EP1: Why PLR Fails with Multiple Discrete Treatments
Overlap, Heterogeneous Effects, and the Estimand Problem
When conducting causal analysis with observational cross-sectional data and multiple discrete treatments, the preferred estimator is usually Interactive Regression Model DoubleMLIRM.
When treatment effects are heterogeneous, regression-style estimators like Partially Linear Regression Models (PLR) do not estimate the Average Treatment Effect (ATE). Instead, they estimate an overlap-weighted average of Conditional Average Treatment Effects (CATEs).
With multiple treatments, each treatment has a different overlap pattern, so each PLR coefficient averages effects over a different subpopulation. The coefficients are therefore not ATEs, not comparable, and can even reverse treatment rankings.
In contrast, the Interactive Regression Model (IRM) recovers the true population ATE, enabling valid comparison and ranking.
In modern observational causal inference — especially in tech, marketplaces, and e-commerce — we often face multiple discrete treatments. Imagine evaluating four different promotional strategies to increase user retention:
- Treatment A: 10% Discount
- Treatment B: Free Shipping
- Treatment C: Buy-One-Get-One (BOGO)
- Treatment D: Loyalty Points Bonus
Unlike randomized experiments, these interventions are not assigned randomly. Users self-select or are targeted based on observed characteristics.
The Strategy: Identification by Conditioning in Cross-Sectional Data
When we cannot run a randomized experiment, we rely on cross-sectional observational data and the strategy of Identification by Conditioning (Chernozhukov, Hansen, et al. 2024).
This approach moves us from correlation to causation by assuming that, after controlling for a sufficiently rich set of covariates \(X\), treatment assignment is as good as random:
\[ (Y(1), Y(0)) \;\perp\; D \mid X \]
This assumption is known as conditional exchangeability, ignorability, or selection on observables.
Under this assumption, causal effects can be recovered by properly adjusting for \(X\).
The Bridge to DoubleML: Learning with High-Dimensional Controls
In high-dimensional settings, traditional regression struggles to flexibly adjust for \(X\).
Double Machine Learning (DoubleML) (Chernozhukov, Chetverikov, et al. 2024) provides a principled solution.
DoubleML uses machine learning to model two nuisance components:
- Outcome model \(g(X) = \mathbb{E}[Y \mid X]\)
- Treatment model \(m(X) = \mathbb{E}[D \mid X]\) (the propensity score)
It then combines them using orthogonal (debiased) scores, making estimation robust to small modeling errors.
Once we adopt this framework, a key modeling choice arises when treatments are multiple and discrete:
Should we use a Partially Linear Regression Model (PLR) or an Interactive Regression Model (IRM)?
The Comparison: PLR vs IRM Specifications
Partially Linear Regression Model (PLR)
Implemented as DoubleMLPLR, the PLR assumes treatment effects enter additively and are constant in the residualized space. It partials out the effect of \(X\) from both \(D\) and \(Y\), then estimates a single linear relationship:
\[ Y - g(X) = \sum_{j=1}^{k} \theta_j \big(D_j - m_j(X)\big) + \varepsilon \]
All treatments compete within the same regression equation.
Interactive Regression Model (IRM)
The IRM (DoubleMLIRM) is more flexible. It does not impose an additive linear structure on treatment effects. Instead, it estimates potential outcomes for each treatment level and constructs an Augmented Inverse Propensity Weighted (AIPW) estimator.
Conceptually, IRM asks:
What would the average outcome be if the entire population received treatment (j)?
This directly targets the population ATE for each treatment.
The Trap: Why PLR Fails to Rank Treatments
The “efficiency” of PLR is tempting — it produces one clean regression table. However, PLR coefficients are overlap-weighted averages, not population ATEs.
To understand why, we first need to clarify what heterogeneous treatment effects are.
1. What are heterogeneous treatment effects?
A treatment effect is heterogeneous if it varies across individuals:
\[ \tau_t(X) = Y(t) - Y(0) \]
If \(\tau_t(X)\) depends on covariates \(X\), then some groups benefit more than others.
This does not just mean “different treatments have different averages.”
It means:
The same treatment has different effects for different people.
Example:
| User Type | Effect of Discount Badge |
|---|---|
| New users | +10% conversion |
| Returning users | +1% conversion |
2. Why heterogeneity changes what regression estimates
Under heterogeneous effects, PLR coefficients converge to:
\[ \theta_t^{PLR} = \frac{\mathbb{E}\!\left[\tau_t(X)\,\mathrm{Var}(D_t \mid X)\right]} {\mathbb{E}\!\left[\mathrm{Var}(D_t \mid X)\right]} \]
(See Appendix Section 15 for a full derivation.)
So:
PLR estimates a weighted average of CATEs, not the population ATE.
The weights arise from the conditional variance of treatment after residualizing covariates (Appendix Section 15).
3. Where do the weights come from?
For binary treatments:
\[ \mathrm{Var}(D_t \mid X=x) = e_t(x)(1-e_t(x)) \]
where \(e_t(x)\) is the propensity score.
This conditional variance appears naturally in the regression derivation (Appendix Section 15).
It is largest when treatment probability is near 0.5 and near zero when treatment is rare or deterministic.
PLR learns the treatment effect where regression has leverage.
4. Why this is fine with one treatment
With a single treatment, there is only one overlap pattern. PLR estimates an overlap-weighted ATE, which may be acceptable depending on the research question.
5. Why this breaks with multiple treatments
With multiple treatments, each treatment has its own overlap weights:
\[ \mathrm{Var}(D_t \mid X) \]
Thus:
- PLR(\(D_1\)) reflects effects where \(D_1\) overlaps
- PLR(\(D_2\)) reflects effects where \(D_2\) overlaps
These are different subpopulations.
The coefficient of \(D_1\) in PLR is not the ATE of treatment 1. It is the average effect in the subpopulation where treatment 1 has overlap.
6. A simple illustration
| X | P(D₁=1|X) | P(D₂=1|X) | τ₁(X) | τ₂(X) |
|---|---|---|---|---|
| 0 | 0.01 | 0.50 | −3 | −2 |
| 1 | 0.50 | 0.01 | +3 | +3 |
True ATEs average both X values equally.
PLR: - gives high weight to \(X=1\) for \(D_1\) - gives high weight to \(X=0\) for \(D_2\)
So the coefficients summarize different slices of the population, and the ranking can reverse.
7. Why IRM succeeds
IRM directly estimates:
\[ \text{ATE}_t = \mathbb{E}[Y(t) - Y(0)] \]
for each treatment using the same population distribution of \(X\).
Overlap affects variance, not which population is averaged over.
IRM keeps the estimand fixed. PLR lets overlap redefine the estimand.

8. RCT vs Observational Studies: Why PLR Sometimes Works
| Setting | Effects heterogeneous? | Does PLR recover ATE? | Why |
|---|---|---|---|
| Randomized experiment | Yes | ✅ Yes | Treatment probability constant across X |
| Observational, binary treatment | Yes | ❌ Not generally | Overlap varies across X |
| Observational, multiple treatments | Yes | ❌❌ Worse | Each treatment has different overlap weights |
In RCTs, the variance term becomes constant and cancels out (Appendix Section 15).
9. Practical takeaway
PLR is reasonable with one treatment and near-homogeneous effects.
With multiple discrete treatments, estimate ATEs separately.
Default rule: Use IRM when treatments are multiple.
10. The real lesson
Regression does not just estimate effects — it determines where in the data those effects are learned.
With heterogeneous effects, overlap becomes the weight.
With multiple treatments, the weights differ.
And once the weights differ, comparison breaks.
Appendix: Why PLR Estimates an Overlap-Weighted Average of CATEs
Setup
Assume unconfoundedness and heterogeneous treatment effects. Observed outcome:
\[ Y = Y(0) + D \cdot \tau(X) \]
PLR:
\[ Y = \theta D + g(X) + \varepsilon \]
Partialling Out
By Frisch–Waugh–Lovell:
\[ \theta = \frac{\mathbb{E}[\tilde{D}\tilde{Y}]}{\mathbb{E}[\tilde{D}^2]} \]
Residualizing
\[ \tilde{Y} = \tau(X)(D - e(X)) + \eta \]
Numerator
\[ \mathbb{E}[\tilde{D}\tilde{Y}] = \mathbb{E}[\tau(X)\,\mathrm{Var}(D \mid X)] \]
Denominator
\[ \mathbb{E}[\tilde{D}^2] = \mathbb{E}[\mathrm{Var}(D \mid X)] \]
Final
\[ \theta^{PLR} = \frac{\mathbb{E}[\tau(X)\,\mathrm{Var}(D \mid X)]} {\mathbb{E}[\mathrm{Var}(D \mid X)]} \]
For binary treatments:
\[ \mathrm{Var}(D \mid X=x) = e(x)(1-e(x)) \]
PLR therefore estimates a variance-weighted average of CATEs, emphasizing regions with overlap.