EP1: Why PLR Fails with Multiple Discrete Treatments

Overlap, Heterogeneous Effects, and the Estimand Problem

observational causal inference
estimation
estimand
overlap
DoubleML
estimator
Author

Lin Jia

Published

January 26, 2026

Key Takeaway

When conducting causal analysis with observational cross-sectional data and multiple discrete treatments, the preferred estimator is usually Interactive Regression Model DoubleMLIRM.

When treatment effects are heterogeneous, regression-style estimators like Partially Linear Regression Models (PLR) do not estimate the Average Treatment Effect (ATE). Instead, they estimate an overlap-weighted average of Conditional Average Treatment Effects (CATEs).

With multiple treatments, each treatment has a different overlap pattern, so each PLR coefficient averages effects over a different subpopulation. The coefficients are therefore not ATEs, not comparable, and can even reverse treatment rankings.

In contrast, the Interactive Regression Model (IRM) recovers the true population ATE, enabling valid comparison and ranking.

In modern observational causal inference — especially in tech, marketplaces, and e-commerce — we often face multiple discrete treatments. Imagine evaluating four different promotional strategies to increase user retention:

Unlike randomized experiments, these interventions are not assigned randomly. Users self-select or are targeted based on observed characteristics.

The Strategy: Identification by Conditioning in Cross-Sectional Data

When we cannot run a randomized experiment, we rely on cross-sectional observational data and the strategy of Identification by Conditioning (Chernozhukov, Hansen, et al. 2024).

This approach moves us from correlation to causation by assuming that, after controlling for a sufficiently rich set of covariates \(X\), treatment assignment is as good as random:

\[ (Y(1), Y(0)) \;\perp\; D \mid X \]

This assumption is known as conditional exchangeability, ignorability, or selection on observables.

Under this assumption, causal effects can be recovered by properly adjusting for \(X\).

The Bridge to DoubleML: Learning with High-Dimensional Controls

In high-dimensional settings, traditional regression struggles to flexibly adjust for \(X\).
Double Machine Learning (DoubleML) (Chernozhukov, Chetverikov, et al. 2024) provides a principled solution.

DoubleML uses machine learning to model two nuisance components:

  1. Outcome model \(g(X) = \mathbb{E}[Y \mid X]\)
  2. Treatment model \(m(X) = \mathbb{E}[D \mid X]\) (the propensity score)

It then combines them using orthogonal (debiased) scores, making estimation robust to small modeling errors.

Once we adopt this framework, a key modeling choice arises when treatments are multiple and discrete:

Should we use a Partially Linear Regression Model (PLR) or an Interactive Regression Model (IRM)?

The Comparison: PLR vs IRM Specifications

Partially Linear Regression Model (PLR)

Implemented as DoubleMLPLR, the PLR assumes treatment effects enter additively and are constant in the residualized space. It partials out the effect of \(X\) from both \(D\) and \(Y\), then estimates a single linear relationship:

\[ Y - g(X) = \sum_{j=1}^{k} \theta_j \big(D_j - m_j(X)\big) + \varepsilon \]

All treatments compete within the same regression equation.

Interactive Regression Model (IRM)

The IRM (DoubleMLIRM) is more flexible. It does not impose an additive linear structure on treatment effects. Instead, it estimates potential outcomes for each treatment level and constructs an Augmented Inverse Propensity Weighted (AIPW) estimator.

Conceptually, IRM asks:

What would the average outcome be if the entire population received treatment (j)?

This directly targets the population ATE for each treatment.

The Trap: Why PLR Fails to Rank Treatments

The “efficiency” of PLR is tempting — it produces one clean regression table. However, PLR coefficients are overlap-weighted averages, not population ATEs.

To understand why, we first need to clarify what heterogeneous treatment effects are.

1. What are heterogeneous treatment effects?

A treatment effect is heterogeneous if it varies across individuals:

\[ \tau_t(X) = Y(t) - Y(0) \]

If \(\tau_t(X)\) depends on covariates \(X\), then some groups benefit more than others.

This does not just mean “different treatments have different averages.”
It means:

The same treatment has different effects for different people.

Example:

User Type Effect of Discount Badge
New users +10% conversion
Returning users +1% conversion

2. Why heterogeneity changes what regression estimates

Under heterogeneous effects, PLR coefficients converge to:

\[ \theta_t^{PLR} = \frac{\mathbb{E}\!\left[\tau_t(X)\,\mathrm{Var}(D_t \mid X)\right]} {\mathbb{E}\!\left[\mathrm{Var}(D_t \mid X)\right]} \]

(See Appendix Section 15 for a full derivation.)

So:

PLR estimates a weighted average of CATEs, not the population ATE.

The weights arise from the conditional variance of treatment after residualizing covariates (Appendix Section 15).

3. Where do the weights come from?

For binary treatments:

\[ \mathrm{Var}(D_t \mid X=x) = e_t(x)(1-e_t(x)) \]

where \(e_t(x)\) is the propensity score.

This conditional variance appears naturally in the regression derivation (Appendix Section 15).

It is largest when treatment probability is near 0.5 and near zero when treatment is rare or deterministic.

PLR learns the treatment effect where regression has leverage.

4. Why this is fine with one treatment

With a single treatment, there is only one overlap pattern. PLR estimates an overlap-weighted ATE, which may be acceptable depending on the research question.

5. Why this breaks with multiple treatments

With multiple treatments, each treatment has its own overlap weights:

\[ \mathrm{Var}(D_t \mid X) \]

Thus:

  • PLR(\(D_1\)) reflects effects where \(D_1\) overlaps
  • PLR(\(D_2\)) reflects effects where \(D_2\) overlaps

These are different subpopulations.

The coefficient of \(D_1\) in PLR is not the ATE of treatment 1. It is the average effect in the subpopulation where treatment 1 has overlap.

6. A simple illustration

X P(D₁=1|X) P(D₂=1|X) τ₁(X) τ₂(X)
0 0.01 0.50 −3 −2
1 0.50 0.01 +3 +3

True ATEs average both X values equally.

PLR: - gives high weight to \(X=1\) for \(D_1\) - gives high weight to \(X=0\) for \(D_2\)

So the coefficients summarize different slices of the population, and the ranking can reverse.

7. Why IRM succeeds

IRM directly estimates:

\[ \text{ATE}_t = \mathbb{E}[Y(t) - Y(0)] \]

for each treatment using the same population distribution of \(X\).

Overlap affects variance, not which population is averaged over.

IRM keeps the estimand fixed. PLR lets overlap redefine the estimand.

This figure illustrates this difference visually: PLR aggregates effects in different local overlap regions, while IRM evaluates each treatment over the same population.

8. RCT vs Observational Studies: Why PLR Sometimes Works

Setting Effects heterogeneous? Does PLR recover ATE? Why
Randomized experiment Yes ✅ Yes Treatment probability constant across X
Observational, binary treatment Yes ❌ Not generally Overlap varies across X
Observational, multiple treatments Yes ❌❌ Worse Each treatment has different overlap weights

In RCTs, the variance term becomes constant and cancels out (Appendix Section 15).

9. Practical takeaway

PLR is reasonable with one treatment and near-homogeneous effects.
With multiple discrete treatments, estimate ATEs separately.

Default rule: Use IRM when treatments are multiple.

10. The real lesson

Regression does not just estimate effects — it determines where in the data those effects are learned.

With heterogeneous effects, overlap becomes the weight.
With multiple treatments, the weights differ.
And once the weights differ, comparison breaks.

Appendix: Why PLR Estimates an Overlap-Weighted Average of CATEs

Setup

Assume unconfoundedness and heterogeneous treatment effects. Observed outcome:

\[ Y = Y(0) + D \cdot \tau(X) \]

PLR:

\[ Y = \theta D + g(X) + \varepsilon \]

Partialling Out

By Frisch–Waugh–Lovell:

\[ \theta = \frac{\mathbb{E}[\tilde{D}\tilde{Y}]}{\mathbb{E}[\tilde{D}^2]} \]

Residualizing

\[ \tilde{Y} = \tau(X)(D - e(X)) + \eta \]

Numerator

\[ \mathbb{E}[\tilde{D}\tilde{Y}] = \mathbb{E}[\tau(X)\,\mathrm{Var}(D \mid X)] \]

Denominator

\[ \mathbb{E}[\tilde{D}^2] = \mathbb{E}[\mathrm{Var}(D \mid X)] \]

Final

\[ \theta^{PLR} = \frac{\mathbb{E}[\tau(X)\,\mathrm{Var}(D \mid X)]} {\mathbb{E}[\mathrm{Var}(D \mid X)]} \]

For binary treatments:

\[ \mathrm{Var}(D \mid X=x) = e(x)(1-e(x)) \]

PLR therefore estimates a variance-weighted average of CATEs, emphasizing regions with overlap.

References

Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2024. “Double/Debiased Machine Learning for Treatment and Causal Parameters.” https://arxiv.org/abs/1608.00060.
Chernozhukov, Victor, Christian Hansen, Nathan Kallus, Martin Spindler, and Vasilis Syrgkanis. 2024. Applied Causal Inference Powered by ML and AI. https://causalml-book.org/.