EP3: The Causality Gap — Measuring the True Impact of Voluntary Adoption in Digital Marketplaces

Randomized Encouragement Design + DoubleML for voluntary adoption

observational causal inference
randomized encouragement design
instrumental variables
DoubleML
LATE
Author

Lin Jia

Published

May 22, 2026

A feature can create value and still fail the experiment.

In marketplaces, many features depend on users or partners choosing to opt in — loyalty programs, vouchers and incentives, partner optimization tools, email messaging. That creates a surprisingly tricky measurement problem:

We call this the causality gap.

The wrong question: “should we kill it?”

Faced with a flat A/B result, the default instinct is to kill the feature. But that conclusion quietly assumes the feature and its adoption are the same thing. They are not.

A weak experiment result can mean two very different things:

  • The feature creates little value, or
  • The feature creates real value, but very few users adopted it.

Those two stories call for completely different actions:

  • 🎯 Improve the product — adopters aren’t getting enough value.
  • 📣 Improve adoption — adopters love it, you just need more of them.
  • 🧭 Stop investing — the value isn’t there, even for adopters.

Reading a flat A/B as “the feature doesn’t work” collapses three different decisions into one. That is the trap.

Two questions, two distinct estimands

In the article, Kexin Fei and I walk through how Randomized Encouragement Designs (RED) combined with DoubleML can answer the two questions a marketplace team actually cares about:

  1. What is the effect on users who actually adopt the feature? (the effect among adopters — LATE / CACE)
  2. What is the impact of rolling it out to everyone eligible? (the intent-to-treat effect — ITT)

These are different numbers with different uses, and most teams unintentionally mix them up. RED + DoubleML lets you estimate both cleanly from the same experiment.

Why the naïve fixes don’t save you

The instinctive next move is to compare adopters to “similar” non-adopters by adjusting for what you can measure — past activity, engagement, demographics. That is the spirit behind regression adjustment, matching, weighting, and their machine-learning variants.

Here is the catch. Adopters and non-adopters differ in things you can measure and in things you cannot — how motivated they were to begin with, how much they would have engaged anyway. Adjustment only handles the measurable differences. The invisible ones — exactly the ones that drove the adoption decision in the first place — stay invisible.

It gets worse. When you match adopters to non-adopters on observable proxies for motivation, you systematically pick the most motivated non-adopters as your control group. You end up comparing a mixed-motivation treatment group against a hand-picked super-motivated control. Broken by design.

Adding more controls does not fix this. Adjusting harder on the wrong variables just produces a wrong answer with tighter confidence intervals.

In the article, we walk through the mechanics step by step — and show what randomized encouragement gives you that pure adjustment never can.

Read the article — or listen to the episode

👉 The Causality Gap: Measuring the True Impact of Voluntary Adoption in Digital Marketplaces →

If it lands for you, a few claps on Medium really do help it reach more practitioners working on the same problem.

Prefer audio? The same ideas on the podcast: