GenAI in Experimentation: Notes from the Panel and the Sidelines
What I said on the GenAI & Experimentation panel at Booking.com’s 3rd Experimentation Conference — and the themes that came up in the breakout conversations afterwards.
🎤 The GenAI & Experimentation panel at Booking.com’s 3rd Experimentation Conference covered roughly four questions in forty minutes — alongside David Gregory (Skyscanner), Marcel Toben (Zalando), and Dima Bordiugov (Delivery Hero), moderated by Gosia Popławska (Netflix).

A panel format optimizes for momentum, not depth. The post below is in two parts:
- Part 1 — On the panel. The four questions we covered on stage, with the reasoning I’d have included if there had been time.
- Part 2 — From the side conversations. Themes that came up repeatedly with conference attendees during the breakouts and breaks — questions that aren’t asked on panels but that practitioners are clearly wrestling with.
Part 1 — On the panel
1. Where has GenAI been most useful — before, during, or after the experiment? Where is it overrated?
I’d reframe the question. Usefulness isn’t really about when GenAI shows up in the experimentation lifecycle. It’s about what function it serves.
Before an experiment — GenAI helps with hypothesis quality, retrieves relevant past experiments, recommends metrics, and helps people navigate experimentation best practices. The friction it removes is about finding and framing.
During an experiment — it can identify anomalies, diagnose issues, categorize failures, surface monitoring signals, and assist with debugging and investigation. The friction it removes is about catching things faster.
After an experiment — it synthesizes findings, connects results to prior evidence, generates decision digests, and helps stakeholders absorb outcomes. The friction it removes is about communication and retention.
Where it gets overrated: when teams expect it to become the source of truth itself. GenAI is very good at retrieval, synthesis, explanation, and recommendation. It should support decision-making — not replace experimental evidence.
GenAI can support the entire experimentation lifecycle, but it should support decisions rather than replace evidence.
2. What would you put in a copilot roadmap — year one versus year three?
Honestly, at today’s pace I’m not sure that distinction is primarily a technical question anymore.
Most of the capabilities people imagine for “year three” can be prototyped this quarter.
The harder questions are:
- Are we solving the right problem?
- Are the outputs trustworthy?
- Can users rely on them?
- Do they actually improve decisions?
For me, the roadmap is less about what becomes technically possible and more about what becomes operationally trusted and useful.
The wrong question is “where can we use GenAI?” The right question is “what problem are we trying to solve, and is GenAI actually the right tool for it?”
Just because we have a hammer doesn’t mean everything becomes a nail.
3. Where’s the first friction between AI-generated outputs and experimentation as the arbiter of truth?
The first friction appears the moment fluency gets confused with correctness.
Large language models are excellent at producing coherent and convincing outputs. That coherence does not guarantee factual accuracy — and experimentation is a domain where wrong-but-confident answers do real damage downstream.
The guardrail isn’t “tell users to be careful.” It’s building grounding and evaluation directly into the system:
🧷 grounding through trusted sources
🔍 retrieval mechanisms
✅ factuality evaluation
🎯 contextual relevance evaluation
📐 statistical correctness evaluation
📊 continuous monitoring
The goal is not just fluent outputs. The goal is trustworthy outputs.
In experimentation, the standard shouldn’t be “does this sound right?” It should be “is this grounded, accurate, and decision-safe?”
4. What makes experimentation memory actionable? How do you avoid stale knowledge?
The challenge with experimentation memory isn’t storing knowledge.
It’s keeping knowledge useful.
Products evolve. Metrics evolve. User behavior evolves. Business priorities evolve.
A knowledge base becomes valuable only when the retrieved information remains useful for the current decision context. That requires:
- continuous maintenance
- updating documentation
- incorporating new information
- deprecating outdated information
And usefulness depends on the downstream application. The same knowledge gets used differently for hypothesis generation, metric recommendation, support assistants, experiment retrieval, and decision support — and each use case requires different evaluation criteria.
Experimentation memory should therefore be treated as a continuously maintained and continuously evaluated system, not a static archive with a search box on top.
Experimentation memory is more than a searchable archive. It is a continuously maintained and evaluated system.
Part 2 — From the side conversations
The next set of questions never made it onto the panel agenda, but they kept coming up in the hallway and over coffee. They’re the questions practitioners ask when there’s no microphone — which usually means they’re the ones currently costing teams the most time.
5. How has experimentation culture changed with GenAI adoption?
GenAI is making experimentation more accessible across organizations. It reduces friction around:
- finding information
- understanding experimentation concepts
- retrieving past learnings
- interpreting results
- navigating experimentation processes
That accessibility is genuinely valuable.
But the goal is not to remove complexity. Some complexity is fundamental.
Questions around uncertainty, measurement quality, causal validity, and trade-offs still exist. A friendlier interface doesn’t make them go away — and if the interface hides them, that’s a regression, not progress.
The job is to make the important complexity easier to navigate, not to pretend it isn’t there.
The goal isn’t to remove complexity from experimentation. It’s to make the important complexity easier to navigate.
6. How should experimentation program KPIs evolve?
I’d group the success metrics for a GenAI-augmented experimentation program into three directions.
1. Existing experimentation quality metrics — experimentation quality, rigor, decision quality, hypothesis quality, metric quality, guideline compliance. GenAI should improve these, not replace them.
2. Efficiency and enablement — how much friction is reduced? Faster workflows. Easier onboarding. Improved learning velocity. Better knowledge access. Fewer avoidable mistakes.
3. GenAI-system metrics — the GenAI systems themselves require evaluation. Factual accuracy. Contextual relevance. Statistical correctness. Trustworthiness. Helpfulness.
The mistake to avoid is treating volume as success. More experiments doesn’t mean more learning. The point of any GenAI investment here should be higher-quality learning per experiment, not just more experiments.
The goal isn’t scaling experimentation volume. The goal is scaling high-quality learning.
7. How do you evaluate and monitor LLM outputs in production?
Different use cases require different evaluation criteria.
For experimentation insights — factual accuracy, statistical correctness, contextual relevance, actionable recommendations.
For agent systems — correct tool usage, multi-turn coherence, workflow correctness, answer quality, safety and toxicity.
Operationally:
🚦 Before shipping — quality thresholds, shipping gates, human evaluation, judge-LLM evaluation.
🛰️ After shipping — continuous monitoring, ongoing evaluation, drift detection.
The shape of the evaluation system is itself a design decision. Building a one-size-fits-all eval pipeline for “AI features” is how you end up with a 92% benchmark score and a tool nobody trusts in production.
Different GenAI applications require different evaluation frameworks.
8. How is building GenAI-enabled features different from classical experimentation features?
Two things often get conflated.
AI-assisted development — using GenAI to help engineers build software faster. This applies broadly to product development; it isn’t specific to experimentation.
GenAI-enabled products — the LLM itself becomes part of the production experience. The model isn’t a tool the team uses to ship; it’s a component the user interacts with.
That second case changes the engineering practice. You inherit the ML-systems toolkit:
- benchmarking
- evaluation datasets
- production monitoring
- shipping gates
- continuous quality assessment
The development process becomes much closer to ML-systems engineering than traditional deterministic product development. Treating it as the latter is how silent regressions reach production.
Once the LLM becomes part of the production logic, the product behaves more like a continuously evaluated ML system than a traditional deterministic feature.
9. What should always remain human?
GenAI can retrieve, summarize, classify, explain, and recommend. All of that is useful.
But many experimentation decisions involve trade-offs that don’t reduce to a clean optimization:
- business impact vs user experience
- speed vs quality
- cost vs latency
- short-term vs long-term outcomes
- uncertainty vs action
These require judgment. They require context that doesn’t live in the data. They require being accountable to people, not metrics.
That’s the part that stays human — not because GenAI couldn’t theoretically learn to weigh these, but because the cost of being wrong sits with humans, not models.
GenAI can support decision-making. It should not replace decision-making. Trade-offs remain fundamentally human.
The thread
Across the panel and the side conversations, the same idea kept showing up.
GenAI is making experimentation faster, more accessible, and easier to scale. The challenge — and the work that actually matters — is maintaining trust, rigor, evaluation quality, and human judgment as experimentation velocity increases.
When experimentation velocity increases, maintaining the same rigor becomes more important, not less.
📥 If this resonates, subscribe to the newsletter — Field Notes posts arrive in your inbox alongside the deeper Causal Inference From the Ground Up and Inference in the Wild series.