What do you do when there’s no experiment to run?

Over the past three weeks we’ve built a toolkit. Rubin gave us the language of potential outcomes and the counterfactual. Pearl gave us DAGs to map our assumptions and check whether our analysis can identify a causal effect. Both frameworks assume you have data — either from an experiment or from observational data with a credible identification strategy.

But what happens when you don’t? When leadership asks “what would happen if we implemented policy X?” and policy X has never been tried? There’s no treatment group. No control group. No natural experiment. No data to construct a counterfactual from.

This is the problem that James Heckman, Nobel laureate in economics, and Rodrigo Pinto have spent years working on. Their argument: if you understand the mechanisms through which a program works, you can make credible predictions about policies that have never been implemented, in populations that have never been studied. But only if you move beyond treatment effects and into the structure underneath them.

Why an experiment isn’t enough

To see why mechanisms matter, consider a case where we do have an experiment, and where the experimental results still fall short.

The Perry Preschool Program, run in the 1960s in Michigan and one of the most studied social interventions in history, illustrates the problem. Disadvantaged children were randomly assigned to an intensive preschool program. Decades later, the treatment group had higher earnings, lower crime rates, and better health outcomes. The experiment tells you the program worked, but when a policymaker asks “should we fund something like this in our state, for our population, with our budget? ” the experimental estimate doesn’t answer that question. The ATE tells you what happened in Michigan in the 1960s. It does not tell you what will happen somewhere else.

So why did it work? Was it cognitive skills, social and emotional development, or changes in parenting behaviour? These are different mechanisms with very different policy implications. If the effect works through cognitive skills, any good curriculum might replicate it. If it works through home visits that changed parenting, a different delivery model might fail entirely. The analysis showed that the long-term effects operated substantially through non-cognitive channels (self-regulation, motivation, social skills) rather than IQ gains, which faded within a few years. That understanding of mechanisms is what tells you what features of the program were doing the work and which were incidental, and whether a new program, in a different place, for a different population, has any reason to expect similar results.

If an experiment as strong as Perry still isn’t enough to answer the policy question without understanding mechanisms, the case is even stronger when not experiment exists at all.

A different starting point

This framework rests on two ideas that distinguish it from what we’ve covered so far.

First, causality is fundamentally about thought experiments, not statistical experiments. This idea traces back to Ragnar Frisch and Trygve Haavelmo, whose seminal 1943 and 1944 papers were the first rigorous treatment of causality in econometrics. A causal question (“what would happen if we changed this policy?”) is defined by a hypothetical model of how the world works. The thought experiment comes first. Data and identification strategies are tools for disciplining that thought experiment, but they don’t define the causal question itself. This is why the framework can reason about policies that have never been tried, you don’t need data from an experiment if you have a credible model of the mechanisms.

Second, people aren’t passive recipients of treatment, they’re decision-makers. This framework treats agents as making choices. People decide whether to pursue education. Parents decide whether to enrol children in preschool. These decisions are based on expected costs and benefits that vary across individuals. This is fundamentally different from Rubin, where treatment assignment is something that happens to people, and from Pearl, where the focus is on the causal structure of the world rather than the decision-making of agents within it.

This matters because the decision-making process itself generates the variation in who benefits and who doesn’t. When you ignore it, you treat a program like a pill: administer it and measure the average effect. When you model it, you understand why some people take up the program, why the effect differs across participants, and what would change if you redesigned the program’s eligibility or delivery. That’s the difference between an evaluation that says “it worked” and one that tells you enough to design something better.

If this sounds abstract, it may be closer to your work than you think. The structural models underlying this framework aren’t alien to government, they are related to behavioural microsimulation models. If your agency builds or uses behavioural microsimulations to forecast policy impacts, you’re already working within a tradition that takes mechanisms and individual decision-making seriously. The frameworks in this series give you the causal foundations that those models rest on, and the language to interrogate whether the assumptions built into them are defensible.

Why this matters for your work

Your teams are regularly asked questions that the previous two frameworks alone can’t answer. “Should we expand this program to a new population?” “What would happen if we changed the eligibility criteria?” “If we doubled the funding, would we double the impact?”

These are questions about policies that haven’t been tried. You can’t estimate a counterfactual from data that doesn’t exist. What you can do is build a model of why your existing programs work, through what channels, for whom, and under what conditions, and use that structural understanding to make informed predictions about new policies.

Rather than treating a program as a black box that produces an average effect, this approach argues you should open the box and understand the mechanisms inside it. The average effect tells you “it worked.” The mechanisms tell you why it worked, which is what you need to know if you want to predict whether it will work somewhere else, for someone else, under different conditions.

Without this, you’re in the position of a doctor who knows that a drug reduced symptoms in a clinical trial but has no idea how it works pharmacologically. They can prescribe it to the same patient population, but they can’t predict what will happen if they change the dose, combine it with another drug, or give it to a different population. The mechanism is what makes generalisation possible.

The test for your own work

Next time you review an evaluation that says “the program improved outcomes by X%,” ask three questions:

Through what channels? What are the mechanisms connecting the program to the outcome? Can you draw them? If you can’t articulate the mechanisms, you can’t predict what happens when you change the program or the population.

For whom? Is the average effect masking important differences? A program that works well for one subgroup and harms another can produce a positive ATE that’s misleading for both.

Under what conditions? What features of the context, the economy, the institutional environment, the specific implementation, are essential to the result? Would the effect survive a change in any of these?

If the evaluation can’t answer these questions, it tells you whether the program worked there and then. It doesn’t tell you whether it will work here and next.

Pushback and disagreements welcome. That’s the point. Next time we will wrap things up.