Causal InferenceStatisticsMarketing AnalyticsExperimentation

Analytics · 10 min read

Introduction to Causality — Why Correlation Is Not Enough

ML is exceptionally good at prediction. It is mostly useless for answering 'what if' questions. Here is the math that explains why, and the framework that fixes it.

June 15, 2024 · Martin Guzman

There is a running joke in data science circles that the field is 80% statistics dressed up in a hoodie. That is not entirely wrong — but the more important observation is that within statistics itself, there is a crucial distinction almost every introductory course glosses over. The distinction between association and causation. Not as a vague warning label on correlation plots, but as a precise mathematical claim.

This post lays out the foundational framework for causal inference: the potential outcomes model. If you work in marketing analytics, product, or economics and you are tired of dashboards that describe what happened without telling you what to do, this is the theoretical bedrock you need.

The AI Hype Problem

Picture a pint of beer. The beer is statistics — the actual substance, dense and real, developed over a century of careful work. Then there is foam on top: light, voluminous, looks impressive until it collapses. Most of the current AI hype about "machine learning for decision-making" is foam.

The underlying issue is that modern ML — gradient boosting, neural networks, transformer-based models — is extraordinarily good at one thing: prediction. Given a feature vector $\mathbf{x}$ , find $f(\mathbf{x})$ that minimizes some loss over historical data. That is the whole game.

Prediction is valuable. But prediction is fundamentally backward-looking. It learns patterns that existed in the data you collected under the conditions that existed when you collected it. The moment you intervene — change a price, launch a campaign, ship a feature — you are creating a new condition the model has never seen. And that is precisely when you need answers most.

The Hotel Price Trap

Here is a concrete example that should make you deeply skeptical of naive ML in business settings.

You train a model on hotel booking data. The model learns that high prices correlate with high occupancy. Makes sense historically: peak season brings both full hotels and premium rates. A credulous data team might look at this and say, "The model says raising prices increases bookings. Let's raise prices."

This is how you go out of business.

The model captured a correlation driven by a hidden third variable: time of year. Peak season causes both high prices and high demand. There is no causal arrow from price to demand running in that direction — in fact, demand elasticity runs the other way. The model is not wrong about the pattern. It is completely wrong about the mechanism.

This is not a bug in the ML model. It is a fundamental limitation of the associational framework. To answer "what happens if we raise prices?" you need a different set of tools entirely.

| Question type | Example | Right tool | |---|---|---| | Predictive | "Which users are most likely to churn?" | ML, regression | | Causal | "Does this discount cause retention?" | Causal inference | | Predictive | "What will revenue be next quarter?" | Time series forecasting | | Causal | "What did this campaign do to revenue?" | Difference-in-differences, synthetic control | | Predictive | "Who will click this ad?" | Propensity scoring | | Causal | "Does showing this ad increase purchases?" | Randomized experiment, IV |

The left column is where ML excels. The right column is where it fails silently — and where most business decisions actually live.

Potential Outcomes — The Framework

The formal foundation for causal inference is the potential outcomes framework, developed by Donald Rubin and building on earlier work by Jerzy Neyman. The central idea is deceptively simple: for every unit, there exists a potential outcome under each possible treatment condition, regardless of which condition that unit actually receives.

Let's define the setup precisely.

For unit $i$ :

$T_i \in \{0, 1\}$ is the treatment indicator — 1 if treated, 0 if not
$Y_i$ is the observed outcome
$Y_i(1)$ is the potential outcome under treatment — what would happen if unit $i$ received treatment
$Y_i(0)$ is the potential outcome under control — what would happen if unit $i$ did not receive treatment

The observed outcome ties these together via the consistency assumption:

Y_i = T_i \cdot Y_i(1) + (1 - T_i) \cdot Y_i(0)

In plain English: you observe the potential outcome corresponding to the treatment you actually received. The other one is counterfactual — it never happened in this universe.

The Individual Treatment Effect

Given this notation, the causal effect for unit $i$ is simply the difference in their two potential outcomes:

\tau_i = Y_i(1) - Y_i(0)

This is the individual treatment effect (ITE). It is the answer to the question: "What was the specific effect of this treatment, for this person, at this time?"

Now here is the problem. And it is not a data quality problem or a sample size problem. It is a logical impossibility.

You can never observe both $Y_i(0)$ and $Y_i(1)$ for the same unit at the same time. The moment unit $i$ receives treatment, their control potential outcome becomes unobservable. And vice versa.

This is the fundamental problem of causal inference. It is not solved by bigger data. It is not solved by better models. It is a consequence of time being linear and units existing in one state at a time. Every causal analysis is, at its core, a strategy for imputing the missing counterfactual.

From Individual Effects to Population Effects

Since individual treatment effects are unidentifiable, we shift our target to averages over a population.

The Average Treatment Effect (ATE) is the expected causal effect across the entire population:

\text{ATE} = \mathbb{E}[\tau_i] = \mathbb{E}[Y_i(1) - Y_i(0)]

The Average Treatment Effect on the Treated (ATT) is the expected causal effect among those who actually received treatment:

\text{ATT} = \mathbb{E}[\tau_i \mid T_i = 1] = \mathbb{E}[Y_i(1) - Y_i(0) \mid T_i = 1]

These are different quantities and the distinction matters. If you run a drug trial and want to know whether the drug works for the general population, you want the ATE. If you want to know whether it worked for the people who took it — whether the treatment was justified for those who received it — you want the ATT.

In marketing, ATT is often the more relevant quantity. You ran a campaign targeting a specific segment. Did it work for them? The ATE would ask whether it would work for everyone, which may not be the question anyone is paying you to answer.

The Bias Decomposition — The Core of Everything

Here is where the mathematics becomes genuinely illuminating. Consider the naive comparison we almost always make in practice: the difference in average outcomes between the treated and untreated groups.

\mathbb{E}[Y_i \mid T_i = 1] - \mathbb{E}[Y_i \mid T_i = 0]

This is what you compute when you look at a dashboard and compare cohorts. This is what gets reported in business reviews. And it is almost never what anyone thinks it is.

Let's decompose it. By consistency, $\mathbb{E}[Y_i \mid T_i = 1] = \mathbb{E}[Y_i(1) \mid T_i = 1]$ and $\mathbb{E}[Y_i \mid T_i = 0] = \mathbb{E}[Y_i(0) \mid T_i = 0]$ . Now add and subtract $\mathbb{E}[Y_i(0) \mid T_i = 1]$ — the expected control outcome among the treated group, which is counterfactual and unobserved:

\mathbb{E}[Y_i \mid T_i = 1] - \mathbb{E}[Y_i \mid T_i = 0] = \underbrace{\mathbb{E}[Y_i(1) - Y_i(0) \mid T_i = 1]}_{\text{ATT}} + \underbrace{\left\{ \mathbb{E}[Y_i(0) \mid T_i = 1] - \mathbb{E}[Y_i(0) \mid T_i = 0] \right\}}_{\text{Selection Bias}}

Read this carefully. The simple observed difference equals the ATT plus a bias term. The bias term is the difference in baseline outcomes between the treated and control groups — what the treated group's outcome would have been in the absence of treatment, versus what the control group's outcome actually was.

This is the mathematical proof that association $\neq$ causation. The observed difference is only causal if the bias term is zero.

The School Tablets Example

Let's make this concrete. A school district gives tablets to some schools. A year later, schools with tablets score higher on standardized tests. A naive analyst concludes: tablets cause better test scores.

Apply the decomposition:

$Y_i(1)$ = test scores with tablets
$Y_i(0)$ = test scores without tablets
Treated units = schools that received tablets

The observed difference is ATT + bias. What is the bias term here?

\mathbb{E}[Y_i(0) \mid T_i = 1] - \mathbb{E}[Y_i(0) \mid T_i = 0]

This is: what would the tablet-receiving schools have scored without tablets, minus what the non-tablet schools actually scored? If wealthier, better-resourced schools are more likely to receive tablets — which they almost certainly are — then the tablet schools would have outperformed anyway. The bias term is large and positive. The observed gap massively overstates the causal effect of tablets.

The tablets might help. Or they might do nothing. You cannot tell from the observational comparison alone. What looks like a 15-point score improvement might be 3 points tablet and 12 points wealth.

When Does Association Equal Causation?

The bias term vanishes when:

\mathbb{E}[Y_i(0) \mid T_i = 1] = \mathbb{E}[Y_i(0) \mid T_i = 0]

That is, when the treated and control groups have the same expected baseline outcomes. When, in the counterfactual world where nobody received treatment, the two groups would have performed the same.

This condition is called ignorability or unconfoundedness. It is the key assumption underlying every causal identification strategy, in some form.

The cleanest way to guarantee it is random assignment. If treatment is assigned by a fair coin flip, then by construction:

(Y_i(0), Y_i(1)) \perp\!\!\!\perp T_i

The potential outcomes are independent of treatment assignment. Rich schools and poor schools get tablets with equal probability. The groups are comparable. The bias term is zero in expectation.

\mathbb{E}[Y_i \mid T_i = 1] - \mathbb{E}[Y_i \mid T_i = 0] = \text{ATT} = \text{ATE}

Under randomization, ATT equals ATE (by the same independence argument), and the naive comparison is the causal effect. This is why randomized controlled trials are the gold standard. Not for philosophical reasons — for mathematical ones.

When randomization is not possible — which is most of the time in industry, policy, and economics — the entire field of causal inference is a collection of strategies for recovering the missing comparability through design or modeling: matching, instrumental variables, regression discontinuity, difference-in-differences, synthetic control. Each method is a different argument for why the bias term is approximately zero in a specific setting.

Key Takeaways

1. Prediction and causation are different problems. ML solves prediction. Causal inference solves intervention. Conflating them leads to decisions that look data-driven and are actually just dressed-up intuition.

2. The fundamental problem is unobservability, not data scarcity. You will never observe both potential outcomes for the same unit. Every causal estimate is an imputation strategy.

3. The bias decomposition is the master equation. Observed difference = ATT + selection bias. You are always computing this difference. The question is whether you have a reason to believe the bias is small.

4. Random assignment eliminates selection bias by construction. This is the only assumption-free path to causal identification. Every other method requires assumptions about the world that cannot be fully tested.

5. The right question changes what you should build. "Which users will churn?" is a prediction question. "Does this intervention reduce churn?" is a causal question. They require different designs, different methods, and different interpretations. Most analytics cultures are fluent in the first and illiterate in the second.

The potential outcomes framework is the starting point, not the destination. The next layer involves identification strategies — how to credibly claim your bias term is zero without running an experiment. That means diving into instrumental variables, regression discontinuity designs, and the assumptions that make each one work (or fail). But none of that scaffolding makes sense without this foundation: the precise statement of what we are trying to estimate, and exactly why the naive approach fails.