11  Causal Inference

Correlation is not causation is the most frequently cited statistical warning in business analytics — and the most frequently ignored.

Companies with more fire stations tend to have more fires. Ice cream sales and drowning rates are correlated. Users who see a premium ad are more likely to buy — but was the ad effective, or did the algorithm show it to people who were already going to buy?

Correlation tells us that two things move together. Causation tells us that changing one thing will cause the other to change. Most business decisions require causation, not correlation. And most observational analyses — even sophisticated ones — are not equipped to deliver it.

In this chapter we look at the fundamental problem of causal inference, the graphical language of causal models (DAGs), why aggregated data can reverse apparent relationships (Simpson’s Paradox), and the main methods for estimating causal effects from observational data: Difference-in-Differences, Regression Discontinuity, Propensity Score Matching, and Double ML.

Code
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')
np.random.seed(42)
print("Libraries loaded.")

11.1 The Fundamental Problem

The reason causal inference is hard is simple: we can never observe the same unit in two states simultaneously. We either send a customer an email or we do not. We can never know what would have happened to that specific customer had we done the opposite. This unobserved outcome is called the counterfactual.

In the Potential Outcomes Framework (also called the Rubin Causal Model), we define \(Y_i(1)\) as the outcome for unit \(i\) under treatment and \(Y_i(0)\) as the outcome under no treatment. The individual treatment effect \(\tau_i = Y_i(1) - Y_i(0)\) is never observable. The quantity we can estimate from data is the Average Treatment Effect:

\[\tau = E[Y(1) - Y(0)]\]

When treatment is randomly assigned, the two groups are identical on average in all characteristics — observed and unobserved. Any difference in outcomes can then be attributed to the treatment. When treatment is not randomly assigned, the people who received it differ systematically from those who did not, and comparing outcomes conflates the treatment effect with those pre-existing differences.

11.2 Directed Acyclic Graphs

A Directed Acyclic Graph (DAG) is a visual tool for representing causal assumptions. Nodes are variables; directed arrows represent causal relationships. Getting the DAG right is not a statistical question — it requires domain knowledge.

Three patterns matter most.

A confounder is a common cause of both \(X\) and \(Y\): Z → X and Z → Y. \(Z\) creates a spurious correlation between \(X\) and \(Y\). We must control for \(Z\) to estimate the true \(X \to Y\) effect. Example: socioeconomic status causes both exercise and health. Not controlling for it overstates the benefit of exercise.

A mediator lies on the causal path: X → M → Y. Controlling for \(M\) blocks the effect we are trying to measure. Example: a drug lowers blood pressure, which reduces heart attacks. Controlling for blood pressure removes the very effect we want.

A collider is caused by both \(X\) and \(Y\): X → Z ← Y. Conditioning on a collider creates a spurious correlation between \(X\) and \(Y\). Example: analyzing only hospitalized patients creates associations between diseases that do not exist in the general population.

The rule is: control for confounders, but do not control for mediators or colliders. Getting this wrong produces biased estimates regardless of sample size.

11.3 Simpson’s Paradox

Simpson’s Paradox occurs when a trend that appears in aggregated data reverses when the data is broken into subgroups. It is one of the most compelling illustrations of why aggregate statistics can mislead.

The classic example is the UC Berkeley graduate admissions study from 1973. Aggregate data showed men were admitted at a higher rate than women (44% vs 35%). But examined department by department, women were admitted at a higher or equal rate in most departments. The explanation: women disproportionately applied to highly competitive departments — the very factor that creates the paradox.

Let us construct a business example. Two salespeople. Aggregated, Salesperson A looks stronger. But A handles mostly small deals, which are easier to close. B handles mostly large deals. Within each deal size, B wins.

Code
# Simpson's Paradox: sales performance
np.random.seed(42)

data = pd.DataFrame({
    'salesperson': ['A']*100 + ['B']*100,
    'deal_size':   (['small']*80 + ['large']*20) + (['small']*20 + ['large']*80),
    'closed':      (np.random.binomial(1, 0.70, 80).tolist() +
                    np.random.binomial(1, 0.40, 20).tolist() +
                    np.random.binomial(1, 0.80, 20).tolist() +
                    np.random.binomial(1, 0.55, 80).tolist())
})

print('Aggregate close rate (misleading):')
print(data.groupby('salesperson')['closed'].mean().rename('close_rate').round(3))
print()
print('Close rate by deal size (true picture):')
print(data.groupby(['salesperson','deal_size'])['closed'].mean().round(3))
print()
print('Salesperson B outperforms A in BOTH categories.')
print('The confounder (deal size) reverses the apparent result in the aggregate.')

11.4 Randomized Controlled Trials

The cleanest solution to the causal inference problem is to randomly assign units to treatment and control. Randomization ensures that, on average, the two groups are identical in all characteristics — both observed and unobserved. Any difference in outcomes can therefore be attributed to the treatment.

In many business and policy settings, randomization is expensive, unethical, or impossible. We cannot randomly assign people to smoke to study lung cancer. We cannot go back in time and re-randomize historical data. We may not be able to run a full experiment before a decision must be made.

For these situations, observational study methods aim to approximate the conditions of a randomized experiment using data where assignment was not random. Each method requires a different identification assumption, and that assumption cannot always be tested — it must be argued on substantive grounds.

11.5 Difference-in-Differences

Difference-in-Differences (DiD) compares the change in outcomes for a treated group to the change for a control group over the same time period. The difference between these two changes is the estimated treatment effect:

\[\hat{\tau}_{DiD} = (\bar{Y}_{\text{treated, after}} - \bar{Y}_{\text{treated, before}}) - (\bar{Y}_{\text{control, after}} - \bar{Y}_{\text{control, before}})\]

The key assumption is parallel trends: in the absence of treatment, the treated group would have followed the same trend as the control group. If the two groups were already diverging before the treatment, the DiD estimate is biased.

Card and Krueger (1994) used DiD to study the effect of a minimum wage increase in New Jersey by comparing fast-food employment trends with neighboring Pennsylvania. They found no negative employment effect, a famous result that reshaped the debate around minimum wage policy.

Code
# DiD: loyalty program rolled out in 12 of 24 cities
np.random.seed(42)
n_cities = 12

sales_before_ctrl    = np.random.normal(100, 10, n_cities)
sales_before_treated = np.random.normal(102, 10, n_cities)

program_effect = 8
sales_after_ctrl    = sales_before_ctrl    + np.random.normal(5, 3, n_cities)
sales_after_treated = sales_before_treated + np.random.normal(5, 3, n_cities) + program_effect

did = ((sales_after_treated - sales_before_treated).mean() -
       (sales_after_ctrl    - sales_before_ctrl).mean())

print(f'Control:  before={sales_before_ctrl.mean():.1f}, after={sales_after_ctrl.mean():.1f}, change={( sales_after_ctrl-sales_before_ctrl).mean():.1f}')
print(f'Treated:  before={sales_before_treated.mean():.1f}, after={sales_after_treated.mean():.1f}, change={(sales_after_treated-sales_before_treated).mean():.1f}')
print()
print(f'DiD estimate: {did:.2f}  (true effect = {program_effect})')

fig, ax = plt.subplots(figsize=(7, 4))
for i in range(n_cities):
    ax.plot([0,1],[sales_before_ctrl[i], sales_after_ctrl[i]],       color='steelblue', alpha=0.3)
    ax.plot([0,1],[sales_before_treated[i], sales_after_treated[i]], color='darkorange', alpha=0.3)
ax.plot([0,1],[sales_before_ctrl.mean(), sales_after_ctrl.mean()],       color='steelblue', lw=3, label='Control')
ax.plot([0,1],[sales_before_treated.mean(), sales_after_treated.mean()], color='darkorange', lw=3, label='Treated')
ax.set_xticks([0,1]); ax.set_xticklabels(['Before','After'])
ax.set_ylabel('Sales Index')
ax.set_title(f'Difference-in-Differences  (estimated effect = {did:.1f})')
ax.legend(); plt.tight_layout(); plt.show()

11.6 Regression Discontinuity

Regression Discontinuity (RD) exploits situations where treatment is assigned based on a threshold in some running variable — a credit score, an age cutoff, an exam score. Units just above and just below the threshold are nearly identical in every way except whether they crossed it. Assignment near the threshold is therefore as good as random, and any jump in outcomes at the threshold can be attributed to the treatment.

A bank that approves loans to applicants with credit score \(\geq 700\) provides a natural experiment: comparing default rates for borrowers just above and below 700 estimates the causal effect of loan approval. A loyalty program that grants gold status at $1{,}000 annual spend can similarly be evaluated by comparing retention just above and below the threshold.

Sharp RD assigns treatment deterministically at the threshold. Fuzzy RD allows some non-compliance — crossing the threshold changes the probability of treatment but does not guarantee it.

Code
# Regression Discontinuity: credit score threshold for loan approval
np.random.seed(0)
n = 500
score = np.random.uniform(600, 800, n)
threshold = 700
treatment = (score >= threshold).astype(float)
true_effect = 0.10
repayment = 0.40 + 0.0005*(score-600) + true_effect*treatment + np.random.normal(0, 0.08, n)
repayment = np.clip(repayment, 0, 1)

bandwidth = 30
mask = np.abs(score - threshold) <= bandwidth
df = pd.DataFrame({'score': score[mask], 'repay': repayment[mask], 'treated': treatment[mask]})
left  = df[df.treated==0]; right = df[df.treated==1]
cl = np.polyfit(left.score, left.repay, 1); cr = np.polyfit(right.score, right.repay, 1)
jump = np.polyval(cr, threshold) - np.polyval(cl, threshold)

print(f'Estimated treatment effect at threshold: {jump:.3f}  (true = {true_effect:.3f})')

xl = np.linspace(left.score.min(), threshold, 100)
xr = np.linspace(threshold, right.score.max(), 100)
fig, ax = plt.subplots(figsize=(9, 5))
ax.scatter(df[df.treated==0].score, df[df.treated==0].repay, alpha=0.4, color='steelblue', s=15)
ax.scatter(df[df.treated==1].score, df[df.treated==1].repay, alpha=0.4, color='darkorange', s=15)
ax.plot(xl, np.polyval(cl, xl), color='steelblue',  lw=2.5, label='Not approved')
ax.plot(xr, np.polyval(cr, xr), color='darkorange', lw=2.5, label='Approved')
ax.axvline(threshold, color='black', linestyle='--', lw=1.5)
ax.set_xlabel('Credit Score'); ax.set_ylabel('Repayment Rate')
ax.set_title('Regression Discontinuity: Loan Approval at Score 700')
ax.legend(); plt.tight_layout(); plt.show()

11.7 Propensity Score Matching

Propensity Score Matching (PSM) matches each treated unit to control units that had a similar probability of receiving treatment given their observed characteristics. The propensity score \(e(X) = P(T=1 \mid X)\) is estimated using logistic regression. Comparing outcomes within matched pairs removes the bias from observed confounders.

The critical limitation is that PSM only controls for observed confounders. If important confounders are unmeasured, the estimate remains biased — unlike an RCT, there is no guarantee of balance on unobservables. Always check balance diagnostics after matching: the covariate distributions in the matched treated and control groups should be visually similar.

Code
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Training program: experienced, high-scoring workers self-select in
np.random.seed(1)
n = 600
age = np.random.normal(35, 8, n).clip(22, 60)
prior_score = np.random.normal(70, 15, n).clip(30, 100)
logit_p = -4 + 0.05*age + 0.04*prior_score
treatment = np.random.binomial(1, 1/(1+np.exp(-logit_p)))
true_effect = 5
outcome = 50 + 0.3*prior_score + 0.1*age + true_effect*treatment + np.random.normal(0,8,n)

df_psm = pd.DataFrame({'age':age,'prior_score':prior_score,'treatment':treatment,'outcome':outcome})
X = StandardScaler().fit_transform(df_psm[['age','prior_score']])
df_psm['pscore'] = LogisticRegression().fit(X, treatment).predict_proba(X)[:,1]

naive_ate = df_psm[df_psm.treatment==1].outcome.mean() - df_psm[df_psm.treatment==0].outcome.mean()
print(f'Naive estimate (biased): {naive_ate:.2f}')

treated = df_psm[df_psm.treatment==1]; untreated = df_psm[df_psm.treatment==0]
matched = [untreated.loc[np.abs(untreated.pscore - ps).idxmin(), 'outcome'] for ps in treated.pscore]
att = treated.outcome.values.mean() - np.mean(matched)
print(f'PSM estimate (ATT):      {att:.2f}  (true effect = {true_effect})')

11.8 Double Machine Learning

PSM and standard regression work well when there are few confounders. When dozens of features could be confounders, these methods become unreliable. Double Machine Learning (Double ML), introduced by Chernozhukov et al. (2018), handles high-dimensional controls using any machine learning algorithm.

The procedure residualizes both the outcome and the treatment on the controls: 1. Fit an ML model predicting \(Y\) from controls \(X\) (not treatment). Compute residuals \(\tilde{Y}\). 2. Fit an ML model predicting \(T\) from controls \(X\). Compute residuals \(\tilde{T}\). 3. Regress \(\tilde{Y}\) on \(\tilde{T}\). The coefficient is the causal effect estimate.

Cross-fitting (train on one fold, predict on another) avoids overfitting bias. The econml library implements this correctly:

from econml.dml import LinearDML
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier

model = LinearDML(
    model_y=GradientBoostingRegressor(),
    model_t=GradientBoostingClassifier(),
    cv=5
)
model.fit(Y=y, T=t, X=x_for_heterogeneity, W=w_controls)
print(model.ate(x_for_heterogeneity))

Install: pip install econml

11.9 Instrumental Variables

Sometimes there are unmeasured confounders that no matching or ML method can address. Instrumental Variables (IV) offer a different route: find a variable \(Z\) that causes the treatment but has no direct effect on the outcome (other than through treatment).

Distance to college as an instrument for years of education is the classic example. Growing up near a college makes you more likely to attend (relevant to treatment), but proximity to a college does not directly affect your earnings — it only does so through the education it facilitates (exclusion restriction).

Finding a valid instrument requires domain knowledge and is often the hardest part. The linearmodels library implements two-stage least squares:

from linearmodels.iv import IV2SLS
result = IV2SLS.from_formula(
    'outcome ~ 1 + controls + [treatment ~ instrument]', data=df
).fit()
print(result.summary)

Install: pip install linearmodels

11.10 Choosing a Method

Setting Recommended Method
We control treatment assignment Randomized Controlled Trial
Before/after data with a control group Difference-in-Differences
Treatment assigned by a threshold Regression Discontinuity
Rich observational data, few confounders Propensity Score Matching
Rich observational data, many confounders Double ML (EconML)
Unmeasured confounders, valid instrument exists Instrumental Variables

What is not an option is running a regression on observational data and calling the coefficient a causal effect without satisfying the identification assumptions. This is one of the most common analytical errors in business data science.

The DoWhy library (Microsoft Research) enforces good causal hygiene by requiring a four-step process: model the causal graph, identify the estimand, estimate it, then refute the estimate with sensitivity checks. Install: pip install dowhy.

Recommended reading: - The Book of Why — Pearl and Mackenzie (accessible, conceptual) - Causal Inference: The Mixtape — Cunningham (free online, practitioner-focused) - Mostly Harmless Econometrics — Angrist and Pischke (rigorous, applied)