Causation vs. Correlation
Two things happening together doesn't mean one causes the other
The confusion between causation and correlation leads to inaccurate assumptions about how the world works. We notice two things happening at the same time (correlation) and mistakenly conclude that one causes the other (causation). We then act on that erroneous conclusion, making decisions that are successful only by luck rather than by capitalizing on real dynamics.
Correlation is measured by a coefficient between -1 and 1, representing the relative weight of shared factors between two measures. Two phenomena with no shared factors (like bottled water consumption and suicide rate) should have a coefficient near zero. Temperature in Celsius and Fahrenheit has a perfect correlation of 1 because they measure the same underlying factor. Most real-world relationships fall somewhere between, indicating that while one variable has some predictive power over another, other factors are clearly at play.
A critical complication is regression to the mean: whenever correlation is imperfect, extremes will soften over time. The best will appear to get worse, and the worst will appear to get better, regardless of any intervention. This means we frequently mistake regression to the mean for the effect of a treatment or policy. Depressed children treated with anything (even hugging a cat or standing on their head) will show improvement, because extreme groups naturally regress toward the average. The only way to distinguish real improvement from regression is through a control group.
- Two things happening together (correlation) does not mean one causes the other (causation).
- The correlation between two measures reflects the relative weight of their shared factors, not a causal relationship.
- Whenever correlation is imperfect, extreme values will regress toward the mean over time regardless of intervention.
- Trying to invert a relationship can help determine whether you are dealing with causation or just correlation.
- The only reliable way to distinguish treatment effects from regression to the mean is through a control group.
- Identify the claimed relationshipWhen presented with a relationship between two variables, clearly state what the claim is. Is it that A causes B, that B causes A, or simply that A and B are observed together?Pro tipA study showing a relationship between parental alcohol consumption and children's academic success has demonstrated only a correlation, not that one causes the other.
- Try inverting the relationshipAsk whether the reverse could be true. If A appears to cause B, could B actually cause A? Could having kids who do poorly in school cause parents to drink more, rather than the reverse?Pro tipInverting the relationship is a quick test for false causation claims.
- Check for regression to the meanIf you are evaluating whether a treatment or intervention worked, ask whether the group being studied is extreme. Extreme groups naturally regress toward the mean over time, regardless of any treatment.Pro tipDepressed children will get somewhat better over time even if they hug no cats and drink no Red Bull.WarningWithout a control group, it is impossible to determine whether improvement is due to the intervention or simply regression to the mean.
- Look for confounding variablesConsider whether a third variable might explain the observed correlation. Both A and B might be caused by C, creating a correlation between A and B without any direct causal link.Pro tipHeight and weight are correlated, but both are partly caused by underlying genetic and nutritional factors.
- Demand a control group or equivalentFor any claim of causation, ask whether a proper control group was used. The aim of rigorous research is to determine whether the treated group improves more than regression alone can explain.Pro tipIn real-life performance evaluation, where no control group exists, compare against industry averages, peer cohorts, or historical improvement rates.WarningNone of these alternatives is a perfect measure, but they are better than no comparison at all.
Kahneman created a hypothetical headline: 'Depressed children treated with energy drink improve significantly over three months.' The fact is true, but depressed children treated with standing on their heads or hugging a cat would also show improvement, because extreme groups regress toward the mean.
A study shows a relationship between high alcohol consumption in parents and low academic success in children. It's tempting to conclude that parental drinking causes poor academic outcomes.
Temperature measured in Celsius and Fahrenheit has a perfect correlation coefficient of 1 because they measure the same underlying factor (molecular velocity). Every degree in Celsius has exactly one corresponding value in Fahrenheit.
The statistical foundations of correlation and causation were developed across centuries of probability theory and experimental design. The correlation coefficient was formalized by Karl Pearson in the late nineteenth century. The critical importance of distinguishing correlation from causation became prominent in medical research, epidemiology, and social science throughout the twentieth century.
Daniel Kahneman, in Thinking, Fast and Slow, provided memorable illustrations of how regression to the mean fools us into false causal attributions. His hypothetical headline about depressed children improving with an energy drink demonstrates how any treatment appears to work when applied to an extreme group, because extreme groups regress toward the mean regardless of intervention. The book uses this as a supporting idea for the probabilistic thinking chapter, showing how confusing correlation with causation leads to decisions based on luck rather than genuine understanding.