Causal inference series
This approach serves you poorly because it does not reflect how we expect you to do research
I’d like to briefly review 1-4, and focus mainly on the “statistical gap”, number 5. Hopefully to help “what model should I use?”
If you are interested, a lot of my own research is about 6.
A directed acyclic graph (or DAG or just graph) conveys our assumptions about the mechanisms that gave rise to the observations, e.g.,
This is a functional causal model \(\{F_V:pa(V)\times U_V\to V\mid V\in\mathcal{V}\}\), e.g., \(y = F_Y(x, u, \varepsilon_Y)\).
Note
There are other frameworks for specifying a causal model, this is just one of them (the “Pearl model”, aka, NPSEM-IE).
The main other one that has led to many results in the causal literature is the Neyman/Rubin potential outcomes framework.
the variable \(Y\) if \(X\) were intervened upon to have value 1.
The scientific question
Does drinking coffee in childhood stunt growth?
What causal parameter did you choose?
Note
causal target, causal contrast, causal metric, all mean the same thing
Some options:
\[ \frac{P\{Y(1) = 1\}/P\{Y(1) = 0\}}{P\{Y(0) = 1\}/P\{Y(0) = 0\}} \]
The answer is complicated, see Colnet et al. (2023) Risk ratio, odds ratio, risk difference… Which causal measure is easier to generalize?
Some key tips:
Now consider the real world data. Suppose we use the conscription register in Sweden to answer this question. The observed data include
Draw the DAG for the data generating mechanism
Is the effect of interest identifiable from these data?
Rules of thumb:
Fear not, there are algorithms that, given a DAG and an effect, determine whether it is identified and if so, give the statistical estimand (Tian and Pearl 2002 -)
If the effect is identifiable, the estimand is some variation on the g-formula:
\[ E\{Y(a)\} = \sum_x E(Y | A = a, X = x) P(X = x | A = a). \]
Important
This still involves unknown quantities: the mean outcome given covariates and observed treatment, and the distribution of covariates given treatment.
We need to estimate these things using statistical models. Which ones and how?
This is the statistical gap
True or false?