Regression and Standardization

Causal inference series

Michael C Sachs

Example

The scientific question

Does drinking coffee in childhood stunt growth?

In the ideal study

  1. We would enroll children aged 10 - 12
  2. Randomize them to drink coffee versus placebo
  3. Follow them up for 8 years (very closely to ensure compliance)
  4. Measure height at age 18 - 20
  5. Causal parameter of interest is

\[ E\{Y(X = 1)\} - E\{Y(X = 0)\} \]

Identifiability

Is the effect of interest identifiable from these data?

Yes! The dag is simple in this case:

Estimation

Our usual estimand is

\[ E(Y | X = 1) - E(Y | X = 0). \]

This defines an estimator of the causal parameter because

\[ E(Y | X = x) = E\{Y(X = x) | X = x\} \] which is always true (sometimes called the consistency assumption/axiom).

Because of randomization, the potential outcomes are independent of the treatment assignment: \[ E\{Y(X = x) | X = x\} = E\{Y(X = x)\} \]

Estimation continued

An estimator is \[ \frac{\sum_i Y_i X_i}{\sum_i X_i} - \frac{\sum_i Y_i (1 - X_i)}{\sum_i (1 - X_i)} \] which is what you get from a t-test.

This is not the only possible estimator. It is consistent for the causal effect of interest, which means that as the sample size increases, the estimator will approach the truth. There are other properties to consider

  • robustness, sensitivity of the estimator to the assumptions (not so relevant for this example)
  • efficiency, how precise the estimator is compared to alternatives

Alternative estimand

Linear regression adjusted for \(C\):

\[ E(Y | X, C) = \beta_0 + \beta_1 X + \beta_2 C \]

As long as there is no interaction, \(\beta_1\) has the interpretation \(E(Y | X = 1) - E(Y | X = 0)\), so what’s the difference?

The estimator is different from the t-test, and it involves \(C\)

Simulation

generate_ests <- function(n = 300) {
  X <- rbinom(n, 1, .5)
  C <- rnorm(n)
  Y <- 1 + 1 * X + C + rnorm(n)
  
  c(ttest = t.test(Y ~ X)$estimate |> diff(), 
    lm = lm(Y ~ X + C)$coefficients[2])

}

simul <- t(replicate(1e3, generate_ests()))

summary(simul) 
 ttest.mean in group 1      lm.X       
 Min.   :0.4736        Min.   :0.6421  
 1st Qu.:0.8879        1st Qu.:0.9211  
 Median :1.0021        Median :1.0017  
 Mean   :0.9977        Mean   :1.0017  
 3rd Qu.:1.1081        3rd Qu.:1.0774  
 Max.   :1.4981        Max.   :1.3293  
apply(simul, 2, sd)
ttest.mean in group 1                  lm.X 
            0.1594874             0.1144657 

It gives the same answer, and is much more efficient. What are the drawbacks?

We need to get the outcome model correct, so it is potentially less robust.

However, linear regression is quite robust, it is hard to get the outcome model wrong enough to make a difference.

Simulation 2

generate_ests2 <- function(n = 300) {
  X <- rbinom(n, 1, .5)
  C <- rnorm(n)
  U <- rnorm(n)
  Y <- 1 + 1 * X + 5 * X * (C > 1) - 3 * (C < 1) + rnorm(n)
  
  c(ttest = t.test(Y ~ X)$estimate |> diff(), 
    lm = lm(Y ~ X + C)$coefficients[2])

}

simul <- t(replicate(1e3, generate_ests2()))

summary(simul) 
 ttest.mean in group 1      lm.X      
 Min.   :0.9392        Min.   :1.055  
 1st Qu.:1.6067        1st Qu.:1.631  
 Median :1.7841        Median :1.781  
 Mean   :1.7890        Mean   :1.788  
 3rd Qu.:1.9770        3rd Qu.:1.940  
 Max.   :2.6779        Max.   :2.571  
apply(simul, 2, sd)
ttest.mean in group 1                  lm.X 
            0.2772902             0.2362267 

Warning

Lessons from linear regression do not often apply to other types of regression. For example, the coefficient in an adjusted logistic model is not an estimand for the causal difference.

A confounded setting

The more typical scenario since we can’t force children to drink coffee:

How do we estimate this now?

The g-forumla

Let’s focus on the first term \(E\{Y(X = 1)\}\)

\[ E\{Y(1)\} = \sum_c E(Y(1) | C = c) P(C = c) = \sum_c E(Y(1) | C = c, X = 1) P(C = c) \] \[ = \sum_c E(Y | C = c, X = 1) P(C = c) \] under the assumptions in the dag.

The first component inside the sum is an outcome model, we can use a prediction from linear regression.

For the second component, we can use the empirical distribution. This is based on the random sampling principle: if I have \(n\) subjects in my sample, and the \(i\) th subject’s covariates are \(C_i\), my best guess of the probability of observing data like \(C_i\) is \(1/n\)

Regression standardization

The regression standardization estimator of the g-formula is

\[ \hat{E}\{Y(1)\} = \sum_{i=1}^n \hat{E}(Y | X = 1, C = C_i) \cdot \frac{1}{n}. \]

The basic approach:

  1. Fit an outcome model involving the exposure of interest and confounders
  2. Copy the data used to fit the model and then:
    1. Replace observed values of \(X\) with 1, and get predictions from the model: \(\hat{E}(Y_i(1))\)
    2. Replace observed values of \(X\) with 0, and get predictions from the model: \(\hat{E}(Y_i(0))\)
  3. Take the mean and then the difference: \(\frac{1}{n}\sum_i \hat{E}(Y_i(1)) - \hat{E}(Y_i(0))\)

Linear regression

Example, using simulated data:

n <- 800
C <- rnorm(n)
X <- rbinom(n, 1, plogis(-1 + 2 * C))
Y <- 1 + 1 * X + C + .5 * X * C + rnorm(n)
simdata <- data.frame(C, Y, X)

fit <- glm(Y ~ C + X + C:X, data = simdata)
simdata0 <- simdata1 <- simdata

simdata0$X <- 0
simdata1$X <- 1

EYi_1 <- predict(fit, newdata = simdata1)
EYi_0 <- predict(fit, newdata = simdata0)

mean(EYi_1) - mean(EYi_0)
[1] 0.9900777

Inference?

One option is to use the bootstrap:

B <- 1000
bootests <- rep(NA, B)
for(i in 1:B) {
  simdata.star <- simdata[sample(1:nrow(simdata), 
                                 nrow(simdata), replace = TRUE), ]

  fit <- glm(Y ~ C + X + C:X, data = simdata.star)
  simdata0 <- simdata1 <- simdata.star

  simdata0$X <- 0
  simdata1$X <- 1

  bootests[i] <- mean(predict(fit, newdata = simdata1)) - 
    mean(predict(fit, newdata = simdata0))
}

sd(bootests)
[1] 0.09725708
quantile(bootests, c(0.025, 0.9756))
     2.5%    97.56% 
0.7975081 1.1795300 

Or you can use the stdReg package

library("stdReg")

stdfit <- stdGlm(fit, data = simdata, X = "X", x = c(0, 1))
summary(stdfit, contrast = "difference", reference = 0)

Formula: Y ~ C + X + C:X
Family: gaussian 
Link function: identity 
Exposure:  X 
Reference level:  X = 0 
Contrast:  difference 

  Estimate Std. Error lower 0.95 upper 0.95
0    0.000     0.0000      0.000      0.000
1    0.778     0.0998      0.582      0.973

Logistic regression

With a binary outcome, we would tend to use logistic regression.

\[ \mbox{logit}(p\{Y = 1 | X, C\}) = \beta_0 + \beta_1 X + \beta_2 C \]

but the coefficient \(\beta_1\) is not a marginal causal contrast, i.e., there is no function \(g\) such that

\[ \beta_1 = g\{Y(1), Y(0)\}, \] the reason being that the nonlinear function does not simplify, so that it continues to depend on \(C\), it is a conditional causal contrast. Besides, we want the risk difference or risk ratio, so we need to standardize with logistic regression.

Binary example

The steps are nearly identical:

n <- 800
C <- rnorm(n)
X <- rbinom(n, 1, plogis(-1 + 2 * C))
Y <- rbinom(n, 1, plogis(1 + 1 * X + C + .5 * X * C))
simdata <- data.frame(C, Y, X)

fit <- glm(Y ~ C + X + C:X, data = simdata, family = "binomial")
simdata0 <- simdata1 <- simdata

simdata0$X <- 0
simdata1$X <- 1

EYi_1 <- predict(fit, newdata = simdata1, type = "response")
EYi_0 <- predict(fit, newdata = simdata0, type = "response")

mean(EYi_1) - mean(EYi_0)
[1] 0.1231389
mean(EYi_1) / mean(EYi_0)
[1] 1.176667

Or use the stdReg package

stdfit <- stdGlm(fit, X = "X", x = c(0,1), data = simdata)
summary(stdfit, contrast = "difference", reference = 0)

Formula: Y ~ C + X + C:X
Family: binomial 
Link function: logit 
Exposure:  X 
Reference level:  X = 0 
Contrast:  difference 

  Estimate Std. Error lower 0.95 upper 0.95
0    0.000      0.000     0.0000      0.000
1    0.123      0.049     0.0271      0.219
summary(stdfit, contrast = "ratio", reference = 0)

Formula: Y ~ C + X + C:X
Family: binomial 
Link function: logit 
Exposure:  X 
Reference level:  X = 0 
Contrast:  ratio 

  Estimate Std. Error lower 0.95 upper 0.95
0     1.00     0.0000       1.00       1.00
1     1.18     0.0721       1.04       1.32

Cox regression and other survival models

For a time to event outcome \(Y\), the Cox model assumes \[ \lim_{\delta \rightarrow 0} \frac{1}{\delta}p\{t \leq Y < t + \delta | Y \geq t, X, C\} = \lambda_0(t) \exp(\beta_1 X + \beta_2 C) \]

\(\beta_1\) is not generally a causal contrast, for different reasons than the logistic model. Hence we need to standardize to get a contrast of survival probabilities such as \[ p\{Y(1) > \tau\} - p\{Y(0) > \tau\} \] for a given time \(\tau\).

The standardization procedure is basically the same

Note

To be pedantic, to get a prediction of the survival probability, you also need the Breslow estimator of the baseline hazard. Note that the predictions are still subject to the proportional hazards assumption.

Cox example

library(survival)
head(rotterdam)
     pid year age meno  size grade nodes pgr  er hormon chemo rtime recur dtime
1393   1 1992  74    1  <=20     3     0  35 291      0     0  1799     0  1799
1416   2 1984  79    1 20-50     3     0  36 611      0     0  2828     0  2828
2962   3 1983  44    0  <=20     2     0 138   0      0     0  6012     0  6012
1455   4 1985  70    1 20-50     3     0   0  12      0     0  2624     0  2624
977    5 1983  75    1  <=20     3     0 260 409      0     0  4915     0  4915
617    6 1983  52    0  <=20     3     0 139 303      0     0  5888     0  5888
     death
1393     0
1416     0
2962     0
1455     0
977      0
617      0
sdata <- rotterdam

cfit <- coxph(Surv(dtime, death) ~ hormon + age + meno + grade + er, 
              data = sdata, ties = "breslow")
sdata0 <- sdata1 <- sdata
sdata0$hormon <- 0
sdata1$hormon <- 1

Ey0t <- survfit(cfit, newdata = sdata0, se.fit = FALSE)
Ey1t <- survfit(cfit, newdata = sdata1, se.fit = FALSE)

mean(Ey1t$surv[max(which(Ey1t$time <= 3000)), ]) - 
  mean(Ey0t$surv[max(which(Ey0t$time <= 3000)), ])
[1] -0.07465547

or use stdCoxph

stdcfit <- stdCoxph(cfit, data = sdata, X = "hormon", x = c(0, 1), 
                    t = 3000)
summary(stdcfit, contrast = "difference", reference = 0)

Formula: Surv(dtime, death) ~ hormon + age + meno + grade + er
Exposure:  hormon 
Reference level:  hormon = 0 
Contrast:  difference 
Survival functions evaluated at t = 3000 

  Estimate Std. Error lower 0.95 upper 0.95
0   0.0000     0.0000      0.000      0.000
1  -0.0747     0.0274     -0.128     -0.021

Summary and comments

  1. Regression standardization is a method to estimate causal effects, for all types of exposures (binary, categorical, continuous)
  2. It works when the outcome model is correctly specified for confounding, i.e., all confounders are included in the model in the correct functional form (nonlinearities, interactions, etc.)
  3. Basic steps:
    1. Fit an outcome regression model (can also fit two, separately by the exposure levels). Use diagnostics to make sure the model is as flexible as it needs to be
    2. Create copies of the data, and set the exposure levels to the desired levels for the contrast
    3. Predict the potential outcome for each individual given their observed covariates
    4. Take the means and construct the contrast of interest (difference or ratio)

Next time