Statistical Issues in Developing Multivariable Biomarker Signatures

Molecular Medicine and Surgery

Michael C Sachs

18 May 2016

Biomarker Signatures

A biomarker signature is a transformation of multiple individual features to a one-dimensional space.

Outline

Issues in development and evaluation of models for prediction

Decisions, Decisions

Signature Development

Let \(X\) denote the set of \(p\) features. The signature is an unknown function

\[ f(X): \mathbb{R}^p \mapsto \mathbb{R}^1 \]

Methods for signature development

Broad classes of methods

  1. Filter + combine
  2. Regularization
  3. Other

Dataset includes \(X_{ij}\) and \(Y_i, Z_i\) for \(i = 1, \ldots, n\), \(j = 1, \ldots, p\)

The \(p > n\) problem

Filter + combine

General idea:

Variations:

Regularization

Least squares:

\[ \sum_{i=1}^n(Y_i - X_i^T\beta)^2. \]

Penalized least squares:

\[ \sum_{i=1}^n(Y_i - X_i^T\beta)^2 + h(\beta, \lambda), \]

where \(h\) is a penalty function depending on fixed penalty parameters \(\lambda\).

Bias-variance tradeoff

Variable selection

Works for many regression models

Other

Stacking/ensemble methods

Define a library of algorithms, each of which gives you a way to estimate a predictive model.

Superlearner

On a V-fold cross-validation partition of the data,

  1. Fit each algorithm separately, on the training portion. Get V model fits for each algorithm
  2. Obtain predictions for \(Y\) on the validation set for each fold, get predictions for each subject and algorithm
  3. Compute the cross-validated performance for each algorithm
  4. Regress the outcome \(Y\) on the set of cross validated predictions to obtain optimal weights
  5. Fit each algorithm on the full data, combine with the optimal weights estimated in step 4.

Outperforms any individual algorithm in the library

Polley, Eric C. and van der Laan, Mark J., “Super Learner In Prediction” (May 2010). U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 266. https://biostats.bepress.com/ucbbiostat/paper266

Considerations for predictive signatures

Looking for a strong and qualitative interaction with \(Z\)

Presence of interaction does not guarantee clinical utility

You need to be able to compute the predictions for people who haven’t yet received treatment!

Valid estimates of performance

Let \(S\) denote the development dataset, includes \(X\) and possibly \(Y, Z\) and other variables, a sample of size \(n\) from distribution \(\mathcal{P}\) with domain \(\mathcal{X}\).

Let \(\mathcal{F}: \mathcal{X} \mapsto \mathcal{D}\) denote the process or algorithm that generates a particular \(f\), i.e. \(f \in \mathcal{F}\)

Let \(\phi_f: \mathcal{X} \mapsto \mathbb{R}\) denote the performance evaluation function, e.g., accuracy of predictive model, magnitude of hazard ratio.

We are interested in estimating \(E_\mathcal{P}[\phi_f(S)]\), the generalization error for a fixed \(f\) or maybe for a class \(\mathcal{F}\).

An estimate of that is the in-sample empirical estimate: \(\hat{E}[\phi_f(S)] = \frac{1}{n}\sum_{i=1}^n \phi_f(s_i)\).

However, if the analyst interacts with \(S\) in the definition of \(\phi_f\), the estimate will be biased (overfit). I.e.,

\[ |E_\mathcal{P}[\phi_f(S)] - \hat{E}[\phi_f(S)]| \]

will be large.

Examples of “interacts with \(S\)

  1. Working on a genomic classifier for binary \(Y\):
    • I test it out on \(S\), and take a look at the individual-level accuracys \(\phi(s_i)\) of a classifier \(f(x_i)\)
    • For \(i\) where \(\phi(s_i)\) is poor, manually change the value of \(y_i\).
  2. Developing a predictor for binary \(Y\):
    • Test the association of each \(X_j\) with \(Y\) using t-test on \(S\).
    • Select the 50 most significant
    • Put them all in a regression model and test on \(S\)
  3. Developing a classfier
    • Select 50 most significant genes using \(S\)
    • Split \(S\) into \(S_h\) and \(S_t\)
    • Put them in a regression model estimated using \(S_t\)
    • Test it on \(S_h\)
  4. Developing predictive signature
    • Split into \(S_t\) and \(S_h\)
    • Build clustering model on \(S_t\)
    • Test performance on \(S_h\)
    • Performance isn’t as good as I expected
    • Go back to \(S_t\) and try again using a different approach

Which ones give valid estimates?

An analogy

Let’s say I’m conducting a randomized clinical trial testing a new treatment versus a placebo.

Obviously not OK. In clinical trials we have pre-registration.

Performance?

Examples of \(\phi\):

Calibration

Accurately predicts clinical outcome of interest

Performance? (2)

Examples of \(\phi\):

Discrimination

Separates into groups that would be clinically managed differently

Overfitting problem

Remedies

  1. Split-sample
  2. Cross-validation
  3. Bootstrapping
  4. Pre-validation

Split sample

holdoutest <- function(trratio = .5, data){

  npart.tr <- floor(trratio * n)

  train.dex <- sample(1:n, npart.tr)
  hold.dex <- setdiff(1:n, train.dex)

  train <- data[train.dex, ]
  hold <- data[hold.dex, ]

  selecttr <- selectvars(train)
  fit <- runclassifier(train, selecttr)
  fitclassifier(fit, hold)
}

Cross validation

cross_validate <- function(K = 50, data){

  over <- partition_index(data, K = K)

  cvests <- sapply(over, function(index){

    leaveout <- data[index, ]
    estin <- data[setdiff(1:n, oot.dex), ]

    selectin <- selectvars(estin)
    fit <- runclassifier(estin, selectin)

    fitclassifier(fit, leavout)

  })
  rowMeans(cvests, na.rm = TRUE)

}

Bootstrapping

bootstrap <- function(B = 50, data){

  bootests <- replicate(B, {

    boot.dex <- sample(1:n, n, replace = TRUE)
    bin <- data[boot.dex, ]
    notbin <- setdiff(1:n, unique(boot.dex))
    leaveout <- data[notbin, ]
    
    selectin <- selectvars(bin)
    fit <- runclassifier(bin, selectin)
    
    fitclassifier(fit, leaveout)

  })

  rowMeans(bootests)

}

Pre-validation

y.hats <- lapply(over, function(oot.dex){

    leaveout <- data[oot.dex, ]
    estin <- data[setdiff(1:n, oot.dex), ]

    selectin <- selectvars(estin)
    fit <- runclassifier(estin, selectin)

    preval <- predict(fit, newdata = leaveout, type = "response")
    preval

  })

X-Validation image

Cross-validation

Cross-validation

Common errors

  1. Incomplete/partial validation
    • Feature selection performed on full data
    • Then only regression model validated
  2. Resubstitution
    • Model built using training sample
    • Performance estimated using full data
  3. Resubstitution (2)
    • Model build using full data
    • Performance estimated using holdout sample

Numerical experiment

Signature estimation

Results: \(\phi = AUC\)

Common source of trouble

Example 4: I decided on a statistical model in advance (e.g., lasso), did training/validation split.

Performance wasn’t as good as I expected on the validation set, so I went back and used random forest.

Don’t do this! Use SuperLearner instead.

Censored data

Answer: my favorite model for continuous outcomes

Censoring is a problem, what can we do?

Pseudo-values

We know how to estimate the Kaplan-Meier curve. Let \(\hat{\theta} = \int g(X) d\hat{F}(x)\) denote our summary statistic of interest, where \(\hat{F}(x)\) equals 1 minus the Kaplan-Meier estimate.

Pseudo-value

The \(i\)th pseudo-value is \(\hat{\theta}_i = n * \hat{\theta} - (n - 1) * \hat{\theta}^{-i}\)

Properties

Calculate \(\hat{\theta}_i\) for all \(i\) in the sample, and use them as the outcome in standard models for continuous data!

What it looks like (pseudo-values)

Real data example

Zhu et al. (2010) Prognostic and Predictive Gene Signature for Adjuvant Chemotherapy in Resected Non small-Cell Lung Cancer.

Methods

library(Biobase)
library(GEOquery)
# load series and platform data from GEO

gset <- getGEO("GSE14814", GSEMatrix =TRUE)

Resubstitution estimates

Resubstitution continued

Pre-validated estimate

Pre-validated continued

Predictions of survival times

Multivariable modeling

Hazard ratios and 95% confidence intervals from separate Cox regression models that adjust for tumor histologic subtype, stages, age, and sex.

Method Comparison Hazard Ratio 95% CI Adjusted p
Partial Resubstitution High Risk vs Low Risk 38.9 9.2 to 164.7 < 0.001
Trt/Risk interaction 14.7 3.2 to 67.0 < 0.001
High Risk vs Low Risk (SL) 147.2 16.5 to 1316.9 < 0.001
Trt/Risk interaction (SL) 77.6 7.8 to 774.9 < 0.001
Prevalidation High Risk vs Low Risk 1.9 0.8 to 4.3 0.122
Trt/Risk interaction 1.8 0.5 to 6.5 0.395
High Risk vs Low Risk (SL) 0.7 0.3 to 1.7 0.49
Trt/Risk interaction (SL) 0.7 0.2 to 2.5 0.571

Summary

Modeling

Conclusions

Slides available at https://sachsmc.github.io/Talks/gsurg-talk