Biomarker Signatures

A biomarker signature is a transformation of multiple individual features to a one-dimensional space.

A signature that reliably predicts an outcome may be useful for treatment selection or prognosis.
A signature that discriminates between groups that would be treated differently may be clinically useful.
A signature may be continuous, binary, or take multiple discrete values.

Outline

Issues in development and evaluation of models for prediction

Modeling methods, SuperLearner
Pitfalls in evaluation/validation
Considerations for censored outcomes (e.g., survival)
Example

Decisions, Decisions

Do I include all features or a subset?
- Which subset?
How do I combine the features?
- Transformations?
- Weight and combine with regression coefficients?
- What coefficients?
- Thresholds/Cutoffs before or after combining?

Signature Development

Let \(X\) denote the set of \(p\) features. The signature is an unknown function

\[ f(X): \mathbb{R}^p \mapsto \mathbb{R}^1 \]

Development Phase Goals:
- Estimate \(f\) based on some training data
  - Based on association with outcome \(Y\)
- Provide a valid estimate of performance of the method in which \(f\) is estimated
  - Depends on the true signal in the data
  - and the manner in which \(f\) is estimated
- Provide a specification of \(f\) for others to use (optional)

Methods for signature development

Broad classes of methods

Filter + combine
Regularization
Other

Dataset includes \(X_{ij}\) and \(Y_i, Z_i\) for \(i = 1, \ldots, n\), \(j = 1, \ldots, p\)

The \(p > n\) problem

Filter + combine

General idea:

Number of features, \(p\), is too large to incorporate all of the \(X_j\) values into a multivariable prognostic model
Select only those genes with a promising univariate association with \(Y\).
First step, for each \(j = 1, \ldots, p\), calculate the two-sample t-statistic comparing each \(X_j\) to \(Y\)
Then, select only the top 5% of t-statistics in absolute value.
Second step, those top 5% are then included in a multivariable regression model to predict \(Y\)

Variations:

Statistic used for comparison
Threshold for inclusion
Regression model used to combine

Regularization

Least squares:

\[ \sum_{i=1}^n(Y_i - X_i^T\beta)^2. \]

Penalized least squares:

\[ \sum_{i=1}^n(Y_i - X_i^T\beta)^2 + h(\beta, \lambda), \]

where \(h\) is a penalty function depending on fixed penalty parameters \(\lambda\).

Bias-variance tradeoff

A bit of bias can improve predictions

Variable selection

Different \(h\) have different nice properties
- Lasso: \(h(\beta, \lambda) = \lambda \sum_{j=1}^p |\beta_j|\)
- Ridge: \(h(\beta, \lambda) = \lambda \sum_{j=1}^p (\beta_j)^2\)
- Elastic net: \(\lambda_1 \sum_{j=1}^p |\beta_j| + \lambda_2 \sum_{j=1}^p(\beta_j)^2\)

Works for many regression models

Other

Clustering/Principal components analysis
Regression trees
Other regression models
Stacking/ensemble methods

Stacking/ensemble methods

Unless you generated the data, it is impossible to predict which prediction development algorithm will work best
The principle of clean training-validation separation makes it hard to choose one method a priori
instead, use them all and average the results!

Define a library of algorithms, each of which gives you a way to estimate a predictive model.

Superlearner

On a V-fold cross-validation partition of the data,

Fit each algorithm separately, on the training portion. Get V model fits for each algorithm
Obtain predictions for \(Y\) on the validation set for each fold, get predictions for each subject and algorithm
Compute the cross-validated performance for each algorithm
Regress the outcome \(Y\) on the set of cross validated predictions to obtain optimal weights
Fit each algorithm on the full data, combine with the optimal weights estimated in step 4.

Outperforms any individual algorithm in the library

Polley, Eric C. and van der Laan, Mark J., “Super Learner In Prediction” (May 2010). U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 266. https://biostats.bepress.com/ucbbiostat/paper266

Considerations for predictive signatures

Looking for a strong and qualitative interaction with \(Z\)

Expand design matrix \(X\) to include all treatment by feature interactions \(X_{ij} \times Z_i\)
Grouped lasso: include interaction terms and force in the main effects

Presence of interaction does not guarantee clinical utility

You need to be able to compute the predictions for people who haven’t yet received treatment!

Valid estimates of performance

Let \(S\) denote the development dataset, includes \(X\) and possibly \(Y, Z\) and other variables, a sample of size \(n\) from distribution \(\mathcal{P}\) with domain \(\mathcal{X}\).

Let \(\mathcal{F}: \mathcal{X} \mapsto \mathcal{D}\) denote the process or algorithm that generates a particular \(f\), i.e. \(f \in \mathcal{F}\)

Let \(\phi_f: \mathcal{X} \mapsto \mathbb{R}\) denote the performance evaluation function, e.g., accuracy of predictive model, magnitude of hazard ratio.

We are interested in estimating \(E_\mathcal{P}[\phi_f(S)]\), the generalization error for a fixed \(f\) or maybe for a class \(\mathcal{F}\).

An estimate of that is the in-sample empirical estimate: \(\hat{E}[\phi_f(S)] = \frac{1}{n}\sum_{i=1}^n \phi_f(s_i)\).

However, if the analyst interacts with \(S\) in the definition of \(\phi_f\), the estimate will be biased (overfit). I.e.,

\[ |E_\mathcal{P}[\phi_f(S)] - \hat{E}[\phi_f(S)]| \]

will be large.

Examples of “interacts with \(S\)”

Working on a genomic classifier for binary \(Y\):
- I test it out on \(S\), and take a look at the individual-level accuracys \(\phi(s_i)\) of a classifier \(f(x_i)\)
- For \(i\) where \(\phi(s_i)\) is poor, manually change the value of \(y_i\).
Developing a predictor for binary \(Y\):
- Test the association of each \(X_j\) with \(Y\) using t-test on \(S\).
- Select the 50 most significant
- Put them all in a regression model and test on \(S\)
Developing a classfier
- Select 50 most significant genes using \(S\)
- Split \(S\) into \(S_h\) and \(S_t\)
- Put them in a regression model estimated using \(S_t\)
- Test it on \(S_h\)
Developing predictive signature
- Split into \(S_t\) and \(S_h\)
- Build clustering model on \(S_t\)
- Test performance on \(S_h\)
- Performance isn’t as good as I expected
- Go back to \(S_t\) and try again using a different approach

Which ones give valid estimates?

An analogy

Let’s say I’m conducting a randomized clinical trial testing a new treatment versus a placebo.

I measure OS, EFS, pCR, DFS, and some PK biomarker
At the end, test them all and report the most significant difference
Evaluating \(\phi\) using \(S\) while also using \(S\) to define \(\phi\)
“Data-adaptive data analysis”

Obviously not OK. In clinical trials we have pre-registration.

With signatures, often we use \(S\) to develop \(f\), and thus \(\phi_f\).
Adapting to the features in \(S\), not necessarily the distribution that generated \(S\)

Performance?

Examples of \(\phi\):

Mean squared error: \(E\{\hat{Y} - Y\}\), mean(Y.hat - Y)
Classification accuracy, sensitivity, specificity
ROC curve, AUC

Calibration

Accurately predicts clinical outcome of interest

Performance? (2)

Examples of \(\phi\):

Hazard ratio, absolute difference in survival
Odds ratio
P-value from logrank test?

Discrimination

Separates into groups that would be clinically managed differently

Overfitting problem

Remedies

Split-sample
Cross-validation
Bootstrapping
Pre-validation

Split sample

Partition \(S\) into \(S_t\) and \(S_h\) with sample sizes \(n_t\) and \(n_h\)
Hide \(S_h\) from yourself
Generate an \(f_t\) using \(S_t\) only
Estimate is \(E[\phi_{f_t}(S_h)]\)
Error for the fixed \(f_t\)

holdoutest <- function(trratio = .5, data){

  npart.tr <- floor(trratio * n)

  train.dex <- sample(1:n, npart.tr)
  hold.dex <- setdiff(1:n, train.dex)

  train <- data[train.dex, ]
  hold <- data[hold.dex, ]

  selecttr <- selectvars(train)
  fit <- runclassifier(train, selecttr)
  fitclassifier(fit, hold)
}

Cross validation

Leave out sample \(i\) and estimate \(f\) using \(S_{-i}\)
Get single estimate \(\phi_f(s_i)\)
Repeat a number of times
Average the single estimates
Generalization error for the class \(\mathcal{F}\)

cross_validate <- function(K = 50, data){

  over <- partition_index(data, K = K)

  cvests <- sapply(over, function(index){

    leaveout <- data[index, ]
    estin <- data[setdiff(1:n, oot.dex), ]

    selectin <- selectvars(estin)
    fit <- runclassifier(estin, selectin)

    fitclassifier(fit, leavout)

  })
  rowMeans(cvests, na.rm = TRUE)

}

Bootstrapping

Randomly sample \(S_b\) from \(S\), with replacement
Derive \(f\) using \(S_b\)
Estimate using samples not selected \(S_{-b}\)
Repeat and average
Generalization error for the class \(\mathcal{F}\)

bootstrap <- function(B = 50, data){

  bootests <- replicate(B, {

    boot.dex <- sample(1:n, n, replace = TRUE)
    bin <- data[boot.dex, ]
    notbin <- setdiff(1:n, unique(boot.dex))
    leaveout <- data[notbin, ]
    
    selectin <- selectvars(bin)
    fit <- runclassifier(bin, selectin)
    
    fitclassifier(fit, leaveout)

  })

  rowMeans(bootests)

}

Pre-validation

Leave out sample \(i\) and estimate \(f\) using \(S_{-i}\): \(\hat{f}_{-i}\)
Get single fitted value \(\hat{y}_i = \hat{f}_{-i}(s_i)\)
Repeat for all \(i\)
Compare \(\hat{y}_i\) with \(y_i\)
Generalization error for the class \(\mathcal{F}\)

y.hats <- lapply(over, function(oot.dex){

    leaveout <- data[oot.dex, ]
    estin <- data[setdiff(1:n, oot.dex), ]

    selectin <- selectvars(estin)
    fit <- runclassifier(estin, selectin)

    preval <- predict(fit, newdata = leaveout, type = "response")
    preval

  })

X-Validation image

Cross-validation

Common errors

Incomplete/partial validation
- Feature selection performed on full data
- Then only regression model validated
Resubstitution
- Model built using training sample
- Performance estimated using full data
Resubstitution (2)
- Model build using full data
- Performance estimated using holdout sample

Numerical experiment

Data were generated with \(n\) samples, each with a binary outcome \(Y\) with prevalence 0.3
\(p\) features sampled from the standard normal distribution.
This is the null case where no features are associated with \(Y\).

Signature estimation

Each feature is regressed against \(Y\) in a univariate logistic regression model.
The 25 features with the smallest p-values are selected
Logistic regression model defines our final signature.

Results: \(\phi = AUC\)

Common source of trouble

Example 4: I decided on a statistical model in advance (e.g., lasso), did training/validation split.

Performance wasn’t as good as I expected on the validation set, so I went back and used random forest.

Don’t do this! Use SuperLearner instead.

Censored data

Let’s say your outcome of interest is a time to some event (e.g., death, cancer recurrence)
If the outcome is not censored in your sample, what sort of prediciton model would you use?

Answer: my favorite model for continuous outcomes

Censoring is a problem, what can we do?

Pseudo-values

We know how to estimate the Kaplan-Meier curve. Let \(\hat{\theta} = \int g(X) d\hat{F}(x)\) denote our summary statistic of interest, where \(\hat{F}(x)\) equals 1 minus the Kaplan-Meier estimate.

Pseudo-value

The \(i\)th pseudo-value is \(\hat{\theta}_i = n * \hat{\theta} - (n - 1) * \hat{\theta}^{-i}\)

Properties

By construction: \(E(\hat{\theta}_i) = E(\hat{\theta}) = \theta\)
Furthermore: \(E(\hat{\theta}_i | X_i)\) is asymptotically unbiased

Calculate \(\hat{\theta}_i\) for all \(i\) in the sample, and use them as the outcome in standard models for continuous data!

What it looks like (pseudo-values)

Real data example

Zhu et al. (2010) Prognostic and Predictive Gene Signature for Adjuvant Chemotherapy in Resected Non small-Cell Lung Cancer.

JBR.10 trial, RCT of adjuvant vinorelbine/cisplatin (ACT) versus observation alone (OBS) in 482 participants with non small cell lung cancer (NSCLC).
Of those 482 participants, 169 had frozen tissue collected, and of those samples, 133 (71 in ACT and 62 in OBS) had gene-expression profiling done using U133A oligonucleotide microarrays.
Aim to develop multi-gene signature that strongly predicts prognosis, and the hypothesis was that the poor prognosis subgroup would benefit more from ACT compared to the good prognosis subgroup.
The signature was trained to predict disease specific survival in the OBS group.
Mainly focus on the discrimination ability of their estimated signature.
Do not directly address calibration, that is, whether their signature accurately predicts survival times.

Methods

Pre-processing to remove batch effects
Gene selection step with univariate Cox regression models
Genes with univariate p-values less than 0.005 were selected
Each gene was weighted by its univariate Cox regression coefficient, and the resulting weighted gene expression values summed to form risk scores.
Genes were selected in a forward selection manner, starting with the most significant genes, the gene that improved the concordance between survival times and the risk score was selected. If no gene improved the concordance, the process was stopped.
The final list of selected genes were all included in a multivariable Cox regression model to fit the final risk score.
The cutoff that yielded the smallest log-rank statistic p-value was used to dichotomize into two risk groups.
Compare to SuperLearner model using pseudo-values for restricted mean at 8 years

library(Biobase)
library(GEOquery)
# load series and platform data from GEO

gset <- getGEO("GSE14814", GSEMatrix =TRUE)

Resubstitution estimates

Resubstitution continued

Pre-validated estimate

Pre-validated continued

Predictions of survival times

Multivariable modeling

Hazard ratios and 95% confidence intervals from separate Cox regression models that adjust for tumor histologic subtype, stages, age, and sex.

Method	Comparison	Hazard Ratio	95% CI	Adjusted p
Partial Resubstitution	High Risk vs Low Risk	38.9	9.2 to 164.7	< 0.001
	Trt/Risk interaction	14.7	3.2 to 67.0	< 0.001
	High Risk vs Low Risk (SL)	147.2	16.5 to 1316.9	< 0.001
	Trt/Risk interaction (SL)	77.6	7.8 to 774.9	< 0.001
Prevalidation	High Risk vs Low Risk	1.9	0.8 to 4.3	0.122
	Trt/Risk interaction	1.8	0.5 to 6.5	0.395
	High Risk vs Low Risk (SL)	0.7	0.3 to 1.7	0.49
	Trt/Risk interaction (SL)	0.7	0.2 to 2.5	0.571

Summary

Omics-based signatures are particularly vulnerable to overfitting due to high dimensional data
Potential for model complexity is high
Select relevant features with true associations
Avoid fitting to random noise in the observations
Use one of the many approaches to get valid performance estimates

Modeling

Pseudo-values turn censored problems into standard continuous problems
Use stacking/SuperLearner to avoid making impossible decisions about methods

Conclusions

Be skeptical of reports of extreme hazard ratios or perfect prediction, especially if the signature development process was complex
Key details often buried in methods, supplementary materials, code, etc.
Rare to find study that makes data, code, methods public

Slides available at https://sachsmc.github.io/Talks/gsurg-talk

Statistical Issues in Developing Multivariable Biomarker Signatures

Molecular Medicine and Surgery