Statistical Issues in Developing Multivariable Biomarker Signatures
Molecular Medicine and Surgery
Michael C Sachs
18 May 2016
Biomarker Signatures
A biomarker signature is a transformation of multiple individual features to a one-dimensional space.
- A signature that reliably predicts an outcome may be useful for treatment selection or prognosis.
- A signature that discriminates between groups that would be treated differently may be clinically useful.
- A signature may be continuous, binary, or take multiple discrete values.
Outline
Issues in development and evaluation of models for prediction
- Modeling methods, SuperLearner
- Pitfalls in evaluation/validation
- Considerations for censored outcomes (e.g., survival)
- Example
Decisions, Decisions
- Do I include all features or a subset?
- How do I combine the features?
- Transformations?
- Weight and combine with regression coefficients?
- What coefficients?
- Thresholds/Cutoffs before or after combining?
Signature Development
Let \(X\) denote the set of \(p\) features. The signature is an unknown function
\[
f(X): \mathbb{R}^p \mapsto \mathbb{R}^1
\]
- Development Phase Goals:
- Estimate \(f\) based on some training data
- Based on association with outcome \(Y\)
- Provide a valid estimate of performance of the method in which \(f\) is estimated
- Depends on the true signal in the data
- and the manner in which \(f\) is estimated
- Provide a specification of \(f\) for others to use (optional)
Methods for signature development
Broad classes of methods
- Filter + combine
- Regularization
- Other
Dataset includes \(X_{ij}\) and \(Y_i, Z_i\) for \(i = 1, \ldots, n\), \(j = 1, \ldots, p\)
The \(p > n\) problem
Filter + combine
General idea:
- Number of features, \(p\), is too large to incorporate all of the \(X_j\) values into a multivariable prognostic model
- Select only those genes with a promising univariate association with \(Y\).
- First step, for each \(j = 1, \ldots, p\), calculate the two-sample t-statistic comparing each \(X_j\) to \(Y\)
- Then, select only the top 5% of t-statistics in absolute value.
- Second step, those top 5% are then included in a multivariable regression model to predict \(Y\)
Variations:
- Statistic used for comparison
- Threshold for inclusion
- Regression model used to combine
Regularization
Least squares:
\[
\sum_{i=1}^n(Y_i - X_i^T\beta)^2.
\]
Penalized least squares:
\[
\sum_{i=1}^n(Y_i - X_i^T\beta)^2 + h(\beta, \lambda),
\]
where \(h\) is a penalty function depending on fixed penalty parameters \(\lambda\).
Bias-variance tradeoff
- A bit of bias can improve predictions
Variable selection
- Different \(h\) have different nice properties
- Lasso: \(h(\beta, \lambda) = \lambda \sum_{j=1}^p |\beta_j|\)
- Ridge: \(h(\beta, \lambda) = \lambda \sum_{j=1}^p (\beta_j)^2\)
- Elastic net: \(\lambda_1 \sum_{j=1}^p |\beta_j| + \lambda_2 \sum_{j=1}^p(\beta_j)^2\)
Works for many regression models
Other
- Clustering/Principal components analysis
- Regression trees
- Other regression models
- Stacking/ensemble methods
Stacking/ensemble methods
- Unless you generated the data, it is impossible to predict which prediction development algorithm will work best
- The principle of clean training-validation separation makes it hard to choose one method a priori
- instead, use them all and average the results!
Define a library of algorithms, each of which gives you a way to estimate a predictive model.
Superlearner
On a V-fold cross-validation partition of the data,
- Fit each algorithm separately, on the training portion. Get V model fits for each algorithm
- Obtain predictions for \(Y\) on the validation set for each fold, get predictions for each subject and algorithm
- Compute the cross-validated performance for each algorithm
- Regress the outcome \(Y\) on the set of cross validated predictions to obtain optimal weights
- Fit each algorithm on the full data, combine with the optimal weights estimated in step 4.
Outperforms any individual algorithm in the library
Polley, Eric C. and van der Laan, Mark J., “Super Learner In Prediction” (May 2010). U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 266. https://biostats.bepress.com/ucbbiostat/paper266
Considerations for predictive signatures
Looking for a strong and qualitative interaction with \(Z\)
- Expand design matrix \(X\) to include all treatment by feature interactions \(X_{ij} \times Z_i\)
- Grouped lasso: include interaction terms and force in the main effects
Presence of interaction does not guarantee clinical utility
You need to be able to compute the predictions for people who haven’t yet received treatment!
Examples of “interacts with \(S\)”
- Working on a genomic classifier for binary \(Y\):
- I test it out on \(S\), and take a look at the individual-level accuracys \(\phi(s_i)\) of a classifier \(f(x_i)\)
- For \(i\) where \(\phi(s_i)\) is poor, manually change the value of \(y_i\).
- Developing a predictor for binary \(Y\):
- Test the association of each \(X_j\) with \(Y\) using t-test on \(S\).
- Select the 50 most significant
- Put them all in a regression model and test on \(S\)
- Developing a classfier
- Select 50 most significant genes using \(S\)
- Split \(S\) into \(S_h\) and \(S_t\)
- Put them in a regression model estimated using \(S_t\)
- Test it on \(S_h\)
- Developing predictive signature
- Split into \(S_t\) and \(S_h\)
- Build clustering model on \(S_t\)
- Test performance on \(S_h\)
- Performance isn’t as good as I expected
- Go back to \(S_t\) and try again using a different approach
Which ones give valid estimates?
An analogy
Let’s say I’m conducting a randomized clinical trial testing a new treatment versus a placebo.
Obviously not OK. In clinical trials we have pre-registration.
- With signatures, often we use \(S\) to develop \(f\), and thus \(\phi_f\).
- Adapting to the features in \(S\), not necessarily the distribution that generated \(S\)
Overfitting problem
Remedies
- Split-sample
- Cross-validation
- Bootstrapping
- Pre-validation
Split sample
- Partition \(S\) into \(S_t\) and \(S_h\) with sample sizes \(n_t\) and \(n_h\)
- Hide \(S_h\) from yourself
- Generate an \(f_t\) using \(S_t\) only
- Estimate is \(E[\phi_{f_t}(S_h)]\)
- Error for the fixed \(f_t\)
Cross validation
- Leave out sample \(i\) and estimate \(f\) using \(S_{-i}\)
- Get single estimate \(\phi_f(s_i)\)
- Repeat a number of times
- Average the single estimates
- Generalization error for the class \(\mathcal{F}\)
Bootstrapping
- Randomly sample \(S_b\) from \(S\), with replacement
- Derive \(f\) using \(S_b\)
- Estimate using samples not selected \(S_{-b}\)
- Repeat and average
- Generalization error for the class \(\mathcal{F}\)
Pre-validation
- Leave out sample \(i\) and estimate \(f\) using \(S_{-i}\): \(\hat{f}_{-i}\)
- Get single fitted value \(\hat{y}_i = \hat{f}_{-i}(s_i)\)
- Repeat for all \(i\)
- Compare \(\hat{y}_i\) with \(y_i\)
- Generalization error for the class \(\mathcal{F}\)
Common errors
- Incomplete/partial validation
- Feature selection performed on full data
- Then only regression model validated
- Resubstitution
- Model built using training sample
- Performance estimated using full data
- Resubstitution (2)
- Model build using full data
- Performance estimated using holdout sample
Numerical experiment
- Data were generated with \(n\) samples, each with a binary outcome \(Y\) with prevalence 0.3
- \(p\) features sampled from the standard normal distribution.
- This is the null case where no features are associated with \(Y\).
Signature estimation
- Each feature is regressed against \(Y\) in a univariate logistic regression model.
- The 25 features with the smallest p-values are selected
- Logistic regression model defines our final signature.
Results: \(\phi = AUC\)
Common source of trouble
Example 4: I decided on a statistical model in advance (e.g., lasso), did training/validation split.
Performance wasn’t as good as I expected on the validation set, so I went back and used random forest.
Don’t do this! Use SuperLearner instead.
Censored data
- Let’s say your outcome of interest is a time to some event (e.g., death, cancer recurrence)
- If the outcome is not censored in your sample, what sort of prediciton model would you use?
Answer: my favorite model for continuous outcomes
Censoring is a problem, what can we do?
Pseudo-values
We know how to estimate the Kaplan-Meier curve. Let \(\hat{\theta} = \int g(X) d\hat{F}(x)\) denote our summary statistic of interest, where \(\hat{F}(x)\) equals 1 minus the Kaplan-Meier estimate.
Pseudo-value
The \(i\)th pseudo-value is \(\hat{\theta}_i = n * \hat{\theta} - (n - 1) * \hat{\theta}^{-i}\)
Properties
- By construction: \(E(\hat{\theta}_i) = E(\hat{\theta}) = \theta\)
- Furthermore: \(E(\hat{\theta}_i | X_i)\) is asymptotically unbiased
Calculate \(\hat{\theta}_i\) for all \(i\) in the sample, and use them as the outcome in standard models for continuous data!
Real data example
Zhu et al. (2010) Prognostic and Predictive Gene Signature for Adjuvant Chemotherapy in Resected Non small-Cell Lung Cancer.
- JBR.10 trial, RCT of adjuvant vinorelbine/cisplatin (ACT) versus observation alone (OBS) in 482 participants with non small cell lung cancer (NSCLC).
- Of those 482 participants, 169 had frozen tissue collected, and of those samples, 133 (71 in ACT and 62 in OBS) had gene-expression profiling done using U133A oligonucleotide microarrays.
- Aim to develop multi-gene signature that strongly predicts prognosis, and the hypothesis was that the poor prognosis subgroup would benefit more from ACT compared to the good prognosis subgroup.
- The signature was trained to predict disease specific survival in the OBS group.
- Mainly focus on the discrimination ability of their estimated signature.
- Do not directly address calibration, that is, whether their signature accurately predicts survival times.
Methods
- Pre-processing to remove batch effects
- Gene selection step with univariate Cox regression models
- Genes with univariate p-values less than 0.005 were selected
- Each gene was weighted by its univariate Cox regression coefficient, and the resulting weighted gene expression values summed to form risk scores.
- Genes were selected in a forward selection manner, starting with the most significant genes, the gene that improved the concordance between survival times and the risk score was selected. If no gene improved the concordance, the process was stopped.
- The final list of selected genes were all included in a multivariable Cox regression model to fit the final risk score.
- The cutoff that yielded the smallest log-rank statistic p-value was used to dichotomize into two risk groups.
- Compare to SuperLearner model using pseudo-values for restricted mean at 8 years
Resubstitution estimates
Resubstitution continued
Pre-validated estimate
Pre-validated continued
Predictions of survival times
Multivariable modeling
Hazard ratios and 95% confidence intervals from separate Cox regression models that adjust for tumor histologic subtype, stages, age, and sex.
Partial Resubstitution |
High Risk vs Low Risk |
38.9 |
9.2 to 164.7 |
< 0.001 |
|
Trt/Risk interaction |
14.7 |
3.2 to 67.0 |
< 0.001 |
|
High Risk vs Low Risk (SL) |
147.2 |
16.5 to 1316.9 |
< 0.001 |
|
Trt/Risk interaction (SL) |
77.6 |
7.8 to 774.9 |
< 0.001 |
Prevalidation |
High Risk vs Low Risk |
1.9 |
0.8 to 4.3 |
0.122 |
|
Trt/Risk interaction |
1.8 |
0.5 to 6.5 |
0.395 |
|
High Risk vs Low Risk (SL) |
0.7 |
0.3 to 1.7 |
0.49 |
|
Trt/Risk interaction (SL) |
0.7 |
0.2 to 2.5 |
0.571 |
Summary
- Omics-based signatures are particularly vulnerable to overfitting due to high dimensional data
- Potential for model complexity is high
- Select relevant features with true associations
- Avoid fitting to random noise in the observations
- Use one of the many approaches to get valid performance estimates
Modeling
- Pseudo-values turn censored problems into standard continuous problems
- Use stacking/SuperLearner to avoid making impossible decisions about methods
Conclusions
- Be skeptical of reports of extreme hazard ratios or perfect prediction, especially if the signature development process was complex
- Key details often buried in methods, supplementary materials, code, etc.
- Rare to find study that makes data, code, methods public
Slides available at https://sachsmc.github.io/Talks/gsurg-talk