Reproducible Research in Data Analysis

Michael Sachs

2018-01-25

About me

Interests:

At the lab?

lab1 

At the computer?

lab1  lab1  lab1  lab1 

Scientific Pathway

knit-flow1 

Scientific Cycle

knit-flow2 

Where do we (statisticians) fit in?

knit-flow3 

A mathematical framework for data analysis

Notation

Example 1

Example 2

Evaluating performance

However, if the selection of \(f\) from \(\mathcal{F}\) involves interacting with \(S\), then the estimate/inference may be biased.

Examples of “interacts with \(S\)

  1. Working on a genomic classifier for binary \(Y\):
    • I test it out on \(S\), and take a look at the individual-level accuracys \(\phi(s_i)\) of a classifier \(f(x_i)\)
    • For \(i\) where \(\phi(s_i)\) is poor, manually change the value of \(y_i\).
  2. Developing a predictor for binary \(Y\):
    • Test the association of each \(X_j\) with \(Y\) using t-test on \(S\).
    • Select the 50 most significant
    • Put them all in a regression model and test on \(S\)
  3. Developing predictive signature
    • Split into \(S_t\) and \(S_h\)
    • Build clustering model on \(S_t\)
    • Test performance on \(S_h\)
    • Performance isn’t as good as I expected
    • Go back to \(S_t\) and try again using a different approach

Which ones give valid estimates?

All aspects of \(\mathcal{F}\) should be documented and reported, i.e., how is \(f\) selected?

Tools to make our lives better

knit-flow4 

Data analysis

analysis 

Goals:

Knitr

Motivation

History

The importance of text

Markdown

Plain-text formatting. Indicate what elements represent, not how they should look. Minimal, yet flexible, html and latex commands are interpreted correctly.

markdown 

Markdown specs

Incorporating code chunks

Three backticks:

```{r my-first-chunk, results='asis'}
## code goes in here and gets evaluated
```

Inline code uses single backticks

Here I am using `#r rnorm(1)` to generate a random digit: -0.89047. (Omit the pound sign)

Results, raw output

Raw output using the mtcars dataset:

```{r mtcars-example}
summary(lm(mpg ~ hp + wt, data = mtcars))
```
## 
## Call:
## lm(formula = mpg ~ hp + wt, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.941 -1.600 -0.182  1.050  5.854 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
## hp          -0.03177    0.00903  -3.519  0.00145 ** 
## wt          -3.87783    0.63273  -6.129 1.12e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.593 on 29 degrees of freedom
## Multiple R-squared:  0.8268, Adjusted R-squared:  0.8148 
## F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12

Goals

Generate every in-text number from code.

paste_meansd <- function(x, digits = 2, na.rm = TRUE){
  paste0(round(mean(x, na.rm = na.rm), digits), 
         " (", round(sd(x, na.rm = na.rm), digits), ")")  
}

## The mean (sd) of a random sample of normals is `r paste_meansd(rnorm(100))`

The mean (sd) of a random sample of normals is -0.12 (0.99)

sprint_CI95 <- function(mu, se) {
  lim <- mu + c(-1.96, 1.96) * se
  sprintf("%.2f (95%% CI: %.2f to %.2f)", mu, lim[1], lim[2])
}
bfit <- lm(hp ~ disp, mtcars)
## The coefficient estimate is `r sprint_CI95(bfit$coeff[2], sqrt(diag(vcov(bfit)))[2])`

The coefficient estimate is 0.44 (95% CI: 0.32 to 0.56).

Getting started

Limitations

git and Github

Example 1: Reverting changes

A recent paper I worked on used data from a disease registry, which released “frozen” databases quarterly. While working on the revisions, a new database was released. I used to new database to update the analysis because it contained the most reliable and up to date information. After completing the revisions, I received this email from the lead author (this was in 2013 btw):

“As you can see from the paper I sent you, it is almost complete and I do not want to re-write it. Therefore, I just want the data described in the e-mail below from the June 1, 2011 data freeze. … Is it possible to reconstruct the data inquiry as per what was originally delivered?”

I had not saved prior versions of the analysis code, not to mention the manuscript with all of the results incorporated into the text. My only option at that point was to start over.

Example 2: Incorporating edits on a manuscript

Applied papers that I’ve worked on had between 5 and 13 authors. Inevitably, a “final” draft of the manuscript (usually a Word document) gets circulated via email and comments or suggestions are solicited. Here are the typical types of responses that I get:

The challenge is to incorporate (or not) all of the changes from a variety of collaborators, while keeping a record of who has contributed what.

Example 3: Sharing content

Once a paper gets published, occasionally people want to use or extend the method.

“I would be very grateful if you are able to help me implement this tool in my dataset as well.”

“Could you please send me your code so that I can try to apply it to my example?”

“Would you please kindly e-mail me your article and other works in that field promptly.?”

Email is an ineffective tool for sharing code, data, documents

What is it?

git

Github

Summary

Final thoughts

Programs are meant to be read by humans and only incidentally for computers to execute.
    -Donald Knuth
Your closest collaborator is you six months ago, but you don’t reply to emails.
    -Paul Wilson

References

  1. The Economist ‘Trouble at the Lab’ 2013 Oct 19; available at http://go.nature.com/dstij3.
  2. Baggerly, Keith A., and Kevin R. Coombes. “Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology.” The Annals of Applied Statistics (2009): 1309-1334.
  3. Witten, Daniela M., and Robert Tibshirani. “Scientific research in the age of omics: the good, the bad, and the sloppy.” Journal of the American Medical Informatics Association 20(1) (2013): 125-127.
  4. Ziemann, Mark, Yotam Eren, and Assam El-Osta. “Gene name errors are widespread in the scientific literature.” Genome Biology 17(1) (2016): 177.
  5. Eklund, Anders, Thomas E. Nichols, and Hans Knutsson. “Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates.” Proceedings of the National Academy of Sciences (2016): 201602413.
  6. Leisch, Friedrich. “Sweave: Dynamic generation of statistical reports using literate data analysis.” In Compstat, pp. 575-580. Physica-Verlag HD, 2002.
  7. Xie, Yihui. Dynamic Documents with R and knitr. Vol. 29. CRC Press, 2015.
  8. Sachs, Michael C., and Lisa M. McShane. “Issues in developing multivariable molecular signatures for guiding clinical care decisions.” Journal of Biopharmaceutical Statistics just-accepted (2016).
  9. Rmarkdown, Rstudio. https://rmarkdown.rstudio.com
  10. Wilson, Greg, Bryan, Jennifer, Cranston, Karen, et al. “Good Enough Practices in Scientific Computing” (2016). https://arxiv.org/pdf/1609.00037v1.pdf
  11. Patil, Prasad, Peng, Roger D., and Jeffrey Leek. “A statistical definition for reproducibility and replicability” (2016), doi: http://dx.doi.org/10.1101/066803.

Contact

https://sachsmc.github.io

michael.sachs@ki.se