June 10, 2014
Goal: avoid making arbitrary/data-driven decisions that go undocumented or unreported
knit (weave) data analysis code to the report
track versions and collaborate over countless iterations
Programs are meant to be read by humans and only incidentally for computers to execute.
knit
to create results.doc
\(\rightarrow\) .docx
Plain-text formatting. Indicate what elements represent, not how they should look. Minimal, yet flexible, html is interpreted correctly.
Three backticks:
```{r my-first-chunk, results='asis'} ## code goes in here and gets evaluated ```
See http://yihui.name/knitr/options#chunk_options for all available options.
Inline code uses single backticks
Here I am using `#r rnorm(1)`
to generate a random digit: 0.2392. (Omit the pound sign)
Raw output using the mtcars
dataset:
```{r mtcars-example} summary(lm(mpg ~ hp + wt, data = mtcars)) ```
## ## Call: ## lm(formula = mpg ~ hp + wt, data = mtcars) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.941 -1.600 -0.182 1.050 5.854 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 37.22727 1.59879 23.28 < 2e-16 *** ## hp -0.03177 0.00903 -3.52 0.0015 ** ## wt -3.87783 0.63273 -6.13 1.1e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.59 on 29 degrees of freedom ## Multiple R-squared: 0.827, Adjusted R-squared: 0.815 ## F-statistic: 69.2 on 2 and 29 DF, p-value: 9.11e-12
Since markdown interprets html, we can use kable
to generate html tables from R
```{r table-example, results='asis'} kable(head(mtcars)) ```
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
```{r mt-plot} library(ggplot2) ggplot(mtcars, aes(y = mpg, x = wt, size = hp)) + geom_point() + stat_smooth(method = "lm", se = FALSE) ```
R
outputThere are three functions I'm aware of that will help create output tables.
kable
in the knitr
package
xtable
in the xtable
package
xtable()
others to print.xtable()
stargazer
in the stargazer
package
kable
example```{r kable, results = 'asis'} kable(head(mtcars), digits = 2, align = c(rep("l", 4), rep("c", 4), rep("r", 4))) ```
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.62 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.88 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.32 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.21 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.44 | 17.02 | 0 | 0 | 3 | 2 |
Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.46 | 20.22 | 1 | 0 | 3 | 1 |
xtable
example```{r xtable, results = 'asis'} library(xtable) print(xtable(head(mtcars, 3)), type = "html") ```
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.00 | 6.00 | 160.00 | 110.00 | 3.90 | 2.62 | 16.46 | 0.00 | 1.00 | 4.00 | 4.00 |
Mazda RX4 Wag | 21.00 | 6.00 | 160.00 | 110.00 | 3.90 | 2.88 | 17.02 | 0.00 | 1.00 | 4.00 | 4.00 |
Datsun 710 | 22.80 | 4.00 | 108.00 | 93.00 | 3.85 | 2.32 | 18.61 | 1.00 | 1.00 | 4.00 | 1.00 |
stargazer
example```{r star, results = 'asis', warning=FALSE, message=FALSE} library(stargazer, quietly = TRUE) fit1 <- lm(mpg ~ wt, mtcars) fit2 <- lm(mpg ~ wt + hp, mtcars) fit3 <- lm(mpg ~ wt + hp + disp, mtcars) stargazer(fit1, fit2, fit3, type = 'html') ```
Dependent variable: | |||
mpg | |||
(1) | (2) | (3) | |
wt | -5.344*** | -3.878*** | -3.801*** |
(0.559) | (0.633) | (1.066) | |
hp | -0.032*** | -0.031** | |
(0.009) | (0.011) | ||
disp | -0.001 | ||
(0.010) | |||
Constant | 37.280*** | 37.230*** | 37.110*** |
(1.878) | (1.599) | (2.111) | |
Observations | 32 | 32 | 32 |
R2 | 0.753 | 0.827 | 0.827 |
Adjusted R2 | 0.745 | 0.815 | 0.808 |
Residual Std. Error | 3.046 (df = 30) | 2.593 (df = 29) | 2.639 (df = 28) |
F Statistic | 91.380*** (df = 1; 30) | 69.210*** (df = 2; 29) | 44.570*** (df = 3; 28) |
Note: | p<0.1; p<0.05; p<0.01 |
Generate every in-text number from code. paste
and sprintf
are my friends.
paste_meansd <- function(x, digits = 2, na.rm = TRUE){ paste0(round(mean(x, na.rm = na.rm), digits), " (", round(sd(x, na.rm = na.rm), digits), ")") } ## The mean (sd) of a random sample of normals is `r paste_meansd(rnorm(100))`
The mean (sd) of a random sample of normals is 0.17 (1.01)
sprint_CI95 <- function(mu, se, trans = function(x) x) { lim <- trans(mu + c(-1.96, 1.96)*se) sprintf("%.2f (95%% CI: %.2f to %.2f)", mu, lim[1], lim[2]) } bfit <- lm(hp ~ disp, mtcars) ## The coefficient estimate is `r sprint_CI95(bfit$coeff[2], sqrt(diag(vcov(bfit)))[2])`
The coefficient estimate is 0.44 (95% CI: 0.32 to 0.56).
The important ones:
dev
, graphics device. E.g. pdf, png, check out tikzDevice
if you are creating pdf output:
path
where to save the plot filesfig_width
, fig_height
, in inches. Can also be set globally.fig_align
, left, right or centerOther options
include = FALSE
evaluates code but doesn't include anythingecho = FALSE
don't display resultswarning = FALSE
don't display warningscache = TRUE
cache results for long-running stuffOutput and rendering can be customized endlessly. knitr
is written in R
to process chunks, so write your own functions. These types of functions are called "hooks". For example, in this document I used a custom hook to display the code chunks as they appear:
knit_hooks$set(source = function(x, options){ if (!is.null(options$verbatim) && options$verbatim){ opts = gsub(",\\s*verbatim\\s*=\\s*TRUE\\s*", "", options$params.src) bef = sprintf('\n\n ```{r %s}\n', opts, "\n") stringr::str_c( bef, knitr:::indent_block(paste(x, collapse = '\n'), " "), "\n ```\n" ) } else { stringr::str_c("\n\n```", tolower(options$engine), "\n", paste(x, collapse = '\n'), "\n```\n\n" ) } })
Credit: Ramnath Vaidyanathan
knit
documentspandoc
pdf_document
, word_document
, html_document
beamer_presentation
, ioslides_presentation
bibliography: mybib.bib
@paperkey
--- title: "Some Tools for Reproducible Research" author: "Michael Sachs" date: "June 10, 2014" output: ioslides_presentation ---
Full source for this presentation available http://github.com/sachsmc/knit-git-markr-guide
Your closest collaborator is you six months ago, but you don't reply to emails.
git
and Githubgit
and http://github.com make version control and collaboration easierA recent paper I worked on used data from a disease registry, which released "frozen" databases quarterly. While working on the revisions, a new database was released. I used to new database to update the analysis because it contained the most reliable and up to date information. After completing the revisions, I received this email from the lead author (this was in 2013 btw):
"As you can see from the paper I sent you, it is almost complete and I do not want to re-write it. Therefore, I just want the data described in the e-mail below from the June 1, 2011 data freeze. … Is it possible to reconstruct the data inquiry as per what was originally delivered?"
I had not saved prior versions of the analysis code, not to mention the manuscript with all of the results incorporated into the text. My only option at that point was to start over.
Another statistician took over an analysis for a manuscript that I thought was complete. Here is a real email exchange
"The manuscript says that out of 1350 participants there were 411 with incident AFib. However, when I run your code to create the dataset, I only end up with 232. The AFib data came from a file called 'mathew_main'. Did you use anything else to get the extra AFib cases?"
My response:
"I'm not 100% sure about anything without looking at my code. Which file are you going on? There should be a dated .Rnw file my Afib folder that contains all the analysis code. I believe that calls an R script called "load-data-chs.R". The mathew_main file does not sound familiar, I suspect that is from a very old version of the paper. "
:(
Applied papers that I've worked on had between 5 and 13 authors. Inevitably, a "final" draft of the manuscript (usually a Word document) gets circulated via email and comments or suggestions are solicited. Here are the typical types of responses that I get:
The challenge is to incorporate (or not) all of the changes from a variety of collaborators, while keeping a record of who has contributed what.
Once a paper gets published, occasionally people want to use or extend the method.
"I would be very grateful if you are able to help me implement this tool in my dataset as well."
"Could you please send me your code so that I can try to apply it to my example? "
"Would you please kindly e-mail me your article and other works in that field promptly.?"
Email is an ineffective tool for sharing code, data, documents
git
Begin tracking using git init
. This creates a hidden file that tells git
what to track.
mkdir my-first-repo cd my-first-repo git init
Now add content. Copy over documents or code, or create new files.
echo "This is a *readme* file for my-first-repo" > readme.md
Check the status of the tracker by running git status
. It will inform you that a file has been changed but it is not tracked:
On branch master Initial commit Untracked files: (use "git add <file>..." to include in what will be committed) readme.md nothing added to commit but untracked files present (use "git add" to track)
Track the new files (the .
means add everything in the directory):
git add .
my-first-repo sachsmc$ git status Initial commit Changes to be committed: (use "git rm --cached <file>..." to unstage) new file: readme.md
Commit the changes making sure to include a detailed description of what was changed.
my-first-repo sachsmc$ git commit -m "Added readme.md" [master (root-commit) f92799f] Added readme.md 1 file changed, 1 insertion(+) create mode 100644 readme.md
git
.tex
file, some analyses in .R
with data and outputCommits do not affect the other branch
Push all of her changes to github, then submit a pull request.
git
is a structured approach to tracking content
git
git revert
Topic | Link |
---|---|
KBroman's UWisc Class | https://kbroman.github.io/Tools4RR/pages/schedule.html |
Github tutorials | http://guides.github.com/ |
Possible workflows | http://www.atlassian.com/git/workflows |
Another tutorial | http://www.atlassian.com/git/tutorial |
Knitr homepage | http://yihui.name/knitr/ |
rmarkdown documentation | http://rmarkdown.rstudio.com/ |
More stuff and source available