Practice manipulating data by tidying and reshaping
Author
Michael C Sachs
Learning objectives
In this lesson you will
Practice tidying statistical results
See and understand how to reshape data from wide to long and long to wide
Tidying our mean sd function
Load the broom package and look at the source code for tidy.lm
broom:::tidy.lm
function (x, conf.int = FALSE, conf.level = 0.95, ...)
{
warn_on_subclass(x)
ret <- as_tibble(summary(x)$coefficients, rownames = "term")
colnames(ret) <- c("term", "estimate", "std.error", "statistic",
"p.value")
coefs <- tibble::enframe(stats::coef(x), name = "term", value = "estimate")
ret <- left_join(coefs, ret, by = c("term", "estimate"))
if (conf.int) {
ci <- broom_confint_terms(x, level = conf.level)
ret <- dplyr::left_join(ret, ci, by = "term")
}
ret
}
<bytecode: 0x58a990881060>
<environment: namespace:broom>
Write a tidy method for our mean_sd function and try it out on the penguins dataset.
Apply the mean_sd function to the penguins body mass in grams by species and sex. Organize the results into a table suitable for publication, where it is easy to compare the two sexes.
Tidying the national patient register dataset
Load the LPR data example from "https://sachsmc.github.io/r-programming/data/lpr-ex.rds"
library(here)
here() starts at /home/micsac/Teaching/Courses/r-programming
lpr <-readRDS(here("data", "lpr-ex.rds"))
Use the tidy principles to do the following:
Reshape the data into wide, where the columns are the primary diagnosis (hdia) at each visit number
Reshape the data into longer format, where all of the diagnoses are stored in a single variable, with another variable indicating the primary diagnosis.
Create a new variable for each participant which equals TRUE if they had any diagnosis of either D150, D152, or D159 before the date 1 January 2010.
Merging and manipulation
The objective of this project is to describe the distribution of the number of days between hospitalizations and drug dispensations by age and sex. Your challenge is to do the following:
Import and merge the drug register data with the hospitalization register.
Create a new variable that counts the number of drug dispensations in the 3 months following a hospitalization.
Summarize the variable by age and sex. Try making a graphical summary .
Hints
The drug register data are stored in separate files by year. You will need to iterate over these files somehow, maybe using a loop or one of the apply functions.
The file names can be created programmatically with paste0("med-", 2005:2010, "-ex.rds")
Once they are all read in as objects, you will want to append them by row, using e.g., rbind
Join the hospitalization table to the drug table, by patient id. How to deal with the dates? We want only the most recent prescription since the last hospitalization. This is a rolling join