Day 3, C
mean_sd
functionThis is not so nice to work with because we cannot see which row is the mean and which is the sd.
Return a data frame instead
These are two different ways to represent the same data. This is to illustrate the concept of tidy data
Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:
Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10), 1–23. https://doi.org/10.18637/jss.v059.i10
In general,
If you are writing the function that does the calculations, you are in control.
Tips and tricks:
deparse1(substitute(x))
creates a character string from the name of the object used as the argument x
broom
packageBroom provides a tidy
function, which is a generic, it can be applied to lots of different objects:
It attempts to do organize statistical output into a tidy tibble.
broom
tidy
returns data about the model coefficients, what about the other components of the model?
tidy()
summarizes information about model componentsglance()
reports information about the entire modelaugment()
adds information about observations to a datasetThe main reason for being tidy is to help with subsequent analysis and reporting. lmfit
prints the necessary results, but look at tidy(lmfit)
.
Real data is often messy, and we want that to be tidy as well for easier analysis. Example of messy data:
Converting from wide to long (tidyr
package):
Base R
This makes it easier to compute summary statistics, or make figures of ranks by artist.
Most longitudinal data analyses will expect data in “long” format.
This also has its uses. Examples include computing correlation matrices, doing derived variable analysis, and tables are often more readable in wide or partial wide format.
tidyr
base R
data.table
Are often much more complex, and can be messier. Here is one example we will use in the lesson:
pid age sex indat visit hdia diag1 diag2 diag3 diag4 diag5 diag6
1 A001 72.5 f 2010-01-27 1 H560 B632 H180 J050 <NA> <NA> <NA>
2 A001 72.5 f 2010-06-26 2 C871 D422 K820 A602 <NA> <NA> <NA>
3 A001 72.5 f 2010-07-20 3 C040 B710 <NA> <NA> <NA> <NA> <NA>
4 A001 72.5 f 2011-03-06 4 F412 F381 F632 E832 <NA> <NA> <NA>
5 A001 72.5 f 2011-12-23 5 F622 J720 <NA> <NA> <NA> <NA> <NA>
6 A002 81.5 f 2005-09-02 1 B481 K512 F750 K059 <NA> <NA> <NA>
7 A002 81.5 f 2008-05-05 2 K959 G959 E252 <NA> <NA> <NA> <NA>
8 A002 81.5 f 2012-07-05 3 G902 D939 K329 G402 <NA> <NA> <NA>
9 A003 54.5 m 2006-07-30 1 E841 A062 <NA> <NA> <NA> <NA> <NA>
10 A003 54.5 m 2007-04-09 2 E071 J602 J490 J839 <NA> <NA> <NA>
11 A003 54.5 m 2008-03-14 3 H189 B752 B190 C751 I610 <NA> <NA>
12 A003 54.5 m 2010-08-24 4 B090 I992 D021 <NA> <NA> <NA> <NA>
We will practice creating tidy data continuing with our mean sd function, and using the register data example to practice pivoting and doing grouped analyses.