Tidy data

Day 3, C

Michael C Sachs

Illustration

Our mean_sd function

This is not so nice to work with because we cannot see which row is the mean and which is the sd.

Alternative 1

Return a data frame instead

Alternative 2

Definitions

Tidy data

These are two different ways to represent the same data. This is to illustrate the concept of tidy data

Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10), 1–23. https://doi.org/10.18637/jss.v059.i10

Variables and observations

In general,

  • It is easier to describe and create functional relationships between variables (columns), than between observations.
    • Think of our mean, sd example. Relationship between mean and sd?
    • Regression
  • It is easier to make comparisons between groups of observations than across variables
    • T-tests, anova, compare means across species

In practice

Making statistical output tidy

If you are writing the function that does the calculations, you are in control.

Tips and tricks:

  • Assemble things in a data frame, the columns you group by will appear on their own.
  • deparse1(substitute(x)) creates a character string from the name of the object used as the argument x

The broom package

Broom provides a tidy function, which is a generic, it can be applied to lots of different objects:

It attempts to do organize statistical output into a tidy tibble.

More functions from broom

tidy returns data about the model coefficients, what about the other components of the model?

  • tidy() summarizes information about model components
  • glance() reports information about the entire model
  • augment() adds information about observations to a dataset

Benefits of tidy statistical output

The main reason for being tidy is to help with subsequent analysis and reporting. lmfit prints the necessary results, but look at tidy(lmfit).

  • Easier to organize into a table or figure
  • Easier to merge or append with results from different models
  • Possibly easier to store/save the results

Tidy data for analysis

Messy data example

Real data is often messy, and we want that to be tidy as well for easier analysis. Example of messy data:

  • A variable is encoded in columns (week)
  • This is also called “wide format”

Pivoting data

Converting from wide to long (tidyr package):

Base R

Melting (data.table and old tidyverse)

This makes it easier to compute summary statistics, or make figures of ranks by artist.

Most longitudinal data analyses will expect data in “long” format.

Pivoting from long to wide

This also has its uses. Examples include computing correlation matrices, doing derived variable analysis, and tables are often more readable in wide or partial wide format.

tidyr

base R

data.table

Real data

Are often much more complex, and can be messier. Here is one example we will use in the lesson:

library(here)
readRDS(here("data", "lpr-ex.rds")) |> head(12)
    pid  age sex      indat visit hdia diag1 diag2 diag3 diag4 diag5 diag6
1  A001 72.5   f 2010-01-27     1 H560  B632  H180  J050  <NA>  <NA>  <NA>
2  A001 72.5   f 2010-06-26     2 C871  D422  K820  A602  <NA>  <NA>  <NA>
3  A001 72.5   f 2010-07-20     3 C040  B710  <NA>  <NA>  <NA>  <NA>  <NA>
4  A001 72.5   f 2011-03-06     4 F412  F381  F632  E832  <NA>  <NA>  <NA>
5  A001 72.5   f 2011-12-23     5 F622  J720  <NA>  <NA>  <NA>  <NA>  <NA>
6  A002 81.5   f 2005-09-02     1 B481  K512  F750  K059  <NA>  <NA>  <NA>
7  A002 81.5   f 2008-05-05     2 K959  G959  E252  <NA>  <NA>  <NA>  <NA>
8  A002 81.5   f 2012-07-05     3 G902  D939  K329  G402  <NA>  <NA>  <NA>
9  A003 54.5   m 2006-07-30     1 E841  A062  <NA>  <NA>  <NA>  <NA>  <NA>
10 A003 54.5   m 2007-04-09     2 E071  J602  J490  J839  <NA>  <NA>  <NA>
11 A003 54.5   m 2008-03-14     3 H189  B752  B190  C751  I610  <NA>  <NA>
12 A003 54.5   m 2010-08-24     4 B090  I992  D021  <NA>  <NA>  <NA>  <NA>

Practical

We will practice creating tidy data continuing with our mean sd function, and using the register data example to practice pivoting and doing grouped analyses.

Link to lesson

Link home