Data structures in R

Day 1, B

Michael C Sachs

Overview

  • R is a programming language, most often used to work with data
  • I use the term ‘data’ loosely, can refer to
    • data to be analyzed
    • results from a data analysis
    • information used in an analysis (e.g., ICD diagnosis codes)
  • Think carefully about organizing your data structures

“Bad programmers worry about the code. Good programmers worry about data structures and their relationships.”

Linus Torvalds, creator of Linux

Types of data

logical, numeric, character, factor, date, …

TRUE
[1] TRUE
1.2
[1] 1.2
"hello"
[1] "hello"
factor("low", levels = c("low", "med", "high"))
[1] low
Levels: low med high
as.Date("2022-05-11", format = "%Y-%m-%d")
[1] "2022-05-11"
NULL
NULL

Vectors

A one dimensional collection of data with the same type. Can be named or unnamed. Can be created in many ways:

1:4
[1] 1 2 3 4
seq(1, 4, by = 1)
[1] 1 2 3 4
letters
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
rep(TRUE, times = 5)
[1] TRUE TRUE TRUE TRUE TRUE
x <- 1:4
names(x) <- c("a", "b", "c", "d")
x
a b c d 
1 2 3 4 

Missing and empty/null values

NA for missing, and note that this has a data type.

NaN for “not a number”, e.g., 0 / 0

NULL is empty, and has 0 length

c(NA, 1) ## different from the next line
[1] NA  1
c(NA, "a")
[1] NA  "a"
c(NULL, "a")
[1] "a"
list(NULL, NULL, 1:3)
[[1]]
NULL

[[2]]
NULL

[[3]]
[1] 1 2 3

Indexing vectors

Subsequences of vectors are obtained with square brackets []

Inside the square brackets goes the index, which can be itself a vector of numbers, logicals, or characters (if the vector is named)

Lists

A list is a collection of things not required to be the same type. An element of a list can be any R object. Can also be named or not.

Indexing lists

A list can be indexed with square brackets [] or double-square brackets [[]], but there is a difference!

Concatenating lists

New elements can be added by name or number

The c function concatenates new elements to the list

Matrices

Just like in math, R matrices are like vectors that have 2 dimensions, and are indexed also by square brackets.

There are lots of matrix manipulation functions in base R

Indexing matrices

Using square brackets, we get a single value by using two numbers or names separated by a comma:

A missing index means “everything”, so this returns a vector

Can also use a logical matrix or numeric as a single index

Index matrices are convenient but hard to understand

Before we used a single index for each dimension:

If we create a series of paired single indices, and store them in a matrix with 2 columns, we can use that matrix as an index:

Matrix arithmetic

Matrix transpose, multiplication, inversion, eigenvalues, etc, are all available in R

S1 <- matrix(runif(9), nrow = 3, ncol = 3)
S2 <- matrix(runif(9), nrow = 3, ncol = 3)
x <- c(1, 1.5, 3)

S1 %*% S2
         [,1]      [,2]     [,3]
[1,] 1.302694 0.9729528 1.514939
[2,] 1.091354 0.7723148 1.142920
[3,] 1.208944 0.9165629 1.444500
t(S1) %*% S2
          [,1]      [,2]     [,3]
[1,] 1.2363695 0.9751907 1.591019
[2,] 1.2367629 0.9261671 1.445685
[3,] 0.9502872 0.7489214 1.219555
x %*% solve(S1) %*% x
          [,1]
[1,] -4.448517

Arrays

Arrays are like matrices, but with more dimensions. A matrix is an array with 2 dimensions. Arrays can have more than 2 dimensions.

The data gets filled in by the first dimension, then the second, then the third, …

The data also get “unrolled” in the same way.

Indexing works like it does with matrices:

Data frames

Data frames look like matrices, but the columns can be different data types:

While they look like matrices, they act more like lists:

Indexing and manipulating data frames

Multiple ways to refer to a particular column:

Subsetting

Manipulation

Careful with attach if you know about it.

Two important data gotchas

Coercion

Coercion is what happens when data gets converted from one type to another (e.g., numeric to character).

This can also be done explicitly using the as. family of functions.

One of R’s “nice” features is that it will automatically attempt to coerce data when different types meet in an operation.

Examples

This is useful sometimes, other times it can cause problems:

Hopefully R warns you if data is destroyed due to coercion.

If the data type is critical for an operation then it is up to you to check using the is. family of functions.

Recycling

Vector and array arithmetic works elementwise, as long as the things have the same dimension.

If not, the shorter one is sometimes recycled to match the larger dimension thing

I often make this mistake when calculating proportions from a table:

Again, hopefully R warns you about this, but when in doubt check and validate lengths.

Special data structures

Attributes

Any object can have attributes, which are data that get attached to the object. It is a flexible way to include information with an object.

They are stored as names and value as in a list. Query or replace them with attributes or attr

Some attributes are special, e.g., class, comment, dim, dimnames, …, and have special ways of querying and setting

Environments – how does R find things?

An environment is kind of like a list, it contains a number of arbitrary objects.

The global environment is a special one, look at the upper right pane of Rstudio, or run

When you type a name in the console, it will first look for it in the global environment. If it cannot find it there, it will then look in the attached packages.

We will come back to environments when we talk about functions.

Packages

Add on packages can be installed from a few different places, CRAN, Bioconductor, github, R-forge, and locally from package files.

They are installed to your system with install.packages("pkgname")

When you use library("pkgname"), the package is attached, so that objects in the package can be found just by typing the name:

You can use objects from packages without attaching them with :: (two colons)

head(survival::aml)
  time status          x
1    9      1 Maintained
2   13      1 Maintained
3   13      0 Maintained
4   18      1 Maintained
5   23      1 Maintained
6   28      0 Maintained

and you can get internal objects from a package with ::: (three colons)

S4 objects

Some packages you may use (e.g., from Bioconductor) return S4 objects. These are kind of like lists, but to access objects (called ‘slots’) inside, use @ (the at symbol).

For example

Tibbles

Tibbles (cute name for ‘table’) are data frames with enhanced printing.

library(tibble)
library(palmerpenguins)
class(penguins) ## this will be data.frame in the browser, but tbl_df in R
[1] "tbl_df"     "tbl"        "data.frame"
penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

You can convert a regular data.frame to a tibble

The indexing behavior is slightly different from data frames:

data.tables

data.table is a package that provides another data.frame extension.

It has many features for data manipulation and management with a focus on speed, both typing and computer speed for large datasets.

There is a special syntax for indexing and merging using square brackets, we will come back to this (because it is my favorite tool for data management)

Data import and export

Reading in external data

The basic functions are read.table, read.csv, read.csv2

Via add on packages, R supports import of any data format I can think of.

The most flexible way to read in data is with the rio package. It guesses what the format is and uses the correct import tool (most of the time)

library(rio)
library(here)
df <- import(here("data/starwars.xlsx"))
head(df)
            Name homeworld species
1 Luke Skywalker  Tatooine   Human
2          C-3PO  Tatooine   Human
3          R2-D2  Alderaan   Human
4    Darth Vader  Tatooine   Human
5    Leia Organa  Tatooine   Human
6      Owen Lars  Tatooine   Human

Big datasets

The slowness of reading in data usually comes from format guessing.

Supplying known column types can dramatically speed up import:

df1 <- read.csv(here("data/starwars.csv"))

df1b <- read.csv(here("data/starwars.csv"),
                colClasses = c("character", "factor", "factor", 
                               "numeric"))

fread from the data.table package is fast and also flexible:

library(data.table)
fread(here("data/starwars.csv"))
                  Name homeworld species age
 1:     Luke Skywalker  Tatooine   Human  27
 2:              C-3PO  Tatooine   Human  12
 3:              R2-D2  Alderaan   Human   8
 4:        Darth Vader  Tatooine   Human  44
 5:        Leia Organa  Tatooine   Human  25
 6:          Owen Lars  Tatooine   Human  32
 7: Beru Whitesun lars   Stewjon   Human  38
 8:              R5-D4  Tatooine   Human   7
 9:  Biggs Darklighter  Kashyyyk Wookiee  65
10:     Obi-Wan Kenobi  Corellia   Human  68

Exporting data

Most import functions have their output counterparts, e.g., write.table, write.csv, write.csv2, fwrite. These are useful for writing out rectangular data for use in other programs.

Another under-used way of exporting objects is to use saveRDS, this saves any R object to a file, which then gives you exactly the same object when read into R using readRDS. I use this frequently for intermediate datasets, analysis results stored in a list, and even functions.

Example

lmfit <- lm(mpg ~ wt, data = mtcars)
lmfit

Call:
lm(formula = mpg ~ wt, data = mtcars)

Coefficients:
(Intercept)           wt  
     37.285       -5.344  
saveRDS(lmfit, file = "reg-ex.rds")

readRDS("reg-ex.rds")

Call:
lm(formula = mpg ~ wt, data = mtcars)

Coefficients:
(Intercept)           wt  
     37.285       -5.344  

Practical

  1. Practice working with vectors and matrices
  2. Thing about ways to organize data and output into data structures
  3. Compare and contrast the base R, data.table, and tibble packages for working with data.

Link to lesson

Link home