[1] TRUE
[1] 1.2
[1] "hello"
[1] low
Levels: low med high
[1] "2022-05-11"
NULL
Day 1, B
“Bad programmers worry about the code. Good programmers worry about data structures and their relationships.”
Linus Torvalds, creator of Linux
logical, numeric, character, factor, date, …
A one dimensional collection of data with the same type. Can be named or unnamed. Can be created in many ways:
NA
for missing, and note that this has a data type.
NaN
for “not a number”, e.g., 0 / 0
NULL
is empty, and has 0 length
Subsequences of vectors are obtained with square brackets []
Inside the square brackets goes the index, which can be itself a vector of numbers, logicals, or characters (if the vector is named)
A list is a collection of things not required to be the same type. An element of a list can be any R object. Can also be named or not.
[[1]]
[1] 1 2 3 4
[[2]]
[1] "a" "b" "c" "d"
[[3]]
function (x, ...)
UseMethod("mean")
<bytecode: 0x5e74b9ba4650>
<environment: namespace:base>
$numbers
[1] 1 2 3 4
$letters
[1] "a" "b" "c" "d"
$mean
function (x, ...)
UseMethod("mean")
<bytecode: 0x5e74b9ba4650>
<environment: namespace:base>
$list
$list[[1]]
[1] "a"
$list[[2]]
[1] 1
$list[[3]]
[1] TRUE
list()
[[1]]
NULL
[[2]]
NULL
[[3]]
NULL
A list can be indexed with square brackets []
or double-square brackets [[]]
, but there is a difference!
[1] 1 2 3 4
$numbers
[1] 1 2 3 4
$letters
[1] "a" "b" "c" "d"
$mean
function (x, ...)
UseMethod("mean")
<bytecode: 0x5e74b9ba4650>
<environment: namespace:base>
$list
$list[[1]]
[1] "a"
$list[[2]]
[1] 1
$list[[3]]
[1] TRUE
$numbers
[1] 1 2 3 4
$letters
[1] "a" "b" "c" "d"
[1] 2
[1] 1 2 3 4
[1] 1 2 3 4
New elements can be added by name or number
$numbers
[1] 1 2 3 4
$letters
[1] "a" "b" "c" "d"
$mean
function (x, ...)
UseMethod("mean")
<bytecode: 0x5e74b9ba4650>
<environment: namespace:base>
$list
$list[[1]]
[1] "a"
$list[[2]]
[1] 1
$list[[3]]
[1] TRUE
$LETTERS
[1] "A" "B" "C" "D"
[[6]]
[1] "A" "B" "C" "D" "E"
The c
function concatenates new elements to the list
$numbers
[1] 1 2 3 4
$letters
[1] "a" "b" "c" "d"
$mean
function (x, ...)
UseMethod("mean")
<bytecode: 0x5e74b9ba4650>
<environment: namespace:base>
$list
$list[[1]]
[1] "a"
$list[[2]]
[1] 1
$list[[3]]
[1] TRUE
$LETTERS
[1] "A" "B" "C" "D"
[[6]]
[1] "A" "B" "C" "D" "E"
$AB
[1] "A" "B"
$numbers
[1] 1 2 3 4
$letters
[1] "a" "b" "c" "d"
$mean
function (x, ...)
UseMethod("mean")
<bytecode: 0x5e74b9ba4650>
<environment: namespace:base>
$list
$list[[1]]
[1] "a"
$list[[2]]
[1] 1
$list[[3]]
[1] TRUE
$LETTERS
[1] "A" "B" "C" "D"
[[6]]
[1] "A" "B" "C" "D" "E"
$AB1
[1] "A"
$AB2
[1] "B"
Just like in math, R matrices are like vectors that have 2 dimensions, and are indexed also by square brackets.
There are lots of matrix manipulation functions in base R
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
A B C D
a 1 4 7 10
b 2 5 8 11
c 3 6 9 12
[1] 1 5 9
[,1] [,2] [,3] [,4]
[1,] FALSE FALSE FALSE FALSE
[2,] TRUE FALSE FALSE FALSE
[3,] TRUE TRUE FALSE FALSE
[,1] [,2] [,3] [,4]
[1,] 1 1 1 1
[2,] 2 2 2 2
[3,] 3 3 3 3
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 1 2 3 4
[3,] 1 2 3 4
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
Using square brackets, we get a single value by using two numbers or names separated by a comma:
A missing index means “everything”, so this returns a vector
[1] 2 5 8 11
[,1] [,2] [,3] [,4]
[1,] 2 5 8 11
Can also use a logical matrix or numeric as a single index
[1] 1 2 3 4 5 6
[,1] [,2] [,3] [,4]
[1,] 0 0 7 10
[2,] 0 0 8 11
[3,] 0 0 9 12
Index matrices are convenient but hard to understand
Before we used a single index for each dimension:
If we create a series of paired single indices, and store them in a matrix with 2 columns, we can use that matrix as an index:
Matrix transpose, multiplication, inversion, eigenvalues, etc, are all available in R
S1 <- matrix(runif(9), nrow = 3, ncol = 3)
S2 <- matrix(runif(9), nrow = 3, ncol = 3)
x <- c(1, 1.5, 3)
S1 %*% S2
[,1] [,2] [,3]
[1,] 0.5817102 0.4319694 0.5662887
[2,] 0.8618178 0.6708640 0.8093220
[3,] 0.9190139 0.9417321 1.3805946
[,1] [,2] [,3]
[1,] 0.7698256 0.5293676 0.9130591
[2,] 0.5945810 0.5559717 0.7709751
[3,] 0.8993657 0.9304431 1.3227746
[,1]
[1,] 9.059178
Arrays are like matrices, but with more dimensions. A matrix is an array with 2 dimensions. Arrays can have more than 2 dimensions.
The data gets filled in by the first dimension, then the second, then the third, …
, , 1
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16
, , 2
[,1] [,2] [,3] [,4]
[1,] 17 21 25 29
[2,] 18 22 26 30
[3,] 19 23 27 31
[4,] 20 24 28 32
The data also get “unrolled” in the same way.
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32
Indexing works like it does with matrices:
[1] 30
[1] 14 30
[,1] [,2] [,3] [,4]
[1,] 17 21 25 29
[2,] 18 22 26 30
[3,] 19 23 27 31
[4,] 20 24 28 32
, , 1
[,1] [,2] [,3] [,4]
[1,] 17 21 25 29
[2,] 18 22 26 30
[3,] 19 23 27 31
[4,] 20 24 28 32
[1] 30 14 7
Data frames look like matrices, but the columns can be different data types:
logical numeric char
1 FALSE 1 a
2 TRUE 2 b
3 FALSE 3 c
[1] "a" "b" "c"
[1] 1 2 3
While they look like matrices, they act more like lists:
Multiple ways to refer to a particular column:
[1] 1 2 3
[1] 1 2 3
[1] 1 2 3
[1] 1 2 3
[1] 1 2 3
Subsetting
logical numeric char missing
2 TRUE 2 b NA
logical numeric char missing
1 FALSE 1 a NA
2 TRUE 2 b NA
Manipulation
logical numeric char missing numeric.squared
1 FALSE 1 a NA 1
2 TRUE 2 b NA 4
3 FALSE 3 c NA 9
logical numeric char missing numeric.squared not.logical numeric.cubed
1 FALSE 1 a NA 1 TRUE 1
2 TRUE 2 b NA 4 FALSE 8
3 FALSE 3 c NA 9 TRUE 27
[1] 1.000000 1.414214 1.732051
[1] 1 2 3
[1] TRUE FALSE TRUE
[1] 1 4 9
[1] 1 2 3
Error in eval(expr, envir, enclos): object 'numeric.squared' not found
Coercion is what happens when data gets converted from one type to another (e.g., numeric to character).
This can also be done explicitly using the as.
family of functions.
One of R’s “nice” features is that it will automatically attempt to coerce data when different types meet in an operation.
Examples
[1] 0 1 0
[1] 1 0
[1] TRUE FALSE
[1] "A_1" "A_2" "A_3" "A_4"
This is useful sometimes, other times it can cause problems:
Hopefully R warns you if data is destroyed due to coercion.
If the data type is critical for an operation then it is up to you to check using the is.
family of functions.
Vector and array arithmetic works elementwise, as long as the things have the same dimension.
If not, the shorter one is sometimes recycled to match the larger dimension thing
I often make this mistake when calculating proportions from a table:
3 4 5
4 1 8 2
6 2 4 1
8 12 0 2
3 4 5
4 0.06666667 0.53333333 0.13333333
6 0.16666667 0.33333333 0.08333333
8 2.40000000 0.00000000 0.40000000
3 4 5
4 0.06666667 0.66666667 0.40000000
6 0.13333333 0.33333333 0.20000000
8 0.80000000 0.00000000 0.40000000
3 4 5
4 0.06666667 0.66666667 0.40000000
6 0.13333333 0.33333333 0.20000000
8 0.80000000 0.00000000 0.40000000
Again, hopefully R warns you about this, but when in doubt check and validate lengths.
Any object can have attributes, which are data that get attached to the object. It is a flexible way to include information with an object.
They are stored as names and value as in a list. Query or replace them with attributes
or attr
$dim
[1] 4 4 2
[1] 4 4 2
, , 1
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16
, , 2
[,1] [,2] [,3] [,4]
[1,] 17 21 25 29
[2,] 18 22 26 30
[3,] 19 23 27 31
[4,] 20 24 28 32
attr(,"note")
[1] "This is a new attribute"
Some attributes are special, e.g., class
, comment
, dim
, dimnames
, …, and have special ways of querying and setting
, , 1
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16
, , 2
[,1] [,2] [,3] [,4]
[1,] 17 21 25 29
[2,] 18 22 26 30
[3,] 19 23 27 31
[4,] 20 24 28 32
attr(,"note")
[1] "This is a new attribute"
[1] "I created this array on 2024-03-20 15:39:58.962121"
An environment is kind of like a list, it contains a number of arbitrary objects.
The global environment is a special one, look at the upper right pane of Rstudio, or run
[1] "A1" "d1" "i3" "imat" "M1" "M2" "og.num" "S1"
[9] "S2" "t1" "x" "xl"
When you type a name in the console, it will first look for it in the global environment. If it cannot find it there, it will then look in the attached packages.
We will come back to environments when we talk about functions.
Add on packages can be installed from a few different places, CRAN, Bioconductor, github, R-forge, and locally from package files.
They are installed to your system with install.packages("pkgname")
When you use library("pkgname")
, the package is attached, so that objects in the package can be found just by typing the name:
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
You can use objects from packages without attaching them with ::
(two colons)
time status x
1 9 1 Maintained
2 13 1 Maintained
3 13 0 Maintained
4 18 1 Maintained
5 23 1 Maintained
6 28 0 Maintained
and you can get internal objects from a package with :::
(three colons)
Error in eval(expr, envir, enclos): object 'plot.aareg' not found
1 function (x, se = TRUE, maxtime, type = "s", ...)
2 {
3 if (!inherits(x, "aareg"))
4 stop("Must be an aareg object")
5 if (missing(maxtime))
6 keep <- 1:length(x$time)
Some packages you may use (e.g., from Bioconductor) return S4 objects. These are kind of like lists, but to access objects (called ‘slots’) inside, use @
(the at symbol).
For example
## A simple class with two slots
track <- setClass("track", slots = c(x="numeric", y="numeric"))
## an object from the class
ts1 <- track(x = 1:10, y = 1:10 + rnorm(10))
ts1
An object of class "track"
Slot "x":
[1] 1 2 3 4 5 6 7 8 9 10
Slot "y":
[1] -0.4686968 1.4483865 2.1845973 3.7024516 4.7426660 6.1987013
[7] 6.1366448 6.9758053 8.9995082 10.5466263
[1] 1 2 3 4 5 6 7 8 9 10
Tibbles (cute name for ‘table’) are data frames with enhanced printing.
[1] "tbl_df" "tbl" "data.frame"
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
You can convert a regular data.frame to a tibble
# A tibble: 32 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ℹ 22 more rows
The indexing behavior is slightly different from data frames:
# A tibble: 32 × 1
mpg
<dbl>
1 21
2 21
3 22.8
4 21.4
5 18.7
6 18.1
7 14.3
8 24.4
9 22.8
10 19.2
# ℹ 22 more rows
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
data.table
is a package that provides another data.frame extension.
It has many features for data manipulation and management with a focus on speed, both typing and computer speed for large datasets.
There is a special syntax for indexing and merging using square brackets, we will come back to this (because it is my favorite tool for data management)
[1] "data.table" "data.frame"
mpg cyl disp hp drat wt qsec vs am gear carb
1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
mpg cyl disp hp drat wt qsec vs am gear carb
1: 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
2: 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
3: 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
4: 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4
5: 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
6: 15.0 8 301 335 3.54 3.570 14.60 0 1 5 8
cyl meanwt
1: 8 3.999214
2: 6 3.117143
3: 4 2.285727
The basic functions are read.table, read.csv, read.csv2
Via add on packages, R supports import of any data format I can think of.
The most flexible way to read in data is with the rio
package. It guesses what the format is and uses the correct import tool (most of the time)
The slowness of reading in data usually comes from format guessing.
Supplying known column types can dramatically speed up import:
fread
from the data.table package is fast and also flexible:
Name homeworld species age
1: Luke Skywalker Tatooine Human 27
2: C-3PO Tatooine Human 12
3: R2-D2 Alderaan Human 8
4: Darth Vader Tatooine Human 44
5: Leia Organa Tatooine Human 25
6: Owen Lars Tatooine Human 32
7: Beru Whitesun lars Stewjon Human 38
8: R5-D4 Tatooine Human 7
9: Biggs Darklighter Kashyyyk Wookiee 65
10: Obi-Wan Kenobi Corellia Human 68
Most import functions have their output counterparts, e.g., write.table, write.csv, write.csv2, fwrite
. These are useful for writing out rectangular data for use in other programs.
Another under-used way of exporting objects is to use saveRDS
, this saves any R object to a file, which then gives you exactly the same object when read into R using readRDS
. I use this frequently for intermediate datasets, analysis results stored in a list, and even functions.
Example
data.table
, and tibble
packages for working with data.