[1] TRUE
[1] 1.2
[1] "hello"
[1] low
Levels: low med high
[1] "2022-05-11"
NULL
Day 1, B
“Bad programmers worry about the code. Good programmers worry about data structures and their relationships.”
Linus Torvalds, creator of Linux
logical, numeric, character, factor, date, …
A one dimensional collection of data with the same type. Can be named or unnamed. Can be created in many ways:
NA
for missing, and note that this has a data type.
NaN
for “not a number”, e.g., 0 / 0
NULL
is empty, and has 0 length
Subsequences of vectors are obtained with square brackets []
Inside the square brackets goes the index, which can be itself a vector of numbers, logicals, or characters (if the vector is named)
A list is a collection of things not required to be the same type. An element of a list can be any R object. Can also be named or not.
A list can be indexed with square brackets []
or double-square brackets [[]]
, but there is a difference!
New elements can be added by name or number
The c
function concatenates new elements to the list
Just like in math, R matrices are like vectors that have 2 dimensions, and are indexed also by square brackets.
There are lots of matrix manipulation functions in base R
Using square brackets, we get a single value by using two numbers or names separated by a comma:
A missing index means “everything”, so this returns a vector
Can also use a logical matrix or numeric as a single index
Index matrices are convenient but hard to understand
Before we used a single index for each dimension:
If we create a series of paired single indices, and store them in a matrix with 2 columns, we can use that matrix as an index:
Matrix transpose, multiplication, inversion, eigenvalues, etc, are all available in R
S1 <- matrix(runif(9), nrow = 3, ncol = 3)
S2 <- matrix(runif(9), nrow = 3, ncol = 3)
x <- c(1, 1.5, 3)
S1 %*% S2
[,1] [,2] [,3]
[1,] 1.302694 0.9729528 1.514939
[2,] 1.091354 0.7723148 1.142920
[3,] 1.208944 0.9165629 1.444500
[,1] [,2] [,3]
[1,] 1.2363695 0.9751907 1.591019
[2,] 1.2367629 0.9261671 1.445685
[3,] 0.9502872 0.7489214 1.219555
[,1]
[1,] -4.448517
Arrays are like matrices, but with more dimensions. A matrix is an array with 2 dimensions. Arrays can have more than 2 dimensions.
The data gets filled in by the first dimension, then the second, then the third, …
The data also get “unrolled” in the same way.
Indexing works like it does with matrices:
Data frames look like matrices, but the columns can be different data types:
While they look like matrices, they act more like lists:
Multiple ways to refer to a particular column:
Subsetting
Manipulation
Careful with attach if you know about it.
Coercion is what happens when data gets converted from one type to another (e.g., numeric to character).
This can also be done explicitly using the as.
family of functions.
One of R’s “nice” features is that it will automatically attempt to coerce data when different types meet in an operation.
Examples
This is useful sometimes, other times it can cause problems:
Hopefully R warns you if data is destroyed due to coercion.
If the data type is critical for an operation then it is up to you to check using the is.
family of functions.
Vector and array arithmetic works elementwise, as long as the things have the same dimension.
If not, the shorter one is sometimes recycled to match the larger dimension thing
I often make this mistake when calculating proportions from a table:
Again, hopefully R warns you about this, but when in doubt check and validate lengths.
Any object can have attributes, which are data that get attached to the object. It is a flexible way to include information with an object.
They are stored as names and value as in a list. Query or replace them with attributes
or attr
Some attributes are special, e.g., class
, comment
, dim
, dimnames
, …, and have special ways of querying and setting
An environment is kind of like a list, it contains a number of arbitrary objects.
The global environment is a special one, look at the upper right pane of Rstudio, or run
When you type a name in the console, it will first look for it in the global environment. If it cannot find it there, it will then look in the attached packages.
We will come back to environments when we talk about functions.
Add on packages can be installed from a few different places, CRAN, Bioconductor, github, R-forge, and locally from package files.
They are installed to your system with install.packages("pkgname")
When you use library("pkgname")
, the package is attached, so that objects in the package can be found just by typing the name:
You can use objects from packages without attaching them with ::
(two colons)
time status x
1 9 1 Maintained
2 13 1 Maintained
3 13 0 Maintained
4 18 1 Maintained
5 23 1 Maintained
6 28 0 Maintained
and you can get internal objects from a package with :::
(three colons)
Some packages you may use (e.g., from Bioconductor) return S4 objects. These are kind of like lists, but to access objects (called ‘slots’) inside, use @
(the at symbol).
For example
Tibbles (cute name for ‘table’) are data frames with enhanced printing.
library(tibble)
library(palmerpenguins)
class(penguins) ## this will be data.frame in the browser, but tbl_df in R
[1] "tbl_df" "tbl" "data.frame"
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
You can convert a regular data.frame to a tibble
The indexing behavior is slightly different from data frames:
data.table
is a package that provides another data.frame extension.
It has many features for data manipulation and management with a focus on speed, both typing and computer speed for large datasets.
There is a special syntax for indexing and merging using square brackets, we will come back to this (because it is my favorite tool for data management)
The basic functions are read.table, read.csv, read.csv2
Via add on packages, R supports import of any data format I can think of.
The most flexible way to read in data is with the rio
package. It guesses what the format is and uses the correct import tool (most of the time)
The slowness of reading in data usually comes from format guessing.
Supplying known column types can dramatically speed up import:
fread
from the data.table package is fast and also flexible:
Name homeworld species age
1: Luke Skywalker Tatooine Human 27
2: C-3PO Tatooine Human 12
3: R2-D2 Alderaan Human 8
4: Darth Vader Tatooine Human 44
5: Leia Organa Tatooine Human 25
6: Owen Lars Tatooine Human 32
7: Beru Whitesun lars Stewjon Human 38
8: R5-D4 Tatooine Human 7
9: Biggs Darklighter Kashyyyk Wookiee 65
10: Obi-Wan Kenobi Corellia Human 68
Most import functions have their output counterparts, e.g., write.table, write.csv, write.csv2, fwrite
. These are useful for writing out rectangular data for use in other programs.
Another under-used way of exporting objects is to use saveRDS
, this saves any R object to a file, which then gives you exactly the same object when read into R using readRDS
. I use this frequently for intermediate datasets, analysis results stored in a list, and even functions.
Example
data.table
, and tibble
packages for working with data.