Day 3, B
We’ve worked a bit with the patient register data defining possible exposures/outcomes using ICD codes and dates
Often the goal of a join or merge is to add new variables to an existing data table
Example, given a table of hospitalization diagnoses, I want to know the drug dispensations that happened for each person after the hospitalization date
# A tibble: 3 × 3
key val_x val_y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 3 x3 <NA>
same as base R.
Slightly different (right join):
merge(x, y, by = "key", all.y = TRUE)
merge(x, y, by = "key", all = FALSE)
(the default)merge(x, y, by = "key", all = TRUE)
I want to merge these two tables where key1
matches exactly, and key2
in y
is less than key2
in x
Multi step process in base R
key1 key2x val_x key2y val_y
1 1 20 x1 10 y1
4 1 50 x1b 10 y1
5 1 50 x1b 25 y1b
6 1 50 x1b 35 y1c
7 2 35 x2 20 y2
## merge allows multiple keys, but only exact matching
merge(x, y,
by.x = c("key1", "key2x"),
by.y = c("key1", "key2y"), all = FALSE)
[1] key1 key2x val_x val_y
<0 rows> (or 0-length row.names)
The previous example returns all matches satisfying the inequality, what if we only want the closest one?
key1 key2y val_y val_x
1: 1 20 y1 x1
2: 1 50 y1 x1b
3: 1 50 y1b x1b
4: 1 50 y1c x1b
5: 2 35 y2 x2
key1 key2y val_y val_x
1: 1 20 y1 x1
2: 1 50 y1c x1b
3: 2 35 y2 x2
This is available in dplyr
since version 1.1.0
When working with real data, remember these principles when merging
After the next lecture, we will continue using the register data example to practice merging and defining new variables based on dates and strings.