Merging/joining data

Day 3, B

Michael C Sachs

Register data example

We’ve worked a bit with the patient register data defining possible exposures/outcomes using ICD codes and dates

  • There are other registers, drug, cause of death, social welfare, etc.
  • We want to obtain data from these other data sets and match it to individuals we already have

Keys

  • A key is one or more variables that can be used to identify observations
  • Examples, CPR number, hospital code, drug codes

Example

classDiagram
  direction RL
  class Person {
    - PID
    - Birthdate
    - Deathdate
  }
  class Hospitalization {
    - PID
    - Date
    - Diagnosis
    - Procedures
  }
  class Prescription {
    -PID
    -Date
    -ATC
    -Dose
  }
  class Drug{
    -ATC
    -Class
  }
  Person <|-- Hospitalization
  Person <|-- Prescription
  Prescription <|-- Drug

Joins

Types of joins

Often the goal of a join or merge is to add new variables to an existing data table

Example, given a table of hospitalization diagnoses, I want to know the drug dispensations that happened for each person after the hospitalization date

Left join

Tidyverse

Base R

data.table

same as base R.

Slightly different (right join):

Many-to-one

One-to-many

Many-to-many

Other merge types

  • Right join: merge(x, y, by = "key", all.y = TRUE)
  • Inner join: merge(x, y, by = "key", all = FALSE) (the default)
  • Full join: merge(x, y, by = "key", all = TRUE)

Real data is harder

Examples from the register

  • Inpatient and outpatient hospitalizations are in different tables, outpatient has only admission date, inpatient also has discharge date
  • Drugs are large in number, so drug registers are stored in different files by year
  • Dates do not match exactly in different registers, often want to match within a range of dates

Non exact joins

I want to merge these two tables where key1 matches exactly, and key2 in y is less than key2 in x

Base R

Multi step process in base R

data.table

Rolling joins with data.table

The previous example returns all matches satisfying the inequality, what if we only want the closest one?

Tidyverse

This is available in dplyr since version 1.1.0

Also can keep only the closest.

I prefer the dplyr in this case, because it is easier to communicate and inspect what is going on.

Remember

When working with real data, remember these principles when merging

  • What is the key? Is it named something else in the other table?
  • Does the key define unique observations? If not (e.g., multiple rows per individual), then you will have a non one-to-one merge.
  • Working with dates and ranges is possible, but be careful and check the results carefully

Practical

After the next lecture, we will continue using the register data example to practice merging and defining new variables based on dates and strings.

Link home