To recap: a nested case control study is drawn from a cohort which has been followed up longitudinally. For each case in the cohort, a specified number of controls is drawn from the subjects in the cohort that are still at risk for the outcome of interest at the time when the case experiences their event; specifically, controls can become later during follow-up. This design can be useful when we have a large cohort, but only few cases, e.g. an uncommon disease followed up in the full Swedish population: by matching each case to say 5 controls, we can reduce the size of the analysis data set considerably, without suffering any noticeable loss of power. It is also straightforward to include extra matching between cases and controls in the process.
We have a simple synthetic cohort of a 1,000 subjects, with CVD as outcome and alcohol as exposure (either continuous or binary).
## sex alc alc.binary time cvd
## Min. :0.0 Min. : 1.0 Min. :0.000 Min. :1517 Min. :0.000
## 1st Qu.:0.0 1st Qu.:10.0 1st Qu.:0.000 1st Qu.:7300 1st Qu.:0.000
## Median :0.5 Median :13.0 Median :1.000 Median :7300 Median :0.000
## Mean :0.5 Mean :12.5 Mean :0.503 Mean :6966 Mean :0.189
## 3rd Qu.:1.0 3rd Qu.:15.0 3rd Qu.:1.000 3rd Qu.:7300 3rd Qu.:0.000
## Max. :1.0 Max. :24.0 Max. :1.000 Max. :7300 Max. :1.000
Time to either CVD diagnosis (cvd==1
) or censoring (cvd==0
) is saved in variable time
.
The code below takes the existing dataframe and create a new one, with matched cases and controls indicated by new variable matchSet
:
## Generate vector with row numbers of cases in cohort,
## extract number of cases in data
ndx_case = which(cohort$cvd == 1)
ncase = length(ndx_case)
## Set up matrix for storing the matched sets of cases
## and controls: 2 controls/case, matrix has cases as
## columns and case/control row indices as rows
k = 2
sets = matrix(NA, nrow = k+1, ncol = ncase)
## Save case numbers as first row
sets[1, ] = ndx_case
## Set seed and loop over cases
set.seed(5236)
for (i in 1:ncase)
{
## Find row numbers of subjects that are still at risk when current case
## happens, and have the same sex as the current case
tmp = with(cohort, which(time > time[ndx_case[i]] & sex == sex[ndx_case[i]]) )
## Sample k controls from the eligible subjects
sets[2:(k+1), i] = sample(tmp, size = k)
}
## Matched sets are consecutive blocks of case + k controls
matchSet = rep(1:ncase, rep(k+1, ncase))
## Use the row numbers in matrix sets to extract the rows from
## the cohort data, add matched set variable
match_ndx = as.vector(sets)
nested_cc = cbind(cohort[match_ndx, ], matchSet = set_var)
nestedCC
that accepts a cohort data frame, and returns a nested case-control sampled from the data. The user has to specify the variables in the input data containing event times as well as event status (event vs censoring); variables can be specified by name (character) or position (integer). Bonus: optionally allow the user to add the name of a matching variable in the input data.epiDesign
(ambitious!) and add the R function from above to the R
directory. Build, document and load the package, and run some informal tests at the command line: can you replicate the example above? How does the documentation look?The original code works well for the original data, but it does not deal with the general situation where we may not be able to find as many controls as specified for some cases (undermatched cases), or even no control at all (unmatched case).
k=2
(why?). Use the new expanded data set to test (informally, at the command line) how nestedCC
behaves in this situation; if necessary, fix the function so it deals correctly with these situations.nestedCC
.Modify function nestedCC
so that it returns a more informative object, including a random seed (if specified) and the row numbers of the under- and unmatched cases in the original data.
Turn the more complicated output of the modified function into an object of (S3) class nestedCC
. Provide simple print- and summary methods for this class, and document them using roxygen. Bonus: write a new generic function getData
that takes an object of class nestedCC
and returns the sample case control data only.
interaction
)subset
)epiDesign
, where multiple different designs (e.g. matched cohort design, case-cohort design) can be generated from cohort data via different constructor functions. What kind of methods could be useful? What common data elements would make sense?dplyr
-implementation, and re-implement the linear example code. Bonus: write a dplyr
-based function analogous to nestedCC
. Melior: allow non-standard evaluation in the new function (i.e. no quotation marks around variable names); you may want to have a look at the dplyr
-programming vignette.Epi::ccwc
This is a fairly complete implementation (except for S3) which tries to deal with unmatched and undermatched cases in the data.
#' Generate a nested case-control design
#'
#' Given a data frame with cohort data, this function will
#' return a a randomly sampled nested case-control design
#'
#' @param cohort Data frame with cohort data
#' @param exit Variable in data frame with event/censoring times,
#' specified as position (integer) or name (character)
#' @param event Integer variable in cohort indicating censoring (0)
#' or event (>0)
#' @param match Optional vector of variables in cohort that cases
#' and controls will be matched on
#' @param k Number of controls per case; default 1
#' @param seed Optional seed for random number generation
#'
#' @return A data frame with rows from the input data (possibly
#' sampled repeatedly) and an extra column `matchSet`
#' indicating matched sets of cases and controls.
nestedCC = function(cohort, exit, event, match, k=1, seed)
{
## Generate vector with row numbers of cases in cohort,
## extract number of cases in data
ndx_case = which(cohort[, event] == 1)
ncase = length(ndx_case)
## Set up matrix for storing the matched sets of cases
## and controls: 2 controls/case, matrix has cases as
## columns and case/control row indices as rows
sets = matrix(NA, nrow = k+1, ncol = ncase)
## Save case numbers as first row
sets[1, ] = ndx_case
## Define a matching variable: match can be a vector of
## variable names or positions
## If not specified, set a dummy variable that always matches
if (!missing(match)) {
grp = interaction(cohort[, match])
} else {
grp = rep(TRUE, nrow(cohort))
}
## If specified: set the random seed; otherwise set to NA
if (!missing(seed)) {
set.seed(seed)
} else {
seed = NA
}
## Loop over cases
for (i in 1:ncase)
{
## Find row numbers of subjects that are still at risk when current case
## happens, and have the same sex as the current case
tmp = which( ( cohort[, exit] > cohort[ndx_case[i], exit] ) &
( grp == grp[ndx_case[i]] ) )
## I have to deal with the special case of only one eligible
## control because the UI for sample is CRAP; see ?sample
n_match_control = length(tmp)
if (n_match_control > 1) {
nsamp = min(k, n_match_control)
sets[2:(1+nsamp), i] = sample(tmp, size = nsamp)
} else if (n_match_control == 1) { ## Only one control: we gotta take it
sets[2:2, i] = tmp
} ## Note: we skip the case where no matching control was found, see below
}
## Calculate number of missing values per column (matched set)
## If all control indices are NA, no match was found, and we
## dump the case
n_controls = colSums(!is.na(sets)) - 1
sets = sets[, n_controls > 0, drop = FALSE]
## We can use the same information to define the match indicator
matchSet = rep(1:ncol(sets), n_controls[n_controls > 0] + 1)
## Use the row numbers in matrix sets to extract the rows from
## the cohort data, dropping unmatched cells
match_ndx = as.vector(sets)
match_ndx = match_ndx[!is.na(match_ndx)]
## Join the sample data and the matched set indicator
ret = cbind(cohort[match_ndx, ], matchSet = matchSet)
## Add some extra information
list(data = ret, seed = seed, k = k,
ndx_unmatched = ndx_case[n_controls == 0],
ndx_undermatched = ndx_case[n_controls < k & n_controls > 0])
}
This is a data frame that has both unmatched and undermatched cases:
test_coh = data.frame(time = c(1, 2, 3, 4, 5, 6),
event = c(0, 1, 0, 1, 0, 1),
sex = c(1, 1, 2, 2, 2, 1 ))
test_coh
## time event sex
## 1 1 0 1
## 2 2 1 1
## 3 3 0 2
## 4 4 1 2
## 5 5 0 2
## 6 6 1 1