The Receiver Operating Characteristic (ROC) curve is used to assess the accuracy of a continuous measurement for predicting a binary outcome. In medicine, ROC curves have a long history of use for evaluating diagnostic tests in radiology and general diagnostics. ROC curves have also been used for a long time in signal detection theory.
The accuracy of a diagnostic test can be evaluated by considering the two possible types of errors: false positives, and false negatives. For a continuous measurement that we denote as \(M\), convention dictates that a test positive is defined as \(M\) exceeding some fixed threshold \(c\): \(M > c\). In reference to the binary outcome that we denote as \(D\), a good outcome of the test is when the test is positive among an individual who truly has a disease: \(D = 1\). A bad outcome is when the test is positive among an individual who does not have the disease \(D = 0\).
Formally, for a fixed cutoff \(c\), the true positive fraction is the probability of a test positive among the diseased population:
\[ TPF(c) = P\{ M > c | D = 1 \} \]
and the false positive fraction is the probability of a test positive among the healthy population:
\[ FPF(c) = P\{ M > c | D = 0 \} \]
Since the cutoff \(c\) is not usually fixed in advance, we can plot the TPF against the FPF for all possible values of \(c\). This is exactly what the ROC curve is, \(FPF(c)\) on the \(x\) axis and \(TPF(c)\) along the \(y\) axis.
In the medical literature, ROC curves are commonly plotted without
the cutoff values displayed. Other problems with ROC curve plots are
abundant in the medical literature. We aim to solve some of these
problems by providing a plotting interface for the ROC curve that comes
with sensible defaults. It is easy to create interactive ROC curves for
local or web-based use. The next section details the usage of the
plotROC
package.
I created a shiny application in order to make the features more accessible to non-R users. A limited subset of the functions of the plotROC package can be performed on an example dataset or on data that users upload to the website. Resulting plots can be saved to the users’ machine as a pdf or as a stand-alone html file. It can be used in any modern web browser with no other dependencies at the website here: https://sachsmc.shinyapps.io/plotROC.
plotROC can be installed from github or CRAN. It requires a recent version of ggplot2 (>2.0.0).
I start by creating an example data set. There are 2 markers, one that is moderately predictive and one that is not as predictive.
## Loading required package: ggplot2
set.seed(2529)
D.ex <- rbinom(200, size = 1, prob = .5)
M1 <- rnorm(200, mean = D.ex, sd = .65)
M2 <- rnorm(200, mean = D.ex, sd = 1.5)
test <- data.frame(D = D.ex, D.str = c("Healthy", "Ill")[D.ex + 1],
M1 = M1, M2 = M2, stringsAsFactors = FALSE)
Next I use the ggplot
function to define the aesthetics,
and the geom_roc
function to add an ROC curve layer. The
geom_roc
function requires the aesthetics d
for disease status, and m
for marker. The disease status
need not be coded as 0/1, but if it is not, stat_roc
assumes (with a warning) that the lowest value in sort order signifies
disease-free status. stat_roc
and geom_roc
are
linked by default, with the stat doing the underlying computation of the
empirical ROC curve, and the geom consisting of the ROC curve layer.
The disease status aesthetic can be specified as a string or factor, but with a warning.
## Warning in verify_d(data$d): D not labeled 0/1, assuming Healthy = 0 and Ill =
## 1!
The geom_roc
layer includes the ROC curve line combined
with points and labels to display the values of the biomarker at the
different cutpoints. It accepts the argument n.cuts
to
define the number of cutpoints to display along the curve. Labels can be
suppressed by using n.cuts = 0
or
labels = FALSE
. The size of the labels and the number of
significant digits can be adjusted with labelsize
and
labelround
, respectively.
We provide a function style_roc
that can be added to a
ggplot that contains an ROC curve layer. This adds a diagonal guideline,
sets the axis labels, and adjusts the major and minor grid lines. The
direct_label
function operates on a ggplot object, adding a
direct label to the plot. It attempts to intelligently select an
appropriate location for the label, but the location can be adjusted
with nudge_x, nudge_y
and label.angle
. If the
labels
argument is NULL, it will take the name from the
mapped aesthetic.
It is common to compute confidence regions for points on the ROC curve using the Clopper and Pearson (1934) exact method. Briefly, exact confidence intervals are calculated for the \(FPF\) and \(TPF\) separately, each at level \(1 - \sqrt{1 - \alpha}\). Based on result 2.4 from Pepe (2003), the cross-product of these intervals yields a \(100 * (1 - \alpha)\) percent rectangular confidence region for the pair.
This is implemented in the stat_rocci
and displayed as a
geom_rocci
layer. These both require the same aesthetics as
the ROC geom, d
for disease status and m
for
marker. By default, a set of 3 evenly spaced points along the curve are
chosen to display confidence regions. You can select points by passing a
vector of values in the range of m
to the
ci.at
argument. By default, the significance level \(\alpha\) is set to 0.05, this can be
changed using the sig.level
option.
Ggplot objects that contain a GeomRoc layer can be used to create an
interactive plot and display it in the Rstudio viewer or default web
browser by passing it to the plot_interactive_roc
, or
export_interactive_roc
function. The style_roc function is
applied by default. Give the function an optional path to an html file
as an argument called file
to save the interactive plot as
a complete web page. By default, any existing Rocci layers are removed
and replaced with a dense layer of confidence regions so that the user
can click anywhere for a confidence region. This can be suppressed by
add.cis = FALSE
. Furthermore, the points layer of the Roc
geom can be hidden by using the hide.points
option.
Hovering over the display shows the cutoff value at the point nearest to the cursor. Clicking makes the cutoff label stick until the next click, and if confidence regions are available, clicks will also display those as grey rectangles. The confidence regions are automatically detected. When the user clicks on the ROC curve, the confidence region for the TPF and FPF is overlaid using a grey rectangle. The label and region stick until the next click.
An interactive ROC plot can be exported by using the
export_interactive_roc
function, which returns a character
string containing the necessary HTML
and
JavaScript
. The character string can be copy-pasted into an
html document, or better yet, incorporated directly into a dynamic
document using knitr
(knitr homepage).
In a knitr
document, it is necessary to use the
cat
function on the results and use the chunk options
results = 'asis'
and fig.keep='none'
so that
the interactive plot is displayed correctly. For documents that contain
multiple interactive plots, it is necessary to assign each plot a unique
name using the prefix
argument of
export_interactive_roc
. This is necessary to ensure that
the JavaScript code manipulates the correct svg elements. The next code
block shows an example knitr
chunk that can be used in an
.Rmd document to display an interactive plot.
```{r int-no, fig.keep='none', results = 'asis'}
cat(
export_interactive_roc(basicplot,
prefix = "a")
)
```
The result is shown below:
Click for confidence regions.
If you have grouping factors in your dataset, or you have multiple markers measured on the same subjects, you may wish to plot multiple ROC curves on the same plot. plotROC fully supports faceting and grouping done by ggplot2. In out example dataset, we have 2 markers measured in a paired manner:
## D D.str M1 M2
## 1 1 Ill 1.48117155 -2.50636605
## 2 1 Ill 0.61994478 1.46861033
## 3 0 Healthy 0.57613345 0.07532573
## 4 1 Ill 0.85433197 2.41997703
## 5 0 Healthy 0.05258342 0.01863718
## 6 1 Ill 0.66703989 0.24732453
These data are in wide format, with the 2 markers going across 2
columns. ggplot requires long format, with the marker result in a single
column, and a third variable identifying the marker. We provide the
function melt_roc
to perform this transformation. The
arguments are the data frame, a name or index identifying the disease
status column, and a vector of names or indices identifying the the
markers. Optionally, the names argument gives a vector of names to
assign to the marker, replacing their column names. The result is a data
frame in long format.
## D M name
## M11 1 1.48117155 M1
## M12 1 0.61994478 M1
## M13 0 0.57613345 M1
## M14 1 0.85433197 M1
## M15 0 0.05258342 M1
## M16 1 0.66703989 M1
Then, the dataset can be passed to the ggplot function, with the marker name given as a grouping or faceting variable.
pairplot <- ggplot(longtest, aes(d = D, m = M, color = name)) +
geom_roc(show.legend = FALSE) + style_roc()
direct_label(pairplot)
Interactive versions of the plots are fully supported.
## Scale for x is already present.
## Adding another scale for x, which will replace the existing scale.
## Scale for y is already present.
## Adding another scale for y, which will replace the existing scale.
Click for confidence regions.
cat(
export_interactive_roc(
ggplot(longtest, aes(d = D, m = M)) + geom_roc() + facet_wrap(~ name),
prefix = "c", width = 10, height = 5)
)
Click for confidence regions.
Showing multiple curves is also useful when there is a factor that affects the classification accuracy of the test. Let’s create another example dataset.
test.cov2 <- data.frame(D = D.cov, gender = gender, M = M.diff,
group = c("A", "B")[rbinom(400, 1, .4) + 1])
bygend2 <- ggplot(test.cov2, aes(d = D, m = M, color = gender)) +
geom_roc() + facet_wrap(~ group)
calc_auc(bygend2)
## PANEL group group.1 gender AUC
## 1 1 1 1 Female 0.6374434
## 3 1 2 2 Male 0.9531326
## 2 2 1 1 Female 0.6328904
## 4 2 2 2 Male 0.8732526
Also works for multiple panels, and layers.
set.seed(123)
x <- rnorm(100)
y <- round(plogis(3*x + rnorm(100, sd = 5)))
df <- data.frame(x, y, gp = c("A", "B"), pan = sample(c("Z", "W"), 100, replace = TRUE))
x2 <- rnorm(100)
y2 <- round(plogis(3*x + rnorm(100, sd = 2)))
df2 <- data.frame(x2, y2, gp = c("X", "Y"), pan = sample(c("Z", "W"), 100, replace = TRUE))
p3 <- ggplot(df, aes(d = y, m = x, color = gp)) + geom_roc() +
geom_roc(data = df2, aes(d = y2, m = x2, color = gp))
calc_auc(p3)
## PANEL group gp AUC
## 1 1 1 A 0.7451923
## 2 1 2 B 0.6928000
## 3 1 1 X 0.4285714
## 4 1 2 Y 0.4640000
p4 <- ggplot(df, aes(d = y, m = x, color = gp)) + geom_roc() +
geom_roc(data = df2, aes(d = y2, m = x2, color = gp)) +
facet_wrap(~ pan)
calc_auc(p4)
## PANEL pan group gp AUC
## 1 1 W 1 A 0.7285068
## 3 1 W 2 B 0.4062500
## 2 2 Z 1 A 0.7252747
## 4 2 Z 2 B 0.8099548
## 11 1 W 1 X 0.3515152
## 31 1 W 2 Y 0.4437500
## 21 2 Z 1 X 0.5071429
## 41 2 Z 2 Y 0.5407407
plotROC uses the ggplot2
package to create the objects
to be plotted. Therefore, themes and annotations can be added in the
usual ggplot2 way, or with external libraries such as
ggthemes
. The area under the ROC curve can be calculated by
passing the ggplot object to the calc_auc
function.
basicplot +
style_roc(theme = theme_grey) +
theme(axis.text = element_text(colour = "blue")) +
ggtitle("Themes and annotations") +
annotate("text", x = .75, y = .25,
label = paste("AUC =", round(calc_auc(basicplot)$AUC, 2))) +
scale_x_continuous("1 - Specificity", breaks = seq(0, 1, by = .1))
## Scale for x is already present.
## Adding another scale for x, which will replace the existing scale.
By default calculate_roc
computes the empirical ROC
curve. There are other estimation methods out there, as I have
summarized in the introduction. Any estimation method can be used, as
long as the cutoff, the TPF and the FPF are returned. Then you can
simply pass those values in a data frame to the ggroc
function, and overriding the default stat to identity
.
D.ex <- test$D
M.ex <- test$M1
mu1 <- mean(M.ex[D.ex == 1])
mu0 <- mean(M.ex[D.ex == 0])
s1 <- sd(M.ex[D.ex == 1])
s0 <- sd(M.ex[D.ex == 0])
c.ex <- seq(min(M.ex), max(M.ex), length.out = 300)
binorm.roc <- data.frame(c = c.ex,
FPF = pnorm((mu0 - c.ex)/s0),
TPF = pnorm((mu1 - c.ex)/s1)
)
binorm.plot <- ggplot(binorm.roc, aes(x = FPF, y = TPF, label = c)) +
geom_roc(stat = "identity") + style_roc(theme = theme_grey)
binorm.plot
Interactive plots with stat = "identity"
are not
currently supported.
Another potential use of this approach is for plotting time-dependent ROC curves for time-to-event outcomes estimated as described in Heagerty, Lumley, and Pepe (2000).
library(survivalROC)
survT <- rexp(350, 1/5)
cens <- rbinom(350, 1, .1)
M <- -8 * sqrt(survT) + rnorm(350, sd = survT)
sroc <- lapply(c(2, 5, 10), function(t){
stroc <- survivalROC(Stime = survT, status = cens, marker = M,
predict.time = t, method = "NNE",
span = .25 * 350^(-.2))
data.frame(TPF = stroc[["TP"]], FPF = stroc[["FP"]],
c = stroc[["cut.values"]],
time = rep(stroc[["predict.time"]], length(stroc[["FP"]])))
})
sroclong <- do.call(rbind, sroc)
ggplot(sroclong, aes(x = FPF, y = TPF, label = c, color = time)) +
geom_roc(labels = FALSE, stat = "identity") + style_roc(theme = theme_gray)