Almost everything you do in R involves functions. You call a function by typing its name with its arguments (inputs) inside the parentheses:
The function takes the arguments you provide, does something, and then returns an object. To see what a function does, you can type its name without parentheses to see the source:
function (x, size, replace = FALSE, prob = NULL)
{
if (length(x) == 1L && is.numeric(x) && is.finite(x) && x >=
1) {
if (missing(size))
size <- x
sample.int(x, size, replace, prob)
}
else {
if (missing(size))
size <- length(x)
x[sample.int(length(x), size, replace, prob)]
}
}
<bytecode: 0x589e90c42b28>
<environment: namespace:base>
The source shows you the arguments, their default values, and the expression defining the function. You can also look at the help file for the documentation:
Functions can have 0 or more arguments, with or without defaults.
The arguments can be given in order, or by name
Names can be partially matched, which can be confusing:
Some functions take ...
as an argument, e.g., paste
, list
, also the apply
family.
There are 2 reasons for this:
Often we want to use the result of one function as the argument to another function. There are many ways to do this:
|>
(available in R 4.0.1)Some functions will take other functions as arguments. An example is the apply
family of functions, which applies a function over an index or iterator. See help(apply)
apply
repeated applies a function over the dimensions of an array. MARGIN
indicates which dimension, and then for each index in that dimension, it applies FUN
to the sub-array
tapply
is commonly used with data. It subsets the data X
based on the INDEX
argument, then applies a function to each subset:
lapply
lapply
is more general, in that it can take any index and apply any function that takes the index as an argument. It always returns a list. sapply
does the same, but simplifies the result to an array, if possible.
mapply
This is the multivariate version of sapply
that allows vector arguments.
See also the purrr
package
The apply family of functions is computationally equivalent to a loop (with pre-allocation)
Using apply instead of a for loop will not be faster computationally
It may be faster to write, but it may also be harder to understand
You can do whatever you want inside a for loop, how would you do something more complex with lapply
?
name2
is a local variable. It exists only inside the function.
Modifying local variables outside the function has no effect. But be careful:
Likewise, arguments modified inside the function do not change the object outside the function.
This is called lexical scoping: it defines how R looks for objects when they are referred to by name
If R sees a variable it needs to use inside a function, and it is not an argument or local variable, then it follows these rules to find the object with that name:
Note the specification sees a variable and needs to use it. This is called lazy evaluation: R does not evaluate anything until it needs to use it
This can be used to your advantage, e.g.,
One way to manually check for arguments is with missing
:
match.arg
Look at the help file for t.test
, and specifically the alternative
argument. It is a vector with 3 elements, but only one is used. Also, it can be partially matched, e.g.,
How does that work? Using match.arg
inside the function:
Your own functions do not need to be saved and assigned names. If a function does not have a name it is anonymous
, I use these often with the apply family:
Since R 4.0.1, \()
can be used as shorthand for function()
:
Operators are symbols like +
, <-
, %*%
, [
.
These are functions! To treat them like functions instead of operators, use backticks:
You can then treat operators as you would any other function, using them in apply or otherwise
You can also define your own operators:
Assignment operators have a special syntax:
Look at the function print
It is a generic function. UseMethod
says depending on the class of argument object
, R will call a suitable method (a function) that does something designed for whatever object
is.
You can find all the special methods by running methods("print")
(try it now).
The class of the object is a simple attribute and the method is defined by appending the class name after the function name separated by a dot. This is called the S3 class system:
In R, everything that happens is due to a function, and everything that exists is an object. Functions themselves are objects.
How do functions work together? We can classify functions according to their inputs and outputs:
Input/Output | Data | Function |
---|---|---|
Data | Regular function | Function factory |
Function | Functional | Function operator |
These concepts are loosely defined, because functions can take both data and function arguments and return data and function results.
When should you write a function? How should it be designed?
Everything is done via (different types of) functions (see functional programming tomorrow):
Functions have
args(lm)
or formals(lm)
body(lm)
environment(lm)
An argument
missing
(if necessary)Fun fact - you can overwrite body, arguments (via formals
) and environment in an assignment:
(Intercept) x
1.962911 2.989594
[1] "No linear regression for you!"
function (formula, data, subset, weights, na.action, method = "qr",
model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
contrasts = NULL, offset, ...)
"No linear regression for you!"
<environment: namespace:stats>
Because the original lm
is defined in a package, changes are done to a local copy in the current environment. At the command line, this is the global environment:
[1] ".GlobalEnv" "package:stats"
(Intercept) x
1.962911 2.989594
(Intercept) x
1.962911 2.989594
Lessons learned:
We can have functions with the same name in the search path - someone keeps (magically?) track of them.
There are differences between functions living in a package and in the global environment (e.g. generated via a script).
By default, an R function returns the value of the last statement in the body:
A function can return any value at any point in the code by using the return
function:
I think this makes programs harder to read and generally avoid using return
; opinions on this differ.
invisible
works like return
, but the resulting value will not be printed:
Often, a vector, matrix or data frame is the obvious return value (e.g. boxcox
).
Sometimes, a matrix or data frame is not the most obvious, but still the most useful return value (e.g. broom::tidy
).
And sometimes, you just get lots of different objects that you need to stick together, which is when you can use a list:
[1] TRUE
coefficients residuals effects rank fitted.values
"double" "double" "double" "integer" "double"
assign qr df.residual call terms
"integer" "list" "integer" "language" "language"
model
"list"
Return the same type of result every time (if possible). Unlike this:
Run conc density
1 1 0.04882812 0.017
2 1 0.04882812 0.018
3 1 0.19531250 0.121
4 1 0.19531250 0.124
5 1 0.39062500 0.206
[1] 0.017 0.018 0.121 0.124 0.206
Consider tidy return values: does it make sense to return a rectangular data matrix?
tab = with(survival::veteran, table(celltype, status))
ttab = reshape2::melt(tab, value.name= "Freq")
tab ## NOT tidy
status
celltype 0 1
squamous 4 31
smallcell 3 45
adeno 1 26
large 1 26
celltype status Freq
1 squamous 0 4
2 smallcell 0 3
3 adeno 0 1
4 large 0 1
5 squamous 1 31
6 smallcell 1 45
7 adeno 1 26
8 large 1 26
Call: glm(formula = Freq ~ celltype + status, data = ttab)
Coefficients:
(Intercept) celltypesmallcell celltypeadeno celltypelarge
2.625 6.500 -4.000 -4.000
status
29.750
Degrees of Freedom: 7 Total (i.e. Null); 3 Residual
Null Deviance: 2019
Residual Deviance: 101.4 AIC: 55.02
Avoid fancy printing and plotting, unless this is the only thing your function does. Many older functions do this:
hist(DNase$density) ## plot, invisible result
hist(DNase$density, plot = FALSE) ## no plot - visible result
$breaks
[1] 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2
$counts
[1] 45 31 12 22 4 18 8 13 18 4 1
$density
[1] 1.27840909 0.88068182 0.34090909 0.62500000 0.11363636 0.51136364
[7] 0.22727273 0.36931818 0.51136364 0.11363636 0.02840909
$mids
[1] 0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9 2.1
$xname
[1] "DNase$density"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
Compare this to ggplot2
, which has a clean separation between defining and displaying a plot.
Function effx
in package Epi
produces fancy-ish text output, but returns nothing, nothing at all:
---------------------------------------------------------------------------
response : bweight
type : metric
exposure : hyp
hyp is numeric
effects are measured as differences in means
---------------------------------------------------------------------------
effect of an increase of 1 unit in hyp on bweight
number of observations 500
Effect 2.5% 97.5%
-431 -585 -276
Test for no effects of exposure on 1 df: p-value= 4.9e-08
NULL
Don’t do that.
Functions should avoid elaborate print output unless this is the only thing they do: see above, object-oriented programming.
Instead, functions should put construct an object with all the relevant information, and either trust the R printing methods, or hide the ugly using invisible
.
However, there are situations where it makes sense to contact the user from within the function:
Progress: for a slow process, use cat
(not print
) to indicate progress; offer an argument to suppress the progress report.
(Alt. use package progress
- however, does not work on terminal R)
Side effects: if the function does something that will have side effects outside itself, the user needs to be informed. This should be done via message
and warning
, not print-ing or cat-ing.
Problems: if the function runs into a situation it cannot handle, it can either try a best-guess fix, or just give up; warning
informs, but continues execution, stop
terminates execution and returns an error.
Useful:
stopifnot
for tough-love argument checkingoptions(warn=2)
to turn warnings into errorsFunctions as arguments are common: lapply
etc.
Functions can also be return values. E.g. we can define our own logarithms for different bases:
This log2
is a function defined in log_general
. When log_general
was called with base=2
, this was saved in the execution environment of log_general
, which log2
has inherited:
<environment: 0x589e97ca9870>
[1] "base"
[1] 2
This is a reasonably advanced, but powerful technique. This combination of a function and data (in the environment) is generally known as a closure, though the term has a narrower technical meaning in R. See Advanced R 2nd ed, Section 10.4, for use cases.
A different, more conventional way of defining our own binary logarithm:
[1] 2
<environment: R_GlobalEnv>
Nothing wrong with that as one off, though ultimately less flexible.
Other people’s functions may also cause messages, warnings or errors.
Use suppressMessages
or suppressWarnings
to keep it silent:
[1] NaN
Errors can be caught using the function try
:
tryCatch
offers a more flexible way of defining error handlers.
Write a function hasPlotArg
that takes a function as argument and checks whether the function has plot
as an argument.
Find all functions in the search path that have plot
in their name, and check which ones have a plot
-argument.
If we also want to know about the default value for argument plot
, should we add an optional argument to hasPlotArg
, or is there a better way to design this? Implement your idea.
Have a look how the function read.table
handles unexpected situations: when does it give up, and when does it continue reading with a message/warning?
get
and assign
Recall that we can retrieve a variable from a data frame by using a character string, e.g., penguins[["species"]]
.
We can use a character string to get or assign any other object using these functions. For example, this returns the function called mean
which we can use like a function
Likewise, an object can be created with assign
get
and assign
Example, iterating over functions by name:
Example, retrieving a function programmatically,
Example, programmatically creating new variables,
do.call
A variant on get
is do.call
. This takes a function as the first argument, then a list containing the arguments for the function, do.call(<function>, <list of arguments to function>)
.
A common use for this is with functions that take a variable number of arguments, e.g., cbind
, paste
, where the arguments are created programmatically.
simple example,
arranging a list into a matrix
There is the <<-
operator, which is used in functions and does (re)assignment outside the function. It searches the parent environments and reassigns where found, if not found it assigns in the global environment.
This is generally considered to be a bad idea, but now you know about it.
Functionals invoke functions over some data structure to create an output that corresponds to that data structure. We’ve already used one functional: lapply
. Here is a simpler one:
The data structure here is a single variable (runif(100)
). Functionals are useful for iterating over more complex data structures as well.
lapply
iterates over a list, applying the same function to each element. How can you achieve the same result using a for
loop?
random_stats <- function(fun) {
lapply(list(uniform = runif(100), normal = rnorm(100),
t = rt(100, 4), exponential = rexp(100)), fun)
}
random_stats(mean)
$uniform
[1] 0.4998756
$normal
[1] 0.1167602
$t
[1] -0.008399415
$exponential
[1] 0.817919
$uniform
[1] 0.2825966
$normal
[1] 1.015672
$t
[1] 1.641905
$exponential
[1] 0.9946533
$mean
$mean$uniform
[1] 0.4918718
$mean$normal
[1] 0.1327262
$mean$t
[1] 0.1203086
$mean$exponential
[1] 1.093484
$sd
$sd$uniform
[1] 0.2998995
$sd$normal
[1] 0.9078401
$sd$t
[1] 1.468887
$sd$exponential
[1] 0.8860794
mapply
iterates over a series of lists. Same question?
All of these iteration problems can be solved with loops. What are the advantages/disadvantages of using loops?
Function operators take functions as input and return functions as output, generally modifying the behaviour of the function.
Let’s see if we can implement an error capture adverb.
#' Modify a function to fail without stopping
#'
#' @param f A function
capture_errors <- function(f) {
function(...) {
tryCatch(f(...),
error = function(e) as.character(e))
}
}
geometric_mean <- function(x) {
exp(mean(log(x)))
}
safe_geom_mean <- capture_errors(geometric_mean)
#lapply(iris, geometric_mean)
lapply(iris, safe_geom_mean)
$Sepal.Length
[1] 5.78572
$Sepal.Width
[1] 3.026598
$Petal.Length
[1] 3.238267
$Petal.Width
[1] 0.8417075
$Species
[1] "Error in Math.factor(x): 'log' not meaningful for factors\n"
Function factories take data as input and return functions:
Function factories are useful in statistics. Let’s do some regression!
x <- rnorm(50)
y <- 1 + 1.5 * x + rnorm(50, sd = 1)
mse_loss <- function(xmat, outcome) {
function(beta) {
linpred <- xmat %*% beta
mean((linpred - outcome)^2)
}
}
mse_loss(cbind(1, x), y)(c(0, 0))
[1] 3.912841
$par
[1] 1.056706 1.552929
$value
[1] 0.7686793
$counts
function gradient
67 NA
$convergence
[1] 0
$message
NULL
What are some ways we could make this function more general? Multivariate, different loss functions?
Recursive functions are functions that call themselves. It is critical that they have an exit condition. The basic example is computing the Fibonacci numbers:
fibonacci <- function(n) {
if(n < 2){
return(1)
} else {
fibonacci(n - 1) + fibonacci(n - 2)
}
}
fibonacci(9)
[1] 55
This seems like a useless novelty, but they are actually quite useful in dealing with DAGs and nested data structures.