Functions

Michael C Sachs

Using functions

A function and its components

Almost everything you do in R involves functions. You call a function by typing its name with its arguments (inputs) inside the parentheses:

The function takes the arguments you provide, does something, and then returns an object. To see what a function does, you can type its name without parentheses to see the source:

sample
function (x, size, replace = FALSE, prob = NULL) 
{
    if (length(x) == 1L && is.numeric(x) && is.finite(x) && x >= 
        1) {
        if (missing(size)) 
            size <- x
        sample.int(x, size, replace, prob)
    }
    else {
        if (missing(size)) 
            size <- length(x)
        x[sample.int(length(x), size, replace, prob)]
    }
}
<bytecode: 0x589e90c42b28>
<environment: namespace:base>

The source shows you the arguments, their default values, and the expression defining the function. You can also look at the help file for the documentation:

help("sample")
# or
?sample

Using functions – arguments

Functions can have 0 or more arguments, with or without defaults.

The arguments can be given in order, or by name

Names can be partially matched, which can be confusing:

The ellipsis argument

Some functions take ... as an argument, e.g., paste, list, also the apply family.

There are 2 reasons for this:

  1. There could be varying numbers of arguments
  1. To pass optional arguments to other functions involved

Using functions – composition

Often we want to use the result of one function as the argument to another function. There are many ways to do this:

  1. Intermediate variables
  1. Nested function calls
  1. The pipe operator |> (available in R 4.0.1)

Using functions – the apply family

Some functions will take other functions as arguments. An example is the apply family of functions, which applies a function over an index or iterator. See help(apply)

apply repeated applies a function over the dimensions of an array. MARGIN indicates which dimension, and then for each index in that dimension, it applies FUN to the sub-array

Apply continued

tapply is commonly used with data. It subsets the data X based on the INDEX argument, then applies a function to each subset:

lapply

lapply is more general, in that it can take any index and apply any function that takes the index as an argument. It always returns a list. sapply does the same, but simplifies the result to an array, if possible.

mapply

This is the multivariate version of sapply that allows vector arguments.

See also the purrr package

Notes on speed and flexibility

The apply family of functions is computationally equivalent to a loop (with pre-allocation)

Using apply instead of a for loop will not be faster computationally

It may be faster to write, but it may also be harder to understand

You can do whatever you want inside a for loop, how would you do something more complex with lapply?

Writing your own functions

A simple function

A function with arguments

Local variables and scoping

name2 is a local variable. It exists only inside the function.

Modifying local variables outside the function has no effect. But be careful:

Arguments passed by value

Likewise, arguments modified inside the function do not change the object outside the function.

Lexical scoping

This is called lexical scoping: it defines how R looks for objects when they are referred to by name

If R sees a variable it needs to use inside a function, and it is not an argument or local variable, then it follows these rules to find the object with that name:

  1. Look in the environment where the function was defined.
  2. If not found, look in the parent environment of 1
  3. If not found continue going down into parents until there are no more.

Note the specification sees a variable and needs to use it. This is called lazy evaluation: R does not evaluate anything until it needs to use it

Lexical scoping example

This can be used to your advantage, e.g.,

Lazy evaluation example

One way to manually check for arguments is with missing:

Using match.arg

Look at the help file for t.test, and specifically the alternative argument. It is a vector with 3 elements, but only one is used. Also, it can be partially matched, e.g.,

How does that work? Using match.arg inside the function:

Anonymous functions

Your own functions do not need to be saved and assigned names. If a function does not have a name it is anonymous, I use these often with the apply family:

Since R 4.0.1, \() can be used as shorthand for function():

Operators

Operators are symbols like +, <-, %*%, [.

These are functions! To treat them like functions instead of operators, use backticks:

You can then treat operators as you would any other function, using them in apply or otherwise

You can also define your own operators:

Assignment operators have a special syntax:

Generic methods/functions

Look at the function print

print
function (x, ...) 
UseMethod("print")
<bytecode: 0x589e8e5a0f88>
<environment: namespace:base>

It is a generic function. UseMethod says depending on the class of argument object, R will call a suitable method (a function) that does something designed for whatever object is.

You can find all the special methods by running methods("print") (try it now).

The class of the object is a simple attribute and the method is defined by appending the class name after the function name separated by a dot. This is called the S3 class system:

Summary

In R, everything that happens is due to a function, and everything that exists is an object. Functions themselves are objects.

How do functions work together? We can classify functions according to their inputs and outputs:

Input/Output Data Function
Data Regular function Function factory
Function Functional Function operator

These concepts are loosely defined, because functions can take both data and function arguments and return data and function results.

Designing functions

When should you write a function? How should it be designed?

  1. The DRY principle: don’t repeat yourself
  2. Consider the audience
  • Don’t write a function unless you expect somebody to use it
  • Consider the most likely use cases, and remember you can’t make everyone happy
  1. Balance ease-of-use with understandability
  • Break down the task into a series of smaller tasks, and abstract them away into functions
  • Reuse or build? Dependencies (using functions from other packages) may change in unpredictable ways
  • Default arguments and error checking – you can’t prevent all errors, ultimately it is the users’ responsibility to use the tools correctly

Recap from the morning

Everything is done via (different types of) functions (see functional programming tomorrow):

  • data, parameters go in as arguments
  • results are returned as value of the function
  • both are R objects

Functions have

  • arguments: args(lm) or formals(lm)
  • a body of code: body(lm)
  • an evironment: environment(lm)

An argument

  • is evaluated lazily (when and if required)
  • can be checked for presence via missing (if necessary)
  • can be an ellipsis

Fun fact - you can overwrite body, arguments (via formals) and environment in an assignment:

set.seed(313)
x = rnorm(10); y = 2 + 3* x + rnorm(10)/4
coef( lm(y~x) )     ## Show coefficients only
(Intercept)           x 
   1.962911    2.989594 
body(lm) = "No linear regression for you!"
lm(y~x) 
[1] "No linear regression for you!"
lm
function (formula, data, subset, weights, na.action, method = "qr", 
    model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, 
    contrasts = NULL, offset, ...) 
"No linear regression for you!"
<environment: namespace:stats>

Because the original lm is defined in a package, changes are done to a local copy in the current environment. At the command line, this is the global environment:

find("lm")
[1] ".GlobalEnv"    "package:stats"
coef( stats::lm(y~x) )   ## Still works 
(Intercept)           x 
   1.962911    2.989594 
rm(lm)                   ## Delete local copy
coef( lm(y~x) )          ## Works again
(Intercept)           x 
   1.962911    2.989594 

Lessons learned:

  1. We can have functions with the same name in the search path - someone keeps (magically?) track of them.

  2. There are differences between functions living in a package and in the global environment (e.g. generated via a script).

Getting {stuff} out of functions

Returning results

By default, an R function returns the value of the last statement in the body:

boxcox = function(x, exp) {
  
  if (exp == 0) {
    ret = log(x)
  } else {
    ret = x^exp
  }
 
  ret 
}

A function can return any value at any point in the code by using the return function:

boxcox = function(x, exp) {
  
  if (exp == 0) {
    return( log(x) )
  } else {
    return( x^exp )
  }
 
}

I think this makes programs harder to read and generally avoid using return; opinions on this differ.

invisible works like return, but the resulting value will not be printed:

invisible_square = function(x)
{
  invisible(x * x)
}
invisible_square(2)
x = invisible_square(2)
x
[1] 4

Complex return values

Often, a vector, matrix or data frame is the obvious return value (e.g. boxcox).

Sometimes, a matrix or data frame is not the most obvious, but still the most useful return value (e.g. broom::tidy).

And sometimes, you just get lots of different objects that you need to stick together, which is when you can use a list:

lm_min = lm(1 ~ 1)    ## World's smallest linear model?! No, but close
is.list(lm_min)
[1] TRUE
sapply(lm_min, typeof)
 coefficients     residuals       effects          rank fitted.values 
     "double"      "double"      "double"     "integer"      "double" 
       assign            qr   df.residual          call         terms 
    "integer"        "list"     "integer"    "language"    "language" 
        model 
       "list" 

Designing return values

Return the same type of result every time (if possible). Unlike this:

DNase[1:5, ]
  Run       conc density
1   1 0.04882812   0.017
2   1 0.04882812   0.018
3   1 0.19531250   0.121
4   1 0.19531250   0.124
5   1 0.39062500   0.206
DNase[1:5, 3]  ## use drop=FALSE to make safe
[1] 0.017 0.018 0.121 0.124 0.206

Consider tidy return values: does it make sense to return a rectangular data matrix?

tab  = with(survival::veteran, table(celltype, status))
ttab = reshape2::melt(tab,  value.name= "Freq")
tab   ## NOT tidy
           status
celltype     0  1
  squamous   4 31
  smallcell  3 45
  adeno      1 26
  large      1 26
ttab  ## tidy
   celltype status Freq
1  squamous      0    4
2 smallcell      0    3
3     adeno      0    1
4     large      0    1
5  squamous      1   31
6 smallcell      1   45
7     adeno      1   26
8     large      1   26
glm(Freq ~ celltype + status, data = ttab)  ## Statistics, yay

Call:  glm(formula = Freq ~ celltype + status, data = ttab)

Coefficients:
      (Intercept)  celltypesmallcell      celltypeadeno      celltypelarge  
            2.625              6.500             -4.000             -4.000  
           status  
           29.750  

Degrees of Freedom: 7 Total (i.e. Null);  3 Residual
Null Deviance:      2019 
Residual Deviance: 101.4    AIC: 55.02

Avoid fancy printing and plotting, unless this is the only thing your function does. Many older functions do this:

hist(DNase$density)                    ## plot, invisible result
hist(DNase$density, plot = FALSE)      ## no plot - visible result
$breaks
 [1] 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2

$counts
 [1] 45 31 12 22  4 18  8 13 18  4  1

$density
 [1] 1.27840909 0.88068182 0.34090909 0.62500000 0.11363636 0.51136364
 [7] 0.22727273 0.36931818 0.51136364 0.11363636 0.02840909

$mids
 [1] 0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9 2.1

$xname
[1] "DNase$density"

$equidist
[1] TRUE

attr(,"class")
[1] "histogram"
hh = hist(DNase$density, plot = FALSE) ## save object
plot(hh)                               ## ... and plot it

Compare this to ggplot2, which has a clean separation between defining and displaying a plot.

Function effx in package Epi produces fancy-ish text output, but returns nothing, nothing at all:

data(births, package = "Epi")
out = Epi::effx(bweight,exposure=hyp,data=births)
--------------------------------------------------------------------------- 
response      :  bweight 
type          :  metric 
exposure      :  hyp 

hyp is numeric 
effects are measured as differences in means 
--------------------------------------------------------------------------- 

effect of an increase of 1 unit in hyp on bweight 
number of observations  500 

Effect   2.5%  97.5% 
  -431   -585   -276 

Test for no effects of exposure on 1 df: p-value= 4.9e-08 
out
NULL

Don’t do that.

Communicating with the user

Functions should avoid elaborate print output unless this is the only thing they do: see above, object-oriented programming.

Instead, functions should put construct an object with all the relevant information, and either trust the R printing methods, or hide the ugly using invisible.

However, there are situations where it makes sense to contact the user from within the function:

  1. Progress: for a slow process, use cat (not print) to indicate progress; offer an argument to suppress the progress report.

    (Alt. use package progress - however, does not work on terminal R)

  2. Side effects: if the function does something that will have side effects outside itself, the user needs to be informed. This should be done via message and warning, not print-ing or cat-ing.

  3. Problems: if the function runs into a situation it cannot handle, it can either try a best-guess fix, or just give up; warning informs, but continues execution, stop terminates execution and returns an error.

log(-1)      ## Warning, moves on with NaN
log("a")     ## Error

Useful:

  • stopifnot for tough-love argument checking
  • options(warn=2) to turn warnings into errors

Functions of functions

Functions as return values

Functions as arguments are common: lapply etc.

Functions can also be return values. E.g. we can define our own logarithms for different bases:

log_general = function(base) 
{
  function(x) log(x, base)
}
log2 = log_general(2)
log2(4)
[1] 2

This log2 is a function defined in log_general. When log_general was called with base=2, this was saved in the execution environment of log_general, which log2 has inherited:

environment(log2)
<environment: 0x589e97ca9870>
ls( environment(log2) )
[1] "base"
get( "base", environment(log2) ) 
[1] 2

This is a reasonably advanced, but powerful technique. This combination of a function and data (in the environment) is generally known as a closure, though the term has a narrower technical meaning in R. See Advanced R 2nd ed, Section 10.4, for use cases.

Function wrappers

A different, more conventional way of defining our own binary logarithm:

other_log2 = function(x) log(x, base = 2)
other_log2(4)
[1] 2
environment(other_log2)
<environment: R_GlobalEnv>

Nothing wrong with that as one off, though ultimately less flexible.

Managing complaints

Other people’s functions may also cause messages, warnings or errors.

Use suppressMessages or suppressWarnings to keep it silent:

suppressWarnings( log(-1) )
[1] NaN
library(dplyr)
library(MASS) 
detach("package:MASS")
suppressMessages( library(MASS) )

Errors can be caught using the function try:

try( log("a"), silent = TRUE )  ## Not an error, but what happened?!
## Convert error to warning, use NaN as value
ret = try( log("a"), silent = TRUE )
if ( !is.numeric (ret) ) {
  warning(ret)
  ret = NaN
}

tryCatch offers a more flexible way of defining error handlers.

Exercises

  1. What do these functions return?
ex1_a = function() 7
ex1_b = function() x=7
ex1_c = function() {x=7}
ex1_d = function() {y=1; x=7}
ex1_e = function() y=1; x=7
  1. Write a function hasPlotArg that takes a function as argument and checks whether the function has plot as an argument.

    Find all functions in the search path that have plot in their name, and check which ones have a plot-argument.

    If we also want to know about the default value for argument plot, should we add an optional argument to hasPlotArg, or is there a better way to design this? Implement your idea.

  2. Have a look how the function read.table handles unexpected situations: when does it give up, and when does it continue reading with a message/warning?

Some more advanced topics

get and assign

Recall that we can retrieve a variable from a data frame by using a character string, e.g., penguins[["species"]].

We can use a character string to get or assign any other object using these functions. For example, this returns the function called mean

which we can use like a function

Likewise, an object can be created with assign

Uses of get and assign

Example, iterating over functions by name:

Example, retrieving a function programmatically,

Example, programmatically creating new variables,

do.call

A variant on get is do.call. This takes a function as the first argument, then a list containing the arguments for the function, do.call(<function>, <list of arguments to function>).

A common use for this is with functions that take a variable number of arguments, e.g., cbind, paste, where the arguments are created programmatically.

simple example,

arranging a list into a matrix

Global assigment operator

There is the <<- operator, which is used in functions and does (re)assignment outside the function. It searches the parent environments and reassigns where found, if not found it assigns in the global environment.

This is generally considered to be a bad idea, but now you know about it.

Functionals

Functionals invoke functions over some data structure to create an output that corresponds to that data structure. We’ve already used one functional: lapply. Here is a simpler one:

random_stat <- function(fun) {
    fun(runif(100))
}

random_stat(mean)
[1] 0.5041757

The data structure here is a single variable (runif(100)). Functionals are useful for iterating over more complex data structures as well.

lapply iterates over a list, applying the same function to each element. How can you achieve the same result using a for loop?

random_stats <- function(fun) {
    
    lapply(list(uniform = runif(100), normal = rnorm(100), 
                t = rt(100, 4), exponential = rexp(100)), fun)
    
}

random_stats(mean)
$uniform
[1] 0.4998756

$normal
[1] 0.1167602

$t
[1] -0.008399415

$exponential
[1] 0.817919
random_stats(sd)
$uniform
[1] 0.2825966

$normal
[1] 1.015672

$t
[1] 1.641905

$exponential
[1] 0.9946533
lapply(list(mean = mean, sd = sd), random_stats)
$mean
$mean$uniform
[1] 0.4918718

$mean$normal
[1] 0.1327262

$mean$t
[1] 0.1203086

$mean$exponential
[1] 1.093484


$sd
$sd$uniform
[1] 0.2998995

$sd$normal
[1] 0.9078401

$sd$t
[1] 1.468887

$sd$exponential
[1] 0.8860794

mapply iterates over a series of lists. Same question?

All of these iteration problems can be solved with loops. What are the advantages/disadvantages of using loops?

Function operators

Function operators take functions as input and return functions as output, generally modifying the behaviour of the function.

Let’s see if we can implement an error capture adverb.

#' Modify a function to fail without stopping
#' 
#' @param f A function

capture_errors <- function(f) {
    
    function(...) {
        tryCatch(f(...), 
                 error = function(e) as.character(e))    
        
    }
}

geometric_mean <- function(x) {
    
    exp(mean(log(x)))
    
}

safe_geom_mean <- capture_errors(geometric_mean)
#lapply(iris, geometric_mean)
lapply(iris, safe_geom_mean)
$Sepal.Length
[1] 5.78572

$Sepal.Width
[1] 3.026598

$Petal.Length
[1] 3.238267

$Petal.Width
[1] 0.8417075

$Species
[1] "Error in Math.factor(x): 'log' not meaningful for factors\n"

Function factories and closures

Function factories take data as input and return functions:

power1 <- function(exp) {
  function(x) {
    x ^ exp
  }
}

square <- power1(2)
cube <- power1(3)

Function factories are useful in statistics. Let’s do some regression!

x <- rnorm(50)
y <- 1 + 1.5 * x + rnorm(50, sd = 1)

mse_loss <- function(xmat, outcome) {
    
    function(beta) {
        
        linpred <- xmat %*% beta
        mean((linpred - outcome)^2)
        
    }
    
}


mse_loss(cbind(1, x), y)(c(0, 0))
[1] 3.912841
optim(par = c(0, 0), mse_loss(cbind(1, x), y))
$par
[1] 1.056706 1.552929

$value
[1] 0.7686793

$counts
function gradient 
      67       NA 

$convergence
[1] 0

$message
NULL

What are some ways we could make this function more general? Multivariate, different loss functions?

Recursive functions

Recursive functions are functions that call themselves. It is critical that they have an exit condition. The basic example is computing the Fibonacci numbers:

fibonacci <- function(n) {
    
    if(n < 2){
        
        return(1)
        
    } else {
        
        fibonacci(n - 1) + fibonacci(n - 2)
    }
    
}

fibonacci(9)
[1] 55

This seems like a useless novelty, but they are actually quite useful in dealing with DAGs and nested data structures.