Recap from the morning

Everything is done via (different types of) functions (see functional programming tomorrow):

Functions have

An argument

Fun fact - you can overwrite body, arguments (via formals) and environment in an assignment:

> set.seed(313)
> x = rnorm(10); y = 2 + 3* x + rnorm(10)/4
> coef( lm(y~x) )     ## Show coefficients only
(Intercept)           x 
   1.962911    2.989594 
> body(lm) = "No linear regression for you!"
> lm(y~x) 
[1] "No linear regression for you!"
> lm
function (formula, data, subset, weights, na.action, method = "qr", 
    model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, 
    contrasts = NULL, offset, ...) 
"No linear regression for you!"
<environment: namespace:stats>

Because the original lm is defined in a package, changes are done to a local copy in the current environment. At the command line, this is the global environment:

> find("lm")
[1] ".GlobalEnv"    "package:stats"
> coef( stats::lm(y~x) )   ## Still works 
(Intercept)           x 
   1.962911    2.989594 
> rm(lm)                   ## Delete local copy
> coef( lm(y~x) )          ## Works again
(Intercept)           x 
   1.962911    2.989594 

Lessons learned:

  1. We can have functions with the same name in the search path - someone keeps (magically?) track of them.

  2. There are differences between functions living in a package and in the global environment (e.g. generated via a script).

Getting {stuff} out of functions

Returning results

By default, an R function returns the value of the last statement in the body:

> boxcox = function(x, exp) {
+   
+   if (exp == 0) {
+     ret = log(x)
+   } else {
+     ret = x^exp
+   }
+  
+   ret 
+ }

A function can return any value at any point in the code by using the return function:

> boxcox = function(x, exp) {
+   
+   if (exp == 0) {
+     return( log(x) )
+   } else {
+     return( x^exp )
+   }
+  
+ }

I think this makes programs harder to read and generally avoid using return; opinions on this differ.

invisible works like return, but the resulting value will not be printed:

> invisible_square = function(x)
+ {
+   invisible(x * x)
+ }
> invisible_square(2)
> x = invisible_square(2)
> x
[1] 4

Complex return values

Often, a vector, matrix or data frame is the obvious return value (e.g. boxcox).

Sometimes, a matrix or data frame is not the most obvious, but still the most useful return value (e.g. broom::tidy).

And sometimes, you just get lots of different objects that you need to stick together, which is when you can use a list:

> lm_min = lm(1 ~ 1)    ## World's smallest linear model?! No, but close
> is.list(lm_min)
[1] TRUE
> sapply(lm_min, typeof)
 coefficients     residuals       effects          rank fitted.values 
     "double"      "double"      "double"     "integer"      "double" 
       assign            qr   df.residual          call         terms 
    "integer"        "list"     "integer"    "language"    "language" 
        model 
       "list" 

Designing return values

Return the same type of result every time (if possible). Unlike this:

> DNase[1:5, ]
  Run       conc density
1   1 0.04882812   0.017
2   1 0.04882812   0.018
3   1 0.19531250   0.121
4   1 0.19531250   0.124
5   1 0.39062500   0.206
> DNase[1:5, 3]  ## use drop=FALSE to make safe
[1] 0.017 0.018 0.121 0.124 0.206

Consider tidy return values: does it make sense to return a rectangular data matrix?

> data(veteran, package = "survival")
> tab  = with(veteran, table(celltype, status))
> ttab = reshape2::melt(tab,  value.name= "Freq")
> tab   ## NOT tidy
           status
celltype     0  1
  squamous   4 31
  smallcell  3 45
  adeno      1 26
  large      1 26
> ttab  ## tidy
   celltype status Freq
1  squamous      0    4
2 smallcell      0    3
3     adeno      0    1
4     large      0    1
5  squamous      1   31
6 smallcell      1   45
7     adeno      1   26
8     large      1   26
> glm(Freq ~ celltype + status, data = ttab)  ## Statistics, yay

Call:  glm(formula = Freq ~ celltype + status, data = ttab)

Coefficients:
      (Intercept)  celltypesmallcell      celltypeadeno  
            2.625              6.500             -4.000  
    celltypelarge             status  
           -4.000             29.750  

Degrees of Freedom: 7 Total (i.e. Null);  3 Residual
Null Deviance:      2019 
Residual Deviance: 101.4    AIC: 55.02

Avoid fancy printing and plotting, unless this is the only thing your function does. Many older functions do this:

> hist(DNase$density)                    ## plot, invisible result
> hist(DNase$density, plot = FALSE)      ## no plot - visible result
$breaks
 [1] 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2

$counts
 [1] 45 31 12 22  4 18  8 13 18  4  1

$density
 [1] 1.27840909 0.88068182 0.34090909 0.62500000 0.11363636 0.51136364
 [7] 0.22727273 0.36931818 0.51136364 0.11363636 0.02840909

$mids
 [1] 0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9 2.1

$xname
[1] "DNase$density"

$equidist
[1] TRUE

attr(,"class")
[1] "histogram"
> hh = hist(DNase$density, plot = FALSE) ## save object
> plot(hh)                               ## ... and plot it

Compare this to ggplot2, which has a clean separation between defining and displaying a plot.

Function effx in package Epi produces fancy-ish text output, but returns nothing, nothing at all:

> data(births, package = "Epi")
> out = Epi::effx(bweight,exposure=hyp,data=births)
--------------------------------------------------------------------------- 
response      :  bweight 
type          :  metric 
exposure      :  hyp 

hyp is numeric 
effects are measured as differences in means 
--------------------------------------------------------------------------- 

effect of an increase of 1 unit in hyp on bweight 
number of observations  500 

Effect   2.5%  97.5% 
  -431   -585   -276 

Test for no effects of exposure on 1 df: p-value= 4.9e-08 
> out
NULL

Don’t do that.

Communicating with the user

Functions should avoid elaborate print output unless this is the only thing they do: see above, object-oriented programming.

Instead, functions should put construct an object with all the relevant information, and either trust the R printing methods, or hide the ugly using invisible.

However, there are situations where it makes sense to contact the user from within the function:

  1. Progress: for a slow process, use cat (not print) to indicate progress; offer an argument to suppress the progress report.

    (Alt. use package progress - however, does not work on terminal R)

  2. Side effects: if the function does something that will have side effects outside itself, the user needs to be informed. This should be done via message and warning, not print-ing or cat-ing.

  3. Problems: if the function runs into a situation it cannot handle, it can either try a best-guess fix, or just give up; warning informs, but continues execution, stop terminates execution and returns an error.

> log(-1)      ## Warning, moves on with NaN
> log("a")     ## Error

Useful:

  • stopifnot for tough-love argument checking
  • options(warn=2) to turn warnings into errors

Functions of functions

Functions as return values

Functions as arguments are common: lapply etc.

Functions can also be return values. E.g. we can define our own logarithms for different bases:

> log_general = function(base) 
+ {
+   function(x) log(x, base)
+ }
> log2 = log_general(2)
> log2(4)
[1] 2

This log2 is a function defined in log_general. When log_general was called with base=2, this was saved in the execution environment of log_general, which log2 has inherited:

> environment(log2)
<environment: 0x000000001ad66f08>
> ls( environment(log2) )
[1] "base"
> get( "base", environment(log2) ) 
[1] 2

This is a reasonably advanced, but powerful technique. This combination of a function and data (in the environment) is generally known as a closure, though the term has a narrower technical meaning in R. See Advanced R 2nd ed, Section 10.4, for use cases.

Function wrappers

A different, more conventional way of defining our own binary logarithm:

> other_log2 = function(x) log(x, base = 2)
> other_log2(4)
[1] 2
> environment(other_log2)
<environment: R_GlobalEnv>

Nothing wrong with that as one off, though ultimately less flexible.

Managing complaints

Other people’s functions may also cause messages, warnings or errors.

Use suppressMessages or suppressWarnings to keep it silent:

> suppressWarnings( log(-1) )
[1] NaN
> library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
> library(MASS) 

Attaching package: 'MASS'
The following object is masked _by_ '.GlobalEnv':

    boxcox
The following object is masked from 'package:dplyr':

    select
> detach("package:MASS")
> suppressMessages( library(MASS) )

Errors can be caught using the function try:

> try( log("a"), silent = TRUE )  ## Not an error, but what happened?!
> ## Convert error to warning, use NaN as value
> ret = try( log("a"), silent = TRUE )
> if ( !is.numeric (ret) ) {
+   warning(ret)
+   ret = NaN
+ }

tryCatch offers a more flexible way of defining error handlers.

Exercises

  1. What do these functions return?
> ex1_a = function() 7
> ex1_b = function() x=7
> ex1_c = function() {x=7}
> ex1_d = function() {y=1; x=7}
> ex1_e = function() y=1; x=7
  1. Write a function hasPlotArg that takes a function as argument and checks whether the function has plot as an argument.

    Find all functions in the search path that have plot in their name, and check which ones have a plot-argument.

    If we also want to know about the default value for argument plot, should we add an optional argument to hasPlotArg, or is there a better way to design this? Implement your idea.

  2. Have a look how the function read.table handles unexpected situations: when does it give up, and when does it continue reading with a message/warning?