Everything is done via (different types of) functions (see functional programming tomorrow):
Functions have
args(lm)
or formals(lm)
body(lm)
environment(lm)
An argument
missing
(if necessary)Fun fact - you can overwrite body, arguments (via formals
) and environment in an assignment:
> set.seed(313)
> x = rnorm(10); y = 2 + 3* x + rnorm(10)/4
> coef( lm(y~x) ) ## Show coefficients only
(Intercept) x
1.962911 2.989594
> body(lm) = "No linear regression for you!"
> lm(y~x)
[1] "No linear regression for you!"
> lm
function (formula, data, subset, weights, na.action, method = "qr",
model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
contrasts = NULL, offset, ...)
"No linear regression for you!"
<environment: namespace:stats>
Because the original lm
is defined in a package, changes are done to a local copy in the current environment. At the command line, this is the global environment:
> find("lm")
[1] ".GlobalEnv" "package:stats"
> coef( stats::lm(y~x) ) ## Still works
(Intercept) x
1.962911 2.989594
> rm(lm) ## Delete local copy
> coef( lm(y~x) ) ## Works again
(Intercept) x
1.962911 2.989594
Lessons learned:
We can have functions with the same name in the search path - someone keeps (magically?) track of them.
There are differences between functions living in a package and in the global environment (e.g. generated via a script).
By default, an R function returns the value of the last statement in the body:
> boxcox = function(x, exp) {
+
+ if (exp == 0) {
+ ret = log(x)
+ } else {
+ ret = x^exp
+ }
+
+ ret
+ }
A function can return any value at any point in the code by using the return
function:
> boxcox = function(x, exp) {
+
+ if (exp == 0) {
+ return( log(x) )
+ } else {
+ return( x^exp )
+ }
+
+ }
I think this makes programs harder to read and generally avoid using return
; opinions on this differ.
invisible
works like return
, but the resulting value will not be printed:
> invisible_square = function(x)
+ {
+ invisible(x * x)
+ }
> invisible_square(2)
> x = invisible_square(2)
> x
[1] 4
Often, a vector, matrix or data frame is the obvious return value (e.g. boxcox
).
Sometimes, a matrix or data frame is not the most obvious, but still the most useful return value (e.g. broom::tidy
).
And sometimes, you just get lots of different objects that you need to stick together, which is when you can use a list:
> lm_min = lm(1 ~ 1) ## World's smallest linear model?! No, but close
> is.list(lm_min)
[1] TRUE
> sapply(lm_min, typeof)
coefficients residuals effects rank fitted.values
"double" "double" "double" "integer" "double"
assign qr df.residual call terms
"integer" "list" "integer" "language" "language"
model
"list"
Return the same type of result every time (if possible). Unlike this:
> DNase[1:5, ]
Run conc density
1 1 0.04882812 0.017
2 1 0.04882812 0.018
3 1 0.19531250 0.121
4 1 0.19531250 0.124
5 1 0.39062500 0.206
> DNase[1:5, 3] ## use drop=FALSE to make safe
[1] 0.017 0.018 0.121 0.124 0.206
Consider tidy return values: does it make sense to return a rectangular data matrix?
> data(veteran, package = "survival")
> tab = with(veteran, table(celltype, status))
> ttab = reshape2::melt(tab, value.name= "Freq")
> tab ## NOT tidy
status
celltype 0 1
squamous 4 31
smallcell 3 45
adeno 1 26
large 1 26
> ttab ## tidy
celltype status Freq
1 squamous 0 4
2 smallcell 0 3
3 adeno 0 1
4 large 0 1
5 squamous 1 31
6 smallcell 1 45
7 adeno 1 26
8 large 1 26
> glm(Freq ~ celltype + status, data = ttab) ## Statistics, yay
Call: glm(formula = Freq ~ celltype + status, data = ttab)
Coefficients:
(Intercept) celltypesmallcell celltypeadeno
2.625 6.500 -4.000
celltypelarge status
-4.000 29.750
Degrees of Freedom: 7 Total (i.e. Null); 3 Residual
Null Deviance: 2019
Residual Deviance: 101.4 AIC: 55.02
Avoid fancy printing and plotting, unless this is the only thing your function does. Many older functions do this:
> hist(DNase$density) ## plot, invisible result
> hist(DNase$density, plot = FALSE) ## no plot - visible result
$breaks
[1] 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2
$counts
[1] 45 31 12 22 4 18 8 13 18 4 1
$density
[1] 1.27840909 0.88068182 0.34090909 0.62500000 0.11363636 0.51136364
[7] 0.22727273 0.36931818 0.51136364 0.11363636 0.02840909
$mids
[1] 0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9 2.1
$xname
[1] "DNase$density"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
> hh = hist(DNase$density, plot = FALSE) ## save object
> plot(hh) ## ... and plot it
Compare this to ggplot2
, which has a clean separation between defining and displaying a plot.
Function effx
in package Epi
produces fancy-ish text output, but returns nothing, nothing at all:
> data(births, package = "Epi")
> out = Epi::effx(bweight,exposure=hyp,data=births)
---------------------------------------------------------------------------
response : bweight
type : metric
exposure : hyp
hyp is numeric
effects are measured as differences in means
---------------------------------------------------------------------------
effect of an increase of 1 unit in hyp on bweight
number of observations 500
Effect 2.5% 97.5%
-431 -585 -276
Test for no effects of exposure on 1 df: p-value= 4.9e-08
> out
NULL
Don’t do that.
Functions should avoid elaborate print output unless this is the only thing they do: see above, object-oriented programming.
Instead, functions should put construct an object with all the relevant information, and either trust the R printing methods, or hide the ugly using invisible
.
However, there are situations where it makes sense to contact the user from within the function:
Progress: for a slow process, use cat
(not print
) to indicate progress; offer an argument to suppress the progress report.
(Alt. use package progress
- however, does not work on terminal R)
Side effects: if the function does something that will have side effects outside itself, the user needs to be informed. This should be done via message
and warning
, not print-ing or cat-ing.
Problems: if the function runs into a situation it cannot handle, it can either try a best-guess fix, or just give up; warning
informs, but continues execution, stop
terminates execution and returns an error.
> log(-1) ## Warning, moves on with NaN
> log("a") ## Error
Useful:
stopifnot
for tough-love argument checkingoptions(warn=2)
to turn warnings into errorsFunctions as arguments are common: lapply
etc.
Functions can also be return values. E.g. we can define our own logarithms for different bases:
> log_general = function(base)
+ {
+ function(x) log(x, base)
+ }
> log2 = log_general(2)
> log2(4)
[1] 2
This log2
is a function defined in log_general
. When log_general
was called with base=2
, this was saved in the execution environment of log_general
, which log2
has inherited:
> environment(log2)
<environment: 0x000000001ad66f08>
> ls( environment(log2) )
[1] "base"
> get( "base", environment(log2) )
[1] 2
This is a reasonably advanced, but powerful technique. This combination of a function and data (in the environment) is generally known as a closure, though the term has a narrower technical meaning in R. See Advanced R 2nd ed, Section 10.4, for use cases.
A different, more conventional way of defining our own binary logarithm:
> other_log2 = function(x) log(x, base = 2)
> other_log2(4)
[1] 2
> environment(other_log2)
<environment: R_GlobalEnv>
Nothing wrong with that as one off, though ultimately less flexible.
Other people’s functions may also cause messages, warnings or errors.
Use suppressMessages
or suppressWarnings
to keep it silent:
> suppressWarnings( log(-1) )
[1] NaN
> library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
> library(MASS)
Attaching package: 'MASS'
The following object is masked _by_ '.GlobalEnv':
boxcox
The following object is masked from 'package:dplyr':
select
> detach("package:MASS")
> suppressMessages( library(MASS) )
Errors can be caught using the function try
:
> try( log("a"), silent = TRUE ) ## Not an error, but what happened?!
> ## Convert error to warning, use NaN as value
> ret = try( log("a"), silent = TRUE )
> if ( !is.numeric (ret) ) {
+ warning(ret)
+ ret = NaN
+ }
tryCatch
offers a more flexible way of defining error handlers.
> ex1_a = function() 7
> ex1_b = function() x=7
> ex1_c = function() {x=7}
> ex1_d = function() {y=1; x=7}
> ex1_e = function() y=1; x=7
Write a function hasPlotArg
that takes a function as argument and checks whether the function has plot
as an argument.
Find all functions in the search path that have plot
in their name, and check which ones have a plot
-argument.
If we also want to know about the default value for argument plot
, should we add an optional argument to hasPlotArg
, or is there a better way to design this? Implement your idea.
Have a look how the function read.table
handles unexpected situations: when does it give up, and when does it continue reading with a message/warning?