Day 2, B
Almost everything you do in R involves functions. You call a function by typing its name with its arguments (inputs) inside the parentheses:
The function takes the arguments you provide, does something, and then returns an object. To see what a function does, you can type its name without parentheses to see the source:
function (x, size, replace = FALSE, prob = NULL)
{
if (length(x) == 1L && is.numeric(x) && is.finite(x) && x >=
1) {
if (missing(size))
size <- x
sample.int(x, size, replace, prob)
}
else {
if (missing(size))
size <- length(x)
x[sample.int(length(x), size, replace, prob)]
}
}
<bytecode: 0x586014475ba8>
<environment: namespace:base>
The source shows you the arguments, their default values, and the expression defining the function. You can also look at the help file for the documentation:
Functions can have 0 or more arguments, with or without defaults.
The arguments can be given in order, or by name
Names can be partially matched, which can be confusing:
Some functions take ...
as an argument, e.g., paste
, list
, also the apply
family.
There are 2 reasons for this:
Often we want to use the result of one function as the argument to another function. There are many ways to do this:
|>
(available in R 4.0.1)Some functions will take other functions as arguments. An example is the apply
family of functions, which applies a function over an index or iterator. See help(apply)
apply
repeated applies a function over the dimensions of an array. MARGIN
indicates which dimension, and then for each index in that dimension, it applies FUN
to the sub-array
tapply
is commonly used with data. It subsets the data X
based on the INDEX
argument, then applies a function to each subset:
lapply
lapply
is more general, in that it can take any index and apply any function that takes the index as an argument. It always returns a list. sapply
does the same, but simplifies the result to an array, if possible.
mapply
This is the multivariate version of sapply
that allows vector arguments.
See also the purrr
package
The apply family of functions is computationally equivalent to a loop (with pre-allocation)
Using apply instead of a for loop will not be faster computationally
It may be faster to write, but it may also be harder to understand
You can do whatever you want inside a for loop, how would you do something more complex with lapply
?
name2
is a local variable. It exists only inside the function.
Modifying local variables outside the function has no effect. But be careful:
Likewise, arguments modified inside the function do not change the object outside the function.
This is called lexical scoping: it defines how R looks for objects when they are referred to by name
If R sees a variable it needs to use inside a function, and it is not an argument or local variable, then it follows these rules to find the object with that name:
Note the specification sees a variable and needs to use it. This is called lazy evaluation: R does not evaluate anything until it needs to use it
This can be used to your advantage, e.g.,
One way to manually check for arguments is with missing
:
match.arg
Look at the help file for t.test
, and specifically the alternative
argument. It is a vector with 3 elements, but only one is used. Also, it can be partially matched, e.g.,
How does that work? Using match.arg
inside the function:
Your own functions do not need to be saved and assigned names. If a function does not have a name it is anonymous
, I use these often with the apply family:
Since R 4.0.1, \()
can be used as shorthand for function()
:
Operators are symbols like +
, <-
, %*%
, [
.
These are functions! To treat them like functions instead of operators, use backticks:
You can then treat operators as you would any other function, using them in apply or otherwise
You can also define your own operators:
Assignment operators have a special syntax:
Look at the function print
It is a generic function. UseMethod
says depending on the class of argument object
, R will call a suitable method (a function) that does something designed for whatever object
is.
You can find all the special methods by running methods("print")
(try it now).
The class of the object is a simple attribute and the method is defined by appending the class name after the function name separated by a dot. This is called the S3 class system:
In R, everything that happens is due to a function, and everything that exists is an object. Functions themselves are objects.
How do functions work together? We can classify functions according to their inputs and outputs:
Input/Output | Data | Function |
---|---|---|
Data | Regular function | Function factory |
Function | Functional | Function operator |
These concepts are loosely defined, because functions can take both data and function arguments and return data and function results.
When should you write a function? How should it be designed?
get
and assign
Recall that we can retrieve a variable from a data frame by using a character string, e.g., penguins[["species"]]
.
We can use a character string to get or assign any other object using these functions. For example, this returns the function called mean
which we can use like a function
Likewise, an object can be created with assign
get
and assign
Example, iterating over functions by name:
Example, retrieving a function programmatically,
Example, programmatically creating new variables,
do.call
A variant on get
is do.call
. This takes a function as the first argument, then a list containing the arguments for the function, do.call(<function>, <list of arguments to function>)
.
A common use for this is with functions that take a variable number of arguments, e.g., cbind
, paste
, where the arguments are created programmatically.
simple example,
arranging a list into a matrix
There is the <<-
operator, which is used in functions and does (re)assignment outside the function. It searches the parent environments and reassigns where found, if not found it assigns in the global environment.
This is generally considered to be a bad idea, but now you know about it.
Functions that call themselves are possible.
As with repeat loops, they need to have a break condition
These are actually useful when working with nested lists and directed acyclic graphs, for example.