split(penguins$body_mass_g, list(penguins$species, penguins$sex)) |>
lapply(FUN = mean_sd)
Functions – exercises
Day 2, B
Understanding how to create and use your own functions
Learning objectives
In this lesson you will
- Learn how the
apply
family of functions works, and the alternatives usingdplyr
anddata.table
- Practice writing and reusing your own functions
Iterating over data with functions
- Load the Palmer penguins dataset.
- Write a function to compute the mean and standard deviation of a numeric vector. We will apply this function to the numeric variables in
penguins
, and also by different subgroups
Stop and think
- What are the components of a function?
- Do I have to worry about missing data? How can I deal with it?
- What sort of data structure should I return?
Using the function
- Use your function to compute the mean and sd of all the numeric variables in
penguins
. - Use your function to compute the mean and sd of body mass by species and sex
- Try using one of the apply functions
Hints
sapply
instead of lapply
. What about tapply
?
- Try using
dplyr
: check out the functionsgroup_by
andsummarize
Hints
library(dplyr)
|> group_by(species, sex) |>
penguins summarize(mean_sd(body_mass_g))
- Try using
data.table
: use the.by
argument in the[
Hints
library(data.table)
<- data.table(penguins)
pengdt
mean_sd(body_mass_g), by = list(species, sex)]
pengdt[, ## . can be used as shorthand for list in data table
mean_sd(body_mass_g), by = .(species, sex)] pengdt[,
Classes and custom generics
Now that you have some functions to do something interesting, let’s create a “class” to indicate that the object has a specific meaning.
- Modify your mean and sd function so that the data structure that is returned has class “meansd”. There are two ways to do this:
- Say the object you currently return is called
res
, instead ofres
, returnstructure(res, class = "meansd")
- Add the line
class(res) <- "meansd"
before returning `res`` - Use the
attr
function to create and assign additional information, for example the name of the variable, You can get the name of the object passed tox
usingdeparse1(substitute(x))
.
- Say the object you currently return is called
Hints
<- function(x, na.rm = TRUE) {
mean_sd2 <- c(mean = mean(x, na.rm = na.rm),
res sd = sd(x, na.rm = na.rm))
attr(res, "variable") <- deparse1(substitute(x))
class(res) <- "meansd"
res }
- Write a custom print function
print.meansd
that nicely prints the mean and standard deviation. Use the functionsround
andpaste
functions to create a string, then print it out using thecat
function.
Hints
<- function(x, digits = 2) {
print.meansd
<- paste0(round(x["mean"], digits = digits), " (",
msd round(x["sd"], digits = digits), ")")
cat("mean (sd) of ",
attr(x, "variable"), ":",
"\n")
msd,
}
More functions – conditional calculations
- Write a function that allows the user to choose between the mean and standard deviation, or the median and interquartile range.
- What can you use as the default argument to allow the switching? Try using
match.arg
. - In your function, before doing any calculations, add a check that the data supplied by the user is numeric. Include an informative error message.
- Modify your function so that it does a calculation to decide whether the mean (sd) or median (IQR) is used (e.g., check the skewness). How can you communicate to the user whether the result is the mean or median?
More functions – classes
- Write another function that constructs a one-sample t-statistic from an estimated mean and standard deviation. Recall that the t-statistic to test the null hypothesis that \(\mu = \mu_0\) is \[ T = \frac{\overline{X} - \mu_0}{\hat{\sigma}/\sqrt{n}} \] where \(\overline{X}\) is the sample mean and \(\hat{\sigma}\) is the sample standard deviation and \(n\) is the sample size.
- Write another function that takes the t-statistic and calculates a p-value
- Compose your custom functions in order to test the null hypothesis that the mean body mass of penguins is 4000g. Try using the pipe operator
|>
.
If you have time or on your own
- Expand your class to include confidence interval and p-value calculation/printing. Check out the
scales::pvalue
function. - Look at the
t.test
function. What type of object does this return? - Look at the print method for the class of the object returned by
t.test
, use the commandstats:::print.htest
to find the source. How does it work? How would you modify it? - Are there any other methods are available for that class? Use the
methods
function to find out. What would be another useful method for this class?