split(penguins$body_mass_g, list(penguins$species, penguins$sex)) |>
lapply(FUN = mean_sd)Functions – exercises
Day 2, B
Understanding how to create and use your own functions
Learning objectives
In this lesson you will
- Learn how the
applyfamily of functions works, and the alternatives usingdplyranddata.table - Practice writing and reusing your own functions
Modifying the greeting function
- Start with the basic hello function from the lecture. Modify it so that users have the option of greeting in a different language (e.g., choosing English or Danish). Provide defaults for the new arguments and use the
match.arginside the function body. - Document the function using quarto or roxygen2 comments
Iterating over data with functions
- Write a function to compute the mean and standard deviation of a numeric vector. We will apply this function to the numeric variables in
penguins, and also by different subgroups - Document your function using quarto or roxygen2 comments
Stop and think
- What are the components of a function?
- Do I have to worry about missing data? How can I deal with it?
- What sort of data structure should I return?
- Should I do any exception handling?
Using the function
- Use your function to compute the mean and sd of all the numeric variables in
penguins. - Use your function to compute the mean and sd of body mass by species and sex
- Try using one of the apply functions
Hints
sapply instead of lapply. What about tapply?
- Try using
dplyr: check out the functionsgroup_byandsummarize
Hints
library(dplyr)
penguins |> group_by(species, sex) |>
summarize(mean_sd(body_mass_g))- Try using
data.table: use the.byargument in the[
Hints
library(data.table)
pengdt <- data.table(penguins)
pengdt[, mean_sd(body_mass_g), by = list(species, sex)]
## . can be used as shorthand for list in data table
pengdt[, mean_sd(body_mass_g), by = .(species, sex)]Classes and custom generics
Now that you have some functions to do something interesting, let’s create a “class” to indicate that the object has a specific meaning.
- Modify your mean and sd function so that the data structure that is returned has class “meansd”. There are two ways to do this:
- Say the object you currently return is called
res, instead ofres, returnstructure(res, class = "meansd") - Add the line
class(res) <- "meansd"before returning `res`` - Use the
attrfunction to create and assign additional information, for example the name of the variable, You can get the name of the object passed toxusingdeparse1(substitute(x)).
- Say the object you currently return is called
Hints
mean_sd2 <- function(x, na.rm = TRUE) {
res <- c(mean = mean(x, na.rm = na.rm),
sd = sd(x, na.rm = na.rm))
attr(res, "variable") <- deparse1(substitute(x))
class(res) <- "meansd"
res
}- Write a custom print function
print.meansdthat nicely prints the mean and standard deviation. Use the functionsroundandpastefunctions to create a string, then print it out using thecatfunction.
Hints
print.meansd <- function(x, digits = 2) {
msd <- paste0(round(x["mean"], digits = digits), " (",
round(x["sd"], digits = digits), ")")
cat("mean (sd) of ",
attr(x, "variable"), ":",
msd, "\n")
}More functions – conditional calculations
- Write a function that allows the user to choose between the mean and standard deviation, or the median and interquartile range.
- What can you use as the default argument to allow the switching? Try using
match.arg. - In your function, before doing any calculations, add a check that the data supplied by the user is numeric. Include an informative error message.
- Modify your function so that it does a calculation to decide whether the mean (sd) or median (IQR) is used (e.g., check the skewness). How can you communicate to the user whether the result is the mean or median?
More functions – classes
- Write another function that constructs a one-sample t-statistic from an estimated mean and standard deviation. Recall that the t-statistic to test the null hypothesis that \(\mu = \mu_0\) is \[ T = \frac{\overline{X} - \mu_0}{\hat{\sigma}/\sqrt{n}} \] where \(\overline{X}\) is the sample mean and \(\hat{\sigma}\) is the sample standard deviation and \(n\) is the sample size.
- Write another function that takes the t-statistic and calculates a p-value
- Compose your custom functions in order to test the null hypothesis that the mean body mass of penguins is 4000g. Try using the pipe operator
|>.
If you have time or on your own
- Expand your class to include confidence interval and p-value calculation/printing. Check out the
scales::pvaluefunction. - Look at the
t.testfunction. What type of object does this return? - Look at the print method for the class of the object returned by
t.test, use the commandstats:::print.htestto find the source. How does it work? How would you modify it? - Are there any other methods are available for that class? Use the
methodsfunction to find out. What would be another useful method for this class?