Functions – exercises

Day 2, B

Understanding how to create and use your own functions
Author

Michael C Sachs

Learning objectives

In this lesson you will

  1. Learn how the apply family of functions works, and the alternatives using dplyr and data.table
  2. Practice writing and reusing your own functions

Iterating over data with functions

  1. Load the Palmer penguins dataset.
  2. Write a function to compute the mean and standard deviation of a numeric vector. We will apply this function to the numeric variables in penguins, and also by different subgroups

Stop and think

  1. What are the components of a function?
  2. Do I have to worry about missing data? How can I deal with it?
  3. What sort of data structure should I return?

Using the function

  1. Use your function to compute the mean and sd of all the numeric variables in penguins.
  2. Use your function to compute the mean and sd of body mass by species and sex
    1. Try using one of the apply functions
Hints
split(penguins$body_mass_g, list(penguins$species, penguins$sex)) |>
  lapply(FUN = mean_sd)
Try sapply instead of lapply. What about tapply?
  1. Try using dplyr: check out the functions group_by and summarize
Hints
library(dplyr)
penguins |> group_by(species, sex) |>
  summarize(mean_sd(body_mass_g))
  1. Try using data.table: use the .by argument in the [
Hints
library(data.table)
pengdt <- data.table(penguins)

pengdt[, mean_sd(body_mass_g), by = list(species, sex)]
## . can be used as shorthand for list in data table
pengdt[, mean_sd(body_mass_g), by = .(species, sex)]

Classes and custom generics

Now that you have some functions to do something interesting, let’s create a “class” to indicate that the object has a specific meaning.

  1. Modify your mean and sd function so that the data structure that is returned has class “meansd”. There are two ways to do this:
    1. Say the object you currently return is called res, instead of res, return structure(res, class = "meansd")
    2. Add the line class(res) <- "meansd" before returning `res``
    3. Use the attr function to create and assign additional information, for example the name of the variable, You can get the name of the object passed to x using deparse1(substitute(x)).
Hints
mean_sd2 <- function(x, na.rm = TRUE) {
  res <- c(mean = mean(x, na.rm = na.rm), 
           sd = sd(x, na.rm = na.rm))
  
  attr(res, "variable") <- deparse1(substitute(x))
  class(res) <- "meansd"
  res
}
  1. Write a custom print function print.meansd that nicely prints the mean and standard deviation. Use the functions round and paste functions to create a string, then print it out using the cat function.
Hints
print.meansd <- function(x, digits = 2) {
  
  msd <- paste0(round(x["mean"], digits = digits), " (", 
                round(x["sd"], digits = digits), ")")
  
  cat("mean (sd) of ", 
      attr(x, "variable"), ":", 
      msd, "\n")
  
}

More functions – conditional calculations

  1. Write a function that allows the user to choose between the mean and standard deviation, or the median and interquartile range.
  2. What can you use as the default argument to allow the switching? Try using match.arg.
  3. In your function, before doing any calculations, add a check that the data supplied by the user is numeric. Include an informative error message.
  4. Modify your function so that it does a calculation to decide whether the mean (sd) or median (IQR) is used (e.g., check the skewness). How can you communicate to the user whether the result is the mean or median?

More functions – classes

  1. Write another function that constructs a one-sample t-statistic from an estimated mean and standard deviation. Recall that the t-statistic to test the null hypothesis that \(\mu = \mu_0\) is \[ T = \frac{\overline{X} - \mu_0}{\hat{\sigma}/\sqrt{n}} \] where \(\overline{X}\) is the sample mean and \(\hat{\sigma}\) is the sample standard deviation and \(n\) is the sample size.
  2. Write another function that takes the t-statistic and calculates a p-value
  3. Compose your custom functions in order to test the null hypothesis that the mean body mass of penguins is 4000g. Try using the pipe operator |>.

If you have time or on your own

  1. Expand your class to include confidence interval and p-value calculation/printing. Check out the scales::pvalue function.
  2. Look at the t.test function. What type of object does this return?
  3. Look at the print method for the class of the object returned by t.test, use the command stats:::print.htest to find the source. How does it work? How would you modify it?
  4. Are there any other methods are available for that class? Use the methods function to find out. What would be another useful method for this class?