Write a function to compute the mean and standard deviation of a numeric vector. We will apply this function to the numeric variables in penguins, and also by different subgroups
library(palmerpenguins)
Attaching package: 'palmerpenguins'
The following objects are masked from 'package:datasets':
penguins, penguins_raw
$Adelie.female
mean sd
3368.8356 269.3801
$Chinstrap.female
mean sd
3527.2059 285.3339
$Gentoo.female
mean sd
4679.7414 281.5783
$Adelie.male
mean sd
4043.4932 346.8116
$Chinstrap.male
mean sd
3938.9706 362.1376
$Gentoo.male
mean sd
5484.8361 313.1586
# sapply converts the result to a matrixsplit(penguins$body_mass_g, list(penguins$species, penguins$sex)) |>sapply(FUN = mean_sd1)
# tapply has a different syntax and each element of the matrix is a listtapres <-tapply(penguins$body_mass_g, list(penguins$species, penguins$sex), mean_sd1)tapres
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
`summarise()` has grouped output by 'species', 'sex'. You can override using
the `.groups` argument.
# A tibble: 16 × 3
# Groups: species, sex [8]
species sex `mean_sd1(body_mass_g)`
<fct> <fct> <dbl>
1 Adelie female 3369.
2 Adelie female 269.
3 Adelie male 4043.
4 Adelie male 347.
5 Adelie <NA> 3540
6 Adelie <NA> 477.
7 Chinstrap female 3527.
8 Chinstrap female 285.
9 Chinstrap male 3939.
10 Chinstrap male 362.
11 Gentoo female 4680.
12 Gentoo female 282.
13 Gentoo male 5485.
14 Gentoo male 313.
15 Gentoo <NA> 4588.
16 Gentoo <NA> 338.
library(data.table)
Attaching package: 'data.table'
The following objects are masked from 'package:dplyr':
between, first, last
pengdt <-data.table(penguins)pengdt[, mean_sd1(body_mass_g), by =list(species, sex)]
species sex V1
<fctr> <fctr> <num>
1: Adelie male 4043.4932
2: Adelie male 346.8116
3: Adelie female 3368.8356
4: Adelie female 269.3801
5: Adelie <NA> 3540.0000
6: Adelie <NA> 477.1661
7: Gentoo female 4679.7414
8: Gentoo female 281.5783
9: Gentoo male 5484.8361
10: Gentoo male 313.1586
11: Gentoo <NA> 4587.5000
12: Gentoo <NA> 338.1937
13: Chinstrap female 3527.2059
14: Chinstrap female 285.3339
15: Chinstrap male 3938.9706
16: Chinstrap male 362.1376
## . can be used as shorthand for list in data tablepengdt[, mean_sd1(body_mass_g), by = .(species, sex)]
species sex V1
<fctr> <fctr> <num>
1: Adelie male 4043.4932
2: Adelie male 346.8116
3: Adelie female 3368.8356
4: Adelie female 269.3801
5: Adelie <NA> 3540.0000
6: Adelie <NA> 477.1661
7: Gentoo female 4679.7414
8: Gentoo female 281.5783
9: Gentoo male 5484.8361
10: Gentoo male 313.1586
11: Gentoo <NA> 4587.5000
12: Gentoo <NA> 338.1937
13: Chinstrap female 3527.2059
14: Chinstrap female 285.3339
15: Chinstrap male 3938.9706
16: Chinstrap male 362.1376
Classes and custom generics
Modify your mean and sd function so that the data structure that is returned has class “meansd”. There are two ways to do this:
Say the object you currently return is called res, instead of res, return structure(res, class = "meansd")
Add the line class(res) <- "meansd" before returning `res``
Use the attr function to create and assign additional information, for example the name of the variable, You can get the name of the object passed to x using deparse1(substitute(x)).
Write a custom print function print.meansd that nicely prints the mean and standard deviation. Use the functions round and paste functions to create a string, then print it out using the cat function.
mean (sd) of penguins$body_mass_g : 4201.75 (801.95)
More functions
Write another function that constructs a one-sample t-statistic from an estimated mean and standard deviation. Recall that the t-statistic to test the null hypothesis that \(\mu = \mu_0\) is \[
T = \frac{\overline{X} - \mu_0}{\hat{\sigma}/\sqrt{n}}
\] where \(\overline{X}\) is the sample mean and \(\hat{\sigma}\) is the sample standard deviation and \(n\) is the sample size.
Write another function that takes the t-statistic and calculates a p-value
Compose your custom functions in order to test the null hypothesis that the mean body mass of penguins is 4000g. Try using the pipe operator |>.
t_stat <-function(x, mu_0 =0) {## x should be of class "meansd"stopifnot(inherits(x, "meansd")) Tstat <-unname((x["mean"] - mu_0) / (x["sd"] /sqrt(attr(x, "sampsize"))))attr(Tstat, "df") <-attr(x, "sampsize") -1 Tstat}p_value <-function(x) {2*pt(-abs(x), df =attr(x, "df"))}mean_sd2(penguins$body_mass_g) |>t_stat(mu_0 =4000) |>p_value()