library(xactonomial)
library(parallel)
Let be distributed for and denote and . The subscript denotes the dimension of the multinomial. Suppose one is interested in the parameter . Given a sample of size from , one can estimate with the sample proportions as and hence . It would then be natural to construct a percent confidence interval for and calculate a p value for a test of the null hypothesis .
If the functional is complex, or the variance of is otherwise difficult to calculate, then one may consider using the nonparametric bootstrap for inference (Efron 1979). However, the bootstrap may perform poorly in small samples, e.g., not having the correct coverage or type I error (Bickel and Freedman 1981). Instead we propose a computational approach for computing p values and confidence intervals in this setting.
Here is a simple illustration of the problem. Let which is similar to the Bhattacharyya coefficient (Bhattacharyya 1946). Suppose we have two samples and each from independent multinomials with dimension 4. The following R code allows us to sample the data and compute .
true_theta <- c(.45, .15, .3, .1, .05, .15, .4, .4)
sample_data <- function(n) {
T1 <- rmultinom(1, n[1], prob = true_theta[1:4])
T2 <- rmultinom(1, n[2], prob = true_theta[5:8])
list(T1 = c(T1), T2 = c(T2))
}
psi_bc <- function(theta) {
theta1 <- theta[1:4]
theta2 <- theta[5:8]
sum(sqrt(theta1 * theta2))
}
true_psi_bc <- psi_bc(true_theta)
How bad is the bootstrap in this situation with different sample sizes?
run_one_bs <- function(i, n, psi, true_psi) {
data <- sample_data(n)
bsamps <- replicate(500, {
unlist(lapply(data, \(x) {
#newx <- c(rmultinom(1, sum(x), prob = x / sum(x)))
newx <- table(factor(sample(rep.int(1:length(x), times = x), sum(x), replace = TRUE),
levels = 1:length(x)))
newx / sum(newx)
})) |> psi()
})
cib <- quantile(bsamps, c(.025, .975))
true_psi >= cib[1] & true_psi <= cib[2]
}
cand_ns <- do.call(rbind, lapply(seq(10, 30, by = 5), \(i) c(i, i)))
emp_coverage <- rep(NA, nrow(cand_ns))
for(i in 1:nrow(cand_ns)) {
cover_boot <- mclapply(1:1000, run_one_bs, n = cand_ns[i, ], psi = psi_bc,
true_psi = true_psi_bc, mc.cores = detectCores() - 4)
emp_coverage[i] <- mean(unlist(cover_boot))
}
knitr::kable(cbind("n in each group" = cand_ns[,1], "bootstrap 95% interval coverage" = emp_coverage))
n in each group | bootstrap 95% interval coverage |
---|---|
10 | 0.529 |
15 | 0.622 |
20 | 0.703 |
25 | 0.711 |
30 | 0.764 |
That does not look too good. Let us see how our method performs.
xmres <- mclapply(1:200, \(i) {
system.time({
data <- sample_data(n = c(10, 10))
results <- xactonomial(psi_bc, data, alpha = .05, psi_limits = c(0,1), maxit = 100, chunksize = 100)
})
results$conf.int
}, mc.cores = detectCores() - 2)
coverage_exact <- sapply(xmres, \(cint) {
true_psi_bc >= cint[1] & true_psi_bc <= cint[2]
})
mean(coverage_exact)
#> [1] 0.985
Much better!
Maximum of parameters
When two of the parameters are equal, the function is nondifferentiable, so bootstrap and other asymptotic methods will fail while our method still works.
psi_causal <- function(pp) {
max(pp)
}
true_psi_causal <- psi_causal(c(.4, .4, .2))
cover_boot <- mclapply(1:200, run_one_bs, n = c(10,10, 10), psi = psi_causal,
true_psi = true_psi_causal, mc.cores = detectCores() - 4)
mean(unlist(cover_boot))
#> [1] 0.44
xmres <- mclapply(1:200, \(i) {
system.time({
data <- sample_data(n = c(10, 10, 10))
results <- xactonomial(psi_causal, data, alpha = .05, psi_limits = c(0,1), maxit = 100, chunksize = 100)
})
results$conf.int
}, mc.cores = detectCores() - 2)
coverage_exact <- sapply(xmres, \(cint) {
true_psi_causal >= cint[1] & true_psi_causal <= cint[2]
})
mean(coverage_exact)
#> [1] 0.975