We consider the following problem. Let be distributed for , where is the number of trials, the number of mutually exclusive categories, and the vector of probabilities. Denote and . Suppose one is interested in the parameter where is a subset of the real line . Suppose we observe a sample which is a vector of counts corresponding to the random variable . Let denote a real-valued statistic that defines an ordering of the sample space. A sensible default statistic in this setting is to apply the function defining the parameter of interest to the sample proportions.
If the function is complex, or the variance of is otherwise difficult to calculate, then one may consider using the nonparametric bootstrap for inference (Efron 1979). However, the bootstrap may perform poorly in small samples, e.g., not having the correct coverage or type I error (Bickel and Freedman 1981). Instead we propose a computational approach for computing p values and confidence intervals in this setting.
Here is a simple illustration of the problem. Let which is similar to the Bhattacharyya coefficient (Bhattacharyya 1946). Suppose we have two samples and each from independent multinomials with dimension 4. The following R code allows us to sample the data and compute .
true_theta <- c(.45, .15, .3, .1, .05, .15, .4, .4)
sample_data <- function(n) {
T1 <- rmultinom(1, n[1], prob = true_theta[1:4])
T2 <- rmultinom(1, n[2], prob = true_theta[5:8])
list(T1 = c(T1), T2 = c(T2))
}
psi_bc <- function(theta) {
theta1 <- theta[1:4]
theta2 <- theta[5:8]
sum(sqrt(theta1 * theta2))
}
psi_bc(true_theta)
#> [1] 0.8464102
We can do the bootstrap, but it actually performs quite poorly in this situation. We are going to define the same psi function but vectorized this time. This improves the speed of xactonomial.
psi_bc_v <- function(theta) {
theta1 <- theta[,1:4, drop = FALSE]
theta2 <- theta[,5:8, drop = FALSE]
rowSums(sqrt(theta1 * theta2))
}
data <- sample_data(n = c(10, 12))
data
#> $T1
#> [1] 2 3 4 1
#>
#> $T2
#> [1] 0 0 6 6
results <- xactonomial(data, psi_bc_v, psi_limits = c(0,1), conf_int = TRUE, psi0 = .5,
maxit = 50, chunksize = 100, ga_restart_every = 10, itp_eps = .01)
results
#>
#> Monte-Carlo multinomial test
#>
#> data: data
#> p-value = 0.07305
#> alternative hypothesis: true psi0 is not equal to 0.5
#> 95 percent confidence interval:
#> 0.4850000 0.9939617
#> sample estimates:
#> [1] 0.6708204
Maximum of parameters
In this example, . When two of the parameters are equal, the function is nondifferentiable, so bootstrap and other asymptotic methods will fail while our method still works.
psi_causal <- function(pp) {
max(pp)
}
true_psi_causal <- psi_causal(c(.4, .4, .2))
sample_data2 <- function(n) {
list(rmultinom(1, n, prob = c(.4, .4, .2))[, 1])
}
set.seed(421)
data <- sample_data2(n = c(60))
data
#> [[1]]
#> [1] 27 25 8
results <- xactonomial(data, psi_causal, psi_limits = c(1/3,1), psi0 = .55,
maxit = 100, chunksize = 1000, itp_eps = .01)
results
#>
#> Monte-Carlo multinomial test
#>
#> data: data
#> p-value = 0.1539
#> alternative hypothesis: true psi0 is not equal to 0.55
#> 95 percent confidence interval:
#> 0.3352706 0.5883333
#> sample estimates:
#> [1] 0.45
In this example, we see that the lower confidence limit it close to
the boundary of the psi space. It is possible that the lower limit
should be equal to the boundary, but the root finding algorithm was too
eager. We can improve performance by providing the argument
p_value_limits
which should be the vector of 2 p-values,
the first being for the test of the null at the lower boundary of psi
and the second at the upper boundary. If those p-values are greater than
,
then no root exists and the confidence limit should instead be the
boundary.
First we calculate the exact p-value for the test of
psi0 <= 1/3
by providing theta_null_points
.
The upper p-value is clearly less than
.
p.ll <- xactonomial(data, psi_causal, psi_limits = c(1/3,1), psi0 = 1/3, conf_int = FALSE,
itp_eps = .01, theta_null_points = t(rep(1/3, 3)),
alternative = "greater")$p.value
xactonomial(data, psi_causal, psi_limits = c(1/3,1), psi0 = 1, conf_int = FALSE,
itp_eps = .01, theta_null_points = rbind(c(1, 0, 0), c(0,1,0), c(0,0,1)),
alternative = "less")$p.value
#> [1] 0
Then we provide the vector of those 2 p-values to the function. They do not need to be precise, the important thing is when one of them is greater than .
xactonomial(data, psi_causal, psi_limits = c(1/3,1), p_value_limits = c(p.ll, 1e-8),
maxit = 100, chunksize = 1000, itp_eps = .01)
#>
#> Monte-Carlo multinomial test
#>
#> data: data
#> p-value = NA
#> alternative hypothesis: two.sided
#> 95 percent confidence interval:
#> 0.3333333 0.5883333
#> sample estimates:
#> [1] 0.45
Now you can see we get exactly 1/3 for the lower confidence limit.