Michael Sachs
In Theory I, and maybe before, you studied finite-dimensional parameters and estimators of them.
Consider a random sample \(X_1, X_2, \ldots, X_n\) from some unspecified distribution \(F\) and an estimator:
\[ \hat{\mu}_n = T(X_1, \ldots, X_n), \]
where \(T\) is a mapping from \(\mathbb{R}^n\) to \(\mathbb{R}\), i.e., a function from the sample to the real line.
You studied the asymptotic behavior of \(\hat{\mu}_n\), what happens when \(n \rightarrow \infty\). Why asymptotics?
How do we prove such things for specific statistics?
Once some basic convergence results have been established, there are a number of theorems that serve as building blocks for more complex results.
… and many more results that can go a long way.
We will often write \(\int_\Omega \, dF_X\) to mean the integral with respect to the distribution \(F_X\) and it equals \(P(X \in \Omega)\) where \(X\) has distribution \(F_X\).
Likewise \(\int f\, dF_X\) means the expectation of \(f(X)\) or \(E[f(X)]\) where \(X \sim F_X\).
When applied to the empirical measure \(\mathbb{F}_n\) for a sample \(X_1, \ldots, X_n\) from \(F_X\), then \(\int f \, d\mathbb{F}_n = \frac{1}{n}\sum_{i = 1}^nf(X_i)\).
\(F(x-)\) denotes the left-hand limit \(\lim_{t \uparrow x} F(t)\), and \(F(x+)\) the right-hand limit.
\(dF(t)\) denotes the increment \(F(t + dt) - F(t)\), and \(\Delta A_i\) the increment for discrete cases \(A_{i+1} - A_i\).
A stochastic process is a collection of random variables with an index.
An empirical process is a process computed from data.
Example, consider our random sample \(X_1, \ldots, X_n\) from distribution \(F\). The empirical cdf is \[ \mathbb{F}_n(t) = \frac{1}{n}\sum_{i = 1}^n I\{X_i \leq t\}. \] This is a random function or sample path. We can view it as the collection of random variables \(\{\mathbb{F}_n(t): t \in \mathbb{R}\}\).
Our statistic \(T(X_1, \ldots, X_n)\) is now is a mapping from \(\mathbb{R}^n\) to the space of bounded increasing functions.
How can we analyze the behavior of \(\mathbb{F}_n(t)\)?
For each fixed \(t \in \mathbb{R}\) the laws of large numbers tells us that
\(\mathbb{F}_n(t) \rightarrow_p F(t)\),
and the central limit theorem tells us that
\(\sqrt{n}(\mathbb{F}_n(t) - F(t)) \rightarrow_d N(0, F(t)(1 - F(t)))\).
For \(T_k = \{t_1, \ldots t_k\}\) a finite collection of \(t\)s the random vector of ecdfs evaluated at \(T_k\) converges in distribution to a mean zero Gaussian vector with covariance \(F(s \wedge t) - F(s) F(t)\) for \(s, t \in T_k\), where \(s \wedge t = \min\{s, t\}\).
Thinking about the random function, what other modes of convergence can we consider?
Glivenko-Cantelli in 1933, proved something like
\[ \sup_t|\mathbb{F}_n(t) - F(t)| \rightarrow_{as} 0, \] as \(n \rightarrow \infty\). This is called uniform convergence of the sample paths.
This is also written as
\[ \| \mathbb{F}_n - F \|_\infty \rightarrow_{as} 0, \] and this highlights the fact that the process point of view uses different notions of distance, since we are now in a functional space rather than Euclidean space. We will see later that this requires different notions of differentiability.
For any \(\varepsilon > 0\), there exists a partition \(-\infty = t_0, t_1, \ldots, t_m = \infty\) such that \(F(t_j) - F(t_{j-1}) < \varepsilon\) for all \(j\). Then noting that
\(\mathbb{F}_n(t) - F(t) \leq \mathbb{F}_n(t_j) - F(t_{j-1})\) \(= \mathbb{F}_n(t_j) - F(t_{j-1}) + F(t_j) - F(t_j) = \mathbb{F}_n(t_j) - F(t_{j}) + \varepsilon\).
and
\(\mathbb{F}_n(t) - F(t) \geq \mathbb{F}_n(t_j) - F(t_{j}) - \varepsilon\), we have \[ \|\mathbb{F}_n - F \|_\infty \leq \max_{j}|\mathbb{F}_n(t_j) - F(t_j)| + \varepsilon. \] The first term converges to 0 almost surely by the strong law of large numbers, and thus for any \(m\), we can find an \(N\) such that for all \(n \geq N\),
\[ \sup_{m\geq n}\| \mathbb{F}_m - F \|_\infty \leq \varepsilon. \] Since \(\varepsilon\) was arbitrary, the result follows.
Given the classical finite dimensional result, we might surmise that \(\sqrt{n}(\mathbb{F}_n - F)\) converges to some sort of Gaussian process with mean 0 and covariance \(F(s \wedge t) - F(t)F(s)\).
Donsker proved just that in 1952, and made more general/rigorous by Skorokod and Kolmogorov in 1956.
Specifically, \(\sqrt{n}(\mathbb{F}_n - F) \Rightarrow G\) in the space of upper semi-continuous functions \(D[-\infty, \infty]\) on \(\mathbb{R}\) equipped with the uniform norm \(\|_\infty\), where \(G\) is an element of that space that satisfies \(E(G(t)) = 0\) and \(Cov(G(t)G(s))\) as above, for every \(t, s \in \mathbb{R}\).
To be complete, we would also specify that \(G\) is tight, meaning that for every \(\varepsilon > 0\), there exists a closed and bounded set \(K \subset D[-\infty, \infty]\) such that \(P(G \notin K) < \varepsilon\). This is a fancy way of saying that the sample paths of \(G\) do not explode to infinity, and it holds because the mean and variance functions are bounded.
It is useful to consider the case where \(F(t) = t\), i.e., the random variables are uniform on \([0, 1]\). The Gaussian process \(\{\mathbb{U}(t), t \in [0, 1]\}\) with mean function \(E(\mathbb{U}(t)) = 0\) and \[ Cov(\mathbb{U}(s), \mathbb{U}(t)) = s \wedge t - st \] is called a Brownian bridge.
If a Gaussian process \(\mathbb{S}\) has mean 0 and covariance function \[ Cov(\mathbb{S}(s), \mathbb{S}(t)) = s \wedge t \] it is called a Brownian motion or also a Weiner process.
Note that if \(\mathbb{U}\) is a Brownian bridge and \(Z\) an independent standard normal random variable, then \[ \mathbb{U}(t) + t Z, \mbox{ for } t \in [0, 1] \] is a Brownian motion. Likewise, if \(\mathbb{S}\) is a Brownian motion, then \(\mathbb{S}(t) - t \mathbb{S}(1)\) is a Brownian bridge for \(t \in [0, 1]\).
The increment of a process \(X\) is defined as \(X(t + u) - X(t)\) for \(u \geq 0\). We will often write \(d\, X(t)\) to represent the increment \(X(t + dt) - X(t)\), i.e., the thing that happens just after \(t\).
A counting process is written as \(\{N(t), t \geq 0\}\), where \(N(0) = 0\), and \(N(t)\) counts the number of events that occur up to time \(t\).
\(N(t)\) is a Poisson process if it has independent increments (counts on disjoint subsets of \(\mathbb{R}^+\) are independent) and the count on an interval of length \(t\) is distributed Poisson with rate \(\lambda t\). It has some interesting properties:
The exercises should be a review of things you’ve seen before. If you struggle with the terminology or notation, consider brushing up on basic asymptotic statistics (ask me for recommendations).
Reading for Friday: ABG Chapter 1.