(p.707) A4 Asymptotic Distribution Theory
(p.707) A4 Asymptotic Distribution Theory
Even when the correct member of a family of probability distributions is known, deriving the exact distributions of statistics is often difficult. Here we investigate what happens in indefinitely large samples. The analysis commences from measures of orders of magnitude, and concepts of stochastic convergence. Laws of large numbers and central‐limit theorems are reviewed for the simplest case of an independent, identically‐distributed, scalar random variable. Vector processes are analysed, and two worked examples are presented to illustrate the various steps in practical derivations as well as some of the problems which can ensue. Next, stationary dynamic processes are studied, and instrumental variables estimators are described, followed by mixing and martingale processes. These methods are applied to maximum likelihood estimation. The chapter concludes with a brief look at asymptotic approximations.
A4.1 Introduction
Asymptotic Distribution Theory A4.1 IntroductionStatistical analysis in econometrics distinguishes between the unknown data‐generation process and the econometric model which attempts to represent the salient data features in a simplified form. Models are of necessity incomplete representations of reality, and the issue of their properties is discussed in the main text. The problem considered now is the behaviour of estimators and tests based on models when the data‐generation process is only specified in general terms. For pedagogic purposes, the analysis proceeds through simple cases (homogeneous, independent, identically‐distributed random variables) to explain the basic ideas, and then introduces complications which arise almost inevitably when analysing economic time series (dependence and heterogeneity, with only low‐order moments of the underlying distributions). The present section describes the framework for the analysis, why asymptotic distribution theory is needed and what problems must be confronted, followed in sections A4.2–A4.7 by formal definitions and notation, orders of magnitude, concepts of stochastic convergence, laws of large numbers, and central‐limit theorems for IID random variables, generalizations to vector random variables, and two detailed examples. Sections A4.8–A4.11 deal with dependent data series, focusing on dynamic processes under stationarity and ergodicity, and discuss mixing conditions and martingale limit theory. Sections A4.9 and A4.11 discuss (p.708) instrumental‐variables and maximum‐likelihood estimation as important applications. Finally, some of the issues involved in asymptotic expansions in simple cases are noted.
In many ways, the use of asymptotic theory, which is a statistical distribution theory for large numbers of observations, is an admission of failure relative to obtaining exact sampling distributions. However, the latter are not only often intractable given present techniques, but can be unenlightening even when obtained due to their complicated forms. Thus, simplifications resulting from assuming large samples of observations can clarify the essential properties of complicated estimators and tests, although one must then examine the usefulness of such asymptotic findings for relevant sample sizes.
As an illustration, consider a random sample of T observations ${x}^{\prime}=({x}_{1}...{x}_{T})$ drawn from an unknown density function ${\text{D}}_{X}(x\theta )$ for θ ∈ Θ. The distribution is assumed by an investigator to be ${\text{G}}_{X}(x;\mu )$ where $\mu \in \mathbb{R}$. A sample statistic W _{T}(x _{1} . . . x _{T}) is formed as an estimate of μ. The properties of ${W}_{T}(X)$ depend on those of D_{X}(·), and are characterized by its distribution function ${\text{F}}_{{w}_{T}}(\xb7)$:
Letting W be a normal random variable with CDF F_{w}(·), assume for illustration that:
Equation (A4.2) expresses the general nature of one approach to deriving limiting distributions, namely relate W _{T} to a simpler random variable W whose distribution is known, such that (W _{T} − W) converges to zero (in a sense to be made precise) as T increases without bound. Many concepts of convergence are possible, and these cannot all be ordered such that type A implies type B implies C etc.; major alternatives are discussed in § A4.3 after noting some notions of the ‘size’ of the difference between W _{T} and W in § A4.2.
All of the main results have to be established for vector random variables which may be non‐linear functions of the original random variables {X _{t}}. Thus, we wish to know (p.709) how a continuous function $g({W}_{T})$ behaves as T → ∞. Such behaviour usually relates to both central tendency, for which laws of large numbers apply, and the form of the distribution for large T, for which central‐limit theorems are required. For independent sampling, various theorems are reported in sections A4.4 and A4.5, and generalized to vector processes in §A4.6. Section A4.7 then discusses two examples in detail. The level in these sections is introductory, in as much as IID assumptions are usually made.
The later sections are more advanced. Dependent random variables are introduced in §A4.8 for data processes which are stationary and ergodic, then §A4.10 and §A4.11 consider mixing conditions and martingale processes.^{1} These have recently been used in econometrics as the basis for establishing limiting distributions, and seen to offer a natural characterization of important features of suitably transformed economic time series, while being sufficiently strong to allow useful limit results to be established. The following comments are intended to motivate the usage of mixing conditions and martingales in deriving asymptotic distributions.^{2}
Heuristically, a process is uniformly mixing if knowledge as to its present state is uninformative about its state in the distant future. For economic systems, current knowledge about the state of the system is often informative about nearby future states, but seems relatively uninformative concerning distant states. Thus, the assumption of uniform mixing is not unreasonable for appropriately transformed economic variables. Moreover, with mixing, the system has a long but restricted memory: X _{T} and X _{T+n} become independent as n → ∞, so that new information about the system continually accrues from larger samples, a condition of importance for the convergence of estimators and the consistency of tests.
More generally, let ${X}_{t}$ denote a vector random process with respect to the probability space $(\Omega \text{,}\mathcal{F}\text{,}\text{P})$. As in Chapter A2, the joint data density D_{X}(·) can be sequentially factorized as:
Finally, since the data are generated by D_{X}(·), but a model may be based on G_{X}(·), it is natural to consider theorems relating to the behaviour of data moments in large samples (usually the first two moments suffice), and derive the properties of estimators therefrom (see e.g. Hannan, 1970, Ch. 4, and Hendry, 1979a). For mixing processes, direct (p.710) derivations are possible (see e.g. White, 1984, and Domowitz and White, 1982). Sections A4.9 to A4.11 provide some illustrations.
It is worth stressing that we use asymptotic theory in order to derive approximate distributions in finite samples, and the validity of such results does not depend on the economic process remaining constant as T → ∞. Thus, even though $\theta $ may change in the next period, this does not affect the accuracy of approximating (A4.1) by F_{w}(·), although it does affect the usefulness of the underlying econometric model. Also, no attempt is made in what follows either to produce the most general possible results, or specify the weakest possible assumptions; nor is comprehensive coverage sought. Rather, we aim to provide a basis for results relevant to the class of models discussed in the main text, taking account of subject‐matter considerations, but simplifying by stronger than necessary assumptions where this eases the derivations. Useful references to asymptotic distribution theory in the field of econometrics are Davidson (1994), Davidson and MacKinnon (1993), McCabe and Tremayne (1993), Sargan (1988), Schmidt (1976), Spanos (1986), Theil (1971) and White (1984); in statistics, see Cox and Hinkley (1974), Cramér (1946), Rao (1973) and Wilks (1962); and in time‐series analysis, see Anderson (1971), Fuller (1976), Hannan (1970) and Harvey (1981a). Formal proofs are not provided for most of the theorems presented below, but may be found in these references.
Finally, as noted earlier, while limiting distributional results both clarify the relationships in large samples between apparently different methods and provide insight into the properties of estimators and tests, nevertheless, the accuracy of such results is dependent on the true value of $\theta $ as well as on T. Thus, for important problems, a check should always be made on the accuracy of the asymptotic approximation for the problem at hand, either by Monte Carlo methods (as in Ch. 3) or by higher‐order approximations as described in the final section.
A4.2 Orders of Magnitude
Asymptotic Distribution Theory A4.2 Orders of MagnitudeA4.2.1 Deterministic Sequences
Let {W _{t}}, {X _{t}}, and {Y _{t}} denote sequences of scalar random variables and {A _{t}}, {B _{t}}, and {C _{t}} denote sequences of non‐stochastic real numbers for t = 1, . . . , T, where we allow T → ∞. We first define convergence for a deterministic sequence.
Definition 1. The sequence {A _{T}} converges to the constant A as T → ∞, if ∀ δ > 0, there exists a T _{δ}, such that ∀ T > T _{δ}, A _{T} − A < δ. We write lim_{T→ ∞} A _{T} = A, or A _{T} → A.
From the definition of continuity, if g(·) is a function continuous at A, and A _{T} → A, then g(A _{T}) → g(A). Next, we define the order of magnitude for a deterministic sequence.
Definition 2. {A _{T}} is at most of order one, denoted by {A _{T}} ≈ O(1), if the sequence is bounded as T → ∞, namely, there exists a constant, finite M > 0 such that A _{T} < M ∀ T. Further {A _{T}} ≈ O(T ^{k}) if {T ^{−k} A _{T}} ≈ O(1).
Definition 3. {B _{T}} is of smaller order than T ^{n} if lim_{T→ ∞} T ^{−n} B _{T} = 0, denoted by {B _{T}} ≈ o(T ^{n}).
(p.711) If {B _{T}} ≈ o(T ^{n}), then {T ^{−n} B _{T}} is bounded as T → ∞, so that {B _{T}} ≈ O(T ^{n}), but the converse is not necessarily true: that {B _{T}} ≈ O(T ^{n}) need not imply that {B _{T}} ≈ o(T ^{n}). Further, from these definitions (denoted by D1, D2 etc.):
Theorem 1. If {A _{T}} ≈ O(T ^{k}) and {B _{T}} ≈ O(T ^{s}), then:

(a) {A _{T}} ≈ o(T ^{k+m}) ∀ m > 0;

(b) {A _{T} ± B _{T}} ≈ O(T ^{r}) where r = max (k, s);

(c) {A _{T} B _{T}} ≈ O(T ^{k+s}).
A4.2.2 Stochastic Sequences
These results on O(·) and o(·) can be applied to standard deviations to provide one characterization of the order of magnitude of a random variable. For example, when ${X}_{t}\sim \text{IN}[{\mu}_{x}\text{,}{\sigma}_{x}^{2}]$, then:
When we consider stochastic sequences, the concept ‘bounded’ must be replaced by ‘bounded in probability’:
Definition 4. {W _{T}} is at most of order unity in probability, denoted by {W _{T}} ≈ O _{p}(1), if ∀ ε > 0, ∃ finite M _{ε} > 0 and T _{ε} > 0 such that ∀ T > T _{ε}, P(W _{T} > M _{ε}) < ε.
Thus, {W _{T}} ≈ O _{p}(1) if for large enough T, there is a negligibly small probability of the absolute value of its terms exceeding an upper bound, which may depend on the probability ε. Correspondingly, {W _{T}} ≈ O _{p}(T ^{k}) if {T ^{−k} W _{T}} ≈ O _{p}(1).
Definition 5. {Y _{T}} ≈ o _{p}(1) if ∀ ε > 0, lim_{T→ ∞} P(Y _{T} > ε) = 0.
Hence 1/T ^{α} ν in (A4.2) is o _{p}(1) for α > 0. Similar results to Theorem 1 apply to O _{p}(·) and o _{p}(·). For example, if:
(p.712) Theorem 2. If f _{T}(·) ≈ O(A _{T}) and g _{T}(·) ≈ o(B _{T}) imply that h _{T}(·) ≈ O(C _{T}), then f _{T}(W _{T}) ≈ O _{p}(A _{T}) and g _{T}(W _{T}) ≈ o _{p}(B _{T}) imply that h _{T}(W _{T}) ≈ O _{p}(C _{T}).
For example; if {X _{T}} and {Y _{T}} are O _{p}(T ^{k}) and O _{p}(T ^{s}) respectively, then {Y _{T} X _{T}} is O _{p}(T ^{k+s}).
Finally, we can paraphrase the earlier result on using the order of magnitude of a standard deviation to determine a finite‐variance limiting distribution: if {W _{T}} has a variance of O(1), then {W _{T}} ≈ O _{p}(1), although the converse is false. Mann and Wald (1943a) developed most of the results in this section.
A4.3 Stochastic Convergence
Asymptotic Distribution Theory A4.3 Stochastic ConvergenceWe now apply the notions of stochastic orders of magnitude to ascertain the relationships between two (or more) random variables. Two random variables could be deemed to converge if:

(a) their distribution functions converge;

(b) their difference is o _{p}(1);

(c) the variance of their difference exists ∀ T and is o(1);

(d) they are equal except on a set of probability zero.
Definition 6. W _{T} tends in distribution to W, denoted by ${W}_{T}\stackrel{D}{\to}W$, if for all continuity points of ${\text{F}}_{w}(\xb7)\text{,}\underset{T\to \infty}{lim}{\text{F}}_{{w}_{T}}(r)={\text{F}}_{w}(r)$, where F_{w} (r) = P(W ≤ r) denotes the CDF of W.
More generally, if for all continuity points of a CDF ${\text{F}}_{{w}_{T}}(\xb7)\text{,}\underset{T\to \infty}{lim}{F}_{{w}_{T}}(r)={\text{F}}_{w}(r)$, then ${\text{F}}_{{w}_{T}}(r)$ converges weakly to F_{w}(r) written as ${\text{F}}_{{w}_{T}}(r)\Rightarrow {\text{F}}_{w}(r)$.
Definition 7. W _{T} tends in probability to W if lim_{T→ ∞} P(W _{T} − W > ε) = 0 ∀ ε > 0, denoted by ${W}_{T}\stackrel{P}{\to}W$. Thus, from D5, (W _{T} − W) ≈ o _{p}(1).
Definition 8. W _{T} tends in mean square to W, denoted ${W}_{T}\stackrel{\mathrm{MS}}{\to}W$, if E[W _{T}] and $\text{E}[{W}_{T}^{2}]$ exist ∀ T and lim_{T→ ∞} E[(W _{T} − W)^{2}] = 0.
Definition 9. W _{T} tends almost surely to W if lim_{T→ ∞} P(W _{t} − W ≤ ε, ∀ t ≥ T) = 1 ∀ ε > 0, denoted ${W}_{T}\stackrel{\mathrm{AS}}{\to}W$.
Since CDFs are real numbers, ordinary convergence holds in D6: the more general definition holds without specific reference to an associated random variable, and weak convergence will prove of importance in the main text. In D9, the whole collection of {W _{t} − W, t ≥ T} must have a limiting probability of unity of being less than or equal to ε, so that the two random variables must come together except for a set of probability zero. An equivalent definition is:
Definition 10. ${W}_{T}\stackrel{\mathrm{AS}}{\to}W$, if P(lim_{T→ ∞} W _{T} = W) = 1.
Almost sure convergence implies that W _{T} converges to W for almost every possible realization of the stochastic process. Thus, D9 implies D7. When definition D8 holds, (p.713) from Chebychev's inequality (see § A2.7):
Summarizing these results in schematic form, where $\Rightarrow $ denotes implies:
Theorem 3. If plim_{T → ∞} (W _{T} − W) = 0, plim_{T → ∞} Y _{T} = c and Y _{T} − c ≥ X _{T} − c, for constants a, b, c > 0, then:

(i) lim P(aW _{T} + bY _{T} ≤ r) = F_{w}([r − bc]/a);

(ii) lim P(W _{T} Y _{T} ≤ r) = F_{w}(r/c);

(iii) plim_{T → ∞} X _{T} = c.
Next, a useful theorem due to Slutsky, which follows from D7 and the definition of continuity:
Theorem 4. When g(·) is a continuous function independent of T, if ${W}_{T}\stackrel{P}{\to}W$, then $g({W}_{T})\stackrel{P}{\to}g(W)$; and if ${W}_{T}\stackrel{\mathrm{AS}}{\to}W$, then $g({W}_{T})\stackrel{\mathrm{AS}}{\to}g(W)$.
Since $\stackrel{P}{\to}$ implies $\stackrel{D}{\to}$, if W = κ, then Slutsky's theorem implies that:
A4.4 Laws of Large Numbers
Asymptotic Distribution Theory A4.4 Laws of Large NumbersWe next consider the behaviour in large samples of averages of random variables. The behaviour of such averages depends on three factors, namely: the degree of interdependence between successive drawings (in this section we assume independence); the extent of heterogeneity in successive drawings (which is different between the two theorems cited below); and the existence of higher moments of the distributions (which needs to be stronger the greater the heterogeneity or interdependence allowed). Weak and strong laws of large numbers exist and we have the following results $({\mu}_{t}<\infty \text{,}0<{\sigma}_{t}^{2}<\infty )$.
A4.4.1 Weak Law of Large Numbers
We denote the weak law of large numbers by WLLN, and cite two famous results:
Theorem 5(i). If X _{t} is IID with E[X _{t}] = μ where μ is finite, then ${\overline{X}}_{T}\stackrel{P}{\to}\mu $ (Khintchine's theorem).
Theorem 5(ii). If X _{t} is $\text{ID}[{\mu}_{t}\text{,}{\sigma}_{t}^{2}]$ with ${\overline{\mu}}_{T}={T}^{1}\sum _{t=1}^{T}{\mu}_{t}$, and $\underset{T\to \infty}{lim}{T}^{2}\sum _{t=1}^{T}{\sigma}_{t}^{2}=0$, then $({\overline{X}}_{T}{\overline{\mu}}_{T})\stackrel{P}{\to}0$.
Thus, if the X _{t} are identically distributed with a common finite mean, then plim X̄_{T} = μ, even if the second moment of X _{t} does not exist. However, if the distributions are not identical as in Theorem 5(ii), extra conditions are required on the second moments of the {X _{t}}, to ensure that they are finite, and the average variance divided by T vanishes. Constant, finite, first two moments are sufficient for 5(ii) to hold, even when X _{t} is not IID (perhaps because the third moment is not constant).
A4.4.2 Strong Law of Large Numbers
The corresponding strong laws of large numbers (SLLN) are:
Theorem 6(i). If X _{t} is IID, then E[X _{t}] = μ being finite is necessary and sufficient for ${\overline{X}}_{T}\stackrel{\mathrm{AS}}{\to}\mu $.
Theorem 6(ii). If X _{t} is $\text{ID}[{\mu}_{t}\text{,}{\sigma}_{t}^{2}]$ and $\underset{T\to \infty}{lim}\sum _{t=1}^{T}{t}^{2}{\sigma}_{t}^{2}<\infty $, then ${\overline{X}}_{T}\stackrel{\mathrm{AS}}{\to}{\overline{\mu}}_{T}$. Thus, WLLN(i) actually implies almost‐sure convergence without additional conditions, whereas, since t ^{−2} ≥ T ^{−2}, the second condition of WLLN(ii) needs strengthening to achieve almost‐sure convergence. Distributions like the Cauchy without finite moments do not satisfy the assumptions of either WLLN or SLLN. Consequently, neither law applies and the sample mean does not converge.
For estimators of a constant population parameter μ, we have the following definitions:
Definition 11. An estimator ${\stackrel{\wedge}{\mu}}_{T}$ is consistent for μ if ${\text{plim}}_{T\to \infty}{\stackrel{\wedge}{\mu}}_{T}=\mu $.
Thus, X̄_{T} is consistent for μ given WLLN(i). Also:
Definition 12. ${\stackrel{\wedge}{\mu}}_{T}$ is asymptotically unbiased for μ if $\text{E}[{\stackrel{\wedge}{\mu}}_{T}]$ is finite ∀ T and: $\underset{T\to \infty}{lim}\text{E}[{\stackrel{\wedge}{\mu}}_{T}]=\overline{\text{E}}[{\stackrel{\wedge}{\mu}}_{T}]=\mu $.
(p.715) Consistency does not imply asymptotic unbiasedness (e.g. $\text{E}[{\stackrel{\wedge}{\mu}}_{T}]$ need not exist), nor conversely (since an unbiased estimator need not have a degenerate limiting distribution). These last two definitions (D11 and D12) highlight one of the main roles for laws of large numbers, namely to allow us to establish the consistency of estimators. In sections A4.10 and A4.11, weaker assumptions will be discussed which still sustain laws like WLLN and SLLN.
A4.5 Central‐limit Theorems
Asymptotic Distribution Theory A4.5 Central‐Limit TheoremsFor consistent estimators, the limiting CDF F_{w}(r) is a step function, and, as this has the same limiting distributional form for all consistent estimators, it is uninformative about the rate of convergence. For example, X̄_{T} and:
As with laws of large numbers, the outcome is determined by an interaction between the extent of memory (i.e. interdependence, which is absent here due to assuming independence), heterogeneity (allowed in Theorems 8(i) and (ii)), and the existence of higher moments (finite second moments are assumed in all the theorems, and the (2 + δ) moment by 8(i)). We have three central‐limit theorems, inducing asymptotic normality (generalized in § A4.11).
Theorem 7. If X _{t} ∼ IID[μ, σ^{2}] with μ < ∞, 0 < σ^{2} < ∞, and W _{T} = √ T (X̄_{T} − μ)/σ, then ${W}_{T}\stackrel{D}{\to}W\sim \text{N}[0\text{,}1]$.
This result can also be written as W _{T} _{ã} N[0, 1].
Theorem 8. If ${X}_{t}\sim \text{ID}[{\mu}_{t}\text{,}{\sigma}_{t}^{2}]$ where μ_{t} < ∞ and $0<{\sigma}_{t}^{2}<\infty \forall t$ with ${S}_{T}^{2}=\sum _{t=1}^{T}{\sigma}_{t}^{2}$, and either:

(i) β_{t} = E[X _{t} − μ_{t}^{2+δ}] < ∞ and $lim((\sum _{t=1}^{T}{\beta}_{t})/{S}_{T}^{2+\delta})=0$ for δ > 0; or

(ii) $lim{S}_{T}^{2}\sum _{t=1}^{T}\text{E}[({X}_{t}{\mu}_{t}){}^{2}I({X}_{t}{\mu}_{t}\ge \epsilon {S}_{T})]=0\forall \epsilon >0$; then:
$$W}_{T}=\sum _{t=1}^{T}\left[\frac{{X}_{t}{\mu}_{t}}{{S}_{T}}\right]=\sqrt{T}\left[\frac{{\overline{X}}_{T}{\overline{\mu}}_{T}}{{S}_{T}/\sqrt{T}}\right]\stackrel{D}{\to}W\sim \text{N}[0\text{,}1]\text{.$$(A4.11)
Theorem 7 makes the strongest assumptions in terms of homogeneity (identical distributions) and moments existing (finite second moment). Whittle (1970) provides an elegant explanation as to why Theorem 7 holds by demonstrating a closure property as follows. Consider a sequence of standardized IID {Y _{t}}, each of which has the same distributional form as the limit of W _{T} with distribution W. Define Z _{T} = ∑ Y _{t}/√ T, then Z _{T} must also converge to the same limiting distribution W. Thus, the characteristic function of Z _{T} is the T ^{th} power of the characteristic function of the IID variates and yet of the same form. Whittle then shows that exp (−ν^{2}/2) (for argument ν) is the only function with that property, which is the characteristic function of N[0, 1].
(p.717) As an illustration of the operation of central‐limiting effects, Figs. A4.1a‐d show the histograms and density functions for 10 000 drawings from the exponential distribution λ exp(−λ x) with parameter λ = 0.1, so the mean and standard deviation are both 10. The first shows the original distribution (which is highly skewed and covers a large range), then the means of samples of size 5 and size 50 therefrom, and finally N[10, 1] for comparison. The convergence to normality is rapid for this IID setting.
In Theorem 8(i), the identical‐distributions assumption is dropped, and instead a finite absolute moment of order (2 + δ) is required for which δ = 1 is sufficient and entails that the third absolute moment exists. If so, then for example, when μ_{t}, σ_{t}, and β_{t} are constant:
However, Theorem 8(ii) provides most insight into what features of the DGP are required to sustain central‐limiting normality, and thereby needs most explanation. First, if the limit in (ii) is to be zero, a necessary condition is that ${S}_{T}^{2}\to \infty $ as T → ∞ (so that information continually accrues). Next, since I(·) denotes the indicator function of the argument (see § A2.7), when the event {X _{t} − μ_{t} ≥ ε S _{T}} occurs, then I(X _{t} − μ_{t} ≥ ε S _{T}) is unity, and is zero otherwise, so that:
Three generalizations are needed to render central‐limit theory applicable to economic data. First, it must apply to vector random variables, and the following section (p.718) analyses that case. Secondly, it must allow for processes with memory rather than independence, and sections A4.8–A4.11 do so. Finally, integrated processes must be incorporated, and that aspect is analysed in Chapter 3.
A4.6 Vector Random Variables
Asymptotic Distribution Theory A4.6 Vector Random VariablesWhile the concepts above carry over directly for elements of vectors, results for vector processes in general follow from the Cramér–Wold device. Let $\{{W}_{T}\}$ be a k‐dimensional vector process, and W a k × 1 vector random variable, then:
Theorem 9. ${W}_{T}\stackrel{D}{\to}W$ if and only if $\lambda \prime {W}_{T}\stackrel{D}{\to}\lambda \prime W$ for all fixed finite k × 1 vectors λ.
For example, if $\lambda \prime W\sim \text{N}[0\text{,}\lambda \prime \Psi \lambda ]$ then ${W}_{T}\stackrel{D}{\to}{\text{N}}_{k}[0\text{,}\Psi ]$. Moreover, certain stochastic functions of ${W}_{T}$ also converge to normality if ${W}_{T}$ does (see Cramér, 1946). This important result is referred to below as Cramér's theorem:
Theorem 10. If ${W}_{T}\stackrel{D}{\to}W\sim {\text{N}}_{k}[\mu \text{,}\Psi ]$ and ${A}_{T}$ is n × k with ${\text{plim}}_{T\to \infty}{A}_{T}=A$, then ${A}_{T}{W}_{T}\stackrel{D}{\to}\mathrm{AW}$ where $\mathrm{AW}\sim {\text{N}}_{n}[A\mu \text{,}A\Psi A\prime ]$.
The theorem is clearly valid if ${A}_{T}=A\forall T$; otherwise, use is made of $({A}_{T}A)\approx {o}_{p}(1)$ so that:
Next, consider an n × 1 continuously differentiable vector function $g({W}_{T})$:
Theorem 11. If $\surd T({W}_{T}\mu )\stackrel{D}{\to}{\text{N}}_{k}[0\text{,}\Psi ]$ and $G(W)=\partial g/\partial w\prime $ exists and is continuous in the neighbourhood of $\mu $, then:
The following two examples seek to illustrate applications of the preceding results. In many cases, almost‐sure convergence can be established where we focus on convergence in probability below: establishing the stronger result in relevant cases makes an interesting and useful exercise (see White, 1984, for a number of helpful developments).
(p.719) A4.7 Solved Examples
Asymptotic Distribution Theory A4.7 Solved ExamplesA4.7.1 Example 1: An IID Process
Let $\text{E}[{y}_{t}{z}_{t}]=\beta \prime {z}_{t}$ for t = 1, . . . , T, and ${\epsilon}_{t}=({y}_{t}{z}_{t}^{\prime}\beta )\sim \text{IID}[0\text{,}{\sigma}_{\epsilon}^{2}]\text{,}0<{\sigma}_{\epsilon}^{2}<\infty $, where the k × 1 vector ${z}_{t}\sim {\text{IID}}_{k}[0\text{,}\Psi ]$ with Ψ finite, positive definite, and $\{{z}_{t}\}\text{,}\{{\epsilon}_{t}\}$ are independent processes. The least‐squares estimator of $\beta \in {\mathbb{R}}^{k}$ (where $\text{rank}({z}_{1}...{z}_{T})=k$) is:
First, ${\nu}_{t}={z}_{t}{\epsilon}_{t}\sim {\text{IID}}_{k}[0\text{,}{\sigma}_{\epsilon}^{2}\Psi ]$, so that:
Next, since ${z}_{t}$ is IID, ${z}_{t}{z}_{t}^{\prime}$ is IID with $\text{E}[{z}_{t}{z}_{t}^{\prime}]=\Psi $. Let
Consequently, by the Cramér–Wold device:
A4.7.2 Example 2: A Trend Model
Let y _{t} = β t ^{α} + ε_{t} where ${\epsilon}_{t}\sim \text{IN}[0\text{,}{\sigma}_{\epsilon}^{2}]$ and α is a known, fixed constant. Given normality of ε_{t}, the exact distribution of the least‐squares estimator of β can be obtained; yet there are problems in obtaining its limiting distribution for some values of α, due to the regressor being a trend. We have:
Five values of α will be considered: α = 0, ± 1/2 and ± 1.

(a) If α = 0, then $\surd T(\stackrel{\wedge}{\beta}\beta )\sim \text{N}[0\text{,}{\sigma}_{\epsilon}^{2}]$ holds ∀ T.

(b) If α = 1/2, then ${\Sigma}_{t=1}^{T}{t}^{2\alpha}={\Sigma}_{t=1}^{T}t=T(T+1)/2\approx O({T}^{2})$. Consequently:
$$\underset{T\to \infty}{\text{plim}}(\stackrel{\wedge}{\beta}\beta )=\underset{T\to \infty}{\text{plim}}\sqrt{T}(\stackrel{\wedge}{\beta}\beta )=0\text{,}$$ 
and the distribution ‘collapses’ rather rapidly. Nevertheless:
$$T(\stackrel{\wedge}{\beta}\beta )\sim \text{N}\left[0\text{,}\frac{2{\sigma}_{\epsilon}^{2}}{(1+{T}^{1})}\right]\to \text{N}[0\text{,}2{\sigma}_{\epsilon}^{2}]\text{.}$$(A4.26) 
Thus, a higher‐order rescaling produces a normal limiting distribution.

(c) α = − ½ could arise from trending heteroscedasticity when ${y}_{t}^{*}=\beta +{\epsilon}_{t}^{*}$ where:
$$\epsilon}_{t}^{*}\sim \text{IN}[0\text{,}{\sigma}_{\epsilon}^{2}t]\text{and}{y}_{t}=\frac{{y}_{t}^{*}}{\sqrt{t}}\text{.$$(A4.27) 
Then:
$$(\stackrel{\wedge}{\beta}\beta )={\left(\sum _{t=1}^{T}{t}^{1}\right)}^{1}\left[\sum _{t=1}^{T}\left(\frac{{\epsilon}_{t}}{\sqrt{t}}\right)\right]\text{.}$$(A4.28) 
Although Σ t ^{−1} ≈ O(log T) → ∞, it converges very slowly (see e.g. Binmore, 1983). Thus:
$$(logT){}^{\frac{1}{2}}(\stackrel{\wedge}{\beta}\beta )\sim \text{N}\left[0\text{,}{\sigma}_{\epsilon}^{2}({\left(\sum _{t=1}^{T}{t}^{1}\right)}^{1}logT)\right]\to \text{N}[0\text{,}{\sigma}_{\epsilon}^{2}V]\text{,}$$ 
where V is a finite constant.

(d) When α = 1, a different rescaling is necessary compared with (a)–(c), since now:
$$\sum {t}^{2}=\frac{1}{6}T(T+1)(2T+1)\approx O({T}^{3})$$(A4.29) 
so that:
$$T}^{\frac{3}{2}}(\stackrel{\wedge}{\beta}\beta )\sim \text{N}\left[0\text{,}\frac{3{\sigma}_{\epsilon}^{2}}{1+1\text{.}5{T}^{1}+0\text{.}5{T}^{2}}\right]\to \text{N}[0\text{,}3{\sigma}_{\epsilon}^{2}]\text{.$$ 
(e) Finally, when $\alpha =1\text{,}\underset{T\to \infty}{lim}(\sum _{t=1}^{T}{t}^{2})={\pi}^{2}/6$ and hence:
$$(\stackrel{\wedge}{\beta}\beta )\stackrel{D}{\to}\text{N}\left[0\text{,}\frac{6{\sigma}_{\epsilon}^{2}}{{\pi}^{2}}\right]\text{.}$$(A4.30)
Combining these two examples and weakening the assumptions, we can extend the limiting normal distribution established in §A3.7 for least‐squares estimation conditional on ${Z}_{T}^{1}$ (see e.g. Anderson, 1971):
Theorem 12. Let ${y}_{t}=\beta \prime {z}_{t}+{\epsilon}_{t}$ where $\text{E}[{y}_{t}{z}_{t}]=\beta \prime {z}_{t}$ and ${\epsilon}_{t}\sim \text{IID}[0\text{,}{\sigma}_{\epsilon}^{2}]$. Let ${A}_{T}=\sum {z}_{t}{z}_{t}^{\prime}$ and denote its (i, j)^{th} element by a(T)_{ij}, and let ${B}_{T}={C}_{T}^{1}{A}_{T}{C}_{T}^{1}$ where ${C}_{T}$ is a diagonal matrix with elements √ a(T)_{ii}. If as T → ∞:

(i) a(T)_{ii} → ∞, i = 1, . . . , k;

(ii) ${z}_{i\text{,}T+1}^{2}/a(T){}_{\mathrm{ii}}\to 0\text{,}i=1\text{,}...\text{,}k$;

(iii) ${B}_{T}\to B$ where B is finite positive definite;
A4.8 Stationary Dynamic Processes
Asymptotic Distribution Theory A4.8 Stationary Dynamic ProcessesA4.8.1 Vector Autoregressive Representations
It is essential to generalize from the assumption of IID observations characterizing the previous derivations to dynamic processes in order to obtain theorems relevant to economic time series. Consider the data‐generation process for t = . . . −m, . . . , 1, . . . , T:
Equation (A4.31) can be written as a matrix polynomial in the lag operator L, where ${L}^{k}{x}_{t}={x}_{tk}$:
Having allowed data dependency, it is unnecessarily restrictive to retain the assumption that $\{{\epsilon}_{t}\}$ is an independent process, which in effect restricts the analysis to dealing only with data‐generating processes, and excludes models where the error process is most unlikely to be independent. Instead, ${\epsilon}_{t}$ can be interpreted as ${x}_{t}\text{E}[{x}_{t}{\mathcal{I}}_{t1}]$, where ${\mathcal{I}}_{t1}$ denotes the available information, so that $\text{E}[{\epsilon}_{t}{\mathcal{I}}_{t1}]=0$ and so $\{{\epsilon}_{t}\}$ is a martingale difference sequence (see Ch. 2). A range of powerful limiting results has been established for martingale processes (see Hall and Heyde, 1980, and §A4.11). To ensure that information continuously accrues and no individual observations dominate, conditions are required about the underlying data process which is being modelled, such as mixing conditions as discussed in White (1984) and §A4.10. Spanos (1986) considers martingale central‐limit theorems as well as mixing processes.
A4.8.2 Mann and Wald's Theorem
When $\{{z}_{t}\}$ is a weakly stationary and ergodic n‐dimensional vector random process:
Theorem 13. Let ${\nu}_{t}\sim \text{IID}[0\text{,}{\sigma}_{\nu}^{2}]$ with finite moments of all orders. Then if $\text{E}[{z}_{t}{\nu}_{t}]=0\forall t$:
Mann and Wald's theorem again extends the conditions under which asymptotic normality occurs. If ${\nu}_{t}\sim {\text{IID}}_{k}[0\text{,}\Sigma ]$ is a k‐dimensional vector process with finite higher‐order moments, their result generalizes. Let ⊗ denote the Kronecker product defined in (p.724) Chapter A1, and let (·)^{v} be the vectoring operator which stacks columns of a matrix in a vector then:
A4.8.3 Hannan's Theorem
It is convenient to write (A4.31) in companion form as:

(i) r = n, so all n variables in ${x}_{t}$ are I(0) and hence stationary;

(ii) r = 0, so all n variables in ${x}_{t}$ are I(1), but $\Delta {x}_{t}$ is stationary; and

(iii) 0 < r < n, so (n − r) linear combinations of $\Delta {x}_{t}$, and r linear combinations of ${x}_{t}$, are I(0).
(i) When π has full rank n, since $\text{E}[{f}_{t1}{\epsilon}_{s}^{\prime}]=0\forall s\ge t$, the second moments of ${f}_{t}$, namely ${M}_{F}=\text{E}[{f}_{t}{f}_{t}^{\prime}]$ and ${M}_{1}=\text{E}[{f}_{t}{f}_{t1}^{\prime}]$, are given from (A4.36) by:
Theorem 14. $\surd T({\stackrel{\wedge}{M}}_{F}{M}_{F}){}^{v}\stackrel{D}{\to}{\text{N}}_{\mathrm{ns}}[0\text{,}C]$.
When $\Pi =P\Lambda {P}^{1}$ and $\Lambda $ has a real diagonal, C is obtained in Hendry (1982) and Maasoumi and Phillips (1982); Govaerts (1986) extends the derivation to complex roots. As before, asymptotic normality results.
We now briefly consider the two cases of the VAR remaining from above, which involve unit roots in some or all of the processes.
(ii) When π(1) = 0 in (A4.31), so that r = 0, the system can be transformed to a VAR in the I(0) variables $\Delta {x}_{t}$. Then the results of (i) apply to the differenced data. Otherwise, if (A4.31) is estimated in levels, the asymptotic distribution of the least‐squares estimator is given in Phillips and Durlauf (1986) who show that conventional statistics do not have the usual limiting normal distributions.
(iii) When ${x}_{t}$ is I(1), but 0 < r < n, then π can be factorized as $\alpha \beta \prime $ where $\alpha $ and $\beta $ are n × r matrices of rank r. Then $\alpha $ is a matrix of adjustment parameters, and $\beta $ contains r cointegrating vectors, such that the linear combinations $\beta \prime {x}_{t}$ are I(0). Once it is known that there are r cointegrating vectors, it is possible to estimate the parameters of (A4.31) subject to the restriction that $\pi =\alpha \beta \prime $, with $\alpha $ and $\beta $ of rank r, using a procedure proposed by Johansen (1988). As with (ii), unless the system is transformed to be I(0), which would involve only either $\beta \prime {x}_{t}$ or $\Delta {x}_{t}$ variables, then the limiting distributions are not normal. We consider system cointegration in Chapters 8 and 11.
A4.8.4 Limiting Distribution of OLS for a Linear Equation
Since a wide range of econometric estimators are functions of data second moments, and so their distributions asymptotically converge to functions of C, Hannan's theorem (Theorem 14) is useful for stationary processes. As an illustration, let ${x}_{t}^{\prime}=({y}_{t}\text{:}{z}_{t}^{\prime}\text{:}{w}_{t}^{\prime})$ and consider an arbitrary linear econometric equation of the form:
Neglecting the sampling variability in $(S{\stackrel{\wedge}{M}}_{F}S\prime )$ by Cramér's theorem, so that it is replaced in (A4.43) by $({\mathrm{SM}}_{F}S\prime )$:
The result in (A4.45) only depends on the distribution of the second moments of the data and is independent of the validity of the postulated model or of any distributional properties of {e _{t}} which an investigator may have claimed. The coefficient γ need not be of any interest, and the fact that $\stackrel{\wedge}{\gamma}$ is consistent for γ is achieved by constructing γ as the population value of the distribution of $\stackrel{\wedge}{\gamma}$. The variance matrix $\mathrm{HCH}\prime $ of the limiting distribution of $\stackrel{\wedge}{\gamma}$ need not bear any relationship to the probability limit $({\mathrm{SM}}_{F}S\prime ){}^{1}$ of the OLS variance formula. Even though the asymptotic distribution theory delivers a limiting normal distribution for an arbitrary coefficient estimator in a linear stationary process, that is at best cold comfort to an empirical modeller. However, when the model has constant parameters and a homoscedastic, innovation error, so {e _{t}} is well behaved, the OLS variance is consistently estimated by the conventional formula.
Even when the model coincides with the DGP and is stationary, convergence to normality can be relatively slow. In finite samples, OLS estimators in dynamic models are usually biased so convergence to normality requires both that the central tendency of the distribution shifts and its shape changes. Figures A4.2a–d show the histograms and density functions for 10 000 drawings for the OLS estimator ${\stackrel{\wedge}{\rho}}_{1}$ of ρ_{1} in y _{t} = ρ_{0} + ρ_{1} y _{t−1} + ε_{t} (p.727) where ε_{t} ∼ IN[0, 1] when ρ_{0} = 0, ρ_{1} = 0.8 at T=10, 25, 50, and 300. When T=10, the distribution is skewed and centred on $\text{E}[{\stackrel{\wedge}{\rho}}_{1}T=10]\simeq 0\text{.}46$ with $\text{SD}[{\stackrel{\wedge}{\rho}}_{1}T=10]\simeq 0\text{.}31$; at T=25, the bias has fallen and $\text{E}[{\stackrel{\wedge}{\rho}}_{1}T=25]\simeq 0\text{.}66$; at $T=50\text{,}\text{E}[{\stackrel{\wedge}{\rho}}_{1}T=50]\simeq 0\text{.}73$; and only by T=300 does $\text{E}[{\stackrel{\wedge}{\rho}}_{1}T=300]\simeq 0\text{.}79$. An approximation to the bias is given by $\text{E}[({\stackrel{\wedge}{\rho}}_{1}{\rho}_{1})T]\simeq (1+3{\rho}_{1})/T$ (see §A4.13.3 for the special case when ρ_{0} = 0). The SD falls somewhat faster than O(1/√T). The distribution becomes symmetric slowly, and is noticeably skewed even at T=300; however, derived distributions such as those for t‐tests converge more rapidly to normality.
(p.728) A4.9 Instrumental Variables
Asymptotic Distribution Theory A4.9 Instrumental VariablesThe method of instrumental variables is an extension of least squares due to Geary (1943) and Reiersøl (1945); see Sargan (1958). Consider the linear equation:
In matrix terms:
In finite samples, other issues arise than merely the rate of convergence to normality. Because $\text{E}[({T}^{1}Z\prime X){}^{1}]$ need not exist, the distribution of $\tilde{\beta}$ need not have any finite‐sample movements, so large outliers are possible. The occurrence of these depends on the probability that $(Z\prime X)$ gets close to singularity, as is most easily seen when k = 1. Now:
A4.10 Mixing Processes
Asymptotic Distribution Theory A4.10 Mixing ProcessesMixing conditions restrict the memory of a stochastic process such that the distant future is virtually independent of the present and past. This provides an upper bound on the predictability of future events from present information. The past and future cannot be interchanged in these statements so that the relationship is asymmetric.
A4.10.1 Mixing and Ergodicity
By way of analogy, imagine a sample space Ω which is a cup of coffee comprising 90 per cent black coffee and 10 per cent cream carefully placed in a layer at the top which defines the initial state ${x}_{0}$ of the system. The whole space is of unit measure μ (Ω) = 1, fixed throughout, where the measure is the volume. A transformation on Ω is defined by a systematic stir of the liquid in the cup, in which a spoon is moved in a 360° circle; let S denote the transformation and ${x}_{t}$ the state of the system at time t so that:
(p.731) An important implication of uniform mixing for stationary processes is the property of ergodicity. Consider a subset $\mathcal{C}$; then S is ergodic if $S\mathcal{C}=\mathcal{C}$ implies that $\mu (\mathcal{C})=0$ or 1. In terms of the example, the transformation S is ergodic if the only sets which are invariant under S are the null set ∅ and the whole space Ω. This property certainly holds for mixing cream into coffee. Consider any $\mathcal{C}$ such that $S\mathcal{C}=\mathcal{C}$, when S is uniform mixing from (A4.52):
The importance of this result derives from what is entailed by the statistical ergodic theorem (see e.g. Walters, 1975), namely, if a strictly stationary stochastic process {X _{t}} is ergodic with finite expectation, then the time average for one realization:
A4.10.2 Uniform Mixing and α‐Mixing Processes
Formally, let ${\mathcal{F}}_{m}^{k}$ denote the σ‐field generated by $({X}_{m}...{X}_{k})\text{,}\mathcal{A}\in {\mathcal{F}}_{\infty}^{t}$ and $\mathcal{B}\in {\mathcal{F}}_{t+\tau}^{\infty}$.
Definition 13. {X _{t}} is a uniform mixing process if for integer τ > 0, and $\text{P}(\mathcal{A})>0$:
Further, D13 implies that:
For example, if {X _{t}} is an IID process, then D13 is satisfied with φ (τ) = 0 for τ ≥ 1, so {ε _{t}} in (A4.31) is a uniform mixing process. Similarly, an m‐dependent process, which is a process where terms more than m‐periods apart are independent, as in an m ^{th}‐order moving average, is mixing with φ (τ) = 0 for τ ≥ m + 1. Moreover, if x _{t} is given by (A4.31), then:
An important implication of uniform mixing is ergodicity as noted above, so that {x _{t}} in (A4.31) is ergodic when the process is stationary, and:
Theorem 15. When {X _{t}} is mixing with φ_{x} (τ) ≈ O(τ^{−r}) where r > 0, letting Y _{t} = h _{t}(X _{t} . . . X _{t−k}) for a finite integer k > 0 when h _{t}(·) is a measurable function onto $\mathbb{R}$, then {Y _{t}} is mixing with φ_{y} (τ) ≈ O(τ^{−r}).
Theorem 16. Further, when Z _{t}(θ) = b _{t}(Y _{t}, θ) where b _{t}(·) is a function measurable in Y onto $\mathbb{R}$ and is continuous on Θ, then φ_{z} (τ, θ) ≤ φ_{y} (τ) ∀ θ ∈ Θ, so Z _{t}(θ) is mixing. Consequent on this last theorem, quite complicated functions of mixing processes remain mixing.
A4.10.3 Laws of Large Numbers and Central‐Limit Theorems
The above developments allow us to formulate the analogue of a SLLN for mixing processes.
Theorem 17. If, in addition to the conditions in Theorems 15 and 16, there exist measurable functions d _{t}(Z _{t}) so that b _{t}(·) < d _{t}(·) ∀ θ ∈ Θ, and m ≥ 1 and δ > 0 such that E[d _{t}(Z _{t})^{m+δ}] ≤ Δ < ∞ ∀ t with r > m/(2m − 1) then:
Next, we have a central‐limit theorem for mixing processes of the form {Z _{t}} subject to the further conditions that:

(a) E[Z _{t}] = 0 and E[Z _{t}^{2m}] ≤ Δ < ∞ ∀ t and m > 1;

(b) there exists a sequence {V _{T}}, 0 < δ ≤ V _{T} ≤ Δ, such that for ${S}_{T}(j)={T}^{\frac{1}{2}}{\Sigma}_{t=1+j}^{T}{Z}_{t}$, then E[S _{T}(j)^{2} − V _{T}] → 0 uniformly in j where {V _{T}} need not (p.733) be constant, but the variability depends only on T and not on the starting point of the sum (i.e. the date). Then:
Theorem 18. If {Z _{t}} satisfies Theorems 15, 16 and 17 and conditions (a) and (b) with r > m/(m − 1):
A4.11 Martingale Difference Sequences
Asymptotic Distribution Theory A4.11 Martingale Difference SequencesA4.11.1 Constructing Martingales
The earlier laws of large numbers and central‐limit theorems were framed in terms of independent random variables. For data‐generation processes with autonomous innovations, this restriction is not too serious; but, for models defined by their error processes just being an innovation relative to the information used in the study, the assumption of independence is untenable. The necessary statistical apparatus already exists for developing a theory relevant for models, in the form of martingale limit theory (see Hall and Heyde, 1980). This theory includes as special cases many of the limit results discussed in earlier sections, since sums of independent random variables (as deviations about their expectations) are in fact martingales. Indeed, there is an intimate link between martingales, conditional expectations on σ‐fields, and least‐squares approximations, which is why the associated theory is useful for econometric modelling.
Let {X _{t}} be a sequence of zero‐mean random variables on the probability space $\{\Omega \text{,}\mathcal{F}\text{,}\text{P}\}$, and let ${X}_{(t1)}=({X}_{t1}\text{,}...\text{,}{X}_{tn}\text{,}...)$ such that:
(p.734) A salient feature of economic data is their high degree of inter‐dependence and temporal dependence. Thus, let ${X}_{t}$ denote a vector random process with respect to the probability space $(\Omega \text{,}\mathcal{F}\text{,}\text{P})$. As in Chapter A2, the joint data density D_{x}(·) can be sequentially conditioned as in (A4.3). Then a linear representation of ${X}_{t}$, given the past, takes the form:
A4.11.2 Properties of Martingale‐difference Sequences
If {X _{t}} is an MDS from Y _{t} whose second moment exists ∀ t, we can generalize Chebychev's inequality to:

(a) if $\sum _{t=1}^{\infty}{t}^{(1+r)}\text{E}[{X}_{t}{}^{2r}]<\infty $ for r ∈ [1, 2], then ${T}^{1}{Y}_{T}\stackrel{\mathrm{AS}}{\to}0$.

(b) if E[X _{t}^{2r}] < B < ∞ ∀ t and some r ∈ [1, 2], then ${T}^{1}{Y}_{T}\stackrel{\mathrm{AS}}{\to}0$.
Before discussing asymptotic distributions based on MDSs, we need an important additional concept. We met Borel sets in Chapter A2 when we wished to construct random variables from the basic event space, in terms of the sequence of intervals ${\mathcal{B}}_{z}=(\infty \text{,}z]$, which are the sets $\{x\text{:}\infty <x\le z\text{,}z\in \mathbb{R}\}$. Technically, the Borel field $\mathcal{B}$ is the smallest collection of ${\mathcal{B}}_{z}\text{,}z\in \mathbb{R}$ which is closed under complements and countable unions. P(·) was well defined for such sets, which comprised all the relevant events, and we moved to a new probability space $(\mathbb{R}\text{,}\mathcal{B}\text{,}{\text{P}}_{x})$. If we think of the sample space here as $\mathbb{R}$ repeated indefinitely often (to allow for infinite sequences), then the Borel sets again generate an event space. This approach generalizes to q‐dimensional vector random variables X, with a probability space which we could write as $({\mathbb{R}}_{\infty}^{q}\text{,}\mathcal{B}\text{,}{\text{P}}_{X})$. If the basic event space increases over time due to information accrual, so does the Borel field, and we can now exploit that feature. The following two theorems sustain the use of martingale difference sequences for asymptotic distributions of model estimates (see e.g. Whittle, 1970, and White, 1984).
Theorem 19. Let $\{{\mathcal{B}}_{n}\}$ be an increasing sequence of Borel fields ${\mathcal{B}}_{n}\subseteq {\mathcal{B}}_{n+1}$, and Z be a fixed random variable where ${Y}_{n}=\text{E}[Z{\mathcal{B}}_{n}]$. Then:
Further:
Theorem 20. If {Y _{T}} is a martingale for which $\text{E}[{Y}_{T}^{2}]$ is uniformly bounded, then Y _{T} converges a.s. to a genuine random variable $Y=\text{E}[Z{\mathcal{B}}_{\infty}]$.
The key here is that:

(i) if X _{t} ∼ IID[μ, σ^{2}], then T X̄_{T} is a martingale and so by (a), ${\overline{X}}_{T}\stackrel{\mathrm{AS}}{\to}\mu $;

(ii) a sequence of conditional expectations on an increasing sequence of ${\mathcal{B}}_{n}$ is a martingale, so the innovations from a congruent model are a martingale difference sequence to which the above theorems apply (see Ch. 9).
Finally, we need a central‐limit theorem for MDS corresponding to the Lindeberg–Lévy result earlier:
Theorem 21. If {X _{t}} is an MDS with $\text{E}[{X}_{t}^{2}]={\sigma}_{t}^{2}$ where $0<{\sigma}_{t}^{2}<\infty $, and ${W}_{T}={T}^{1}\sum _{t=1}^{T}{X}_{t}$, with ${S}_{T}^{2}=\sum _{t=1}^{T}{\sigma}_{t}^{2}$ when:
A4.11.3 Applications to Maximum‐likelihood Estimation
We maintain the regularity assumptions made in Chapter A3, assume stationarity and uniform mixing of {X _{t}}, but weaken the sampling assumption, using results in Crowder (1976). Mixing ensures that information continually accrues, yet the process is ergodic as shown in §A4.10. Let ${\ell}_{t}(\theta )=\ell (\theta ;{x}_{t}{\mathcal{I}}_{t1})$ denote the log‐likelihood function for one observation on the random variable X from the sequential density ${\text{D}}_{x}(x{\mathcal{I}}_{t1}\text{,}\theta )$ for $\theta \in \Theta \subseteq {\mathbb{R}}^{k}$ and history ${\mathcal{I}}_{t1}$, and let:
Next, we need to relate that distribution to the limiting distribution of the MLE. A first‐order Taylor‐series expansion around ${\theta}_{p}$ is taken to be valid for $\theta \in \mathcal{N}({\theta}_{p})$, so that asymptotically, the average likelihood function is quadratic in the neighbourhood of ${\theta}_{p}$ (a mean‐value theorem could be used, given the consistency of the MLE noted below):
Next, from the mean‐value expression corresponding to (A4.80), but for the MLE:
Transient parameters can be allowed for, as can logistic growth in the data series. This provides a general result for maximum‐likelihood estimation in stationary, ergodic processes, with considerable initial data dependence and some heterogeneity, extending section A4.7.
A4.12 A Solved Autoregressive Example
Asymptotic Distribution Theory A4.12 A Solved Autoregressive ExampleConsider the following stationary data generation process for a random variable y _{t}:

(a) Obtain the population moments of the process.

(b) Derive (i) $\text{E}[{T}^{1}\sum _{t=2}^{T}{y}_{t1}{e}_{t}]$; (ii) $\text{E}[{T}^{1}\sum _{t=1}^{T}{y}_{t}^{2}]$; (iii) $\text{E}[{T}^{1}\sum _{t=2}^{T}{y}_{t}{y}_{t1}]$; and (iv) $V=\text{V}[{T}^{1}\sum _{t=1}^{T}{y}_{t}^{2}]$.

(c) Derive the limiting distribution of the sample mean.

(d) Obtain the limiting distribution of the least‐squares estimator of β.
(Adapted from Oxford M.Phil., 1987)
Solution to (a)
When β < 1, the population moments of a first‐order autoregressive process are derived as follows. Solve (A4.86) backwards in time as: