## David F. Hendry

Print publication date: 1995

Print ISBN-13: 9780198283164

Published to Oxford Scholarship Online: November 2003

DOI: 10.1093/0198283164.001.0001

Show Summary Details
Page of

PRINTED FROM OXFORD SCHOLARSHIP ONLINE (www.oxfordscholarship.com). (c) Copyright Oxford University Press, 2017. All Rights Reserved. Under the terms of the licence agreement, an individual user may print out a PDF of a single chapter of a monograph in OSO for personal use (for details see http://www.oxfordscholarship.com/page/privacy-policy). Subscriber: null; date: 25 February 2017

# (p.707) A4 Asymptotic Distribution Theory

Source:
Dynamic Econometrics
Publisher:
Oxford University Press

Even when the correct member of a family of probability distributions is known, deriving the exact distributions of statistics is often difficult. Here we investigate what happens in indefinitely large samples. The analysis commences from measures of orders of magnitude, and concepts of stochastic convergence. Laws of large numbers and central‐limit theorems are reviewed for the simplest case of an independent, identically‐distributed, scalar random variable. Vector processes are analysed, and two worked examples are presented to illustrate the various steps in practical derivations as well as some of the problems which can ensue. Next, stationary dynamic processes are studied, and instrumental variables estimators are described, followed by mixing and martingale processes. These methods are applied to maximum likelihood estimation. The chapter concludes with a brief look at asymptotic approximations.

# A4.1 Introduction

Asymptotic Distribution Theory A4.1 Introduction

Statistical analysis in econometrics distinguishes between the unknown data‐generation process and the econometric model which attempts to represent the salient data features in a simplified form. Models are of necessity incomplete representations of reality, and the issue of their properties is discussed in the main text. The problem considered now is the behaviour of estimators and tests based on models when the data‐generation process is only specified in general terms. For pedagogic purposes, the analysis proceeds through simple cases (homogeneous, independent, identically‐distributed random variables) to explain the basic ideas, and then introduces complications which arise almost inevitably when analysing economic time series (dependence and heterogeneity, with only low‐order moments of the underlying distributions). The present section describes the framework for the analysis, why asymptotic distribution theory is needed and what problems must be confronted, followed in sections A4.2–A4.7 by formal definitions and notation, orders of magnitude, concepts of stochastic convergence, laws of large numbers, and central‐limit theorems for IID random variables, generalizations to vector random variables, and two detailed examples. Sections A4.8–A4.11 deal with dependent data series, focusing on dynamic processes under stationarity and ergodicity, and discuss mixing conditions and martingale limit theory. Sections A4.9 and A4.11 discuss (p.708) instrumental‐variables and maximum‐likelihood estimation as important applications. Finally, some of the issues involved in asymptotic expansions in simple cases are noted.

In many ways, the use of asymptotic theory, which is a statistical distribution theory for large numbers of observations, is an admission of failure relative to obtaining exact sampling distributions. However, the latter are not only often intractable given present techniques, but can be unenlightening even when obtained due to their complicated forms. Thus, simplifications resulting from assuming large samples of observations can clarify the essential properties of complicated estimators and tests, although one must then examine the usefulness of such asymptotic findings for relevant sample sizes.

As an illustration, consider a random sample of T observations drawn from an unknown density function $D X ( x | θ )$ for θ ∈ Θ. The distribution is assumed by an investigator to be $G X ( x ; μ )$ where $μ ∈ ℝ$. A sample statistic W T(x 1 . . . x T) is formed as an estimate of μ. The properties of $W T ( X )$ depend on those of DX(·), and are characterized by its distribution function $F w T ( · )$:

$Display mathematics$
(A4.1)
Often, the CDF is analytically intractable, yet, as T increases, it takes on an increasingly simple form, which can be used as an approximation to (A4.1) when T is sufficiently large. This limiting distribution, denoted by Fw(·), provides a first‐order approximation, and, if that is not sufficiently accurate, higher‐order approximations may be developed (see § A4.13 for a more precise statement). The sense of the word ‘approximation’ must be carefully specified, since Fw(·) may not be adequate in all desired senses. For example, (A4.1) may be a distribution which possesses no moments, whereas all the moments of Fw(·) could exist, and this may or may not be an important simplification and/or a serious limitation.

Letting W be a normal random variable with CDF Fw(·), assume for illustration that:

$Display mathematics$
(A4.2)
Then E[W T] does not exist whereas E[W] does by assumption. If α = 1010 and T > 10, then W T and W coincide for all practical purposes despite the divergence in their moments. Conversely, if W T = W + 10α/T (for the same α), W T and W coincide for ‘sufficiently large T’, but are not remotely alike in practice even when all moments of both exist.

Equation (A4.2) expresses the general nature of one approach to deriving limiting distributions, namely relate W T to a simpler random variable W whose distribution is known, such that (W TW) converges to zero (in a sense to be made precise) as T increases without bound. Many concepts of convergence are possible, and these cannot all be ordered such that type A implies type B implies C etc.; major alternatives are discussed in § A4.3 after noting some notions of the ‘size’ of the difference between W T and W in § A4.2.

All of the main results have to be established for vector random variables which may be non‐linear functions of the original random variables {X t}. Thus, we wish to know (p.709) how a continuous function $g ( W T )$ behaves as T → ∞. Such behaviour usually relates to both central tendency, for which laws of large numbers apply, and the form of the distribution for large T, for which central‐limit theorems are required. For independent sampling, various theorems are reported in sections A4.4 and A4.5, and generalized to vector processes in §A4.6. Section A4.7 then discusses two examples in detail. The level in these sections is introductory, in as much as IID assumptions are usually made.

The later sections are more advanced. Dependent random variables are introduced in §A4.8 for data processes which are stationary and ergodic, then §A4.10 and §A4.11 consider mixing conditions and martingale processes.1 These have recently been used in econometrics as the basis for establishing limiting distributions, and seen to offer a natural characterization of important features of suitably transformed economic time series, while being sufficiently strong to allow useful limit results to be established. The following comments are intended to motivate the usage of mixing conditions and martingales in deriving asymptotic distributions.2

Heuristically, a process is uniformly mixing if knowledge as to its present state is uninformative about its state in the distant future. For economic systems, current knowledge about the state of the system is often informative about nearby future states, but seems relatively uninformative concerning distant states. Thus, the assumption of uniform mixing is not unreasonable for appropriately transformed economic variables. Moreover, with mixing, the system has a long but restricted memory: X T and X T+n become independent as n → ∞, so that new information about the system continually accrues from larger samples, a condition of importance for the convergence of estimators and the consistency of tests.

More generally, let $X t$ denote a vector random process with respect to the probability space $( Ω , ℱ , P )$. As in Chapter A2, the joint data density DX(·) can be sequentially factorized as:

$Display mathematics$
(A4.3)
where $ℐ t − 1$ denotes previous information (strictly, the σ‐field generated by ). By removing the predictable component from the past, the remainder acts like the difference of a martingale, called a martingale difference process as shown in §A4.11.

Finally, since the data are generated by DX(·), but a model may be based on GX(·), it is natural to consider theorems relating to the behaviour of data moments in large samples (usually the first two moments suffice), and derive the properties of estimators therefrom (see e.g. Hannan, 1970, Ch. 4, and Hendry, 1979a). For mixing processes, direct (p.710) derivations are possible (see e.g. White, 1984, and Domowitz and White, 1982). Sections A4.9 to A4.11 provide some illustrations.

It is worth stressing that we use asymptotic theory in order to derive approximate distributions in finite samples, and the validity of such results does not depend on the economic process remaining constant as T → ∞. Thus, even though $θ$ may change in the next period, this does not affect the accuracy of approximating (A4.1) by Fw(·), although it does affect the usefulness of the underlying econometric model. Also, no attempt is made in what follows either to produce the most general possible results, or specify the weakest possible assumptions; nor is comprehensive coverage sought. Rather, we aim to provide a basis for results relevant to the class of models discussed in the main text, taking account of subject‐matter considerations, but simplifying by stronger than necessary assumptions where this eases the derivations. Useful references to asymptotic distribution theory in the field of econometrics are Davidson (1994), Davidson and MacKinnon (1993), McCabe and Tremayne (1993), Sargan (1988), Schmidt (1976), Spanos (1986), Theil (1971) and White (1984); in statistics, see Cox and Hinkley (1974), Cramér (1946), Rao (1973) and Wilks (1962); and in time‐series analysis, see Anderson (1971), Fuller (1976), Hannan (1970) and Harvey (1981a). Formal proofs are not provided for most of the theorems presented below, but may be found in these references.

Finally, as noted earlier, while limiting distributional results both clarify the relationships in large samples between apparently different methods and provide insight into the properties of estimators and tests, nevertheless, the accuracy of such results is dependent on the true value of $θ$ as well as on T. Thus, for important problems, a check should always be made on the accuracy of the asymptotic approximation for the problem at hand, either by Monte Carlo methods (as in Ch. 3) or by higher‐order approximations as described in the final section.

# A4.2 Orders of Magnitude

Asymptotic Distribution Theory A4.2 Orders of Magnitude

## A4.2.1 Deterministic Sequences

Let {W t}, {X t}, and {Y t} denote sequences of scalar random variables and {A t}, {B t}, and {C t} denote sequences of non‐stochastic real numbers for t = 1, . . . , T, where we allow T → ∞. We first define convergence for a deterministic sequence.

Definition 1. The sequence {A T} converges to the constant A as T → ∞, if ∀ δ > 0, there exists a T δ, such thatT > T δ, |A TA| < δ. We write limT→ ∞ A T = A, or A TA.

From the definition of continuity, if g(·) is a function continuous at A, and A TA, then g(A T) → g(A). Next, we define the order of magnitude for a deterministic sequence.

Definition 2. {A T} is at most of order one, denoted by {A T} ≈ O(1), if the sequence is bounded as T → ∞, namely, there exists a constant, finite M > 0 such that |A T| < MT. Further {A T} ≈ O(T k) if {T k A T} ≈ O(1).

Definition 3. {B T} is of smaller order than T n if limT→ ∞ T n B T = 0, denoted by {B T} ≈ o(T n).

(p.711) If {B T} ≈ o(T n), then {T n B T} is bounded as T → ∞, so that {B T} ≈ O(T n), but the converse is not necessarily true: that {B T} ≈ O(T n) need not imply that {B T} ≈ o(T n). Further, from these definitions (denoted by D1, D2 etc.):

Theorem 1. If {A T} ≈ O(T k) and {B T} ≈ O(T s), then:

1. (a) {A T} ≈ o(T k+m) ∀ m > 0;

2. (b) {A T ± B T} ≈ O(T r) where r = max (k, s);

3. (c) {A T B T} ≈ O(T k+s).

Since O(T k) + o(T k) ≈ O(T k), terms in o(·) are negligible for sufficiently large T relative to equivalent terms in O(·) and can be ignored. Thus, from (b) and (c):
$Display mathematics$
(A4.4)

## A4.2.2 Stochastic Sequences

These results on O(·) and o(·) can be applied to standard deviations to provide one characterization of the order of magnitude of a random variable. For example, when $X t ∼ IN [ μ x , σ x 2 ]$, then:

$Display mathematics$
(A4.5)
has a standard error σx/√T of O(1/√T) ≈ o(1), and hence X̄T has a degenerate limiting distribution (i.e. with a zero variance). Nevertheless, at all sample sizes:
$Display mathematics$
(A4.6)
and hence has a standard error of O(1). Thus, √T is the scaling transformation of (X̄T − μx) required to produce a finite‐variance, non‐degenerate, limiting distribution. However, not all small‐sample distributions have finite variances.

When we consider stochastic sequences, the concept ‘bounded’ must be replaced by ‘bounded in probability’:

Definition 4. {W T} is at most of order unity in probability, denoted by {W T} ≈ O p(1), if ∀ ε > 0, ∃ finite M ε > 0 and T ε > 0 such thatT > T ε, P(|W T| > M ε) < ε.

Thus, {W T} ≈ O p(1) if for large enough T, there is a negligibly small probability of the absolute value of its terms exceeding an upper bound, which may depend on the probability ε. Correspondingly, {W T} ≈ O p(T k) if {T k W T} ≈ O p(1).

Definition 5. {Y T} ≈ o p(1) if ∀ ε > 0, limT→ ∞ P(|Y T| > ε) = 0.

Hence 1/T α ν in (A4.2) is o p(1) for α > 0. Similar results to Theorem 1 apply to O p(·) and o p(·). For example, if:

$Display mathematics$
(A4.7)
then {X T} ≈ O p(1) and {X TW T} = Y To p(1). We can treat O p(·) and o p(·) as if they were O(·) and o(·) respectively (see Madansky, 1976).

(p.712) Theorem 2. If f T(·) ≈ O(A T) and g T(·) ≈ o(B T) imply that h T(·) ≈ O(C T), then f T(W T) ≈ O p(A T) and g T(W T) ≈ o p(B T) imply that h T(W T) ≈ O p(C T).

For example; if {X T} and {Y T} are O p(T k) and O p(T s) respectively, then {Y T X T} is O p(T k+s).

Finally, we can paraphrase the earlier result on using the order of magnitude of a standard deviation to determine a finite‐variance limiting distribution: if {W T} has a variance of O(1), then {W T} ≈ O p(1), although the converse is false. Mann and Wald (1943a) developed most of the results in this section.

# A4.3 Stochastic Convergence

Asymptotic Distribution Theory A4.3 Stochastic Convergence

We now apply the notions of stochastic orders of magnitude to ascertain the relationships between two (or more) random variables. Two random variables could be deemed to converge if:

1. (a) their distribution functions converge;

2. (b) their difference is o p(1);

3. (c) the variance of their difference exists ∀ T and is o(1);

4. (d) they are equal except on a set of probability zero.

These notions need not coincide, as shown below. Precise definitions for (a)–(d) are:

Definition 6. W T tends in distribution to W, denoted by $W T → D W$, if for all continuity points of $F w ( · ) , lim T → ∞ F w T ( r ) = F w ( r )$, where Fw (r) = P(Wr) denotes the CDF of W.

More generally, if for all continuity points of a CDF $F w T ( · ) , lim T → ∞ F w T ( r ) = F w ( r )$, then $F w T ( r )$ converges weakly to Fw(r) written as $F w T ( r ) ⇒ F w ( r )$.

Definition 7. W T tends in probability to W if limT→ ∞ P(|W TW| > ε) = 0 ∀ ε > 0, denoted by $W T → P W$. Thus, from D5, (W TW) ≈ o p(1).

Definition 8. W T tends in mean square to W, denoted $W T → MS W$, if E[W T] and $E [ W T 2 ]$ existT and limT→ ∞ E[(W TW)2] = 0.

Definition 9. W T tends almost surely to W if limT→ ∞ P(|W tW| ≤ ε, ∀ tT) = 1 ∀ ε > 0, denoted $W T → AS W$.

Since CDFs are real numbers, ordinary convergence holds in D6: the more general definition holds without specific reference to an associated random variable, and weak convergence will prove of importance in the main text. In D9, the whole collection of {|W tW|, tT} must have a limiting probability of unity of being less than or equal to ε, so that the two random variables must come together except for a set of probability zero. An equivalent definition is:

Definition 10. $W T → AS W$, if P(limT→ ∞ W T = W) = 1.

Almost sure convergence implies that W T converges to W for almost every possible realization of the stochastic process. Thus, D9 implies D7. When definition D8 holds, (p.713) from Chebychev's inequality (see § A2.7):

$Display mathematics$
(A4.8)
so D8 implies D7. However, neither D7 nor D9 imply D8 (since no moments need exist), nor is D8 by itself sufficient to ensure D9. If (W TW) ≈ o p(1), the difference between W T and W is asymptotically negligible, so D7 implies D6. Finally, if W is a constant, D6 implies D7 since the CDF is degenerate and becomes a step function at W = κ:
$Display mathematics$
(A4.9)
and hence limT → ∞ P(|W T − κ| ≥ ε) = 0 ∀ ε > 0, which is the probability limit, written as:
$Display mathematics$

Summarizing these results in schematic form, where $⇒$ denotes implies:

$Display mathematics$
with none of the converse implications holding, except for $D6 ⇒ D7$ if W = κ: this last condition is sufficient for W T to be O p(1), but not necessary, so that if W T is O p(1), it does not follow that plimT → ∞ W T is a constant. Further, D6 does not imply that the moments of W T tend in probability to the moments of W unless the random variables have bounded range. Finally, if W T tends in distribution to W, we say that W T is asymptotically distributed as W, and denote that expression by ã, writing W T ã W. In such a case:

Theorem 3. If plimT → ∞ (W TW) = 0, plimT → ∞ Y T = c and |Y Tc| ≥ |X Tc|, for constants a, b, c > 0, then:

1. (i) lim P(aW T + bY Tr) = Fw([rbc]/a);

2. (ii) lim P(W T Y Tr) = Fw(r/c);

3. (iii) plimT → ∞ X T = c.

For example, Theorem 3(i) follows because P(aW T + bY Tr) → P(aW + bcr) = P(W ≤ (rbc)/a). Generally, we omit the limiting argument when no confusion is likely, writing plim X T = c.

Next, a useful theorem due to Slutsky, which follows from D7 and the definition of continuity:

Theorem 4. When g(·) is a continuous function independent of T, if $W T → P W$, then $g ( W T ) → P g ( W )$; and if $W T → AS W$, then $g ( W T ) → AS g ( W )$.

Since $→ P$ implies $→ D$, if W = κ, then Slutsky's theorem implies that:

$Display mathematics$
(p.714) Theorem 4 has many important applications. For example, if $W T → P W ∼ N [ 0 , 1 ]$ then since $D7 ⇒ D6 , W T 2 a ˜ χ 2 ( 1 )$.

# A4.4 Laws of Large Numbers

Asymptotic Distribution Theory A4.4 Laws of Large Numbers

We next consider the behaviour in large samples of averages of random variables. The behaviour of such averages depends on three factors, namely: the degree of interdependence between successive drawings (in this section we assume independence); the extent of heterogeneity in successive drawings (which is different between the two theorems cited below); and the existence of higher moments of the distributions (which needs to be stronger the greater the heterogeneity or interdependence allowed). Weak and strong laws of large numbers exist and we have the following results $( | μ t | < ∞ , 0 < σ t 2 < ∞ )$.

## A4.4.1 Weak Law of Large Numbers

We denote the weak law of large numbers by WLLN, and cite two famous results:

Theorem 5(i). If X t is IID with E[X t] = μ where μ is finite, then $X ¯ T → P μ$ (Khintchine's theorem).

Theorem 5(ii). If X t is $ID [ μ t , σ t 2 ]$ with $μ ¯ T = T − 1 ∑ t = 1 T μ t$, and $lim T → ∞ T − 2 ∑ t = 1 T σ t 2 = 0$, then $( X ¯ T − μ ¯ T ) → P 0$.

Thus, if the X t are identically distributed with a common finite mean, then plim X̄T = μ, even if the second moment of X t does not exist. However, if the distributions are not identical as in Theorem 5(ii), extra conditions are required on the second moments of the {X t}, to ensure that they are finite, and the average variance divided by T vanishes. Constant, finite, first two moments are sufficient for 5(ii) to hold, even when X t is not IID (perhaps because the third moment is not constant).

## A4.4.2 Strong Law of Large Numbers

The corresponding strong laws of large numbers (SLLN) are:

Theorem 6(i). If X t is IID, then E[X t] = μ being finite is necessary and sufficient for $X ¯ T → AS μ$.

Theorem 6(ii). If X t is $ID [ μ t , σ t 2 ]$ and $lim T → ∞ ∑ t = 1 T t − 2 σ t 2 < ∞$, then $X ¯ T → AS μ ¯ T$. Thus, WLLN(i) actually implies almost‐sure convergence without additional conditions, whereas, since t −2T −2, the second condition of WLLN(ii) needs strengthening to achieve almost‐sure convergence. Distributions like the Cauchy without finite moments do not satisfy the assumptions of either WLLN or SLLN. Consequently, neither law applies and the sample mean does not converge.

For estimators of a constant population parameter μ, we have the following definitions:

Definition 11. An estimator $μ ∧ T$ is consistent for μ if $plim T → ∞ μ ∧ T = μ$.

Thus, X̄T is consistent for μ given WLLN(i). Also:

Definition 12. $μ ∧ T$ is asymptotically unbiased for μ if $E [ μ ∧ T ]$ is finiteT and: $lim T → ∞ E [ μ ∧ T ] = E ¯ [ μ ∧ T ] = μ$.

(p.715) Consistency does not imply asymptotic unbiasedness (e.g. $E [ μ ∧ T ]$ need not exist), nor conversely (since an unbiased estimator need not have a degenerate limiting distribution). These last two definitions (D11 and D12) highlight one of the main roles for laws of large numbers, namely to allow us to establish the consistency of estimators. In sections A4.10 and A4.11, weaker assumptions will be discussed which still sustain laws like WLLN and SLLN.

# A4.5 Central‐limit Theorems

Asymptotic Distribution Theory A4.5 Central‐Limit Theorems

For consistent estimators, the limiting CDF Fw(r) is a step function, and, as this has the same limiting distributional form for all consistent estimators, it is uninformative about the rate of convergence. For example, X̄T and:

$Display mathematics$
(A4.10)
both have a plim of μ under WLLN(i), but the latter is useless for all practical T. To discriminate between estimators and compare their properties, we must rescale the random variables to O p(1). If plim(W TW) = 0, and a random variable of O p(1) is achieved by multiplying by T k where V[W TW] ≈ O(T −2k) > 0, then we can investigate the behaviour of T k (W TW) as T → ∞ and expect to obtain a non‐degenerate limiting distribution.3 This will let us compare distributional forms across alternative estimators or tests.

As with laws of large numbers, the outcome is determined by an interaction between the extent of memory (i.e. interdependence, which is absent here due to assuming independence), heterogeneity (allowed in Theorems 8(i) and (ii)), and the existence of higher moments (finite second moments are assumed in all the theorems, and the (2 + δ) moment by 8(i)). We have three central‐limit theorems, inducing asymptotic normality (generalized in § A4.11).

Theorem 7. If X t ∼ IID[μ, σ2] with |μ| < ∞, 0 < σ2 < ∞, and W T = √ T (X̄T − μ)/σ, then $W T → D W ∼ N [ 0 , 1 ]$.

This result can also be written as W T ã N[0, 1].

Theorem 8. If $X t ∼ ID [ μ t , σ t 2 ]$ wheret| < ∞ and $0 < σ t 2 < ∞ ∀ t$ with $S T 2 = ∑ t = 1 T σ t 2$, and either:

1. (i) βt = E[|X t − μt|2+δ] < ∞ and $lim ( ( ∑ t = 1 T β t ) / S T 2 + δ ) = 0$ for δ > 0; or

2. (ii) $lim S T − 2 ∑ t = 1 T E [ ( X t − μ t ) 2 I ( | X t − μ t | ≥ ε S T ) ] = 0 ∀ ε > 0$; then:

$Display mathematics$
(A4.11)
(p.716) The three versions 7–8(ii) of the central‐limit theorem (CLT) are referred to as the Lindeberg–Lévy, Lyapunov and Lindeberg–Feller theorems respectively. In each case, the random variable W T to be studied has been normalized to have a mean of zero and a unit variance.

Theorem 7 makes the strongest assumptions in terms of homogeneity (identical distributions) and moments existing (finite second moment). Whittle (1970) provides an elegant explanation as to why Theorem 7 holds by demonstrating a closure property as follows. Consider a sequence of standardized IID {Y t}, each of which has the same distributional form as the limit of W T with distribution W. Define Z T = ∑ Y t/√ T, then Z T must also converge to the same limiting distribution W. Thus, the characteristic function of Z T is the T th power of the characteristic function of the IID variates and yet of the same form. Whittle then shows that exp (−ν2/2) (for argument ν) is the only function with that property, which is the characteristic function of N[0, 1].

Fig. A4.1 Central‐limit Convergence

(p.717) As an illustration of the operation of central‐limiting effects, Figs. A4.1a‐d show the histograms and density functions for 10 000 drawings from the exponential distribution λ exp(−λ x) with parameter λ = 0.1, so the mean and standard deviation are both 10. The first shows the original distribution (which is highly skewed and covers a large range), then the means of samples of size 5 and size 50 therefrom, and finally N[10, 1] for comparison. The convergence to normality is rapid for this IID setting.

In Theorem 8(i), the identical‐distributions assumption is dropped, and instead a finite absolute moment of order (2 + δ) is required for which δ = 1 is sufficient and entails that the third absolute moment exists. If so, then for example, when μt, σt, and βt are constant:

$Display mathematics$
(A4.12)
which provides sufficient conditions, although these are too strong given Theorem 7.

However, Theorem 8(ii) provides most insight into what features of the DGP are required to sustain central‐limiting normality, and thereby needs most explanation. First, if the limit in (ii) is to be zero, a necessary condition is that $S T 2 → ∞$ as T → ∞ (so that information continually accrues). Next, since I(·) denotes the indicator function of the argument (see § A2.7), when the event {|X t − μt| ≥ ε S T} occurs, then I(|X t − μt| ≥ ε S T) is unity, and is zero otherwise, so that:

$Display mathematics$
Also, E[(X t − μt)2 I(|X t − μt| ≥ ε S T)] is the expected value of (X t − μt)2 over the interval where |X t − μt| ≥ ε S T when $E [ ( X t − μ t ) 2 ] = σ t 2$. Since $( X t − μ t ) 2 / S T 2 ≥ ε 2$ when I(|X t − μt| ≥ ε S T) = 1, as Tt, we have:
$Display mathematics$
which can tend to zero as T → ∞ only if the largest ε2 P(|X t − μt| ≥ ε S T) goes to zero ∀ ε. Thus, it is necessary that $σ T 2 / S T 2 → 0$ for 8(ii) to hold, and indeed all individual terms must be asymptotically negligible. However, the two conditions $σ T 2 / S T 2 → 0$ and $S T 2 → ∞$ as T → ∞ are not sufficient, whereas the conditions in Theorem 8 are. Sufficient conditions for Theorem 8 can be found which have intuitive appeal, namely that E[|X t|2+δ] < K < ∞ ∀ t for some δ > 0 and $T − 1 S T 2 + δ → ∞$, so that the |X t| do not grow without bound. Then, no one term dominates, and information accrues sufficiently rapidly for asymptotic normality to hold. The trade‐off between the existence of higher moments and increased heterogeneity is apparent from comparing Theorems 7 and 8.

Three generalizations are needed to render central‐limit theory applicable to economic data. First, it must apply to vector random variables, and the following section (p.718) analyses that case. Secondly, it must allow for processes with memory rather than independence, and sections A4.8–A4.11 do so. Finally, integrated processes must be incorporated, and that aspect is analysed in Chapter 3.

# A4.6 Vector Random Variables

Asymptotic Distribution Theory A4.6 Vector Random Variables

While the concepts above carry over directly for elements of vectors, results for vector processes in general follow from the Cramér–Wold device. Let ${ W T }$ be a k‐dimensional vector process, and W a k × 1 vector random variable, then:

Theorem 9. $W T → D W$ if and only if $λ ′ W T → D λ ′ W$ for all fixed finite k × 1 vectors λ.

For example, if $λ ′ W ∼ N [ 0 , λ ′ Ψ λ ]$ then $W T → D N k [ 0 , Ψ ]$. Moreover, certain stochastic functions of $W T$ also converge to normality if $W T$ does (see Cramér, 1946). This important result is referred to below as Cramér's theorem:

Theorem 10. If $W T → D W ∼ N k [ μ , Ψ ]$ and $A T$ is n × k with $plim T → ∞ A T = A$, then $A T W T → D AW$ where $AW ∼ N n [ A μ , A Ψ A ′ ]$.

The theorem is clearly valid if $A T = A ∀ T$; otherwise, use is made of $( A T − A ) ≈ o p ( 1 )$ so that:

$Display mathematics$
(A4.13)
where the second term is asymptotically negligible.

Next, consider an n × 1 continuously differentiable vector function $g ( W T )$:

Theorem 11. If $√ T ( W T − μ ) → D N k [ 0 , Ψ ]$ and $G ( W ) = ∂ g / ∂ w ′$ exists and is continuous in the neighbourhood of $μ$, then:

$Display mathematics$
To prove this claim, since plim $W T = μ$, then plim $g ( W T ) = g ( μ )$, and by the mean‐value theorem:
$Display mathematics$
(A4.14)
where $| W T * − μ | ≤ | W T − μ |$ so that:
$Display mathematics$
Since $G ( W T * ) → D G ( μ )$ (a constant), $G ( W T * ) → P G ( μ )$, noting that $plim ( W T − μ ) = 0$ implies $plim ( W T * − μ ) = 0$.

The following two examples seek to illustrate applications of the preceding results. In many cases, almost‐sure convergence can be established where we focus on convergence in probability below: establishing the stronger result in relevant cases makes an interesting and useful exercise (see White, 1984, for a number of helpful developments).

# (p.719) A4.7 Solved Examples

Asymptotic Distribution Theory A4.7 Solved Examples

## A4.7.1 Example 1: An IID Process

Let $E [ y t | z t ] = β ′ z t$ for t = 1, . . . , T, and $ε t = ( y t − z t ′ β ) ∼ IID [ 0 , σ ε 2 ] , 0 < σ ε 2 < ∞$, where the k × 1 vector $z t ∼ IID k [ 0 , Ψ ]$ with Ψ finite, positive definite, and ${ z t } , { ε t }$ are independent processes. The least‐squares estimator of $β ∈ ℝ k$ (where ) is:

$Display mathematics$
(A4.15)
We seek to derive the asymptotic distribution of a suitably normalized function of $β ∧$.

First, $ν t = z t ε t ∼ IID k [ 0 , σ ε 2 Ψ ]$, so that:

$Display mathematics$
(A4.16)
WLLN(i) applies to {u t}, so letting:
$Display mathematics$
then:
$Display mathematics$
(A4.17)
Hence, by the Cramér–Wold device:
$Display mathematics$
(A4.18)

Next, since $z t$ is IID, $z t z t ′$ is IID with $E [ z t z t ′ ] = Ψ$. Let

$Display mathematics$
so that WLLN(i) applies again and hence:
$Display mathematics$
(A4.19)
Thus, $plim T → ∞ T − 1 ∑ z t z t ′ = Ψ$, and hence $∑ z t z t ′ ≈ O p ( T )$. By Slutsky's Theorem, since the inverse of a positive definite matrix is a continuous function of its elements:
$Display mathematics$
(p.720) Consequently, again by Slutsky's Theorem from (A4.18):
$Display mathematics$
and hence $β ∧$ is consistent for β. Further, since:
$Display mathematics$
which is O(T), then (∑ z tεt)/√TO p(1). Hence, by central‐limit Theorem 7:
$Display mathematics$
(A4.20)

Consequently, by the Cramér–Wold device:

$Display mathematics$
(A4.21)
Finally, by Cramér's Theorem:
$Display mathematics$
and therefore:
$Display mathematics$
(A4.22)
This result generalizes the theorem on the (asymptotic) normality of least squares in § A3.7 to stochastic regressors which are independent over time, and independent of the difference (y tzt β).

## A4.7.2 Example 2: A Trend Model

Let y t = β t α + εt where $ε t ∼ IN [ 0 , σ ε 2 ]$ and α is a known, fixed constant. Given normality of εt, the exact distribution of the least‐squares estimator of β can be obtained; yet there are problems in obtaining its limiting distribution for some values of α, due to the regressor being a trend. We have:

$Display mathematics$
(A4.23)
so that:
$Display mathematics$
(A4.24)
(p.721) and:
$Display mathematics$
(A4.25)

Five values of α will be considered: α = 0, ± 1/2 and ± 1.

1. (a) If α = 0, then $√ T ( β ∧ − β ) ∼ N [ 0 , σ ε 2 ]$ holds ∀ T.

2. (b) If α = 1/2, then $Σ t = 1 T t 2 α = Σ t = 1 T t = T ( T + 1 ) / 2 ≈ O ( T 2 )$. Consequently:

$Display mathematics$

3. and the distribution ‘collapses’ rather rapidly. Nevertheless:

$Display mathematics$
(A4.26)

4. Thus, a higher‐order rescaling produces a normal limiting distribution.

5. (c) α = − ½ could arise from trending heteroscedasticity when $y t * = β + ε t *$ where:

$Display mathematics$
(A4.27)

6. Then:

$Display mathematics$
(A4.28)

7. Although Σ t −1O(log T) → ∞, it converges very slowly (see e.g. Binmore, 1983). Thus:

$Display mathematics$

8. where V is a finite constant.

9. (d) When α = 1, a different rescaling is necessary compared with (a)–(c), since now:

$Display mathematics$
(A4.29)

10. so that:

$Display mathematics$

11. (e) Finally, when $α = − 1 , lim T → ∞ ( ∑ t = 1 T t − 2 ) = π 2 / 6$ and hence:

$Display mathematics$
(A4.30)
(p.722) Thus, $β ∧$ is unbiased but inconsistent for β, since the variance does not go to zero as T → ∞. In effect, the error ‘swamps’ β/t once t becomes large and so additional observations on y are not informative about $β ∧$, reflected in V [$β ∧$] being non‐zero even asymptotically. Thus, the non‐singularity of $plim ( f ( T ) − 1 Σ z t z t ′ )$ is an essential condition for consistent regression estimation, for some function f(T) → ∞ as T → ∞.

Combining these two examples and weakening the assumptions, we can extend the limiting normal distribution established in §A3.7 for least‐squares estimation conditional on $Z T 1$ (see e.g. Anderson, 1971):

Theorem 12. Let $y t = β ′ z t + ε t$ where $E [ y t | z t ] = β ′ z t$ and $ε t ∼ IID [ 0 , σ ε 2 ]$. Let $A T = ∑ z t z t ′$ and denote its (i, j)th element by a(T)ij, and let $B T = C T − 1 A T C T − 1$ where $C T$ is a diagonal matrix with elementsa(T)ii. If as T → ∞:

1. (i) a(T)ii → ∞, i = 1, . . . , k;

2. (ii) ;

3. (iii) $B T → B$ where B is finite positive definite;

then:
$Display mathematics$
This theorem covers example 2 for the values of α ≠ −1 as these cases satisfy its assumptions.

# A4.8 Stationary Dynamic Processes

Asymptotic Distribution Theory A4.8 Stationary Dynamic Processes

## A4.8.1 Vector Autoregressive Representations

It is essential to generalize from the assumption of IID observations characterizing the previous derivations to dynamic processes in order to obtain theorems relevant to economic time series. Consider the data‐generation process for t = . . . −m, . . . , 1, . . . , T:

$Display mathematics$
(A4.31)
which is an s th‐order vector autoregression (VAR) in the n variables $x t$, where $Σ$ is an unrestricted n × n covariance matrix, the initial values are fixed, and s is finite. Bounded deterministic functions could be introduced without altering the form of the results for stationary processes (but would alter the algebra), whereas trends require a generalization of Theorem 12, and even intercepts matter if the process is non‐stationary.

Equation (A4.31) can be written as a matrix polynomial in the lag operator L, where $L k x t = x t − k$:

$Display mathematics$
(A4.32)
(p.723) so π(L) is a polynomial matrix (see Ch. A1). The roots of the polynomial:
$Display mathematics$
(A4.33)
determine the statistical properties of ${ x t }$ We begin with the case that all ns roots of (A4.33) lie inside the unit circle. Such processes ${ x t }$ are called weakly stationary since the first two unconditional moments of ${ x t }$ are independent of $t : E [ x t ] = 0 ∀ t$ and $E [ x t x t + k ′ ]$ is both independent of t for all k, and tends to zero as k → ∞. If:
$Display mathematics$
(A4.34)
so that the joint distribution of all collections is unaltered by ‘translation’ h‐periods along the time axis, then ${ x t }$ is strictly stationary. Neither concept implies the other, since strictly stationary processes need not have finite second moments, and weak stationarity is not enough to ensure that the CDF of is constant over time (see Ch. 2).

Having allowed data dependency, it is unnecessarily restrictive to retain the assumption that ${ ε t }$ is an independent process, which in effect restricts the analysis to dealing only with data‐generating processes, and excludes models where the error process is most unlikely to be independent. Instead, $ε t$ can be interpreted as $x t − E [ x t | ℐ t − 1 ]$, where $ℐ t − 1$ denotes the available information, so that $E [ ε t | ℐ t − 1 ] = 0$ and so ${ ε t }$ is a martingale difference sequence (see Ch. 2). A range of powerful limiting results has been established for martingale processes (see Hall and Heyde, 1980, and §A4.11). To ensure that information continuously accrues and no individual observations dominate, conditions are required about the underlying data process which is being modelled, such as mixing conditions as discussed in White (1984) and §A4.10. Spanos (1986) considers martingale central‐limit theorems as well as mixing processes.

## A4.8.2 Mann and Wald's Theorem

When ${ z t }$ is a weakly stationary and ergodic n‐dimensional vector random process:

$Display mathematics$
(A4.35)
where $M Z$ is finite, positive definite. Then (see Mann and Wald, 1943b):

Theorem 13. Let $ν t ∼ IID [ 0 , σ ν 2 ]$ with finite moments of all orders. Then if $E [ z t ν t ] = 0 ∀ t$:

$Display mathematics$

Mann and Wald's theorem again extends the conditions under which asymptotic normality occurs. If $ν t ∼ IID k [ 0 , Σ ]$ is a k‐dimensional vector process with finite higher‐order moments, their result generalizes. Let ⊗ denote the Kronecker product defined in (p.724) Chapter A1, and let (·)v be the vectoring operator which stacks columns of a matrix in a vector then:

$Display mathematics$
This theorem is applicable when the model coincides with the stationary DGP, as is assumed in much of econometric theory. An example would be the error $ε t$ in (A4.31) relative to lagged xs. We consider the case where the model and DGP differ in §[A3.8.d] and Chapter 11.

## A4.8.3 Hannan's Theorem

It is convenient to write (A4.31) in companion form as:

$Display mathematics$
(A4.36)
where:
$Display mathematics$
In (A4.36), $Ψ$ is non‐negative definite, but singular for s > 1, and the eigenvalues of $Π$ are the same as the roots in (A4.31) (see Ch. A1). Let:
$Display mathematics$
(A4.37)
This is the ‘long‐run’ matrix, and the rank of π plays an important role in determining the asymptotic properties of estimators and tests (see Ch. 11). Although ${ ε t }$ is stationary, the n variables in $x t$ need not all be stationary, and r = rank(π) determines how many levels variables are stationary, and how many are integrated of first order, denoted I(1), so that $Δ x it$ is stationary or I(0). Three cases can be distinguished:
1. (i) r = n, so all n variables in $x t$ are I(0) and hence stationary;

2. (ii) r = 0, so all n variables in $x t$ are I(1), but $Δ x t$ is stationary; and

3. (iii) 0 < r < n, so (nr) linear combinations of $Δ x t$, and r linear combinations of $x t$, are I(0).

Case (ii) leads to differencing transformations on all variables, and case (iii) to cointegration; for the present we consider stationary distributions (i).

(i) When π has full rank n, since $E [ f t − 1 ε s ′ ] = 0 ∀ s ≥ t$, the second moments of $f t$, namely $M F = E [ f t f t ′ ]$ and $M 1 = E [ f t f t − 1 ′ ]$, are given from (A4.36) by:

$Display mathematics$
(A4.38)
(p.725) so that vectorizing:
$Display mathematics$
(A4.39)
The sample estimators of $M F$ and $M 1$ are:
$Display mathematics$
(A4.40)
where . From Hannan (1970, Ch. 4), under the above assumptions:

Theorem 14. $√ T ( M ∧ F − M F ) v → D N ns [ 0 , C ]$.

When $Π = P Λ P − 1$ and $Λ$ has a real diagonal, C is obtained in Hendry (1982) and Maasoumi and Phillips (1982); Govaerts (1986) extends the derivation to complex roots. As before, asymptotic normality results.

We now briefly consider the two cases of the VAR remaining from above, which involve unit roots in some or all of the processes.

(ii) When π(1) = 0 in (A4.31), so that r = 0, the system can be transformed to a VAR in the I(0) variables $Δ x t$. Then the results of (i) apply to the differenced data. Otherwise, if (A4.31) is estimated in levels, the asymptotic distribution of the least‐squares estimator is given in Phillips and Durlauf (1986) who show that conventional statistics do not have the usual limiting normal distributions.

(iii) When $x t$ is I(1), but 0 < r < n, then π can be factorized as $α β ′$ where $α$ and $β$ are n × r matrices of rank r. Then $α$ is a matrix of adjustment parameters, and $β$ contains r cointegrating vectors, such that the linear combinations $β ′ x t$ are I(0). Once it is known that there are r cointegrating vectors, it is possible to estimate the parameters of (A4.31) subject to the restriction that $π = α β ′$, with $α$ and $β$ of rank r, using a procedure proposed by Johansen (1988). As with (ii), unless the system is transformed to be I(0), which would involve only either $β ′ x t$ or $Δ x t$ variables, then the limiting distributions are not normal. We consider system cointegration in Chapters 8 and 11.

## A4.8.4 Limiting Distribution of OLS for a Linear Equation

Since a wide range of econometric estimators are functions of data second moments, and so their distributions asymptotically converge to functions of C, Hannan's theorem (Theorem 14) is useful for stationary processes. As an illustration, let $x t ′ = ( y t : z t ′ : w t ′ )$ and consider an arbitrary linear econometric equation of the form:

$Display mathematics$
(A4.41)
where $φ ′ = ( 1 : − γ ′ : 0 ′ )$ and γ is an m × 1 vector defined by:
$Display mathematics$
(A4.42)
Then $z t = Sf t$ where $S = ( 0 : I m : 0 )$ is a selection matrix. Given the model in (A4.41)+(A4.42), γ would be estimated by OLS: (p.726)
$Display mathematics$
(A4.43)
From (A4.42), $SM F φ = 0$ and hence:
$Display mathematics$

Neglecting the sampling variability in $( S M ∧ F S ′ )$ by Cramér's theorem, so that it is replaced in (A4.43) by $( SM F S ′ )$:

$Display mathematics$
(A4.44)
using $( ABC ) v = ( A ⊗ C ′ ) B v$ where $H = ( SM F S ′ ) − 1 ( S ⊗ φ ′ )$. The limiting distribution of $√ T ( γ ∧ − γ )$ now follows from Theorem 14:
$Display mathematics$
(A4.45)

The result in (A4.45) only depends on the distribution of the second moments of the data and is independent of the validity of the postulated model or of any distributional properties of {e t} which an investigator may have claimed. The coefficient γ need not be of any interest, and the fact that $γ ∧$ is consistent for γ is achieved by constructing γ as the population value of the distribution of $γ ∧$. The variance matrix $HCH ′$ of the limiting distribution of $γ ∧$ need not bear any relationship to the probability limit $( SM F S ′ ) − 1$ of the OLS variance formula. Even though the asymptotic distribution theory delivers a limiting normal distribution for an arbitrary coefficient estimator in a linear stationary process, that is at best cold comfort to an empirical modeller. However, when the model has constant parameters and a homoscedastic, innovation error, so {e t} is well behaved, the OLS variance is consistently estimated by the conventional formula.

Even when the model coincides with the DGP and is stationary, convergence to normality can be relatively slow. In finite samples, OLS estimators in dynamic models are usually biased so convergence to normality requires both that the central tendency of the distribution shifts and its shape changes. Figures A4.2a–d show the histograms and density functions for 10 000 drawings for the OLS estimator $ρ ∧ 1$ of ρ1 in y t = ρ0 + ρ1 y t−1 + εt (p.727) where εt ∼ IN[0, 1] when ρ0 = 0, ρ1 = 0.8 at T=10, 25, 50, and 300. When T=10, the distribution is skewed and centred on $E [ ρ ∧ 1 | T = 10 ] ≃ 0 . 46$ with $SD [ ρ ∧ 1 | T = 10 ] ≃ 0 . 31$; at T=25, the bias has fallen and $E [ ρ ∧ 1 | T = 25 ] ≃ 0 . 66$; at $T = 50 , E [ ρ ∧ 1 | T = 50 ] ≃ 0 . 73$; and only by T=300 does $E [ ρ ∧ 1 | T = 300 ] ≃ 0 . 79$. An approximation to the bias is given by $E [ ( ρ ∧ 1 − ρ 1 ) | T ] ≃ − ( 1 + 3 ρ 1 ) / T$ (see §A4.13.3 for the special case when ρ0 = 0). The SD falls somewhat faster than O(1/√T). The distribution becomes symmetric slowly, and is noticeably skewed even at T=300; however, derived distributions such as those for t‐tests converge more rapidly to normality.

Fig. A4.2 Convergence Of OLS In a Dynamic Equation

# (p.728) A4.9 Instrumental Variables

Asymptotic Distribution Theory A4.9 Instrumental Variables

The method of instrumental variables is an extension of least squares due to Geary (1943) and Reiersøl (1945); see Sargan (1958). Consider the linear equation:

$Display mathematics$
(A4.46)
where $β$ is k × 1. However, $E [ y t | x t ] ≠ x t ′ β$, so that, from (A4.46), $E [ x t u t ] ≠ 0$. Thus, regression will not yield a consistent estimator of $β$, as in the case discussed in §A4.8.4. There is assumed to exist a k × 1 vector $z t$ such that:
$Display mathematics$
The role of $z t$ is purely instrumental in solving the problem of estimating $β$. We develop the algebra of instrumental variables (IV) estimators and their large‐sample behaviour when there are k instruments: the case of more than k instruments is considered in Chapter 11. $β$ is not estimable if there are fewer than k instruments.

In matrix terms:

$Display mathematics$
(A4.47)
with $E [ X ′ u ] ≠ 0$ but $E [ Z ′ u ] = 0$. When (A4.47) holds, consider pre‐multiplying by Z′:
$Display mathematics$
Providing $rank ( T − 1 Z ′ X ) = k$, and so $rank ( T − 1 Z ′ Z ) = k$ at all sample sizes, then:4
$Display mathematics$
(A4.48)
exists ∀ T and is called the instrumental variables estimator. When $( y t : x ′ t : z ′ t )$ is a stationary vector process, then given the above assumptions, the results in §A4.8 apply and:
$Display mathematics$
(A4.49)
from Slutsky's theorem and the law of large numbers. Hence, by Mann and Wald's theorem together with Cramér's theorem:
$Display mathematics$
where R is defined by:
$Display mathematics$
(A4.50)
(p.729) The matrix in [·] is a consistent estimator of R by construction. Let:
$Display mathematics$
Since plim $β ˜ = β$, then $σ ˜ u 2$ is a consistent estimator of $σ u 2$, so that $σ u 2 R$ can be consistently estimated from sample evidence. An extension of §A4.8.4 could apply.

Fig. A4.3 Behaviour Of IV Estimation In a Just‐identified Equation

In finite samples, other issues arise than merely the rate of convergence to normality. Because $E [ ( T − 1 Z ′ X ) − 1 ]$ need not exist, the distribution of $β ˜$ need not have any finite‐sample movements, so large outliers are possible. The occurrence of these depends on the probability that $( Z ′ X )$ gets close to singularity, as is most easily seen when k = 1. Now:

$Display mathematics$
(p.730) where $x = μ z + e$ when $e t ∼ IN [ 0 , σ e 2 ]$ so that:
$Display mathematics$
and $γ ∧ ∼ N [ γ , ( σ u 2 + β 2 σ e 2 ) ( z ′ z ) − 1 ]$ with $μ ∧ ∼ N [ μ , σ e 2 ( z ′ z ) − 1 ]$. Then $P ( | μ ∧ | < δ ) ≠ 0$ even for small δ so $E [ γ ∧ μ ∧ − 1 ]$ does not exist ($γ ∧ μ ∧ − 1$ is a Cauchy random variable): or, in practice, outliers abound when $μ ∧$ can be close to zero. However, when the non‐centrality $c = T μ 2 m zz / σ e 2$ is large (where $m zz = E [ T − 1 z ′ z ] )$, there is a small probability of $μ ∧ ≃ 0$ so well‐behaved outcomes result. Figures A4.3a–d show the histograms and density functions for 10 000 drawings for the IV estimator β of β in y t = β x t + u t where u t ∼ IN[0, 1] when β = 1.0 for μ = 0.4 at T=25 and 100 and μ =0.8 at T=50 and 100, where $σ e 2 = 1$ and m zz = 1 (so c = 4, 16, 32, and 80). The sampling distribution is badly behaved until c≥ 32, after which it looks close to normality (although no moments exist).

# A4.10 Mixing Processes

Asymptotic Distribution Theory A4.10 Mixing Processes

Mixing conditions restrict the memory of a stochastic process such that the distant future is virtually independent of the present and past. This provides an upper bound on the predictability of future events from present information. The past and future cannot be interchanged in these statements so that the relationship is asymmetric.

## A4.10.1 Mixing and Ergodicity

By way of analogy, imagine a sample space Ω which is a cup of coffee comprising 90 per cent black coffee and 10 per cent cream carefully placed in a layer at the top which defines the initial state $x 0$ of the system. The whole space is of unit measure μ (Ω) = 1, fixed throughout, where the measure is the volume. A transformation on Ω is defined by a systematic stir of the liquid in the cup, in which a spoon is moved in a 360° circle; let S denote the transformation and $x t$ the state of the system at time t so that:

$Display mathematics$
(A4.51)
where S is a mixing transformation in a literal sense. The experiment involves stirring the coffee/cream mixture, then drawing tiny samples of liquid from the cup at different points in time. The mixing is uniform if after enough stirs, any measurable set (i.e. volume of coffee drawn from the cup) contains 10% cream. In technical terms, let $𝒜$ and $ℬ$ respectively denote the sets of cream and black coffee, and μ(·) the measure function so that $μ ( 𝒜 ) = 0 . 1 , μ ( ℬ ) = 0 . 9$. Then S is uniformly mixing if, for any subset $𝒞 ⊂ Ω$ with $μ ( 𝒞 ) > 0$:
$Display mathematics$
(A4.52)
i.e. if the transformation shifts $𝒜$ around enough so that the amount of $𝒜$ ending up in $𝒞$ equals their relative shares of the whole space (see e.g. Halmos, 1950). For dynamical systems, one can take $𝒜$ and $𝒞$ to be subsets of the phase space of the system, and consider such questions as: What is $P ( x t ∈ 𝒞 | x 0 ∈ 𝒜 )$?

(p.731) An important implication of uniform mixing for stationary processes is the property of ergodicity. Consider a subset $𝒞$; then S is ergodic if $S 𝒞 = 𝒞$ implies that $μ ( 𝒞 ) = 0$ or 1. In terms of the example, the transformation S is ergodic if the only sets which are invariant under S are the null set ∅ and the whole space Ω. This property certainly holds for mixing cream into coffee. Consider any $𝒞$ such that $S 𝒞 = 𝒞$, when S is uniform mixing from (A4.52):

$Display mathematics$
(A4.53)
But as $S n 𝒞 = 𝒞$ also, then $μ ( S n ( 𝒞 ∩ 𝒞 ) ) = μ ( 𝒞 ∩ 𝒞 ) = μ ( 𝒞 )$ so we have $μ ( 𝒞 ) = μ ( 𝒞 ) 2$ which implies $μ ( 𝒞 ) = 0$ or 1. Thus, uniform mixing transformations are ergodic.

The importance of this result derives from what is entailed by the statistical ergodic theorem (see e.g. Walters, 1975), namely, if a strictly stationary stochastic process {X t} is ergodic with finite expectation, then the time average for one realization:

$Display mathematics$
(A4.54)
and the expectation E[X t] converge when T is sufficiently large. This is an essential condition for conducting inference about E[X t] from the time average.

## A4.10.2 Uniform Mixing and α‐Mixing Processes

Formally, let $ℱ m k$ denote the σ‐field generated by and $ℬ ∈ ℱ t + τ ∞$.

Definition 13. {X t} is a uniform mixing process if for integer τ > 0, and $P ( 𝒜 ) > 0$:

$Display mathematics$

Further, D13 implies that:

$Display mathematics$
say, then limτ → ∞ α (τ) = 0 when limτ → ∞ φ (τ) = 0. This second stronger condition is called α‐mixing. In the limit as τ → ∞, for uniform‐mixing processes, events in $ℱ t + τ ∞$ are essentially independent of events in $ℱ − ∞ t$.

For example, if {X t} is an IID process, then D13 is satisfied with φ (τ) = 0 for τ ≥ 1, so {ε t} in (A4.31) is a uniform mixing process. Similarly, an m‐dependent process, which is a process where terms more than m‐periods apart are independent, as in an m th‐order moving average, is mixing with φ (τ) = 0 for τ ≥ m + 1. Moreover, if x t is given by (A4.31), then:

$Display mathematics$
(A4.55)
(p.732) so that if ε t is IND, then {x t} is mixing. The autocorrelations of {x t} must vanish exponentially as with (A4.31), and yet the process remains mixing with sufficient independence to allow useful asymptotic results to be established. Thus, (A4.31) is overly restrictive for the class of admissible data‐generation processes.

An important implication of uniform mixing is ergodicity as noted above, so that {x t} in (A4.31) is ergodic when the process is stationary, and:

$Display mathematics$
(A4.56)
Moreover, if {x t} is ergodic so is {x t xt+s}, so that sample second moments also converge almost surely to their population counterparts. Indeed, if {x t} is uniform mixing (and hence ergodic) so that φ (τ) → 0 as τ → ∞, then any functions of a finite number of {x t} are also uniform mixing (with a new φ (τ) which tends to zero as τ → ∞) so (e.g.) {x t xt+s} is uniform mixing. More generally (see e.g. White, 1984):

Theorem 15. When {X t} is mixing with φx (τ) ≈ Or) where r > 0, letting Y t = h t(X t . . . X tk) for a finite integer k > 0 when h t(·) is a measurable function onto $ℝ$, then {Y t} is mixing with φy (τ) ≈ Or).

Theorem 16. Further, when Z t(θ) = b t(Y t, θ) where b t(·) is a function measurable in Y onto $ℝ$ and is continuous on Θ, then φz (τ, θ) ≤ φy (τ) ∀ θ ∈ Θ, so Z t(θ) is mixing. Consequent on this last theorem, quite complicated functions of mixing processes remain mixing.

## A4.10.3 Laws of Large Numbers and Central‐Limit Theorems

The above developments allow us to formulate the analogue of a SLLN for mixing processes.

Theorem 17. If, in addition to the conditions in Theorems 15 and 16, there exist measurable functions d t(Z t) so that |b t(·)| < d t(·) ∀ θ ∈ Θ, and m ≥ 1 and δ > 0 such that E[|d t(Z t)|m] ≤ Δ < ∞ ∀ t with r > m/(2m − 1) then:

$Display mathematics$
(A4.57)
This theorem provides a uniform strong law of large numbers for mixing processes. The larger is m, then the greater the order of moments assumed to exist, but the weaker the memory restrictions in terms of the value of r. If φ (τ) vanishes exponentially, then m can be set arbitrarily close to unity.

Next, we have a central‐limit theorem for mixing processes of the form {Z t} subject to the further conditions that:

1. (a) E[Z t] = 0 and E[|Z t|2m] ≤ Δ < ∞ ∀ t and m > 1;

2. (b) there exists a sequence {V T}, 0 < δ ≤ V T ≤ Δ, such that for $S T ( j ) = T − 1 2 Σ t = 1 + j T Z t$, then E[S T(j)2V T] → 0 uniformly in j where {V T} need not (p.733) be constant, but the variability depends only on T and not on the starting point of the sum (i.e. the date). Then:

Theorem 18. If {Z t} satisfies Theorems 15, 16 and 17 and conditions (a) and (b) with r > m/(m − 1):

$Display mathematics$
(A4.58)
While certain of the required conditions are neither very transparent nor easily verified, the need for continuity, finite moments of an appropriate order, and bounding functions in addition to mixing is unsurprising given the discussion in earlier sections. Conversely, this theorem is general and potentially covers a large class of econometric estimators providing ${ x t }$ is mixing, irrespective of any assumptions concerning the correct specification of the model under consideration (see White, 1984). As will be seen in the next section, mixing provides a useful set of sufficient conditions for Crowder's Theorem.

# A4.11 Martingale Difference Sequences

Asymptotic Distribution Theory A4.11 Martingale Difference Sequences

## A4.11.1 Constructing Martingales

The earlier laws of large numbers and central‐limit theorems were framed in terms of independent random variables. For data‐generation processes with autonomous innovations, this restriction is not too serious; but, for models defined by their error processes just being an innovation relative to the information used in the study, the assumption of independence is untenable. The necessary statistical apparatus already exists for developing a theory relevant for models, in the form of martingale limit theory (see Hall and Heyde, 1980). This theory includes as special cases many of the limit results discussed in earlier sections, since sums of independent random variables (as deviations about their expectations) are in fact martingales. Indeed, there is an intimate link between martingales, conditional expectations on σ‐fields, and least‐squares approximations, which is why the associated theory is useful for econometric modelling.

Let {X t} be a sequence of zero‐mean random variables on the probability space ${ Ω , ℱ , P }$, and let such that:

$Display mathematics$
(A4.59)
Let $Y T = Σ t = 1 T X t$, then:
$Display mathematics$
(A4.60)
and Y T is called a martingale.5 Since X T = Y TY T−1, then {X T} is a martingale‐difference sequence (MDS).

(p.734) A salient feature of economic data is their high degree of inter‐dependence and temporal dependence. Thus, let $X t$ denote a vector random process with respect to the probability space $( Ω , ℱ , P )$. As in Chapter A2, the joint data density Dx(·) can be sequentially conditioned as in (A4.3). Then a linear representation of $X t$, given the past, takes the form:

$Display mathematics$
(A4.61)
where $ℐ t − 1$ denotes previous information, π j is the matrix of coefficients whereby each $X t − j$ influences $X t$, and s is the longest time dependence. In general, therefore, ${ X t }$ is not a martingale. However, define:
$Display mathematics$
(A4.62)
then implicitly we have created the model:
$Display mathematics$
(A4.63)
where {ε t} is a process such that, by construction, $E [ ε t | ℐ t − 1 ] = 0$ and hence:
$Display mathematics$
(A4.64)
where is a subset of $ℐ t − 1$. The derived process {ε t} in (A4.64) is a vector of martingale‐difference sequences. Successive ε t are not independent in general, but are uncorrelated, and in particular:
$Display mathematics$
(A4.65)
Let:
$Display mathematics$
then:
$Display mathematics$
(A4.66)
Hence, {ν T} is a martingale. This result is intimately related to the famous Wold decomposition theorem (see Wold, 1938) that the purely non‐deterministic component of any stationary time series can be represented in terms of an infinite moving average of uncorrelated errors.

## A4.11.2 Properties of Martingale‐difference Sequences

If {X t} is an MDS from Y t whose second moment exists ∀ t, we can generalize Chebychev's inequality to:

$Display mathematics$
(A4.67)
(p.735) Also, generalizations of the SLLNs hold, noting that T −1 Y T = X̄T (see e.g. White, 1984):
1. (a) if $∑ t = 1 ∞ t − ( 1 + r ) E [ | X t | 2 r ] < ∞$ for r ∈ [1, 2], then $T − 1 Y T → AS 0$.

2. (b) if E[|X t|2r] < B < ∞ ∀ t and some r ∈ [1, 2], then $T − 1 Y T → AS 0$.

For IID random variables, the condition r > 1/2 is necessary and sufficient, so we come close to that requirement despite much weaker dependence assumptions.

Before discussing asymptotic distributions based on MDSs, we need an important additional concept. We met Borel sets in Chapter A2 when we wished to construct random variables from the basic event space, in terms of the sequence of intervals $ℬ z = ( − ∞ , z ]$, which are the sets ${ x : − ∞ < x ≤ z , z ∈ ℝ }$. Technically, the Borel field $ℬ$ is the smallest collection of $ℬ z , z ∈ ℝ$ which is closed under complements and countable unions. P(·) was well defined for such sets, which comprised all the relevant events, and we moved to a new probability space $( ℝ , ℬ , P x )$. If we think of the sample space here as $ℝ$ repeated indefinitely often (to allow for infinite sequences), then the Borel sets again generate an event space. This approach generalizes to q‐dimensional vector random variables X, with a probability space which we could write as $( ℝ ∞ q , ℬ , P X )$. If the basic event space increases over time due to information accrual, so does the Borel field, and we can now exploit that feature. The following two theorems sustain the use of martingale difference sequences for asymptotic distributions of model estimates (see e.g. Whittle, 1970, and White, 1984).

Theorem 19. Let ${ ℬ n }$ be an increasing sequence of Borel fields $ℬ n ⊆ ℬ n + 1$, and Z be a fixed random variable where $Y n = E [ Z | ℬ n ]$. Then:

$Display mathematics$
so that $Y n = E [ Y n + 1 | ℬ n ]$. Consequently, {Y n} is a martingale relative to $ℬ n$.

Further:

$Display mathematics$
Indeed, Y n is the random variable defined on $ℬ n$ by minimizing E[ZY n]2, and so is a conditional expectation, or least‐squares approximation, with the property that Y nZY n (read as Y n is orthogonal to ZY n). Alternatively:
$Display mathematics$
(A4.68)
as required. This result provides a general way to construct an MDS of the form Y TY T−1, from an increasing sequence of Borel fields. Next:

Theorem 20. If {Y T} is a martingale for which $E [ Y T 2 ]$ is uniformly bounded, then Y T converges a.s. to a genuine random variable $Y = E [ Z | ℬ ∞ ]$.

The key here is that:

$Display mathematics$
(A4.69)
(p.736) by the martingale property, and if $∑ t = 1 ∞ E [ X t 2 ]$ is finite due to the uniform bound on $E [ Y T 2 ]$, then the SLLN applies. There are many applications of these results of which two immediate ones are that:
1. (i) if X t ∼ IID[μ, σ2], then TT is a martingale and so by (a), $X ¯ T → AS μ$;

2. (ii) a sequence of conditional expectations on an increasing sequence of $ℬ n$ is a martingale, so the innovations from a congruent model are a martingale difference sequence to which the above theorems apply (see Ch. 9).

Finally, we need a central‐limit theorem for MDS corresponding to the Lindeberg–Lévy result earlier:

Theorem 21. If {X t} is an MDS with $E [ X t 2 ] = σ t 2$ where $0 < σ t 2 < ∞$, and $W T = T − 1 ∑ t = 1 T X t$, with $S T 2 = ∑ t = 1 T σ t 2$ when:

$Display mathematics$
and $∑ t = 1 T X t 2 / S T 2 → P 1$, then:
$Display mathematics$
(A4.70)
Note the additional requirement that $∑ t = 1 T X t 2 / S T 2 → P 1$, which was implied by the Lindeberg‐Lévy theorem when the sequences were independent. We now apply Theorem 21 to MLE.

## A4.11.3 Applications to Maximum‐likelihood Estimation

We maintain the regularity assumptions made in Chapter A3, assume stationarity and uniform mixing of {X t}, but weaken the sampling assumption, using results in Crowder (1976). Mixing ensures that information continually accrues, yet the process is ergodic as shown in §A4.10. Let $ℓ t ( θ ) = ℓ ( θ ; x t | ℐ t − 1 )$ denote the log‐likelihood function for one observation on the random variable X from the sequential density $D x ( x | ℐ t − 1 , θ )$ for $θ ∈ Θ ⊆ ℝ k$ and history $ℐ t − 1$, and let:

$Display mathematics$
(A4.71)
where
$Display mathematics$
(A4.72)
with:
$Display mathematics$
(A4.73)
(p.737) Let $θ p$ denote the population value of $θ$. It is assumed that $E [ q t ( θ p ) | ℐ t − 1 ] = 0$ and:
$Display mathematics$
(A4.74)
where the notation $H t ( θ p , θ )$ potentially allows different rows to be evaluated at different points in $[ θ p , θ ]$, and:
$Display mathematics$
(A4.75)
Therefore, $Z t = λ ′ q t ( θ p )$ is an MDS, so that:
$Display mathematics$
(A4.76)
is a martingale, essentially achieved by sequential factorization of the joint density function. Its (normal) limiting distribution can be derived from the results in Theorem 21 under the present assumptions, via the martingale version of the Lindeberg‐Lévy central‐limit theorem. Let:
$Display mathematics$
(A4.77)
then:
$Display mathematics$
(A4.78)
Using the Cramér‐Wold device for the martingale Z (T):
$Display mathematics$
(A4.79)

Next, we need to relate that distribution to the limiting distribution of the MLE. A first‐order Taylor‐series expansion around $θ p$ is taken to be valid for $θ ∈ 𝒩 ( θ p )$, so that asymptotically, the average likelihood function is quadratic in the neighbourhood of $θ p$ (a mean‐value theorem could be used, given the consistency of the MLE noted below):

$Display mathematics$
(A4.80)
Assume that the MLE $θ ∧ T$, based on solving the score equation $q ¯ T ( θ ∧ T ) = 0$, is unique. Then:
$Display mathematics$
(A4.81)
and
$Display mathematics$
(A4.82)
Consequently, from (A4.75) and (A4.81), $B T − 1$ is O p(T −1). From (A4.80):
$Display mathematics$
(A4.83)
(p.738) Crowder proves the existence of a divergent sequence {C T} (related to the lower bound of the eigenvalues of $H ( T ) ( · ) H ( T ) ( · ) ′ )$ such that on scaling (A4.83) by $C T − 1$, the first two terms converge to zero, whereas the third diverges except at $θ p$ (see (A4.81)), which allows him to prove consistency of the MLE (see Ch. 11).

Next, from the mean‐value expression corresponding to (A4.80), but for the MLE:

$Display mathematics$
(A4.84)
From (A4.84) and (A4.81), scaling by √T in (A4.84) achieves a non‐degenerate limiting normal distribution based on (A4.79):
$Display mathematics$
(A4.85)
where $B 1 − 1 ( θ p )$ denotes the matrix for a single observation, so we rely on ergodicity and stationarity again.

Transient parameters can be allowed for, as can logistic growth in the data series. This provides a general result for maximum‐likelihood estimation in stationary, ergodic processes, with considerable initial data dependence and some heterogeneity, extending section A4.7.

# A4.12 A Solved Autoregressive Example

Asymptotic Distribution Theory A4.12 A Solved Autoregressive Example

Consider the following stationary data generation process for a random variable y t:

$Display mathematics$
(A4.86)
when |β| < 1 and y 0 ∼ N[0, (1 − β2)−1].
1. (a) Obtain the population moments of the process.

2. (b) Derive (i) $E [ T − 1 ∑ t = 2 T y t − 1 e t ]$; (ii) $E [ T − 1 ∑ t = 1 T y t 2 ]$; (iii) $E [ T − 1 ∑ t = 2 T y t y t − 1 ]$; and (iv) $V = V [ T − 1 ∑ t = 1 T y t 2 ]$.

3. (c) Derive the limiting distribution of the sample mean.

4. (d) Obtain the limiting distribution of the least‐squares estimator of β.

#### Solution to (a)

When |β| < 1, the population moments of a first‐order autoregressive process are derived as follows. Solve (A4.86) backwards in time as:

$Display mathematics$
so that:
$Display mathematics$
(A4.87)
(p.739) Letting $V [ y t ] = σ y 2 = E [ y t 2 ]$:
$Display mathematics$
(A4.88)
since {y t} is stationary. Further, since $E [ e t 4 ] = 3$, noting only the non‐zero terms:
$Display mathematics$
(A4.89)
Finally:
$Display mathematics$
(A4.90)
Also C[y t, y tk] = β C[y t, y tk+1] as E[y tk e t] = 0 ∀ k > 0, so corr(y t, y tk) = βk.

#### Solution to (b)

$Display mathematics$