This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
1, 1/p + 1/q = 1:
and consequently, taking yj = 1, lEji.Xjr^nP-'Ejl.lx/^l.
(1.4.3)
The latter inequality is a sharpening of the following trivial but useful inequality: j
]
p
,
p > 0.
Moreover, by Minkowski's inequality we have for 1 < p < oo,
(1.4.4)
14
Basic probability theory
{(l/n)E°=ilXj + yjr} 1/p< {(l/n)£jL.N P } 1/p + {(l/n)E]LilyjlP}1/P-(1-4-5) Finally we shall also use (in chapter 2) the following result. Theorem 1.4.1 Let X be a r.v. on {&,g,P} such that E[|X|] < oo. Let (An) be a sequence of sets in g such that lim n _+ooP(/ln) = 0. Then: lim n ^oJ,Jx(a>)|P(da>) = 0. Proof: Let for n = 1,2,..., An = {coeQ :|x(a))| > n}; T n = {co e Q : n < |x(co)| < n + 1}. From the fact that J|x(co)|P(dco) = Ju^ orn |x(a;)|P(da)) = £ ~ o J r J x M | P ( d c o ) < oo we conclude that JdJx(fl>)|P(da>) = E £ n Jrk|x(co)|P(dco) ^ 0 as n ^ oo. The theorem now follows from: fAn\x(ri)\P(dco)
= $AnnAk\x(co)\I>(da>) +J ylnnJ c|x(co)|P(dco)
< J^k|x(G))|P(dG)) + \A,nAi kP(dco) < JdJx(a))|P(dco) + kP(^ n ) by letting first n —• oo and then k —> oo.
Q.E.D.
Exercises 1. Prove the first set of properties of the integral with respect to a probability measure for finite-valued random variables. 2. Prove the second property for simple functions of independent random variables. 3. Prove Chebishev's inequality. (Hint: Write E[
(|X|)] = 4. Prove Holder's inequality. (Hint: Denote a = |X| P /{E[|X| P ]}, b = |Y| q /{E[|Y| q ]}, Z = a 1/p b 1/q , and use the concavity of ln(.) to show thatln(Z) < ln[(l/p)a + (l/q)b]. 5. Prove Jensen's inequality for the case that X is discretely distributed, i.e., P(X =
Xj)
=
Pj
> 0 ; j = l,...,n ; EjLiPj = 1-
1.5 Characteristic functions A special and very useful mathematical expectation is the so-called characteristic function: Definition 1.5.1 Let X be a random vector in R k with dis-
Characteristic functions
15
tribution function F. The characteristic function of this distribution is the following complex-valued function
x R m which is continuous in 6 for each x G R m . There exists a mapping 0(x) from R k into 0 with Borel measurable components such that
Random functions
17
Consequently the supremum involved is a Borel measurable real function on R m . Moreover, if (p(0,x) is continuous on 0 x R m then sup0G6>(/?(#,x) is a continuous function on R m . Proof: Jennrich(1969). (N.B. Compact sets in R k are Borel sets.) Of course, a similar result holds for the 'inf case. The condition in theorem 1.6.1 that the set 0 is compact (hence bounded) is not strictly necessary for the Borel measurability of sup#G6>(/?(0,x) and mfee0(p(O,x), provided that 0n = 0 n {0 e R k : |0| < n} is compact for each n. Then we have:
which by theorems 1.3.2 and 1.6.1 are Borel measurable functions. Let us return to more general random functions. The properties of a random function f(.) may differ for different co in Q. For two points <x>\ and oj2 in Q it is for example possible that f(r,coi) is continuous and f(i,co2) is discontinuous at the same T. In this study we shall always consider properties of random functions holding almost surely, which means that a property of f(.) = f(.,co) holds for all co in a set E e 5 with P(E) = 1. Thus, for example, the statement: "f(.) is a.s. continuous on 0" means that there is a null set N such that f(.,co) is continuous on 0 for all co e Q\N. Exercises: 1. Let X be uniformly distributed on [0,1] and let for 0 e [0,1], {(0) = 1 if X > 0, f(<9) = 0 if X < 6. Is f(#) a random function on [0,1]? 2. Let X be N(0,l) distributed and let for 9 e R, f(0) = 6 if 0 is rational, f(0) = -0 if 6 is irrational. Prove that f(X) is a continuously distributed random variable. 3. Let for x e R and 6 e R,
0, ip(x,0) = [l-(0-2) 2 ]/[l+x 2 ]if 1 < 9< 3andx<0,
0, (f(x,6) = 0 elsewhere, e) = P(X n < y>(X)-e) < P(X n < X-e) (t)dF(t)| < 5e for sufficiently large n, which proves the "only if" part of the theorem. Now let u be a continuity point of F and define (f(t) = 1 if t < u, (f(t) = 0 if t > u, E[?(X)] if y? is a bounded continuous function. Q.E.D. The converse of this theorem is not generally true, but it is if X is constant a.s., that is: P(X = c) = 1 for some constant c. In that case the proper limit F involved is: F(t) = 1 if t > c, F(t) = Oift < c. The proof of this proposition is very simple: P(|X n - c | <e) = P ( c - e < X n < c + e) = F n (c + e) - F n ( ( c - e ) - ) -¥(c-e) = 1 for every e > 0, since c + e and c — e are continuity points of F. Thus: Theorem 2.3.3 Convergence in distribution to a constant implies convergence in probability to that constant. Let X n and X be random vectors in R k such that X n —> X in distr. and let f be any continuous real function on R k . For any bounded continuous real function (p on R it follows that (f(f) is a bounded continuous real function on R k , so that by theorem 2.3.1, E[ E[^(f(X,c)] for any bounded continuous real function ip on R, which by theorem 2.3.1 implies f(X n ,Y n )^f(X,c) in distr. Let M be the uniform bound of ip and let F n and F be the distribution functions of X n and X, respectively. For every e we can choose continuity points a and b of F such that: P(X G (a,b]) = F(b) - F ( a ) > 1 -e/(2M). Moreover, for any d > 0 we have: F properly pointwise. Proof: Cf. Feller (1966). This theorem is basic for proving central limit theorems. Moreover, the following corollary of theorem 2.3.6 is very useful in proving multivariate asymptotic normality results. Theorem 2.3.7 Let (Xn) be a sequence of random vectors in R k If for all vectors £ e R k , 0. Then J(^(x)dFn(x) -+ JV(x)dF(x). Proof: Define for a > 0 <pa(x) = (p(x) if | (x) > a, <pa(x) = - a i f > ( x ) < - a . x oo . It follows therefore from the monotone convergence theorem (theorem 2.2.4) and theorem 2.3.1 that: 0 Proof: Consider the function ?a(x) defined in (2.5.1). Then obviously by the independence of the Xj's, the boundedness of <^aOO a n ^ Chebishev's inequality Xj)]} = 0, X in distr., hence the distribution function F n of X n converges F properly setwise then (2.5.2) carries over for Borel measurable <pa, as we have just shown in theorem 2.5.4. Consequently we have: Theorem 2.5.5 Let (F n ) be a sequence of distribution functions on R k satisfying F n —> F properly setwise. Let (p(x) be a Borel measurable real function such that sup n JV(x)| 1 + <5dFn(x) < oo for some S > 0. m (0o) + Es~ifos(0) - ^ o ) ) ' Z * _ s ] 2 . Clearly 90 minimizes Q(0) and hence 90 = 9*. This proves: Theorem 8.1.3 Under assumptions 8.1.1, 8.1.2, and 8.1.3, ^ S2 is continuous if for each f G Si and arbitrary e > 0 there exists a S > 0 such that p2(4>(f),#(g)) < e if g G Si and Pi(f,g) < <5. Continuous mappings are Borel measurable. Now consider a sequence (Xn) of random elements of S, defined on a common probability space {(2,g,P}. Each X n induces a probability measure / j n on {S,33} by the correspondence /in(B) = P[{o>eQ: xn(©,.) G B}], B e 23. X if and only if for all bounded continuous real functions ip on S, E[ip(Xn)] 1, /i n(K) > 1 — e. If /i n is induced by a random element X n of S, then the sequence (Xn) is called tight. m, it follows from (6.2.3) and theorem 6.2.1 that cov{Z< m) K[(x-X< m) )/ 7n ],z| m) K[(x-X} m) )/y n ]} <2/(j)' /2 E({Z< m) K[(x-X< m) )/y n ]} 2 ). Step 1 follows now from the fact that Proof of step 2: From the inversion formula for Fourier transforms we have = [l/(2 7r)]kJexp(-i-t'x)V'(t)dt From (10.7.9) it follows that for all x e R |Z0ZjK[(x - X0)/7n]K[(x - Xj)/y J
18
Basic probability theory and let #be such that
— ls(p continuous? — Determine 6(x). — What is your conclusion regarding the second part of theorem 1.6.1?
Convergence In this chapter we consider various modes of convergence, i.e., weak and strong convergence of random variables, weak and strong laws of large numbers, convergence in distribution and central limit theorems, weak and strong uniform convergence of random functions, and uniform weak and strong laws. The material in this chapter is a revision and extension of sections 2.2-2.4 in Bierens (1981). 2.1 Weak and strong convergence of random variables In this section we shall deal with the concepts of convergence in probability and almost sure convergence, and various laws of large numbers. Throughout we assume that the random variables involved are defined on a common probability space {&,g,P}. The first concept is well known: Definition 2.1.1 Let (X n) be a sequence of r.v.'s. We say that X n converges in probability to a r.v. X if for every e > 0, linin^ooPdXn — X| < e) = 1, and we write: X n —> X in pr. or plinin^ooXn = X. However, almost sure convergence is a much stronger convergence concept: Definition 2.1.2 Let (X n) be a sequence of r.v.'s. We say that X n converges almost surely (a.s.) to a r.v. X if there is a null set N G g (that is a set in g satisfying P(N) = 0) such that for every co G £2\N, limn_^ocxn(co) = x(co), and we write: X n —> X a.s. or linin^ooXn = X a.s. Note that this definition is equivalent to: X n -* X a.s. if P({COGO: l i n i n ^ x ^ ) = x(co)}) = 1. A useful criterion for almost sure convergence of random variables is given by the following theorem. Theorem 2.1.1 Let X and Xi,X 2 ,... be random variables. Then X n —> X a.s. if and only if 19
20
Convergence l i m n ^ o o P ( n - = n { | X m - X | < e})= 1 for every e > 0.
(2.1.1)
Proof: First we prove that x(co) = lim n ^ oo x n (a;) pointwise on Q\N implies (2.1.1). Let co0 G &\N. Then for every e > 0 there is a number sucn that |xn(co0) x(co0)| < e for all n > no(co0,e). Now consider the following set in g. An(e) = n^ =n {co G & :|xm(o>) -x(©)| < e}. Then co0 G AUo(
Random variables
21
Corollary 2.1.1 X n —> X a.s. implies X n —> X in pr. Proof Note that
n oQ
ri"v
•{L/VjYj
"v^I ^
-**-
_
\
^j
^
fI~v
|
AJJ
^rI
—-^*-
<^
1
^/
and consequently P(n~=n{|Xm - X | < e}) < P(|Xn - X | < e).
Q.E.D.
The following simple but important theorem provides another useful criterion for almost sure convergence. Theorem 2.1.2 (Borel-Cantelli lemma) If for every e > 0, £nP(|X n - X | > e) < oo, then X n —> X a.s. Proof. Consider the set An(e) = n» =n {co e Q: |xn(co) -x(co)| < e}= n- = n {|X n - X | < e}. From theorem 2.1.1 it follows that it suffices to show P(An(e)) -> 1 or, equivalent^, P(An(e)c) -> 0. But
and hence
Since the latter sum is a tail sum of the convergent series EnP(|X n -X| > S) we must have £~ =n p (|Xn - X | > e) -> 0 as n -+ oo, which proves the theorem.
Q.E.D.
The a.s. convergence concept arises in a natural way from the strong laws of large numbers. Here we give three versions of these laws. Theorem 2.1.3 Let (Xj) be a sequence of uncorrelated random variables satisfying E[Xj — EXj)]2 = O(j0 for some \i < 1. Then (l/n)£jL 1 [X J -E(X J )]-+Oa.s. Proof: This theorem is a further elaboration of the strong law of large
22
Convergence
numbers of Rademacher-Menchov (see Revesz [1968, theorem 3.2.1] or Stout [1974], theorem 2.3.2), which states: Let (Yj), j > 0, be a sequence of orthogonal random variables (orthogonality means that E(YjiYj2) = 0 if ji / j 2 ). If < oo
then XjXj converges a.s. (which means that Z = XjTi is a.s. a finite valued random variable). Now let Yj = [Xj-E(Xj)]/j. Then £j(logj)E(Yj 2 ) = EjOogJPG^ 2 ) < o o f o r / K 1, hence Xj[Xj — E(Xj)]/j converges a.s. From the Kronecker lemma (see Revesz [1968, theorem 1.2.2] or Chung [1974, p. 123]) it follows that this result implies the theorem under review. Q.E.D. Theorem 2.1.4 Let (Xj) be a sequence of uncorrelated random variables satisfying: sup n (l/n)X] J I L 1 E[|Xj-E(Xj)| 2 + ^ < oc for some 8 > 0. Then (l/n)£jLi[Xj -E(Xj)] -+ 0 a.s. Proof: The moment condition in this theorem implies E[|Xj -E(Xj)| 2 + <5] = O(j), so that by Liapounov's inequality E{[X j -E(X J )])} 2 < {E[|X j -E(X J )]| 2 + ^} 2 / ( 2 + <5) = Oa 2/(2 + ' ) ). The theorem now follows from theorem 2.1.3.
Q.E.D.
If the Xj's are independent identically distributed (i.i.d.) the condition on the second moment is not needed: Theorem 2.1.5 (Strong law of large numbers of Kolmogorov) Let (Xj) be a sequence of i.i.d. random variables satisfying E(|Xi|) < oo. Then
Proof Chung (1974, theorem 5.4.2). We already have mentioned that almost sure convergence implies convergence in probability. There is also a converse connection, given by the following theorem. Theorem 2.1.6 Let X and Xi,X 2 ,... be r.v.'s. Then X n —> X in pr. if and only if every subsequence (nk) of the sequence (n) contains a further subsequence (n k ) such that
Random variables
23
Xnk -> Xa.s. a s j -> oc
(2.1.2)
Proof. Suppose that every subsequence (nk) contains a further subsequence (nk.) such that (2.1.2) holds, but not that X n —• X in pr. Then there exist numbers e > 0, 3 > 0, and a subsequence (nk) such that
P(|Xnk-X|<e)
G O\(NIUN2)
we have:
|f(xn(co)) -f(x(co))| < e if n > n0(co,S(e,(o)). Since NiUN 2 is a null set in g, this proves part (a) of the theorem. (b) For an arbitrary subsequence (nk) we have a further subsequence
24
Convergence
(nk.) such that (2.1.2) holds and consequently by part (a) of the theorem: f(Xnkj) -> f(X) a.s. By theorem 2.1.6 this implies f(Xn) -> f(X) in pr.
Q.E.D.
Remark: Thus far in this section we have dealt only with random variables in R. However, generalization of the definitions and the theorems in this section to finite dimensional random vectors is straightforward, simply by changing random variable to random vector. The only notable extension is: Theorem 2.1.8 Let X n - (X ln ,...,X kn )' and X =(X 1 ,...,X k )' be random vectors. Then X n —> X a.s. (in pr.) if and only if for each j , X jn -> Xj a.s. (in pr.). Exercises 1. Let (Xj) be a sequence of random variables such that for each j E(Xj) = ji, E [ ( X J - M ) 2 ] = o\ su P j E[|Xj-^| 3 ] < oo,
cov(Xj,Xj.m) = Oifm > l,cov(Xj,Xj. m )/Oifm = 1. Prove that (l/n)£jLiXj -+ ^ a .s. (Hint: Let Yij = X 2j , Y 2j = X2j + 1. Prove first that for i = 1,2, ( l / n ^ j ^ Y y -> \i a.s.) 2. Let (Yj), j > 0, be a sequence of random variables satisfying E(Yj) = 0, E(Yj2) = 1/j2 and let X n = (Y n ) n , n = 1,2,... Prove that X n -> 0 a.s. (Hint: Combine Chebishev's inequality and the Borel-Cantelli lemma.) 3. Let X n m and X m be random variables, where n = 1,2,... and m = 1,2,... such that for m = 1,2,...,k < oo, Xn>m —> X m a.s. as n ^ oo. Prove that nm maxm=lf2,...,k Xv n,m
4. Let (Xj) be a sequence of i.i.d. random variables satisfying E(Xj) = 0, 0 < E(Xj2) < oo. Prove that SjLiXj 2 ] - 0 a.s. 5. Let X be a random variable satisfying P(X = 1) = P(X = - 1 ) = XA. Let Y n = X n , n = 1,2,... Does Y n converge a.s. or in pr.? 6. Prove theorem 2.1.8.
Mathematical expectations
25
2.2 Convergence of mathematical expectations
If X and Xi,X 2v .. are random variables or vectors such that for some p > 0, E(|X n - X | p ) -> 0 as n -> oo, then it follows from Chebishev's inequality that X n —> X in pr. The converse is not always true. A partial converse is given by the following theorem. Theorem 2.2.1 If X n —+ X in pr. and if there is a r.v. Y satisfying |X n | < Y a.s. for n = 1,2,... and E(YP) < oo for some p > 0, thenE(|X n - X | p ) - + 0 . Proof. If P(|X| > Y) > 0 then X n -> X in pr. is not possible, hence |X| < Y a.s. Since now |X n — X| < 2Y there is no loss of generality in assuming X = 0 a.s. We then have:
J|x n (co)| p P(dco) = J { | Xn(co) |>, } |x n The theorem follows now from theorem 1.4.1.
Q.E.D.
Putting p = 1 in theorem 2.2.1 we have: Theorem 2.2.2 (Dominated convergence theorem) If X n —•X in pr. and if |X n | < Y a.s., where E(Y) < oo, then E(X n ) -> E(X). We shall use this theorem to prove first Fatou's lemma which in turn will be used to prove the monotone convergence theorem. Theorem
2.2.3 (Fatou's lemma) If X n ooXn) < liminf n ^ oo E(X n ).
>
0 a.s., then
Proof. Put X = liminfn^ooXn and let (f(x) be any simple function satisfying 0 <
- 0.
Moreover, since ^(x) is a simple function we must have E[v?(X)] < oo. From the dominated convergence theorem and from Y n < X n a.s. it follows now: E[
26
Convergence
Proof. Since our sequence (X n) is nondecreasing, we have: = liminf n _ oo E(X n ), so that by Fatou's lemma, EQimn^ooXn) < lim n _ >oo E(X n ). However, for any n we have X n < linin^ooXn a.s. because X n is nondecreasing, hence E(X n ) < E^imn^ooXn) and consequently lim n _ oo E(X n ) < This proves the theorem.
Q.E.D.
Exercises 1. Let Y n be defined as in exercise 5 of section 2.1. Does E(Y n ) converge? 2. Let (fn) be a sequence of Borel measurable real functions on R k and let ft be a probability measure on {Rk,33k}. Suppose there exists a non-negative Borel measurable real function g on R k such that supn|fn(x)| < g(x), with Jg(x)/x(dx) < oo. Moreover, assume that f(x) = limn_,oofn(x) exists for each x in a set S C R k with /i(S) = 1. Prove that lim n ^ oo Jf n (x)/i(dx) = Jf(x)/*(dx). (This is another version of the dominated convergence theorem.) 2.3 Convergence of distributions If X and Xi,X 2 ,... are r.v.'s with distribution functions F, Fi, F 2 ,..., respectively, then one would like to say that X n converges in distribution to X if for every t G R, F n (t) —> F(t). However, if X and F are given and if we define X n - X + 1/n, then: F n (t) = P(X n < t) = P(X < t - 1/n) = F(t - 1/n), so that for every discontinuity point to of F we have: - 1 / n ) = F ( t 0 - ) < F(t 0 ), while intuitively we would expect that in this case we also have convergence in distribution. Furthermore, if X n = X + n we have: F n (t) = P(X n < t) = F ( t - n ) -> F ( - o o ) = 0 for every t. Thus not every sequence of distribution functions converges to another distribution function. In the latter case we say that the convergence is improper. Definition 2.3.1 A sequence (F n (t)) of distribution functions converges properly pointwise if F n (t) —> F(t) pointwise for all
Distributions
27
continuity points of F, where F is a distribution function. We then write: F n —> F properly pointwise. The exclusion of discontinuity points avoids the complication that otherwise the function F(t) = limn^ooFn^) may not be right continuous. In view of the above example we now define: Definition 2.3.2 A sequence (Xn) of random variables (or random vectors) converges in distribution to X, if their underlying distribution functions (F n ), F, respectively, satisfy F n —> F properly pointwise. We then write: X n —• X in distr. Remark: If this "limit" distribution F is the distribution function of (for example) the normal distribution N(/z,cr2), we shall also write: X n —> N(/i,cr2) in distr. There is a close connection between proper pointwise convergence of distribution functions and convergence of mathematical expectations, as is shown by the following theorem. This theorem is fundamental as it allows for a variety of applications. Theorem 2.3.1 Let F and F n , n = 1,2,... be distribution functions on R k . Then F n —> F properly pointwise if and only if for every bounded continuous real function ip on R k ,
JV(t)dFn(t) - MOdF(t). Proof: Since the proof for the general case k > 1 is a straightforward extension of that for k = 1, we assume k = 1. Suppose F n —> F properly pointwise. For given e > 0 we can always find continuity points a and b of F such that F(b) — F(a) > 1 — e. Let
|JV(t)dF n (t) -J^(t)dF n (t)| < J {tG(a , b]} edF n (t) + J {t ^ (a , b]} dF n (t)
28
Convergence = e(F n (b) - F n ( a ) ) + 1 - F n ( b ) 4- F n (a) -> e(F(b) - F ( a ) ) + 1 - F ( b ) + F(a) < 2e.
Moreover, JV(t)dF n (t) =ES 1 {inft e ( t 1 ,t I + l ] ^(t)}(F n (t i + 1 ) -F n (ti)) - E ' T ^ i n f t e c v ^ ^ W K F C t i ^ ) -F(tO) = JXt)dF(t) and
So we have: |J
Moreover: 0 < J(^2,m(t) -(^l,m(t))dF(t) < J{t€(u-l/m,u+l/m]}dF(t) = F(u+l/m) - F ( u Since u is a continuity point of F, F(u + 1/m) — F(u — 1/m) can be made arbitrarily small by increasing m, hence F n (u) —•F(u), which proves the "if" part. Q.E.D. A direct consequence of this theorem is that: Theorem 2.3.2 X n —> X in pr. implies X n —* X in distr. Proof: By theorem 2.1.7 it follows that for any continuous function (p we have X n —> X in pr. implies
Distributions
29
2.2.2,
and consequently f(Xn) —> f(X) in distr. Thus we have: Theorem 2.3.4 Let X n and X be random vectors in R k and let f be a continuous real function on R k. Then X n —> X in distr. implies f(Xn) -> f(X) in distr. Remark: The continuity of f in theorem 2.3.4 is an essential condition. For example, let X be a continuously distributed random variable, let X n the value of X rounded off to n decimal digits and let f(x) = 1 if x is rational, f(x) = 0 if x is irrational. Then X n —> X in distr. and f is Borel measurable. However, f(Xn) = 1 a.s. and f(X) = 0 a.s., which renders f(Xn) —• f(X) in distr. impossible. A more general result is given by the following theorem. Theorem 2.3.5 Let X n and X be random vectors in R k , Y n a random vector in R m and c a non-random vector in R m . If X n —> X in distr. and Y n —• c in distr., then f(X n,Yn) —> f(X,c) in distr. for any continuous real function f on R k x C, where C is some subset of R m with interior point c. Proof: Again we prove the theorem for the case k = m = 1 since the
30
Convergence
proof of the general case is similar. It suffices to prove that for any bounded continuous real function
Y n )] - E
J
2M.P({X n e(a,b]}n{|Y n -c|><5}) + 2M.P(X n 0(a,b]) + 2M-P{|Y n -c|>(5} + 2 M ( 1 - F n ( b ) + F n (a)). Since by theorem 2.3.3, Y n —> c in pr., we have P ( | Y n - c | > S) -+ 0 for any d > 0, whereas F ^ + F^a)) = 2 M ( 1 - F ( b ) + F(a)) < e. Furthermore, since (p(ti,t2) is uniformly continuous on the bounded set {(ti,t 2 ) G R 2 : a <
ti
provided that S is so small that this set is contained in R k x C, we can choose S such that the last integral is smaller than e. So we conclude:
Since obviously E[(/?(Xn,c)] —• E[(/?(X,c)] because Xn —• X in distr., the theorem follows. Q.E.D. Remark: It should be stressed that the constancy of c is crucial in theorem 2.3.5. Thus, for example, X n —• X in distr. and Y n —• Y in distr., where X and Y are nonconstant random variables, does not generally imply X n + Y n —• X + Y in distr. or X n Y n —> X-Y in distr. For example, let
Distributions
31
Consequently, theorem 2.1.8 does not carry over to convergence in distribution. Finally we note that convergence in distribution is closely related to convergence of characteristic functions: Theorem 2.3.6 Let (F n ) be a sequence of distribution functions on Rk and let (
^
)
]
)
= A.
(Recall that if A is a singular matrix then the normal distribution involved is said to be singular). Suppose that for every vector £ in R k , £'X is normally distributed. Then £'X~N(£'jU,£Vl£), hence for every t eR and every £ £ R k , E[exp(i-tf'X)] =
32
Convergence
Substituting t = 1 now yields E[exp(i- f'X)] = exp[i./*'<£ - V^Mf ] for all £ e R k ,
which is just the characteristic function of the k-variate normal Q.E.D. distribution Nk[^A\. Exercises 1. Let Y n be defined as in exercise 5 of section 2.1. Does Y n converge in distribution? 2. Let (X n ) be a sequence of independent N k (^,yl) - distributed random vectors in R k , where A is nonsingular. Let
A = (l/n)E°=,(Xj -M)(Xj -A*)'Prove that ( X n - ^ ' i " 1 (X n -/i) ^ x^in distr. (Hint: Use the fact that the elements of an inverse matrix are continuous functions of the elements of the inverted matrix, provided the latter is nonsingular.) 3. The x^ distribution has characteristic function y>n(t) = (1 -2it)~ n / 2 . Let Xn = Yn/n, where Yn is distributed xn2- Prove Xn —> 1 in pr., using theorem 2.3.6 and the fact that (1 + z/n)n —* ez for real or complex-valued z. 2.4 Central limit theorems
In this section we consider a few well-known central limit theorems (CLT), which are stated here for convenience. The first one is the standard textbook CLT (see, for example, Feller [1966]): Theorem 2.4.1 Let X1? X2, X3,... be i.i.d. random variables with E(Xj) = /i, var(Xj) = a 2 < oo. Then (l/>/n)£jLi(Xj-/i) -+ N(0,a2) in distr. Proof. Let (f(t) be the characteristic function of Xj — \i, and let ^ n(t) be the characteristic function of (l/x/n^jL^Xj-/!). Then Moreover, it follows from exercise 3 in section 1.5 that for the first and second derivatives y>' and y>" of tp, ip\0) = 0; (^"(0) =
-o\
Central limit theorems
33
hence by Taylor's theorem,
where kni e [0,1]. Denoting
an = -AtVantt/Vn), it now follows easily from the continuity of 99" and the well-known result lim n ^ ^an = a implies lim n ^ ^(1 +a n /n) n = exp(a), that limn_> ^ ( t ) = e x p ( - VitV), which is the characteristic function of the N(0,cr2) distribution. The theorem follows now from theorem 2.3.6. Q.E.D. The next central limit theorem is due to Liapounov. This theorem is less general than the Lindeberg-Feller central limit theorem (see Feller [1966] or Chung [1974]), but its conditions are easier to verify. Theorem 2.4.2 Let <s = V k n Y •
where for each n the r.v.'s X n?1 ,...,X nkn are independent and kn—>oo. Put E
(Xn,j) = anj,an
= J2)l\an,p
2
<x (Xnj) = E[(XnJ -a n J ) 2 ] = < , a\ = E f c i < , assuming o\ < 00 (but not necessarily limsupn^oo^ < 00). If for some 5 > 0, limn^ooEjiiEIKXnj -a nJ )/
(2.4.1)
34
Convergence
Remark: Note that condition (2.4.1) holds if
Exercises 1. Let (Xj) be a sequence of i.i.d. random vectors in R k with E(X}) = fi,E[(Xi where A is nonsingular. Let X = (l/n)E]LiXj, A = (l/n)EjLiPCj-X)(Xj - X ) \ Yn = niX-fiYA-'a-fi). Prove that Y n —• ^ m distr. (Cf. exercise 2 of section 2.3). 2. Let (Uj) be a sequence of i.i.d. random variables satisfying E(Uj) = 0, E(Uj2) = 1. Let Xj = Uj + Uj_i. Prove that Er=i X j -> N(0,4) in distr. 2.5 Further convergence results, stochastic boundedness, and the O p and op notation 2.5.7 Further results on convergence of distributions and mathematical expectations, and laws of large numbers The condition in theorem 2.3.1 that the function y> is bounded is only necessary for the "if" part. If we assume proper pointwise convergence then convergence of expectations will occur under more general conditions, as is shown in the following extension of the "only if" part of theorem 2.3.1. Theorem 2.5.1 Let (F n ) be a sequence of distribution functions on R k satisfying F n —> F properly pointwise. Let
Obviously (/?a( ) is theorem 2.3.1:
a
(2.5.1) k
bounded continuous real function on R , hence by
Uniform laws of large numbers
JVa(x)dFn(x) -+ JVa(x)dF(x)
35
(2.5.2)
Moreover, |JV(x)dFn(x) -JV a (x)dF n (x)| < 2j Mx) | >a |< / ?(x)|dF n (x) < 2.&-sS\
(2.5.3)
uniformly in n, and similarly we have: |JV(x)dF(x)-JV a (x)dF(x)|
<
2a- 5 J|^(x)| 1 + 5 dF(x) = O(a-' 5 )
(2.5.4)
provided
J|vKx)radF(x) < oo
(2.5.5)
The theorem follows easily from (2.5.2) through (2.5.4) by letting first n —> oo and then a —* oo. Thus it suffices to show that (2.5.5) is true. Now observe that |(/?a(x)|1 + <5 is monotonic nondecreasing in a and that \Va(x)\l+3 - • \
JV(x)|1+5dF(x) = lima^JVaWr' < lim n _ o J| v >(x)| I + a dF n (x) < sup n J|^(x)| 1 + 5 dF n (x) < oo (2.5.6) This result completes the proof of theorem 2.5.1.
Q.E.D.
Along similar lines we can prove the following version of the weak law of large numbers. Theorem 2.5.2 Let Xi,X 2 ,... be a sequence of independent random vectors in R k , and let (Fj(x)) be the sequence of corresponding distribution functions. Let
(2.5.7)
36
Convergence
while from theorem 2.3.1 it follows li miwoo E[(l/n)£jL,¥>a(Xj)] = JV a (x)dG(x). Hence plim n ^ oo (l/n)E J n =1 ¥'a(X j ) = JV a (x)dG(x).
(2.5.9)
Moreover, since (fa(x) is bounded, it follows from (2.5.9) and theorem 2.2.1 that j ) - JV>a(x)dG(x)|] -> 0 as n - oo.
(2.5.10)
Furthermore, similarly to (2.5.3) it follows that limsup n ^ o o E[|(l/n)E]Li^(X j )-(l/n)E J r Li^a(X J )|] ^ 0 as a ^ oo (2.5.11) and, similarly to (2.5.4), that |JV(x)dG(x)- JV a (x)dG(x)| -> 0 as a -> oo.
(2.5.12)
Combining (2.5.10), (2.5.11), and (2.5.12) we see that (2.5.13) J ) - JV(x)dG(x)|] - 0. The theorem follows now from (2.5.13) and Chebishev's inequality. Q.E.D. Remark: The difference of this theorem from the classical weak law of large numbers is that the finiteness of second moments is not necessary. If we combine theorems 2.1.4 and 2.5.1 we easily obtain the following strong version of theorem 2.5.2. Theorem 2.5.3 Let the conditions of theorem 2.5.2 be satisfied, and assume in addition that supn(l/n)J^=lE[\(p(Xj)\2
+s
]< oo for some 8 > 0.
(2.5.14)
Then (l/n)£jLi¥>(Xj) -+ JV(x)dG(x) a.s. The continuity condition on the function ip in theorems 2.5.2 and 2.5.3 is mainly due to theorem 2.5.1, i.e., without this condition theorem 2.5.1 is not generally true. Suppose for example
Uniform laws of large numbers
37
properly to the distribution function F of X. However, since X is a.s. irrational we have: E[?(X)] = J - x 2 d F ( x ) = - 1 whereas X n is a.s. rational and thus
Efo>(Xn)] = Jx 2 dF n (x) - Jx 2 dF(x) = 1. This counter-example shows that theorem 2.5.1 does not carry over for general Borel measurable functions (p. In order that a similar result to that in theorem 2.5.1 does hold for Borel measurable functions we need a stronger convergence in distribution concept, namely setwise proper convergence: Definition 2.5.1. Let (F n ) be a sequence of distribution functions on R k with corresponding sequence (//n) of probability measures on {Rk,95k} (cf. section 1.1). This sequence (F n ) converges properly setwise if for each Borel set B in 93k, lim n ^ oo ^ n (B) = /i(B), where n is a probability measure on {Rk,23k}. We then write F n —> F properly setwise, where F is the distribution function induced by \i. Using this concept and definition 1.4.4 we can now state: Theorem 2.5.4 Let F n and F be distribution functions on R k . Then F n —> F properly setwise if and only if for every bounded Borel measurable real function
lim n_ooJ(^(x)dFn(x) = JV(x)dF(x). Proof: The "if" part is easy and therefore left to the reader. Moreover, the "only if" part follows easily from definition 1.3.2 if ip is a simple function. Thus, assume that ip is not simple. From the proof of theorem 1.3.4 it easily follows that for arbitrary e > 0 and each bounded Borel set B we can construct a simple function ip such that \ip(x)-(p(x)\
< e i f x e B , 7p(x) = O f o r x ^ B .
Moreover, we may choose B such that MB) = J B dF(x) >
l-e.
Furthermore, since /in(B) - J B dF n (x) -> MB) there exists an n£ such that
38
Convergence
Thus we have for n > n£
|JV(x)dFn(x) -JV(x)dF(x)| < + | JRk\B
|JB(^(X)
-V(x))dF n (x)| + |JB(v>(x) -V(x))dF(x)|
+ |J B V(x)dF n (x) -J B V(x)dF(x)| + M Mn (R k \B) + M/i(R k \B) < e •/in(B) +
£
•MB) + 3eM + |JBV(x)dFn(x) - J B ^(x)dF(x)|
< (2 + 3M)e + |JV(x)dFn(x) -J>(x)dF(x)|, where M is the bound of ?(x). Since the second term converges to zero (for ip is simple) and the first term can be made arbitrarily small, the theorem follows. Q.E.D. Now observe from the proof of theorem 2.5.1 that the continuity of
Replacing theorem 2.5.1 by theorem 2.5.5 the laws of large numbers (theorems 2.5.2 and 2.5.3) can now be generalized to: Theorem 2.5.6 Let (Xj) be a sequence of independent random vectors in R k , and let (Fj(x)) be the sequence of corresponding distribution functions. Let (/?(x) be a Borel measurable real function on R k . If J1Fj
—• G properly setwise
and supn(l/n)X:jIL 1E[|^(Xj)|1+ ^] < oo for some 3 < 0 thenplim n _ o o (l/n)E^i^(X J ) = Mx)dG(x). Theorem 2.5.7 Let the conditions of theorem 2.5.6 be satisfied. If s u p n O / n ^ j ^ E ^ X j ) ! 2 ^ < oo for some 5 > 0, then ( l / n ) ^ ^ ( X j ) -> JV(x)dG(x) a.s.
Uniform laws of large numbers
39
Finally we consider convergence of random variables of the type
where the ?j's are Borel measurable (respectively, continuous) functions. Theorem 2.5.8 Let Xj be a sequence of independent random vectors in R k and let (y?j) be a sequence of Borel measurable (continuous) real functions on R k . Denote Yj =
40
Convergence
For example, if Xi,...,X n is a random sample from a distribution with E(|Xj|2) < oo, then by theorems 2.4.1 and 2.5.9,
Exercises 1. Prove theorem 2.5.8. 2. Restate theorems 2.5.2 and 2.5.6 for double arrays (X n j), j = l,...,n, n = 1,2,... of random vectors in R k and prove the modified theorems involved. 3. Prove theorem 2.5.9. 4. Prove theorem 2.5.10. 2.6 Convergence of random functions In dealing with convergence of random functions, one should be aware of some pitfalls. The first one concerns pointwise a.s. convergence. Let f(0) and fn(0) be random functions on a subset 0 on R k such that for each 9 G 0, fn(0) —> f(0) a.s. as n —> oo. At first sight we would expect from definition 2.1.2 that there is a null set N and an integer function n0(co,9,e) such that for every e > 0 and every at e &\N, |fn(0,co) -f(0,co)| < e if n > no(t»,0,e). However, reading definiton 2.1.2 carefully we see that this is not correct, because the null set N may depend on 9\ N = N#. Then, again at first sight, we might reply that this does not matter because we could choose N = U^^N^ as a null set. But the problem now is that we are not sure whether N e g, for only countable unions of members of g are surely members of g themselves. Thus although N# e g for each 9 e 0, this is not necessarily the case for U9e0N9 if 0 is uncountable. Moreover, even if U0G6>N0 e g, it may fail to be a null set itself if 0 is uncountable. For example, let 0 = Q = [0,1], let P be the Lebesgue measure on [0,1] and let N0 = {6} for 9 e [0,1]. Then ?(U9e0N9) = P(Q) = 1, while obviously the N#'s are null sets. The second pitfall concerns uniform convergence of random functions. As is well known, uniform convergence of (real) non-random functions, for example (pn(6) —• (p(0) uniformly on 0 as n —> oo, can be defined by sup9e0\ipn(0) -ip(0)\ -> 0 as n -+ oo. Dealing with uniform a.s. convergence of random functions, i.e., fn(0) -> f(0) a.s. uniformly on 0, a suitable definition is therefore:
Random functions
41
supee0\{n(O) -f(0)| -> 0 a.s. as n -> oo, However, this has only a probabilistic meaning if the supremum involved is a random variable. If so, then uniform a.s. convergence is equivalent to the following: There is a null set N and an integer function no(co,£) both independent of 9, such that for every e > 0, every co e Q\N and every 9 e (9, |fn(<9,co) -f(0,cw)| < e if n > no(cu,e). Thus going from pointwise a.s. convergence to uniform a.s. convergence we have to check three things, namely that the null set N is independent of 0, that the integer function n0(co,e) is independent of 0, and that is a random variable for each n. Only if so, can we say that fn(#) —> f(0) a.s. uniformly on 0. Nevertheless, if this is not the case but fn(0,co) —• f(0,co) uniformly on 0 for every a> in Q except in a null set not depending on 0, then we still have a useful property, as will be shown in chapter 4. In this case we can say that fn(0) —• f(0) a.s. pseudo-unifovmly on 0. Summarizing: Definition 2.6.1 Let f(0) and fn(0) be random functions on a subset 0 of a Euclidean space, and let {£2,$,P} be the probability space involved. Then: (a) fn(0) —• f(0) a.s. pointwise on 0 if for every O G 0 there is a null set N# in g and for every e > 0 and every co e fl\N^ a number no(co,0,£) such that |fn(0,co) — f(0,co)| < e if n > no(co,0,£); (b) fn(0) -> f(0) a.s. uniformly on 0 if (i) sup0e(9|fn(0) — f(0)| is a random variable for n = 1,2,..., and if (ii) there is a null set N and an integer function n0(co,£), both independent of 9, such that for every e > 0 and every co e (2\N, |fn(0,Q)) -f(0,c»)| < e if n > no(c»,e). (c) fn(0) -^ f(0) a.s. pseudo-uniformly on 0 if condition (ii) in (b) holds, but not necessarily condition (i). Similarly to the case of a.s. uniform convergence of random functions the uniform convergence in probability of fn(#) to f(0) can be defined by plim n _ oo sup de 0|f n (0) -f(0)| = 0, provided that sup^G0|fn(0) -f(0)| is a random variable for n = 1,2,... In that case it follows from theorem 2.1.6 that fn(0) —• f(0) in pr. uniformly on 0 if and only if every subsequence (nk) of (n) contains a further subsequence (n k ) such that fnk.(0) —• f(0) a.s. uniformly on 0. This suggests how to define pseudo-uniform convergence in pr.:
42
Convergence Definition 2.6.2 Let fn(0) and f(0) be random functions on a subset 0 of a Euclidean space. Then: (a) fn(0) -> f(0) in pr. uniformly on 0 if sup^ 0 |f n ((9) -f(0)| is a random variable for n = 1,2,... satisfying pUm^ooSupteelfnW -f(0)| - 0; (b) fn(0) —> f(0) in pr. pseudo-uniformly on (9 if every subsequence (nk) of (n) contains a further subsequence (n k ) such that fnk.(0) —• f(0) a.s. pseudo-uniformly on 0.
Remark: In this study we shall often conclude: sup, G0 |f n (0) -f(0)| ->Oa.s.orplim n ^ o o sup^©|f n (0) -f(0)| - 0 instead of fn(0) —• f(0) a.s. or fn(0) —• f(0) in pr., uniformly on 0, respectively. In these cases it will be clear from the context that sup# G6>|fn(#) — f(0)| is a random variable for n = 1,2,... We are now able to generalize theorem 2.1.7 to random functions. Theorem 2.6.1 Let (fn(0)) be a sequence of random functions on a Borel subset 0 of a Euclidean space. Let f(0) be an a.s. continuous random function on 0. Let X n and X be random vectors in 0 such that P(X e 0) = 1 and P(X n e 0) = 1 for n = 1,2,... Moreover, suppose that f(X) is a random variable and that fn(Xn) is a random variable for n = 1,2,... If (a) X n —> X a.s. and fn(0) —> f(0) a.s. pseudo-uniformly on 0, or (b) X n —> X in pr. and fn(0) —» f(0) in pr. pseudo-uniformly on 0, then (a) fn(Xn) -+ f(X) a.s. or (b) fn(Xn) -+ f(X) in pr., respectively. Proof (a) Let (f2,g,P) be the probability space. Let Ni be the null set on which x n( w ) —* x(&>) fails to hold, let N 2 be the null set on which f(0,co) fails to be continuous, let N 3 and N 3 , n be null sets on which x(co) G 0 and xn(co) e 0, respectively, fail to hold and finally let N 4 be the null set on which fails to hold. Put N = N1UN2UN3U{U^=1N3,n}UN4. Then N e g, P(N) = 0 and for co G O\N we have: |fn(xn(«),CO) -f(x(C0),G))| ,CO) -f(0,CO)| "f |f(xn(G)),O)) - f ( x ( G ) ) , G ) ) | - 0 .
This proves part (a). Part (b) follows from (a) by using theorem 2.1.6. Q.E.D.
Uniform laws of large numbers
43
Exercise 1. Let X be uniformly distributed on [0,1]. Define for 9 G [0,1], fn(0) = n - ' x - * l Show that fn(#) —• 0 a.s. pointwise on 0 = [0,1] while sup^G©|fn(0)| = 1. (This is a counter-example that pointwise a.s. convergence on compact spaces does not imply uniform a.s. convergence.) 2.7 Uniform strong and weak laws of large numbers Next we shall extend theorems 1 and 2 of Jennrich (1969). We shall closely follow Jennrich's proof, but instead of the Helly-Bray theorem (theorem 2.3.1) we shall now use theorems 2.5.1 and 2.5.5. The extension involved is: Theorem 2.7.1 Let Xi,X 2 ,... be a sequence of independent random vectors in R k with distribution functions Fi,F 2 ,... respectively. Let f(x,0) be a continuous real function on R k x 0, where 0 is a compact Borel set in R m . If O/rOEjLiFj ->
G
properly pointwise
(2.7.1)
and sup n (l/n)5:jLiE[sup^ e |f(x,0)| 2 + <5] < oo
(2.7.2)
then (l/n)£jLif(Xj,0) -> Jf(x,0)dG(x) a.s. uniformly on 0, where the limit function involved is continuous on 0. Proof: First we note that by theorem 1.6.1 the supremum in (2.7.2) is a random variable. For the sake of convenience and clarity we shall label the main steps of the proof. Step 1: Choose 0O arbitrarily in 0 and put for S > 0
r3 = {6eRm:\9 -eo\ <($}n<9, Then for any S > 0, sup# er /(x,0) and inf^ Gr /(x,0) are continuous functions on R k , because r$ is a closed subset of a compact set and therefore compact itself. See Rudin (1976, theorem 2.35) and compare theorem 1.6.1. Moreover, |sup* er /(x,0)| < sup, G0 |f(x,0)|,
(2.7.3)
|inf* er /(x,0)| < sup, G0 |f(x,0)|.
(2.7.4)
44
Convergence
Thus it follows from theorem 2.5.3 and the conditions (2.7.1) and (2.7.2) that (l/n)Ej n =iSup^r/(Xj^) - Jsup, e r /(x,0)dG(x) - a.s.
(2.7.5)
and (l/n)E]Liinf^r/(Xj,fl) -
Jinf, €r /(x,0)dG(x) a.s.
(2.7.6)
Step 2: By continuity, sup0 €r /(x,0) -inf^ e r /(x,0) -* 0 as 8 j 0, pointwise in x. It follows now from the dominated convergence theorem that lim <5iO |Jsup, er /(x,0)dG(x) -Jinf^ r /(x,0)dG(x)| = 0.
(2.7.7)
Step 3: Choose e > 0 arbitrarily. From (2.7.7) it follows that S > 0 can be chosen so small, say S = d(e), that 0 < Jsup, er ^f(x,0)dG(x) -Jinf, G ^ (e f(x,0)dG(x) < V2e.
(2.7.8)
Let {^,g,P} be the probability space involved. From (2.7.5) and (2.7.6) it follows that there is a null set N and for each co e Q\N a number n0(co,e) such that: ^fCXjCco)^) -Jsup e e r i ( /(x,0)dG(x)| < Vie, | < Vie,
(2.7.9) (2.7.10)
if n > no(w,e). From (2.7.8), (2.7.9) and (2.7.10) it follows now that for every co e O\N, every n > n0(co,e) and every 9 € r 5(£) : (l/n)£jL,f(Xj(a)),0) -Jf(x,^ , ( /(x j (a)),0) - Jinf,6rjwfl(x,0)dG(x) f(xj(a)),0) - Jsup, 6 r ^f(x,0)dG(x)| and similarly: (l/n)E]Lif(Xj(co),0) - Jf(x,0)dG(x) >
-s.
Thus for co £ Q\N and n > n0(co,^) we have: (l/n)E]Lif(Xj(^),0) - Jf(x,0)dG(x)| < e. We note that the null set N and the number n0(co,e) depend on the set f$(£), which in its turn depends on 60 and e. Thus the above result should be restated as follows. For every 60 in 0 and every e > 0 there is a null
Uniform laws of large numbers
45
set N(#o,£) and an integer function no(.,£,#o) on Q\N(0o,e) such that for co e Q\N(0o,e) and n > n0(co,e,60): (
W
j
)
,
0
)
-Jf(x,0)dG(x)| < e9
(2.7.11)
where
rs(o0)
= {0 e R k : \e — Ool < s } n e .
(2.7.12)
k
Step 4: The collection of sets {6 e R :\6-6O\ < 6} with 60 e 0 is an open covering of 0. Since 0 is compact, there exists by definition of compactness a finite covering. Thus there are a finite number of points in 0, say
, < 5 , r
< 5 < 5
< oc
such that 0 c u ; i , {0eR k :|0-0a f i | < S}. Using (2.7.12) we therefore have:
e = u[i 1 rd(os,dNow
(2-7-13)
put:
Then by (2.7.12) and (2.7.13) we have for co G fi\Ne and n > n*(co9e), sup,G0|(l/n)EJn=1f(xj(co),0)-Jf(x,0)dG(x)| -Jf(x,0)dG(x)|<e. Since it can be shown, similarly to the proof of theorem 2.2.1, that the null set N£ can be chosen independently of e, it follows now that O/rOEjL^Xj^) -> Jf(x,0)dG(x) a.s. pseudo-uniformly on 0.
(2.7.14)
Step 5: From (2.7.7) it follows that Jf(x,0)dG(x) is a continuous function on 0. Using theorem 1.6.1, it is now easy to verify that sup,e0|(l/n)X:jn=1f(XJ,0)-Jf(x,0)dG(x)| is a random variable, so that (2.7.14) becomes (l/n)£jLif(Xj,0) -> Jf(x,0)dG(x) a.s. uniformly on 0. This completes the proof.
Q.E.D.
If condition (2.7.2) is only satisfied with 1 + S instead of 2 + S then we can no longer apply theorem 2.5.3 for proving (2.7.5) and (2.7.6).
46
Convergence
However, applying theorem 2.5.2 we see that (2.7.5) and (2.7.6) still hold in probability. From theorem 2.1.6 it then follows that any subsequence (nk) of (n) contains further subsequences (njj. ) and (n^ ), say, such that for m —> oo,
O/n^EjSsup^fiCX.j.e) - Jsup,€r/(x,6>)dG(x) a.s. O/n£))E]SinW/i(Xj,0) - Jinf,er/(x,0)dG(x) a.s. Note that we may assume without loss of generality that these further subsequences are equal:
nk = 4 1 } = 4 2 ) .
We now conclude from the argument in the proof of theorem 2.7.1 that s u p ^ K l / n O E ^ T f(Xj,0) - J f ( x , 0 ) d G ( x ) H O a . s . as m —» oo. Again using theorem 2.1.6 we then conclude: Theorem 2.7.2 Let the conditions of theorem 2.7.1 be satisfied, except (2.7.2). If supn(l/n)5]jL1E[sup^©|f](Xj,fl)|1+*] < oo for some d > 0, then 0/n)Ej n =i f CM) -> Jf(x,0)dG(x) in pr. uniformly on 6>, where the limit function involved is continuous on 0. Next, let f(x,0) be Borel measurable in both arguments and for each x G Rk continuous on 0. Referring to theorems 2.5.6 and 2.5.7 instead of theorems 2.5.2 and 2.5.3, respectively, we have: Theorem 2.7.3 Let X!,X 2,... be a sequence of independent random vectors in R k with distribution functions F 1 ,F 2 ,.respectively. Let f(x,0) be a Borel measurable function on R k x 0 , where 0 is a compact Borel set in R m , which is continuous in 9 for each x e R k . If O/n)£r=i F j -> G properly setwise
(2.7.15)
and sup n (l/n)E]LiE[sup, G0 |f(X J ,0)| 2 + <3] < oo for some S > 0 (2.7.16) then (l/n)£jLif(Xj,0) - • Jf(x,0)dG(x) a.s. uniformly on 0,
Uniform laws of large numbers
47
where the limit function involved is continuous on 0. Theorem 2.7.4 Let the conditions of theorem 2.7.3 be satisfied, except condition (2.7.16). If ) ! ^ ' ] < oo for some 8 > 0 (2.7.17) then (l/n)Ej=i f ( x j^) -+ Jf(x,0)dG(x) in pr. uniformly on <9, where the limit function involved is continuous on 0. Finally, if the sequence (Xj) is i.i.d. we can relax the moment conditions (2.7.16) and (2.7.17) further, due to Kolmogorov's strong law (cf. theorem 2.1.5): Theorem 2.7.5 Let the conditions of theorem 2.7.3 be satisfied, except condition (2.7.16). If (Xj) is i.i.d. with Fj = G and E[sup*=*|f(Xj,0)|] < oo then the conclusion of theorem 2.7.3 carries over. Exercise 1. Restate theorems 2.7.2 and 2.7.4 for double arrays (X n j), j = l,2,...,n, n = 1,2,... of random vectors in R k (cf. exercise 2 in section 2.5).
Introduction to conditioning
The concept of conditional expectation is basic to regression analysis, as regression models essentially represent conditional expectations. The theory of conditioning, however, is one of the most abstract and difficult parts of probability theory. In particular, conditioning relative to a onesided infinite sequence of random variables requires the concept of conditional expectation relative to a Borel field. We shall need this concept when we deal with time series regression models. However, as long as we deal with independent samples, we only need the concept of a conditional expectation relative to a random vector, and fortunately the latter conditional expectation concept can be defined in a much more transparant way than the former. Therefore we shall discuss the abstract theory of conditioning relative to a Borel field later on. Here we shall confine attention to the easier concept of conditional expectation relative to a random vector. 3.1 Definition of conditional expectation
Most intermediate textbooks on mathematical statistics define conditional expectations by using conditional densities and probabilities. For our purpose this elementary conditional expectation concept is not suitable. In particular, our theory of model specification testing requires a more rigorous conditioning concept. Before we introduce this rigorous concept, however, we illustrate two basic features of the elementary conditional expectation concept. Thus let (Y,X) G R x R k be an absolutely continuously distributed random vector with density f(y,x) and marginal density h(x). Then the conditional density of Y relative to the event X = x is defined as: f(y|x) = f(y,x)/h(x)ifh(x) > 0, f(y|x) = 0 if h(x) = 0. The conditional expectation of Y relative to the event X = x is E ( Y |X
48
= x) = £ ~ yf(y|x)dy = g(x),
Definition of conditional expectation
49
say. Substituting X for x, we get the conditional expectation of Y relative toX: E(Y|X) - g(X). Thus E(Y|X) is & function of X. Moreover, we also have E[Y-E(Y|X)]V(X) = E[Y^(X)] -E[g(X)^(X)] = 0
for every function ijj for which this expectation is defined, as easily follows from the above elementary definition. These two properties are basic to conditional expectations. In fact they form the defining properties: Definition 3.1.1 Let Y be a random variable satisfying E(|Y|) < co and let X be a random vector in Rk. The conditional expectation of Y relative to X, denoted by E(Y|X), is defined as E(Y|X) = g(X), where g is a Borel measurable real function on Rk such that for every bounded Borel measurable real function V> on Rk, E[Y-g(X)W(X) = 0.
(3.1.1)
Example: Draw randomly a pair (Y,X) from the set Since X takes only two values, any Borel measurable function ip of X is a.s. equal to a simple function of X, i.e. = a-I(X=l) + b-I(X = 2),a,beR. Now we have E[Y-g(X)W(X) = ±{[l-g(l)]a + [2-g(l)]a + [3-g(2)]b + [4-g(2)]b} = [| -£g(l)]a + [\ -£g(2)]b = 0 for every a e R, b e R, hence g(l) = |, g(2) = \. Thus E(Y|X) = 1.5-I(X=1) + 3.5-I(X = 2). Two problems now arise. First, does this function g always exist? The answer is yes, but for the proof we actually need the notion of a conditional expectation relative to a Borel field (see, for example, Chung [1974, ch. 9]), together with the Radon-Nikodym theorem (see Royden [1968, p. 238]). We shall not pursue this point further here. Second, is g(X) unique? The answer to this question is also affirmative, owing to the Radon-Nikodym theorem, in the sense that if there are two Borel measurable real functions gi and g2 on Rk satisfying the definition, then gi(X) = g2(X) a.s. An alternative proof of the uniqueness of g(X) is given
50
Introduction to conditioning
by the following theorem of Bierens (1982), which is also of intrinsic interest and, moreover, is basic to our theory of model specification testing, described in chapter 5. Theorem 3.1.1 Let gi and g2 be Borel measurable real functions on Rk. Let X be a random vector in Rk such that E(|gi(X)|) < oo, E(|g2(X)|) < oo. Let for non-random t e Rk, <^(t) = E[ gl (X)exp(i.t'X)],^ 2 (t) = E[ g2(X)exp(i.fX)]. Then P(gi(X) = g2(X)) < 1 if and only if y?i(t) ±
(3.1.2)
Definition of conditional expectation
51
Then we can define probability measures v\ and v2 on the Euclidean Borel field 33k by (cf. exercise la) »j(B) = JBrj(x)u(dx)/Cj, j = 1,2,
(3.1.3)
where v is the probability measure induced by X (cf. section 1.1) and B is an arbitrary Borel set in R k . We may now write (cf. exercise lc) E [r(X)exp(i-t'X)] = Jr(x)exp(i-t'x)o(dx) = Jr 1 (x)exp(it'x)u(dx) — Jr 2 (x)exp(it'x)u(dx) = c 1 jexp(it'x)u 1 (dx) — c 2 jexp(it'x)u 2 (dx) = Ci77i(t) - c 2 r / 2 ( t ) ,
say, where 7j(t) = Jexp(i-t'x)Dj(dx),j = 1,2, is the characteristic function of Dj, j = 1,2. If E[r(X)exp(i-t'X)] = O then Cii7i(t) — c2r/2(t) = 0 for all t e R k . Substituting t = 0 yields: CiT7i(0) -c2r)2(0) = ci - c 2 = 0,
(3.1.4)
hence f o r a11
1 G Rk.
The latter implies that the corresponding probability measures are equal, i.e., ui(B) = v2(B) for all Borel sets B in R k .
(3.1.5)
From (3.1.3), (3.1.4), and (3.1.5) it follows now: Jer^xMdx) = JBr2(x)u(dx) for all B e 33k and consequently JBr(x)u(dx) = 0 for all B £ S k .
(3.1.6)
Now take B! = {XE
Rk: r(x) > 0}.
This is a Borel set, for r is Borel measurable. Hence by (3.1.6), JBir(x)u(dx) = 0. This implies that i^B^ = 0. Similarly, we have for B 2 = {XG
Rk: r(x) < 0}
that v(B2) = 0. Since Bi and B 2 are disjoint we now have u(BiUB2) = + v(B2) = 0 or equivalently:
52
Introduction to conditioning
P(r(X) ^ 0) = 0. This proves that r(X) = gi(X) -g 2 (X) = 0 a.s. if (3.1.2) holds. The proof for the case that E [rx(X)] = 0 and/or E [r2(X)] = 0 is left to the reader as an easy exercise. Q.E.D. Exercises 1. Let v be a probability measure on (Rk,33k) and let f be a non-negative Borel measurable function on R k such that Jf(x)u(dx) = c with 0 < c < oc. Define for B e 93k, MB) = JBf(x)o(dx)/c. a) Prove that \i is a probability measure on {Rk,53k}. b) Prove that for every simple function ip on R k , JV(x)f(x)»(dx) = cJV(x)/i(dx). c) Prove the same for bounded Borel measurable real functions ip on Rk. 2. Check the proof of theorem 3.1.1 for the cases Ci = 0; c 2 > 0, Ci > 0; c2 = 0; and ci = c 2 = 0. 3. Prove theorem 3.1.2. 3.2 Basic properties of conditional expectations All the basic properties of conditional expectations, which are well known from intermediate statistical textbooks, can easily be derived from definition 3.1.1 and theorem 3.1.1. We list them in theorem 3.2.1 below. The proofs are left to the reader as exercises. Theorem 3.2.1 Let Y e R and V e R be random variables satisfying E(|Y|) < oo, E(|V|) < oc, and let X e R k and Z e R m be random vectors. We have: (i) E[E(Y|X,Z)|X] = E(Y|X) = E[E(Y|X)|X,Z]; (ii) E[E(Y|X)] = E (Y); E(Y|Y) = Y; (iii) Let U = Y -E(Y|X). Then E(U|X) = 0 a.s.; (iv) E(Y + V|X) - E(Y|X) + E(V|X); (v) Y < V a.s. implies E(Y|X) < E(V|X) a.s.; (vi)|E(Y|X)|<E(|Y||X)a.s.; (vii) E(Y-f(X)|X) = f(X)E(Y|X) a.s. for every Borel measurable real function f on R k satisfying E(|f(X)|) < oc ;
Basic properties of conditional expectations
53
(viii) Let F be a Borel measurable mapping from R k into a subset of R m . Then E[E(Y|X)|r(X)] = E[Y|r(X)] a.s. If F is a one-to-one mapping then E(Y|f(X)) = E(Y|X) a.s. (ix) If X and Y are independent then E(Y|X) = E(Y) a.s. Hint. For proving (v) apply (3.1.1) with iP(X) = I[E(Y|X) > E(V|X)], where I[.] is the indicator function. Also Chebishev's, Holder's, Minkowski's, Liapounov's, and Jensen's inequalities easily carry over to conditional expectations: Theorem 3.2.2 (Chebishev's inequality) Let Y G R, X G R k and let (p be a positive monotonic increasing real function on (0,oo) such that (p(y) = ?( — y) and E[ip(Y)]< oo. Then for every 5 > 0, E[I(|Y| > 5)\ X] < E[^(Y)|X]^(5)a.s. Pro*/: Let ^(X) = I{E[I(|Y| > S)\ X] > E[y>(Y)|X]/y>(5)}. Applying definition 3.1.1 we find ?/<X) = 0 a.s. Q.E.D. Theorem 3.2.3 (Holder's inequality) Let Y G R, V G R, E(|Y| P ) < oo, E(|V| q ) < oo, E(|Y-V|) < oo, and X G R k , where p > 1 and 1/p H-l/q = l.Then |E(Y-V|X)| < [E(|Y| p |X)] 1/p [E(|V| q | X)] 1/q a.s. Proof: Similarly to the unconditional case. Theorem 3.2.4 (Minkowski's inequality) Let Y G R, V G R, X e R k and E(|Y| P ) < oo, E(|V| P ) < oc for some p > 1. Then [E(|Y + V| p | X)] 1/p < [E(|Y| P | X)] 1/p + [E(|V|P| X)] 1/p a.s. Proof: Similarly to the unconditional case. Theorem 3.2.5 (Liapounov's inequality) Let Y G R, E(|Y| q ) < oo for some q > 1, X G Rk and 1 < p < q. Then [E(|Y| P | X)] 1/p < [E(|Y| q | X)] 1/q a.s. Proof: Let V = 1 in theorem 3.2.3.
Q.E.D.
Theorem 3.2.6 (Jensen's inequality) Let if be a convex real function on R and let Y G R, X G R k , E(|Y|) < oo, oo. Then ] < E[(Y)|X] a.s.
54
Introduction to conditioning
Proof: Similar to the unconditional cse. The results in section 2.2 also go through for conditional expectations. Although we do not need these generalizations in the sequel we shall state and prove them here for completeness. Theorem 3.2.7 Let Y n , Y and Z be random variables and let X be a random vector in R k . If sup n (|Y n |) < Z ; E(|Z| P ) < oo for some p > 0 and Y n -> Y in prob., then E(|Y n - Y| p | X) -> 0 in pr. Proof: The theorem follows easily from hev's inequality and theorem 3.2.1 (ii).
theorem 2.2.1, ChebisQ.E.D.
Theorem 3.2.8 (Dominated convergence theorem) Let the conditions of theorem 3.2.7 with p = 1 be satisfied. Then E(Y n |X)-+E(Y|X)inpr. Proof: By theorems 3.2.1 (iv and vi) and 3.2.7 it follows that |E(Y n |X) - E ( Y | X ) | = |E[(Yn < E(|Y n - Y | | X ) -> 0 in pr.
Q.E.D.
Theorem 3.2.9 (Fatou's lemma) Let Y n be a random variable satisfying Y n > 0 a.s. and let X be a random vector in R k . Then n |X) < liminfn^ECYnlX) a.s. Proof: Put Y = liminf^^ Yn, Zn = Yn - Y, and let Fn(z | x) be the conditional distribution function of Zn, given X = x. Then it follows from the unconditional Fatou's lemma (theorem 2.2.3) that pointwise in x,
liming JzdFn(z | x)>0. Plugging in X for x now yields 0 < liminf n _ J z dFn(z | X) = l i m i n g E(Z; |X) = limsup n _ E(Yn | X) - E(Y | X) a.s. Q.E.D. Theorem 3.2.10 (Monotone convergence theorem) Let (Y n) be a non-decreasing sequence of random variables satisfying E|Y n | < oo and let X be a random vector in R k . Then
Identification of conditional expectations
55
EOim^ooYnl X) = Umo^ooEOTnlX) < oo a.s.
Proof: Similarly to the proof of theorem 2.2.4, using theorem 3.2.9 Q.E.D. instead of theorem 2.2.3. Exercises 1. Prove theorem 3.2.1. 2. Complete the proof of theorem 3.2.2. 3. Prove theorem 3.2.7. 3.3 Identification of conditional expectations
In parametric regression analysis the conditional expectation E(Y|X) is specified as a member of a parametric family of functions of X. In particular the family of linear functions is often used in empirical research. The question we now ask is how a given specification can be identified as the conditional expectation involved. In other words: given a dependent variable Y, a k-vector X of explanatory variables and a Borel measurable functional specification f(X) of E(Y|X), how can we distinguish between P[E(Y|X) = f(X)] = 1
(3.3.1)
and P[E(Y|X) = f(X)] < 1 ?
(3.3.2)
An answer is given by theorem 3.1.1, i.e. (3.3.1) is true if E[Y -f(X)]exp(it'X) = O and (3.3.2) is true if E[Y -f(X)]exp(i.t'X) ^ 0 for some t e Rk. Verifying this, however, requires searching over the entire space Rk for such a point t. So where should we look? For the case that X is bounded the answer is this: Theorem 3.3.1 Let X be bounded. Then (3.3.2) is true if and only if E[Y-f(X)]exp(iVX)/O for some t0 ^ 0 in an arbitrary small neighborhood of the origin ofRk . Thus in this case we may confine our search to an arbitrary neighborhood of t = 0. If we do not find such a t0 in this neighborhood, then
56
Introduction to conditioning
E(Y -g(X))exp(i-t'X) = O and thus (3.3.1) is true. Proof. Let (3.3.2) be true. According to theorem 3.3.1 there exists a t* e R k for which E[Y -f(X)]exp(i.U'X) ^ 0
(3.3.3)
Let Z be a discrete random variable, independent of X and Y, such that P(Z = j) = (l/j!)e"1, j = 0,1,2,... Since X is bounded and Z is independent of (X,Y) we can now write (3.3.3) as E[Y -f(X)]exp(i-U'X) = E[Y = exp(l)-E{E[(Y -f(X))i z (U'X) z |X,Y]} = exp(l)E[(Y -f(X))i z (U'X) z ] = exp(l)-E{E[(Y -f(X))i z (U'X) z |Z]} = Ej=o(iJ/J!)E[(Y -f(X))(U'Xy] ^ 0. Consequently, there exists at least one j * for which
Then (d/dAy*E[(Y -f(X))exp[i-At*'X)] -+ ij* E[(Y -f(X))(t*'Xy*] ^ 0 as A -+ 0. This result implies that there exists an arbitrarily small A* such that E[(Y -f(X))exp(i-A*t*'X) ] / 0. Taking t0 = A*t*, the theorem follows.
Q.E.D.
Now observe from the proof of theorem 3.3.1 that (3.3.2) is true if and only if for a point t 0 in an arbitrarily small neighborhood of the origin of R k and some non-negative integer j 0 , E[(Y-f(X))(t o 'Xy°]/0. Applying a similar argument as in the proof of theorem 3.3.1 (with i replaced by 1) it is easy to verify: Theorem 3.3.2 Let X be bounded. Then (3.3.2) is true if and only if the function E[(Y — f(X))exp(t'X)] is nonzero for a t in an arbitrarily small neighborhood of the origin of R k . Clearly this theorem is more convenient than theorem 3.3.1, as we no
Identification of conditional expectations
57 k
longer have to deal with complex-valued functions. Next, let t* e R be arbitrary, let Y* = Y-exp(U'X) and let f-(X) = f(X)exp(t*'X). Then (3.3.2) is true if and only if P[E(Y*|X) = f*(X)] < 1. Applying theorem 3.3.2 we see that then E[(Y- -f-(X))exp(t o 'X)] = E{(Y -f(X))exp[(t* + t o)'X]} ^ 0 for some t0 in an arbitrary neighborhood of the origin of R k . Consequently we have: Theorem 3.3.3 Let X be bounded and let t* £ R k be arbitrary. Then (3.3.2) is true if and only if E(Y - f(X))exp(t o'X) ^ 0 for a t 0 in an arbitrarily small neighborhood oft*. Thus we may pick an arbitrary neighborhood and check whether there exists a t 0 in this neighborhood for which E[(Y-f(X))exp(t o 'X)]/0. If so, then (3.3.2) is true, else (3.3.1) is true. This result now leads to our main theorem. Theorem 3.3.4 Let X e R k be bounded, and let S be the set of all t e R k for which E[(Y - f(X))exp(t'X)] = 0. For any probability measure \i on 53k corresponding to an absolutely continuous distribution we have: /z(S) = 1 if (3.3.1) is true and /z(S) = 0 if (3.3.2) is true. Proof: Let V = Y — f(X) and let t 0 be arbitrary. Suppose for the moment that X e R. If P[E(V|X) = 0] < 1, then theorem 3.3.3 implies that for every t0 € R there exists a S > 0 such that E[Vexp(tX)] ^ 0 on (to-5,to)U(to,to + 5). Consequently, we have: Lemma 3.3.1 Let V € R be a random variable satisfying E(|V|) < oo and let X e R be a bounded random variable. If P[E(V|X) = 0] < 1 then the set S = {t € R : E[Vexp(tX)] = 0} is countable.
58
Introduction to conditioning
Using the lemma it is very easy to prove theorem 3.3.4 for the case k = 1. So let us turn to the case k = 2. Let P[E(V|X) = 0] < 1. According to theorem 3.3.3 there exists a t* e R 2 such that E[V-exp(U'X)] / 0. Denote V* = Vexp(U'X), <Mt l9 t 2 ) = EfV-exp^X! + t 2 X 2 )] where X! and X 2 are the components of X. Moreover, let Sx = {t! e R: ^ ( t i , 0 ) = 0}, S 2 (ti) = {t2 e R: >*(ti,t2) = 0} . Since E (V*) ^ 0, we have P[E(V*|Xi) = 0] < 1, hence by lemma 3.3.1 the set Si is countable. By the same argument it follows that the set S 2 (ti) is countable if ti ^ Si. Now let (*i,f2) be a random drawing from an absolutely continuous distribution. We have: E{IhMfi,*2) = 0]} = E{I[<M*i,*2) = 0].I(*i e Si) + E { l h M * ^ = 0]J[ri £ Si]} < E[l(ri e SO] + E[l(ri 0 S0-l(r 2 e S 2 (r0). Since the set Si is countable and t\ is continuously distributed we have E [I(*i e Si)] = 0. Moreover, since the distribution of t2 conditional on tx is continuous we have:
for S 2 (ti) is countable if ti ^ Si. Thus: i A ) = 0] = 0. Replacing (^i,^2) by (t\ —t\*, t 2 —1 2*) , where ti* and t 2 * are the compoments oft*, we now see that theorem 3.3.3 holds too for the case k = 2. The proof of the cases k = 3,4,... is similar to the case k = 2 and therefore left to the reader. Q.E.D. Finally we consider the case that X is not bounded. By theorem 3.2.1 (viii) we have E[Y -f(X)|X] = E[Y -f(X)|r(X)] a.s. for every Borel measurable one-to-one mapping F from R k into R k . From this result and theorem 3.3.3 it now follows: Theorem 3.3.5 Let the conditions of theorem 3.3.3 be satisfied, except that X is bounded. Let F be an arbitrary bounded Borel measurable one-to-one mapping from R k into R k , and let S = {t G R k : E[(Y -f(X))exp(tT(X))] = 0}. For any probability measure \i on 93k corresponding to an
Identification of conditional expectations
59
absolutely continuous distribution we have: ^(S) = 1 if (3.3.1) is true and /x(S) = 0 if (3.3.2) is true. Exercises 1. Use the argument in the proof of theorem 3.3.1 to prove the following corollary: Let X = (Xi,...,X k )\ Under the conditions of theorem 3.3.1 it follows that (3.3.2) is true if and only if there exist non-negative integers m l v ..,m k such that Cf. Bierens (1982, theorem 2). 2. Let 0 be a subset of R k . A point y in R k is called a point of closure of 0 if for every e > 0 we can find an x e 0 such that |x — y| < e. The set of all points of closure of 0 is called the closure of 0. A subset S of 0 is called dense in 0 if 0 C S", where S~ is the closure of S. Prove that the set S in theorem 3.3.4 and 3.3.4 is not dense in Rk.
Nonlinear parametric regression analysis and maximum likelihood theory In this chapter we consider the asymptotic properties, i.e. weak and strong consistency and asymptotic normality, of the least squares estimators of the parameters of a nonlinear regression model. Throughout we assume that the data generating process is independent, but we distinguish between the identically distributed and the nonidentically distributed case. Also, we consider maximum likelihood estimation and the Wald, likelihood ratio and Lagrange multiplier tests of (non)linear restrictions on the parameters. Notable work in the area of nonlinear regression has been done by Jennrich (1969), Malinvaud (1970a,b), White (1980b, 1982), Bierens (1981), Burguete, Gallant, and Souza (1982), and Gallant (1987), among others. The present nonlinear regression approach is a further elaboration of the approach in Bierens (1981, section 3.1). 4.1 Nonlinear regression models and the nonlinear least squares estimator
Consider an independent data generating process {(Yj,Xj)}, j = 1,2,... where Yj G R is the dependent variable and Xj e Rk is a vector of regressors. The central problem in regression analysis is how to determine the conditional expectation functions gj(Xj) = E(Yj|Xj),j = 1,2,...
(4.1.1)
We shall call this function gj the response function. Note that this response function exists and is a Borel measurable real function on Rk, provided j
(4.1.2)
(cf. section 3.1). Defining Uj = Yj- g j (Xj),j = 1,2,...
(4.1.3)
we get the tautological regression model Yj = gj(Xj) + Uj,j = 1,2,... 60
(4.1.4)
The nonlinear least squares estimator
61
where by construction the errors Uj satisfy E(Uj|Xj) = Oa.s.,j = 1,2,...
(4.1.5)
In parametric regression analysis it is assumed that: (i) gj does not depend on j : gj
= g f o r j = 1,2,...;
(4.1.6)
(ii) the function g belongs to a parametric family of functions f(.,0); g(.) e {f(.,0): 9 e 6>},
(4.1.7)
where 0 is a subset of a Euclidean space, called the parameter space, and f(x,0) is known for each x e R k and each 6 e 0. Quite often one also assumes that the errors Uj are independent of Xj and even that the Uj's are normally distributed. These assumptions, however, are not necessary for the asymptotic theory of regression analysis and therefore we shall not depend on them. Condition (i) naturally holds if the data generating process is i.i.d. However, if some or all of the components of Xj are control variables set by the analyst in a statistical experiment, we can no longer assume that the Xj's are identically distributed. In that case the response function may depend on the observation index j . The condition that the response function belongs to a certain parametric family of known functions is crucial to parametric regression analysis, but is also disputable. In practice we almost never know the exact functional form of a regression model. The choice of the parametric family (4.1.7) is therefore almost always a matter of convenience rather than a matter of precise a priori knowledge. For that reason the linear regression model is still the most popular functional specification in empirical research, owing to its convenience in estimating the parameters by ordinary least squares (OLS). So why should we bother about nonlinear regression models? There are various reasons why nonlinear regression analysis makes sense. First, it is possible to test whether a given (linear) functional specification is correct (see chapter 5). If we have to reject the linear specification we are then forced to look for a nonlinear specification that is closer to the true response function. Second, a linear regression model sometimes is logically inconsistent with the nature of the data. If for example the range of Yj is [0,oo) and the range of the components of Xj is ( — 00,00), then the linear response function may take negative values, whereas E(Yj|Xj) > 0 a.s. (cf. theorem 3.2.1.(v)). Another example is the
62
Nonlinear regression and maximum likelihood theory
case where Yj is a binary variable only taking the values 0 and 1. Then E(Yj|Xj) is just the conditional probability of the event Yj = 1 relative to Xj, and must therefore lie in the interval [0,1]. A possible specification in this case is a logit-type model: withE(Uj|Xj) = Oa.s. where 9 = (0o,0iT £ R x R k = ©. Note that in this case Uj cannot be independent of Xj. Third, there are cases where theory is more explicit about the functional form of a model. An example is the CES production function in economics, introduced by Arrow, Chenery, Minhas, and Solow (1961). This production function takes the form y = {0i x ^ + 0 2 x ^ 3 r 1 / 0 \ where 9X > 0, 02 > 0, 03 > -1, y is the output level of an industry, xi is the total labor force employed in the industry and x 2 is the stock of capital in the industry. Adding an error term to the right-hand side yields a nonlinear regression model. Fourth, there are flexible functional specifications that allow the functional form to be partly determined by the data. A typical example of such a flexible functional specification is the Box-Cox (1964) transformation. The Box-Cox transformation of a variable v is given by
For X = 0 we have rj(v\0) = ln(v), and clearly ry(v|l) = v — 1. This transformation is particularly suitable when the regressors are positively valued. Consider for example the case Xj e R, P(Xj > 0) = 1. Specifying the regression model as
yields a model that contains the linear model and the semi-loglinear model [Yj = 9\ + 62 ln(Xj) + Uj] as special cases. By estimating the parameters 0 l5 92 and 03 and testing the hypothesis 03 = 0 against 0 3 = 1 we actually let the data choose between these alternatives. From now on we take the parametric family of functional forms as given (though not necessarily as true). Moreover, we confine ourselves to functions that are Borel measurable in the explanatory variables and the parameters and are continuous in the parameters. Furthermore, the parameter space O is assumed to be a compact Borel set. Assumption 4.1.1 Given the data generating process {(Yj,Xj)},
Consistency and asymptotic normality: theory
63
j = 1,2,... in R x R k with E(|Yj|) < oo, the conditional expectation of Yj relative to Xj equals f(Xj,0o) a.s. for j = 1,2,..., where f(x,0) is a known Borel measurable real function o n R k x 0 with 0 a compact Borel set in R m containing 0O. For each x e R k the function f(x,0) is continuous on 0. We recall that compactness of a set implies that every open covering contains a finite subcovering, and that a subset of a Euclidean space is compact if and only if it is closed and bounded; cf. Royden (1968). Assuming that the data generating process is observable for j = 1,2,...,n, the nonlinear least squares estimator for 0O is now defined as a measurable solution of the following minimization problem: 0G<9:Q(0) = inf* e0 Q(0),
(4.1.8)
where Q W = (l/n)£jLi[Yj-f(Xj,0)] 2 .
(4.1.9)
By theorem 1.6.1 such a measurable solution always exists. Later in this chapter we shall give further conditions for strong consistency, i.e. 0 - • 0O a.s. a s n -* oo
(4.1.10)
or weak consistency: p l i m ^ ^ = 0O,
(4.1.11)
and asymptotic normality, i.e. Vn(0 - 0 o ) -> N m (0,fl) in distr.
4.2
(4.1.12)
Consistency and asymptotic normality: general theory
4.2.1
Consistency
The consistency proof contains two main steps. First we set forth conditions such that Q(0) —> Q(0) a.s. (in pr.) pseudo-uniformly on 0
(4.2.1)
where Q(0) = lim n _ oo E[Q(0)] is continuous on 0
(4.2.2)
and then we set forth conditions such that 0* G 0: Q(0*) = Mee0Q(6)
=> 0* = 0o.
(4.2.3)
64
Nonlinear regression and maximum likelihood theory
Thus the latter condition says that Q(0) has a unique infimum on 0 at 90. The conditions (4.2.2) and (4.2.3) guarantee that 9 —• 90 a.s. (in pr.), because of the following fundamental theorem. Theorem 4.2.1 Let (Qn(#)) be a sequence of random functions on a compact set 0 c Rm such that for a continuous real function Q(0) on 0, Qn(#) —• Q(0) a.s. (in pr.) pseudo-uniformly on 0. Let 9n be any random vector in 0 satisfying
and let 90 be a unique point in 0 such that
Then 9n —> 0O a.s. (in pr.) Proof: We consider first the a.s. case. Let {&,g,P} be the probability space and let N be the null set on which the uniform convergence fails to hold. For a> e Q\N we then have Qn(^n(^),CO) + Qn(0O,CO) " Q(0 O ) < 2'S\xp0e0\Qn(0,co)-Q(0)\
-* 0 as n -+ oo,
hence Q(9n(co)) —>Q(6> 0) as n ^ oo and co G fi\N. (4.2.4) Now let ^*(co) be any limit point of the sequence (0 n (^))- Since this sequence lies in a compact set 0, all its limit points lie in 0. Thus, 9*(co) e 0. By definition of a limit point there exists a subsequence (0n.(co)) such that lirrij ^ocA/co) = 0*(co). From (4.2.4) and the continuity of Q(0) it follows that also limj _ooQ(0nj(co)) = Q(limj ^O^co))
= Q(0\co)) = Q(flo),
hence 9*(co) = 90 by uniqueness. Thus all the limit points of 9n(a>) are equal to 90 and consequently lim n _ oo 0 n (o)) = 0O for co e O\N. By definition, this implies 9n —> 90 a.s. The convergence in probability case follows easily from theorem 2.1.6 and the above argument. Q.E.D.
Consistency and asymptotic normality: theory
65
4.2.2 Asymptotic normality Next we turn to asymptotic normality. We shall put our argument in a general framework, in order also to use the results for more general cases than considered in this chapter. Theorem 4.2.2 Let the conditions of theorem 4.2.1 be satisfied (for the convergence in probability case) and assume in addition that: (i) 0 is convex and 60 is an interior point of <9; (ii) O/<>0')Qn(0) and (b/W)(b/W)Qn(6) are well defined as vector and matrix, respectively, of random functions on 0; (iii) y/n(b/W)Qn(0o) -» Nm(0,A!) in distr., where A] is a positive semi-definite m x m matrix; (iv) for i1?i2 = l,2,...,m, br&ej&WQQiO) in p r . pseudo-uniformly on a neighborhood of 0O, where the limit function involved is continuous in 60; (v) the matrix A2 = (&[bO)(d/W)Q(00) is nonsingular. Then X 0 n - 0 O ) - Nm(0, A ^ A ^ " 1 ) . Proof: Consider the following Taylor expansion of (
i = 1,2,...,m
(4.2.5)
where 0n is a mean value satisfying |0?-0ol < |0 n -0o|a.s. (4.2.6) In the non-random case the existence of such a mean value is proved by Taylor's theorem. In the random case under review, however, we should also ask whether a measurable mean value exists. For the moment we shall ignore this problem. At the end of the proof we shall return to this issue. Note that we have indexed the mean values by i. This indicates that these mean values may be different for different i's. Cf. Don (1986). Now consider the left member of equation (4.2.5). If 6n is an interior point of 0 then (
66
Nonlinear regression and maximum likelihood theory
pr. and 0O is an interior point of <9, the probability that 0 n is an interior point of 0 will converge to 1, hence
o) = 0] = 1.
(4.2.7)
Next consider the m x m random matrix
A2 = K^e{)(b/Wi2)QM2))]
(4.2.8)
Then it follows from (4.2.5) and (4.2.7) that lim n ^ o o P[ V 'n(0 n -0 ( ) ) = - A 2 V n ( hence
= 0.
(4.2.9)
If plimn^A^-A;1) = O
(4.2.10)
then it follows from condition (iii) and theorems 2.3.3 and 2.3.5 that plim^ooCAj x - A2" V n ( W ) Q n ( 0 o ) = 0.
(4.2.11)
Combining (4.2.9) and (4.2.11) we then conclude plim^ooIX^-flo) + A 2 -yn(W)Qn(0o)] = 0.
(4.2.12)
Since condition (iii) implies V
N m (0,A 2 -i^Aj 1 )
(4.2.13)
(note that A2 is symmetric), it follows from (4.2.12) and theorem 2.3.5 that y/n(6n -0 O ) -> Nm(0,A2 ! AiA 2 1 ) in distr. Thus for proving the theorem it remains to show that (4.2.10) holds and that the mean values are indeed properly defined random vectors. For proving (4.2.10), observe from (4.2.6) that plinin^oofln = 0O implies plim^ooflo® = 0O, i= l,2,...,m.
(4.2.14)
From (4.2.14), condition (iv), and theorem 2.6.1 it follows now that plimn_ooA2 = A2
(4.2.15)
Finally, since the elements of an inverse matrix are continuous functions of the elements of the inverted matrix, provided the latter is nonsingular, it follows from (4.2.15), condition (v) and theorem 2.1.7 that
Consistency and asymptotic normality: theory
67
plimn^ooA^1 = A^ 1 , which proves (4.2.10). We now conclude our proof with the following lemma of Jennrich: Lemma 4.2.1 Let f(x,0) be a real-valued function on R k x<9, where 0 is a convex compact subset of R m . For each 6 in 0 let f(x,0) be a Borel measurable function on R k and for each x e R k let f(x,0) be a continuously differentiable function on 0. Let #i(x) and 62(x) be Borel measurable functions from R k into 0. Then there exists a Borel measurable function 6*(x) from R k into 0 such that (i)f(x,0,(x))-f(x,02(x)) = (
Q.E.D.
Exercises 1. Let Y l v ..,Y n be independent standard normally distributed random variables. Let Q n (0) = (l/n)Ej=i|Yj - 0 | , 0 = [a,b], - o o < a < 0 < b < c x ) .
2.
Let 0n be defined as in theorem 4.2.1. Prove 6n —• 0 a.s. (Hint: Use theorems 2.7.5 and 4.2.1.) Let Y l v ..,Y n be i.i.d. random variables satisfying Let Qn(0) = (l/n)E]Li(Yj-0) 4 and let 0 be as in exercise 1. (a) Prove 0n -> 0 a.s. (b) Prove yjn6n -> N(0,a 2 ) in distr. (c) Determine a 2 .
68
Nonlinear regression and maximum likelihood theory
4.3 Consistency and asymptotic normality of nonlinear least squares estimators in the i.i.d. case 4.3.1
Cons is ten cy
We now consider the case where (Y U X\),...,(Y n ,X n ),... are independent identically distributed random vectors in R x R k , and we verify the conditions for consistency and asymptotic normality in section 4.2 for the nonlinear least squares estimator. Thus, we augment assumption 4.1.1 with: Assumption 4.3.1 The random vectors (Yj,Xj), j = 1,2,...,n,... are independent random drawings from a k + 1-variate distribution. Moreover, E(Yj2) < oo. For proving strong consistency we use theorem 2.7.5 in order to show (4.2.1). First, Assumption 4.3.2 Let E[supe e0f(XuO)2]
< oo.
Then E[sup, € 0 (Y, -f(X 1 ,0))] 2 ]<2E(Y2) + 2E[sup, G0 f(X 1 ,0) 2 ] < oo, (4.3.1) hence by theorem 2.7.5 sup* € e|Q(0)-Q(0)| - 0 a.s.,
(4.3.2)
where Q(0) = E{[Y{-f(Xu6)]2}.
(4.3.3)
From assumption 4.1.1 it follows that Yi =f(X 1 ,0 o ) + U b with E(U!|X,) = 0 a.s.
(4.3.4)
hence )) 2 ]. 2
(4.3.5)
2
Note that E(Y )
Consistency and normality: i.i.d. case
69
Theorem 4.3.1. Under assumptions 4.1.1 and 4.3.1-4.3.3,0 -> 60 a.s.
4.3.2
Asymptotic normality
Next we set forth the conditions for asymptotic normality by specializing the conditions of theorem 4.2.2. First, Assumption 4.3.4 Let 0 be convex, let 00 be an interior point of 0 and let f(x,0) be twice continuously differentiable in 9 for each xeRk. Then the conditions (i) and (ii) of theorem 4.2.2 hold. Now consider the first and second derivatives of Q(0): ]
j
j
)
j
,
(4.3.6)
= ( - 2)(l/n)EjL, [Yj - f(Xj,0)](^0)(*/<>0')f(Xj,0) n ) ] . (4.3.7) Assumption 4.3.5 Let for i,ii,i2 = 1,2,...,m, oo; )] 2 } < CXD.
Then by the Cauchy-Schwarz inequality,
^
'
7
2
< oo
(4.3.8)
and
E{suP, 66,|[Y,
-{(XiMi^oOi^e^iXue)]} ^{Efsup^^Y.-^X,,^]) 2 ]}" 7 3 x l E I s u p ^ e P / ^ ^ . X ^ ^ n X ! ^ ) ] 2 } ^ < oo
(4.3.9)
where the last inequality follows partly from (4.3.1). Using theorem 2.7.5 we now conclude from (4.3.8) and (4.3.9) that for ii,i2 = 1,2,...,m. sup, e©IG>/i0i1)O>/^0i2)Q((?)-E[O)/ieil)G)/b0i2)O(«)]| - 0 a.s.
(4.3.10)
which proves condition (iv) of theorem 4.2.2. Now consider the matrix A2: A2 = (d/W)(b
= -2
fit.^P/XWW.A)
(4.3.11)
70
Nonlinear regression and maximum likelihood theory
Since E[Y{ -f(Xi,0o)|Xi] = EflJilXO = 0 a.s., the first term vanishes. Thus: A 2 = 2Q2,
(4.3.12)
where Q2 = E[WW)f(Xu0oMW0)f(Xu6o)l
(4.3.13)
Assumption 4.3.6 Let Q2 be nonsingular. Then condition (v) of theorem 4.2.2 is satisfied. So it remains to show that condition (iii) of theorem 4.2.2 holds. Observe from (4.3.4) and (4.3.6) that J,0o)=
-2(l/n)£jLiZj,
(4.3.14)
say, where (Zj), j = 1,2,..., is a sequence of i.i.d. random vectors in Rk with E(Z0 = E[U,O>/a0')f(Xi,0o)] = EtEKU^XOO/^^flX^Oo)]} = 0 (4.3.15) (for E(Ui|Xi) = 0 a.s.). Moreover,
E ^ z j ) = E(\j]{(^ey(xue0)}{^iw){(xue0)^
= au
(4.3.16)
say. If Assumption 4.3.7 For il9i2 = l,2,...,m, E{sup9e«[Y, -fQLuOtfWrbejfQLuemWjqXuB)!}
< oo,
then the elements of the matrix Q\ are finite, hence we may apply theorem 2.4.1 to arbitrary linear combinations £'Zj (£ € Rm): ( l / » E " = i f Zj - N(0,f (2,0 in distr.
(4.3.17)
From theorem 2.3.7 we then conclude: OA/n)£jLiZj - Nm(0,O,) in distr.
(4.3.18)
Thus > Nm(0,4O!) in distr. This proves condition (iii) of theorem 4.2.2 with A! = 4Q\. Summarizing, we have shown: Theorem 4.3.2. Under the conditions of theorem 4.3.1 and the additional assumptions 4.3.4-4.3.7 we have: ^ ( 0 - 0 0 ) -> Nm(0,fi2 ! Q i ^ 2 ! ) i n
distr
-
Consistency and normality: i.i.d. case
71
where Qx and Q2 are defined by (4.3.16) and (4.3.13), respectively. Cf. White (1980b). Quite often it is assumed that Uj is independent of Xj or that Assumption 4.3.8. ECU^XO = E(U2) = a 2 a.s. Then Q\ = G2Q2, and moreover assumption 4.3.7 then follows from assumption 4.3.5 and the condition E(Y2) < oo. Thus: Theorem 4.3.3. Under the conditions of theorem 4.3.2, with assumption 4.3.7 replaced by assumption 4.3.8, we have: y/n(6 -00) -+ N m [0,a 2 G^] in distr. 4.3.3
Consistent estimation of the asymptotic variance matrix
Finally, we turn to the problem of how to consistently estimate the asymptotic variance matrices in theorems 4.3.2 and 4.3.3. Let 4(0) = ( l / ^ E j L i t Y j - ^ X j ^ ^ O / ^ ^ f C X j ^ ) ] ^ ^ ^ ) ^ ^ ) ]
(4.3.19)
O2(P) = (l/n)EjLi[^0')fi(X j ,0)P/i0)fpCj,e)]
(4.3.20)
4 = 4(0), Q2 = Qiih # = Q(0)
(4.3.21)
We have shown that under the conditions of theorem 4.3.2: 4(0) -> E[4(0)] a.s. uniformly on 0, 4>(0) -> E[Q2(0)] a.s. uniformly on 0, Q(<9) -+ E[Q(0)] a.s. uniformly on 0,
(4.3.22) (4.3.23) (4.3.24)
and d = E[4(0o)], G2 = E[Q2(0o)], a 2 = E(U2) = E[Q(0o)] From theorems 2.6.1 and 4.3.1 it therefore follows: Qi -» Q\ a.s.; Q2 -> ^2 a.s.; d2 -> a 2 a.s.; hence, using theorem 2.1.7, we have: Theorem 4.3.4 Under the conditions of theorem 4.3.2, 0 ^ 4 AT1 -> ^2 l f i i f i 2 !
as
'
and under the conditions of theorem 4.3.3 we have: a.s.
(4.3.25)
72
Nonlinear regression and maximum likelihood theory
Exercises 1. Verify the conditions for strong consistency and asymptotic normality of 9 for the standard linear regression model Yj = 0o'Xj + Uj, where Uj and Xj are mutually independent. 2. Suppose that f(x,0) is incorrectly specified as 9% with 9 e 0 C R k , x G R k . Set forth conditions such that 0 —• 90 a.s. and
y/n(6 - eo in distr., where 90 = [E(XiX 1 > )]" 1 E(XiYi)]. Cf. White (1980a). 4.4 Consistency and asymptotic normality of nonlinear least squares estimators under data heterogeneity 4.4.1 Data heterogeneity In medicine, biology, psychology, and physics the source of the data set {(Yj,Xj)}, j = l,...,n, is quite often a statistical experiment, where Yj is the outcome of the experiment and Xj is a vector of control variables set by the analyst. In this case the vector Xj is in fact non-random. However, non-random variables or vectors may be considered as random variables or vectors, respectively, taking a value with probability 1. Thus, if (XJ), j = 1,2,... is a sequence of control vectors then we define: Xj = Xj a.s.
The distribution function of Xj is then where I(.) is the indicator function and X^ and x ( ^ are the ^-th components of Xj and x, respectively. Moreover, suppose E(Y j ) = g(x j ),U j = Y j - g ( x j ) . Then Y j = g(Xj) + U j a.s. ) where E(Uj) = 0 and Uj is independent of Xj. The joint distribution function Fj(y,x) of (Yj,Xj) is now: Fj(y,x) = P[Uj
Consistency and normality: data heterogeneity
73
stratified sample of size n with q strata of different sizes iii (i=l,...,q, n = ni +... + n q ), corresponding to q income classes. Then the distribution of (Yj,Xj) depends on the stratum to which j belongs. See further White (1980b). In order to cover data heterogeneity as well, we now augment assumption 4.1.1 with: Assumption 4.4.1 The random vectors (Yj,Xj), j = 1,2,...,n,... are independent with joint distribution functions Fj(y,x), j = 1,2,...,n,..., respectively. Moreover, denoting F<%,x) = (l/n)£jL,Fj(y,x), one of the following alternative conditions holds: (a) F ( n ) —» F properly setwise, (b) F ( n ) —> F properly pointwise, and f(x,0) is continuous in both of its arguments. The distinction between the alternatives (a) and (b) is due to the distinction between the conditions of theorems 2.7.1-2.7.4, i.e., under assumption 4.4.1a we shall apply theorems 2.7.3 and 2.7.4 and under assumption 4.4.1b we shall apply theorems 2.7.1 and 2.7.2. 4.4.2 Strong and weak consistency Now let us modify the assumptions of section 4.3 to the case under consideration. Depending on whether we wish to prove weak consistency, i.e.. p l i m ^ ^ = 0O, or strong consistency, i.e., 0 -> 0O a.s. we assume either Assumption 4.4.2(i). Let, for some 6 > 0,
supn(l/n)EjLiE[suptfee|Yj -f(Xj,0)|2 + *| < oo, or Assumption 4.4.2(ii). Let, for some 5 > 0, suPn(l/n)5:°=iElsupfl6e|Yj-f(Xj,0)|4-M] < oo. Denoting = J( y -f( x ,0)) 2 dF(y,x),
(4.4.1)
74
Nonlinear regression and maximum likelihood theory
it follows from assumptions 4.1.1, 4.4.1, and 4.4.2(i) and theorems 2.7.2 and 2.7.4 that plim n ^ oo sup tf۩|Q(fl) - Q(0)| = 0,
(4.4.2)
whereas from assumptions 4.1.1, 4.4.1, 4.4.2.(ii) and theorems 2.7.1 and 2.7.3 it follows that: sup0G<9|Q(0) - Q(0)| -> 0 a.s.
(4.4.3)
Moreover, in both cases the limit function Q(0) is continuous. Furthermore, it follows from theorems 2.5.1 and 2.5.4 that Q(0) = lim n _ oo E[Q(0)] = lim n _ oo (l/n)y;| 1 _ 1 E{[U j +
f(Xi,0o)--f(XhO)]2} J
'(4.4.4)
hence Q(0O) = inf, e 0 Q(0).
(4.4.5)
Thus, if Assumption 4.4.3
Q(0) takes a unique minimum on 0,
then by theorem 4.2.1 Theorem 4.4.1. (i) Under assumptions 4.1.1, 4.4.1, 4.4.2(i) and 4.4.3, p l i m , ^ ^ (ii) Under assumptions 4.1.1, 4.4.1, 4.4.2(ii) and 4.4.3, 0 -> 0O a.s.
4.4.3 Asymptotic normality Next we shall modify the assumptions 4.3.4—4.3.8 such that we may replace the reference to theorem 2.7.5 by references to theorems 2.7.2 and 2.7.4, and the reference to theorem 2.4.1 by a reference to theorem 2.4.3. Assumption 4.4.4 Let assumption 4.3.4 hold. If part (b) of assumption 4.4.1 holds then the first and second partial derivatives of f(x,0) to 0 are continuous in both arguments. Assumption 4.4.5
Let for some 8 > 0andi,ii,i 2 = 1,2,...,m, : oo; ))\2 + s] < oo.
Maximum likelihood theory
75
Assumption 4.4.6 The matrix fi2 = J[O/i0')f(x,0o)][^0)f(x,e o )]dF(y,x) is nonsingular. Assumption 4.4.7 Let for some d > 0 and ii,i2 = 1,2,...,m,
Assumption 4.4.8 For j = 1,2,..., E(Uj2|Xj) = E(Uj2) = a 2 a.s. Moreover, let
Then similarly to theorems 4.3.2, 4.3.3, and 4.3.4 we have: Theorem 4.4.2 Let the conditions of theorem 4.4. l(i) be satisfied. (i) Under the additional assumptions 4.4.4-4.4.7 we have: -0 O ) -> Nm[0,fi2 ^ I ^ ! 1
i n
distr
>
(ii) Under the additional asumptions 4.4.4-4.4.6 and 4.4.8 we have: -0 O ) -> N m (0,a 2 fi2 ! ) i n d i s t r -
and
4.5 Maximum likelihood theory The theory in section 4.2 is straightforwardly applicable to maximum likelihood (ML) estimation as well. Although we assume that the reader is familiar with the basics of ML theory, say on the level of Hogg and Craig (1978), we shall briefly review here the asymptotic properties of ML estimators, for the case of a random sample from a continuous distribution. Let Z l v ..,Z n be a random sample from an absolutely continuous kvariate distribution with density function h(z|0o), where 6o is an unknown parameter vector in a known parameter space 0 c R m . We assume that Assumption 4.5.1 The parameter space 0 is convex and compact. For each z e R k , h(z|0) is twice continuously
76
Nonlinear regression and maximum likelihood theory
differentiable on 0. Moreover, the following conditions hold for each interior point 60 of 0: (i) Jsup,e0|ln(h(z|0))|h(z|0o)dz < oo; (ii) For i,r,s = l,...,m, 0))| 2 h(|0)d
oo, oo,
M\)\
oc.
(iii) The function £(6\e0) = Jln(h(z|0))h(z|0 o )dz
(4.5.1)
has a unique supremum on 0. (iv) The matrix H(*o) =
(4.5.2) is nonsingular. Assumptions 4.5.1(i)-(iii) are just what are often termed in intermediate statistics and econometrics textbooks "some mild regularity conditions." In particular, condition (ii) allows us to take derivatives of integrals by integrating the corresponding derivatives: Lemma 4.5.1 Let assumption 4.5.1(ii) hold. For i,r,s= l,...,m and every pair of interior points 0 and 90 of 0, n(h(z|0))h(z|0o)dz;
(c) jO/U0i)h(z|0)dz = jO/^0r)O/^0s)h(z|0)dz = 0. Proof: Let ei be the i-th unit vector in Rm. Then (b/W{)\n(h(z\6)) = ^ms
l
whereas by the mean value theorem |^~][ln(h(z104-(5ei))-ln(h(z1 The result (a) therefore follows from the dominated convergence theorem 2.2.2. Similarly, so do (b) and (c). Q.E.D. A direct corollary of lemma 4.5.1 is that
Maximum likelihood theory
77
(^0Ojln(h(z|0))h(z|0 o )dz = 0 at 0 = 0O;
(4.5.3)
(b/W)(b/W)$ln(h(z\6))h(z\60)dz
(4.5.4)
= -H((90) at 0 = 0O;
hence the first- and second-order conditions for a maximum of Jln(h(z|0))h(z|0o)dz in 0O are satisfied: Lemma 4.5.2 Under assumption 4.5.1, the function £(6\6o) [cf. (4.5.1)]) takes a unique supremum on © at 0 = 0O. The ML estimator 0 of 0O is n o w defined as a measurable solution of the maximization problem sup0G(9Ln(Z|0), or equivalently sup*? €6>ln(Ln(Z|0)), where ]
X
with Z
= (Zr,...,ZnT,
is the likelihood function. Denoting Qn(0)
=
(4.5.5)
ln(Ln(Z|0))/n and
H = (l/^EjLitO/^^MMZjI^^JIO/^^ln^Zjl^))],
(4.5.6)
it follows straightforwardly from theorems 4.2.1-4.2.2 that: Theorem 4.5.1 Under assumption 4.5.1, 6 —> 60 a.s., yjn(6 -Oo) -* N ^ O f l ^ o ) " 1 ) in distribution and H -> H((90) a.s. Proof. Exercise 1. Least squares estimators and ML estimators are members of the class of so-called M-estimators. A typical M-estimator 6 is obtained by maximizing an objective function of the form Qn(0) = (l/n)£jLi*
(4.5.7)
For example, in the case of least squares estimation of a nonlinear regression model Yj = g(Xj,0o) + Uj» we have —
78
Nonlinear regression and maximum likelihood theory
VWo) = Jrtz,0)h(z|0o)dz
(4.5.8)
has a unique supremum on 0 at 9 = 0O. (iii) For i,r,s = l,...,m, h(z|0o)dz < oo, < oo, z < oo.
O/r)O/s)(|)|
oo.
(iv) The matrix
B(0O) = S[Wm(W0oWzA)Mz\0o)dz.
(4.5.9)
is nonsingular. Denoting
A(0) = ^/WMzMl(^eMzMHz\e)dz9
(4.5.10)
it follows now that: Theorem 4.5.2 Under assumptions 4.5.1-4.5.2, 0 —• 0O a.s. and Vn(fl> - 0 O ) -* Nm(0,B((90)" lA(60)B(e0yl) in distribution. Proof: Exercise 2. The position of the ML estimator within the class of M-estimators is a very special one, namely that of the asymptotic efficient estimator, i.e., the asymptotic variance matrix of the ML estimator is always "smaller" (or at least not "greater") than the asymptotic variance matrix of the typical M-estimator, in the following sense: Theorem 4.5.3. Under assumptions 4.5.1-4.5.2, the matrix B(0 O )~^(00)8(00)"* -H(0 O )~ 1
(4.5.11)
is positive semi-definite. Proof. The first-order condition for a maximum of ip(0*\6) in 0* = 8 is: J[((V()0'Mz,0)]h(z|0)dz = O, for each interior point 0 of 0. Taking derivatives again it follows that O = O/^0)J[O/^0')^(z,0)]h(z|0)dz = J[O/^0)O/^0')^(z,0)]h(z|0)dz+f[(b/io 9)ip(zM[Q>WMz\e)]dz 9 (4.5.12) where assumptions 4.5.1-4.5.2 allow us to take the derivatives inside the integrals. Consequently, we have (4.5.13)
Maximum likelihood theory
79
This is the main key for the proof of the asymptotic efficiency of ML estimation. The rest of the proof below is quite similar to the derivation of the Cramer-Rao lower bound of the variance of an unbiased estimator. It is not hard to show (cf. exercise 2) that y/n(0 - 0 o ) = V n + o p (l),
(4.5.14)
where
Vn = B(flo)" 1 (Wn)Er=i^ e X z jA)
(4.5.15)
and o p (l) denotes a random variable that converges in probability to zero. Moreover, denote W n = (l/Vn)5:jLiO^0')ln(h(Zj|eo)).
(4.5.16)
Then var(Vn) = B(6orlA(00)B(00)-1
and var(W n ) = H(0O),
(4.5.17)
whereas by (4.5.13) and (4.5.14), covar(V n ,W n ) = covar(W n ,V n ) = - I .
(4.5.18)
It follows now from (4.5.17) and (4.5.18) that
-I
H(6)J
(4519)
is the variance matrix of (Vn',Wn')'; hence M is positive semi-definite, and so is SMS' = B(0O)~ W^WB^o)"* -H(flo)" 1 , with S = (I,H((90)"1). Q.E.D. Since the ML estimator is the efficient estimator, one may wonder why somebody would consider other, less efficient, estimators. The reason is robustness. ML estimation requires complete specification of the density h(z|0). Only if h(z|0) = h(zuz2\6) can be written as h1(z1|z2,e)h2(z2), with hi the conditional density of h given z2, and h2 the marginal density, where only hx depends on 0, do we not need to know the functional form of h2. For example, ML estimation of the regression model Yj = f(Xj,0o) + Uj requires full specification of the conditional density of Uj relative to the event Xj = x, whereas least squares estimation requires the much weaker condition that E(Uj|Xj) = 0 a.s. Thus, since ML estimation employs more a priori information about the data-generating process than say least squares, it is not suprising that ML estimators are more efficient. However, this virtue also makes ML estimators vulnerable to deviations from the assumed distributions. For example, if the errors Uj of the regression model Yj = f(Xj,0o) + Uj are independent of Xj but
80
Nonlinear regression and maximum likelihood theory
their distribution is non-normal with fat tails, while ML estimation is conducted under the assumption of normality of Uj, the "pseudo" ML estimator involved will be asymptotically less efficient than many alternative robust M-estimators. See Bierens (1981). Often the ML estimator is more complicated to calculate than some alternative non-efficient M-estimators. Take, for example, the case of a linear regression model Yj = a + /3Xj + Uj, where Xj and Uj are independent and the distribution of Uj is a mixture of two normal distributions, say N(0,l) and N(0,2), with mixture coefficient X. Thus we assume that the density of Uj is
f(u) = texp(-V2\i2)/y/(27r) + ( l - ^ e x p C - ^ u ^ / ^ T r ) ,
(4.5.20)
where X e [0,1] is the mixture coefficient. Potscher and Prucha (1986) consider the similar but even more complicated problem of efficient estimation of a nonlinear time series regression model with t-distributed errors, and they propose the adaptive one-step M-estimator discussed below. Denoting the possibly unknown density of Xj by g(x), the joint density of (Yj,Xj)' is now: - y 2 (y-a-/3x) 2 /2]/V(47r)}g(x),
(4.5.21)
hence the log-likelihood function takes the form ln(L n (Z|0)) = EjLiln{^exp[- 1 / 2 (Y j -a-^Xj) 2 ]/V(27r) + (l-i)exp[- 1 /2(Y j -a- i aXj) 2 /2]/ > /(47r)} + £jL,ln(g(Xj)), (4.5.22) where Z = (Yi,X l v ..,Y n ,X n )' and 6 = (a,(3,X)\ Note that the term ^jn=1ln(g(Xj)) does not matter for the solution of sup^ln(L n (Z|0)). Clearly, maximizing (4.5.22) with respect to 9 is a highly nonlinear optimization problem. However, in the present case it is easy to obtain an asymptotically efficient estimator of 9 by conducting a single Newton step starting from an initial estimator 6 of 9 for which ^n(0 — 9) = O p (l), as follows. Let a and /3 be the OLS estimators of a and /?, respectively, and let a 2 = (l/n)£" = 1 (Yj —a — /3Xj)2 be the usual estimator of the variance a 2 of Uj. Since in the case under consideration a 2 = X + 2(1— X) = 2 — X, we estimate X by X = 2 — a 2 . Now let 6 = (d,/3,A)'.
(4.5.23)
It can be shown that if the (Yj,Xj)'s are independent and 0 < E[|Xj — E(Xj)|2] < oo, then y/n(0 — 9) converges in distribution to N 3 (0,O), hence y/n(0-9) = O p (l). Cf. theorem 2.5.9 and exercise 5. The Newton method for finding a maximum or minimum of a twice
Maximum likelihood theory
81 m
differentiable function $(0) on R is based on the second-order Taylor approximation of $(0) about an initial point 0^ *(0)«*(0!) + (0-0!)'J$(0i) + 1/2(0-01)'A2$((91)(0-01),
(4.5.24)
2
where J$(0) = (d/*0')$(0) and A $(0) = (d/<>0)(d/d0')$(0). Differentiating the right-hand side of (4.5.24) to 0 and equating the results to zero yield the Newton step: 02 = 0j - ( A 2 * ^ ) ) " 1 ^ ^ ) .
(4.5.25)
In the present case $(0) = ln(Ln(Z|0)) and QX = 0, so the single Newton step estimator involved is: 0 = 0 -[O/^0)O/()0')ln(Ln(Z|0))]-1(()/()0')ln(Ln(Z|0)).
(4.5.26)
Theorem 4.5.4 Let the conditions of theorem 4.5.1 hold, and let as
0 be the single Newton step estimator (4.5.26), where the initial estimator 0 is such that y/n(0 - 0 O ) = OP(1). Then y/n(S - 0 o ) -+ Nm(O,H(0o)"1). Proof: We leave the details of the proof as an exercise (exercise 6). Here we shall only give some hints. First, let
H = -(l/n)O/*0)(^0>(L n (Z|0))
(4.5.27)
and prove that (4.5.28) Si
Next, using the mean value theorem, show that there exists a matrix H such that (l/>)(W)ln(L n (Z|0)) = (l/ v /n)O/()0')ln(L n (Z|0o))-Hyn(0-0o) (4.5.29) and plimn^ooH=H(0o).
(4.5.30)
Then y/n(0 -eo) = y/iL{0-0o) +H- 1 [(l/Vn)(^0')ln(L n (Z|0 o ))-H v /n(0-0 o )] = (I - H " lH)jn(§ - 0O) + H - l[(l / v/n)(c)^0')ln(Ln(Z|0o))]. (4.5.31) Finally, show that plim n ^ o o (I-H- 1 H v /n(0-0 o ) = 0 and
(4.5.32)
82
Nonlinear regression and maximum likelihood theory
H"H(l/>)(W)ln(L n (Z|0 o ))] - N m (O,H(0 o r')•
(4.5.33) Q.E.D.
Exercises 1. Prove theorem 4.5.1. 2. Prove theorem 4.5.2. 3. Verify (4.5.18). 4. Let Zj = (Yj, Xj)' G R 2 , j = l,.,.,n, be a random sample from a bivariate distribution, where Yj = a + /?Xj + c\5} with Uj standard normally distributed and independent of Xj. The distribution of Xj is unknown. Let 9 = (a,/?,a 2 ). Derive the ML estimator 0 of 9 and the asymptotic distribution of y/n(§ — 9). 5. Prove that for the estimator (4.5.23), yJn{9-9) - • N 3 (0,G), and derive Q. 6. Complete the details of the proof of theorem 4.5.4.
4.6 Testing parameter restrictions 4.6.1 The Waldtest Estimation of a regression model is usually not the final stage of regression analysis. What we actually want to know are the properties of the true model rather than those of the estimated model. Given the truth of the functional form, we thus want to make inference about the true parameter vector 90. Various restrictions on 90 may correspond to various theories about the phenomenon being studied. In particular, the theory of interest may correspond to a set of q (non)linear restrictions in 0o, i.e., Ho: m(0o) = 0 for i= l,2,...,q < m
(4.6.1)
Throughout we assume that the functions rji are continuously differentiable in a neighborhood of 90 and that the q x m matrix Q^
(4.6.2)
has rank q. We now derive the limiting distribution of 77 =
(r]l(e),...,r]q(O)y
under Ho, given that v/n(£-0o)^N m (O,G),
(4.6.3)
where Q is nonsingular, and that Qis a consistent estimator of Q: plimn^oo Q= Q.
(4.6.4)
Testing parameter restrictions
83 !
Note that if 6 is a ML estimator, Q = H(0 O )~ * and Q = H ~ . Observe from the mean value theorem that -Oo), where 0
(l)
(4.6.5)
satisfies
\d®-0o\ < \0-60\ a.s.
(4.6.6)
Since (4.6.3) and (4.6.6) imply p l i m ^ ^ ^ = 90, it follows from theorem 2.1.7 that (4-6.7) and consequently, by (4.6.3) and theorems 2.3.3 and 2.3.5, plimn^oetO/^^i^^-^/^^r/i^lVn^-eo) = 0
(4.6.8)
Combining (4.6.5) and (4.6.8) it follows that
piim^oj >te(0) - mWo)) - (^eMOoWnie - e0y\ = o
(4.6.9)
and consequently pUm^ooIXtf-ry) ~ ^ X ^ -60)]= 0,
(4.6.10)
where v = (mWol-MOo))9
(4.6.H)
Since (4.6.3) implies that ry/n(6 -60) -4 N q (0,ror') in distr., it follows now from (4.6.10) that, 01(7? -rj)-*
N q (0,ror']) in distr.
4.6.12)
Note that FQP has rank q and is therefore nonsingular. Now let f = [(W6i2)%(6)]
(4.6.13)
Then plinin^oo^ = 90 implies plimn^^r = r
(4.6.14)
and thus
fti
= rar.
(4.6.15)
Using theorem 2.3.5 we conclude from (4.6.12) and (4.6.15) that (for)-
V n ^ - r ? ) -> Nq(0,I) in distr.
(4.6.16)
84
Nonlinear regression and maximum likelihood theory
and consequently by theorem 2.3.4, n(rt-rinfQfTl(ri
-V) - Xq in distr.
(4.6.17)
The Wald statistic for testing H o : r] = 0 is now Wn = nri\far)-lrj.
(4.6.18)
Clearly, W n -> Xq in distr. if H o is true.
(4.6.19)
If H o is false, i.e., 77 ^ 0, then p i i m n ^ W n / n = Ti\rarylri
> o,
(4.6.20)
hence plim n .ooW n = 00.
(4.6.21)
Remark: A drawback of the Wald test is that changing the form of the nonlinear restrictions 77i(0o) = 0 to a form which is algebraically equivalent under the null hypothesis will change the Wald statistic W n . See Gregory and Veall (1985) and Phillips and Park (1988). The former authors show on the basis of Monte Carlo experiments that differences in functional form of the restrictions are likely to be important in small samples, and the latter two authors investigate this phenomenon analytically. Note that the result in this section is in fact a straightforward corollary of the following general theorem. Theorem 4.6.1 Let (6n) be a sequence of random vectors in R k satisfying r n (0 n -0 o ) - •N k (0,G), where (r n) is a sequence of real numbers converging to infinity and 00 G R k is non-random. Let rj\(0)9...,rjm(0) be Borel measurable real functions on R k such that for j = l,...,m, rj^0) is continuously differentiate in 6o. Let j](0) = (j]i(0),...,r]m(0)y and let F be the q x m matrix r = [(b/lOiJriiSPo)]
(4.6.22)
Then rn(r7(<9n) — 77(6^0)) -•Nm(0,FQF') in distr. (N.B. m may be larger than k.) 4.6.2 The likelihood ratio test Suppose that 9 is the ML estimator of 60, and consider again the problem of testing the null hypothesis (4.6.1). We recall that the null hypothesis involved is H o : v(0o) = 0,
(4.6.23)
Testing parameter restrictions
85
where 7/(0) - (7n(0)9...9riq0))\
(4.6.24)
Cf. (4.6.1). The likelihood ratio (LR) test is based on the ratio of the constrained and the unconstrained maximum likelihood: 4 = sup6e0Me) = oK(Z\9) / supee0Ln(Z\6).
(4.6.25)
Note that 0 < Xn < 1. If H o is true one may expect Xn to be close to 1, whereas if H o is false Xn will tend to zero. Therefore the LR test is onesided: H o is rejected if kn is less than the a-critical value Aa, which under H o is such that P(An
(4.6.27)
be the new parameter space, and let *
'(#*))
(4.6.28)
be the reparameterized likelihood function. The null hypothesis (4.6.23) now simplifies to H o : The last q components of 6Q = ^(OQ) are zero,
(4.6.29)
and the LR test involved becomes
where 62 is a q-component subvector. Thus without loss of generality we may confine our attention to testing the simple null hypothesis: H o : 0O = (0oi\O')\ with 0' a 1 x q zero vector.
(4.6.31)
Let 0 = (0u0y
(4.6.32)
be the constrained ML estimator. Then the LR test statistic is: ^ = L n (Z|0)/Ln(Z|0).
(4.6.33)
In practice the distribution of Xn under H o is often unknown, so that
86
Nonlinear regression and maximum likelihood theory
the critical value Aa cannot be calculated. However, Aa can be approximated using the following asymptotic result: Theorem 4.6.2 Let the conditions of theorem 4.5.1 hold. Under the null hypothesis (4.6.31), -2.1n(>ln) -• #j in distribution. Proof. We leave the details of the proof as an exercise. Here we shall only give some hints. Consider the second-order Taylor expansion ln(Ln(Z|0))-ln(Ln(Z|0)) = (0 - §)(b/W)\n(Ln(Z\§)) + V20 - §y[(b/W/)(d/W) ln(Ln(Z|0-))](0 - 0),
(4.6.34)
with 0* a mean value. Conclude from (4.6.34) that p (l),
(4.6.35) where we recall that op(l) stands for a random variable that converges in probability to zero. Next, observe that (4.6.36)
and V n ( 0 - 0 o ) = H(0o)- 1 [(l/Vn)E]Li(^/^^)ln(h(Z j |0 o ))] + o p (l),
(4.6.37)
where Hn(0 o ) is the upper-left (m-q)x(m-q) submatrix of H(0O). Denoting H = H ( 0 o ) , H n = H n ( 0 o ) a n d S = So - H" 1 with So = ( Hg
° ),
(4.6.38)
we conclude from (4.6.36) and (4.6.37) that ^(0 _
fl)->Nm(0,SHS'),
(4.6.39)
which implies that !/
1 / 1 /
).
(4.6.40)
Since the matrix I —H^SoH^ is idempotent with rank q, the theorem under consideration follows. Q.E.D. We have seen in section 4.6.1 that the Wald test is not invariant with respect to possible mathematically equivalent functional specifications of the parameter restrictions under the null hypothesis. This does not apply to the LR test, owing to the fact that the LR test statistic /l n is invariant with respect to reparameterization. In particular, the null hypothesis (4.6.31) is mathematically equivalent to (4.6.23). This implies that: Theorem 4.6.3 The result of theorem 4.6.2 carries over if the
Testing parameter restrictions
87
null hypothesis is replaced by (4.6.23). Moreover, the LR test is invariant with respect to possible mathematically equivalent functional specifications of the restrictions in (4.6.23). This invariance property distinguishes the LR test favorably from the Wald test. On the other hand, the LR test requires full specification of the likelihood function (up to a possible factor that does not depend on the parameters), and two different ML estimators, the constrained and the unconstrained ML estimators. The Wald test employs only one unconstrained estimator and does not require full specification of the likelihood function. Thus the Wald test is more robust with respect to possible misspecification of the model and has the advantage of computational simplicity. Finally, we note that the LR test is a consistent test. If the null hypothesis is false then asymptotically we will always reject the null hypothesis: Theorem 4.6.4 Let the conditions of theorem 4.5.1 hold. If HQ is false then plimn^oo —2.1 n(An) = oo. Proof. Exercise 2. 4.6.3 The Lagrange multiplier test Maximization of the likelihood function under the restrictions (4.6.23) can be done using the well-known Lagrange technique. Consider the Lagrange function F(09/i) = ln(L n (Z|0))-/AK0).
(4-6.41)
The first-order conditions for an optimum in (6, ft) are:
(b/W)\n(Ln(Z\0))-r(0yil = 0;
(4.6.42)
rj(6) = 0;
(4.6.43)
where F(0) is the q x m matrix r(6) = [(b/Wi2)mi(e)l
i, = l,...,q, i 2 = l,...,m.
(4.6.44)
Cf. (4.6.2). The Lagrange multiplier (LM) test is based on the vector /I of Lagrange multipliers:
4 = (ii/y/n)rH-1 r(ji/y/n),
(4.6.45)
where f = T(§) and H = (l/nJE-LitO/^^ln^ZjIflWlIO/^ejlnaiCZjie))].
(4.6.46)
88
Nonlinear regression and maximum likelihood theory Theorem 4.6.5 Let the conditions of theorem 4.5.1 hold. Under the null hypothesis (4.6.23), £n ->#q, whereas if (4.6.23) is false, plinin^oo 4 = oo.
Also the LM test is invariant with respect to possible mathematically equivalent formulations of the null hypothesis, although this is not so obvious as in the case of the LR test. Moreover, we note that the LM test is a special case of Neyman's (1959) C(a) test. See Dagenais and Dufour (1991). In the first instance the test statistics of the Wald, LR and LM tests look quite different, but they are in fact asymptotically equivalent under the null hypothesis, in the sense that: Theorem 4.6.6. Under the conditions of theorem 4.5.1 and the null hypothesis (4.6.23), (a) p l i m ^ o o l ^ + 2.1n(^n)} = 0, (b) p l i m n ^ W n + 2.1n(Ao)} = 0. Proof. Exercise 3. (Hint: Show by using the mean value theorem that ») + o p (l),
(4.6.47)
and then use (4.6.40) and (4.6.42). This proves part (a). The proofof part (b) goes similarly.) Q.E.D. Thus, the first part of theorem 4.6.5 follows from theorems 4.6.3 and 4.6.6a. The second part is left as exercise 4. Exercises 1. Complete the details of the proof of theorem 4.6.2. 2. Prove theorem 4.6.4. 3. Prove theorem 4.6.6. 4. Prove the consistency of the LM test.
Tests for model misspecification
In the literature on model specification testing two trends can be distinguished. One trend consists of tests using one or more well-specified non-nested alternative specifications. See Cox (1961, 1962), Atkinson (1969, 1970), Quandt (1974), Pereira (1977, 1978), Pesaran and Deaton (1978), Davidson and MacKinnon (1981), among others. The other trend consists of tests of the orthogonality condition, i.e. the condition that the conditional expectation of the error relative to the regressors equals zero a.s., without employing a well-specified alternative. Notable work on this problem has been done by Ramsey (1969, 1970), Hausman (1978), White (1981), Holly (1982), Bierens (1982, 1991a), Newey (1985), and Tauchen (1985), among others. A pair of models is called non-nested if it is not possible to construct one model out of the other by fixing some parameters. The non-nested models considered in the literature usually have different vectors of regressors, for testing non-nested models with common regressors makes no sense. In the latter case one may simply choose the model with the minimum estimated error variance, and this choice will be consistent in the sense that the probability that we pick the wrong model converges to zero. A serious point overlooked by virtually all authors is that nonnested models with different sets of regressors may all be correct. This is obvious if the dependent variable and all the regressors involved are jointly normally distributed and the non-nested models are all linear, for conditional expectations on the basis of jointly normally distributed random variables are always linear functions of the conditioning variables. Moreover, in each model involved the errors are independent of the regressors. In particular, in this case, the tests of Davidson and MacKinnon (1981) will likely reject each of these true models, as these tests are based on combining linearly the non-nested models into a compound regression model. Since other tests of non-nested hypotheses are basically in the same spirit one may expect this flaw to be a pervasive phenomenon. Consequently, these tests are only valid if either the null or only one of the alternatives is true. Moreover, tests of non-nested hypotheses may have low power against nonspecified alternatives, as 89
90
Tests for model misspecification
pointed out by Bierens (1982). Therefore we shall not review these tests further. In this chapter we only consider tests of the orthogonality condition, without employing a specific alternative. First we discuss White's version of Hausman's test in section 5.1 and then, in section 5.2, the more general M-test of Newey. In section 5.3 we modify the M-test to a consistent test and in section 5.4 we consider a further elaboration of Bierens' integrated M-test. 5.1 White's version of Hausman's test
In an influential paper, Hausman (1978) proposed to test for model misspecification by comparing an efficient estimator with a consistent but inefficient estimator. Under the null hypothesis that the model is correctly specified the difference of these estimators times the square root of the sample size, will converge in distribution to the normal with zero mean, whereas under the alternative that the model is misspecified it is likely that these two estimators have different probability limits. White (1981) has extended Hausman's test to nonlinear models, using the nonlinear least squares estimator as the efficient estimator and a weighted nonlinear least squares estimator as the nonefficient consistent estimator. The null hypothesis to be tested is that assumption 4.1.1 holds: Ho: E(Yj|Xj) = f(Xj,0o) a.s. for some 6o e 0, where f(x,0) is a given Borel measurable real function on Rk x 0 which for each x £ Rk is continuous on the compact Borel set 8 c Rm . The weighted nonlinear least squares estimator is a measurable solution 0 of: 0* e 0 a.s., Q*(0*) = inf0G0Q*(0), where Q\ff) = (l/nJEjLiCYj-f^j^Vxj), with w(.) a positive Borel measurable real weight function on Rk. Following White (1981), we shall now set forth conditions such that under the null hypothesis, y/n(0 - §*) -> Nm(0,Q) in distr., with 6 the nonlinear least squares estimator, whereas if H o is false, Given a consistent estimator Q of the asymptotic variance matrix Q the test statistic of White's version of Hausman's test is now
White's version of Hausman's test
91
W* = n(9-0*yQ-1 (9-6*1 which is asymptotically Xm distributed under H o and converges in probability to infinity if H o is false. Let us now list the maintained hypotheses which are assumed to hold regardless whether or not the model is correctly specified. Assumption 5.1.1. Assumption 4.3.1 holds and E[Y2w(Xj)] < OO.
Assumption 5.1.2 Assumption 4.3.2 holds and E[supfleefKX,,0)2w(X,)] < oo. Assumption 5.1.3 There are unique vectors 6* and 0*» in 0 such that E{[E(Y,|X,) -f(X,A)] 2 } = i and
If H o is false then 0* / 0«. Assumption 5.1.4 The parameter space 0 is convex and f(x,0) is for each x e Rk twice continuously differentiate on 0. If H o is true then 60 is an interior point of 0. Assumption 5.1.5 Let assumption 4.3.5 hold. Moreover, let for oo,
< oo.
Assumption 5.1.6 The matrices
are nonsingular for i = 0,1; j = 0,1. Assumption 5.1.7 Let assumption 4.3.7 hold and let for ii, -f(Xu9)]2 x\(7>/lOi2)f(XuO)\}
Finally, denoting fi
6i'
< ex).
92
Tests for model misspecification
we assume: Assumption 5.1.8 The matrix Q is nonsingular. Now observe that under assumptions 5.1.1 and 5.1.2, Q(0) -> Q(0) a -s- uniformly on <9, Q*(0) - • Q*(#) a.s. uniformly on <9, where Q(0) and Q(0) are defined in (4.1.9) and (4.3.3), respectively, and
Q\e) = E{[Y, -qxuoyfyfri)}. Together with assumption 5.1.3 these results now imply: Theorem 5.1.1 Under assumptions 5.1.1-5.1.3, 0 —> 6* a.s. and 6* -> 6** a.s. (cf. theorem 4.2.1). Moreover, if H o is true then clearly n _ n _ A U* — (/*• — t/Q.
Now assume that H o is true, and denote Uj = Yj-f(Xj,0o). Then it follows from assumptions 5.1.1-5.1.8, similarly to (4.2.12), that ')f(Xj,0o)] = 0,
(5.1.1)
')fl[Xj,0o)] = 0,
(5.1.2)
hence
where
Moreover, from the central limit theorem it follows
where )-lo 6 1 A o-l oo ^IO^OO
O^OnO"1 oo ^ n ^ o i
From these results it easily follows that
White's version of Hausman's test
93
Theorem 5.1.2 Under H o and assumptions 5.1.1-5.1.8, y/n{0 - 0*) -+ Nm(0,&) in distr. A consistent estimator of Q can be constructed as follows. Let for
and define Q analogously to Q. Then Theorem 5.1.3 Under assumptions 5.1.1-5.1.7, Q —• Q a.s., regardless whether or not the null is true. Combining theorems 5.1.1-5.1.3 we now have Theorem 5.1.4 Under assumptions 5.1.1-5.1.8, W*—•Xm if Ho is true and W*/n -> (0*-6**yQ~ l (0*-0**) > 0 a.s. if H o is false. The latter implies, of course, that plimn^ooW* = oo. The power of the Hausman-White test depends heavily on the condition that under misspecification 0* ^ 0**, and for that the choice of the weight function w(.) is crucial. Take for example the true model Yj = Xy + X 2j + XyXzj + Uj where the X^'s, X2j's and Uj's are independent and N(0,l) distributed, and let f(x,0) = 0^1 + 02x2, w(x) = x^ + xj. Then fpd^)] 2 } = ( 1 - 0 O 2 + ( l - 0 2 ) 2 + 2
(5.1.3)
and E{[Y 1 -f(X 1 ,0)] 2 w(Xi)} = 4(1 -9xf
+ 4 ( 1 - 0 2 ) 2 + 8,
(5.1.4)
hence 0* = 0** = (1,1)'. Moreover, in this case we still have W*-+ xl in distr.,
(5.1.5)
although the model is misspecified. Thus Hausman's test is not consistent against all alternatives, a result also confirmed by Holly (1982). Remark: If under the null hypothesis H o the model is also homoskedastic, i.e., Assumption 5.1.9 Under H o , E(U2|Xj) = E(U 2 ) = a2 where Uj = Yj-f(Xj,0 o ),
a.s.,
94
Tests for model misspecification
then
where
Cf. assumption 5.1.6. Hence s2 — d i^Qj ^02^01
—^
oo *
It is easy to verify that now Q is just the difference of the asymptotic variance matrix of the weighted nonlinear least squares estimator 6 and the asymptotic variance matrix of the least squares estimator 6. Thus we have: Theorem 5.1.5. Under H o and the assumptions 5.1.1-5.1.9, y/n(6*-60) y/n(6-60) 0*6)
-> N^O^fioVfloiOol 1 ) in distr, -> Nm(0,(72Q^1) in distr., and -+ Nm(0yn^QO2Q^
-0*0^)
in distr.
Consequently, denoting
with
we have: Theorem 5.1.6 Under assumptions 5.1.1-5.1.9 the results in theorem 5.1.4 carry over. Actually, this is the form of the Hausman-White test considered by White (1981). Exercises 1. Prove (5.1.2). 2. Prove theorem 5.1.2. 3. Prove theorem 5.1.3. 4. Prove (5.1.3) and (5.1.4), using the fact that the fourth moment of a standard normally distributed random variable equals 3. 5. Prove (5.1.5).
Newey's M-test
95
5.2 Newey's M-test 5.2.1 Introduction Newey (1985) argues that testing model correctness usually amounts to testing a null hypothesis of the form H o : E[M(Yj,Xj,0o)] - 0,
(5.2.1)
where M(y,x,0) = (Mi(y,x,0),...,M p (y,x,0))' is a vector-valued function on R x R k x (9 (with Borel measurable components). A specification test can then be based on the sample moment vector M(0) - (l/n)E]LiM(Yj,Xj,0),
(5.2.2)
where 9 is, under H o , a consistent and asymptotically normally distributed estimator of 9Q. A similar class of M-tests has been developed independently by Tauchen (1985), under weaker conditions on the function M(.) than in Newey (1985). See also Pagan and Vella (1989) for a survey of various M-tests and their application. We show now that the Hausman-White test is indeed asymptotically equivalent under H o to a particular M-test. Let M(y,x,0) - (y-f(x,0))O/^0')f(x,0)w(x).
(5.2.3)
Then the i-th component of M(0) is:
Let 9 be the nonlinear least squares estimator. It follows from the mean value theorem that there exists a mean value # (l) satisfying |0 (l) — 0O| < \9 — 90\ a.s. such that VnMi(0) = VnMj^o) + [O>/d0)Mi(0(i)
)]y/n(6-00).
We leave it as an exercise (cf. exercise 1) to show that under the conditions of theorem 5.1.2,
hence plim n _ o o [ v 'n M(0) -
j + Q0l^n(6-80)]
= 0.
(5.2.4)
Comparing this result with (5.1.2) we now see that p l i m n ^ / n M(0) -Q0l^n(8*-0)]
= 0,
hence the chi-square test based on ^/nM(9) is the same as the chi-square
96
Tests for model misspecification
test based on y/n(6 — 0*). This result demonstrates the asymptotic equivalence under Ho of this special case of the M-test and the Hausman-White test. Next, consider the case that H o is false. Under the conditions of theorem 5.1.4 we have (5.2.5) as is not hard to verify. Cf. exercise 2. Now assume that the function
has no local extremum at 6 = 6*. This condition is only a minor augmentation of assumption 5.1.3. Then the right-hand side of (5.2.5) is unequal to the zero vector, i.e., p l i m ^ M ^ ) ^ 0. This establishes the asymptotic power of the M-test under review. However, also the M-test is not generally watertight. It is not hard to verify that for the example at the end of section 5.1 this version of the Mtest also has low power. Another example of an M-test is the Wald test in section 4.6.1, with M(y,x,0) = r/(0). Also, Ramsey's (1969, 1970) RESET test may be considered as a special case of the M-test. 5.2.2 The conditional M-test In regression analysis, where we deal with conditional expectations, model correctness usually corresponds to a null hypothesis of the form H o : E[r(Yj,Xj,0)|Xj] = 0 a.s. if and only if 6 = 6O.
(5.2.6)
For example in the regression case considered in section 4.3 an obvious candidate for this function r is r(y,x,0) = y — f(x,0). Also, we may choose
r(y,x,0) - (y-f(x,0))(W)f(x,0). In this case we have:
(5.2.7)
E[r(Yj,Xj,0)|Xj] = [ E C Y j l X j ) - ^ , © ) ] ^ ^ ^ ^ ^ ) = 0 a.s. (5.2.8) if and only if E(Yj|Xj) - f(Xj,0o) a.s. and 0 = 0o. Furthermore, observe that in the case (5.2.7) the least squares estimator 0 of 0O is such that P[O/n)£?=ir(Yj,Xj,0) = 0] -+ 1.
(5.2.9)
Cf. (4.2.7). This is true even if the model is misspecified, provided that 0 converges in probability to an interior point of the parameter space 0.
Newey's M-test
97
Consequently we cannot choose M = r, for then P(M(0) = O) —> 1 anyhow and thus any test based on M(0) will have no power at all. In order to cure this problem we need to use a weight function, similarly to (5.2.3), i.e., let rj(y,x,0) be the i-th component of the vector r(y,x,0), and let Wi(x,0) be a weight function. Then Mi(y,x,0) = ri(y,x,0)wi(x,0), i= I,2,...,m.
(5.2.10)
Note that in the case (5.2.3), Wi(x,0)
= w(x), ri(y,x,0) = ( y - f (
In view of the above argument, we can now state the basic ingredients of the conditional M-test. First, let us assume that: Assumption 5.2.1 The data-generating process {(Yj,Xj)} with Yj e R, Xj G R k is i.i.d. The model is implicitly specified by the functions ri(y,x,0): Assumption 5.2.2 For i=l,2,...,m the functions ri(y,x,0) are Borel measurable real functions on R x R k x (9, where 0 is a compact Borel subset of R m . Moreover, for each (y,x) e R x Rk the functions ri(y,x,0) are continuously differentiable on 9. Let r(y,x,0) = ( ri (y,x,0),...,r m (y,x,0))\ There exists a unique interior point 90 of 0 such that
Note that the latter condition does not say anything about model correctness. For example in the case of the regression model in section 4.3 this condition merely says that the function E{[Y j -f(X j ,0)] 2 } = EUYj-EtYjlXj)]2} + E{[E(Yj|Xj)-f<Xj,0)]2} takes a unique minimum on 9 at an interior point 0O? without saying that jJ-fpCj^o)] 2 } - 0. Next, we consider a consistent estimator 9n o£90, satisfying (5.2.9): Assumption 5.2.3 Let (0n) be a sequence of random vectors in 0 such that, (i) plinin^oofln = 0O, EjLirCYj^j^n) = 0] = 1. We may think of 9n as an estimator obtained by solving an optimization problem with first-order condition
98
Tests for model misspecification
(l/n)£?=ir(Yj,Xj,0) = 0. Estimators based on such moment conditions are called Method of Moment (MM) estimators. Cf. Hansen (1982). We show now that assumption 5.2.3 implies asymptotic normality. By the mean value theorem we have
^^Y^M^YV^n-eol
(5.2.11)
where On is a mean value satisfying \0n — #o| < \0n — 00\. Now assume Assumption 5.2.4 For i,£= l,2,...,m, let E[sup, G0 |(()/^^)r i (Y 1 ,X 1 ,0)|] < oo Then by theorem 2.7.5, (l/n)EjL,O/^fl')ri(Yj,Xj,6l) - E[(b/l0y{(YuXl96)]
a.s.
(5.2.12)
uniformly on (9, hence by theorem 2.6.1 and the consistency of 0n, plimn^oofl^ = 0O, p l i m n _ o o ( l / n ) 5 : j W Denoting
and r = (EIO/^^^^X^floMv^EIO/^^rmCY^X^flo)])'
(5.2.13)
we thus have plimn^r* = T.
(5.2.14)
Next assume: Assumption 5.2.5 The (m x m) matrix F is nonsingular. Then plim^r*-1 = T-1
(5.2.15)
(Cf. exercise 3), whereas by (5.2.11) and assumption 5.2.3 plim n ^ oo [(l/v / n)£ J n = ,^~ 1 r(Y j ,X j ,0 o ) + Vn(0 n -0 o )] = 0.
(5.2.16)
This result, together with (5.2.15), implies plimn^oo[(l/v'n)5:jLi/'-1r(Yj,Xj,0o) + X f l n - f l 0 ) ] = 0,
(5.2.17)
Newey's M-test
99
provided that (l/ v /n)^j 1 =1 r(Yj,Xj,^ 0 ) converges in distribution. Cf. exercise 4. A sufficient additional condition for the latter is: Assumption 5.2.6 For i=l,2,...,m, E[sup^G6>(ri(Y1,X1,0))2]
<
00,
as then the (m x m) variance matrix A = E^Y^X^MY^X^o)']
(5.2.18)
has finite elements. Since the random vectors r(Yj,Xj,0o) are i.i.d. with zero mean vectors and finite variance matrix A it follows now from the central limit theorem £r=i r ( Y J> x jA) - Nm(0,zl) in distr.
(5.2.19)
Combining (5.2.17) and (5.2.19) yields: Theorem 5.2.1 Under assumptions 5.2.1-5.2.6, y/n(0n-60)
-* N m (O,0) in distr., where Q =
(
r
x
1
Note that this result holds regardless of whether or not the underlying model is correctly specified. A similar result has been obtained by White (1980a, 1982) for misspecified linear models and maximum likelihood under misspecification. Moreover, if the underlying model is correctly specified, r is defined by (5.2.7) and 9n is the nonlinear least squares estimator then Q reduces to ^QQ1 O I Q O ^ 1 . Cf. theorem 4.3.2 and exercise 5. A consistent estimator of Q can be obtained as follows. Let (5.2.20) f
x
^
^
e
j
Q = f-lA(r)-]
(5.2.21) (5.2.22)
Then Theorem 5.2.2 Under assumptions 5.2.1-5.2.6, plinin^oo^ = Q. Proof: Exercise 6. We now come to the null hypothesis to be tested. As said before, the null hypothesis E(Yj|Xj) = f(Xj,0o) a.s. is equivalent to (5.2.6), where r is defined by (5.2.7). If H o is true then for i = l,2,...,m, E[ri(Yj,Xj,0o)wi(Xj,0o)] = EfEIr^Y^X^^IX^w^X,,^)} = 0
(5.2.23)
for all weight functions w{ for which the expectation involved is defined.
100
Tests for model misspecification
If Ho is false there exist continuous weight functions Wi for which (5.2.23) does not hold. Cf. theorem 3.1.2. Now let us specify these weight functions. Assumption 5.2.7 The weight functions Wi(x,#), i = 1,2,...,m, are Borel measurable real functions on R k x 0 such that for (i) for each x e R k , Wi(x,0) is continuously differentiate on (9; (ii) EIsup.^lr^Y^X,^)! 2 |wi(X,,0)|] < oo; (iii) E[sup Oe0 (r i (Y I ,X 1 ,0)w i (X 1 ,0)) 2 ] < oo; (iv)for f=l,2,...,m, oo;
Etsup^K^^Xr^Y^X^^w^X,^))!]
<
(v) if H o is false then E[ri(Y1,X1,0o)wi(X1,0o)] ^ 0 for at least one i. The conditions (i)-(iv) are regularity conditions. Condition (v), however, is the crux of the conditional M-test, because it determines the power of the test. It says that the random vector function M(0) = O/n)£jLiM(Yj,Xj,0)
(5.2.24)
with M(Yj,Xj,0) = (r1(YJ,XJ,0)w1(XJ,0),...,rm(YJ,XJ,0)wm(XJ,0))' = (Ml(Yj,Xp6l...Mm(YPXP9)y
(5.2.25)
say, has nonzero mean at 6 = 60 if H o is false. Thus, we actually test the null hypothesis (5.2.26) against the alternative hypothesis H|: E[M(0o)] ^ 0.
(5.2.27)
However, it may occur that the choice of the weight functions is inappropriate in that Hj$ holds while H o is false. The choice of the weight functions is therefore more or less a matter of guesswork, as a watertight choice would require knowledge of the true model. In the next section it will be shown how the conditional M-test can be modified to a consistent test. We are now going to construct a test statistic on the basis of the statistic M(#n). Consider its i-th component Mi(0n). By the mean value theorem we have J ,X j ,0i
1
Vn(^n-0o) (5.2.28)
Newey's M-test where
{O^-OQI
< \6n-00\.
101 Denoting
A = (E[O/^fl')r1(Y1,X1,0o)w1(X1,0o)]v.. ,E[O^0')rm(Yi,X 1 ,0 o )w m (X 1 ,0o)])'
(5.2.29)
it is not hard to show, similarly to (5.2.17), that (5.2.28) implies = 0
(5.2.30)
Cf. exercise 7. Substituting -(l/y/n)^f={r~lr(YJ9XJ9eo) (cf. (5.2.17)) it follows from (5.2.30) and (5.2.24) that
for yn(<9 n -0 o )
nJEjLiZj) = 0,
(5.2.31)
where Zj = M(Yj,Xj,e0) - A r - 1 r(Yj,Xj,0o). If H o is true then E(Zj) = 0, and moreover it follows from assumption 5.2.7 that E(ZjZj') = A*9 where A. = E[M(Y1,X1,0o)M(Y1,X1,0o)'] -E\M(Yl9Xl90o)T(YuXu0onrrl 1
A 9] -
l
l
A \
(5.2.32)
is well-defined. By the central limit theorem and (5.2.31) we now have V/nM(<9n) -> N m (0,J*) in distr. under H 0 .
(5.2.33)
Moreover, under H* we have plim^ooMCfln) = E[M(Y1,X1,0o)] / 0.
(5.2.34)
A consistent estimator of A* can be obtained as follows. Let T, A and 6 b e defined by (5.2.19)-(5.2.21), let A = B = (l/n)E]LiM(Y J ,X J ,0 n )M(Y J ,X j ,0 n )' C = (l/n)5:]LiM(Yj,Xj,0n)r(Yj,Xj,fln)' and
A* = B -C(r')" ! A' - A r - 1 ^ + AM'.
(5.2.35)
Then: Theorem 5.2.3 Under assumptions 5.2.1-5.2.7, pliirin^ooZl* = A*. Proof: Exercise 8.
102
Tests for model misspeciflcation
Note that this result also holds if H o is false, although in that case A* is no longer the asymptotic variance matrix of y/nM(0n). Finally, assume Assumption 5.2.8 The matrix A* is nonsingular, and let H =
nM(6nyA*-lM(6n)
be the ultimate test statistic. Then Theorem 5.2.4 Under assumptions 5.2.1-5.2.8, (i) H —> Xm in distr. if H o is true,
(ii) plinin^H/n = {E[M(YuXu6oWA^l{EM(YuXueo)]}
>0
if HQ is false.
Exercises 1. Prove (5.2.4). 2. Prove (5.2.5). 3. Why does (5.2.15) follow from (5.2.14)? 4. Prove (5.2.17). 5. Prove that under the conditions in section 4.3, 6. Prove theorem 5.2.2. 7. Prove (5.2.30). 8. Prove theorem 5.2.3. In particular, check which parts of assumption 5.2.7 have been used here. 5.3 A consistent randomized conditional M-test As mentioned before, the power of the conditional M-test heavily depends on the choice of the weight functions. Quoting Newey (1985, p. 1054): "An important property of specification tests based on a finite set of moment conditions is that they may not be consistent. This inconsistency has been noted in particular examples by Bierens (1982) and Holly (1982) and is a pervasive phenomenon." Thus, the solution of the inconsistency problem is to use an infinite set of moment conditions. Theorem 3.3.5 suggests how to do that. Let xp be a bounded Borel measurable one-to-one mapping from Rk into R k , and replace w7i(x,0) by exp(«f V>(x))- Then theorem 3.3.5 says that the null hypothesis is false, i.e., = 0} < 1,
A consistent randomized conditional M-test
103
if and only if E[r(Yj,Xj,0o)exp(f V(Xj))l ^ 0, except on a set S with fi(S) = 0, where \i is a probability measure induced by an absolutely continuous k-variate distribution. Note that in general the complement of the set S is infinite, so that we have to check the inequality involved over an infinite (and usually also uncountable) set of £'s. Denoting M(y,x,0,O = r(y,x,0)exp(£>(x))
(5.3.1)
we now have E[M(Yj,Xj,0o,Q] / 0 for all £ 0 S if HO is false
(5.3.2)
whereas clearly E[M(Yj,Xj,0o,O] = 0 for all £ e R k if H o is true.
(5.3.3)
Next, let M(0 n ,O = (l/n)EjLiM(Yj,Xj,0 n ,a
(5.3.4)
H(O = n M ( 0 n , a ' i * ( O " ] M(0 n ,O,
(5.3.5)
where A*(£) is defined in (5.2.35) with Wi(x,0) replaced by exp(£V( x )), and assume that in particular assumption 5.2.8 holds for every £ e R k \{0}. Then we have H ( 0 -+ x2m i n
distr
-
for evef
y ^^
Rk
\{°)
if H
o is true.
(5.3.6)
and for all £ e R k \S if H o is false,
X1,0o,O]} > 0 (5.3.7)
where A*(£) is defined in (5.2.31). The latter result indicates that this version of the conditional M-test is "almost surely" consistent (i.e., has asymptotic power 1 against all deviations from the null), as consistency only fails for £ in a null set S of the measure //. Also, note that we actually have imposed an infinite number of moment restrictions, namely the restrictions (5.3.3). Furthermore, observe that the exclusion of £ = 0 is essential for (5.3.6) and (5.3.7), because by assumption 5.2.3,
hence H(0) -+ 0 in pr. regardless of whether or not H o is true. Thus, the set S contains at least the origin of R k . One might argue now that the problem of how to choose the weight
104
Tests for model misspeciflcation
function w(x,#) has not been solved but merely been shifted to the problem of how to choose the vector £ in the weight function exp(£Y(x))Admittedly, in the present approach, one has still to make a choice, but the point is that our choice will now be far less crucial for the asymptotic power of the test, for the asymptotic power will be equal to 1 for "almost" any £ e Rk , namely all £ outside a null set with respect to an absolutely continuous k-variate distribution. If one would pick £ randomly from such a distribution then £ will be an admissible choice with probability 1. In fact, the asymptotic properties of the test under H o will not be affected by choosing £ randomly, whereas the asymptotic power will be 1 without worrying about the null set S: Theorem 5.3.1 Let H(£) be the test statistic of the conditional Mtest with weight functions Wi(x,0) = exp(
(5.3.8)
and = oo if H o is false
(5.3.9) k
Proof: First, assume that H o is true, so that for every £ e R \{0}, H(f) - • #m in distr. Then by theorem 2.3.6 E[exp(itH(0)] -+ ( l - 2 i t ) - m / 2 = y>m(t)
(5.3.10)
for every t e R and every fixed £ e Rk \{0}, where <^m(t) is the characteristic function of the Xm distribution. (Cf. section 2.3, exercise 3). Now let £ be a random drawing from an absolutely continuous k-variate distribution with density h(£). Then for every t e R, E[exp(itH(O)] - jE[exp(i-tH(O)]h(Od£ (5.3-11) by bounded convergence (cf. theorem 2.2.2). Theorem 2.3.6 says that this result implies that H(£) —• xL m distr. Second, assume that H o is false. Then there exists a null set S of the distribution of £ such that for every £ e R k \S,
Hence
A consistent randomized conditional M-test
= T(O,
105
(5.3.12)
say, where T(O > 0 if
(5.3.13)
Again using theorem 2.3.6 we see that H(0/n -* T(O in distr.
(5.3.14)
and since S is a null set of the distribution of £ we have P[T(0 > 0] - 1.
(5.3.15)
It is not hard to show now that (5.3.14) and (5.3.15) imply (5.3.9). Q.E.D. Remark: The randomization of the test parameter £ involved can be avoided by using the more complicated approach in Bierens (1991a). The present approach has been choosen for its simplicity. Next, we have to deal with a practical problem regarding the choice of the bounded Borel measurable mapping ip. Suppose for example that we would have chosen
This mapping is clearly admissible. However, if the components Xy- of Xj are large then tg~ 1(Xy) will be close to the upperbound Viir, hence exp(£Y(Xj)) will be almost constant, i.e.,
and consequently M(0n,O Since the mean between the square brackets equals the zero vector with probability converging to 1 (cf. assumption 5.2.3), M(0n,£) will be close to the zero vector and consequently H(£) will be close to zero. This will obviously destroy the power of the test. A cure for this problem is to standardize the Xj's in ^(Xj). Thus let Xi be the sample mean of the Xy's, let Si be the sample standard deviation of the Xy's (i = 1,2,...,k), and let Zj = (tg- 1 [(X l j -X 1 )/S 1 ],...,tg- 1 [(X k j -X k )/S k ])'.
(5.3.16)
Then the proposed weight function is: j).
(5.3.17)
106
Tests for model misspecification
It can be shown (cf. exercise 3) that using this weight function is asymptotically equivalent to using the weight function exptf'Zj)
(5.3.18)
where
)1
(5.3.19)
Exercises 1. Check the conditions in section 5.2.2 for the weight function exp(£'?/j(x)), and in particular verify that only assumption 5.2.8 is of concern. 2. Show that (5.3.12) holds uniformly on any compact subset of R k. 3. Verify that using the weight function (5.3.17) is asymptotically equivalent to using the weight function (5.3.18), provided that for i=l,2
k,
= E(Xn) and plrni^^S, = v/var(Xn) > 0. 5.4 The integrated M-test An alternative to plugging a random £ into the test statistic H(£) defined in (5.3.5) is to take a weighted integral of H(£), say
b = jH(0h(0d£,
(5.4.1)
where h(£) is a k-variate density function. This is the approach in Bierens (1982). This idea seems attractive because under F^ it is possible to draw a £ for which the function T(O = plim n _ >oo H(0/n (cf.(5.3.12)) is close to zero, despite the fact that P[T(0 > 0] = 1. In that case the small sample power of the test may be rather poor. By integrating over a sufficient large domain of T(£) we will likely cover the areas for which T(£) is high, hence we may expect that a test based on b will in general have higher small sample power than the test in section 5.3. A disadvantage of this approach, however, is firstly that the limiting distribution of b under H o is of an unknown type, and secondly that calculating b can be quite laborious. It will be shown that under H o the test statistic b is asymptotically equivalent to an integral of the form
The integrated M-test
107
(5.4.2) provided h(£) vanishes outside a compact set, where the Zj(0's are for each £ e R k \{0} independent random vectors in R m with zero mean vector and unit variance matrix: E[Zj(O] = 0, E[Zj(0Zj(O'] = l m .
(5.4.3)
Although Kl/>/n)EjLi z j(O]'[(l/>/n)EjLiZj(O] - xl in distr. for each £ e R k \{0}, this result does not imply that b* -> ^ in distr. On the other hand, the first moment of b* Chebishev's inequality
equals m, hence by
P(b* > m/e) < E[b*/(m/e)] = e
(5.4.4)
for every e > 0. Since under H o , plim n _ oo (b — b*) = 0, we may conclude that for every e > 0, > m/e) < e under H o .
(5.4.5)
Moreover, if H o is false then plinin^oob = oo, hence limn-.ooPCb > m/e) = 1 under Hj.
(5.4.6)
These results suggest to use m/e as a critical value for testing H o at the e x 100 per cent significance level, i.e., reject H o if b > m/e and accept H o if b < m/e. Admittedly, the actual type I error will be (much) smaller than e, because Chebishev's inequality is not very sharp an inequality, but this is the price we have to pay for possible gains of small sample power. The problem regarding the calculation of the integral (5.4.1) can be solved by drawing a sample {£i,.--,£Nn} of size N n (N n —• oo as n —• oo) from h(£) and to use b = (l/Nn)E£iH(6)
(5-4.7)
instead of b. This will be asymptotically equivalent, i.e., plimn^ooCb-b) = 0 under H o
(5.4.8)
and > 0 under H,.
(5.4.9)
108
Tests for model misspeciflcation
Now let us turn to the proof of the asymptotic equivalence of b and b under Ho. Observe that similarly to (5.2.28)
J
i
J
J
^
n
-
0
o
)
,
(5.4.10)
where Mi(#,0 is the i-th component of M(0,£) defined in (5.3.4) and 9n(O is a mean value satisfying \0{n\Q-9o\ < \en-60\ a.s., for all £ e Rk.
(5.4.11)
k
Let E be a compact subset of R . Under the conditions of theorem 5.3.1 we have j M i C Y j . X j . e , © - E[(W)Mi(Yi,Xi,0,Q]
(5.4.12)
a.s., uniformly o n 0 x H . Cf. theorem 2.7.5. Denoting ai(0,0 - E p / c ^ M i t Y ^ X ! , ^ ) ]
(5.4.13)
we thus have
oo|<|0n_0o||ai((9*,O-ai((9o,OI = 0.
(5.4.14)
where the last step follows from the continuity of ai(#,£) on the compact set @ xE (hence aj(0,£) is uniformly continuous on 0 x E), and the consistency of 6n. Consequently, denoting A(O = (cf. (5.2.29)), we have (5.4.16) Next, let Cj(O
= M(Yj,Xj,0o,O + A ^ r - ' ^ Y j ^ j ^ o ) .
(5.4.17)
Then it follows from (5.2.17) and (5.4.16) that p l i m n ^ s u p ^ s | y n M ( 0 n , O -(l/>/n)EjLiCj«)l = 0.
(5.4.18)
Moreover, we have shown in section 5.2 that under H o, E[Cj(O] = 0, E[CJ(£)CJ(O'] = A*(O-
(5.4.19)
Furthermore, it is not hard to show that the consistent estimator A*(£) of A*(£) is also uniformly consistent on H, and thus (cf. exercise 1)
The integrated M-test
109
A*(Q~l —• ^ * ( O - 1 *n Pr-> uniformly on 5 .
(5.4.20)
Denoting Zj(O = A.(Q~\(O
(5.4.21)
it is now not too hard to show (exercise 2) that (5.4.22) Moreover, we leave it to the reader (exercise 3) to show that under Hi, = 0.
(5.4.23)
Cf. (5.3.12). Summarizing, we now have shown: Theorem 5.4.1 Let the conditions of theorem 5.3.1 hold and let h(£) be a k-variate density vanishing outside a compact subset E ofRk. (i) Under H o we have plirrin^o^b — b*) = 0, where E(b*) m.
=
(ii) Under H! we have p l i n i n ^ b / n = jT(Oh(Od£ > 0. (iii) Replacing b by b defined in (5.4.7), the above results go through. Exercises 1. Prove 2. Prove 3. Prove 4. Prove
(5.4.20). (5.4.22). (5.4.23). part iii of theorem 5.4.1.
Conditioning and dependence
Time series models usually aim to represent, directly or indirectly, the conditional expectation of a time series variable relative to the entire past of the time series process involved. The reason is that this conditional expectation is the best forecasting scheme; best in the sense that it yields forecasts with minimal mean square forecast error. The concept of a conditional expectation relative to a one-sided infinite sequence of "past" variables cannot be made clear on the basis of the elementary notion of conditional expectation known from intermediate mathematical-statistical textbooks. Even the more general approach in chapter 3 is not suitable. What we need here is the concept of a conditional expectation relative to a Borel field. We shall discuss this concept and its consequences (in particular martingale theory) in section 6.1. In section 6.2 we consider various measures of dependence as some, though rather weak, conditions have to be imposed on the dependence of a time series process to prove weak (uniform) laws of large numbers. These weak laws are the topics of sections 6.3 and 6.4. Throughout we assume that the reader is familiar with the basic elements of linear time series analysis, say on the level of Harvey's (1981) textbook. 6.1 Conditional expectations relative to a Borel field 6.1.1 Definition and basic properties
In section 3.1 we have defined the conditional expectation of a random variable Y relative to a random vector X e Rk as a Borel measurable real function g on Rk such that E[Y-g(XMX) = 0 for all bounded Borel measurable real functions ip on Rk. Approximating ip by simple functions (cf. definition 1.3.2 and theorem 1.3.5) it follows that g(X) = E(Y|X) a.s. if and only if E[Y-g(X)]I(XeB) = 0 110
(6.1.1)
Conditional expectations relative to a Borel field
111
for all Borel sets B in Rk, where I(.) is the indicator function. Now let {&,g,P} be the probability space on which (Y,X) is defined and let g x be the Borel field generated by X, i.e. g x is the collection of all sets of the type {co G Q: x(co) G B}, B G 23k. Cf. theorem 1.1.1. Then condition (6.1.1) is equivalent to J^[yM-g(x(co))]P(dco) = 0 for all A G gx- This justifies the notation E(Y|X) = E(Y|g x ). Note that g x is a Borel field contained in g: gxCg. Following this line we can define conditional expectations relative to an arbitrary sub-Borel field © of g without reference to the random variable or vector which generates ©: Definition 6.1.1 Let Y be a random variable defined on the probability space {(2,g,P} satisfying E(|Y|) < oo, and let © be a Borel field contained in g. Then the conditional expectation of Y relative to © is a random variable Z defined on {0,©,P} such that for every A G ©, Jyl(y(co)-z(Q)))P(da)) = 0. The existence of Z = E(Y|©) is guaranteed by the condition E(|Y|) < oo, i.e. if E(Y) exists, so does E(Y|©). Cf. Chung (1974, theorem 6.1.1.) Moreover, the uniqueness of Z follows from the Radon-Nikodym theorem (cf. Royden [1968, p.238]), i.e. if both Zx and Z 2 satisfy the conditions in definition 6.1.1 then P(Zi = Z2) = 1. We then say that Zx and Z 2 belong to the same equivalence class. Thus, E(Y|©) is almost surely unique. Next, we define the Borel field generated by a one-sided infinite sequence (Xt), t = 1,2,..., of random vectors in Rk. First, let g n be the Borel field generated by X1?...,Xn, i.e. g n is the collection of sets {co G Q: X^OJ) G Bj,...,x n (co) G B n },
where Bj,...,Bn are arbitrary Borel sets in Rk. Now (g n ) is an increasing sequence of Borel fields: 5nCgn+1
(6.1.2)
112
Conditioning and dependence
If U n >i5n is a Borel field itself, it is just the Borel field generated by (Xt), t = 1,2,... However, it is in general not a Borel field, for
Ate\Jn>l%n9t=l,29... does not imply
Therefore we define the Borel field g ^ generated by (Xt), t = 1,2,.., as the minimum Borel field containing U n >ig n . Definition 6.1.2 Let (X t ), t=l,2,..., be a one-sided infinite sequence of random vectors defined on a common probability space {£2,g,P}, and let g n be the Borel field generated by X b ...,X n . Then the Borel field generated by (X t ), t = 1,2,... is the minimum Borel field containing U n >ig n , which is denoted by goo = V n >,g n . Consequently, for every Y defined on (£2,g,P) satisfying E(|Y|) < oo we have E(Y|X 1 ,X 2 ,X 3 ,...])= E(Y|goo). The properties of conditional expectations relative to a Borel field are quite similar to those in section 3.2. In particular, theorem 3.2.1 now reads as: Theorem 6.1.1 Let Y and Z be random variables defined on a probability space {£,g,P} satisfying E(|Y|) < oo, E(|Z|) < oo. Let © and § be Borel fields satisfying © c § C g. We have: (i)
E[E(Y|S)|(S] = E(Y|<5) = E[E(Y|©)|$];
(ii)
(a) E[E(Y|®)] = E(Y); (b) E(Y|g) = Y a.s.;
(iii) U = Y - E ( Y | © ) => E(U|©) = 0 a.s.; (iv) E(Y + Z|©) = E(Y|©) + E(Z|©) a.s.; (v)
Y < Z => E(Y|©) < E(Z|©) a.s.;
(vi)
|E(Y|©)|<E[|Y||©]a.s.;
(vii) If X is defined on {0,©,P} and E(|X|) < oo then E(X-Y|©) = X.E(Y|(5); (viii) If every random variable defined on {£2,©,P} is independent of Y, i.e., Ax e 8fY, ^2 € © ^> P(>4i H yl2) = P(/li)P(yl2), where g Y is the Borel field generated by Y, then E(Y|©) = E(Y).
Measures of dependence
113
Proof: Exercise 2. Note that part (viii) of theorem 3.2.1 does not apply anymore. The Borel field generated by F(X) is contained in the Borel field generated by X, so that the first result of theorem 3.2.1 (viii) is just part (i) of theorem 6.1.1. Moreover, if F is a one-to-one mapping the aforementioned Borel fields are the same, and so are the corresponding conditional expectations. 6.1.2 Martingales A fundamental concept in the theory of stochastic processes is the martingale concept. This concept is particularly important in regression analysis, as the errors of a properly specified time series regression model are, by construction, martingale differences and martingale differences obey central limit theorems similar to those obeyed by independent random variables. Consider a sequence (Xj) of independent random variables defined on a common probability space, satisfying E(|Xj|) < oc, E(Xj) = 0, and let
Moreover, let 5n be the Borel field generated by (X lv ..,X n ). Then
EOgSn-O = E(Xn + Y^ISn.O = E(Xn|gn_0 + EOfn^lgn-l) = Yn_ia.s.,
(6.1.3)
for E(X n |5n-i) = 0 by independence and the condition E(X n ) = 0 (cf. theorem 6.1.1 (viii)]) whereas by theorem 6.1.1 (iib) and the fact that Y n is defined on {G,g n ,P}, we have E(Yn_i|5n-i) E(|Y n |) < oo
=
(6.1.4)
Yn_! a.s. Furthermore, observe that (6.1.5)
and g n D 3fn-i
(6.1.6)
The properties (6.1.3) through (6.1.6) are just the defining conditions for a martingale: Definition 6.1.3 Let (Yn) be a sequence of random variables satisfying (a) E(|Y n |) < oc (b) Y n is defined on {O,g n ,P}, (c) g n c 5n +1 and (d) Y n = E(Y n +11 g n ) a.s. Then Y n is called a martingale. Next, let Un = Y n - ¥ „ _ ! .
114
Conditioning and dependence
Then by (d), E(Un|8fn-i) = Oa.s., whereas conditions (a), (b), and (c) in definition 6.1.3 hold with Y n replaced by U n . Such a sequence (U n ) is called a martingale difference sequence. Conversely, let (U n ) be such that E(|U n |) < oo, U n is defined on {O,g n ,P}, with g n c g n + I, and E(U n |g n _O = Oa.s.
(6.1.7)
Can we construct a martingale (Y n ) such that TJ ^n
= v 1
n
—Y x
i? n—1 •
In general the answer is no. The only possible candidate for Y n is but in general E(Y n ) is not defined. For example, if the U n 's are independent random drawings from the standard normal distribution then E(|Y n |) = oo. However, for every one-sided infinite sequence (U n ), n = 1,2,..., satisfying (6.1.7) we can define a martingale (Yn) such that U n = Y n — Y n _i for n > 1, namely Y
n = EjLiUj for n > 1, Y n = 0 for n < 1.
(6.1.8)
Therefore we restrict the martingale difference concept to one-sided infinite sequences: Definition 6.1.4 Let (U n ), n=l,2,..., be a sequence of random variables such that for n > 1 (a) E(|U n |) < oo (b) U n is defined on {0,5n,P} (c) g n C 3fn+1, 5o = {0,0} (d) E(U n |g n _!) = 0 a.s. Then (U n ), n = 1,2,..., is called a martingale difference sequence. Note that for n = 1 condition (d) follows from the definition of go> which is called a trivial Borel field, and the condition E(Ui) = 0, for E(U!|go) - E(U0
(6.1.9)
6.1.3 Martingale convergence theorems Given a random variable X defined on {£2,g,P}, with E(|X|) < oo, a martingale (Yn) can be constructed as follows: let (g n ) be any increasing sequence of Borel fields contained in g and let Y n = E(X|g n ) Then by theorem 6.1.1(vi), |Y n | = | E ( X | 5 n ) | < E ( | X | | g n ) ,
(6.1.10)
Measures of dependence
115
hence by theorem 6.1.l(iia),
E(|Yn|) < E(|X|) < oo Moreover, by definition 6.1.1, Y n is defined on {(2,g n ,P}. Finally, E(Y n + 1 |g n ) = E[E(X|g n + 1 )|g n ] - E(X|g n ) = Y n by theorem 6.1.1(i). Thus (Yn) is a martingale. This construction is important because it enables us to prove the existence of
on the basis of the following dominated convergence theorem for martingales. Theorem 6.1.2 Let (Yn) be a martingale satisfying sup n E(|Y n |) < oo. Then Y n converges a.s. to an a.s. finite limit Y ^ Proof: Chung (1974, theorem 9.4.4). Thus the martingale (Yn) defined in (6.1.10) satisfies: lim n ^ oo E(X|5 n ) = Y ^ a.s., with I Y J < oo a.s. Moreover, we can identify Y ^ as a conditional expectation of X: Theorem 6.1.3 Let X be a random variable defined on {£2,g,P}, satisfying E(|X|) < oo, and let 5n be an increasing sequence of Borel fields contained in JJ. Then linv^ECXIgn) = E(X|g oo ), where 5oo = V n >ig n . Proof: Chung (1974, theorem 9.4.8). A direct consequence of theorem 6.1.3 and definition 6.1.2 is: Theorem 6.1.4 Let X be defined as in theorem 6.1.3 and let (Z t ), t = 1,2,..., be a sequence of random variables or vectors defined on {ft,g,P lim n _ oo E(X|Z 1 ,Z 2 ,...,Z n ) = E(X|Z,,Z 2 ,...)a.s. Another application of theorem 6.1.3 we need later on is the following: Theorem 6.1.5 Let the conditions of theorem 6.1.4 hold. Let Z< be the random vector consisting of the components of Z t truncated to £ decimal digits. Then ^ Z ^ , . . . ) = E(X|Z 1 ,Z 2 ,...)a.s.
116
Conditioning and dependence
Proof. Let ©^ be the Borel field generated by (z[% t = 1,2,... and let © ^ = V^>i©^. Since for each £, ©^ c 5oo, we have S ^ c 5oo- Moreover, since for each £, ©^ C ®e+\, it follows from theorem 6.1.3 that
^zf,...) = The theorem can now be proved by showing gooC©^,
(6.1.11)
as then g ^ = S ^ . To prove (6.1.11), assume first that the Z t (= zt(co)) are scalar random variables defined on a common probability space {£2,g,P}. Next, prove, following the line of the proof of theorem 1.3.2, that for arbitrary £ € R, {co e Q: zt(co) < £} = {oj e Q: limsup^oozf \co) < Q £ © ^ and conclude from this (cf. theorem 1.1.1) that for every Borel set B t in R, {o e Q: zt(co) e BJ e S ^ . See further, exercise 4.
Q.E.D.
The importance of theorem 6.1.5 is that the sequence (Zj') is rationalvalued, hence countably-valued. Cf. Royden (1968, proposition 6, p. 21). This property will enable us to identify E(X|Zi,Z 2,...) in terms of unconditional expectations, similarly to theorem 3.3.5. See Bierens (1987a, 1988a) and chapter 7. 6.1.4 A martingale difference central limit theorem The martingale difference central limit theorem we consider here is an extension of theorem 2.4.3 to double arrays (X nj ) of martingale differences, based on the following theorem of McLeish (1974). Theorem 6.1.6 Let (Xnj) be a martingale difference array, i.e. E(X nJ |X nJ _ 1 ,...,X n , 1 ) = 0 for j > 2, E(Xn>1) = 0, satisfying (a)supn>iE[(maXj
(c)£JiiX2j->linpr., where kn —> oo as n —» oo. Then ££,X nJ ->N(0,l)indistr.
Measures of dependence
117
Proof. McLeish (1974, theorem 2.3) Using this theorem it is not too hard to prove the following generalization of theorem 2.4.3. Theorem 6.1.7 Let (X nj) be a double array of martingale differences. If plimn^ooCl/n^jLiX^ = lim n ^ oo (l/n)5:]LiE(X2 j ) = a2 G (0,oo) (6.1.12) and Hmn^ooEjLiEdXnj/Vnl^ 5 ) = 0 for some S > 0
(6.1.13)
then (l/Vn)£jLiX n ,j — N(0,(T 2) in distr. Proof: Let
Verify that (Ynj) satisfies the conditions of theorem 6.1.6 Cf. exercise 5. Q.E.D. Remark: Note that condition (6.1.13) is implied by condition sup n (l/n)X]jLiE(|X n j|) 2 + 5 < oo for some S > 0.
(6.1.14)
Exercises 1. Prove that (6.1.1) for arbitrary Borel sets B in R k implies E(Y|X) = g(X) a.s. 2. 3. 4. 5.
Prove Prove Prove Prove
theorem 6.1.1. (6.1.9). (6.1.11). theorem 6.1.7.
6.2 Measures of dependence 6.2.1 Mixingales To prove laws of large numbers for sequences of dependent random variables we need some conditions on the extent of dependence of the random variables involved. For example, let (X t) be a sequence of dependent random variables with E (X 2) < oo for each t. Then var[(l/n)£f = ,X t )]= (l/n 2 )£?=ivar(X t ) 2
m ,X t ).
(6.2.1)
118
Conditioning and dependence
If = M < oo
(6.2.2)
then the first term in (6.2.1) is O(l/n). So for showing that (6.2.1) converges to zero, and hence by Chebishev's inequality, plimn_oo(l/n)E[LiPCt - E ( X t ) ] = 0, we need a condition that ensures that the second term in (6.2.1) is o(l). So let us have a closer look at the covariances in (6.2.1). First we introduce some notation, which will be maintained throughout this section. For — oo < n < m < oo, gjf is the Borel field generated by X n ,X n + 1? X n+2 ,...,X m . Now by theorem 6.1.1(ii,vii) and Cauchy-Schwarz inequality, |cov(X t+m ,X t )| = |E({[X t+m -E(X t + m )][(X t = |E{[E(Xt+m|afL00) -E(X t + m )][X t < (E{[E(X t+m|5t_oo) -E(X t+m )] 2 }) 1/2 (var(X t )) l/2 .
(6.2.3)
Denoting //(m) = s u p ^ E ^ X ^ ^ )
-E(X t + m )] 2 ) 1 / 2
(6.2.4)
it follows now from (6.2.1)-(6.2.3) and Liapounov's inequality 1/
var[(l/n)£? =1 X t ] < (l/n)M
(l/n)E^ir/(m) - 0
(6.2.5)
if \imm_oor](m) = 0.
(6.2.6)
Condition (6.2.6), together with (6.2.2), is therefore a sufficient condition for a weak law of large numbers. Sequences satisfying condition (6.2.6) are called mixingales (cf. McLeish [1975]). Definition 6.2.1. Let (Xt) be a sequence of random variables defined on a common probability space, satisfying E(X2) < oo for each t > 1. Then (Xt) is called a mixingale if there exists nonnegative constants c t, ^ m , with i/jm —• 0 as m —> oo, such that for all t > l , m > 0 , oo) -E(X t+m )] 2 }) 1/2 < Thus, condition (6.2.6) now holds if supt>ict < oo, as then r](m) =
Measures of dependence
119
Examples of m ix ing ales (1) Independent sequences. If the X t 's are independent then by theorem m |g
t
_ 00 ) = E(Xt + m ) f o r m > 1,
hence tpm = 0 for m > 1. For m = 0 we have ECXJSLOO) = x t (cf. theorem 6.1.1(iib)), hence we may choose ct = (var(X t )])' /2 ^o = 1. (2) Finite dependent sequences. The sequence (Xt) is finite dependent if for some non-negative integer £, Xt + ^, Xt + ^+i, ... are independent of X t , Xt_ i, Xt_2, ... Then for m > £, E(Xt + m |^ t _ oo ) = E(Xt + m ), hence ipm = 0 for m > L We may now choose i/jm = 1 for m < £ ct = max0<m<€(E{[E(Xt + m |g t _ oo ) - E ( X t + m )] 2 )' /a (3) AR(1) processes. Let X t = pXt_, + U t where \p\ < 1 and (Ut) is an independent process with E(U t ) = 0, E(Uf) = a2 < oo. Then Since X t , Xt_i, Xt.2, ... can be constructed from U t , Ut_i, U t . 2 , ... and vice versa, the Borel field g ^ is also the Borel field generated by the latter sequence. We now have t m |af _ 00 )
=
andE(X t + m ) = 0. Thus {E{[E(Xt + m\^J-E(Xi
So we may choose C, = < T , V m =
\p
2
+ m)]
})^ =
(Z^^
120
Conditioning and dependence
(4) ARMA(p,q) processes. Let Xt = p,X t _, + p 2 X t _ 2 4- ... + p p X t _ p + U t -T q U t . q ,
(6.2.7)
where (Ut) is an independent sequence with E(Ut) = 0, E(U?) = a1 < oc, and the lag polynomials p(L) = 1 -PlL
- p 2 L - ... - p p L p
(6.2.8)
p
T(L) = 1 - T , L - T 2 L - ... - i q L (6.2.9) have roots all outside the complex unit circle. Then Xt has a movingaverage representation: Xt - E j ^ j U t - j
(6-2.10)
where the /?j's are exponentially decreasing: |/3j| < c y for some c , > 0 , p G (0,1),
(6.2.11)
and U t has a moving average representation: Ut = E j ^ j X t - j .
(6.2.12)
Cf. Harvey (1981). Therefore, again, g'.oo is also the Borel field generated by Ut,Ut.i,Ut_2,..., and so E(Xt + m |g t _ 00 ) = E(X t + m ) = 0. Consequently,
So we may take ct = a, ^m = (E~m/ 3 f)' /l ^ c.|p| m /^(l - p 2 ) . (5.2.2 Uniform and strong mixing A disadvantage of the mixingale concept is that it does not necessarily carry over to functions of mixingales. However, it is desirable to work with conditions that are invariant under Borel measurable transformations. We consider here two of these invariant conditions, namely the
Measures of dependence
121
uniform (or ip-) mixing condition and the strong (or a-) mixing condition. Cf. Iosifescu and Theodorescu (1969) and Rosenblatt (1956a). Definition 6.2.2 Let (Xt) be a sequence of random variables or vectors defined on a common probability space. Denote for m > 0, = suptsupAGg«_oo>BGgocmip(A)>0|P(B|A) a(m) = sup tsupAeg«_oofB€g~in|P(AnB) -P(A)P(B)| If limm^oo^^) = 0 then (X t) is called a uniform (or
(6.2.13)
where (ipt) is an arbitrary sequence of conformable Borel measurable functions. Moreover, a (/> or a-mixing process is a mixingale, provided a certain moment condition holds. The following theorem points out the relation between the mixing and the mixingale concepts. Theorem 6.2.1. Let (Xt) be a sequence of random vectors satisfying [E(|X t | r )] 1/r < oo for some r > 1, r < oo. Then for 1 < p < r,
122
Conditioning and dependence
Proof. Serfling (1968) for the ^-mixing case and McLeish (1975, lemma 2.1) for the a-mixing case. Thus, if ct = [E(|X t | r )] 1/r < oo for some r > 2
(6.2.14)
then the mixingale coefficient ipm equals ipm = 2(p(m) 1 ~ 1/r , V^m = 2(1 + V2)a(m) 1 / 2 - 1 / r ,
(6.2.15)
respectively. 6.2.3 ^-stability A disadvantage of the mixing conditions is that they are hard to verify. The traditional time series models, in particular ARMA(p,q) and AR(p) models, however, can be represented by moving averages of mixing variables, provided the errors are assumed to be independent. For example, a standard ARMA(p,q) model may be considered as an AR(p) model with q-dependent hence (/> and a-mixing, errors, and if the AR lag polynomial (6.2.8) has roots all outside the unit circle the process can be represented by an infinite moving average of these q-dependent errors. A natural extension of these kinds of process is therefore to consider processes of the type X t = f t (U t ,U t _ l9 U t _ 2 ,...),
(6.2.16)
where (Ut) is a mixing process and ft is a mapping from the space of onesided infinite sequences in the range of the U t 's such that the right-hand side of (6.2.16) is a properly defined random variable or vector. In order to limit the memory of this process (Xt) we need a condition which ensures that the impact of remote U t _ m 's on X t vanishes: Definition 6.2.3 Let (Ut) be a sequence of random vectors in Re and let Xt e Rk be defined by (6.2.16). For some r > 1 and all t > 1, let E(|X t | r ) < oo, and denote v(m) = supt>1{E[|E(Xt|©[_m) -X t | r ]} 1 / r , where ©|_m is the Borel field generated by U t , U t _i,..., U t _ m . If limm^ooi^m) = 0, then (Xt) is a v-stable process in L r with respect to the base (U t ). Cf. McLeish (1975) and Bierens (1983). This concept is reminiscent of the stochastic stability concept of Bierens (1981), which is somewhat weaker
Measures of dependence
123
a condition. See also Potscher and Prucha (1991) for a discussion of the relationship between u-stability and Gallant and White's (1988) near epoch dependency concept. The actual content of this definition is not in the first place the condition that t>(m) —> 0, for this is always satisfied if the processes (X t) and (Ut) are strictly stationary, as in that case we may delete "sup t >i" and take for t some arbitrary fixed index, say t = 1: v(m) = {E|E(X 1 |©j_ m )-X 1 | r ]} 1 / r From theorem 6.1.3 it then follows l i m n w o o E ( X , | © | _ J = E(X,|© 1 . oo )a.s. where ©loc = V m >0®|- m is the Borel Field generated by Ui, U o , U _ ! v . . Since X! is a function of this one-sided infinite sequence, the Borel field generated by X\ is contained in ©L^, by which E(XI|©L00) = X,a.s. Cf. theorem 6.1.1(iib). Thus, lim m _ oo E(X 1 |®!_ m ) = X, a.s. and consequently by dominated convergence (theorem 2.2.2), u(m) —• 0 as m —• oo. Summarizing, we have shown: Theorem 6.2.2 Let (U t) be a strictly stationary stochastic process in R/ and let for each t Xt = f(U t ,U t -i,U t -2,...) be a well-defined random vector in R k, where f does not depend on t. If E(|X t | r ) < oo for some r > 1 then (Xt) is u-stable in Lr with respect to the base (U t ). Therefore, the actual content of definition 6.2.3 is that this result goes through if the X t 's and/or U t 's are heterogeneously distributed (that means that the distributions of U t and/or Xt depend on t). In other words, the u-stability condition actually imposes restrictions on the extent of heterogeneity of (Xt) and (U t ). In checking the ^-stability condition the following result is useful.
124
Conditioning and dependence
Theorem 6.2.3 Let X t be of the form (6.2.16). If there exist Borel measurable mappings ftm such that for some r > 1, E|X t | r < oo, t > 1, i/(m) = sup t >,[E(|f t , m (U t M-i,...M-m) -X t | r })] 1 / r —> 0 as m —> oo, then (Xt) is D-stable in Lr with respect to the base (U t ), where K m ) < 2u*(m). Proof: Observe that E(X t |©;_J = E(X t |U t ,U t -i,..,U t _ m ) is a Borel measurable function of the conditioning variables. By Jensen's inequality for conditional expectations (theorem 3.2.6) it follows that |E(X t |U t ,U t -lv..,U t -m)-ft,m(U t v ..,U t -m)i r < E[|Xt -f t , m (U t ,..,U t - m )| r |U t ,..,U t _ m ] hence <E(|X t -f t , m (U t ,...,U t _ m )| r ) Using Minkowski's inequality (cf. section 1.4) it now follows that: {E[|E(X t |©[_ m )-X t | r ]} 1/r < 2{E[|X t -f t , m (U t ,...,U t _ m )|] r } 1/r .
Q.E.D.
It should be noted that the u-stability concept is only meaningful if we make further assumptions on the dependence of the base (U t ), for every sequence (Xt) is u-stable with respect to itself. In the next section we shall impose mixing conditions on (U t ), in order to prove various laws of large numbers. The ^-stability concept, like the mixingale concept, is not generally invariant under Borel measurable transformations. However, an invariance property holds under bounded uniformly continuous transformation, and this is just what we need to generalize the weak law of large numbers in sections 2.5 and 2.7 to u-stable processes in L with respect to a mixing base. Thus, let (Xt) obey the conditions in definition 6.2.3 and let ip be a bounded uniformly continuous function on the domain of the X t 's. For notational convenience, let for m > 0
Xim) = E(X t |©l_J and let for a > 0 C(a) = sup| Xl _x 2 |
(6.2.17)
Weak laws of large numbers for dependent r.v.'s
125
Then for arbitrary q > 0 E ( | ^ ( X | m ) ) - < / > ( X t ) | ) " < e ( a ) q + 2sup x |tA(x)| q P(|X< m) - X t | > a) < C ( a ) q + 2sup x |V(x)| q E(|xi m ) - X t | r ) / a r < C(a)q + 2sup x |V(x)| q o(m)7a r , where the second inequality follows from Chebishev's inequality. Then sup^.tEdVCX* 1 "*) - V ( X t ) | q ) ] 1 / q <
V2v\m),
where v\m)
= 2.inf a > 0 (C(a) q 4- 2sup x |^(X 1 )| q u(m)7a r ) 1 / q
(6.2.18)
Using the fact that ((a) —> 0 for a j 0, it is easy to show v (m) —> 0 for m —> oo. It follows now from theorem 6.2.3 that ip(Xt) is v (m)-stable in L q with respect to the base (U t ): Theorem 6.2.4 Let (X t ) be u-stable in L r with respect to the base (U t ). Let ip be a bounded uniformly continuous function on the domain of the X t 's. Let q > 0 be arbitrary. Then ip(Xt) is v*stable in L q with respect to the base (U t ), where u* is defined by (6.2.17) and (6.2.18). 6.3 Weak laws of large numbers for dependent random variables
In section 6.2 we have already derived a weak law of large numbers for mixingales. We shall now generalize this result to processes that are testable in L1 with respect to a ip- or a-mixing base. Let (Xt) be a sequence of bounded random variables, i.e., for some M < oo and all t, P(|X t | < M) = 1,
(6.3.1)
with the structure (6.2.16). Define the Borel fields
as in section 6.2. Moreover, let the base (U t ) be (/^-mixing or a-mixing and denote Xt(m) = E(Xt|(5{_m) = gt,m(U t ,U t _ l5 ...,U t -m),
(63.2)
m
say. Note that X{ ^ is bounded too by M. We shall now derive an upper bound of |cov(Xt + ^,Xt)| in terms of u, tp and/or a. First observe that by (6.3.1) and the definition of ^-stability, |cov(Xt + €, Xt) -cov(Xt(™], X t )|< 2.E[|Xt + , -Xt(™]||Xt -E(X t )|] < 4M-u(m), uniformly in L Next observe that for fixed m, and a-mixing (U t )
(6.3.3)
126 x(m)
Conditioning and dependence
_E(x(m)}
= g t m (U t ,U t _ l v ..,U t -m)
- E ( x [ m ) ) is a*-mixing (6.3.4)
with a*(m*) = 1 if m* < m, a*(m*) = a(m* — m) if m* > m.
(6.3.5)
Similarly, if (Ut) is (^-mixing the process (6.3.3) is
(6.3.6)
Thus by theorem 6.2.1 with p = 1 and r = oo it follows that
^ l J
(6 3 7)
,)|] < 6Ma(£).
' '
Hence, similarly to (6.2.3), we have |cov(X t ( ^,X t )| < 2M2(p(£-m),
|cov(X t ( ^,X t )| < 6M2a(£-m),
(6.3.8)
where ?(m) = a(m) = 1 if m < 0. Combining (6.3.3) and (6.3.8) now yields: Lemma 6.3.1 Let (Xt) be u-stable in L k with respect to a base (Ut) and let ip and a be the mixing coefficients corresponding to (U t ). Moreover, let for some M < oo and all t, P(|X t | < M) = 1. Then for £> l , m > 0 m) + 4M-u(m), where ip(£-m) = mm(2(f(£-m),6a(£-m)) if £-m > 0, ip(£-m) = 1 i f £ - m < 0 . Using this result in (6.2.1) now yields: < M2/n + 8u(m)M + 2 n ~ 1 2 Letting m -^ oo with n at rate o(n), we now see that the following lemma holds Lemma 6.3.2 Let the bounded stochastic process (X t) be u-stable in L1 with respect to the base (U t). Assume either (a) (Ut) is (p-mixing with X^So^CO < oo, or (b) (Ut) is a-mixing with Xl£o a W
<
°°-
iXJ = 0. We are now ready to generalize theorem 2.5.2 to u-stable processes in L1 with respect to a mixing base.
Weak laws of large numbers for dependent r.v.'s
127
Theorem 6.3.1 Let (X t ) be an R e v a l u e d o-stable process in L 1 with respect to a
) < oo for some 5 > 0
and either
^ X t ) = JV«dG(x) Proof: We prove the theorem for the case X t G R. The proof for the case X t G Rk is almost the same. We show that (2.5.7) in the proof of theorem 2.5.2 goes through. Thus, define similarly to (2.5.1) ^a(x) = V<x) if |^(x)| < a, ^ a (x) = a if ip(x) > a, ^a(x) = - a if^(x) < - a ,
(6.3.9)
and let for b > 0, ^ab(x) = ^ a (x) if |x| < b ,
(6.3.10)
^ab(x) = ^a(b-x/|x|)if|x| > b,
(6.3.10)
Then ^ab(x) is bounded and uniformly continuous, for the set {x G Rk: |x| < b} is a closed and bounded subset of R k and hence compact, whereas continuous functions on a compact set are uniformly continuous on that set. The proof of the uniform continuity outside this set is left as an exercise (exercise 1). It follows from theorem 6.2.4 that ^ab(^t) is ^-stable in L1 with respect to the mixing base involved. Thus it follows from lemma 6.3.1 that [ ( / £ [ i ]
= 0,
hence by Chebishev's inequality, plim n ^ 0O (l/n)i;?=i{V'ab(Xt)-E[^ ab (X t )]} = 0. Since
- (1 /n)E?= i ti >b)] -F,(b) + Ft(-b))
128
Conditioning and dependence
we now have Xt)]} = 0.
Thus (2.5.7) in the proof of theorem 2.5.2 goes through. Since the rest of the proof of theorem 2.5.2 does not employ the independence condition, this result is sufficient for theorem 6.3.1 to hold. Q.E.D. Refering in the proof of theorem 2.7.2 to theorem 6.3.1 rather than to theorem 2.5.2, it immediately follows: Theorem 6.3.2 Let (Xt) be an Rk-valued u-stable process in L1 with respect to a ip- or a-mixing base. Let F t be the distribution function of Xt and let f(x,0) be a continuous function on Rk x (9, where 0 is a compact Borel set in Rm. If (l/ n )E?=i^t —• G, properly, pointwise, supn^^l/nJ^iEIsup^elfCXt^)!^ 5 ] < oo for some S > 0, and either
£ ^ ( ^ < o o or ££,,<*(*)
Weak laws of large numbers for dependent r.v.'s
129
plim n ^(l/n)£[LiV>a(X t ) = Efe(X,)]. Now by (6.3.9) and theorem 1.4.1 Xt)|] ,)| > a)] -> 0 as a -+ oo, for P(|^(Xi)| > a) -> 0 as a -> oo, as otherwise E[|^(X0|] = oo. Similarly lEft^XOl-EtyaCX,)]! - 0 as a -> oo. The theorem under review now follows from the argument in the proof of theorem 2.7.2. Q.E.D. Again, referring to theorem 6.3.3 instead of theorem 2.7.2 (with 8 = 0), theorem 2.7.5 carries over. Theorem 6.3.4 Let (Xt) be as in theorem 6.3.3. Let f(x,0) be a Borel measurable real function on R k x (9, where 0 is a compact Borel set in R m , such that for each x e R k , f(x,#) is continuous on 0. If in addition
^ l f ^ ) ! ] < oo then plim n _ o o sup, G 6 ) |(l/n)E^ 1 f(X t ,0)-E[f(X 1 ^)]| = 0, where E[f(X!,#)] is continuous on 0. For generalizing theorem 6.3.1 to Borel measurable functions I/J we need the following lemma. Lemma 6.3.3 Let ip be a Borel measurable real function on R k and let X be a random vector in Rk such that for some r > 1, E[|^(X)|r ] < oo. For every e > 0 there exists a function ip£ on Rk that is bounded, continuous and zero outside a compact set (hence uniformly continuous), such that e
Proof. Dunford and Schwartz (1957, p. 298). A direct corollary of this lemma is: Lemma 6.3.4 Let (Xt) be a sequence of random vectors in R k and let (F t ) be the sequence of corresponding distribution functions. Assume (l/n)5^" =1 F t —> G properly setwise.
130
Conditioning and dependence
Let ip be a Borel measurable real function on R k such that 1
^ ] < oo for some S > 0.
For every e > 0 there exists a uniformly continuous bounded function ip£ on R k such that i^CXt)!] < e Proof. Exercise 2 (Hint: Combine lemma 6.3.3 with theorem 2.5.5.) Combining lemma 6.3.4 with theorem 6.3.2 yields: Theorem 6.3.5 Let the conditions of theorem 6.3.1 be satisfied, except that now tp is Borel measurable and (l/n)^£ = 1 F t —• G properly, setwise. Then the conclusion of theorem 6.3.1 carries over. Proof: Exercise 3. Finally, referring in the proof of theorem 2.7.4 to theorem 6.3.5 rather than to theorem 2.5.6, yields: Theorem 6.3.6. Let the conditions of theorem 6.3.2 be satisfied, except that now 0/ n )Za=iFt —• G properly, setwise, f(x,0) is Borel measurable on R k x <9 and for each x e R k a continuous function on 0. Then the conclusions of theorem 6.3.2 carry over. Remark: It seems possible to generalize the results in this section further to strong laws, using theorem 3.1 of McLeish (1975). This, however, will require further conditions on the rate of convergence to zero of v, ip, and a. Exercises 1. Prove that the function (6.3.10) is uniformly continuous. 2. Prove lemma 6.3.4. 3. Prove theorem 6.3.5. 6.4 Proper heterogeneity and uniform laws for functions of infinitely many random variables For some time series models, such as ARMA and ARMAX models least squares parameter estimation is conducted by minimizing a sum of functions of a one-sided infinite sequence of random variables. See
Proper heterogeneity
131
chapter 7. In order to prove consistency of these parameter estimators under data heterogeneity we need an extension of the proper convergence concept to distributions of one-sided infinite sequences of random variables (cf. definitions 2.3.1 and 2.5.1), as well as some generalizations of the uniform laws in section 6.3. Definition 6.4.1 Let (Xt) be a sequence of random variables in R k , and let F t m be the distribution function of (X t ,...,X t _ m ). The process (Xt) is said to be pointwise (setwise) properly heterogenous if there exists a one-sided infinite sequence (X*), t < 0, of Revalued random variables such that for m = 0,1,2,..., (l/n^JLjF^m —> H m properly pointwise (setwise), where H m is the distribution function of (Xo,X_ l v ..,Xl m ). The sequence (X*), t < 0 will be called the mean process. The following theorem now specializes theorem 6.3.2. This theorem is the basis for the consistency results of linear ARM AX models, in chapter 8. Theorem 6.4.1 Let (Xt) be a stochastic process in R k which is testable in L1 with respect to an a- or (^-mixing base, where either
and pointwise properly heterogenous with mean process (X*). Let (}>j,i(0)), j > 0, i=l,2,...,p, be sequences of continuous mappings from a compact subset 0 of a Euclidean space into R k , such that for i = l,2,...,p, ^•2osup0e6>|7j5i(#)| < oo.
(6.4.1)
Let ^ be a differentiate real function on R p such that for c —• oo, su
P|f|
=
O(c0>
(6.4.2)
where fx > 0 is such that for some S > 0, sup t E(|X t | 1+/x + <5) < oo.
(6.4.3)
Denote
Then
= 0.
132
Conditioning and dependence
Moreover, the limit function E[^(][^ o rj(0)'X!_j)] is a continuous real function on Q. Proof. Denote Pj = maXi=i,...,pSup0G0|yj,i(0)|.
(6.4.4)
Then condition (6.4.1) implies
Moreover, denote for non-negative integers s,
We shall now prove theorem 6.4.1 in four steps, each stated in a lemma. Lemma 6.4.1 There exists a constant K such that for every n > 1 and every s > 0,
Lemma 6.4.2 For every fixed s > 0,
-j)]l = 0. Lemma 6.4.3 There exists a constant K such that for every s
< KLemma 6.4.4 The function E[^(^j2orj(0)'X!_j)] is continuous on 0. Realizing that (6.4.5) implies l i m m _ > o o ^ m + 1 p j - 0, the theorem under review now easily follows from these four lemmas. Proof of lemma 6.4.1: Observe from (6.4.4) and (6.4.6) that for # E (9,
Moreover, by the mean value theorem there exists a mean value ^t,s [0,1] such that
Proper heterogeneity
P
+
i
133
:
j
^
i
(6.4.7)
J
where the last inequality follows from (6.4.4). According to condition (6.4.2) there exists a constant C such that = Ca",
(6.4.8)
hence by (6.4.7) and Holder's and Liapounov's inequalities
say. This proves the lemma.
Q.E.D.
Proof of lemma 6.4.2: Lemma 6.4.2 follows straightforwardly from theorem 6.3.2. Proof of lemma 6.4.3: From lemma 6.4.1 and the conditions of theorem 6.4.1 it follows that hence E(|Xl| 1 + ")< oo. The lemma now follows similarly to lemma 6.4.1.
Q.E.D.
Proof of lemma 6.4.4: Let 6X e 0 and 62 € 0. Similarly to (6.4.7) (with (6.4.8)) we have
ijiy. Thus similarly to (6.4.9) it follows from (6.4.10) that
(6.4.10)
134
Conditioning and dependence
Since the 7ji(0)'s are continuous on (9, this result proves the lemma. Q.E.D. The next theorems are easy extensions of theorem 6.4.1. They will enable us to prove consistency of least squares parameter estimators of nonlinear ARM AX models. Cf. chapter 8. Theorem 6.4.2 Let <9, ip, \i, (Xt), and (X*) be as in theorem 6.4.1. Let the functions yij(0,x), j = 0,1,2,..., i = l,2,...,p, be continuous real functions on 0 x R k , such that maxi=1>..,psup0e6>|7j,i((9,x)| < pjb(x),
(6.4.11)
where E£oPj < oo
(6-4-12)
and b (x) is a non-negative continuous real function on R k such that for some 5 > 0, sup t E(b(X t ) 1 + " + <5) < oo.
(6.4.13)
Finally, let rj(fl,x) = (yj,1(0,x)v..,yjiP(e,x))'.
(6.4.14)
Then
Moreover, the limit function E[^(^^oi~j(0,X!_j))] is continuous on 0. Proof. Replacing |X t _j| and |X^j| in the proofs of lemmas 6.4.1-6.4.4 by b (Xt_j) and b (Xlj), respectively, the theorem easily follows. Q.E.D. Theorem 6.4.3 Let the conditions of theorem 6.4.2 be satisfied, except that (Xt) is now setwise properly heterogenous and the functions yy(0,x), j = 0,1,2,..., i=l,2,...,p, are for each x e R k continuous real functions on 0 and for each 6 e 0 Borel measurable real functions on R k . Then the conclusions of theorem 6.4.2 carry over. Proof: Note that the function b (.) may now be merely Borel measurable. The proof of this theorem is similar to the proof of theorem 6.4.2, referring to theorem 6.3.6 instead of theorem 6.3.2. Q.E.D.
7
Functional specification of time series models
7.1 Introduction Consider a vector time series process (Zt) in Rk with E(|Z t|) < oc for each t. In time series regression analysis we are interested in modeling and estimating the conditional expectation of Z t relative to its entire past. The reason for our interest in this conditional expectation is that it represents the best forecasting scheme for Z t; best in the sense that the mean square forecast error is minimal. To see this, compare this forecast, i.e., Z t = ECZtlZt^Zt-*...),
(7.1.1)
with an alternative forecasting scheme, say Z t = G t (Z t _ 1 ,Z t _ 2 ,...),
(7.1.2)
where G t is a Revalued (non-random) function on the space of all onesided infinite sequences in Rk such that Z t is a well-defined random vector. Denote the forecast errors by Ut = Z t - Z t ,
(7.1.3)
and Wt = Zt-Zt,
(7.1.4)
respectively. Then E(WtWt') = E(UtUt') + E[(Z t -Zt)(Zt-Z t )'],
(7.1.5)
due to the fact that by (7.1.1) and (7.1.3) E(U t |Z t _,,Z t _ 2 ,...) = Oa.s.
(7.1.6)
Thus the mean square error matrix E(WtWt') of the alternative forecasting scheme dominates E(U t U t ') by a positive semi-definite matrix. 135
136 7.2
7.2.7
Functional specification of time series models Linear time series regression models
The Wold decomposition
The central problem in time series modeling is to find a suitable functional specification of the conditional expectation (7.1.1). Often the model is specified directly or indirectly as a linear AR(oo) model. This linear AR(oo) specification can be motivated on the basis of the famous Wold (1954) decomposition theorem, together with the assumption that the process Z t is stationary and Gaussian, and some regularity conditions. Here we present Wold's theorem for univariate Gaussian time series processes. Theorem 7.2.1 (Wold decomposition). Let (Z t) be a univariate stationary Gaussian time series process satisfying o 2 = var(U t ) > 0, where U t is defined by (7.1.3), and let for s ^ 0, ys = E(ZtUt_s/c72) (note that y0 = 1). Then E^o^s V~^OO
<
TT s
°°
t ~ 2-/s=0' ^t — s
anc l
*
TIT
•" "t?
where the process (Wt) is such that E (WjUt) = 0 for all j and t and (Wt) is deterministic, i.e., W t is a (possibly infinite) linear combination of W t _i, W t _ 2 , ... without error. Moreover, (U t ) is an independent Gaussian process.
(7.2.1)
Proof. Let {O,g,P} be the probability space. From the definition of y s it easily follows that E[(Zt — E L o ^ U t - s ) 2 ] = E(Z 2 — ELo^s)' n e n c e ESo^s < °°- Let (5j_n be the Borel field generated by U t ,...U t _ n and denote Y t n = Es=oysUt _ s - Then for fixed t, Yt>n is a martingale defined on {£2,($|_n,P}, with sup n E(|Y t9n |) < v i E ^ o ? 2 ] < °°- Therefore, it follows from theorem 6.1.2 that l i m ^ ooYt>n = Yt?oo a.s., where Yt)OO = E^o^sUt-s- Consequently, E ^ o ^ U t - s exists a.s.
(7.2.2)
Next, denote W t = Z t — ^ ^ o y s U t _ s - F o r j > 0 we have E(W t _jU t ) = E(U t Z t _j) - E ^ o y s E ( U t U t + j _ s ) = E[E(U t |Z t _ 1 ,Z t _ 2 ,...)Z t _ j ] - E ^ s E [ E ( U t | Z t _ 1 , Z t _ 2 , . . . ) U t _ j _ s ] = 0, whereas for j > 0, by definition of ys,
(7.2.3)
Linear time series regression models
E(Wt+jUt) = E(Z t+j U t ) -Es"o = E(Zt+jUt) - o \ = 0.
137
(7.2.4)
The proof that Wt is deterministic is quite difficult, and therefore we refer for it to Anderson (1971, p. 421). Since Zt is stationary and Gaussian, E(Zt|Zt_1,...,Zt_m) is a linear function of Z t _ 1 ,...,Z t _ m , i.e., there exist constants /3i,m,...,/3m,m n o t depending on t, such that E(Zt|Zt_l9...,Zt_m) = £™ AmZt-j
(7.2.5)
(cf. exercise 1). Defining Ut,m = Z t -E(Z t |Z t __ lv ..,Z t _ m ),
(7.2.6)
it follows that (Ut,m,Zt_i,...,Zt_m) is (m+ l)-variate normally distributed with Ut,m independent of Z t _i,...,Z t _ m . Since by theorem 6.1.4, lirrim^ooUt^ = Ut a.s.,
(7.2.7)
it follows that Ut is independent of (Zt_i,...,Zt_m) for every m > 1, Ut ~ N(0,d2) and Ut is a linear combination of Z t ,Z t _ l5 Zt_2,..- With these hints, (7.2.1) is not hard to prove (cf. exercise 2). Q.E.D. Next, assume that the lag polynomial y(L) = X ^ o ) ^ is invertible: /?(L) = y(L)- 1 = £s~o/?sLs,
(7.2.8)
say. See Anderson (1971, pp. 423-424) for precise conditions under which (7.2.8) holds. If Zt is stationary and Gaussian with E(Zt) = 0 and the deterministic part Wt is a.s. zero then £s~oftZ t -s = Ut,
(7.2.9)
hence, since /30=l, Zt = E S i ( - A ) Z t _ s + Ut,
(7.2.10)
which is an AR(oo) model. In practice one often assumes that the lag polynomial y(L) is rational, i.e. y(L) = 9(L)/a(L),
(7.2.11)
where 0(L) = l - E s = 1 ^ L s
(7.2.12)
a(L) = l-Es=i«» L S (7-2-13) are finite-order lag polynomials with no common roots, and all roots
138
Functional specification of time series models
outside the unit circle. If Z t is a zero-mean stationary Gaussian process with Wt = 0 a.s. then a(L)Z t = 0(L)Ut,
(7.2.14)
which is an ARMA(p,q) model. 7.2.2 Linear vector time series models Similar results also hold for vector time series processes. If (Zt) is a kvariate zero mean stationary Gaussian process then under some regularity conditions we have Zt = £ £ o W _
(7.2.15)
s
where Fs = [E(ZtUt_s')][E(UtUt')]~1. Again assuming that the matrixvalued lag polynomial £^sLs
(r0 - I)
(7.2.16)
is invertible with inverse B(L) = T(L)- 1 = E^o B sL S
(B0 = I)
(7.2.17)
the model becomes a VAR(oo) model: A = E ^ i ( - B s ) Z t _ s + Ut.
(7.2.18)
If T(L) is rational, i.e. T(L) = A O L r ^ L )
(7.2.19)
with A(L) = I-£P = 1 A S L S
(7.2.20)
0(L) = I - E s = i ^ s L s ,
(7.2.21)
then (under some regularity conditions), the model becomes a VARMA(p,q) model: A(L)Zt = 0(L)U t
(7.2.22)
with A(0) - 0(0) = I. VARMA models, however, may be considered as systems of ARM AX models. This is obvious if <9(L) is diagonal, but it is also true if not each equation in (7.2.22) can be written as an ARMAX model. To see this, observe that
ARMA memory index models
139
where C(L) is the matrix of co-factors of <9(L). Multiplying both sides of (7.2.22) by C(L) then yields C(L)A(L)Zt = (det <9(L))Ut.
(7.2.23)
The polynomial matrix ^(L) = C(L)A(L) consists of finite-order lag polynomials, and also
Zi,, + ElLiEfi^i.ioZi.t-j = U,,t + T^L^u-i
(7-2.24)
Denoting
we can now write (7.2.24) in ARMAX form as Y, = Efi"jY.-j + EfiA'Xt-j + Vt + Ej.yjV.-j
(7.2.25)
Finally, for the case q* = 0 we get an ARX-model. Exercises 1. Prove (7.2.5). 2. Prove (7.2.1). 7.3
ARMA memory index models
7.3.1 Introduction The linearity of the time series models discussed in section 7.2 is due to the assumption of normality of the time series involved. Normality, however, is by no means a necessity for time series. So the question arises what can be said about the functional form of the conditional expectation (7.1.1) if the process (Zt) is non-Gaussian. In this section we discuss the ARMA memory index modeling approach of Bierens (1988a,b). This approach exploits the fact that all time series are rational-valued. One could consider the rationality condition as an assumption, but in practice one cannot deal with irrational numbers, hence time series are always reported in a finite number of decimal digits and consequently time series are rationalvalued by nature. Thus, the rationality condition is an indisputable fact rather than an assumption. In this section it will be shown that in conditioning a k-variate rational-valued time series process on its entire past it is possible to capture the information contained in the past of the process by a single random variable. This random variable, containing all relevant information about the past of the process involved, can be
140
Functional specification of time series models
formed as an autoregressive moving average of past observations. Hence the conditional expectation involved then takes the form of a nonlinear function of an autoregressive moving average of past observations. In particular, for univariate rational-valued time series processes (Zt) it will be shown that there exist uncountably many real numbers T G (—1,1) such that ECZtlZt^Zt^Z^,...) = E(Z t Ej2iT j - 1 Z t _j) a.s.
(7.3.1)
Moreover, if Zt is k-variate rational-valued there exist uncountably many T G ( - 1,1) and 6 e Rk such that for i= l,...,k, E(Zi,t|Zt_1,Zt_2,Zt_3,...) = E(Z u |^iT j " 1 0'Z t _ j ) a.s.
(7.3.2)
where Zi?t is the i-th component of Zt. This result is not specific for the geometric weighting scheme involved. More generally, it will be shown that there exist uncountably many sets of rational lag polynomials where V'jj (L) and I/J\ \h) are finite-order lag polynomials, such that for i=l,2,...,k, E(Z i , t |Z t _ 1 ,Z t _ 2 ,Z t _ 3 ,..) = E(Zi,tE]Li^id(L)Zj,t_1) a.s.
(7.3.3)
Since a conditional expectation can be written as a Borel measurable function of the conditioning variable, the result (7.3.3) implies that for each permissible lag polynomial V'ij(L) there exists a Borel measurable real function fi>t such that E(Zi,t|Zt_1,Zt_2,Zt_3,...) = fift(E]Li^ij(L)Zjft>1) a.s.
(7.3.4)
Denoting
J ^ j
)
M
2 )
- 1 ,
(7.3.5)
we see that the conditioning variable in (7.3.3) can be written in ARM A form:
(7.3.6)
Consequently, specifying the data-generating process as an ARMA process is equivalent to specifying the response functions fit, for a particular set of rational lag polynomials /0i,j(L), as time-invariant linear functions. Moreover, in the multivariate case one may interpret the conditioning variable £j>t as a one-step ahead forecast with an almost arbitrary linear ARMAX model for Zi?t. This can be seen if one replaces fi)t in (7.3.6) by Z i t —V it, where (Vi?t) is the error process. The X-vector involved then consists of all components of Z t except Zit. Thus, the best
ARM A memory index models
141
one-step-ahead forecasting scheme is a Borel measurable real function of a one-step-ahead forecast with an almost arbitrary linear ARMAX model. Specifying the equations in the VARMA model (7.2.22) as ARMAX models is therefore equivalent to specifying the corresponding functions f[t in (7.3.4) as linear time-invariant functions. Furthermore, all the nonlinearity of the conditional expectation function (7.3.4) is now captured by the nonlinearity of the functions fit, and the impact of heterogeneity of the process (Z t) on the conditional expectation involved is captured by the time dependence of fi>t. As the conditioning variable (7.3.5) carries the memory of the process, plays a similar role as the index in the index modeling approach of Sargent and Sims (1977) and Sims (1981), and can be written in ARMA(X) form, we have called our approach autoregressive moving average (ARMA) memory index modeling and the index (7.3.5) will be called the ARM A memory index. 7.3.2 Finite conditioning of univariate rational-valued time series Let (Zt) be a Q-valued stochastic process, where Q is the set of rational numbers. If E(|Z t |) < oo for every t,
(7.3.7)
then E(Z t |Z t _ l v ..,Z t _ m ) exists for any integers t and m > 1 (see chapter 3). Now our aim is to show that the conditioning variables Zt_i,...,Z t _ m in this conditional expectation may be replaced by Y^L\^~X^t-] f° r some real numbers T, provided m is finite. Suppose there exists a Borel measurable one-to-one mapping
(7.3.8)
for each t, provided (7.3.7) holds. Thus we see that by using such a oneto-one mapping
(7.3.9)
where w = (w,,...,w m )' e Q m , T e R. Moreover, let for w(1) e Q m , w(2) € Qm ,
(7.3.10)
142
Functional specification of time series models
R m (w(1),w(2)) = {T G R: <*>m(w(1)|i) = <MW ( 2 ) |T)}. (1)
(7.3.11)
(2)
In other words, R m(w ,w ) is the set of real roots of the ( m - l)-order polynomial <*>m(w(1)|i) -<2>m(w(2)|i) = E j ^ i C w / ^ - w / V 1 . (1)
(7.3.12)
(2)
It is well known that if w ^ w , so that for at least one j , wj ' ^ wj \ the number of real roots of the polynomial (7.3.12) does not exceed m— 1. Thus if w (1) / w(2) then Rm(w(1),w(2)) is a finite set of size less than or equal to m—1. Since Q is a countable set (see Royden [1968, proposition 6, p. 21]) and since the union of a countable collection of countable sets is countable itself (Royden [1968, proposition 7, p.21]), we now obviously have that the set S m - URm(w(1),w(2))
(7.3.13) m
is countable, where the union is over all Q -valued unequal w Thus for T G R\S m we have that
(1)
and w(2).
w(1) G Q m , w(2) <E Q m , <£m(w(1)|T) =
(7.3.14)
and vice versa. This proves that for T G R\S m the function (Pm(w|x) is a one-to-one mapping from Q m to a subset of R. Taking S = U~ =1 S m ,
(7.3.15)
which is a countable union of countable sets and therefore countable itself, we now see that the following theorem holds. Theorem 7.3.1 Let (Zt) be a Q-valued stochastic process satisfying E(|Z t |) < oc for all t. Then there exists a countable subset S of R such that T G R\S implies E(Z t |Z t _ 1 ,...,Z t _ m ) = E(Z t |Ej^iZ t _ J T j - 1 ) a.s.
(7.3.16)
f o r m = 1,2,3,... and t - . . . , - 2 , - 1 , 0 , 1 , 2 , 3 , . . . Remark: Note that this result carries over for processes (Z t) in any countable subset of R, as we have only used the countability of Q for proving theorem 7.3.1. Thus, the theorem remains valid if the Z t are Borel measurable transformations of Q-valued random variables, for countability is always preserved. 7.3.3 Infinite conditioning of univariate rational-valued time series In this subsection we shall set forth conditions such that (7.3.1) holds for each t and each T(—1,1)\S, where S is the same as in theorem 7.3.1. Intuitively we feel that (7.3.1) requires the following condition:
ARM A memory index models
143
The process (Zt) is such that for every t and every T G (—1,1), Y^\Zi-f]~X converges a.s. (7.3.17) As has been shown in Bierens (1988a), this condition is implied by the following assumption: Assumption 7.3.1 Let sup t E(|Z t |) < oo. Now if condition (7.3.17) holds then (Z t _i,Z t _ 2 ,...) and ( Z ^ i Z t - j ^ ~ 1 , Z t _ m _ i , Z t _ m _ 2 , . . . ) generate the same Borel field, because both sequences can then be constructed from ( E j = i ^ t V ~ 1 A - m - i A - m - 2 v ) a n d vice versa. Since this conclusion holds for every m > 1, it follows that under the conditions of theorem 7.3.1 and condition (7.3.17) or assumption 7.3.1, re(— 1,1)\S implies E(Z t |Z t _ 1 ,Z t _ 2 ,...) = E(Z t |X: j ^ 1 Z t _ J T J - 1 ,Z t _ m _ 1 ,Z t _ m _ 2 ,...)a.s. for every m > 1 and t, hence for every t, E(Z t |Z t _ b Z t _ 2 ,...) = l i m m _ 0 0 E ( Z t E J - 1 Z t _ J i J - 1 , Z t _ m _ 1 , Z t _ m _ 2 , . . . ) a.s.
(7.3.18)
For showing (7.3.1) we now need additional conditions ensuring that the impact of Z t _ m _ 1 , Z t _ m _ 2 , . . . on the conditional expectation at the right-hand side of (7.3.18) vanishes as m —» oo. In Bierens (1988a,b) we have shown that u-stability with respect to an a-mixing base, together with some regularity conditions, will do. However, the proof involved is rather complicated. Therefore we impose here the following extension of the mixingale condition. Assumption 7.3.2
Let g 1 ^ be the Borel field generated by
Zt>Z t _i,Z t _ 2 ,Z t _3,... and let g^° be the Borel field generated by
Let ^ < oo be an arbitrary integer and let W t be an arbitrary random variable defined on gj 5 ^ satisfying E(W^) < oo. Moreover, let {©Loo} a n ^ (§-oo) ^ e arbitrary sequences of Borel fields such that © ^ c g 1 ^ and S 1 , ^ c S 1 ^ . For every t and every m > 0 there exist constants c t , ^ m , with ^ m —• 0 as m —> oo, such that ^
V S ^ - 1 ) -E(W t |© t r o L)] 2 }' /2 <
c^m.
This assumption is stated more generally than needed here. We will need its full extent in chapter 8.
144
Functional specification of time series models
Admittedly, assumption 7.3.2 looks quite complicated. However, it simply states that the impact of the remote past of Z t , where the remote past involved is represented by i o 1 ^ " 1 , vanishes as m —• oo In the case (7.3.18), Wt = Z t , g 1 . ^ = g 1 ^ and © ^ is the Borel field generated by ^•^1 Z t _jT j ~ 1 , hence assumption 7.3.2 and Chebishev's inequality imply plimm^ooE(Zt|^1Zt_jTj"1,Zt_m_1,Zt_m_2,...) From theorem 2.1.6 it now follows that there exists a subsequence such that giZt-jT j " 1 ) a.s. as £ -+ oo, hence the limit (7.3.18) must be equal to the latter conditional expectation as well. Thus we have: Theorem 7.3.2 Let the conditions of theorem 7.3.1 and assumptions 7.3.1 and 7.3.2 be satisfied. There exists a countable subset S of R such that T G ( - 1,1)\S implies E(Z t |Z t _ 1? Z t _ 2 ,...) = fort=...,-2,-1,0,1,2,... 7.3.4
The multivariate case
If Z t G Q k we may proceed in the same way as before, hence theorems 7.3.1 and 7.3.2 carry over to rational-valued vector time series processes. However, the ARMA memory index Z ^ i A - j ^ " 1 is then multivariate too. We can get a scalar ARMA memory index by using the concept of a linear separator introduced in Bierens and Hartog (1988): Definition 7.3.1 A vector 6 is a linear separator of a countable subset E of R k if for all pairs (x1?x2) G E x S, 6'x\ = 0'x2 implies Xi = x2. The existence of a linear separator of a countable set is always guaranteed. In fact the set of all linear separators is uncountable, as is illustrated by the following theorem. Theorem 7.3.3 Let E be a countable subset of R k and let S be the set of all linear separators of E. Then the set R k \S has Lebesgue measure zero.
ARM A memory index models
145
Proof: Let C = {(xi,x2): X! eE,x2e
S, x} ^ x2}
and let T(xl9x2) = {6>eR k :6T(xi-x 2 ) = 0}. For X! / x2 the set T(x b x 2 ) is of lower dimension than k, hence T(xl5x2) has Lebesgue measure zero. Now R k \S = U(Xi,X2)GCT(x1?x2) is a countable union of sets with Lebesgue measure zero and therefore a set with Lebesgue measure zero itself. Q.E.D. From definition 7.3.1 it follows that for any linear separator 8 of E each point in the range of 9\ (x e E) can uniquely be associated to a point in the domain 3, and vice versa. Thus #'x is a one-to-one mapping from E into R. Thus, let 6eRk be a linear separator of the countable set Q k . Then it follows from theorem 3.2.l(viii): E(Z t |Z t _ 1 ,Z t _ 2 ,...) = E(Z t |0'Z t _ 1 ,0'Z t _ 2 v ..)a.s. and moreover the process (0'Zt) is still countable-valued. Applying theorem 7.3.2 we now conclude that for each linear separator 9 of Q k there exists a countable subset Se of R such that for each T(— E(Z t |Z t _ 1? Z t _ 2 ,...) = E f Z t l E j ^ ' Z t - ^ - 1 ) a.s.
(7.3.19)
k
We recall that the set (90 of vectors 6 e R that are not linear separators of Q k has Lebesgue measure zero; see theorem 7.3.3. This result, together with the countability of Se for each linear separator 6, imply that there exists a set N c R k + 1 with Lebesgue measure zero such that (7.3.19) holds for (0,T) G Rk x ( - 1 , 1 )\N. To see this, draw x from the uniform [-1,1] distribution, draw the components of 6 independently from say the uniform [a,b] distribution and let Ct(t,0) = E{I[E(Z t |Z t _,,Z t _ 2v ..) =
^
Then E[Ct(T,0)] = JxJLI[a,b]{J[-l,l]Ct(T,0)dT}d0 = J{x?=1[a,b]}\0o{J(-l,l)\SoCt(T,0)dT}d0 = I, which implies that (7.3.19) holds except for (0,T) in a set with Lebesgue measure zero. Thus we have: Theorem 73.4 Let (Zt) = ((Zi >t ,...,Z M )') be a Qk-valued stochastic process satisfying E(|Z t |) < oo for every t. There
146
Functional specification of time series models
exists a subset N of R k+1 with Lebesgue measure zero such that (<9,r)eR k x(-l,l)\N implies E(z ift|zt_lv..,zt_m) - ECZ^E^i^ZtV" 1 ) a-s-
(7-3-2°)
for m=l,2,...,i=l,2,...,k, and t = . . . , - 2 , - 1 , 0 , 1 , 2 , . . . If, in addition, assumptions 7.3.1 and 7.3.2 are satisfied, then E(Z i>t |Z t _ 1 ,Z t _ 2 ,...) = W&,t\Z™iO 9Zt-iJ~l)
a s
--
(7-3.21)
for i = l,2,...,k and t = . . . , - 2 , - 1,0,1,2,... For showing that (7.3.3) holds for uncountably many rational lag polynomials we need the following generalization of theorems 7.3.1 and 7.3.2. Lemma 7.3.1 Let (Zt) be a Q-valued stochastic process with E(|Z t |) < oo. Let q be an arbitrary positive integer. Let Q be the set of complex numbers with absolute value less than 1. There exists a subset S of C q with Lebesgue measure zero such that (T 1V..,Tq)' € Cq \S implies
E(Z|Z t _ l v ..,Z t _ m q ) = E{Z t |i7? = 1 [(l-(T,L) m )/(l-T,L)]Z t _ 1 ) a.s. (7.3.22) for m = 1,2,3,... and all t. If in addition assumptions 7.3.1 and 7.3.3 hold, then for each t, E(Z t |Z t _ l9 Z t _ 2 ,...) = E(Z t |Ht 1 [l/(l-T,L)]Z t _ 1 ) a.s.
(7.3.23)
Proof: Let (wt) be an arbitrary sequence of rational numbers. Denote for
= J7? =1 {[l-(T,L) m ]/(l-T,L)}w t . Suppose for the moment that T lv..,Tq are real-valued. Now draw T\,...,xq independently from the uniform [0,1] distribution. Then P(x{ q) (f 1 ,...,f q ) = 0|f l v ..,f q ) = Oa.s. x[^ (f l v . . , f q _ i ) , j = 0,...,m— 1, is unequal to zero, if at leastt one of the x[^j whereas q) P(x[ x[ q) (f! (f!,...,!,) = 0|f !,..,f q ) = 1 a.s.
if allll the *^
(f i,...,fq-i) are zero. Thus
q)
P(x[ (f !,...,! q) = 0|f !,..., _1) = 0)
ARM A memory index models
147
and consequently P(x t (q) (f lv ..,f q ) = 0) = E[minj = o,...,m-il(xt(!j"1)(f l s ...,f q _,) = 0)] <min j = o,...,m-iP(x t ( !7 1) (f 1 ,...,f q _ 1 ) = 0). By recursion it therefore follows that P(x t (q) (T lv .,f q ) = 0) < minj = 0....,q(m-i)P(xt(i5(f,) = 0).
(7.3.24)
But P(x[ Vf i) = 0) = 0 if at least one w t _j (j = 0,...,m—1) is unequal to zero,P(x t (1 '(fi) = 0) = 1 ifw t _j = 0 for j = 0,..,m- 1, hence P(x t (1) (f0 = 0) = minj = o ,...,m-il(w t _j = 0).
(7.3.25)
Combining (7.3.24) and (7.3.25) now yields P(x t (q) (fi,...,f q ) = O)<minj = o,...,(m-i)ql(wt-j = 0). This result shows that there exists a subset S* of x q=1(— 1,1), depending on w t ,w t _i,...,w t _ q(m _ 1) , which has Lebesgue measure zero if one of these w t _j's is unequal to zero. The set S in lemma 7.3.1 is now the countable union of all these null sets S*. The case that the fe are complex-valued is similar. For example, let T£ = pe(cos ^£ + i-sin 7p£), where the p? are drawn independently from the uniform [—1,1] distribution and the ^ are drawn independently from say the standard normal distribution. The rest of the proof of lemma 7.3.1 is similar to the proofs of theorems 7.3.1 and 7.3.2. Q.E.D. Now let T q be the set of vectors y = (y l5...,yq)' e R q for which the polynomial l + ^ s = i ^ S n a s r o o t s a ^ outside the unit circle. Realizing that these roots are related to yi,...,yq by a one-to-one mapping, the following corollary of part (7.3.23) of lemma 7.3.1 is easy to verify. Lemma 73.2 Let the conditions of part (7.3.23) of lemma 7.3.1 hold. There exists a subset S of Fq with Lebesgue measure zero such that y = (yi,.-,y q )' € ^q \S implies that for each t, E(Z t |Z t _,,Z t _ 2 ,...) = EiZtltlAl + ^ i ^ L ^ Z ^ ^ a . s . Moreover, part (7.3.21) of theorem 7.3.4 can now be generalized as follows: Lemma 7.3.3 Let the conditions of part (7.3.21) of theorem 7.3.3 hold and let q be an arbitrary positive integer. There exists
148
Functional specification of time series models
a subset N of Rk x Fq with Lebesgue measure zero such that G Rk x T q \N implies E(Z t |Z t _ 1 ,Z t _ 2 ,...) = E{Zi,t|[l/(l + i;s=iysL s )FZ t _ 1 }a.s. for i = l,2,...,k and t =...,-2,-1,0,1,2,... Proof: Similarly to theorem 7.3.4. Applying lemma 7.3.3 to the sequence: (Z*)withZ* = (Z t ',Z t _r,...,Z t _ p T, our main theorem below now easily follows. Theorem 7.3.5 Let p and q be arbitrary integers satisfying p > 0, q > l.Let
(7.3.26)
for i= 1,2,...,k and t = ..., —2,— 1,0,1,2,... Consequently, for each permissible rational lag polynomial VXL|Aj57i) there exist Borel measurable real functions fi>t(.) depending on (3lX, /?i,2v,A,k and 7i such that for i= 1,2,...,k and t = . . . , - 2 , - 1,0,1,2,..., E(Z 1 , t |Z t _ 1 ,Z t _ 2 ,...)
= UEjli^LIA^yOZj,!-])
a.s. (7.3.27)
7.3.5 The nature of the ARM A memory index parameters and the response functions In discussing the nature of the ARMA memory index parameters we shall first focus on the univariate case. We ask the question what the nature of a permissible i in (7.3.1) is, i.e., is x in general irrational or are rational T'S also permissible? We recall that a permissible T is such that the polynomial E'IW^-
1
(7.3.28)
in nonzero for arbitrary m > 1 and arbitrary rational numbers Wj not all equal to zero. But for m = 2, W I + W 2 T = 0 for i = — W1/W2, so that for given rational T we can always find rational numbers Wj such that the
ARM A memory index models
149
polynomial (7.3.28) equals zero. Obviously the same applies if only integer-valued w/s are allowed. Consequently, the permissible T'S are in general irrational. By a similar argument it can be shown that in the case of (7.3.26) at least some of the parameters in (5-^ and yx will likely be irrational. What is the consequence of the irrationality of the ARMA memory index parameters for the nature of the response functions? If we would pick an arbitrary permissible x the Borel measurable real function ft T for which
E(Zt|Zt_l5Zt_2,...) -
^
j
will likely be highly discontinuous, as this function has to sort out Z t _ 1 ,Z t _ 2 ,Z t _3,... from E,=i T J ~ l z t-j- See Sims(1988). On the other hand, if we choose x such that Zt and Ej^i r J ~ l Zt-j are strongly correlated, and thus that ft, T (Ej=i TJ " lz t-j) and E j ^ i ^ Z t - j are strongly correlated, then the function ft T will be close to a linear function. In any event, lemma 6.3.3 shows that the possible discontinuity of ftT is not too dramatic, as ft T can always be approximated arbitrarily close by a uniformly continuous function. Thus, given an arbitrary S e (0,1), a permissible x e (—1,1) and the condition E(|Zt|) < oo, there exists a uniformly continuous real function g tT such that ,Ej2i^" 1 Zt-j)-gt,r(Ej^iT j " 1 Zt-j)|] < &
(7.3.29)
and consequently by Chebishev's inequality, E~i^ J " 1 Zt-j)-gt,T(Ej2i^ 1 Zt-j)l < S]>l-5. (7.3.30) We have argued that the ARMA memory index parameters are likely irrational. Since computers deal only with rational numbers we therefore cannot calculate the ARMA memory index exactly in practice. However, it is possible to choose a rational x* close to x such that "almost" all information about the past of the data-generating process is preserved. The argument goes as follows. Because of the uniform continuity of g tT, there exist real numbers rj > 0, p e (0,1) and a rational number T* close to x such that
j
t
-
j
l
< Vl
(7-3.31)
where 1 > p > max(|T|,|i*|). The last inequality follows from the mean value theorem. Thus by Chebishev's inequality > i_5
(7.3.32)
150
Functional specification of time series models
if |T-T*|
< ^(l-p) 2 /sup t E(|Z t |).
(7.3.33)
Combining (7.3.30) and (7.3.32) we see that for arbitrary 3 € (09Yi)9 P[IUE^iT j " I Zt-j)-8t,x(E?i^" 1 Zt-j)l < s\ > 1—25 if (7.3.33) holds.
(7.3.34)
It should be noted that the rational-valued T* depends in general on the time index t. However, if the process (Zt) is strictly stationary we can pick a constant T*, as is not too hard to verify from (7.3.29) - (7.3.34). In that case the functions ft T and gt T are independent of t. Summarizing, we have shown: Theorem 7.3.6 Let (Zt) be a strictly stationary univariate rational-valued process. Let assumptions 7.3.1 and 7.3.2 hold and let 3 e (O/A) and T G ( - 1,1 )\S be arbitrary, where S is the same as in theorem 7.3.1. There exists a uniformly continuous real function gT and a rational number T* in a neighborhood of T such that PtmZtlA^Zt^,...) - g x C E g i ^ Z t - j ) ! < 3] > 1-2(5. (7.3.35) Finally, a similar result can be shown for the general case in theorem 7.3.5. Thus: Theorem 7.3.7 Let the conditions of theorem 7.3.5 hold and assume in addition that (Zt) is strictly stationary. For arbitrary 3 e (0,!/2) and (A,i,-,/Wi) e R ( p + 1 ) k xT q \N there exist uniformly continuous real functions & and vectors (/?* I,...,/?*,^?!) £ Q(P +1 )k X (r q nQ q ) such that j _ > 1-2(5
1
)
|
< 3] (7.3.36)
for i = l,2,...,k and each t. 7.3.6 Discussion
We have shown that in modeling rational-valued time series processes as conditional expectations relative to the entire past of the process involved, it is possible to capture the relevant information about the past of the process by a single random variable, called an ARMA memory index. Given this ARMA memory index, the specification of the model then amounts to specifying a nonlinear response function defined on the
Linear time series regression models
151
real line. Although this response function might be highly discontinuous, it can be approximated arbitrarily close by a uniformly continuous real function of an ARM A memory index with rational-valued parameters. One might argue that our approach is merely a sophisticated variant of representing a one-sided infinite sequence of variables as a decimal expansion of a real variable. For example, let the univariate stochastic process (Zt) be integer-valued with values 0,1,...,9, and define
Then £t contains all the information about the past Z t _i, Z t _ 2 , ... of the process under review, hence E(Z t |Z t _,,Z t _ 2 ,...) = E(Z t |£ t )a.s. In particular if E(Z t |Z t _,,Z t _ 2 ,...) = V
152
Functional specification of time series models
our approach but a universal problem. For example, transforming the data by, say, a log transformation will result in loss of information, because of the flniteness of data storage in a computer. Whether this problem is serious or not for our ARMA memory index depends on the dependence of the data. Take, for example, the above primitive index £t. Storing £t as a double precision variable yields 29 significant decimal digits (in CDC Fortran 5). Thus at least we can sort out Z t _i,...,Z t _ 2 9 from £t. If Z t is almost independent of Z t _j for j > 29 then E(Z t |Z t _,,Z t _ 2 ,...) « E(Z t |Z t _ 1 ,...,Z t _ 29 ) 7.4 Nonlinear ARMAX models The lesson we learn from the argument in sections 7.2 and 7.3 is that the class of linear ARMAX models forms a good starting point for modeling vector time series processes. In modeling the conditional expectation E(Zi 5 t|Z t _i,Z t _ 2 v) one should first look for the best-fitting linear ARMAX model, as this strategy forces the nonlinear function fj, which maps the corresponding ARMA memory index £i?t into this conditional expectation, towards a linear function. Then apply various model misspecification tests to check the validity of the linear ARMAX model. We will consider these tests in the next chapter. If these tests indicate the presence of misspecification one could then try to model the nonlinear function fi? for which E(Z i?t |^ it ) = fi(£i,t), for example by specifying fj as a polynomial of a bounded one-to-one transformation of £i>t. Moreover, one could run a nonparametric regression of Zi)t on £i?t to find a suitable functional form of fj. The latter approach is suggested in Bierens (1988a, section 6.2) and worked out further in Bierens (1990). Also, plotting Zi t and £ it may reveal the form of this function f;. Thus, if the linear ARMAX model fails to pass model misspecification tests we may think of specifying a parametric family for the function fi9 say f(.,a), where a is a parameter vector. This approach gives rise to a model of the form Y t = fKi + E j L j y M - ^ E ^ i / S j ' Z t - j M + u t . where Y t is one of the components of Z t , (Ut) is the error process (which should now satisfy E(U t |Z t _ 1 ,Z t _ 2 ,...) = 0 a.s.) and the /?j's and y = (yi,...,yq)' are parameter vectors. In what follows, however, we shall not deal with this class of models, for the simple reason that these models have not yet been considered in the literature, hence the sampling theory involved is still absent. The mean reason for introducing the ARMA memory index modeling theory is that it plays a key role in our consistent model misspecification testing approach, in chapter 8. Alternatively, if a linear ARMAX model does not pass our model
Nonlinear ARMAX models
153
misspecification tests one could follow Sims' (1988) common-sense approach and add nonlinear terms to the best linear ARMAX model to capture the possible nonlinearity of the conditional expectation function. How these nonlinear terms should be specified depends on prior knowledge about the phenomena one wishes to model. This specification issue falls outside the scope of this book. Quoting Sims (1988): There is no more general procedure available for inference in infinite-dimensional parameter spaces than the common sense one of guessing a set of finitedimensional models, fitting them, and weighting together the results according to how well the models fit. This describes the actual behavior of most researchers and decision makers ... Thus, if we are unsure of lag length and also believe that there may be nonlinearity in the system, a reasonable way to proceed is to introduce both a flexible distributed lag specification and some nonlinear terms that can be thought of as part of a Taylor or Fourier expansion of the nonlinear function to be estimated. Sims' common-sense approach will lead to a nonlinear ARMAX model of the form Y t = g(Z t _ lv ..,Z t _ p ,/3) + U t + E?=i7jUt-j.
(7.4.1)
where g(.,/3) is a known parametric functional form for the AR part of the ARMAX model, with (3 a parameter vector. The MA part of this model may be considered as a flexible distributed lag specification, together with the AR lag structure implied by the function g(.,/3). In chapter 8 we consider the problem of estimating the parameters of model (7.4.1), taking the function g(.,/3) as given, and we derive the asymptotic properties of these estimators under strict stationarity of the data-generating process (Zt) as well as under data heterogeneity. Also, we consider various model misspecification tests, in particular consistent tests based on the ARM A memory index approach.
8 ARMAX models: estimation and testing
In this chapter wefirstconsider the asymptotic properties of least squares estimators of the parameters of linear ARMAX models, and then we extend the results involved to nonlinear ARMAX models. A new feature of our approach is that we allow the X-variables to be stochastic time series themselves, possibly depending on lagged F-values. Moreover, we allow the data-generating process to be heterogeneous. Furthermore, we propose consistent tests of the null hypothesis that the errors are martingale differences, and a less general but easier test of the null hypothesis that the errors are uncorrelated. Most of these results are obtained by further elaboration of the results in Bierens (1987a, 1991b). Moreover, we refer to Bierens and Broersma (1993) for empirical applications of the ARMAX modeling approach. 8.1 Estimation of linear ARMAX models
8.1.1 Introduction We recall that, given a k-variate time series process {(Yt,Xt)}, where Yt and the k — 1 components of Xt are real-valued random variables, the linear ARMAX model assumes the form: (l-ELi«sL s )Y t = /. + ELift'L s X t + (l + ELi7sL s )U t ,
(8.1.1)
where L is the usual lag operator, / i G R , a s G R J s G R k " ' and ys e R are unknown parameters, the Ut's are the errors and p, q and r are natural numbers specified in advance. The exclusion of Xt in this model is no loss of generality, as we may replace Xt by Xj* = Xt+ \. The correctness of this linear ARMAX model specification now corresponds to the null hypothesis Ho: E[Ut|(Yt_1,Xt_1),(Yt-2,Xt_2),...] = 0 a.s.
(8.1.2)
for each t. Assuming that the lag polynomial 1 + Zls=i);sLs is invertible, this hypothesis implies that the ARMAX model (8.1.1) represents the 154
Estimation of linear ARMAX models
155
mathematical expectation of Y t conditional on the entire past of the process. The ARMAX model specification is particularly suitable for macroeconomic vector time series modeling without imposing a priori restrictions prescribed by macroeconomic theory. Such macroeconomic analysis has been advocated and conducted by Sims (1980, 1981) and Doan, Litterman and Sims (1984) in the framework of unrestricted vector autoregressions and observable index models. Cf. Sargent and Sims (1977) for the latter models. The advantage of ARMAX models over the VAR models used by Sims (1980) is that ARMAX models allow an infinite lag structure with a parsimonious parameterization, by which we get a tractable model that may better reflect the strong dependence of macroeconomic time series. Estimation of the parameters of a linear ARMAX model for the case that the X t 's are exogenous (in the sense that the X t 's are either nonstochastic or independent of the U t 's) has been considered by Hannan, Dunsmuir and Deistler (1980), among others. This estimation theory, however, is not straightforwardly applicable to the model under review. The condition that the X t 's are exogenous is too strong a condition, as then feedback from Y t to X t is excluded. Also, we do not assume that the errors U t are Gaussian or independent, but merely that (Ut) is a martingale difference sequence. We recall that the ARMAX model (8.1.1) represents the conditional expectation of Y t given the entire past of the process {(Yt,Xt)}, provided condition (8.1.2) holds and the MA lag polynomial 1+ ]Cs=i);sLs is invertible. We then have
E[Yi\(Yi_uX{_MYi_2,Xt_2l...]
and U t = Y t -E[Y t |(Y t _ 1 ,X t _ 1 ),(Y t _ 2 ,X t _ 2 ),...] a.s.
(8.1.4)
Since the MA lag polynomial can be written as where A i " 1 , . . . , ^ " 1 are its possibly complex-valued roots, invertibility requires \ks\ < 1 for s= l,...,q. In particular, for 0 < S < 1 the set r3 = {(7i,...,?q)' e Rq: \ls\ < IS
for s=l,...,q}
(8.1.5)
is a compact set of vectors (yi,...,y q)' with this property. The compactness of Fs follows from the fact that the ys's are continuous functions of
156
ARMAX models: estimation and testing
ylv..,yq, hence Fs is the continuous image of a compact set and therefore compact itself. Cf. Royden (1968, proposition 4, p. 158). From now on we assume that there are known compact subsets M of R, A of Rp, B of R(k-1)r and r5 of Rq such that, if (8.1.2) is true, li G M, (a lv ..,a p )' G A, (j?r,...,j3rT G B and (ylv..,yq)' G Fd.
(8.1.6)
Stacking all these parameters in a vector 60: 90 = (^au...,apJl\...JT\yu--,yqy
(8.1.7)
and denoting the parameter space by 0 = MxAxBxrsc
Rm withm=l+p + (k-l)r + q,
(8.1.8)
which is a compact set, we thus have 60e © if (8.1.2) is true. Denoting Zt = (Yt,Xt')', the conditional expectation (8.1.3) can now be written as t-s,
(8.1.9)
where and the rjs(.) are continuously differentiate vector-valued functions defined by: (8.1.10) It is not too hard to verify that each component rjiS(6) of rjs(6) satisfies (8-ul) < oo(j = l,2,...,m)
(8.1.12)
7i.s(0)l < oo Q^= l,2,...,m)
(8.1.13)
etc. Cf. exercise 1. These properties will play a crucial role in our estimation theory. In particular, the model (8.1.3) can now be written as a nonlinear regression model: (8.1.14) where the response function gt(0) - y>(0) + £ £ i *7s(0)'Zt-s
(8.1.15)
and its first and second partial derivatives are well-defined random functions.
Estimation of linear ARM AX models
157
Assuming that only Z 1? ...,Z n have been observed, we now propose to estimate 90 by nonlinear least squares, as follows. Let
gt(0) = v(0) + ETJM0yZi_s
if t > 2, MO) =
(8.1.16)
Thus (8.1.16) is gt(#) with Z t set equal to the zero vector for t < 1. Alternatively, we may set Z t = Z{ for t < 1, but for convenience the analysis below will be conducted for the case (8.1.16) only. Moreover, denote Q(0) = (l/n)£t=i[Y t -U0)f.
(8.1.17)
Then the proposed least squares estimator 9 of 90 is a (measurable) solution of 9 e G : Q(0) = infee0Q(O).
(8.1.18)
Similar to the results in chapter 4 we can set forth conditions such that under the null hypothesis (8.1.2), y/n(9-90)
-> N m (0,Gf ^ f i f 1 )
in distr
>
(8.1.19)
4 = (l/n)E?=i[O/^0')gt(0)][O/^0)gt(0)]
(8.1.20)
where Q\ is the probability limit of
and Q2 is the probability limit of Q2 = (l/^^^Yt-gt^))2^^^)^)]^^^^)]^
(8.1.21)
Moreover, when the null hypothesis (8.1.2) is false we show that there exists a 9* £ 0 such that p\imn^J
= d*.
(8.1.22)
8.1.2 Consistency and asymptotic normality In this section we set forth conditions such that the results in section 8.1.1 hold. Assumption 8.1.1 The data-generating process (Z t) in R k , with Z t = (Y t ,X t '), is u-stable in L1 with respect to an a-mixing base, where X^o a (j) < ° ° ' a n d ^s P r o P e r ly heterogeneous. Moreover, sup t E(|Z t | 4 + <5) < oo for some S > 0. (Cf. Definitions 6.2.2, 6.2.3, and 6.4.1 and theorem 6.4.1.) In the sequel we shall denote the base involved by (vt) where vt G V with V a Euclidean space, and the mean process of (Z t ) (cf. definition 6.4.1) will be denoted by (Z*). It should be noted that the error U t of the ARMAX model (8.1.1) need not be a component of vt, as it is possible
158
ARM AX models: estimation and testing
that the U t 's themselves are generated by a one-sided infinite sequence of v t 's. If we would make the strict stationarity assumption then assumption 8.1.1 simplifies to: Assumption 8.1.1* There exists a strictly stationary a-mixing process (vt) in a Euclidean space V, with a the same as in assumption 8.1.1, and a Borel measurable mapping G from the space of one-sided infinite sequences in V into R k such that Z t = (Y t ,X t y = G(v t ,v t __ 1 ,v t _ 2 ....)a.s. Moreover, E(|Z t |) 4+<5 < oo for some 5 > 0. Thus assumption 8.1.1* implies assumption 8.1.1. The proof of this proposition follows straightforwardly from theorem 6.2.2 and the fact that by the strict stationarity assumption the proper heterogeneity condition automatically holds with mean process (Z t ). Next consider the function Q(0) defined by (8.1.17). Let Y*o be the first component of Z$ and let Q(0) = E(Y* -
(8.1.23)
Then it follows from theorem 6.4.1: Theorem 8.1.1 Under assumption 8.1.1, = 0. Proof: Condition (6.4.1) is implied by (8.1.11). Since the function ip in theorem 6.4.1 is now */;(.) = (.)2, condition (6.4.2) holds with fi = I. The other conditions of theorem 6.4.1 now follow from assumption 8.1.1. Q.E.D. Next, we assume: Assumption 8.1.2 There exists a unique 0* £ 0 such that
Since 0 is compact and Q(0) is continuous there is always a 0* in 0 which minimizes Q(0) over 0. Thus the actual contents of this assumption is the uniqueness of 0*. If the null hypothesis (8.1.2) is true then 0* = 0O, so that assumption 8.1.2 then identifies the parameters of model (8.1.1). However, this assumption is also supposed to hold in the case that the null hypothesis (8.1.2) is false. Applying theorem 4.2.1 it follows from theorem 8.1.1 and assumption 8.1.2:
Estimation of linear ARMAX models
159
Theorem 8.1.2 Under assumptions 8.1.1 and 8.1.2 the least squares estimator 9 defined by (8.1.18) satisfies p l i m ^ ^ = 0*. This proves (8.1.22). Next we show the consistency and asymptotic normality of 9 under the assumption that (8.1.2) holds for each t. Since (8.1.1) and (8.1.2) are equivalent to (8.1.9), we now assume: Assumption 8.1.3 There exists a point 90 in an open convex subset 0O of @ such that (8.1.9) holds for each t. This assumption is hardly a condition. The sets M, A, and B (cf. (8.1.6) and (8.1.8)) can be chosen to be the closures of open convex sets M o , A o , and B o , respectively, whereas the set ro,s = « y i v . , y q ) ' G R q : 2 s < l-<5fors=l,2,...,q} (cf. (8.1.5)) is for 5 e (0,1) the continuous image of an open set and therefore open itself, with closure F$. Assuming that y0 corresponding to #o is an interior point of Fs for some (5, hence y0 e Fo^, there exists an open convex neighborhood Fo of y0 contained in Fs. Thus <90 = M o x A o x B o x r 0 (cf. (8.1.7)) is then an open convex subset of 0. In order to establish the consistency of the least squares estimator 9 it suffices to show that 90 minimizes Q(#), as then by the uniqueness condition in assumption 8.1.2, 9* must be equal to 90. To show this, let
It follows from lemmas 6.4.2 and 6.4.3 that lim n ^ TO su P 0 e e|E[Q(0)] - Q ( 0 ) | = O
(8.1.25)
and from (8.1.4) and (8.1.9) it follows that under assumption 8.1.3 :s~ifas(0) -r/ s (#o))'Z t _ s ] 2 . (8.1.26)
X1
Consequently, under assumption 8.1.3, we have: + E[
= 0O.
(8.1.27)
160
ARM AX models: estimation and testing
The asymptotic normality proof follows the classical lines. Cf. chapter 4. Thus we first apply the mean value theorem to (b/W{)Q(d), where 9{ is the i-th component of 9. This yields
0o) + 0 - 0o)'(W)(W)Q(0 (i ^
(8.1.28)
(l)
with 0 a mean value satisfying |0(O _6/ 0 | < \e -90\ a.s.
(8.1.29)
Theorem 8.1.3 and assumption 8.1.3 imply that 9 is an interior point of 0 with probability converging to 1. Thus 9 and 0(l) are with probability converging to 1 contained in the open convex subset 0O of 0. Cf. assumption 8.1.3. Consequently, we have similarly to (4.2.7), = 0.
(8.1.30)
The next step is to establish plimn^ooKWXa/MOQCG 0 ^"^
(8.1-31)
where Q{ is a nonsingular matrix, and the last step is to show V n ( W ) Q ( 0 o ) -+ N m (0,4O 2 ).
(8.1.32)
Combining (8.1.28) through (8.1.32) then yields Vn(0 -0o) -> Nm(0,fiif1Q2Oj-1) in distr.
(8.1.33)
For proving (8.1.31) and (8.1.32) the following lemma is convenient. Let
g\s)=
(8. I .34)
where we recall that (Z*) is the mean process of (Zt). Lemma 8.1.1 Under assumptions 8.1.1 and 8.1.3, E[(Y£ -g*(0o))Z*_s] = Ofors = 1,2,.., where YJ is the first component of Zg. Proof: Since E [ ( Y t - gt(0o))Zt_s] = E(U t Z t _ s ) = 0 under assumption 8.1.3 and since by theorem 6.4.1, limn^ooEKYt -g t (0 o ))Z t _ s ] = E[(Y5 the lemma follows. Now consider the derivatives
- 2( 1 /n)
-g\60))Z*_s], Q.E.D.
Estimation of linear ARMAX models
161
It follows from theorem 6.4.1 and (8.1.11), (8.1.12), and (8.1.13) that Lemma 8.1.2 Under assumptions 8.1.1 and 8.1.2,
]
0
(8.1.35)
and -E[(Y5 -g\0m/^ei)(b/Wj)g\9)]\
= 0,
(8.1.36)
Proof: We only prove (8.1.35). The proof of (8.1.36) is left as exercise 2. We verify the conditions of theorem 6.4.1. For t > 2 we have
hence
where
ri2\oy = and for ^ £u 6 e R, ^ i ( 0 = ^ ^2(6,6) = 6 6 - Moreover, the parameter fi in (6.4.2) is \x = 0 for V^i, H = 1 for ^ 2 Now part (8.1.35) of the lemma follows easily from (8.1.11), (8.1.12), assumptions 8.1.1 and 8.1.2, and theorem 6.4.1. The proof of (8.1.36) is similar. Q.E.D. Next, observe that lemma 8.1.1 implies E[(Y* - g * ( 0 o ) ) ( W X ^ / ^ g ^ o ) ] = 0.
(8.1.37)
Since
because of (8.1.29) and theorem 8.1.3, (8.1.31) follows from (8.1.35), (8.1.36), and (8.1.37) with
162
ARM AX models: estimation and testing
V * 0 o ) ] } .
(8-1.38)
The nonsingularity of Q\ cannot be derived, but has to be assumed as part of the identification assumption. Moreover, in order that this matrix is also defined in the case that the null hypothesis (8.1.2) fails to hold we redefine Q\ as 0, = E{[(W)g*(0*)][(W)g*(0.)]}. (8.1-39) Since 6* = OQ under the null hypothesis, there is no loss of generality in doing so. Assumption 8.1.4 The matrix Qx defined by (8.1.39) is nonsingular. This assumption is part of the set of maintained hypotheses which are assumed to hold regardless of whether or not the null hypothesis is true. Using lemma 8.1.2 it easily follows that the matrix Q\ defined by (8.1.20) is a consistent estimator of Q\. Lemma 8.1.3 Under assumptions 8.1.1 and 8.1.2, Note that assumption 8.1.3 is not needed for this result, due to the more general definition (8.1.39) oiQ\. For proving (8.1.32), we observe first that £?=! flJt + gt(0o) - gt( £[Li (gt(0o) - gt(0o))(Wi)gt(0o).
(8.1.40)
Secondly, we shall prove: p l i m ^ a /Vn)EU (gt(0o) ~ gt(0o))(W)gt(0o) = 0, so that
(8.1.41)
plim^Iv/nO/^OQ^o) + 2( 1 / » E I L I U t (W)g t (0 o )] = 0
(8.1.42)
Thirdly, denoting (8.1.43) cf. theorem 6.1.7, where £ is an arbitrary non-random vector in Rm, we show that (Xn>t) is a sequence of martingale differences for which the martingale central limit theorem 6.1.7 applies: (W n )Ef=i x n,t -> N(0,<j2) in distr., where
(8.1.44)
Estimation of linear ARMAX models
163
(72 = plim^ o o (l/n)5:i! = iX2 > t = rfi2^
(8.1.45)
with Q2 = E{(Y*0 -g\0.))\^/W)g\e,)][(bmg\9,)]}.
(8.1.46)
From these results it then follows Vn(
(8.1.47)
which implies (8.1.32). For proving (8.1.41) we need the following extension of (8.1.11). Lemma 8.1.4 For s—>oo and i = l,...,m, supoeehM\
=O[sq(l-5)s],
where 5 is defined by (8.1.5). Proof: It follows from (8.1.5) that
(i + EJ-iy.LV1 = n?=id - W - 1 = UUE and
Combining these results with (8.1.10), the lemma easily follows. Q.E.D. Using this lemma, Cauchy-Schwarz inequality and the fact that
we see that there are constants C . and C»* such that
] 1/4 .
(8.1.48)
Moreover, denoting. ps = maXiKcVMMtfo)!, Po = m a X i \ ( b / l 9 M ^ ) \ we have from (8.1.12) a n d Liapounov's inequality, )2 sup t E(|Z t | 2 ). (8.1.49)
164
ARM AX models: estimation and testing
Thus the right-hand side of (8.1.48) is of order
and converges therefore to zero. Now (8.1.41) follows from Chebishev's inequality. With Xn>t defined by (8.1.43), condition (6.1.14) and thus condition (6.1.13) of theorem 6.1.7 follow from < [E(|lJ t |
4+
,|4+M)]i/1
")]Wi[E(|(
/r\ \ i 4 + 2<5 i"j f 1 C/Q ) I \
4+M ') < oo implies and the fact that sup t E(|Z t |
sup t T7/IA/
|4 + Id \
< OO
E(|Yt| ) and., similarly to (8.1.49), sup t E(|gt(0o)|4+2*') < (50 and sup,E|i|(()/<)0i)gt(#i o)| 4 + 2<5* ' - o o . The S for which this holds must therefore be smaller than lA5, with S as in assumption 8.1.1. Condition (6.1.12) of theorem 6.1.7 follows easily from theorem 6.4.1, where a2 is given by (8.1.45) and (8.1.46). So (8.1.47) is proved. Furthermore, we note that similarly to lemma 8.1.3 we have: Lemma8.L5
Under assumptions 8.1.1 and 8.1.2,
plim n ^ o o 0 2 = Q2. Proof'. Exercise 3. Summarizing, we have: Theorem 8.1.4 Under assumptions 8.1.1-8.1.4, y/n(6 -00) -> Nm(0,(2i"1Q2Qr1)
in distr
-
(8.1.50)
and under assumptions 8.1.1, 8.1.2, and 8.1.4, plim^ooflj-^Ofl =QrlQ2«r1-
(8.1.51)
Note that assumption 8.1.3 is not needed for (8.1.51). However, if assumption 8.1.3 does not hold then Q^XQ2Q\X is no longer the variance matrix of the limiting distribution of 6. Moreover, note that nonsingularity of Q2 is not required: if Q2 is singular the limiting normal distribution (8.1.50) is singular too.
Estimation of nonlinear ARM AX models
165
Exercises 1. Prove (8.1.11), (8.1.12), and (8.1.13). 2. Prove (8.1.36). 3. Prove lemma 8.1.5. 8.2 Estimation of nonlinear ARMAX models In this section we consider the asymptotic properties of the least squares parameter estimators of model (7.4.1): ^ t . j ,
Po e B,
(8.2.1)
with B c Rr a parameter space. Let again
where F3 is defined by (8.1.5) and let 0o = (j8 o>o')\ ® = B x rdcRm,
with m = r + q.
Since the lag polynomial 1 + X^s=i^sLs is invertible, we can write
where the ps()0's are continuously differentiate functions on FS such that Po(y) = 1, £ £ i s u p y e / > s ( y ) | < oo,
(8.2.2)
oo, i=l,...,q,
(8.2.3)
s(y)| < oo,ij=l,..,q.
(8.2.4)
Cf. (8.1.11), (8.1.12), and (8.1.13). Similarly to (8.1.15) we can write the model in nonlinear ARX(oo) form as Yt = gt(0) + U t , where now
Moreover, similarly to (8.1.16) we truncate the response function gt(0) to gt(0) = £ l X W y ) g ( Z t - i v . . , Z t _ p , j 8 ) f o r t > p + l ,
1 (8 2 6)
gt(0) = Ofort < p + 1 ,
J
and we define the least squares estimator 6 of 0O as a (measurable) solution of 0 e 0 : Q(0) = in
166
ARM AX models: estimation and testing
where Q(0) = [l/(n-p)]Et= P + i(Yt -gt(#)) 2 . Now assume: Assumption 8.2.1 (a) The process (Z t) is u-stable in L 1 with respect to an oc-mixing base, where X]j2oa(J) K °°(b) The function g(w,/f) is for each w e R pk a continuous real function on B, and for each j5 E B a Borel measurable real function on R pk , where B is a compact subset of R r . (c) The process (Zt) is setwise or pointwise properly heterogeneous. (d) If (Zt) is pointwise properly heterogeneous, then g(w,/f) is continuous on R pk x B. (e) For some S > 0, sup t E(|Z t | 4 + ^) < o c s u p t E t s u p ^ B l g C Z t - ! , . . . , ^ . ^ ) ! 4 ^ ] < oo. Let similarly to (8.1.23)
where YQ and Z*_} are the same as before. Then Theorem 8.2.1 Under assumption 8.2.1, = 0. Proof: We verify the conditions of theorems 6.4.2 and 6.4.3. Let for w = (w,,..,w 1 + p k ) ' G R 1 + p k , ys(0,w) = wi -g(w 2 ,...,wi + p k ,j5)ifs = 0, and let ps = sup yer jp s (y)|, b(w) = |wi| -fNote that b(w) is continuous on R 1 + pk if g(w2,."?w1 + pk,j8) is continuous in all its arguments. Cf. theorem 1.6.1. Then ,w)| < psb(w), J2ZoPs < o°The latter result follows from (8.2.2). Moreover, denoting Wt = (Y t ,Z t _r,...,Z t _ p ')'
Estimation of nonlinear ARMAX models
167 1
it follows from assumption 8.2.1 that (W t ) is ^-stable in L with respect to an a-mixing base and that sup t E(|b(W t )| 2 + <5) < oo. The theorem under review now easily follows from theorems 6.4.2 and 6.4.3. Q.E.D. Next, assume: Assumption 8.2.2 There exists a unique O * G 0 such that Q(0.) = infee Assumption 8.2.3 There exists a point 9Q in an open convex subset &0 of (9, such that for each t, E(Y t |Z t _ 1 ,Z t _ 2 ,...) = gt( Then similarly to theorems 8.1.2 and 8.1.3 we have: Theorem 8.2.2 Under assumptions 8.2.1 and 8.2.2, plinin^oo^ = 6*. Theorem 8.2.3 Under assumptions 8.2.1, 8.2.2, and 8.2.3,
The proof of the asymptotic normality of 9 is left as an exercise. We only give here the additional assumptions involved. Assumption 8.2.4 The function g(w,/0 is for each w G R pk twice continuously differentiable on B. If (Zt) is pointwise properly heterogeneous then for i,i1,i2=: l,...,r, (
(8.2.7)
168
ARMAX models: estimation and testing
V ( n - p ) (0 -0 O ) -> N r + q(0,O["XQ2Q^X) in distr. and under assumptions 8.2.1, 8.2.2, 8.2.4 and 8.2.5,
Exercise 1. Prove theorem 8.2.4 along the lines of the proof of theorem 8.1.4. 8.3 A consistent N(0,l) model specification test In Bierens (1984) we have proposed a consistent model specification test for nonlinear time series regressions. This tests the null hypothesis that the errors of a nonlinear ARX(p) model obey a condition of the type (8.1.2). This model specification test is in principle also applicable to the ARMAX case considered in sections 8.1 and 8.2. A disadvantage of this test, however, is that the distribution of the test statistic under the null hypothesis is of an unknown type, so that the critical region of the test involved has to be derived on the basis of Chebishev's inequality for first absolute moments. This approach will lead, of course, to overestimating the effective type I error of the test, as Chebishev's inequality is not very sharp. Moreover, this test is quite laborious for relatively large data sets and models. On the other hand, the test involved is consistent in the sense that any model misspecification will be detected as the sample size grows to infinity, provided the data-generating process is strictly stationary. To the best of our knowledge no other model specification test for time series regressions has this consistency property. We shall now propose a new test which has a known limiting distribution under the null hypothesis and is consistent in the above sense. In particular, in this section we shall construct a test statistic \ / n T , say, with the property that under the null hypothesis (8.1.2), ^ / n t —• N(0,l) in distribution as n —> oo, whereas under the alternative hypothesis that (8.1.2) does not hold and under the stationarity hypothesis, p l i m ^ ^ ^ n f = oo. This test is a further elaboration of the test of Bierens (1987a) and is reminiscent of the consistent conditional moment tests in chapter 5. The consistency of our test requires that assumption 8.1.1 holds. Thus, strict stationarity of (Z t) and thus of {(Ut,Zt)} is part of our maintained hypothesis. Some of our results below also hold under data heterogeneity. This will be indicated by not explicitly referring to assumption 8.1.1*. The reason for imposing the stationarity assumption is that under data heterogeneity the null hypothesis (8.1.2) may be false
A consistent N(0,l) model specification test
169
for only finitely many t's. Clearly, no test based on asymptotic arguments can detect this. Let U t = Y t — gt(#*), where g t is defined by (8.2.5) and 6* is defined in assumption 8.2.2. The null hypothesis of a correct model specification can now be restated as H 0 :P[E(U t |Z t _ 1 ,Z t _ 2 ,Z t _ 3 v..) = 0 ] = l .
(8.3.1)
The alternative hypothesis we consider is that H o is false. Under stationarity this general alternative hypothesis becomes H 1 :P[E(U t |Z t _ 1 ,Z t _ 2 ,Z t _3,...) = 0] < 1.
(8.3.2)
Let us assume for the moment that H] is true. Then theorem 6.1.5 implies that there exists an integer £0 such that sup,>, o P[E(U t |Z t ( l ) 1 ,Z t ( i ) 2 ,...) = 0] < I,
(8.3.3)
where Z\ ' is the vector of components of Z t rounded off to £ decimal digits. Cf. exercise 1. Since {(U t ,Z t )} is strictly stationary, £0 is independent of t. Because the process (Z([ ') is rational-valued, we may now apply theorem 7.3.4: Lemma 83.1 Let assumptions 7.3.2, 8.1.1*, and 8.2.1 hold. There exists a subset N of R k + 1 with Lebesgue measure zero such that (8.3.3) implies E s ^ i ^ - ^ ' Z ^ s ) = 0} < 1 forall(£,i)eRkx(-l,l)\N. Proof: Let 9* = (j8*\y*')' and UtG) = Yt -E J s : P oPs(T*)g(A-i-sv.,Z t _ p _ s ,^) Cf. (8.2.5). Then UtG) e g ^ (cf. assumption 7.3.2) and E(|Ut -U t a ) |) < X;~ j+p+1 |p s ();O|E[|g(Z t _ 1 _ s ,...,Z t _p_ —> 0 asj —> oo. Consequently, < E(|Ut -U t G ) |) -^ 0 as j -> oo
(8.3.4)
and < E(|Ut -U t a ) |) -> 0 as j - • oo.
(8.3.5)
Moreover, from theorem 7.3.4 and assumption 7.3.2 it follows that there
170
ARM AX models: estimation and testing
exists a subset Nj £ of R k+ ] with Lebesgue measure zero such that for all (f,T)eR k x(-l,i)\Nj, € , l ^ ^ I E ^ i T
8
-
1
^ ^ ) .
(8.3.6)
Combining (8.3.4), (8.3.5), and (8.3.6) and taking for N the union of all the sets Nj ^ the lemma follows. Q.E.D. Next, we combine lemma 8.3.1 with lemma 3.3.1: Lemma 8.3.2 Let ip be an arbitrary bounded Borel measurable one-to-one mapping from R into R. Let the conditions of lemma 8.3.1 hold. There exists a subset S of Rk + 2 with Lebesgue measure zero such that (8.3.3) implies E{Ut • exp[p • HY.Zx^'^M
± 0
(8.3.7)
k
for all e>i0 and all (p,£,i) G R X R x ( - 1,1)\S. Proof: Lemmas 3.3.1 and 8.3.1 imply that for each (£,T) G Rk x ( - 1,1 )\N there exists a countable subset Q(£,T) of R such that (8.3.7) holds for p 0 Q(£,T). The proof can now easily be completed along the lines of the proof of theorem 3.3.5. Cf. exercise 2. Q.E.D. Let Ut = Yt -g t (0), where gt is defined by (8.2.6). Denoting (8. 3.8) l
i
ZW~ €Z[ l])] if t > 2, = l i f t < 1;
•(8.
3.9)
(8.3.10) (8.3.11) (8.3.12) where ip is now a bounded uniformly continuous one-to-one mapping from R into R, it follows: Theorem 8.3.1 Let the conditions of theorem 8.2.2 be satisfied. Then Under Hi and assumptions 7.3.2 and 8.1.1 there exists an integer £0 and a subset S of Rk + 2 with Lebesgue measure zero such that for all £>£0 and all (P,£,T) G R X Rk x ( - 1,1)\S,
A consistent N(0,l) model specification test
171
c,(p,£,t) ^ 0, whereas under H o , c,(p,£,i) = 0 for all £ and all (P,£,T) € R x Rk x ( -
1,1).
Proof: This theorem follows easily from lemma 8.3.2 and theorems 6.4.3 and 8.2.2. Cf. exercise 3. Q.E.D. Theorem 8.3.1 suggests using cn,e(P&T) a s a basis for a consistent test of the null hypothesis (8.3.1) versus the alternative hypothesis (8.3.2). The next two lemmas establish the asymptotic normality of cn^(p,£>T) under H o . Lemma 8.3.3 Under the null hypothesis (8.3.1) and the conditions of theorem 8.2.4, r t ( ^ ] } = 0,
(8.3.13)
where Q{ is denned by (8.1.38) with g* defined by (8.2.7), and bXp^T) = lim n ..^(l/n)E^iE[w t >,^,T)O/^0')gt(^)].
(8.3.14)
Proof: Exercise 4. Denoting Tt(6>*)]2},
(8.3.15)
it follows now from lemma 8.3.3 and the martingale central limit theorem 6.1.7 that Lemma 8.3.4 Under the null hypothesis (8.3.1) and the conditions of theorem 8.2.4, x/(n-p) cn,*(p,f,T) -+
N(0,S?(P,£,T))
in distr.
(8.3.16)
for each (p,f,t) e R x R k x ( - 1,1) and each L Proof: Exercise 5. Moreover, denoting
M % M d ) ? with
(8-3.17)
172
ARMAX models: estimation and testing
b n > ^ T ) = [l/(n-p)]E| l == p + iW t >,e,T)O/^0')gt(0)
(8.3.18)
we have: Lemma 8.3.5 Under assumptions 8.2.1, 8.2.2, 8.2.4, and 8.2.5,
regardless whether or not the null is true. Proof: Exercise 6. Our test statistic is now f
2>l(p^T)).
(8.3.19)
However, before we can draw conclusions about the limiting behavior of t^(p,£,T) under H o and Hi we have to address the question whether
where in the latter case 0 is the zero vector. If £ = 0 then clearly w u (p,f,T) = w u (p,£,i) = exp(p and consequently CM(P,£,T)
= exp(p •
Similarly, if p = 0 then
But if model (8.2.1) contains a constant term then
Er= P+ iU, = 0a.s., hence s?(p4,T) = 0. We may exclude this case by including p = 0 and/or £ = 0 in the set S. Because the new set S is then the union of the original set S with another set with Lebesgue measure zero, it still has Lebesgue measure zero. The following assumption guarantees that the points (P,£,T) G R x R k x ( — 1 , 1 ) for which sf(p,£,r) = 0 are indeed contained in a set with Lebesgue measure zero. Assumption 8.3.1 The process (Zt) is strictly stationary. There exists a stationary process (ft), where ft is defined on the Borel field generated by Z t _i,Z t _ 2 ,..., such that the random vector Kt
A consistent N(0,l) model specification test
173
= (ft,((V^#)gt(#*))' has positive definite second moment matrix E(KtKt).
Lemma 8.3.6 Under assumption 8.3.1 and the conditions of lemma 8.3.5 there exists a natural number £x such that for all £ > £x the set S? =
{(P,£,T)
<E R x Rk x (-1,1): s?(p,£,T) = 0}
has Lebesgue measure zero. Proof. Without loss of generality we may assume P(U t = 0) < 1. Let (p,£,i)eS;;.Then
Hence E[ft •w where
Since P[ft = (<)/d0')gt(0*)A] = 1 would imply that E(fct/ct') is singular, it follows from assumption 8.3.1 that
Similarly to lemma 8.3.2 it follows now that for sufficiently large £, S*£ has Lebesgue measure zero. Q.E.D. Taking the union of the former set S with the union of the S*£ over £ > £\ and denoting the new set again by S we now have: Theorem 8.3.2 There exists a natural number £0 and a subset S of Rk + 2 with Lebesgue measure zero such that for all (p,£,r) G R x Rk x ( - 1,1 )\S and all £>£0 the following hold. (i) Under the null hypothesis, the conditions of theorem 8.2.4 and assumption 8.3.1, we have v/(n-p)fXp,£,T) -» N(0,l) in distr.
(8.3.20)
(ii) If the null is false then under assumptions 7.3.2, 8.1.1 , 8.3.1, 8.2.1,8.2.2, 8.2.4, and 8.2.5, f , T ) = oo.
(8.3.21)
174
ARMAX models: estimation and testing
In practice the exceptional set S is unknown. However, similarly to theorem 5.3.1 we have: Theorem 8.3.3 Draw p and the components of £ randomly from continuous distributions and draw x randomly from the uniform (—1,1) distribution. Let £ > £0. Then the conclusions of theorem 8.3.2 carry over. Proof: Exercise 7. Next we propose a slightly modified version of the test under review, for the very same reason as in section 5.3. Suppose we have chosen ^(.) = tg~ 1 (5(.)), where 5 > 0 is some constant.
(8.3.22)
Moreover, suppose that
takes with high probability only large positive values. Then takes with high probability only values close to !/27r, which is the upper bound of the function ip. Hence the function \MP,£,T)
= exp[p • ^(xt^(T,O)]
(8.3.24)
takes values close to exp(p Yin) and consequently c n >,£,T) « [(l/n)Ef=iUt]exp(p1/27r). But if the U t are least squares residuals of a model with a constant term they sum up to zero and hence cn^(p,£,T) will then be close to zero. Clearly this will destroy the power of the test. Therefore we propose to standardize the argument of ip, i.e., we propose to replace x t^(i,£) in (8.3.24) by xt>/T,£) defined as follows:
^ ) 2 -[d/t)Es=iM^O] 2 }' / 2 if t > 3,
I (8.3.25) 1
where X U (T,£) = 0 if t < 1 and defined by (8.3.24) if t > 2. The function vttj(p,£,T) is thus redefined as if t > 3, W,,XP,£,T)=1 ift
<2.
\
J
Redefine the function wue(p,£,r) accordingly as (x t ,,(t,0)],
(8.3.27)
A consistent N(0,l) model specification test
175
where x M
(8.3.28)
with xt,KT,O = E j ^ i ^ J " I ^ z S .
(8.3.29)
Then Theorem 83.4 With (8.3.26) instead of (8.3.9) and (8.3.27) instead of (8.3.8), all the previous results in this section go through. Proof: Exercise 8. Remark 1: It is easy to verify that the results in this section also go through if we use more general ARMA memory indices than (8.3.29). In particular, we may replace (8.3.29) by, say, M T , 0 = (l+T 1 L 1 + ... + T q L q ' ) " 1 ( 6 L + . . . + e Pl L P TZ t ( ' ) ) and replace (8.3.23) by
where now T = (T,,...,Tq,)' e z U = (£i\...,£ Pi y
e
Rk+Pl
with A such that the lag polynomial 1 +T1L1 + ... + TqqiLqi has roots all outside the complex unit circle, and Z[£) = Z{t£) for t > 1,
z[£) = 0 for t < 1.
Remark 2: In chapter 7 we argued that the parameters T and £ for which are likely to be irrational. Since this result plays a key role in the proof of theorem 8.3.1, one might therefore think that the consistency of our test cannot hold as consistency requires irrational i and £, whereas in practice it is impossible to deal with irrational numbers. However, the functions C>(P,£,T) and S|(/P,£,T) are continuous, and so is
provided s|(p,£,i) > 0, hence if T*(P,6T) / 0
holds then it holds too on an open neighborhood of (p,£,r) and thus also for all rational (p ,£ ,r ) in this neighborhood.
176
ARMAX models: estimation and testing
Exercises 1. Prove (8.3.3). 2. Complete the proof of lemma 8.3.2. 3. Complete the proof of theorem 8.3.1. 4. Prove lemma 8.3.3. 5. Prove lemma 8.3.4. 6. Prove lemma 8.3.5. 7. Prove theorem 8.3.3. 8. Prove theorem 8.3.4. 8.4 An autocorrelation test In this section we briefly discuss a test for first- or higher-order autocorrelation of the errors U t of model (8.2.1). The null hypothesis is still H o defined by (8.3.1), but instead of (8.3.2) we consider the less general alternative H(!r): cov(U t ,U t _j) £ 0 for some j £ {l,2,...,r}. The reason for considering the problem of testing H o against Hj is threefold. First, in traditional times analysis most tests for model specification test the null hypothesis of white-noise errors against an alternative of the type H j r \ Second, such a test is rather easy to construct, and its construction is a very useful exercise that highlights the essence of the approach in the previous sections. Third, severe model misspecification will likely be covered by Hj for r sufficiently large. Therefore we advocate conducting the test below first, as a pretest of model misspecification. If H o is rejected in favor of Hj there is no need to conduct a consistent test. However, since Hj r) may be false while H o is false, not rejecting H o in favor of Hj does not provide sufficient evidence that H o is true. In that case the consistent tests in section 8.3 should be used in order to verify whether H o is true or not. The test involved can simply be based on the statistic c = [l/(n-r-p)]£?= r+p+ ,U t V t . where U, = Y t -g t (0), with gt defined by (8.2.6), and
Let U, = Y t - g t (0*),
An autocorrelation test
177
where gt is defined by (8.2.5) and 6* is defined in assumption 8.2.2, and let Vt = (U t _ 1 ,...,U t -r)'. We recall that under H o, 0* = 90. Denoting A = A -B 1 O 1 ^ 1 B 2 ' - B 2 ^ r 1 B 1 ' + 1
A=A -BiQf ^' -B2Q^B^ +
B]Q^Q2Q^Bl\
B^fyQ^B^
with B, = B2 = A = B1 = B2 = A = the test statistic involved is now ar = (n —r —p)c'J~ 1 c. Theorem 8.4.1 Let det(zl) / 0. (i) Under the null hypothesis (8.3.1) and assumptions 8.2.1-8.2.5 we have ar —• x^ m distr. (ii) Under H(,r) and the assumptions 8.1.1*, 8.2.1, 8.2.2, 8.2.4 and 8.2.5 we have p l i m ^ ^ a r = oo. Proof: The details of the proof are left as exercises. Below we only give the main steps of the argument. First, under the conditions of part (ii) of the theorem we have p l i m n ^ c + 0.
(8.4.1)
Next, let the conditions of part (i) hold. Let Cj be the i-th component of c. By the mean value theorem there exists a mean value 6^ satisfying
\6O)-0o\<\O-eo\ such that
178
ARMAX models: estimation and testing
Since
^ + p + t U , - , }
= 0,
pli m i w o o [ 1 /(n - r - p)]Er=r+P+1 (Yt - gt(0(i))
= 0, plimn^tl /(n - r - p)]£?=r+P+i ( ( W & ^ X Y , -i - g, _ i(fl(i))) = plimn^^t 1 /(n - r - p)]£[U +P+ , Ut _ i(ime)gl(0o) and plinin^oo {V(n - r - p)(0 - 0O) -[l/^n-r-pMELr+p+iUtfir'O/^gt^o)} = 0 it follows now that plim n _ o o { v /(n-r-p)c -[l/VCn-r-pW^r+p+iUtCVt-B^i-^/^^g^o))} = 0, hence by theorem 6.1.7, v
/
( n - r - p ) c -* Nr(0,zl) in distr.
(8.4.2)
Finally, we have p]imn^ooA = A. Combining (8.4.1), (8.4.2), and (8.4.3), the theorem follows.
(8.4.3) Q.E.D.
Unit roots and cointegration
If a time series is modeled as an ARMA(p,q) process while the true datagenerating process is an ARIMA(p— 1,1,q) process, strange things may happen with the asymptotic distributions of parameter estimators. For example, if a time series process Yt is modeled as Yt = aY t _i + Ut, with Ut Gaussian white noise and a assumed to be in the stable region (1,1), while in reality the process is a random walk, i.e., AYt = Ut, then the OLS estimator an of a (on the basis of a sample of size n) is nconsistent rather than A/n-consistent, and the asymptotic distribution of n(an-a) is non-normal. Therefore, in testing the hypothesis a = 1 standard asymptotic theory is no longer valid. See Fuller (1976), Dickey and Fuller (1979, 1981), Evans and Savin (1981, 1984), Said and Dickey (1984), Dickey, Hasza, and Fuller (1984), Phillips (1987), Phillips and Perron (1988), Hylleberg and Mizon (1989), and Haldrup and Hylleberg (1989), among others, for various unit root tests (all based on testing a = 1 in an AR model) and Schwert (1989) for a Monte Carlo analysis of the power of some of these tests. Moreover, see Diebold and Nerlove (1990) for a review of the unit root literature, and see Bierens (1993) and Bierens and Guo (1993) for alternative tests of the unit root hypothesis. In this chapter we shall review and explain the most common unit root tests. However, since understanding the asymptotic theory involved requires some knowledge of the theory of convergence of probability measures on metric spaces, we shall start with reviewing the latter first. The concept of cointegration was first introduced by Granger (1981) and elaborated further by Engle and Granger (1987), Engle and Yoo (1987), Stock and Watson (1988) and Johansen (1988, 1991), to mention a few. Recently it has become an increasingly popular topic in the econometrics literature. The basic idea is that if all the components Xjt of a vector time series process Xt have a unit root there may exist linear combinations a'X t without a unit root. These linear combinations may then be interpreted as long-term relations between the components of X t. Since the theory of cointegration is rapidly developing, any attempt to cover the whole literature will soon be outdated. In this chapter we shall therefore outline the basic principles only, on the basis of Engle and 179
180
Unit roots and cointegration
Granger (1987), Engle (1987), Engle and Yoo (1989) and Johansen (1988). 9.1 Weak convergence of random functions 9.1.1 Introduction In order to derive tests for a unit root in time series processes we need to generalize the concept of convergence in distribution of random variables or vectors to weak convergence of random functions on the unit interval [0,1]. Therefore we briefly review some terminology and results related to convergence of probability measures on metric spaces. The emphasis is on intuition rather than on mathematical rigor. For a full account, see Royden (1968) for a general treatment of the theory of metric spaces, and, in particular, Billingsley (1968) for a rigorous treatment of convergence of probability measures on metric spaces. Let {£2,g,P} be a probability space and let f(t) be a random function on [0,1] defined on {&,g,P}. We recall that by definition f is a mapping from Q x [0,1] onto R such that for each r G R, t G [0,1], {coeQ: f(co,t) < r} G g. Cf. definition 1.6.1. Now assume that f belongs to a class S of real function on [0,1], in the sense that the set N = {coeQ: f(co,.) £ S} is a null set in $: N G 5? P(N) = 0. Then we may interpret f as a mapping from Q onto S. In order to make probability statements about f we need to define a Borel field © of subsets of S such that for each set B G 25, f !(B) G 5. These sets B play a similar role to the Borel sets in a Euclidean space. Thus, we need to define Borel sets of functions in S. We recall that the Euclidean Borel field of subsets of R (that is, the collection of Borel sets in R) is defined as the minimal Borel field containing the collection of all half-open intervals ( —oo,b], b G R. It is quite easy to verify that the same Euclidean Borel field can be defined as the minimal Borel field containing the collection of all open intervals (a,b), a,b G R, a < b. This suggests to define the collection of Borel sets in S as the minimal Borel field containing the collection of all open sets in S. However, since we have not yet defined a distance measure for functions in S, topological terms like "open", "closed", "closure", "dense", "boundary", etc., have no meaning yet for subsets of S. Therefore we shall endow the set S with a metric p, i.e., let p be a mapping from S x S into R with the following properties: for all functions f, g, and h in S, p(f,f) = 0; 0 < p(f,g) < 00 iff f/ g; p(f,g) - p(g,f); p(f,h) < p(f,g) + p(g,h) (triangular inequality).
(9.1.1)
Weak convergence of random functions
181
An example of such a metric is the "sup" norm: p(f,g) = s u p o ^ i | f ( t ) - g ( t ) | .
(9.1.2)
The pair (S,p) is called a metric space. The metric p allows us to define the following attributes of subsets of S. A set A is open if for every f e A there exists an e > 0 such that the open e-sphere {geS: p(f,g) < e} about f is contained in A. An element f of S is called a point of closure of a set A if for every e > 0 there is an element g e A such that p(f,g) < e. The set of points of closure of A is called the closure of A and is denoted by A " . A set A is closed if A~ = A. An element f of A is called an interior point if there exists an e > 0 such that {gGS: p(f,g) < e} c A. The set of interior points of A is called the interior of A, and is denoted by A 0. Clearly, A0 is an open set. The boundary of a set A, denoted by <)A, is defined by dA = A~\A°. A set A is dense in B if A C B C A". Finally, a set A is compact if every open cover of A (that is a union of open sets in S containing A) contains a finite subcover. We now define the Borel sets of S as the members of the minimal Borel field 93 containing the collection of open sets in S. It is not too hard to prove that open sets, closed sets, boundaries and closures of Borel sets, and compact sets are all Borel sets. Note that this notion of a Borel set of functions is relative to the metric p: if we choose another metric p then 53 may change. In chapter 1 we defined a random variable X as a real function x(co) on Q such that for each b e R, {coeQ: x(<x>) and vice versa. Thus an equivalent condition for X being a random variable is that for each Borel set B in R, {coeQ: x(co)eB} e 5- This suggests to define a random element X of S as a mapping x: Q —> S such that for each Borel set B in S, {coeQ: x(co,.) G B) e J . Similarly to random variables, a random element X of S induces a probability measure \i on {S,53} by the correspondence /i(B) = ?[{coeQ: x(o),.) e B}], B e S.
(9.1.3)
This probability measure \i describes the distribution of X. Let (f be a real function on S. In order that
182
Unit roots and cointegration
of the metric space under review. Consequently, the expectation E[(/?(X)], with X a random element of a metric space and (p a Borel measurable function on S, can also be defined along the lines of section 1.4. Next, let (Si,pi) and (S2,p2) be two metric spaces, and let 331 and 332 be the corresponding collections of Borel sets. A mapping
(9.1.4)
Analogously to proper convergence of distribution functions we can now define the weak convergence of fin to a probability measure \i on {S,B} as follows: Definition 9.1.1 A sequence (/in) of probability measures on {S,93} converges weakly to /i, denoted by fin => \i, if \i is a probability measure on {S,33} such that for all Borel sets B e 33 with /i(dB) = 0, lim n _ oo/in(B) = MB). The condition /i(()B) = 0 corresponds to the exclusion of discontinuity points of the limiting distribution F in the case of proper convergence of F n to F. Cf. definition 2.3.1. Similarly to convergence in distribution of random variables we can now define: Definition 9.1.2 Let X n , n = 1,2,..., and X be random elements of a metric space {S,p} defined on a common probability space. Then X n converges weakly to X, denoted by X n => X, if fin => /z, where fin and \i are the induced probability measures of X n and X, respectively. We recall that for random variables X n and X, X n —> X in distribution if and only if for all bounded continuous real functions
Weak convergence of random functions
183
Proof. Billingsley (1968, theorem 2.1, pp. 11-12). From theorems 2.3.1 and 9.1.1 it now follows easily that in the case S = R and of 33 the Euclidean Borel field, definition 9.1.1 is equivalent to the definition of proper convergence of distribution functions. We have seen in theorem 2.3.4 that convergence in distribution of random variables is invariant under continuous transformations. Also this property carries over to weak convergence: Theorem 9.1.2 (Continuous mapping theorem) Let X n => X, where X n and X are random elements of a metric space (S,p), and let \i be the probability measure induced by X. Let $ be a Borel measurable mapping from (S,p) into a metric space (S*,p*) such that 0 is continuous on a Borel set So C S with /i(So) = 1. Then
9.1.2 Weak convergence in C; Wiener process We now consider a special but important case of a metric space of real functions, namely the metric space C of continuous real functions on [0,1], endowed with the "sup" norm p defined in (9.1.2). The concept of tightness, together with the concept of finite dimensional distribution below, plays a crucial role in determining weak convergence in C: Definition 9.1.4 Let X be a random element of C. A finite distribution of X is the distribution of a finite dimensional random vector of the form (X(ti),X(t 2 ),.-,X(t m ))', where ix e [0,l],i=l,..,m < oo.
184
Unit roots and cointegration
We now have: Theorem 9.1.3. Let X n and X be random elements of C. Then X n => X if and only if (i) all finite distributions of X n converge properly pointwise to the corresponding finite distributions of X, and (ii) X n is tight. Proof: Billingsley (1968, p.35) Next, we consider the concept of Wiener process (also called Brownian motion). Let (U n ) be a sequence of independent standard normal random variables, and define the random function Y n (t) on [0,1] by
Yn(t) = (l/Vn)ESUj if t e [n-1,!]; = Oifte^n" 1 ),
1 i J ' '
where [x] stands for the largest integer < x. This is not a continuous random function and hence not a random element in C, but we can smooth out the jumps in Y n (t) by adding the term Z n (t) = (nt-[nt])(l/Vn)U [ n t ] + 1 .
(9.1.6)
Thus, let X n (t) = Y n (t) + Z n (t).
(9.1.7)
Then X n (t) is a.s. continuous and therefore a random element of C. Note that for fixed t, Y n (t) and Z n (t) are independent normal random variables with zero means and variances: var(Yn(t)) = [nt]/n, var(Zn(t)) = (nt-[nt]) 2 /n, hence var(Xn(t)) - [nt]/n + (nt-[nt]) 2 /n -> t as n -* oc.
(9.1.8)
Moreover for 0 < t < s < 1 we have
?)) as is not hard to verify. Thus Xn(s) —Xn(t) and X n (t) are asymptotically independent. This result suggests the existence of a random element W in C with the following properties: For 0 < t < s < 1,
hMUM
W(s)-W(t)\
XT
//0\
or equivalently, for s, t e [0,1],
/s-t
))
0\\
/nii^ (9U0a)
Weak convergence of random functions
fw(s)\
XT ff°\
N2
fs
185
min(s,t)
U(t)J^ VVoJ'Uin(s,t)
t
Such a random element is called a Wiener process or Brownian motion. However, does a Wiener process exist in C ? According to theorem 9.1.3, we have to verify that all finite distributions of X n converge properly pointwise to the corresponding finite distributions of W, and that X n is tight. The proof of the latter is rather complicated; we refer for it to Billingsley (1968, pp. 61-64). On the other hand, it is easy to verify that for arbitrary unequal ti,...,t m in [0,1],
(xn(u),...,xn(tm)y -> (W(to,...,w(tm))\ where the random variables W(tj) are such that for s = ti , t = tj, (9.1.10b) holds. Thus a Wiener process exists. Summarizing, we have shown X n => W, where X n is defined by (9.1.5)(9.1.7) with the Uj's independent standard normally distributed, and W a Wiener process. More generally, we have the following functional central limit theorem: Theorem 9.1.4 Let X n be defined by (9.1.5)-(9.1.7), where the Uj's are independent random variables with zero mean and finite variance a2 > 0. Then XJa => W, where W is a Wiener process. Proof: Billingsley (1968, theorem 10.1, p. 68) Note that this theorem generalizes the central limit theorem 2.4.1, i.e., theorem 2.4.1 follows from theorem 9.1.4 by confining attention to t = 1. A Wiener process is an (important) example of a Gaussian process. A Gaussian process Z in C is a random element of C such that all finite distributions of Z are multivariate normal. They are uniquely determined by their expectation function E[Z(t)] and covariance function cov[Z(s),Z(t)], s,t e [0,1]. Consequently, a Wiener process is uniquely determined by (9.1.10b). Another example of a Gaussian process in C is the so-called Brownian bridge: W°(t) = W(t) - tW(l), where W is a Wiener process. Note that W°(0) = W°(l) = 0 a.s. Exercises 1. Let W be a Wiener process. (a) What is the distribution of JjJ tW(t)dt? (b) Calculate EfWXOjJwOOdt]. (c) Calculate E[W(1/2)J()1W(t)dt] 2. Calculate E[W°(t)2], where W° is a Brownian bridge.
186
Unit roots and cointegration
9.1.3 Weak convergence in D
The random function Yn(t) defined by (9.1.5) is a right-continuous real random function: limS|tYn(s) = Yn(t) a.s, has finite left-hand limits: limsTtYn(s) = Yn(t —) exists a.s., and hasfinitelymany discontinuities. In order to consider weak convergence properties of such random functions we have to extend the set C to the set D of all right-continuous real functions on [0,1] with left-hand limits. As is shown in Billingsley (1968, p. 110), these two conditions ensure that all functions in D have countably many discontinuities. However, endowing the space D with the "sup" norm (9.1.2) would lead to too conservative a distance measure. For example, let f(t) = I(t > Vi), fn(t) = f(t- '/in" 1 ),
(9.1.11)
where I(.) is the indicator function. Then for all n, supo^t^i|fn(t) — f(t)| = 1, although fn —> f for all continuity points of f. Therefore we endow the space D with the so-called Skorohod metric: Definition 9.1.5 Let A be the space of strictly increasing continuous real functions on [0,1] satisfying 2(0) = 0, A(l)=l. The Skorohod metric p°(f,g) is the infimum of those positive e for which A contains some X with sup8^t,o<.
(9.1.12a)
and supo
(9.1.12b)
It is shown in Billingsley (1968, p.l 13) that p° is indeed a metric. Note that condition (9.1.12a) says that a suitable function X should be close to unity: A(t) « t. Furthermore, taking 2(t) = t we see that p° is dominated by the "sup" norm: p°(f,g) < sup o ^i|f(t)-g(t)|.
(9.1.13)
This easy inequality is quite convenient, as it will often enable us to verify continuity properties by working with the "sup" metric rather than with the more complicated Skorohod metric. The idea behind this definition is that in comparing the distance between two functions f and g in D one should allow for a small deformation /l(t) of the time index t in order to bring the discontinuity points of f and g closer together. As an illustration of this point, consider again example (9.1.11). Let for n > 2, An(t) = (l-c* n )t + ant2, where a n = 1/tn-n"1].
Estimating an AR(1) model without intercept
187
Then An is a strictly increasing continuous function on [0,1] such that An(0) = 0, A n ('/2+ Vin1) = V2, An(l) = 1, f(An(t)) = fn(t), hence |
= 0,and
sups ^ t,o ^ s ^ i ,0 < t < 11 ln[(An(t) — An(s))/(t — s)] | = Thus, for the case (9.1.11) we have p°(fn,f) = OCn"1), so that fn -+ f in the Skorohod topology. Having endowed D with the Skorohod metric p°, we can now state the following version of theorem 9.1.4: Theorem 9.1.5 Let Y n be defined by (9.1.5), where the Uj's are independent random variables with zero mean and finite variance a1 > 0. Then Yn/cr => W, where W is a Wiener process. Proof: Billingsley (1968, theorem 16.1, p. 137) 9.2 Estimating an AR(1) model without intercept when the true datagenerating process is a white-noise random walk As a first application of the material in section 9.1 we consider the limiting distribution of the OLS estimator of the parameter a in the AR(1) model Yt = a Y t _ , + Ut,
(9.2.1)
where Assumption 9.2.1
The U t 's are i.i.d. and satisfy
E(U t ) = 0, E(U 2 ) = a2, E(|U t | 2 + <5) < oo for some 8 > 0. If |a| < 1 it is well known that under assumption 9.2.1 the OLS estimator
"n = E?= 2 Y,Y t _ 1 /Er= 2 Y?_,
(9.2.2)
of a is asymptotically normally distributed: 2
.
(9.2.3)
If the true value of a equals one, then (9.2.3) is still valid, in the sense that then indeed y/n(an—l) —> 0 in probability, hence y / n(a n —1) —> N(0,0) in distr., but this result is of no help in making inference about a. As is shown by Fuller (1976) and Dickey and Fuller (1979), in the case a = 1 we have that n(a n -l) converges weakly (and in distribution) to a function of a Wiener process, namely:
188
Unit roots and cointegration
Theorem 9.2.1 Then
Let AYt = U t and let assumption 9.2.1 hold.
/o'W(r)2dr '
(9
,4)
The limiting distribution Z\ is tabulated in Fuller (1976, table 8.5.1, p.371). In particular, P(Z! < - 8 ) « 0 . 0 5 , P ( Z , < - 5 . 7 ) « 0.1. Fuller (1976) and Dickey and Fuller (1979) derive the result (9.2.4) in a rather complicated way. Here we shall follow the more transparant approach of Phillips (1987), using the results in section 9.1. Proof of Theorem 9.2.1: First, observe from (9.2.1) and (9.2.2) that <*n - 1 = E l U U t Y t - , / E?=2Y?_i-
(9-2.5)
Now consider the denominator in (9.2.5). For the case a = 1 we have
= n
(9.2.6)
where W n (r) = [l/(av/n)E[L n !u t if r e [ n - ! , l ] , W n (r) = 0 i f r € [ 0 , n - ! ) .
1 J
Note that by theorem 9.1.5, Wn^W,
(9.2.8)
with W a Wiener process. Moreover, for any Borel measurable real function f on R we have J0'f(Wn(r))dr = Er="o1Jt/n+1)/nf(Wn('-) = (l/n)E"=|fTWn(t/n)] + f(0)/n -f(W n (l))/n. Since for continuous real functions f on R and functions g in D = f(g(l))and
'
(9.2.9)
Estimating an AR(1) model without intercept
189
are continuous real functions on D, it follows now from (9.2.8), (9.2.9), and the continuous mapping theorem 9.1.2 that Lemma 9.2.1
Under assumption 9.2.1,
(l/n)E?=i«Wn(t/n)] = jJffWnO-fldr + O^n" 1 ) =» £ f(W(r))dr for any continuous real function f on R. Application of this lemma (for f(x) = x2 and f(x) = x, respectively) to the right-hand side of (9.2.6) yields: Lemma 9.2.2 Let AYt = U t and let assumption 9.2.1 hold. Then
Note that the results in lemmas 9.2.1 and 9.2.2 involve random variables rather than random functions, so that in this case weak convergence is equivalent to convergence in distribution. Next we consider the limiting distribution of the numerator on the right-hand side of (9.2.5). If a = 1 we can write ( l / n J E ^ U t Y t - ! = (l/n)Xt 2 U t (Y 0
+
Ej=}Uj)
^ ^ i U 2 + o p (l) = l/2a2\Wn(l)2 -(l/n)E^!(U 2 /a 2 )] + o p (l) = l/2(T2\Wn(l)2 - 1 ] + o p (l),
(9.2.10)
where the third and last equalities follow from the fact that by the weak law of large numbers,
Combining (9.2.8) and (9.2.10) now yields: Lemma 9.2.3
Let A Yt = U t and let assumption 9.2.1 hold. Then
Finally, denoting Xn = [(l/n)E" = 2UtYt-i,(l/n 2 )E"=2 X* = <j2[V2(Wn(\)2-l\tiwn(r)2dr]\ X - (T2[1/2(W(l)2-l),J01W(r)2dr]',
190
Unit roots and cointegration
lemmas 9.2.2 and 9.2.3 imply that X n = X* + o p ( l ) ^ X i n d i s t r . In other words, the results of lemmas 9.2.2 and 9.2.3 hold jointly. Referring to one of the continuous mapping theorems 2.3.4 or 9.1.2, this result now proves theorem 9.2.1. Q.E.D. Exercises: 1. Prove (9.2.3). 2. Calculate ¥(ZX < 0), where Zx is defined in (9.2.4). 3. Suppose U t = Vt + !/2Vt_i, where the V t 's are i.i.d. N(0,(72) random variables. How will (9.2.4) change? 9.3 Estimating an AR(1) model with intercept when the true datagenerating process is a white noise random walk The test suggested by theorem 9.2.1 tests the null hypothesis H o : Y t — Y t _i = U t , with (U t ) a white-noise process, against the alternative that Y t is a zero mean AR(1) process. However, if the true datagenerating process is AR(1) with nonzero mean , i.e., Y t = c + a Y t _ i + U t , c / 0, \a\ < 1, then the probability limit of the OLS estimate an defined by (9.2.2) is plim^ooan = E(Y, Y 0 )/E(Y 2 ) = a + — (9.3.1) It is easy to see that the right-hand side of (9.3.1) approaches 1 if cr2/c2 approaches zero. Thus if c 2 is large, relative to a 2, then an will be close to 1, even if a itself is not. Take for example the case a = 0.5, a2 = 1, c = 10. Then plim n ^ ^ n « 0.998, hence for large n, n(a n— 1) « -0.002n. In order that this value is less than the 5 percent critical value — 8, n should be at least 4000! This example therefore suggests including an intercept in the AR(1) model to be estimated, so that the OLS estimator of a now becomes: with Y o = [ l / ( n - l ) E ^ 2 Y t - i ; Y, = [ l / ( n - l ) ] E " = 2 Y f
(9.3.2)
Now let us assume again that the data-generating process is the random walk AYi = U t , where U t satisfies assumption 9.2.1. Then similarly to (9.2.5) we can write "n - 1 = E " = 2 U t ( Y . - . - Y o ) / E " = 2 ( Y t - i - Y o ) 2
YD-Cn-l)^].
(9.3.3)
Estimating an AR(1) model with intercept
191
Now observe that (n-l)Yo = £!UYt-i = E t ^ Y t - Y n - E^i(Ej=iUj + Yo) -(EjliUj + Yo) = ( m > ( l / n ) ^ 2 W n ( t / n ) -
n
v
n(an — 1) => Z2 =
^(W(l)2-l)-W(l)J0'w(r)dr 1
z
-,——
£ W(r)2dr - (£ W(r)dr)2
z—.
(9.3.4)
Also the limiting distribution Z 2 is tabulated in Fuller (1976, table 8.5.1, p. 371). In particular, P(Z2 < -14)«0.05,P(Z 2 < - 11.3) « 0.1. Exercises 1. Prove (9.3.1). 2. Suppose U t = Vt — ViWt-u where the V t 's are i.i.d. N(0,cr2) random variables. How will (9.3.4) change? 9.4 Relaxing the white-noise assumption 9.4.1 The augmented Dickey-Fuller test
The condition that the errors of the AR(1) model (9.2.1) are white noise (assumption 9.2.1) is, of course, too restrictive for econometric applica-
192
Unit roots and cointegration
tions. In practice we much more often encounter time series that are better approximated by an ARIMA(p,l,q) model than by the ARIMA(0,l,0) model considered in the previous two sections. Since under fairly general conditions an ARIMA (p,l,q) model can be written as an ARIMA(oo,l,0) model, and the latter can be approximated by an ARIMA(k,l,0) model, with k possibly increasing with the sample size, Said and Dickey (1984) proposed basing a test on the OLS estimate an of the parameter a in the model: Y t = c + a Y t _ ! 4- / M Y t _ ! + ...+ / M Y t _ k + U t , with (1 — /3jL — ... — /3kLk) a lag polynomial with roots all outside the unit circle. They showed that if k -» oo at rate o(n1/3) then the Mest statistic of the hypothesis a - 1 has the same limiting distribution as in the case p{ = 0, Ut is white noise. The latter distribution is tabulated in Fuller (1976). However, we shall not discuss this result further, but focus on the Phillips (1987) and Phillips-Perron (1988) tests, because the latter tests require less distributional assumptions. 9.4.2 The Phillips and Phillips-Perron tests Phillips (1987) and Phillips and Perron (1988) propose to test the null hypothesis H o : AYt = U t , where (U t ) is a zero mean mixing process rather than white noise (cf. assumption 9.2.1): Assumption 9.4.1. (a) E(U t ) = 0 for all t; (b) suptEflU/) < oo for some (5 > 2; (c)<72 = limn-.ooEiKl/v/nJ^iUJ 2 } exists and a1 > 0; (d)a2u = U m ^ ^ l / n ^ ^ i E K U t ) 2 ] exists and a\ > 0; (e) {Ut} is a-mixing with mixing coefficients a(s) satisfying
XZM*?-2"3
< <*>•
Note that in general a1 ^ dj, i.e., G2 = a\ + 2.1im n _ o o (l/n)E^ 2 E J t :i 1 E(U t U j ). Moreover, note that assumption 9.4.1 allows for a fair amount of heterogeneity.
Relaxing the white-noise assumption
193
The main difference between the Phillips and the Phillips-Perron test is that Phillips (1987) considers the alternative Hi:Yt =
aiYt_,
+ Ut, M
< 1,
(9.4.1)
whereas Phillips and Perron (1988) consider the alternative Hn Y t = c + a 2 Y t _ ! + U t , \a2\ < 1.
(9.4.2)
where in both cases the U t satisfy assumption 9.4.1. Phillips (1987) and Phillips-Perron (1988) now use Herrndorf s (1984) functional central limit theorem for mixing processes: Lemma 9.4.1 Let W n be defined by (9.2.7), with U t satisfying assumption 9.4.1. Then W n => W, with W a Wiener process. Moreover, we have Lemma 9.4.2
Under Assumption 9.4.1, plim n ^ ooOMXCtUU?
Proof: Exercise for the reader. Now replacing in section 9.2 the references to assumption 9.2.1 by references to assumption 9.4.1, the only change we have to make is in (9.2.10) and consequently in lemma 9.2.3. Due to lemma 9.4.1 the last equality in (9.2.10) now becomes: •/2( r 2 [W n (l) 2 - ( l / n J l t i f l J ? / * 2 ) ] + o p (l) = '/2<x2[Wn(l)2 -alia2] + o p (l), hence lemma 9.2.3 becomes: Lemma 9.4.3 Then
Let zlYt = U t and let assumption 9.4.1 hold.
(l/n)E"= 2 UtY t _, = ' / 2 a 2 [ W n ( l ) 2 - a ; > 2 ] + o p (l) Also the lemmas in section 9.3 carry over. In view of the result in lemma 9.4.3 we can now restate theorems 9.2.1 and 9.3.1 as follows: Theorem 9.4.1 Let aXn and a2n be the OLS estimators of the parameters ax and a2 in the auxiliary regressions (9.4.1) and (9.4.2), respectively. Moreover, let AYt = U t with U t obeying assumption 9.4.1. Then
194
Unit roots and cointegration and
The problem now arises that the distributions of Z* and Z2 depend on the unknown variance ratio o^Jo1. In order to solve this problem, let us assume for the moment that we have a consistent estimator 62 of er2, and choose
a2u = (l/(n-l))E? =2 e?
(9.4.5)
as a consistent estimator of 0^, where the et's are the OLS residuals of the auxiliary regression (9.4.1) in the case of the Phillips test and auxiliary regression (9.4.2) in the case of the Phillips-Perron test. The consistency of G\ under H o follows straightforwardly from lemma 9.4.2 and theorem 9.4.1. It now follows easily from lemmas 9.2.2 and 9.3.1 (with assumption 9.2.1 replaced by assumption 9.4.1) that under H o and assumption 9.4.1,
(WE^Yf
(9A6)
£W(r)'dr
and
(1/n2) E[Li( y t - Y)2
Jj W(r)2dr - (£ W(r)dr)2
(9 4 7)
' *
where Y = ( l / n ) ^ = i Yt. Hence, denoting *1-
Z2n = n(a2n - 1) - _ J ^
we then have:
n
^ '
(9.4.8)
_
(9.4.9)
The Newey-West estimator of (T2
195
Theorem 9.4.2 Let a2 be a consistent estimator of a2. Under the null hypothesis AYt = U t with U t obeying assumption 9.4.1, we have Z i n =>> Zi5 i = 1,2, where Z l n and Z 2 n are defined in (9.4.8) and (9.4.9), and Zi and Z 2 are defined in theorems 9.2.1 and 9.3.1. The statistics Zin and Z 2 n are now the test statistics of the Phillips and Phillips-Perron unit root tests, respectively. Next, consider the case that Y t is stationary. Since plim n
^n
^
< 1 in the case (9.4.1),
— < 1 in the case (9.4.2),
(9.4.10)
(9.4.11)
it follows easily Theorem 9.4.3 Let a2 be a consistent estimator of a 2 . Under the alternative hypotheses (9.4.1) and (9.4.2) respectively, with U t obeying assumption 9.4.1, we have: plim n ^ ooZin/n < 0, hence plimn^ooZin = - o o ( i = l , 2 ) . In other words, the tests involved are consistent. Note that the inequalities (9.4.10) and (9.4.11) and theorem 9.4.3 do not necessarily hold if
(9.5.1)
where Wj(m) = 1 - j / ( m + l )
(9.5.2)
and mn is chosen such that mn = o(n1/4). This estimator has the
196
Unit roots and cointegration
advantage over the White-Domowitz (1984) estimator (which is equal to (9.5.1) with Wj(m) = 1) that it is a.s. non-negative. In the discussion of the Newey-West estimator it is convenient to drop the subscript of mn and to assume that Yo is observable, so that (9.5.1) then becomes: a2 = ftO) + 2£™1w1(m)y(i),
(9.5.3)
where for i = 0,1,2,..., y(i) = (l/n)E^i + 1 et-A.
(9.5.4)
Similarly, denote a2 = y(0) + 2££,Wi(m)y(i),
(9.5.5)
with 7(i) = (l/nJE^i+iUt-iUt.
(9.5.6)
Theorem 9.5.1 Let AYt = Ut and let assumption 9.4.1 hold. Then 62 — d2 = (^(m^/m/n). Proof'. We prove the theorem only for the case that Ut is strictly stationary and et is the residual of the auxiliary regression (9.4.1). Substituting et = Ut — (ai n —l)Y t _i yields
- ( l / n ^ - i + A - i Y t . i . , -(l/n)Y:?. I + i (Y t _ i _ 1 -Y t _,)U t - d / n ) E ? - i + iY,_,Ut + (a l n -l)d/n)y:r_ 1 + i Y t _ i _ I Y t _ I . It follows easily from (9.2.10) that for i < m and m = o(n), m U t - i Y . - i - , = ( l / n ^ r l u . Y , . , = Op(l)
(9.5.7) (9.5.8)
and similarly, (l/n)E?=i +i U t Y t _, = Op(l).
(9.5.9)
Moreover, it is easy to verify that for some constant M, not depending on t and i, E[|Ut_i(Yt_, — Y t _j_ ,)[]< Mi'72, hence for i < m, (l/n)i:r_ l + i U t _ i (Y t _ l -Y t _ i _,) = Op(Vm),
(9.5.10)
and similarly (9.5.11)
The Newey-West estimator of a1
197
Furthermore, using the Cauchy-Schwarz inequality and lemma 9.2.2 it follows easily that
(l/nJEf-i+iYt-i-iYt-! < (l/n)£[LiY? = Op(n).
(9.5.12)
Finally, under Ho we have: a l n - l - Op(l/n).
(9.5.13)
Combining (9.5.7)-(9.5.13) now yields: maxo
(9.5.14)
from which the theorem easily follows.
Q.E.D. 2
2
Theorem 9.5.2 Under assumption 9.4.1, o = 5 + Op(m2/n), where a 2 = (l/n)Er=T +1 [(l/v / m)E J =o 1 U t+j ] 2 . Proof. Exercise 2. Theorem 9.5.3 Under assumption 9.4.1, strict stationarity and the condition m —• oo at rate o ^ n ) , plimn^ ^a2 = a2. Proof: We can write a2 = (l/n)^" = 1 Vt(m), where Vt(m) = [ ( l / V ^ E ' o ' U t - j ] 2 if t > m; Vt(m) = 0 if t < m.
(9.5.15)
Now for fixed m, V t (m) is an a-mixing process with mixing coefficient a m(j) = »G~" m ) ^ j > m> a m0) = 1 ^ j < m ? a n ( i s o is Vt( m )I[V t (m)
(9.5.16)
hence var{(l/n)Et=iV t (m)I[V t (m)
(9.5.17)
Thus if K = o(^/n) and m = o(i/n), then ^ K ] -E{V t (m)I[V t (m)
198
Unit roots and cointegration
mapping theorem (theorem 9.1.2) that fK(Vi(m)) => fK(
(9.5.20)
uniformly in t. Moreover, by assumption we have E[Yt(m)] -> ex2 if m -+ oo,
(9.5.21)
uniformly in t. Combining (9.5.19)-(9.5.21) yields E{Vt(m)I[Vt(m) > K]} -> E{<72W(l)2I[
(9.5.23)
Combining (9.5.18) and (9.5.23), the theorem follows.
Q.E.D.
Combining theorems 9.5.1-9.5.3 now yields: Theorem 9.5.4 Under assumption 9.4.1, strict stationarity and the condition m —• oo at rate o ^ n ) , the Newey-West estimator (9.5.3) is consistent: plimn^oocx2 = a1. Note that this result (and its proof) differs from the original result of Newey and West in that here m = o(^/n) rather than m = o(n 1/4 ). However, Andrews (1991) following different lines arrives at the same conclusion as theorem 9.5.4. Finally, one may wonder why we have not based the Newey-West estimator involved on AYt = U t directly rather than on the OLS residuals e t. The reason is that if we replace e t by AYt then under the stationarity hypothesis, plim n ^ ^a2 = 0. Consequently, plim n ^ ooZin/n would then be less negative than in the case of a residual-based NeweyWest estimator, and therefore the tests would have less power. Exercises 1. Prove theorem 9.5.1 for the case that the e t's are the residuals of the auxiliary regression (9.4.2). 2. Prove theorem 9.5.2. 3. Suppose that the Y t 's are i.i.d. N(0,<x2). Let Zfn be the test statistic Z l n with
Unit root versus deterministic trend
199
Phillips-Perron test will suffer from a similar problem, namely in the case that the true data-generating process is of the type Hx: Y t = a Y t _ i + /Jt + y + U t
(9.6.1)
with |a| < 1, (3 7^ 0 and U t satisfying assumption 9.4.1. Take for example the case Y t = /?t + y + et; et~ NID(O,1), with (3 ^ 0. Then it is easy to verify that the OLS estimator a2n of the parameter a2 in model (9.4.2) satisfies n(a2n — 1) = o p (l), hence the Dickey-Fuller test in section 9.3 and the Phillips-Perron test in section 9.4 will have no power against this alternative. In order to cover this case as well, Dickey and Fuller (1979, 1981) also propose a test of the null hypothesis a = 1, /? = y = 0 in model (9.6.1) with (U t ) white noise. Phillips and Perron (1988) modified this test for the case of mixing (U t ). In deriving this test it is convenient to rewrite model (9.6.1) as Y t = 0'X t + U t , where X t = (Y t _i,t,l)'; 9 = (a,/3,y)\
(9.6.2)
Again the null hypothesis to be tested is that AYt = U t , where (U t ) satisfies assumption 9.4.1. Clearly, under H o we have 9 = (1,0,0)'. The OLS estimator 9n = (an,/3n,yny of 9 satisfies
9n -9 = (a n -iA, 7n y = ElUXtXtT'E^XtUt
Lt=2 Y t-l Most of the partial sums in (9.6.3) involving Y t and U t have been encountered before, except X^"=2tYt-i and X)" =2 tU t . Their limiting behavior under H o is derived in the following two lemmas. Lemma 9.6.1 Then
Let AYX = U t and let assumption 9.4.1 hold.
J =>
Proof: The lemma follows from the following equalities:
[l/(nVn)]E?=i'Y, + Ej=iUj) + [l/(nVn)]E?=i'(Yo + E|=iUj) = [l/(nVn)]E"i't[Y 0 + crv/nWnCt/n)] V i ' [ Y o +
+
^E^'
200
Unit roots and cointegration
+ [l/(nVn)]E?="i'Y0 + <x(l/n2)££i'Wn(t/n) and J0'rWn(r)dr rWn(r)dr = (l/n)Er=o'Wn(t/n)J (r)dr = E tt o! !^^ rW(r)dr (l/n)Er'W(t/n)Jtt(;n+1)/n rdr = (l/n)ESjW n (t/n)[(t/n) + 'An'2] = (l/n2)Et=o'tWn(t/n))+ ' / ^ W ^ d r .
Q.E.D.
Lemma 9.6.2 Let assumption 9.4.1 hold. Then 2 tU t
=
ffwn(l)
-^Wn(
Proof: This lemma is a straightforward corollary of the following more general lemma: Lemma 9.6.3 Let F be a differentiable real function on [0,1] with derivative f, let (xt) be an arbitrary sequence in R and let Sn(r) = £[=!*, if r G [n-'.ll; Sn(r) = 0 if r e [O.n"1). Then E F ( V n ) x = F(l)Sn(l) -Jorf(r)Sn(r)dr. Proof: Without loss of generality we may assume that F is constant on (l,oo), so that F(x) = F(l) for x > 1. Now observe that l)/n]-F[t/n]}Sn(t/n) = E[^{F[(t+ l)/n]-F[t/n]}Sn(t/n) Moreover, by rearranging terms it easily follows that: E"=.{F[(t+l)/n]-F[t/n]}Sn(t/n) = Et=i(F[(t+ l)/n]-F[t/n]}EJ=,Xj = Er=iEj=t{F[(J+l)/n]-F[j/n]}xt = E?=i[F(l +n" 1 )-F(t/n)]x t = FO+n-^EjL.Xj -E"=.F(t/n)x t = F(l)Sn(l) -E?=,F(t/n)x t . Combining these results, the lemma follows. Denoting rn =
/n 0 0 \ 0 n^n 0 \0 0 y/nj
it is now easy to verify that
Q.E.D.
Unit root versus deterministic trend
'o
0 0 \ /Jo'Wn(r)2dr ,(r)dr J 0 '\V n (r)dr
201
Jo'rWn(r)dr
Jo' W n (r)dr\ /a
1/3
1/2
0
0N
| [ 0 1 0 |+op(l).
1/2
'O/n)E!UYt-iUt
0 0
° V wn(i) Thus we have: Theorem 9.6.1 Let J Y t = U t and let assumption 9.4.1 hold. Then the OLS estimators an, /3n, and yn of the parameters a, (3, and y of model (9.6.1) satisfy [n(an— l),n^/n/3 n/oi,^n')'n/(T]' => M - 1 V , where
/rJ10'W(r)2dr J0'rW(r)2dr Jj W(r)dr J0'rW(r)dr
M =
Vjo'WMdr
1/3
1/2
1/2
1
W(1) Next we derive a more explicit expression for the limiting distribution of n(an— 1). Denote X, = J0'W(r)2dr -(J 0 'W(r)dr) 2 , X 2 = Jo(r-'/ 2 )W(r)dr, X 3 = J0'W(r)dr, X 4 = X, -\2X22(=
12det(M)).
202
Unit roots and cointegration
Then
1
M" = X7' V4 'I -12X - 2 v 6X2 - X 3 1
1 W
-12X2 6X2 - X 3 12X, -6X, + I2X2X3 -6X1 - 12X2X3 X 4 + 3X, - I2X2X3
and A(W(1) 2 -1)\ ( V= W(l)-X 3 + 0 \W(1) / \0 Hence we have: n{an
i
^(W(l)l)W(l)(6X
— 1) =>
—
2
+ X 3 ) + 12X2X3 ,
A4
1
A4
This result can be restated as: Theorem 9.6.2 Let AYt = U t and let assumption 9.4.1 hold. Then the OLS estimator an of the parameter a in model (9.6.1) satisfies
"v"n
*'^
J
' 24
where 3
_1(W(1) 2 - l)-2W(l)J 0 '(3r- l)W(r)dr + 6j o '(2r- l)W(r)drJ0'w(r)dr 12det(M)
Note that the distribution of Z 3 is tabulated in Fuller (1976, table 8.5.1). In particular, P(Z 3 < -21.5) « 0.05, P(Z 3 < -18.1) « 0.1. Finally, observe that under H o , det[rn-l[J2«=2XtXt']rn-1]
=* a 2 det(M),
hence: Theorem 9.6.3 Let AYt = U t with U t obeying assumption 9.4.1. Denote
Unit root versus deterministic trend ~2
203 "2
Z 3n = n(a n - 1) where an is the OLS estimator of the parameter a in model (9.6.1), and 62 and <J2U are defined in (9.4.5) and (9.5.1), where the et's are the OLS residuals of the auxiliary regression (9.6.1). Then Z 3n => Z 3 . Along similar lines we can construct test statistics for the hypotheses (3 = 0 and y = 0, and the joint hypothesis a = 1, j3 = y = 0. Exercise 1. An alternative to the above test is to apply the Phillips test to the OLS residuals et of the auxiliary regression Yt = 0t + y + Vt. Derive the asymptotic distribution of this alternative test statistic under the unit root hypothesis, and compare it with Z 3 . (Hint: Denote W?(r) = e[rn]/(
G
[ n - \ l ] ; W?(r) = 0 if r G [O^" 1 ]
and show that W^ => W D , with W D a function of a Wiener process W [WD is called the detrended Wiener process]. Then replace W in (9.4.3) by W D .)
9.7 Introduction to cointegration Consider a vector time series process X t G Rk such that each component Xit is integrated of order one, i.e., each of the series Xi t contains a unit root, but JXi t is a zero mean stationary process. Referring to the Wold decomposition theorem (cf. theorem 7.2.1), we may therefore write: AXt = C(L)U t = (E~ O CjL j )U t = 5:~oCjU t H ,
(9.7.1)
where L is the lag operator, U t is a k-variate white-noise process: E(Ut) = 0, E(UtUt') = I, E(U t U' t _j) = O for j = 1,2,...,
(9.7.2)
and the coefficients matrices Cj are such that Ej^oCjCj' exists.
(9.7.3)
Note that there is no loss of generality in assuming E(U t U t ') = I. If E(UtUt') = Q, then one may replace U t by U* = Q~ '/2Ut and the Cj's by Cfi~ v\ Moreover, note that (9.7.3) implies that
204
Unit roots and cointegration
C(l) = Ej'oCj exists.
(9.7.4)
Denote B(L) = C(L)-C(1). Then B(l) = O, hence all the elements of B(L) have a unit root, and consequently we can write C(L) = C(l) + (1-L)C*(L).
(9.7.5)
Thus, (l-L)X t = C(l)Ut + (l-L)C*(L)U t . Now assume that the matrix C(l) is singular. Then there exists a nonzero vector a such that a'C(l) = 0'. Denoting Zt = a'X t , we then have ( l - L ) Z t = (l-L)a'X t = a'C(l)U t + (1-L)a'C*(L)U t = (l-L)a'C*(L)U t , hence, Zt = a'C*(L)Ut. Thus Zt is stationary if C*(L)Ut is. A sufficient condition for the latter is that the elements of C*(L) are rational lag polynomials, C*(L) = [c^(L)],withc^(L) = a^(L)/b^(L), where a*j(L) and b*j(L) are finite lag polynomials with roots all outside the unit circle. In its turn a sufficient condition for this is that the elements of C(L) are rational lag polynomials: C(L) = (Cij(L)), with Cy(L) = aij(L)/bij(L), where a^L) and by(L) are finite lag polynomials with roots all outside the unit circle. (9.7.6) Then say, where dy(L) is a finite lag polynomial. Denoting the order of dy by my, we can write dij(L) = n ^ ^ - L ) , where the <5yS are the roots of dy(L). However, since by construction dy(L) contains a unit root, one of the <5yS equals one, say 8^\ = 1. Then C*(L) = (ctj(L)),withctj(L) = n ^ ^ j s We may take a equal to an eigenvector of C(l)' corresponding to a
Error correction form of cointegrated systems
205
zero eigenvalue. Thus let A be the matrix of eigenvectors of C(l)' corresponding to the zero eigenvalues. Then A'X t is stationary. Summarizing, we have shown: Theorem 9.7.1 Let the k-variate time series process (Xt) be defined by (9.7.1H9.7.3) and (9.7.6). If m = rank(C(l)) < k then there exists a k x ( k - m ) matrix A with rank k —m such that Z t = A'X t is stationary. If C(l) is indeed singular then X t is called a cointegrated system, and the columns of A are called the cointegrated vectors. 9.8 Error correction form of cointegrated systems Following Engle (1987) and Engle and Yoo (1989), we show now how to write the model in autoregressive error correction form. Assume for simplicity that rank(C(l)) = k— 1, so that there is only one cointegrated vector a. Without loss of generality we may normalize a as follows: (9.8.1)
<x = (l,a 2 ,...,a k ) Now consider the matrix /I * =
Oil-
..c*k
\ a
0
|, where*, = [0,I k _i].
(9.8.2)
Ik- 1
\o Denoting a'C*(L)
(9.8.3)
,M(L) =
we can write a'AXt
U,=
(9.8.4)
206
Unit roots and cointegration
Next, denote f "l
I O'
M*(L) = '
(9.8.5) ,0
so that M*(L)M(L) = ( l - L ) I k , and assume that V l(L) is invertible, with inverse V(L). Then (1 - L)M*(L)
(9.8.6)
hence M*(L)
(9.8.7)
Denoting A(L) = V(L)M*(L)<£,
(9.8.8)
we now get the AR form of the model: A(L)X t = U t .
(9.8.9)
Next, observe that
/I 1
10'
.0
I O
I
a2-ak
a2--Qk\
0
0
I O k _i
Ik-i 0
I
/
\o
i
I (9.8.10)
where O k _i is the (k— 1) x (k— 1) zero matrix and y is the first column of V(l). Moreover, similarly to (9.7.5) we can write A(L) = A(1)L + ( l - L ) A (L).
(9.8.11)
Combining (9.8.9), (9.8.10), and (9.8.11) now yields the error correction form of the model: A*(L)JX t + ya'X t _i = U t .
(9.8.12)
Note that the lag of X t does not matter: replacing A (L) by, say,
A"(L) = A*(L) + EfJi'ya'L^
(9-8-!3)
Estimation and testing of cointegrated systems
207
model (9.8.12) becomes
A*\L)AXt + yaXt_p = Ut.
(9.8.14)
The latter form is the one used by Johansen (1988), with A**(L) a (p— 1)order lag polynomial: AXt
= BxAXt-X
+ ...+ B p _!zlX t _p + 1 - y a ' X t _ p + U t . (9.8.15)
9.9 Estimation and testing of cointegrated systems 9.9.1
The Engle-Gr anger approach
Working in the context of a bivariate system with at most one cointegrated vector, Engle and Granger (1987) propose estimating the cointegrated vector a = (1, a 2 )' by regressing the first component X l t of X t on the second component X 2t , using OLS, and then testing whether the OLS residuals Z t have a unit root, using the augmented DickeyFuller test of Said and Dickey (1984). However, since the latter test is conducted on estimated residuals, the tables of the critical values of this test in Fuller (1976) no longer apply. The correct critical values involved can be found in Engle and Yoo (1987). If the test rejects the unit root hypothesis, and thus accepts the cointegration hypothesis, one may substitute for a'X t _i in (9.8.12) the OLS residual Z t _ i and estimate the parameter matrices in A*(L) by OLS, assuming that A*(L) is a finite-order lag polynomial. Replacing a by its OLS estimate a = (1, a2) does not affect the asymptotic properties of the OLS estimators of the coefficients of A (L), because a 2 is super consistent: n(a 2 — ^2) = OpO)The above approach is only applicable if there is at most one cointegrated vector. Systems with dimension greater than two, however, may have multiple cointegrated vectors. In that case one may use the approach of Stock and Watson (1988), which is (more or less) a multivariate extension of the Engle-Granger approach and is related to Johansen's approach below. See Johansen (1991, sect.6). 9.9.2 Johansen's maximum likelihood approach In a recent series of influential papers, Johansen (1988, 1991) and Johansen and Juselius (1990) propose an ingenious and practical full maximum likelihood (ML) estimation and testing approach. The basic idea is quite simple: assume that the process X t is Gaussian, integrated of order 1, and observable for t = l,...,n. Moreover, assume that model (9.8.15), with a' and y r x k matrices of rank r < k, represents the
208
Unit roots and cointegration
conditional expectation of AXt relative to Xt_1,Xt_2,.-- Then the conditional distribution of AXt, relative to X t _ 1 ,X t _ 2 ,.-, is k-variate normal: AXt\Xt_!,Xt_2,... -N k (B!AX t _! + ... + Bp_!AXt_p+ x -ya'Xt_p,
A) (9.9.1)
Consequently, the log-likelihood function takes the form
f / t . i
+ rest,
(9.9.2)
where the rest term depends on the initial values X 0,...,X1_p only. The actual parameter matrix of interest is a, so let us concentrate out the other parameter matrices. First, if a, y, and A are assumed to be known the ML estimators of the matrices B 1,...,Bp_1 can be obtained simply by regressing AXt + yaXt_p on zlXt_1,...,zlXt_p+1, using OLS. The residuals of this regression are ROt + ya'Rpt, where ROt is the residual vector of the regression of AXt on JXt_1,...,z1Xt_p+15 and Rtp is the residual vector of the regression of X t _ p on (9.9.3) ZIX t _ lv ..,JX t _ p+1 . Thus the partially concentrated log-likelihood function, with only B lv ..,B p _i concentrated out, is £(a,y,A) = - !/2nln(det A) -V^D^p+iCRot + ya'RpO'^'VRot + ya'Rpt) + rest.
(9.9.4)
Similarly, we can concentrate y out, given a and A, by regressing ROt on a'Rpt, which yields the estimate y(a) = -Sop^a'SppC*)-1,
(9.9.5)
with (9.9.6) Sij = (l/n)E?=2RitRjt'' iJ = 0,p. Finally, we concentrate A out, given a, by substituting the well-known ML estimator of the variance matrix of a normal distribution with zero mean vector:
A(a) = (l/n)E"=1 = SQO -SopaCa'SppoT'a'Spo.
(9.9.7)
Thus, the concentrated log-likelihood function now becomes: £(a) = - Vin\n(det A(a)) + rest,
(9.9.8)
Estimation and testing of cointegrated systems
209
hence the ML estimator of a is found by solving the minimization problem min detlSoo-Sop^a'Spp^-^'Spo],
(9.9.9)
where the minimization is over all k x r matrices a. Now consider the well-known matrix relation A B\_/A B' C J ~ I B '
O\/l I MO
A- J B ( C - B ' A - ' B ; ~ VO C
where A and C are nonsingular matrices. Then det(A)det(C-B'A~ 1 B) = Substituting A = SOo, B = SOpa and C = a'S p p a, it follows that the ML estimator of a is obtained by solving the minimization problem min detCa'Sppa-a'SpoS^Sopa) / det(a'S p p a).
(9.9.10)
Note, however, that the solution of (9.9.10) is not unique, as we may freely premultiply a' by a conformable nonsingular matrix. Next, let D be the diagonal matrix with diagonal elements \\Xii»-X^ where Ai > A2 > . . . > Ak are the ordered roots of the polynomial det(AS pp — SpoS^Q SOp), with Q = (qi,.-4k) the corresponding matrix of eigenvectors. (9.9.11) Then S PP QD = SpoS^SopQ.
(9.9.12)
Moreover, since the roots (9.9.11) are the eigenvalues of the positive semi-definite matrix Sp^SpoSopSopSpp5, and Q = S P P O\ with Q* the matrix of eigenvectors of SpJSpoS^SopSpp2, we can normalize Q such that Q'S PPQ = I k .
(9.9.13)
Now choose a = Q£, where ^ is a k x r matrix.
(9.9.14)
Then it follows from (9.9.13) and (9.9.14) that the minimization problem (9.9.10) becomes min det(f'£-f'DO/det(f'f),
(9.9.15)
where without loss of generality we may normalize £ such that f'£ = I r
210
Unit roots and cointegration
(because we may replace £ by £(£'£)" /2)« It *s n o w solution of (9.9.15) is:
easv to
verify that the
(9.9.16)
hence the ML estimator of a is equal to the first r columns of Q, i.e., & = (qi,...4r),
(9.9.17)
and the corresponding maximum value of the likelihood function, given the restriction that a has rank r, is ^mlx(r) = const x n [ = 1 ( l - A i ) .
(9.9.18)
Consequently, the likelihood ratio test LR(k — r) for the hypothesis that there are at most r cointegrated vectors is:
-21n(LR(k-r)) = -n£JL r + 1 ln(l-Ai).
(9.9.19)
It is well known that under standard conditions, — 2 times the log of the LR test statistic converges in distribution to a x2 distribution with degrees of freedom equal to the number of constraints (in the present case, k —r). The present situation however, is far from standard, and therefore the usual conclusions from maximum likelihood theory (cf. section 4.5) no longer automatically apply. In particular, the limiting distribution of the test statistic (9.9.19) is now a function of a (k — r)variate standard Wiener process. Johansen (1988, table 1, p. 239) has calculated critical values on the basis of Monte Carlo simulations (Table 9.9.1). Moreover, Johansen (1988) shows that the ML estimators are consistent. Since the proofs are tedious, we shall not give them here. Since cointegrated vectors represent long-run relations between integrated variables, it is important to test whether certain components of cointegrated vectors are zero. Also, in economics, for example, one may ask the question whether the log of real GNP is stationary, while the log of nominal GNP and the log of the GNP deflator are integrated. This hypothesis corresponds to a cointegrated vector proportional to (1,— 1)'. These types of hypotheses can be formulated as linear restrictions: H!i a = Hip,
(9.9.20)
where H is a known k x s matrix of full rank and ? is an s x r matrix of parameters, with s < r. The likelihood ratio test statistic — 21n(LR*) involved, which is easy to derive along the above lines, has a limiting # 2 distribution with r(k —s) degrees of freedom. See Johansen (1988, theorem 4).
Estimation and testing of cointegrated systems Table 9.9.1
211
Asymptotic critical values of the LR test - 21n(LR(m))
m
10
1 2 3 4 5
2.9 10.3 21.2 35.6 53.6
Significance level (percent) 5 4.2 12.0 23.8 38.6 57.2
2.5 5.3 13.9 26.1 41.2 60.3
Finally, in Johansen (1991) the model (9.8.15) is augmented with a vector of intercepts fi and a vector of seasonal dummies Dt orthogonal to AXt = BiJX t _! + ...+ Bp_!ZlXt_p+1 -yaXt_p
+ <2>Dt + \i + Ut (9.9.21) The derivation of the concentrated likelihood function and the LR test for the number of cointegrated vectors is similar to above. However, the presence of an intercept has a considerable impact on the asymptotic distribution of the LR test. In particular, it matters whether the vector \x of intercepts can be written as \i = ya0, with a o a n r x l vector,
(9.9.22)
or not. If so, then ya'X t _ p + \x = ya*'X*_p, with a = (a',a 0 )' and X* = (X b l)', (9.9.23) so that then the cointegrated relations have intercepts rather than the model for AXt itself. See further, Johansen (1991).
10 The Nadaraya-Watson kernel regression function estimator This chapter reviews the asymptotic properties of the Nadaraya-Watson type kernel estimator of an unknown (multivariate) regression function. Conditions are set forth for pointwise weak and strong consistency, asymptotic normality, and uniform consistency. These conditions cover the standard i.i.d. case with continuously distributed regressors, as well as the cases where the distribution of all, or some, regressors is discrete. Moreover, attention is paid to the problem of how the kernel and the window width should be specified. This chapter is a modified and extended version of Bierens (1987b). For further reading and references, see the monographs by Eubank (1988), Hardle (1990), and Rosenblatt (1991), and for an empirical application, see Bierens and Pott-Buter (1990). 10.1 Introduction
The usual practice in constructing regression models is to specify a parametric family for the response function. Obviously the most popular parametric family is the linear model. However, one could consider this as choosing a parametric functional form from a continuum of possible functional forms, analogously to sampling from a continuous distribution, for often the set of theoretically admissible functional forms is uncountably large. Therefore the probability that we pick the true functional form in this way is zero, or at least very close to zero. The only way to avoid model misspecification is to specify no functional form at all. But then the problem arises how information about the functional form of the model can be derived from the data. A possible solution to this problem is to use so-called kernel estimators of regression functions. Since the pioneering papers of Nadaraya (1964) and Watson (1964) on kernel regression function estimators, there is now a growing extent of literature on the problem of nonparametric estimation of unknown regression functions. See Collomb (1981, 1985a) and Bierens (1987b) for a bibliography. Most of the literature on nonparametric regression function estimation deals with the kernel method and its variants. 212
Introduction
213
In this section we shall now introduce the kernel regression function estimator for the case where we have an i.i.d. sample {(Yi,Xi),..., (Yn,Xn)} from an absolutely continuous (k+ l)-variate distribution with density f(y,x), where y e R and x e R k . In this data set the Yj's are the dependent variables and the Xj's are k-component vectors of regressors. If E(|Yj|) < oo, then by definition (cf. chapter 3) the conditional expectation of Yj relative to Xj exists and takes the form j
j
(10.1.1) k
with g(.) a Borel measurable real function on R . Denoting Uj = Yj -g(Xj),
(10.1.2)
we then get the regression model (10.1.3) where by construction the error term Uj satisfies the usual condition that its conditional expectation relative to the vector of regressors equals zero with probability 1, i.e., E(Uj|Xj) = 0a.s.
(10.1.4)
The model (10.1.3) is therefore purely tautological, that is, its set up is merely a matter of definition. Now our aim is to estimate the regression function g(.) without making explicit assumptions about its functional form. For the data-generating process under review the regression function g(.) takes the well-known form g(x) = Jyf(y,x)dy/h(x), if h(x) > 0,
(10.1.5)
where h(x) is the marginal density of f(y,x) ,i.e., h(x) = Jf(y,x)dy.
(10.1.6)
This suggests estimation of the function g(x) via estimating the densities f and h. A convenient method for estimating unknown multivariate density functions is the kernel density estimation method proposed by Rosenblatt (1956b). Important contributions to the asymptotic theory of this class of estimators have been made by Parzen (1962) for the univariate case and Cacoullos (1966) for the multivariate case. See Fryer (1977) and Tapia and Thompson (1978) for reviews. A kernel estimator of the density h(x) is a random function of the form 1 K[(x-X j )/y n ]/yJ;,
(10.1.7)
214
The Nadaraya-Watson estimator
where K[.] is an a priori chosen real function on R k , called the kernel, satisfying J|K(x)|dx < oo, j K ( x ) d x = l ,
(10.1.8)
and (yn) is an a priori chosen sequence of positive numbers, called window width parameters, satisfying limn^ooyn^O, limn^oonyj^oo.
(10.1.9)
Under conditions (10.1.8) and (10.1.9) the estimator h(x) is pointwise weakly consistent in every continuity point of h(x), provided supxh(x) < oo.
(10.1.10)
The proof of this proposition is simple but instructive. First, the asymptotic unbiasedness follows from E[h(x)] = Jy- k K[(x - z)/yn]h(z)dz = Jh(x - ynz)K(z)dz -> h(x)jK(z)dz = h(x)
(10.1.11)
by bounded convergence. Second, the variance vanishes at order O[l/(ny k )], as = E{y~kK[(x -Xj)/ 7 n ] 2 - y k (E{y" k K[(x-Xj)/yJ}) 2 = Jh(x - ynz)K(z)2dz - yk[Jh(x - ynz)K(z)dz]2 -> h(x)jK(z) 2 dz
(10.1.12)
by bounded convergence. This completes the pointwise weak consistency proof. We shall now construct a kernel density estimator f(y,x) of the joint density f(y,x) such that h(x) is the marginal density of f(y,x) and the integral Jyf(y,x)dy yields an expression involving the same kernel K as in (10.1.7). This kernel estimator of f(y,x) is of the form f(y,x) = (l/n)EjLiK*[(y-Yj)/y I1 ,(x-Xj)/y n ]y- k - 1 ,
(10.1.13)
where the kernel K* satisfies JyK-(y,x)dy = 0, jK*(y,x)dy = K(x).
(10.1.14)
Then h(x) is the marginal density of f(y,x) and moreover, Jyf(y,x)dy = (l/n)E J I LiY J K[(x-X J )/y n ]y- k ,
(10.1.15)
hence the corresponding regression function estimator of (10.1.5) is = {EjLiY j K[(x-Xj)/y n ]}/{i:jLiK[(x-X j )/y n ]}.
(10.1.16)
Introduction
215
This is the so-called Nadaraya-Watson kernel regression function estimator, named after Nadaraya (1964) and Watson (1964). Note that this kernel regression function estimator is a weighted mean of the dependent variables Yj, where the weights sum up to 1. In particular, if the kernel is chosen to be a unimodal density function with zero mode, for instance let the kernel be the density of the k-variate standard normal distribution, then the closer x is to Xj, the more weight is put on Yj. Similarly to (10.1.11) and (10.1.12) it follows now that
E[g(x)h(x)] - E{Y}K[(x-X})/yn]y-k}
= Elg^KKx-XjVyJy"*}
k
= Jg(z)h(z)K[(x - z)/y n]y- dz = Jg(x - ynz)h(x - ynz)K(z)dz -> g(x)h(x)
(10.1.17)
by bounded convergence, and = E{Y2lK[(x-Xl)/yn)]2y-k}
-y^
= J^(z)h(z)K[(x-z)/ 7 n ]y- k dz -y k {E[g(x)h(x)]} 2 - J^(x-y n z)h(x-y n z)K(z) 2 dz -y k {E[g(x)h(x)]} 2 -^
(10.1.18)
by bounded convergence, where <72(x) = E(Y2|Xj = x)for h(x) > 0,
(10.1.19) 2
provided x is a continuity point of g(x), h(x), and o" (x) and supx|g(x)|h(x) < oc, supx(72(x)h(x) < oc.
(10.1.20)
Now it is easy to verify from (10.1.11), (10.1.12), (10.1.17), and (10.1.18) that = h(x), plim n^0Og(x)h(x) = g(x)h(x)
(10.1.21)
and hence plim n _ oo g(x) = g(x)
(10.1.22)
in every continuity point x of h(x) and g(x)h(x) for which h(x) > 0. The weak consistency of the kernel regression function estimator is not limited to the case that Xj is continuously distributed, as is shown by Devroye (1978) and Bierens (1983). We shall consider the (partly) discrete case later on, in section 10.4. Uniform consistency will also be considered later in this chapter. Moreover, strong consistency results have been derived by Nadaraya (1965, 1970) and Noda (1976). Here we shall give a different proof of the strong consistency of the kernel estimator.
216
The Nadaraya-Watson estimator
First, consider the following special case of Whittle's (1960) lemma. Lemma 10.1.1 Let Z lv .,Z n be independent random variables with E(Z 4 )<M < oo,j=l,...,n. Then E({EJrLi[Zj-E(Zj)]}4) < 3n2M Proof: Exercise 1. (Hint: Relate an = E({X]Jn=1[Zj-E(Zj)]}4) to an_!. From this lemma and (10.1.10) it follows that E({h(x)-E[h(X)]}4) < p n ^ n ^ ^ E f K K x - X O / y J 4 } = [3/(n2y3k)]E{K[(x-X,)/yJ V ) = [3/(n 2y^)]JK[(x-z)/7n]47-kh(z)dz = [3/(n2yf)] Jh(x - ynz)K(z)4dz = O[l /(n2y?)]
(10.1.23)
hence from Chebishev's inequality P{|h(x)-E[h(x)]| > £} = O[l/(n27^)]
(10.1.24)
for every e > 0. Thus if yn is chosen such that E ^ i C n 2 ^ ) " 1 < oo
(10.1.25)
then it follows from theorem 2.1.2 that h(x) -E[h(x)] -> 0 a.s.
(10.1.26)
Now assume E(Y4) < oo
(10.1.27)
and let (74(x) = E(Y4|Xj = x), h(x) > 0,
(10.1.28)
be such that supx(j4(x)h(x) < oo.
(10.1.29)
Then, similarly to (10.1.23), E({g(x)h(x)-E[g(x)h(x)]}4) < [3/(n2y^)]J^(x_^
(10.1.30)
hence condition (10.1.25) implies g(x)h(x) -E[g(x)h(x)] -> 0 a.s.
(10.1.31)
Together with (10.1.11) and (10.1.17), the results (10.1.26) and (10.1.31) imply g(x) —• g(x) a.s., provided h(x) > 0.
Asymptotic normality in the continuous case
217
Summarizing, we have shown: Theorem 10.1 A Let x e R k be such that h(x) > 0.
y n ^ 0 , ny k -+oc, supxh(x) < oo, supx|g(x)|h(x) < oo and supcTy(x)h(x) < oo, then g(x) -> g(x) in pr. (ii) If in addition E(Yf) < oo, sup x ^(x)h(x) < o o a n d £ ^ i i i - 2 y - 3 < oo, then g(x) -* g(x) a.s. Exercises 1. Prove lemma 10.1.1. 2. Are the conditions on the window width for strong consistency stronger than those for weak consistency? 3. Do we need the condition that Yj is continuously distributed? 10.2 Asymptotic normality in the continuous case The kernel regression estimation approach distinguishes itself from other nonparametric regression methods in that asymptotic distribution theory is fairly well established. In particular, the asymptotic normality of the kernel regression function estimator under the conditions under review has been proved by Schuster (1972) for the univariate case (k= 1). Here we shall derive asymptotic normality, in a somewhat different but much easier way, for the general case k > 1. Observe from (10.1.3), (10.1.7), and (10.1.16) that [g(x)-g(x)]h(x) =
]
+ (1 AOEjL! ([g(Xj) - g(x)]K[(x - Xj)/yn)]y-k - E{[g(Xj) - g(x)]K[(x -
X})/yn]y-k})
+ (l/n)E J r LiE{[g(X J )-g(x)]K[(x-X J )/ 7n ]y- k } = q1(x) + q2(x) + q3(x),
(10.2.1)
say. We shall now set forth conditions such that firstly y(n 7 k )q 1 (x)-N(0,^(x)h(x)jK(z) 2 dz) in distr.,
(10.2.2)
where, for h(x) > 0, X)
(10.2.3)
218
The Nadaraya-Watson estimator
is the conditional variance of Uj. Secondly, we show that
lim^^EltVCny^^Cx)]2} = 0
(10.2.4)
and finally we set forth condition such that limn^y^cbCx) exists.
(10.2.5)
The conditions we need are the following. Let for p > 0 <7P(x) = E(|Uj|p|Xj = x),
(10.2.6)
p
provided E(|Uj| ) < oo and h(x) > 0. Assumption 10.2.1 There exists a d > 0 such that <72+<5(x)h(x) is uniformly bounded. The functions g(x)2h(x) and
(10.2.7)
Since
it suffices to show that the double array (vnj(x)) satisfies the conditions of Liapounov's central limit theorem (cf. theorem 2.4.3). Thus the results E[vnJ(x)] = 0,
(10.2.8)
E[vnj(x)2] = E{U2K[(x - Xj)/y J 2 y - k > = K ( * - 7nz)h(x - ynz)K(z)2dz ^<72(x)h(x)jK(z)2dz,
(10.2.9)
= O[l/ v / (n7j;)' 5 H0, for some 5 > 0
(10.2.10)
and
imply (l/»E J =iVn,j(x)->N(0, < 7 2 (x)h(x)jK(z) 2 dz) in distr.
(10.2.11)
This proves (10.2.2). Next, observe that similarly to (10.1.12), q2(x)]2} = J[g(x - y n z ) - g(x)]2h(x - ynz)K(z)2dz -7n{J[g(x-TnZ) -g(x)]h(x-)- n z)K(z)dz} 2 ^0
(10.2.12)
Asymptotic normality in the continuous case
219
by bounded convergence. This proves (10.2.4). Finally, observe that similarly to (10.1.11), q3(x) = J[g(x-y n z) -g(x)]h(x-y n z)K(z)dz - J[g(x-y n z)h(x-y n z) -g(x)h(x)]K(z)dz - g(x) J[h(x - ynz) - h(x)]K(z)dz - yn£n(x,z)z)]zK(z)dz n g(x)Jz'(^x')h(x)K(z)dz
(10.2.13) where 0 < £n(x,z) < 1. The last equality in (10.2.13) follows from Taylor's theorem. Thus if we choose K such that JxK(x)dx = 0, Jxx'K(x)dx = & is
finite,
(10.2.14)
then the first and third terms at the right-hand side of (10.2.13) vanish, while by bounded convergence the second and fourth terms, divided by yl, converge. Thus, (10.2.15) where b(x) = 1/ (10.2.16) This proves (10.2.5). From (10.2.1), (10.2.2), (10.2.4), and (10.2.5) the following theorem easily follows (exercise 1). Theorem 10.2.1 Let assumption 10.2.1 and condition (10.2.14) hold and let h(x) > 0. If j; = /i with 0 < /i < oo
(10.2.17)
then V(nykn)[g(x) - g(x)]^N{/ib(x)/h(x),[(72(x)/h(x)] jK(z) 2 dz} (10.2.18) in distribution. If lim n ^ 0O y2 N /( ny k) = 00>
(10.2.19)
then plim^^y-^gCx) -g(x)] = b(x)/h(x).
(10.2.20)
220
The Nadaraya-Watson estimator
Note that the latter result may be considered as convergence in distribution to a degenerated limiting distribution. At first sight it looks attractive to choose the window width yn such that fi = 0, as then the asymptotic bias vanishes. However, in that case the asymptotic rate of convergence in distribution is lower than in the case fi>0, as (10.2.17) implies v/(n7k)/n2/(k + 4 ) - > / / ( k + 4> as n-+oo.
(10.2.21)
This corresponds to the fact that minimizing the integrated mean square error E{J[g(x)h(x)-g(x)h(x)] 2dx} yields an optimal window width of the form (10.2.17) with JLI>0. Thus the window width yn which gives the maximum rate of convergence in distribution is ,
(10.2.22)
where c > 0 is a constant. Since ^ = c corollary.
(k + 4)/2
, we have the following
Theorem 10.2.2 Let the conditions of theorem 10.2.1 hold. With the window width (10.2.22) we have n2/(k + 4 ) [ g ( x )
_ g ( x ) H N { c 2 b ( x ) / h ( x ) ? c - V u (x)/h( X )] jK(z) 2 dz}. (10.2.23)
Notice that the asymptotic rate of convergence in distribution is negatively related to the number of regressors. This is typical for nonparametric regression, for the more regressors we have, the more information we ask from the data and thus the more observations we need to get a useful answer. The result (10.2.23) only has practical significance if it is possible to estimate the mean and the variance of the limiting normal distribution involved. As far as the variance is concerned, consistent estimation will appear to be feasible. Regarding the mean, however, the estimation problem is too hard. Inspecting the function b(x) (cf. (10.2.16)) reveals that estimating this function is awkward, as b(x) is a function of the second derivatives of the unknown functions h(x) and g(x)h(x). It would therefore be preferable to eliminate the mean of the limiting normal distribution. We have already mentioned a way of doing that, namely to choose the window width such that the limit \i in (10.2.17) is zero, but then we also sacrifice some of the speed of convergence. There is, however, another way to remove the asymptotic bias while maintaining
Asymptotic normality in the continuous case
221
the maximal rate of convergence in distribution of order n 2/(k + 4), namely by combining the results (10.2.18) and (10.2.20). The idea is to use (10.2.20) for estimating the mean of the limiting normal distribution in (10.2.18) by subtracting the random function at the left-hand side of (10.2.20) times \i from the left-hand side of (10.2.18). Theorem 10.2.3 Let the conditions of theorem 10.2.1 hold. Let gi(x) be the kernel regression estimator with window width and let g2(x) be the kernel regression estimator with window width yn = cn-* / ( k + 4 Denote ^ ^ g ^ x l - n - ^ - ^ ^ ^ x M l - n - ^ - * ^ ) ] . (10.2.24) Then n 2/(k + 4) [i(x)-g(x)]^N(O,c- k [(j2(x)/h(x)]jK(z) 2 dz) in distr. (10.2.25) Note that for the estimator g^x) the result (10.2.18) holds with /A>0, whereas for g2(x) the result (10.2.20) holds. The proof of this theorem therefore follows straightforwardly from the fact that by (10.2.20),
is asymptotically distributed as n 2/(k + 4) [gi(x)-g(x)]-c 2 b(x)/h(x) (cf. exercise 2). This easy result is related to the generalized jack-knife method of Schucany and Sommers (1977) for bias reduction of kernel density estimators. The rate of convergence in distribution is determined by the rate of convergence of the expectation of q3(x). If we choose the kernel K such that JxK(x)dx = 0 and Jxx'K(x)dx = O, then it can be shown that instead of (10.2.15), lim n ^ o0 y~ 3 q 3 (x) exists and is finite. The asymptotic rate of convergence in distribution then becomes n 3 / ( k + 6) instead of n2/(k + 4) j h u s a way to improve the convergence in distribution is to choose a kernel with zero moments up to a particular order m. More precisely, following Singh (1981) we define the class &k,m of these kernels as follows. Definition 10.2.1 Let SKkm be the class of all bounded Borel
222
The Nadaraya-Watson estimator measurable real valued functions K(.) on R k such that, with z = (z l v ..,z k )',ZiGR, Jz;iz^...z1kkK(z1,z2,...,zk)dz1...dzk = 0 if 0 < ii + ... + ik < m,
(10.2.26)
J|znK(z)|dz < oofori = 0 a n d i = m, jK(z)dz = 1. With K e ftk,m there exists a function b*(x) such that I i m n ^ y - m q 3 ( x ) = b*(x),
(10.2.27)
provided h(x) and g(x)h(x) belong to the class T) km : Definition 10.2.2 Let £)k?m be the class of all continuous real functions f on R k such that the derivatives (^z 1 ) i l O/()z 2 ) i 2 ...(^/W k f(ziv..,z k ), ij > 0 , j = l,...,k, are continuous and uniformly bounded for 0 < ii + i2 +... + ik < m. Thus similarly to theorem 10.2.1 we have: Theorem 10.2.4 Let assumption 10.2.1 and the additional conditions h(x) e I>k,m, g(x)h(x) e T)k?m, K £ ftk?m hold, where m is an integer > 2. Let h(x) > 0. There exists a real function b*(x) on R k such that limn^ooy™A/(nyJ;) = /i with 0 < \i < oc
(10.2.28)
implies V(nyS)[gW - g(x)]^N{^b*(x)/h(x),[a2(x)/h(x)] jK(z) 2 dz} (10.2.29) in distr., and Um n _ o o y» > /(n^) = oo
(10.2.30)
implies plim n ^ o o 7 - m [g(x)-g(x)] = b*(x)/h(x). (10.2.31) The optimal rate of convergence in distribution is now n m /( 2 m + k>? a n ( j the corresponding window width is yn = cn~ 1 / ( 2 m + k)
(10.2.32)
with c > 0 a constant. Moreover, similarly to theorems 10.2.2 and 10.2.3 we now have:
Asymptotic normality in the continuous case
223
Theorem 10.2.5 Let the conditions of theorem 10.2.2 hold. With window width (10.2.32) we have: n m/(2m + k) [g(x)-g(x)] -^N{c m b*(x)/h(x),c-Vu( x )/ h (x)]jK(z) 2 dz} in distr.
(10.2.33)
Theorem 10.2.6 Let the conditions of theorem 10.2.4 hold. Let gi(x) be the kernel regression estimator with window width
and let g2(x) be the kernel regression estimator with window width yn = c n - ^ ( 2 m + k) with Se (0,1). Denote ^
x ) = [ g l ( x ) _ n -(l-W(2m +
^
(10.2.34) Then n m/(2m + k) [gXx)-g(x)]->N{0,c- k [auW/h(x)]jK(z) 2 dz}. (10.2.35) As we have seen in chapter 4, the usual asymptotic normality results in parametric regression analysis hold with a rate of convergence in distribution equal to the square root of the number of observations. Now we see that in the nonparametric regression case this rate can be approached arbitrarily close by increasing m In Singh (1981) examples are given of members of the class 5 ^ m for m = 3,4,5,6. However, a simple way to construct kernels in R k m for arbitrary k > 1 and even m > 2 is the following. Let for xeR k and N > 1 J
^
^
j
(10.2.36)
where Q is a positive definite matrix and the 6j and o} are such that Ejii0j=l>
(10.2.37)
£j!i0j°f = 0for^=l,2,...,N-l.
(10.2.38)
Then it is not hard to verify that Keftk?m with m = 2N (exercise 3). The choice of the 6} and Oj affects the asymptotic variance of the estimator g, via the quantity
jK(x)2dx = £ £ i £ £ i 0 i V ( ' ? + ^XV(27r))Vdet(a).
(10.2.39)
Thus at first sight one might think of choosing the #j and Oj so as to minimize (10.2.39), given Q. However, (10.2.39) can be made arbitrarily small, for if 01,...,0N,o'i,...,(TN is a solution of (10.2.37) and (10.2.38) then
224
The Nadaraya-Watson estimator
so is #I,...,0N> / i f f i v ^ N f ° r an Y A<>0. Then (10.2.39) is proportional to fi~l. This indicates that the choice of the 9} and the o} is not crucial, as the constant of the window width (10.2.32) may take over the role of this \i. Finally, we consider the multivariate limiting distribution of the kernel regression estimator in distinct points. Thus let x(1) and x(2) be distinct points in R k such that h(x(1)) > 0, h(x(2)) > 0. Then, similarly to (10.2.9), ^ K x ^ - Xj)/yn]K[(x<2) - X}] 2
(1)
(1)
= J(7 (x - 7n z)h(x - 7n z)K(z)K{[(x (2) -x (1) )/y n ] + z}dz^0, (1)
(10.2.40)
(2)
by bounded convergence, for K{[(x — x )/y] + z} —• 0 as y[0. Using this result it is easy to show (exercise 4) that [V(ny^)qi(x(1)), >/(ny£)q,(x(2))r -> N 2 (0,D) in distr., with D is a diagonal variance matrix. (10.2.41) Thus y/(ny^)qi(x(l)) and ^(nyjDqiO^2*) are asymptotically independent, and so are V(n7^[g(x (1) )-g(x (1) )] and v / (ny k )[g(x (2) )-g(x (2) )]. More generally we have: Theorem 10.2.7 Let the conditions of theorem 10.2.1 or theorem 10.2.4 be satisfied and let x(1),.-.,x(M) be distinct points in R k with h(x(£)) > 0 for £= 1,2,...,M. Then the sequence (g(x ( " } )-g(x (£) )]}, 1= 1,...,M is asymptotically independent, and so is {v/(ny!m(x ( Vg(x ( £ ) )]}, 1= 1.....M. Finally we consider estimation of the asymptotic variance in (10.2.35). Let S2M
k 2 { ( V n ) E " = i K [ ( x - X j ) / 7 n ( c ) ] / 7 n ( c ))k }}2
(10242)
with yn(c) = cn- 1 / ( 2 m + k) .
(10.2.43)
It is not too hard to show: Theorem 10.2.8 Under the conditions of theorem 10.2.6, hence T)(X|C)
= n m/(2m + k) [g.(x|c)-g(x)]/
Uniform consistency in the continuous case (1)
Moreover, if x ,..,x 0,^=l,2,...,M, then
(M)
225 k
(
are distinct points in R with h(x ^) >
(r)(x(1)|c),...,7)(x(M)|c))' -+ N M (O,I) in distr. Proof: Exercise 5. On the basis of this result it is now easy to construct confidence bands for g(x). In particular, the 95 per cent asymptotic confidence interval for g(x) is: g.(x|c) + 1.96£(x|c)/nm/(2m + k) . Exercises 1. Complete the proof of theorem 10.2.1. 2. Complete the proof of theorem 10.2.3. 3. Verify that the kernel (10.2.36) belongs to the class ftk?m with m = 2N. 4. Prove (10.2.41). 5. Prove theorem 10.2.8. 10.3 Uniform consistency in the continuous case The uniform consistency of the kernel regression estimator is proved by Nadaraya (1965, 1970), Devroye (1978), Schuster and Yakowitz (1979) and Bierens (1983), among others. The approach in the latter two papers is based on an idea of Parzen (1962), namely to use the Fourier transform of the kernel. Suppose that the kernel has an absolutely integrable Fourier transform, i.e., ^(t) = Jexp(it'x)K(x)dx,
J>(t)|dt < oo.
(10.3.1)
If K is a density then ip(i) is its characteristric function. Then by the inversion formula for characteristic functions (cf. theorem 1.5.1) we have K(x) = (1 /(27r))k Jexp( - it'x)^(t)dt.
(10.3.2)
This result, however, carries over to more general Fourier transforms. In particular, for the kernel (10.2.36) we have ^
i
^
f
(10.3.3)
and applying the inversion formula in theorem 1.5.1 to each of the terms involved we see that (10.3.2) holds. Thus g(x)h(x) can be written as k
] k Yj Jexp[ - it'(x - Xj)/ g(x)h(x) = (1 /n)EjL, 7n-k [ 1 /(27r)] = [ 1 /(27r)]kJ[( 1 /n)E]Li Yjexptft'XjMexiK - it'x)M n t)dt,
(10.3.4)
226
The Nadaraya-Watson estimator
hence E[sup x |g(x)h(x)-E(g(x)h(x))|]
D
(10.3.5)
Moreover, using the well-known equality exp(ia) = cos(a) + i.sin(a) and Liapounov's inequality we see that, uniformly in t, E(|(l/n)EjLi{Yjexp(it'Xj) -E[Yjexp(it'Xj)]}|) < {var[(l/n)EjLiYjCos(t'Xj)] + var[(l/n)E]LiYjSin(t'Xj)]} v* < VIECY?)]/^.
(10.3.6)
Combining (10.3.5) and (10.3.6) yields E[sup x |g(x)h(x)-E(g(x)h(x))|] < x /[E(Y^)](l/ v / n)[l/(27r)] k J|^(y n t)|dt k (10.3.7) Furthermore, if f(x) = g(x)h(x) belongs to the class £)k?2 and K belongs to the class ftk?2 then it follows, similarly to (10.2.13), that 7-
2
sup x |E[g(x)h(x)] -g(x)h(x)| = y- 2 sup x |J[g(x- y n z)h(x-y n z) -g(x)h(x)]K(z)dz| = 7 - 2 sup x |J[f(x- 7 n z) -f(x)]K(z)dz| < supx|1/2Jz'O/^x)O/(3x')f(x)zK(z)dz| < oo.
(10.3.8)
More generally, if g(x)h(x) belongs to T) k m and K belongs to ftk?m then 7-
m
sup x |E[g(x)h(x)]-g(x)h(x)| < oo,
(10.3.9)
uniformly in n. Combining (10.3.7) and (10.3.9) now yields E[su Px |g(x)n(x)-g(x)h(x)|] = O{[min( 7 k x /n,7- m )]- 1 }.
(10.3.10)
Clearly, this rate of convergence is optimal for yn = c.n- 1 / ( 2 m + 2k) , c > 0,
(10.3.11)
and then E[sup x |g(x)h(x)-g(x)h(x)|] = O(n- m / ( 2 m + 2 k ) ).
(10.3.12)
Changing Yj to 1 we see that a similar results hold for h(x). It is now easy to verify: Theorem 10.3.1 Let assumption 10.2.1 and the additional
Discrete and mixed continuous-discrete regressors
227
conditions (10.3.1), h(x)eD k , m , g(x)h(x)e£ k , m , and hold, where m > 2. Let S G (0,supxh(x)] be arbitrary. For the window width (10.3.11) we have n m/ ( 2m + 2k) sup xe{x€Rk:h(x)> <5} |g(x)-g(x)| = O p (l). The proof is left as an exercise. It should be noted that the rate of uniform convergence, n m /( 2 m + 2k>9 is not the maximum obtainable rate, as is shown by Silverman (1978) for the density case and Revesz (1979), Schuster and Yakowitz (1979), Liero (1982), and Cheng (1983) for the regression case. The present conservative approach has been chosen for the simplicity with which it can easily be extended to the case with partly discrete regressors and/or time series. 10.4 Discrete and mixed continuous-discrete regressors 10.4.1 The discrete case Economic and other social data quite often contain qualitative variables. A typical feature of such variables is that they take a countable number of values and can usually be rescaled to integer valued variables. We consider first the case where all the components of Xj are of a qualitative nature. In the next subsection we show what happens in the mixed continuous-discrete case. The following assumption formalizes the discrete nature of Xj. Assumption 10.4.1 There exists a countable subset A of R k such that (i) x e A implies p(x) = P(Xj = x) > 0; (ii) E X G J P W = 1;
(iii) every bounded subset of A is finite. Part (iii) of this assumption excludes limit points in A. It ensures that for every x e A, inf ZGzlUx} |z-x| = //(x) > 0.
(10.4.1)
Now let the kernel and the window width be such that K(0)= 1, y n |0, y/n sup| z | >Ai/ JK(z)|->0 for every JH > 0.
(10.4.2)
This condition holds for kernels of the type (10.2.36) and window widths of the type yn = c.n~T with T > 0 , c > 0.
228
The Nadaraya-Watson estimator
Since now
< (l/n)EjLilYj|>/n su P | z | >/i(x)/yn |K(z)H0 in prob.,
(10.4.3)
where I(.) is the indicator function, and similarly Kl/»E?=iK[(x-X j )/yJ-(l/ v /n)EjLiI(X j = x)HO in pr.,
(10.4.4)
it is easy to verify that for every x e A pUm^ooVnlgto-gV)] = 0,
(10.4.5)
where g*(x) = [E?=i YjipCj = x)]/[E]Lii(Xj = x)] = [EjLi UJKXJ = x)ME]Li i(Xj=x)] + g(x).
(10.4.6)
It now follows straightforwardly from the law of large numbers that p*(x) = (l/n)Ef=iI(Xj = xHp(x) in pr.,
(10.4.7)
whereas by the central limit theorem v/n[g*(x)p*(x) - g(x)p*(x)] = (1 / » E j L i UJIPCJ = x) ->N(0,(7j(x)p(x)) in distr.
(10.4.8)
Combining (10.4.5), (10.4.7) and (10.4.8) yields: Theorem 10.4.1 Under assumption 10.4.1 and condition (10.4.2) we have Vn[g(x)-g(x)]-*N(0,(^(x)/p(x)) in distr.
(10.4.9)
Also, similarly to theorem 10.2.7 we have: Theorem 10.4.2 Let x(1),...,x(M) be distinct points in A. Under the conditions of theorem 10.4.1 the sequence is asymptotically independent. Proof: Exercise 1. Note that the discrete case differs from the continuous case in that hardly any restrictions are placed on the window width while, nevertheless, the assymptotic normal distribution has zero mean. Moreover, the aymptotic rate of convergence in distribution is now the same as for
Discrete and mixed continuous-discrete regressors
229
the usual parametric models. Furthermore, since every bounded subset A* of A is finite, theorem 10.4.1 implies that max x e J J > /n[g(x)-g(x)]| = Op(l).
(10.4.10)
10.4.2 The mixed continuous-discrete case We now consider the case where the first ki components of Xj are continuous and the remaining k2 components are discrete. Of course, this case is only relevant for k = kj + k2 > 2. Assumption 10.4.2 L e t Xj = ( X J ^ X p V e A x x A 2 , w h e r e Ax is a kj-dimensional real space and A2 is a subset of a k2-dimensional real space. The set A2 is such that (i) x(2) G A 2 implies p(x(2)) = P(XJ2) = x(2)) > 0; (ii) EX( 2 >^ 2 P(X ( 2 ) ) = 1;
(iii) every bounded subset of A2 is finite. Let x = (x(1),x(2))' G Ax x A2 and let h(x(1)|x(2)) be the density of the conditional distribution of X- relative to the event X- = x(2). For every fixed x(2) G A2 the following holds: (iv) h(x(1)|x(2)) and g(x(1),x(2))h(x(1)|x(2)) belong to the class D ki ,m with m > 2; (v) there exists a d>0 such that (T2+(5(x(1),x(2))h(x(1)|x(2)) is uniformly bounded on A\\ (vi) the functions g(x(1),x(2))2h(x(1)|x(2)) and <x2(x(1),x(2))h(x(1)|x(2)) are continuous and uniformly bounded on A \. Moreover, we now choose the kernel K(x(1),x(2)) and the window width yn such that with (zuz2y e A{x A2 and for n —• oo, y n |0; y/n sup| 22 |> /i/y J|K(z,,z 2 )|dz 1 -> 0 for every \i > 0;
.
n y ^ o o ; K(z1?0) G ftkl,m with m > 2;
1(10.4.11)
J|K(z 1 ,0)|dz 1 < oojK(z 1 ,0)dz 1 = 1.
J
A suitable kernel satisfying (10.4.11) can be constructed similarly to (10.2.36), i.e., let K(x) = £ j l i 0 j e x p ( - i / 2 x ' G - ^
(10.4.12)
where Q\ is the inverse of the upper-left (kj x k{) submatrix of Q~x and the o-j's and fy's are the same as in (10.2.38). Denoting
230
The Nadaraya-Watson estimator
h(x) = h(x(1),x(2)) = h(x(1)|x(2))p(x(2)),
(10.4.13)
we now have: Theorem 10.4.3 Under assumption 10.4.2 and condition (10.4.11) the conclusions of theorems 10.2.1-10.2.8 and theorem 10.3.1 carry over with k replaced by k^ and jK(z) 2 dz replaced by jK(z l5 0) 2 d Zl . This theorem can be proved by combining the arguments in sections 10.2 and 10.4.1. The proof is somewhat cumbersome but does not involve insurmountable difficulties. It is therefore left to the reader. Exercises 1. Prove theorem 10.4.2. 2. Prove theorem 10.4.3. 10.5 The choice of the kernel In the literature on kernel density and regression function estimation the problem of how the kernel should be specified has mainly been considered from an asymptotic point of view. In the case of density estimation Epanechnikov (1969) has shown that the kernel which minimizes the integrated mean squared error J[h(x)-h(x)] 2 dx
(10.5.1)
over the class of product kernels K(x) = K(x(1),x(2),...,x(k)) = n!LiK 0 (x (l) ), x (l) eR,
(10.5.2)
with Ko(v) = K o ( - v ) > 0 ,
jK o (v)dv = Jv 2 K o (v)dv=l, JvK o(v)dv = 0, (10.5.3)
is a product kernel with Ko(v) = 3/(V5) -3v 2 /(20^5) if |v| < ^ 5 , Ko(v) = 0if |v| > v/5.
(10 5 4)
- '
Note that Epanechnikov's kernel K o is the solution of the problem: minimize jK 0 (v) 2 dv, subject to (10.5.3).
(10.5.5)
Greblicki and Krzyzak (1980) have confirmed this result for the regression case. Epanechnikov also shows that there are various kernels which are nearly optimal. For example, the standard normal density satisfies the condition (10.5.3) and is almost optimal, as
Choice of the kernel 1
2
/
231
2
2
J[exp(- /2V )/v (27r)] dv= 1.051jK0(v) dv.
(10.5.6)
A disadvantage of Epanechnikov's kernel is that its Fourier transform is not absolutely integrable, a condition employed in section 10.3. Also, the non-negativity of K o implies that the kernel (10.5.2) with (10.5.4) merely satisfies K€ftk 2 , whereas higher rates of convergence in distribution than n2/(k + 4) r e q u i r e K e # k , m with m > 2. Cf. theorem 10.2.4. Since kernels of the type (10.2.36) have all the required properties, are almost arbitrarily flexible and can easily be constructed, we advocate the use of that type of kernel. However, the question now arises how the matrix Q should be specified. A heuristic approach to solve this problem is to specify Q such that certain properties of the true regression function carry over to the estimate g. The property we shall consider is the linear translation invariance principle. Suppose we apply a linear translation to x and the xj's: x* = Px + q, X* = PXj + q,
(10.5.7)
where P is a nonsingular k x k matrix and q is a k-component vector. Then g(x) = E(Yj|Xj = x) = E(Y j |X;=x*) = g V ) ,
(10.5.8)
say. However, if we replace the Xj and x in (10.2.16) by X* and x \ respectively, and if we leave the kernel K unchanged, then the resulting kernel regression estimator g*(x*), say, will in general be unequal to g(x), for J
J
}
/
g(x) if (10.5.9)
The only way to accomplish g*(x*) = g(x) in all cases (10.5.7) is to let the kernel be of the form K(x) = 77(x'V"1x),
(10.5.10)
where r\ is a real function on R and V is the sample variance matrix, i.e., V = (l/n)5:]Li(Xj-X)(Xj-X)' with X = (l/n)£jLiXj.
(10.5.11)
In particular, if we use kernels of the form (10.2.36) then we should specify Q = V. Thus for m = 2,4,6,... we let Km(x) = E ^ j e x p C - '/2X'V-'x/ofVK^Tr^lffjlVdet(V)]
(10.5.12a)
in the continuous case, (10.5.12b)
232
The Nadaraya-Watson estimator
in the mixed continuous-discrete case (with the first ki components of Xj continuously distributed) and K m (x) = K2(x) = e x p ( - Vix'V-^x) in the discrete case, where V and
(1)
(10.5.12c)
is the upper-left kj x kj submatrix of V " 1
Y™lx 0pV=i if £ = 09 Y™J? Off = 0 if £=l,2,...,(m/2)-l.
(10.5.13)
The question now arises whether the previous asymptotic results go through for kernel regression estimators with this kernel. The answer is yes, provided the following additional conditions hold. Assumption 10.5.1 Let E(|Xj|4) < oo and let the matrix V = E(XjXj') -[E(Xj)][E(Xj)]'
(10.5.14)
be nonsingular. Denoting Km(x) = E?i 2 ^p(- 1 /2x'V- i x/ff?)/[( N /(27r)) k |ffjlVdet(V)] >
(10.5.15a)
in the continuous case, (10.5.15b) (1)
in the mixed continuous-discrete case, where V is the upper-left k\ x k\ submatrix of V~ l and the 0j and the o} are the same as before, and Km(x) = K2(x) = e x p ( - ^ x ' V ' x )
(10.5.15c)
in the discrete case, we can now state: Theorem 10.5.1 With assumption 10.5.1 the kernel regression estimator with kernel (10.5.12) has the same asymptotic properties (as previously considered) as the kernel regression estimator with kernel (10.5.15). Proof. We shall prove this theorem only for the continuous case with kernel K2(x). The proof for the other cases is similar and therefore left to the reader. Let s(x) = (l/n)Ejl 1 Y j ex P t-'/ 2 (x-X J )'V- 1 (x-X j )/y2] k
J
(10.5.16)
s(x) = (1 /n)£jL, Yjexp[ - V2(x - Xj)' V ~' (x - Xj)/y*] kk
(10.5.17)
Choice of the kernel
233
s(x) = (l/n)£j!=i Yjexp[- '/ 2 ( x -Xj)'V" '(x-Xj)/^] k ^
(10.5.18)
Moreover, let (ij)(ij)
(10.5.19) 1
(l
where v ^ is the typical element of V" and v '^ is the typical element of V" 1 . For every t = (t lv ..,t k )'R k we have I t ' V - ' t - t ' V ^ t l < M^i,j|titj| < k Mt't < pMt'V" 1 ^
(10.5.20)
where p is the maximum eigenvalue of V times k. Using inequality (10.5.20) and the mean value theorem it is not too hard to verify that |s(x)-s(x)| < (10.5.21) Since the Xj's are independent and have bounded fourth moments we have for every e > 0, n 1 / 2 " £ ( V - V H in pr.,
(10.5.22)
as is easy to verify by using Chebishev's inequality. Since the elements of an inverse matrix are continuously differentiate functions of the elements of the inverted matrix, provided the inverted matrix is nonsingular, it follows that (10.5.22) implies (10.5.23) n ' / 2 - e ( y - i _ v - i ^ 0 Jn p r and consequently plimn^n 172 - 5 M = 0 for every 8 > 0.
(10.5.24)
Thus also Hmn^ooPtpM < 1/4)= 1.
(10.5.25)
Now (10.5.21) and (10.5.25) imply that the inequality |s(x) -s(x)|< (2pkM/yk)(Vdet(V)/v/det(V))(v/2)k x(l/n)EjLi|Yj|K*[(x-Xj)/y n]
(10.5.26)
with (10.5.27) holds with probability converging to 1. Since
234
The Nadaraya-Watson estimator = J[l + ol(x - ynz) + g(x - ynz)2]h(x - ynz)K*(z)dz ->[1 + <7j;(x) + g(x)2]h(x) as n ^ o o
(10.5.28)
and since (10.5.24) implies plim n ^ ooA /(ny k )M = 0,
(10.5.29)
it now follows that, pointwise in x,
plimn^yrtlsW-sWI =0
(10.5.30)
Next, observe that p l i m n ^ o o ^ n y ^ l s t x ) - ^ ) ! =0,
(10.5.31)
for (10.5.22) implies that plim n ^ oo> /(ny k )(det(V)-det(V)) = 0.
(10.5.32)
Thus, plim n ^ 00> /(nyj;)(s(x)-s(x)) = 0.
(10.5.33)
From this result it follows straightforwardly that the asymptotic normality results go through. The proof that the uniform consistency results go through is left as an exercise. 10.6 The choice of the window width From the preceding results it is clear that the asymptotic performance of the kernel regression function estimator heavily depends on the choice of the window width yn. In particular, the asymptotic normality results in theorems 10.2.3 and 10.2.6 show that the variance of the limiting normal distribution of g, shrinks down to zero if we let the constant c of the window-width parameters approach infinity. But that will destroy the small sample performance of the kernel regression estimator. If we choose too large a yn the Nadaraya-Watson regression function estimate will become too flat, for g(x)->Y = (l/n)X;j1=,Yj if we let y n -^oc.
(10.6.1)
Similarly, g,(x) —> Y if c —• oo. On the other hand, if we choose too small a yn the estimate g will go wild. For example, if we employ in (10.1.16) the kernel K2 and if we let y n |0, then (10.6.2)
Choice of the window width
235
(•£= l,...,n), where I(.) is the indicator function. Thus g(x) converges to the Yj for which (x — Xj)'V~1(x — Xj) is minimal, so that then the estimate g degenerates to an inconsistent nearest neighbor estimate. Cf. Stone (1977). Again, a similar result holds for g, if we let c —> 0. A somewhat heuristic but effective trick to optimize the window width is the cross-validation approach introduced by Stone (1974), Geisser (1975), and Wahba and Wold (1975). See also Wong (1983). The basic idea is to split the data set in two parts. Then the first part is used for calculating the estimate and the second part is used for optimizing the fit of the estimate by minimizing the mean squared error. A variant used by Bierens (1983) is to consider various partitions and to minimize the mean of the mean squared errors. In particular, let (10.6.3) and denote similarly to (10.2.34), )
),
(10.6.4)
)
),
(10.6.5) (10.6.6)
Then g* (x|c) is the regression function estimator of the type (10.2.34) with kernel K m , based on the data set leaving the observation with index £ out. We now propose to optimize c by minimizing Q(c) = E]Li( Y J-g* ) ( x jl c )) 2
( 10 - 6 - 7 )
to c in an interval [ci,c2] with 0 < Cj < c2 < oo. Denoting the resulting optimal c by c, i.e., (c) = inf(Q(c)|cG[c,,C2]),
(10.6.8)
we then propose to use g,(x|c) = gio)(x|c)
(10.6.9)
as the cross-validated kernel regression function estimator. Although this approach works well in practice, it has the disadvantage that we lose control over some of the asymptotic properties of kernel estimators. From Bierens (1983) it follows that the cross-validated kernel regression estimator remains (uniformly) consistent, but it is not clear whether asymptotic normality goes through. We can regain some control over the asymptotic behaviour of g.(x|c) if instead of (10.6.8) we use (c(*V = 1,2,. ..,M),
(10.6.10)
236
The Nadaraya-Watson estimator
where Cj = c(1) < ... < c (M) = C2 are grid points. It is not hard to show that in the continuous case the M-variate limiting distribution of n m/(2m + k) (i(x|c (1) )-g(x),...,gXx|c (M) )-g(x))
(10.6.11)
is M-variate normal with zero mean vector. Hence for x with h(x) > 0 we have at least n m/(2m + k) (g.(x|c)-g(x)) =O P (1).
(10.6.12)
A similar result holds for the mixed continuous-discrete case. However, if for this c, plimn^^c = c, then asymptotic normality goes through as if c = c. Moreover, in the discrete case the cross-validated regression estimator has the same properties as before, without additional conditions. If our sample is large, we may proceed as follows. Split the sample into two subsamples of sizes nj and n2, respectively. Now apply the crossvalidation procedure as described above to one of the subsamples, say subsample 2, or alternatively, determine an appropriate c by visual inspection of the nonparametric regression results based on this subsample. Then use the resulting c as constant in the regression function estimator g.(x|c) based on subsample 1. Since now c and g.(x|c) are independent, the asymptotic distribution results go through for g.(x|c), conditional on c. 10.7 Nonparametric time series regression Recently the kernel regression approach has been extended to time series. Robinson (1983) shows strong consistency and asymptotic normality, using the a-mixing concept. In Bierens (1983) we proved uniform consistency under u-stability in L with respect to a
Nonparametrie time series regression
237
Assumption 10.7.1 The data-generating process {(Yt,Xt)} is a strictly stationary u-stable process in L 2 with respect to a strictly stationary cp-mixing base (W t ), where u(m) = O(exp(-cm)) for some c > 0; ' / < °°.
(10.7.1) (10.7.2)
Also, we assume that g(Xt) represents the conditional expectation of Y t given the entire past of the data-generating process: Assumption 10.7.2 Let = E(Y t |X t ,X t _ 1 ,X t _ 2 ,...,Y t _ 1 ,Y t _ 2 ,...)a.s.
(10.7.3)
The vector X t may contain lagged Y t 's. Thus g(Xt) is in fact a (non)linear ARX model with unknown functional form g(.). We note that Robinson (1983) only assumes E(Y t |X t ) = g(Xt), which is weaker than (10.7.3). However, as has been argued in chapter 7, proper time series models should satisfy condition (10.7.3). The errors U t are then martingale differences, so that the martingale difference central limit theorem 6.1.7 is applicable. Next, we assume: Assumption 10.7.3. (i) If assumption 10.2.1 holds then in addition: (a)
238
The Nadaraya-Watson estimator
condition (10.7.2). Let K be a Borel measurable real function on Rk such that J|K(x)|dx < oo; J|t^(t)|dt < oo, where ip(t) = Jexp(it'x)K(x)dx. Denote for x e Rk, dn(x) = var{(l/n)Ef=iZiK[(x-Xj)/yn]} -(l/n^EjLivartZjKKx-XjVyJ},
(10.7.4)
where yn > 0, yn j 0 as n —> oo. Then dn(x) = 0({ln(n/)»n) + ln[l/(E{Z2K[(x-Xo)/yn]2})]} xE{Z2K[(x-X0)/yn]2/n}) Proof: We can write
= 2(l/n2)E"J,1 EjiT -2(l/n 2 )E":,' £jL-/E({ZoK[(x-Xo)/yn]}2.)
(10.7.5)
Similarly, let dlm)(x) = 2(l/n 2 )EL I E J I L"/cov{Z< m) K[(x-X< m) ]/y n ),ZJ m) K[(x-X( m) )/y n ]} = 2(l/n 2 )E":, 1 E J ="/E{Z( m) Z; m) K[(x-X( m) )/y n ]K[(x-X| m) )/y n ]} -2(l/n 2 )E"l 1 1 EjL" 1 'E({z( m) K((x-X| 1 m) )/y n )} 2 ),
(10.7.6)
where
We shall prove the lemma in three steps:
Step 1: Step 2: For sufficiently large n and some constant Ci > 0, independent of j and m, we have E(sup x | {Z0K[(x - X0)/yn]} {ZjK[(x - X^/yJ} - {Z< m) K[(x-X<, m) ]/y n )}{ZJ m) K[(x-X; m) ]/y n )}|)< c,y-'u(m);
Nonparametric time series regression
239
Step 3: For sufficiently large n and some constants c2 > 0, c3 > 0, independent of x, j and m, we have
We first show that the results of these three steps imply the lemma. Let (mn) be a sequence of positive integers such that m n —> oo and m n /n —> 0, and let n be so large that
It follows from step 1 and step 2 with j = 0 that for sufficiently large n, |di m) (x)|<8(m n /n)E({Z o K[(x-Xo)/y n ]} 2 ) + 8(m n /n) C l 7 n -ym n ). Moreover, it follows from steps 2 and 3 that |d n (x)-d< m) (x)| <2(c 1 +c 3 )7 n - 1 u(m n ) +
2c2(y"'o(mn))2.
Without loss of generality we may assume that for sufficiently large n, m n /n < (d + c 2 + c3)/4 and y~]v(mn) < 1
(10.7.7)
(as will appear). Then |d n (x)|<8(m n /n)E({Z 0 K[(x - X0)/yn]}2) + 4(c, + c 2 + c 3 )y"' u(mn) = O(Pn(x)), where pn(x) = (m n /n)E({Z 0 K[(x-X 0 )/y n ]} 2 ) +7 n - 1 exp(-cm n ).
(10.7.8)
Minimizing the right-hand side of (10.7.8) to m n yields i)(mn) = exp(-c-m n ) = (l/c)(y n /n)(E{Z 0 K[(x-X 0 )/y n ]})- 2 , with m n = (1/c) [log(c) + log(n/yn) - 2-log(E{Z0K[(x - X 0 )/y n ]})], (thus observe that indeed (10.7.7) holds). Hence Pn(x) < (l/c)|l +log(c) + log(n/yn) + log [(E{Z2K[(x - X 0 )/y n ] 2 })"'] | (E{Z2K[(x - X0)/yn]2/n}) = O {[ln(n/ 7n ) + In (E{Z2K[(x - X0)/yn]2}) ~'] E{Z2K[(x - X 0 )/ 7n ] 2 /n}}, as was to be shown.
240
The Nadaraya-Watson estimator
Proof of step 1: Since {Zj K[(x — X-m )/yn]} is a cp*-mixing sequence, where q>\f) = 1 if £ < m, cp\£) =
(10.7.9) k
O-ti'Xo + i-t2'Xj) m)
.trx| ) +i.t 2 'X; m) )|^( [l/(27r)]2kyf JJ|ZoZjexp(i-trXo + i-t2'Xj) + i-tz' -4m) z| m) exp(i-t,'Xj n) +i.t 2 'XJin) )||^(yn t,W
< [l/(27r)] 2 k y-'|ZoZj||<Xo,Xj)-
+ [l/(27r)] 2k |Z 0 Z j -Z[ ) m) ZJ m) |JJ^(t 1 )V(t 2 )|dt 1 dt 2 . < [l/(27r)] 2k y- 1 |Z o Z j |(|Xo-X< m) | 2 + |Xj-X< m) + [l/(27r)] 2k (|Z 0 ||Z j -Z{ m) | + |Zo-Z< m) ||Z< m) |)(J|V(t)|dt) 2 . (10.7.10) Applying the Cauchy-Schwarz and Liapounov inequalities and using the inequality E[(ZJm))2] < E(Z 2 ) (cf. exercise 1) it follows from (10.7.10) that there exist constants c» > 0, c* > 0, independent of j and m, such that
Nonparametric time series regression
241
E{supx|ZoZjK[(x - X 0 )/7 n ]K((x - Xj)/yJ
- Z< m) ZJ m) K[(x - x( m) )/y n ]K[(x - XJ m) ]/y n )|}
+ [l/(2 7 r)] 2k {E[(Z 0 -Z( m) ) 2 ]} 1/i [E(Z 2 )]' /2 (J|^(t)|dt) 2
^cl'VVmHc^Vm)
(10.7.11)
For n so large that ci y" 1 > ci we may take cj in step 2 equal to 2-ci . Proof of step 3: By stationarity and the trivial inequality |a2-b2| <(a-b)2+|a||a-b| we have |E{Z0K[(x-X0)/y1J}E{ZjK[(x-Xj)/yn]} = |(E{Z0K[(x-X0)/7n]})2 -(E{Z< m) Kt(x-X|, m) )/7 n ]}) 2 | < {E|Z0K[(x-X0)/yn] -Zj, m) K[(x-X< m) )/ 7n ]|} 2 + |E{ZoK[(x-Xo)/rn]}|E|ZoK[(x-Xo)/yn]-Z|)m)K[(x-X|)m))/yn]| (10.7.12) From (10.7.9) it follows, similarly to (10.7.11), that there exist constants c* > 0, c* > 0, independent of x and m, such that E|Z0K[(x - X0)/yn)] - Z^m)K[(x - xj,m))/yn]| < [l/(27r)]kyk jE|Zoexp(i-t'Xo) -Z<m)exp(i-t'X<m))||V(7nt)|dt <[l/(2 7 r)] k y k E|Zo-Z| ) m) |J|^(y n t)|dt <[l/(27r)]k{E|Z0-Z<m)|}'/jJ|t/;(t)|dt + [l/(27r)]k}'k[E(|Xo-X(m)|2)]1/2{E[(Z<m))]2]}1/2J|t^(t)|dt
^ c i ' ^ - ' u M + ci^Km).
(10.7.13)
Realizing that by (10.7.9) |E{Z0K[(x-X0)/yn]}| < [l/(27r)]kE|Zo|J|^(t)|dt < oo, step 3 now easily follows from (10.7.12) and (10.7.13). This completes the proof. Q.E.D. The following lemma enables us to prove uniform consistency
242
The Nadaraya-Watson estimator Lemma 10.7.2 Let the conditions of lemma 10.7.1 hold, except that now E(Z 2 ) < oo suffices, and let an(x) = (1 /n)£jL, ZjK[(x - Xj)/yJ Then E{supx|an(x) - E[an(x)]|} = O{[log(n/y2)] " / ^ n }
Proof: Let
aim)(x) = (l/n)E]L,zJ m) K[(x-XJ m) )/ rn ]. It follows from (10.7.9), similarly to (10.7.18), that there exist constants c, > 0, c; > 0, independent of x and m, such that E[supx|an(x) -ai m ) (x)|] < c{l)y-lD(m) + c?}D(m).
(10.7.14)
Moreover, it follows from (10.7.9), the well-known formula exp(i-u) = cos(u) + i sin(u), Liapounov's inequality and inequality (1.4.4) that E{supx|aim)(x) -E[ai m )
n)X:jL, {ZJm)exp(i-t'XJm)) -E[ZJ ra) exp(i-t'XJ m) )]|Ml>nt)|dt KE^l/n^iaZ^cosa'X^) -E[ZJm)cos(t'XJm))]})2]} 1 W2)[l/(2n)]kykn J{E[((l/n)EjLi {ZJm)cos(t'XJm)) -E[ZJ m) cos(t'XJ m) )]})]] 2 } w |V
(10.7.15)
Similarly to step 1 in the proof of lemma 10.7.1 we have, uniformly in t, '(m + E ^ o ]
(m + Eto
1/2
)]E[(Z0)2]< c.(m/n),
(10.7.16)
say, for sufficiently large m, and similarly, E[((l/n)E J I Li{ZJ m) sin(t'XJ m) )-E[ZJ m) sin(t'XJ m) )]}) 2 ]
'
i
'/
(10.7.18)
say. Combining (10.7.14) and (10.7.18) it follows that for sufficiently large n
Nonparametric time series regression
243
E{sup x |a n (x) - E [ a n ( x ) ] | } < 2 E{sup x |a n (x) - a i m n ) ( x ) | }
+ E{sup x |ai mn) (x) -E[ai mn) (x)]|} < 2ci 1) y- 1 o(mn) + 2ci 2) u(m n ) + Cl (m n /n) 1/2 = O{[(mn/n) + 7 - 2 exp(-2c.m n )] l / 2 },
(10.7.19)
say. Minimizing the right hand side of (10.7.19) to m n , the lemma follows. Q.E.D. Finally, the following lemma will enable us to extend the results in section 10.5 to time series. Lemma 10.7.3 Let (Zj) be a strictly stationary stochastic process in R satisfying E(Z 4 ) < oo. Let (Zj) be u-stable in L 4 with respect to a strictly stationary (^-mixing base, where u(m) = O[exp( — cm)] for some c > 0, Z (10.7.20) Then for every e > 0, =0
j 1 j
(10.7.21)
and plim n ^ oo n l/a - fi (l/n)5:iLi[Z?-E(Z?)] = 0. (10.7.22) Proof: Let (Wj) be the base and let
ZJm) = E(Zj|WJ,Wj_l,WJ_2,...,Wj_m). Denote similarly to (10.7.4), (10.7.5), and (10.7.6)
dim) = varKl/nEjLifZJ^)2] - O/n2)£jLi var[(ZJm))2]. Then it follows similarly to the proof of lemma 10.7.1 that for sufficiently large n 4
)<
Cl m/n,
and |dn-dim)|
244
The Nadaraya-Watson estimator
as n —» oo. This proves (10.7.22). The proof of (10.7.21) is nearly the same. Q.E.D. 10.7.2 Consistency and asymptotic normality of time series kernel regression function estimators In this subsection we now generalize the approach in sections 10.1-10.6 to time series. Theorem 10.7.1 Let E(Uj8) < oo and E[g(Xj)4] < oo
(10.7.23)
and let the kernel K be such that for £ = 1,2, J|t^(t)|dt < oo, where ^ ( t ) = Jexp(i-t'x)K(x/dx.
(10.7.24)
Moreover, let Jzz'K(z)2dz be finite in the continuous case, and let Jz1Zi'K(z1,0)2dz1 be finite in the mixed continuous-discrete case, respectively. (10.7.25) (Note that the conditions (10.7.24) and (10.7.25) hold for kernels of the type (10.2.36).) With assumptions 10.7.1, 10.7.2, and 10.7.3 and the conditions (10.7.23), (10.7.24), and (10.7.25) the asymptotic normality results in sections 10.2 and 10.4 go through. Proof. We only prove the theorem for the continuous case, leaving the proofs for the discrete and mixed continuous-discrete cases as exercises. We now have to show that (10.2.2) and (10.2.4) go through and that var(h(x))->0
(10.7.26)
in the time series case under review, as only in these steps has the independence assumption been employed. Step 1: proof of (10.2.2). Since now the vnj(x) defined by (10.2.7) are martingale differences, it suffices to show plim n _ oo (l/n)E]l l [Vn, J (x) 2 -E(v n0 (x) 2 )] = 0,
(10.7.27)
as then (10.2.2) follows from theorem 6.1.7. Thus consider lemma 10.7.1 with Zj = U 2 and K(x) replaced by K(x)2. Since E{U4K[(x - Xj)/yn]4} = ykjcr4u(x - ynz)h(x - ynz)K(z)4dz = it follows from lemma 10.7.1 that
Nonparametric time series regression
245
var{(l/n)£jLi U 2K[(x - Xj)/y n ] 2 /^} r{U J 2 K[(x-X j )/)» n ] 2 /^} + )--2kdn(x) J5)] + O[(l /(nykn)\n(n/ykn+' )]->0
(10.7.28)
as n —> oo, provided yn is proportional to n~~T with T < 1/k. Therefore, (10.7.26) follows from (10.7.28) and Chebishev's inequality. Step 2: proof of (10.2.4). Let Zj in lemma 10.7.1 be
Then - (l/n 2 )EjLi var{ V(nyJD[g(Xj) - g(x)]K[(x - Xj)/yn]/y*} where the last conclusion follows from the fact that by assumption 10.7.3 and Taylor's theorem = ?!; J[g(x - ynz) - g(x)]2h(x - ynz)K(z)2dz Since y2ln(n/y£+3) —> 0 as n —> oo if yn is proportional to n~ T with T > 0, (10.2.4) follows. Step 3: proof of (10.7.26). Let (72(x) = E{K[(x-X 0 )/y n ] 2 }y n - k . Observe that similarly to (10.1.17), <72(x) - h(x)jK(z) 2 dz. Now let Zj = 1 in lemma 10.7.1 and let x be such that h(x) > 0. Then var(h(x)) < y- 2k d n (x) + (72(x)/(nyk) - O[(log(n/yn) + log(y- k ) + l)/(nyk)] = O[(log(ny n - k - 1 ))/(nyJ;)]->0 if yn is proportional to n~ T with T < 1/k. Since the latter condition is satisfied throughout section 10.2, the proof of theorem 10.7.1 for the continuous case is completed. Q.E.D. Theorem 10.7.2 Let assumption 10.7.1 hold. Then: (i) the conclusion of theorem 10.3.1 becomes:
246
The Nadaraya-Watson estimator [n/log(n)] m/(2m + 2k) sup xG{xeRk:h(x) >, } |g(x) -g(x)| =O P (1), with corresponding optimal window width of the form yn = c-[iog(ii)/n] 1 / (2m+2k) ; (ii) the conclusion of theorem 10.4.1 regarding uniform consistency becomes: l)
suPxG{x€Rk:h(x)><5}|g(x) - g ( x ) | =O P (1),
with corresponding optimal window width 7n
= c-[log(n)/n]1/(2m + 2k|) ,
where h(x) is defined by (10.4.13). Proof: Again we confine our attention to the continuous case, i.e. part (i). The only places in section 10.3 where the independence assumption has been employed are (10.3.6) and, subsequently, (10.3.7) and (10.3.10). From lemma 10.7.2 with Zj = Yj it follows that in the present case (10.3.7) becomes: E{supx|g(x)h(x) -E[g(x)h(x)]|} = y- k E{sup x |a n (x) -E[a n (x)]|} = O{[log(n/y2)]^/(yk > / n)} .
(10.7.29)
Combining (10.7.29) with (10.3.9), there exist constants ci > 0, c 2 > 0 such that (10.3.10) becomes: E[supx|g(x)h(x) - g (x)h(x)|] = O{c1[log(n/y2)]1/V(yJ;>/n) + c2y™} = O{Cl[log(n)] I/2/(yk »
+ c2y™}
(10.7.30)
provided log(l/ 7 2 )/log(n)-,0.
(10.7.31)
Minimizing (10.7.30) to yn yields an optimal window width of the form yn = c[log(n)/n] 1/(2m + 2k) for which indeed (10.7.31) holds. With this window width, (10.7.30) becomes E[supx|g(x)h(x) -g(x)h(x)|] = O{[log(n)/n] m/(2m + This proves part (a) of the theorem.
2k)
}. Q.E.D.
Finally, we have Theorem 10.7.3 Under the conditions of theorems 10.7.1 and 10.7.2 and the additional assumption 10.7.4 the conclusions of theorem 10.5.1 carry over.
Nonparametric time series regression
247
Proof. It follows straightforwardly from lemma 10.7.3 that (10.5.22) goes through, which proves the theorem. Q.E.D. Remark: We note that the ^-mixing condition on the base can be relaxed to the weaker a-mixing condition, but at the expense of stronger conditions on the moments of Yj and Xj. This follows from theorem 6.2.1. This extension is left as an exercise. Exercises 1. Prove E[(ZJm))2] < E(Z?). (cf. lemma 10.7.1). 2. Prove theorem 10.7.1 for the discrete and mixed continuous-discrete case. 3. Prove part (ii) of theorem 10.7.2.
References
Anderson,T.W. (1958), An Introduction to Multivariate Statistical Analysis, New York: John Wiley. (1971), The Statistical Analysis of Time Series, New York: John Wiley. Andrews,D.W.K. (1991), "Heteroskedasticity and Autocorrelation Consistent Covariance Matrix Estimation," Econometrica, 59: 817-858. Arrow,K.J., H.B.Chenery, B.S.Minhas, and R.M.Solow (1961), "Capital-Labor Substitution and Economic Efficiency," Review of Economics and Statistics, 43: 225-250. Atkinson,A.C. (1969), "A Test for Discriminating Between Models," Biometrika, 56: 337-347. (1970), "A Method for Discriminating Between Models," Journal of the Royal Statistical Society, Series B, 32: 323-353. Bierens,H.J. (1981), Robust Methods and Asymptotic Theory in Nonlinear Econometrics, Lecture Notes in Economics and Mathematical Systems, vol. 192, Berlin: Springer Verlag. (1982), "Consistent Model Specification Tests," Journal of Econometrics, 20: 105-134. (1983), "Uniform Consistency of Kernel Estimators of a Regression Function under Generalized Conditions," Journal of the American Statistical Association, 77: 699-707. (1984), "Model Specification Testing of Time Series Regressions," Journal of Econometrics, 26: 323-353. (1987a), "ARMAX Model Specification Testing, With an Application to Unemployment in the Netherlands," Journal of Econometrics, 35: 161190. (1987b), "Kernel Estimators of Regression Functions," in Truman F.Bewley (ed.), Advances in Econometrics 1985, New York: Cambridge University Press, 99-144. (1988a), "ARMA Memory Index Modeling of Economic Time Series [with discussion]," Econometric Theory, 4: 35-59. (1988b), "Reply," Econometric Theory, 4: 70-76. (1990), "Model-Free Asymptotically Best Forecasting of Stationary Economic Time Series," Econometric Theory, 6: 348-383. (1991a), "A Consistent Conditional Moment Test of Functional Form," Econometrica, 58: 1443-1458. (1991b), "Least Squares Estimation of Linear and Nonlinear ARMAX models 248
References
249
under Data Heterogeneity," Annales d'Economie et de Statistique 20/21: 143— 169. (1993), "Higher Order Sample Autocorrelations and the Unit Root Hypothesis," Journal of Econometrics, 57: 137—160. Bierens,H.J. and L.Broersma (1993), "The Relation Between Unemployment and Interest Rate: Some International Evidence," Econometric Reviews, 12: 217— 256. Bierens,H.J. and S.Guo (1993), "Testing Stationarity and Trend Stationarity Against the Unit Root Hypothesis," Econometric Reviews, 12: 1-32. Bierens,H.J. and J.Hartog (1988), "Non-Linear Regression with Discrete Explanatory Variables, with an Application to the Earnings Function," Journal of Econometrics, 38: 269-299. Bierens,H.J. and H.Pott-Buter (1990), "Specification of Household Engel Curves by Nonparametric Regression [with discussion]," Econometric Reviews, 9: 123-184. Billingsley,P. (1968), Convergence of Probability Measures, New York: John Wiley. (1979), Probability and Measure, New York: John Wiley. Box,G.E.P. and D.R.Cox (1964), "An Analysis of Transformations," Journal of the Royal Statistical Society, Series B, 26: 211-243. Burguete, W.J., A.R.Gallant, and G.Souza (1982), "On Unification of the Asymptotic Theory on Nonlinear Models [with discussion]," Econometric Reviews, 1: 151-211. Cacoullos,T. (1966), "Estimation of a Multivariate Density," Annals of the Institute of Statistics and Mathematics, 18: 179-189. Cheng,K.F. (1983), "Strong Convergence in Nonparametric Estimation of Regression Functions," Periodica Mathematica Hungarica, 14: 177-187. Chung,K.L. (1974), A Course in Probability Theory, New York: Academic Press. Collomb,G. (1981), "Estimation non-parametrique de la regression: revue bibliographique," International Statistical Review, 49: 75-93. (1985a), "Nonparametric Regression: An Up-to-Date Bibliography," Statistics, 16: 309-324. (1985b), "Non-parametric Time Series Analysis and Prediction: Uniform Almost Sure Convergence of the Window and K-NN Autoregression Estimates," Statistics, 16: 297-307. Cox,D.R. (1961), "Tests for Separate Families of Hypotheses," Proceedings of the 4th Berkeley Symposium, 1: 105-123. (1962), "Further Results on Tests of Separate Families of Hypotheses," Journal of the Royal Statistical Society, Series B, 24: 406-424. CramerJ.S. (1969), Empirical Econometrics, Amsterdam: North-Holland. Dagenais, M.G and J.M. Dufour (1991), "Invariance, Nonlinear Models, and Asymptotic Tests," Econometrica, 59: 1601-1615. Davidson,R. and J.G.MacKinnon (1981), "Several Tests for Model Specification in the Presence of Alternative Hypotheses," Econometrica, 49, 781-793. Devroye,L.P. (1978), "The Uniform Convergence of the Nadaraya-Watson Regression Function Estimate," Canadian Journal of Statistics, 6: 179-191.
250
References
Dickey,D.A. and W.A.Fuller (1979), "Distribution of the Estimators for Autoregressive Time Series with a Unit Root," Journal of the American Statistical Association, 74: 427-431. (1981), "Likelihood Ratio Statistics for Autoregressive Time Series with a Unit Root," Econometrica, 49: 1057-1072. Dickey,D.A., D.P.Hasza, and W.A.Fuller (1984) 'Testing for Unit Roots in Seasonal Time Series," Journal of the American Statistical Association, 79: 355-367. Diebold,F. and M.Nerlove (1990), "Unit Roots in Economic Time Series: A Selective Survey," in T.B.Fomby and G.F.Rodes (eds.), Co-Integration, Spurious Regressions, and Unit Roots, London: JAI Press. Doan,T., P.Litterman, and C.Sims (1984), "Forecasting and Conditional Projection Using Realistic Prior Distributions," Econometric Reviews, 3: 1100. Don,F.J.H. (1986), "Constrained Non-Linear Least Squares," Statistica Neerlandica,40: 109-115. Dunford,N. and J.T.Schwartz (1957), Linear Operators, Part I: General Theory, New York: Wiley Interscience. Engle,R.F. (1987), "On the Theory of Cointegrated Economic Time Series," Invited paper presented at the Econometric Society European Meeting, Copenhagen. Engle,R.F. and C.W.J.Granger (1987), "Cointegration and Error Correction: Representation, Estimation, and Testing," Econometrica, 55: 251-276. Engle,R.F. and S.B.Yoo (1987), "Forecasting and Testing in Cointegrated Systems," Journal of Econometrics, 35: 143-159 (1989), "Cointegrated Economic Time Series: A Survey with New Results," mimeo, Department of Economics, University of California, San Diego. Epanechnikov,V.A. (1969), "Non-parametric Estimation of a Multivariate Probability Density," Theory of Probability and its Applications, 14: 153-158. Eubank,R. (1988), Spline Smoothing and Nonparametric Regression, New York: Marcel Dekker. Evans,G.B.A. and N.E.Savin (1981), "Testing for Unit Roots: 1," Econometrica, 49: 753-779. (1984), "Testing for Unit Roots: 2," Econometrica, 52, 1241-1269. Feller,W. (1966), An Introduction to Probability Theory and Its Applications, vol. II, New York: John Wiley. Fryer,M.J. (1977), "A Review of Some Non-parametric Methods of Density Estimation," Journal of the Institute of Mathematics and its Applications, 20: 335-354. Fuller,W.A. (1976), Introduction to Statistical Time Series, New York: John Wiley. Gallant,A.R. (1987), Nonlinear Statistical Models, New York: John Wiley. Gallant,A.R. and H.White (1987), A Unified Theory of the Estimation and Inference for Nonlinear Dynamic Models, New York: Basil Blackwell. Geisser,S. (1975), "The Predictive Sample Reuse Method with Applications," Journal of the American Statistical Association, 70: 320-328. Georgiev,A.A. (1984), "Nonparametric System Identification by Kernel Methods," IEEE Transact. Aut.Cont., 29: 356-358.
References
251
GihmanJ.I. and A.V.Skorohod (1974), The Theory of Stochastic Processes, vol. I, Berlin: Springer Verlag. Granger,C.WJ. (1981), "Some Properties of Time Series and Their Use in Econometric Model Specification," Journal of Econometrics, 16: 121-130. Greblicki,W., and A.Krzyzak (1980), "Asymptotic Properties of Kernel Estimates of a Regression Function," Journal of Statistical Planning and Inference, 4: 81-90. Gregory,A.W. and M.R.Veall (1985), "Formulating Wald Tests of Non-linear Restrictions," Econometrica, 53: 1465-1468. Haldrup,N. and S.Hylleberg (1989), "Unit Roots and Deterministic Trends, with Yet Another Comment on the Existence and Interpretation of a Unit Root in U.S. GNP," mimeo, Institute of Economics, University of Aarhus. Hannan,E.J., W.T.M.Dunsmuir, and M.Deistler (1980), "Estimation of Vector ARMAX Models," Journal of Multivariate Analysis, 10: 275-295. Hansen,L.P. (1982), "Large Sample Properties of Generalized Method of Moment Estimators," Econometrica, 50: 1029-1054. Hardle,W. (1990), Applied Nonparametric Regression, Cambridge: Cambridge University Press. Harvey,A.C. (1981), Time Series Models, Oxford: Philip Allen. HausmanJ.A. (1978), "Specification Tests in Econometrics," Econometrica, 46: 1251-1271. Herrndorf,N. (1984), "A Functional Central Limit Theorem for Weakly Dependent Sequences of Random Variables," Annals of Probability, 12: 141-153. Hogg,R.V. and A.T.Craig (1978), Introduction to Mathematical Statistics, London: Macmillan. Holly,A. (1982), "A Remark on Hausman's Specification Test," Econometrica, 50: 749-760. Hylleberg,S. and G.E.Mizon (1989), "A Note on the Distribution of the Least Squares Estimator of a Random Walk with Drift," Economics Letters, 29: 225-230. IbragimovJ.A. and Y.V.Linnik (1971), Independent and Stationary Sequences of Random Variables, Groningen: Wolters-Noordhoff. Iosifescu,M. and R.Theodorescu (1969), Random Processes and Learning, New York: Springer Verlag. Jennrich,R.I. (1969), "Asymptotic Properties of Nonlinear Least Squares Estimators," Annals of Mathematical Statistics, 40: 633-643. Johansen,S. (1988), "Statistical Analysis of Cointegrated Vectors," Journal of Economic Dynamics and Control, 12: 231-254. (1991), "Estimation and Hypothesis Testing of Cointegrated Vectors in Gaussian Vector Autoregressive Models," Econometrica, 59: 1551-1580. Johansen,S. and K.Juselius (1990), "Maximum Likelihood Estimation and Inference on Cointegration: With Applications to the Demand for Money," Oxford Bulletin of Economics and Statistics, 52: 169-210 Kolmogorov,A.N. and S.V.Fomin (1961), Elements of the Theory of Functions and Functional Analysis, vol. II: Measure. The Lebesque Integral. Hilbert Space, Rochester N.Y.: Graylock Press.
252
References
Liero,H. (1982), "On the Maximal Deviation of the Kernel Regression Function Estimate," Mathematik, Operationsforschung und Statistik, Ser.Statist., 13: 171-182. McLeish,D.L. (1974), "Dependent Central Limit Theorems and Invariance Principles," Annals of Probability, 2: 620-628. (1975), "A Maximal Inequality and Dependent Strong Laws," Annals of Probability, 3: 829-839. Malinvaud,E. (1970a), Statistical Methods of Econometrics, Amsterdam: North Holland. (1970b), "The Consistency of Nonlinear Regression," Annals of Mathematical Statistics, 41: 956-969. Nadaraya,E.A. (1964), "On Estimating Regression," Theory of Probability and its Applications, 9: 141-142. (1965), "On Non-parametric Estimation of Density Functions and Regression Curves," Theory of Probability and its Applications, 10: 186-190. (1970), "Remarks on Non-parametric Estimates for Density Functions and Regression Curves," Theory of Probability and its Applications, 15: 134-137. Newey,W.K. (1985), "Maximum Likelihood Specification Testing and Conditional Moment Tests," Econometrica, 53: 1047-1070. Newey,W.K. and K.D.West (1987), "A Simple Positive Definite Heteroskedasticity and Autocorrelation Consistent Covariance Matrix," Econometrica, 55: 703-708. NeymanJ (1959), "Optimal Asymptotic Tests of Composite Statistical Hypotheses," in U.Grenander (ed.), Probability and Statistics, the Harold Cramer Volume, Uppsala: Almqvist and Wiksell, 213-234. Noda,K. (1976), "Estimation of a Regression Function by the Parzen Kerneltype Density Estimators," Annals of the Institute of Statistics and Mathematics, 28: 221-234. Pagan,A. and F.Vella (1989), "Diagnostic Tests for Models Based on Individual Data: A Survey," Journal of Applied Econometrics, 4: S29-S59. Parthasarathy,K.R. (1977), Introduction to Probability and Measure, New York: Springer Verlag. Parzen,E. (1962), "On Estimation of a Probability Density Function and Mode," Annals of Mathematical Statistics, 33: 1065-1076. Pereira,B.de B. (1977), "A Note on the Consistency and on the Finite Sample Comparisons of Some Tests of Separate Families of Hypotheses," Biometrika, 64: 109-113. (1978), "Tests and Efficiencies of Separate Regression Models," Biometrika, 65:319-327. Pesaran,M.H. and A.S.Deaton (1978), "Testing Non-nested Non-linear Regression Models," Econometrica, 46: 677-694. Phillips,P.C.B. (1987), "Time Series Regression with a Unit Root," Econometrica, 55: 211-W\. Phillips,P.C.B. and J.Y.Park (1988), "On the Formulation of Wald Tests of Nonlinear Restrictions," Econometrica, 56 1065-1083. Phillips,P.C.B. and P.Perron (1988), "Testing for a Unit Root in Time Series Regression," Biometrika, 75: 335-346.
References
253
Potscher,B.M. and I.R.Prucha (1986), "A Class of Partially Adaptive One-Step M-Estimators for the Nonlinear Regression Model with Dependent Observations," Journal of Econometrics, 32: 219-251. (1991a), "Basic Structure of the Asymptotic Theory in Dynamic Nonlinear Econometric Models, Part I: Consistency and Approximation Concepts," Econometric Reviews, 10: 125-216. (1991b), "Basic Structure of the Asymptotic Theory in Dynamic Nonlinear Econometric Models, Part II: Asymptotic Normality," Econometric Reviews, 10: 253-325. Quandt,R.E. (1974), "A Comparison of Methods for Testing Non-nested Hypotheses," Review of Economics and Statistics, 56: 92-99. Ramsey,J.B. (1969), "Tests for Specification Errors in Classical Linear Leastsquares Regression Analysis," Journal of the Royal Statistical Society, Series 5,31:350-371. (1970), "Models, Specification Error, and Inference: A Discussion of Some Problems in Econometric Methodology," Bulletin of the Oxford Institute of Economics and Statistics, 32: 301-318. Revesz,P. (1968), The Laws of Large Numbers, New York: Academic Press. (1979), "On the Nonparametric Estimation of the Regression Function," Probl. Contr. Inf. Theory, 8: 297-302. Robinson,P.M. (1983), "Nonparametric Estimators for Time Series," Journal of Time Series Analysis, 4: 185-207. Rosenblatt,M. (1956a), "A Central Limit Theorem and a Strong Mixing Condition," Proceedings of the National Academy of Science [USA], 42:43^7. (1956b), "Remarks on Some Non-parametric Estimates of a Density Function," Annals of Mathematical Statistics, 27: 832-837. (1991), Stochastic Curve Estimation, Hayward, Calif.: Institute of Mathematical Statistics. Royden,H.L. (1968), Real Analysis, London: Macmillan. Rudin,W. (1976), Principles of Mathematical Analysis, Tokyo: McGraw-Hill Kogakusha. Said,S.E. and D.A.Dickey (1984), "Testing for Unit Roots in AutoregressiveMoving Average of Unknown Order," Biometrika, 71: 599-607. Sargent,T.J. and C.A.Sims (1977), "Business Cycle Modeling Without Pretending to Have Too Much A Priori Economic Theory," in C.A.Sims (ed.), New Methods in Business Cycle Research: Proceedings from a Conference, Minneapolis: Federal Reserve Bank of Minneapolis. Schucany,W.R. and J.P.Sommers (1977), "Improvement of Kernel Type Density Estimators," Journal of the American Statistical Association 72: 420-423. Schuster,E.F. (1972), "Joint Asymptotic Distribution of the Estimated Regression Function at a Finite Number of Distinct Points," Annals of Mathematical Statistics, 43: 84-88. Schuster,E.F. and S.Yakowitz (1979), "Contributions to the Theory of Nonparametric Regression, with Application to System Identification," Annals of Statistics, 7: 139-149. Schwert,G.W. (1989), "Tests for Unit Roots: A Monte Carlo Investigation," Journal of Business and Economic Statistics, 7: 147-159.
254
References
Serfling,R.J. (1968), "Contributions to Central Limit Theory for Dependent Variables," Annals of Mathematical Statistics, 39: 1158-1175. Silverman,B.W. (1978), "Weak and Strong Uniform Consistency of the Kernel Estimate of a Density and its Derivates," Annals of Statistics, 6: 177-184. Sims,C.A. (1980), "Macroeconomics and Reality," Econometrica, 48: 1-48. (1981), "An Autogressive Index Model for the U.S., 1948-1975," in J.Kmenta and J.B.Ramsey (eds.), Large-Scale Macro-Econometric Models, Amsterdam: North-Holland. (1988), "Comment on 'ARMA Memory Index Modeling of Economic Time Series'," Econometric Theory, 4: 64-69. Singh,R.S. (1981), "Speed of Convergence in Non-parametric Estimation of a Multivariate M-Density and its Mixed Partial Derivatives," Journal of Statistical Planning and Inference, 5: 287-298. Stock,J.H. and M.W.Watson (1988), "Testing for Common Trends," Journal of the American Statistical Association, 83: 1097-1107. Stone,C. (1977), "Consistent Nonparametric Regression [with discussion]," Annals of Statistics, 5: 595-645. Stone,M. (1974), "Cross-validatory Choice and Assessment of Statistical Predictions [with discussion]," Journal of the Royal Statistical Society, Series £,36: 111-147. Stout,W.F. (1974), Almost Sure Convergence, New York: Academic Press. Tapia,R.A. and J.R.Thompson (1978), Nonparametric Probability Density Estimation, Baltimore: Johns Hopkins University Press. Tauchen,G. (1985), "Diagnostic Testing and Evaluation of Maximum Likelihood Models," Journal of Econometrics, 30: 415—443. Wahba,G. and S.Wold (1975), "A Completely Automatic French Curve: Fitting Spline Functions by Cross-validation," Communications in Statistics, 4: 1-17. Watson,G.S. (1964), "Smooth Regression Analysis," Sankhya, Series A, 26: 359-372. White,H. (1980a), "Using Least Squares to Approximate Unknown Regression Functions," International Economic Review, 21: 149-170. (1980b), "Nonlinear Regression on Cross-Section Data," Econometrica, 48: 721-746. (1981), "Consequences and Detection of Misspecified Nonlinear Regression Models," Journal of the American Statistical Association, 76: 419-433. (1982), "Maximum Likelihood Estimation of Misspecified Models," Econometrica, 50, :l-25. (1984), Asymptotic Theory for Econometricians. New York: Academic Press. White,H and I.Domowitz (1984), "Nonlinear Regression with Dependent Observations," Econometrica, 52: 143-151. Whittle,P. (1960), "Bounds on the Moments of Linear and Quadratic Forms in Independent Random Variables," Theory of Probability and Its Applications, 5: 302. Wilks,S.S. (1963), Mathematical Statistics, Princeton: Princeton University Press.
References
255
Wold,H. (1954), A Study in the Analysis of Stationary Time Series, Uppsala: Almqvist and Wiksell. Wong,W.H. (1983), "On the Consistency of Cross-validation in Kernel Nonparametric Regression," Annals of Statistics, 11: 1136-1141.
Index
a-mixing, see mixing conditions AR model, 137 ARM A memory index, 139-141 ARM A model, 138 ARMAX model linear, 139, 154-155 nonlinear, 152-153, 165-168 ARX model, 139, 165 asymptotic normality of estimators, 65-67 ARMAX models, 160-164, 167-168 kernel regression, 217-225, 227-230 M-estimators, 77-78 maximum likelihood, 77 nonlinear regression, 69-71, 74-75 autocorrelation test, 176-178 base, 122 bias reduction of kernel estimators, 221, 223 Borel field, 1 Euclidean, 4 minimal, 3 Borel measurability, 7-10 Borel set, 4, 181 Borel-Cantelli lemma, see convergence of random variables boundary of a set, 180-181 Box-Cox transformation, 62 Brownian bridge, 185 Brownian motion, see Wiener process central limit theorems functional central limit theorem, 185, 187,193 i.i.d. random variables, 33 Liapounov, 33-34 martingale differences, 116-117 characteristic functions, 14-15 convergence, 31 inversion formula, 15 Chebishev's inequality, see inequalities closure of a set, 181 cointegrated system, 205 cointegrated vectors, 205-211 cointegration, 179, 203-205 256
conditional expectation identification, 50, 55-59 properties, 52-53, 112 relative to a Borel field, 110-113 relative to a random vector 48-49 conditional M-test, see M-test consistency of estimators, 63-64 ARMAX models, 157-159, 167 asymptotic variance, 71, 164, 168 kernel regression, 215-217, 226-229, 244 M-estimators, 78 maximum likelihood, 77 Newey-West estimator, 198 non-linear regression, 69, 73-74 consistent test, see M-test convergence of distributions asymptotic normality, 31 Borel measurable transformations, 37-39 continuous transformations, 29-30, 34-36 convergence in probability, 28-29 convergence of characteristic functions, 31 convergence of expectations, 27 mean process, 131 proper heterogeneity, 131 proper pointwise, 27, 73 proper setwise, 37-38, 74 weak convergence, 182-183, 187 convergence of mathematical expectations, 25-26 characteristic functions, 31 convergence of distributions, 27, 34 dominated convergence, 25-26, 54 Fatou's lemma, 25, 54 monotone convergence, 25, 54-55 convergence of random functions pointwise, 41 pseudo-uniform, 41-42 uniform, 41-42 weak convergence, see convergence of distributions convergence of random variables almost surely, 19, 24 Borel-Cantelli lemma, 21
257
Index continuous transformations, 23 in probability, 19,22-23 cross-validation, 235-236
likelihood ratio test, 84-87 linear separator, 148 linear translation invariance principle, 231
dense set, 59, 181 Dickey-Fuller unit root tests, 187-192 distribution function, 2, 5 dominated convergence, see convergence of mathematical expectations
M-estimators, 77-78 adaptive, 80-82 asymptotic normality, 78 consistency, 78 M-test, 95-96 conditional M-test, 96-102 consistent integrated conditional M-test, 106-109 consistent randomized conditional Mtest, 102-106, 168-175 martingale, 113 convergence theorems, 114-116 difference, 114 mathematical expectations, 11-14 Borel measurable transformations, 11 integral representation, 11-12 properties, 12 simple function, 11 maximum likelihood, 75-77 asymptotic normality, 77 consistency, 77 efficiency, 78-79 regularity conditions, 75-76 mean process, 131 metric, 180-182, 186-187 Minkowski's inequality, see inequalities mixing conditions a-mixing, 121 (^-mixing, 121 strong mixing, 121 uniform mixing, 121 mixingale, 118-120 monotone convergence, see convergence of mathematical expectations
efficiency, see maximum likelihood error correction models, 205-207 Fatou's lemma, see convergence of mathematical expectations Hausman's test, 90 Hausman-White test, 90-94 inconsistency, 93 Holder's inequality, see inequalities independence mutual, 7 pairwise, 7 total, 6 induced probability measure, see probability measure inequalities Chebishev, 13,53 Holder, 13,53 Jensen, 13, 53 Liapounov, 13, 53 Minkowski, 13, 53 interior of a set, 181 Jensen's inequality, see inequalities kernel, 214 higher-order kernel, 221-223, 229, 231-232 kernel density estimator, 213-214 pointwise consistency, 214 kernel regression estimator, 214-215 asymptotic normality, 217-225, 228-230, 244 asymptotic variance, 224 consistency, 215-217, 227-230, 244 time series, 237-247 uniform consistency, 225-227, 245-246 Language multiplier test, 87-88 laws of large numbers dependent random variables, 125-130 i.i.d. random variables, 22 Kolmogorov's strong law, 22 uncorrelated random variables, 21-22 Liapounov's inequality, see inequalities
Newton-step, 81 Neyman's C(a) test, 88 non-nested hypotheses, 89 non-nested tests, 89 v-stability, 122-125 (^-mixing, see mixing conditions probability measure, 1 induced, 4 probability space, 1 random function, 16-17 random variable, 2 random vector, 1 regression model, 60-63 sample space, 1 ^-algebra, 1
258
Index
simple function, 8 Skorohod metric, 186-187 stochastic boundedness, 39 strong mixing, see mixing conditions
independent random variables, 4 3 ^ 7 uniform mixing, see mixing conditions unit root, 179 VARMA model, 138
tightness, 183 uniform consistency, see kernel regression estimator uniform laws of large numbers dependent random variables, 129-132, 134
Wald test, 82-84, 88 weighted least squares, 90 Wiener process, 184-185 window width, 214 Wold decomposition, 136