Chapter 36
LARGE SAMPLE ESTIMATION TESTING* WHITNEY
AND HYPOTHESIS
K. NEWEY
Massachusetts Institute of Technology DANIEL
MCFADDEN
University of California, Berkeley
Contents
2113 2113 2120
Abstract 1. Introduction 2. Consistency
3.
2.1.
The basic consistency
2.2.
Identification
2121
theorem
2124
2.2.1.
The maximum
2.2.2.
Nonlinear
likelihood
2.2.3.
Generalized Classical
method
minimum
2.3.
Uniform
convergence
2.4.
Consistency
of maximum
2.5.
Consistency
of GMM
2.6.
Consistency
without
2.1.
Stochastic
2.8.
Least absolute
Maximum
2128 2129 2131
likelihood
2132 2133
compactness and uniform
deviations
Censored
2126
of moments distance
and continuity
equicontinuity
2.8.2.
2124 2125
least squares
2.2.4.
2.8.1.
estimator
2136
convergence
2138
examples
2138
score least absolute
2140
deviations
2141
Asymptotic normality
2143
3.1.
The basic results
3.2.
Asymptotic
normality
for MLE
2146
3.3.
Asymptotic
normality
for GMM
2148
*We are grateful to the NSF for financial support P. Ruud, and T. Stoker for helpful comments.
and to Y. Ait-Sahalia,
J. Porter, J. Powell, J. Robins,
Handbook of Econometrics, Volume IV, Edited by R.F. Engle and D.L. McFadden 0 1994 Elsevier Science B.V. All rights reserved
Ch. 36: Large Sample Estimation and Hypothesis Testing
2113
Abstract Asymptotic distribution theory is the primary method used to examine the properties of econometric estimators and tests. We present conditions for obtaining consistency and asymptotic normality of a very general class of estimators (extremum estimators). Consistent asymptotic variance estimators are given to enable approximation of the asymptotic distribution. Asymptotic efficiency is another desirable property then considered. Throughout the chapter, the general results are also specialized to common econometric estimators (e.g. MLE and GMM), and in specific examples we work through the conditions for the various results in detail. The results are also extended to two-step estimators (with finite-dimensional parameter estimation in the first step), estimators derived from nonsmooth objective functions, and semiparametric two-step estimators (with nonparametric estimation of an infinite-dimensional parameter in the first step). Finally, the trinity of test statistics is considered within the quite general setting of GMM estimation, and numerous examples are given. 1.
Introduction
Large sample distribution theory is the cornerstone of statistical inference for econometric models. The limiting distribution of a statistic gives approximate distributional results that are often straightforward to derive, even in complicated econometric models. These distributions are useful for approximate inference, including constructing approximate confidence intervals and test statistics. Also, the location and dispersion of the limiting distribution provides criteria for choosing between different estimators. Of course, asymptotic results are sensitive to the accuracy of the large sample approximation, but the approximation has been found to be quite good in many cases and asymptotic distribution results are an important starting point for further improvements, such as the bootstrap. Also, exact distribution theory is often difficult to derive in econometric models, and may not apply to models with unspecified distributions, which are important in econometrics. Because asymptotic theory is so useful for econometric models, it is important to have general results with conditions that can be interpreted and applied to particular estimators as easily as possible. The purpose of this chapter is the presentation of such results. Consistency and asymptotic normality are the two fundamental large sample properties of estimators considered in this chapter. A consistent estimator 6 is one that converges in probability to the true value Q,,, i.e. 6% 8,, as the sample size n goes to infinity, for all possible true values.’ This is a mild property, only requiring ‘This property is sometimes referred to as weak consistency, with strong consistency holding when(j converges almost surely to the true value. Throughout the chapter we focus on weak consistency, although we also show how strong consistency can be proven.
W.K. Newey and D. McFadden
2114
that the estimator is close to the truth when the number of observations is nearly infinite. Thus, an estimator that is not even consistent is usually considered inadequate. Also, consistency is useful because it means that the asymptotic distribution of an estimator is determined by its limiting behavior near the true parameter. An asymptotically normal estimator 6is one where there is an increasing function v(n) such that the distribution function of v(n)(8- 0,) converges to the Gaussian distribution function with mean zero and variance V, i.e. v(n)(8 - 6,) A N(0, V). The variance I/ of the limiting distribution is referred to as the asymptotic variance of @. The estimator &-consistent
is ,,/&-consistent
if v(n) = 6.
case, so that unless otherwise
noted,
This chapter asymptotic
focuses normality
on the will be
taken to include ,,&-consistency. Asymptotic normality and a consistent estimator of the asymptotic variance can be used to construct approximate confidence intervals. In particular, for an esti1 - CY mator c of V and for pori2satisfying Prob[N(O, 1) > gn,J = 42, an asymptotic confidence interval is
Cal-@=
ce-g,,2(m”2,e+f,,2(3/n)“2].
If P is a consistent estimator of I/ and I/ > 0, then asymptotic normality of 6 will imply that Prob(B,EY1 -,)1 - a as n+ co. 2 Here asymptotic theory is important for econometric practice, where consistent standard errors can be used for approximate confidence interval construction. Thus, it is useful to know that estimators are asymptotically normal and to know how to form consistent standard errors in applications. In addition, the magnitude of asymptotic variances for different estimators helps choose between estimators in practice. If one estimator has a smaller asymptotic variance, then an asymptotic confidence interval, as above, will be shorter for that estimator in large samples, suggesting preference for its use in applications. A prime example is generalized least squares with estimated disturbance variance matrix, which has smaller asymptotic variance than ordinary least squares, and is often used in practice. Many estimators share a common structure that is useful in showing consistency and asymptotic normality, and in deriving the asymptotic variance. The benefit of using this structure is that it distills the asymptotic theory to a few essential ingredients. The cost is that applying general results to particular estimators often requires thought and calculation. In our opinion, the benefits outweigh the costs, and so in these notes we focus on general structures, illustrating their application with examples. One general structure, or framework, is the class of estimators that maximize some objective function that depends on data and sample size, referred to as extremum estimators. An estimator 8 is an extremum estimator if there is an ‘The proof of this result is an exercise in convergence states that Y. 5 Y, and Z, %C implies Z, Y, &Y,.
in distribution
and the Slutzky theorem,
which
Ch. 36: Large Sample Estimation and Hypothesis
objective
function
o^maximizes
Testing
2115
o,(0) such that o,(Q) subject to HE 0,
(1.1)’
where 0 is the set of possible parameter values. In the notation, dependence of H^ on n and of i? and o,,(G) on the data is suppressed for convenience. This estimator is the maximizer of some objective function that depends on the data, hence the term “extremum estimator”.3 R.A. Fisher (1921, 1925), Wald (1949) Huber (1967) Jennrich (1969), and Malinvaud (1970) developed consistency and asymptotic normality results for various special cases of extremum estimators, and Amemiya (1973, 1985) formulated the general class of estimators and gave some useful results. A prime example of an extremum estimator is the maximum likelihood (MLE). Let the data (z,, , z,) be i.i.d. with p.d.f. f(zl0,) equal to some member of a family of p.d.f.‘s f(zI0). Throughout, we will take the p.d.f. f(zl0) to mean a probability function where z is discrete, and to possibly be conditioned on part of the observation z.~ The MLE satisfies eq. (1.1) with Q,(0) = nP ’ i
(1.2)
lnf(ziI 0).
i=l
Here o,(0) is the normalized log-likelihood. Of course, the monotonic transformation of taking the log of the likelihood and normalizing by n will not typically affect the estimator, but it is a convenient normalization in the theory. Asymptotic theory for the MLE was outlined by R.A. Fisher (192 1, 1925), and Wald’s (1949) consistency theorem is the prototype result for extremum estimators. Also, Huber (1967) gave weak conditions for consistency and asymptotic normality of the MLE and other extremum estimators that maximize a sample average.5 A second example is the nonlinear least squares (NLS), where for data zi = (yi, xi) with E[Y Ix] = h(x, d,), the estimator solves eq. (1.1) with
k(Q)= - n- l i
[yi- h(Xi,
!!I)]*.
(1.3)
i=l
Here maximizing o,(H) is the same as minimizing the sum of squared residuals. The asymptotic normality theorem of Jennrich (1969) is the prototype for many modern results on asymptotic normality of extremum estimators. 3“Extremum” rather than “maximum” appears here because minimizers are also special cases, with objective function equal to the negative of the minimand. 4More precisely, flzIH) is the density (Radon-Nikodym derivative) of the probability measure for z with respect to some measure that may assign measure 1 to some singleton’s, allowing for discrete variables, and for z = (y, x) may be the product of some measure for ~1with the marginal distribution of X, allowing f(z)O) to be a conditional density given X. 5Estimators that maximize a sample average, i.e. where o,(H) = n- ‘I:= 1q(z,,O),are often referred to as m-estimators, where the “m” means “maximum-likelihood-like”.
W.K. Nrwuy
2116
and D. McFuddrn
A third example is the generalized method of moments (GMM). Suppose that there is a “moment function” vector g(z, H) such that the population moments satisfy E[g(z, 0,)] = 0. A GMM estimator is one that minimizes a squared Euclidean distance of sample moments from their population counterpart of zero. Let ii/ be a positive semi-definite matrix, so that (m’@m) ‘P is a measure of the distance of m from zero. A GMM estimator is one that solves eq. (1.1) with
&I) = -
[n-l izln
Ytzi,
O)
1
‘*[ n-l it1 e)]. Ytzi3
(1.4)
This class includes linear instrumental variables estimators, where g(z, 0) =x’ ( y - Y’O),x is a vector of instrumental variables, y is a left-hand-side dependent variable, and Y are right-hand-side variables. In this case the population moment condition E[g(z, (!I,)] = 0 is the same as the product of instrumental variables x and the disturbance y - Y’8, having mean zero. By varying I% one can construct a variety of instrumental variables estimators, including two-stage least squares for k%= (n-‘~;=Ixix;)-‘.” The GMM class also includes nonlinear instrumental variables estimators, where g(z, 0) = x.p(z, Q)for a residual p(z, Q),satisfying E[x*p(z, (!I,)] = 0. Nonlinear instrumental variable estimators were developed and analyzed by Sargan (1959) and Amemiya (1974). Also, the GMM class was formulated and general results on asymptotic properties given in Burguete et al. (1982) and Hansen (1982). The GMM class is general enough to also include MLE and NLS when those estimators are viewed as solutions to their first-order conditions. In this case the derivatives of Inf(zI 0) or - [y - h(x, H)12 become the moment functions, and there are exactly as many moment functions as parameters. Thinking of GMM as including MLE, NLS, and many other estimators is quite useful for analyzing their asymptotic distribution, but not for showing consistency, as further discussed below. A fourth example is classical minimum distance estimation (CMD). Suppose that there is a vector of estimators fi A x0 and a vector of functions h(8) with 7c,,= II( The idea is that 71consists of “reduced form” parameters, 0 consists of “structural” parameters, and h(0) gives the mapping from structure to reduced form. An estimator of 0 can be constructed by solving eq. (1.1) with
&@I)= -
[72-
h(U)]‘ci+t-
h(U)],
(1.5)
where k? is a positive semi-definite matrix. This class of estimators includes classical minimum chi-square methods for discrete data, as well as estimators for simultaneous equations models in Rothenberg (1973) and panel data in Chamberlain (1982). Its asymptotic properties were developed by Chiang (1956) and Ferguson (1958). A different framework that is sometimes useful is minimum distance estimation. “The l/n normalization in @does not affect the estimator, but, by the law oflarge numbers, that W converges in probability to a constant matrix, a condition imposed below.
will imply
Ch. 36: Large Sample Estimation and Hypothesis
Testing
2117
a class of estimators that solve eq. (1.1) for Q,,(d) = - &,(@‘@/g,(@, where d,(d) is a vector of the data and parameters such that 9,(8,) LO and I@ is positive semidefinite. Both GMM and CMD are special cases of minimum distance, with g,,(H) = n- l XI= 1 g(zi, 0) for GMM and g,(0) = 72- h(0) for CMD.’ This framework is useful for analyzing asymptotic normality of GMM and CMD, because (once) differentiability of J,(0) is a sufficient smoothness condition, while twice differentiability is often assumed for the objective function of an extremum estimator [see, e.g. Amemiya (1985)]. Indeed, as discussed in Section 3, asymptotic normality of an extremum estimator with a twice differentiable objective function Q,(e) is actually a special case 0, asymptotic normality of a minimum distance estimator, with d,(0) = V,&(0) and W equal to an identity matrix, where V, denotes the partial derivative. The idea here is that when analyzing asymptotic normality, an extremum estimator can be viewed as a solution to the first-order conditions V,&(Q) = 0, and in this form is a minimum distance estimator. For consistency, it can be a bad idea to treat an extremum estimator as a solution to first-order conditions rather than a global maximum of an objective function, because the first-order condition can have multiple roots even when the objective function has a unique maximum. Thus, the first-order conditions may not identify the parameters, even when there is a unique maximum to the objective function. Also, it is often easier to specify primitive conditions for a unique maximum than for a unique root of the first-order conditions. A classic example is the MLE for the Cauchy location-scale model, where z is a scalar, p is a location parameter, 0 a scale parameter, and f(z 10) = Ca- ‘( 1 + [(z - ~)/cJ]*)- 1 for a constant C. It is well known that, even in large samples, there are many roots to the first-order conditions for the location parameter ~1,although there is a global maximum to the likelihood function; see Example 1 below. Econometric examples tend to be somewhat less extreme, but can still have multiple roots. An example is the censored least absolute deviations estimator of Powell (1984). This estimator solves eq. (1.1) for Q,,(O) = -n-‘~;=,Jyimax (0, xi0) 1,where yi = max (0, ~18, + si}, and si has conditional median zero. A global maximum of this function over any compact set containing the true parameter will be consistent, under certain conditions, but the gradient has extraneous roots at any point where xi0 < 0 for all i (e.g. which can occur if xi is bounded). The importance for consistency of an extremum estimator being a global maximum has practical implications. Many iterative maximization procedures (e.g. Newton Raphson) may converge only to a local maximum, but consistency results only apply to the global maximum. Thus, it is often important to search for a global maximum. One approach to this problem is to try different starting values for iterative procedures, and pick the estimator that maximizes the objective from among the converged values. AS long as the extremum estimator is consistent and the true parameter is an element of the interior of the parameter set 0, an extremum estimator will be ‘For
GMM.
the law of large numbers
implies cj.(fI,) 50.
W.K. Newey und D. McFadden
2118
a root of the first-order conditions asymptotically, and hence will be included among the local maxima. Also, this procedure can avoid extraneous boundary maxima, e.g. those that can occur in maximum likelihood estimation of mixture models. Figure 1 shows a schematic, illustrating the relationships between the various types of estimators introduced so far: The name or mnemonic for each type of estimator (e.g. MLE for maximum likelihood) is given, along with objective function being maximized, except for GMM and CMD where the form of d,(0) is given. The solid arrows indicate inclusion in a class of estimators. For example, MLE is included in the class of extremum estimators and GMM is a minimum distance estimator. The broken arrows indicate inclusion in the class when the estimator is viewed as a solution to first-order conditions. In particular, the first-order conditions for an extremum estimator are V,&(Q) = 0, making it a minimum distance estimator with g,,(0) = V,&(e) and I%‘= I. Similarly, the first-order conditions for MLE make it a GMM estimator with y(z, 0) = VBIn f(zl0) and those for NLS a GMM estimator with g(z, 0) = - 2[y - h(x, B)]V,h(x, 0). As discussed above, these broken arrows are useful for analyzing the asymptotic distribution, but not for consistency. Also, as further discussed in Section 7, the broken arrows are not very useful when the objective function o,(0) is not smooth. The broad outline of the chapter is to treat consistency, asymptotic normality, consistent asymptotic variance estimation, and asymptotic efficiency in that order. The general results will be organized hierarchically across sections, with the asymptotic normality results assuming consistency and the asymptotic efficiency results assuming asymptotic normality. In each section, some illustrative, self-contained examples will be given. Two-step estimators will be discussed in a separate section, partly as an illustration of how the general frameworks discussed here can be applied and partly because of their intrinsic importance in econometric applications. Two later sections deal with more advanced topics. Section 7 considers asymptotic normality when the objective function o,(0) is not smooth. Section 8 develops some asymptotic theory when @ depends on a nonparametric estimator (e.g. a kernel regression, see Chapter 39). This chapter is designed to provide an introduction to asymptotic theory for nonlinear models, as well as a guide to recent developments. For this purpose,
Extremum
O.@) /
i$,{yi - 4~
/ MLE
@l’/n
Distance
-AW~cm
\
NLS
-
Minimum
------_---__*
\ CMD
GMM
iglsh
i In f(dWn ,=I
L-_________l___________T Figure
1
Q/n
{A(@))
3 - WI)
Ch. 36: Lurge Sample Estimation und Hypothesis
Testing
2119
Sections 226 have been organized in such a way that the more basic material is collected in the first part of each section. In particular, Sections 2.1-2.5, 3.1-3.4, 4.1-4.3, 5.1, and 5.2, might be used as text for part of a second-year graduate econometrics course, possibly also including some examples from the other parts of this chapter. The results for extremum and minimum distance estimators are general enough to cover data that is a stationary stochastic process, but the regularity conditions for GMM, MLE, and the more specific examples are restricted to i.i.d. data. Modeling data as i.i.d. is satisfactory in many cross-section and panel data applications. Chapter 37 gives results for dependent observations. This chapter assumes some familiarity with elementary concepts from analysis (e.g. compact sets, continuous functions, etc.) and with probability theory. More detailed familiarity with convergence concepts, laws of large numbers, and central limit theorems is assumed, e.g. as in Chapter 3 of Amemiya (1985), although some particularly important or potentially unfamiliar results will be cited in footnotes. The most technical explanations, including measurability concerns, will be reserved to footnotes. Three basic examples will be used to illustrate the general results of this chapter. Example 1.I (Cauchy location-scale) In this example z is a scalar random variable, 0 = (11,c)’ is a two-dimensional vector, and z is continuously distributed with p.d.f. f(zId,), where f(zl@ = C-a- ’ { 1 + [(z - ~)/a]~} -i and C is a constant. In this example p is a location parameter and 0 a scale parameter. This example is interesting because the MLE will be consistent, in spite of the first-order conditions having many roots and the nonexistence of moments of z (e.g. so the sample mean is not a consistent estimator of 0,). Example 1.2 (Probit) Probit is an MLE example where z = (y, x’) for a binary variable y, y~(0, l}, and a q x 1 vector of regressors x, and the conditional probability of y given x is f(zl0,) for f(zl0) = @(x’@~[ 1 - @(x’Q)]’ -y. Here f(z ItI,) is a p.d.f. with respect to integration that sums over the two different values of y and integrates over the distribution of x, i.e. where the integral of any function a(y, x) is !a(~, x) dz = E[a( 1, x)] + Epu(O,x)]. This example illustrates how regressors can be allowed for, and is a model that is often applied. Example 1.3 (Hansen-Singleton) This is a GMM (nonlinear instrumental variables) example, where g(z, 0) = x*p(z, 0) for p(z, 0) = p*w*yy - 1. The functional form here is from Hansen and Singleton (1982), where p is a rate of time preference, y a risk aversion parameter, w an asset return, y a consumption ratio for adjacent time periods, and x consists of variables
Ch. 36: Large Sample Estimation and Hypothesis
2121
Testing
lead to the estimator being close to one of the maxima, which does not give consistency (because one of the maxima will not be the true value of the parameter). The condition that QO(0) have a unique maximum at the true parameter is related to identification. The discussion so far only allows for a compact parameter set. In theory compactness requires that one know bounds on the true parameter value, although this constraint is often ignored in practice. It is possible to drop this assumption if the function Q,(0) cannot rise “too much” as 8 becomes unbounded, as further discussed below. Uniform convergence and continuity of the limiting function are also important. Uniform convergence corresponds to the feature of the graph that Q,(e) was in the “sleeve” for all values of 0E 0. Conditions for uniform convergence are given below. The rest of this section develops this descriptive discussion into precise results on consistency of extremum estimators. Section 2.1 presents the basic consistency theorem. Sections 2.222.5 give simple but general sufficient conditions for consistency, including results for MLE and GMM. More advanced and/or technical material is contained in Sections 2.662.8.
2.1.
The basic consistency
theorem
To state a theorem it is necessary probability, as follows:
to define
Uniform convergence_in
o,(d) converges
probability:
precisely
uniform
uniformly
convergence
in
in probability
to
Qd@ meanssu~~~~l Q,(e) - Qd@ 30. The following is the fundamental consistency is similar to Lemma 3 of Amemiya (1973).
result for extremum
estimators,
and
Theorem 2.1 If there is a function QO(0) such that (i)&(8) IS uniquely maximized at 8,; (ii) 0 is compact; (iii) QO(0) is continuous; (iv) Q,,(e) converges uniformly in probability to Q,(0), then i?p.
19,.
Proof For any E > 0 we have wit_h propability 43 by eq. (1.1); (b)
approaching
one (w.p.a.1) (a) Q,(g) > Q,(O,) -
Qd@ > Q.(o) - e/3 by (iv); (4 Q,&J > Qd&J - 43 by W9
‘The probability statements in this proof are only well defined if each of k&(8),, and &8,) are measurable. The measurability issue can be bypassed by defining consistency and uniform convergence in terms of outer measure. The outer measure of a (possibly nonmeasurable) event E is the infimum of E[ Y] over all random variables Y with Y 2 l(8), where l(d) is the indicator function for the event 6.
W.K. Newey and D. McFadden
2122
Therefore,
w.p.a. 1,
(b) Q,(e, > Q,(o^, - J3?
Q&J
- 2E,3(? Qo(&J - E.
Thus, for any a > 0, Q,(Q) > Qe(0,) - E w.p.a.1. Let .,Ir be any open subset of 0 containing fI,. By 0 n.4”’ compact, (i), and (iii), SU~~~~~,-~Q~(~) = Qo(8*) < Qo(0,) for some 0*~ 0 n Jt”. Thus, choosing E = Qo_(fIo)- supBE .,,flCQ0(8), it follows that Q.E.D. w.p.a.1 Q,(6) > SU~~~~~,~~Q,,(H), and hence (3~~4”. The conditions of this theorem are slightly stronger than necessary. It is not necessary to assume that 8 actually maximi_zes_the objectiv_e function. This assumption can be replaced by the hypothesis that Q,(e) 3 supBE @Q,,(d)+ o,(l). This replacement has no effect on the proof, in particular on part (a), so that the conclusion remains true. These modifications are useful for analyzing some estimators in econometrics, such as the maximum score estimator of Manski (1975) and the simulated moment estimators of Pakes (1986) and McFadden (1989). These modifications are not given in the statement of the consistency result in order to keep that result simple, but will be used later. Some of the other conditions can also be weakened. Assumption (iii) can be changed to upper semi-continuity of Q,,(e) and (iv) to Q,,(e,) A Q,(fI,) and for all E > 0, Q,(0) < Q,(e) + E for all 19~0 with probability approaching one.” Under these weaker conditions the conclusion still is satisfied, with exactly the same proof. Theorem 2.1 is a weak consistency result, i.e. it shows I!?3 8,. A corresponding strong consistency result, i.e. H^Z Ho, can be obtained by assuming that supBE eJ Q,(0) - Qo(0) 1% 0 holds in place of uniform convergence in probability. The proof is exactly the same as that above, except that “as. for large enough n” replaces “with probability approaching one”. This and other results are stated here for convergence in probability because it suffices for the asymptotic distribution theory. This result is quite general, applying to any topological space. Hence, it allows for 0 to be infinite-dimensional, i.e. for 19to be a function, as would be of interest for nonparametric estimation of (say) a density or regression function. However, the compactness of the parameter space is difficult to check or implausible in many cases where B is infinite-dimensional. To use this result to show consistency of a particular estimator it must be possible to check the conditions. For this purpose it is important to have primitive conditions, where the word “primitive” here is used synonymously with the phrase “easy to interpret”. The compactness condition is primitive but the others are not, so that it is important to discuss more primitive conditions, as will be done in the following subsections. I0 Uppersemi-continuity means that for any OE 0 and t: > 0 there is an open subset. 0 such that Q”(P) < Q,(0) + E for all U’EA’.
V of 0 containing
Ch. 36: Large Sample Estimation and Hypothesis
Testing
2123
Condition (i) is the identification condition discussed above, (ii) the boundedness condition on the parameter set, and (iii) and (iv) the continuity and uniform convergence conditions. These can be loosely grouped into “substantive” and “regularity” conditions. The identification condition (i) is substantive. There are well known examples where this condition fails, e.g. linear instrumental variables estimation with fewer instruments than parameters. Thus, it is particularly important to be able to specify primitive hypotheses for QO(@ to have a unique maximum. The compactness condition (ii) is also substantive, with eOe 0 requiring that bounds on the parameters be known. However, in applications the compactness restriction is often ignored. This practice is justified for estimators where compactness can be dropped without affecting consistency of estimators. Some of these estimators are discussed in Section 2.6. Uniform convergence and continuity are the hypotheses that are often referred to as “the standard regularity conditions” for consistency. They will typically be satisfied when moments of certain functions exist and there is some continuity in Q,(O) or in the distribution of the data. Moment existence assumptions are needed to use the law of large numbers to show convergence of Q,(0) to its limit Q,,(0). Continuity of the limit QO(0) is quite a weak condition. It can even be true when Q,(0) is not continuous, because continuity of the distribution of the data can “smooth out” the discontinuities in the sample objective function. Primitive regularity conditions for uniform convergence and continuity are given in Section 2.3. Also, Section 2.7 relates uniform convergence to stochastic equicontinuity, a property that is necessary and sufficient for uniform convergence, and gives more sufficient conditions for uniform convergence. To formulate primitive conditions for consistency of an extremum estimator, it is necessary to first find Q0(f9). Usually it is straightforward to calculate QO(@ as the probability limit of Q,(0) for any 0, a necessary condition for (iii) to be satisfied. This calculation can be accomplished by applying the law of large numbers, or hypotheses about convergence of certain components. For example, the law of large numbers implies that for MLE the limit of Q,(0) is QO(0) = E[lnf(zI 0)] and for NLS QO(0) = - E[ {y - h(x, @}‘I. Note the role played here by the normalization of the log-likelihood and sum of squared residuals, that leads to the objective function converging to a nonzero limit. Similar calculations give the limit for GMM and CMD, as further discussed below. Once this limit has been found, the consistency will follow from the conditions of Theorem 2.1. One device that may allow for consistency under weaker conditions is to treat 8 as a maximum of Q,(e) - Q,(e,) rather than just Q,(d). This is a magnitude normalization that sometimes makes it possible to weaken hypotheses on existence of moments. In the censored least absolute deviations example, where Q,,(e) = -n-rC;=,lJ$max (0, xi0) (, an assumption on existence of the expectation of y is useful for applying a law of large numbers to show convergence of Q,(0). In contrast Q,,(d) - Q,,(&) = -n- ’ X1= 1[ (yi -max{O, x:6} I- (yi --ax (0, XI@,}I] is a bounded function of yi, so that no such assumption is needed.
2124
2.2.
W.K. Newey end D. McFadden
Ident$cution
The identification condition for consistency of an extremum estimator is that the limit of the objective function has a unique maximum at the truth.” This condition is related to identification in the usual sense, which is that the distribution of the data at the true parameter is different than that at any other possible parameter value. To be precise, identification is a necessary condition for the limiting objective function to have a unique maximum, but it is not in general sufficient.” This section focuses on identification conditions for MLE, NLS, GMM, and CMD, in order to illustrate the kinds of results that are available. 2.2.1.
The maximum
likelihood estimator
An important feature of maximum likelihood is that identification is also sufficient for a unique maximum. Let Y, # Y2 for random variables mean Prob({ Y1 # Y,})>O. Lemma 2.2 (Information
inequality)
If 8, is identified [tI # 0, and 0~ 0 implies f(z 10)# f(z 1O,)] and E[ 1In f(z 10)I] < cc for all 0 then QO(tl) = E[lnf(zI@] has a unique maximum at 8,. Proof By the strict dom variable
version of Jensen’s inequality, for any nonconstant, positive Y, - ln(E[Y]) < E[ - ln(Y)].r3 Then for a = f(zIfI)/f(zI0,)
ranand
~~~,,Q,~~,~-Q,~~~=~C~-~~Cf~~I~~lf~~I~,~l~l~-~n~C~f(zl~)lf(zl~~)~l= Q.E.D. - In [i.f(z (B)dz] = 0. The term “information inequality” refers to an interpretation of QO(0) as an information measure. This result means that MLE has the very nice feature that uniqueness of the maximum of the limiting objective function occurs under the very weakest possible condition of identification of 8,. Conditions for identification in particular models are specific to those models. It
‘i If the set of maximands .1 of the objective function has more than one element, then this set does not distinguish between the true parameter and other values. In this case further restrictions are needed for identification. These restrictions are sometimes referred to as normalizations. Alternatively, one could work with convergence in probability to a set .,*/R,but imposing normalization restrictions is more practical, and is needed for asymptotic normality. “If Or, is not identified, then there will be some o# 0, such that the distribution of the data is the same when 0 is the true parameter value>s when 0, is the true parameter value. Therefore, Q*(O) will also be limiting objective function when 0 is the true parameter, and hence the requirement that Q,,(O) be maximized at the true parameter implies that Q,,(O) has at least two maxima, flo and 0. i3The strict version of Jensen’s inequality states that if a(y) is a strictly concave function [e.g. a(y) = In(y)] and Y is a nonconstant random variable, then a(E[Y]) > E[a(Y)].
Ch. 36:
Large
Samplr
Estimation
and Hypothesis
Testing
is often possible to specify them in a way that is easy to interpret way), as in the Cauchy example. Exampk
2125 (i.e. in a “primitive”
1.1 continued
It will follow from Lemma 2.2 that E[ln,f(z10)] has a unique maximum at the true parameter. Existence of E [I In f(z I@[] for all 0 follows from Ilnf(zIO)I d C, + ln(l+a-2~~-~~2)
0. Thus, by the information inequality, E [ln f(z I O)] has a unique maximum at OO.This example illustrates that it can be quite easy to show that the expected log-likelihood has a unique maximum, even when the first-order conditions for the MLE do not have unique roots. Example
I .2 continued
Throughout the probit example, the identification and regularity conditions will be combined in the assumption that the second-moment matrix E[xx’] exists and is nonsingular. This assumption implies identification. To see why, note that nonsingularity of E[xx’] implies that it is positive definite. Let 0 # O,, so that E[{x’(O - O,)}“] = (0 - O,)‘E[xx’](O - 0,) > 0, implying that ~‘(0 - 0,) # 0, and hence x’0 # x’OO, where as before “not equals” means “not equal on a set of positive probability”. Both Q(u) and @( - u) are strictly monotonic, so that x’0 # ~‘0, implies both @(x’O) # @(x’O,) and 1 - @(X’S) # 1 - @(x’O,), and hence that f(z I 0) = @(x’O)Y[1 - @(x’O)] l py # f(z IO,). Existence of E[xx’] also implies that E[ Ilnf(zlO)l] < co. It is well known that the derivative d In @(u)/du = %(u)= ~(U)/@(U) [for 4(u) = V,@(u)], is convex and asymptotes to - u as u -+ - cc, and to zero as u + co. Therefore, a mean-value expansion around 0 = 0 gives Iln @(x’O)l = Iln @(O) + ~(x’8”)x’O1d Iln Q(O)\ + i(x’@)lx’OI
~I~~~~~~I+~~~+I~‘~l~l~‘~Idl~~~(~~I+C(~+IIxII lIOIl)llxlI IlOll. Since 1 -@(u)=@(-u)andyis bounded, (lnf(zIO)Id2[Iln@(O)I+C(l + 11x/I x II 0 II )II x /III 0 II 1, so existence of second moments of x implies that E[ Ilnf(z1 O)/] is finite. This part of the probit example illustrates the detailed work that may be needed to verify that moment existence assumptions like that of Lemma 2.2 are satisfied. 2.2.2.
Nonlinear
least squares
The identification condition for NLS is that the mean square error E[ { y - h(x,O)l’] = - QJO) have a unique minimum at OO.As is easily shown, the mean square error
W.K. Newey
2126
und D. McFudden
has a unique minimum at the conditional mean. I4 Since h(x,O,) = E[ylx] is the conditional mean, the identification condition for NLS is that h(x, 0) # h(x, 0,) if 0 # 8,, i.e. that h(x, 0) is not the conditional mean when 8 # 0,. This is a natural “conditional mean” identification condition for NLS. In some cases identification will not be sufficient for conditional mean identification. Intuitively, only parameters that affect the first conditional moment of y given x can be identified by NLS. For example, if 8 includes conditional variance parameters, or parameters of other higher-order moments, then these parameters may not be identified from the conditional mean. As for identification, it is often easy to give primitive hypotheses for conditional mean identification. For example, in the linear model h(x, 19)= x’d conditional mean identification holds if E[xx’] is nonsingular, for then 6 # 0, implies ~‘6’ # x’O,,, as shown in the probit example. For another example, suppose x is a positive scalar and h(x, 6) = c( + bxy. As long as both PO and y0 are nonzero, the regression curve for a different value of 6 intersects the true curve at most at three x points. Thus, for identification it is sufficient that x have positive density over any interval, or that x have more than three points that have positive probability. 2.2.3.
Generalized
method
of moments
For generalized method of moments the limit cated than for MLE or NLS, but is still easy g,(O) L g,,(O) = E[g(z, O)], so that if 6’ A W W, then by continuity of multiplication, Q,(d) tion has a maximum of zero at 8,, so 8, will 0 # 00. Lemma
2.3 (GMM
function QO(fI)is a little more complito find. By the law of large numbers, for some positive semi-definite matrix 3 Q,JO) = - go(O) Wg,(B). This funcbe identified if it is less than zero for
identification)
If W is positive semi-definite and, for go(Q) = E[g(z, S)], gO(O,) = 0 and Wg,(8) for 0 # 8, then QJfI) = - g0(0)‘Wg,(8) has a unique maximum at 8,.
# 0
Proof
Let R be such that R’R = W. If 6’# (I,, then 0 # Wg,(8) = R’RgJB) implies Rg,(O) #O and hence QO(@ = - [RgO(0)]‘[Rgo(fl)] < QO(fl,) = 0 for 8 # Be. Q.E.D. The GMM identification condition is that if 8 # 8, then go(O) is not in the null space of W, which for nonsingular W reduces to go(B) being nonzero if 8 # 0,. A necessary order condition for GMM identification is that there be at least as many moment
“‘For ECOI
m(x)= E[ylx]
and
a(x) any
-a(~))~1 = ECOI -m(4)2l + ~JX{Y
with strict inequality
if a(x) #m(x).
function -m(4Hm(x)
with
finite
-&)}I
variance,
iterated
expectations
gives
+ EC~m(x)-~(x)}~l~ EC{y-m(x)}‘],
Ch. 36: Large Sumplr
Esrimution
and Hypothesis
Testing
2121
functions as parameters. If there are fewer moments than parameters, then there will typically be many solutions to ~~(8) = 0. If the moment functions are linear, say y(z, Q) = g(z) + G(z)0, then the necessary and sufficient rank condition for GMM identification is that the rank of WE[G(z)J is equal to the number of columns. For example, consider a linear instrumental variables estimator, where g(z, 19)= x.(y - Y’Q) for a residual y - Y’B and a vector of instrumental variables x. The two-stage least squares estimator of 8 is a GMM estimator with W = (C!‘= 1xixi/n)- ‘. Suppose that E[xx’] exists and is nonsingular, so that W = (E[xx’])- i by the law of large numbers. Then the rank condition for GMM identification is E[xY’] has full column rank, the well known instrumental variables identification condition. If E[Y’lx] = x’rt then this condition reduces to 7~having full column rank, a version of the single equation identification condition [see F.M. Fisher (1976) Theorem 2.7.11. More generally, E[xY’] = E[xE[Y’jx]], so that GMM identification is the same as x having “full rank covariance” with
-uYlxl.
If E[g(z, 0)] is nonlinear in 0, then specifying primitive conditions for identification becomes quite difficult. Here conditions for identification are like conditions for unique solutions of nonlinear equations (as in E[g(z, e)] = 0), which are known to be difficult. This difficulty is another reason to avoid formulating 8 as the solution to the first-order condition when analyzing consistency, e.g. to avoid interpreting MLE as a GMM estimator with g(z, 0) = V, In f(z 119). In some cases this difficulty is unavoidable, as for instrumental variables estimators of nonlinear simultaneous equations models.’ 5 Local identification analysis may be useful when it is difficult to find primitive conditions for (global) identification. If g(z,@ is continuously differentiable and VOE[g(z, 0)] = E[V,g(z, Q)], then by Rothenberg (1971), a sufficient condition for a unique solution of WE[g(z, 8)] = 0 in a (small enough) neighborhood of 0, is that WEIVOg(z,Bo)] have full column rank. This condition is also necessary for local identification, and hence provides a necessary condition for global identification, when E[V,g(z, Q)] has constant rank in a neighborhood of 8, [i.e. in Rothenberg’s (1971) “regular” case]. For example, for nonlinear 2SLS, where p(z, e) is a residual and g(z, 0) = x.p(z, 8), the rank condition for local identification is that E[x.V,p(z, f&J’] has rank equal to its number of columns. A practical “solution” to the problem of global GMM identification, that has often been adopted, is to simply assume identification. This practice is reasonable, given the difficulty of formulating primitive conditions, but it is important to check that it is not a vacuous assumption whenever possible, by showing identification in some special cases. In simple models it may be possible to show identification under particular forms for conditional distributions. The Hansen-Singleton model provides one example. “There are some useful results on identification (1983) and Roehrig remains difficult.
(1989), although
global
of nonlinear simultaneous equations models in Brown identification analysis of instrumental variables estimators
W.K. Newey and D. McFadden
2128
Example
I .3 continued
Suppose that l? = (n-l C;= 1x,x;), so that the GMM estimator is nonlinear twostage least squares. By the law of large numbers, if E[xx’] exists and is nonsingular, Then the l?’ will converge in probability to W = (E[xx’])~‘, which is nonsingular. GMM identification condition is that there is a unique solution to E[xp(z, 0)] = 0 at 0 = H,, where p(z, 0) = {/?wy’ - 1). Quite primitive conditions for identification can be formulated in a special log-linear case. Suppose that w = exp[a(x) + u] and y = exp[b(x) + u], where (u, u) is independent of x, that a(x) + y,b(x) is constant, and that rl(0,) = 1 for ~(0) = exp[a(x) + y,b(x)]aE[exp(u + yv)]. Suppose also that the first element is a constant, so that the other elements can be assumed to have mean zero (by “demeaning” if necessary, which is a nonsingular linear transformation, and so does not affect the identification analysis). Let CI(X,y)=exp[(Y-yJb(x)]. Then E[p(z, @lx] = a(x, y)v](@- 1, which is zero for 0 = BO,and hence E[y(z, O,)] = 0. For 8 # B,, E[g(z, 0)] = {E[cr(x, y)]q(8) - 1, Cov [x’, a(x, y)]q(O)}‘. This expression is nonzero if Cov[x, a(x, y)] is nonzero, because then the second term is nonzero if r](B) is nonzero and the first term is nonzero if ~(8) = 0. Furthermore, if Cov [x, a(x, y)] = 0 for some y, then all of the elements of E[y(z, 0)] are zero for all /J and one can choose /I > 0 so the first element is zero. Thus, Cov[x, c((x, y)] # 0 for y # y0 is a necessary and sufficient condition for identification. In other words, the identification condition is that for all y in the parameter set, some coefficient of a nonconstant variable in the regression of a(x, y) on x is nonzero. This is a relatively primitive condition, because we have some intuition about when regression coefficients are zero, although it does depend on the form of b(x) and the distribution of x in a complicated way. If b(x) is a nonconstant, monotonic function of a linear combination of x, then this covariance will be nonzero. l6 Thus, in this example it is found that the assumption of GMM identification is not vacuous, that there are some nice special cases where identification does hold. 2.2.4.
Classical minimum distance
The analysis
of CMD
identification
is very similar
to that for GMM.
If AL
r-r0
and %‘I W, W positive semi-definite, then Q(0) = - [72- h(B)]‘@72 - h(6)] -% - [rco - h(0)]’ W[q, - h(O)] = Q,(O). The condition for Qo(8) to have a unique maximum (of zero) at 0, is that h(8,) = rcOand h(B) - h(0,) is not in the null space of W if 0 # Be, which reduces to h(B) # h(B,) if W is nonsingular. If h(8) is linear in 8 then there is a readily interpretable rank condition for identification, but otherwise the analysis of global identification is difficult. A rank condition for local identification is that the rank of W*V,h(O,) equals the number of components of 0.
“It is well known variable x.
that Cov[.x,J(x)]
# 0 for any monotonic,
nonconstant
function
,f(x) of a random
Ch. 36: Laryr Sample Estimation and Hypothesis
2.3.
Unform
convergence
2129
Testing
and continuity
Once conditions for identification have been found and compactness of the parameter set has been assumed, the only other primitive conditions for consistency required by Theorem 2.1 are those for uniform convergence in probability and continuity of the limiting objective function. This subsection gives primitive hypotheses for these conditions that, when combined with identification, lead to primitive conditions for consistency of particular estimators. For many estimators, results on uniform convergence of sample averages, known as uniform laws oflarge numbers, can be used to specify primitive regularity conditions. Examples include MLE, NLS, and GMM, each of which depends on sample averages. The following uniform law of large numbers is useful for these estimators. Let a(z, 6) be a matrix of functions of an observation z and the parameter 0, and for a matrix A = [aj,], let 11 A 11= (&&)“’ be the Euclidean norm. Lemma
2.4
If the data are i.i.d., @is compact, a(~,, 0) is continuous at each 0~ 0 with probability one, and there is d(z) with 11 a(z,d)ll d d(z) for all 8~0 and E[d(z)] < co, then E[a(z, e)] is continuous
and supeto /In- ‘x1= i a(~,, 0) - E[a(z, 0)] I/ 3
0.
The conditions of this result are similar to assumptions of Wald’s (1949) consistency proof, and it is implied by Lemma 1 of Tauchen (1985). The conditions of this result are quite weak. In particular, they allow for a(~,@ this result is useful to not be continuous on all of 0 for given z.l’ Consequently, even when the objective function is not continuous, as for Manski’s (1975) maximum score estimator and the simulation-based estimators of Pakes (1986) and McFadden (1989). Also, this result can be extended to dependent data. The conclusion remains true if the i.i.d. hypothesis is changed to strict stationarity and ergodicity of zi.i8 The two conditions imposed on a(z, 0) are a continuity condition and a moment existence condition. These conditions are very primitive. The continuity condition can often be verified by inspection. The moment existence hypothesis just requires a data-dependent upper bound on IIa(z, 0) II that has finite expectation. This condition is sometimes referred to as a “dominance condition”, where d(z) is the dominating function. Because it only requires that certain moments exist, it is a “regularity condition” rather than a “substantive restriction”. It is often quite easy to see that the continuity condition is satisfied and to specify moment hypotheses for the dominance condition, as in the examples.
r
'The conditions of Lemma 2.4 are not sufficient but are sufficient for convergence of the supremum sufficient for consistency of the estimator in terms objective function is not continuous, as previously “Strict stationarity means that the distribution and ergodicity implies that n- ‘I:= ,a(zJ + E[a(zJ]
for measurability of the supremum in the conclusion, in outer measure. Convergence in outer measure is of outer measure, a result that is useful when the noted, of (zi, zi + ,, , z.,+,) does not depend on i for any tn, for (measurable) functions a(z) with E[ la(z)l] < CO.
Ch. 36: Large Sample Estimation and Hypothesis Testing
2.4.
Consistency
of maximum
2131
likelihood
The conditions for identification in Section 2.2 and the uniform convergence result of Lemma 2.4, allow specification of primitive regularity conditions for particular kinds of estimators. A consistency result for MLE can be formulated as follows: Theorem 2.5
Suppose that zi, (i = 1,2,. . .), are i.i.d. with p.d.f. f(zJ0,) and (i) if 8 f8, then f(zi18) #f(zilO,); (ii) B,E@, which is compact; (iii) In f(z,le) is continuous at each 8~0 with probability one; (iv) E[supe,oIlnf(~18)1] < co. Then &Lo,. Proof
Proceed by verifying the conditions of Theorem 2.1. Condition 2.1(i) follows by 2.5(i) and (iv) and Lemma 2.2. Condition 2.l(ii) holds by 2S(ii). Conditions 2.l(iii) and (iv) Q.E.D. follow by Lemma 2.4. The conditions of this result are quite primitive and also quite weak. The conclusion is consistency of the MLE. Thus, a particular MLE can be shown to be consistent by checking the conditions of this result, which are identification, compactness, continuity of the log-likelihood at particular points, and a dominance condition for the log-likelihood. Often it is easy to specify conditions for identification, continuity holds by inspection, and the dominance condition can be shown to hold with a little algebra. The Cauchy location-scale model is an example. Example 1 .l continued
To show consistency of the Cauchy MLE, one can proceed to verify the hypotheses of Theorem 2.5. Condition (i) was shown in Section 2.2.1. Conditions (iii) and (iv) were shown in Section 2.3. Then the conditions of Theorem 2.5 imply that when 0 is any compact set containing 8,, the Cauchy MLE is consistent. A similar result can be stated for probit (i.e. Example 1.2). It is not given here because it is possible to drop the compactness hypothesis of Theorem 2.5. The probit log-likelihood turns out to be concave in parameters, leading to a simple consistency result without a compact parameter space. This result is discussed in Section 2.6. Theorem 2.5 remains true if the i.i.d. assumption is replaced with the condition thatz,,~,,... is stationary and ergodic with (marginal) p.d.f. of zi given byf(z IO,). This relaxation of the i.i.d. assumption is possible because the limit function remains unchanged (so the information inequality still applies) and, as noted in Section 2.3, uniform convergence and continuity of the limit still hold. A similar consistency result for NLS could be formulated by combining conditional mean identification, compactness of the parameter space, h(x, 13)being conti-
2132
W.K. Nrwey and D. McFadden
nuous at each H with probability such a result is left as an exercise.
Consistency
2.5.
A consistency Theorem
one, and a dominance
condition.
Formulating
ofGMM
result for GMM
can be formulated
as follows:
2.6
Suppose that zi, (i = 1,2,. .), are i.i.d., I%’% W, and (i) W is positive semi-definite and WE[g(z, t3)] = 0 only if (I = 8,; (ii) tIO~0, which is compact; (iii) g(z, 0) is continuous at each QE 0 with probability one; (iv) E[sup~,~ I/g(z, 0) I/] < co. Then 6% (so. ProQf
Proceed by verifying the hypotheses of Theorem 2.1. Condition 2.1(i) follows by 2.6(i) and Lemma 2.3. Condition 2.l(ii) holds by 2.6(ii). By Lemma 2.4 applied to a(z, 0) = g(z, g), for g,(e) = n- ‘x:1= ,g(zi, 0) and go(g) = E[g(z, g)], one has supBEe I(g,(8) - go(g) II30 and go(d) is continuous. Thus, 2.l(iii) holds by QO(0) = - go(g) WY,(Q) continuous. By 0 compact, go(e) is bounded on 0, and by the triangle and Cauchy-Schwartz inequalities,
I!A(@- Qo@) I
G IICM@ - Yov4II2II + II + 2 IIso(@) II IId,(@- s,(@ II II @ II + llSo(~N2 II @- WII, so that sup,,,lQ,(g)
- Q,Jg)I AO,
and 2.l(iv) holds.
Q.E.D.
The conditions of this result are quite weak, allowing for discontinuity in the moment functions.’ 9 Consequently, this result is general enough to cover the simulated moment estimators of Pakes (1986) and McFadden (1989), or the interval moment estimator of Newey (1988). To use this result to show consistency of a GMM estimator, one proceeds to check the conditions, as in the Hansen-Singleton example.
19Measurability of the estimator becomes an issue in this case, although working with outer measure, as previously noted.
this can be finessed
by
2133
Ch. 36: Large Sample Estimation and Hypothesis Testing Example
1.3 continued
‘. For hypothesis (i), simply Assume that E[xx’] < a, so that I% A W = (E[xx’])assume that E[y(z, 0)] = 0 has a unique solution at 0, among all PIE0. Unfortunately, as discussed in Section 2.2, it is difficult to give more primitive assumptions for this identification condition. Also, assume that @is compact, so that (ii) holds. Then (iii) holds by inspection, and as discussed in Section 2.3, (iv) holds as long as the moment existence conditions given there are satisfied. Thus, under these assumptions, the estimator will be consistent.
Theorem 2.6 remains true if the i.i.d. assumption is replaced with the condition that zlr z2,. . is stationary and ergodic. Also, a similar consistency result could be formulated for CMD, by combining uniqueness of the solution to 7c,,= h(8) with compactness of the parameter space and continuity of h(O). Details are left as an exercise. 2.6.
Consistency
without compactness
The compactness assumption is restrictive, because it implicitly requires that there be known bounds on the true parameter value. It is useful in practice to be able to drop this restriction, so that conditions for consistency without compactness are of interest. One nice result is available when the objective function is concave. Intuitively, concavity prevents the objective function from “turning up” as the parameter moves far away from the truth. A precise result based on this intuition is the following one: Theorem
2.7
If there is a function QO(0) such that (i) QO(0) 1s uniquely maximized at 0,; (ii) B0 is an element of the interior of a convex set 0 and o,,(e) is concave; and (iii) o,(e) L QO(0) for all 8~0,
then fin exists with probability
approaching
one and 8,,-%te,.
Proof Let %?be a closed sphere of radius 2~ around 8, that is contained in the interior of 0 and let %?!be its boundary. Concavity is preserved by pointwise limits, so that QO(0) is also concave. A concave function is continuous on the interior of its domain, so that QO(0) is continuous on V?. Also, by Theorem 10.8 of Rockafellar (1970), pointwise convergence of concave functions on a dense subset of an open set implies uniform convergence on any compact subset of the open set. It then follows as in Andersen and Gill (1982) that o,(e) converges to QO(fI) in probability uniformly on any compact subset of 0, and in particular on %Y.Hence, by Theorem 2.1, the maximand f!?!of o,,(e) on % is consistent for 0,. Then the event that g,, is within c of fIO, so that Q,(g,,) 3 max,&,(@, occurs with probability approaching one. In this event, for any 0 outside W, there is a linear convex combination ,J$” + (1 - ,I)0
W.K. Newry and D. McFadden
2134
that lies in g (with A < l), so that_ Q,(g,,) 3 Q,[ng,, + (1 - i)U]. By concavity, Q.[ng,,_+ (1 - i)O] 3 ,$,(g,,) + (1 - E_)_Q,(e). Putting these inequalities together, Q.E.D. (1 - i)Q,(@ > (1 - i)Q,(0), implying 8, is the maximand over 0. This theorem is similar to Corollary II.2 of Andersen and Gill (1982) and Lemma A of Newey and Powell (1987). In addition to allowing for noncompact 0, it only requires pointwise convergence. This weaker hypothesis is possible because pointwise convergence of concave functions implies uniform con_vergence (see the proof). This result also contains the additional conclusion that 0 exists with probability approaching one, which is needed because of noncompactness of 0. This theorem leads to simple conditions for consistency without compactness for both MLE and GMM. For MLE, if in Theorem 2.5, (ii)are replaced by 0 convex, In f(z 10)concave in 0 (with probability one), and E[ 1In f’(z 10)I] < 03 for all 0, then the law of large numbers and Theorem 2.7 give consistency. In other words, with concavity the conditions of Lemma 2.2 are sufficient for consistency of the MLE. Probit is an example. Example
1.2 continued
It was shown in Section 2.2.1 that the conditions of Lemma 2.2 are satisfied. Thus, to show consistency of the probit MLE it suffices to show concavity of the loglikelihood, which will be implied by concavity of In @(x’@)and In @( - ~‘0). Since ~‘8 is linear in H, it suffices to show concavity of In a(u) in u. This concavity follows from the well known fact that d In @(u)/du = ~(U)/@(U) is monotonic decreasing [as well as the general Pratt (1981) result discussed below]. For GMM, if y(z, 0) is linear in 0 and I?f is positive semi-definite then the objective function is concave, so if in Theorem 2.6, (ii)are replaced by the requirement that E[ /Ig(z, 0) 111< n3 for all tj~ 0, the conclusion of Theorem 2.7 will give consistency of GMM. This linear moment function case includes linear instrumental variables estimators, where compactness is well known to not be essential. This result can easily be generalized to estimators with objective functions that are concave after reparametrization. If conditions (i) and (iii) are satisfied and there is a one-to-one mapping r(0) with continuous inverse such that &-‘(I.)] is concave_ on^ r(O) and $0,) is an element of the interior of r( O), then the maximizing value i of Q.[r - ‘(J”)] will be consistent for i, = s(d,) by Theorem 2.7 and invariance of a maxima to one-to-one reparametrization, and i? = r- ‘(I) will be consistent for 8, = z-~(&) by continuity of the inverse. An important class of estimators with objective functions that are concave after reparametrization are univariate continuous/discrete regression models with logconcave densities, as discussed in Olsen (1978) and Pratt (1981). To describe this class, first consider a continuous regression model y = x’& + cOc, where E is independent of x with p.d.f. g(s). In this case the (conditional on x) log-likelihood is - In 0 + In sCa_ ‘(y - x’fi)] for (B’, C)E 0 = @x(0, co). If In g(E) is concave, then this
Ch. 36: Large Sample Estimation and Hypothesis
Testing
2135
log-likelihood need not be concave, but the likelihood In ‘/ + ln Y(YY- ~‘6) is concave in the one-to-one reparametrization y = Q- ’ and 6 = /~‘/a. Thus, the average loglikelihood is also concave in these parameters, so that the above generalization of Theorem 2.7 implies consistency of the MLE estimators of fi and r~ when the maximization takes place over 0 = Rkx(O, a), if In g(c) is concave. There are many log-concave densities, including those proportional to exp( - Ixl”) for CI3 1 (including the Gaussian), logistic, and the gamma and beta when the p.d.f. is bounded, so this concavity property is shared by many models of interest. The reparametrized log-likelihood is also concave when y is only partially observed. As shown by Pratt (1981), concavity of lng(a) also implies concavity of ln[G(u)G(w)] in u and w, for the CDF G(u)=~“~~(E)~E.~~ That is, the logprobability of an interval will be concave in the endpoints. Consequently, the log-likelihood for partial observability will be concave in the parameters when each of the endpoints is a linear function of the parameters. Thus, the MLE will be consistent without compactness in partially observed regression models with logconcave densities, which includes probit, logit, Tobit, and ordered probit with unknown censoring points. There are many other estimators with concave objective functions, where some version of Theorem 2.7 has been used to show consistency without compactness. These include the estimators in Andersen and Gill (1982), Newey and Powell (1987), and Honort (1992). It is also possible to relax compactness with some nonconcave objective functions. Indeed, the original Wald (1949) MLE consistency theorem allowed for noncompactness, and Huber (1967) has given similar results for other estimators. The basic idea is to bound the objective function above uniformly in parameters that are far enough away from the truth. For example, consider the MLE. Suppose that there is a compact set % such that E[supBtOnMc In f(z 1d)] < E[ln f(z) fl,)]. Then by the law of large numbers, with probability approaching one, supBtOnXc&(0) d n-l x In f(zil@) < n-‘Cy= I In f(zl do), and the maximum must lie in %‘. c;= 1 suPoE@n’fjc Once the maximum is known to be in a compact set with probability approaching one, Theorem 2.1 applies to give consistency. Unfortunately, the Wald idea does not work in regression models, which are quite common in econometrics. The problem is that the likelihood depends on regression parameters 8 through linear combinations of the form ~‘9, so that for given x changing 8 along the null-space of x’ does not change the likelihood. Some results that do allow for regressors are given in McDonald and Newey (1988), where it is shown how compactness on 0 can be dropped when the objective takes the form Q,(e) = n- ’ xy= 1 a(Zi, X:O) an d a (z, u) goes to - co as u becomes unbounded. It would be useful to have other results that apply to regression models with nonconcave objective functions. “‘Pratt (1981) also showed that concavity to be concave over all v and w.
of In g(c) is necessary
as well as sufficient for ln[G(u) ~ G(w)]
W.K. Newey and D. McFadden
2136
Compactness is essential for consistency of some extremum estimators. For example, consider the MLE in a model where z is a mixture of normals, having likelihood f(z 1Q)= pea-‘~+!$a-‘(z-p)] +(I -p)y~‘f$Cy~l(z-~)l for8=(p,a,6y)‘, some 0 < p < 1, and the standard normal p.d.f. d(c) = (271) 1’2e-E2’2. An interpretation of this model is that z is drawn from N(p, a2) with probability p and from N(cc, r2) with probability (1 - p). The problem with noncompactness for the MLE in this model is that for certain p (and u) values, the average log-likelihood becomes unbounded as g (or y) goes to zero. Thus, for existence and consistency of the MLE it is necessary to bound 0 (and y) away from zero. To be specific, suppose that p = Zi as o+o, for some i. Then f(z,lfI) = ~.a ~‘@(O)$(l -p)y-lf$cy~l(zi-cc)]+co and assuming that zj # zi for all j # i, cs occurs with probability one, f(zj/U)+ (1 -p)y-l~[y-l(zj-@]>O. Hence, Q,,(e)= n-‘Cy=r lnf(zilO) becomes unbounded as (T+O for p = zi. In spite of this fact, if the parameter set is assumed to be compact, so that (Tand y are bounded away from zero, then Theorem 2.5 gives consistency of the MLE. In particular, it is straightforward to show that (I is identified, so that, by the information inequality, E[ln f(zl@] has a unique maximum at Be. The problem here is that the convergence of the sample objective function is not uniform over small values of fr. This example is extreme, but there are interesting econometric examples that have this feature. One of these is the disequilibrium model without observed regime of Fair and Jaffee (1972), where y = min{x’p, + G,,E,~‘6, + you}, E and u are standard normal and independent of each other and of x and w, and the regressors include constants. This model also has an unbounded average log-likelihood as 0 -+ 0 for a certain values of /I, but the MLE over any compact set containing the truth will be consistent under the conditions of Theorem 2.5. Unfortunately, as a practical matter one may not be sure about lower bounds on variances, and even if one were sure, extraneous maxima can appear at the lower bounds in small samples. An approach to this problem is to search among local maxima that satisfy the first-order conditions for the one that maximizes the likelihood. This approach may work in the normal mixture and disequilibrium models, but might not give a consistent estimator when the true value lies on the boundary (and the first-order conditions are not satisfied on the boundary).
2.7.
Stochastic
equicontinuity
and uniform
convergence
Stochastic equicontinuity is important in recent developments in asymptotic distribution theory, as described in the chapter by Andrews in this handbook. This concept is also important for uniform convergence, as can be illustrated by the nonstochastic case. Consider a sequence of continuous, nonstochastic functions {Q,(0)},“= 1. For nonrandom functions, equicontinuity means that the “gap” between Q,(0) and Q,(6) can be made small uniformly in n by making g be close enough to 0, i.e. a sequence of functions is equicontinuous if they are continuous uniformly in
Ch. 36: Lurqr
Sample Estimation
and Hypothesis
Testing
2137
More precisely, equicontinuity holds if for each 8, c > 0 there exists 6 > 0 with 1Q,(8) ~ Q,(e)1 < E for all Jj6 0 11< 6 and all 11.~~ It is well known that if Q,(0) converges to Q,J0) pointwise, i.e. for all UE 0, and @is compact, then equicontinuity is a necessary and sufficient condition for uniform convergence [e.g. see Rudin (1976)]. The ideas behind it being a necessary and sufficient condition for uniform convergence is that pointwise convergence is the same as uniform covergence on any finite grid of points, and a finite grid of points can approximately cover a compact set, so that uniform convergence means that the functions cannot vary too much as 0 moves off the grid. To apply the same ideas to uniform convergence in probability it is necessary to define an “in probability” version of equicontinuity. The following version is formulated in Newey (1991 a). n.
Stochastic_equicontinuity: For every c, n > 0 there exists a sequence of random variables d, and a sample size no such that for n > n,, Prob( 1d^,1> E) < q and for each 0 there is an open set JV containing 8 with
Here t_he function d^, acts like a “random epsilon”, bounding the effect of changing 0 on Q,(e). Consequently, similar reasoning to the nonstochastic case can be used to show that stochastic equicontinuity is an essential condition for uniform convergence, as stated in the following result: Lemma 2.8 Suppose 0 is compact and Qo(B) is continuous. Then ~up~,~lQ,(~) - Qo(@ 30 if and only if Q,(0) L Qo(e) for all 9~ @and Q,(O) is stochastically equicontinuous. The proof of this result is given in Newey (1991a). It is also possible to state an almost sure convergence version of this result, although this does not seem to produce the variety of conditions for uniform convergence that stochastic equicontinuity does; see Andrews (1992). One useful sufficient condition for uniform convergence that is motivated by the form of the stochastic equicontinuity property is a global, “in probability” Lipschitz condition, as in the hypotheses of the following result. Let O,(l) denote a sequence of random variables that is bounded in probability.22
” One can allow for discontinuity in the functions by allowing the difference to be less than I: only for n > fi, where fi depends on E, but not on H. This modification is closer to the stochastic equicontinuity condition given here, which does allow for discontinuity. ” Y” is bounded in probability if for every E > 0 there exists ii and q such that Prob(l Y,l > ‘1)< E for n > ii.
W.K. Newey and D. McFadden
2138
Lemma 2.9 %QO(0) for all 0~0, and there is If 0 is compact, QO(0) is contmuous,_Q,,(0) OL,then cr>O and B,=O,(l) such that for all 0, HE 0, 1o,(8) - Q^,(O)ld k,, I/g- 0 11 su~~lto
I Q,(@ - QdfO 5 0.
Prooj By Lemma 2.8 it suffices to show stochastic equicontinuity. Pick E, ye> 0. By B,n = o,(l) there is M such that Prob( IB,I > M) < r] for all n large enough. Let a)=Prob(Ifi,I>M) Q.E.D. and for all 0, ~E.~V, IQ,,(o) - Q,,(0)1 < 6,,Il& 8 lla < 2,. This result is useful in formulating the uniform law of large numbers given in Wooldridge’s chapter in this volume. It is also useful when the objective function Q,(e) is not a simple function of sample averages (i.e. where uniform laws of large numbers do not apply). Further examples and discussion are given in Newey (1991a).
2.8.
Least ubsolute deviations examples
Estimators that minimize a sum of absolute deviations provide interesting examples. The objective function that these estimators minimize is not differentiable, so that weak regularity conditions are needed for verifying consistency and asymptotic normality. Also, these estimators have certain robustness properties that make them interesting in their own right. In linear models the least absolute deviations estimator is known to be more asymptotically more efficient than least squares for thick-tailed distributions. In the binary choice and censored regression models the least absolute deviations estimator is consistent without any functional form assumptions on the distribution of the disturbance. The linear model has been much discussed in the statistics and economics literature [e.g. see Bloomfeld and Steiger (1983)], so it seems more interesting to consider here other cases. To this end two examples are given: maximum score, which applies to the binary choice model, and censored least absolute deviations. 2.8.1.
Maximum
score
The maximum score estimator of Manski (I 975) is an interesting example because it has a noncontinuous objective function, where the weak regularity conditions of Lemma 2.4 are essential, and because it is a distribution-free estimator for binary choice. Maximum score is used to estimate 8, in the model y = I(x’B, + E > 0), where l(.s&‘)denotes the indicator for the event .d (equal to one if d occurs and zero
Ch. 36: Lurye Sumple Estimation and Hypothesis
Testing
otherwise), and E is a disturbance term with a conditional The estimator solves eq. (1.1) for
!A(@=-H-It i=l
lyi-
2139
median (given x) ofzero.
l(x;H>o)/.
A scale normalization is necessary (as usual for binary choice), and a convenient one here is to restrict all elements of 0 to satisfy //0 /I = 1. To show consistency of the maximum score estimator, one can use conditions for identification and Lemma 2.4 to directly verify all the hypotheses of Theorem 2.1. By the law of large numbers, Q,(e) will have probability limit Qe(0) = - EC/y - l(x’U > O)l]. To show that this limiting objective has a unique maximum at fIO,one can use the well known result that for any random variable Y, the expected absolute deviation E[ 1Y - a(x)I] is strictly minimized at any median of the conditional distribution of Y given x. For a binary variable such as y, the median is unique when Prob(y = 1 Ix) # +, equal to one when the conditional probability is more than i and equal to zero when it is less than i. Assume that 0 is the unique conditional median of E given x and that Prob(x’B, = 0) = 0. Then Prob(y = 1 Ix) > ( < ) 3 if and only if ~‘0, > ( < ) 0, so Prob(y = 1 Ix) = i occurs with probability zero, and hence l(x’t), > 0) is the unique median of y given x. Thus, it suffices to show that l(x’B > 0) # l(x’B, > 0) if 0 # 19,. For this purpose, suppose that there are corresponding partitions 8 = (or, fl;,’ and x = (x,, x;)’ such that x&S = 0 only if 6 = 0; also assume that the conditional distribution of x1 given x2 is continuous with a p.d.f. that is positive on R, and the coefficient O,, of x1 is nonzero. Under these conditions, if 0 # 8, then l(x’B > 0) # l(x’B, > 0), the idea being that the continuous distribution of x1 means that it is allowed that there is a region of x1 values where the sign of x’8 is different. Also, under this condition, ~‘8, = 0 with zero probability, so y has a unique conditional median of l(x’8, > 0) that differs from i(x’8 > 0) when 0 # fI,,, so that QO(@ has a unique maximum at 0,. For uniform convergence it is enough to assume that x’0 is continuously distributed for each 0. For example, if the coefficient of x1 is nonzero for all 0~0 then this condition will hold. Then, l(x’B > 0) will be continuous at each tI with probability one, and by y and l(x’B > 0) bounded, the dominance condition will be satisfied, so the conclusion of Lemma 2.4 gives continuity of Qo(0) and uniform convergence of Q,,(e) to Qe(@. The following result summarizes these conditions: Theorem
2.10
If y = l(x’B, + E > 0) and (i) the conditional median at I: = 0; (ii) there are corresponding
distribution of E given x has a unique partitions x = (x,, xi)’ and 8 = (e,, pZ)’
13A median of the distribution and Prob(y < m) 2 +.
Y is the set of values m SUCKthat Prob( Y 2 m) > f
of a random
variable
W.K. Nrwey
2140
and D. McFadden
such that Prob(x;G # 0) > 0 for 6 # 0 and the conditional distribution of xi given x2 is continuous with support R; and (iii) ~‘8 is continuously distributed for all 0~0= (H:lIHIl = l}; then 850,. 2.8.2.
Censored leust ubsolute deviations
Censored least absolute deviations is used to estimate B0 in the model y = max{O, ~‘0, + F} where c has a unique conditional median at zero. It is obtained by solvingeq.(l.l)forQ,(0)= -n-‘~~=i (lyi- max{O,x~~}~-~yi-max{O,xj~,}~)= Q,(U) - Q,(0,). Consistency of 8 can be shown by using Lemma 2.4 to verify the conditions of Theorem 2.1. The function Iyi - max (0, xi0) 1- Iyi - max {0, xi@,} I is continuous in 8 by inspection, and by the triangle inequality its absolute value is bounded above by Imax{O,x~H}I + Imax{O,xI8,}I d lIxJ( 118ll + IId,ll), so that if E[ 11 x II] < cc the dominance condition is satisfied. Then by the conclusion of Lemma 2.4, Q,(0) converges uniformly in probability to QO(@= E[ ly - max{O,x’8} Ily - max{O, ~‘8,) I]. Thus, for the normalized objective function, uniform convergence does not require any moments of y to exist, as promised in Section 2.1. Identification will follow from the fact that the conditional median minimizes the expected absolute deviation. Suppose that P(x’B, > 0) and P(x’6 # Olx’8, > 0) > 0 median at zero, y has a unique if 6 # 0. 24 By E having a uniqu e conditional conditional median at max{O, x’o,}. Therefore, to show identification it suffices to show that max{O, x’d} # max{O, x’BO} if 8 # 0,. There are two cases to consider. In case one, l(x’U > 0) # 1(x’@, > 0), implying max{O,x’B,} # max{O,x’@}. In case two, 1(x’@> 0) = l(x’0, > 0), so that max 10, x’(9) - max 10, x’BO}= l(x’B, > O)x’(H- 0,) # 0 by the identifying assumption. Thus, QO(0) has a unique maximum over all of R4 at BO. Summarizing these conditions leads to the following result: Theorem 2.11 If (i) y = max{O, ~‘8, + a}, the conditional distribution of E given x has a unique median at E = 0; (ii) Prob(x’B, > 0) > 0, Prob(x’G # Olx’0, > 0) > 0; (iii) E[li x 111< a; and (iv) 0 is any compact set containing BO, then 8 3 8,. As previously promised, this result shows that no assumption on the existence of moments of y is needed for consistency of censored least absolute deviations. Also, it shows that in spite of the first-order conditions being identically zero over all 0 where xi0 < 0 for all the observations, the global maximum of the least absolute deviations estimator, over any compact set containing the true parameter, will be consistent. It is not known whether the compactness restriction can be relaxed for this estimator; the objective function is not concave, and it is not known whether some other approach can be used to get rid of compactness.
241t suffices for the second condition
that E[l(u’U,
> 0)x.x’] is nonsingular.
2141
Ch. 36: Large Sample Estimation and Hypothesis Testiny
3.
Asymptotic
normality
Before giving precise conditions for asymptotic normality, it is helpful to sketch the main ideas. The key idea is that in large samples estimators are approximately equal to linear combinations of sample averages, so that the central limit theorem gives asymptotic normality. This idea can be illustrated by describing the approximation for the MLE. When the log-likelihood is differentiable and 8 is in the interior of the parameter set 0, the first-order condition 0 = n ‘x1= 1V, In f(zi I$) will be satisfied. Assuming twice continuous differentiability of the log-likelihood, the mean-value theorem applied to each element of the right-hand side of this first-order condition gives
(3.1)
where t?is a mean value on the line joining i? and 19~and V,, denotes the Hessian matrix of second derivatives. ’ 5 Let J = E[V, In f(z (0,) (V, In f(z 1tl,)}‘] be the information matrix and H = E[V,, In f(z 1O,)] the expected Hessian. Multiplying through by Jn
and solving for &(e^ - 6,) gives
p
I
(Hessian Conv.)
d
(Inverse Cont.)
1 NO.
H-1
(CLT)
(3.2)
J)
By the well known zero-mean property of the score V,ln ,f(z/Q,) and the central limit theorem, the second term will converge in distribution to N(0, .I). Also, since eis between 6 and 8,, it will be consistent if 8 is, so that by a law of large numbers that is uniform in 0 converging to 8, the Hessian term converges in probability to H. Then the inverse Hessian converges in probability to H-’ by continuity of the inverse at a nonsingular matrix. It then follows from the Slutzky theorem that &(6-
0,) % N(0, Hm 1JH-‘).26
Furthermore,
by the information
matrix equality
25The mean-value theorem only applies to individual elements of the partial derivatives, so that 0 actually differs from element to element of the vector equation (3.1). Measurability of these mean values holds because they minimize the absolute value of the remainder term, setting it equal to zero, and thus are extremum estimators; see Jennrich (1969). *“The Slutzky theorem
is Y, 5
Y, and Z, Ac*Z,Y,
’ -WY,.
W’,K. Newey
2142
und D. McFadden
H = -J, the asymptotic variance will have the usual inverse information matrix form J-l. This expansion shows that the maximum likelihood estimator is approximately equal to a linear combination of the average score in large samples, so that asymptotic normality follows by the central limit theorem applied to the score. This result is the prototype for many other asymptotic normality results. It has several components, including a first-order condition that is expanded around the truth, convergence of an inverse Hessian, and a score that follows the central limit theorem. Each of these components is important to the result. The first-order condition is a consequence of the estimator being in the interior of the parameter space.27 If the estimator remains on the boundary asymptotically, then it may not be asymptotically normal, as further discussed below. Also, if the inverse Hessian does not converge to a constant or the average score does not satisfy a central limit theorem, then the estimator may not be asymptotically normal. An example like this is least squares estimation of an autoregressive model with a unit root, as further discussed in Chapter 2. One condition that is not essential to asymptotic normality is the information matrix equality. If the distribution is misspecified [i.e. is not f’(zI fI,)] then the MLE may still be consistent and asymptotically normal. For example, for certain exponential family densities, such as the normal, conditional mean parameters will be consistently estimated even though the likelihood is misspecified; e.g. see Gourieroux et al. (1984). However, the distribution misspecification will result in a more complicated form H- 'JH-' for the asymptotic variance. This more complicated form must be allowed for to construct a consistent asymptotic variance estimator under misspecification. As described above, asymptotic normality results from convergence in probability of the Hessian, convergence in distribution of the average score, and the Slutzky theorem. There is another way to describe the asymptotic normality results that is often used. Consider an estimator 6, and suppose that there is a function G(z) such that
fi(e-
0,) = t
$(zi)/$
+ o,(l),
EC$(Z)l = 0,
~%$(z)lc/(ZYl exists,
(3.3)
i=l
where o,(l) denote: a random vector that converges in probability to zero. Asymptotic normality of 6’then results from the central limit theorem applied to Cy= 1$(zi)/ ,,h, with asymptotic variance given by the variance of I/I(Z).An estimator satisfying this equation is referred to as asymptotically lineur. The function II/(z) is referred to as the influence function, motivated by the fact that it gives the effect of a single “It is sufficient that the estimator be in the “relative interior” of 0, allowing for equality restrictions to be imposed on 0, such as 0 = r(g) for smooth ~b) and the true )’ being in an open ball. The first-order condition does rule out inequality restrictions that are asymptotically binding.
Ch. 36: Lurge Sumplr Estimation and Hypothesis
2143
Testing
observation on the estimator, up to the o,(l) remainder term. This description is useful because all the information about the asymptotic variance is summarized in the influence function. Also, the influence function is important in determining the robustness properties of the estimator; e.g. see Huber (1964). The MLE is an example of an asymptotically linear estimator, with influence function $(z) = - H ‘V, In ,f(z IO,). In this example the remainder term is, for the mean value a, - [(n ‘C;= 1V,,,,In f(zi 1g))- ’ - H - ‘In- li2Cr= ,V, In f(zil e,), which converges in probability to zero because the inverse Hessian converges in probability to H and the $I times the average score converges in distribution. Each of NLS and GMM is also asymptotically linear, with influence functions that will be described below. In general the CMD estimator need not be asymptotically linear, because its asymptotic properties depend only on the reduced form estimator fi. However, if the reduced form estimator 72is asymptotically linear the CMD will also be. The idea of approximating an estimator by a sample average and applying the central limit theorem can be used to state rigorous asymptotic normality results for extremum estimators. In Section 3.1 precise results are given for cases where the objective function is “sufficiently smooth”, allowing a Taylor expansion like that of eq. (3.1). Asymptotic normality for nonsmooth objective functions is discussed in Section 7.
3.1.
The husic results
For asymptotic normality, two basic results are useful, one for an extremum estimator and one for a minimum distance estimator. The relationship between these results will be discussed below. The first theorem is for an extremum estimator. Theorem
3.1
Suppose
that 8 satisfies eq. (l.l),
@A O,, and (i) o,Einterior(O);
(ii) o,(e) is twice
continuously differentiable in a neighborhood Jf of Be; (iii) &V,&,(0,,) % N(0, Z); (iv) there is H(Q) that is continuous at 8, and supBEN IIV,,&(@ - H(d)11 30; (v) H = H(H,) is nonsingular.
Then J&(8 - 0,) % N(0, H
l,?ZH- ‘).
Proqf A sketch of a proof is given here, with full details described in Section 3.5. Conditions (i)-(iii) imply that V,&(8) = 0 with probability approaching one. Expanding around B0 and solving for ,,&(8 - 0,) = - I?(e)- ’ $V,&(0,),
where E?(B) = V,,&(0)
and f?is a mean value, located between Band 8,. By ep. Be and (iv), with probability approaching - one, I/fi(q - H /I< /IE?(g) - H(g) II + )IH(g) - H II d supBEell fi(O) H(B) /I + /IH(0) - H/I 3 0. Then by continuity of matrix inversion, - f?(g)- l 3 -H-l. The conclusion then follows by the Slutzky theorem. Q.E.D.
2144
W.K. Newey and D. McFuddun
The asymptotic variance matrix in the conclusion of this result has a complicated form, being equal to the product H -'EH- '.In the case of maximum likelihood matrix, because of the this form simplifies to J- ‘, the inverse of the information information matrix equality. An analogous simplification occurs for some other estimators, such as NLS where Var(ylx) is constant (i.e. under homoskedasticity). As further discussed in Section 5, a simplified asymptotic variance matrix is a feature of an efficient estimator in some class. The true parameter being interior to the parameter set, condition (i), is essential to asymptotic normality. If 0 imposes inequality restrictions on 0 that are asymptotically binding, then the estimator may not be asymptotically normal. For example, consider estimation of the mean of a normal distribution that is constrained to be nonnegative, i.e. f(z 1H) = (271~~)- ’ exp [ - (z - ~)~/20~], 8 = (p, 02), and 0 = [0, co) x (0, acj). It is straightforward to check that the MLE of ~1 is ii = Z,Z > 0, fi = 0 otherwise. If PO = 0, violating condition (ii), then Prob(P = 0) = i and Jnfi is N(O,o’) conditional on fi > 0. Therefore, for every n (and hence also asymptotically), the distribution of &(flpO) is a mixture of a spike at zero with probability i and the positive half normal distribution. Thus, the conclusion of Theorem 3.1 is not true. This example illustrates that asymptotic normality can fail when the maximum occurs on the boundary. The general theory for the boundary case is quite complicated, and an account will not be given in this chapter. Condition (ii), on twice differentiability of Q,(s), can be considerably weakened without affecting the result. In particular, for GMM and CMD, asymptotic normality can easily be shown when the moment functions only have first derivatives. With considerably more work, it is possible to obtain asymptotic normality when Q,,(e) is not even once differentiable, as discussed in Section 7. Condition (iii) is analogous to asymptotic normality of the scores. It -11 often follow from a central limit theorem for the sample averages that make up V,Q,(0,). Condition (iv) is uniform convergence of the Hessian over a neighborhood of the true parameter and continuity of the limiting function. This same type of condition (on the objective function) is important for consistency of the estimator, and was discussed in Section 2. Consequently, the results of Section 2 can be applied to give primitive hypotheses for condition (iv). In particular, when the Hessian is a sample average, or depends on sample averages, Lemma 2.4 can be applied. If the average is continuous in the parameters, as will typically be implied by condition (iv), and a dominance condition is satisfied, then the conclusion of Lemma 2.4 will give uniform convergence. Using Lemma 2.4 in this way will be illustrated for MLE and GMM. Condition (v) can be interpreted as a strict local identification condition, because H = V,,Q,(H,) (under regularity conditions that allow interchange of the limiting and differentiation operations.) Thus, nonsingularity of H is the sufficient (secondorder) condition for there to be a unique local maximum at 0,. Furthermore, if V,,QO(0) is “regular”, in the sense of Rothenberg (1971) that it has constant rank in a neighborhood of 8,, then nonsingularity of H follows from Qa(0) having a unique
Ch. 36:
Large
Sample Estimation
and ffypothesis
2145
Testing
maximum at fIO.A local identification condition in these cases is that His nonsingular. As stated above, asymptotic normality of GMM and CMD can be shown under once differentiability, rather than twice differentiability. The following asymptotic normality result for general minimum distance estimators is useful for this purpose. Theorem
3.2
Suppose that H^satisfies eq. (1.1) for Q,(0) = - 4,(0)‘ii/g,,(e) where ii/ 3 W, W is and (i) .Q,Einterior(O); (ii) g,(e) is continuously positive semi-definite, @Lo,, differentiable in a neighborhood JV’ of 8,; (iii) $9,(8,) 5 N(O,n); (iv) there is G(8) that is continuous at 0, and supBE y /(V&,,(e) - G(U) II A 0; (v) for G = G(e,), G’ WC is nonsingular.
Then $(8-
0,) bI[O,(G’WG)-‘G’Wf2WG(G’WG)-‘1.
The argument is similar to the proof of Theorem 3.1. By (i) and (ii), with probability approaching one the first-order conditions G(@t@@,($ = 0 are satisfied, for G(0) = V&,,(0). Expanding
d,(8) around
I?%@)] - 1G^(@I&“$,(&,),
B0 and
solving
gives Jn(e^-
e,,) = - [G(@ x
w h ere t?is a mean value. By (iv) and similar reasoning
as
for Theorem 3.1, G(8) A G and G(g) A G. Then by(v), - [G(@‘@‘G(@]-16(e),%~ - (G’WG)- 'G'W, so the conclusion follows by (iii) and the Slutzky theorem. Q.E.D. When W = Q - ‘, the asymptotic variance of a minimum distance estimator simplifies to (G’Q - ‘G)) ‘. As is discussed in Section 5, the value W = L2 _ ’ corresponds to an efficient weighting matrix, so as for the MLE the simpler asymptotic variance matrix is associated with an efficient estimator. Conditions (i)-(v) of Theorem 3.2 are analogous to the corresponding conditions of Theorem 3.1, and most of the discussion given there also applies in the minimum distance case. In particular, the differentiability condition for g,(e) can be weakened, as discussed in Section 7. For analyzing asymptotic normality, extremum estimators can be thought of as a special case of minimum distance estimators, with V&,(e) = d,(0) and t?f = I = W. The_ first-order conditions for extremum estimators imply that o,(tI)‘@g,(fI) = V,Q,(0)‘V,Q,(@ has a minimum (of zero) at 0 = 8. Then the G and n of Theorem 3.2 are the H and Z of Theorem 3.1, respectively, and the asymptotic variance of the extremum estimator is that of the minimum distance estimator, with (G’WG)-’ x G’Wf2WG(G’WG)p1 =(H’H)-‘H’L’H(H’H)m’ = H-‘ZHpl. Thus, minimum distance estimation provides a general framework for analyzing asymptotic normality, although, as previously discussed, it is better to work directly with the maximum, rather than the first-order conditions, when analyzing consistency.28 18This generality suggests that Theorem 3.1 could be formulated as a special case of Theorem 3.2. The results are not organLed in this way because it seems easier to apply Theorem 3.1 directly to particular extremum estimators.
W.K. Newey und D. McFadden
2146
3.2.
Asymptotic
normality
jbr MLE
The conditions for asymptotic to give a result for MLE. Theorem
normality
of an extremum
estimator
can be specialized
3.3
Suppose that zl,. . . , z, are i.i.d., the hypotheses of Theorem 2.5 are satisfied and (i) d,Einterior(O); (ii) f(zl0) is twice continuously differentiable and f(zl0) > 0 in a neighborhood ,X of 8,; (iii) {suP~~,~- 11 V,f(zl B) //dz < co, jsupe._, IIV,,f(zl@ I)dz < m;; VBHx (iv) J = ECVBln f(z I 4,) PO In f(z I 6Ji’l exists and is nonsingular; (v) E[suP~~_,~ 11 lnf(z~8)~l]
The proof proceeds by verifying the hypotheses of Theorem 3.1. By Theorem 2.5, o^A do. Important intermediate results are that the score s(z) = V, lnJ‘(zI U,) has mean zero and the information matrix equality .I = - E[V,,Inf(zI0,)]. These results follow by differentiating the identity jf(zlB)dz twice, and interchanging the order of differentiation and integration, as allowed by (iii) and Lemma 3.6 in Section 3.5. Then conditions 3.1(i), (ii) hold by 3.3(i), (ii). Also, 3.l(iii) holds, with Z = J, by E[s(z)] = 0, existence of J, and the LindberggLevy central limit theorem. To show 3.l(iv) with H = -J, let 0 be a compact set contained in JY and containing fIOin its interior, so that the hypotheses of Lemma 2.4 are satisfied for a(z, 0) = V,, In ,f(zl 0) by (ii) and (v). Condition 3.1 (v) then follows by nonsingularity of .I. Now Jn(H^-0,) %N(O, andH= -J.
H-‘JHP’)=N(O,JP1)follows
by theconclusionofTheorem
3.1 Q.E.D.
The hypotheses of Theorem 2.5 are only used to make sure that @-% O,, so that they can be replaced by any other conditions that imply consistency. For example, the conditions that 8, is identified, In f(z / 19)is concave in 6, and E[ IIn f(z 10)I] < x for all 8 can be used as replacements for Theorem 2.5, because Theorem 2.7 then gives 8At10. More generally, the MLE will be asymptotically normal if it is consistent and the other conditions (i)-(v) of Theorem 3.3 are satisfied. It is straightforward to derive a corresponding result for nonlinear least squares, by using Lemma 2.4, the law of large numbers, and the Lindberg-Levy central limit theorem to provide primitive conditions for Theorem 3.1. The statement of a theorem is left as an exercise for the interested reader. The resulting asymptotic variance for NLS will be H-‘ZH -I, for E[ylx] = h(x, U,), h&x, 0) = V,h(x, 0), H = - E[h,(x, O,)h,(x, O,)‘] and Z = E[ {y - h(x, O,)}‘h,(x, Q,)h,(x, O,)‘]. The variance matrix simplifies to a2H - ’ when E[ {y - h(x, BO)}2 Ix] is a constant 02, a well known efficiency condition for NLS.
Ch. 36: Larye Sump/e Estimation and Hypothesis
Testing
2147
As previously stated, MLE and NLS will be asymptotically linear, with the MLE influence function given by J- ‘VOIn j’(zI 0,). The NLS influence function will have a similar form,
It/(z)= { EChk ~oP,(.?Qd’l} - l h&x,Q,) [y - 4x, U,)], as can be shown by expanding the first-order conditions for NLS. The previous examples provide useful illustrations of how the regularity tions can be verified. Example
(3.4)
condi-
1.1 continued
In the Cauchy location and scale case, f(z18) = G- ‘y[o- ‘(z - p)] for Y(E)= l/[rc( 1 + E’)]. To show asymptotic normality of the MLE, the conditions of Theorem 3.3 can be verified. The hypotheses of Theorem 2.5 were shown in Section 2. For the parameter set previously specified for this example, condition (i) requires that p0 and (me are interior points of the allowed intervals. Condition (ii) holds by inspection. It is straightforward to verify the dominance conditions for (iii) and (v). For example, (v) follows by noting that V,,lnf(z10) is bounded, uniformly in bounded p and 0, and 0 bounded away from zero. To show condition (iv), consider cc=(~(~,c(J # 0. Note that a,(1 + z2)[ti’V01nf(z~8,)] = cr,2z + ~~(1 + z’) + c(,2z2= ~1~+ (2c(,)z + (3u,)z2 is a polynomial and hence is nonzero on an interval. Therefore, E[{cx’V,ln~f(z~0,,)}2] = c(‘J M> 0. Since this conclusion is true for any CI# 0, J must be nonsingular. Example
1.2 continued
Existence and nonsingularity of E[xx’] are sufficient for asymptotic normality of the probit MLE. Consistency of 8 was shown in Section 2.6, so that only conditions (i)-(v) of Theorem 3.3 are needed (as noted following Theorem 3.3). Condition (i) holds because 0 = Rq is an open set. Condition (ii) holds by inspection of f’(z 10) = y@(x’O) + (1 - y)@( - x’(9). For condition (iii), it is well known that 4(u) and 4”(u) are uniformly bounded, implying V&z /0) = (1 - 2y)4(x’H)x and V,,f(z 10)= (1 - 2y) x ~,(x’@xx’ are bounded by C( 1 + I/x 11 2, for some constant C. Also, integration over dz is the sum over y and the expectation over x {i.e. ja(y, x)dz = E[a(O, x) + a( 1, x)] }, so that i( 1 + 11 x I/2)dz = 2 + 2E[ //x 11’1< GC. For (iv), it can be shown that J = E[i.(x’0&( - x’d,)xx’], for j(u) = ~(U)/@(U). Existence of J follows by E.(u)i.(- ~1) bounded, and nonsingularity by %(u)A(- u) bounded away from zero on any open interval.29 Condition (v) follows from V,, In ,f’(z IQ,,)= [&.(x’B,)y + &,( - x’tI,)( 1 - y)]xx’ 291t can be shown that Z(u)i.( - a) is bounded using l’H8pital’s rule. Also, for any Ir>O, J 2 E[l(lx’H,I < fi)i(x’fI,)n( -x’tI,)xx’] 2 CE[ l(lx’O,I < C)x.x’] in the positive semi-definite sense, the last term is positive definite for large enough V by nonsingularity of E[xx’].
W.K. Newey and D. McFuddm
2148
and boundedness of I_,(u). This example illustrates how conditions on existence of moments may be useful regularity conditions for consistency and asymptotic normality of an MLE, and how detailed work may be needed to check the conditions.
3.3.
Asymptotic
normulity for GMM
The conditions on asymptotic normality specialized to give a result for GMM. Theorem
of minimum
distance
estimators
can be
3.4
Suppose that the hypotheses ofTheorem 2.6 are satisfied, r;i/ A W, and (i) 0,Einterior of 0; (ii) g(z,O) is continuously differentiable in a neighborhood _t‘ of 0,, with probability approaching one; (iii) E[g(z, fl,)] = 0 and E[ I/g(z, 0,) I/‘1 is finite; (iv) E[su~,,~ Ij V&z, 0) 111< co;(v) G’WG is nonsingular for G = E[V,g(z, fl,)]. Then for 0 = ECg(z, @,Jg(z, Hd’l,$(@
- 0,) ~N[O,(G’WG)G’WBWG(G’WG)~‘].
Proof
The proof will be sketched, although a complete proof like that of Theorem 3.1 given in Section 3.5 could be given. By (i), (ii), and (iii), the first-order condition 2G,,(@%~,(8) = 0 is satisfied with probability approaching one, for G,(e) = V&,,(0). Expanding
J,,(g) around
fI,, multiplying
through
by $,
and solving gives (3.5)
where 0 is the mean [G,(~))‘~~,(8)]-‘~,(~))‘ii/ Slutzky theorem.
value.
By (iv), G,,(8) LG and G,(g) 3 G, so that by (v), The conclusion then follows by the Q.E.D.
~(G’WG)~‘G’W.
asymptotic variance formula simplifies to (G’R ‘G)- ’ when W = in Hansen (1982) and further discussed in Section 5, this value for W is optimal in the sense that it minimizes the asymptotic variance matrix of the GMM estimator. The hypotheses of Theorem 2.6 are only used to make sure that I!?L BO, so that they can be replaced by any other conditions that imply consistency. For example, the conditions that 8, is identified, g(z, 0) is linear in 8, and E[ /Ig(z, II) 111< cc for all 8 can be used as replacements for Theorem 2.6, because Theorem 2.7 then gives 830,. More generally, a GMM estimator will be asymptotically normal if it is consistent and the other conditions (i))(v) of Theorem 3.4 are satisfied. The complicated
R- ‘. As shown
2149
Ch. 36: Large Sample Estimation and Hypothesis Testing
It is straightforward
to derive
a corresponding
result
for classical
minimum
distance, under the conditions that 6 is consistent, &[72 - h(e,)] L N(0, fl) for some R, h(8) is continuously differentiable in a neighborhood of Be, and G’WG is nonsingular for G = V&(0,). The statement of a theorem is left as an exercise for the interested reader. The resulting asymptotic variance for CMD will have the same form as given in the conclusion of Theorem 3.4. By expanding the GMM first-order conditions, as in eq. (3.5), it is straightforward to show that GMM is asymptotically linear with influence function $(z) = - (G’ WC) - ‘G’ Wg(z, 0,).
(3.6)
In general CMD need not be asymptotically linear, but will be if the reduced form estimator 72 is asymptotically linear. Expanding the first-order conditions for 6 around
the truth gives $(e^-
0,) = - (G’WG)-‘6’6’&(72
G = V,@(8), and @is the mean value. Then &(fi and(~‘~G)-‘~‘ii/‘~(G’WG)-‘G’W. W&(72
- x0), where G = V&(8),
- rra) converging
implies that &(8-
in distribution
0,) = - (G’WG)-‘G’
x
- TC,J+ o,(l). Therefore,
ll/“(z), the CMD estimator t&z) = - (G’WG)-
if 72is asymptotically linear with influence function will also be asymptotically linear with influence function
‘G’W$“(z).
The Hansen-Singleton example provides of Theorem 3.4 can be verified.
(3.7) a useful illustration
of how the conditions
Example 1.3 continued
It was shown
in Section 2 that
sufficient conditions for consistency are that solution at 0eE 0 = [Be, /3,]x[yl, y,], and that E[llx(l] 0 and constant C big enough. Let N be a neighborhood of B,, such that ye + E < y < yU- E for all &_N. Then SUP~,~~ liV,g(z,e)iI ~CllxlllwlCl +ln(y)] x E[x(BwyY - l)] = 0 have a unique
~~~~l~l~~~lI~III~l~~+l~l~~+l~l~~~~~~~l~l~~lI~III~l~l~l~‘+l~l~~~,
so
that
condition (iv) follows by the previously assumed moment condition. Finally, condition (v) holds by the previous rank condition and W = (E[xx’])-’ nonsingular. Thus, under the assumptions imposed above, the nonlinear two-stage least squares estimator will be consistent and asymptotically normal, with asymptotic variance as given in the conclusion of Theorem 3.4.
W.K. Nrrvey and D. McFudden
2150
3.4.
One-step
theorems
A result that is useful, particularly for efficient estimation, pertains to the properties of estimators that are obtained from a single iteration of a numerical maximization procedure, such as NewtonRaphson. If the starting point is an estimator that is asymptotically normal, then the estimator from applying one iteration will have the same asymptotic variance as the maximum of an objective function. This result is particularly helpful when simple initial estimators can be constructed, but an efficient estimator is more complicated, because it means that a single iteration will yield an efficient estimator. To describe a one-step extremum estimator, let ?? be an initial estimator and l? be an estimator of H = plim[V,,&(B,)]. Consider the estimator 8=
e- I7 - lV,&(O).
(3.8)
If l? = V,,&(@ then eq. (3.8) describes one Newton-Raphson iteration. More generally it might be described as a modified NewtonRaphson step with some other value of fi used in place of the Hessian. The useful property of this estimator is that it will have the same asymptotic variance as the maximizer of o,(Q), if &(& 0,) is bounded in probability. Consequently, if the extremum estimator is efficient in some class, so will be the one-step estimator, while the one-step estimator is computationally more convenient than the extremum estimator.30 An important example is the MLE. In this case the Hessian limit is the negative of the information matrix, so that fi = -J is an estimated Hessian. The corresponding iteration is e= @+ J-‘n-’
f
V,lnf(zi)8).
(3.9)
i=l
For the Hessian estimator of the information matrix 7 = - n ’ x1= 1V,, In f(zi Ig), eq. (3.9) is one NewtonRaphson iteration. One could also use one of the other information matrix estimators discussed in Section 4. This is a general form of the famous linearized maximum likelihood estimator. It will have the same asymptotic variance as MLE, and hence inherit the asymptotic efficiency of the MLE. For minimum distance estimators it is convenient to use a version that does not involve second derivatives of the moments. For c = V,d,(@, the matrix - 2G’l?‘G is an estimator of the Hessian of the objective function - ~,,(O)‘l?~,(0) at the true parameter value, because the terms that involve the second derivatives of Q,(e) are asymptotically negligible.31 Plugging a = - 2G’l?fi/G into eq. (3.8) gives a one-step “‘An alternative one-step estimator can be obtained by setting it equal to one, as t? = fI + xd^for d^= - H ’ P,,&(@ will also have the same asymptotic variance as the solution 31These terms are all multiplied bv one or more elements
maximizing over the step size, rather than and z= argmax,Q,(O + 22). This estimator to eq. (l.l), as shown by Newey (1987). of iJO,), which all converge to zero.
Ch. 36: Large Sample Estimation and Hypothesis
minimum
distance
2151
Testing
estimator,
e”=e- (Cfr;i/G)-‘G~~gn(H).
(3.10)
Alternatively, one could replace G by any consistent estimator of plim[V&,(8,)]. This estimator will have the same asymptotic variance as a minimum distance estimator with weighting matrix I?. In particular, if I%’is a consistent estimator of fl- ‘, an efficient choice of weighting matrix, then e” has the same asymptotic variance as the minimum distance estimator with an efficient weighting matrix. An example is provided by GMM estimation. Let G = n- ’ x1= 1V&z,, g) and let fi be an estimator of R = E[y(z, fI,)g(z, 0,)‘], such as fi = n- r C;= 1 g(zi, 8)g(z, g)‘. Then the one-step estimator of eq. (3.10) is
--
H1=8-(GrQ-lC)-l~ffl-~
t
g(zi,iJ)/n.
(3.1 1)
i=l
This is a one-step GMM estimator with efficient choice of weighting matrix. The results showing that the one-step estimators have the same asymptotic variances as the maximizing values are quite similar for both extremum and minimum distance estimators, so it is convenient to group them together in the following result: Theorem
3.5
Suppose that h(s0,) is bounded in probability. If I!? satisfies eq. (3.8), the conditions of Theorem 3.1 are satisfied, and either I? = V,,,&(@ or Z? 3 H, then $(Q-
0,) L
N(0, H- ‘ZH-
‘). If esatisfies
are satisfied, and either G= V&J@ or G L G’Wl2WG(G’WG)-‘1.
eq. (3.10), the conditions G, then J&(8-
of Theorem
3.2
(3,) % N[O, (G’WG)- l x
Proof Using eq. (3.8) and expanding
V,&(@ around
8, gives:
where 4 is the mean value. By 1-l -% H-l and the Slutzky theorem, the second term -converges to N(0, H- ‘ZH- ‘). By condition (iv) of Theorem _ .in distribution 3.1, Hp’V,,Q,(@+H‘H = I, so that the first term is a product of a term that converges in probability to zero with a term that is bounded in probability, so that the first term converges in probability to zero, giving the conclusion. The result for minimum distance follows by a similar argument applied to the expansion of eq. (3.10)
W.K. Newey and D. McFadden
2152
given
by
&(e”-
(3,) = [Z - (c’~~)-‘c’~V,g,(e)]~(e-
J%“(&).
6,) - (G’k%-‘G’@ Q.E.D.
This result can be specialized to MLE or GMM by imposing the conditions of Theorem 3.3 or 3.4, but for brevity this specialization is not given here. The proof of this result could be modified to give the slightly stronger conclusion that &(e - 6) 3 0,-a condition that is referred to as “asymptotic equivalence” of the estimators B”and 0. Rothenberg(l984) showed that for MLE, if a second iteration is undertaken, i.e. f? in eq. (3.8) solves the same equation for some other initial estimator, then n(e - 6) -% 0. Thus, a second iteration makes the estimator asymptotically closer to the extremum estimator. This result has been extended to multiple iterations and other types of estimators in Robinson (1988a).
3.5.
Technicalities
A complete proof of Theorem
3.1
Without loss of generality, assume that Af is a convex, open set contained in 0. Let i be the indicator function for the event that &eJlr. Note that $11-*0, implies i 3 1. By condition (ii) and the first-order conditio_ns fo_ra maximum, i*V&,,(@ = 0. Also, b_y a mean-value expansion theorem, 0 = 1 *V,Q,(e,), % 1 *VQQ^,,(e,)Je- 0,), where tIj is a random variable equal to the mean value when 1 = 1 and equal to fIO otherwise. Then c&0,. Let H denote the matrix with jth row Vi&(gj);. By condition
(iv), H L
H. Let 7 be the indicator
for film
and H nonsingular.
Then
by condition (v), i -% 1, and 0 = i.V,&(&,) + i*H(eIII,), so that $(e0,) = -1H ’ &V&,(6,) + (1 - i)J%(e - 0,). Then since ifi - ’ 3 H- ’ by condition (v), &V,&(&J 5 N(O, z) b y condition (iii), and (1 - i)&(i!? - 0,) 3 0 by i -% 1, the conclusion follows by the Slutzky theorem and the fact that if Y, -% Ye and 2, Y, 5 0 then Z, % Y,. Q.E.D. The proof that the score has zero mean and of the information matrix equality.
proof of Theorem 3.3 it suffices that the order of differentiation well known lemma, e.g. as found that the order of differentiation
By the to show that J f (zl B)dz is twice differentiable and and integration can be interchanged. The following in Bartle (1966, Corollary 5.9), is useful for showing and integration can be interchanged.
Lemma 3.6
If a(z, 13) is continuously differentiable on an open set ./lr of 8,, a.s. dz, and Jsu~,,~ 11V,a(z, 19)1)dz < co, then ia(z, f3)dz is continuously differentiable and V,ja(z, B)dz = j[V,a(z, fI)]dz for f3~Jlr.
2153
Ch. 36: Large Sample Estimation and Hypothesis Testing
Proof Continuity of l [V&z, 0)] dz on X follows by continuity of V&z, ~9)in 0 and the dominated convergence theorem. Also, for all e”close enough to 8, the line jo@ing 8 and 0 will lie in Jlr, so a mean-value expansion gives a(z, g) = a(z, 0) + V&z, @‘(fJ- 0) + r(z, g),, where, for the mean value f?(z), I(Z, $) = {V&z, g(z)] - V&z, 0)}‘(8- 0). AS &+ 0, )(r(z,@ 1)/j e”- 8 1)< (1V&z, g(z)] - V&z, (3)II+0 by continuity of V&z, 0). so by the dominated convergence Also, i@, 0) i / ii 8 - 6 iI G 2 sUPeE~ /IV&z, 0) ii , theorem, jlr(z, @(dz/~~8-0(~-+0. Therefore,lja(z, 8)dz-Sa(z, @dz- {j[Ve4z, e)]dz}’ x Q.E.D. (~-e)(=IS~(Z,B)dzldSIr(z,8)Idz=0(1le-eli). The needed result that f f(zI0)dz is twice differentiable and that f (zlf3) can be differentiated under the integral then follows by Lemma 3.6 and conditions (ii) and (iii) of Theorem 3.3.
4.
Consistent asymptotic variance estimation
A consistent estimator of the asymptotic variance is important for construction of asymptotic confidence intervals, as discussed in the introduction. The basic idea for constructing variance estimators is to substitute, or “plug-in”, estimators of the various components in the formulae for the asymptotic variance. For both extremum and minimum distance estimators, derivatives of sample functions can be used to estimate the Hessian or Jacobian terms in the asymptotic variance, when the derivatives exist. Even when derivatives do not exist, numerical approximations can be used to estimate Hessian or Jacobian terms, as discussed in Section 7. The more difficult term is the one that results from asymptotic
normality
of ,,&V,f&(e,)
or
&n&(0,). The form of this term depends on the nature of the estimator and whether there is dependence in the data. In this chapter, estimation of this more difficult term will only be discussed under i.i.d. data, with Wooldridge’s chapter in this volume giving results for dependent observations. To better describe variance estimation it is helpful to consider separately extremum and minimum distance estimators. The asymptotic variance of an extremum estimator is H- 'ZH- ',where H is the probability limit of Vee&(BO) and Z is the asymptotic variance of ^A $&7,&e,). Thus, an estimator of the asymptotic variance can be formed as fi- 'ZH- ',where fi is an estimator of H and 2 is an estimator of Z. An estimator of H can be constructed in a general way, by substituting 8 for 8, in the Hessian of the objective function, i.e. l? = V,,&(8). It is more difficult to find a general estimator of .Z, because it depends on the nature of the extremum estimator and the properties of the data. In some cases, including MLE and NLS, an estimator of Z can be formed in a straightforward way from sample second moments. For example, for MLE the central limit theorem implies that ;I: = E[V, In f (z I/3,-J{V, In f (z IO,,)}‘], so that an
W.K. Newey and D. McFadden
2154
estimator can be formed by substituting moments for expectations and estimators for true parameter, i.e. 2 = II- ‘x1= 1Ve In f(zil 8) {V, In f(zii g)}!. More generally, an analogous estimator can be constructed whenever the objective function is a sample average, Q,@) = n ‘Cr= 1q(z,,fl), e.g.where q(z,0) = - [y - h(x, O)]' for NLS. In this case $V,Q,(tI,) = n - ‘I2 C;= r Veq(zi, N,), so the central limit theorem will imply that Z = E[V,q(z, BO){V,q(z, 8,)}‘].32 This second-moment matrix can be estimated as
2 = n-l .f i=l
v,q(z,,8){v,q(z,,~)}',
Q,(d)=
Iv1
i
q(z,fl).
(4.1)
i=l
In cases where the asymptotic variance simplifies it will be possible to simplify the variance estimator in a corresponding way. For example the MLE asymptotic variance is the inverse of the information matrix, which can be estimated by J^- ‘, for an estimator J^ of the information matrix. Of course, this also means that there are several ways to construct a variance estimator. For the MLE, jcan be estimated from the Hessian, the sample second moment of the score, or even the general formula &‘,??I?‘. Asymptotic distribution theory is silent about the choice between these estimators, when the models are correctly specified (i.e. the assumptions that lead to simplification are true), because any consistent estimator will lead to asymptotically correct confidence intervals. Thus, the choice between them has to be based on other considerations, such as computational ease or more refined asymptotic accuracy and length of the confidence intervals. These considerations are inherently specific to the estimator, although many results seem to suggest it is better to avoid estimating higher-order moments in the formation of variance estimators. If the model is not correctly specified, then the simplifications may not be valid, so that one should use the general form fi- ‘Tfi?- ‘, as pointed out by Huber (1967) and White (1982a). This case is particularly interesting when 8 is consistent even though the model is misspecified, as for some MLE estimators with exponential family likelihoods; see Gourieroux et al. (1984). For minimum distance estimation it is straightforward to estimate the Jacobian term G in the asymptotic variance (G’WG))‘G’W~RG(G’WG)-‘, as G = V&,(u^). Also, by assumption W will be a consistent estimator of W. A general method of forming B is more difficult because the form of fl depends on the nature of the estimator. For GMM an estimator of R can be formed from sample second moments. By the central limit theorem, the asymptotic variance of Jng,(fl,,) = n- ‘I2 C;= 1g(zi, 0,) is R = E[g(z, e,)g(z, O,)‘]. Thus, an estimator can be formed by substituting sample
32The derivative V,q(z,O,,) can often be shown to have mean zero, as needed for the central limit theorem, by a direct argument. Alternatively, a zero mean will follow from the first-order condition for maximization of Q,,(O) = E[q(z,O)]at 0,.
Ch. 36: Large Sample Estimation and Hypothesis
moments
for the expectation
2155
Testing
and an estimator
of 8 for the true value, as
i=l
As discussed in Section 3, extremum estimators can be considered as special cases of minimum distance estimators for analyzing asymptotic normality. More specifically, an extremum estimator with o,(O) = n- ’ x1= ,q(z,, 0) will be a GMM estimator with g(z, 0) = V,q(z, 0). Consequently, the estimator in eq. (4.1) is actually a special case of the one in eq. (4.2). For minimum distance estimators, where Q,(d) = r? - h(O), the asymptotic variance R of $g,(O,) is just the asymptotic variance of R. Thus, to form h one simply uses a consistent estimator of the asymptotic variance of 72.If r? is itself an extremum or GMM estimator, its asymptotic variance can be estimated in the way described above. When the asymptotic variance matrix simplifies there will be a corresponding simplification for an estimator. In particular, if W = 0-l then the asymptotic variance is (G’O-‘G)-‘, so that a corresponding estimator is (c’& ‘6)-l. Alternatively, if I? is a consistent estimator of a- ‘, a variance estimator is (@ii/& ‘. In addition, it may also be possible to estimate L2 in alternative ways. For example, for linear instrumental variables where g(z, 0) = x(y - Y’@, the estimator in eq. (4.2) is II- ‘XI= r x,xi(y, - Y$)‘, which is consistent even if si = yi - YIfI, is heteroskedastic. An alternative estimator that would be consistent under homoskedasticity (i.e. if E[s2 Ix] is constant) is c?‘C~, 1xixi/n for 82 = n- ’ Cr= 1(yi - Y$)2. For minimum distance estimators, the choice between different consistent variance estimators can be based on considerations such as those discussed for extremum estimators, when the model is correctly specified. When the model is not correctly specified and there are more elements in d,(O) than 8, the formula (G’ WC)- ‘G’ WR WG(G’ WC) ’ is no longer the correct asymptotic variance matrix, the reason being that other terms enter the asymptotic variance because S,,(J) need not converge to zero. It is possible to show that 6 is asymptotically normal when centered at its limit, by treating it as an extremum estimator, but the formula is very complicated [e.g. see Maasoumi and Phillips (1982)]. This formula is not used often in econometrics, because it is so complicated and because, in most models where d,,(O) has more elements than 0, the estimator will not be consistent under misspecification.
4.1.
The basic results
It is easy to state a consistency result for asymptotic variance is assumed to be consistent. A result for extremum estimators
estimation is:
if ,E or r3
W.K. Newey and D. McFadden
2156
Theorem
4.1
If the hypotheses fi-‘f&l
of Theorem
3.1 are satisfied,
fi = V,,&,(6),
to. By c_ondition(iv)
of Theorem
and 2 AZ,
then
!J+H-l~H-l.
Proof
By asymptotic
normality,
o^3
3.1, with probability
one, IF-W 5 IlH-WtWI + lW(~)-ffll ~suP~~.,~IIV~~Q~(~)-H(~)II + (1H(8) - HI/ LO, SO that H g H. The conclusion then follows by condition (v) Q.E.D. of Theorem 3.1 and continuity of matrix inversion and multiplication.
approaching
A corresponding Theorem
result for minimum
distance
estimators
is:
4.2
If the hypotheses of Theorem (~‘~~)-‘~‘~ii~~(~‘~‘6)-’
3.2 are satisfied,
6 = V&,(8), and
fi -% 0,
then
%(G’WG)-‘G’Wf2WG(G’WG)-?
Proof
It follows similarly
to the proof of Theorem
implies 6 5 G, while %A then follows from condition and multiplication.
4.1 that condition
(iv) of Theorem
3.2
W and fi % 0 hold by hypothesis.
(v) of Theorem
3.2 and continuity
The conclusion of matrix inversion Q.E.D.
As discussed above, the asymptotic variance for MLE, NLS, and GMM can be estimated using sample second moments, with true parameters replaced by estimators. This type of estimator will be consistent by the law of large numbers, as long as the use of estimators in place of true parameters does not affect the limit. The following result is useful in this respect. Lemma 4.3
If zi is i.i.d., a(z, 0) is continuous borhood Jf _of fI,, such that n - 1Z;= 1a(z, 0) 3 E[a(z, tl,)].
at 8, with probability one, and there is a neighE[sup,,,/Ia(~, @II] < cu, then for any e%B,,
Proof
By consistency oft? there is 6, + 0 such that II8 - 8, I( < 6, with probability approacha(z, tl) - a(z, 0,) II. By continuity of a(z, 0) at et,, ing one. Let A,(z) = su~~~~-~~,,d 6,,11 d,(z) + 0 with probability one,.while by the dominance condition, for n large enough d,,(z) d 2 SUPes.A~I(a(z, 0) I(. Then by the dominated convergence theorem, E[d,(z)]-+ 0, so by the Markov inequality, P( I n- ’ Cr= 1A,(zi)/ > E) d E[A.(z)]/c -+ 0 for all E>O, giving n-‘Cy= IA,, 3 0. By Khintchine’s law of large numbers, n- ‘XI= 1u x
Ch. 36: Large Sample Estimation and Hypothesis (zi, fI,) %
E[a(z,
O,,)].
Also, with pro_bability
np’Cr=
la(Zi,8,)li
elusion
follows by the triangle
2157
Testing
approaching
one, (1n-‘Cr=
~lU(zi,8)-a(Zi,8,)1~,
inequality.
0,so
lu(Zi, 8) -
theconQ.E.D.
The conditions of this result are even weaker than those of Lemma 2.4, because the conclusion is simply uniform convergence at the true parameter. In particular, the function is only required to be continuous at the true parameter. This weak type of condition is not very important for the cases considered so far, e.g. for GMM where the moment functions have been assumed to be differentiable, but it is very useful for the results of Section 7, where some discontinuity of the moments is allowed. For example, for the censored LAD estimator the asymptotic variance depends on indicator functions for positivity of x’% and Lemma 4.3 can be used to show consistency of asymptotic variance estimators that depend on such indicator functions.
4.2.
Variance
estimation
for MLE
The asymptotic variance of the maximum likelihood estimator is J- ‘, the inverse of the Fisher information matrix. It can be consistently estimated from J^- ‘, where J^ is a consistent estimator of the information matrix. There are several ways to estimate the information matrix. To describe these ways, let s(z, 9) = V, lnf(zI 9) denote the score. Then by the information matrix equality, J = E[s(z, %&(z, %,)‘I = - E[V,s(z, %,)I = J(%,), where J(9) = - j [V&z, %)]f(zI %)dz. That is, J is the expectation of the outer product of the score and the expectation of the negative of the derivative of the score, i.e. of the Hessian of the log-likelihood. This form suggests that J might be estimated by the method of moments, replacing expectations by sample averages and unknown parameter values by estimates. This yields two estimators,
j1 = n - ’
t i=l
s(zi, @s(z,,B)‘/n,
j2 = -n-l
t
V,,lnf(zJ%).
i=l
The second estimator is just the negative of the Hessian, and so will be consistent under the conditions of Theorem 3.3. Lemma 4.3 can be used to formulate conditions for consistency of the first estimator. A third estimator could be obtained by substituting 6 in the integrated function J(9). This estimator is often not feasible in econometrics, because f(z 1%)is a conditional likelihood, e.g. conditioned on regressors, and so the integration in J(9) involves the unknown marginal distribution. An alternative estimator that is feasible is the sample average of the conditional information matrix. To describe this estimator, suppose that z = (y, x) and that f(z 1%)= f(y Ix, 9) is the conditional density of y given x. Let J(x, II) = E[s(z, %)s(z,%)‘Ix, O] = is(z, U)s(z, U)‘f’(y Ix, 0)dy be the con-
2158
W.K. Newey and D. McFadden
ditional information matrix, so that J = E[J(x, e,)] by the law of iterated tions. The third estimator of the information matrix is then
& = f:
J(x,,
8)/n.
expecta-
(4.4)
i=l
Lemma 4.3 can be used to develop conditions for consistency of this estimator. In particular, it will often be the case that a(~, 8) = J(x, 0) is continuous in 8, because the integration in J(x, 0) tends to smooth out any discontinuities. Consistency will then follow from a dominance condition for J(x, d). The following result gives conditions for consistency ofall three of these estimators: Theorem
4.4
Suppose that the hypotheses of Theorem 3.3 are satisfied. Then 51; ’ A Jp ‘. Also, if there is a neighborhood N of B0 such that E[su~,,_~ 11 s(z, 0) 11’1-=zco then 51; ’ 3 J- I. Also, if J(x, 0) is continuous at B0 with probability one and E[su~,,,+~ /)J(x, Q)/)] < c;o then.?;‘AJ-‘. Proof
It follows as in the proof of Theorem 4.1 that 512 ’ A J- ‘. Also, by s(z, 0) continuously differentiable in a neighborhood of 0,, u(z, e) = s(z, 8)s(z, 0)’ so consistency of I; ’ follows from Lemma 4.3. Also, consistency of IT1 follows by Lemma 4.3 with a(z, 0) = J(x, 0). Q.E.D. The regularity conditions for consistency of each of these estimators are quite weak, and so typically they all will be consistent when the likelihood is twice differentiable. Since only consistency is required for asymptotically correct confidence intervals for 0, the asymptotic theory for @provides no guide as to which of these one should use. However, there are some known properties of these estimators that are useful in deciding which to use. First, 5, is easier to compute than j2, which is easier to compute than j3. Because it is easiest to compute, jr has seen much use in maximum likelihood estimation and inference, as in Berndt et al. (1974). In at least some cases they seem to rank the opposite way in terms of how closely the asymptotic theory approximates the true confidence interval distribution; e.g. see Davidson and MacKinnon (1984). Since the estimators are ranked differently according to different criteria, none of them seems always preferred to the others. One property shared by all inverse information matrix estimators for the MLE variance is that they may not be consistent if the distribution is misspecified, as pointed out by Huber (1967) and White (1982a). If .f(zle,) is not the true p.d.f. then the information matrix equality will generally not hold. An alternative estimator that will be consistent is the general extremum estimator fdrmula J^;‘J^,j;‘. Sufficient regularity conditions for its consistency are that 8% 8,, In f(z18) satisfy
Ch. 36:
Large
Sample
Estimation
and Hypothesis
2159
Testing
parts (ii) and (iv) of Theorem 3.3, E[sup Ot,,, /IV, In f(z (0) (I*] be finite for a neighborhood .,+’ of 8,, and EIVOe In f(zl Q,)] be nonsingular. Example
I .l continued
It would be straightforward to give the formulae j1 and j2 using the derivatives derived earlier. In this example, there are no conditioning variables x, so that 5^; ’ would simply be the information formula evaluated at 8. Alternatively, since it is known that the information matrix is diagonal, one could replace J^; ’ and ji ’ with same matrices, except that before the inversion the off-diagonal elements are set equal to zero. For example, the matrix corresponding to jil would produce a variance estimator for j2 of ncP/C~, I/‘ce(.$), for & = 8- ‘(zi - fi). Consistency of all of these estimators will follow by Theorem 4.4 Sometimes some extra conditions illustrated by the probit example. Example
are needed
of j; ’ or 1; ‘, as
for consistency
1.2 continued
For probit, the three information
matrix estimators
discussed
above are, for L(E)=
&Y@(s),
5^2 =
j3
n-l
+
i
xix:[d{~(-
~)-l~(U)}/d~]lo_x,~[yi
-
@(x$)],
i=l
_T1 =
n- ’ 2
xixi@( - xi&
*L(xj@*{ yi - @(x$)}~.
i=l
Bothj;‘%J-‘andj;’
LJ-1
will follow from consistency
of 6, E[ IIx I/‘1 finite,
and J nonsingular. However, consistency of 5^; ’ seems to require that E[ 11 x II”] is finite, because the score satisfies IIV, lnf(zlQ)l12 < I@(x’O)-‘@( - ~‘0)) ‘~(x’8)(4/~~~~~d 4CC,(l +
IIx II II4 )I2 IIx II2d C(1+ IIx II”).
The variance of nonlinear least squares has some special features that can be used to simplify its calculation. By the conditional mean assumption that E [y Ix] = h(x, fl,), the Hessian term in the asymptotic variance is
H = 2{ECh(x,Wdx> &,)‘I- W,,(x, 4,) CY - hk 6,) } I}
= 2ECk9b,4$,(x, &)‘I, where h, denotes
the gradient,
h,, the Hessian
of h(x, O), and the second
equality
W.K. Newry and D. McFadden
2160
follows by the law of iterated expectations. Therefore, H can be estimated by s = 2n- ‘C:= ,h,(xi, @h&xi, @, which is convenient because it only depends on first derivatives, rather than first and second derivatives. Under homoskedasticity the matrix Z also simplifies, to 4f~~E[h,(x, 8,)h,(x, e,)‘] for cr2 F E[ { y - h(x, U,)}2], which can be estimated by 2e2H for e2 = nP ‘Cy= i { y - h(xi, d)}“. Combining this estimator of Z with the one for H gives an asymptotic variance estimator of the form ? = fi“TfiP1 = 262fim ‘. Consistency of this estimator can be shown by applying the conditions of Lemma 4.3 to both u(z, 6) = {y - h(x, 19))’ and a(z, 8) = h,(x, @h&x, e)‘, which is left as an exercise. If there is heteroskedasticity then the variance of y does not factor out of Z, so that one must use the estimator z= 4n-‘Cr, ,h,(xi, @h&xi, @‘{ yi - h(xi, 8)}2. Also, if the conditional expectation is misspecified, then second derivatives of the regression function do not disappea_r from the Hessian (except in the linear case), so that one must use the estimator H = 2n- ‘x1= 1 [h&x,, @h&xi, i$ + h&xi, @{ yi - h(xi, @}I. A variance estimator for NLS that is consistent in spite of heteroskedasticity or misspecification is fi-‘&-‘, as discussed in White (1982b). One could formulate consistency conditions for this estimator by applying Lemma 4.3. The details are left as an exercise.
4.3.
Asymptotic
vuriance estimation,for
GMM
The asymptotic variance of a GMM estimator is (G’WG))‘G’~~l&‘G(G’~G)-‘, which can be estimated by substituting estimators for each of G, W and 0. As p_reviously discussed,_estima_tors of G and Ware readily available, and are given by G = n- ‘x1= 1VOy(zi, e) and W, where k@is the original weighting matrix. To estimate R = E[g(z, H&z, 0,)‘], one can replace the population moment by a sample average and the true parameter by an estimator, to form fi = n- ’ Cy= r g(zi, @)g(z,, I!?)‘,as in eq. (4.2). The estimator of the asymptotic variance is then given by e = (G’I%‘G)-’ x G,I@r2 I?G@l?G,_l. Consistencyof Sz will follow from Lemma 4.3 with a(z, 8) = g(z, B)g(z, 0)‘, so that consistency of F’will hold under the conditions of Theorem 4.2, as applied to GMM. A result that summarizes these conditions is the following one: Theorem
4.5
If the hypotheses of Theorem 3.4 are satisfied, g(z,@ is continuous at B0 with probability_ one,a_nd_for ^a neighborhood JV of 8,, E[su~~,~ I/ g(z, 0) 11 2] < co, then ?. . ^ V=(&$‘G)-‘G’WRWG(G’WG)-
’ -(’
G’WG)-‘G’WRWG(G’WG)-‘.
Proof
By Lemma 4.3 applied to a(z, 0) = g(z, H)g(z, 6)‘, fiL a. Also, the proof of Theorem 3.4 shows that the hypotheses of Theorem 3.2 are satisfied, so the conclusion follows by Theorem 4.2. Q.E.D.
Ch. 36: Large Sample Estimation
and Hypothesis
Testing
2161
If @‘is a consistent estimator of a-‘, i.e. the probability limit W of @is equal to n-l, then a simpler estimator of the asymptotic variance can be formed as p = (@k&l. Alternatively, one could form &as in eq. (4.2) and use v = ((?fi-‘& ‘. Little seems to be known about the relative merits of these two procedures in small samples, i.e. which (if either) of the initial I%’or the final d-l gives more accurate or shorter confidence intervals. The asymptotic variance estimator c is very general, in that it does not require that the second moment matrix a= E[g(z,B,)g(z,8,)‘] be restricted in any way. Consequently, consistency of ? does not require substantive distributional restrictions other than E[g(z, Q,)] = 0.33 For example, in the context of least squares estimation, where y(z, 0) = x( y - x’d), l?f = I, and (? = - C;= 1x,xi/n, this GMM variance estimator is P = k’[n-‘C~= lxixi(yi - x$)‘]&‘, the Eicker (1967) and White (1980) heteroskedasticity consistent variance estimator. Furthermore, the GMM variance estimator includes many heteroskedasticity-robust IV variance estimators, as discussed in Hansen (1982). When there is more information about the model than just the moment restrictions, it may improve the asymptotic confidence interval approximation to try to use this information in estimation of the asymptotic variance. An example is least squares, where the usual estimator under homoskedasticity is n(Cr, 1xix:)- ‘C( yi - x@‘/ (n - K), where K is the dimension of x. It is well known that under homoskedasticity this estimator gives more accurate confidence intervals than the heteroskedasticity consistent one, e.g. leading to exact confidence intervals from the t-distribution under normality.
Example
1.3 continued
The nonlinear two-stage least squares estimator for the Hansen-Singleton example is a GMM estimator with g(z, 0) = x{bwyY - 1) and @= x1= 1x,xi/n, so that an asymptotic variance estimator can be formed by applying the general GMM formula to this case. Here an estimator of the variance of the moment functions can be formed as described above, with 8= n-‘~~=,x,x,{&viyf - l}‘. The Jacobian estimator is G^= n- ‘Cr= 1xi(wi yly^, Bwi In ( yi)yr). The corresponding asymptotic variance estimator then comes from the general GMM formula (~f~~)-‘~~~~~~(~f~~)~ ‘. Consistency of this estimator will follow under the conditions of Theorem 4.5. It was previously shown that all of these conditions are satisfied except the additional moment assumption stated in Theorem 4.5. For this assumption, it suffices that the upper and lower limits on y, namely yr and y,, satisfy E[~~x~/*~w~~(I~~*~’ + Iyl*‘“)] < co. This condition requires that slightly more moments exist than the previous conditions that were imposed.
331f this restriction is not satisfied, then a GMM estimator may still be asymptotically normal, but the asymptotic variance is much more complicated; see Maasoumi and Phillips (1982) for the instrumental variables case.
W.K. Newey and D. McFudden
2162
5.
Asymptotic
efficiency
Asymptotically normal estimators can be compared on the basis of their asymptotic variances, with one being asymptotically efficient relative to another if it has at least as small an asymptotic variance for all possible true parameter values. Asymptotic efficiency is desirable because an efficient estimator will be closer to the true parameter value in large samples; if o^is asymptotically efficient relative to 8 then for all constants K, Prob() e- O,I d K/&) > Prob( (8- 8,I < K/J%) for all n large enough. Efficiency is important in practice, because it results in smaller asymptotic confidence intervals, as discussed in the introduction. This section discusses general results on asymptotic efficiency within a class of estimators, and application of these results to important estimation environments, both old and new. In focusing on efficiency within a class of estimators, we follow much of the econometrics and statistics literature. 34 Also, this efficiency framework allows one to derive results on efficiency within classes of “limited information” estimators (such as single equation estimators in a simultaneous system), which are of interest because they are relatively insensitive to misspecification and easier to compute. An alternative approach to efficiency analysis, that also allows for limited information estimators, is through semiparametric efficiency bounds, e.g. see Newey (1990). The approach taken here, focusing on classes of estimators, is simpler and more directly linked to the rest of this chapter. Two of the most important and famous efficiency results are efficiency of maximum likelihood and the form of an optimal weighting matrix for minimum distance estimation. Other useful results are efficiency of heteroskedasticity-corrected generalized least squares in the class of weighted least squares estimators and two-stage least squares as an efficient instrumental variables estimator. All of these results share a common structure that is useful in understanding them and deriving new ones. To motivate this structure, and focus attention on the most important results, we first consider separately maximum likelihood and minimum distance estimation.
5.1.
Eficiency
of maximum
likelihood estimation
Efficiency of maximum likelihood is a central proposition of statistics that dates from the work of R.A. Fisher (1921). Although maximum likelihood is not efficient in the class of all asymptotically normal estimators, because of “superefficient” estimators, it is efficient in quite general classes of estimators.35 One such general class is the 341n particular, one of the precise results on efficiency of MLE is the HajekkLeCam representation theory, which shows efficiency in a class of reyular estimators. See, e.g. Newey (1990) for a discussion of regularity. 35The word “superefficient” refers to a certain type ofestimator, attributed to Hodges, that is used to show tha?there does not exist an efficient estimator in the class of all asymptotically normal estimators. Suppose 0 is asymptotically normal, and for some numb_er t( and 0 ^( p < i, suppose that 0 ha_s positive asympiotic variance when the trueparameter is rx. Let B = e if nalU - al > 1 and 0 = a if nPIO - a( < 1. Then 6’ is superefficient relative to 8, having the same asymptotic variance when the true parameter is not cxbut having a smaller asymptotic variance, of zero, when the true parameter is X.
Ch. 36: Large Sample Estimation and Hypothesis
2163
Testing
class of GMM estimators, which includes method of moments, least squares, instrumental variables, and other estimators. Because this class includes so many estimators of interest, efficiency in this class is a useful way of thinking about MLE efficiency. Asymptotic efficiency of MLE among GMM estimators is shown by comparing asymptotic variances. The asymptotic variance of the MLE is (E[ss’])-‘, where s = V, In f(zl0,) is the score, with the z and 8 arguments suppressed for notational convenience. The asymptotic variance of a GMM estimator can be written as m-%l)r ‘Jm~‘l(~b;l)l where m, = (E[Veg(z, (3,)])‘WV0g(z, 0,) and m = (E[V,g(z,8,)])‘Wg(z, 0,). At this point the relationship between the GMM and MLE variances is not clear. It turns out that a relationship can be derived from an interpretation of E[me] as the covariance of m with the score. To obtain this interpretation, consider the GMM moment condition jg(z, 19)f(z ItI) dz = 0. This condition is typically an identity over the parameter space that is necessary for consistency of a GMM estimator. If it did not hold at a parameter value, then the GMM estimator may not converge to the parameter at that point, and hence would not be consistent.36 Differentiating this identity, assuming differentiation under the integral is allowed, gives
s
0 = Vo s(z,W(zl@dzle=e, =
Cvodz,@lf(z I@dz + & ‘3CV,f(z I@I’ dz B=B” s
= ECVddz, 4Jl + %dz, &JVoInf(z IWI,
(5.1)
where the last equality follows by multiplying and dividing V, f(z IO,) by f(z IO,). This is the generalized information matrix equality, including the information matrix equality as a special case, where g(z, 0) = V,ln f(~l8).~’ It implies that E[m,] + E[ms’] = 0, i.e. that E[ms] = - E[ms’]. Then the difference of the GMM and MLE asymptotic variances can be written as (E[mJ-‘E[mm’](E[m~])-’
-(E[ss’])~’
= (E[ms’])-‘E[mm’](E[sm’])p’ = (E[ms’])-‘{E[mm’]
- (E[ss’])-’
- E[ms’](E[ss’])-‘E[sm’]}(E[sm’])-’
= (E[ms’])-‘E[UU’](E[sm’])-‘,
U = m - E[ms’] (E[ss’])-
1 s.
(5.2)
3hRecall that consistency means that the estimator converges in probability to the true parameter for all oossible true oarameter values. ‘;A similar eq’uality, used to derive the Cramer-Rao bound for the variance of unbiased estimators, is obtained by differentiating the identity 0 = JOdF,, where F’,, is the distribution of the data when 0 is the true parameter value.
W.K. Newey and D. McFadden
2164
Since E[UU’] is positive semi-definite, the difference of the respective variance matrices is also positive semi-definite, and hence the MLE is asymptotically efficient in the class of GMM estimators. To give a precise result it is necessary to specify regularity conditions for the generalized information matrix equality of eq. (5.1). Conditions can be formulated by imposing smoothness on the square root of the likelihood, f(zl@“‘, similar to the regularity conditions for MLE efficiency of LeCam (1956) and Hajek (1970). A precise result on efficiency of MLE in the class of GMM estimators can then be stated as: Theorem 5.1 If the conditions of Theorem O,, J is nonsingular, and for f(z I 0) dz and s ~upij~.~~ I/V&z (G’WG)- ‘G’WRWG(G’WG)
3.4 are satisfied,f(zl 0)1’2 is continuously differentiable at all 8 in a neighborhood JY of BO,JsuP~~,~ 11g(z, g) /I2 x Ir$1’2 11 2 dz are bounded and Jg(z, @f(z 10)dz = 0, then - JP1 is positive semi-definite.
The proof is postponed until Section 5.6. This result states that J-’ is a lower bound on the asymptotic variance of a GMM estimator. Asymptotic efficiency of MLE among GMM estimators then follows from Theorem 3.4, because the MLE will have J ’ for its asymptotic variance.38
5.2.
Optimal minimum distance estimation
The asymptotic variance of a minimum distance estimator depends on the limit W of the weighting matrix [email protected] W = a-‘, the asymptotic variance of a minimum distance estimator is (G’R-‘G)-‘. It turns out that this estimator is efficient in the class of minimum distance estimators. To show this result, let Z be any random vector such that a= E[ZZ’], and let m = G’WZ and fi = G’K’Z. Then by G’ WC = E[mfi’]
and G’R- ‘G = E[riifi’],
(G’WG)~‘G’WL!WG(G’WG)~l-(G’~nlG)-’ = (G’WG)PIEIUU’](G’WG)-‘,
U = m - E[mfi’](E[tirii’])-‘6.
Since E[UU’] is positive semi-definite, the difference of the asymptotic positive semi-definite. This proves the following result:
(5.3) variances
is
38 It is possible to show this result under the weaker condition that f(zlO)“’ is mean-square differentiable, which allows for f(zlO) to not be continuously differentiable. This condition is further discussed in Section 5.5.
2165
Ch. 36: Large Sample Estimation and Hypothesis Testing
Theorem 5.2 If f2 is nonsingular, a minimum distance estimator with W = plim(@‘) = R-r asymptotically efficient in the class of minimum distance estimators.
is
This type of result is familiar from efficiency theory for CMD and GMM estimation. For example, in minimum chi-square estimation, where b(Q) = 72- $0) the efficient weighting matrix W is the inverse of the asymptotic variance of fi, a result given by Chiang (1956) and Ferguson (1958). For GMM, where Q(H)= x1=, g(Zi, d)/n, the efficient weighting matrix is the inverse of the variance of g(zi, fI,), a result derived by Hansen (1982). Each of these results is a special case of Theorem 5.2. Construction of an efficient minimum distance estimator is quite simple, because the weighting matrix affects the asymptotic distrib_ution only t_hrough its probability limit. All that is required is a consistent estimator R, for then W = fX ’ will converge in probability to rZP ‘. Since an estimator of R is needed for asymptotic variance estimation, very little additional effort is required to form an efficient weighting matrix. An efficient minimum distance estimator can then be constructed by minimizing d(O)‘& ‘g(O). Alternatively, the one-step estimator r?= &(@a‘c)) ’ x eh- ‘g(6) will also be efficient, because it is asymptotically equivalent to the fully iterated minimum distance estimator. The condition that W = fl- ’ is sufficient but not necessary for efficiency. A necessary and sufficient condition can be obtained by further examination of eq. (5.3). A minimum distance estimator will be efficient if and only if the random vector U is zero. This vector is the residual from a population regression of m on I+&and so will be zero if and only if m is a linear combination of fi, i.e. there is a constant matrix C such that G’WZ = CG’R-‘2. Since 2 has nonsingular variance matrix, this condition is the same as G’W = CG’O-‘. This is the necessary estimator.
5.3.
(5.4) and sufficient
condition
for efficiency of a minimum
distance
A general eficiency framework
The maximum likelihood and minimum distance efficiency results have a similar structure, as can be seen by comparing eqs. (5.2) and (5.3). This structure can be exploited to construct an eficiency framework that includes these and other important results, and is useful for finding efficient estimators. To describe this framework one needs notation for the asymptotic variance associated with an estimator. To this end, let r denote an “index” for the asymptotic variance of an estimator in some
W.K. Newey and D. McFadden
2166
class, where r is an element of some abstract set. A completely general form for z would be the sequence of functions of the data that is the sequence of estimators. However, since r is only needed to index the asymptotic variance, a simpler specification will often suffice. For example, in the class of minimum distance estimators with given g,(O), the asymptotic variance depends only on W = plim(I@, so that it suffices to specify that z = W. The framework considered here is one where there is a random vector Z such that for each r (corresponding to an estimator), there is D(z) and m(Z, z) with the asymptotic variance V(r) satisfying V(7) = D(z)_ l E[m(Z, T)rn(Z, T)‘]D(T)-
l’.
(5.5)
Note that the random vector Z is held fixed as t varies. The function m(Z, z) can often be interpreted as a score or moment function, and the matrix D(z) as a Jacobian matrix for the parameters. For example, the asymptotic variances of the class of GMM estimators satisfy this formula, with z being [g(z, 8&G, W], Z = z being a single observation, m(Z, r) = G’Wg(z, tl,), and D(r) = G’WG. Another example is minimum distance estimators, where Z is any random vector with mean zero and variance 0, z = W, m(Z, z) = G’WZ, and D(T) = G’ WC. In this framework, there is an interesting and useful characterization of an efficient estimator. Theorem 5.3 If Z satisfies D(z) = E[m(Z, z)m(Z, ?)‘I for all z then any estimator with variance L’(f) is efficient. Furthermore, suppose that for any ri, r2, and constant square matrices C,, C, such that C,D(z,) + C&r,) is nonsingular, there is z3 with (i) (linearity of the moment function set) m(Z,r,) = C,m(Z,z,) + C,m(Z,z,); (ii) (linearity of D) D(r,) = C,D(t,) + C,D(z,). If there is an efficient estimator with E[m(Z, z)m(Z, z)‘] nonsingular then there is an efficient estimator with index F such that D(z) = E[m(Z, z)m(Z, f)‘] for all z. Proof If r and S satisfy D(z) = E[m(Z,r)m(Z,?)‘] then the difference asymptotic variances satisfies, for m = m(Z, z) and 6 = m(Z, ?), V(7) -
V(f) =
(E[m~‘])~‘E[mm’](E[fim’])~’
= (E[mti’])-‘E[UU’](E[tim’])p CJ= m-E[mrii’](E[@iti’])-‘ti,
- (E[tid])~
of the respective
l
‘, (5.6)
so the first conclusion follows by E[UU’] positive semi-definite. To show the second conclusion, let ll/(Z, t) = D(T)- ‘m(Z, T), so that V(7) = E[$(Z, z)$(Z, s)‘]. Consider
Ch. 36: Large Sample Estimation
and Hypothesis
2167
Testing
any constant matrix B, and for 7r and T* let C, = BD(7,))’ and C, = (I - B)D(T~)-’ note that C,D(z,) + C,D(z,) = I is nonsingular, so by (i) and (ii) there is 73 such that Bl+b(Z,7,)+(Z-B)II/(Z,7,)
=
c,m(z,7,)+C,m(Z,7,)=m(Z,7,)
=
I-‘m(Z,z,)
=
[C,D(t,) + C,D(t,)]‘m(Z, z3) = D(z,)- 'm(Z, 7j) = I/I(Z, 73). Thus, the set ($(Z, 7)} is affine, in the sense that B$(Z, tl) + (I - B)$(Z, z2) is in this set for any 71, z2 and constant matrix B. Let $(Z,?) correspond to an efficient estimator. Suppose that there is 7 with E[($ - $)$‘I # 0 for $ = $(Z, 7) and & = $(Z, 5). Then $ - 6 # 0, so there exists a constant matrix F such that e = F($ - $) has nonsingular variance andE[e~]#O.LetB=-E[~e’](E[ee’])-’Fandu=~+B(~-~)=(Z-B)~+B~. By the affine property of {rj(Z,z)} there is z”such that k’(f) = E[uu’] = E[$$‘] E[$e’](E[ee’])-‘E[e$‘] = V(T) - E[$e’](E[ee’])-‘E[e$‘], which is smaller than V(S) in the positive semi-definite sense. This conclusion contradicts the assumed - -, efficiency of Z, so that the assumption that E[($ - $)tj ] # 0 contradicts efficiency. Thus, it follows that E[($ - $)I+?‘]= 0 for all 7, i.e. that for all 7,
D(t)-
‘E[m(Z,r)m(Z,f)‘]D(?)-”
= D(t)-'E[m(Z,~)m(Z,~)']D(~)-
“.
(5.7)
By the assumed nonsingularity of E[m(Z, T)m(Z, Z)‘], this equation can be solved for D(7) to give D(7) = E[m(Z, z)m(Z, T)‘](E[m(Z, f)m(Z, 2)‘])- ‘D(f). Since C = D(f)‘(E[m(Z, f)m(Z, ?)‘I)- ’ is a nonsingular matrix it follows by (i) and (ii) that there exists ? with m(Z, ?) = Cm(Z, Y). Furthermore, by linearity of D(7) it follows that V(?)= V(Z), so that the estimator corresponding to z”is efficient. The second conQ.E.D. clusion then follows from D(7) = E[m(Z, z)m(Z, S)‘] for all 7. This result states that D(7) =
E[m(Z, t)m(Z,
Z)‘],
for all 7,
(5.8)
is sufficient for Z to correspond to an efficient estimator and is necessary for some efficient estimator if the set of moment functions is linear and the Jacobian is a linear function of the scores. This equality is a generalization of the information matrix equality. Hansen (1985a) formulated and used this condition to derive efficient instrumental variables estimators, and gave more primitive hypotheses for conditions (i) and (ii) of Theorem 5.3. Also, the framework here is a modified version of that of Bates and White (1992) for general classes of estimators. The sufficiency part of Theorem 5.3 appears in both of these papers. The necessity part of Theorem 5.3 appears to be new, but is closely related to R.A. Fisher’s (1925) necessary condition for an efficient statistic, as further discussed below. One interpretation of eq. (5.8) is that the asymptotic covariance between an efficient estimator and any other estimator is the variance of the efficient estimator. This characterization of an efficient estimator was discussed in R.A. Fisher (1925),
W.K.NeweyandD.McFaddvn
2168
and is useful in constructing Hausman (1978) specification tests. It is derived by assuming that the asymptotic covariance between two estimators in the class takes as can usually be verified by “stacking” theform D(r,)~'E[m(Z,z,)m(Z,s,)']D(z,)-", the two estimators and deriving theirjoint asymptotic variance (and hence asymptotic covariance). For example, consider two different GMM estimators 8, and g2, with two different moment functions g,(z, 6) and g2(z, @, and r = q for simplicity. The vector y*= (@,, &)’ can be considered a joint GMM estimator with moment vector g(z, y) = [gr(z, H,)‘, gz(z, @,)‘I’. The Jacobian matrix of the stacked moment vector will be block diagonal, and hence so will its inverse, so that the asymptotic covariance between 6, and 6, will be {E[V,g,(z, e,)]} _ ‘E[g,(z, d0)g2(z, O,)‘] x {ECV,g,(z, &,)I) - l’. Th’ISISexactly of the form D(T,)- ‘E[m(Z, tl)m(Z, TJ']O(T~)-I', where Z = z, m(Z, TV)= g,(z,O,),etc. When the covariance takes this form, the covariance between any estimator and one satisfying eq. (5.8) will be D(T)-' x E[m(Z,z)m(Z,~)l]D(~)-“=I~D(~)~“=D(~)-’E[m(Z,t)m(Z,~)‘]D(~)-” = V(t), the variance of the efficient estimator. R.A. Fisher (1925) showed that this covariance condition is sufficient for efficiency, and that it is also necessary if the class of statistics is linear, in a certain sense. The role of conditions (i) and (ii) is to guarantee that R.A. Fisher’s (1925) linearity condition is satisfied. Another interpretation ofeq. (5.8) is that the variance of any estimator in the class can be written as the sum of the efficient variance and the variance of a “noise term”. to Let u(Z)= D(T)-'m(Z,T)-D(f)-'m(Z,f), and note that U(Z) is orthogonal D(5)_ ‘m(Z, Z) by eq. (5.8). Thus, V(T)= V(Z)+ E[CI(Z)U(Z)‘]. This interpretation is a second-moment version of the Hajek and LeCam efficiency results.
5.4.
Solving fir the smallest asymptotic variance
The characterization of an efficient estimator given in Theorem 5.3 is very useful for finding efficient estimators. Equation (5.8) can often be used to solve for Z, by following two steps: (1) specify the class of estimators so that conditions (i) and (ii) of Theorem 5.3 are satisfied, i.e. so the set of moment functions is linear and the Jacobian D is linear in the moment functions; (2) look for Z such that D(T) = E[m(Z, s)m(Z, Z)‘]. The importance of step (1) is that the linearity conditions guarantee that a solution to eq. (5.8) exists when there is an efficient estimator [with the variance of m(Z, t) nonsingular], so that the effort of solving eq. (5.8) will not be in vain. Although for some classes of estimators the linearity conditions are not met, it often seems to be possible to enlarge the class of estimators so that the linearity conditions are met without affecting the efficient estimator. An example is weighted least squares estimation, as further discussed below. Using eq. (5.8) to solve for an efficient estimator can be illustrated with several examples, both old and new. Consider first minimum distance estimators. The asymptotic variance has the form given in eq. (5.5) for the score G’WZ and the Jacobian term G’ WC. The equation for the efficient W is then 0 = G’ WC - G’Wf26’G =
Ch. 36: Large Sample Estimation and Hypothesis Testing
2169
G’W(I - flW)G, which holds if fll?f= I, i.e. w = R- ‘. Thus, in this example one can solve directly for the optimal weight matrix. Another example is provided by the problem of deriving the efficient instruments for a nonlinear instrumental variables estimator. Let p(z, (3)denote an s x 1 residual vector, and suppose that there is a vector of variables x such that a conditional moment restriction,
ma
&I)Ixl = 0,
(5.9)
is satisfied. Here p(z, 0) can be thought of as a vector of residuals and x as a vector of instrumental variables. A simple example is a nonlinear regression model y = ,f(x, (3,) + E, @&lx] = 0, where the residual p(z, 0) = y - f(x, 0) will satisfy the conditional moment restriction in eq. (5.9) by E having conditional mean zero. Another familiar example is a single equation of a simultaneous equations system, where p(z, 0) = y - Y’8 and Y are the right-hand-side endogenous variables. An important class of estimators are instrumental variable, or GMM estimators, based on eq. (5.9). This conditional moment restriction implies the unconditional moment restriction that E[A(x)p(z, e,)] = 0 for any q x s matrix of functions A(x). Thus, a GMM estimator can be based on the moment functions g(z, 0) = A(x)p(z, 0). Noting that V&z, 0) = A(x)V,p(z, Q), it follows by Theorem 3.4 that the asymptotic variance of such a GMM estimator will be
WV = {~%4x)Vep(z> 441) ‘~C44dz, ~JP(z,&J’44’1 {~C44Ve~k441 > “2 (5.10) where no weighting matrix is present because g(z, Q) = A(x)p(z,B) has the same number of components as 0. This asymptotic variance satisfies eq. (5.5), where T = A(-) indexes the asymptotic variance. By choosing p(z, 0) and A(x) in certain ways, this class of asymptotic variances can be set up to include all weighted least squares estimators, all single equation instrumental variables estimators, or all system instrumental variables estimators. In particular, cases with more instrumental variables than parameters can be included by specifying A(x) to be a linear combination of all the instrumental variables, with linear combination coefficients given by the probability limit of corresponding sample values. For example, suppose the residual is a scalar p(z,@ = y- Y’B, and consider the 2SLS estimator with instrumental variables x. Its asymptotic variance has the form given in eq. (5.10) for A(x) = E[ Yx’](E[x~‘])~‘x. In this example, the probability limit of the linear combination coefficients is E[Yx’](E[xx’])-‘. For system instrumental variables estimators these coefficients could also depend on the residual variance, e.g. allowing for 3SLS. The asymptotic variance in eq. (5.10) satisfies eq. (5.5) for Z=z, D(r)= E[A(x) x V&z, Q,)], and m(Z, r) = A(x)p(Z, 0,). Furthermore, both m(Z, r) and D(r) are linear in A(x), so that conditions (i) and (ii) should be satisfied if the set of functions {A(x)}
W.K. Newey and D. McFadden
2170
is linear. To be specific, consider the class of all A(x) such that E[A(x)V&z, O,)] and E[ )1.4(x)11 2 /)p(z, 0,) I/2] exist. Then conditions (i) and (ii) are satisfied with TV= A3(*) = CIA,(.) _t C,A,(V).~~ Thus, by Theorem 5.3, if an efficient choice of instruments exist there will be one that solves eq. (5.8). To find such a solution, let G(x) = E[V,p(z, 0,)j x] and 0(x) = E[p(z, Qp(z, 0,)’ (xl, so that by iterated expectations eq. (5.8) is 0 = E[A(x)(G(x) - Q(x)A(x)‘}]. This equation will be satisfied if G(x) - Q(x),?(x)’ = 0, i.e. if
A(x) = G(x)'O(x)- ‘.
(5.11)
Consequently, this function minimizes the asymptotic variance. Also, the asymptotic variance is invariant to nonsingular linear transformations, so that A(x) = CG(x)‘n(x)-’ will also minimize the asymptotic variance for any nonsingular constant matrix C. This efficient instrument formula includes many important efficiency results as special cases. For example, for nonlinear weighted least squares it shows that the optimal weight is the inverse of the conditional variance of the residual: For G,(0) = - n- 1C;= 1w(xi)[ yi - h(xi, O)]“, the conclusion of Theorem 3.1 will give an asymptotic variance in eq. (5.10) with A(x) = w(x)h,(x, S,), and the efficient estimator has A(x) = {E[a2 1x] } - ‘h,(x, Q,), corresponding to weighting by the inverse of the conditional variance. This example also illustrates how efficiency in a class that does not satisfy assumptions (i) and (ii) of Theorem 5.3 (i.e. the linearity conditions), can be shown by enlarging the class: the set of scores (or moments) for weighted least squares estimators is not linear in the sense of assumption (i), but by also including variances for “instrumental variable” estimators, based on the moment conditions y(z, 19)= A(x)[y - h(x, tI)], one obtains a class that includes weighted least squares, satisfies linearity, and has an efficient member given by a weighted least squares estimator. Of course, in a simple example like this one it is not necessary to check linearity, but in using eq. (5.8) to derive new efficiency results, it is a good idea to set up the class of estimators so that the linearity hypothesis is satisfied, and hence some solution to eq. (5.8) exists (when there is an efficient estimator). Another example of optimal instrument variables is the well known result on efficiency of 2SLS in the class of instrumental variables estimators with possibly nonlinear instruments: If p(z, 0) = y - Y’O, E[ Yjx] = 17x, and c2 = E[p(z, B,J2 1x] is constant, then G(x) = - Ii’x and 0(x) = 02, and the 2SLS instruments are E[ Yx’](E[xx’])lx = 17x = - 02&x), a nonsingular linear combination of A(x). As noted above, for efficiency it suffices that the instruments are a nonsingular linear combination of A(x), implying efficiency of 2SLS. This general form A(x) for the optimal instruments has been previously derived in Chamberlain (1987), but here it serves to illustrate how eq. (5.8) can be used to “Existence of the asymptotic Cauchy-Schwartz inequalities.
variance
matrix
corresponding
to 53 follows
by the triangle
and
Ch. 36: Large Sample Estimation and Hypothesis Testing
2171
derive the form of an optimal estimator. In this example, an optimal choice of estimator follows immediately from the form of eq. (5.8) and there is no need to guess what form the optimal instruments might take.
5.5.
Feasible
efficient estimation
In general, an efficient estimator can depend on nuisance parameters or functions. For example, in minimum distance estimation the efficient weighting matrix is a nuisance parameter that is unknown. Often there is a nuisance function, i.e. an infinite-dimensional nuisance parameter, such as the optimal instruments discussed in Section 5.4. The true value of these nuisance parameters is generally unknown, so that it is not feasible to use the true value to construct an efficient estimator. One feasible approach to efficient estimation is to use estimates in place of true nuisance parameters, i.e. to “plug-in” consistent nuisance parameter estimates, in the construction of the estimator. For example, an approach to feasible, optimal weighted least squares estimator is to maximize - n-l x1= r G(xi)[yi - h(xi, 8)12, where a’(x) is an estimator of 1/E[.a2 1x]. This approach will give an efficient estimator, if the estimation of the nuisance parameters does not affect the asymptotic variance of 6. It has already been shown, in Section 5.2, that this approach works for minimum distance estimation, where it suffices for efficiency that the weight matrix converges in probability to R - ‘. More generally, a result developed in Section 6, on two-step estimators, suggests that estimation of the nuisance parameters should not affect efficiency. One can think of the “plug-in” approach to efficient estimation as a two-step estimator, where the first step is estimating the nuisance parameter or function, and the second is construction of &. According to a principle developed in the next section, the first-step estimation has no effect on the second-step estimator if consistency of the first-step estimator does not affect consistency of the second. This principle generally applies to efficient estimators, where nuisance parameter estimates that converge to wrong values do not affect consistency of the estimator of parameters of interest. For example, consistency of the weighted least squares estimator is not affected by the form of the weights (as long as they satisfy certain regularity conditions). Thus, results on two-step estimation suggest that the “plug-in” approach should usually yield an efficient estimator. The plug-in approach is often easy to implement when there are a finite number of nuisance parameters or when one is willing to assume that the nuisance function can be parametrized by a finite number of parameters. Finding a consistent estimator of the true nuisance parameters to be used in the estimator is often straightforward. A well known example is the efficient linear combination matrix Z7= E[Yx’](E[xx’])’ for an instrumental variables estimator, which is consistently estimated by the 2SLS coefficients fi= xy= r Yix:(Cy= ,x,x~)-‘. Another example is the optimal weight for nonlinear least squares. If the conditional variance is parametrized as a’(~, y), then
W.K. Newey and D. McFadden
2172
the true y can be consistently estimated from the nonlinear least squares regression of $ on aZ(xi, y), where Ei = yi - h(xi, I$ (i = 1,. . , n), are the residuals from a preliminary consistent estimator (7. Of course, regularity conditions are, useful for showing that estimation of the nuisance parameters does not affect the asymptotic variance of the estimator. To give a precise statement it is helpful to be more specific about the nature of the estimator. A quite general type of “plug-in” estimator is a GMM estimator that depends on preliminary estimates of some parameters. Let g(z, 19,y) denote a q x 1 vector of functions of the parameters of interest and nuisance parameters y, and let y*be a first-step estimator. Consider an estimator e that, with probability approaching one. solves
n-
l f
cJ(Zi,f&y*)= 0.
(5.12)
i=l
This class is quite general, because eq. (5.12) can often be interpreted as the firstorder conditions for an estimator. For example, it includes weighted least squares estimators with an estimated weight w(x,y*), for which eq. (5.12) is the first-order condition with g(z, 8, y) = w(x, y&(x, 0)[ y - h(x, 8)]. One type of estimator not included is CMD, but the main result of interest here is efficient choice of weighting matrix, as already discussed in Section 5.2. Suppose also that y*is a GMM estimator, satisfying n-l x1= i m(zi, y) = 0. If this equation is “stacked” with eq. (5.12), the pair (6, $) becomes a joint GMM estimator, so that regularity conditions for asymptotic efficiency can be obtained from the assumptions for Theorem 3.4. This result, and its application to more general types of two-step estimators, is described in Section 6. In particular, Theorem 6.1 can be applied to show that 6’from eq. (5.12) is efficient. If the hypotheses of that result are satisfied and G, = E[V,g(z, B,, yO)] = 0 then 8 will be asymptotically normal with asymptotic variance the same as if 7 = yO. As further discussed in Section 6, the condition G, = 0 is related to the requirement that consistency of ji not affect consistency of 8. As noted above, this condition is a useful one for determining whether the estimation of the nuisance parameters affects the asymptotic variance of the feasible estimator 6. To show how to analyze particular feasible estimators, it is useful to give an example. Linear regression with linear heteroskedusticity: Consider a linear model where &lx] = ~‘8, and C?(X) = Var( y Jx) = w’c(~for some w = w(x) that is a function of x. As noted above, the efficient estimator among those that solve n-‘Cy= i A(xi) x [ yi - x:(3] = 0 has A(x) = A(x) = (~‘a,))’ x. A feasible efficient estimator can be constructed by using a squared residual regression to form an estimator oi for Q, and plugging this estimator into the first-order conditions. More precisely, let p be the least squares estimator from a regression of y on x and & the least squares
Ch. 36: Large Sample Estimation and Hypothesis
2173
Testing
estimator from a regression of (y - x/j?)’ on w. Suppose that w’aO is bounded below and let r(u) be a positive function that is continuously differentiable with bounded derivative and z(u) = u for u greater than the lower bound on w’cx,,.~’ Consider 8 obtained from solving CT= i r(w’&)) ‘xi(yi - xi@ = 0. This estimator is a two-step GMM estimator like that given above with y = (cc’,fl’)‘, m(z, y) =
[( y -
x’P)x’,
{( y - x’py - w’cr}w’]‘,
g(z,8, y) = T(W’cI)- ‘x( y - de). It is straightforward to verify that the vector of moment functions [m(z, y)‘, g(z, 8, y)‘]’ satisfies the conditions of Theorem 6.1 if w is bounded, x and y have finite fourth moments, and E[xx’] and E[ww’] are nonsingular. Furthermore, E[V,g(z, do, yo)] = - E[~(w’a~)-~(y - x’~,)xw’] = 0, so that this feasible estimator will be efficient. In many cases the efficiency of a “plug-in” estimator may be adversely affected if the parametrization of the nuisance functions is incorrect. For example, if in a linear model, heteroskedasticity is specified as exponential, but the true conditional variance takes another form, then the weighted least squares estimator based on an exponential variance function will not be efficient. Consistency will generally not be affected, and there will be only a little loss in efficiency if the parametrization is approximately correct, but there could be big efficiency losses if the parametrized functional form is far from the true one. This potential problem with efficiency suggests that one might want to use nonparametric nuisance function estimators, that do not impose any restrictions on functional form. For the same reasons discussed above, one would expect that estimation of the nuisance function does not affect the limiting distribution, so that the resulting feasible estimators would be efficient. Examples of this type of approach are Stone (197.3 Bickel (1982), and Carroll (1982). These estimators are quite complicated, so an account is not given here, except to say that similar estimators are discussed in Section 8.
5.4.
Technicalities
It is possible to show the generalized information matrix equality in eq. (5.1) under a condition that allows for f(zl @‘I2 to not be continuously differentiable and g(z, 0) to not be continuous. For the root-density, this condition is “mean-square”differentiability at fIO with respect to integration over z, meaning that there is 6(z) with l /l&z) )I2dz < co such that J[f(zI @‘I2 - f(zl Qo)1’2- 6(z)‘(H- 0,)12 dz = o( II8 - ,!?,I)2, ““The T(U)function is a “trimming” device similar to those used in the semiparametric estimation literature. This specification requires knowing a lower bound on the conditional variance. It is also possible to allow T(U)to approach the identity for all u > 0 as the sample size grows, but this would complicate the analysis.
W.K. Newey and D. McFadden
2174
that as O-+9,. As shown in Bickel et al. (1992) it will suffice for this condition ,f(zl0) is continuously differentiable in 0 (for almost all z) and that J(0) = jV, In ,f(zlO) x and continuous in 8. Here 6(z) is the derivative {VOln~(zIfO>‘,!“(zI@d z ISnonsingular off(z I w2, so by V,f‘(z 141/2= +f(z l 0)1/2V0 In f(z Id), the expression for the information matrix in terms of 6(z) is J = 4J”6(z)&z) dz. A precise result on efficiency of MLE in the class of GMM estimators can then be stated as: Lemma 5.4 If(i) ,f(~l@r/~ is mean-square differentiable at 0, with derivative 6(z); (ii) E[g(z, Q)] is differentiable at 0, with derivative G; (iii) g(z, 0) is continuous at B,, with probability one; (iv) there is a neighborhood _N of 6, and a function d(z) such that IIg(z, 0) II d d(z) and Srl(z)‘f(z IO)dz is bounded for BEN; then lg(z, Q)f(z 10)dz is differentiable at B0 with derivative G + 2jg(z, Q,)~(z)f(~lQ,)‘~~ dz. Proof The proof is similar to that of Lemma 7.2 of Ibragimov and Has’minskii (1981). Let r(0) = f(z IQi”, g(e) = g(z, 0), 6 = 6(z), and A(B) = r(0) - r(Q,) - iY(d - Q,), suppressing the z argument for notational convenience. Also, let m(8,8) = ~g(@r(Q2dz and M = jg(b’,)&(ll,) dz. By (ii), m(0, d,) - m(B,, ~9,)- G(B - 0,) = o( Ij 0 - do I/). Also, by the triangle inequality, I/m(0,@ - m(B,, 0,) - (G + 2M)(0 - 0,) 11< I/m(e,e,) m(fl,, 0,) - G(8 - 0,) /I + 11 m(6,6) - m(@,d,) - 2M(6’ - 0,) 11,so that to show the conclusion it suffices to show IIm(d, 0) - m(d, 0,) - 2M(B - 0,) II = o( 110- B. /I). To show this, note by the triangle inequality,
IIde, 4 - MO,&4 - 2M(8 - 0,) I/ = d
IIs
g(d)[r(d)’
- r(0,J2] dz - 2~(0 - 0,)
!I
Cd4 - g(~o)lr(80)6’dz II8 - 8, II s(QCr@)+ r(bJl44 dz + IIS II IIS II +
Cs(e)r(e)-s(e,)r(e,)lsdz II~-~,ll=~,+~,ll~-8,l/+R,ll8-~,ll.
Therefore, it suffices to show that R, =o( /I0-0, II), R, -0, By (iv) and the triangle and Cauchy-Schwartz inequalities,
R, < { [ [g(Q2r(R,’
d.z]li2 + [ Ig(Q2r(&J2
Also, by (iii) and (iv) and the dominated
dzy
convergence
and R, +O as O+ fl,,.
“)[p(B)’
dz]12
theorem, E[ I(g(e) - g(0,) \I’] +O,
Ch. 36: Large Sample Estimation
and Hypothesis
2175
Testing
so by the Cauchy-Schwartz inequality, R, <(EC IIg(0)-g(0,) Il’])“‘(j 116112 dz)“‘+O. Also, by the triangle inequality, R, < R, + 1 I/g(0) 11Ir(0) - r(Q,)l /I6 II dz, while for K > 0,
s
/Icd@It Ir(@ - 44,) I II6 II dz d
d
s
44 Ir@) - 4%) I II6 II dz
4.4 Ir(Q)- r&J I II6 IIdz + K Ir(Q)- 4%)l II6 II dz s d(z)>.4 s
I is l/2
<
d(~)~lr(0) - r(B,)12 dz
II6 II2dz
d(z) 3 X
lr(R)-r(i),)2dz11’2{
S6’dz)L-i.
By (iv), i d(z)‘Ir(Q) - r(0,)12 dz < 2fd(z)2r(B)2 dz + 2jd(z)2r(6,)2 dz is bounded. Also, by the dominated convergence theorem, JdcZjaK I/6 (1’dz + 0 as K + co, and by (i), z + 0, so that the last term converges to zero for any K. Consider j Ir(4 - r(bJ I2d E > 0 and choose K so jdtz,a K ((6 ((2 dz < 3~. Then by the last term is less than +E for 0 close enough to 0,, implying that j 1)g(0) I/Ir(8) - r(Q,)l 116I/ dz < E for 0 close enough to (IO. The conclusion then follows by the triangle inequality. Q.E.D. Proof of Theorem
5.1
By condition (iv) of Theorem 3.4 and Lemma 3.5, g(z, e) is continuous on a neighborhood of 8, and E[g(z, 0)] is differentiable at B0 with derivative G = E[Vsy(z, (II,)]. Also, f(z I0)“’ is mean-square differentiable by the dominance condition in Theorem 5.1, as can be shown by the usual mean-value expansion argument. Also, by the conditions of Theorem 5.1, the derivative is equal to $1 [f(zj0,)>O]f(z(0,)-“2 x V,f(z) 0,) on a set of full measure, so that the derivative in the conclusion of Lemma 5.4 is G + %(z,
j442.fW)dz
WV0 ln
bounded,
f(zl&Jl. Also, IIdz, 0)II d 44 = su~~~.~- IIAZ, 0)II Has
so that u = g(z, 0,) + GJ- ‘V,in f(zIB,), (G’WG)- ‘G’B’~WG(G’WG)-
the conclusion
of Lemma
5.4 holds.
Then
for
1 - J-l
=(G’WG)-‘G’W(~uu’dz)WG(G’WG)-‘, so the conclusion
6.
follows by i UU’dz positive
semi-definite.
Q.E.D.
Two-step estimators
A two-step estimator is one that depends on some preliminary, “first-step” estimator of a parameter vector. They provide a useful illustration of how the previous results
W.K. Newey and D. McFadden
2116
can be applied, even to complicated estimators. In particular, it is shown in this section that two-step estimators can be fit into the GMM framework. Two-step estimators are also of interest in their own right. As discussed in Section 5, feasible efficient estimators often are two-step estimators, with the first step being the estimation of nuisance parameters that affect efficiency. Also, they provide a simpler alternative to complicated joint estimators. Examples of two-step estimators in econometrics are the Heckman (1976) sample selection estimator and the Barro (1977) estimator for linear models that depend on expectations and/or corresponding residuals. Their properties have been analyzed by Newey (1984) and Pagan (1984,1986), among others. An important question for two-step estimators is whether the estimation of the first step affects the asymptotic variance of the second, and if so, what effect does the first step have. Ignoring the first step can lead to inconsistent standard error estimates, and hence confidence intervals that are not even asymptotically valid. This section develops a simple condition for whether the first step affects the second, which is that an effect is present if and only if consistency of the first-step estimator affects consistency of the second-step estimator. This condition is useful because one can often see by inspection whether first-step inconsistency leads to the secondstep inconsistency. This section also describes conditions for ignoring the first step to lead to either an underestimate or an overestimate of the standard errors. When the variance of the second step is affected by the estimation in the first step, asymptotically valid standard errors for the second step require a correction for the first-step estimation. This section derives consistent standard error estimators by applying the general GMM formula. The results are illustrated by a sample selection model. The efficiency results of Section 5 can also be applied, to characterize efficient members of some class of two-step estimators. For brevity these results are given in Newey (1993) rather than here.
6.1.
Two-step
estimators
as joint GMM
estimators
The class of GMM estimators is sufficiently general to include two-step estimators where moment functions from the first step and the second step can be “stacked” to form a vector of moment conditions. Theorem 3.4 can then be applied to specify regularity conditions for asymptotic normality, and the conclusion of Theorem 3.4 will provide the asymptotic variance, which can then be analyzed to derive the results described above. Previous results can also be used to show consistency, which is an assumption for the asymptotic normality results, but to focus attention on the most interesting features of two-step estimators, consistency will just be assumed in this section.
Ch. 36: Large Sample Estimation and Hypothesis
Testing
A general type of estimator 8 that has as special cases most examples is one that, with probability approaching one, solves an equation
n-
’
i$l
dzi,
8,
2117
of interest
(6.1)
y*)= O,
where g(z,B,y) is a vector of functions with the same dimension as 0 and y*is a first-step estimator. This equation is exactly the same as eq. (5.12), but here the purpose is analyzing the asymptotic distribution of gin general rather than specifying regularity conditions for $ to have no effect. The estimator can be treated as part of a joint GMM estimator if y^also satisfies a moment condition of the form, with probability approaching one, n-l
i
m(z,,y)=O,
(6.2)
i=l
where m(z,y) is a vector with the same dimension as y. If g(z, 0,~) and m(z,r) are “stacked” to form J(z, 8, y) = [m(z, O)‘,g(z, 8, y)‘]‘, then eqs. (6.1) and (6.2) are simply the two components of the joint moment equation n-i C;= 1 g(zi, 8,y*) = 0.Thus, the two-step estimator from eq. (6.1) can be viewed as a GMM estimator. An interesting example of a two-step estimator that fits into this framework is Heckman’s (1976) sample selection estimator. Sample selection example: In this example the first step +$is a probit estimator with regressors x. The second step is least squares regression in the subsample where the probit-dependent variable is one, i.e. in the selected sample, with regressors given by w and i(x’y^) for n(o) = ~(U)/@(U). Let d be the probit-dependent variable, that is equal to either zero or one. This estimator is useful when y is only observed if d = 1, e.g. where y is wages and d is labor force participation. The idea is that joint normality of the regression y = w’/& + u and the probit equation leads to E[yl w, d = 1, x] = w’p,, + cc,A(x’y,), where a, is nonzero if the probit- and regression-dependent variables are not independent. Thus, %(x’cr,) can be thought of as an additional regressor that corrects for the endogenous subsample. This two-step estimator will satisfy eqs. (6.1) and (6.2) for
Y(4 8,Y)= d m(z,y)
=
[
A(&1 CY-w'B-~wr)l~
Il(x’y)a=-‘( -x’y)x[d-
@(x’y)],
(6.3)
where 8 = (/Y, a)‘. Then eq. (6.1) becomes the first-order condition for least squares on the selected sample and eq. (6.2) the first-order condition for probit.
W.K. Newley and D. McFadden
2178
Regularity conditions for asymptotic normality can be formulated by applying the asymptotic normality result for GMM, i.e. Theorem 3.4, to the stacked vector of moment conditions. Also, the conclusion of Theorem 3.4 and partitioned inversion can then be used to calculate the asymptotic variance of 8, as in the following result. Let
G, = ECV,g(z,‘&>YO)I> Y(Z)= dz, &, Yoh
G, = W,dZ> Q,>ro)l, M = ECV,mk ~o)l, Theorem
I,@) = - M
‘m(z, y,,).
(6.4)
6.1
Ifeqs. (6.1) and (6.2) are satisfied with probability approaching one, 8% 8,, y*3 ye, and g(z, 8, y) satisfies conditions (i)-(v) of Theorem 3.4, then 8 and 9 are asymptotically normal
and $(&
0,) 4
N(0, V) where
I/ = G; ‘EC {g(z) + G,$(z)}(g(z)
+
G,WJ’1G,“. Proof
By eqs. (6.1) and (6.2), with probability approaching one (8, y*)is a GMM estimator with moment function g”(z,_B, y) = [m(z,y)‘,g(z, e,y)‘]’ and I? equal to an identity the asymptotic variance of the estimator is matrix. By (~?‘1@‘6’= G-‘, (W‘Z(‘IE[#(z, do, y&(z, 8,, y,)‘]zz;(~~zz1)- l = CT-‘E[ij(z, 8,, y&(z, o,, yJ]G- l’. Also, the expected Jacobian matrix and its inverse are given by
(6.5) that the first row of G- ’ is G; ’ [I, - GYM - ‘1 and that [I, - G,M- ‘1 x variance of 8, which is the upper left block Q.E.D. of the joint variance matrix, follows by partitioned matrix multiplication. Noting
g(z, BO,yO) = g(z) + G&(z), the asymptotic
An alternative
approach
to deriving
the asymptotic
distribution
of two-step
esti-
mators is to work directly from eq. (6. l), expanding in 6’to solve for &(e^ - 6,) and then expanding the result around the true yO. To describe this approach, first note that 9 is an asymptotically linear estimator with influence function $(z) = - M- ‘m(zi, ye), where fi(y* - yO) = Cr= 1 $(zi)/$ + op(1). Then left-hand side of eq. (6.1) around B0 and solving gives:
Jj2(8-8,)= -
a-1 t
[ =-[& t i=l
i=l
Vog(z.
1)
@) ?
1
-l iFl n
1-l
V,g(z,,8,y^)
Stzi,
eO,Y”V&
expanding
the
2179
Ch. 36: Large Sample Estimation and Hypothesis Testing
x = -
ii$l
g(zi)l&
+
[,-l
i, i=l
vyCl(zi,
V,]\;;(9 - YOJ]
eO,
GB1 t {g(zi)+ Gyti(zJ}lJn + up, i=l
where (? and 7 are mean values and the y^and the mean values and the conclusion by applying the central limit theorem to One advantage of this approach is
third equality follows by convergence of of Lemma 2.4. The conclusion then follows the term following the last equality. that it only uses the influence function
representation &($ - ye) = x1= 1 tj(z,)/& + o,(l) for 9, and not the GMM formula in eq. (6.2). This generalization is useful when y*is not a GMM estimator. The GMM approach has been adopted here because it leads to straightforward primitive conditions, while an influence representation for y*is not a very primitive condition. Also the GMM approach can be generalized to allow y*to be a two-step, or even multistep, estimator by stacking moment conditions for estimators that affect 3 with the moment conditions for 0 and y.
6.2.
The efect
ofjrst-step
estimation
on second-step
standard
errors
One important feature of two-step estimators is that ignoring the first step in calculating standard errors can lead to inconsistent standard errors for the second step. The asymptotic variance for the estimator solving eq. (6.1) with y*= yO, i.e. the asymptotic variance ignoring the presence of y*in the first stage, is G; ’ E[g(z)g(z)‘]G; l’. In general, this matrix differs from the asymptotic variance given in the conclusion of Theorem 6.1, because it does not account for the presence of the first-step estimators. Ignoring the first step will be valid if G, = 0. Also, if G, # 0, then ignoring the first step will generally be invalid, leading to an incorrect asymptotic variance formula, because nonzero G, means that, except for unusual cases, E[g(z)g(z)‘] will not equal E[ (g(z) + G&(z)} {g(z) + G&(z)}‘]. Thus, the condition for estimation of the first step to have no effect on the second-step asymptotic variance is G, = 0. A nonzero G, can be interpreted as meaning that inconsistency in the first-step estimator leads to inconsistency in the second-step estimator. This interpretation is useful, because it gives a comparatively simple criterion for determining if first-stage estimation has to be accounted for. To derive this interpretation, consider the solution 8(y) to E[g(z, B(y), y)] = 0. Because 8 satisfies the sample version of this condition, B(y) should be the probability limit of the second-step estimator when J? converges to y (under appropriate regularity conditions, such as those of Section 2). Assuming differentiation inside the expectation is allowed, the implicit function theorem gives V$(y,)
= - G; ‘Gy.
(6.7)
W.K. Newey and D. McFadden
2180
By nonsingularity of G,, the necessary and sufficient condition for G, = 0 is that V,H(yJ = 0. Since H(y,) = H,, the condition that V,B(y,J = 0 is a local, first-order condition that inconsistency in y*does not affect consistency of 8. The following result adds regularity conditions for this first-order condition to be interpreted as a consistency condition. Theorem 6.2 Suppose that the conditions of Theorem 6.1 are satisfied and g(z, 0, y) satisfies the conditions of Lemma 2.4 for the parameter vector (H’,y’). If &A 8, even when ‘j-y # yO, then G, = 0. Also suppose that E[V,g(z, 8,, y)] has constant rank on a neighborhood of yO. If for any neighborhood of y0 there is y in that neighborhood such that 8 does not converge in probability to H, when $ L y, then G, # 0. Proof By Lemma 2.4, 8 3 8, and y*3 y imply that Cy= r g(zi, 8, y^)/n -% E[g(z, 8,, y)]. The sample moment conditions (6.1) thus imply E[g(z, BO,y)] = 0. Differentiating this identity with respect to y at y = y0 gives G, = 0.41 To show the second conclusion, let H(y) denote the limit of e when 9 L y. By the previous argument, E[g(z, 8(y), y)] = 0. Also, by the implicit function theorem 0(y) is continuous at yO, with @ye) = BO.By the conditions of Theorem 6.1, G&8, y) = E[V,g(z, 0, y)] is continuous in a neighborhood of B0 and yO, and so will be nonsingular on a small enough neighborhood by G, nonsingular. Consider a small enough convex neighborhood where this nonsingularity condition holds and E[V,g(z, 8,, y)] has constant rank. A mean-value expansion gives E[g(z, 8,, ?)I.= E[g(z, B(y), y)] + G,(& y)[e, - 0(y)] ~0. Another expansion then gives E[g(z, Be, y)] = E[V,g(z, O,, -$](y - y,,) # 0, implying E[V,g(z, do, v)] # 0, and hence G, # 0 (by the derivative having constant rank). Q.E.D. This results states that, under certain regularity conditions, the first-step estimator affects second-step standard errors, i.e. G, # 0, if and only if inconsistency in the first step leads to inconsistency in the second step. The sample selection estimator provides an example of how this criterion can be applied. Sample selection continued: The second-step estimator is a regression where some of the regressors depend on y. In general, including the wrong regressors leads to inconsistency, so that, by Theorem 6.2, the second-step standard errors will be affected by the first step. One special case where the estimator will still be consistent is if q, = 0, because including a regressor that does not belong does not affect consistency. Thus, by Theorem 6.2, no adjustment is needed (i.e. G, = 0) if c(~ = 0. This result is useful for constructing tests of whether these regressors belong, because 41Differentiation
inside the expectation
is allowed
by Lemma
3.6.
Ch. 36: Large Sample Estimation and Hypothesis
2181
Testing
it means that under the null hypothesis the test that ignores the first stage will have asymptotically correct size. These results can be confirmed by calculating
where n,(o) = di(v)/dv. a, = 0.
By inspection
this matrix is generally
nonzero,
but is zero if
This criterion can also be applied to subsets of the second-step coefficients. Let S denote a selection matrix such that SA is a matrix of rows of A, so that Se is a subvector of the second-step coefficients. Then the asymptotic variance of Se is SC, ’ E[ {g(z) + G&(z)} {g(z) + G,$(z)}‘]G; ‘S’, while the asymptotic variance that ignores the first step is SC; ‘E[g(z)g(z)‘]G; 1S’. The general condition for equality of these two matrices is 0= -
SC,' G, = SV,B(y,) = V,[SB(y,)],
where the second equality follows statement that asymptotic variance only if consistency of the first step could be made precise by modifying is not given here.
(6.8)
by eq. (6.7). This is a first-order version of the of Skis affected by the first-step estimator if and affects consistency of the second. This condition Theorem 6.2, but for simplicity this modification
Sample selection continued: As is well known, if the correct and incorrect regressors are independent of the other regressors then including the wrong regressor only affects consistency of the coefficient of the constant. Thus, the second-step standard errors of the coefficients of nonconstant variables in w will not be affected by the first-step estimation if w and x are independent. One can also derive conditions for the correct asymptotic variance to be larger or smaller than the one that ignores the first step. A condition for the correct asymptotic variance to be larger, given in Newey (1984), is that the first- and second-step moment conditions are uncorrelated, i.e.
Gdz, &, xJm(z,YJI = 0.
(6.9)
In this case E[g(z)$(z)‘]
= 0, so the correct
G; ‘~,WW$WlG;~, the one G; ‘E[g(z)g(z)‘]G;
“2which is larger, in the positive semi-definite
variance
” that ignores first-step
is G; ’ E[g(z)g(z)‘]G; I’ + sense, than estimation.
W.K. Newey and D. McFadden
2182
continued: In this example, E[y - w’fiO - cr,i(x’y,)l w, d = 1, x] = 0, which implies (6.9). Thus, the standard error formula that ignores the first-step estimation will understate the asymptotic standard error. Sump/e selection
A condition for the correct asymptotic variance to be smaller ignores the first step, given by Pierce (1982), is that
than the one that
(6.10)
m(z) = m(z, yO) = V, ln f(z I Q,, yd
In this case, the identities Sm(z, ~)f(zl O,, y) dz = 0 and lg(z, 0,, y)f(z Id,, y) dz = 0 can be differentiated to obtain the generalized information matrix equalities M = - E[s(z)s(z)‘] and G, = - E[g(z)s(z)‘]. It then follows that G, = - E[g(z)m(z)‘] = variance is - J%d4w’l I~c$wwl> - l> so that the correct asymptotic
G; 1~Cg(4g(4’lG;’ - G; 'ECgWWl { ~C+WW’l> - ‘-f%WsM’lG; “. This variance is smaller, in the positive semi-definite sense, than the one that ignores the first step. Equation (6.10) is a useful condition, because it implies that conservative asymptotic confidence intervals can be constructed by ignoring the first stage. Unfortunately, the cases where it is satisfied are somewhat rare. A necessary condition for eq. (6.10) is that the information matrix for Q and y be block diagonal, because eq. (6.10) implies that the asymptotic variance of y*is {E[m(z)m(z)‘]} - ‘, which is only obtainable when the information matrix is block diagonal. Consequently, if g(z, 8, y) were the score for 8, then G, = 0 by the information matrix equality, and hence estimation of 9 would have no effect on the second-stage variance. Thus, eq. (6.10) only leads to a lowering of the variance when g(z, 8, y) is not the score, i.e. 8 is not an efficient estimator. One case where eq. (6.10) holds is if there is a factorization of the likelihood f(z ItI, y) = fl(z IB)f,(z Iy) and y^is the MLE of y. In particular, if fi (z 10)is a conditional likelihood and f,(zl y) = fi(x 17) a marginal likelihood of variables x, i.e. x are ancillary to 8, then eq. (6.8) is satisfied when y*is an efficient estimator of yO.
6.3.
Consistent
asymptotic
variance
estimation
for two-step
estimators
The interpretation of a two-step estimator as a joint GMM estimator can be used to construct a consistent estimator of the asymptotic variance when G, # 0, by applying the general GMM formula. The Jacobian terms can be estimated by sample Jacobians, i.e. as
60~n-l t v,g(ziy8,9), Gy= 6’ i=l
The second-moment
t V,g(Z,,BJ),ii = n-l i V,m(z,,y*). i=l
matrix
can be estimated
i=l
by a sample second-moment
matrix
Ch. 36: Larye Sample Estimation
and Hypothesis
Testing
2183
di = y(zi, 8, y*)and Ai = m(z,, f), of the form fi= n- ‘x1= ,(& &i)‘(& &I). An estimator of the joint asymptotic variance of 8 and 7 is then given by
An estimator of the asymptotic variance of the second step 8 can be extracted from the upper left block of this matrix. A convenient expression, corresponding to that in Theorem 6.1, can be obtained by letting $i = - & l&z,, so that the upper left block of ? is
(6.11)
If the moment functions are uncorrelated as in eq. (6.9) so that the first-step estimation increases the second-step variance, then for ?? = n- ‘Cy= 1JitJ:y an asymptotic variance estimator for 8 is
(6.12) This estimator is quite convenient, because most of its pieces can be recovered from standard output of computer programs. The first of the two terms being summed is a variance estimate that ignores the first step, as often provided by computer output (possibly in a different form than here). An estimated variance FYis also often provided by standard output from the first step. In many cases 6;’ can also be recovered from the first step. Thus, often the only part of this variance estimator requiring application-specific calculation is eY. This simplification is only possible under eq. (6.9). If the first- and second-step moment conditions are correlated then one will need the individual observations Gi, in order to properly account for the covariance between the first- and second-step moments. A consistency result for these asymptotic variance estimators can be obtained by applying the results of Section 4 to these joint moment conditions. It will suffice to assume that the joint moment vector g(z, 0, y) = [m(z, y)‘, y(z, 0, r)‘]’ satisfies the conditions of Theorem 4.5. Because it is such a direct application of previous results a formal statement is not given here. In some cases it may be possible to simplify PO by using restrictions on the form of Jacobians and variance matrices that are implied by a model. The use of such restrictions in the general formula can be illustrated by deriving a consistent asymptotic variance estimator for the example.
W.K. Newey and D. McFadden
2184
Sumple selection example continued: Let Wi = di[wI, /z(xIyo)]’ and %i = di[wI, i(.$j)]‘. Note that by the residual having conditional mean zero given w, d = 1, and x, it is the case that G, = - E[diWiWJ and G, = - a,E[di~,,(xlyo)WiX11, where terms involving second derivatives have dropped out by the residual having conditional mean zero. Estimates of these matrices are given by ee = - x1= 1ki/iA~/~ and G, = -oily= II.,(x~j)ii/,x~/n. Applying eq. (6.12) to this case, for ii = yi - W#‘, 3i)‘, then gives
(6.13)
where pY is a probit estimator of the asymp_totic_variance of &(y - yO), e.g. as provided by a canned computer program, and 17~ G; ‘Gy is the matrix of coefficients from a multivariate regression of c?%,(x~y*)xi on Wi. This estimator is the sum of the White (1980) variance matrix for least squares and a correction term for the firststage estimation.42 It will be a consistent estimator of the asymptotic variance of JII@ - do).43
7.
Asymptotic
normality with nonsmooth objective functions
The previous asymptotic normality results for MLE and GMM require that the log-likelihood be twice differentiable and that the moment functions be once differentiable. There are many examples of estimators where these functions are not that smooth. These include Koenker and Bassett (1978), Powell’s (1984, 1986) censored least absolute deviations and symmetrically trimmed estimators, Newey and Powell’s (1987) asymmetric least squares estimator, and the simulated moment estimators, of Pakes (1986) and McFadden (1989). Therefore, it is important to have asymptotic normality results that allow for nonsmooth objective functions. Asymptotic normality results for nonsmooth functions were developed by Daniels (1961), Huber (1967), Pollard (1985), and Pakes and Pollard (1989). The basic insight of these papers is that smoothness of the objective function can be replaced by smoothness of the limit if certain remainder terms are small. This insight is useful because the limiting objective functions are often expectations that are smoother than their sample counterparts. 4*Contrary to a statement given in Amemiya (1985), the correction term is needed here. 43The normalization by the total sample size means that one can obtain asymptotic confidence intervals as described in Section 1, with the n given there equal to the total sample size. This procedure is equivalent to ignoring the n divisor in Section 1and dropping the n from the probit asymptotic variance estimator (as is usually done in canned programs) and from the lead term in eq. (6.13).
2185
Ch. 36: Large Sample Estimation and Hypothesis Testing
To illustrate how this approach works it is useful to give a heuristic The basic idea is the approximation
&@)- e^,&)r &e - &J + Qo(4 E&e
description.
Qo(4J
- (3,) + (0 - O,)H(B - 8,)/2, (7.1)
where 6, is a derivative, or approximate derivative, of Q,,(e) at B,,, H = V,,Q,(B,), and the second approximate equality uses the first-order condition V,QO(e,) = 0 in a second-order expansion of QO(0). This is an approximation of Q,(e) by a quadratic function. Assuming that the approximation error is of the right order, the maximum of the approximation should be close to the true maximum, and the maximum of the approxi_mation is 8 = B0 - H- ‘fi,,. This random yariable will be asymptotically normal if D, is, so that asymptotic normality of 0 will follow from asymptotic normality of its approximate value 8.
7.1.
The basic results
In order to make the previous argument precise the approximation error in eq. (7.1) has to be small enough. Indeed, the reason that eq. (7.1) is used, rather than some other expansion, is because it leads to approximation errors of just the right size. Suppose for discussion Purposes that 6,, = V&(6,), where the derivative exists with probability one. Then Q,(e) - Q,(e,) - 6;(0 - 0,) goes to zero faster than 118- doI/ does, by the definition of a derivative. Similarly, QO(e) - QO(O,) goes to zero faster than ((8 - 0, (([since V,Q,(B,) = 01. Also, assuming ded in probability for each 8, as would typically
that J%@,,(e) - Qo(@] is bounbe the case when Q,(e) is made
up of sample averages, and noting that $0, bounded in probability asymptotic normality, it follows that the remainder term,
k(e)
= JtrcOm
- O,w
- 6,te - 0,) - mv3
- ade,w
follows by
Ii 8 - e. II,
(7.2)
is bounded in probability for each 0. Then, the combination of these two properties suggests that l?,(e) goes to zero as the sample size grows and 8 goes to BO,a stochastic equicontinuity property. If so, then the remainder term in eq. (7.1) will be of order oP( I/0 - 8, I//& + II8 - 8, /I*). The next result shows that a slightly weaker condition is sufficient for the approximation in eq. (7.1) to lead to asymptotic normality of 8. Theorem
7.1
Suppose that Q.(8) 2 supti&(@ - o&r- ‘), 8 A 8,, and (i) QO(0) is maximized on @ at 8,; (ii) 8, is an interior point of 0, (iii) Qe(0) is twice differentiable at 8,
W.K. Newey
2186
with nonsingular
second
s~p~~~-~,,,,,~~R,(e)/[l
derivative
+ JnllO
H; (iv) &fi
- ~,I111 LO.
Then
5
N(O,Q;
&(e-
and D. McFadden
(v) for any 6, +O,
&J ~N(O,H-‘~H-‘).
The
proof of this result is given in Section 7.4. This result is essentially a version of Theorem 2 of Pollard (1985) that applies to any objective function rather than just a sample average, with an analogous method of proof. The key remainder condition is assumption (v), which is referred to by Pollard as stochastic diflerentiability. It is slightly weaker than k,(O) converging to zero, because of the presence of the denominator term (1 + & /I8 - 8, II)- ‘, which is similar to a term Huber (1967) used. In several cases the presence of this denominator term is quite useful, because it leads to a weaker condition on the remainder without affecting the conclusion. Although assumption (v) is quite complicated, primitive conditions for it are available, as further discussed below. The other conditions are more straightforward._Consistency can be shown using Theorem 2.1, or the generalization that allows for 8 to be an approximate maximum, as suggested in the text following Theorem 2.1. Assumptions (ii) and (iii) are quite primitive, although verifying assumption (iii) may require substantial detailed work. Assumption (iv) will follow from a central limit theorem in the usual case where 6, is equal to a sample average. There are several examples of GMM estimators in econometrics where the moments are not continuous in the parameters, including the simulated moment estimators of Pakes (1986) and McFadden (1989). For these estimators it is useful to have more specific conditions than those given in Theorem 7.1. One way such conditions can be formulated is in an asymptotic normality result for minimum distance estimators where g,(e) is allowed to be discontinuous. The following is such a result. Theorem
7.2
Suppose that $,,(@I?o.(@ < info,0Q.(8)‘i@&(8) + o,(n-‘), 8-% 8,, and I? L W, W is positive semi-definite, where there is go(e) such that (i) gO(O,) = 0; (ii) g,,(d) is differentiable at B0 with derivative G such that G’WG is nonsingular; (iii) 8, is an interior point of 0; (iv) +g,(e,) $,(e,)-g&III/[1 WZWG
L
+fiIIe-e,II]
NO, z3; (v) for any 6, + LO.
Then
0, supllO- OolI $6,&
II8,u4 -
,/“(k@wV[O,(G’WG)-‘G’R
(G’WG)-‘1.
The proof is given in Section 7.4. For the case where Q,(e) has the same number of elements as 8, this result is similar to Huber’s (1967), and in the general case is like Pakes and Pollard’s (1989), although the method of proof is different than either of these papers’. The conditions of this result are similar to those for Theorem 7.1. The function go(e) should be thought of as the limit of d,(e), as in Section 3. Most of the conditions are straightforward to interpret, except for assumption (v). This assumption is a “stochastic equicontinuity” assumption analogous to the condition (v) of Theorem 7.1. Stochastic equicontinuity is the appropriate term here because when go(e) is the pointwise
limit of $,,(e), i.e. d,(e) Ago(B)
for all 0, then for all
Ch. 36: Laryr Sample Estimation and Hypothesis
Testing
2187
8 # 8,, & 11Q,(O) - &,(8,) - go(H) II/[ 1 + Ji )I0 - B. II] AO. Thus, condition (v) can be thought of as an additional requirement that this convergence be uniform over any shrinking neighborhood of BO.As discussed in Section 2, stochastic equicontinuity is an essential condition for uniform convergence. Theorem 7.2 is a special case of Theorem 7.1, in the sense that the proof proceeds by showing that the conditions of Theorem 7.1 are satisfied. Thus, in the nonsmooth case, asymptotic normality for minimum distance is a special case of asymptotic normality for an extremum estimator, in contrast to the results of Section 3. This relationship is the natural one when the conditions are sufficiently weak, because a minimum distance estimator is a special case of a general extremum estimator. For some extremum estimators where V,&,(0) exists with probability one it is possible to, use Theorem 7.2 to show asymptotic normality, by setting i,,(e) equal to V,Q,(@. An example is censored least absolute deviations, where V,&(0) = n - l C;= 1xil(xj8 > 0)[ 1 - 2.l(y < x’e)]. However, when this is done there is an additional condition that has to be checked, namely that )/V,Q,(0) )/* d
inf,, 8 11 V,&(e) II2 + o,(n- ‘), for which it suffices to show that J&V&,(@ L 0. This is an “asymptotic first-order condition” for nonsmooth objective functions that generally has to be verified by direct calculations. Theorem 7.1 does not take this assumption to be one of its hypotheses, so that the task of checking the asymptotic first-order condition can be bypassed by working directly with the extremum estimator as in Theorem 7.1. In terms of the literature, this means that Huber’s (1967) asymptotic first-order condition can be bypassed by working directly with the extremum formulation of the estimator, as in Pollard (1985). The cost of doing this is that the remainder in condition (v) of Theorem 7.1 tends to be more complicated than the remainder in condition (v) of Theorem 7.2, making that regularity condition more difficult to check. The most complicated regularity condition in Theorems 7.1 and 7.2 is assumption (v). This condition is difficult to check in the form given, but there are more primitive conditions available. In particular, for Q,(0) = n ‘Cy= 1 q(z,, 8), where the objective function is a sample average, Pollard (1985) has given primitive conditions for stochastic differentiability. Also, for GMM where J,(0) = C;= i g(z, 0)/n and go(B) = E[g(z, 0)], primitive conditions for stochastic equicontinuity are given in Andrews’ (1994) chapter of this handbook. Andrews (1994) actually gives conditions for a stronger result, that s~p,,~_~,,, da./% )Id,(0) - .&(0,) - go(e) 1)L 0, i.e. for (v) of Theorem 7.2 without the denominator term. The conditions described in Pollard (1985) and Andrews (1994) allow for very weak conditions on g(z, 0), e.g. it can even be discontinuous in 8. Because there is a wide variety of such conditions, we do not attempt to describe them here, but instead refer the reader to Pollard (1985) and Andrews (1994). There is a primitive condition for stochastic equicontinuity that is not covered in these other papers, that allows for g(z, 8) to be Lipschitz at 0, and differentiable with probability one, rather than continuously differentiable. This condition is simple but has a number of applications, as we discuss next.
W.K. Newey and D. McFadden
2188
7.2.
Stochastic
equicontinuity
for Lipschitz
moment,functions
The following result gives a primitive condition for the stochastic equicontinuity hypothesis of Theorem 7.2 for GMM, where Q,(e) = nP ‘Cy= 1g(Zi, 0) and go(O)=
ECg(z, @I. Theorem
7.3
Suppose that E[g(z, O,)] = 0 and there are d(z) and E > 0 such that with probability r(z,B)]
one,
IIdz, Q)- & 0,) - W(fl - 44 II/IIQ- 0, I/+ 0 as Q+ oo,~C~W,,,-,~,, Ccx
r(z, d) =
< a,
Theorem
and n- ‘Cr= 1d(zi) LE[d(z)]. 7.2 are satisfied for G = E[d (z)].
Then
assumptions
(ii) and
(v) of
Proof
one r(z, E) + 0 as For any E > 0, let r(z,E) = sup, o-00, BEIIr(z, 0) 11.With probability E+ 0, so by the dominated convergence theorem, E[r(z, E)] + 0 as E+ 0. Then for 0 + 0, and s = IIQ- 4, II, IIad@- sd4) - (30 - 0,) I/= IIEC&, 0)- g(z,0,) - 44 x (0 - O,)]11 d E[r(z, E)] II0 - 0, /I+O, giving assumption (iii). For assumption (v), note that for all (5’with /I8 - 0, I/ < 6,, by the definition
of r(z, E) and the Markov
inequality,
II4,(@- &(Ho)- go(@I//Cl + fi II0 - 0, II1 d Jn CIICY=1{d(zi) - EC&)1 } x (0- f&)/nII + {C1=Ir(zi, Wn + ECr(z, S.)l > II0 - 00IIl/(1 + Jn II0 - 00II1d IICy=1 Q.E.D. j A(zJ - J%A(z)lj/n II + ~,@Cr(z,%)I) JS 0. Jn
The condition on r(z, Cl) in this result was formulated by Hansen et al. (1992). The requirement that r(z, 0) --f 0 as 8 + B0 means that, with probability one, g(z, 19)is differentiable with derivative A(z) at BO.The dominance condition further restricts this remainder to be well behaved uniformly near the true parameter. This uniformity property requires that g(z, e) be Lipschitz at B0 with an integrable Lipschitz constant.44 A useful aspect of this result is that the hypotheses only require that Cr= 1A(zi) 3 E[A(z)], and place no other restriction on the dependence of the observations. This result will be quite useful in the time series context, as it is used in Hansen et al. (1992). Another useful feature is that the conclusion includes differentiability of go(e) at B,, a “bonus” resulting from the dominance condition on the remainder. The conditions of Theorem 7.3 are strictly weaker than the requirement of Section 3 that g(z, 0) be continuously differentiable in a neighborhood of B0 with derivative that is dominated by an integrable function, as can be shown in a straightforward way. An example of a function that satisfies Theorem 7.3, but not the stronger continuous differentiability condition, is the moment conditions corresponding to Huber’s (1964) robust location estimator.
44For44
= SUPI~~~~,,~ < &tiz, 01, the triangle and Cauchy-Schwarz
Ill&) II + &)I II0 - 6, Il.
inequalities
imply
1)~(z,o)- g(~,0,) I/<
2189
Ch. 36: Largr Sample Estimution and Hypothesis Testing Huher’s
robust locution estimator: The first-order conditions for this estimator are n~‘~~~,p(yi~~)=Oforp(c)=-l(cd-l)+l(-l~~~l)~+1(~31).Thisestimator will be consistent for B0 where y is symmetrically distributed around 8,. The motivation for this estimator is that its first-order condition is a bounded, continuous function of the data, giving it a certain robustness property; see Huber (1964). This estimator is a GMM estimator with g(z, 0) = p(y - 0). The function p(c) is differentiable everywhere except at - 1 or 1, with derivative P,(E) = l( - 1 < E < 1). Let d(z)= -p,(y-U,).ThenforE=y-H,and6=H,-U,
r(z, 0) = Ig(z, Q)- dz, RJ)- d (z)(fl~ 4J l/l Q- b I
= IP(E+ 4 - PM - PEWI/ IdI =~[-1(E+6<-1)+1(&~-1)]+[1(E+~>1)-1(E~1)] +[l(-l<E+6<1)-1(-1<E
1,
r(z,~)=~1(-1-~<~d-1)+1(1-6d~<1)+[1(-1-~6~~-1) -l(l
-6<e<
l)](E+fi)I/lfil
~1(-6~E+1~0)(~+~E+1~)//~~+1(-~6~-1<0)(/&-1~+6)/~6~ 62[1(-6<E+
1
l(-66E-
1<0)]<2.
Applying an analogous argument for negative - 1 d 6 < 0 gives r(z,O) < 2[l(lc-lId/6~)+l(le+lI~~6~)]d4. Therefore, if Prob(&=l)=O and Prob(& = - 1) = 0 then r(z, 0) + 0 with probability one as 0 -+ 0, (i.e. as 6 -+ 0). Also, r(z, fl) < 4. Thus, the conditions of Theorem 7.3 are satisfied. Other examples of estimators that satisfy these conditions are the asymmetric least squares estimator of Newey and Powell (1987) and the symmetrically trimmed estimators for censored Tobit models of Powell (1986) and Honori: (1992). All of these examples are interesting, and illustrate the usefulness of Theorem 7.3.
7.3.
Asymptotic
variance
estimation
Just as in the smooth case the asymptotic variance of extremum and minimum distance estimators contain derivative and variance terms. In the smooth case the derivative terms were easy to estimate, using derivatives of the objective functions. In the nonsmooth case these estimates are no longer available, so alternatives must be found. One alternative is numerical derivatives. For the general extremum estimator of Theorem 7.1, the matrix H can be
W.K. Newey and D. McFadden
2190
estimated by a second-order numerical derivative of the objective function. Let e, denote the ith unit vector, E, a small positive constant that depends on the sample size, and fi the matrix with i, jth element fiij = [Q(o^+ eis, + ejs,) - Q(@- eis, + ejs,) - Q(@+ eie, - eje,) + Q(B- eis, - ejsn)]/4$. Under certain conditions on E,,, the hypotheses of Theorem 7.1 will suffice for consistency of G for the H in the asymptotic variance of Theorem 7.1. For a minimum distance estimator a numerical derivative estimator G of G hasjth column
Gj = [i(B + ejc,)
-
d(@ - eje,)]/2s,.
This estimator will be consistent result shows consistency:
under the conditions
of Theorem
7.2. The following
Theorem 7.4 of Theorem 7.1 are satisfied Suppose that E, + 0 and E,,& + co. If the conditions then fi AH. Also, if the conditions of Theorem 7.2 are satisfied then G 5 G. This result is proved in Section 7.4. Similar results have been given by McFadden (1989), Newey (1990), and Pakes and Pollard (1989). A practical problem for both of these estimators is the degree of difference (i.e. the magnitude of s,) used to form the numerical derivatives. Our specification of the same E, for each component is only good if 6 has been scaled so that its components have similar magnitude. Alternatively, different E, could be used for different components, according to their scale. Choosing the size of&,, is a difficult problem, although analogies with the choice of bandwidth for nonparametric regression, as discussed in the chapter by Hardle and Linton (1994), might be useful. One possibility is to graph some component as a function of E, and then choose E, small, but not in a region where the function is very choppy. Also, it might be possible to estimate variance and bias terms, and choose E, to balance them, although this is beyond the scope of this chapter. In specific cases it may be possible to construct estimators that do not involve numerical differentiation. For example, in the smooth case we know that a numerical derivative can be replaced by analytical derivatives. A similar replacement is often possible under the conditions of Theorem 7.3. In many cases where Theorem 7.3 applies, g(z,@ will often be differentiable with probability one with a derivative V,g(z, 0) that is continuou_s in 8 with probabil$y one and dominated by an integrable function. Consistency of G = n- ‘Cy= 1V,g(z, 0) will then follow from Lemma 4.3. For example, it is straightforward to show that this reasoning applies to the Huber locationestimator,withV,g(z,O)=-1(-1~y-~
Ch. 36: Lurye Sample Estimation
and Hypothesis
2191
Testing
Estimation of the other terms in the asymptotic variance of 8 can usually be carried out in the way described in Section 4. For example, for GMM the moment function g(z,fI) will typically be continuous in 8 with probability one and be dominated by a square integrable function, so that Lemma 4.3 will imp_ly the consistency of fi = Cr= 1 g(zi, 6)g(zi, @‘/II. Also, extremum estimators where Q,(0) = nP ‘Cr= lq(z, U), q(z, 0) will usually be differentiable almost everywhere, and Lemma 4.3 will yield consistency of the variance estimator given in eq. (4.1).
7.4.
Technicalities
Because they are long and somewhat complicated, and 7.4 are given here rather than previously. Proof of Theorem
the proofs of Theorems
7.1,7.2,
7.1
Let Q(e) = Q,(e) and Q(0) = Qo(@. First it will be proven
that $118
- 8, /I = O,(l),
i.e. that 8is “&-consistent”. By Q(0) having a local maximum at 8,, its first derivative is zero at O,, and hence Q(0) = Q(0,) + (0 - 8,)‘H(fI - (!I,)/2 + o( /I 0 - 8, II2). Also, H is negative definite by fI,, a maximum and nonsingularity of H, so that there is C > 0 and a small enough neighborhood of B0 with (t9 - 8,)‘H(B - 8,)/2 + o( 110- 8,II 2, < - C 110- 0, I/2. Therefore, by 8 A 8,, with probability approaching one (w.p.a.l), Q(8)< Q(d,) - C II& 8, II2. Choose U, so that 8~ U, w.p.a.1, so that by (v) $I&Q),I (1
<
+~IIo^-~oll)~pu).
0 d &6) - &I,)
+ o&n- ‘) = Q(8) - Q&J + 6’(&
d -C/18-8,112+
llfil( l&e,11 + Ile-e,11(1
d -CC+o,(l)]1~8-e,112+0,(n-“2)11~-~o~I
0,) + 116
@,,I1i?(6) + o&n-‘)
+Jni18-8,/1)o,(n-l’2)+Op(lZ-‘) +o,(n?).
Since C + ~~(1) is bounded away from zero w.p.a.1, it follows that /I6- 8, (I2 < O&n- 1’2)II8 - 0, II + o&n- ‘), and hence, completing the square, that [ iI&- 8, II + 0plnm”2)]2 < O&K’). Taking the square root of both sides, it follows that I lld- Boll + O,(n- 1’2)ld O,(n- ‘12), so by the triangle inequality, 11G-S,, 11
+ 1- 0,(n-1’2)l
6 0,(n-1’2).
Next, let e”= H,, - H- ‘6, and note that by construction by &-consistency
of 8, twice differentiability
2[Q(8) - Q(0,) + Q&J]
it is $-consistent.
Then
of Q(0), and (v) it follows that
= (s - &J’H(o^ - Q,) + 26’(8 - 0,) + o&n- ‘) =(8-Q,)‘H(&-&)-2(8”-8,)‘H&8,)+0&n-’).
Similarly,
2[!j(&
- f&O,) + Q(&)]
= (8 - O,,)‘H@ - e,,) + 26’(8 - 0,) + ~,(n- l) =
W.K. Nwry
2192
und D. McFadden
-(H”- H,)‘H(& H,) + o,(nP1). Then since 0” is contained within 0 w.p.a.1, 2[&8)Q^(e,) + Q(e,)] -2[&& ~ dc!IO) + Q(fI,)] > o&n- ‘), so by the last equation and the corresponding equation for 0, o,(nP’)
< (e-
H,)‘H(8-
=(H^-@H(H^-G)& Therefore,
0,) - 2(H1- 8,)‘H(8-
e,)%@-
0,)
-CIIo^-el[‘.
((Jr1(6 - 0,) ~ ( - H - ‘Jr&) -+d N(0, H
follows by - H-‘&6
e,) + (S-
jl = fi
Ij8 - 811% 0, so the conclusion
‘flH _ ‘) and the Slutzky
theorem.
Q.E.D.
Proof of Theorem 7.2
Let g(0) = d,(0) and y(8) = go(B). The proof proceeds by verifying the hypotheses of Theorem 7.1, for Q(0) = - g(H)‘Wy(0)/2, Q(0) = - 4(tI)‘f?‘4(@/2 + d^(@, and d^(B) equal to a certain function specified below. By (i) and (ii), Q(0) = - [G(e - 19,) + 4 II6’ - 41 II)I’WG(@- 41)+ o(II 0 - ‘JoII)I/2 = Q(6.J+ (0- ‘AJ’W~- 4JP + o(/IB- 0, )/2),for H = - G’ WC and Q(0,) = 0, so that Q(0) is twice differentiable at do. Also, by W positive semi-definite and G’ WC nonsingular, H is negative definite, implying that there is a neighborhood of 8, on which Q(0) has a unique maximum (of zero) at 6’= BO. Thus, hypotheses (i)-(ii) of Theorem 7.1 are satisfied. By the Slutzky theorem, 6 = - G’@$g(0,) % N(0,0) for B = G'WZ WC, so that hypothesis (v) of Theorem 7.1 is satisfied. It therefore remains to check the initial supposition and hypothesis (v) of Theorem 7.1. Let e(0) = [Q(H) - .&fI,) - g(H)]/[ 1 + $11
Let (v),
H - 8, I/1. Then
4?(e) = - 4(@‘@&@)/2 + &(6)‘$6(6)/2 + ~(8,)‘%6((8).
For
any
6, + 0,
by
~~P,,e-so,,~a,IQ1(~)-~-~(~)‘~~(~)/~~l~~,(~)s~P,,,_,“,,,,“{llE*(~)I/ ll4Qll+ O&l - 1’2))= o&n- ‘), so that by (9, o(6) 3 SU~,,~_~,,,G,,&I) - o&n-‘). Thus,
the initial supposition of Theorem 7.1 is satisfied. To check hypothesis by E*(e,)= 0, for i(0) as defined above,
&I k(e)lC1 +& r^I(4 = JG& f2(4
=
(v), note that
III9- UC3 II1Id t pj(e), II0 - 8, II + II0 - &I II2)lE*(mq~)I/[ IIH - 8” I/(1 ;t J”
hICs(@ + G(Q- 4@‘iW,)llII IIQ- 0, II(1 + &
110- fl, II)],
)I8 - 8, II)],
2193
Ch. 36: Large Sample Estimation and Hypothesis
Testing
f3(@ = n I Cd@ + 44mm)
II 0 - 0, II),
l/(1 + 4
tde) = JliIg(e)f~~(e)I/lle-eoii,
Then for 6,-O
and U = (0: j(8-0,ll
<<s,,}, sup,F,(e)6Cn.sup,l(~((B)(1211
till =0,(l),
- e,ii)ii @II iw,)ii
=WW(II~ - 4mw = 0,(u ~~P,ww ~~uPu~li~(e)l~iJTIlie-eOll)+JtIil~(eo)~l~~~ ~ll~~~,JIS~w~~l~ {SUpa~(l~e-eolo+~,(l))o,(l)=o,(l), ~~~,r*,(~~~~~~,i(li~(~~l~il~e-~oli~ll~~~ 6 ~~P,(lk7(eWile - eol12)il@ - w = 0,(l). ~~h&llwl~ = 0,(l), and sw,(e) sup,p,(e)d~sup,{0(lie
Q.E.D. Proof of Theorem 7.4 Let a be a constant vector. By the conclusion Then by hypothesis (v),
Io(e^ + &,,a)-
&4J -
Q@ + q)
of Theorem
7.1, II8 + a&,- do I( = O,,(E,).
+ Q&J I
d /I6 + aEn- e. I/ [I JW + w)l + IIfi II IIe+ a&,- 0, II1 ~qJ(~,){u +Jw+
&,a- doII)qdll&) + O&,l&))
Also, by twice differentiability
IE,~[Q(~+ &,a) -
e,)viq6
3
[2(ei
+ &,a- e,)/2 + o( I/8 + &,a - 8,1)“)] - a’Ha/21
O,)‘Ha( + lE;2(&
It then follows by the triangle fiij
of Q(0) at 8,,
Q(O,)] -a’Ha/2(
= IE; 2[(6 + &,aG jq1(f9-
= O&,Z)~
+ ej)‘H(ei +
e,yH@-
inequality
e,)l + 0,(l) = 0,(i).
that
ej) - (ei - ej)‘H(ei - ej) - (ej - eJ’H(ej - eJ]/8
= 2[eiHe, + e;Hej - ejHe, - eJHej]/8 + eiHej = eiHej = Hi,, giving the first conclusion. For the second_conclusion, it follows from hypothesis (v) of Theorem 7.2, similarly to the proof for H, that /IJ(fl + c,a) - Q(O,)- g(t?+ &,a)(I < (1 +,& I/8 + &,,a- 0, I()0&n,- li2) = O&E; ‘), and by differentiability of g(0) at B0 that /Ig(d + q,a)/c,, - Ga I/ d /I G(B - f&)/c, 11+ O(E; l I/8 + Ena- do )I) = op( 1). The second conclusion then follows by the triangle inequality. Q.E.D.
W.K. Nrwev and D. McFadden
2194
8.
Semiparametric
two-step estimators
Two-step estimators where the first step is a function rather than a finite-dimensional parameter, referred to here as semiparametric two-step estimators, are of interest in a number of econometric applications. 45 As noted in Section 5, they are useful for constructing feasible efficient estimators when there is a nuisance function present. Also, they provide estimators for certain econometric parameters of interest without restricting functional form, such as consumer surplus in an example discussed below. An interesting property of these estimators is that they can be Jnconsistent, even though the convergence rate for the first-step functions is slower than fi. This section discusses how and when this property holds, and gives regularity conditions for asymptotic normality of the second-step estimator. The regularity conditions here are somewhat more technical than those of previous sections, as required by the infinite-dimensional first step. The type of estimator to be considered here will be one that solves
n- l t g(zi, 8, 9) = 0, i=l
where f can include infinite-dimensional functions and g(z, 0, y) is some function of a data observation z, the parameters of interest 0, and a function y. This estimator is exactly like that considered in Section 6, except for the conceptual difference that y is allowed to denote a function rather than a finite-dimensional vector. Here, g(z,U,y) is a vector valued function of a function. Such things are usually referred to as functionals. Examples are useful for illustrating how semiparametric two-step estimators can be fit into this framework. V-estimators: Consider a simultaneous equations model where the residual p(z, d) is independent of the instrumental variables x. Let u(x,p) be a vector of functions of the instrumental variables and the residual p. Independence implies that of a EC~{x,pk4,)]1 = ECSajx,p(Z,e,)}dF,(P)Iwhere F,(z) is the distribution single observation. For example, if a(x,p) is multiplicatively separable, then this restriction is that the expectation of the product is the product of the expectations. This restriction can be exploited by replacing expectations with sample averages and dF(Z) with an estimator, and then solving the corresponding equation, as in
(8.2) where m(z,, z2, 0) = a[~,, p(z,, Q)] - a[~,, p(z,, O)]. This estimator 45This terminology
may not be completely
consistent
with Powell’s chapter
has the form given of this handbook.
Ch. 36: Large Sample Estimation
and Hypothesis
Testing
2195
in eq. (8.1), where y is the CDF of a single observation, y(z, 0, y) = s m(z, F,O)dy(F), and $ is the empirical distribution with y(f) = Cr=, l(zi < 5)/n. It is referred to as a V-estimator because double averages like that in eq. (8.2) are often referred to as V-statistics [Serfling (1980)]. V-statistics are related to U-statistics, which have been considered in recent econometric literature [e.g. Powell et al. (1989) and Robinson (1988b)] and are further discussed below. The general class of V-estimators were considered in Newey (1989). If a(x,p) is multiplicatively separable in x and p then these estimators just set a vector of sample covariances equal to zero. It turns out though, that the optimal u(x, p) may not be multiplicatively separable, e.g. it can include Jacobian terms, making the generalization in eq. (8.2) of some interest. Also, Honor6 and Powell (1992) have recently suggested estimators that are similar to those in equation (8.2), and given conditions that allow for lack of smoothness of m(z,, z2, H) in fl. Nonpurumetric approximate consumer surplus estimation: Suppose that the demand function as a function of price is given by h,(x) = E[qlx], where 4 is quantity demanded and x is price. The approximate consumer surplus for a price change from a to h is Ii,h,(x)dx. A nonparametric estimator can be constructed by replacing the true condttronal expectation by a nonparametric estimator. One such is a kernel estimator of the form i(x) = EYE 1q&&x - xi)/C;= 1K,(x - xi), where K,(u) = a-‘K(o/a), r is the dimension of x, K(u) is a function such that JK(u)du = 1, and 0 is a bandwidth term that is chosen by the econometrician. This estimator is a weighted average of qi, with the weight for the ith observation given by K,(x - xi)/ cjn= 1K,(x - xj). The bandwidth 0 controls the amount of local weighting and hence the variance and bias of this estimator. As 0 goes down, more weight will tend to be given to observations with Xi close to x, lowering bias_, but raising variance by giving more weight to fewer observations. Alternatively, h(x) can be interpreted as a ratio estimator, with a denominator j(x) = n-l Cr= 1K,(x - xi) that is an estimator of the density of x. These kernel estimators are further discussed in Hardle and Linton (1994). A kernel estimator of h,(x) can be used to construct a consumer surplus estimator of the form t?=isi(x)dx. This estimator takes the form given in eq. (8.1), for y = (r,, y2) where yr(x) is a density for x, yz(x) is the product of a density for x and a conditional expectation of y given x, g(z, 8, y) = Jf:[Y2(x)IY1(x)]dx - 6, y,(x) = n- ‘Cy= 1 K,(x - xi) and f2(x) = K’ x1= 1q,KJx - xi). This particular specification, where y consists separately of the numerator and denominator of i(x), is convenient in the analysis to follow.
In both ofthese examples there is some flexibility in the formulation of the estimator as a solution to eq. (8.1). For V-estimators, one could integrate over the first argument in a[~,, p(z,, 0)] rather than the second. In the consumer surplus example, one could set y = h rather than equal to the separate numerator and denominator terms. This flexibility is useful, because it allows the estimator to be set up in a way
W.K. Newey and D. McFadden
2196
that is most convenient for verifying the regularity conditions for asymptotic normality. This section will focus on conditions for asymptotic normality, taking consistency as given, similarly to Section 6. Consistency can often be shown by applying Theorem 2.1 directly, e.g. with uniform convergence resulting from application of Lemma 2.4. Also, when y(z, 0, y) is linear in 0, as in the consumer surplus example, then consistency is not needed for the asymptotic normality arguments.
8.1.
Asymptotic
To motivate
and consistent
variance estimation
the precise results to be given, it is helpful to consider
for 8. Expanding
Ji(8-
normality
eq. (8.1) and solving
0,) = -
for $(@-
an expansion
0,) gives
n-l
YCz,
Y) =
g(Z,
003
y),
(8.3) where e is the mean value. The usual (uniform) convergence arguments, when combined with consistency of e and y*, suggest that 6’ x1= 1V,g(zi, e, 9) 3 E[V,g(z,tI,,y,)] = G,. Thus, the behavior of the Jacobian term in eq. (8.3) is not conceptually difficult, only technically difficult because of the presence of nonparametric estimates. The score term C;= r g(zi, y”)/& is much more interesting and difficult. Showing asymptotic normality requires accounting for the presence of the infinite-dimensional term 7. Section 6 shows how to do this for the finitedimensional case, by expanding around the true value and using an influence function representation for 9. The infinite-dimensional case requires a significant generalization. One such is given in the next result, from Newey (1992a). Let 11 y /I denote a norm, such as ~up~+~~ /Iy(x) I/.
Theorem 8.1 Suppose that EC&, YJI = 0, ECIIdz, 14 II21 < co, and there is 6(z) with E[G(z)] = 0, E[ II6(z)l12] < co, and (i) (linearization) there is a function G(z,y - yO) that is linear in y - ye such that for all y with IIy - y. II small enough, IIg(z, y) - g(z, ye) G(z, y - yO)II d b(z) IIy - y. II2, and E[b(z)]&
11 y*- y. II2 3
0; (ii) (stochastic
equicon-
tinuity) C;= 1 [G(zi, y^- yO) - j G(z, $ --?/O)dFo]/fi 3 0; (iii) (mean-square differentiability) there is 6(z) and a measure F such that EC&z)] = 0, E[ II6(z) II2] < co and for all IIy - y. I/ small enough, JG(z, y*- Ye)dF, = !6(z)dF; (iv) for the empirical F [F(z) = n- ‘C;= 1 l(z, d z)],
distribution Cy=
1 Stzi7
7)/J
n3
&[j&z)dF
- S6(z)dF]
N(0, f2), where R = Var [g(zi, yO) + 6(zi)].
-%O.
Then
Ch. 36: Large Sumple Estimation
und Hypothesis
2197
Testing
Proof
It follows by the triangle and by the central
inequality
limit theorem
that Cr= 1[g(zi> y*)- g(Zi, ~0) - S(Zi)]/Ji
that Cy= 1[g(Zi, yo) + 6(Zi)]/&
A
JL 0,
N(O, 0). Q.E.D.
This result is just a decomposition
Cdzi~ ~0)+ &zJll&
of the remainder
term Cr=
I g(Zi> p)/,,h- EYEI
As will be illustrated
for the examples, it provides a useful outline of how asymptotic normality of a semiparametric two-step estimator can be shown. In addition, the assumptions of this result are useful for understanding even though y*is not fihow Cr= 1dzi, Y”)lJ n can have a limiting distribution, consistent. Assumption(i) requires that the remainder term from a linearization be small. The remainder term in this condition is analogous to g(z, y) - g(z, yO)- [V,g(z, y,J](y - yO) from parametric, two-step estimators. Here the functional G(z, y - ye) takes the place of [V,g(z, yO)](y - yO). The condition on this remainder requires either that it be zero, where b(z) = 0, or that the convergence rate of 9 be faster than npii4, in terms of the norm /Iy /I. Often such a convergence rate will require that the underlying nonparametric function satisfy certain smoothness restrictions, as further discussed in Section 8.3. Assumption (ii) is analogous to the requirement for parametric two-step estimators that {n ‘C;= 1V,g(zi, y,,) - E[V,g(z, y,-J] I(? - ye) converge to zero. It is referred to as a stochastic equicontinuity condition for similar reasons as condition (v) of Theorem 7.2. Andrews (1990) has recently given quite general sufficient conditions for condition (ii). Alternatively, it may be possible to show by direct calculation that condition (ii) holds, under weaker conditions than those given in Andrews (1990). For example, in the V-estimator example, condition (ii) is a well known projection result for V-statistics (or U-statistics), as further discussed in Section 8.2. For kernel estimators, condition (ii) will follow from combining a V-statistic projection and a condition that the bias goes to zero, as further discussed in Section 8.3. Both conditions (i) and (ii) involve “second-order” terms. Thus, both of these conditions are “regularity conditions”, meaning that they should be satisfied if g(z, y) is sufficiently smooth and y*sufficiently well behaved. The terms in (iii) and (iv) are “first-order” terms. These conditions are the ones that allow I;= 1 g(z, y^)/Ji to be asymptotically normal, even though y^may converge at a slower rate. The key condition is (iii), which imposes a representation of JG(z, y*- ye)dF, as an integral with respect to an estimated measure. The interpretation of this representation is that [G(z, jj - y,JdF, can be viewed as an average over some estimated distribution. As discussed in Newey (1992a), this condition is essentially equivalent to finiteness of the semiparametric variance bound for estimation of J G(z, y - yo)dF,. It is referred to as “mean-square differentiability” because the representation as an integral lG(z)dF(z, y) means that if dF(z, y) 1’2 has a mean-square derivative then
W.K. Newey and D. McFadden
2198
{G(z)dF(z, y) will be differentiable in y, as shown in Ibragimov and Has’minskii (1981). This is an essential condition for a finite semiparametric variance bound, as discussed in Van der Vaart (1991), which in turn is a necessary condition for Jn-consistency average
of j G(z, y*- yo)dF,.
over an estimated
distribution,
If jG(z,$
- yo)dF,
cannot
be viewed
then it will not be &-consistent.
as an Thus,
condition (iii) is the key one to obtaining &-consistency. Condition (iv) requires that the difference between the estimator F and the empirical distribution be small, in the sense of difference of integrals. This condition embodies a requirement that p be nonparametric, because otherwise it could not be close to the empirical measure. For kernel estimators it will turn out that part (iv) is a pure bias condition, requiring that a bias term goes to zero faster than l/fi. For other estimators this condition may not impose such a severe bias requirement, as for the series estimators discussed in Newey (1992a). An implication of conditions (iii) and (iv) is that Jil@z)d(F - F,) = JG(z)d&. (6 -F,) converges in distribution to a normal random vector, a key result. An alternative way to obtain this result is to show that fi(@ - F,) is a stochastic process that converges in distribution in a metric for which !6(z)d(*) is continuous, and then apply the continuous mapping theorem. 46 This approach is followed in Ait-Sahalia (1993). One piece of knowledge that is useful in verifying the conditions of Theorem 8.1 is the form of 6(z). As discussed in Newey (1992a), a straightforward derivative calculation is often useful for finding 6(z). Let v denote the parameters of some general distribution where q0 is equal to the truth, and let y(q) denote the true value of y when 7 are the true parameters. The calculation is to find 6(z) such that V,jg[z, y(q)] dF, = E[S(z)Sh], where the derivative is taken at the true distribution. The reason that this reproduces the 6(z) of Theorem 8.1 is that condition (i) will imply that V,Jg[z, y(q)]dF, = V,lG[z, y(q) - yo]dF, [under the regularity condition that /Iy(q) - y 11is a differentiable function of y], so (iii) implies that V,sg[z, Y(q)]dF, = V,jG(z)dF(q) = E[S(z)Sk]. This calculation is like the Gateaux derivative calculation discussed in Huber (1981), except that it allows for the distributions to be continuous in some variables. With 6(z) in hand, one can then proceed to check the conditions of Theorem 8.1. This calculation is even useful when some result other than Theorem 8.1 is used to show asymptotic normality, because it leads to the form of the remainder term Cr= 1 [g(zi, $) - g(zi, yO) - 6(zi)]/fi that should be small to get asymptotic normality. Theorem 8.1 can be combined with conditions for convergence of the Jacobian to obtain conditions for asymptotic normality-of 4, as in the following result.
46The continuous mapping Z then hLY(n)] Ah(Z).
theorem
states that if Y(n) AZ
and h(y) is continuous
on the support
of
Ch. 36: Large Sample Estimation
Theorem
and Hypothesis
2199
Testing
8.2
If 8% O,, the conditions of Theorem 8.1 are satisfied, and (i) there are a norm (1y 11,E > 0, and a neighborhood .Af of 8, such that for IIy - y. II small enough, sup& Ir IIv&Y(z,@,Y) - v&dzi, @,YO)II d Nz) IIY - YO IHEand E[b(z)] 11y*- y0 11’3 0; (ii) V,g(z,,fI, yO) satisfies the conditions of Lemma 4.3; (iii) G, is nonsingular; then Jr@
- 0,) % N(0, c; ‘RG,
I’).
Pr#Of
It suffices to show that IZ- ‘Cy=, V&z,, 6! 9) 3 G,, because then the conclusion will follow from the conclusion of Theorem 8.1, eq. (8.3), and arguments like those of Section 3. Condition (i) implies that [x1= 1b(zi)/n] I/y^- y0 11’% 0 by the Markov inequality, SO n~'Cr=,(/V,g(Zi,8,y^)-V,g(zi,8,y,)(l ~[n-lC1=Ih(Zi)]Ilp-y,ll"~O. of Lemma 4.3, II- ’ xy= 1Veg(zi, & yO)A G,. The conclusion
Also, by the conclusion
then follows by the triangle
Q.E.D.
inequality.
This result provides one set of sufficient conditions for convergence of the Jacobian term. They are specified so as to be similar to those of Theorem 8.1, involving a norm for y. In particular cases it may be useful to employ some other method for showing Jacobian convergence, as will be illustrated in Section 8.2. A similar comment applies to the consistency condition. Consistency can be shown by imposing conditions like (i) and (ii) to give uniform convergence of an objective function, but this result will not cover all cases. In some cases it may be better to work directly with Theorem 2.1 to show consistency. The asymptotic variance of a semiparametric two-step estimator is Gi ‘flG; “. As usual, a consistent estimator can be formed by plugging in estimators of the different pieces. An estimator of the Jacobian term can be formed in a straightforward way, as
CB= n l i
v,g(zi,
f7,g.
i=l
Consistency of GBfor G, will follow under the same conditions as used for asymptotic normality of I!?,because of the need to show consistency of the Jacobian matrix in the Taylor expansion. The more difficult term to estimate is the “score” variance 0. One way to estimate this term is to form an estimator g(z) of the function 6(z) that appears in the asymptotic variance, and then construct
’ = Iz~ ’ ~
{g(Zi,
e,B) + I}
(g(Z,
e,y*)+ I}'.
(8.5)
i=l
An estimator
of the asymptotic
variance
can then be formed as G;’
fiG;
“.
WK.
2200
and D. McFadden
Newey
It is difficult at this level of generality to give primitive conditions for consistency of a variance estimator, because these will depend on the nature of 6(z). One useful intermediate result is the following one. Lemma
8.3
If the conditjons of Theorem and C;= 1 11 6(zi) - d(zi) II2/n
L
8.1 are sati_sfied, xy= 0, then fi L 0.
1 11g(Zi,
6, f)
-
g(Zi,
8,,
yo)
11‘/n
5
0,
Proof
Let zii = g(zi, 8, 7) + J(zi) and ui = g(zi, 8,, yO) + 6(zJ, so that fl= E[uiuIl and 8 = ~1~ 1ti,ti:/n. By the assumptions and the triangle inequality, x1= 1 IIfii - UCII2/n 11,~). Also, by the LLN, x1= 1uiui/n -S E[UiUi]. Also, IIx1= 1riirii/n - Cy= 1u&/n II d
CyzI ~Iliiri~-UiU~ll/n
of Cl= 1uiuI/n implies that Cy= 1 /IUi I/2/n is bounded
in probability. Q.E.D.
Powell et al. (1989) use an analogous intermediate result to show consistency of their variance estimator. More primitive conditions are not given because it is difficult to specify them in a way that would cover all examples of interest. These results provide a useful way of organizing and understanding asymptotic normality of semiparametric two-step estimators. In the analysis to follow, their usefulness will be illustrated by considering V-estimators and estimators where the first step is a kernel estimator. These results are also useful in showing asymptotic normality when the first step is a series regression estimator, i.e. an estimator obtained from least squares regression of some dependent variable on approximating functions. The series estimator case is considered in Newey (1992a).
8.2.
V-estimators
A V-estimator, as in eq. (8.2), is useful as an illustration of the results. As previously noted, this estimator has g(z, y) = J m (z, Z,8,)dy(?), and y*is the empirical distribution with p(5) = x1= 1 l(zi d 5)/n. For this estimator, condition (i) of Theorem 8.1 is automatically satisfied, with b(z) = 0 because g(z, y) is linear in y. Condition (ii) needs to be verified. To see what this condition means, let m(z,, z2) = m(z,, z2, fl,), ml(z) = Sm(z, ?)dF,(Z), m2(z) = Im(z”, z)dF,(Z), and p = j~m(z,~)dF,(z)dF&). Then i$l CG(zi, $ - ~0) - J G(z, Y*-
~dd~J/~n
ml(Zi) =&{n~li~l[n~’ Ii I[ m(ZhZj)
j=l
-
n-l
i;. m2(zi) - p i=l
II
Ch. 36: Lurge Sample Estimation and Hypothesis
Testing
2201
It will follow from U- and V-statistic theory that this remainder term is small. A U-statistic has the form fi = np ‘(n - I)- ’ xi< ja(zi, zj), where u(z,, z2) = a(~,, zJ. A V-statistic has the form p = II-’ C;= IC;= I m(z,, zj). A V-statistic is equal to a U-statistic plus an asymptotically negligible term, as in p = n-‘CF, 1 Pn(Zi, Zi) + [(n - 1)/n] 6, where a(~,, zj) = m(zi, zj) + m(zj, zi). The lead term, n-‘Cr, 1m(z,, Zi) is a negligible “own observations” term, that converges in probability to zero at the rate l/n as long as E[m(z,, zi)] is finite. The condition that the remainder term in eq. (8.6) have probability limit zero is known as the projection theorem for U- or V-statistics. For a U-statistic, a(z) = fu(z, Z)dF,(Z), and E[c?(z)] = 0, the projection theorem states if the data are i.i.d. and u(z,,z,) n-‘CrZl vations; remainder theorem following
has finite second moments, then &[C? - np ‘Cr= 1a( LO, where a-( z J 1s re ferred to as the projection of the U-statistic on the basic obsersee Serfling (1980). The V-statistic projection theorem states that the in eq. (8.6) converges in probability to zero. The V-statistic projection is implied by the U-statistic projection theorem, as can be shown in the way. Let a(~,, z2) = m(z,, z2) + m(z,, zl) - 2~~ so
n-‘t
i, i=l
[Wl(zi~zj)-~]="~zi~l
[m(Zi,Zi)-~]+[(I1-1)/11]~.
j=l
The first term following the equality should be negligible. The second term following the equality is a multiple of the U-statistic, where the multiplying constant converges to 1. Furthermore, a(z) = ml(z) + m2(z) - 2~ in this case, so the projection of the U-statistic on the basic observations is n-l x1= 1[ml(zi) + m*(Zi) - 2~1. The Ustatistic projection theorem then implies that the remainder in eq. (8.6) is small. Thus, it will follow from eq. (8.6) and the U-statistic projection theorem that condition (ii) of Theorem 8.1 is satisfied. The previous discussion indicates that, for V-estimators, assumption (ii) follows from the V-statistic projection theorem. This projection result will also be important for assumption (ii) for kernel estimators, although in that case the V-statistic varies with the sample size. For this reason it is helpful to allow for m(z,,z,) to depend on n when stating a precise result. Let m,,(z) = Jm,,(z, ,?)dF,(?), mn2(z) = Im,(z”, z)dF,(Z), and Y,,= O,(r,) mean that 11Y, II/r, is bounded in probability for the Euclidean norm
II* II. Lemma 8.4 z,,z*,
. . are i.i.d. then n-‘C;=
1Cjn= 1m,(zi,zj) - n-l C”= 1[m,l(zi) + m,,(zi)] + p =
zl, ZJ III/n + (ECIIm,(z,,z2)II21)“2/fl).
W.K. Newey und D. McFadden
2202
The proof is technical, and so is postponed until Section 8.4. A consequence of this result is that condition (ii), the stochastic equicontinuity hypothesis, will be satisfied for U-estimators as long as E[ 11 m(zl, zl, Q,) II1 and EC IIm(z,,z2,0,) II“I are finite. Lemma 8.4 actually gives a stronger result, that the convergence rate of the remainder is l/Jr~, but this result will not be used until Section 8.3. With condition (ii) (finally) out of the way, one can consider conditions (iii) and (iv) for V-estimators. Assuming that p = 0, note that j G(z, y*- Ye)dF, ;j [ jm(z, z”,e,) x dF,(z)] dF”(1) = j G(z)dF(z) for 6(z) = m2(z) = j m(Z, z, B,)dF,(z) and F(z) equal to the empirical distribution. Thus, in this example conditions (iii) and (iv) are automatically satisfied because of the form of the estimator, giving all the assumptions of Theorem 8.1, with g(z, 8,, yO) + 6(z) = ml(z) + m2(z). An asymptotic normality result for Vestimators can then be stated by specifying conditions for uniform convergence of the Jacobian. The following condition is useful in this respect, and is also useful for showing the uniform convergence assumption of Theorem 2.1 and V-estimators. Lemma 8.5
If 21, z*, . . . are i.i.d., a(z,, z2, O), is continuous
at each (3~ 0 with probability
one,
ECsupo,ell 4zI,zI, 0)III< Q and EC~UP,,, II4z,, z2,0)II1 < ~0,thenEC4zI,z2,@I is continuous
in 0~ 0, and
supBE 8 /I nm2 x1= 1 x7=
1a(z,
The proof is postponed until Section 8.4. This result can be used to formulate conditions adding a condition for convergence of the Jacobian. Theorem
zj, d) -
ELZ(Z,, z2, O)] (I J+
for asymptotic
normality
0.
by
8.6
Suppose that zr, z2,. . are i.i.d., 63 C+,,(i) E[m(z,, z2, @,)I = 0, E[ II m(z,, zl, 8,) II] < co, E[ /Im(z,, z2, /!I,) /I2] < 03, (ii) m(z,, zl, 19)and m(z,, z2, 19)are continuously differentiable on a neighborhood of (I0 with probability one, and there is a neighborhood
IIVom(z,, zl, 6)II1 < ~0andEC~UP,~.,~ IIVom(zI, z2, 4 II1 < co, (iii) GB = E [V,m(z,, z2, (!I,)] is nonsingular. Then &(e-- 0,) L N(0, G, ‘!CCIGg“)
N of 8,, such that ECsup,,,-
for 0 = Var { j [m(z, Z,0,) + m(Z, z, 0,)] dF,(z”)}. Proof
It follows by Lemma 8.4, assumption (i), and the preceding discussion that conditions (i)-(iv) of Theorem 8.1 are satisfied for g(z, ye) + 6(z) = 1 [m(z, Z, 0,) + m(Z, z, tI,)]dF,(Z), so it follows by the conclusion
of Theorem
8.1 that &c’C~=,
CT= 1m(z,, zj, 0,) 3
N(O,C?). Therefore,
t?L
it suffices to show that n-“Cr, 1cjn= 1V,m(z,, zj, I$ L G, for any 8,. This condition follows by Lemma 8.5 and the triangle inequality. Q.E.D.
To use this result to make inferences about 8 it is useful to have an asymptotic variance estimator. Let GH= nT 2 C;= 1x7=, V,m(z,, zj, 8) be a Jacobian estimator.
Ch. 36: Larye Sample Estimation and Hypothesis
2203
Testiny
This estimator wll be consistent for G, under the conditions estimator of g(z, O,, yO) + 6(z) can be constructed by replacing in the expression given in R, to form
Iii =
n- l j$l[rn(Zi, zj, 6)+ rn(Zj, zi, 8)],
The following result is useful for showing consistency m,(z,, z2, 0) depend on n and m,,(z) be as defined above. Lemma
of this estimator.
Let
8.7
Lf~ll~-~,I/=0,(1) ECsuP,,.,_ II%(Z,,
then
n~1~~~~~~n~‘~j”~~m,(~~,~j,~)-m,l(~i)~~2=0,{n-1
21,Q)II2 + sup,,,, IIVcPn(zl~z2,~)
II2 + II%(Z,,
This result is proved in Section 8.4. Consistency be shown, using Lemma 8.7.
of the variance
Theorem
of Theorem 8.6. An BOby 6 and F, by E
x z2,hJ)
II‘I ).
estimator
can now
8.8
If the conditions of Theorem 8.6 are satisfied, E[su~,,,~ /Im(z,, zl, 0) I/2] < cc and E[su~,,,~ 1)Vem,(zl, z2, 0) I/‘1 < cc then G; ‘dG; ’ 3 Cc ‘RG; ‘. Proof
It follows by Lemmas 8.7 and 8.3 that b Aa, and it follows as in the proof of Theorem 8.6 that 6, ’ L Gil, so the conclusion follows by continuity of matrix multiplication. Q.E.D.
8.3.
First-step
kernel
estimation
There are many examples of semiparametric two-step estimators that depend on kernel density or conditional expectations estimators. These include the estimators of Powell et al. (1989) and Robinson (1988b). Also, the nonparametric approximate consumer surplus estimator introduced earlier is of this form. For these estimators it is possible to formulate primitive assumptions for asymptotic normality, based on the conditions of Section 8.1. Suppose that y denotes a vector of functions of variables x, where x is an r x 1 subvector of the data observation z. Let y denote another subvector of the data. The first-step estimator to be considered here will be the function of x with y*(x)= n- ’ i
yiKb(x - xi).
(8.7)
W.K. Newey and D. McFadden
2204
This is a kernel estimator of fO(x)E[ylx], where Jo(x) is the marginal density of x. A kernel estimator of the density of x will be a component of y(x) where the corresponding component of y is identically equal to 1. The nonparametric consumer surplus estimator depends on 9 of this form, where yi = (1, qi)‘. Unlike V-estimators, two-step estimators that depend on the 9 of eq. (8.6) will often be nonlinear in y*.Consequently, the linearization condition (i) of Theorem 8.1 will be important for these estimators. For example, the nonparametric consumer surplus estimator depends on a ratio, with g(z,y) = ~~[Y2(x)/y1(x)]dx - BO. In this example the linearization G(z, y - y_) is obtained by expanding the ratio inside the integral. By ii/g - a/b = bm‘Cl - b- ‘(6 - b)] [ii - a - (u/b)(g - b)], the linearization of Z/g around a/b is b- ‘[E - (I - (u/b)(g - b)]. Therefore, the linear functional of assumption (i) is
WY) =
sbfoW
‘C- h&4,llyWx.
(8.8)
a
Ify,,(x) = f,Jx) is bounded away from zero, y2,,(x) is bounded, close to yie(x) on [a, b], then the remainder term will satisfy
and yi(x) is uniformly
Ig(z,Y)- dz, 14 - G(z,Y - ~0)I d
bl~&)I - ‘f,,(x)-‘Cl
sP
+ lMx)llClr~(4
- f&)lz
+ IyAx) - ~zo(4121d~
d c SUP,,,qb] 11 dx) - h,tx) 11 ‘. Therefore
assumption
(i) of Theorem
(8.9) 8.1 will be satisfied
if &
supXt,a,bl IIy*(x) -
Yob) II . One feature of the consumer surplus example that is shared by other cases where conditional expectations are present is that the density in the denominator must be bounded away from zero in order for the remainder to be well behaved. This condition requires that the density only effects the estimator through its values on a bounded set, a “fixed trimming” condition, where the word trimming refers to limiting the effect of the density. In some examples, such as the consumer surplus one, this fixed trimming condition arises naturally, because the estimator only depends on x over a range of values. In other cases it may be necessary to guarantee that this condition holds by adding a weight function, as in the weighted average derivative example below. It may be possible to avoid this assumption, using results like those of Robinson (1988b), where the amount of trimming is allowed to decrease with sample size, but for simplicity this generalization is not considered here. 2ao47
471n this case p,(x) will be uniformly close to y,Jx), and so will be bounded probability approaching one if yIo(x) is bounded away from zero, on [a, h].
away from zero with
Ch. 36: Large Sample Estimation and Hypothesis
2205
Testing
In general, to check the linearization condition (i) of Theorem 8.1 it is necessary to specify a norm for the function y. A norm that is quite convenient and applies to many examples is a supremum norm on a function and its derivatives. This norm does not give quite as sharp results as an integral norm, but it applies to many more examples, and one does not lose very much in working with a supremum norm rather than an integral norm.48 Let ajy(x)/axj denote any vector consisting of all distinct jth-order partial derivatives of all elements of y(x). Also, let 3’ denote a set that is contained in the support of x, and for some nonnegative integer d let
This type of norm is often referred to as a Sobolev norm. With this norm the n’j4 convergence rate of Theorem 8.1 will hold if the kernel estimator g(x) and its derivatives converge uniformly on CCat a sufficiently fast rate. To make sure that the rate is attainable it is useful to impose some conditions on the kernel, the true function y,(x), the data vector y, and the bandwidth. The first assumption gives some useful conditions for the kernel. Assumption
8.1
K(u) is differentiable of order d, the derivatives of order d are bounded, K(u) is zero outside a bounded set, jX(u)du = 1, there is a positive integer m such that for all j<m,SK(u)[~~=,u]du=O.
The existence of the dth derivative of the kernel means that IIf 11will be well defined. The requirement that K(u) is zero outside a bounded set could probably be relaxed, but is maintained here for simplicity. The other two conditions are important for controlling the bias of the estimator. They can be explained by considering an expansion of the bias of y(x). For simplicity, suppose that x is a scalar, and note EC?(x)] = SE[ylZ],f,(l)K,(x - I)d,? = ~y,+)K,(x - I)dZ. Making the change of variables u = (x - .%)/a and expanding around CJ= 0 gives
E[-f(x)]
=
=
s
y,,(x - ou)K(u)du
2
Ojajy,(xyaxjK(u)ujdu
O<j<m
=
s
ye(x) + 0m amyotx +
s
+ Cm ayotx +
ouyaxvqu)u~du
s
i7u)/axv(u)umdu,
48With an integral norm, the Inn term in the results below could be dropped. dominate this one, so that this change would not result in much improvement.
(8.10)
The other
terms
WK. Newey and D. McFadden
2206
where 6 is an intermediate value, assuming that derivatives up to order m of ye(x) exist. The role of jK(u)du = 1 is to make the coefficient of y,(x) equal to 1, in the expansion. The role of the “zero moment” condition {K(u)ujdu = 0, (j < m), is to make all of the lower-order powers of cr disappear, so that the difference between E[y*(x)] and yO(x) is of order grn. Thus, the larger m is, with a corresponding number of derivatives of y,(x), the faster will be the convergence rate of E[y*(x)] to y&x). Kernels with this moment property will have to be negative when j 3 2. They are often referred to as “higher-order” or “bias-reducing” kernels. Such higher-order rate for y*and are also important kernels are used to obtain the r2’/4 convergence for assumption (iv) of Theorem 8.1. In order to guarantee that bias-reducing kernels have the desired effect, the function being estimated must be sufficiently smooth. The following condition imposes such smoothness. Assumption
8.2
There is a version of yO(x) that is continuously derivatives on an open set containing .F.
differentiable
to order d with bounded
This assumption, when combined with Assumption 8.1 and the expansion given above produce the following result on the bias of the kernel estimator 9. Let E[y*] denote E[y^(x)] as a function of x. Lemma 8.9 If Assumptions
8.1 and 8.2 are satisfied then 11E[$] - y /I = O(C).
This result is a standard one on kernel estimators, as described in Hardle and Linton (1994), so its proof is omitted. To obtain a uniform convergence rate for f is also helpful to impose the following condition. Assumption
8.3
There is p 3 4 such that E[ 11 y II”] < co and E[ lly Ilplx]fO(x) is bounded. Assumptions
8.1-8.3 can be combined
to obtain
the following
result:
Lemma 8.10 If Assumptions 8.1-8.3 are satisfied and cr = a(n) such that o(n)+0 In n -+ cc then IIy*- y,, /I = O,[(ln n)l/* (w~+*~))“* + ~“‘1.
and n1 -(2ip),(rr)‘/
This result is proved in Newey (1992b). Its proof is quite long and technical,
and so
is omitted.
for as-
It follows from this result that Jn
(Iy*- y0 (I* 3
0, as required
Ch. 36: Large Sample Estimation
and Hypothesis
Testing
2207
sumption (i) of Theorem 8.1, if ,1-‘2’P’a(n)‘/ln n+ 03, &no2” --f 0, and Ja In n/ sequences (nd+ 2d)+ 0. Th ese conditions will be satisfied for a range of bandwidth o(n), if m and p are big enough, i.e. if the kernel is of “high-enough order”, the true function y(x) is smooth enough, and there are enough moments of y. However, large values of m will be required if r is large. For kernel estimators it turns out that assumption (ii) of Theorem 8.1 will follow from combining a V-statistic projection with a small bias condition. Suppose that G(z, v) is linear in y, and let 7 = I?[?]. Then G(z, 9 - yO) = G(z, 9 - 7) + G(z, 7 - yO). Let m,(zi, Zj) = G[zi, .YjK,( * - xj)], m&) = Jm,(Z Z)@‘cd4 = SG[z, .YjK,( * - Xj)] dF,(z), and assume that m,,(z) = 1 m,(z, 2) dF,(Z) = G(z, 17)as should follow by the linearity of G(z, y). Then G(z, y*- 7) dF,(z) s
(8.11) where the second equality follows by linearity of G(z, y). The convergence in probability of this term to zero will follow by the V-statistic projection result of Lemma 8.4. The other term, &(x1= 1G(zi, 7 -,yJ/n - 1 G(z, 7 - y) dF,(z)}, will converge in probability to zero if E[ )(G(z, 7 - yO)I/2] + 0, by Chebyshev’s inequality, which should happen in great generality by y -+ y0 as 0 40, as described precisely in the proof of Theorem 8.11 below. Thus, a V-statistic projection result when combined with a small bias condition that E[ 1)G(z, v - yO)/I‘1 goes to zero, gives condition (ii) of Theorem 8.1. For kernel estimators, a simple condition for the mean-square differentiability assumption (iii) of Theorem 8.1 is that there is a conformable matrix v(x) of functions of x such that
jG(z,
?;)dF, =
~.(xh(x) dx,
(8.12)
for some v(x).This condition says j G(z, y) dF, can be represented as an integral, i.e. as an “average” over values of x. It leads to a simple form for 6(z). As previously discussed, in general 6(z) can be calculated by differentiating f G[z, y(q)] dF, with respect to the parameters q of a distribution of z, and finding 6(z) such that V,J G[z, y(q)] dFO = E[S(z)SJ for the score S, and all sufficiently regular parametrizations. Let I$[.] denote the expectation with respect to the distribution at this
W.K. Newey and D. McFadden
2208
parametrization.
Here, the law of iterated
s
s
v)dx = GCz,~h)l dF, = V(X)Y(X>
so differentiating for J(z)=
expectations
implies that
s
vW,CyIx1.0x Iv) dx = E,Cv(x)yl,
gives V,j G[z, y(v)] dF, = V,E,[v(x)y]
= E[v(x)ySJ
= E[G(z)SJ, (8.13)
V(X)Y- ECv(x)~l.
For example, for the consumer surplus estimator, by eq. (8.8), one has v(x) = and y = (l,q), so that 6(z) = l(a 6 x d b)f,(x)-’ x l(U~X~b)f~(x)-lC-h~(x),ll c4- Mx)l. With a candidate for 6(z) in hand, it is easier to find the integral representation for assumption (iii) of Theorem 8.1. Partition z as z = (x, w), where w are the components of z other than x. By a change of variables, 1 K,(x - xi) dx = j K(u) du = 1, so that
s
v(x)y,(x)dx=n-’
G(z,y^-y,)dF,=
i, i=l
- E[v(x)y]
= n
”
1 izl
s
V(X)yiK,(X
-
Xi)
dx
f
J 6(X,WJKg(X- xi) dx = Jd(z)d’, (8.14)
where the integral of a function a(z) over d$ is equal to np ’ x1= I sa(x, wi)K,(x - xi)dx. The integral here will be the expectation over a distribution when K(u) 2 0, but when K(u) can be negative, as for higher-order kernels, then the integral cannot be interpreted as an expectation. The final condition of Theorem 8.1, i.e. assumption (iv), will follow under straightforward conditions. To verify assumption (iv) of Theorem 8.1, it is useful to note that the integral in eq. (8.14) is close to the empirical measure, the main difference being that the empirical distribution of x has been replaced by a smoothed version with density n- ’ x1= 1K,(x - xi) [for K(u) 3 01. Consequently, the difference between the two integrals can be interpreted as a smoothing bias term, with
b(z)diSd(z)dF=K’
By Chebyshev’s in probability
inequality,
$r
[ Sv(x)K,(X-xi)dX-V(Xi)]Yi.
sufficient conditions
to zero are that JnE[y,{
‘CIIY~II*IISV(X)K,(X-~~)~X-VV(X~)II~I~O.A and smoothness
parts of Assumptions
for Jn
(8.15)
times this term to converge
jv(x)K,(x - xi)dx - V(Xi)}] -0 and that s s.h own below, the bias-reducing kernel 8.1-8.3 are useful in showing that the first
Ch. 36: Large Sample Estimation and Hypothesis
condition holds, while continuity the second. In particular, one can even when v(x) is discontinuous, Putting together the various asymptotic Theorem
normality
2209
Testing
of v(x) at “most points” of v(x) is useful for showing show that the remainder term in eq. (8.15) is small, as is important in the consumer surplus example. arguments described above leads to a result on
of the “score” x1= r g(zi, y”)/&.
8.11
Suppose that Assumptions 8.1-8.3 are satisfied, E[g(z, yO)] = 0, E[ I/g(z, y,,) )(‘1 < a, X is a compact set, cr = o(n) with na2’+4d/(ln n)2 -+ cc and na2m + 0, and there is a vector of functionals G(z, y) that is linear in y such that (i) for ll y - y. I/ small
IIdz, y)- dz, yo)- W, Y- yo) II 6 W IIY-y. II2,ECWI < ~0;(3 IIW, Y) II d c(z) 1)y 1)and E[c(z)‘] < co; (iii) there is v(x) with 1 G(z, y) dF,(z) = lv(x)y(x) dx for all /ly 11< co; (iv) v(x) is continuous almost everywhere, 111v(x) 11dx < co, and there
enough,
is E > 0 such that E[sup,,,,, GE(Iv(x + u) II41 < co. Then for 6(z) = v(x)y - E[v(x)y], Cr= 1S(zi, Y^)l& 5
N(0, Var [g(z, yo) + 6(Z)]}.
Proof The proof proceeds
by verifying
the conditions
of Theorem
8.1. To show assump-
2 30 which follows by the rate conditions tion (i) it suffices to show fi /Iy*- y. 11 on 0 and Lemma 8.10. To show assumption iii), note that by K(u) having bounded derivatives of order d and bounded support, (IG[z, yK,(. - x)] 11d o-‘c(z) IIy )I. It then follows by Lemma 8.4 that the remainder term of eq. (8.11) is O,,(n- ‘a-’ x {E[c(z,)/( y, II] + (E[c(z~)~ IIy, 112])1’2})= o,(l) by n-‘a-‘+O. Also, the rate conditions imply 0 --f 0, so that E[ I(G(z, 7 - yo) )I2] d E[c(z)~] 117- y. 1)2 + 0, so that the other remainder term for assumption (ii) also goes to zero, as discussed following eq. (8.11). Assumption (iii) was verified in the text, with dF as described there. To show assumption (iv), note that
v(x)K,(x
- xi) dx - v(xi) yi I
v(x)K(u)y,(x
Cyo(x < &
s
II v(x) II
iI1
Ill
- au) dudx -
- 0~)-
yo(x)PW
du
[y,(x - au) - y,,(x)lK(u) du 11dx < 1 Ca” jll
v(x)IIdx,
(8.16)
W.K. Newey and D. McFadden
2210
for some constant C. Therefore, //J&[ { Sv(x)K,(x-xi) dx - V(Xi)}yi] 11d CJjza”‘+O. Also, by almost everywhere continuity of v(x), v(x + au) + v(x) for almost all x and U. Also, on the bounded support of K(u), for small enough 0, v(x + W) d SU~~~~~~ S ,v(x + o), so by the dominated convergence theorem, j v(x + au)K(u) du + j v(x)K(u) du = v(x) for almost all x. Another application of the dominated convergence theorem, using boundedness of K(u) gives E[ 11 j v(x)K,(x - xi) dx - v(xi) 114]-0, so by the CauchySchwartz inequality, E[ 11yi I/2 11j v(x)K,(x - xi) dx - v(xi) II2] + 0. Condition (iv) then follows from the Chebyshev inequality, since the mean and variance of Q.E.D. II- l” C;= 1[I v(x)K,(x - xi) dx - v(x,)]y, go to zero. The assumptions of Theorem 8.11 can be combined of the Jacobian to obtain an asymptotic normality estimator. As before, let R = Var [g(z, y,,) + 6(z)]. Theorem
with conditions for convergence result with a first-step kernel
8.12
Suppose that e -% 00~ interior(O), the assumptions of Theorem 8.11 are satisfied, E(g(z, yO)] = 0 and E[ 11g(z, ye) 11 2] < co, for 11 y - y. II small enough, g(z, 0, y) is continuously differentiable in 0 on a neighborhood _# of O,, there are b(z), s > 0 with
EC&)1< ~0, IIV~s(z,~,y)-V,g(z,~,,y,)/I d&)Cl/Q-4ll”+ E[V,g(z, Oo,yo)] exists and is nonsingular.
Then $(&
0,) 3
IIY-Y~II~~~ and N(0, G; ‘L2G; I’).
Proof
It follows similarly to the proof of Theorem 8.2 that 6; ’ 3 G; ‘, so the conclusion follows from Theorem 8.11 similarly to the proof of Theorem 8.2. Q.E.D. As previously discussed, the asymptotic variance can be estimatedby G,,‘86, I’, C;= 1 Vsg(zi, e,y*) and 8= n- ’ x1= lliiti; for ai = g(zi, 0, $) + 6(zi). The whereG,=n-’ main question here is how to construct an estimator of 6(z). Typically, the form of 6(z) will be known from assumption (iii) of Theorem 8.11, with 6(z) = 6(z, 8,, yo) for some known function 6(z, 0, y). An estimator of 6(z) can then be formed by substituting 8 and $3for 8, and y. to form
8(z)= 6(z, 6,jq.
(8.17)
The following result gives regularity asymptotic variance estimator. Theorem
conditions
for consistency
of the corresponding
8.13
Suppose that the assumptions of Theorem 8.12 are satisfied and there are b(z), s > 0, such that E[~(z)~] < cc and for /Iy - y. /I small enough, IIg(z, 19,y)-g(z, do, y)II d h(z) x
CIIQ-Q,ll”+ /I~-~~ll~l and 11~~~,~,~~-~6(~,~~,~~~ll dWCII~-~oI/“+ /IY-Y~II”~. Then 6; ’ 86;
l’ L
G; ‘RG;
“.
Ch. 36: Large Sample Estimation and Hypothesis
2211
Testing
Proof
It suffices to show that the assumptions of Theorem 8.3 are satisfied. By the conditions of Theorem 8.12, I/t? - 0, I/ 3 0 and /I9 - y0 I/ 5 0, so with probability approaching one,
because n- ’ x1= 1 b(zi) 2 is bounded in probability follows similarly that Cr= 1 11 8(zi) - 6(Zi) II“/ n 30, Theorem 8.3.
by the Markov so the conclusion
inequality. It follows by Q.E.D.
In some cases 6(z, 0, y) may be complex and difficult to calculate, making it hard to form the estimator 6(z, e,?). There is an alternative estimator, recently developed in Newey (1992b), that does not have these problems. It uses only the form of g(z, 6,~) and the kernel to calculate the estimator. For a scalar [ the estimator is given by
i(zi)=v, n-l [
j$l
C71zj38,y*
+
(8.18)
i,K,(‘yxil}]~
i=O’
This estimator can be thought of as the influence of the ith observation through the kernel estimator. It can be calculated by either analytical or numerical differentiation. Consistency of the corresponding asymptotic variance estimator is shown in Newey (1992b). It is helpful to consider some examples to illustrate how these results for first-step kernel estimates can be used. Nonparametric consumer surplus continued: To show asymptotic normality, one can first check the conditions of Theorem 8.11. This estimator has g(z, yO) = Jib,(x) dx 8, = 0, so the first two conditions are automatically satisfied. Let X = [a, b], which is a compact set, and suppose that Assumptions 8.1-8.3 are satisfied with m = 2, d = 0, and p = 4, so that the norm IIy 1) is just a supremum norm, involving no derivatives. Note that m = 2 only requires that JuK(u)du = 0, which is satisfied by many kernels. This condition also requires that fO(x) and fO(x)E[q Ix] have versions that are twice continuously differentiable on an open set containing [a, b], and that q have a fourth moment. Suppose that no2/(ln n)‘+ CC and no4 +O, giving the bandwidth conditions of Theorem 8.11, with r = 1 (here x is a scalar) and d = 0. Suppose that f,,(x) is bounded away from zero on [a, b]. Then, as previously shown in eq. (8.9), assumption (i) is satisfied, with b(z) equal to a constant and G(z, y) = (ii) holds by inspection by fO(x)-’ and !,bfo(x)- ’ C- Mx), lldx) dx. Assumption h,(x) bounded. As previously noted, assumption (iii) holds with v(x) = l(a < x < b) x fO(x)- ’ [ - h,(x), 11. This function is continuous except at the points x = a and x = b,
W.K. Newey and D. McFadden
2212
and is bounded, so that assumption Theorem 8.11 it follows that i;(x) - 0,
>
LV(O,
(iv) is satisfied.
Then
E[l(a ,< x d 4f,(x)~‘{q
-
by the conclusion
hI(x))21)>
of
(8.19)
an asymptotic normality result for a nonparametric consumer surplus estimator. To estimate the asymptotic variance, note that in this example, 6(z) = l(a d x d b) x Then f&)- ’ [I4- Mx)l = &z,h) for h(z,Y)= 1(a d x d b)y,(~)~’[q - y1(x)-1y2(x)]. for 6(z) = 6(z, y^),an asymptotic variance estimator will be
‘= f 8(Zi)2/n = n-l i=l
i$l l(U <Xi
< b)f(Xi)p2[qi-
&(Xi)]2.
(8.20)
By the density bounded away from zero on 3 = [a, b], for /Iy - y. /I small enough that yr (x) is also bounded away from zero on .oll‘,16(zi, y) - 6(zi, yO)1d C( 1 + qi) 11y - y0 1) for some constant C, so that the conditions of Theorem 8.13 are satisfied, implying consistency of d. Weighted average derivative estimation: There are many examples of models where there is a dependent variable with E[qlx] = T(X’ /3,Jfor a parameter vector /IO, as discussed in Powell’s chapter of this handbook. When the conditional expectation satisfies this “index” restriction, then V,E[ql.x] = s,(x’~,,)~~, where r,(v) = dr(v)/dv. Consequently, for any bounded function w(x), E[w(x)V,E[q(x]] = E[w(x)r,(x’/3,)]&,, i.e. the weighted average derivative E[w(x)V,E[qlx]] is equal to a scale multiple of the coefficients /I,,. Consequently, an estimate of /I0 that is consistent up to scale can be formed as
B=n-'
t W(Xi)V,L(Xi), C(X)= i i=l
i=l
qiK,(X-Xi)/i
K,(X-Xi).
(8.21)
i=l
This is a weighted average derivative estimator. This estimator takes the form given above where yIO(x) = f,Jx), yIO(x) = fO(x) x ECq Ixl,
and
Yk 0, v) = %47,cY2(4lY,(~)l
- 8.
(8.22)
The weight w(x) is useful as a “fixed trimming” device, that will allow the application of Theorem 8.11 even though there is a denominator term in g(z, 0, y). For this purpose, let 3 be a compact set, and suppose that w(x) is zero outside % and bounded. Also impose the condition that fe(x) = yIO(x) is bounded away from zero on I%^.Suppose that Assumptions 8.1-8.3 are satisfied, n~?‘+~/(ln ~)~+co and &“+O.
Ch. 36: Large Sample Estimation and Hypothesis
2213
Testing
These conditions will require that m > r + 2, so that the kernel must be of the higher-order type, and yO(x) must be differentiable of higher order than the dimension of the regressors plus 2. Then it is straightforward to verify that assumption (i) of Theorem 8.11 is satisfied where the norm (/y )I includes the first derivative, i.e. where d = 1, with a linear term given by
G(z,Y)= w(x)Cdx)~(x)+ V,r(xMx)l, %b4 = .I-&)- l c- &Ax) + kl(xb(x), - SWI,
Md = .foW l c- Mx), II’> (8.23)
where an x subscript denotes a vector of partial derivatives, and s(x) = fO,.(x)/fO(x) is the score for the density of x. This result follows from expanding the ratio V,[y,(x)/y,(x)] at each given point for x, using arguments similar to those in the previous example. Assumption (ii) also holds by inspection, by fO(x) bounded away from zero. To obtain assumption (iii) in this example, an additional step is required. In particular, the derivatives V,y(x) have to be transformed to the function values y(x) in order to obtain the representation in assumption (iii). The way this is done is by integration by parts, as in
HwW,~Wd41=
=-s
s
w(x)fo(x)b,(x~CV,~(x)l dx
V,Cw(x)fo(x)~o(x)l’~O dx>
v,Cww-ow,(x)l’
= w,(x)C- Mx), II+ w(x)c- 4&d, 01
It then follows that 1 G(z, y) dF, = j v(x)y(x) dx, for
44 = - w,(x)C- w4 11- w(x)II- &&4,01 + wb)a,(x) = - {WAX) + w(x)s(x)> c- Mx), 11= 04 t(x) = - w,(x) -- w(x)s(x).
c-
h_l(x),11, (8.24)
By the assumption that fO(x) is bounded away from zero on .!Zand that 9” is compact, the function a(~)[ - h,(x), l] is bounded, continuous, and zero outside a compact set, so that condition (iv) of Theorem 8.11 is satisfied. Noting that 6(z) = C(x)[q - h,(x)], the conclusion of Theorem 8.11 then gives
1L
w(xi)V,&xi) - 80
W,Var{w(x)V,h,(x) + QX)[q - &(x)]}). (8.25)
W.K. Newey
2214
The asymptotic
&!=n-’
i,
variance I,?$,
of this estimator
can be estimated
pi = W(Xi)V,~(Xi) - H^+ ~(Xi)[qi -
I],
and D. McFadden
as (8.26)
i=l
where z(x) = - w,(x) - w(x)fJx)/f^(x) for T(X) = n- ’ C;= 1 K(x - xi). Consistency of this asymptotic variance estimator will follow analogously to the consumer surplus example. One cautionary note due to Stoker (1991) is that the kernel weighted average derivative estimators tend to have large small sample biases. Stoker (1991) suggests a corrected estimate of - [n-l Cy= 1e^(x,)x~]- ‘8, and shows that this correction tends to reduce bias 8 and does not affect the asymptotic variance. Newey et al. and show that (1992) suggest an alternative estimator o^+ n- ’ C;= 1 &xi) [qi - &)I, this also tends to have smaller bias than 6. Newey et al. (1992) also show how to extend this correction to any two-step semiparametric estimator with a first-step kernel.
8.4.
Technicalities
Proof of Lemma 8.4 Let mij = m(zi, zj), 61,.= m,(z,), and fi., = m2(zi). Note that E[ /Im, 1 - p 111d E[ 11 m, I 111 +(E[ I~m,,~~2])1’2 and (E[I~m,, -p(12])1’2 <2(E[ /Im,2~~2])1’2 by the triangle inequality. Thus, by replacing m(z,, z2) with m(z,, z2) - p it can be assumed that p = 0. Note that IICijmij/n2 - Ci(fii. + fi.,)/n II = 11 Cij(mij - 61,. - Kj)/n2 II < I/xi+ j(mij rii,. - ti.j)/n2 II + IICi(mii - 6,. -6.,)/n” I/= Tl + T2. Note E[ TJ <(EC I/ml 1 /I + 2 x 111)/n. Also, for i #j, k #P let vijk/ = E[(mij - tii. - rKj)‘(m,( - fi,. - ti./)]. ECllm12 By i.i.d. observations, if neither k nor 8 is equal to i or j, then vijk/ = 0. Also for e not equal to i orj, viji/ = E[(mij - tii.)‘(mi/ - tip)] = E[E[(mij - &.)‘(m,( - ti,.)Izi,zj]] = E[(mij - fii.)‘(E[mit Izi, zj] - tip)] = 0 = vijj/. Similarly, vijk/ = 0 if k equals neither i nor j. Thus, ‘CT:1
= C
C Vijk//n4 = 1 (vijij + rijji)/n4
i#jk#/
=
i#j
2(n2 - n)E[ IIm12 - ti,.-ti.,
112]/n4= E[ l/ml2 - 6,. - Kz., lj2]0(np2),
and Tl =O,({E[IIml2-~l.-~.2/~2]}112n2-1)=Op({E[~~ml2~~2]}1~2n~‘). clusion then follows by the triangle inequality.
The conQ.E.D.
Proof of Lemma 8.5 Continuity of a(z,, z2, l3) follows by the dominated convergence theorem. Without changing notation let a(z,, z2, 0) = a@,, z2, 0) - E[a(z,, z2, Q]. This function satisfies the same dominance conditions as a(z,, z2, e), so it henceforth suffices to assume that
Ch. 36: Large Sample Estimation and Hypothesis
2215
Testing
E[a(z,,z,,
e)] = 0 for all 0. Let O(0) = n-‘(n - 1)-i Ci+ja(Zi,Zj,B)t and note that A 0. Then by well known results on U-statistics s”pOe@ IIn-z Ci,jatzi, zj9@ - tic@ II as in Serfling (1980), for each 0, 6(e) -%O. It therefore suffices to show stochastic e_quicontinuity of 8. The rest of:he proof proceeds as in the proof of Lemma 2.4, with di,(e, 6) = suplia- ~iiG,J/Ia(zi,zj, 0)- a(zi,zj, 0)II replacing 4(& :I, Ci+j replaci_ng Cy=1T and the U-statistic convergence result n- ‘(n - 1)-i xi+ jAij(d, 6) -% E[A12(B, S)] replacing the law of large numbers. Q.E.D Proof of Lemma 8.7 Let fiij = m,(zi, zj, g),,mij = m,(zi, zj, Q,), and ml, = m,i(zJ. By the triangle we have
n-l t I/II-l t diij-mlil12
+cn-’
j=l
t i=l
+Cnp2
t
IIn-’
inequality,
izI” II&ii II2
j~i(Jnij-mij)l12+c~-1i~l
II(n-I)-’
j~i(Wlij-mli)l12
JJmli))2=R1+R2+R3+R4.
i=l
for some positive constant C. Let b(zi) = SUP~~.~ IIm,(zi, zi, 0) II and b(z, zj) = sup~~,~ IIVom”(zi, zjr e) 11.With probability approaching one, R, <_CK2 Cl= 1b(z,)’ = O,(n-‘E[b(z,‘f]}. Also, R2~Cn-‘~~=lI~n-‘~jzib(zi,zj)~~2~~~-~e,~12~Cn-2 X Cifjb(zi,zj)211e-e~l12=0,{n-‘E[b( zl, z,)‘]}. Also, by the Chebyshev and CauchySchwartz inequalities, E[R,] d CE[ IIml2 II’]/n and E[R,] < CE[ (Im,, Il’]/n. The conclusion then follows by the Markov and triangle inequalities. Q.E.D.
9.
Hypothesis
testing with GMM estimators
This section outlines the large sample theory of hypothesis testing for GMM estimators. The trinity of Wald, Lagrange multiplier, and likelihood ratio test statistics from maximum likelihood estimation extend virtually unchanged to this more general setting. Our treatment provides a unified framework that specializes to both classical maximum likelihood methods and traditional linear models estimated on the basis of orthogonality restrictions. Suppose data z are generated by a process that is parametrized by a k x 1 vector 8. Let /(z, 0) denote the log-likelihood of z, and let 8, denote the true value of 0 in the population. Suppose there is an m x 1 vector of functions of z and 0, denoted g(z, f3),that have zero expectation in the population if and only if 8 equals 0,:
g(e) = ~~5 1(z,~ f l) =
g(z, 0) ee(‘veo)dz = 0, s
if 8 = 8,.
W.K. Newey and D. McFadden
2216
Then, Ey(z, H) are moments, and the analogy principle suggests that an estimator of 8, can be obtained by solving for 8 that makes the sample analogs of the population moments small. Identification normally requires that m 3 k. If the inequality is strict, and the moments are not degenerate, then there are overidentifying moments that can be used to improve estimation efficiency and/or test the internal consistency of the model. In this set-up, there are several alternative interpretations of z. It may be the case that z is a complete description of the data and P(z,Q) is the “full information” likelihood. Alternatively, some components of observations may be margined out, and P(z, 0) may be a marginal “limited information” likelihood. Examples are the likelihood for one equation in a simultaneous equations system, or the likelihood for continuous observations that are classified into discrete categories. Also, there may be “exogenous” variables (covariates), and the full or limited information likelihood above may be written conditioning on the values of these covariates. From the standpoint of statistical analysis, variables that are conditioned out behave like constants. Then, it does not matter for the discussion of hypothesis testing that follows which interpretation above applies, except that when regularity conditions are stated it should be understood that they hold almost surely with respect to the distribution of covariates. Several special cases of this general set-up occur frequently in applications. First, if Qz,~) is a full or limited information likelihood function, and g(z,8) = V,L(z,@ is the score vector, then we obtain maximum likelihood estimation.49 Second, if z = (y, x, w) and g(z, 0) = w’(y - x0) asserts orthogonality in the population between instruments w and regression disturbances E = y - x0,, then GMM specializes to 2SLS, or in the case that w = x, to OLS. These linear regression set-ups generalize immediately to nonlinear regression orthogonality conditions based on the form Y(Z,0) = W’CY - h(x, @I. Suppose an i.i.d. sample zi, . . . , z, is obtained from the data generation process. A GMM estimator of 0, is the vector 6,, that minimizes the generalized distance of the sample moments from zero, where this generalized distance is defined by the quadratic form
with l,(0) = (l/n)C:, i g(z,, (3)and 0, an m x m positive definite symmetric matrix that defines a “distance metric”. Define the covariance matrix of the moments, fl = Eg(z, O,)g(z, 0,)‘. Efficient
converge
mxm
weighting of a given set of m moments requires that 0, to Ras n + m.50 Also, define the Jacobian matrix mfik = EVOg(z, O,), and
@‘If the sample score has multiple roots, we assume that a root is selected that achieves a global maximum of the likelihood function. 50This weighting is efficient in that it minimizes the asymptotic covariance matrix in the class of all estimators obtained by setting to zero k linear combinations of the m moment conditions. Obviously, if there are exactly k moments, then the weighting is irrelevant. It is often useful to obtain initial consistent asymptotically normal GMM estimators employing an inefficient weighting that reduces computation, and then apply the one-step theorem to get efficient estimators.
Ch. 36: Large Sample Estimation and Hypothesis
2217
Testing
let G, denote an array that approaches G as n -+ co. The arrays 0, and G, may be functions of (preliminary) estimates g,, of 8,. When it is necessary to make this dependence explicit, write Q,,(g,,) and G,(g,,). Theorems 2.6, 3.4, and 4.5 for consistency, asymptotic normality, and asymptotic covariance matrix estimation, guarantee that the unconstrained GMM estimator
IS consistent and asymptotically & = argmw,~Q,,(@ N(0, B- ‘); where B = G’R- ‘G. Further, from Theorem matrix can be estimated using
G, = t,
normal, with &(8” - 0,) L 4.5, the asymptotic covariance
clVedz,,&, J+G,
f
where 8,, is any &-consistent estimator of 0, [i.e., &(8,, - 0,) is stochastically bounded]. A practical procedure for estimation is to first estimate 0 using the GMM criterion with an arbitrary L2,, such as J2,, = 1. This produces an initial $-consistent estimator I!?~,. Then use the formulae above to estimate the asymptotically efficient R,, and use the GMM criterion with this distance metric to obtain the final estimator gn. Equation (5.1) establishes that r- Ey(z,B,)V,G(z,0,)’ s EV,g(z, 0,) = G. It will sometimes be convenient to estimate G by
In the maximum likelihood case g = V,d, one has a= r= G, and the asymptotic covariance matrix of the unconstrained estimator simplifies to OR ‘.
9.1.
The null hypothesis
Suppose
and the constrained
there is an r-dimensional
GMM
null hypothesis
estimator
on the data generation
H,: r; 1(Q,) = 0.
We will consider
H
1 :
alternatives
to the null of the form
a(@,)# 0,
or asymptotically
local alternatives
H,,: a(&) = SJ&
# 0.
of the form
process,
W.K. Newey and D. McFadden
2218
Assume that F& z V,a(&,) has rank r. The null hypothesis
may be linear or nonlinear.
A particularly simple case is He. ‘6 = do, or a(@ = 0 - do, so the parameter vector 8 is completely specified under the null. More generally, there will be k - r parameters to be estimated when one imposes the null. One can define a constrained GMM estimator by optimizing the GMM criterion subject to the null hypothesis: g,, = argmaxtiEOQn(@, Define a Lagrangian
subject to a(0) = 0.
for t?“: _Y’;p,(6,y)= Q,(0) - , z ,(6yr;
1. In this expression,
y is
the vector of undetermined Lagrangian multipliers; these will be nonzero when the constraints are binding. The first-order conditions for solution of this problem are
0 = [I [
&VOQA@J -vo@J,/% II
0
- 46)
A first result establishes
l-
that g,, is consistent
under the null or local alternatives:
Theorem 9.1 Suppose the hypotheses of Theorem 2.6. Suppose ~(0s) = S/,/Z, including null when 6 = 0, with a continuously differentiable and A of rank r. Then ez
the 6’0.
Proof Let f3,, minimize [E&(e)]‘fl‘[&J,(@] subject to a(0) = S/J%. Continuity of this objective function and the uniqueness of its minimum imply eon + 8,. Then Q,(8,) 6 Q,(e,,) -% 0, implying Q,(gJ LO. But Q, converges uniformly [&j,(6)], so the argument of Theorem 2.6 implies t?,,3 0,. The consistency V,Q,(e,) V,a(e,) a
A
to [E~,(@]‘~’ x Q.E.D.
of g” implies - G’R - ’ Eg(z, 6,) = 0, A * A’?, = - V,Q,(e,) + oP 5
and since A is of full rank, 7, LO.
A central
0, limit theorem
implies (9.1)
A Taylor’s &W)
expansion = &,@o)
of the sample moments + G,,/&J
- &,I,
about
8. gives (9.2)
2219
Ch. 36: Large Sample Estimation and Hypothesis Testing
with G, evaluated
at points
between
8 and 8,. Substituting
this expression
for the
final term in the unconstrained first-order condition 0 = &V,Q,(g,J = - Gbf2; ’ x g,,(@,,)and using the consistency of e^, and uniform convergence of G,(0) yields 0 = - G’R - 1’2ulln+ S&e, =+(e, Similarly,
- 0,) = B-‘G’C substituting
- 0,) + oP li2@ n + o P’
&&(t?,J
= $&,(0,)
(9.3) + G,&(t?”
G$(en - &,) + op, and J&z(&) = J&(0,) + A&(e, op in the first-order conditions for an yields
- 0,) = - G’Qn-1’2@, +
- 0,) + op = 6 + Afi(g”
- 0,) +
(9.4)
From the formula
[,” ;I-l=[
for partitioned
inverses,
~-l/ZMBI’/’
B-‘,q(AB-l,q-l
(ABm’A’)-‘AB-l
where M = I - B- 1’2A’(AB- ‘A’)- ‘AK k - r. Applying this to eq. (9.4) yields
1
(9.5)
’
-(&-‘A’)-’
l” is a k x k idempotent
matrix
of rank
(9.6) Then, the asymptotic distribution of &(t?,, - 0,) under a local alternative, null with 6 =O, is N[ - B-‘A’(AB-‘A’)-‘6,B-1’2MB-“2]. Writing out M = I-B“2A’(AB- ‘A’)- ‘AB- 1’2 yields
or the
JJt(B,-8o)=B-1G’R-1I2U11,-~-1A’(AB-1~’)-1~~-1~’R-1/2~‘, -B-‘A’(AB-‘A’)-‘6
+ op.
The first terms on the right-hand side of eq. (9.7) and the right-hand are identical, to order op. Then, they can be combined to conclude
&(e, - e,) = B -l
(9.7) side of eq. (9.3) that
A ‘(AB-‘A’)-‘AB-‘G’R-“2~n+B-‘A’(AB-‘A’)-16+op, (9.8)
so that &(6,,
- g,,) IS . asymptotically
normal
with mean B- ‘A’(AB- ‘A’)-‘6
and
W.K. Newey and D. McFadden
2220
Table 1
Formula
Statistic Jn(&
- %)
&$-So)
Asymptotic covariance matrix B-’ EC
B~‘G’Q~“%,+o, -B~‘A’(AB~‘A’)~‘6+B~“2MB~“ZG’~~“Z~i,+o,
Bm l/2MBm l/2
J;l(e,-e,)
B-‘A’(AB-‘A’)-‘6+B-‘A’(AB~‘A’)~‘AB~’G’R-”z~~C,+op
Bm’A’(ABm’A’)m’ABm’
&r.
(AB-‘A’)~‘6+(AB~‘A’)~‘AB-‘G’R~“2~~11,+o,
(AB-‘A’)-’
$4&)
S+AB-‘G’R-“*42
AB-‘A’
,,‘h,Q,(e,)
A’(AB~‘A’)-‘B+A:(AB’A’)-‘AB-LC’R-”’~.+o,
+o
A’(AB-‘A’)-‘A
‘. Note that the covariance matrix B- ‘/‘(I - M)BPi’2 z BP ‘A’(AB-lA’)-‘ABP asymptotic covariance matrices satisfy acov(& - e,) = acov 8n - acov S,, or the uariante of the difference equals the difference of the variances. This proposition is familiar in a maximum likelihood context where the variance in the deviation between an efficient estimator and any other estimator equals the difference of the variances. We see here that it also applies to relatively efficient GMM estimators that use available moments and constraints optimally. The results above and some of their implications are summarized in Table 1. Each statistic is distributed asymptotically as a linear transformation of a common standard normal random vector %. Recall that B = G’R- ‘G is a positive definite kxkmatrix,andletC=B~‘-acov8,.RecallthatM=Z-B~”2A’(AB~‘A’)~’x AK ‘I2 is a k x k idempotent matrix of rank k - r.
9.2.
The test statistics
The test statistics for the null hypothesis fall into three major classes, sometimes called the trinity. Wald statistics are based on deviations of the unconstrained estimates from values consistent with the null. Lagrange multiplier (LM) or score statistics are based on deviations of the constrained estimates from values solving the unconstrained problem. Distance metric statistics are based on differences in the GMM criterion between the unconstrained and constrained estimators. In the case of maximum likelihood estimation, the distance metric statistic is asymptotically equivalent to the likelihood ratio statistic. There are several variants for Wald statistics in the case of the general nonlinear hypothesis; these reduce to the same expression in the simple case where the parameter vector is completely determined under the null. The same is true for the LM statistic. There are often significant computational advantages to using one member or variant of the trinity rather than another. On the other hand, they are all asymptotically equivalent. Thus, at least to first-order asymptotic approximation, there is no statistical reason to choose be-
Ch. 36: Large Sample Estimation and Hypothesis
2221
Testing
Figure 3. GMM
tests
tween them. This pattern of first-order asymptotic equivalence for GMM estimates is exactly the same as for maximum likelihood estimates. Figure 3 illustrates the relationship between distance metric (DM), Wald (W), and score (LM) tests. In the case of maximum likelihood estimation, the distance metric criterion is replaced by the likelihood ratio. The arguments 0, and f?,,are the unconstrained GMM estimator and the GMM estimator subject to the null hypothesis, respectively. The GMM criterion function is plotted, along with quadratic approximations to this function through the respective arguments 6, and &. The Wald statistic (W) can be interpreted as twice the difference in the criterion function at the two estimates, using a quadratic approximation to the criterion function at 6,. The Lagrange multiplier (LM) statistic can be interpreted as twice the difference in the criterion function of the two estimates,
using a quadratic
statistic is twice the difference and constrained estimators.
approximation in the distance
at 15%. The distance metric
between
metric
(DM)
the unconstrained
We develop the test statistics initially for the general nonlinear hypothesis ~(0,) = 0; the various statistics we consider are given in Table 2. In this table, recall that acov 87,= B and acov g,, = B- “‘MB- l” . In the following section, we consider the important special cases, including maximum likelihood and nonlinear least squares.
W.K. Newey and D. McFadden
2222 Table 2
Test statistics Wald statistics WI”
na(e.Y[AB-‘A’]-‘a(B,)
W,”
n(&”- f?J{acov(JJ - acov(G”)}-(6” - 13”) =n(e”-B,)‘B-‘A’(AB-‘A’)-‘AB-‘(~“-~”)
W3”
t1(8” - GJ acov(JJ
Lagrange
multiplier
‘(6” - f7J
statistics
LM,,
rq$4B~ ‘A’&
LM,,
nV,Q,(B,)‘{A’(AB-‘A’)-‘A}-V,Q,(B,)
LM,,
nv,Q.@J’B- ‘V,Q.(e.)
= “V,Q,(B,)‘B~‘A’(AB~‘A’)~‘AB~‘V,Q,(B,)
Distance
metric statistic -
DM,
2n[Q.(e., - Q,(&)l
In particular, when the hypothesis is that a subset of the parameters are constants, there are some simplifications of the statistics, and some versions are indistinguishable. The following theorem gives the large sample distributions of these statistics: Theorem 9.2 Suppose the conditions of Theorems 2.6,3.4, and 4.5 are satisfied, and a(8) is continuously differentiable with A of rank r. The test statistics in Table 2 are asymptotically equivalent under the null or under local alternatives. Under the null, the statistics converge in distribution to a chi-square with r degrees of freedom. Under a local alternative chi-square
a(&,) = S/J& the statistics converge in distribution with r degrees of freedom and a noncentrality parameter
to a noncentral 6’(AB- ‘A’)- ‘6.
Proof All of the test statistics are constructed from the expressions in Table 1. If 4 is an expression from the table with asymptotic covariance matrix R = acov q and asymptotic mean RA under local alternatives to the null, then the statistic will be of the form q’R+q, where R+ is any symmetric matrix that satisfies RR+R = R. The matrix R+ will be the ordinary inverse R- ’ if R is nonsingular, and may be the MoorePenrose generalized inverse R - if R is singular. Section 9.8 defines generalized inverses, and Lemma 9.7 in that section shows that if q is a normal random vector with covariance matrix R of rank r and mean R1, then q’R+q is distributed noncentral chi-square with r degrees of freedom and noncentrality parameter A’R;i under local alternatives to the null.
Ch. 36: Large Sample Estimation and Hypothesis
2223
Testing
Consider W,,. Under the local alternative ~(0,) = S/&, row five of Table 1 gives normal with mean S and a nonsingular covariance matrix q=6+AB-‘G’fi-“2ul;! R = AB-‘A’. Let A = R-‘6. Then Lemma 9.7 implies the result with noncentrality parameter i’R/1= 6’R ‘6 = 6’(AB- ‘A’)- ‘6. Consider W,,. The generalized inverse R of R = acov 8,, - acov t?,,can be written as:
The first identity substitutes the covariance formula from row 2 of Table 1. The second and third equalities follow from Section 9.8, Lemma 9.5, (5) and (4), respectively. One can check that A = R-B- ‘A’(AB- ‘A’)- ‘6 satisfies RI = B- ‘A’ x (AB-‘A’)-‘8, so that ;I’RA = d’(AB-‘A’)-‘6. The statistic W,, is obtained by noting that for R = BP’A’(AB-‘A’)-‘AB-‘, the matrix R+ = B satisfies RR+R = R and /z = RfB-‘A’(AB-‘A’)-‘6 satisfies RL = B- ‘A’(AB- ‘A’)- ‘6. Similar arguments establish the properties of the LM statistics. In particular, the second form of the statistic LM,, follows from previous argument that A’(AB-‘A’)-‘A’ and B-‘A’(AB-‘A’)-‘AB-’ are generalized inverses, and the statistic LM,, is obtained by noting that R = A’(AB-‘A’)-‘A has RR+R = R when R+ =B-‘. To demonstrate
the asymptotic
a Taylor’s expansion G,,&(g,,
- 8”) + oP, and substitute
with the last equality
equivalence
of the sample moments
holding
of DM, to the earlier statistics, for i?n about &, J&,(f?,,)
this in the expression
since G$?;
‘$nj,(&)
= 0.
make
= J$,(&)
+
for DM, to obtain
Q.E.D.
The Wald statistic W,, asks how close are the unconstrained estimators to satisfying the constraints; i.e., how close to zero is a(B,)? This variety of the test is particularly useful when the unconstrained estimator is available and the matrix A is easy to compute. For example, when the null is that a subvector of parameters equal constants, then A is a selection matrix that picks out the corresponding rows and columns of B- ‘, and this test reduces to a quadratic form with the deviations of the estimators from their hypothesized values in the wings, and the inverse of their asymptotic covariance matrix in the center. In the special case H,: 8 = 8’, one has A = I.
W.K. Newey and D. McFadden
2224
The Wald test W,, is useful if both the unconstrained and constrained estimators are available. Its first version requires only the readily available asymptotic covariance matrices of the two estimators, but for r < k requires calculation of a generalized inverse. Algorithms for this are available, but are often not as numerically stable as classical inversion algorithms because near-zero and exactzero characteristic roots are treated very differently. The second version involves only ordinary inverses, and is potentially quite useful for computation in applications. The Wald statistic W,, treats the constrained estimators us ifthey were constants with a zero asymptotic covariance matrix. This statistic is particularly simple to compute when the unconstrained and constrained estimators are available, as no matrix differences or generalized inverses are involved, and the matrix A need not be computed. The statistic W,, is in general larger than W,, in finite samples, since the center of the second quadratic form is (acov6J’ and the center of the first quadratic form is (acov e?, - acov I!?~)-, while the tails are the same. Nevertheless, the two statistics are asymptotically equivalent. The approach of Lagrange multiplier or score tests is to calculate the constrained estimator e,, and then to base a statistic on the discrepancy from zero at this argument of a condition that would be zero if the constraint were not binding. The statistic LM,, asks how close the Lagrangian multipliers Y,, measuring the degree to which the hypothesized constraints are binding, are to zero. This statistic is easy to compute if the constrained estimation problem is actually solved by Lagrangian methods, and the multipliers are obtained as part of the calculation. The statistic LM,, asks how close to zero is the gradient of the distance criterion, evaluated at the constrained estimator. This statistic is useful when the constrained estimator is available and it is easy to compute the gradient of the distance criterion, say using the algorithm to seek minimum distance estimates. The second version of the statistic avoids computation of a generalized inverse. The statistic LM,, bears the same relationship to LM,, that W,, bears to W,,. This flavor of the test statistic is particularly convenient to calculate, as it can be obtained by auxiliary regressions starting from the constrained estimator g.,: Theorem
9.3
LM,, can be calculated by a 2SLS regression: (a) Regress V,d(z,, f?,J’ on g(z,, g,,), and retrieve fitted values VO?(z,, I?,,)‘. (b) Regress 1 on V&z,, r?“), and retrieve fitted values 9,. Then LM,, = C:= 19:. For MLE, g = V&‘, and this procedure reduces to OLS. Proof
Let y be an n-vector of l’s, X an n x k array whose rows are V,&‘, Z an n x m array whose rows are g’. The first regression yields X = Z(Z’Z)-‘Z’X, and the second regression yields 9 = X(X’_%-‘X’y. Then, (l/n)Z’Z = C?,, (l/n)Z’X = r,, (l/n)Z’y =
2225
Ch. 36: Large Sample Estimation and Hypothesis Testing
9’9 =
y’@?X)- ‘2y = y’Z(zq
Note that V,Q,(g,,) = - Gin; LM,,.
‘zx[x’z(zz)-
‘J,(g,,) = - l-:0;
‘Z’X] - ‘x’z(z’z)-
‘g,(t?,,). Substituting
‘Z’y.
terms, y^‘p= Q.E.D.
Another form of the auxiliary regression for computing LM,, arises in the case of nonlinear instrumental variable regression. Consider the model y, = k(x,, 8,) + E* Define with E(E,[ wt) = 0 and E(sf 1w,) = 02, where w, is a vector of instruments. z, = (y,, x,, wt) and g(q, 0) = w,Cy, - k(x,, @I. Then @(a, 0,) = 0 and Eg(z, ~,)g(z, &J’ = 02Ew,w:. The GMM criterion Q,(0) for this model is
the scalar g2 does not affect the optimization of this function. Consider the hypothesis ~(0,) = 0, and let I!?,,be the GMM estimator obtained subject to this hypothesis. One can compute LM,, by the following method: (a) Regress V,k(x,, 8,) on w,, and retrieve the fitted values V&. (b) Regress the residual u, = y, - k(x,, f?,,)on V,k,, and retrieve the fitted values 12,. Then LM,, = nx:, 1tif/C:= 1uf = nR2, with R2 the uncentered multiple correlation coefficient. Note that this is not in general the same as the standard R2 produced by OLS, since the denominator of that definition is the sum of squared deviations of the dependent variable about its mean. When the dependent variable has mean zero (e.g. if the nonlinear regression has an additive intercept term), the centered and uncentered definitions coincide. The approach of the distance metric test is based on the discrepancy between the value of the distance metric, evaluated at the constrained estimate, and the minimum attained by the unconstrained estimate. This estimator is particularly convenient when both the unconstrained and constrained estimators can be computed, and the estimation algorithm returns the goodness-of-fit statistics. In the case of linear or nonlinear least squares, this is the familiar test statistic based on the sum of squared residuals from the constrained and unconstrained regressions. The tests based on GMM estimation with an optimal weight matrix can be extended to any extremum estimator. Consider such an estimator, satisfying eq. (1.1). Also, let e be a restricted estimator, maximizing Q,(0) subject to a(0) = 0. Suppose that the equality H = - Z is satisfied, for the Hessian matrix H and the asymptotic variance Z [of JIV,Q,(~,)] from Theorem 3.1. This property is a generalization of the information matrix equality to any extremum estimator. For GMM estimation with optimal weight matrix, this equality is satisfied if the objective function is normalized by i, i.e. Q,(0) = +9,(8)‘8- ‘J,(0). Let 2 denote an estimator
W.K. Newey and D. McFadden
2226
of ,?Zbased on Band E an estimator
based on t?. Consider
w = ..(@[‘m’2-‘a(B), -LM = nV,&,(O)‘Z-
the following
test statistics:
2 = V@(B),
‘V&@),
DM = 2n[Q,(@ - Q,(e,]. The statistic W is analogous to the first Wald statistic in Table 2 and the statistic LM to the third LM statistic in Table 2. We could also give analogs of the other statistics in Table 2, but for brevity we leave these extensions to the reader. Under the conditions of Theorems 2.1, 3.1, and 4.1, H = - Z and the same conditions on a(0) previously given, these three test statistics will all have an asymptotic chi-squared distribution, with degrees of freedom equal to the number of components of a(8). As we have discussed, optimal GMM estimation provides one example of these statistics. The MLE also provides an example, as does optimal CMD estimation. Nonlinear least squares also fits this framework, if homoskedasticity holds and the objective function is normalized in the right way. Suppose that Var(ylx) = c?, a constant. Consider the objective function Q,(O) = (2b2)- ‘x1= 1 [yi - h(x, f3)12,where d2 is an estimator of rs2. Then it is straightforward to check that, because of the normalization of dividing by 2b2, the condition H = - Z is satisfied. In this example, the DM test statistic will have a familiar squared residual form. There are many examples of estimators where H = - Z is not satisfied. In these cases, the Wald statistic can still be used, but 2-l must be replaced by a consistent estimator of the asymptotic variance of 6. There is another version of the LM statistic that will be asymptotically equivalent to the Wald statistic in this case, but for brevity we do not describe it here. Furthermore, the DM statistic will not have a chi-squared distribution. These results are further discussed for quasi-maximum likelihood estimation by White (1982a), and for the general extremum estimator case by Gourieroux et al. (1983).
9.3.
One-step versions
qf the
trinity
Calculation of Wald or Lagrange multiplier test statistics in finite samples requires estimation of G, R, and/or A. Any convenient consistent estimates of these arrays will do, and will preserve the asymptotic equivalence of the tests under the null and local alternatives. In particular, one can evaluate terms entering the definitions of these arrays at &, t?,,,or any other consistent estimator of 8,. In sample analogs that converge to these arrays by the law of large numbers, one can freely substitute sample and population terms that leave the probability limits unchanged. For example, if z, = (y,, xr) and 8” is any consistent estimator of 8,,, then R can be estimated by (1) an analytic ex_pressio_n for Eg(z, O)g(z, O)‘, evaluated at e”,, (2) a sample average (l/n)C:= 1dz,, &Jd.q, &J, or (3) a sample average of conditional
2221
Ch. 36: Large Sample Estimation and Hypothesis Testing
expectations (lln)C:,
1
~,~,&J, x,, AMY,x,, 8,J’. These first-order
efficiency
equiv-
alences do not hold in finite samples, or even to higher orders of &I. Thus, there may be clear choices between these when higher orders of approximation are taken into account. The next result is an application of the one-step theorem in Section 3.4, and shows how one can start from any initial &-consistent estimator of 8,, and in one iteration obtain versions of the trinity that are asymptotically equivalent to versions obtained when the exact estimators e^, and GE are used. Further, the required iterations can usually be cast as regressions, so their computation is relatively elementary. Consider the GMM criterion Q,(0). Suppose gn is any consistent estimator of B0 such that $(gn -j,,) is stochastically strained maximizer of Q, and 19, be the maximizer a(6) = 0. Suppose The unconstrained 8;
the null hypothesis, one-step estimator
‘d,(e’,), satisfies &($n
tors from the Lagrangian
- e^,)L
or a local alternztive, ~(0,) = S/,/i, is true. from eq. (3.1 l), 0, = e”, - (G:R; ‘G,,) ‘CL x
0. Similarly,
first-order
bounded. Let 6” be the unconof Q subject to the constraint
define one-step
constrained
estima-
conditions:
[$[;]-[A” ‘y[“s!J Note in this definition that y = 0 is a trivial initially consistent estimator of the Lagrangian multipliers under the null or local alternatives, and that the arrays B and A can be estimated at 8,. The one-step theorem again applies, yielding fi(& - 6,) 3 0 and fi(%, - m) 3 0. Then, these one-step equivalents can be substituted in any of the test statistics of the trinity without changing their asymptotic distribution. A regression procedure for calculating the one-step expressions is often useful for computation. The adjustment from & yielding the one-step unconstrained estimator is obtained by a two-stage least squares regression of the constant one on V&z,, I%), with g(zt, 8,) as instruments; i.e. (a) Regress each component of V~d(z~, &), on g(zt, &) in the sample t = 1, . . , n, and retrieve fitted values V&(zt, &). (b) Regress 1 on V~i(z,,&); and adjust &, by the amounts of the fitted coefficients. Step (a) yields V&z,, 8,J’ = g(z,, 8J2;
‘I-,,, and step (b) yields coefficients
W.K. Newey and D. McFadden
2228
This is the adjustment indicated by the one-step theorem. Computation of one-step constrained estimators is conveniently formulae
= & + A - BP’A’(AB-‘A’): 6, = - (.4BP’A’)-
‘a(&
‘[a(&)
E - (AK
done using the
+ AA],
l/I’)- ‘[a(e’,) + AA],
with A and B evaluated at 8,. To derive these formulae from the first-order conditions fqr the Lagrangian problem, replace V,Q,(@ by the expression - (rJ2; ‘I-b) x (8” -i,,) from:the one-step definition of the unconstrained estimator, replace a(g,,) by a(fI,) + A(8, - g,,), and use the formula
9.4.
for a partitioned
inverse.
Special cases
Maximum likelihood. We have noted that maximum likelihood estimation can be treated as GMM estimation with moments equal to the score, g = V,t. The statistics in Table 2 remain the same, with the simplification that I3 = f2( = G = r). The likelihood ratio statistic 2n[L,(8,) - L,(&J], where L,(0) = (l/n) C:= 1Qz,, d), is shown by a Taylor’s expansion about g” to be asymptotically equivalent to the Wald statistic W,,, and hence to all the statistics in Table 2. Suppose one sets up an estimation problem in terms of a maximum likelihood criterion, but that one does not in fact have the true likelihood function. Suppose that in spite of this misspecification, optimization of the selected criterion yields consistent estimates. One place this commonly arises is when panel data observations are serially correlated, but one writes down the marginal likelihoods of the observations ignoring serial correlation. These are sometimes called pseudo-likelihood criteria. The resulting estimators can be interpreted as GMM estimators, so that hypotheses can be tested using the statistics in Table 2. Note however that now G # 0, so that B = G’S2- ‘G must be estimated in full, and one cannot do tests using a likelihood ratio of the pseudo-likelihood function. Least squares. Consider the nonlinear regression model y = h(x, 0) + E, and suppose cr*. Minimizing the least squares criterion E(ylx)=h(x,8)andE[{y-h(x,8)}21x]= Q,(0) = C:= 1 [y, - h(z,, Q]’ is asymptotically equivalent to GMM estimation with g(z, 19)= [y - h(x, B)]V,h(x, 13)and a distance metric R, = (a2/n) C:= 1 [V,h(x, 0,)] x [V,h(x, e,)]‘. For this problem, B = R = G. If h(z,, 0) = z,O is linear, one has g(z,, (I) = u,(@z,, where u,(0) = y, - z,O is the regression residual, and 0, = (o’/n) C:= 1z,z;. Instrumental variables. Consider the regression model y, = h(z,, 0,) + E, where E, may be correlated with V,h(z,,O,). Suppose there are instruments w such that E(e,I w,) = 0. For this problem, one has the moment conditions g(y,, z,, w,, 0) = [y, - h(z,, U)]f(w,) satisfying Eg( yt, z,, w,, 0,) = 0 for any vector of functions f(w) of
Ch. 36: Large Sample Estimation
the instruments,
and Hypothesis
so the GMM
criterion
2229
Testing
becomes
’
{Y, Qn(4= f ; ; 1Y, - Nz,,Q))f(w,) 0, ’ t, *Cl f 1 1 [
- et, e))f(w,)1
with 0, = (cr2/n)C:= rf(w,)f(w,)‘. Suppose that it were feasible to construct the conditional expectation of the gradient of the regression function conditioned on w, qt = E[V&(z,, S,)\ wJ. This is the optimal vector of functions of the instruments, in the sense that the GMM estimator based on f(w) = q will yield estimators with an asymptotic covariance matrix that is smaller in the positive definite sense than any other distinct vector of functions of w. A feasible GMM estimator with good efficiency properties may then be obtained by first obtaining a preliminary,,& consistent estimator 8,, employing a simple practical distance metric, second, regressing V&z,, 8”) on a flexible family of functions of wt, such as low-order polynomials in w, and, third, using fitted values from this regression as the vector of functions f(w,) in a final GMM estimation. Note that only one Newton-Raphson step is needed in the last stage. Simplifications of this problem result when h(z, 0) = z0 is linear in 8; in this case, the feasible procedure above is simply 2SLS, and no iteration is needed. Simple hypotheses. An important practical case of the general nonlinear hypothesis ~(0,) = 0 is that a subset of the parameters are zero. (A hypothesis that parameters equal constants other than zero can be reduced to this case by reparametrization.) and H,: /I = 0. The first-order Assume@=(,xf+fl~r) of this problem Y,=
are 0 = &V,Q,(i?,,),
0 = &VsQ,,(g,,)
conditions
+ &y,,
and 0 = &,, implying
-V~Q.(R.),andA=[~~~_~~~~~ . Let C = B- ’ be the asymptotic
matrix of &(gn
- 0,), and AB- ‘A’ = C,, the submatrix
sions about t?”of the first-order
conditions
imply ,,&(a.
for solution
covariance
of C for j?. Taylor’s expan- CI,)= - B&I,,
fib”
+ or
A and &Y, = C& - ~~a~~l~,,J&~n + op= j?‘C&lfln + op. Then the Wald statistics are
One can check the asymptotic
equivalence
of these statistics
by substituting
the
expression for &(&-c(,). The LM statistic, in any version, becomes LM, = nV,Qn(t)n)‘CssV,Q,(B,). Recall that B, hence C, can be evaluated at any consistent estimator of 8,. In particular, the constrained estimator is consistent under the null
W.K. Newey and D. McFadden
2230
or under local alternatives. The LM testing procedure for this case is then to (a) compute the constrained estimator Cr,,subject to the condition /3 = 0, (b) calculate the gradient and Hessian of Q, with respect to the full parameter vector, evaluated at cl, and /I = 0, and (c) form the quadratic form above for LM, from the /I part of the gradient and the /? submatrix of the inverse of the Hessian. Note that this does not require any iteration of the GMM criterion with respect to the full parameter vector. It is also possible to carry out the calculation of the LM, test statistic using auxiliary regressions. This could be done using the auxiliary regression technique introduced earlier for the calculation of LM,, in the case of any nonlinear hypothesis, but a variant is available for this case that reduces the size of the regressions required. The steps are as follows: (a) Regress VJ(z,,8,) and V,J(z,, I?,,)’ on g(z,, t?,,), and retrieve the fitted values V,?(Z~, g,J and V$~(Z,, I?“). (b) Regress VD?(z,, 0,) on Vol?(z(zr, f?,J, and retrieve the residual u(z,, I$,). (c) Regress the constant 1 on the residual u(z,, g,,), and calculate the sum of squares of the$rted values of 1. This quantity is LM,. To justify this method, start from the gradient of the GMM criterion, 0=
V,Q&,,O)= - %fl;'&(&,O), v,Q,@,>O) = - G,,~;'&@,,O),
where G, is partitioned into its CI and /I submatrices. From partitioned inverses, one has for C = BP1 the expression c,,
= [z-,n-rr;
the formula
- T~~-‘Tb(T,R-‘r~)-‘r,n-‘r~]-‘.
The fitted values from step (a) satisfy 6,) = g(zr, e,)fl,
‘G&,
V&z,, e,)’ = g(z,, e,)a,
‘CL,.
V&z,, and
Then the residuals
from step (b) satisfy
u(z,, e,) = g(zf, e,)fiR 1G& - g(zf, @,)a; ’ G;,(G,$;
r G;,) - ’ G&2;
t G&.
Then f,
zl u(zt, &I’ = VsQ .(s-
f
(9’- V,Q,b%n W(G,,$, l G;J - ’ G,&‘n-1Gbs
for the
Ch. 36: Large Sample Estimation
and Hypothesis
2231
Testing
Then, the step (c) regression yields LM,. In the case of maximum estimation, step (a) is redundant and can be omitted.
9.5.
Tests for overidentifying
likelihood
restrictions
Consider the GMM estimator based on moments g(z,, 0), where g is m x 1,O is k x 1, and m > k, so there are overidentifying moments. Thecriterion
Q,(@ = -
$M9’fJ, ’ &(@,
evaluated at its maximizing argument @,,for any 0,&n, has the property that - 2n& = - 2nQ,(&) Lx:_, under the null hypothesis that Eg(z, 0,) = 0. This statistic then provides a specification test for the overidentifying moment_ in g. It can also be used as an indicator for convergence in numerical search for 0,. To demonstrate
this result, recall from eqs. (9.1) and (9.2) that - a-
02c,% % m N(0, I) and &(f!?” - 0,) = B- ’ G’O- ‘j2%!” + op. Then, pansion yields
&c?,(&)= - fii’2%, + G,,(G$?; 1G,)where R, = I - 0; 112Gn(Gb12;1G,)~ ‘Gkf2; - 2nQ,(6,,) = %:R,%,
lGp;
1’2&&(tI,)
=
a Taylor’s
ex-
1’2@,+ op = - 12;‘2~,q, +o,,
‘I2 is idempotent
of rank m - k. Then
+ op Ax:_,.
Suppose that instead of estimating 6 using the full list of moments, one uses a linear combination Lg(z, f3), where L is r x m with k < r < m. In particular, L may select a subset of the moments. Let t?” denote the GMM estimator obtained from these moment combinations, and assume the identification conditions are satisfied so I?,, is Jn-consistent. Then the statistic S = nd,,(e,)‘l2; ‘I2 R,,R; 1’2d,(8,) -% xi_, under If,, and this statistic
is asymptotically
equivalent
to the statistic
- 2nQ,(&).
This result holds for any &-consistent estimator $” of 8,, not necessarily the optimal GMM estimator for the moments Lg(z, 0), or even an initially consistent estimator based on only these moments. The distance metric in the center of the quadratic form S does not depend on L, so that the formula for the statistic is invariant with respect to the choice of the initially consistent estima’tor. This implies in particular that the test statistics S for overidentifying restrictions, starting from
W.K. Newey and D. McFadden
2232
different subsets of the moment conditions, are all asymptotically equivalent. HOWever, the presence of the idempotent matrix R, in the center of the quadratic form S is critical to its statistical properties. Only the GMM distance metric criterion using all moments, evaluated at &, is asymptotically equivalent to S. Substitution of another &-consistent estimator @,, in equivalent version of S, but - 2nQ,(gJ is not These results are a simple coroLlary of the one-step estimator of 6” is Jn(O, - 8,,) = -
place of 8n yields an asymptotically asymptotically chi-square distributed. one-step theorem. Starting from gn, the (Glfl; ‘G,)) ‘Gifl; ‘d,(g”). Then, one
has a one-step estimator ,/;lS,,($,,) = JnJ,,(gJ + G,,,I@~~ - e”,) = 0; “‘R,f2; ‘I2 x &ti,(e”,). Substituting this expression in the formula for - 2nQ,,(fi,J yields the statistic S. The test for overidentifying restrictions can be recast as an LM test by artificially embedding the original model in a richer model. Partition the moments
dz, 4 =
Y’k4 [ Y2(Z,@) 1’
where g1 is k x 1 with G, = EV,g’(z,8,) of rank G, = EV0g2(z, 0,). Embed this in the model
where II/ is an (m - k) vector of additional parameters. GMM estimation of this expanded model is
k, and
g2 is (m-k)
The first-order
x 1 with
condition
for
The second block of conditions are satisfied by $,, = &((?,,), no matter what I!?,,,so g,, is determined by 0 = G’J2; lgi(e,). This is simply the estimator obtained from the first block of moments, and coincides with the earlier definition of g”. Thus, unconstrained estimation of the expanded model coincides with restricted estimation of the original model. Next consider GMM estimation of the expanded model subject to H,: $ = 0. This constrained estimation obviously coincides with GMM estimation using all moments in the original model, and yields e^,. Thus, constrained estimation of the expanded model coincides with unrestricted estimation of the original model. The distance metric test statistic for the constraint Ic/= 0 in the expanded model is DM, = - 2n[&(&,,, 0) - &(q,,, $J] = - 2nQ,(&), where Q denotes the criterion as a function of the expanded parameter list. One has Q,(6”,0) = Q,,(8”) from the coincidence of the constrained expanded model estimator and the unrestricted
2233
Ch. 36: Large Sample Estimation and Hypothesis Testing
original model estimator, and one has Q,(e,, $,,) = 0 since the number of moments equals the number of parameters. Then, the test statistic - 2nQ,(&) for overidentifying restrictions is identical to a distance metric test in the expanded model, and hence asymptotically equivalent to any of the trinity of tests for H,: II/ = 0 in the expanded model. We give four examples of econometric problems that can be formulated as tests for overidentifying restrictions: Example
9.1
If y = xb + F with E(E(x) = 0, E(E’ Ix) = 02, then the moments
4Y - XP) g’(zJJ)
=
[
(Y_xp)2_,2
1
can be used to estimate p and 02. If E is normal, MLE. Normality can be tested via the additional kurtosis,
s2k B)= Example
(Y - xP)“la” ( y - xj?)“/a” -. 3 [
then these GMM estimators moments that give skewness
are and
1
9.2
In the linear model y = xb + E with E(E[x) = 0 and E(E~B,Ix) = 0 for t # s, but with possible heteroskedasticity of unknown form, one gets the OLS estimates b of /I and V(b) = s2(X’X)-’ under the null hypothesis of homoskedasticity. A test for homoskedasticity can be based on the population moments 0 = E vecu[x’x(e2 - 02)], where “vecu” means the vector formed from the upper triangle of the array. The sample value of this moment vector is
I gl
vecui
.+A (Y, - xtBY- s’> ,
I
1
the difference between the White robust estimator of vecu [X’f2X]. Example
and the standard
OLS estimator
9.3
If P(z, 8) is the log-likelihood of an observation, and H^,is the MLE, then an additional moment condition that should hold if the model is specified correctly is the information matrix equality 0 = EV,,,/‘(z, 4,) + EV,/(z, U,)V,/(z, 8,)‘.
W.K. Nrwey and D. McFudden
2234
The sample analog is White’s information matrix test, which then can be interpreted as a GMM test for overidentifying restrictions. Example
9.4
In the nonlinear model y = h(x, 0) + E with E(E~x) = 0, and e, a GMM estimator based on moments w(x)[y - h(x, fI)], w h ere w(x) is some vector of functions of x, suppose one is interested in testing the stronger assumption that E is independent of x. A necessary and sufficient condition for independence is E[w(x) - Ew(x)] x f[ y - h(x, 19,)] = 0 for every function f and vector of functions w for which the moments exist. A specification test can be based on a selection of such moments. 9.6.
Specijication
tests in linear models5’
GMM tests for overidentifying restrictions have particularly convenient forms in linear models; see Newey and West (1988) and Hansen and Singleton (1982). Three standard specification tests will be shown to have this interpretation. We summarize a few properties of projections that will be used in the following discussion. Let Yp, = X(X/X)-X denote the projection matrix from R” onto the linear subspace X spanned by an n x p array X. (We use a Moore-Penrose generalized inverse in the definition of Yx to handle the possibility that X is less than full rank; see Section 9.8.) Let 2, = I - Yp, denote the projection matrix onto the linear subspace orthogonal to X. Note that Yx and sx are idempotent. If X is a subspace generated by an array X and w is a subspace generated by an array W = [X Z] that contains X, then Y’xY’w = ??,&Fx = gx; i.e. a projection onto a subspace is left invariant by a further projection onto a larger subspace, and a two-stage projection onto a large subspace followed by a projection onto a smaller one is the same as projecting directly onto the smaller one. The subspace of VVthat is orthogonal to X is generated by 2x W; i.e., it is the set of linear combinations of the residuals, orthogonal to X, obtained by regressing Won X. Any y in R” has a unique decomposition y = Yxy + sxYwy + ZIwy into the sum of projections onto X, the subspace of W orthogonal to X, and the subspace orthogonal to W. The projection Z?x.Yw can be rewritten %xYw = g’w - Yx = Y’w9, = &Y&, or since $x W = Sx[X Z] = [0 2xZ], z?xP’w = 9 Ll,yw-- Pdxz = 2,Z(Z’9,Z)mZ’LL?x. This implies that 2xYw is idempotent since (Z!,.Y\,)($,Y’,) = Z!.x(YwZ?x)Y’w = 2,(Z?!,Y’,)g, = sxYw. Omitted variables test: Consider the regression model y = X/3 + E,where y is n x 1, X is n x k, E(EIX) = 0, and E(EE’IX) = a2Z. Suppose one has the hypothesis H,: B1 = 0, where /I1 is a p x 1 subvector of /I. Define u = y - Xb to be the residual associated with an estimator b of /I. The GMM criterion is then 2nQ = u’X(X’X)-‘. X’u/a2. The projection matrix Px = X(X/X)-‘X’ that appears in the center of this criterion can obviously be decomposed as Yx = g)x2 + (9X - Yx,). Under H,, “Paul
Ruud contributed
substantially
to this section.
2235
Ch. 36: Large Sample Estimation and Hypothesis Testing
u = y - X,b, and X’u can be interpreted as k = p + q overidentifying moments for the q parameters p2. Then, the GMM test statistic for overidentifying restrictions is the minimum value - 2n& in b, of u’PP,u/a2. But Pp,u = I?pxZu + (Yx - Ypx,)y and minb2u’Yx2u = 0 (at the OLS estimator under H, that makes u orthogonal to X2). Then - 2~0, = ~‘(9’~ - PPx,)y/a 2. The unknown variance c? in this formula can be replaced by any consistent estimators 2, in particular the estimated variance of the disturbance from either the restricted or the unrestricted regression, without altering the asymptotic distribution, which is xf under the null hypothesis. The statistic - 2n& has three alternative interpretations. First, - 2n& = y’P,ylcT2 - Y’pxz Y/u2 =
SSR,,
- SSR, g2 ’
which is the difference of the sum of squared residuals from the restricted regression under If, and the sum of squared residuals from the unrestricted regression, normalized by a2. This is a large sample version of the usual finite sample F-test for H,. Second, note that the fitted value of the dependent variable from the restricted regression is Jo = Px2y, and from the unrestricted regression is 9, = .Ppxy, so that - 24,
= (9b90 - p:9,)/a’
= (90 - 9”)‘(90 - 9”)/@’ = IIPO- 9” II”/~‘.
Then, the statistic is calculated from the distance between the fitted values of the dependent variable with and without H, imposed. Note that this computation requires no covariance matrix calculations. Third, let b, denote the GMM estimator restricted by H, and b, denote the unrestricted GMM estimator. Then, b, consists of the OLS estimator for p2 and the hypothesized value 0 for /II, while b, is the OLS estimator for the full parameter vector. Note that j0 = Xb, and 9, = Xb,, so that j0 - 9, = X(b, - b,). Then - 24,
= (b, - b,)‘(X’X/a2)(b,
- b,) = (b, - b,)‘V(b,)-
‘(6, - b,).
This is the Wald statistic W,,. From the equivalent form W,, of the Wald statistic, this can also be written as a quadratic form - 2~10, = b;,,V(b,,,)-lb,,,, where b,,, is the subvector of unrestricted estimates for the parameters that are zero under the null hypothesis. The Hausman exogeneity test: Consider the regression y = X,/I, + X2B2 + X3/jj + E, and the null hypothesis that X, is exogenous, where X2 is known to be exogenous, and X, is known to be endogenous. Suppose N is an array of instruments, including X2, that are sufficient to identify the coefficients when the hypothesis is false. Let W = [N X,] be the full set of instruments available when the null hypothesis is tfue. Then the best instruments under the null hypothesis are XAO= 9,X = [Xl, X,X,], and the best instruments under the alternative are ,XU =Yp,X E [X, X2 X,]. The test statistic for overidentifying restrictions is - 2nQ, = y’(Ppx, .Yg,)y/a’, as in the previous case. This can be written - 2nQ, = (SSRi” - SSRi0)/a2,
WK. Newey and D. McFadden
2236
with the numerator the difference in sum of squared residuals from an OLS regression of y on 2, and an OLS regression of y on r?,. Also, - 2nQ^, = 11 jf, - jiu II’/o’, the difference between on 2,. Finally,
the fitted values of y from a regression
- 2nQ^, = (b,s,s, - ~2s&‘CVb2s&
-
on 2, and a regression
W2~d-(b2~~~o - b2dT
an extension of the Hausman-Taylor exogeneity test to the problem where some variables are suspect and others are known to be exogenous. Newey and West (1988) show that the matrix in the center of this quadratic form has rank equal to the rank of X,, and that the test statistic can be written equivalently as a quadratic form in the subvector of differences of the 2SLS estimates for the X, coefficients, with the ordinary inverse of the corresponding submatrix of differences of variances in the center of the quadratic form. Testingfor overidentifying restrictions in a structural system: Consider an equation y = X/I + E from a system of simultaneous equations, and let W denote the array of instruments (exogenous and predetermined variables) in the system. Let _? = 9,X denote the fitted values of X obtained from OLS estimation of the reduced form. The equation is oueridentijied if the number of instruments W exceeds the number of right-hand-side variables X. The GMM test statistic for overidentification is the minimum in fi of - 2nQ,(/?) = u’P,u/a2
= u’Piu/a2
+ ~‘(9~
- Pg)u/o’,
where u = y - Xg. As before, - 2n& = y’(P’, - Pi)y/a’. Under H,, this statistic is asymptotically chi-squared distributed with degrees of freedom equal to the difference in ranks of Wand 2. This statistic can be interpreted as the difference in the sum of squared residuals from the 2SLS regression of y on X and the sum of squared residuals from the reduced form regression of y on W, normalized by CJ~. A computationally convenient equivalent form is - 2n& = II J&, - $2 (I2/a2, the sum of squares of the difference between the reduced form fitted values and the 2SLS fitted values of y, normalized by c2. Finally, - 2n& = y’sgP,,,gky/02 = nR2/a2, where R2 is the multiple correlation coefficient from regressing the 2SLS residuals on all the instruments; this result follows from the equivalent formulae for the projection onto the subspace of VWorthogonal to the subspace spanned by 2;;. This test statistic does not have a version that can be written as a quadratic form with the wings containing a difference of coefficient estimates from the 2SLS and reduced form regressions.
9.7.
Specification
testing
in multinomial
models
As applications of GMM testing, we consider hypotheses arising in the context of analysis of discrete response data. The first example is a test for omitted variables
Ck. 36: Larye Sample Estimation and Hypothesis
Testing
2237
in multinomial data, which extends to various tests of functional specification by introduction of appropriate omitted variables. The second example tests for the presence of random effects in discrete panel data. Example
9.5
Suppose J multinomial outcomes are indexed C = { 1,. . , J}. Define z = (d,, . , d,, x), where d, is one if outcome j is observed, and zero otherwise. The x are exogenous variables. The log-likelihood of an observation is
e) = C di log P,(i, X, e),
e(Z,
SC
where P&i,x, 19) is the probability that i is observed from C, given x. Suppose 0 = (CI,/?), and the null hypothesis He: fi = 0. We derive an LM test starting from the maximum likelihood estimates of a under the constraint fi = 0. Define ui = [di - P,(i, X, gn)]Pc(i, x, c??,,-‘I’,
qi = P&,x,
~J1’*VO logPc(i,
x, GJ.
Then, in a sample t = 1,. . . , n, one has (l/n) C:= 1 V&(zt, g,,) E (l/n) C:= 1CiEc qituit. Also, (l/n) C:= i Cisc qiqj 3 R since fl=
-EVged~
= E C
-EVsC
iEC
[di - Pc(i,x,8o)]VslogPc(i,x,Bo)
Pc(i,x,0e)[VtilogP(i,x,e)][V,logP(i,x,~)]’.
icC
Then,
This statistic can be computed from the sum of squares of the fitted values of uit from an auxiliary regression over i and t of uit on qil. If R2 is the multiple correlation coefficient from this regression, and U is the sample mean of the uir, then LM,, = n( J - 1)R2 + (1 - R2)d2. McFadden (1987) shows for the multinomial logit model that the Hausman and McFadden (1984) test for the independence from irrelevant alternatives property of this model can be calculated as an omitted variable test of the form above, where the omitted variables are interactions of the original variables and dummy variables for subsets of C where nonindependence is suspected. Similarly, Lagrange multiplier tests of the logit model against nested logit alternatives can be cast as omitted
W.K. Newey and D. McFadden
2238
variable tests where the omitted variables are interactions of dummy variables for suspect subsets A of C and variables of the form log[P,(1’, x, I!?,,)/C,,, P,(i, x, e,)]. Example 9.6 We develop a Lagrange multiplier test for unobserved heterogeneity in discrete panel data. A case is observed to be either in state d, = + 1 or d, = - 1 in periods t = 1,. . . , T. A probability model for these observations that allows unobserved heterogeneity is
. . , xT are exogenous, PI,. . , /jT and 6 are parameters, F is a cumulative function for a density that is symmetric about zero, and v is an “case effect” heterogeneity. The density h(v) is normalized so that Ev = 0 1. = 0, this model reduces to a series of independent Bernoulli trials,
where x r,. distribution unobserved and Ev2= When 6
P(d,,. ..,d,lx,,.
.,x,,Bl>.
.,/LO)
=
fi
f’k&xtBtL
1=1
and is easily estimated. For example, F normal yields binary probits, and F logistic yields binary logits. A Lagrange multiplier test for 6 = 0 will detect the presence of unobserved heterogeneity across cases. Assume a sample of n cases, drawn randomly from the population. The LM test statistic is
[
LM=
n C (Vd2/n-
i
(VAVJYln
Cw
I[
12
1 (VpW&Yln 1
where e is the log-likelihood of the case, V,/ = (VP,/, , V,,Z?), and all the derivatives are evaluated at 6 = 0 and the Bernoulli model estimates of /I. The j? derivatives are straightforward, e,t = d,x,f(d,x,P,)/F(d,x,B,), where f is the density I’Hopital’s rule:
la=;
~-
of F. The 6 derivative
is more delicate,
f(4xtBJ’ + i 4f(4x,B,) 2 t=I F(d,x,&) II ’ W,x,BJ2 I[
requiring
use of
Ch. 36: Large Sample Estimation
and Hypothesis
2239
Testing
The reason for introducing 6 in the form above, so J&J appeared in the probability, was to get a statistic where C V& was not identically zero. The alternative would have been to develop the test statistic in terms of the first non-identically zero higher derivative; see Lee and Chesher (1986). The LM statistic can be calculated by regressing the constant 1 on V& and V,,P, . . . ) V,“e, where all these derivatives are evaluated at 6 = 0 and the Bernoulli model estimates, and then forming the sum of squares of the fitted values. Note that the LM statistic is independent of the shape of the heterogeneity distribution h(v), and is thus a “robust” test against heterogeneity of any form.
9.8.
Technicalities
Some test statistics are conveniently defined using generalized inverses. This section gives a constructive definition of a generalized inverse, and lists some of its properties. A matrix ,& is a Moore-Penrose generalized inverse of a matrix ,A, k if it has three properties: (i) AA-A = A, (ii) A-AA= A-, (iii) AA _ and A -A are symmetric. There are other generalized inverse definitions that have some, but not all, of these properties; in particular A + will denote any matrix that satisfies (i). First, a method for constructing a generalized inverse is described, and then some of the implications of the definition are developed. The construction is called the singular value decomposition (SVD) of a matrix, and is of independent interest as a tool for finding the eigenvalues and eigenvectors of a symmetric matrix, and for calculation of inverses of moment matrices of data with high multicollinearity; see Press et al. (1986) for computational algorithms and programs. Lemma 9.4
Every real m x k matrix A of rank r can be decomposed
into a product
A=UDV mxk
mxrrxrrxk’
where D is a diagonal matrix with positive diagonal, and U and V are column-orthonormal;
nonincreasing elements i.e. U’U = I, = V’V.
down
the
Proof
The m x m matrix AA’ is symmetric and positive semi-definite. Then, there exists an m x m orthonormal matrix W, partitioned W = [W, W,] with WI of dimension m x r, such that w;(AA’)W, = G is diagonal with positive, nonincreasing diagonal
WK. Newey and D. McFadden
2240
elements, and W;(AA’)W, = 0, implying A’W, = 0. Define D from G by replacing the diagonal elements of G by their positive square roots. Note that W' W = I = W W’ = W, W; + W, W;. Define U = W, and V’ = D-l U’A. Then, U’U = I, and V’V = D~‘U’AUD~‘=D-‘GD-‘=I,.Further,A=(Z,-W,W;)A=UU’A=UD1/‘.This Q.E.D. establishes the decomposition. Note that if A is symmetric, then U is the array of eigenvectors of A corresponding to the nonzero roots, so that A’U = UD,, with D, the r x r diagonal matrix with the nonzero eigenvalues in descending magnitude down the diagonal. In this case, V = A’UD-’ = UD,D-‘. Since the elements of D, and D are identical except possibly for sign, the columns of U and V are either equal (for positive roots) or reversed in sign (for negative roots). Lemma 9.5 The Moore-Penrose generalized inverse of an m x k matrix A is the matrix A- = V D-l U’ Let A’ denote any matrix, including A-, that satisfies AA+A = A. kxr
rxr
rxm
These matrices satisfy: (1) A+ = A- ’ if A is square and nonsingular. (2) The system of equations Ax = y has a solution if and only if y = AA+y, and the linear subspace of all solutions is the set ofvectors x = A+y + [Z - A+A]z for all ZERk.
(3) AA+ and A+A are idempotent. (4) If A is idempotent, then A = A-. (5) If A = BCD with B and D nonsingular, A+ = D-‘C+B-’ satisfies AA+A = A.
then A- = D-‘C-B-‘,
and any matrix
Proof Elementary;
see Pringle
and Rayner
(1971).
Lemma 9.6 If A is square, symmetric, and positive semi-definite of rank r, then (1) There exist Q positive definite and R idempotent of rank r such that A = QRQ and A- = Q-‘RQ-‘. (2) There exists kt, column-orthonormal such that U’AU = D is nonsingular diagonal and A- = U(U’AU)- ’ U’. (3) A has a symmetric square root B = A”‘, and A- = B-B-. Proof Let W = [U W,] diagonal W’,R=
be an orthogonal
matrix
matrix of positive eigenvalues, 1, 0 W [ 00
1
w’ and B = UD’i2U’. ’
diagonalizing
A. Then,
U’AU = D, a
ID:1:_.
and A W, = 0. Define Q = W
Q.E.D.
Ch. 36: Large Sample Estimation and
2241
Hypothesis Testing
Lemma 9.7 If y - N(A,I, A), with A of rank I, and A+ is any symmetric matrix satisfying AA+A = A, then y’A+y is noncentral chi-square distributed with I degrees of freedom and noncentrality parameter ,l’Al.
Proof Let W = [U W,] be an orthonormal matrix that diagonalizes A, as in the proof of Lemma 9.6, with U’AU = D, a positive diagonal r x r matrix, and W’AW, = 0, implying
A W, = 0. Then, the nonsingular
mean [ Dm1’F’A’]
and covariance
transformation
z=
matrix
buted N(D- “2U’A2,Z,), z2 = W,y = 0, implying w’y = [Dli2z, 01. It is standard that z’z has a noncentral chi-square distribution with r degrees of freedom and noncentrality parameter A’AUD-‘U’AA = 2’A;1. The condition A = AA+A implies U’AU = U’AWW’A+ WW’AU, or D = [DO]W’A+
W[DO]‘=
Hence, U’A+U = D-l. y’A+y = y’WW’A+
D(U’A+U)D.
Then WW’y = [z;D”~O](W’A+
= z;D”~(U’A+U)D~‘~Z~
W)[D”2z;
01’
= z;zl. Q.E.D.
References Ait-Sahalia, Y. (1993) “Asymptotic Theory for Functionals of Kernel Estimators”, MIT Ph.D. thesis. Amemiya, T. (1973) “Regression Analysis When the Dependent Variable is Truncated Normal”. Econometrica, 41, 997-1016. Amemiya, T. (1974) “The Nonlinear Two-Stage Least-Squares Estimator”, Journal of Econometrics, 2, 105-l 10. Amemiya, T. (1985) Advanced Econometrics, Cambridge, MA: Harvard University Press. Andersen, P.K. and R.D. Gill (1982) “Cox’s Regression Model for Counting Processes: A Large Sample Study”, The Annals of Statistics, 19, 1100-1120. Andrews, D.W.K. (1990) “Asymptotics for Semiparametric Econometric Models: I. Estimation and Testing”, Cowles Foundation Discussion Paper No. 908R. Andrews, D.W.K. (1992) “Generic Uniform Convergence”, Econometric Theory, 8,241-257. Andrews, D.W.K. (1994) “Empirical Process Methods in Econometrics”, in: R. Engle and D. McFadden, eds., Handbook ofEconometrics, Vol. 4, Amsterdam: North-Holland. Barro, R.J. (1977) “Unanticipated Money Growth and Unemployment in the United States”, American Economic Reoiew, 67, 101-115.
2242
W.K. Newey and D. McFadden
Bartle, R.G. (1966) The Elements oflntegration, New York: John Wiley and Sons. Bates, C.E. and H. White (1992) “Determination of Estimators with Minimum Asymptotic Covariance Matrices”, preprint, University of California, San Diego. Berndt, E.R., B.H. Hall, R.E. Hall and J.A. Hausman (1974) “Estimation and Inference in Nonlinear Structural Models”, Annals of Economic and Social Measurement, 3,653-666. Bickel, P. (1982) “On Adaptive Estimation,” Annals of Statistics, 10, 6477671. Bickel, P., C.A.J. Klaassen, Y. Ritov and J.A. Wellner (1992) “Efficient and Adaptive Inference in Semiparametric Models” Forthcoming monograph, Baltimore, MD: Johns Hopkins University Press. Billingsley, P. (1968) Convergence ofProbability Measures, New York: Wiley. Bloomfeld, P. and W.L. Steiger (1983) Least Absolute Deviations: Theory, Applications, and Algorithms, Boston: Birkhauser. Brown, B.W. (1983) “The Identification Problem in Systems Nonlinear in the Variables”, Econometrica, 51, 175-196. Burguete, J., A.R. Gallant and G. Souza (1982) “On the Unification of the Asymptotic Theory of Nonlinear Econometric Models”, Econometric Reviews, 1, 151-190. Carroll, R.J. (1982) “Adapting for Heteroskedasticity in Linear Models”, Annals of Statistics, 10,1224&1233. Chamberlain, G. (1982) “Multivariate Regression Models for Panel Data”, Journal of Econometrics, 18, 5-46. Chamberlain, G. (1987) “Asymptotic Efficiency in Estimation with Conditional Moment Restrictions”, Journal of Econometrics, 34, 305-334. Chesher, A. (1984) “Testing for Neglected Heterogeneity”, Econometrica, 52, 865-872. Chiang, C.L. (1956) “On Regular Best Asymptotically Normal Estimates”, Annals of Mathematical Statistics, 27, 336-351. Daniels, H.E. (1961) “The Asymptotic Efficiency of a Maximum Likelihood Estimator”, in: Fourth Berkeley Symposium on Mathematical Statistics and Probability, pp. 151-163, Berkeley: University of California Press. Davidson, R. and J. MacKinnon (1984) “Convenient Tests for Probit and Logit Models”, Journal of Econometrics, 25, 241-262. Eichenbaum, M.S., L.P. Hansen and K.J. Singleton (1988) “A Time Series Analysis of Representative Agent Models of Consumption and Leisure Choice Under Uncertainty”, Quarterly Journal of Economics, 103, 5 l-78. Eicker, F. (1967) “Limit Theorems for Regressions with Unequal and Dependent Errors”, in: L.M. LeCam and J. Neyman, eds., Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley: University of California Press. Fair, R.C. and D.M. Jaffee (1972) “Methods of Estimation for Markets in Disequilibrium”, Econometrica, 40,497-514. Ferguson, T.S. (1958) “A Method of Generating Best Asymptotically Normal Estimates with Application to the Estimation of Bacterial Densities”, Annals of Mathematical Statistics, 29, 1046-1062. Fisher, F.M. (1976) The Identification Problem in Econometrics, New York: Krieger. Fisher, R.A. (1921) “On the Mathematical Foundations of Theoretical Statistics”, Philosophical Transactions, A, 222, 309-368. Fisher, R.A. (1925) “Theory of Statistical Estimation”, Proceedings of the Cambridge Philosophical Society, 22, 700-725. Gourieroux, C., A. Monfort and A. Trognon (1983) “Testing Nested or Nonnested Hypotheses”, Journal of Econometrics, 21, 83-l 15. Gourieroux, C., A. Monfort and A. Trognon (1984) “Psuedo Maximum Likelihood Methods: Theory”, Econometrica, 52, 68 l-700. Hajek, J. (1970) “A Characterization of Limiting Distributions of Regular Estimates”, Z. Wahrscheinlichkeitstheorie uerw. Geb., 14, 323-330. Hansen, L.P. (1982) “Large Sample Properties of Generalized Method of Moments Estimators”, Econometrica, 50, 1029-1054.
Ch. 36: Large Sample Estimation
and Hypothesis
Testing
2243
Hansen, L.P. (1985a) “A Method for Calculating Bounds on the Asymptotic Covariance Matrices of Generalized Method of Moments Estimators”, Journal ofEconometrics, 30, 203-238. Discussion, December meetings of the Hansen, L.P. (1985b) “Notes on Two Step GMM Estimators”, Econometric Society. Hansen, L.P. and K.J. Singleton (1982) “Generalized Instrumental Variable Estimation of Nonlinear Rational Expectations Models”, Econometrica, 50, 1269-1286. Hansen, L.P., J. Heaton and R. Jagannathan (I 992) “Econometric Evaluation of Intertemporal Asset Pricing Models Using Volatility Bounds”, mimeo, University of Chicago. Hardle, W. (1990) Applied Nonparametric Regression, Cambridge: Cambridge University Press. Hiirdle, W. and 0. Linton (1994) “Nonparametric Regression”, in: R. Engle and D. McFadden, eds., Handbook of Econometrics, Vol. 4, Amsterdam: North-Holland. Hausman, J.A. (1978) “Specification Tests in Econometrics”, Econometrica, 46, 1251-1271. Hausman, J.A. and D. McFadden (1984) “Specification Tests for the Multinomial Logit Model”, Econometrica, 52, I2 19-l 240. Heckman, J.J. (1976) “The Common Structure of Statistical Models of Truncation, Sample Selection, and Limited Dependent Variables and a Simple Estimator for Such Models”, Annals ofEconomic and Social Measurement, 5,475-492. Honor&, B.E. (1992) “Timmed LAD and Least Squares Estimation of Truncated and Censored Models with Fixed Effects”, Econometrica, 60, 533-565. Honor& B.E. and J.L. Powell (1992) “Pairwise Difference Estimators of Linear, Censored, and Truncated Regression Models”, mimeo, Northwestern University. Huber, P.J. (1964) “Robust Estimation of a Location Parameter”, Annals ofMathematical Statistics, 35, 73-101. Huber, P. (1967) “The Behavior of Maximum Likelihood Estimates Under Nonstandard Conditions”, in: L.M. LeCam and J. Neyman, eds., Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley: University of California Press. Huber, P. (1981) Robust Statistics, New York: Wiley. Ibragimov, LA. and R.Z. Has’minskii (1981) Statistical Estimation: Asymptotic Theory, New York: Springer-Verlag. Jennrich (1969), “Asymptotic Properties of Nonlinear Least Squares Estimators”, Annals of Mathematical Statistics, 20, 633-643. Koenker, R. and G. Bassett (1978) “Regression Quantiles”, Econometrica, 46, 33-50. LeCam, L. (1956) “On the Asymptotic Theory of Estimation and Testing Hypotheses”, in: L.M. LeCam and J. Neyman, eds., Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 129-156, Berkeley: University of California Press. Lee, L. F. and A. Chesher (1986) “Specification Testing when the Score Statistics are Identically Zero”, Journal ofEconometrics, 31, 121-149. Maasoumi, E. and P.C.B. Phillips (1982) “On the Behavior of Inconsistent Instrumental Variables Estimators”, Journal ofEconometrics, 19, 183-201. Malinvaud, E. (1970) “The Consistency of Nonlinear Regressions”, Annals of Mathematical Statistics, 41,956-969. Manski, C. (1975) “Maximum Score Estimation of the Stochastic Utility Model of Choice”, Journal of Econometrics, 3, 205-228. McDonald, J.B. and W.K. Newey (1988) “Partially Adaptive Estimation of Regression Models Via the Generalized T Distribution”, Econometric Theory, 4, 428-457. McFadden, D. (1987) “Regression-Based Specification Tests for the Multinomial Logit Model”, Journal of Econometrics, 34, 63-82. McFadden, D. (1989) “A Method of Simulated Moments for Estimation of Multinomial Discrete Response Models Without Numerical Integration”, Econometricu, 57, 995-1026. McFadden, D. (1990) “An Introduction to Asymptotic Theory: Lecture Notes for 14.381”, mimeo, MIT.
W.K. Newey and D. McFadden
2244
Newey, W.K. (1984) “A Method of Moments Interpretation of Sequential Estimators”, Economics Letters, 14, 201-206. Newey, W.K. (1985) “Generalized Method of Moments Specification Testing”, Journal ofEconometrics, 29,229-256. Newey, W.K. (1987) “Asymptotic Properties of a One-Step Estimator Obtained from an Optimal Step Size”, Econometric Theory, 3, 305. Newey, W.K. (1988) “Interval Moment Estimation of the Truncated Regression Model”, mimeo, Department of Economics, MIT. Newey, W.K. (1989) “Locally Efficient, Residual-Based Estimation of Nonlinear Simultaneous Equations Models”, mimeo, Department of Economics, Princeton University. Newey, W.K. (1990) “Semiparametric Efficiency Bounds”, Journal of Applied Econometrics, 5,99-l 35. Newey, W.K. (1991a) “Uniform Convergence in Probability and Stochastic Equicontinuity”, Econometrica, 59, 1161-l 167. Newey, W.K. (1991b) “Efficient Estimation of Tobit Models Under Conditional Symmetry”, in: W. Barnett, J. Powell and G. Tauchen, eds., Semiparametric and Nonparametric Methods in Statistics and Econometrics, Cambridge: Cambridge University Press. Newey, W.K. (1992a) “The Asymptotic Variance of Semiparametric Estimators”, MIT Working Paper. Newey, W.K. (1992b) “Partial Means, Kernel Estimation, and a General Asymptotic Variance Estimator”, mimeo, MIT. Newey, W.K. (1993) “Efficient Two-Step Instrumental Variables Estimation”, mimeo, MIT. Newey, W.K. and J.L. Powell (1987) “Asymmetric Least Squares Estimation and Testing”, Econometrica, 55,819-847. Newey, W.K. and K. West (1988) “Hypothesis Testing with Efficient Method of Moments Estimation”, International Economic Review, 28, 777-787. Newey, W.K., F. Hsieh and J. Robins (1992) “Bias Corrected Semiparametric Estimation”, mimeo, MIT. Olsen, R.J. (1978) “Note on the Uniqueness Econometrica, 46, 1211~1216.
of the Maximum
Likelihood
Estimator
for the Tobit Model”,
Pagan, A.R. (1984) “Econometric Issues in the Analysis of Regressions with Generated Regressors”, International Economic Review, 25,221-247. Pagan, A.R. (1986) “Two Stage and Related Estimators and Their Applications”, Reuiew of Economic Studies, 53, 517-538. Pakes, A. (1986) “Patents as Options: Some Estimates of the Value of Holding European Patent Stocks”, Econometrica, 54, 755-785. Pakes, A. and D. Pollard (1989) “Simulation metrica, 57, 1027-1057.
and the Asymptotics
of Optimization
Estimators”,
Econo-
Pierce, D.A. (1982) “The Asymptotic Effect of Substituting Estimators for Parameters in Certain Types of Statistics”, Annals ofStatistics, IO, 475-478. Pollard, D. (1985) “New Ways to Prove Central Limit Theorems”, Econometric Theory, 1, 295-314. Pollard, D. (1989) Empirical Processes: Theory and Applications, CBMS/NSF Regional Conference Series Lecture Notes. Powell, J.L. (1984) “Least Absolute ofEconometrics, 25, 303-325. Powell, J.L. (1986) “Symmetrically 54.1435-1460.
Deviations Trimmed
Powell, J.L., J.H. Stock and T.M. Stoker Econometrica, 57, 1403-1430.
Pratt,J.W. (1981) “Concavity
Estimation
for the Censored
Least Squares Estimation (1989) “Semiparametric
of the Log Likelihood”,
Regression
Model”, Journal
for Tobit Models”, Econometrica, Estimation
of Index Coefficients”,
Journal ofthe American Statistical Association, 76,
103%106. Press, W.H., B.P. Flannery, University Press.
S.A. Tenkolsky
and W.T. Vetterling
(1986) Numerical Recipes, Cambridge
Ch. 36: Large Sample Estimation and Hypothesis
Testing
2245
Pringle, R. and A. Rayner (1971) Generalized Inverse Matrices, London: Griffin. Robins, J. (1991) “Estimation with Missing Data”, preprint, Epidemiology Department, Harvard School of Public Health. Robinson, P.M. (1988a) “The Stochastic Difference Between Econometric Statistics”, Econometrica, 56, 531-548. Robinson, P. (1988b) “Root-N-Consistent Semiparametric Regression”, Econometrica, 56, 931-954. Rockafellar, T. (1970) Convex Analysis, Princeton: Princeton University Press. Roehrig, C.S. (1989) “Conditions for Identification in Nonparametric and Parametric Models”, Econometrica, 56, 433-447. Rothenberg, T.J. (1971) “Identification in Parametric Models”, Econometrica, 39, 577-592. Rothenberg, T. J. (1973) Eficient Estimation with a priori Ir$ormation, Cowles Foundation Monograph 23, New Haven: Yale University Press. Rothenberg, T.J. (1984) “Approximating the Distributions of Econometric Estimators and Test Statistics”, Ch. 15 in: Z. Griliches and M.D. Intriligator, eds., Handbook of Econometrics, Vol 2, Amsterdam, North-Holland. Rudin, W. (1976) Principles ofMathematical Analysis, New York: McGraw-Hill. Sargan, J.D. (1959) “The Estimation of Relationships with Autocorrelated Residuals by the Use of Instrumental Variables”, Journal of the Royal Statistical Society Series B, 21, 91-105. Serfling, R.J. (1980) Approximation Theorems of MathematicalStatistics, New York: Wiley. Stoker, T. (1991) “Smoothing Bias in the Measurement of Marginal Effects”, MIT Sloan School Working Paper, WP3377-91-ESA. Stone, C. (1975) “Adaptive Maximum Likelihood Estimators of a Location Parameter”, Annals of Statistics, 3, 267-284. Tauchen, G.E. (1985) “Diagnostic Testing and Evaluation of Maximum Likelihood Models”, Journal of Econometrics, 30, 4 155443. Van der Vaart, A. (1991) “On Differentiable Functionals”, Annals ofStatistics, 19, 178204. Wald (1949) “Note on the Consistency of the Maximum Likelihood Estimate”, Annals ofMathematical Statistcs, 20, 595-601. White, H. (1980) “A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity”, Econometrica, 48, 8177838. White, H. (1982a)“Maximum Likelihood Estimation ofMisspecified Models”, Econometrica, 50, l-25. White, H. (1982b) “Consequences and Detection of Misspecified Linear Regression Models”, Journal of the American Statistical Association, 76, 419-433.
Chapter 37
EMPIRICAL PROCESS IN ECONOMETRICS DONALD Co&s
METHODS
W.K. ANDREWS’
Foundation Yale University
Contents Abstract 1. Introduction 2. Weak convergence 3. Applications
4.
and stochastic
3.1.
Review of applications
3.2.
Parametric
Tests when a nuisance
3.4.
Semiparametric
Stochastic Primitive
4.2.
Examples
5. Stochastic 6. Conclusion Appendix References
based on non-differentiable parameter
is present
criterion
functions
only under the alternative
conditions
2267
via symmetrization
for stochastic
equicontinuity
2255 2259 2263
estimation
equicontinuity
4.1.
equicontinuity
2253
M-estimators
3.3.
2248 2248 2249 2253
2267
equicontinuity
2273
2276 2283 2284 2292
via bracketing
‘This paper is a substantial revision of the first part of the paper Andrews (1989). I thank D. McFadden for comments and suggestions concerning this revision. I gratefully acknowledge research support from the Alfred P. Sloan Foundation and the National Science Foundation through a Research Fellowship and grant nos. SES-8618617, SES-8821021, and SES-9121914 respectively.
Handbook of Econometrics, Volume IV, Edited by R.F. En& 0 1994 Elsevier Science B. V. All rights reserved
and D.L. McFadden
D. W.K. Andrew
2248
Abstract
This paper provides an introduction to the use of empirical process methods in econometrics. These methods can be used to establish the large sample properties of econometric estimators and test statistics. In the first part of the paper, key terminology and results are introduced and discussed heuristically. Applications in the econometrics literature are briefly reviewed. A select set of three classes of applications is discussed in more detail. The second part of the paper shows how one can verify a key property called stochastic equicontinuity. The paper takes several stochastic equicontinuity results from the probability literature, which rely on entropy conditions of one sort or another, and provides primitive sufficient conditions under which the entropy conditions hold. This yields stochastic equicontinuity results that are readily applicable in a variety of contexts. Examples are provided.
1.
Introduction
This paper discusses the use of empirical process methods in econometrics. It begins by defining, and discussing heuristically, empirical processes, weak convergence, and stochastic equicontinuity. The paper then provides a brief review of the use of empirical process methods in the econometrics literature. Their use is primarily in the establishment of the asymptotic distributions of various estimators and test statistics. Next, the paper discusses three classes of applications of empirical process methods in more detail. The first is the establishment of asymptotic normality of parametric M-estimators that are based on non-differentiable criterion functions. This includes least absolute deviations and method of simulated moments estimators, among others. The second is the establishment of asymptotic normality of semiparametric estimators that depend on preliminary nonparametric estimators. This includes weighted least squares estimators of partially linear regression models and semiparametric generalized method of moments estimators of parameters defined by conditional moment restrictions, among others. The third is the establishment of the asymptotic null distributions of several test statistics that apply in the nonstandard testing scenario in which a nuisance parameter appears under the alternative hypothesis, but not under the null. Examples of such testing problems include tests of variable relevance in certain nonlinear models, such as models with BoxCox transformed variables, and tests of cross-sectional constancy in regression models. As shown in the first part of the paper, the verification of stochastic equicontinuity in a given application is the key step in utilizing empirical process results. The
Ch. 37: Empirical Process
Methods
2249
in Econometrics
second part of the paper provides methods for verifying stochastic equicontinuity. Numerous results are available in the probability literature concerning sufficient conditions for stochastic equicontinuity (references are given below). Most of these results rely on some sort of entropy condition. For application to specific estimation and testing problems, such entropy conditions are not sufficiently primitive. The second part of the paper provides an array of primitive conditions under which such entropy conditions hold, and hence, under which stochastic equicontinuity obtains. The primitive conditions considered here include: differentiability conditions, Lipschitz conditions, LP continuity conditions, Vapnikkcervonenkis conditions, and combinations thereof. Applications discussed in the first part of the paper are employed to exemplify the use of these primitive conditions. The empirical process results discussed here apply only to random variables (rv’s) that are independent or m-dependent (i.e. independent beyond lags of length m). There is a growing literature on empirical processes with more general forms of temporal dependence. See Andrews (1993) for a review of this literature. The remainder of this paper is organized as follows: Section 2 defines and discusses empirical processes, weak convergence, and stochastic equicontinuity. Section 3 gives a brief review of the use of empirical process methods in the econometrics literature and discusses three classes of applications in more detail. Sections 4 and 5 provide stochastic equicontinuity results of the paper. Section 6 provides a brief conclusion. An Appendix contains proofs of results stated in Sections 4 and 5.
2.
Weak convergence and stochastic equicontinuity
We begin by introducing some notation. Let ( Wr,: t G T, T 2 l} be a triangular array of w-valued rv’s defined on a probability space (0, d, P), where w is a (Bore1 measurable) subset of Rk. For notational simplicity, we abbreviate W,, by W, below. Let .Y be a pseudometric space with pseudometric p.* Let A! = {m(~,z):z&~-) be a class of R”-valued functions empirical process vT(.) by
VT(~)= it Jr1
(2.1) defined
Cm(W,, r) - Em( W,, 7)]
on -ly- and indexed
for
r~r-,
by KEY. Define
an
(2.2)
‘That is, F is a metric space except that p(~, , TV)= 0 does not necessarily imply that r1 = r2. For example, the class of square integrable functions on [0, 11 with p(s,,r,) = [lA(T,(W) - T2(W))Zdw]1’2.is a pseudometric space, but not a metric space. The reason is that if rr(w) equals T?(W) for all w except one point, say, then ~(5,. T2) = 0, but TV # TV. In order to handle sets Y that are function spaces of the above type, we allow F to be a pseudometric space rather than a (more restrictive) metric space.
D.W.K.
2250
Andrew
where CT abbreviates xF= i. The empirical process vT(.) is a particular type of stochastic process. If Y = [0,11, then vT(.) is a stochastic process on [0,11. For parametric applications of empirical process theory, Y is usually a subset of RP. For semiparametric and nonparametric,applications, Y is often a class of functions. In some other applications, such as chi-square diagnostic test applications, .q is a class of subsets of RP. We now define weak convergence of the sequence of empirical processes {vT(.): T 2 l} to some stochastic process v(.) indexed by elements z of Y. (v(.) may or may not be defined on the same probability space (a,,&‘, P) as vT(.) VT> 1.) Let * denote weak convergence of stochastic processes, as defined below. Let % denote convergence in distribution of some sequence of rv’s. Let 1).1)denote the Euclidean norm. All limits below are taken as T-+ 00. Definition of weak convergence
v~(.)=-v(.)
if
E*f(v,(.))+Ef(v(.))
VfWB(F_)),
where B(Y) is the class of bounded R”-valued functions on Y (which includes all realizations of vr(.) and v(.) by assumption), d is the uniform metric on B(Y) (i.e. d(b,, b2) = sup,,r 11 b,(z) - b2(7) II), and @(B(S)) is the class of all bounded uniformly continuous (with respect to the metric d) real functions on B(Y). In the definition, E* denotes outer expectation. Correspondingly, P* denotes outer probability below. (It is used because it is desirable not to require vr(.) to be a measurable random element of the metric space (B(Y), d) with its Bore1 o-field, since measurability in this context can be too restrictive. For example, if (B(Y), d) is the space of functions D[O, l] with the uniform metric, then the standard empirical distribution function is not measurable with respect to its Bore1 a-field. The limit stochastic process v(.), on the other hand, is sufficiently well-behaved in applications that it is assumed to be measurable in the definition.) The above definition is due to HoffmanJorgensen. It is widely used in the recent probability literature, e.g. see Pollard (1990, Section 9). Weak convergence is a useful concept for econometrics, because it can be used to establish the asymptotic distributions of estimators and test statistics. Section 3 below illustrates how. For now, we consider sufficient conditions for weak convergence. In many applications of interest, the limit process v(.) is (uniformly p) continuous in t with probability one. In such cases, a property of the sequence of empirical processes {vr(.): T 2 11, called stochastic equicontinuity, is a key member of a set of sufficient conditions for weak convergence. It also is implied by weak convergence (if the limit process v(.) is as above).
Ch. 37:
Empirical
Dejnition
Process
Methods
of stochastic
equicontinuity
{I+(.): T> l} IS t s t oc h as t’KU11y equicontinuous
lim P -[
T+m
2251
in Econometrics
SUP T,.?2E~:&J(r,,rZ)
llV&1)-M~2)II
if VE> 0 and q > 0,36 > 0 such that
>II
1
(2.3)
-Cc.
Basically, a sequence of empirical processes iv,(.): T > l} is stochastically equicontinuous if vT(.) is continuous in z uniformly over Y at least with high probability and for T large. Thus, stochastic equicontinuity is a probabilistic and asymptotic generalization of the uniform continuity of a function. The concept of stochastic equicontinuity is quite old and appears in the literature under various guises. For example, it appears in Theorem 8.2 of Billingsley (1968, p. 55), which is attributed to Prohorov (1956), for the case of 9 = [O, 11. Moreover, a non-asymptotic analogue of stochastic equicontinuity arises in the even older literature on the existence of stochastic processes with continuous sample paths. The concept of stochastic equicontinuity is important for two reasons. First, as mentioned above, stochastic equicontinuity is a key member of a set of sufficient conditions for weak convergence. These conditions are specified immediately below. Second, in many applications it is not necessary to establish a full functional limit (i.e. weak convergence) result to obtain the desired result - it suffices to establish just stochastic equicontinuity. Examples of this are given in Section 3 below. Sufficient conditions for weak convergence are given in the following widely used result. A proof of the result can be found in Pollard (1990, Section 10) (but the basic result has been around for some time). Recall that a pseudometric space is said to be totally bounded if it can be covered by a finite number of c-balls VE> 0. (For example, a subset of Euclidean space is totally bounded if and only if it is bounded.) Proposition If (i) (Y,p) is a totally bounded pseudometric space, (ii) finite dimensional (fidi) convergence holds: V finite subsets (z,, . . . , T_,)of Y-, (v,(z,)‘, . . . , ~~(7,)‘)’ converges in distribution, and (iii) {v*(.): T 3 l} is stochastically equicontinuous, then there exists a (Borel-measurable with respect to d) B(F)-valued stochastic process. v(.), whose sample paths are uniformly p continuous with probability one, such that VT(.)JV(.). Conversely, if v=(.)*v(.) (ii) and (iii) hold.
for v(.) with the properties
above
Condition (ii) of the proposition typically is verified by central limit theorem (CLT) (or a univariate CLT coupled device, see Billingsley (1968)). There are numerous CLTs in different configurations of non-identical distributions and
and (i) holds, then
applying a multivariate with the Cramer-Wold the literature that cover temporal dependence.
D. W.K. Andrews
2252
Condition (i) of the proposition is straightforward to verify if Y is a subset of Euclidean space and is typically a by-product of the verification of stochastic equicontinuity in other cases. In consequence, the verification of stochastic equicontinuity is the key step in verifying weak convergence (and, as mentioned above, is often the desired end in its own right). For these reasons, we provide further discussion of the stochastic equicontinuity condition here and we provide methods for verifying it in several sections below. Two equivalent definitions of stochastic equicontinuity are the following: (i) {v,(.): T 3 1) is stochastically equicontinuous if for every sequence of constants (6,) that converges to zero, we have SUP~(,~,~~)~~~IV~(Z~)- vT(rZ)l 30 where “A” denotes convergence in probability, and (ii) {vT(.): vT 3 l} is stochastically equicontinuous if for all sequences of random elements {Z^iT} and {tZT} that satisfy p(z^,,,f,,) LO, we have v,(Q,,) - v,(z^,,) L 0. The latter characterization of stochastic equicontinuity reflects its use in the semiparametric examples below. Allowing {QiT} and {tZT} to be random in the latter characterization is crucial. If only fixed sequences were considered, then the property would be substantially weaker-it would not deliver the result that vT(z*,,)- vr.(fZT) 30 ~ and its proof would be substantially simpler - the property would follow directly from Chebyshev’s inequality. To demonstrate the plausibility of the stochastic equicontinuity property, suppose JZ contains only linear functions, i.e. ~2’ = {g: g(w) = w’t for some FERN} and p is the Euclidean metric. In this simple linear case,
< E,
(2.4)
where the first inequality holds by the CauchyySchwarz inequality and the second inequality holds for 6 sufficiently small provided (l/J?)x T( W, - E IV,) = O,( 1). Thus, Iv,(.): T 3 l} is stochastically equicontinuous in this case if the rv’s { W, - E W,: t < T, T 2 l} satisfy an ordinary CLT. For classes of nonlinear functions, the stochastic equicontinuity property is substantially more difficult to verify than for linear functions. Indeed, it is not difficult to demonstrate that it does not hold for all classes of functions J?‘. Some restrictions on .k are necessary ~ ~2! cannot be too complex/large. To see this, suppose { W,: t d T, T 3 l} are iid with distribution P, that is absolutely continuous with respect to Lebesgue measure and J$? is the class of indicator
Ch. 37: Empirical Process Methods in Econometrics
2253
functions of all Bore1 sets in %‘“.Let z denote a Bore1 set in w and let Y denote the collection of all such sets. Then, m(w, t) = l(w~r). Take p(r,, z2) = (J(m(w, ri) m(w, rz))*dPl(w)) ‘I* . For any two sets zl, r2 in Y that have finite numbers of elements, v,(zj) = (l/$)C~l(W,~t~) and p(r1,z2) = 0, since P1(WI~tj) = 0 forj = 1,2. Given any T 2 1 and any realization o~Q, there exist finite sets tlTo and rZTwin Y such that W,(o)~r,,~ and IVJo)$r,rwVt d T, where W,(o) denotes the value of W, when o is realized. This yields vr-(riTw) = @, v~(~*~J = 0, and supP(rl,Q)
3. 3.1.
Applications Review of applications
In this subsection, we briefly describe a number of applications of empirical process theory that appear in the econometrics literature. There are numerous others that appear in the statistics literature, see Shorack and Wellner (1986) and Wellner (1992) for some references. The applications and use of empirical process methods in econometrics are fairly diverse. Some applications use a full weak convergence result; others just use a stochastic equicontinuity result. Most applications use empirical process theory for normalized sums of rv’s, but some use the corresponding theory for U-processes, see Kim and Pollard (1990) and Sherman (1992). The applications include estimation problems and testing problems. Here we categorize the applications not by the type of empirical process method used, but by area of application. We consider estimation first, then testing. Empirical process methods are useful in obtaining the asymptotic normality of parametric optimization estimators when the criterion function that defines the estimator is not differentiable. Estimators that fit into this category include robust M-estimators (see Huber (1973)). regression quantiles (see Koenker and Bassett (1978)), censored regression quantiles (see Powell (1984, 1986a)), trimmed LAD estimators (see Honore (1992)), and method of simulated moments estimators (see McFadden (1989) and Pakes and Pollard (1989)). Huber (1967) gave some asymptotic normality results for a class of M-estimators of the above sort using empirical process-like methods. His results have been utilized by numerous econometricians, e.g. see Powell (1984). Empirical process methods were utilized explicitly in several subsequent papers that treat parametric estimation with non-differentiable criterion
2254
D. WK.
Andrews
functions, see Pollard (1984, 1985) McFadden (1989), Pakes and Pollard (1989) and Andrews (1988a). Also, see Newey and McFadden (1994) in this handbook. In Section 3.2 below, we illustrate one way in which empirical process methods can be exploited for problems of this sort. Empirical process methods also have been utilized in the semiparametric econometrics literature. They have been used to establish the asymptotic normality (and, in a few cases, other limiting distributions) of various estimators. References include Horowitz (1988, 1992), Kim and Pollard (1990), Andrews (1994a), Newey (1989), White and Stinchcombe (1991), Olley and Pakes (1991), Pakes and Olley (1991), Ait-Sahalia (1992a, b), Sherman (1993,1994) and Cavanagh and Sherman (1992). Kim and Pollard (1990) establish the asymptotic (non-normal) distribution of Manski’s (1975) maximum score estimator for binary choice models using empirical process theory for U-statistics. Horowitz (1992) establishes the asymptotic normal distribution of a smoothed version of the maximum score estimator. Andrews (1994a), Newey (1989), Pakes and Olley (1991) and Ait-Sahalia (1992b) all use empirical process theory to establish the asymptotic normality of classes of semiparametric estimators that employ nonparametric estimators in their definition. Andrews (1994a), Newey (1989) and Pakes and Olley (1991) use stochastic equicontinuity results, whereas Ait-Sahalia (1992b) utilizes a full weak convergence result. Sherman (1993,1994) and Cavanagh and Sherman (1992) establish asymptotic normality of a number of semiparametric estimators using empirical process theory of U-statistics. Section 3.3 below gives a heuristic description of one way in which empirical process methods can be used for semiparametric estimation problems. A third area of application of empirical process methods to estimation problems is that of nonparametrics. Gallant (1989) and Gallant and Souza (1991) use these methods to establish the asymptotic normality of certain seminonparametric (i.e. nonparametric series) estimators. In their proof, empirical process methods are used to establish that a law of large numbers holds uniformly over a class of functions that expands with the sample size. Andrews (1994b) uses empirical process methods to show that nonparametric kernel density and regression estimators are consistent when the dependent variable or the regressor variables are residuals from some preliminary estimation procedure (as often occurs in semiparametric applications). Empirical process methods also have been utilized very effectively in justifying the use of bootstrap confidence intervals. References include Gine and Zinn (1990), Arcones and Gine (1992) and Hahn (1995). Next, we consider testing problems. Empirical process methods have been used in the literature to obtain the asymptotic null (and local alternative) distributions of a wide variety of test statistics. These include test statistics for chi-square diagnostic tests (see Andrews (1988b, c)), consistent model specification tests (see Bierens (1990), Yatchew (1992), Hansen (1992a), De Jong (1992) and Stinchcombe and White (1993)), tests of nonlinear restrictions in semiparametric models (see Andrews (1988a)), tests of specification of semiparametric models (see Whang and Andrews (1993) and White and Hong (1992)), tests of stochastic dominance (see
2255
Ch. 37: Empirical Process Methods in Econometrics
Klecan et al. (1990), and tests of hypotheses for which a nuisance parameter appears only under the alternative (see Davies (1977,1987), Bera and Higgins (1992), Hansen (1991, 1992b), Andrews and Ploberger (1994) and Stinchcombe and White (1993). For tests of the latter sort, Section 3.4 below describes how empirical process methods are utilized. Last, we note that stochastic equicontinuity can be used to obtain uniform laws of large numbers that can be employed in proofs of consistency of extremum estimators. For example, see Pollard (1984, Chapter 2), Newey (1991) and Andrews (1992). 3.2.
Parametric
M-estimators
based on non-d@erentiable
criterion functions
Here we give a heuristic description of one way in which empirical process theory can be used to establish the asymptotic normality of parametric M-estimators (or GMM estimators) that are based on criterion functions that are not differentiable with respect to the unknown parameter. This treatment follows that of Andrews (1988a) most closely (in which a formal statement of assumptions and results can be found). Other references are given in Section 3.1 above. Suppose ? is a consistent estimator of a parameter ~,,ER~ that satisfies a set of p first order conditions m,(e) = 0
(3.1)
at least with probability that goes to one as T-+ CO, where tiT(7) =
-$tjm( W,, t). 1
(3.2)
Here, W, is an observed vector of random variables and m( ., .) is a known RP-valued function. Examples are given below. If m( W,, z) is differentiable in z, one can establish the asymptotic normality of 5^by expanding fiti, about t0 using element by element mean value expansions. This is the standard way of establishing asymptotic normality of f (or, more precisely, of fi(z^ - to)). In a variety of applications, however, the function m(W,, T) is not differentiable in 5, or not even continuous, due to the appearance of a sign function, an indicator function or a kinked function, etc. Examples are listed above and below. In such cases, one can still establish asymptotic normality of t^provided Em(W,, 2) is differentiable in t. Since the expectation operator is a smoothing operator, Em( W,, z) is often differentiable in t even though m( W,, t) is not. One method is as follows. Let 6$(t)
= $ $ Em( W,, t).
(3.3)
D. WK.
2256
Andrew
To establish asymptotic normality off, one can replace (element by element) mean value expansions of +(Z*) about r0 by corresponding mean value expansions of fi+(rO) about d and then use empirical process methods to establish the limit distribution of the expansion. In particular, such mean value expansions yield 0=
J%qT,) = J!%;(Q)- apii;(qyaT~JT(f
- TV),
where the first equality holds by the population orthogonality conditions (by assumption) and 7 lies on the line segment joining t* and r,, (and takes different values in each row of a[Krr(?)]/ar’). Under suitable assumptions on {m(kV,, r): t < T, T >, l}, one obtains
(For example, if the rv’s a[Em( W,, z,)]/at’ continuous
fi(z^- TO)= (M (Here, o,(l) denotes
-
fitiT
1+ op(l)@%*,(t).
a term that converges
Now, the asymptotic process methods
distribution
to determine
(3.5) in probability
of fi(s*
the asymptotic
to zero as T + co.)
- re) is obtained distribution
of fitiT(
by using empirical We write
= [J’rrn,(Q- @ii;(t)] - JTrn,(t*) = tvT@) -
The term able sum
W, are identically distributed, it suffices to have in r at r,,.) Thus, provided M is nonsingular, one has
vT(TO)) +
vTh,) -
fifi,(t).
(3.6)
third term on the right hand side (rhs) of (3.6) is o,(l) by (3.1). The second on the rhs of (3.6) is asymptotically normal by an ordinary CLT under suitmoment and temporal dependence assumptions, since vr(t,,) is a normalized of mean zero rv’s. That is, we have
(3.7) where S = lim,, m varC(lIJT)CTm(w,,~,)l. F or example, if the rv’s W, are independent and identically distributed (iid), it suffices to have S = Em( W,, z,)m( W,, to) well-defined.) Next, the first term on the rhs of (3.6) is o,(l) provided {vT(.): T 2 l> is stochastically equicontinuous and Z Lro. This follows because given any q > 0 and E > 0, there exists a 6 > 0 such that
Ch. 37: Empirical Process Methods
2257
in Econometrics
lim P(I vT(f)- vTh)l > d
T-ra,
d lim P( 1VT(t) - vT(ro)l > q, p(t, To) d 6) + lim P(p(z*, rO) > 6) T+OZ
T-rm
d
lim P -(
T-CC <
sup
1vT(z)
re.F:p(r,ro)<
-
vT(zO)/
>
q
d
>
(3.8)
6
where the second inequality uses z^A t0 and the third uses stochastic Combining (3.5))(3.8) yields the desired result that
JT(r*-
zo) L N(O,M-'S(M-I)')
as T+
equicontinuity.
co.
(3.9)
It remains to show how one can verify the stochastic equicontinuity of (VT(.): T 2 l}. This is done in Sections 4 and 5 below. Before doing so, we consider several examples. Example 1 M-estimators for standard, censored and truncated linear regression model. In the models considered here, {(K, X,): t d T} are observed rv’s and {(Y:, XT): t d T} are latent rv’s. The models are defined by Yf = linear
xye, + u,, regression
censored truncated
t=l,...,T,
(LR):
regression regression
(YE X,) = (Y:, X:)7 (Y,, X,) = (Y: 1(Y:
(CR): (TR):
Depending upon the context, assumptions such as constant about zero for all t. We need We consider M-estimators
2 Cl), Xf),
(q, X,) = (Y: 1(Y: 2 0), XT 1(Y:
2 0)).
(3.10)
the errors (U,} may satisfy any one of a number of conditional mean or quantile for all t or symmetry not be specific for present purposes. ? of r,, that satisfy the equations
(3.11)
O=~ci/l(r,-X;r*)~,(w,,~)X* with probability + 1 as T-, co, where W, = (Y,, Xi,‘. Such estimators framework of (3.1)-(3.2) with m(w, r) = 11/i(y - x’~)$~(w, r)x,
where w = (y, x’)‘.
fit the general
(3.12)
D. WK. Andrews
2258
Examples of such M-estimators in the literature include the following: (a) LR model: Let $r(z) = sgn(z) and tiz = 1 to obtain the least absolute deviations (LAD) estimator. Let $r(z) = q - l(y - x’~ < 0) and $* = 1 to obtain Koenker and Bassett’s (1978) regression quantile estimator for quantile qE(O, 1). Let rc/1(z) = (z A c) v (- c) (where A and v are the min and max operators respectively) and $z = 1 to obtain Huber’s (1973) M-estimator with truncation at + c. Let $t (z) = 1q - 1(y - x’t < O)l and $z(w, r) = y - x’s to obtain Newey and Powell’s (1987) asymmetric LS estimator. (b) CR model: Let $r(z) = q - 1(y - x’r < 0) and tjz(w, r) = l(x’r > 0) to obtain Powell’s (1984, 1986a) censored regression quantile estimator for quantile qE(O, 1). Let $r = 1 and tjz(w, r) = 1(x? > O)[(y - x’r) A x’r] to obtain Powell’s (1986b) symmetrically trimmed LS estimator. (c) TR model: Let $r = 1 and $z(w, r) = l(y < 2x’t)(y - x’r) to obtain Powell’s (1986b) symmetrically trimmed LS estimator. (Note that for the Huber M-estimator of the LR model one would usually simultaneously estimate a scale parameter for the errors U,. For brevity, we omit this above.) Example
2
Method of simulated moments (MSM) estimator for multinomial probit. The model and estimator considered here are as in McFadden (1989) and Pakes and Pollard (1989). We consider a discrete response model with r possible responses. Let D, be an observed response vector that takes values in {ei: i = 1,. . . , I}, where ei=(O ,..., O,l,O ,..., 0)’ is the ith elementary r-vector. Let Zli denote an observed b-vector of covariates - one for each possible response i = 1,. , r. Let Z, = Z;J’. The model is defined such that cZ:r’Z:2’...’ D, = e,
if
(Zti - Z,,)‘(j3(s0) + A(r,)U,)
3 0
Vl = 1,. . . , r,
(3.13)
where U, N N(O,Z,) is an unobserved normal rv, /3(.) and A(.) are known RbX ‘and RbX ‘-valued functions of an unknown parameter rOey c RP. McFadden’s MSM estimator of r0 is constructed using s independent simulated N(0, I,) rv’s (Y,, , . . . , Y,,)’ and a matrix of instruments g(Z,, r), where g(., .) is a known R” b-valued function. The MSM estimator is an example of the estimator of (3.1)-(3.2) with W, L (D,, Z,, Ytl,. . . , Y,,) and m(“‘, r) = g(z, r)
(
d - 1 ,gI HCz(P(r) + A(z)Yj)I 2
where w = (d, z, y,, , . . , y,). Here, H[.] is of the form
nl 1=1
CCzi
-
(3.14)
>
J
is a (0, I}‘-valued
zlY(B(t)+ A(z)Yj) 3 Ol.
function
whose ith element
(3.15)
Ch. 37: Empirical Process Methods in Econometrics
3.3.
Tests when a nuisance parameter
2259
is present only under the alternative
In this section we consider a class of testing problems for which empirical process limit theory can be usefully exploited. The testing problems considered are ones for which a nuisance parameter is present under the alternative hypothesis, but not under the null hypothesis. Such testing problems are non-standard. In consequence, the usual asymptotic distributional and optimality properties of likelihood ratio (LR), Lagrange multiplier (LM), and Wald (W) tests do not apply. Consider a parametric model with parameters 8 and T, where & 0 c R”, TEF c R”. Let 0 = (/I’, S’)‘, where BERN, and FERN, and s = p + q. The null and alternative hypotheses of interest are H,:
/I=0
H,:
pzo.
and
(3.16)
Under the null hypothesis, the distribution parameter r by assumption. Under the examples are the following. Example
of the data does not depend on the alternative hypothesis, it does. Two
3
This example variable/vector
is a test for variable relevance. We want to test whether a regressor Z, belongs in a nonlinear regression model. This model is
Y,=dX,,4)+LWZ,,z)+
U,,
u,-N(o,d,),
t= l,...,~.
(3.17)
The functions g and h are assumed known. The parameters (/?,bl,fi2, r) are unknown. The regressors (X,,Z,) and/or the errors U, are presumed to exhibit some sort of asymptotically weak temporal dependence. As an example, the term h(Z,,r) might be of the Box-Cox form (Z: - 1)/r. Under the null hypothesis H,: /I = 0, Z, does not enter the regression function and the parameter r is not present. Example 4 This example is a test of cross-sectional constancy in a nonlinear regression model. A parameter r (ERR) partitions the sample space of some observed variable Z, (E R’) into two regions. In one region the regression parameter is 6, (ERR) and in the other region it is 6, + /I. A test of cross-sectional constancy of the regression parameters corresponds to a test of the null hypothesis H,: p = 0. The parameter r is present only under the alternative. To be concrete, the model is
for Wt34>0 for
for
t=l
T ,...,
h(Z,,z) 6 0
3
(3.18)
D. W.K. Andrew
2260
where the errors CJ, N iid N(O,6,), the regressors X, and the rv Z, are m-dependent and identically distributed, and g(.;) and h(.;) are known real functions, For example, h(Z,,t) could equal Z, - r, where the real rv Z, is an element of X,, an element of Xt_d for some integer d 2 1, or Y,_, for some integer d > 1. The model could be generalized to allow for more regions than two. Problems of the sort considered above were first treated in a general way by Davies (1977, 1987). Davies proposed using the LR test. Let LR(r) denote the LR test statistic (i.e. minus two times the log likelihood ratio) when t is specified under the alternative. For given r, LR(r) has standard asymptotic properties (under standard regularity conditions). In particular, it converges in distribution under the null to a random variable X2(r) that has a xi distribution. When r is not given, but is allowed to take any value in y, the LR statistic is (3.19)
sup LR(r). rsf
This statistic has power against a much wider variety of alternatives than the statistic LR(r) for some fixed value of r. To mount a test based on SUP,,~ LR(r), one needs to determine its asymptotic null distribution. This can be achieved by establishing that the stochastic process LR(r), viewed as a random function indexed by r, converges weakly to a stochastic process X’(r). Then, it is easy to show that the asymptotic null distribution of SUP,,~ LR(t) is that of the supremum of the chi-square process X’(r). The methods discussed below can be used to provide a rigorous justification of this type of argument. Hansen (1991) extended Davies’ results to non-likelihood testing scenarios, considered LM versions of the test, and pointed out a variety of applications of such tests in econometrics. A drawback of the supLR test statistic is that it does not possess standard asymptotic optimality properties. Andrews and Ploberger (1994) derived a class of tests that do. They considered a weighted average power criterion that is similar to that considered by Wald (1943). Optimal tests turn out to be average exponential tests:
Exp-LR = (1 + c)-~‘~ ]exp(
k&
LRo)dJW.
(3.20)
where J(.) is a specified weight function over r~9 and c is a scalar parameter that indexes whether one is directing power against close or distant alternatives (i.e. against b small or /I large). Let Exp-LM and Exp-W denote the test statistic defined as in (3.20) but with LR(t) replaced by LM(7) and W(7), respectively, where the latter are defined analogously to LR(7). The three statistics Exp-LR,
Ch. 37: Empirical Process Methods
2261
in Econometrics
Exp-LM, and Exp-W each have asymptotic optimality properties. Using empirical process results, each can be shown to have an asymptotic null distribution that is a function of the stochastic process X”(z) discussed above. First, we introduce some notation. Let I,(B,r) denote a criterion function that is used to estimate the parameters 6’ and r. The leading case is when l,(Q, r) is the log likelihood function for the sample of size T. Let D&.(8, r) denote the s-vector of partial derivatives of I,(Q,r) with respect to 8. Let 8, denote the true value of 8 under the null hypothesis H,, i.e. B0 = (0,s;)‘. (Note that D1,(8,, r) depends on z in general even though I,(B,,s) does not.) By some manipulations (e.g. see Andrews and Ploberger (1994)), one can show that the test statistics SUP~~,~LR(r), Exp-LR, Exp-LM, and Exp-W equal a continuous real function of the normalized score process {D/,(0,, r)/,,@: try-) plus an op( 1) term under H,. In view of the continuous mapping theorem (e.g. see Pollard (1984, Chapter 111.2)), the asymptotic null distributions of these statistics are given by the same functions of the limit process More specifically, let
VT(T)=
as T-r co of {D1,(8,, r)/fi:
AN,(A,,7).
reF_).
(3.21)
Jr
(Note that EDIr(BO,r) = 0 under Ho, since these are the population first order conditions for the estimator.) Then, for some continuous function g of v,(.), we have sup LR(r) = g(vT(.)) + o,(l) re.!7
under
H,.
(3.22)
(Here, continuity is defined with respect to the uniform metric d on the space of bounded R”-valued functions on Y-, i.e. B(Y).) If vr.(.)* v(.), then
;,“,p LR(r) 5
g(v(.))
under
H,,
(3.23)
which is the desired result. The distribution of g(v(.)) yields asymptotic critical values for the test statistic SUP,,,~ LR(z). The results are analogous for Exp-LR, Exp-LM, and Exp-W. In conclusion, if one can establish the weak convergence result, v=(.)*v(.) as T-t co, then one can obtain the asymptotic distribution of the test statistics of interest. As discussed in Section 2, the key condition for weak convergence is stochastic equicontinuity. The verification of stochastic equicontinuity for Examples 3 and 4 is discussed in Sections 4 and 5 below. Here, we specify the form of v=(z) in these examples.
2242
Examples
D. WK.
Andrews
the assumption
of iid
3 (continued)
In this example, normal errors:
1,(O,r) is the log likelihood
function
under
and
VT(Z)
=
1
D1,(0,,
7)
(3.24)
=
fi
Since 7 only appears in the first term, it suffices to show that { (l/fl)xTU,h(Z,, T 3 l} is stochastically equicontinuous.
.):
Example 4 (continued) In this cross-sectional constancy example, I(& 7) is the log likelihood the assumption of iid normal innovations:
function
under
Since 7 only appears in the first term, it suffices to show that {(I/fi)CTU, a[g(X,, 8,,)]/&S, 1(h(Z,;) d 0): T 3 I} is stochastically equicontinuous.
x
MO,
7) =
-
+gZnd,
-+f:
[r, 2
-
dx,,
61
+
B)
l(W,,
-
g(X,,6J
l(h(Z,,z)
> 0)
1 7)
d
(31’
and
L
D&-(0,,
7) =
Jr
2263
Ch. 37: Empirical Process Methods in Econometrics
3.4.
Semiparametric
estimation
We now consider the application of stochastic equicontinuity results to semiparametric estimation problems. The approach that is discussed below is given in more detail in Andrews (1994a). Other approaches are referenced in Section 3.1 above. Consider a two-stage estimator e of a finite dimensional parameter 0e~ 0 c R’. In the first stage, an infinite dimensional parameter estimator z*is computed, such as a nonparametric regression or density estimator or its derivative. In the second stage, the estimator 8 of 8, is obtained from a set of estimating equations that depend on the preliminary estimator t^. Many semiparametric estimators in the literature can be defined in this way. By linearizing the estimating equations, one can show that the asymptotic distribution of ,/?((8- 19,)depends on an empirical process vr(t), evaluated at the preliminary estimator f. That is, it depends on vr(?). To obtain the asymptotic distribution of 8, then, one needs to obtain that of vr(?). If r*converges in probability to some t0 (under a suitable pseudometric) and vT(r) is stochastically equicontinuous, then one can show that v=(f) - Q(Q) 50 and the asymptotic behavior of ,/?(e^- 19,) depends on that of v&& which is obtained straightforwardly from an ordinary CLT. Thus, one can effectively utilize empirical process stochastic equicontinuity results in establishing the asymptotic distributions of semiparametric estimators. We now provide some more details of the argument sketched above. Let the data consist of {W,: t Q T}. Consider a system of p estimating equations ti,(B, f) = f
$m(e, f),
(3.26)
where m(0, r) =, m( W,, 8, z) and m(., ., *) is an RP-valued known function. Suppose the estimator 0 solves the equations
J%iT(& f)
= 0
(3.27)
(at least with probability that goes to one as T+ CO). These equations might be the first order conditions from some minimization problem. We suppose consistency of 8 has already been established, i.e. e-%0, (see Andrews (1994:) for sufficient conditions). We wish to determine the asymptotic distribution of 8. When m( W,, 8, t) is a smooth function of 8, the following approach can be used. Element by element mean value expansions stacked yield
o,(l) = w& 4 = .JTm,(e,,f) + a[rii,(e*,
f)yaelfi@-
e,),
(3.28)
where 8* lies between 6 and 0, (and 0* may differ from row to row in
D.W.K. Andrews
2264
a[fi,(O*,
z*)],W’). Under
suitable
conditions,
(3.29) Thus, JT(e^-
l + o,(l))Jrrn,(O,,~)
0,) = -(A!_
= - (M- 1 + o,(l))CJr(m,(e,,
t*) - m;(e,,z*)) + @ii;(8,,
?)I, (3.30)
where ti*,(O,z) = (l/T)CTEm(W,, 8,~). Again under suitable conditions, either
for some covariance Let
VT(4=
matrix
JeMe,,
z) -
A, see Andrews
(1994a).
fqe,, t)).
(3.32)
Note that v=(.) is a stochastic process indexed by an infinite dimensional parameter in this case. This differs from the other examples in this section for which r is finite dimensional. Under standard conditions, one can establish that
%-bo)5 N(O,S) for some covariance can show that
(3.33)
matrix
S, by applying
an ordinary
CLT. If, in addition,
VT(z*) - VT(%)J+0,
one
(3.34)
then we obtain
JT(&
e,) = -(M-l
= - M5 which is the desired
+
‘CVTh)+ JTmge,,f)]
N(O, M - ‘(S + A)(M result,
@qe,,?)I
%U))CVT(Q) +
‘)‘),
+ O,(l) (3.35)
2265
Ch. 37: Empirical Process Methods in Econometrics
To prove (3.34), we can use the stochastic is stochastically (i) {v,(.): T 2 1) and pseudometric p on r-, (ii) P(QEF)+ 1, and (iii) p(?, tO) J+ 0,
equicontinuity equicontinuous
property.
Suppose
for some choice of F
(3.36)
then (3.34) holds (as shown below). Note that there exist tradeoffs between conditions (i), (ii), and (iii) of l(3.36) in terms of the difficulty of verification and the strength of the regularity conditions needed. For example, a larger set Y makes it more difficult to verify (i), but easier to verify (ii). A stronger pseudometric p makes it easier to verify (i), but more difficult to verify (iii). Since the sufficiency of (3.36) for (3.34) is the key to the approach considered here, we provide a proof of this simple result. We have: V E > 0, V n > 0,3 6 > 0 such that
lim P(I vT(z*) - vT(d > rl)
T-30
< lim P(
I VT(?)
-
vT(zo)
1>
q, QEF,
p(t,
zo)
d
6)
T-CC +
lim P(2#Y
or
p(t^,r,) > 6)
T+CC
1vT(z)
sup re.F: <
p(r,ro)
-
vT(zO)
1>
‘1
< d
(3.37)
E,
where the term on the third line of (3.37) is zero by (ii) and (iii) and the last inequality holds by (i). Since E > 0 is arbitrary, (3.34) follows. To conclude, one can establish the fi-consistency and asymptotic normality of the semiparametric estimator 6 if one can establish, among other things, that {v,(.): T 2 l} is stochastically equicontinuous. Next, we consider the application of this approach to two examples and illustrate the form of vT(.) in these examples. In Sections 4 and 5, we discuss the verification of stochastic equicontinuity when “M = {m(., t): ZEY} is an infinite dimensional class of functions. Example
5
This example considers a weighted least squares (WLS) estimator linear regression (PLR) model. The PLR model is given by Y, = X:6’, + g(Z,) + U,
and
E( U,I X,, Z,) = 0
a.s.
of the partially
(3.38)
D. W.K. Andrew
2266
W, = (Y,,X:,Z:)’ is iid or for t= l,..., T, where the real function g(.) is unknown, m-dependent and identically distributed, Y,, U,eR, X,, tl,ERP and Z,eRka. This model is also discussed by Hlrdle and Linton (1994) in this handbook. The WLS estimator is defined for the case where the conditional variance of U, given (X,, Z,) depends only on Z,. This estimator is a weighted version of Robinson’s (1988) semiparametric LS estimator. The PLR model with heteroskedasticity of the above form can be generated by a sample selection model with nonparametric selection equation (e.g. see Andrews (1994a)). Let rlO(Z,) = E(Y,IZ,),r,,(Z,) = E(X,IZ,), r3JZt) = E(U: IZ,) and r. = (riO, rio, rzo)). Let fj(.) be an estimator of tjO(.) for j = 1,2,3. The semiparametric WLS estimator of the PLR model is given by -1
e=[ 51 5wt)(X, -
Z^2(Zt))(Xt - ~*m)‘/~,m
1
x i 5wJw, - z^2(Zt))(yt- ~lm)/~,(z,),
(3.39)
1
where r( W,) = l(Z,~f%“*) is a trimming function and 5?* is a bounded Rka. This estimator is of the form (3.16)-(3.17) with m(K, 8, f) = S(K)Cr,
- %(Z,) - (X, - z^,(Z,))‘ei LX, - e,(Z,)l/t3(Z,).
subset
of
(3.40)
To establish the asymptotic normality of z^using the approach above, one needs to establish stochastic equicontinuity for the empirical process vr(.) when the class of functions JJ’ is given by J? = {m(., Bo, t):
ZEF}
where
m(w, eo, r) = <(w)~Y - ri(z) - (x - r,(z))‘eoi
cx - z,(z)I/z,(z),
(3.41)
w = (y,x’,z’), r = (r,,r;,r,)’ and F is as defined below. Here, the elements ZEF are possible realizations of the vector nonparametric estimator 2. By definition, 3 c Rk” is the domain of rj(z) for j = 1,2,3 and 2 includes the support of Z, V t 3 1. By assumption, the trimming set 6* c 3. If d* = 2, then no trimming occurs and t(w) is redundant. If i%“*is a proper subset of 2, then trimming occurs and the WLS estimator 8 is based on only nontrimmed observations. Example
6
This example considers generalized method of moments (GMM) estimators parameters defined by conditional moments restrictions (CMR). In this example, 0, is the unique parameter vector that solves the equations E($(Z,,e)IX,)=O
a.s.
Vt>
1
of
(3.42)
Ch. 37: Empirical
Process Methods
2261
in Econometrics
for some specified R”-valued function in econometrics are quite numerous,
Ic/(., .), where X,eRkn. Examples of this model see Chamberlain (1987) and Newey (1990).
Let %(X,) = E($(Z,, t%)$(z,, &)‘lX,), d&X,) = ECaC$(z,, 4Jll~@I~,l and to(X,) = d,(X,)‘R, ‘(X,). By assumption, a,(.), A,(.), and rO(.) do not depend on t. Let fi(.) and A(.) be nonparametric estimators of a,(.) and A,(.). Let t*(.) = d^(.)‘lt;,- ‘(.). Let W, = (Z;, Xi)‘. A GMM estimator 6 of B,, minimizes
over
0~ 0 c RP’,
(3.43)
where 9 is a data-dependent weight matrix. To obtain the asymptotic distribution of this estimator using the approach above, we need to establish a stochastic equicontinuity result for the empirical process vT(.) when the class of functions J? is given by M = {m(., do, 5): TEL?-},
where
m(w, &, r) = r(x)lcI(z, 6,) = A(x)‘nw = (z’, x’) and Y is defined
4. 4.1.
Stochastic
equicontinuity
‘(x)$(z,
&,),
(3.44)
below.
via symmetrization
Primitive conditions for stochastic
equicontinuity
In this section we provide primitive conditions for stochastic equicontinuity. These conditions are applied to some of the examples of Section 3 in Section 4.2 below. We utilize an empirical process result of Pollard (1990) altered to encompass m-dependent rather than independent rv’s and reduced in generality somewhat to achieve a simplification of the conditions. This result depends on a condition, which we refer to as Pollard’s entropy condition, that is based on how well the functions in JV can be approximated by a finite number of functions, where the distance between functions is measured by the largest L’(Q) distance over all distributions Q that have finite support. The main purpose of this section is to establish primitive conditions under which the entropy condition holds. Following this, a number of examples are provided to illustrate the ease of verification of the entropy condition. First, we note that stochastic equicontinuity of a vector-valued empirical process (i.e. s > 1) follows from the stochastic equicontinuity of each element of the empirical process. In consequence, we focus attention on real-valued empirical processes (s = 1).
D. W.K. Andrews
2268
The pseudometric
p on Y is defined
E(m(W,,
71)
-
in this section
m(W,,
by
1’2.3
72))2
(4.1)
>
Let Q denote a probability measure on W. For a real function f on W, let Qf 2 = 1% f*(w)dQ(w). Let 9 be a class of functions in c(Q). The L2(Q) cover numbers of 9 are defined as follows: Definition For any E > 0, the cover number N*(E, Q, F) is the smallest value of n for which in 4 such that minj, ,(Q(f - fj)*)li2 < EVlf~p. there exist functions fI, . . ,f,, N2(&, Q, 9) = co if no such n exists. The log of N2(&,Q,S7 is referred to as the L*(Q) &-entropy of 9. Let 2 denote the class of all probability measures Q on W that concentrate on a finite set. The following entropy/cover number condition was introduced in Pollard (1982). Definition A class F of real functions
defined
on W satisfies Pollard’s entropy condition
if
1
sup [log N*(E(QF~)“~, Q, F)]1’2 de < co, s o QE~
(4.2)
where F is some envelope function for 9, i.e. F is a real function on W” for which If(.)1 < F(.)V’fEF. As ~10, the cover number N2(&(QF2)“*, Q,p) increases. Pollard’s entropy condition requires that it cannot increase too quickly as ~10. This restricts the complexity/size of 9 and does so in a way that is sufficient for stochastic equicontinuity given suitable moment and temporal dependence assumptions. In particular, the following three assumptions are sufficient for stochastic equicontinuity. Assumption
A
JZZ satisfies Pollard’s Assumption
entropy
condition
with some envelope
ti.
B
lim T_ 3. ( l/T)CTEti2
“(IV,) < CCfor some 6 > 0, where M is as in Assumption
A.
3The pseudometric p(., .) is defined here using a dummy variable N (rather than T) to avoid confusion when we consider objects such as plim T_rcp(Q,so). Note that p(.;) is taken to be independent of the sample size T.
Ch. 37: Empirical Process
Assumption
Methods
C
( W,: t < T, T 2 1j is an m-dependent Theorem
2269
in Econometrics
triangular
array of rv’s.
1 (Pollard)
Under Assumptions given by (4.1).
A-C,
{vT(.): T > l} is stochastically
equicontinuous
with p
Comments (1) Theorem 1 is proved using a symmetrization argument. In particular, one obtains a maximal inequality for vT(r) by showing that SUP,,,~ 1vT(t)j is less variable 1o,m( W,, z)l, where (6,: t d T} are iid rv’s that are indepenthan suproY l(l/fi)CT dent of { W,: t < T) and have Rudemacher distribution (i.e. r~( equals + 1 or - 1, each with probability i). Conditional on { W,} one performs a chaining argument that relies on Hoeffding’s inequality for tail probabilities of sums of bounded, mean zero, independent rv’s. The bound in this case is small when the average sum of squares of the bounds on the individual rv’s is small. In the present case, the latter ultimately is applied to the is just (lIT)Clm T ’ ( W t, z). The maximal inequality empirical measure constructed from differences of the form m( W,, zl) - m( W,, r2) rather than to just m(W,, z). In consequence, the measure of distance between m(.,z,) and m(.,z,) that makes the bound effective is an L2(P,) pseudometric, where P, denotes the empirical distribution of (W,: t d Tj. This pseudometric is random and depends on T, but is conveniently dominated by the largest L2(Q) pseudometric over all distributions Q with finite support. This explains the appearance of the latter in the definition of Pollard’s entropy condition. To see why Pollard’s entropy condition takes the precise form given above, one has to inspect the details of the chaining argument. The interested reader can do so, see Pollard (1990, Section 3). (2) When Assumptions A-C hold, F is totally bounded under the pseudometric p provided p is equivalent to the pseudometric p* defined by p*(z,,z2) = zi) - m(W,, T2))2]1’2. By equivalent, we mean that &,+ co[(l/N)CyE(m(W,, p*(~, , TV)2 Cp(z,, z2) V tl, Z*EF for some C > 0. (p*(~i, z2) < p(r,, ZJ holds automatically.) Of course, p equals p* if the rv’s W, are identically distributed. The proof of total boundedness is analogous to that given in the proof of Theorem 10.7 in Pollard (1990). Combinatorial arguments have been used to establish that certain classes of functions, often referred to as Vapnik-Cervonenkis (VC) classes of one sort or another, satisfy Pollard’s entropy condition, see Pollard (1984, Chapter 2; 1990, Section 4) and Dudley (1987). Here we consider the most important of these VC classes for applications (type I classes below) and we show that several other classes of functions satisfy Pollard’s entropy condition. These include Lipschitz functions
D.W.K. Andrew
2270
indexed by finite dimensional parameters (type II classes) and infinite dimensional classes of smooth functions (type III classes). The latter are important for applications to semiparametric and nonparametric problems because they cover realizations of nonparametric estimators (under suitable assumptions). Having established that Pollard’s entropy condition holds for several useful classes of functions, we proceed below to show that functions from these classes can be “mixed and matched”, e.g. by addition, multiplication and division, to obtain new classes that satisfy Pollard’s entropy condition. In consequence, one can routinely build up fairly complicated classes of functions that satisfy Pollard’s entropy condition. In particular, one can build up classes of functions that are suitable for use in the examples above. The first class of functions we consider are applicable in the non-differentiable M-estimator Examples 1 and 2 (see Section 3.2 above). Dejinition A class F of real functions on W is called a type I class if it is of the form (a) 8 = {f:f(w) = ~‘4 V w~-Iy- for some 5~ Y c Rk} or (b) 9 = {f:f(w) = h(w’t) V w~.q for some <E Y c Rk, hi V,}, where V, is some set of functions from R to R each with total variation less than or equal to K < co. Common choices for h in (b) include the indicator function, the sign function, and Huber $-functions, among others. For the more knowledgeable reader (concerning empirical processes), we note that it is sometimes useful to extend the definition of type I classes of functions to include various classes of functions called VC classes. By definition, such classes include (i) classes of indicator functions of VC sets, (ii) VC major classes of uniformly bounded functions, (iii) VC hull classes, (iv) VC subgraph classes, and (v) VC subgraph hull classes, where each of these classes is as defined in Dudley (1987) (but without the restriction that f > 0 V’~EF). For brevity and simplicity, we do not discuss all of these classes here. The second class of functions we consider contains functions that are indexed by a finite dimensional parameter and are Lipschitz with respect to that parameter: Dejinition A class F of real functions on W is called a type II class if each function f in F satisfies: f(.) = f(., t) for some re5-, where Y is some bounded subset of Euclidean space and f(., r) is Lipschitz in r, i.e.,
Lf(~*~l)-f(~>~2)1
B( .): W + R.
-52/I
V/t,,T,EY
(4.3)
Ch. 37: Empirical
Process Methods
2271
in Econometrics
The third class of functions we consider is an infinite dimensional class of functions that is useful for semiparametric and nonparametric applications such as Examples 5 and 6. This class is more complicated to define than type I and II classes. The reader may wish to skip this section on first reading and move ahead to Theorem 2. The third class of functions contains functions that depend on w = (~1, ~6)’ only through a subvector w, that has dimension k,
on W is called a type III class if
(i) each f in 9 depends on w only through a subvector w, of dimension k, < k, (ii) for some real number q > k,/2, some constant C < co, and some set W,*, which is a subset of Wa and is a connected compact subset of Rka, each f EF satisfies the smoothness condition: V WEW and w + hE-W^,
f(w + h)=
.r,, $
B,(h,, w,) + Nh,, w,) and R(h,, w,) d C/I h, llq,
where B,(h,, w,) is homogeneous on f, w, or h, (iii) for some constant K and all f~9,
of degree v in h, and (4, C, WJ) do not depend f(w) = K V WEW such that w,,EW~ - WT.
Typically the expansion of f(w + h) in (4.4) is a Taylor expansion and the function B&H,, w,) is the vth differential of f at w, i.e.
&(h,, wy) = 1 y
(4.4)
of order
[q]
‘!
VI!...
where zV denotes the sum over all ordered k,-tuples (v,, . . . , v,+,) of nonnegative integers such that vr + ... + vk, = v, w, = (W,1,. . . , W&)l and h = (h,l,. . , h,k.y. Sufficient conditions for condition (ii) above are: (a) for some real number q > k,/2, f~9 has partial derivatives of order [q] on W”* = {weW: wa~W~}; (b) the [q]th order partial derivatives of f satisfy a Lipschitz condition with exponent q - [qJ and some Lipschitz constant C* that does not depend on f V f ~9; and (c) W”,* is a convex compact set. The envelope of a type III class 9 can be taken to be a constant function, since the functions in 9 are uniformly bounded in absolute value over WEW and f~9. Type III classes can be extended to allow Wa to be a finite union of connected compact subsets of Rkm.In this case, (4.4) only needs to hold V wgW and w + hEY+‘” such that w, and w, + h, are in the same connected set in W,*.
2212
D.W.K. Andrews
In applications, type III classes of functions typically are classes of realizations of nonparametric function estimates. Since these realizations usually depend on only a subvector W,, of W, = (Wb,, Wb,)‘, it is advantageous to define type III classes to contain functions that may depend on only part of W,. By “mixing and matching” functions of type III with functions of types I and II (see below), classes of functions are obtained that depend on all of w. In applications where the subvector W,, of W, is a bounded rv, one may have YV,*= W,. In applications where W,, is an unbounded rv, vV~ must be a proper subset of wa for 9 to be a type III class. A common case where the latter arises in the examples of Andrews (1994a) is when W,, is an unbounded rv, all the observations are used to estimate a nonparametric function I for w,EYV~, and the semiparametric estimator only uses observations W, such that W,, is in a bounded set -WT. In this case, one sets the nonparametric estimator of rO(w,) equal to zero outside YV,*and the realizations of this trimmed estimator form a type III class if they satisfy the smoothness condition (ii) for w,E%‘“~. Theorem 2 If g is a class of functions of type I, II, or III, then Pollard’s entropy condition (4.2) (i.e. Assumption A) holds with envelope F(.) given by 1 v SUP~~,~If(. 1 v su~r~.~ If(.)1 v B(.), or 1 v su~~~~~ If( .) 1,respectively, where v is the maximum operator. Comment For type I classes, the result of Theorem 2 follows from results in the literature such as Pollard (1984, Chapter II) and Dudley (1987) (see the Appendix for details). For type II classes, Theorem 2 is established directly. It is similar to Lemma 2.13 of Pakes and Pollard (1989). For type III classes, Theorem 2 is established using uniform metric entropy results of Kolmogorov and Tihomirov (1961). We now show how one can “mix and match” functions of types I, II, and III to obtain a wide variety of classes that satisfy Pollard’s entropy condition (Assumption A). Let 3 and g* be classes of I x s matrix-valued functions defined on -Iy- with scalar envelopes G and G*, respectively (i.e. G: -ly- + R and Igij(.) I < G( .) V i = 1>..., r,vj= 1, . . . , s, V g&J). Let g and g* denote generic elements of 3 and g*. Let Z be defined as 3 is, but with s x u-valued functions. Let h denote a generic element of Z. We say that a class of matrix-valued functions 3, ?J*, or 2 satisfies Pollard’s entropy condition or is of type I, II, or III if that is the case element by element for each of the rs or su elements of its functions. Let~~O*={g+g*}(={g+g*:g~~,g*~~}),~~={gh},4ev~*=(gvg*}, 9~Y*={gr\g*} and Igl={lgl}, w h ere v, A, and 1.1 denote the element by element maximum, minimum, and absolute value operators respectively. If I = s and g(w) is non-singular V w~-ly- and VgM, let 3-i = {g-i}. Let ,&(.) denote the smallest eigenvalue of the matrix.
Ch. 37: Empirical Process
Theorem
Methods
2213
in Econometrics
3
If g, F?*, and 9 satisfy Pollard’s entropy condition with envelopes G, G*, and H, respectively, then so do each of the following classes (with envelopes given in parentheses): %ug* (G v G*), g@O* (G + G*), Y.8 ((G v l)(H v l)), $9 v 9* (G v G*), 9 A Y* (G v G*), and 191 (G). If in addition r = s and 3-l has a finite envelope c, and 9-i also satisfies Pollard’s entropy condition (with envelope (G v l)@‘). Comments
(1) The stability properties of Pollard’s entropy condition quite similar to stability properties of packing numbers
given in Theorem 3 are considered in Pollard
(1990).
(2) If r = s and infgsY infwEw &,(g(w)) uniformly bounded by a finite constant.
4.2.
> 0, then 9-i
has an envelope
that
is
Examples
We now show how Theorems l-3 can be applied obtain stochastic equicontinuity of vT(.). Example
of Section
3 to
1 (continued)
By Theorems l-3, the following conditions of vr(.) in this example.
(4
in the examples
are sufficient for stochastic equicontinuity
{(Y,, X,): t > l> is an m-dependent
(ii) ~~~~~~IlX~ll
2+6
(iii) {$,(., r): ZEF}
satisfies
supI$,(.,r)Iand
G
$$E
sequence
of rv’s.
for some 6>0. Pollard’s
rE3 T-r, [ some 6 > 0. (iv) 11/r(.) is a function of bounded
entropy
condition
(IlX,l12+s+ l)s~pI$,(W’,,r)~~+~ re.F
variation.
1
with
envelope
for
(4.5)
Sufficiency of conditions (i)-(iv) for stochastic equicontinuity of vT(.) is established as follows. The sets (g: g(w) = t+kl(y- X’T) for some TEF} and {h: h(w) = x} are type I classes with envelopes C, and /Ix 11,respectively, for some constant C, < co, and hence satisfy Pollard’s entropy condition by Theorem 2. This result, condition (iii), and the 9% result of Theorem 3 show that A satisfies Pollard’s entropy condition with envelope ( )Ix /I v l)(su~,,,~ I $2(w, T)I v 1). Stochastic equicontinuity now follows from Theorem 1, since Assumption B is implied by conditions (ii) and (iii).
D.W.K.
2214
Andrews
For the particular M-estimators considered in Example 1 above, condition (iv) is always satisfied and condition (iii) is automatically satisfied given (ii) whenever $2 = 1 or $2(w, r) = 1(x’s > 0). When tj2(w, t) = y - X'T,I+b2(W, t) = l(X’T > o)[(y - X’T) A X’T], or $*(w, r) - 1(y < 2x’r)(y - x’r), condition (iii) is satisfied provided Y is bounded and 1 r lim ~~C[EIU,12+6+EilX(i14+b
+E~lU,X,~~2+b]
forsome
6>0.
T-GPTI
This follows from Theorem 3, since {1(x’s > 0): reY}, {y - X'T:TEF}, {x'T:TE.?} and (1 (y < ~x’T): TELT} are type I classes with envelopes 1, Iu I + I\x (I supIGg 11 r - ?. 11, respectively, where u = y x’~e. IIx IIsu~,,.~ IIT II and1, Example
2 (continued)
In the method of simulated moments example, sufficient for stochastic equicontinuity of vT(.).
the following
conditions
are
is an m-dependent sequence of rv’s. is a type IT class of functions with Lipschitz function
B(.)
(9 (U&Z,, ytl,..., Y,,): t 3 l} (ii) {g(., t): reY_)
EB*+“(Z,)
that satisfies
+ Esup
))g(Z,,r)))2+6
re.Y for some
< co >
6 > 0.
Note that condition open, and
(ii) holds if g(w,r) is differentiable
(4.6) in ZV w~-ly-,Vr~~-,~
is
Sufficiency is established as follows. Classes of functions of the form { l((Zi-Zl)'(fl(z)+ A(z)yj)> 0): rsY c RP} are type I classes with envelopes equal to 1 (by including products ziyj and z,yj as additional elements of w) and hence satisfy Pollard’s entropy condition by Theorem 2. {g(.,r):rEY} also satisfies Pollard’s entropy condition with envelope 1 v supres 1)g(‘, t) II v B(.) by condition (ii) and Theorem 2. The 9% result of Theorem 3 now implies that A satisfies Pollard’s entropy condition with envelope 1 v SUP,,,~ IIg(‘, r) II v B(.). Stochastic equicontinuity now follows by Theorem 1. Example
5 (continued)
By applying Theorems stochastic equicontinuity
l-3, we find the following conditions are sufficient for of vT(.) in the WLS/PLR example. With some abuse of
Ch. 37: Empirical Process Methods in Econometrics
2275
notation, let rj(w) denote a function on W that depends on w only through the k,-subvector z and equals tj(z) above for j = 1,2,3. The sufficient conditions are:
(4 {(K,X,,Z,):t2 1) is an m-dependent
identically distributed sequence of rv’s. (ii) El1 Yt-Xle,I12+“+EIIX,I12+S+E/I(Y,-XX:B,)X,l)2+6< cc for some 6 > 0. (iii) F={t:r=(tl,t2, tJ),tjEFj for j = 1,2,3}. Fj is a type III class of RPj-valued functions on W c Rk that depend on w =(y, x’, z’)’ only through the k,-vector z for j = 1,2,3, where pi = 1, p2 =p and p3 = 1, and C 1 y-3 = tj: inf lr3(w)l 2 E for some E > 0. (4.7) wsll i 1’ The set W,* in the definition of the type III class Fj equals g* in this example for j = 1,2,3. Since g* is bounded by condition (iii), conditions (i)-(iii) can be satisfied without trimming only if the rv’s {Z,: t > l} are bounded. Sufficiency of conditions (i)-(iii) for stochastic equicontinuity is established as follows. Let h,(w) = y - ~‘0, and h2(w) = x. By Theorem 2, {c}, (hi}, {h2} and Fj satisfy Pollard’s entropy condition with envelopes 1, Ih, 1,Ih, I and Cj, respectively, for some constant C,E[~, co), for j = 1,2,3. By the 9-l result of Theorem 3, so 2 2. By the F?% and $!?@?J* results of does {l/r,:rj~Fj} with envelope CJE Theorem 3 applied several times, .&! satisfies Pollard’s entropy condition with envelope (lh,l v l)C,+(lh,I v l)C,+(lh,I v l)((h,( v l)C, for some finite constants C,, C,, and C,. Hence, Theorem 1 yields the stochastic equicontinuity of v,(.), since (ii) suffices for Assumption B. Next, we consider the conditions P(Z*EF)+ 1 and ? Are of (3.36). Suppose (i) fj(z) is a nonparametric estimator of rjO(z) that is trimmed outside T* to equal zero for j = 1,2 and one for j = 3, (ii) %* is a finite union of convex compact subsets of Rka, (iii) fj(z) and its partial derivatives of order d [q] + 1 are uniformly consistent over ZEN* for Tag and its corresponding partial derivatives, for j = 1,2,3, for some q > k,/2, and (iv) the partial derivatives of order [q] + 1 of Tag are uniformly bounded over ZEN!‘* and infiea* Ewmin(~&z)) > 0. Then, the realizations of fj(z), viewed as functions of w, lie in a type III class of functions with probability -+ 1 for j = 1,2,3 and t L T,, uniformly over 5?’ (where zjO(z) is defined for ZEN - %* to equal zero for j = 1,2 and one for j = 3). Hence, the above conditions plus (i) and (ii) of (4.7) imply that conditions (i)-(iii) of (3.36) hold. If fj(z) is a kernel regression estimator for j = 1,2,3, then sufficient conditions for the above uniform consistency properties are given in Andrews (1994b).
2276
5.
D. W.K. Andrew
Stochastic equicontinuity
via bracketing
This section provides an alternative set of sufficient conditions for stochastic equicontinuity to those considered in Section 4. We utilize a bracketing result of Ossiander (1987) for iid rv’s altered to encompass m-dependent rather than independent rv’s and extended as in Pollard (1989) to allow for non-identically distributed rv’s. This result depends on a condition, that we refer to as Ossiander’s entropy condition, that is based on how well the functions in JZ can be approximated by a finite number of functions that “bracket” each of the functions in A. The bracketing error is measured by the largest L’(P,) distance over all distributions P, of IV, for t d T, T 3 1. The main purpose of this section is to give primitive conditions under which Ossiander’s entropy condition holds. The results given here are particularly useful in three contexts. The first context is when r is finite dimensional and m(W,, t) is a non-smooth function of some nonlinear function of t and W,. For example, the rn(W,,~) function for the LAD estimator of a nonlinear regression model is of this form. In this case, it is difficult to verify Pollard’s entropy condition, so Theorems l-3 are difficult to apply. The second context concerns semiparametric and nonparametric applications in which the parameter r is infinite dimensional and is a bounded smooth function with an unbounded domain. Realizations of smooth nonparametric estimators are sometimes of this form. Theorem 2 above does not apply in this case. The third context concerns semiparametric and nonparametric applications in which r is infinite dimensional, is a bounded smooth function on one set out of a countable collection of sets and is constant outside this set. For example, realizations of trimmed nonparametric estimators with variable trimming sets are sometimes of this form. The pseudometric p on r that is used in this section is defined by
p(rl,~2) = We adopt
sup (W(W, tl) - WK,T~))~)“~. ti N.N> 1
the following
notational
convention:
~,@lf(K)IJYP = supwsr- If(w)1 if P = 00. An entropy condition analogous to Pollard’s bracketing cover numbers.
For
(5.1) any real function
is defined
using
f on
the following
Dejnition For any E > 0 and p~[2, m], the Lp bracketing cover number N:(e, P,,F)is the smallest value of n for which there exist real functions a,, . . . ,a, and b,, ,b, on YV such that for each f~9 one has If - ajl < bj for some j < II and maxjG n supt< r r> l (Eb$‘( Wr))lIpd E, where { W,: t d T, T > l} has distribution determined by PF ’ ’ The log of N~(E, P,F) is referred to as the Lp bracketing E-entropy of F. The following entropy condition was introduced by Ossiander (1987) (for the case p = 2).
Ch. 37: Empirical Process Methods
2271
in Econometrics
Definition A class F of real functions p~[2, co] if
s
on ?Y satisfies Ossiander’s Lp entropy condition for some
1
(log N;(E, P, F))“2
d& < a3.
(5.2)
0
As with Pollard’s entropy condition, Ossiander’s entropy condition restricts the complexity/size of F by restricting the rate ofincrease of the cover numbers as ~10. Often our interest in Ossiander’s Lp entropy condition is limited to the case where p = 2, as in Ossiander (1987) and Pollard (1989). To show that Ossiander’s Lp entropy condition holds for p = 2 for a class of products of functions 32, however, we need to consider the case p > 2. The latter situation arises quite frequently in applications of interest. Assumption
D
_k’ satisfies Ossiander’s
Lp entropy
condition
with p = 2 and has envelope
I&
Theorem 4 Under Assumptions B-D (with M in Assumption B given by Assumption D rather than Assumption A), {vT(.): T > l} is stochastically equicontinuous with p given by (5.1) and F is totally bounded under p. Comments 1. The proof of this theorem follows easily from Theorem 2 of Pollard (1989) (as shown in the Appendix). Pollard’s result is based on methods introduced by Ossiander (1987). Ossiander’s result, in turn, in an extension of work by Dudley (1978). 2. As in Section 4, one establishes stochastic equicontinuity here via maximal inequalities. With the bracketing approach, however, one applies a chaining argument directly to the empirical measure rather than to a symmetrized version of it. The chaining argument relies on the Bernstein inequality for the tail probabilities of a sum of mean zero, independent rv’s. The upper bound in Bernstein’s inequality is small when the L2(P,) norms of the underlying rv’s are small, where P, denotes the distribution of the tth underlying rv. The bound ultimately is applied with the underlying rv’s given by the centered difference between an arbitrary function in _&’and one of the functions from a finite set of approximating functions, each evaluated at W,. In consequence, these functions need to be close in an L2(P,) sense for all t < T for the bound to be effective, where P, denotes the distribution of W,. This explains the appearance of the supremum L2(P,) norm as the measure of approximation error in Ossiander’s L2 entropy condition.
D. W.K. Andrew
2278
We now provide primitive conditions under which Ossiander’s entropy condition is satisfied. The method is analogous to that used for Pollard’s entropy condition. First, we show that several useful classes of functions satisfy the condition. Then, we show how functions from these classes can be mixed and matched to obtain more general classes that satisfy the condition. Dejinition A class 9 of real functions on w is called a type IV class under P with index p~[2, CO] if each function f in F satisfies f(.) = f(., r) for some Roy-, where F is some bounded subset of Euclidean space, and l/P
V r~r and V 6 > 0 in a neighborhood of 0, for some finite positive constants C and I,+,where { W,: t d T, T b l} has distribution determined by P.4 Condition (5.3) is an Lp continuity condition that weakens the Lipschitz condition (4.3) of type II classes (provided suptG r,r> l(EBp(W,))“p < 00). The Lp continuity condition allows for discontinuous functions such as sign and indicator functions. For example, for the LAD estimator of a nonlinear regression model one takes f( W,, z) = sgn (Y, - g(X,, z))a[g(X,, z)]/hj for different elements rj of r. Under appropriate conditions on (Y,, X,) and on the regression function g(., .), the resultant class of functions can be shown to be of type IV under P with index p. Example 3 (continued) In this test of variable relevance the following condition: sup EU: SUP IW,,~,) r> I ?,:~I?,-?/~ <s
example,
-
h(Z,,z)l’
J& is a type IV class with p = 2 under
d
Cd*
(5.4)
for all thy, for all 6 > 0, and for some finite positive constants C and $. Condition (5.4) is easy to verify if h(Z,,t) is differentiable in r. By a mean value expansion, (5.4) holds if supt, 1 E II II, supTGF a[h(z,, z)]/ik II2 < 00 and r is bounded. On the other hand, condition (5.4) can be verified even if h(Z,,z) is discontinuous in r. For example, suppose h(Z,, z) = l(h*(Z,, r) d 0) for some real differentiable function h*(Z,, z). In this case, it can be shown that condition (5.4) holds if supta 1 El U,)2+6 < CO for some 6 > 0, sup*> 1 SUP,,,~ (Ia[h*(Z,, z)yar Ii d C, < cc a.s. for some constant C,, and h*(Z,, t) has a (Lebesgue) density that is bounded above uniformly over ZEF. 41f need be, the bound in (5.3) can be replaced i. > 1 and Theorem 5 still goes through.
by CIlog61-”
for arbitrary
constants
CE(~, co) and
2219
Ch. 37: Empirical Process Methods in Econometrics
Example 4 (continued) Jl is a type IV class with p = 2 in this cross-sectional constancy example under the same conditions as in Example 3 with U, of Example 3 replaced by U,a[g(X,,s,,)]/a8, and with h(Z,,z) taken to be of the non-differentiable form 1(h*(Z,, t) d 0) discussed above. Note that the conditions placed on a type IV class of functions are weaker in several respects than those placed on the functions in Huber’s (1967, Lemma 3, p 227) stochastic equicontinuity result. (Huber’s conditions N-2, N-3(i), and N-3(ii) are not used here, nor is his independence assumption on { W,}.)Huber’s result has been used extensively in the literature on M-estimators. Next we consider an analogue of type III classes that allows for uniformly bounded functions that are smooth on an unbounded domain. (Recall that the functions of type III are smooth only on a bounded domain and equal a constant elsewhere.) The class considered here can be applied to the WLS/PLR Example 5 or the GMM/CMR Example 6. Define wU as in Section 4 and lel w = (wb, wb)‘, h = (hb, hb)‘, and W, = (W;,, Wb,)‘. Dejinition A class 9
of real functions
on w
is called
a type I/ class under P with index
001,if
PER
(i) each fin F depends on w only through a subvector w, of dimension k, d k, (ii) wb is such that w0 n {w,ER~=: I/w, I/ < r} is a connected compact set V r > 0, (iii) for some real number q > k,/2 and some finite constants C,, . . . , Clql, C,, each f EF satisfies the smoothness condition V w~-llr and w + hew, f
(w+ h)=
vro y’!B,(k,,
w,) + W,, w,),
R(h,>w,) G C, IIh, I?, and
IB,(h,, w,)l 6 C,
IIh, II” for v = 0,. . , Cd, (5.5)
where B,(h,, w,) is homogeneous of degree v in h, and (q, C,, . . . , C,) do not depend on f,w,or h, (iv) suPtg T,Ta 1 E I/ W,, Iii < co for some [ > pqkJ(2q - k,) under P. In condition (iv) above, the condition [ > co, which arises when p = co, is taken to hold if [ = 00. Condition (ii) above holds, for example, if “IIT,= Rka. As with type III classes, the expansion of f(w + h) in (5.5) is typically a Taylor expansion and B,(h,, w,) is usually the vth differential of f at w. In this case, the third condition of (5.5) holds if the partial derivatives of f of order <[q] are uniformly bounded. Sufficient conditions for condition (iii) above are: (a) for some real number q > k,/2, each fEF has partial derivatives of order [q] on YF that are bounded uniformly over W~YY and f EF, (b) the [q]th order partial derivatives off satisfy
D. W.K. Andrews
2280
a Lipschitz condition with exponent q - [q] and some Lipschitz constant C, that does not depend on f, and (c) Y+$ is a convex set. The envelope of a type V class 9 can be taken to be a constant function, since the functions in 9 are uniformly bounded over wcw and f EF:. Type V classes can be extended to allow wO to be such that _wbn{w,~RI’~: 11w, 11d r} is a finite union of connected sets V r > 0. In this case, (5.5) only needs to hold V w~-llr and w + hE-IY_ such that w, and h, are in the same connected set in “wb n {w,: IIw, II d r} for some r > 0. In applications, the functions in type V classes usually are the realizations of nonparametric function estimates. For example, nonparametric kernel density estimates for bounded and unbounded rv’s satisfy the uniform smoothness conditions of type V classes under suitable assumptions. In addition, kernel regression estimates for bounded and unbounded regressor variables satisfy the uniform smoothness conditions if they are trimmed to equal a constant outside a suitable bounded set and then smoothed (e.g. by convolution with another kernel). The bounded set in this case may depend on T. In some cases one may wish to consider nonparametric estimates that are trimmed (i.e. set equal to a constant outside some set), but not subsequently smoothed. Realizations of such estimates do not comprise a type V class because the trimming procedure creates a discontinuity. The following class of functions is designed for this scenario. It can be used with the WLS/PLR Example 5 and the GMMjCMR Example 6. The trimming sets are restricted to come from a countably infinite number of sets {wOj: j 3 l}. (This can be restrictive in practice.) Definition
A class 9
of real functions
on w
is called a type
VI class
under
P with index
PECK, 001,if (i) each f in F depends on w only through a subvector w, of w of dimension k, d k, (ii) for some real number q > k, 12, some sequence {wOj: j 2 1) of connected compact subsets of Rka that lie in wO, some sequence {Kj:j 3 l} of constants that satisfy supja 1llyjl < co, and some finite constants C,, . . , CLql,C,, each f~9- satisfies the smoothness condition: for some integer J, (a) f(w) = K, V WE%/ for which w,+!~~~ and (b) V w~YY and w + hEW for which w,E~~~ and w, + huEdyb,, f(w
+ h) = .rO ,I;B,(h.,
R(hm wJ d C,
wu) + R(h,, w,),
IIh, 114,and IMh,, w,)l d C, /Ih, II” for v = 0,.
where B,(h,, w.) is homogeneous do not depend on f, w, or h.
. . , [q],
(5.6)
of degree v in h, and (q, (Woj: j >, l}, C,, . . . , C,)
Ch. 37: Empirical
Process
Methods
in Econometrics
2281
supti . T1T> I 1 E (1IV,, lli < cc for some iy> pqk,/(2q - k,) under P, (iv) n(r) < K, exp(K,rr) for some 5 < 2[/p and some finite constants K 1, K,, where n(r) is the number of sets Waj in the sequence {Waj: j 3 l} that do not include
(iii)
(W&“K:
IIw, II G 4.
Conditions (i)-(iii) in the definition of a type VI class are quite similar to conditions used above to define type III and type V classes. The difference is that with a type VI class, the set on which the functions are smooth is not a single set, but may vary from one function to the next among a countably infinite number of sets. Condition (iv) restricts the number of ^whj sets that may be of a given radius or less. Sufficient conditions for condition (iv) are the following. Suppose Wuj 3 or allj sufficiently large, where II(.) is a nondecreasing real Cw,E”Wb: IIw,II d r?(j)) f function on the positive integers that diverges to infinity as j-+ a3. For example, {Waj: j 3 l} could contain spheres, ellipses, and/or rectangles whose “radii” are large for large j. If q(j)3 D*(log j)lir
(5.7)
for some positive finite constant D *, then condition (iv) holds. Thus, the “radii” of the sets {~~j: j > 1) are only required to increase logarithmically for condition (iv). This condition is not too restrictive, given that the number of trimming sets {Waj} is countable. More restrictive is the latter condition that the number of trimming sets {-Wbj} is countable. As with type III and type V classes, the envelope of a type VI class of functions can be taken to be a constant function. The trimmed kernel regression estimators discussed in Andrews (1994b) provide examples of nonparametric function estimates for which type VI classes are applicable. For suitable trimming sets {WGj: j 2 l} and suitable smoothness conditions on the true regression function, one can specify a type VI class that contains all of the realizations of such kernel estimators in a set whose probability -1. The following result establishes Ossiander’s Lp entropy condition for classes of type II-VI. Theorem 5 Let p~[2,00]. If Y is a class of functions of type II with supt< T T, I (EBp(Wt))“p < co, of type III, or of type IV, V, or VI under P with index i, then Ossiander’s Lp entropy condition (5.2) holds (with envelope F(.) given by supltF If(.)\). Comments (1) To obtain Assumption D for any of the classes of functions considered above, one only needs to consider p = 2 in Theorem 5. To obtain Assumption D for a
D. W.K. Andrew
2282
class of the form 3&F’, where 9 and 2 are classes of types II, III, IV, V or VI, however, one needs to apply Theorem 5 to 9 and 2 for values of p greater than 2, see Theorem 6 below. (2) Theorem 5 covers classes containing a finite number of functions, because such functions are of type IV under any distribution P and for any index PE[~, co]. In particular, this is true for classes containing a single function. This observation is useful when establishing Ossiander’s Lp entropy condition for classes of functions that can be obtained by mixing and matching functions from several classes, see below. We now show how one can “mix and match” functions of types II-VI. Let 9?,9*, Y?, 9 @ %*, etc., be as defined in Section 4. We say that a class of matrixvalued functions 3,9*, or H satisfies Ossiander’s Lp entropy condition or is of type II, III, IV, V or VI if it does so, or if it is, element by element for each of the IS or su elements of its functions. We adopt the convention that &/(A + p) = ~E(O, co] if A = co and vice versa.
Theorem 6 (a) If 3 and 3* satisfy Ossiander’s Lp entropy condition for some p~[2, co], with envelopes G and G*, respectively, then so do each of the following classes (with envelopes given in parentheses): 9 u 3* (G v G*), 9 0 9* (G + G*), 3’ v Y* (G v G*), 9 A 3* (G v G*), and IF?\(G). If in addition r = s and inf,,, inf,,,,,- A,,,(g(w)) = A., for some A.,,> 0, then 9-i also satisfies Ossiander’s Lp entropy condition (with envelope r/E,,). (b) The class 3% satisfies Ossiander’s Lp entropy condition with p equal to cr~[2, co] and envelope sGH, if(i) 3 and A? satisfy Ossiander’s Lp entropy condition with p equal to k(cc, co] and p equal to ,ULE(CL, co], respectively, (ii) +/(A + p) 3 CI, and (iii) the envelopes G and H of Y and YP satisfy sup,< T,Ta 1(EG”(W,))“’ < cc and suptG T,Ta ,(EH”(K))“”
< 00.
Example 6 (continued) Theorems 4-6 can be used to verify stochastic equicontinuity of vT(.) and total boundedness of F in the GMMjCMR example. With some abuse of notation, let d(w) and n(w) denote functions on -w^ whose values depend on w only through the k,-vector x and equal A(x) and Q(x) respectively. Similarly, let $(w, 0,) denote the function on -w^ that depends on w only through z and equals ll/(z,e,). The following conditions are sufficient.
(i) {(Z,,X,):t> l} is an m-dependent
sequence
of rv’s.
(ii) ;;y E II$(Z,, &J II6 < ~0. (iii) $ = {r: r = A’R-’ for some AE~ and a~&‘}, where $3 and s4 are type V or type VI classes of functions on FY c Rk with index p = 6 whose functions
Ch. 37: Empirical Process
Methods
depend on w only through for some E > 0.
2283
in Econometrics
the k,-vector x, and .d c
R: inf &,(fl(w))> we*
E (5.8)
Note that condition (iii) of (5.8) includes a moment condition on X,:supta 1 E I/X, lir< co for some i > 6qk,/(2q - k,). Sufficiency of conditions (i))(iii) for stochastic equicontinuity and total boundedness is established as follows. By Theorem 5, {$(., (!I,)}, LS and d satisfy Ossiander’s Lp entropy condition with p = 6 and with envelopes I$(.,tI,)l, C, and C,, respectively, for some finite constants C,, C,. By the 9-l result of Theorem 6, so does J4-’ with some constant envelope C, < co. By the 32 result of Theorem 6 applied with c1= 3 and 1, = p = 6, SS&-’ satisfies Ossiander’s Lp entropy condition with p = 3 and some constant envelope C, < co. By this result, condition (ii), and the 9%’ result of Theorem 6 applied with c1= 2, 2 = 3, p = 6, 9 = g&-r, and Y? = ($(.,e,)}, JY satisfies Ossiander’s Lp entropy condition with p = 2 and envelope C, I$(., Q,)l for some constant C, < co. Theorem 4 now yields stochastic equicontinuity, since condition (ii) is sufficient for Assumption B. Condition (iii) above covers the case where the domain of the nonparametric functions is unbounded and the nonparametric estimators A and fi are not trimmed to equal zero outside a single fixed bounded set, as is required when the symmetrization results of Section 4 are applied. As discussed above, nonparametric kernel regression estimators that are trimmed and smoothed or trimmed on variable sets provide examples where condition (iii) holds under suitable assumptions for realizations of the estimators that lie in a set whose probability + 1. For example, Andrews (1994b) provides uniform consistency on expanding sets and LQconsistency results for such estimators, as are required to establish that P(Z*EY) -+ 1 and z^3 z0 (the first and second parts of (3.36)) when stochastic equicontinuity is established using conditions (i)-(iii) above.
6.
Conclusion
This paper illustrates how empirical process methods can be utilized to find the asymptotic distributions of econometric estimators and test statistics. The concepts of empirical processes, weak convergence, and stochastic equicontinuity are introduced. Primitive sufficient conditions for the key stochastic equicontinuity property are outlined. Applications of empirical process methods in the econometrics literature are reviewed briefly. More detailed discussion is given for three classes of applications: M-estimators based on non-differentiable criterion functions; tests of hypotheses for which a nuisance parameter is present only under the alternative hypothesis; and semiparametric estimators that utilize preliminary nonparametric estimators.
D. W.K. Andrew
2284
Appendix
Proof of Theorem 1 Write vT(.) as the sum of m empirical processes {vrj(.): T 3 l} forj = 1,. . , m, where vTj(.) is based on the independent summands {m(W,, .): t = j + sm, s = 1,2,. .}. By standard inequalities is suffices to prove the stochastic equicontinuity of {vTj(.): T3 l} for each j. The latter can be proved using Pollard’s (1990) proof of stochastic equicontinuity for his functional CLT (Theorem 10.7). We take his functions &(w, t) to be of the form m( IV,, r)/fl We alter his pseudometric from lim,, m [ (l/N)xyE 11 m( W,, zl) m(W,, t2) 11 2]“2 to that given in (3.1). Pollard’s proof of stochastic equicontinuity relies on conditions (i) and (iii)-(v) of his Theorem 10.7. Condition (ii) of Theorem 10.7 is used only for obtaining convergence of the finite dimensional distributions, which we do not need, and for ensuring that his pseudometric is well-defined. Our pseudometric does not rely on this condition. Inspection of Pollard’s proof shows that any pseudometric can be used for his stochastic equicontinuity result (although not for his total boundedness result) provided his condition (v) holds. Thus, it suffices to verify his conditions (i) and (iii)-(v). Condition “manageable.” satisfy
(i) requires that the functions {m(W,, t)/fi: t d T, T > l} are This holds under Assumption A because Pollard’s packing numbers
sup
D(s
Ia0 F”.(w)I,a0 Pn,) d sup N,(.s/2, Q, A).
Conditions (iii) and (iv) are implied by Assumption matically given our choice of pseudometric.
B. Condition
(A.1) (v) holds autoQ.E.D.
Proof of Theorem 2 Type I classes of form (a) satisfy Pollard’s entropy condition by Lemmas II.28 and 11.36(ii) of Pollard (1984, pp 30 and 34). Type I classes of form (b) satisfy Pollard’s entropy condition because (i) they are contained in VC hull classes by the proof of Proposition 4.4 of Dudley (1987) and the fact that {f: f (w)= w'
T) -
f (.,
T~))~)"~ d
min(QB2)‘j2 j< n
IIT - Tj IId
&(QF’)l”,
64.2)
2285
Ch. 37: Empirical Process Methods in Econometrics
N2(s(QF2)“2,
Q, 9) is d the number of cubes above. By choice of the envelope F(.) = 1 v su~/,~ If(.)1 v B(.), 4QF’)“’ A (QB 21lj2 3 E, so the number of cubes is < CE-~ for some C > 0 and all QE_%?.Thus, Pollard’s entropy condition holds with envelope F( .). For classes of type III, Pollard’s entropy condition holds because sup N2(~(QF2)“2,
Q, 9) < C exp(e-k0’4)
v EE(0, l]
(A.3)
Qd
for some C < cc by Kolmogorov and Tihomirov (1961, Theorem Since g > k,/2 by assumption, Pollard’s entropy condition holds. Proof of Theorem
XIII,
p 308). Q.E.D.
3
For Yu Y*, we have N26,
Q, 3 u g*) d
Q, 3) +
N26,
N26,
Q, 3*X
N(&(Q(G v G*)2)1’2, Q, %JuV*) < N2(s(QG2)“2,
and so, Q, 9) + N2(~(QG*2)“2,
Q,?J*), (A.4)
where the second inequality uses the facts that N2(s, Q,F) is nonincreasing in E, Q(G v G*)2 2 QG2, and Q(G v G*)2 3 QG *2 . Pollard’s entropy condition follows from the second inequality of (A.4). For ?J 0 Y*, it suffices to suppose that I = s = 1. As above, Pollard’s entropy condition follows from the inequalities N26,
Q, 9 0 9*) d N2(@, Q, Y)N,(s/2, Q, %*),
Q(G + G*)2 b QG2
Q(G + G*)2 2 QG*2,
and
(A.5)
where the first inequality holds because minjsn k
min j$n,k
(gh - gjh,)’
dQ
l/2
>
~~::(QH2S(g-gj)2d[Q~])1’2+
D. W.K. Andrews
2286
(‘4.6) Thus, we get d N2($s(QHG2)“2, QH, C!3N2($~(Q,$Z2)1’2, Qc, YE’) and
N~(J~QG~H~)“~, Q, $2) sup N2(s(QG”H2)“‘,
Q, 3%‘)
QEY <
=
sup N&(Q&2)“2, QHE??
sup N2(+~(QG2)“2,
QH, 9) sup N2(34QcP2)1’2, Qc, 2) QCE9
Q, B) sup N2(+~(QH2)“2,
Q, 2).
Pollard’s entropy condition follows from the latter inequality. For 3 v B*, it suffices to suppose r = s = 1. Pollard’s entropy from the inequalities
N2b
Q, 3 v 9*) d
Q(G v G*)2 3 QG2
64.7)
QE22
QEZ?
N2W,
and
Q, 94N,W,
condition
follows
Q, %*I,
Q(G v G*)* > QG*2,
(‘4.8)
where the first inequality uses 1g v g* - gj v g: I< 1g - gjl + (g* - g: I. The proof for 3 A CC?*is analogous (with the envelope still given by G v G* rather than G A G*). The result for 131 follows because 1191- lajl I < lg - ajl. Lastly, consider 3-r. For gEY, let g1 denote the Ith element of g, where I = 1,. ..,L and L = r*. Let Y[ = {gl: gE3} and n, = N,(E/~, Q, 3J for some QE_!~?.We claim that given any E > 0 and QE~?, there exist functions gr,. . . ,g,, in 9 with n < nF= rn, such that for all g& min max (Q(g’ - gi)2)112 < E. j
64.9)
I
To see this, note that by the assumption that 3 satisfies Pollard’s entropy condition, for each 1 there exist real functions grr, . . . , glnr in %I such that for all ge?J minjc,,(Q(g’ - glj)2)112 < 42. Form the set Y+ of all RL-valued functions whose Ith element is gu for some j = 1,. . . , n, for 1= 1,. ..,L.The number of such functions is n+ = nL=1 ,n,. The functions in 3’ are not necessarily in 9. For each function g+ in g+ consider the L'(Q) a/2-ball in 3 centered at g+. Take one function from each non-empty ball and let gr, . . . , gn denote the chosen functions. These functions satisfy the claim above.
Ch. 37: Empirical
Process
Methods
2287
in Econometrics
If 9 satisfies Pollard’s entropy condition with envelope G, it also does so with envelope G v 1. For notational simplicity, suppose G = G v 1. Given QeS, let Q(.) = Q(.e4)/Qc4 (ES), where 6 is the envelope of 3-r. Take E and Q in the claim above to equal E(oG4)“2/r4 and Q respectively. Then, there exist functions gl,. . . , gn in 9 such that
min max (Q(g’-g$2)1’2 j
<.s(QG4)112/r4
and
nb fi N2($s(Q”G4)1/2/r4, Q, G,). I=1
l
Let l,=(l,..., 1)’ (ER’) and let 1.1 denote the matrix matrix *. For arbitrary unit vectors b,cER’, we have min Q(b’g-‘c j
of absolute
values
of the
- b’gJ:‘c)’
= min Q(b’g- ‘(gj - g)gJ: ‘c)~ j
5
5
Qlg’-g:lIgm-gy/
1=1 m=1
< r8Qc”” min max Q”(g’- gf)” G r8Qt?4&20G4/r8 = ,s2QG4G4. j
SU~N,(E(QG~Z;~)“~,
(A. 10)
Q, %- ‘) d n d nf= 1N2(i~(QG4)112/r4, Q”,G,) and
Q, C!- ‘) d sup I”r Nz($@G4)“2/r4,
0, %J
i&i? I=1
Qd
=
sup
fi
N&E(QG~)“~/~~,
QE~ I=1
Q, 9,) d sup I”r N&E(QG’)~‘~/~~,
Q, ?Ir).
Q&? I=1
(A.1 1) The integral over FE[O, l] of the square root of the logarithm of the right-hand side (rhs) of (A.ll) is finite since 9 satisfies Pollard’s entropy condition with envelope G = G v 1. Thus, 9-l satisfies Pollard’s entropy condition with envelope (G v 1)‘c2. Q.E.D. Proof
ofTheorem 4
Total boundedness of Y under p follows straightforwardly from N:(E, P, ~64)-=c cc VIE> 0. For stochastic equicontinuity of {v~(.): T 2 l}, by the same argument as in the proof of Theorem 1, it suffices to prove the result when {W,: t d T} are independent rv’s. By Markov’s inequality and Theorem 2 of Pollard (1989), we
D. W.K.
Andrew
have lim P* T+CC -
(
sup
IV&i)
- v&J
’ r
p(rr,r2)
suP
ivTtzl)
-
vT(zZ)i/?
P(rIxT2F6
for some constant C < co, where & > 0 is a constant that does not depend on T. The second term on the right-hand side of (A.12) can be made arbitrarily small by choice of 6 using Assumption D. The first term is less than or equal to
(A.13)
using Assumption Proof of Theorem
B. Stochastic
equicontinuity
follows.
Q.E.D.
5
It suffices to prove the result for classes of types III-VI, because a type II class < co is a type IV class under P with index p. with suplG T,T> i(EWW,))“’ First, we consider classes of type III. For given E > 0, define the functions aj, bj, j = 1,. , n, of the definition of Lp bracketing cover numbers as follows: (a)V WGW such that w,EW~ - W,*, let aj(w) = K and b,(w) = OVj and (b)V weW such that w,~“Ilrz, let {uj(w): j = l,.. ., n,} be the functions constructed by Kolmogorov and Tihomirov (1961, ~~312-314) in their proof of Theorem XIV and let b,(w) = E Vj. These functions satisfy the conditions for Lp bracketing cover numbers for all p~[2, co]. Hence, N~(E, P, 9) < n, V ~(0, 11, V p~[2, co]. The number n, of such functions is < C exp E- ka’q V ~(0, l] for some C < co by Kolmogorov and Tihomirov (1961, Theorem XIV). Since q > k,/2 by assumption, Ossiander’s entropy condition holds for all pe[O, a]. For a type IV class with index p, consider disjoint cubes in W of diameter 6 = (E/C)“~. The number N(E) of such cubes satisfies N(E) d C*E-~‘$ for some C* < co, where d is the dimension of Y. Let rj be some element of the jth cube in F. Let ~j(~)=f(~~‘j)andbj=~uP~~r-,,~~<~lf(’~~)-~j(’)I~B~(4.3),~uP~~T,~>~[Eb~(~~)I”P,< Cd” = E. Thus, N~(E, P, 9) < N(E). Since jA(log N(a))“2 de < c&: Ossiander’s L!’ entropy condition holds. For a type V class with index p, let W, = W n {WE Rk: (Iw, 11d r}, let Fr denote the class of functions 9 restricted to -W;, and let N,(E, Wr, Fr) be the minimal number
Ch. 37:
Empiricul
Procrss
Methods
2289
in Econometrics
n of real functions fi,. . . , fn on vr each fog,.. We claim that
such that mini<,, sup,,,lf(w)
- Jj(w)l < E for
(A. 14)
N;(E,P, 9”) d N,W, %(c,,Kc,,),
where r(c) = C&-p/r for some constant C < co when p < cc and r(c) = sup { (1w, I(:WEW} (
,<
Dr(@aE
- k/q < D*E
-kd(!‘lil+
(l/q)1
(A.15)
for some constants D, D* < co, where the second inequality holds only when p < co. When p < cc, (A.14) and (A.15) combine to yield Ossiander’s Lp entropy condition for 9 if k,(p/[ + l/q)/2 < 1, or equivalently, if [ > pqkJ(2q - k,) and q > kJ2, as is assumed. When p = co, (A.14) and the first inequality of (A.15) combine to yield Ossiander’s Lp entropy condition for fl provided q > k,/2, as is assumed. It remains to show (A.14). For p = co, (A.14) follows immediately from the definition of Nt(.) and N,(.), since “z%“&, = w and F,(e) = 9 when p = co. Next, suppose p < co. For n = N,(E/~, w,, P,),), define real functions aj, bj, j = 1,. . . , n on by %‘” as follows: On YJ$ take {aj(.): j= l,..., n} to be the functions constructed Kolmogorov and Tihomirov (1961, pp 312-314) in their proof of Theorem XIV and let b,(.)=c/2 forj=l,...,n. On w--W;, take aj(.)=O and takes bj(.)=F for j= l,..., n, where F is a constant for which supwcw If(w)1 < F Vf EY. Then, for each REP, minj,,lf - ajl < bj and
< (E/Z)’
+
FPr _ c
E I( W,, 11’=
Sup
td
T,T$
(E/Z)’
+
C*rei,
(A. 16)
1
where C* is defined implicitly. If we let r = T(E) = (2pC*/(2p - l))lii&-p’r, then sup,< T,Ta 1 Eb;( W,) < &pand (A. 14) holds. Last, we consider type VI classes of functions. First, suppose p < co. We derive an upper bound on Nf(s, P, F) for arbitrary E > 0. Let rc = C&-p’ii for some C < co and let F be a constant for which supwSly If(w)1 < Ftlf~F. Let J be the index of a set vO, that does not include {w,E%~~: 11 w, 1)d r,}. For functions REP whose corresponding integer of part (ii) (of the definition of type VI classes) is J, take the centering and E-bracketing functions ((a,, b,): 1= 1,. . . , n,,} (of the definition of Lp bracketing cover numbers) as follows: (a) V WE?Y such that I(w, )( > rE, let al(w) = 0 and b,(w) = F, (b) V WEYY such that 1)w, I/ 6 r, and w,$wG,, let al(w) = K, and
D. W.K. Andrew
2290
b,(w) = 0, and (c) V wow such that 11 w, 11d rE and w,E~~~, let {al(w): 1= 1,. . . , neJ) be the functions constructed by Kolmogorov and Tihomirov (1961) in the proof of their Theorem XIV and let b,(w) = 42 V 1. The number neJ of such functions is < D, exp [D,r,kaE-kaig] by Theorem XIV of Kolmogorov and Tihomirov (1961), since {w: I/w, 11< r,, w,EWoJ} c {w: /Iw, iI’< r,}. Next, for all functions ~IzP- whose corresponding integer J of part (ii) is such that waJ contains {w,~^jYb: 11 w, )I < r}, take the centering and s-bracketing functions such that II w,II > r, let al(w) = 0 {(a,,b,):l= l,..., n,} as follows. (a) Vw~w and b,(w) = FV 1 and (b) V we”Y such that IIw, II d r, let {al(w): I = 1,. . . , n,} be the functions constructed by Kolmogorov and Tihomirov (1961) in the proof of their Theorem XIV and let b,(w) = s/2V 1. The number of such functions also is
d (n(r,) +
l)D, exp[D,r2s-kQ’q]
d (K, exp[K,Cr~-P5’1 With this bound, k,(p/i
Ossiander’s
] + l)D, exp[D,Ck~s-k~(pii+
1’q)].
(A.17)
Lp entropy
+ l/q)/2 < 1, or equivalently,
condition holds provided p5/(2[) < 1 and q > k,/2 and i > pqk,/(2q - k,), as 4 < 2[lp,
is assumed. For the case where p= CO, take r(8) = sup{)1 w,II: WE%‘“} < cc VE >O in the argument above. Then, Ossiander’s L” entropy condition holds provided q > k,/2, as is assumed. Q.E.D. Proof of Theorem
6
For YuU*, the result is obvious. For 3@ g*, it suffices to suppose that r = s = 1. Let (g, Uj, bj) and (g*, a:, b:) for ge3 and g*E??* be defined analogously to (f, uj, bj) given in the definition of the Lp bracketing cover numbers. We have (E(bj N,B(k
+
bF)P)l’P I’,
3
0
< 3*)
(EbS)“P + (Eb:P)l’P d
<
2E,
SO,
N;(E, P, 4e)N;(c, P, 9?*).
The result follows. For 3 v Y*, it also suffices to suppose
(A.18)
that r = s = 1. We have
Igvg*-~jv~jLI~(g-U~~+~g*-~~~~bj+b~, N32~
and
P, 3 v Y*) ,< N;(E, P, %)N;(E, P, c-c?*).
andso,
(A.19)
Ch. 37: Empirical
Process Methods
2291
in Econometrics
The result for 99 A 97* is analogous. For 191, the result follows from the inequality 119I- (Uj( 1d 19- Uj(. Next consider %‘- ‘. For gE9, let g’ denote the Ith element of g for 1= 1,. . . , L, where L= r2. By the same argument as used to prove the claim in the proof of the Y- ’ result of Theorem 3, there exist r x r matrix functions a,, . . . , a, and b ,,...,b,suchthat(i)aj~Yforallj,
dIb)‘lg-‘)luj-gg)luj’llcl
(E[(r4/il:)l~bj1,]P}1iP
<(r4/L:)1;bjl,
and
< (r6/A:)&.
(A.20)
Thus, N:(r6&/;1:, P, 9-l) d n < nf= ,N:(.5/2, P, YJ and the result follows. To prove part (b) of Theorem 6 concerning 92, note that each element of gh (for geB and hi&‘) is a finite union of products of scalar functions, and so, using the result for 9 @9* it suffices to suppose that r = s = u = 1. Let (g, uj, bj) and (h,uJ”, b,*) be defined analogously to (f,uj, bj) given in the definition of the Lp bracketing cover numbers, with p = i and p = p respectively. We have
d Gb; + I(u: - k) + kl bj < Gbl” + Hb, + b,b:
(A.21)
and (E(Gb: + Hbj + bjb:,,)“’
d (EG”b~“)“” + (EH”bS)“” + (Ebj”bl*a)l’a
< (EGaPi(r-a))(a-a)/ap(Ebl*P)lir+ (EHani(~-a))(~-a)inn(Eb))l/~ + (EbOf”i(‘-a))(“-a)in’(Eb:‘)‘/” J
,<
sup ((EC’)“” + (EH’)“‘)& + 2 t< T,T> 1
,< c*&
(A.22)
for ~(0, 11, where C* is defined implicitly and the dependence of each of the functions G, b:, etc. on W, is suppressed for notational simplicity. The second and third inequalities hold by Holder’s inequality and the fact that &/(,I + p) > c(implies that a~/@ - LY)d 2 and an/(2 - IX)d p. Equations (A.21) and (A.22) imply that Nf(C*tz, P, 9%) d N;@,P, ~)N;(E, P, 2)
(A.23)
2292
D. W.K. Andrews
and the desired result follows. Note that using the notational conventions in the text, (A.21)-(A.23) hold whether or not c1= CO, 1, = CO or p = co.
stated Q.E.D.
References Ait-Sahalia, Y. (1992a) “Nonparametric Pricing of Interest Rate Derivative Securities”, Department of Economics, MIT, unpublished manuscript. Ait-Sahalia, Y. (1992b) “The Delta and Bootstrap Methods for Nonparametric Kernel Functionals”, Department of Economics, MIT, unpublished manuscript. Andrews, D.W.K. (1988a) “Asymptotics for Semiparametric Econometric Models: I. Estimation and Testing”, Cowles Foundation Discussion Paper No. 908R, Yale University. Andrews. D.W.K. (1988b) “Chi-square Diagnostic Tests for Econometric Models: Introduction and Applications”, Journal of Econometrics, 31, 135-156. Andrews. D.W.K. (19886) “Chi-sauare Diagnostic Tests for Econometric Models: Theory”, Econometrica, 56, 1419-1453. ’ ’ L Andrews, D.W.K. (1989) “Asymptotics for Semiparametric Econometric Models: II. Stochastic EquiDiscussion Paper No. 909R, continuity and Nonparametric Kernel Estimation”, Cowles Foundation Yale University. Econometric Theory, 8, 241-257. Andrews, D.W.K. (1992) “Generic Uniform Convergence”, Andrews. D.W.K. (1993) “An Introduction to Econometric Applications of Empirical Process Theory for Dependent Random Variables”, Econometric Reviews, ii, 183-216. Andrews, D.W.K. (1994a) “Asymptotics for Semiparametric Econometric Models Via Stochastic Equicontinuity”, Econometrica, 62, forthcoming. Andrews, D.W.K. (1994b) “Nonparametric Kernel Estimation for Semiparametric Models”, Econometric Theory, 10, forthcoming. Andrews, D.W.K. and W. Ploberger (1994) “Optimal Tests When a Nuisance Parameter Is Present Only under the Alternative”, Econometrica, 62, forthcoming. Arcones, M. and E. Gine (1992) “On the Bootstrap of M-estimators and Other Statistical Functional?, in: R. LePage and L. Billard, eds., Exploring the Limits of the Bootstrap, New York: Wiley. Bera, A. K. and M. L. Higgins (1992) “A Test for Conditional Heteroskedasticity in Time Series Models”, Journal of Time Series Analysis, 13, 501-519. Bierens, H. (1990) “A Consistent Conditional Moment Test of Functional Form”, Econometrica, 58, 1443-1458. Billingsley, P. (1968) Convergence of Probability Measures. New York: Wiley. Cavanagh, C. and R.P. Sherman (1992) “Rank Estimators for Monotone Index Models”, Bellcore Economics Discussion Paper No. 84, Bellcore, Morristown, NJ. Chamberlain, G. (1987) “Asymptotic Efficiency in Estimation with Conditional Moment Restrictions”, Journal of Econometrics, 34, 305-324. Davies, R.B. (1977) “Hypothesis Testing When a Nuisance Parameter Is Present Only under the Alternative”, Biometrika, 64, 247-254. Davies, R.B. (1987) “Hypothesis Testing When a Nuisance Parameter Is Present Only under the Alternative”, Biometrika, 74, 33-43. De Jong, R.M. (1992) “The Bierens Test under Data Dependence”, Department of Econometrics, Free University, Amsterdam, unpublished manuscript. Dudley, R.M. (1978) “Central Limit Theorems for Empirical Measures”, Annals of Probability, 6, 899-929. Dudley, R.M. (1987) “Universal Donsker Classes and Metric Entropy”, Annals of Probability, 15, 1306G1326. Gallant, A.R. (1989) “On Asymptotic Normality When the Number of Regressors Increases and the Minimum Eigenvalues of X,X/n Decreases”, Institute of Statistics Mimeograph Series No. 1955, North Carolina State University, Raleigh, NC. Gallant, A.R. and G. Souza (1991) “On the Asymptotic Normality of Fourier Flexible Form Estimates”, Journal of Econometrics, 50, 329-353.
Ch. 37: Empirical
Process Methods
in Econometrics
2293
Gine, E. and J. Zinn (1990) “Bootstrapping General Empirical Measures”, Annals of Prabability, 18, 851-869. Hahn, J. (1995) “Bootstrapping Quantile Regression Estimators”, Econometric Theory, 11, forthcoming. Hansen, B.E. (1991) “Inference When a Nuisance Parameter Is Not Identified under the Null Hypothesis”, Working Paper No. 296, Rochester Center for Economic Research, University of Rochester. Hansen, B.E. (1992a) “Testing the Conditional Mean Specification in Parametric Regression Using the University of Rochester, unpublished Empirical Score Process”, Department of Economics, manuscript. Hansen, B.E. (1992b) “The Likelihood Test under Non-standard Conditions: Testing the Markov Trend Model of GNP”, Journal of Applied Econometrics, 7, s61-~82. Hlrdle, W. and 0. Linton (1994) “Applied Nonparametric Methods”, in: Handbook of Econometrics, Volume 4. Amsterdam: North-Holland. Honore, B. (1992) “Trimmed LAD and Least Squares Estimation of Truncated and Censored Regression Models with Fixed Effects”, Econometrica, 60, 533-565. Horowitz, J.L. (1988) “Semiparametric M-estimation of Censored Linear Regression Models”, Adoances in Econometrics, 7, 45-83. Horowitz, J. L. (1992) “A Smoothed Maximum Score Estimator for the Binary Response Model”, Econometrica, 60, 505-531. Horowitz, J.L. and G.R. Neumann (1992) “A Generalized Moments Specification Test of the Proportional Hazards Model”, Journal of the American Statistical Association, 87, 234-240. Huber, P.J. (1967) “The Behaviour of Maximum Likelihood Estimates under Nonstandard Conditions”, in Proceedings ofthe Fifth Berkeley Symposium in Mathematical Statistics and Probability, 1,221-233. Berkeley: University of California. Huber, P.J. (1973) “Robust Regression: Asymptotics, Conjectures and Monte Carlo”, Annals ofstatistics, 1, 799-821. Kim, J. and D. Pollard (1990) “Cube Root Asymptotics”, Annals of Statistics, 18, 191-219. Klecan, L., R. McFadden, and D. McFadden (1990) “A Robust Test for Stochastic Dominance”, Department of Economics, MIT, unpublished manuscript. Koenker, R. and G. Bassett (1978) “Regression Quantiles”, Econometrica, 46, 33-50. Kolmogorov, A.N. and V.M. Tihomirov (1961) “s-entropy and e-capacity of Sets in Functional Spaces”, American Mathematical Society Translations, Ser. 2, 17, 277-364. Manski, C.F. (1975) “Maximum Score Estimation of the Stochastic Utility Model of Choice”, Journal of Econometrics, 3, 205-228. McFadden, D. (1989) “A Method of Simulated Moments for Estimation of Discrete Response Models without Numerical Integration”, Econometrica, 57, 995-1026. Newey, W.K. (1989) “The Asymptotic Variance of Semiparametric Estimators”, Department of Economics, Princeton University, unpublished manuscript. Newey, W. K. (1990) “Efficient Instrumental Variables Estimation of Nonlinear Models”, Econometrica, 58, 809-837. Newey, W.K. (1991) “Uniform Convergence in Probability and Stochastic Equicontinuity”, Econometrica, 59, 1161-l 167. Newey, W.K. and D. McFadden (1994) “Estimation in Large Samples”, in: Handbook of Econometrics, Volume 4. Amsterdam: North-Holland. Newey, W.K. and J.L. Powell (1987) “Asymmetric Least Squares Estimation and Testing”, Econometrica, 55, 819-847. Olley, S. and A. Pakes (1991) “The Dynamics of Productivity in the Telecommunications Equipment Industry”, Department of Economics, Yale University, unpublished manuscript. Ossiander, M. (1987) “A Central Limit Theorem under Metric Entropy with Bracketing”, Annals of Probability, 15, 897-919. Pakes, A. and S. Olley (1991) “A Limit Theorem for a Smooth Class of Semiparamettic Estimators”, Department of Economics, Yale University, unpublished manuscript. Pakes, A. and D. Pollard (1989) “Simulation and the Asymptotics of Optimization Estimators”, Econometrica, 57, 1027-1057. Pollard, D. (1982) “A Central Limit Theorem for Empirical Processes”, Journal of the Australian Mathematical Society (Series A), 33, 235-248. Pollard, D. (1984) Convergence of Stochastic Processes. New York: Springer-Verlag.
2294
D. W.K. Andrews
Pollard, D. (1985) “New Ways to Prove Central Limit Theorems”, Econometric Theory, 1, 2955314. Pollard, D. (1989) “A Maximal Inequality for Sums of Independent Processes under a Bracketing Condition”, Department of Statistics, Yale University, unpublished manuscript. Pollard, D. (1990) Empirical Processes: Theory and Applications. CBMS Conference Series in Probability and Statistics, Vol. 2. Hayward, CA: Institute of Mathematical Statistics. Powell, J.L. (1984) “Least Absolute Deviations Estimation for the Censored Regression Model”, Journal qf Econometrics, 25, 303-325. Powell, J.L. (1986a) “Censored Regression Quantiles”, Journal of Econometrics, 32, 143- 155. Powell, J.L. (1986b) “Symmetrically Trimmed Least Squares Estimators for Tobit Models”, Econometrica, 54, 1435-1460. Prohorov, Yu.V. (1956) “Convergence of Random Processes and Limit Theorems in Probability Theory”, Theory of Probability and Its Applications, 1, 157-214. Robinson, P.M. (1988) “Root-N-Consistent Semiparametric Regression”, Econometrica, 56, 931-954. Sherman, R.P. (1992) “Maximal Inequalities for Degenerate U-processes with Applications to Optimization Estimators”, unpublished manuscript, Bell Communications Research, Morristown, NJ. Sherman, R.P. (1993) “The Limiting Distribution of the Maximum Rank Correlation Estimator”, Econometrica, 61, 123-137. Sherman, R.P. (1994) “U-processes in the Analysis of a Generalized Semiparametric Regression Estimator”, Econometric Theory, 10, forthcoming. Shorack, G.R. and J.A. Wellner (1986) Empirical Processes with Applications to Statistics. New York: Wiley. Stinchcombe, M.B. and H. White (1993) “Consistent Specification Testing with Unidentified Nuisance Parameters Using Duality and Banach Space Limit Theory”, Department of Economics, University of California, San Diego, unpublished manuscript. Wald, A. (1943) “Tests of Statistical Hypotheses Concerning Several Parameters When the Number of Observations Is Large”, Transactions of the American Mathematical Society, 54, 426-482. Wellner, J.A. (1992) “Empirical Processes in Action: A Review”, International Statistical Review, 60, 247-269. Whang, Y.-J. and D.W.K. Andrews (1993) “Tests of Model Specification for Parametric and Semiparametric Models”, Journal of Econometrics, 57, 277-3 18. White, H. and Y. Hong(1992) “M-testing Using Finite and Infinite Dimensional Parameter Estimators”, Department of Economics, University of California, San Diego, unpublished manuscript. White, H. and M. Stinchcombe (1991) “Adaptive Efficient Weighted Least Squares with Dependent Observations”, in Directions in Robust Statistics and Diagnostics, Part II, ed. by W. Stahel and S. Weisberg. Berlin: Springer. Yatchew, A. (1992) “Nonparametric Regression Tests Based on Least Squares”, Econometric Theory, 8,435-451.
Chapter 38
APPLIED WOLFGANG
NONPARAMETRIC H;iRDLE*
Humboldt-Universitiit OLIVER
METHODS
Berlin
LINTON’
Oxford University
Contents
Abstract 1. Nonparametric estimation 2. Density estimation
3.
2300
2.1.
Kernels as windows
2.2.
Kernels and ill-posed
2.3.
Properties
of kernels
2.4,
Properties
of the kernel density estimator
2.5.
Estimation
2.6.
Fast implementation
Regression
2297 2297 2300
in econometrics
2301
problems
2302
of multivariate
densities,
their derivatives
2303 and bias reduction
of density estimation
Kernel estimators
3.2.
k-Nearest
2306
2308
estimation
3.1.
2304
2308
neighbor
estimators
k-NN estimators
3.2.1.
Ordinary
3.2.2.
Symmetrized
k-NN estimators estimators
2310 2310 2311 2311
3.3.
Local polynomial
3.4.
Spline estimators
2312
3.5.
Series estimators
2313
3.6.
Kernels, k-NN, splines, and series
2314
*This work was prepared while the first author was visiting CentER, KUB Tilburg, The Netherlands. It was financed, in part, by contract No 26 of the programme “P81e d’attraction interuniversitaire” of the Belgian government. +We would like to thank Don Andrews, Roger Koenker, Jens Perch Nielsen, Tom Rothenberg and Richard Spady for helpful comments. Without the careful typewriting of Mariette Huysentruit and the skillful programming of Marlene Miiller this work would not have been possible. Handbook ofEconometrics, Volume IV, Edited by R.F. Engle and D.L. McFadden (c 1994 Elsevier Science B. V. All rights reserved
W. Hiirdle and 0. Linton
2296
4.
5.
6.
3.7.
Confidence
intervals
3.8.
Regression
derivatives
Optimality
2315 and quantiles
and bandwidth
4.1.
Optimality
4.2.
Choice of smoothing
choice
2318
2319 2319
parameter
2321
4.2.1.
Plug-in
2322
4.2.2.
Crossvalidation
2322
4.2.3.
Other data driven selectors
Application 5. I
Autoregression
5.2.
Correlated
Applications
2325 2326
errors
2321
to semiparametric
6. I
The partially
6.2.
Heteroskedastic
6.3.
Single index models
7. Conclusions References
2323
to time series
estimation
linear model nonlinear
2328 2329
regression
2330 233 I
2334 2334
Ch. 38: Applied Nonpuramrtric
Methods
2297
Abstract We review different approaches to nonparametric density and regression estimation. Kernel estimators are motivated from local averaging and solving ill-posed problems. Kernel estimators are compared to k-NN estimators, orthogonal series and splines. Pointwise and uniform confidence bands are described, and the choice of smoothing parameter is discussed. Finally, the method is applied to nonparametric prediction of time series and to semiparametric estimation.
1.
Nonparametric
estimation in econometrics
Although economic theory generally provides only loose restrictions on the distribution of observable quantities, much econometric work is based on tightly specified parametric models and likelihood based methods of inference. Under regularity conditions, maximum likelihood estimators consistently estimate the unknown parameters of the likelihood function. Furthermore, they are asymptotically normal (at convergence rate the square root of the sample size) with a limiting variance matrix that is minimal according to the Cramer-Rao theory. Hypothesis tests constructed from the likelihood ratio, Wald or Lagrange multiplier principle have therefore maximum local asymptotic power. However, when the parametric model is not true, these estimators may not be fully efficient, and in many cases - for example in regression when the functional form is misspecified - may not even be consistent. The costs of imposing the strong restrictions required for parametric estimation and testing can be considerable. Furthermore, as McFadden says in his 1985 presidential address to the Econometric Society, the parametric approach “interposes an untidy veil between econometric analysis and the propositions of economic theory, which are mostly abstract without specific dimensional or functional restrictions.”
Therefore, much effort has gone into developing procedures that can be used in the absence of strong a priori restrictions. This survey examines nonparametric smoothing methods which do not impose parametric restrictions on functional form. We put emphasis on econometric applications and implementations on currently available computer technology. There are many examples of density estimation in econometrics. Income distributions - see Hildenbrand and Hildenbrand (1986) - are of interest with regard to welfare analysis, while the density of stock returns has long been of interest to financial economists following Mandelbrot (1963) and Fama (1965). Figure 1 shows a density estimate of the stock return data of Pagan and Schwert (1990) in comparison with a normal density. We include a bandwidth factor in the scale parameter to correct for the finite sample bias of the kernel method.
W. Hiirdle and 0. Linton
2298
Stock Returns
-0.‘15
Figure
1 Density
estimator
-0:10
-0(05
of stock returns
o.bo Returns
of Pagan
0.b5
and Schwert
o.io
data compared
0. 15
with a mean zero
normal density (thin line) with standard deviation J&%, d = 0.035and & = 0.009, both evaluated at a grid of 100 equispaced points. Sample size was 1104. The bandwidth h was determined by the XploRe macro denauto according to Silverman’s rule of thumb method.
Regression smoothing methods are used frequently in demand analysis ~ see for example Deaton (1991), Banks et al. (1993) and Hausman and Newey (1992). Figure 2 shows a nonparametric kernel regression estimate of the statistical Engel curve for food expenditure and total income. For comparison the (parametric) Leser curve is also included. There are four main uses for nonparametric smoothing procedures. Firstly, they can be employed as a convenient and succinct means of displaying the features of a dataset and hence to aid practical parametric model building. Secondly, they can be used for diagnostic checking of an estimated parametric model. Thirdly, one may want to conduct inference under only the very weak restrictions imposed in fully nonparametric structures. Finally, nonparametric estimators are frequently required in the construction of estimators of Euclidean-valued quantities in semiparametric models. By using smoothing methods one can broaden the class of structures under which the chosen procedure gives valid inference. Unfortunately, this robustness is not free. Centered
nonparametric
estimators
converge
smoothing parameter, which is slower than the & in correctly specified models. It is also sometimes
at rate Jnh,
where h +O is a
rate for parametric estimators suggested that the asymptotic
Ch. 38: Applied
Nonparametric
2299
Methods
Enael Curve
Figure 2. A kernel regression smoother applied to the food expenditure as a function of total income. Data from the Family Expenditure Survey (196%1983), year 1973, Quartic kernel, bandwidth h = 0.2. The data have been normalized by mean income. Standard deviation of net income is 0.544. The kernel has been computed using the XploRe macro regest.
distributions themselves can be poor approximations in small samples. However, this problem is also found in parametric situations. The difference is quantitative rather than qualitative: typically, centered nonparametric estimators behave similarly to parametric ones in which II has been replaced by nh. The closeness of the approximation is investigated further in Hall (1992). Smoothing techniques have a long history starting at least in 1857 when the Saxonian economist Engel found the law named after him. He analyzed Belgian data on household expenditure, using what we would now call the regressogram. Whittaker (1923) used a graduation method for regression curve estimation which one would now call spline smoothing. Nadaraya (1964) and Watson (1964) provided an extension for general random design based on kernel methods. In time series, Daniel1 (1946) introduced the smoothed periodogram for consistent estimation of the spectral density. Fix and Hodges (1951) extended this for the estimation of a probability density. Rosenblatt (1956) proved asymptotic consistency of the kernel density estimator. These methods have developed considerably in the last ten years, and are now frequently used by applied econometricians - see the recent survey by Deaton (1993). The massive increase in computing power as well as the increased availability of large cross-sectional and high-frequency financial time-series datasets are partly responsible for the popularity of these methods. They are typically simple to implement in software like GAUSS or XploRe (1993). We base our survey of these methods around kernels. All the techniques we review for nonparametric regression are linear in the data, and thus can be viewed as kernel estimators with a certain equivalent weighting function. Since smoothing parameter selection methods and confidence intervals have been mostly studied for kernels,
W. Hiirdle and 0. Linton
2300
we feel obliged to concentrate nonparametric smoothing.
on these methods
as the basic unit of account
in
2. Density estimation It is simplest to describe the nonparametric approach in the setting of density estimation, so we begin with that. Suppose we are given iid real-valued observations {Xi};, 1 with density f. Sometimes ~ for the crossvalidation algorithm described in Section 4 and for semiparametric estimation - it is required to estimate f at each sample point, while on other occasions it is sufficient to estimate at a grid of points xi,. . . , xM for M fixed. We shall for the most part restrict our attention to the latter situation, and in particular concentrate on estimation at a single point x. Below we give two approaches to estimating f(x).
2.1.
Kernels as windows
If f is smooth in a small following approximation,
neighborhood
[x - h,x + h] of x, we can justify
the
x+h
fh.f(x)
zz
f(u)du
= P(XE[X - h,x + A]),
(I)
s x-h by the mean value theorem. The right-hand side of counting the number of X,‘s in this small interval of by n. This is a histogram estimator with bincenter x ;Z([U~ d l), where I (.) is the indicator function taking true and zero otherwise. Then the histogram estimator Th(X)
=
n-
’
t
i=l
Kh(X
-
(1) can be approximated by length 2h, and then dividing and binwidth 2h. Let K(u) = the value 1 when the event is can be written as
Xi),
(2)
where Kh(.) = hp ‘K(./h). This is also a kernel density estimator of f(x) with kernel K(u) = $I( 1u 1< 1) and bandwidth h. The step function kernel weights each observation inside the window equally, even though observations closer to x should possess better information than more distant ones. In addition, the step function estimator is discontinuous in x, which is unattractive given the smoothness assumption on f. Both objectives can be satisfied by choosing a smoother “window function” K as kernel, i.e. one for which K(u) + 0 as 1u I + 1. One example is the so-called quartic kernel K(u)=g(l
-u~)~~(IuI
< 1).
In the next section we give an alternative motivation less technically able reader may skip this section.
(3) for kernel estimators.
The
Ch. 3X: Applied
2.2.
Kernels
Nonpurametric
2301
Methods
and ill-posed problems
An alternative approach to the estimation of ,f is to find the best smooth mation to the empirical distribution function and to take its derivative. The distribution function F is related to f by
approxi-
s m
Af(xl =
Z(u < x)f (u)du = F(x), --Lo
(4)
which is called a Fredholm equation with integral operator Af(x) = s” mf (u) du. Recovering the density from the distribution function is the same as finding the inverse of the operator A. In practice, we must replace the distribution function by the empirical distribution function (edf) F,(x) = n-‘Cy= lI(Xi < x), which converges to F at rate &. However, this is a step function and cannot be differentiated to obtain an approximation to f(x). Put another way, the Fredholm problem is ill-posed since for a sequence F, tending to F, the solutions (satisfying Af,= F,) do not necessarily converge to f: the inverse operator in (4) is not continuous, see Vapnik (1982, p. 22). Solutions to ill-posed problems can be obtained using the Tikhonov (1963) regularization method. Let Q(f) be a lower semicontinuous functional called the stabilizer. The idea of the regularization method is to find indirectly a solution to Af= F by use of the stabilizer. Note that the solution of Af = F minimizes (with respect to f^)
sH ai
30
pm
Z(u 3 x)_?‘(u)du - F(x)
-c*)
The stabilizer parameter A,
a(T)
R,(?,F) =
=
IIf 11’is
Z(x 3
now
1
* dx.
added
u)f(u)du
to this equation
with
1 sm
- F(x)
* dx + 1
f*(U)du. -Cc
Since we do not know F(x), we replace it by the edf F,(x) and obtain of minimizing the functional R,($ F,) with respect to f. A necessary condition for a solution f^ is
Z(x >
Applying
the Fourier
s)f^(s) ds
transform
- F,(x)
a Lagrange
(5)
the problem
I
dx + i.?(u) = 0.
for generalized
functions
and
noting
that
the
W. Hiirdle und 0. Linton
2302
Fourier
transform
of I(u 3 0) is (i/w) + rr&w) (with 6(.) the delta function),
where I- is the Fourier transform of f. Solving applying the inverse Fourier transform, we obtain J(x) =
1 ,g
n-
this equation
we obtain
for r
and then
2( -,l~-W~, fi
Thus we obtain h = &.
2.3.
a kernel estimator
with kernel K(u) = $exp( - 1~1) and bandwidth
More details are given in Vapnik
Properties
(1982, p. 302).
of kernels
In the first two sections we derived different approaches to kernel smoothing. Here we would like to collect and summarize some properties of kernels. A kernel is a piecewise continuous function, symmetric around zero, integrating to one: K(u) = K( - u);
s
K(u)du
= 1.
(6)
It need not have bounded support, although many commonly used kernels live on [ - 1, 11. In most applications K is a positive probability density function, however for theoretical reasons it is sometimes useful to consider kernels that take on negative values. For any integerj, let ~j(K) =
U’K(u) du;
K(u)‘du.
Vj(K) =
s
s
The order p of a kernel is defined as the first nonzero Pj
=
O,
j=
l,...,p-
1;
moment,
&J z 0.
(7)
We mostly restrict our attention to positive kernels which can be at most of order 2. An example of a higher order kernel (of order 4) is K(u) = $7u4
- 1ou2 + 3)1(1 UI < 1).
A list of common kernel functions the values in the third column.
is given in Table
1. We shall comment
later on
Ch. 38: Applied Nonparametric
2303
Methods
Common Kernel
WO,,,
K(u) $(l -u~)l(JuI < 1) $1 - uz)zz(IuI < 1) (1 - lulY(lul Q 1) (2x)-“‘exp( - d/2)
Epanechnikov Quartic Triangular Gauss Uniform
2.4.
Table 1 kernel functions.
;z(lul Q 1)
w
1 1.005 1.011 1.041 1.060
Properties of the kernel density estimator
The kernel estimator is a sum of iid random variables, and therefore
~C?,,Wl = f&(x - z)f (z) d.z = K, *f (x), s
(8)
where * denotes convolution, assuming the integral exists. When f is N(0, a’) and K is standard normal, E[f,,(x)] is therefore the normal density with standard deviation d&?? evaluated at x, see Silverman (1986, p. 37). This explains our modification to the normal density in Figure 1. More generally, it is necessary to approximate E[f,,(x)] by a Taylor series expansion. Firstly, we change variables
E[&x)] = K(u)f (x -
uh) du.
s Then expanding f (x - uh) about
-K&,(x)1= f(x) + ;h2MK)f
(9)
f(x) gives “(x) + o(h2),
(10)
provided f”(x) is continuous in a neighborhood O(h2) as h + 0. By similar calculation,
of x. Therefore, the bias of f,,(x) is
VarC_Lh(x)l = $v2(K)f
(x),
(11)
see Silverman (1986, p. 38). Therefore, provided h-+0 and nh-+ a, T,,(x) A f(x). Further asymptotic properties of the kernel density estimator are given in Prakasa Rao (1983). The statistical properties of r,,(x) depend closely on the bandwidth h: the bias
2304
W. H&de
und 0. Linton
increases and the variance decreases with h. We investigate how the estimator itself depends on the bandwidth using the income data of Figure 2. Figure 3a shows a kernel density estimate for the income data with bandwidth h = 0.2 computed using the quartic kernel in Equation 3 and evaluated at a grid of 100 equispaced points. There is a clear bimodal structure for this implementation. A larger bandwidth h = 0.4 creates a single model structure as shown in Figure 3b, while a smaller h = 0.05 results in Figure 3c where, in addition to the bimodal feature, there is considerable small scale variation in the density. It is therefore important to have some method of choosing h. This problem has been heavily researched ~ see Jones et al. (1992) for a collection of recent results and discussion. We take up the issue of automatic bandwidth selection in greater detail for the regression case in Section 4.2. We mention here one method that is frequently used in practice ~ Silverman’s rule of thumb. Let 8’ be the sample variance of the data. Silverman (1986) proposed choosing the bandwidth to be
This rule is optimal (according to the IMSE - see Section 4 below) for the normal density, and is not far from optimal for most symmetric, unimodal densities. This procedure was used to select h in Figure 1.
2.5.
Estimation
A multivariate estimator
of multivariate (d-dimensional)
densities, their derivatives and bias reduction density
function
f can be estimated
by the kernel
(12) where kH(.) = {det(H)} ‘k(H- ’ .), where k(.) is a d-dimensional kernel function, while H is a d by d bandwidth matrix. A convenient choice in practice is to take H = hS”‘, where S is the sample covariance matrix and h is a scalar bandwidth sequence, and to give k a product structure, i.e. let k(u) = n4=1 K(uj), where u=(ur,..., I.+)~ and K(.) is a univariate kernel function. Partial derivatives off can be estimated by the appropriate partial derivatives of fH(x) (providing k(.) has the same number of nonzero continuous derivatives). For any d-vector r = (rl, . . . , rd) and any function g(.) define
where I rl = Cj”= 1rj, then T;‘(x) estimates
f(x).
Ch. 38: Applied
Nonparametric
2305
Methods
8 4.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
2.0
2.5
3.0
3.5
2.0
2.5
3.0
3.5
Net income
x -0.5
0.0
0.5
1.0
1.5
Net Income
I
x -0.5
1
0.0
0.5
1.0
1.5
Net income
Figure
3. Kernel density estimates of net income distribution: (a) h = 0.2, (b) h = 0.4, (c) h = 0.05. Family Expenditure Survey (1968-1983). XploRe macro denest. Year 1973.
W. Hiirdle and 0. Linton
2306
The properties ofmultivariate derivative estimators are described in Prakasa Rao (1983, p. 237). In fact, when a bandwidth H = kA is used, where h is scalar and A is any fixed positive definite d by d matrix, then Var[fc;l’(x)] = O(n- 1h-(2”‘+d)), while the bias is 0(h2). For a given bandwidth h, the variance increases with the number of derivatives being estimated and with the dimensionality of X. The latter effect is well known as the curse of dimensionality. It is possible to improve the order of magnitude of the bias by using a pth order kernel, where p > 2. In this case, the Taylor series expansion argument shows that EC?,,(x)] -f(x) = O(kP), where p is an even integer. Unfortunately, with this method there is the possibility of a negative density estimate, since K must be negative somewhere. Abramson (1982) and Jones et al. (1993) define bias reduction techniques that ensure a positive estimate. Jones and Foster (1993) review a number of other bias reduction methods. The merits of bias reduction methods are based on asymptotic approximations. Marron and Wand (1992) derive exact expressions for the first two moments of higher order kernel estimators in a general class of mixture densities and find that unless very large samples are used, these estimators may not perform as well as the asymptotic approximations suggest. Unless otherwise stated, we restrict our attention to second order kernel estimators.
2.6.
Fast
implementation
of density
estimation
Fast evaluation of Equation 2 is especially important for optimization of the smoothing parameter. This topic will be treated in Section 4.2. If the kernel density estimator has to be computed at each observation point for k different bandwidths, the number of calculations are 0(n2kk) for kernels with bounded support. For the family expenditure dataset of Figure 1 with about 7000 observations this would take too long for the type of interactive data analysis we envisage. To resolve this problem we introduce the idea of discretization. The method is to map the raw data onto an equally spaced grid of smaller cardinality. All subsequent calculations are performed on this data summary which results in considerable computational savings. Let H,(x; A), I= 0, 1, . . . , A4 - 1, be the Ith histogram estimator of f(x) with origin l/M and small binwidth d. The sensitivity of histograms with respect to choice of origin is well known, see, e.g. Hardle (1991, Figure 1.16). However, if histograms with different origins are then repeatedly averaged, the result becomes independent of the histograms’ origins. Let rM,4(x) = (l/M)Cf?“=,H,(x;A) be the averaged histogram estimator. Then
.f,bf, Ax) = ,:h .qItxEBj) ,E ’
5
i= -M
nj-iwi,
(13)
Ch. 38: Applied
Nonparametric
2307
Methods
where 2’ = {. . , - l,O, 1,. . . >, Bj = [bj - +h, bj + ih] with h = A/M and bj = jh, while nj = C;= ,Z(Xi~Bj) and wi = (M - Iii/M). At the bincenters
Note that {wi>E _M is, in fact, a discrete approximation to the (resealed) triangular kernel K(u) = (1 - lu()l((u( < 1). More generally, weights wi can be used that represent the discretization of any kernel K. When K is supported on [ - 1, 11, Wi is the resealed evaluation of K at the points -i/M (i = - M, . . . , M). If a kernel with non-compact support is used, such as the Gaussian for example, it is necessary to truncate the kernel function. Figure 4 shows the weights chosen from the quartic kernel with M = 5. Since Equation 13 is essentially a convolution of the discrete kernel weights wi with the bincounts nj, modern statistical languages such as GAUSS or XploRe that supply a convolution command are very convenient for computation of Equation 13. Binning the data takes exactly n operations. If C denotes the number of nonempty bins, then evaluation of the binned estimator at the nonempty bins requires O(MC) operations. In total we have a computational cost of O(n + kM,,,C) operations for evaluating the binned estimator at k bandwidths, where M,,, = Max{ M ;;j = 1,. . . , k}, This is a big improvement.
Kernel and Discretization r
I
I
I
I
I
Figure 4. The quartic kernel qua(u) = $1 - u’)~I(/u~ < 1). Discretizing the kernel (without resealing) leads to w-~ = qua(i/M), i = - M,. , M. Here M = 5 was chosen. The weights are represented by the thick step function.
W. Hiirdle and 0. Linton
2308
The discretization technique also works for estimating derivatives and multivariate densities, see Hardle and Scott (1992) and Turlach (1992). This method is basically a time domain version of the Fast Fourier Transform computational approach advocated in Silverman (1986), see also Jones (1989).
3.
Regression estimation
The most common method for studying and Y is to estimate the conditional Suppose that Yi = m(X,) +
i=l
Ei,
,...1 n,
the relationship between two variables X expectation function m(x) = E( Y [X = x).
(14)
where si is an independent random error satisfying E(siIXi = x) = 0, and Var(si(Xi = x) = o’(x). In this section we restrict our attention to independent sampling, but some extensions to the dependent sampling case are given in Section 5. The methods we consider are appropriate for both random design, where the (Xi, Yi) are iid, and fixed design, where the Xi are fixed in repeated samples. In the random design case, X is an ancillary statistic, and standard statistical practice - see Cox and Hinkley (1974) - is to make inferences conditional on the sample (Xi}:, r. However, many papers in the literature prove theoretical properties unconditionally, and we shall, for ease of exposition, present results in this form. We quote most results only for the case where X is scalar, although where appropriate we describe the extension to multivariate data. In some cases, it is convenient to restrict attention to the equispaced design sequence Xi=i/n, i= l,..., n. Although this is unsuitable for most econometric applications, there are situations where it is of interest; specifically, time itself is conveniently described in this way. Also, the relative ranks of any variable (within a given sample) are naturally equispaced - see Anand et al. (1993). The estimators of m(x) we describe are all of the form x1= i W,,(x)Yi for some weighting sequence (W,i(X)}1= 1, but arise from different motivations and possess different statistical properties.
3.1.
Kernel estimators
Given the technique of kernel density estimation, a natural way to estimate m(.) is first to compute an estimate of the joint density f(x, y) of (X, Y) and, then, to integrate it according to the formula
s s yfk
m(x) =
Y)dy
S(x>Y) d.v
(15)
Ch. 3X: Applied Nonparametric
The kernel density
estimate
T,Jx, y) of f(x, y) is
L(x,Y) = n- l i$lK~(x and by Equation
s
T,,(x, y)dy
2309
Methods
Xi)KhO,
-
6
= n-l
t
K,(x
- Xi);
Y&X,
Y) dy = n - ’
s
i=l
Plugging these into the numerator NadarayaaWatson kernel estimate i
yi::)3
and denominator
of Equation
15 we obtain the
K,(x - Xi)Yi
(16)
The bandwidth h determines the degree of smoothness of A,,. This can be immediately seen by considering the limits for h tending to zero or to infinity, respectively. Indeed, at an observation Xi, &(Xi) -+ Yi, as h + 0, while at an arbitrary point x, A,,(x) + y, as h + co. These two limit considerations make it clear that the smoothing parameter h, in relation to the sample size n, should not converge to zero too rapidly nor too slowly. Conditions for consistency of & are given in the following theorem, proved in Schuster (1972):
Theorem 1 Let K(.) satisfy 11K(u)1 du < co and Lim lUI_ ,uK(u) = 0. Suppose also that m(x), f(x), and (T’(X)are continuous at x, and f(x) > 0. Then, provided h = h(n) + 0 and nh + co as n + oo, we have Ah(x) -% m(x). The kernel (1972).
estimator
is asymptotically
normal,
as was first shown
in Schuster
Theorem 2 Suppose in addition to the conditions of Theorem 1 that SlK(~)l~+~du < co, for some g > 0. Suppose also that m(x) and f(x) are twice continuously differentiable at x and that E(J Y)2’9 Ix) exists and is continuous at x. Finally, suppose that
W. Hiirdle and 0. Linton
2310
Lim h5n < co.Then
Jnhc+w -
m(x)- h24nAx)l*NO, Vnw(x)),
where B,,(x) = $iLZ(K) m”(x) +
[
2m’(x)qx) f 1
in, = v2wJ2Mf(x). The Nadaraya-Watson estimator has an obvious generalization to d-dimensional explanatory variables and pth order kernels. In this case, assuming a common bandwidth h is used, the (asymptotic) bias is O(hp), when p is an even integer, while the (asymptotic) variance is O(n~‘Kd).
3.2. 3.2.1.
k-Nearest
neighbor estimators
Ordinary k-NN estimators
The kernel estimate was defined as a weighted average of the response variables in a fixed neighborhood of x. The k-nearest neighbor (k-NN) estimate is defined as a weighted average of the response variables in a varying neighborhood. This neighborhood is defined through those X-variables which are among the k-nearest neighbors of a point x. Let J(x) = {i:Xi 1s one of the k-NN to x} be the set of indices of the k-nearest neighbors of x. The k-NN estimate is the average of Y’s with index in J+‘(X), (17) Connections to kernel smoothing can be made by considering Equation 17 as a kernel smoother with uniform kernel K(u) = 31(juI < 1) and variable bandwidth h = R(k), the distance between x and its furthest k-NN,
Note that in Equation 18, for this specific kernel, the denominator is equal to (k/nR) the k-NN density estimate of f(x). The formula in Equation 18 provides sensible estimators for arbitrary kernels. The bias and variance of this more general k-NN estimator is given in a theorem by Mack (198 1).
Ch. 38: Applied Nonparametric
Theorem
2311
Methods
3
Let the conditions asn-+co.Then
of Theorem
&%(x) - m(x)-
2 hold, except that k + co, k/n -+ 0 and Lim k5/n4 < GO
Wn)*&,(41= NO, v,,(x)),
where
m”(X) + Zm(x)f(xj f
B,,(x) = P*(K)
gf *(x)
i V”“(X)= 2f?(x)Z*(K).
In contrast to kernel smoothing, the variance of the k-NN regression smoother does not depend on f, the density of X. This makes sense since the k-NN estimator always averages over exactly k observations independently of the distribution of the X-variables. The bias constant B,,(x) is also different from the one for kernel estimators given in Theorem 2. An approximate identity between k-NN and kernel smoothers can be obtained by setting k = 2nhf(x), or equivalently mean squared 3.2.2.
(19)
h = k/[2nf(x)]. error formulas
Symmetrized
For this choice of k or h respectively, the asymptotic of Theorem 2 and Theorem 3 are identical.
k-NN estimators
A computationally useful modification of & is to restrict the k-nearest neighbors always to symmetric neighborhoods, i.e., one takes k/2 neighbors to the left and k/2 neighbors to the right. In this case, weight-updating formulas can be given, see Hardle (1990, Section 3.2). The bias formulas are slightly different, see Hlrdle and Carroll (1990), but Equation 19 remains true.
3.3.
Local polynomial
The Nadaraya-Watson tion problem
estimators estimator
can be regarded
&(x) = arg, min $ K,(x - Xi). i=l
as the solution
of the minimiza-
(20)
W. Hiirdle and 0. Linton
2312
This motivates
the local polynomial
class of estimators.
Let go,.
, gp minimize
ipqX-Xi) Yi-e,-e,(Xi-x)-...-H,(Xi-x~p 2.
[
P!
I
(21)
Then g0 serves as an estimator of m(x), while Qj estimates thejth derivative of m. Clearly, 60 is linear in Y. A variation on these estimators called LOWESS was first considered in Cleveland (1979) who employed a nearest neighbor window. Fan (1992) establishes an asymptotic approximation for the case where p = 1, which he calls the local linear estimator &,,Jx). Theorem 4 Let the conditions
of Theorem
Jnh@,,,(x) - m(x) -
2 hold. Then
h2W41 =W,
J’dx)),
where B,(x) = ~~~(~)m”(x)
The local linear estimator is unbiased when m is linear, while the Nadaraya-Watson estimator may be biased depending on the marginal density of the design. We note here that fitting higher order polynomials can result in bias reduction, see Fan and Gijbels (1992) and Ruppert and Wand (1992) - who also extend the analysis to multidimensional explanatory variables. The principle underlying the local polynomial estimator can be generalized in a number of ways. Tibshirani (1984) introduced the local likelihood procedure in which an arbitrary parametric regression function g(x; 8) substitutes the polynomial in Equation 21. Fan, Heckman and Wand (1992) developed a theory for a nonparametric estimator in a GLIM (Limited Dependent Variable) model in which, for example, a probit likelihood function replaces the polynomial in Equation 21. An advantage of this procedure is that low bias results when the parametric model is true (Linton and Nielsen 1993).
3.4.
Spline estimators
For any estimate +I of m, the residual sum of squares (RSS) is defined as CT= r [ Yi - @Xi)12, which is a widely used criterion, in other contexts, for generating estimators of regression functions. However, the RSS is minimized by A interpolating the data, assuming no ties in the X’s, To avoid this problem it is necessary to add a stabilizer. Most work is based on the stabilizer 0(&t) = J[&“(u)]~ du, although see
Ch. 38: Applied Nonparametric
2313
Methods
Ansley et al. (1993) and Koenker et al. (1993) for alternatives. estimator A, is the (unique) minimizer of R,(Gi, m) = $ [ Yi - @Xi)]’
+ 2 [M’(u)]~ du.
i=l
J
The cubic
spline
(22)
The spline &A has the following properties. It is a cubic polynomial between two successive X-values at the observation points tin(.) and its first two derivatives are continuous; at the boundary of the observation interval the spline is linear. This characterization of the solution to Equation 22 allows the integral term on the right hand side to be replaced by a quadratic form, see Eubank (1988) and Wahba (1990), and computation of the estimator proceeds by standard, although computationally intensive, matrix techniques. The smoothing parameter 2 controls the degree of smoothness of the estimator A,. As LO,h, interpolates the observations, while if A+ co,fi, tends to a least squares regression line. Although &, is linear in the Y data, see Hardle (1990, pp 58859), its dependency on the design and on the smoothing parameter is rather complicated. This has resulted in rather less treatment of the statistical properties of these estimators, except in rather simple settings, although see Wahba (1990) - in fact, the extension to multivariate design is not straightforward. However, splines are asymptotically equivalent to kernel smoothers as Silverman (1984) showed. The equivalent kernel is
K(u)=fenp(
-$)sin($+t),
(23)
which is of fourth order, since its first three moments bandwidth h = h(L; Xi) is IQ; Xi) = Al’% - ““f(x,)-
l/4.
are zero, while the equivalent
(24)
One advantage of spline estimators over kernels is that global inequality and equality constraints can be imposed more conveniently. For example, it may be desirable to restrict the smooth to pass through a particular point - see Jones (1985). Silverman (1985) discusses a Bayesian interpretation of the spline procedure. However, from Section 2.2 we conclude that this interpretation can also be given to kernel estimators.
3.5.
Series estimators
Series estimators have received considerable attention in the econometrics literature, following Elbadawi et al. (1983). This theory is very much tied to the structure of
W. H&-d/e und 0. Linron
2314
Hilbert
space. Suppose
Mx) = f
that m has an expansion
for all x:
(25)
j3jcPjtxh
j=O
in terms of the orthogonal basis functions {~j},?Zo and their coefficients {/3j},?o. Suitable basis systems include the Leyendre polynomials described in Hardle (1990) and the Fourier series used in Gallant and Souza (1991). A simple method of estimating m(x) involves firstly selecting a basis system and a truncation sequence t(n), where t(n) is an integer less than II, and then regressing K on ‘Pti = (cPO(xi), . . .T (p,(Xi))r. Let (~j}$Jo be the least squares “parameter” estimates, then r(n) _ @(PI)(~)
C j=O
Bj(Pjtx)
=
t i=l
wni(x)yi,
(26)
where W,(x) = (W,,r,. . . , WnJT, with
where vt, = (cp,(x),. . . , ~~(4)~and at= (a,. . . , ~3~. These estimators are typically very easy to compute. In addition, the extension to additive structures and semiparametric models is convenient, see Andrews and Whang (1990) and Andrews (1991). Finally, provided t(n) grows at a sufficiently fast rate, the optimal (given the smoothness of m) rate of convergence can be established - see Stone (1982), while fixed window kernels achieve at best a rate of convergence (of MSE) of n4’5. However, the same effect can be achieved by using a kernel estimator, where the order of the kernel changes with n in such a way as to produce bias reduction of the desired degree, see Miiller (1987). In any case, the evidence of Marron and Wand (1992) cautions against the application of bias reduction techniques unless quite large sample sizes are available. Finally, a major disadvantage with the series method is that there is relatively little theory about how to select the basis system and the smoothing parameter t(n).
3.6.
Kernels, k-NN, splines and series
Splines and series are both “global” methods in the sense that they try to approximate the whole curve at once, while kernel and nearest neighbor methods work separately on each estimation point. Nevertheless, when X is uniformly distributed, kernels and nearest neighbor estimators of m(x) are identical, while spline estimators are roughly equivalent to a kernel estimator of order 4. Only when the design is not equispaced, do substantial differences appear.
Ch. 38: Applied Nonparametric
Methods
2315
We apply kernel, k-NN, orthogonal series (we used the Legendre system of orthogonal polynomials), and splines to the car data set (Table 7, pp 352-355 in Chambers et al. (1983)). In each plot, we give a scatterplot of the data x = price in dollars of car (in 1979) versus y = miles per US gallon of that car, and one of the nonparametric estimators. The sample size is n = 74 observations. In Figure 5a we have plotted together with the raw data a kernel smoother #r,, for which a quartic kernel was used with h = 2000. Very similar to this is the spline smoother shown in Figure 5b (2 = 109). In this example, the X’s are not too far from uniform. The effective local bandwidth for the spline smoother from Equation 24 is a function of f-1’4 only, which does not vary that much. Ofcourse at the right end with the isolated observation at x = 15906 and y = 21 (Cadillac Seville) both kernel and splines must have difficulties. Both work essentially with a window of fixed width. The series estimator (Figure 5d) with t = 8 is quite close to the spline estimator. In contrast to these regression estimators stands the k-NN smoother (k = 11) in Figure 5c. We used the symmetrized k-NN estimator for this plot. By formula (19) the dependence of k on f is much stronger than for the spline. At the right end of the price scale no local effect from the outlier described above is visible. By contrast in the main body of the data where the density is high this k-NN smoother tends to be wiggly.
3.7.
Confidence
intervals
The asymptotic distribution results contained in Theorems 2-4 can be used to calculate pointwise confidence intervals for the estimators described above. In practice, it is usual to ignore the bias term, since this is rather complicated, depending on higher derivatives of the regression function and perhaps on the derivatives of the density of X. This approach can be justified when a bandwidth is chosen that makes the bias relatively small. In this section we restrict our attention to the Nadaraya-Watson regression estimator. In this case, we suppose that hn”’ -+O, which ensures that the bias term does not appear in the limiting distribution. Let CLO(x) = &h(X) - $23 CUP(x) = &h(X) + c&, where @(c,) = (1 -a) with a(.) the standard normal distribution, while g2 is a consistent estimate of the asymptotic variance of&(x). Suitable estimators include (1) .ff = n-‘A-’
v2uw;w_7,(4
(2) s*; = 8;(x) t i=l
w;;(x)
,
4MxJ
6ma
4m
tmo
6ocKl
Plice
loo00 14m
14Km
*-
16OCXl
16ooO
:::;:
(y) with four different
****; ** 12oca
12OCN
*
Figure 5(a-d). Scatterplot of car price (x) and miles per gallon Standard deviation of car price is 2918.
L
Price
loom
KNN Estimate
6coo
::k**
I
Kernel tstlmate
smooth
4cbo
4okxl
6cbJ
a&
approximations
6obo
Orthogonal
6&J
rnooo
loo00
12boo
14boo
*
14&l
l&
16&o
(n = 74, h = 2000, k = 11, i. = 109, r = 8).
Piice
*
Series Estimate
Price
loo00
Spline Estimate
$
Ch. 38: Applied Nonparametric
2317
Methods
(3) $3 = t w,2i(x)E*;, i=l where f,,(x) is defined in Equation 2, ti = Yi - oh are the nonparametric E is estimator of a2(x) - see . a nonparametric residuals and S:(x) = x1= 1Wni(x )^f Robinson (1987) and Hildenbrand and Kneip (1992) for a discussion of alternative conditional variance estimators and their application. With the above definitions.
P(rn(X)E[CLO(X),
CUP(x)]}
-+ 1 - a.
(28)
These confidence intervals are frequently employed in econometric applications, see for example Bierens and Pott-Buter (1990), Banks et al. (1993) and Gozalo (1989). This approach is relevant if the behavior of the regression function at a single point is under consideration. Usually, however, its behavior over an interval is under study. In this case, pointwise confidence intervals do not take account of the joint nature of the implicit null hypothesis. We now consider uniform confidence bands for the function m, over some compact subset x of the support of X. Without loss of generality we take x = [0,11. We require functions CLO*(x) and CUP*(x) such that
P{m(x)E[CLO*(x),CUP*(x)]
Vxex} -+ 1 - c(.
(29)
Let
where 6 = ,/%go, and exp [ - 2 exp( - c,*)] = (1 - CY). Then (29) is satisfied under the conditions given in Hardle (1990, Theorem 4.3.1). See also Prakasa Rao (1983, Theorem 2.1.17) for a treatment of the same problem for density estimators. In Figure 6 we show the uniform confidence band’s for the income data of Figure 2. Hall (1993) advocates using the bootstrap to construct uniform confidence bands. He argues that the error in (29) is O(l/log n), which can be improved to O((log h-‘)3/ nh) by the judicious use of this resampling method in the random design case. See also Hall (1992) and Hlrdle (1990) for further applications of the bootstrap in nonparametric statistics.
W. Hiirdle and 0. Linton
2318
Engel Curve and Confidence Bands P
Net income
Figure 6. Uniform confidence bands for the income data. Food versus net income. Calculated using XploRe macro reguncb.
3.8.
Regression
derivatives and quantiles
There are a number of other functionals of the conditional distribution that are of interest for applications. The first derivative of the regression function measures the strength of the relationship between Y and X, while second derivatives can quantify the concavity or convexity of the regression function. Let &t(x) be any estimator of m(x) that has at least r non-zero derivatives at x. Then m”‘(x) can be estimated by the rth derivative of Sz(x), denoted &“(x). Miiller (1988) describes kernel estimators of m”‘(x) based on the convolution method of Gasser and Miiller (1984); their method gives simpler bias expressions than the Nadaraya-Watson estimator. An alternative technique is to fit a local polynomial (of order r) estimator, and take the coefficient on the rth term in (21), see Ruppert and Wand (1992). In each case, the resulting estimator is linear in Yi, with bias of order h2 and variance of order n -lh-(2r+l) Quantiles can also be useful. The median is an alternative - and robust - measure of location, while other quantiles can help to describe the spread of the conditional distribution. Let fy,x=x(y) denote the conditional distribution of Y given X = x, and let c,(x) be the crth conditional quantile, i.e. CC=
C=(x) f,,, Z,(Y) dy, s -co
(30)
where for simplicity we assume this is unique. There are several methods for estimating c,(x). Firstly, let Zj = [ K’nj(X), Yj]‘, where W”j(X) are kernel or nearest neighbor weights. We first sort {Zj}j”= 1 on the variable Yj, and find the largest index J such that i j=
wnj(x)d
1
LX.
Ch. 38: Applird
Nonparametric
2319
Methods
Then let (31)
c*,(x) = YJ.
Stute (1986) shows that e,(x) consistently estimates c,(x), with the same convergence rates as in ordinary nonparametric regression, see also Bhattacharya and Gangopadhyay (1990). When K is the uniform kernel and c( = i, this procedure corresponds to the running median discussed in Hardle (1990, pp 69-71). A smoother estimator is obtained by also smoothing in the y direction, i.e.
Provided K has at least r non-zero derivatives, the rth derivative of C,(X) can be estimated by the rth derivative of e,(x). See Anand et al. (1993) and Robb et al. (1992) for applications. An alternative method of estimating conditional quantiles is through minimizing an appropriate loss function. This idea originated in Koenker and Bassett (1978). In particular, e,(x) = arg,min
t
K,(x - X,)p,( K - 19),
(32)
i=l
where p,(y) = \yJ + (2a - l)y, consistently estimates C,(X). Computation of the estimator can be carried out by linear programming techniques. Chaudhuri (1991) provides asymptotic theory for this estimator in a general multidimensional context and for estimators of the derivatives of c,(x). In neither (31) nor (32) is the estimator linear in &:., although the asymptotic distribution of the estimators are determined by a linear approximation to them, i.e. the estimators are asymptotically normal.
4. 4.1.
Optimality
and bandwidth choice
Optimality
Let Q(h) be a performance asymptotically optimal if
Q@*) ~~ infhEHnQ@)
’
criterion.
We say that a bandwidth
sequence
h* is
(33)
as n + co, where H, is the range of permissible bandwidths, There are a number of alternative optimality criteria in use. Finally, we may be interested in the quadratic
W. Hiirdle and 0. Linton
2320
loss of the estimator at a single point x, which is measured by the Meun squared error, MSE(&(x)}. Secondly, we may be only concerned with a global measure of performance. In this case, we may consider the Integrated mean squared error, IMSE = ~MSE[&,,(X)]Z(X)~( x )d x for some weighting function rc(.). An alternative is the in-sample version of this, the aoeraged squared error d,(h) = n-l
j$I ChhCxj) - m(Xj)127c(Xj).
(34)
The purpose of rr(.) may be to downweight observations in the tail of X’s distribution, and thereby to eliminate boundary effects - see Miiller (1988) for a discussion. When h = O(n-‘I’), the squared bias and the variance of the kernel smoother have the same magnitude; this is the optimal order of magnitude for h with respect to all three criteria, and the corresponding performance measures are all 0(n-415) in this case. Now let h = yn-‘I’, where y is a constant. The optimal constant balances the contributions to MSE from the squared bias and the variance respectively. From Theorem 2 we obtain an approximate mean squared error expansion, MSE[&,(x)] and the bandwidth
h,(x)
Similarly,
=
[
z n-‘hP’V(x) minimizing
v(x) 4P(x) 1
+ h4B2(x). Equation
(35)
35 is
l/5 n-1/5
(36)
the optimal bandwidth with respect to IMSE is the same as in (36) with v = jVx)n(x)f( x ) d x and B2 = JB2(x)~(x)f(x) dx replacing V(x) and B’(x). Unfortunately, in either case the optimal bandwidth depends on the unknown regression function and design density. We discuss in Section 4.2 below how one can obtain empirical versions of (36). The optimal local bandwidths can vary considerably with x, a point which is best illustrated for density estimation. Suppose that the density is standard normal and a standard normal kernel is used. In this case, as x-+ co, h,(x)+ co: when data is sparse a wider window is called for. Also at x = f 1, h,(x) = co, which reflects the fact that 4” = 0 at these points. Elsewhere, substantiallyless smoothing is called for: at f 2.236, h,(x) = 0.884n-“5 (which is the minimum value of h,(x)). The optimal global bandwidth is 1.06~ 115. Although allowing the bandwidth to vary with x dominates over the strategy of throughout choosing a single bandwidth, in practice this requires consjderably more computation, and is rarely used in applications. By substituting ha in (35), we find that the optimal MSE and IMSE depend on
Ch. 38: Applied Nonparametric
2321
Methods Table 2 Kernel exchange
sfJ.q Uniform Triangle Epanechnikov Quartic Gaussian
Uniform
Triangle
1.000 1.398 1.272 1.507 0.575
0.715 1.000 0.910 1.078 0.411
rate.
Epanechnikov 0.786 1.099 1.000 1.185 0.452
Quartic
Gaussian
0.663 0.927 0.844 1.000 0.381
1.740 2.432 2.214 2.623 1.000
K only through
T(K) = v:W)PAK).
(37)
This functional can be minimized with respect to K using the calculus of variations, although it is necessary to first adopt a scale standardization of K - for details, see Gasser et al. (1985). A kernel is said to be optimal if it minimizes (37). The optimal kernel of order 2 is the Epanechnikov kernel given in Table 1. The third column of this table shows the loss in efficiency of other kernels in relation to this optimal one. Over a wide class of kernel estimators, the loss in efficiency is not that drastic; more important is the choice of h than the choice of K. Any kernel can be resealed as K*(.) = s- ‘K(./s) which of course changes the value of the kernel constants and hence h,. In particular, v,(K*)
= s-%,(K);
We can uncouple s*
=
&K*)
= s2/4K).
the scaling effect by using for each kernel K, that K* with scale
vAK*) 1’5
[ P;(K)1 for which pz(K*) = v2(K*). Now suppose we wish to compare two smooths with kernels K, and bandwidths hj respectively. This can be done by transforming both to their canonical scale, see Marron and Nolan (1989) and then comparing their ~7. In Table 2 we give the exchange rate between various commonly used kernels. For example, the bandwidth of 0.2 used with a quartic kernel in Figure 2, translates into a bandwidth of 0.133 for a uniform kernel and 0.076 for a Gaussian kernel.
4.2.
Choice of smoothing parameter
For each nonparametric regression method, one has to choose how much to smooth for the given dataset. In Section 3 we saw that k-NN, series, and spline estimation are asymptotically equivalent to the kernel method, so we describe here only the selection of bandwidth h for kernel regression smoothing.
W. Hiirdle and 0. Linton
2322
4.2.1.
Plug-in
The asymptotic approximation given in (36) can be used to determine an optimal local bandwidth. We can calculate an estimated optimal bandwidth iPl in which the consistent estimators &i*(x), 6,$(x), f,,*(x) and &(x) replace the unknown functions. We then use fig,,(x) to estimate m(x). Likewise, if a globally optimal bandwidth is required, one must substitute estimators of the appropriate average functionals. This procedure is generally fast and simple to implement. Its properties are examined in Hardle et al. (1992a). However, this method fails to provide pointwise optimal bandwidths, when m(x) possesses less than two continuous derivatives. Finally, a major disadvantage of this procedure is that a preliminary bandwidth h* must be chosen for estimation of m”(x) and the other quantities. 4.2.2.
Crossvalidation
Crossvalidation is a convenient method of global bandwidth choice for many problems, and relies on the well established principle of out-of-sample predictive validation. Suppose that optimality with respect to d,(h) is the aim. We must first replace d,(h) by a computable approximation to it. A naive estimate would be to just replace the unknown values m(Xj) by the observations Yj: p(h)=n-’
” jzl
ChL(xj)- yj124xj)~
This is called the resubstitution estimate. However, this quantity makes use of each observation twice - the response variable Yj is used in &,,(Xj) to predict itself. Therefore, p(h) can be made arbitrarily small by taking h +O (when there are no tied X observations). This fact can be expressed via asymptotic expressions for the moments of p. Conditional on Xi,. . . , X,, we have
‘Cd’)1 = ‘Cd,(h)1 + i
,tl~‘(XMXJ - 2:I,tlwni(XJ~2(XJ~(XJ,
I
(38)
and the third term is of the same order of magnitude as E[d,(h)], but with a negative sign. Therefore, d, is wrongly underestimated and the selected bandwidth will be downward biased. The simplest way to avoid this problem is to remove thejth observation 1 $$j(xj)
=
Kh(Xj
EL1
j#i
-
Kh(Xj-Xi)
xi)yi
(39)
Ch. 38: Applied Nonpuramrtric
This leave-one-out
2323
Methods
estimate
is used to form the so-called
crossvalidation
function
j=l
which is to be minimized with respect to h. For technical reasons, the minimum must be taken only over a restricted set of bandwidths such as H, = [np(1’5pr), n-(‘i5+i)], for some c > 0. Theorem
5
Assume that the conditions given in Hardle (1990, Theorem 5.1.1) hold. Then the bandwidth selection rule, “Choose 6 to minimize CV(h)” is asymptotically optimal with respect to d,(h) and IMSE. Proof See Hardle
and Marron
(1985).
The conditions include the restriction that f > 0 on the compact support of rc, moment conditions on E, and a Lipschitz condition on K. However, unlike the plug-in procedure, m and f need not be differentiable (a Lipschitz condition is required, however). 4.2.3.
Other data driven selectors
There are a number of different automatic bandwidth selectors that asymptotically optimal kernel smoothers. They are based on various correcting the downwards bias of the resubstitution estimate of dA(h). The p(h) is multiplied by a correction factor that in a sense penalizes h’s which small. The general form of this selector is
where c” is the correction
function
with first-order
Taylor
expansion
Z(u) = 1 + 2u + O(u2), as u + 0. Some well known (i) Generalized &&d)
produce ways of function are too
(41) examples
crossvalidation = (1 - u)-Z;
are: (Craven
and Wahba
1979; Li 1985),
W. Hiirdle and 0. Linton
2324
(ii) Akaike’s information
criterion
(Akaike
1970)
E,Ic(u) = exp 2~; (iii) Finite prediction E&U) (iv) Shibata’s
error (Akaike
1974).
= (1 + u)/( 1 - u); (198 1) model selector,
&(u) = 1 + 2u; (v) Rice’s (1984) bandwidth
selector,
&(U) = (1 - 2u)_‘. Hgrdle et al. (1988) show that the general criterion G(h) works in producing asymptotically optimal bandwidth selection, although they present their results for the equispaced design case only. The method of crossvalidation was applied to the car data set to find the optimal smoothing parameter h. A plot of the crossvalidation function is given in Figure 7.
Crossvalidation
I 1500 Figure 7. The crossvalidation
1 1600 function
0
1700
I
Function
I
1800 1900 Bandwidth h
1
2000
CV(h) for the car data. Quartic XploRe macro regcvl.
I
2100
I
2200
kernel. Computation
made with
Ch. 38:
Applied
Nonparametric
2325
Methods
The computation is for the quartic kernel using the WARPing method, see Hardle and Scott (1992). The minimal & = argminCV(h) is at 1922 which shows that in Figure 5a we used slightly too large a bandwidth. Hardle et al. (1988) investigate how far the crossvalidation optimal i is from the true optimum, &,, (that minimizes d,(h)). They show that for each optimization method, nl/10
L-Ii,
(T 1=+
w4(Q-
4(~0)1
NO, a2),
*c,x;,
(43)
where o2 and C, are both positive. The above methods are all asymptotically equivalent at this higher order of approximation. Another interesting result is that the estimated i and optimum fro are actually negatively correlated! Hall and Johnstone (1992) show how to correct for this effect in density estimation and in regression with uniform X’s. It is still an open question how to improve this for the general regression setting we are considering here. There has been considerable research into finding improved methods of bandwidth selection that give faster rates of convergence in (42). Most of this work is in density estimation ~ see the recent review of Jones et al. (1992) for references. In this case, various & consistent bandwidth selectors have been suggested. The finite sample properties of these procedures are not well established, although Park and Turlach (1992) contains some preliminary simulation evidence. Hardle et al. (1992a) construct a $ technique.
5.
consistent
bandwidth
selector for regression
based on a bias reduction
Application to time series
In the theoretical development described up to this point, we have restricted our attention to independent sampling. However, smoothing methods can also be applied to dependent data. Considerable resources are devoted to providing forecasts of macroeconomic entities such as GNP, unemployment and inflation, while the benefits of predicting asset prices are obvious. In many cases linear models have been the basis of econometric prediction, while more recently nonlinear models such as ARCH have become popular. Nonparametric methods can also be applied in this context, and provide a model free basis of predicting future outcomes. We focus on the issue of functional form, rather than that of correlation structure - this latter issue is treated, from a nonparametric point of view, in Brillinger (1980), see also Phillips (1991) and Robinson (1991). Suppose that we observe the vector time series {Z,};, 1, where Zi = ( Yi, Xi), and Xi is strictly exogenous in the sense of Engle et al. (1983). It is convenient to assume
W. Hiirdle and 0. Linton
2326
that the process is stationary and mixing is as defined in Gallant and White (1988) which includes most linear processes, for example, although extensions to certain types of nonstationarity can also be permitted. We consider two distinct problems. Firstly, we want to predict Yi from its own past which we call autoregression. Secondly, we want to predict Yi from Xi. This problem we call regression with correlated errors.
5.1.
Autoregression
For convenience we restrict our attention to the problem of predicting the scalar Yi+k given Yi for some k > 0. The best predictor is provided by the autoregression function Mk(Y)
=
E(Yi+kl
More generally, lagged values, v,(Y)
=
yi
Y).
=
one may wish to estimate
Var(Yi+kI
yi
=
(44) the conditional
variance
of Yi+k from
Yh
One can also estimate the predictive density fri+r,Pi. These quantities can be estimated using any of the smoothing methods described in this chapter. See Robinson (1983) and Bierens (1987) for some theoretical results including convergence rates and asymptotic distributions. Diebold and Nason (1990), Meese and Rose (1991) and Mizrach (1992) estimate M(.) for use in predicting asset prices over short horizons. In each case, a locally weighted regression estimator was employed with a nearest neighbor type window, while bandwidth was chosen subjectively (except in Mizrach (1992) where crossvalidation was used). Not surprisingly, their published results concluded that there was little gain in predictive accuracy over a simple random walk. Pagan and Hong (1991), Pagan and Schwert (1990) and Pagan and Ullah (1988) estimate V( .) in order to evaluate the risk premium of asset returns. They used a variety of nonparametric methods including Fourier series and kernels. Their focus was on estimation rather than prediction, and their procedures relied on some parametric estimation. See also Whistler (1988) and Gallant et al. (1991). A scientific basis can also be found for choosing bandwidth in this sampling scheme. Hardle and Vieu (1991) showed that crossvalidation also works in the autoregression problem - “choose” i = arg min CV(h) gives asymptotically optimal estimates. To illustrate this result we simulated an autoregressive process Yi = M( Yi_ i) + si with
MY) = Y ev( - ~“1,
(45)
Ch. 38:
Applied
Nonparametric
2321
Methods
True and Estimated Function M J
I
I
4.6
4.3
I
I
I
c
010 x
013
016
019
* 6
4i.9
Figure 8. The time regression
function
M(y) = y exp( -y2) for the simulated kernel smoother (thick line).
example
(thin line) and the
where the innovations si were uniformly distributed over the interval (- l/2,1/2). Such a process is a-mixing with geometrically decreasing a(n) as shown by Doukhan and Ghindes (1980) and Gyorfi et al. (1990, Section 111.4.4).The sample size investigated was n = 100. The quartic kernel function in (3) was used. The minimum of CV(h) was 6 = 0.43, while the maximum of d,(h) was at h = 0.52. The curve of d,(h) was very flat for this example, since there was very little bias present. In Figure 8 we compare the estimated curve with the autoregression function and find good agreement.
5.2.
Correlated errors
We now consider the regression model Yi = m(X,) +
Ei,
where Xi is fixed in repeated samples and the errors Ei satisfy .E(&JXi)= 0, but are autocorrelated. The kernel estimator A,,(X)of m(x) is consistent under quite general conditions. In fact, its bias is the same as when the Q are independent. However, the variance is generally affected by the dependency structure. Suppose that the error
W. Hiirdle und 0. Linton
2328
process is MA(l), i.e. Ei =
ui + l3ui_l,
where ui are iid with zero mean and variance
Var[rfi,(x)]
= fr* (1 + e2) i [
c*. In this case,
1
Wii + 2flnf1 WniWni+ 1 i=l
i=l
which is O(K ‘h-l), but differs from Theorem 2. If the explanatory variable time itself (i.e. Xi = i/n, i = 1,. . , n), then a further approximation is possible:
were
Var C%(x)1Z $Zr”(l + t3* + 2@!,(K). Hart and Wehrly (1986) develop MSE approximations in a regression model in which the error correlation is a general function p(.) of the time between observations. Unfortunately, crossvalidation fails in this case. Suppose that the errors are AR(l) with autoregression parameter close to one. The effect on the crossvalidation technique described in Section 4 must be drastic. The error process stays a long time on one side of the mean curve. Therefore, the bandwidth selection procedure gives undersmoothed estimates, since it interprets the little bumps of the error process as part of the regression curve. An example is given in Hardle (1990, Figures 7.6 and 7.7). The effect of correlation on the crossvalidation criterion may be mitigated by leaving out more than just one observation. For the MA(l) process, leaving out the 3 contiguous (in time) observations works. This “leave-out-some” technique is also sometimes appealing in an independent setting. See the discussion of Hardle et al. (1988) and Hart and Vieu (1990). It may also be possible to correct for this effect by “whitening” the residuals in (40), although this has yet to be shown.
6.
Applications
to semiparametric
estimation
Semiparametric models offer a compromise between parametric modeling and the nonparametric approaches we have discussed. When data are high dimensional or if it is necessary to account for both functional form and correlation of a general nature, fully nonparametric methods may not perform well. In this case, semiparametric models may be preferred. By a semiparametric model we mean that the density of the observable data, conditional on any ancillary information, is completely specified by a finite
Ch. 38: Applied
Nonparametric
2329
Methods
dimensional parameter 8 and an unknown function G(.). The exhaustive monograph of Bickel et al. (1992) develops a comprehensive theory of inference for a large number of semiparametric models, although mostly within iid sampling. There are a number of reviews for econometricians including Robinson (1988b), Newey (1990) and Powell (this volume). In many cases, f3 is of primary interest. Andrews (1989) provides asymptotic theory for a general procedure designed to estimate 0 when a preliminary estimate G of G is available. The method involves substituting G for G in an estimating equation derived, perhaps, from a likelihood function. Typically, the dependence of the estimated parameters 8on the nonparametric estimators disappears asymptotically, and
where fl, > 0. Nevertheless, the small sample properties of e can depend quite closely on the way in which this preliminary step is carried out ~ see the Monte Carlo evidence contained in Engle and Gardiner (1976) Hsieh and Manski (1987), Stock (1989) and Delgado (1992). Some recent work has investigated analytically the small sample properties of semiparametric estimators. Carroll and Hardle (1989), Cavanagh (1989), HCrdle et al. (1992b), Linton (1991, 1992,1993) and Powell and Stoker(1991) develop asymptotic expansions of the form
(48) where q1 and q2 both increase with n under restrictions on h(n). These expansions yield a formula for the optimal bandwidth similar to (36). An important finding is that different amounts of smoothing are required for i? and for G; in particular, it is often optimal to undersmooth G (by an order of magnitude) when the properties of i? are at stake. The MSE expansions can be used to define a plug-in method of bandwidth choice for 6 that is based on second order optimality considerations.
6.1.
The partially linear model
Consider Yi = fiTXi + Cf)(Zi)+
Ei;
xi
=
dzi)
+
Vi,
i=12 > ,..., n
(49)
where 4(.) and g(.) are of unknown functional form, while E(aiIZi) = E(qijZi) = 0. If an inappropriate parametric model is fit to $(.), the resulting MLE of p may be
W. Hiirdlr and 0. Linton
2330
inconsistent. This necessitates using nonparametric methods that allow a more general functional form, when it is needed. Engle et al. (1986) uses this model to estimate the effects of temperature on electricity demand, while Stock (1991) models the effect of the proximity of toxic waste on house prices. In both cases, the effect is highly nonlinear as the large number of covariates make a fully nonparametric analysis infeasible. See also Olley and Pakes (1991). This specification also arises from various sample selection models. See Ahn and Powell (1990) and Newey et al. (1990). Notice that Yi - E( Yi 1Zi) = B’[Xi - E(Xi 1Zi)] +
Ei.
Robinson
(1988a) constructed a semiparametric and m(Z,) = E( Yi/Zi) by nonparametric and then letting E(X,/Z,)
fi =
[
estimator of /I replacing g(Z,) = kernel estimators #,,(Zi) and &(Zi)
igl Ixi - C9h(Zi)} Ixi - 9h(ziJ}‘]m ’ $I lIxi - 4h(Zi)l Cyi- hh(zi)l.
In fact, Robinson modified this estimator by trimming out observations for which the marginal density of Z was small. Robinson’s estimator satisfies (47, provided the dimensions of Z are not too high relative to the order of the kernel being used (provided m and g are sufficiently smooth). Linton (1992) establishes that the optimal bandwidth for b is O(n-2’9), when Z is scalar, and the resulting correction to the (asymptotic) MSE of the standardized estimator is O(n - 719).
4.2.
Heteroskedastic
Consider
the following
Yi = Z(Xi; p) + Ei,
nonlinear regression
nonlinear
regression
i= 1,2 ,...,
model:
n,
(W
where t(.; /3) is known, while E(siIXi) = 0 and Var(siI Xi) = a2(Xi), where a*(.) is of unknown functional form. Efficient estimation of j3 can be carried out using the pseudo-likelihood principle. Assuming that the si are iid and normally distributed, the sample log-likelihood function is proportional to _P[/3; cJ2( .)] = i [ Yi - z(X,; p)]‘a”(x,)i=l
1,
where a’(.) is known. In the semiparametric situation nonparametric estimator 8*(Xi), and then let fi minimize
(51) we replace ._!?[/i’;e2(.)].
a2(Xi) by a
Ch. 38: Applied Nonparametric
Carroll (1982) and Robinson in which case p=
i
t
xix’82(xi)-’
i=l
233 1
Methods
(1987) examine
I-l n izl
xi
yi:.82(xi)-
the situation
”
where t(X; /3) = BTX
(52)
They establish (under iid sampling) that /? is asymptotically equivalent to the infeasible GLS estimator based on (51). Remarkably, Robinson allows X to have unbounded support, yet did not need to trim out contributions from its tails: he used nearest neighbor estimators of g2(.) that always average over the same number of observations. Extensions of this model to the multivariate nonlinear r(*; 8) case are considered in Delgado (1992), while Hidalgo (1992) allows both heteroskedasticity and serial correlation of unknown form. Applications include Melenberg and van Soest (1991), Altug and Miller (1992) and Whistler (1988). Carroll and Hardle (1989), Cavanagh (1989) and Linton (1993) develop second order theory for these estimators. In this case, the optimal bandwidth is O(n-‘1’) when X is scalar, making the correction to the (asymptotic) MSE O(n-4’5).
6.3.
Single index models
When the conditional distribution of a scalar variable Y, given the d-dimensional predictor variable X, depends on X only through the index /I’X, we say that this is a single index model. One example is the single index regression model in which E [ Y 1X = x] = m(x) = g(xT&, but no other restrictions are imposed. Define the vector of average derivatives 6=
ECm’(Wl= ECdWB)IP,
(53)
and note that 6 determines /I up to scale - as shown by Stoker (1986). Let f(x) denote the density of X and 1 be its vector of the negative log-derivatives (partial), I= - a log flax = -f’/f (1 is also called the score oector). Under the assumptions on f given in Powell et al. (1989), we can write 6 = E[m’(X)]
= E[l(X) Y],
(54)
and we estimate 6 by s^= n- ’ x1= ,&(Xi) Yi, where &(x) = -_&/fH(x) is an estimator of l(x) based on a kernel density smoother with bandwidth matrix H. Furthermore, g(.) is estimated by a kernel estimator gh(.) for which [s^‘XJ;, 1 is the right-hand side data. Hardle and Stoker (1989) show that
W. Hiirdle and 0. Linton
2332
where & = Var {1(X)[ Y - m(X)] + m’(X)}, while Q,,converges at rate Jnh - i.e. like a one dimensional function. Stoker (1991) proposed alternative estimators for 6 based on first estimating the partial derivatives m’(x) and then averaging over the observations. A Monte Carlo comparison of these methods is presented in Stoker and Villas-Boas (1992). Hlrdle et al. (1992b) develop a second order theory for & in the scalar case, the optimal bandwidth h is O(np2j7) and the resulting correction to the MSE is O(n- ‘I’). Another example is the binary choice model Yi = Z(B’Xi +
Ui ~
O),
(55)
where (X, u) are iid. There are many treatments of this specification following the seminal paper of Manski (1975) - in which a slightly more general specification was considered. We assume also that u is independent of X with unknown distribution function F(.), in which case Pr[ F = 1 IX,] = F(pTXi) = E( YJP’X,), i.e. F(.) is a regression function. In fact, (55) is a special case of (53). Applications include Das (1990), Horowitz (199 l), and Melenberg and van Soest (199 1). Klein and Spady (1993) use the profile likelihood principle (see also Ichimura and Lee (1991)) to obtain (semiparametric) efficient estimates of 8. When F is known, the sample log-likelihood function is
Y{F(fi)} = i (Yiln[F(flrXJ]
+ (1 - YJln[l
- F(fi’Xi)]}.
(56)
i=l
For given /I, let @rX) be the nonparametric regression estimator feasible estimator fi of fi is obtained as the minimizer of y[&I)]
= i
{ Yiln[@rXi)]
+ (1 - Y,)ln[l
- @rX,)]}.
of E( Y 1/?‘X). A
(57)
i=l
This can be found using standard numerical optimization techniques. The average derivative estimator can be used to provide initial consistent estimators of B, although it is not in general efficient, see Cosslett (1987). Note that to establish &-consistency, it is necessary to employ bias reduction techniques such as higher order kernels as well as to trim out contributions from sparse regions. Note also that b is not as efficient as the MLE obtained from (56). We examined the performance of the average derivative estimator on a simulated dataset, where
Pr( Y = 11X = x) = A(brx) + 0.64’(~rx) B = (I, l)T,
Ch. 38:
Applied
Nonparumetric
2333
Methods
ADE Projection
I
-0.6
I
-0.4
0 OI-PD I -0.2
I
0.0 DeltaH’X
oan 0 0 I 0.2
I
0.4
I
0.6
Figure 9. For the simulated dataset 6, X versus Y and two estimates of g(pX,) are shown. The thick line shows the Nadaraya-Watson estimator with a bandwidth h = 0.3, while for the thin line h = 0.1 was chosen.
while A and C#Jare the standard logit and normal density functions respectively. A sample of size n = 200 was generated, and the bivariate density function was estimated using a NadarayaaWatson estimator with bandwidth matrix H = diag(0.99,0.78). This example is taken from Hardle and Turlach (1992). The estimation of 6 and its asymptotic covariance matrix & was done with XploRe macro adefit. For this example 6 = (0.135, 0.135)‘, and
Figure 9 shows the estimated regression function J,,(pXi). These results allow us to test some hypotheses formally using a Wald statistic (see Stoker (1992), pp 53-54). In particular, to test the restriction R6 = ro, the Wald statistic W= n(R6^- rJT(R&RT)is compared technique.
‘(Rg-
to a x2 (rank R) critical
ro) value. Table
3 gives some examples
for this
W. Hiirdle and 0. Linton
2334
Wald statistics Restriction 61 =p=0 S’ = 62 = 0.135 6’ = 62
7.
Table 3 for some restrictions
Value W
d.f.
25.25 0.365 0.027
2 2 1
on 6.
P[x2(d.f.)
> W]
0 0.83 0.869
Conclusions
The nonparametric methods we have examined are especially variable over which the smoothing takes place is one dimensional. relationship can be plotted and evaluated, while the estimators
useful when the In this case, the converge at rate
Jnh. For higher dimensions these methods are less attractive due to the slower rate of convergence and the lack of simple but comprehensive graphs. In these cases, there are a number of restricted structures that can be employed including the nonparametric additive models of Hastie and Tibshirani (1990), or semiparametric models like the partially linear and index models examined in Section 6.
References Abramson, I. (1982) “On bandwidth variation in .kernel estimates -a square root law”, Annals of Statistics, 10, 1217-1223. Ahn, H. and J.L. Powell (1990) “Estimation of Censored Selection Models with a Nonparametric Selection Mechanism”, Unpublished Manuscript, University of Wisconsin. Akaike, H. (1970) “Statistical predictor information”, Annals of the Institute of Statistical Mathematics, 22,203-17. Akaike, H. (1974) “A new look at the statistical model identification”, IEEE Transactions of Automatic Control, AC 19, 716-23. Altug, S. and R.A. Miller (1992) “Human capital, aggregate shocks and panel data estimation”, Unpublished manuscript, University of Minnesota. Anand, S., C.J. Harris and 0. Linton (1993) “On the concept of ultrapoverty”, Harvard Center for Population Studies Working paper, 93-02. Andrews, D.W.K. (1989) “Semiparametric Econometric Models: I Estimation”, Cowles Foundation Discussion paper 908. Andrews, D.W.K. (1991) “Asymptotic Normality of Series Estimators for Nonparametric and Semiparametric Regression Models”, Econometrica, 59, 307-346. Andrews, D.W.K. and Y.-J. Whang (1990) “Additive and Interactive Regression Models: Circumvention of the Curse of Dimensionality”, Econometric Theory, 6, 466-479. Ansley, C.F., R. Kohn and C. Wong (1993) “Nonparametric spline regression with prior information”, Biometrikn. 80. 75-88. Banks, J., R: Blundell and A. Lewbel (1993) “Quadratic Engel curves, welfare measurement and consumer demand”, Institute for Fiscal Studies, 92-14. Bhattacharya, P.K. and A.K. Gangopadhyay (1990) “Kernel and Nearest-Neighbor Estimation of a Conditional Quantile”, Annals ofbtatistics. 18. 1400-15. Bickel, P.J., C.A.J. Klaassen, Y. kitov and’ J.A. Welner (1992) Ejicient and Adaptive Znference in Semiparametric Models. Johns Hopkins University Press: Baltimore.
Ch. 38: Applied Nonparametric Methods
2335
in Advances in Econometrics: Fifth Bierens, H.J. (1987) “Kernel Estimators of Regression Functions”, World Congress, Vol 1, ed. by T.F. Bewley. Cambridge University Press. Bierens, H.J. and H.A. Pott-Buter (1990) “Specification of household Engel curves by nonparametric regression”, Econometric Reviews, 9, 123-184. Brillinger, D.R. (1980) Time Series, Data Analysis and, Theory, Holden-Day. Carroll, R.J. (1982) “Adapting for Heteroscedasttctty in Linear Models”, Annals of Statistics, 10, 122441233. Carroll, R.J. and W. Hlrdle (1989) “Second Order Eflects in Semiparametric Weighted Least Squares Regression”, Statistics, 20, 179-186. Cavanagh, C.L. (1989) “The cost of adapting for heteroskedasticity in linear models”, Unpublished manuscript, Harvard University. Chambers, J.M., W.S. Cleveland, B. Kleiner and P.A. Tukey (1983) Graphical Methodsfor Data Analysis. Duxburry Press. Chaudhuri, P. (1991) “Global nonparametric estimation of conditional quantile functions and their derivatives”, Journal of Multivariate Analysis, 39, 246-269. Cleveland, W.S. (1979) “Robust Locally Weighted Regression and Smoothing Scatterplots”, Journal of the American Statistical Association, 74, 829-836. Cosslett, S.R. (1987) “Efficiency bounds for Distribution-free estimators of the Binary Choice and the Censored Regression model”, Econometrica, 55, 559-587. Cox, D.R. and D.V. Hinkley (1974) Theoretical Statistics. Chapman and Hall. Craven, P. and Wahba, G. (1979) “Smoothing noisy data with spline functions”, Numer. Math., 31, 377-403. Daniell, P.J. (1946) “Discussion of paper by M.S. Bartlett”, Journal of the Royal Statistical Society Supplement, 8 :27. Das, S. (1990) “A Semiparametric Structural Analysis of the Idling of Cement Kilns”, Journal of Econometrics, 50, 235-256. Deaton, A.S. (1991) “Rice-prices and income distribution in Thailand: a nonparametric analysis”, Economic Journal, 99, l-37. Deaton, A.S. (1993) “Data and econometric tools for development economics”, The Handbook of Development Economics, Volume III, Eds J. Behrman and T.N. Srinavasan. Delgado, M. (1992) “Semiparametric Generalised Least Squares in the Multivariate Nonlinear Regression Model”, Econometric Theory, 8,203-222. Diebold, F., and J. Nason (1990) “Nonparametric exchange rate prediction?“, Journal of International Economics, 28,315-332. Doukhan, P. and Ghindts, M. (1980) “Estimation dans le processus X, = f (X,_ 1) + E,“, Comptes Rendus. Academic des Sciences de Paris, 297, Serie A, 61-4. Elbadawi, I., A.R. Gallant and G. Souza (1983) “An elasticity can be estimated consistently without a priori knowledge of functional form”, Econometrica, 51, 1731l1751. Engle, R.F. and R. Gardiner (1976) “Some Finite Sample Properties of Spectral Estimators of a Linear Regression”, Econometrica, 44, 149-165. Engle, R.F., D.F. Hendry and J.F. Richard (1983) “Exogeneity”, Econometrica, 51, 277-304. Engle, R.F., C.W.J. Granger, J. Rice and A. Weiss (1986) “Semiparametric Estimates of the Relationship Between Weather and Electricity Sales”, Journal of the American Statistical Association, 81, 310-320. Eubank, R.L. (1988) Smoothing Splines and Nonparametric Regression. Marcel Dekker. Fama, E.F. (1965) “The behavior of stock prices”, Journal of Business, 38, 34-105. Family Expenditure Survey, Annual Base Tapes (1968-1983). Department of Employment, Statistics Division, Her Majesty’s Stationary Office, London, 196881983. Fan, J. (1992) “Design-Adaptive Nonparametric Regression”, Journal of the American Statistical Association, 87, 998-1004. Fan, J. and I. Gijbels (1992) “Spatial and Design Adaptation: Variable order approximation in function estimation”, Institute ofStat&ics MimeoSeries,no 2080, University ofNorthCarolinaat Chapel Hill. Fan, J., N.E. Heckman and M.P. Wand (1992) “Local Polynomial Kernel Regression for Generalized Linear Models and Quasi-Likelihood Functions”, University of British Columbia Working paper 92-028. Fix, E. and J.L. Hodges (1951) “Discriminatory analysis, nonparametric estimation: consistency properties”, Report No 4, Project no 21-49-004, USAF School of Aviation Medicine, Randolph Field, Texas.
2336 Gallant,
W. Hiirdle and 0. Linton
A.R. and G. Souza (1991) “On the asymptotic normality of Fourier flexible form estimates”, Journal ofEconometrics, 50, 329-353. Gallant, A.R. and H. White (1988) A Unified Theory ofEstimation and Znferencefor Nonlinear Dynamic Models. Blackwell: Oxford. Gallant, A.R., D.A. Hsieh and G.E. Tauchen (1991) “On Fitting a Recalcitrant Series: The Pound/Dollar Exchange Rate, 1974-1983”, in Nonparametric and Semiparametric Methods in Econometrics and Statistics. Eds Barnett, Powell, and Tauchen. Cambridge University Press. Gasser, T. and H.G. Miiller (1984) “Estimating regression functions and their derivatives by the kernel method”, Scandinavian Journal of Statistics, 11, 171-85. Gasser, T., H.G. Miiller and V. Mammitzsch (1985) “Kernels for nonparametric curve estimation”, Journal of the Royal Statistical Society Series B, 47,238852. Gozalo, P.L. (1989) “Nonparametric analysis of Engel curves: estimation and testing of demographic effects”, Brown University, Department of Economics Working paper 92215. Gyorfi, L., W. Hardle, P. Sarda and P. Vieu (1990) Nonparametric Curve Estimation,fiom Time Series. Lecture Notes in Statistics, 60. Springer-Verlag: Heidelberg, New York. Hall, P. (1992) The Bootstrap and Edgeworth Expansion. Springer-Verlag: New York. Hall, P. (1993) “On Edgeworth Expansion and Bootstrap Confidence Bands in Nonparametric Curve Estimation”, Journal of the Royal Statistical Society Series B, 55, 291-304. Hall, P. and I. Johnstone (1992) “Empirical functional and efficient smoothing parameter selection”, Journal of the Royal Statistical Society Series B, 54, 4755530. Hardle, W. (1990) Applied Nonparametric Regression. Econometric Society Monographs 19, Cambridge University Press. Hardle, W. (1991) Smoothing Techniques with Implementation. Springer-Verlag: Heidelberg, New York, Berlin, Hardle, W. and.R.J. Carroll (1990) “Biased cross-validation for a kernel regression estimator and its derivatives”, Osterreichische Zeitschriffiir Statistik und Informatik, 20, 53-64. Hardle, W. and M. Jerison (1991) “Cross Section Engel Curves over Time”, Recherches Economiques de Louvain, 57, 391-431. Hardle, W. and J.S. Marron (1985) “Optimal bandwidth selection in nonparametric regression function estimation”, Annals of Statistics, 13, 1465581. Hardle, W. and M. Miiller (1993) “Nichtparametrische Gllttungsmethoden in der alltaglichen statistischen Praxis”, Allgemeines Statistiches Archiv, 77, 9-31. HPrdle, W. and D.W. Scott (1992) “Smoothing in Low and High Dimensions by Weighted Averaging Using Rounded Points”, Computational Statistics, 1, 97-128. Hlrdle, W. and T.M. Stoker (1989) “Investigating Smooth Multiple Regression by the Method of Average Derivatives”, Journal of the American Statistical Association, 84,9866995. Hlrdle, W. and B.A. Turlach (1992) “Nonparametric Approaches to Generalized Linear Models”, In: Fahrmeir, L., Francis, B., Gilchrist, R., Tutz, G. (Eds.) Aduances in GLIM and Statistical Modelling, Lecture Notes in Statistics, 78. SpringerrVerlag: New York. Hardle, W. and P. Vieu (1991) “Kernel regression smoothing of time series”, Journal of Time Series Analysis, 13, 209-232. Hardle, W., P. Hall and J.S. Marron (1988) “How far are automatically chosen regression smoothing parameters from their optimum. ?“, Journal of the American Statistical Association, 83, 86699. Hlrdle, W., P. Hall and J.S. Marron (1992a) “Regression smoothing parameters that are not far from their optimum”, Journal of the American Statistical Association, 87,227-233. Hardle, W., J. Hart, J.S. Marron and A.B. Tsybakov (1992b) “Bandwidth Choice for Average Derivative Estimation”, Journal of the American Statistical Association, 87, 218-226. HPrdle, W., P. Hall and H. Ichimura (1993) “Optical Smoothing in Single Index Models”, Annals of Statistics, 21, to appear. Hart, J. and P. Vieu (1990) “Data-driven bandwidth choice for density estimation based on dependent data”, Annals of Statistics, 18, 873-890. Hart, D. and T.E. Wehrly (1986) “Kernel regression estimation using repeated measurements data”, Journal of the American Statistical Association, 81, 1080-g. Hastie, T.J. and R.J. Tibshirani (1990) Generalized Additive Models. Chapman and Hall. Hausman, J.A. and W.K. Newey (1992) “Nonparametric estimation of exact consumer surplus and deadweight loss”, MIT, Department of Economics Working paper 93-2, Massachusetts.
Ch. 38: Applied Nonparametric Methods
2331
Hidalgo, J. (1992) “Adaptive Estimation in Time Series Models with Heteroscedasticity of Unknown Form”, Econometric Theory, 8, 161-187. Hildenbrand, K. and W. Hildenbrand (1986) “On the mean income effect: a data analysis of the U.K. family expenditure survey”, in Contributions to Mathematical Economics, ed W. Hildenbrand and A. Mas-Colell. North-Holland: Amsterdam. Hildenbrand, W. and A. Kneip (1992) “Family expenditure data, heteroscedasticity and the law of demand”, Universitat Bonn Discussion paper A-390. Horuwitz, J.L. (1991) “Semiparametric estimation of a work-trip mode choice model”, University of Iowa Department of Economics Working paper 91-12. Hsieh, D.A. and C.F. Manski (1987) “Monte Carlo Evidence on Adaptive Maximum Likelihood Estimation of a Regression”, Annals ofStatistics, 15, 541-551. Hussey, R. (1992) “Nonparametric evidence on asymmetry in business cycles using aggregate employment time series”, Journal of Econometrics, 51, 217-231. Ichimura, H. and L.F. Lee (1991) “Semiparametric Least Squares Estimation of Multiple Index Models: Single Equation Estimation”, in Nonparametric and Semiparametric Methods in Econometrics and Statistics. Eds Barnett, Powell, and Tauchen. Cambridge University Press. Jones, M.C. (1985) “Discussion of the paper by B.W. Silverman”, Journal ofthe Royal Statistical Society Series B, 47, 25-26. Jones, M.C. (1989) “Discretized and interpolated Kernel Density Estimates”, Journal ofthe American Statistical Association, 84, 733-741. Jones, M.C. and P.J. Foster (1993) “Generalized jacknifing and higher order kernels”, Forthcoming in Journal of Nonparametric Statistics. Jones, M.C., J.S. Marron and S.J. Sheather (1992) “Progress in data-based selection for Kernel Density estimation”, Australian Graduate School of Management Working paper no 92-014. Jones, M.S., 0. Linton and J.P. Nielsen (1993) “A multiplicative bias reduction method”, Preprint, Nuffield College, Oxford. Klein, R.W. and R.H. Spady (1993) “An Efficient Semiparametric Estimator for Binary Choice Models”, Econometrica, 61, 387-421. Koenker, R. and G. Bassett (1978) “Regression quantiles”, Econometrica, 46, 33-50. in Biometrika. Koenker, R., P. Ng and S. Portnoy (1993) “Q uantile Smoothing Splines”, Forthcoming Lewbel, A. (1991) “The Rank of Demand Systems: Theory and Nonparametric Estimation”, Econometrica, 59, 71 l-730. Li, K.-C. (1985) “From Stein’s unbiased risk estimates to the method of generalized cross-validation”, Annals of Statistics, 13, 1352-77. Linton, O.B. (1991) “Edgeworth Approximation in Semiparametric Regression Models”, PhD thesis, Department of Economics, UC Berkeley. Linton, O.B. (1992) “Second Order Approximation in the Partially Linear Model”, Cowles Foundation Discussion Paper no 1065. Linton, O.B. (1993) “Second Order Approximation in a linear regression with heteroskedasticity of unknown form”, Nuffield College Discussion paper no 75. Linton, O.B. and J.P. Nielsen (1993) “A Multiplicative Bias Reduction Method for Nonparametric Regression”, Forthcoming in Statistics and Probability Letters. McFadden, D. (1985) “Specification ofeconometric models”, Econometric Society, Presidential Address. Mack, Y.P. (1981) “Local properties of k-NN regression estimates”, SIAM J. Alg. Disc. Meth., 2, 31 l-23. Mandelbrot. B. (1963) “The variation of certain speculative prices”, Journal ofBusiness, 36, 394-419. Manski, C.F. (1975) “Maximum Score Estimation of the Stochastic Utility Model of Choice”, Journal of Econometrics, 3, 2055228. Marron, J.S. and D. Nolan (1989) “Canonical kernels for density estimation”, Statistics and Probability Letters, 7, 191-195. Marron, J.S. and M.P. Wand (1992) “Exact Mean Integrated Squared Error”, Annals ofstatistics, 20, 712-736. Meese, R.A. and A.K. Rose (1991) “An empirical assessment of nonlinearities in models of exchange rate determination”, Review of Economic Studies, 80, 603-619. Melenberg, B. and A. van Soest (1991) “Parametric and semi-parametric modelling of vacation expenditures”, CentER for Economic Research, Discussion paper no 9144, Tilburg, Holland.
2338
W. Hiirdle and 0. Linton
Mizrach, B. (1992) “Multivariate nearest-neighbor forecasts of EMS exchange rates”, Journal ofApplied Econometrics, I, 151-163. Miiller, H.G. (1987) “On the asymptotic mean square error of L, kernel estimates of C, functions”, Journal ofApproximation Theory, 51, 1933201. Miiller, H.G. (1988) Nonparametric Regression Analysis ofLongitudinal Data. Lecture Notes in Statistics, Vol. 46. SpringerrVerlag: Heidelberg/New York. Theory of Probability and its Applications, 10, Nadaraya, E.A. (1964) “On estimating regression”, 1866190. Newey, W.K. (1990) “Semiparametric Efficiency Bounds”, Journal ofApplied Econometrics, 5, 99-135. Newey, W.K., J.L. Powell and J.R. Walker (1990) “Semiparametric Estimation of Selection Models: Some Empirical Results”, American Economic Review Papers and Proceedings, 80, 324-328. Olley, G.S. and A. Pakes (1991) “The Dynamics of Productivity in the Telecommunications Equipment Industry”, Unpublished manuscript, Yale University. Pagan, A.R. and Y.S. Hong (1991) “Nonparametric Estimation and the Risk Premium”, in Nonparametric and Semiparametric Methods in Econometrics and Statistics. Eds Barnett, Powell, and Tauchen. Cambridge University Press. Pagan, A.R. and W. Schwert (1990) “Alternative models for conditional stock volatility”, Journal of Econometrics, 45, 267-290. Pagan, A.R. and A. Ullah (1988) “The econometric analysis of models with risk terms”, Journal of Applied Econometrics, 3, 87-105. Park, B.U. and B.A. Turlach (1992) “Practical performance of several data-driven bandwidth selectors (with discussion)“, Computational Statistics, 7,251&271. Phillips, P.C.B. (1991) “Spectral Regression for Cointegrated Time Series” in Nonparametric and Semiparametric Methods in Econometrics and Statistics. Eds Barnett, Powell, and Tauchen. Cambridge University Press. Powell, J.L. and T.M. Stoker (1991) “Optimal Bandwidth Choice for Density-Weighted Averages”, Unpublished manuscript, Princeton University. Powell, J.L., J.H. Stock and T.M. Stoker (1989) “Semiparametric Estimation of Index Coefficients”, Econometrica, 51, 1403-1430. Prakasa Rao, B.L.S. (1983) Nonparametric Functional Estimation. Academic Press. Rice, J.A. (1984) “Bandwidth choice for nonparametric regression”, Annals of Statistics, 12, 1215-30. Robb, A.L., L. Magee and J.B. Burbidge (1992) “Kernel smoothed consumption-age quantiles”, Canadian Journal of Economics, 25, 669-680. Robinson, P.M. (1983) “Nonparametric Estimators for Time Series”, Journal of Time Series Analysis, 4, 185-208. Robinson, P.M. (1987) “Asymptotically Efficient Estimation in the Presence of Heteroscedasticity of Unknown Form”, Econometrica, 56,875-891. Robinson, P.M. (1988a) “Root-N-Consistent Semiparametric Regression”, Econometrica, 56, 931-954. Robinson, P.M. (1988b) “Semiparametric Econometrics: A Survey”, Journal of AppliedEconometrics, 3, 35-51. Robinson, P.M. (1991) “Automatic Frequency Domain Inference on Semiparametric and Nonparametric Models”, Econometrica, 59, 132991364. Rosenblatt, M. (1956) “Remarks on some nonparametric estimates of a density function”, Annals of Mathematical Statistics, 27, 642-669. Ruppert, D. and M.P. Wand (1992) “Multivariate Locally Weighted Least Squares Regression”, Rice University, Technical Report no. 9224. Schuster, E.F. (1972) “Joint asymptotic distribution of the estimated regression function at a finite number of distinct points”, Annals of Mathematical Statistics, 43, 84-8. Sentana, E. and S. Wadhwani (1991) “Semi-parametric Estimation and the Predictability of Stock Returns: Some Lessons from Japan”, Review of Economic Studies, 58, 547-563. Shibata, R. (1981) “An optimal selection of regression variables”, Biometrika, 68,45-54. Silverman, B.W. (1984) “Spline smoothing: the equivalent variable kernel method”, Annals of Statistics, 12.898-916. Silverman, B.W. (1985) “Some aspects of the Sphne Smoothing approach to Non-parametric Regression Curve Fitting”, Journal of the Royal Statistical Society Series B, 47, l-52. Silverman, B.W. (1986). Densiry estimationfor statistics and data analysis. Chapman and Hall: London. Stock, J.H. (1989) “Nonparametric Policy Analysis”, Journal qfthe American Statistical Association, 84, 567-516.
Ch. 38: Applied Nonparumetric
Methods
2339
Stock, J.H. (1991) “Nonparametric Policy Analysis: An Application to Estimating Hazardous Waste Cleanup Benefits”, in Nonparametric and Semiparametric Methods in Econometrics and Statistics. Eds Barnett, Powell, and Tauchen. Cambridge University Press. Stoker, T.M. (1986) “Consistent Estimation of Scaled Coefficients”, Econometrica, 54, 1461-1481. Stoker, T.M. (1991) “Equivalence of direct, indirect, and slope estimators of average derivatives”, in Nonparametric and Semiparametric Methods in Econometrics and Statistics. Eds Barnett, Powell, and Tauchen. Cambridge University Press. Stoker, T.M. (1992) Lectures on Semiparametric Econometrics. CORE Lecture Series. Universite Catholique de Louvain, Belgium. Stoker, T.M. and J.M. Villas-Boas (1992) “Monte Carlo Simulation of Average Derivative Estimators”, Unpublished manuscript, MIT: Massachusetts. Stone, C.J. (1982) “Optical global rates ofconvergence for nonparametric regression”, Annals ofStatistics, 10,1040~1053. Strauss, J. and D. Thomas(1990) “The shape of the calorie-expenditure curve”, Unpublished manuscript, Rand Corporation, Santa Monica. Stute, W. (1986) “Conditional Empirical Processes”, Annals of Statistics, 14, 638-647. Tibshirani. R. (1984) “Local likelihood estimation”. PhD Thesis, Stanford University, California. Tikhonov,‘A.N. (1963) “Regularization of incorrectly posed problems”, Soviet Math:, 4, 1624-1627. Turlach, B.A. (1992) “On discretization methods for average derivative estimation”, CORE Discussion Paper no. 9232, Universite Catholique de Louvain, Louvain-la-Neuve, Belgium. Vapnik, V. (1982). Estimation of Dependencies Based on Empirical Data. SpringerrVerlag: Heidelberg, New York, Berlin. Wahba, G. (1990) Spline Models for Observational Data. CBMS-NSF Regional Conference Series in Applied Mathematics, no. 59. Watson, G.S. (1964) “Smooth regression analysis”, Sankhya Series A, 26, 359-372. Whistler, D. (1988) “Semiparametric ARCH Estimation of Intra-Daily Exchange Rate Volatility”, Unpublished manuscript, London School of Economics. Whittaker, E.T. (1923) “On a new method of graduation”, Proc. Edinburgh Math. Sot., 41, 63-75. XploRe (1993) An interactive statistical computing environment. Available from XploRe Systems, Institute fur Statistik und ekonometrie, Wirtschaftswissenschaftliche Fakultat, Humboldt-Universitat zu Berlin, D 10178 Berlin, Germany.
P.
2342
Hall
Abstract A brief account is given of the methodology and theory for the bootstrap. Methodology is developed in the context of the “equation” approach, which allows attention to be focussed on specific criteria for excellence, such as coverage error of a confidence interval or expected value of a bias-corrected estimator. This approach utilizes a definition of the bootstrap in which the key component is replacing a true distribution function by its empirical estimator. Our theory is Edgeworth expansion based, and is aimed specifically at elucidating properties of different methods for constructing bootstrap confidence intervals in a variety of settings. The reader interested in more detail than can be provided here is referred to the recent monograph of Hall (1992).
1.
Introduction
A broad interpretation of bootstrap methods argues that they are defined by replacing an unknown distribution function, F, by its empirical estimator, p, in a functional form for an unknown quantity of interest. From this standpoint, the individual who first suggested that a population mean, p =
xdF(x), s
could be estimated
x=
by the sample mean,
xdF(x), s
was using the bootstrap. We tend to favour this definition, although we appreciate that there are alternative views. Perhaps the most common alternative is to confer the name “bootstrap” on procedures that use Monte Carlo methods to effect a numerical approximation. While we see that this does have its merits, we would argue against it on two grounds. First, it is sometimes convenient to draw a distinction between the essentially statistical argument that leads to the “substitution” or “plug-in” method described in the previous paragraph, and the essentially numerical argument that employs a Monte Carlo approximation to calculate a functional of F^. There do exist statistical procedures which marry the numerical simulation and statistical estimation into one operation, where the simulation is regarded as primarily a statistical feature. Monte Carlo testing is one such procedure; see for example
Ch. 39: Methodoioyy and Theoryfor the Bootstrap
2343
Barnard (1963), Hope (I 968) and Marriott (1979). Our definition of the bootstrap would not regard Monte Carlo testing as a bootstrap procedure. That may be seen as either an advantage or a disadvantage, depending on one’s view. A second objection that one may have to defining the “bootstrap” strictly in terms of whether or not Monte Carlo methods are employed, is that the method of numerical computation becomes intrinsic to the definition. TO cite an extreme case, one would not usually think of using Monte Carlo methods to compute a sample mean or variance, but nevertheless those quantities might reasonably be regarded as bootstrap estimators of the population mean and variance, respectively. In a less obvious instance, estimators of bootstrap distribution functions, which would usually be candidates for approximation by Monte Carlo methods, may sometimes be computed most effectively by exact, non-Monte Carlo methods. See for example Fisher and Hall (1991). In other settings, saddlepoint methods provide excellent alternatives to simulation; see Davison and Hinkley (1988) and Reid (1988). Does a technique stop being a bootstrap method as soon as non-Monte Carlo methods are employed? To argue that it does seems unnecessarily pedantic, but to deny that it does would cause some problems for a bootstrap definition based on the notion of simulation. The name “bootstrap” was introduced by Efron (1979), and it is appropriate here to emphasize the fundamental contributions that he made. As Efron was careful to point out, bootstrap methods (in the sense of replacing F by F) had been around for many years before his seminal paper. But he was perhaps the first to perceive the enormous breadth of this class of methods. He saw too that the power of modern computing machinery could be harnessed to allow functionals of F^to be computed in very diverse circumstances. The combination of these two observations is extremely powerful, and its ultimate effect on Statistics will be revolutionary. Necessarily, these two observations go together; the vast range of applications of bootstrap methods would not be possible without a facility for extremely rapid simulation. However, that fact does not imply that bootstrap methods are restricted to situations where simulation is employed for calculation. Statistical scientists who thought along lines similar to Efron include Hartigan (1969, 1971), who used resampled sub-samples to construct point and interval estimators, and who stressed connections with Mahalanobis’ “interpenetrating samples” and the jackknife of Quenouille (1949, 1956) and Tukey (1958); and Simon (1969, Chapters 23-25), who described a variety of Monte Carlo methods. Let us accept, for the sake of argument, that bootstrap methods are defined by the “replace F by P’ rule, described above. Two challenges immediately emerge in response to this definition. First, we must determine how to “focus” this concept, SO as to make the bootstrap responsive to statistical demands. That is, how do we decide which functionals of F should be estimated? This requires a “principle” that enables US to implement bootstrap methods in a range of circumstances. The second challenge is that of calculating the values of those functionals in a practical setting. The latter problem may be solved partly by providing simulation methods or related
2344
P.Hall
devices, such as saddlepoint arguments, for numerical approximation. Space limitations mean that a thorough account of these techniques is beyond the scope of this chapter. However, a detailed account of efficient methods of bootstrap simulation may be found in Appendix II of Hall (1992). A key part of the answer to the first question is the development of theory describing the relative performance of different forms of the bootstrap, and that issue will be addressed at some length here. Our answer to the first question is provided in Section 2, where we describe an “equation approach” to focussing attention on specific statistical questions. This technique was discussed in more detail by Hall and Martin (1988), Martin (1989) and Hall (1992, Chapter 1). It leads naturally to bootstrap iteration, which is discussed in Section 3. Section 4 presents theory that enables comparisons to be made of different bootstrap approaches to inference about distributions. The reader is referred to Hinkley (1988) and DiCiccio and Roman0 (1988) for excellent reviews of bootstrap methods. Our discussion is necessarily kept brief and is essentially an abbreviated form of an account that may be found in Hall (1992). In undertaking that abbreviation we have omitted discussion of a variety of different approaches to the bootstrap. In particular, we do not discuss various forms of bias correction, not because we do not recommend it but because space does not permit an adequate survey. We readily concede that the restricted account of bootstrap methods and theory presented here is in need of a degree of bias correction itself! We do not address in any detail the bootstrap for dependent data, but pause here to outline the main issues. There are two main approaches to implementing the bootstrap in dependent settings. The first is to model the dependent process as one that is driven by independent and identically distributed disturbances - examples include autoregressions and moving averages. We describe briefly here a technique which may be used when no parametric assumptions are made about the distribution of the disturbances. First estimate the parameters of the model, and calculate the residuals (i.e. the estimated values of the independent disturbances). Then run the process over and over again, by Monte Carlo simulation, with parameter values set equal to their estimated values and with the bootstrapped independent disturbances obtained by resampling randomly, with replacement, from the set of residuals. Each resampled process should be of the same length as the original one, and bootstrap inference may be conducted by averaging over the independent Monte Carlo replications. Bose (1988) addresses the efficacy of this procedure in the context of autoregressive models, and derives results that may be viewed as analogues (in the case of autoregressive processes) of some of those discussed later in this chapter for independent data. If the distribution of disturbances is assumed known then, rather than estimate residuals and resample with replacement from those, the parameters of the assumed distribution may be estimated. The bootstrap disturbances may now be derived by resampling from the hypothesized distribution, with parameters estimated.
Ch. 39: Methodology
and Throryfir
2345
the Bootstrup
The major other way of bootstrapping dependent processes is to divide the data sequence into blocks, and resample the blocks rather than individual data values. This approach has application in spatial as well as “linear” or time series contexts, and indeed was apparently first suggested for spatial data; see Hall (1985). Blocking methods may involve either non-overlapping blocks, as in the technique treated by Carlstein (1986), or overlapping blocks, as proposed by Kiinsch (1989). (Both methods were considered for spatial data by Hall (1985)) In sheer asymptotic terms Kiinsch’s method has advantages over Carlstein’s, but those advantages are not always apparent in practice. This matter has been addressed by Hall and Horowitz (1993) in the context of estimating bias or variance, and there the matter of optimal block width has been treated. The issue of distribution estimation using blocking methods has been discussed by Gotze and Kiinsch (1990), Lahiri (1991, 1992) and Davison and Hall (1993).
2.
A formal definition of the bootstrap principle
Much of statistical inference involves describing the relationship between a sample and the population from which the sample was drawn. Formally, given a functional f, from a class (f,:t~Y->, we wish to determine that value t, of r that solves an equation such as
W(Fcl?FJlFo) = 0,
(2.1)
where F = F, denotes the population distribution function and F = F, is the distribution function “of the sample”. An explicit definition of F, will be given shortly. Conditioning on F, in (2.1) serves to stress that the expectation is taken with respect to the distribution F,. We call (2.1) the population equation because we need properties of the population if we are to solve this equation exactly. For example, let 8, = d(F,) denote a true parameter value, such as the rth power of a mean,
Let e= B(F,) be our bootstrap mean,
estimator
of 8,, such as the rth power of a sample
where 3 = F, is the empirical distribution function of the sample from which _? is computed. Correcting gadditively for bias is equivalent to finding that value 1, that
P.Hall
2346
solves (2.1) when
fr(F,, Fl) = v-1) - W,) + t.
(2.2)
Our bias-corrected estimator would be 8+ t,. On the other hand, to construct symmetric, 95% confidence interval for 8, we would solve (2.1) when jl(F,,
F,) = Z{B(F,) - t d B(F,) < B(F,) + t} - 0.95,
a
(2.3)
where the indicator function Z(&) is defined to equal 1 if event 6 holds and 0 otherwise. The confidence interval is (6 - to, 6 + to), where 8 = B(F,). To obtain an approximate solution of the population equation (2.1) we argue as follows. Let F, denote the distribution function of a sample drawn from F, (conditional on F,). Replace the pair (F,, F,) in (1.1) by (F,, F,), thereby transforming (2.1) to (2.4) We call this the sample equation because we know (or can find out) everything about it once we know the sample distribution function F,. In particular, its solution f, is a function of the sample values. We call & and E{f,(F,, FJ F,} “the bootstrap estimators” of t, and E{f,(F,, F,) 1F,}, respectively. They are obtained by replacing F0 by F, in formulae for to and E{f,(F,, F,)I F,}. In the bias correction problem, where f, is given by (2.2), the bootstrap version of our bias-corrected estimator is I!+ &,. In the confidence interval problem where (2.3) describes f,, our bootstrap confidence interval is (e - &,, 8 + f,). The latter is commonly called a (symmetric) percentile-method confidence interval for 6,. The “bootstrap principle” might be described in terms of this approach to estimation of a population equation. It is appropriate now to give detailed definitions of F, and F,. There are two approaches, suitable for nonparametric and parametric problems respectively. In both, inference is based on a sample X of n random (independent and identically distributed) observations of the population. In the nonparametric case, F, is simply the empirical distribution function of X; that is, the distribution function of the distribution that assigns mass n-l to each point in X. The associated empirical probability measure assigns to a region B a value equal to the proportion of the sample that lies within 2. Similarly, F, is the empirical distribution function of a sample drawn at random from the population with distribution function F,; that is, the empiric of a sample !Z* drawn randomly, with replacement, from 3. If we denote the population by X0 then we have a nest of sampling operations: X is drawn at random from X0 and !E* is drawn at random from X.
Ch. 39: Mrthodology
and Theoryfor
2341
the Bootstrap
In the parametric case, F, is assumed completely known up to a finite vector i, of unknown parameters. To indicate this dependence we write F, = F,*(,), an element of a class {F,,,, k.~Aj of possible distributions. Let 1: be an estimator of I, computed from J, often (but not necessarily) the maximum likelihood estimator. It will be a function of sample values, so we may write it as h(X). Then F, = F,Q, the distribution function obtained on replacing “true” parameter values by their sample estimates. Let X* denote the sample drawn at random from the distribution with distribution function F,,, (not simply drawn from 3” with replacement), and let fi* = A(F*) denote the version of I computed for .Y* instead of .Y. Then F, = F,i*,. It is appropriate now to discuss two examples that illustrate the bootstrap principle. Example 2.1.
Bias reduction
Here the function
f, is given by (2.2), and the sample equation
(2.4) assumes the form
E{W,) - W,) + [IF,) = 0, whose solution t=
is
to= 8(F,)
The bootstrap
- E{O(F,)IF,}.
bias-reduced
estimator
is thus
6, = @+ t*,,= 8(F,) + 2, = 28(F,) - E{O(F,)IF,}.
(2.5)
Note that our basic estimator I!?= B(F,) is also a bootstrap estimator since it is obtained by substituting F, for F, in the functional formula 8, = 8(F,). may always be computed (or approximated) by The expectation E(B(F,)jF,} Monte Carlo simulation, as follows. Conditional on F,, draw B resamples {.Fz, 1 d b d B} independently from the distribution with distribution function F,. In the nonparametric case, where F, is the empirical distribution function of the sample 3, let F,, denote the empirical distribution function of .!!z. In the parametric case, let iz = I’(%;) be that estimator of &, computed from resample Fz, and put F,, = Fci*,. Define 6: = 8(F,,) and o^= H(F,). Then in both parametric and nonparametrPc
circumstances,
h=l converges to fi = E(O(F,)lF,} as B+ncj.
= E(@*(X) (with probability
one, conditional
on F,)
P. Hull
2348
Example
2.2.
Confidence
interval
A symmetric confidence interval for 8, = U(F,) may be constructed by applying the resampling principle using the function f, given by (2.3). The sample equation then assumes the form P{8(F,)
- t < 8(F,) < Q(F,) + t(F,} - 0.95 = 0.
(2.6)
In a nonparametric context Q(F,), conditional on F,, has a discrete distribution and so it would seldom be possible to solve (2.6) exactly. However, any error in the solution of (2.6) will usually be very small, since the size of even the largest atom of the distribution of B(F,) decreases exponentially quickly with increasing II. The largest atom is of size only 3.6 x 1O-4 when IZ= 10. We could remove this minor difficulty by smoothing the distribution function F,. In parametric cases, (2.6) may usually be solved exactly for t. The interval (& f,, 8+ &J is a bootstrap confidence interval for 8, = 8(F,), usually called a (two-sided, symmetric) percentile interval since &, is a percentile of the distribution of le(F,) - Q(F,)I conditional on F,. Other nominal 95% percentile intervals include the two-sided, equal-tailed interval (i?- f,,, 8 + fo2) and the one-sided interval (- co, & + f,,), where f,,, fo2, and f,, solve P{@F,)
< B(F,) - tlF,}
- 0.025 = 0,
P(B(F,)
< QF,) + tlF,}
- 0.975 = 0,
and
P{e(F,)<e(F,)+
tlF,}
-0.95=0,
respectively. The former interval equal probability in each tail: P(8, d 8-
is called equal-tailed because
it attempts
to place
f,,) z I-‘(& > 8+ f,,) z 0.025.
The “ideal” form of this interval, obtained by solving the population equation rather than the sample equation, does place equal probability in each tail. Still other 95% percentile intervals are I^, = (e- fo2, 8+ f,,) and III = (- co, 8 + fo4), where too4is the solution of P{fI(F,) d Q(F,) - tlF,}
- 0.05 = 0.
These do not fit naturally into a systematic development of bootstrap methods by frequentist arguments, and we find them a little contrived. They are sometimes
Ch. 3Y: Methodology
motivated
and Theoryfor
2349
the Bootstrap
as follows. Define e* = B(F,), I?(x) = P(8* < ~1%) and
I?‘(C()=inf{x:t?(x)>a}. Then and
r^, = [I?‘(0.025),&‘(0.975)]
fI = [-
co,I?‘(0.95)].
All these intervals cover 8, with probability approximately 0.95, which might be called the nominal coverage. Coverage error is defined to be true coverage minus nominal coverage; it generally converges to zero as sample size increases. We now treat in more detail the construction of two-sided, symmetric percentile intervals in parametric problems. There, provided the distribution functions Fo, are continuous, equation (2.6) may be solved exactly. We focus attention on the cases where 8, = Q(F,) is a population mean and the population is normal or exponential. Our main aim is to bring out the virtues of pivoting, which usually amounts to resealing so that the distribution of a statistic depends less on unknown parameters. If the population is Normal N@,c?) and we use the maximum likelihood estimator x = (x, S2) to estimate 1, = (CL,a2), then the sample equation (2.6) may be rewritten as
P(ln - “%Nj < t 1F,) = 0.95,
(2.7)
where N is Normal N(0, 1) independent of F,. Therefore * t = t, = xog5n -1’2B, where X, is defined by P(INI <x,) The bootstrap
the solution
of (2.6) is
= oz. confidence
interval
is therefore
(X - n-“2X,,,,c?,X+ n-‘12xo.958) with coverage
error
P(r7- C2X (p&
d /.i d
x +n- 1’2xo.958)
= P{ (n”2(X - ,U)/Sl G x,.95} - 0.95.
(2.8)
Ofcourse n”‘(X - p)/S does not have a Normal distribution, but a resealed Student’s t distribution with n - 1 degrees of freedom. Therefore the coverage error is essentially that which results from approximating to Student’s t distribution by a Normal distribution, and so is O(n-‘). (See Kendall and Stuart (1977, p. 404).) That is disappointing, particularly as classical methods lead so easily to an interval with precisely known coverage in this important special case.
P.Hull
2350
To appreciate why the percentile interval has this inadequate performance, let us go back to our parametric example involving the Normal distribution. The root cause of the problem there is that 8, and not cr, appears on the right-hand side in (2.8). This happens because the sample equation (2.6), equivalent here to (2.7), depends on 8. Put another way, the population equation (2.1), equivalent to P{ l(V,)
- B(F,)I < t} = 0.95,
depends on cr’, the population variance. This occurs because the distribution of le(F,) - 8(F,)I depends on the unknown CJ.We should try to eliminate, or at least minimize, this dependence. A function T of both the data and an unknown parameter is said to be (exactly) piootal if it has the same distribution for all values of the unknowns. It is asymptotically pivotal if, for sequences of known constants {a,} and {b,}, a,T+ b, has a proper nondegenerate limiting distribution not depending on unknowns. We may convert 8(F,) - 8(F,) into a pivotal statistic by correcting for scale, changing it to T= {B(F,) - fl(F,)}/* r w h ere z*= r(F,) is an appropriate scale estimator. In our example about the mean there are usually many different choices for 2, e.g. the sample standard deviation {n-‘C(Xi - X) ’ } 1/Z, the square root of the unbiased variance estimate, Gini’s mean difference and the interquartile range. In more complex problems, a jackknife standard deviation estimator is usually an option. Note that exactly the same confidence interval will be obtained if t^ is replaced by c?, for any given c # 0, and so it is inessential that z^be consistent for the asymptotic standard deviation of f&F,). What is important is piuotalness - exact pivotalness if we are to obtain a confidence interval with zero coverage error, asymptotic pivotalness if exact pivotalness is unattainable. If we change to a pivotal statistic then the function f, alters from the form given in (2.3) to f,(F,,
F,) = 1(&F,)
- tr(F,) d B(F,) d B(F,) + tr(F,)}
- 0.95.
(2.9)
In the case of our parametric Normal model, any reasonable scale estimator t*will give exact pivotalness. We shall take z^= 8, where 8’ = a’(F,) = n-‘C(Xi - f)’ denotes sample variance. Then f, becomes ft(F,, F,) = 1(Q(F,) - m(~,) d e(F,) < e(F,) + ta(~,)) Using this functional in place of that at (2.3), but otherwise equation (2.7) changes to P((n-
1)-1’21T~_1( dt(F,}
where T,_ I has Student’s stochastically independent
- 0.95. arguing
=0.95,
t distribution with n - 1 degrees of F,. (Therefore the conditioning
exactly as before,
(2.10) of freedom and is on F, in (2.10) is
2351
Ch. 39: Methodology and Theory for the Bootstrap
irrelevant.) Thus, the solution of the sample equation is f0 = (n - 1)) 1i2w0,95, where w&= w,(n) is given by P(IT,_ 1 1< w,) = ct. The bootstrap confidence interval is (X - &,,b,.% + 2,8), with perfect coverage accuracy, P{X -(n
- 1)-1’2w0,95 8 ,< p d X + (n - l)-“zw,,,,B)
= 0.95.
(Of course, the latter statement applies only to the parametric bootstrap under the assumption of a Normal model.) Such confidence intervals are usually called percentile-t intervals since f0 is a percentile of the Student’s t-like statistic /0(F,) - 8(F,)1/r(F,). Perfect coverage accuracy of percentile-t intervals usually holds only in parametric problems where the underlying statistic is exactly pivotal. More generally, if symmetric percentile-t intervals are constructed in parametric .and nonparametric problems by solving the sample equation when f, is defined by (2.9), where z(F,) is chosen so that T= {8(F,) - B(F,)}/z(F,) is asymptotically pivotal, then coverage error will usually be O(n-‘) rather than the O(n- ‘) associated with ordinary percentile intervals. We conclude this example with remarks on the computation of critical points, such as r?,, by uniform Monte Carlo simulation. Further details, including an account of efficient Monte Carlo simulation, are given in Section 5. Assume we wish to compute the solution 0, of the equation
PC{w-2) - W,)}Iz(F,)
d 9, IF,] =
LY,
(2.11)
or, to be more precise, the value
4 = inf{x:PC{W,) - B(F,))/T(F,)d
xIF, (3 M}.
Choose integers B > 1 and 1 d v 6 B such that v/(B + 1) = ~1.For example, if c1= 0.95 then we could take (v, B) = (9599) or (950,999). Conditonal on F,, draw B resamples (gz, 1
and write T* for a generic Tt. In this notation, equation (2.11) is equivalent to P(T* d z$,(%“)= a. Let I&, denote the vth largest value of Tz. Then O,,, --f 8, with probability one, conditional on 3, as B+ co. The value O,,, is a Monte Carlo approximation to v^,.
P. Hall
2352
3.
Iterating the principle
Recall that in Section 2, we suggested that statistical inference often involves describing a relationship between the sample and the population. We argued that this leads to a bootstrap principle, which may be enunciated in terms of finding an empirical solution to a population equation, (2.1). The empirical solution is obtained by solving a sample version, (2.4) of the population equation. The notation employed in those equations includes taking F,, F, and F, to denote the true population distribution function, the empirical distribution function, and the resample version of the empiric, respectively. The solution of the population equation is a functional of F,, say T(F,), and the solution of the sample equation is the corresponding functional of the empiric, T(F,). The population equation may then be represented as
-W-W’,W,~ FJIFd = 0, with approximate
solution
-W’YFI)(FO~F,)IFI~ ~00.
(3.1)
The solution of the sample equation represents an approximation to the solution of the population equation. In many instances we would like to improve on this approximation ~ for example, to further reduce bias in a bias correction problem, or to improve coverage accuracy in a confidence interval problem. Therefore we introduce a correction term t to the functional T, so that T(.) becomes U(., t) with U( ., 0) E T( .). The adjustment may be multiplicative, for example, U( ., t) E (1 + t)T( .). Or it may be an additive correction, as in U(*, t) = T(.) + t. Or t might adjust some particular feature of T, as in the level-error correction for confidence intervals, which we shall discuss shortly. In all cases, the functional U(.,t) should be smooth in t. Our aim is to choose t so as to improve on the approximation (3.1). Ideally, we would like to solve the equation (3.2) for t. If we write g,(F, G) =fac,_(F, J%(Fo>
FAIF,)
G), we see that (3.2) is equivalent
to
= 0,
which is of the same form as the population equation approximation by passing to the sample equation,
(2.1). Therefore
we obtain
an
Ch. 39: Methodology
and TheoryJbr
2353
the Bootstrap
or equivalently,
equation of the This has solution &,oz= T,(F,), say, giving us a new approximate same form as the first approximation (3.1), and being the result of iterating that earlier approximation,
Our hope is that the approximation here is better than that in (3.1) so that in a sense U(F,, T,(F,)] is a better estimate than T(F,) of the solution t, to equation (2.1). Of course, this does not mean that U[F,, 7’,(F,)] is closer to t, than T(F,), only that the left-hand side of (3.4) is closer to zero than the left-hand side of (3.1). If we revise notation and call U[F,, T,(F,)] the “new” T(F,), we may run through the argument again, obtaining a third approximate solution of (2.1). In principle, these iterations may be repeated as often as desired. We have given two explicit methods, multiplicative and additive, for modifying our original estimate f, = T(F,) of the solution of (2.1) so as to obtain the adjustable form U(F,, t). Those modifications may be used in a wide range of circumstances. In the special case of confidence intervals, an alternative approach is to modify the nominal coverage probability of the confidence interval. To explain the argument we shall concentrate on the special case of symmetric percentile-method intervals discussed in Example 2.1. Corrections for other types of intervals may be introduced in like manner. An a-level symmetric percentile-method interval for Be = QF,) is given by [B(F,) - &,, f?(F,) + &,I, where &, is chosen to solve the sample equation
P{w-*) - t d WI)
d 8(F,) + t/F,)
(In our earlier examples, tl of the population equation
PlW,) - t d W,)
=
-a
= 0.
0.95.) This f,, is an estimator
d B(F,) + tlF,}
-
c1=
of the solution
t, = T(F,)
0,
that is. of P((B-8&t(F,)=cc, where e= O(F,). Therefore 6 &I,
to is just the a-level quantile,
x,, of the distribution
of
P. Hall
2354
Write x, as x(F,),, the quantile when F, is the true distribution E, = T(F,) is just x(F,),, and we might take U(., t) to be
function.
Then
U(., t) = x(.),+ 1. This is an alternative problem are U(.,t)-(1
to multiplicative
+t)x(.),
and
and additive
corrections,
which in the present
U(.,t)=x(.),+t,
respectively. In general, each will give slightly different numerical results, although, as we shall prove shortly, each provides the same order of correction. Concise definitions of Fj are different in parametric and nonparametric cases. In that are completely the former we work within a class {Fo,, d~l\} of distributions specified up to an unknown vector 1 of parameters. The “true” distribution is F, = Fcno), we estimate il, by I= L(X) where X = Xi is an n-sample drawn from F,, and we take F, to be F,i,. To define Fj, let ij = L(Xj) denote the estimator i computed for an n-sample Xj drawn from Fjm 1 and put Fj = F,ir The nonparametric case is conceptually simpler. There, Fj is the empirical distribution of an n-sample drawn randomly from Fj_ 1, with replacement. To explain how high-index Fis enter into computation of bootstrap iterations, we shall discuss calculation of the solution to equation (3.3). That requires calculation of U(F,, t), defined for example by U(F,, t) = (1 + t)T(F,). And for this we must compute sample equation
and so T(F,) is the solution
T(F,). Now, f, = T(F,) is the solution
(in t) of the resample
(in t) of the
equation
~tft(F,~F,)IF,)= 0. Thus, to find the second bootstrap iterate, the solution of (3.3), we must construct F,, F,, and F,. Calculation of F, “by simulation” typically involves order B sampling operations (B resamples drawn from the original sample), whereas calculation of F, “by simulation” involves order B2 sampling operations (B resamples drawn from each of B resamples) if the same number of operations is used at each level. Thus, i bootstrap iterations could require order B’ computations, and so complexity would increase rapidly with the number of iterations.
Ch. 39: Methodology
and Theory,for
2355
the Bootstrap
In regular cases, expansions of the error in formulae such as (3.1) are usually power series in n- i’* or n- r, often resulting from Edgeworth expansions of the type that we shall discuss in Section 4. Each bootstrap iteration reduces the order of magnitude of error by a factor of at least n - liz . However, in many problems with an element of symmetry, such as two-sided confidence intervals, expansions of error are power seriesin IZ- ’ rather than n ‘I’, and each bootstrap iteration reduces error by a factor of n-l, not just n- ‘I*. Example 3.1.
Bias reduction
In this situation, each bootstrap iteration reduces the order of magnitude of bias by the factor n-l. (See Hall 1992, Section 1.5, for further details.) To investigate further the effect of bootstrap iteration on bias, observe that, in the case of bias reduction by an additive correction,
.ft(FcbFl) = w-1) - QF,) + t. Therefore
the sample equation,
has solution
t = T(F,) = QF,) - E{B(F,)IF,},
and so the once-iterated
estimate
is
8, = @+ T(F,) = B(F,) + T(F,) = 2&F,) - E{B(F,)IF,). See also (2.5). On iteration of this formula general bootstrap estimator.
we obtain
the following
formula
for a
Theorem 3.1 If I!?~denotes then
the jth iterate of 8, and if the adjustment
E{B(Fi)(F,},
Example 3.2.
ja
at each iteration
is additive,
1.
Confidence interval
Here, each iteration generally reduces the order of coverage error by the factor n-l in the case of two-sided intervals, and by n- ‘I2 for one-sided intervals. To appreciate the effect of iteration in more detail, let us consider the case of parametric, percentile confidence intervals for a mean, assuming a Normal N(p, CJ*)population, discussed in Example 2.2. Let N denote a Normal N(0, 1) random variable. Estimate the
P. Hall
i, = (p, 0”) by th e maximum
parameter
n^ =(X, 62)= (up,),
likelihood
estimator
rJ2(Fl)),
where X = n- ’ CX, and 6’ = IZ- ‘x(X, - X)’ are sample mean and sample variance, respectively. The functional ,f, is, in the case of a symmetric two-sided 95% percentile confidence interval, f;(F,,F,)
= I{W,)
- t d fI(F,) d B(F,) + t} - 0.95,
and the sample equation (2.4) has solution t = T(F,) = n-“2~,~,,o(F,), is given by P( 1N 1< x0.95) = 0.95. This gives the percentile interval
x +n- l’2xo,958),
(X - n- “2x,,,&
derived in Example 2.2. For the sake of definiteness correction in the form U(F,,t)=n
although f
- 1’2(%.95
Fl)
=
w2
so that the sample equation P{n”2IU(F,) Observe
we shall make the coverage
+ t)@,),
we would draw the same conclusion
“(F,,t)(hI’
where x0,95
lw,l)
-
with other forms of correction.
~(~,w(~,)
d
x0.95
+
t> -
Thus,
0.95,
(3.3) becomes
- B(F,)(/o(F,)
< x0.95 + tlF,}
- 0.95 = 0.
(3.5)
that
W = nli2{d(F2)
=
nm1j2 i i
- B(F,)}/o(F,)
n-l
(X*-X)
i=l
H
i i=l
-“‘,
(X*-X)2 I
where conditional on ?Z, XT,. . . , Xz are independent and identically distributed on X, and N(X, b2) random variables and X* = n ’ cX*. Therefore, conditional also unconditionally, W is distributed as {n/(n - 1)}“2T,_, where T,- 1 has Student’s t distribution with n - 1 degrees of freedom. Therefore the solution E, of equation (3.5) is f. = (~/(n - l)}‘i2wo,,5 - xo,95, where w, = w,(n) is defined by
P(IT,-,/<w,)=sr.
Ch. 9:
Methodoloyy
The resulting
und Thuory,fiv
bootstrap
2357
the Bootstrap
confidence
interval
is
+ f,), NF,) + fl- 1’2~(~l)(%.95 + Ml CW,) - n 1’24F1)(%95 = [X -
(?I- 1))“2W,,,,B,X
+ (?I - 1))“2W,,,,6].
This is identical to the percentile-t (not the percentile) confidence interval derived in Example 2.2 and has perfect coverage accuracy. The methodology of bootstrap iteration was introduced by Efron (1983), Hall (1986), Beran (1987) and Loh (1987).
4. 4.1.
Asymptotic
theory
Summary
We begin by describing circumstances where Edgeworth expansions, in the usual rather than the bootstrap sense, may be generated under rigorous regularity conditions; see Section 4.2, Major contributors to this theory include Chibishov (1972,1973a, 1973b), Sargan (1975, 1976) and Bhattacharya and Ghosh (1978). Our account is based on the latter paper. Following that, in Section 4.3, we discuss bootstrap versions of those expansions and then describe the conclusions that may be drawn from those results. Our first conclusions, about the efficacy of pivotal methods, are given towards the end of Section 4.3. Sections 4.4, 4.5, 4.6 and 4.7 describe respectively a variety of different confidence intervals, properties of bootstrap estimates of critical points, properties of coverage error and the special case of regression. The last case is of particular interest because, in the context of intervals for slope parameters, it admits bootstrap methods with unusually good coverage accuracy. The main conclusions drawn in this section relate to the virtues of pivoting. That subject was touched on in Section 2 but there we lacked the technical devices necessary to provide a broad description of the relative performances of pivotal and non-pivotal methods. The Edgeworth expansion techniques introduced in Section4.2 fill this gap. In particular, they enable us to show that pivotal methods generally yield greater accuracy in the estimation of critical points (Section 4.5) and smaller asymptotic order of coverage error of one-sided confidence intervals (Section 4.6). Nevertheless, it should be borne in mind that these results are asymptotic in character and that, while they provide a valuable guide, they do not tell the whole story. For example, the performance of pivotal methods with small samples depends in large part on the relative accuracy of the variance estimator and can be very poor in cases where an accurate variance estimator is not available. Examples which feature poor accuracy include interval estimation for the correlation coefficient and for a ratio of means when the denominator mean is close to zero.
P. Hall
2358
Theory for the bootstrap, along the lines of that described here, was developed by Bickel and Freedman (1980), Singh (1981), Beran (1982, 1987), Babu and Singh (1983, 1984, 1985), Hall (1986, 1988a, 1988b), Efron (1987), Liu and Singh (1987) and Robinson (1987). Further work on the bootstrap in regression models is described by Bickel and Freedman (198 1, 1983), Freedman (198 l), Freedman and Peters (1984) and Peters and Freedman (1984a, 1984b).
4.2.
Edgeworth
and Cornish-Fisher
expansions
We begin by describing a general model that allows Edgeworth and CornishhFisher expansions to be established rigorously. Let @, 4 denote respectively the Standard Normal distribution and density functions. Let X, X,, X,, . . . be independent and identically distributed random column d-vectors with mean p, and put X = n ‘C Xi. Let A: Rd --f R be a smooth function satisfying A(p) = 0. We have in mind a function such as A(x) = {g(x) - g(~)}/h@), where 8, = g(p) is the (scalar) parameter estimated by 6 = g(X) and g2 = h(/1)2 is the asymptotic variance of n”28; or A(x) = {g(x) g(p)}/h(x), where b2 = h(X) is an estimator of h(p). (Thus, we assume h is a known function.) This “smooth function model” allows us to study problems where 8, is a mean, or a variance, or a ratio of means or variances, or a difference of means or variances, or a correlation coefficient, etc. For example, if { W,, . . . , W,,} were a random sample from a univariate population with mean m and variance fi2, and if we wished to estimate 0, = m, then we would take d = 2, x = (X”‘, Xc2’)r = (W, W2)T, p = E(f), (1’,x(2’) &7(x
= x(“,
h(x”‘,X’2’)
=x(2’
_ (,(1’)2.
This would ensure that g(p) = m, g(x) = w (the sample mean), h(p) = b2, and
h(X)=n-’
$lX~2’-(n~1i$lXj1’)2=n~1i$l(Wi-
W)‘=/P
(the sample variance). If instead our target were 8, = /II” then we would take d = 4, X = (W, W2, W3, W4)*, /I = E(X), g(x”‘, . . h(x(“,...,
)X(4’)= x(2’ - (x(l’)2, x(4’)
=
x(4’
_
4x”‘x’3’
In this case,
Y(P)= B2> dx, = Bz, h(p) = E( W - m)” - fi4,
+ fj(x”‘)2x’2’
_ 3(x”‘)4
_ [x”’
_ (,W)2]2.
2359
Ch. 39: Methodology and Theoryfor the Bootstrap
h(X) = rl-l i (Wi - W)2= B’. i=l
(Note that E(W- m)” - /I” equals the asymptotic variance of nri2/?.) The cases where o0 is a correlation coefficient (a function of five means), or a variance ratio (a function of four means), among others, may be treated similarly. The following result may be established under the model described above. We first present a little notation. Put p = E(X), and let pi ,,,.i, = E{ ai,...i,
=
(X - pp. . .(X - pp},
(aj/axv..
j> 1,
.ax~q4(x)~,=,,
and
Note that c2 equals the asymptotic
variance
of n’i2A(_%).
Theorem 4.1 Assume that the function A has j + 2 continuous derivatives in a neighbourhood of p = E(X), that .4(p) = 0, that E( IIii! IJj’2) < co, and that the characteristic function x of X satisfies limsup Ix(t)1 < 1. “E,, - % Suppose
(4.1)
CJ> 0. Then for j 2 1,
P{n”2A(@/o
d x) = e,(x) + n-“2pl(x)&) +
n-j’2pj(X)4(X)
+
+ ... 0(n-j’2)
(4.2)
P.Hall
2360
uniformly in x, where pj is a polynomial of degree at most 3j - 1, odd for even j and even for odd j, with coefficients depending on moments of 2 up to order j + 2. In particular, pl(x) = (‘4,o-’
+ $42a-3(X2- l)}.
See Bhattacharya and Ghosh (1978) for a proof. Condition (4.1) is a multivariate form of Cramer’s continuity condition. It is satisfied if the distribution of z is nonsingular (i.e. has a nondegenerate absolutely continuous component) or if 2 = (W, W2,. , Wd)T where W is a random variable with a nonsingular distribution. Two versions of (4.2) are given by
P{n1’2(g-f&)/a < x} = CD(x) + n-1’2pl(X)c/l(X) + +
I.’
n -“*pj(x)q5(x)+ o(n -j/2)
(4.3)
and
P{n1’2(O^8,)/b
d
x} = @(x) + +n
n-1’2ql(X)#(X)
-“2qj(X)~(X)
+ ...
+ O(n-j”),
(4.4)
being Edgeworth expansions for non-Studentized and Studentized statistics, respectively. Here, pj and qj are polynomials of degree at most 3j - 1 and are odd or even functions according to whether j is even or odd. They are usually distinct. The Edgeworth expansion in Theorem 4.1 is readily inverted so as to yield a Cornish-Fisher expansion of the critical point of a distribution. To appreciate how, first define w, = w,(n), the a-level quantile of the distribution of S, = n1’2A(x), by w, = inf{x:P(S,
d x) 3 2).
Let z, be the a-level Standard
Normal
W,=Z,+n-“2p,l(Z,)+n-1p2,(z,)+
quantile,
given by @(z,) = a. We may write
+n-“2pj&,)+
..’
..
and z, = w, + n -1’2p12(wa) + n-‘p,,(w,)
+
“’
+
nP2pj2(Wm) +
.‘.)
where the functions pjl and pj2 are polynomials. These expansions are to be interpreted as asymptotic series and in that sense are available uniformly in ECU< 1 -sforanyO<.s<+.
Ch. 39: Methodology
and Theoryfor
the Bootstrap
2361
The polynomials pjI and pj2 are of degree at most j + 1, odd for even j and even for odd j, and depend on cumulants only up to order j + 2. They are completely determined by the pis in (4.2). In particular, it follows that pjI is determined by p1 , . . . , pj. To derive formulae for p1 1 and p2 1, note that a= @(z,)+
~~-“2Pl,(z,)+~-‘P2,(z,)~~(z,)-~~~~”2P11(Z,)~2Z,~(Z,)
+ n - 1’2cPl
mw
+ n -
+ n-‘p2(z,)4(z,) =a+n
“2PII(z,){P;(d
- z,Pl(z,)Hwl
+ O(n_3’2)
~“2{Pll(z,)+P,(z,)}~(z,)+~-‘CP2,(z,)-~z,P,,(z,)2
+ Pllk){PW
- Z,PIWI
From this we may conclude Pllb)
= -
P2164
= PlwP;(x)
+ P2(Z,)laz)
+ wp3’2).
that
Pl(4
and -
+xPl(42
-
P2W
Formulae for the other polynomials piI, and for the pi2’s, may be derived similarly, however, they will not be needed in our work. CornishhFisher expansions under explicit regularity conditions may be deduced from results such as Theorem 4.1. For example, the following inversions of (4.3) and (4.4) are valid uniformly in E < c1< 1 - E, under the conditions of that theorem: u, = z, + n -1’2p11(z,)
+ nm’p2,(z,)
+ .‘. + n-j’2pj,(Za)
+ o(n-j’2),
(4.5)
and v, = z, + n ~“2q11(z,)+n~1q21(z,)+
Here z,, u,, v, are the solutions
‘.. +n-“2qjl(z,)+o(n-j’2).
of the equations
(4.6)
@(z,) = LX,
&J/a
respectively; p1 1 and p21 are given by the formulae displayed in the previous paragraph, with p1 and p2 defined as in (4.3); and qll and q21 are given by the analogous formulae, with q1 and q2 from (4.4) replacing p1 and p2.
P. Hull
2362
4.3.
Edgeworth
and Cornish-Fisher
expansions
of bootstrap distributions
We are now in a position to describe Edgeworth expansions of bootstrap distributions. We shall emphasize the role played by pivotal methods, introduced in Section 2. Recall that a statistic is (asymptotically) pivotal if its limiting distribution does not depend on unknown quantities. In several respects the bootstrap does a better job of estimating the distribution of a pivotal statistic than it does for a nonpivotal statistic. The advantages of pivoting can be explained very easily by means of Edgeworth expansion, as follows. If a pivotal statistic T is asymptotically Normally distributed, then in regular cases we may expand its distribution function as G(x) = P( T < x) = a(x)
+ n “‘q(x)+(x)
+ O(n- ‘),
(4.7)
where q is an even quadratic polynomial. See (4.2), for example. We might take T= n”‘(t!?- 0)/c?, where i? is an estimator of an unknown parameter H,, and s2 is an estimator of the asymptotic variance o2 of n ‘I28 . The bootstrap estimator of G admits an analogous expansion, G(x) = P(T* d x(.5) = a(x)
+ n-“2Q(x)~(x)
+ O,(n-I),
(4.8)
where T* is the bootstrap version of T, computed from a resample %* instead of the sample ?Z”,and the polynomial 4 is obtained from q on replacing unknowns, such as skewness, by bootstrap estimates. (The notation “O,(n- I)” denotes a random variable that is order n- ’ “’m probability”. The distribution of T* conditional on 3’ is called the bootstrap distribution of T*. The estimators in the coefficients of 4 are typically distant O,(n ‘j2) from their respective values in q, and so 4 -q = O,(n- ‘12). Therefore, subtracting (4.7) and (4.8), we conclude that P(T* <xl??“) - P(T6
x) = O,(n-‘).
That is, the bootstrap approximation to G is in error by only n-l. This is a substantial improvement on the Normal approximation, G N CD,which by (4.7) is in error by n- 1’2. On the other hand, were we to use the bootstrap to approximate the distribution of a nonpivotal statistic U, such as U = n”2(g- t3,), we would typically commit an error of size n- ‘I2 rather than n ‘. To appreciate why, observe that the analogues of (4.7) and (4.8) in this case are H(x) = P(U < x) = @(x/a)
+
n-“‘p(x/a)#(x/a)
+ O(n-‘),
Ch. 39: Methodology
B(x)= P(u*
2363
and Theory for the Bootstrap
d
xl%)
= 0 (x/c?) + n - “2~(x/o/s) + O,(n -
‘)
respectively, where p is a polynomial, 8 is obtained from p on replacing unknowns variance of U, d2 is the by their bootstrap estimators, o2 equals the asymptotic bootstrap estimator of c2 and U* is the bootstrap version of U. Again, @- p = O,(n- “‘), and also 8 - G = O,(n- 1/2), whence
B(x)-
H(x) = @(x/B) - @(x/a) +
O,(nP).
(4.9)
Now, the difference between 6 and (T is usually of precise order n- 1’2. Indeed, n1i2(B - 0) typical1 y h as a limiting Normal N(0, c2) distribution, for some i > 0. Thus, @(x/6) - @(x/a) is generally of size n-1/2, not n-l. Hence by (4.9), the bootstrap approximation to H is in error by terms of size n-*/*, not n-‘. This relatively poor performance is due to the presence of cr in the limiting distribution function @(x/a), i.e. to the fact that U is not pivotal. Expansions such as (4.8) may be developed under the smooth function model, and analogues of Theorem 4.1 are available in the bootstrap case. For example, let us return to the notation introduced just prior to that theorem, and introduce additionally the definitions x* = n- ‘CXT, 8* = g(x*) and c?*~ = h(X*), where %* = {XT,. . . ) x} denotes a resample drawn randomly, with replacement, from s = {Xl,. . .) X,,}, Then under the same conditions as in Theorem 4.1, except that the moment condition should be strengthened a little, we have the following analogues of (4.3) and (4.4) respectively:
P(n”2(f7*- e,lr3*< xp-} = @(x)
+
n-1’21j1(X)c#J(X) + ...
+ n -“2gj(x)c#J(x) + o,(n-j’2)
(4.10)
and
P(n1’2(B* -6)/d* <x(2-} =
aqx) +
n-“2Q1(X)4(X)+ ...
+ n-“24j(x)q5(x)+ o,(n -ji2).
(4.11)
Bootstrap Edgeworth expansions may be inverted in much the same way as ordinary Edgeworth expansions, to obtain bootstrap Cornish-Fisher expansions. For example, the quantiles arising from inversion of the bootstrap expansions (4.10) and (4.11) are a, = z, + n -“2B11(Z,)+n~1~21(Z,)+
“‘+fji2@j,(Z,)+
I?, = z, + n -1’2~11(z,)+n-‘~21(z,)+
‘..
“’
(4.12)
..‘.
(4.13)
and
+n-“2Qjl(z,)+
P.Ha//
2364
Here, pjl and Qjl differ from pjI and q,r, appearing in (4.5) and (4.6), only in that F, is replaced by F,; that is, population moments are replaced by sample moments. Of course, CornishhFisher expansions are to be interpreted as asymptotic series and apply uniformly in values of z bounded away from zero and one. For example, .j/
2
sup
Iti,-{z,+n
-r/2f111(Z,)+
.
+n-j/*
Bjl(za)}l+o
r.
almost surely (and hence also in probability) as II + co, for each 0 < e < $. A key assumption underlying these results is smoothness of the sampling distribution. For example, under the “smooth function model” introduced in Section 4.2, the sampling distribution would typically be required to satisfy Cramer’s condition.
4.4.
DifSerent versions
of bootstrap
confidence
intervals
We begin with a little notation. Let F, denote the population distribution function, Q(.) a functional of distribution functions, and 8, = 8(F,) a true parameter value, such as the rth power of a population mean 8, = (JxdFo(x)}‘. Write F, for the distribution function “of a sample” .% drawn from the population. Interpretations of F, differ for parametric and nonparametric cases; see Section 2. The bootstrap estimator of 8, is 8= O(F,). Define F, to be the distribution function of a resample X* drawn from the “population” with distribution function F,. Again, F, is different in parametric and nonparametric settings. A theoretical a-level percentile confidence interval for QO, I, =(is obtained
co,H^+t,), by solving the population
equation
(2.1) for t = t,, using the function
f;(F,, F,) = Z{W,) d B(F,) + t} - oz. Thus, to is defined by P(8 < e + to) = cc. The bootstrap version of this interval is r^, = (- oc, e+ fo), where solution of the sample equation (2.4). Equivalently, & is given by P{fl(F,)
d B(F,) + f,] F,}
We call r^, a bootstrap
percentile
t = f, is the
= a.
confidence
interval,
or simply a percentile
interval
Ch. 39: Methodology
2365
and Theory for the Bootstrap
To construct a percentile-r confidence interval for 8,, define a2(F,) to be the asymptotic variance of n 1/28, and put b2 = a2(F,). A theoretical cc-level percentile-t confidence interval is J, = (- co, 8 + to&), where on the present occasion t, is given by lye, <
f7+ to&) =
This is equivalent
ct.
to solving the population
equation
(2.1) with
.f,(F,, Fl) = I{@,) G w.1) + WFl)) - cc The bootstrap interval is obtained by solving the corresponding and is _?i = (- co, 8 + f,c?), where f, is now defined by
P(8 < B(F,)
+ ,)I
sample equation,
F,} = ct.
To simplify notation in future sections we shall often denote O(F,) and a@‘,) by t?* and 8*, respectively. Exposition will be clearer if we represent t, and &, in terms of quantiles. Thus, we define u,, v,, ri,, and 8, by
(4.14) and
PblVv,)
- e(Fl)jqa(F2)G D,IF,I = @.
Write (r = a(F,) and 6 = a@‘,). Then definitions those given earlier are
=(-Co,8-n-“2aul_m), I; =(-co,8-n-“26til_a),
I,
J,
(4.15) of I,, .I,, fI, and ii equivalent
to
=(-co,e-n-1’28vl_a),
f1 =(-co,8-n-“2c?91_J,
All are confidence intervals for B,,, with coverage probabilities approximately equal to a. In the nonparametric case the statistic B(F,), conditional on F,, has a discrete distribution. This means that equations (4.14) and (4.15) will usually not have exact solutions, although as we point out in Section 1.3 and Appendix I, the errors due to discreteness are exponentially small functions of n. The reader concerned by the
P. Hull
2366
problem
of discreteness
might like to define li, and 0, by
4 = inf{u:PCn1’2{B(F,)
- B(F,)}/~J(F,)
d u~F,] 3 CC}
0, = inf{u:P[n1’2{8(F2)
- B(F,)}/@,)
G uIF,]
and
Two-sided, equal-tailed confidence intersection of two one-sided intervals.
2 CZ},
intervals are constructed by forming Two-sided analogues of I, and J, are
the
and J, =
(&
respectively,
n- 1’2c?uC1 +aJ,2,& with bootstrap
n-“28uC, -&
versions
and
The intervals P(8, d
I, and J, have equal probability
6-
n-112cw (l+a),Z) = P(6, > e-
in each tail; for example, II- %UC1 _a),2) = $1 - CC).
Intervals f2 and j2 have approximately the same level of probability in each tail, and are called two-sided, equal-tailed confidence intervals. Two-sided symmetric intervals were discussed in Section 2. All the intervals defined above have at least asymptotic coverage ~1,in the sense that if 4 is any one of the intervals,
as n + co. As before, we call CIthe nominal coverage of the confidence interval 9. The coverage error of 9 is the difference between true coverage and nominal coverage, coverage
error = P(8,~4)
- CI.
Ch. 39: Methodology and Theoryfor
4.5.
Order of correctness
2361
the Bootstrap
of bootstrap approximations
to critical points
The a-level quantiles of the distributions of S = n”‘(@- 0,)/a and T = n”‘(8d,)/c? with bootstrap estimates ti, and 6,. Subtracting are u, and v,, respectively, expansions (4.5) and (4.6) from (4.12) and (4.13) we deduce that 4
-u,
= n-1’21MZ,)
-
PII
+ ~-‘{a,,(d
- P21Wl
+ ...
(4.16)
and U*,--U,=n-“2~~11(Z,)-qll(z,)}
+n-‘CLi21(Z,)-q21(Z,)}
+
“..
Now, the polynomial jjl is obtained from pjl on replacing population moments by sample moments, and the latter are distant O,(n- iI’) from their population counterparts. Therefore fijl is distant O,(n- 1’2) from pjl. Thus, by (4.16), li, - u, = O,(C”%I
-1’2 + n-i)=
O&-i),
and similarly 0, - v, = O,(n - ‘). This establishes one of the important properties of bootstrap, or sample, critical points: the bootstrap estimates of U, and u, are in error by only order n-‘. In comparison, the traditional Normal approximation argues that u, and v, are both close to z, and is in error by n-l”; for example, z, - u, = z, -
{za+ n -1’2p11(z,)+
. ..} = -n-“2pll(Z,)+O(n-‘).
Approximation by Student’s t distribution hardly improves on the Normal approximation, since the g-level quantile t, of Student’s t distribution with n - v degrees of freedom (for any fixed v) is distant order n- ‘, not order n- ‘12, away from z,. Thus, the bootstrap has definite advantages over traditional methods employed to approximate critical points. This property of the bootstrap will only benefit us if we use bootstrap critical points in the right way. To appreciate the importance of this remark, go back to the definitions of the confidence intervals I,, J,, fl, and j1 given in Section 4.4. Since v*l_~=vl~~+O,(n~‘),theupperendpointoftheinterval~l=(-co,8-n~”2Bv*~_~) differs from the upper endpoint of J, = (-co, 6 n-“28ul -,) by only 0,(n-3’2). We say that i1 is second-order correct for J, and that & n-‘/280, _a is secondorder correct for 8- n- 1128vl -a since the latter two quantities are in agreement up to and including terms of order (n- 1’2)2 = n- ‘. In contrast, II1 = (- co, 8 - n- ‘1’ OUl-a) ** is generally only first-order correct for I, = (- co, 8- n- 1/2~u1 _,) since the upper endpoints agree only in terms of order n- lj2, not n- l, ((j_n-1/2A* fJU1-.) - (6
n-“2aul
-,) = n-‘/2(au1
= n-
_a - CM, _,)
1’2u1-@(a - 8) + 0,(n-3’2),
P.Hall
2368
and (usually) n”‘(6 - a) is asymptotically Normally distributed with zero mean and nonzero variance. Likewise, r^, is usually only first-order correct for J, since terms of order n-112 in Cornish-Fisher expansions of ti,_, and ulPa generally do not agree, (Q-n-
“26ti,_.)-(8-n-“28u,_a)=n~‘128(v,~a-a,_.) =n +~{P,(z,)
- 41(z,)) + 0,(n-3’2).
However, there do exist circumstances where pi and q1 are identical, in which case it follows from the formula above that 1^i is first-order correct for J,. Estimation of slope in regression provides an example and will be discussed in Section 4.7. As we shall show in Section 4.6, these “correctness” properties of critical points have important consequences for coverage accuracy. The second-order correct confidence interval .Z, has coverage error of order n-‘, whereas the first-order correct interval I, has coverage error of size np112. As noted in Section 4.4, the intervals I;, ii represent bootstrap versions of I,, .Z1, respectively. Recall from Example 2.2 of Section 2 that percentile intervals such as I, are based on the nonpivotal statistic @- 8, and that it is this nonpivotalness that causes the asymptotic standard deviation r~ to appear in the definition of I,. That is why r^, is not second-order correct for I,, and so our problems may be traced back to the issue of pivotalness raised in Section 2. The percentile-t interval 5^i is based on the (asymptotically) pivotal statistic (e- 0,)/8, hence its virtuous properties. If the asymptotic variance rr2 should be known then we may use it to standardize, and construct confidence interv_als based on (e- 8,)/a (which is asymptotically pivotal), instead of on either 0 - 8, or (O- 8,)/8. Application of the principle enunciated in Section 2 now produces the interval
I;,
=(-
c0,6-nn-“2alil_m),
which is second-order correct for I, and has coverage error of order n- I. We should warn the reader not to read too much into the notion of “correctness order” for confidence intervals. While it is true that second-order correctness is a desirable property, and that intervals that fail to exhibit it do not correct even for the elementary skewness errors in Normal approximations, it does not follow that we should seek third- or fourth-order correct confidence intervals. Indeed, such intervals are usually unattainable as the next paragraph will show. Techniques such as bootstrap iteration, which reduce the order of coverage error, do not accomplish this goal by achieving high orders of correction but rather by adjusting the error of size n- 3’2 inherent to almost all sample-based critical points. Recall from (4.6) that v1 _L1=
z1
-a +.n -1’2qll(Z1_a)+II-~q21(Z1_a)+
‘...
Ch. 39: Methodology and Theory for the Bootstrap
2369
Coefficients of the polynomial q1 I are usually unknown quantities. In view of results such as the Cramer-Rao lower bound (e.g. Cox and Hinkley 1974, pp. 254ff), the coefficients cannot be estimated with an accuracy better than order n-l”. This means that u1 _a cannot be estimated with an accuracy better than order n- ‘, and that the upper endpoint of the confidence interval J, =(-co,& n-“28u, -,) cannot be estimated with an accuracy better than order nP3j2. Therefore, except in unusual circumstances, any practical confidence interval i, that tries to emulate J, will have an endpoint differing in a term of order ne3j2 from that of J,, and so will not be third-order correct. Exceptional circumstances are those where we have enough parametric information about the population to know the coefficients of qll. For example, in the case of estimating a mean, qll vanishes if the underlying population is symmetric. If we know that the population is symmetric, we may construct confidence intervals that are better than second-order correct. For example, we may resample in a way that ensures that the bootstrap distribution is symmetric, by sampling with replacement from the collection {+(X1 -X), . . . , X, - X}. But in most problems, both para*(X,-X)} rather than {X,-X,..., metric and nonparametric, second-order correctness is the best we can hope to achieve.
4.6.
Coverage
error of conjidence
intervals
In this section we show how to apply the Edgeworth and CornishhFisher expansion formulae developed in Sections 4.2 and 4.3, to develop expressions for coverage accuracy of bootstrap confidence intervals. It is convenient to focus attention initially on the case of one-sided intervals and to progress from there to the two-sided case. A general one-sided confidence interval for BOmay be expressed as 9, = (- 00, 8 + f), where f is determined from the data. In most circumstances, if 3i has nominal coverage c1then f admits the representation f = n - “2d(z, + i?,),
(4.17)
where e, is a random variable and converges to zero as n+ CO. For example, this would typically be the case if T = n”2(8 - f3,,)/&had an asymptotic Standard Normal distribution. However, should the value 0’ of asymptotic variance be known, we would most likely use an interval in which f had the form f = n li20(z, + c^,). Intervals fl,J1, and j1 (defined in Section 4.4) are of the former type, with c*, in (4.17) assuming the respective values - li, -a - z,, li, - z,, - vi _= - z,, -G1 _= - z,. So also are the “Normal approximation interval” (- co, 8 + n-i/*&z,) and “Student’s
P.Hull
2370
t approximation interval” (- co, 8 + n p1/28tm), where t, is the a-level quantile of Student’s t distribution with IZ- 1 degrees of freedom. Interval I, is of the latter type. The main purpose of the correction term 2, is to adjust for skewness. To a lesser extent it corrects for higher-order departures from Normality. Suppose that f is of the form (4.17). Then coverage probability a i,n = P(B,E.Y,)
= P{B, < e+ n-“Q(Z,
= 1 - P{ni’2(8-
0,)K
e,)B- l +
(4.18)
l + e, < -za}.
We wish to develop an expansion to have an Edgeworth expansion
?I”‘(&
+ c*,)}
of this probability. of the distribution
For that purpose function of
it is necessary
e,,
or at least a good approximation to it. In some circumstances, 6, is easy to work with directly; for example ?, = 0 in the case of the Normal approximation interval. But for bootstrap intervals, c^, is defined only implicitly as the solution of an equation, and that makes it rather difficult to handle. So we first approximate it by a Cornish-Fisher expansion. Suppose that t, = n-1’2.Gl(z,)
+
n-‘s*,(z,) + 0,(np3’2),
(4.19)
where s1 and s2 are polynomials with coefficients equal to polynomials in population moments and gj is obtained from sj on replacing population moments. by sample moments. Then ij = sj + O,,(n- l/*) and
P{n”*(B-
&JB-’ + t, d x} = P n”‘(Q- 8,)8-’ + n-‘i2{s*l - s,(z,)} [ d x -
i
n-ji2sj(z,)
j=l
1
+ 0(nm3’*).
(4.20)
Here we have used the delta method. Therefore, to evaluate the coverage probability al,n at (4.18) up to a remainder of order ne3j2, we need only derive an Edgeworth expansion of the distribution function of
S, = n”‘(&
8,)8-’ + n-‘A,,
where A,, = n’i2{s*l(z,) - s,(z,)}. That is usually expansion for n l’*@- 8,)K l + c^,.
(4.21) simpler than finding
an Edgeworth
Ch. 3Y: Methodology
Put T, = n”‘($number such that E(T,AJ
2371
and Theory for the Bootstrap
8,-J/8 and A, = n’iz{$l(zJ
= E[n”‘(&
8,)8- ld’z{$l(za)
- sl(z,)},
and let a, denote
the real
- s,(z,))] (4.22)
=a,+O(n-1).
If s, is an even polynomial of degree 2, which would typically be the case, then a, = n(z,), where rt is an even polynomial of degree 2 with coefficients not depending on c(. Then it may be shown that P(S, <x) = P(T, <x) - np’a,x@(x)
(4.23)
+ 0(n-3’2).
It is now a simple matter to obtain expansions of coverage probability for our general one-sided confidence interval Y1. Taking x = - z, in (4.20), and noting (4.18) and (4.23), we conclude that if 6, is given by (4.19) then the confidence interval
=(-co,e+n-1~28(z,+C*,))
9,
has coverage
probability
Q~‘~($-e,)p
= P
> - z, -
- n-‘a,z,~(z,) Assuming
putting
c( 19 =
expansion
for a Studentized
8,)/c? < x} = 0(x) + n- 1’2q1GM4
x=
-~~-x~=~,~n
-ji2sj(z,),
%(%){vI?l(%)
-
4;@)
+
fl-
-
+
statistic,
1Y2mw
and Taylor-expanding,
a + n-1’2{s1k) - 41(%))dw + n- 1cq2(z,) +
I
+ O(n-y.
the usual Edgeworth
P{?(B-
cj12, sj(z,)
i
j=l
i
+
i.e. O(n
- 3’2),
we finally obtain
sz(z,) - $z,s1(zJ2
a,zJ#&J + O(n-3’2).
(4.24)
(Remember that qj is an odd/even function for even/oddj, respectively.) The function sr is generally an even polynomial and s2 an odd polynomial. This follows from the Cornish-Fisher expansion (4.19) of 2,. Since s1 is an even function, then by the definition of a, at (4.23), this quantity equals an even polynomial in z,. Therefore the coefficient of n-‘$(z,) in (4.24) is an odd polynomial in z,. The coefficient of n- “‘&z,) is clearly an even polynomial.
P. Hall
2372
There is no difficulty developing the expansion (4.24) to an arbitrary number of terms, obtaining a series in powers of n- ‘P where the coefficient of n-j”4(z,) equals an odd or even polynomial depending on whether j is even or odd. The following proposition summarizes that result. Proposition
Consider
91
4.1
the confidence
interval
=c~~(~)=(-m,6+n-‘128(z,+
c*,)),
where ~2~ = n-“2s*,(z,)
+ n-‘s*,(z,)
+ ...,
where the 3j’s are obtained from polynomials sj on replacing by sample moments and odd/even indexed s;s are even/odd tively. Suppose P{n”‘(@-
where odd/even
0,)/s
<x)
= C@(x) + n-112q1(~)&~)
indexed polynomials
P{8,~9,(cc)}
where odd/even particular,
qj are even/odd
= CI+ n -‘i2rl(z,)4Yz,)
indexed
polynomials
population moments polynomials, respec-
+ n-‘q,(x)4(x)
functions,
+ n- lrAz,)@(z,)
rj are even/odd
+ ...,
respectively.
+ ...,
functions,
Then (4.25)
respectively.
In
rl = s1 - q1
and
r2(d = q2W + s2k) -$mW*
+ sl(z,){z,qlk) - 4;k)) - a,~,.
where a, is defined at (4.22). The coverage expansion (4.25) should of course be interpreted as an asymptotic series. It is often not available uniformly in c(, but does hold uniformly in E < tl< 1 - Efor any 0 < E < i. However, if c^,is monotone increasing in CIthen (4.25) will typically be available uniformly in 0 < c1< 1. An immediate consequence of (4.25) is that a necessary and su#icient condition for our conjidence interval 9, to have coverage error of order n-l for all values of cI, is that it be second-order correct relative to the interval J,, defined in Section 4.4. To
2373
Ch. 39: Methodology and Theory for the Bootstrap
appreciate
why, go back to the definition
(4.19) of e,, which implies that
Y, =(-&+n-“28(2,+&)) =(-co,e+n-“~8{z,+n-“%,(z,)}+O,(n-~’2)). Since q1 1 = -ql J,
(see Section 4.2) then,
=(-co,fLn-1’2hl_J =(-co,e-n-1’%[zl_~+ll-1’2qll(Zl-a)]+0,(n-3’2)) =(-co,e+n-1’2~[z,+.-1’2ql(ZJ]+Op(n-3’2)).
The upper endpoint of this interval agrees with that of 9, in terms of order n- ‘, for all CC,if and only if sl = ql, that is, if and only if the term of order n- l” vanishes from (4.25). Therefore the second-order correct interval jl has coverage error of order n- ‘, but the interval I;, which is only first-order correct, has coverage error of size IZ- 1’2 except in special circumstances. So far we have worked only with confidence intervals of the form (- co,6 + f) where t*= n-‘i2B(z, + &) and C, is given by (4.19). Should the value a2 of asymptotic variance be known then we would most likely construct confidence intervals using t*= n-‘izo(z, + e,), again for 6, given by (4.19). This case may be treated by reworking arguments above. We should change the symbol q to p at each appearance, because we are now working with the Edgeworth expansion (4.3) rather than (4.4). With this alteration, formula (4.24) for coverage probability continues to apply,
P{8,+oo,8+n-“2
4zm+ a)) = M +
n-
1’2{%(%)
Pl(Z,)}~(Z,)
-
+
n-‘C~~(z,) + s2(z,) - +,s1(d2
+
~lk){ZaPlW
-
P;(%)>
-
%4
x &z,) + O(n- 3’2). (Our
definition of a, at (4.22) is unaffected if 8-l is replaced by c-l, since o’ + O&n- 1’2).) Likewise, as the analogue of Proposition 4.1 is valid, it is necessary only to replace 8 by c in the definition of Y, and qj by pj at all appearances of the former. Therefore our conclusions in the case where C-Jis known are similar to those when c is unknown: a necessary and sufficient condition for the confidence interval (- co, 8 + n- ‘I2u(z, + 6,)) to have coverage error of order n-l for all values of cxis that it be second-order correct relative to I,. Similarly it may be proved that if 9, isjth-order correct relative to a one-sided confidence interval JJ;, meaning that the upper endpoints agree in terms of size n-j12 or larger, then Y, and zJ; have the same coverage probability up to but not A-1 o
_ -
P. Hall
2314
necessarily including terms of order n -j” The converse of this result is false for j > 3. Indeed, there are many important examples of confidence intervals whose coverage errors differ by O(n -3/2) but which are not third-order correct relative to one another. Coverage properties of two-sided confidence intervals are rather different from those in the one-sided case. For two-sided intervals, parity properties of polynomials in expansions such as (4.25) cause terms of order n -1’2 to cancel completely from expansions of coverage error. Therefore coverage error is always of order n-l or smaller, even for the most basic Normal approximation method. In the case of symmetric two-sided intervals constructed using the percentile-t bootstrap, coverage of the present section will treat two-sided error is of order nm 2. The remainder equal-tailed intervals. We begin by recalling our definition of the general one-sided interval 9, = S,(U) whose nominal coverage is a:
The equal-tailed
interval
-02(4=au
based on this scheme and having nominal
+~))\A(+(1
=(~+n-1'2+t~-gl,2
coverage
c1is
-4)
+c^(,-01),2),~+n-"28(z(l+.,,2
+tcl+aj,2)).
(4.26)
(Here Y\$ denotes the intersection of set 9 with the complement of set 2.) Apply Proposition 4.1 with z = zcl +aJ,2 = - zcl -ai,2, noting particularly that rj is an odd or even function accordingly as j is even or odd, to obtain an expansion of the coverage probability of Y,(M):
M z,n=Wo~.~2(4~ =
@(~)+n-~~~r~(z)~(z)+n-'r,(z)~(z)+
rl(-z)qb-z)+n~1r2(-z)~(-z)+~~~}
-{@(-z)+n-"2
= cI + 2n-‘r,(z)+(z) = a + 2n-‘[q,(z)
x 4(z) + O(n
...
+ 2nm2r,(z)q5(z) + ... + s2(z) - +z~~(z)~ + sl(z){zql(z) “).
- 4;(z))
- ql
+aj,2~l (4.27)
The property of second-order correctness, which as we have seen is equivalent to s1 = ql, has relatively little effect on the coverage probability in (4.27). This contrasts with the case of one-sided confidence intervals.
Ch. 39: Methodology
For percentile
and Theory@
confidence
the
2375
Bootstrap
intervals, (4.28)
--P11 =Pl
s1=
and
s*(x)= P21W= Plc4P;w - +vl(4z - P2(X), while for percentile-t s1=
(4.29)
intervals, (4.30)
-q11=41
and
s2(4 = 421(x) = ql(x)q;b) - &%(x)2- q2(4.
(4.3 1)
There is no significant simplification of (4.27) when (4.28) and (4.29) are used to express s1 and s2. However, in the percentile-t case we see from (4.27), (4.30) and (4.3 1) that a 2,n
=
a -
2~
‘a(,
+a)/2~(1
+or~,24(z~1
+al,2)
+
0W2L
which represents a substantial simplification. When the asymptotic variance c2 is known, our formula for equal-tailed, two-sided, cc-level confidence intervals should be changed from that in (4.26) to (e+ n- “%zo
-aji2 + c*(r-aj,2), @+ n- 1’2+o
+a)i2 + kc1+&),
for a suitable random function c^,.If c*,is given by (4.19) then the coverage probability of this interval is given by (4.27), except that q should be changed to p at each appearance in that formula. The value of a(, +6j,2 is unchanged.
4.7.
Simple linear regression
In previous sections we drew attention to important properties of the bootstrap in a wide range of statistical problems. We stressed the importance of pivoting. For example, the coverage error of a one-sided percentile-t confidence interval is of size n - ‘, but the coverage error of an uncorrected one-sided percentile interval is of size n- 11.7 The good performance of a percentile-t interval is available in problems where the variance of a parameter estimate may be estimated accurately. Many regression
P.Hall
2316
problems are of this type. Thus, we might expect the endearing properties of percentile-t to go over without change to the regression case. In a sense, this is true; one-sided percentile-t confidence intervals for regression mean, intercept or slope all have coverage error at most 0(X ‘), whereas their percentile method counterparts generally only have coverage error of size n - ‘I2 However, this generalization conceals several very important differences in the case of slope estimation. Onesided percentile-t confidence intervals for slope have coverage error O(n 312), not O(n-‘); and the error is only O(n-‘) in the case of two-sided intervals. These exceptional properties apply only to estimates of slope, not to estimates of intercept or means. However, slope parameters are particularly important in the study of regression, and our interpretation of slope is quite general. For example, in the polynomial regression model Yi=c+xidl
+ . ..+x”dm+Ei.
ldidn,
(4.32)
we regard each dj as a slope parameter. A one-sided percentile-t confidence interval for dj has coverage error O(n- 3’2), although a one-sided percentile-t interval for c or for E(Ylx=x,)=c+x,d,+...+x;;d, has coverage error of size n-‘. The reason that slope parameters have this distinctive points xi confer a significant amount of extra symmetry. the model (4.32) as Yi = C’ + (Xi - 5l)dl
+
where<j=nm’Cxiandc’=c+tldl+... the fact that
‘..
+
(X” - 5,)d, + ‘i’
property is that the design Note that we may rewrite
l,
+ &,,d,. The extra symmetry
arises from
(4.j3) i=l
For example, (4.33) implies that random variables Ci(xi - tj)$ and CE: are uncorrelated for each triple (j,k, I) of nonnegative integers, and this symmetry property is enough to establish the claimed performance of percentile-t confidence intervals. We shall deal only with the simple linear regression model. Multivariate problems are similar in many respects, and the reader is referred to Chapter 4 of Hall (1992) for a more general account in that context.
Ch. 39: Methodology
and Theory
forthe Bootstrap
2311
The simple linear model is
Yi= c + xid + .zi,
Ibid&
where c, d, xi, Yi, &iare scalars, c and d are unknown constants representing intercept and slope, respectively, the ENSare independent and identically distributed random variables with zero mean and variance o’, and the xi’s are fixed design points. Put &= ri- r-(Xi-$&X=n-‘Cxi,
62&p;,
fJ; = n- l t (Xi - X)2. i=l
Then 6’ estimates
o’, and in this notation,
8=cq2n-’ i
Y),
(Xi_X)(Yi_
c*= r-22,
i=l
are the usual least-squares estimates of c and d. Since ahas n”‘(& d)a,/8 is (asymptotically) pivotal. Define r* =
e+ x,2+ E?,
variance
It- ‘0; ‘CT’then
16idn,
where the 6:‘s are generated by resampling randomly from the residuals ii. Furthermore, ?*, c?*, and 6* have the same formulae as ~2,d, and 8, except that Yi is replaced by YF at each appearance of the former. Quantiles u, and u, of the distributions of n”‘(d^ - d)o,/o
and
n”‘(d^ - d)a,/c?
may be defined by f’{n”‘(&
d)o,/a
< urn}= P{n”‘(&
d)o,.B
d ua}
= cc. and their bootstrap
estimates
12,and 0, by
P{n”‘(iP - d^)g,/B< 22,l~} = P{n”‘(J*
- &7,/6*
d 0,1X}
= CY, where X denotes the sample of pairs {(xl, Y,), . . . ,(x,, Y,)}. In this notation, one-sided percentile and percentile-r bootstrap confidence intervals for d are given
P. Hull
2378
by
r^,=(-co,li-n-“2a,1Bal_.), .T1=(-co,&--“‘a;‘80,_,)
respectively; compare Section 4.4. Each of these confidence intervals has nominal coverage c(.The percentile-t interval _?i is the bootstrap version of an “ideal” interval J,
=(-G0,d^-n-“2a,1Bvl_a).
Of course, each of the intervals has a two-sided counterpart. Recall our definition that a one-confidence interval is second-order correct relative to another if the (finite) endpoints of the intervals agree up to and including terms of order (n - 1/2)2 = n- ‘; see Section 4.3. It comes as no surprise to find that j1 is second-order correct for J,, given what was learned in Section 4.5 about bootstrap confidence intervals in more conventional problems. However, on the present occasion r^, is also second-order correct for J,, and that property is quite unusual. It ari_ses because Edgeworth expansions of the distributions of n’/‘(d^- d)a,/a and n’12(d - Lt)a /c? contain identical terms of size n- l/2, that is, Studentizing has no effect on thexfirst term in the expansion. This is a consequence of the extra symmetry conferred by the presence of the design points xi, as we shall show in the next paragraph. The reason why second-order correctness follows from identical formulae for the n- 1’2 terms in expansions was made clear in Section 4.5. Assume that 0.: is bounded away from zero as n---f co, and that maxi ~ iG ,,(xi - Xl is bounded as n -+ co. (In refined versions of the proof below, this boundedness condition may be replaced by a moment condition on the design points xi, such as sup& ‘C(xi - x-)” < co.) Put C= n- ‘C.q, and observe that
82=ne1i~l~f=np1
i i=l
(ei_c_(Xi-_x)(&d)}2
i$l(Ef-
=g2+n-’
Therefore,
defining
A =+n-1a-2
S = n”‘(d^ - c&,/c,
i
02)+ O,(n - 1). T = n1j2(a - d)a,/~?, and
(6: - g2),
i=l
we have T= S(l - d) + O,(n-‘) By making
= S + 0,(n-112).
use of the fact that X(xi - X) = 0 (this is where the extra
(4.34) symmetry
Ch. 3Y: Methodology
conferred
2379
and Theory fiw the Bootstrap
by the design comes in) and of the representation
s = n-%;lo-l
t
(Xi - X)Ci,
i=l
we may easily prove that E{ S( 1 - A)}’ - E(S) = O(n ‘) forj = 1,2,3. Therefore, the first three cumulants of S and S(l - A) agree up to and including terms of order n- ‘I’. Higher-order cumulants are of size n- ’ or smaller. It follows that Edgeworth expansions of the distributions of S and S(l - A) differ only in terms of order 6 ‘. In view of (4.34), the same is true for S and T, P(S < w) = P(T6
w) + O(n_ ‘).
(This step uses the delta method.) Therefore, term in the expansion, as had to be proved.
Studentizing
has no effect on the fist
References Babu, G.J. and Singh, K. (1983) “Inference on Means Using the Bootstrap”, Annals ofStatistics, 11, 999-1003. Babu, G.J. and Singh, K. (1984) “On One Term Correction by Efron’s Bootstrap”, Sankhya, Series A 46, 219-232. Babu, G.J. and Singh, K. (1985) “Edgeworth Expansions for Sampling without Replacement from Finite Populations”, Journal of Multivariate Analysis, 17, 261-278. Barnard, G.A. (1963) “Contribution to Discussion”, Journal of the Royal Statistical Society, Series B, 25, 294. Beran, R. (1982) “Estimated Sampling Distributions: The Bootstrap and Competitors”, Annals of Statistics, 10, 212-225. Beran, R. (1987) “Prepivoting to Reduce Level Error of Confidence Sets”, Biometrika, 74.457-468. Bhattacharya, R.N. and Ghosh, J.K. (1978) “On the Validity of the Formal Edgeworth Expansion”, Annals of Statistics, 6,434&451. Bickel, P.J. and Freedman, D.A. (1980) “On Edgeworth Expansions and the Bootstrap”. Unpublished manuscript. Bickel, P.J. and Freedman, D.A. (1981) “Some Asymptotic Theory for the Bootstrap”, Annals of Statistics, 9, 1196-1217. Bickel, P.J. and Freedman, D.A. (1983) Bootstrapping Regression Models with Many Parameters” in P.J. Bickel, K.A. Doksum, and J.C. Hodges, Jr., eds. A Festschrift for Erich L. Lehmann. Belmont: Wadsworth, 28-48. Bose, A. (1988) “Edgeworth Correction by Bootstrap in Autoregressions”, Annals of Statistics, 16, 170991722. Carlstein, E. (1986) “The Use of Subseries Methods for Estimating the Variance of a General Statistic from a Stationary Time Series”, Annals of Statistics, 14, 1171-l 179. Chibishov, D.M. (1972) “An Asymptotic Expansion for the Distribution of a Statistic Admitting an Asymptotic Expansion”, Theory of Probability and its Applications, 17, 620-630. Chibishov, D.M. (1973a) “An Asymptotic Expansion for a Class of Estimators Containing Maximum Likelihood Estimators”, Theory of Probability and its Applications, 18, 295-303. Chibishov, D.M. (1973b) “An Asymptotic Expansion for the Distribution of Sums of a Special Form with an Application to Minimum-Contrast Estimates”, Theory ofProbability and its Applications, 18, &19-661. Cox, D.R. and Hinkley, D.V. (1974) Theoretical Statistics. London: Chapman and Hall,
2380
P. Hall
Davison, AC. and Hall, P. (1993) “On Studentizing and Blocking Methods for Implementing the Bootstrap with Dependent Data”, Australian Journal ofStaristics, 35, 215-224. Davison, A.C. and Hinkley, D.V. (1988) “Saddlepoint Approximations in Resampling Methods”, Biometrika, 15, 411-431. DiCiccio, T.J. and Romano, J.P. (1988) “A Review of Bootstrap Confidence Intervals (With Discussion)“, Journal oj the Royal Statistical Society, Series B 50, 338-354. Efron, B. (1979) “Bootstrap Methods: Another Look at the Jackknife”, Annals ofStatistics, 7, l-26. Efron, B. (1983) “Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation”, Journal ofthe American Statistical Association, 78, 316-331. Efron, B. (1987) “Better Bootstrap Confidence Intervals (With Discussion)“, Journal of the American Statistical Association, 82, 171L200. Fisher, N.I. and Hall, P. (1991) “Bootstrap Algorithms for Small Samples. Journal ofStatistical PIanniny and Inference”, 21, 151- 169. Freedman, D.A. (198 1) “Bootstrapping Regression Models”, Annals ofStatistics, 9, 1218-1228. Freedman, D.A. and Peters, S.C. (1984) “Bootstrapping a Regression Equation: Some Empirical Results”, Journal oJthe American Statistical Association, 79, 97-106. G&e, F. and Kiinsch, H.R. (1990) “Blockwise Bootstrap for Dependent Observations: Higher Order Approximations for Studentized Statistics (Abstract)“, Bull. Inst. Math. Statist.. 19,443. Hall, P. (1985) “Resampling a Coverage Process”, Stoch. Proc. Appl., 19,259-269. Hall, P. (1986) “On the Bootstrap and Confidence Intervals”, Annals ofstatistics, 14, 1431-1452. Hall, P. (1988a) “Theoretical Comparison of Bootstrap Confidence Intervals (With Discussion)“, Annals of Statistics, 16, 927-985. Hall, P. (1988b) “Unusual Properties of Bootstrap Confidence Intervals in Regression Problems”, Probab. Theory Rel. Fields, 81, 247-273. Hall, P. (1992) The Bootstrap and Edgeworth Expansion. New York: Springer. Hall, P. and Horowitz, J.L. (1993) Corrections and Blocking Rules for the Bootstrap with Dependent Data. Research Report no. CMA-SRI l-93, Centre for Mathematics and its Applications, Australian National University. Hall, P. and Martin, M.A. (1988) “On Bootstrap Resampling and Iteration”, Biometriku, 75, 661-671. Hartigan, J.A. (1969) “Using Subsample Values as Typical Values”, Journal of the American Statistical Association, 64, 1303-1317. Hartigan, J.A. (1971) “Error Analysis by Replaced Samples”, Journal of the Royal Statistical Society, Series B 33, 98-l 10. Hinkley, D.V. (1988) “Bootstrap Methods (With Discussion)“, Journal ofthe Royal Statistical Society, Series B 50, 321-337 Hope, A.C.A. (1968) “A Simplified Monte Carlo Significance Test Procedure”, Journal qf the Royal Statistical Society, Series B 30, 582-598. Kendall, M.G. and Stuart, A. (1977) The Advanced Theory of Statistics. Vol 1, 4th three-volume Ed. London: Griffin. Kiinsch, H.R. (1989) “The Jackknife and the Bootstrap for General Stationary Observations”, Annals ofStatistics, 17, 1217-1241. Lahiri, S.N. (1991) “Second Order Optimality of Stationary Bootstrap”, Statistics and Probability Letters, 11, 335-341. Lahiri, S.N. (1992) “Edgeworth Correction by ‘Moving Block’ Bootstrap for Stationary and Nonstationary Data”, in: Exploring the Limits ofthe Bootstrap. R. Lepage and L. Billard, eds., New York: Wiley, pp 183-214. Liu, R.Y. and Singh, K. (1987) “0 n a Partial Correction by the Bootstrap”, Annals of Statistics, 15, 1713-1718. Lob, W.-Y. (I 987) “Calibrating Confidence Coefficients”, Journal of the American Statistical Association, 82, 155-162. Marriott, F.H.C. (1979) “Barnard’s Monte Carlo Tests: How Many Simulations?“, Applied Statistics, 28, 75-77. Martin, M.A. (1989) On the Bootstrap and Confidence Intervals. Unpublished PhD thesis, Australian National University. Peters, SC. and Freedman, D.A. (1984a) “Bootstrapping an Econometric Model: Some Empirical Results”, J. Bus. Econ. Studies, 2, 150-158.
Ch. 39: Methodology
und Theory,for
the Bootstrap
2381
Peters, S.C. and Freedman, D.A. (1984b) “Some Notes on the Bootstrap in Regression Problems”, J. Bus. Econ. Studies, 2, 406-409. Quenouille, M.H. (1949) “Approximate Tests of Correlation in Time-Series”, Journal of the Royal Statistical Association, Series B 11. 68-84. Quenouille, M.H. (1956) “Notes on Bias in Estimation”, Biometrika, 43, 353-360. Reid, N. (1988) “Saddlepoint Methods and Statistical Inference (With Discussion)“, Statistic. Sci., 3, 213-238. Robinson, J. (1987) “Nonparametric Confidence Intervals in Regression: The Bootstrap and Randomization Methods”, in: M.L. Puri, J.P. Vilaplana, and W. Wertz, eds., New Perspectives in Theoretical and Applied Statistics. New York: Wiley, pp 2433256. Sargan, J.D. (1975) “Gram-Charlier Approximations Applied to t Ratios of k-Class Estimators”, Econometrica, 43, 3277346. Sargan, J.D. (1976) “Econometric Estimators and the Edgeworth Approximation”, Econometrica, 44, 421-448. Simon, J.L. (1969) Basic Research Methods in Social Science. New York: Random House. Singh, K.(1981)“0n the Asymptotic Accuracy ofEfron’s Bootstrap”, Annalsofstatistics, 9,1187-l 195. Tukey, J.W. (1958) “Bias and Confidence in Not-Quite Large Samples (Abstract)“, Ann. Math. Statist., 29, 614.
Chapter
40
CLASSICAL ESTIMATION METHODS FOR LDV MODELS USING SIMULATION VASSILIS
A. HAJIVASSILIOU
Yale University PAUL
A. RUUD
University
of California
Contents 1. 2.
variable
models
2.1.
The latent normal
regression
model
2.2.
Censoring
2386
Truncation
2392
2.3.
3.
4.
2384 2386
Introduction Limited dependent
2386
2.4.
Mixtures
2394
2.5.
Time series models
2395
2.6.
Score functions
2.7.
The computational
Simulation 3.1.
Overview
3.2.
Censored
3.3.
Truncated
Simulation
2397 intractability
of LDV models
methods
2399
2400 2400 2402
simulation
2406
simulation
and estimation
of LDV models
2408 2408
4.1.
Overview
4.2.
Simulation
of the log-likelihood
4.3.
Simulation
of moment
functions
2421
4.4.
Simulation
of the score function
2428
4.5.
Bias corrections
function
Conclusion 6. Acknowledgements References 5.
Handbook of Econometrics, Volume IV, Edited by R.F. Engle and D.L. McFadden 0 1994 Elseuier Science B. V. All rights reserved
2412
2435
2437 2438 2438
V.A. Hajitmsiliou
2384
1.
and P.A. Ruud
Introduction
This chapter discusses classical estimation methods for limited dependent variable (LDV) models that employ Monte Carlo simulation techniques to overcome computational problems in such models. These difficulties take the form of high-dimensional integrals that need to be calculated repeatedly. In the past, investigators were forced to restrict attention to special classes of LDV models that are computationally manageable. The simulation estimation methods we discuss here make it possible to estimate LDV models that are computationally intractable using classical estimation methods. We first review the ways in which LDV models arise, describing the differences and similarities in censored and truncated data generating processes. Censoring and truncation give rise to the troublesome multivariate integrals. Following the LDV models, we described various simulation methods for evaluating such integrals. Naturally, censoring and truncation play roles in simulation as well. Finally, estimation methods that rely on simulation are described. We review three general approaches that combine estimation of LDV models and simulation: simulation of the log-likelihood function (MSL), simulation of moment functions (MSM), and simulation of the score (MSS). The MSS is a combination of ideas from MSL and MSM, treating the efficient score of the log-likelihood function as a moment function. One of the most familiar LDV models is the binomial probit model, which specifies that the probability that a binomial random variable y equals one, conditional on the regression vector x, is @(x’fi) where a(.) is the univariate standard normal cumulative distribution function (c.d.f.). Although this integral has no analytical expression, @ has accurate, rapid, numerical approximations. These help make maximum likelihood estimation of the binomial probit model straightforward and most econometric software packages provide such estimation as a feature. However, a simple and common extension of the binomial probit model renders the resulting model too difficult for maximum likelihood computation. Introducing correlation among the observations generally produces a likelihood function containing integrals that cannot be well approximated and rapidly computed. An example places the binomial probit model in the context of panel data in which a cross-section of N experimental units (individuals or households) is observed repeatedly, say in T consecutive time periods. Denote the binomial outcome for the nth experimental unit in the tth time period by y,,~(0, l}. In panel data sets, econometricians commonly expect correlation among the y,, for the same n across different t, reflecting the presence of unobservable determinants of y,, that evolve slowly for each experimental unit through time. In order to model such correlation parsimoniously, econometricians have adapted familiar models with correlation to the probit model. One can describe each y,, as the transformation of a latent,
Ch. 40: Classical Estimation
normally
distributed,
Methods,for
2385
LDV Models Using Simulation
y,*,:
Then, one can assign the latent y,: a nonscalar covariance matrix appropriate to continuously distributed panel data. For example, stacking the y,*, first by time period and then by experimental unit, a common specification of the covariance matrix is the variance components plus first-order autoregression model
+ +JT, PT-2
...
p
1 (1)
Now consider the impact of such nonscalar covariance matrices on the likelihood for the observed y,,. Although the marginal probabilities that y,, is zero or one are unchanged, the likelihood function consists of thejoint probabilities that the dependent series {yn1,yn2,. . . , ynT} are the observed sequences of zeros and ones. These joint probabilities are multivariate normal integrals over T dimensions and there are 2* possible integrals.’ The practical significance of the increased dimensionality of the integrals is that traditional numerical methods generally cannot compute the integrals with sufficient speed and precision to make the computation of the maximum likelihood estimator workable. In this chapter, we review a collection of alternative, feasible, methods based on the ideas of estimation with simulation suggested by McFadden (1989) and Pakes and Pollard (1989). In Section 2, we describe LDV models and illustrate the computational difficulties classical estimation methods encounter. Section 3 summarizes basic simulation methods, covering censored and truncated sampling methods. Estimation of LDV models and simulation are combined in Section 4 where three general approaches are reviewed: simulation of the log-likelihood function, simulation of moment
’ A partial list of studies in numerical analysis of such integrals is Clark (1961), Daganzo (1980), Davis and Rabinowitz (1984), Dutt (1973), Dutt (1976). Fishman (1973), Hammersley and Handscomb (1964), Horowitzeta1.(1981),Moran(1984),Owen(1956),Rubinstein(l981),Stroud(l97l)andThisted(1988).
V.A. Hajiuassiliou and P.A. Ruud
2386
functions, and simulation of the efficient score. We provide computational examples throughout to illustrate the various methods and their properties. We conclude this chapter with a summary of the main approaches presented.
2.
2.1.
Limited dependent variable models The latent normal regression model
Consider the problem of maximum likelihood estimation given the N observations on the vector of random variables y drawn from a population with cumulative distribution function (c.d.f.) F(0, Y) = Pr{y < Y}. Let the corresponding density function with respect to Lebesgue measure be f(0, y). The density f is a parametric function and the parameter vector 0 is unknown, finite-dimensional, and 0~63, where 0 is a compact subset of RK.Estimation of 19by maximum likelihood (ML) involves the maximization of the log-likelihood function IN(d) = C,“= 1In f(0; y,) over 0. Often, finding the root of a system of normal equations V,l,(O) = 0 is equivalent. In the limited dependent variable models that we consider in this chapter, F will be a mixture of discrete and continuous distributions, so that f may consist of nonzero probabilities for discrete values of y and continuous probability densities for intervals of y. These functions are generally difficult to compute because they involve multivariate integrals that do not have closed forms, accurate approximations or rapid numerical solutions. As a result, estimation of 0 by classical methods is effectively infeasible. In this section, we review the various forms of likelihood functions that arise in LDV models. In the second subsection, we discuss models generated as partially observed or censored latent dependent variables. The third subsection describes truncated latent dependent variables. In this case, one views observations in a latent data set as missing entirely from an observed data set. Within these broad categories, we review discrete, mixed discrete/continuous, and mixture likelihood functions. Following our discussion of likelihood functions, Section 2.6 treats the structure of the score function for LDV models and the last subsection gives a concrete illustration of the intractability of classical estimation methods for the general LDV model.
2.2.
Censoring
In general, and particularly in LDV models, one can represent the data generating process for y as an “incomplete data” or “partial observability” process in which the observed data vector y is an indirect observation on a latent vector y*. In such cases, y* cannot be recovered from the censored random variable y.
Ch. 40:
Classical
De$nition
1.
Estimation
Methods.for
Censored
LDV Models
2387
Using Simulation
random variables
Let Y* be a random variable from a population with c.d.f. F(Y*) and support A. Let B be the support of the random variable Y = r(Y*) where z:A+ B is not invertible. Then Y is a censored random variable. In LDV models, r is often called the “observation rule”; and though it may not be monotonic, t is generally piece-wise continuous. An important characteristic of censored sampling is that no observations are missing. Observations on y* are merely abbreviated or summarized, hence the descriptive term “censored”. Let A c RM and Bc
RJ.
The latent c.d.f. F(B; Y*) for y* is related to the observed equation
F(B; Y) =
s
(Y*lr(Y*)
c.d.f. for y by the integral
dF(B; y*).
(2)
d Y)
In the LDV models that we consider, F(0; y*) is the multivariate normal c.d.f. given by F(B, y*) = J&y* - p, Qdy* where R is a positive definite matrix, and @(y* - p,Q) = {det[2nR]}-1’2exp[
-i(y*
- ~)‘a-‘(y*
-p)].
(3)
We will refer to this multivariate normal distribution as the N(p,fl) distribution. The mean vector is often parameterized as a linear function of observed conditioning variables X: p(p) = Xp, where fi is a vector of K, slope coefficients. The covariance matrix is usually a function of a vector of K, variance parameters C. The p.d.f. for y is the function that integrates to F(8; Y). In this chapter, integration refers to the Lebesgue-Stieltjes integral and the p.d.f. is a generalized derivative of the c.d.f.2 This means that the p.d.f, has discrete and continuous components. Everywhere in the support of Y where F is differentiable, the p.d.f. can be obtained by ordinary differentiation:
(4)
A simple illustration of such p.d.f.‘s is given below in Example 2. In the LDV models we consider, F generally has a small nuumber of discontinuities in some dimensions
‘Such densities are formally
known as Radon-Nikodym
p.d.t’s. with respect to Lebesgue
measure
V.A. Hajimssiliou
2388
and P.A. Ruud
of Y so that F is not differentiable everywhere. At a point of discontinuity Yd, we can obtain the generalized p.d.f. by partitioning Y into the elements in which F is differentiable, { Y,, . . . , YJ,} say, and the remaining elements { Y,. + i, . . . , Y,} in which the discontinuity occurs. The p.d.f. then has the form
f(e; y) =
$& 1 ..
[F(0;
Y) - F(8; Y - 0)]
J’
= f(e; Y1,..., yJ,)Pr{Yj=Y~;j>~‘(6;
yl,...,
YJ'},
(5)
where the discrete jump F(8; Y) - F(8; Y - 0) reflects the nontrivial probability of the event { Yj = Y:; j > 5’}.3 Examples 1 and 2 illustrate such probabilities. It is these probabilities, the discrete components of the p.d.f., that pose computational obstacles to classical estimation. One must carry out multivariate integration and differentiation in (2))(5) to obtain the likelihood for the observed data - see the following example for a clear illustration of this problem. Because accurate numerical approximations are unavailable, this integration is often handled by such general purpose numerical methods as quadrature. But the speed and accuracy of quadrature are inadequate to make the computation of the MLE practical except in special cases. Example
1.
Multinomial
probit
The multinomial probit model is a leading illustration of the computational difficulties of classical estimation methods for LDV models, which require the repeated evaluation of (2))(5). This model is based on the work of Thurstone (1927) and was first analyzed by Bock and Jones (1968). For a multinomial model with J = M possible outcomes, the latent y* is N(p, f2) where p is a J x 1 vector of means and R is a J x J symmetric positive definite covariance matrix. The observed y is often represented as a vector of indicator functions for the maximal element of y*: r(y*) = [l(yf=max,y*}; j= l,..., 51. Therefore, the sampling space B of y is the set of orthonormal elementary unit vectors, whose elements are all zero except for a unique element that equals one:
B= {(l,O,O ,..., O),(O,l,O,O ,..., 0) ,..., (0,O ,..., O,l)}. The probability function for y can be written as an integral over J - 1 dimensions after noting that the event {yj = 1, yi = 0, i # j} is equivalent to {y; - y* 3 0, i = 1,. . . ,
3The height of the discontinuity
is denoted
by
F(O; Y) - F(B; Y - 0) = lim [F(f$ Y) - F(0; Y - E)]. 40
Ch. 40: CIassical Estimation
Methods
2389
for LDV Models Using Simulation
J}. By creating the first-difference vector zj = [y: - y*, i = 1,. . . ,J, i #j] = AjY* and denoting its mean and covariance by pj = Ajp and fij = AjflA> respectively, F(B;y) and ~“(0; y) are both functions of multivariate normal negative orthant integrals of the general form
ss 0
0
@(p,L?) =
4(x + P, Wx.
..’
-co
-m
We obtain
F(e;Y)= i
l{Yj3
l}@(-Pj,.Rj)
j=l
and
f
ng= r @( - ~j, Rj)Y’
(0;Y) =
0
When J = 2, this reduces the introduction:
f
(0;Y) =
ifyE&
i
WC12
-PI,
@(
p’,
-
(6)
otherwise. to the familiar
l)y’@(P1
-
P2,
1)’ _Y’@($, 1)Y’
binomial
probit
likelihood
mentioned
in
1Y2
(7)
where ji = pL1- pLzand y’ = y,. If J > 5, then the likelihood function (6) is difficult to compute using conventional expansions without special restrictions on the covariance matrix or without adopting other distributions that imply closed-form expressions. Examples of the former approach are the factor-analytic structures for R analyzed in Heckman (1981), Bolduc (1991) and Bolduc and Kaci (1991), and the diagonal R discussed in Hausman and Wise (1978), p. 310. An example of the latter is the i.i.d. extreme-value distribution which, as McFadden (1973) shows, yields the analytically tractable multinomial logit model. See also Lerman and Manski (1981), p. 224, McFadden (198 1) and McFadden (1986) for further discussions on this issue.
Example
2.
Tohit
The tobit or censored regression model4 is a simple example of a mixed distribution with discrete and continuous components. This model has a univariate latent
4Tobin
(I 95X).
V.A. Hajivassiliou
2390
and P.A. Ruud
rule is also similar: structure like probit: y* - N(p, a’). The observation l{y* >O}.y* which leads to the sample space B = {yeRly 2 0} and c.d.f. 0
F(B; Y) =
+(y*-p,~)dy*=@(Y-~(,a*) /
s (Y’
0
i
if Y < 0, ifY30.
< Y)
The p.d.f. is mixed, containing
f(@Y)=
r(y*) =
discrete and continuous
terms:
if Y < 0,
@(--~,a’)
ifY=O,
f$(Y-~,a’)
ifY>O.
(8)
The discrete jump in F at Y = 0 corresponds to the nonzero probability of { Y = 0}, just as in binomial probit. F is differentiable for Y > 0 so that the p.d.f. is obtained by differentiation. Just as in the extension of binomial to multinomial probit, multivariate tobit models present multivariate integrals that are difficult to compute. Example 3.
Nonrandom sample selection
The nonrandom sample selection model provides a final example of partial observability which generalizes the tobit model.’ In the simplest version, the latent y* consists of two elements drawn from a bivariate normal distribution where
n(a) =
l
[ al2
The observation
*
al2
a2 1
rule is
so that the first element of y observation on yf when y, = is identically zero. That is, the B = { (O,O)} u { (l,y,), ~,ER}.
‘See
is a binomial variable and the second element is an 1; otherwise, there is no observation of yf because y, sampling space of y is the union of two disjoint sets: Thus, two cases capture the nonzero regions of the
Gronau (1974), Heckman (1974), Heckman (1979), Lewis (1974), Lee (1978) and Lee (1979).
Ch. 40: Chwical
Estimation
Merhods
fir
LDV
Models
c.d.f. of y. First of all, the c.d.f. is constant
F(B; Y)=
s
2391
Using Simuhfion
on B, = [0, 1) x [0, co):
YEB,
ddy* - p,fWy* = @(- ~l,lX
rv: <01
because yr is unrestricted (and unobserved) in this case. Once Y, reaches 1, the entire sampling space for y, has been covered and the c.d.f. on B, = [l, co) x R is increased according to
l{Y22O}@(-P1,1)+ F(B; Y) =
s
4(~* - p, 0) IY:~O,Y;~Y*)
dy*,
YEB,,
l{Y,2O}@(-Pr,l)+@
The p.d.f. will therefore
f
(0; Y) =
be
1
@(-
@(PI + arAY
- P2)/&
The sample selection process sample selection. In such cases, with a different cause of partial elements of y: (suppose there rule is
Pl>
if Y, = 0,
1)
1 - 0:,/0:).4(Y,
- ~,,a:)
if Y, = 1.
is often more complicated, with several causes of the latent yT is a vector with each element associated observation. The latent yT is observed only if all the are J = M - 1) are positive so that the observation
where 1 {y: 2 0) is an (M - 1) x 1 vector of indicator
variables. The sampling space is
M-l
B=
JJER~IYM=O,fl Yj=l,YjE{O,l},
j<M
j= 1 M-l
JJER~IYM=O,
JJ Yj=O,YjE{O,l}
9
j=l
and the likelihood sions of yy.
function
contains
multivariate
integrals
over the M - 1 dimen-
Other types of nonrandom sample selection lead to general discrete/continuous models and models of switching regressions with known sample separation. Such
V.A. Hajiwssiliou
2392
models are discussed extensively in Dubin and McFadden Lee (1978) Maddala (1983) and Amemiya (1984).
2.3.
and P.A. Ruud
(1984), Hanemann
(1984)
Truncation
When it is represented as a partial observation, censored latent variable. Another mechanism variables is truncation, which refers to dropping tion goes unrecorded. Dejinition
2.
Truncated
a limited dependent variable is a for generating limited dependent observations so that their realiza-
random variables
Let F(Y) be the c.d.f. of y* and let D be a proper subset of the support its complement such that Pr {y*E DC} > 0. The function G(Y)=
F(Y)/Pr{YED} 0
is the c.d.f. of a truncated
of F and DC
if YED, if YED’.
y*.
One can generate a sample of truncated random variables with the c.d.f. G by drawing a random sample of y* and removing the realizations that are not members of D. This is typically the way truncation arises in practice. To draw a single realization of the truncated random variable, one can draw y*‘s until a realization falls into D. The term “truncation” derives from the visual effect dropping the set DC has on the original distribution when DC is a tail region: the tail of the p.d.f. is cut off or truncated. To incorporate truncation, we expand the observation rule to
Y=
4Y*)
ify*ED,
unobserved
otherwise,
where D is an “acceptance region”. This situation differs from that of the nonrandom sample selection model in which an observation is still partially observed: at least, every realization is recorded. In the presence of truncation, the observed likelihood requires normalization relative to the latent likelihood:
(10)
Ch. 40: Classical
Estimation Methodsfor
The normalization with an upper bound Example
4.
by a probability of one.
Truncated
2393
LDV Models Using Simulation
in the denominator
makes the c.d.f. proper,
normal regression
Ify* - N(p, a’) and y is an observation of y* when y* > 0, the model is a truncated normal regression. Setting D = (y~f?I y > 0) makes B = D so that the c.d.f. and p.d.f. of y are
F(6;
Y)=
4 y
P, 4
NY*
0
sm 0
f(& Y) =
I
dy*
= @(y-PY~2)-@(-I44
@(Y* -p,ddy* 0
&Y-
if Y < 0,
0
r
1-
@(-p,c?)
if y>.
2
if Y d 0, K 0)
! 1 -@(-&a”)
ifY>O,
As in the tobit model, a normal integral appears in the likelihood function. However, this integral enters in a nonlinear fashion, in the denominator of a ratio. Clearly, multivariate forms of truncation lead to multivariate integrals in the denominator. To accommodate both censored and truncated models, in the remainder of this chapter we will often denote the general log-likelihood function for LDV models with a two-part function:
ln f@ Y) = ln ft (8 Y) - ln fd@ Y)
(11)
where fi represents the normalizing probability Pr {y* E D ] = SDdF(0; y*). In models with only censoring, f2E 1. But in general, both fi and f2 will require numerical approximation. Note that in this general form, the log-likelihood function can be viewed as the difference between two log-likelihood functions for models with censoring. For example, the log-likelihood of the truncated regression in Example 4 is the difference between the log-likelihoods of the tobit regression in Example 2 and the binomial probit model mentioned in the introduction and Example 1 (see equations (7) and (S))?
6Note that scale information about y* is available in the censored and truncated normal regression models which is not in the case of binary response, so that u2 is now identifiable. Hence, the normalization c’_= 1 is not necessary, as it is in the binary probit model where only the discrete information 1{ Y> 0) is available.
V.A. Hajivassiliou
2394
1 {y >
L!9kEL [ 1 -@(-p,a2)
0) ln
und P.A. Ruud
1
=l{Y>O}ln[C$(Y-/J,a)]+1{Y=0}@(-,U,0*) -[l(Y>O}ln[l-@(--p,c*)] +l(Y=o}@(-~,a*)].
2.4.
Mixtures
LDV models have limited dependent an analytical trait generally contains Definition
3.
come to variables. with the discrete
include a family of models that do not necessarily have This family, containing densities called mixtures, shares LDV models that we have already reviewed: the p.d.f. probability terms.
Mixtures
Let F(8; Y) be the c.d.f. of y* depending Then the c.d.f.
on a parameter
8 and H(8) another
c.d.f.
F(B; Y) dH(8)
G(Y) =
s is a mixture. Possible ways in which mixtures arise in econometric models are unobservable heterogeneity in the underlying data generating process (see, for example, Heckman (198 1)) and “short-side” rationing rules (Quandt (1972), Goldfeld and Quandt (1975), Laroque and Salanit: (1989)). Laroque and Salanitt (1989) discuss simulation estimation methods for the analysis of this type of model. Example 5.
Mixture
A cousin of the nonrandom by an underlying trivariate
The observation written as
sample selection model is the mixture normal distribution, where
rule maps a three-dimensional
model generated
vector into a scalar; the rule can be
Ch. 40: Cla.ssical Estimation
Mefhods,fiw
LDV
Models
2395
Usiny Simulation
An indicator function determines whether yT or y: is observed. An important difference with sample selection is that the indicator itself is not observed. Thus, y is a “mixture” of yz’s and yz’s. As a result, such mixtures have qualitatively distinct c.d.f’s, compared to the other LDV models we have discussed. In the present case,
F(8; Y) =
s s o(
3
4(y*
- PL,f4
0.y; G Y)u (Y: < 0-y;Q Yl 4(y*
- PL,Qdy*
+
dy*,
s
$(Y* -
PL,4
dy*,
{Y:< 0.y: s Y)
(Y:a o,y: s Yl
and
where, for j = {2,3}, PlljeE(Y:lYj*=
Llllj
E
y)=Pl
+alj(YTPj)/af9
Iqyyyj* = Y) = 1 - 0+;
are conditional moments. The p.d.f. particularly demonstrates the weighted nature of the distribution: the marginal distributions of yz and ys are mixed together by probability weights.
2.5.
Time series models
LDV models are not typically applied to time series data sets but short time series have played an important role in the analysis of panel or longitudinal data sets. Such time series are another source of high-dimensional integrals in likelihood functions. Here we expand our introductory example. Example
6.
Multiperiod
binary probit model
A random sample of N economic agents is followed over time, with agent n being observed for T periods. The latent variable y,*, = pnt + E,, measures the net benefit to the agent characterizing an action in period t. Typically, pnt is a linear index
V.A. Hajiwssiliou
2396
und P.A. Ruud
function of a k x 1 vector of exogenous explanatory variables x,,~,i.e., pL,t= xk#. The agent chooses one of two actions in each period, denoted by y,,,~jO, l}, depending upon the value of y,*,:
r(y*) =
y,, = 1 i y,, = 0
ify,*, > 0,
(12)
t= l,...,T.
if yz* d 0, I
Hence, the sample space for r(y*) is B = x T= I (0, l}, i.e., all possible (2r) sequences of length T, with 0 and 1 as the possible realizations in each period. normal given in Let the distribution of y,* = (y,*,, . . , y,*,)’ be the multivariate equation (3). Then, for individual n the LDV vector {y,,}, t = 1,. , T, has the discrete p.d.f. where p = x’b
f(fi, 0; Y) = @(S,,, SOS),
and S = diag{2y - 1).
This is a special case of the multinomial probit model of Example 1, with J = 2r alternatives and a typically highly restricted 52, reflecting the assumed serial correlation in the {snt}T=1 sequence. By way of illustration, let us consider the specific covariance structure, found very useful in applied work:7 E”*= rn + in,,
IPI < 1
&I*= Pi,,* - 1 + “nt,
and v and yeindependent.
This implies that p
p2
...
pT-’
P
1
p
...
pT-2
p2
p
1
pT-l
(13)
pT-2
...
i
...
1
P
...
/,
1
+ a;.J,.
The variance parameters 0: and of cannot both be identified, so the normalization a~+~,2=1isused.’ The probability of the observed sequence of choices of individual n is
s b&n)
Pr (y,; 8, XJ =
4(~,*
- A, f&J dy:,
MYn)
‘See Hajivassiliou and McFadden (1990), BGrsch-Supan et al. (1992) and Hajivassiliou *This is the structure assumed in the introductory example see equation (1) above.
(1993a).
Ch. 40: Classical
Estimation Methods/iv
LD V Models Using Simulation
2397
with 0 = (/3, of, p) and 0
%t =
if y,, = 1,
i -YE
ify,, = 0,
Note that the likelihood of this example is another member of the family of censored models. Time series models like this do not present a new analytical problem. Indeed, such time series models are more tractable for estimation because classical methods do provide consistent, though statistically inefficient, estimators (see Poirier and Ruud (1988), Hajivassiliou (1986) and Avery et al. (1983)).9 Keane (1993) discusses extensively special issues in the estimation by simulation of panel data models and Miihleisen (1991) compares the performance of alternative simulation estimators for such models. Studies of dynamic discrete behavior using simulation techniques are Berkovec and Stern (1991), Bloemen and Kapteyn ($991), Hajivassiliou and Ioannides (1991), Hotz and Miller (1989), Hotz and Sanders (1991), Hotz et al. (1991), Pakes (1992) and Rust (1992). In this chapter, we do not analyze the estimation by simulation of “long” time series models. We refer the reader to Lee and Ingram (1991), Duffie and Singleton (1993), Laroque and Salanie (1990) and Gourieroux and Monfort (1990) for results on this topic.
2.6.
Score functions
For models with censoring, the score for 8 can be written in two ways which we will use to motivate two approaches to approximation of the score by simulation.
V,lnf(Q;y)
=z =
(14)
~JW~lnf(~;~*)l~l
(15)
where V0 is an operator that represents partial differentiation with respect to the elements of 0. The ratio (14) is simply the derivative of the log-likelihood and ‘Panel data sets in which each agent is observed for the same number of time periods T are called balanced, while sets with T. # T for some n = 1,. , N are known as unhalancrd. As long as the determination of T. is not endogenous to the economic model at hand, balanced and unbalanced sets can be analyzed using the same techniques. There exists, however, the interesting case in which T,, is determined endogenously through an economic decision, which leads to a multiperiod sample selection problem. See Hausman and Wise (1979) for a discussion of this case.
V.A. Hajivassiliou
2398
and P.A. Ruud
simulation can be applied to the numerator and denominator separately. The second expression (15), the conditional expectation of the score of the latent loglikelihood, can be simulated as a single expectation if V, lnf(8; y*) is tractable. Ruud (1986), van Praag and Hop (1987), Hajivassiliou and McFadden (1990) and Hajivassiliou (1992) have noted alternative ways of writing score functions for the purpose of estimation by simulation. Here is the derivation of (15). Let F(8; y* 1y) denote the conditional c.d.f. of y* given that r(y*) = y.” We let
ECNY*)IYI =
r(y*)dF(&y*Iy)
(16)
s denote the expectation of a random variable c.d.f. F(B; y* 1y) of y* given r(y*) = y. Then
vtm Y) _
1
.mY)
f(R Y) =
s
s
V,dF(&y*)
(y*,r(y*)
‘y)
v&m
(y'lr(y*)'Y)
t(y*) with respect to the conditional
Y*)
fvt Y*)
= -V,WV;Y*)I~(Y*)
fvt Y*) fvt Y)
dy*
= ~1
since
f(8; y*)dy* is the p.d.f. of the truncated We; Y*)ifvt Y) = fv3 Y*YJ(~.,~(~*)=~~ distribution {y* 1r(y*) = y}. This formula for the score leads to the following general equations for normal LDV models when y* has the multivariate normal p.d.f. given in (3):
v,ln f(e; Y) = a- 1[E(Y*
IY) - ~1,
V,lnf(e;Y)=~~-l{~(Y*lY)+c~(y*i~(y*)=y)-~i x [WY* using the standard
Ir(Y*) = Y) - PI’ - fl>fl-
derivatives
for the log-likelihood
I0 Formally, F(O; Y* 1T(y*) =
y)
E
lim 610
Pr{y*< y*,y-E
r, of a multivariate
(17) normal
Ch. 40: Classical Estimation
MethodsJiv
LDV Models Usiny Simulation
2399
According to (17), the score of a normal LDV model depends only on the first two moments of a truncated multivariate normal random variable z generated by the truncation rule
z=
Y* unobserved
if r(y*) = y, (19)
otherwise.
The functional form of these moments depends on the specification of the LDV function 2. For LDV models with truncation, there are no changes to (14)-(16). The only change that (9) requires for (19) is the restriction to the acceptance region D. That is, the score depends on only the first two moments of a truncated multivariate normal random variable z’ generated by the truncation rule z’ =
Y* unobserved
if r(y*) = y,
Y*E&
otherwise.
As a result, there is a basic change to (17). Because the log-likelihood function of truncated models is the difference between two log-likelihood functions for censored models (see equation (1 l)), the score is expressed as the difference in the scores for such models:
V. ln f(& Y) = V. ln fl(& Y) - V. ln f,(Q;Y) = ~1- W,lnf(R~*)l~*~Dl
= .W,lnf(R~*)l~(~*) and (17) becomes V,lnF(B;y)=R-‘[E(y*)z(y*)
=y)-E(y*ly*~D)],
v,lnF(B;y)=~R-‘{EC(y*-~)(y*-I*)‘I~(y*)=y] - EC(Y* - P)(Y* - ~)ll~*~Dl}fl-‘.
2.7.
The computational
intractability
of LDV models
The likelihood contribution f(t9; y,) and the score V, lnf(B; y,) are functions of at most M-dimensional integrals over the region D(y) = {ylz(y*) = y} in the domain of the M x 1 latent vector yx. The fundamental source of the computational intractability of classical estimation methods for the general LDV model is the repeated evaluation of such integrals. To illustrate, consider a multinomial probit model with M = 16 alternatives, with K = 20 exogenous variables that vary by alternative. A random sample of N = 1000 observations is available. Suppose the M x M variance-
V.A. Hajicassiliou
2400
and P.A. Ruud
covariance matrix R of the unobserved random utilities has (15 x 16/2) - 1 = 119 free elements (after imposing identification restrictions). Then, the number of parameters to be estimated is p = 139. Suppose the analyst uses an iterative NewtonRaphson type of numerical procedure, employing numerical approximations to the first derivatives based on two-sided first differences and that 20 iterations are required to achieve convergence, which is a realistic number.” Each iteration requires at least 2p evaluations of the likelihood function for approximating the first derivatives. We thus expect that finding the ML estimator will require about 20 x 2p function evaluations. Since the sample consists of N = 1000 individuals, we will have to calculate N x 20 x 2p contributions to the likelihood function, each of which, in general, will be 16-dimensional integrals. Let s be the time in seconds a given computer requires to approximate a 16-dimensional integral by numerical quadrature methods. Our hypothetical ML estimation will thus require about N x 20 x 2p x s seconds. On a typical modern supercomputer (say a Cray 1) one could expect s z 2. Hence, using such a supercomputer, our problem would take about 1000 x 20 x 178 x 2/3600 hours, which is about 4 months of Cray 1 CPU! It is crucial to stress that such numerical quadrature methods offer only poor approximations to integrals of such dimension. I2 The maximum likelihood estimates resulting from 4 months of Cray 1 CPU would be utterly unreliable. The need for alternative estimation methods for these problems is apparent.
3. 3.1.
Simulation
methods
Overview
Two general approaches to exploiting simulation in parametric estimation are to approximate the likelihood function and to approximate such moment functions as the score. The likelihood function can be simulated by Monte Carlo techniques over the latent marginal distribution f(0; y*) in equation (4) for the mixture case, equation (5) for the discrete/continuous case and equation (10) for the truncated case. Alternatively, the score can be approximated either by integrating both numerator and denominator in equation (14) or by integrating over the latent conditional p.d.f. f(0; y* 1y) as in equation (15). Thus, simulation techniques focus on the simulation from these two distributions, f(0; y*) and f(0; y* 1y). The censoring and truncation discussed above for LDV models also appear in simulations and we consider methods for effecting each type of observation rule below. As we will show in Section 4, some simulation estimation methods use censored simulation for the estimation of the main types of LDV models discussed in Section 2 (censored, truncated and
“See Quandt (1986) for a discussion of issues in numerical optimization methods. “Clark (1961) proposed another numerical approximation method for such integrals - see also Daganzo et al. (1977) and Daganzo (1980). The Horowitz et al. (1981) study finds serious shortcomings in the numerical accuracy of the Clark method in typical problems with high .I and unrestricted R.
Ch. 40:
C/assical Estimation
Methods for LDV Models Using Simulation
2401
mixture models), whereas other estimation methods use truncated simulation for the estimation of these models. Simulation of standard normal random variables is an old and well-studied problem. Relatively fast algorithms are widely available for generating such random variables on a computer. Thus, consider the simulation of the latent data generating process. We can always write
where q is a vector of M independent standard normal random variables and I- is a matrix square root of Q, so that R = I-T’. It is convenient to set r to the (lower triangular) Cholesky factor. Clearly, the latent data generating process can be simulated rapidly with simulations of q for any values of p and R. Such simulations can be used in turn to simulate the likelihood and log-likelihood functions and their derivatives with respect to the parameters. As in all of the examples given above, the observation rules common in LDV models imply regions of integration that are rectangles: that is, for some matrix A and vectors b, and b,, possibly with some infinite elements,
(21) where rank(A) d M. These are the problems that we will consider. Since Ay* is also normally distributed, it will often be convenient to simulate Ay* instead of y*. In that case, we simply transform the mean vector and covariance matrix to Ap and ARA’, respectively. Without any loss of generality in this section, we set A = I,, the M x M identity matrix. We denote D = {zeR”I b, < z < b,}. Such regions as (21) have two important analytical properties. First of all, rectangular regions have constant boundaries with respect to the variable of integration, simplifying integration. Secondly, the differentiation in (4) and (5) can be carried out analytically to obtain likelihood functions composed of multivariate normal p.d.f.‘s of the form (3) and multivariate normal c.d.f.‘s of the form
Pr{D;p,fl}
l{y*ED}+(y*-p,n)dy*.
=
(22)
s
Thus, simulation of the likelihood can be restricted to terms in Pr( D; p, l2}. Simulation of the score in (14) involves only the additional terms
V,Pr(D;fl,R}
=0-i s
V,Pr{D;p,R}
I{Y*ED)(Y*
= $Q-l is
-/MY*
- ~fl)dy*,
I{Y*ED)C(Y*--)(Y*-/+-J&~*-@)~~* (23)
V.A. Hujioussiliou
2402
Normalized
by Pr { D; p, Q}, these equations
transform
and P.A. Kuud
to
(24) which are terms in (17). In the remainder of this section, we will discuss the simulation this purpose we denote yTi=
[yi;m=
of (22)-(24). For
l,...,M;m#i],
p-i S E(Y*i)j n_i,~i
~ V(YTi),
R_i,i = cov(y*,,y*) and the conditional
moments
p_i,i(y*) s E(Y*~~YF) These conditional
moments
and
R-,.-iii
have the well-known
The conditional mean and variance defined analogously. We also define y*,,=[y$m=
l,...,i-
E t’(Y’iIYF). formulas
of y* given y? i, denoted
nil _ i and nii, _ i, are
11
and use a similar notation for the marginal and conditional moments of ‘this subvector of random variables. For example, the conditional mean ofy: given y*, i is
3.2.
Censored simulation
We begin by focusing
on the integrals
in (22) and (23) accumulated
E[h(y*, D)] s E 1 {y*ED} [
in the vector
(25) [veci
y*J]'
The elements of h are censored of simulation: direct censoring importance sampling. 3.2.1.
Multivariate
random variables. We consider two basic methods of the multivariate normal random variable and
normal simuhtion
A direct method for simulating Pr{ D; p, Q} and its derivatives is to make repeated Monte Carlo draws for ye, use (20) to calculate y* for each q, and then form an empirical analogue of (25), by working only with the realizations that fall in set D. Let {YIP,...,qR} be R simulated draws from the N(0, ZJ distribution and J, = p + l-q, (r=l,...,R)sothat
is an unbiased simulation of (25). As R gets larger, the sampling variance of h, P(l - P)/R, approaches zero and h coverges strongly to E[h(y*, D)]. The simulation of Pr{ D; p, 0) is simply the observed frequency with which the simulations of y* fall into D. Its derivatives with respect to p and Q are functions of the average simulation of l{y*~D}y* and l{y*~D}y*y*‘. We will call this the crude Monte Curlo (CMC) simulator. Lerman and Manski (1981) conducted the first extensive application of Monte Carlo integration as a numerical technique to the estimation of LDV models using the CMC simulator. The CMC is quick to compute and ideal for computers with a “vectorization facility”.’ 3 However, the CMC also has at least two major drawbacks. First, it is not continuous in parameters. The simulator jumps at parameter values where a J, is on the boundary of D. For example, consider parameter values (pug,r,) chosen so that the mth element of the rth simulation equals its lower bound in D:
where r,, is the mth row of r,. Decreasing the parameter pm from porn will cause the indicator l{jTl~D> to jump from 1 to 0, and this will result in discrete jumps in the elements of h(jj,, D) and h. Such discontinuities make computation of estimators and asymptotic distribution theory awkward.i4 Second, the number of computations required by the CMC rises inversely with Pr{D; p, Q}, which makes it intractable when this probability is small. It should be noted that in principle the accuracy of the CMC can be improved by use of so-called simulation-variance-reduction “Such a mechanism allows simultaneous operation on adjacent elements of a vector using multiple processors. See Hajivassiliou (1993b) which shows that the CMC exhibits the greatest speed gains from vectorization among 13 alternative simulation methods. 14See Quandt (1986) for a discussion ofiterative parameter search algorithms and their requirements for differentiability of the function to be optimized.
V.A. Hajivassiliou
2404
techniques, as, for example, (1984) for definitions. Importance
3.2.2.
the use of control
and antithetic
variates.
and P.A. Ruud
See Hendry
sampling
Importance
sampling is another general method for reducing the sampling variance of integrals computed by Monte Carlo integration over (censoring) intervals. The CMC involves sampling y* from the #(y* - ~1~0) p.d.f. and evaluating the function h(y*, D). A simple generalization of this procedure rewrites EC/r] in terms of another sampling distribution y:
E[h] =
s
h(y*, D)$(y*
- p, Wy*
=
S[ &
,
D)
a -
PL,f4
yw;P, fl,4
16;
P,
Q, 4 d9
1
6 is a vector of parameters characterizing the design of the importance sampler y(.). Note that for h(.) = 1 {LED}, this expression corresponds to Pr{D;p, 01. By drawing a random variable j from the importance p.d.f. y and evaluating the weighted indicator function h@)w(j), where
w(J)
=
m - P, Q) Y(k zGi’
one obtains an alternative unbiased simulation of Pr{ D; p, Ll>.The first advantage offered by importance sampling is the ability to substitute sampling from y for sampling from 4. In some cases, y may be sampled more quickly, or, in a more general setting, sampling from 4 may be impractical. In addition, if y also has an analytical integral over a truncated sampling region C such that D G C, then this analytical integral can be exploited as an approximation to Pr{ D; p, f2} as follows:
Pr{D;p,R}
I(j%D}w(j)~~~dj.
= Pr{jEC} sC
By drawing from the truncated p.d.f. y(J, ,u, 0, fi)/Pr{jEC}, fewer simulations are “wasted” on outcomes of zero and, in effect, Pr{ jEC}w(j) approximates Pr{ D; p, f2). When y is a good approximation to 4, so that the ratio of densities w = 4/y is relatively constant, the sampling variance of the importance-sampling simulator is small. As noted above, the sampling variance of the CMC for a single simulation is P( 1 - P), while the sampling variance of the importance sampler is
W’c.1 {SD}w(J)) = P~%V’(w(Wd’) where
PC=
Pr{$EC}
and
PD = Pr{JED}.
+ (1 In
the
f’,).E(~(j)lj%D)~], extreme
case
that
y = 4,
Ch. 40: Classical Estimation
Methods,for
2405
LDV Models Using Simulation
to 4 V(W(J)IJED) = 0 and E(w(J)ljj~0)’ = P D. Therefore, good approximations afford improvements over the CFC. Geweke (1989) introduces importance sampling in Monte Carlo integration in the context of Bayesian estimation.i5 Dejinition 4.
GHK importance-sampling
The GHK importance
p.d.f. is the “recursively
for jkD where (T,,<,,, = fi,,,,,,,,, Cim
E
him
-
simulator
&nI < m(9
truncated”multivariate
normal
p.d.f.
and i=O,l.
By construction, the support of this p.d.f. is D. Conditional on J<,,,, p,,, is univariate truncated normal on D, with conditional mean determined by p<,. Draws from y can be made recursively according to the formula
where the o are independently simulator is the product
It is an unbiased
simulator
distributed
uniform
random
variablesi
The GHK
of E(h).
The GHK simulator was developed by Geweke (1992), Hajivassiliou and McFadden (1990) and Keane (1990). Experience suggests that the sampling variance of &uk(jj) is very small so that it approximates E(h) well in practice. This approximant has the properties of lying in the unit interval, summing to one over all the
t50ther investigations of the use of Monte Carlo integration in Bayesian analysis are, inter aha, Bauwens (1984), Kloek and van Dijk (1978) van Dijk (1987) and West (1990). 16This method is described extensively in Devroye (1986) and is a simple application of the cumulative probability integral transform result -see Feller (1971). Computationally more efficient methods for generating univariate truncated normal variates exist -for example Geweke (1992). The advantage of the method presented in the preceding equation, however, is that it is continuous in p, 0, and w,, which, as already mentioned, is a desirable property of simulators for asymptotic theory and for iterative parameter search. The method of constructing y in this example can also be extended to a bivariate version using a bivariate normal c.d.f. and standardizing adjacent pairs of elements.
V.A. H~jiaassiliou
2406
and P.A.
Ruud
disjoint rectangular regions surrounding and including D, and being a continuous function of w, p, L2, he, and b,. These properties are discussed in Borsch-Supan and Hajivassiliou (1993). Moreover, Hajivassiliou et al. (1992) found conclusive evidence for the superior root-mean-squared-error performance of the GHK method in an extensive Monte Carlo study comparing the GHK to 12 other simulators for normal rectangle probabilities Pr{ D; p, l2}.
Truncated simulation
3.3.
We now turn to the expectations in (24). These are ratios of the integrals in (25) and cannot be simulated without bias using the censored simulation methods above. Even ignoring the bias, one must ensure that the denominator of the ratio is not zero. For example, the CMC and some importance-sampling simulators can yield outcomes of zero for probabilities and thus violate this requirement.17 In this subsection, we describe two general procedures which draw directly from the truncated distributions associated with the expectations in equation (24). 3.3.1.
Acceptance/rejection methods
Acceptance/rejection (A/R) methods provide a mechanism for drawing from a conditional density when practical exact transformations from uniform or standard normal variates are not available. The following result is standard; see Devroye (1986) Fishman (1973) or Rubinstein (1981) for proofs. Proposition 1 Suppose b(y*) is a J-dimensional density, and one wishes to sample from the conditional density 4(y* ID) = 4(y*)/jD r#~(y*)dy*. Suppose y(J) is a density with a support A from which it is practical to sample, with the property that
supa
+ co,
where D s A. Draw J from y and o from a uniform density on [0,11, repeat this process until a pair satisfying PED and 4(J) 3 occ.r(J) is observed, and accept the associated 9. Then, the accepted points have density 4(.1 D).
“It should be noted that one of the attractive properties of the GHK simulator simulated probability values that are bounded away from 0 and 1, unlike many sampling simulators. See Bgrsch-Supan and Hajivassiliou (1993) for details.
is that it generates other importance-
Ch. 40: Classical
Estimation
2401
Methods for LD V Models Using Simulution
The choice of a suitable comparison density y(‘) is important because it determines the expected “yield” of the acceptance/rejection scheme. The main attractive feature of A/R is that the accepted draws have the correct truncated distribution. The practical shortcoming, though, is that the operations necessary until a specific number of draws are accepted may be very large. The A/R scheme also provides an unbiased simulator of l/Pr{ D; p, f2} if JDy@)dJ = T(D) is practical to compute. The conditional probability of acceptance, given of acceptance is @ED}, is So6(J)dUa = Pr{D)l c(, so that the marginal probability T(D) Pr{D}/cr. The distribution of the number of trials to get an acceptance is the geometric distribution and its expectation is LY/[ZJZI) Pr(D}]. Therefore, if t is the number of draws made until J is accepted, t.T(D)/cc is an unbiased simulator of l/Pr( D}. Example
7
The recursively truncated comparison distribution. a=
fi
normal p.d.f. in Definition 4 works well in practice A bound on the density ratio is given by
as the
{~(b,,-~L,,<,(bl<,),~~,<,)-~(bo,-~,,<,(bo<,),aH,<,)>,
m=l
where the conditional moments Since A = D, T(D) = 1. 3.3.2.
are conditioned
on J<,
equal to the boundaries.
Gibbs resampling
Gibbs resampling is another way to draw from truncated distributions. An infinite number of calculations are required to generate a finite number of draws with distribution approaching the true one. But convergence to the true distribution is geometric in the number of resamplings, hence the performance of this simulator in practice is generally very satisfactory. In addition, this simulator is continuous and differentiable in the parameters j.~ and R. The Gibbs simulator is based on a Markov chain that utilizes computable univariate truncated normal densities to construct transitions, and has the desired truncated multivariate normal as its limiting distribution.’ * This simulator is defined by the following Markovian updating scheme. Proposition
2
Consider the multivariate normal distribution N(p,O) truncated on D, which is assumed to be finite. Define a recursive procedure with steps j = 1,. . . , J in rounds g=l , . . . , G. Let {y*‘jg’} be a sequence on D such that on the jth step of the gth “This simulator can be generalized in principle to non-normal ing univariate distributions are easy to sample.
distributions,
provided
the correspond-
V.A. Hajimwiliou
2408
round,
thejth
element
of y*(jg) is computed
and P.A. Ruud
from J??~~“-~’ by
yj*(jg) = pj, _ j(y *(j&J j - “) + aj, _ j’ @ - l [oj,y _ I aqc; j, dj, _j) -(l
-Oj,g-l)~(c~j’~jl-j)l’
where
and the ojg are independent uniform [0, l] variates and cj, _ j = Gj. Repeat this process for G “Gibbs resampling rounds”. Then the random draws obtained by this simulator have a distribution that converges in L, norm at a geometric rate to the true truncated distribution {y*ly*~D} as the number of Gibbs resampling rounds G grows to infinity. This result is proved in Hajivassiliou and McFadden (1990). It relies on stochastic relaxation techniques as discussed in Geman and Geman (1984). See also Tierny (1992) for other theoretical results on the Gibbs resampling scheme.” We present below Monte Carlo experiments with simulation estimators based on this truncated simulation scheme.
4. 4.1.
Simulation
and estimation of LDV models
Overview
In this section, we bring together the parametric estimation of the LDV models described in Section 2 with the simulation methods in Section 3. Our focus is the consistent estimation of the parameters of the model; we defer the discussion of limiting distributions to a later section. Our exposition follows the general historical trend of thought in this area. We begin with the application of simulation to approximating the log-likelihood function. Next, we consider the simulation of moment functions. Because of the simulation biases that naturally arise in the log-likelihood approach, the unbiased simulation of moment functions and the method of moments are an alternative approach. Finally, we discuss simulation of the score function. Solving the normal equations of ML estimation is a special case of the method of moments and simulating the score function offers the potential for efficient estimation. One can organize a description of the methods along the following lines. Figure 1 gives a diagrammatic presentation of a useful taxonomy. In this figure, the various “The usefulness of Gibbs resampling for Bayesian Chib (1993), and by McCulloch and Rossi (1993).
estimation
has been recognized
by Geweke (1992),
Ch. 40: Classical Estimorion Methods fir LDV Models
UsingSimulation
GMSM
Figure
1.
Taxonomy of simulation estimators
estimation methods are represented as elliptical sets and the properties of the associated simulation methods are represented as rectangular sets. Five families of estimation methods are pictured. All of the methods fall into the class of generalized method ofsimulated moments (GMSM). This is the simulated counterpart to the generalized method of moments (GMM) (see Newey and McFadden (1994)). Within the GMSM fall the method of simulated scores (MSS), the simulated EM (SEM), the method of simulated moments (MSM), and maximum simulated likelihood (MSL). In parallel with the types of LDV models, the simulation methods are divided between censored and truncated sampling. The simulation methods are further separated into those that simulate the efficient score of the LDV models with and without bias. The MSM is a simulated counterpart to the method of moments (MOM). As the figure shows, the MSM is restricted to simulation methods that generate unbiased simulations using censored simulation methods. The MSL estimation method also rests on censored simulation but, as we will explain, the critical object (the loglikelihood function) is simulated with bias. The SEM algorithm is an extension of the EM algorithm using unbiased simulations from truncated distributions; it falls, therefore, in the upper half of the figure. Of these methods, only the MSS has versions that use both classes of simulation methods, censored and truncated, that we have described above. Throughout this section, we will assume that we are working with models for which the maximum likelihood estimator is well-behaved. In particular, we suppose
V.A. Hajimssiliou
2410
und P.A. Ruud
that the usual regularity conditions are met, ensuring that the ML estimator is the most efficient CUAN estimator. We will illustrate the methods using the rank ordered prohit model. This LDV model is a natural candidate for most approaches to estimation with simulation and the exact MLE performs well in small samples. Example
8.
Rank ordered probit
The rank ordered probit model is a generalization of the multinomial probit model described in Example 1. Instead of observing only the most preferred (or highest ranked) alternative, each observation records the rank order of the alternatives from most preferred to least preferred. The rank ordering yields considerably more information about the underlying preference parameters than the simpler, highestranked-alternative response. Hence, consumer survey designers often prefer to ask for complete rankings. We can express the observation rule of rank ordered data algebraically as Y = zij(Y*) = l {Y& = Yj*}f
where the {yz,} correspond * <
Y(l)
’
* <...<
Y(2)
’
’
i,j=
1,..., J
to the order statistics
of y*,
Y;“J,,
so that the first row of y contains indicators of the on until the last row indicates the largest element consistsoftheJ!=Jx(J-1)x... x 2 different J and ones such that only a single entry equals one in J]lYijE{o,1}3CYij=CYij=
I
smallest element of y* and so of y*. The sample space of y x J matrices containing zeros each row and column: ’
j
Thus, even moderate numbers of alternatives correspond to discrete sampling spaces with many outcomes. The c.d.f. of y is not particularly informative; it is simpler to derive the probability of each possible outcome directly. The rank ordering y corresponds to values of y* in a set satisfying J - 1 inequalities:
D(y) = {y*d?-‘ly,.y*
d y2.y* d ... < y,.y*},
where ,vj, is the row vector [yjl,. . . , yjJ]. Such additional inequalities as yr.y* < y3.y* are redundant. As in the multinomial choice model, it is convenient to transform the latent y* into a vector of .I - 1 differences: zy~Ayy*=[yi.y*-y;+l.y*;i=
l,...,
J-
11,
Ch. 40: Classicul
Estimation
Methods fir
LDV
Models
2411
Using Simulatim
where i=
CYij-Yi+l,j;
d,-
l)...) J-
I;j=
l)..) 51
is a (J - 1) x J differencing matrix. According to this definition, The transformed random vector zy is also multivariate normal
D(y) = {.Y*Iz, d 0). and for all YEB,
f(8; Y) = Pr{ y = Y; p, f2} = @(drp, d&Id;). One probability normal orthant
(29)
term in this p.d.f. is equivalent in computational integrals of the choice probabilities in Example
complexity 1.
to the
We will use the various simulation and estimation methods to estimate this rank ordered probit model in Monte Carlo experiments. Because a natural standard of comparison is the MLE, we present first a Monte Carlo experiment for the MLE in a workable case. Example
9
When J = 4, the MLE is computable using standard approximation methods. our basic Monte Carlo experiment the population parameters will be -1 -l/3 p= i
l/3 1
I
and
Q =
1
l/2
0
0
l/2
1
0
0
0
0
1
f/2
: 0
0
l/2
In
.
1 I
These values yield a reasonable amount of variation in y and they induce significant inconsistency in the popular rank ordered logit estimator (Beggs et al. (1981)) when it is applied to the data. The block diagonal 0 contains covariances among the latent y* that are zero in the latent logit model. The ,u and R parameters are not all identifiable and so we normalize by reducing the parameterization to d,~ and drf2dk for Y = I,, the 4 x 4 identity matrix. The first variance in d,R,4; is also scaled to 1. In order to restrict d,Rd; to be positive semi-definite, this covariance matrix is also parameterized in terms of its Cholesky square root. Putting the mean parameters first, then stacking the nonzero elements of the Cholesky parameters, the identifiable population parameter vector is B0 = [ - 0.6667, - 0.6667, - 0.6667, 0.5000, 1.3230, 0.0000, - 0.3780, 0.92581. The basic Monte Carlo experiment will be a random draw from the distribution of each estimator for N = 100 observations on y. There will be 500 replications of each estimator. Results of the experiment for the MLE are in Table 1. The MLE has a small bias relative to its sampling variance and the sampling variance is small enough to make hypothesis tests for equal means or zero covariances quite powerful.
V.A. Hajivassiliou
2412
Sample statistics Population value
Parameter
Standard deviation
Mean
-0.6667
-0.6864
-0.6667 -0.6667 -0.5000
-0.6910 -0.7063 -0.5135 1.3536 -0.0127 ~ 0.408 1 0.9385
1.3230 0.000 -0.3780 0.9258
Table 1 for rank ordered
0.1317 0.2351 0.2263 0.2265 0.3002 0.1797 0.1909 0.246 I
probit
and P.A. Ruud
MLE
Lower quartile
Median
-0.7702 -0.8354 - 0.8276 - 0.6402 1.130 -0.1241 -0.5158 0.7513
- 0.6629 - 0.6648 -0.5016 1.317 -0.008616 -0.3891 0.9140
-0.6807
Upper quartile -0.5921 -0.5231 -0.5374 -0.3645 1.519 0.09545 -0.2765 1.074
It appears that the bias in the MLE is largely caused by asymmetry in the sampling distribution: The medians are closer to the population values than the means. Overall, the asymptotic approximation to the distribution of the MLE is good. The inverse information matrix predicts the standard deviations in the fourth column of Table 1 to be 0.1296, 0.1927, 0.1703, 0.2005, 0.2248, 0.1543, 0.1514, 0.1987. Therefore, the actual sampling distribution has more variation than the asymptotic approximation. For the simulation estimators, we will also conduct Monte Carlo experiments for a model with J = 6 alternatives. In that case, the MLE is not easily computed. We will use the population values --1- 315 - l/5 P(=
_
- 1
l/2
0
0
0
0’
l/2
1
0
0
0
0 l/4
0
0
514
314
I/4
115
0
0
314
514
l/4
l/4
315 l_
0
0
l/4
l/4
514
314
0
l/4
l/4
314
514
and
R=
-0
whichcorrespond to0,=[-0.4000, -0.4000, -0.4000, -0.4000, -0.4000, -0.5000, 1.414, 0.000, -0.3536, 0.9354, 0.000, -0.1768, - 0.6013, 1.052, 0.000, 0.000, O.OOQ, -0.4752, 0.87991 when normalizing on Y = I,.
4.2.
Simulation
of the log-likelihood function
One of the earliest applications of simulation to estimation was the general computation of multivariate integrals in such likelihoods as that of the multinomial probit by Monte Carlo integration. Crude Monte Carlo simulation can approximate the probabilities of the multinomial probit to any desired degree of accuracy, so that
Ch. 40: Classical Estimation Methodsfir
the corresponding the ML estimator. De$nition 5.
LDV
2413
Models Using Simulation
maximum simulated likelihood
(MSL) estimator
can approximate
Maximum simulated likelihood
Let the log-likelihood sample of observations
function for the unknown (y,, n = 1,. . , N) be
parameter
vector
0 given the
and let f”(t); y, o) be an unbiased simulator so that f(O; y) = E,[7(6; y, o)l y] where w is a simulated vector of R random variates. The maximum simulated likelihood estimator is
&,,
E
arg max TN(O), 8
where
for some given simulation
sequence
{w”}.
It is important to note that the MSL estimator is conditional on the sequence of simulators {w”}. For both computational stability and asymptotic distribution theory, it is important that the simulations do not change with the parameter values. See McFadden (1989) and Pakes and Pollard (1989) for an explanation of this point. Example
10
Borsch-Supan and Hajivassiliou (1993) proposed MSL estimation of the multinomial probit model of Example 1 using the GHK simulator for the choice probabilities. In this example, we make similar calculations for the rank ordered probit model of Example 9. Instead of the normal probability function in (29), we used the probability simulator in the first element of h,,, in (28) to compute the simulated log-likelihood function ~,(O).2o For the simu lations of the probability of each observation, we drew a vector of J - 1 = 3 independently distributed uniform random variables for each CO,.For each replication of &,,,, we drew a new data set
“The order of integration orderings. They were chosen integration.
affects this simulator, but we do not attempt to describe our particular purely on the basis of a convenient algorithm for finding the limits of
2414
V.A. Hajiaassiliou
Sample statistics
Parameter
01 $2 0, 0, 05 06 0, 08
Population value
-0.6667 - 0.6667 - 0.6667 ~0.5000 1.3230 0.0000 -0.3780 0.9258
for rank ordered
Table 2 probit MSLE using GHK
Mean
Standard deviation
Lower quartile
-0.7230 - 0.6077 -0.9555 -0.6387 1.2595 0.0131 -0.6715 1.3282
0.1424 0.2162 0.2520 0.1430 0.1741 0.1717 0.2088 0.2211
-0.8219 - 0.7342 - 1.087 - 0.7305 1.134 - 0.09063 -0.7883 1.185
and P.A. Ruud
(J = 4, R = 1).
Median -0.7198 -0.5934 -0.9256 -0.6415 1.237 -0.01013 -0.6639 1.301
Upper quartile - 0.6253 - 0.4640 -0.7860 -0.5379 1.353 0.1285 - 0.5292 1.448
{(Y,>%);n= I,..., N} before maximizing TN(@)over 8. Each 7(&y,, w,) consisted of a single simulation of f(0; y,) (R = 1). The results of this Monte Carlo experiment for J = 4 are in Table 2. In contrast with the MLE, this MSLE exhibits much larger bias. The median is virtually identical to the mean. The sampling variances are also larger, particularly for the covariance parameters. Nevertheless, this MSLE gives a rough approximation to the population parameters. The results of this Monte Carlo experiment for J = 6 are in Table 3. For brevity, only the mean parameters are listed. Once again, substantial biases appear in the sample of estimators. Given our experience with J = 4, it seems likely that these biases are largely due to simulation. We will confirm this below as we apply other methods to this case.
Note that unbiased simulation of the likelihood function is neither necessary nor sufficient for consistent MSL estimation. Because the estimator is a nonlinear function (through optimization) of the simulator, the MSL estimator will generally be a biased simulation of the MLE even when the criterion function of estimation
Sample statistics Population value ~ -
0.4000 0.4000 0.4ooo 0.4000 0.4000
for rank ordered
Table 3 probit MSLE using GHK
(J = 6, R = 1).
Mean
Standard deviation
Lower quartile
Median
Upper quartile
-0.4585 - 0.2489 - 0.5054 -0.4589 -0.6108
0.1504 0.2059 0.1710 0.2013 0.1882
-0.5565 -0.3898 - 0.6056 -0.5779 - 0.6934
-0.4561 - 0.2460 -0.4957 -0.4551 -0.6016
-0.3664 - 0.0940 -0.3891 -0.3216 -0.5042
Ch. 40: Ckwicnl
is simulated
Estimation Methods
without
for LDV Models Using Simulation
2415
bias because
Ll
E[T(e)]= l(O) +
6
E arg max q(3) = arg max r(0).
Note also that while unbiased simulation of the likelihood function is often straightforward, unbiased simulation of the log-likelihood is generally infeasible. The logarithmic transformation of the intractable function introduces a nonlinearity that cannot be overcome simply. However, to obtain an estimator with the same probability limit as the MLE, a sufficient characteristic of a simulator for the log-likelihood is that its sample average converge to the same limit as the sample average log-likelihood. Only by reducing the error of a simulator for the loglikelihood function to zero at a sufficiently rapid rate with sample size can one expect to obtain a consistent estimator. Such results rest on a general proposition that underlies the consistency of many extremum estimators (see Newey and McFadden (1994), Theorem 2.1): Lemma 1 Let (1) (2) (3) (4) (5)
0~ 0, a compact subset of RK, QJO), QN(0) be continuous in 8, f0 = arg max,, eQo(B) be unique, 8, = arg max,,,QN(B) and Q,(e) + Qe(C3)in probability uniformly
in & 0 as N + co.
Then $N 4 Be in probability. We will assume from now on that the log-likelihood function is sufficiently regular to exploit this lemma. In particular, we suppose that the y, are i.i.d., that 8 is identifiable, that f(0;y) is continuous at each 8 in a compact parameter space 0, and that E[sup,,.llnf(8;y)l] < cc. We refer the reader to Newey and McFadden (1994, Theorem 2.5) for further discussion of these conditions and their roles. For LDV models with censoring, the generic likelihood simulator f(0; y,, w,) is the average of R replications of one of the simulation methods described above:
.7eY,, wn)-
i
i I
.W;Y,, o,,).
1
If the model includes truncation, then the likelihood simulation typically involves a ratio of such averages, because a normalizing probability appears in the denominator, although unbiased simulation of the ratio is possible (see Section 3.3). In any case, the simulation error will generally be O,( l/R). Thus, a common approach
V.A. Hajimwiliou
2416
and P.A. Ruud
approximating the log-likelihood function with sufficient accuracy is increasing the number of replications per observation R with the sample size N. This statistical approach is in contrast to a strictly numerical approach of setting R high enough to achieve a specified numerical accuracy independent of sample size. to
Example
11
For illustration, let us increase the replications in the previous examples from R = 1 simulation per observation to 5. The summary statistics are listed in Tables 4 and 5. In both cases, J = 4 and J = 6, the biases are significantly reduced. See BSrschSupan and Hajivassiliou (1993) for a more extensive Monte Carlo study of the relationship between R and bias in the multinomial probit model. In the rank ordered probit model and similar discrete LDV models, all that is necessary for estimator consistency is that R + CE as N -+ co. No relative rates are required provided that the likelihood is sufficiently regular. Nor must the simulations o satisfy any restrictions on dependence across observations. The following proposition, taken from Lee (1992), establishes this situation. Proposition
3
Let f(& y) be uniformly bounded away from zero for all 8~0, a compact set, and all WEB, the sample space of y. Assume that the set of regularity conditions in the
Sample statistics Population value -0.6667 - 0.6667 - 0.6667 -0.5OQo I .3230 0.0000 - 0.3780 0.9258
Sample statistics Population value - 0.4000 - 0.4ooo -0.4000 - 0.4OcNI -0.4000
for rank ordered
Table 4 probit MSLE using GHK
(J = 4, R = 5).
Mean
Standard deviation
Lower quartile
Median
quartile
~ 0.6795 -0.6528 -0.8327 -0.5771 1.3582 -0.0121 - 0.5034 1.1334
0.1366 0.2267 0.2299 0.2159 0.2459 0.2089 0.2016 0.2454
-0.7726 -0.7913 -0.9686 -0.7076 1.1863 -0.1380 -0.6256 0.9505
-0.6774 - 0.6268 -0.8085 -0.5641 1.3184 -0.01570 -0.4875 1.1142
-0.5840 -0.5029 -0.6768 -0.4412 1.5036 0.1275 -0.3753 1.2814
for rank ordered
Table 5 probit MSLE using GHK
Upper
(J = 6, R = 5).
Mean
Standard deviation
Lower quartile
Median
Upper quartile
- 0.4088 -0.3059 -0.4554 - 0.4288 -0.5046
0.1256 0.1776 0.1387 0.1661 0.1773
- 0.4893 - 0.4200 -0.5373 -0.5369 -0.6211
-0.4053 - 0.2966 -0.4553 -0.4219 - 0.4976
-0.3227 -0.1846 -0.3615 -0.3142 -0.3872
Ch. 40: Classical
Estimation
Methods
f;v LDV
Models
2417
Usiny Simulation
paragraph after Lemma 1 hold. Let {onl} be an i.i.d. sequence over the index r. The 1 In j(& yn, 0,) is consistent if R -+ cc as MSL estimator &s, = arg maxe(l/N)C,N, N-co. Proof By a uniform
law of large numbers
1 J+o,
and the lower bound
as
off,
R-rco,
so that
Since our regularity
s:p
assumptions
$lN(0) - E[lnf(@
then &,(@/N also converges Lemma 1.
in the paragraph
after Lemma
1 guarantee
y)] LO
as
uniformly
to E[ln f(0; y)] and consistency
that
N+oo,
follows by Q.E.D.
Thus, the property of estimator consistency makes modest demands on the simulations of the likelihood function. Strictly speaking, one could employ a common sequence of simulations {w,), for all simulated likelihoods, which grows at an arbitrarily slow rate with sample size. The differences between simulation designs appear only in the limiting normal distributions of the estimators. It is especially important to note that consistency does not confine such differences to sampling variances. Both the expectations and the variances of the approximate limiting distribution can be affected by the simulation design. Note that Proposition 3 does not apply to models with elements of y which are continuously distributed and unbounded. Additional work is needed in this area. See Hajivassiliou and McFadden (1990) for the special conditions needed for an example of a multiperiod (panel) autocorrelated tobit model. From the standpoint of asymptotic distribution theory, the simplest use of simulation makes independent simulations for the contribution of each observation to the likelihood function. If elements of the sequence (w,,} are independent across the observation index n, as well as the replication index I, then we preserve the independence of the f(@ y,, o,,) and its derivatives across n, permitting the application of familiar laws of large numbers and central limit theorems. When f” is
V.A. Hajimssiliou
2418
differentiable
0=
1
in 0, we can make a familiar
JN
V,l&)) + ;
[
linear approximation
v,zqe, JN(&,,
-
and P.A. Ruud
for t&s,:
e,),
(30)
1
where the elements of 81ie on the line segment between &s, and 8,. The consistency of &,,I_ implies the consistency of I? which in turn implies that
;v:l(s,J+NV,2In f(e,; Y)I = W4J, using the argument terms
that supports
Proposition
(31) 3. The leading term is a sum of N i.i.d.
to which we would like to apply a central limit theorem. But we are prevented from this by the fact that the expectation of these terms is not zero. Consider the simple factorization, obtained by adding and subtracting terms,
Lv&) =L fl
V,1(&)) + A, + B,,
(32)
x/N
where
A, is a sum of i.i.d. terms with zero expectation pure simulation noise in 8,,, . B, is the potential result can be used to show that R/J?? such bias. Proposition
and can be viewed as the source of source of simulation bias. The next
-+ co is a sufficient
rate of increase
to avoid
4
Let fi(13;y, o) be an unbiased simulator for ~(8;y) such that V(ji -ply) = O(R-‘). Let ~(0; y, p) be a moment function such that E[s(B,; y, p)] = 0. Consider the simula-
Ch. 40: Classical
2419
Estimation Methods fir LD V Models Using Simulation
tor F(O;y) = s(H; y, 8) and let R/fi the simulation bias
+ GO.If s”is Lipschitz
in DES uniformly
in 8, then
Proof
Ifs” is Lipschitz
in p uniformly
s”- s = CV,s(B;y, p)l(ii
in 6 then
- p) + CV,s(&Y, P*) - V,s(@Y, P)l(F - PL),
where p* is on the line segment unbiasedness,
joining
D and p. According
to the hypothesis
of
E,(s” - s) = E, { CV,s(B; y, P*) - V,s(RY, dl (P - P)> so that
II&,,(s”-s)ll <M*@-/L)~=O(R-‘) for some O,(@/R)
finite
M*
according
to the
Lipschitz
hypothesis.
and the result follows.
Therefore,
B, = Q.E.D.
In the multinomial and rank ordered probit cases, the Lipschitz requirement is generally met by the regularity conditions that bound the discrete probabilities and the smoothness of the probability simulator f”: p = (f, V,f), k = (7, VJ), and s = (V&/f. We are not aware of any slower rates for R that avoid bias in the limiting distribution of &,,. Proposition
5
Let f be-bounded uniformly away from zero and Lipschitz in 0 on a compact space 0. Let f(@ y,o) be an unbiased differentiable simulator for f(0; y), also bounded uniformly away from zero and Lipschitz in 8 on 0 such that V(f - f) = O(R-‘). Let RI&?
+ 00. Then the simulation
A+“.$$l and &,,, is asymptotically
components
(V,lnf”(e;y,,o,)-V,lnf(e;y,)} LO efficient.
V.A. Hujioassiliou
2420
and P.A. Ruud
Proof The difference
between
By the Chebychev Pr
1
iIJNn
and exact scores can be written
inequality, - jToz
1 [VJ
for each component (30)-(33).
simulated
lnfI >E 6-,>F I I
of the gradient.
VCV,,7-.7V,,ln fl=O(JNIR)
The result follows from this order and equations Q.E.D.
Propositions 4 and 5 demonstrate that bias is the fundamental hurdle that MSL must overcome. The logarithmic transformation of the likelihood function forces one to increase R with the sample size to obtain a consistent estimator. Given enough simulations to overcome bias, there are enough simulations to make the asymptotic contribution of simulation to the limiting distribution of &,s, negligible. There is a simulation design that uses the same total number (N x R)of simulations of w as the independent design, but applies every simulation of o to every observation of y. That is, the simulated log-likelihood function is generated according to the double sum
$0) = 5 n=l
y
In
y(0;y,,w,).
m=l
The motivation for this approach is to take advantage of all N x R simulations that must be drawn when R independent simulations are made for each observation. Lee (1992) finds that efficiency requires only that R + coas N + co with this design. This approach appears to gain efficiency without any additional computational cost. However, one simulates each contribution to the likelihood N x R times rather than merely R times, substantially increasing the cost of evaluating the average simulated log-likelihood function. The computational savings gained by pooling simulations in this manner are generally overcome by the added computational cost of calculating O(N’) likelihoods instead of O(N3”), especially when N is large. We close our discussion of simulated likelihood functions by noting that the method of simulated pseudo-maximum likelihood (SPML) of Laroque and Salanie (1989) is another early simulation estimation approach for LDV models. This
method, originally developed for the mixture models of Section 2.4 in the case of the analysis of markets in disequilibrium, uses simulation to overcome the highdimensional integration difficulties that arise in calculating the moments of such models. Definition 6.
Simulated pseudo-maximum
likelihood
Let the observation rule I = y yield a mixture model with the first two moments g1 (x,, 0) = E(Y Ix,,, 0) and g2(xn, (3 = Q(Y - EY)’ Ixnr 0). C onsider simulating functions gj(x,,, 0, w, R),j = 1,2, based on auxiliary simulation sequences (CO],such that y”j(X,, 8, co, R) converge almost surely to _yj(x,, 6I) as R + co, j = 1,2. The simulated pseudomaximum likelihood estimator O,,,, is defined by:
where $(.I = ~C(Y, - s1(.))2/si(.)1 + lns,(.)l corresponds contribution
assuming
to the log-likelihood
y, - N(g,(.),g,(.)).
Laroque and Salanie (1989) prove t_hat for x,EXER, 8~0 compact, and gj(.) sufficiently continuous on X x 0, then Q,,,, A fIpMLas R -+ ~0.~~ It should be noted that for particular choices of a pseudo-likelihood function $(.), the SPML estimator can be shown to be consistent for a finite number of simulations R, because it then satisfies the basic linearity property of the MSM approach. Such a choice could be $(.) = (y, - gI(.))2, which corresponds to the assumption that y, - N(g,(.), 1).
4.3.
Simulation
of moment functions
The simulation of the log-likelihood is an appealing approach to applying simulation to estimation, but this approach must overcome the inherent simulation bias that forces one to increase R with the sample size. Instead of simulating the log-likelihood function, one can simulate moment functions. When they are linear in the simulations, moment functions can be simulated easily without bias. The direct consequence is that the simulation bias in the limiting distribution of an estimator is also zero, making the need to increase the number of simulations per observation with sample size unnecessary. This was a key insight of McFadden (1989) and Pakes and Pollard (1989).
‘l Pseudo-maximum likelihood estimation methods, which are special types of the classical minimum distance (CMD) approach, are developed in Gourieroux et al. (1984a) and Gourieroux et al. (1984b). See Newey and McFadden (1994) for a discussion of CMD and the closely related pweralized method cfmommts (GMM).
V.A. Hujivassiliou
2422
and P.A. Ruud
Method of moments (MOM) estimators have a simple structure. Such estimators are generally constructed from “residuals” that are differences between observed random variables y and their conditional expectations. These expectations are known functions of the conditioning variables x and the unknown parameter vector 0 to be estimated. Let E(ylx, 0) = ~(0; x). Moment equations are built up by multiplying the residuals by various weights or instrumental variable functions and specifying the estimator as the parameter values which equate the sample average of these products with zero. The MOM estimator &o, is defined by
;
.$,W”(X,&oMKYn - l4i4oh.G x,)1 = 0.
(35)
The consistency of such estimators rests on the uniform convergence of the sample averages to their population counterparts for any value of 8 as the sample size approaches infinity. When the unique root of the population equations is fI,, the population value of 0, the root of the sample equations, converges to 8,. The limiting distribution of &,,, is derived from the linear expansion
where we have denoted the residual by e,(e) = y, - E(y, (x,, 0) and t? lies between and Be. Because E[e,(&,)] = 0, the leading term will generally converge to a &Oh4 limiting normal random variable with zero expectation, implying no asymptotic bias in &,,:
where
One of the matrices
i
in the second term converges
to zero:
$ e,(t?)V,w,(H)-%O.
n
1
This fact is often exploited by replacing the weights w in (35) with consistent estimates that do not change the limiting distribution of &,,,. Thus under regularity
2423
conditions,
JNam4- b) %V(O,H-'ZH'-'), where
;
2 w,(@V,e,@)
n
LH.
1
Simulation has an affinity with the MOM. Substituting an unbiased, finitevariance simulator for the conditional expectation p(& x,) does not alter the essential convergence properties of these sample moment equations. We therefore consider the class of estimators generated by the method simulated moments (MSM).
of
Definition
7. Method
of simulated
moments
Let p(6); x, o) = l/RCF= ,,G(0; x, w,) be an unbiased simulator so that ,u(e; x) = E[p(e; x, 0)1x] where w is a simulated random variable. The method of simulated moments estimator is &s,
3 arg min I(F,(O) (1,
where
(36)
for some sequence
{on}.
Defining the MSM estimator as a minimizer rather than the root of the simulated moments equation s(0) = 0 is an important part of making the MSM operational. Newey and McFadden (1994), Sections 1 and 2.2.3, discuss the general difficulties that MOM poses for the construction of consistent estimators. Whereas the structure of ML provides a direct link between parameter identification and estimator consistency, MOM does not. It is often difficult to guarantee that a system of nonlinear equations has a unique solution. MSM inherits these difficulties. Also, the addition of simulation in MSM may introduce problems that were not present in the original MOM formulation. For example, simulated moment equations may not exhibit solutions at all in small samples, leading one to question the reliability of asymptotic approximations. This property may be the greatest practical drawback of this method of estimation using simulations, although it does not greatly affect the asymptotic distribution theory extended from the MOM case.
V.A. fiujioassiliou
2424 Table 6 for rank ordered
Sample statistics Population value
Parameter
1’1 (‘2
-0.6661 -0.6667 - 0.6667 -0.5000 1.3230 0.0000 -0.3780 0.9258
Cl.? 04 0, 0, 0, 0,
Example
probit
and P.A. Ruud
CMD (J = 4).
Mean
Standard deviation
Lower quartile
Median
Upper quartile
~ 0.6906 -0.7887 -0.6953 ~ 0.6683 1.4143 0.1764 -0.3077 0.7714
0.1481 0.3714 0.2223 0.4271 0.4384 0.5053 0.2765 0.3356
- 0.1192 -0.9496 - 0.8347 - 0.8962 1.1 18 -0.1957 -0.4703 0.5955
-0.6918 -0.7109 -0.6594 -0.5688 1.337 0.08331 - 0.3207 0.7980
-0.5948 -0.5431 -0.5366 -0.3679 1.633 0.5563 -0.1747 0.9834
12
To construct an MSM estimator for the rank ordered probit model, we construct a set of moment equations corresponding to the elements of y:
j, Yijn .________-Pr{yij= N
l;p,8}
=O,
i,j = 1,...,J-
1.
Not all J2 elements of y are needed because these elements have a singular distribution. As the sampling space of y makes clear, we can focus our attention on the first J - 1 rows and columns of y. Because we obtain more moment equations than parameters, we combine the moments of y according to the method of classical minimum distance (CMD) using the inverse of the sample covariance of the elements of y as the normalizing matrix. Note, however, that one could use more moments to increase the efficiency of the estimator. For example, the cross-products yijykl (i # k, j # 1) contain additional sample information about the population parameters. The CMD estimation results are described in Table 6 for J = 4 ranked alternatives. This classical estimator is much less efficient than the MLE. In addition, it exhibits large bias and skewness in the sampling distribution. The summary statistics for the MSM version of the CMD estimator are listed in Table 7. There was R = 1 simulation of the GHK probability simulator for each observation and each probability. As expected, the sampling variance is larger for the MSM estimator than for the CMD estimator. In addition, the bias and skewness in the CMD estimator for the mean parameters seem to be aggravated by the simulation in the MSM estimator. We do not present analogous results for J = 6 alternatives because the MSM estimator is not practical in this case. With 720 elements in the sampling space, the
2425
Ch. 40: Classical Estimation Methods for LDV Models Using Simulation
Sample statistics Population value
Parameter
- 0.6667 - 0.6667 - 0.6667 -0.5000 1.3230 O.OC@O -0.3780 0.9258
for rank ordered
Table I probit CMD, MSM version, (J = 4, R = 1).
Mean
Standard deviation
Lower quartile
Median
Upper quartile
-0.6916 -0.9790 -0.8561 - 0.7083 1.4733 0.0780 -0.3828 0.8560
0.1905 0.9576 0.6394 0.6392 0.8402 0.6268 0.5423 0.6857
-0.7915 - 1.099 -1.008 -0.8918 1.086 -0.3091 -0.5758 0.5023
- 0.6809 -0.7619 - 0.6900 -0.5559 1.323 0.03749 -0.3099 0.7341
-0.5798 -0.5654 -0.4813 -0.3327 1.662 0.4616 -0.1110 0.9745
amount of simulation becomes prohibitive. This illustrates another important drawback in this method: the MSM works best for sample spaces with a small number of elements. The analogies between MSM and MOM are direct and, as a result, the asymptotic analysis is generally simpler than for MSL. The first difference with MSL appears in the requirements on the simulation design for estimator consistency. Whereas MSL requires that R + og regardless of whether simulations are independent across observations, MSM yields consistent estimators with fixed R provided that the simulations vary enough to make a law of large numbers work. Because the simulated moments are linear in the simulations, one has the option of applying the law of large numbers to large numbers of observations alone, or in combination with large numbers of simulations. Proposition
6
Let k(f9;x, o) be an unbiased, finite-variance simulator for ~(0; x) and let either
(1) l%i n=l , . . . , N, r = 1,. . . , R} be i.i.d. random variables for fixed
(2) (q;r=
R, or 1,. . . , N} be an i.i.d. sequence for R = N and let o,, = w,, n = 1,. , . , N.
Then 4lMSh.l L O0under the regularity conditions (1) s,(O) E (l/N)C,N, 1 w,(B)[y, - ~(8; x,)] is continuous in 0, (2) s&9 J+ ~(0) = plim(fIN)C,N= 1 w,(@Cp(Oo; xn) - p(fl; xJ] uniformly in 8~ 0, a compact parameter space, (3) s,(O) is continuous in 8 and s,(8) equals zero only at 8,. Proof
The average difference between the classical moment functions and their simulated counterparts is
V.A. Hajiaassiliou
2426
and P.A. Ruud
(37)
(38)
where s,(B) 3 (l/N)C,N, r w,(B)[y, - ~(0; x,)]. Under design 1, the (,& - p,,} are an i.n.i.d. sequence so that a uniform law of large numbers applied to (37) implies s,(e) - F,,J0) -% 0 as N + co. Under design 2, s,(8) - .FN(0)is written in (38) as a U-statistic and a uniform law of large numbers for U-statistics (Lee (1992)) implies s,(8) - sN(0) LO as N + co. Therefore, in either case, by continuity, Q.E.D. I(~~(0) - sN(0) 1)-% 0 uniformly in 6’ and Lemma 1 implies the result. The opportunity to fix R for all sample sizes offers significant computational savings that are a key motivation for interest in the MSM. As we shall see below, the benefits of the dependent design are generally modest. Thus, while the theoretical applicability of U-statistics to MSM is interesting in itself, we will not consider it further in this section.” We continue with the analogy between the MOM and the MSM. Note first of all that an analogous linear expansion for &sM exists:
0=
L f
$v.=l
$ w,(w$qB)
w”(e,)qe,) + f
[
n
1
+ a,(e,v,w,(iq Jiq&,, I
-
e,),
where we have denoted the simulated residual by &n(0)= y,, - D(e; x,) and I? lies between 8 MSM and do. Because E[&,(6J,)] = 0, the leading term will generally converge to a limiting normal random variable with zero expectation, implying no asymptotic bias in &s,:
where
hf, w,(b)wc(e,) 1x,iw,(e,)~ Also, as before,
2ZSee Lee (1992).
11,z,,,.
Ch. 40: Classical
Estimation Mrthodsfor
so that under regularity
LDV Models Using Simulation
2427
conditions,
- 0,) -% N(0, H - ’ &,,H’JNCe,SM
‘),
where
The equivalence of the H matrices also rests on the unbiased simulation of p. If for ~(0;x) = E[p(&x,o)[x], then V&&x) = V,E[fi(tI; x,w)IxJ =E[V,ji(tI;x,o)(x] the smooth simulators described in Section 3. While the first moment of the MSM estimator does not depend on R, the limiting covariance matrix, and hence relative efficiency, does. Simulation noise introduces a generic difference between the covariance matrices of I!?~,,, and &s,. Intuition suggests, and theory confirms, that the larger R is, the more efficient the MSM estimator will be as the simulation noise is diminished. The extra variation in &,sM is contained in the object (37). This term is generated conditional on the realizations ofy and is, by definition, distributed independently of the classical moment function. Inflating the simulation noise by fi and evaluating limit theorem to it to obtain the following result. Proposition ‘x MSM
it at 0,,, we can apply a central
7
- ZMOM -
+
(l/R)Z,
where
If it were not for the simulation noise, the MSM estimator would be as efficient as its MOM counterpart. McFadden (1989) noted that in the special case where fi is obtained by averaging simulations of the data generating process itself, Z, = Z,,, and ZMsM= (1 + l/R)ZMoM. In this case, the inefficiency of simulation is easy to measure and one observes that 10 replications are sufficient to reduce the inefficiency to 10% compared to classical MOM. The proposition suggests that full efficiency would be obtained if we simply increased R without bound as N grows. That intuition is formalized in the next proposition, which is analogous to Proposition 5 (see McFadden and Ruud (1992)). Proposition
8
If R = O(N”), LX>0, then @(e^,o,
- e^,,,) -% 0.
2428
V.A. Hajiuassiliou
and P.A. Ruud
For any given residual and instrumental variables, there generally exist optimal weights among MOM estimators, and the same holds for MSM as well. In what is essentially an asymptotic counterpart to the GaussMarkov theorem, if H = ZMSM then the MSM estimator is optimal (Hansen (1982)). To construct an MSM estimator that satisfies this restriction, one normalizes the simulated residual by its variance and makes the instrumental variables the partial derivatives of the conditional expectation of the simulated moment with respect to the unknown parameters:
One can approximate these functions using simulations that are independent of the moment simulations with R fixed, but efficiency will require increasing R with sample size. If /? is differentiable in t3, then independent simulations of the V,,ii are unbiased simulators of the instruments. Otherwise, discrete numerical derivatives can be employed. The covariance matrix can be estimated using the sample variance of p and the simulated variance of y. Inefficiency in simulated instruments constructed in this way has two sources: the simulation noise and the bias in the inverse of an estimated variance. Both sources disappear asymptotically if R approaches infinity with N. While it is critical that the simulations of w be independent of the simulations of b, there is no obvious advantage to simulating the individual components of w independently. In some cases, for example simulating a ratio, it appears that independent simulation may be inferior.23
4.4.
Simulation
of the score function
Interest in the efficiency of estimators naturally leads to attempts to construct an efficient MSM estimator. The obvious way to do this is to simulate the score function as a set of simulated moment equations. Within the LDV framework, however, unbiased simulation of the score with a finite number of operations is not possible with simple censored simulators. The efficient weights are nonlinear functions of the objects that require simulation. Nevertheless, it may be possible with the aid of simulation to construct good approximations that offer improvements in efficiency over simpler MSM estimators. There is an alternative approach based on truncated simulation. We showed in Section 2 that every score function can be expressed as the expectation of the score of a latent data generating process taken conditional on the observed data. In the particular case of normal LDV models, this conditional expectation is taken over a truncated multivariate normal distribution and the latent score is the score of an untruncated multivariate normal distribution. Simulations from the truncated nor23A Taylor series expansion suggests that positive correlation between the numerator of a ratio can yield a smaller variance than independent simulation.
and denominator
Ch. 40: Classical
Estimation
Methods for LDV Models Using
Simulation
2429
ma1 distribution can replace the expectation operator to obtain unbiased simulators of the score function. In order to include both the censored and truncated approaches to simulating the score function, we define the method of simulated scores as follows.24 Dejinition 8.
Method
of simulated scores
Let the log-likelihood function for the unknown parameter vector (3 given the sample of observations (y,, n = 1,. . . , N) be I,(0) = C,“= 1In f(& y,). Let fi(Q; y,, w,) = (l/R)Cr=, ~(&y,,,lo,J be an asymptotically (in R) unbiased simulator of the score function ~(0;y) = Vlnf(B; y) where o is a simulated random variable. The method of simulated scores estimator is &,s, E arg min,, J/ 5,(e) (1 where .YN(0)3 (l/N)Cr= ,b(@, y,, 0,) for some sequence {on}. Our definition includes all MSL estimators as MSS estimators, because they implicitly simulate the score with a bias that disappears asymptotically with the number of replications R. But there are also MSS estimators without simulation bias for fixed R. These estimators rely on simulation from the truncated conditional distribution of the latent y* given y. We turn to such estimators first. 4.4.1.
Truncated
simulation of the score
The truncated simulation methods described in Section 3.3 provide unbiased simulators of the LDV score (17), which is composed of elements of the form (24). Such simulation would be ideal, because R can be held fixed, thus leading to fast estimation procedures. The problem is that these truncated simulation methods pose new problems for the MSS estimators that use them. The first truncated simulation scheme, discussed in Section 3.3.1 above, is the A/R method. This provides simulations that are discontinuous in the parameters, a property shared with the CMC. A/R simulation delivers the first element in a simulated sequence that falls into a region which depends on the parameters under estimation. As a result, changes in the parameter values cause discrete changes in which element in the sequence is accepted. An example of this phenomenon is to suppose that one is drawing a sequence of normal random variables {ql} IV N(0, I,) in order to obtain truncated multivariate normal random variables for rank ordered probit estimation. Given the observation y, one seeks a simulation from D(y), as defined in Example 8. Let the simulation of y* be jjl(pl, r,) 3 ,~i + T1qr at the parameter values (pi, r,). At neighboring parameter values where two elements of the vector j,(p, r) are equal, the A/R simulation is at the point of jumping from the value j& r) to another point in the sequence {J,(p, r)}. See Hajivassiliou and McFadden (1990) and McFadden and Ruud (1992) for treatments of the special
24The term was coined by Hajivassiliou
and McFadden
(1990)
V.A. Hajivassiliou
2430
and P.A. Ruud
asymptotic distribution theory for such simulation estimators. Briefly described, this distribution theory requires a degree of smoothness in the estimator with respect to the parameters that permits such discontinuities but allows familiar linear approximations in the limit. See Ruud (1991) for an illustrative application. The second truncated simulation scheme we discussed above was the Gibbs resampling simulation method; see Section 3.3.2. This method is continuous in the parameters provided that one uses a continuous univariate truncated normal simulation scheme. But this simulation method also has a drawback: Strictly applied, each simulation requires an infinite number of resampling rounds. In practice, Gibbs resampling is truncated and applied as an approximation. The limited Monte Carlo evidence that we have seen suggests that such approximation is reliable. Simulation of the efficient score fits naturally with the EM algorithm for computing the MLE derived by Dempster et al. (1977). The EM algorithm includes a step in which one computes an expectation with respect to the truncated distribution of y* conditional on y. Ruud (1991) suggested that a simulated EM (SEM) algorithm could be based on simulation of the required expectation.25 This substitution provides a computational algorithm for solving the simulated score of MSS estimators. Dejinition 9.
EM algorithm
The EM algorithm is an iterative process for computing the MLE of a censored data model. On the ith iteration, the EM algorithm solves 0 i+ ’ = arg max Q(0, 0';y), where the function
(39)
Q is
Q(O1,OO;~)~E,oClnf(O1;~*)l~l, where EOo[. Iy] indicates
an expectation
(40) measured
with respect to ~(0”; y* 1y).
If Q is continuous in both 0 arguments, then (39) is a contraction mapping converges to a root of the normal equations; as Ruud (1991) points out, 0 = 0’ = 0’ *V,,
Q(O’, 0’; y) = VBIn F(0; y),
that
(41)
so that the first-order conditions for an iteration of (39) and the normal equations for ML are intimately related. Unlike the log-likelihood function, this Q can be simulated without bias for LDV models because the latent likelihood f(0; y*) is tractable and Q is linear in In f(0; y*) 25van Pragg et al. (1989) and van Praag et al. (1991) also investigated a study of the Dutch labor market.
this approach
and applied
it in
Ch. 40: Classical
Estimution Methodsfor
2431
LDV Models Using Simulation
(see equation (40)). According to (41). unbiased simulation of Q implies a means for unbiased simulation of the score. Although it is not guaranteed, an unbiased simulator of Q usually yields a contraction mapping to a stationary point. For LDV models based on a latent multivariate normal distribution, the iteration in (39) is quite simple to compute, given Q or a simulation of Q. Iff(e;y*) = 4(y*
-
P; W,
then
and
IR’ = k $ n
&o[(y:
- P’)(Y,* - P’),\Y~],
(42)
1
which are analogous to the equations for the MLE using the latent data. This algorithm is often quite slow, however, in a neighborhood of the stationary point of (39). Any normalizations necessary for identification of 0 can be imposed at convergence. See Ruud (1991) for a discussion of these points. Example 13.
SEM estimation
In this example, we apply the SEM procedure to the rank ordered probit model of our previous examples. We simulated an (approximately) unbiased 0 of Q by drawing simulations of y: from its truncated normal distribution conditional on y, using the Gibbs resampling method truncated to 10 rounds. The support of this truncated distribution is specified as D(y) in Example 8. The simulated estimators were computed according to (42), after replacing the expectations with the averages of independent simulations. The usual Monte Carlo results for 500 experiments with J = 6 ranked alternatives are reported in Table 8 for data sets containing 100 observations and R = 5 simulations per observation. These statistics are comparable to those in Table 5 for the MSL estimator of the same model with the same number of simulation replications. The biases for the true parameter values appear to be appreciably smaller in the SEM estimator, while the sampling variances are larger. We cannot judge either estimator as an approximation to the MLE, because the latter is prohibitively difficult to compute.
Sample statistics
Parameter
01 0, 0, 0, 0,
Population value - 0.4000 -0.4000 - 0.4ooo - 0.4000 - 0.4000
for rank ordered
Table 8 probit SEM using Gibbs simulation
(J = 6, R = 5).
Mean
Standard deviation
Lower quartile
Median
Upper quartile
- 0.3827 -0.4570 -0.4237 - 0.4268 -0.4300
0.1558 0.3271 0.2262 0.2710 0.2622
- 0.4907 -0.5992 -0.5351 -0.5319 -0.5535
- 0.3848 -0.4089 -0.3756 -0.3891 - 0.3794
-0.2757 -0.2455 - 0.2766 -0.2580 -0.2521
V.A. Hajivassiliou
2432
and P.A. Ruud
Although truncated simulation is generally more costly, the SEM estimator remains a promising general approach to combining simulation with relatively efficient estimation. It is the only method that combines unbiased simulation of the score with optimization of an objective function and the latter property appears to offer substantial computational advantages. 4.4.2.
Censored simulation of ratios
The censored simulation methods in Section 3.2 can also be applied to approximating the efficient score. These simulation methods tend to be much faster computationally than the truncated simulation methods, but censored simulations introduce simulation bias in much the same way as in the MSL. Censored simulation can be applied to discrete LDV models by noting that the score function of an LDV model with observation rule y = z(y*) can generally be written in the ratio form:
s s
VtJf(0;Y)
V, dF(e;Y*)
(Y'IHY')'Yl
.0&Y) = ____
WA Y*)
(Y*lr(Y*) =Y)
= W,WRY*)I$Y*)
= Y)
%y*lT(y*) = Y>
’
where F(0; y* 1y) is the conditional c.d.f. of y given r(y*) = y. See Section 2.6 for more details. Van Praag and Hop (1987), McFadden (1989) and Hajivassiliou and McFadden (1990) note that this form of the score function offers the potential of by simulating estimation by simulation. 26 An MSS estimator can be constructed separately the numerator and denominator of the score expressions:
s,(e)=-
1 N @;y,,q,) 1 N
n = I P(e;
(43)
Y,,, ~2n)’
where 2(&y,, win) = (l/R,)CF: 1&e; y ,,,winr) is an unbiased simulator of the derivative function V,f(0, y) and p(0; y,, oz.) = (l/R,)CFZ 1d(& y,, mznr) is an unbiased function of the probability expression f(0; y,). Hajivassiliou and McFadden (1990) prove that when the approximation of the scores in ratio form is carried out using the GHK simulator, the resulting MSS estimator is consistent and asymptotically normal when N -+ 00 and R,/,,h-t co. The number of simulations for the numerator expression, R,, affects the efficiency of the resulting MSS estimator. Because the unbiased simulator p(&y, 02) of f(e; y) does not yield an unbiased simulator of
26See Hajivassiliou models.
(1993~) for a survey of the development
of simulation
estimarion
methods
for LDV
Ch. 40: Classical Estimation Methodsfor
LDV Models Using Simulation
2433
the reciprocal l/f(& y) in the simulator l/p(0; y, w,), R2 must increase with sample size to obtain a consistent estimator. This is analogous to simulation in MSL. In fact, this simulation scheme is equivalent to MSL when ur = o2 and d’= V,p. McFadden and Ruud (1992) note that MSM techniques can also be used generally to remove the simulation bias in such MSS estimators. In discrete LDV models, where y has a sampling space B that is countable and finite, we can always write y as a vector of dummy variables for each of the possible outcomes so that E,(y,) = Pr{yi = 1; 6} = f(Q; Y)
if
Yi = 1, Yj = 0, j # i.
Thus, v,f(e;Y)
E
8 [
f@Y)
=o= 1
Y) 1 f@; Y).- VfLf(~; --~ YEB f (0; Y)
and the score can be written
(44) Provided that the “residual” l{y = Y} - f(0; Y) and the “instrumental variables” V,f(e; Y)/f(0; Y) are simulated independently, equation (44) provides a moment function for the MSM. In this form, the instrumental variables ratio can be simulated with bias as in (43) because the residual term is independently distributed and possesses a marginal expectation equal to zero at the population parameter value. For example, we can alter (43) to
ue)=X$1 YEB c
CRY,=
”
&@Y,%“)
~1-iw; Y,O,A-,
iv; y,%“I
(45)
where wr and o2 are independent pseudo-random variables. While such bias does not introduce inconsistency into the MSM estimator, the simulation bias does introduce inefficiency because the moment function is not an unbiased simulator of the score function. This general approach underlies the estimation method for multinomial probit originally proposed by McFadden (1989). 4.4.3.
MSM
versus MSS
MSM and MSS are natural competitors in estimation with simulation because each has a comparative advantage. MSM uses censored simulations that are cheap to
V.A. Hajivassiliou
2434
and P.A. Ruud
compute, but it cannot simulate the score without bias within a finite number of calculations. MSS uses truncated simulations that are expensive to compute (and introduce jumps in the objective function with A/R simulations), but simulates the score (virtually) without bias. McFadden and Ruud (1992) make a general comparison of the asymptotic covariance matrices that suggests when one method is preferable to the other. Consider the special MSS case in which the simulations P*(e; Y,w) are drawn from the latent conditional distribution and the exact latent score V,l* is available so that &(O)=
R,’
2 V,I*[8;
?*(Q; Y,w)].
r=l
Then Z,, the contribution has a useful interpretation:
of simulation
to the covariance
matrix
of the estimator,
where Z, = E,{V,I*(@; Y*)[VBI*(O; Y*)]‘} is th e information matrix of the latent log-likelihood. The simulation noise is proportional to the information loss due to partial observability. In the simplest applications of censored simulation to the MSM, the simulations are independent of sample outcomes and their contribution to the moment function is additively separable from the contribution of the data. Thus we can write &,(0) = g(& Y,o&g(O; wi,~J(see(45)). In that case, &simplifies In general, the simulation process makes R independent tions {ml; r = 1,. , R}, so that
to V{,/%[~“(~,;O,,O,)]}. replications of the simula-
and & = R- ’ V,&j(d,; ml, oJ]. In an important special case of censored simulation, the simulation process makes R independent replications of the modeled data generating process, { F(L);w,,); r = 1,. . . , R}, so that
ae;al,
4
= R; 1
de; t(e; ml,), w2]
and Z, = R ’ V[g(B,; Y, co’)] = Z,/R. Then the MSM covariance matrix equals (1 + l/R) times the classical MOM covariance matrix without simulation G-‘Z,(G’))‘. Now let us specialize to simulation of the score. For simplicity, suppose that the simulated moment functions are unbiased simulations of the score: E[S,,,(B)I Y] = V,1(8; Y). Of course in most cases, the MSM estimator will have a simulation bias
Ch. 40: Classical
for the score. The asymptotic Z,
2435
Estimation Methods for LDV Models Using Simulation
= lim,,, = lim,,
I/(&,(0,)
variance
of the MSM estimator
is
- V,1(8,; Y)]
m ~chmlw4J)l +
&l
= z, + E,, where 22, = Z’,/R and Z, holds additional variation attributable to the simulation of the score. If the MSS and MSM estimators use the same number of simulation replications, we can make a simple comparison of the relative efficiency of the two methods. The difference between the asymptotic covariance matrices is R-‘Z,‘[&
+
(R+
l)Z,
- (Z, - &)]Z,‘.
This expression gives guidance about the conditions under which censored simulation is likely to dominate truncated. It is already obvious that if Z, is high, so that censored simulation is inefficient due to a poor approximation of the score, then truncated simulation is likely to dominate. On the other hand, if Z:, is low, because partial observability causes a large loss in information, then estimation with censored simulation is likely to dominate truncated. Thus, we might expect that the censored simulation method will dominate the truncated one for the multinomial probit model, particularly if Z, = 0. That, however, is a special case in which a more efficient truncated simulation estimator can be constructed from the censored simulation estimator. Because E[E(Q)I Y] = VfJ(R Y),
ma y,02) - m 01, w2)l -E[g(e;ol,o,)]
=
= E{g[e;
VfJ(RVI
~(e;o),d]} =o t/e.
The bias correction is obviously unnecessary and only increases the variance of the MSM estimator. But an MSM estimator based on g(& Y, o) is a truncated simulation MSM estimator; only simulation for the particular Y observed is required. We conclude that the censored method can outperform the truncated method only by choosing E,[e(B)] # V,1(8; Y) in such a way that the loss in efficiency in Z, is offset by low Zc, and low Z,.27
4.5.
Bias corrections
In this section, we interpret estimation with simulation as a general method for removing bias from approximate parametric moment functions, following McFadden “The actual difference in asymptotic however, because G # EM # &.
covariance
matrices
is more complicated
than the formula
above
V.A. Hajicawiliou
2436
and P.A. Ruud
and Ruud (1992). The approximation of the efficient score is the leading problem in estimation with simulation. In a comparison of the MSM and MSS approximations, we have just described a simple trade-off. On the one hand, the simulated term in the residual of (45) that replaces the expectation in (44) is clearly redundant when the instrumental variables are V,f(Q; Y)/f(& Y). The expectation of the simulated terms multiplied by the instruments is identically zero for all parameter values so that the simulation merely adds noise to the score and the resulting estimator. On the other hand, the simulated residual is clearly necessary when the instruments are not ideal. Without the simulation, the moment equation is invalid and the resultant estimators are inconsistent. This trade-off motivates a general structure of simulated moments estimators. We can interpret the extra simulation term as a bias correction to an approximation of the score. For example, one can view the substitution of non-ideal weights into the original score function as an approximation to the score, chosen for its computational feasibility. Because the approximation introduces bias, the bias is removed by simulating the (generally) unknown expectation of the approximate score. Suppose the moment restrictions have a general form
Hs(&; y, Xl Ix 1 = 0. When the moment function s is computationally burdensome, an approximation g(fl; y, X, o) becomes a feasible alternative. The additional argument o represents an ancillary statistic containing the “coefficients” of the approximation. In general, such approximation will introduce inefficiency and bias into MOM estimators constructed from g. Simulation of g over the distribution of y produces an approximate bias correction Q(8; X, w’, o), where o’ represents the simulated component. Thus, we consider estimators 6 that satisfy g(& y, x, w) MSM estimators too. 4.5.1.
lj(8;x, co’,w)
= 0.
have this general form; and feasible MSS estimators
(47) generally
do,
A score test for estimator bias
The appeal of simulation estimators without bias correction is substantial. Although the simulation of moments or scores overcomes a substantial computational difficulty in the estimation of LDV models, there may remain practical difficulties in solving the simulated moment functions for the estimators. Whereas maximum likelihood possesses a powerful relationship between the normal equations and the likelihood function, moment equations generally do not satisfy such “integrability” conditions. As a result, there is not even a guarantee that a root of the estimating
Ch. 40: Classicul Esfimation
Methodsfor
LDV Models Using Simulution
2437
equations
exists. Bias correction can introduce a significant amount of simulation noise to estimators. For these reasons, the approximation of the log-likelihood function itself through simulation still offers an important opportunity to construct feasible and relatively efficient estimators. MSS, and particularly MSL, estimators can be used without bias correction if the bias is negligible relative to the sampling error of the estimator and the magnitude of the true parameter. A simple score test for significant bias can be developed and implemented easily. Conditional on the MSS estimator, the expectation of the simulated bias in the approximate score should be zero. The conditional distribution of the elements of the bias correction are i.n.i.d. random variables to which a central limit theorem can be applied. In addition, the White-Eicker estimator of the covariance matrix of the bias elements is consistent so that the usual Wald statistic, measuring the statistical significance of the bias term, can be computed (see Engle (1984)). As an alternative to testing the significance of this statistic, the bias correction term can be used to compute a local approximate confidence region for the biases in the moment function or the estimated parameters. This has the advantage of providing a way to assess whether the biases are important for the purposes of inference.
5.
Conclusion
In this chapter, we have described the use of simulation methods to overcome the difficulties in computing the likelihood and moment functions of LDV models. These functions contain multivariate ‘integrals that cannot be easily approximated by series expansions. However, unbiased simulators of these integrals can be computed easily. We began by reviewing the ways in which LDV models arise, describing the differences and similarities in censored and truncated data generating processes. Censoring and truncation give rise to the troublesome multivariate integrals. Following the LDV models, we described various simulation methods for evaluating such integrals. Naturally, censoring and truncation play roles in simulation as well. Finally, estimation methods that rely on simulation were described in the final section of this chapter. We organized these methods into three broad groups: MSL, MSM, and MSS. These are not mutually exclusive groups. But each group has a different motivation: MSL focuses on the log-likelihood function, the MSM on moment functions, and the MSS on the score function. The MSS is a combination of ideas from MSL and MSM, treating the efficient score of the log-likelihood function as a moment function. Software for implementing these methods is not yet widely available. But as such tools spread, and as improvements in the simulators themselves are developed, simulation methods will surely become a familiar tool in the applied econometrician’s workshop.
V.A. Hajiuassiliou and P.A. Ruud
2438
6.
Acknowledgements
We would like to thank John Geweke and Daniel McFadden for very helpful comments. John Wald provided expert research assistance. We are grateful to the National Science Foundation for partial financial support, under grants SES929411913 (Hajivassiliou) and SES-9122283 (Ruud).
Amemiya, T. (1984) “Tobit Models: A Survey”, Journal of Econometrics, 24, 3-61. Avery, R., Hansen, L. and Hotz, V. (1983) “Multiperiod Probit Models and Orthogonality Condition Estiplation”, International Economic Review, 24, 21-35. Bauwens, L. (1984) Bayesian Full Information Analysis ofSimultaneous Equation Models using Integration by Monte Carlo. Berlin: Springer. Beggs, S., Cardell, S. and Hausman, J. (1981) “Assessing the Potential Demand for Electric Cars”, Journal of Econometrics, 17, l-20. Berkovec, J. and Stern, S. (1991) “Job Exit Behavior of Older Men”, Econometrica, 59, 189-210. Bloemen, H. and Kapteyn, A. (1991) The Joint Estimation of a Non-linear Labour Supply Function and a Wage Equation Using Simulated Response Probabilities. Tilburg University, mimeo. Bock, R.D. and Jones, L.V. (1968) The Measurement and Prediction of Judgement and Choice. San Francisco: Holden-Day. Bolduc, D. (1992) “Generalized Autoregressive Errors in the Multinomial Probit Model”, Transportation Research B - Methodological, 26B(2), 155- 170. Bolduc, D. and Kaci, M. (1991) Multinomial Probit Models with Factor-Based Autoregressioe Errors: A Computationally EJ’icient Estimation Approach. Universite Lava], mimeo. Borsch-Supan, A. and Hajivassiliou, V. (1993) “Smooth Unbiased Multivariate Probability Simulators for Maximum Likelihood Estimation of Limited Dependent Variable Models”, Journal of Econometrics, 58(3), 347-368. Biirsch-Supan, A., Hajivassiliou, V., Kotlikoff, L. and Morris, J. (1992) Health, Children and Elderly Living Arrangements: A Multi-Period Multinomial Probit Model with Unobserved Heterogeneity and Autocorrelated Errors pp. 79-108, in: D. Wise, ed., Topics in the Economics of Aging. Chicago: University of Chicago Press. Chib, S. (1993) “Bayes Regression with Autoregressive Errors: A Gibbs Sampling Approach”, Journal of Econometrics, 58(3), 275-294. Clark, C. (1961) “The Greatest of a Finite Set of Random Variables”, Operations Research, 9, 145-162. Daganzo, C. (1980) Multinomial Probit. New York: Academic Press. Daganzo, C., Bouthelier, F. and Sheffi, Y. (1977) “Multinomial Probit and Qualitative Choice: A Computationally Efficient Algorithm”, Transportation Science, 11,338-358. Davis, P. and Rabinowitz, P. (1984) Methods of Numerical Integration. New York: Academic Press. Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977) “Maximum Likelihood from Incomplete Data via the EM Algorithm”, Journal of the Royal Statistical Society, Series B, 39, l-38. Devroye, L. (1986) Non-Uniform Random Variate Generation. New York: Springer. Dubin, J. and McFadden, D. (1984) “An Econometric Analysis of Residential Electric Appliance Holdings and Consumption”, Econometrica, 52(2), 345-362. Duffie, D. and Singleton, K. (1993) “Simulated Moments Estimation of Markov Models of Asset Prices”, Econometrica, 61(4), 929-952. Dutt, J. (1973) “A Representation of Multivariate Normal Probability Integrals by Integral Transforms”, Biometrika, 60, 637-645. Dutt, J. (1976) “Numerical Aspects of Multivariate Normal Probabilities in Econometric Models”, Annals of Economic and Social Measurement, 5, 547-562. Engle, R. (1984) Wald, Likelihood Ratio, and Lagrange Multiplier Tests in Econometrics, pp. 776-826, in: Z. Griliches and M. Intriligator, eds., Handbook of Econometrics, Vol. 2. Amsterdam: NorthHolland. Feller, W. (1971) An Introduction to Probability Theory and its Applications. New York: Wiley, Fishman, G. (1973) Concepts and Methods of Digital Simulation. New York: Wiley.
Ch. 40: Classical Estimufion
Methodsfor
LD V Models Using Simulation
2439
Geman, S. and Geman, D. (1984) “Stochastic Relaxation, Gibbs Distributions and the Bayesian Restoration of Images”,IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721~741. &w&e, J. (1989) “Bayesian Inference in Econometric Models Using Monte Carlo Integration”, Econometrica, 57, 1317-1340. Geweke, J. (1992)Efficient Simulation from the Multivariate Normal and Student-t Distributions Subject to Linear Constraints. Computing Science and Statistics: Proceedings qf the Twenty-Third Symposium, 571-578. Goldfeld, S. and Quandt, R. (1975) “Estimation in a Disequilibrium Model and the Value of Information”, Journal ofEconometrics, 3(3), 325-348. Gourieroux, C. and Monfort, A. (1990) Simulation Based Inference in Models with Heterogeneity. INSEE, mimeo. Gourieroux, C., Monfort, A., Renault, E. and Trognon, A. (1984a) “Pseudo Maximum Likelihood Methods: Theory”, Econometrica, 52, 681-700. Gourieroux, C., Monfort, A., Renault, E. and Trognon, A. (1984b) “Pseudo Maximum Likelihood Methods: Applications to Poisson Models”, Econometrica, 52, 701-720. Gronau, R. (1974) “The Effect of Children on the Housewife’s Value of Time”, Journal of Political Economy”, 81, 168-199. Hajivassiliou, V. (1986) Serial Correlation in Limited Dependent Variable Models: Theoretical and Monte Carlo Results. Cowles Foundation Discussion Paper No. 803. Hajivassiliou, V. (1992) The Method ofSimulated Scores: A Presentation and Comparative Evaluation. Cowles Foundation Discussion Paper, Yale University. Hajivassiliou, V. (1993a) Estimation by Simulation of the External Debt Repayment Problems. Cowles Foundation Discussion Paper, Yale University. Published in the Journal of Applied Econometrics, 9(2) (1994) 109-132. Hajivassiliou, V. (1993b) “Simulating Normal Rectangle Probabilities and Their Derivatives: The effects of Vectorization”. International Journal ofSupercomputer Applications, 7(3), 231-253. Hajivassiliou, V. (1993~) Simulation Estimation Methods for Limited Dependent Variable Models. pp. 519-543, in: G.S. Maddala, C.R. Rao and H.D. Vinod, eds., Handbook ofstatistics (Econometrics), Vol. 11. Amsterdam: North-Holland. Hajivassiliou, V. and Ioannides, Y. (1991) Switching Regressions Models of the Euler Equation: Consumption Labor Supply and Liquidity Constraints. Cowles Foundation for Research in Economics, Yale University, mimeo. Hajivassiliou, V. and McFadden, D. (1990). The Method ofsimulated Scores, with Application to Models of External Debt Crises. Cowles Foundation Discussion Paper No. 967. Hajivassiliou, V., McFadden, D. and Ruud, P. (1992) “Simulation of Multivariate Normal Orthant Probabilities: Methods and Programs”, Journal of Econometrics, forthcoming. Hammersley, J. and Handscomb, D. (1964) Monte Carlo Methods. London: Methuen. Hanemann, M. (1984) “Discrete/Continuous Models of Consumer Demand”, Econometrica, 52(3), 541562. Hansen, L.P. (1982) “Large Sample Properties of Generalized Method of Moments Estimators” Econometrica, 50, 1029-1054. Hausman, J. and Wise, D. (1978) “A Conditional Probit Model for Qualitative Choice: Discrete Decisions Recognizing Interdependence and Heterogeneous Preferences”, Econometrica, 46, 403-426. Hausman, J. and Wise, D. (1979) “Attrition Bias in Experimental and Panel Data: The Gary Negative Income Maintenance Experiment”, Econometrica, 47(2), 445-473. Heckman, J. (1974) “Shadow Prices, Market Wages, and Labor Supply”, Econometrica, 42, 679-694. Heckman, J. (1979) “Sample Selection Bias as a Specification Error”, Econometrica, 47, 153-161. Heckman, J. (1981) Dynamic Discrete Models. pp. 179-195,_in C. Manski and D. McFadden, eds., Structural Analysis ofDiscrete Data with Econometric Applications. Cambridge: MIT Press. Hendry, D. (1984) Monte Carlo Experimentation in Econometrics, pp. 937-976 in: Z. Griliches and M. Intriligator, eds., Handbook ofEconometrics, Vol. 2. Amsterdam: North-Holland. Horowitz, J., Sparmonn, J. and Daganzo, C. (1981) “An Investigation of the Accuracy of the Clark Approximation for the Multinomial Probit Model”, Transportation Science, 16, 382-401. Hotz, V.J. and Miller, R. (1989) Condirional Choice Probabilities and the Estimation of Dynamic Programming Models. GSIA Working Paper 88-89-10. Hotz, V.J. and Sanders, S. (1991) The Estimation ofDynamic Discrete Choice Models by the Method of Simulated Moments. NORC, University of Chicago.
2440
V.A. Hujiuussiliou
and P.A. Ruud
Hotz, V.J., Miller, R., Sanders, S. and Smith, J. (1991) A Simulation Estimatorfin Dynamic Discrete Choice Models. NORC, University of Chicago, mimeo. Keane, M. (1990) A Computationully &ficient Practical Simulation Estimator ,Jor Panel Data with Applications to Estimating Temporal Dependence in Employment and Wqes. University of Minnesota, mimeo. Keane, M. (1993) Simulation Estimation Methods for Panel Data Limited Dependent Variable Models, in: G.S. Maddala, C.R. Rao and H.D. Vinod, eds., Handbook of Statistics (Econometrics), Vol. 11. Amsterdam: North-Holland. Kloek, T. and van Dijk, H. (1978) “Bayesian Estimates of Equation System Parameters: An Application of Integration by Monte Carlo”, Econometrica, 46, l-20. Laroque, G. and Salanie, B. (1989) “Estimation of Multi-Market Disequilibrium Fix-Price Models: An Application of Pseudo Maximum Likelihood Methods”, Econometrica, 57(4), 83 t-860. Laroque, G. and Salanie, B. (1990) The Properties of Simulated Pseudo-Maximum Likelihood Methods: The Case ofthe Canonical Disequilibrium Model. Working Paper No. 9005, CREST-Departement de la Recherche, INSEE. Lee, B.-S. and Ingram, B. (1991) “Simulation Estimation of Time-Series Models”, Journal of Econometrics, 47, 197-205. Lee, L.-F. (1978) “Unionism and Wage Rates: A Simultaneous Equation Model with Qualitative and Limited Denendent Variables”. international Economic Review, 19,415-433. Lee, L.-F. (1979) “Identification ‘and Estimation in Binary Choice Models with Limited (Censored) Dependent Variables”, Econometrica, 47, 977-996. Lee, L.-F. (1992) “On the Efficiency of Methods of Simulated Moments and Maximum Simulated Likelihood Estimation of Discrete Response Models”, Econometric Theory, 8(4), 518-552. Lerman, S. and Manski, C. (1981) On the Use of Simulated Frequencies to Aproximate Choice Probabilities, pp. 305-319, in: C. Manski and D. McFadden, eds., Structural Analysis ofDiscrete Data with Econometric Applications. Cambridge: MIT Press. Lewis, H.G. (1974) “Comments on Selectivity Biases in Wage Comparisons”, Journal of Political Economy, 82(6), 114551155. Maddala, G.S. (1983) Limited Dependent and Qualitative Variables in Econometrics. Cambridge: Cambridge University Press. McCulloch, R. and Rossi, P.E. (1993) An Exact Likelihood Analysis of the Multinomial Probit Model. Working Paper 91-102, Graduate School of Business, University of Chicago. McFadden, D. (1973) Conditional Logit Analysis of Qualitative Choice Behavior, pp. 105-142, in: P. Zarembka, ed., Frontiers in Econometrics. New York: Academic Press. McFadden, D. (1981) Econometric Models of Probabilistic Choice, pp. 1988272, in: C. Manski and D. McFadden, eds., Structural Analysis ofDiscreteData with Econometric Applications. Cambridge: MIT Press. McFadden, D. (1986) Econometric Analysis of Qualitative Response Models, pp. 1395-1457, in: Z. Griliches and M. Intriligator, eds., Handbook ofEconometrics, Vol. 2, Amsterdam: North-Holland. McFadden, D. (1989) “A Method of Simulated Moments for Estimation of Discrete Response Models without Numerical Integration”, Econometrica, 57, 99551026. McFadden, D. and Ruud, P. (1992) Estimation by Simulation. University of California at Berkeley, working paper. Moran, P. (1984) “The Monte Carlo Evaluation of Orthant Probabilities for Multivariate Normal Distributions”, Australian Journal of’ Statistics, 26, 39-44. Miihleisen, M. (1991) On the Use of Simulated Estimators for Panel Models with Limited-Dependent Variables. University of Munich, mimeo. Newey, W.K. and McFadden, D.L. (1994) Estimation in Large Samples, in: R. Engle and D. McFadden, eds., Handbook ofEconometrics, Vol. 4. Amsterdam: North-Holland. Owen, D. (1956) “Tables for Computing Bivariate Normal Probabilities”, Annals of Mathematical Statistics, 27, 1075-1090. Pakes, A. (1992) Estimation of Dynamic Structural Models: Problems and Prospects Part II: Mixed ContinuoussDiscrete Controls and Market Interactions. Yale University, mimeo. Pakes, A. and Pollard, D. (1989) “Simulation and the Asymptotics of Optimization Estimators”, Econometrica, 57, 1027-1057. Poirier, D. and Ruud, P.A. (1988) “Probit with Dependent Observations”, Review of Economic Studies, 55,5933614.
Ch. 40: CIassicaf Estimation Methods for LDV Models Using Simulation
2441
Quandt, R. (1972) “A New Approach to Estimating Switching Regressions”, Journal of the American Statistical Association, 67, 306-310. Quandt, R. (1986) Computational Problems in Econometrics, pp. 1395-1457, in: Z. Griliches and M. Intriligator, eds., Handbook ofEconometrics, Vol. 1. Amsterdam: North-Holland. Rubinstein, R. (1981) Simulation and the Monte Carlo Method. New York: Wiley. Rust, J. (1992) Estimation of Dynamic Structural Models: Problems and Prospects Part II: Discrete Decision Processes. SSRI Working Paper #9106, University of Wisconsin at Madison. Ruud, P. (1986) On the Method ofsimulated Moments for the Estimation of Limited Dependent Variable Models. University of California at Berkeley, mimeo. Ruud, P. (1991) “Extensions of Estimation Methods Using the EM Algorithm”, Journal ofEconometrics, 49,305-341. Stroud, A. (1971) Approximate Calculation ofMultiple Integrals. New York: Prentice-Hall. Thisted, R. (1988) Elements ofStatistical Computing. New York: Chapman and Hall. Thurstone, L. (1927) “A Law of Comparative Judgement”, Psychological Review, 34,273-286. Tierny, L. (1992). Markov Chainsfor Exploring Posterior Distributions. University of Minnesota, working paper. Tobin, J. (1958) “Estimation of Relationships for Limited Department Variables”, Econometrica, 26, 24-36. van Dijk, H.K. (1987) Some Advances in Bayesian Estimation Methods Using Monte Carlo Integration, pp. 205-261, in: T.B. Fomby and G.F. Rhodes, eds., Aduances in Econometrics, Vol. 6, Greenwich, CT: JAI Press. van Praag, B.M.S. and Hop, J.P. (1987) Estimation of Continuous Models on the Basis of Set-Valued Observations. Erasmus University Working Paper, presented at the ESEM Copenhagen. van Praag, B.M.S., Hop, J.P. and Eggink, E. (1989) A Symmetric Approach to the Labor Market by Means of the Simulated Moments Method with an Application to Married Females. Erasmus University Working Paper, presented at the EEA Augsburg. van Praag, B.M.S., Hop, J.P. and Eggink, E. (1991) A Symmetric Approach to the Labor Market by Means of the Simulated EM-Algorithm with an Application to Married Females. Erasmus University Working Paper, presented at the ESEM Cambridge. West, M. (1990) Bnyesian Computations: Monte-Carlo Density Estimation. Duke University, Discussion Paper 90-AlO.
Chapter 41
ESTIMATION MODELS* JAMES
OF SEMIPARAMETRIC
L. POWELL
Princeton Unioersity
Contents 2444 2444
Abstract 1. Introduction
2.
3.
2444
1.1.
Overview
1.2.
Definition
of “semiparametric”
1.3.
Stochastic
restrictions
1.4.
Objectives
and techniques
Stochastic
and structural
2449 2452
models
of asymptotic
2460
theory
2465
restrictions
2.1.
Conditional
mean restriction
2466
2.2.
Conditional
quantile
2469
2.3.
Conditional
symmetry
2.4.
Independence
2.5.
Exclusion
Structural
restrictions
2416
restrictions
2482
and index restrictions
2487
models
3.1.
Discrete
3.2.
Transformation
3.3.
Censored
and truncated
3.4.
Selection
models
3.5.
Nonlinear
4. Summary References
2414
restrictions
response
2487
models
2492
models regression
2500
models
2506 2511
panel data models
2513 2514
and conclusions
*This work was supported by NSF Grants 91-96185 and 92-10101 to Princeton University. I am grateful to Hyungtaik Ahn, Moshe Buchinsky, Gary Chamberlain, Songnian Chen, Gregory Chow, Angus Deaton, Bo Honor&, Joel Horowitz, Oliver Linton, Robin Lumsdaine, Chuck Manski, Rosa Ma&kin, Dan McFadden, Whitney Newey, Paul Ruud, and Tom Stoker for their helpful suggestions, which were generally adopted except when they were mutually contradictory or required a lot of extra work.
Handbook of Econometrics, Volume IV, Edited by R.F. En& 0 1994 Elseuier Science B.V. All rights reserved
and D.L. McFadden
J.L. Powell
2444
Abstract
A semiparametric model for observational data combines a parametric form for some component of the data generating process (usually the behavioral relation between the dependent and explanatory variables) with weak nonparametric restrictions on the remainder of the model (usually the distribution of the unobservable errors). This chapter surveys some of the recent literature on semiparametric methods, emphasizing microeconometric applications using limited dependent variable models. An introductory section defines semiparametric models more precisely and reviews the techniques used to derive the large-sample properties of the corresponding estimation methods. The next section describes a number of weak restrictions on error distributions ~ conditional mean, conditional quantile, conditional symmetry, independence, and index restrictions - and show how they can be used to derive identifying restrictions on the distributions of observables. This general discussion is followed by a survey of a number of specific estimators proposed for particular econometric models, and the chapter concludes with a brief account of applications of these methods in practice.
1.
1.l.
Introduction Overview
Semiparametric modelling is, as its name suggests, a hybrid of the parametric and nonparametric approaches to construction, fitting, and validation of statistical models. To place semiparametric methods in context, it is useful to review the way these other approaches are used to address a generic microeconometric problem ~ namely, determination of the relationship of a dependent variable (or variables) y to a set of conditioning variables x given a random sample {zi = (yi, Xi), i = 1,. . . , N} of observations on y and x. This would be considered a “micro’‘-econometric problem because the observations are mutually independent and the dimension of the conditioning variables x is finite and fixed. In a “macro’‘-econometric application using time series data, the analysis must also account for possible serial dependence in the observations, which is usually straightforward, and a growing or infinite number of conditioning variables, e.g. past values of the dependent variable y, which may be more difficult to accommodate. Even for microeconometric analyses of cross-sectional data, distributional heterogeneity and dependence due to clustering and stratification must often be considered; still, while the random sampling assumption may not be typical, it is a useful simplification, and adaptation of statistical methods to non-random sampling is usually straightforward. In the classical parametric approach to this problem, it is typically assumed that the dependent variable is functionally dependent on the conditioning variables
2445
Ch. 41: Estimation of Semiparametric Models
(“regressors”) and unobservable of the form
“errors” according to a fixed structural relation (1.1)
Y = g(x, @o,s),
where the structural function g(.) is known but the finite-dimensional parameter vector a,~Iwp and the error term E are unobserved. The form of g(.) is chosen to give a class of simple and interpretable data generating mechanisms which embody the relevant restrictions imposed by the characteristics of the data (e.g. g(‘) is dichotomous if y is binary) and/or economic theory (monotonicity, homotheticity, etc.). The error terms E are introduced to account for the lack of perfect fit of (1.1) for any fixed value of c1eand a, and are variously interpreted as expectational or optimization errors, measurement errors, unobserved differences in tastes or technology, or other omitted or unobserved conditioning variables; their interpretation influences the way they are incorporated into the structural function 9(.). To prevent (1.1) from holding tautologically for any value of ao, the stochastic behavior of the error terms must be restricted. The parametric approach takes the error distribution to belong to a finite-dimensional family of distributions, a Pr{s d nix} = f,(a Ix, ~0) dl,, (1.2) s -CO where f(.) is a known density (with respect to the dominating measure p,) except for an unknown, finite-dimensional “nuisance” parameter ‘lo. Given the assumed structural model (1.1) and the conditional error distribution (1.2), the conditional distribution of y given x can be derived,
s 1
Pr{y <
11~)
=
-a,
1b
d
~hf,,,(uI
x, %I, qo)
dpYIX,
for some parametric conditional density f,,,,(.). Of course, it is usually possible to posit this conditional distribution of y given x directly, without recourse to unobservable “error” terms, but the adequacy of an assumed functional form is generally assessed with reference to an implicit structural model. In any case, with this conditional density, the unknown parameters c(~and q. can be estimated by maximizing the average conditional log-likelihood
This fully parametric modelling strategy has a number of well-known optimality properties. If the specifications of the structural equation (1.1) and error distribution (1.2) are correct (and other mild regularity conditions hold), the maximum likelihood estimators of ~1~and ‘lo will converge to the true parameters at the rate of the inverse square root of the sample size (“root-N-consistent”) and will be
asymptotically normally distributed, with an asymptotic covariance matrix which is no larger than that of any other regular root-N-consistent estimator. Moreover, the parameter estimates yield a precise estimator of the conditional distribution of the dependent variable given the regressors, which might be used to predict y for values of x which fall outside the observed support of the regressors. The drawback to parametric modelling is the requirement that both the structural model and the error distribution are correctly specified. Correct specification may be particularly difficult for the error distribution, which represents the unpredictable component of the relation of y to x. Unfortunately, if g(x, u, E) is fundamentally nonlinear in E - that is, it is noninvertible in E or has a Jacobian that depends on the unknown parameters tl - then misspecification of the functional form of the error distribution f(slx, 9) generally yields inconsistency of the MLE and inconsistent estimates of the conditional distribution of y given x. At the other extreme, a fully nonparametric approach to modelling the relation between y and x would define any such “relation” as a characteristic of the joint distribution of y and x, which would be the primitive object of interest. A “causal” or predictive relation from the regressors to the dependent variable would be given as a particular functional of the conditional distribution of y given x,
g(x) = WY,,),
(1.3)
where F,,, is the joint and F,tx is the conditional distribution. Usually the functional T(.) is a location measure, in which case the relation between y and x has a representation analogous to (1.1) and (1.2), but with unknown functional forms for f( .) and g(.). For example, if g(x) is the mean regression function (T(F,,,) = E[y 1x]), then y can be written as Y =
g(x) + E,
with E defined to have conditional density f,,, assumed to satisfy only the normalization E[E[x] = 0. In this approach the interpretation of the error term E is different than for the parametric approach; its stochastic properties derive from its definition in terms of the functional g(.) rather than a prior behavioral assumption. Estimation of the function g(.) is straightforward once a suitable estimator gYIX of the conditional distribution of y given x is obtained; if the functional T(.) in (1.3) is well-behaved (i.e. continuous over the space of possible I’&, a natural estimator is
9(x) =
~(~y,,).
Thus the problem of estimating the “relationship” g(.) reduces to the problem of estimating the conditional distribution function, which generally requires some smoothing across adjacent observations of the regressors x when some components
Ch. 41: Estimation
of Semiparametric Models
2441
are continuously distributed (see, e.g. Prakasa Rao (1983) Silverman (1986), Bierens (1987), Hardle (1991)). In some cases, the functional T(.) might be a well-defined functional of the empirical c.d.f. of the data (for example, g(x) might be the best linear projection of y on x, which depends only on the covariance matrix of the data); in these cases smoothing of the empirical c.d.f. will not be required. An alternative estimation strategy would approximate g(x) and the conditional distribution of E in (1.6) by a sequence of parametric models, with the number of parameters expanding as the sample size increases; this approach, termed the “method of sieves” by Grenander (1981), is closely related to the “seminonparametric” modelling approach of Gallant (1981, 1987), Elbadawi et al. (1983) and Gallant and Nychka (1987). The advantages and disadvantages of the nonparametric approach are the opposite of those for parametric modelling. Nonparametric modelling typically imposes few restrictions on the form of the joint distribution of the data (like smoothness or monotonicity), so there is little room for misspecification, and consistency of an estimator of g(x) is established under much more general conditions than for parametric modelling. On the other hand, the precision of estimators which impose only nonparametric restrictions is often poor. When estimation of g(x) requires smoothing of the empirical c.d.f. of the data, the convergence rate of the estimator is usually slower than the parametric rate (square root of the sample size), due to the bias caused by the smoothing (see the chapter by Hardle and Linton in this volume). And, although some prior economic restrictions like homotheticity and monotonicity can be incorporated into the nonparametric approach (as described in the chapter by Matzkin in this volume), the definition of the “relation” is statistical, not economic. Extrapolation of the relationship outside the observed support of the regressors is not generally possible with a nonparametric model, which is analogous to a “reduced form” in the classical terminology of simultaneous equations modelling. The semiparametric approach, the subject of this chapter, distinguishes between the “parameters of interest”, which are finite-dimensional, and infinite-dimensional “nuisance parameters”, which are treated nonparametrically. (When the “parameter of interest” is infinite-dimensional, like the baseline hazard in a proportional hazards model, the nonparametric methods described in the Hardle and Linton chapter are more appropriate.) In a typical parametric model, the parameters of interest, mO, appear only in a structural equation analogue to (l.l), while the conditional error distribution is treated as a nuisance parameter, subject to certain prior restrictions. More generally, unknown nuisance functions may also appear in the structural equation. Semiparametric analogues to equations (1.1) and (1.2) are (1.4) 1 {u d A}fo(aIx)dp,,
Pr{s d nix} = s
(1.5)
J.L. Powell
2448
where, as before, CQis unknown but known to lie in a finite-dimensional subspace, and where the unknown nuisance parameter is ‘lo =
Euclidean
(to(.)J
As with the parametric
approach,
prior economic
reasoning
general regularity and identification restrictions are imposed on the nuisance parameters qO, as in the nonparametric approach. As a hybrid of the parametric and nonparametric approaches, semiparametric modelling shares the advantages and disadvantages of each. Because it allows a more general specification of the nuisance parameters, estimators of the parameters of interest for semiparametric models are consistent under a broader range of conditions than for parametric models, and these estimators are usually more precise (converging to the true values at the square root of the sample size) than their nonparametric counterparts. On the other hand, estimators for semiparametric models are generally less efficient than maximum likelihood estimators for a correctly-specified parametric model, and are still sensitive to misspecification of the structural function or other parametric components of the model. This chapter will survey the econometric literature on semiparametric estimation, with emphasis on a particular class of models, nonlinear latent variable models, which have been the focus of most of the attention in this literature. The remainder of Section 1 more precisely defines the “semiparametric” categorization, briefly lists the structural functions and error distributions to be considered and reviews the techniques for obtaining large-sample approximations to the distributions of various types of estimators for semiparametric models. The next section discusses how each of the semiparametric restrictions on the behavior of the error terms can be used to construct estimators for certain classes of structural functions. Section 3 then surveys existing results in the econometric literature for several groups of latent variable models, with a variety of error restrictions for each group of structural models. A concluding section summarizes this literature and suggests topics for further work. The coverage of the large literature on semiparametric estimation in this chapter will necessarily be incomplete; fortunately, other general references on the subject are available. A forthcoming monograph by Bickel et al. (1993) discusses much of the work on semiparametrics in the statistical literature, with special attention to construction of efficient estimators; a monograph by Manski (1988b) discusses the analogous econometric literature. Other surveys of the econometric literature include those by Robinson (1988a) and Stoker (1992), the latter giving an extensive treatment of estimation based upon index restrictions, as described in Section 2.5 below. Newey (1990a) surveys the econometric literature on semiparametric efficiency bounds, which is not covered extensively in this chapter. Finally, given the close connection between the semiparametric approach and parametric and
say, to different method5 and degrees of “smoothing” of the empirical c.d.f.), while estimation of a semiparametric model would require an additional choice of the particular functional T* upon which to base the estimates. On a related point, while it is common to refer to “semiparametric estimation” Some and “semiparametric estimators”, this is somewhat misleading terminology. authors use the term “semiparametric estimator” to denote a statistic which involves a preliminary “plug-in” estimator of a nonparametric component (see, for example, Andrews’ chapter in this volume); this leads to some semantic ambiguities, since the parameters of many semiparametric models can be estimated by “parametric” estimators and vice versa. Thus, though certain estimators would be hard to interpret in a parametric or nonparametric context, in general the term “semiparametric”, like “parametric” or “nonparametric”, will be used in this chapter to refer to classes of structural models and stochastic restrictions, and not to a particular statistic. In many cases, the same estimator can be viewed as parametric, nonparametric or semiparametric, depending on the assumptions of the model. For example, for the classical linear model y = x’& +
E,
the least squares estimator
PC
[
itl
xixl
1
-lit1
of the unknown
coefficients
&,
xiYi3
would be considered a “parametric” estimator when the error terms are assumed to be Gaussian with zero mean and distributed independently of the regressors x. With these assumptions fi is the maximum likelihood estimator of PO, and thus is asymptotically efficient relative to all regular estimators of PO. Alternatively, the least squares estimator arises in the context of a linear prediction problem, where the error term E has a density which is assumed to satisfy the unconditional moment restriction E[&.X] = 0. This restriction yields a unique tion of the data,
representation
for /I0 in terms of the joint distribu-
& = {E[x.x'])-'E[x.y], so estimation of /I0 in this context would be considered a “nonparametric” problem by the criteria given above. Though other, less precise estimators of the moments E[x.x’] and E[x.y] (say, based only on a subset of the observations) might be used to define alternative estimators, the classical least squares estimator fi is, al-
Ch. 41: Estimation
of Semiparametric
2451
Models
most by default, an “efficient” estimator of PO in this model (as Levit (1975) makes precise). Finally, the least squares estimator b can be viewed as a special case of the broader class of weighted least squares estimators of PO when the error terms E are assumed to have conditional mean zero, E[.51xi] = 0
a.s.
The model defined by this restriction
would be considered
“semiparametric”,
since
&, is overidentified; while the least squares estimator b is *-consistent and asymptotically normal for this model (assuming the relevant second moments are finite), it is inefficient in general, with an efficient estimator being based on the representation
of the parameters of interest, where a2(x) E Var(sJxi) (as discussed in Section 2.1 below). The least squares statistic fi is a “semiparametric” estimator in this context, due to the restrictions imposed on the model, not on the form of the estimator. Two categories of estimators which are related to “semiparametric estimators”, but logically distinct, are “robust” and “adaptive” estimators. The term “robustness” is used informally to denote statistical procedures which are well-behaved for slight misspecifications of the model. More formally, a robust estimator & - T(p,,,) can be defined as one for which T(F) is a continuous functional at the true model (e.g. Manski (1988b)), or whose asymptotic distribution is continuous at the as defined by Huber (1981)). Other notions of truth (“quantitative robustness”, robustness involve sensitivity of particular estimators to changes in a small fraction of the observations. While “semiparametric estimators” are designed to be well-behaved under weak conditions on the error distribution and other nuisance parameters (which are assumed to be correct), robust estimators are designed to be relatively efficient for correctly-specified models but also relatively insensitive to “slight” model misspecification. As noted in Section 1.4 below, robustness of an estimator is related to the boundedness (and continuity) of its influence function, defined in Section 1.4 below; whether a particular semiparametric model admits a robust estimator depends upon the particular restrictions imposed. For example, for conditional mean restrictions described in Section 2.1 below, the influence functions for semiparametric estimators will be linear (and thus unbounded) functions of the error terms, so robust estimation is infeasible under this restriction. On the other hand, the influence function for estimators under conditional quantile restrictions depends upon the sign of the error terms, so quantile estimators are generally “robust” (at least with respect to outlying errors) as well as “semiparametric”. “Adaptive” estimators are efficient estimators of certain semiparametric models for which the best attainable efficiency for estimation of the parameters of interest
J.L. Powell
does not depend upon prior knowledge of a parametric form for the nuisance parameters. That is, adaptive estimators are consistent under the semiparametric restrictions but as efficient (asymptotically) as a maximum likelihood estimator when the (infinite-dimensional) nuisance parameter is known to lie in a finitedimensional parametric family. Adaptive estimation is possible only if the semiparametric information bound for attainable efficiency for the parameters of interest is equal to the analogous Cramer-Rao bound for any feasible parametric specification of the nuisance parameter. Adaptive estimators, which are described in more detail by Bickel et al. (1993) and Manski (1988b), involve explicit estimation of (nonparametric) nuisance parameters, as do efficient estimators for semiparametric models more generally.
1.3.
Stochastic
restrictions
and structural
models
As discussed above, a semiparametric model for the relationship between y and x will be determined by the parametric form of the structural function g(.) of (1.4) and the restrictions imposed on the error distribution and any other infinitedimensional component of the model. The following sections of this chapter group semiparametric models by the restrictions imposed on the error distribution, describing estimation under these restrictions for a number of different structural models. A brief description of the restrictions to be considered, followed by a discussion of the structural models, is given in this section. A semiparametric restriction on E which is quite familiar in econometric theory and practice is a (constant) conditional mean restriction, where it is assumed that
-wx) = PO
(1.6)
for some unknown constant po, which is usually normalized to zero to ensure identification of an intercept term. (Here and throughout, all conditional expectations are assumed to hold for a set of regressors x with probability one.) This restriction is the basis for much of the large-sample theory for least squares and method-of-moments estimation, and estimators derived for assumed Gaussian distributions of E (or, more generally, for error distributions in an exponential family) are often well-behaved under this weaker restriction. A restriction which is less familiar but gaining increasing attention in econometric practice is a (constant) conditional quantile restriction, under which a scalar error term E is assumed to satisfy Pr{c d qolx} = 71 for some fixed proportion restriction is the (leading)
(1.7) rr~(O, 1) and constant q. = qo(n); a conditional median special case with n= l/2. Rewriting the conditional
J.L. Powell
2454
an assumption
that (1.11)
Pr{e < ulx} = Pr{a d UlU(X)>
for some “index” function u(x) with dim{u(x)} < dim (x}; a weak or mean index restriction asserts a similar property only for the conditional expectation ~ (1.12)
E[&1X] = E[&IU(X)].
For different structural models, the index function v(x) might be assumed to be a known function of x, or known up to a finite number of unknown parameters (e.g. u(x) = x’BO), or an unknown function of known dimensionality (in which case some extra restriction(s) will be needed to identify the index). As a special case, the function u(x) may be trivial, which yields the independence or conditional mean restrictions as special cases; more generally, u(x) might be a known subvector x1 of the regressors x, in which case (1.11) and (1.12) are strong and weak forms of an exclusion restriction, otherwise known as conditional independence and conditional mean independence of E and x given x1, respectively. When the index function is unknown, it is often assumed to be linear in the regressors, with coefficients that are related to unknown parameters of interest in the structural model. The following diagram summarizes the hierarchy of the stochastic restrictions to be discussed in the following sections of this chapter, with declining level of generality from top to bottom: Nonparametric
I
-1
Conditional
lndedndence
1
Parametric
Turning parametric
mean
Location
Conditional
symmetry
median
m
now to a description of some structural models treated in the semiliterature, an important class of parametric forms for the structural
Ch. 41: Estimation
of Semiparametric
2455
Models
functions is the class of linear latent variable models, in which the dependent y is assumed to be generated as some transformation Y =
(1.13)
t(y*; &, %(.I)
of some unobservable y* = x’& +
variable
variable
y*, which itself has a linear regression
representation (1.14)
E.
Here the regression coefficients /I0 and the finite-dimensional parameters 2, of the transformation function are the parameters of interest, while the error distribution and any nonparametric component rO(.) of the transformation make up the nonparametric component of the model. In general y and y* may be vector-valued, and restrictions on the coefficient matrix /I0 may be imposed to ensure identification of the remaining parameters. This class of models, which includes the classical linear model as a special case, might be broadened to permit a nonlinear (but parametric) regression function for the latent variable y*, as long as the additivity of the error terms in (1.14) is maintained. One category of latent variable models, parametric transformation models, takes the transformation function t(y*;&) to have no nonparametric nuisance component to(.) and to be invertible in y* for all possible values of &. A well-known example of a parametric transformation model is the Box-Cox regression model (Box and Cox (1964)), which has y = t(x’&, + E; A,) for t - yy.
2)
=
-va F - l 1(1#
0} + ln(y)l{A = O}.
This transformation, which includes linear and log-linear (in y) regression models as special cases, requires the support of the latent variable y* to be bounded from below (by - I/&) for noninteger values of A,, but has been extended by Bickel and Doksum (1981) to unbounded y*. Since the error term E can be expressed as a known function of the observable variables and unknown parameters for these models, a stochastic restriction on E (like a conditional mean restriction, defined below) translates directly into a restriction on y, x,/IO, and II, which can be used to construct estimators. Another category, limited dependent variable models, includes latent variable models in which the transformation function t(y*) which does not depend upon unknown parameters, but which is noninvertible, mapping intervals of possible y* values into single values of y. Scalar versions of these models have received much of the attention in the econometric literature on semiparametric estimation, owing to their relative simplicity and the fact that parametric methods generally yield inconsistent estimators for /I0 when the functional form of the error distribution is misspecified. The simplest nontrivial transformation in this category is
J.L.Powell
2456
an indicator model
for positivity
y = 1{x’/A)+
E>
of the latent variable
O},
y*, which yields the binary response
(1.15)
which is commonly used in econometric applications to model dichotomous choice problems. For this model, in which the parameters can be identified at most up to a scale normalization on PO or E, the only point of variation of the function t(y*) occurs at y* = 0, which makes identification of &, particularly difficult. A model which shares much of the structure of the binary response model is the ordered response model, with the latent variable y* is only known to fall in one of J + 1 ordered intervals { (- co, c,], (c,, c,], . . . , (c,, GO)}; that is, J y=
1 1(x’& +& >Cj}.
(1.16)
j=l
Here the thresholds {cj} are assumed unknown (apart from a normalization like c0 = 0), and must be estimated along with PO. The grouped dependent oariable model is a variation with known values of {cj}, where the values of y might correspond to prespecified income intervals. A structural function for which the transformation function is more “informative” about /I0 is the censored regression model, also known in econometrics as the censored Tobit model (after Tobin (1956)). Here the observable dependent variable is assumed to be subject to a nonnegativity constraint, so that y = max (0, x’pO + s};
(1.17)
this structural function is often used as a model of individual demand or supply for some good when a fraction of individuals do not participate in that market. A variation on this model, the accelerated failure time model with jixed censoring, can be used as a model for duration data when some durations are incomplete. Here y=min{x;&+E,x2},
(1.18)
where y is the logarithm of the observable duration time (e.g. an unemployment spell), and x2 is the logarithm of the duration of the experiment (following which the time to completion for any ongoing spells is unobserved); the “fixed” qualifier denotes models in which both x1 and x2 are observable (and may be functionally related). These univariate limited dependent variable models have multivariate analogues which have also been considered in the semiparametric literature. One multivariate generalization of the binary response model is the multinomial response
Ch. 41: Estimation
model, for which the dependent y=vec{yj,j= l,..., J}, with
yj=l(yf3y:
for
and with each latent yj*=x’pj,+E.
Bo
variable
is a J-dimensional
=
vector
of indicators,
(1.19)
k#j)
variable
J’
2457
Models
of Semipurametric
y? generated cp;,
. . . , a&
by a linear model ‘.
>
DJ,l.
(1.20)
That is, yj = 1 if and only if its latent variable yl is the largest across alternatives. Another bivariate model which combines the binary response and censored regression models is the censored sample selection model, which has one binary response variable y, and one quantitative dependent variable y, which is observed only when yi = 1: y1=
l(x;B;+E,
>O)
(1.21)
and Y2 = Yl
cx;fi:+4.
(1.22)
This model includes the censored regression model as a special case, with fi; = fii s /I, and s1 = a2 = E. A closely related model is the disequilibrium regression model with observed regime, for which only the smaller of two latent variables is observed, and it is known which variable is observed: y, =
1(x;& + El -=c x;g +e,)
(1.23)
and
A special case of this model, the randomly censored regression model, imposes the restriction fii = 0, and is a variant of the duration model (1.18) in which the observable censoring threshold x2 is replaced by a random threshold a2 which is unobserved for completed spells. A class of limited dependent variable models which does not neatly fit into the foregoing latent variable framework is the class of truncated dependent variable models, which includes the truncated regression and truncated sample selection models. In these models, an observable dependent variable y is constructed from latent variables drawn from a particular subset of their support. For the truncated regression model, the dependent variable y has the distribution of y* = x’/I,, + E
J.L. Powell
2458
conditional
on y* > 0: (1.25)
y = x’po + u, with Pr(u
6)
-x’B0}.
(1.26)
For the truncated selection model, the dependent variable y is generated in the same way as y, in (1.24), conditionally on y, = 1. Truncated models are variants of censored models for which no information on the conditioning variables x is available when the latent variable y* cannot be observed. Since truncated samples can be constructed from their censored counterparts by deleting censored observations, identification and estimation of the parameters of interest is more challenging for truncated data. An important class of multivariate latent dependent variable models arises in the analysis of panel data, where the dimensionality of the dependent variable y is proportional to the number of time periods each individual is observed. For concreteness, consider the special case in which a scalar dependent variable is observed for two time periods, with subscripts on y and x denoting time period; then a latent variable analogue of the standard linear “fixed effects” model for panel data has y, =
HY+ x;&J + $1qJ>
Y, = t(Y+$D,
(1.27)
+ &2'?J'
where t(.) is any of the transformation functions discussed above and y is an unobservable error term which is constant across time periods (unlike the timespecific errors cl and s2) but may depend in an arbitrary way on the regressors* x1 and x2. Consistent estimation of the parameters of interest PO for such models is a very challenging problem; while “time-differencing” or “deviation from cell means” eliminates the fixed effect for linear models, these techniques are not applicable to nonlinear models, except in certain special cases (as discussed by Chamberlain (1984)). Even when the joint distribution of the error terms E, and s2 is known parametrically, maximum likelihood estimators for &,, r0 and the distributional parameters will be inconsistent in general if the unknown values of y are treated as individual-specific intercept terms (as noted by Heckman and MaCurdy (1980)), so semiparametric methods will be useful even when the distribution of the fixed effects is the only nuisance parameter of the model. The structural functions considered so far have been assumed known up to a finite-dimensional parameter. This is not the case for the generalized regression
Ch. 41: Estimation
ofSemiparametric Models
2459
model, which has Y =
(1.28)
%(X’Po + 4,
for some transformation function TV which is of unknown parametric form, but which is restricted either to be monotonic (as assumed by Han (1987a)), or smooth (or both). Formally, this model includes the univariate limited dependent variable and parametric transformation models as special cases; however, it is generally easier to identify and estimate the parameters of interest when the form of the transformation function t(.) is (parametrically) known. Another model which at first glance has a nonparametric component in the structural component is the partially linear or semilinear regression model proposed by Engle et al. (1986), who labelled it the “semiparametric regression model”; estimation of this model was also considered by Robinson (1988). Here the regression function is a nonparametric function of a subset xi of the regressors, and a linear function of the rest: Y=
x;p,+&(x,)
+6
(1.29)
where A,(.) is unknown but smooth. By defining a new error term E* = 2,(x,) + E, a constant conditional mean assumption on the original error term E translates into a mean exclusion restriction on the error terms in an otherwise-standard linear model. Yet another class of models with a nonparametric component are generated regressor models, in which the regressors x appear in the structural equation for y indirectly, through the conditional mean of some other observable variable w given x:
Y=h(ECwlxl,~,,&)~g(x,~,,~,(~),&),
(1.30)
with 6,(x) _=E[wjx]. These models arise when modelling individual behavior under uncertainty, when actions depend upon predictions (here, conditional expectations) of unobserved outcomes, as in the large literature on “rational expectations”. Formally, the nonparametric component in the structural function can be absorbed into an unobservable error term satisfying a conditional mean restriction; that is, defining q 5 w -JZ[wlx] (so that E[qlx] -O), the model (1.30) with nonparametrically-generated regressors can be rewritten as y = g(w - q,cr,,s), with a conditional mean restriction on the extra error term q. In practice, this alternative representation is difficult to manipulate unless g(.) is linear, and estimators are more easily constructed using the original formulation (1.30). Although the models described above have received much of the attention in the econometric literature on semiparametrics, they by no means exhaust the set of models with parametric and nonparametric components which are used in
J.L.Powell
2460
econometric applications. One group of semiparametric models, not considered here, include the proportional hazards model proposed and analyzed by Cox (1972, 1975) for duration data, and duration models more generally; these are discussed by Lancaster (1990) among many others. Another class of semiparametric models which is not considered here are choice-based or response-based sampling models; these are similar to truncated sampling models, in that the observations are drawn from sub-populations with restricted ranges of the dependent variable, eliminating the ancillarity of the regressors x. These models are discussed by Manski and McFadden (1981) and, more recently, by Imbens (1992).
1.4.
Objectives and techniques
of asymptotic
theory
Because of the generality of the restrictions imposed on the error terms for semiparametric models, it is very difficult to obtain finite-sample results for the distribution of estimators except for special cases. Therefore, analysis of semiparametric models is based on large-sample theory, using classical limit theorems to approximate the sampling distribution of estimators. The goals and methods to derive this asymptotic distribution theory, briefly described here, are discussed in much more detail in the chapter by Newey and McFadden in this volume. As mentioned earlier, the first step in the statistical analysis of a semiparametric model is to demonstrate identijkation of the parameters a0 of interest; though logically distinct, identification is often the first step in construction of an estimator of aO. To identify aO, at least one function T(.) must be found that yields T(F,) = aO, where F, is the true joint distribution function of z = (y,x) (as in (1.3) above). This functional may be implicit: for example, a,, may be shown to uniquely solve some functional equation T(F,, a,,) = 0 (e.g. E[m(y, x, a,,)] = 0, for some m(.)). Given the functional T(.) and a random sample {zi = (y,, xi), i = 1,. . . , N) of observations on the data vector z, a natural estimator of a0 is 62= T(P),
(1.31)
where P is a suitable estimator of the joint distribution function F,. Consistency of & (i.e. oi+ a,, in probability as N + co) is often demonstrated by invoking a law of large numbers after approximating the estimator as a sample average:
’=
$,f cPiV(Yi3xi) + Op(1)~ I
(1.32)
1
where E[q,(y, x)] + aO. In other settings, that the estimator maximizes a random almost surely to a limiting function with As noted below, establishing (1.31) can
consistency is demonstrated by showing function which converges uniformly and a unique maximum at the true value aO. be difficult if construction of 6i involves
Ch. 41: Estimation
ofSemiparametric Models
2461
explicit nonparametric estimators (through smoothing of the empirical distribution function). Once consistency of the estimator is established, the next step is to determine its rate ofconueryence, i.e. the steepest function h(N) such that h(N)(Gi - Q) = O,(l). so this is a maximal rate under weaker For regular parametric models, h(N) = fi, semiparametric restrictions. If the estimator bi has h(N) = fi (in which case it is said to be root-N-consistent), then it is usually possible to find conditions under which the estimator has an asymptotically linear representation:
di= '0 +
k,E$(Yi, I
where the “influence The Lindeberg-Levy estimator,
JNca-
xi)
+
op(11JN)2
(1.33)
1
function” I/I(.) has E[$(y, x)] = 0 and finite second moments. central limit theorem then yields asymptotic normality of the
ao) L Jqo, If,),
(1.34)
where V, = E{$(y,x)[$(y,x)]‘}. With a consistent estimator of V, (formed as the sample covariance matrix of some consistent estimator ~(yi,Xi) of the influence function), confidence regions and test statistics can be constructed with coverage/ rejection probabilities which are approximately correct in large samples. For semiparametric models, as defined above, there will be other functionals T+(F) which can be used to construct estimators of the parameters of interest. The asymptotic efJtciency of a particular estimator 6i can be established by showing that its asymptotic covariance matrix V, in (1.34) is equal to the semiparametric analogue to the Cramer-Rao bound for estimation of ~1~. This semiparametric ejjiciency bound is obtained as the smallest of all efficiency bounds for parametric models which satisfy the semiparametric restrictions. The representation ~1~= T*(F,) which yields an efficient estimator generally depends on some component do(.) of the unknown, infinite-dimensional nuisance parameter qo(.), i.e. T*(.) = T*(., 6,), so construction of an efficient estimator requires explicit nonparametric estimation of some characteristics of the nuisance parameter. Demonstration of (root-iv) consistency and asymptotic normality of an estimator depends on the complexity of the asymptotic linearity representation (1.33), which in turn depends on the complexity of the estimator. In the simplest case, where the estimator can be written in a closed form as a smooth function of sample averages,
6i=a
(
9 j$,$ m(Yi,xi) I 1 >
(1.35)
J.L. Powell
2462
the so-called +(Y,
“delta method”
yields an influence
function
II/ of the form
4 = ca4~o)/ad c~(Y, 4 - ~~1,
(1.36)
where pLoE E[m(y,x)]. Unfortunately, except for the classical linear model with a conditional mean restriction, estimators for semiparametric models are not of this simple form. Some estimators for models with weak index or exclusion restrictions on the errors can be written in closed form as functions of bivariate U-statistics.
(1.37)
with “kernel” function pN that has pN(zi, zj) = pN(zj,zi) for zi = (y,,z,); under conditions given by Powell et al. (1989), the representation (1.33) for such an estimator has influence function II/ of the same form as in (1.36), where now
m(.V, X)= lim EEPN(zi9 zj)lzi
=
(Y,X)1,
PO = ECm(y,
41.
(1.38)
N-02
A consistent estimator of the asymptotic sample second moment matrix of
covariance
matrix
of bi of (1.37) is the
(1.39)
In most cases, the estimator 6i will not have a closed-form expression like in (1.35) or (1.37), but instead will be defined implicitly as a minimizer of some sample criterion function or a solution of estimating equations. Some (generally inefficient) estimators based on conditional location or symmetry restrictions are “Mestimators”, defined as minimizers of an empirical process
h = aigETn $ ,i L and/or
solutions
0= i
p(_Yi,Xi, a) = argmin S,(a) 1
of estimating
equations
.g m(yit xi, 8.)= kN(B). I
(1.40)
asI3
(1.41)
1
for some functions p(.) and m(.), with dim{m(.)} = dim(a). When p(y,x,cr) (or m(y,x, a)) is a uniformly continuous function in the parameters over the entire parameter space 0 (with probability one), a standard uniform law of large numbers can be used to ensure that normalized versions of these criteria converge to their
J.L. Powell
2464
where the kernel pN(.) has the same symmetry property as stated for (1.37) above; such estimators arise for models with independence or index restrictions on the error terms. Results by Nolan and Pollard (1987,1988), Sherman (1993) and Honor& and Powell (1991) can be used to establish the consistency and asymptotic normality of this.estimator, which will have an influence function of the form (1.42) when m(y, X, a) = lim aE [pN(zi, zj, CC)1yi =
y, xi =
xyaE.
(1.47)
N+m
A more difficult class of estimators to analyze are those termed “semiparametric M-estimators” by Horowitz (1988a), for which the estimating equations in (1.41) also depend upon an estimator of a nonparametric component de(.); that is, ai solves
o=~.~ m(yi,xi,6i,~(‘))=mN(6i,6^(‘))
(1.48)
1
I
for some nonparametric estimator sof 6,. This condition might arise as a first-order condition for minimization of an empirical loss function that depends on 8,
d=ar~~n~i~lP(Yi,xi,a,6^(‘)),
(1.49)
as considered by Andrews (1990a, b). As noted above, an efficient estimator for any semiparametric model is generally of this form and estimators for models with independence or index restrictions are often in this class. To derive the influence function for an estimator satisfying (1.48), a functional mean-value expansion of Ci,(& c!?)around s^= 6, can be used to determine the effect on di of estimation of 6,. Formally, condition (1.48) yields
o=
mN(61,
&‘,, = &(&&,(‘)) + &,(8(‘)-
for some linear functional this second term
do(‘)) +
op(f/v6)
L,; then, with an influence
function
(1.50) representation
of
(1.51)
(with E[S(y, x)] = O), the form of the influence estimator is G(Y, 4
= raE(m(Y,x3 4 ~o)iiw,_,3
function
for a semiparametric
- 1~4~~ X, ao,6,) + a~, 41.
M-
(1.52)
Ch. 41: Estimation
of Semiparametric
To illustrate, suppose 6, is finite-dimensional, (1.50) would be a matrix product,
L&%9- hk)) = b&
2465
Models
6,~@‘; then the linear functional
6,) = CaE(m(y,~,a,6)/a6'1.=.o,a=a,](6-
do),
in
(1.53)
and the additional component 5 of the influence function in (1.52) would be the product of the matrix L, with the influence function of the preliminary estimator 8. When 6, is infinite-dimensional, calculation of the linear functional L, and the associated influence function 5 depends on the nature of the nuisance parameter 6, and how it enters the moment function m(y,x,a,d). One important case has 6, equal to the conditional expectation of some function s(y,x) of the data given some other function u(x) of the regressors, with m(.) a function only of the fitted values of this expectation; that is,
43= ~,(44) = ECdY, x)l a41
(1.54)
and (1.55) with am/&J well-defined. For instance, this is the structure of efficient estimators for conditional location restrictions. For this case, Newey (1991) has shown that the adjustment term t(y,x) to the influence function of a semiparametric Mestimator 6i is of the form a~,
X) = CWm(y,x3 4 4
ewa~~t,=,,i CS(Y~ 4
- 4A44)i.
(1.56)
In some cases the leading matrix in this expression is identically zero, so the asymptotic distribution of the semiparametric M-estimator is the same as if 6,(.) were known; Andrews (1990a, b) considered this and other settings for which the adjustment term 5 is identically zero, giving regularity conditions for validity of the expansion (1.50) in such cases. General formulae for the influence functions of more complicated semiparametric M-estimators are derived by Newey (1991) and are summarized in Andrews’ and Newey and McFadden’s chapters in this volume.
2.
Stochastic restrictions
This section discusses how various combinations of structural equations and stochastic restrictions on the unobservable errors imply restrictions on the joint distribution of the observable data, and presents general estimation methods for the parameters of interest which exploit these restrictions on observables. The classification scheme here is the same as introduced in the monograph by Manski
2466
(1988b) (and also in Manski’s chapter in this volume), although the discussion here puts more emphasis on estimation techniques and properties. Readers who are familiar with this material or who are interested in a particular structural form, may wish to skip ahead to Section 3 (which reviews the literature for particular models), referring back to this section when necessary.
2.1.
Conditional
mean restriction
As discussed in Section 1.3 above, restrictions for the error distribution
the class of constant assert constancy of
conditional
vO = argmin E[r(c - b)jx],
location
(2.1)
b
for some function r(.) which is nonincreasing for negative arguments and nondecreasing for positive arguments; this implies a moment condition E[q(.z - po)lx] = 0, for q(u) = ar(t#Ih. When the loss function of (2.1) is taken to be quadratic, r(u) = u’u, the corresponding conditional location restriction imposes constancy of the conditional mean of the error terms, .%4x) = PO
(2.2)
for some po. By appropriate definition of the dependent variable(s) y and “exogenous” variables x, this restriction may be applied to models with “endogenous” regressors (that is, some components of x may be excluded from the restriction (2.2)). This restriction is useful for identification of the parameters of interest for structural functions g(x, IX,E) that are invertible in the error terms E; that is, Y = g(x, MO,40s for some function
= 4Y, x, MO)
e(.), so that the mean restriction
(2.1) can be rewritten (2.3)
where the latter equality imposes the normalization p. E 0 (i.e., the mean ,u~ is appended to the vector ~1~of parameters of interest). Conditional mean restrictions are useful for some models that are not completely specified ~ that is, for models in which some components of the structural function g(.) are unknown or unspecified. In many cases it is more natural to specify the function e(.) characterizing a subset of the error terms than the structural function g(.) for the dependent variable; for example, the parameters of interest may be coefficients of a single equation from a simultaneous equations system and it is
Ch. 41: Estimation
of Semipurametric
2461
Models
often possible to specify the function e(.) without specifying the remaining equations of the model. However, conditional mean restrictions generally are insufficient to identify the parameters of interest in noninvertible limited dependent variable models, as Manski (1988a) illustrates for the binary response model. The conditional moment condition (2.3) immediately yields an unconditional moment equation of the form
0 = EC4x)4.k x, 41,
(2.4)
where d(x) is some conformable matrix with at least as many rows as the dimension of a,. For a given function cl(.), the sample analogue of the right-hand side of (2.8) can be used to construct a method-of-moments or generalized method-of-moments estimator, as described in Section 1.4; the columns of the matrix d(x) are “instrumental variables” for the corresponding rows of the error vector E. More generally, the function d(.) may depend on the parameters of interest, Q, and a (possibly) infinite-dimensional nuisance parameter 6,(.), so a semiparametric M-estimator for B may be defined to solve
(2.5)
estimator of the where dim(d(.)) = dim(g) x dim(s) and s^= c?(.) is a consistent nuisance function 6,(.). For example, these sample moment equations arise as the first-order conditions for the GMM minimization given in (1.43), where the moment functions take the form m(y, x, U) = c(x) e(y, x, a), for a matrix c(x) of fixed functions of x with number of rows greater than or equal to the number components of CC Then, assuming differentiability of e(.), the GMM estimator solves (2.5) with
d(x,
d, 8)= i
$ ,$[ae(y,, xi, d)pd]‘[c(xi)]’ L
1
I
A,c(x),
(2.6)
where A, is the weight matrix given in (1.43). Since the function d(.) depends on the data only through the conditioning variable x, it is simple to derive the form of the asymptotic distribution for the estimator oi which solves (2.5) using the results stated in Section 1.4: ,h@ where
- a,,)~N(O,
M,‘V’JM;)-‘),
(2.7)
J.L. Powell
2468
and
V. = ECdb,ao,6,) e(y,x, a01e’(y,x, a01d’k a0, S0)l
= E[d(x, aO, 6O)z(X)d’(xi, aO>sO)l. In this expression, Z(x)-
Z(x) is the conditional
E[e(y,x,ao)e’(y,x,ao)lx]
covariance
matrix
of the error terms,
= E[EdIx].
Also, the expectation and differentiation in the definition of MO can often be interchanged, but the order given above is often well-defined even if d(.) or e(.) is not smooth in a. A simple extension of the Gauss-Markov argument can be used to show that an efficient choice of instrumental variable matrix d*(x) is of the form
d*(x)=d*(x,ao,do)= the resulting
,/??(a*
Cal -‘; &E[e(yyx,cr)lxi]Ia=,,,
efficient estimator
- ~1~)5
J(O,
V*),
(2.8)
&* will have
with
I/* = {E[d*(x)C~(x)lCd*(x)l’}-‘,
(2.9) under suitable regularity conditions. Chamberlain (1987) showed that V* is the semiparametric efficiency bound for any “regular” estimator of ~1~when only the conditional moment restriction (2.3) is imposed. Of course, the optimal matrix d*(x) of instrumental variables depends upon the conditional distribution of y given x, an infinite-dimensional nuisance parameter, so direct substitution of d*(x) in (2.5) is not feasible. Construction of a feasible efficient estimator for a0 generally uses nonparametric regression and a preliminary inefficient GMM estimator of u. to construct estimates of the components of d*(x), the conditional mean of ae(y,x, a,)/aa and the conditional covariance matrix of e(y, x, ao). This is the approach taken by Carroll (1982), Robinson (1987), Newey (1990b), Linton (1992) and Delgado (1992), among others. Alternatively, a “nearly” efficient sequence of estimators can be generated as a sequence of GMM estimators with moment functions of the form m(y, x, a) = c(x) e(y, x, a), when the number of rows of c(x) (i.e. the number of “instrumental variables”) increases slowly as the sample size increases; Newey (1988a) shows that if linear combinations of c(x) can be used to approximate d*(x) to an arbitrarily high degree as the size of c(x) increases, then the asymptotic variance of the corresponding sequence of GMM estimators equals v*.
Ch. 41: Estimation
2469
of Semipammrtric Models
For the linear model y = x’& +
E
with scalar dependent matrix d*(x) simplifies d*(x) = [a’(x)]
variable y, the form of the optimal to the vector
instrumental
variable
- lx,
where a’(x) is the conditional variance of the error term E. As noted in Section 1.2 above, an efficient estimator for f10 would be a weighted least squares estimator, with weights proportional to a nonparametric estimator of [a’(x)] -I, as considered by Robinson (1987).
2.2.
Conditional quantile restrictions
In its most general form, the conditional 71th quantile of a scalar error term E is defined to be any function 9(x; rr) for which the conditional distribution of E has at least probability rr to the left and probability 1 - rc to the right of q=(x): Pr{s d q(x; n) Ix} 2 71 and
Pr{.s > ~(x; n)lx} 3 1 - 7~.
A conditional quantile restriction is the assumption conditional quantile is independent of x, 9(x; 7r)= rj,(7c) = qo,
a.s.
(2.10)
that, for some rt~(O, l), this
(2.11)
Usually the conditional distribution of E is further restricted to have no point mass at its conditional quantile (Pr{s = q,,} = 0), which with (2.10) implies the conditional moment restriction E[71-
l{E
l{&
(2.12)
where again the normalization ‘lo E 0 is imposed (absorbing q0 as a component of Q). To ensure uniqueness of the solution ‘lo = 0 to this moment condition, the conditional error distribution is usually assumed to be absolutely continuous with nonnegative density in some neighborhood of zero. Although it is possible in principle to treat the proportion rr as an unknown parameter, it is generally assumed that rt is known in advance; most attention is paid to the special case 71 = i (i.e. a conditional median restriction) which is implied by the stronger assumptions of either independence of the errors and regressors or conditional symmetry of the errors about a constant.
J.L. Powell
2470
A conditional quantile restriction can be used to identify parameters of interest in models in which the dependent variable y and the error term E are both scalar, and the structural function g(.) of (1.4) is nondecreasing in E for all possible a0 and almost all x: u1
G u2 =-dx, M, 4) G 4x, ~1,d
(2.13)
a.s. Cd.
(Of course, nonincreasing structural functions can be accommodated with a sign change on the dependent variable y.) This monotonicity and the quantile restriction (2.11) imply that the conditional xth quantile of y given x is g(x, aO, 0); since
EGO or
E2.O
y=g(x,ao,E)~g(x,~o,O)
3
or
ykg(x,cc,,O),
it follows that Pr{ydg(x,cc,,O)Ix)
> Pr{s
and
Pr{y > g(x,cr,,O)lx}
2 Pr{.s 3 01x) 3 1 - rc.
(2.14)
Unlike a conditional mean restriction, a conditional quantile restriction is useful for identification of CI~even when the structural function g(x, a, E) is not invertible in E. Moreover, the equivariance of quantiles to monotonic transformations means that, when it is convenient, a transformation l(y) might be analyzed instead of the original dependent variable y, since the conditional quantile of I(y) is l(g(x, aO, 0)) if I(.) is nondecreasing. (Note, though, that application of a noninvertible transformation may well make the parameters a,, more difficult to identify.) The main drawback with the use of quantile restrictions to identify a0 is that the approach is apparently restricted to models with a scalar error term E, because of their lack of additivity (i.e. quantiles of convolutions are not generally the sums of the corresponding quantiles) as well as the ambiguity of a monotonicity restriction on the structural function in a multivariate setting. Estimators based upon quantile restrictions have been proposed for the linear regression, parametric transformation, binary response, ordered response and censored regression models, as described in Section 3 below. For values of x for which g(x,a,,e) is strictly increasing and differentiable at E = 0, the moment restriction given in (2.12) and monotonicity restriction (2.13) can be combined to obtain a conditional moment restriction for the observable data and unknown parameter aO. Let
b(x
a)
7
=
1
aE
then (2.12) immediately
=-
a&
-
aE
’
(2.15)
implies
E(&x,~~)C~- 11~ ~s(x,~o,O))llx) = ~Cm(y,x,ao)lxl=O.
(2.16)
Ch. 41: Estimation
of Semiparametric Models
2471
In principle, this conditional moment condition might be used directly to define a method-of-moments estimator for cr,; however, there are two drawbacks to this approach. First, the moment function m(.) defined above is necessarily a discontinuous function of the unknown parameters, complicating the asymptotic theory. More importantly, this moment condition is substantially weaker than the derived quantile restriction (2.14), since observations for which g(x, CX~, u) is not strictly increasing at u = 0 may still be useful in identifying the unknown parameters. As an extreme example, the binary response model has b(x, a,) = 0 with probability one under standard conditions, yet (2.14) can be sufficient to identify the parameters of interest even in this case (as discussed below). An alternative approach to estimation of c(~ can be based on a characterization of the nth conditional quantile as the solution to a particular expected loss minimization problem. Define
wh x; 4 = m,(Y - b) - P,(Y)lXl>
(2.17)
where p,(u) = u[7c - l(u
Qb; w(.),4 = NW(X)Ndx, a,O),x; 41= E{w(x)CP,(Y - dx, a,0)) - A(
> (2.18)
over the parameter space, where w(x) is any scalar, nonnegative function of x which has E[w(x).Ig(x,a,O)l] < co. For a particular structural function g(.), then, the unknown parameters will be identified if conditions on the error distribution, regressors, and weight function w(x) are imposed which ensure the uniqueness of the minimizer of Q(cc;w(.), n) in (2.18). Sufficient conditions are uniqueness of the rrth conditional quantile q0 = 0 of the error distribution and Pr{w(x) > 0, g(x, u, r~)# g(x, c1,,0)} > 0 whenever c1# ~1~. Given a sample {(y,, xi), i = 1,. . . , N} of observations on y and x, the sample analogue of the minimand in (2.18) is
QN(CC wt.),n) =
k.$1 W(Xi)Pn(yi- g(xi,m,OIL
(2.19)
L
where an additive constant which does not affect the minimization problem has been deleted. In general, the weight function w(x) may be allowed to depend upon
J.L. Powell
2472
nuisance parameters, w(x) E w(x, 6,), so a feasible weighted quantile estimator of CC~might be defined to minimize SN(a,q, G(.);x), with G(x) = w(x, $) for some preliminary estimator 6^of 6,. In the special case of a conditional median restriction (n = $), minimization of QN is equivalent to minimization of a weighted sum of absolute deviations criterion (2.20)
which, with w(x) 3 1, is the usual starting point for estimation of the particular models considered in the literature cited below. When the structural function g(.) is of the latent variable form (g(x, CL,&) = t(x’/3 + E,T)), the estimator oi which minimizes QJLY; cii,rr) will typically solve an approximate first-order condition, ,:
fl
k(Xi)[71 - l(y, < g(xi, oi,O))]b(Xi, a) ag(;;e,O)
r 0,
(2.21)
where b(x, CY)is defined in (2.15) and ag(.)/acr denotes the vector of left derivatives. (The equality is only approximate due to the nondifferentiability of p,(u) at zero and possible nondifferentiability of g(.) at c?; the symbol “G” in (2.21) means the left-hand side converges in probability to zero at an appropriate rate.) These equations are of the form
where the moment
d(X,
bi,8)3
W(Xi,
function
&Xi,
m(.) is defined in (2.16) and
d,jag’:&“’ O).
Thus the quantile minimization problem yields an analogue to the unconditional moment restriction E[m( y, x, cl,,) d(x, CI~,S,)] = 0, which follows from (2.16). As outlined in Section 1.4 above, under certain regularity conditions (given by Powell (1991)) the quantile estimator di will be asymptotically normal,
,/%a - ~0)5 A’“@,M,
’ Vo(Mb)- ‘),
(2.22)
where now
adx, uo,wm, ao,0) MO= E ./-@I4 w(x,&.Jm,ao)
1
au
aa
I
Ch. 41:
Estimation
2413
Models
qf Semiparametric
and V, = E ~(1 - rc) w2(x, 6,) b(x, CIJ
ag(x, a,,O)ag(x,
aa
ao, 0)
ad
I ’
for f(Ol x) being the conditional density of the “residual”y - g(x, clO,0) at zero (which appears from the differentiation of the expectation of the indicator function in (2.21)). The “regularity” conditions include invertibility of the matrix MO, which is identically zero for the binary and ordered response models; as shown by Kim and Pollard (1990), the rate of convergence of the estimator bi is slower than fi for these models. When (2.22) holds, an efficient choice of weight function w(x) for this problem is
w*(x) E .I-(0IX)?
(2.23)
for which the corresponding
JN@* - a,) J+J-(0,
estimator
c?* has
v*),
(2.24)
with
v* = n(l
- 7t) E i[
ah, ao,wdx, a,, 0) - 1 f2(Olx)b(x,ao). aa ad ’ 11
The matrix V* was shown by Newey and Powell (1990) to be the semiparametric efficiency bound for the linear and censored regression models with a conditional quantile restriction, and this is likely to be the case for a more general class of structural models. For the linear regression model g(x, c(~,E) 3 x’bo + E, estimation of the true coefficients PO using a least absolute deviations criterion dates from Laplace (1793); the extension to other quantile restrictions was proposed by Koenker and Bassett (1978). In this case b(x, CI)= 1 and ag(x, a, .s)/aa = x, which simplifies the asymptotic variance formulae. In the special case in which the conditional density of E = y - x’BO at zero is constant - f(Olx) = f. - the asymptotic covariance matrix of the quantile estimator B further simplifies to V*=rc(l
-~)[f~]-~(E[xx’]}-~.
(Of course, imposition of the additional restriction of a constant conditional density at zero may affect the semiparametric information bound for estimation of PO.) The monograph by Bloomfield and Steiger (1983) gives a detailed discussion of the
Ch. 41: Estimation
of Semiparametric
2475
Models
for some h(.) and all possible x, CIand E. Then the random function h(y, x, a) = h(g(x, Q,E),x, a) will also be symmetrically distributed about zero when CI= LX~, implying the conditional moment restriction
my, x, MO)Ixl
=
awx,
MO,4, XT@ON xl =
0.
(2.27)
As with the previous restrictions, the conditional moment restriction can be used to generate an unconditional moment equation of the form E[d(x) h( y, x, LY,)]= 0, with d(x) a conformable matrix of instruments with a number of rows equal to the number of components of 0~~.In general, the function d(x) can be a function of a and nuisance parameters S (possibly infinite-dimensional), so a semiparametric M-estimator biof ~1~can be constructed to solve the sample moment equations
O= i
,$
d(xi,
I
Oi,4 h(Yi, xi, Oi),
(2.28)
1
for s^an estimator of some nuisance parameters 6,. For structural functions g(x, M,E) which are invertible in the error terms, it is straightforward to find a transformation satisfying condition (2.26). Since E= e( y, x, ~1) is an odd function of E, h(.) can be chosen as this inverse function e(.). Even for noninvertible structural functions, it is still sometimes possible to find a “trimming” function h( .) which counteracts the asymmetry induced in the conditional distribution of y by the nonlinear transformation g(.). Examples discussed below include the censored and truncated regression models and a particular selectivity bias model. As with the quantile estimators described in a preceding section, the moment condition (2.27) is sometimes insufficient to identify the parameters go, since the “trimming” transformation h(.) may be identically zero when evaluated at certain values of c1in the parameter space. For example, the symmetrically censored least squares estimator proposed by Powell (1986b) for the censored regression model satisfies condition (2.27) with a function h(.) which is nonzero only when the fitted regression function x$ exceeds the censoring point (zero), so that the sample moment equation (2.28) will be trivially satisfied if fl is chosen so that x$ is nonpositive for all observations. In this case, the estimator /? was defined not only as a solution to a sample moment condition of the form (2.28), but in terms of a particular minimization problem b = argmino &(/I) which yields (2.28) as a firstorder condition. The limiting minimand was shown to have a unique minimizer at /IO, even though the limiting first-order conditions have multiple solutions; thus, this further restriction on the acceptable solutions to the first-order condition was enough to ensure consistency of the estimator ,!?for PO.Construction of an analogous minimization problem might be necessary to fully exploit the symmetry restriction for other structural models. Once consistency of a particular estimator di satisfying (2.28) is established, the asymptotic distribution theory immediately follows from the GMM formulae pre-
J.L. Powell
2476
sented in Section 2.1 above. For a particular choice of h(.), the form of the sample moment condition (2.28) is the same as condition (2.6) of Section 2.2 above, replacing the inverse transformation “e(.)” with the more general “h(.)” here; thus, the form of the asymptotically normal distribution of 6i satisfying (2.28) is given by (2.7) of Section 2.2, again replacing “e(.)” with “h(.)“. Of course, the choice of the symmetrizing transformation h(.) is not unique - given any h(.) satisfying (2.26), another transformation h*( y, x, U) = I(h( y, x, CI),x, U) will also satisfy (2.26) if I(u, x, a) is an odd function of u for all x and CI.This multiplicity of possible symmetrizing transformations complicates the derivation of the semiparametric efficiency bounds for estimation of ~1~under the symmetry restriction, which are typically derived on a case-by-case basis. For example, Newey (1991) derived the semiparametric efficiency bounds for the censored and truncated regression models under the conditional symmetry restriction (2.25), and indicated how efficient estimators for these models might be constructed. For ,the linear regression model g(x, cue,E) E x’b + E, the efficient symmetrizing transformation h(y, x, B) is the derivative of the log-density of E given x, evaluated at the residual y - x’j, with optimal instruments equal to the regressors x: h*(~,x,p)=alnf~,~(y--‘BIx)la&,
d*(x, p, 6) = x.
Here an efficient estimator might be constructed using a nonparametric estimator of the conditional density of E given x, itself based on residuals e”= y - x’g from a preliminary fit of the model. Alternatively, as proposed by Cragg (1983) and Newey (1988a), an efficient estimator might be constructed as a sequence of GMM estimators, based on a growing number of transformation functions h(.) and instrument sets d(.), which are chosen to ensure that the sequence of GMM influence functions can approximate the influence function for the optimal estimator arbitrarily well. In either case, the efficient estimator would be “adaptive” for the linear model, since it would be asymptotically equivalent to the maximum likelihood estimator with known error density.
2.4.
Independence
restrictions
Perhaps the most commonly-imposed semiparametric restriction of independence of the error terms and the regressors, Pr(si < ;1Ixi} = Pr(s, < A}
for all real 2, w.p.1.
is the assumption
(2.29)
Like conditional symmetry restrictions, this condition implies constancy of the conditional mean and median (as well as the conditional mode), so estimators which are consistent under these weaker restrictions are equally applicable here. In fact, for models which are invertible in the errors (E E e(y,x, cle) for some e(.)), a large
Ch. 41: Estimation
of Semiparametric
class of GMM
estimators
2417
Models
is available,
based upon the general
E(d(x)Cl(e(y,x,cr,))-v,l} =O
moment
condition (2.30)
for any conformable functions d(.) and I(.) for which the moment in (2.30) is well-defined, with v,, = EC/(s)]. (MaCurdy (1982) and Newey (1988a) discuss how to exploit these restrictions to obtain more efficient estimators of linear regression coefficients.) Independence restrictions are also stronger than the index and exclusion restrictions to be discussed in the next section, so estimation approaches based upon those restrictions will be relevant here. In addition to estimation approaches based on these weaker implied stochastic restrictions, certain approaches specific to independence restrictions have been proposed. One strategy to estimate the unknown parameters involves maximization of a “feasible” version of the log-likelihood function, in which the unknown distribution function of the errors is replaced by a (preliminary or concomitant) nonparametric estimator. For some structural functions (in particular, discrete response models), the conditional likelihood function for the observable data depends only on the cumulative distribution function FE(.) of the error terms, and not its derivative (density). Since cumulative distribution functions are bounded and satisfy certain monotonicity restrictions, the set of possible c.d.f.‘s will be compact with respect to an appropriately chosen topology, so in such cases an estimator of the parameters of interest CI~can be defined by maximization of the log-likelihood simultaneously over the finite-dimensional parameter c1and the infinite-dimensional nuisance parameter F,( .). That is, if f( y Ix, a, FE(.)) is the conditional density of y given x and the unknown parameters cl0 and F, (with respect to a fixed measure pLy),a nonparametric maximum likelihood (NPML) estimator for the parameters can be defined as
= argmax 1 $J Inf(yiIxi,cr,F(.)), at~,~Ep IV i= 1
(2.31)
where p is the space of admissible c.d.f.‘s. Such estimators were proposed by, e.g. Cosslett (1983) for the binary response model and Heckman and Singer (1984) for a duration model with unobserved heterogeneity. Consistency of 6i can be established by verification of the Kiefer and Wolfowitz (1956) conditions for consistency of NPML estimation; however, an asymptotic distribution theory for such estimators has not yet been developed, so the form of the influence function for 6i (if it exists) has not yet been rigorously established. When the likelihood function of the dependent variable y depends, at least for some observations, on the density function f,(e) = dF,(e)/de of the error terms, the joint maximization problem given in (2.31) can be ill-posed: spurious maxima (at infinity) can be obtained by sending the (unbounded) density estimator Te to infinity at particular points (depending on c1and the data). In such cases, nonparametric density estimation techniques are sometimes used to obtain a preliminary estimator
2419
Ch. 41: Estimation of Semiparametric Models
and identically distributed random variables are symmetrically distributed about zero. For a particular structural model y = g(x, CC, E), the first step in the construction of a pairwise difference estimator is to find some transformation e(z,, zj, a) E eij(a) of pairs of observations (zi, zj) 3 (( yi, xi), (yj, xi)) and the parameter vector so that, conditional on the regressors xi and xj, the transformations eij(crO) and eji(cr,) are identically distributed, i.e. =Y(eij(ao)lXi,
xj)
=
~(eji(Q)lXi,
xj)
as.,
(2.35)
where LZ(.l.) denotes the conditional sampling distribution of the random variable. In order for the parameter a0 to be identified using this transformation, it must also be true that 9(eij(a,)Ixi, xj) # _Y(eji(a,)Ixi, xj) with positive probability if a1 # ao, which implies that observations i andj cannot enter symmetrically in the function e(zi,zj,a). Since si and sj are assumed to be mutually independent given xi and Xi, eij(a) and eji(a) will be conditionally independent given xi and xj; thus, if (2.35) is satisfied, then the difference eij(a) - eji(a) will be symmetrically distributed about zero, conditionally on xi and xi, when evaluated at a = a,,. Given an odd function {(.) (which, in general, might depend on xi and xj), the conditional symmetry of eij(a) - eji(a) implies the conditional moment restriction
E[S(eij(%J- ~ji(%))I~i~xjl = O a.s.,
(2.36)
provided this expectation exists, and a0 will be identified using this restriction if it fails to hold when a # ao. When [(.) is taken to be the identity mapping t(d) = d, the restriction that eij(ao) and eji(ae) have identical conditional distributions can be weakened to the restriction that they have identical conditional means, E[eij(a,)IXi,
Xjl
=
ECeji(ao)lXi,
Xjl a.s.,
(2.37)
which may not require independence of the errors Ei and regressors xi, depending on the form of the transformation e(.). Given an appropriate (integrable) vector /(xi, xj, a) of functions of the regressors and parameter vector, this yields the unconditional moment restrictions (2.38) which can be used as a basis for estimation. If Z(.) is chosen to have the same dimension as a, a method-of-moments estimator bi of a0 can be defined as the solution to the sample analogue of this population moment condition, namely,
02”-IiTj
4teijCbi) -
eji(d))l(Xi,
Xj,
di) = 0
(2.39)
J.L. Powell
2480
(which may only approximately hold if t(eij(a) - eji(M))is discontinuous in E). For many models (e.g. those depending on a latent variable y* E g(xi, a) + ci), it is possible to construct some minimization problem which has this sample moment condition as a first-order condition, i.e. for some function s(zi, zj, IX)with
as(z:azj’ ‘) =((eij(a) -
eji(a))l(xi,xj,
a),
the estimator d might alternatively be defined as bi= argmin ; aE@ 0
(2.40)
-I 1 S(Zi,Zj$). iij
A simple example of a model which is amenable to the pairwise differencing approach is the linear model, yi = x:/I0 + ci, where gi and xi are assumed to be independent. For this case, one transformation function which satisfies the requirements above is 4Yi,
xi, xj, Coc
Yi -
X:B,
which does not depend on xP Choosing l(x,, xj, ~1)= xi - xi, a pairwise difference estimator of /&,can be defined to solve
0i -’
1 (((yi - yj) -(Xi
- Xj))fi)(xi -
xj)
E
OT
i<j
or, if E is the antiderivative of r, to minimize
0
&l(B)= ;
-’ ~j~((Yi-Yj)-(xi-xj)lB).
When &I) = d, the estimator fiis algebraically equal to the slope coefficient estimators of a classical least squares regression of yi on Xi and a constant (unless some normalization on the location of the distribution of ci is imposed, a constant term is not identified by the independence restriction). When t(d) = sgn(d), j? is a rank regression estimator which sets the sample covariance of the regressors xi with the ranks of the residuals yi - x$ equal (approximately) to zero (JureEkovB (1971), Jaeckel(l972)). The same general approach has been used to construct estimators for discrete response models and censored and truncated regression models. In all of these cases, the pairwise difference estimator diis defined as a minimizer of a second-order U-statistic of the form
2481
Ch. 41: Estimation of Semiparametric Models
(with zi 3 ( yi, xi)), and will solve an approximate first-order condition
0 -’ n 2
C q(Zi,Zj,6i)=",(n-"2), icj
where q(.) = ap(.)/aa when this derivative is well-defined. As described in Section 1.4 above, the asymptotic normal distribution of the estimator 6i can be derived from the asymptotically linear representation
h = %3 -
m t H, l r(zi, cto)+ n
o&n- l/2),
(2.41)
i=l
where r(zj, LX) E E[q(zi, zj, CY)/ zi] and
The pairwise comparison approach is also useful for construction of estimators for certain nonlinear panel data models. In this setting functions of pairs of observations are constructed, not across individuals, but over time for each individual. In the simplest case, where only two observations across time are available for each individual, a moment condition analogous to (2.36) is ECS(elz,i(~o)
- ezl,i(~o))IXil9
xi21 = 0
a.s.,
(2.42)
where now ei2,Ja) - e(zil, zi2, a) for th e same types of transformation functions e(.) described above, and where the second subscripts on the random variables denote the respective time periods. To obtain the restriction (2.42), it is not necessary for the error terms Ei= (sil, ci2) to be independent of the regressors xi = (xii, xi2) across individuals i; it suffices that the components sil and si2 are mutually independent and identically distributed across time, given the regressors xi. The pairwise differencing approach, when it is applicable to panel data, has the added advantage that it automatically adjusts for the presence of individual-specific fixed effects, since Eil + yi and Ei2+ yi will be identically distributed if sil and si2 are. A familiar example is the estimation of the coefficients /IOin the linear fixed-effects model Yit
=
XIrbO
+
Yi
+
&it,
t=
where setting the transformation in the moment condition
1,2, e12Jcl) = yi, - xi1 /I and 5(u) = u in (2.42) results
J.L. Powell
2482
which is the basis for the traditional least squares fixed effects estimator. As described in Section 3.5 below, this idea has been exploited to construct estimators for panel data versions of the binary response and censored and truncated regression models which are semiparametric with respect to both the error distribution and the distribution of the fixed effects.
2.5.
Exclusion and index restrictions
Construction of estimators based on index restrictions can be based on a variety of different approaches, depending upon whether the index function u(x) is completely known or depends upon (finite- or infinite-dimensional) unknown parameters, and whether the index sufficiency condition is of the “weak” (affecting only the conditional mean or median) or “strong” (applying to the entire error distribution) form. Estimators of the parameters of interest under mean index restrictions exploit modified forms of the moment conditions implied by the stronger constant conditional mean restrictions, just as estimators under distributional index restrictions use modifications of estimation strategies for independence restrictions. Perhaps the simplest version of the restrictions to analyze are mean exclusion restrictions, for which the index function is a subset of the regressors (i.e. u(x) E x1, where x = (xi, xi)‘), so that the restriction is E[elx]
= E[E[x,]
a.s.
(2.43)
As for conditional mean restrictions, this condition can be used to identify the parameters of interest, 01~,for structural functions y = g(x, a,, E) which are invertible in the error terms (E = e(y,x, a,,)), so that the exclusion restriction (2.43) can be rewritten as
EC4y,x,aO)Ixl-ECe(~,x,a~)lx~l=O. By iterated expectations, is analogous to condition
this implies an unconditional (2.4) of Section 2.1, namely,
(2.44) moment
restriction
which
(2.45) where now (2.46) for any conformable matrix d(x) and square matrix A(x) of functions of the regressors for-which the relevant expectations and inverses exist. (Note that, by construction, E[d(x)lx,] = 0 almost surely.) Alternatively, estimation might be based on the
ofSemiparametric Models
Ch. 41: Estimation
2483
condition 0=
EC&)ay,x, cl,)l,
(2.47)
where, analogously to (2.46),
Given a particular nonparametric method for estimation of conditional means given x1 (denoted E[*lx,]), a semiparametric M-estimator 61of the structural coefficients c1ecan be defined as the solution to a sample analogue of (2.45), 0=
j!$ ,${d(xi,4 s3- E[d(xi,4 s31xi1] (i[A(xi)lxil])- ’ A(xi)}e(.Yi,
Xi,
I
a),
1
(2.48) where the instrumental variable matrix d(x) is permitted to depend upon LXand a preliminary nuisance parameter estimator 8, as in Section 2.2. Formally, the asymptotic distribution of this estimator is given by the same expression (2.7) for estimation with conditional mean restrictions, replacing d with 2 throughout. However, rigorous verification of the consistency and asymptotic normality of dzis technically difficult, and the estimating equation (2.48) must often be modified to, “trim” (i.e. delete) observations where the nonparametric regression estimator EC.1 is imprecise. A bound on the attainable efficiency of estimators of t1e under condition (2.44) was derived by Chamberlain (1992), who showed that an optimal instrumental variable matrix d”*(x)of the form (2.46) is related to the corresponding optimal instrument matrix d*(x) for the constant conditional moment restrictions of Section 2.2 by the formula
d”(x)= d*(x)
- E[d*(x)lx,]
[E{ [Z(x)]-‘lxl}]-’
[Z(x)]-
‘,
(2.49)
where d*(x) is defined in (2.8) above and E(x) is the conditional covariance matrix of the errors s given the regressors x. This formula directly generalizes to the case in which the subvector x1 is replaced by a more general (but known) index function u(x). For a linear model y = x!J& + E, the mean exclusion restriction (2.43) yields the semilinear model considered by Robinson (1988): Y = @cl + w% I+ % where 0(x,) - E[E/x,] and E[qlx] = E[E - 0(x,)1x] = 0. Defining y - xi/I, d(x) 3 x2, and A = I, the moment condition (2.47) becomes
e(y,x,cl) E
J.L. Powell
2484
which can be solved for PO:
Robinson (1988) proposed an estimator of /IO constructed from a sample analogue to (2.47), using kernel regression to nonparametrically estimate the conditional expectations and “trimming” observations where a nonparametric estimator of the density of x1 (assumed continuously distributed) is close to zero and gave conditions under which the resulting estimator was root-N-consistent and asymptotically normal. Linton (1992) constructs higher-order approximations to the distribution of this estimator. Strengthening the mean exclusion restriction to a distributional exclusion condition widens the class of moment restrictions which can be exploited when the structural function is invertible in the errors. Imposing Pr{s < ulx} = Pr{s bulx,) for all possible
0=
(2.50)
values of u yields the general moment
conditions
EC&444x x, d)l
(2.51)
for any square-integrable function I(E)of the errors, which includes (2.45) as a special case. As with independence restrictions, precision of estimators of a, can be improved by judicious choice of the transformation I(.). Even for noninvertible structural functions, the pairwise comparison approach considered for index restrictions can be modified to be applicable for distributional exclusion (or known index) restrictions. For any pair of observations zi and zj which have the same value of the index function u(xi) = u(xj), the corresponding error terms si and sj will be independently and identically distributed, given the regressors xi and xj, under the distributional index restriction Pr{.s < ulx} = Pr{s < ulu(x)>.
(2.52)
Given the pairwise transformation function e(z,, zj, a) = eij@) described in the previous section, an analogue to restriction (2.35) holds under this additional restriction of equality of index functions: T(eij(cco))xi, xj) = Y(eji(M,)lXi, Xj) As for independence tion E[eij(cr,)Ixi,
restrictions,
a.s. if
U(Xi)
=
U(Xj).
(2.53) implies the weaker conditional
xi] = E[eji(M,)IXi, Xj]
a.s. if
U(Xi)
=
U(Xj),
(2.53) mean restric-
(2.54)
Ch. 41: Estimation of‘Semiparametric
2485
Models
which is relevant for invertible structural functions (with eij(a) equated with the inverse function e( yi, xi, a) in this case). These restrictions suggest estimation of a, by modifying the estimating equation (2.39) or the minimization problem (2.40) of the preceding subsection to exclude pairs of observations for which u(xi) # D(x~). However, in general U(Xi)- U(Xj) may be continuously distributed around zero, so direct imposition of this restriction would exclude all pairs of observations. Still, if the sampling distributions LZ’(eij(Uo) (xi, xj, u(xi) - u(xj) = c) or conditional expectations E[eij(Eo)l xi, xj, u(xi) - u(xj) = c] are smooth functions of c at c = 0, the restrictions (2.53) or (2.54) will approximately hold if u(xi) - u(xj) is close to zero. Then appropriate modifications of the estimating equations (2.39) and minimization problem (2.40) are
0 *:
-I
iTj
4(eij(d)
-
eji(d))
l(Xi,
Xj,
d) WN(U(Xi)
-
U(Xj))
=
()
(2.55)
and
oi = argmin &Q
0
T
-
1 iTj Nzi2 zj9 Co wN("(xi)-
u(xj))3
(2.56)
for some weighting function wN(.) which tends to zero as the magnitude of its argument increases and, at a faster rate, as the sample size N increases (so that, ultimately, only observations with u(xi) - u(xj) very close to zero are included in the summations). Returning to the semilinear regression model y = xi& + 0(x,) + 4, E[qlx] = 0, the same transformation as used in the previous subsection can be used to construct a pairwise difference, provided the nonparametric components B(xil) and /3(x,J are equal for the two observations; that is, if e( yi, xi, Xi, CI)= eij(a) = yi - xQ and u(xi) = xii, then
if u(xi) = D(Xj). Provided B(x,r) is a smooth (continuous and differentiable) function, relation (2.36) will hold approximately if xi1 E xjI. Defining the weight function w,,&.) to be a traditional kernel weight, WN(d)= k(h, l d),
k(O)>O,k(ll)+Oas
IIAII+oO,hN+OasN+cO,
(2.57)
and+ taking l(x,, xj, CC) = xiZ - xj2 and t(d) = d, a pairwise difference estimator of PO using either (2.55) or (2.56) reduces to a weighted least squares regression of the distinct differences (yi - yj) in dependent variables on the differences (Xi2 - xj2) in regressors, using k(h,‘(xi, - xjI)) as weights (as proposed by Powell (1987)).
Consistency of the resulting estimator a requires only the weak exclusion restriction (2.43); when the strong exclusion restriction (2.53) is imposed, other choices of odd function t(d) besides the identity function are permissible in (2.55). Thus, an estimator of Do using t(d) = sgn(d) might solve N
0 2
’ iTj sgn((yi - yj) - (xi1 - xjl)‘g)(xil
- Xjl)k((Xi2 - Xj2)lhN) E 0.
This is the first-order condition of a “smoothed” problem defining the rank regression estimator,
b = argmin : 0 a
version
(2.5’)
of the minimization
- ’ iTj I(Yi - Yj) ~ (xi - xj)‘B Ik((xi, - Xjz)/hN),
(2.59)
which is a “robust” alternative to estimators proposed by Robinson (1988b) and Powell (1987) for the semilinear model. Although the asymptotic theory for such estimators has yet to be developed, it is likely that reasonable conditions can be found to ensure their root-N-consistency and asymptotic normality. So far, the discussion has been limited to models with known index functions u(x). When the index function depends upon unknown parameters 6, which are functionally unrelated to the parameters of interest rxe,and when preliminary consistent estimators s^ of 6, are available, the estimators described above are easily adapted to use an estimated index function O(x) = u(x, 8). The asymptotic distribution theory for the resulting estimator must properly account for the variability of the preliminary estimator $. When 6, is related to a,, and that relation is exploited in the construction of an estimator of CI~, the foregoing estimation theory requires more substantial modification, both conceptually and technically. A leading special case occurs when the index governing the conditional error distribution appears in the same form in the structural function for the dependent variable y. For example, suppose the structural function has a linear latent variable form, Y=
57(x, %I, 4 = WPo + 4,
(2.60)
and index u(x) is the latent linear regression Pr(s d ulx} = Pr{s <
x’po, (2.61)
4x’Po).
This particular index restriction implies the same index restriction Pr{ydulx}=Pr(ydulx’Bo),
function
on the unobservable error for the observable dependent
terms immediately variable, (2.62)
Ch. 41: Estimation
qf Semiparametric
2487
Models
which can be used to generate moment restrictions example, (2.62) implies the weaker restriction
for estimation
of PO. For
(2.63)
ECYIXI =x;(x’P,),
on the conditional mean of the dependent variable, where G(.) is some unknown nuisance function. (Clearly &, is at most identified up to a location and scale normalization without stronger restrictions on the form of G(.).) Defining Z(y, x, b) y - E[ y Jx’b], condition (2.63) implies that
J34x) Z(Y, x2 /Ml
=
(2.64)
0
for any conformable, square-integrable d(x). Thus, with a nonparametric estimator @ylx’b] of the conditional expectation function E[ylx’b], a semiparametric Mestimator of /I0 can be constructed as a sample analogue to (2.64). Alternatively, a weighted pairwise difference approach might be used: assuming G(.) is continuous, the difference in the conditional means of the dependent variables for observations i and j satisfies E[ yi - yj 1xi, xj] = G(x&)
- G(x;/3,) g 0
if
x&
z x;p,,.
(2.65)
So by estimating E[ yi - yj (Xi, xj] nonparametrically and determining when it is near zero, the corresponding pair of observations will have (xi - xj)‘& g 0, which is useful in determining &,. When G(.) is known to be monotonic (which follows, for example, if the transformation t(.) of the latent variable in (2.60) is monotonic and E is assumed to be independent of x), a variation on the pairwise comparison approach could exploit the resulting inequality E[y, - yjl xi, xj] = G(x&) - G(x>P,) > 0 only if x$,, > x&. Various estimators based upon these conditions have been proposed for the monotone regression model, as discussed in Section 3.2 below. More complicated examples involve multiple indices, with some indices depending upon parameters of interest and others depending upon unrelated nuisance parameters, as for some of the proposed estimators for selectivity bias models. The methods of estimation of the structural parameters ~1~vary across the particular models but generally involve nonparametric estimation of regression or density functions involving the index u(x).
3. 3.1.
Structural models Discrete response models
The parameters
of the binary
y = 1(x’& + E > 0)
response
model (3.1)
J.L. Powell
2488
are traditionally
estimated
by maximization
of the average log-likelihood
function
(3.2)
where the error term E is assumed to be distributed independently of x with known distribution function F(.) (typically standard normal or logistic). Estimators for semiparametric versions of the binary response model usually involve maximization of a modified form of this log-likelihood, one which does not presuppose knowledge of the distribution of the errors. For the more general multinomial response model, in which J indicator variables { yj, j = 1,. . . , J} are generated as yj=l{x’fl~+~j>x’&++Ek the average log-likelihood
~N(P,..., BJ;
F, = i
forall
k#j},
has the analogous
itl
j$l
YijlnCFj(x$‘,
(3.3)
form
. . . , XipJ)],
(3.4)
where Fj(.) is the conditional probability that yj = 1 given the regressors x. This form easily specializes to the ordered response or grouped dependent variable models, replacing Fj(.) with F(x& - cj) - F(x$,, - cj_ r), where the {cj} are the (known or unknown) group boundaries. The earliest example of a semiparametric approach for estimation of a limited dependent variable model in econometrics is the maximum score estimation method proposed by Manski (1975). For the binary response mode, Manski suggested that PO be estimated by maximizing the number_of correct predictions of y by the sign of the latent regression function x’p; that is, /I was defined to maximize the predictive score function
(3.5)
i=l
over a suitable parameter space 0 (e.g. the unit sphere). The error terms E were restricted to have conditional median zero to ensure consistency of the estimator. A later interpretation of the estimator (Manski (1985)) characterized the maximum score estimator p^as a least absolute deviations estimator, since the estimator solved the minimization problem
b = arg:in
A i$r I Yi - 1 tX:B >
Ol1.
(3.6)
Ch. 41: Estimation
of Semiparametric
Models
2489
This led to the extension of the maximum score idea to more general quantile estimation of /?,,, under the assumption that the corresponding conditional quantile of the error terms was constant (Manski (1985)). The maximum score approach was also applied to the multinomial response model by Manski (1975); in this case, the score criterion becomes
and its consistency was established under the stronger condition of mutual independence of the alternative specific errors (ej}. M. Lee (1992) used conditional median restrictions to define a least absolute deviations estimator of the parameters of the ordered response model along the same lines. Although consistency of the maximum score estimator for binary response was rigorously established by Manski (1985) and Amemiya (1985), its asymptotic distribution cannot be established by the methods described in Section 2.2 above, because of lack of continuity of the median regression function 1{x’j?, > 0} of the dependent variable y. More importantly, because this median regression function is flat except at its discontinuity points, the estimator is not root-N-consistent under standard regularity conditions on the errors and regressors. Kim and Pollard (1990) found that the rate of convergence of the maximum score estimator to j?,, under such conditions is N1/3, with a nonstandard asymptotic distribution (involving the distribution of the maximum value of a particular Gaussian process with quadratic drift). This result was confirmed for finite samples by the simulation study of Manski and Thompson (1986). Chamberlain (1986) showed that this slow rate of convergence of the maximum score estimator was not particular to the estimation method, but a general consequence of estimation of the binary response model with a conditional median restriction. Chamberlain showed that the semiparametric version of the information matrix for this model is identically zero, so that no regular root-N-consistent estimator of /I?,,exists in this case. An extension by Zheng (1992) derived the same result - a zero semiparametric information matrix - even if the conditional median restriction is strengthened to an assumption of conditional symmetry of the error distribution. Still, consistency of the maximum score estimator fi illustrates the fact that the parameters flc,of the binary response model are identified under conditional quantile or symmetry assumptions on the error terms, which is not the case if the errors are restricted only to have constant conditional mean. If additional smoothness restrictions on the distribution of the errors and regressors are imposed, the maximum score (quantile) approach can be modified to obtain estimators which converge to the true parameters at a faster rate than N113. Nawata (1992) proposed an estimator which, in essence, estimates f10by maximizing the fit of an estimator of the conditional median function 1(x’& > 0) of the binary variable to a nonparametric estimator of the conditional median of y given x. In a
J.L.Powell
2490
first stage, the observations are grouped by a partition of the space of regressors, and the median value of the dependent variable y is calculated for each of these regressor bins. These group medians, along with the average value of the regression vector in each group, are treated as raw data in a second-stage fit of the binary response model using the likelihood function (3.2) with a standard normal cumulative and a correction for heteroskedasticity induced by the grouping scheme. Nawata (1992) gives conditions under which the rate of convergence of the resulting estimator is N2’5, and indicates how the estimator and regularity conditions can be modified to achieve a rate of convergence arbitrarily close to N”‘. Horowitz (1992) used a different approach, but similar strengthening of the regularity conditions, to obtain a median estimator for binary response with a faster convergence rate. Horowitz modifies the score function of (3.5) by replacing the conditional median function l{x’/I > 0} by a “smoothed” version, so that an estimator of /I,, is defined as a minimizer of the criterion
s,*(P)=
iilYi K(x:B/hN)+ t1 - Yi) Cl -
K(x~B/hN)l~
(3.8)
where K(.) is a smooth function in [0, l] with K(u)+0 or 1 as U+ - co or co, and h, is a sequence of bandwidths which tends to zero as the sample size increases (so that K(x’&/h,) approaches the binary median 1(x’& > 0) as N + co). With particular conditions on the function K(.) and the smoothness of the regressor distribution and with the conditional density of the errors at the median being zero, Horowitz (1992) shows how the rate of convergence of the minimizer of S;G(fi) over 0 can be made at least N2” and arbitrarily close to N”2; moreover, asymptotic normality of the resulting estimator is shown (and consistent estimators of asymptotic bias and covariance terms are provided), so that normal sampling theory can be used to construct confidence regions and hypothesis tests in large samples. When the error terms in the binary response model are assumed to satisfy the stronger assumption of independence of the errors and regressors, Cosslett (1987) showed that the semiparametric information matrix for estimation of fiO in (3.1) (once a suitable normalization is imposed) is generally nonsingular, a necessary condition for existence of a regular root-N-consistent estimator. Its form is analogous to the parametric information matrix when the distribution function F(.) of the errors is known, except that the regressors x are replaced by deviations from their conditional means given the latent regression function x’&; that is, the best attainable asymptotic covariance matrix for a regular estimator of &, when E is independent of x with unknown distribution function F(.) is Cf (x’Bo)12
wm where f(u) = dF(u)/du
- wm1
[5Z- E(:lx’&)]
and Z?is the subvector
II
(3.9)
x which eliminates
the
[Z - E(i(x’&,)]’
of regressors
-l,
Ch. 41: Estimution
~JSemipurametric
2491
Models
last component (whose coefficient is assumed normalized to unity to pin down the scale of /IO). Existence of the inverse in (3.9) implies that a constant term is excluded from the regression vector, and the corresponding intercept term is absorbed into the definition of the error cumulative F(.). For the binary response model under an index restriction, Cosslett (1983) proposed a nonparametric maximum likelihood estimator (NPMLE) of j3e through maximization of the average log-likelihood function _Y,,@‘;F) simultaneously over BE 0 and FEN, where g is the space of possible cumulative distributions (monotonic functions on [0,11). Computationally, given a particular trial value b of fi, an estimator of F is obtained by monotonic regression of the indicator y on x’b, using the pool adjacent violators algorithm of isotonic regression; this estimator F^ of F is then substituted into the likelihood function, and the concentrated criterion SY,(b; F) is maximized over bE O= {/I: )//31)= 1 }. Cosslett (1983) establishes consistency of the resulting estimators of j?, and F(.) through verification of the Kiefer-Wolfowitz (1956) conditions for the consistency of NPMLE, constructing a topology which ensures compactness of the parameter space B of possible nuisance functions F(.). As noted in Section 2.4 above, an asymptotic distribution for NMPLE has not yet been established. Instead of the monotonic regression estimator F(.) of F(.) implicit in the construction of the NPMLE, the same estimation approach can be based upon other nonparametric estimators of the error cumulative. The resulting projle likelihood estimator of /IO, maximizing ZP,(b; F) of (3.2) using a kernel regression estimator F, was considered by Severini and Wong (1987a) (for a single parameter) and Klein and Spady (1993). Because kernel regression does not impose monotonicity of the function estimator, this profile likelihood estimator is valid under a weaker index restriction on the error distribution Pr{.s < u/x} = Pr{& < u[x’&,}, which implies that E[ ~1x1 = F(x’/?,) for some (not necessarily monotone) function F(.). Theoretically, the form of the profile likelihood TN(b;@ is modified by Klein and Spady (1993) to “trim” observations with imprecise estimators of F(.) in order to show root-N-consistency and asymptotic normality of the resulting estimator p. Klein and Spady show that this estimator is asymptotically efficient under the assumption of independence of the errors and regressors, since its asymptotic covariance matrix equals the best attainable value V* of (3.9) under this restriction. Other estimators of the parameters of the binary response model have been proposed which do not exploit the particular structure of the binary response model, but instead are based upon general properties of transformation models. If indepen-
dence of the errors and regressors is assumed, the monotonicity function (3.1) in E can be used to define a pairwise comparison Imposition
of a weaker index restriction
ECYIXI= WA,)
of the structural
estimator of Do. Pr{s < u (x] = Pr{s < ~1x’p,} implies that (3.10)
for some unknown function G(.), so any estimator which is based on this restriction
J.L. Powell
2492
is applicable to the binary response model. A number of estimators proposed for this more general setup are discussed in the following section on transformation models. Estimation of the multinomial response model (3.3) under independence and index restrictions can be based on natural extensions of the methods for the binary response model. In addition to the maximum score estimator defined by minimizing (3.7), Thompson (1989a, b) considered identification and estimation of the parameters in (3.3) assuming independence of the errors and regressors; Thompson showed how consistent estimators of (/?A,. . . , /I”,) could be constructed using a least squares criterion even if only a single element yj of the vector of choice indicators (y,, . . . , yj) is observed. L. Lee (1991) extended profile likelihood estimation to the multinomial response model, and obtained a similar efficiency result to Klein and Spady’s (1993) result for binary response under index restrictions on the error terms. And, as for the binary response model, various pairwise comparison or index restriction estimators for multiple index models are applicable to the multinomial response model; these estimators are reviewed in the next section.
3.2.
Transformation models
In Section 1.3 above, two general classes of transformation models were distinguished. Parametric transformation models, in which the relation between the latent and observed dependent variables is invertible and of known parametric form, are traditionally estimated assuming the errors are independent of the regressors with density function f(.;r) of known parametric form. In this setting, the average conditional log-likelihood function for the dependent variable y = t(x’&
+
E;
&JO& = t - l (Y; &)
- x’Po= 4x x, PO,2,)
k.z (InCf(e(Yi, 1
B,4; r)l - ln CladYi,
(3.11)
is
ThdP,A ? f) =
xi,
Xi,
B,2yay I]),
I
(3.12) which is maximized over 8 = (B, ;1,r) to obtain estimators of the parameters /IO and 2, of interest. Given both the monotonicity of the transformation t(.) in the latent variable and the explicit representation function e(.) for the errors in terms of the observable variables and unknown parameters, these models are amenable to estimation under most of the semiparametric restrictions on the error distribution discussed in Section 2. For example, Amemiya and Powell (1981) considered nonlinear twostage least squares (method-of-moments) estimation of /IO and A,, for the Box-Cox
Ch. 41: Estimation
of Semiparametric
Models
2493
transformation under a conditional mean restriction on the errors E given the regressors x, and showed how this estimator could greatly outperform (in a meansquared-error sense) a misspecified Gaussian ML estimator over some ranges of the transformation parameter &. Carroll and Ruppert (1984) and Powell (1991) discuss least absolute deviations and quantile estimators of the Box-Cox regression model, imposing independence or constant quantile restrictions on the errors. Han (1987b) also assumes independence of the errors and regressors, and constructs a pairwise difference estimator of the transformation parameter 2, and the slope coefficients &, which involves maximization of a fourth-order U-statistic; this approach is a natural generalization of the maximum rank correlation estimation method described below. Newey (1989~) constructs efficient method-of-moments estimators for the BoxxCox regression model under conditional mean, symmetry, and independence restrictions on the error terms. Though not yet considered in the econometric literature, it would be straightforward to extend the general estimation strategies described in Section 2.5 above to estimate the parameters of interest in a semilinear variant of the BoxxCox regression model. When the form of the transformation function t(.) in (3.11) is not parametrically specified (i.e. the transformation itself is an infinite-dimensional nuisance parameter), estimation of &, becomes more problematic, since some of the semiparametric restrictions on the errors no longer suffice to identify /I,, (which is, at most, uniquely determined up to a scale normalization). For instance, since a special case is the binary response model, it is clear from the discussion of the previous section that a conditional mean restriction on E is insufficient to identify the parameters of interest. Conversely, any dependent variable generated from an unknown (nonconstant and monotonic) transformation can be further transformed to a binary response model, so that identification of the parameters of a binary response model generally implies identification of the parameters of an analogous transformation model. Under the assumption of independence of the errors and regressors, Han (1987a) proposed a pairwise comparison estimator, termed the maximum rank correlation estimator, for the model (3.11) with t(.) unknown but nondecreasing. Han actually considered a generalization of (3.1 l), the generalized regression model, with structural function Y =
tCs(x’Bo, 41,
(3.13)
with t[.] a monotone (but possibly roninvertible) function and s(.) smooth and invertible in both of its arguments; with continuity and unbounded support of the error distribution, this construction ensures that the support of y will not depend upon the unknown parameters &,. Though the discussion below focusses on the special case s(x’ /I, E) = x’fi + E, the same arguments apply to this, more general, setup. For model (3.11), with t(.) unknown and E and x assumed independent, Han proposed estimation of /I,, by maximization of
J.L. Powell
2494
=0“1
-IN-1
RN(b)
N ’
x;P)+ l(Yi <
Yj)
ltx:P < x;B)l (3.14)
over a suitably-restricted parameter space @(e.g. normalizing one of the components of & to unity). Maximization of (3.14) is equivalent to minimization of a least absolute deviations criterion for the sign of yi - yj minus its median, the sign of xi/? - xi/?, for those observations with nonzero values of yi - yj: N
B E argmax RN(b) = argmin 0
0
0
-IN-1
N
izI j=z 1 l(Yi+Yj)
2
I l(Yi>Yj)-
l(xlP>x;B)I.
(3.15) In terms of the pairwise
eij(B)
E
difference
estimators
l(Yi Z Yj)%nCl(Yi > Yj)-
identification of & using the maximum conditional symmetry of
=
of Section 2.4, defining
l(x$>xJB)l~ rank correlation
2 l(yi # Yj)Sgn[l((Xi-Xj)‘Bo
criterion
> &j-&i)-
is related to the
l((xi-xjYBO
‘“)l
about zero given xi and xj. The maximum rank correlation estimator defined in (3.15) does not solve a sample moment condition like (2.39) of Section 2.4 (though such estimators could easily be constructed), because the derivative of RN(B) is zero wherever it is well-defined; still, the estimator b is motivated by the same general pairwise comparison approach described in Section 2.4. Han (1987a) gave regularity conditions under which fl is consistent for & these included continuity of the error distribution and compact support for the regressors. Under similar conditions Sherman (1993) demonstrated the root-N-consistency and asymptotic normality of the maximum rank estimator; writing the estimator as the minimizer of a second-order U-process,
j? = argmax 0
N 0 2
-r ‘jj’ i=l
5
P(ziYzjt8)>
(3.16)
j=i+l
Sherman showed that the asymptotic distribution of B is the same as that for an M-estimator based on N/2 observations which maximizes the sample average of the conditional expectation r(zi, /I) = E[ p(zi, zj, /I) 1zi] over the parameter space 0,
y*. Greene (1981, 1983) derives similar results for classical least squares estimates in the special case of a censored dependent variable. Brillinger (1983) shows consistency of classical least squares estimates for the general transformation model when the regressors are jointly normally distributed, which implies that the conditional distribution of the regressors x given the index x’BO has the linear form
Cxl X’BOI= PO + vo(X’BO)
(3.20)
for some p. and vo. Ruud (1983) noted that condition (3.20) (with a full-rank condition on the distribution of the regressors) was sufficient for consistency (up to scale) of a misspecified maximum likelihood estimator of PO in a binary response model with independence of the errors and regressors; this result was extended by Ruud (1986) to include all misspecified maximum likelihood estimators for latent variable models when (3.1 l), (3.20) and independence of the errors and regressors are assumed. Li and Duan (1989) have recently noted this result, emphasizing the importance of convexity of the assumed likelihood function (which ensures uniqueness of the minimizer rcfio of the limiting objective function). As Ruud points out, all of these results use the fact that the least squares or misspecified ML estimators 6i and y^of the intercept term and slope coefficients satisfy a sample moment condition of the form
5
O= i=l r(.Yi,6i +
1 Xiy*) [3xi
(3.21)
for some “quasi-residual” function I(.). Letting F(x’Bo, ~1+ x’y) = E[r(y, c1+ x’y) 1x] and imposing condition (3.20), the value y* = rcpo will solve the corresponding population moment condition if K and the intercept CIare chosen to satisfy the two conditions
0 = W(x’Bo, a + dx’Do))l = -w(x’&, a + K(X’PO))(x’A41, since the population
analogue
of condition
(3.21) then becomes
under the restriction (3.20). (An analogous argument works for condition (3.19), replacing x’fio withy* where appropriate; in this case, the index restriction _Y(yI x) = _Y(yIx’p,) is not necessary, though this condition may not be as easily verified as (3.20).) Conditions (3.19) and (3.20) are strong restrictions which seem unlikely to hold for observational data, but the consistency results may be useful in experimental design settihgs (where the distribution of the regressors can be chosen to satisfy
2497
Ch. 41: Estimation of Semiparametric Models
(3.20)), and the results suggest that the inconsistency of traditional maximum likelihood estimators may be small when the index restriction holds and (3.19) or (3.20) is approximately satisfied. If the regressors are assumed to be jointly continuously distributed with known density function fX(x), modifications of least squares estimators can yield consistent estimators of /I0 (up to scale) even if neither (3.19) nor (3.20) holds. Ruud (1986) proposed estimation of & by weighted least squares, &
(d4xi)lfx(xi))(xi - a)(Yi - Jib
(3.22) where 4(x) is any density function for a random vector satisfying (for example, a multivariate normal density function) and
condition
(3.20)
(3.23)
with an analogous definition for 9. This reweighting ensures that the probability limit for the weighted least squares estimator in (3.22) is the same as the probability limit for an unweighted least squares estimator with regressors having marginal density 4(x); since this density is assumed to satisfy (3.20), the resulting estimator will be consistent for /I,, (up to scale) by the results cited above. A different approach to use of a known regressor density was taken by Stoker (1986), who used the mean index restriction E[y 1x] = E[y 1x’/I,,] = G(x’/?,) implied by the transformation model with a strong index restriction on the errors. If the nuisance function G(.) is assumed to be smooth, an average of the derivative of E[ylx] with respect to the regressors x will be proportional to PO:
EEaE~ylxll~xl= ECWx’P,)lWP,)l PO= K*Po.
(3.24)
Furthermore, if the regressor density f,(x) declines smoothly to zero on the boundary of its support (which is most plausible when the support is unbounded), an integrationby-parts argument yields
huh = - EC9 lnCfx(41/ax)~
(3.25)
which implies that PO can be consistently estimated (up to scale) by the sample average ofy, times the derivative of the log-density of the regressors, a ln[fX(xi)]/ax. Also, using the facts that
-W lnCfxb)l/ax) = 0,
E{(a Ufx(41/WX’) = - 1,
(3.26)
J.L. Powell
2498
Stoker proposed an alternative estimator of K*& as the slope coefficients of an instrumental variables fit ofyi on xi using the log-density derivatives a ln[fJxJ]/ax, and a constant as instruments. This estimator, as well as Ruud’s density-weighted least squares estimator, is easily generalized to include models which have regressor density f,(x; rO) of known parametric form, by substitution of a preliminary estimator + for the unknown distribution parameters and accounting for the variability of this preliminary estimator in the asymptotic covariance matrix formulae, using formula (1.53) in Section 1.4 above. When the regressors are continuously distributed with density function f,(x) of unknown form, nonparametric (kernel) estimators of this density function (and its derivatives) can be substituted into the formulae for the foregoing estimators. Although the nonparametrically-estimated components necessarily converge at a rate slower than N1’2, the corresponding density-weighted LS and average derivative estimators will be root-IV-consistent under appropriate conditions, because they involve averages of these nonparametric components across the data. Newey and Ruud (1991) give conditions which ensure that the density-weighted LS estimator (defined in (3.22) and (3.23)) is root-iV-consistent and asymptotically normal when f,.(x) is replaced by a kernel estimator_?Jx). These conditions include the requirement that the reweighting density 4(x) is nonzero only inside a compact set which has f,(x) bounded above zero, to guarantee that the reciprocal of the corresponding nonparametric estimator f,(x) is well-behaved. Hlrdle and Stoker (1989) and Stoker (1991) considered substitution of the derivative of a kernel estimator of the logdensity, a ln[~.Jx)]fix into a sample analogue of condition (3.26) (which deletes observations for which a ln[TX(xi)]/Z x is small), and gave conditions for root-l\rconsistency and asymptotic normality of the resulting estimator. A “density-weighted” variant on the average derivative estimator was proposed by Powell et al. (1989), using the fact that
where the last inequality follows from a similar integration-by-parts used to derive (3.25). The resulting estimator s^of 6, = K+&,
argument
as
(3.28) was shown to have Ith component
of the form
(3.29)
with weights c+,(xi - xi) which tend to zero as 11 Xi - Xj 11increases,
and, for fixed
J.L.
2500
Powell
separately from the index x’f10 in that formula, is replaced by the deviation of the regressors from their conditional mean given the index, x - E[x) x’&J. Newey and Stoker (1993) derived the semiparametric efficiency bound for estimation of PO (up to a scale normalization on one coefficient) under condition (3.32), which has a similar form to the semiparametric efficiency bound for estimation under exclusion restrictions given by Chamberlain (1992) as described in Section 2.5 above.
3.3.
Censored
and truncated regression
models
A general notation for censored regression models which covers fixed and random censoring takes the dependent variable y and an observable indicator variable d to be generated as y = min {x’& + E,u},
d=
(3.34)
l{y
This notation covers the censored Tobit model with the dependent variable censored below at zero (with u = 0 and a sign change on the dependent and explanatory variables) and the accelerated failure time model (y equals log failure time) with either fixed (u always observable) or random censoring times. Given a parametric density f(~; r,,) for the error terms (assumed independent of x), estimation of the resulting parametric model can be based upon maximization of the likelihood function
LP$j?, r; F) = $ .$ (di ln[f(yi I 1
- x:fl; T)] + (1 - di) ln[l
- F(ui - x$; T)])
(3.35)
over possible values for /I0 and T,,, where F(.) is the c.d.f. of E (i.e. the antiderivative of the density f(.)). This likelihood is actually the conditional likelihood of yi, di given the regressors {xi} and the censoring points {ui} for all observations (assuming ui is independent of yi and xi), but since it only involves the censoring point ui for those observations which are censored, maximization of the likelihood in (3.35) is equally feasible for fixed or random censoring. For truncated data (i.e. sampling conditional on d = l), the likelihood function becomes
=93/A 7; F) = +j ,t lnCf(Yi - xiB; 7)lF(ui - x:/3; 7)l; 1 1 here the truncation points must be known for all observations. When the error density is Gaussian (or in a more general exponential family), the first-order conditions for the maximum likelihood estimator of /I0 with censored data can be interpreted in terms of the “EM algorithm” (Dempster et al. (1977), as
Ch. 41: Estimation
a solution
of Semiparametric
2501
Models
to (3.37)
where
J’+(PO,~0)= diyi + (1 - di)E[XiBo + EiIdi =
s
0,Xi,
Ui]
Cc
=
diyi + (1 - di) XiBo + [
E_I+;20)d&
1
u, - x;s,
with a similar expression for the nuisance parameter for the conditional mean of y given x and U,
s s ”
E[ylx,u]=[l
-F(u-x’/$))]u+
(3.38)
,
estimator
z^.Related formulae
-x’po
-m
Cx’Po +
~lf(~;4 d&,
(3.39)
mean of y given x and u and with d = 1,
or for the conditional
u-i/Jo
E[y 1x, u, d = l] = [F(u - x’&,)] -
’
-cc
Cx’Bo+ &ME;To)de,
can be used to define nonlinear least squares estimators for censored data (or for truncated data using (3.40)) in a fully parametric model. As discussed in Section 2.1 above, the parameters of interest &, for the censored regression model (3.34) will not in general be identified if the error terms are assumed only to satisfy a constant conditional mean restriction, because the structural function is not invertible in the error terms. However, the monotonicity of the censoring transformation in E for fixed x and u implies that the constant conditional quantile restrictions discussed in Section 2.2 will be useful in identifying and consistently estimating /I,,. For fixed censoring (at zero), Powell (1984) proposed a least absolute deviations estimator for &, under the assumption that the error terms had conditional median zero; in the notation of model (3.34), this estimator b would be defined as J=argminh 0
,t I
Iyi-Illill{X$?,Ui}I
(3.41)
1
where 0 is the (compact) parameter space. Since the conditional median of y given x and u depends on the censoring value u for all observations (even if y is uncensored), the estimator is not directly applicable to random censoring models. Demonstration
/.I_.. Powell
2502
of the root-iV-consistency
and asymptotic
steps outlined in Section 2.3. The asymptotic this model will be H; ‘V&Z; ‘, with H, = 2E[f(O\x)
and
1(x’& < u}xx’]
normality
of this estimator
covariance
matrix
of fi(fi
follows the - flO) for
V, = E[~{x’~~ < u}xx’].
f(Olx) is the conditional density of the error term E at its median, zero. This approach was extended to the model with a general constant quantile restriction by Powell (1986a), which derived analogous conditions for consistency and asymptotic normality. Under the stronger restriction that the error terms are independent of the regressors, this paper showed how more efficient estimators of the slope coefficients in /I,, could be obtained by combining coefficients estimated at different quantiles, and how the assumption of independent errors could be tested by testing convergence of the differences in quantile slope estimators to zero, as proposed by Koenker and Bassett (1982) for the linear model. Nawata (1990) proposed a two-step estimator for b,, which calculates a nonparametric estimator of the conditional median of y given x, in the first step, by grouping the regressors into cells and computing the within-cell medians of the dependent variable. The second step treats these cell medians jjj and the corresponding cell averages of the regressors Ij as raw data in a Gaussian version of the likelihood function (3.35) and weights these quasi-observations by the cell frequencies (which would be optimal if the conditional density of the errors at the median were constant). Nawata gives conditions for the consistency of this estimator, and shows how its asymptotic distribution approaches the distribution of the censored least absolute deviations estimator (defined in (3.41)) as the regressor cells become small. And, as mentioned in Section 3.2, Newey and Powell (1990) showed that an efficient estimator of PO, under a quantile restriction on the errors, is a weighted quantile estimator with weights proportional tof(O(x), the conditional density of the errors at their conditional quantile, and proposed a feasible one-step version of this estimator which is asymptotically efficient. When the censoring value u is observed only for censored observations, with u independently distributed from (y, x), Ying et al. (1991) propose a quantile estimator for ,!& under the restriction Pr{s d 01x) = n~(0, 1) using the implied relation Pr{y > x’fl,Jx} = Pr{x’& = Pr{x’P, = H(x’B,)(f
< u
and
E > 01x)
< ulx} Pr{s > 01x) - xx
(3.42)
where H(c) = Pr{u > c} is the survivor function of the random variable u. The unknown function H(.) can be consistently estimated using the Kaplan and Meier (1958) product-limit estimator for the distribution function for censored data. The resulting consistent estimator H(.) uses only the dependent variables (y,} and the
Ch. 41: Estimation
of‘Semiparametric
censoring indicators solution to estimating OZb,$
2503
Models
(di). Ying et al. (1991) define equations of the form
a quantile
estimator
[ as a
(3.43)
[[E?(x~~)]-‘l{yi>X~~-(l-n)]Xi, I 1
based on the conditional moment restriction (3.42) and give conditions for the root-N-consistency and asymptotic normality of this estimator. Since H(x’/?,) = &x/&J = 1 {x’p O< uO} when the censoring points ui are constant at some value uO with probability one, these equations are not well-defined for fixed censoring (say, at zero) except in the special case Pr{x’& < uO> = 1. A modification of the sample moment conditions defined in (3.43), 0 ~ ~ ,~ [l{ri I
>
XIP}
-
[~(XlB)](l
-
71)]Xi,
1
would allow a constant censoring value, and when n = i would reduce to the subgradient condition for the minimization problem (3.41) in this case. Unfortunately, this condition may have a continuum of inconsistent roots, if B can be chosen so that x$ > ui for all observations. It is not immediately clear whether an antiderivative of the right-hand side of (3.44) would yield a minimand which could be used to consistently estimate PO under random censoring, as it does (yielding (3.41) for z = i) for fixed censoring. Because the conditional median (and other quantiles) of the dependent variable y depend explicitly on the error distribution when the dependent variable is truncated, quantile restrictions are not helpful in identifying p,, for truncated samples. With a stronger restriction of conditional symmetry of the errors about a constant (zero), the “symmetric trimming” idea mentioned in Section 2.3 can be used to construct consistent estimators for both censored and truncated samples. Powell (1986b) proposed a symmetrically truncated least squares estimator of PO for a truncated sample. The estimator exploited the moment condition E[l{y>2x’~,-u}(y-x’~o)~x,~X’~o-u}F(X,E
symmetry
given x and u.
(3.46)
which yields a sample analogue
to (3.45) as an approximate
first-order
condition.
J.L.
2504
Powell
Similarly, a symmetrically censored least squares estimator for the censored regression model (3.34) will solve a sample moment condition based upon the condition E[maxjy,2x’B,
- u$ - x’pOlx] = E[ max(min(i-:, u - x’po), x’BO ~ U} 1x1 = 0. (3.47)
The root-JV-consistency and asymptotic normality of these estimators were established by Powell (1986b). In addition to conditional symmetry and a full-rank condition on the matrix V, = E[ 1{x’fio < u}xx’], a unimodality condition on the error distribution was imposed in the truncated case. A variant on the symmetric trimming approach was proposed by M. Lee (1993a, b) which, for a fixed scalar w > 0, constructed estimators for truncated and censored samples based on the moment conditions
(3.48) and
~C1~~-x’80~w}min(/~-x’ll,I,w}sgn{y-x’Bo}Ixl (3.49)
=E[l{u-x’~o>w}min{Isl,w}sgn{~}Ix]=O,
respectively. Newey (1989a) derives the semiparametric efficiency bounds for estimation of f10 under conditional symmetry with censored and truncated samples, noting that the symmetrically truncated least squares estimator attains that efficiency bound in the special case where the unknown error distribution is, in fact, Gaussian (the analogous result does not hold, though, for the symmetrically censored estimator). As described at the end of Section 2.2, conditional mode restrictions can be used to identify PO for truncated data, and an estimator proposed by M. Lee (1992) exploits this restriction. This estimator solves a sample analogue to the characterization of /I0 as the solution to the minimization problem B0 = argmin
Pr{ Iy - minju,
+ 0,
x;h})
>
w},
as long as the modal interval of length 201 for the untruncated error distribution is assumed to be centered at zero. M. Lee (1992) showed the N”3-consistency of this estimator and considered its robustness properties. Most of the literature on semiparametric estimation for censored and truncated regression in both statistics and econometrics has been based upon independence restrictions. Early estimators of /I0 for random censoring models which relaxed the assumed parametric form of the error distribution (but maintained independence
J.L.Powell
2506
Pairwise difference estimators for the censored and truncated regression models have also been constructed by Honor& and Powell (1991). For model (3.34) with fixed censoring, and using the notation of Section 2.4, these estimators were based upon the transformation eij(0) = e(z,, zj, fl) = min{y, - Xi/I,
Ui - Xi/?),
(3.54)
which satisfies eij(Bo) = min(min{e,,
ui - xi&}, Ui - xi/&} = min{ei, ui - X$e, Ui - x)be},
so that eij(Q,) and eji(Qo) are clearly independently and identically distributed given xi and xi. Again choosing /(xi, xj, 0) = xi - xj, the pairwise difference estimator for the censored regression model was given as a solution to the sample moment condition (2.39) of Section 2.4 above. These estimating equations were shown to have a unique solution, since they correspond to first-order conditions for a convex minimization problem. Honor& and Powell (1991) also considered estimation of the truncated regression model, in which yi and xi are observed only if yi is positive; that is, ify, = xi/&, + vi, where ui has the conditional distribution ofei given si > - x&, then 6p(Ui 1xi) = di”(q 1Xi, q > - xi&). Again assuming the untruncated errors si are i.i.d. and independent of the regressors xi, a pairwise difference estimator of &, was defined using the transformation e(zi, Zj, p)
E
When evaluated
(Yi - Xib) l(Yi - Xi/? > - Xip) l(Yj - X;fi > - X:/3).
(3.55)
at the true value /IO, the difference
eij(fio) - eji(fio) = (Vi- Uj) l(Ui > - X;p)l(Oj > - Xifl)
(3.56)
is symmetrically distributed around zero given xi and xj. As for the censored case, the estimator B for this model was defined using &xi, xj, 0) = (xi - xj) and (2.39) through (2.40) above. When the function c(d) = sgn(d), the solution to (2.39) for this model was proposed by Bhattacharya et al. (1983) as an estimator of /?,, for this model under the assumption that xi is a scalar. The general theory derived for minimizers of mth-order U-statistics (discussed in Section 1.3) was applied to show root-N-consistency and to obtain the large-sample distributions of the pairwise difference estimators for the censored and truncated regression models.
3.4.
Selection models
Rewriting
the censored
selection
model of (1.21) and (1.22) as
d = 1(x’@, + v > 0}, Y = dCx;Bo +
~1
(3.57)
Ch. 41:
Estimation
of Semiparametric
2507
Models
(for y, E d, y, E y, /IA = 6,, and & E /IO), a fully parametric model would specify the functional form of the joint density f(s, q; tO) of the error terms. Then the maximization of the average log-likelihood function
m -x;{s cc m [s s
f(_Yi- X;ip, q; 5)dq
+(l
-d,)ln
.I-(&,rl; r) dq ds
-m
-x;,a
1 11
(3.58)
over fi, 6, and r in the parameter space. An alternative estimation method, proposed by Heckman (1976), can be based upon the conditional mean of y given x and with d= 1: 00 E[ylx,d
= l] =x;&
m
+ m
rl; r,,) drl ds
[S -a, s -xi60 00
X
-x;ao 1 E f(.s,
[S -00 s
1 -1
0,
q; q-,)
dq ds
= x;/?~ + 1(x;&,; rO).
(3.59)
When the “selection correction function” A(x;~;z) is linear in the distributional parameters r (as is the case for bivariate Gaussian densities), a two-step estimator of &, can be constructed using linear least squares, after inserting a consistent first-step estimator $ of 6, (using the indicator d and regressors x1 in the binary log-likelihood of (3.2)) into the selection correction function. Alternatively, a nonlinear least squares estimator of the parameters can be constructed using (3.59), which is also applicable for truncated data (i.e. for y and x being observed conditional on d = 1). To date, semiparametric modelling of the selection model (3.57) has imposed independence or index restrictions on the error terms (&,I]). Chamberlain (1986a) derived the semiparametric efficiency bound for estimation of /IO and 6, in (3.57) when the errors are independent of the regressors with unknown error density. The form of the efficiency bound is a simple modification of the parametric efficiency bound for this problem when the error density is known, with the regression vectors x1 and x2 being replaced by their deviations from their conditional means, given the selection index, x1 - E[x, 1xi S,] and x2 - E[x, 1x;d,], except for terms which involve the index ~~6,. Chamberlain notes that, in general, nonsingularity of the semiparametric information matrix will require an exclusion restriction on x2 (i.e. some component of x1 with nonzero coefficient in 6,, is excluded from x,), as well as a normalization restriction on 6,. The efficiency bound, which was derived imposing independence of the errors and regressors, apparently holds more generally when the joint distribution of the errors in (3.57), given the regressors, depends only upon the index xi&, appearing in the selection equation.
J.L. Powell
2508
Under this index restriction, the conditional mean of y given d = 1 and x will have the same form as in (3.59), but with a selection correction function of unknown form. More generally, conditional on d = 1, the dependent variable y has the linear representation y = x’&, + E, where E satisfies the distributional index restriction dP(s(d=
l,x)=Z’(sId=
1,~~6,)
as.,
so that other estimation methods for distributional Section 2.5) are applicable here. So far, though, exploited only the weaker mean index restriction E(sld = 1,x) = E(cJd = 1,x’,&).
(3.60) index restrictions (discussed in the econometric literature has
(3.61)
A semiparametric analogue of Heckman’s two-step estimator was constructed by Cosslett (1991), assuming independence of the errors and regressors. In the first step of this approach, a consistent estimator of the selectivity parameter 6, is obtained using Cosslett’s (1983) NPMLE for the binary response model, described in Section 3.1 above. In this first step, the concomitant estimator F(.) of the marginal c.d.f. of the selection error ‘1is a step function, constant on a finite number J of intervals {~-(~j-,,~j),j= l,..., J> with cc, = - cc and cJ = co. The second-step estimator of PO approximates the selection correction function A(.) by a piecewise-constant function on those intervals. That is, writing
y = x;&) + i Aj 1 {QOEllj} j=l
+ t?,
(3.62)
the estimator B is constructed from a linear least squares regression of y on x2 and the J indicator variables {l(x;z~T~}}. Cosslett (1991) showed consistency of the resulting estimator, using the fact that the number of intervals, J, increases slowly to infinity as the sample size increases so that the piecewise linear function could approximate the true selection function A(.) to an arbitrary degree. An important identifying assumption was the requirement that some component of the regression vector xi for the selection equation was excluded from the regressors x2 in the equation for y, as discussed by Chamberlain (1986a). Although independence of the errors and regressors was imposed by Cosslett (1991), this was primarily used to ensure consistency of the NPML estimator of the selection coefficient vector 6,. The same approach to approximation of the selection correction function will work under an index restriction on the errors, provided the first-step estimator of 6, only requires this index restriction. In a parametric context, L. Lee (1982) proposed estimation of /I0 using a flexible parametrization of the selection correction function A(.) in (3.59). For the semiparametric model Newey (1988) proposed a similar two-step estimator, which in the second step used a series
2509
Ch. 41: Estimation
of Semiparametric
approximation
to the selection correction
Y ZE x;PO+
,$,
AjPj(x;61J)
Models
+
function
to obtain the approximate
model (3.63)
e9
which was estimated (substituting a preliminary estimator Jfor 6,) by least squares to obtain an estimator of Do. Here the functions {pj(.)} were a series of functions whose linear combination could be used to approximate (in a mean-squared-error sense) the function A(.) arbitrarily well as J --f co. Newey (1988) gave conditions (including a particula_ rate of growth of the number J of series components) under which the estimator p of &, was root-IV-consistent and asymptotically normal, and also discussed how efficient estimators of the parameters could be constructed. As discussed in Section 2.5, weighted versions of the pairwise-difference estimation approach can be used under the index restriction of (3.61). Assuming a preliminary, root-N-consistent estimator s^ of 6, is available, Powell (1987) considers a pairwise-difference estimator of the form (2.55) when t(d) = d, eij(B) = yi - x&/3 and I(x,, xj, 0) = xi2 - xjz, yielding the explicit estimator
PC C
[
WN((Xil
- Xjl)l&(Xi,
-
Xjz)(Xi,
-
Xj2)'
icj
I
x iTj wN((xil
-
xjl)18J(xi2
-
xj*)(Yi2
-
1-’ 1’
Yj2)
(3.64)
Conditions were given in Powell (1987) on thedata generating process, the weighting functions w,(.), and the preliminary estimator 6 which ensured the root-IV-consistency and asymptotic normality of p. The dependence of this asymptotic distribution on the large-sample behavior of s^ was explicitly derived, along with a consistent estimator of the asymptotic covariance matrix. The approach was also extended to permit endogeneity of some components of xi2 using an instrumental variables version of the estimator. L. Lee (1991) considers system identification of semiparametric selection models with endogenous regressors and proposes efficient estimators of the unknown parameters under an independence assumption on the errors. When the errors in (3.57) are assumed independent of the regressors, and the support of the selection error q is the entire real line, the assumption of a known parametric form ~~6, of the regression function in the selection equation can be relaxed. In this case, the dependent variable y given d = 1 has the linear representation yi = x$‘,, + ci, where the error term E satisfies the distributional index restriction Y(EId = 1,x) = 2(&/d = l,p(x,))
a.s.,
where now the single index p(xl) is the “propensity
(3.65) score” (Rosenbaum
and Rubin
J.L. Powell
2510
(1983)), defined
as (3.66)
Given a nonparametric estimator @(xi) of the conditional mean p(xi) of the selection indicator, it is straightforward to modify the estimation methods above to accommodate this new index restriction, by replacing the estimated linear index x;J by the nonparametric index @(xi) throughout. Choi (1990) proposed a series estimator of /3,, based on (3.63) with this substitution, while Ahn and Powell (1993) modified the weighted pairwise-difference estimator in (3.64) along these lines. Both papers used a nonparametric kernel estimator to construct @(xi), and both gave conditions on the model, the first-step nonparametric estimator and the degree of smoothing in the second step which guaranteed root-N-consistency and asymptotic normality of the resulting estimators of &. The influence functions for these estimators depend upon the conditional variability of the errors E and the deviations of the selection indicator from its conditional mean, d - p(xi). Newey and Powell (1993) calculate the semiparametric efficiency bounds for &, under the distributional index restriction (3.65) and its mean index analogue, while Newey and Powell (1991) discuss construction of semiparametric M-estimators which will attain these efficiency bounds. For the truncated selection model (sampling from (3.57) conditional on d = l), identification and estimation of the unknown parameters is much more difficult. Ichimura and Lee (1991) consider a semiparametric version of a nonlinear least squares estimator using the form of the truncated conditional mean function
ECylx,d = 11 =x;&
+ 2(x;&)
(3.67)
from (3.59) with A(.) unknown, following the definition of Ichimura’s (1992) estimator in (3.33) above. Besides giving conditions for identification of the parameters and root-N-consistency of their estimators, Ichimura and Lee (1991) consider a generalization of this model in which the nonparametric component depends upon several linear indices. If the linear index restriction (3.61) is replaced by the nonparametric index restriction (3.65), identification and consistent estimation of &, requires the functional independence of xi and x2, in which case the estimator proposed by Robinson (1988), discussed in Section 2.5 above, will be applicable. Chamberlain (1992) derives the efficiency bound for estimation of the parameters of the truncated regression model under the index restriction (3.65). Just as eliminating the information provided by the selection variable d makes identification and estimation of fi,, harder, a strengthening of the information in the selection variable makes estimation easier, and permits identification using other semiparametric restrictions on the errors. Honore et al. (1992) consider a model in which the binary selection variable d is replaced by a censored dependent variable
Ch. 41: Estimation
of Semiparametric Models
2511
y,, so that the model becomes
(3.68) Yz = l{Y, >O)cx;Po
+&I.
This model is called the “Type 3 Tobit” model by Amemiya (1985). Assuming conditional symmetry of the errors (E,‘I) about zero given x (as defined in Section 2.3), the authors note that 6, can be consistently estimated using the quantile or symmetric trimming estimators for censored regression models discussed in Section 3.3, and, furthermore, by symmetrically trimming the dependent variable y, using the trimming function ~(Yl,Y2,Xl,XZ,4B)-
(3.69)
p=Yl<2x;~}(Y,-x;B),
the function h(.) satisfies the conditional moment restriction ~C~(Y,,Y,,X,,X,,~,,B~)lXl
=~Cl{-x;&l
--;&J4x1
=o
(3.70)
because of the joint conditional symmetry of the errors. By constructing a sample analogue of (3.70) (possibly_based on other odd functions of y, - xi/?) and inserting the preliminary estimator 6, Honor+ et al. (1992) show the resulting estimator fl to be root-N-consistent and asymptotically normal under relatively weak conditions on the model. Thus, with the additional information on the latent variable ~~6, + r] provided by the censored variable y,, it is possible to consistently estimate /3,, without obtaining explicit nonparametric estimators of infinite-dimensional nuisance functions.
3.5.
Nonlinear panel data models
For panel data versions of the latent variable models considered above, with Ys =
t(rl+ q%l + Es,54,
s= l,...,T,
(3.71)
derivation of log-likelihood functions like the ones above is straightforward if the individual-specific intercept q is assumed independent of x (or its dependence is parametrically specified) with a distribution of known parametric form. The conditional density of y E (y 1,. . . , yT) given x for each individual can be obtained from the joint density of the convolution u = (q + Er,. . . , q + +), which, for special (e.g. Gaussian) choices of error distribution is of simple form. Maximum likelihood estimators of POfor these nonlinear “random effect” models have the usual optimality properties, but their consistency depends on proper specification of both the error
J.L. Powell
2512
sT) and the random effect r]. When the individual-specific intercepts terms&-(sr,..., are treated as unknown parameters (“fixed effects”), the corresponding log-likelihoods for the parameters PO and the vector of intercept terms (r],, . . . , ?i,. . . , I]~) are even simpler to derive, being of the same general forms as given above when the errors E, are assumed to be i.i.d. across individuals and time. However, because the vector of unknown intercept terms increases with the sample size, maximum likelihood estimators of these fixed effects will be inconsistent unless the number of time periods T also increases to infinity; moreover, the inconsistency of the fixed effect estimators leads to inconsistency of the estimators of the parameters of interest, a,, as a consequence of the notorious “incidental parameters” problem (Neyman and Scott (1948)). For some special parametric discrete response models, consistent estimators of &, with fixed effects can be obtained by maximizing a “conditional likelihood” function, which conditions on a fixed sum of the discrete dependent variable across time for each individual. In the special case T = 2, this is the same as maximizing the conditional likelihood given that y, # y, and that the estimation method is the analogue to estimation using pairwise differences (over time) for linear panel data models. Models for which a version of pairwise differencing can be used to eliminate the fixed effect in panel data include the binary logit model (Andersen (1970)), the Poisson regression model (Hausman et al. (1984)) and certain duration models (Chamberlain (1984)); however, these results require a particular (exponential) structure to the likelihood which does not hold in general. For the binary, censored, and truncated regression models with fixed effects, estimators have been proposed under the assumption that the time-specific errors {E,} are identically distributed across time periods s given the regressors x. Manski (1987) shows that, with T = 2 time periods, the conditional median of the difference y, - y, of the binary variables y, = 1{x& + E, > 0}, given that y, # y,, is l{ (x2 - x,)‘fiO > 0}, so that a consistent estimator for /3,, will be
li=argmin$,$ 0
I 1
1{Yi2#Yil}l(Yi2-Yil)-1((x~-X~)I~~>O}I,
(3.72)
which will be consistent under conditions on (xi2 - Xii), etc., similar to those for consistency of the maximum score estimator. Honore (1992) considered pairwisedifference estimators for censored and truncated regression models with fixed effects using the approach described in Section 3.3. Specifically, using the transformations given in (3.54) and (3.55) for the censored and truncated cases, respectively, estimators of the parameter vector /?,, in both cases were defined as solutions to minimization problems which generate a first-order condition of the form
0% f SCe(ziZ,Zi~,B)-e(zil,zi,,B)l(xi2-xil). i=l
(3.73)
Ch. 41: Estimation
~~Semiparametric
2513
Models
As discussed at the end of Section 2.4, the expectation of the right-hand side of (3.73) will be zero when evaluated at /IO, even in the presence of a fixed effect. As for Manski’s binary panel data estimator, this estimation approach can be generalized to allow for more than T = 2 time periods.
4.
Summary and conclusions
As the previous section indicates, the theoretical analysis of the properties of estimators under various semiparametric restrictions is quite extensive, at least for the latent variable models considered above. The following table gives a general summary of the state of the econometric literature on estimation of several semiparametric models.
Mean Linear Transformed Censored Truncated Binary Monotone Semilinear Selection Binary panel Censored panel Key:
0 ~ Not identified
mator; 2 ~ @-consistent,
Median
3 3 0 0 0 0 3 0 0 0
Index
Symmetry
Of
3 3 3 0
o+
o+
1 1 2 ? ? ?
(0+ ~ Identified asymptotically
Mode
1
1
0 3 2 3 3 ? ?
only up to scale); 1 normal
estimator;
Parameter
Independence
3 3 3 3 1 1 2 2 ?
3 3 3 3 3 2 3 3
?
2
identified/consistent
1 esti-
3 - Efficient estimator.
Of course, this table should be viewed with caution, as some of its entries are ambiguous (for instance, the entry under “symmetry” for the “selection” row refers to the “Type 3 Tobit” model with a censored regression model as the selection equation, while the other columns presume a binary selection equation). Nevertheless, the table should be suggestive of areas where more research is needed. The literature on the empirical application of semiparametric methods (apart from estimation of invertible models under conditional mean restrictions) is much less extensive. When applied to relatively small data sets (roughly 100 observations per parameter), the potential bias from misspecification of the parametric model has proven to be less important than the additional imprecision induced when parametric restrictions are relaxed. For example, Horowitz and Neumann (1987) and McFadden and Han (1987) estimate the parameters of an employment duration data set imposing independence and quantile restrictions, but for these data even maximum likelihood estimates are imprecise (in terms of their asymptotic standard errors). A similar outcome was obtained by Newey et al. (1990), which reanalyzed data on married women’s labor supply originally studied (in a parametric context)
2514
J.L. Powell
by Mroz (1987). For these data, estimates based upon semiparametric restrictions were fairly comparable to their parametric counterparts, with differences in the estimates having large standard errors. On the other hand, for larger data sets (with relatively few parameters), the bias due to distributional misspecification is more likely to be evident. Chamberlain (1990) and Buchinski (1991b) apply quantile methods to estimate the returns to education for a large, right-censored data set, and find these estimates to be quite precise. Other empirical papers which use semiparametric methods, with mixed success, include those by Deaton and Irish (1984), Newey (1987), Das (1991), Horowitz (1993), Bult (1992a, b), Horowitz and Markatou (1993), Deaton and Ng (1993) and Melenberg and van Soest (1993). Besides the possible imprecision due to weakening of semiparametric restrictions, an obstacle to routine use of some of the estimators described in Section 3 is their dependence upon a choice of type and degree of “smoothing” imposed for estimators which depend explicitly upon nonparametric components of the model. Though this question has been widely studied in the literature on nonparametrics, the results are different when the nonparametric component is a nuisance parameter. Some early results on the proper degree of smoothing are available for some special cases of estimators for censored regression (Hall and Horowitz (1990)) or upon index restrictions (Hall and Marron (1987), Powell and Stoker (1991), HGdle et al. (1992)), but more theoretical results are needed to narrow the choice of possible estimators which depend upon nonparametrically-estimated components.
References Ahn, H. and C.F. Manski (1993) “Distribution Theory for the Analysis of Binary Choice Under Uncertainty with Nonparametric Estimation ofExpectations”, JournalofEconometrics, forthcoming. Ahn, H. and J.L. Powell (1993) “Semiparametric Estimation of Censored Selection Models with a Nonparametric Selection Mechanism”, Journal of Econometrics, forthcoming. Amemiya, T. (1974) “The Nonlinear Two-Stage Least-Squares Estimator”, Journal ofEconometrics, 2, 105-l 10. Amemiya, T. (1977) “The Maximum Likelihood and Nonlinear Three-Stage Least Squares Estimator in the General Nonlinear Simultaneous Equations Model”, Econometrica, 45, 955-968. Amemiya, T. (1982) “Two Stage Least Absolute Deviations Estimators”, Econometrica, 50,689-711. Amemiya, T. (1985) Advanced Econometrics. Cambridge, Mass: Harvard University Press. Amemiya, T. and J.L. Powell (1981) “A Comparison of the Box-Cox Maximum Likelihood Estimator and the Non-Linear Two-Stage Least Squares Estimator”, Journal ofEconometrics, 17, 3512381. Andersen, E.B. (1970) “Asymptotic Properties of Conditional Maximum Likelihood Estimators”, Journal ofthe Royal Statistical Society, Series B, 32,283-301. Andrews, D.W.K. (1987) “Consistency in Nonlinear Econometric Models, A Generic Uniform Law of Large Numbers”, Econometrica, 55, 1465-1471. Andrews, D.W.K. (1990a) “Asymptotics for Semiparametric Econometric Models, I. Estimation and Testing”, Cowles Foundation, Yale University, Discussion Paper No. 908R. Andrews, D.W.K. (1990b) “Asymptotics for Semiparametric Econometric Models, II. Stochastic Equicontinuity and Nonparametric Kernel Estimation”, Cowles Foundation, Yale University, Discussion Paper No. 909R. Andrews, D.W.K. (1991) “Asymptotic Normality of Series Estimators for Nonparametric and Semiparametric Regression Models”, Econometrica, 59, 307-345.
Ch. 41: Estimation
I$ Semiparametric
Models
2515
Arabmazar, A. and P. Schmidt (1981) “Further Evidence on the Robustness of the Tobit Estimator to Heteroscedasticity”, Journal ofEconometrics, 17,2533258. Arabmazar, A. and P. Schmidt (1982) “An Investigation of the Robustness of the Tobit Estimator to Non-Normality”, Econometrica, 50, 1055-1063. Bassett, G.S. and R. Koenker (1978) “Asymptotic Theory of Least Absolute Error Regression”, Journal ofthe American Statistical Association, 73, 667-677. Begun, J., W. Hall. W. Huang and J. Wellner (1983) “Information and Asymptotic Efficiency in ParametricNonparametric Models”, Annals ofStatistics, 11, 432-452. Bhattacharya, P.K., H. Chernoff and S.S. Yang (1983) “Nonparametric Estimation of the Slope of a Truncated Regression”, Annals of Statistics, 11, 505-514. Bickel, P.J. (1982) “On Adaptive Estimation”, Anna/s ofStotistics, 10, 6477671. Bickel, P.J. and K.A. Doksum (1981) “An Analysis of Transformations Revisited”, Journal of the American Statistical Association, 76, 29663 1 1. Bickel, P.J., C.A.J. Klaasen, Y. Ritov and J.A. Wellner (1993) Ejicient and Adaptive Inference in Semiparametric Models. Washington: Johns Hopkins University Press, forthcoming. in: T.F. Bewley, ed., Aduances in Bierens, H.J. (1987) “Kernel Estimators of Regression Functions”, Econometrics, Ft$h World Congress. vol. 1,Cambridge: Cambridge University Press. Bloomfield, P. and W.L. Steiger (1983) Least Absolute Deviations: Theory, Applications, and Algorithms. Boston: Birkhauser. Box, G.E.P. and D.R. Cox (1964) “An Analysis of Transformations”, Journal of the Royal Statistical Society, Series B, 26, 21 l-252. Brillinger, D.R. (1983)“A Generalized Linear Model with 'Gaussian'Regressor Variables”, in: P.J. Bickel, K.A. Doksum and J.L. Hodges, eds., A Festschrifttfor Erich L. Lehmann. Belmont, CA: Woodsworth International Group. Buchinsky, M. (1991a) “A Monte Carlo Study of the Asymptotic Covariance Estimators for Quantile Regression Coefficients”, manuscript, Harvard University, January. Buchinsky, M. (1991b) “Changes in the U.S. Wage Structure 1963-1987: Applications of Quantile Regression”, manuscript, University of Chicago. Buchinsky, M. (1993) “How Did Women’s ‘Return to Education’ Evolve in the U.S.? Exploration by Quantile Regression Analysis with Nonparametric Correction for Sample Selection Bias”, manuscript, Yale University. Buckley, J. and I. James (1979) “Linear Regression with Censored Data”, Biometrika, 66,4299436. Bult, J.R. (1992a) “Target Selection for Direct Marketing: Semiparametric versus Parametric Discrete Choice Models”, Faculty of Economics, University of Groningen, Research Memorandum No. 468. Bult, J.R. (1992b) “Semiparametric versus Parametric Classification Models: An Application to Direct Marketing”, manuscript, University of Groningen. Burguete, J., R. Gallant and G. Souza (1982) “On Unification of the Asymptotic Theory of Nonlinear Econometric Models”, Econometric Reviews, 1, 151-190. Carroll, R.J. (1982) “Adapting for Heteroskedasticity in Linear Models”, Annals ofstatistics, 10, 12241233. Carroll, R.J. and D. Ruppert (1982) “Robust Estimation in Heteroskedastic Linear Models”, Annals of Statistics, 10, 4299443. Carroll, R.J. and D. Ruppert (1984) “Power Transformations When Fitting Theoretical Models to Data”, Journal of the American Statistical Association, 79, 321-328. Cavanagh, C. and R. Sherman (1991) “Rank Estimators for Monotonic Regression Models”, manuscript, Bellcore. Chamberlain, G. (1984) “Panel Data”, in: 2. Griliches and M. Intriligator, eds., Handbook ofEconometrics, Vol. 2. Amsterdam: North-Holland. Chamberlain, G. (1986a) “Asymptotic Efficiency in Semiparametric Models with Censoring”, Journal of Econometrics, 32, 189-218. Chamberlain, G. (1986b) “Notes on Semiparametric Regression”, manuscript, Department of Economics, University of Wisconsin-Madison. Chamberlain, G. (1987) “Asymptotic Efficiency in Estimation with Conditional Moment Restrictions”, Journal of Econometrics, 34, 305-334. Chamberlain, G. (1990) “Quantile Regression, Censoring, and the Structure of Wages”, manuscript, Harvard University. Chamberlain, G. (1992) “Efficiency Bounds for Semiparametric Regression”, Econometrica, 567-596.
2516
J.L. Powell
Choi, K. (1990) “The Semiparametric Estimation of the Sample Selection Model Using Series Expansion and the Propensity Score”, manuscript, University of Chicago. Chung, C.-F. and A.S. Goldberger (1984) “Proportional Projections in Limited Dependent Variable Models”, Econometrica, 52, 531-534. _ Cosslett. S.R. (1981) “Maximum Likelihood Estimation for Choice-Based Samples”, Econometrica, 49, 1289-1316. ~ ’ Cosslett, S.R. (1983) “Distribution-Free Maximum Likelihood Estimator of the Binary Choice Model”, Econometrica, 51, 7655782. Cosslett, S.R. (1987) “Efficiency Bounds for Distribution-Free Estimators of the Binary Choice and the Censored Regression Models”, Economctrica, 55, 559-587. Cosslett, S.R. (1991) “Distribution-Free Estimator of a Regression Model with Sample Selectivity”, in: W.A. Barnett, J.L. Powell and G. Tauchen, eds., Nonparametric and Semiparametric Methods in Econometrics and Statistics. Cambridge: Cambridge University Press. Cox, D.R. (1972) “Regression Models and Life Tables”, Journal ofthe Royal Statistical Society, Series B, 34, 187-220. Cox, D.R. (1975) “Partial Likelihood”, Biometrika, 62, 269-276. Cragg, J.G. (1983) “More Efficient Estimation in the Presence of Heteroscedasticity of Unknown Form”, Econometrica, 51, 751-764. Das, S. (1991) “A Semiparametric Structural Analysis of the Idling of Cement Kilns”, Journal of Econometrics, 50, 235-256. Deaton, A. and M. Irish (1984) “Statistical Models for Zero Expenditures in Household Budgets”, Journal of Public Economics, 23, 59-80. Deaton, A. and S. Ng (1993) “Parametric and Non-parametric Approaches to Price and Tax Reform”, manuscript, Princeton University. Delgado, M.A. (1992) “Semiparametric Generalized Least Squares in the Multivariate Nonlinear Regression Model”, Econometric Theory, 8,203-222. Dempster, A.P., N.M. Laird and D.B. Rubin (1977) “Maximum Likelihood from Incomplete Data via the E-M Algorithm”, Journal ofthe Royal Statistical Society, Series B, l-38. Duncan, G.M. (1986) “A Semiparametric Censored Regression Estimator”, Journal ofEconometrics, 32, 5-34. Elbadawi, I., A.R. Gallant and G. Souza (1983) “An Elasticity Can be Estimated Consistently Without A Priori Knowledge of its Functional Form”, Econometrica, 51, 1731-1751. Engle, R.F., C.W.J. Granger, J. Rice and A. Weiss (1986) “Semiparametric Estimates of the Relation Between Weather and Electricity Sales”, Journal of the American Statistical Association, 81, 310320. Ferguson, T.S. (1967) Mathematical Statistics: A Decision Theoretic Approach. New York: Academic Press. Femandez, L. (1986) “Nonparametric Maximum Likelihood Estimation of Censored Regression Models”, Journal of Econometrics, 32, 35-57. Friedman, J.H. and W. Stuetzle (1981) “Projection Pursuit Regression”, Journal ofthe American Statistical Association, 76, 817-823. Gallant, A.R. (1980) “Explicit Estimators of Parametric Functions in Nonlinear Regression”, Journal of the American Statistical Association, 75, 182-193. Gallant, A.R. (1981) “On the Bias in Flexible Functional Forms and an Essentially Unbiased Form, The Fourier Flexible Form”, Journal ofEconometrics, IS,21 l-245. Gallant, A.R. (1987) “Identification and Consistency in Nonparametric Regression”, in: T.F. Bewley, ed., Advances in Econometrics, Fifth World Congress. Cambridge: Cambridge University Press. Gallant, A.R. and D.W. Nychka (1987) “Semi-nonparametric Maximum Likelihood Estimation”, Econometrica, 55, 363-390. Goldberger, A.S. (1983) “Abnormal Selection Bias”, in: S. Karlin, T. Amemiya and L. Goodman, eds., Studies in Econometrics, Time Series, and Multivariate Statistics, New York: Academic Press. Greene, W.H. (1981) “On the Asymptotic Bias of the Ordinary Least Squares Estimator of the Tobit Model”, Econometrica, 49, 5055514. Greene, W.H. (1983) “Estimation of Limited Dependent Variable Models by Ordinary Least Squares and the Method of Moments”, Journal ofEconometrics, 21, 1955212. Grenander, U. (1981) Abstract Inference. New York: Wiley. Hall, P. and J.L. Horowitz (1990) “Bandwidth Selection in Semiparametric Estimation of Censored Linear Regression Models”, Econometric Theory, 6, 123-150.
Ch. 41: Estimation
of Semiparametric
Models
2517
Hall, P. and J.S. Marron (1987) “Estimation of Integrated Squared Density Derivatives”, Statistics and Probability Letters, 6, 109% 115. Han, A.K. (1987a) “Non-Parametric Analysis of a Generalized Regression Model: The Maximum Rank Correlation Estimator”, Journal of Econometrics, 35, 303-316. Han, A.K. (1987b) “A Non-Parametric Analysis of Transformations”, Jlournal qf Econometrics, 35, 191-209. Hansen, L.P. (1982) “Large Sample Properties of Generalized Method of Moment Estimators”, Econometrica, 50, 1029-1054. Hlrdle, W. (1991) Applied Nonparametric Regression. Cambridge: Cambridge University Press. HBrdle, W. and T.M. Stoker (1989) “Investigating Smooth Multiple Regression by the Method of Average Derivatives”, Journal of the American Statistical Association, forthcoming. Hardle, W., J. Hart, J.S. Marron and A.B. Tsybakov (1992) “Bandwidth Choice for Average Derivative Estimation”, Journal of the American Statistical Association, 87, 227-233. Hausman, J., B.H. Hall and 2. Griliches(1984)“Econometric Modelsfor Count Data with an Application to the Patents-R&D Relationship”, Econometrica, 52,909-938. Heckman, J.J. (1976) “The Common Structure of Statistical Models of Truncation, Sample Selection and Limited Dependent Variables and a Simple Estimator for Such Models”, Annals of Economic and Social Measurement, 5,475-492. Heckman, J.J. and T.E. MaCurdy (1980) “A Life-Cycle Model of Female Labor Supply”, Review qf Economic Studies, 47, 47-74. Heckman, J.J. and B. Singer (1984) “A Method for Minimizing the Impact of Distributional Assumptions in Econometric Models for Duration Data”, Econometrica, 52, 271-320. Heckman, N.E. (1986) “Spline Smoothing in a Partly Linear Model”, Journal of the Royal Statistical Society, Series B, 48, 244-248. Hoeffding, W. (1948) “A Class of Statistics with Asymptotically Normal Distribution”, Annuls of Mathematical Statistics, 19, 293-325. Honor&, B.E. (1986) “Estimation of Proportional Hazards Models in the Presence of Unobserved Heterogeneity”, manuscript, University of Chicago, November. Honor&, B.E. (1992) “Trimmed LAD and Least Squares Estimation of Truncated and Censored Regression Models with Fixed Effects”, Econometrica, 60, 533-565. Honor&, B.E. and J.L. Powell (1991) “Pairwise Difference Estimators of Linear, Censored, and Truncated Regression Models”, manuscript, Department of Economics, Princeton University, November. Honor&, B.E., E. Kyriazidou and C. Udry (1992) “Estimation of Type 3 Tobit Models Using Symmetric Trimming and Pairwise Comparisons”, manuscript, Department of Economics, Northwestern University. Horowitz, J.L. (1986) “A Distribution-Free Least Squares Estimator for Censored Linear Regression Models”, Journal of Econometrics, 32, 59-84. Horowitz, J.L. (1988a) “Semiparametric M-Estimation of Censored Linear Regression Models”, Aduances in Econometrics, 7, 45-83. Horowitz, J.L. (1988b) “The Asymptotic Efficiency of Semiparametric Estimators for Censored Linear Regression Models”, Empirical Economics, 13, 123-140. Horowitz, J.L. (1992) “A Smoothed Maximum Score Estimator for the Binary Response Model”, Econometrica, 60, 505-53 1. Horowitz, J.L. (1993) “Semiparametric Estimation of a Work Trip Mode Choice Model”, Journal of Econometrics, forthcoming. Horowitz, J.L. and M. Markatou (1993) “Semiparametric Estimation of Regression Models for Panel Data”, Department of Economics, University of Iowa, Working Paper No. 93-14. Horowitz, J.L. and G. Neumann (1987) “Semiparametric Estimation of Employment Duration Models”, with discussion, Econometric Reviews, 6, 5-40. Hsieh, D. and C. Manski (1987) “Monte-Carlo Evidence on Adaptive Maximum Likelihood Estimation of a Regression”, Annals of Statistics, 15, 541-551. Huber, P.J. (1967) “The Behavior of Maximum Likelihood Estimates Under Nonstandard Conditions”, Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press, 4, 221-233. Huber, P.J. (1981) Robust Statistics. New York: Wiley. Huber, P.J. (1984) “Projection Pursuit”, with discussion, Annals ofStatistics, 13, 435-525. Hurd, M. (1979) “Estimation in Truncated Samples When There is Heteroskedasticity”, Journal of Econometrics, 11, 247-258.
2518
J.L. Powell
Ichimura, H. (1992) “Semiparametric Least Squares Estimation of Single Index Models”, Journal of Econometrics, forthcoming. Ichimura, H. and L.-F. Lee (1991) “Semiparametric Least Squares Estimation of Multiple Index Models: Single Equation Estimation”, in W.A. Barnett, J.L. Powell and G. Tauchen, eds., Nonparametric and Semiparametric Methods in Econometrics and Statistics. Cambridge: Cambridge University Press. Imbens, G.W. (1992) “An Efficient Method of Moments Estimator for Discrete Choice Models with Choice-Based Sampling”, Econometrica, 60, 1187-1214. Jaeckel, L.A. (1972) “Estimating Regression Coefficients by Minimizing the Dispersion of the Residuals”, Annals of Mathematical Sta&ics, 43, 1449-1458. JureEkovB, J. (1971) “Nonparametric Estimate of Regression Coefficients”, Annals of Mathematical Statistics, 42, 1328-1338. Kaplan, E.L. and P. Meier (1958) “Nonparametric Estimation from Incomplete Data”, Journal of the American Statistical Association, 53,457-481. Kiefer, J. and J. Wolfowitz (1956) “Consistency of the Maximum Likelihood Estimator in the Presence Annals of Mathematical Statistics, 27, 887-906. of Infinitely Many Incidental Parameters”, Kim, J. and D. Pollard (1990) “Cube Root Asymptotics”, Annals ofstatistics, 18, 191-219. Klein, R.W. and R.H. Spady (1993) “An Efficient Semiparametric Estimator for Discrete Choice Models”, Econometrica, 61, 387-421. Koenker, R. and G.S. Bassett Jr. (1978) “Regression Quantiles”, Econometrica, 46, 33-50. Koenker, R. and G.S. Bassett Jr. (1982) “Robust Tests for Heteroscedasticity Based on Regression Quantiles”, Econometrica, 50, 43-61. Koul, H., V. Suslara, and J. Van Ryzin (1981) “Regression Analysis with Randomly Right Censored Data”, Annals of Statistics, 9, 1276-1288. Lancaster, T. (1990) The Econometric Analysis of Transition Data. Cambridge: Cambridge University Press. Laplace, P.S. (1793) “Sur Quelques Points du Systems du Monde”, Memoires de I’kademie Royale des Sciences de Paris, Annee 1789, l-87. Lee, L.F. (1982) “Some Approaches to the Correction of Selectivity Bias”, Review of Economic Studies, 49,355-372. Lee, L.F. (1991) “Semiparametric Instrumental Variables Estimation of Simultaneous Equation Sample Selection Models”, manuscript, Department of Economics, University of Minnesota. Lee, L.F. (1992) “Semiparametric Nonlinear Least-Squares Estimation of Truncated Regression Models”, Econometric Theory, 8, 52-94. Lee, M.J. (1989) “Mode Regression”, Journal of Econometrics, 42, 337-349. Lee, M.J. (1992)“Median Regression for Ordered Discrete Response”, Journal ofEconometrics, 51,59-77. Lee, M.J. (1993a) “Windsorized Mean Estimator for Censored Regression Model”, Econometric Theory, forthcoming. Lee, M.J. (1993b) “Quadratic Mode Regression”, Journal of Econometrics, forthcoming. Levit, B.Y. (1975) “On the Efficiency of a Class of Nonparametric Estimates”, Theory of Probability and Its Applications, 20, 723-740. Li, K.C. and N. Duan (1989) “Regression Analysis Under Link Violation”, Annals of Statistics, 17, 1009-1052. Linton, O.B. (1991) “Second Order Approximation in Semiparametric Regression Models”, manuscript, Nuffield College, Oxford University. Linton, O.B. (1992) “Second Order Approximation in a Linear Regression with Heteroskedastkity of Unknown Form”, manuscript, Nuffield College, Oxford University. MaCurdy, T.E. (1982) “Using Information on the Moments of the Disturbance to Increase the Efficiency of Estimation”, Stanford University, manuscript. Manski, C.F. (1975) “Maximum Score Estimation of the Stochastic Utility Model of Choice”, Journal of Econometrics, 3, 205-228. Manski, C. (1983) “Closest Empirical Distribution Estimation”, Econometrica, 51, 305-319. Manski, C. (1984) “Adaptive Estimation of Nonlinear Regression Models”, Econometric Reviews, 3, 145-194. Manski, C.F. (1985) “Semiparametric Analysis of Discrete Response, Asymptotic Properties of the Maximum Score Estimator”, Journal ofEconometrics, 27, 205-228. Manski, C.F. (1987) “Semiparametric Analysis of Random Effects Linear Models from Binary Panel Data”, Econometrica, 55, 357-362.
Ch. 41: Estimation qf Semiparametric
Models
2519
Manski, CF. (1988a) “Identification of Binary Response Models”, Journal of the American Statistical Association, 83, 729-138. Manski, C.F. (1988b) Analog Estimation Methods in Econometrics. New York: Chapman and Hall. Manski, C.F. and S. Lerman (1977) “The Estimation ofchoice Probabilities from Choice-Based Samples”, Econometrica, 45, 1971-1988. Manski, CF. and D.F. McFadden (1981) “Alternative Estimators and Sample Designs for Discrete Choice Analysis”, in: C. Manski and D. McFadden, eds., Structural Analysis of Discrete Data with Econometric Applications. Cambridge: MIT Press. Manski, C.F. and T.S. Thompson (1986) “Operational Characteristics of Maximum Score Estimation”, Journal ofEconometrics, 32, 85-108. Address, Fifth World McFadden, D.F. (1985) “Specification of Econometric Models”, Presidential Congress of the Econometric Society. McFadden, D.F. and A. Han (1987) “Comment on Joel Horowitz and George Neumann ‘Semiparametric Estimation of Employment Duration Models”‘, Econometric Reviews, 6, 257-270. Melenberg, B. and A. Van Soest (1993) “Semi-parametric Estimation of the Sample Selection Model”, manuscript, Department of Econometrics, Tilburg University. Department of Meyer, B (1987) “Semiparametric Estimation of Duration Models”, Ph.D. dissertation, Economics, MIT. Moon, C.-G. (1989) “A Monte Carlo Comparison of Semiparametric Tobit Estimators”, Journal of Applied Econometrics, 4, 361-382. Mroz, T.A. (1987) “The Sensitivity of an Empirical Model of Married Women’s Hours of Work to Economic and Statistical Assumptions”, Econometrica, 55, 765-799. Nawata, K. (1990) “Robust Estimation Based on Grouped-Adjusted Data in Censored Regression Models,” Journal of Econometrics, 43, 337-362. Nawata, K. (1992)“Semiparametric Estimation of Binary Choice Models Based on Medians of Grouped Data”, University of Tokyo, manuscript: Newey, W.K. (1984) “Nearly Efficient Moment Restriction Estimation of Regression Models with Nonnormal Disturbances”, Princeton University, Econometric Research Program Memo. No, 315. Newey, W.K. (1985) “Semiparametric Estimation of Limited Dependent Variable Models with Endogenous Explanatory Variables”, Ann&s de I’lnsee, 59/60, 219-236. Newey, W.K. (1987a) “Efficient Estimation of Models with Conditional Moment Restrictions”, Princeton University, manuscript. Newey, W.K. (1987b) “Interval Moment Estimation of the Truncated Regression Model”, manuscript, Department of Economics, Princeton University, June. Newey, W.K. (1987~) “Specification Tests for Distributional Assumptions in the Tobit Model”, Journal of Econometrics, 34, 1255145. Newey, W.K. (1988a) “Adaptive Estimation of Regression Models Via Moment Restrictions”, Journal ofEconometrics, 38, 301-339. Newey, W.K. (1988b) “Efficient Estimation of Semiparametric Models Via Moment Restrictions”, Princeton University, manuscript. Newey, W.K. (1988~) “Two-Step Series Estimation of Sample Selection Models”, Princeton University, manuscript. Newey, W.K. (1989a) “Efficient Estimation of Tobit Models Under Symmetry”, in: W.A. Barnett, J.L. Powell and G. Tauchen, eds., Nonparametric and Semiparametric Methods in Econometrics and Statistics. Cambridge: Cambridge University Press. Newey, W.K. (1989b) “Efficiency in Univariate Limited Dependent Variable Models Under Conditional Moment Restrictions”, Princeton University, manuscript, Newey, W.K. (1989~) “Efficient Instrumental Variables Estimation of Nonlinear Models”, mimeo, Princeton University. Newey, W.K. (1989d) “Uniform Convergence in Probability and Uniform Stochastic Equicontinuity”, mimeo, Department of Economics, Princeton University. Newey, W.K. (1990a) “Semiparametric Efficiency Bounds”, Journal ofApplied Econometrics, 5,99- 135. Newey, W.K. (1990b) “Efficient Instrumental Variables Estimation of Nonlinear Models”, Econometrica, 58,809-837. Newey, W.K. (1991) “The Asymptotic Variance of Semiparametric Estimators”, Working Paper No. 583, Department of Economics, MIT, revised July.
2520
J.L. Powell
Ncwcy, W.K. and J.L. Powell (1990) “Efficient Estimation of Linear and Type I Censored Regression Econometric Theory, 6: 295-3 17. Models Under Conditional Quantile Restrictions”, Newey, W.K. and J.L. Powell (1991) “Two-Step Estimation, Optimal Moment Conditions, and Sample Selection Models”, manuscript, Department of Economics, MIT, October. Newey, W.K. and J.L. Powell (1993) “Efficiency Bounds for Some Semiparametric Selection Models”, Journal ofEconometrics, forthcoming. Newey, W.K. and P. Ruud (1991) “Density Weighted Least Squares Estimation”, manuscript, Department of Economics, MIT. Newey, W.K. and T. Stoker (1989) “Efficiency Properties of Average Derivative Estimators”, manuscript, Sloan School of Management, MIT. Newey, W.K. and T.M. Stoker (1993) “Efficiency of Weighted Average Derivative Estimators and Index Models”, Econometrica, 61, 1199-1223. Newey, W.K., J.L. Powell and J.M. Walker (1990) “Semiparametric Estimation of Selection Models: Some Empirical Results”, American Economic Review Papers and Proceedings, 80, 324-328. Neyman, J. and E.L. Scott (1948) “Consistent Estimates Based on Partially Consistent Cbservations”, Econometrica, 16, l-32. Nolan, D. and D. Pollard (1987) “U-Processes, Rates of Convergence”, Annals of Statistics, 15, 780799. Nolan, D. and D. Pollard (1988) “Functional Central Limit Theorems for U-Processes”, Annals of Probability, 16, 1291-1298. Oakes, D. (1981) “Survival Times: Aspects of Partial Likelihood”, International Statistical Reuiew, 49, 235-264. Obenhofer, W. (1982) “The Consistency of Nonlinear Regression Minimizing the Ll Norm”, Annals of Statistics, 10, 316-319. Pakes, A. and D. Pollard (1989) “Simulation and the Asymptotics of Optimization Estimators”, Econometrica, 57, 1027-1058. Pollard, D. (1985) “New Ways to Prove Central Limit Theorems”, Econometric Theory, 1, 295-314. Powell, J.L. (1983) “The Asymptotic Normality of Two-Stage Least Absolute Deviations Estimators”, Econometrica, 51, 1569-1575. Powell, J.L. (1984) “Least Absolute Deviations Estimation for the Censored Regression Model”, Journal of Econometrics, 25, 303-325. Powell, J.L. (1986a) “Censored Regression Quantiles”, Journal of Econometrics, 32, 143-155. Powell, J.L. (1986b) “Symmetrically Trimmed Least Squares Estimation ofTobit Models”, Econometrica, 54,1435-1460. Powell, J.L. (1987) “Semiparametric Estimation of Bivariate Latent Variable Models”, Social Systems Research Institute, University of Wisconsin-Madison, Working Paper No. 8704. Powell, J.L. (1991) “Estimation ofMonotonic Regression Models Under Quantile Restrictions”, in: W.A. Barnett, J.L. Powell and G. Tauchen, eds., Nonparametric and Semiparametric Methods in Econometrics and Statistics, Cambridge: Cambridge University Press. Powell, J.L. and T.M. Stoker (1991) “Optimal Bandwidth Choice for Density-Weighted Averages”, manuscript, Department of Economics; Princeton University, December. _ Powell, J.L., J.H. Stock and T.M. Stoker (1989) “Semiparametric Estimation of Weighted Average Derivatives”, Econometrica, 57, 1403-1436. Prakasa Rao, B.L.S. (1983) Nonparametric Functional Estimation. New York: Academic Press. Rice, J. (1986) “Convergence Rates for Partially Splined Estimates”, Statistics and Probability Letters, 4, 203-208. Rilstone, P. (1989) “Semiparametric Estimation of Missing Data Models”, mimeo, Department of Economics, Lava1 University. Ritov, Y. (1990) “Estimation in a Linear Regression Model with Censored Data”, Annals of Statistics, 18,303-328. Robinson, P. (1987) “Asymptotically Efficient Estimation in the Presence of Heteroskedasticity of Unknown Form”, Econometrica, 55,875-891. Robinson, P. (1988a) “Semiparametric Econometrics, A Survey”, Journal of Applied Econometrics, 3, 35-51. Robinson, P. (1988b) “Root-N-Consistent Semiparametric Regression”, Econometrica, 56.931-954. Rosenbaum, P.R. and D.B. Rubin (1983) “The Central Role of the Propensity Score in Observational Studies for Causal Effects”, Biometrika, 70, 41-55.
Ch. 41: Estimution of Semiparumetric
Models
2521
Ruud, P. (1983) “Sufficient Conditions for Consistency of Maximum Likelihood Estimation Despite Misspecification of Distribution”, Econometrica, 51, 2255228. Ruud, P. (1986) “Consistent Estimation of Limited Dependent Variable Models Despite Misspecification of Distribution”, Journul ofEconometrics, 32, 1577187. Schick, A. (1986) “On Asymptotically Efficient Estimation in Semiparametric Models”, Annals of Stutistics, 14,1139-1151. Serfling, R.J. (1980) Approximation Theorems of Mathematical Statistics, New York: Wiley. Severini, T.A. and W.H. Wong (1987a) “Profile Likelihood and Semiparametric Models”. manuscriut, University of Chicago. Severini, T.A. and W.H. Wong (1987b) “Convergence Rates of Maximum Likelihood and Related Estimates in General Parameter Spaces”, Technical Report No. 207, Department of Statistics, University of Chicago, Chicago, IL. Sherman, R.P. (1990a) “The Limiting Distribution of the Maximum Rank Correlation Estimator”, manuscript, Bell Communications Research. Sherman, R.P. (1990b) “Maximal Inequalities for Degenerate U-Processes with Applications to Optimization Estimators”, manuscript, Bell Communications Research. Sherman, R.P. (1993) “The Limiting Distribution of the Maximum Rank Correlation Estimator”, Economerrica, 61, 123-137. Silverman, B.W. (1986) Density Estimationfor Statistics and Data Analysis. London: Chapman and Hall. Stein, C. (1956) “Efficient Nonparametric Testing and Estimation”, Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, Berkeley, University ofcalifornia Press. Stock, J.H. (1989) “Nonparametric Policy Analysis”, Journal of the American Statistical Association, 84, 1461l1481. Stoker, T.M. (1986) “Consistent Estimation of Scaled Coefficients”, Econometrica, 54, 1461l1481. Stoker, T.M. (1991) “Equivalence of Direct, Indirect, and Slope Estimators of Average Derivatives”, in: W.A. Barnett, J.L. Powell and G. Tauchen, eds., Nonparametric and Semiparametric Methods in Econometrics and Statistics. Cambridge: Cambridge University Press. Stoker, T.M. (1992) Lectures on Semiparametric Econometrics. Louvain-LaaNeuve, Belgium: CORE Lecture Series. Thompson, T.S. (1989a) “Identification of Semiparametric Discrete Choice Models”, manuscript, Department of Economics, University of Minnesota. Thompson, T.S. (1989b) “Least Squares Estimation of Semiparametric Discrete Choice Models,” manuscript, Department of Economics, University of Minnesota. Tobin, J. (1956) “Estimation of Relationships for Limited Dependent Variables”, Econometrica, 26, 24-36. Wahba, G. (1984) “Partial Spline Models for the Semiparametric Estimation of Functions of Several Variables”, in Statistical Analysis of Time Series. Tokyo, Institute of Statistical Mathematics. White, H. (1982) “Maximum Likelihood Estimation of Misspecified Models”, Econometrica, 50, l-26. Ying, Z., S.H. Jung and L.J. Wei (1991) “Survival Analysis with Median Regression Models”, manuscript, Department of Statistics, University of Illinois. Zheng, Z. (1992) “Efficiency Bounds for the Binary Choice and Sample Selection Models under Symmetry”, in Topics in Nonparametric and Semiparametric Analysis, Ph.D. dissertation, Princeton University,
Chapter 42
RESTRICTIONS OF ECONOMIC NONPARAMETRIC METHODS*
THEORY IN
ROSA L. MATZKIN Northwestern University
Contents Abstract 1. Introduction 2. Identification
3.
4.
of nonparametric
2.1.
Definition
2.2.
Identification
of nonparametric
of limited dependent
Identification
of functions
2.4.
Identification
of simultaneous
estimation that depend
Estimation
using seminonparametric
3.3.
Estimation
using weighted
average
Nonstatistical Statistical
2535 2536
restrictions
on the shape of the estimated
function
methods methods
tests using economic
4.1.
functions
models
using economic
Estimators
2530
models
regression
equations
3.2.
4.2.
restrictions
2528
variable
generating
3.1.
Nonparametric
using economic
identification
2.3.
Nonparametric
models
2524 2524 2528
restrictions
tests
tests
Conclusions References
5.
2537 2538 2544 2546
2548 2548 255 1
2554 2554
*The support of the NSF through Grants SES-8900291 and SES-9122294 is gratefully acknowledged, I am grateful to an editor, Daniel McFadden, and two referees, Charles Manski and James Powell, for their comments and suggestions. I also wish to thank Don Andrews, Richard Briesch, James Heckman, Bo Honor& Vrinda Kadiyali, Ekaterini Kyriazidou, Whitney Newey and participants in seminars at the University of Chicago, the University of Pennsylvania, Seoul University, Yomsei University and the conference on Current Trends in Economics, Cephalonia, Greece, for their comments. This chapter was partially written while the author was visiting MIT and the University of Chicago, whose warm hospitality is gratefully appreciated. Handbook of Econometrics, Volume IV, Edited by R.F. Engle and D.L. McFadden (3 1994 Elsevier Science B. V. All rights reserved
R.L. Ma&kin
2524
Abstract This chapter describes several nonparametric estimation and testing methods for econometric models. Instead of using parametric assumptions on the functions and distributions in an economic model, the methods use the restrictions that can be derived from the model. Examples of such restrictions are the concavity and monotonicity of functions, equality conditions, and exclusion restrictions. The chapter shows, first, how economic restrictions can guarantee the identification of nonparametric functions in several structural models. It then describes how shape restrictions can be used to estimate nonparametric functions using popular methods for nonparametric estimation. Finally, the chapter describes how to test nonparametrically the hypothesis that an economic model is correct and the hypothesis that a nonparametric function satisfies some specified shape properties.
1.
Introduction
Increasingly, it appears that restrictions implied by economic theory provide extremely useful tools for developing nonparametric estimation and testing methods. Unlike parametric methods, in which the functions and distributions in a model are specified up to a finite dimensional vector, in nonparametric methods the functions and distributions are left parametrically unspecified. The nonparametric functions may be required to satisfy some properties, but these properties do not restrict them to be within a parametric class. Several econometric models, formerly requiring very restrictive parametric assumptions, can now be estimated with minimal parametric assumptions, by making use of the restrictions that economic theory implies on the functions of those models. Similarly, tests of economic models that have previously been performed using parametric structures, and hence were conditional on the parimetric assumptions made, can now be performed using fewer parametric assumptions by using economic restrictions. This chapter describes some of the existing results on the development of nonparametric methods using the restrictions of economic theory. Studying restrictions on the relationship between economic variables is one of the most important objectives of economic theory. Without this study, one would not be able to determine, for example, whether an increase in income will produce an increase in consumption or whether a proportional increase in prices will produce a similar proportional increase in profits. Examples of economic restrictions that are used in nonparametric methods are the concavity, continuity and monotonicity of functions, equilibrium conditions, and the implications of optimization on solution functions. The usefulness of the restrictions of economic theory on parametric models is
Ch. 42: Restrictions
of Economic Theory in Nonparametric Methods
2525
by now well understood. Some restrictions can be used, for example, to decrease the variance of parameter estimators, by requiring that the estimated values satisfy the conditions that economic theory implies on the values of the parameters. Some can be used to derive tests of economic models by testing whether the unrestricted parameter estimates satisfy the conditions implied by the economic restrictions. And some can be used to improve the quality of an extrapolation beyond the support of the data. In nonparametric models, economic restrictions can be used, as in parametric models, to reduce the variance of estimators, to falsify theories, and to extrapolate beyond the support of the data. But, in addition, some economic restrictions can be used to guarantee the identification of some nonparametric models and the consistency of some nonparametric estimators. Suppose, for example, that we are interested in estimating the cost function a typical, perfectly competitive firm faces when it undertakes a particular project, such as the development of a new product. Suppose that the only available data are independent observations on the price vector faced by the firm for the inputs required to perform the project, and whether or not the firm decides to undertake the project. Suppose that the revenue of the project for the typical firm is distributed independently of the vector of input prices faced by that firm. The firm knows the revenue it can get from the project, and it undertakes the project if its revenue exceeds its cost. Then, using the convexity, monotonicity and homogeneity of degree one1 properties, that economic theory implies on the cost function, one can identify and estimate both the cost function of the typical firm and the distribution of revenues, without imposing parametric ‘assumptions on either of these functions (Matzkin (1992)). This result requires, for normalization purposes, that the cost is known at one particular vector of input prices. Let us see how nonparametric estimators for the cost function and the distribution of the revenue in the model described above can be obtained. Let (xl,. ,x”) denote the observed vectors of input prices faced by N randomly sampled firms possessing the same cost function. These could be, for example, firms with the same R&D technologies. Let y’ equal 0 if the ith sampled firm undertakes the project and equal 1 otherwise (i = 1, . . . , N). Let us denote by k*(x) the cost of undertaking the project when x is the vector of input prices and let us denote by E the revenue associated with the project. Note that E > 0. The cumulative distribution function of E will be denoted by F*. We assume that F* is strictly increasing over the nonnegative real numbers and the support of the probability distribution of x is IX”,. (Since we are assuming that E is independent of x, F* does not depend on x.) According to the model, the probability that y’= 1 given x is Pr(s ,< k*(x’)) = F*(k*(x’)). The homogeneity of degree one of k* implies that k*(O) = 0. A necessary normalization is imposed by requiring that k*(x*) = c(, where both x* and CYare known; cr~lw. 1 A function h: X + iw,where X c RK is convex, is convex if Vx, ysX and tll~[O, 11, h(ix + (1 - i)y) < Ah(x) + (1 - iJh(y); h is homogeneous of degree one if VXEX and VA> 0, h(b) = ih(x).
R.L. Matzkin
2526
Nonparametric estimators for h* and F* can be obtained as follows. First, one estimates the values that h* attains at each of the observed points x1,. . . , xN and one estimates the values that F* attains at h*(x’), . . . , II*( Second, one interpolates between these values to obtain functions 8 and p that estimate, respectively, h* and F*. The nonparametric functions fi and i satisfy the properties that h* and F* are known to possess. In our model, these properties are that h*(x) = c(,h* is convex, homogeneous of degree one and monotone increasing, and F* is monotone increasing and its values lie in the interval [0,11. The estimator for the finite dimensional vector {h*(x’), . , h*(xN); F*(h*(x’)), . . . , F*(h*(xN))} is obtained by solving the following constrained maximization loglikelihood problem: maximize
f
{F’},{h’},{T’}
{yi log(F’) + (1 - y’) log(1 - F’)}
i=l
subject to F’ Q F’
if
O
1,
hi d hj,
i,j=l
,...,
N,
i=l
,..., N,
hi = Ti.xi,
i=O,...,N+
h’> T’.x’,
i,j=O ,...,N+
T’ 2 0,
i=O,...,N+
(2) (3) 1,
(4) 1,
1.
(5) (6)
In this problem, hi is the value of a cost function h at xi, T’ is the subgradient’ of h at xi, and F’ is the value of a cumulative distribution at hi (i = 1,. . . , N); x0 = 0, xN+‘=x*,hO=O,andhN”= ~1.The constraints (2)-(3) on F’, . . . , FN characterize the behavior that any distribution function must satisfy at any given points h’, . . . , h” in its domain. As we will see in Subsection 3.1, the constraints (4)-(6) on the values hO,...,hN+’ and vectors To,. . . , TN+ ’ characterize the behavior that the values and subgradients of any convex, homogeneous of degree one, and monotone function must satisfy at the points x0,. . . , xN+ ‘. Matzkin (1993b) provides an algorithm to find a solution to the constrained optimization problem above. The algorithm is based on a search over randomly drawnpoints(h,T)=(h’,..., hN;To ,..., TN+’ ) that satisfy (4)-(6) and over convex combinations of these points. Given any point (_h,1) satisfying (4)-(6), the optimal values of F’ , . . . , FN and the optimal value of the objective function given (h, T) are calculated using the algorithm developed by Asher et al. (1955). (See also Cosslett (1983).) Thii algorithm divides the observations in groups, and assigns to each F’ in a group the value equal to the proportion of observations within the group with *If f:X+@ is a convex function on a convex Vy~Xh(y) > h(x) + F(y - x), is called a subgradient h at x is the unique subgradient of h at x.
set XC RK and XEX, any vector TEIW~, such that of h at x. If h is differentiable at x, the gradient of
Ch. 42: Restrictions of Economic Theory in Nonparametric Methods
2527
y’ = 1. The groups are obtained by first ordering the observations according to the values of the h”s. A group ends at observation i in the jth place and a new group starts at observation k in the (j + 1)th place iffy’ = 0 and yk = 1. If the values of the F”s corresponding to two adjacent groups are not in increasing order, the two groups are merged. This merging process is repeated till the values of the F”s are in increasing order. To randomly generate points (h, T), several methods can be used, but the most critical one proceeds by drawing N + 2 homogeneous and monotone linear functions and then letting (h, T) be the vector of values and subgradients of the function that is the maximum of those N + 2 linear functions. The coefficients of the N + 2 linear functions are drawn so that one of the functions attains the value GIat x* and the other functions attain a value smaller than c1at x*. To interpolate between solution (ii,. . . , fi”; F”, . . . , Fiv+ ‘; F’, . . . , pN), one can use different interpolation methods. One possible method proceeds by interpolating linearly betw_een Pi,. . . , P” to obtain a function F^ and using the following interpolation for h: i;(x)=max{P.xli=O,...,N+
l}.
Figure 1 presents some value sets of this nonparametric estimator 6 when XERT. For contrast, Figure 2 presents some value sets for a parametric estimator for h* that is specified to be linear in a parameter /I and x. At this stage, several questions about the nonparametric estimator described above may be in the reader’s mind. For example, how do we know whether these estimators are consistent? More fundamentally, how can the functions h* and F* be identified when no parametric specification is imposed on them? And, if they are identified, is the estimation method described above the only one that can be used to estimate the nonparametric model? These and several other related questions will be answered for the model described above and for other popular models. In Section 2 we will see first what it means for a nonparametric function to be identified. We will also see how restrictions of economic theory can be used to identify nonparametric functions in three popular types of models.
Figure 1
R.L. Ma&kin
Figure 2
In Section 3, we will consider various methods for estimating nonparametric functions and we will see how properties such as concavity, monotonicity, and homogeneity of degree one can be incorporated into those estimation methods. Besides estimation methods like the one described above, we will also consider seminonparametric methods and weighted average methods. In Section 4, we will describe some nonparametric tests that use restrictions of economic theory. We will be concerned with both nonstatistical as well as statistical tests. The nonstatistical tests assume that the data is observed without error and the variables in the models are nonrandom. Samuelson’s Weak Axiom of Revealed Preference is an example of such a nonparametric test. Section 5 presents a short summary of the main conclusions of the chapter.
2. 2.1.
Identification
of nonparametric
Dejinition of nonparametric
models using economic restrictions identijication
Formally, an econometric model is specified by a vector of functionally dependent and independent observable variables, a vector of functionally dependent and independent unobservable variables, a set of known functional relationships among the variables, and a set of restrictions on the unknown functions and distributions. In the example that we have been considering, the observable and unobservable independent variables are, respectively, XE[W~ and EEIR,. A binary variable, y, that takes the value zero if the firm undertakes the project and takes the value 1 otherwise is the observable dependent variable. The profit of the firm if it undertakes the project is the unobservable dependent variable, y*. The known functional relationships among these variables are that y* = E - h*(x) and that y = 0 when y* > 0 and y = 1 otherwise. The restrictions on the functions and distributions are that h* is continuous, convex, homogeneous of degree one, monotone increasing and attains the value c( at x*; the joint distribution, G, of (x, E) has as its support the set [WX,” and it is such that E and x are independently distributed.
Ch. 42: Restrictions of Economic Theory in Nonparametric Methods
2529
The restrictions imposed on the unknown functions and distributions in an econometric model define the set of functions and distributions to which these belong. For example, in the econometric model described above, h* belongs to the set of continuous, convex, homogeneous of degree one, monotone increasing functions that attain the value c( at x*, and G belongs to the set of distributions of and satisfy the restriction that x and E are (x,E) that have support Rr+i independently distributed. One of the main objectives of specifying an econometric model is to uncover the “hidden” functions and distributions that drive the behavior of the observable variables in the model. The identification analysis of a model studies what functions, or features of functions, can be recovered from the joint distribution of the observable variables in the model. Knowing the hidden functions, or some features of the hidden functions, in a model is necessary, for example, to study properties of these functions or to predict the behavior of other variables that are also driven by these functions. In the model considered in the introduction, for example, one can use knowledge about the cost function of a typical firm to infer properties of the production function of the firm or to calculate the cost of the firm under a nonperfectly competitive situation. Let M denote a set of vectors of functions such that each function and distribution in an econometric model corresponds to a coordinate of the vectors in M. Suppose that the vector, m*, whose coordinates are the true functions and distribution in the model belongs to M. We say that we can identify within M the functions and distributions in the model, from the joint distribution of the observable variables, if no other vector m in M can generate the, same joint distribution of the observable variables. We next define this notion formally. Let m* denote the vector of the unknown functions and distributions in an econometric model. Let M denote the set to which m* is known to belong. For each mEM let P(m) denote the joint distribution of the observable variables in the model when m* is substituted by m. Then, the vector of functions m* is identified within M if for any vector meM such that m # m*, P(m) # P(m*). One may consider studying the recoverability of some feature, C(m*), of m*, such as the sign of some coordinate of m*, or one may consider the recoverability of some subvector, mf, of m*, where m* = (mr, m:). A feature is identified if a different value
of the feature generates a different probability distribution of the observable variables. A subvector is identified if, given any possible remaining unknown functions, any subvector that is different can not generate the same joint distribution of the observable variables. Formally, the feature C(m*) of m* is ident$ed within the set {C(m)(meM) if VmEM such that C(m) # C(m*), P(m) # P(m*). The subvector Ml, where M = Ml x M,, myEM,, and m:EM,, if Vm,EM, follows that Vm2, m;EM, P(m:, m;) # P(m,, m2).
rnr is identiJied within such that m, #my, it
When the restrictions of an econometric model specify all functions and distributions up to the value of a finite dimensional vector, the model is said to be
R.L. Matzkin
2530
parametric. When some af the functions or distributions are left parametrically unspecified, the model is said to be semiparametric. The model is nonparametric if none of the functions and distributions are specified parametrically. For example, in a nonparametric model, a certain distribution may be required to possess zero mean and finite variance, while in a parametric model the same distribution may be required to be a Normal distribution. Analyzing the identification of a nonparametric econometric model is useful for several reasons. To establish whether a consistent estimator can be developed for a specific nonparametric function in the model, it is essential to determine first whether the nonparametric function can be identified from the population behavior of observable variables. To single out the recoverability properties that are solely due to a particular parametric specification being imposed on a model, one has to analyze first what can be recovered without imposing that parametric specification. To determine what sets of parametric or nonparametric restrictions can be used to identify a model, it is important to analyze the identification of the model first without, or with as few as possible, restrictions. Imposing restrictions on a model, whether they are parametric or nonparametric, is typically not desirable unless those restrictions are justified. While some amount of unjustified restrictions is typically unavoidable, imposing the restrictions that economic theory implies on some models is not only desirable but also, as we will see, very useful. Consider again the model of the firm that considers whether to undertake a project. Let us see how the properties of the cost function allow us to identify the cost function of the firm and the distribution of the revenue from the conditional distribution of the binary variable y given the vector of input prices x. To simplify our argument, let us assume that F* is continuous. Recall that F* is assumed to be strictly increasing and the support of the probability measure of x is rWt. Let g(x) denote Pr(y = 1 Ix). Then, g(x) = F*(h*( x )) is a continuous function whose values on Iw: can be identified from the joint distribution of (x, y). To see that F* can be recovered from g, note that since h*(x*) = c1and h* is a homogeneous of degree one function, for any CER,, F*(t) = F*((t/a) a) = F*((t/cr) h*(x*)) = F*(h*((t/a) x*)) = g((t/a)x*). Next, to see that h* can be recovered from g and F*, we note that for any XE@, h*(x) = (F*)-‘g(x). So, we can recover both h* and F* from the observable function g. Any other pair (h, F) satisfying the same properties as (h*, F*) but with h # h* or F # F* will generate a different continuous function g. So, (II*, F*) is identified. In the next subsections, we will see how economic restrictions can be used to identify other models. 2.2.
Identification of limited dependent variable models
Limited dependent variable (LDV) models have been extensively used to analyze microeconomic data such as labor force participation, school choice, and purchase of commodities.
Ch. 42: Restrictions
of Economic
Theory
in Nonparametric
A typical LDV model can be described
2531
Methods
by a pair of functional
relationships,
Y = G(Y*) and y* = w*(x),
E),
where y is an observable dependent vector, which is a transformation, G, of an D, of the unobservable dependent vector, y *. The vector y* is a transformation, value that a function, h*, attains at a vector of observable variables, x, and the value of an unobservable vector, E. In most popular examples, the function D is additively separable into the value of h* and E. The model of the firm that we have been considering satisfies this restriction. Popular cases of G are the binary threshold crossing model y= 1
if y* >, 0 and y = 0 otherwise,
and the tobit model Y=Y*
2.2.1.
if y* b 0 and y = 0 otherwise.
Generalized regression models
Typically, the function h* is the object of most interest in LDV models, since it aggregates the influence of the vector of observable explanatory variables, x. It is therefore of interest to ask what can be learned about h* when G and D are unknown and the distribution of E is also unknown. An answer to this question has been provided by Matzkin (1994) for the case in which y, y*, h*(x), and E are real valued, E is distributed independently of x, and GOD is nondecreasing and nonconstant. Roughly, the result is that h* is identified up to a strictly increasing transformation. Formally, we can state the following result (see Matzkin (1990b, 1991c, 1994)). Theorem.
Identification of h* in generalized regression models
Suppose that (i) GOD: Rz + R is monotone increasing and nonconstant, (ii) h*: X + K!, where X c [WK,belongs to a set W of functions h: X + II2that are continuous and strictly increasing in the Kth coordinate of x, (iii) EE [wis distributed independently of x, (iv) the conditional probability of the Kth coordinate of x has a Lebesgue density that is everywhere positive, conditional on the other coordinates of x, (v) for any x,x’ in X such that h*(x) < h*(x’) there exists tell2 such that Pr[GoD(h*(x), E) d t] > Pr[GoD(h*(x’), E) d t], where the probability is taken with respect to the probability measure of E, and (vi) the support of the marginal distribution of x includes X.
2532
Then, h* is identified within W if and only if no two functions increasing transformations of each other.
R.L. Matzkin
in W are strictly
Assumptions (i) and (iii) guarantee that increasing values of h*(x) generate nonincreasing values of the probability of y given x. Assumption (v) slightly strengthens this, guaranteeing that variations in the value of h* are translated into variations in the values of the conditional distribution of y given x. Assumption (ii) implies that whenever two functions are not strictly increasing transformations of each other, we can find two neighborhoods at which each function attains different values from the other function. Assumptions (iv) and (vi) guarantee that those neighborhoods have positive probability. Note the generality of the result. One may be considering a very complicated model determining the way by which an observable vector x influences the value of an observable variable y. If the influence of x can be aggregated by the value of a function h*, the unobservable random variable E in the model is distributed independently of x, and both h* and E influence y in a nondecreasing way, then one can identify the aggregator function h* up to a strictly increasing transformation. The identification of a more general model, where Eis not necessarily independent of x, h* is a vector of functions, and GOD is not necessarily monotone increasing on its domain has not yet been studied. For the result of the above theorem to have any practicality, one needs to find sets of functions that are such that no two functions are strictly increasing transformations of each other. When the functions are linear in a finite dimensional parameter, say h(x) = fi.x, one can guarantee this by requiring, for example, that IIp (1= 1 or jK = 1, where b = (jr,. . . , flK). When the functions are nonparametric, one can use the restrictions of economic theory. The set of homogeneous of degree one functions that attain a given value, ~1,at a given point, x*, for example, is such that no two functions are strictly increasing transformations of each other. To see this, suppose that h and h’ are in this set and for some strictly increasing function f, h’ = j-0 h; then since h(Ax*) = h’(Ax*) for each 22 0, it follows that f(t) = f(cr(t/cr)) = f(h((t/cr) x*)) = h’((t/a) x*) = t. So, f is the identity function. It follows that h’ = h. Matzkin (1990b, 1993a) shows that the set of least-concave3 functions that attain common values at two points in their domain is also a set such that no two functions in the set are strictly increasing transformations of each other. The sets of additively separable functions described in Matzkin (1992,1993a) also satisfy this requirement. Other sets of restrictions that could also be used-remain to be studied.
3A function V:X + R, where X is a convex subset of RK, is least-concaoe if it is concave and if any concave function, u’, that can be written as a strictly increasing transformation, f, of v can also be written but u(xl, x2) = as a concave transformation, y. of v. For example, 0(x,, x2) = (x1 .x2) ‘P is least-concave, log(x,) + log(x,) is not.
Ch. 42: Restrictions
of Economic
Summarizing, we have to identify the aggregator are unknown. In the next in some particular models
2.2.2.
Theory in Nonparametric
Methods
2533
shown that restrictions of economic theory can be used function h* in LDV models where the functions D and G subsections we will see how much more can be recovered where the functions D and G are known.
Binary threshold crossing models
A particular case of a generalized regression model where G and D are known is the binary threshold crossing model. This model is widely used not only in economics but in other sciences, such as biology, physics, and medicine, as well. The books by Cox (1970) Finney (1971) and Maddala (1983), among others, describe several empirical applications of these models. The semi- and nonparametric identification and estimation of these models has been studied, among others, by Cosslett (1983) Han (1987) Horowitz (1992), Hotz and Miller (1989), Ichimura (1993), Klein and Spady (1993), Manski (1975, 1985, 1988), Matzkin (1990b, 199Oc, 1992), Powell et al. (1989) Stoker (1986) and Thompson (1989). The following theorem has been shown in Matzkin (1994):
Theorem.
Identijication
of (h*, F*) in a binary choice model
Suppose that (i) y* = h*(y) + E; y = 1 if y* 3 0, y = 0 otherwise. (ii) h*: X+ R, where X c lRK, belongs to a set W of functions h:X+ IF!that are continuous and strictly increasing in the Kth coordinate to x, (iii) E is distributed independently of x, (iv) the conditional probability of the Kth coordinate of x has a Lebesgue density that is everywhere positive, conditional on the other coordinates of x, (v) F*, the cumulative distribution function (cdf) of E, is strictly increasing, and (vi) the support of the marginal distribution of x is included in X. Let I- denote the set of monotone increasing functions on R with values in the interval [0,11. Then, (h*, F*) is identified within (W x I) if and only if W is a set of functions such that no two functions in W are strictly increasing transformations of each other. Assumptions (ii)and (vi) are the same as in the previous theorem and they play the same role here as they did there. Assumptions (i) and (v) guarantee that assumptions (i) and (v) in the previous theorem are satisfied. They also guarantee that the cdf F* is identified when h* is identified. Note that the set of functions W within which h* is identified satisfies the same properties as the set in the previous theorem. So, one can use sets of homogeneous of degree one functions, least-concave functions, and additive separable functions to guarantee the identification of h* and F* in binary threshold crossing models.
R.L. Ma&kin
2534
2.2.3.
Discrete choice models
Discrete choice models have been extensively used in economics since the pioneering work of McFadden (1974, 1981). The choice among modes of transportation, the choice among occupations, and the choice among appliances have, for example, been studied using these models. See, for example, Maddala (1983), for an extensive list of empirical applications of these models. In discrete choice models, a typical agent chooses one alternative from a set A = { 1,. . , J> of alternatives. The agent possesses an observable vector, sgS, of socioeconomic characteristics. Each alternative j in A is characterized by a vector of observable attributes zj~Z, which may be different for each agent. For each alternativejgA, the agent’s preferences for alternativej are represented by the value of a random function U defined by U(j) = V*( j, s, zj) + sjr where sj is an unobservable random term. The agent is assumed to choose the alternative that maximizes his utility; i.e., he is assumed to choose alternative j iff V*(j,
St Zj) + Ej >
V*(k, St Zk)+
Ek,
fork=l,...,J;k#j.
(We are assuming that the probability of a tie is zero.) The identification of these models concerns the unknown function V* and the distribution of the unobservable random vector E = (cr,. . , Ed). The observable variables are the chosen alternatives, the vector s of socioeconomic characteristics, and the vector z = (zr , . , zJ) of attributes of the alternatives. The papers by Strauss (1979), Yellott (1977) and those mentioned in the previous subsection concern the nonparametric and semiparametric identification of discrete choice models. A result in Matzkin (1993a) concerns the identification of V* when the distribution of the vector of unobservable variables (or, . . . , Ed) is allowed to depend on the vector of observable variables (s,zr,. . . ,z,). Letting (sr,. . . , eJ) depend on (s,z) is important because there is evidence that the estimators for discrete choice models may be very sensitive to heteroskedasticity of E (Hausman and Wise (1978)). The identification result is obtained using the assumptions that (i) the V*( j, .) functions are continuous and the same for all j; i.e. 3v* such that Vj V*( j, s, zj) = v*(s, zj), and (ii), conditional on (s,z r,. .,zJ), the sj’s are i.i.d.4 Matzkin (1993a) shows that a sufficient condition for v*: S x Z + R to be identified within a set of continuous functions W is that for any two functions v, v’ in W there exists a vector s such that u(s, .) is not a strictly increasing transformation of v’(s, .). So, for example, when the functions v: S x Z -+ R in W are such that for each s, v(s, .) is homogeneous of degree one, continuous, convex and attains a value c1 at some given vector z*, one can identify the function u*. A second result in Matzkin (1993a) extends techniques developed by Yellott (1977) “Manski (1975, 1985) used this conditional semiparametric discrete choice models.
independence
assumption
to analyze
the identification
of
2535
Ch. 42: Restrictions of Economic Theory in Nonparametric Methods
and Strauss (1979). The result is obtained under the assumption that the distribution of E is independent of the vector (s, z). It is shown that using shape restrictions on the distribution of E and on the function V*, one can recover the distribution of the eJ - el) and the V*(j, .) functions over some subset of their vector (s2-si,..., domain. The restrictions on I/* involve knowing its values at some points and requiring that I/* attains low enough values over some sections of its domain. For example, Matzkin (I993a) shows that when I/* is a monotone increasing and concave function whose values are known at some points, I’* can be identified over some subset of its domain. The nonparametric identification of discrete choice models under other nonparametric assumptions on the distribution of the E’Sremains to be studied.
2.3.
Identification
offunctions
Several models in economics Y = f
*cd + 4
generating
regression functions
are specified by the functional
relation (7)
where x and E are, respectively, vectors of observable and unobservable functionally independent variables, and y is the observable vector of dependent variables. Under some weak assumptions, the function f *: X -+ Iw can be recovered from the joint distribution of (x, y) without need of specifying any parametric structure for f *.To see this, suppose that E@(x) = 0 a.s.; then E(ylx) = f *(x) a.s. Hence, if f * is continuous and the support of the marginal distribution of x includes the domain off *, we can recover f *. A similar result can be obtained making other assumptions on the conditional distribution of E, such as Median@ Ix) = 0 a.s. In most cases, however, the object of interest is not a conditional mean (or a conditional median) function f *, but some “deeper” function, such as a utility function generating the distribution of demand for commodities by a consumer, or a production function generating the distribution of profits of a particular firm. In these cases, one could still recover these deeper functions, as long as they influence f *. This requires using results of economic theory about the properties that f * needs to satisfy. For example, suppose that in the model (7) with E(E~x) = 0, x is a vector (p, I) of prices of K commodities and income of a consumer, and the function f * denotes for each (p,l) the vector of commodities that maximizes the consumer’s utility function U* over the budget set (z > 0lp.z < Z}; E denotes a measurement error. Then, imposing theoretical restrictions on f * we can guarantee that the preferences represented by U* can be recovered from f *. Moreover, since f * can be recovered from the joint distribution of (y,p, I), it follows that U* can also be recovered from this distribution. Hence, U* is identified. The required theoretical restrictions on f * have been developed by Mas-Colell(l977).
R.L. Matzkin
2536
Theorem.
Recoverability
of utility functions
from
demand functions
(Mas-Cole11
(1977)) Let W denote a set of monotone increasing, continuous, concave and strictly quasiconcave functions such that no two functions in W are strictly increasing transformations of each other. For any UEW, let f(p,Z; U) denote the demand function generated by U, where PELWK,denotes a vector of prices and ZEIR, + denotes a consumer’s income. Then, for any U, U’ in W, such that U # U’ one has that
ft.9 '; U) z ft.9 '; w. This result states that different utility functions generate different demand functions when the set of all possible values of the vector (p,l) is Iw:+‘. The assumption that the utility functions in the set W are concave is the critical assumption guaranteeing that the same demand function can not be generated from two different utility functions in the set W. Mas-Cole11 (1978) shows that, under certain regularity conditions, one can construct the preferences represented by U* by taking the limit, with respect to an appropriate distance function, of a sequence of preferences. The sequence is constructed by letting {p’,Z’},~, be a sequence that becomes dense in (w;+i. For each N, a utility function V, is constructed using Afriat’s (1967a) construction: V,(z) = min { I/’ + A’p’.(z - z’, b where zi =
. . , N},
f *(pi, Ii) and the Vi’s and 2”s are any numbers satisfying the inequalities
vi < vj + Ajpj. (Zi _ Zj), 1’ 2 0,
i,j=l
,.‘., N,
i= l,...,N.
The preference relation represented by U* is the limit of the sequence of preference relations represented by the functions V, as N goes to co. Summarizing, we have shown that using Mas-Cole113 (1977) result about the recoverability of utility functions from demand functions, we can identify a utility function from the distribution of its demand. Following a procedure similar to the one described above, one could obtain nonparametric identification results for other models of economic theory. Brown and Matzkin (1991) followed this path to show that the preferences of heterogeneous consumers in a pure exchange economy can be identified from the conditional distribution of equilibrium prices given the endowments of the consumers. 2.4.
Identijication
of simultaneous
equations models
Restrictions of economic theory can also be used to identify the structural equations of a system of nonparametric simultaneous equations. In particular, when the functions in the system of equations are continuously differentiable, this could be
Ch. 42: Restrictions of Economic Theory in Nonparametric Methods
2531
done by determining what type of restrictions guarantee that a given matrix is of full rank. This matrix is presented in Roehrig (1988). Following Roehrig, let us describe a system of structural equations by r*(x,y)--u=O + [WC;y denotes a vector of observable where. XEUP, y, UEIWG,and I*: IWKx IWG endogeneous variables, x denotes a vector of observable exogenous variables, and u denotes a vector of unobservable exogenous variables. Let 4* denote the joint distribution of (x, u). Suppose that (i) V(x, y) ar*/ay is full rank, (ii) there exists a function 7~such that y = Y$X,u), and (iii) d* is such that u is distributed independently of x. Let (I, 4) be another pair satisfying these same conditions. Then, under certain assumptions on the support of the probability measures, Roehrig (1988) shows that a necessary and sufficient condition guaranteeing that P(r*, &J*)= P(r, 4) is that for all i = 1,. . . , G and all (x, y) the rank of the matrix
is less than G + 1. In the above expression, ri denotes the ith coordinate function of r and P(r, 4) is the joint distribution of the observable vectors (x, y), when (r*, 4*) is substituted with (r, 4). Consider, for example, a simple system of a demand and a supply function described by 4 = 44 P, w) + $3 P = s(w, 431) + Es, where q denotes quantity, p denotes price, I denotes the income of the consumers and w denotes input price. Then, using the restrictions of economic theory that adlaw = 0, as/al = 0, adfal # 0 and as/&v # 0, one can show that both the demand function and the supply function are identified up to additive constants. Kadiyali (1993) provides a more complicated example where Roehrig’s (1988) conditions are used to determine when the cost and demand functions of the firms in a duopolistic market are nonparametrically identified. I am not aware of any other work that has used these conditions to identify a nonparametric model.
3.
Nonparametric
estimation using economic restrictions
Once it has been established that a function can be identified nonparametrically, one can proceed to develop nonparametric estimators for that function. Several methods exist for nonparametrically estimating a given function. In the following subsections we will describe some of these methods. In particular, we will be
R.L. Matzkin
2538
concerned with the use of these methods to estimate nonparametric functions subject to restrictions of economic theory. We will be concerned only with independent observations. Imposing restrictions of economic theory on estimator of a function may be necessary to guarantee the identification of the function being estimated, as in the models described in the previous section. They may also be used to reduce the variance of the estimators. Or, they may be imposed to guarantee that the results are meaningful, such as guaranteeing that an estimated demand function is downwards sloping. Moreover, for some nonparametric estimators, imposing shape restrictions is critical for the feasibility of their use. It is to these estimators that we turn next.
3.1.
Estimators
that depend on the shape of the estimated function
When a function that one wants to estimate satisfies certain shape properties, such as monotonicity and concavity, one can use those properties to estimate the function nonparametrically. The main practical tool for obtaining these estimators is the possibility of using the shape properties of the nonparametric function to characterize the set of values that it can attain at any finite number of points in its domain. The estimation method proceeds by, first, estimating the values (and possibly the gradients or subgradients) of the nonparametric function at a finite number of points of its domain, and second, interpolating among the obtained values. The estimators in the first step are subject to the restrictions implied by the shape properties of the function. The interpolated function in the second step satisfies those same shape properties. The estimator presented in the introduction was obtained using this method. In that case, the constraints on the vector (h’, . , hN; To,. . , TN+‘) of values and subgradients of a convex, homogeneous of degree one, and monotone function were hi =
Ti.xi,
h’> T’.x’, T’ > 0,
(4’)
i,j=O ,...,N+
1,
(5’)
i=O,...,N+
1.
(6’)
ifh’
F’ < F’ F’<
1,
on the vector (F’, . . . , FN) of values of a cdf were
The constraints
06
i=O,...,N+
1,
i=l
l,...,
N,
(2’)
,“.,
N.
(3’)
The necessity of the first set of constraints follows by definition. A function h: X + R, where X is an open and convex set in R K, is convex if and only if for all XCX there exists T(x)E@ such that for all ye X, h(y) 3 h(x) + T(x).(y - x). Let h be a convex
Ch. 42: Restrictions of Economic Theory in Nonparametric Methods
2539
function and T(x) a subgradient of h at x; h is homogeneous of degree one if and only if h(x) = T(x).x and h is monotone increasing if and only if T(x) 2 0. Letting x = xc, y = xj, h(x) = h(x’), h(y) = hj and T(x) = T’ one gets the above constraints. Conversely, toesee that if the vector (ho,. , hN+ ‘; To,. . . , TN+ ‘) satisfies the above must correspond to the constraints with ho = 0 and hN+’ = ~1,then its coordinates and homovalues and subgradients at x0,. . , xN+l of some convex, monotone geneous of degree one function, we note that the function h(x) = max{ T’.xl i = (See Matzkin (1992) for a more detailed 0 , . . . , N + l} is one such function. discussion of these arguments.) The estimators for (II*, F*) obtained by interpolating the results of the optimization in (l)-(6) are consistent. This can be proved by noting that they are maximum likelihood estimators and using results about the consistency of not-necessarily parametric maximum likelihood estimators, such as Wald (1949) and Kiefer and Wolfowitz (1956). To see that (g,@ is a maximum likelihood estimator, let the set of nonparametric estimators for (h*,F*) be the set of functions that solve the broblem
max L,(h, F) = 5 i=l (h-F) subject to (%F)c(H
{yi log [F(h(x’))]
+ (1 - y’) log [ 1 - F(h(x’))] }
x r),
(8)
where H is the set of convex, monotone increasing, and homogeneous of degree one functions that attain the value CI at x* and r is the set of monotone increasing functions on R whose values lie in the interval [0,11. Notice that the value of L,(h, F) depends on h and F only through the values that these functions attain at a finite number of points. As seen above, the behavior of these values is completely characterized by the restrictions (2)-(6) in the problem in the introduction. Hence, the set of solutions of the optimization problem (8) coincides with the set of solutions obtained by interpolating the solutions of the optimization problem described by (l))(6). So, the estimators we have been considering are maximum likelihood estimators. We are not aware of any existing results about the asymptotic distribution of these nonparametric maximum likelihood estimators. The principles that have been exemplified in this subsection can be generalized to estimate other nonparametric models, using possibly other types of extremum estimators, and subject to different sets of restrictions on the estimated functions. The next subsection presents general results that can be used in those cases. 3.1 .I.
General types of shape restrictions
Generally speaking, one can interpret the theory behind estimators of the sort described in the previous subsection as an immediate extension of the theory behind parametric M-estimators. When a function is estimated parametrically using a
2540
R.L. Mat&in
maximization procedure, the function is specified up to the value of some finite dimensional parameter vector do RL, and an estimator for the parameter is obtained by maximizing a criterion function over a subset of RL. When the nonparametric shape restricted method is used, the function is specified up to some shape restrictions and an estimator is obtained by maximizing a criterion function over the set of functions satisfying the specified shape restrictions. The consistency of these nonparametric shape restricted estimators can be proved by extending the usual arguments to apply to subsets of functions instead of subsets of finite dimensional vectors. For example, the following result, which is discussed at length in the chapter by Newey and McFadden in this volume, can typically be used: Theorem Let m* be a function, or a vector of functions, that belongs to a set of functions M. Let L,: M + 52 denote a criterion function that depends on the data. Let P& be an estimator for m*, defined by A,Eargmax(L,(m)ImEM}. Assume that the following conditions are satisfied: (i) The function L, converges a.s. uniformly over M to a nonrandom continuous function L: M + R. (ii) The function m* uniquely maximizes L over the set M. (iii) The set M is compact with respect to a metric d. Then, any sequence of estimators metric d. That is, with probability
{fiN} converges a.s. to m* with respect one, lim,, m d(rfi,, m*) = 0.
to the
See the Newey and McFadden chapter for a description of the role played by each of the assumptions, as well as a list of alternative assumptions. The most substantive assumptions are (ii) and (iii). Depending on the definition of L,, the identification of m* typically implies that assumption (ii) is satisfied. The satisfaction of assumption (iii) depends on the definitions of the set M and of the metric d, which measures the convergence of the estimator to the true function. Compactness is more difficult to be satisfied by sets of functions than by sets of finite dimensional parameter vectors. One often faces a trade-off between the strength of the convergence result and the strength of the restrictions on M in the sense that the stronger the metric d, the stronger the convergence result, but the more restricted the set M must be. For example, the set of convex, monotone increasing, and homogeneous of degree one functions that attain the value CIat x* and have a common open domain is compact with respect to the I.’ norm. If, in addition, the functions in this set possess uniformly bounded subgradients, then the set is compact with respect to the supremum norm on any compact subset of their joint domain. Two properties of the estimation method allow one to transform the problem of finding functions that maximize L, over M into a finite dimensional optimization
Ch. 42: Restrictions
of Economic Theory in Nonparametric
Methods
2541
problem. First, it is necessary that the function L, depends on any meM only through the values that m attains at a finite number of points. And second, it is necessary that the values that any function rn~M may attain at those finite number of points can be characterized by a finite set of inequality constraints. When these conditions are satisfied, one can use standard routines to solve the finite dimensional optimization problem that arises when estimating functions using this method. The second requirement is not trivially satisfied. For example, there is no known finite set of necessary and sufficient conditions on the values of a function at a finite number of points guaranteeing that the function is differentiable and a-Lipschitzian5 (c( > 0). In the example given in Section 3.1, the concavity of the functions was critical in guaranteeing that we can characterize the behavior of the functions at a finite number of points. While the results discussed in this section can be applied to a wide variety of models and shape restrictions, some types of models and shape restrictions have received particular attention. We next survey some of the literature concerning estimation subject to monotonicity and concavity restrictions. 3.1.2.
Estimation
of monotone functions
A large body of literature concerns the use of monotone restrictions to estimate nonparametric functions. Most of this work is summarized in an excellent book by Robertson et al. (1988), which updates results surveyed in a previous book by Barlow et al. (1972). (See also, Prakasa Rao (1983).) The book by Robertson et al. describes results about the computation of the estimators, their consistency, rates of convergence, and asymptotic distributions. Subsection 9.2 in that book is of particular interest. In that subsection the authors survey existing results about monotone restricted estimators for the function f * in the model Y = f*(x)
+6
where E(EI x) = 0 a.s. or Median(&Ix) = 0. Key papers are Brunk (1970), where the consistency and asymptotic distribution of the monotone restricted least squares estimators for f * is studied when E(E[x) = 0 and x~[0, 11; and Hanson et al. (1973), where consistency is proved when XG[O, l] x [0,11. Earlier, Asher et al. (1955) had proved some weak convergence results. Recently, Wang (1992) derived the rate of convergence of the monotone restricted estimator for f * when E(E~x) = 0 a.s. and x~[0, l] x [0,11.The asymptotic distribution of the least squares estimator for this latter case is not yet known. Of course, the general methods described in the previous subsection apply in particular to monotone functions. So, one can use those results to determine the consistency of monotone restricted estimators in a variety of models that may or may not fall into the categories of models that are usually studied. (See, for example, Cosslett (1983) and Matzkin (1990a).) ‘A function h:X + Iw,where X c Rx, is a-lipschitzian
(GL> 0) if Vx, y~X,Ih(x) - h(y)1 6 a 11 x -Y
11.
Ch. 42: Restrictions
gradients constraints
of Economic
Theory in Nonparametric
of the concave in (9) become
fi
function
Tj.(xi-xj),
i,j=l
(Matzkin
2543
Methods
(1986,1991a),
Balls
(1987)).
The
3.“) N,
and the minimization is over the values {fi} and the vectors (T’}. To add a monotonicity restriction, one includes the constraints i= 1,. . . , N.
T’ >/ 0,
To bound the subgradients by the values of a function - B < T’ < B,
i=l
by a vector B, or to bound the values of the function b, one uses, respectively, the constraints , N,
,
and - b(x’) d fi d b(x'),
i=l
,...,N.
Algorithms for the resulting constrained optimization problem were developed by Dykstra (1983) and Goldman and Ruud (1992) for the least squares estimator, and Matzkin (1993b) for general types of objective functions. The algorithms by Dykstra and by Goldman and Ruud are extensions of the method proposed by Hildreth (1954). This algorithm proceeds by solving the problem minimize
I/y
-
A’2
112,
A>0
where A is a matrix whose rows (all k # i), and /I’X = 0. The rows the first coordinates of which are vector z*that minimizes the sum
minimize
(1y - z
are all vectors ~~EIW~with pi = 1 (some i), & d 0 of the N x K matrix X are the observed points xi, ones. This is the dual of the problem of finding the of squared errors subject to concavity constraints
112,
A.ZdO
The solution to this problem is 1= y - A/l, where fi is the solution to the dual problem. While the dual problem is minimized over more variables, the constraints are much simpler than those of the primal problem. The algorithm minimizes the objective function over one coordinate of 2 at a time, repeating the procedure till convergence. The consistency of the concavity restricted least squares estimator of a multivariate nonparametric concave function can be proved using the consistency result
R.L. Matzkin
2544
presented y=
in Section
3.1.1. Suppose,
for example,
that in the model
f*(x) + 6
XEX, where X is an open and convex subset of Rx, f *: X + R4, and the unobserved vector EE lFPis distributed independently of x with mean 0 and variance .Z. Let BE R”, and b: X + R4. Assume that f* belongs to the set, H, of concave functions f: X + Rq whose subgradients are uniformly bounded by B and their values satisfy that VxgX, If(x)\ < b(x). Then, H is a compact set, in the sup norm, of equicontinuous functions. So, following the same arguments as in, e.g., Epstein and Yatchew (1985) and Gallant (1987), one can show that the function L,: H + [wdefined by
LN(f) =
k.$1(Yi -
f(xi)‘z
_
‘(Yi -
f(xi))
L
converges
a.s. uniformly
to the continuous
function
L: H + R defined by
s
L(f) = q + (f(x) - f*(~))‘(fb) - f*(x)) $4x), where p is the probability measure of x. Since the functions in H are continuous, L is uniquely minimized at f*. Hence, by the theorem of Subsection 3.1.1 it follows that the least squares estimator is a strongly consistent estimator for f*. For an LAD (least absolute deviations) nonparametric concavity restricted estimator, Balls (1987) proposed proving consistency by showing that the distance between the concavity restricted estimator and the true function is smaller than the distance between an unrestricted consistent nonparametric splines estimator (see Section 3.2) and the true function. Matzkin (1986) showed consistency of a nonparametric concavity restricted maximum likelihood estimator using a variation of Wald’s (1949) theorem, which uses compactness of the set H. No asymptotic distribution results are known for these estimators.
3.2.
Estimation
using seminonparametric
methods
Seminonparametric methods proceed by approximating any function of interest with a parametric approximation. The larger the number of observations available to estimate the function, the larger the number of parameters used in the approximating function and the better the approximation. The parametric approximations are chosen so that as the number of observations increases, the sequence of parametric approximations converges to the true function, for appropriate values of the parameters. A popular example of such a class of parametric approximations is the set of
Ch. 42: Restrictions
functions
IJ~Economic
Theory in Nonparametric
defined by the Fourier
gN(x, 0) = h’x + X’CX +
2545
flexible form (FFF) expansion
1 Ikl*<
Methods
uk
eik’x,
XEFF,
T
where i = J-1, ~E[W~, C is a K x K matrix, uk = uk + iv, for some real numbers uk and ok, k = (k,, , kK) is a vector with integer coordinates, and JkJ* = Cf= i (ki(. (See Gallant (198 1)) To guarantee that the above sum is real valued, it is imposed that oe = 0, uk = u_~ and vk = - v_~. Moreover, the values of each coordinate of x need to be modified to fall into the [0,2~] interval. The coordinates of the parameter vector 19are the uk’s, the uk’s and the coefficients of the linear and quadratic terms. Important advantages of this expression are that it is linear in the parameters and its partial derivatives are easily calculated. As K + CO,the FFF and its partial derivatives up to order m - 1 approximate in an Lp norm any m times differentiable function and its m - 1 derivatives. Imposing restrictions on the values of the parameters of the approximation, one can guarantee that the resulting estimator satisfies a desired shape property. Gallant and Golub (1984), for example, impose quasi-convexity in the FFF estimator by calculating the estimator for 6 as the solution to a constrained minimization problem min s,(8) 8
subject to
r(0) > 0,
where sN(.) is a data dependent function, such as a weighted sum of squared errors, r(0) = min, u(x, 0) and u(x, 0) = min, (z’D’g,(x, 0)z (z’Dg,(x, 0) = 0, z’z = 1}. Dg, and D’g, denote, respectively, the gradient and Hessian of gN with respect to x. Gallant and Golub (1984) have developed an algorithm to solve this problem. Gallant (1981, 1982) developed restrictions guaranteeing that the Fourier flexible form approximation satisfies homotheticity, linear homogeneity or separability. The consistency of seminonparametric estimators can typically be shown by appealing to the following theorem, which is presented and discussed in Gallant (1987) and Gallant and Nychka (1987, Theorem 0). Theorem Suppose that m* belongs to a set of functions M. Let L,: M + [wdenote a criterion function that depends on the data. Let {MN} denote an infinite sequence of subsets of M such that . ..M. c MN+i c MN+z.... Let rnc be an estimator for m*, defined by rni = argmax {L,(m)(m~M,}. Assume that the following conditions are satisfied. (i) The function L, converges a.s. uniformly over M to a nonrandom continuous function L: M + R. (ii) The function m* uniquely maximizes L over the set M.
R.L. Matzkin
2546
(iii) The set M is compact with respect to a metric d. (iv) There exists a sequence of functions (gN} c M such N= 1,2,... and d(g,, WI*)+ 0.
that
gNEMN for all
Then, the sequence ofestimators {mN} converges a.s. to m* with respect to the metric d. That is, with probability one, lim,, m d(m,, m*) = 0. This result is very similar to the theorem in Subsection 3.1.1. Indeed, Assumptions (i)-(iii) play the same role here as they played in that theorem. Assumption (iv) is necessary to substitute for the fact that the maximization of L, for each N is not over the whole space M but only over a subset, M,, of M. This asumption is satisfied when the M, sets become dense in M as N -+ co. (See Gallant (1987) for more discussion about this result.) Asymptotic normality results for Fourier flexible forms and other seminonparametric estimators have been developed, among others, by Andrews (1991), Eastwood (1991), Eastwood and Gallant (1991) and Gallant and Souza (1991). None of these considers the case where the estimators are restricted to be concave. The M, sets are typically defined by using results that allow one to characterize any arbitrary function as the limit of an infinite sum of parametric functions. The Fourier flexible form described above is one example of this. Each set M, is defined as the set of functions obtained as the sum of the first T(N) terms in the expansion, where T(N) is increasing in N and such that K(N)+ co as N + co. Some other types of expansions that have been used to define parametric approximations are Hermite forms (Gallant and Nychka (1987)) power series (Bergstrom (1985)) splines (Wahba (1990)), and Miintz-Szatz type series (Barnett and Yue (1988a, 1988b) and Barnett et al. (1991)). Splines are smooth functions that are piecewise polynomials. Kimeldorf and Wahba (1971) Utreras (1984, 1985), Villalobos and Wahba (1987) and Wong (1984) studied the imposition of monotonicity and convexity restrictions on splines estimators. Yatchew and Bos (1992) proposed using splines to estimate a consumer demand function subject to the implications of economic theory on demand functions. Barnett et al. (1991) impose concavity in a Miintz-Szatz type series by requiring that each term in the expansion satisfies concavity. This method for imposing concavity restrictions in series estimators was proposed by. McFadden (1985).
3.3.
Estimation
A weighted
using weighted aoerage
average estimator,
Y = f*(x) + s,
methods
7, for the function
f* in the model
R.L. Matzkin
2548
where in
= min 7(x’), x’>x
TL(x) = maxrK(x’) X’GX
and where fK is a kernel estimator for J The consistency of this estimator follows from the consistency of the kernel estimator. No asymptotic distribution for it is known. A kernel estimator was also used in Matzkin (1991d) to obtain a smooth interpolation of a concavity restricted nonparametric maximum likelihood estimator and in Matzkin and Newey (1992) to estimate a homogenous function in a binary threshold crossing model. The Matzkin and Newey estimator possesses a known asymptotic distribution.
4.
Nonparametric
tests using economic restrictions
The testing of economic hypotheses in parametric models suffers from drawbacks similar to those of the estimation of parametric models; the conclusions depend on the parametric specifications used. Suppose, for example, that one is interested in testing whether some given consumer demand data provide support for the classical model of utility maximization. The parametric approach would proceed by: first, specifying parametric structures for the demand functions; second, using the demand data to estimate the parameters; and then testing whether the estimated demand functions satisfy the integrability conditions. But, if the integrability conditions are not satisfied by the parametrically estimated demand functions it is not clear whether this is evidence against the utility maximization model or against the particular parametric structures chosen. In contrast, a nonparametric test of the utility maximization model would use demand functions estimated nonparametrically. In this case, rejection of the integrability conditions provides stronger evidence against the utility maximization model.
4.1.
Nomtatistical
tests
A large body of literature dating back to the work of Samuelson (1938) and Houthakker (1950) on Revealed Preference has developed nonparametric tests for the hypothesis that data is consistent with a particular choice model, such as the choice made by a consumer or a firm. Most of these tests are nonstatistical. The data is assumed to be observed without error and the models contain no unobservable random terms. (One exception is the Axiom of Revealed Stochastic Rationality (McFadden and Richter (1970, 1990)), where conditions are given characterizing discrete choice probabilities generated by a random utility function.) In the nonstatistical tests, an hypothesis is rejected if at least one in a set of
2549
Ch. 42: Restrictions of Economic Theory in Nonparametric Methods
nonparametric restrictions is violated; the hypothesis is accepted otherwise. The nonparametric restrictions used to test the hypotheses are typically expressed in one of two different ways. Either they establish that a solution must exist for a certain finite system of inequalities whose coefficients are determined by the data; or they establish that certain algebraic conditions must be satisfied by the data. For example, the Strong Axiom of Revealed Preference is one of the algebraic conditions that is used in these tests. To provide an example of such results, we state below Afriat’s (1967a) Theorem, which is fundamental in this literature. Afriat’s Theorem can be used to test the consistency of demand data with the hypothesis that observed commodity bundles are the maximizers of a common utility function over the budget sets determined by observed prices of the commodities and incomes of a consumer. If the data correspond to different individuals, the conditions of the theorem can be used to test the existence of a utility function that is common to all of them. Afriat’s Theorem
(I 967a)
Let {xi, pi, Ii}:’ I denote a set of N observations on commodity bundles xi, prices pi, and incomes I’ such that Vi, p”~’ = I’. Then, the following conditions are equivalent. (i) There exists a nonsatiated function V: [WK+ [wsuch that for all i = 1,. . . , N and all YEW, [pi.y < I’] *[V(y) < I/(x’)]. i.e., for all sequences (ii) The data {x i, pi, Ii};= 1 satisfy Cyclical Consistency; {i,.i, k , . . . >I, 1) ~8.~
Q
zj,pk.xj
zk,....,pt.xrg
6
(iii) There exist numbers
vi G vj + ;Ij$qxi
zJ]+.[Zi
dpi.xt].
Ai > 0 and I/’ (i = 1,. . . , N) satisfying
- xj),
i,j=l
,..., N.
(iv) There exists a monotone increasing, concave and continuous such that for all i = 1,...,Nandally~lRK,[pi.ydZi]=[V(y)
function
V: [WK+ IR
This result states that the data could have been generated by the maximization of a common nonsatiated utility function (condition (i)) if and only if that data satisfy the set of algebraic conditions stated in condition (ii). In Figure 3, two observations that do not satisfy Cyclical Consistency are graphed. In these observations, p’.x2 < 1’ = p’.x’ and p2.x’ < I2 = p2.x2. The theorem also states that a condition equivalent to Cyclical Consistency is that one can find numbers II’ > 0 and I/’ (i = 1, . . . , N) satisfying the linear inequalities in (iii). For example, no such numbers can be found for the observations in Figure 3; since when p1.(x2 - x’) < 0, p2.(x’ - x2) < 0, and 1’, 2’ > 0, the inequalities in (iii) imply that V’ - V2 < 0 and V2 - V’ < 0.
R.L. Matzkin
2550
Figure 3
Finally, the equivalence between conditions (i) and (iv) implies that if one can find a nonsatiated function that is maximized at the observed xi’s then one can also find a monotone increasing, concave, and continuous function that is maximized at the observed xi’s,. Varian (1982) stated an alternative algebraic condition to Cyclical Consistency and developed algorithms to test the conditions of the above theorem. Along similar lines to the above theorem, a large literature deals with nonparametric tests for the hypothesis that a given set of demand data has been generated from the maximization of a utility function that satisfies certain shape restrictions. For example, Afriat (1967b, 1972a, 1973,1981), Diewert (1973), Diewert and Parkan (1985), and Varian (1983a) provided tests for the consistency of demand data with additively separable, weakly separable and homothetic utility functions. Matzkin and Richter (1991) provided a test for the strict concavity and strict monotonicity of the utility function; and Chiappori and Rochet (1987) developed a test for the consistency of demand data with a strictly concave and infinitely differentiable utility function. To provide an example of one such set of conditions, the algebraic conditions developed by Chiappori and Rochet are that (i) for all sequences {i,j, k,. . . , I, Z>in { 1,. . . ,N) [$.xi~I/pk.Xj~p
,....
, p’.x’ < Z’] = [I’ < pi.xf], and
(ii) for all i, j [xi = xj] * [pi = a$ for some c1> 01. Yatchew (1985) provided nonparametric restrictions for demand data generated by utility maximization subject to budget sets that are the union of linear sets. Matzkin (1991 b) developed restrictions for demand data generated subject to choice sets that possess monotone and convex complement and for choices that are each supported by a unique hyperplane. Nonstatistical nonparametric tests for the hypothesis of cost minimization and profit maximization have also been developed. See, for example, Afriat (1972b), Diewert and Parkan (1979), Hanoch and Rothschild (1978), Richter (1985) and Varian (1984). Suppose, for example, that (y’, pi} are a set of observations on a vector
Ch. 42: Restrictions of Economic Theory in Nonparametric Methods
2551
ofinputs
and outputs, y, and a vector of the corresponding prices, P. Then, one of the results in the above papers is that {yi,pi} is consistent with profit maximization N, pi.yi 2 p’.# (Hanoch and Rothschild (1978)). iffforalli,j=l,..., Some of the above mentioned tests have been used in empirical applications. See, for example, Landsburg (198 l), McDonald and Manser (1984) and Manser and McDonald (1988). Nonparametric restrictions have also been developed to test efficiency in production. These tests, typically appearing under the heading of Data Envelope Analysis, use data on the input and output vectors of different facilities (decision making units or DMU’s) that are assumed to possess the same technology. Then, making assumptions about the technology, such as constant returns to scale, they determine the set of vectors of inputs and outputs that are efficient. A DMU is not efficient if its vector of input and output quantities is not in the efficiency set. See the paper by Seiford and Thrall(l990) for a survey of this literature. Recently, nonparametric restrictions characterizing data generated by models
other than the single agent optimization problem have been developed. Chiappori (1988) developed a test for the Pareto optimality of the consumption allocation within a household using data on aggregate household consumption and labor supply of each household member. Brown and Matzkin (1993) developed a test for the general equilibrium model, using data on market prices, aggregate endowments, consumers’ incomes, and consumers’ shares of profits. Nonparametric restrictions characterizing data consistent with other equilibrium models, such as various imperfect competition models, have not yet been developed. Varian (1983b) developed a test for the model of investors’ behavior. Some papers have developed statistical tests using the nonstatistical restrictions of some of the tests mentioned above (Varian (1985, 1988), Epstein and Yatchew (1985), Yatchew and Bos (1992) and Brown and Matzkin (1992), among others). As we will see in the next subsection, the test developed by Yatchew and Bos (1992), in particular, can be used with several of the above restrictions to obtain statistical nonparametric tests for economic models.
4.2.
Statistical
tests
Using nonparametric methods similar to those used to estimate nonparametric functions, it is possible to develop tests for the hypothesis that a nonparametric regression function satisfies a specified set of nonparametric shape restrictions. Yatchew and Bos (1992) and Gallant (1982), for example, present such tests. The consistent test by Yatchew and Bos is based on a comparison of the restricted and unrestricted weighted sum of square errors. More specifically, suppose that the model is specified by y = f*(x) + E,where ye Rq,XERK,EERq,x and E are independent, E(E)= 0, and COV(E)= Z The null hypothesis is that f* EF c F, while the alternative
R.L. Matzkin
2552
hypothesis is that ~*EF\F. The Sobolev6 norms of the functions in the sets F and F are uniformly bounded. The test proceeds as follows. First, divide the sample into two independent samples of the same size, T. Compute the estimators $. and s’, using, respectively, the first and second samples, where
si = min$Zi /SF
[y’-f(x’)]‘Z-‘[y’-f(x’)]
s$ = min $Zi fEF
[y’-f(x’)]‘Z-‘[y’-f(x’)].
and
To transform these minimization problems into finite dimensional problems, Yatchew and Bos (1992) use a method similar to the one described in Section 3.1. They show that, under the null hypothesis, the asymptotic distribution of t, = T112[$ - $1 is N(O,20), where u = Var E’Z ‘s. So, one can use standard statistical tables to determine whether the difference of the sum of squared errors is significantly different from zero. (This test builds on the work of Epstein and Yatchew (1985), Varian (1985) and Yatchew (1992).) The Yatchew and Bos (1992) test can be used in conjunction with the nonstatistical nonparametric tests described in the previous subsection. Suppose for example that y’ denotes a vector of commodities purchased by a consumer and xi denotes the vector of prices pi and income I’ faced by the consumer when he or she purchased y’. Assume that the observations are independent and for each i, y’ = f*(xi) + E, where E satisfies the assumptions made above. Then, as it is described in Yatchew and Bos (1992), we can use their method to test whether the data is consistent with the utility maximization hypothesis. In particular, Afriat’s inequalities (in condition (iii) in Afriat’s Theorem) can be used to calculate $. by minimizing the value of
ii1 CY’-fil’Z
-‘CYi-“fil
with respect to Vi, A’, and ties: I/‘< vj+Aj$.(fi-fj)
fi (i,j=
1,. . . , T) subject to (i) the Afriat l,..., T), (ii) the budget constraints:
(i =
6The Sobolev norm is defined on a set of m times continuously
differentiable
functions
inequalipi.fi = I’
Cm by
where a = (a,, , a,) is a vector of integers; Dqf(x) is the value resulting from differentiating f at X, a1 times with respect to x,, a2 times with respect to x2,. , aK times with respect to x,; and Jai = max,ja,).
Ch. 42: Restrictions of Economic Theory in Nonparametric Methods
2553
(i=
1,..., T), and (iii) inequalities that guarantee that the Sobolev norm of the function f is within specified bounds. Gallant (1982) presents a seminonparametric method for testing whether a regression function satisfies some shape restrictions, such as linear homogeneity, separability and homotheticity. The method proceeds by testing whether the parametric approximation used to estimate a nonparametric function satisfies the hypothesized restrictions. Following Gallant (1982), suppose that we are interested in testing the linear homogeneity of a cost function, c(p,u), where p = (pr,. . ,pJ is a vector of input prices and u is the output. Let
g(l, v) = In c
exp1,)...)P
~
a1
whereli=lnpi+lnaiandv=lnu+lna,+,. (The ai’s are location parameters that are determined‘from the data.) Then, linear homogeneity of the cost function c in prices is equivalent to requiring that for all r, g(1+ ~1, v) = t + g(1, v). The approximation gN of g, given by gN(x
1
1t3)= h’x + X’CX +
uk
eik’x
XEW+’
1kl-G T
satisfies these restrictions, =f bj=l j=l
and if
for C = Z,c,kk’, ak =0
and c,=O
if
when
5 kj ~0. j=l
Linear homogeneity is then tested by determining whether these restrictions are satisfied. Gallant (1982) shows that by increasing the degree of approximation (i.e. the number of parameters) at a particular specified rate, as the number of observations increases, one can construct tests that are asymptotically free of specification bias. That is, for any given level of significance, a, one can construct a test statistic t, and a critical value cN such that if the true nonparametric function satisfies the null hypothesis then lim,,, Pr(t, > cN) = ~1. Several other methods have been developed to test restrictions of economic theory on nonparametric functions. Stoker (1989), for example, presents nonparametric tests for additive constraints on the first and second derivatives of a conditional mean function f*(x). These tests are based on weighted-average derivatives estimators (Stoker (1986), Powell et al. (1989), Hardle and Stoker (1989)). Linear homogeneity and symmetry constraints are examples of properties of f* that can be tested using this method. (See also Lewbel(1991):) Also using average derivatives, Hlrdle et al. (1992) tested the positive definiteness of the matrix of aggregate income effects.
R.L. Ma&kin
2554
Hausman and Newey (1992) developed a test for the symmetry and negative slope of the Hicksian (compensated) demand. The test is derived from a nonparametric estimator for a consumer surplus. Since symmetry of the Hicksian demand implies that the consumer surplus is independent of the price path used to calculate it, estimates obtained using different paths should converge to the same limit. A minimum chi-square test is then developed using this idea. We should also mention in this section the extensive existent literature that deals with tests for the monotonicity of nonparametric functions in a wide variety of statistical models. For a survey of such literature, we refer the reader to the previously mentioned books of Barlow et al. (1972) and Robertson et al. (1988). (See also Prakasa Rao (1983).)
5.
Conclusion
We have discussed the use of restrictions implied by economic theory in the econometric analysis of nonparametric models. We described advancements that have been made on the theories of identification, estimation, and testing of nonparametric models due to the use of restrictions of economic theory. First, we showed how restrictions implied by economic theory, such as shape and exclusion restrictions, can be used to identify functions in economic models. We demonstrated this in generalized regression models, binary threshold models, discrete choice models, models of consumer demand and in systems of simultaneous equations. Various ways of incorporating economic shape restrictions into nonparametric estimators were discussed. Special attention was given to estimators whose feasibility depends critically on the imposition of shape restrictions. We described technical results that can be used to develop new shape restricted nonparametric estimators in a wide range of models. We also described seminonparametric and weighted average estimators and showed how one can impose restrictions of economic theory on estimators obtained by these two methods. Finally, we have discussed some nonstatistical and statistical nonparametric tests. The nonstatistical tests are extensions of the basic ideas underlying the theory of Revealed Preference. The statistical tests are developed using nonparametric estimation methods.
References Afriat, S. (1967a) “The Construction of a Utility Function from Demand Data”, International Economic Review, 8, 66-77. Afriat, S. (1967b) “The Construction of Separable Utility Functions from Expenditure Data”, mimeo, Purdue University. Afriat, S. (1972a) “The Theory of International Comparisons of Real Income and Prices”, in: J.D. Daly,
Ch. 42: Restrictions of Economic Theory in Nonparametric Methods
2555
ed., International Comparisons of Prices and Output. New York, National Bureau of Economic Research. International Economic Review, 13, Afriat, S. (1972b) “Efficiency Estimates of Production Functions”, 5688598. Afriat, S. (1973) “On a System of Inequalities on Demand Analysis”, International Economic Review, 14, 460-412. Afriat, S. (1981) “On the Constructability of Consistent Price Indices between Several Periods Simultaneously”, in: A. Deaton, ed., Essays in Applied Demand Analysis. Cambridge University Press: Cambridge. Andrews. D.W.K. (1991) “Asymptotic Normality of Series Estimators for Nonparametric and Semiparametric Regression Models”, Econometrica, 59(2), 3077345. Asher. M.. H.D. Brunk. GM. Ewing. W.T. Reid and E. Silverman (1955) “An Empirical Distribution Function for Sampling with Incomplete Information”, Annals of Mathematical Statistics, 26,641-647. Balls, K.G. (1987) Inequality Constrained Nonparametric Estimation, Ph.D. Dissertation. Carnegie Mellon University. Barlow, R.E., D.J. Bartholomew, J.M. Bremner and H.D. Brunk (1972) Statistical Znference under Order Restrictions, New York: John Wiley. Barnett, W.A. and P. Yue (1988a) “The Asymptotically Ideal Model (AIM)“, working paper. Barnett, W.A. and P. Yue (1988b) “Semiparametric Estimation of the Asymptotically Ideal Model: The AIM Demand System”, in: G. Rhodes and T. Fornby, eds., Nonparametric and Robust Inference, Advances in Econometrics, Vol. 7. JAI Press: Greenwich, Connecticut, 2299252. Barnett, W.A., J. Geweke, and P. Yue (1991) “Seminonparametric Bayesian Estimation of the Asymptotically Ideal Model: The AIM Consumer Demand System”, in: W. Barnett, J. Powell and G. Tauchen, eds., Nonparametric and Semiparametric Methods in Econometrics and Statistics. Cambridge: Cambridge University Press. Bergstrom, A.R. (1985) “The Estimation of Nonparametric Functions in a Hilbert Space”, Econometric Theory, 1, l-26. Brown, D.J. and R.L. Matzkin (1991) “Recoverability and Estimation of the Demand and Utility Functions of Traders when Demands are Unobservable”, mimeo, Cowles Foundation, Yale University. Brown, D.J. and R.L. Matzkin (1992) “A Nonparametric Test for the Perfectly Competitive Model”, mimeo, Northwestern University. Brown. D.J. and R.L. Matzkin (1993) “Walrarian Comparative Statics”, Technical Report No. 57, Stanford Institute for Theoretical Economics. _ Brunk, H.D. (1970) “Estimation of Isotonic Regression”, in: M.L. Puri, ed., Nonparametric Techniques in Statistical Inference. Cambridge University Press: Cambridge, 177-197. Chiappori, P.A. (1988) “Rational Household Labor Supply”, Econometrica, 56(l), 63-89. Chiappori, P. and J. Rochet (1987) “Revealed Preference and Differential Demand”, Econometrica, 55, 687-691. Cosslett, S.R. (1983) “Distribution-free Maximum Likelihood Estimator of the Binary Choice Model”, Econometrica, 51(3), 765-782. Cox, D.R. (1970) The Analysis of Binary Data. Methuen & Co Ltd. Diewert, E. (1973) “Afriat and Revealed Preference Theory”, Review of Economic Studies, 40,419-426. Diewert, E.W. and C. Parkan (1979) “Linear Programming Tests for Regularity Conditions for Production Functions”, University of British Columbia. Diewert, E.W. and C. Parkan (1985) “Tests for the Consistency of Consumer Data”, Journal of Econometrics, 30, 127-147. Dykstra, R.L. (1983) “An Algorithm for Restricted Least Squares Regression”, Journal of the American Statistical Association, 78, 8377842. Eastwood, B.J. (1991) “Asymptotic Normality and Consistency of Semi-nonparametric Regression Estimators Using an upward F Test Truncation Rule”, Journal of Econometrics. Eastwood, B.J. and A.R. Gallant (1991) “Adaptive Truncation Rules for Seminonparametric Estimators that Achieve Asymptotic Normality”, Econometric Theory, 7, 307-340. Epstein, L.G. and D.J. Yatchew (1985) “Non-Parametric Hypothesis Testing Procedures and Applications to Demand Analysis”, Journal of Econometrics, 30, 150-169. Finney, D.J. (1971) Probit Analysis, Cambridge University Press. Friedman, J. and R. Tibshirani (1984) “The Monotone Smoothing of Scatter Plots”, Technometrics, 26, 243-350.
2556
R.L. Matzkin
Gallant, A.R. (1981) “On the Bias in Flexible Functional Forms and an Essentially Unbiased Form”, Journal ofEconometrics, 15, 211-245. Gallant, A.R. (1982) “Unbiased Determination of Production Technologies”, Journal Econometrics, 20,2855323. Gallant, A.R. (1987) “Identification and Consistency in Seminonparametric Regression”, in: T.F. Bewley, ed., Advances in Econometrics, Fifth World Congress, Volume 1. Cambridge University Press. Gallant, A.R. and G.H. Golub (1984) “Imposing Curvature Restrictions on Flexible Functional Forms”, Journal of Econometrics, 26, 295-321. Gallant, A.R. and D.W. Nychka (1987) “Seminonparametric Maximum Likelihood Estimation”, Econometrica, 55, 363-390. Gallant, A.R. and G. Souza (1991) “On the Asymptotic Normality of Fourier Flexible Form Estimates”, Journal of Econometrics, 50, 329-353. Goldman, S.M. and P.A. Ruud (1992) “Nonparametric Multivariate Regression Subject to Monotonicity and Convexity Constraints”, mimeo, University of California, Berkeley. Han, A.K. (1987) “Nonparametric Analysis of a Generalized Regression Model: The Maximum Rank Correlation Estimation”, Journal ofEconometrics, 35, 303-316. Hanoch, G. and M. Rothschild (1978) “Testing the Assumptions of Production Theory: A Nonparametric Approach”, Journal of Political Economy, 80, 256275. Hanson, D.L. and G. Pledger (1976) “Consistency in Concave Regression”, Annals of Statistics, 4, 1038~1050. Hanson, D.L., G. Pledger, and F.T. Wright (1973) “On Consistency in Monotonic Regression”, Annals of Statistics, 1, 401-421. Hardle, W. and T.M. Stoker (1989) “Investigating Smooth Multiple Regression by the Method of Average Derivatives”, Journal of the American Statistical Association, 84,986-995. Hardle, W., W. Hildenbrand and M. Jerison (1992) “Empirical Evidence on the Law of Demand”, Econometrica, 59, 1525-1550. Hausman, J. and W. Newey (1992) “Nonparametric Estimation of Exact Consumer Surplus and Deadweight Loss”, mimeo, Department of Economics, M.I.T. Hausman, J.A. and D.A. Wise (1978) “A Conditional Probit Model of Qualitative Choice: Discrete Decisions Recognizing Interdependence and Heterogeneous Preferences”, Econometrica, 46,403-426. Hildreth, C. (1954) “Point Estimates of Ordinates of Concave Functions”, Journal of the American Statistical Association, 49, 598-619. Horowitz, J.L. (1992) “A Smoothed Maximum Score Estimator for the Binary Choice Model”, Econometrica, 60, 5055531. Hotz, V.J. and R.A. Miller (1989) “Conditional Choice Probabilities and the Estimation of Dynamic Discrete Choice Models”, mimeo, University of Chicago and Carnegie Mellon University. Houthakker, H.S. (1950) “Revealed Preference and the Utility Function”, Economica, 17, 159-174. Ichimura, H. (1993) “Semiparametric Least Squares (SLS) and Weighted SLS Estimation of Single Index Models”, Journal of Econometrics, 58, 7 1 - 120. Kadiyali, V. (1993) Ph.D. Thesis, Department of Economics, Northwestern University. Kiefer, J. and J. Wolfowitz (1956) “Consistencv of the Maximum Likelihood Estimator in the Presence of Infinitely Many Incidental Parameters”, Annals of Mathematical Statistics, 27, 887-906. Kimeldorf, G.S. and G. Wahba (1971) “Some Results on Tchebycheffian Splines Functions”, Journal of Mathematical Analysis and Applications, 33, 82-95. Klein, R.W. and R.H. Spady (1993) “An Efficient Semiparametric Estimator for Discrete Choice Models”, Econometrica, 61, 387-422. Landsburg, SE. (1981) “Taste Change in the United Kingdom, 1900-1955”, Journal ofPolitical Economy, 89,922104. Lewbel, A. (1991) “Applied Consistent Tests of Nonparametric Regression and Density Restriction”, mimeo, Brandeis University. Maddala, G.S. (1983) Limited-Dependent and Qualitatioe Variables in Econometrics, Cambridge University Press: Cambridge. Mammen, E. (1991a) “Estimating a Smooth Monotone Function”, Annals of Statistics, 19, 724-740. Mammen, E. (1991 b) “Nonparametric Regression under Qualitative Smoothness Assumptions”, Annals ofStatistics, 19, 741-759. Manser, M.E. and R.J. McDonald (1988) “An Analysis of Substitution Bias in Measuring Inflation, 1959985”, Econometrica, 56(4), 909-930.
of
Ch. 42: Restrictions of Economic Theory in Nonparametric Methods
2557
Manski, C. (1975) “Maximum Score Estimation of the Stochastic Utility Model of Choice”, Journal of Econometrics, 3, 2055228. Manski, C. (1985) ‘Semiparametric Analysis of Discrete Response: Asymptotic Properties of the Maximum Score Estimator”, Journal ofEconometrics, 27, 313-334. Manski, C. (1988) “Identification of Binary Response Models”, Journal of the American Statistical Association, 83, 7299738. Preferences from Market Demand Mas-Colell, A. (1977) “On the Recoverability of Consumers’ Behavior”, Econometrica, 45(6), 140991430. Mas-Colell, A. (1978) “On Revealed Preference Analysis”, Review ofEconomic Studies, 45, 121-131. Matzkin, R.L. (1986) Mathematical and Statistical Inferences from Demand Data, Ph.D. Dissertation, University of Minnesota. Matzkin, R.L. (1990a) “Estimation of Multinomial Models Using Weak Monotonicity Assumptions”, Cowles Foundation Discussion Paper No. 957, Yale University. Matzkin, R.L. (1990b) “Least-concavity and the Distribution-free Estimation of Nonparametric Concave Functions”, Cowles Foundation Discussion Paper No. 958, Yale University. Matzkin, R.L. (1990~) “Fully Nonparametric Estimation of Some Qualitative Dependent Variable Yale University. Models Using the Method of Kernels”, mimeo, Cowles Foundation, Matzkin, R.L. (199la) “Semiparametric Estimation of Monotone and Concave Utility Functions for Polychotomous Choice Models”, Econometrica, 59, 1315-1327. Matzkin. R.L. f199lb) “Axioms of Revealed Preference for Nonlinear Choice Sets”, Econometrica, 59, 177991786. Matzkin, R.L. (1991~) “A Nonparametric Maximum Rank Correlation Estimator”, in: W. Barnett, J. Powell and G. Tauchen, eds., Nonparametric and Semiparametric Methods in Econometrics and Statistics. Cambridge: Cambridge University Press. Matzkin, R.L. (1991d) “Using Kernel Methods to Smooth Concavity Restricted Estimators”, mimeo, Cowles Foundation, Yale University. Matzkin, R.L. (1992) “Nonparametric and Distribution-Free Estimation of the Binary Choice and the Threshold-Crossing Models”, Econometrica, 60,239-270. Matzkin, R.L. (1993a) “Nonparametric Identification and Estimation of Polychotomous Choice Models”, Journal of Econometrics, 58, 137-168. Matzkin, R.L. (1993b) “Computation and Operational Properties of Nonparametric Concavity Restricted Estimators”, Northwestern University. Matzkin, R.L. (1994) “Identification in Nonparametric LDV Models”, mimeo, Northwestern University. Matzkin, R.L. and W. Newey (1992) “Kernel Estimation of a Structural Nonparametric Limited Dependent Variable Models”, mimeo. Matzkin, R.L. and M.K. Richter (1991) “Testing Strictly Concave Rationality”, Journal of Economic Theory, S3,287-303. McDonald, R.J. and M.E. Manser (1984) “The Effect of Commodity Aggregation on Tests of Consumer Behavior”, mimeo. McFadden, D. (1974) “Conditional Logit Analysis of Qualitative Choice Behavior”, in P. Zarembka, ed., Frontiers ofEconometrics, New York: Academic Press, 105-142. McFadden, D. (1981) “Econometric Models of Probabilistic Choice”, in: C. Manski and D. McFadden, eds., Structural Analysis of Discrete Data with Econometric Applications. The MIT Press, 198272. McFadden, D. (1985) “Specification of Econometric Models”, Presidential Address, Fifth World Congress, mimeo, M.I.T. McFadden, D. and M.K. Richter (1970) “Stochastic Rationality and Revealed Stochastic Preference”, Department of Economics, University of California, Berkeley, mimeo. McFadden, D. and M.K. Richter (1990) “Stochastic Rationality and Revealed Stochastic Preference”, in: J.S. Chipman, D. McFadden and M.K. Richter, eds., Preference, Uncertainty, and Optima&y. Essays in Honor ofLeonid Hurwicz. Boulder: Westview Press, 161-186. Mukarjee, H. (1988) “Monotone Nonparametric Regression”, The Annals ofStatistics, 16, 741-750. Mukarjee, H. and S. Stern (1994) “Feasible Nonparametric Estimation of Multiarnument Monotone Functions”, Journal of the American Statistical Association, 89, No. 425, 77-80. Nadaraja, E. (1964) “On Regression Estimators”, Theory of Probability and its Applications, 9,157-l 59. Nemirovskii, AS., B.T. Polyak, and A.B. Tsybakov (1983) “Rates of Convergence of Nonparametric Estimates ofMaximum Likelihood Type”, Problem peredachi informatsii, 21, 258-272.
2558
R.L. Ma&kin
Powell, J.L., J.H. Stock, and T.M. Stoker (1989) “Semiparametric Estimation of Index Coefficients”, Econometrica, 57, 140331430. Prakasa Rao, B.L.S. (1983) Nonparametric Functional Estimation, Academic Press. Richter, M.K. (1985) “Theory of Profit”, mimeo, Department of Economics, University of Minnesota. Robertson, T., F.T. Wright. and R.L. Dykstra (1988) Order Restricted Statistical Inference, John Wiley and Sons. Roehrig, C.S. (1988) “Conditions for Identification in Nonparametric and Parametric Models”, Econometrica, 56(2), 4333447. Royall, R.M. (1966) A Class of Nonparametric Estimators of a Smooth Regression Function. Ph.D. Thesis, Stanford University. Samuelson, P.A. (1938) “A Note on the Pure Theory of Consumer Behavior”, Economica, 5,61-71. Seiford, L.M. and R.M. Thrall(l990) “Recent Development in DEA”, Journal ofEconometrics, 46,7-38. Stoker, T.M. (1986) “Consistent Estimation of Scaled Coefficients”, Econometrica, 54, 1461-1481. Stoker, T.M. (1989) “Tests of Additive Derivatives Constraints”, Reuiew ofEconomic Studies, 56,535-552. Strauss, D. (1979) “Some Results on Random Utility Models”, Journal OfMathematical Psychology, 20, 35552. Thompson, T.S. (1989) “Identification of Semiparametric Discrete Choice Models”, Discussion Paper No. 249, Center for Economic Research, University of Minnesota. Utreras, F. (1984) “Positive Thin Plate Splines”, CAT Rep. No. 68, Dept. of Math., Texas A & M Univ. Utreras, F. (1985) “Smoothing Noisy Data Under Monotonicity Constraints: Existence, Characterization, and Convergence Rates”, Numerische Mathematik, 47, 61 l-625. Varian, H. (1982) “The Nonparametric Approach to Demand Analysis”, Econometrica, 50(4), 9455974. Varian, H. (1983a) “Nonparametric Tests of Consumer Behavior”, Review of Economic Studies, 50, 999110. Varian, H. (1983b) “Nonparametric Tests of Models of Investor Behavior”, Journal of Financial and Quantitative Analysis, 18, 269-278. Varian, H. (1984) “The Nonparametric Approach to Production Analysis”, Econometrica, 52,579-597. Varian, H. (1985) “Non-Parametric Analysis of Optimizing Behavior with Measurement Error”, Journal of Econometrics, 30. Varian, H. (1988) “Goodness-of-Fit in Demand Analysis”, CREST Working Paper No. 89911, University of Michigan. Villalobos, M. and G. Wahba (1987) “Inequality Constrained Multivariate Smoothing Splines with Journal of the American Statistical Application to the Estimation of Posterior Probabilities”, Association, 82, 239-248. Wahba, G. (1990) Spline Modelsfor Obseruational Data, CBMS-NSG Regional Conference Series in Applied Mathematics, No. 59, Society for Industrial and Applied Mathematics. Wald, A. (1949) “Note on the Consistency of the Maximum Likelihood Estimator”, Annals of Mathematical Statistics, 20, 595-601. Wang, Y. (1992) Nonparametric Estimation Subject to Shape Restrictions, Ph.D. Dissertation, Department of Statistics, University of California, Berkeley. Watson, G.S. (1964) “Smooth Regression Analysis”, Sankhya Series A, 26, 359-372. Wong, W.H. (1984) “On Constrained Multivariate Splines and Their Approximations”, Numerische Mathematik, 43, 141-152. Wright, F.T. (1982) “Monotone Regression Estimates for Grouped Observations”, Annals of Statistics, lo,2788286. Yatchew, A.J. (1985) “A Note on Nonparametric Tests of Consumer Behavior”, Economic Letters, 18, 45-48. Yatchew, A.J. (1992) “Nonparametric Regression Tests Based on Least Squares”, Econometric Theory, 8,435-451. Yatchew, A. and L. Bos (1992) “Nonparametric Tests of Demand Theory”, mimeo, Department of Economics, University ‘of Toronto. Yellott, J.I. (1977) “The Relationship Between Lute’s Choice Axiom, Thurstone’s Theory of Comparative Judgement, and the Double Exponential Distribution”, Journal of Mathematical Psychology, 15, 109-144.
Chapter 43
ANALOG CHARLES University
ESTIMATION
OF ECONOMETRIC
MODELS*
F. MANSKI
ofWisconsin-Madison
Contents
2560 2560 2561
Abstract 1. Introduction 2. Preliminaries 2.1.
The analogy
2.2.
Moment
4.
5.
2563
problems
2565
models Method-of-moments estimation
2.3.
3.
2561
principle
Econometric
of separable
models
2566
3.1.
Mean independence
2561
3.2.
Median
2568
3.3.
Conditional
3.4.
Variance
3.5.
Statistical
3.6.
A historical
independence
2510
independence
2570
independence
2571
note
Method-of-moments 4.1.
Likelihood
4.2.
Invertible
4.3.
Mean independent
4.4.
Quantile
Estimation
2569
symmetry
estimation
of response
models
2512
models
2574
models
2514
linear models
models separable and response
independent
2575
monotone
of general
5.1.
Closest-empirical-distribution
estimation
5.2.
Minimum-distance
of response
estimation
of separable
models models
models
2577 2517 2580
2581 2581
Conclusion References
6.
*I am grateful
2571
for the comments
of Rosa Matzkin
and Jim Powell.
Handbook of Econometrics, Volume IV, Edited by R.F. Engle and D.L. McFadden 0 1994 Elsevier Science B. V. All rights reserved
C.F. Manski
2560
Abstract Suppose that one wants to estimate a parameter characterizing some feature of a specified population. One has some prior information about the population and a random sample of observations. A widely applicable approach is to estimate the parameter by a sample analog; that is, by a statistic having the same properties in the sample as the parameter does in the population. If there is no such statistic, then one may choose an estimate that, in some well-defined sense, makes the known properties of the population hold as closely as possible in the sample. These are analog estimation methods. This chapter surveys some uses of analog methods to estimate two classes ofeconometric models, the separable and the response models.
1.
Introduction
Suppose that one wants to estimate a parameter characterizing some feature of a specified population. One has some prior information about the population and a random sample of observations. A widely applicable approach is to estimate the parameter by a sample analog; that is, by a statistic having the same properties in the sample as the parameter does in the population. If there is no such statistic, then one may choose an estimate that, in some well-defined sense, makes the known properties of the population hold as closely as possible in the sample. These are analog estimation methods. Familiar examples include use of the sample average to estimate the population mean and sample quantiles to estimate population quantiles. The classical method of moments (Pearson (1894)) is an analog approach, as is minimum chi-square estimation (Neyman (1949)). Maximum likelihood, least squares and least absolute deviations estimation are analog methods. This chapter surveys some uses of analog methods to estimate econometric models. Section 2 presents the necessary preliminaries, defining the analogy principle, moment problems and the method of moments, and two classes of models, the separable and the response models. Sections 3 and 4 describe the variety of separable and response models that imply moment problems and may be estimated by the method of moments. Section 5 discusses two more general analog estimation approaches: closest empirical distribution estimation of separable models and minimum distance estimation of response models. Section 6 gives conclusions. The reader wishing a more thorough treatment of much of the material in this chapter should see Manski (1988). The analogy principle is used here to estimate population parameters. Other chapters of this handbook exploit related ideas for other purposes. The chapter by Hall describes bootstrap methods, which apply the analogy principle to approxi-
Ch. 43: Analoy Estimation
mate the distribution describes simulation and a pseudo-sample values.
2. 2.1.
of Econometric
Models
2561
of sample statistics. The chapter by Hajivassiliou and Ruud methods, which use the analogy between an observed sample from the same population, drawn at postulated parameter
Preliminaries The analogy principle
Assume that a probability distribution P on a sample space 2 characterizes a population. One observes a sample of N independent realizations of a random variable z distributed P. One knows that P is a member of some family 17 of probability distributions on 2. One also knows that a parameter b in a parameter space B solves an equation T(P, b) = 0,
(1)
where T(., *) is a given function mapping 17 x B into some vector space Y. The problem is to combine the sample data with the knowledge that DEB, PEJI and T(P, b) = 0 so as to estimate b. Many econometric models imply that a parameter solves an extremum problem rather than an equation. We can use (1) to express extremum problems by saying that b solves b - argmin W(P, c) = 0. COB
(2)
Here W(., .) is a given function mapping 17 x B into the real line. Let P, be the empirical distribution of the sample of N draws from P. That is, P, is the multinomial probability distribution that places probability l/N on each of the N observations of z. The group of theorems collectively referred to as the laws of large numbers show that P, converges to P in various senses as N --) co. This suggests that to estimate b one might substitute the function T(P,, .) for T(P, .) and use B, = [cEB: T(P,, c) = 01.
(3)
This defines the analog estimate when P, is a feasible value for P; that is, when P,EIZ. In these cases T(P,;) is well-defined and has at least one zero in B, so B, is the (possibly set-valued) analog estimate of b. Equation (3) does not explain how to proceed when P,#I7. We have so far defined T(.;) only on the space ZZ x B of feasible population distributions and parameter values. The function T(P,, .) is as yet undefined for P,#I7.
C.F. Manski
2562
Let @ denote the space of all multinomial distributions on Z. To define T(P,, .) for every sample size and all sample realizations, it suffices to extend T(., .) from 17 x B to the domain (n u @) x B. Two approaches have proved useful in practice. Mapping P, into 17. One approach is to map P, into 17. Select a function rc(.): Hu @ + 17 which maps every member of 17 into itself. Now replace the equation T(P, b) = 0 with T[rc(P), b] = 0.
(4)
This substitution leaves the estimation problem unchanged as T[rr(Q), .] = r(Q, .) for all Q~17. Moreover, n(P,)~17; so T[rc(P,);] is defined and has a zero in B. The analogy principle applied to (4) yields the estimate B,, = [CER T{7c(P,),
c>= 01.
(5)
When P,EII, this estimate is the same as the one defined in equation (3). When P,$II, the estimate (5) depends on the selected function rc(.); hence we write B,, rather than B,. A prominent example of this approach is kernel estimation of Lebesgue density functions. Let n be the space of distributions having Lebesgue densities. The empirical distribution P, is multinomial and so is not in 17. But P, can be smoothed so as to yield a distribution that is in 17. In particular, the convolution of P, with any element of 17 is itself an element of 17. The density of the convolution is a kernel density estimate. See Manski (1988), Chapter 2. Direct extension. Sometimes there is a natural direct way to extend the domain of T(., .), so T(P,, .) is well-defined. Whenever T(P,, .) has a zero in B, equation (3) gives the analog estimate. If P, is not in l7, it may be that T(P,, c) # 0 for all CEB. Then the analogy principle suggests selection of an estimate that makes T(P,, .) as close as possible to zero in some sense. To put this idea into practice, select an origin-preserving function r(.) which maps values of T(., .) into the non-negative real half line. That is, let I(.): Y + [0, co), with T = O-r(T) = 0. Now replace the equation T(P, b) = 0 with the extremum problem min r[ T(P, c)]. CSB
(6)
This substitution leaves the estimation problem unchanged as T(Q,c) = 00 r[T(Q, c)] = 0 for (Q, c)~17 x B. To estimate b, solve the sample analog of (6). Provided only that r[T(P,;)] attains its minimum on B, the analog estimate is B,, = argmin r[T(P,, ceB
c)].
(7)
Ch. 43: Analog Estimation
of Econometric
2563
Models
If P,~17, this estimate is the same as the one defined in (3). If PN$17 but T(P,, .) has a zero in B, the estimate remains as in (3). If T(P,;) is everywhere non-zero, the estimate depends on the selected function r(.); hence we write B,, rather than B,. Section 2.2 describes an extraordinarily useful application of this approach, the method of moments.
2.2.
Moment problems
Much of present-day econometrics is concerned with estimation of a parameter b solving an equation of the form
s
g(z, b) dP = 0
(8)
or an extremum problem of the form min h(z, c) dP. CEBs
(9)
In (8), g(., .) is a given function mapping 2 x B into a real vector space. In (9), h(., .) is a given function mapping 2 x B into the real line. Numerous prominent examples of (8) and (9) will be given in Sections 3 and 4 respectively. When PN~n, application of the analogy principle to (8) and (9) yields the estimates
I[
g(z, c)dP, = 0 =
B, = argmin CEB
I1
CEB:; $
g(z, c) = 0
h(z, c) dP, = argE,n i .$ h(z, c), I
1
I
(10)
(11)
where (zi,i= l,..., N) are the sample observations of z. When P,$ZIl, one might either map P, into ZZor extend the domain of T(.;) directly. The latter approach is simplest; the sample analogs of the expectations jg(z, .) dP and J h(z, .) dP are the sample averages jg(z, .)dP, and I&, *)dP,. So (10) and (11) remain analog estimates of the parameters solving (8) and (9). It remains only to consider the possibility that the estimates may not exist. In applications, s h(z, .) dP, generally has a minimum. On the other hand, jg(z, *)dP, often has no zero. In that case, one may select an origin-preserving transformation r(.) and replace (8) with the problem of minimizing r[J g(z, .) dP], as was done in (6).
2564
C.F. Manski
Minimizing the sample apalog yields
(12) Estimation problems relating b to P by (8) or (9) are called moment problems. Estimates of the forms (lo), (1 l), and (12) are method-of-moments estimates. Use of the term “moment” rather than the equally descriptive “expectation,” “mean,” or “integral” honors the early work of K. Pearson on the method of moments. Clearly, consistent estimation of b of method-of-moments estimates. requires that the asserted moment problem has a unique solution; that is, b must be identified. If no solution exists, the estimation problem has been misspecified and b is not defined. If there are multiple solutions, sample data cannot possibly distinguish between them. There is no general approach for determining the number of solutions to equation systems of the form (8) or to extremum problems of the form (9). One must proceed more or less case-by-case. Given identification, method-of-moments estimates are consistent if the estimation problem is sufficiently regular. Rigorous treatments appear in such econometrics texts as Amemiya (1985), Gallant (1987) and Manski (1988). I provide here an heuristic explanation focussing on (12); case (11) involves no additional considerations. We are concerned with the behavior of the function r[Jg(z, *)dP,] as N + co. The strong law of large numbers implies that for all CEB, s g(z, c) dP, + j g(z, c) dP as N + co, almbst surely. The convergence is uniform on B if the parameter space is sufficiently small, the function g(.;) sufficiently smooth, and the distribution P sufficiently well-behaved. (For example, it suffices for B to be a compact finitedimensional set, for J (g(z, *)1dP to be bounded by an integrable function D(z), and for g(z;) to be continuous on B. See Manski (1988), Chapter 7.) If the convergence is uniform and I( .) is smooth, then as N + cc the minima on B of I [ J g(z, .) dP,] tend to occur increasingly near the minima of r[Jg(.z, .) dP]. The unique minimum of r[Jg(z, .) dP] occurs at b. So the estimate B,, converges to b. Uniform convergence on B of Jg(z, .) dP, to Jg(z, .)dP is close to a necessary condition for consistency of method-of-moments estimates. If this condition is seriously violated, s g(z, .) dP, is not a good sample analog to jg(z, .) dP and the estimation approach does not work. Beginning in the 1930s with the GlivenkoCantelli Theorem, statisticians and econometricians have steadily broadened the range of specifications of B, g(., *) and P for which uniform laws of large numbers have been shown to hold (e.g. Pollard (1984) and Andrews (1987)). Nevertheless, uniformity does break down in situations that are far from pathological. Perhaps the most important practical concern is the size of the parameter space. Given a specification for g(., .) and for P, uniformity becomes a more demanding property as B becomes larger. Consistency
Ch. 43: Analog Estimation
of Econometric
Models
2565
Sampling distributions. The exact sampling distributions of method-of-moments estimates are generally complicated. Hence the practice is to invoke local asymptotic approximations. If the parameter space is finite-dimensional and the estimation problem is sufficiently regular, a method-of-moments estimate B,, converges at rate as a limiting normal distribution centered at zero. Alterl/,,&andfl(B,,-b)h native estimates of a given parameter may have limiting distributions with different variances. This fact suggests use of the variance of the limiting distribution as a criterion for measuring precision. Comparison of the precision of alternative estimators has long engaged the attention of econometric theorists. An estimate is termed asymptotic efficient if the variance of the limiting normal distribution of @(B,, - b) is the smallest possible given the available prior information. Hansen (1982) and Chamberlain (1987) provide the central findings on the efficiency of method-of-moments estimates. For an exposition, see Manski (1988), Chapters 8 and 9. Non-random sampling. In discussing moment problems and estimation problems more generally, I have assumed that the data are a random sample. It is important to understand that random sampling, albeit a useful simplifying idea, is not essential to the success of analog estimation. The essential requirement is that the sampling process be such that relevant features of the empirical distribution converge to corresponding population features. For example, consider stationary time series problems. Here the data are observations at N dates from a single realization of a stationary stochastic process whose marginal distribution is P. So we do not have a random sample from P. Nevertheless, dependent sampling versions of the laws of large numbers show that P, converges to P in various senses as N -+ co.
2.3.
Econometric models
We have been discussing an estimation problem relating a parameter b to a probability distribution P generating realizations of an observable random variable z. Econometric models typically relate a parameter b to realizations of the observable z and of an unobservable random variable, say u. Analog estimation methods may be used to estimate b if one can transform the econometric model into a representation relating b to P and to nuisance parameters. Formally, suppose that a probability distribution P,, on a space Z x U characterizes a population. A random sample of N realizations of a random variable (z, u) distributed P,, is drawn and one observes the realizations of z but not of U. One knows that P,, is a member of some family D,, of probability distributions on Z x U. One also knows that a parameter b in a parameter space B solves an equation
f(z, u,b) = 0,
(13)
C.F. Manski
2566
where f(., ., .) maps Z x U x B into some vector space. Equation (13) is to be interpreted as saying that almost every realization (i, q) of (z, U) satisfies the equation f(i, yl, b) = 0. Equation (13) typically has no content in the absence of information on the probability distribution P,, generating (z, u). A meaningful model combines (13) with some distributional knowledge. The practice has been to impose restrictions on the probability distribution of u conditional on some function of z, say x = x(z) taking values in a space X. Let P,lx denote this conditional distribution. Then a model is defined by equation (13) and by a restriction on the conditional distributions (P,l5,5EX). Essentially all econometric research has specified f to have one of two forms. A separable model makes the unobserved variable u additively separable, so that
where u,(.;) maps Z x B into U. A response and makes f have the form
f(Y, x9 UTb)= Y -
model defines z = (y, x), Z = Y x X,
u,b),
Y,k
(15)
where y,(., ., .) maps X x U x B into Y. Functional forms (14) and (15) are not mutually exclusive. Some models can be written both ways. The next two sections survey the many separable and response models implying that b and a nuisance parameter together solve a moment problem. (The nuisance parameter characterizes unrestricted features of P,Ix). These models may be estimated by the method of moments if the parameter space is not too large.
3.
Method-of-moments
estimation of separable models
Separable models suppose through an equation u,(z, b) = u.
that realizations
of (z, U) are related
to the parameter
b
(16)
In the absence of information restricting the distribution of the unobserved U, this equation simply defines u and conveys no information about b. In the presence of various distributional restrictions (16) implies that b and a nuisance parameter solve a type of moment equation known as an orthogonality condition, defined here. Orthogonality conditions. Let x = x(z) take values in a real vector space X. Let r denote a space in which a nuisance parameter y lives. Let e(*, .) be a function mapping U x r into a real vector space. Let e(.;)’ denote the transpose of the
2561
Ch. 43: Analog Estimation of Econometric Models
column
vector e(., .). The random
vectors x and e(u, y) are orthogonal
if
s
xe(u, y)‘dP,, = 0.
(17)
Equation (17) relates the observed random variable x to the unobserved random variable u. Suppose that (16) holds. Then we can replace u in (17) with u,(z, b), yielding xe[u,(z,
b), y]‘dP = 0.
(18)
This orthogonality condition is a moment the distribution P of the observable z. It is not easy to motivate orthogonality show that these conditions are implied by restrictions. The remainder of this section
equation
relating
the parameters
(b, y) to
conditions directly, but we can readily various more transparent distributional describes the leading cases.
Mean independence
3.1.
The classical econometric literature on instrumental variables estimation is concerned with separable models in which x and u are known to be uncorrelated. Let y be the mean of U. Zero covariance is the orthogonality condition x(u - Y)‘dP,, = s
x[u,(z,
b) - y]‘dP = 0.
(19)
s
Most authors incorporate the nuisance by giving that function a free intercept. and equation (19) is rewritten as
s
x[u,(z, b)]‘dP = 0.
parameter y into the specification of u,(., .) This done, u is declared to have mean zero
(20)
To facilitate discussion of a variety of distributional restrictions, I shall keep y explicit. Zero covariance is sometimes asserted directly, to express a belief that the random variables x and u ,are unrelated. It is preferable to think of zero covariance as following from a stronger form of unrelatedness. This is the mean-independence condition
s
udP,lt
= y,
VEX.
(21)
C.F.
2568
Manski
Mean independence implies zero covariance but it is difficult to motivate zero To see why, rewrite (19) as the covariance in the absence of mean independence. iterated expectation r
J
x(u - y)‘dP,,
=
r rr 1 x (u - Y)‘dP,jx dP, = 0. J J LJ
(22)
This shows that mean independence implies zero covariance. It also shows that x and u are uncorrelated if positive and negative realizations of x[(u - Y)‘dP,(x balance when weighted by the distribution of x. But one rarely has information about P,, certainly not information that would make one confident in (22) in the absence of (21). Hence, an assertion of zero covariance suggests a belief that x and u are unrelated in the sense of mean independence. Mean independence implies orthogonality conditions beyond (19). Let u(.) be any function mapping X into a real vector space. It follows from (16) and (21) that
v(x) [u,(z, h) - y]’ dP = s
(u - y)’ dP, 1x
v(x) s
[S
1
dP, = 0,
(23)
provided only that the integral in (23) exists. So the random variables u(x) and u,(z, b) are uncorrelated. In other words, all functions ofx are instrumental variables.
Median independence
3.2.
The assertion that u is mean independent of x expresses a belief that u has the same central tendency conditional on each realization of x. Median independence offers another way to express this belief. Median independence alone does not imply an orthogonality condition, but it does when the conditional distributions P,l& VEX are componentwise continuous. Let U be the real line; the vector case introduces no new considerations as we shall deal with u componentwise. For each 5 in X, let mg be the median of u conditional on the event [x = 41. Let y be the unconditional median of U. We say that u is median independent of x if m<=Y,
(EX.
(24)
It can be shown (see Manski (1988), Chapter 4) that if P,/& [EX are continuous probability distributions, their medians solve the conditional moment equations sgn(u - mJdP,(t s
= 0,
[EX.
(25)
Ch. 43: Analog Estimation of Econometric Models
So median
independence
s s
sgn(u - Y)dP,It
and continuity
2569
together
imply that
(26)
VEX.
= 0,
It follows from (16) and (26) that
s[s
b) - r]dP=
u(x)x4-n
44
for all u(.) such that the integral
sgn(u -
Y)dP,lx
1
dP, = 0
in (27) exists. Thus, all functions
(27)
of x are orthogonal
to sgn CU,(Z,b) - ~1. The median of a probability distribution is its 0.5-quantile. The above derivation can be generalized to obtain orthogonality conditions implied by the assumption that, for given c(E(O, l), the a-quantile of u does not vary with x.
3.3.
Conditional
symmetry
Mean and median independence both express a belief that the central tendency of u does not vary with x. Yet they are different assertions. This fact may cause the applied researcher some discomfort. One often feels at ease saying that the central tendency of u does not vary with x. But only occasionally can one pinpoint the mathematical sense in ulhich the term “central tendency” should be interpreted. The need for care in defining central tendency disappears if the conditional distributions P,l<, {EX are componentwise symmetric with common center of symmetry. Let U be the real line again and assume that for all realizations of x, the conditional distribution of u is symmetric around some point y. That is,
pu-,I5 = py-.14,
(28)
<EX.
Let II(.) be any odd function mapping the real line into a real vector space; that is, h(q) = - h( - ‘I) for q in R’. Conditional symmetry implies
s s
h(u-y)dP,lt=O,
Equations
VEX.
(29)
(16) and (29) imply that (b, y) solves
o(x)h[ue(z, b) - y]‘dP =
u(x) s
h(u-y)‘dP,Ix [S
1
dP,=O
(30)
C.F. Manski
2570
for all u(.) and h(.) such that the integral in (30) exists. So all functions of x are orthogonal to all odd functions of u - y. The functions h(u - y) = u - y and h(u - y) = w(u - y) are odd. Thus, the orthogonality conditions (23) and (27) that follow from mean and median independence are satisfied given conditional symmetry.
Variance independence
3.4.
One may believe that u not only has the same central tendency for each realization of x but also the same spread. The usual econometric practice has been to express spread by variance. In the presence of mean independence, variance independence (homoskedasticity) is the additional condition
(31)
Here yi is the common mean of the distributions P,[<, VEX and y2 is the common variance matrix. Let u(.) be any real function on X. It follows from (16) and (31) that (b,y,,y,) solves the orthogonality condition
r
J
W{ CU,(Z~ b)- YllCu&
w-
(32)
The assertion of variance independence imposes no restrictions on the variance matrix yZ. In some applications, information about y2 is available. For example, it may be known that the components of u are uncorrelated with one another, so y2 is a diagonal matrix. Such information may be expressed by appropriately restricting the space of possible values of yZ.
3.5.
Statistical
independence
It is often assumed that u has the same distribution
PA5 = p,,
(EX.
for each realization
ofx. That is, (33)
This statistical independence assumption implies mean, median and variance independence. In fact, it implies that all functions of x are uncorrelated with all functions of U. Let s(.) map U into a real vector space. Let y be the unconditional mean of s(u).
Ch. 43:
Analog
Estimation
of Econometric
Models
2571
It follows from (33) that
s(u) dP, I5 = Y,
<EX.
(34)
s It follows from (16) and (34) that (b, y) solves
s
u(x)t-S{UO(Z~ b)f - y]‘dP=O
for all u(.) and s(.) such that the integral
3.6.
(35) in (35) exists.
A historical note
The analogy principle can be applied to the orthogonality conditions derived in the preceding sections to yield method-of-moments estimates of (b, y). These estimators are easy to understand and to apply. Nevertheless, they have taken considerable time to evolve. Wright (1928) and Reiersol(l941, 1945) developed the zero covariance condition (19) in the case where U is the real line, X and B are both K-dimensional real space, and u,(.;) is linear in b. In this case, the sample analog of the orthogonality condition always has a solution. For some time, the literature offered no clear prescription for estimation when the vector x is longer than b; that is, when there are more instruments than unknowns. The sample analog of the zero covariance condition then usually has no solution. The idea of selecting an estimate that makes the sample condition hold as closely as possibly took hold in the 1950s particularly following the work of Sargan (1958). It was not until the 1970s that the estimation methods developed for linear models were extended to models nonlinear in b. See Amemiya (1974). And it was not until the late 1970s that systematic attention was paid to distributional restrictions other than mean independence. The work of Koenker and Bassett (1978) did much to awaken interest in models assuming median independence. The idea that orthogonality conditions should be thought of as special cases of moment equations took hold in the 1980s. See Burguete et al. (1982), Hansen (1982) and Manski (1983).
4.
Method-of-moments
estimation of response models
Response models assert that an observable random variable y is a measurable function of a random pair (x, u), with x observable and u not. The mapping from (x, u) to y is known to be a member of a family of functions indexed by a parameter
C.F. Munski
2512
h in a parameter
space B. Thus,
y = Y&G u, b).
(36)
The random variable y is referred to variously as the dependent, endogenous, explained or response variable. The pair (x, U) are termed independent, exogenous, explanatory or stimulus variables. The function yO(.;, .) mapping (x, u, h) into y is sometimes called the response function. Equation (36) is meaningful only when accompanied by information on the distribution of U. The usual practice is to restrict the conditional distribution P,lx in some way. Many response models imply that b and a nuisance parameter solve a moment problem. I describe here the moment problems implied by likelihood models (Section 4.1) invertible models (Section 4.2) mean independent linear models (Section 4.3) and quantile independent monotone models.(Section 4.4).
4.1.
Likelihood
models
The form of the response model (36) implies that the conditional distribution P,lx is determined by x, b, and P,lx. Suppose that P,(x is known to be a member of a family of distributions s(x,y), ~/ET, where I- is a parameter space and r(.;) is a known function mapping X x r into probability distributions on U. Then P,lx is a function of x, of the parameter of interest b and of the nuisance parameter y. To be precise, let <eX, CUB, A c Y and define U((,c, A) = UEU
s.t.
y()((,u, C)EA.
(37)
BY (36)
P,(A 15)= r(t, Y)CU(~, b,41,
for all A c Y.
(38)
Suppose that there exists a measure v on Y that dominates (in the measure theoretic sense) all of the feasible values of P,,[<. That is, for (&c,d)~X x B x r and A c Y, v(A) = O=r(<, S)[U(& c, A)] = 0. Then the Radon-Nikodym Theorem shows that P,lx has a density with respect to v, say 4(.; x, b, y), with the density function known up to the value of (b, y). Jensen’s inequality can be used to show that, for each ~;EX, (b, y) solves the extremum problem
max log &Y; 4, c, 6) dP, (C,&EB x I- s See, for example,
Manski
I 5.
(1988), Chapter
(39)
5. Because (39) holds for all values of x,
2573
Ch. 43: Analog Estimation of Econometric Models
it follows that (b, y) solves the unconditional
extremum
problem
max log CNY;x, c, 6) dp, (C,6)EB xr s whose sample analog
is the maximum
(40)
likelihood
estimator
The dominance condition. The above shows that maximum likelihood estimation is well-defined whenever there exists a measure v dominating the feasible values of P,lx. I give a pair of familiar examples in which the condition is satisfied. Discrete response models are ones in which the space Y has at most a countable number of points. The dominance condition is satisfied trivially by letting v be a counting measure on Y. Hence, all discrete response models are likelihood models. Models with additive errors are ones in which Y is a finite dimensional real space, U = Y, and equation (36) has the form Y=
Y(X> b) +
4
g(., .) being a given function
PYlX= pg(X,b)+“lx.
(42) mapping
X x B into Y. It follows that (43)
Suppose that the distributions P,(x are known to be dominated by Lebesgue measure. Then the shifted distributions P,,, bJ+uIx are similarly dominated. So the distributions P,jx are dominated by Lebesgue measure. The nuisance parameter. It may appear from the above that maximum likelihood can be used to estimate almost any response model, but this conclusion is too sanguine. To apply the method, one usually must estimate the parameter of interest b and the nuisance parameter y jointly. (There are special cases in which the problem decomposes but these are not the rule.) The estimation task is typically feasible if one has substantial prior information; for example, the classical theory of maximum likelihood estimation supposes that the parameter space B x r is finite dimensional. But the maximum likelihood estimate may not be consistent if the parameter space is too large. And the computational problem of maximizing the likelihood function may become intractable before the method breaks down statistically. For example, maximum likelihood estimation is unappealing when one knows only that u is mean or median independent of x. In these cases, the space r indexing the possible values of P,) x is rather large. In fact, the dominance condition typically
C.F. Manski
2514
fails to hold, so that the maximum likelihood estimate is not even well-defined. That maximum likelihood may break down in the presence of weak distributional information does not imply that estimation is impossible. The remainder of this section shows that some response models can be estimated using other methodof-moments approaches.
4.2.
Invertible
models
Suppose that y and u are one to one. That is, for each (5, c) in X x B, let y,(& ., c) be invertible as a mapping from U into Y. Let y; ‘(t, ., c) denote the inverse function mapping Y into U. Then an alternative representation of equation (36) is Y,
Yx,Y, b) = u.
(44)
This is a separable model, so all of the approaches described in Section 3 can be applied. The additive error model (42) is obviously invertible. Also invertible are the linear simultaneous equations models prominent in the econometrics literature. In simultaneous equations analysis, equation (44) is referred to as the structural form and equation (36) as the reduced form of the model.
4.3.
Mean independent
linear models
Certain response functions combine well with specific distributional restrictions. Linear response functions pair nicely with mean independent unobservables. Let Y and U be J-dimensional and K-dimensional real space. Let equation (36) have the linear-in-u form Y = 91 (x, b) +
&,
bb.
(45)
Here g1 (., .) maps X x B into RJ. The function g2(., .) maps X x B into RJ ’ K and is written as a J x K matrix. Note that the response function in (45) is not invertible unless J = K and the matrices g2(& c), (5, C)EX x B are non-singular. Let it be known that u is mean independent of x. Let y denote the mean of 1.4. Equation (45) implies that the mean regression of y on x is E(Y Ixl
= c/l (x, b) + c/Ax, 4-c
Mean regressions
s CY -
solve a variety
(46) of moment
sl(L b) - gz(k b)yldP,I t = 0,
problems.
(EX.
Rewrite (46) as
(47)
2515
Ch. 43: Analog Estimation of Econometric Models
Let u(.) be any function on X. Because (47) holds for all values that (b, y) solves the orthogonality condition
s
44 CY-
9 1 (x, b) -
of x, it follows
gAx>Nyl dP = 0.
(48)
Another approach uses the well-known fact that the mean regression is the best predictor of y conditional on x, in the sense of minimizing square loss. That is, for each VEX, (b,y) solves the extremum problem
min C~-~~(5,c)-~~(5,c)~l’Cy-~~(5,c)-~~(5,c)~ldP~l5. (C,6)EB xr s It follows that, for any function conditional extremum problem
w(.) mapping
analog
is a weighted
(49)
X into (0, co), (b, y) solves the un-
min W(X)CY-~~(~,C)-S~(~,~)~I’C~-~~~(~,~)-~~~(~,C)~I~~, (C,@EB xr s whose sample
of y on x expected
least squares
estimator
(50)
of (b, y), with weights
w(x).
4.4. Quantile independent monotone models Whereas mean independence meshes well with linear response function, quantile independence combines nicely with real valued response functions that are monotone in a scalar u. Let Y and U be the real line. Let it be known that, for given a~(0, l), the a-quantile of u does not vary with x. Let y denote the cr-quantile of u. For each VEX, let y,(<, u, c) be non-decreasing as a function of u and continuous at y as a function of c. Then it can be shown that y,(x,y, b) is the a-quantile regression of y on x (see Manski (1988), Chapter 6). The a-quantile regression of y on x is a best predictor of y conditional on x, in the sense of minimizing the expected value of the asymmetric absolute loss function giving weights (1 - a) and c1to overpredictions and underpredictions. That is, for each VEX, (b,y) solves the extremum problem
min
(C,&B xr I
(1 -~)~CY
+ @ICY’ YO(~~~>C)IIY - ~,,(S,hc)l @,I& It follows that, for any function
w(.) mapping
(51)
X into (0, oo), (b,y) solves the un-
2576
C.F. Manski
conditional
extremum
,,pF r W{(l <, t x s
problem
-~~~CY~Y,~~,~,~~llY-Yy,~~,~,~~I
+ alCy > Y~(x,~c)IIY
-~,b,~,c)I)
dP,
(52)
whose sample analog is a weighted asymmetric least absolute deviations estimator of (b,y), with weights w(x). Two applications follow. For simplicity, I confine attention to the median independence case. Censored response. Let Y = [0, co) and X = B = RK. Powell (1984, 1986) studied estimation of the censored linear model asserting that y = 0 if x’b + u d 0 and y = x’b + u otherwise. That is,
y = max(O, x’b + u).
(53)
For each l in X, the function max(O, t’b + u) is non-decreasing and continuous u. Hence, the median of P,lx is max(O,x’b + y). Applying (52), (b,y) solves
s
,,,zJf.ly- max(O,x’c whose sample
analog
+ S)l dP,
is the censored
in
(54)
least absolute
deviations
estimator
(55) Binary response. Let Y = (0, l} and X = B = RK. Manski (1975, 1985) studied estimation of the binary response model asserting that y = 0 if x’b + u d 0 and y = 1 otherwise. That is,
y=
l[x’b+u>O].
(56)
For each 5 in X, the response function l[{‘b + u > 0] is non-decreasing in u. This function is continuous at y if {‘b # y, but discontinuous at y if t’b = y. Nevertheless, it can be shown that for all 5, the median of P,I 4 is 1[t’b + y > 01. Applying (52), (b, y) solves
fc,mkxr
I
l~-1Cx’c+~>OlldP,
(57)
Ch. 43: Analog Estimation
qf Econometric
whose sample
is the maximum
analog
2517
Models
score estimator
(58)
5.
Estimation
of general separable and response models
The method-of-moments approaches described in the preceding sections make it possible to estimate a wide variety of econometric models and so are enormously useful. These approaches cannot, however, handle all models of potential interest. Not all models imply moment problems and those models that do imply moment problems can be estimated by the method of moments only if the estimation problem is sufficiently regular. This section describes more general analog approaches to the estimation of separable and response models. Section 5.1 presents closest-empirical-distribution estimation of separable models, introduced in Manski (1983). Section 5.2 presents minimum-distance estimation of response models, based on the work of Wolfowitz (1953, 1957) and others.
5.1.
Closest-empirical-distribution
Recall that a separable
estimation of separable
models
model has the form u,(z, b) = u. Hence
P,” = P+,b).
(59)
Thus, the joint distribution of observables and unobservables observable distribution P and of the parameter b. To make P,, on (P, b) explicit, let Q be any probability distribution on Z a random variable distributed Q. For EL?, let $(Q,c) denote [z(Q), ue{z(Q), c}]. Then (59) implies that
is a the and the
function of the dependence of let z(Q) denote distribution of
Pm= W’, b). Suppose one knows that P,,~17,,, Z x U. By (60),
(60) where I&, is some family of distributions
W’, b)EJI,,. This translates the information on P,, into a condition relating to the observable distribution P. We may now apply the analogy
on
(61) the parameter b principle to (61).
C.F. Manski
2578
The analog
estimate
for b is
(62)
B, = [cEB:$(P,, C)EITZ”],
unless this set is empty. In that case, the analogy principle suggests selection of an estimate that makes the distribution $(PN, .) as close as possible to &,, in some sense. To do this, we may select a function r(.,&,) that maps each probability distribution $ on Z x U into [0, co) and that satisfies the condition r($, l7,,) = 00 $~17,,. Thus, r(., I&,) distinguishes distributions that are in fl,, from ones that are not. Condition (61) is equivalent to saying that b solves the minimization problem (63)
The analogy principle (CED) estimator
applied
to (63) yields
In words, (64) selects an estimate as close as possible to Z&,.
the closest-empirical-distribution
that brings the distribution
of [z(PN), uO{z(P,), .}]
Examples. Method-of-moments estimates of parameters solving orthogonality conditions are CED estimates. The set fl,, is the family of distributions satisfying (17). This information is translated by (18) into an orthogonality condition relating b to P. The method of moments then selects an estimate that makes the distribution of [z(PN), u,{z(P,), .}] satisfy the orthogonality condition as closely as possible. As a second example, suppose it is known that u is statistically independent of some function of z, say ?c = x(z). This information can be expressed through the statement that, for all (s, t)eX x U,
PZ”(Xd s, u ,< t) = P(x < S)P”(U< t). This translates
into the following
p[xG s,u,(z, b) < tl If it exists, the analog B, s {EB:P,[x
restriction
(65) relating
b ‘to P:
= P(x d s)P[u,(z, b) < t].
estimate
(66)
is
d s, u,(z, b) < t] = P,(x d s)P,Ju,,(z, b) < t, ‘v’s,tl}.
(67)
Ch. 43: Analog
Estimation
of Econometric
2519
Models
But this estimate typically does not exist so we need to make (65) hold as closely as possible, in some sense. One among many a priori reasonable approaches expresses the prior information through the statement that b minimizes the integrated square distance between P[x,u,(z;)] and P(x)P[u,(z;)]. That is, b solves the minimization problem
?$I
s
{P[X G s,u,(z,c) < t] - P(x <s)PCu,(z,c)< tl}2dsdt,
whose sample analog
min {P,[x CEB
(68)
is the estimator
< s, u,(z, c) d t] - P,(x d s)PN[uo(z,
C)
G
t]}” ds dt.
(69)
s
Computation of this estimator is difficult as one must integrate over all values of (s, t). A computationally simpler (but notationally more complicated) estimator results if one uses mean square distance rather than integrated square distance. Let P’ = P, define z’ to be a random variable distributed P’ and independent of z, and replace (68) by
m$r
{P[X G x’, u,(z, c) < ue(z’, c)] - P(x < x’)P[u,(z,
c) < u&‘, c)I}~ dP’.
(70)
s The sample
analog
of (70) is
TE$lk ,il{‘NCxG
xi,
uO(z,c) d uO(zi~c)l - p,(x G xi)p~[“~(Z,C)G uo(zi, C)]}2.
I
(71) Consistency of CED estimates. Manski (1983) shows that closest-empirical-distribution estimates are consistent if b is identified, B is compact and finite dimensional, r[II/(P, .), I7,,] is continuous on B, and (72)
almost surely. (Condition (72) is an abstract uniform law of large numbers.) This consistency theorem has been applied to prove that the estimator (71) is consistent given regularity conditions (see Manski (1983), Corollary to Theorem 2). The asymptotic sampling behavior of general CED estimates has not been studied.
C.F. Manski
2580
5.2.
Minimum-distance
estimation of response
models
Recall that the response model (36) implies that P,I x is a function of (x, b, P,Ix). When we introduced likelihood models in Section 4.1, we assumed that P,lx is a member of a family of distributions T(X,y), r~r, so P,lx is a function of (x, b, y). We also assumed that P,(x is dominated by a known measure v. Here we maintain the former assumption but drop the latter one. By assumption,
P, Ix = N--G b,Y),
(73)
where h(., ., .) is a known function mapping X x B x r into probability distributions on Y. Let p(., .) be a function that maps pairs of probability distributions on Y into [0, co) and that satisfies the condition Q = Q’op(Q, Q’) = 0. Equation (73) implies that (b, y) solves the collection of conditional minimization problems
min pC(P,I0, ML c,41,
(c,6)eB
Y I-
5~x.
It follows that, for any function w(.) mapping conditional minimization problem
min
(C.&Bx r
s
w(OCV’, Ix), k
whose sample analog
minr
(C,6)EB x
= ,,gk
(74)
X into (0, co), (b,y) solves the un-
c,41 dp,,
is the minimum-distance
(75)
estimator
s
W(X)P CWNy Ix),4~ c,41 df’,,
.$ jl
W(Xi)PC(pNy
Ixi)T Yxi7‘, ‘)I’
(76)
Wolfowitz (1953, 1957) investigated the case with no conditioning variable x and with p specified to be a metric on the space of distributions on Y. In that setting, (75) selects an estimate that minimizes the distance, in the sense of p, between the theorized distribution of y and its empirical distribution. Sahler (1970) extended the approach by letting p be any smooth function that maps pairs of probability distributions on Y into [0, co) and that satisfies the condition Q = Q’op(Q, Q’) = 0. An early minimum-distance estimator with conditioning variables is the minimum chi-square method (Neyman, 1949). Here x is multinomial with support X = (1,. . . , J),
Ch. 43: Analoy Estimation
y is Bernoulli
of Econometric
conditional
on x, and p is Euclidean
w(j)[P,(y=
2581
Models
llx=j)-h(j,c,6)]2.
distance.
So the estimator
is
(77)
Here Nj is the number of observations at which xi =j and P,(y = 11.x =j) is the sample frequency of the event [y = 1] conditional on the event [x = j]. Econometricians use the term minimum-distance to refer to estimators that minimize the distance between specified features of the theorized distribution of y and the sample analogs of these features. Thus, in econometric usage, t(Q) = t(Q’)e p(Q, Q’) = 0 for some functional t(.). Usually, p has measured Euclidean distance between theoretical and sample moments (e.g. Goldberger and Joreskog, 1972). In a recent application, Chamberlain (1994) measures distance between theoretical and sample medians of y conditional on x.
6.
Conclusion
This chapter has surveyed the application of analog methods to estimate econometric models. The analogy principle is more than just a useful tool for generating estimators. It offers a valuable framework for teaching and for research. The analogy principle is an effective device for teaching estimation. In analog estimation, one begins by asking what one knows about the population. One then treats the sample as if it were the population. Finally, one selects an estimate that makes the known properties of the population hold as closely as possible in the sample. What could be more intuitive? The analogy principle disciplines econometric research by focussing attention on estimation problems rather than on methods. Much of the literature proposes some new method and then looks for problems to which it can be applied. This approach has been productive, but it seems more sensible to first specify an estimation problem and then seek to develop applicable estimation methods. The analogy principle forces this mode of thought. One can define an analog estimator only after one has stated the estimation problem of interest.
References Amemiya, T. (1974) “The Non-Linear Two Stage Least Squares Estimator”, Journal of Econometrics, 2, 105-l 10. Amemiya, T. (I 985) Advanced Econometrics, Cambridge: Harvard University Press. Andrews, D. (1987) “Consistency in Nonlinear Econometric Models: A Generic Uniform Law of Large Numbers”, Econometrica, 55, 1465-1471. Burguete, J., Gallant, R., and Souza, G. (1982) “On Unification of the Asymptotic Theory of Nonlinear Econometric Models”, Econometric Reviews, 1, I51-190.
2582
C.F. Manski
Chamberlain, G. (1987) “Asymptotic Efficiency in Estimation With Conditional Moment Restrictions”, Journal cf Econometrics, 34, 305-334. Chamberlain, G. (1994) “Quantile Regression, Censoring, and the Structure of Wages”, in: C. Sims, ed., Advances in Econometrics: Sixrh World Congress, New York: Cambridge University Press. Cambridge University Press, forthcoming. Gallant, R. (1987) Nonlinear Statistical Models, New York: Wiley. Goldberger, A. and Joreskog, K. (1972) “Factor Analysis by Generalized Least Squares”, Psychometrika, 37, 243-260. Hansen, L. (1982) “Large Sample Properties of Generalized Method of Moment Estimators”, Econometrica, 50, 1029-1054. Koenker, R. and Bassett, G. (1978) “Regression Quantiles”, Econometrica, 46, 33-50. Manski. C. (1975) “Maximum Score Estimation of the Stochastic Utility Model of Choice”, Journal ofEconometrics, 3, 205-228. Manski, C. (1983) “Closest Empirical Distribution Estimation”, Econometrica, 51, 305-319. Manski, C. (1985) “Semiparametric Analysis of Discrete Response: Asymptotic Properties of the Maximum Score Estimator”, Journal of Econometrics, 27, 303-333. Manski, C. (1988) Analog Estimation Methods in Econometrics, London: Chapman and Hall. Neyman, J. (1949) “Contributions to the Theory of the x2 Test”, in: Berkeley Symposium on Mathematical Statistics and Probability, Berkeley: University of California. Pearson, K. (1894) “Contributions to the Mathematical Theory of Evolution”, Philosophical Transactions of the Royal Society of London, A185, 71-78. Pollard, D. (1984) Contlergence c$Storhastic Processes, New York: Springer-Verlag. Powell, J. (1984) “Least Absolute Deviations Estimation for the Censored Regression Model”, Journal of Econometrics, 25, 303-325. Powell, J. (1986) “Censored Regression Quantiles”, Journal of Econometrics, 32, 143-155. Reiersol, 0. (1941) “Confluence Analysis by Means of Lag Moments and Other Methods of Confluence Analysis”, Econometrica, 9, l-23. Reiersol, 0. (1945) “Confluence Analysis by Means of Instrumental Sets of Variables”, Arkio Fur Matematik, Astronomi Och Fysik, 32A, no. 4, 1-119. Sahler, W. (1970) “Estimation by Minimum Discrepancy Methods”, Metrika, 16, 85-106. Sargan, J. (1958) “The Estimation of Economic Relationships Using Instrumental Variables”, Econometrica, 26, 393-415. Wolfowitz, J. (1953) “Estimation by the Minimum Distance Method”, Annals ofthe Institute ofStatistics and Mathematics, 5,9-23. Wolfowitz, J. (1957) “The Minimum Distance Method”, Annals OfMathematical Statistics, 28, 75-88. Wright, S. (1928) Appendix B to Wright, P. The Tariff on Animal and Vegetable Oils, New York: Macmillan.
Chapter 44
TESTING NON-NESTED
HYPOTHESES
C. GOURIEROUX CREST-CEPREMAP A. MONFORT CREST-INSEE
Contents
1. 2.
3.
4.
5.
Introduction Non-nested
2585 2587
2.1.
Definitions
2.2.
Pseudo-true
2.3.
Semi-parametric
2.4.
Examples
2.5.
Symmetry
2587
hypotheses
2589
values
2590
hypotheses
2591 2596
of the problem
Testing procedures 3.1. Maximum likelihood
2597 estimator
2597
under misspecification
3.2.
The extended
Wald test
2598
3.3.
The extended
score test
2600
3.4.
The Cox procedure
3.5.
Application
3.6.
Applications
Artificial
2602
to the choice of regressors to qualitative
nesting
2610
models
2610
Examples
4.2.
Local expansions
4.3.
A score test based on a modified
4.4.
The partially
5.1.
Asymptotic
2608
models
4.1.
Comparison
2605
in linear models
of artificial
modified
of testing equivalence
nesting
Atkinson’s
Atkinson’s
2614
models
compound
compound
model
model
procedures of test statistics
5.2.
Asymptotic
5.3.
Exact finite sample results
comparisons
of power functions
Handbook of Econometrics, Volume IV, Edited by R.F. Engle and D.L. McFadden 0 1994 Elseuier Science B.V. All rights reserved
2618 2621
2621 2622 2622 2624
Ch. 44: Testing Non-Nested Hypothrses
1.
2585
Introduction
The comparison of different hypotheses, i.e. of competing models, is the basis of model specification. It may be performed along two main lines. The first one consists in associating with each model a loss function and in retaining the specification implying the smallest (estimated) loss. In practice, the loss function is defined either by updating some a priori knowledge on the models given the available observations (the Bayesian point of view), or by introducing some criterion taking into account the trade-off between the goodness of fit and the complexity of the model (for instance the usual adjusted R2 or the Akaike information criterion). This approach, called model choice or model selection, has already been discussed in the handbook (see Learner (1983)) and therefore it will not be treated in this chapter, except in Section 6. The second approach is hypothesis testing theory. For model selection we have to choose a decision rule explaining for which observations we prefer to retain each hypothesis. In the simplest case of two hypotheses H, and H,, this is equivalent to the definition of the critical region giving the set of observations for which H, is rejected. However, the determination of the decision rule is not done on the same basis as model choice. The basis of hypothesis testing theory is to introduce the probability of errors: first error type (to reject H, when it is true), and second error type (to reject H, when it is true), then to choose a critical region for which the first error type probability is smaller than a given size (generally 5%) and the second error type probability is as small as possible. Hypothesis testing theory is usually advocated when one of the hypotheses He, called the null hypothesis, may be considered as a “limit case” of the second hypothesis H,, called the alternative hypothesis. Broadly speaking the model H, u H, can be reduced to the submodel H, by imposing some restrictions on the parameters. H, is said to be nested in H,. Some general testing procedures have been developed for this case; the main ones are the likelihood ratio tests, Wald tests, Lagrange multiplier (or score) tests and Hausman’s tests (see Engle (1984) for a survey of these testing procedures). In this chapter we are interested in the opposite case, where none of the hypotheses is a particular case of another one. These hypotheses may be entirely distinct (globally non-nested hypotheses) or may have an intersection (partially non-nested hypotheses). Most of the theoretical literature on non-nested hypotheses testing derives from papers by Cox (1961, 1962) and Atkinson (1970). The first author developed a general procedure, known as the Cox test, which generalizes the likelihood ratio procedure used in the case of nested hypotheses. The second author proposed to introduce a third model H, called an artificial nesting model, containing both Ho and H, and to use the classical procedures for testing H, against H, and HI against H. These two approaches to the problem are conceptually different, even if they provide similar results in a number of important applications,
2586
C. Gourieroux und A. Monfort
especially in the case of linear models. The example of linear models, which leads to explicit and tractable computations has been extensively studied during the seventies and the beginning of the eighties (see, e.g. Fisher and McAleer (1981), Godfrey and Pesaran (1983) Pesaran (1974, 1982a, 1982b)), and different nesting models have been proposed for specific problems. The generalization of Wald and score testing procedures to non-nested hypotheses have been proposed by Cox (1961), Gourieroux et al. (1983) and Mizon and Richard (1986) and their links with the Cox test have been studied. In parallel, Davidson and McKinnon (1981, 1983, 1984) considered some local approximations of artificial nesting models in a neighbourhood of the hypothesis H, and derived the associated tests. Then the power of these different testing procedures have been compared, either in finite samples (McAleer (198 l), Fisher and McAleer (198 I), Dufour (1989)) or asymptotically (Davidson and McKinnon (1987), Gourieroux (1983), Pesaran (1984, 1987) Szroeter (1989)). The generalization of the Wald test leads to a procedure with interesting interpretations in terms of predictions (Gourieroux et al. (1983), Mizon (1984), Mizon and Richard (1986)). This interpretation has been used as the basis of a modelling strategy. The so-called encompassing principle has been developed in a number of papers by Mizon and Richard (198 1,1986), Hendry (1988), Hendry and Mizon (1990), Hendry and Richard (1990) and associated tests have been introduced in Gourieroux and Monfort (1992). In Section 2, we carefully define the notions of non-nested hypotheses, we distinguish, especially, partially and globally non-nested hypotheses. For this purpose it is necessary to introduce a suitable metric measuring the closeness of the hypotheses; this leads to two concepts of pseudo-true value, one is marginal with respect to the explanatory variables and the other is conditional to these variables. Different classical examples of non-nested models are also described in this section. Section 3 treats the extension of the usual testing procedures: Wald, score, likelihood ratio tests. We obtain different forms of the test statistics depending on the kind of pseudo-true value which is used. The application to the choice of regressors in the linear model is particularly emphasized. The artificial nesting models are presented in Section 4, first for some specific problems in which the nesting models may have interesting interpretations, then in the general case. The expansion of these artificial nesting models in a neighbourhood of one of the hypotheses leads to a linearization of the problem and to simple specification tests, called the J-test and the P-test (Davidson and McKinnon (1981)). These tests are compared in Section 5 through their asymptotic power. This analysis is complicated by the fact that some points of the composite null hypothesis are not the limits of points of the alternative hypothesis. A local power analysis may only be conducted along a sequence of points leading to a distribution which is common to both hypotheses; otherwise it is necessary to develop some other
Ch. 44: Testing Non-Nested
2587
Hypotheses
asymptotic comparisons of tests for fixed alternatives (Bahadur (1960), Geweke (1981) Gourieroux (1983)). In the last section, we discuss the use of non-nested testing procedures for model building. We introduce the notion of encompassing and explain how it may be used as a modelling strategy. Then we derive Wald tests of the encompassing hypothesis which are modified versions of the Wald test for non-nested hypotheses, taking into account the fact that the true distribution does not necessarily belong to one of the competing models. Since our aim is to present the main ideas and potential applications, we will in general omit the proofs and the less significant assumptions. In the whole chapter, we consider T observations of some endogenous and exogenous variables, denoted by y,, x,, t = 1,. . . , T, respectively. We assume that: the pairs (y,, x,), t = 1,. , T are identically and independently distributed an unknown probability density function: fox(x) f&y/x) with respect product measure denoted by v,(dx) @ v,(dy).
with to a
These assumptions are made in order to simplify the presentation, but they may be weakened in order to allow for some correlations between the pairs (y,, x,), for instance when there are lagged endogenous regressors (see, e.g. Wooldridge (1990), Domowitz and White (1982), Gourieroux and Monfort (1992)), or to consider the case of deterministic regressors (White and Domowitz (1984), Wooldridge (1990)).
2.
Non-nested
hypotheses
The hypotheses may concern either the whole conditional distribution of y, given x,, or simply some conditional moments, such as the conditional expectation. We successively consider these two situations.
2.1.
Dejinitions
When the hypotheses concern the whole conditional metric form, they may be written as
H, = MY,/&;
distribution
and have a para-
a),CffzA= [WC},
H, = {~(Y,/x,; b’),PEB = RH}.
(2.1)
They are respectively parameterized by the parameters c( and 8, which may have different sizes and different interpretations. The first hypothesis H, (for instance)
C. Gourieroux and A. Morzfbrt
2588
is valid if the true conditional distribution f&y,/x,) can be written as g(y,/x,;irO) for some u,,EA. In such a case a, is the true value of the parameter. To measure the closeness between the two hypotheses H, and H,,we have to introduce a proximity measure between probability density functions. One such measure, is the KullbackkLeibler information criterion (KLIC). The KLIC of two conditional distributions g(y/x; a) and h(y/x; fl) may be defined either conditionally to the exogenous variables by
Z,,(a,P;4 =
log s
or unconditionally
cJhbY P)=
$f$2
(2.2)
g(y/x; a)v, (dy),
by
ss log
.dY/x;4 ____ s(yIx; G-ox(xbyWY)v, (d-4.
(2.3)
W/x; PI
The conditional version may be computed as soon as the two hypotheses have been defined, whereas the unconditional version Z,,(a, /J) depends on the unknown marginal distribution of the exogenous variables, and therefore is also unknown. However, it may be consistently estimated by
The KLIC takes non-negative values and is equal to zero if and only if the two p.d.f.‘s appearing in the definition are the same:
Igh(4B;x) = O-dYlx;
a) = 4Yl-T B).
It is not a distance in the mathematical the symmetry condition:
sense since, for instance,
it does not satisfy
I&?&>B; x) z r,,(D> a; x). For the definition of nested and non-nested hypotheses, we follow Pesaran (1987). We first define the proximity between a distribution of H, and the whole hypothesis H, by I&a; x) = inf Zgh(a,/I; x), BEE and similarly,
the proximity
Ih(B;x) = infIh,(B, 4x1. EA
(2.5) between
a distribution
of
H, and H, (2.6)
C. Gourieroux
2590
and A. MonJbrt
This finite sample pseudo-true value b,(a) depends on T and on the observations x1,. . . , xT, and therefore it is a random variable. For this reason, it is sometimes called the conditional pseudo-true value. It converges to b(a), when T tends to infinity. An important simplification occurs when we consider models without explanatory variables, or equivalently models in which the only x variables are constant variables. In this case Z,,(a, /i’;x) = Z,,(a, p) = I^,(a, fi) and the asymptotic and finite sample pseudo-true values coincide. However models without explanatory variables are not frequent in econometrics and therefore we will have to keep in mind the distinction between @cc) and b,(a).
2.3.
Semi-parametric
hypotheses
The competing models may also have been defined through some conditional moments, for instance through conditional means. The approach described in subsections 2.1 and 2.2 can easily be adapted to this semi-parametric framework. We basically have to define in a suitable way a measure of proximity between these conditional moments. As an illustration let us consider the case of conditional means. The two hypotheses H, and H, are defined by
H, = (E(y,/x,)
= m(x,; a), WA c R’},
H, = (E(ytlxt)
= Ax,; P), BEB = R”}.
The proximity between Euclidean distance
k,Ja, 8) =
two means may be measured
(2.8) by the usual average squared
s
or by its empirical
II4x; 4 - Ax; 8)II2fox(-4 dvx(xL
(2.9)
counterpart
kP)=;1;f: IIm(x,;~)-~(x,;B)l12.
(2.10)
The pseudo-true values associated with these semi-parametric this distance D, are then defined by D,,,+(M,b(x)) = inf D,,J~, BEE
b)
for the asymptotic for the finite sample
hypotheses
and with
case, case.
(2.11)
2591
Ch. 44: Testing Non-Nested Hypotheses
In fact this semi-parametric approach is strongly linked with the previous one. Let us assume for a moment that the conditional distribution of y, given x, is normal with unit variance. The hypotheses H, and H, would be defined by
H, = {Y(Y,/x,; a) = Wm(x,, a),11,a-l}, H,, = (~(Y,/x,; PI = Moreover
1
the associated
(a
gh
KLIC
would be
s s
logdJi~x;a) h(~B(y!x; aWv,(y)
px)= 2
N/4x,,/% 11,BEB).
1
= =
1
dylx; a)dv,W
4II4x; 4 - Ax; B)II*.
We deduce that I,,(a, /j’) = JZ,,(a, /I; x)f&x) dv,(x) = iD,,,Ja, p). The pseudo-true values defined in the semi-parametric framework with a Euclidean distance coincide with the pseudo-true values computed as if the conditional distribution was a Gaussian distribution. It would have been possible to measure the closeness of the non-nested hypotheses H, and H, by using a KLIC based on another artificial conditional distribution. The artificial distributions which are suitable for this purpose are the members of linear exponential families, including the normal, the Poisson, the gamma, etc. distributions (see Gourieroux et al. (1984)).
2.4. 2.4.1.
Examples
Choice of regressors in a linear model
We consider three sets of linearly independent regressors x0, xi, x2, with respective sizes K,, K,, K, and x = (x,, xi, x2). The two models of interest are
H, =
{E(Y,/x,)= xOtaO+
H, = {E(~,/xt) = x,,b,
xltal,(ab,a;)‘~[WKo+K’:,
+~~~bl,(bb,b;)~[W~‘+~~}.
To choose between H, and H, is equivalent to be eliminated: either xi (H, is accepted), or K, is non-zero the two semi-parametric
(2.12)
to choosing the regressors which have or x2 (H, is accepted). As soon as K, hypotheses are non-nested. They are
C. Gourieroux
2592
only partially
non-nested
as soon as K, # 0, their intersection
and A. Monfort
being
H,n H, = (E(y,/xJ = xotuo,uo~RK”}. The finite sample
is the solution
pseudo-true
of the minimization
T
min1
value br(c() corresponding
CxOtaO + ha1
-
to
problem
(xotbo+ x2A)12.
bo,b, 1= 1
It is a least squares problem. Let us introduce the matrices X0,, X,.,X,, giving the observations on the different regressors. They have respective sizes TX K,, T x K,, T x K,. The solution of the problem is
When T is large the empirical cross-products order moments of the marginal distribution
converge to the corresponding of the x variables. We get
>(>
second
a,
b
a,
’
(2.14)
where $ is the expectation with respect to the true unknown marginal distribution of the explanatory variables. It must be emphasized that the definitions of the two hypotheses concern the same conditional expectations. It might have seemed natural to define the hypotheses by ot, XlJ = XO(UO+ x,,a,,(u;, u;)‘ERKo+K’}, Hm= {E(Y,/x
H, = {E(ytlxot>
~a) = x,,b,
+ x,tb,,(b;,
b;)aRKo+K2).
However it is easily seen that the hypotheses defined in this way may be compatible: for instance, if the vector of all variables (yt,xot,x1t,x2t) is Gaussian, H, and H, are both satisfied.
Ch. 44: Testing Non-Nested
2.4.2.
Linear
2593
Hypotheses
versus logarithmic
,formulations
When specifying a regression model we have often to choose between a linear and a logarithmic formulation. For instance, for demand functions giving the consumption as a function of income, we have to compare a formulation with constant propensity to consume (linear formulation) and a formulation with constant elasticity (logarithmic formulation). The two hypotheses written in terms of conditional distributions are H, = {logy, = x,a + ut, where uJx,, z, - N(0, or’), CCEA,o2 > 0}, H, = {yt = z,B + vl, where Q/X,, z, - N[O, v2], /I’EB, q2 > O}.
(2.15)
For the same reasons as in the previous example the hypotheses are made on a conditional distribution given the same set of regressors x and z. The hypotheses differ in different respects. First, the distributions have different supports: the positive real line for H, and the whole real line for H,; clearly the question of choosing between H, and H, is only meaningful if the observed values of y are positive. Second, the two models differ not only by the form of the regression functions
E(y,/x,, zr) = exp(x,cr + $“),
for
H,,
E(YJ%
for
H,,
zt) = z,P,
but also by their variance properties since the data are supposed to be heteroskedastic under H, and homoskedastic under H,. The finite sample pseudo-true values may be easily determined (see Gourieroux et al. (1983)). They are given by
&,(a, a2)=
( fjziz,) -’ i \r=1
V+(a,a2)= +
/
T
Polytomous
exp(x,a) exp g,
$ exp(2x,cr) exp(2a2)
-rl,,cr2(~I
2.4.3.
zi
t=1
z,expx,a) (
logit versus sequential
*jil z:z,
logit formulations
Another classical example (Hausman and McFadden (1980)) is found in discrete choice models. Let us consider the case of three alternatives, i = 1,2,3, and the
C. Gourieroux
2594
and A. Monfort
conditional distribution of the indicator function of the retained alternative given some attributes. The y variable is a qualitative variable and the model is completely defined by the conditional selection probabilities. Two formulations are often examined. Under the independence from the irrelevant alternative hypothesis (I.I.A.), the selection probabilities have a polytomous logit form H,={g(l/x,;cr)=(l
+expx,,cr,
+expx,,a,)-‘,
g@/x,;cc) = expx,,cc,(l
+ expx,,cr,
+ expx,,a,))‘,
g(3/x,;a)
+ expx,,a,
+ expx,,a,)-‘,
= expx+,(l
(2.16)
EEA},
where g(i/x,;cc) denotes the probability of choosing alternative i given the x variables. Such a model describes choices deduced from a unique utility maximization among the set of alternatives { 1,2,3}. However other models describe the idea of sequential choices: a first choice between 1 and {2,3}, then if the second subset is retained, a choice between 2 and 3. We get the so-called sequential logit model
H, =
{Wlx,;8) = (1 + expx,,P,)-‘, Wx,;P) = evx3,B1(l + expx,A-‘(1 + expX41/j2)-13 h(3/x,; 8) = exp xJfBl exp x&(1
As before, the distributions
are conditional
+ exp xjtB1)- ‘(1 + exp xqlB2)- ‘, RB}. (2.17) to the set of all regressors
xit, x2*, xjt,
X4,.
2.4.4.
Choices between macromodels
Other examples exist in macromodelling: choice between New Keynesian and Classical models (Dadkhah and Valbuena (1985), Nakhaeizadeh (1988)), between models defined through Euler conditions corresponding to different optimizations (Ghysels and Hall (1990)) and between equilibrium and disequilibrium models. This last example is particularly interesting, since the usual equilibrium and disequilibrium models are generally defined by different kinds of conditional distributions. Let us introduce the demand and supply functions, depending on a price p and on exogenous variables x,z. The latent model is D, = up, + x,ct + ut,
St= b, + zrb’+ r~r> where for instance (ut, VJ is, conditionally mean and a variance-covariance matrix
to x, z, normally denoted by 0.
distributed
with a zero
Ch. 44: Testing Non-Nested
In the -equilibrium defined by
2595
Hypotheses
model the exchanged
quantity
and the equilibrium
price are
Q, = aP, + x,a + u,,
Q, = bp, + z,B+ u,, and this model leads to a parameterized
form for the conditional
distributions
of
Q,, P, given q,z,. In the disequilibrium framework, the prices pt are assumed to be exogenous and the exchanged quantity is defined as the minimum of supply and demand: Q, = min(D,, S,). The disequilibrium model leads to a parameterized form of the conditional distribution of Q, given pI,xt,z,. Since the two models do not correspond to the same kind of endogenous and exogenous variables, they can only be compared by increasing or decreasing the information. A first possibility is to ignore the information on the distribution of prices in the equilibrium model and to compare the form of f(Q,/p,, x,, zt) in the equilibrium and disequilibrium models. A second solution is to complete the disequilibrium model by adding a price equation explaining how the price depends on x and z; with this enlarged model we may determine the conditional distribution f(Q,, pJx,, z,) and compare it with the specification associated with the equilibrium model.
2.4.5.
Separate families
in time series analysis
The last examples that we present generally linked with the determination orders, i.e. the so-called identification
deal with time series. These examples are of the autoregressive or the moving average problem (see, e.g. Walker (1967) j.
2.4.5.1. Choice between autoregressive and moving average representations. Let us consider a centered time series (Y,, tcN), whose dynamics is of interest. It is natural to first test if this series satisfies the white noise properties and, if this hypothesis is rejected, to propose a more complicated specification. Two simple specifications may be introduced, an autoregressive representation of order 1 Y, = pY,_ I + ut, and a moving y, =
E, -
average BE,
_
1,
U, ‘v I.I.N(O, a*), representation
of order
1,
E, - I.I.N(O, Y/*).
These two hypotheses are partially non-nested to the white noise processes. The comparison
and their intersection corresponds between the two previous models
Ch. 44: Testing Non-Nested
2599
Hypotheses
E denotes
the conditional
zxpectation
with respect to f&x).
expectation
with
respect
to g(y/x;a,)
and
E the X
Then if V- and V*- are generalized inverses of I/ and I/*, if c- and c*- are consistent estimators of these generalized inverses and if d and d* are the ranks of V and V*, we deduce from Proposition 3.2.1 that the Wald statistics
are asymptotically distributed under H,, as chi-squares freedom respectively (see Cox (1961) for
with d and d* degrees of et al. (1983) for
3.2.2
The Wald tests of H, against H, consist in accepting H, if ;“rl < xt _,(d)[resp rp2 d x: _,(d*)], and in rejecting it otherwise (E is the asymptotic level of the test). Remark 3.2.3 It is easily seen that the previous testing procedures coincide with the usual Wald test in the special case of nested hypotheses. Indeed let us assume that
MY,/x,;B)= khlx,;
Y, CI),
and that the null hypothesis
wherep’ = (y’,a’)‘,
is defined
by constraining
y to be zero:
dytlx,; ~1)= k(y,lx,;0,a). The pseudo-true
b(cz)= b&)
=
values are
0O ci
and the two previous
’
test statistics
trl,
rp2 are equal. Moreover
we have
Ch. 44: Testing Non-Nested
Hypotheses
2601
evaluated at the estimated pseudo-true value. Depending on the computability the asymptotic pseudo-true value, we may consider either
of
(3.4) or (3.5)
In the special case of nested hypotheses l$) and ;i(T2)are equal and coincide with the usual score evaluated under the null. The asymptotic distributional properties of the estimated scores have been derived in Gourieroux et al. (1983); they are summarized in the following proposition (for the i.i.d. case).
Proposition
3.3.1
Under the null hypothesis H,, the random vectors (l/J?)@’ and (l/J?)@ asymptotically normal, with zero mean and asymptotic variance-covariance des respectively given by
w=ctltlw* = C& -
GgCgglCgh, C&,tCgh.
The matrices W and W* have the same ranks d and d* as the matrices introduced for the Wald procedures. Proposition
are matri-
V and V*
3.3.2
If I$- and @*- are consistent estimators (under the null) of generalized Wand W*, the score statistics are defined by
inverses of
C. Gourieroux and A. Mocfort
2602
They are asymptotically of freedom respectively. are
distributed under H,, as chi-squares with d and d* degrees The associated critical regions with asymptotic level c
and
{t”,‘> x:-,(d*)>.
3.4.
The Cox procedure
In the two previous subsections we described natural extensions of the Wald and score test procedures. We now consider the third kind of procedure: the likelihood ratio test. The extension was first introduced by Cox (1961,1962) and studied in detail in a number of other papers (see, e.g. Jackson (1968), Pesaran (1974), Pesaran and Deaton (1978)). The idea is to consider, in a first step, the usual likelihood ratio (LR) test statistic, and to study its asymptotic distributional properties. Since under the null hypothesis, the LR statistic divided by T tends to a non-zero limit, this limit is, in a second step, estimated and used to correct the usual procedure. More precisely let us consider the maximum log-likelihood function evaluated under H, and H, respectively:
(3.6)
Lh,&)= max i logh(y,lx,;B) p t=i
= f:
WWxt;&).
(3.7)
i=l
The usual LR statistic
=
2
is defined
by
,ilClog4+/x,; fir)- logdytlxt; 411.
(3.8)
C. Gourieroux and A. MonJort
2604
where u, is the E quantile of the standard normal. The asymptotic level of this test is E; moreover this test is consistent for the distributions of H, which do not belong to H,. (Note that C, is sometimes replaced by - C,; in this case the critical region must be changed accordingly.) Proof
We essentially have to explain why the critical region is one-sided. For this purpose we consider the asymptotic behaviour of C, under the alternative H,. We denote by a(Pe) the asymptotic pseudo-true value of c( associated with the true value, /IO, of fi. We get log
h(Y/X;P,)
E log W/X; II 9( y/x;
- d8o) [
a&J)
bC4Bo)l) 9(y/x; 4Bo)) x
This limit is strictly positive (except if the hypotheses are partially non-nested and if the true p.d.f. belongs to both hypotheses) and under the alternative the Cox statistic tends to + co, which explains the one-sided form of the test. Remark 3.4.3
In practical situations the marginal distribution of the exogenous variables generally unknown and it may be interesting to consider the modified version the Cox statistic
cT
=
+:(8T)
-
L;(dT)]
-
+,fl
E
{log
;[L:(fi,(,)
h(Y/X,;
&T@T))
-
L;(dT)]/x
-
l”gdy/x,;
is of
dT)/Xt)
1
However such a modification will necessitate the determination of the asymptotic variance of ET which is different from the asymptotic variance of CT. The explicit form of the Cox statistic or of its modified version has been derived and interpreted in several econometric models (Pesaran (1974) Pesaran and Deaton (1978), Fisher. and McAleer (1979), White (1982b)).
Ch. 44: Testing Non-Nested
2605
Hypotheses
Remark 3.4.4
The use of the Cox statistic is only valid under the regularity condition 0; # 0, i.e. if the difference logg(Y/X; a,) - logh( Y/X; b(a,)) does not belong to the vector space generated by the components of
When H, is nested in H,, we get (see Remark 3.2.3) h( Y/X; @a,)) = k( Y/X; 0, ao) = the Cox procedure does not apply in the simple case of nested hypotheses. It does not apply either if the two non-nested hypotheses are “orthogonal”. The second case is especially clear for linear models (see Section 3.5). g( Y/X; Q) and wi = 0. Therefore
Remark 3.4.5
Different modifications of the Cox statistic have been introduced The most popular one is Atkinson’s variation defined by
in the literature.
CA, = f {L;CW,)l - L;(b)} - H(d,). It is such that
CA, = C, +
+~(e,)l- L;[&l)
since flT gives the maximum the null.
3.5. 3.5.1.
Application
of L!j_, and it is asymptotically
to the choice of regressors
The estimated pseudo-true
Let us consider
d C,,
two Gaussian
to C, under
in linear models
values
linear models
H, = {Y = X,y + u,
u - N[O, &d]},
H,={Y=X,6+v,
o-NIO,rZId]}.
The matrices
equivalent
X, and X, have respective
(3.10)
sizes (n, K,), (n, K,) and their ranks are K,
C. Gourieroux
and X,. The parameter
is
for H,,
CX=
/J=
and A. Monfort
6 0 T2
for H,.
Let us denote by Pj = Xj(X;Xj)-‘Xi, j = 1,2 the orthogonal column vectors of Xj and Mj = I - Pj. We get
projector
on the
(xix,)-lx;
+Yll’
i
lx;
(xix,)-
(3.11)
$f2Yf12
and (x;x2)-'x;x,?T hT(BT)=
+‘,YJ”+
;i!M2X,f,/i2 i
(x;x,)-'x;P,
Y
(3.12)
~11~,~112+~11~2~~~112 =i
3.5.2.
An interpretation
The difference
between
of the extended
Wald statistic
the two estimators
of the pseudo-true
value b(a,) is
(x;x2)-'x~(y-x,Y*T) fi[bT-bT(6iT)l
= fi
i
It may be proved (see, e.g. Gourieroux and Monfort (1989)) that the second subvector is asymptotically equivalent to a linear combination of the first subvector under the hypothesis H,. Therefore the Wald statistic measures the distance between
Ch. 44: Testing Non-Nested
2607
Hypotheses
zero and
This quantity is an inner product between the residuals of H, and the exogenous variables of H,. Therefore the procedure is asymptotically equivalent to a score test of the hypothesis H, = {S = 0) in the artificial nesting model
Y = x,y +
r?,s+w,
including all the explanatory the usual F test.
3.5.3.
variables
2, of H, not spanned
by those of H,, i.e. to
The Cox statistic
Let us consider the modified (Pesaran (1974))
L;(p)=
version of the Cox statistic
(see Section
3.4.3). We get
-~log2n-~logr’-~,,Y-X26,,~,
L+(cI)= -~log2n-~loga’-Zfr’,,Y-X,y,,2,
EL;(p)= CI
-~log2n-~logr”-~-2j21X,~-X26,,“,
EL;(a)= CI
-~log2n-~logrr”-;.
We deduce that
where ?c = l/T 11 M, Y 11 2 is the maximum likelihood alternative and ?c is the estimated pseudo-true value
f;= ~IllWl12
+ I/M,~,Yl121.
estimator
of 52 under
the
C. Gourieroux and A. MonJivt
2608
Therefore, the Cox statistic is directly given as a simple function of two estimated variances. This kind of result is valid for any Gaussian regression model, including nonlinear regression or multivariate models (after replacing the scalar variances by the determinants of the residual variance-covariance matrices). Fg may be seen as the residual variance of H, expected under H, whereas ff is the actual estimate of r2 under H,. “A positive Cox statistic indicates that the actual performance is better than expected. A significantly positive statistic leads to rejection of H, because H, is performing too well for H, to be regarded as true.” (McAleer (1986)). It may be noted that when the two hypotheses are orthogonal, the Cox statistic e, still has meaning even if the Cox procedure cannot be directly used. More precisely we get, in this special case,
and
z; T
=!,ogLY? 2
IIM,Yl12’
This statistic is a simple function of the variable T 11P, Y l12/)lM2 Y II2 which is asymptotically proportional to a chi-square under H,. Contrary to the general case described in Proposition 3.4.2 the Cox statistic no longer follows a normal distribution.
3.6.
Applications
to qualitative
models
Let us consider a framework including, as a special case, the polytomous logit versus sequential logit example of Section 2.4.3. We assume that the endogenous variable associated with the choice among K alternatives is defined by ytk = 1 if k is chosen for the tth observation and ytk = 0 otherwise. The two competing models are
H,: P(Y,, = 1)= &Ax,,4, H,:P(y,,i= 1)= “h&t,BX
C. Gourieroux
2610
and A. Monfort
and
We obtain,
for instance
[diag g - @j’] diag
: L E 0 h apt- I
where the matrices are evaluated at ~1~and b(cr,), and g (resp x) is the vector whose components are & (resp &). estimators are Similar expressions are obtained for C,, and C,,. Consistent obtained by replacing f by the empirical mean, ~1~by 8, and b(a,) by fiT or bT(Oir.). 4.
Artificial nesting models
The basic idea of artificial nesting methods is to introduce a third hypothesis in which both H, and H, are nested and characterized by some equality constraints.
Examples
4.1.
We describe 4.1 .I.
some general nesting
Quad’s
procedures
procedure (Quandt
The idea is to introduce the compound of H, and H,. This model is
(1974)) model made of mixtures
~4= ((1- Mytlx,; 4 + Wy,/x,; D), The basic hypotheses H,={A=O}
and some specific ones.
of the distributions
k[O, 11, CEA, /MI}.
(4.1)
are defined by the constraints and
H,,={;l=l}.
The procedure consists in testing {A = 0} against {A 3 0} and {A = l} against {A < l}, i.e. in applying a one-sided t-ratio test to the parameters 2 or 2 - 1. It is possible that both hypotheses are not satisfied. In such a case H, and H, will be asymptotically rejected by the testing procedure and it will be possible to get an estimate f, of the parameter 2 significantly different from 0 or 1. In some applications such a value of 2 may be interpreted.
Ch. 44: Testing Non-Nested
2611
Hypotheses
Let us consider the choice between polytomous and nested logit models (see Section 2.4.3). The nesting model may be seen as describing the average choice of a group of individuals, some, a proportion 1 - 2, selecting the alternatives according to the polytomous logit form and the others, a proportion I+, in a sequential way. into these two &- may give an idea of the decomposition of the population subgroups. Remark 4.1 .l Even if such a compound model is attractive, it has the usual drawback of mixtures. Under the null hypothesis H, = {,I = 0}, the parameter fi is not identifiable. Therefore the asymptotic properties of the M.L. estimator of a, fi, 1, obtained by maximizing
maxi logcut=1
MYtlx,;
co+ %y,lx,;
P)l,
and the properties of the t-ratio test are unknown except in special cases. This difficulty is at the origin of several papers whose aim is either to compute directly the distribution of some test statistics under specific formulations of the hypotheses (Pesaran (1981, 1982a)) or to introduce some change of parameters or some identifiability constraints (Fisher and McAleer (1981)). 4.1.2.
Atkinson’s procedure (Atkinson
(1970))
Atkinson proposed a similar procedure in which the compound by considering exponential combinations. The model is
M = i Kdy,lx,; a)’-ih(~,lx,;
model is derived
BY,
where
1 -1
a)‘-“NY/x,; B)“dv,W ,
1~CO,ll,
ad,
(4.2)
This nesting model has two drawbacks. First, as for Quandt’s procedure, p is unidentifiable under H, (and a under W,) and this creates many difficulties in performing the tests (see, e.g. Pesaran (1981)). Second, the model is only meaningful if the function g(y/x,; a)’ -“Q/x,; p)” IS integrable for AE[O, 11, a condition which is not always satisfied. However the identifiability problem may be solved in particular cases.
C. Gourieroux
2612
Example
and A. Monfort
4.1.2
Let us consider
two linear hypotheses 4 - I.I.N(O, l), t = 1,. . ., T},
H,=~Yt=w+ut~ H, = {yt = xd
The distributions to
u, -
+ u,,
I.I.N(O, l), t = 1,. . . , T}.
of the nesting model have p.d.f. l(y,/x,; y, &A) which are proportional
exp (-
+(l - J*)(y, - xi,y)’ - $n(y, - xzt6)2)
exp I-
$CY, -(x1,(1
or to
Therefore normal with introducing no common
- 0~ +
x2J412).
the distributions of the nesting model are such that y, is conditionally a unit variance and a conditional mean obtained by simultaneously the variables of the two initial models. If for instance xi and x2 have component, we may introduce the new parameters
y* = (1 - A)y,
s* = U.
Whereas y, 6, I are not identifiable in the nesting model the transformed parameters y*, S* are. The same is true for the hypotheses H, and H, which are characterized by the constraints H, = {6* = 0}, H, = {y* = 0} which only depend on 6* and y*. This kind of result may directly be extended to multivariate Gaussian models (Pesaran (1982a)). 4.1.3.
Mixtures
of regressions
We have seen that for regression models Atkinson’s procedure is equivalent to considering the regression which is the convex combination of the two initial regressions. This kind of nesting procedure may be directly applied to nonlinear regressions, even in a semi-parametric framework. With the two nonlinear regression models ff, = {E(Y,/x,) =
m(x,;4, =A},
H, = {E(Y,/x,) = Axt; 8, DEB}, we can associate
the nesting
nonlinear
regression
model (4.3)
Ch. 44: Testing Non-Nested
4.1.4.
Box-Cox
The Box-Cox
y(l)
=
Hypotheses
transformation
transformation
Y”- l -,
A#O,
log Y9
A=0
2613
(Box and Cox (1964)) of a positive
variable
y is defined by
A
Y(A) =
(4.4)
This transformation reduces to the logarithmic function for 1= 0, and to a linear function for I = 1 and is often used for nesting linear and logarithmic formulations H, = {log yt = x,a + u,, H, = (y, = x,/3 + u,,
u, - I.I.N(O, o’)},
u, - I.I.N(O, q’)}.
As soon as the regressions model
contain
a constant
term we may consider
the nesting
A
Y;’
= x,y + q,
o, - I.I.N(O, .r2),
&CO, l]
.
The two initial hypotheses are characterized by H, = (2 = 0} and H, = (2 = l}, respectively, and may be tested by the usual t-ratio test (see Zarembka (1974), Chang (1977)) or by applying a t-ratio test after a Taylor expansion of the models around a pre-assigned value (2 = 0) on (2 = 1) (Andrews (1971), Godfrey and Wickens (1981)). In the nesting model it is assumed that the error term u is normally distributed with mean 0 and variance TV. Strictly speaking this assumption is untenable since y,(l) cannot take values less than - l/2. This difficulty can be circumvented by truncating o, in some fashion (Poirier and Ruud (1979)), but as noted by Davidson and McKinnon (1985a) it seems “reasonable to ignore the problem which would occur with small probabilities”. 4.13.
A comprehensive
model for autoregressive
and
moving average representations
In some applications a natural embedding model appears. Let us for instance consider the two hypotheses of an autoregressive and a moving average representation of order one: H, = (y, = py, _ 1 + Ed, Ewhite noise), H, = (y, = qr - Or],_ 1,
q white noise).
C. Gourieroux and A. Moniorort
2614
A comprehensive
model is the ARMA(l,l)
M = {y, - (~~y,_~ = u, - fl,u,_,,
representation
u white noise}.
As noted in Section 2.4.5. the initial models are regression coefficients
WYtlY, -
1)= PYt- 1
E(y,/y,_,)=
-By,_,
-82y,_2-
... -8ky,_k-
...
models with constrained
for
H,,
for
H,,
model is not obtained by where Y Y,-~ ={yt-i, Y,-2,... }, and the comprehensive taking linear combinations of the two previous regression functions since we get E(y,/y,-r)=(cpi 4.2.
-8i)y,_r
+ ... +(cpi -8i)8:-‘y,_k+
...
under
M.
Local expansions of artificial nesting models
In a series of papers Davidson and McKinnon proposed a simple method for solving the identifiability problem appearing in mixtures of regressions. We first discuss the idea in the case of linear regressions and then we describe the general results concerning the so-called J- and P-tests. 4.2.1.
Linear regressions
Let us consider
two linear regressions
with different
sets of regressors
ff, = {Y, = XI~Y +
u,, u, I.I.D, E(u,/x,) = 0, V(u,/x,) = a*, t = 1,. . . , T},
H, = {Y, =
u,,
~24
and the associated M =
+
u,
I.I.D, E(o,/x,) = 0, V(u,/x,) = r2, t = 1,. . . , T},
mixtures
y, = (1 - I)x,,y + 1x2,6 + o,,
o, I.I.D, E(w,/x,) = 0, V(o,/x,) t = 1,. . ., T
To circumvent and McKinnon
= q2,
.
(4.6)
the unidentifiability of parameter 6 under the null H,, Davidson (1981) proposed an approach which is different from the simple
Ch. 44: Testing Non-Nested
2615
Hypotheses
change of parameters given in Example 4.1.2, and which can be extended to more general problems (see Section 4.4). The idea is to replace the nuisance parameter 6 by its estimator under H, instead of using its estimator under M. Therefore the nesting model M is replaced by a pseudo-nesting model M* = {y, = Xlty* +
t=
n*x,,& + o:,
1,. . .) T},
(4.7)
where 8, is the O.L.S. estimator of 6 based on H,. Clearly in this formulation the second kind of regressors X,& depends on Yl,...? y, through the estimator &.; they are partly endogenous and the error terms are correlated. However Davidson and McKinnon (1981) proposed to omit these difficulties and to study directly the asymptotic properties of the t-ratio statistic associated with A*, computed as if & was non-random and the w:‘s were error terms satisfying the usual conditions. The O.L.S. estimator of A* in the regression of y, on xIt,xJT is
1; = {&x;M,x,&} and the associated
T2 =
- l6;x;M,
t-ratio
statistic
(l/fi)&X;M,
(4.8)
is given by
Y (4.9)
sT[(l/T)~~X;M,X,s^,]“”
where rj’, is the usual estimator
Under
Y,
the null hypothesis
H,,
of the variance:
8r tends to the pseudo-true
value
4hJ = C&;x,)l +E(x;x,)y,, fl’, tends to G: the true value of the variance statistic converges to
%P%+JCw$x,)
The numerator
-
E(x;x,)(Ex;x,)-‘E(x;x2)]d(yo))1’2.
of TA is equal to
and the denominator
of the t-ratio
C. Gourieroux
2616
und
It is asymptotically normal under H,, with a zero mean and a variance the square of limit of the denominator:
~;bwb)C~(x;x,)-
A. Mocfort equal to
~(x;x,)(~x;x,)~'~(x;x,)l~(~,)}.
Two cases have to be distinguished: (i) the denominator is asymptotically different from zero making the t-statistic asymptotically standard normal, (ii) the denominator tends to zero making the limit of the T,-statistic undetermined. The denominator tends to zero if either d(y,) = 0, i.e. if x1 and x2 are orthogonal regressors, or if E(x;xJ - E(x~xl)(Ex~xl)-‘E(x~x,) = 0, i.e. if the regressors x2 are linear combinations of the regressors x1. Proposition
4.2.1
If the regressors x1 and x2 are non-orthogonal (Exix, # 0) and if H, is not nested in H,, the t-ratio statistic TL has asymptotically a standard normal distribution under the null. The Davidson-McKinnon test consists in rejecting H, if 1TA1> u1 _-E,2, where E is the asymptotic level and u1 --E,2the 1 - c/2 quantile of the standard normal distribution. In the previous proposition we gave a two-sided version of the test; however, as for the Cox test, it can be seen that the one-sided test whose critical region is {T* > u1 -E} has an asymptotic level equal to E and is consistent (except if x26,, is a linear combination of the components of x1). The previous test has been called the J-test by Davidson and McKinnon. It is also worth noting that there exists an exact version of this test, called the_JA-test (A indicating the Atkinson variant of this test), which is the usual t-test of I = 0 in the regression y = x,7
+ ;iP,P,y
(4.10)
+ 63,
where Pj = I - Mj = Xj(XJXj)-
‘XJ,
j=
1,2.
The difference between this pseudo-nesting model and model M* given in (4.7) is the replacement of X,6, = P,y by P,P,y. Since the right hand side of (4.10) depends on y only through Ply, the t-statistic on 5 has the t distribution with T - K, - 1 degrees of freedom, where K, is the number of columns in X, (see Milliken and Graybill (1970)).
2617
Ch. 44: Testing Non-Nested Hypotheses Nonlinear
4.2.2.
The previous
regressions:
approach
the J-test
may be directly
H, = {Y, = m(x,; Y) + u,, H, = {it = Ax,; 4 + u,,
extended
u, I.I.D, E&/x,)
to nonlinear
= 0, V&/x,)
regressions
= r~‘, t = 1,. . . , T},
u, I.I.D, E(u,/x,) = 0, V(u,/x,) = ~~~ t = 1,. . . , T},
and to the set of mixtures A4 = { y, = (1 - i)m(x,, y) + Ap(xt, 6) + q,
CO,I.I.D, E(w,/x,) = 0, V(o,/x,) = v2, t= l,...,T
This model is replaced
by the pseudo-nesting
.
model
M* = {yt = (1 - A*)m(x,, y*) + A*p(x,, &) + o:,
t = 1,. . . ) T},
(4.11)
where s^, is the nonlinear least squares estimator of 6 under H,. Then we can compute the nonlinear least squares estimator of I*, y* under M* as if 8, was a constant and the t-ratio statistic TA was associated with A*. The following proposition is the analogue of Proposition 4.2.1, and, as in this proposition, the two-sided test can be replaced by a one-sided test. Proposition
4.2.2
Under regularity conditions, the t-ratio TA has, under H,, an asymptotic standard normal distribution. The so-called J-test consists in rejecting H, if 1TAI> u1 -E,2. 4.2.3.
Nonlinear
regressions:
the P-test:
The previous J-test necessitates the determination of nonlinear least squares estimators of some parameters. It is possible to develop a procedure which only uses linear least squares. For this purpose we may consider an expansion of the regression model M* in a neighbourhood of the null hypothesis. More precisely since the true value y0 is unknown, we introduce the expansion around yT, the least squares estimator of y under H,. We have (1 - A*)m(x,, y*) + A*/L(x,, &, + C0:
= (1 - A*) m(x,; j$-) + i
a4x,; w
2%)
(Y* - M
+ ~*P(x,; &) + 0: I
= m(x,; Y^T)+ A*[p(x,; s^,, - m(x,; &)I + am;;:
%) c + cot*,
2618
C. Gourieroux
where c is an unknown
M** =
it
parameter.
Therefore
an “expansion”
and A. Mmfort
of model
M* is
= 4x,; M + ~*CP(X,; $4- 4x,; &)I
am + -(x,; a?!
jqc
+ Cot**, t = 1,. . . ) T
(4.12)
.
Let us now consider the asymptotic properties of the t-ratio statistic Ti for A* in M**, computed as if pr, 8, were deterministic and the error terms CO:* had the usual properties of white noise. Let
h = [4x,; &)I,
P = cl&; &)I>
6 be the matrix of derivatives (am/aY’)(x,; 8,) and I?, the orthogonal projector on the space orthogonal to the space spanned by the columns of d. The t-ratio statistic is T: =
- 4 &i)‘G,(p - rfi)} liZ’
(P - W&dY
tj { (p
-
where q2 is the residual as the statistic
variance.
It has the same asymptotic
distribution
under H,
(P- m)‘Mdy- 4 ~oC(P - m)‘M,(I* - Nl”” wherem = (m(x,;Y~)AP = CP(X,, dbdl~ D = CGWW)(x,;RJI. Using the same arguments as for the linear case, we get the following (given with the two-sided version of the test). Proposition
proposition
4.2.3
Under the condition plim (l/T)(p - m)‘M,(p - m) # 0, the statistic TT has asymptotically a standard normal distribution under the null H,. The P-test consists in rejecting H, if I Tt I > u1 -E,2.
4.3.
A score test based on a modijed
Atkinson’s compound model
We have seen that the artificial nesting model obtained by introducing the convex combinations of two nonlinear regressions was a special case of the Atkinson’s compound model. Moreover the modification of this artificial nesting model
Ch. 44: Testing Non-Nested
2619
Hypotheses
considered by Davidson and McKinnon was mainly intended to solve the identification problem of the a-parameter under the null 2 = 0. Therefore, it is natural to look for a potential extension of the DavidsonMcKinnon procedure to the Atkinson’s compound model. More precisely let us consider this compound model
We may introduce the modified compound model in which the unknown c( and /I are both replaced by the maximum likelihood estimators computed under H, and H, respectively:
ii
Qyt/xt;
=
2) =
parameters oi, and br
dytlxt; 4)’ - Wtlxt; ih)”
s
dulx,; 4 ’ - 'k(u/x,;
i
By analogy with the DavidsonMcKinnon for testing the hypothesis (2 = 0}, computed statistic is
T a
‘=$t:l
i0g&(y,/~,;
aA
0)
1 T
[
approach, we define the score statistic as if bi, and Jr were deterministic. This
a iogfT(y,/x,;0) 2 - 1’2 a2 ’ 11
Tt?l (
(4.13)
What are the properties of this statistic? Let us first consider the numerator. It is easily seen that
alog.L:T(YJx~; 0) ~ = log khlx,; &I - log shlx,; an and that the numerator
a,) -
logg(ulx,;~,)lg(ulx,;b)dv,(u),
s
Clog&lx,;
b-d
is equal to
(4.14) where L,,(flT) and L&&r) are the maxima of the likelihood function under H, and H, respectively. Therefore the numerator has an expression which is analogous to that
C. Gourieroux
2620
and A. Monfort
of the Cox statistic except that the estimated pseudo-true value b(&,) has been directly replaced by a, (such a replacement was initially proposed by Atkinson (1970)). It is easy to check that the numerator is asymptotically equivalent under the null to the numerator ofthe Cox statistic; in particular it has the same asymptotic variance given by
w=
logh-logg,Z
~~/,,(logh-logg)-Cov,, (
1 (4.15)
X [ V~O(~)]~‘cov~O(~Jogh-logg),
where log h, logg and ag/aa are evaluated at (&a,), a,,) and where V,,, Cov,, are the variance and covariance under the null. The denominator of the score statistic computed as if oi,, p^, were deterministic tends to the square root of
which is always larger than the true asymptotic variance W. Therefore this first extension of the Davidson-McKinnon approach to Atkinson’s compound model does not produce a test with the right asymptotic size. More precisely we get the following result. Proposition
4.3.1
Let us consider the score statistic Fs computed and also the corrected score statistic
as if oi, and IT were deterministic,
where @is a consistent
the null
estimator
of W under
(i) 5’ is asymptotically equivalent to the Cox statistic, (ii) 4””gives a procedure with an incorrect asymptotic level, (iii) the critical region based on I?/ is conservative, and the null hypothesis accepted too often.
is
It is worth noting that 4”sis easier to evaluate than 4”. If the use of p leads to the rejection of the null hypothesis, we can conclude (without computing 5s) that the null hypothesis must be rejected.
Ch. 44: Testing Non-Nested
4.4.
The partially
2621
Hypotheses
modijied
Atkinson’s
compound
model
In fact in their approach Davidson and McKinnon only replace the parameter of the alternative model by its maximum likelihood estimator. The same idea may be followed with the Atkinson’s compound model. We now consider the partially modified compound model in which only /3 is replaced by the maximum likelihood estimator p^, computed under H,:
ii*
=
J‘*,(y,/x,;2, a) =
~
~(Y,/x,;
s
4’ - ‘hbtlx,;fl,)”
(4.16)
g(nlx,; a) 1- ‘h(u/x,; &.)” dv,(u)
We then define deterministic:
the score
statistic
for testing
(2 = 0} computed
as if br
was
where
a 10gj: a2
a i0gj; ’ a@
are evaluated at the constrained estimators 2 = 0 and c1= 8,. With this correction, we get the following proposition. Proposition
4.4.1
The score statistic tf model is asymptotically
5.
Comparison
for 2 = 0 deduced from the partially modified equivalent to the Cox statistic under the null.
compound
of testing procedures
In the comparisons of testing procedures we have to distinguish between asymptotic results (such as the equivalence of two procedures under the null hypothesis, or a comparison of their asymptotic relative efficiencies) and finite sample results (such as an exact comparison or an evaluation of the difference between the small sample
C. Gourieroux
2622
and A. Monjiirt
significance level and the nominal level). The asymptotic results are generally derived from a theoretical point of view, while the small sample results are obtained from Monte Carlo studies except for simple models such as Gaussian linear models.
5.1.
Asymptotic
equivalence
of test statistics
We first study if the test statistics considered above are asymptotically equivalent under the null, i.e. if they differ at most by a negligible term in probability o,(l). If this is the case it is known that they have the same asymptotic distribution. We have seen before that the form of these asymptotic distributions may be either chi-square (several degrees of freedom) or univariate normal distributions. Therefore some previously introduced statistics, such as the Wald statistic and the Cox statistic, cannot be equivalent and the equivalence may only be derived for particular pairs of statistics. Proposition
5.1.1
(i) The extended Wald and score statistics null hypothesis
are asymptotically
equivalent
under the
(ii) The Cox statistic and the score statistic based on the partially modified Atkinson’s compound model (and the J- and P-test in the case of regression models) are asymptotically equivalent under the null. (iii) The two Wald statistics <:I and i;r2 are generally not equivalent and they are not equivalent to the Cox statistic (or to the J- and P-tests in the case of regression models). The first part of the proposition has been proved in Gourieroux et al. (1983), the second part is a consequence of Proposition 4.4.1 and has been established by Davidson and McKinnon (1981) for the J- and P-tests. The third part is a consequence of the different asymptotic distributions for the statistics x2(d), x’(d*) and N[O, l] respectively.
5.2.
Asymptotic
comparisons
of power functions
Let us consider a consistent testing procedure for testing H, against H,. This procedure is generally defined by a critical region of the form (5r > cE}, where 5r
Ch. 44:
Testing
Non-Nested
2623
Hppotheses
is the test statistic and c, a critical value chosen to get the right asymptotic
level:
The power function of this test gives the probability of the critical region under the alternative hypothesis H,. It is a function of E, T and the parameter fi: P(E,
(5.1)
T>p) = p,(tT > Cc),
where P, is the probability with respect to the p.d.f. corresponding to p in H,. Since the tests previously defined are consistent, this power function tends to one when T tends to infinity: lim p(~, T, /I) = 1,
VE > 0, /I.
T+Ca
Therefore if (5iT,c1), (5 2T,c2) are two testing procedures associated with critical regions (trT > cle), (cZT > cZE),they cannot be compared by examining the asymptotic value of the power function at fixed arguments E,/I, since the limit equals one for both procedures. But it is possible to study this power function for either a varying p or a varying E. The first approach of a varying j? was introduced by Pitman (1948). The sequence /jT of alternative hypotheses has to be chosen in such a way that the associated distributions h(y/x; /IT) tend to a distribution g(y/x; a,) of the null hypothesis, at a rate ensuring that the limits
lim Pj(h T, PT) = lim T+CC
PpT(<jT
>
CjJ
=
aj(E)
(say),
j=
1,2
T-CC
exist and are different from zero and one. Then the first procedure is said to be asymptotically more powerful than the second if ai > +(E), Vs. ‘The sequence h(y/x; pT) is called a sequence of local alternatives. For non-nested hypotheses this approach of local alternatives can only be used for some specific problems. Indeed it needs partially non-nested hypotheses in order to be able to build the local alternatives and even in such a case the local alternatives can only be defined in some special directions corresponding to the intersection of the two hypotheses. Such an approach has been followed by several authors. Pesaran (1982b) considered the case of two non-nested linear regression models, and proved that the local power of the one degree of freedom tests (Cox test, J- and P-tests) is not exceeded by that of the standard F-test, when the number of explanatory variables is smaller under the null than under the alternative. Dastoor and McAleer (1982) extended the previous results to the case of multiple alternatives and demonstrated that Pesaran’s result depends crucially on the type of local alternatives specified. In general, it is
C. Gourieroux
2624
und A. Monfort
not possible to rank the tests in terms of asymptotic local power. Ericsson (1983) considered Cox-type statistics in the context of instrumental variable estimation and Gill (1983b) considered the general case of parametric hypotheses. In the second approach due to Bahadur (1960) (see also Geweke (1981)) the alternative /I is fixed and the first type error cT tends to zero in such a way that the limits
lim Pj(ET,T, B)= lim Pp(SjT> CjJ= bj(B) (say), j= 1,2 T-m
Td’X
exist and are different from zero and one. The first test is said to be asymptotically more powerful than the second if b,(b) 2 b&?), VP. This approach seems more suitable for non-nested hypotheses but may be difficult to apply. The problem of fixed alternatives has been analyzed in some particular cases by McAleer et al. (1982). Epps et al. (1982) introduced a testing procedure based on empirical moment generating functions and tried to maximize the power for fixed alternatives. Gourieroux (1983) considered testing procedures based on estimated residuals and looked for the optimal form of such test statistics using the Bahadur’s criterion. The results are valid for the choice between non-linear regression models and show that the Wald test, the score test and the J- and P-tests satisfy the optimality condition, while the Cox procedure does not. Pesaran (1984) gives a general survey of both approaches by local and fixed alternatives.
5.3.
Exactfinite
sample results
For some specific examples it is possible to determine the exact forms of the test statistics and of their distributions under the null and alternative hypotheses (or at least an expansion of this exact distribution). The earlier papers were interested in discriminating between some families with common invariant sufficient statistics, in the i.i.d. framework. Dumonceaux et al. (1973) proved that the likelihood ratio does not depend on nuisance parameters for discriminating between normal and Cauchy distributions and between normal and exponential distributions. Dumonceaux and Antle (1973) gave the table of critical values for a test based on likelihood ratio statistics for discriminating between a log-normal and a Weibull distribution (see also Pereira (1977a)); some other authors look for accurate approximations of the finite sample distribution when its explicit expression is not available. For instance Jackson (1968) considered the Cox statistic for choosing between a log-normal model
lexp
g(y;a)= [
Y&2
_(l”gy- @A2
(
2%
>I
Ch. 44: Testing Non-Nested
and an exponential 1 h(y;b)=-exp B
(
Hypotheses
2625
model
I
B)I>
y (
(see also Pereira (1977a), Epps et al. (1982) for a study of this problem originally considered by Cox). When some explanatory variables are introduced into the models, exact results have been essentially derived for the choice between two Gaussian linear models, or for the comparison of two test statistics with closed forms. These results provide some inequalities between test statistics, Fisher and McAleer (1981) consider this problem for Gaussian non-linear regressions, Dastoor (1983a) establishes an inequality between two versions of Cox statistic: when the opposites of the statistics given above are retained, the Atkinson’s version of the Cox statistic is smaller than the Cox statistic itself, and therefore is less likely to reject the null hypothesis. Determination of exact distributions of Fisher type (Dastoor and McAleer (1985)) and the comparison of exact power functions (Dastoor and McAleer (1985)) are also dealt with. A summary of the expressions of the main statistics used in the regression case and of the size and power functions of the associated tests in finite sample is given in Table 2.4 of Fisher and Whistler (1982), for instance.
5.4.
Monte Carlo studies
Monte Carlo studies have been performed for more complicated examples. Dyer (1973, 1974) compared testing procedures which are invariant with respect to location and scale parameters in the i.i.d. case. Pesaran (1982b) and Godfrey and Pesaran (1983) considered the choice between two regression models by the COX statistic or by modified versions of this statistic. They analyzed the effect of the difference between the number of regressors in the two hypotheses, of the non-normality of the error term and of the presence of lagged endogenous variables in the regressions. Davidson and McKinnon (1982) compared various versions of the Cox test with the F-test, the J-test and the JA-test in the linear case. All these results are partial since they concern specific problems and specific values of the parameters, but they give some ideas on the behaviour of the procedures for small T. The main observations are the following ones. (4 The finite sample size of Cox type tests can be much greate; than the nominal level. These tests reveal a tendency to reject the true model too frequently. This effect is also important for the J-test. However it seems possible to incorporate in the test statistics both mean and variance adjustments in order to avoid such an effect (Godfrey and Pesaran (1983)). The simulations by the authors show that these corrections are partially successful. For instance the size of the
2626
(ii) (iii)
(iv)
(v)
(vi)
6.
C. Gourimmx
and A. Monfort
adjusted J-test is smaller than for the unadjusted J-test, but is still higher than the nominal significance level. The comparison of power functions is difficult to interpret since the usual procedures do not have the same finite sample size. The results are often very sensitive to the relative number of regressors in the two hypotheses and significatively depend on the fact that this number is smaller or larger in the null hypothesis than in the alternative one. For instance the power of the J-test is poor in the second case. The JA-test lacks power in several situations: when the number of regressors in the null hypothesis is less than in the alternative or when the true distribution does not belong to either the null or the alternative. The finite sample sizes are not badly distorted when the errors have been assumed to be normal and follow another distribution (log-normal, chi-square, etc). Similarly the ordering of the power functions does not seem to be significatively modified. When the sample size is reasonably large and the variance of the error terms reasonably small, all the tests perform in a satisfactory manner.
Encompassing
6.1.
The encompassing
principle
In the non-nested hypothesis testing procedures that we have described in the previous sections, we assume that the true conditional distribution belongs to one of the hypotheses. This assumption can be considered as a strong one and, therefore, it is interesting to see if it is possible to avoid this assumption. This kind of idea led to a tentative definition of the notation of encompassing (Mizon and Richard (1986), Hendry and Richard (1990)): One model encompasses another results obtained by the latter.
if the former can account
for, or explain,
the
This notion can be used in a modelling strategy, in which we want to propose more and more suitable models. These models have not only to take into account some new interesting phenomena, but they also have to be able to explain previous results derived with the previous models. Theoretically when two or more competing models are considered, it is possible to define a general model in which they are all nested, and to assume that the true distribution belongs to this general model. This is the idea of artificial nesting and historically the first definition of encompassing (see Pesaran and Deaton (1978) Mizon and Richard (198 1) or Hendry and Richard (1982)). However in practice this general nesting model will contain many parameters and will require an amount of
Ch. 44: Testing Non-Nested
2627
Hypotheses
information often larger than that contained in the available data. In fact, there is room for more parsimonious strategies of encompassing, in which we do not have to nest the models at each step in a more general model, nor to assume that a model contains the true distribution (contrary to the axiom of correct specification formalized by Learner (1978)). For developing such a modelling strategy, we have (i) to precisely define the notion of encompassing, (ii) to modify the test procedures in order to take into account the fact that H, may encompass H,, even if neither H, nor H, is true. Since the notion of encompassing is linked with model choice, we introduce different notations in the rest of this section. The two competing models are denoted by
instead of H, and H,. The true conditional assumed a priori to belong to M, or M,. 6.1.1.
distribution
of y given x isf,,
and is not
Pseudo-true values and binding functions
As previously defined by
meutioned,
the pseudo-true
values of the parameters
GINand CI*are
foy(Y/X) rxyO= argminEElog-------OL, X0 Y,(YlK a = arg max E E log g,(y/x; CQ), b, X0 .foy(Y/X) a& = arg min E E log a2 x0 g,(ylx; 4 = arg max E E log g,(y/x; a,), a2 X0 where E is the expectation is the &nditional The proximity
l[fo,,
(6.2)
with respect to the marginal
distributionfoX
of x, and E 0
expectation with respect to the true p.d.f. fo,(y/x). between fo, and models M, or M, is
.foy(Y/X) Mj] = E E log ________ ~ X0
CJj(Ylxi clj*o)’
j=
1,2.
In the same spirit we can, for any curEAT, define the value of Q, denoted
by b,,(a,),
2628
C. Gourieroux
providing
and A. Monfbrt
the p.d.f. of M, which is the closest to gI(Y/x;a,), (6.4)
b,lh) = argmzxt E log g,(ylx; a,), and similarly
The functions b,, and b,, are called binding models M, and M, and not the true distribution. Encompassing
6.1.2.
(Mizon
and Richard
functions.
They only involve
the
(1986))
The distribution of M, (resp M,) associated with CYT~ (resp CX~,)can be seen as the best representative of M, (resp MJ and it seems natural to formalize the notion of encompassing by saying that M, encompasses M, if, acting as if the best representative of M 1 was the true distribution, we find that the closest distribution of M, is the best representative of M,. This means that a& = bzl(aTo). Note that this property depends not only on M, and M, but also on the true p.d.f. fO of (Y>x). Dejinition 6.1 .I (i) fO is such that M 1 encompasses is denoted fO
s.t.
M, if and only if LX&,= b,,(aTJ.
MI&M,.
(ii) fO is such that there is mutual encompassing fO s.t. MI&M, and fO s.t. M,bM,.
6.2. The encompassing 6.2.1.
This condition
tests
The encompassing
hypothesis
We want to define testing procedures H, = {fO = IG,
s.t.
of the null hypothesis
M,bM,}
= b&:,)>.
This null hypothesis
if and only if we have simultaneously
constrains
(6.5) the unknown
p.d.f. fO, and the tests have to be
Ch. 44: Testing Non-Nested
2629
Hypotheses
considered without assuming a priori that foV is in M, or in M,. It is also clear that H, is true if fey belongs to MI. It is natural to consider the test procedures previously introduced for testing non-nested hypotheses and to examine if they can be used for testing the encompassing hypothesis H,. 6.2.2.
Cox likelihood ratio statistic
The Cox approach
s 1,=;
-
Under
statistic:
{logg,(Y,lx,;si,,)-logg,(Y,lx,;6i,,)}
c
f
is based on the following
1
ktfl
&fT
(logg,(Y,lx,;~,.)-logg,(Y,lx,;~,,)}.
the null hypothesis
plim slT = E E T
He, this statistic
Clog g,(Ylx;@To)-
tends to
logg,(Ylx;
@To)1
X0
G,) - logg,(y/x;G,,l.
E E Clog&k
-
x 40
This limit is equal to zero if the true conditional p.d.f. belongs to M,. However it is generally different from zero for the p.d.f. of Ho whose conditional distributions do not belong to M,. This shows that the Cox approach is not appropriate for testing the encompassing hypothesis Ho. Obviously the same conclusion occurs for the J- and P-tests, which are equivalent to the Cox test. 6.2.3.
The Wald encompassing
The difference * ~1~~- b,,(oi,,),
test (WET)
between the two estimators of the pseudo-true value aTo, i.e. tends to zero under the encompassing hypothesis. Under Ho,
fi[B,, - b,,(oi,,)] is asymptotically normal, but its asymptotic variance-covariance matrix is different from the one given in Proposition 3.2.1 which has been computed under the strict sub-hypothesis M, of the encompassing hypothesis Ho. Proposition Under 0,
6.2.1 (see Gourieroux
Ho, fi[oZ
2T - bZ1(ilr)]
= K,-,‘[C,, x [C,,‘CIz
and Monfort converges
- C,,C;;CJK;; - K,,‘?,&‘K,,]K;;,
(1992))
in distribution + K,-,‘[CPIC;/
to N[O, a,], - K&$,,K,‘]C,,
with
Ch. 44: Testing Non-Nested
2631
Hypotheses
which plim 1, = 0.
Therefore
the score statistic
statistic & is asymptotically encompassing hypothesis.
can be used as the basis for an encompassing equivalent
to ,/?K,,[B,,
- b,,(oi,,)],
test. The under
the
Proposition 6.2.2 The SET is based on the statistic c$. = T?Tk,-,‘fi,l?;i&, is ;““, > I:_,, where d is the rank of 0,.
and the critical region
As for the WET, the asymptotic variance-covariance matrix is reevaluated, in comparison with that of the extended score test of Proposition 3.3.1, in order to take into account the fact that the true p.d.f. may satisfy the encompassing hypothesis without belonging to Mr. 6.2.5.
The generalized encompassing test (GET)
The previous Wald and score encompassing tests may be difficult to implement for various reasons and, in particular, because the variance-covariance matrices appearing in the test statistics are, in general, not invertible. This implies that a generalized inverse must be used and that the rank must be estimated. Therefore it is worth looking for simpler tests even if the price to pay is the enlargement of the implicit null hypothesis H, = (CC&, = b,,(aT,)). This null hypothesis has an intersection with M, which is equal to the so-called “reflecting set” R,, = (c~~:cx~ = b21[b12(a2)]}. The tests that are proposed below have an implicit null hypothesis whose intersection with M, is equal to the image M,, of M, by b,,. This implies that, when b,, is injective, these tests are effective only if pz, the size of c(~, is greater than pl, the size of c1r. Proposition 6.2.3 Under
He and if the rank of ab,,/acr,
is pl, the statistic
where zr is a consistent estimator of Z = K&’ Cz2K,;‘, is asymptotically distributed as xz(p2 - pJ. The test consists in rejecting H, if t”, > xf Jp2 - pl), where E is the asymptotic level of the test. This test is called the generalized encompassing test (GET).
2634
C. Gourieroux
and A. Mortfiwt
Davidson, R. and J.G. McKinnon (1985a) “Testing Linear and Loglinear Regressions against Box-Cox Alternatives”, Canadian Journal of Economics, XVIII, 499-517. Davidson, R. and J.G. McKinnon (1985b) “Heteroskedasticity-Robust Tests in Regression Directions”, Annales de I’INSEE, 59160, 183-218. Davidson, R. and J.G. McKinnon (1987) “Implicit Alternatives and the Local Power of Test Statistics”, Econometrica, 55, 1305- 1329. Deaton, AS. (1982) “Model Selection Procedures, or, Does the Consumption Function Exist?” in: C.G. Chow and P. Corsi, eds., Etialuating theReliability ofMacroeconomic Models, New York: Wiley, 43-65. Domowitz, I. and H. White (1982) “Misspecified Models with Dependent Observations”, Journal y/ Econometrics, 20, 35-58. Dufour, J.M. (1989) “Nonlinear Hypotheses, Inequality Restrictions and Non-nested Hypotheses: Exact Simultaneous Tests in Linear Regressions”, Econometrica, 57, 335-356. Dumonceaux, R. and C.E. Antle (1973) “Discrimination Between the Log Normal and the Weibull Distribution”, Technometrics, 15, 923-926. Dumonceaux, R., C.E. Antle and G. Haas (1973) “Likelihood Ratio Test for Discrimination Between Two Models with Unknown Location and Scale Parameters”, Technomerrics, 15, 19-27. Dyer, A.R. (1973) “Discrimination Procedures for Separate Families of Hypotheses”, Journal of the American Statistical Association, 68, 970-974. Dyer, A.R. (1974) “Hypothesis Testing Procedures for Separate Families of Hypotheses”, Journal ofthe American Statistical Association, 69, 140-145. Efron, B. (1983) “Comparing Non-nested Linear Models”, Technical Report 84, Stanford University. Engle, R.F. (1984) “Wald, Likelihood Ratio and Lagrange Multiplier Tests in Econometrics”, in: Z. Griliches and M. Intriligator, Eds., Handbook ofEconometrics, Vol 2, North-Holland: Amsterdam, 776-826. Epps, T.W., K.J. Singleton and L.B. Pulley (1982) “A Test of Separate Families of Distributions Based on the Empirical Moment Generating Function”, Biometrika, 69, 391-399. Ericsson, N.R. (1982) “Testing Non-nested Hypotheses in Systems of Linear Dynamic Economic Relationships”, Ph.D. dissertation, London School of Economics. Ericsson, N.R. (1983) “Asymptotic Properties of Instrumental Variables Statistics for Testing Nonnested Hypotheses”, Review ofEconomic Studies, 50,287-304. Ericsson, N.R. (1991) “Monte Carlo Methodology and the Finite Sample Properties of Instrumental Variables Statistics for Testing Nested and Non Nested Hypotheses”, Econometrica, 59, 1249-1278. Ericsson, N.R. and D.F. Hendry (1989) “Encompassing and Rational Expectations: How Sequential Corroboration Can Imply Refutation”, International Finance, dissertation paper 354, Board of Governors of the Federal Reserve System. Fisher, G.R. (1983) “Tests for Two Separate Regressions”, Journal of Econometrics, 21, 117-132. Fisher, G.R. and M. McAleer (1979) “On the Interpretation of the Cox Test in Econometrics”, Economics Letters, 4, 145-150. Fisher, G.R. and M. McAleer (1981) “Alternative Procedures and Associated Tests of Significance for Non-nested Hypotheses”, Journal qf‘Econometrics, 16, 103-l 19. Fisher, G.R. and D. Whistler (1982) “Tests for Two Separate Regressions”, Institut National de la Statistique et des Etudes Economiques (INSEE), dissertation paper 8210. Geweke, J. (1981) “The Approximate Slopes of Econometric Tests”, Econometrica, 49, 1427-1442. Ghysels, E. and A. Hall (1990) “Testing Non-nested Euler Conditions with Quadrature Based Method of Approximation”, Journal of Econometrics, 46, 273-308. Gill, L. (1983a) “Some Non-nested Tests in an Exponential Family of Distributions”, University of Manchester, dissertation paper 129. Gill, L. (1983b) “Local Power Comparisons for Tests of Non-nested Hypotheses”, University of Manchester dissertation paper. Godfrey, L.G. (1983) “Testing Non-nested Models After Estimation by Instrumental Variables or Least Squares”, Econometrica, 51, 355-365. Godfrey, L.G. (1984) “On the Use of Misspecification Checks and Tests of Non-nested Hypotheses in Empirical Econometrics”, Economic Journal, 94, 69-81. Godfrey, L.G. and M.H. Pesaran (1983) “Tests of Non-nested Regression Models: Small Sample Adjustments and Monte Carlo Evidence”, Journal of Econometrics, 21, 133-154.
2636
C. Gourieroux
and A. Monfort
with autocorrelated disturbances: an application to models of U.S. unemployment”, Communications in Statistics, Series A, 19, 3619-44. McFadden, D.L. (1984) “Econometric Analysis of Qualitative Response Models”, in: 2. Griliches and M.D. Intriligator, eds., Handbook of Econometrics, Vol 2, North-Holland: Amsterdam, 1395-1458. McFadden, D.L. (1989) “A Method of Simulated Moments for Estimation of Discrete Response Models without Numerical Integration”, Econometrica, 57, 995-1026. McKinnon, J.G. (1983) “Model Specification Tests Against Non-nested Alternatives”, Econometric Reoiew, 2, 85-l 10. McKinnon, J.G., H. White and R. Davidson (1983) “Tests for Model Specification in the Presence of Alternative Hypotheses: Some Further Results”, Journal ofEconometrics, 21, 53-70. Milliken G.A. and F.A. Graybill (1970) “Extensions of the General Linear Hypothesis Model”, Journal of the American Statistical Association, 65, 797-807. in: D.F. Hendry and K.F. Wallis, Mizon, G.E. (1984) “The Encompassing Approach in Econometrics”, eds., Econometrics and Qualitative Mod&q, Oxford: Basil Blackwell. Mizon, G.E. and J.F. Richard (198 I) “The Structure of some Non-nested Hypothesis Tests”, Southampton University, mimeo. Mizon, G.E. and J.F. Richard (1986) “The Encompassing Principle and its Application to Testing Non-Nested Hypotheses”, Econometrica, 54,657-678. Nakhaeizadeh, G. (1988) “Non Nested New Classical and Keynesian Models: A Comparative Studv in paper. _ the Case of the Federal Republic of Germany”, Karlsruhe University, dissertation Pakes, A. and D. Pollard (1989) “Simulation and the Asymptotics of Optimization Estimators”, Econometrica, 57, 1027-1058. Pereira, B. de B. (1977a) “A Note on the Consistency and on the Finite Sample Comparisons of Some Tests of Separate Families of Hypotheses”, Biometrika, 64, 109-l 13. Pereira, B. de B. (1977b) “Discriminating Among Separate Models: A Bibliography”, International Statistical Review, 45, 1633172. Pesaran, M.H. (1974) “On the General Problem of Model Selection”, Review of Economic Studies, 41, 153-171. Pesaran, M.H. (1981) “Pitfalls ofTesting Non-nested Hypotheses by the Lagrange Multiplier Method”, Journal of Econometrics, 17, 3233331. Pesaran, M.H. (1982a) “On the Comprehensive Method of Testing Non-nested Regression Models”, Journal of Econometrics, 78, 263-274. Pesaran, M.H. (1982b) “Comparison of Local Power of Alternative Tests of Non-nested Regression Models”, Econometrica, 50, 128771305. Pesaran, M.H. (1984) “Asymptotic Power Comparisons of Tests of Separate Parametric Families by Bahadur’s Approach”, Biometrika, 71, 245-252. Pesaran, M.H. (1987) “Global and Partial Non-nested Hypotheses and Asymptotic Local Power”, Econometric Theory, 3, 69-9. Pesaran, M.H. and A.S. Deaton (1978) “Testing Non-nested Nonlinear Regression Models”, Econometrica, 46,6777694. Pesaran, H. and B. Pesaran (1989) “Simulation Approach to the Problem of Computing Cox’s Statistic for Testing Non-nested Models”, dissertation paper presented at the European Meeting of the Econometric Society. Pitman, E.J.G. (1948) “Non-Parametric Statistical Inference”, University of North Carolina, Institute of Statistics, Mimeographed Lecture Notes. Poirier, D.J. and P.A. Ruud (1979) “A Simple Lagrange Multiplier Test for Lognormal Regression”, Economic Letters, 4, 251-255. Quandt, R.E. (1974) “A Comparison of Methods for Testing Non-nested Hypotheses”, Review of Economics and Statistics, 56, 92-99. Ramsey, J.B. (1974) “Classical Model Selection Through Specification Error Tests”, in: P. Zarembka, ed., Frontiers of Econometrics, Academic Press: New York, 13-47. Rossi, P.E. (1985) “Comparison of Alternative Functional Forms in Production”, Journal qfEconometrics, 30,345%361. Sargan, J.D. (1964) “Wages and Prices in the United Kingdom: A Study in Econometric Methodology”, in: P.E. Hart, G. Mills and J.K. Whitaker, eds., Econometric Analysisfor National Economic Planning, London: Butterworths, 25-63.
Ch. 44:
Testing
Non-Nested
2631
Hypotheses
Australian Sawyer, K.R. (1980) “The Theory of Econometric Model Selection”, Ph.D. dissertation, National University. Sawyer, K.R. (1983) “Testing Separate Families of Hypotheses: An Information Criterion”, Journal oj the Royal Sratistical Society, Series B, 45, 89-99. Smith, M.A. and G.S. Maddala (1983) “Multiple Model Testing for Non-nested Heteroscedastic Censored Regression Models”, Journal of Econometrics, 21, 71-81. Smith, R.J. (1992) “Non-nested tests for competing models estimated by generalized method of moments”, Econometrica, 60, 973-80. Szroeter, J. (1989) “Efficient Tests of Non-nested Hypotheses”, University College, London. Vuong, GM. (1989) “Likelikood Ratio Tests for Model Selection and Non-nested Hypotheses”, Econometrica,
Walker,
51, 307-334.
A.M. (1967) “Some
Biometrika,
Tests
of Separate
Families
of Hypotheses
in Time
Series Analysis”,
54, 39-68.
White, H. (1982a) “Maximum Likelikood Estimation of Misspecified Models”, Econometrica, 50, l-26. White, H. (1982b) “Regularity Conditions for Cox’s Test of Non-nested Hypotheses”, Journal of Econometrics, 19, 301-318. White, H. and I. Domowitz (1984) “Nonlinear Regression with Dependent Observations”, Econometrica, 52, 143% 162.
Wooldridge, J.M. (1990) “An Encompassing Approach to Conditional Mean Tests with Applications Testing Non-nested Hypotheses”, Journal ofEconomefrics, 45, 331-350. Zabel, J.E. (1992) “A comparison of non-nested tests for misspecified models using the method approximate slopes”, Journal ofEconometrics, forthcoming. Zarembka, P. (1974) “Transformation of Variables in Econometrics”, in: P. Zarembka, ed., Frontiers Econometrics, New York: Academic Press.
to of in
2640
7.4.
Estimating
7.5.
Asymptotic
7.6.
Testing
Part III. 8.
9.
2699 2700
nonstationary,
weakly dependent
Introduction
8.2.
Asymptotic
Asymptotic
Asymptotic
9.2.
Estimating
normality
of an abstract
variance
10.2.
Abstract
limiting
2710
2710
case
2710
results
Introduction
2702
2706 2706
the asymptotic
The nonergodic
estimator
of M-estimators
normality
10.1.
2701
2701
normality
9.1.
case
2701
results
8.1.
General
2697
variance
efficiency
The globally
General
Part IV. 10.
the asymptotic
2710 distribution
result
11. Some results for linear models 12. Applications to nonlinear models Appendix References
2711
2713 2723 2725 2733
Ch. 45: Estimation
und Ir&-mce,for
Dependrnt
Procrsses
2641
Abstract This chapter provides an overview of asymptotic results available for parametric estimators in dynamic models. Three cases are treated: stationary (or essentially stationary) weakly dependent data, weakly dependent data containing deterministic trends, and nonergodic data (or data with stochastic trends). Estimation of asymptotic covariance matrices and computation of the major test statistics are covered. Examples include multivariate least squares estimation of a dynamic conditional mean, quasi-maximum likelihood estimation of a jointly parameterized conditional mean and conditional variance, and generalized method of moments estimation of orthogonality conditions. Some results for linear models with integrated variables are provided, as are some abstract limiting distribution results for nonlinear models with trending data. Part I.
1.
Introduction and overview
Introduction
This chapter discusses estimation and inference in time series contexts. For the most part, estimation techniques that are suitable for cross section applications see Newey and McFadden (this Handbook) - are either directly applicable or applicable after slight modification to time series problems. Just a few examples include least squares, maximum likelihood and method of moments estimation. Complications in the analysis arise due to the dependence and possible trends in time series data. Part II of this chapter covers estimation and inference for the essentially stationary, weakly dependent case. This material comprises the bulk of the chapter and is also the case covered in most of the econometrics literature. The work of Bierens (1981, 1982), Domowitz and White (1982), White (1984), Bates and White (1985), Gallant (1987), Gallant and White (1988) and Potscher and Prucha (199 la, b) contains various catalogues of assumptions that can be used in various estimation settings. Part II synthesizes and extends some of these results, but our emphasis is somewhat different from the earlier work. While we state some formal results with regularity conditions, our focus is on the assumptions that impact on how one performs inference. These assumptions often involve conditional moments and are therefore straightforward to interpret. The approach in Part II of this chapter is most similar to the book by White (1993). White analyzes quasi-maximum likelihood estimation for heterogeneous (but essentially stationary), weakly dependent processes under possible model misspecification. His results are very general and technically sophisticated. Here, by
2642
J.M.
Wooldridgr
restricting ourselves to models where the primary feature of interest is correctly specified, and by focusing on weak rather than strong consistency of the estimators, we obtain results with simple regularity conditions that are nevertheless applicable in a variety of contexts. We also make some further simplifying assumptions, such as assuming that moment matrices settle down to some limit. The hope is that, after seeing a stripped-down analysis that de-emphasizes the role of regularity conditions, the reader can then tackle the more advanced treatments referenced above. At the same time the inference procedures offered here are fairly general. When the data are trending the standard uniform law of large numbers approach cannot be used to establish consistency and to find the limiting distribution of optimization estimators. Nevertheless, if the data are weakly dependent one generally expects the estimation techniques useful in the essentially stationary case to still have good properties in the trending, weakly dependent case. Part III of this chapter draws on the work of Crowder (1976), Heijmans and Magnus (1986), Wooldridge (1986) and others to establish the consistency and asymptotic normality (when properly scaled) of a general class of optimization estimators for globally nonstationary processes that are weakly dependent. These results can be applied to M-estimation and method of moments estimation. An important consequence of Part III is that the common practice of performing inference in trending, weakly dependent contexts exactly as if the stochastic process is essentially stationary and weakly dependent is justified quite generally. The last part of this chapter, Part IV, covers limiting distribution results when the process (or at least the score of the objective function) is not weakly dependent. The case when at least some elements of the underlying stochastic process are integrated of order one, or 1(l), is of particular interest and has received a lot of attention recently [just a few references include Phillips (1986, 1987, 1988), Phillips and Durlauf (1986), Park and Phillips (1988, 1989), Phillips and Hansen (1990) and Sims et al. (1990)]. Most of the specific work on nonergodic processes has been in the context of linear models. Some abstract results are available for nonlinear models, for example Basawa and Scott (1983), Domowitz (1985), Wooldridge (1986), Jeganathan (1988) and Domowitz‘ and Muus (1988). In Part IV we present a modification of a result in Wooldridge (1986) that applies immediately to linear models; we also given an example of a nonlinear application with nonergodic data. A few remaining features of this chapter are worth drawing attention to at this point. First, we do not discuss the interpretation of the parameters in dynamic models beyond assuming that they index a conditional distribution, conditional expectation or conditional variance. Models that are expressed in terms of underlying innovations are most easily handled by expressing them in terms of a conditional expectation or conditional distribution in observable variables. Second, although the subsequent results apply to linear models, most of the conditions are explicitly set out for nonlinear models (the exception is Section 11). Unsurprisingly, these are more restrictive than the conditions needed to analyze
Ch. 45: Estitnation
and Inference,for
linear models. While it is possible to relax models, we do not do that here. Finally, a warning about the notation. are unavoidable. It is best to view the same symbol can be used to represent Hopefully this will not cause confusion.
2.
Examples
2643
Dependent Proce.ssrs
the assumptions
for application
to linear
We have tried to limit conflicts, but some notation as being “local” in nature: the different quantities in different sections.
of stochastic processes
The classification of the results in Parts II, III, and IV relies heavily on the notions of essential stationarity and weak dependence. It would take us too far afield to define and analyze the many kinds of dependence concepts (such as various mixing and near epoch dependence conditions) that have been used recently in the time series econometrics literature for generally heterogeneous processes. For the purposes of estimation and inference, what is most important are the implications of these concepts for limiting distribution theory, and this section provides an informal discussion primarily from this perspective. As we see below, this turns out to be imperfect; nevertheless, it strips a complicated literature down to its essentials for the asymptotic analysis of estimators in time series settings. Formal definitions of the types of stochastic processes discussed below are provided in Rosenblatt (1978), Hall and Heyde (1980) Gallant and White (1988), Andrews (1988) and Piitscher and Prucha (1991a, b). Let {x,: t = 1,2,. . .} be a scalar stochastic process defined on the positive integers [for definitions of a stochastic process and other basic time series concepts, see Brillinger (1981)]. For our purposes the idea that {xt} is essentially stationary is best captured by the (minimal) assumption that E(x:) is uniformly bounded. An immediate implication of essential stationarity is that the variance of the partial sum,
02,=Var
T 1 x,
( )
(2.1)
,
I=1
is well-defined.
If, in addition,
of = O(T),
(2.2)
-2 = O(T’).
(2.3)
CT
and a;’
f: (x, - E(x,)) -% Normal(0,
l),
(2.4)
then we say that {x~} is weakly dependent. Condition (2.2) implies that the variance of the partial sum is bounded above by a multiple of T; it rules out highly dependent processes with positive autocorrelations that do not die out to zero sufficiently quickly. Condition (2.3) implies that the variance of the partial sum is bounded below by a (positive) multiple of T; among other things it rules out processes with strong negative serial correlation. Condition (2.4) states that {x,} satisfies the central limit theorem (CLT). These definitions of essential stationarity and weak dependence are not without their glitches. First, there are many strictly stationary sequences that have an infinite second moment; by the above convention, such processes are not essentially stationary. Actually, defining essential stationarity to rule out these cases serves a purpose because stationary processes with an infinite second moment do not satisfy the CLT. Because we do not deal with such applications in this chapter it seems easiest to exclude them in the definition of essential stationarity. Second, there are processes exhibiting very little temporal dependence that nevertheless violate (2.3). The leading example is x, E e, - e,_ i, where (e,: t = 0, 1,2,. . .} is an i.i.d. sequence with finite second moment. Then a$/T+O even though x, and X ,+j are independent for j 3 2. The problem again is that such a sequence does not satisfy the CLT, so we rule it out in the definition of weak dependence. One might argue that, as long as we are assuming essential stationarity - which we sometimes refer to as local nonstationarity or bounded heterogeneity ~ we might as well simplify things further and restrict attention to the strictly stationary case. This argument has some merit because the inference procedures are identical whether we assume strict stationarity or allow for bounded heterogeneity, at least for correctly specified models. Nevertheless, it is important to know that asymptotic results are available for the heterogeneous case so we can handle processes with deterministic seasonality, structural breaks, and other forms of temporal heterogeneity. In addition, as will be seen in Section 4, even if the underlying stochastic process is strictly stationary the sequence of objective functions defining the estimation problem need not be. As we will see in Section 4.4, it is the gradient or score of the objective function evaluated at the “true” parameters that should satisfy the CLT in applications with essentially stationary, weakly dependent data. For many problems this follows from the essential stationarity and weak dependence of the underlying process, along with some additional moment conditions. Still, it is not always true that the score is weakly dependent if the underlying stochastic process is, especially when the objective function depends on a growing number of lags of the data. A simple example is nonlinear least squares (NLS) estimation of an MA(l) model (moving average of order one) when a white noise series has been overdifferenced: even though the first difference of a white noise series is stationary and weakly dependent, the score of the NLS objective function evaluated at the true parameter is not. See, for example, Quah and Wooldridge (1988). The other case is also possible: the underlying stochastic process could have an infinite second moment, or be
Ch. 45:
Estimation
strongly
und ltlfkwcr
dependent,
for
but the
Dependent
2645
Proce.w.~
score of the objective
function
could
be essentially
stationary and weakly dependent. The point of the previous paragraph is that the terms “essentially stationary” and “weak dependence” really apply to a particular function of the underlying process, namely the score of the objective function. For many applications this distinction is irrelevant. But, when in doubt, it is the score that should be studied for weak dependence properties. There has been much work on establishing primitive conditions under which stochastic processes satisfy the CLT. Most of the early work on limiting distribution theory - which focused on maximum likelihood estimation ~ relied heavily on the central limit theorem for martingale difference sequences. This is because the score of the conditional log-likelihood is a martingale difference sequence under correct dynamic specification (more on this in Section 5). Roussas (1972) analyzed the MLE for strictly stationary, ergodic data and employed the CLT for strictly stationary martingale differences. McLeish (1974) proved CLTs for martingale difference sequences that are not strictly stationary; see also Hall and Heyde (1980). These results allowed for substantial heterogeneity in the underlying stochastic process, and they were used in the work of Bhat (1974), Basawa et al. (1976) and Crowder (1976). Recent work in the econometrics literature covers a broader class of estimators and allows for dynamic misspecification. For many problems with misspecified dynamics the martingale CLT cannot be applied. Thus, the econometric work on limit theory for estimation with essentially stationary, weakly dependent processes has relied on various mixing conditions available in the theoretical time series literature. Under certain moment and mixing assumptions the process {x,} satisfies the central limit theorem. In the strictly stationary case, Rosenblatt (1956) proves a CLT for cc-mixing (strong mixing) sequences and Billingsley (1968) proves results for &mixing (uniform mixing) sequences and functions of &mixing sequences; see also Rosenblatt (1978) and Hall and Heyde (1980). McLeish (1975) extended Billingsley’s results to allow for bounded heterogeneity. Wooldridge and White (1989) and Davidson (1992) have proven CLTs for “near epoch dependent” (NED) functions of underlying mixing sequences. Among other things this allows for infinite moving averages in an underlying mixing sequence. It might be helpful at this point to give an example of a weakly dependent process. Let {e,: t~i?} be an independent, identically distributed (i.i.d.) sequence with gf = E(ef) < co, E(e,) = 0. Let {4j: j = 0, 1,2,. . .} be a sequence of real constants such that
2
(2.5)
IOjl
j=O
Then we can define a process
xt=
2
j=O
4jet-j,
(x,: t = 1,2,. . .> by
t= 1,2,...
(2.6)
J.M.
2646
Wooldridge
(in the sense that C,?= o 4je, _ j exists almost surely). Provided that CT= 0 #j # 0, (2.3) holds and it follows by Anderson (1971, Theorem 7.7.8) that T
clT-=
c x, %
Normal(0,
1)
(2.7)
1=1
where
(2.8) [See Hall and Heyde (1980, Corollary 5.2) for a weaker set of conditions.] The summability condition on {4j: j = 1,2,. . .} ensures (2.2); it allows for much more dependence than a stable autoregressive moving average (ARMA) process with i.i.d. innovations, but (2.5) does imply that 4j + 0 asj -+ cc at a sufficiently fast rate. We can easily allow for bounded heterogeneity by changing the assumption about the underlying sequence {e,}. Now assume that the e, are independent, nonidentically distributed, or i.n.i.d, with E(le,12+“) bounded for some 6 > 0. Then Fuller (1976, Theorem 6.3.4) implies that (2.4) holds. This covers heterogeneous ARMA models with independent innovations. Also, if we allow x, to have a time-varying mean pt then {x1 - p,} satisfies the CLT. When we relax the requirement that E(x:) is uniformly bounded we arrive at the notion of a globally nonstationary process. Even though such processes are growing or shrinking over time, it is entirely possible for them to satisfy the CLT (2.4). [Condition (2.2) no longer holds, but this is not a problem.] As a simple example of a globally nonstationary but weakly dependent process, define x, = tu,,
t= 1,2,...
(2.9)
where {u,:t= 1,2,... } is a weakly dependent series with E(uf) uniformly bounded and E(u,) = 0, t = 1,2,. . . (for example, {nt} could be i.i.d.). Note that E(xf) = O(t2) and rr’, = 0(T3). Nevertheless, under general conditions, (2.4) holds. [See, for example, Wooldridge and White (1989) and Davidson (1992).] There are several examples of processes, including ones that are strictly stationary and ergodic, that are not weakly dependent. Robinson (1991b) calls such processes strongly dependent. A general class of strongly dependent processes is given by (2.6) where the coefficients {#j} are square summable
(2.10)
Even though such a process is covariance stationary, without further restrictions on {4j} the variance of the partial sum can be of order larger than T, so (2.2) does not hold. Examples are the long memory or fractionally integrated processes with degree of integration between zero and one half; see, for example, Brockwell and Davis (1991). Little is known about the asymptotic distribution of estimators from general nonlinear problems when the underlying sequence is strongly dependent [for some recent results for a simple model, see Sowell (1988)]. The results in Part II or Part IV may be applicable, but this remains an important topic for future research. The term nonergodic is reserved for those processes that exhibit such strong dependence they do not satisfy the law of large numbers. A popular example of a nonergodic process is x,=x,-i
+e,,
t= 1,2,...,
(2.11)
where {e,: t = 1,2,. . .} is an i.i.d. sequence and x0 is a given random variable. For illustration, assume that E(ef) < cc and E(e,) = 0. Even under these assumptions the first moment of x, need not exist. If we add E(xi) < co and x0 is uncorrelated with all e,, it is easy to see that Var(x,) = O(t). Also, E(x,) = E(x,) for all t, so the mean of x, is constant over time when it exists. Still, the process {x,> does not return to its mean with any regularity (it is nonergodic), and the sample average X, will not converge in probability or in any other meaningful sense to E(x,). The work of Phillips (1986, 1987) has sparked a recent interest in asymptotic theory with general integrated processes, of which (2.11) is a special case. A general integrated of order one, or Z(l), process, can be written as x, = LX+ x,_ 1 + u,, where {u,> is an essentially stationary, weakly dependent zero mean process with
lim Var( T-112$lo,)>0.
(2.12)
T+CC
[Condition (2.12) ensures that the process (Axt 3 x, -x,-1} has not been overdifferenced.] When a # 0 the process is said to be 1(l) with drift, otherwise it is Z(1) without drift. Before turning to Part II we should emphasize that the partitioning of the results into Parts II, III, and IV is determined by the limiting distribution theory. In particular, a separate consistency result is given in Part II that does not require the process on any function of it to be weakly dependent. In the strictly stationary case only ergodicity of the underlying process and a moment condition are needed for consistency, so that it applies to strongly dependent processes. However, the asymptotic normality results rely on weak dependence of the score. Parts III and IV do not contain separate consistency results; consistency is proven along with the limiting distribution result.
2648
3.
Types of estimation
techniques
The approach to estimation in this chapter is what Goldberger (1968) and Manski (1988) have termed the analogy principle. To apply the analogy principle one must know the population problem that the parameters of interest solve in order to construct a sample counterpart. This is where basic probability theory, and especially properties of conditional expectations and conditional distributions, play a key role. Often population parameters can be shown to solve a minimization or maximization problem, which then leads to the class of optimization estimators discussed in this chapter. To show how the analogy principle is applied, we consider the example of nonlinear least squares estimation. Suppose that {(x,, y,): t = 1,2,. .} is a stochastic process, where y, is a scalar and x,E%‘, is a vector whose dimension may depend on t. Allowing the number of conditioning variables x, to grow with t allows for cases such as x, = (y,_ 1, y,- 2,. . . , y,) y,, zl), where zI is a 1 x J vector of conditioning variables, orXt=(Zf,yf-l,Zt-l,..., as well as for static regression models with x, = zf. Suppose that Ely,l < a3 for all t and that interest lies in the conditional expectation E(y,jx,). A parametric model of this conditional expectation is {m,(x,, 0): x,E%“,, 0~ 0 c Rp} (0 is the parameter set). The model is correctly specified if for some 0,~ 0 E(Y,lX,)
=
m,(x,,ho),
t= 1,2,....
(3.1)
To see how to estimate 8,, we rely on a well-known fact from probability theory. Namely, if E(y:) < co, then pt(xt) E E(y,(x,) is the best mean square error predictor of y,. In other words, for any other function y,(x,) such that E[g,(x,)2] < co, EC(Y, - PL,(xt))21G EC(Yt - st(xJ)21.
It follows that if the parametric EC(YV,-
for all 8~0.
(3.2)
model nlr(x,, 0) is correctly
specified
then
m,(-%RJ21 d EC(Y,- m,(x,,@)21 This suggests estimating
8, by solving
(3.3) the sample
problem
(3.4)
which leads to the nonlinear least squares estimator. Closely tied to the analogy principle is the concept of Fisher consistency. An estimation procedure is Fisher consistent if the parameters of interest solve the population analog of the estimation problem. Inequality (3.3) shows that least squares is Fisher consistent for estimating the parameters of a wnditional mean. Using the Kullback-Leibler information inequality, we show in Section 5 that
Ch. 45: Estimation
and lr@mce
,fiv Dependent
Processes
2649
maximum likelihood is Fisher consistent for estimating the parameters of a correctly specified conditional density, regardless of what the conditioning variables X, are. Many other estimation procedures, including multivariate weighted nonlinear least squares, least absolute deviations, quasi-maximum likelihood, and generalized method of moments are all Fisher consistent for certain features of conditional or unconditional distributions. We cover several of these examples in Sections 5, 6 and 7.
Part II.
4. 4.1.
The essentially
Asymptotic
stationary,
weakly dependent case
properties of M-estimators
Introduction
In this section we study the consistency and asymptotic normality of a class of estimators known as M-estimators (which stands for “maximum likelihood-like” estimators), a term introduced by Huber (1967) in the context of i.i.d. observations. The class of M-estimators includes the maximum likelihood estimator, the quasimaximum likelihood estimator, multivariate nonlinear least squares and many other estimators used by econometricians. The terminology adopted here is not universal. Piitscher and Prucha (1991a, b) refer to a more general class of optimization estimators as M-estimators. Burguete et al. (1982) and Piitscher and Prucha (199 1a, b) call the estimators studied in this section least mean distance estimators. The parameter space 0 is a subset of Rp and 0 denotes a generic P x 1 vector contained in 0. We have a sequence of random variables {w,: t = 1,2,. .}. Denote the range of w, by -W;, where “IIT, is a subset of a finite dimensional Euclidean space whose dimension may depend on t. The objective function for M-estimation is a sample average:
T-
l i qt(w, a
(4.1)
t=1
where qr: W, x 0 + R. There are a few different situations that warrant special attention. The first is when {w,: t = 1,2,. . .} is a sequence of strictly stationary (hereafter, simply “stationary”) M x 1 random vectors - whereby Y%‘“~ can be taken to be a subset 1w of RM for all t= 1,2,...and there exists a time-invariant function q: W x 0 -+ R such that qJw,, 0) = q(w,, d). Then each summand in (4.1) depends on t only through the observation w,. An important consequence of this setup is that {q(w,, Q)} is stationary for each 0~0, and this facilitates application of laws of large numbers and central limit theorems.
2650
J.M.
Wooldridcp
Another case of interest, which requires notably more technical work, is when the dimension of W, grows with t. This can happen when one is interested in getting the dynamics of a model for a conditional mean or a conditional distribution correctly specified. For example, suppose that for a scalar sequence {y,} and a vector sequence {Z,E RK} one is interested in E(yllzl, y,- 1, ztm i, . . , y,, zl). Let m,(x,, 0) be a model for this conditional expectation, where x, = (zt,yfP1,zt_i,. . .,yr,zi). If E(y,(x,) depends on all past lags of y and z then the model m,(x,,8) should reflect this. Thus, letting w, = (y,, z,, yt_ i,. . . , y,, z,), nonlinear least squares estimation of 8, such that E(y,(x,) = m,(x,, 0,) would take qJw,, 0) = (y, - m,(x,, 8))2. Note that even if {(y,, z,)} is stationary, {m,(x,, 0)) is not if E(y,I z,, y,- i, zr- 1,. . .) depends on all past lags ~ such as in finite order moving average models. If E(y,( zf, y, _ 1, z, _ 1,. . .) depends on a finite number of lags of y and z - such as in finite order autoregressive models - then we are essentially in the stationary case described above. Heterogeneity in {qt( w,, O)} can also arise when interest lies in a model relating an observable sequence to an unobservable sequence. For example, let (e,: t = 0,1,2,. .} be an i.i.d. sequence with E(ef) < co and E(e,) = 0, and consider an MA(l) model for observable y,: y, = e, + 8 e t-1, 0
t= 1,2,...
(4.2)
where 18,I < 1. One can study estimation of 9, in the previous framework by finding the regression function E(y,Jy,_ i, . . , yl) (a tractable calculation in this simple example). In practice one often sees a different approach used: set the time zero residual equal to zero and then build up the residual function recursively. This leads to the psuedo-regression function f-l
m,(x,,e)Em,(Y,-,,Y,-2,...,Y,,8)=
-
1 j=
(-eljYt-j.
(4.3)
1
This is not a true regression function because m,(x,, 0,) # E(y,I x,) = E(y, ly,- i, . . . , yi). Nevertheless, because of the invertibility assumption IO, I < 1, E I E(y, I y, _ 1,. . . , y,) y,, fl,)l -+O as t + 00 at the rate I fl,\‘. This is enough to consistently &-I,..., estimate 0, by nonlinear least squares. Once again the technical complications arise because, even though the observable data are stationary, the sequence of summands in the objective function is not. Yet a different approach that avoids both approximation arguments and complicated expectations calculations is to ensure that (4.3) is a true regression function by changing the assumption about how y, is generated. If we assume that e, s 0, 1,. . . ,y,) is given by (4.3) with 8 = OO.Now {yt: t = 1,2,. . .} as well as then @IY,{qt(wt, O)} is a heterogeneous sequence. Of course heterogeneity in {qt(w,,fl)} a 1so arises with fixed dimensional w, if {wt} constitutes a heterogeneously distributed sequence, as in Domowitz and White (1982) White and Domowitz (1984) and Bates and White (1985).
Ch. 45:
4.2.
Estimation
und Infhwre
,fiw Dependrnf
265 1
Proces.ses
Consistency
It is convenient
to begin with two definitions.
Definition 4.1 A sequence of random numbers (WLLN) if
variables
{z,: t = 1,2,. . .} satisfies
the weak law of large
t= 1,2,...; (i) ~Clz,ll< 03, (ii) lim,,, T- ’ C,‘= 1E(z,) exists; (iii) T lx,‘= I (zr - E(z,)} A 0. Condition (ii) of this definition is not needed for what follows, but it entails loss of generality and simplifies the statement of conditions.
little
Dejinition 4.2 Let 0 c KY, let {IV,:t = 1,. . .} be a sequence of random vectors with w,E%‘“,, of real-valued t= 1,2,... and let {q,: “IVYx 0 -+ [w,t = 1,2,.. .} be a sequence functions. Assume that (i) 0 is compact; (ii) qt satisfies the standard measurability and continuity conditions on %$ x 0, t= 1,2,... (see Definition A.2 in the Appendix); (iii) E[lq,(w,,8)1] < co for all OEO,t= 1,2,...; (iv) lim,,, T-l CT= 1E[q,(w,, O)] exists for all 8~0; (v) maxoEBl Tp ‘CT=1dw,, ‘4- ECdw,, @II 3 0. Then (qt(w,, O)} is said to satisfy the uniform weak law of large numbers (UWLLN) on 0. When applied to vector or matrix functions this definition applies element by element. Condition (iv) is an inconsequential but convenient simplification. When {W,EVV c FP} is stationary and ergodic, conditions sufficient for the UWLLN to hold are relatively straightforward. The following result is due to Ranga Rao (1962); see also Hansen (1982). Theorem 4.1.
U WLLN
for the stationary
ergodic case
Let 0 c lQp, let {w~E”K: t = 1,2,. . .} be a sequence of stationary and ergodic M x 1 random vectors and let q: “IV x 0 + R be a real-valued function. Assume that (i) 0 is compact; (ii) q satisfies the standard measurability and continuity requirements on 7V-x 0; (iii) for some function b: YV -+ R+ with E[b(w,)] < m, lq(w, @I d b(w) for all dE 0. Then {q(w,,Q)} satisfies the UWLLN on 0.
J.M.
2652
Wooldridgc
The proof of Theorem 4.1 - which is very similar to the i.i.d. case (see Newey and McFadden (Lemma 2.4)) - is driven by the fact that if {wl} is stationary and ergodic then so is any time-invariant function of it. Note carefully that we have only assumed ergodicity here; as discussed in Section 2 this allows for fairly strong forms of dependence. If we relax the stationarity assumption then the conditions for the UWLLN are notably more complicated, especially since we are allowing the dimension of w, to grow. Here we follow Andrews (1987) and impose some smoothness on the objective function. Then, as in Newey (1991a), a pointwise WLLN can be turned into a UWLLN. The following result is a corollary of Newey (1991a, Corollary 3.1). A proof that does not rely on the notion of stochastic equicontinuity ~ see Andrews (this Handbook) and Newey and McFadden (Section 2.8) - is given in the Appendix.
Theorem 4.2.
U WLLN
for the heterogeneous
case
Let 0, (We: t = 1,2,. . .}, and {qt: wt x 0 + IF!:t = 1,2,. . .} be as in Definition Assume that (9 0 is compact; (ii) qt satisfies the standard measurability and continuity requirements “iVt x 0, t = 1,2,. . . ; (iii) for each %E0, {qt(w,, %):t = 1,2,. . .} satisfies the WLLN; (iv) there exists a function ct(wt) 3 0 such that
Then
4.2.
on
(a) for all %,,%,E@, Iqt(wt,%,) - qJw,,%,)l d c,(~,)ll%, - e2 II; (b) {c,(w,)> satisfies the WLLN. (qt(w,, %)} satisfies the UWLLN on 0. (For proof see Appendix.)
If qJw,, .) is continuously differentiable then the natural choice for ct(wt) is
4%) = sup IIVs4r(Y, 4 II, BEV;
on an open, convex
set @?containing
0
(4.4)
provided it satisfies the WLLN. To see why this choice satisfies (iv) (a), simply use a mean value expansion of qt about 8. Because most time series applications involve smooth objective functions, the difficulty in applying Theorem 4.2 usually lies in verifying that {qJw,,%)} and {c,(w,)} satisfy the WLLN for any 8~0. These WLLN requirements restrict the dependence analogous to the ergodicity assumption for the stationary case. If the dimension of w, is fixed, and {w,} is an M- or @mixing process with mixing coefficients declining at an appropriate rate, verification of (iii) and (iv) (b) is straightforward because qt(wt, %)inherits its mixing properties from {We}. See, for example, Domowitz and White (1982), White (1984), White and Domowitz (1984) and Potscher and Prucha (1989, 1991a). McLeish (1975) introduced strong laws of
large numbers that can be applied when qt(w,, 0) depends on an increasing number of lags of a mixing process. See Gallant and White (1988), Hansen (199 1a), Piitscher and Prucha (1991a) and White (1993) for further discussion of strong laws. Because our focus is on weak consistency, the general WLLNs of Andrews (1988) are especially relevant here; they can be used to verify (iii) and (iv) (b) under satisfyingly weak assumptions, including conditions that allow for strongly dependent heterogeneous processes (although when applied to the stationary case, the conditions are always more restrictive than ergodicity). Before stating a formal consistency result for M-estimators, it is useful to allow for the presence of some estimated “nuisance” parameters. Let yT denote an R x 1 vector estimator such that plim pT = y* for some Y*ET c RR. The (two-step) Mestimator 8, solves
mini
BE@r=l
q,(w,,0;%I,
where q, is now defined Theorem
4.3.
Weak
(4.5)
on wt x 0 x I-.
consistency
of M-estimators
Let 0 c Rp, I- c RR, {wt~wt: t = 1,2,. . .} be a sequence of random vectors, and let {y,: YY*x 0 x f + R: t = 1.2,. . .) be the sequence of objective functions. Assume that M.l: (i) 0 and I- are compact; (ii) % +*El-; (iii) qt satisfies the standard measurability and continuity requirements on “iY~xOxl-,t=1,2 ).... M.2: {q,(w,, 8; y): t = 1,2,. . .} satisfies the UWLLN on 0 x r; M.3: 0, is the unique minimizer of
Y(O;Y*)=
:m_T-
Then a random Appendix.)
vector
’
c$l E[qr(wt, 8; y*)] 8, exists that
on 0.
solves (4.5) and
8, L
8,. (For proof
see
Under the assumptions of Theorem 4.3 it turns out that the limit function q(O; y*) is necessarily continuous on 0, so that it achieves its minimum on 0 by compactness. In the stationary case without nuisance parameters, q(O) = E[q(w,, O)] for all t, so it suffices to concentrate on a single observation when verifying the identification Assumption M.3. Even in the heterogeneous case, for applications, 0, minimizes E[q,(w,,Q)] over 0 for each t (more on this in Sections 5 and 6). Verifying that 0, is the unique minimizer of q in either the stationary or heterogeneous case often
J.M. Wooldridye
2654
requires knowing something about the distribution of conditioning variables, and so identification is often taken on faith unless there are reasons to believe it might fail. Newey and McFadden (Section 2.2) give three examples of how to verify identification in examples with identically distributed data. There are situations - such as the stationary MA(l) example when the startup value is set to zero - where 0, will not minimize E[ql(wt, S)] on 0 for any t. Nevertheless, especially in cases where the error of approximation dies off at an exponential rate, it can usually be verified that 0, minimizes lim,,, T-‘CT= I E[qr(wf, S)]. This is true for the invertible MA( 1) example. Bierens (1981, 1982) provides several illustrations. In some cases with nuisance parameters the identification condition holds only for a particular value of y, say y = yO= y*, where yO also indexes some feature of the distribution of w,; this is generally the case for two-step maximum likelihood procedures under correct specification of the conditional density. In other cases the identification condition holds for any ycT, and so the preliminary estimator fT could come from a misspecified estimation problem. For example, as we will see later, $T could be parameter estimates in a misspecified conditional variance function in the context of weighted nonlinear least squares. The misspecification of the variance function does not prevent consistent estimation of the conditional mean parameters O,, and so we expect the identification condition for the conditional mean parameters to hold for an arbitrary element in lY Theorem 4.3 restricts attention to objective functions continuous in 0 (and y), and this rules out certain estimators that have been suggested primarily in the cross section econometrics literature. A leading example is Manski’s (1975) maximum score estimator. It turns out that Theorem 4.3 can be extended without much difficulty to cover certain discontinuous objective functions (although measurability of the estimator becomes an issue). Wooldridge and White (1985) present a result that applies to the maximum score and related estimators for mixing processes. Newey and McFadden (Section 2.7.1) provide a careful discussion of the relevant issues for i.i.d. observations; these same issues are relevant for identically distributed dependent observations.
4.3.
Asymptotic
normality
We first define what it means for an essentially process to satisfy the central limit theorem.
stationary,
weakly dependent
vector
Definition 4.3 Let {sr: t = 1,2,. . .} be a P x 1 random central limit theorem (CLT) if (i) E(s+,) < co, t = 1,2,. ;
vector
sequence.
Then
{sl} satisfies
the
(ii) T~“2~~=1E(s,)-+0 as T+co; (iii) T- ““C,‘= Is, %Normal(O,B),
where B = lim,,,
Var(T~“‘C,‘=
,sr).
(4.6)
Condition (i) ensures that Var(s,) exists for all t. In the cases we focus on E(s,) = 0 for all t, so (ii) is trivially satisfied. Still, there are cases where only the weaker condition (ii) holds, so we allow for it in the definition. Implicit in (iii) is that the limit in (4.6) exists. We could relax this assumption, as in Gallant and White (1988) Potscher and Prucha (1991 b) and White (1993), but this does not affect estimation or inference procedures. We refer the reader to Pijtscher and Prucha (1991b) for a current list of central limit theorems available in the stationary and heterogenous cases under weak dependence; see also the references in Section 2. Theorem 4.4.
Asymptotic
normality
of M-estimators
Let 0, r, {w,: t = 1,2,. .}, and {qr: Wt x 0 x r + R: t = 1,2,. . .} be as in Theorem 4.3. In addition to M.l-M.3, assume e0 is interior to 0; M.4: (9 (ii) y* is interior to r;
M.5:
M.6:
(iii) JT(y*, - Y*) = G,(l); and second order (iv) for each ~~r,q, satisfies the standard measurability differentiability conditions on YY’“~ x 0 (see Definition A.4 in the Appendix). Define the P x 1 score vector s,(& y) = s,(w,, 8; y) = V,q,(w,, 0; y)’ and the P x P Hessian matrix h,(8; y) = V&B; y) = Viq,(w,, 8; y). differentiable on int(r). (v) For each OE 0, s,(Q; .) is continuously (h,(B; y): t = 1,2,. .} satisfies the UWLLN on 0 x r; (i) (ii) A,=limr._jo T-lC,T_l E[h,(8,; ?*)I is positive definite; (iii) {V,s,(&y): t = 1,2,. . .} satisfies the UWLLN on 0 x r; {st(Q,; y*): t = 1,2,. . .} satisfies the CLT with positive definite asymptotic variance B, E lim Var T+CC
M.7:
T- ‘j2 r$l s,(@,; Y*) . (
E[V,s,(B,; y*)] = 0, t = 1,2,.
Then ,,/‘?(e,
(4.7)
..
- 0,) % Normal(0, A; ’ BOA ; ‘), so that
Avar fi(e,
- Q,) = A ; 1B, A ; I.
(4.8)
(For proof see Appendix.) This general result can be used to establish the asymptotic normality of the Mestimator in both the stationary and heterogeneous cases. The UWLLN assump-
.I.M. W'ooldrid~q
2656
tions MS(i) and M.S(iii) differ depending on the nature of {wt}, {qr(wl, d)). As mentioned in Section 2, the CLT is applied to the score of the objective function evaluated at (G,, y*),s; 3 s,(OJ*). Due to the scaling of the partial sum ‘j2 > Theorem 4.4 is restricted to cases where Var(sp) is bounded; this rules byTout most examples with trending data since, generally, trending data implies a trending score. Assumption M.6 also implies that Var(C:= ,sp) grows linearly with K and this restricts {sp} to be a weakly dependent process. As discussed in Section 2, this is not necessarily the same as the underlying process {rvt} being weakly dependent. Often the weak dependence of the score is established by exploiting weak dependence properties of {rut}, but this is not always the case. For example, for a broad class of estimation problems {sp} is a martingale difference sequence (MDS) - see Sections 5 and 6 -in which case it satisfies a CLT under some additional assumptions. While the properties of (wt} play a role in verifying these additional assumptions, it is sometimes possible to establish M.6 without imposing mixing or related conditions on {wt}. Another consequence of M.6 is
Tm1’2,$l E[s,(B,;y*)]+O In most cases, including E[s,(B,;y*)]=o,
as T+m.
but not restricted t= 1,2,...
(4.9)
to stationarity,
the stronger
condition (4.10)
holds. The invertible MA(l) example when the startup values are set to zero is an example of where (4.9) holds but (4.10) does not. For the rest of this discussion we assume that (4.10) holds, but it should be kept in mind that the weaker assumption (4.9) is sufficient. Often it is possible to establish (4.10) directly from the structure of qt(0). Still, it is useful to know that it holds under the following additional assumptions. First, suppose that 8, minimizes E[q,(B;y*)] on 0 for all t. Then, because B,Eint(O), it follows that if E[q,(d; y*)] is differentiable, then 8, satisfies (4.11) Second, if the the case quite (4.10). Notice that this is needed
derivative generally),
and expectations operator can be interchanged (which is then (4.11) implies E[V,q,(B; y*)] leEO, = 0, which is simply
0, is assumed to be on the interior of its parameter space. Technically, for a mean value expansion of the score about 0,. It is also easy to
devise examples where JT(8, - 0,) has a nonnormal limiting distribution because 0, is on the boundary of 0; see, for example, Newey and McFadden (Section 3.1).
Another limitation of Theorem 4.4 is that the objective function is assumed to be twice continuously differentiable in 8 with positive definite expected Hessian at (I,. For most time series applications with essentially stationary, weakly dependent data, M.4(iv) holds. Nevertheless, it rules out some procedures that have recently become more popular in the econometric analysis of time series data. The leading example is least absolute deviations (LAD) estimation of a smooth conditional median function. For simplicity, suppose we have stationary data {(x,, y,)}. The objective function for the LAD estimator for observation t is q(w,, 0) = ly, - m(x,, 0)l, where m(x,, 0) is the hypothesized conditional median function. Under the assumption that Med(y,Ix,) = m(x,, 0,) for some fl,~ 0, it can be shown [for example, Manski (1988), White (1993), Newey and McFadden (Section 2.7.1)] that 6, minimizes E[q(w,, H)] on 0, so that LAD is Fisher-consistent for the parameters of a conditional median. Theorem 4.3 applies for weak consistency because the LAD objective function is continuous in 8. The problem with applying Theorem 4.4 for asymptotic normality is that q(w,, 0) is not twice continuously differentiable on int(O). Nevertheless, under reasonable assumptions it is still possible to obtain a first order representation of the form
(4.12) where s,(0) plays the role of the score and A, is the derivative of the expected score E[s,(B)] evaluated at 8,. The key insight in obtaining representations such as (4.12) to nonsmooth problems is that often E[s,(@] is smooth in 0 even though s,(H) is not; this works with dependent data just as with independent data. We refer the reader to the general treatment in Newey and McFadden (this Handbook, Section 7). Also, Wooldridge (1986) shows how the results of Huber (1967) extend to the case of mixing observations. Bloomfield and Steiger (1983) Wooldridge (1986) and Weiss (1991) study the asymptotic properties of LAD for dependent observations. For regular problems with essentially stationary, weakly dependent data, Assumptions M.l-M.6 can be viewed as regularity conditions. On the other hand, Assumption M.7 plays a key role in how one goes about conducting inference about 0,. Namely,
M.7 guarantees
that the asymptotic
distribution
of fi(3,
does not affect that of JT(a^, ~ 0,). (Of course, if there are no nuisance M.7 is automatically satisfied.) In particular, Avar(8,) is the same (Section 6) dimensional Theorem applied to
= A;’ BOA; ‘/T
- v*)
parameters,
(4.13)
as if y* were known rather than estimated. See Newey and McFadden for an insightful analysis of two-step estimation procedures with finite I-. 4.4 also assumes that r is finite dimensional, and so it cannot be semiparametric estimation problems. For semiparametric problems 6
J.M. Wooldrid~qr
2658
is still a P x 1 vector but r is an infinite dimensional function space. For a certain class of semiparametric problems, including adaptive estimation [see Piitscher and Prucha (1986), Robinson (1987), Andrews (1989), White and Stinchcombe (1991) and Steigerwald (1992)], the limiting distribution of fi(H^, - 0,) is still given by (4.8) provided fT is consistent at a fast enough rate in a suitable norm (often the in required rate is T1j4 and a suitable norm is an L, norm). The conditions Theorem 4.4, notably those concerning the gradients with respect to y, must be modified to cover semiparametric problems. The notion of stochastic equicontinuity can replace differentiability assumptions. Andrews (this Handbook) gives a general discussion of empirical process methods; Newey (1991 b) and Newey and McFadden (Section 8) show how to perform asymptotic analysis with certain kinds of infinite dimensional nuisance parameters and i.i.d. observations.
4.4.
Adjustment for nuisance parameters
There are many problems for which M.7 fails, for example two-step estimators for factor autoregressive conditional heteroskedasticity (ARCH) models [see Lin (1992)]. In such cases one must adjust the asymptotic variance of HITto account for estimation of y*. First define the P x R matrix F, = lim T-i T-rCC
f
E[V,s,(B,; y*)]
(4.14)
1=1
(this exists by M.S(iii)). In place of M.7 we now assume influence function) representation JT(gT
- y*) = T_ i/2 i
that 9, has first order (or
yZ(Y*)+ o,(l),
I=1
where u,(y) is an R x 1 vector with E[v,(y*)] = 0. The vector v,(y) in general depends on unknown parameters other than y, but these are suppressed for simplicity. Thus, fT could itself be an M-estimator or, as we will see in Section 7, a generalized method of moments estimator. The mean value expansion underlying the proof of Theorem 4.4 (under M.l-M.6) now gives
JT(&.-e,)= If the P x 1 process totic variance
-A-lT-li2 0
f:
fs,(d,;Y*)+ E,~Y*)} + o,(l).
(~~(0,; y*) = s,(8,; y*) + F,r,(y*)}
satisfies the CLT with asymp-
(4.15)
2659
then (4.16)
Avar~(B,-8,)=A”-‘DoA~l.
Note that the form of D, is similar in structure to B,. In the next section, discuss how to estimate such matrices under different assumptions.
4.5.
Estimating
the asymptotic
We first consider estimating tions
M.llM.7,
variance
the asymptotic
so that the asymptotic
variance variance
Avar fi(8, - 0,) (see (4.8)). In constructing third matrix. J, = lim T-’
i
T+ai
r=1
we
E[s,(B,;y*)s,(8,;y*)‘],
of fi(e^, of fi($,
valid estimators,
- 0,) under Assump- y*) does not affect we make use of a
(4.17)
which is essentially the average variance of the scores (~~(0,; r*)}. Estimation of A, is no different from cross section analysis with independent observations. A consistent estimator of A, that is always available under the conditions of Theorem 4.4 is the average of the Hessians evaluated at the estimates, T-’
i
h,(&;y*,).
r=1
As a practical matter, an analytical formula for h,(B; y) is not needed to implement this formula; numerical second derivatives can be used for approximating (4.18). Still, this estimator is often more difficult to compute than necessary, and it is not guaranteed to be even positive semi-definite for a particular sample (although it is positive definite with probability approaching one). Sometimes more structure is available that allows a different estimator. Suppose we can partition w, as w, = (x,,y,), where x, is a random vector with possibly growing dimension and yt is G x 1 (interest lies in estimating some aspect of the conditional distribution of yt given 1,). Define a,(x,, 8,; y*) = ECh,(w,,8,; r*)l~,l. By the law of iterated
(4.19)
expectations
EC~~(~,> 0,; r*)l = ECht(wt, 6,; y*)l,
(4.20)
E[a,(B,; y*)]. Appendix
sothatA,=lim.,,T~‘CT=, imply that generally T-
T ’
c
Lemma A.1 and the UWLLN
_ u,(fI,;
7,)
A
(4.21)
A,.
t=1
This estimator is useful in the context of nonlinear least squares, maximum likelihood, and for several quasi-maximum likelihood applications. In some leading cases, including the examples in Section 6, a,(~,, 8,; y*) depends only on the first derivatives of the conditional mean and variance functions. Estimation of B, is generally more difficult due to the time series dependence in the data. But in some important cases it is no more difficult than estimating A,. In the simplest case, B,, = A,,soa separate estimator of B,is not needed. Two additional assumptions imply that B,= A,.The first is that the score (evaluated at (0,; y*)) is serially uncorrelated. Assumption
M.8
For t = 1,2,. . , E[s,(B,;y*)s,+j(8,;y*)‘]
= O,j 3 1.
(In cases where E[s,(B,; y*)] is not identically zero, but (4.9) holds, M.8 might only still hold provided the be approximately true as t + co; the fo 11owing conclusions approximation error dies out fast enough with t.) The score is always serially uncorrelated if sp is independent of sp+ j, j = 1,2,. . . , which is usually assumed in cross section applications but rarely in time series settings. Of course independence is not nearly necessary for M.8 to hold. When time series models are dynamically complete ~ in ways to be defined precisely in later sections - sp(0,; y*) often satisfies the stronger requirement of being a martingale difference sequence with respect to the information sets ar = (w,, . . . , wl}. Even outside the context of maximum likelihood estimation (MLE), in many practical cases of interest an extension of the information matrix equality holds for each observation t. This is stated as follows. Assumption
M.9
ECs,(d,; i’*h(Q,;
y*)‘l = ECh,(~,; ?*)I>
t= 1,2,....
In other words, the variance of the score for observation t is equal to the expected value of the Hessian for observation t. (Note that the latter quantity is at least positive semi-definite because we are analyzing a minimization problem.) Assumption M.9 immediately implies that A, = J,. In addition to MLE, we will encounter other cases where M.9 is satisfied, among these multivariate weighted nonlinear least squares when the conditional mean and conditional variance are both correctly specified. In virtually every case that M.9 holds it makes sense to replace
y* by y,, to indicate that y indexes some feature of the distribution of w, that is correctly specified. It is important to see that conditions M.8 and M.9 are logically distinct and must be examined separately. In Section 6 we cover cases where the score s,(tl,; y*) is serially uncorrelated - so that M.8 holds - but M.9 does not hold; in Sections 5 and 6 we give examples where M.9 holds but M.8 does not. The usefulness of imposing both M.8 and M.9 is seen in a simple lemma. Lemma 4.1 Under
assumptions
Avar(8,)
M. l&M.9, B, = Jo = A,, and therefore (4.22)
= A; ‘IT.
This shows that the asymptotic variance of 6, can be estimated as A^G l/T under M.l&M.9, where AT is the consistent estimator of the expected Hessian given by either (4.18) or (4.21). Alternatively, we can use an estimator of J,, to obtain A$r(H^,). A consistent estimator of Jo under M.l-M.6 (and the regularity condition that (s,(fI; y)s,(B; y)‘} satisfies the UWLLN) is
(4.23)
where s*,= sJH^~;~*~).Under assumption’M.9, jT is also a consistent estimator of A,. Therefore, under M.l-M.9, Avar(6,) can be estimated by A%r(&-) = j,‘/T. This outer product of the score estimator is usually attributed to Berndt, Hall, Hall and Hausman (1974) (BHHH) in the context of MLE. Practically speaking, Assumption M:9 without M.8 does not afford much simplification for estimating asymptotic variances because of the potential serial correlation in the score. On the other hand, M.8, even in the absence of M.9, means that covariance matrices can be estimated by sample averages. This follows from the following obvious lemma. Lemma 4.2 Under
M.l&M.8,
B, = Jo, and so Avar(&)
= Ai1 J,A;
l/T.
Thus, to estimate Avar(6,) under M.l&M.8, take AT to be one of the estimators (4.18) or (4.21) and take jT as in (4.23). Then a consistent estimator of Avar(&,) is (4.24) This estimator estimation
was suggested
of misspecified
by White (1982) in the context of maximum
models with i.i.d. data. This formula
Avar Jf(@,
likelihood - 13,)=
Ai ‘J,,A; ’ appeared in Huber (1967) in his analysis of M-estimators for i.i.d. data. The estimator (4.24) has since been suggested by Domowitz and White (1982), Hsieh (1983), White and Domowitz (1984), White (1993) and others in a variety of contexts when the score is serially uncorrelated. When we relax M.8 as well as M.9, the general form of B, in (4.7) must be estimated. Again let sy EEs,(H,;y*) and first suppose it is known that E(s;s;; J = 0, where L is a known
(4.25)
j > L, integer.
A sensible
estimator
of B, is
(4.26)
where T-j
j=O,l,...,
L.
(4.27)
adjust(Sometimes T - ’ in (4.27) is replaced by (T - P)- ’ as a degrees-of-freedom ment.) Hansen and Hodrick (1980) proposed this estimator in the context of linear rational expectations models. Hansen (1982) and Hansen and Singleton (1982) proposed it in the generalized method of moments (GMM) framework; see also White (1984). A simple application of the uniform weak law of large numbers to js,(U; y)s,+ j(& 7)‘) and Lemma A.1 shows that T-j
plim .Ej = lim T-l T-tm
c
E(s;s;;~),
(4.28)
t=1
and so BT is consistent for B, under general conditions. When (4.26) applies, the choice of L typically depends on the frequency of the data. A potential drawback of (4.26) is that it need not be positive definite. Remedies for this are discussed below. Several econometricians have recently looked at the problem of estimating B, consistently in the general case where the autocovariances of {s;} are known only to die out at a polynomial rate as j gets large. For motivational purposes, assume that (s;)} is covariance stationary; the estimators that follow are valid in the heterogeneous case under mixing or near epoch dependence conditions, as in White (1984), Newey and West (1987), Gallant and White (1988), Hansen (1992a), Andrews (1991), Piitscher and Prucha (1991b) and Andrews and Monohan (1992). Under covariance stationarity of the score, B, = E. + C,?! 1(Ej + El), where Ej 3 E(s;s;‘+ j).
If we truncate
E,
+
the infinite
“z(Zj j=
+ E;)+
sum at L,, but let L, tend to infinity
B,.
with T, then
(4.29)
1
This suggests estimating B,, exactly as when S: is known to have zero autocovariances after a certain point, with the technical distinction that the truncation lag L, should grow with T to ensure consistency for general autocovariance structures. But L, cannot grow too quickly or else too many autocovariances are estimated for a given sample size. Rather than using (4.26), as in White (1984), in some cases it is useful to weight the autocovariances to ensure a positive semi-definite (p.s.d.) matrix. Thus, consider the estimator
b, = $, +
“tw(j,
i= 1
T)(gj +
s’s),
(4.30)
where the weights o(j, T) are chosen to ensure that &- is p.s.d. Note that if (4.25) , L, ensures that & is holds and L, = L, then o(j, T) -+ 1 as T+ co, j = 1,2,. consistent for B,.Possibilities for o(j, T) are abundant in the time series literature on spectral density estimation because, in the covariance stationary case, B, is proportional to the spectral density matrix of the process {SF} evaluated at frequency zero; in econometrics, B,is often called the Iony run variance of {s;} [for example, Phillips (1988)]. For a list and discussion of weights see Anderson (197 1, Chapter 9) Bloomfield (1976, Chapter 8), Gallant (1987, Chapter 7) and Andrews (199 1). The Bartlett weights were suggested by Newey and West (1987) and are also studied in Gallant (1987) and Gallant and White (1988). They are given by
w(j,T)=l-ml-,
j=1,2
,..., L,,
(4.3 1)
(L, + 1) = 0,
j>L,.
[The Bartlett weights are not among the most popular weights in the spectral density estimation literature; see, for example, Bloomfield (1976, p 164).] Note that o(j, T) -+ 1 as T+ co for each j provided Lr grows with T. Newey and West (1987) demonstrate that with this choice of weights, b, is p.s.d. For applications, Gallant (1987) recommends the Parzen weights, given by w( j, T) = 1 - 6( j/&l2 = 2(1 - (j/Lr))3,
+ 6( j/L,)3,
j d L,/2, L,/2 <j < L,.
(4.32)
Andrews (1991) allows for weighting schemes that leave all of the covariances in the estimator for each sample size (that is, L, E T - 1) but of course those for largej are downweighted. Several other choices of el(j, T) are studied by Andrews (1991). There are a variety of regularity conditions under which (4.30) is a consistent estimator of B,. For nonlinear models these usually include the assumption that o( j, T) is uniformly bounded (which is satisfied by the Bartlett and Parzen weights), that w(j, T)+ 1 as T+ CCfor each j, and that L, tends to infinity at a slower rate than T. Piitscher and Prucha (1991 b) provide a careful discussion and extensions of sufficient conditions based on work of White (1984), Newey and West (1987), Gallant (1987), Gallant and White (1988) and Andrews (1991). Hansen (1992a) gives results that relax the moment conditions. The details underlying consistency are rather intricate, and so they will not be given here. Generally, the assumptions include smoothness of s, as a function on 0 x r and weak dependence of sp (such as mixing and near epoch dependence conditions). In applying serial-correlation-robust estimators the choice of the lag length L, is crucial, but consistency results only yield rates at which L, should grow with T. For problems with near epoch dependent scores, the available proofs allow L, = o(Tti3). This was also shown by Quah (1990) to produce a consistent estimator in a related context. Andrews (1991) contains consistency results under cumulant and mixing assumptions that allow L, = o(T”‘); see also Keener et al. (1991) for OLS with fixed regressors. Andrews (1991) also uses asymptotic mean squared error criterion to derive optimal growth rates of L, for a general class of weighting schemes. The optimal growth rate for the Bartlett weights is T1j3, and for the Parzen weights it is T l” . The optimal weighting scheme in the class Andrews considers is the quadratic spectral (which, unlike the Bartlett and Parzen weights, keeps all autocovariances in the estimator). Andrews (1991, Table 1) gives the optimal lag length for a variety of weights w(j, T) in the context of linear regression with AR(l) errors. This table provides useful guidance but is necessarily limited in scope; such calculations require knowing something about the degree of dependence in the underlying stochastic process. In practice, deterministic rules for choosing L, necessarily have a component of arbitrariness. Andrews (1991) and Andrews and Monohan (1992) discuss data dependent or automatic ways of selecting L,. Except for the computational requirements, these are attractive alternatives to deterministic rules. They are likely to become more popular as they are incorporated into econometrics software packages. There are other alternatives for estimating B,. If {sp} were a stationary, finite order vector autoregression, then B, is simply the long run variance of a finite order VAR; it can be consistently estimated by first estimating a VAR for & This leads to the multivariate version of Berk’s (1974) autoregressive spectral density estimator. The autoregressive spectral density estimator might also work well when $’ is not a finite order VAR, provided the lag L, in the VAR is allowed to depend
on T (again, the rate L, = o(T”~) is sufficient under certain conditions); see Berk (1974) for the scalar case. Another possibility is suggested by Andrews and Monohan (1992): prefilter s*,through a VAR, and then apply an estimator such as (4.30) to the residuals. Andrews and Monohan show that this can lead to better finite sample properties. When M.7 is violated the asymptotic variance estimator of @, must account for the asymptotic variance of PT. This requires that we estimate the matrix D, in (4.15) rather than B,. First we need to estimate F,. This is typically straightforward. Under M.l-M.6 the P x R matrix
P, = T - ’
i
(4.33)
V,s,(&, 9,)
t=1
is consistent for F,,. In applications with conditioning variables, such as nonlinear least squares and quasi-maximum likelihood estimation (QMLE) (see Section 6), a simpler estimator is ofte_n obtained by computing [,(I,, 0,; y*) = E[V,s,(fI,, y*)Ixt] and then replacing V,s,(fr, 9,) in (4.33) with Jr@,, 8,; $r). Next, define ri, = s*,+ F,i,, where i, replaces any unknown parameters in Ye with consistent estimates. For example, often rJy*) = K; ‘e,(y*), where ,K* is an R x R unknown positive definite matrix. Given a consistent estimator K, of k* (which is typically very similar to estimating A,),fr = f?; ‘e,(j),). This covers the case when 9, is an M-estimator, and similar estimators are available in generalized method of moments contexts. Without further assumptions, estimation of D, requires application of one of the serial-correlation-robust estimators, such as as is typically the case (4.30), tp {lit}. When {ut(Bo;y*)} 1s . serially uncorrelated, when 0, and +jT are obtained from problems that have completely specified dynamics ~ more on this in Sections 5 and 6 - D, can be estimated as & =
4.6.
Hypothesis
Consider
testing
testing Q nonlinear
restrictions
H,: ~(0,) = 0,
(4.34)
where c(0) is a Q x 1 vector function of the P x 1 vector 8. We assume that Q d P, c(.) is continuously differentiable on the interior of 0, and 0, is in the interior of 0 under H,. Define C(0) = V&U) to be the Q x P gradient of c(0) and assume that rank C(0,) = Q (this ensures that the Q restrictions are nonredundant). The Wuldstatistic
is based on the limiting
distribution
offic(@,.)
under H,. A standard
J.M.
mean value expansion
gives, under
JTc(B,) = c 0Jr&
-
0”) +
where C, 3 C(8,). Therefore,
JTc($,)
A Normal(0,
Wooldridyr
H,,
0
(4.35)
P(1) ’
the null limiting
distribution
of JTc(g,)
is
C, V,Cb),
(4.36)
where V, = Avar fi(e, - 0%).Given a consistent estimator of V,, say P,, and the consistent estimator of C,, C, = C(e,), the Wald statistic for testing H, against H,: c(t),) # 0 is
w, = Jh(&)I&
F&]
- 1JTc(f7T).
(4.37)
Under N,, W, Lx;. As discussed in Section 4.5, the choice of P, depends on what is maintained under H,. Assuming that M.7 holds, CT = i;r or 5;’ under M.8 and M.9, 3, = A^; ‘j,,+?;’ under M.8 only, and P, = A^; ‘&.A^ f1 if neither M.8 nor M.9 is assumed. When nuisance parameters are present and M.7 does not hold, pT should account for the asymptotic variance of JT(qT - y*). Often, it is convenient to use a likelihood ratio (LR)-type statistic when testing hypotheses. Because M-estimators solve a minimization problem, under certain assumptions the difference in the objective function evaluated at the constrained and unconstrained estimators can be used as a formal test statistic. While it is possible to derive the limiting distribution of this statistic under Assumptions M.lM.6, here we only cover the case where the statistic has a limiting chi-square distribution; thus, we impose M.7, M.8, and M.9. White (1993) allows the model to be entirely misspecified, but the quasi-LR statistic has little practical value for classical testing in such cases. Let 8, denote the unconstrained M-estimator of 0,. The constrained estimator, er, solves
min i qJw,, 0; $T) 0t@t=l To derive the limiting
subject
distribution
to c(0) = 0.
of the quasi-likelihood
assume that B,Eint(O) under H,. In proper first order representation we mapping d: RPpQ + Rp exists such that vector in the interior of its parameter
ratio statistic,
we first
addition, to ensure that fi(s, - 0,) has a assume that a continuously differentiable 8, = d(a,) under H,, where c(, is a (P - Q) x 1 space A under H,. Further, the rank of the
Ch. 45: Estimarion
and Inferencr
P x (P - Q) gradient
2661
,fiw Dt~pendent Procrs.sr.s
V,d(a,)
is P - Q under
H,. The estimator
of a,, &, solves (4.39)
and the constrained
estimator
8,. is simply 8, = d(&r). By definition, (4.40)
The difference in (4.40) has a convenient limiting parameters under Assumptions M.llM.9. Lemma
distribution
that is free of nuisance
4.3
Under M.l-M.9, and the assumptions under H, that 8, = d(cx,) for ctO,a (P - Q) x 1 vector, and d(a), a continuously differentiable function on int(A), r,Eint(A), and rank V,d(cc,) = P - Q,
(4.41)
converges
in distribution
to xi under
H,.
A word of caution: in applying the QLR statistic, scaling factors that can appear in ql(w,,O;y*) must be chosen so that M.9 is satisfied. As we will see, that is no problem in the context of MLE since qr is simply the negative of the conditional log-likelihood. In Section 6 we show what M.9 entails in the context of the weighted NLS and QMLE approaches. The final test we cover is Rao’s (1948) score test, known more commonly in econometrics as the Lagrange multiplier (LM) test. Engle (1984), Godfrey (1988) and MacKinnon (1992) contain discussions of the LM statistic and its use in econometrics. Calculation of the LM statistic requires estimation only under the null, so it is well-suited for model specification testing. As with the quasi-LR statistic, we assume that f3, = d(a,) under H,, where rx,Eint(A) and d(a) satisfies the assumptions in Lemma 4.3. The simplest method for deriving the LM test is to use Rao’s score principle extended to the M-estimator case. The LM statistic is based on the limiting distribution of
T-
1’2 i I=1
St&
YT)
(4.42)
J.M.
2668
Wddridge
under H,,where yT now denotes
an estimator of y* used in obtaining e,. As usual, we assume that JT(PT - y*) = O,(l). W e explicitly derive the LM statistic under the assumption that M.7 is true under H,, as this holds in many applications. Assume initially that 0, is in the interior of 0 under H,; we discuss how this can be relaxed below. A standard mean value expansion yields T-‘12
= Tp1j2 i de,; Y*)+ A,&-@,- 0,) + o,(l)
f. &;-&) 1=1
r=1
under M.7 and H,. But 0 E ~‘?c(t!?,) = fic(B,) + c,JT(e, the Q x P matrix C(0) with rows evaluated at mean values Under
Ho, c(6),) = 0, and
(because
fi(E,
H,, C,fi(g,
plim C, = C(0,) = C,.
- c(,) = O,(l)
under
the above
Further,
- O,), where C, is between fIT and 0,. fi(e,
assumptions).
- 8,) = O,( 1) Therefore,
under
- 6,) = o,,( 1). It follows that
CoA,‘T-‘/2
i
s,(e,.;~T)=C,A,1T-112
(4.43)
’
t=1
under
H,. Without
imposing
M.8 or M.9 under
C,A y 1T- I/’ 5 s,(B,; y*) LNormal(0,
H,, we generally
have
C,,A; ‘BOA; ‘C:).
f=l
Under
our assumptions,
C, Ai ’ B,,A; ’ CL has full rank Q. Therefore,
where & z s,(e,; 9,). The score or LM statistic
under
H,
is given by
where all quantities are evaluated at (gT,yr) or just e”,. Under H,,LM, Axi. This LM statistic is robust in the sense that neither M.8 nor M.9 is maintained under H,. If the score is serially uncorrelated under H, (that is, M.8 holds under H,), then ET can be replaced in (4.44) by the outer product estimator
(4.45)
C/I. 45: Estimation
ud
Ir~/ewncrjhr
Dependrnt
Proccxses
2669
otherwise gr should be a serial-correlation-robust estimator applied to {S,}. For both the Wald and QLR statistics, we assumed that %,Eint(O) under H,; this is crucial for the statistics to have limiting chi-square distributions. We will not consider thi Wald or QLR statistics when 0” is on the boundary of 0 under H,; see Wolak (1991) for some recent results. The general derivation of the LM for certain applistatistic also assumed that %,Eint(O) under H,. Nevertheless, cations of the LM test we can drop the requirement that 0, is in the interior of 0 under H,. A leading case is when 8 can be partitioned as 0 = (%;, d;)‘, where %r is (P - Q) x 1 and 9, is Q x 1. The null hypothesis is H,:H,,= 0,sothat c(9) = 9,. It is easily seen that the mean value expansion used to derive the LM statistic is valid provided ~1,= %,, is in the interior of its parameter space under H,;9,= (%b,, 0) can be on the boundary of 0. This is useful especially when testing hypotheses about variances; see Bollerslev, Engle, and Nelson (this Handbook). Recall that under Assumptions M.8 and M.9,$, = B, = Jo, and the LM statistic simplifies considerably. For example, 2, and B, can both be replaced with 3,. Some matrix algebra shows that the LM statistic becomes
(4.46)
which is just TR,'from the linear regression 1
on
S;,
t=1,2
,...,
T,
(4.47)
where Rf is the uncentered r-squared from the regression (recall that 5: is a 1 x P vector). Engle (1984) derives this statistic in the context of MLE. Because the dependent variable in (4.47) is unity, LM, can also be computed as T - SSR, where SSR is the sum of squared residuals from (4.47). The statistic (4.46) is called the outer product LM statistic because it uses the estimator 1, for the estimator of the variance of the score. The outer product LM statistic requires that the two Assumptions M.8 and M.9 hold in addition to M.7; it is not generally asymptotically xi distributed if any of these assumptions fail. Even when M.7, M.8, and M.9 hold under H,, the outer product LM statistic might not be the best choice for applications. There is evidence in the maximum likelihood context that it can have severe finite sample biases. If an estimate of the Hessian or expected Hessian AT is available, the Hessian LM statistic (4.48)
can have better finite sample and Spady (1991) Davidson
properties. See, for example, Orme (1990), Chesher and MacKinnon (1984, 1991) and Bollerslev and
J.M.
2670
Wooldridge
(1992). When A, is estimated
Wooldridge
as
(4.49)
and the model has a residual structure, then (4.48) (or a scalar multiple of it) has a simple regression-based form. This is also true of the robust statistic (4.44) when Al,. is given in (4.49). It is important to remember that the outer product and Hessian forms are usually invalid if either M.8 or M.9 is violated. If there is any doubt about M.8 or M.9, the robust form of the statistic should be used. Two comments before ending this section. First, to keep the notation and assumptions simple, and to focus on the computation of valid test statistics under various assumptions, we have only derived the limiting distribution of the classical test statistics under the null hypothesis. Very general analyses under local alternatives are available in Gallant (1987), Gallant and White (1988) and White (1993). Second, the limiting distribution of the test statistics under H, have been derived under the assumption that all elements of 0, are identified. This is violated for certain econometric problems, and usually the standard test statistics no longer have limiting chi-square distributions. See Hansen (1991 b) for recent work on this problem.
5.
Maximum
likelihood
estimation
We now apply the results on M-estimation to maximum likelihood estimation (MLE). For the purposes of discussion, suppose initially that (IV,: t = 1,2,. . .} is a sequence of strictly stationary M x 1 random vectors. The classical approach to MLE entails specifying a parametric model for the joint distribution of W, = (w,, w2,. . . , wT) for any T. A less restrictive setup allows one to partition IV, into a G x 1 vector of endogenous variablesy, and a K x 1 vector of exogenous variables zl. Then (conditional) MLE requires specification of the joint distribution of Y, z (yl,. . , yT) conditional on Z, E (zl, t2,. . , zT). The latter setup allows one to investigate how zr influences various features of the conditional distribution of yt given z,, and it is familiar to economists in both cross section and time series settings. Still, for a few reasons this approach is too limiting for modern econometric practice. First, as this approach is usually implemented, a restrictive type of exogeneity of the process {zr: t = 1,2,. . .} IS assumed. In the time series econometrics literature this is known as the strict exoyeneity assumption. When interest lies in the conditional distribution, strict exogeneity of {tt} can be stated as
~(Y,lYl~...,Y,-l,zl,z
2,...)=~(YtIYl,...,Yt-l,Z1,...,
z,);
(5.1)
., ytm ,,zr,. . . ,zr. in other words,y, is independent of z,+ 1, zr+ 2r.. conditional any,,. [This is Chamberlain’s (1982) modification of Sims’s (1972) definition of strict exogeneity.] Chamberlain (1982) shows that (5.1) is equivalent to the Granger (1969) noncausality condition
Often, the Z~are policy or other control variables that can be influenced by economic agents, in which case (5.2) can easily be false. Second, specifying the joint distribution of (y,, . . . , yT) (conditional on (z,, z2,. , zT)) assumes that the researcher knows the entire dynamic structure of (yi,. . , yr). In certain cases the dynamic structure is not even of interest. For example, one might be interested in the contemporaneous relationship between y, and z,; in terms of conditional distributions, this entails specifying a model for D(y,Iz,). It is now recognized - for example, Robinson (1982) Levine (1983) and White (1993) ~ that this can be done quite generally without assuming that {z,} is strictly exogenous and without specifying how y, depends on past y or Z. Several examples of this kind of specification can be found in the literature, including the well-known static linear model, static logit and probit [Gourieroux et al. (1985) Poirier and Ruud (1988)], static Tobit [Robinson (1982)], static models for count data and static transformation models [Seaks and Layson (1983)]. Even when the complete dynamics are of interest one does not always directly specify the joint distribution of (yi,. , yT). Often it is more natural to specify the conditional distributions D(y,ly,- i,. ., y,) or D(y,Iz,, y,_ i, z,_ r,. , yr,zi) or D(y,l yt_ i, zf- 1,. . . , y,, z,). Static models, dynamic models, and most other cases of interest can be cast in the following framework: interest lies in the distribution of yt given a set of conditioning variables x,, where X, is some subset of (z,, ytp 1, 1,. . . , y,, z,). The z1 are not necessarily strictly exogenous and the dynamics are z, not necessarily completely specified. Thus, we follow White (1993) in taking the notion of maximum likelihood estimation to include cases where only the distributions D(y,Ix,) are specified for some conditioning variables x,. The fact that MLE is consistent for the parameters of D(y, Ix,) for essentially any conditioning variables x, has many useful applications. In what follows it is easiest to view x, as a subset of (z,, ~~~r,~~-r,. . . , yl, z,), where the G x 1 vector yt and the K x 1 vector z, are contemporaneously observed, but the following results hold for other definitions of x, (for example if x, contains leads of z as well as current and lagged z). Also note that x, can be null for all t, in which case interest lies in estimating the parameters in the unconditional distributions D(y,). From the M-estimation results we know that, for verifying regularity conditions, the easiest case to handle is when the dimension of X, is fixed and {(x,,y,):t= 1,2 )... } IS a strictly stationary sequence. This rules out cases where the distribution of y, given the observable past actually depends on all past observables, as with moving average models and generalizations of ARCH models
[Bollerslev (1986) Nelson (1991)]. The following treatment allows for the number of elements in x, to grow with t, without doing the sometimes difficult task of verifying the regularity conditions. Generally, one must ensure that the conditional log-likelihood (defined below) and the score are not too dependent on the past. Let pp(.lx,) denote the conditional density ofy, given x, with respect to a measure v,(dy). (In most applications, V,would not depend on t.) We will say nothing further about the measure rt(dy) except that it can be chosen to allow yt to be discrete, continuous, or some combination of the two. The interested reader is referred to Billingsley (1986) and White (1993). Let the support of y, be ?V, c R” and denote the range of x, by :X,; the dimension of F, may depend on t. A crucial result from basic probability theory for studying the properties of maximum likelihood estimation is the (conditional) Kullbuck-Leibler information inequality: for any nonnegative function fJ./x,) such that
s
f,(ylx,)v,(dy)
d
1, all x,EX,,
(5.3)
Af
it can be shown
that
s d
logCp~(~Ix,)/f,(~/x,)lv,(dy) 3 0, all VT,.
[see, for example, Manski (1988), White (1993) and Newey and McFadden 2.2)]. Now suppose that one specifies a parametric model for pF(.lx,),
(f,(~lx,;aeEo, 0
c
= pp(‘Ix,),
(Lemma
RP},
which satisfies (5.3) for all 0~0. some 8,~0, .f,(.lx,;O,)
(5.4)
(5.5) Model (5.5) is said to be correctly specijied if, for
all x,ETl, t = 1,2 ,... .
It follows from (5.4) that ifthe model is correctly
(5.6)
specified then for all t = 1,2,. . . ,
where
is the conditional log-likelihood for observation t. From (5.7) and the law of iterated
2673
expectations,
0, solves, for each t, (5.9)
The fact that the vector of interest, 0,, solves (5.9) shows that maximum likelihood estimation is Fisher consistent. Importantly, this holds for any conditioning variables x,; there is no presumption that the dynamics are completely specified or that some sort of strict exogeneity holds. The consistency of the maximum likelihood estimator now follows in a straightforward manner from the consistency of M-estimators. In the notation of Section 4, q,(w,, 0) = - /,(y,, xt, 0). The maximum likelihood estimator e^, solves
(5.10)
Theorem 5.1.
Weak consistenc~~ of MLE
Let ((xt, y,): t = 1,2,. . .} be a sequence of random vectors with xteZr, y,&Yj/, c RG. Let 0 c Rp be the parameter set and denote the parametric model of the conditional density as in (5.5). We assume that this is a density with respect to the measure v,(dy) for all x, and 0:
s
f,(ylx,; !V*
Assume MLE.l:
MLE.2: MLE.3:
B)v,(dy) = 1,
further
all
x,E.Z”(,
(5.11)
8~0.
that
(i) 0 is compact. (ii) 1, satisfies the standard measurability and continuity on “v x ’Tit x 0. {l(y,,x,k): t = 1,2 ,... } satisfies the UWLLN on 0. (i) For some B,E@, pf(.lx,)=f,(.lx,;8,), (ii) 0, is the unique
max lim T-’
ese T-m
Then there exists a solution
xfcF,,t= solution
requirements
1,2,...;
to
,tI ECUY,, -G
‘31.
to (5.10), the MLE 8r, and plim 6, = 8,.
(5.12)
J.M.
2674
Wooldridge
From the discussion above. MLE.3(i) ensures that 8, solves (5.12), but it does not guarantee uniqueness; this depends on the distribution of x,, so at this level we simply assume uniqueness in MLE.3(ii). To derive the asymptotic normality of the MLE, we assume that I,(.) is twice continuously differentiable on the interior of 0 and B,Eint(O). Define the score and Hessian for observation t by
s,(O)= s,(w,, 0) = - V&r(W,, ey, /l,(8) = h,(w,, 0) = V&(U) = - v;&v,,
O),
where W, = (xi, y:)‘. From the general M-estimator result we would that the score evaluated at 0, has expectation zero:
like to show
~Cs,(~,)l= 0. By the law of iterated
(5.13) expectations,
(5.13) certainly
holds if (5.14)
Now, for any 0~69.
where E,(. IX,) denotes expectation with respect to the density fJ.(x,; 0). If integration and differentiation can be interchanged on int(O) in the sense that
(5.16)
for all x,EX,, QEint(O), then, by (5.1 l), the right hand side of (5.16) is necessarily zero. But V,l,(y,x,, @j;(y(x,; 0) = V,,f,(ylx,; (I), which by (5.15) implies that Es[s,(B)lx,] = 0, BEint(O). Substituting U, for 0 now yields (5.14). Given (5.13), the central limit theorem generally implies that T-1~2~~~,s,(0,) is asymptotically distributed as Normal(0, B,), where
(5.17)
Define A, = A (0,) to be the limit of the expected
value of the Hessian
evaluated
2615
at the true parameter: A, = lim T-r
(5.18)
$j E[h,($,)]. r=1
T-J_
Provided that 0, is identified, A, will be positive definite; in fact, under an additional regularity condition it can be shown that E[h,(0,)] is equal to the variance of s,(fI,), so A, is at least positive semi-definite. The additional condition needed is that, for all ~,fz.K,, BEint(U),
This is simply another assumption about interchanging which is valid in regular cases. Taking the derivative
an integral and a derivative, of the identity
and using (5.19) yields, for all BEint(O),
EeCW)I-d = Vah [s,(Q) I-4
(5.20)
Equation (5.20) is called the conditional information matrix equality for observation t. This equality has important consequences for estimating the asymptotic variance of the MLE. One point worth emphasizing is that (5.20) holds for any set of conditioning variables xt, and we have not assumed that f,(.lx,,Q,) is also the density of yt conditional on x, and the observable past (y, 1,x, _ 1,. . ,yl, x1). Evaluating (5.20) at 6, and taking expectations yields
ECU~JI= VarCs,(~,)13
(5.21)
which is best thought of as the unconditional information matrix equality observation t. This shows that, under standard regularity conditions, estimation Assumption M.9 holds for MLE. Theorem 5.2.
(Asymptotic
Let the conditions MLE.4:
normality
of Theorem
for each the M-
of MLE)
5.1 hold. In addition,
(i) U,Eint(O); (ii) 1, satisfies the standard measurability conditions on Y f x .f‘t x 0. 3
assume
and second order differentiability
J.M.
2676
Wooldridye
(iii) the interchanges of derivative and integral in (5.16) and (5.19) hold for all BEint(O). (i) A, defined by (5.18) is positive definite; (ii) {V,2l,(y,, x,, 0)) satisfies the UWLLN. {sJ0,): t = 1,2,. . .} satisfies the CLT with asymptotic variance B, given by (5.17).
MLES: MLE.6: Then fi(e,
- 0,) A Normal(0,
A; ’ BOA; ‘).
(5.22)
While Assumption MLE.4(iii) is often satisfied in applied does rule out some cases where the (conditional) support parameters 0,. White (1993) presents a slightly more general port ?Yyrcan depend on 8 provided it does so in a smooth that (5.16) and (5.19) hold. It is possible to confuse (5.21) with the more traditional equality. To see the difference, define J, as in Section 4.5 but parameters: J, = lim T-’
i
T-30
t=1
time series analysis, it of yr depends on the result where the supenough manner such information matrix without the nuisance
E[s,(~,)s,(O,)‘].
(5.23)
Then, by (5.18) and (5.21), it is trivially true that A, = J,. But the traditional mation matrix equality says that (for all T)
Tpl
i
E[h,(Q,)] = Var
t=1
infor-
(5.24)
.
It is easily seen that (5.24) is not implied by the conditions of Theorem 5.2 because these conditions do not imply that the score is serially uncorrelated. In other words, for MLE as we have defined it, B, # A, necessarily. If {s,(d,)} is serially uncorrelated then the traditional information matrix equality holds, A, = B, = Jo, and the asymptotic variance of the MLE simplifies to its well-known form. We state this as a lemma. Lemma 5.1 Let the assumptions MLE.7: Then
of Theorem
For t = 1>2 ,...,
5.2 hold, and assume,
ECst(Bobt +j(Ho)‘l= O,
j>
in addition, 1.
Assumption MLE.7 always holds when distributed, which is why it never appears for data. With dependent observations, if MLE.7 captures all of the dynamics in the following
the observations are independently MLE with independently distributed holds it is usually because the model sense.
Dejinition 5.1 The model is said to be dynamically
WIG
complete
in distribution
if (5.25)
@t-l) = DbIxtL
observed at time t - 1. where @,-i -(yt-i,xt-i ,..., y,, x1) is the information Often we simply say the density is dynamically complete when (5.25) holds. Note that this definition allows x, and Dt_ 1 to overlap, as happens if X, contains lags of yt or lags of some other variables zl. For example, if X, = (z,, yt _ 1, zt 1), then Qtml =(Y~-~,z~~~,...,Y~,z~). Lemma 5.2 If the model is dynamically complete in distribution is a martingale difference sequence with respect 1,2,. .). Consequently, MLE.7 holds. This lemma is easily proven:
then the score {st(Q,): t = 1,2,. . .} to the information sets {at: t =
(5.14) and (5.25) imply that
E cs,(e,)Ix,, @t- 11= 0, so that E[s,(BJ at_ i] = 0 by iterated expectations. Because s,(@,) is a function of Qr, {s,(e,)} is a martingale difference sequence with respect to {Qt}. It is a simple consequence of the law of iterated expectations that a martingale difference sequence is serially uncorrelated. Dynamic completeness ~ which hinges crucially on the choice of the conditioning variables X, - has a simple interpretation: if interest lies in explaining y, in terms of past y and possibly current and past values of some other sequence, say {z,}, then enough lags of y and z have been included in the conditioning variables x, to capture the entire dependence on the past. Often, but not always, it is assumed that a fixed number of lags of observables is sufhcient for modelling dynamics, and (5.25) makes precise the notion that enough lags have been included. For example, suppose that the conditioning variables are chosen as x, = ( yr _ 1, zt_ 1). If this specification of x, is dynamically complete then
so that there are only first order x, = zl. Then dynamic completeness
dynamics. As another example suppose requires the fairly strong assumption
~(Y,lz,,Y,-l,z,-l,...~Yl~zl)=~(Y,lz,).
that
(5.26)
In other words, the static relationship is also the dynamic relationship. That this is rarely true is perhaps what prompts Hendry et al. (1984, p. 1043) to state that static models “. . rarely provide useful approximations to time-series data processes.” But one should keep in mind that D(y,Iz,) might be of economic interest even if (5.26) is false. Because we are allowing the dimension of x, to grow with t we can always choose 1,...,Yl,zl)orXt-(Yr-1,Yr~2,...,Y1) It-(Zf,.Y-I,&1,. ..,Y1,zl)orXt=(Yt-l,zl~ to ensure dynamic completeness, assuming of course that a correctly specified parametric model of the conditional density can be found. Most of the earlier work on maximum likelihood estimation with dependent processes assumes dynamic completeness, in which case a martingale difference central limit theorem can be applied to (st(fI,): t = 1,2,. . .}. See, for example, Roussas (1972), Basawa et al. (1976), Crowder (1976), Hall and Heyde (1980) and Heijmans and Magnus (1986), among others. The popular “prediction error decomposition” method of building up the likelihood function from conditional distributions [for example, Hendry and Richard (1983), Hendry et al. (1984) and Harvey (1990, Section 3.5)] is a special case of specifying a dynamically complete model: for each t, the density f,(.lx,;tI,) represents the density of y, (or of the prediction errors) given all past observable information on y and perhaps on current and past values of z. As usual, estimating Avar ,,I’?(& - 0,) requires estimating A, and II,. First consider estimation of the matrix A,. There are at least three possible estimators of A, in the MLE context, each of which is valid whether or not the model is dynamically complete. The first estimator, based on the Hessian, was encountered in the general M-estimation setting in Section 4, but it is now useful to have a separate notation for it. Define
& = T- ’ i I&), t=1
where we recall that h, generally depends on y, and x,. Under the conditions of Theorem 5.2, H, A A,. This estimator is generally thought to have good finite sample properties, but it does require second derivatives of the conditional loglikelihood function and it is not guaranteed to be even positive semi-definite in sample (although it is positive definite with probability approaching one). A second estimator is the BHHH [Berndt et al. (1974)] estimator, which is based on the information matrix equality (5.21). If we add to the conditions of Theorem 5.2
2679
that {s~(@s,(~)‘:t = 1,2,.
.} satisfies the UWLLN
then
& = T- ’ t, s,(B,)s,(&)~ L4,. r=1
This estimator has the advantage of requiring only the first derivatives of the conditional log-likelihood function, and it is guaranteed to be at least positive semi-definite. However, it is also known to have poor finite sample properties in some situations. [See MacKinnon and White (1985) for linear models, Davidson and MacKinnon (1984) for logit and probit, and Bollerslev and Wooldridge (1992) for QMLE.] A third possible estimator was given in the M-estimation section. Let a,(~,, 0) =
M~rbt~~t8)l-d = - J%CV~M)I-~.Then (5.29)
satisfies the UWLLN. If the is a consistent estimator of A, provided {a&,@} conditional expectation needed to obtain A_T is in closed form (as it is in some leading cases to be discussed in Section 6), A, has some attractive features. First, it often depends only on first derivatives of a conditional mean and/or conditional variance function. Second, A^T is guaranteed to be at least positive semi-definite because of the conditional information matrix equality (5.20). Thir_d, jT has been found to have significantly better finite sample properties than J, in situations where a(~,, 0) can be obtained analytically. If the model is dynamically complete or MLE.7 holds for some other reason there is nothing further to do. The asymptotic variance of 8, is estimated as one of the three matrices f?; 'IT, 3; 1JT, and A^; ‘IT. Things are more complicated if MLE.7 does not hold because B, depends on the autocovariances of {s,(e,)}. A serial-correlation-robust estimator using {$} should be used, for example that given in equation (4.30). With b, consistent for B, under MLE.1 -MLE.6,
We will not explicitly
consistent
estimators
of Avar JT(O,
cover the case with nuisance
- 0,) are given by
parameters
that affect the
asymptotic distribution of fi(eT - 0,) but these are easily reasoned from the general M-estimator results. The three tests covered in the M-estimation section apply directly to the MLE context. The Wald statistic for testing Ha: c(e,,, = 0 is given in equation (4.37). If the model is dynamically complete under H,, I”, can be taken to be (5.27), (5.28)
.I. M. Wooldridgr
26X0
or (5.29). If the application is to time series with incomplete dynamics then VT can be taken to be one of the estimators in (5.30). Define the log-likelihood function for the entire sample by YvT(0) = I,‘= 1I,(O). Let 8, be the unrestricted estimator, and let fi,. be the estimator with the Q nonredundant constraints imposed. Then, provided MLE.7 holds and the additional assumptions in Lemma 4.3 hold, the likelihood ratio statistic 29,
= 2[&(0^,)
(5.3 I)
- U,@,)]
is distributed asymptotically as xi under H,; this follows immediately from Lemma 4.3 as M.8 and M.9 are satisfied. Recall that there is no known correction to this statistic that has an asymptotic distribution free of nuisance parameters when MLE.7 is violated, so the LR statistic should be used only when the dynamics have been completely specified under H,. The LM test follows exactly as in the general M-estimator case. If we assume dynamic completeness under the null hypothesis then the three possible versions of LM, are (4.46), (4.48) and (4.48) with 5; ’ in place of 2; ‘, where all quantities are evaluated at the restricted estimator QT. Under H, and dynamic completeness, LM, zxi. In the case outlined in Section 4.6, 0, may be on the boundary of 0 under H,. If MLE.7 is not maintained under H, then the robust LM statistic given in (4.44) should be computed. It is possible in this framework to cover a broader class of tests that includes tests against nonnested alternatives, encompassing tests, Hausman tests and various information matrix tests. White (1987, 1993) gives a very general treatment for possible misspecified dynamic models.
6.
Quasi-maximum
likelihood
estimation
(QMLE)
In this section we cover estimation of the first two conditional moments ofy, given x,. Section 6.1 covers the case where the conditional mean is of interest. We consider multivariate weighted nonlinear least squares estimation of the parameters of E(y,Ix,), which covers many of the models used in applied time series analysis (including vector ARMA models). These methods can be applied whenever E(y,lx,) can be obtained in parametric form, regardless of the underlying structure. Section 6.2 covers the case when the conditional mean and conditional variance are jointly estimated. The multivariate normal distribution is known to have robustness properties for estimating the parameters indexing the conditional mean and conditional variance. These results are intended primarily for the case when the mean cannot be separated out from the variance; if E(y,lx,) can be estimated without specifying Var(y,lx,) then, at least for robustness reasons, the methods of Section 6.1 are preferred for estimating the conditional mean parameters. [See Pagan and Sabau (1987) for an illustration of the inconsistency of the normal
MLE for the conditional mean parameters in a univariate linear model case when the variance is misspecified.] Of course if one is confident about the specification of the first two moments the methods of Section 6.2 might be preferred on efficiency grounds over those in Section 6.1, but this depends on auxiliary assumptions.
6.1.
Conditional
mean estimation
We first consider estimating a model of a correctly specified conditional mean using multivariate weighted nonlinear least squares (MWNLS). These results build on the work of Hannan (197 l), Robinson (1972), Klimko and Nelson (1978), White and Domowitz (1984) Gallant (1987) and others for multivariate nonlinear regression. We allow for a nonconstant weighting matrix in order to include procedures asymptotically equivalent to QMLE in the linear exponential family [see Gourieroux et al. (1984) and White (1993)]. As in Section 5 let yt be a G x 1 vector of variables to be explained by the conditioning variables x,, which, as always, may contain lagged dependent variables and other conditioning variables. Let (m,(x,, %):x,EX,, %e 0 c [w’} be a model for E(y, 1x,). We will assume that the model is correctly specified in the following sense:
Assumption
WNLS.1
For some %,E 0 c Iwp,
NV, I-5)= 4(x,, O,), The weighted
nonlinear
t= 1,2,....
least squares
(6.1) (WNLS)
estimator
solves
where W1(x,, y) is a G x G symmetric, positive definite (with probability one) matrix that can depend on the conditioning variables x, and an R x 1 vector of parameters y. When G = 1 and Wr(xf, y) = y = cr’ the problem reduces to univariate nonlinear least squares. We have put the + in the objective function to simplify the gradient formulas and to make the quasi-LR statistic derived in Section 4.6 directly applicable. The motivation for using MWNLS to estimate the parameters of E(y,lx,) is that W1(x,, 9r) is thought to be a consistent estimator of Var(y, Ix,). For example, if W(x,, fr) = fir (so that Var(y,Ix,) = R, is the nominal variance assumption), then we obtain the nonlinear seemingly unrelated regressions (SUR) estimator
J.M.
2682
[Gallant
(1987) Robinson
fi,
=
(1972)].
Most of the time fi,
T-l i (y, - m,(O,+))(y, - m,(e;))’
where 0: is the MNLS estimator In general,
EE
1=1
of0, and (u:}
yT can be any estimator
for some DIET c [WR,fib?,
T-1 i
r=1
Wooldridyr
would be obtained
as
u&+‘,
are the G x 1 MNLS residuals.
that is n-consistent
- y*) = O,,(l). (In most
for its plim; namely,
applications
$r is either
a
preliminary a-consistent estimator of O,, such as the NLS estimator, or it comes as the dependent variable and functions from a regression procedure using u:u:’ of x, as the independent variables, or some combination of these.) Although we will discuss the case where the variance function is correctly specified, we are also interested in performing inference under variance misspecification. Under WNLS.1. define the G x 1 errors as
u,= Yt - m,(x,,Q,),
t= 1,2,....
(6.4)
Keep in mind that, under WNLS. 1, we can only conclude that E(u,lx,) = 0 and Var(u, 1x,) = Var(y, 1x,); the II, are not necessarily serially uncorrelated or independent of x,. This leads to the tautological model Yt =
~t(-%,do)+ 4,
(6.5)
E(u, IXJ = 0. Define the MWNLS
(6.6) objective
function
for observation
t as
By replacing yt in (6.7) with the right hand side of (6.5) and using a little algebra along with (6.6), we can write
Nq,(~,~Q,;74d = trC~,(+~)}-‘~~(-5) + x~,(xt> 00)- ~,(~,>@I’(Wr(X,,Y))- ‘Cwb,, 0,) - Mx,, @I,
(6.8)
where Var(y,Ix,) = fip(x,) is the actual conditional variance of yt given x,. Because the first term in (6.8) does not depend on 8 it is clear that, for any r~r, E[q,(w,, 8; y)I x,] 3 E[q,(w,, 8,; y)lx,], all BE 0. By iterated expectations
E(qt(w,,0;7412 a%(% 0,; Y)l> In particular,
this
inequality
holds
BE0. for plimy,
(6.9) = y*, which
establishes
Fisher
consistency of the multivariate weighted NLS estimator under MWNLS.l. (This is one of those cases alluded to in Section 4 where Fisher consistency holds for any value of the nuisance parameter.) We do not write down formal consistency and asymptotic normality results, as these follow from the results on M-estimation. Rather, we focus on the key assumptions that influence how inference is done on 0,. Because of the Fisher consistency result we have essentially shown that, under WNLS.l and regularity conditions, MWNLS consistently estimates 0,. Next define the score for observation f as
de Y)= V,q,@;Y)’= - Vpr(-% Q,‘IW,(-%Y)}- lCYr- m,(x,,@I = - V&(@‘CW,(Y))‘u,(Q).
(6.10)
Since u, = u,(e,), it is easily seen under WNLS.l that E[s,(B,;y)lx,] = 0 for all YET. Also under WNLS. 1 ~ and smoothness conditions on W(x,, y) ~ E[V,.s,(B,,~~)Ix,] = 0, which implies the convenient variance
(6.11)
all YET, M-estimation
assumption
of 8, does not depend on that of *TTprovided
Under WNLS.l and the M-estimation asymptotically normally distributed with Avarfi(i?,-(I,)=
regularity
A;‘BoAoml,
M.7. Thus, the asymptotic
that fioT conditions,
- y*) = O,( 1). fi(e,
- Q,) is
(6.12)
where we can write A, = lim T-’
f
T+ic
t=1
E[V,m,(B,)‘{ W,~*)}-lV,q(O,)]
(6; 13)
and B, is given by (4.7). (This expression for A, can be derived by noting that the term in V&B; y) depending on Vim,(x,, 0) and u,(e) has a zero conditional mean at 0 = 0,.) Under WNLS.1 and regularity conditions, a consistent estimator of A, is the positive semi-definite matrix
AT= T- ’ i
V,m,(e,)‘{
I+‘,(;,)} -‘V,m,&).
(6.14)
1=1
As usual, estimation of B, must generally account for possible serial correlation in s,(8,;y*); from (6.10) we see that the serial correlation in the score depends on the serial correlation in {Us}. Straightforward application of the law of iterated expectations shows that Assumption M.8 is implied by the following assumption.
J. M, Wooldridqe
2684
Assumption For all t3
WNLS.2 1, E(u,~:+~jlx,.x,+~)=O,j3
1
Definition 6.1 The model specification
is said to be dynamically
complete
in mean if (6.15)
WY,l-%@t-1)= WI-d, where @,-r =(yt_r,xt-r
,...,
yr,r,).
We have the following WNLS.2.
simple
lemma
relating
Definition
6.1 and
Assumption
difference
sequence
with respect
Lemma 6.1 If (6.15) holds then (u,: t = 1,2,. . .} is a martingale to (at}, and so WNLS.2 holds.
In fact, we can say more. If {u,} IS a martingale difference sequence with respect to { @,} then so is {s,(tI,;y)} for any YET. Thus, for MWNLS only the conditional mean has to be dynamically complete for the score evaluated at 8, and any value of y to be a martingale difference sequence (MDS). For MLE we derived the MDS property of the score under the assumption that the model for the conditional density of yt given (x,, @,- r) was correct in its entirety. Theorem Under
6.1
WNLS.1
and WNLS.2. (6.16)
where A, is given by (6.13) and J, is given by (4.17). A consistent estimator of J, is the usual outer product estimator jr given in (4.23). Under WNLS.1 and WNLS.2 an appropriate estimator of Avar(&) is A~(8,)=~;‘.?,A^;‘/T. Further simplifications arise if we have properly modelled Var(y,Ix,), as given by the following assumption. Assumption
WNLS.3
For some Y,E~, Var(y, Ix,) = W,k,
Y,),
t= 1,2,....
2685
When WNLS.3 Theorem
Under
holds we also assume
for yO.
6.2
WNLS.IPWNLS.3,
Under
that fT is JT-consistent
WNLS.l-WNLS.3,
Avar,,@(8,
- 0,) = Ai ‘.
a simple estimator
of Avar(H^,) is (6.17)
When y, is a scalar and I+‘&, y) = y = 02, 0, is the nonlinear estimator and the usual estimator of its asymptotic variance
&l
_____= 8: T
(
i
V,m,(~,),V,m,(Q,)
I=1
-l,
least squares is
(NLS)
(6.18)
>
where &G2 IS the usual estimator of Var(y,Ix,) = Var(u,Ix,) = of based on the sum of squared residuals from NLS estimation. For emphasis, sufficient conditions for (6.18) to be valid are E(Y~IX~,Qt- I) = E(Y,M
= m,(x,,~,)
(6.19)
and Var(y,Ix,)
= f~,‘.
(6.20)
Interestingly, it is not required that Var(y,( x,, Q1- 1) = Var(y,(x,), so that the variance need not be dynamically complete. For example, if x, does not contain lagged dependent variables then Var(y,Ix,, at- 1) can follow an ARCH process provided (6.20) is true; (6.18) is still a valid estimator of Avar(8,) even though there are neglected dynamics in the second moment of u,. For the general case where only WNLS. 1 is assumed to hold, we need to estimate II, as in equa$on_(4._30) with the score given by (6.10). As always, Avar(&.) is estimated by A;’ B,A; l/T. Testing is easily carried out as in previous sections. The Wald statistic is obtained from the results in Section 4.6, with the estimated variance matrix used depending on whether WNLS.2, WNLS.2 and WNLS.3, or neither of these assumptions is maintained. The quasi-LR statistic, obtained as
(6.21)
J.M.
2686
Wooldridqr
where o”, is the restricted estimator and 4,. is the unrestricted estimator, has an asymptotic xk distribution under WNLS.l, WNLS.2 and WNLS.3. To ensure that M.9 holds (see Lemma 4.3), we must ensure that the objective function is properly computed. For example, when y, is a scalar and the variance is given by Var(y, 1x,) = crfu,(x,, 6,) for some function u&c,, a), the function q1 is given by
4t(w
Typically,
?)
@
f)
3
=
(Y,-
et,
2&I,(x,,
@J2
(6.22)
6)
once s^, and @, have been calculated,
S’, is computed
as (6.23)
where ti, = y, - m,(x,, 8,) and 0, = u,(x,, 8,). Once the restricted been obtained, the QLR statistic can be computed as
QLR
=
T
estimator
8, has
(SW - SWJ A2 CT
where SSR, is the sum of squares of the restricted weighted residuals {Ot1’2Ult} and SSR,, is the sum of squares of the unrestricted weighted residuals (0, “‘I&}. For NLS, (6.24) is QT/(T - P) times the F-statistic covered in Gallant (1987, p. 56) for nonlinear least squares with fixed regressors and i.i.d. errors. This analysis shows that this F-statistic is valid under the weaker assumptions WNLS.1, WNLS.2 and WNLS.3, which allow for models with lagged dependent variables and heteroskedastic martingale difference sequences (provided the heteroskedasticity is properly modelled). To compute the LM statistic for testing conditional mean hypotheses, only the first derivative of the conditional mean function evaluated at the restricted estimates is needed. The setup is the same as in Section 4.6; there are Q possibly implicit restrictions on the P x 1 vector BO. Let yT be the nuisance parameter estimator used in computing the restricted estimator e”,. In using the LM statistics from Section 4.6 it is best to use as the estimate of A, the P x P matrix
(6.25)
where V,r71, = VBm,(x,, 8,) and k’, K WJx,, FT). Under WNLS.1 only, d, should be a serial-correlation-robust estimator applied to {St} and the robust LM statistic
(4.44) should be used. This would be the case, for example, in testing a static or distributed lag relationship without also assuming that it is the dynamically complete conditional mean. When WNLS.1 and WNLS.2 hold under H,, which is often the case ~ tests of dynamic misspecification take the null to be complete dynamic specification - & can be replaced in (4.44) by the outer product estimator jT in (4.45), where S, = - V&z: I?‘- ‘it. The resulting statistic can be given a regression interpretation as in Wooldridge (199 1b). If we impose WNLSI, WNLS.2 and WNLS.3 under the null then things are even easier. The outer product LM statistic in (4.46) is asymptotically valid, but this is not the best choice; the Hessian form (4.48) with JT in (6.25) is no more difficult to compute and has better finite sample properties. A statistic that is proportional to (4.48) (where the constant of proportionality tends in probability to unity under H,) can be computed from the r-squared of a regression. Run the (stacked) OLS regression @-“2~
t
t
on
pwv t
fi
I3
f’
t=1,2
T.
)...)
(6.26)
Under WNLS.l, WNLS.2 and WNLS.3, TGR,’ “1: (G is the dimension of y,). When y, is a scalar, Var(y,Ix,) is typically specified as Var(y,Ix,) = ~fu~(x,, 6,). Then the LM statistic can be computed as TR,’ from c-l’2jj f
I
on
17~“‘V,fiit,
c=1,2
,...,
T,
(6.27)
where Et = u,(x,, g,). For NLS, fir = 1. As an example of how to set up a test in the current framework, consider testing for AR(l) serial correlation after NLS estimation. The null mean function is f,(g,, /?) for g, some subset of (zr,yr-l,z,-l ,..., y,, z,) (so that g, can contain lags of y,) and the unrestricted mean function is m,(x,, 0) = f,(g,, B) + p(y,_ 1 - f,_ 1(g, _ 1, /Q), where x, = (g,, y, 1,gt 1). Under H,: ,p, = 0, we-obtain the NLS estimator o,f PO, /I, and the NLS residuals ii, = y, - f,(g,, /IT). Now 0, = (&O)’ and V,fi, = (VJ,, ii,_ l). Because this is always a test of dynamic completeness, WNLS.2 holds under H,. If we also impose (6.20) (WNLS.3 in this context), then an asymptotically xf statistic is (T - l)Ri from the regression
fit
on
VJt7fit-
l7
t = 2,3,. . . , T.
(6.28)
Note that because X, contains at least y,_ 1, (6.20) rules out ARCH or other forms of dynamic heteroskedasticity under H,. For applications of LM statistics and more on their robust forms, see Engle (1984), Godfrey (1988), Wooldridge (1991a) and MacKinnon (1992).
268X
4.2.
QMLE for the mean and variance
We now consider joint estimation of E(Y,lx,) and Var(y,Jx,). As in the previous section, y, is G x 1 and x,EX( is the set of conditioning variables that may contain lags of yr. The conditional mean and variance functions are jointly parameterized by a P x 1 vector 8: {m,(x,, %):rr&“r, %EO C P}, {C&Y,, %):x,&,,
%E0 C l@}.
The subsequent analysis is carried out under conditional moments are correctly specified. Assumption
the hypothesis
that
the first two
QMLE.1
For some %,E 0, E(Y, IT) = 4(x,9 eO)> Var(Y,lx,)
= Rr(q, eO),
t= 1,2,....
We estimate 8, using quasi-MLE under the nominal assumption is normally distributed. For observation t, the quasi-conditional apart from a constant is
that y, given x, log-likelihood
where 10,(x,, %)I denotes the determinant of J~,(x,, %). Letting u,(B) -y, - m,(x,, 8) denote the G x 1 residual function and suppressing the dependence of 0,(x,, 6) on X, yields the more concise expression I,(%)= - $loglf2,(%)I - $4,(e)‘{n,(e)}-1u,(%).
(6.29)
The QMLE %r is obtained by maximizing the normal quasi-log-likelihood function _Yr(%) = z,‘= 1I,(@. Under QMLE.l it can be shown that E[1,(8,)(x,] 3 E[l,(d)Ix,] for all 0~0 and for all x,EX,. This result has been established in special cases by Weiss (1986) and was shown to hold generally by Bollerslev and Wooldridge (1992). Its important consequence is that 8, solves
maxE M%)l eee
under
QMLE.1,
so that QMLE
based on the normal
conditional
log-likelihood
is Fisher consistent. The weak consistency of the QMLE GT for 0, under QMLE.l and regularity conditions follows from the usual analogy principle argument. As usual, the uniform weak law of large numbers applied to fl,(y,, x,, 0): t = 1,2,. . .} underlies such a result, which requires moment and continuity assumptions on m(x,, .) and 0(x,, .) and the assumption that I,(O) is not too dependent on the past. on 0 for all relevant x,, and if 0,(x,, (3) If m,(x,, .) and 0,(x,, .) are differentiable is nonsingular with probability one for all 0~ 0, then differentiation of (6.29) yields the P x 1 score function s,(e): s,(U) s - V&,(U) = - {V,m,(H)‘R;
‘(R)u,(U)
+~V,n,(e)‘[~,‘(U)on,‘(H)l
vecCu,(@,(@‘- Q(Q)]},
(6.30)
where V,m,(H) is the G x P derivative of m, and V&!,(Q)is the G2 x P derivative of n,(0). By definition, under QMLE.l E(u,)x,) = 0 and E(u,u:lx,) = O(xt, 0,). It follows that, under correct specification of the first two conditional moments of y, given x,, E[s,(0Jx,] = 0. This is an alternative statement of Fisher consistency of QMLE. To estimate Avar where the simplest vatives, in this case A straightforward
fi(e^, - 0,), we need to estimate A, and best behaved estimator of A, of the mean and variance functions. but tedious calculation shows that,
and B,. Here is another case depends only on first deriLet a,(~,, tl,) = E[h,(tI,)lx,]. under QMLE.l,
q(x,, do)= v,m,(OJ’Q; ‘@,)V,m,V,) +~v,~,(~,)‘C~,-‘(~,)o~,l(~,)lV,~,(~,).
(6.31)
(As expected, this matrix is positive semi-definite, something that is useful for programming Gauss-Newton iterations for obtaining the estimates.) A consistent estimator of A, is A^r = T- lx,‘= 1 a,(8,.). Under QMLE.1 only, we need a serialcorrelation-robust estimator for-B,, and this is obtained by applying one of the Section 4.5 estimators to $ = ~~(0~). To state a condition under which the score is serially uncorrelated, define the (G + G2) x 1 vector Yec [ul, {vec[u,uj - f2r(0,)] }‘I’, where u, = ~~(0,). Under QMLE.1, E(r,(x,) = 0. We now add the assumption that {Ye} is appropriately serially uncorrelated. Assumption
QMLE.2
For all t 3 1, E(r,r;+jJx,,x,+j)
= 0,j 3 1.
It is easily seen that Assumption
QMLE.2
implies that the score is (conditionally)
serially uncorrelated. Therefore, under QMLE.l and QMLE.2, Avar fi(gT - 0,) = A;‘J,A;‘, where J, is as in (5.23), and a consistent estimator of J, is given in
J.,2/1. Wooldridge
2690
(5.28) with the score as in (6.30). Usually, a stronger assumption. Dejnition
if QMLE.2
is to hold, one has in mind
6.2
The model is dynamically
WI-5
complete
in mean and variance if
@r-1)= EblxtL
Varbl-q Dt- 1)= Var(y,lx,),
t= 1,2,...,
Lemma 6.2 If the model is dynamically complete in mean and variance then {Y,:t = 1,2,. . .} is a martingale difference with respect to the information sets {a,}, and therefore QMLE.2 holds. From (6.30), {s,(f?,)} IS a martingale difference sequence (MDS) with respect to Dt = (yt,xt,. , y,,x,) if {r,} is. Thus, the score of the normal quasi-likelihood is an MDS if the first two conditional moments are dynamically complete; nothing else about the normal distribution need apply to the distribution of yt given x,. If yt given X, is normally distributed then the conditional information matrix equality
Var Cs,(~,)IXrl= a,(x,,0,) holds. When combined with QMLE.2 Avar(@,). While normality is sufficient assumption.
(6.32) this further simplifies the estimation of for (6.32), it also holds under a weaker
Assumption QMLE.3 (i) E [vec(u,u:)ui Ixt]=
0; (ii) EC(vec(&)- Q,(@,)} CWu,ul)- ~,(Q,))'lx,l = 2~J~,(~,)@f4(Q,,)l, where NG = DG(D)CDGJID~, and D, is the G2 x G(G + 1)/2 duplication matrix; see Magnus
and Neudecker
(1986).
In the scalar case, QMLE.3(i) is the symmetry condition E(u:Ix,) = 0 and QMLE.3(ii) is the familiar fourth moment condition E[ {uf - CJ:(~,))~~X,] = 2a,4(0,). Assumption QMLE.3 is the multivariate version of these assumptions, and it could hold for distributions other than multivariate normal. For more on the matrices D, and NG, and their relevance for the multivariate normal distribution, see Magnus and Neudecker (1986).
Under QMLE.1
and QMLE.3,
(6.32) holds. Therefore,
under
QMLE.1,QMLE.2
andQMLE.3,
Avar fi(&, - e,) =_AO, and Avar(@,) can be estimated as A^; ‘/T or 3; l/T. The estimator based on A, tends to have better finite sample properties than that based on the outer product of the score estimator jr. Testing hypotheses about BOposes no new problems. The Wald statistic is as before for M-estimation, with an appropriate estimator for B,. Recall that under QMLE.l-QMLE.3, B, = A,, but not otherwise. The quasi-LR statistic is valid under assumptions QMLE.l, QMLE.2 and QMLE.3; this follows directly from Lemma 4.3. The general formula for the LM statistic is again given by Equation (4.44). If, in addition to QMLE.l, we impose QMLE.2 and QMLE.3 under the null, the statistic simplifies to (4.46) or, preferably, (4.48). Under QMLE.l and QMLE.2 the statistic is (4.44) with Jr in place of I,. A natural application of these results is to univariate and multivariate ARCH, GARCH, and ARCH-in-mean models; see Bollerslev, Engle, and Nelson (this Handbook). But the normal QMLE can be applied in many other contexts. Whenever the mean and variance are twice continuously differentiable in 8 then, subject to the modest regularity conditions in Section 4, the QMLE will produce fi-asymptotically normal estimates of the conditional mean and conditional variance parameters. The actual distribution of yt given x, can be very different from normal; for example, when y, is a scalar, it could have a discrete or truncated distribution. This is useful for problems where maximum likelihood is computationally difficult or violates standard regularity conditions (such as (5.16)). For example, QMLE can be applied to certain switching regression models with unknown regime by obtaining E(y,lx,) and Var(y,Ix,) (where X, may or may not contain lagged dependent variables) in terms of the structural parameters and using these in the normal quasi-log-likelihood; the MLE for these models based on a mixture of normals is known to be computationally difficult [Quandt and Ramsey (1978)]. The QMLE can also be applied to certain frontier production function models [Schmidt (1976)], where the log-likelihood is discontinuous in some of the parameters and so standard inference procedures do not necessarily apply. The discontinuity in the true log-likelihood comes about because the support of the conditional distribution of y, depends on unknown parameters. The QMLE might be useful because the moments E(y,lx,) and Var(y,Ix,) depend on the parameters in a very smooth fashion.
7. 7.1.
Generalized
method of moments estimation
Introduction
This section studies dependent processes.
generalized method of moments (GMM) estimation for We rely heavily on the work of Hansen (1982), Bates and
White (1985, 1993), and Newey and McFadden (this Handbook). As with the rest of the chapter, we focus on weak consistency, and we do not attempt to find the weakest set of regularity conditions. The limiting distribution results in Section 7.3 apply to the essentially stationary, weakly dependent case. Many applications of GMM fit into the category of estimation under conditional moment restrictions. We will approach method of moments estimation from this standpoint. By specifying the conditioning set to be empty the unconditional moments framework is a special case. The related classical minimum distance estimators are not covered here; see Newey and McFadden for a treatment that can be applied to time series contexts by using the results of this chapter. Let {(w~,x,): t = 1,2,. . .$ be a stochastic process, where wtc%‘; and x,EX,, and the dimensions of both w, and x, may grow with t. Assume that there is an N x 1 vector function Y,:,YYW; x 0 --f [WNdefined on YJY~ and the parameter space 0 c [w’. Interest lies in estimating the P x 1 vector 8,~0 such that
Nr,(w,, Q,)lx,l= 0,
t=
1,2,....
(7.1)
Equation (7.1) constitutes a set of conditional moment restrictions, studied in cross section settings by Chamberlain (1987) and by Hansen (1982), Hansen and Singleton (1982), Robinson (1991 a), Newey (1990) and Rilstone (1991) in dependent settings. Conditional moment restrictions are straightforward to generate. First consider the setup of Section 6.1, where yt is a G x 1 vector and the model {m,(x, 0): x,E%“~, 0~0) is correctly specified for E(y,lx,). Then (7.1) holds by setting u,(w,,@ = Y, - m,(x,, Q). Next suppose m,(x,,H) and Q,(x,,O) are correctly specified models for E(y,Jx,) and Var(y, 1xr), as in Section 6.2. Defining u,(0) = yr - m,(x,, H), a set of conditional moment restrictions is obtained by setting
a {G+[G(G+ 1)/2]} x 1 vector, where vech[.] denotes the vectorization of the lower triangle of a matrix. Under correct specification of the conditional mean and conditional variance (7.1) holds. In many situations, including a variety of rational expectations models, economic theory implies an Euler equation of the form
~CY,(Wf,~o)l~,_1,~,_2,...‘~11=0’
(7.2)
in which case (7.1) follows by the law of iterated expectations whenever x, is a subset of (wI_l,wr_l ,..., WI). Conditional moment restrictions are also natural for analyzing nonlinear simultaneous equations models. A general dynamic structural model can be
Ch. 45: Estimution and l@renw
expressed
0,
2693
,fiw Drprndrrlt Prcrcessrs
as
IXJ = 0,
(7.4)
where qt(.) is N x 1 and x, contains predetermined variables. Thus, in a GMM setting we identify YJw,, 0,) with the structural errors u,. As we see below, a whole class of GMM estimators are consistent and asymptotically normal under the Assumption (7.4) that the structural errors have a conditional mean of zero given the predetermined variables; the errors need not be independent of X, or even conditionally homoskedastic. They can also be serially correlated provided (7.4) holds. Condition (7.1) implies that the unconditional expectation of Y~(w~,~?,)is zero, but it implies much more: any matrix function of x,, say the N x M matrix G&X,), is uncorrelated with r,(t),,)= r,(w,,0,). More precisely, if E[~T,,,(w,, 8,)1] < co, h = 1,. . . , N and E[I Grhj(xt)rth(w,, d,)(] < co, h = 1,. . . , N, j = 1,. . . , M, then EIGl(xt)‘r,(~,,
e,)l =
0.
(7.5)
Equation (7.5) is the basis for estimating 0,. Actually, to handle cases that arise in practice, we will consider a more general framework. Let the instrumental variable function depend on some nuisance parameters (which might include 19) as well as x,: G,(x,, y), where YET c RH. Under (7.1) and finite moment assumptions, E[G&,
y)‘rt(wt, e,)] = 0
As before, we assume
fi(y*T-
for all YET.
that we have an estimator
(7.6) fT that satisfies
Y*)= O,(l)>
(7.7)
where y* could contain 8, but need not have an interpretation otherwise. because (7.6) holds with y = y*, Lemma A.1 generally implies that
T- ’ f G,(W’r,(~J JL0,
Then,
(7.8)
t=1
where G,(y) = G,(x,, y) and r,(O) = rJw,, 0). The analogy BOby solving the M nonlinear equations T - l i G,(g,)'r,(e) = 0, 1=1
principle
suggests estimating
(7.9)
where, to especially hand side quadratic
identify O,, we need M > P. But (7.9) does not always have a solution, when M > P. Instead, we choose an estimator of 8, that makes the left of (7.9) as close as possible to zero, where closeness is measured as a form in a positive definite matrix. The weighting matrix A, is an M x M
positive semi-definite random matrix, such that A, 11, A*, where A* is an M x M nonrandom positive definite matrix. (This implies that A, is positive definite with probability approaching one (w.p.a.1)) As a shorthand, denote the instruments by 6, z G,(x,, TT). A generalized method o/‘ moments (GMM) estimator of 6’,, is the solution 8, to
(7.10)
7.1.
Consistency
Consistency of the GMM estimator does not follow from Theorem 4.3 because the GMM estimator is not an M-estimator as we have defined it here. Nevertheless, consistency follows under the uniform weak law of large numbers and the identification assumption that 8, is the unique solution to lim T- ’ f: E[G,(x,, y*)‘r,(w,,
Usually Theorem
Assume GMM.1:
@I E 0.
(7.11)
t=1
T-X
in practice
one argues that E[G,(x,, y*)‘r,(w,, 0)] # 0 for 6’# N,.
7.1 (Consistency
qf GMM)
that (i) (ii) (iii) (i)
0 and r are compact sets; $. J+ y*; Ar % A*, a positive definite M x M matrix. GMM.2: G, satisfies the standard measurability and continuity conditions on Xr x r; (ii) Y, satisfies the standard measurability and continuity conditions on wrx 0. GMM.3: {G,(y)‘r,(U)} satisfies the UWLLN on 0 x I-. GMM.4: (i) For some O,E@, E[r,(w,, Q,)jx,] = 0, t = 1,2,. .; (ii) 8, is the unique solution to (7.11). Then the GMM estimator gT exists and is weakly consistent for 8,. This result follows by verifying the conditions of Newey and McFadden (Theorem 2.1). It is similar in spirit to the strong consistency results of Hansen
(1982), Bates and White (1985), Gallant Potscher and Prucha (1991a).
7.3.
Asymptotic
(1987), Gallant
and
White
(1988) and
normality
To establish the asymptotic normality of the GMM estimator we apply Theorem 3.2 in Newey and McFadden; here, only the main ingredients are sketched. In defining the score of the objective function, it is notationally convenient to divide the gradient of (7.10) by 2 and we do so in what follows. Straightforward differentiation with respect to Q gives the score as the P x 1 vector
(7.12)
9
where V,r,(@) is the N x P gradient of_r,(@. Assuming that fI,Eint(O), consistency of 8, implies that S,(t?,; $r, A,) = 0 w.p.a.1. Further,
,,/fs,(e,;
&,
(i,) = Tml
5 6:V,r,(e,)1 ‘&[ T-l/’
1=1
Using a standard mean value expansion, Lemma A.l, it is easily shown that
Tp “’ i
1=1
@,(e,)= Tm‘I2 i
t=1
i t=1
the assumption
G,(y*)'r,(~,)
+ op( 1).
tf$,(d,)
.
weak
(7.13)
1 E[r,(OJlx,]
= 0, and
(7.14)
Define the M x 1 vector e,(@ y) E G,(y)‘r,(B), so that E[e,(8,; Y*)IxJ = 0. Under moment and weak dependence assumptions, T- ‘I2 C,‘= 1e,(flo; y*) satisfies the central limit theorem, and so it is O,(l). Next define the M x P matrix
R, = rlim T-l
i
E[G,(y*)‘V,r,(BJ].
(7.15)
1=1
Then by the UWLLN
and Lemma
A.l,
(7.16) Because AT-A* fiS,(fI,;&/ir)=
= o,(l)
it follows from (7.14) and (7.16) that R:JI*T-“~
,i* e,(Q$Y*) + opU).
(7.17)
J M. ~(~(~l~~i~~~~
2696
Equation (7.17) has the important impIication that the limiting distribution ,J’?($, - y*) does not affect that of ,,fi(& - 8,); only (7.7) is needed. From (7.17) it is natural to define the score for observation t as s,@,) = R;A*e,@?,; r*) = ~~A*G~~~*~~~(~~).
of
(7.18)
(For simplicity we suppress ah parameters in the score except for S,.) It now follows from Newey and McFadden (Theorem 34, that Jf($,.
- 0,) -%Normal(O, A;‘Z?,A;“),
(7.19)
where (7.20)
A, SER;/l*R,. The matrix B, can be expressed as
D, z5 lim Var T-+SZ
T-Ii2 $j e,(B,; Y*f [
.
1
t=1
We summarize
this argument with a theorem.
Theorem 7.2
(As~~~t~t~c n~~~~~~t~of GMMJ
(7.22)
Assume that the conditions of Theorem 7.1 hold. In addition, assume that GMM.5: (i) fi(Y*T- ?+I= (ii) B,Eint(O), y*&nt(Z). GMM.6: (i) G, satisfies the standard measurability and first order differentiability conditions on S, x C (ii) r, satisfies the standard measurability and first order di~erentiabi~ity conditions on 7ly; x 0. GMM.7: (i) Ic;,(x,, Y)fVOrt(wt> 0)) and ([r,(w,, @)OZH]VyG,(xl, r,‘> satisfy the UWLLN on 0 x Z; (ii) rank (Z?,) = P. GMM.8: (~~(8,):t = 1,2,...) satisfies the eentral limit theorem with positive definite asymptotic variance B, given by (7.21). Then (7.19) holds.
q?m
This result is similar to the asymptotic normality results in Hansen (1982) and Gallant (1987), except that we leave the UWLLN and CLT as high level assumptions. If necessary, the assumptions on the ranks of the matrices A*, R,, and B, can be somewhat relaxed; see Newey and McFadden (Theorem 3.2).
7.4.
Estimating
A consistent
the asymptotic
estimator
variance
of A0 under
the conditions
of Theorem
7.2 is
AT= &i,kT
(7.23)
where
&. = T’
i: G,~j,.)'V,r,(Q
(7.24)
r=1 Given
a consistent
estimator
BT of D,, a consistent
estimator
of B, is (7.25)
From (7.22) it is seen that estimation of D, generally entails a serial-correlationrobust estimator unless e; = e,(e,; y*) is serially uncorrelated. As usual, one simply applies one of the serial-correlation-robust covariance matrix estimators from Section 4.5 to (6, E e,(&r; y*,)}. See also Hansen (1982), Gallant (1987), Newey and West (1987), Gallant and White (1988), Andrews (1991) and Piitscher and Prucha (1991 b). As a computational matter, if M is much larger than P it is easier to estimate B, directly by applying a serial-correlation estimator to the P x 1 vector process $ = I&&-d,. The form of the asymptotic variance simplifies, and we obtain the efficient estimate: given the choice of instruments G&y*), by choosing the weighting matrix AT appropriately. In particular, let AT be a consistent estimator of 0; ‘, so that A* = 0; ‘. Then from (7.21) and (7.22), A, = B, = RbD;‘R,, and so Avar(t?=) = (Rb 0; ’ R,) ‘IT. Lemma 7.1 Assume that GMM.l-GMM.8 GMM.9: A* = 0,‘. Then Avar Jf(&r Choosing
hold, and in addition
- 0,) = (RiDo-’ RJ
/1, to be consistent
for 0;’
l.
(7.26) usually
requires
an initial
,,@-consistent
estimator of 8,. The resulting estimator is often called the minimum &-square estimator (Newey and McFadden call it the optimal minimum distance estimator). See Hansen (1982), Bates and White (1993) and Newey and McFadden (Theorem 5.2) for proofs that it is the efficient estimator given the choice of instruments under conditions analogous to GMM.l-GMM.9. If (uJ0,)) is serially uncorrelated in an appropriate sense, then estimation of D,, and therefore computation of the minimum chi-square estimator, is simplified. Lemma 7.2 Assume that GMM.l-GMM.9 hold, and in addition GMM.10: For t 3 1, E[rt(Uo)Y,+j(e,)IX,,X,+j] = O,j 3 1. Then D0 = lim T-’
i
T-r,
1=1
E[e,(Bo;y*)e,(80;y*)‘]
(7.27)
and (7.26) holds. Hansen (1982) showed Under GMM.l-GMM.lO,
earlier how GMM.10 leads to the form (7.27) for Do. a simple consistent estimator of Do is (7.28)
1=1
which makes To obtain the obtain /1, as estimator. A sufficient
A%(H^,) = (&.b;’ RT)-l/T especially straightforward to obtain. minimum chi-square estimator under GMM.l-GMM.10, one would the inverse of (7.28), but C, would be computed using the initial condition
for GMM.10
is
~Crf(~o)I~,,~,-l,~,_l,...,~l,~ll=O,
(7.29)
as would be the case under (7.2) when x, is a subset of (wt _ l,. . . , wl). Thus, GMM. 10 is satisfied for estimating the parameters of a dynamically complete mean, for joint estimation of dynamically complete means and variances, for many rational expectations applications, and for dynamic simultaneous equations models (SEM) when the errors uI are martingale difference sequences with respect to ot. Using the usual iterated expectations arguments, {sl(fIO;y*)) is seen to be a martingale difference sequence under (7.29) and so a martingale CLT can be applied in verifying GMM.8. Often the model (7.3))(7.4) is estimated under the assumptions GMM.l-GMM.10 and the conditional homoskedasticity assumption
Jwp, Iq) = a,,
(7.30)
where 0, is an N x N positive the general case, new estimators
definite matrix. If we extend this assumption for the asymptotic variance are available.
to
Lemma 7.3 Assume that GMM.l-GMM.10 hold, and in addition GMM.11: For a positive definite N x N matrix a,, E[r,(OO)v,(H,)‘jx,] = 0,. Then
D,,= lim T- ’
5 E[G,(x,,
y*)'i2,G,(~,, ?*)I.
Lemma 7.3 follows because, under GMM.ll, Now D, can be consistently estimated by
fiT=
T-’
f: &$!,ti,,,
E(epep’lx,) = G,(x,, y*))flOG,(x,, y*).
(7.31)
1=1
where fi, is a consistent estimator of 0,. In virtually all applications, obtaining R, requires an initial estimation stage. If 0,t is an initial consistent estimator of is g enerally consistent for .RO. In the simul6, then d, = T-‘C~Clr,(Q,‘)r,(e,i)’ taneous equations setting, 8, can first be estimated by nonlinear two-stage least squares (N2SLS). Let Z, = Z,(x,) be an N x M set of instruments. Then N2SLS estimation is obtained by setting r,(e) = u,(O), 6, = Z,, and d, = [T- ‘CT= 1Z:Z,] _ ’ in (7.10). Letting ,u(+ = ql(yt,xt, 0;) denote the N2SLS residuals, a consistent estimator of 0, is R, = T-‘CTz 1u~+u~“. This can then be used to obtain the nonlinear three-stage least squares (N3SLS) estimator with 6t E Z, and AT = d,‘, where bT is given by (7.31).
7.5.
Asymptotic
efjiciency
We have already seen that, given the choice of instruments G,(x,, y*,), the optimal GMM estimator is the minimum chi-square estimator. There is little reason not to compute the minimum chi-square estimator in applications since the weighting matrix needed in its computation is needed to estimate the asymptotic variance of any GMM estimator. A more difficult issue is the optimal choice of instruments. Under (7.1), any function of X, that satisfies regularity conditions can be used as instruments, which leaves a lot of possibilities. Fortunately, under assumptions GMM.l-GMM.lO, the optimal instruments can be easily characterized. For given “residual” function r,(w,, Q), the optimal choice of instruments under GMM.llGMM.10 is
G;(x,) = a;(~,)-
‘Q;(q)>
(7.32)
.I. M.
2700
where Hansen is easily GMM
WooldtYdge
C$‘(x,) = Var[u,(B,)lx,] and QP(x,) = E[VH~,(~,, U,,)lx,l. See, for example, (1982) and Bates and White (1993). The proof in Bates and White (1993) modified to cover the current setup. The asymptotic variance of the optimal estimator under GMM.l-GMM.10 is
A; l = R;’
= 0;’
=
lim TP1 i T-X
E[Qy(x,)‘{fi:(x,)}
‘Qp(x,)]
-l.
(7.33)
t=1
As an application, consider the class of M WNLS estimators studied in Section 6.1. Under WNLS.l, WNLS.2 and WNLS.3, (7.33) reduces to the variance of the WNLS estimator because Q;(x,) = - Veml(xI,O,) (see Theorem 6.2). Thus, if we start by assuming only WNLS.l and WNLS.2 then the optimal WNLS estimator is obtained by choosing W,(x,, y,) = Var(y,Ix,). The optimal choice (7.32) also has implications for estimating simultaneous equations models. Under GMM. 1-GMM 11, the efficient estimator uses instruments 0; ’ E[V,q,(B,) 1xr]. In models linear in the endogenous variables, E[V,q,(UJ Ix,] is in closed form and can be estimated; this leads to Sargan’s (1958) generalized instrumental variables estimator. For models linear in all variables this leads to the 3SLS estimator. In models nonlinear in yt the conditional expectation E[V,q,(B,)Ix,] is rarely of closed form and so an efficient GMM estimator cannot be obtained using standard parametric analysis. Nevertheless, recent work has shown that, quite generally the conditional expectations Qy(x,) and 0:(x,) need not be of a known parametric form in order to achieve the asymptotic variance of the efficient GMM estimator under GMM.lGMM.10. Under strict stationarity (in particular, when x, has fixed dimension) and weak dependence, both of these quantities can typically be nonparametrically estimated at fast enough rates to obtain the asymptotically efficient GMM estimator; see Robinson (1991a), Newey (1990) and Rilstone (1991). Newey (1990) suggests a series estimation approach. For example, in the simultaneous equations model under GMM. l-GMM.ll, Qp(x,) can be estimated by regressing the elements of V,q,(fI,f) on functions of x, that have good approximation properties (here 0; is a preliminary estimator). As a practical matter the nonparametric approach to estimating QP(x,) and 0:(x,) has limitations, especially when the dimension of x, is large. When the dimensions of w, and x, are growing, little is known about the efficiency bounds for the GMM estimator. Some work is available in the linear case; see Hansen (1985) and Hansen et al. (1988).
7.6.
Testing
Given a consistent estimator is just as in (4.37).
P, of Avar J?(H^,
- do), the Wald test of H,: ~(0,) = 0
A quasi-likelihood ratio statistic can be computed for testing hypotheses of the form covered in Section 4.6 when minimum ch_i-square estimation is used. Suppose chi-square there are Q restrictions to be tested. Let eT denote the minimum estimator with the restrictions imposed and let 4, be the unrestricted estimator. (Typically, the initial estimator of A * = 0; ’ would come from an unconstrained estimation procedure with a simple weighting matrix.) Let Qr(0) denote the objective function in (7.10). Then, under H,,
This limiting result holds under the conditions of Lemma 7.1, with the constraints satisfying the conditions of Lemma 4.3. It is established using a second order Taylor’s expansion and the fact that A, = B,. See Gallant (1987, Chapter 7, Theorem 15) for a careful proof. There is also a score test that can be applied to GMM estimation; see Gallant (1987, Chapter 7, Theorem 16).
Part III: 8. 8.1.
The globally nonstationary,
weakly dependent case
General results Introduction
We now consider the asymptotic properties of estimators when the data are not essentially stationary but are still weakly dependent. This covers the well-known case of series with deterministic trends that exhibit essentially stationary behavior around their means. Because such series generally satisfy the CLT when properly standardized, we expect estimators obtained from problems using globally nonstationary, weakly dependent data to be asymptotically normally distributed. The analysis in this part verifies this expectation. Some comments on the limitations of the following results. First, the CLTs underlying the analysis do not allow for series with exponential trends (polynomial trends of any finite order are allowed). Thus, often a transformation (such as taking the natural log) is needed to put one’s problem into this framework. Second, because the UWLLN approach cannot be used for general trending processes, we prove consistency and asymptotic normality at the same time. This means that, for consistency results, we assume twice-continuous differentiability of the objective function. Because general results for the globally nonstationary case are not readily available in the literature, in Section 8.2 we give a general result that can be used in a variety of contexts, including M-estimation and GMM estimation. This result is a simplification of a theorem in Wooldridge (1986), which builds on the work of
Weiss (197 1, 1973), Crowder (1976) and Heijmans and Magnus (1986). In Section 9 we show how the general result can be applied to M-estimation. The bottom line of the subsequent analysis is that, practically speaking, there is no difference, in performing inference, between the globally nonstationary and the essentially stationary cases, provided the stochastic processes are weakly dependent and reasonable regularity conditions hold. Recently, Andrews and McDermott (1993) reached the same conclusion using a triangular array approach to asymptotic analysis with deterministic trends.
8.2.
Asymptotic
normality
of an abstract estimator
We begin with an objective function Q=(w,~) and assume that QT(w;) is twice continuously differentiable on int(O) (in this section, we do not require 0 to be compact). Rather than assume that a minimizer of Q=(w, 0) on 0 exists, we work from the first order condition. Define the P x 1 score, S,(e), as S,(8) = V,Q,(@ and the P x P Hessian HT(0) z ViQ,(@ _= V&-(B). We assume that the P x 1 parameter vector 8,, which we are trying to estimate, is in the interior of 0. (Incidentally, in contrast to the essentially stationary case, the score and Hessian do not incorporate a scaling based on the sample size. Below we explicitly introduce a matrix scaling.) To sketch the issues that arise in the globally nonstationary case, first assume that we have an estimator e, such that S,(e,)
= 0
w.p.a.1
(8.1)
(below we prove that such an estimator exists). A mean value expansion yields, with probability approaching one, 0 = S,(OJ + ii,@,
- do),
about
0,
(8.2)
where /i, is the Hessian with rows evaluated at mean values between t?, and.0,. Assume for the moment that, for some nonstochastic, P x P positive definite diagonal matrix D,,
D;1'2ST(Oo)% Normal(O,BJ, where B, is a P x P positive (w.p.a.1) 0=
definite
(8.3) nonstochastic
matrix.
Next, use (8.2) to write
D;1'2ST(@_,)+(D;"2iiTD;1'2)[D;!2(i& - Q,)].
(8.4)
Now. if
D,;1'2[tiT - HT(Oo)]D,1'2 = o,(l),
(8.5)
which is reasonable
o;“& Under
since the mean values should
e,) = - [D,
weak dependence
“*
0; “*H&,)D,
1’2H,(e,)D, 1’21~‘[D, “2s,(e,)]
we can generally
3
be converging
to 0,, then
+ o,(l).
(8.6)
apply a law of large numbers
so that
AO,
where A0 is a P x P nonrandom,
(8.7) positive
Dy”(e^, - 0,) = - A; ’ [D, ““S,(d,)]
definite
matrix.
If so,
+ op( l),
(8.8)
and then asymptotic normality of Dy’(&, - 0,) follows in the usual way from the asymptotic normality of the score. For nonlinear estimation with trending data, Condition (8.5) is typically the most difficult to verify. One approach is to take (8.5) as a reasonable regularity condition, as is essentially done in Crowder (1976) for MLE with dependent observations. Still, it is often useful to have more basic conditions that imply this convergence of the Hessian, particularly since we have not yet shown that 8, is consistent for 0,. We use an extension of Weiss (197 1) suggested by Wooldridge (1986, Chapter 3, Proposition 4.3), which implies (8.5) and at the same time guarantezs that Di”(@, - 0,) = O,( 1). We can then derive the asymptotic normality of Dy*(0, - 19,) from (8.3) and (8.8). The idea is to impose a type of uniform convergence of the Hessian normalized by something tending to infinity at a rate slower than D,. The key differences between this approach and the type of uniform convergence used in the essentially stationary case are that now(i) we allow each element of the Hessian to be standardized by a different function of the sample size and (ii) the subset of the parameter space over which the Hessian must converge uniformly shrinks to 0, as the sample size tends to infinity. Formally, the condition we impose on the Hessian is max /IC; “‘[Zf,(d,) .‘“T where (C,} is a sequence that C,D; ’ = o(l), and Jv; = {tk@:
- H,(B)]C;
of nonstochastic
IIq2(e-e,)II
We have the following
l/* /I = o,(l), positive
definite diagonal
matrices
< l}.
such
(8.10)
theorem.
Theorem 8.1 Let {Qr: YV x 0 + R, T = 1,2,. .} be a sequence of objective functions the data space YV and the parameter space 0 c Rp. Assume that
defined on
J.,M. Wooldridye
2704
(i) H,Eint(O); (ii) Q, satisfies the standard measurability and second order differentiability conditions on W x 0, T = 1,2,. . . There are sequences of nonstochastic positive definite diagonal P x P matrices (C,:T= I,2 ,... } and {D,:T= 1,2 ,... } such that (iii) (a) C,D;‘+O as T+ XI; (b) (8.9) holds with .,+‘t defined by (8.10); where A0 is a P x P nonrandom, positive (iv) (a) 0; 1’2H,(Bo)Df “’ %A,, definite matrix; (b) 0; 1’2S,(0,) LNormal(0, B,), where B, is a P x P nonrandom, positive definite matrix. Then there exists a sequence of estimators (8,: T = 1,2,. . .} such that (8.1) and (8.6) hold. Further, Di”(e^, - 0,) -% Normal(0, A; ’ BoAi ‘), and therefore (8.11) (For proof see Appendix.) Condition (iv)(b) serves to identify 8, as the parameter vector such that E[S,(O,)] = 0; (iv)(a) then ensures that 0, is the unique value of 0 that solves the asymptotic first order condition. Given (8.11), it is natural to define Avar(g,)
= 0; 1’2A61BoA~‘D,
li2,
(8.12)
which shrinks to zero as T+ co, as we would expect. Formula (8.12) reduces to the expression for the asymptotic variance we derived in the essentially stationary case when D, ‘I2 = T- ‘I21p, where ZP is the P x P identity matrix. As can be seen in (8.12), the norming matrix D, ~ and therefore the kinds of trends in the underlying data ~ clearly affects Avar(8,). It is natural to conclude that the form of D, affects how one estimates the asymptotic variance of 6, and, therefore, how one performs inference about B,._Fortunately, this is not the case. In practice, consistent estimators A^r of A, and B, of B, incorporate the norming D-T li2 in such a way that D,1’2A^,‘jtj
does not depend
A^-‘D-1’2 TT
(8.13)
T
on D,. For example,
AT = 0; 1’2HT(@T)D, 1’2
under
the conditions
of Theorem
8.1, (8.14)
is a consistent estimator of A,. For illustration, suppose that B, = A0 which, as we saw in the essentially stationary case, occurs under classical assumptions. Then
2705
Avar(6,)
= 0;
A%(&)
“‘A;
‘0;
‘I2 and
= D;1’2[D;1’2HT(&)D;1’2]-1D;1’2
= [H,(&)]-‘.
(8.15)
Equation (8.15) shows that we estimate the asymptotic variance of @, by the inverse of the estimated Hessian, exuctly as would occur in the essentially stationary case when B, = A,. The researcher need not even know at what rate each parameter estimator is converging to its limiting normal distribution. A similar observation holds even when B, # Ao, because consistent estimators of B, typically have the form BT = 0; l12kTD; “‘, where A?, does not depend on scaling by a function of the sample size. Then A%(H^,) = fi;
’ A?#;
’
(8.16)
does not depend on a particular scaling. We see this explicitly in Section 9 on Mestimation. One apparently restrictive feature of Theorem 8.1 is that A0 and B, are assumed to be positive definite. There are several problems with multiple trending processes for which this is not true. As a simple example, consider the linear model y, = a,, + y,t + p,,zt + u,, where all quantities are scalars, z, = a, + a, t + u,, a, # 0, and. {(u,, II,)} is a strictly stationary and weakly dependent process with E(u,u,) = 0. Let x, = (1, t, z,), and assume we estimate this model by OLS. Then HT(Q) = H, = X’X, and it can be shown that the norming matrix that makes XX converge in probability is D, = diag{ T, T3, T3}. In particular,
which has rank two instead of three. Because of examples like this, we should discuss how Theorem 8.1 can handle these cases. Usually it is possible to find a P x P nonsingular matrix M, and new scaling matrix, say G, (that depends on M), such that G, “2(M’H,(0,)M)G,
1’2 -% A,M
(8.17)
and C, 1’2M’S,(0,)
A Normal(0,
(8.18)
BF),
where Af’ and By are positive definite. Then, by an argument to the above (multiply (8.2) through by M’),
entirely
analogous
cy
[M-
ye,- %J] =
- [G, 1’2M’H,(%,)MG,
l’*] - '[G, 1’2Ms,(%,)] + op( 1)
= - {A~}-‘[G,“*M’S,(%,)]
(8.19)
+ o,(l)
(8.20)
~N0rmal(O,(z4~}-‘B~{A~}~~).
Thus, inference can be carried out on 6, = Mm- ‘BO using Theorem 8.1. Fortunately, in order to perform inference about %,, one need not actually find a linear combination of i?, - 0, that, when properly scaled, has a nondegenerate limiting distribution (that is, one need not actually find the matrix M). It is enough to know that one exists, and this is virtually always the case in identified models. To see why the choice of M plays no role in performing inference about %0,suppose that interest lies in testing a linear hypothesis about BO: H,: R%, = r.
(8.21)
This is stated equivalently
in terms of 6, = M-‘BO as
He: PS, = r,
(8.22)
where P = RM. Suppose for simplicity
that A0 = B,, and A_: = Br, a case that arises
under-some assumptions covered earlier. Then, with 6, = M- ‘eT, A&$($,) Mp lH;l(M’-l and the Wald statistic for testing (8.22) is
=
W,=(P~T-r)f[PM-‘fi~‘(M’-‘P’~‘(P6^,-r) = (R8, - r)‘[Rfi;
‘R’] -‘(Rf?,
- r),
(8.23)
which is the usual Wald statistic if we had started with gT and its estimated asymptotic variance fi; ‘. Although the reasoning is more complicated, similar arguments can be made for testing nonlinear hypotheses of the form H,: c(%,) = 0. In general, t-statistics, Wald statistics, quasi-LR statistics and LM statistics can be computed in the usual ways without worrying about the rate of convergence of each estimator. Of course, the rates of convergence might be of interest for other reasons, such as to facilitate efficiency comparisons.
9. 9.1.
Asymptotic
normality of M-estimators
Asymptotic
normality
We now show how Theorem
QT(W0) = i 1=1
qt(w,,%I
8.1 can be applied
when the optimand
is
(9.1)
Ch. 45:
Estimation
and It@-ence,for
Dependent
Processes
2707
In applying Theorem 8.1 to M-estimation, we need to be able to verify Conditions (iii) and (iv)(b) of Theorem 8.1. First consider Assumption (iv)(b), which requires that the score evaluated at 19, satisfies the CLT: -% Normal(0,
B,),
where s,(e) = VOql(wf, 0)’ and B, is positive definite. If (9.2) is to hold at all, usually easy to find. This is because 0; 112 must satisfy
(9.2)
D, is
(9.3)
in particular, the matrix on the left hand side of (9.3) must be bounded and at the same time uniformly positive definite (this latter requirement prevents us from choosing D, too large). Once D, has been found, a CLT for trending processes can be applied [see McLeish (1974) for martingale difference sequences, Wooldridge and White (1989) and Davidson (1992) for near epoch dependent functions of mixing processes]. In some cases a functional CLT ~ see Part IV - can be used to establish (9.2). As in the essentially stationary case, {s,(0,): t = 1,2,. .} can be serially correlated, although under the complete dynamic specification assumptions of Sections 5 and 6 {s,(0,)} is a martingale difference sequence with respect to Qt = {wi, . . . , wr}. Thus, there are many examples for which {s,(0,)} is serially uncorrelated, although Var[s,(8J] will not be bounded. After the CLT has been argued to hold, we need to establish the key condition (8.9). Let htij(e) denote the (i,j)th element of h,(8). Especially if qt(.) is thrice continuously differentiable on int(O), we can often establish the inequality
maxI
h,ij(e)
-
h,ij(eJ
Id
bTtij
II8 - 8, II2,
(9.4)
BEND,
where bTtij is a positive random variable. Note that /I 8 - 8, II2 d (cTcl))- ‘I2 for all &Jf; where cr.(i) is the smallest element of C,; therefore, if we can control the rate at which bTtij grows, (8.9) is easily shown. The proof of the following lemma is straightforward. Lemma 9.1 Assume
that for all i, j = 1,2,. . . , P,
(i) inequality
(ii)
(9.4) holds;
Xi1b,.fij =OpiJG).
Then Condition
(iii) of Theorem
8.1 holds for M-estimation.
J.M.
2708
Wooldridyu
To see that the conditions of this lemma are reasonable, consider what they entail in the strictly stationary case. Suppose that q(w,, 6) is thrice continuously differentiable on int(O) with the third derivative dominated by a function with finite moment (see, for example, Condition (iii) in Theorem 4.1). Then we can take bTtij = btij to be a bound on the absolute value of the third derivatives of q,(B) on int(O). Because {b,,,} is stationary and ergodic with finite moment,
by the WLLN. Because D, = TIP we can take C, = T”Zp for any a < 1 and have (iii)(a) of Theorem 8.1 satisfied. For condition (ii) of Lemma 9.1 to hold we require C,‘= 1btij = o(T{~“)“); by (9.5) this holds for any a > 5. Of course Lemma 9.1 is useful only if it can be applied to nonlinear models with trending data. This is the case, at least for some applications, but finding bTtij and verifying (ii) can be tedious. For illustrative purposes, we given a simple nonlinear least squares example. Consider the model y, = CI,+ f(z,, Y,) + BJ + % where {(z,, uJ: t = 1,2,. . .) is a strictly that
(9.6) stationary,
weakly dependent
sequence
E(u,Iz,) = 0.
such
(9.7)
Here, ~1,and /3, are scalars and y, is a J x 1 vector. We assume that f(z,, .) is thrice continuously differentiable on the open set I’, with y,Eint(r). Note that if 8, # 0 then {y,} contains a linear time trend (extending this example to the case where y, has a polynomial time trend of any finite order is straightforward). Define the regression function m,(x,, 0) E @+ Ski,, Y) + Pt. The score for observation
t for nonlinear
(9.8) least squares
estimation
s,(@= - V,m,(@‘u,(~), where, for model (9.8), V,m,(@ = (1, V,f(w,, y), t). Letting forward to show that
is (9.9)
s, = $‘(eJ,
It is straight-
Ch. 45: Estimation
and Inference
forDependent
2709
Processes
where D, E diag{ T, TI,, T3}. (It helps to remember that c,‘=,t = O(T’) and CT= 1t2 = 0( T3), and to first assume that {ut} is serially uncorrelated.) As we now show, it suffices to set C, E diag{ T”‘, Ta’ZJ, Ta2} for some a, < 1 and a2 < 3. For nonlinear least squares estimation, there are always two pieces of the Hessian that need to be analyzed for convergence. The first is V,m,(O)‘V,m,(~) and the second is V~m,(@u,(O), where V,‘m,(O) is the P x P second derivative of m,(d). What must be shown is that
EygK:A,il I
Vd%(WfP,j(e)
-
VtP,i(~,)VfP$j(Q,)
I = O,(l)
and
for all i,j= l,..., P. By looking at the form of m,(O)and recalling the discussion of the stationary case, the only terms that present a new challenge are the cross products V,f,,(y)t, i = 1,. . , J, and the term V~ftih(~)ut(0), i, h = 1,. . . , J. Let us look at the cross product term V,f&)t. Because this is differentiable, and because cTcl) = T”‘, to verify the conditions of Lemma 9.1 it suffices to show that C,‘= I gtit = op( T”’ +02/‘), where gti 1s a b ound on the partial derivatives of VJJy) for all YET. But {gti} is a stationary sequence with finite moment, so it suffices to show that CT= 1 [gti - E(g,,)]t = op( T”’ +a2/2) and C,‘= 1t = op( T”’ +a2’2). By the WLLN [see Andrews (1988)], XT= 1[g,i - E(g,i)]t = 0p(T3’*), and CT= 1t = O(T*). Thus, we must choose (a, + 42) > 2; but under the restrictions a, < 1 and a2 < 3, (a, + u,/2) can be made arbitrarily close to 4. The sum containing the terms V:f,ij(y)u,(0) can be handled in a similar manner by writing u,(O) = u, + [f(z,, y,) - f(z,, y)] + (B, - fl)t and using the fact that the third derivative of j&y) is bounded. Thus, we can conclude the NLS estimator of (c(,, y,, /?,,) is consistent and asymptotically normal under fairly weak assumptions. Note that we have allowed for dynamic misspecification and conditional heteroskedasticity in {u,}. Another example is the nonlinear trend model y, = cr,t@”+ u,, where E(uf) is uniformly bounded, E(u,) = 0, and {Us} is an essentially stationary process. It turns out that we must assume that j?, > - $; otherwise, there is not enough information in the sample to consistently estimate the parameters. The apparent simplicity of this two parameter model for E(y,) is misleading. To carefully verify Condition (iii) of Theorem 8.1 for the NLS estimator is very tedious. We will not go through the details here, but Lemma 9.1 can be used. Wooldridge (1986, Chapter 3, Corollary 6.3) gives the details.
J.M.
2710
9.2.
Estimating
the asymptotic
Wooldridqr
variance
Estimation of A, and t#O follows in-much the same way as in the essentially stationary case. Define H, E CT= 1h,(d,). Then, under the conditions of Theorem +A,. The estimator based on a&8,) (Section 4.5) is also 8.1, D;1’21?TD;1’2 ’ generally consistent when it is available. If the score is serially uncorrelated, and {st(0)s,(O)‘) satisfies Condition (iii) in Theorem
8.1 (in addition
to {/I,(@)), then 0;
‘/‘kTD;
1/2 -% B,, where (9.10)
As mentioned earlier, the absence of serial correlation in the score follows in the same circumstances as in the essentially stationary case. The asymptotic variance of t?, is estimated as (9.11)
and this is one of the matrices used for obtaining asymptotic standard errors and for forming Wald statistics derived for the essentially stationary case with serially uncorrelated score. Equation (9.11) again shows that, for practical purposes, the scaling matrix D, has disappeared from the analysis. The absence of a scaling matrix in A%(&) is entirely consistent with observed econometric practice; one rarely sees consideration of scaling matrices appearing in applied work, and this is justifiably so. To complete the analysis of the globally nonstationary case, we should have methods of estimating B, when {s,(O,)} IS serially correlated. Conceptually, this causes no problems. The Hansen-Hodrick estimator and its weighted versions can be shown to be generally consistent for B, when the autocovariances of the score are known to be zero after a certain point. We conjecture that the general serial-correlation-robust estimators of the score covered in Section 4.5, when properly standardized, remain consistent under reasonably weak conditions. Unfortunately, we know of no formal results along these lines. This is an important topic for future research.
Part IV. 10. 10.1.
The nonergodic case
General results Introduction
In this final part we turn to inference in models with nonergodic processes. Actually, the primary distinctions between this part and the previous ones are that the score,
Ch. 45: Estimation
und Infirencefhr
Dependent
Procrss~s
2711
when properly standardized, is not weakly dependent (so it does not have a limiting normal distribution), and the Hessian, when properly standardized, does not necessarily converge in probability to a nonstochastic matrix. Instead, the standardized Hessian and score converge jointly in distribution, typically to a function of multivariate Brownian motion. The functional CLT (FCLT) or invariance principle plays a prominent role for determining the limiting distributions. Because of the nonstandard limit behavior of the score and Hessian, the properly standardized estimator and related test statistics do not necessarily have the usual normal and chi-square limiting distributions. The limiting random variable can depend on unknown parameters, which makes asymptotic theory difficult to apply in practice. Recently there has been much work on estimation of linear models with nonergodic processes. A short list of references includes Phillips (1987, 1988, 1991), Stock (1987), Phillips and Durlauf (1986), Park and Phillips (1988, 1989), Phillips and Hansen (1990) and Sims et al. (1990). Fortunately, some of this research has focused on finding statistics with either standard limiting distributions or at least limiting distributions that are free of nuisance parameters, and which can therefore be used for inference. Some of these results are given in Section 11; Watson (this Handbook) gives a more extensive treatment. In Section 10.2 we state a general theorem for nonlinear models that is a straightforward extension of Theorem 8.1. Section 11 covers some applications to linear models. Section 12 sketches how the general theorem can be applied to a particular nonlinear regression model. All of our examples are for processes that are integrated of order one. It is possible to apply them to explosive processes as in Domowitz and Muus (1988) and Jeganathan (1988). An interesting open question is whether these results, or whether the results in Part II, can be applied to estimation with strongly dependent data.
10.2.
Abstract
limiting distribution
result
We first analyze a setup very similar to that in Section 8.2. The score s,(O) and Hessian Hr.(O) of the objective function QT(0) are defined there. Theorem
10.1
Let Assumptions (i)-(iii) of Theorem 8.1 hold. Replace Assumption (iv) with (iv) (D; “2HT(0,)O; li2, 0; 1’2S,(0,)) -%(,d,, Yp,), where &, is positive definite with probability one. Then there exists a sequence (8.6) hold. Further,
of estimators
D;"(B,- 0,)%c4,'.4Po.
{&+ T = 1,2,.
.} such that (8.1) and
(10.1)
The proof of Theorem 10.1 is identical to that of Theorem 8.1 up to establishing the first order representation (8.6). Then, (10.1) follows by the continuous convergence theorem and Assumption (iv). For linear models Condition (iii) (see Theorem 8.1) is trivially satisfied because the Hessian does not depend on 8, so the difficulty lies in establishing (iv). Of course, one hardly needs a result like Theorem 10.1 to analyze linear models since the estimators are given in closed form. In Section 11 we show how the functional CLT can be used to establish (iv) and the distribution of &‘;‘YO for a class of linear models. As stated in the introduction, at this point we have no guarantee that the distribution of &‘Sy‘Y. can be used for inference, as it may depend in an intractable way on nuisance parameters. In addition to having to establish (iv), for nonlinear models we also have to verify (iii). As we saw in Sections 8 and 9, this is nontrivial for models with trending but weakly dependent data. It is even more difficult when we allow for nonergodic processes. For future applications of Theorem 10.1, more research is needed to see how (iii) can be verified. In Section 12 we show how the FCLT can be used to verify (iii) for a particular nonlinear model. Theorem 10.1 assumes that d4, is positive definite with probability one. As with Theorem 8.1, one might have to employ linear transformations to (t? - 0,) to ensure that this holds in particular applications. But, unlike the trend-stationary case, different linear combinations may have fundamentally different limiting distributions, and so some care is required for inference about the parameters of inference. More on this in Section 11 and in Watson (this Handbook). The notion of locally asymptotically mixed normal (LAMN) families [for example, LeCam (1986) Jeganathan (1980, 1988) and Phillips (1989)] has played an important role in studying efficiency issues in nonergodic models. It turns out to be closely related to the possibility of finding asymptotically chi-square test statistics. The LAMN condition was originally applied to log-likelihood functions, but, as shown in Phillips (1989) it can be applied to more general criterion functions. We do not consider the LAMN condition or its extensions [Phillips (1989, 1991)] here, as it involves a substantial technical burden, and for the purposes of establishing limiting distributions the full LAMN machinery is not needed. Nevertheless, it is informative to see what LAMN entails in Theorem 10.1. For purposes of inference, the important consequence of LAMN is that Assumption (iv) of Theorem 10.1 is satisfied with
where 3V0- Normal(O,I,) and is independent of &O. Thus, the LAMN condition restricts how the random quantities Cd0 and
and the conditions
of Theorem
[D; “2H,(N,)O;
10.1, under
“‘]“‘[o$“(e,
LAMN
it follows immediately
- Q,)] % .c4; ‘j2Y0 - Normal(0,
Thus, when normed by a random matrix, Oi”(&, standard normal distribution. From (10.3),
- 0,) has a limiting
Ip).
that (10.3)
multivariate
(8,.- O”)‘fqqJ(O^, - 0,) Lx;,
(10.4)
and so a quadratic form in the estimator has a standard limiting distribution. Generally, when the LAMN condition holds, there exist quadratic forms that have limiting chi-square distributions that can be used for inference about do.
11.
Some results for linear models
We begin with the linear model Y, = % +
xtBo+ u,>
where U, in an T(O),zero-mean x, = x f-l
+
“t,
t= 1,2,...,
(11.1)
process, and the 1 x K vector x, is an I(1) process: (11.2)
where {u,} is an I(O), zero-mean process, and there are no cointegrating relationships among the X, (x0 is an arbitrary random vector). These assumptions imply that y, and x, are cointegrated [see Engle and Granger (1987), Watson (this Handbook)] and, given the normalization that the coefficient on y, is unity, there is only one cointegrating vector. (Technically, we allow for /I, = 0, in which case y, is I(O).) Due to the work of Stock (1987), Phillips and Durlauf (1986), Phillips (1988), Park and Phillips (1988) and others, it is now known that the limiting distribution of the OLS estimator of fi, is nonnormal. To derive the limiting distribution requires the notion of a functional CLT. Dejinition
11 .I
Let (w,: t = 1,2,. . .} be an M x 1 strictly stationary, weakly dependent process such that (i) E($w,) < ~0, (ii) E( WJ = 0, (iii) 0 = lifnn,,m Var(T _ 1’2CT= 1wt) exists. Then {w,} is said to satisfy the functional central limit theorem (FCLT) (or invariance
2714
principle) if the stochastic
B,(r)
= T-1’2
[Trl 2
w,,
process
{B,: T = 1,2,. . .}, defined
by
O
r=1
converges covariance
in distribution to g&Z(O), an M-dimensional Brownian matrix 0. Here, [Tr] denotes the integer part of Tr.
motion
with
The notion of the process B, converging in distribution to g&‘(Q) is defined in terms of weak convergence of their probability distributions. Weak convergence is the extension to general metric spaces of the usual notion of convergence in distribution over finite dimensional Euclidean spaces. The reader is referred to Billingsley (1968, Chapters l-5) for definitions and background material. The use of the FCLT to obtain limiting distribution results for estimators was pioneered by Phillips (1986, 1987) for a univariate autoregression with a unit root. The multivariate FCLT was first used in econometric applications by Phillips and Durlauf (1986). The FCLT is known to hold under conditions analogous to the CLT; see, for example, Billingsley (1968), McLeish (1977), Herndorff (1984) Phillips and Durlauf (1986) and Wooldridge and White (1988). Although we have assumed strict stationarity of (wt} for simplicity, this is not necessary; certain forms of bounded heterogeneity are allowed, as in the references. In what follows, we simply assume that the FCLT holds. The following lemma is very useful for analyzing least squares estimators with I(1) processes. Parts (i)-(iv) were proven by Park and Phillips (1988). Lemma 11.1
Let {wt = (u:, II:)‘} be an M x 1 strictly stationary, weakly dependent stochastic process with finite second moments and zero mean. Here, u, is M, x 1 and u, is M, x 1. Define
(11.3)
and
(11.4)
z, *= E(qu:),
A,, = f .s= 1
~(u,u:_,),
(11.5)
and d,,
= z,,
(11.6)
+ A2r.
Assume that a,, and fizz are positive definite. Define X, as in (11.2) (note that X, is a column vector for the purposes of stating the lemma). Let B denote a Brownian motion with covariance matrix 0, and partition B as B = (B;, B;)‘. Thus, B, &JA(B, r) and B, - %?&‘(0,,), and these processes are independent if and only if R,, = 0. Then, under additional regularity conditions, the following hold jointly as well as separately: 1
(i) Tp3’2 i
x,
A
M-)
dr,
s0
r=1
1
(ii) T-5’2
i
tx,
A
rB,(r)
1
T
(iii) Tp2 1 x1x:% t=1
(iv) T-3’2
T-’
B2(r)B2(r)’ dr, s 0
5
tu, +d
I=1
(v)
dr,
s0
t=1
i t=1
‘rdB,(r), s 0
’ B,(r)dB,(r)‘+d,,.
V, ’ - d s 0
As shown in Park and Phillips (1988), parts (i))(iv) are an immediate consequence of the convergence in distribution of B, to a&(n) and the continuous convergence theorem [Billingsley (1968, Theorem 5.1)]. For example, part (i) follows because X, = T112BT2(t/T), and so T-3’2t$l
x, = T-’
fl
BT2(t/T)
= [‘B,,(r)dr
%s’B,(r)dr,
0
0
where the second equality follows because BT2(.) is stepwise continuous and the final convergence result follows because j,!,B,(r)dr is a continuous function of B. For part (ii), tx, = Tli2tBT,(t/T); then
Tm5j2 ,zl tx, = Tp ’ f. (fIT)B,,(t/T) r=1
Parts
= I1 rB,,(r)dr 0
L
j1 rB,(r) dr. 0
(iii) and (iv) are handled similarly, where part (iv) uses the fact that U, = - BTl((t - 1)/T)]. Part (v) is more difficult to verify, and does not follow simply from the convergence of B, to aM(0). The same kind of proof for T”2[BT,(t/T)
parts (i))(iv) initially appears to work, but the final convergence in distribution does not follow from the continuous convergence theorem because the right hand side in part (v) is not a continuous function of tiA(fl). Nevertheless, Hansen (1992b) shows that (v) holds under fairly general conditions. For analyzing the model (11.1) with x, 1 x K we simply replace u, with II:, x, with xi, B with B’, and so on. The assumption that On22 is positive definite implies that the elements of x, are not cointegrated, a restriction that should be borne in mind in what follows. With this lemma in hand, we can establish much of the needed asymptotic distribution theory for linear regressions with I(1) processes. Conclusions (ii) and (iv) are needed to allow for trend-stationary processes and I(1) processes with drift. To analyze (1 l.l), we only need the conclusions of this lemma when U, is a scalar. Thus, in Lemma 11.1, Z,,, AZ1, and A,, are K x 1 vectors, and the long run variance of {ut} is a,,
= E(r+,)
+ :
(11.7)
{E(U:_ju,) + E(u;u,_j,},
j=l
the long run covariance
between
{II,} and (u,} is (11.8)
J-221 = Wu,) + j=: 1 {-wJ_ju,)+ E(u;+ju,)}, and the long run variance
or1 = E(uf) + 2 g j=
of {Us} is
(11.9)
E(u,u,_j). 1
Define the 1 x (K + 1) stochastic process Br(r) = (BT1(r), BT2(r)) and the limit B z (B,, B,) as the transpose of that in Lemma 11.1. Let &fir denote the OLS estimators from the regression y, To obtain
on Lxt,
t= l,...,T.
the limiting
&.=/Jo+
I,il
(11.10)
distribution
(x,-X)‘(x,-f)
of j?,, write
1-lElT
(x,-
vu*
(11.11)
or T-'
i
1=1
(x,-X)‘(x,-f)
1
-lT-’
i 1=1
(x,--)‘u,,
(11.12)
assuming that [T- ‘C,‘= 1(_rt- X)‘(x, - X)] - ’ exists w.p.a.1. Using Lemma 11.1, Park and Phillips (1988) have derived the limiting distribution of T(fl, - fi,),
[S
B,(r)‘B,(r) dr
1
B,(r)‘dB,(r)+ A,, 1 ,
(11.13)
0
0
where B2 denotes
1 [S -1
1
m, - B,) A
the demeaned
process
B,: for each 0 ,< r d 1,
1
B2(s) ds.
B2(r) = B2(r) -
(11.14)
s0 is nonsingular with Incidentally, because a,, is positive definite, JAfi,(r)‘fi,(r)dr probability one, so that the distribution of the right hand side of (11.13) is well-defined and fir exists w.p.a.1. Generally, the distribution in (11.13) depends, in an intractable way, on the nuisance parameters f12i and d,, . But, there is one case where it can be applied immediately, and that is when the regressors are strictly exogenous in the sense that all t and s.
E(Ax;u,) = 0,
(11.15)
Assumption (11.15) implies that d, i = a,, = 0, so that B, and B, are independent Brownian motions. Letting W~_E wi 1, Park and Phillips (1988) argue that, when a,, = 0, the distribution of JhB,(r)‘dB,(r), conditional on B,, is normal:
s 1
B2(r)‘B,(r) dr .
~2W’dBlWl.,
(11.16)
0
A useful heuristic device (for which I am grateful to Bruce Hansen) helps us to see where (11.16) comes from. Take the definition of a stochastic integral to be
s 1
82W’dBl(r)= plim T+mr=l
0
=
5 B,(t/T)‘[B,(t/T)
- B,((t - 1)/T)]
plim f: B2(r)‘FTt, T-m?=1
where the srt are i.i.d. Normal(O,o,” T-l) by definition of a Brownian motion, are independent of the process B2(.). Therefore, conditional on B,(.),
,tl
&@y’Tt
-
O,oiT-’
Normal (
Taking
f: B,(t/T)‘B,(t/T) t=1
the limit of both sides yields (11.16).
. >
and
2718
Given
(11.16), we have 1
1
-l/Z
(S
1 (1
But (%‘x/T’)-
‘I2 % [ihB,(r)‘&(r)dr]-
B2(r)'B2(r) dr
0
~,(r)‘dB,(r) -
Normal(0,
w:Z,).
(11.17)
0
(X’X/T2)-1/2T(b
‘I’, so that
- p,) %Normal(O,
wi1,),
(11.18)
where x denotes the T x K matrix with tth row x, - 5 In practice, that 6, can be treated as being asymptotically normal. Loosely, --
BT2 Normal(fi,,o~(X’X)-‘).
this means
(11.19)
Except for the presence of w,’ in place of 0: c E(u:), (11.19) is identical to the usual approximation for the slope coefficients in regressions with essentially stationary, weakly dependent processes and appropriately homoskedastic and serially uncorrelated errors. Note that this is a problem for which the LAMN condition mentioned in Section 10.2 is satisfied. Asymptotically valid t-statistics and Wald or F-statistics can be obtained by replacing the usual estimate of a,2,8,2 (the square of the standard error of the regression) with a consistent estimator of 0,.2 A consistent estimator &),’is obtained by applying the estimators in Section 4.5 to the OLS residuals {Li,} (for example, (4.30)). Asymptotically valid t-statistics are obtained by multiplying the usual t-statistics by the ratio BJQ,; asymptotically valid Wald or F-statistics are obtained by multiplying the usual Wald or F-statistic by the ratio 8:/d:. Other than the strict exogeneity case, there is at least one other practical application of (11.13) which is testing for a unit root, as in Dickey and Fuller (1979) and Phillips (1987). Although the t-statistic does not have a limiting standard normal distribution, its limiting distribution is either free of nuisance parameters (as in the Dickey-Fuller setup) or a simple transformation of it is free of nuisance parameters (as in Phillips); see Stock (this Handbook). We now extend the model by allowing for I(0) regressors in addition to the I(1) regressors x,. Let zf be a 1 x J vector I(0) process (these can be any I(0) variables, including leads and lags of Ax, and lags of Ay, if y, is I(1)). The model is Y, = CT,+
xtBo+ zryo+ e,,
where e, is an I(0) zero-mean
(11.20) process,
and we assume
that
E(z:e,) = 0; this condition
allows us to identify
(11.21) the vector y, on the I(0) variables.
From
the
Ch. 45:
Estimation
and Inference
for Dependent
2719
Processes
results for model (11. l), we know that 8, can be consistently estimated by ignoring the I(0) process Z, and obtaining /?, from the regression (11.10). This is easily seen by writing (11.20) as Y, = ?, +
XrBo + u,,
(11.22)
where u, = e, + zty, - E(z,y,) and q0 = ~1,+ E(z,~,). Then u, is I(0) with zero mean, and so the limiting distribution of T(B, - p,) is given in Equation (11.13). As is now fairly well known [for example, Phillips and Durlauf (1986), Park and Phillips (1988)], omitting I(0) regressors does not affect our ability to consistently estimate fl, when x, is I(1) and has no cointegrating relationships among its elements. For a variety of reasons we need to know what happens when Z, is included in the regression. Let di,, bT, and $jTdenote the OLS estimators from the regression Y,
on
Lx,, z,,
t=l,...,T.
(11.23)
The following.!emma is useful for finding the asymptotic distribution of properly standardized B, and jjT. Its proof uses Lemma 11.1 and is given in Wooldridge (1991c, Lemma 5.1). Lemma 11.2 Let {n,}, {z,}, and (e,} satisfy (11.2) and (11.21). Let {X,: t = l,..., T} denote demeaned x, and let {t,: t = 1,. . . , T} denote the demeaned z,. Let f, denote 1 x K residuals from the regression -% on
t= l,...,T
l,bt,
and let ?( denote
Then the following (i) T-’
i
(iii) T-’
from the regression
t= l,...,T. asymptotic
.i$f, = T-2
1=1
(ii) T-l
(11.24)
the 1 x J residuals
z, on l,q,
(11.25) equivalences
f: _i$ft + o,(l); t=1
f: if:e, = T-’ 1=1
f: i:e, + o,(l); 1=1
f: i’;it, = T - ’ f: i;z, + o,(l); t=1
(iv) T-1’2 i t=1
the the
t=1
Z:e, = T-“2 ,tl i:e, + op( 1).
hold:
Note that the .i!t are the residuals from the regression x, on 1. Thus, Lemma 11.2 says that, for certain purposes, these can replace the residuals from the regression x, on 1, z,. A similar statement holds for Z, and i;. Combined with standard results from least squares mechanics, Lemma 11.2 yields straightforward asymptotic representations for fl, and jT. Write
(11.26)
Now, by Lemma
11.2,
The first term on the except that e, replaces from Lemma 11.1. Let (11.9) with e, replacing
W-BP+
right hand side of (11.27) is exactly of the form in (11.12) u,. Thus, its limiting distribution can be obtained directly w,‘~w:~,Z;r, and d”,, be defined as in (11.5), (11.6) and u,. Then, from Lemma 11.1,
l?,(r)‘B,(r)
dr
&(r)' dB’, (r) + A’, 1 ,
(11.28)
where Bf is a SJJz’(oi) process. This is as in (11.13), except that it is the covariogram of {(~,,e,): t = 1,2,. . .} which shows up in the asymptotic distribution. Thus, including I(0) regressors when estimating j?, changes the implicit errors in the relationship. The form of the limiting distribution is unaltered, but the asymptotic distributions of T(fi - b,) and T(fl- p,) are not the same. From (11.27) and the earlier discussion it follows that if the x, are strictly exogenous in (11.20) that is E(Ax:e,) = 0,
(11.29)
all t and S,
then we can treat a, as approximately
a, z Normal(/?,,o,2(_$?aii-1),
normal.
As before,
(11.30)
where 2 is the T x K matrix with tth row given by the 1 x K vector residual P,. The important difference between J.ll.30) and (11.19) is that of replaces wi. Using Lemma 11.2(i) we could replace X with x, but this would be unnatural since PT is obtained from (11.23). Note that the validity of (11.30) as a heuristic does not require strict exogeneity of the I(0) variables, z,; only (11.2 1) and (11.29) are assumed.
Under strict exogeneity the simple adjustments to t- and F-statistics discussed for model (11.1) apply here as well, except that 0% is estimated using the residuals &, from regression (11.23). Next, consider the coefficient estimates on the I(0) variables. As shown by Phillips (1988) Park and Phillips (1989) Sims et al. (1990) and others, the asymptotics for jjT are standard (regardless of whether or not there is any kind of strict exogeneity). This result has been derived under a variety of assumptions and in a number of ways. Given Lemma 11.2, it is most easily established by writing jjT in partial regression form. By Lemma 11.2 and standard results such as T - l”Cf’ 1iie, = Op( 1) (by the CLT), we have
(11.31)
Thus, the I( 1) regressors have disappeared entirely from the first order asymptotic representation for yT. Under standard assumptions for strictly stationary processes, the right hand side of (11.31) is asymptotically normally distributed. However, unless (e,} and (zie,} are serially
uncorrelated
(11.32)
and (11.33) the usual OLS covariance matrix estimator and test statistics will generally be invalid. This is as in regression with essentially stationary data. Given the OLS residuals (6,: t = 1,2,. . .} from the regression (11.23), standard serial-correlationrobust covariance matrix estimators can applied to {Q,}, say &. The asymptotic variance
of fi(jj,
- y,) is estimated
(.?‘.@T)-‘&if’~/T)-l.
as (11.34)
If (11.32) and (11.33) both hold, as in Sims et al. (1990), then the usual OLS variance matrix estimator for j’r is valid. Therefore, standard t- and F-statistics for testing hypotheses about y, are valid under no serial correlation and homoskedasticity assumptions. Note that these have nothing to do with the I(l), noncointegrated regressors x,. This limiting distribution theory can be applied to the augmented Dickey-Fuller regression under the null of a unit root. The model is
Wooldridge (1991c)l. Also important is that the leads and lags estimator of Boj (with or without z, included in the leads and lags regression) is generally an inconsistent estimator of poj under (11.35). If x,~ is not cointegrated with x,(~) then the leads and lags estimator of /Ioj produces asymptotically normal t-statistics for poj, just as before. The discussion in the preceding paragraph shows that the cointegrating properties of X, need to be known before currently available methods can be used for inference about fl,. For further discussion and examples, see Phillips (1988), Park and Phillips (1989), Sims et al. (1990), Wooldridge (1991~) and Watson (this Handbook). The preceding results have extensions to I(1) processes with drift, integrated processes of higher order, and multivariate regression and instrumental variables techniques. See Phillips (1988) Park and Phillips (1988, 1989), Sims et al. (1990) and Phillips and Hansen (1990).
12.
Applications
to nonlinear
models
In this section we sketch how Theorem We wish to estimate the model Y, =
a, + “I%,,Y,)+ x,/j,+ u,,
10.1 can be applied
to a nonlinear
model.
(12.1) (12.2)
E(u,Iz,)=O
by nonlinear least squares, where {(z,, u,): t = 1,2,. . .} is strictly stationary and {.qt= 1,2,...}isal x K I( 1) process without drift, as in (11.2), with no cointegrating relationships among the elements of x,. y, is an M x 1 vector. We assume that the gradient of f(z,,~), V,f(z,, y,), contains no constant elements, so that both its variance and long run variance are positive definite. Letting m,(Q) = tl + f(z,, y) + X,/I, we have V,m,(H) = [ 1, VJ,(r), x,]. The score of the NLS objective function for observation t is
s,(Q)= - v,m,(e)‘u,(e).
(12.3)
When we evaluate this at Ho we get ~~(0,) = - [u,, Vyf,(Yo)u,, x,u,]‘. From the CLT and (12.2), T - “‘Cl’= 1Vyf,(yO)‘ut has a limiting normal distribution. From Lemma 11.1, T - ‘Et’= 1xiu, converges in distribution to a functional of Brownian motion. Given this, it is clear that the scaling matrix must be
=!
TO
D,
0
0
TIM
0
0
0 T’t,
!,
(12.4)
2724
in which case
,tl
0; 1’2 s,@,)
(12.5)
converges in distribution to a nondegenerate random vector. Suppose, for now, that Condition (iii) of Theorem 10.1 holds, and let o^, denote the NLS estimator of 8,. Then (8.6) becomes
q’((j,_+
-
D,“2
(12.6) where h,(0,) = V,m,(0,)‘V,m,(0,) 0, and so
- V,mf(O,)u,. Because -f3u,Iz,)= 0,
~CV~.fk,~~,)~,l=
(12.7) t=1
by the WLLN
i=l
applied
to {V,m:(O,)u,}.
Therefore,
we can write -1
D;‘~@,-
e,) = -
0;
[ x
l/2 i v,m,(0,yv,m,(~,)D; 112 I=1
D; [
1
lj2 i Vomt(t3Jt4, + o#). i=l
(12.8)
1
This puts us back in the linear model case covered in Section 11. Using partitioned inverse and Lemma 11.2, T(b, - /?,) has the same representation as in (11.23). Similarly,
fi(fT
- y,) has a representation
as in (11.31), but with zI replaced
with
V,f(z,, Y,). The main point of this example is that, once we have a linearized representation as in (12.8), the asymptotic analysis is almost identical to the linear case, provided the joint limiting distribution of
D;‘j2
i V,m,(e,),V,m,(e,)D,liZ 1=1
can be found. The structure is the case. In general, finding particularly if the nonergodic We have yet to do the hard of Theorem 10.1. It turns out
and
Dp112i
V,,mt(eJ’u,
1=1
of the regression ‘function in (12.1) ensures that this the limiting distribution can be much more difficult, variables x, appear nonlinearly. part of the analysis, and that is to verify Condition (iii) that Lemma 9.1 can also be applied in this case. We
2775
sketch how this can be done. Define
where a, < 1 and a2 < 2. Note that the minimum diagonal element of this matrix differentiable, where each is c r(r) = T”‘. Assume that ,f(z,, y) is thrice continuously derivative is dominated by an integrable function. Then, in the notation of Lemma 9.1, the functions bTtij = b,,,, where 1 d i d M + 1 and M + 1 <j < M + K + 1, can be taken to be of the form E
bTtij
'tij
=
Srijl
',j
I + Stij,
where gtij is a stationary function of z, that dominates f&/I) and its first three derivatives. (The terms for other combinations of i and j are easy to handle.) We assume that E(gtij) < co for all i and j. An application of the FCLT, as in Lemma 11.1, implies that f. Stij IX,j I =
0p(T3”),
(12.10)
t=1
so that f. htij = O,( T3”). r=1
JGzgl) =
(12.11)
Tal”@. Thus, for condition (ii) of Lemma 9.1 to be satisfied, Now we must have (a, + aJ2) > $, a condition easily satisfied because a, + a,/2 can be made arbitrarily close to 2 under the restrictions stated above. Thus, the conditions of Theorem 10.1 hold under general conditions, and therefore representation (12.8) is valid. An important topic for future research is to examine how the conditions of Theorem 10.1, or a result with similar scope, can be verified for more complicated nonlinear models. It seems likely that the functional CLT will play an important role.
Appendix
1.
Notation
The transpose of a K x M matrix A is denoted by A’. 11a 11denotes the Euclidean norm of the P x 1 vector a.
11A 11= [tr(A’A)]“* denotes the Euclidean matrix norm of the matrix A. For a continuously differentiable function q(O), where 8 is a P x 1 vector, gradient of 4 is denoted by the 1 x P vector V&B). For a K x M differentiable matrix A(B), where 0 is a P x 1 vector, we denote gradient of A by V,A(H) = avec A(tI)/aO, which is a KM x P matrix. The second derivative of a matrix, denoted ViA(O), is defined as
the the
V,zA(8)= v,[v,A(e)]. For random vectors y and x, &lx) denotes the conditional distribution of y given X, E(yjx) denotes the conditional expectation, and Var(y(x) denotes the conditional variance.
2.
Dejinitions und proojY$
Dejnition
A.1
Let (O,P, P) be a probability space, and let {Or: T = 1,2,. . .} be a sequence of events defined on this space. Then (Or} occurs with probability approaching one (w.p.a.1) if P(OT)-+l
as
T+co.
Definition A.2 A random function r: W x 0 satisfies conditions on W x 0 if (i) for each & 0, r(., 0) is measurable; (ii) for each weW,r(w;) is continuous
the standard
measurability
and continuity
on 0.
Dejinition A.3 Let 0 be a compact (closed be a sequence of functions conditions on 9 x 0. Let 0. Then Qr(~,e) conoerges maxIQr(W,@-Q(e)lLo 068
and bounded) subset of lRp and let {Q,: w x 0 + R} satisfying the standard measurability and continuity Q: 0 -+ R be a nonstochastic continuous function on in probability to Q(O) uniformly on 0 if and only if as
When (a.1) holds we often write Q, 3
T+KI.
(a.1)
Q uniformly
on 0.
Let 0 be a subset of Rp and let {QT: w x 0 -+R:T= valued functions; Assume that
1,2,...}
Theorem
A.1 beasequenceofreal-
Ch. 45: Estimation
and Inference
for
Dependent
2727
Processes
(i) 0 is compact; is fi es the standard measurability and continuity IQ=) sa t' -llr x 0. Then a (measurable) estimator 8,: YV + 0 exists such that
conditions
(ii)
Q~(w, @r(w)) = min Q~(w, ~9) for all BE8
on
WEYY.
In addition, - assume that (iii) QT LQ uniformly on 0, where Q is a nonstochastic, valued function on 0; (iv) OO>s the unique minimizer of Q on 0. Then 13, A 19~.
continuous,
real-
Proof This follows from White (1993, Theorem 2.1).
3.4) or Newey and McFadden
(Theorem
Lemma A.1 Let G,: YV x 0 + R and G: 0 + R be functions satisfying the standard measurability and continuity conditions on the compact set 0. Suppose that G, 3 G uniformly on 0 and 8, % 8,. Then G,(t?,) 3 G(0,). Proof Follows
from White (1993, Theorem
3.7).
Definition A.4 Let 0 be a subset of Rp with nonempty interior. A random function r: W x 0 -+ lRKsatisfies the standard measurability and$rst order (second order) differentiability conditions on W x 0 if (i) for each 0~ 0, r( ., 0) is measurable; (ii) for each WE%‘“, r(w;) is once (twice) continuously differentiable on int(O). For the abstract optimization problem objective function as the P x 1 vector s,(e) = S,(w, 0)
E
V,Q,(w, 0) =
of Theorem
A.l, define
“Q;; ‘),...,““$’ 1
The Hessian of Qr is defined
to be the P x P symmetric
P
the score of the
‘. e,>
matrix
in particular, the (i, j)th element of H,(w, 0) is a’Qr(w, O)/??OiaOj. Hr(CJ) denotes P x P symmetric random matrix evaluated at 8. Theorem
the
A.2
Let the conditions of Theorem (v) o0 is in the interior of 0; 1sfi es the standard (vi) {Q,1sa t' (vii) H, LA uniformly on 0, matrix function, and A, = L
(viii) fiS,(fI,) Then fi(8,
Normal(0,
A.1 be satisfied.
In addition,
assume
that
measurability and differentiability conditions; continuous where A: 0 + Rpxp IS a nonrandom A(8J is nonsingular.
B,) where B, is a positive
definite
matrix.
- 0,) ~Normal(O,A;‘B,A;‘).
(a.2)
Proof
This essentially follows from Theorem 3.1 in Newey and McFadden. proof is available on request from the author. Proof of Theorem
A separate
4.2
Define
Q,(wQ)=
T-’ i: qt(w,,@ 1=1
and Q,(0) = E[Q,(w,
e)]. Then we must show that, for each E > 0,
maxIQT(w,e)-QT(0)I>E 888
1
-+O
as T+co.
(a.3)
Let 6 > 0 be a number to be set later. Because 0 is compact, there exists a finite covering of 0, say Y6(ej), j = 1,2,. . . , K(6), where 9’,(ej) is the sphere of radius 6 about ej. Set Yj = ,4”,(0,), K = K(6), and Q,(e) E Qr(rv, 0). Because 0 c uj”= rYj, it follows that
ese
P maxIQ,(e) - Q,(q > E
1 d P[
max 1 <j<
max IQ=(e) - &(e)I K e&‘,
max IQ,(e) - Q,(e)1 > ec9,
We will bound
each probability
in the summand.
For BEYj,
>
E1
E1.
ia.4
by the triangle
inequality G T-’
i
k,(e) -
4,(‘j)l
+
T-’
i I=1
1=1
+
where q,(Q) E E[qt(B)].
T-l
i 1=1
h,(e)
-
By Condition
4,tej)
-
4,(ej)
4,(ej)13
(iv)(a), for &Yj,
Ide) - q,(ej)ld cl(wt)IIe- 'j II < SC, and
Iqtte)-
d
Gt('j)l
IIe - dj 11< sqe,),
C,(ej)
where Ct = E(c,). Thus, we have
maxIQT(e)- Q&9
1I
+ T- l t
G6
e&j
<26T-’
i
C,+6
T-’
<26c+6
+ T-’ t$l qf(‘j) - 4t(‘j) 3 I 1
5 et-Et
T-’
1=1
1=1
i t=1
ct-El
I
q,(ej) - q,(ej)
1=1
+ T-’ I
where T- ‘Ct’= 1C, d c < cc by (iv)(b). It follows that
1
maxI QT(4 - QTm > E
p [
os.vj
T- l i: c, - c, + T- ’ 5 1=1 t=1 I I
I
q,(ej) - q,(ej)
> & - 2sc
I.
Now choose 6 < 1 such that (E - 26c) < 42 (this affects K, but not c). Then
1
maxIQT(WQT(W= BEYj
GP
T-’
5 1=1
Ct-2,
T-’
+
I
I
5 q,(oj)-q,(oj) 1=1
I
> E/2 . >
2730
Next, choose P
[I
T-’
T, so that T c q-Cc, + T-’ 1=1 I I
for all T 3 T,, and all j = 1,2,. , K; this is possible by Assumptions (iii) and (iv)(b) of Theorem 4.2 (and because K = K(6) is finite). From (a.4) it follows that, for T 3 T,, P max 1QT(6) - QT(@l > s Oet3
which establishes
1
d c,
the result.
Proof of Theorem 4.3 We verify the conditions
of Theorem
A.l. Define
By Assumptions M. 1 and M.2 of Theorem 4.3, it follows from White (1993, Theorem 3.7) that QT(w, 0) converges in probability to Q(O) 3 q(fI; y*) uniformly on 0. The result now follows from Assumption M.3 and Theorem A.l. Proof of Theorem 4.4 This is a simple application
of Theorem
O=T-‘.‘~~~~,(~~;g,)+[T-’
i
A.2. A mean value expansion
gives (w.p.a.1)
+((i,-t$),
t=1
where & is h,(t?;9,) evaluated
at mean values between
6, and 8,
where V,“; is V,s,(BO; y) evaluated at mean values between Assumptions M.4(iii), M.S(iii) and M.7 of Theorem 4.4, T-1
f: 1=
V,S;=T-1 1
,;,T -w,a~Y*)l
+ o,(l) = o,(l).
pT and
y*. By
7731
Therefore,
0 = T-
,fl~~(0,;y*) +
1’2
fi(8,
- 0,) + o,(l).
By M.5(i) and M.S(ii), because L?, 3 O,, T- lx,‘= 1it 3 nonsingular w.p.a.1. Thus, w.p.a.1 we can write
A0 and so T- lx,‘= 14, is
where we have used the fact that Tp ‘/‘CITE 1s,(O,; y*) = O,(l) (by M.6). This proves the result. Proof of Lemma 4.3 The proof is standard for example, Amemiya Proof of Theorem From
a second
and follows from a second order Taylor’s expansion. (1985, Section 4.5) and White (1993, Theorem 8.10).
See,
8.1
order Taylor’s
expansion,
where R,(8; (3,) = (0 - flJ’[Hr(& 0,) - H”,](8 - 0,) and H,(8; 0,) denotes ZfT(Q) evaluated at mean values between 8 and BO.Define a random vector by 0, = 8, If;- ‘SC (w.p.a. 1). After a little algebra we have
Q,(O)- Q,(e;.) =
(H OTy~(O - ‘d +R,.(B; 0,) -
RT(eT;0,).
(a.3
Also, write R,(O; 0,) = clT(8)‘A,(H; 0,)cr,(0), where a,(0) = Ck’2(Q - I!?~,) and A.(& 0,) = c, 1’2[HT(u; 0,) - HOr]CQ 1’2. Now by assumption (iii)(b) of Theorem 8.1, 11A,(H; 0,) /) A 0 uniformly over the set {@ 1)q.(d) I/ d E} for any E d 1. It follows
2732
that for E < 1,
for all 0 such that Next, define
I/Ck’2(Q - 0J 11< E, where 6, AO.
.,FT(E) = (U: /IDk”(U - 8,) I/ <
6:).
By Assumptions (iv)(a) and (iv)(b), Di’2(8T - (3,) = O,( 1). By (iii)(a) Ci!‘D; therefore, there exists a sequence {c, > 0} with sT +O such that P[/IC+.‘2(lJ-Bo)ll <+-‘I C”‘DT
Also because VT(ET)
=
By the triangle
as
T+co.
l/2 +O;
(a.7)
_ ‘I2 -+ 0,
T
{ 8: I)
Ch’2(0- ir,) IId Ed}
w.p.a. 1.
(a4
inequality,
IIq2(o - 0,) II < I/q2(6, - e,, I(+ llcy(e”, - UJ 1).
(a.9
BY (a.71and(a.% if IICk”(Q- &) II d &Tthen I/C$‘2(d- do)II6 2~~ w.p.a. 1. Thus, from (a.8) it follows that “~‘T(&T)c {0: IIC$‘2(U - 0,) 11d 2ET,\
w.p.a. 1.
(a. 10)
Now (a.6) and (a.lO) imply that
sup IRT(8;00)I <4dT6$ w.p.a.1
(a.1’1)
and (a.6) and (a.7) imply that IR,(H”,;O,)I d 6,s; Letting &?,(E,) w.p.a.1,
min
w.p.a.1.
denote
QT(U) -
(a. 12)
the boundary
of J?“,(E,), (a.?+ (a.1 1) and (a.12) imply that
QT(gT) 3 iA;,min~$- 56,~;
eE?.P~I.(ET) =
{+j.;,min - ~S,.}E+,
Ch. 45: Estimution
and Ir$wncr
fbr Dependent
2733
Procusxs
I/2. By Assumption (iv)(a) where 1; min is the smallest eigenvalue of 0; l12H'+D; and Ed > 0 for all T, QT of Theorem 8.1, 2; min 2 I> 0 w.p.a.1. Because 6, A0 cannot achieve its ‘minimum on the boundary of J?~(+) w.p.a.1; therefore, it achieves its minimum on the interior of J?~(+). Let e, denote this estimator. Then S,(&,) = 0 w.p.a.1. and 11 Di'"(6, - 8,)II d +, SO that (a.13)
D;"(& - tl,)= O,(1). Now we are almost done. Use a mean value expansion
of the score to write (w.p.a. 1)
DT1'2ST(eT)=DT1'2SOT+DT1'2HOTDT1'2D~2(83-e,) +D;1'2(l;iT-H;)D;1'2D;2(&9 ) 0'
(a. 14)
where ki, is evaluated at mean values. Letting 8, denote a generic mean value it is easily shown that Dy2(gT- 0,) = O,(l). But this implies that ~,EJV; w.p.a.1 because C1'2D-1/2 +O. Thus, by (iii)(b) and (a.13), the last term in (a.14) is o,(l). We have n:w lstablished that w.p.a.1, 0=
D,"'S; + D;1'2H;D;1'2Dy2(6T0,)+o,,(l);
by (iv)(a) we can write
D~'(&-0,)= -A,'D,"2S"T+ o,(l). Along with (iv)(b), this completes
(a.15)
the proof.
References Amemiya, T. (1985) Advanced Econometrics. Cambridge: Harvard University Press. Anderson, T.W. (1971) The Statistical Analysis of Time Series. New York: Wiley. Andrews, D.W.K. (1987) “Consistency in Nonlinear Econometric Models: A Generic Uniform Law of Large Numbers”, Econometrica, 55, 1465-1472. Andrews, D.W.K. (1988) “Laws of Large Numbers for Dependent Non-Identically Distributed Random Variables”, Econometric Theory, 4, 458-467. Andrews, D.W.K. (1989) “Asymptotics for Semiparametric Econometric Models I: Estimation”, Cowles Foundation for Research in Economics Working Paper No. 908. Andrews, D.W.K. (1991) “Heteroskedasticity and Autocorrelation Consistent Covariance Matrix Estimation”, Econometrica, 59, 817-858. Andrews, D.W.K. and J. McDermott (1993) “Nonlinear Econometric Models with Deterministically Trending Variables”, Cowles Foundation for Economic Research Working Paper No. 1053. Andrews, D.W.K. and J.C. Monohan (1992) “An Improved Heteroskedasticity and Autocorrelation Consistent Covariance Matrix Estimator”, Econometrica, 60, 953-966. Basawa, I.V. and D.J. Scott (1983), Asymptotic Optimal Inferencefor Nonergodic Models. New York: Springer-Verlag.
Basa~a, I.V., P.D. Feigin and C.C. Heyde (1976) “Asymptotic Properties of Maximum Likelihood Estimators for Stochastic Processes”, Sankhya, Series A, 38, 259-270. Bates, C.E. and H. White (1985) “A Unified Theory of Consistent Estimation for Parametric Models”, Econometric Theory, 1, 15 I- 175. Bates, C.E. and H. White (1993) “Determination of Estimators with Minimum Asymptotic Variance”, Econometric Theory, 9, 633-648. Berk, K.N. (1974) “Consistent Autoregressive Spectral Estimates”, Annals of Statistics,2, 489-502. Berndt, E.R., B.H. Hall, R.E. Hall and J.A. Hausman (1974) “Estimation and Inference in Nonlinear Structural Models”, Annals of Economic and Social Measurement, 4, 653-665. Bhat, B.R. (1974) “On the Method of Maximum Likelihood for Dependent Observations”, Journal of /he Royal Statistical Society, Series B, 36, 48-53. Bierens, H.J. (1981) Robust Methods and Asymptotic Theory in Nonlinear Economerrics. New York: Springer-Verlag. Bierens, H.J. (1982) “A Uniform Weak Law of Large Numbers Under &mixing with Application to Statistica Nederlandica, 36, 81-86. Nonlinear Least Squares Estimation”, Billingsley, P. (1968) Conreryence of Probability Measures. New York: Wiley. Billingsley, P. (1986) Probability and Measure. Second edition. New York: Wiley. Bloomfield, P. (1976) Fourier Analysis of Time Series: An Introduction. New York: Wiley. Bloomfield, P. and W.L. Steiger (1983) Least Absolute Deviations. Boston: Birkhauser. Bollerslev, T. (1986)“Generalized Autoregressive Conditional Heteroscedasticity”, Journal of Econometrics, 3 1, 307-328. Bollerslev, T. and J.M. Wooldridge (1992) “Quasi-Maximum Likelihood Estimation and Inference in Econometric Reviews, II, 143-172. Dynamic Models with Time-Varying Covariances”, Brillinger, D.R. (1981) Time Series: Dara Analysis and Theory. New York: Holden-Day. Brockwell, P.J. and R.A. Davis (1991) Time Series: Theory and Methods. New York: Springer-Verlag. Burguete, J.F., A.R. Gallant and G. Souza (1982) “On the Unification of the Asymptotic Theory of Nonlinear Econometric Models”, Econometric Reviews, I, 151-190. Chamberlain, G. (1982) “The General Equivalence of Granger and Sims Causality”, Econometrica, 50, 569-581.
Chamberlain, G. (1987) “Asymptotic Efficiency in Estimation with Conditional Moment Restrictions”, Journal of‘ Econometrics, 34, 305-334. Chesher, A. and R. Spady (1991) “Asymptotic Expansions of the Information Test Statistic”, Econometrica,
59, 787-8
15.
Likelihood Estimation with Dependent Observations”, Journal of Series B, 38, 45-53. Davidson, J. (1992) “A Central Limit Theorem for Globally Nonstationary Near-Epoch Dependent Functions of Mixing Processes”, Econometric Theory, 8, 3 13-329. Davidson, R. and J.G. MacKinnon (1984) “Convenient Specification Tests for Logit and Probit Models”,
Crowder,
M.J. (1976) “Maximum
the Royal Statistical
Journal
Davidson,
Society,
qf Econometrics,
25, 241-262.
R. and J.G. MacKinnon
(1991) “A New Form of the Information
Matrix Test”, Econometrica,
60, 145-158.
Dickey, D.A. and W.A. Fuller (1979) “Distribution of the Estimators for Autoregressive Time Series with a Unit Root”, Journal of‘ the American Statistical Association, 74, 427-431. Domowitz, I. (1985) “New Directions in Nonlinear Estimation with Dependent Observations”, Canadian Journal
Domowitz, cations”,
of Economics,
19, l-27.
I. and L.T. Muus (1988) “Asymptotic Inference for Nonergodic with Econometric Appliin: W.A. Barnett, E.R. Berndt ind H. White, eds., Proceedings of the Third International Symposium in Economic Theory and Econometrics. New York: Cambridge University Press, Domowitz, I. and H. White (1982) “Maximum Likelihood Estimation of Misspecified Models”, Journal of Econometrics, 20, 35-58. Engle, R.F. (1984) “Wald, Likelihood Ratio, and Lagrange Multiplier Tests in Econometrics”, in: Z. Griliches and M.D. Intriligator, eds., Handbook ofEconometrics, Vol. II. Amsterdam: North-Holland, 775-826. Engle, R.F. and C.W.J. Granger (1987) “Cointegration and Error Correction: Representation, Estimation and Testing”, Econometrica, 55, 251-276. Fuller, W. (1976) Introduction to Statistical Time Series. New York: Wiley. Gallant, A.R. (1987) Nonlinear Statistical Models. New York: Wiley.
Gallant, A.R. and H. White (1988) A Unified Approach to Estimation and Inference in Nonlinear Dynamic Models. Oxford: Basil Blackwell. Godfrey, L.C. (1988) Misspecijication Tests in Econometrics: The LM Principle and Other Approaches. New York: Cambridge University Press. Goldberger, A. (1968) Topics in Reyression Analysis. New York: Macmillan. Gourieroux, C., A, Monfort and A. Trognon (1984) “Pseudo-Maximum Likelihood Methods: Theory”, Econometrica, 52, 68 1~700. Gourieroux, C., A. Monfort and A. Trognon (1985) “A General Approach to Serial Correlation”, Econometric Theory, 1, 3155340. Granger, C.W.J. (1969) “Investigating Causal Relations by Econometric Models and Cross-Spectral Methods”, Econometrica, 37, 424-438. Hall, P. and C.C. Heyde (1980) Martingale Limit Theory and Its Application. New York: Academic Press. Hannan, E.J. (1971) “Non-Linear Time Series Regression”, Journal ofApplied Probability, 8,767-780. Hansen, B.E. (1991a) “Strong Laws for Dependent Heterogeneous Processes”, Econometric Theory, 7, 213-221. Hansen, B.E. (1991b) Inference When a Nuisance Parameter is Not Identified Under the Null Hypothesis, Rochester Center for Economic Research, Working Paper no. 296. Hansen, B.E. (1992a) “Consistent Covariance Matrix Estimation for Dependent Heterogeneous Processes”, Econometrica, 60, 9677972. Hansen, B.E. (1992b) “Convergence to Stochastic Integrals for Dependent Heterogeneous Processes”, Econometric Theory, 8, 4899500. Hansen, L.P. (1982) “Large Sample Properties of Generalized Method of Moments Estimators”, Econometrica, 50, 1029- 1054. Hansen, L.P. (1985) “A Method for Calculating Bounds in the Asymptotic Covariance Matrices of Generalized Method of Moments Estimators”, Journal of Econometrics. 30, 203-238. Hansen, L.P. and R.J. Hodrick (1980) “Forward Exchange Rates as Optimal Predictors of Future Spot Rates: An Econometric Analysis”, Journal qf Political Economy, 88, 8299853. Hansen, L.P. and K.J. Singleton (1982) “Generalized Instrumental Variables Estimation of Nonlinear Rational Expectations Models”, Econometrica, 50, 126991286. Hansen, L.P., J.C. Heaton and M. Ogaki (1988) “Efficiency Bounds Implied by Multiperiod Conditional Moment Restrictions”, Journal of the American Statistical Association, 83, 863-87 1, Harvey, A.C. (1990) The Econometric Analysis of Time Series. Cambridge: MIT Press. Heijmans, R.D.H. and J.R. Magnus (1986) “On the First-order Efficiency and Asymptotic Normality of Maximum Likelihood Estimators Obtained from Dependent Observations”, Statistica Nederlandica, 40. Hendry, D.F. and J.-F. Richard (1983) “The Econometric Analysis of Economic Time Series”, International Statistical Review, 51, 11 l-163. Hendry, D.F., A.R. Pagan and J.D. Sargan (1984) “Dynamic Specification”, in: Z. Griliches and M.D. Intriligator, eds., Handbook of Econometrics, Vol. II. Amsterdam: North-Holland, 1023-l 100. Herndorff, N. (1984) “An Invariance Principle for Weakly Dependent Sequences of Random Variables”, Annals of Probability, 12, 141-153. Hsieh, D.A. (1983) “A Heteroskedasticity-Consistent Covariance Matrix Estimator for Time Series Regressions”, Journal of Econometrics, 22, 281-290. Huber, P.J. (1967) “The Behavior of Maximum Likelihood Estimates Under Nonstandard Conditions”, Proceedings of the Fijth Berkeley Symposium in Mathematical Statistics and Probability. Berkeley: University of California Press. Jeganathan, P. (1980) “An Extension of a Result of L. LeCam Concerning Asymptotic Normality”, Sankhya, Series A, 43, 23-36. Jeganathan, P. (1988) “Some Aspects of Asymptotic Theory with Applications to Time Series Models”, University of Michigan Department of Statistics, Technical Report No. 166. Johansen, S. (1988) “Statistical Analysis of Cointegrating Vectors”, Journal ofEconomic Dynamics and Control, 12, 23 l-54. Keener, R.W., J. Kmenta. and N.C. Weber (1991) “Estimation of the Covariance Matrix of the Least Squares Regression Coefficients when the Disturbance Covariance Matrix is of Unknown Form”, Econometric Theory, 7, 22245. Klimko, L.A. and P.T. Nelson (1978) “Conditional Least Squares Estimation for Stochastic Processes”, Annals of Statistics,6, 6299642.
LeCam, L. (1986) Asymptotic Methods in Statistical Decision Theory. New York: Springer-Verlag. Levine, D. (1983) “A Remark on Serial Correlation in Maximum Likelihood”, Journal c$Econometrics, 23, 337-342. Lin, W.-L. (1992) “Alternative Estimators for Factor GARCH Models A Monte Carlo Comparison”, Journal of Applied Econometrics, 7, 259-279. MacKinnon, J.G. (1992) “Model Specification Tests and Artificial Regressions”, Journal of Economic Literature, 30, 102Z 146. MacKinnon, J.G. and H. White (1985) “Some Heteroskedasticity Consistent Covariance Matrix Estimators with Improved Finite Sample Properties”, Journal of Econometrics, 19, 305-325. Magnus, J.R. and H. Neudecker (1986) “Symmetry, O-l Matrices and Jacobians: A Review”, Econometric
Manski,
Theory,
2, 157-190.
C. (1975) “Maximum
ofEconometrics,
Score Estimation
3, 205-225. (1988) Analog Estimation
Manski, CF. McLeish, D.L. (1974) “Dependent Probability,
McLeish,
of the Stochastic
Methods
Central
in Econometrics.
Limit
Theorems
and
Utility
Model
of Choice”,
Journal
New York: Chapman
Invariance
and Hall. Principles”, Annals
of
2, 81-85.
D.L. (1975) “A Maximal
Inequality
and Dependent
Strong
Laws”, Annals of Probability,
3,
826-836.
McLeish, ability,
D.L. (1977) “On the Invariance
Principle
for Nonstationary
Mixingales”,
Annals of Prob-
5, 616-621.
Nelson, D.B. (1991) “Conditional
Heteroskedasticity in Asset Returns: A New Approach”, Econometrica, 59, 347F370. Newey, W.K. (1990) “Efficient Instrumental Variables Estimation of Nonlinear Econometric Models”, Econometrica, 58, 809%837. Newey, W.K. (1991a) “Uniform Convergence in Probability and Stochastic Equicontinuity”, Econometrica, 59, 1161~1167. Newey, W.K. (1991b) Consistency and Asymptotic Normality of Nonparametric Projection Estimators, mimeo, MIT Department of Economics. Newey, W.K. and K.D. West (1987) “A Simple Positive Semi-Definite Heteroskedasticity and Autocorrelation Consistent Covariance Matrix”, Econometrica, 55, 703-708. Orme, C. (1990) “The Small-Sample Performance of the Information Matrix Test”, Journal of Econometrics, 46, 309-331.
Pagan, A.R. and H. Regression Models, Park, J.Y. and P.C.B. Part l”, Econometric Park, J.Y. and P.C.B. Part 2”, Econometric Phillips, P.C.B. (1986)
Sabau (1987) On the Inconsistency of the MLE in Certain Heteroskedastic mimeo, University of Rochester, Department of Economics. Phillips (1988) “Statistical Inference in Regressions with Integrated Processes: Theory,
4, 468-497.
Phillips (1989) “Statistical Inference Theory, 5, 95-131. “Understanding Spurious Regressions
in Regressions in Econometrics”,
with Integrated Journal
Processes:
ofEconometrics.
33, 311-340.
Phillips, P.C.B. (1987) “Time Series Regression with a Unit Root”, Econometrica, 55, 277-301. Phillips, P.C.B. (1988) “Multiple Regression with Integrated Time Series”, Contemporary Mathematics, 80, 79%105.
Phillips, P.C.B. (1989) “Partially Identified Econometric Models”, Econometric Theory, 5, 181-240. Phillips, P.C.B. (1991) “Optimal Inference in Cointegrated Systems”, Econometrica, 59, 283-306. Phillips, P.C.B. and S.N. Durlauf (1986) “Multiple Time Series Regression with Integrated Processes”, Review of Economic Studies, 53, 473-496. Phillips, P.C.B. and B.E. Hansen (1990) “Statistical Inference in Instrumental Variables Regression with I(1) Processes”, Review of Economic Studies, 57, 99-125. Phillips, P.C.B. and M. Loretan (1991) “Estimating Long-Run Economic Equilibria”, Reoiew (If Economic Studies, 58, 407-436. Poirier, D.J. and P.A. Ruud (1988) “Probit with Dependent Variables”, Reoiew of EconomicStudies, 54, 593-614. PGtscher, B.M. and I.R. Prucha (1986) “A Class of Partially Adaptive One-Step Estimators for the Nonlinear Regression Model with Dependent Observations”, Journal of Econometrics, 32, 219-251. PGtscher, B.M. and I.R. Prucha (1989) “A Uniform Law of Large Numbers for Dependent and Heterogeneous Data Processes”, Econometrica, 57, 675-684.
Ch. 45: Estimation and Ir$rence~fhr Pijtscher, B.M. and I.R. Prucha
Dependent Processes
2737
(1991a) “Basic Structure of the Asymptotic Theory in Dynamic Nonlinear Econometric Models, Part I: Consistency and Approximation Concepts”, Ihnometric Reviews, 10, 125-216. Potscher, B.M. and I.R. Prucha (1991b) “Basic Structure of the Asymptotic Theory in Dynamic NonEconometric Reviews, 10, 2533325. linear Econometric Models, Part II: Asymptotic Normality”, Quah, D. (1990) “An Improved Rate for Non-Negative Definite Consistent Covariance Matrix Estimation with Heterogeneous Dependent Data”, Economics Letfers, 33, 133- 140. Quah, D. and J.M. Wooldridge (1988) A Common Error in the Treatment of Trending Time Series, MIT Department of Economics, Working Paper No. 483. Quandt, R.E. and J.B. Ramsey (1978) “Estimating Mixtures of Normal Distributions and Switching Regressions”, Journal of the American Statistical Association, 73, 730-738. Ranga Rao, R. (1962) “Relations Between Weak and Uniform Convergence of Measures with Applications”, Annals of Mathematical Statistics, 33, 659-680. Rao, C.R. (1948) “Large Sample Tests of Statistical Hypotheses Concerning Several Parameters with Applications to Problems of Estimation”, Proceedings of the Cambridge Philosophical Society, 44, 50-57. Rilstone, P. (1991) Efficient Instrumental Variables Estimation of Nonlinear Dependent Processes, mimeo, Universite Laval. Robinson, P.M. (1972) “Nonlinear Regression for Multiple Time Series”, Journal ofApplied Probability, 9, 758-768. Robinson, P.M. (1982) “On the Asymptotic Properties of Estimators of Models Containing Limited Dependent Variables”, Econometrica, 50, 27-42. Robinson, P.M. (1987) “Asymptotically Efficient Estimation in the Presence of Heteroskedasticity of Unknown Form”, Econometrica, 55, 875-891. Robinson, P.M. (1991a) “Best Nonlinear Three-Stage Least Squares Estimation of Certain Econometric Models”, Econometrica, 59, 755-786. Robinson, P.M. (1991b) “Testing for Strong Serial Correlation and Dynamic Conditional Heteroskedasticity in Multiple Regression”, Journal of Econometrics, 47, 67-84. Rosenblatt, M. (1956) “A Central Limit Theorem and a Strong Mixing Condition”, Proceedings of the National Academy of Sciences USA, 42,43-47. Rosenblatt, M. (1978) “Dependence and Asymptotic Independence for Random Processes”, in: M. Rosenblatt, ed., Studies in Probability Theory. Washington, DC: Mathematical Association of America. Roussas, G.G. (1972) Contiguity of Probability Measures. Cambridge: Cambridge University Press. Saikkonen, P. (1991) “Asymptotically Efficient Estimation of Cointegration Regressions”, Econometric Theory, 7, 1-21. Sargan, J.D. (1958) “The Estimation of Economic Relationships Using Instrumental Variables”, Econometrica, 26, 393-415. Schmidt, P. (1976) “On the Statistical Estimation of Parametric Frontier Production Functions”, Reoiew of Economics and Statistics, 58, 238-239. Seaks, T.G. and S.K. Layson (1983) “Box-Cox Estimation with Standard Econometric Problems”, Review of Economics and Statistics, 65, 857-859. Sims, CA. (1972) “Money, Income and Causality”, American Economic Review, 62, 540-552. Sims, C.A., J.H. Stock and M.W. Watson (1990) “Inference in Linear Time Series Models with Some Unit Roots”, Econometrica, 58, 113-144. Sowell, F. (1988) Maximum Likelihood Estimation of Fractionally Integrated Time Series, GSIA, Carnegie Mellon University, Working Paper. Steigerwald, D. (1992) “Adaptive Estimation in Time Series Models”, Journal of Econometrics, 54, 251-275. Stock, J.H. (1987) “Asymptotic Properties of Least Squares Estimators of Cointegrating Vectors”, Econometrica, 55, 1035-1056. Stock J.H. and M.W. Watson (1993) “A Simple MLE of Cointegrating Vectors in Higher Order Integrated Systems”, Econometrica, 61, 783-820. Weiss, A.A. (1986) “Asymptotic Theory for ARCH Models: Estimation and Testing”, Econometric Theory, 2, 107-131. Weiss, A.A. (1991) “Estimating Nonlinear Dynamic Models Using Least Absolute Error Estimation”, Econometric Theory, 7, 46-68.
2138
J.M.
Wooldrid
Weiss, L. (1971) “Asymptotic Properties of Maximum Likelihood Estimators in some Nonstandard Cases I”, Journal of the American Statisticul Association, 66, 345-350. Weiss, L. (1973) ‘*Asymptotic Properties of Maximum Likelihood Estimators in some Nonstandard Cases II”, Journal of the American Statistical Association, 68, 428-430. White, H. (1982) “Maximum Likelihood Estimation of Misspecified Models”, Econometrica, 50, l-25. White H. (1984) Asymptotic Theory@ Econometricians. Orlando: Academic Press. White, H. (1987) “Specification Testing in Dynamic Models”, in: T. Bewley, ed., Advances in Econometrics ~ Fifth World Congress, Vol. I, l-58. New York: Cambridge University Press. White, H. (1993) Estimution, Inference, und Specification Analysis. New York: Cambridge University Press. White, H. and I. Domowitz (1984) “Nonlinear Regression with Dependent Observations”, Econometrica, 52, 143-162. White, H. and M. Stinchcombe (1991) Adaptive Efficient Weighted Least Squares Estimation with Dependent Observations, mimeo, UCSD Department of Economics. Wolak, F.A. (1991) “The Local Nature of Hypothesis Tests Involving Inequality Constraints in Nonlinear Models”, Econometrica, 59, 98 l-996. Wooldridge, J.M. (1986) Asymptotic Properties of Econometric Estimators, UCSD Department of Economics, Ph.D. Dissertation. Wooldridge, J.M. (1991a) “On the Application of Robust, Regression-Based Diagnostics to Models of Conditmnal Means and Conditional Variances”, Journal of Econometrics, 47, 5-46. Wooldridee. J.M. (1991b) “Suecification Testing _ and Quasi-Maximum Likelihood Estimation”, Journal . of Eckketrics; 48, 24-55. Wooldridge, J.M. (1991~) Notes on Regression with Difference-Stationary Data, mimeo, Michigan State University Department of Economics. Wooldridge, J.M. and H. White (1985) “Consistency of Optimization Estimators”, UCSD Department of Economics, Discussion Paper 85-29. Wooldridge, J.M. and H. White (1988) “Some Invariance Principles and Central Limit Theorems for Dependent Heterogeneous Processes”, Econometric Theory, 4, 210-230. Wooldridge, J.M. and H. White (1989) Central Limit Theorems for Dependent, Heterogeneous Processes with Trending Moments, mimeo, MIT Department of Economics.
Chapter 46
UNIT ROOTS, STRUCTURAL JAMES
BREAKS AND TRENDS
H. STOCK*
Harvard University
Contents Abstract 1. Introduction 2. Models and preliminary
3.
4.
5.
6.
asymptotic
2.1.
Basic concepts
and notation
2.2.
The functional
central
2.3.
Examples
2.4.
Generalizations
2748
and related tools
2751
results
and additional
Unit autoregressive
theory
2745
limit theorem
and preliminary
2740 2740 2744
2756
references
2757
roots
3.1.
Point estimation
2758
3.2.
Hypothesis
2763
3.3.
Interval
tests
2785
estimation
Unit moving
average
4.1.
Point estimation
4.2.
Hypothesis
Structural
2788
roots
2790 2792
tests
breaks
and broken
5.1.
Breaks in coefficients
5.2.
Trend
breaks
Tests of the I( 1) and I(0) hypotheses: Parallels
6.2.
Decision-theoretic
between
6.3.
Practical
2807
in time series regression
and tests for autoregressive
6.1.
2805
trends
2817
unit roots
links and practical
limitations
2821
the I(0) and I(1) testing problems classification
and theoretical
2822
schemes
limitations
in the ability to distinguish
2821
I(0) and
I( 1) processes
References
2825
2831
*The author thanks Robert Amano, Donald Andrews, Jushan Bai, Ngai Hang Chan, In Choi, David Dickey, Frank Diebold, Robert Engle, Neil Ericsson, Alastair Hall, James Hamilton, Andrew Harvey, Sastry Pantula, Pierre Perron, Peter Phillips, Thomas Rothenberg, Pentti Saikkonen, Peter Schmidt, Neil Shephard and Mark Watson for helpful discussions and/or cements on a draft of this chapter. Graham Elliott provided outstanding research assistance. This research was supported in part by the National Science Foundation (Grants SES-89-10601 and SES-91-22463). Handbook ofEconometrics, Volume IV, Edited by R.F. Enyle and D.L. McFadden 0 1994 Elsevier Science B. V. All rights reserved
J.H. Stock
2740
Abstract This chapter reviews inference about large autoregressive or moving average roots in univariate time series, and structural change in multivariate time series regression. The “problem” of unit roots is cast more broadly as determining the order of integration of a series; estimation, inference, and confidence intervals are discussed. The discussion of structural change focuses on tests for parameter stability. Much emphasis is on asymptotic distributions in these nonstandard settings, and one theme is the general applicability of functional central limit theory. The quality of the asymptotic approximations to finite-sample distributions and implications for empirical work are critically reviewed.
1.
Introduction
The past decade has seen a surge of interest in the theoretical and empirical analysis of long-run economic activity and its relation to short-run fluctuations. Early versions of new classical theories of the business cycle (real business cycle models) predicted that many real economic variables would exhibit considerable persistence, more precisely, would contain a unit root in their autoregressive (AR) representations. Hall’s (1978) fundamental work on the consumption function showed that, under a simple version of the permanent income hypothesis, future changes in consumption are unpredictable, so consumption follows a random walk or, more generally, a martingale. The efficient markets theory of asset pricing recapitulated by Fama (1970) had the same prediction: if future excess returns were predictable, they would be bid away so that the price (or log price) would follow a martingale. The predictions of these theories often extended to multivariate relations. For example, if labor income has a unit root, then a simple version of the intertemporal permanent income hypothesis implies that consumption will also have a unit root and moreover that income minus consumption (savings) will not have a unit root, so that consumption and income are, in Engle and Granger’s (1987) terminology, cointegrated [Campbell (1987)]. Similarly, versions of real business cycle models predict that aggregate consumption, income and investment will be cointegrated. The empirical evidence on persistence in economic time series was also being refined during the 1980’s. The observation that economic time series have high persistence is hardly new. Orcutt (1948) found a high degree of serial correlation in the annual time series data which Tinbergen (1939) used to estimate his econometric model of the U.S. economy. By plotting autocorrelograms and adjusting for their downward bias when the true autocorrelation is large, Orcutt concluded that many of these series ~ including changes in aggregate output, investment andconsumption -
were well characterized as being generated by the first-order autoregression, Ay, = 0.3Ay,_, + I:, [Orcutt (1948, eq. 50)], where Ay, = y, - y,- , , that is, they contained an autoregressive unit root. During the 1960’s and 1970’s, conventional time series practice was to model most economic aggregates in first differences, a practice based on simple diagnostic devices rather than formal statistical tests. In their seminal article, Nelson and Plosser (1982) replaced this informal approach with Dickey and Fuller’s (1979) formal tests for a unit root, and found that they could not reject the hypothesis of a unit autoregressive root in 13 of 14 U.S. variables using long annual economic time series, in some cases spanning a century. Similarly, Meese and Singleton (1982) applied Dickey- Fuller tests and found that they could not reject the null of a single unit root in various exchange rates. Davidson et al. (1978) found that an error-correction model, later recognized as a cointegrating model, provided stable forecasts of consumption in the U.K. As Campbell and Mankiw (1987a, 1987b) and Cochrane (1988) pointed out, the presence of a unit root in output implies that shocks to output have great persistence through base drift, which can even exceed the magnitude of the original shock if there is positive feedback in the form of positive autocorrelation. This body of theoretical and empirical evidence drew on and spurred developments in the econometric theory of inference concerning long-run properties of economic time series. This chapter surveys the theoretical econometrics literature on long-run inference in univariate time series. With the exception of Section 5.1 on stability tests, the focus here is strictly univariate; for multivariate extensions see the chapter by Watson in this Handbook. Throughout, we write the observed series y, as the sum of a deterministic trend d, and a stochastic term u, y, = d, + uf’
t= 1,2 ,...,
T.
(1.1)
The trend in general depends on unknown parameters, for example, in the leading case of a linear time trend, d, = b, + fl, t, where /I, and pi are unknown. Unless explicitly stated otherwise, it is assumed that the form of the trend is correctly specified. If u, has a unit autoregressive root. then u, is integrated of order one (is I( 1)) in the sense of Box and Jenkins (1976). If Au, has a unit moving average (MA) root, then u, is integrated of order zero (is I(0)). In the treatment here, the focus is on these largest roots and the parameters describing the deterministic term are treated as nuisance parameters. The two types of unit roots (AR and MA) introduce obvious ambiguity in the phrase “unit root”, so this chapter emphasizes instead the I(0) and I( 1) terminology. Precise definitions are given in Section 2. The specific aim of this chapter is to outline the econometric theory of four areas of inference in time series analysis: unit autoregressive roots and inference for I(l) and nearly I( 1) series; unit moving average roots and testing for a series being I(0); inference on d,, and, in particular, testing for a unit autoregressive root when d, might have breaks, for example, be piecewise linear; and tests for parameter instability and structural breaks in regression models. Although the analysis of
2742
J.H.
Srock
structural breaks stems from a different literature than unit roots, the mathematics and indeed some test statistics in these two areas are closely related, and this survey emphasizes such links. There have been four main areas of application of the techniques for inference about long-run dependence discussed in this chapter. The first and perhaps the most straightforward is data description. Does real GNP contain an autoregressive unit root? What is a 95”/, confidence interval for the largest root? If output has a unit root, then it has a permanent component, in the sense that it can be decomposed into a stochastic trend (a martingale component) plus an I(O) series. What does this permanent component, trend output, look like, and how can it be estimated‘? This question has led to estimating and testing an unobserved components model. For empirical applications of the unobserved components model see Harvey (1985), Watson (1986), Clark (1987, 1989) Quah (1992) and Harvey and Jaeger (1993); for a technical discussion, see Harvey (1989); for reviews see Stock and Watson (1988a) and Harvey and Shephard (1992). A natural question is whether there is in fact a permanent component. As will be seen in Section 4, this leads to testing for a unit moving average root or, more generally, testing the null hypothesis that the series is l(0) against the I(1) alternative. A second important application is medium- and long-term forecasting. Suppose one is interested in making projections of a series over a horizon that represents a substantial fraction of the sample at hand. Such long-term forecasts will be dominated by modeling decisions about the deterministic and stochastic trends. Several of the techniques for inference studied in this chapter ~ for example, tests for unit AR or MA roots and the construction of median-unbiased estimates of autoregressive coefficients - have applications to long-run forecasting and the estimation of forecast error bands. A third application, perhaps the most common in practice, is to guide subsequent multivariate modeling or inference involving the variable in question. For example, suppose that primary interest is in the coefficients on yt in a regression in which y, appears as a regressor. Inference in this regression in general depends on the order of integration of y, and on its deterministic component [see West (1988a), Park and Phillips (1988), Sims et al. (1990), and the chapter by Watson in this Handbook]. As another example, if multiple series are I(1) then the next step might be to test for and model cointegration. Alternatively, suppose that the objective is to decompose multiple time series into permanent and transitory components, say to study short-run dynamic effects of permanent shocks [Blanchard and Quah (1989), King et al. (1991)]. In each of these cases, how best to proceed hinges on knowing whether the individual series are I(O) or I(1). Although these multivariate applications are beyond the scope of this chapter, inference about univariate AR and MA roots plays a key initial step in these multivariate applications. Fourth, information on the degree of persistence in a time series and, in particular, on its order of integration can help to guide the construction or testing of economic theories. Indeed, a leading interpretation of Nelson and Plosser’s (1982) findings
Ch. 46:
Unit
Koots, Structural
Breaks and Trends
2143
was that the prevalence of I( 1) series in their long annual data set provided support for real theories of the business cycle. Alternatively, knowledge of the order of integration of certain variables can be used to suggest more precise statements (and to guide inference) about certain economic theories, for example, the possibility of a vertical long-run Phillips curve or the neutrality of money [Fisher and Seater (1993), King and Watson (1992)]. In addition to these empirical applications, technical aspects of the econometric theory of unit roots, trend breaks and structural breaks are related to several other problems in econometric theory, such as inference in cointegrated systems. The theory developed here provides an introduction to the more involved multivariate problems. Several good reviews of this literature are already available and an effort has been made here to complement them. Phillips (1988) surveys the theoretical literature on univariate and multivariate autoregressive unit root distributions, and a less technical introduction to these topics is given in Phillips (1992a). Diebold and Nerlove (1990) provide a broad review of the econometric literature on measures and models of persistence. Campbell and Perron (199 1) provide an overview of the literature on unit autoregressive roots, as well as on cointegration, with an eye towards advising applied researchers. Banerjee et al. (1992a) provide a thorough introduction to testing and estimation in the presence of unit autoregressive roots and multivariate modeling of integrated time series, with special attention to empirical applications. The main approach to inference about long-term properties of time series which is excluded from this survey is fractional integration. In this alternative to the I(O)/I( 1) framework, it is supposed that a series is integrated of order d, where d need not be an integer. The econometric theory of inference in fractionally integrated models has seen ongoing important work over the past two decades. This literature is large and the theory is involved, and doing it justice would require a lengthier treatment than possible here. The R/S statistic of Mandelbrot and Van Ness (1968), originally developed to detect fractional integration, is discussed briefly in Section 3.2 in the context of tests for an AR unit root. Otherwise, the reader is referred to recent contributions in this area. Two excellent surveys are Beran (1992) and, at a more rigorous level, Robinson (1993). Important contributions to the theory of inference with fractional integration include Geweke and Porter-Hudak (1983), Fox and Taqqu (1986), Dahlhaus (1989) and Sowell (1990, 1992). Recent empirical work in econometrics includes Lo (1991) (R/S analysis of stock prices), Diebold and Rudebusch (1989, 199la) and Diebold et al. (1991) (estimation of the fractional differencing parameter for economic data). The chapter is organized as follows. Section 2 describes the I(0) and I( 1) models and reviews some tools for asymptotic analysis. Section 3 examines inference about the largest autoregressive root when this root equals or is close to one. Section 4 studies inference about unit or near-unit moving average roots. Two related topics are covered in Section 5: tests for parameter stability and structural breaks when
J.H. Stock
2744
the break date is unknown, and tests for AR unit roots when there are broken trends. Section 6 concludes by drawing links between the I(O) and I(1) testing problems and by suggesting some conclusions concerning these techniques that, it is hoped, will be useful in empirical practice. Most of the formal analysis in this chapter is based on asymptotic distribution theory. The treatment of the theory here is self-contained. Readers primarily interested in empirical applications can omit Sections 2.4,3.2.3 and 4.2.3 with little loss of continuity. Readers primarily interested in tests for parameter stability and structural breaks in time series regression can restrict their attention to Sections 2 and 5.1.
2.
Models and preliminary asymptotic
theory
This section provides an introduction to the basic models and limit theory which will be used to develop and to characterize the statistical procedures studied in the remainder of this chapter. Section 2.1 introduces basic notation used throughout the chapter, and provides formulations of the I(O) and I( 1) hypotheses. This section also introduces a useful tool, Beveridge and Nelson’s (1981) decomposition of an I(1) process into I(0) and I( 1) components. This leads naturally to a second expression for the I(0) and I( 1) hypothesis in a “components” representation. Section 2.2 summarizes the limit theory which will be used to derive the asymptotic properties of the various test procedures. A variety of techniques have been and continue to be used in the literature to characterize limiting distributions in the unit MA and AR roots problems. However, the most general and the simplest to apply is the approach based on the functional central limit theorem (FCLT, also called the invariance principle or Donsker’s theorem) and the continuous mapping theorem (CMT), and that is the approach used in this chapter. [There are a number of excellent texts on the FCLT. The classic text is Billingsley (1968). A more modern treatment, on which this chapter draws, is Hall and Heyde (1980). Ethier and Kurtz (1986) provide more advanced material and applications. Also, see the chapter by Wooldridge in this Handbook.] The version of the FCLT used in this chapter, which applies to the sequence of partial sums of martingale difference sequences, is due to Brown (1971). The main advantage of this approach is that, armed with the FCLT and the CMT, otherwise daunting asymptotic problems are reduced to a series of relatively simple calculations. White (I 958) was the first to suggest using the FCLT to analyze “unit root” distributions. Other early applications of the FCLT, with i.i.d. or martingale difference sequence errors, to statistics involving I( 1) processes include Bobkoski (1983) and Solo (1984). Phillips’ (1987a) influential paper demonstrated the power of this approach by deriving the distribution of the AR( 1) estimator and t-statistic in the misspecified case that the process has additional [non-AR(l)] dependence. These were paralleled by important developments in the asymptotics of multivariate unit root models; see the chapter by Watson in this Handbook for a review.
Ch. 46:
Unit
Roots, Structural
2745
Breaks and Trends
The aim of this chapter is to provide a treatment at a level suitable for graduate students and applied econometricians. To enhance accessibility, we make two main compromises in generality and rigor. The first is to restrict attention to time series which can be written as linear processes with martingale difference errors, subject to some moment restrictions. This class is rich enough to capture the key complications in the theory and practice of inference concerning unit roots and trend breaks, namely the presence of possibly infinitely many nuisance parameters describing the short-run dependence of an I(0) disturbance. However, most of the results hold under some forms of nonstationarity. References to treatments which handle such nonstationarity are given in Section 2.4. The second technical compromise concerns details of proofs of continuity of functionals needed to apply the continuous mapping theorem; these details are typically conceptually straightforward but tedious and notationally cumbersome, and references are given to complete treatments when subtleties are involved.
2.1.
Basic concepts and notation
Throughout this chapter, u, denotes a purely stochastic I(0) process and E, denotes a serially uncorrelated stochastic process, specifically a martingale difference sequence. The term “I(O)“is vague, so defining a process to be I(0) requires additional technical assumptions. The formulation which shall be used throughout this chapter is that u, is a linear process with martingale difference sequence errors. That is, the I(0) process u, has the (possibly infinite) moving average representation, 0, =
c(L)q,
t=O ,-+1 ,-+2 ,.. . 3
(2.1)
where c(L) = C,?=,cjL’ is a one-sided moving average polynomial in the lag operator L which in general has infinite order. The errors are assumed to obey
E(E,lE,_1,E,_2,...)=0, T-1
f E(&:I&,_1,E,_2,...)~a.s. r=1
E(~;ll.2,_~,~,_~,... )
E~:=c-+o
a.s.forallt.
as T+Q (2.2)
That is, E, can exhibit conditional heteroskedasticity but this conditional heteroskedasticity must be stationary in the sense that fourth moments exist and that E, is unconditionally homoskedastic. Because E, is unconditionally homoskedastic, under (2.1) and (2.2) U, is covariance stationary.’ This simplifies the discussion of ‘A time series y, is strictly stationary if the distribution series is covariance stationary (or second-order stationary) independent oft.
of (y ,+,,...,y,+,)doesnotdependonk.The if Ey, and Ey,y,_ j, j = 0, k 1,. . . exist and are
functions of second moments of L’,such as its spectrum, s,.(c)), or autocovariances, r,,(,j),j = 0, k 1, & 2,. The representation (2.1) is similar to the Wold representation for a covariance stationary series, although the Wold representation only implies that the errors are serially uncorrelated, not martingale difference sequences. Central to the idea that a process is I(O), is that the dependence between distant observations is limited. In the context of (2.1), this amounts to making specific assumptions on c(L). The assumption which will be maintained throughout is that c(L) has no unit roots and that it is one-summable [e.g. Brillinger (1981, ch. 2.7)]
c(l)#O
and
fj/cjl
< co,
(2.3)
j=O
where c(l) = CJYocj. The conditions
(2.3) can alternatively
be written
as restrictions
on the spectrum of u,,s,(w). Because s,(w) = (aE2/27C)ICjm_oCjeiwj12(where i = fi I), ~~(0) = ofc( 1)‘/27c, so c(1) # d implies that the spectral density of u, at frequency zero is nonzero. Similarly, the one-summability condition implies that ds,(w)/do is finite at o = 0. Thus, these conditions on c(L) restrict the long-term behavior of u,. Unless explicitly stated otherwise, throughout this chapter it is assumed that u, satisfies (2.1))(2.3). The definition of general orders of integration rests on this definition of I(0): a process is said to be I(d), d 3 1, if its dth difference, Ad+ is T(O). Thus u, is I(1) if Au, = u,, where u, satisfies (2.1))(2.3). In levels, u, = cf= rus + uo, so that the specification of the levels process of u, must also include an assumption about the initial condition. Unless explicitly stated otherwise, it is assumed that, if u, is I(l), then the initial condition satisfies Eui < co. A leading example of processes which satisfy (2.3) are finite-order ARMA models as popularized by Box and Jenkins (1976). If u, has an ARMA(p, q) representation, then it can be written
P(W, = 4(0%
(2.4)
where p(L) and 4(L), respectively, have finite orders p and q. If the roots of p(L) and 4(L) lie outside the unit circle, then the ARMA process is stationary and invertible and u, is integrated of order zero. If u, satisfies (2.4) and is stationary and invertible, then u, = c(L)a, where c(L) = p(L)_ ‘4(L) and it is readily verified that (2.3) is satisfied, since #( 1) # 0 and eventually c(L) decays exponentially. ARMA models provide a simple framework for nesting the I(0) and I(1) hypotheses, and are the origin of the “unit root” terminology. Suppose u, in (1.1) satisfies (1 - olL)u, = (1 - BL)u,, where u, is I(0). If (~1 < 1, u, is stationary.
(2.5) If 181-c 1, then (1 - OL) is said to be
invertible. If c( = 1 and 101< 1, then u, is integrated of order one; that is, U, - u0 can be expressed as the partial sum -- loosely, the “integration” - of a stationary process. If c( = 1 and 6, = 1, then u, = L’,+ (u, - L’J and u, is stationary, or integrated of order zero. [If (3= 1 and (a 1< 1, then u, is integrated of order - 1, but we will not consider this case since then u, in (2.2) can be replaced by its accumulation I:= rus, which in turn is I(O).] This framework provides an instructive interpretation of the I( 1) and I(O) models in terms of the properties of long-run forecasts of the series. As Harvey (1985, 1989) has emphasized, an intuitively appealing definition of the trend component of a series is that its long-run forecast is its trend. If u, is I(l), then its long-run forecast follows a martingale, while if u, is I(O), its long-run forecast tends to its unconditional mean (here zero). In this sense, if u, is T(1) then it and y, can be said to have a stochastic trend. This correspondence between the order of integration of a series and whether it has a stochastic trend is formally provided by Beveridge and Nelson’s (1981) decomposition of u, into I(1) and I(O) components. Suppose that Au, = u,. The BeveridgeeNelson (1981) decomposition rests on writing c(L) as c(L) = c(l) + [c(L) - c(l)] = c( 1) + c*(L)A, where A = 1 - L and cj* = - C,: j+ 1ci (this identity is readily verified by writing out c(L) - c( 1) = AC*(L) and collecting terms). Thus u, can be written u, = ~(1)s~ + c*(L)AE~. Then, because u, = cf= iv, + tq,, we get the Beveridye-Nelson decomposition
u, = c( 1) i
E, + c*(L)&, + C,,
(2.6)
s=l
where ul,, = u0 - c*(L)&,. It is readily verified that, under covariance stationary. This follows from the one-summability that c*(L) is summable. [Specifically,
(2.1)-(2.3), c*(L)q is of c(L), which implies
which is finite by (2.3).] Thus,
E(c*(L)E,)* = f j=O
(cj*)‘o, < (
j~olq
>
*g:,
which is finite by (2.2) and (2.3). The BeveridgeeNelson decomposition (2.6) therefore represents U, as the sum of a constant times a martingale, a covariance stationary disturbance and an initial condition do. If u. is fixed or drawn from a distribution on the real line, then zio can be neglected and often is set to zero in statements of the Beveridge-Nelson
J.H. Stock
2748
decomposition. The martingale term can be interpreted as the long-run forecast of u,: because c*(L) is summable, the long-term forecast, u,+~,~ for k very large, is c(l)Cf, rs,. Thus an I(1) series can be thought of as containing a stochastic trend. Equally, if u, is I(O), then plim,, mu, + kit = 0, so that U, does not have a stochastic trend.
2.2.
The functional
central limit theorem and related tools
If u, is stationary, or more generally has sufficiently many moments and limited dependence on past observations, then averages such as T- ‘CT= ,u: will be consistent for their expectation, and scaled sums like T-1’2Ct’= 1u, will obey a central limit theorem; see the chapter by Wooldridge in this Handbook for a general treatment. By the nature of the problems being studied, however, conventional limit theory does not apply to many of the statistics covered in this chapter. For example, the null distribution of a test for a unit autoregressive root is derived for U, being I(1). However, this violates the assumptions upon which conventional asymptotic tools, such as the weak law of large numbers (WLLN), are based. For example, if u, is I(l), then the sample mean U is Op( T”‘) and T- ‘j2ti has a limiting normal distribution, in sharp contrast to the I(0) case in which U is consistent2 The approach to this and related problems used in this chapter is based on the functional central limit theorem. The FCLT is a generalization of the conventional CLT to function-valued random variables, in the case at hand, the function constructed from the sequence of partial sums of a stationary process. Before discussing the FCLT, we introduce extensions to function spaces of the standard notions of consistency, convergence in distribution, and the continuous mapping theorem. Let C[O, l] be the space of bounded continuous functions on the unit interval with the sup-norm metric, d(f, g) = ~up,,t,,r~lf(s) - g(s)l, wheref, gEC[O, 11. Consistency. A random element ~,~f)ifPr[d(~,,f)>6]+Oforall6>0.
r+C[O,
l] converges
in probability
to f (that is,
Convergence in distribution. Let {r,, T> l} be a sequence of random elements of C[O, l] with induced probability measures {r-c,}. Then rcr converges weakly to 7c, or equivalently <,=s 5 where 4 has the probability measure rc, if and only if jfdrcr+Jfdrrforallb ounded continuous f: C[O, l] -+ W. The notations CT * 5 and (,(.)a[(.), where “.” denotes the argument of the functions 5, and [, are used interchangeably in this chapter. ‘Suppose Au, = E, and I+, = 0. Clearly conventional assumptions in the WLLN, such as u, having a bounded second moment, do not hold. Rather, T-% = T-3’2~~z 1u, = T3’*xT= ,xT= ,E, = T- 1’2xT= 1(l -(s - 1)/T)&,, so a central limit theorem for weighted sums implies that T-‘$% ‘YO, af/3).
Ch. 46: Unit RooI.~. Strucrurul
2749
l3rrak.s und Trvnd.s
If h is a continuous The continuous mupping theorem (CMT). C[O, l] to some metric space and (,a 5, then h(t,) * h(t).
functional
mapping
The FCLT generalizes the usual CLT to random functions 5r~C[0, I]. Let I.1 denote the greatest lesser integer function. Let lT(/l) be the function constructed by linearly interpolating between the partial sums of c, at the points ,I = (0, l/T, 2/T,. . , l), that is,
so that t, is a piecewise-linear random element of C[O, 11. The CLT for vectorvalued processes ensures that, if i,, . . . ,3., are fixed constants between zero and one and condition (2.2) holds, then [tr(I_,), rT(,12), . . , i;T(&,)] converges in distribution jointly to a k-dimensional normal random variable. The FCLT extends this result to hold not just for finitely many fixed values of J*, but rather for 5r treated as a function of 2. The following FCLT is a special case of Brown’s (1971) FCLT [see Hall and Heyde (1980), Theorem 4.1 and discussion]. Theorem
1 (Functional
cent&
limit theorem for a martingale)
Suppose that E, is a martingale difference sequence which satisfies (2.2). Then tT = W, where W is a standard Brownian motion on the unit interval. An FCLT for processes which satisfy (2.1)-(2.3) can be obtained by verifying that condition (5.24) in Hall and Heyde’s (1980) Theorem 5.5 is satisfied if c(L) is one-summable. [One-summability is used because of its prior use in unit root asymptotics [Stock (1987)], although it can be replaced by the weaker condition that c(L) is &summable; see Solo (1989) and Phillips and Solo (1992).] However, Hall and Heyde’s theorem is more general than needed here and for completeness an FCLT is explicitly derived from Theorem 1 for processes satisfying (2.1))(2.3). The argument here relies on inequalities in Hall and Heyde (1980) and follows Phillips and Solo (1992), except that the somewhat stronger conditions used here simplify the argument. See Phillips and Solo (1992) for an extensive discussion, based on the Beveridge-Nelson decomposition, of conditions under which the FCLT holds for linear processes. To show that Theorem 1 and the Beveridge-Nelson decomposition can be used to yield directly an FCLT for partial sums of I(0) processes which satisfy conditions (2.1))(2.3), let (,,(A) = (cf T) - 1’2 czI 0, + (Tj. According
to the BeveridgeeNelson
CT4 brrl,+1 .
decomposition
(2.6), this scaled partial
sum for
J.H.Sfock
2750
fixed 1. is c( l)T- “‘~~~j~, plus a term which is T- Ii2 times an I(0) variable. Because tT+ W, this suggests that [,.,.*c(l)W. To show this formally, the argument that [,,T - Cam LO must be made uniformly in I., that is, that Pr[sup,l t,,,(A) ~ c( I)tr(jb)l > (s] - 0 for all 6 > 0. Now. ITi
l<,.,(2) - c(1)5,(i.)I = (a:T))
‘I2 1 0, + (Ti. - [Tj.])C,TAl+, ,=I -
(Ti.
-
[
Ti.] )c( l)+il
[Ti.] ~ c(1) 1 cI 1=1
+,
17-11 = (a,27--
I/2
c(
1) 1 F, + c*(L)c[,,] ~ c*(L)i:, f=l
+ (Tr’ ~ [T’])(c(l)EIT~]+ -c(l)
[Til 1 E, -(Ti
-
1 + c*(L)dE,,.;,,+1)
[Tr”])c(l)c,r,,+,
1=1 =(aZT)-“21~*(L)~,TAIl-(.*(L)~0
6
bfT)-
“‘{
Ic*(L)c[,,jl
+(Ti.-
+
ic*(L)c,T~]+lI
~2a,‘max,=,,,,,,TIT-1!2~*(L)~,I
+
- c(l)tT(A)I > S] < Pr[2cr~~‘max,IT~
1)1
k*b%,I}
+oF;’ Tm”21c*(L)c,I
where the second equality uses the BeveridgeeNelson T-“2 Ic*(L)c,l in the final line of (2.7) does not depend negligible, so we drop it and have Pr[sup,llvT(A)
[TI.])(c*(L)h,,.,,+
(2.7)
decomposition. The term on 1, and is asymptotically
1’2c*(L)~,I > 61
-3
E max, IT - 1’2c*(L)c, 13
(2.8) where the final inequality follows from Minkowski’s inequality. Because max,EleJ” < m [by (2.2)] and &?= 1/cf I < M, [by the argument following (2.6)], Pr[sup, Ir’,.,(i) c(l)<,.(A)1 > S] +O for all 6 > 0 so trT - c(l)<, 3 0. Combining this asymptotic equivalence with Theorem 1, we have the general result that, if u, satisfies (2.1))(2.3) then t,.,*c(l)W.
Ch. 46: Unit Koors. Structurul
Breaks
and
2751
Trends
The continuity correction involved in constructing 5, and cVr is cumbersome and is asymptotically negligible in the sup-norm sense [this can be shown formally using the method of (2.7) and (2.8)]. We shall therefore drop this correction henceforth and write the result t,,=+c(l)W as the FCLTfor general I(0) processes, IT.1 T-1/2
1 u,*o$(l)w(~)
= oW(.),
(2.9)
s=l where w = ~,c( 1).3 Suppose u, is an I(1) process with Au, = u, and, as is assumed throughout, Eui < co. Then the levels process of u, obeys the FCLT (2.9): TP 1’2ulPI = T-“2C~\u,+ T-1i2u,=oW(.), where T l/22(0 A 0 by Chebyshev’s inequality. A special case of this is when u0 is fixed and finite. The result (2.9) provides a concrete link between the assumptions (2.2) and (2.3) used to characterize an I(1) process, the BeveridgeeNelson decomposition (2.6) and the limit theory which will be used to analyze statistics based on I( 1) processes. Under (2.2) and (2.3), the partial sum process is dominated by a stochastic trend, as in (2.6). In the limit, after scaling by T ‘I2 , this behaves like w times a Brownian motion, where w2 = 27cs,(O) is the zero-frequency power, or long-run variance, of G’,. Thus the limiting behavior of u,, where Au, = u,, is the same (up to a scale factor) for a wide range oft(L) which satisfy (2.2). It is in this sense that we think of processes which satisfy (2.1)-(2.3) as being I(0).
2.3.
Examples
and preliminary
results
The FCLT and the CMT provide a powerful set of tools for the analysis of statistics involving I(1) processes. The examples in this section will be of use later but are also of independent interest. Example
1. Sample moments of I( 1) processes
A problem mentioned in Section 2.2 was the surprising behavior of the sample mean of an I(1) process. The limiting properties of this and higher moments are readily characterized using the tools of Section 2.2. Let u, be I(l), so that Au, = u,, and let u0 = 0. Then, T-“2U=
T-3$u,=
T-’
1(T-1’2u,,,,)dA:+
.f (T-1’2+ t=1
T-3I2~,,
s 0
(2.10) 3Formally, the process on the left-hand side of (2.9) is an element of D[O, 11, the space of functions on [0, I] that are right-continuous and have left-hand limits. However, the discontinuous partial sum process is asymptotically equivalent to 5,,T~C[0, I], for which Theorem I applies. See Billingsley’(1968, ch. 3) or Ethier and Kurtz (1986) for a treatment of convergence on D[O, 11.
J.H. Stock
2152
where the final equality follows by definition of the integral. The final expression in (2.10) can be written T-‘%i = hI(T-lh,T.I) + T-‘h,(T-“2u,,.,), where h, and h, are functions from [0, l] + 3, namely h,(f) = Jkf(1.) dl, and h2(,f) = f(1). Both functions are readily seen to be continuous with respect to the sup-norm, so by (2.9) and the continuous mapping theorem,
s 1
h,(T~l’2u,,.,)=>h,(oW) = 0
W(A) d1.
0
so T-1h,(T-‘i2u,,.,)
A0 and T- ‘%Ii~o~~ W(A)di., which has a normal distribution (cf. footnote 2). This approach can be just as easily applied to higher moments, say the kth moment.
(2.11)
where the convergence follows from the FCLT and the CMT. The final expression in (2.11) uses a notational convention which will be used commonly in this chapter: the limits on integrals over the unit interval will be omitted, so for example JWk denotes jA( W(Iw))kdA. S imilarly, the stochastic (It6) integral size W(A) dG(i) is written SW dG for two continuous-time stochastic processes W and G. Example
2. Detrended
I( 1) processes
Because of the presence of the deterministic term d, in (1. l), many statistics of interest involve detrending. It is therefore useful to have limiting representations of the detrended series. The most common form of detrending is by an ordinary least squares (OLS) regression of y, on polynomials in time, the leading cases being demeaning and linear detrending. Let
Y;=Y,-
Y:=Yt-&-BIL
T-’ f
s=1
Y,,
(2.12b)
where (ii,, /;i,) are the OLS estimators of the parameters in the regression of y, onto (1, t). If d, = /I,, then (2.12a) applies, while if d, = /I’, + /I, t, then (2.12b) applies.
2753
Ch. 46: Unit Roots, Structural Breaks and Trends
As in the previous example, suppose that U, is I(l), so that (2.9) applies. NOW - T-“‘U. The CMT and (2.10) T-lC~zl~S, so T-l/Zyp = T-“% so that the demeaned I(1) process thus imply that T- 1’2y~T.l=Sw{ F[.) - SW>, IF1 converges to a “demeaned” Brownian motion. Similar arguments, with a bit more algebra, apply to the detrended case. Summarizing these two results, if U,is a general I(1) process so that Au, = u, where u, satisfies (2.1)-(2.3) and Eut < co, then we have4
yf = u,-
T-liZy&.I*oW”(~),
T-
1’2y;Tl-wWr(.),
where W“(n) = W(1) -
s
W,
(2.13a)
where W’(A)= W(A)-(4-62)
W-(121-6)
sW(s)ds. s
s
(2.13b) Perron (1991~) provides expressions extending (2.13) to the residuals of an I(1) process which has been detrended by an OLS regression onto a pth order polynomial in time. These results can be used to study the behavior of an I( 1) process which has been spuriously detrended, that is, regressed against (1, c) when in fact y, is purely stochastic. Because the sample R2 is 1 - {I,‘= 1(y)I:)*/CT= 1(y:)‘}, the results (2.13) and the CMT show that RZ has the limit R2 =S 1 - {s(Wr)‘/s( WP)2), which is positive with probability one; that is, the regression R2 is, asymptotically, a positive random variable. It follows that the standard t-test for significance of p1 rejects with probability one asymptotically even though the true coefficient on time is zero [Durlauf and Phillips ( 1988)15. Next, consider the autocorrelogram of the detrended process, P&) = 9,,(CT~l)/&(O) = h&T- 1/2y;r.1), say, where &(j) = (T - lj I)-’ x XT=,j, + ly:y:_, j,. Because h, is a continuous mapping from D[O, l] to O[O, 11, the FCLT and CMT imply that or * p*, where p*(1) = (1 - A)- ‘If= 1W’(s) W’(s - A)ds/ [Wr2, 0 < 2 < 1. Thus, in particular, the first k sample autocorrelations converge in probability to one for k fixed, although it eventually declines towards zero. Nelson and Kang (198 1) show, using other techniques, that the autocorrelogram dips below zero, suggesting periodicity which spuriously arises from the detrending of the I( 1) process. The results here indicate that this is an asymptotic phenomenon, when the autocorrelation is interpreted as a fraction of the sample size. Example 3. Cumulated detrended I(0) processes
Section 4 considers testing for a unit moving average root when it is maintained that there is a deterministic trend. A statistic which arises in this context is the 4Derivations of W’ arc given in the and Phillips (1988); the result can also Park and Phillips (1988) demonstrate, of the projection of W onto (1,s). ‘Phillips (1986) gives similar results
proof of Theorem 5.1 of Stock and Watson (1988b) and in Park be derived from Theorem 2.1 of Durlauf and Phillips (1988). As W’ can be thought of as detrended Brownian motion, the residual for regressions
with two independent
random
walks.
J.H. Stock
2754
cumulation of an I(0) process, which has been detrended by OLS. The asymptotics of this process are also readily analyzed using the FCLT and the CMT. For this example, suppose that u, is a general I(0) process and U, = u,, where V, satisfies (2.1))(2.3). Consider the demeaned case, and define the statistic LTJ.1
Y;,.(%)
=
T- “’
c yi, s=l
so CT21
ITAl
Y;,(l.)=
T-1'2
1 (us-~)=
s= 1
T-‘/z
1
Tm1’2 f
u,_
SE1
u,.
f=l
Then (2.9) and the CMT yield the limit Yz, =~o@‘, where B”(A) = w(A) - AW(1). The process BP is a standard Brownian bridge on the unit interval, so called because it is a Brownian motion that is “tied down” to be zero at 0 and 1. Similarly, define Brownian Y:,(A) = T- 1’2C&ajyz; then YbT 3 wE, where B’ is a second-level bridge on the umt interval, given by B’(1) = W(A) - APV(l) + 6A(1 - 1”){+ W( 1) - SW) [MacNeill (1978)]. Collecting these results, we have T-1’2
c
yr*wB’(.),
B’(1”) = W(A) - iW( l),
y:=>oB’(.),
B”(A)= H’(+APV(l)+61.(1
(2.14a)
s=1
T- 1’2 c s=1
LT.1
-A){+‘(l)-SW}, (2.14b)
MacNeill (1978) extended these results to kth order polynomial detrending. Let Yt&(A) = T-“2C~~jy~k) be the process of cumulated kth order detrended data, , tk). Then where y,(k’ is the residual from the OLS regression of y, onto (1, Ytk =+WIP - l), where Bck’ is a kth level generalized Brownian bridge, expressions for which are given by MacNeill (1978, eq. 8).
t,
Example
4. Processes
with an autoregressive
root local to unity
One of the issues considered in Section 3 is the asymptotic properties of statistics when the process is nearly I(l), in the sense that the largest root of the process is local to unity. The starting point in these calculations is characterizing the largesample behavior of the series itself when the root is close to one. Let U, obey u,=cIu,_,
+u,, where
do= 1 +c/T
and
Eu~-c CO.
(2.15)
This is the local-to-unity model considered (under various assumptions on the disturbances ut) by Bobkoski (1983), Cavanagh (1985), Chan and Wei (1987) and
2755
Ch. 46: Unit Roots, Structural Breaks and Trends
Phillips (1987b). The treatment here follows Bobkoski (1983). In particular, we use the method of proof of Bobkoski’s (1983) Lemma 3.4 to generalize his local-to-unity representations from his case of i.i.d. disturbances to general I(0) disturbances which satisfy (2.1))(2.3). As we shall see, this extension is a straightforward application of the FCLT (2.9) and the CMT. Use recursive substitution in (2.15) to write u, as y-- l/2u
_ T1-
112
1-l =
T-lj2
c
(u’-~-
l)u,+
T-1/2
=
(ct -
1):I~cx-~-~( 1
= k,(o,S,,)(tlT)
i
v,+
T-li2cc’u,
s=l
s= 1
T-i/2$1~r)+
+ o,(1).
~~~~~~~~~~
~-+h,
(2.16)
The third equality in (2.16) obtains as an identity by noting that ,r - 1 = (IX- l)CiZ+~j and rearranging summations. The final equality obtains by noting that (1 + c/T)‘~” = exp(ci) + o(l) uniformly in J., 0 < 1. < 1, and by defining k4(f)(I.) = CIAec(‘-“‘f(s)ds+ f(i), where toT is defined in Section 2.2. The op( 1) term in the final expression arises from the assumption Eut < a, so T- li2u0 = oP( I), and from the approximation (1 + c/T)[~“] E exp(c2). As in the previous examples, k, is a continuous functional, in this case from C[O, l] to C[O, 11. Using the FCLT (2.9) and the CMT we have T-“2~t,.l~k,(wW)()
= ml+‘,(.),
(2.17)
where W’,(n) = cJG e c(‘-s)W(s) ds + W(A). The stochastic process WC is the solution to the stochastic differential equation, dw, = c WC(%)d1. + d W(2) with W,(O) = 0. Thus, for Mlocal-to-unity, T - 1i2uCTIconverges to w times a diffusion, or OrnsteinUhlenbeck, process. A remark on the interpretation of limiting jiinctionals of‘ Brownian motion. The calculations of this section ended when the random variable of interest was shown to have a limiting representation as a functional of Brownian motion or, in the localto-unity case of Example 4, a diffusion process. These representations show that the limiting distribution exists; they indicate when a limiting distribution is nonstandard; and, importantly, they show when and how nuisance parameters describing the short-run dependence of u, enter the limiting distribution. Because W is a Gaussian process, the results occasionally yield simply-evaluated distributions. For example, W(l),lW, and JsW h ave normal distributions. However, in most cases the limiting distributions are nonstandard.
J.H. Stock
2756
This leads to the practical question of how to compute limiting distribution functions, once one has in hand the limiting representation of the process as a functional of Brownian motion. The simplest approach, both conceptually and in terms of computer programming, is to evaluate the functional by Monte Carlo simulation using discretized realizations of the underlying Brownian motions. This is equivalent to generating pseudo-data j, from a Gaussian random walk with PO = 0 and with unit innovation variance and replacing W by its discretized realization. For example, W(1)’ would be replaced by (T-“‘j,)’ and W”(.) would be replaced by T- 1’2{j$T.l - T-lx,‘= ,$,}. For T sufficiently large, the FCLT ensures that the limiting distribution of these pseudo-random variates converges to those of the functionals of Brownian motion. The main disadvantage of this approach is that high numerical accuracy requires many Monte Carlo repetitions. For this reason, considerable effort has been devoted to alternative methods for evaluating some of these limiting distributions. Because these techniques are specialized, they will not be discussed in detail, although selected references are given in Section 2.4.
2.4.
Generalizations
and additional mferences
The model (2.1)-(2.3) provides a concise characterization of I(0) processes with possibly infinitely many nuisance parameters describing the short-run dependence, but this simplicity comes at the cost of assuming away various types of nonstationarity and heteroskedasticity which might be present in empirical applications. The key result used in Section 2.3 and in the sections to follow is the FCLT, which obtains under weaker conditions than stated here. The condition (2.2) is weakened in Brown’s (1971) FCLT, which uses the Lindeberg condition and admits unconditional heteroskedasticity which is asymptotically negligible, in the sense that T-‘z;&: -+a:. The result (2.9) for linear processes can be obtained under Brown’s (1971) weaker conditions by modifying the argument in Section 2.2; see Hall and Heyde (1980, Chapter 4) and Phillips and Solo (1992). An alternative approach is to use mixing conditions, which permit an explicit tradeoff between the number of moments and the degree of temporal dependence in u, and which admit certain nonstationarities (which are asymptotically negligible in the sense above). This approach was introduced to the unit roots literature by Phillips (1987a), who used Herrndorf’s (1984) mixing-condition FCLT, and much of the recent unit roots literature uses these conditions. Phillips (1987b) derives the local-to-unity result (2.17) using Herrndorf’s (1984) mixing-condition FCLT. An elegant approach to defining I(0) is simply to make the high-level assumption that u, is I(0) if its partial sum process converges weakly to a constant times a standard Brownian motion. Thus (2.9) is taken as the assumption rather than the implication of (2.1)-(2.3). With additional conditions assuring convergence of sample moments, such as sample autocovariances, this “high-level” assumption provides a general definition of I(O), which automatically incorporates u, which
Ch. 46:
Unit
Root,s, Structural
2757
Breaks und Trends
satisfy Herrndorf’s (1984) FCLT’s. The gain in elegance of this approach comes at the cost of concreteness. However, the results in this chapter that rely solely on the FCLT and CMT typically can be interpreted as holding under this alternative definition. The FCLT approach is not the only route to asymptotic results in this literature. The approach used by Fuller (1976) Dickey and Fuller (1979) and Sargan and Bhargava (1983a) was to consider the limiting behavior of quadratic forms such as C,r= ,u: expressed as $A,v], where q is a T x 1 standard normal variate; thus the limiting behavior is characterized by the limiting eigenvalues of A,. See Chan (1988) and Saikkonen and Luukkonen (1993b) for discussions of computational issues involved with this approach. There is a growing literature on numerical evaluation of these asymptotic distributions. In some cases, it is possible to obtain explicit expressions for moment generating functions or characteristic functions which can be integrated numerically; see White (1958, 1959), Evans and Savin (1981a), Perron (1989b, 1991a), Nabeya and Tanaka (1990a, 1990b) and Tanaka (1990a). Finally, under normality, exact finite-sample distributions can be computed using the Imhof method; see, for example, Evans and Savin (198 1b, 1984).
3.
Unit autoregressive
This section examines y, = d, + uI)
roots inference
u,=c(u,-l
concerning
+u,,
c1in the model
t= 1,2,...,T
(3.1)
where CI is either close to or equal to one and u, is I(O) with spectral density at frequency zero of d/27c. Unless explicitly stated otherwise, it is assumed that u0 might be random, with Euf, < co, and that u, is a linear process satisfying (2.1)-(2.3). The trend term d, will be specified as known up to a finite-dimensional parameter vector /?. The leading cases for the deterministic component are (i) no deterministic term (d, = 0); (ii) a constant (d, = PO); and (iii) a linear time trend (d, = p,, + Prt). Extensions to higher-order polynomial trends or trends satisfying more general conditions are typically straightforward and are discussed only briefly. Another possibility is a piecewise-linear (or broken) trend [Rappaport and Reichlin (1989), Perron (1989a, 1990b)], a topic taken up in Section 5. Most of the procedures for inference on CItreat the unknown parameters in the trend term d, as nuisance parameters, so that many of the statistics can be represented generally in terms of detrended data. Throughout, yp denotes a general detrended process with unspecified detrending. For specific types of detrending, we adopt Dickey and Fuller’s (1979) notation: yr denotes demeaned data and y: denotes linearly detrended data when the detrending is by OLS as in (2.12).
J.H. Sfock
2758
The focus of this section is almost exclusively on the case in which there is at most a single real unit root. This rules out higher orders of integration (two or more real autoregressive unit roots) and seasonal unit roots (complex roots on the unit circle). These topics have been omitted because of space limitations. However, the techniques used here extend to these other cases. References on estimation and testing with seasonal unit roots include Hasza and Fuller (198 l), Dickey et al. (1984), Chan and Wei (1988), Ghysels (1990) Jegganathan (1991), Ghysels and Perron (1993), Hylleberg et al. (1990), Diebold (1993) and Beaulieu and Miron (1993). See Banerjee et al. (1992a) for an overview. In the area of testing when there might be two or more unit roots, an important practical lesson from the theoretical literature is that a “downward” testing procedure (starting with the greatest plausible number of unit roots) is consistent, while an “upward” testing procedure (starting with a test for a single unit root) is not. This was shown for F-type tests in the no-deterministic case by Pantula (1989). Based on simulation evidence, Dickey and Pantula (1987) recommend a downwardtesting, sequential t-test procedure. Pantula (1989) proves that the distribution of the relevant t-statistic under each null has the standard Dickey-Fuller (1979) distribution. Also, Hasza and Fuller (1979) provide distribution theory for testing two versus zero unit roots in an autoregression.
3.1.
Point estimation
The four main qualitative differences between regressions regressors are that, in contrast to the case of I(O) regressors, linear combinations of regression coefficients is nonstandard,
with I(1) and I(0) inference on certain with: (i) estimators
which are consistent at rate T rather than at the usual rate fi; (ii) limiting distributions of estimators and test statistics which are often nonstandard and have nonzero means; (iii) estimators which are consistent even if the regression misspecifies the short-run dynamics, although, in this case, the limiting distributions change; and (iv) limiting distributions which depend on both the true and estimated trend specifications. These differences between I(0) and I( 1) regressors can be seen by examining the OLS estimator of CY in (3.1). First, consider the no-deterministic case, so oi = CT= 2ytyt _ i/ C,‘= &_ i. When 1ct1< 1 and u, = E,, conventional a normal limiting distribution, ifIx < 1,
T”‘(d - a) A
N(0,
1 - cr2),
fi
asymptotics
apply and oi has
(3.2)
which was derived by Mann and Wald (1943) under the assumptions that E, is i.i.d. and all the moments of E, exist. In contrast, suppose that the true value of LYis 1, and let u, follow the general
Ch. 46:
Unit
Roots, Structural
2759
Breaks and Trends
linear process (2.1)-(2.3). Then the OLS estimator
T-’ it
t=2
T(&-1)=
Tp2
can be rewritten,
AY,Y,-, T
~
2 yf-, 1=2
T-’ f:
T-‘(yZ,-my:)-
(AY,)'
1=2
(3.3) r2
i~:-~ 1=2
where the second line uses the identity y: - yf = 2C,T=,Ay,y,_ 1 + CTE2(AyJ2. Although the conditions for Mann and Wald’s result do not apply here because of the unit autoregressive root, an asymptotic result for T(oi - 1) nonetheless can be obtained using the FCLT (2.9) and the CMT. Because Eni < co, T-‘12yl = T - “‘(u, + vl) -% 0. Thus, because y, = u0 + C’= 1us,by (2.9) and the CMT, we have T-‘i2yT=~W(1) and T-2CtT=2yf_1 =z-w2~Ws?. Also, T - ‘CT= 2(Ay,)2 = y*,,(O) % Y,,(O) = Y,(O). Thus, ifa = 1,
T(& -
l)=
#V(l)2
- K)
where
K =T,
(3.4a)
JW2, This expression
was first obtained
by White (1958) in the AR(l) model with K = 1
(although his result was in error by a factor of $) and by Phillips (1987a) for general K. An alternative expression for this limiting result obtains by using the continuoustime analogue of the identity used to obtain the second line of (3.3), namely IWdW=i(W(1)2 - 1) [Arnold (1973, p. 76)]; thus,
ifa = 1,
T(oi - l)=
Wdw-
$c - 1)
w2.
(3.4b)
This result can also be obtained from the first line in (3.3) by applying Theorem 2.4 of Chan and Wei (1988). The results (3.2) and (3.4) demonstrate the first three of the four main differences between I(0) and I(1) asymptotics. First, the OLS estimator of cI is “superconsistent”, converging at rate T rather than ,,&. While initially surprising, this has an intuitive interpretation: if the true value of ccis less than one, then in expectation the mean squared error E(y, - cly,_ 1)2 is minimized at the true value of c( but remains finite for other values of cc In contrast, if CIis truly 1, then E(AyJ2 is finite but, for any
J.H. Stock
2160
fixed value of CI# 1, (1 - clL)y, = by, + (1 - cr)y,_ 1 has an integrated component; thus the OLS objective function T- ‘C,TZ2(yt - ccy,_ J2 is finite, asymptotically, for CI= 1 but tends to infinity for fixed tl # 1. An alternative intuitive interpretation of this result is that the variance of the usual OLS estimator depends on the sampling variability of the regressors, here CT= 2yf_ 1; but, because y, is I(l), this sum is O,( T*) rather than the conventional rate O,(T). Second, the limiting distribution in (3.4) is nonstandard. While the marginal distribution of W(l)* is XT, the distribution of the ratio in (3.4a) does not have a simple form. This distribution has been extensively studied. In the leading case that u, is serially uncorrelated, then 0’ = y,(O) so that K = 1 and (3.4a) becomes +( W( 1)2 - 1)/jW2 [and (3.4b) becomes j W d W/j W2], This distribution was tabulated by Dickey (1976) and reproduced in Fuller (1976, Table 85.1). The distribution is skewed, with asymptotic lower and upper 5 percent quantiles of - 8.1 and 1.28. Third, oi is consistent for c1 even though the regression of y, onto y,_i is misspecified, in the sense that the error term v, is serially correlated and correlated with (differences of) the regressor. This misspecification affects the limiting distribution in an intuitive way. Use the definition of K to write
Because o* can be thought of as the long-run variance of u,, +(K - 1) represents the correlation between the error and the regressor, which enters as a shift in the numerator of the limiting representation but does not introduce inconsistency. This term can increase the bias of d in finite samples. Although this bias decreases at the rate T-i, when u, is negatively serially correlated, so that +(K - 1) is positive, in sample sizes often encountered in practice this bias can be large. For example, if u, follows the MA(l) process u, = (1 - HI+,, then +(K - 1) = e/(1 - e)‘, so for 8 = 0.8, 3(K - 1) = 20. To examine the fourth general feature of regressions with I(1) variables, the dependence of limiting distributions on trend specifications, consider the case that d, = PO + /I, t. Substitute this into (3.1), transform both sides of (3.1) by (1 - crL), and thus write y,=6,+6,t+cry,-,
+u,,
t= 1,2 ,...,
T,
(3.5)
where 6, = (1 - g)po + LX/~,and 6, = (1 - a)/3,. If both PO and pi are unrestricted, (3.5) suggests estimating c( by the regression of y, onto (1, t, y,_ ,); if b, is restricted a priori to be zero, then CIcan be estimated by regressing y, onto (1, yt_ i). Consider, for the moment, the latter case in which d, = /I, where fi, is unknown. Then, by the algebra of least squares, the OLS estimator of c(,oi”, can be written (after centering
Ch. 46: Unit Roots. Structural
2761
Breaks and Trends
and scaling) as
T-l i T(@-
AY,Y:-
,
t=2
l)= r2
t 2
i 1=2
(YP~J’
T-‘(y’;2_y;2)-
T-’
’ (AyJ2 x2 (3.6)
where yr_ 1 =Y~_~ -(T1)-‘Z~~2y,_1. of T(&” - 1) under the The method for obtaining a limiting representation hypothesis that c( = 1 is analogous to that used for T(B - l), namely, to use the FCLT to obtain a limiting representation for T - “‘yr_ 1 and then to apply the continuous mapping theorem. Expression (2.13a) provides the needed limiting result for the demeaned levels process of the data; applying this to (3.6) yields T(@-
1)=${W”(1)2 - Wp(O)2-
ti)/jWp2 ={jWYdW-+(K-
l)}/jM’f12 (3.7)
where the second representation is obtained using PV’(O)= - SW. The detrended case can be handled the same way. Let 8’ denote the estimator of rl obtained from estimating (3.5) including both the constant and time as regressors. Then T(oi’ - 1) can be written in the form (3.6), with y: replacing yr. The application of (2.13b) to this modification of (3.6) yields the limiting representation T(c?-
I)=+{
Wr(l)2 - Wr(O)2 - tc]/jWr2
= {jW’dW-+(c
I)}/jW? (3.8)
Because the distributions of W, Wp and W’ differ, so do the distributions in (3.4), (3.7) and (3.8). When u, = E, so that K = 1, the distribution of T(B” - 1) is skewed and sharply shifted to the left, with asymptotic lower and upper 5 percent quantiles of - 14.1 and - 0.13. With linear detrending, the skewness is even more pronounced, with 5 percent quantiles of - 21.8 and - 2.66. This imparts a substantial bias to the estimates of CY: for example, with T= 100 and u, = E,, the mean of 8’, based on the asymptotic approximation (3.8), is 0.898. Another feature of regression with I(1) regressors is that, when the regression contains both I( 1) and I(0) regressors, estimators of coefficients (and their associated
J.H. Stock
2162
test statistics) on the I(1) and I(0) regressors, in a suitably transformed regression, are asymptotically independent. This is illustrated here in the AR(p) model with a unit root as analyzed by Fuller (1976). General treatments of regressions with integrated regressors in multiple time series models are given by Chan and Wei (1988), Park and Phillips (1988) and Sims et al. (1990). When u, has nontrivial short-run dynamics so that w2 # y,,(O), an alternative approach to estimating c( is to approximate the dynamics of o, by a pth order autoregression, u(L)u, = e,. In the time-trend case, this leads to the OLS estimator, oi’, from the regression Ay,=60+6it+(~-l)y,_i
+ t
ajAY,-j+c,,
t=1,2
,...,
T.
(3.9)
j=l
If u, in fact follows an AR(p), then e, = E, and (3.9) is correctly specified. To simplify the calculation, consider the special case of no deterministic terms, conditional homoskedasticity and p = 1, so that y, is regressed on (y,_ i,Ay,_i). Define TT = diag( T 1/Z T), let a=(a,,a-l)‘, let zl-i=(Ayt_i,yt_i), and let d be the OLS estimator’of a. Then
(
-1
3-F’ i
l-&3-a)=
Zt_lZ;_lr;l
1=2
)(
3”;’
5 t=2
Z,_lE,
. >
of the FCLT and the CMT shows that Yg ‘CT= 2z,_ lzip 1Y, I* w2~W2), w h ere cc)= oJ( 1 - a(l)) (in the special case p = 1, a( 1) = al).
A direct application dia&,,(0), Similarly,
r,’
i
Z,_lE,=
T-1'21i2A~,-~&p
T-l,i2~t-l~t)'.
r=2
Because Ay,_ 1E, is a martingale difference sequence which satisfies Theorem 1, T~“2CT=2Ayt_1.zt%~*, where ‘I* _ N(0, 01E(AyJ2). Direct application of Chan and Wei’s (1988) Theorem 2.4 (or, alternatively, an algebraic rearrangement of the type leading to (3.4b)) implies that the second term has the limit o,oJW d W. From Theorem 2.2 of Chan and Wei (1988), this convergence is joint and moreover the (W, q*) are independent. Upon completing the calculation, one obtains the result
{ T1’2(Lil - a,), T(oi - l)} _i
N(O, 1 - ~:~>~@4~~dW’~2},
(3.10)
where asymptotically the two terms are independent. The joint asymptotic distribution of { T1j2(b1 -a,), T(B - l)} was originally obtained by Fuller (1976) using different techniques. The result extends to values of p > 1 and to more general time trends. For example, in the AR(p) model with a constant and a linear time trend, T(&‘- l)=+(~,/w)~W~dW/f(W’)~.
2163
Ch. 46: Unit Roots. Structural Breaks and Trends
Ordinary least squares estimation is not, of course, the only way to estimate (x. Interestingly, the asymptotic distribution is sensitive to seemingly minor changes in the estimator. Consider, for example, Dickey’s et al. (1984) “symmetric least squares” estimator in the no-deterministic case
i*Y,Yt-1 a, =
(3.1 la)
T-l c r=2
Y:
Straightforward
+
3(Y: + Y’,)
algebra
and an application
- + T- ’ i T(cc,-
K a-p
T-l
T-2
c
reveals that
(Ay,)’
t=2
l)=
of the FCLT
(3.1 lb)
2 Y: + +(Y: + Y$)
1=2
so that T(@, - 1) is negative with probability one. If point estimates are of direct interest, then the bias in the usual OLS estimator can be a problem. For example, if one’s object is forecasting, then the use of a biased estimator of (Ywill result in median-biased conditional forecasts of the stochastic component.6 This has led to the development of median-unbiased estimators of ct. This problem is closely related to the construction of confidence intervals for czand is taken up in Section 3.3.
3.2.
Hypothesis
3.2.1.
tests
Test of c( = 1 in the Gaussian AR(l)
model
The greatest amount of research effort concerning autoregressive unit roots, both empirical and theoretical, has been devoted to testing for a unit root. Because of the large number of tests available, a useful starting point is the no-deterministic i.i.d. Gaussian AR( 1) model, Yt = NY, -
1 +
5,
E, i.i.d. N(0, a’),
t = 1,2,. . . , T,
(3.12)
‘When (al < 1 and a is fixed, bi is also biased towards zero. In the Gaussian AR(l) model with d, = 0, Hurwicz (1950) derives the approximation Eoi = {(T’ ~ 2T+ 3)/(T2 - l)}x for a close to zero. When G( is close to one, Hurwicz’s approximation breaks down but the distribution becomes well-approximated using the local-to-unity approximations discussed in the next section, and the downward bias remains. Approximate biases in the stationary constant model are given by Marriot and Pope (1954). For an application of these bias expressions, see Rudebusch (1993). Also see Magnus anctpesaran (1991) and Stine and Shaman (1989).
J.H. Stock
2164
where y, = 0. Suppose further that rs2 is known, in which case we can set rr2 = 1. Because there is only one unknown parameter, CI,the NeymanPearson lemma can be used to construct the most powerful test of the null hypothesis c1= a, vs. the point alternative c( = Cc.The likelihood function is proportional to L(a) = k, exp( --$(a - E)‘CT= 2yf_ i), where k, does not depend on ~1. The Neyman Pearson test of 01= 1 vs. c1= C?rejects if L(c()/L(l) is sufficiently large; after some manipulation, this yields a critical region of the form
[T(a-I)]2T-2~y;_1-2T(E-l)~~1
5 y,_,Ay,
(3.13)
t=2
t=2
where k is a constant. The implication of (3.13) is that the most powerful test of c1= 1 vs. tl = &is a linear combination of two statistics, with weights that depend on a. It follows that, even in this simplified problem, there is no uniformly most powerful (UMP) test of a = 1 vs. 1~1< 1. This difficulty is present even asymptotically: suppose that the alternative of interest is local to one in the sense (2.15), so that Cr= 1 + f/T, where C is a fixed constant. Then, T(& - 1) = C. Under the null
T-2
i t=2
yf_,,2T-’
5 yt_tAy,
w2,
w(1)2 - 1 1
t=2
so both terms in (3.13) are O,(l). Thus, there is no single candidate test which dominates on theoretical grounds, either in finite samples [Anderson (1948), Dufour and King (1991)] or asymptotically.’ From the perspective of empirical work, the model (3.12) is overly restrictive because of the absence of deterministic components and because the errors are assumed to be i.i.d. The primary objective of the large literature on tests for unit autoregressive roots has, therefore, been to propose tests that have three characteristics: first, the test is asymptotically similar under the general I(1) null, in the sense that the null distribution depends on neither the parameters of the trend process (assuming the trend has been correctly specified) nor the nuisance parameters describing the short-run dynamics of u,; second, it has good power in large samples; and third, it exhibits small size distortions and good power over a range of empirically plausible models and sample sizes. The next three subsections, respectively, summarize the properties of various unit root tests in terms of these three characteristics. ‘This draws on Rothenberg (1990). Manipulation of(3.13) shows that the Dickey-Fuller (1979) p test, which rejects if T(ci - 1) < k’ (where k’ < 0 for conventional significance levels), is efficient against E = 2k’, although this does not extend to the demeaned or detrended cases. We thank Thomas Rothenberg and Pentti Saikkonen for pointing this out.
Ch. 46: Unit Roots. Structural
3.2.2.
Breuks and Trends
2165
Tests of the general I( 1) null
This subsection describes the basic ideas used to generalize tests from the AR(l) model to the general I(1) null by examining four sets of tests in detail. Some other tests of the general I( 1) null are then briefly mentioned. If u, follows an AR(p) and d, = p,, + Pit, then the regression (3.9) serves as a basis for two tests proposed by Dickey and Fuller (1979): a test based on the t-statistic testing 01= 1, z*‘, and a test based on p’ = (Cr)AR/BJT(&- l), where d& is the autoregressive spectral density estimator (the AR estimator ofo”): (3.14)
where (a,, a,, . . . ,c?,) are the OLS estimators from (3.9), modified, respectively, to omit t or (1, t) as regressors in the d, = /I0 or d, = 0 cases. In the time-trend case, under the null hypothesis u = 1 and the maintained AR(p) hypothesis, the limiting representations of these statistics are
(3.15)
neither of which depends on nuisance parameters. Thus these statistics form the basis for an asymptotically similar test of the unit root hypothesis in the AR(p)/timetrend model. Their distributions have come to be known as the Dickey-Fuller (1979) “p” and “Y distributions and are tabulated in Fuller (1976), Tables 8.51 and 8.5.2, respectively. In the constant-only case (d, = PO), the only modification is that t is dropped as a regressor from (3.9) and W’ replaces W’ in (3.15). In the nodeterministic case, the intercept is also dropped from (3.9) and W replaces W’. In an important extension of Fuller (1976) and Dickey and Fuller (1979), Said and Dickey (1984) used Berk’s (1974) results for AR( co) I(0) autoregressions to analyze the case that o, follows a general ARMA(p, q) process with unknown p, q. In this case, the true autoregressive order is infinite so the regression (3.9) is misspecified. If, however, the autoregressive order pT increases with the sample size (specifically, pT-+ co, p;/T+O), then Said and Dickey (1984) showed that the results (3.15) continue to hold. Thus the Dickey-Fuller/Said-Dickey tests have a nonparametric interpretation, in the sense that they are valid under a more general I(0) null with weak conditions on the dynamics of u,.~ 8Berk’s (1974) conditions on c(L) (in the notation of (2.1)) are less restrictive than Said and Dickey’s (1984) assumption that v, obeys an ARMA(p, q). For a related discussion and extension to the multivariate case, see Lewis and Reinsel(l985) and especially Saikkonen (1991).
J.H. Stock
2166
Alternative tests were proposed by Phillips (1987a) and Phillips and Perron (1988). They recognized that if K in (3.4) were consistently estimable, then T(oi - 1) from the misspecified AR(I) model could be adjusted so that it would be asymptotically similar. This reasoning led Phillips (1987a) to propose the corrected statistics
Z,=
g?? - l)&
T(oS-- l)+
--f‘s
WdW (3.16a) w= ’
Tp2 5 y:+ s
(3.16b)
where I? and d2 are consistent estimators of K and III’, and where r is the t-statistic testing for a unit root in the OLS regression of y, onto y,_, . Phillips and Perron (1988) extended these statistics to the constant and time-trend cases by replacing the regression of y, onto y,_ 1 with a regression of y, onto (1, y,_ r) in the constant case or, in the linear-trend case, onto (1, t, yt- 1). The limiting distributions for these two cases are as in (3.16), with Wb or W’, respectively, replacing W. Because +(K - 1) = &,(O) - 02)/02, the estimation of the correction entails the estimation of 02. Phillips (1987a) and Phillips and Perron (1988) suggested estimating o2 using a sum-of-covariances (SC) spectral estimator (the SC estimator of 02), IT
A2
coSC
F
= tn= c -IT
k(
(3.17)
r*;(m), T>
(x, - X)(X,_,,, - X), k( .) is a kernel weighting function and G, is the residual from the regression of y, onto y,_ t, (1, y, _ t) or (1, t, y,_ 1) in the no-deterministic, constant or linear-trend cases, respectively. The appropriate choice of kernel ensures that S.& > 0 [Newey and West (1987), Andrews (1991)]; Phillips (1987a) and Phillips and Perron (1988) suggested using Bartlett (linearly declining) weights. If [r increases to infinity at a suitable rate [e.g. $/T+O from Phillips (1987a, Theorem 4.2); see Andrews (1991) for optimal rates], then c;)& A w2 as required. Like the SaiddDickey tests, the Phillips/Phillips-Perron tests thus provide a way to test the general (nonparametric) I( 1) null. A third test for the general I(1) null can be obtained .by generalizing a statistic derived by Sargan and Bhargava (1983a) in the no-trend and constant cases and extended to the time-trend case by Bhargava (1986). They used Anderson’s (1948) approximation to the inverse of the covariance matrix to derive the locally most
wherey,(m)= V- ml- ‘CtTz A
lml
+
1
Ch. 46: Unit Roots, Structural Breaks and Trends
2767
powerful invariant (LMPI) test [although the test is not LMPI if the true inverse is used, as pointed out by Nabeya and Tanaka (1990b)l. The SarganBhargava statistics, E,, E;, and &, are
r2 i (Y,)’ t=1
iiT=y-’ T- l
(3.18a)
2 (AyJ2 1=2
r2 i
(~3 1=1
&--,
T-’ i
(3.18b)
(AY,)'
t=2
(3.18~)
y;=ylT-lC,T=Iyt and yf’=y,-&&‘t, where E= T-‘C,T=lyr1)/2(T- l)](yT - y,)and j$’ = (yT - y,)/(T- 1). Note that @ is the maximum likelihood estimator (MLE) of ,4i under the null c1= 1 when u, is i.i.d. normal. Also, the statistic d, is asymptotically equivalent to minus one-half times the inverse of the symmetric least squares estimator T(ii, - 1) in (3.11). These statistics have seen little use in empirical work because of their derivation in the first-order case and because, in their form (3.18), the tests are not similar under the general I(1) null. They are, however, readily extended to the general I(1) null. While of independent interest, this extension is worth explaining here because it demonstrates a simple way that a large class of tests of the unit root model can be extended to the general I(1) case, namely, by replacing yI by y,/&. A direct application of the FCLT and the CMT yields ET 3 K-~~W~, @_=> K-‘J( W’)2 and i+-K_2 j( WB)2, where W’(E) = IV(n) - (1. - $W( 1) - SW. Thus, if li-’ is consistent, modified Sargan-Bhargava (MSB) statistics are obtained as R, = R2R”,, R; = g2E; and Rt = 12’k& which are similar under the general I(1) null. These statistics can be summarized using a compact functional notation. Note that the demeaned (say) MSB statistic with li;= y*,,(O)/ci, can be written as RI+ = IIK~T-‘C,~= 1 Y;(t/T)2, where Y;(E,) = T- 1’2yrT,l. This suggests the notation where [(T+
l-1
MSB =
f (JJ2 d;l, J 1=0
(3.19)
2768
J.H. Stock
where f(A) = O- ’ Y,(n), & ’ Y;(1) and &‘YB,(J*), respectively, in the three cases, where Y,(I) = T- “*ytTsl and Y:(i) = T- “*yFTnl. The approach used to extend the SB statistic to the general I(1) null can be applied to other statistics as well. The NeymanPearson test regions (3.13) have the same drawback as the SB critical regions, that is, they depend on o under the general I( 1) null. Consider the no-deterministic case under which (3.13) was derived. Then, 2T-‘CTZ2y_,Ayf = (T-*/‘yT)* -y,(O) + o,(l), so the critical regions in (3.13) are asymptotically equivalent to critical regions based on (T(i - l))2T-2CT=2y:_1 o*{F*~W* - CW(1)2}. T(cl - l)( T - “*yT)*, which has the limiting representation While this depends on w, if Q2 -%02 under the null and local alternatives, then an asymptotically equivalent test can be performed using the statistic P, = - CT- ‘yi}. Because P,=z-C*~W* - EW( l)*, this test is asympd-2(C2T-2~~=2y~_1 totically similar and is moreover asymptotically equivalent to the NeymanPearson test in the case that u, is i.i.d. N(0, a*). When deterministic terms are present, it is desirable to modify the P, statistic. A feature of the tests discussed so far is that they are invariant to the values of the parameters describing the deterministic terms (PO and pi in the linear-trend case), that is, a change in the value of b does not induce a change in the value of the test statistic. This feature is desirable, particularly in the case of p,,. For example, when a test is performed on a series in logarithms, then a change in the units of measurement of the series (from thousands to millions of dollars, say) will appear, after taking logarithms, as an additive shift in /IO. It is natural to require a test for unit roots to be unaffected by the units of measurement, which translates here into requiring that the test be invariant to /IO. This line of reasoning led Dufour and King (1991), drawing on King (1980), to develop most powerful invariant (MPI) finite-sample tests of cI = a, vs. c1= c~i, where t1e is a general value, not necessarily one, in the Gaussian AR(l) model with Eui < co. The finite-sample distribution of these tests hinges on the Gaussian AR(l) assumption and they are not similar under the general I(1) null. These were extended by Elliott et al. (1992) to the general case using the same device as was used to extend the Neyman-Pearson tests to the statistic P,. The resultant statistics, P$ and P;, are asymptotically MPI against the alternative c = C. These statistics have forms similar to P,, but are constructed using demeaned and detrended series, where the trend coefficients are estimated by generalized least squares (GLS) under a local alternative (c = C), rather than under the null. This “local” GLS detrending results in intercept estimators which are asymptotically negligible, so that P;=E-C~~W* Cl+‘(l)*. This suggests examining other unit root test statistics using local detrending. One such statistic, proposed by Elliott et al. (1992), is their DickeyyFuller GLS (DF-GLS) statistic, in which the local GLS-demeaned or local GLS-detrendcd series is used to compute the r-statistic in the regression (3.9), where the intercept and time trend are suppressed.’ The construction of the P;,P;, DF-GLV and ‘The DF-GLS’ statistic is computed in two steps. Let z, = (1, t). (1) /I0 and PI are estimated by GLS that the process is an AR(l) with coefficient E = 1 + F/T and u0 = 0. That is, PO
under the assumption
Ch. 46: Unit Roots. Structural
2769
Breaks and Trends
DF-GLS’ tests requires the user to choose C. Drawing upon the arguments in King (1988), a case can be made for choosing C so that the test achieves the power envelope against stationary alternatives (is asymptotically MPI) at 50 percent power. This turns out to be achieved by setting C = - 7 in the demeaned case and C = - 13.5 in the detrended case. Another statistic of independent interest is the so-called resealed range (R/S) statistic which was proposed and originally analyzed by Mandelbrot and Van Ness (1968), Mandelbrot (1975) and Mandelbrot and Taqqu (1979). The statistic is
R/S=-
T- “2(max,= i,....ry, - mm*= T-’
i
l,...,TYr
(AyJ2
1
(3.20) .
1=2
Although the R/S statistic was originally proposed as a method for measuring the differencing parameter in a fractionally integrated (fractionally differenced) model, the R/S test also has power against stationary roots in the autoregressive model. In functional notation, the statistic is supJ(A)infnf(lZ), which is a continuous functional from C[O, l] to B’. As Lo (1991) pointed out, this statistic is not similar under the general I(1) null, but if evaluated using f(1) = T- 1’2yI,,1/Q it is asymptotically similar (note that this statistic needs no explicit demeaning in the d, = /i’e case). Thus, the asymptotic representation of this modified R/S statistic under the general I( 1) null is sup, W(n) - inf, W(1). Although a large number of unit root tests have been proposed, many fall in the same family, in the sense that they have the same functional representation. It will be shown in the next section that if two tests have the same functional representation then they have the same local asymptotic power functions. However, as will be seen in Section 3.2.4, tests which are asymptotically equivalent under the null and local alternatives can perform quite differently in finite samples. 3.2.3.
Consistency and local asymptotic power
Consistency. A simple argument proves the consistency of unit root tests which can be written in functional notation such as (3.19). Suppose that a test has the representation g(f), where g: C[O, l] -tB is continuous, and f is T-“2y,,.,/&, T- 1’2yFT.I/c&or T- 1’2y;PIjc3 in the no-deterministic, demeaned or detrended cases, respectively. If g(0) falls m the rejection region, then consistency follows immediately, provided that the process being evaluated is consistent for zero under all fixed and /I1 are estimated by regressing [y,,(l - 07L)y,,. .(l_- aL)y,] onto [zr,(l - KC)z,, ,(l - cX)Z,]: call the resulting estimator &s. Detrended J, = yr - z&Ls is then computed. (2) The Dickey-Fuller regression (3.9) is run using j$ without the intercept and time trend; the r-statistic on J,- 1is the DF-GLS’ statistic. The DF-GLS’ statistic is computed similarly except that the regressor t is omitted in the first step. The DF-GLS’ statistic has the no-deterministic Dickey-Fuller r*distribution and the distribution of DF-GLS is tabulated in Elliott et al. (1992).
2770
J.H. Stock
alternatives. As a concrete example, let d, = /I,, and consider the demeaned MSB statistic (3.19) with f= 0-l Y;. The test rejects for small values of the statistic, so consistency against the general I(0) alternative follows if Q- i Y; A 0. Now if U, is I(O)then Pr[sup, ( Y;(;I)I > S] + 0 for all 6 > 0 [the proof is along the lines of (2.8)]. It follows that Q-i Y; LO if cj L k > 0 for some constant k under the I(O) alternative. Thus, with this additional assumption, MSB’ = j(& ’ Y;(1))’ dA + o,,( 1) A 0 under the fixed I(0) alternative, and the test is consistent. The assumption that d2 5 k > 0 for some constant k under the I(0) alternative is valid for certain variants of both the SC and AR spectral estimators. For the AR spectral estimator, this was shown by Stock (1988, Lemma 1). For the SC spectral estimator, test consistency is an implication of Phillips and Ouliaris’ (1990, Theorem 5.1) more general result for tests for cointegration. These results, combined with some additional algebra, demonstrate the consistency for the MSB, Z,, Z,, P, and R/S statistics. lo Local asymptotic power. Power comparisons are a standard way to choose among competing tests. Because finite-sample distribution theory in nearly Z(1) models is prohibitively complicated, research has focused on asymptotic approximations to power functions. For consistent tests, this requires computing power against alternatives which are local to (in a decreasing neighborhood of) unity. Applications of asymptotic expansions commonly used in T”‘-asymptotic problems, in particular Edgeworth expansions and saddlepoint approximations, provided poor distributional approximations for c1near unity [Phillips (1978); also, see Satchel1 (1984)]. This led to the exploration of the alternative nesting, cur= 1 + c/T; important early work developing this approach includes Bobkoski (1983), Cavanagh (1985), Phillips (1987a, 1987b), Chan and Wei (1987) and Chan (1988, 1989). The treatment here follows Bobkoski (1983) as generalized in Example 4 of Section 2.3. The key observation is that, under the local-to-unity alternative (2.15), the processes T- 1/2u1Pl =oW,(.), where WCis a diffusion process on the unit interval satisfying d W,(A) = c W,(A) + d W(A) with W,(O) = 0. In addition, both the SC and AR spectral estimators have the property that d2 *’ o2 under the local alternative.” These results directly yield local-to-unity representations of those test statistics with functional representations such as (3.19). toNot all plausible estimators of& will satisfy this condition. For example, consider the SC estimator constructed using not the quasi-difference d - By;d_1, as in (3.17), but the first difference A$. These two estimators are asymptotically equivalent under the null, but under the alternative y, is overdifferenced. Thus, the spectrum of Afl at frequency zero is zero under the I(O) alternative, the SC estimator of the spectrum does not satisfy the positivity condition, and tests constructed using the first-differenced SC estimator are not in general consistent. [Precise statements of this result are given by Stock and Watson (1988b) in the MA(q) case with I, fixed and by Phillips and Ouliaris (1990, Theorem 5.2) in the general case.] This problem of overdifferencing by imposing a = 1 when nuisance parameters are estimated also results in the inconsistency of Solo’s (1984) Lagrange multiplier (LM) test for a unit AR root in an ARMA(p,q) model, as demonstrated by Saikkonen (1993). “See Phillips (1987b) for the SC estimator and Stock (1988) for the AR estimator.
Ch. 46: Unit Roots. Structural
2111
Breaks and Trends
As a concrete example, again consider the MSB’ statistic. Under the local-tounity alternative, Y!+-o( W, - IWJ = o WE. Thus, test statistics of the form g(6 i Y’“,) have the local-to-unity representation g( W,“). An important implication is that the local asymptotic power of these tests does not depend on the nuisance parameter o, simplifying their comparison. Phillips (1987b, Theorem 2) showed that this framework bridges the gap between the conventional Gaussian I(0) asymptotics and the nonstandard I( 1) asymptotics. Specifically, as c + - cc the (suitably normalized) local-to-unity approximations for T(bi - 1) and the associated t-statistic approached their I(O) Gaussian limits and, as c -+ + co, these distributions, respectively, tend to Cauchy and normal, in accordance with the asymptotic results of White (1958, 1959) and Anderson (1959) for the Gaussian AR(l) model with ICX/> 1. Another application of this approach is to derive the asymptotic Gaussian power envelope for unit root tests, that is, the envelope of the power functions of the family of most powerful unit root tests. Because there is no UMP test (or, in the time-trend case, no uniformly most powerful invariant test), this envelope provides a concrete way to judge the absolute asymptotic performance of various unit root tests. In the no-deterministic case, this envelope is readily derived using the local-to-unity limit (2.17) and the Neyman-Pearson critical regions (3.13). Assume that (i) the process is a Gaussian AR( 1) so that o2 = a:; (ii) En: < co; (iii) the alternative against which the test is most powerful is local-to-unity, so that F = T(i - 1) is fixed; and (iv) the true process is local-to-unity with c = T(cc - 1). Then, the probability of rejecting CI= 1 against the one-sided alternative l&l - 1 is, asymptotically,
Pr
(T(&-
1))2T-2
i
y:_i -2T(E-
t=2
l)T-’
i r=2
Wz-CWc(1)2
1 ,
yt_iAy,
1 (3.21)
where k and k’ are constants which do not depend on c. When c = C, the second expression in (3.21) provides the envelope of the power functions of the most powerful (Neyman-Pearson) tests and, thus, provides an asymptotic performance bound on all unit root tests in the Gaussian model. This result is extended in several ways in Elliott et al. (1992). The bound (3.21) is shown to hold if d, is unknown but changes slowly, in the sense that d, satisfies T T-’
1
I=1
(Ad,)‘+0
and
T-‘j2
max f= l,...,T
ld,l+O.
(3.22)
If d, = &, + /IIt, then the bound (3.21) cannot be achieved uniformly in pi, but a similar bound can be derived among the class of all invariant tests, and this is achieved by the Pi statistic. Although the bound (3.21) was motivated here for u, i.i.d.
J.H. Stock
2112
N(O,aZ), this bound applies under the more general condition that u, obeys a Gaussian AR(p). We now turn to numerical results for asymptotic power, computed using the local-to-unity asymptotic representations of various classes of statistics. In addition to the tests discussed so far, we include expressions for the Park (1990) J(p, q) variable-addition test [also, see Park and Choi (1988)] and, in the no-deterministic case, the modified asymptotically LMPI test [invariant under change of scale, this test rejects for small values of (T- 1izyr/Q)2 and is obtained by letting c --, 0 in (3.13) and rearranging]. Let W.f penote a general OLS-detrended WC process, that is, Wf = WCin the no-determmtstic case, W:(A) = W;(A) = W,(i) - j WCin the demeaned case and W:(A) = W;(A) = WC(A)- (4 - 6n)l W, - (122 - 6)fs W,(s) ds in the detrended case [cf. (2.13)]. Let Wf denote the asymptotic limit of the Bhargava-detrended process T - “‘yf3,., used to construct the Bhargava statistic (3.18~) in the detrended case; specifically, W:(A) = W,(A) - (A - i) Wc(1) - s WC. Finally, let Vc(A)= W,(A) n{E+WE(l)+(l -c+)3IrW,( Y)d r}, w h ere Zt = (1 - C)/(l - C + C2/3), denote the limit of the detrended process obtained from local GLS detrending, which is used to construct the Pk and DF-GLS’ statistics. The local-to-unity representations for various classes of unit root test statistics are given by the following expressions:
-l{ -l/2
w;(l)2 - w:(o)2 - l};
b-class
Q-class
{
SB-class
w;(l)2
(Wf)”
(no-deterministic,
( WF)2
(detrended
(3.23a)
- w:(o)2 - 1 };
(3.23b)
demeaned
(3.23~)
cases);
s SB-class
case);
(3.23d)
s R/S
IsiP1J W:(A) -
J@, 1)
s ~
inf W,“(A); J.c(O.1)
(3.23e)
( yj2 - 1
(demeaned
case);
(3.23f)
- 1
(detrended
case);
w%)
(WZ)” s (Wi)’
JCL 2)
s (WZ)’ s
Ch. 46: Unit Roots. Structural
Breuks and Trends
LMPI
W,( 1)2
(no-deterministic
PT
c2
PT
C2 Vf - (C - l)VC(1)2 s
DFPGLS
;
DF-GLS
;
s
w; -cwc(q2
2113
(3.23h)
case only);
(no-deterministic,
(detrended
demeaned
cases);
(3.231)
(3.23j)
case);
- l/2
{ Wc( 1)2 - Wc(O)2 - 1} -
(demeaned
case);
(3.23k)
l/2 {
VC(l)2 - VC(0)’ - l}
(detrended
case).
(3.231)
The p class includes the Dickey-Fuller (1979) p tests and the Phillips (1987a)/PhillipsPerron (1988) 2, tests. The f class includes the Dickey-Fuller (1979) t-tests and the Phillips (1987a)/PhillipsPerron (1988) 2, tests. The SB class includes the SchmidtPhillips (1992) test. Most of these representations can be obtained by directly applying the previous results. For those statistics with functional representations already given and where the statistic is evaluated using OLS detrending [the SB-class (demeaned case) and R/S statistics], the results obtain as a direct application of the continuous mapping theorem. In the cases involving detrending other than OLS in the time trend case (the SB-class and DF-GLS’ statistics), an additional calculation must be made to obtain the limit of the detrended processes. The other expressions follow by direct calculation.i2 Asymptotic power functions for leading classes of unit root tests (5 percent level) are plotted in Figures 1, 2 and 3 in the no-deterministic, constant and trend cases, respectively. i3 The upper line in these figures is the Gaussian power envelope. In the d, = 0 case, the power functions for the t, p and SB tests are all very close to the power envelope, so this comparison provides little basis for choosing among them. Also plotted in Figure 1 is the power function of the LMPI test. Although this test has good power against c quite close to zero, its power quickly falls away from the envelope and is quite poor for distant alternatives.
r2References for these results include: for the p, Q statistics, Phillips (1987b) (Z, statistic) and Stock (1991) (DickeyyFuller AR(p) statistics); for the SB-class statistics, Schmidt and Phillips (1992) and Stock (1988); for the P,- and DF-GLS statistics, Elliott et al. (1992). ‘“The asymptotic power functions were computed using the functional representations in (3.23) evaluated with discrete Gaussian random walks (T= 500) replacing the Brownian motions, with 20,000 Monte Carlo replications. Nabeya and Tanaka (1990b) tabulate the power functions for tests including the SB and Q tests, although they do not provide the power envelope. Because they derive and integrate the characteristic function for these statistics in the local-to-unity case, their results presumably have higher numerical accuracy than those reported here. Standard errors of rejection rates in Figures l-3 are at most 0.004. Some curves in these figures originally appeared in Elliott et al. (1992).
J.H. Stock
2114
In the empirically more relevant cases of a constant or constant and trend, the asymptotic power functions of the various tests differ sharply. First, consider the cased, = /?,. Perhaps the most commonly used test in practice is the DickeyyFuller/ Said-Dickey t-test, Q@;however, its power is well below not just the power envelope but the power of the 6” (equivalently, Zd) test. The SB-class statistics have
--
--
--
_--/-
LMPI
I
01
OO
4
8
12
i6
20
24
28
32
20
24
28
32
-c
Figure
OO
4
a
12
1
16 -c
Figure
2
2775
Ch. 46: Unit Roots, Structural Breaks and Trends
0
OO
4
8
16
12
20
24
28
32
-c
Figure
3
asymptotic power slightly above the jY’ statistics, particularly for power between 0.3 and 0.8, but remains well below the envelope. In contrast, the asymptotic local power function of the Pt test, which is, by construction, tangent to the power envelope at 50 percent power, is effectively on the power envelope for all values of c. Similarly, the DF-GLS‘ power function is effectively on the power envelope. Pitman efficiency provides a useful way to assess the importance of these power differences. Pitman’s proposal was to consider the behavior of two tests of the same hypothesis against a sequence of local alternatives, against which at least one of the tests had nondegenerate power. The Pitman efficiency [or asymptotic relative efficiency (ARE)] is the ratio of the sample sizes giving, asymptotically, the same power for that sequence. In conventional fi-normal asymptotics, often the ARE can be computed as a ratio of the variances entering the denominators of the two Studentized test statistics. Although this approach is inapplicable here, the ARE can be calculated using the asymptotic power functions. Suppose that two tests achieve power /I against local alternatives cl(B) and cZ(b); then the ARE of the first test relative to the second test is ci(fi)/c,(/?) [Nyblom and Makelainen (1983)]. Using this device, the ARE of the P; test, relative to the optimal test, at power of 50 percent is 1.O, by construction, and the ARE of the DF-GLS’ test is effectively 1. In contrast, the ARE’s of the SB-, p- and ?-class tests, relative to the Pt test, are 1.40, 1.53 and 1.91. That is, to achieve 50 percent power against a local alternative using the DickeyyFuller t-statistic asymptotically requires 90 percent more observations than are needed using the asymptotically efficient P; test or the nearly efficient DF-GLS’ test.
2116
J.H.
Stock
The results in the detrended case are qualitatively similar but, quantitatively, the power differences are less. The Y-class statistics have low power relative to the envelope and to the SB- and ,Y-class tests. The SB-class tests have power slightly above the b’-class tests and all power functions are dominated by the P; test. Some of the other tests, in particular the R/S test, have power that is competitive with the MSB- and fir-class tests. At 50 percent power, the Pitman efficiency of the Q tests is 1.39 and of the 1 tests is 1.25. Interestingly, the power function of P; actually lies above the power function of z*‘,even though P; involves the additional estimation of the linear-trend coefficient /II. Comparing the results across the figures highlights a common theme in this literature: including additional trend terms reduces the power of the unit root tests if the trends are unnecessary.14 So far, the sampling frequency has been fixed at one observation per period. A natural question is whether power can be increased by sampling more frequently, for example, by moving from annual to quarterly data, while keeping the span of the data fixed. A simple argument, however, shows that it is the span of the data which matters for power, not the frequency of observation. To be concrete, consider the demeaned case and suppose that the true value of CIis 1 + cl/T, based on T annual observations, where c1 is fixed. Suppose that the MSB statistic is used with sufficiently many lags for cG2to be consistent. With the annual data, the test statistic has the limiting representation j( W;,)‘. The quarterly test statistic has the limiting representation S(WQ2, where c4 is the local-to-unity parameter at the quarterly frequency and the factor of 4 arises because there are four times as many quarterly as annual observations. Because tl = 1 + cl/T at the annual level, at the quarterly level this root is ~1~‘~z 1 + (cJ4T) = 1 + (c4/T), so c4 = cJ4. Thus, I(W&J2 = @‘;I)2 and the quarterly and annual statistics have the same limiting representations and, hence, the same rejection probabilities. Although there are four times as many observations, the quarterly root is four times closer to one than the annual root, and these two effects cancel asymptotically. For theoretical results, see Perron (1991b); for Monte Carlo results, see Shiller and Perron (1985). More frequent observations, however, might improve estimation of the short-run dynamics, and this, apparently, led Choi (1993) to find higher finite-sample power at higher frequencies in a Monte Carlo study. The case of uO drawn from its unconditional distribution. The preceding analysis makes various assumptions about u 0: to derive the finite-sample Neyman-Pearson tests, that ue = 0 (equivalently, u0 is fixed and known) and for the asymptotics, that i4Asymptotic power was computed for the Dickey-Fuller (1981) and Perron (1990a) F-tests, but this is not plotted in the figures. These statistics test the joint restriction that a = 1 and that 6, = 0 in (3.5) or (3.9). Unlike the other tests considered here, these F-tests are not invariant to the trend parameter under local and fixed alternatives. The power of the two F-tests depends on p, under the alternative, so for drifts sufficiently large their power functions can, in theory, exceed the power envelope for invariant tests. If /I, =0 or is small, the F-tests have very low asymptotic power; well below the f-class tests, Perron’s (1990a) calculations indicate, however, that for p, sufficiently large, the F-tests can have high (size-adjusted) power.
Ch. 46: Unit Root,s. Structural
Breaks and Trends
2771
T- r’*uO LO, as specified after (2.9). Under the null, the tests considered are invariant to /I, and thus to uO. Although this finite+, case has received the vast majority of the attention in the literature, some work addresses the alternate model that u0 is drawn from its unconditional distribution or is large relative to the sample size. In finite samples, this modification is readily handled and leads to different tests [see Dufour and King (1991)]. The maximum likelihood estimator is different from that when u0 is fixed, being the solution to a cubic equation [Koopmans (1942); for the regression case, Beach and MacKinnon (1978)]. As pointed out by Evans and Savin (1981b, 1984) and further studied by Nankervis and Savin (1988), Perron (1991a), Nabeya and Sorensen (1992), Schmidt and Phillips (1992) and DeJong et al. (1992a), the power of unit tests depends on the assumption about uO. Analytically, this dependence arises automatically if the asymptotic approximation relies on increasingly finely observed data, in Phillips (1987a) terminology, continuous record asymptotics [see Perron (1991a, 1992), Sorensen (1992) and Nabeya and Sorensen (1992)]. Alternatively, equivalent expressions can be obtained with the local-to-unity asymptotics used here if T-“*IL, = O,(l) [in the stationary AR(l) case, a natural device is to let T-“*u, be distributed N(0, af/T(l - CX~))-+N(0, - ~cT~/c), where c < 0, so that an additional term appears in (2.17)]. Elliott (1993a) derives the asymptotic power envelope under the unconditional case and shows that tests which are efficient in the unconditional case are not efficient in the conditional case in either the demeaned or detrended cases. The quantitative effect on the most commonly used unit root tests of drawing u0 from its unconditional distribution is investigated in the Monte Carlo analysis of the next subsection. 3.2.4.
Finite-sample size and power
There is a large body of Monte Carlo evidence on the performance of tests for a unit AR root. The most influential Monte Carlo study in this literature is Schwert (1989), which found large size distortions in tests which are asymptotically similar under the general I(1) null, especially the Phillips-Perron (1988) Z, and Z, statistics. A partial list of additional papers which report simulation evidence includes Dickey and Fuller (1979), Said and Dickey (1985), Perron (1988,1989c, 1990a), Diebold and Rudebusch (1991 b), Pantula and Hall (1991), Schmidt and Phillips (1992), Elliott et al. (1992), Pantula et al. (1992), Hall (1992a), DeJong et al. (1992b), Ng and Perron (1993a, 1993b) and Bierens (1993). Taken together, these experiments suggest four general findings. First, all the asymptotically valid tests exhibit finite-sample size distortions for models which are in a sense close to I(0) models. However, the extent of the distortion varies widely across tests and depends on the details of the construction of the spectral estimator c3*. Second, the estimation of nuisance parameters describing the short-run dynamics reduces test power, in some cases dramatically. Third, these two observations lead to the use of data-dependent truncation or AR lag lengths in the
J.H. Stock
2778
estimation of o2 and the resulting tests show considerable improvements in size and power. Fourth, the presence of nonnormality or conditional heteroskedasticity in the errors results in size distortions, but these are much smaller than the distortions arising from the short-run dynamics. We quantify these findings using a Monte Carlo study with eight designs (data generating processes or DGP’s) which reflect some leading cases studied in the literature. In each, y, = u,, where u, = au,_ 1 + II,. Five values of CIwere considered: 1.0, 0.95,0.9,0.8 and 0.7. All results are for T= 100. The DGP’s are Gaussian
MA( 1): 0 = 0.8, 0.5,0, -0.5,
Gaussian
MA(l), u0 unconditional:
u, = E, -
!3~,_
8 = 0.5,0,
1,
- 0.5,
u.
-0.8, -
(3.24a)
N(0, y,(O)), (3.24b)
where in each case E, - i.i.d. N(0, 1). The Gaussian MA( 1) DGP (3.24a) has received the most attention in the literature and was the focus of Schwert’s (1989) study. The unconditional variant is identical under the null, but under the alternative u0 is drawn from its unconditional distribution N(0, y,(O)), where y,(O) = (1 + 8’ - 2&)/ (1 - a’). This affects power but the size is the same as for (3.24a). The unconditional model is of particular interest because the power functions in Section 3.2.3 were for the so-called conditional (uO fixed) case. The tests considered are: the Dickey-Fuller fi statistic T(6i - l)/(l - CT= iSj) [where &,a,, . .,A, are ordinary least squares estimators (OLSE’s) from (3.9)]; the Phillips (1987a)/Phillips_Perron (1988) Z, statistic (3.16a); the Dickey-Fuller z* statistic computed from the AR(p + 1) (3.9); the MSB statistic (3.19) computed using * o,,; the Schmidt-Phillips (1992) statistic, which is essentially (3.19) computed using &-; and the DF-GLS of statistic of Elliott et al. (1992).” Various procedures for selecting the truncation parameter 1, in O.& and the autoregressive order pT in ci)iR are considered. Theoretical and simulation evidence suggest using data-based rules for selecting 1,. Phillips and Perron (1988) and DeJong et al. (1992b) use the Parzen kernel, so this kernel is adopted here.16 The truncation parameter 1, was chosen using Andrews’ (1991) optimal procedure for this kernel as given in his equations (5.2) and (5.4). The AR estimator lag length pT in (3.14) (with a constant but no time trend in the regression, in both the demeaned “The results here are drawn from the extensive tabulations of 20 tests in 13 data generating processes (DGP’s) in Elliott (1993b). Other tests examined include: the Dickey-Fuller (1981) and Perron (1990b) F-tests; the Phillips-Perron Z, test; the modified R/S statistic; Hall’s (1989) instrumental variable statistic; Stock’s (1988) MZ, statistic; and the Park J(p,p + 3) tests for p = 1, 2. In brief, each of these tests had drawbacks - distorted size, low power or both - which, in our view, makes them less attractive than the tests examined here, so, to conserve space, these results are omitted. ‘6TheParzenkernelisgivenby:k(x)=1-6x2+6~x~3,O~~~~~~;k(x)=2(1-~~()3,~~~~~~1;and k(x)=O,Ixl> 1.
J.H. Stock
2780
Table 2 Size and size-adjusted power of selected tests of the I(1) null: Monte Carlo results (5 percent level tests, detrended case, T= 100, y, = u,, u, = au,_ I + u,,v, = E,- BE,_ ,).’
MA(l),
Test Statistic
a
Asymptotic Power
Unconditional: MA( 1). 0 =
0=
-0.8
-0.5
0.0
0.5
0.8
-0.5
0.0
0.5
1.00 0.95 0.90 0.80 0.70
0.05
0.09 0.19 0.61 0.94
0.03 0.07 0.10 0.24 0.40
0.05 0.07 0.12 0.28 0.45
0.05 0.07 0.13 0.32 0.53
0.06 0.08 0.16 0.43 0.71
0.37 0.09 0.18 0.49 0.78
0.05 0.07 0.12 0.28 0.45
0.05 0.08 0.14 0.32 0.52
0.06 0.08 0.17 0.44 0.72
1.00 0.95 0.90 0.80 0.70
0.05 0.09 0.19 0.61 0.94
0.10
0.09 0.16 0.36 0.57
0.07 0.08 0.14 0.36 0.58
0.05 0.08 0.15 0.39 0.64
0.09 0.09 0.18 0.51 0.81
0.58 0.08 0.17 0.50 0.80
0.07 0.08 0.15 0.36 0.58
0.05 0.08 0.15 0.39 0.64
0.09 0.09 0.18 0.52 0.81
1.00 0.95 0.90 0.80 0.70
0.05 0.09 0.19 0.61 0.94
0.09 0.08 0.14 0.29 0.42
O.Jl
0.08 0.09 0.17 0.46 0.74
0.22 0.10 0.22 0.56 0.76
0.65 0.09 0.17 0.42 0.57
0.1 I 0.09 0.16 0.39 0.58
0.08
0.09 0.16 0.39 0.58
0.09 0.18 0.47 0.74
0.22 0.09 0.20 0.56 0.77
DF-p AR(BIC)
1.00 0.95 0.90 0.80 0.70
0.05 0.10 0.23 0.70 0.97
0.21 0.09 0.18 0.42 0.62
0.16 0.09 0.18 0.45 0.67
0.13 0.10 0.20 0.49 0.74
0.21 0.10 0.22 0.58 0.87
0.81 0.08 0.17 0.47 0.73
0.16 0.09 0.18 0.43 0.66
0.13 0.09 0.19 0.48 0.73
0.21 0.10 0.20 0.57 0.86
Z, SC(aut0)
1.00 0.95 0.90 0.80 0.70
0.05 0.10 0.23 0.70 0.97
0.00 0.09 0.19 0.56 0.89
0.01
0.05 0.11 0.25 0.74 0.98
0.65 0.10 0.25 0.73 0.97
J.00
0.09 0.20 0.62 0.92
0.09 0.16 0.44 0.72
0.01 0.09 0.20 0.62 0.92
0.05 0.10 0.23 0.73 0.98
0.65 0.09 0.21 0.70 0.97
MSB AR(BIC)
1.00 0.95 0.90 0.80 0.70
0.05 0.10 0.25 0.73 0.97
0.23 0.10 0.19 0.42 0.60
0.17 0.09 0.20 0.45 0.65
0.13 0.10 0.21 0.48 0.69
0.12 0.10 0.21 0.50 0.74
0.49 0.08 0.16 0.42 0.69
0.17 0.09 0.18 0.41 0.61
0.13 0.09 0.19 0.44 0.64
O.J2 0.09 0.19 0.46 0.69
MSB SC(auto)
1.00 0.95 0.90 0.80 0.70
0.05 0.10 0.25 0.73 0.97
0.00 0.10 0.24 0.63 0.89
0.01
0.03 0.11 0.28 0.75 0.97
0.46 0.11 0.27 0.74 0.94
0.99 0.11 0.22 0.42 0.42
0.01
0.10 0.24 0.66 0.91
0.09 0.21 0.61 0.86
0.03 0.10 0.24 0.70 0.93
0.46 0.10 0.23 0.65 0.89
1.00 0.95 0.90 0.80 0.70
0.05 0.10 0.27 0.8 1 0.99
0.1I
0.08 0.10 0.23 0.57 0.80
0.07 0.10 0.24 0.61 0.84
O.JJ 0.11 0.28 0.72 0.94
0.58 0.12 0.27 0.70 0.91
0.08 0.09 0.19 0.46 0.67
0.07 0.09 0.19 0.49 0.71
0.11 0.09 0.21 0.54 0.76
DF-I’ AR(4)
DF-Q’ AR(BIC)
DF-r^’ AR(LR)
DF-GLS’ AR(BIC)
0.11 0.23 0.53 0.75
“AR(BIC) indicates that the AR spectral estimator based on (3.9) with the time trend suppressed was used. See the notes to Table 1.
Ch. 46: Unit Roots, Structural
Breaks and Trends
2781
and the detrended cases) was selected using the Schwartz (1977) Bayesian information criterion (BIG), with a minimum lag of 3 and a maximum of 8. For comparison purposes a sequential likelihood ratio (LR) downward-testing procedure with 10 percent critical values, as suggested by Ng and Perron (1993b), was also applied to the Dickey-Fuller t-statistic.’ 7 The results for tests of asymptotic level 5 percent are summarized in Table 1 for the demeaned case and in Table 2 for the detrended case. For each statistic, the first column provides the asymptotic approximation to the size (which is always 5 percent) and to the local-to-unity power. The remaining entries for c1= 1 are the empirical size, that is, the Monte Carlo rejection rate based on asymptotic critical values. The entries for ICC/< 1 are the size-adjusted power, that is, the Monte Carlo rejection rates when the actual 5 percent critical value computed for that model with c( = 1 is used to compute the rejections. Of course, in practice the model and this correct critical value are unknown, so the size-adjusted powers do not reflect the empirical rejections based on the asymptotic critical values. However, it is the size-adjusted powers, not the empirical rejection rates, which permit examining the quality of the local-to-unity asymptotic approximations reported in the first column. These results illustrate common features of other simulations. Test performance, both size and power, varies greatly across the statistics, the models generating the data and the methods used to estimate the long-run variance. The most commonly used test in practice is the Dickey-Fuller z*statistic. Looking across designs, this statistic has size closer to its level than any other statistic considered here, with size in the range 5-10 percent in both the demeaned and detrended cases with 8 6 0.5, for both the AR(4) and BIC choices of lag length. However, as the asymptotic comparisons of the previous subsection suggest, this ability to control size in a variety of models comes at a high cost in power. For example, consider the case 13= -0.5. In the demeaned case with c( = 0.9, the DF-r* test has power of 0.22 (BIC case) while the DF-GLS test has power of 0.59. In the detrended case, as the asymptotic results suggest, the power loss from using the DF-Q statistic is less, Again in the 0 = - 0.5, CI= 0.9 case, the powers of the DF-Q and the DF-GLS statistics are 0.14 and 0.23. Typically, the p- and SB-class tests also have better size-adjusted power than the DF-z^ statistics. Three lag length selection procedures are compared for the DF ?fl and t*’statistics, and the choice has important effects on both size and power. In the 8 = 0 case, for example, using 4 lags results in substantial power declines against distant alternatives, relative to either data-dependent procedure. DeJong et al. (1992b) show that increasing p typically results in a modest decrease in power but a substantial reduction in size distortions. The results here favor the BIC over the LR selector; “Alternative strategies, both data-based and not, were also studied, but they, typically, did not perform as well as the procedures reported here and thus are not reported here to save space. In general, among SC estimators, the Andrews (1991) procedure studied here performed substantially better (in terms of size distortions and size-adjusted power) than non-data-based procedures with I, = k(T/100)0~2 with k = 4 or 12. See Elliott (1993b).
2782
J.H. Stock
a finding congruent with Hall’s (1992b) proof that the asymptotic null distribution of the DF statistic is the same using the BIC as if the true order were known (assuming the maximum possible lag is known and fixed). However, Ng and Perron (1993b) provide evidence supporting the sequential LR procedure. In any event, currently available information suggests using one of these two lag selection procedures. Although the size distortions are slight for the cases with positive serial correlation in u,, the introduction of moderate negative serial correlation results in very large size distortions for several of the statistics. This is the key finding of Schwert’s (1989) influential Monte Carlo study and is one of the main lessons for practitioners of this experiment. For several statistics, these size distortions are extreme. For example, for the Gaussian MA(l) process with 8 = 0.5, which corresponds to a first autocorrelation of u, of - 0.4, the detrended Phillips-Perron Za statistic has a rejection rate of 65 percent. These large size distortions are partially but not exclusively associated with the use of the SC spectral estimator. For example, the sizes of the MSB’/AR(BIC) test and Schmidt and Phillips’ (1992) version of this test implemented with the Parzen kernel, the MSB”/SC(auto) statistic, are respectively 9 percent and 38 percent in the 0 = 0.5 case. Similarly, the Z, test can be modified using an AR estimator to reduce the distortions substantially, although they remain well above the distortions of the DF-? or DF-GLS statistics. Ng and Perron (1993a) give theoretical reasons for the improvement of the AR over SC estimators. Part of the problem is that the SC estimators are computed using the estimated quasidifference of ya or y:, where the quasidifference is based on 8, which in turn is badly biased in the very cases where the correction factor is most important [see the discussion following (3.4b)].‘* Looking across the statistics, the asymptotic power rankings provide a good guide to finite-sample size-adjusted power rankings, although the finite-sample power typically falls short of the asymptotic power. As predicted by the asymptotic analysis, the differences in size-adjusted powers is dramatic. For example, in the demeaned 8 = 0 case with c1= 0.9, the Dickey-Fuller t-statistic (BIC case) has power of 22 percent, Z, has power of 44 percent, and DF-GLS has power of 60 percent. There is some tradeoff between power and size. The DF-t statistic exhibits the smallest deviation from nominal size, but it has low power. Other tests, such as the Z, and MSB/SC (auto) statistics, have high size-adjusted power but very large size distortions. The DF-GLS statistic appears to represent a compromise, in the sense that its power is high - based on results in Elliott et al. (1992), typically as high as the asymptotic point-optimal test P, - but its size distortions are low, *aConsistent with the asymptotic theory, introducing generalized autoregressive conditional heteroskedasticity [GARCH, Bollerslev (1986)] has only a small effect on the empirical size or power of any of the statistics. Elliott (1993b) reports simulations with MA(l) GARCH(1, 1) errors and coefficients which add to 0.9. For example, for the DF-GLS statistic, demeaned case, 0 = 0 or -0.5, size and power (T= 100) differ at most by 0.03 from those in Table 1 for a = I to 0.7.
Ch. 46: Unit Roots, Structural
2783
Breaks and Trends
although not as low as the DF-r statistic. In the demeaned results, DF-GLS has sizes of 0.07-0.11, compared to the DF-r (BIC) which has sizes 0.06-0.08 (except in the extreme, 19=0.8, case). In the detrended case, the DF-GLS has sizes of 0.07-0.11, while DF-r has sizes in the range 0.05-0.10. Drawing the initial value from its unconditional distribution changes the rankings of size-adjusted power; in particular the size-adjusted power of DF-GLS drops, particularly for distant alternatives. However, the DF-GLS power remains above the DF-r^ (BIC) power in both demeaned and detrended cases, and, of course, the large size distortions of the other tests are not mitigated in this DGP. Recent Monte Carlo evidence by Pantula et al. (1992) suggests that, in the correctly specified demeaned AR(l) model, better power can be achieved against the unconditional alternative by a test based on a weighted symmetric least squares estimator. However, the unconditional case has been less completely studied than the conditional case and it seems premature to draw conclusions about which tests perform best in this setting. 3.2.5.
Effects of misspecifying
the deterministic
trend
The discussion so far has assumed that the order of the trend has been correctly specified. If the trend is misspecified, however, then the estimators of c( and the tests of tl = 1 can be inconsistent [Perron and Phillips (1987) West (1987,1988a)]. This argument can be made simply in the case where the true trend is d, = /?, + /?, t with /?, a nonzero constant, but the econometrician incorrectly uses the constant-only model. Because y, contains a linear time trend, asymptotically the OLS objective function will be minimized by first-differencing yt, whether or not u, is I(l), and a straightforward calculation shows that c?L 1.” It follows that b and f tests will not be consistent. This inconsistency is transparent if one works with the functional representation of the tests: T- ’ Y!+* hP, where h”(A) = /3,(A - i). In finite samples, the importance of this omitted variable effect depends on the magnitude of the incorrectly omitted time-trend slope relative to w [West (1987) provides Monte Carlo evidence on this effect]. This problem extends to other types of trends as well, and in particular to misspecification of a piecewise-linear trend (a “broken” trend) as a single linear trend [see Perron (1989a, 1990b), Rappoport and Reichlin (1989) and the discussion in Section 5.2 of this chapter]. The analogy to the usual regression problem of omitted variable bias is useful here: if the trend is underspecified, unit root tests (and root estimators) are inconsistent, while if the trend is overspecified, power is reduced, even asymptotically. This contrasts with the case of mean-zero I(O) regressors, in which the reduction in power, resulting from unnecessarily including polynomials in time, vanishes asymptotically. The difference is that while I(0) regressors are asymptotically “See West (1988a), Park and Phillips (1988) and Sims et al. (1990) for extensions of this result to multiple time series models.
J.H. Stock
2784
uncorrelated with the included time polynomials, I( 1) regressors are asymptotically correlated (with a random correlation coefficient). This asymptotic collinearity reduces the power of the unit root tests when a time trend is included. A procedure of sequential testing of the order of the trend specification prior to inference on c1will result in pretest bias arising from the possibility of making a type I error in the tests for the trend order. This problem is further complicated by the dependence of the distributions of the trend coefficients and test statistics on the order of integration of the stochastic component.20
3.2.6.
Summary
and implications for empirical practice
If one is interested in testing the null hypothesis that a series is I(1) - as opposed to testing the null that the series is I(0) or to using a consistent decision- or information-theoretic procedure to select between the I(0) and I(1) hypotheses - then the presumption must be that there is a reason that the researcher wishes to control type I error with respect to the I(1) null. If so, then a key criterion in the selection of a unit root test for practical purposes is that the finite-sample size be approximately the level of the test. Taking this criterion as primary, we can see from Tables 1 and 2 that only a few of the proposed tests effectively control size for a range of nuisance parameters. In the demeaned case, only the Dickey-Fuller z^fl and DF-GLS’ tests have sizes of 12 percent or under (excluding the extreme 8 = 0.8 case). However, the z*’ statistic has much lower size-adjusted power than the DF-GLS’ statistic. Moreover, asymptotically, the DF-GLS’ statistic can be thought of as approximately UMP since its power function nearly lies on the Neyman-Pearson power envelope in Figure 2, even though, strictly, no UMP test exists. When u0 is drawn from its unconditional distribution, the power of the DF-GLS’ statistic exceeds that of ?’ except against distant alternatives. These results suggest that, of the tests studied here, the DF-GLS’ statistic is to be preferred in the d, = PO case. In the detrended case, only t^’ and DF-GLS’ have sizes less than 12 percent (excepting 0 = 0.8). The size-adjusted power of the DF-GLS’ (BIC) test exceeds that of the t^’ (BIC) test in all cases except u. unconditional, 8 = 0.5 and a = 0.7. Because the differences in size distortions between the ?’ and DF-GLS’ tests is minimal, this suggests that again the DF-GLS’ test is preferred in the detrended case. In both the demeaned and the detrended cases, an important implication of the Monte Carlo results here and in the literature is that the choice of lag length or truncation parameter can strongly influence test performance. The LR and BIC “In theory, this can be addressed by casting the trend order/integration order decision selection problem and using Bayesian model selection techniques, an approach investigated and Ploberger (1992). See the discussion in Section 6 of this chapter.
as a model by Phillips
Ch. 46: Unit Roots, Structural
Breaks and Trends
2785
rules have the twin advantages of relieving the researcher from making an arbitrary decision about lag length and of providing reasonable tradeoffs between controlling size with longer lags and gaining size-adjusted power with shorter lags. One could reasonably object to the emphasis on controlling size in drawing these conclusions. In many applications, particularly when the unit root test is used as a pretest, it is not clear that controlling type I error is as important as achieving desirable statistical properties in the subsequent analysis. This suggests adopting alternative strategies: perhaps testing the I(0) null or implementing a consistent classification scheme. These strategies are respectively taken up in Sections 4 and 6.
3.3,
Interval
estimation
Confidence intervals are a mainstay of empirical econometrics and provide more information than point estimates or hypothesis tests alone. For example, it is more informative to estimate a range of persistence measures for a given series than simply to report whether or not the persistence is consistent with there being a unit root in its autoregressive representation [see, for example, Cochrane (1988) and Durlauf (1989)]. This suggests constructing classical confidence intervals for the largest autoregressive root ~1,for the sum of the coefficients in the autoregressive approximation to u,, or for the cumulative impulse response function. Alternatively, if one is interested in forecasting, then it might be desirable to use a median-unbiased estimator of a, so that forecasts (in the first-order model) would be median-unbiased. Because a median-unbiased estimator of u corresponds to a 0 percent equal-tailed confidence interval [e.g. Lehmann (1959, p. 174)], this again suggests considering the construction of classical confidence intervals for a. Moreover, a confidence interval for tl would facilitate computing forecast prediction intervals which take into account the sampling uncertainty inherent in estimates of a. The construction of classical confidence intervals for a, however, involves technical and computational complications. Only recently has this been the subject of active research. Because of the nonstandard limiting distribution at u = 1, it is evident that the usual approach of constructing confidence intervals, as, say, + 1.96 times the standard error of 61,has neither a finite-sample nor an asymptotic justification. This approach does not produce confidence intervals with the correct coverage probabilities, even asymptotically, when c( is large. To see this, suppose that a is estimated in the regression (3.9) and that the true value of a is one. Then the usual “asymptotic 95 percent” confidence interval will contain the true value of a when the absolute value of the r-ratio testing a = 1, constructed using ar, is less than 1.96. When u = 1, however, this t-ratio has the Dickey-Fuller t? distribution, for which Pr[lt^‘l > 1.961 jO.61. That is, the purported 95 percent confidence interval actually has an asymptotic coverage rate of only 39 percent! It is, therefore, useful to return to first principles to develop a theory of classical
2786
J.H. Stock
interval estimation for CI. A 95 percent confidence set for c(, S(y,, . . . , yT), is a set-valued function of the data with the property that Pr[a&] = 0.95 for all values of CIand for all values of the nuisance parameters. In general, a confidence set can be constructed by “inverting” the acceptance region of a test statistic that has a distribution which depends on a but not on any nuisance parameters. Were there a UMP test of cx= CQ,available for all null values CY~,then this test could be inverted to obtain a uniformly most accurate confidence set for a. However, as was shown in Section 3.2.1, no such UMP test exists, even in the special case of no nuisance parameters, so uniformly most accurate (or uniformly most accurate invariant or invariant-unbiased) confidence sets cannot be constructed by inverting such tests. Thus, as in the testing problem, even in the finite-sample Gaussian AR(l) model the choice of which test to invert is somewhat arbitrary. As the discussion of Section 3 revealed, a variety of statistics for testing a = cl,, are available for the construction of confidence intervals. Dufour (1990) and Kiviet and Phillips (1992) proposed techniques for constructing exact confidence regions in Gaussian AR(l) regression with exogenous regressors and Andrews (1993a) develops confidence sets for the Gaussian AR(l) model in the no-deterministic, demeaned and detrended cases with no additional regressors. Dufour’s (1990) confidence interval is based on inverting the Durbin-Watson statistic, Kiviet and Phillips (1992) inverted the t-statistic from an augmented OLS regression, and Andrews (1993a) inverted 61“- u (in the detrended case bir - ~1). In practice, the inversion of these test statistics is readily performed using a graph of the confidence belt for the respective statistics, which plots the critical values of the test statistic as a function of the true parameter. Inverting this graph yields those parameters which cannot be rejected for a given realization of the test statistic, providing the desired confidence interval [see Kendall and Stuart (1967, Chapter 2O)].‘l In practice one rarely, if ever, knows a priori that the true autoregressive order is one and that the errors are Gaussian, so a natural question is how to construct confidence intervals for a in the more general model (3.1). If treated in finite samples, even if one maintains the Gaussianity assumption this problem is quite difficult because of the additional nuisance parameters describing the short-run dependence. However, as first pointed out by Cavanagh (1985), the local-to-unity asymptotics of Section 3.2.3 can be used to construct asymptotically valid confidence intervals for TVwhen a is close to one. 21Dufour studied linear regression with Gaussian AR(l) disturbances, of which the constant and constant/time-trend regression problems considered here are special cases. Both Dufour (1990) and Andrews (1993a) computed the exact distributions of these statistics using the Imhof method. In earlier work, Ahtola and Tiao (1984) proposed a method for constructing confidence intervals in the Gaussian AR(l) model with no intercept. Ahtola and Tiao’s approach can be interpreted as inverting the score test of the null p = pO, with two important approximations. First, they use a normal-F approximation to the distribution of the score test, which seems to work well over their tabulated range of a. Second (and more importantly), their proposed procedure for inverting the confidence belt requires the belt to be linear and parallel, which is not the case over a suitably large range of a, even at the scale of the local-to-unity model a = I + c/T.
Ch. 46: Unit Roots, Structural Breaks and Trends
2787
If the local-to-unity asymptotic representation of the statistic in the general I(1) case has a distribution which depends only on c [a condition satisfied by any statistic with the limiting representations in (3.23)] then this test can be inverted to construct confidence intervals for c and, thus, for CI. In the finite-sample case, ~1cannot be determined from the data with certainty, and similarly in the asymptotic case c cannot be known with certainty even though a is consistently estimated. However, the nesting a = 1 + c/Tprovides confidence intervals that shorten at the rate T-’ rather than the usual T- 112. To be concrete, the Dickey-Fuller t-statistic from the pth order autoregression (3.9) (interpreted in the Said-Dickey sense of p increasing with the sample size) has the local-to-unity distribution (3.23b) which depends only on c and, so, can be used to test the hypothesis c = c0 against the two-sided alternative for any finite value of c,,. The critical values for this test depend on ce. The plot of these values constitutes an asymptotic confidence belt for the local-to-unity parameter c, based on the Dickey-Fuller t-statistic. Inverting the test based on this belt provides an asymptotic local-to-unity confidence interval for c. Asymptotic confidence belts based on the Dickey-Fuller t-statistic in (3.9) and, alternatively, the modified Sargan-Bhargava (MSB) statistic are provided by Stock (1991) in both the demeaned and the detrended cases. Stock’s (1991) Monte Carlo evidence suggests that the finite-sample coverage rates of the interval based on the Dickey-Fuller t-statistic are close to their asymptotic confidence levels in the presence of MA(l) disturbances, but the finite-sample coverage rates of the MSB-based statistics exhibit substantial distortions relative to their asymptotic confidence levels. Both the finite-sample AR( 1) and asymptotic AR(p) confidence intervals yield, as special cases, median-unbiased estimators of a. The OLS estimates of a are biased downwards, and both the finite-sample and asymptotic approaches typically produce median-unbiased estimates of a larger than the OLS point estimates. While this approach produces confidence intervals and median-unbiased estimators of a, the researcher might not be interested in the largest root per se but rather in some function of this root, such as the sum of the AR coefficients in the autoregressive representation. To this end, Rudebusch (1992) proposed a numerical technique based on simulation to construct median-unbiased estimators of each of the p + 1 autoregressive parameters; his algorithm searches for those autoregressive parameters for which the median of each of the AR parameters equals the observed OLS estimate for that parameter. Andrews and Chen (1992) propose a similar algorithm, except that their emphasis is the sum of the autoregressive parameters rather than the individual autoregressive parameters themselves and their calculations are done using the asymptotic local-to-unity approximations. A completely different approach to interval estimation which has been the subject of considerable recent research and controversy is the construction of Bayesian regions of highest posterior probability and an associated set of Bayesian tests of the unit AR root hypothesis. (References are given in Section 6.2.) Although these procedures examine the same substantive issue, they are not competitors of
J.H. Stock
2788
the classical methods in the sense that, when interpreted from a frequentist perspective, many of the proposed Bayesian intervals have coverage rates that differ from the stated confidence levels, even in large samples. A simple example of this occurs in the Gaussian AR( 1) model with a constant and a time trend when there are flat priors on the coefficients. Then, in large samples, the Bayesian 95 percent coverage region is constructed as those values of CIwhich are within 1.96 standard errors (conventional OLS formula) of the point estimate [Zellner (1987)]. As pointed out earlier in this subsection, if c( = 1 this interval will contain the true value of c( only 39 percent of the time in the detrended case. Of course these Bayesian regions have well-defined interpretations in terms of thought-experiments in which CI is itself random and have optimality properties, given the priors. However, given the lack of congruence between the classical and the Bayesian intervals in this problem, and the sensitivity of the results to the choice of priors [see Phillips (1991a) and his discussants], applied researchers should be careful in interpreting these results.
4.
Unit moving average roots
This section examines inference in two related models, the moving average and the unobserved components model. The moving average model is y, = d, + u t,
Au, = (1 - BL)o,
model
(4.1)
where u, is, in general, an I(0) process satisfying (2.1)-(2.3). If 0 = 1, u, = u, + (ue - u,), so that with the initial condition that u0 = uO,u, = u, is a purely stochastic I(O) process. If 1131 < 1, then (1 - BL)- ’ yields a convergent series and (1 - BL) is invertible, so u, is I(1). The convention in the literature is to refer to 101= 1 as the noninvertible case. The unobserved components (UC) model considered here can be written y, = d, + u,> u, = vt + i,,
,q = P,- 1 + v,,
t = 42,. . . , T,
(4.2)
where [, and v, are I(0) with variances at and 0: and where d, is a trend term as in (1.1). If [, and v, have a nondegenerate joint distribution, then, in general, the I(1) component pt and the I(0) component [, cannot be extracted from the observed series without error, even with known additional parametric structure; hence the “unobserved components” terminology. It should be observed at the outset that the unit MA root/UC models are a mirror image of the unit AR root model, in the sense that the unit AR root model parameterizes the I(1) model as a point (c1= 1) and the I(0) model as a continuum ()c11< l), while in the unit MA root/UC models the reverse is true. In the latter two models, the I(0) case is parameterized as a point (0 = 1 in the unit MA root model, 0,’ = 0 in the UC model), while the I(1) case is parameterized as a
Ch. 46:
Unit Roots, Structural
Breaks
and Trends
2789
continuum (I 81 < 1 in the MA model, 0: > 0 in the UC model). This suggests that, at least qualitatively, some of the general lessons from the AR problem will carry over to the MA/UC problems. In particular, because the points 0 = 1 and 0: = 0 represent discontinuities in the long-run behavior of the process, it is perhaps not surprising that, as in the special case of a unit AR root, the first-order asymptotic distributions of estimators of 0 and 0,” exhibit discontinuities at these points. In addition, just as the unit AR root model lends itself to constructing tests of the general I(1) null, the unit MA root model lends itself to constructing tests of the general I(0) null. Although the MA model (4.1) and UC model (4.2) appear rather different, they are closely related. To see this, consider only the stochastic components of the models. In general, for suitable choices of initial conditions, all MA models (4.1) have UC representations (4.2) and all UC models (4.2) have MA representations of the form (4.1). To show the first of these statements, we need only write Au, = (1 - 8L)o, = (1 - 0)u, + BAv,; then, cumulating Au, with the initial condition u0 = o0 yields u,=(l
-e) i u,+eu,. s=o
(4.3)
By construction (1 - 0)Cf, ,,u, is I(1) and 8u, is I(0). Therefore, the MA model (4.1) has the UC representation (4.3) with v, = (1 - t9)0,and i, = 8u,. If 0 = 1, then the I(1) term in (4.3) vanishes and u, is I(0). To argue that all UC models have MA representations of the form (4.1), it is enough to consider the two cases, cz = 0 and 0,” > 0. If 0: = 0 and pc, = 0, then u, = [,, which is I(O), so (4.1) obtains with ue = u0 and 8 = 1. If 0: > 0, then u, is I(l), so Au, = v, + A[, is I(O),and it follows that Au, has a Wold decomposition and hence an MA representation of the form (4.1) where (01 < 1. A leading special case of the UC model which is helpful in developing intuition and which will be studied below is when (Al,, v,) are serially uncorrelated and are mutually uncorrelated. Then, Au, in the UC model has MA(l) autocovariances: Au, = A[, + vt, so that ~~~(0)= 20: + a:, y,,(l) = - 6: and ydu(j) = 0,1jl > 1. Thus Au, has the MA(l) representation (4.1), Au, = (1 - 8L)e,, where e, is serially uncorrelated, 8 solves 0 + 8-l - 2 = at/$ and 0,” = $/0. Because EA&v, = 0 by assumption, this UC model is incapable of producing positive autocorrelations of AuI, so, while all uncorrelated UC models have an MA(l) representation, the converse in not true. [The MA(l) model will, however, have a correlated UC representation and a UC representation with independent permanent and transitory components which themselves have complicated short-run dynamics; see Quah (1992).] The UC model can equivalently be thought of as having I( 1) and I(0) components, or as being a regression equation with deterministic regressor(s) d, (which in general has unknown parameter(s) /I), an I(0) error and a constant which is time-varying
J.H. Stock
2190
and follows an I( 1) process. Thus the problem of testing for a unit moving average root and testing for time variation in the intercept in an otherwise standard time series regression model are closely related. 4.1. ‘Point estimation When the MA process is noninvertible, or nearly noninvertible, estimators of 0 fail to have the standard Gaussian limiting distributions. The task of characterizing the limiting properties of estimators of 8 when 19is one, or nearly one, is difficult, and the theory is less complete than in the case of nearly-unit autoregressive roots. Most of the literature has focused on the Gaussian MA(l) model with d, = 0 and u, = Ed,and this model is adopted in this subsection, except as explicitly noted. One complication is that the limiting distribution depends on the specific maximand and the treatment of initial conditions. Because the objective here is pedagogical rather than to present a comprehensive review, our discussion of point estimation focuses on two specific estimators of 8, the unconditional and conditional (on s1 = 0) MLE. Suppose that the data have been transformed so that x, = Ay,, t = 2,. . . , T. Then 8 can be estimated by maximizing the Gaussian likelihood for (x2,. . . , xT). The exact form of the likelihood depends on the treatment of the initial condition. If x2 is treated as being drawn from its stationary distribution, so that x2 = s2 - 0&i, then X=(X2,..., xJ has covariance matrix afQ,(@, where 0, ii= 1 + 0’ and s2,,ii, 1= - 0. This is the “unconditional” case, and the Gaussian likelihood is %?2;‘X
A (0, of) = - $ T In 27~0:- $ln det (0”) - 2
frf
’
where det(R,) denotes the determinant of 0,. Estimation proceeds by maximization of A in (4.4). Numerical issues associated with this maximization are discussed at the end of the subsection. The “conditional” case sets .sl = 0, so that x2 = e2. The conditional likelihood is given by (4.4) with a, replaced by a,, where 0, = 0” except that a,,,, = 1. A principal advantage of maximizing the conditional Gaussian likelihood is that the determinant of the covariance matrix does not depend on 0, so maximization can proceed by minimizing the quadratic form X’0; ‘X. Because st = x, + es,_ r, with the additional assumption that .si = 0, the residuals e,(e) can be constructed recursively as (1 - BL)e,(8) = x, and estimation reduces to the nonlinear least squares problem of minimizing C,‘=2 e,(8)2. If (8(< 1, so that the process is invertible, then standard ,/!? asymptotic theory applies. More generally, if an ARMA(p, 4) process ‘is stationary and invertible and has no common roots, then the Gaussian maximum likelihood estimator of the ARMA parameters is fi-consistent and has a normal asymptotic distribution; see Brockwell and Davis (1987, Chapter 10.8). In the MA(l) model, fi(&,,, - 0)
Ch. 46: Unit Roots, Structural
Breaks and Trends
2191
is asymptotically distributed N(0, 1 - 0”). This provides a simple way to construct tests of whether 8 equals some particular value. Alternatively, confidence intervals for 8 can be constructed as e+ 1.96( 1 - @)“’ (for a 95 percent two-sided confidence interval). These simple results fail to hold in the noninvertible case. This is readily seen by noting that the asymptotic normal approximation to the distribution of the MLE is degenerate when 8 = 1. The most dramatic and initially surprising feature of this failure is the “pileup” phenomenon. In a series of Monte Carlo experiments, investigators found that the unconditional MLE took on the value of exactly one with positive probability when the true value of 8 was near one, a surprising finding at the time since 0 can take on any value in a continuum. Shephard (1992) and Davis and Dunsmuir (1992) attribute the initial discovery of the pileup effect to unpublished work by Kang (1975); early published simulation studies documenting this phenomenon include Ansley and Newbold (1980) Cooper and Thompson (1977), Davidson (1979, 198 l), Dent and Min (1978) and Harvey (198 1, pp. 13669); also see Plosser and Schwert (1977), Dunsmuir (198 1) and Cryer and Ledolter (198 1). The intuition concerning the source of the pileup effect is straightforward, and concerns the lack of identification of (0, a’) in the unconditional model. Note that n,(Q) = 820,(8-l); upon substituting this expression into the unconditional likelihood (4.4), one obtains A(& g2) = A(& ‘, e2a2) and A(0) = A(F’), where A denotes the likelihood concentrated to be an argument only of 8. Because A is symmetric in In 8 for B > 0, it follows immediately that A will have a local maximum at B = 1 if ~2~/~~2),~,~O,sotheprobabilityofalocalmaximumat~=1isPr[~2~/~~2~,~,~O]. Sargan and Bhargava (1983b, Corollary 1) [also see Pesaran (1983) and Anderson and Takemura (1986, Theorem 4. l)] provide expressions for this limiting probability in the noninvertible case, which can be calculated by interpolation of Table I in Anderson and Darling (1952) and is 0.657. These results were extended to the case of higher-order MA and ARMA models by Anderson and Takemura (1986), where the estimation is by Gaussian MLE when the order of the ARMA process is correctly specified. Tanaka (1990b) considered a different problem, in which u, is a linear process which is I(O) but otherwise is only weakly restricted, but Q is estimated by using the misspecified Gaussian MA(l) likelihood. Tanaka (1990b) found that, despite the misspecification of the model order, the unconditional MLE continues to exhibit the pileup effect, in the sense that the probability of a local (but not necessarily global) maximum at i?= 1 is nonzero if the true value of 0 is one. Also see Tanaka and Satchel1 (1989) and Potscher (1991). Because of the close link between the UC and MA models, not surprisingly the pileup phenomenon occurs in those models as well. In this model, if the “signal-tonoise ratio” o,‘/c$ is zero or is in a T 2 neighborhood of zero, then 0,’ is estimated to be precisely zero with finite probability. However, the value of this point probability depends on the precise choice of maximand (e.g. maximum marginal likelihood or maximum profile likelihood) and the treatment of the initial condition
J.H. Stock
2792
p0 (as fixed or alternatively as random with a variance which tends to infinity, or equivalently as being drawn from a diffuse prior). Various versions of this problem have been studied by Nabeya and Tanaka (1988), Shephard and Harvey (1990) and Shephard (1992, 1993). Research on the limiting distribution of estimators of 8 when 0 is close to one is incomplete. Davis and Dunsmuir (1992) derive asymptotic distributions of the local maximizer closest to one of the unconditional likelihood, when the true value is in a l/T neighborhood of 0 = 1. Their numerical results indicate that their distributions provide good approximations, even for 8 as small as 0.6 with T= 50. Their approach is to obtain representations of the first and second derivatives of the likelihoods as stochastic processes in T(1 - 0). They do not (explicitly) use the FCLT, and working through the details here would go beyond the scope of this chapter. A remark on computation. The main technical complication that arises in the estimation of stationary and invertible ARMA(p,q) models is the numerical evaluation of the likelihood when q 3 1. If the sample size is small, then Q;’ can be computed and inverted directly. In sample sizes typically encountered in econometric applications, however, the direct computation of 0; 1 is time-consuming and can introduce numerical errors. One elegant and general solution is to use the Kalman filter, which is a general device for computing the Gaussian log likelihood, Z’(Yr,. . . , Y,)=~(Y,)+C,T_*~(YtIYt-l,...,Yl), P(Y r,. . , yr) with the factorization when the model can be represented in state space form (as can general ARMA models). The Kalman filter operates by recursively computing the conditional mean and variance of Yt, which, in turn, specifies the conditional likelihood Y(y, 1y,_ l,. . . , yl). The Gaussian MLE is then computed by finding the parameter vector that maximizes the likelihood. The chapter by Hamilton in this Handbook describes the particulars of the standard Kalman filter and provides a state space representation for ARMA models which can be used to compute their Gaussian MLE. The model (4.2) is a special case of unobserved components time series models, which in general can be written in state space form so that they, too, can be estimated using the Kalman filter; see Harvey (1989) and Harvey and Shephard (1992) for discussions and related examples. The literature on the estimation of stationary and invertible ARMA models is vast and it will not be covered further in this chapter. See Brockwell and Davis (1987, Chapter 8) for a discussion and references. For additional discussion of the Kalman filter with applications and a bibliography, see the chapter by Hamilton in this Handbook.
4.2. 4.2.1.
Hypothesis
tests
Tests of 8 = 1 in the conditional
Gaussian MAC1 ) model
As Sargan and Bhargava (1983b) pointed out, the pileup phenomenon means that likelihood ratio tests cannot be used for hypothesis testing at conventional
Ch. 46: Unit Roots, Structural
2793
Breaks and Trends
significance levels, at least using the unconditional Gaussian MLE. Given this difficulty, it is perhaps not surprising that research into testing the null of a unit MA root has been limited and has largely focused on the MA(l) conditional Gaussian model. This model, therefore, provides a natural starting point for our discussion of tests of the general I(0) null. The conditional Gaussian MA(l) model with a general pth order polynomial time trend is yr=z:/?+ut,
ui =~i,
AU,=&,-t&i,
t> 1
(4.5)
where E, is i.i.d. N(0, crt) and z, = (1, t, . , t”)‘. Let z = (z,, . . . , zT)‘, and similarly define the T x 1 vectors y and u. Then y is distributed N(za, JCJ, where EC,, i = 0: and the remaining elements can be calculated directly from the moving average representation Au, = (1 - 19L)q. The problem of testing values of 0 is invariant to transformations of the form to restrict y, + ay, + zib, /? -+ a/? + b and 0: + a*of. It is therefore reasonable attention to the family of tests which are invariant to this transformation, and among that family to find the most powerful tests of 0 = 1 against the fixed alternative 0 = I%An implication of the general results of King (1980, 1988) is that the MPI test of 0 = 1 vs. 0 = 8rejects for small values of the statistic
where 6 = (ri,,zi *, . . . , t&-)), where {tit} are the residuals from the OLS regression of y, onto z,, 6 are the GLS residuals from the estimation of (4.5) under the alternative, and EUU’ = X,(g) is the covariance matrix of U = (u,, . . . , uT) under the alternative t? In the MA(l) model, the GLS transformation can be written explicitly, and GLS simplifies to the OLS regression of Y,(g) onto Z,(e), where Y,(g) = y, and (1 - &!,)Y,(@ = Ayr, t > 1, and similarly for Z,(f?). As in the case of MP tests for an autoregressive unit root discussed in Section 3.2.1, the dependence of the MPI test statistic (4.6) on the alternative ecannot be eliminated, so there does not exist a UMPI test of f3= 1 vs. 101< 1. This has led researchers to propose alternative tests. A natural approach is to consider tests which have maximal power for local alternatives, that is, to consider the locally most powerful invariant test. In the special case that d, is zero, Saikkonen and Luukkonen (1993a) show that the LMPI test has the form, Tj2/8f, where y= T-‘C,T_lyf an d Si = T-‘Cf’,y: (which is the MLE of 0: under the null). In the case d, = /lo, Saikkonen and Luukkonen (1993a) use results in King and Hillier (1985) to derive the locally most powerful invariant unbiased test, which is based on the statistic
(4.7)
J.H. Stock
2794
where Y:,.(n) = T- “‘C$‘jy~ and (8,“)’ = T lx,‘= ,(yr)‘, where yt = y, - j [also see Tanaka (1990b)l. Note that (c?:)~ is the (conditional) MLE of 0: under the null hypothesis. Because the statistic was derived for arbitrarily small deviations from the null, the parameter 0 does not need to be estimated to construct L.22 A natural generalization of (4.7) to linear time trends is to replace the demeaned process yf by the detrended process y::
(4.8)
where Y;,(A) = T-1’2CjT_:]yz and (&I)* = T-‘CsT_l(y3)2, where y: = y, - z$ and fi is the OLS estimator from the regression of y, onto (1, t). The asymptotic null distributions of Lp and C are readily obtained using the FCLT and CMT, under the maintained assumption that the order of the estimated deterministic trend is at least as great as the order of the true trend. First consider Lw. As the second expression in (4.7) reveals, L@ can be written as a continuous functional of Y;,/BE. To obtain limiting representations for Lfl it therefore suffices to obtain limiting results for the stochastic process Y&/C?:. The limit of the numerator of this process was derived in Section 2.3 (Example 3) and is given in (2.14a) for u, being a general I(0) process; because it is assumed in this subsection that u, = E, under the null, this result applies here with o = aE. In addition, the maintained assumption that the trend is correctly specified ensures that 8: A aE. It follows that Y;r/d:=W and that L@a~(B’)*, where B’(A) = W(I1) - ;CW(l) is a standard Brownian bridge. An identical argument applies to the linearly detrended case and yields the limit L’=>~(Br)2, where B’ is the second-level Brownian bridge in (2.14b). In the leading case that d, is a constant, L“ has the asymptotic distribution of the Cramer-von Mises statistic derived by Anderson and Darling (1952). Nyblom and Makeltiinen (1983, Table 1) provide critical values of the finite-sample distribution of L”, computed using the Imhof method for Gaussian errors. Kwiatkowski et al. (1992, Table 1) provide a table of critical values of J(BP)2 and J(W)’ which agrees closely with earlier computations, e.g. MacNeill (1978, Table 2). Although the motivation for the L-statistic comes from considering the Gaussian MA(l) model, it is evident from the preceding derivation that the same asymptotic distribution obtains for MA(l) models with errors which satisfy the weaker assumptions, such as being martingale difference sequences which satisfy (2.2). The Lstatistics (4.7) and (4.8) have intuitive interpretations. To be concrete, consider L’. Under the null hypothesis, y, - /3, is serially uncorrelated and the partial sum process of the demeaned data, Cl5’jy;, is I( 1). The statistic Lv thus can be seen to test the null hypothesis that y, is I(0) by testing its implication that the “Nabeya and Tanaka (1990b) showed that the statistic (4.7) is also locally MPI unbiased unconditional
Gaussian
MA(l) model with known
d,.
for the
Ch. 46: Unit Roots, Structural Breaks and Trends
2195
process of accumulated (demeaned) y,‘s is I( 1). Rejection occurs if L’” is large, SO the statistic tests the null that the accumulation of y, is I( 1) against the alternative that it is 1(2). Comparison of (4.7) to the expression (3.18b) for the Sargan-Bhargava statistic testing the null of a unit autoregressive root in the d, = 8, case shows that the two statistics are closely related: the Sargan-Bhargava statistic rejects the I(1) null against the I(O) alternative when the sum of squared y,‘s is small, while the LMPI statistic Lp rejects the I(0) null against the I(1) alternative when the sum of-squared accumulated y,‘s is large. Because of the similarities between the UC and MA models, not surprisingly the L-statistics can be alternatively derived as tests of rrt = 0 in the UC model. In this formulation, the tests have the interpretation that they are testing the null that the regression intercept in (4.2) is constant, versus the alternative that it is time-varying, evolving as a martingale. To be concrete, suppose that yt is generated by (4.2) with (&, vt) i.i.d. N(0, a: diag(1, q)), p. = 0, and set q = cs/c$. Then (yr,. . . , yT) is distributed N(z/?, af&(q)), where i&-(q) = I + qf2 *, where Qc = min(i, j). Again, the results of King (1980) imply that the most powerful invariant test of q = 0 against q = 4 > 0 is a ratio of quadratic forms similar to (4.6) but involving &(4). The resulting statistic depends on 4, so no uniformly MPI test exists. Because there is no UMPI test, it is reasonable to examine the locally MPI test in the UC model. In the case d, = PO, Nyblom and Makelainen (1983) derived the LMPI test statistic and showed it to be Lp. Nyblom (1986) extended this analysis to the case d, = Do + flit and showed the LMPI statistic to be L’. Nabeya and Tanaka (1988) extended these results to the general Gaussian regression problem in which coefficients on some of the variables follow a random walk while others are constant. The special case of Nabeya and Tanaka’s (1988) results, of interest here, is when d, = z$, where z, = (1, t,. . . , tP)’ and the intercept term, pt in (4.2), follows a random walk. Then Nabeya and Tanaka’s (1988) LM test statistic simplifies to (4.7), except that yf, the residual from the OLS regression of y, onto the time polynomials z,, replaces yr and Y: = T- ‘/‘CL= lyl replaces Y;.23 Despite the differences in the derivations in the MA and UC cases, the fact that the same local test statistics arise has a simple explanation. As argued above, Au, generated by the UC model has an MA(l) representation with parameters (0,af) which solve q=&‘+&2 and 0 g2 = 0:. Thus, the distribution of (y,, . . . , yT) can be written as N(z/I, a2Z,c(B)), wheri z = (z,, . . . ,zT), Z,, 11 = (1 + q)af/at =(1-o+ 19’) and where the remaining elements of Z,,(0) equal those of EC(e). Thus the UC and conditional MA models are the same except for their treatment of the initial valueyl.But&c,,,(l)=~~ ,,(I), so, when 8 = 1 (equivalently, when q = 0), the two models are identical. ’ A third interpretation of the L-tests arises from recognizing that the UC model
Z3This simplification obtains from Nabeya and Tanaka’s (1988) equation (2.5) by noting that the tth element of their My is d, by recognizing that, in our application, their DX is the T x T identity matrix, and by carrying out the summation explicitly. See Kwiatkowski et al. (1992).
J.H. Srock
2796
is a special case of time-varying parameter models, so that the tests can be viewed as a test for a time-varying intercept. This interpretation was emphasized by Nabeya and Tanaka (1988) and by Nyblom (1989). We return to this link in Section 5. Local optimality is not the only testing principle which can be fruitfully exploited here, and other tests of the hypothesis q = 0 in the UC model have been proposed. LaMotte and McWhorter (1978) proposed a family of exact tests for random walk coefficients, which contains the i.i.d. UC model (4.1) with d, = b, and d, = fi,, + fill as special cases, under the translation group y + y + zh, /I + p + 6. Powers of the LaMotte and McWhorter tests are tabulated by Nyblom and Makelainen (1983) (constant case) and by Nyblom (1986) (time-trend case). Franzini and Harvey (1983) considered tests in the Gaussian UC model with the maintained hypothesis of nonzero drift in p,, which is equivalent to (4.2) with ([,,v,) i.i.d. Gaussian and d, = fl, + flit. Franzini and Harvey (1983) suggested using a point-optimal test, that is, choosing an MPI test of the form where their recommendation corresponds to 4 E 0.75 for T= 20. Shively (1988) also examined point-optimal tests in the UC model with an intercept and suggested using the MPI tests tangent to the power envelope at, alternatively, powers of 50 percent and 80 percent, respectively, corresponding to q = 0.023 and 0.079 for T= 51. 4.2.2.
Tests of the general I(0)
null
Because economic theory rarely suggests that an error term is i.i.d. Gaussian, the Gaussian MA( 1) and i.i.d. UC models analyzed in the previous subsection are too special to be of interest in most empirical applications. While the asymptotic null distributions of the L” and L’ statistics obtain under weaker conditions than Gaussianity, such as u, = E, where E, satisfies (2.2) these statistics are not asymptotically similar under the general I(0) null in which u, is weakly dependent and satisfies (2.1))(2.3). A task of potential practical importance, therefore, is to relax this assumption and to develop tests which are valid under the more general assumption that u, is I(0). The two main techniques which have been used to develop tests of the general I(O) null parallel those used to extend autoregressive unit root tests from the AR(l) model to the general I(1) model. The first, motivated by analogy to the way that Phillips and Perron (1988) handled the nuisance parameter o in their unit root tests, is to replace the estimator of the variance of u, in statistics such as Lp and L’ with an estimator of the spectral density of u, at frequency zero; this produces “modified” Lp and L’ statistics.24 The second, used by Saikkonen and Luukkonen (1993a, 1993b), is to transform the series using an estimated ARMA process for u,. The device of Section 3, in which unit root tests were represented as functionals of 24This approach was proposed by Park and Choi (1988) to generalize their variable addition tests, discussed in the subsequent paragraphs, to the general I(0) null. It was used by Tanaka (1990b) to extend the L” statistic to the general I(0) null. [Tanaka’s (1990b) expression (7) is asymptotically equivalent to (4.7).] Kwiatkowski et al. (1992) used this approach to extend the L’ statistic to the general I(0) null.
Ch. 46:
Unit
Root.s. Structural
2197
Breuks and Trends
the levels process of the data, can be applied in this problem to provide a general treatment of those tests of the I(O) null which involve an explicit correction using an estimated spectral density. The main modification is that, in the I(O) case, the tests are represented as functionals of the accumulated levels process rather than the levels process itself. This general treatment produces, as special cases, the extended L” and L’ statistics and the “variable addition” test statistics, G(p,q), proposed by Park and Choi (1988). Park and Choi’s (1988) G(p, q) statistic arises from supposing that the true trend is (at most) a pth order polynomial. The detrending regression is then intentionally overspecified, including polynomials of order q where q > p. If u, is I(O), then the OLS estimators of the coefficients on these additional q - p trends are consistent for zero. If, however, u, is I(l), then the regression of y, on the full set of trends introduces “spurious detrending”, as discussed in Example 2 of Section 2.3, and the LR test will reject the null hypothesis that the true coefficients on P+l , . . . , t”) are zero. These two observations suggest considering the modified LR (t statistic, G(p,q) = T(8’ - 62)/c22, where b2 and 6’, respectively, are the mean squared residuals from the regression of y, onto (1, t, . . . , tp) and of y, onto (1, r, . . . , P). In functional notation, the L and G(p, q)-tests have the representations L:
L= gI,(f), g&f) = Jf2>
G(P,4): G(P,4)= g,(f),
CI,(.II=
(4.9a)
j=$+ 1
2,
(4.9b)
where f denotes the random function being evaluated and hj is the jth Legendre polynomial on the unit interval. As the representations (4.9) make clear, to study the limiting behavior of the statistics it suffices to study the behavior of the function being evaluated and then to apply the CMT. Under the general I(0) null, T- 1/2~~~]1u,~wW, so that the general detrended process Yir (defined in Section 2.3, Example 3) has the limit Yd,,/c+(02/a,2)B (p). This suggests modifying these statistics by evaluating the functionals using I/&., where
v”,(A) = Q,’ T - “’
1 yf. s=l
(4.10)
If && is consistent for w2 then V”, * Bcp) and the asymptotic distributions of the statistics (4.9) will not depend on any nuisance parameters under the general I(0) null. The SC estimator&&is used for this purpose by Park and Choi (1988), Tanaka (1990b), Kwiatkowski et al. (1992) and Stock (1992). The rate conditions I, + cc and I, = o(T”~) are sufficient to ensure consistent estimation of o under the null and, as is discussed in the next subsection, test consistency under a (fixed) alternative.
J.H. Stock
2798
A Second approach to extending the MA( 1) tests to the general I(O) null, developed by Saikkonen and Luukkonen (1993a, 1993b), involves modifying the test statistic by, in effect, filtering the data. Saikkonen and Luukkonen (1993a) consider the Gaussian LMPI unbiased (LMPIU) test under an ARMA(p, q) model for the I(0) uT). Let the covariance matrix of u, be Z, and assume that C, were errorsu=(u,,..., known. With Z, known, under the null hypothesis PO can be estimated by GLS yielding the estimator fiO, and let fi, denote residuals y, - &. The transformed GLS residuals are then given by Z = Z; ‘I2‘17.The Gaussian LMPIU test in this model is a ratio of quadratic forms in E analogous to (4.6), where the covariance matrix in the numerator is evaluated under the 8 = 1 null. In practice, the parameters of the short-run ARMA process used to construct Z:, must be estimated; see Saikkonen and Luukkonen (1993a) for the details. Saikkonen and Luukkonen (1993b) apply this approach to extend the finite-sample point-optimal invariant tests of the form (4.6) to general I(0) errors in the d, = & case and to derive the asymptotic distribution of these tests under the null and local alternatives.25 4.2.3.
Consistency
and local asymptotic
power
The statistics with the functional representations (4.9) reject for large Consistency. values of the statistic. It follows that the tests based on the modified ~5~ and L’ statistics and on the G(p, q) statistics are consistent if V”, 5 co under the I(1) alternative. Consider, first, the case d, = 0, so that the numerator of V”, in (4.10) is T-“‘CE\u,. Under the I(1) alternative this cumulation is I(2) and T-3’2Cz\~,* OS: = ,, W(s) ds. Similarly, if 1, + co but I, = o(T”‘), then @c has the limit I&,! these two results, we have [TCfnT, _,,k(m/l,)] =f32jW2.26 Combining
s 1
W(s) ds
N-i’V=/T T
and NT = T/c:=
v*,
where
+k(m/l,).
V*(n) =
Because
’
(4.11)
i/2
the kernel
k is bounded
and I, = o(T”‘),
25Bierens and Guo (1993) used a different approach to develop a test of the general I(0) null against the I(1) alternative, in which the distribution under the null is made free of nuisance parameters not by explicit filtering or estimation of the spectral density at frequency zero, but rather by using a weighted average of statistics in which the weights are sensitive to whether the I(0) or I(1) hypothesis is true. 26Suppose that d, = 0 and that the SC estimator is constructed using a fixed number I autocovariances of y,, rather than letting I, * 00; this would be appropriate were the MA order of u, finite and known a priori. Ify, is I(l), T-Z~~=i+,y,y,~i-_T2~~=1~:~0,i= l,...,k, and moreover T~2~~=1y:*~2~WZ. It follows by direct calculation that estimator entails extending this result details see Phillips (199lb, Appendix) with a constant or a linear time trend; and Stock (1992) under general trend
4&/[Tx’_ &n/l)] =.02~W2. The proof for the general SC from fixed I to a sequence of I, increasing sufficiently slowly. For for the d, = 0 case; Kwiatkowski et al. (1992) for OLS detrending Perron (19914 for general polynomial trends estimated by OLS; conditions including an estimated broken trend.
Ch. 46:
Unit
Roots,
Structural
Breaks and
2199
Trends
N, + cc as T+ 03, so under the fixed I( 1) alternative Vr 3 co. Thus, tests which are continuous functionals of V, and which reject I(0) in favor of I(1) for large realizations of I’, will be consistent against the fixed I( 1) alternative. This is readily extended to general trends. For example, in the case d, = I_I,, N; ‘I2 I$(.)=V*a, where P’*B(%)= Jt W“(s) ds/(JIV2} ‘12. In the detrended case, N; ‘I2 If;(.)= I/*‘, where P’*‘(n) = {t IV(s) ds/{~Wr2} l”, where W’ is the OLSdetrended Brownian motion in (2.13b). This, in turn, implies that, under the fixed I( 1) alternative, V; and V; 3 co, and consistency of the test statistics in (4.9) follows directly. Local
asymptotic
power.
We examine
local asymptotic
power using a local version
of the UC model (4.2): Y, =
4+
4,
4 =
uot+ H,u,t
(4.12)
where uot and ult are respectively I(0) and I(1) in the sense that (uot,Ault) satisfy (2.1)-(2.3) and where H, = h/T, where h is a constant. Because H, -+O, the I( 1) component of y, in (4.12) vanishes asymptotically, so that (4.12) provides a model in which y, is a local-to-I(O) process. For concreteness and to make the link between the UC and MA models precise, we will work with the special case of (4.12) in which uot and Aui, are mutually and serially uncorrelated and have the same variance. Then the local-to-I(O) UC model (4.12) has the MA(l) representation, Au, = E, - 0r~,_ i, where 0, = 1 - h/T+ o(T-‘) and where E, is serially uncorrelated. Thus, the local-to-I(O) parameterization H, = h/T is asymptotically equivalent to a local-to-unit MA root with the nesting eT = 1 - hlT.27 To analyze the local power properties of the tests in (4.9), we obtain a limiting representation of V”, under the local-to-I(O) model. First consider the case d, = 0. The behavior of the numerator follows from the FCLT and CMT. Define the independent Brownian motions W, and W, respectively as the limits T- ‘/2C~\uos~ q, W,(.) and T- ‘12ulIT,l+-~l WI(.). By assumption, w. = oi, so T-1’2C~~~~, = T- “‘C~‘=‘& + hT-3/2xfT’1 5= l~ts~~OUh(~), where u,,(J) = W,(A) + hJtW,(s)ds. It can additionally be shown that, in the local-to-I(O) model (4.12), the SC estimator has the limit 6ic A c$, [Elliott and Stock (1994, Theorem 2)]. Thus which it follows that the statistics in (4.9) have the local asymptotic g(U,) for their respective g functionals.
Vr=> U,, from representation
27The rate T for this local nesting is consistent with the asymptotic results in the unit MA root and UC test literatures, which in general find that this nesting is an appropriate one for studying rates of convergence of the MA estimators and/or the local asymptotic power of tests. In the MA unit root literature, see Sargan and Bhargava (1983b), Anderson and Takemura (1986), Tanaka and Satchel1 (1989), Tanaka (1990b) and Saikkonen and Luukkonen (1993b); in the UC literature, see Nyblom and Makelainen (1983) Nyblom (1986, 1989) and Nabeya and Tanaka (1988).
J.H. Stock
2800 Table 3 Power of MA unit root tests [S percent level tests, demeaned case (d, = &,)].” LP
POI(O.5)
A. T= 50 1 0.064 2 0.103 5 0.304 10 0.631 15 0.823 20 0.909 30 0.976 40 0.992
0.064 0.101 0.299 0.570 0.717 0.803 0.878 0.914
B. T= 100 1 0.064 2 0.106 5 0.332 10 0.659 15 0.845 20 0.931 30 0.985 40 0.996 C. T=200 1 2 5 10 15 20 30 40
h
G(O, 1)
WI 2)
0.061 0.095 0.298 0.633 0.815 0.898 0.960 0.979
0.058 0.087 0.279 0.583 0.745 0.823 0.890 0.912
0.059 0.089 0.268 0.596 0.776 0.864 0.932 0.954
0.064 0.107 0.319 0.605 0.765 0.852 0.923 0.958
0.066 0.101 0.321 0.664 0.841 0.919 0.974 0.99 1
0.065 0.103 0.311 0.623 0.779 0.857 0.924 0.944
0.060 0.090 0.289 0.629 0.807 0.890 0.952 0.974
0.062 0.102 0.314 0.669 0.851 0.937 0.988 0.998
0.063 0.104 0.309 0.605 0.758 0.847 0.934 0.965
0.064 0.099 0.316 0.667 0.841 0.922 0.980 0.995
0.062 0.097 0.305 0.621 0.779 0.854 0.924 0.950
0.063 0.095 0.299 0.640 0.811 0.894 0.956 0.976
D. T= 1000 1 0.061 2 0.099 5 0.329 10 0.661 15 0.853 20 0.944 30 0.992 40 0.999
0.062 0.102 0.321 0.613 0.717 0.866 0.948 0.978
0.057 0.087 0.310 0.663 0.843 0.929 0.985 0.996
0.055 0.086 0.296 0.624 0.789 0.871 0.937 0.962
0.057 0.08 1 0.283 0.632 0.811 0.900 0.963 0.981
PE
“Data were generated according to the unobserved components model, y, = u,, where u, = uO, + H,u,,, (u,,~,A+) are i.i.d. N(0, I) and H, = h/T. PE denotes the power envelope. The remaining tests are based on the indicated statistics as defined in the text. Based on 5000 Monte Carlo repetitions.
The power functions for the statistics in (4.9) along with the power envelope are summarized in Tables 3 (for the case d, = 8,) and 4 (for the case d, = p,, + fill). The power functions were computed by Monte Carlo simulation for various values of 7’, so technically all the power functions are finite-sample although the simulations
2801
Ch. 46: Unit Roots, Structural Breaks and Trends
[5 percent
Table 4 Power of MA unit root tests level tests, detrended case (d, = j& + Bit)].“ L’
POI(O.5)
G(L2)
G(l,3)
A. T= 50 1 0.053 2 0.062 5 0.139 10 0.349 15 0.585 20 0.744 30 0.905 40 0.958
0.052 0.062 0.132 0.322 0.513 0.659 0.810 0.880
0.052 0.061 0.131 0.348 0.578 0.746 0.899 0.954
0.055 0.062 0.117 0.259 0.379 0.469 0.569 0.618
0.055 0.062 0.118 0.299 0.477 0.603 0.733 0.795
B. T= 100 1 0.055 2 0.064 5 0.127 10 0.359 15 0.610 20 0.775 30 0.939 40 0.984
0.055 0.063 0.136 0.349 0.560 0.704 0.864 0.928
0.053 0.063 0.130 0.359 0.609 0.777 0.928 0.975
0.052 0.060 0.121 0.277 0.414 0.507 0.629 0.689
0.05 1 0.060 0.113 0.298 0.491 0.632 0.777 0.842
C. T=200 1 2 5 10 15 20 30 40
0.055 0.063 0.136 0.369 0.613 0.785 0.948 0.989
0.054 0.064 0.136 0.357 0.569 0.718 0.880 0.950
0.049 0.056 0.122 0.362 0.610 0.776 0.935 0.980
0.053 0.062 0.111 0.269 0.415 0.521 0.640 0.708
0.050 0.062 0.125 0.323 0.517 0.655 0.792 0.867
D. T= 1000 1 0.051 2 0.060 5 0.131 10 0.370 15 0.629 20 0.815 30 0.963 40 0.994
0.052 0.059 0.127 0.353 0.576 0.743 0.905 0.969
0.054 0.063 0.123 0.366 0.624 0.805 0.953 0.990
0.052 0.062 0.120 0.323 0.521 0.671 0.825 0.893
0.052 0.060 0.116 0.335 0.554 0.712 0.868 0.942
h
PE
“See the notes to Table 3.
suggest that the T= 1000 power is effectively the asymptotic local power.28 In addition, the power of the point-optimal tests which are tangent to the power “Tabulations of exact power functions and the finite-sample power envelope under the Gaussian model appear in several places in the literature. Those tabulations are based on the Imhof algorithm. When results in the literature are directly comparable to those in Tables 3 and 4, they agree to within two decimals. For results in the demeaned UC model, see Nyblom and Makelainen (1983) and Shively (1988); for tabulations in the detrended UC model, see Nyblom (1986). Tanaka (1990b) tabulates both finite-sample and limiting powers of the L” statistic, where the latter is computed by inverting numerically its limiting characteristic function [Tanaka (1990b, Theorem 2)]. Tanaka’s limiting power for L” agrees with the T= 1000 powers in Table 3 to within the Monte Carlo error.
J.H. Stock
2802
envelope at power of 50percent are reported. In the demeaned case, this test was suggested by Shively (1988), and the test is the MPI test against the local alternative h= 7.74. In the detrended case, calculations suggest that the power envelope attains 5Opercent at approximately h = 13, so the MPI test against the local alternative h= 13 is reported. This test, the POI(0.5) statistic in Table 4, is almost the same test as was proposed by Franzini and Harvey (1983): if the local-to-I(O) asymptotics are used to interpret their recommendations (which were based on a Monte Carlo experiment with T= 20), then the Franzini-Harvey statistic is the point-optimal invariant test which is point-optimal against the local alternative CE 17. (Interpreted thus, the Franzini-Harvey statistic is asymptotically MPI at a power of approximately 70 percent.) These tables summarize the power findings of Nyblom and Makelainen (1983), Nyblom (1986), Shively (1988), Tanaka (1990b) and Saikkonen and Luukkonen (1993a, 1993b). Five main conclusions emerge from these tables. First, the convergence of the finite-sample powers to the asymptotic limits appears to be relatively fast, in the sense that the T = 100 powers and T = 1000 powers typically differ by less than 0.02. Second, as was the case with tests for a unit autoregressive root, the powers deteriorate as the order of detrending increases from demeaning to linear detrending, particularly for alternatives of h near zero. For example, the L@statistic has a limiting power of 0.61 against h = 10, while the corresponding power for the L’ statistic is 0.35. Third, the point-optimal tests perform better than the LMPIU test against all but the closest alternatives. Fourth, although the ParkkChoi G(p,p + 1) and G(p,p + 2) tests are strictly below the power envelope, they nonetheless perform rather well and in particular have power curves only slightly below the L” and L’ statistics. Fifth, it is important to emphasize that all these differences are rather modest in comparison to the large differences in powers found among the various tests for a unit AR root. For example, the Pitman efficiency of the LP statistic relative to the MPI test at power = 50 percent is approximately 1.1, indicating a loss of the equivalent of only 10 percent of the sample if the L’ statistic is used in this case rather than the MPI test. 4.2.4.
Finite-sample size and power
A small Monte Carlo experiment was performed to examine the finite-sample size and power of tests of the I(0) null. Unlike for tests of the I( 1) null, as of this writing, there have been few Monte Carlo investigations of tests of the general I(0) null; exceptions include Amano and van Norden (1992) and Kwiatkowski et al. (1992). The simulation here summarizes the results of these two studies for the L’ statistic by using a similar design (autoregressive errors) and extends them to include the ParkkChoi G(p,p + 2) statistics and to examine the effect of kernel choice on test performance. In the d, = &, case, the experiment considers the modified L@and G(O,2) statistics (based on I’“,); in the d, = /I, + fil t case, the statistics are the modified L’ and G( 1,3)
Ch. 46: Unit
Roots,
Structural
Breaks
and Trends
2803
statistics (based on V’,). The spectral density was estimated using two SC spectral estimators with a truncated automatic bandwidth selector. The automatic bandwidth is I, = min[f& 12(T/100)“~2], where &- is Andrew? (1991) automatic selector based on an estimated AR(l) model. The two kernels are the Parzen kernel and the QS kernel, the latter being Andrews’ (1991) optimal kernel, and the appropriate selector for each kernel is used. [The automatic bandwidth selector is truncated because unless ir is bounded in the I( 1) case it does not satisfy the o( T- 1’2) rate condition needed for consistency as described in Section 4.2.3.1 The pseudo-data were generated so that u, followed the AR(l), Y, =
u,> Au,
where
= (1 - fX)u,,
u, = pu,_ I + Em, E, i.i.d. N(0, l),
(4.13)
where u. = 0 and u. is drawn from its unconditional distribution. When Ip I < 1 and 0 = 1, y, is I(0) and the experiment examines the size of the test. When IpI < 1 and 181< 1, y, is I(1). When p = 0, this is the MA(l) model and corresponds to the local-to-unity model (4.12) with (uor, AUK,) mutually and serially uncorrelated with the same variance, in which case 0 = 1 - h/T+ o(T- ‘). Empirical size (in italics) and size-adjusted power are presented for T= 100 in Table 5 (the demeaned case) and Table 6 (the detrended case). Size-adjusted power in a (p,B) design, 101< 1, is computed using the 5 percent empirical quantile for (p, 0 = 1) for each value of p. These results suggest three conclusions. First, the choice of spectral estimator matters for size, less so for size-adjusted power. For example, if the Parzen kernel is used, the size deteriorates substantially when the serial correlation is large (p = 0.9). [If the Bartlett kernel is used, as suggested by Tanaka (1990b) and Kwiatkowski et al. (1992), similar size distortions arise (results not shown in these tables).] In contrast, the size is much better controlled using the QS kernel. This is true for both of the statistics examined, in both the demeaned and detrended cases. On the other hand, the size-adjusted powers for both statistics in both cases are comparable for the two spectral estimators. Interestingly, for distant alternatives the size-adjusted power declines in the p = 0 case for the demeaned statistics, and the decline is more pronounced for the QS statistics. Second, a comparison of the results in Tables 3 and 4 with those in Tables 5 and 6, respectively, reveals that when p = 0 the finite-sample size-adjusted power is fairly close to the power predicted by the local-to-I(O) asymptotics of Section 4.2.3, at least for close and moderately close alternatives. At least in the p = 0 case, the use of the SC estimator seems to have little impact on either size or power. However, sizeadjusted power deteriorates sharply as the autoregressive nuisance parameter increases towards one. Interestingly, detrending makes little difference in terms of size. This is noteworthy, given the large impact of detrending in the I(1) test situations.
J.H.
2804
Stock
Table 5 Size and size-adjusted power of selected tests of the I(0) null: Monte Carlo results [5 percent level tests, demeaned case (d, = [j,), T= 1001 [Data generating process: (1 - pL)Ay, = (1 - OL)a,, E, i.i.d. N(0, l)].” P=
Test Statistic
Asymptotic Power
-
0.0
0.9
0.75
0.5
-0.5
1.00 0.95 0.90 0.80 0.70
0.05 0.32 0.61 0.87 0.95
0.05 0.29 0.55 0.69 0.68
0.26 0.26 0.43 0.53 0.55
0.10 0.25 0.46 0.58 0.61
0.06 0.25 0.47 0.60 0.64
0.04 0.29 0.56 0.79 0.87
1.00 0.95 0.90 0.80 0.70
0.05 0.32 0.61 0.87 0.95
0.05 0.30 0.57 0.72 0.67
O.fl 0.24 0.39 0.46 0.49
0.05 0.2 1 0.36 0.47 0.49
0.06 0.24 0.39 0.43 0.44
0.04 0.29 0.56 0.80 0.88
G(O, 2) P(aut0)
1.00 0.95 0.90 0.80 0.70
0.05 0.28 0.63 0.90 0.96
0.05 0.28 0.58 0.76 0.76
0.29 0.26 0.48 0.62 0.65
0.10 0.24 0.49 0.66 0.70
0.05 0.23 0.49 0.67 0.72
0.03 0.26 0.57 0.8 1 0.88
WI 2) QS(auto)
1.00 0.95 0.90 0.80 0.70
0.05 0.28 0.63 0.90 0.96
0.05 0.29 0.60 0.78 0.72
0.07 0.19 0.37 0.49 0.52
0.04 0.16 0.33 0.47 0.52
0.06 0.21 0.34 0.37 0.38
0.04 0.26 0.57 0.81 0.88
V P(auto)
QS(auto)
“For each statistic, the first row of entries are the empirical rejection rates under the null, that is, the empirical size of the test, based on the asymptotic critical value. The remaining entries are the size-adjusted power for the model given in the column heading. The column Asymptotic Power gives the T= 1000 rejection rate for that statistic from Table 3 using 8, = 1 - h/T. The entry below the name of each statistic indicates the spectral density estimator used. P(auto) and QS(auto) refer to the SC estimator, computed respectively using the Parzen and QS kernels, each with lag lengths chosen by the respective truncated automatic selector in Andrews (1991). Based on 5000 Monte Carlo repetitions.
Third, the differences in size-adjusted power across test statistics are modest. Because of its better size performance, we restrict the discussion to the results for the QS kernel. In the demeaned case, G(0,2) has somewhat better size-adjusted power than the modified LP statistic for distant alternatives when u, is positively correlated; for 0 near one, the modified LP statistic is more powerful. In the detrended case, G( 1,3) and modified L’ have essentially the same size-adjusted powers. 4.2.5.
Summary
and implications for empirical practice
The literature on tests of the general I(0) null against the I(1) alternative is still young. Subject to this caveat, the results here suggest several observations. The asymptotic power analysis of Section 4.2.3 suggests that there is little room for improvement on the performance of the currently proposed tests, at least in terms of local
Ch. 46: Unit Roots, Structural
2805
Breuks and Trends
Table 6 Size and size-adjusted power of selected tests of the I(0) null: Monte Carlo results [5 percent level tests, detrended case (d, = t!&, + PI t). T= 1001 [Data generating process: (1 - pL)Ay, = (1 - OL)E,,E, i.i.d. N(0, l)].” P= Test Statistic
0
Asymptotic Power
0.0
0.9
0.75
0.5
-0.5
1.00 0.95 0.90 0.80 0.70
0.05 0.13 0.35 0.74 0.91
0.05 0.13 0.34 0.62 0.64
0.29 0.12 0.23 0.36 0.40
0.11 0.12 0.25 0.41 0.47
0.06 0.12 0.25 0.43 0.50
0.04 0.12 0.32 0.65 0.81
c QS(auto)
1.00 0.95 0.90 0.80 0.70
0.05 0.13 0.35 0.74 0.91
0.05 0.13 0.35 0.65 0.68
0.10. 0.10 0.19 0.28 0.30
0.05 0.09 0.16 0.25 0.28
0.06 0.12 0.22 0.25 0.23
0.04 0.13 0.33 0.67 0.83
G(L 3) P(aut0)
1.00 0.95 0.90 0.80 0.70
0.05 0.12 0.34 0.71 0.87
0.04 0.12 0.31 0.58 0.62
0.30 0.12 0.25 0.40 0.46
0.12 0.11 0.24 0.43 0.49
0.07 0.11 0.23 0.43 0.51
0.04 0.11 0.28 0.59 0.73
G(L 3) QS(auto)
1.00 0.95 0.90 0.80 0.70
0.05 0.12 0.34 0.71 0.87
0.04 0.12 0.32 0.61 0.64
0.13 0.10 0.18 0.28 0.32
0.07 0.08 0.15 0.24 0.28
0.07 0.10 0.20 0.25 0.24
0.04 0.11 0.29 0.60 0.74
L’
P(aut0)
“See the notes to Table 5
asymptotic power. The various tests have asymptotic relative efficiencies fairly close to one, and the point-optimal tests (the Shively and Franzini-Harvey tests), interpreted in the local-to-I(O) asymptotic framework, have power functions that are close to the power envelope for a large range of local alternatives. The Monte Carlo results suggest, however, that there remains room for improvement in the finite-sample performance of these tests. With the Parzen kernel, the tests exhibit large size distortions; with the QS kernel, the size distortions are reduced but the finite-sample power can be well below its asymptotic limit. For autoregressive parameters not exceeding 0.75, both the G(p, p + 2) and L statistics, evaluated using the QS(auto) kernel, have Monte Carlo sizes near their asymptotic levels and have comparable power.
5.
Structural breaks and broken trends
This section examines two topics: structural breaks and parameter instability in time series regression; and tests for a unit root when there are kinks or jumps in the
2806
J.H. Stock
deterministic trend (the “broken-trend” model). At first glance these problems seem quite different. However, there are close mathematical and conceptual links which this section aims to emphasize. Mathematically, a multidimensional version of the FCLT plus CMT approach of Section 2 is readily applied to provide asymptotic representations for a variety of tests of parameter stability. [An early and sophisticated application of the FCLT to the change-point problem can be found in MacNeill(1974).] Conceptually, the unobserved components model with a small independent random walk component is in fact a special case of the more general time-varying-parameter model. Also, these topics recently have become intertwined in empirical investigations into unit roots when one maintains the possibility that the deterministic component has a single break, for example is a piecewise-linear time trend. Section 5.1 addresses testing for and, briefly, estimation of parameter instability in time series regression with I(O) regressors, including the case when there are lagged dependent I(0) variables and, in particular, stationary autoregressions. The main empirical application of these tests is as regression diagnostics and, as an example in Section 5.1.4, the tests are used to assess the stability of the link between various monetary aggregates and output in the U.S. from 1960 to 1992. The literature on parameter instability and structural breaks is vast, and the treatment here provides an introduction to the main applications in econometric time series regression from a classical perspective. The distribution theory for the tests is nonstandard. Here, the alternatives of interest have parameters which are unidentified under the null hypothesis; for example, in the case of a one-time change in a coefficient, under the null of “no break” the magnitude of the change is zero and the break date is unidentified. Davies (1977) showed that, if parameters are unidentified under the null, standard x2 inference does not obtain, and many of the results in Section 5.1 can be seen as special cases of this more general problem. For further references on parameter instability and breaks, the reader is referred to the reviews and bibliographies in Hack1 and Westlund (1989) Krishnaiah and Miao (1988), Kramer and Sonnberger (1986) and, for Bayesian work in this area, Zacks (1983) and Barry and Hartigan (1993). Section 5.2 turns to inference about the largest root in univariate autoregression under the maintained hypothesis that there might be one-time breaks or jumps in the deterministic component. In innovative papers, Perron (1989a, 1990b) and Rappoport and Reichlin (1989) independently suggested that the broken-trend model provides a useful description of a wide variety of economic time series. Perron (1989a) argued, inter alia, that U.S. postwar real GNP is best modeled as being I(O) around a piecewise-linear trend with a break in 1973, and Rappoport and Reichlin (1989) argued that U.S. real GNP from 1909-1970 [the Nelson-Plosser (1982) data] was stationary around a broken trend with a break in 1940. These results seem to suggest that the long-term properties of output are determined not by unit-root dynamics, but rather by rare events with lasting implications for mean long-term growth, such as World War II and the subsequent shift to more activist governmental
2807
Ch. 46: Unit Roots, Structural Breaks and Trends
economic policy, or the oil shock and productivity slowdown of the mid-1970’s. Whether this view is upheld statistically is a topic of ongoing debate in which the tests of Section 5.2 play a central role.
5.1.
Breaks in coeficients in time series regression
5.1 .l.
Tests,for a single break date
Suppose Y, =
y, obeys the time series regression
a:x,-
1
+
model
(5.1)
E;,
where under the null hypothesis /I, = fl for all t. Throughout Section 5.1, unless explicitly stated otherwise, it is maintained that E,is a martingale difference sequence with respect to the o-fields generated by {E,_ i , X, _ i, E,_ 2, X, _ 2,. . .}, where X, is a k x 1 vector of regressors, which are here assumed to be constant and/or I(0) with EX,X: = Z, and, possibly, a nonzero mean. For convenience, further assume that E, is conditionally (on lagged E, and X,) homoskedastic. Also, assume that T- ‘CyzA”,‘XSX:A AZ’, uniformly in 1 for 1~[0,1]. Note, in particular, that X,_ 1 can include lagged dependent variables as long as they are I(0) under the null. The alternative hypothesis of a single break in some or all of the coefficients is &=/I,
t
and
p, =/?+
y,
t > r,
(5.2)
where r, k + 1 < r < T, is the “break date” (or “change point”) and y # 0. When the potential break date is known, a natural test for a change in j? is the Chow (1960) test, which can be implemented in asymptotically equivalent Wald, Lagrange multiplier (LM), and LR forms. In the Wald form, the test for a break at a fraction r/T through the sample is
F,
r =SSR,,,-
0 T
(S%,,
WR,,, + SW+ 1,d
+ SSR,+,,,)MT-
2k) ’
(5.3)
of (5.1) on where SSR13, is the sum of squared residuals from the estimation observations 1,...,r,etc. For fixed r/T, F,(r/T) has an asymptotic xz distribution under the null. When the break date is unknown, the situation is more complicated. One approach might be to estimate the break date, then compute (5.3) for that break. However, because the change point is selected by virtue of an apparent break at that point, the null distribution of the resulting test is not the same as if the break date were chosen without regard to the data. The means of determining r/T must be further specified before the distribution of the resulting test can be obtained.
J.H. Stock
2808
A natural solution, proposed by Quandt (1960) for time series regression and extended by Davies (1977) to general models with parameters unidentified under the null, is to base inference on the LR statistic, which is the maximal F, statistic over a range of break dates ro, . , rl. This yields the Quandt likelihood ratio (QLR) statistic. QLR = max r -10 _ ,...,r, F,
(5.4)
Intuition suggests that this statistic will have power against a change in /I even though the break date is unknown. The null asymptotic distribution of the QLR statistic remained unknown for many years. The FCLT and CMT, however, provide ready tools for obtaining this limit. The argument is sketched here; for details, see Kim and Siegmund (1989) and, for a quite general treatment of “sup tests” in nonlinear models, Andrews (1993b). To obtain the limiting null distribution of the QLR statistic, let F&/T) = SSR,,, (SSR,,, + SSR,+,,,) and use (5.1) to write
= - v,(l)‘VT(l)-‘v,(l)
+ VT($(;)-‘vJ+)
+[~,(l)-V,(:)][vWG(~)]-l[vr(l)-v,(~)]. (5.5) where vT(A) = T-1’2C~~~Xt_1~, and VT(A) = T-‘C~T_hjX,_lX~_l. Because E, is a martingale difference sequence, X,_ 1E, is a martingale difference sequence. Additionally, assume throughout Section 5.1 that X,_ 1 has sufficiently limited dependence and enough moments for X,_ 1~, to satisfy a multivariate martingale difference sequence FCLT, so vr-(.)=a,Z$‘* W,(*), w h ere W, is a k-dimensional standard Brownian motion. Also, recall that by assumption V,(A) 11, AZ, uniformly in ,I. By applying these two limits to the second expression in (5.5), one obtains &(+=a;F*(.),
(5.6)
2809
Ch. 46: Unit Roots. Structural Breaks and Trends
where
F*(A) = - W,(l)‘W,(l)
w&)‘%(n) ~ + -
+ [W,(l) -
~k(4I’C~kU)- ~k(41 1-A
B;(A)‘B;(k) =
n(1 -2)
’
where &(I.) = IV,(n) - Al+‘,(l), where
W,(l) is a k-dimensional
Brownian
bridge.
Because g,=sazF* and SSR,,,/(Tk) L cf under the null, (SSR,,, + SSR,, i,r)/ (T- 2k) -% cf uniformly in r. Thus F, a F *. It follows from the continuous mapping theorem that the QLR statistic has the limiting representation,
‘3
(5.7)
-where ,$ = lim T4mri/T, i = 0,l. For fixed 1, F*(x) has a x,’ distribution. Andrews (1993b, Table I) reports asymptotic critical values of the functional in (5.7), computed by Monte Carlo simulation for a range of trimming parameters and k = 1,. . . ,20. The critical values are much larger than the conventional fixed-break xf critical values. For example, consider 5 percent critical values with truncation fractions (A,,, 2,) = (0.15,0.85): for k = 1, the QLR critical value is 8.85, while the xf value is 3.84; for k = 10, the QLR critical value is 27.03, while the x:, critical value is 18.3. In practice the researcher must choose the trimming parameters r,, and ri. In some applications the approximate break date might be known and used to choose r0 and rI. Also, with nonnormal errors and small r. the fixed-r distribution of the FT(r/T) statistic can be far from x:, so one way to control size is to choose r. sufficiently large, say r,lT= 0.15 and r,lT= 0.85.” The error process has been assumed to be serially uncorrelated. If it is serially correlated but uncorrelated with the regressors, then the distribution of the change-point test differs. In the case of a known break date, this problem is well studied and the Wald test statistic should be computed using an autocorrelationconsistent estimator of the covariance matrix; for recent work and discussion of the literature, see Andrews (1991) and Andrews and Monahan (1992). For the extension to break tests with unknown break dates, see Tang and MacNeill (1993).
29Functionals of Fr(L) other than the supremum are possible. Examples include the average of F,., perhaps over a restricted range, as studied by Andrews and Ploberger (1992) and Hansen (1990) [see Chernoff and Zacks (1964) and Gardner (1969) for historical precedents]. Andrews and Ploberger (1992) consider tests which maximize weighted average local asymptotic power, averaged over the unidentified nuisance parameters (here, the break date). The resulting family of optimal tests are weighted averages of exponentials, with the simple weighted average as the limit for nearby alternatives. The AndrewsPloberger (1992) tests are reviewed in the chapter by Andrews in this Handbook.
J.H. Stock
2810
The derivation of (5.6) assumes that T- “‘C~E~XS_ 1~, obeys an FCLT and hold if X, contains a T - ‘C~‘=“,lX,X’s A AZ, uniformly in 2. These assumptions constant and/or I(0) regressors, but not if X, is I(1). A sufficient condition for (5.6) not to hold is that the standard Chow test for fixed r/T does not have an asymptotic x2 distribution, since F*(x) has a x2 distribution for any fixed 1. This will occur, in general, for I( 1) regressors (although there are exceptions in cointegrating relations; see Watson’s chapter in this Handbook) and, in these cases, the derivations must be modified; see Banerjee et al. (1992b), Chu and White (1992) and Hansen (1992) for examples. In principle this approach can be extended to more than one break. A practical difficulty is that the computational demands increase exponentially in the number of breaks (all values of the two-break F-statistic need to be computed for break dates (r,s) over the range of r and s), which makes evaluating the limiting distributions currently difficult for more than two or three break dates. More importantly, positing multiple exogenous breaks raises the modeling question of whether the breaks are better thought ofas stochastic or as the result of a continuous process. Indeed, this line of reasoning leads to a formulation in which the parameters change stochastically in each period by random amounts, which is the time-varying parameter model discussed in Section 5.1.3. A related problem is the construction of confidence intervals for the break date. A natural estimator of the break date is the Gaussian MLE 1, which is the value of ~E(&,~,) which maximizes the LR test statistic (5.3). The literature on inference about the break date is large and beyond the scope of this chapter, and we make only two observations. First, 1 is consistent for ;1 when the break magnitude is indexed to the sample size (y = yr) and yr +O, T1j2yT+ co [Picard (1985), Yao (1987), Bai (1992)], although F itself is not consistent. Second, it is possible to construct asymptotic confidence intervals for 2, but this is not as straightforward as inverting the LR statistic using the QLR critical values because the null for the LR statistic is no break, while the maintained hypothesis for the construction of a confidence interval is that a break exists. Picard’s (1985) results can be used to construct confidence intervals for the break date by inverting a Wald-type statistic, an approach extended to time series regression with dependent errors by Bai (1992). Alternatively, finite-sample intervals can be constructed with sufficiently strong conditions on E, and strong conditions on X,; see Siegmund (1988) and Kim and Siegmund (1989) for results and discussion. 5.1.2.
Recursive coejicient estimates and recursive residuals
Another approach to the detection of breaks is to examine the sequence of regression coefficients estimated with increasingly large data sets, that is, to examine p(A), the OLS estimator of /? computed using observations 1, . . . , [TA]. These tests typically have been proposed without reference to a specific alternative, although the most commonly studied alternative is a single structural break. Related is Brown’s et al.
Ch. 46: Unit Roots, Structural
2811
Breaks and Trends
(1975) CUSUM statistic, which rejects when the time series model systematically over- or under-forecasts ytr more precisely, when the cumulative one-step-ahead forecast errors, computed recursively, are either too positive or negative. The recursive coefficients and Brown’s et al. (1975) recursive residuals and CUSUM statistic are, respectively, given by
(5.8) l)/VX,-1 .l-t
w =Yt-if@-
f
(5.9)
’
CT11
T”2 CUSUM(n)
=
c
ws
s=k+l
ZE
(5.10) ’
where d,= {T-1~~=k+l(~,-~)2)1’2 and f,= (1 +X:_l~~=:Xs_,X:_,)-‘X,_,)‘i2 (this comes from noting that the variance of the one-step-ahead forecast error is eff:). The CUSUM test rejects for large values of sup,,< 1G 11CUSUM(A)/(l + 2A)I. Because the recursive coefficients are evaluated at each point r, the distribution of the recursive coefficients differs from the usual distribution of the OLS estimator. The asymptotics readily obtain using the tools of Section 2. Under the null hypothesis /I, = /I, the arguments leading to (5.6), applied here, yield
T”2(p(-)-p)= vT(pvT(‘)*p*(*),
p*(n)=
a,z,‘~2wk(A) ~
(5.11)
(Ploberger et al. (1989), Lemma A.l). For fixed I, p*(A) has the usual OLS asymptotic distribution. An immediate implication of (5.11) is that conventional “95 percent” confidence intervals, plotted as bands around the path of recursive coefficient estimates, are inappropriate since those bands fail to handle simultaneous inferences on the full plot of recursive coefficients. Combined with the CMT, (5.11) can be used to construct a formal test for parameter constancy based on recursive coefficients. An example is Ploberger’s et al. (1989) “fluctuations” test [also see Sen (198O)J which rejects for large changes in the recursive coefficients, specifically when b(A) - p(1) is large. From (5.1 l), note that T1’2(8(~)-B(1))ja,~,1’2(Wk(;l)/~Wk(1)) uniformly in 1. Because the full-sample OLS estimator &f is consistent under the null, @r)(n) = 6E-1(T-‘x’X,_ lXi_ 1)1/2 x T”2(p(n) - B( 1)) =- W,(n)/2 - W,(l), uniformly in 1. This leads to Ploberger’s et al. (1989) “fluctuations” test and its limiting representation under the null of parameter
J.H. Stock
2812
constancy,
where BIT) is the ith element of BcT)and Bki is the ith element of the k-dimensional Brownian bridge B,. The null distribution of the CUSUM test is also obtained by FCLT and CMT arguments. If X,_ 1 is strictly exogenous and E,is i.i.d. N(0, G:), then w, is i.i.d. N(0, a:), so the FCLT and CMT imply CUSUM(.) =z.W( .),
(5.13)
where W is a l-dimensional Brownian motion. The same limit obtains with general I(O) regressors and a constant, but the calculation is complicated and is omitted here; for the details, see Kramer et al. (1988), who prove (5.13) for time series regressions possibly including lagged dependent variables and for general i.i.d. errors (their i.i.d. assumption can be relaxed to the martingale difference assumption used here). Critical values for sup,)CUSUM(i)/(l + 2A)( are obtained from results in Brownian motion theory; see Brown et al. (1975). An important feature of the CUSUM statistic is that, as shown by Kramer et al. (1988), it has local asymptotic power only in the direction of the mean regressors: coefficient breaks of order T- ‘I2 on mean-zero stationary regressors will not be detected. This has an intuitive explanation. The cumulation of a mean-zero regressor will remain mean-zero (and will obey an FCLT) whether or not its true coefficient changes, while the nonzero mean of the cumulation of the constant implies that breaks in the intercept will result in systematically biased forecast errors.3o This is both a limitation and an advantage, for rejection suggests a particular alternative (instability in the intercept or the direction of the mean regressors). Several variants of the CUSUM statistic have been proposed. Ploberger and Kramer’s (1992a) version, in which full-sample OLS residuals &,replace the recursive residuals Gi),,is attractive because of its computational simplicity. Again, the distribution is obtained using the FCLT and CMT. Their test statistic and its 30Consider the simplest case, in which y, = et under the null while under the local alternative y, = T-1’2yl(t > r)X,_ 1+ E,.Since /J’is known to equal zero, under the null (with y = 0 imposed) the cumulated residuals process is just T -‘iz~~=Iy~. Under the local alternative, T - q:;;ys = T - “zc~;Es + =>- W(A) + ymax(O,r/T/I)EX,. If EX, is zero, the distribution IS the same under the YT-‘p1’ s ,+I x,_, local alternative and the null; the test only has power in the direction of the mean vector EX,. Estimation of /?, as is of course done in practice, does not affect this conclusion qualitatively because the alternative is local. Also see Ploberger and Kramer (1990).
Ch. 46: Unit Roots, Strucrural
limiting
null representation
max
~~~Yr-l--
ko[l,Tl
2813
Breaks und Trends
3
are
sup jLY$“)l,
(5.14)
J.e[O,ll
8,
where B, is the one-dimensional Brownian bridge and the limit obtains using the FCLT and CMT. Other variants include Brown’s et al. (1975) CUSUM-of-Squares test based on w:, and McCabe and Harrison’s (1980) CUSUM-of-Squares test based on OLS residuals. See Ploberger and Kramer (1990) for a discussion of the low asymptotic power of the CUSUM-of-Squares test. See Deshayes and Picard (1986) and the bibliography by Hack1 and Westlund (1989) for additional references. If the regressors are I(l), the distribution theory for rolling and recursive tests changes, although it still can be obtained using the FCLT and CMT as it was throughout this chapter. See Banerjee et al. (1992b) for rolling and recursive tests with a single I(1) regressor, Chu and White (1992) for fluctuations tests in models with stochastic and deterministic trends, and Hansen (1992) for Chow-type (e.g. QLR) and LM-type [e.g. Nyblom’s (1989) statistic] tests with multiple I(1) regressors in cointegrating equations. Also, the distribution of the CUSUM statistic changes if stochastically or deterministically trending regressors are included; see MacNeill(l978) and Ploberger and Kramer (1992b). 5.1.3.
Tests against the time-varying-parameter
model
A flexible extension of the standard regression model regression coefficients evolve over time, specifically y, = B;X,_
1
+
E,,
/3,= /?_
1
+
vt,
Ev,v; = .r2G,
is to suppose
that
the
(5.15)
where E, and vt are uncorrelated and v, is serially uncorrelated. The formulation (5.15), of course, nests the standard linear regression model when r2 = 0. By setting v, = y, t = I + 1 and v, = 0, t # r + 1, (5.15) nests the single-break model (5.2). The alternative of specific interest here, however, is when v, is i.i.d. N(0, t2G) (where G is assumed to be known) so that the coefficient & follows a multivariate stochastic trend and thus evolves smoothly but randomly over the sample period. When combined with the additional assumption that E, is i.i.d. N(0, a:), this is referred to as the “time-varying-parameter” (TVP) model [see Cooley and Prescott (1976), Sarris (1973) and the reviews by Chow (1984) and Nicholls and Pagan (1985)]. Maximum likelihood estimation of the TVP model is a direct application of the Kalman filter (/I, is the unobserved state vector and y, = &X,_ 1 + E, is the measurement equation) and the estimation of /3, and its standard error under the alternative is well understood; see the chapter in this Handbook by Hamilton. We therefore focus on the problem of testing the null that r2 = 0.
J.H. Stock
2814
The TVP model (5.15) nests, as a special case, the MA(l) model considered in Section 4. Setting X, = 1 yields the unobserved components model (4.21, y, = PO + u,, where u, = (/?, - Do) + E, = C:= 1v, + E,.Thus the testing problem in the general TVP model can be seen as an extension of the unit MA root testing problem. Starting with Nyblom and Makelainen (1983), several authors have studied the properties of locally most powerful tests of r2 = 0 against t2 > 0 in (5.15) or in models where only some of the coefficients are assumed to evolve over time (that is, where G has reduced rank); see for example King and Hillier (1985) King (1988) Nyblom (1989), Nabeya and Tanaka (1988), Leybourne and McCabe (1989), Hansen (1990), Jandhyala and MacNeill(l992) and Andrews and Ploberger (1992). [Also see Watson and Engle (1985) who consider tests against /?, following a stationary AR(l).] The treatment here follows Nyblom and Makeliinen (1983) and builds on the discussion in Section 4 of tests of the UC model. To derive the LMPI test of r2 =o;krsus r2 > 0, suppose that X,_ 1 is strictly exogenous (although the asymptotics hold more generally). Under the TVP model, (5.15) can be rewritten as y, =&X,-i + {(C:=,v,)‘X,_ 1 + ct}, where the term in curly brackets is an unobserved error. In standard matrix notation [Y denotes of Y, given (Yi, . . . , yr)‘, X denotes (X,, . . , Xr- i)‘, etc.], the conditional distribution X, is
Y-NjXP,,o:[l,+(~)V,lj, where
v,=fi*O(XGX’)
(5.16)
where 0: = min(i, j) and 0 denotes the, Hadamard (elementwise) product. The testing problem is invariant to scale/translation shifts of the form y + ay + b’X, so the most powerful invariant test against an alternative .r2 will be a ratio of quadratic forms involving I, + (?2/a%)vr.. However, this depends on the alternative, so no uniformly most powerful test exists. One solution is to consider the LMPI test, which rejects for large values of Fvr&/&‘c, where {e,} are the full-sample OLS residuals. Straightforward algebra shows that T-2&‘1/Te* = T-lx,‘= ,S,(s/T)‘GS,(s/T), where S,(A)= T-1’2C,T_[T1J+le*tXt_,, which provides a simpler form for the test: reject if T-lx,‘= ,S,(s/T)‘GS,(s/T) is large. Because this test and its limiting distribution depend on G, Nyblom (1989) suggested the simplification G=(T-lCT=lX,_ ,X;_,)-‘. Accordingly, the test rejects for large values of
(5.17)
Conditional on (X,> the TVP model induces a heteroskedastic random walk into the error term which is detected by L using the cumulated product of the OLS residuals and the regressors. Nyblom (1989) derived the statistic (5.17) by applying local arguments to a likelihood for generally nonlinear, nonnormal models, and his general statistic
2815
Ch. 46: Unit Roots. Structural Breaks and Trends
simplifies to (5.17) in the Gaussian linear regression model. If X, = 1, (5.17) reduces to the LMPI test (4.7) of the i.i.d. null against the random walk alternative, or equivalently the test of the null of a unit MA root. 31 Henceforth, we refer to (5.17) as the Nyblom statistic. The asymptotics of the Nyblom statistic follow from the FCLT and the CMT. As usual, E, need not be i.i.d. normal and X, need not be strictly exogenous; rather the weaker conditions following (5.1) are sufficient for the asymptotics. Under those weaker conditions, E,X, 1 is a martingale difference sequence and, by the FCLT standard Brownian and CMT, S,(.)+CT$, 1/2Bp k( .), where Bf: is a k-dimensional bridge. Because T- ‘Et’= 1X, _ 1XL_ 1 -% Z, and SF L a:, under the null hypothesis,
L=s
B$)‘B;(E.)
dl.
(5.18)
s The literature contains Monte Carlo results on the finite-sample power of the tests in Sections 5.1.1-5.1.3 against various alternatives. The range of alternatives considered is broad and some preliminary conclusions have firmly emerged. Many of the tests overreject in moderately large samples (T = 100) when asymptotic critical values are used. This is exacerbated if errors are nonnormal and, especially, if autoregressions have large autoregressive parameters [QLR and related tests; see Diebold and Chen (1992)]. In their Monte Carlo study of the QLR test, exponentially averaged F-tests, the CUSUM test and several other tests against alternatives of one-time breaks and random walk coefficients, Andrews et al. (1992) found that in general the weighted exponential tests performed well and often the QLR and Nyblom tests performed nearly as well. For additional results, see Garbade (1977) and the references in Hack1 and Westlund (1989, 1991). 5.1.4.
Empirical application:
stability of the money-output
relation
At least since the work of Friedman and Mieselman (1963), one of the long-standing empirical problems in macroeconomics has been whether money has a strong and stable link to aggregate output; for a discussion and recent references, see Friedman and Kuttner (1992). Since their introduction to this literature by Sims (1972, 1980), Granger-causality tests and vector autoregressions have provided the workhorse machinery for quantifying the strength and direction of these relations in nonstructural time series models (see the chapter by Watson in this Handbook for a discussion of vector autoregressions). But for such empirical models to be useful guides for monetary policy they must be stable, and the tests of this section can play a useful role in assessing their stability. Of particular importance is whether one of the several monetary aggregates is arguably most stably related to output.
3’To show this, rewrite
S,(s/T) using the identity that the mean OLS residual is zero
J.H. Stock
2816
Tests for structural
Table 7 breaks and time-varying parameters in the moneyyoutput (Dependent variable: Nominal GDP growth) (Estimation period: quarterly, 1960:2 to 1992:2).’
relation
F-tests on coefficients on: M
r
R2
M
r
QLR
P-K CUSUM
Nyblom
1 2
Base Base
R-90
0.153 0.178
3.85** 1.43
2.50*
40.23*** 54.15***
1.43** 1.36**
2.83** 2.80
3 4
Ml Ml
R-90
0.140 0.179
3.17** 1.50
2.87**
32.57*** 51.02***
1.22* 1.35**
1.90 2.85
5 6
M2 M2
0.221 0.332
7.60*** 8.40***
3.19**
20.82 25.44
0.73
R-90
1.31 1.53
1.05
L
“All regressions include 3 lags each of the nominal GDP growth rate, GDP inflation and the growth rate of the monetary aggregate. The M column specifies the monetary aggregate. The r column indicates whether the 90-day U.S. Treasury bill rate is included in the regression. If the interest rate is included, it is included in differences (3 lags) and one lag of an error-correction term from a long-run money demand equation is also included. The I? is the usual OLS adjusted R2. The F-tests are Wald tests of the hypothesis that the coefficients on the indicated variable are zero; the restriction that the errorcorrection term (when present) has a zero coefficient is included in the Wald test on the monetary aggregate. QLR is the Quandt (1960) likelihood ratio statistic (5.4) with symmetric 15 percent trimming; P-K CUSUM is the Ploberger-Kramer (1992a) CUSUM statistic (5.14); and the final column reports the Nyblom (1989) L statistic (5.17). Break test critical values were taken from published tables and/or were computed by Monte Carlo simulation of the limiting functionals of Brownian motion, as described in Section 2.3. Tests are significant at the *lO percent; **5 percent; ***l percent level.
Table 7 presents regression summary statistics and three tests for parameter stability in typical money-output regressions for three monetary aggregates, the monetary base, Ml, and M2, over the period 1960:2-1992:2. The results are taken from Feldstein and Stock (1994), to which the reader is referred for additional detail. Based on preliminary unit root analysis, log GDP, the log GDP deflator, log money and the 90-day US. Treasury bill rate are specified as having a single unit root so that the GDP growth rate, GDP inflation, the money growth rate and the first difference of the interest rate are used in the regressions. Drawing on the cointegration evidence in Hoffman and Rasche’s (1991) and Stock and Watson’s (1993) studies of long-run money demand, in the models including the interest rate we model log money, log output and the interest rate as being cointegrated so that the equations include an error-correction term, the cointegrating residual. The long-run cointegrating equation was estimated by imposing a unit long-run income elasticity and estimating the interest semi-elasticity using the Saikkonen (1991)/PhillipssLoretan (1991)/StockkWatson (1993) “dynamic OLS” efficient estimator.32 The main 32Alldata were taken from the Citibase data base. The hypothesis of two unit roots was rejected at the 5 percent level for each series (demeaned case) using DF-GLS tests with AR(BIC), I
Ch. 46: Unit Roots, Structural Breaks and Trends
2817
conclusions are insensitive to empirically plausible changes in the unit root specifications of interest rates and money; in particular see Konishi et al. (1993) for F-statistics in specifications where the interest rate is assumed stationary. The Granger-causality test results indicate that including the interest rate makes base money and Ml insignificant, although M2 remains significant (this is partly due to the error-correction term). The QLR test rejects the null hypothesis of parameter stability at the 1 percent level in all specifications including base money or M 1; the L-statistic rejects in the base-only specification; and the Ploberger-Kramer (1992a) CUSUM based on OLS residuals rejects in the base and Ml specifications. The hypothesis of stability is thus strongly rejected for the base-output and Ml-output relations. The evidence against stability is much weaker for the M22output relation; none of the stability tests reject at the 10 percent level. Once changes in velocity are controlled for by including the error-correction term in regression 6, both M2 and the interest rate enter significantly and there is no evidence of instability. As with any empirical investigation, some caveats are necessary: these results are based on only a few specifications, and stability in this sample is no guarantee of stability in the future. Still, these results suggest that, of the base, Ml, and M2, only M2 had a stable reduced-form relationship with output over this period.
5.2. 5.2.1.
Trend breaks and tests for autoregressive unit roots The trend-break model and efSects of misspecifying the trend
Rappoport and Reichlin (1989) and Perron (1989a, 1990b) argued that a plausible model for many economic variables is stationarity around a time trend with a break, and that autoregressive unit root tests based on linear detrending as discussed in Section 3 have low power against this alternative. Two such broken-trend
failed to reject a single unit root at the 10 percent level (detrending for each variable except interest rates, for which the demeaned statistics were used), except for the interest rate, which rejected at the 10 percent but not 5 percent level. The 95 percent asymptotic confidence intervals, computed as in Stock (1991) by inverting the Dickey-Fuller r*’statistic (+’ for interest rates) as described in Section 3.3, for the largest autoregressive roots are: log Ml, (0.821, 1.026); log M2, (0.998, 1.039); log base, (0.603, 0.882); go-day T-bill rate, (0.838, 1.015); log GDP, (0.950, 1.037); GDP inflation, (0.876, 1.032). The results are robust to using the AR(BIC) selector with 3 < p < 8, as in the Monte Carlo simulations, except that the M2 confidence interval rises to (1.011, 1.040). For consistency, all monetary aggregates are specified in growth rates [but see Christian0 and Ljungqvist (1988) and Stock and Watson (1989)]. These results leave room to argue that inflation should be entered in changes, but for comparability with other specifications in the literature inflation itself is used. There is some ambiguity about the treatment of interest rates, but to be consistent with recent investigations of long-run money demand they are treated here as I(1). The evidence on cointegration involves statistics not covered in this chapter and the reader is instead referred to Hoffman and Rasche (1991) and Stock and Watson (1993).
J.H. Stock
2818
specifications
are
(Shift in mean)
d, = B, + /II l(t > r),
(5.19)
(Shift in trend)
d, = B, + B, t + &(t - r)W > 4,
(5.20)
where l(t > r) is a dummy variable which equals one for t > r and zero otherwise.33 For conciseness, attention here is restricted to the trend-shift model (5.20), a model suggested by Perron (1989a) and Rappoport and Reichlin (1989) for real GNP. It was emphasized in Section 3.2.5 that if the trend is misspecified then unit root tests can be misleading. This conclusion applies here as well. Suppose that (5.20) is correct and r/T-+ S,6 fixed, 0 < 6 < 1, but that statistics are computed by linear detrending. Then the power of the unit root test tends to zero against fixed alternatives. The intuition is simple: if a linear time trend is fitted to an I(0) process around a piecewise-linear trend, then the residuals will be I(0) around a mean-zero “Y-shaped trend. These residuals have variances growing large (with T) at the start and end of the sample and standard tests will classify the residuals as having a unit root.34 In the mean-shift case, Dickey-Fuller unit root tests are consistent but have low power if the mean shift is large [Perron (1989a); for Monte Carlo evidence, Hendry and Neale (199 l)]. See Campbell and Perron (1991) for further discussion. This troubling effect of trend misspecification raises the question of how to test for AR unit roots in the presence of a possibly broken trend. 5.2.2.
Unit root tests with broken trends
If the break date is known a priori, as Perron (1989a, 1990b) and Rappoport and Reichlin (1989) assumed, then detrending can be done by correctly specified OLS, and the asymptotic distribution theory is obtained using a straightforward extension of Sections 2 and 3. However, as Christian0 (1992) and, subsequently, Banerjee et al. (1992b), Perron and Vogelsang (1992) and Zivot and Andrews (1992) pointed out, the assumption that the break date is data-independent is hardly credible in macroeconomic applications. For example, in Perron’s applications to
33Under the null hypothesis of a unit root, the mean-shift model is equivalent to assuming that there is a single additive outlier in v, at time r + 1, since, under the null hypothesis, (3.1) and (5.19) imply Ay, = v, + j,l(t = r + 1). A third trend model is Perron’s (1989a) “model C,” with both a mean and a trend shift. 34To show this, consider the AR(l) case, so that y,(O) = 02, and the Dickey-Fuller roe; test, T(c?‘_- 1). If the trend is, in fact, given by (5.20), then the detrended process is yi = u, + d,, where d, = (/lo - &,) + (8, - /ll)t + p2(t - r)l(t > r). For r/T+ 6,6 fixed, if u, is I(0) then straightforward but tedious calculations show that the scaled detrended process has the deterministic limit, T- 1y;T1,LB,, + pIA. + &(A - 6)&i > 6), uniformly in J., where fll and & are nonrandom functions of &,,/$,& and 6. It follows that T(B’ - l)= g(6), where g is nonrandom. An explicit expression for g(6) is given in Perron (1989a, Theorem l(b)). Perron (1989a) shows that g(6) is in the acceptance region of the detrended DF root test, so, asymptotically, the null is incorrectly accepted with probability one.
Ch. 46: Unit Roots, Structural
Breaks and Trends
2819
GNP, the break dates were chosen to be in the Great Depression and the 1973 oil price shock, both of which are widely recognized as having important and lasting effects on economic activity. Thus the problem becomes testing for a unit root when the break dates are unknown and determined from the data. Two issues arise here: devising tests which control for the nuisance parameters, in particular the unknown break date, and, among such tests, finding the most powerful. To date research has focused on the first of these topics. There is little work which addresses this problem starting from the theory of optimal tests, and this is not pursued here.35 The procedures in the literature for handling the unknown date of the trend break are based on a modified Dickey-Fuller test. To simplify the argument, consider the AR(l) case, so that c.? = y,(O). Then, as suggested by Christian0 (1992), Banerjee et al. (1992b) and Zivot and Andrews (1992) one could test for a unit root by examining the minimum of the sequence of Dickey-Fuller r-statistics, constructed by first detrending the series by OLS using (5.20) for I over the range r,,, . . . , rI, t;;lp =
min
(5.21)
zd(S),
dE[&l.dll
where
T - l f: AY:'@)Y;- 169 t=2
P(6) =
(r3d)2(S)T-2 f: (y;_,(c?))~ t=2
where ~$9 = Y, -
1’2’
z,W&%where B(S) = CE:T= ,z,(W,(@‘1~’ CCf”,4-9~,1, z,kV=
[l, t, (t - [Ts])l(t > [Ti?])] and (8d)2 (A) is the sample variance of the residual from the regression of y:(6) onto yy_ ,(6). Just as the null distribution of the QLR statistic differs from the distribution of the fixed-date F-statistics, the null distribution oft@’ differs from the distribution of rd for a fixed break point. The approach to obtaining the null distribution is similar; namely to obtain a limiting representation for the sequence of statistics p(S), uniformly in 6. Relative to the QLR statistic, this entails an additional complication, because under the null the broken-trend detrended process will be I(1). This leads to limit results for elements of O[O, 1) x O[O, 11. While no new tools are needed for these calculations, they are tedious and notationally cumbersome and the reader is referred to the articles by Banerjee et al. (1992b) and Zivot and Andrews (1992) for different derivations of the same limiting representation. Not surprisingly, the critical values of the minimal DF statistic are well below the critical values of “Elliott et al. (1992) show that the asymptotic Gaussian power envelope in the mean-shift (5.19) with /I1 fixed equals the no-detrending power envelope plotted in Figure 1.
model
J.H. Stock
2820
the usual linearly detrended statistic; for example, with symmetric 15 percent trimming, the one-sided 10 percent asymptotic critical value is approximately - 4.13 [Banerjee et al. (1992b, Table 2)], compared with - 3.12 in the linearly detrended case. 5.2.3.
Finite-sample size and power
There are fewer Monte Carlo studies of the broken-trend and broken-mean unit root statistics than of the linearly detrended case, perhaps in part because the additional minimization dramatically increases the computational demands. Nonetheless, the results of Hendry and Neale (1991), Perron and Vogelsang (1992), Zivot and Andrews (1992) and Banerjee et al. (1992b) provide insights into the performance of the tests. The finite-sample distributions are sensitive to the procedures used to determine the lag length in the augmented DF regression, and the null distributions depend on the nuisance parameters even though the tests are asymptotically similar. Typically, the asymptotic critical values are too small, that is, the sizes of the tests exceed their nominal level. The extent of the distortion depends on the actual values of the nuisance parameters. Zivot and Andrews (1992) examined size distortions by Monte Carlo study of ARIMA models estimated using the Nelson-Plosser (1982) U.S. data set; for the mean-shift model (5.19), the finite-sample 10 percent critical values were found to fall in the range -4.85 to - 5.05, while the corresponding asymptotic value is - 4.58; for each series, tests of asymptotic level 2.5 percent rejected between 5 percent and 10 percent of the time. Perron and Vogelsang (1992) found larger rejection rates under the null when there is more negative serial correlation than present in the Zivot-Andrews simulations. The Monte Carlo evidence confirms the view that the finite-sample power of the unit root tests is reduced by trend- or mean-shift detrending, in the sense that if the true trend is linear then introducing the additional break-point reduces power. The extent of this power reduction, however, depends on the nuisance parameters and, in any event, if the broken-trend specification is correct then broken-trend detrending is necessary. The more relevant comparison is across different procedures which entail broken-trend detrending, but only limited results are available [see Perron and Vogelsang (1992) for some conclusions comparing four Dickey-Fuller-type tests with different lag length selection procedures]. 5.2.4.
Conclusions and practical implications
Although the research on trend-break unit root tests is incomplete, it is possible to draw some initial conclusions. On a practical level, the size distortions found in the demeaned and linear detrended cases in Section 3 appear, if anything, to be more severe in the broken-trend case, and the power of the tests also deteriorates. One can speculate that this reflects a dwindling division between the I( 1) model and other competing representations; were the trend-shifts I(0) and occurring every period, then the extension of (5.20) would deliver an I(2) model for y,.
Ch. 46: Unir Roots, Strucrurul
2821
Brraks und Trrnds
A useful way to summarize the broken-trends literature is to return to our original four motivating objectives for analyzing unit roots. As a matter of data description, Perron’s (1989a, 1990b) and Rappoport and Reichlin’s (1989) analyses demonstrate that the broken-trend models deliver very different interpretations from conventional unit root models, emphasizing the importance of a few irregularly occurring events in determining the long-run path of aggregate variables; this warrants continued research in this area. The practical implications concerning the remaining three objectives remain largely unexplored. From a forecasting perspective, if the single-break model is taken as a metaphor for multiple irregular breaks, then one must be skeptical that out-of-sample forecasts will be particularly reliable, since another break could occur. Equally importantly, for this reason treating the break as a one-time nonrandom event presumably leads to understating the uncertainty of multistep forecasts. Little is currently known about the practical effect of misspecifying trend breaks in subsequent multivariate modeling, although the asymptotic theory of inference in vector autoregressions (VAR) with unit roots and cointegration analysis discussed in Watson’s chapter in this Handbook must be modified if there are broken trends. Finally, the link between these trend+break models and economic theory is undeveloped. In any event, the statistical difficulties with inference in this area does not make one optimistic that trend-break models will parse economic theories, however capable they are of producing suggestive stylized facts.
6.
Tests of the I(1) and I(0) hypotheses: links and practical limitations
Sections 3,4, and 5.2 focused on inference in the I( 1) and I(O) models. When inference is needed about the order of integration of a series, sometimes there is no compelling a priori reason to think that one or other of these models is the best starting point; rather, the models might best be treated symmetrically. In this light, this section addresses three topics. Section 6.1 examines some formal links between the I( 1) and I(0) models. Section 6.2 summarizes some recent work taking a different approach to these issues, in which the determination of whether a series is I(0) or I( 1) is recast as a classification problem, so that the tools of Bayesian analysis and statistical decision theory can be applied. Section 6.3 then raises several practical difficulties which arise in the interpretation of both these Bayesian classification schemes and classical unit-root hypothesis tests in light of the size distortions coupled with low power of the tests studied in the Monte Carlo experiments of Sections 3 and 4.
6.1.
Parallels between the I(0)
and I(1)
testing problems
The historical development of tests of the I(0) and I( 1) hypotheses treated the issues as conceptually and technically quite different. To a large extent, these differences
J.H. Stock
2822
are artificial, arising from their ARMA parameterizations. Since an integrated I(0) process is I(l), a test of the I(0) null against the I( 1) alternative is, up to the handling of initial conditions, equivalent to a test of the I(1) null against the I(2) alternative. In this sense, the tests of the previous sections can both be seen as tests of the I(1) null, on the one hand, against I(0) and, on the other hand, against I(2). What is interesting is that this reinterpretation is valid not just on a heuristic level but also on a technical level. To make this precise, consider the cased, = fiO, v, = E,. The LMPIU test of the unit MA root in (4.1) rejects for large values of the Nyblom-Makelainen (1983) statistic
where y; = y, - jj. If instead the null hypothesis is that u, is a Gaussian random walk and the alternative is that u, is an AR(l) with 1x1< 1, then one could test this hypothesis by rejecting for small values of the demeaned Sargan-Bhargava statistic
T-* t
t=1
R”;= T-’
i
(yf)*
(6.4
(Ay;)’
t=2
The Lp statistic rejects if the mean square of the I( 1) process, the cumulation of yp, is large, while &. rejects if the mean square of the I(1) process, y;, is small. Both tests can be seen as tests of the I(1) null but, respectively, against the I(2) and I(0) alternatives.
6.2.
Decision-theoretic
classijication
schemes
A standard argument for using conventional hypothesis tests is that the researcher has a particular reason for wishing to control the Type I error rate. While this might be appropriate in some of the applications listed in Section 1, in others, such as forecasting, the ultimate objective of the empirical analysis is different and classical hypothesis tests are not necessarily the best tools to achieve those objectives. In such cases, the researcher might rather be interested in having a procedure which will deliver consistent inference, in the sense that the probability of correctly classifying a process as I( 1) or I(0) asymptotically tends to one; that is, the probabilities of both Type I and Type II errors tend to zero.
Ch. 46: Unit Roots, Sfrucrural
Breaks and Trends
2823
In theory, this can be achieved by using a sequence of critical values which tend to - co as an appropriate function of the sample size. To be concrete, suppose that the researcher computed the Dickey-Fuller? statistic, and evaluated it using critical values, b,. If the null is true, then Pr[2* < &.11(l)] +O for any sequence b, + - a3, so that the probability of correctly concluding that the process is I(1) tends to one. Similarly, it is plausible that for a suitable choice of b,, if the process is truly I(0) then Pr[f < b,l I(O)] + 1 and the Type II error rate tends to zero. For such a choice of b,, this would be a consistent classification scheme. Because the DickeyyFuller t-statistic tends to - co at the rate T’j2 under a fixed alternative, one candidate for b, is b, = - k, - k, In T for some positive constants (k,, k,). Thus the rule is Classify y, as I(0) if
t*< - k, - k, In T
(6.3)
and, otherwise, classify y, as I(1). The problem with this scheme is that, in practice, the researcher is left to choose k, and k,. Because the sample size is, of course, fixed in an actual data set, the conceptual device of choosing this sequence is artificial and the researcher is left with little practical guidance. One solution is to frame this as a classification or decision-theoretic problem and to apply Bayesian techniques. In this context, an observed series is classified as I(0) or I(1) based on the posterior odds ratio Li’r, which we write heuristically as
where
BT =
PrC(yI~...~h-)lW)l PrC(Y1,...,YT)lI(0)I’
(6.4)
where rci and rco are prior weights that the series is I( 1) and I(0) and where B, is the Bayes ratio. If Z7r > 1, then the posterior odds favor the I(1) model and the series is classified as I( 1). Although (6.4) appears simple, numerous subtleties are involved in its evaluation and addressing these subtleties has spawned a large literature on Bayesian approaches to autoregressive unit roots; see in particular Sims (1988), Schotman and van Dijk (1990), Sims and Uhlig (1991), DeJong and Whiteman (1991a, 1991b), Diebold (1990), Sowell (199 1) and the papers by Phillips (199 1a) and his discussants in the special issue of the Journal of Applied Econometrics (October-December, 1991). In most cases, implementations of (6.4) have worked within specifications which require placing explicit priors over key continuous parameters, such as the largest autoregressive root. The proposed priors differ considerably and can imply substantial differences in empirical inferences [see the review by Uhlig (1992)]. Because of this dependence on priors, and given space limitations, no attempt will be made here to summarize this literature. Instead, we briefly discuss two recent approaches, by Phillips and Ploberger (1991) and Stock (1992), which provide simple ways to evaluate the posterior odds ratio (6.4) and which avoid explicit integration over priors on continuous parameters. These procedures require only
J.H. Stock
2824
that the researcher place priors rrO and n, = 1 - rcO on the respective hypotheses “I(0)” and “I( 1)“. Phillips and Ploberger (1991) derive their procedure from a consideration likelihood ratio statistic in the AR( 1) model, and obtain the rule
point of the
IiYf-1
Classify y, as I(0) if
0 i i
2’ > In 7(1 + In + x0
c7
(6.5)
E
and, otherwise, classify y, as I(l), where t* is the Dickey-Fuller t-statistic.36 The expression (6.5) bears considerable similarity to (6.3): a unit AR root is rejected based on the Dickey-Fuller r-statistic, with a critical value that depends on the sample size. The difference here is that the critical value is data-dependent; if y, is I(l), the “critical value” will be 2 In T+ O,(l), while if y, is I(O), it will be In T+ Op( 1). As Phillips and Ploberger (1992) point out, this procedure can be viewed as an extension to the I(1) case of the BIC model selection procedure, where the issue is whether to include or to exclude y,_ 1 as a regressor in the DF regression (3.9). The procedure is also closely related to the predictive least squares principle, see Wei (1992). Another approach is to evaluate the Bayes factor in (6.4) directly, using a reduceddimensional statistic rather than the full data set. Suppose that 4r is a statistic which is informative about the order of integration, such as a unit root test statistic; then the expression for the Bayes factor in (6.4) could be replaced with
B*
=
’
Wd4IU)) W&l I(O))
(6.6)
The approach in Stock (1992) is to construct a family of statistics which have limiting distributions which, on the one hand, do not depend on nuisance parameters under either the I(1) or I(0) hypothesis but, on the other hand, diverge, depending on which hypothesis is true. The results in the previous sections can, in fact, be used to construct such statistics. Consider the process I’“, defined in (4. lo), and consider the no-deterministic case. If y, is a general I(0) process then Vr= W. On the other hand, if y, is a general I( 1) process then NT “‘I’, * I/*, where V* is defined in (4.11) and N, = TIC:= +&n/l,). In either case the limiting representation of Vr does not depend on any nuisance parameters. To make this concrete, consider the statistic 4T = In L= ln{ T-‘CT= 1Vf_ ,}. Then, for u, a general I(0) process, from (2.9) (in the I(0) case) and (4.11) (in the I( 1) case), 4r has the limiting representations
ifI(0)
#,*ln (s
36Phillips (1992b).
and Ploberger’s
IV* , > (1991) formula
(6.7a) has been modified
for an estimated
variance
as in Phillips
2825
Ch. 46: Unit Roots, Structurul Breaks and Trends
ifI(1)
U
(6.7b)
.
V2
4,-lnJV,*ln
>
The limiting distributions under the I(O) and I( 1) models can be computed numerically from (6.7a) and (6.7b), respectively, which in turn permits the numerical evaluation of the Bayes factor (6.6) based on this statistic. It must be stressed that, although consistent decision-theoretic procedures such as these have both theoretical and intuitive appeal, they have properties which empirical researchers might find undesirable. One is that these procedures will consistently classify local-to-I( 1) processes as I( 1) rather than I(O), and local-to-I(O) processes as I(0) rather than as I( 1). That is, if y, is local-to-I(l) with local parameter tl = 1 + c/T, then, as the sample size increases, this process will be classified as I(1) with probability increasing to one, even though along the sequence it is always an I(0) process [see Elliott and Stock (1994) for details]. More generally, because these procedures can have large misclassification rates in finite samples (loosely, their size can be quite large), care must be taken in interpreting the results. Initial empirical applications [Phillips (1992b)] and Monte Carlo simulations [Elliott and Stock (1994), Stock (1992)] suggest that, for some applications such as forecasting and pretesting, these approaches are promising. To date, however, the investigation of the sampling properties of these and alternative procedures, and in particular the effect of their use in second-stage procedures, is incomplete. It would be premature to make concrete recommendations for empirical practice.
6.3.
6.3.1.
Practical and theoretical and I( 1) processes
limitations
in the ability to distinguish
I(0)
Theory
The evidence on tests of the I(1) null yields two troubling conclusions. On the one hand, the tests have relatively low power against I(0) alternatives that might be of interest; for example, with 100 observations in the detrended case, the local-to-unity asymptotics indicate that the 5 percent one-sided MPI test has a power of 0.27 against c( = 0.9 and that the DickeyyFuller t-test has a power of only 0.19. On the other hand, processes which are I( 1) but which have moderate negative autocorrelation in first differences are incorrectly rejected with high probability, that is, the unit AR root tests exhibit substantial size distortions, although the extent of these distortions varies widely across test statistics. The same general conclusions were found for tests of the general I(0) null: the power against interesting alternatives can be low and, depending on the choice of spectral estimator, the rejection rate for null values that have substantial positive autocorrelation can be well above the asymptotic level. A natural question is how one should interpret these finite-sample size distortions. In this regard, it is useful to develop some results concerning the source of these
J.H. Stock
2826
size distortions and whether they will persist in large samples. Section 4 examined the behavior of tests of the I(0) null in the event that y, was I( 1) but local-to-I(O) in the sense (4.12), and found that the I(0) tests with functional representations had nondegenerate asymptotic power functions against these alternatives. It is natural to wonder, then, what is the behavior of tests of the I(1) null, if the true process is I(1) but is local to I(O)? As a starting point, consider the d, = 0 case and the sequence of models
u, = bT5,+ i]
rl,,
(i,, qt) i.i.d. N(O,ozI).
(6.8)
s=1
This is just the local-to-I(O) model (4.12), resealed by multiplication by h- ‘T where with (Ault, u,,J Gaussian and mutually and serially uncorrelated. If in fact b = 0, then u, is a Gaussian random walk, so one might consider indicates that, for using the Sargan-Bhargava statistic i,. A direct calculation b > 0, under this nesting n?, 3;. It follows that Pr[R”, < k] = Pr[T& < Tk] + 1 for any constant k, so that the rejection probability of the test tends to one. The implication is that (unmodified) Sargan-Bhargava tests of sequences which are local to I(0) in the sense (6.8) will incorrectly reject with asymptotic probability 1. The implication of this result is perhaps clearer when u, is cast in its ARIMA form, AU, = (1 - B,L).s,. For finite T, jOT/< 1, but the limiting result indicates that the rejection probability approaches one and so can be quite large for finite T. A similar set ofcalculations can be made for tests of the I(0) null. Here, the relevant sequence ofnull models to consider are those models against which the AR unit root tests have nondegenerate local asymptotic power, namely the local-to-unity models studied in Section 3.2.3. Again let d, = 0 and suppose that the I(0) null is tested using the Lv statistic. A straightforward calculation shows that b = II-‘,
where W; is the demeaned local-to-unity diffusion process defined in Section 3.2.3. It follows that Pr[L” > k] = Pr[T-‘L” > T-‘k] -+ 1 for any constant k, so that the rejection probability of the test tends to one. For these processes, which are local to I(l), the Lp test rejects with probability approaching one even though, for fixed T,u, has the AR(l) representation (1 -p,L)u,= F, with (~~1 < 1. These results elucidate the Monte Carlo findings in Sections 3 and 4. In the AR case, the implication is that there are I(1) models which are local to I(0) for which the I(1) null will be incorrectly rejected with high probability. In the MA case, there are I(0) models which are local to I(1) for which the I(0) null will be incorrectly
Ch. 46. Unit Roots. Structural Breaks and Trends
2821
rejected with high probability. Thus the high false rejection rates (the size distortions) found in the Monte Carlo analysis can be expected to persist asymptotically, but the range of models suffering size distortions will decrease. The foregoing analysis is limited both because of the tests considered and because it does not address the question of the size of the neighborhoods of these incorrect rejections; it was shown only that the neighborhoods are at least O(T- ‘). In the case of I(1) tests, Pantula (1991) has provided results on the sizes of these neighborhoods for several tests of the unit AR root hypothesis in the MA( 1) model. He found that these neighborhoods vanish at different rates for different tests, with the slowest rate being the Phillips-Perron (1988) statistic. This finding explains the particularly large size distortions of this statistic with negative MA roots, even with very large samples [e.g. T= 500; Pantula (1991, Table 2)]. In related work, Perron (1991d) and Nabeya and Perron (1991) provide approximations to the distribution of the OLS root estimator with sequences of negative MA and negative AR roots approaching their respective boundaries. Because tests for the general I(0) null have only recently been developed, as of this writing there have been few empirical analyses in which both I(0) and I( 1) tests are used [exceptions include Fisher and Park (1991) and Ogaki (1992)]. The foregoing theoretical results suggest, however, that there will be a range of models for which the I(1) test will reject with high probability and the I(0) test will not, although the process is I(1); for which the I(0) test will reject and the I(1) test will not, although the process is I(0); and for which both tests will reject and the process is I(1). It also seems plausible that there are models that are truly I(0) but for which both tests reject with high probability, but this has not been investigated formally. There is currently little evidence on the volume of these regions of contradictory results, although Amano and van Norden’s (1992) Monte Carlo evidence suggests that they may well be large in moderate sample sizes. In summary, tests of the general I(0) null and tests of the general I(1) null are neither similar nor unbiased. Asymptotically, the tests have size equal to their stated level for fixed null models; but problems arise when we consider sequences of null and alternative models for which the I(0) and I( 1) models become increasingly close. On the one hand, there are null models which will be rejected with arbitrarily high probability; on the other hand, there are alternative models against which the tests will have power approaching the nominal level. Although these regions diminish asymptotically, in finite samples this implies that there is a range of I(0) and I(1) models amongst which the unit MA and AR root tests are unable to distinguish.
6.3.2.
Practical implications
The asymptotic inability to distinguish certain I(0) and I(1) models raises the question of how these tests are to be interpreted, and this has generated great controversy in the applied literature on the practical value of unit root tests. Some
2828
J.H. Stock
of the earliest criticisms of Nelson and Plosser’s (1982) stylized fact that many economic time series contain a unit root came from Bayesian analyses [Sims (1988), DeJong and Whiteman (1991a, 1991b); see the references in Section 6.2 following equation (6.4)], although the discussion here follows the debate from a classical perspective in Blough (1992), Christian0 and Eichenbaum (1990), Cochrane (1991 a), Rudebusch (1992, 1993) and Stock (1990). In particular, the reader is referred to Campbell and Perron’s (1991) thoughtful discussion and the comments on it by Cochrane (199 1b). There has, however, been little systematic research on the practical implications of this problem, so one’s view of the importance of this lack of unbiasedness remains largely a matter of judgment. Because of the prominence of this issue it nonetheless seems appropriate to organize the ways in which such judgment can be exercised. This discussion focuses exclusively on AR unit root tests, although several of the remarks have parallels to MA unit root tests. It is useful to return to the reasons, listed in the introduction, why one might wish to perform inference concerning orders of integration: as data description; for forecasting; as pretesting for subsequent specification analysis or testing; or for testing or distinguishing among economic theories. Although this discussion proceeds in general terms, it must be emphasized that the size and power problems vary greatly across test statistics, so that the difficulties discussed here are worse for some tests than others. Data description. The size distortions and low power of even the best-performing tests imply that the literal interpretation of unit AR root tests as similar and unbiased tests of the I(1) null against the I(0) alternative is inappropriate. However, the Monte Carlo evidence provides considerable guidance in the interpretation of unit root test results. For some tests, such as the Dickey-Fuller t-statistic and the DF-GLS statistic, the size is well controlled over a wide range of null models so rejection can be associated rather closely with the absence of a unit root. In contrast, the severe size distortions of the Phillips-Perron tests [or other tests with the SC spectral estimator, such as the Schmidt-Phillips (1992) MSB statistic] in the presence of moderate negative MA roots and their low empirical rejection rates in the stationary case with moderate positive MA or second AR roots indicates that rejection by this statistic is only secondarily associated with the presence or absence of a unit root, and instead is indicative of the extent of positive serial correlation in the process. Interpretation of results based on extant versions of these statistics using SC estimators is, thus, problematic. In any event, confidence intervals for measures of long-run persistence are arguably more informative than unit root tests themselves; constructing these confidence intervals entails testing a range of possible values of tl, not just the unit root hypothesis. An important caveat is that the unit root tests and thus confidence intervals require that the trend order be correctly specified; depending on the type of misspecification, the tests might otherwise be inconsistent. We agree with Campbell and Perron’s (1991) emphasis on the importance of properly specifying the trend order
Ch. 46: Unit Roots, Structural
Breaks and Trends
2829
before proceeding with the classical tests, and this is an area in which one should bring economic theory to bear to the maximum extent possible. For example, a priori reasoning might suggest using a constant or shift-in-mean specification in modeling interest rates, rather than including a linear time trend. We speculate that while one could develop a consistent downward-testing procedure, starting with the highest possible trend order and letting the test level decline with the sample size, such an approach would have high misclassification rates in moderate samples (size distortions and low power). The Bayesian approach in Phillips and Ploberger (1992) to joint selection of the trend and order of integration is theoretically appealing for fixed models but the finite-sample performance of this approach has not yet been fully investigated.
Campbell and Perron (1991) and Cochrane (1991 b) examined the Forecasting. effect of unit root pretests on forecasting performance. In their Monte Carlo experiment, data were generated by an ARMA(l, 1) and forecasts were made using an autoregression. Their most striking finding was that, in models with a unit AR root and large negative correlation in first differences, the out-of-sample forecast error was substantially lower with the unit root pretest than if the true differences specification was used. This finding appeared both at short and long horizons (1 and 20 periods with a sample of size 100). In cases with less severe negative correlation or with a stationary process, little was lost by pretesting relative to using the true model. Because economic forecasting is largely done using multivariate models, these initial results do not bear directly on most professional forecasting activities. Still, they suggest that for forecasting the size distortions might be an advantage, not a problem. A promising alternative to pretesting is to forecast using median-unbiased estimates of CIas discussed in Section 3.3. To date, however, there has been no thorough examination of whether this delivers finite-sample improvements in forecasts and forecast standard errors. Pretests for second-stage inference. Perhaps the most common use of unit root tests is as pretests for second-stage inference: as a preliminary stage for developing a forecasting model, for formulating a cointegrated system, or for determining subsequent distribution theory. In the final of these applications, the existing distribution theory for inference in linear time series regressions conditions upon the number and location of unit roots in the system, in the sense that the orders of integration and cointegration are assumed known. In empirical work, these orders are typically unknown, so one way to proceed is to pretest for integration or cointegration and then to condition on the results of these pretests in performing second-stage inference. In practice, this can mean using a unit root pretest to decide whether a variable should enter a second-stage regression in levels or differences, as was done in the empirical application in Section 51.4. Alternatively, if the relationship of interest involves the level of the variable in a second-stage regression, a unit
2830
J.H. Stock
root pretest could be used to ascertain whether standard or nonstandard distribution theory should be used to compute second-stage tests. There has, however, been little research on the implications of this use of unit root tests. Some evidence addressing this is provided by Elliott and Stock (1994) who consider a bivariate problem in which there is uncertainty about whether the regressor has a unit root. In their Monte Carlo simulation, they find that unit root pretests can induce substantial size distortions in the second-stage test. If the innovations of the regressor and the second-stage regression error are correlated, the first-stage Dickey-Fuller t-statistic and the second-stage t-statistic will be dependent so the size of the second stage in this two-stage procedure cannot be controlled effectively, even asymptotically. Although this problem is important when this error correlation is high, in applications with more modest correlations the problem is less severe. This is arguably the application most Formulating and testing economic theories. damaged by the problems of poor size and low power. In special cases - the martingale theories of consumption and stock prices being the leading examples simple theories predict that the series is not only I(1) but is a martingale. In this case, the null models are circumscribed and the problems of size distortions do not arise. However, the initial appeal of unit root tests to economists was that they seemed to provide a way to distinguish between broad classes of models: on the one hand, dynamic stochastic equilibrium models (real business cycle models) in which fluctuations were optimal adjustments to supply shocks, on the other hand, Keynesian models in which fluctuations arose in large part from demand disturbances. Indeed, this was the original context in which they were interpreted in Nelson and Plosser’s (1982) seminal paper. Unfortunately, there are two problems, either of which alone is fatal to such an interpretation. The first is a matter of economic theory: as argued by Christian0 and Eichenbaum (1990) stochastic equilibrium models need not generate unit roots in observed output, and as argued by West (1988b), Keynesian models can generate autoregressive roots that are very close to unity. Thus a rejection by an ideal unit root test (that is, one with no size distortions) need not invalidate a real business cycle model and a failure to reject should not be interpreted as evidence against Keynesian models. The second is the lack of unbiasedness outlined above: even if the match between classes of macroeconomic theories and whether macroeconomic series are I( 1) were exact, the size distortions and low power would mean that the outcomes of unit root tests would not discriminate among theories. In this light the idea, however appealing, that a univariate unit root test could distinguish which class of models best describes the macroeconomy seems in retrospect overly ambitious. This said, inference about the order of integration of a time series can usefully guide the specification and empirical analysis of relations of theoretical interest in economics. For example, King and Watson (1992) and Fisher and Seater (1993) use
Ch. 46: Unit Roots, Structural Breaks and
Trends
2831
these techniques to provide evidence on which versions of money neutrality (longrun neutrality, superneutrality) can be investigated empirically. They show that long-run neutrality can be tested without specifying a complete model of short-run dynamics, as long as money and income are I(1). Similarly, investigations into whether there are unit roots in exchange rates have proven central to inferences about such matters as long-run purchasing power parity and the behavior of exchange rates in the presence of target zones [see Johnson (1993) and Svensson (1992) for reviews]. Finally, quantitative conclusions about the persistence in univariate series have proven to be a key starting point for modeling the long-run properties of multiple time series and cointegration analysis, an area which has seen an explosion of exciting empirical and theoretical research and is the topic of the next chapter in this Handbook.
References Ahtola, J. and G.C. Tiao (1984) “Parameter Inference for a Nearly Nonstationary First Order Autoregressive Model”, Biometrika, 71, 263-72. Amano, R.A. and S. van Norden (1992) Unit Root Tests and the Burden of Proof, manuscript, Bank of Canada. Anderson, T.W. (1948) “On the Theory of Testing Serial Correlation”, Skandinauisk Aktuariedskrift, 31, 88-l 16. Anderson, T.W. (1959) “On Asymptotic Distributions of Estimates of Parameters of Stochastic Difference Equations”, Annals of Mathematical Statistics, 30, 676-687. Anderson, T.W. and D. Darling (1952) “Asymptotic Theory of Certain ‘Goodness of Fit’ Criteria Based on Stochastic Processes”, Annals of Mathematical Statistics, 23, 193-212. Anderson, T.W. and A. Takemura (1986) “Why Do Noninvertible Moving Averages Occur?“, Journal of Time Series Analysis, 7, 235-254. Andrews, D.W.K. (1991) “Heteroskedasticity and Autocorrelation Consistent Covariance Matrix Estimation”, Econometrica, 59, 817-858. Andrews, D.W.K. (1993a) “Exactly Median-Unbiased Estimation of First Order Autoregressive/Unit Root Model”, Econometrica, 61, 139-166. Andrews, D.W.K. (1993b) “Tests for Parameter Instability and Structural Change with Unknown Change Point”, Econometrica, 821-856. Andrews, D.W.K. and H.-Y. Chen (1992) “Approximately Median-Unbiased Estimation of Autoregressive Models with Applications to U.S. Macroeconomic and Financial Time Series”, Cowles Foundation Discussion Paper no. 1026, Yale University. Andrews, D.W.K. and J.C. Monahan (1992) “An Improved Heteroskedasticity and Autocorrelation Consistent Covariance Matrix Estimator”, Econometrica, 60, 953-966. Andrews, D.W.K. and W. Ploberger (1992) “Optimal Tests When a Nuisance Parameter is Identified Only Under the Alternative”, Cowles Foundation Discussion Paper no. 1015, Yale University. Andrews, D.W.K., I. Lee and W. Ploberger (1992) “Optimal Changepoint Tests for Normal Linear Regression”, Cowles Foundation Discussion Paper no. 1016, Yale University. Ansley, C.F. and P. Newbold (1980) “Finite Sample Properties of Estimators for Autoregressive Moving Average Models”, Journal of Econometrics, 13, 1599 183. Arnold, L. (1973) Stochastic Differential Equations: Theory and Applications. Wiley:New York. Bai, J. (1992) Econometric Estimation of Structural Change, unpublished Ph.D. Dissertation, Department of Economics, University of California, Berkeley. Banerjee, A., J. Dolado, J.W. Galbraith and D.F. Hendry (1992a) Cointegration. Error Correction and the Econometric Analysis ofNon-Stationary Data. 0xford:Oxford University Press. Banerjee, A., R.L. Lumsdaine and J.H. Stock (1992b) “Recursive and Sequential Tests of the Unit Root
2832
J.H. Stock
and Trend Break Hypotheses: Theory and International Evidence”, Journal ofBusiness and Economic Statistics, 10, 271-288. Barry, D. and J.A. Hartigan (1993) “A Bayesian Analysis for Change Point Problems”, Journal ofthe American Statistical Association, 88, 309-319. Beach, C.M. and J.G. MacKinnon (1978) “A Maximum Likelihood Procedure for Regression with Autocorrelated Errors”, Econometrica, 46, 51-58. Beaulieu, J.J. and J.A. Miron (1993) “Seasonal Unit Roots in Aggregate U.S. Data”, Journal ofEconometrics, 55,305-328. Beran, J. (1992) “Statistical Models for Data with Long-Range Dependence (with discussion)“, Statistical Science, 4, 404427. Berk, K.N. (1974) “Consistent Autoregressive Spectral Estimates”, Annals qf.Statistics, 2, 489-502. Beveridge, S. and C.R. Nelson (1981) “A New Approach to Decomposition of Economic Time Series into Permanent and Transitory Components with Particular Attention to Measurement of the ‘Business Cycle’ ‘0 Journal of Monetary Economics, 7, 151- 174. Bhargava, A. (1986) “On the Theory of Testing for Unit Roots in Observed Time Series”, Reoiew of Economic Studies, 53, 369-384. Bierens, H.J. (1993) “Higher-Order Sample Autocorrelations and the Unit Root Hypothesis”, Journal of Econometrics, 57, 137- 160. Bierens, H.J. and S. Guo (1993) “Testing Stationarity and Trend Stationarity Against the Unit Root Hypothesis”, Econometric Reviews, 12, l-32. Billingsley, P. (1968) Convergence of Probability Measure. Wiley:New York. Blanchard, O.J. and D. Quah (1989) “The Dynamic Effects of Aggregate Demand and Supply Disturbances”, American Economic Review, 79, 655-73. Blough, S.R. (1992) “The Relationship Between Power and Level for Generic Unit Root Tests in Finite Samples”, Journal of Applied Econometrics, 7, 295-308. Bobkoski, M.J. (1983) Hypothesis Testing in Nonstationary Time Series, unpublished Ph.D. thesis, Department of Statistics, University of Wisconsin. Bollerslev, T. (1986) “Generalized Autoregressive Conditional Heteroskedasticity”, Journal of Econometrics, 3 1, 307-327. Box, G.E.P. and G.M. Jenkins (1976) Time Series Analysis: Forecasting and Control. Revised Edition. San Francisco: Holden-Day. Brillinger, D.R. (1981) Time Series: Data Analysis and Theory. San Francisco: Holden-Day. Brockwell, P.J. and R.A. Davis (1987) Time Series: Theory and Methods. New York: Springer-Verlag. Brown, B.M. (1971) “Martingale Central Limit Theorems”, Annals of MathematicalStatistics, 42,59-66. Brown, R.L., J. Durbin and J.M. Evans (1975) “Techniques for Testing the Constancy of Regression Relationships over Time with Comments”, Journal of the Royal Statistical Society, Series B, 37, 1499192. Campbell, J.Y. (1987) “Does Saving Anticipate Declining Labor Income? An Alternative Test of the Permanent Income Hypothesis”, Econometrica, 55, 1249-1274. Campbell, J.Y. and N.G. Mankiw (1987a) “Are Output Fluctuations Transitory”, Quarterly Journal of Economics, 102,857-880. Campbell, J.Y. and N.G. Mankiw (1987b) “Permanent and Transitory Components in Macroeconomic Fluctuations”, The American Economic Review, 77, 11l-1 17. Campbell, J.Y. and P. Perron (1991) “Pitfalls and Opportunities: What Macroeconomists Should Know about Unit Roots”, NBER Macroeconomics Annual, 141-200. Cavanagh, C (I 985) Roots Local To Unity, manuscript, Department of Economics, Harvard University. Chart, N.H. (1988) “On The Parameter Inference for Nearly Nonstationary Time Series”, Journal ofthe American Statistical Association, 83, 857-862. Chan, N.H. (1989) “Asymptotic Inference for Unstable Autoregressive Time Series With Drifts”. Journal of StatisticalPlanning and InJbence, 23, 301-3 12. Chan, N.H. and C.Z. Wei (1987) “Asymptotic Inference For Nearly Nonstationary AR(l) Processes”, Annals ofStatistics, 15, 1050-1063. Chart, N.H. and C.Z. Wei (1988) “Limiting Distributions of Least Squares Estimates of Unstable Autoregressive Processes”, Annals ofStatistics, 16. 367-401. Chernotf-H. and S. Zacks (1964) “Estimating the Current Mean of a Normal Distribution Which Is Subject to Changes in Time”, Annals ofMathematical Statistics, 35,999%1028. Choi, 1. (1993) “Effects of Data Aggregation on the Power of Tests for a Unit Root: A Simulation Study”, Economic Letters, forthcoming.
Ch. 46: Unit Roots, Structural
Breaks and Trends
2833
Chow, G.C. (1960) “Tests of Equality Between Sets of Coefficients in Two Linear Regressions”, Econometrica, 28, 59 l-605. Chow, G.C. (1984) “Random and Changing Coefficient Models”, in: Z. Griliches and M. Intriligator, eds., Handbook ofEconometrics. Vol. 2, North-Holland: Amsterdam. Christiano, L.J. (1992) “Searching for a Break in GNP”, Journal ofBusiness and Economic Statistics, 10, 237-250. Christiano, L.C. and M. Eichenbaum (1990) “Unit Roots in GNP: Do We Know and Do We Care?” Carnegie-Rochester Conference Series on Public Policy, 32, 7-62. Christiano, L.J. and L. Ljungqvist (1988) “Money Does Granger-Cause Output in the Bivariate MoneyOutput Relation”, Journal of MonetaryEconomics, 22,217-236. Chu, C.-S., and H. White (1992) “A Direct Test for Changing Trend”, Journal ofBusiness and Economic Statistics, 10, 289-299. Clark, P.K. (1987) “The Cyclical Component of U.S. Economic Activity”, The Quarterly Journal of Economics, 102, 798-814. Clark, P.I$. (1989) “Trend Reversion in Real Output and Unemployment”, Journal of Econometrics, 40, 15-32. Cochrane, J.H. (1988) “How Big is the Random Walk in GNP?“, Journal of Political Economy, 96, 893-920. Cochrane, J.H. (1991a) “A Critique of the Application of Unit Root Tests”, Journal ofEconomic Dynamics and Control, 15,275-284. Cochrane, J.H. (1991b) “Comment on Campbell and Perron”, NBER Macroeconomics Annual 1991, 5, 201-210. Cooley, T.F. and EC. Prescott (1976) “Estimation in the Presence of Stochastic Parameter Variation”, Econometrica, 44, 167- 184. Cooper, D.M. and R. Thompson (1977) “A Note on the Estimation of the Parameters of the AutoregressiveMoving Average Process”, Biometrika, 64, 625-628. Cryer, J.D. and J. Ledolter (1981) “Small-Sample Properties of the Maximum Likelihood Estimator in the First-Order Moving Average Model”, Biometrika, 68, 691-694. Dahlhaus, R. (1989) “Efficient Parameter Estimation for Self-Similar Processes”, Annals of Statistics, 17, 1749-1766. Davidson, J.E.H. (1979) “Small Sample Properties of Estimators of the Moving Average Process”, in: E.G. Charatsis, ed., Proceedings of the Econometric Society Meeting in 1979: Selected Econometric Papers in Memory of Stephan Valananis. North-Holland: Amsterdam. Davidson, J.E.H. (1981) “Problems With the Estimation of Moving Average Processes”, Journal of Econometrics, 16, 295-310. Davidson, J.E.H., D.F. Hendry, F. Srba and S. Yeo (1978) “Econometric Modelling of the Aggregate Time-Series Relationship Between Consumer’s Expenditure and Income in the United Kingdom”, Economic Journal, 86,661&692. Davies, R.B. (1977) “Hypothesis Testing When a Nuisance Parameter is Present only Under the Alternative”, Biometrika, 64, 247-254. Davis, R.A. and W.T.M. Dunsmuir (1992) Inference for MA(l) Processes with a Root On or Near the Unit Circle, Discussion Paper no. 36, Bond University School of Business. DeJong, D.N. and C.H. Whiteman (1991a) “Reconsidering Trends and Random Walks in Macroeconomic Time Series”, Journal of Monetary Economics, 28, 221-254. DeJong, D.N. and C.H. Whiteman (1991b) “The Temporal Stability of Dividends and Stock Prices: Evidence from the Likelihood Function”, American Economic Review, 81, 600~617. DeJong, D.N., J.C. Nankervis, N.E. Savin and C.H. Whiteman (1992a) “Integration versus TrendStationarity in Macroeconomic Time Series”, Econometrica, 60, 423-434. DeJong, D.N., J.C. Nankervis, N.E. Savin and C.H. Whiteman (1992b) “The Power Problems of Unit Root Tests for Time Series with Autoregressive Errors”, Journal of Econometrics, 53, 323-43. Dent, W. and A.-S. Min (1978) “A Monte Carlo Study of Autoregressive Integrated Moving Average Processes”, Journal of Econometrics, 7, 23-55. Deshayes, J. and D. Picard (1986) “Off-line Statistical Analysis of Change-Point Models Using Non Parametric and Likelihood Methods”, in: M. Basseville and A. Benveniste, eds., Detection of Abrupt Changes in Signals and Dynamical Systems. Springer Verlag Lecture Notes in Control and Information Sciences, no. 77, 103-168. Dickey, D.A. (1976) Estimation and Hypothesis Testing in Nonstationary Time Series, Ph.D. dissertation, Iowa State University.
2834
J.H. Stock
Dickey, D.A. and W.A. Fuller (1979) “Distribution of the Estimators for Autoregressive Time Series with a Unit Root”, Journal ofthe American Statistical Association, 74, 427-431. Dickey, D.A. and W.A. Fuller (1981) “Likelihood Ratio Tests for Autoregressive Time Series with a Unit Root”, Econometrica, 49, 105771072. Dickey, D.A. and SC. Pantula (1987) “Determining the Order of Differencing in Autoregressive Processes”, Journal of Business and Economic Statistics, 5, 455-462. Dickey, D.A., D.P. Hasza, and W.A. Fuller (1984) “Testing for Unit Roots in Seasonal Time Series”, Journal of the American Statistical Association, 79, 355-367. Diebold, F.X. (1990) “On ‘Unit Root Econometrics’: Discussion of Geweke, Sims, and DeJong and Whiteman”, in: Proceedings of the American Statistical Association Business and Economics Statistics Session. Washington, D.C.: American Statistical Association. Diebold, F.X. (1993) “The Effect of Seasonal Adjustment Filters on Tests for a Unit Root: Discussion”, Journal of Econometrics, 55, 999103. Diebold, F.X. and C. Chen (1992) Testing Structural Stability with Endogenous Break Point: A Size Comparison of Analytic and Bootstrap Procedures, manuscript, Department of Economics, University of Pennsylvania. Diebold, F.X. and M. Nerlove (1990) “Unit Roots in Economic Time Series: A Selective Survey”, in: T.B. Fomby and G.F. Rhodes, eds., Advances in Econometrics: Co-Integration, Spurious Regressions, and Unit Roots. Greenwich, CT: JAI Press. Diebold, F.X. and G.D. Rudebusch (1989) “Long Memory and Persistence in Aggregate Output”, Journal of Monetary Economics, 24, 189-209. Diebold, F.X. and G.D. Rudebusch (1991a) “Is Consumption Too Smooth? Long Memory and the Deaton Paradox”, Review ofEconomics and Statistics, 71, l-9. Diebold, F.X. and G.D. Rudebusch (1991b) “On the Power of Dickey-Fuller Tests Against Fractional Alternatives”, Economics Letters, 35, 1555160. Diebold, F.X., S. Husted and M. Rush (1991) “Real Exchange Rates Under the Gold Standard’, Journal of Political Economy, 99, 1252-1271. Dufour, J.-M. (1990) “Exact Tests and Confidence Sets in Linear Regressions with Autocorrelated Errors”, Econometrica, 58,475-494. Dufour, J.-M. and M.L. King (1991) “Optimal Invariant Tests for the Autocorrelation Coefficient in Linear Regressions with Stationary or Nonstationary AR(l) Errors”, Journal of Econometrics, 47, 115-143. Dunsmuir, W.T.M. (1981) “Estimation for Stationary Time Series When Data Are Irregularly Spaced or Missing”, in: D.F. Findley, ed., Applied Time Series II, Academic Press: New York. Durlauf, S.N. (1989) “Output Persistence, Economic Structure and the Choice of Stabilization Policv”. Brookings Papers an Ecanomic Activity, 2,69-136. Durlauf. S.N. and P.C.B. Phillins (1988) “Trends Versus Random Walks in Time Series Analvsis”. , Econametrica, 56,1333-1354. 1 ~ ’ Elliott, G. (1993a) Efficient Tests for a Unit Root When the Initial Observation Is Drawn From Its Unconditional Distribution, manuscript, Department of Economics, Harvard University. Elliott, G. (1993b) Monte Carlo Performance of Twenty Unit Root Tests, manuscript, Department of Economics, Harvard University. Elliott, G. and J.H. Stock (1994) “Inference in Time Series Regression When the Order of Integration of a Regressor is Unknown”, Econometric Theory, forthcoming. Elliott, G., T.J. Rothenberg and J.H. Stock (1992) Efficient Tests for an Autoregressive Unit Root, NBER Technical Working Paper no. 130. Engle, R.F. and C.W.J. Granger( 1987) “Cointegration and Error Correction: Representation, Estimation and Testing”, Econometrica, 55, 251-276. Ethier, S.N. and T.G. Kurtz (1986) Markov Processes: Characterization and Convergence. New York: Wiley. Evans, G.B.A. and N.E. Savin (I 981a) “The Calculation of the Limiting Distribution of the Least Squares Estimator of the Parameter in a Random Walk Model”, Annals of Statistics, 9, 1114-l 118. Evans, G.B.A. and N.E. Savin (1981b) “Testing for Unit Roots: I”, Econometrica, 49, 753-779. Evans, G.B.A. and N.E. Savin (1984) “Testing for Unit Roots: II”, Econometrica, 52, 1241-1260. Fama. E.F. (1970) “Efficient Capital Markets: A Review of Theory and Empirical Work”, Journal of Finance, 25,383-417. Feldstein, M. and J.H. Stock (1994) “The Use of a Monetary Aggregate to Target Nominal GDP”, in: N.G. Mankiw, ed., Monetary Policy. Chicago: University of Chicago Press.
Ch. 46: Unit Roots, Structural Breaks and Trends
2835
Fisher, E.O. and J.Y. Park (1991) “Testing Purchasing Power Parity Under the Null Hypothesis of Cointegration”, Economic Journal, 101, 1476684. Fisher, M.E. and J.J. Seater (1993) “Long-Run Neutrality and Superneutrality in an ARIMA Framework”, American Economic Review, 83,4022415. Fox, R. and MS. Taqqu (1986) “Large-Sample Properties of Parameter Estimates for Strongly Dependent Stationary Gaussian Time Series”, Annals ofStatistics, 14, 517-532. Franzini, L. and A.C. Harvey (1983) “Testing for deterministic trend and seasonal components in time series models”, Biometrika, 70, 673-682. Friedman, B.M. and K.N. Kuttner (1992) “Money, Income, Prices and Interest Rates”, American Economic Reoiew, 82,472~492. Friedman, M. and D. Meiselman (1963) “The Relative Stability of the Investment Multiplier and Monetary Velocity in the United States, 1897-1958”, in: Stabilization Policies. Commission on Money and Credit, Englewood Cliffs, NJ: Prentice-Hall, 1655268. Fuller, W.A. (1976) Introduction to Statistical Time Series. New York:Wiley. Garbade, K. (1977) “Two Methods of Examining the Stability of Regression Coefficients”, Journal of The American Statistical Association, 72, 54463. Gardner, L.A. (1969) “On Detecting Changes in the Mean of Normal Variates”, Annals ofMathematical Statistics, 40, 116-126. Geweke, J. and S. Porter-Hudak (1983) “The Estimation and Application of Long Memory Time Series Models”, Journal of Time Series Analysis, 4, 221-238. Ghysels, E. (1990) “Unit-Root Tests and the Statistical Pitfalls of Seasonal Adjustment: The Case of U.S. Postwar Real Gross National Product”, Journal of Business and Economic Statistics, 8, 1455152. Ghysels, E. and P. Perron (1993) “The Effect of Seasonal Adjustment Filters on Tests for a Unit Root”, Journal of Econometrics, 55, 57-98. Hack], P. and A. Westlund (1989) “Statistical Analysis of ‘Structural Change’: An Annotated Bibliography”, Empirical Economics, 143, 1677192. Hackl, P. and A.H. Westlund (1991) eds., Economic Structural Change: Analysis and Forecasting. Springer-Verlag: Berlin, 95-l 19. Hall, A. (1989) “Testing for a Unit Root in the Presence of Moving Average Errors”, Biometrika, 76, 49-56. Hall, A. (1992a) “Testing for a Unit Root in Time Series Using Instrumental Variable Estimation with Pretest Data-Based Model Selection”, Journal of Econometrics, 54,223-250. Hall, A. (1992b) “Unit Root Tests After Data Based Model Selection”, Proceedings of the 1992 Summer Meetings of the American Statistical Association: Business and Economics Section. Hall, P. and CC. Heyde (1980) Martingale Limit Theory and its Applications. New York: Academic Press. Hall, R.E. (1978) “Stochastic Implications of the Life Cycle-Permanent Income Hypothesis: Theory and Evidence”, Journal of Political Economy, 86, 971-87. Hansen, B.E. (1990) Lagrange Multiplier Tests for Parameter Instability in Non-Linear Models, manuscript, Economics Department, University of Rochester. Hansen, B.E. (1992) “Tests for Parameter Instability in Regressions with I(1) Processes”, Journal of Business and Economic Statistics, 10, 321-336. Harvey, A.C. (1981) Time Series Models. Oxford: Philip Allan. Harvey, A.C. (1985) “Trends and Cycles in Macroeconomic Time Series”, Journal of Business and Economic Statistics, 3, 216-227. Harvey, A.C. (1989) Forecasting, Structural Models and the Kalman Filter. Cambridge, U.K.: Cambridge University Press. Harvey, A.C. and A. Jaeger (1993) “Detrending, Stylized Facts and the Business Cycle”, Journal of Applied Econometrics, 8, 231-248. Harvey, A.C. and N.G. Shephard (1992) “Structural Time Series Models”, in: G.S. Maddala, CR. Rao and H.D. Vinod, eds., Handbook of Statistics, Vol. 2, Elsevier: Amsterdam. Hasza, D.P. and W.A. Fuller (1979) “Estimation for Autoregressive Processes with Unit Roots”, Annals of Statistics, 7, 1106-1120. Hasza, D.P. and W.A. Fuller (1981) “Testing for Nonstationary Parameter Specifications in Seasonal Time Series Models”, Annals of Statistics, 10, 1209-1216. Hendry, D.F. and A.J. Neale (1991) “A Monte Carlo Study of the Effects of Structural Breaks on Tests for Unit Roots”, in. P. Hack1 and A.H. Westlund, eds., Economic Structural Change: Analysis and Forecasting. SpringerrVerlag: Berlin, 955119.
2836
J.H. Slack
Herrndorf, N.A. (1984) “A Functional Central Limit Theorem for Weakly Dependent Sequences of Random Variables”, Annals cfProhahility, 12, 141-153. Hoffman, D. and R.H. Rasche (1991) “Long-Run Income and Interest Elasticities of Money Demand in the United States”, Review ofEconomics and Statistics, 73, 665-674. ed., Statistical Inference in Hurwicz, L. (1950) “Least-Squares Bias in Time Series”, in: T. Koopmans, Dynamic Economic Models. New York: Wiley. Hylleberg, S., R.F. Engle, C.W.J. Granger and S. Yoo (1990) “Seasonal Integration and Cointegration”, Journal ofEconometrics, 44, 215-238. Jandhyala, V.K. and LB. MacNeill (1992) “On Testing for the Constancy of Regression Coefficients Under Random Walk and Change-Point Alternatives”, Econometric Theory, 8, 501-517. Jegganathan, P. (1991) “On the Asymptotic Behavior of Least-Squares Estimators in AR Time Series with Roots Near the Unit Circle”, Econometric Theory, 7, 269-306. Johnson, D. (1993) “Unit Roots, Cointegration and Purchasing Power Parity: Canda and the United States, 1870-1991”, in: B. O’Reilly and J. Murray, eds., The Exchange Rate and the Economy. Ottawa: Bank of Canada. Kang, K.M. (1975) A Comparison of Estimators of Moving Average Processes, manuscript, Australian Bureau of Statistics. Kendall, M.G. and A. Stuart (1967) The Advanced Theory ofStatistics. Vol. 2, Second edition. Charles Griffin: London. Kim, H.-J. and D. Siegmund (1989) “The Likelihood Ratio Test for a Change-Point in Simple Linear Regression”, Biometrika, 76(3), 409-23. King, M.L. (1980) “Robust Tests for Spherical Symmetry and their Application to Least Squares Regression”, Annals oJSitatistics, 8, 1265-1271. King, M.L. (1988) “Towards a Theory of Point Optimal Testing”, Econometric Reviews, 6, 169-218. King, M.L. and G.H. Hillier (1985) “Locally Best Invariant Tests of the Error Covariance Matrix of the Linear Regression Model”, Journal ofthe Royal Statistical Society, Series B, 47,98-102. King, R.G. and M.W. Watson (1992) Testing Long-Run Neutrality, NBER Working Paper no. 4156. King, R.G., C.I. Plosser, J.H. Stock and M.W. Watson (1991) “Stochastic Trends and Economic Fluctuations”, American Economic Review, 81,819-840. Kiviet, J.F. and G.D.A. Phillips (1992) “Exact Similar Tests for Unit Roots and Cointegration”, Oxford Bulletin of Economics and Statistics, 54, 349-368. Konishi, T., V.A. Ramey and C.W.J. Granger (1993) Stochastic Trends and Short-Run Relationships Between Financial Variables and Real Activity, NBER Working Paper no. 4275. Koopmans, T. (1942) “Serial Correlation and Quadratic Forms in Normal Variables”, Annals oJMathematical Statistics, 13, 14-33. Kramer, W. and H. Sonnberger (1986) The Linear Regression Model Under Test. Physica-Verlag: Heidelberg. KrPmer, W., W. Ploberger and R. Alt (1988) “Testing for Structural Change in Dynamic Models”, Economet;ica, 56, 1355-1369. Krishnaiah, P.R. and B.Q. Miao (1988) “Review about Estimation ofChange Points”, in: P.R. Krishnaiah and C.R. Rao, eds., Handbook ofstatistics. 7, New York:Elsevier. Kwiatkowski, D., P.C.B. Phillips, P. Schmidt and Y. Shin (1992) “Testing the Null Hypothesis of Stationarity Against the Alternatives of a Unit Root: How Sure Are We that Economic Time Series Have a Unit Root?“, Journal ofEconometrics, 54, 159-178. Lamotte, L.R. and A. McWhorter (1978) “An Exact Test for the Presence of Random Walk Coefficients in a Linear Regression Model”, Journal ofthe American Statistical Association, 73, 816-820. Lehmann, E.L. (1959) Testing Statistical Hypotheses. John Wiley and Sons: New York. Lewis, R. and G.C. Reinsel (1985) “Predi&on of Multivariate-Time Series by Autoregressive Model Fitting”, Journal OfMultiuariate Analvsis, 16. 393-411. Leybourne, S.J. and B.P.M. McCabe (lbs9) “Oh the Distributiosof Some Test Statistics for Coefficient Constancy”, Biometrika, 76, 169-177. Lo, A.W. (1991) “Long-Term Memory in Stock Market Prices”, Econometrica, 59, 1279-1314. MacNeill, I.B. (1974) “Tests for Change of Parameter at Unknown Times and Distributions of Some Related Functionals on Brownian Motion”, Annals ofstatistics, 2,950-962. MacNeill, I.B. (1978) “Properties of Sequences of Partial Sums of Polynomial Regression Residuals with Applications to Tests for Change of Regression at Unknown Times”, Annals ofsitatistics, 6,422-433. Magnus, J.R. and B. Pesaran (1991) “The Bias of Forecasts from a First-Order Autoregression”, Econometric Theory, 7, 222-235.
Ch. 46: Unit Roots, Structural
Breaks and Trends
2837
B.B. (1975) “Limit Theorems on the Self-normalized Range for Weakly and StroWdY Processes”, Zeitschrifftir Wahrscheinlichkeitstheorie und verwandte Gehiete, 3 1,271-285. Mandelbrot, B.B. and M.S. Taqqu (1979) Robust R/S Analysis of Long Run Serial Correlation, Invited paper, 42nd Session of the International Institute of Statistics. Mandelbrot, B.B. and J.W. Van Ness (1968) “Fractional Brownian Motions, Fractional Noise and Applications”, SIAM Review, 10, 4222437. Mann, H.B. and A. Wald (1943) “On the Statistical Treatment of Linear Stochastic Difference Equations”, Econometrica, 11, 1733220. Marriot, F.H.C. and J.A. Pope (1954) “Bias in the Estimation of Autocorrelations”, Biometrika, 41, 3933403. McCabe, B.P.M. and M.J. Harrison (1980) “Testing the Constancy of Regression Relationships over Time Using Least Squares Residuals”, Applied Statistics, 29, 142-148. Meese, R.A. and K.J. Singleton (1982) “On Unit Roots and the Empirical Modeling of Exchange Rates”, Journal ofFinance, 37, 1029-1035. Nabeya, S. and P. Perron (1991) Local Asymptotic Distributions Related to the AR(l) Model with Dependent Errors, Econometric Research Program Memorandum No. 362, Princeton University. Nabeya, S. and B.E. Sorensen (1992) Asymptotic Distributions of the Least Squares Estimators and Test Statistics in the Near Unit Root Model with Non-Zero Initial Value and Local Drift and Trend, manuscript, Department of Economics, Brown University. Nabeya, S. and K. Tanaka (1988) “Asymptotic Theory of a Test for the Constancy of Regression Coefficients Against the Random Walk Alternative”, Annals of Statistics, 16, 2188235. Nabeya, S. and K. Tanaka (1990a) “A General Approach to the Limiting Distribution for Estimators in Time Series Regression with Nonstable Autoregressive Errors”, Econometrica, 58, 1455163. Nabeya, S. and K. Tanaka (1990b) “Limiting Powers of Unit Root Tests in Time Series Regression”, Journal ofEconometrics, 46,247-271. Nankervis, J.C. and N.E. Savin (1988) “The Exact Moments of the Least-Squares Estimator for the Autoregressive Model: Corrections and Extensions”, Journal ofEconometrics, 37, 381-388. Nelson, CR. and H. Kang (1981) “Spurious Periodicity in Inappropriately Detrended Time Series”, Econometrica, 49, 741-751. Nelson, C.R. and C.I. Plosser (1982) “Trends and Random Walks in Macro-economic Time Series: Some Evidence and Implications”, Journal ofMonetary Economics, 10, 1399162. Newey, W. and K.D. West (1987) “A Simple Positive Semi Definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix”, Econometrica, 55, 703-708. Ng, S. and P. Perron (1993a) Useful Modifications to Some Unit Root Tests with Dependent Errors and their Local Asymptotic Properties, manuscript, C.R.D.E., University of Montreal, Quebec. Ng, S. and P. Perron (1993b) Unit Root Tests in ARMA Models With Data-Dependent Methods for the Selection of the Truncation Lag, manuscript, C.R.D.E., University of Montreal, Quebec. Nicholls, S. and A. Pagan (1985) “Varying Coefficient Regression”, in: Handbook ofStatistics, 4133449. Nyblom, J. (1986) “Testing for Deterministic Linear Trend in Time Series”, Journal of the American Statistical Association, 81, 545-549. Nyblom, J. (1989) “Testing for the Constancy of Parameters Over Time”, Journal of the American Statistical Association, 84, 223-30. Nyblom, J. and T. Makellinen (1983) “Comparisons of Tests for the Presence of Random Walk Coefficients in a Simple Linear Model”, Journal ofthe American Statistical Association, 78, 856-864. Ogaki, M. (1992) “Engel’s Law and Cointegration”, Journal ofPolitical Economy, 100, 1027-1046. Orcutt, G.H. (1948) “A Study of the Autoregressive Nature of the Time Series Used for Tinbergen’s Model of the Economic System of the United States, 1919-1932”, Journal ofthe Royal Statistical Society B, 10, l-45. Pantula, S.G. (1989) “Testing for Unit Roots in Time Series Data”, Econometric Theory, 5, 256-271. Pantula, S.G. (1991) “Asymptotic Distributions of the Unit-Root Tests when the Process is Nearly Stationary”, Journal of Business and Economic Statistics, 9, 63-71. Pantula, S.G. and A. Hall (1991) “Testing for Unit Roots in Autoregressive Moving Average Models: An Instrumental Variable Approach”, Journal of Econometrics, 48, 325-353. Pantula, S.G., G. Gonzalez-Farias and W.A. Fuller (1992) A Comparison of Unit Root Test Criteria, manuscript, North Carolina State University. Park, J. (1990) “Testing for Unit Roots and Cointegration by Variable Addition”, in: T.B. Fomby and G.F. Rhodes, eds., Advances in Econometrics: Co-Integration. Spurious Regressions and (init Roots. Greenwich, CT: JAI Press. Mandelbrot,
Dependent
2838
J.H. Stock
Park, J. and B. Choi (1988) A New Approach to Testing for a Unit Root, Working Paper no. 88-23, Center for Analytical Economics, Cornell University. Park, J.Y. and P.C.B. Phillips (1988) “Statistical Inference in Regressions with Integrated Processes: Part I”, Econometric Theory, 4, 468-497. Perron, P. (1988) “Trends and Random Walks in Macroeconomic Time Series: Further Evidence from a New Approach”, Journal ofEconomic Dynamics and Control, 12,297-332. Perron, P. (1989a) “The Great Crash, the Oil Price Shock and the Unit Root Hypothesis”, Econometrica, 57, 1361-1401. Perron, P. (1989b) “The Calculation of the Limiting Distribution of the Least-Squares Estimator in a Near-Integrated Model”, Econometric Theory, 5,241-255. Perron, P. (1989~) “Testing for a Random Walk: A Simulation Experiment of Power When the Sampling Interval is Varied”, in: B. Raj, ed., Advances in Econometrics and Modelling. Kluwer Academic Publishers: Boston. Perron, P. (1990a) “Tests of Joint Hypothesis for Time Series Regression with a Unit Root”, in: T.B. Fomby and G.F. Rhodes, eds., Advances in Econometrics: Co-Integration, Spurious Regressions and Unit Roots. Greenwich, CT: JAI Press. Perron, P. (1990b) “Testing for a Unit Root in a Time Series with a Changing Mean”, Journal ofBusiness and Economic Statistics, 8, 153-162. Perron, P. (199la) “A Continuous Time Approximation to the Stationary First-Order Autoregressive Model”, Econometric Theory, 7, 236-252. Perron, P. (199lb) “Test Consistency with Varying Sampling Frequency”, Econometric Theory, 7, 341-368. Perron, P. (1991~) A Test for Changes in a Polynomial Trend Function for a Dynamic Time Series, manuscript, Princeton University. Perron, P. (1991d) The Adequacy of Asymptotic Approximations in the Near-Integrated Autoregressive Model with Dependent Errors, manuscript, Princeton University. Perron, P. (1992) “A Continuous Time Approximation to the Unstable First Order Autoregressive Process: The Case Without an Intercept”, Econometrica, 59, 211-236. Perron, P. and P.C.B. Phillips (1987) “Does GNP Have a Unit Root? A Reevaluation”, Economics Letters, 23, 139-145. Perron, P. and T.S. Vogelsang (1992) “Nonstationarity and Level Shifts With an Application to Purchasing Power Parity”, Journal of Business and Economic Statistics, 10, 301-320. Pesaran, M.H. (1983) “A Note on the Maximum Likelihood Estimation of Regression Models with, First-Order Moving Average Errors with Roots on the Unit Circle”, Australian Journal of Statistics, 25.442-448. Phillips, P.C.B. (1978) “Edgeworth and Saddlepoint Approximations in the First-Order Noncircular Autoregression”, Biometrika, 65, 91-98. Phillips, P.C.B. (1986) “Understanding Spurious Regression in Econometrics”, Journal of Econometrics, 33,3llL340. Phillips, P.C.B. (1987a) “Time Series Regression with a Unit Root”, Econometrica, 55, 277-302. Phillips, P.C.B. (1987b) “Towards a Unified Asymptotic Theory for Autoregression”, Biometrika, 74, 535-547. Phillips, P.C.B. (1988) “Multiple Regression with Integrated Time Series”, Contemporary Mathematics, 80, 79-105. Phillips, P.C.B. (199la) “To Criticize the Critics: An Objective Bayesian Analysis of Stochastic Trends”, Journal of Applied Econometrics, 6, 333-364. Phillips, P.C.B. (199lb) “Spectral Regression for Cointegrated Time Series”, in: W. Barnett, J. Powell and G. Tauchen, eds., Nonparametric and Semiparametric Methods in Econometrics and Statistics. Cambridge University Press: Cambridge, UK. Phillips, P.C.B. (1992a) “Unit Roots”, in: P. Newman, M. Milgate and J. Eatwell, eds., The New Palgrave Dictionary of Money and Finance. London: Macmillan, 726-730. Phillips, P.C.B. (1992b) Bayes Methods for Trending Multiple Time Series with An Empirical Application to the U.S. Economy, manuscript, Cowles Foundation, Yale University. Phillips, P.C.B. and M. Loretan (1991) “Estimating Long Run Economic Equilibria”, Reoiew of Economic Studies, 58, 407-436. Phillips, P.C.B. and S. Ouliaris (1990) “Asymptotic Properties of Residual Based Tests for Cointegration”, Econometrica, 58, 165-94.
Ch. 46: Unit Roots, Structural
Breaks and Trends
2839
Phillips, P.C.B. and P. Perron (1988) “Testing for a Unit Roots in a Time Series Regression”, Biometrika, X,335-346. Phillips, P.C.B. and W. Ploberger (1991) Time Series Modeling with a Bayesian Frame of Reference: I. Concepts and Illustrations, manuscript, Cowles Foundation, Yale University. Phillips, P.C.B. and W. Ploberger (1992) Posterior Odds Testing for a Unit Root with Data-Based Model Selection. manuscript, Cowles Foundation for Economic Research, Yale University. Phillips P.C.B. and V. Solo (1992) “Asymptotics for Linear Processes”, Annals ofstatistics, 20,971-1001. Picard, D. (1985) “Testing and Estimating Change-Points in Time Series”, Advances in Applied Probability, 176, 841-867. Ploberger, W. and W. Kramer (1990) “The Local Power of the CUSUM and CUSUM of Squares Tests”, Econometric Theory, 6335-347. Plobeger, W. and W. Kramer (1992a) “The CUSUM test with OLS Residuals”, Econometrica, 60, 271-286. Ploberger, W. and W. Kramer (1992b) A Trend-Resistant Test for Structural Change Based on OLS-Residuals, manuscript, Department of Statistics, University of Dortmund. Ploberger, W., W. Kramer and K. Kontrus (1989) “A New Test for Structural Stability in the Linear Regression Model”, Journal ofEconometrics, 40, 307-318. Plosser, C.I. and C.W. Schwert (1977) “Estimation of a Noninvertible Moving-Average Process: The Case of Overdifferencing”, Journal of Econometrics, 6, 199-224. Piitscher, B.M. (1991) “Noninvertibility and Pseudo-Maximum Likelihood Estimation of Misspecified ARMA Models”, Econometric Theory, 7, 435-449. Quah, D. (1992) “The Relative Importance of Permanent and Transitory Components: Identification and some Theoretical Bounds”, Econometrica, 60, 107-l 18. Quandt, R.E. (1960) “Tests of the Hypothesis that a Linear Regression System Obeys Two Separate Regimes”, Journal ofthe American Statistical Association, 55, 324-330. Rappoport, P. and L. Reichlin (1989) “Segmented Trends and Non-Stationary Time Series”, Economic Journal, 99, 168-177. Robinson, P.M. (1993) “Nonparametric Function Estimation for Long-Memory Time Series”, in: C. Sims, ed., Advances in Econometrics, Sixth World Congress of the Econometric Society. Cambridge: Cambridge University Press. Rothenberg, T.J. (1990) Inference in Autoregressive Models with Near-Unit Root, seminar notes, Department of Economics, University of California, Berkeley. Rudebusch, G. (1992) “Trends and Random Walks in Macroeconomic Time Series: A Reexamination”, International Economic Review, 33,661-680. Rudebusch, G.D. (1993) “The Uncertain Unit Root in Real GNP”, American Economic Review, 83, 264272. Said, S.E. and D.A. Dickey (1984) “Testing for Unit Roots in Autoregressive-Moving Average Models of Unknown Order”, Biometrika, 71, 599-608. Said, S.E. and D.A. Dickey (1985) “Hypothesis Testing in ARIMA (p, l,q) Models”, Journal of the American Statistical Association, 80, 369-374. Saikkonen, P. (1991) “Asymptotically Efficient Estimation of Cointegrating Regressions”, Econometric Theory, 7, 1-21. Saikkonen, P. (1993) “A Note on a Lagrange Multiplier Test for Testing an Autoregressive Unit Root”, Econometric Theory, 9, 343-362. Saikkonen, P. and R. Luukkonen (1993a) “Testing for Moving Average Unit Root in Autoregressive Integrated Moving Average Models”, Journal of the American Statistical Association, 88, 5966601. Saikkonen, P. and R. Luukkonen (1993b) “Point Optimal Tests for Testing the Order of Differencing in ARIMA Models”, Econometric Theory, 9, 343-362. Sargan, J.D. and A. Bhargava (1983a) ‘Testing Residuals from Least Squares Regression for Being Generated by the Gaussian Random Walk”. Econometrica. 51. 153-174. Sargan, J.D. and A. Bhargava (1983b) “Maximum Likelihood Estimation of Regression Models with First Order Moving Average Errors when the Root Lies on the Unit Circle”, Econometrica, 51, 799-820. Sarris, A.H. (1973) “A Bayesian Approach to Estimation of Time Varying Regression Coefficients”, Annals of Economic and Social Measurement, 2, 501-523. Satchell, SE. (1984) “Approximation to the Finite Sample Distribution for Nonstable First Order Stochastic Difference Equations”, Econometrica, 52, 1271-1289.
J.H. Stock
Schmidt. P. and P.C.B. Phillips (1992) “LM Tests for a Unit Root in the Presence of Deterministic Trends”, Oxford Bulletin of Economics and Statistics, 54, 2577288. Schotman, P. and H.K. van Dijk (1990) Posterior Analysis of Possibly Integrated Time Series with an Application to Real GNP, manuscript, Erasmus University Econometric Institute, Netherlands. Schwarz, G. (1978) “Estimating the Dimension of a Model”, Annals of Statistics, 6,461L464. Schwert. G.W. (1989) “Tests for Unit Roots: A Monte Carlo Investigation”, Journal Businessand Economic Statistics, I, 147-l 59. Sen, P.K. (1980) “Asymptotic Theory of Some Tests for a Possible Change in the Regression Slope Occurrine at an Unknown Time Point”, Zeitschrifi fir Wahrscheinlichkeitstheorie und verwandte Gebiete, 52, 2033218. Sheohard. N.G. (1992) Distribution of the ML Estimator of a MA(l) and a Local Level Model, manuscript, Nuffield College, Oxford. Shephard, N.G. (1993) “Maximum Likelihood Estimation of Regression Models with Stochastic Trend Components”, Journal of the American Statistical Association, 88, 590-595. Shephard, N.G. and A.C. Harvey (1990) “On the Probability of Estimating a Deterministic Component in the Local Level Model”, Journal of Time Series Analysis, 11, 339-347. Shiller, R.J. and P. Perron (1985) “Testing the Random Walk Hypothesis: Power versus Frequency of Observation”, Economics Letters, 18, 1271-1289. Shively, T.S. (1988) “An Exact Test for a Stochastic Coefficient in a Time Series Regression Model”, Journal of Time Series Analysis, 9, 81-88. Siegmund, D. (1988) “Confidence Sets in Change-point Problems”, International Statistical Review, 56, 31-47. Sims, C.A. (1972) “Money, Income and Causality”, American Economic Review, 62, 540-552. Sims, C.A. (1980) “Macroeconomics and Reality”, Econometrica, 48, l-48. Sims, CA. (1988) “Bayesian Skepticism on Unit Root Econometrics”, Journal of Economic Dynamics and Control, 12,463-474. Sims, C.A. and H. Uhlig (1991) “Understanding Unit Rooters: A Helicopter Tour”, Econometrica, 59, 1591-1600. Sims, CA., J.H. Stock, and M.W. Watson (1990) “Inference in Linear Time Series Models with Some Unit Roots”, Econometrica, 58, 1133144. Solo, V. (1984) “The Order of Differencing in ARIMA Models”, Journal of the American Statistical Association, 79, 916-921. Solo, V. (1989) A Simple Derivation of the Granger Representation Theorem, manuscript, Macquarie University, Sydney, Australia. Sorensen, B. (1992) “Continuous Record Asymptotics in Systems of Stochastic Differential Equations”, Econometric Theory, 8,28-51. Sowell, F. (1990) “Fractional Unit Root Distribution”, Econometrica, 58,495-506. Sowell, F. (1991) “On DeJong and Whiteman’s Bayesian Inference for the Unit Root Model”, Journal of Monetary Economics, 28, 2555264. Sowell, F.B. (1992) “Maximum Likelihood Estimation of Stationary Univariate FractionallyIntegrated Time-Series Models”, Journal of Econometrics, 53, 1655188. Stine, R.A. and P. Shaman (1989) “A Fixed Point Characterization for Bias of Autoregressive Estimates”, Annals of Statistics, 17, 1275-1284. Stock, J.H. (1987) “Asymptotic Properties of Least Squares Estimators of Cointegrating Vectors”, Econometrica, 55, 1035-1056. Stock, J.H. (1988) A Class of Tests for Integration and Cointegration, manuscript, Kennedy School of Government, Harvard University. Stock. J.H. (1990) “Unit Roots in GNP: Do We Know and Do We Care?: A Comment”. Carneaie” Rochester‘Conference on Public Policy, 26, 63-82. Stock, J.H. (1991) “Confidence Intervals for the Largest Autoregressive Root in U.S. Economic Time Series”, Journal of Monetary Economics, 28,435-460. Stock, J.H. (1992) ‘Deciding Between I(0) and I(l)“, NBER Technical Working Paper no. 121; Journal of Econometrics, forthcoming. Stock, J.H. and M.W. Watson (1988a) “Variable Trends in Economic Time Series”, Journal of Economic Perspectives, 2, 147-l 74. Stock, J.H. and M.W. Watson (1988b) "Testingfor Common Trends”, Journal ofthe American Statistical Association, 83, 1097-l 107.
of
Ch. 46: Unit Roots, Structural Breaks and Trends
2841
Stock, J.H. and M.W. Watson (1989) “Interpreting the Evidence on Money-Income Causality”, Journal of Econometrics, 40, 161-182. Stock. J.H. and M.W. Watson (1993) “A Simple Estimator of Cointegrating Vectors in Higher Order Integrated Systems”, Econometrica, 61, 783:820. Svensson. L.E.O. (1992) “An Interpretation of Recent Research on Exchange Rate Target Zones”, Journal of economic Pe;spectives, 6, 11&144. Tanaka, K. (1990a) “The Fredholm Approach to Asymptotic Inference in Nonstationary and Noninvertible Time Series Models”, Econometric Theory, 6, 41 l-432. Tanaka, K. (1990b) “Testing for a Moving Average Unit Root”, Econometric Theory, 6,433-444. Tanaka. K. and S.E. Satchel1 (1989) “Asvmptotic Properties of the Maximum Likelihood and Non-Linear Least-Squares E&mat&s fo> fioninvertable Moving Average Models”, Econometric Theory, $333-353. Tang, S.M. and LB. MacNeill (1993) “The Effect of Serial Correlation on Tests for Parameter Change at Unknown Time”, Annals of Statistics, 21, 552-575. Tinbergen, J. (1939) Statistical Testing of Business-Cycle Theories. Vol, II: Business Cycles in the United States of America, 1919-1932. League of Nations: Geneva. Uhlig, H. (1992) What Macroeconomists Should Know About Unit Roots As Well: The Bayesian Perspective, manuscript, Department of Economics, Princeton University. Watson, M.W. (1986) “Univariate Detrending Methods with Stochastic Trends”, Journal of Monetary Economics, 18, 49-75. Watson, M.W. and R.F. Engle (1985) “Testing for Regression Coefficient Stability with a Stationary AR(l) Alternative”, Review of Economics and Statistics, LXVII, 341-345. Wei, C.Z. (1992) “On Predictive Least Squares Principles”, Annals of Statistics, 20, l-42. West, K.D. (1987) “A Note on the Power of Least Squares Tests for a Unit Root”, Economics Letters, 24,1397-1418. West, K.D. (1988a) “Asymptotic Normality when Regressors have a Unit Root”, Econometrica, 56, 1397-1418. West, K.D. (1988b) “On the Interpretation of Near Random Walk Behavior in GNP”, American Economic Review, 78,202-208. White, J.S. (1958) “The Limiting Distribution of the Serial Correlation Coefficient in the Explosive Case”, Annals of Mathematical Statistics, 29, 1188-l 197. White, J.S. (1959) “The Limiting Distribution of the Serial Correlation Coefficient in the Explosive Case II”, Annals of Mathematical Statistics, 30, 831-834. Yao, Y.-C. (1987) “Approximating the Distribution of the Maximum Likelihood Estimate of the Change-Point in a Sequence ofIndependent Random Variables”, Annals ofStatistics, 15,1321-1328. Zacks, S. (1983) “Survey of Classical and Bayesian Approaches to the Change-Point Problem: Fixed Sample and Sequential Procedures of Testing and Estimation”, in: M.H. Rizvi et al., eds., Recent Advances in Statistics. New York: Academic Press, 245-269. Zellner (1987) “Bayesian Inference”, in: J. Eatwell, M. Milgate and P. Newman, eds., The New Pa/grave: A Dictionary of Economics, Macmillan Press: London, 208-219. Zivot, E. and D.W.K. Andrews (1992) “Further Evidence on the Great Crash, the Oil Price Shock, and the Unit Root Hypothesis”, Journal of Business and Economic Statistics, 10, 251-270.
Chapter 47
VECTOR AUTOREGRESSIONS COINTEGRATION* MARK
AND
W. WATSON
Northwestern University and Federal Reserve Bank of Chicago
Contents
Abstract 1. Introduction 2. Inference in VARs with integrated 2.1.
3.
4.
Introductory
2844 2844 2848
regressors
2848
comments
2848
2.2.
An example
2.3.
A useful lemma
2850
2.4.
Continuing
2852
2.5.
A general
with the example
2.6.
Applications
2.7.
Implications
Cointegrated
2860 for econometric
2866
practice
2870
systems
3.1.
Introductory
3.2.
Representations
2870
comments for the I(1) cointegrated
3.3.
Testing
3.4.
Estimating
3.5.
The role of constants
Structural
2854
framework
for cointegration cointegrating
2870
model
in I(1) systems
2876
vectors
2887
and trends
2894
vector autoregressions
2898
4.1.
Introductory
4.2.
The structural
4.3.
The structural
4.4.
Identification
4.5.
Estimating
variance
2898
comments moving
average
model, impulse
response
functions
and 2899
decompositions VAR representation of the structural structural
VAR
VAR models
References
2900 2902 2906
2910
*The paper has benefited from comments by Edwin Denson, Rob Engle, Neil Ericsson, Michael Horvath, Soren Johansen, Peter Phillips, Greg Reinsel, James Stock and students at Northwestern University and Studienzentrum Gerzensee. Support was provided by the National Science Foundation through grants SES-89910601 and SES-91-22463. Handbook of Econometrics, Fo~ume IV, Edited by R.F. Engle and D.L. McFadden 0 1994 Elsevier Science B.V. All rights reserved
M. W. Watson
2844
Abstract
This paper surveys three topics: vector autoregressive (VAR) models with integrated regressors, cointegration, and structural VAR modeling. The paper begins by developing methods to study potential “unit root” problems in multivariate models, and then presents a simple set of rules designed to help applied researchers conduct inference in VARs. A large number of examples are studied, including tests for Granger causality, tests for VAR lag length, spurious regressions and OLS estimators of cointegrating vectors. The survey of cointegration begins with four alternative representations of cointegrated systems: the vector error correction model (VECM), and the moving average, common trends and triangular representations. A variety of tests for cointegration and efficient estimators for cointegrating vectors are developed and compared. Finally, structural VAR modeling is surveyed, with an emphasis on interpretation, econometric identification and construction of efficient estimators. Each section of this survey is largely self-contained. Inference in VARs with integrated regressors is covered in Section 2, cointegration is surveyed in Section 3, and structural VAR modeling is the subject of Section 4. 1.
Introduction
Multivariate time series methods are widely used by empirical economists, and econometricians have focused a great deal of attention at refining and extending these techniques so that they are well suited for answering economic questions. This paper surveys two of the most important recent developments in this area: vector autoregressions and cointegration. Vector autoregressions (VARs) were introduced into empirical economics by Sims (1980), who demonstrated that VARs provide a flexible and tractable framework for analyzing economic time series. Cointegration was introduced in a series of papers by Granger (1983) Granger and Weiss (1983) and Engle and Granger (1987). These papers developed a very useful probability structure for analyzing both long-run and short-run economic relations. Empirical researchers immediately began experimenting with these new models, and econometricians began studying the unique problems that they raise for econometric identification, estimation and statistical inference. Identification problems had to be confronted immediately in VARs. Since these models don’t dichotomize variables into “endogenous” and “exogenous,” the exclusion restrictions used to identify traditional simultaneous equations models make little sense. Alternative sets of restrictions, typically involving the covariance matrix of the errors, have been used instead. Problems in statistical inference immediately confronted researchers using cointegrated models. At the heart of cointegrated models are “integrated” variables, and statistics constructed from integrated variables often behave in nonstandard ways. “Unit root” problems are present and a large research effort has attempted to understand and deal with these problems. This paper is a survey of some of the developments in VARs and cointegration that have occurred since the early 1980s. Because of space and time constraints, certain topics have been omitted. For example, there is no discussion of forecasting or data analysis; the paper focuses entirely on structural inference. Empirical
M. W. Watson
2846
proposition is testable without a complete specification of the structural model. The basic idea is that when money and output are integrated, the historical data contain permanent shocks. Long-run neutrality can be investigated by examining the relationship between the permanent changes in money and output. This raises two important econometric questions. First, how can the permanent changes in the variables be extracted from the historical time series? Second, the neutrality components of changes in money; can these proposition involves “exogenous” components be econometrically identified? The first question is addressed in Section 3, where, among other topics, trend extraction in integrated processes is discussed. The second question concerns structural identification and is discussed in Section 4. One important restriction of economic theory is that certain “Great Ratios” are stable. In the eight-variable system, five of these restrictions are noteworthy. The first four are suggested by the standard neoclassical growth model. In response to exogenous growth in productivity and population, the neoclassical growth model predicts that output, consumption and investment will grow in a balanced way. That is, even though y,, c,, and i, increase permanently in response to increases in productivity and population, there are no permanent shifts in c, - y, and i, - y,. The model also predicts that the marginal product of capital will be stable in the long run, suggesting that similar long-run stability will be present in ex-post real interest rates, r - Ap. Absent long-run frictions in competitive labor markets, real wages equal the marginal product of labor. Thus, when the production function is Cobb-Douglas (so that marginal and average products are proportional), (w - p) - (y - n) is stable in the long run. Finally, many macroeconomic models of money [e.g. Lucas (1988)] imply a stable long-run relation between real balances (m - p), output (y) and nominal interest rates (r), such as m - p = /3,y + &r; that is, these models imply a stable long-run “money demand” equation. Kosobud and Klein (1961) contains one of the first systematic investigations of these stability propositions. They tested whether the deterministic growth rates in the series were consistent with the propositions. However, in models with stochastic growth, the stability propositions also restrict the stochastic trends in the variables. These restrictions can be described succinctly. Let x, denote the 8 x 1 vector (y,, c,, i,, n,, w,, m,, pr, rJ. Assume that the forcing processes of the system (productivity, population, outside money, etc.) are such that the elements of x, are potentially I(1). The five stability propositions imply that z, = CL’X,is I(O), where 1 -1
1
-1
0 o-1
cI=
-0, 0
0
0
00
0
0
100
0
0
100
0
0
0
0
0
0
0
0 -1 0
0
10 -1
0
-B,
1
Ch. 47:
Vector Autoregressions
and Cointegration
2847
The first two columns of IXare the balanced growth restrictions, the third column is the real wage - average labor productivity restriction, the fourth column is stable long-run money demand restriction, and the last column restricts nominal interest rates to be I(0). If money and prices are I(l), Ap is I(0) so that stationary real rates imply stationary nominal rates.’ These restrictions raise two econometric questions. First, how should the stability hypotheses be tested? This is answered in Section 3.3 which discusses tests for cointegration. Second, how should the coefficients /?, and p, be estimated from the data, and how should inference about their values be carried out?* This is the subject of Section 3.4 which considers the problem of estimating cointegrating vectors. In addition to these narrow questions, there are two broad and arguably more important questions about the business cycle behavior of the system. First, how do the variables respond dynamically to exogenous shocks? Do prices respond sluggishly to exogenous changes in money? Does output respond at all? And if so, for how long? Second, what are the important sources of fluctuations in the variables. Are business cycles largely the result of supply shocks, like shocks to productivity? Or do aggregate demand shocks, associated with monetary and fiscal policy, play the dominant role in the business cycle? If the exogenous shocks of econometric interest ~ supply shocks, monetary shocks, etc. ~ can be related to one-step-ahead forecast errors, then VAR models can be used to answer these questions. The VAR, together with a function relating the one-step-ahead forecast errors to exogenous structural shocks is called a “structural” VAR. The first question ~ what is the dynamic response of the variables to exogenous shocks? ~ is answered by the moving average representation of the structural VAR model and its associated impulse response functions. The second question - what are the important sources of economic fluctuations? ~ is answered by the structural VAR’s variance decompositions. Section 4 shows how the impulse responses and variance decompositions can be computed from the VAR. Their calculation and interpretation are straightforward. The more interesting econometric questions involve issues of identification and efficient estimation in structural VAR models. The bulk of Section 4 is devoted to these topics. Before proceeding to the body of the survey, three organizational comments are useful. First, the sections of this survey are largely self contained. This means that the reader interested in structural VARs can skip Sections 2 and 3 and proceed directly to Section 4. The only exception to this is that certain results on inference in cointegrated systems, discussed in Section 3, rely on asymptotic results from Section 2. If the reader is willing to take these results on faith, Section 3 can be read without the benefit of Section 2. The second comment is that Sections 2 and
‘Since nominal rates are I(0) from the last column of a, the long run interest semielasticity of money demand, fi,, need not appear in the fourth column of a. ‘The values of BYand b, are important to macroeconomists because they determine (i) the relationship between the average growth rate of money, output and prices and (ii) the steady-state amount of seignorage associated with any given level of money growth.
M. W. Watson
2848
3 are written at a somewhat higher level than Section 4. Sections 2 and 3 are based on lecture notes developed for a second year graduate econometrics course and assumes that students have completed a traditional first year econometrics sequence. Section 4, on structural VARs, is based on lecture notes from a first year graduate course in macroeconomics and assumes only that students have a basic understanding of econometrics at the level of simultaneous equations. Finally, this survey focuses only on the classical statistical analysis of I(1) and I(0) systems. Many of the results presented here have been extended to higher order integrated systems, and these extensions will be mentioned where appropriate. 2.
Inference in VARs with integrated regressors
2.1.
Introductory
comments
Time series regressions that include integrated variables can behave very differently than standard regression models. The simplest example of this is the AR( 1) regression: y, = py,- 1 + E,, where p = 1 and E, is independent and identically distributed with mean zero and variance g2, i.i.d.(0,a2). As Stock shows in his chapter of the Handbook, p, the ordinary least squares (OLS) estimator of p, has a non-normal asymptotic distribution, is asymptotically biased, and yet is “super consistent,” converging to its true value at rate T. Estimated coefficients in VARs with integrated components, can also behave differently than estimators in covariance stationary VARs. In particular, some of the estimated coefficients behave like p, with non-normal asymptotic distributions, while other estimated coefficients behave in the standard way, with asymptotic normal large sample distributions. This has profound consequences for carrying out statistical inference, since in some instances, the usual test statistics will not have asymptotic x2 distributions, while in other circumstances they will. For example, Granger causality test statistics will often have nonstandard asymptotic distributions, so that conducting inference using critical values from the x2 table is incorrect. On the other hand, test statistics for lag length in the VAR will usually be distributed x2 in large samples. This section investigates these subtleties, with the objective of developing a set of simple guidelines that can be used for conducting inference in VARs with integrated components. We do this by studying a .model composed of I(0) and I(1) variables. Although results are available for higher order integrated systems [see Park and Phillips (1988, 1989), Sims et al. (1990) and Tsay and Tiao (1990)], limiting attention to I(1) processes greatly simplifies the notation with little loss of insight.
2.2. Many
An example of the complications
in statistical
inference
that arise in VARs with unit
Ch. 47: Vector Autoregressions
roots can be analyzed Y,=4lY,-,
2849
and Cointegration
in a simple univariate
+ 42Yt-2
AR(2) model3
(2.1)
+ ‘I,.
Assume that #i + $2 = 1 and Ic$~/ < 1, so that process contains one unit root. To keep things simple, assume that qr is i.i.d.(O, 1) and normally distributed [n.i.i.d.(O, l)]. Let x, = (y,_ 1yt_ 2)’ and C$= (4i 42)‘, so that the OLS estimator is 4 = (Cx,xi)- ’ x (U n 1ess noted otherwise, C will denote (CX,Y,) and (4~- 4) = (Cx,x:)-‘(C-V,). x:,‘=, throughout this paper.) In the covariance stationary model, the large sample distribution of C$is deduced by writing T1’2($ - 4) = (~~‘~x,x~)~‘(~~“*~x~~~), and then using a law oflarge numbers to show that T-‘Cx,x; A E(x,xj) = V, and a central limit theorem to show that T-112C~,q, -%N(O, V). These results, together with Slutsky’s theorem, imply that T1j2($ - 4) %N(O, V-l). When the process contains a unit root, this argument fails. The most obvious reason is that, when p = 1, E(x,x:) is not constant, but rather grows with t. Because of this, T-‘Cx,x: and Tel”Cxtqt no longer converge: convergence requires that Cxrxi be divided by T2 instead of T, and that CXJ, be divided by T instead of T’12. Moreover, even with these new scale factors, Tp2Cx,xi converges to a random matrix rather than a constant, and T- ’ Cx,q, converges to a non-normal random vector. However, even this argument is too simple, since the standard approach can be applied to a specific linear combination of the regressors. To see this, rearrange the regressors in (2.1) so that Y, = Y~AY,-~ +Y*Yt-I
(2.2)
+rll,
where yi = - d2 and y2 = C#J~ + b2. Regression (2.2) is equivalent to regression in the sense that the OLS estimates of 4i and 42 are linear transformations the OLS estimators of yi and y2. In terms of the transformed regressors
(2.1) of
(2.3) and the asymptotic behavior the large sample behavior
of pi and f2 (and hence 6) can be analyzed by studying
of the cross products
CAY:_,,CAY,_,Y,-,,CY:-,,
CAY~-~V~ and CY,-in,. To begin, consider the terms CAY:_ 1 and CAY,_ lvr. Since 41 + 42 = y2 = 1, AY, = - 42Ayr_l 3Many (1978).
of the insights
(2.4)
+ nl. developed
by analyzing
this example
are discussed
in Fuller (1976) and Sims
M. W. Watson
2850
Since 1c$~1< 1, Ayt (and hence Ay,_ , ) is covariance stationary with mean zero. Thus, standard asymptotic arguments imply that T- ‘CAY:_ 1 An& and T-“*CAy,_ iqt %N(O,G&). This means that the first regressor in (2.2) behaves in the usual way. “Unit root” complications arise only because of the second regressor, y, _ 1. To analyze the behavior of this regressor, solve (2.4) backwards for the level of y,: y,=u
+dQ-151+Yo+s,>
(2.5)
where rl=Ci=iyls and s,= -(l +(p2)-1C~~~(-~2)i+1ylt_i, and vi=0 for ib0 has been assumed for simplicity. Equation (2.5) is the BeveridgeeNelson (1981) decomposition of y,. It decomposes y, into the sum of a martingale or “stochastic trend” [(l + 4,)) ‘&I, a constant (y,) and an I(0) component (s,). The martingale component has a variance that grows with t, and (as is shown below) it is this component that leads to the nonstandard behavior of the cross products Cy:_ i, CY~-~AY~-~ and D-i% Other types of trending regressors also arise naturally in time series models and their presence affects the sampling distribution of coefficient estimators. For example, suppose that the AR(2) model includes a constant, so that Y,=~+~,AY,-,+y,y,~,+~lr.
(2.6)
This constant introduces two additional complications. First, a column of l’s must be added to the list of regressors. Second, solving for the level of y, as above: y,=(l
+ $+-iClt+(l
+ &-14,+yo+st.
(2.7)
The key difference between (2.5) and (2.7) is that now y, contains the linear trend (1 + 4*)-l~lt. This means that terms involving yrP 1 now contain cross products that involve linear time trends. Estimators of the coefficients in equation (2.6) can be studied systematically by investigating the behavior of cross products of(i) zero mean stationary components (like qt and Ay,_ i), (ii) constant terms, (iii) martingal,es and (iv) time trends. We digress to present a useful lemma that shows the limiting behavior of these cross products. This lemma is the key to deriving the asymptotic distribution for coefficient estimators and test statistics for linear regressions involving I(0) and I(1) variables, for tests for cointegration and for estimators of cointegrating vectors. While the AR(2) example involves a scalar process, most of the models considered in this survey are multivariate, and so the lemma is stated for vector processes.
2.3.
A useful lemma
Three key results are used in the lemma. The first is the functional theorem. Letting qt denote an n x 1 martingale difference sequence,
central limit this theorem
Ch. 47: Vector Autoregressions
2851
and Cointegration
expresses the limiting behavior of the sequence of partial sums 5, = xi= tqs, Wiener or Brownian t= l,..., T, in terms of the behavior of an II x 1 standardized motion process B(s) for 0 <s < 1.4 That is, the limiting behavior of the discrete time random walk 5, is expressed in terms of the continuous time random walk B(s). The result implies, for example, that T 1’2iJ,sTl*B(s) - N(0, s), for 0 < s < 1, where [ST] denotes the first integer less than or equal to ST. The second result used in the lemma is the continuous mapping theorem. Loosely, this theorem says that the limit of a continuous function is equal to the function evaluated at the limit of its arguments. The nonstochastic version of this theorem implies that T “C,‘= 1t = T-‘~T=I(t/T)+~$ds = $. The stochastic version implies that T-312CT=1<, = T-‘CT=1(T-“25,)jS~B(s)ds. The final result is the convergence of Tp’Cyt_lq: to the stochastic integral IkB(s) dB(s)‘, which is one of the moments directly under study. These key results are discussed in Wooldridge’s chapter of the Handbook. For our purposes they are important because they lead to the following lemma. Lemma 2.3
Let 11, be an n x 1 vector of random variables with E(v~[v~_ . . , or) = 0, E(~,~:l~,~ l,. . . , u],) = In, and bounded fourth moments. Let F(L) = C,“=,FiLi and G(L) = C,p)=, GiLi denote two matrix polynomials in the lag operator with C,&ilFil < CC and C,&ilGil < 00. Let 5, = C:=rvs, and let B(s) denote an n x 1 dimensional Brownian motion process. Then the following converge jointly: r,
.
-N)~B(s)ds, (4 T- “2CWh, *j-B(s)dW’, (b) T-‘CS,s:+l a F( 1)’ + jB(s) dB(s)‘F( l)‘, (4 T- ‘CS,CWh,l’ (d) T- ’ C C’(L)V,ICG(L)~I,I’JL CZ 1FiGi> =+ dB(s)‘F( l)‘, (4 T-3’2C~CWhr+Il’ (f) (g) (h)
T-3’%& T-‘C5,5: T-5’2Ct&
* IB(s) ds, =+(s)B(s)’ ds, =s {sB(s) ds,
where, to simplify notation 1; is denoted by J. The lemma follows from results in Chan and Wei (1988) together with standard versions of the law of large numbers and the central limit theorem for martingale difference sequences [see White (1984)]. Many versions of this lemma (often under assumptions slightly different from those stated here) have appeared in the literature. For example, univariate versions can be found in Phillips (1986, 1987a), Phillips and Perron (1988) and Solo (1984), while multivariate versions (in most cases covering higher order integrated processes) can be found in Park and Phillips (1988, 1989), Phillips and
“Throughout this paper B(s) will denote n x 1 process with independent increments
a multivariate standard Brownian motion process, i.e., an B(r) - B(s) that are distributed N(O,(r - s)l,) for r > s.
M. W. Watson
2852
Durlauf (1986) Phillips and Solo (1992) Sims et al. (1990) and Tsay and Tiao (1990). The specific regressions that are studied below fall into two categories: (i) regressions that include a constant and a martingale as regressors or, (ii) regressions that include a constant, a time trend and a martingale as regressors. In either case, the coefficient on the martingale is the parameter of interest. The estimated value of this coefficient can be calculated by including a constant or a constant and time trend in the regression, or, alternatively, by first demeaning or detrending the data. It is convenient to introduce some notation for the demeaned and detrended martingales and their limiting Brownian motion representations._Thus let tf = 5, T-‘CT= it, denote the demeaned martingale, and let 5: = t, - pi - b2t denote the detrended martingale, where pi and b, are the OLS estimators obtained from the regression of 5, onto (1 t). Then, from the lemma, a straightforward calculation yields
T-
1’2~~sTldl(S) -
sl
B(r) dr = W(s)
0
and 1
T-“2~;s,,*B(s)
-
s
1 a,(r)B(rjdr
- s
0
a,(r)B(r)dr s
= p’(s),
0
where a,(r) = 4 - 6r and a2(r) = - 6 + 12r.
2.4.
Continuing
with the example
We are now in a position a scaled version of (2.3) T”2(fl TV2
-
to complete
T-‘day:_,
Yl) = ~2)
the analysis
T-
T-3’2~Ayt_1yr-l
T-3’2bv&-l
I[ X
1’2CA~t-
T-‘&~lvt
of the AR(2) example.
lylt
1
T-2C~:-
Consider
1 -’
1
From (2.5) and result (g) of the lemma, TP2Cyf_ 1*(l + 42))2JB(s)2 ds and from (b) T~‘Cyt_iq,=(l +4,)-‘JB(s)dB(s). Finally, noting from (2.4) that Ayt= LO. This result is particularly (1 + b2L)-‘qf, (c) implies that T~3’2CAyt~1y,~1 important because it implies that the limiting scaled “X’X”matrix for the regression is block diagonal. Thus,
Ch. 47: Vector Autoregressions
and Cointegration
2853
and
Two features of these results are important. First, y*i and y**converge at different rates. These rates are determined by the variability of their respective regressors: yi is the coefficient on a regressor with bounded variance, while y2 is the coefficient on a regressor with a variance that increases at rate t. The second important feature is that ‘y*ihas an asymptotic normal distribution, while the asymptotic distribution of f2 is non-normal. Unit root complications will affect statistical inference about y2 but not yl. Now consider the estimated regression coefficients 0, and 4, in the untransformed regression. Sin_ce 6, = - fl, T~‘*(c$~ - 4*) 3 N(0, o&f). Furthermore, since 4, = pi + ‘y^*,T1/*(bl, 4,) = P*(p, - yi) + T1i2(y”2- y2) = T1/2(y”l - y,) + o,(l). That is, even though 4i depends on both 7, and y**,the “super consistency” of y**implies that its sampling error can be ignored in large samples. Thus, T1’*(Jl - 4,) 3 N(O,a&,*), so that both ~$i and b2 converge at rate T”* and have asymptotic normal distributions. Their joint distribution is more complicated. Since ~$t + c$* = y2, T1’2(41 - 4,) + Ty*(4, - c$*) = T1’*(ji2 - y2) LO and the joint asymptotic distribuiion of T1/*(dl - 4,) and T1’*(4* - 4,) is singular. The liner combination 4l + 42 converges at rate T to a non-normal distribution: T[(+, + 4,) (4, + &)I= 73, - Y,)*U + ~2)[ISB(s)2dSl-111SB(S)dB(S)l. There are two important practical consequences of these results. First, inference about 41 or about dz can be conducted in the usual way. Second, inference about the sum of coefficients 41 + e52 must be carried out using nonstandard asymptotic distributions. Under the null hypothesis, the t-statistic for testing the null H,: 4, = c converges to a standard normal random variable, while the r-statistic for testing the null hypothesis H,: +1 + c#* = 1 converges to [sB(s)* ds]-“*[JB(s)dB(s)], which is the distribution of the Dickey-Fuller T statistic (see Stock’s chapter of the Handbook). As we will see, many of the results developed for the AR(2) carry over to more general settings. First, estimates of linear combinations of regression coefficients converge at different rates. Estimators that correspond to coefficients on stationary regressors, or that can be written as coefficients on stationary regressors in a transformed regression (yl in this example), converge at rate T”* and have the usual asymptotic normal distribution. Estimators that correspond to coefficients on I( 1) regressors, and that cannot be written as coefficients on I(0) regressors in a transformed regression (y2 in this example), converge at rate T and have a nonstandard asymptotic distribution. The asymptotic distribution of test statistics is also affected by these results. Wald statistics for restrictions on coefficients corresponding to I(0) regressors have the usual asymptotic normal or x2 distributions. In
M. W. Watson
2854
general, Wald statistics for restrictions on coefficients that cannot be written as coefficients on l(0) regressors have nonstandard limiting distributions. We now demonstrate these results for the general VAR model with I(1) variables.
2.5.
A general framework
Consider
the VAR model
Y,=a+
f
@iY,_i+E,,
(2.8)
i=l
where Y, is an n x 1 vector and E, is a martingale difference sequence with constant conditional variance Z, (abbreviated mds(Z’,)) with finite fourth moments. Assume that the determinant of the autoregressive polynomial (I - @,z - @,z2 - ... - @,zp( has all of its roots outside the unit circle or at z = 1, and continue to maintain the simplifying assumption that all elements of Y, are individually I(0) or I(1).5 For simplicity, assume that there are no cross equation restrictions, so that the efficient linear estimators correspond to the equation-by-equation OLS estimators. We now study the distribution of these estimators and commonly used test statistics.’
2.5.1.
Distribution
of estimated
To begin, write the ith equation Yi,t =
xip+‘i .f
regression coejficients of the model as (2.9)
’
where yi,t is the ith element of Y,, X, = (1 Y:_ r Y:_ 2... Y:_,)’ is the (np + of regressors, /I is the corresponding vector of regression coefficients, and ith element of E,. (For notational convenience the dependence of p on i suppressed.) The OLS estimator of fi is fl= (CX,Xi)‘(CX,yi,,), so that (CX,x:)-l(Cx,Ei ,)’ As in the univariate
AR(2) model, the asymptotic
behavior
1) vector F~,~is the has been B - /I =
of b is facilitated
by
5Higher order integrated processes can also be studied using the techniques discussed here, see Park and Phillips (1988) and Sims et al. (1990). Seasonal unit roots (corresponding to zeroes elsewhere on the unit circle) can also be studied using a modification of these procedures. See Tsay and Tiao (1990) for a careful analysis of this case. 6The analysis in this section is based on a large body of work on estimation and inference in multivariate time series models with unit roots. A partial list of relevant references includes Chan and Wei (1988) Park and Phillips (1988, 1989) Phillips (1988) Phillips and Durlauf (1986), Sims et al. (1990), Stock (1987), Tsay and Tiao (1990), and West (1988). Additional references are provided in the body of the text.
Ch. 47: Vector Autoregressions and Cointegration
2855
transforming the regressors in a way that isolates the various stochastic and deterministic trends. In particular, the regressors are transformed as Z, = DX,, where D is nonsingular and Z, = (z~,,z~,~ ...z~,~)‘, where the zi,t will be referred to as “canonical” regressors. These regressors are related to the deterministic and stochastic trends given in Lemma 2.3 by the transformation
or
z, = F(L)V, -
1,
where v, = (9; 1 r: t)‘. The advantage of this transformation is that it isolates the terms of different orders of probability. For example, zi,( is a zero mean I(0) regressor, z2 t is a constant, the asymptotic behavior of the regressor z~,~ is dominated by the martingale component Fx3tt_ i, and z~,~ is dominated by the time trend Fd4t. The canonical regressors z*,~ and z~,~ are scalars, while zi f and zs,* are vectors. In the AR(2) example analyzed above, zl,* = Ay,_ i = (1 + 4,L)-iqri, so that F, i(L) = (1 + c$~L)- ‘; z~,~ is absent, since the model did not contain a constant;~,,,=y,_~=(1+~,)-‘5,_,+y,+s,_,,sothatF,,=(l+~,)-’,F,,=y, is absent since y, contains no deterandF,,(L)=&(l +$J1(l +~#~~L)-‘;andz,,, ministic drift. Sims et al. (1990) provide a general procedure for transforming regressors from an integrated VAR into canonical form. They show that Z, can always be formed so that the diagonal blocks, Fii, i > 2 have full row rank, although some blocks may be absent. They also show that F,, = 0, as shown above, whenever the VAR includes a constant. The details of their construction need not concern us since, in practice, there is no need to construct the canonical regressors. The transformation from the X, to the Z, regressors is merely an analytic device. It is useful for two reasons. First, X:D'(D')'/I = Ziy,with y = (D')'p.Thus the OLS estimators of the original and transformed models are related by 0'9= b.Second, the asymptotic properties of $ are easy to analyze because of the special structure of the regressors. Together these imply that we can study the asymptotic properties of b by first studying the asymptotic properties of y^ and then transforming these coefficients into the /?s. The transformation from X, to Z, is not unique. All that is required is some transformation that yields a lower triangular F(L) matrix. Thus, in the AR(2) example we set ~i~=Ay~_~ and ~~~=y~_~, but an alternative transformation would have set z1 f = Ay, _ 1 and z3 , = y, _ 2. Since we always transform results for
M.W. Watson
2856 the canonical
regressors Z, back into results for the “natural” regressors X,, this non-uniqueness is of no consequence. We now derive the asymptotic properties of y*constructed from the regression n x 1 martingale Y,,~= Z;v + ei f. Writing E, = ,Xj” ql, where qt is the standardized difference sequence from Lemma 2.3, then Q = CO’Q= q:o, where w’ is the ith row of ,YE12, and y*- y = (CZ,Z~)-‘(CZ&O). Lemma 2.3 can be used to deduce the asymptotic behavior of CZ,Z: and CZ,Y@O. Some care must be taken, however, since all of the z~,~elements of Z, are growing at different rates. Assume that zr,, contains k, elements, z~,~ contains k, elements, and partition y conformably with Z, as y = (yr yz y3,y4)‘, where yj are the regression coefficients corresponding to Zj,t. Let P2z Yy,= :
0
0
0
0
T1’2
0
0
0
0
0
0
TI!i, 0
kl
0 ~312
1
and consider Yu,(P - y) = (Y’U, ‘CZ,Z; Y; ‘)-‘( Y; ‘CZ,I@O). The matrix !PT multiplies the various blocks of (9, - y,),CZ,Z;, and CZ,q, by the scaling factors appropriate from the lemma. The first block of coefficients, yl, are coefficients on zero mean stationary components and are scaled up by the usual factor of T1j2; the same scaling factor is appropriate for yz, the constant term; the parameters making up y3 are coefficients on regressors that are dominated by martingales, and these need to be scaled by T; finally, y4 is a coefficient on a regressor that is dominated by a time trend and is scaled by 7’3/2. Applying the lemma, we have Y; ’ CZ,Z: !P; ’ * V, where, partitioning I/ conformably with Z,: T- l&.tz;,t
AC
T-‘&J2
-+F;,
T-2xz3,,zj
t
Fll,jF;l
(j
= 1/11? = V 227
*F33
V33, LJ
T-3~(z,,,)z T-j’2~zl,tz; T-3’2~z2,,z;
f f
A+
=
-I1-,0
= Vlj = Vi1 B(s)‘dsF;,
=sF,,
V449
= v,, = V;,,
s P F,,F,, T-2CZ2.tZ4,,
----
T- 5’2cz3,tz4,t
=aF,,
2 sB(s)dsF,, s
=
v24
=
v42y
=
v,,
=
Vk3,
for
j = 2,3,4,
Ch. 47: Vector Autoregressions
2857
and Cointegration
where the notation reflects the fact that F,, and F,, are scalars. The limiting value of this scaled moment matrix shares two important characteristics with its analogue in the univariate AR(2) model. First, V is block diagonal with Vlj = 0 for j # 1. (Recall that in the AR(2) model T-312CAyt_1yt_1 LO.) Second, many of the blocks of V contain random variables. (In the AR(2) model T-2Cy:_ 1 converged to a random variable.) Now, applying the lemma to YU, ‘CZ,q:w yields Y, ‘CZ,r]iw=- A, where, partitioning A conformably with Z,:
tq;w Gv[O,(o’o)vll]
T-li2&
= A,,
n
T- ““~z2
,q;co =
F,,
dB(s)‘w
= 4,
i T- ‘cz,
J;W
*Fa3
B(s)dB(s)‘w s I”
= A,,
Putting the results together, Y,(p - y)* V’A, and three important results follow. First, the individual coefficients converge to their values at different rates: y^i and 9, converge to their values at rate T’12, while all of the other coefficients converge more quickly. Second, the block diagonality of I/ implies that Tli2(y*, - y,) 3 N(0, cf V;,‘), where 0: = w’o = var(sf). Moreover, A, is independent of Aj forj > 1 [Chan and Wei (1988, Theorem 2.2)], so that T”‘(y^, - yl) is asymptotically independent of the other estimated coefficients. Third, all of the other coefficients will have non-normal limiting distributions, in general. This follows because Vj3 # 0 for j > 1, and A, is non-normal. A notable exception to this general result is when the canonical regressors do not contain any stochastic trends, so that z~,~ is absent from the model. In this case I/ is a constant and A is normally distributed, so that the estimated coefficients have a joint asymptotic normal distribution.’ The leading example of this is polynomial regression, when the set of regressors contains covariance stationary regressors and polynomials in time. Another important example is given by West (1988), who considers the scalar unit root AR(l) model with drift. The asymptotic distribution of the coefficients /? that correspond to the “natural” regressors X, can now be deduced. It is useful to begin with a special case of the general model, Yi,,
=
PI
+
x;,*B2
+
x;,t83
+
‘i,r>
(2.10)
‘A,, A,, and A, are jointly normally distributed since Js’dB(s)‘w is a normally distributed random variable with mean 0 and variance (o’w)J?ds.
M. W. Watson
2858
where ~i,~ = 1 for all t,~~,~ is an h x 1 vector of zero mean I(0) variables and x3,( contains the other regressors. It is particularly easy to transform this model into canonical form. First, since x~,~ = 1, we can set z~,~ = ~i,~; thus, in terms of the transformed regression, 0, = y2. Second, since the elements of x~,~ are zero mean I(0) variables, we can set the first h elements of z~,~ equal to x~,~; thus /3* is equal to the first h elements of yi. The remaining elements of z, are linear combination of the regressors that need not concern us here. In this example, since fi2 is a subset of the elements of yi, T”‘(B, - b2) is asymptotically normal and independent of the coefficients corresponding to trend and unit root regressors. This result is very useful because it provides a constructive sufficient condition for estimated coefficients to have an asymptotic normal limiting distribution: whenever the block of coefficients can be written as coefficients on zero mean I(0) regressors in a model that includes a constant term they will have a joint asymptotic normal distribution. Now consider the general model. Recall that fi= D’y*. Let dj denote the jth column of D, and partition this conformably with y, so that dj =it;j_d;jd\jdkj)), where dij and qi are the same dimension. Then thejth elem_ent of /? is pj = Cidijpi. Since the components of y^ converge at different rates, flj will converge at the slowest rate of the gi included in the sum. Thus, when d,j # 0, pj will converge at rate T1/2, the rate of convergence of $,. Distribution
2.5.2.
of Wald test statistics
Consider Wald test statistics for linear hypotheses a q x k matrix with full row rank,
of the form R/3 = r, where R is
(Recall that fi corresponds to the coefficients in the ith equation, so that W tests within-equation restrictions.) Letting Q = R(D’), an equivalent way of writing the Wald statistic is in terms of the canonical regressors Z, and their estimated coefficients y^,
w
=
(Q?- 4’CQ(%Z;)-‘Q’l6;
‘(Qr* - 4
Care must be taken when analyzing the large sample behavior of W because the individual coefficients in p converge at different rates. To isolate the different components, it is useful to assume (without loss of generality) that Q is upper triangular.* *This assumption is made without loss of generality since the constraint Qy = r (and the resulting Wald statistic) is equivalent to CQy = Cr, for nonsingular C. For any matrix Q, C can chosen so that CQ is upper triangular.
Ch. 47: Vector Autoregressions
2859
and Cointegration
Now, partition Q, conformably with 9 and the canonical regressors making up Z,, so that Q = [qij] where qij is a qi x kj matrix representing qi constraints on the kj elements in yj. These blocks are chosen SO that qii has full row rank and qij = 0 for i <j. Since the set of constraints Qy = r may not involve yi, the blocks qij might be absent for some i. Thus, for example, when the hypothesis concerns only y3, then Q is written as Q = [q31q32q33q34], where q31 = 0, q32 = 0 and q33 has full row rank. Partition r = (I; r; r; rk)’ conformably with Q, where again some of the li may be absent. Now consider the first q1 elements of Q$qll$l + q12y2 + q13p3 + q14f4. Since yj, for j > 2, converges more quickly than PI and p2, the sampling error in this vector will be dominated asymptotically by the sampling error in qllfl + q12f2. Similarly, the sampling error in the next group of q2 elements of Q9 is dominated by q22y*2, in the next q3 by q33y*3, etc. Thus, the appropriate scaling matrix for Qp--r is T"ZI
I
!i+
0
0
0
T'121 0 42
TIq3 0
0
0
0
41
Now, write the Wald statistic
But, under
0 ’
00 T3i21
94
as
the null,
T1’2(qd1+ q12f2+ q13Y*3 + q14.94 -rl) = T1’2(q1191 + q12f2- rl) + o,(l), T”-
lqqjjTj
+ . . . + qj4p4 _
rj)
=
Thus. if we let
p311q12 0 f-j=
0” q;’ ,”
1 0
0
01
0”
)
G’q&$!
then
*,(QY- r) =
e”YT(p -
y) + o&l)
and
T(j- ‘)I2 (qjj~j - rj) + O,(l),
for
j > 1.
M. W. Watson
2860
under
the nu11.9 Similarly,
it is straightforward
to show that
Finally, since Yu,(g - y)= v/-IA and Y’V,‘CZ,Z: Y/s’* V, then W=>(Ql’/-‘A)’ x (Qv-‘Q)-‘(Qv?4). The limiting distribution of W is particularly simple when qii = 0 for i > 2. In this case, all of the hypotheses of interest concern linear combinations of zero mean I(0) regressors, together with the other regression coefficients. When q12 = 0, so that the constant term is unrestricted, we have
a:w=cq11v1-r1UCq11(Czl,,z;,,)-‘q;,l-‘cq1l~~l
-?,)I + O,(l)?
so that W 3x:, . When the constraints involve other linear combinations of the regression coefficients, the asymptotic x2 distribution of the regression coefficients will not generally obtain. This analysis has only considered tests of restrictions on coefficients from the same equation. Results for cross equation restrictions are contained in Sims et al. (1990). The same general results carry over to cross equation restrictions. Namely, restrictions that involve subsets of coefficients, that can be written as coefficients on zero mean stationary regressors in regressions that include constant terms, can be tested using standard asymptotic distribution theory. Otherwise, in general, the statistics will have nonstandard limiting distributions.
2.6.
Applications Testing lag length restrictions
2.6.1. Consider
the VAR(p + s) model, P+S
Y,=Cr+
C ~iY*_i+‘r i=l
and the null hypothesis H,: Qp+ 1 = Qpt2 = ... = @p+s = 0, which says that the true model is a VAR(p). When p 2 1, the usual Wald (and LR and LM) test statistic for H, has an asymptotic x2 distribution under the null. This can be demonstrated by rewriting the regression so that the restrictions in H, concern coefficients on zero mean stationary regressors. Assume that AY, is I(0) with mean p, and then
941*is the only off-diagonal at rate T”‘.
element
appearing
in @. It appears
because
fl and f, both converge
Ch. 47: Vector Autoreyressions
2861
and Cointeyration
rewrite the model as pis-
Y,=Z+AY,_, +
2
1
Oi(AY,_i-p)+~t,
i=l
where A = ~~~~ Qi, Oi = - x$‘T:+ 1Qj and a”= c1+ Cfz:- ’ Oip. The restrictions = 0, in the original model are equivalent to 0, = = ... = Qpcs @P+l = cDp+2 @p+l=...= Op+s_ 1 in the transformed model. Since these coefficients are zero mean I(0) regressors in regression equations that contain a constant term, the test statistics will have the usual large sample x2 distribution. Testing for Granger causality
2.6.2. Consider
y2,t
the bivariate
=
‘2
+
IfI
i=l
VAR model
42l,iYl,t-i
+
IfI +*2,iY2.t-i
i=l
+ ‘2,t’
The restriction that yZ,t does not Granger-cause yl,, corresponds to the null hypothesis Ho: 412,1 = 412,2 = ... = c),,,, = 0. When (yl f y2,,) are covariance stationary, the resulting Wald, LR or LM test statistic for &is hypothesis will have a large sample x,’ distribution. When (y, t y, ,) are integrated, the distribution of the test statistic depends on the location ok &it roots in the system. For example, suppose that yl,, is I(l), but that y, , is I(0). Then, by writing the model in terms of deviations of y,,, from its mean, the’ testrictions involve only coefficients on zero mean I(0) regressors. Consequently, the test statistic has a limiting x,: distribution. When yZ,t is I(l), then the distribution of the statistic will be asymptotically x2 and y2,1 are cointegrated. When yl,, and y,,, are not cointegrated, the when Y, t Grangerlcausality test statistic will not be asymptotically x2, in general. Again, the first result is easily demonstrated by writing the model so the coefficients of interest appear as coefficients on zero mean stationary regressors. In particular, and y,,, are cointegrated, there is an I(0) linear combination of the when Y~,~ variables, say w, = yZ,r - ;l~,,~, and the model can be rewritten as
Y1.z = al +
i
&ll,iYl,t-i
+
i
+12,itwr-i -Pw) +
&l.t3
where pw is the mean of wt,E1 = ~+C~Z1~lz,i~~ and 4,l.i = 4ll.i + 412,i& i=l , . . . , p. In the transformed regression, the Granger-causality restriction corresponds to the restriction that the terms w,-i - pL, do not enter the regression. But
M. W. Watson
2862
these are zero mean I(0) regressors in a regression that includes a constant, so that the resulting test statistics will have a limiting xf distribution. When ~i,~ and y, ~ are not cointegrated, the regression cannot be transformed in this way, and the resulting test statistic will not, in general, have a limiting x2 distribution.” The Mankiw-Shapiro (1985)/Stack-West (1988) results concerning Hall’s test of the life-cycle/permanent income model can now be explained quite simply. Mankiw and Shapiro considered tests of Hall’s model based on the regression of AC, (the logarithm of consumption) onto y,_i (the lagged value of the logarithm of income). Since y,_ 1 is (arguably) integrated, its regression coefficient and tstatistic will have a nonstandard limiting distribution. Stock and West, following Hall’s (1978) original regressions, considered regressions of c, onto c,_ 1 and y,_ 1. Since, according to the life-cycle/permanent income model, c,_ 1 and y,_ 1 are cointegrated, the coefficient on y,_ 1 will be asymptotically normal and its t-statistic will have a limiting standard normal distribution. However, when y,_ 1 is replaced in the regression with m,_, (the lagged value of the logarithm of money), the statistic will not be asymptotically normal, since c, _ 1 and m,_ 1 are not cointegrated. A more detailed discussion of this example is contained in Stock and West (1988). Spurious regressions
2.6.3.
In a very influential paper in the 1970’s, Granger and Newbold (1974) presented Monte Carlo evidence reminding economists of Yule’s (1926) spurious correlation results. Specifically, Granger and Newbold showed that a large R2 and a large t-statistic were not unusual when one random walk was regressed on another, statistically independent, random walk. Their results warned researchers that standard measures of fit can be very misleading in “spurious” regressions. Phillips (1986) showed how these results could be interpreted quite simply using the framework outlined above, and his analysis is summarized here. Let Yi,, and y2,t be two independent random walks Yl,,-1
+ %,t,
Y2.t = Y2,,-
1 + E2.v
YlJ
=
where E, = (E~,~E~,~)’is an mds(ZJ with finite fourth moments, and {~i,~}~, i and {~~,~}~r,1 are mutually independent. For simplicity, set y,,, = y,,, = 0. Consider the linear regression of y2,* onto ~i,~, (2.11)
Y2.r = BYi,, + % where u, is the regression fi = 0 and u, = Y~,~.
error.
Since y, ,f and y,,, are statistically
“A detailed discussion of Granger-causality (1990) and Toda and Phillips (1993a, b).
tests in integrated
systems
in contained
independent
in Sims et al.
M. W. Watson
2864
Y2,t= PYI,, +
(2.16)
u2,t3
where u, = (~i,~ u2,J’ = DEB,wh ere E, is an mds(Z,) with finite fourth moments. Like I(1): yr,, is a the spurious regression model, both yr,, and y2,t are individually random walk, while Ay2,t follows a univariate ARMA(l, 1) process. Unlike the spurious regression model, one linear combination of the variables y2,t - fi~i,~ = u2,t is I(O), and so the variables are cointegrated. Stock (1987) derives the asymptotic distribution of the OLS estimator of cointegrating vectors. In this example, the limiting distribution is quite simple. Write
(2.17)
and let dij denote
the ijth element
Then the limiting lemma:
behavior,
of D, and Di = (di, di2) denote
or the denominator
T-~C(~,,,)~ = DlCT-2~t,t:lD;=‘D~
[
the ith row of D.
of b - j?, follows directly
from the
jB(s)B(s)ds]D;,(2.18)
where 5, is the bivariate random walk, with A& = E, and B(s) is a 2 x 1 Brownian motion process. The numerator is only slightly more difficult:
T-CYdzt=
Putting
1
3
T-'CYl,l442,t+
T-'CAY1,u2t 1 9
= D,[T-‘~5,_l~;]D;
+ Dl[T-l&~;]D;
*Dl[
+ D,D;.
j$s)dB(s).]D;
(2.19)
these two results together,
+ D, D;
I[ s D,
B(s)B(s)‘dsD;
1 -1
.
(2.20)
There are three interesting features of the limiting representation (2.20). First, j? is “super consistent,” converging to its true value at rate T. Second, while super consistent, fi is asymptotically biased, in the sense that the mean of the asymptotic distribution is not centered at zero. The constant term DID; = d,,d,, + d,,d,, that appears in the numerator of (2.20) is primarily responsible for this bias. To see the source of this bias, notice that the regressor yl,t is correlated with the error term u~,~. In standard situations, this “simultaneous equation bias” is reflected in
Ch. 47: Vector Autoregressions
2865
and Cointegration
large samples as an inconsistency in B. With cointegration, the regressor is I(1) and the error term is I(O), so no inconsistency results; the “simultaneous equations bias” shows up as bias in the asymptotic distribution of b. In realistic examples this bias can be quite large. For example, Stock (1988) calculates the asymptotic bias that would obtain in the OLS estimator of the marginal propensity to consume, obtained from a regression of consumption onto income using annual observations with a process for u, similar to that found in U.S. data. He finds that the bias is still -0.10 even when 53 years of data are used.’ ’ Thus, even though the OLS estimators are “super” consistent, they can be quite poor. The third feature of the asymptotic distribution in (2.20) involves the special independent. In case in which d,, = d,, = 0 so that u1 , and u2, are statistically this case the OLS estimator corresponds to the Gaussian maximum likelihood estimation (MLE). When d 12 = d,, = 0, (2.20) simplifies to
s , s &W&(4
(2.21)
B,(s)*ds
where B(s) is partitioned as B(s) = [B,(s)B,(s)]‘. This result is derived in Phillips and Park (1988) where the distribution is given a particularly simple and useful interpretation. To develop the interpretation, suppose for the moment that u~,~ = dZZcZ,twas n.i.i.d. (In large samples the normality assumption is not important; it is made here to derive simple and exact small sample results.) Now, consider the distribution of $ conditional on the regressors {y,,,}T= i. Since Q is n.i.i.d., the restriztion d,, = d,, = 0 implies that u~,~ is independent of {y, ,}f’ 1. This means distribution t_hat8-Dl{y,JT=, - N(0,d~,CC(Y,,,)21-‘),so that the unconditional /I - p is normal with mean zero and random covariance matrix, d:2[C(yl,t)2]-1. In large samples, T-2C(y,,,)2~dIlSB1(S)2dS, so that T(fi--/I) converges to a normal random variable with a mean of zero and random covariance matrix, (d,,ld,,)2CSBl(s)2dsl-1.Thus, T(B - - p) has an asymptotic distribution that is a random mixture of normals. Since the normal distributions in the mixture have a mean of zero, the asymptotic distribution is distributed symmetrically about zero, and thus j!?is asymptotically median unbiased. The distribution is useful, not so much for what it implies about the distribution of b, but for what it implies about the t-statistic for fi. When d,, or d,, are not equal to zero, the t-statistic for testing the null fi = /I0 has a nonstandard limiting distribution, analogous to the distribution of the Dickey-Fuller t-statistic for testing the null of a unit AR coefficient in a univariate regression. However, when d,, = d,, = 0, the t-statistic has a limiting standard normal distribution. To see “Stock (1988, Table 4). These results are for durable plus nondurable consumption is used, Stock estimates the bias to be -0.15.
consumption.
When nondurable
M. W. Watson
2866
why this is true, again consider the situation in which u~,~is n.i.i.d. When d, Z = d, I= 0, the distribution of the t-statistic for testing b = PO conditional on {yl,t},‘E 1 has an exact Student’s t distribution with T - 1 degrees of freedom. Since this distribution does not depend on {Y~,~},‘=1, this is the unconditional distribution as well. This means that in large samples, the t-statistic has a standard normal distribution. AS we will see in this next section, the Phillips and Park (1988) result carries over to a much more general setting. In the example developed here, u, = DE,is serially uncorrelated. This simplifies the analysis, but all of the results hold more generally. For example, Stock (1987) assumes that u,= D(L)&,, where D(L)=~t?L,DiL.f,lD(l)/ #O and C,?Y, i/D,/ < co. In this case,
(2.22)
where Dj(l) is thejth
row of D(1) and
Dj,i is thejth row of Di.Under the additional
assumption that d12(1) = dZl(l) = 0 and Cz?LoD,,iD;,i = 0,T(b- /I) is distributed as a mixed normal (asymptotically) and the r-statistic for testing /J’= /3, has an asymptotic normal distribution when d12(1) = dT1(l) = 0 [see Phillips and Park (1988) and Phillips (1991a)l.
2.7.
Implications
for econometric
practice
The asymptotic results presented above are important because they determine the appropriate critical values for tests of coefficient restrictions in VAR models. The results lead to three lessons that are useful for applied practice. (1) Coefficients that can be written as coefficients on zero mean I(0) regressors id regressions that include a constant term are asymptotically normal. Test statistics for restrictions on these coefficients have the usual asymptotic x2 distributions. For example, in the model Y,
=
YlZl,,
+
Y2
+
Y3Z3.t
+
Y‘$t+
Et,
where z1 f is a mean zero I(0) scalar regressor regressor: this result implies that Wald statistics totically x2.
(2.23) and z3 t is a scalar martingale for tesiing H,:y1= c is asymp-
(2) Linear combinations of coefficients that include coefficients on zero mean I(0) regressors together with coefficients on stochastic or deterministic trends will have asymptotic normal distributions. Wald statistics for testing restrictions on these
Ch. 47: Vector Autoregressions and Cointegration
2867
linear combinations will have large sample x2 distributions. Thus in (2.23) Wald statistics for testing H,: R,y, + R,y, + R,y, = r, will have an asymptotic x2 distribution if R, # 0. (3) Coefficients that cannot be written as coefficients on zero mean I(0) regressors (e.g. constants, time trends, and martingales) will, in general, have nonstandard asymptotic distributions. Test statistics that involve restrictions on these coefficients that are not a function of coefficients on zero mean I(0) regressors will, in general, have nonstandard asymptotic distributions. Thus in (2.23), the Wald statistic for testing: H,: R(y, y3 y4)’ = I has a non-x’ asymptotic distribution, as do test statistics for composite hypotheses of the form H,: R(y, y3 y4)’ = r and y1 = c. When test statistics have a nonstandard distribution, critical values can be determined by Monte Carlo methods by simulating approximations to the various functionals of B(s) appearing in Lemma 2.3. As an example, consider using Monte Carlo methods to calculate the asymptotic distribution of sum of coefficients 4i + 42 = y2 in the univariate AR(2) regression model (2.1). Section 2.4 showed that where B(s) is a scalar Brownian T(?, - ~2)=4 + ~2)CjB(S)2ds1-1CSB(s)dB(s)l, motion process. If x, is generated as a univariate Gaussian random walk, then one draw of the random variable [JB(s)‘ds] - ‘[jB(s)dB(s)] is well approximated ) with T large. (A value of T = 500 provides an by (T-2~x~)-‘(T-‘~x,Ax,+, adequate approximation for most purposes.) The distribution of T(y*, -7,) can then be approximated by taking repeated draws of (T-2Cxf)-‘(Tp ‘CX~AX~+~) multiplied by (1 + 4,). An example of this approach in a more complicated multivariate model is provided in Stock and Watson (1988). Application of these rules in practice requires that the researcher know about the presence and location of unit roots in the VAR. For example, in determining the asymptotic distribution of Granger-causality test statistics, the researcher has to know whether the candidate causal variable is integrated and, if it is integrated, whether it is cointegrated with any other variable in the regression. If it is cointegrated with the other regressors, then the test statistic has a x2 asymptotic distribution. Otherwise the test statistic is asymptotically non-X2, in general. In practice such prior information is often unavailable, and an important question is what is to be done in this case?12 The general problem can be described as follows. Let W denote the Wald test statistic for a hypothesis of interest. Then the asymptotic distribution of the Wald statistic when a unit root is present, say F(WI U), is not equal to the distribution of the statistic when no unit root is present, say F( WI N). Let cU and cN denote “Toda and Phillips (1993a, b) discuss testing for Granger causality in a situation in which the researcher knows the number of unit roots in the model but doesn’t know the cointegrating vectors. They develop a sequence of asymptotic x2 tests for the problem. When the number of unit roots in the system in unknown, they suggest pretesting for the number of unit roots. While this will lead to sensible results in many empirical problems, examples such as the one presented at the end of this section show that large pretest biases are possible.
2868
M. W. Watson
the “unit root” and “no unit root” critical values for a test with size c(. That is, cu and cN satisfy: P( W > cu( U) = P( W > cN( N) = a under the null. The problem is that cu # cN, and the researcher does not know whether U or N is the correct specification. In one sense, this is not an ususual situation. Usually, the distribution of statistics depends on characteristics of the probability distribution of the data that are unknown to the researcher, even under the null hypothesis. Typically, there is that affect the distribution of the uncertainty over certain “nuisance parameters,” statistic of interest. Yet, typically the distribution depends on the nuisance parameters in a continuous fashion, in the sense that critical values are continuous functions of the nuisance parameters. This means that asymptotically valid inference can be carried out by replacing the unknown parameters with consistent estimates. This is not possible in the present situation. While it is possible to represent the uncertainty in the distribution of test statistics as a function of nuisance parameters that can be consistently estimated, the critical values are not continuous functions of these prameters. Small changes in the nuisance parameters ~ associated with sampling error in estimates - may lead to large changes in critical values. Thus, inference cannot be carried out by replacing unknown nuisance parameters with consistent estimates. Alternative procedures are required.13 Development of these alternative procedures is currently an active area of research, and it is too early to speculate on which procedures will prove to be the most useful. It is possible to mention a few possibilities and highlight the key issues. The simplest procedure is to carry out conservative inference. That is, to use the largest of the “unit root” and “no unit root” critical values, rejecting the null when W > max(c,, cN). By construction, the size of the test is less than or equal to a. Whenever W > max(c,,c,), so that the null is rejected using either distribution, or W < min(c,, cN), so that the null is not rejected using either distribution, one need not proceed further. However a problem remains when min(c,, cN) < W < max(c,, cN). In this case, an intuitively appealing procedure is to look at the data to see which hypothesis - unit root or no unit root - seems more plausible. This approach is widely used in applications. Formally, it can be described as follows. Let y denote a statistic helpful in classifying the stochastic process as a unit root or no unit root process. (For example, y might denote a Dickey-Fuller “t-statistic” or one of the test statistics for cointegration discussed in the next section.) The procedure is then to define a region for y, say R,, and when yeR,, the critical value cu is used; otherwise the critical value cN is used. (For example, the unit root critical value might be used if the Dickey-Fuller “t-statistic” was greater than -2, and the no unit root critical value used when the DF statistic
13Alternatively, using “local-to-unity” asymptotics, the critical values can be represented as continuous functions of the local-to-unity parameter, but this parameter cannot be consistently estimated from the data. See Bobkoski (1983), Cavanagh (1985), Chan and Wei (1987), Chan (1988), Phillips (1987b) and Stock (1991).
2869
Ch. 47: Vector Autoregressions and Cointegration was
less than P(Type
-2.)
In this case, the probability
1 error) = P(W > co(y~R,)P(yeR,)
of type 1 error is + P(W > c,ly$R,)P(y$R,).
The procedure will work well, in the sense of having the correct size and a power close to the power that would obtain when the correct unit root or no unit root specification were known, if two conditions are met. First, P(~ER,) should be near 1 when the unit root specification is true, and P(y$R,) should be near 1 when the unit root specification is false, respectively. Second, P( W > cLi)yeR,) and P( W > cN )y $Ru) should be near P( W > cu 1U) and P( W > cNlN), respectively. Unfortunately, in practice neither of these conditions may be true. The first requires statistics that perfectly discriminate between the unit root and non-unit root hypotheses. While significant progress has been made in developing powerful inference procedures [e.g. Dickey and Fuller (1979), Elliot et al. (1992), Phillips of classification errors is and Ploberger (1991), Stock (1992)], a high probability unavoidable in moderate sample sizes. In addition, the second condition may not be satisfied. An example presented in Elliot and Stock (1992) makes this point quite forcefully. [Also see Cavanagh and Stock (1985).] They consider the problem of testing whether the price-divided ratio helps to predict future changes in stock prices.14 A stylized version of the model is Pt -
4 = m-
1 -
AP,= HP,- I-
d,- 1)+ %.t,
(2.24)
4 - 1)+ ~z,t,
(2.25)
where pt and d, are the logs of prices and dividends, respectively, and (E~,~E~,~))is an mds(Z’,). The hypothesis of interest is H,: p = 0. Under the null, and when 14 I < 1, the t-statistic for this null will have an asymptotic standard normal distribution; when the hypothesis 4 = 1, the t-statistic will have a unit root distribution. (The particular form of the distribution could be deduced using Lemma 2.3, and critical values could be constructed using numerical methods.) The pretest procedure involves carrying out a test of 4 = 1 in (2.24), and using the unit root critical value for the t-statistic for fi = 0 in (2.25) when 4 = 1 is not rejected. If 4 = 1 is rejected, the critical value from the standard normal distribution is used. Elliot and Stock show that the properties of this procedure depends critically on the correlation between &I f and Ed f. To see why, consider an extreme example. In the data, dividends are much smoother than prices, so that most of the variance in the price-dividend ratio comes from movements in prices and not from dividends. Thus, E~,~and E~,~are likely to be highly correlated. In the extreme case, when
14Hodrick (1992) contains an overview of the empirical literature on the predictability of stock prices using variables like the price-dividend ratio. Also see, Fama and French (1988) and Campbell (1990).
M. W. Watson
2870
they are perfectly correlated, (p - b) is proportional to (6 - 4), and the “t-statistic” for testing /3 = 0 is exactly equal to the “t-statistic” for testing 4 = 1. In this case F(WIy) is degenerate and does not depend on the null hypothesis. All of the information in the data about the hypothesis /I = 0 is contained in the pretest. While this example is extreme, it does point out the potential danger of relying on unit root pretests to choose critical values for subsequent tests.
3. 3.1.
Cointegrated
systems
Introductory
comments
An important special case of the model analyzed in Section 4 is the cointegrated VAR. This model provides a framework for studying the long-run economic relations discussed in the introduction. There are three important econometric questions that arise in the analysis of cointegrated systems. First, how can the common stochastic trends present in cointegrated systems be extracted from the data? Second, how can the hypothesis of cointegration be tested? And finally, how should unknown parameters in cointegrating vectors be estimated, and how should inference about their values be conducted? These questions are answered in this section. We begin, in Section 3.2, by studying different representations for cointegrated systems. In addition to highlighting important characteristics of cointegrated systems, this section provides an answer to the first question by presenting a general trend extraction procedure for cointegrated systems. Section 3.3 discusses the problem of testing for the order of cointegration, and Section 3.4 discusses the problem of estimation and inference for unknown parameters in cointegrating vectors. To keep the notation simple, the analysis in Sections 3.2-3.4 abstracts from deterministic components (constants and trends) in the data. The complications in estimation and testing that arise when the model contains constants and trends is the subject of Section 3.5. Only I(1) systems are considered here. Using Engle and Granger’s (1987) terminology, the section discusses only CI(1,l) systems; that is, systems in which linear combinations of I(1) and I(0) variables are I(0). Extensions for CI(d, b) systems with d and b different from 1 are presented in Johansen (1988b, 1992c), Granger and Lee (1990) and Stock and Watson (1993).
3.2.
Representations
for the I (1) cointegrated
model
Consider the VAR xr= t i=l
z7iX,_i+E,,
(3.1)
Ch. 47: Vector Autoreyressions
2871
and’Cointegration
where x, is an n x 1 vector composed of I(0) and I(1) variables, and E, is an mds(Z,). Since each of the variables in the system are I(0) or I(l), the determinantal polynomial 1n(z)1 contains at most n unit roots, with n(z) = I - Cf= 1 IIizi. When there are fewer than n unit roots, then the variables are cointegrated, in the sense that certain linear combinations of the x,‘s are I(0). In this subsection we derive four useful representations for cointegrated VARs: (1) the vector error correction VAR model, (2) the moving average representation of the first differences of the data, (3) the common trends representation of the levels of the data, and (4) the triangular representation of the cointegrated model. All of these representations are readily derived using a particular SmithhMcMillan factorization of the autoregressive polynomial 17(L). The specific factorization used here was originally developed by Yoo (1987) and was subsequently used to derive alternative representations of cointegrated systems by Engle and Yoo (1991). Some of the discussion presented here parallels the discussion in this latter reference. Yoo’s factorization of n(z) isolates the unit roots in the system in a particularly convenient fashion. Suppose that the polynomial n(z) has all of its roots on or outside the unit circle, then the polynomial can be factored as U(z) = U(z)M(z)V(z), where U(z) and V(z) are n x n matrix polynomials with all of their roots outside the unit circle, and M(z) is an n x IZdiagonal matrix polynomial with roots on or outside the unit circle. In the case of the I(1) cointegrated VAR, M(L) can be written as
MM =
[
4o
0
z
I
1 9
where Ak = (1 - L)Z, and k + r = n. This factorization is useful because it isolates all of the VAR’s nonstationarities in-the upper block of M(L). We now derive alternative representations for the cointegrated system. 3.2.1.
The vector error correction VAR model (VECM)
To derive the VECM, equation as
subtract
x,_ 1 from both
sides of (3.1) and rearrange
the
p-1
Ax, = 17x,_
1 +
C
~i’Xt-i+Ef,
i=l
whereZ7= -1,,+C;=‘=,U,= -r;l(l),andQi= -CjP=i+lnj,i=l,...,p-l.Since n(l) = U(l)M(l)V( l), and M(1) has rank r, 17 = - I7( 1) also has rank r. denote an n x r matrix whose columns form a basis for the row space of that every row of 17 can be written as a linear combination of the rows Thus, we can write 17 = &z’, where 6 is an n x r matrix with full column
(3.2)
Let GI n, so of cc!. rank.
M. W. Watson
2872
Equation
(3.2) then becomes p’l
Ax, = &‘x,_i
+ c
@iAxt_i + E,
(3.3)
i=l
or p-1
Ax,=Sw,_i
+ 1
Q+x,_~+c,,
(3.4)
i=l
where w, = IX’X,. Solving (3.4) for w,_i shows that w,_ 1 = (6’S)- *S’[Ax, C;:t @iAx,_i - EJ, so that wt is I(0). Thus, the linear combinations of the potentially I(1) elements of x, formed by the columns of a are I(O), and the columns of c1are cointegrating vectors. The VECM imposes k < n unit roots in the VAR by including first differences of all of the variables and r = n - k linear combinations of levels of the variables. The levels of x, are introduced in a special way - as w, = rz’x, - so that all of the variables in the regression are I(0). Equations of this form appeared in Sargan (1964) and the term “error correction model” was introduced in Davidson et al. (1978).15 As explained there and in Hendry and von Ungern-Sternberg (1981), CI’X,= 0 can be interpreted as the “equilibrium” of the dynamical system, w, as the vector of “equilibrium errors” and equation (3.4) describes the self correcting mechanism of the system. The moving average representation
3.2.2.
To derive the moving
0 WJ = 1, o A [
so that M(L)M(L)
I
average
representation
for Ax,, let
1 ,
= (1 - L)Z,. Then,
M(L)M(L)V(L)x,
= ti(L)u(L)-‘&,,
so that I’(L)Ax, = ti(L)Cr(L)-
*st,
‘*As Phillips and Loretan (1991) point out in their survey, continuous time formulations of error correction models were used extensively by A.W. Phillips in the 1950’s. I thank Peter Phillips for drawing this work to any attention.
Ch. 47: Vector Autoregressions and Cointegration
2873
and Ax, = C(L)&,,
(3.5)
where C(L) = V(L)-iM(L)U There are two special characteristics of the moving average representation. First, C( 1) = V(l)- ‘a( l)U(l)- ’ has rank k and is singular when k < n. This implies that the spectral density matrix of Ax, evaluated at frequency zero, (27c_iC(l)Z,C( l)‘, is singular in a cointegrated system. Second, there is a close relationship between C(1) and the matrix of cointegrating vectors ~1.In particular, a’C(1) = 0.i6 Since w, = CI’X,is I(O), Aw, = CL’AX,is I(- 1) so that its spectrum at frequency zero, (27~)~~cr’C(l)C,C(l)‘cc, vanishes. The equivalence of vector error correction models and cointegrated variables with moving average representations of the form (3.5) is provided in Granger (1983) and forms the basis of the Granger Representation Theorem [see Engle and Granger (1987)]. 3.2.3.
The common trends representation
The common trends representation follows directly from (3.5). Adding tracting C(l)&, from the right hand side of (3.5) yields Ax, = C( l)s, + [C(L) - C( l)]e,. Solving
backwards
and sub-
(34
for the level of x,,
x, = C( 1)5,+ C*(L)&, +
x0,
(3.7)
where 5, = C:= l~, and C*(L) = (1 - L)-‘[C(L) - C(l)] = C,YLOC~Li, where Cr = -x,?=i+lCj and si=O for. i < 0 is assumed. Equation (3.7) is the multivariate BeveridgeeNelson (1981) decomposition of xt; it decomposes x, into its “permanent component,” C(l)& + x0, and its “transitory component,” C*(L)s,.17 Since C( 1) has rank k, we can find a nonsingular matrix G, such that C(l)G = [A 0, .,I, where A is an n x k matrix with full column rank.” Thus C(l)& = C(l)GG-‘<,, r6To derive this result, note from (3.2) and (3.3) that 17 = -n(l) = - U(l)M(l)V(l) = 6~‘. Since M(1) has zeroes everywhere, except the lower diagonal block which is I,,x’ must be a nonsingular transformation of the last r rows of V(1). This implies that the first k columns of u’V(l)-r contain only zeroes, so that a’V(l)-‘M(l)U(l) = a’C(1) = 0. “The last component can be viewed as transitory because it has a finite spectrum at frequency zero. Since U(z) and V(z) are finite order with roots outside the unit circle, the Ci coefficients decline exponentially for large i, and thus CiilC,I is finite. Thus the CT matrices are absolutely summable, and C*(l)Z,C*(l)’ is finite. “The matrix G is not unique. One way to construct G is from the eigenvectors of A. The first k columns of G are the eigenvectors corresponding to the nonzero eigenvalues of A and the remaining eigenvectors are the last n-k columns of G.
2874
so that x, = AT, +
C*(L)&, +
X0’
(3.8)
where r, denotes the first k components of G-l<,. Equation (3.8) is the common trends representation of the cointegrated system. It decomposes the n x 1 vector x, into k “permanent components” r, and n “transitory components” C*(L)&,. These permanent components often have natural interpretations. For example, in the eight variable (y, c, i, n, w, m, p, r) system introduced in Section 1, five cointegrating vectors were suggested. In an eight variable system with five cointegrating vectors there are three common trends. In the (y, c, i, II, m, p, r) systems these trends can be interpreted as population growth, technological progress and trend growth in money. The common trends representation (3.8) is used in King et al. (1991) as a device to “extract” the single common trend in a three variable system consisting of y,c and i. The derivation of (3.8) shows exactly how to do this: (i) estimate the VECM (3.3) imposing the cointegration restrictions; (ii) invert the VECM to find the moving average representation (3.5); (iii) find the matrix G introduced below equation (3.7); and, finally, (iv) construct t, recursively from r, = r,_ 1 + e,, where e, is the first element of G- ‘E,, and where E, denotes the vector of residuals from the VECM. Other interesting applications of trend extraction in cointegrated systems are contained in Cochrane and Sbordone (1988) and Cochrane (1994).
3.2.4.
The triangular representation
The triangular representation also represents x, in terms of a set of k non-cointegrated I(1) variables. Rather than construct these stochastic trends as the latent variables r, in the common trends representation, a subset of the x, variables are used. In particular, the triangular representation takes the form:
Ax1.t= Ul,,, X2.r
-
kt
=
(3.9) (3.10)
u2,*9
where x, = (xi,, xi 1)‘,~i,~ is k x 1 and x2 .f is r x 1. The transitory components u, = cu; f u; f)’ = D(L)E,, where (as we show below) D(1) has full rank. In this presentation, the first k elements of x, are the common trends and x~,~ - px, f the I(0) linear combinations of the data. To derive this representation from the VAR (3.2), use H(L) = U(L)M(L)V(L) write
W)~W)wJx,
= E,,
are reare to
(3.11)
Ch. 47: Vector Autoregressions
and Cointegration
2815
so that M(L)V(L)x, Now, partition
=
P’(L) as VII(L)
uu(L)
[ %1(L)
u&L)
V(L) =
(3.12)
u(L)-‘&,.
1 ’
where ull(L) is k x k, u12(L) is k x r, tizl(L) is r x k and uz2(L) is r x r. Assume that the data have been ordered so that uz2(L) has all of its roots outside the unit circle. (Since V(L) has.all of its roots outside the unit circle, this assumption is made with no loss of generality.) Now, let
1,
(AL) =[ B(L)
0 I,
1 ’
where B(L) = - u~~(L)-~u~~(L). Then M(L)v(L)C(L)C(L)-lx, or, rearranging
= u(L)-‘&,
(3.13)
and simplifying,
(3.14)
where p*(L) = (1 - L)- ‘[/i’(L) - p(l)] and /I = /I( 1). Letting G(L) denote the matrix polynomial on the left hand side of (3.14), the triangular representation is obtained by multiplying equation (3.14) by G(L)-‘. Thus, in equations (3.9) and (3.10), u, = D(L)&,, with D(L) = G(L)-‘U(L)‘. When derived from the VAR (3.2), D(L) is seen to have a special structure that was inherited from the assumption that the data were generated by a finite order VAR. But of course, there is nothing inherently special or natural about the finite order VAR; it is just one flexible parameterization for the x, process. When the triangular representation is used, an alternative approach is to parameterize the matrix polynomial D(L) directly. An early empirical study using this formulation is contained in Campbell and Shiller (1987). They estimate a bivariate model of the term structure that includes long term and short term interest rates. Both interest rates are assumed to be I(l), but the “spread” or difference between the variables is assumed to be I(0). Thus, in terms of (3.9))(3.10) ~i,~ is the short term interest rate, x2 t is the long rate and /I = 1. In their empirical work, Campbell and Shiller modeled the process U, in (3.10) as a finite order VAR.
Ch. 47: Vector Autoregressions
2877
and Cointegration
the constant (z,,,) and the deterministic time trends (z,,,). Hypothesis testing when deterministic components are present is discussed in Section 3.5. There are a many tests for cointegration: some are based on likelihood methods, using a Gaussian likelihood and the VECM representation for the model, while others are based on more ad hoc methods. Section 3.3.1 presents likelihood based (Wald and Likelihood Ratio) tests for cointegration constructed from the VECM. The non-likelihood-based methods of Engle and Granger (1987) and Stock and Watson (1988) are the subject of Section 3.3.2, and the various tests are compared in Section 3.3.3. to
3.3.1.
Likelihood
In Section
based tests for cointegration”
3.2.1 the general
VECM
was written
as
p-1
Ax, = SCI’X,_ i + C @iA~,-i+~,.
(3.3)
i=l
To develop the restrictions on the parameters in (3.3) implicit in the null hypothesis, first partition the matrix of cointegrating vectors as CI= [a, up] where ~1,is an n x r. matrix whose columns are the cointegrating vectors present under the null and ~1, is the n x I, matrix of additional cointegrating vectors present under the alternative. Partition 6 conformably as 6 = [S,S,], let r =(@, Q2... Qp_i) and let z, = )‘. The VECM can then be written as (Ax;_~ Ax;_~~x:_~+~
Ax, = S&xt_
1
+
Sac+_
1
+
I-z, + et,
(3.15)
where, under the null hypothesis, the term d,~lhx~_ 1 is absent. This suggests writing the null and alternative hypotheses as Ho: 6, = 0 vs. H,: 6, # 0.21 Written in this way, the null is seen as a linear restriction on the regression coefficients in (3.15). An important complication is that the regressor cQ_ 1 depends on parameters in ~1, that are potentially unknown. Moreover, when 6, = 0, c+_ I does not enter the regression, and so the data provide no information about any unknown parameters in cls. This means that these parameters are econometrically identified only under the alternative hypothesis, and this complicates the testing problem in ways discussed by Davies (1977, 1987), and (in the cointegration context) by Engle and Granger (1987). In many applications, this may not be a problem of practical consequence, since the coefficients in a are determined by the economic theory under consideration. For example, in the (y,c, i, w, n,r,m,p) system, candidate error correction terms “Much of the discussion in this section is based on material in Horvath and Watson (1993). ZIFormally, the restriction rank@,&) = rO should be added as a qualifier to H,. Since this constraint is satisfied almost surely by unconstraiied estimators of (3.15) it can safely be ignored when constructing likelihood ratio test statistics.
2878
M. W. Watson
with no unknown parameters are y - c, y - i, (w - p) - (y - n) and r. Only one error correction term, m - p - fi,y - /?,T, contains potentially unknown parameters. Yet, when testing for cointegration, a researcher may not want to impose specific values of potential cointegrating vectors, particularly during the preliminary data analytic stages of the empirical investigation. For example, in their investigation of long-run purchasing power parity, Johansen and Juselius (1992) suggest a twostep testing procedure. In the first step cointegration is tested without imposing any information about the cointegrating vector. If the null hypothesis of no cointegration is rejected, a second stage test is conducted to see if the cointegrating vector takes on the value predicted by economic theory. The advantage of this two-step approach is that it can uncover cointegrating relations not predicted by the specific economic theory under study. The disadvantage is that the first stage test for cointegration will have low power relative to a test that imposes the correct cointegrating vector. It is useful to have testing procedures that can be used when cointegrating vectors are known and when they are unknown. With these two possibilities in mind, we write r = rk + ru, where rk denotes the number of cointegrating vectors with known coefficients, and r,, denotes the number of cointegrating vectors with unknown coefficients. Similarly, write r, = rok + rou and ra = rak + reu, where the subscripts “k” and “u” denote known and unknown respectively. Of course, the rak subset of “known cointegrating vectors” are present only under the alternative, and ahxt is I(1) under the null. Likelihood ratio tests for cointegration with unknown cointegrating vectors (i.e. H,: r = r9” vs. H,: r = ro, + rou) are developed in Johansen (1988a), and these tests are modified to incorporate known cointegrating vectors (nonzero values of r and rak) in Horvath and Watson (1993). The test statistics and their asymptot:: null distributions are developed below. For expositional purposes it is convenient to consider three special cases. In the first, r, = rek, so that all of the additional cointegrating vectors present under the alternative are assumed to be known. In the second, r, = r,,, so that they are all unknown. The third case allows nonzero values of both rak and ra,. To keep the notation simple, the tests are derived for the r. = 0 null. In one sense, this is without loss of generality, since the LR statistic for H,: r = r. vs. H,: r = r, + r, can always be calculated as the difference between the LR statistics for [H,: r = 0 vs. H,: r = r, + r,] and [H,: r = 0 vs. H,: r = r,]. However, the asymptotic null distribution of the test statistic does depend on ror and ro,, and this will be discussed at the end of this section. Testing H,: r = 0 vs. H,: r = rek
When r. = 0, equation
(3.15) simplifies
Ax, = S,(abx,_ *) + l-2, + E,. Since abx, _ 1 is known, (3.16) is a multivariate
to (3.16)
linear regression,
so that the LR, Wald
Ch. 47: Vector Autoregressions
2879
and Cointeyration
and LM statistics have their standard regression form. Letting X = [x, x2 . ..x~]‘. ]‘,AX=X-X_,,Z=[z,z,...z,]‘,s=[.slsZ~..sT]’andM,= x-1 =[xoxi...xT-l [I - Z(Z’Z)- ‘Z’], the OLS estimator of 6, is s:, = (AX’MzX _ ia,)(rx~X’_ ,M,XP 1~,)-1, which is the Gaussian MLE. The corresponding Wald test statistic for H, vs. H, is W=
[vec(8J]‘[(u~Xf_lMZX_lu,)-1
C32’,]-‘[vec(G^,)]
= [vec(AX’M,X_,cr,)]‘[(crbX’,M,X_,cc,)-’~~~-’I x Cvec(AX’M,X
(3.17)
_ 1cc,)],
where .??, is the usual estimator value of z, (2, = T-‘6’6, where E*is the matrix of OLS residuals from (3.16)), “vet” is the operator that stacks the column of a matrix, and the second line uses the result that vec(ABC) = (C’ x A) vet(B) for conformable matrices A,B and C. The corresponding LR and LM statistics are asymptotically equivalent to W under the null and local alternatives. The asymptotic null distribution of W is derived in Horvath and Watson (1993), where it is shown that
W=-Trace{
[ b,(s)dB(s)‘]‘[
b,(s)Bi(s)‘ds]-‘[
(3.18)
~,(s)dB(s)‘]~,
where B(s) is an n x 1 Wiener process partitioned into r, and n - I, components B,(s) and B2(s), respectively. A proof of this result will not be offered here, but the form of the limiting distribution can be understood by considering a special case with I- = 0 (so that there are no lags of Ax, in the regression), zc = I, and al = [l,aO]. In this case, x, is a random walk with n.i.i.d.(O,Z,) innovations, and (3.16) is the regression of Ax, onto the first r. elements of x,_ i, say x1 f_ i. Using the true value of CE, the Wald statistic in (3.17) simplifies to
W= Cvec(CAx,x;,,-l)I’C(~x~,~-~x~,~-~)-l o~,lCvec(CAx,x;,,-,)l. =TraceC(CAx,x;,,-,)(Cx,,,-,x;,,-,)-1(Cx,,,-,Ax:)l = TraceC(T-‘~Ax,x;,,_ 1)(T-2~x1,,_,x~,,_ l)-l(T-‘~x,,,_ *Trace[ (SB,(WW’)(
~IWIH’~l(
,&)I
SB,(WW’)],
where the second line uses the result that for square matrices, Trace(AB) = Trace(BA), and for conformable matrices, Trace(ABCD) = [vec(D)]‘(A x C’) vec(B’) [Magnus and Neudecker (1988, page 30)], and the last line follows from Lemma 2.3. This verifies (3.18) for the example.
M.W. Watson
2880
the Wald test in (3.17) Testing H,:r = 0 vs. H,:r = rau When ‘1, is unknown, cannot be calculated because the regressor abX,_ 1 depends on unknown parameters. However, the LR statistic can be calculated, and useful formulae for the LR statistic are developed in Anderson (1951) (for the reduced rank regression model) and Johansen (1988a) (for the VECM). In the context of the VECM (3.3), Johansen (1988a) shows that the LR statistics’can be written as LR = - T c
(3.19)
ln(1 - y,),
i=l
where yi are the ordered squared canonical correlations between Ax, and x,_ 1, after controlling for Ax, _ 1,. . , Ax, _ D+ 1. These canonical correlations can be calculated as the eigenvalues of T-IS, There S = ,!?; “2(AX’MzX_ ,)(X’_ ,M,X_ i))r x (X’_ rM,AX)(Z;“‘)‘, and where Z, = T-‘(AX’h4,AX) is the estimated covariance matrix of s,, computed under the null [see Anderson (1984, Chapter 12) or Brillinger (1980, Chapter lo)]. Letting %i(S) denote the eigenvalues of S ordered as A,(S) 2 12(S) > ... 3 &(S), then yi from (3.19) is yi = T-‘A,(S). Since the elements of S are O,(l) from Lemma 2.3, a Taylor series expansion of ln(1 - ri) shows that the LR statistic can be written as
LR = 1 &(S) + o,(l).
(3.20)
i=l
Equation (3.20) shows why the LR statistic is sometimes called the “Maximal eigenvalue statistic” when rou= 1 and the “Trace-statistic” when rou= n [Johansen and Juselius (1990)].” One way to motivate the formula for the LR statistic given in (3.20), is by manipulating the Wald statistic in (3.17).23 To see the relationship between LR and W in this case, let L(6,, a,) denote the log likelihood written as a function of 6, and c1,, and let &(c(,) denote the MLE of 6, for fixed a,. When Z, is known, then the well known relation between the Wald and LR statistics in the linear regression model [Engle (1984)] implies that the Wald statistic can be written as
(3.21) where the last line follows since cz, does not enter the likelihood when 6a = 0, and where W(crJ is written to show the dependence of W on ~1,. From (3.21), with ZE “In standard ro, = n - rO,.
jargon,
when r 0” #O, the trace
Z3See Hansen (1990b) for a general discussion in the presence of unidentified parameters.
statistic
corresponds
of the relationship
to the test for the alternative between
Wald, LR and LM tests
Ch. 47:
Vector Autoregressions
2881
and Cointeyration
known, sup W(cc,) = sup 2CL(&(aJ, a,) - UO,
=u
aa
011
= 2[L(&, 02,)- L(0, O)] (3.22)
=LR where the Sup is taken over all n x ra matrices (Ye. When equivalence is asymptotic, i.e. Sup,= kV(a,) = LR + o,(l). To calculate Sup,” W(cc,), rewrite (3.17) as kV(a,)= [vec(AX’M,X_,cc,)]‘[(ahX’~lM,X_,cr,)-l x [vec(AX’M,X
Z, is unknown,
this
@~‘,‘I
_ 1cr,)]
=TR[~~-“2(AX’M,X_,~,)(cr~X’_,M,X_,~x,)-’ x @AX’_ 1M,AX)(f’,
“‘)‘]
= TR [,!?‘, ‘12(AX’M,X ~, )DD’(X’_ 1M,AX)(T’,
“‘)‘],
D = !xX,(rAbX’_ 1M,X = TR [D’(X’_ 1M,AX)&-
‘(AX’M,X
where _ 1CIJ 1’2
_ l)D] (3.23)
= TR [F’CC’F], where F=(X~lM,X_l)‘i2~,Ja~X’_lM,X_lcr,)~”2,and ‘j2’. Since F’F = Ira (X’- 1M,AX)Z, u’ Sup W(ccJ = Sup TR[F’(CC’)F] F’F=I CI, =
C=(X~,M,X_,))1i2
x
rot. r‘w = c i,(CC’) = 1 ;l,(C’C) i=l
i=l
(3.24)
LR + oP( l),
where Ai denote the ordered eigenvalues of (CC’), and the final two equalities follow from the standard principal components argument [for example, see Theil (1971, page 46)] and A,(CC’) = &(C’C). Equation (3.24) shows that the likelihood ratio statistic can then be calculated (up to an o,(l) term) as the largest r. eigenvalues of
c’c =
t’,
"2(AX'MZX_
,)(X'_ ,M,X_
,)-
‘(x’_ ,M,AX)(~‘, l/2)‘,
To see the relationship between the formulae for the LR statistics in (3.24) and (3.20), notice that CC in (3.24) and S in (3.20) differ only in the estimator of ZE; C’C uses an estimator constructed from residuals calculated under the alternative, while S uses an estimator constructed from residuals calculated under the null.
M. W. Watson
2882
In general settings, it is not possible to derive a simple representation for the asymptotic distribution of the Likelihood Ratio statistic when some parameters are present only under the alternative. However, the special structure of the VECM makes such a simple representation possible. Johansen (1988a) shows that the LR statistic has the limiting asymptotic null distribution given by rau
(3.25)
LR=> 2 A,(H) i=l
where H = [jB(s)dB(s)‘]‘[{B(s)B(s)‘ds]-‘[fB(s)dB(s)’], process. To understand Johansen’s result, again r = 0 and z, = I,,. In this case, C’C becomes
and B(s) is an n x 1 Wiener consider the special case with
C’C=(AX’X_,)(X’_,X_,)~‘(X’_IAX) = [~Ax,x:_
1IC~~,-,~:-~l-‘C~~,-~~~:l
= CT-‘~AX,.~~_,]‘[T-~~X~_~~~_~]-~[T-’C~,_X~] -[
from Lemma
/B($dR(s)‘]‘[
jB(s)B(s)‘ds]
- ’ [ /B(s)dB(s)‘]
(3.26)
2.3. This verifies (3.25) for the example.
Testing H,: r = 0 vs. H,: r, = ra, + rau The model is now
AX, = %J~&kx,-1)+ 4&,x,-
1)+ Pz, + &t,
(3.27)
where a, has been partitioned so that uor contains the rllk known cointegrating the ru, unknown cointegrating vectors and 6, has been vectors, c1,, contains partitioned conformably as 6 = (dak 6,“). As above, the LR statistic can be approximated up to an o,(l) term by maximizing the Wald statistic over the unknown parameters in claU. Let M,, = M, - M,X_ laak(a~,X’_ 1M,X_ 1~,,)- ‘c&X’_ 1M, denote the matrix that partials both Z and X_lcl,k out of the regresslon (3.27), The Wald statistic (as a function of apI( and CC,~)can then be written as24 W(c(,*,Q
= Cvec(AX’MZX_
x [vec(AX’M,X_
,a,,)l’C(~~,x~,M,X_ ItyJ1 @~‘,‘I 1a,,)]
x [vec(AX’M,,X_,cr,U)]‘[(~~,Xl,M,,X-,Clyl)-lO~E1l x [vec(AX’M,,X_
1a,y)].
(3.28)
14The first term in (3.28) is the Wald statistic for testing Sak = 0 imposing the constraint that 6,u = 0. The second term is the Wald statistic for testing &as= 0 with ub,XI_ t and ZC partialled out of the regression. This form of the Wald statistic can be deduced from the partitioned inverse formula.
M. W. Wurson
2884
As pointed out above, when the null hypothesis is H,: r = rOk+ rO,, the LR test statistic can be calculated as the difference between the LR statistics for [H,: r = 0 vs. H,: r = r. + r,] and [H,: r = 0 vs. H,: r = r,]. So, for example, when testing Ho: r = ro, vs. Ha: r = ro, + ra,, the LR statistic is ‘al4 LR = - T
c i=r,,+
ln(1 - yi) = 1
In” 1 i=r,,+
&(S) + o,(l),
(3.3 1)
1
where yi are the canonical correlations defined below equation (3.19) [see Anderson (1951) and Johansen (1988a)]. Critical values for the case rok = rak = 0 and n - rou < 5 are given in Johansen (1988a) for the trace-statistic (so that the alternative is rou = (1992), who also n - r.,,); these are extended for n - ro, ,< 11 in Osterwald-Lenum tabulates asymptotic critical values for the maximal eigenvalue statistic (so that r =r = 0 and ram= 1). Finally, asymptotic critical values for all combinations o?r Ok’r”” r and ra, with n rO, < 9 are tabulated in Horvath and Watson (1992). 0”’ lllr 3.3.2.
Non-likelihood-based
approaches
In addition to the likelihood based tests discussed in the last section, standard univariate unit root tests and their multivariate generalizations have also been used as tests for cointegration. To see why these tests are useful, consider the hypotheses H,: r = 0 vs. H,: r = 1, and suppose that CI is known under the alternative. Since the data are not cointegrated under the null, w, = a’x, is I(l), while under the alternative it is I(0). Thus, cointegration can be tested by applying a standard unit root test to the univariate series w,. To be useful in more general cointegrated models, standard unit root tests have been modified in two ways. First, modifications have been proposed so that the tests can be applied when c1 is unknown. Second, multivariate unit root tests have been developed for the general testing problem H,: r = r, vs. H,: r = r. + r,. We discuss these two modifications in turn. Engle and Granger (1987) develop a test for the hypotheses H,: r = 0 vs. H,: r = 1 when c1is unknown. They suggest using OLS to estimate the single cointegrating vector and applying a standard unit root test (they suggest an augmented DickeyFuller t-test) to the OLS residuals, +‘r = &x,. Under the alternative, 6i is a consistent estimator of LX,so that $, will behave like w,. However, under the null, 6i is obtained from a “spurious” regression (see Section 2.6.3) and the residuals from a spurious regression (kZ) behave differently than non-stochastic linear combinations of I(1) variables (w,). This affects the null distribution of unit root statistics calculated using tit. For example, the Dickey-Fuller t-statistic constructed using fi’t has a different null distribution than the statistic calculated using w,, so that the usual critical values given in Fuller (1976) cannot be used for the Engle-Granger test. The correct asymptotic null distribution of the statistic is derived in Phillips and Ouliaris (1990), and is tabulated in Engle and Yoo (1987) and MacKinnon (1991). Hansen
Ch. 47:
Vector Autoregressions
and Cointegration
2885
(1990a) proposes a modification of the EngleeGranger test that is based on an iterated Cochrane-Orcutt estimator which eliminates the “spurious regression” problem and results in test statistics with standard Dickey-Fuller asymptotic distributions under the null. Stock and Watson (1988), building on work by Fountis and Dickey (1986), propose a multivariate unit root test. Their procedure is most easily described by considering the VAR(l) model, x, = @x,_ 1 + c,, together with the hypotheses H,: r = 0 vs. H,: r = ra. Under the null the data are not cointegrated, so that @ = I,. Under the alternative there are r. covariance stationary linear combinations of the data, so that @ has r. eigenvalues that are less than one in modulus. The StockWatson test is based on the ordered eigenvalues of 6, the OLS estimator of @. Writing these eigenvalues as Il,I 2 1x,( > . . .. the test is based on ;l’n_r,+ 1, the r,th smallest eigenvalue. Under the null, An_*,+r = 1, while under the alternative, null distribution of T(G - I) and T( Iii\ - 1) are IAn- r. + 11< 1. The asymptotic derived in Stock and Watson (1988), and critical values for T( I%n_-r,+ 1) - 1) are tabulated. This paper also develops the required modifications for testing in a general VAR(p) model with r0 # 0. 3.3.3.
Comparison
of the tests
The tests discussed above differ from one another in two important respects. First, some of the tests are constructed using the true value of the cointegrating vectors under the alternative, while others estimate the cointegrating vectors. Second, the likelihood based tests focus their attention on 6 in (3.3), while the non-likelihoodbased tests focus on the serial correlation properties of certain linear combinations of the data. Of course, knowledge of the cointegrating vectors, if available, will increase the power of the tests. The relative power of tests that focus on S and tests that focus on the serial correlation properties of w, = CL’X,is less clear. Some insight can be obtained by considering a special case of the VECM (3.3) Ax, = SJa;xt_ ,) +
E,.
Suppose that ~1,is known and that the competing hypotheses r = 1. Multiplying both sides of (3.31) by a: yields
Aw~=Bw,-~+~,,
(3.32) are H,: r = 0 vs. H,:
(3.33)
where w, = abxl, 0 = czbs, and e, = c~:E,. Unit root tests constructed from w, test the hypotheses H,: 8 = (cQ,) = 0 vs. H,: 0 = ($8,) < 0, while the VECM-based LR and Wald statistics test H,: 6, = 0 vs. H,: 6, # 0. Thus, unit root tests constructed from w, focus on departures from the 6, = 0 null in the direction of the cointegrating vector CI,. In contrast, the VECM likelihood based tests are invariant to transformations of the form Pcrbx, _ 1 when CI, is known and Px,_, when u, is unknown,
M. W. Watson
2886
where P is an arbitrary nonsingular matrix. Thus, the likelihood based tests aren’t focused in a single direction like the univariate unit root test. This suggests that tests based on w, should perform relatively well for departures in the direction of LY.,but relatively poorly in other directions. As an extreme case, when LY~S.= 0, the elements of x, are I(2) and w, is I(1). [The system is CZ(2,l) in Engle and Granger’s (1987) notation.] The elements are still cointegrated, at least in the sense that a particular linear combination of the variables is less persistent than the individual elements of x,, and this form of cointegration can be detected by a nonzero value of 6, in equation (3.32) even though 8 = 0 in (3.33).26 A systematic comparison of the power properties of the various tests will not be carried out here, but one simple Monte Carlo experiment, taken from a set of experiments in Horvath and Watson (1993), highlights the power tradeoffs. Consider a bivariate model of the form given in (3.32) with E, - n.i.i.d.(O, I,), CI,= (1 - 1)’ and 6, = (S,, da*)‘. This design implies that 0 = de, - S,* in (3.33), so that the unit root tests should perform reasonably well when Id,, - fi(12(is large and reasonably poorly when 16,, - 8J is small. Changes in 6, have two effects on the power of the VECM likelihood based tests. In the classical multivariate regression model, the power of the likelihood based tests increase with { = hi, + a:,. However, in the VECM, changes in 6,, and Lju2also affect the serial correlation properties of the regressor, w, _ 1 = dx, _ 1, as well as [. Indeed, for this design, w, - AR(l) with
Table 1 Comparing power of tests for cointegration.” Size for 5 percent asymptotic critical values and power for tests carried at 5 percent IeveP
(0,O)
Test DF (a known) EGDF (a unknown) Wald (a known) LR (c( unknown) “Design
5.0 4.1 4.1 4.4
(0.05,0.055) 6.5 2.9 95.0 86.1
(-0.05,0.055) 81.5 31.9 54.2 20.8
out
(-0.0105,0) 81.9 32.5 91.5 60.7
is
where E, = (E:c,?) ’ - n.l.l. d.(O,I,) and t = I ,_.., 100. bThese results are based on 10,000 replications. The first column shows rejection frequencies using asymptotic critical values. The other columns show rejection frequencies using 5 percent critical values calculated from the experiment in column 1.
26This example
was pointed
out to me by T. Rothenberg.
Ch. 47: Vector Autorepmions
2887
und Cointegration
AR coefficient 8 = da1 - Sal [see equation (3.33)]. Increases in 6’ lead to increases in the variability of the regressor and increases in the power of the test. Table I shows size and power for four different values of 6, when T = 100 in this bivariate system. Four tests are considered: (1) the Dickey-Fuller (DF) t-test using the true value of IX;(2) the Engle-Granger test (EG-DF, the Dickey-Fuller t-test using a value of c( estimated by OLS); (3) the Wald statistic for Ho: 6, = 0 using the true value of a; and (4) the LR statistic for H,: 6, = 0 for unknown ~1. The table contains several noteworthy results. First, for this simple design, the size of the tests is close to the size predicted by asymptotic theory. Second, as expected, the DF and EG-DF tests perform quite poorly when 16,, - da21 is small. Third, increasing the serial correlation in w, = abxt, while holding Szl + 8:, constant, increases the power of the likelihood based tests. [This can be seen by comparing the 6, = (0.05,0.055) and 6, = (- 0.05,0.055) columns.] Fourth, increasing 6f1 + 6,2,, while holding the serial correlation in w, constant, increases the power of the likelihood based tests. [This can be seen by comparing the 6, = (- 0.05, 0.055) and 6, = (- 0.105, 0.00) columns.] Fifth, when the DF and EG--DF are focused on the correct direction, their power exceeds the likelihood based tests. [This can be seen from the 6, = (- 0.05, 0.055) column.] Finally, there is a gain in power from incorporating the true value of the cointegrating vector. (This can be seen by comparing the DF test to the EG-DF test and the Wald test to the LR test.) A more thorough comparison of the tests is contained in Horvath and Watson (1993).
Estimating
3.4. 3.4.1.
cointegrating
Gaussian maximum likelihood estimation on the triangular representation
In Section 3.2.4 the triangular as
Ax1.t= Ul,,, X2,*
vectors
-
Px1,r
=
U2.f’
representation
(MLE)
based
of the cointegrated
system was written
(3.9) (3.10)
where U, = D(L)&,. In this section we discuss the MLE estimator of fi under the assumption that E, - n.i.i.d.(O, I). The n.i.i.d. assumption is used only to motivate the Gaussian MLE. The asymptotic distribution of estimators that are derived below follow under the weaker distributional assumptions listed in Lemma 2.3. In Section 2.6.4 we considered the OLS estimator of/l in a bivariate model, and paid particular attention to the distribution of the estimator when D(L) = D with d,, = d,, = 0. In this case, ~i,~ is weakly exogenous for /I and the MLE estimator corresponds
M.W.
2888
Wutson
to the OLS estimator. Recall (see Section 2.6.4) that when d,, = It,, = 0, the OLS estimator of p has an asymptotic distribution that can be represented as a variance mixture of normals and that the t-statistic for p has an asymptotic null distribution that is standard normal. This means that tests concerning the value of /I and confidence intervals for B can be constructed in the usual way; complications from the unit roots in the system can be ignored. These results carry over immediately to the vector case where ~i,~ is k x 1 and x*,~ is r x 1 when u, is serially uncorrelated of u~,~. Somewhat surprisingly, they also carry over to the and ul,t is independent MLE of fi in the general model with u, = D(L)&,, so that the errors are both serially and cross correlated. Intuition for this result can be developed by considering the static model with u, = D.zt and D is not necessarily block diagonal. Since ~i,~ and u2,* are correlated, the MLE of /3 corresponds to the seemingly unrelated regression (SUR) estimator from (3.9)-(3.10). But, since there are no unknown regression coefficients in (3.9), the SUR estimator can be calculated by OLS in the regression x2.t =
BXl,,+ a1,1+ ez,r,
(3.34)
where y is the coefficient from the regression of u*,~ onto u~,~, and e2,, = u~,~ E[u,,,lu,,,] is the residual from this regression. By construction, e2,1 is independent of {xi,,}:= 1 for all t. Moreover, since y is a coefficient on a zero mean stationary regressor and b is a coefficient on a martingale, the limiting scaled “X’X” matrix from the regression is block diagonal (Section 2.5.1). Thus from Lemma 2.3,
T(&p)=(T-‘C
e2,tx;,,)(T~2Cx1,,x;,,)-1 + o,(l)
)(
Z,l/’
s
B,(s)B,(s)‘ds(Zu’j’)’
-1
,
(3.35)
where Z,,, = var(u,,,), ZeZ = var(e2,J and B(s) is an n x 1 Brownian motion process, partitioned as B(s) = [B,(s)‘B,(s)‘]‘, where B,(s) is k x 1 and B*(S) is r x 1. Except for the change in scale factors and dimensions, equation (3.35) has the same form as (2.21), the asymptotic distribution of fi in the case d,, = d,, = 0. Thus, the asymptotic distribution of j? can be represented as a variance mixture of normals. Moreover, the same conditioning argument used when d,, = d,, = 0 implies that the asymptotic distribution of Wald test statistics concerning p have their usual large sample x2 distribution. Thus, inference about j can be carried out using standard procedures and standard distributions. Now suppose that a, = D(L)&,. The dynamic analogue of (3.34) is X 2,,
=
Bx,,,+ Y(-WX,,,+ e2,t,
(3.36)
=~C~2,rl~~~l,rI~~ll = ~C~2,tl{~I,r~T=11~ and e2,t=uz,tY(G%,, E[u,,,l{u~,~}T= 1]. Letting D,(L) denote the first k rows of D(L) and D,(L) denote
where
Ch. 47: Vector Autoreqressions
2889
and Cointeyration
the remaining r rows, then from classical projection formulae [e.g. Whittle (1983, Chapter 5)], y(L) = D,(L)D,(L)‘[D,(L)D,(L-‘)‘]-‘.27 Equation (3.36) differs from (3.34) in two ways. First, there is potential serial correlation in the error term of (3.36), and second, y(L) in (3.36) is a two-sided polynomial. These differences complicate the estimation process. To focus on the first complication, assume that y(L) = 0. In this case, (3.36) is a regression model with a serially correlated error, so that (asymptotically) the MLE of p is just the feasible GLS estimator in (3.36). But, as shown in Phillips and Park (1988), the GLS correction has no effect on the asymptotic distribution of the estimator: the OLS estimator and GLS estimators of p in (3.17) are asymptotically equivalent.28 Since the regression error e2,( and the regre:sors (x~,~}~~ 1 are independent, by analogy with the serially uncorrelated case, T(/I - fi) will have an asymptotic distribution that can be represented as a variance mixture of normals. Indeed, the distribution will be exactly of the form (3.35), where now Z’,, and Ze2 represent “long-run” covariance matrices.29 Using conditioning arguments like those used in Section 2.6.4, it is straightforward to show that the Wald test statistics constructed from the GLS estimators of /I have large sample x2 distributions. However, since the errors in (3.36) are serially correlated, the usual estimator of the covariance matrix for the OLS estimators of p is inappropriate, and a serial correlation robust covariance matrix is required.30 Wald test statistics constructed from OLS estimators of fi together with serial correlation robust estimators of covariance matrices will be asymptotically x2 and
27This is the formula for the projection onto the infinite sample, i.e. y(L)Ax,! = ECU:) {Ax,’ ),a= ,I. In general, y(L) is two-sided and of infinite order, so that this is an approximation to E[u:J {Ax,‘}f, r]. The effect of this approximation error on estimators of B is discussed below. ‘sThis can be demonstrated as follows. When y(L) = 0, e,,, = D&)c,,, and x1,, = D,~(L)E~,~ where E,.~ and cl,, are the first k and last r elements of E,, and D, t(L) and Dzz(L) are the appropriate dtagonal blocks of D(L). Let C(L) = [D22(L)]-1 and assume that the matrix coefficients in C(L), D,,(L) and D,,(L) are I-summable. Letting 6 = vet(@), the GLS estimator and OLS estimators satisfy
where q, = [xl,,@I,], the Lemma 2.3
and defining
T&s - 4 = IT-2&,,~;,,
the operator
L so that z,L = Lz, = s,_r,ri,
= [x1,,@
C(L)‘]. Using
OI,I-‘CT-‘C(x;,,OI,)D,,(L)&,.,l
O~,I-‘CT-‘~(X;,~~~~I~)~~.,)I + o,(t), = IT-z&.$,,, -6)=[T-Z~{C(~)~,,,}{x;,,C(L)‘}O~,I-’CT~’~(~~,,OC(~)I)~~.,1 T(6^,,, = IT~‘~x,.,x;,,
OC(t~C(t)l-‘CT-‘~(x~.~~C(~~~~~,)l+o,(t)~
Since C(l)-’ = Dz2(l), T(go,, - 6) = T(6^,,, - 6) + o,(l), so that T($o,,, - Jo,,) LO. “The long-run covariance matrix for an n x 1 covariance stationary vector y, with absolutely summable autocovariances is x,E m Cov(y,, y,_& which is 2n times the spectral density matrix for y, at frequency zero. “See Wooldridge’s chapter of the Handbook for a thorough discussion of robust covariance matrix estimators.
M. W. Watson
2890
be asymptotically equivalent to the statistics calculated using the GLS of b [Phillips and Park (1988)]. In summary, the serial correlation in (3.36) poses no major obstacles. The two-sided polynomial y(L) poses more of a problem, and three different solutions have developed. In the first approach, y(L) is approximated by a finite order (two-sided) polynomial. 31 In this case, equation (3.36) can be estimated by GLS, yielding what Stock and Watson (1993) call the “Dynamic GLS” estimator of fi. Alternatively, utilizing the Phillips and Park (1988) result, an asymptotically equivalent “Dynamic OLS’estimator can be constructed by applying OLS to (3.36). To motivate the second approach, assume for a moment that y(L) were known. The OLS estimator of p would then be formed by regressing x~,~ - y(L)Ax,,, onto x1,,. But T-‘C[y(L)Axi,,]xi,, = r-‘~[y(l)Ax,,,]x,,, + B + o,,(l), where B = lim f+m E(y,x,,,), where Y, = [Y(L) - YU)IAX~,,. (This can be verified using (c) and (d) of Lemma 2.3.) Thus, an asymptotically equivalent estimator can be constructed by regressing x~,~ - y(l)Axi,, onto x1,( and correcting for the “bias” term B. Park’s (1992) “Canonical Cointegrating Regression” estimator and Phillips and Hansen’s (1990) “Fully Modified” estimator use this approach, where in both cases y(l) and B are replaced by consistent estimators. The final approach is motivated by the observation that the low frequency movements in the data asymptotically dominate the estimator of p. Phillips (1991 b) demonstrates that an efficient band spectrum regression, concentrating on frequency zero, can be used to calculate an estimator asymptotically equivalent to the MLE estimator in (3.36).32 All of these suggestions lead to asymptotically equivalent estimators. The estimators have asymptotic representations of the form (3.35) (where _ZU,and Z_ represent long-run covariance matrices), and thus their distributions can be represented as variance mixtures of normals. Wald test statistics computed using the estimators (and serial correlation robust matrices) have the usual large sample x2 distributions under the null. Will
estimators
3.4.2.
Gaussian maximum
likelihood estimation based on the VECM
Most of the empirical work with cointegrated systems has utilized parameterizations based on the finite order VECM representation shown in equation (3.3). Exact MLEs calculated from the finite order VECM representation of the model are different from the exact MLE’s calculated from the triangular representations that were developed in the last section. The reason is that the VECM imposes constraints on the coefficients in y(L) and the serial correlation properties of e2,r in (3.36). 3’This suggestion can be found in papers by Hansen (1988) Phillips (1991a), Phillips and Loretan (1991), Saikkonen (1991) and Stock and Watson (1993). Saikkonen (1991) contains a careful discussion of the approximation error that arises when y(L) is approximated by a finite order polynomial. Using results of Berk (1974) and Said and Dickey (1984) he shows that a consistent estimator of y(l) (which, as we show below is required for an asymptotically efficient estimator of /?) obtains if the order of the estimated polynomial y(L) increases at rate Td for 0 < 6 cf. -“See Hannan (1970) and Engle (1976) for a general discussion of band spectrum regression.
Ch. 47: Vector Autoregressions
and Cointegration
2891
These restrictions were not exploited in the estimators discussed in Section 3.41. While these restrictions are asymptotically uninformative about j?, they impact the estimator in small samples. Gaussian MLEs of/J constructed from the finite order VECM (3.2) are analyzed in Johansen (1988a, 1991) and Ahn and Reinsel (1990) using the reduced rank regression framework originally studied by Anderson (195 1). Both papers discuss computational approaches for computing the MLEs, and more importantly, derive the asymptotic distribution of the Gaussian MLE. There are two minor differences between the Johansen (1988a, 1991) and Ahn and Reinsel(l990) approaches. First, different normalizations are employed. Since 17 = 6~’ = 6FF- ‘CYfor any nonsingular r x r matrix F, the parameters in 6 and CY are not econometrically identified without further restrictions. Ahn and Reinsel (1990) use the same identifying restriction imposed in the triangular model, i.e., a’ = [ - /IZ,]; Johansen (1991) uses the normalization Oi’Roi = I,, where R is the sample moment matrix of residuals from are the regression of x,-r onto Ax~_~, i = 1,. . . , p - 1. Both sets of restrictions normalizations in the sense that they “just” identify the model, and lead to identical values of the maximized likelihood. Partitioning Johansen’s MLE as oi = (02; a;)‘, where bi, is k x I and oi, is r x r, implies that the MLE of fi using Ahn and Reinsel’s normalization is p^= - (&,&;‘y. The approaches also differ in the computational algorithm used to maximize the likelihood function. Johansen (1988a), following Anderson (1951), suggests an algorithm based on partial canonical correlation analysis between Ax, and x,_ 1 given Ax,- i, i = 1,. , p - 1. This framework is useful because likelihood ratio tests for cointegration are computed as a byproduct (see Equation 3.19). Ahn and Reinsel (1990) suggests an algorithm based on iterative least squares calculations. Modern computers quickly find the MLEs for typical economic systems using either algorithm. Some key results derived in both Johansen (1988a) and Ahn and Reinsel (1990) are transparent from the latter’s regression formulae. As in Section 3.3, write the VECM as Ax, = ~CX’X, _ 1+
I-z, + E,
=fiCX2,t-l-BXl,t~11+~Z1+&,,
(3.37)
where z, includes the relevant lags of Ax, and the second line imposes the Ahn Reinsel normalization of ~1.Let w, _ I = x2 f _ 1 - /?x, f _ 1 denote the error correction term, and let 0 = [vet(G)’ vet(r)’ vec(j?)‘]’ denote the vector of unknown parameters. Using the well known relations between the vet operator and Kronecker products, vec(rz,) = (z:Ol,)vec(T), vec(bw,_ 1) = (w:_ I @l,)vec(r) and vec(&?x,,,_ r) = (x; f_ I 0 6) vet(p). Using th ese expressions, and defining Q, = [(z, 0 I,) (& _ 1@ I,,) (x~:~~1OS)]‘, then the Gauss-Newton iterations for estimating 0 are
@+l= e’+
[cQI~','Q,]-'[CQ:~','~:,]
(3.38)
M. W. Watson
2892
where 8’ denotes the estimator of 0 at the ith iteration, k’, = T- ‘CE,& and Q, and E, are evaluated at i?.33 Thus, the Gauss-Newton regression corresponds to the GLS regression of E, onto (z: 0 I,), (wi _ 10 I,,) and (x’, ,! _ 10 6). Since the z, and w, are I(0) with zero mean and x~,~ is I(l), the analysis m Section 2 suggests that the limiting regression “X’X” matrix will be block diagonal, and the MLEs of 6 and I- will be asymptotically independent of the MLE of /?. Johansen (1988a) and Ahn and Reinsel (1990) show that this is indeed the case. In addition they demonstrate that the MLE of /3 has a limiting distribution of the same form as shown in equation (3.35) above, so that T(fi - fl) can be represented as a variance mixture of normals. Finally, paralleling the result for MLEs from triangular representation, Johansen (1988a) and Ahn and Reinsel (1990) demonstrate that
[~Q;~','Q,]-"2(&6) %'(O,Z), so that hypothesis tests and confidence intervals for all of the parameters VECM can be constructed using the normal and x2 distributions. 3.4.3.
Comparison
in the
and efJiciency of the estimators
The estimated cointegrating vectors constructed from the VECM (3.3) or the triangular representation (3.9)-(3.10) differ only in the way that the I(0) dynamics of the system are parameterized. The VECM models these dynamics using a VAR involving the first differences Ax, ,f, AxZ,( and the error correction terms, x~,~ - /?x~,~; the triangular representation uses only Ax, f and the error correction terms. Section 3.4.1 showed that the exact parameterization of the I(0) dynamics ~ y(L) and the serial correlation of the error term in (3.36) - mattered little for the asymptotic behavior of the estimator from the triangular representation. In particular, estimators of b that ignore residual serial correlation and replace y(L) with y(l) and adjust for bias are asymptotically equivalent to the exact MLE in (3.36). Saikkonen (1991) shows that this asymptotic equivalence extends to Gaussian MLEs constructed from the VECM. Estimators of /? constructed from (3.36) with appropriate nonparametric estimators of y( 1) are asymptotically equivalent to Gaussian MLEs constructed from the VECM (3.3). Similarly, test statistics for H,: R[vec(/l)] = r constructed from estimators based on the triangular representation and the VECM are also asymptotically equivalent.
3SConsistent initial conditions for the iterations are easily constructed from the OLS_estimator_s of the parameters in the VAR (3.2). Let fi denote the OLS estimator of IT, partitioned as 17 = [IT,II,], where “r is n x (n -r) and e2 is n x r; further partition I?, = [I?; Lfi;r]’ and fi, = [I?‘;, Ii’;,]‘, where 17, 1is (n - r) x (n - r), I7,, is r x (n - r), or2 is (n - r) x r and 17,, is r x r. Then I?, serves as an initial consistent estimator of 6 and -(J722))‘nz1 serves as an estimator of 8. Ahn and Reinsel(l990) and Saikkonen (1992) develop efficient two-step estimators of B constructed from IT, and Engle and Yoo (1991) develop an efficient three-step estimator of all the parameters in the model using iterations similar to those in (3.38).
Ch. 47: Vector Autoregressions
und Cointeyration
2893
Since estimators of cointegrating vectors do not have asymptotic normal distributions, the standard analysis of asymptotic efficiency - based on comparing estimator’s asymptotic covariance matrices ~ cannot be used. Phillips (199 1a) and Saikkonen (199 1) discuss efficiency of cointegrating vectors using generalizations of the standard efficiency definitions.34 Loosely speaking, these generalizations compare two estimators in terms of the relative probability that the estimators are contained in certain convex regions that are symmetric about the true value of the parameter vector. Phillips (1991a) shows that when U, in the triangular representation (3.9)-(3.10) is generated by a Gaussian ARMA process, then the MLE is asymptotically efficient. Saikkonen (1991) considers estimators whose asymptotic distributions can be represented by a certain class of functionals of Brownian motion. This class contains the OLS and nonlinear least squares estimators analyzed in Stock (1987), the instrumental variable estimators analyzed in Phillips and Hansen (1990), all of the estimators discussed in Sections 3.4.1 and 3.4.2, and virtually every other estimator that has been suggested. Saikkonen shows that the Gaussian MLE or (any of the estimators that are asymptotically equivalent to the Gaussian MLE) are asymptotically efficient members of this class. Several studies have used Monte Carlo methods to examine the small sample behavior of the various estimators of cointegrating vectors. A partial list of the Monte Carlo studies is Ahn and Reinsel (1990), Banerjee et al. (1986), Gonzalo (1989), Park and Ogaki (1991), Phillips and Hansen (1990), Phillips and Loretan (1991) and Stock and Watson (1993). A survey of these studies suggests three general conclusions. First, the static OLS estimator can be very badly biased even when the sample size is reasonably large. This finding is consistent with the bias in the asymptotic distribution of the OLS estimator (see equation (2.22)) that was noted by Stock (1987). The second general conclusion concerns the small sample behavior of the Gaussian MLE based on the finite order VECM. The Monte Carlo studies discovered that, when the sample size is small, the estimator has a very large mean squared error, caused by a few extreme outliers. Gaussian MLEs based on the triangular representation do not share this characteristic. Some insight into this phenomenon is provided in Phillips (1991~) which derives the exact (small sample) distribution of the estimators in a model in which the variables follow independent Gaussian random walks. The MLE constructed from the VECM is shown to have a Cauchy distribution and so has no integer moments, while the estimator based on the triangular representation has integer moments up to order T - n + r. While Phillips’ results concern a model in which the variables are not cointegrated, it is useful because it suggests that when the data are “weakly” cointegrated - as might be the case in small samples - the estimated cointegrating vector will (approximately) have these characteristics. The third general conclusion concerns the approximate Gaussian MLEs based “‘See Basawa and Scott (1983)and Sweeting (1983).
M. w. Wutson
2894
on the triangular representation. The small sample properties of these estimators and test statistics depend in an important way on the estimator used for the longrun covariance matrix of the data (spectrum at frequency zero), which is used to construct an estimator of y( 1) and the long-run residual variance in (3.36). Experiments in Park and Ogaki (1991), Stock and Watson (1993) and (in a different context) Andrews and Moynihan (1990), suggest that autoregressive estimators or estimators that rely on autoregressive pre-whitening outperform estimators based on simple weighted averages of sample covariances.
3.5. 3.5.1.
The role of constants
and trends
The model of deterministic
components
Thus far, deterministic components in the time series (constants and trends) have been ignored. These components are important for three reasons. First, they represent the average growth or nonzero level present in virtually all economic time series; second, they affect the efficiency of estimated cointegrating vectors and the power of tests for cointegration; finally, they affect the distribution of estimated cointegrating vectors and cointegration test statistics. Accordingly, suppose that the observed n x 1 time series y, can be represented as
Y,=P,+Plt+x,,
(3.39)
where x, is generated by the VAR (3.1). In (3.39), pLo+ pLlt represents the deterministic component of yt, and x, represents the stochastic component. In this section we discuss how the deterministic components affect the estimation and testing procedures that we have already surveyed.35 There is a simple way to modify the procedures so that they can be applied to y,. The deterministic components can be eliminated by regressing y, onto a constant and time trend. Letting y: denote the detrended series, the estimation and testing procedures developed above can then be used by replacing x, with y;. This changes the asymptotic distribution of the statistics in a simple way: since the detrended values of y, and x, are identical, all statistics have the same limiting representation with the Brownian motion process B(s) replaced by B’(s), the detrended Brownian motion introduced in Section 2.3. While this approach is simple, it is often statistically inefficient because it discards all of the deterministic trend information in the data, and the relationship between these trends is often the most useful information about cointegration. To see this,
35We limit discussion lo linear trends in y, for reasons of brevity and because this is the most important model for empirical applications. The results are readily extended to higher order trend polynomials and other smooth trend functions.
Ch. 47:
Vector Autoregressions
let CIdenote
2895
and Cointeyration
a cointegrating
vector and consider
the “stable”
linear combination
dy, = A, + 2, t + wt,
(3.40)
the where JUO= CX’,U~, R, = a’~~, and w, = a’x,. In most (if not all) applications, cointegrating vector will annihilate both the stochastic trend and deterministic trend in a’y,. That is, w, will be I(0) and J., =0.36 As shown below, this means that one linear combination of the coefficients in the cointegrating vector can be consistently estimated at rate T 3’2. In contrast, when detrended data are used, the cointegrating vectors are consistently estimated at rate T. Thus, the data’s deterministic trends are the dominant source of information about the cointegrating vector and detrending the data throws this information away. The remainder of this section discusses estimation and testing procedures that utilize the data’s deterministic trends. Most of these procedures are simple modifications of the procedures that were developed above. Estimating
3.5.2.
cointegrating
vectors
We begin with a discussion of the MLE of cointegrating vectors based on the triangular representation. Partitioning yt into (n - r) x 1 and r x 1 components, yi,, and y2,t, the triangular representation for y, is
=Y+ AY1.t
Ul,,,
Y2.t
=
-
BYI,,
Al
(3.41) +
4t
+
U2.r.
(3.42)
This is identical to the triangular representation for x, given in (3.9))(3.10) except for the constant and trend terms. The constant term in (3.41) represents the average growth in yr,,. In most situations 1, = 0 in (3.42) since the cointegrating vector annihilates the deterministic trend in the variables. In this case, 2, denotes the mean of the error correction terms, which is unrestricted in most economic applications. Assuming that i1 = 0 and 1, and y are unrestricted, efficient estimation of b in (3.42) proceeds as in Section 3.3.1. The only difference is that the equations now include a constant term. As in Section 3.3.1, Wald, LR or LM test statistics for testing H,: R[vec(p)] = r will have limiting x2 distributions, and confidence intervals for the elements of /I can be constructed in the usual way. The only result from Section 3.3.1 that needs to be modified is the asymptotic distribution of b. This estimator is calculated from the regression of y, f onto Y,,~, leads and lags of AY,,, and a constant term. When the y, ,f data contain a trend [y # 0 in (3.41)], 360 aki and Park (1990) define these two restrictions as “stochastic” and “deterministic” Stochattic cointegration means that w, is I(O), while deterministic cointegration means
cointegration. that 1, = 0.
M. W. Watson
2896
one of the canonical regressors is a time trend (z~,~ from Section 2.5.1), and the estimated coefficient on the time trend converges at rate T3’*. This means that one linear combination of the estimated coefficients in the cointegrating vector converges to its true value very quickly; when the model did not contain a linear trend the estimator converged at rate T. The results for MLEs based on the finite order VECM representation are analogous to those from the triangular representation. The VECM representation for y, is derived directly from (3.2) and (3.39), p-1
Ay,
=
pl
+‘@‘x,_ 1) + 1 @+Ax,_~ + E, i=l
p-1 =~1+6("'y,_,--1,--i,t)+
c
@iAyt_i+~,,
(3.43)
i=l
where b1 = (I - C;:t Q&r, II, = 0, and the VECM is
'y, = ' + '("y,_
1)+
2, = cl’pO and /2, = “‘pt. Again, in most applications
C
~i'y,_
i +&r,
(3.44)
i=l
where 0 = PI - 62,. When the only restriction on pl is a’pl = 0, the constant term 19is unconstrained, and (3.44) has the same form as (3.2) except that a constant term has been added. Thus, the Gaussian MLE from (3.44) is constructed exactly as in Section 3.4.2 with the addition of a constant term in all equations. The distribution of test statistics is unaffected, but for the reasons discussed above, the asqrmptotic distribution of the cointegrating vector changes because of the presence of the deterministic trend. In some situations the data are not trending in a deterministic fashion, so that pl = 0. (For example, this is arguably the case when y, is a vector of U.S. interest rates.) When pL1= 0, then ,iil = 0 in (3.43), and this imposes a constraint on 0 in (3.44). To impose this constraint, the model can be written as
(3.45)
and estimated using a modification of the Gauss-Newton iterations in (3.38) or a modification of Johansen’s canonical correlation approach [see Johansen and Juselius (1990)].
Ch. 47: Vector Autoregressions and Cointegration
3.5.3.
Testing for
2897
cointegration
Deterministic trends have important effects on tests for cointegration. AS discussed in Johansen and Juselius (1990) and Johansen (1991, 1992a), it is useful to consider two separate effects. First, as in (3.43))(3.44) nonzero values of p0 and pi affect the form of the VECM, and this, in turn, affects the form of the cointegration test statistic. Second, the deterministic components affect the properties of the regressors, and this, in turn, affects the distribution of cointegration test statistics. In the most general form of the test considered in Section 3.3.1, c1was partitioned into known and unknown cointegrating vectors under both the null and alternative; that is, CI was written as c1= (cI,~GL,”CI,~clou). When nonzero values of pu, and .~i are allowed, the precise form of the statistic and resulting asymptotic null distribution depends on which of these cointegrating vectors annihilate the trend or constant [see Horvath and Watson (1993)]. Rather than catalogue all of the possible cases, the major statistical issues will be discussed in the context of two examples. The reader is referred to Johansen and Juselius (1990), Johansen (1992a) and Horvath and Watson (1993) for a more systematic treatment. In the first example, suppose that r = 0 under the null, that CIis known under the alternative, that j+, and pi are nonzero, but that “‘pi = 0 is known. To be concrete, suppose that the data are aggregate time series on the logarithms of income, consumption and investment for the United States. The balanced growth hypothesis suggests two possible cointegrating relations with cointegrating vectors (1, - 1,O) and (l,O, - 1). The series exhibit deterministic growth, so that pL1# 0, and the sample growth rates are approximately equal, so that “‘pi = 0 is reasonable. In this example, (3.44) is the correct specification of the VECM with 0 unrestricted under both the null and alternative and 6 = 0 under the null. Comparing (3.44) and the specification with no deterministic components given in (3.3), the only difference is that x, in (3.3) becomes y, in (3.44) and the constant term 8 is added. Thus, the Wald test for HO: 6 = 0 is constructed as in (3.17) with y, replacing x, and Z augmented by a column of 1’s. Since c1’pi = 0, the regressor is a’y, _ 1 = CY’X, _ 1 + a’po, but since a constant is included in the regression, all of the variables are deviated from their sample means. Since the demeaned values of GI’~~_1 and a’x,_i are the same, the asymptotic null distribution of the Wald statistic for testing H,: 6 = 0 in (3.44) is given by (3.18) with /I”(s), the demeaned Wiener process defined below Lemma 2.3, replacing B(s). The second example is the same as the first, except that now x is unknown. Equation (3.44) is still the correct VECM with 0 unrestricted under the null and alternative. The LR test statistic is calculated as in (3.19), again with y, replacing x, and Z augmented by a vector of 1’s. Now, however, the distribution of the test statistic changes in an important way. Since the regressor y,_ 1 contains a nonzero trend, it behaves like a combination of time trend and martingale components. When the n x 1 vector y,_ 1 is transformed into the canonical regressors of Section 2, this yields one regressor dominated by a time trend and n - 1 regressors dominated
M.W Watson
2898
by martingales. As shown in Johansen and Juselius (1990) the distribution resulting LR statistic has a null distribution given by (3.25) where now
H = [ j&)dB(s)‘][
k(s)F($dr]
of the
-’ [ k(s)dB(s)‘],
where F(s) is an n x 1 vector, with first n - 1 elements given by the first n - 1 elements of /P’(s) and the last element given by the demeaned time trend, s-i. (The components are demeaned because of the constant term in the regression.) Johansen and Juselius (1990) also derive the asymptotic null distribution of the LR test for cointegration with unknown cointegrating vectors when p1 = 0, so that (3.45) is the appropriate specification of the VECM. Tables of critical values are presented in Johansen and Juselius (1990) for n - r,,” ,< 5 for the various deterministic trend models, and these are extended in Osterwald-Lenum for n - rOu< 11. Horvath and Watson (1992) extend the tables to include nonzero values of rOk and r+. The appropriate treatment of deterministic components in cointegration and unit root tests is still unsettled, and remains an active area of research. For example, Elliot et al. (1992) report that large gains in power for univariate unit root tests can be achieved by modifying standard DickeyyFuller tests by an alternative method of detrending the data. They propose detrending the data using GLS estimators or p0 and pI from (3.39) together with specific assumptions about initial conditions for the x, process. Analogous procedures for likelihood based tests for cointegration can also be constructed. Johansen (1992b) develops a sequential testing procedure for cointegration in which the trend properties of the data and potential error corrections terms are unknown.
4. 4.1.
Structural vector autoregressions Introductory
comments
Following the work of Sims (1980), vector autoregressions have been extensively used by economists for data description, forecasting and structural inference. Canova (1991) surveys VARs as a tool for data description and forecasting; this survey focuses on structural inference. We begin the discussion in Section 4.2 by introducing the structural moving average model, and show that this model provides answers to the “impulse” and “propagation” questions often asked by macroeconomists. The relationship between the structural moving average model and structural VAR is the subject of Section 4.3. That section discusses the conditions under which the structural moving average polynomial can be inverted, so that the structural shocks can be recovered from a VAR. When this is possible, a structural VAR obtains. Section 4.4 shows that the structural VAR can be interpreted
Ch. 47: Vector Autoregressions and Cointegration
2899
as a dynamic simultaneous equations model, and discusses econometric identification of the model’s parameters. Finally, Section 4.5 discusses issues of estimation and statistical inference.
4.2.
The structural moving average model, impulse response functions and variance decompositions
In this section
we study the model
wk-,3
Y, =
(4.1)
where y, is an ny x 1 vector of economic variables and E, is an n, x 1 vector of shocks. For now we allow ny # nE. Equation (4.1) is called the structural moving average model, since the elements of E, are given a structural economic interpretation. For example, one element of F, might be interpreted as an exogenous shock to labor productivity, another as an exogenous shock to labor supply, another as an exogenous change in the quantity of money, and so forth. In the jargon developed for the analysis of dynamic simultaneous equations models, (4.4) is the final form of an economic model, in which the endogenous variables y, are expressed as a distributed lag of the exogenous variables, given here by the elements of E,. It will be assumed that the endogenous variables y, are observed, but that the exogenous variables E, are not directly observed. Rather, the elements of E, are indirectly observed through their effect on the elements of y,. This assumption is made without loss of generality, since any observed exogenous variables can always be added to the y, vector. In Section 1, a typical macroeconomic system was introduced and two broad questions were posed. The first question asked how the system’s endogenous variables responded dynamically to exogenous shocks. The second question asked which shocks were the primary causes of variability in the endogenous variables. Both of these questions are readily answered using the structural moving average model. First, the dynamic effects of the elements of e, on the elements of y, are determined by the elements of the matrix lag polynomial C(L). Letting C(L) = C, + C, L + C,L2 + .‘.) where C, is an ny x n, matrix with typical element [cij,J, then cij
aYi,t /(
=
aEj,*-k
=
a!* aEj,t
’
(4.2)
where y,,, is the ith element of y,, E~,~ is the jth element of E,, and the last equality follows from the time invariance of (4.1). Viewed as a function of k, cij,k is called the impulse response function of ej,r for Y,,~.It shows how y, f+k changes in response to a one unit “impulse” in Ed,*.In the classic econometric literature on distributed lag models, the impulse responses are called dynamic multipliers.
To answer the second question concerning the relative importance of the shocks, the probability structure of the model must be specified and the question must be refined. In most applications the probability structure is specified by assuming that the shocks are i.i.d.(O,ZJ, so that any serial correlation in the exogenous variables is captured in the lag polynomial C(L). The assumption of zero mean is inconsequential, since deterministic components such as constants and trends can always be added to (4.1). Viewed in this way, E, represents innovations or unanticipated shifts in the exogenous variables. The question concerning the relative importance of the shocks can be made more precise by casting it in terms of the h-step-ahead forecast errors of y,. Let ytjt_,, = E(y,((s,}f;h,) denote the h-stepahead forecast of y, made at time t - h, and let atit_, = y, - yt,, _,, = C:Zk Cket _li denote the resulting forecast error. For small values of h, atjt _,, can be interpreted as “short-run” movements in yt, while for large values of h, a,,,_, can be interpreted as “long-run” movements. In the limit as h + co, atit_,, = y,. The importance of a specific shock can then be represented as the fraction of the variance in atit_,, that is explained by that shock; it can be calculated for short-run and long-run movements in y, by varying h. When the shocks are mutually correlated there is no unique way to do this, since their covariance must somehow be distributed. However, when the shocks are uncorrelated the calculation is straightforward. Assume ZE is diagonal with diagonal elements a;, then the variance of the ith element of a,,( _ h is
so that
(4.3)
shows the fraction of the h-step-ahead forecast error variance in yi,t attributed to E;.~. The set of n, values of Rz,h are called the variance decomposition of y,,, at horizon h.
4.3.
The structural
The structural
VAR representation
VAR representation
A(JYy, = s,,
of (4.1) is obtained
by inverting
C(L) to yield (4.4)
Ch. 47: Vector Autoregressions
and Cointegration
2901
where A(L) = A, - Cc= 1A,Lk is a one-sided matrix lag polynomial. In (4.4), the exogenous shocks E, are written as a distributed lag of current and lagged values of y,. The structural VAR representation is useful for two reasons. First, when the model parameters are known, it can be used to construct the unobserved exogenous shocks as a function of current and lagged values of the observed variables y,. Second, it provides a convenient framework for estimating the model parameters: with A(,!,) approximated by a finite order polynomial, Equation (4.4) is a dynamic simultaneous equations model, and standard simultaneous methods can be used to estimate the parameters. It is not always possible to invert C(L) and move from the structural moving average representation (4.1) to the VAR representation (4.4). One useful way to discuss the invertibility problem [see Granger and Anderson (1978)] is in terms of estimates of E, constructed from (4.4) using truncated versions of A(L). Since a semi-infinite realization of the y, process, {y,},‘, _ ,, is never available, estimates of E, must be constructed from (4.4) using {y,},‘, 1. Consider the estimator & = C:;~&J,_~ constructed from the truncated realization. If & converges to E, in mean square as t -P co, then the structural moving average process (4.1) is said to be invertible. Thus, when the process is invertible, the structural errors can be recovered as a one-sided moving average of the observed data, at least in large samples. This definition makes it clear that the structural movit?g average process cannot be inverted if n? < ne. Even in the static model y, = is,, a necessary condition for obtaining a unique solution for E, in terms of y, is that n,, 2 n,. This requirement has a very important implication for structural analysis using VAR models: in general, small scale VARs can only be used for structural analysis when the endogenous variables can be explained by a small number of structural shocks. Thus, a bivariate VAR of macroeconomic variables is not useful for structural analysis if there are more than two important macroeconomic shocks affecting the variables.37 In what follows we assume that n,, = nE. This rules out the simple cause of noninvertibility just discussed; it also assumes that any dynamic identities relating the elements of y, when nY> nE have been solved out of the model. With nY= n, = n, C(L) is square and the general requirement for invertibility is that the determinantal polynomial 1C(z)1 has all of its roots outside the unit circle. Roots on the unit circle pose no special problems; they are evidence of overdifferencing and can be handled by appropriately transforming the variables (e.g. accumulating the necessary linear combinations of the elements of y,). In any event, unit roots can be detected, at least in large samples, by appropriate statistical tests. Roots of [C(z)1 that are inside the unit circle pose a much more difficult problem, since models with roots inside the unit circle have the same second moment properties as models with roots outside the unit circle. The simplest example of this 37Blanchard and Quah (1989) and Faust and Leeper (1993) discuss special circumstances when some structural analysis is possible when nY< n,. For example, suppose that y, is a scalar and the nCelements of&, affect y, only through the scalar “index” e, = D’E,, where D is an nL x 1vector. In this case the impulse response functions can be recovered up to scale.
M. W. Watson
2902
is the univariate MA(l) model y, = (1 - cL)E~, where E, is i.i.d.(O, at). The same first and second moments of y, obtain for the model y, = (1 - FL)&, where 5 = c- ’ and Et is IID(O,o,“) with of = c 2crE2. Thus the first two moments of y, cannot be used to discriminate between these two different models. This is important because it can lead to large specification errors in structural VAR models that cannot be detected from the data. For example, suppose that the true structural model is y, = (1 - cL)s, with ICI > 1 so that the model is not invertible. A researcher using the invertible model would not recover the true structural shocks, but rather .Ct= (1 - ZL)- ‘y, = (1 - c”L)- ‘( 1 - CL)&,= cl - (c”- c)C,p”_1I?&,_ 1_i. A general discussion of this subject is contained in Hannan (1970) and Rozanov (1967). Implications of these results for the interpretation of structural VARs are discussed in Hansen and Sargent (199 1) and Lippi and Reichlin (1993). For related discussion see Quah (1986). Hansen and Sargent (1991) provides a specific economic model in which noninvertible structural moving average processes arise. In the model, one set of economic variables, say xt, are generated by an invertible moving average process. Another set of economic variables, say yt, are expectational variables, formed as discounted sums of expected future x,‘s. Hansen and Sargent then consider what would happen if only the y, data were available to the econometrician. They show that the implied moving average process of y,, written in terms of the structural shocks driving xt, is not invertible. 38 The Hansen-Sargent example provides an important and constructive lesson for researchers using structural VARs: it is important to include variables that are directly related to the exogenous shocks under consideration (x, in the example above). If the only variables used in the model are indirect indicators with important expectational elements (y, in the example above), severe misspecification may result.
4.4.
Ident$cation
qf the
structural
Assuming that the lag polynomial VAR can be written as A,y,=A,Y,-,+A,~,_,+...+A,y,_,+~,.
VAR of A(L) in (4.4) is of order p, then structural
(4.5)
j*A simple version of their example is as follows: suppose that y, and xI are two scalar time series, with x, generated by the MA(l) process x, = E, - fk_ ,. Suppose that y, is related to x, by the expectational equation
Y, = E, g B’x,+, i=0 = x, + Y&X, + 1
= (1 - PO)&,- OS,_ , = C(L)&,, where the second and third lines follow from the MA(l) process for x,. It is readily verified that the root of C(z) is (1 - BO)/O, which may be less than 1 even when the root of (1 - Oz) is greater than 1. (For example, if 0 = /I = 0.8, the root of (1 - Oz) is 1.25 and the root of C(z) is 0.45.)
Ch. 47: Vector Autoreyressions
2903
and Cointeyrarion
Since A, is not restricted to be diagonal, equation (4.5) is a dynamic simultaneous equations model. It differs from standard representations of the simultaneous equations model [see Hausman (1983)] because observable exogenous variables are not included in the equations. However, since exogenous and predetermined variables - lagged values of y,_ 1 ~ are treated identically for purposes of identification and estimation, equation (4.5) can be analyzed using techniques developed for simultaneous equations. The reduced form of (4.5) is Y, = @d-l
+ @ZYlP2 + ... + @py,_p + e,,
(4.6)
E,. A natural first question concerns where@i=A;‘Ai,fori=l,...,p,ande,=A,’ the identifiability of the structural parameters in (4.5) and this is the subject taken up in this section. The well known “order” condition for identification is readily deduced. Since y, is n x 1, there are pn2 elements in (CD,, Q2,. . . , Qp) and n(n + 1)/2 elements in Z, = A; Z&A 0 I)‘, the covariance matrix of the reduced form disturbances. When the structural shocks are n.i.i.d.(O, Z,), these [n’p + n(n + 1)/2] parameters completely characterize the probability distribution of the data. In the structural model (4.5) there are (p + l)n2 elements in (A,, A,, . . . , AJ and n(n + 1)/2 elements in Z,. Thus, there are n2 more parameters in the structural model than are needed to characterize the likelihood function, so that n2 restrictions are required for identification. As usual, setting the diagonal elements of A, equal to 1, gives the first n restrictions. This leaves n(n - 1) restrictions that must be deduced from economic considerations. The identifying restrictions must be dictated by the economic model under consideration. It makes little sense to discuss the restrictions without reference to a specific economic system. Here, some general remarks on identification are made in the context of a simple bivariate model explaining output and money; a more detailed discussion of identification in structural VAR models is presented in Giannini (1991). Let the first element of y,, say yr,,, denote the rate of growth of real output, and the second element of y,, say y,,, denote the rate of growth of money.39 Writing the typical element of A, as uij,k, equation (4.5) becomes
r
Yl,, = -
‘lZ,OY2,t+
Y2.t = -
~21,OYlJ
Equation “Much
+
f i=l
‘ll,iYl,*-i+
If i=l
‘2l,iYl,t-i
(4.7a) is interpreted of this discussion
+
f i=l
a12,iY2,t-i
+
‘l,ty
fI i=l
‘22,iY2,t-i
+
‘2,t’
as an output
concerning
this example
or “aggregate draws
(4.7b)
supply”
from King and Watson
equation, (1993).
with
2904
M. W. Watson
cl,* interpreted as an aggregate supply or productivity shock. Equation (4.7b) is interpreted as a money supply “reaction function ” showing how the money supply responds to contemporaneous output, lagged variables, and a money supply shock s2 1. Identification requires n(n - 1) = 2 restrictions on the parameters of (4.7). ‘In the standard analysis of simultaneous equation models, identification is achieved by imposing zero restrictions on the coefficients for the predetermined variables. For example, the order condition is satisfied if yi,,_ 1 enters (4.7a) but not (4.7b), and y, f_2 enters (4.7b) but not (4.7a); this imposes the two constraints 21 1 = a, 1 2 = 0. in this case, y, t_ 1 shifts the output equation but not the money iquation, while y, 1_ 2 shifts the ‘money equation but not the output equation. Of course, this is a very odd restriction in the context of the output-money model, since the lags in the equations capture expectational effects, technological and institutional inertia arising production lags and sticky prices, information lags, etc.. There is little basis for identifying the model with the restriction u2i,i = u11,2 = 0. Indeed there is little basis for identifying the model with any zero restrictions on lag coefficients. Sims (1980) persuasively makes this argument in a more general context, and this has led structural VAR modelers to avoid imposing zero restrictions on lag coefficients. Instead, structural VARs have been identified using restrictions on the covariance matrix of structural shocks ZE,, the matrix of contemporaneous coefficients A, and the matrix of long-run multipliers A(l)-‘. Restrictions on EC have generally taken the form that Z, is diagonal, so that the structural shocks are assumed to be uncorrelated. In the example above, this means that the underlying productivity shocks and money supply shocks are uncorrelated, so that any contemporaneous cross equation impacts arise through nonzero values of u12,0 and u~~,~. Some researchers have found this a natural assumption to make, since it follows from a modeling strategy in which unobserved structural shocks are viewed as distinct phenomena which give rise to comovement in observed variables only through the specific economic interactions studied in the model. The restriction that EC is diagonal imposes n(n - 1) restrictions on the model, leaving only n(n - 1)/2 additional necessary restrictions.40 These additional restrictions can come from a priori knowledge about the A, matrix in (4.5). In the bivariate output-money model in (4.7), if Z, is diagonal, then only ~(n - 1)/2 = 1 restriction on A, is required for identification. Thus, a priori knowledge of u12,0 or uZ1,o will serve to identify the model. For example, if it was assumed that the money shocks affect output only with a lag, so that ay, &Z~E~,~ = 0, then ui2 o = 0, and this restriction identifies the model. The generalization of this restriction in the n-variable model produces the Wold causal chain [see Wold (1954) and Malinvaud (1980, pp. 6055608)], in which ayi,r/&j,! = 0 for i <j. This restriction leads to a recursive model with A, lower triangular, yielding the required n(n - 1)/2 identifying restrictions. This restriction was used in Sims (1980), and has 400ther restrictions on the covariance matrix are possible, but will not be discussed here. A more general discussion of identification with covariance restrictions can be found in Hausman and Taylor (1983). Fisher (1966). Rothenberg (1971) and the references cited there.
Ch. 47: Vector Autoregressions and Cointeyration
2905
become the “default” identifying restriction implemented automatically in commercial econometric software. Like any identifying restriction, it should never be used automatically. In the context of the output-money example, it is appropriate under the maintained assumption that exogenous money supply shocks, and the resulting change in interest rates, have no contemporaneous effect on output. This may be a reasonable assumption for data sampled at high frequencies, but loses its appeal as the sampling interval increases.41342 Other restrictions on A, can also be used to identify the model. Blanchard and Watson (1986) Bernanke (1986) and Sims (1986) present empirical models that are identified by zero restrictions on A, that don’t yield a lower triangular matrix. Keating (1990) uses a related set of restrictions. Of course, nonzero equality restrictions can also be used; see Blanchard and Watson (1986) and King and Watson (1993) for examples. An alternative set of identifying restrictions relies on long-run relationships. In the context of structural VARs these restrictions were used in papers by Blanchard on and Quah (1989) and King et al. (1991). 43 These papers relied on restrictions A( 1) = A, - Cr= 1Ai for identification. Since C( 1) = A( 1)) ‘, these can alternatively be viewed as restrictions on the sum of impulse responses. To motivate these restrictions, consider the output-money example.44 Let x~,~ denote the logarithm of the level of output and x~,~ denote the logarithm of the level of money, so that and y2,t = Ax~,~. Then from (4.1), Y1,t --Act axi ffk
A= aEj,t
,;oa*= *I,+@ c
(4.8)
J.1
for i, j = 1,2, so that
(4.9)
which is the ijth element of C(1). Now, suppose that money is neutral in the long run, in the sense that shocks to money have no permanent effect on the level of output. This means that lim,,a,ax,,,+,/&, t = 0, so that C(1) is a lower triangular
41The appropriateness of the Wold causal chain was vigorously debated in the formative years of simultaneous equations. See Malinvaud (1980, pp. 55-58) and the references cited there. “*Applied researchers sometimes estimate a variety of recursive models in the belief (or hope) that the set of recursive models somehow “brackets” the truth. There is no basis for this. Statements like “the ordering of the Wold causal chain didn’t matter for the results” say little about the robustness of the results to different identifying restrictions. 43For other early applications of this approach, see Shapiro and Watson (1988) and Gali (1992). 44The empirical model analyzed in Blanchard and Quah (1989) has the same structure as the outputmoney example with the unemployment rate used in place of money growth.
M. W. Watson
2906
matrix. Since A( 1) = C( 1)-l, this means that A(1) is also lower triangular, and this yields the single extra identifying restriction that is required to identify the bivariate model. The analogous restriction in the general n-variable VAR, is the long-run Wold causal chain in which E~,~has no long-run effect on yj,, forj < i. This restriction implies that A(1) is lower triangular yielding the necessary n(n - I)/2 identifying restrictions.45
4.5.
Estimating
structural
VAR models
This section discusses methods for estimating the parameters of the structural VAR (4.5). The discussion is centered around generalized method of moment (GMM) estimators. The relationship between these estimators and FIML estimators constructed from a Gaussian likelihood is discussed below. The simplest version of the GMM estimator is indirect least squares, which follows from the relationship between the reduced form parameters in (4.6) and the structural parameters in (4.5): A,‘A,= A,ZEA;
cDi,
i= l,...,p,
(4.10)
= Ze.
(4.11)
Indirect least squares estimators are formed by replacing the reduced form parameters in (4.10) and (4.11) with their OLS estimators and solving the resulting equations for the structural parameters. Assuming that th_e mod< is exactly identified, a solution will necessarily exist. Given estimators ai and A,,, equation (4.10) yields Ai = &@i. The quadratic equation (4.11) is more difficult to solve. In general, iterative techniques are required, but simpler methods are presented below for specific models. To derive the large sample distribution of the estimators and to “solve” the indirect least squares equations when there are overidentifying restrictions, it is convenient to cast the problem in the standard GMM framework [see Hansen (1982)]. Hausman et al. (1987) show how this framework can be used to construct efficient estimators for the simultaneous equations model with covariance restrictions on the error terms, thus providing a general procedure for forming efficient estimators in the structural VAR model. Some additional notation is useful. Let z, = (y: _ 1, y: _ 2,. . . , y:_ J denote the vector of predetermined variables in the model, and let 8 denote the vector of unknown parameters in A,, A,, . . . , A, and Ze. The population moment conditions that implicitly define the structural parameters are E&z;) = 0, 450f course, restrictions on A, and A(1) can be used in concert (1992)for an empirical example.
(4.12) to identify
the model.
See Cali
Ch. 47: Vector Autoreyressions
E(E,&;) =
2907
and Cointrgration
z,,
(4.13)
where E, and Z, are functions of the unknown 0. GMM estimators are formed by choosing 8 so that (4.12) and (4.13) are satisfied, or satisfied as closely as possible, with sample moments used in place of the population moments. The key ideas underlying the GMM estimator in the structural VAR model can be developed using the bivariate output-money example in (4.7). This avoids the cumbersome notation associated with the n-equation model and arbitrary covariante restrictions. [See Hausman et al. (1987) for discussion of the general case.] Assume that the model is identified by linear restrictions on the coefficients of AP and the restriction that E(E~,~E~,~)= 0. Let wl,, denote the variables &A,,..., appearing on the right hand side of (4.7a) after the restrictions on the structural coefficients have been solved out, and let 6, denote the corresponding coefficients. Thus, if aI2 0 = 0 is the only coefficient restriction in (4.7a), then only lags of y, , . . . , y;_,)‘. If the long-run neutrality appear in the equation and w~,~= (y:_,,y:_, assumption C;= 0 a 12.i =Ois imposed in(4.7a), then w~,~=(~~i,~_~,yi,,_~ ,..., ~i,~,-~, for equatton Ay,,,,Ay,,,_ i,. . . ,AY~,~_~+ 1)‘.46 Defining w2,t and 6, analogously (4.7b), the model can be written as (4.14a)
4,A + El,,?
Y 1.1
=
y 2,r
= w;
,$2
and the GMM
(4.14b)
+ Ez,t,
moment
equations
are:
%,El.J = 0,
(4.15a)
UZ,EZ,J= 0,
(4.15b)
~(~l.,E2.t) = 0,
(4.15c)
E(& - a:,) = 0,
i = 1,2.
(4.15d)
The sample analogues of (4.15a))(4.15~) determine the estimators 8, and g2, while (4.15d) determines A:, and BE: as sample averages of sums of squared residuals. Since the estimators of Sfl and Sh are standard, we focus on (4.15a)-(4.15~) and the resulting estimators of 6, and 6,. Let u, = (z’s ’ E E )‘andzi=T-‘Cu, f 5t’ZtE2,t~ 1,t 2,t denote the sample values of the second moments m (4.15a)-(4.15~). Then the GMM estimators, 8, and 6,, are values of 6, and 6, that minimize ‘=lf a:#)
a,,(L) = CG,
restrictions
=X”=, a:J!,
,,Q,,,~L’and u,,(l) = 0, then u,~(L)Y~,, = cl:,(L)(l - L)Y,,, = a~,(f$~,,,, where where a;, i = - x,“=i+, LI,~,,.The discussion that follows assumes the linear
on the structural
for nonhomogeneous
coefkcients
(or nonzero)
linear
are homogeneous restrictions
(or zero). As usual, the only change required of the dependent variable.
is a redefinition
M. W. Watson
2908
(4.16) have a simple where 2” is a consistent estimator of E(u,ui). 47 These estimators GLS or instrumental variable interpretation. To see this, let 2 = (z, z2 . ..zr)’ denote the T x 2pmatrixofinstruments;let W, =(w~,~ w,:,...w,,,)‘and W, =(w~,~ w*,~..’ w~,~)’ denote the TX k, and T x k, matrices of right hand side variables; finally, let Yr, Y2, sr and s2 denote the T x 1 vectors composed of ~~r,~,y~,~,er:~ and E*,~, respectively. Multiplying equations (4.14a) and (4.14b) by z, and summmg yields Z’Y,
= (Z’W,)6,
+ Z’E1’
(4.17a)
Z’Y, = (zl W#,
+ ZE2.
(4.17b)
Now, letting
Ei = Yi - WiTi, for some 4
E’IE2 + E; W,S, + E; WI Jr = (C’,W,)6, + (F; W,)d, + E; e2 + quadratic
Stacking
equations
(4.17a)-(4.17~)
and ignoring
the quadratic
terms. (4.17c)
terms in (4.17~) yields
Q = P,6, + P,6, + V,
(4.18)
where Q = [(Z’Y1)J(Z’Y,)J(Cr;E, +E;WZg2 +E~Wl~l)],P, = [(~W,))O,,.,,l(~~Ww,)l, and where “I” denotes P, = CO,, x Jr2j(Z’W,)I(F; W,)], and V = [(zle,)l(~~,)I(~;&~)], vertical concatenation (“stacking”). By inspection V = Tii from (4.16). Thus when Q, P, and P, are evaluated at $r = s^, and Tz = gz, the GMM estimators coincide with the GLS estimators from (4.18). This means that the GMM estimators can be formed by iterative GLS estimation of equations (4.18), updating 5, and c?~ at each iteration and using T- ‘Cti,t2~ as the GLS covariance matrix. Hausman et al. (1987) show that the resulting GMM estimators of ~?,,&,a% and crz*are jointly asymptotically normally distributed when the vectors (z: 6:)’ are independently distributed and standard regularity conditions hold. These results are readily extended to the structural VAR when the roots of CD(Z)are outside the unit circle, so that the data are covariance stationary. Expressions for the asymptotic variance of the GMM estimators are given in their paper. When some of the variables in the model are integrated, the asymptotic distribution of the estimators changes in a way like that discussed in Section 2. This issue does not seem to have been studied explicitly in the structural VAR model, although such an analysis would seem to be reasonably straightforward.4s *‘When elements of u and u are correlated for t # r, 2” is replaced by a consistent limiting value of the vahance df T”‘ti. 481nstrumental variable estimators constructed from possibly integrated regressors are discussed in Phillips and Hansen (1990).
estimator
of the
and instruments
Ch. 47: Vector Autoregressions
and Cointegration
2909
The paper by Hausman et al. (1987) also discuss the relationship between efficient GMM estimators and the FIML estimator constructed under the assumption that the errors are normally distributed. It shows that the FIML estimator can be written as the solution to (4.16), using a specific estimator of X, appropriate under the normality assumption. In particular, FIML uses a block diagonal estimator of _X,, since E[(E,,,E~,,)(E,,,z,)] = E([E~,,E~,,)(E~,,z,)] = 0 when the errors are normally distributed. When the errors are not normally distributed, this estimator of z, may be inconsistent, leading to a loss of efficiency in the FIML estimator relative to the efficient GMM estimator. Estimation is simplified when there are no overidentifying restrictions. In this case, iteration is not required, and the GMM estimators can be constructed as instrumental variable (IV) estimators. When the model is just identified, only one restriction is imposed on the coefficient in equation (4.7). This implies that one of the vectors 6, or 6, is 2p x 1, while the other is (2~ + 1) x 1, and (4.17) is a set of 4p + 1 linear equation in 4p + 1 unknowns. Suppose, without loss of generality, that 6, is 2p x 1. Then s^, is determined from (4.17a) as s^, = (Z’W,))‘(Z’Y,), which is the usual IV estimator of equation (4.14a) using z, as instruments. Using this value for gi in (4.17~) and noting that Y, = W,c?f2+ E;, equation (4.17~) becomes E*;Y, = (6; w,)d,
+ E;E2,
(4.19)
where E*i= Y, - W, 8, is the residual from the first equation. The GMM estimator of 6, is formed by solving (4.17b) and (4.19) for 8,. This can be recognized as the IV estimator of equation (4.14b) using z, and the residual from (4.14a) as an instrument. The residual is a valid instrument because of the covariance restriction (4.1%).49 In many structural VAR exercises, the impulse response functions and variance decompositions defined in Section 4.2 are of more interest than the parameters of the structural VAR. Since C(L) = A(L)-‘, the moving average parameters/impulse responses and the variance decompositions are differentiable functions of the structural VAR parameters. The continuous mapping theorem directly yields the asymptotic distribution of these parameters from the distribution of the structural VAR parameters. Formulae for the resulting covariance matrix can be determined by delta method calculations. Convenient formulae for these covariance matrices can be found in Lutkepohl (1990), Mittnik and Zadrozny (1993) and Hamilton (1994). 49 While this instrumental variables scheme provides a simple way to compute the GMM estimator using standard computer software, the covariance matrix of the estimators constructed using the usual formula will not be correct. Using 6, , asan instrument introduces “generated regressor” complications familiar from Pagan (1984). Corrections for the standard formula are provided in Kine. and Watson (1993). An alternative approach is to carry out one GMM iteration using the IV estimaks as starting values. The point estimates will remain unchanged, but standard GMM software will compute a consistent estimator of the correct covariance matrix. The usefulness of residuals as instruments is discussed in more detail in Hausman (1983) Hausman and Taylor (1983) and Hausman et al. (1987).
2910
M.W. Warson
Many applied researchers have instead relied on Monte Carlo methods for estimating standard errors of estimated impulse responses and variance decompositions. Runkle (1987) reports on experiments comparing the small sample accuracy of the estimators. He concludes that the delta method provides reasonably accurate estimates of the standard errors for the impulse responses, and the resulting confidence intervals have approximately the correct coverage. On the other hand, delta method confidence intervals for the variance decompositions are often unsatisfactory. This undoubtedly reflects the [0, l] bounded support of the variance decompositions and the unbounded support of the delta method normal approximation.
References Ahn, SK. and G.C. Reinsel (1990) “Estimation for Partially Nonstationary Autoregressive Models”, Journal of the American Statistical Association, 85, 8 13-823. Anderson, T.W. (1951) “Estimating Linear Restrictions on Regression Coefficients for Multivariate Normal Distributions”, Annals qf MathematicalStatistics, 22, 327751. Anderson, T.W. (1984) An Introduction to Multivariate Statistical Analysis, 2nd Edition. Wiley: New York. Andrews, D.W.K. and J.C. Moynihan (1990) An Improved Heteroskedastic and Autocorrelation Consistent Covariance Matrix Estimator, Cowles Foundation Discussion Paper No. 942, Yale University. Banerjee, A., J.J. Dolado, D.F. Hendry and G.W. Smith (1986) “Exploring Equilibrium Relationships in Econometrics through Static Models: Some Monte Carlo Evidence”, Oxford Bulletin of Economics and Statistics, 48(3), 253370. Banerjee, A., J. Dolado, J.W. Galbraith and D.F. Hendry (1993) Co-Integration, Error-Correction, and the Econometric Analysis ofNon-Stationary Data. Oxford University Press: Oxford. Basawa, I.V. and D.J. Scott (1983) Asymptotic Optimal Inference for Nonergodic Models. Springer Verlag: New York. Berk, K.N. (1974) “Consistent Autoregressive Spectral Estimates”, Annals of Statistics, 2, 4899502. Bernanke, B. (1986) “Alternative Explanations of the Money-Income Correlation”, Carnegie-Rochester Conference Series on Public Policy. Amsterdam: North-Holland Publishing Company. Beveridge, S. and C.R. Nelson (1981) “A New Approach to Decomposition of Time Series in Permanent and Transitory Components with Particular Attention to Measurement of the ‘Business Cvcle’“. Journal af Monetary konomics, 7, 15 l-74. Blanchard, O.J. and D. Quah (1989) “The Dynamic Effects of Aggregate Demand and Supply Disturbances”, American Economic Review, 79, 655573. Blanchard, O.J. and M.W. Watson (1986) “Are Business Cycles All Alike?“, in: R. Gordon, ed., The American Business Cycle: Continuity and Change. Chicago: University of Chicago Press, 123-179. Bobkosky, M.J. (1983) Hypothesis Testing in Nonstationary Time Series, Ph.D. thesis, Department of Statistics, University of Wisconsin. Brillinger, D.R. (1980) Time Series, Data Analysis and Theory. Expanded Edition, Holden-Day: San Francisco. Campbell, J.Y. (1990) “Measuring the Persistence of Expected Returns”, American Ecbtomic Review, 80(2), 43-47. Campbell, J.Y. and P. Perron (1991) “Pitfalls and Opportunities: What Macroeconomists Should Know about Unit Roots”, NBER Macroeconomics -Annual. MIT Press: Cambridge, Mass. Campbell, J.Y. and R.J. Shiller (1987) “Cointegration and Tests of Present Value Models”, Journal af Political Economy, 95, 1062-1088. Reprinted in: R.F. Engle and C.W.J. Granger, eds., Long-Run Economic Relations: Readings in Cointegration. Oxford University Press: New York, 1991. Canova, Fabio (1991) Vector Autoregressive Models: Specification Estimation and Testing, manuscript, Brown University.
Ch. 47: Vector Autoregressions and Cointegration
2911
Cavanagh, C.L. (1985) Roots Local to Unity, manuscript, Department of Economics, Harvard University. Cavanaeh. C.L. and J.H. Stock (1985) Inference in Econometric Models with Nearly Nonstationary Regre%s, Manuscript, Kennedy School of Government, Harvard University. Chan. N.H. (1988) “On Parameter Inference for Nearly Nonstationary Time Series”, Journal of the American Statistical Association, 83, 851762. Ghan, N.H. and C.Z. Wei (1987) “Asymptotic Inference for Nearly Nonstationary AR(i) Processes”, The Annals of Statistics, 15, 1050-63. Chan, N.H. and C.Z. Wei (1988) “Limiting Distributions of Least Squares Estimates of Unstable Autoregressive Processes”, The Annals of Statistics, 16(l), 3677401. Cochrane, J.H. (1994) “Permanent and Transitory Components of GNP and Stock Prices”, Quarterly Journal of Economics, 109, 241-266. Cochrane, J.H. and A.M. Sbordone (1988) “Multivariate Estimates of the Permanent Components of GNP and Stock Prices”, Journal of Economic Dynamics and Control, 12, 255-296. Davidson. J.E.. D.F. Hendrv. F. Srba and S. Yeo (1978) “Econometric Modelling of the Aggregate TimeSeries Relationship Between Consumer’s Expenditures and Income in the United Kingdom”: Economic Journal, 88, 661-692. Davies, R.B. (1977) “Hypothesis Testing When a Parameter is Present Only Under the Alternative”, Biometrika, 64, 247-54. Davies, R.B. (1987) “Hypothesis Testing When a Parameter is Present Only Under the Alternative”, Biometrika, 14, 33-43. Dickey, D.A. and W.A. Fuller (1979) “Distribution of the Estimators for Autoregressive Time Series with a Unit Root”, Journal of the American Statistical Association, 74, 427731. Elliot, G. and J.H. Stock (1992) Inference in Time Series Regressions when there is Uncertainty about Whether a Regressor Contains a Unit Root, manuscript,-Harvard University. Elliot, G., T.J. Rothenberg and J.H. Stock (1992) Efficient Tests of an Autoregressive Unit Root, NBER Technical Working Paper 130. Engle, R.F. (1976) “Band Spectrum Regression”, International Economic Review, 15, l-l 1. Enele. R.F. (1984) “Wald. Likelihood Ratio. and Lagrange Multiplier Tests in Econometrics”, in: 2. Griliches and M. Intriligator, eds., Handbook of Econometrics. North-Holland: New York, Vol. 2, 775-826. Engle, R.F. and C.W.J. Granger (1987) “Cointegration and Error Correction: Representation, Estimation, and Testing”, Econometrica, 55, 251-276. Reprinted in: R.F. Engle and C.W.J. Granger, eds., Long-Run Economic Relations: Readings in Cointegration. Oxford University Press: New York, 1991. Engle, R.F. and B.S. Yoo (1987) “Forecasting and Testing in Cointegrated Systems”, Journal of Econometrics, 35, 143-59. Reprinted in: R.F. Engle and C.W.J. Granger, eds., Long-Run Economic Relations: Readings in Cointegration. Oxford University Press: New York, 1991. Engle, R.F. and B.S. Yoo (1991) “Cointegrated Economic Time Series: An Overview with New Results”, in R.F. Engle and C.W.J. Granger, eds., Long-Run Economic Relations: Readings in Cointegration. Oxford University Press: New York. Engle, R.F., D.F. Hendry, and J.F. Richard (1983) “Exogeneity”, Econometrica, 51(2), 2777304. Fama, E.F. and K.R. French (1988) “Permanent and Transitory Components of Stock Prices”, Journal of Political Economy, 96, 246-73. Faust, J. and E.M. Leeper (1993) Do Long-Run Identifying Restrictions Identify Anything?, manuscript, Board of Governors of the Federal Reserve System. Fisher, F. (1966) The Zdentijication Problem in Econometrics. New York: McGraw-Hill. Fisher, M.E. and J.J. Seater (1993) “Long-Run Neutrality and Superneutrality in an ARIMA Framework”, American Economic Review, 83(3), 402-415. Fountis, N.G. and D.A. Dickey (1986) Testing for a Unit Root Nonstationarity in Multivariate Time Series, manuscript, North Carolina State University. Fuller, W.A. (1976) Zntroduction to Statistical Time Series. New York: Wiley. Gali, J. (1992) “How Well does the IS-LM Model Fit Postwar U.S. Data”, Quarterly Journal of Economics, 107, 709-738. Geweke, J. (1986) “The Superneutrality of Money in the United States: An Interpretation of the Evidence”. Econometrica, 54, l-21. Gianini, C. (1991) Topics in Structural VAR Econometrics, manuscript, Department of Economics, Universita Degli Studi Di Ancona.
2912
M. W. Watson
Gonzalo, J. (1989) Comparison of Five Alternative Methods of Estimating Long Run Equilibrium Relationships, manuscript, UCSD. Granger, C.W.J. (1969) “Investigating Causal Relations by Econometric Methods and Cross Spectral Methods”, Econometrica, 34, 150-61. Granger, C.W.J. (1983) Co-Integrated Variables and Error-Correcting Models, UCSD Discussion Paper 83-13. Granger, C.W.J. and A.P. Andersen (1978) An Introduction to Bilinear Time Series Models. Vandenhoeck & Ruprecht: Gottingen. Advances in Econometrics, 8, 71-84. Granger, C.W.J. and T.-H. Lee (1990) “Multicointegration”, Reprinted in: R.F. Engle and C.W.J. Granger, eds., Long-Run Economic Relations: Readings in Cointegration. Oxford University Press: New York, 1991. Granger, C.W.J. and P. Newbold (1974) “Spurious Regressions in Econometrics”, Journal Econometrics, 2, 11 l-20. Granger, C.W.J. and P. Newbold (1976) Forecasting Economic Time Series. Academic Press: New York. Granger, C.W.J. and A.A. Weiss (1983) “Time Series Analysis of Error-Correcting Models”, in: Studies in Econometrics, Time Series and Multivariate Statistics. Academic Press: New York, 255-78. Hall, R.E. (1978) “Stochastic Implications of the Life Cycle - Permanent Income Hypothesis: Theory and Evidence”, Journal of Political Economy, 86(6), 971-87. Hamilton, J.D. (1994) Time Series Analysis. Princeton University Press: Princeton, NJ. Hannan, E.J. (1970) MultipIe Time Series Wiley: New York. Hansen. B.E. (1988) Robust Inference in General Models ofCointegration, manuscript, Yale University. Hansen, B.E.‘(1996a) A Powerful, Simple Test for Cointegration Using Cochrane-Orcutt, Working Paper No. 230, Rochester Center for Economic Research. Hansen, B.E. (1990b) Inference When a Nuisance Parameter is Not Identified Under the Null Hypothesis, manuscript, University of Rochester. Hansen, B.E. and P.C.B. Phillips (1990) “Estimation and Inference in Models of Cointegration: A Simulation Study”, Adoances in Econometrics, 8, 225-248. Hansen, L.P. (1982) “Large Sample Properties of Generalized Method of Moments Estimators”, Econometrica, 50, 1029954. Hansen, L.P. and T.J. Sargent (1991) “Two Problems in Interpreting Vector Autoregressions”, in: L. Hansen and T. Sargent, eds., Rational Expectations Econometrics. Westview: Boulder. Hausman, J.A. (1983) “Specification and Estimation of Simultaneous Equation Models”, in: Z. Griliches and M. Inteligator, eds., Handbook of Econometrics. North Holland: New York, Vol. 1, 391-448. Hausman, J.A. and W.E. Taylor (1983) “Identification in Linear Simultaneous Equations Models with Covariance Restrictions; An Instrumental Variables Interpretation”, Econometrica, 51(5), 1527-50. Hausman, J.A., W.K. Newey and W.E. Taylor (1987) “Efficient Estimation and Identification of Simultaneous Equation Models with Covariance Restrictions”, Econometrica, 55(4), 849-874. Hendry, D.F. and T. von Ungern-Sternbcrg (1981) “Liquidity and Inflation Effects on Consumer’s Expenditure”, in: A.S. Deaton, ed., Essays in the Theory and Measurement of Consumer’s Behavior. Cambridge University Press: Cambridge. Hodrick, R.J. (1992) “Dividend Yields and Expected Stock Returns: Alternative Procedures for Inference and Measurement”, The Review of Financial Studies, 5(3), 357-86. Horvath, M. and M. Watson (1992) Critical Values for Likelihood Based Tests for Cointegration When Some Cointegrating May be Known, manuscript, Northwestern University. Horvath, M. and M.W. Watson (1993) Testing for Cointegration When Some of the Cointegrating Vectors are Known, manuscript, Northwestern University. Johansen, S. (1988a) “Statistical Analysis of Cointegrating Vectors”, Journal of Economic Dynamics and Control, 12, 231-54. Reprinted in: R.F. Engle and C.W.J. &anger, eds., Long-Run Economic Relations: Readings in Cointegration. Oxford University Press: New York, 1991. Johansen, S. (1988b) “The Mathematical Structure of Error Correction Models”, in: N.U. Prabhu, ed., Contemporary Mathematics, vol. 80: Structural Inference from Stochastic Processes. American Mathematical Society: Providence, RI. Johansen, S. (1991) “Estimation and Hypothesis Testing of Cointegrating Vectors in Gaussian Vector Autoregression Models”, Econometrica, 59, 1551-l 580. Johansen, S. (1992a) The Role of the Constant Term in Cointegration Analysis of Nonstationary Variables, Preprint No. 1, Institute of Mathematical Statistics, University of Copenhagen. Johansen, S. (1992b) “Determination of Cointegration Rank in the Presence of a Linear Trend”, Oxford Bulletin of Economics and Statistics, 54, 383-397.
of
Ch. 47: Vector Autoregressions and Cointegration
2913
Johansen, S. (1992~) “A Representation of Vector Autoregressive Processes Integrated of Order 2”, Econometric Theory, 8(2), 188-202. Johansen, S. and K. Juselius (1990) “Maximum Likelihood Estimation and Inference on Cointegration with Applications to the Demand for Money”, Oxford Bulletin of Economics and Statistics, 52(2), 169-210. Johansen, S. and K. Juselius (1992) “Testing Structural Hypotheses in a Multivariate Cointegration Analysis of the PPP and UIP of UK”, Journal of Econometrics, 53, 21 I-44. Keating, J. (1990) “Identifying VAR Models Under Rational Expectations”, Journal of Monetary Economics, 25(3), 453-76. King, R.G. and M.W. Watson (1993) Testing for Neutrality, manuscript, Northwestern University. King, R.G., C.I. Plosser, J.H. Stock and M.W. Watson (1991) “Stochastic Trends and Economic Fluctuations”, American Economic Review, 81, 819-840. Kosobud, R. and L. Klein (1961) “Some Econometrics ofGrowth: Great Ratios of Economics”, Quarterly Journal of Economics, 25, 173-98. Lippi, M. and L. Reichlin (1993)“The Dynamic Effects of Aggregate Demand and Supply Disturbances: Comments”, American Economic Reoiew, 83(3), 644652. Lucas, R.E. (1972) “Econometric Testing of the Natural Rate Hypothesis”, in: 0. Eckstein, ed., The Econometrics ofPrice Determination. Washington, D.C.: Board of Governors of the Federal Reserve System. Lucas, R.E. (1988) “Money Demand in the United States: A Quantitative Review”, Carnegie-Rochester Conference Series on Public Policy, 29, 137-68. Lutkepohl, H. (1990) “Asymptotic Distributions of Impulse Response Functions and Forecast Error Variance Decompositions of Vector Autoregressive Models”, Review ofEconomics and Statistics, 72, 116-25. MacKinnon, J.G. (1991) “Critical Values for Cointegration Tests”, in: R.F. Engle and C.W.J. Granger, eds., Long-Run Economic Relations: Readings in Cointegration. Oxford University Press: New York. Magnus, J.R. and H. Neudecker (1988) Matrix Difirential Calculus. Wiley: New York. Malinvaud, E. (1980) Statistical Methods of Econometrics. Amsterdam: North-Holland. Mankiw, N.G. and M.D. Shapiro (1985) “Trends, Random Walks and the Permanent Income Hypothesis”, Journal of Monetary Economics, 16, 165-74. Mittnik, S. and P.A. Zadrozny (1993) “Asymptotic Distributions of Impulse Responses, Step Responses and Variance Decompositions of Estimated Linear Dynamic Models”, Econometrica, 61, 857-70. Ogaki, M. and J.Y. Park (1990) A Cointegration Approach to Estimating Preference Parameters, manuscript, University of Rochester. Osterwald-Lenum, M. (1992) “A Note with Quantiles of the Asymptotic Distribution of the Maximum Likelihood Cointegration Rank Test Statistics”, Oxford Bulletin of Economics and Statistics, 54, 461-71. Pagan, A. (1984) “Econometric Issues in the Analysis of Regressions with Generated Regressors”, International Economic Review, 25, 221-48. Park, J.Y. (1992) “Canonical Cointegrating Regression”, Econometrica, 60(l), 119-144. Park, J.Y. and M. Ogaki (1991) Inference in Cointegrated Models Using VAR Prewhitening to Estimate Shortrun Dynamics, Rochester Center for Economic Research Working Paper No, 281. Park, J.Y. and P.C.B. Phillips (1988) “Statistical Inference in Regressions with Integrated Regressors I”, Econometric Theory, 4, 468-97. Park, J.Y. and P.C.B. Phillips (1989) “Statistical Inference in Regressions with Integrated Regressors: Part 2”, Econometric Theory, 5, 95-131. Phillips, P.C.B. (1986) “Understanding Spurious Regression in Econometrics”, Journal ofEconometrics, 33, 31 l-40. Phillips, P.C.B. (1987a) “Time Series Regression with a Unit Root”, Econometrica, 55, 277-301. Phillips, P.C.B. (l987b) “Toward a Unified Asymptotic Theory for Autoregression”, Biometrika, 74, 535-47. Phillips, P.C.B. (1988) “Multiple Regression with Integrated Regressors”, Contemporary Mathematics, 80. 79-105. Phillips, P.C.B. (1991a) “Optimal Inference in Cointegrated Systems”, Econometrica, 59(2), 283-306. Phillips, P.C.B. (1991b) “Spectra1 Regression for Cointegrated Time Series”, in: W. Barnett, ed., Nonparametric and Semiparamerric Methods in Economics and Statistics. Cambridge University Press: Cambridge, 413-436.
2914
M. W. W&son
Phillips, P.C.B. (1991~) The Tail Behavior of Maximum Likelihood Estimators of Cointegrating Coefficients in Error Correction Models, manuscript, Yale University. Phillips, P.C.B. (1991d) “To Criticize the Critics: An Objective Bayesian Analysis of Stochastic Trends”, Journal of Applied Econometrics, 6, 3333364. Phillips, P.C.B. and S.N. Durlauf (1986) “Multiple Time Series Regression with Integrated Processes”, Rekew of Economic Studies, 53,‘473-96. Phillios. P.C.B. and B.E. Hansen (1990) “Statistical Inference in Instrumental Variables Regression wit’h I(1) Processes”, Review of Economic Studies, 57, 999125. Phillips, P.C.B. and M. Loretan (1991) “Estimating Long Run Economic Equilibria”, Review of Economic Studies, 58, 4077436. Phillips, P.C.B. and S. Ouliaris (1990) “Asymptotic Properties of Residual Based Tests for Cointegration”, Econometrica, 58, 165594. Phillips, P.C.B. and J.Y. Park (1988) “Asymptotic Equivalence of OLS and GLS in Regression with Integrated Regressors”, Journal of the American Statisticat Association, 83, 11 l-l 15. _ Phillips. P.C.B. and P. Perron (1988) “Testing for Unit Root in Time Series Reeression”. Biometrika. ~ 75;335-46. Phillips, P.C.B. and W. Ploberger (1991) Time Series Modeling with a Bayesian Frame of Reference: I. Concepts and Illustrations, manuscript, Yale University. Phillips, P.C.B. and V. Solo (1992) “Asymptotics for Linear Processes”, Annals afStatistics,20, 97lL 1001. Quah, D. (1986) Estimation and Hypothesis Testing with Restricted Spectral Density Matrices: An Application to Uncovered Interest Parity, Chapter 4 of Essays in Dynamic Macroeconometrics, Ph.D. Dissertation, Harvard University. Rothenberg, T.J. (1971) “Identification in Parametric Models”, Econometrica, 39, 577-92. Rozanov, Y.A. (1967) Stationary Random Processes. San Francisco: Holden Day. Runkle, D. (1987) “Vector Autoregressions and Reality”, Journal of Business and Economic Statistics, 5(4), 4377442. Said, SE. and D.A. Dickey (1984) “Testing for Unit Roots in Autoregressive-Moving Average Models of Unknown Order”, Biometrika, 71, 599-608. Saikkonen, P. (1991) “Asymptotically Efficient Estimation of Cointegrating Regressions”, Econometric Theory, 7(l), l-21. Saikkonen, P. (1992) “Estimation and Testing of Cointegrated Systems by an Autoregressive Approximation”, Econometric Theory, 8(l), l-27. Sargan, J.D. (1964) “Wages and Prices in the United Kingdom: A Study in Econometric Methodology”, in: P.E. Hart, G. Mills and J.N. Whittaker, eds., Econometric Analysisfor National Economic Planning. London: Butterworths. Sargent, T.J. (1971) “A Note on the Accelerationist Controversy”, Jour’nal of Money, Banking and Credit, 3, 50-60. Shapiro, M. and M.W. Watson (1988) “Sources of Business Cycle Fluctuations”, NBER Macroeconomics Annual, 3, 11 l-1 56. Sims, CA. (1972) “Money, Income and Causality”, American Economic Review, 62, 540-552. Sims, CA. (1978) Least Squares Estimation of Autoregressions with Some Unit Roots, University of Minnesota, Discussion Paper No. 78-95. Sims, C.A. (1980) “Macroeconomics and Reality”, Econometrica, 48, l-48. Sims, C.A. (1986) “Are Forecasting Models Usable for Policy Analysis?“, Quarterly Review, Federal Reserve Bank of Minneapolis, Winter. Sims, CA. (1989) “Models and Their Uses”, American Journal of Agricultural Economics, 71,489-494. Sims, CA., J.H. Stock and M.W. Watson (1990) “Inference in Linear Time Series Models with Some Unit Roots”, Econometrica, 58(l), 113-44. Solo, V. (1984) “The Order of Differencing in ARIMA Models”, Journal of the American Statistical Association, 79, 916-21. Stock, J.H. (1987) “Asymptotic Properties of Least Squares Estimators of Cointegrating Vectors”, Econometrica, 55, 1035-56. Stock, J.H. (1988) “A Reexamination of Friedman’s Consumption Puzzle”, Journal of Business and Economic Statistics, 6(4), 401-14. Stock, J.H. (1991) “Confidence Intervals of the Largest Autoregressive Root in U.S. Macroeconomic Time Series”, Journal of Monetary Economics, 28, 435560.
Ch. 47: Vector Autoregressions and Cointeyration Stock, Stock, Vol. Stock,
2915
J.H. (1992) Deciding Between I(0) and I(l), manuscript, Harvard University. J.H. (1993) Forthcoming in: R.F. Engle and D. McFadden, eds., Handbook of Econometrics. 4, North Holland: New York. J.H. and M.W. Watson (1988a) “Interpreting the Evidence on Money-Income Causality”, Journal of Econometrics, 40(l), 161-82. Stock, J.H. and M.W. Watson (1988b) “Testing for Common Trends”, Journal ofthe American Statistical Association, 83, 1097-l 107. Reprinted in: R.F. Engle and C.W.J. Granger, eds., Long-Run Economic Relations: Readings in Cointeyration. Oxford University Press: New York, 1991. Stock, J.H. and M.W. Watson (1993) “A Simple Estimator of Cointegrating Vectors in Higher Order Integrated Systems”, Econometrica, 61, 783-820. Stock, H.H. and K.D. West (1988) “Integrated Regressors and Tests of the Permanent Income Hypothesis”, Journal of Monetary Economics, 21 (l), 85-95. Sweeting, T. (1983) “On Estimator Efficiency in Stochastic Processes”, Stochastic Processes and their Ao&cations. 15, 93-98. The$ H. (1971) Principles of Econometrics. Wiley: New York. Toda, H.Y. and P.C.B. Phillips (1993a) “Vector Autoregressions and Causality”, Econometrica, 62(l), 1367-1394. Toda, H.Y. and P.C.B. Phillips (1993b) “Vector Autoregressions and Causality. A Theoretical Overview and Simulation Studv”. Econometric Reviews. 12, 321-364. Tsav, R.S. and G.C. Tiao (1990) “Asymptotic Properties of Multivariate Nonstationary Processes with Applications to Autoregressions”: Annals of Statistics, 18, 220-50. West. K.D. (1988) “Asvmntotic Normalitv when Regressors Have a Unit Root”, Econometrica, 56, 139771418. ’ - White, H. (1984) Asymptotic Theoryfor Econometricians. New York: Academic Press. Whittle, P. (1983) Prediction and Regulation by Linear Least-Square Methods. Second Edition, Revised. University of Minnesota Press: Minneapolis. Wold, H. (1954) “Causality and Econometrics”, Econometrica, 22, 162-177. Wooldridge, J. (1993) Forthcoming in: R.F. Engle and D. McFadden, eds., Handbook ofEconometrics. Vol. 4, North-Holland: New York. Yoo, B.S. (1987) Co-Integrated Time Series Structure, Forecasting and Testing, Ph.D. Dissertation, UCSD. Yule, G.C. (1926) “Why Do We Sometimes Get Nonsense-Correlations Between Time-Series”, Journal of the Royal Statistical Society B, 89, l-64.
Chapter 48
ASPECTS OF MODELLING TIM0
NONLINEAR
TIME SERIES*
TERASVIRTA
Copenhagen Business School and Bank of Norway DAG TJ@STHEIM University
ofBergen
CLIVE W.J. GRANGER University of California
Contents Abstract 1. Introduction 2. Types of nonlinear
3.
4. 5.
2919 2919 2921
models
2.1.
Models from economic
2.2.
Models from time series theory
2.3.
Flexible statistical
2.4.
State-dependent,
2.5.
Nonparametric
Testing
2921
theory
parametric time-varying
2922 2923
models parameter
and long-memory
models
2924 2925
models
2926
linearity
3.1.
Tests against
a specific alternative
2921
3.2.
Tests without
a specific alternative
2930
3.3.
Constancy
of conditional
2933
variance
2934 2937
Specification of nonlinear models Estimation in nonlinear time series 5.1.
Estimation
of parameters
in parametric
models
2937
*The work for this paper originated when TT and DT were visiting the University of California, San Diego. They wish to thank the economics and mathematics departments, respectively, of UCSD for their hospitality and John Rice and Murray Rosenblatt, in particular. The research of TT was also supported by the University of Giiteborg, Bank of Norway and a grant from the YrjG Jahnsson Foundation. DT acknowledges financial support from the Norwegian Council for Research and CWJG from NSF, Grant SES 9023037. Handbook of Econometrics, Volume IV, Edited by R.F. Engle and D.L. McFadden 0 1994 Elsevier Science B. V. All rights reserved
T.Teriisvirta
2918 5.2.
Estimation
of nonparametric
5.3.
Estimation
of restricted
6. Evaluation of estimated 7. Example 8. Conclusions References
functions
nonparametric
models
et al.
2938 and semiparametric
models
2942
2945 2946 2952 2953
Ch. 48: Aspects
ofModellingNonlinear
Time Series
2919
Abstract This paper surveys some of the recent developments in nonlinear analysis of economic time series. The emphasis lies on stochastic models. Various classes of nonlinear models appearing in the economics and time series literature are presented and discussed. Linearity testing and estimation of nonlinear models, both parametric and nonparametric, are considered as well as post-estimation model evaluation. Data-based nonlinear model building is illustrated with an empirical example.
1.
Introduction
It is common practice for economic theories to postulate nonlinear relationships between economic variables, production functions being an example. If a theory suggests a specific functional form, econometricians can propose estimation techniques for the parameters, and asymptotic results about normality and consistency, under given conditions, are known for these estimates, see, e.g. Judge et al. (1983, White (1984) and Gallant (1987, Chapter 7). However, in many cases the theory does not provide a single specification, or specifications are incomplete and may not capture the major features of the actual data, such as trends, seasonality or the dynamics. When this occurs, econometricians can try to propose more general specifications and tests of them. There are clearly an immense number of possible parametric nonlinear models and there are also many nonparametric techniques for approximating them. Given the limited amount of data that is usually available in economics it would not be appropriate to consider many alternative models or to use many techniques. Because of the wide possibilities, the methods and models available for analysing nonlinearities are usually very flexible so that they can provide good approximations to many different generating mechanisms. A consequence is that, with fairly small samples, the methods are inclined to over-fit, so that if the true mechanism is linear, say, with residual variance 02, the fitted model may appear to find nonlinearity and an estimated residual variance less than u2. The estimated model will then be inclined to forecast badly in the post-sample period. It is therefore necessary to have a specific research strategy for modelling nonlinear relationships between time series. In this chapter the modelling process concentrates on a particular situation, where there is a single dependent variable y, to be explained and X, is a vector of exogenous variables. Let I, be the information set
(1.1)
T. TerSisvirtaet al.
2920
and denote all of the variables (and lags) used in I, by w,. The modelling will then attempt to find a satisfactory approximation for f(wJ such that
process
(1.2) If the error is E, =
Y, -
f(wJ
then in some cases a more parsimonious lagged E’S in f(s).
representation
will specifically
include
The strategy proposed is as follows. (i) Test y, for linearity, using the information I,. As there are many possible forms of nonlinearity it is likely that no one test will be powerful against them all, so several tests may be needed. (ii) If linearity is rejected, consider a small number of alternative parametric models and/or nonparametric estimates. Linearity tests may give guidance as to which kind of nonlinear models to consider. (iii) These models should be estimated in-sample and compared out-of-sample. The properties of the estimated models should be checked. If a single model is required, the one that is best out-of-sample may be selected and reestimated over all available data. The strategy is by no means guaranteed to be successful. For example, if the nonlinearity is associated with a particular feature of the data, but if this feature does not occur in the post-sample evaluation period, then the nonlinear model may not perform any better than a linear model. Section 2 of the chapter briefly considers some parametric models, Section 3 discusses tests of linearity, Section 4 reviews specification of nonlinear models, Section 5 considers estimation and Section 6 evaluation of estimated models. Section 7 contains an example and Section 8 concludes. This survey largely deals with linearity in the conditional mean, which occurs if f(wJ in (1.2) can be well approximated by some linear combination cp’w, of the components of w,. It will generally be assumed that w, contains lagged values of y, plus, possibly, present and lagged values of X, including 1. This definition avoids the difficulty of deciding whether or not processes having forms of heteroskedasticity that involve explanatory or lagged variables, such as ARCH, are nonlinear. It is clear that some tests of linearity will be confused by these types of heteroskedasticity. Recent surveys of some of the topics considered here include Tong (1990) for univariate time series, Delgado and Robinson (1992) Hardle (1990) and Tjsstheim (1994) for semi- and nonparametric techniques, Brock and Potter (1993) for linearity testing and Granger and Terasvirta (1993). There has recently been a lot of interest, particularly by economic theorists, in
Ch. 48: Aspects of Modelliny
Nonlinear
Time Series
2921
chaotic processes, which are deterministic series which have some of the linear properties of familiar stochastic processes. A well known example is the “tent-map” y, = 4y,_ 1 (1 - y,- i), which, with a suitable starting value in (0, l), generates a series with all autocorrelations equal to zero and thus a flat spectrum, and so may be called a “white chaos”, as a stochastic white noise also has these properties. Economic theories can be constructed which produce such processes as discussed in Chen and Day (1992). Econometricians are unlikely to expect such models to be relevant in economics having a strong affiliation with stochastic models and, so far, there is no evidence of actual economic data having been generated by a deterministic mechanism. A difficulty is that there is no statistical test which has chaos as a null hypothesis, so that non-rejection of the null could be claimed to be evidence in favour of chaos. For a discussion and illustrations, see Liu et al. (1992). However, a much-used linearity test has been proposed by Brock et al. (1987), based on chaos theory, whose properties are discussed in Section 3.2. The hope in using nonlinear models is that better explanations can be provided of economic events and consequently better forecasts. If the economy were found to be chaotic, and if the generating mechanism could be discovered using some learning model, say, then forecasts would be effectively exact, without any error.
2. 2.1.
Types of nonlinear models Models from economic theory
Theory can both be used to suggest possibly sensible nonlinear models or to take into account some optimizing behaviour, with arbitrary assumed cost or utility functions, to produce a model. An example is a relationship of the form y, = min (#Wt, t9’wJ + s,,
(2.1)
so that y, is the smaller of a pair of alternative linear combinations of the vector of variables used to model y,. This model arises from a disequilibrium analysis of some simple markets, with the linear combinations representing supply and demand curves; for more discussion see Quandt (1982) and Maddala (1986). If we replace the “min condition” by another variable z,_~ which may also be one of the elements of W, but not 1, we may have y, = qo’w,+ B’w,F(z,_J +
“*,
(2.2)
where F(z,_,) = 0, z,_~ < c and F(z,_,) = 1, z,_~ > c. This is a switching regression model with switching variable z~_~ where d is the delay parameter; see Quandt (1982). In univariate time series analysis (2.2) is called a two-regime threshold autoregressive model; see, e.g. Tong (1990). Model (2.2) may be generalized by
T. Teriisuirta et al.
2922
assuming a continuum of regimes instead of only two. This can be done for instance by defining F(z,_,)=
(1 +expC-y(z,-,-Cc)l~-‘~
Y>O
(2.3)
in (2.2). Maddala (1977, p. 396) [see also Bacon and Watts (1971)] has already proposed such a generalization which is here called a logistic smooth transition regression (LSTR) model. F may also have the form of a probability density rather than a cumulative distribution function. In the univariate case this would correspond to the exponential smooth transition autoregressive (ESTAR) model (Terasvirta, 1994) or its well-known special case, the exponential autoregressive model (Haggan and Ozaki, 1981). The transition variable may represent changing political or policy regimes, high versus low inflation, upswings versus downswings of the business cycle and so forth. These switching models or their smooth transition counterparts occur frequently in theory which, for example, suggests changes in relationships when there is idle production capacity versus otherwise or when unemployment is low versus high. Aggregation considerations suggest that a smooth transition regression model may often be more sensible than the abrupt change in (2.2). Some theories lead to models that have also been suggested by time series statisticians. An example is the bivariate nonlinear autoregressive model described as a “prey-predator” model by Desai (1984) taking the form
Aylt = - a + b exp(y,,), Ayzt
= c + b ew(y,,)~
where y, is the logarithm of the share of wages in national income and y, is the logarithm of the employment rate. Other examples can be found (Chen and Day, 1992). The fact that some models do arise from theory justifies their consideration but it does not imply that they are necessarily superior to other models that currently do not arise from economic theory. 2.2.
Models from time series theory
The linear autoregressive, moving average and transfer function models have been popular in the time series literature following the work by Box and Jenkins (1970) and there are a variety of natural generalizations to nonlinear forms. If the information set being considered is I,= {y,_j, j= l,...,q,X,_i,i=O,...,q},
4< co,
denote by E, the residual from yt explained by I, and let ekt be the residual from xkt explained by I, (excluding xkt itself). The components of the models considered in this section are nonlinear functions of components such as g(y, _j), h(x,,, _ i), G(E,_ j),
Ch. 48: Aspects ofModelling
Nonlinear
Time Series
2923
H(e,,,_i) plus cross-products such as y,_j~k,t_i,yt_jst-i,~,,t_jeb,,-i or E,_jek,t_i. A model would string together several such components, each with a parameter. For a given specification, the model is linear in the parameters so they can be easily estimated by OLS. The big questions are about the specification of the model; what components, functions and lags to use. There are so many possible components and combinations that the “curse of dimensionality” soon becomes apparent, so that choices of specification have to be made. Several classes of models have been considered. They include (9 nonlinear autoregressive, involving only functions of the dependent variable. Typically only simple mathematical functions have been considered (such as sine or cosine, sign, modulus, integer powers, logarithm of modulus or ratios of low order polynomials); (ii) nonlinear transfer function models, using functions of the lagged dependent variable and current and lagged explanatory variables, usually separately; (iii) bilinear models, y, = Cj,Jjkyr_ j~,_k + similar terms involving products of a component of X, and a lagged residual of some kind. This can be thought of as one equation of a multivariate bilinear system, as considered by Stensholt and Tjostheim (1987); (iv) nonlinear moving averages, being sums of functions of lagged residuals E,,e,; (v) doubly stochastic models which contain the cross-products between lagged y, and current and lagged components of xkt or a random parameter process and are discussed in Tjostheim (1986). Most of the models are augmented by a linear autoregressive term. There has been little consideration of mixtures of these models. Because of difficulty of analysis, lags are often taken to be small. Specifying the lag structure in nonlinear models is discussed in Section 4. A number of results are available for some of these models, such as stability for simple nonlinear autoregressive models (Lasota and Mackey, 1989), stationarity and invertibility of bilinear models or the autocorrelation properties of certain bilinear systems, but are often too complicated to be used in practice. To study stability or invertibility of a specific model it is recommended that a long simulation be formed and the properties of the resulting series be studied. There is not a lot of experience with these models in a multivariate setting and little success in their use has been reported. At present they cannot be recommended for use in preference to the smooth transition regression model of the previous section or the more structured models of the next section. A simple nonlinear autoregressive or bilinear model with just a few terms may be worth considering from this group. 2.3.
Flexible statistical
parametric
models
A number of important modelling procedures concentrate on models of the form (2.4)
T. Teriisuirta et al.
2924
where w, is a vector of past y, values and past and present values of a vector of explanatory variables x, plus a constant. The first component of the model is linear and the cpj(x) are a set of specific functions in x, examples being: (i) power series, cpj(x) = xj (x is generally not a lag of y); (ii) trigonometric, q(x) = sinx or cosx, (2.4) augmented by a quadratic term w,’ Aw, gives the flexible function forms discussed by Gallant (1981); (iii) cpj(x) = q(x) for all j, where q(x) is a “squashing function” such as a probability density function or the logistic function q(x) = [ 1 + exp( - x)] - i. This is a neural network model, which has been used successfully in various fields, especially as a learning model, see, e.g. White (1989) or Kuan and White (1994); (iv) if cpj(x) is estimated nonparametrically, by a “super-smoother”, say, the method is that of “projection-pursuit”, as briefly described in the next section. The first three models are dense, in the sense that theorems exist suggesting that any well-behaved function can be approximated arbitrarily well by a high enough choice of p, the number of terms in the sum, for example Stinchcombe and White (1989). In practice, the small sample sizes available in economics limit p to a small number, say one or two, to keep the number of parameters to be estimated at a reasonable level. In theory p should be chosen using some stopping criterion or goodness-of-fit measure. In practice, a small, arbitrary value is usually chosen, or some simple experimentation is undertaken. These models are sufficiently structured to provide interesting and probably useful classes of nonlinear relationships in practice. They are natural alternatives to nonparametric and semiparametric models. A nonparametric model, as discussed in Section 2.5, produces an estimate of a function at every point in the space of explanatory variables by using some smoother, but not a specific parametric function. The distinction between parametric and nonparametric estimators is not sharp, as methods using splines or neural nets with an undetermined cut-off value indicate. This is the case, in particular, for the res’tricted nonparametric models in Section 6.
2.4.
State-dependent,
Priestley form
time-varying parameter
(1988) has discussed
and long-memory
models
a very general class of models for a system taking the
(moving average terms can also be included) where Y, is a k x 1 stochastic vector and x, is a “state-variable” consisting of x, = (Y,, Y,_ i, . . . , Y, _k + J and which is updated by a Markov system X 1+1=h(x,)+F(x,)xt+v,+1.
Ch. 48: Asprcts oJ’Modelling Nonlinear
Time Series
2925
Here the cp’s and the components of the matrix Fare general functions, which in practice will be approximated by linear or low-order polynomials. Many of the models discussed in Section 2.2 can be embedded in this form. It is clearly related to the extended Kalman filter [see Anderson and Moore (1979)] and to time-varying parametric ARMA models, where the parameters evolve according to some simple AR model; see Granger and Newbold (1986, Chapter 10). For practical use various approximations can be applied, but so far there is little actual use of these models with multivariate economic series. For most of the models considered in Section 2.2, the series are assumed to be stationary, but this is not always a reasonable assumption in economics. In a linear context many actual series are I(l), in that they need to be differenced in order to become stationary, and some pairs of variables are cointegrated, in that they are both I(1) but there exists a linear combination that is stationary. A start to generalizing these concepts to nonlinear cases has been made by Granger and Hallman (1991a,b). I( 1) is replaced by a long-memory concept and cointegration by a possibly nonlinear attractor, so that yt, .x, are each long-memory but there is a function g(x) such that y, - g(x,) is stationary. A nonparametric estimator for gp) is proposed and an example provided.
2.5.
Nonparametric
models
Nonparametric modelling of time series does not require an explicit model but for reference purposes it is assumed that there is the following model
y,=f(Y*-1,x,-,)+g(y,-,,x,-,)&,
(2.5)
with (x,) being exogenous, and where y, _ r = (y, _ i,, . . . , are vectors of lagged variables, and {.st}is a sequence Yt_i,)andx,-,=(x,-j,,...,x,-j,) of martingale differences with respect to the information set I, = {y, _ i, i > 0; x, _ i, i > O}. The joint process {y,,x,} is assumed to be stationary and strongly mixing [cf. Robinson (1983)]. The model formulation can be generalized to several variables and the instantaneous transformation of exogenous variables. There has recently been a surge of interest in nonparametric modelling; for references see, for instance, Ullah (1989), Barnett et al. (1991) and Hardle (1990). The motivation is to approach the data with as much flexibility as possible, not being restricted by the straitjacket of a particular class of parametric models. However, more observations are needed to obtain estimates of comparable variability. In econometric applications the two primary quantities of interest are the conditional mean where {y,, x,> are observed
WJ?4=M(y,
,...,
yp;xl
)...)
XJ
=E(Y~lY~-i,=Y~,~~~~Yf-i,=Yp;Xt_j,=X~,...,Xf_jq=X4)
(2.6)
T. Teriisvirta
and the conditional
et al.
variance
(2.7) The conditional mean gives the optimal least squares predictor of y, given lagged values y, _ i, ,..., y,_ip;Xt-j ,,...., Xt-jq. Derivatives of M(x;y) can also have economic interpretations (Ullah, 1989) and can be estimated nonparametrically. The conditional variance can be used to study volatility. For (2.5) M(y,x) =f(y,x) and V(y, x) = a2g2(y, x), where 0’ = E($). As pointed out in the introduction, this survey mainly concentrates on M(y; x) while it is assumed that g(y; x) = 1. A problem of nonparametric modelling in several dimensions is the curse of dimensionality. As the number of lags and regressors increases, the number of observations in a unit volume element of regressor space can become very small, and it is difficult to obtain meaningful nonparametric estimates of (2.6) and (2.7). Special methods have been designed to overcome this obstacle, and they will be considered in Sections 4 and 5.3. Applying these methods often results in a model which is an end product in that no further parametric modelling is necessary. Another remedy to dimension difficulties is to apply semiparametric models. These models usually assume linear and parametric dependence in some variables, and nonparametric functional dependence in the rest. The estimation of such models as well as restricted nonparametric ones will*be considered in Section 5.3.
3.
Testing linearity
When parametric nonlinear models are used for modelling economic relationships, model specification is a crucial issue. Economic theory is often too vague to allow complete specification of even a linear, let alone a nonlinear model. Usually at least the specification of the lag structure has to be carried out using the available data. As discussed in the introduction, the type of nonlinearity best suited for describing the data may not be clear at the outset either. The first step of a specification strategy for any type of nonlinear model should therefore consist of testing linearity. As mentioned above, it may not be difficult at all to fit a nonlinear model to data from a linear process, interpret the results and draw possibly erroneous conclusions. If the time series are short that may sometimes be successfully done even in situations in which the nonlinear model is not identified under the linearity hypothesis. There is more statistical theory available for linear than nonlinear models and the parameter estimation in the former models is generally simpler than in the latter. Finally, multi-step forecasting with nonlinear models is more complicated than with linear ones. Therefore the need for a nonlinear model should be considered before any attempt at nonlinear modelling.
Ch. 48: Aspects qf Modelliny
3.1.
Nonlinear
2921
Time Series
Tests against a specijic alternative
Since the estimation of nonlinear models is generally more difficult than that of linear models, it is natural to look for linearity tests which do not require estimation of any nonlinear alternative. In cases where the model is not identified under the null hypothesis of linearity, tests based on the estimation of the nonlinear alternative would normally not even be available. The score or Lagrange multiplier principle thus appears useful for the construction of linearity tests. In fact, many well-known tests in the literature are Lagrange multiplier (LM) or LM type tests. Moreover, some well-known tests, such as Tsay’s (1986), which have been introduced as general linearity tests without a specific nonlinear alternative in mind, can be interpreted as LM tests against a particular nonlinear model. Other tests, not built upon the LM principle, do exist and we shall mention some of them. Recent accounts of linearity testing in nonlinear time series analysis include Brock and Potter (1993), De Gooijer and Kumar (1992), Granger and Terasvirta (1993, Chapter 6) and Tong (1990, Chapter 5). For small-sample comparisons of some of the tests, see Chan and Tong (1986), Lee et al. (1993), Luukkonen et al. (1988a), Petruccelh (1990) and Terlsvirta et al. (1993). Consider the following nonlinear model Y, =
V’W,+S(e,w,,v,)+ u,,
(3.1)
wherew,=(l,y,_, ,..., yt-p,xtl ,..., x,J, v, = (n, - 1, . . . . u, -J’, u, = d ICI,vp,0, w,, v,)E,and E,is a martingale difference process: E(E~1I,) = 0, COV(E,1I,) = af, where I, is as in (1.1). It follows that E(u,)I,) = 0 and cov(u,JI,) = aig’( e, cp, 0, w,, v,). Assume that f is at least twice continuously differentiable with respect to the parameters 8 = (0,). . . , 0,)‘. Let f(0, w,, VJ = 0, so that the linearity hypothesis becomes He: 8 = 0. Here we shall concentrate on the case g = 1 so that u, 3 E,. To test the linearity hypothesis write the conditional (pseudo) logarithmic likelihood function as
T =c--1oga; 2 The relevant
- $&.
block of the score vector scaled by l/JT
becomes
T. Trriisvirta
2928
et al.
This is the block that is nonzero under the null hypothesis. The information matrix is block diagonal such that the diagonal element conforming to c,” builds a separate block. Thus the inverse of the block related to 0 and evaluated at H, becomes
where h”,is At evaluated at H,; see, e.g. Granger and Terasvirta (1993, Chapter fir)’ the test statistic, in obvious notation, has the form Settingii=(ii,,..., LM = d-2rl’H(H’MwH)-1H’ti,
6).
(3.2)
iI: and the vector P consists of resiwhere Mw=ZW(wIW’)-lw), g2=(1/T)x duals from (3.1) estimated consistently under H, and g = 1. Under a set of assumptions which are moment conditions for (2.2) [see White’(1984, Theorem 4.25)], (3.2) has an asymptotic x2(m) distribution when H, holds. A practical way of carrying out the test is by ordinary least squares as follows. (i) Regress y, on w,, compute the residuals IY& and the sum of squared residuals SSR,. (ii) Regress i& on w, and &, compute the sum of squared residuals SSR, (iii) Compute
F(m,T-n-m)=
(SSR, - SSRJm SSR,/(T-
m - n)
with n = k + p + 1, which has an approximate F distribution under f?= 0. The use of an F test instead of the x2 test given by the asymptotic theory is recommended in small samples because of its good size and power properties; see Harvey (1990, pp. 174- 175). As an example, assume w, = (1, I?;)’ with Wt = (y,_ . . , y,_J’ and f = vi@@, = (v, 0 tiJ’ vet(O) so that (3.1) is a univariate bilinear model. Then /I, = (v, @ W,), h; = ( ijt 0 tit) and (3.2) is a linearity test against bilinearity as discussed in Weiss (1986) and Saikkonen and Luukkonen (1988). In a few cases f in (3.1) factors as follows: r,
fvt 4
= (e;w,m,,
e,,4
.
(3.3)
and fr(0, f?,, w,) = 0. Assume that 0, is a scalar whereas 0, may be a vector. This is the case for many nonlinear models such as the smooth transition regression models discussed in Section 2.1. Vector v, is excluded for simplicity. The linearity hypothesis can be expressed as He,: 0, = 0. However, H,,: 8, = 0 is also a valid linearity hypothesis. This is an indication of the fact that (3.1) with (3.3) is only identified under the alternative o2 # 0 but not under e2 = 0. If we choose H,, as our
2929
Ch. 48: Aspects of Model/kg Nonlinear Time Series
starting-point,
we may use the Taylor
expansion
fl(e,,e3,W,)=f1(0,e33,W,)+~(0,e3,w,)~2 + R2(~2,@33’Wt)~V2* Assume furthermore
(3.4)
that b, has the form
4 = P(Qww,)
(3.5)
where j?(e,) and k(w,) are I x 1 vectors. Next replace fI in (3.3) by the first-order Taylor approximation at 0, = 0
me,,w,)= iw3)w9e2. Then (3.3) becomes
where +r = $,(e,, f3,, 0,) and & = 021c/2(f11,0,). Vector g(w,) contains those elements of k(w,)w: that are of higher order than one. From this it follows that the approximation of (3.1) has the form Yr = +;
wt+ VMW)+ UI”.
(3.6)
The test can be carried out as before. After estimating (3.1) under H,, u”,is regressed on W, and g(w,) and under He,: & = 0 the test statistic has an asymptotic x2(s) distribution if e2 is an s x 1 vector. From (3.6) it is seen that the original null hypothesis H,, has been transformed into Hb,:e2 = 0.Approximating fI as in (3.4) and reparametrizing the model may be seen as a way of removing the identification problem. However, it may also be seen as a solution in the spirit of Davies (1977, 1987). Let ti* be the residual vector from the regression (3.6). Then
qeyiqe), 61,83
Z*‘rl* = inf
where rl(0) = y, - VW, - 8211/2(e,,t9,)‘g(w,) are the OLS residuals on W, and $,(f3,, Q’g(w,) while keeping 8, and 0, fixed. The test statistic is
F
=
sup
e
F(e
2’ 01.83
)
3
=
p’k - inf~ww)i/~ infu(e),if(e)/(T-
n - s)’
from regressing
y,
T. Teriimirta
2930
et al.
The price of the neat asymptotic null distribution is that not all the information in the original model has been used. The original null hypothesis involved only a single parameter. As an example, assume w, = tit = (y,_ 1,. . . , yt-J, let /I($) = 1 and k(w,) = y:- 1. This gives & = 8,8, and g( wJ = ( y:_ i , y:_ i Y,_ 2, , Y,“_i y, _ J. The resulting test is the linearity test against the univariate exponential autoregressive model in Saikkonen and Luukkonen (1988). In that model, fi = 1 - exp[ - 02(yr_ 1 - 83)2] with 8, = 0,8, > 0. Take another example where k(w,) = IV,, &g(w,) = Cf= ,CyEi qijy,_iy,_jand Hb,:cpij=O,i= l,..., p; j=i ,..., p. The test is the first of the three linearity tests against smooth transition autoregression in Luukkonen et al. (1988b) when the delay parameter d is unknown but it is assumed that 1 d d d p. The number of degrees of freedom in the asymptotic null distribution equals p(p + 1)/2. If w, also contains other variables than lags of yt, the test is a linearity test against smooth transition regression; see Granger and Terasvirta (1993, Chapter 6). If the delay parameteris known, k(w,)=(l,y,_,)‘,so that g(w,)=(y,_iy,_, ,..., y,“_, ,..., ~,_~y,-~) and the F test has p and T - n - p degrees of freedom. In some cases the first-order Taylor series approximation is inadequate. For instance, let 0, = (0,,, 0,. . . , 0)’ in (3.3) so that the only nonlinearity is described by fi multiplied by a constant. Assume furthermore that k(w,) = W, so that 0;w, = B,, and B(%)‘&%) = B(&)‘wl. Then the LM type test has no power against the alternative because the auxiliary regression (3.6) is of order one, i.e. +2 = 0. In such a situation, a third-order Taylor series approximation of fi is needed for constructing a proper test; see Luukkonen et al. (1988b) for discussion.
3.2.
Tests without
a specific
alternative
The above linearity tests are tests against a well-specified nonlinear alternative. There exist other tests that are intended as general tests without a specific alternative. We shall consider some of them. The first one is the Regression Error Specification Test (RESET; Ramsey, 1969). Suppose we have a linear model Y, = ‘p’w, + $7
(3.7)
where w, is as in (3.1) and whose parameters we estimate by OLS. Let Ul,,t = 1,. . , T, be the estimated residuals and 9, = y, - ~7,the fitted values. Construct an auxiliary regression Et = lpw, +
i
sjz + u:.
(3.8)
j=2
The
RESET is the F-test of the hypothesis H,: Sj = 0, j = 2,. . ., h in (3.8). If r,. . . , y,_J’ and h = 2, (3.8) yields the univariate linearity test of Keenan
w, =(l,y,_
Ch. 48: Aspects
ofModellingNonlinear
293 1
Time Series
(1985). In fact, RESET may also be interpreted as an LM test against a well-specified alternative; see for instance Terasvirta (1990) or Granger and Terlsvirta (1993, Chapter 6). Tsay (1986) suggested augmenting the univariate (3.7) by second-order terms so that the auxiliary regression corresponding to (3.8) becomes
iit=
+‘Wt
+
f
i
(Pij.Y-iYr-j
+
(3.9)
‘:’
i=lj=i
The linearity hypothesis to be tested is H,:cpij = 0, V’i,j. The generalization to multivariate models is immediate. This test also has as LM type interpretation showing that the test has power against a larger variety of nonlinear models than the RESET. This is seen by comparing (3.9) with (3.6), assuming $ig(w,) = Cp= iC$‘, i’pijytp iy, _ j as discussed in the previous section. The advantage of RESET lies in the small number of parameters in the null hypothesis. When w, = (1, y,_ i) [or w, = (1, x,~)‘], the two tests are identical. A general linearity test can also be based on the neural network model (2.4), and such a test is presented in Lee et al. (1993). In computing the test statistics, Terasvirta et al. yj, j= l,..., p, in (2.4) are selected randomly from a distribution. (1993) showed that this can be avoided by deriving the test by applying the LM principle. The auxiliary regression for the test becomes
ii,= +‘Wf+
i i=lj=i
t
‘PijYt-iY,-j+
i i=l
$J f
(PijkYt-iYt-jYt-fc+
‘:
(3.10)
j=ik=j
and the linearity hypothesis H,:cp,, = 0, qijk = 0, Vi, j, k. The simulation results in Terasvirta et al. (1993) indicate that in small samples the test based on (3.10) often has better power than the original neural network test. There has been no mention yet about tests against piecewise linear or switching regression or its univariate counterpart, threshold autoregression. The problem is that f, in (3.3) is not a continuous function of parameters if the switch-points or thresholds are unknown. This makes the likelihood function irregular and the score principle inapplicable. Ertel and Fowlkes (1976) suggested the use of cumulative sums of recursive residuals for testing linearity. First, order the variables in ascending (or descending) order according to the transition variable. Compute the parameters recursively and consider the cumulative sum of the recursive residuals. The test is analogous to the CUSUM test that Brown et al. (1975) suggested in which time is the transition variable and no lags of y, are allowed in w,. However, Kramer et al. (1988) showed that the presence of lags of yt in the model does not affect the asymptotic null distribution of the CUSUM statistic. Even before that, Petruccelli and Davies (1986) proposed the same test for the univariate (threshold autoregressive) case; see also Petruccelli (1990). The CUSUM test may also be based on residuals
T. Teriisuirta
2932
et al.
from OLS estimation using all the observations instead of recursive residuals. Ploberger and Kramer (1992) recently discussed this possibility. The CUSUM principle is not the only one available from the literature of structural change. Quandt (1960) suggested generalizing the F test (Chow, 1960) for testing parameter constancy in a linear model with known change-point by applying F= ~up~~rF(t) where T = (t/t,
II
't,j
-
's,j
The correlation
IId
j=O,l
‘>
integral
,...,Iz-
1.
(3.11)
is defined as
C,(E) lim Te2(number
of pairs (t, s) with 1 d t,s d T such that (3.11) holds).
T+CC
Brock et al. (1987) defined
S(n,E)=
e”(B)- [~l(E)]“.
(3.12)
Under the hypothesis that {y,} is an i.i.d. process, (3.12) has an asymptotic normal distribution with zero mean and variance, given in Brock et al. (1987). Note that
Ch. 48: Aspects oJ Modelling
Nonlineur
Time Serirs
2933
(3.12) depends on n and e which the investigator has to choose and that the size of the test is very sensitive to these two parameters. A much more thorough discussion of the BDS test and its properties is found in Brock and Potter (1993) or Scheinkman (1990). It may be mentioned, however, that a rather long time series is needed to obtain reasonable power. Lee et al. (1993) contains some small-sample evidence on the behaviour of the BDS test but it is not very conclusive; see Terasvirta (1990). Linearity of a single series may also be tested in the frequency domain. Let {y,} be stationary and have finite moments up to the sixth order. Then we can define the bispectral density f(oi, oj) of y, based on third moments and
b(q, Oj) = -2
If(wito’)12~
f(wilf(ojV(oi
+
Oj)
where f(oi) is the spectral density of y,. Two hypotheses can be tested: (i) if f(wi, oj) E 0 then y, is linear and Gaussian, (ii) if b(wi, wj) = b, > 0 then y, is linear but not Gaussian, i.e. the parametrized linear model for {y,} has non-Gaussian errors. Subba Rao and Gabr (1980) proposed tests for testing these two hypotheqes. Hinich (1982) derived somewhat different tests for the same purpose. For more discussion, see, e.g. Priestley (1988) and Brockett et al. (1988). A disadvantage of these tests seems to be relatively low power in small samples. Besides, performing the tests requires more computation than carrying out most of their time domain counterparts. It has been assumed, so far, that g = 1 in (3.1). If this assumption is not satisfied, the size of the test may be affected. At least the BDS test and the tests based on bispectral density are known to be sensitive to departures from that assumption. If linearity of the conditional mean is tested against a well-specified alternative using LM type tests, some possibilities of taking conditional heteroskedasticity into account exist and will be briefly mentioned in the next section.
3.3.
Constancy
of conditional variance
The assumption g = 1 is also a testable hypothesis. However, because conditional heteroskedasticity is discussed in Chapter 49 of this volume, testing g = 1 against nonconstant conditional variance is not considered here. This concerns not only testing linearity against ARCH but also testing it against random coefficient linear regression; see, e.g. Nicholls and Pagan (1985) for further discussion on the latter situation. If f = 0 and g = 1 are tested jointly, a typical LM or LM type test is a sum of two separate LM (type) tests for f- 0 and g = 1, respectively. This is the case because under this joint null hypothesis the information matrix is block diagonal; see Granger and Terasvirta (1993, Chapter 6). Higgins and Bera (1989) derived a joint LM test against bilinearity and ARCH. On the other hand, testing f- 0 when g f 1
T Tertisvirta
2934
et al.
a more complicated affair than it is when g = 1. If g is parametrized, the null model has to be estimated under conditional heteroskedasticity. Besides, it may no longer be possible to carry out the test making use of a simple auxiliary regression, see Granger and Terasvirta (1993). If g is not parametrized but g f 1 is suspected to hold then the tests described in Section 3.1 as well as the RESET and the Tsay test can be made robust against g f 1. Davidson and MacKinnon (1985) and Wooldridge (1990) described techniques for doing this. The present simulation evidence is not yet sufficient to fully evaluate their performance in small samples.
is
4.
Specification
of nonlinear models
If linearity tests indicate the need for a nonlinear model and economic theory does not suggest a completely specified model, then the structure of the model has to be specified from the data. This problem also exists in nonparametric modelling as a variable selection problem because the lags needed to describe the dynamics of the process are usually unknown; see Auestad and TjBstheim (1991) and Tj&heim and Auestad (1994a, b). To specify univariate time series models, Haggan et al. (1984) devised a specification technique based on recursive estimation of parameters of a linear autoregressive model. The parameters of the model were assumed to change over time in a certain fashion. Choosing a model from a class of state-dependent models, see Priestley (1988), was carried out by examining the graphs of recursive estimates. Perhaps because the family of state-dependent models is large and, thus, the po.&ibilities are many, the technique is not easy to apply. If the class of parametric models to choose from is more restricted, more concrete specification methods may be developed. [For instance, Box and Jenkins (1970) restricted their attention to linear ARMA models.] Tsay (1989) presented a technique making use of linearity tests and visual inspection of some graphs to specify a model from the class of threshold autoregressive models. It is easy to use and seems to work well. Chen and Tsay (1993a) considered the specification of functional-coefficient autoregressive models whereas Chen and Tsay (1993b) extended the discussion to additive functional-coefficient regression models. The key element in that procedure is the use of arranged local regressions in which the observations are ordered according to a transition variable. Lewis and Stevens (1991a) applied multivariate adaptive regression splines (MARS), see Friedman (1991), to specify adaptive spline threshold autoregressive models. Terasvirta (1994) discussed the specification of smooth transition autoregressive models. This technique was generalized to smooth transition regression models in Granger and Tergsvirta (1993, Chapter 7) and will be considered next. Consider the smooth transition regression (STR) model with p + k + 1 independent variables Y, =
4D’wt+ (e’w,)F(z,) + u,,
(4.1)
Ch. 48: Aspects
ofModellingNonlinear
Time Series
2935
where E{u,lI,}=O, cov{~,(I,}=~~, I,={Y,_~, j>O,x,_j,i, i= l,..., k, j>O} as in (1.1) 4p=((p0,‘p1 ,..., cp,)‘, 8=(8,,0, ,..., 8,)‘,m=p+k+l and wI=(l,yt-i ,... , for F are F(z,) = (1 + exp [ - y(z, - c)] } - ‘, y > 0, ytpp; x11,. . . >x,J’. The alternatives which gives the logistic STR model, and F(z,) = 1 - exp [ - y(z, - c)‘], y > 0, corresponding to the exponential STR model. The transition variable z, may be any element of W, other than 1 or another variable not included in w,. The data-based specification proceeds in three stages. First, specify a linear model to serve as a base for testing linearity. This is done by using a suitable model selection criterion. Second, test linearity against STR using the linear model as the null model. If linearity is rejected, determine the transition variable from the data, Third, choose between LSTR and ESTR models. Testing linearity against STR is not difficult. A test with power against both LSTR and ESTR if the transition variable is assumed known is obtained by proceeding as in Section 3.1. This leads to auxiliary regression
where z,~ is the transition variable and ti, is the OLS residual from the linear regression y, = p’w, + u,. If z,~ is an element of wt, W, = (1, ~5:)’has to be replaced by wt in (4.2) except for the first right-hand-side term. The linearity hypothesis is H,,:/?i = /& = /I3 = 0. Equation (4.2) is also used for selecting z,~. The test is carried out for all candidates for z,~, and the one yielding the smallest p-value is selected if that value is sufficiently small. If it is not, the model is taken to be linear. This procedure is motivated as follows. Suppose there is a true STR model with a transition variable z,~ that generated the data. Then the LM type test against that alternative has optimal power properties. If an inappropriate transition variable is selected for the test, the resulting test may still have power against the true alternative but the power is less than if the correct transition variable is used. Thus, the strongest rejection of the null hypothesis suggests that the corresponding transition variable should be selected. For more discussion of this procedure see Terbvirta (1994) and Granger and Terasvirta (1993, Chapter 6 and 7). If linearity is rejected and a transition variable selected, then the third step is to choose between LSTR and ESTR models. This can be done by testing a set of nested null hypotheses within (4.2) with an F-test: the hypotheses are H&:& = 0,Hg2:& = Ol& = 0 and Hi,:& = Ol& = & = 0.If the p-value of the test of Hg2 is the smallest, choose the ESTR model, otherwise choose the LSTR model. Specifying the lag structure of (4.1) could be done within (4.2) using an appropriate model selection criterion but little is known about the success of such a procedure. In the existing applications, a general-to-specific approach based on estimating nonlinear STR (or STAR) models has mostly been used. The model specification problem also arises in nonparametric time series modelling. Taking model (2.5) as a starting-point, there is the question of which lags X f-i13.“?x~-tp; Y~_~, should be included in the model. Furthermore, .it Y,-j,,“‘t should be investigated whether the functions f and g are linear or nonlinear and
T. Teriisuirta et al.
2936
whether they are additive or not. Moreover, if interaction terms are included, how should they be modelled and, more generally, can the nonparametric analysis suggest functional forms, such as the smooth transition or threshold function, or an ARCH type function for conditional variance? These are problems of exploratory data analysis for nonlinear time series, and relatively little nonparametric work has been done in the area. Various graphical model indicators have been tried out in Tong (1990, Chapter 7) Haggan et al. (1984) and Auestad and Tjostheim (1990) however. Perhaps the most natural quantities to look at are the lagged conditional mean and variance of increasing order, i.e. M,,,(Y)
= E(Y,lY,_, = y)
V,,,(Y) =
var(ytly,_,= Y)
M,,,(x) = &lx,_, =x) v,,,(x)= My, Ix, -k
=
x).
(4.3)
In univariate modelling these quantities have been extensively used, albeit informally, see Tong (1990, Chapter 7). They can give a rough idea of the type of nonlinearity involved, but they fail to reveal things like the lag structure of an additive model. A more precise and obvious alternative is to look at the functions M(y;x) and I/(y;x) defined in (2.6) and (2.7) but they cannot be graphically displayed for p + q > 2, and the curse of dimensionality quickly becomes a severe problem. Auestad and Tjostheim (1991) and Tjostheim and Auestad (1994a) introduced projections as a compromise between M&x), V&x) and the indicators (4.3). To define projections consider the conditional mean function M(y, _ i,, . . . , y,, . . . , y, _ iP; X f_ j,, . . . , x,_ j,) with yfpik excluded. The one-dimensional projector of order (p, q) projecting on lag i, of yt is defined by P~,k(Y)=E{M(YV,-i,~“‘,Yk,“‘,Y~-ip;Xf-j,~”’~Xf-jq)}’
The projector
PXk(x) is defined
NY,, >YP; Xl,. lags are included
in the same way. For an additive model + Cq= 1fij(xj) it is easily seen that if all in the projection operation, then :. >x4) = X:1
‘y,,(Y) = ‘k(Y) + p(k
1 ai
(4.4) with p + q
Px,k(X) = bktX) + ek,
where ,u~ = E(y,) - E[a,(y,)] and 8, = E(x,) - E[fik(x,)]. Clearly the additive terms rzk(y) and p,(x) cannot be recovered using My,k and Mx k of (4.3). Projectors can be defined similarly for the conditional variance and, in principle, they reveal the structure of models having an additive conditional variance function. Both types of projectors can be estimated by replacing theoretical expectations with empirical averages and by introducing a weight function to screen off extreme data. Properties and details are given in Auestad and Tjsstheim (1991) and Tjsstheim and Auestad (1994a). Consistency is proved in Masry and Tjostheim (1994). An important part of the model specification problem consists of singling out the significant lags i,, . . . , i,; j, , . . . ,j, and the orders p and q for the conditional mean
Ch. 48: Aspects of Model&g
Nonlinear
2931
Time Series
(2.6) and conditional variance (2.7). Auestad and Tjostheim (1990, 1991), Tjostheim and Auestad (1994b) and Cheng and Tong (1992) considered this problem, Granger and Lin (1991) did the same from a somewhat different point of view. Auestad and Tjsstheim adopted an approach analogous to the parametric final prediction error (FPE) criterion of Akaike (1969). They treated it only in the univariate case, but it is easily extended to the multivariate situation. Algorithms and formulae including the heterogeneous case g f 1 are given in Tjostheim and Auestad (1994b) to which the reader is referred for details of derivation and examples with simulated and real data. Cheng and Tong (1992) discussed a closely related approach based on cross validation. An alternative and less computer intensive method is outlined by Granger and Lin (1991). They use the Kendall rank partial autocorrelation function and the bivariate information measure
s
log fk Y)
bawxY)l
fib, Y) dxdy
for a pair of lags. Joe (1989) studied its properties in the i.i.d. case. Robinson (1991) considered the random process case and tests of independence. Related tests of independence and a power comparison with the BDS test are given in Skaug and Tjastheim (1993a, b, c). Specification of semiparametric time series models is discussed in the next section together with estimation.
5. 5.1.
Estimation Estimation
in nonlinear time series of parameters
in parametric
models
For parametric nonlinear models, conditional nonlinear least squares is the most common estimation technique. If the errors are normal and independent, this is equivalent to conditional maximum likelihood. The theory derived for dynamic nonlinear models (3.1) with g = 1 gives the conditions for consistency and asymptotic normality of the estimators. For an account, see, e.g. Gallant (1987, Chapter 7). Even more general conditions were recently laid out in PStscher and Prucha (1991a, b). These conditions may be difficult to verify in practice, so that the asymptotic standard deviation estimates, confidence intervals and the like have to be interpreted with care. For discussions of estimation algorithms, see, e.g. Quandt (1983) Judge et al. (1985, Appendix B) and Bates and Watts (1988). The estimation of parameters in (2.2) may not always be straightforward. Local minima may occur, so that estimation with different starting-values is recommended. Estimation of y in transition function (2.3) may create problems if the transition is rapid because there may not be sufficiently many observations in the neighbourhood of the point about which the transition takes place. The convergence of the estimate sequence
T. Teriisvirta et al.
2938
may therefore be slow and the standard deviation estimate of y most often very large. This problem is discussed, e.g. in Bates and Watts (1988, p. 87) Granger and Terasvirta (1993, Chapter 7) Seber and Wild (1989, pp. 480-481) and Tergsvirta (1994). For simulation evidence and estimation using real economic data sets, see also Chan and Tong (1986) Granger et al. (1993) Luukkonen (1990) and Terasvirta and Anderson (1992). Model (2.2) may even be a switching regression model in which case y is not finite and, in principle, cannot be estimated. In that case convergence may still occur at some very large value, but obtaining a negative definite Hessian probably turns out to be a problem. An available alternative then is to fix y at some sufficiently large but finite value and estimate the remaining parameters conditionally on that value. The estimation of parameters becomes more complicated if the model contains lagged errors as the bilinear model does. Subba Rao and Gabr (1984) outlined a procedure for the estimation of a bilinear model based on maximizing the conditional likelihood. Quick preliminary estimates may be obtained using a long autoregression to estimate the residuals and OLS for estimating the parameters keeping the residuals fixed. This is possible because the bilinear model has a simple structure in the sense that it is linear in the parameters if we regard the lagged residuals as observed. Granger and Terasvirta (1993, Chapter 7) suggested this alternative. If the model is a switching regression or threshold autoregressive model, nonlinear least squares is an inapplicable technique because of the irregularity of the sum of squares or the likelihood function. The problem consists of the unknown switch-points or thresholds for which unique point estimates are not available as long as the number of observations is finite. Tsay (1989) suggested specifying (approximate) switch-points from “scatterplots of t-values” in ordered (according to the switching variable) recursive regressions. As long as the recursion stays in the same regime, the t-value of a coefficient estimate converges to a fixed value. When observations from another regime are added into the regression, the coefficient estimates start changing and the t-values deviating. Tsay (1989) contains examples. The estimation of parameters in regimes is carried out by ordinary least squares. Chan (1993) showed (in the univariate case) that if the model is stationary and ergodic, the parameter estimates, including those of the thresholds, are strongly consistent; for a discussion see Tong (1990, Section 5.5.3).
5.2.
Estimation
of nonparametric
In nonparametric estimation mean (2.6) and variance (2.7) a kernel function k(x) which, function integrating to one. sometimes it is advantageous
functions
the most common way of estimating the conditional is to apply the so-called kernel method. It is based on typically, is a real continuous, bounded, symmetric Usually it is required that k(x) 3 0 for all x, but to allow k(x) to take negative values, so that we may
2939
Ch. 4X: Aspects oJModelling Nonlinear Time Series
have fx’k(x)dx = 0. The kernel method is explained in much greater detail in Chapter 38 of this volume. The kernel acts as a smoothing device in the estimation procedure. For quantities depending on several variables as in (2.6) and (2.7) a product kernel can be used. Then the kernel estimates of M and V are
$CY, fi k~,l(Y~-Y,_,~,~lk,,2(x,ti(y,
)...)
y,,x,
x(J=+l
fcn k,,l(Yl-Ys-i,) fi kh,2(X,-Xs-i,)’
)...)
(5.1)
r=l
s r=l
~~Y:rfJlk~,l(Yr-Ys-J fI kta.2(Xr-Xs-i,) P(y,)...)
)...)
yp,xl
x4)=-L
~_____ ~
r=1
~~
~~
$1fi k,,l(Yr-Ys-i,) fI kh,2(Xr-xs-i,)* r=l
s r=l
-
&Kc4)2,
(5.2)
where k, i(x) = hip’ki(himlx), i = 1,2. Here k, and k, are the kernel functions associated with the {y,} and (xt} processes, and h, and h, are the corresponding bandwidths. The bandwidth controls the width of the kernel function and thus the amount of smoothing involved. The bandwidth will depend on the total number of observations T, so that h = h(T) + 0 as T-t m. It also depends on the dimensions p and q, but this has been suppressed in the above notation. In the following, to simplify notation, it is assumed that {y,}, {x, } are measured roughly on the same scale, so that the same bandwidth and the same kernel function can be used everywhere. Under regularity conditions (Robinson, 1983) it can be proved that $(y, x) and c(y, x) are asymptotically normal. More precisely,
1
(ThP+4)“2[6(y,r)-M(y,x)]+N
(5.3)
and (ThP+4)“2[~(y,.r)
- V(y,n)] + N
1 ,
(5.4)
where the convergence is in the distribution, J = sk2(x) dx and s(y, x) is defined in Auestad and Tjostheim (1990). Several points should be noted for (5.3) and (5.4). For parametric models we have fi-consistency.
For nonparametric
models
the rate is JThp+4,
which is slower.
2940
T. Teriistha
et al.
The presence of p(y,x) in the denominator on the left-hand side of (5.3) and (5.4) means that the variance blows up close to the boundaries of the data set, and extreme care must be used there in the interpretation of G(x,y) and ?(x,y). There are other aspects of practical significance that are not immediately transparent from (5.3) and (5.4). They will be discussed next. Asymptotic confidence intervals can in principle be computed Conjidence inter-v&. from (5.3) and (5.4) by replacing p(y,x), V(y,x) and s(y, x) by corresponding estimated quantities. An alternative is to try to form bootstrap confidence intervals. Franke and Wendel(l990) discussed a simple example where the bootstrap performs much better than asymptotic intervals. In the general case the bootstrap developed by Ktinsch (1989) and Politis and Roman0 (1990) may be needed. unbiased. Bias. As seen from (5.3) and (5.4), G(y, x) and p(y ,x ) are asymptotically For a finite sample size the bias can be substantial. Thus, reasoning as in Auestad and Tjsstheim (1990) yields
where I, = Ix2k(x)dx. A corresponding formula (Tjostheim and Auestad, 1994a) holds for the conditional variance. A Gaussian linear model will have a linear bias in the conditional mean, but, in general, the bias can lead to a misspecified model. For example, a model with a flat conditional variance (no conditional heteroskedasticity) may in fact appear to have some form of heteroskedasticity due to bias from a rapidlyrarying M(y, x). An example is given in Auestad and Tjostheim (1_990). Generally, V(y,x) is more affected by bias and has more variability than M(y,x). This makes it harder to reveal the structure of the conditional variance using purely nonparametric means; see, for instance, the example of conditional stock volatility in Pagan and Schwert (1990). Another problem is that misspecification of the conditional mean may mix up conditional mean and variance effects. This is, of course, a problem in parametric models as well. Choosing the bandwidth. Comparing the is seen that the classical problem of all increases, the variance decreases whereas should h be chosen for a given data set? There are at least three approaches to to compute estimates for several values
variance and bias formulae (5.335.5), it smoothing operations is present. As h the bias increases and vice versa. How this problem. The simplest solution of h and to select one subjectively.
is A
Ch. 45: Aspects of Modellinq
Nonlinear
2941
Time Srries
second possibility is to use asymptotic theory. From (5.3-5.5) it is seen that if we require that variance and bias squared should be asymptotically balanced, then (ThP+q)- 1 _ h4: or h - T-1/(P+q+4’. A n extension of this argument (Truong and parameter. The Stone, 1992) yields h - T-‘i(P+q+2R), where R is a smoothness problem of choosing the proportionality factor still remains. A discussion of this and related problems is given in Hardle (1990, Chapter 5), in Chapter 38 of this volume and in Marron (1989). The third possibility, which is the most time consuming but, possibly, the one most used in practice, is to use some form of cross validation. For details, see the above references. Simulation experiments showing considerable variability for h selected by cross validation for one and the same model have been reported. Bounllary effects. For a point (y, x) close to the boundary of the data set there will be disproportionally more points on the “inward” side of (y,x). This asymmetry implies that we are not able to integrate over the entire support of the kernel function, so that we cannot exploit the fact that Jxk(x) dx = 0. This, in turn, means that there is an additional bias of order h due to this boundary effect. For example, for a linear regression model the estimated regression line would bend close to the boundary. The phenomenon has, primarily, been examined theoretically in the fixed regression design case (Rice, 1984; Miller, 1990). Higher order kernels. Sometimes so-called higher order kernels have been suggested for reducing bias. It is seen from (5.5) that if k is chosen such that Jx2k(x)dx = 0, the bias will effectively be reduced to the next order term in the bias expansion (typically of order h4). However, practical experience in the finite sample case has been mixed and a higher order kernel does not work unless T is rather large. Curse of dimension&y. This problem was mentioned in the introduction. It is a well-known difficulty of multidimensional data analysis and a serious one in nonparametric estimation. Although the bandwidth h typically increases somewhat as the dimensions p and 4 increase, this is by no means enough to compensate for the sparsity of points in a neighbourhood of a given point. There may still be some useful information left in $(y,x) that can be used for specification purposes (Tjsstheim and Auestad, 1994a,b) or as initial input to iterative algorithms described in the next section, but it is of little use as an accurate estimate of M(y, x). In general one should try to avoid the curse of dimensionality by not looking at too many regressors simultaneously; i.e. by considering (2.6) and (2.7) such that while i, and i, may be large, p and 4 are not. This requires a method for singling out significant lags nonparametrically, which was discussed in Section 4. Alternatively, the problem may be handled by applying more restricted models which will be considered in the next section. Other estimation
methods.
There
are a number
of alternative
nonparametric
T. Teriisvirra et al.
2942
estimation methods. These are described in Hardle (1990, Chapter 3) and Hastie and Tibshirani (1990, Chapter 2). The most commonly used are spline smoothing, nearest neighbour estimation, orthogonal series expansion and the regressogram. For all of these methods there is a smoothing parameter that must be chosen analogously to the choice of bandwidth for the kernel smoother. The asymptotic properties of the resulting estimators are roughly similar to those in kernel estimation. The spline smoother (Silverman, 1984) can be rephrased asymptotically as a kernel estimator with negative sidelobes. Diebolt (1990) applied the regressogram to testing linearity. Yakowitz (1987) considered nearest neighbour methods in time series. Further applications will be mentioned in the next section.
5.3.
Estimation
in restricted
nonparametric
and semiparametric
models
As mentioned above, general nonparametric estimation with many variables leads to increased variability and problems with the curse of dimensionality. To alleviate these problems one can look at more restrictive models requiring particular forms for f and g in (2.5) or one can consider semiparametric models. This section is devoted to models of that kind. Additioe models. Virtually all restrictive models have some sort of additivity into them. In the simplest case (using consecutive lags)
Y, = i
“i(Y,-i)
i=l
built
Pitx,-J + Et.
+ f i=l
Regression versions of such models and generalizations with interaction terms are analysed extensively in Hastie and Tibshirani (1990) and references therein. By taking conditional expectations with respect to y,_ i and x, _ j, simple identities are obtained which can be used as a basis for an iterative algorithm for computing the unknown functions cli and pj. The algorithm needs initial values of these functions. One possibility is to use either projections or simply a linear model for this purpose. Some examples and theoretical properties in the pure regression case are given by Hastie and Tibshirani. See also Chen and Tsay (1993b). The ACE algorithm treats a situation in which the dependent variable may be transformed as well, so that
NY,) =
C”i(Y,-i) + CPitxt-i)+ Et. I
I
The algorithm is perhaps best suited for a situation where c(~= 0 for all i, so that there is a clear distinction between the input and output variables. The method was developed in Breiman and Friedman (1985). Some curious aspects of the ACE
Ch. 48: Aspects
ofModelliny Nonlinear
2943
Time Series
algorithm are highlighted in Hastie and Tibshirani (1990, pp. 184- 186). In view of the above comments it is perhaps not surprising that, in a time series example, Hallman (1990) obtained better results by using a version of backfitting (Tibshirani, 1988) than with the ACE algorithm. Chen and Tsay (1993a) considered a univariate model allowing certain interactions. Their functional coefficient autoregressive (FCAR) model is given as
with i, < p. By ordering the observations according to some variable or a known combination of them to an “ordered” local regression the authors proposed an iterative procedure for evaluating f,, . ,f, and gave some theoretical properties. The procedure simplifies dramatically if all the fj are one-dimensional. The authors fitted an FCAR model of this type to the chicken pox data of Sugihara and May (1990). The fitted model seemed to point at a threshold autoregressive model. The forecasts from such a model, subsequently, fitted to the data had an MSE at least 30% smaller than a seasonal ARMA model used as a comparison for forecasting 4411 months ahead. Projection pursuit type models.
.Y, =
i
j=l
Bj(Yj'Ut-
1 +
These models can be written
as
Vixt- 1)+ et3
where fij, j = 1,. , r, are unknown functions, yj and pj are unknown vectors determining the direction of the jth projector, and yt_ 1,x,_ 1 are as in (2.5). An iterative procedure (Friedman and Stuetzle, 1981) exists for deriving optimal projectors (projection pursuit step) and functions /Ij. The curse of dimensionality is avoided since the smoothing part of the algorithm exploits the fact that fij is a function of one scalar variable. For time series data, experience with this method is limited. A small simulation study that Granger and Terasvirta (1992) conducted gave marginal improvements compared to linear model fitting for the particular nonlinear models they considered. Projection pursuit models are related to neural network models, but for the latter the functions pj are assumed known and often pj=/I, j= l,..., Y, thus giving a parametric model class. The fitting of neural network models is discussed in White (1989). Regression trees, splines and MARS.
Assume a model of form
and approximate f(y, X) in terms of simple basis functions Bj(y,x) so that f,,,,(y,x) = CjcjBj(y,x). In the regression tree approach (Breiman et al., 1984)
2944
T. Teriisvirta
et al.
f appr is built up recursively from indicator functions Bj(y, X) = I{ (_v,x)ER~} and the regions Rj are partitioned in the next step of the algorithm according to a certain pattern. As can be expected, there are problems in fitting simple smooth functions like the linear model. Friedman (1991) in his MARS (multivariate adaptive regression splines) methodology has made at least two important new contributions. First, to overcome the difficulty in fitting simple smooth functions, Friedman proposed not to automatically eliminate the parent region Rj in the above recursive scheme for creating subregions. In subsequent iteration both the parent region and its corresponding subregions are eligible for further partitioning. This allows for much greater flexibility. The second contribution is to replace step functions by products of linear left and right truncated regression splines. The products make it possible to include interaction terms. For a detailed discussion the reader is referred to Friedman (1991). Lewis and Stevens (1991a) applied MARS to time series, both simulated and real data. As for most of the techniques discussed in this section a number of input parameters are needed. Lewis and Stevens recommended running the model for several sets of parameters and then selecting a final model based on various specification/fitting tests. They fitted a model to the sunspot data which has 3 one-way, 3 two-way and 7 three-way interaction terms. The MARS model produced better overall forecasts of the sunspot activity than the models applied before. In Lewis and Stevens (1991 b) riverflow is fitted against temperature and precipitation and good results obtained. There are as yet no applications to economic data. The MARS technology appears very promising but must of course be tested more extensively on real and simulated data sets. No asymptotic theory with confidence intervals is available yet. Stepwise series expansion of conditional densities. In a sense the conditional density p(y, Iyt_ 1, x, _ J is the most natural quantity to look at in a joint modelling of { y,, xr} since predictive distributions as well as the conditional mean and variance can all be derived from this quantity. Gallant and Tauchen (1989) used this fact as their starting-point. The conditional density is estimated, to avoid the curse of dimensionality, by expanding it in Hermite polynomials. These are centred and scaled so that the conditional mean M(y,x) and variance V(y,x) play a prominent role. As a first approximation they are supposed to be linear Gaussian and of ARCH type, respectively. Gallant et al. (1992) looked at econometric applications, notably to stock market data. In particular, they investigated the relationship between the volatility of stock prices and volume. A main finding was that an asymmetry in the volatility of prices when studied by itself more or less disappears when volume is included as an additional conditional variable. Possible asymmetry in the conditional variance function (univariate case) has recently been studied by a number of investigators using both parametric and nonparametric methods; see Engle and Ng (1993) and references therein.
Ch. 48: Aspects
ofMuddling Nonlinear
Time Series
2945
Semiparametric models. Another way of trying to eliminate the difficulties in evaluating high-dimensional conditional quantities is to assume nonlinear and nonparametric dependence in some of the predictors, and parametric and usually linear dependence in others. An illustrative example is given by Engle et al. (1986) who modelled electricity sales using a number of predictor variables. It is natural to assume the impact of temperature on electricity consumption to be nonlinear, as both high and low temperatures lead to increased consumption, whereas a linear relationship may be assumed for the other regressors. A similar situation arose in Shumway et al. (1988) which is a study of mortality as a function of weather and pollution variables in the Los Angeles region. In the context of model (2.5) with a linear dependence on lags of y, and nonlinearity with respect to the exogenous variable {x,}, we have
The modelling technique would depend somewhat on the dimension of x,_ 1. In the case where the argument of f is scalar, it can be incorporated in the backfitting algorithm of Hastie and Tibshirani (1990, p. 118). Under quite general assumptions it is possible to obtain fi-consistency for the parametric part as demonstrated by Heckman (1986) and Robinson (1988). Powell et al. (1989) developed the theory further and gave econometric applications.
6.
Evaluation of estimated models
After estimating a nonlinear time series model it is necessary to evaluate its properties to see if the specified and estimated model may be regarded as an adequate description of the relationship it was constructed to characterize. The residuals of the model can be subjected to various tests such as those against error autocorrelation, ARCH and normality. At least in the parametric case linearity of the time series was tested, and similar tests may now be performed on the residuals to see if the model adequately characterizes the nonlinearity the tests previously suggested. For instance, Eitrheim and Terasvirta (1993) proposed testing the STAR model against an alternative containing two additive STAR components and derived an LM type test for this purpose. The test applies to STR models as well. As to testing the null of no error autocorrelation it should be noted that the asymptotic distribution of the Ljung-Box test statistic based on estimated residuals is not available, as the correct number of degrees of freedom is known only if the estimated model is’s linear ARMA model. For this reason, Eitrheim and Terasvirta (1993) also derived an LM test for testing the residuals of the STR model against autocorrelation. One should also study the long-term properties of the model, which generally can only be done numerically by simulating the model without noise. A bilinear model constitutes an exception as its long-term solution is the same as that of the
T. Teriisvirta et al.
2946
corresponding linear autoregressive model. The exogenous variables should be set on a constant level, for instance, to equal their sample means. If the solution path diverges, the model should be rejected and respecification attempted. Other examples of a solution are a limit cycle or a unique stable singular point. Sometimes several solutions may appear depending on the starting-values. See, e.g. Ozaki (1985) for further discussion. The out-of-sample prediction of the model is an important part of the evaluation process. The precision of the forecasts should be compared to those from the corresponding linear model. However, as mentioned in the introduction, the results also depend on the data during the forecasting period. If there are no observations in the range in which nonlinearity of the model makes an impact, then the forecasts cannot be expected to be more accurate than those from a linear model. The check is thus negative: if the forecasts from the nonlinear model are significantly less accurate than those from the corresponding linear one, then the nonlinear specification should be reconsidered. A further check of the estimated model is to see whether it can reproduce a feature of interest in the data. A fitted model is considered adequate only if it is capable of doing that. The spectral density function of the time series may be such a feature. The check is carried out by bootstrapping the estimated model (linear or nonlinear) which is required to be parametric. Details and examples can be found in Tsay (1992).
7.
Example
As an example of the parametric specification, estimation, and evaluation cycle discussed in Section 4 we shall consider modelling the seasonally unadjusted logarithmic Austrian industrial output 1960(l) to 1986(4). This is one of the series analysed in Luuk konen and Terasvirta (199 1) and Terasvirta and Anderson (1992). However, while those authors tested linearity of the series and rejected it, they did not report any modelling results. Our aim is to see whether we can describe the four-quarter differences (y,) of the series or annual growth rates, appearing in Figure 1, by a STAR model. In order to do that we first have to specify a linear autoregressive model for the series. Following Terasvirta and Anderson we choose an AR(S) model yielded by AIC. Having done this, the second step is to test linearity against STAR using five lags and applying an F-test based on the auxiliary regression (4.2). The results are found in Table 1 where it is seen that the smallest p-value of the tests for d=l , . . . ,5, is obtained at d = 1. We take this value (= 0.010) to be sufficiently small to reject linearity in favour of STAR. The next step is to choose between an exponential and a logistic STAR model assuming that d = 1. Table 1 shows that the p-values of the F-tests of both H,*, and Hz3 are smaller than that of the test of H,*, (see Section 4), so that the decision rule discussed in Section 4 leads us to choose an LSTAR model.
Ch. 48: Aspects
ofModelliny Nonlinear Time Series
2941
0.06
-0.06
1965(l)
1977(1)1
1973(i)
1969(l)
1985(i)
1981(l)
Quarter Figure
1. Four-quarter
differences
of the
logarithmic index 1961(l)-1986(4).
of Austrian
industrial
production,
Table 1 The p-values of the LM type linearity test against STAR based on (4.2) for delays tests to choose between d = 1,. ., 5, and the p-values of the model specification LSTAR and ESTAR for d = 1, for the four-quarter differences of the logarithmic Austrian industrial production model, 1960(l)-1986(4). The linear base model is AR(5). d Null hypothesis H,,$, =/32=&=0 H;3:/J3 = 0 H;2:/?2=OI/13=0 H,*,:~,=O~/?,=j?,=O
1 0.010 0.034 0.24 0.039
2
3
4
5
0.29
0.15
0.22
0.41
After selecting the type of model, the next problem is that of specifying the lag structure. An obvious way to start is to estimate the parameters of the full model (2.2) with (2.3) as the transition function. However, here, as in many other similar situations, this leads to convergence problems because some of the parameters are redundant and their estimates highly correlated with those of other parameters.
T. Terti.wirta et ul.
2948
To avoid this it is often advisable to fix y or even both y and c in (2.3) and estimate cp and 8 conditionally. This helps one to put restrictions on the elements of these parameter vectors, and after finding a sensible set of parameters, the model can be re-estimated without any restrictions on y and c. It is of course possible and sometimes even desirable to impose further restrictions on p and 8 even after this stage. Note that, apart from the usual restrictions of the type ‘pj = 0 and Bj = 0, the exclusion restrictions ‘pj = - Bj are useful. While ‘pj = 0 makes the “parameter” ‘pj + OjF = 0 for F = 0, the latter does the same for F = 1. The final estimated LSTAR model has the form y, = 0.76~~ _ 1 + 0.3Oy,_, - 0.37y,_, (0.18)
(0.17)
(0.16)
- 0.63y,_,
+ 0.55y,_ 5 + (0.087 - 0.76y,_ 1
(0.15)
(0.11)
(0.008 1) (0.18)
- 0.30~,_~ + 0.37~,_~ + 0.63~,_~ - 0.55y,_,) (0.17)
(0.16)
(0.15)
(0.11)
x [l + exp( - 2.2 x 24(y,_ 1 - 0.063)}]-1 (0.80)
+ ti,
(7.1)
(0.010)
s = 0.0217, s’/s; = 0.87, F,,(6,78) = 1.2(0.34), FARCH(4,90) = 0.96(0.44), sk = 0.054, ek = 0.96, LJB = 3.9(0.15). The restrictions ‘p,, = 0 and ‘pj = - tIj, j = 1,. . . ,5, were suggested by the data and imposed during the lag specification stage. The figures below the parameter estimates are the estimated standard deviations based on the Hessian, the ones in parentheses following the values of the test statistics are p-values. Note that the exponent of the transition function is standardized by dividing it by the sample standard deviation of y,[l/B(y) = 241. This is useful because y originally is not scale-free, and standardizing it makes it much easier to give it a suitable starting-value for estimation. Furthermore, s is the estimated standard error of residuals, sL is ditto for the AR(5) model, F,,(q, n) is the F-test of no autocorrelation (Section 6) against qth order autocorrelation, FAR&q, n) is the LM test against ARCH of order q, sk is skewness, ek excess kurtosis, and LJB the Lomnicki-JarqueeBera normality test. The tests do not reveal any serious inadequacy of the model. There seems to be some excess kurtosis in the residuals but it amounts to one half of that in the AR(5) model. Table 2 contains the results of the test of no remaining nonlinearity (Section 6). They indicate that this null hypothesis cannot be rejected at conventional significance levels. The numerical long-run solution paths converge to the same point independent of the starting-values. This allows us to conclude that (7.1) has a unique stable singular point (Section 6). Thus the model cannot be rejected on the grounds of a diverging solution path. Note, however, that the value of the solution ( = 0.063) clearly exceeds the sample mean of the series which equals 0.034. The statistical analysis of the model, so far, thus does not reveal any serious model inadequacy and we can proceed to interpreting the estimated model. The parameters
Ch. 48: Aspects
of ModellingNonlinear
2949
Time Series
Table 2 The p-values of the test of no remaining nonlinearity in Eitrheim and Terasvirta (1993) performed on the residuals of LSTAR model (7.1) for delays d = 1,. ,5. d Test F(15,75)
1
2
3
4
5
0.17
0.70
0.22
0.31
0.30
most easily interpreted are y and c. The former indicates how rapidly the “parameter” vector q + BF changes from the one extreme to the other with y,_ i. The larger y, the more rapid the change. Location parameter c tells where in the range the c_hange occurs as e = 0.5 at y,_ i = e. This information summed up by grap@g F, and the graph appears in Figure 2. It is seen that in our example Q + f< changes rather slowly with y,_ i. It may also be interesting to know how @+ 8F has varied over time. Figure 3 shows that low values of F^ have been much more common than high ones. It is of course impossible to interpret individual parameter estimates qjj or I!?~in (7.1). A study of the roots of characteristic polynomials offers a better way of
1.0
0.8
0.6 Cu
0.5 0.4 0.3 0.2 0.1 0.0 -004
-0.02
0.00
002
0.04
0.06
0.08
0.10
0.12
Y(t- 1) Figure 2. Transition
function
of model (7.1) for the four-quarter differences Austrian industrial production.
of the logarithmic
index of
T. Trriisvirtu et al.
2950 1.0 09 0.8 0.7 0.6 'U
0.5 0.4 0.3 0.2 0.1 00
1965(l)
1969(l)
1973(l)
1977(l)
1981(l)
1985(i)
Quarter
Figure 3. Values over time of the transfer function of model (7.1) for the four-quarter the logarithmic index of Austrian industrial production.
differences
of
interpreting (7.1) exactly as it does in the case of a linear autoregressive model. The roots can be computed at various values of F, of which zero and one are, perhaps, the most interesting ones. Table 3 contains the roots for F^ = 0 and 0.5. When E = 0 the local dynamics of (7.1) are characterized by a strong cyclic component with a period of about two years. When F increases this component grows weaker. For P = 0.5, the modulus of the corresponding pair of complex roots has decreased from
Table 3 The roots of the characteristic polynomial” p = 0, 0.5.
of (7.1) for
e
Root
Modulus
Period
0
0.71 k 0.63i -0.71 * 0.55i 0.75
0.95 0.90 0.75
8.7 2.5
0.5
0.51 & 0.64i - 0.63 k 0.5Oi
0.82 0.8 1
7.0 2.5
0.63 “The characteristic (gj + ej&-j.
polynomial
C(z) = zs -I;=
1
Ch. 48: Aspects of Modelling Nonlinear
2951
Time Series
0.95 to 0.82. At the same time, the intercept has increased from zero to 0.044. Indeed, it is seen directly from (7.1) that when @ approaches unity the (local) cyclical variation disappears altogether. The local autoregression for F^ = 1 is merely a white noise process with mean 0.087. Thus after entering a recession industrial production on average is bound to recover strongly after a few quarters because of the cyclical component. On the other hand, according to (7.1) it is somewhat more difficult for a recovery to change into a recession. During recovery the cycle is less pronounced, and the local linear approximation has a positive intercept as well. A sufficiently large negative shock may be required to depress the industrial output from high to low growth rates. It is also illuminating to compare the residuals of (7.1) with those from the linear model. They are shown in Figure 4. It is seen that the LSTAR model explains the aftermath of the exceptionally large observation (0.14) in 1972(4) much better than the AR(S) model. This is the case because the local dynamics of the LSTAR model predict a drop to 0.086 in the next period while the AR(5) model offers a much slower return to lower growth. The residual sum of squares of (7.1) is 85 percent of that of the AR model. After subtracting the residual for 1973(l) the same figure is 92 percent. This is still a fair improvement, but these figures do indicate that a single observation may have quite a large influence on the results. Thus one should be
0.06
!-
-0.0611.
1965(l)
1969(l)
1973{1)
1977(l)
1961(l)
1985(l)
Quarter Figure 4. Residuals of the LSTAR model (7.1) (solid line) and the AR(5) model (broken line) for the four-quarter differences of the logarithmic index of Austrian industrial production.
2952
T. Tertisuirta et al.
aware of the possibility that results from nonlinear modelling can be rather sensitive to data errors or other outliers of similar kind. It does not seem unusual that an LSTAR model is useful in modelling the consequences of exceptional events. The modelling exercise with other industrial production series in Terasvirta and Anderson (1992) showed that nonlinearity was mainly needed to describe the response of the output to large negative shocks. In the absence of such shocks, both the STAR and the AR models seemed to fit the data equally well. The main difference here is that the most important contribution of the LSTAR model (7.1) is to characterize the response of the system to a large positive shock. The specialized nature of (7.1) becomes obvious also when the model is used for one-quarter-ahead forecasting. The observations 1987(1))1988(4) were saved for this purpose. The root mean square error (RMSE) of the eight forecasts equals 0.023 which is about the size of the residual standard derivation of (7.1). However, the RMSE of the forecasts from the AR(5) model only equals 0.013. The test of the hypothesis that both models have the same mean square error of prediction against the alternative that the AR(5) has the lower mean square error of the two, has p-value 0.049. The reason for this outcome is that the prediction period does not contain any nonlinearity of the kind appearing during the estimation period. The simple AR model, thus, can forecast such a regular period better than the more involved LSTAR one. This example is univariate. Bacon and Watts (1971) contained probably the first application of a bivariate STR model, but the data were not economic. An application of bivariate STR models to economic data can be found in Granger et al. (1993), but as a whole the number of applications so far is small.
8.
Conclusions
This chapter is an attempt at an overview of various ways of modelling nonlinear economic relationships. Since nonlinear time series models and methods are a very large field, not all important developments have been covered. The emphasis has been on model building, and the modelling cycle, comprising linearity testing, model specification, parametric or nonparametric function estimation and model evaluation, has been highlighted. The estimation of fully specified nonlinear theory models such as disequilibrium models has not been included here. A majority of results concern the estimation of the conditional mean of a process and, therefore, the conditional variance has received less attention. This is, in part, because conditional heteroskedasticity is discussed in Chapter 49. Random coefficient models also belong under that heading and have not been considered here. Furthermore, this presentation reflects the belief that economic phenomena are more naturally characterized by stochastic rather than deterministic models, so that deterministic chaos and its applications to economics have only been briefly mentioned in the discussion.
Ch. 48: Aspects of Modelling Nonlinear
Time Series
2953
At present the number of applications of nonlinear time series models in economics is still fairly limited. Many techniques discussed here are as yet relatively untested. However, the situation may change rather rapidly, so that in a few years the possibilities of evaluating the empirical success of present and new techniques will be considerably better than now.
References Annals of the Institute of Statistical Akaike, H. (1969) “Fitting autoregressions for predictions”, Mathematics, 21,243-247. Anderson, B.D.O. and J.B. Moore (1979) Optimaljltering. Englewood Cliffs, NJ: Prentice-Hall. Andrews, D.W.K. (1993) “Tests for parameter instability and structural change with unknown change point”, Econometrica, 61, 821-856. Auestad, B. and D. Tjsstheim (1990) “Identification of nonlinear time series: First order characterization and order determmation”, Biometrika, 77,669-687. Auestad. B. and D. Tiostheim (1991) “Functional identification in nonlinear time series”, In: G. Roussas, ed., Nonparametric functional estimation and related topics. Amsterdam: Kluwer Academic Publishers, 493-507. Bacon, D.W..and D.G. Watts (1971) “Estimating the transition between two intersecting straight lines”, Biometrika, 58, 525-534. Barnett, W.A., J. Powell and G.E. Tauchen (1991) eds., Nonparametric and semi-parametric methods in econometrics and statistics. Proceedings of the 5th International Symposium in Economic Theory and Econometrics. Cambridge: Cambridge University Press. Bates, D.M. and D.G. Watts (1988) Nonlinear regression analysis and its applications. New York: Wiley. Box. G.E.P. and G.M. Jenkins (1970) Time series analysis, . .forecasting and control. San Francisco: Holden-Day. Breiman, L. and J.H. Friedman (1985) “Estimating optimal transformations for multiple regression and correlation”. Journal ofthe American Statistical Association, 80, 580-619 (with discussion). Breiman, L., J.H. Friedman, R. Olshen and C.J. Stone (1984) Classification and regression trees. Belmont, CA: Wadsworth. Brock, W.A. and S.M. Potter (1993) “Nonlinear time series and macroeconometrics”, in: G.S. Maddala, CR. Rao and H.R. Vinod, eds., Handbook of Statistics, Vol. 11. Amsterdam: North-Holland, 195-229. Brock, W.A’., W.D. Dechert and J.A. Scheinkman (1987) A test for independence based on the correlation dimension. Working paper, University of Wisconsin-Madison, Social Systems Research Institute. Brockett, P.L., M.J. Hinich and D. Patterson (1988) “Bispectral-based tests for the detection of Gaussianity and linearity in time series”, Journal of the American Statistical Association, 83,657-664. Brown, R.L., J. Durbin and J.M. Evans (1975) “Techniques for testing the constancy of regression coefficients over time”, Journal of the Royal Statistical Society B, 37, 149-192 (with discussion). Annals of Statistics, 18, 1886-1894. Chan, K.S. (1990) “Testing for threshold autoregression”, Chan, K.S. (1991) “Percentage points of likelihood ratio tests for threshold autoregression”, Journal of the Royal Statistical Society B, 53,691-696. Chan, K.S. (1993) “Consistency and limiting distribution of the least squares estimator of a threshold autoregressive model”, Annals of Statistics, 21,520-533. Chan. KS. and H. Tone (1986) “On estimatine. thresholds in autoregressive models”. Journal of Time Series Analysis, 7, 179-i90. ’ Chan, KS. and H. Tong (1990) “On likelihood ratio tests for threshold autoregression”, Journal of the Royal Statistical Society B, 52,469-476. Chen, P. and R.H. Day (1992), eds., Non-linear dynamics and evolutionary economics. Cambridge, MA: MIT Press. Chen, R. and R.S. Tsay (1993a) “Functional coefficient autoregressive models”, Journal of the American Statistical Association, 88, 298-308.
2954
T. Teriisvirta et al.
Chen, R. and R.S. Tsay (1993b) “Nonlinear additive ARX models”, Journal of the American Statistical Association, 88, 9555961. Cheng, B. and H. Tong (1992) “On consistent nonparametric order determination and chaos”, Journal of the Royal Statistical Society B, 54,427-449. Chow, G.C. (1960) “Testing for equality between sets of coefficients in two linear regressions”, Econometrica, 28, 591-605. Davidson, R. and J.G. MacKinnon (1985) “Heteroskedasticity-robust tests in regressions directions”, Annales de I’INSEE, 59/60, 183-218. Davies, R.B. (1977) “Hypothesis testing when a nuisance parameter is present only under the alternative”, Biometrika, 74, 241-254. Davies, R.B. (1987) “Hypothesis testing when a nuisance parameter is present only under the alternative”, Biometrika, 84, 33-43. De Gooijer, J.G. and K. Kumar (1992) “Some recent developments in non-linear time series modelling, testing and forecasting”, International Journal of Forecasting, 8, 135-156. Delgado, M.A. and P.M. Robinson (1992) “Nonparametric and semiparametric methods for economic research”, Journal of Economic Surveys, 6, 201-249. Desai, M. (1984) “Econometric models of the share of wages in national income, U.K. 185551965”, in: R.M. Goodwin, M. Kruger and A. Vercelli, eds., Nonlinear models offluctuating growth. Lecture Notes in Economics and Mathematical Systems No. 228, New York: Springer Verlag. Diebolt, J. (1990) “Testing the functions defining a nonlinear autoregressive time series”, Stochastic Processes and their Applications, 36, 85-106. Eitrheim, 8. and T. Terasvirta (1993). Testing the adequacy of smooth transition autoregressive models. Bank of Norway, Research Department, Working Paper 1993/13. Engle, R.F. and V. Ng (1993) “Measuring and testing the impact of news on volatility”, Journal of Finance, 41, 1749-l 778. Engle, R.F., C.W.J. Granger, J. Rice and A. Weiss (1986) “Semiparametric estimates of the relation between weather and electricity sales”, Journal of the American Statistical Association, 81, 310-320. Ertel, J.E. and E.B. Fowlkes (1976) “Some algorithms for linear sphne and piecewise multiple linear regression”, Journal of the American Statistical Association, 71,640&648. Franke, J. and M. Wendel (1990) “A bootstrap approach for nonlinear autoregressions. Some preliminary results”, Preprint, to appear in: Proceedings of the International Conference on Bootstrapping and Related Techniques, Trier, June 1990. Friedman, J.H. (1991) “Multivariate adaptive regression splines”, Annals of Statistics, 19, l-141 (with discussion). Friedman, J.H. and W. Stuetzle (1981) “Projection pursuit regression”, Journal ofthe American Statistical Association, 76, 817-823. Gallant, A.R. (1981) “On the bias in flexible functional forms and an essentially unbiased form: The Fourier Flexible Form”, Journal of Econometrics, 15, 211-245. Gallant, A.R. (1987) Nonlinear statistical models. New York: Wiley. Gallant, A.R. and G. Tauchen (1989) “Seminonparametric estimation of conditionally constrained heterogeneous processes: asset pricing applications”, Econometrica, 57, 1091-l 120. Gallant, A.R., P.E. Rossi and G. Tauchen (1992) “Stock prices and volume”, Review ofFinancial Studies, 5, 1999242. Grange& C.W.J. and J.J. Hallman (1991a) “Nonlinear transformations of integrated time series”, Journal of Time Series Analysis, 12, 207-224. Granger, C.W.J. and J.J. Hallman (1991b) “Long-memory processes with attractors”, Oxford Bulletin of Economics and Statistics, 53, 1 l-26. Granger, C.W.J. and J.L. Lin (1991) Nonlinear correlation coefficients and identification of nonlinear time series models. University of California, San Diego, Department of Economics, Discussion Paper. Granger, C.W.J. and P. Newbold (1986). Forecasting economic time series. 2nd edition. Orlando, FL: Academic Press. Granger, C.W.J. and T. Terasvirta (1992) “Experiments in modeling nonlinear relationships between time series”, in: M. Casdagli and S. Eubank, eds., Nonlinear modeling and foreasting. Proceedings of the Workshop on Nonlinear Modeling and Forecasting Held September, 1990 in Santa Fe, New Mexico, Redwood City, CA: Addison-Wesley, 189-197.
Ch. 48: Aspects of Modelling Nonlinear Time Series
2955
Granger, C.W.J. and T. Terhsvirta (1993) Modeltiny nonlinear economic relationships. Oxford: Oxford University Press. Granger, C.W.J., T. Terlsvirta and H.M. Anderson (1993) “Modelling non-linearity over the business cycle”, in: J.H. Stock and M.W. Watson, eds., Business cycles, indicators and forecasting. Chicago: University of Chicago Press, 31 l-325. Haggan, V. and T. Ozaki (1981) “Modelling non-linear random vibrations using an amplitude-dependent autoregressive time series model”, Biometrika, 68, 1899196. Haggan, V., SM. Heravi and M.B. Priestley (1984) “A study of the application of state-dependent models in nonlinear time series analysis”, Journal of Time Series Analysis, 5, 699102. Hallman, J.J. (1990) Nonlinear integrated series, cointegration and application. PhD Thesis. University of California, San Diego, Department of Economics. Hansen, B.E. (1990) Lagrange multiplier tests for parameter instability in non-linear models. Paper presented at the Sixth World Congress of the Econometric Society, Barcelona. Hlrdle, W. (1990) Applied nonparametric regression. Oxford: Oxford University Press. Harvey, A.C. (1990) Econometric analysis of time series, 2nd edition. Cambridge, MA: MIT Press. Hastie, T.J. and R.J. Tibshirani (1990) Generalized additiue models. London: Chapman and Hall. Heckman, N. (1986) “Spline smoothing in a partly linear model”, Journal of the Statistical Society B, 48, 244-248. Higgins, M. and A.K. Bera (1989) “A joint test for ARCH and bilinearity in the regression model”, Econometric Reviews, I, 171-181. Hinich, M.J. (1982) “Testing for Gaussianity and linearity of a stationary time series”, Journal of Time Series Analysis, 3, 1699176. Joe, H. (1989) “Estimation of entropy and other functionals of a multivariate density”, Annals of the Institute of Statistical Mathematics, 41, 6833697. Judge, G.G., W.E. Griffiths, R.C. Hill, H. Liitkepohl and T.-C. Lee (1985) The theory and practice of econometrics, 2nd edition. New York: Wiley. Keenan, D.M. (1985) “A Tukey non-additivity type test for time series nonlinearity”, Biometrika, 72, 39-44. Kramer, W., W. Ploberger and R. Ah (1988) “Testing for structural change in dynamic models”, Econometrica, 56, 1335-l 310. Kuan, C.-M. and H. White (1994) “Artificial neural networks: An econometric perspective”, Econometric Reviews, 13, 1-143 (with discussion). Kiinsch, H. (1989) “The jackknife and the bootstrap for general stationary observations”, Annals of Statistics, 17, 121771241. Lasota, A. and M.C. Mackey (1989) “Stochastic perturbation of dynamical systems: The weak convergence of measures”, Journal of Mathematical Analysis and Applications, 138, 232-248. Lee, T.-H., White, H. and C.W.J. Granger (1993) “Testing for neglected nonlinearity in time series models. A comparison of neural network methods and alternative tests”, Journal of Econometrics, 56,269-290. Lewis, P.A.W. and J.G. Stevens (1991a) “Nonlinear modeling of time series using multivariate adaptive regression splines (MARS)“, Journal of the American Statistical Association, 86, 864877. Lewis, P.A.W. and J.G. Stevens (1991b) Semi-multivariate nonlinear modeling of time series using multivariate adaptive regresson splines (MARS). Preprint, Naval Post-graduate School. Liu, T., C.W.J. Granger and W. Heller (1992) “Using the correlation exponent to decide whether an economic series is chaotic”, Journal of Applied Econometrics, 7, S25-S39. Luukkonen, R. (1990) On linearity testing and model estimation in non-linear time series analysis. Helsinki: Finnish Statistical Society. Luukkonen, R. and T. Terlsvirta (1991) “Testing linearity of economic time series against cyclical asymmetry”, Annales de l’economic et de statistique, 20121, 125-142. Luukkonen, R., P. Saikkonen and T. Terlsvirta (1988a) “Testing linearity in univariate time series”, Scandinavian Journal of Statistics, 15, 161-175. Luukkonen, R., P. Saikkonen and T. Terlsvirta (1988b) “Testing linearity against smooth transition autoregression”, Biometrika, 75,491-499. Maddala, G.S. (1977) Econometrics, New York: McGraw-Hill. Maddala, G.S. (1986) “Disequilibrium, self-selection and switching models”, in: Z. Griliches and M.D. Intrihgator, eds., Handbook of econometrics. Vol. 3, Amsterdam: North-Holland, 1634-1688.
2956
T Teriisuirta et al.
Marron, S. (1989) “Automatic smoothing parameter selection: A survey”, in: A. Ullah, ed., Semiparametric and nonparametric econometrics. Heidelberg: Physica-Verlag, 65-86. Masry, E. and D. Tjnstheim (1994) “Nonparametric estimation and identification of ARCH and ARX nonlinear time series. Strong convergence and asymptotic normality”, Econometric Theory (forthcoming). Miiller, H.G. (1990) Smooth optimum kernel estimators near endpoints. Preprint, University of California, Davis. Nicholls, D.F. and A.R. Pagan (1985) “Varying coefficient regression”, in: E.J. Hannan, P.R. Krishnaiah and M.M. Rao, eds., Handbook ofstatistics. Vol. 5. Amsterdam: Elsevier, 413-449. Ozaki, T. (1985) “Non-linear time series models and dynamical systems”, in: E.J. Hannan, P.R. Krisnaiah and M.M. Rao, eds., Handbook OfStatistics. Vol. 5. Amsterdam: Elsevier, 25-83. Pagan, A.R. and G.W. Schwert (1990) “Alternative models for conditional stock volatility”, Journal of Econometrics, 45, 261-290. Petruccelli, J.D. (1990) “A comparison of tests for SETAR-type non-linearity in time series”, Journal of Forecasting, 9, 25-36. Petruccelli, J.D. and N. Davies (1986) “A portmanteau test for self-exciting threshold autoregressive-type nonlinearity”, Biometrika, 73, 687-694. Ploberger, W. and W. Kriimer (1992) “The CUSUM-test with OLS residuals”, Ecoaometrica, 60, 271-285. Politis, D.N. and J.P. Roman0 (1990) A nonparametric resampling procedure for multivariate confidence regions in time series analysis. Technical Report, Department of Statistics, Stanford University. Ptitscher. B.M. and I.R. Prucha (199la) “Basic structure of the asymptotic theory in dynamic nonlinear econometric models, Part I: Consistency and approximation concepts”, Econometric Reoiews, 10, 125-216. Pb;tscher, B.M. and I.R. Prucha (199lb) “Basic structure of the asymptotic theory in dynamic nonlinear econometric models, Part II: Asymptotic normality”, Econometric Reviews, 10, 253-325. Powell, J.L., J.H. Stock and T:M. Stoker (1989) “Semiparametric estimation of index coefficients”, Econometrica, 57, 1403- 1430. Priestley, M. (1988) Non-linear and non-stationary time series analysis. London and San Diego: Academic Press. Quandt, R. (1960) “Tests of the hypothesis that a linear regression system obeys two separate regimes”, Journal of the American Statistical Association, 55, 324-330. Quandt, R. (1982) “Econometric disequilibrium models”, Econometric Reviews, 1, l-63. Quandt, R. (1983) “Computational problems and methods”, in: 2. Griliches and M.D. Intriligator, eds., Handbook of econometrics. Vol. 1. Amsterdam: North-Holland, 699-746. Ramsey, J.B. (1969) “Tests for specification errors in classical linear least-squares regression analysis”, Journal of the Royal Statistical Society B, 31, 350-371. Rice, J. (1984) “Boundary modification for kernel regression”, Communications in Statistics, Theory and Methods, 13, 893-900. Robinson, P.M. (1983) “Non-parametric estimation for time series models”, Journal of Time Series Analysis, 4, 185-208. Robinson, P.M. (1988) “Root-N-consistent semiparametric regression”, Econometrica, 56, 931-954. Robinson, P.M. (1991) “Consistent nonparametric entropy-based testing”, Reuiew of Economic Studies, 437-453. Saikkonen, P. and R. Luukkonen (1988) “Lagrange multiplier tests for testing nonlinearities in time series models”, Scandinavian Journal of Statistics, 15, 55-68. Scheinkman, J.A. (1990) “Nonlinearities”in econor&c dynamics”, Economic Journal, 100, Supplement, 33-48. Seber, G.A.F. and C.J. Wild (1989) Nonlinear regression. New York: Wiley. Shumway, R.H., AS. Azari and Y. Pawitan (1988) “Modeling mortality fluctuations in Los Angeles as functions of pollution and weather effects”, Environmental Research, 45, 224-241. Silverman, B.W. (1984) “Spline smoothing: the equivalent variable kernel method”, Annals of Statistics, 12,898-916. Skaug, H. and D. Tjastheim (1993a) “Nonparametric tests of serial independence”, in: T. Subba Rao, ed., The M.B. Priestley Birthday Volume. London: Chapman and Hall, 207-229. Skaug, H. and D. Tjestheim (1993b) “A nonparametric test ofserial independence based on the empirical distribution function”, Biometrika, 80, 591-602.
Ch. 48: Aspects
ofModelling Nonlinear Time Series
2957
Skaug, H. and D. Tjostheim (1993~) Measures of distance between densities with application to testing for serial independence. Preprint, Department of Mathematics, University of Bergen. Stensholt, B.K. and D. Tjostheim (1987) “Multiple bilinear time series models”, Journal of Time Series Analysis, 8,221-233. Stinchcombe, M. and H. White (1989) “Universal approximations using feedforward networks with non-sizmoid hidden laver activation functions”, in: Proceedings of the International Joint Conference on Ne&al Networks, Washington, D.C. San Diego: SOS Printing,.I: 613-618. Subba Rao, T. and M.M. Gabr (1980) “A test for linearity of stationary time series”, Journal ofTime Series Analysis, 1, 145-158. Subba Rao, T. and M.M. Gabr (1984) “An introduction to bispectral analysis and bilinear time series models”, in: Lecture Notes in Statistics, 24, New York: Springer. Sugihara, G. and R.M. May (1990) “Nonlinear forecasting as a way of distinguishing chaos from measurement error in time series”, Nature, 344, 1344741. Terasvirta, T. (1990) Power properties of linearity tests for time series. University Of California, San Diego, Department of Economics, Discussion Paper No. 90-l 5. Terlsvirta, T. (1994) “Specification, estimation and evaluation of smooth transition autoregressive models”, Journal of the American Statistical Association, 89,208-218. Terasvirta, T. and H.M. Anderson (1992) “Modelhng nonlinearities in business cycles using smooth transition autoregressive models”, Journal of Applied Econometrics, 7, S119-S136. Terasvirta, T., C.-F. Lin and C.W.J. Granger (1993) “Power of the neural network linearity test”, Journal of Time Series Analysis, 14, 209-220. Tibshirani, R. (1988) “Estimating optimal transformations for regression via additive and variance stabilization”, Journal of the American Statistical Association, 83, 5599568. Tjostheim, D. (1986) “Some doubly stochastic time series models”, Journal of Time Series Analysis, 7, 51-72. Tjostheim, D. (1994) “Nonlinear time series: A selective review”, Scandinavian Journal of Statistics, (forthcoming). Tjastheim, D. and B. Auestad (1994a) “Nonparametric identification of nonlinear time series: Projections”, Journal of the American Statistical Association, 89 (forthcoming). Tjastheim, D. and B. Auestad (1994b) “Nonparametric identification of nonlinear time series: Selecting significant lags”, Journal of the American Statistical Association, 89 (forthcoming). Tong, H. (1990) Non-linear time series. A dynamical system approach. Oxford: Oxford University Press. Truong, Y.K. and C. Stone (1992) “Nonparametric function estimation involving time series”, Annals of Statistics, 20, 77-97. Tsay, R.S. (1986) “Nonlinearity tests for time series”, Biometrika, 73, 461-466. Tsay, R.S. (1989) “Testing and modeling threshold autoregressive processes”, Journal of the American Statistical Association, 84, 23 l-240. Tsay, R.S. (1992) “Model checking via parametric bootstraps in time series analysis”, Applied Statistics, 41, 1-15. Ullah, A. (1989), ed., Semiparametric and nonparametric econometrics. Heidelberg: Physica-Verlag. Weiss, A. (1986) “ARCH and bilinear time series models: Comparison and combination”, Journal of Business and Economic Statistics, 4, 59-70. White, H. (1984) Asymptotic theory for econometricians: Orlando, FL; Academic Press. White, H. (1989) “Some asymptotic results for learning in single hidden-layer feedforward network models”, Journal ofthe American Statistical Association, 84, 100331013. Wooldridge, J.M. (1990) “A unified approach to robust, regression-based specification tests”, Econometric Theory, 6, 17-43. Yakowitz, S. (1987) “Nearestneighbour methods for time series analysis”, Journal of Time Series Analysis, 8, 235-247.
Chapter 49
ARCH MODELS” TIM BOLLERSLEV Northwestern University and N.B.E.R. ROBERT
F. ENGLE
University of California, San Diego and N.B.E.R. DANIEL
B. NELSON
University of Chicago and N.B.E.R.
Contents
2961 2961
Abstract 1. Introduction
2.
2961
1.1.
Definitions
1.2.
Empirical
regularities
1.3.
Univariate
parametric
1.4.
ARCH
1.5.
Nonparametric
Inference
2963
of asset returns
2967
models
2912
in mean models and semiparametric
methods
2.1.
Testing for ARCH
2.2.
Maximum
2.3.
Quasi-maximum
2.4.
Specification
2912
2974
procedures
2914
likelihood
methods
likelihood
methods
checks
2971 2983 2984
“The authors would like to thank Torben G. Andersen, Patrick Billingsley, William A. Brock, Eric Ghysels, Lars P. Hansen, Andrew Harvey, Blake LeBaron, and Theo Nijman for helpful comments. Financial support from the National Science Foundation under grants SES-9022807 (Bollerslev), SES9122056 (Engle), and SES-9110131 and SES-9310683 (Nelson), and from the Center for Research in Security Prices (Nelson), is gratefully acknowledged. Inquiries regarding the data for the stock market empirical application should be addressed to Professor G. William Schwert, Graduate School of Management, University of Rochester, Rochester, NY 14627, USA. The GAUSSTM code used in the stock market empirical example is available from Inter-University Consortium for Political and Social Research (ICPSR), P.O. Box 1248, Ann Arbor, MI 48106, USA, telephone (313)763-5010. Order “Class 5” under this article’s name. Handbook ofEconometrics, Volume IV, Edited by R.F. Engle and D.L. McFadden 0 1994 Elseuier Science B.V. All rights reserved
T. Bollersku
3.
4.
5.
6.
7. 8. 9.
Stationary
et al.
2989
and ergodic properties
3.1.
Strict stationarity
2989
3.2.
Persistence
2990
Continuous
2992
time methods
4.1.
ARCH
models as approximations
4.2.
Diffusions
4.3,
ARCH
as approximations
to diffusions
to ARCH
models as filters and forecasters
models
2994 2996 2991
Aggregation
and forecasting
2999
5.1.
Temporal
aggregation
2999
5.2.
Forecast
error distributions
Multivariate
specifications
6.1.
Vector ARCH
6.2.
Factor
6.3.
Constant
conditional
6.4.
Bivariate
EGARCH
6.5.
Stationarity
3002
and diagonal
3003
ARCH
3005
ARCH
3007
correlations
3008 3009
and co-persistence
Model selection Alternative measures Empirical examples 9.1.
U.S. Dollar/Deutschmark
9.2.
U.S. stock prices
10. Conclusion References
3001
3010 3012 3014
for volatility exchange
rates
3014 3017
3030 3031
Ch. 49: ARCH
Models
2961
Abstract This chapter evaluates the most important theoretical developments in ARCH type modeling of time-varying conditional variances. The coverage include the specification of univariate parametric ARCH models, general inference procedures, conditions for stationarity and ergodicity, continuous time methods, aggregation and forecasting of ARCH models, multivariate conditional covariance formulations, and the use of model selection criteria in an ARCH context. Additionally, the chapter contains a discussion of the empirical regularities pertaining to the temporal variation in financial market volatility. Motivated in part by recent results on optimal filtering, a new conditional variance model for better characterizing stock return volatility is also presented. 1.
Introduction
Until a decade ago the focus of most macroeconometric and financial time series modeling centered on the conditional first moments, with any temporal dependencies in the higher order moments treated as a nuisance. The increased importance played by risk and uncertainty considerations in modern economic theory, however, has necessitated the development of new econometric time series techniques that allow for the modeling of time varying variances and covariances. Given the apparent lack of any structural dynamic economic theory explaining the variation in higher order moments, particularly instrumental in this development has been the autoregressive conditional heteroskedastic (ARCH) class of models introduced by Engle (1982). Parallel to the success of standard linear time series models, arising from the use of the conditional versus the unconditional mean, the key insight offered by the ARCH model lies in the distinction between the conditional and the unconditional second order moments. While the unconditional covariance matrix for the variables of interest may be time invariant, the conditional variances and covariances often depend non-trivially on the past states of the world. Understanding the exact nature of this temporal dependence is crucially important for many issues in macroeconomics and finance, such as irreversible investments, option pricing, the term structure of interest rates, and general dynamic asset pricing relationships. Also, from the perspective of econometric inference, the loss in asymptotic efficiency from neglected heteroskedasticity may be arbitrarily large and, when evaluating economic forecasts, a much more accurate estimate of the forecast error uncertainty is generally available by conditioning on the current information set.
1.I.
Dejinitions
Let {E,(O)} denote a discrete time stochastic process with conditional mean and variance functions parametrized by the finite dimensional vector OE 0 s R”, where
T. Bolfersleu et al.
2962
8, denotes the true value. For notational simplicity we shall initially assume that s,(O)is a scalar, with the obvious extensions to a multivariate framework treated in Section 6. Also, let E,_ r(.) denote the mathematical expectation, conditional on the past, of the process, along with any other information available at time t - 1. The {E,(@,)}process is then defined to follow an ARCH model if the conditional mean equals zero,
~1-1MRJ))= 0
t= 1,2,...,
(1.1)
but the conditional variance.
44J = Var,- lWo)) = L l(~:(&))
t= 1,2,...,
(1.2)
depends non-trivially on the sigma-field generated by the past observations; i.e. {~t-l(~O)r~,-2(e0),...}.Wh en obvious from the context, the explicit dependence on the parameters, 8, will be suppressed for notational convenience. Also, in the multivariate case the corresponding time varying conditional covariance matrix will be denoted by f2,. In much of the subsequent discussion we shall focus directly on the {st} process, but the same ideas extend directly to the situation in which {st} corresponds to the innovations from some more elaborate econometric model. In particular, let {yl(O,)} denote the stochastic process of interest with cohditional mean
PtwM= 4 - l(YJ
t=l2 ) )....
(1.3)
Note, by the timing convention both ~~(0,) and a:(O,) are measurable with respect to the time t - 1 information set. Define the {s,(e,)} process by
de,) = Y, -
au
t= 1,2,....
(1.4)
The conditional variance for (ct} then equals the conditional variance for the {y,} process. Since very few economic and financial time series have a constant conditional mean of zero, most of the empirical applications of the ARCH methodology actually fall within this framework. Returning to the definitions in equations (1.1) and (1.2), it follows that the standardized process,
2,(e,)
s E,(e,)a:(e,)-
1’2
t= 1,2,...,
(1.5)
will have conditional mean zero, and a time invariant conditional variance of unity. This observation forms the basis for most of the inference procedures that underlie the applications of ARCH type models. If the conditional distribution for z, is furthermore assumed to be time invariant
2963
Ch. 49: ARCH Models
with a finite fourth moment,
E(&;) = E(zf’)E(a;)
2
it follows by Jensen’s inequality
that
E(z;)E(af)2 = E(zp)E(q2,
where the equality holds true for a constant conditional variance only. Given a normal distribution for the standardized innovations in equation (1.5), the unconditional distribution for E, is therefore leptokurtic. The setup in equations (1.1) through (1.4) is extremely general and does not lend itself directly to empirical implementation without first imposing further restrictions on the temporal dependencies in the conditional mean and variance functions. Below we shall discuss some of the most practical and popular such ARCH formulations for the conditional variance. While the first empirical applications of the ARCH class of models were concerned with modeling inflationary uncertainty, the methodology has subsequently found especially wide use in capturing the temporal dependencies in asset returns. For a recent survey of this extensive empirical literature we refer to Bollerslev et al. (1992).
1.2.
Empirical regularities
of asset returns
Even in the univariate case, the array of functional forms permitted by equation (1.2) is vast, and infinitely larger than can be accommodated by any parametric family of ARCH models. Clearly, to have any hope of selecting an appropriate ARCH model, we must have a good idea of what empirical regularities the model should capture. Thus, a brief discussion of some of the important regularities for asset returns volatility follows. 1.2.1.
Thick tails
Asset returns tend to be leptokurtic. The documentation of this empirical regularity by Mandelbrot (1963), Fama (1965) and others led to a large literature on modeling stock returns as i.i.d. draws from thick-tailed distributions; see, e.g. Mandelbrot (1963), Fama (1963,1965), Clark (1973) and Blattberg and Gonedes (1974). 1.2.2.
Volatility
As Mandelbrot
clustering (1963) wrote,
. . . large changes tend to be followed by large changes, changes tend to be followed by small changes..
of either sign, and small
This volatility clustering phenomenon is immediately apparent when asset returns are plotted through time. To illustrate, Figure 1 plots the daily capital gains on the Standard 90 composite stock index from 1928-1952 combined with Standard and
Daily Standard
2
’ 1920 1930
1940
1950
and Poor’s Capital
1960
Gains
1970
1980
1990
2000
Figure 1
Poor’s 500 index from 1953-1990. The returns are expressed in percent, and are continuously compounded. It is clear from visual inspection of the figure, and any reasonable statistical test, that the returns are not i.i.d. through time. For example, volatility was clearly higher during the 1930’s than during the 1960’s, as confirmed by the estimation results reported in French et al. (1987). A similar message is contained in Figure 2, which plots the daily percentage Deutschmark/U.S. Dollar exchange rate appreciation. Distinct periods of exchange market turbulence and tranquility are immediately evident. We shall return to a formal analysis of both of these two time series in Section 9 below. Volatility clustering and thick tailed returns are intimately related. As noted in Section 1.1 above, if the unconditional kurtosis of a, is finite, E(E~)/[E($)]~ 3 E(z:), where the last inequality is strict unless ot is constant. Excess kurtosis in E, can therefore arise from randomness in ol, from excess kurtosis in the conditional distribution of sI, i.e., in zl, or from both. 1.2.3.
Leverage
eflects
The so-called “leverage effect,” first noted by Black (1976), refers to the for changes in stock prices to be negatively correlated with changes volatility. Fixed costs such as financial and operating leverage provide explanation for this phenomenon. A firm with debt and equity outstanding
tendency in stock a partial typically
Ch. 49: ARCH
2965
Models Dally U.S. Dollar-Deutschmark
“p
1960
1962
1964
1966
1988
Appreciation
1990
1992
1994
Figure 2
becomes more highly leveraged when the value of the firm falls. This returns volatility if the returns on the firm as a whole are constant. however, argued that the response of stock volatility to the direction too large to be explained by leverage alone. This conclusion is also the empirical work of Christie (1982) and Schwert (1989b). 1.2.4.
raises equity Black (1976), of returns is supported by
Non-trading periods
Information that accumulates when financial markets are closed is reflected in prices after the markets reopen. If, for example, information accumulates at a constant rate over calendar time, then the variance of returns over the period from the Friday close to the Monday close should be three times the variance from the Monday close to the Tuesday close. Fama (1965) and French and Roll (1986) have found, however, that information accumulates more slowly when the markets are closed than when they are open. Variances are higher following weekends and holidays than on other days, but not nearly by as much as would be expected if the news arrival rate were constant. For instance, using data on daily returns across all NYSE and AMEX stocks from 1963-1982, French and Roll (1986) find that volatility is 70 times higher per hour on average when the market is open than when it is closed. Baillie and Bollerslev (1989) report qualitatively similar results for foreign exchange rates. 1.2.5.
Forecastable events
Not surprisingly, forecastable releases of important information are associated with high ex ante volatility. For example, Cornell (1978) and Pate11 and Wolfson (1979,
T Bollersleu et al.
2966
1981) show that individual firms’ stock returns volatility is high around earnings announcements. Similarly, Harvey and Huang (1991,1992) find that fixed income and foreign exchange volatility is higher during periods of heavy trading by central banks or when macroeconomic news is being released. There are also important predictable changes in volatility across the trading day. For example, volatility is typically much higher at the open and close of stock and foreign exchange trading than during the middle of the day. This pattern has been documented by Harris (1986), Gerity and Mulherin (1992) and Baillie and Bollerslev (1991) among others. The increase in volatility at the open at least partly reflects information accumulated while the market was closed. The volatility surge at the close is less easily interpreted.
1.2.6.
Volatility
and serial correlation
LeBaron (1992) finds a strong inverse relation between volatility and serial correlation for U.S. stock indices. This finding appears remarkably robust to the choice of sample period, market index, measurement interval and volatility measure. Kim (1989) documents a similar relationship in foreign exchange rate data.
1.2.7.
Co-movements
Black (1976) observed
in volatilities
that
. . . there is a lot of commonality in volatility changes across stocks: a 1% market volatility change typically implies a 1% volatility change for each stock. Well, perhaps the high volatility stocks are somewhat more sensitive to market volatility changes than the low volatility stocks. In general it seems fair to say that when stock volatilities change, they all tend to change in the same direction. Diebold and Nerlove (1989) and Harvey et al. (1992) also argue for the existence of a few common factors explaining exchange rate volatility movements. Engle et al. (1990b) show that U.S. bond volatility changes are closely linked across maturities. This commonality of volatility changes holds not only across assets within a market, but also across different markets. For example, Schwert (1989a) finds that U.S. stock and bond volatilities move together, while Engle and Susmel (1993) and Hamao et al. (1990) discover close links between volatility changes across international stock markets. The importance of international linkages has been further explored by King et al. (1994), Engle et al. (1990a), and Lin et al. (1994). That volatilities move together should be encouraging to model builders, since it indicates that a few common factors may explain much of the temporal variation in the conditional variances and covariances of asset returns. This forms the basis for the factor ARCH models discussed in Section 6.2 below.
2961
Ch. 49: ARCH Models
I .2.8.
Macroeconomic
variables and volatility
Since stock values are closely tied to the health of the economy, it is natural to expect that measures of macroeconomic uncertainty such as the conditional variances of industrial production, interest rates, money growth, etc. should help explain changes in stock market volatility. Schwert (1989a, b) finds that although stock volatility rises sharply during recessions and financial crises and drops during expansions, the relation between macroeconomic uncertainty and stock volatility is surprisingly weak. Glosten et al. (1993), on the other hand, uncover a strong positive relationship between stock return volatility and interest rates. 1.3.
1.3.1.
Univariate parametric
models
GARCH
Numerous parametric specifications for the time varying conditional variance have been proposed in the literature. In the linear ARCH(q) model originally introduced by Engle (1982), the conditional variance is postulated to be a linear function of the past q squared innovations,
o;=w+
1
CLiE~_i-W+C((L)E:_l,
(1.6)
i=l,q
where L denotes the lag to be well defined and parameters must satisfy Defining v, = E: - a:,
Ef = w +
or backshift operator, L’y, = Y,_~. Of course, for this model the conditional variance to be positive, almost surely the w > 0 and c(~ 3 0,. . . , a, > 0. the ARCH(q) model in (1.6) may be re-written as
C((L)&f_1 + v,.
(1.7)
Since E, _ I (v,) = 0, the model corresponds directly to an AR(q) model for the squared innovations, E:. The process is covariance stationary if and only if the sum of the positive autoregressive parameters is less than one, in which case the unconditional variance equals Var(s,) = a2 = o/( 1 - U1 - ... - Uq). Even though the E,)Sare serially uncorrelated, they are clearly not independent through time. In accordance with the stylized facts for asset returns discussed above, there is a tendency for large (small) absolute values of the process to be followed by other large (small) values of unpredictable sign. Also, as noted above, if the distribution for the standardized innovations in equation (1.5) is assumed to be time invariant, the unconditional distribution for E, will have fatter tails than the distribution for z,. For instance, for the ARCH(l) model with conditionally normally distributed errors, E(sp)/E($)’ = 3( 1 - a:)/( 1 - 34) if 3c(T < 1, and E(.$)/E(E:)~ = co otherwise: both of which exceed the normal value of three.
T Bollersleu et al.
2968
Alternatively the ARCH(q) model parameter MA(q) model for a,, E, = 0 +
cc(J!&
lE,_
may also be represented
as a time varying
(1.8)
1,
where {i,} denotes a scalar i.i.d. stochastic process with mean zero and variance one. Time varying parameter models have a long history in econometrics and statistics. The appeal of the observational equivalent formulation in equation (1.6) stems from the explicit focus on the time varying conditional variance of the process. For discussion of this interpretation of ARCH models, see, e.g., Tsay (1987), Bera et al. (1993) and Bera and Lee (1993). In empirical applications of ARCH(q) models a long lag length and a large number of parameters are often called for. To circumvent this problem Bollerslev (1986) proposed the generalized ARCH, or GARCH(p, q), model, ~: =
0 +
C C(iEf_i + C i=l,q
j=
~jjb:_j
~
W +
a(L)&:_ 1
+
B(L)a:_ 1.
(1.9)
1.~
For the conditional variance in the GARCH(p, q) model to be well defined all the coefficients in the corresponding infinite order linear ARCH model must be positive. Provided that a(L) and /J(L) have no common roots and that the roots of the polynomial /I(x) = 1 lie outside the unit circle, this positivity constraint is satisfied if and only if all the coefficients in the infinite power series expansion for a(~)/( 1 - /i(x)) are non-negative. Necessary and sufficient conditions for this are given in Nelson and Cao (1992). For the simple GARCH(l, 1) model almost sure positivity of 0: requires that w 3 0, c(i B 0 and /I1 > 0. Rearranging the GARCH(p, q) model as in equation (1.7), it follows that a; = 0 + [a(L) +
B(L)l&:_ 1- P(L)v,-1+ v,,
(1.10)
which defines an ARMA[max(p,q),p] model for E:. By standard arguments, the model is covariance stationary if and only if all the roots of a(x) + b(x) = 1 lie outside the unit circle; see Bollerslev (1986) for a formal proof. In many applications with high frequency financial data the estimate for CI(1) + /I( 1) turns out to be very close to unity. This provides an empirical motivation for the so-called integrated GARCH (p, q), or IGARCH(p,q), model introduced by Engle and Bollerslev (1986). In the IGARCH class of models the autoregressive polynomial in equation (1.10) has a unit root, and consequently a shock to the conditional variance is persistent in the sense that it remains important for future forecasts of all horizons. Further discussion of stationarity conditions and issues of persistence are contained in Section 3 below. Just as an ARMA model often leads to a more parsimonious representation of the temporal dependencies in the conditional mean than an AR model, the GARCH (p, q) formulation in equation (1.9) provides a similar added flexibility over the linear
2910
non-linear O:=O
ARCH (NARCH) +
models:
C ClilE,-ilY + 1 i=l,q
If (1.13) is modified
(1.13)
BjOf-j.
j=l,P
further by setting (1.14)
i=l,q
j=l,p
for some non-zero K, the innovations in c$ will depend on the size as well as the sign of lagged residuals, thereby allowing for the leverage effect in stock return volatility. The formulation in equation (1.14) with y = 2 is also a special case of Sentana’s (1991) quadratic ARCH (QARCH) model, in which 0: is modeled as a quadratic form in the lagged residuals. A simple version of this model termed asymmetric ARCH, or AARCH, was also proposed by Engle (1990). In the first order case the AARCH model becomes
a:=o+tlE:_l+sE,-l+P~:_l,
(1.15)
where a negative value of 6 means that positive returns increase negative returns. Another route for introducing asymmetric effects is to set 0: = W +
C [a+T(E,_i i=l,q
> 0)1&,_il’ + Ui-I(&t-i d o)lE~~ilyl
+
volatility
less than
C Bjaf-jT j= 1.p
(l.16)
where I(.) denotes the indicator function. For example the threshold ARCH (TARCH) model of Zakoian (1990) corresponds to equation (1.16) with y = 1. Glosten, Jagannathan and Runkle (1993) estimate a version of equation (1.16) with y = 2. This so-called GJR model allows a quadratic response of volatility to news with different coefficients for good and bad news, but maintains the assertion that the minimum volatility will result when there is no news.’ Two additional classes of models have recently been proposed. These models have a somewhat different intellectual heritage but imply particular forms of conditional heteroskedasticity. The first is the unobserved components structural ARCH (STARCH) model of Harvey et al. (1992). These are state space models or factor models in which the innovation is composed of several sources of error where each of the error sources has heteroskedastic specifications of the ARCH form. Since the error components cannot be separately observed given the past observations, the independent variables in the variance equations are not measurable with respect ’ In a comparison study for daily Japanese TOPIX data, Engle and Ng (1993) found that the EGARCH and the GJR formulation were superior to the AARCH model (1.15) which simply shifted the intercept.
Ch. 49: ARCH
2971
Models
the available information set, which complicates inference procedures2 Following earlier work by Diebold and Nerlove (1989), Harvey et al. (1992) propose an estimation strategy based on the Kalman filter. To illustrate the issues, consider the factor structure to
Yr =
Bft +
(1.17)
&*,
where JJ, is an n x 1 vector of asset returns, f, is a scalar factor with time invariant factor loadings, B, and E, is an n x 1 vector of idiosyncratic returns. If the factor follows an ARCH( 1) process, (1.18) then new estimation problems arise since f,_ I is not observed, and c;,~ is not a conditional variance. The Kalman filter gives both E, _ l(f, _ 1) and V, _ l(f, _ 1), so the proposal by Harvey et al. (1992) is to let the conditional variance of the factor, which is the state variable in the Kalman filter, be given by
Another important class of models is the switching ARCH, or SWARCH, model proposed independently by Cai (1994) and Hamilton and Susmel(l992). This class of models postulates that there are several different ARCH models and that the economy switches from one to another following a Markov chain. In this model there can be an extremely high volatility process which is responsible for events such as the stock market crash in October 1987. Since this could happen at any time but with very low probability, the behavior of risk averse agents will take this into account. The SWARCH model must again be estimated using Kalman filter techniques. The richness of the family of parametric ARCH models is both a blessing and a curse. It certainly complicates the search for the “true” model, and leaves quite a bit of arbitrariness in the model selection stage. On the other hand, the flexibility of the ARCH class of models means that in the analysis of structural economic models with time varying volatility, there is a good chance that an appropriate parametric ARCH model can be formulated that will make the analysis tractable. For example, Campbell and Hentschell (1992) seek to explain the drop in stock prices associated with an increase in volatility within the context of an economic model. In their model, exogenous rises in stock volatility increase discount rates, lowering stock prices. Using an EGARCH model would have made their formal analysis intractable, but based on a QARCH formulation the derivations are straightforward. ‘These models sometimes formal definition.
are also called stochastic
volatility
models; see Andersen
(1992a) for a more
T. Bollerslev et al.
2972
1.4.
ARCH in mean models
Many theories in finance call for an explicit tradeoff between the expected returns and the variance, or the covariance among the returns. For instance, in Merton’s (1973) intertemporal CAPM model, the expected excess return on the market portfolio is linear in its conditional variance under the assumption of a representative agent with log utility. In more general settings, the conditional covariance with an appropriately defined benchmark portfolio often serves to price the assets. For example, according to the traditional capital asset pricing model (CAPM) the excess returns on all risky assets are proportional to the non-diversifiable risk as measured by the covariances with the market portfolio. Of course, this implies that the expected excess return on the market portfolio is simply proportional to its own conditional variance as in the univariate Merton (1973) model. The ARCH in mean, or ARCH-M, model introduced by Engle et al. (1987) was designed to capture such relationships. In the ARCH-M model the conditional mean is an explicit function of the conditional variance, (1.19) where the derivative of the g(., .) function with respect to the first element is non-zero. The multivariate extension of the ARCH-M model, allowing for the explicit influence of conditional covariance terms in the conditional mean equations, was first considered by Bollerslev et al. (1988) in the context of a multivariate CAPM model. The exact formulation of such multivariate ARCH models is discussed further in Section 6 below. The most commonly employed univariate specifications of the ARCH-M model postulate a linear relationship in 0, or af; e.g. g[o:(@, 19]= p + ha:. For 6 # 0 the risk premium will be time-varying, and could change sign if p < 0 < 6. Note that any time variation in CJ(will result in serial correlation in the {yt} process.3 Because of the explicit dependence of the conditional mean on the conditional variance and/or covariance, several unique problems arise in the estimation and testing of ARCH-M models. We shall return to a discussion of these issues in Section 2.2 below.
1.5.
Nonparametric and semiparametric methods
A natural response to the overwhelming variety of parametric univariate ARCH models, is to consider and estimate nonparametric models. One of the first attempts at this problem was by Pagan and Schwert (1990) who used a collection of standard 3The exact form of this serial dependence (1991).
has been formally
analyzed
for some simple models in Hong
2913
Ch. 49: ARCH Models
nonparametric estimation methods, including kernels, Fourier series and least squares regressions, to fit models for the relation between yf and past yt’s, and then compare the fits with several parametric formulations. Effectively, these models estimate the function f(.) in y:=f(y,~1,y,-2,...,Yr-p,e)+rl,.
(1.20)
Several problems immediately arise in estimating f(.), however. Because of the problems of high dimensionality, the parameter p must generally be chosen rather small, so that only a little temporal smoothing can actually be achieved directly from (1.20). Secondly, if only squares of the past y,‘s are used the asymmetric terms may not be discovered. Thirdly, minimizing the distance between y: and f, = f(y, _ 1, y,_ 2,. . , Y,-~, 6) is most effective if qt is homoskedastic, however, in this case it is highly heteroskedastic. In fact, if f, were the precise conditional heteroskedasticity, then y:f, ’ and v,f, ‘, would be homoskedastic. Thus, qt has conditional variance f:, so that the heteroskedasticity is actually more severe than in y,. Not only does parameter estimation become inefficient, but the use of a simple R2 measure as a model selection criterion is inappropriate. An R2 criterion penalizes generalized least squares or maximum likelihood estimators, and corresponds to a loss function which does not even penalize zero or negative predicted variances. This issue will be discussed in more detail in Section 7. Indeed, the conclusion from the empirical analysis for U.S. stock returns conducted in Pagan and Schwert (1990) was that there was in-sample evidence that the nonparametric models could outperform the GARCH and EGARCH models, but that out-of-sample the performance deteriorated. When a proportional loss function was used the superiority of the nonparametric models also disappeared in-sample. Any nonparametric estimation method must be sensitive to the above mentioned issues. Gourieroux and Monfort (1992) introduce a qualitative threshold ARCH, or QTARCH, model, which has conditional variance that is constant over various multivariate observation intervals. For example, divide the space of y, into J intervals and let Ij(Y,) be 1 if y, is in thejth interval. The QTARCH model is then written as (1.21) where u, is taken to be i.i.d. The mij parameters govern the mean and the bij parameters govern the variance of the {y,} process. As the sample size grows, J can be increased and the bins made smaller to approximate any process. In their most successful application, Gourieroux and Monfort (1992) add a GARCH term resulting in the G-QTARCH(l) model, with a conditional variance given by of = o + Boa:- 1 +
1 j=l,J
PjIj(Yl-
*I
(1.22)
T. Bollersleu et al.
2914
Interestingly, the estimates using four years of daily returns on the French stock index (CAC) showed strong evidence of the leverage effect. In the same spirit, Engle and Ng (1993) propose and estimate a partially nonparametric, or PNP, model, which uses linear splines to estimate the shape of the response to the most recent news. The name of the model reflects the fact that the long memory component is treated as parametric while the relationship between the news and the volatility is treated nonparametrically. The semi-nonparametric series expansion developed in a sequence of papers by Gallant and Tauchen (1989) and Gallant et al. (1991,1992,1993) has also been employed in characterizing the temporal dependencies in the second order moments of asset returns. A formal description of this innovative nonparametric procedure is beyond the scope of the present chapter, however.
2.
Inference procedures
2.1.
Testing for ARCH
2.1 .I.
Serial correlation
and Lagrange
multiplier tests
The original Lagrange multiplier (LM) test for ARCH proposed by Engle (1982) is very simple to compute, and relatively easy to derive. Under the null hypothesis it is assumed that the model is a standard dynamic regression model which can be written as Y, =
XtP+ 57
(2.1)
where x, is a set of weakly exogenous and lagged dependent variables and E, is a Gaussian white noise process,
where I, denotes the available information set. Because the null is so easily estimated, the Lagrange multiplier test is a natural choice. The alternative hype thesis is that the errors are ARCH(q), as in equation (1.6). A straightforward derivation of the Lagrange multiplier test as in Engle (1984) leads to the TR’ test statistic, where the RZ is computed from the regression of ETon a constant and sf_ . . , E:_~. Under the null hypothesis that there is no ARCH, the test statistic is asymptotically distributed as a chi-square distribution with q degrees of freedom. The intuition behind this test is very clear. If the data are homoskedastic, then the variance cannot be predicted and variations in E: will be purely random. However, if ARCH effects are present, large values of E: will be predicted by large values of the past squared residuals. r,
Ch. 49: ARCH Models
2915
While this is a simple and widely used statistic, there are several points which should be made. First and most obvious, if the model in (2.1) is misspecified by omission of a relevant regressor or failure to account for some non-linearity or serial correlation, it is quite likely that the ARCH test will reject as these errors may induce serial correlation in the squared errors. Thus, one cannot simply assume that ARCH effects are necessarily present when the ARCH test rejects. Second, there are several other asymptotically equivalent forms of the test, including the standard F-test from the above regression. Another version of the test simply omits the constant but subtracts the estimate of the unconditional variance, g2, from the dependent variable, and then uses one half the explained sum of squares as a test statistic. It is also quite common to use asymptotically equivalent portmanteau tests, such as the Ljung and Box (1978) statistic, for E:. As described above, the parameters of the ARCH(q) model must be positive. Hence, the ARCH test could be formulated as a one tailed test. When q = 1 this is simple to do, but for higher values of q, the procedures are not as clear. Demos and Sentana (1991) has suggested a one sided ARCH test which is presumably more powerful than the simple TR* test described above. Similarly, since we find that the GARCH(l, 1) is often a superior model and is surely more parsimoniously parametrized, one would like a test which is more powerful for this alternative. The Lagrange multiplier principle unfortunately does not deliver such a test because, for models close to the null, tlr and PI cannot be separately identified. In fact, the LM test for GARCH( 1,1) is just the same as the LM test for ARCH( 1); see Lee and King (1993) which proposes a locally most powerful test for ARCH and GARCH. Of course, Wald type tests for GARCH may also be computed. These too are non-standard, however. The t-statistic on cur in the GARCH(l, 1) model will not have a t-distribution under the null hypothesis since there is no time-varying input and /?r will be unidentified. Finally, likelihood ratio test statistics may be examined, although again they have an uncertain distribution under the null. Practical experience, however, suggests that the latter is a very powerful approach to testing for GARCH effects. We shall return to a more detailed discussion of these tests in Section 2.2.2 below. 2.1.2.
BDS test.for ARCH
The tests for ARCH discussed above are tests for volatility clustering rather than general conditional heteroskedasticity, or general non-linear dependence. One widely used test for general departures from i.i.d. observations is the BDS test introduced by Brock, Dechert and Scheinkman (1987). We will consider only the univariate version of the test; the multivariate extension is made in Baek and Brock (1992). The BDS test has inspired quite a large literature and several applications have appeared in the finance area; see, e.g. Scheinkman and LeBaron (1989), Hsieh (1991) and Brock et al. (1991). To set up the test, let {x,},=r,r denote a scalar sequence which under the null
T. Bollersleu et al.
2976
hypothesis is assumed to be i.i.d. through time. Define the m-histories of the x, process as the vectors (xi,.. .,x,,,), (x2 ,. ..,x,+J,(xj,.. .,x,+~),. . .,(x,_,,. . .,XTel), xT). Clearly, there are T-m + 1 such m-histories, and therefore (Xvm+lr..., (T - m + l)(T - m)/2 distinct pairs of m-histories. Next, define the correlation integral as the fraction of the distinct pairs of m-histories lying within a distance c in the sup norm; i.e. C,,,,,(c) = [(T-m
+ l)(T - m)/2]-’
1 I(maxj=,,,_, c f=rn,s s=m,T
Ix,-j-
x,_jJ cc). (2.3)
Under weak dependence conditions, C,,, (c) converges almost surely to a limit C,(c). By the basic properties of order-statistics, C,(c) = C,(C)~ when {xt} is i.i.d. The BDS test is based on the difference, [C,,,(c) - C1,T(c)m]. Intuitively, C,,,(c) > C,,,(c)” means that when x,-~ and x,_~ are “close” forj = 1 to m - 1, i.e. IllZlXj,~,,_~ 1x,_ j - x,_ jI< c, then x, and x, are more likely than average to be close, also. In other words, nearest-neighbor methods work in predicting the {x,) series, which is inconsistent with the i.i.d. assumption.4 Brock et al. (1987) show that for fixed m and c, T”‘[C,,,(c)C,,T(~)m] is asymptotically normal with mean zero and variance V(m, c) given by V(m,c)=4[K(c)“‘+2
c
K(c)“~jC,(c)2~+(m-1)2C,(c)2m-m2K(c)C,(c)2m~2],
j=l,m-1
(2.4) where K(c) = E{ [F(x, + c) - F(x, - c)]“}, and F(.) is the cumulative function of x,. The BDS test is then computed as T1’2CC,,r(c)
-
CI,TWY~(T, m,c),
distribution
(2.5)
where e(T, m, c) denotes a consistent estimator of V(m, c), details of which are given by Brock et al. (1987,199l). For fixed m > 2 and c > 0, the BDS statistic in equation (2.5) is asymptotically standard normal. The BDS test has power against many, though not all, departures from i.i.d. In particular, as documented by Brock et al. (1991) and Hsieh (1991), the power against ARCH alternatives is close to Engle’s (1982) test. For other conditionally heteroskedastic alternatives, the power of the BDS test may be superior. To illustrate, consider the following example from Brock et al. (1991), where 0: is deterministically “C,,,(c) < C,,r(c)m indicates the reverse of nearest-neighbors predictability. It is important not to push the nearest-neighbors analogy too far, however. For example, suppose {x,} is an ARCH process with a constant conditional mean of 0. In this case, the conditional mean of x, is always 0, and the nearest-neighbors analogy breaks down for minimum mean-squared-error forecasting of x,. It still holds for forecasting, say, the probability that X, lies in some interval.
2971
Ch. 49: ARCH Models
determined
by the tent map,
O;+l = 1 - 210: - 0.5).
(2.6)
with (ri~(O,l). The model is clearly heteroskedastic, but does not exhibit volatility clustering, since the empirical serial correlations of {cJ:} approach zero in large samples for almost all values of 0:. In order to actually implement the kDS test a choice has to be made regarding the values of m and c. The Monte Carlo experiments of Brock et al. (199 1) suggest that c should be between t and 2 standard deviations of the data, and that T/m should be greater than 200 with m no greater than 5. For the asymptotic distribution to be a good approximation to the finite-sample behavior of the BDS test a sample size of at least 500 observations is required. Since the BDS test is a test for i.i.d., it requires some adaptation in testing for ARCH errors in the presence of time-varying conditional means. One of the most convenient properties of the BDS test is that unlike many other diagnostic tests, including the portmanteau statistic, its distribution is unchanged when applied to residuals from a linear model. If, for example, the null hypothesis is a stationary, invertible, ARMA model with i.i.d. errors and the alternative hypothesis is the same ARMA model but with ARCH errors, the standard BDS test remains valid when applied to the fitted residuals from the homoskedastic ARMA model. A similar invariance property holds for residuals from a wide variety of non-linear regression models, but as discussed in Section 2.4.2 below, this does not carry over to the standardized residuals from a fitted ARCH model. Of course, the BDS test may reject due to misspecification of the conditional mean rather than ARCH effects in the errors. The same is true, however, of the simple TR’ Lagrange multiplier test for ARCH, which has power against a wide variety of non-linear alternatives.
2.2. 2.2.1.
Maximum likelihood
methods
Estimation
The procedure most often used in estimating 8, in ARCH models involves the maximization of a likelihood function constructed under the auxiliary assumption of an i.i.d. distribution for the standardized innovations in equation (1.5). In particular, let f(z,; q) denote the density function for z,(0) = E,(~)/o,(@ with mean zero, variance one, and nuisance parameters ~EH c Rk. Also, let {yr, yr- 1,. . , y,} refer to the sample realizations from an ARCH model as defined by equations (1.1) through (1.4), and II/’ = (0’, tj) the combined (m + k) x 1 parameter vector to be estimated for the conditional mean, variance and density functions. The log likelihood function for the tth observation is then given by UY,; $) = ln {fCz,(&
~11- 0.5 lnCflf(@l
t= 1,2,....
(2.7)
T. Bollerslev et al.
2978
The second term on the right hand side is a Jacobian that arises in the transformation from the standardized innovations, z,(0), to the observables, Y,(Q).~ By a standard prediction error decomposition argument, the log likelihood function for the full sample equals the sum of the conditional log likelihoods in equation (2.7X6
~,(Y,,Y,-l,...,
Yli
$) =
c
UY*i
(2.8)
II/).
The maximum likelihood estimator (MLE) for the true parameters I& = (f&, &), say tjT, is found by the maximization of equation (2.8). Assuming the conditional density and the mean and variance functions to be differentiable for all $E 0 x H = Y, $, therefore solves
1
ST(YYT,YT-l,...,Yl;~)-
1=1,T
S,(y,;ti)=O,
(2.9)
where s,(y,; II/) = V&y,; II/) is the score vector for the tth observation. for the conditional mean and variance parameters, v,u~,;
b4=
mm
where f’(z,(e); r) denotes first element, and
~3- lfw4;
w,~m - 0.54~
the derivative
v,z,(e) = - v,p,(eg(e)-
of the density
1’2- o.5&,(e)a:(e)-3’2v,afo.
lb~:m
function
In particular,
(2.10)
with respect to the
(2.11)
In practice, the actual solution to the set of m + k non-linear equations in (2.9) will have to proceed by numerical techniques. Engle (1982) and Bollerslev (1986) provide a discussion of some of the alternative iterative procedures that have been successfully employed in the estimation of ARCH models. Of course, the actual implementation of the maximum likelihood procedure requires an explicit assumption regarding the conditional density in equation (2.7). By far the most commonly employed distribution in the literature is the normal, f[z,(e)]
= (24 - ij2 exp [ - o.sz,(e)q.
(2.12)
Since the normal distribution is uniquely determined by its first two moments, only the conditional mean and variance parameters enter the likelihood function in 51n the multivariate context, l,(y,: I(/)= ln{f[er(H)~t(8)~“2;~]} - 0.5 ln(].R,(B)I), where 1.1denotes the determinant. ‘In most empirical applications the likelihood function is conditioned on a number of initial observations and nuisance parameters in order to start up the recursions for the conditional mean and variance functions. Subject to proper stationarity conditions this practice does not alter the asymptotic distribution of the resulting MLE.
Ch. 49: ARCH
2919
Models
equation (2.8); i.e. $ = 0. If the conditional mean and variance functions are both differentiable for all 8~0, it follows that the score vector in equation (2.10) takes the simple form, s&; 0) = V,/~~(fI)c,(0)c~f(Q)- 1’2 + OSV,~;(t7)a:(fI-
“2[~t(tI)2a;(8))1
- 11. (2.13)
From the discussion in Section 2.1 the ARCH model with conditionally normal errors results in a leptokurtic unconditional distribution. However, the degree of leptokurtosis induced by the time-varying conditional variance often does not capture all of the leptokurtosis present in high frequency speculative prices. To circumvent this problem Bollerslev (1987) suggested using a standardized t-distribution with v > 2 degrees of freedom, f[z,@);q]
=[7r(r/ - 2)]-“2I-[o.5(n
+ l)]r(o.s~)-‘[l
+z,(e)(~-2)-‘]-‘“+“‘2, (2.14)
where I-(.) denotes the gamma function. The r-distribution is symmetric around zero, and converges to the normal distribution for 9 + co. However, for 4 < q < co the conditional kurtosis equals 3(~ - 2)/(9 - 4), which exceeds the normal value of three. Several other conditional distributions have been employed in the literature to fully capture the degree of tail fatness in speculative prices. The density function for the generalized error distribution (GED) used in Nelson (1991) is given by f[z,(@;?j]
= P/-‘2-t
1+1i~)T(~-‘)-‘exp[-0.51z,(8)~-‘)”],
(2.15)
where 2 = [2(-2/“)~(~-‘)~(3~-l)-l]l/2
(2.16)
For the tail-thickness parameter r] = 2 the density equals the standard normal density in equation (2.10). For r] < 2 the distribution has thicker tails than the normal, while q > 2 results in a distribution with thinner tails than the normal. Both of these candidates for the conditional density impose the restriction of symmetry. From an economic point of view the hypothesis of symmetry is of interest since risk averse agents will induce correlation between shocks to the mean and shocks to the variance as developed more fully by Campbell and Hentschel(l992). Engle and Gonzalez-Rivera (1991) propose to estimate the conditional density nonparametrically. The procedure they develop first estimates the parameters of the model using the Gaussian likelihood. The density of the residuals standardized by their estimated conditional standard deviations is then estimated using a linear spline with smoothness priors. The estimated density is then taken to be the true density and the new likelihood function is maximized. The use of the linear spline
T. Bollersfeu et al.
2980
simplifies the estimation in that the derivatives with respect to the conditional density are easy to compute and store, which would not be the case for kernels or many other methods. In a Monte Carlo study, this approach improved the efficiency beyond the quasi MLE, particularly when the density was highly non-normal and skewed. 2.2.2.
Testing
The primary appeal of the maximum likelihood technique stems from the wellknown optimality conditions of the resulting estimators under ideal conditions. Crowder (1976) gives one set of sufficient regularity conditions for the MLE in models with dependent observations to be consistent and asymptotically normally distributed. Verification of these regularity conditions has proven extremely difficult for the general ARCH class of models, and a formal proof is only available for a few special cases, including the GARCH (1,1) model in Lumsdaine (1992a) and Lee and Hansen (1993).’ The common practice in empirical studies has been to proceed under the assumption that the necessary regularity conditions are satisfied. In particular, if the conditional density is correctly specified and the true parameter vector IC/,Eint( Y), then a central limit theorem argument yields that
T1’%b- $0) + NO, A, ‘),
(2.17)
where + denotes convergence in distribution. Again, the technical difficulties in verifying (2.17) are formidable. The asymptotic covariance matrix for the MLE is equal to the inverse of the information matrix evaluated at the true parameter vector * 03
Ao= - T-l
c
ECV,s,(y,;
11/o)l.
(2.18)
The inverse of this matrix is less than the asymptotic covariance matrix for all other estimators by a positive definite matrix. In practice, a consistent estimate for A, is available by evaluating the corresponding sa_mple analogue at GT; i.e. replace E[V,s,(y,; I++~)]in equation (2.18) with V&y,; I&=).Furthermore, as shown below, the terms with second derivatives typically have expected value equal to zero and therefore do not need to be calculated. Under the assumption of a correctly specified conditional density, the information matrix equality implies that A, = B,, where B, denotes the expected value of the
‘As discussed in Section 3 ensures that the GARCH(l,l) inequality E(ln(cc,zf + PI)]
below, the condition that E(ln(a,zf +/II)] < 0 in Lunsdaine (1992a) model is strictly stationary and ergodic. Note also, that by Jensen’s E(a,z: + 8,) = In@, + j?,), so the parameter region covers the interesta, + b, = 1.
2981
Ch.49:ARCH Models
outer product
of the gradients
evaluated
Bo = T- l 1 ECs,(Yt;tio)s,(Y,; $o)'l.
at the true parameters, (2.19)
1=1,T
The outer product of the sample gradients evaluated at 6, therefore provides an alternative covariance matrix estimator; that is, replace the summand in equation (2.19) by the sample analogues s,(y,; Gr-)st(y,; $,)‘. Since analytical derivatives in ARCH models often involve very complicated recursive expressions, it is common in empirical applications to make use of numerical derivatives to approximate their analytical counterparts. The estimator defined from equation (2.19) has the computational advantage that only first order derivatives are needed, as numerical second order derivatives are likely to be unstable.8 In many applications of ARCH models the parameter vector may be partitioned as 8’ = (Pi, PZ) where d1 and o2 operate a sequential cut on 0, x 0, = 0, such that 8i parametrizes the conditional mean and 8, parametrizes the conditional variance function for y,. Thus, VQ~(~) = 0, and although V,,c~f(@ # 0 for all &@, it is possible to show that, under fairly general symmetrical distributional assumptions regarding z, and for particular functional forms of the ARCH conditional variance, the information matrix for 0’ = (Pi, &) becomes block diagonal. Engle (1982) gives conditions and provides a formal proof for the linear ARCH(q) model in equation (1.6) under the assumption of conditional normality. As a result, asymptotically efficient estimates for 8,, may be calculated on the basis of a consistent estimate for Bol, and vice versa. In particular, for the linear regression model with covariance stationary ARCH disturbances, the regression coefficients may be consistently estimated by OLS, and asymptotically efficient estimates for the ARCH parameters in the conditional variance calculated on the basis of the OLS regression residuals. The loss in asymptotic efficiency for the OLS coefficient estimates may be arbitrarily large, however. Also, the conventional OLS standard errors are generally inappropriate, and should be modified to take account of the heteroskedasticity as in White (1980). In particular, as noted by Milhoj (1985), Diebold (1987), Bollerslev (1988) and Stambaugh (1993) when testing for serial correlation in the mean in the presence of ARCH effects, the conventional Bartlett standard error for the estimated autocorrelations, given by the inverse of the square root of the sample size, may severely underestimate the true standard error. There are several important cases in which block-diagonality does not hold. For example, block-diagonality typically fails for functional forms, such as EGARCH, in which 0: is an asymmetric function of lagged residuals. Another important exception is the ARCH-M class of models discussed in Section 1.4. Consistent ‘In the Berndt, Hall, Hall and Hausman (1974) (BHHH) algorithm, often used in the maximization of the likelihood function, the covariance matrix from the auxiliary OLS regression in the last iteration provides an estimate of B,. In a small scale Monte Carlo experiment Bollerslev and Wooldridge (1992) found that this estimator performed reasonably well under ideal conditions.
2982
estimation of the parameters in ARCH-M models generally requires that both the conditional mean and variance functions be correctly specified and estimated simultaneously. A formal analysis of these issues is contained in Engle et al. (1987), Pagan and Hong (1991) Pagan and Sabau (1987a, 1987b) and Pagan and Ullah (1988). Standard hypothesis testing procedures concerning the true parameter vector are directly available from equation (2.17). To illustrate, let the null hypothesis of on int( Y) and interest be stated as T($,,) = 0, where I: 0 x H + R’ is differentiable 1-c m + k. If +,,Eint( Y) and rank [V&$,)] = I, the Wald statistic takes the familiar form
where C, denotes a consistent estimator of the covariance matrix for the parameter estimates under the alternative. If the null hypothesis is true and the regularity conditions are satisfied, the Wald statistic is asymptotically chi-square distributed with (m + k) - 1 degrees of freedom. Similarly, let $,, denote the MLE under the null hypothesis. The conventional likelihood ratio (LR) statistic,
should then be the realization of a chi-square distribution with (m + k) - I degrees of freedom if the null hypothesis is true and $,Eint( Y). As discussed already in Section 2.1 above, when testing hypotheses about the parameters in the conditional variance of estimated ARCH models, non-negativity constraints must often be imposed, so that GO is on the boundary of the admissible parameter space. As a result the two-sided critical value from the standard asymptotic chi-square distribution will lead to a conservative test; recent discussions of general issues related to testing inequality constraints are given in Gourieroux et al. (1982), Kodde and Palm (1986) and Wolak (1991). Another complication that often arises when testing in ARCH models, also alluded to in Section 2.1 above, concerns the lack of identification of certain parameters under the null hypothesis. This in turn leads to a singularity of the information matrix under the null and a breakdown of standard testipg procedures. For instance, as previously noted in the GARCH(l, 1) model, /I1 and o are not jointly identified under the null hypothesis that c(~ = 0. Similarly, in the ARCH-M model, ~~(0) = p + &$ with p # 0, the parameter S is only identified if the conditional variance is time-varying. Thus, a standard joint test for ARCH effects and 6 = 0 is not feasible. Of course, such identification problems are not unique to the ARCH class of models, and a general discussion is beyond the scope of the present chapter; for a more detailed analysis along these lines we refer the reader to Davies (1977) Watson and Engle (1985) and Andrews and Ploberger (1992, 1993).
2983
Ch. 49: ARCH Models
The finite sample evidence on the performance of ARCH MLE estimators and test statistics is still fairly limited: examples include Engle et al. (1985) Bollerslev and Wooldridge (1992), Lumsdaine (1992b) and Baillie et al. (1993). For the GARCH(l, 1) model with conditional normal errors, the available Monte Carlo evidence suggests that the estimate for cur + 6, is downward biased and skewed to the right in small samples. This bias in oi, + fii comes from a downward bias in pi, while oi, is upward biased. Consistent with the theoretical results in Lumsdaine (1992a) there appears to be no discontinuity in the finite sample distribution of the estimators at the IGARCH(l, 1) boundary; i.e. c1i + fii = 1. Reliable inference from the LM, Wald and LR test statistics generally does require moderately large sample sizes of at least two hundred or more observations, however.
2.3.
Quasi-maximum
likelihood
methods
The assumption of conditional normality for the standardized innovations are difficult to justify in many empirical applications. This has motivated the use of alternative parametric distributional assumptions such as the densities in equation (2.14) or (2.15). Alternatively, the MLE based on the normal density in equation (2.12) may be given a quasi-maximum likelihood interpretation. If the conditional mean and variance functions are correctly specified, the normal quasi-score in equation (2.13) evaluated at the true parameters B0 will have the martingale difference property, E,(V,~L,(Bo)&,(e,)o,2(eg)
+ 0.5V,a:(B,)(r:(8,)~‘CE,(B0)2a:(e,)-’
- l]} = 0. (2.20)
Since equation (2.20) holds for any value of the true parameters, the QMLE obtained by maximizing the conditional normal likelihood function defined by equations (2.7), (2.8) and (2.12), say gr,oMLE, is Fisher-consistent; that is, Y,; e)] = 0 for any 0~ 0. Under appropriate regularity conditions ECS,(Yr,Y,-I,..., this is sufficient to establish consistency and asymptotic normality of $r,oMLE. Wooldridge (1994) provides a formal discussion. Furthermore, following Weiss (1984, 1986) the asymptotic distribution for the QMLE takes the form T1’2&oMLE - 0,) + N(0, A, r&4,
‘).
(2.21)
Under appropriate, and difficult to verify, regularity conditions, the A, and B, matrices are consistently estimated by the sample counterparts from equations (2.18) and (2.19), respectively. Provided that the first two conditional moments are correctly specified, it follows from equation (2.13) that E,[V,s,(y,;
e,)]
= - v,~,(e)v,~,(e)‘a:(e)-
l - ~v,a:(e)v,a:(e),a:(e)-‘.
(2.22)
T. Bollersleu et al.
2984
As pointed out by Bollerslev and Wooldridge (1992) a convenient estimate of the information matrix, A,, involving only first derivatives is therefore available by replacing the right hand side of equation (2.18) with the sample realizations from equation (2.22). The finite sample distribution of the QMLE and the Wald statistics based on the robust covariance matrix estimator constructed from equations (2.18), (2.19) and (2.22) has been investigated by Bollerslev and Wooldridge (1992). For symmetric departures from conditional normality, the QMLE is generally close to the exact MLE. However, as noted by Engle and Gonzales-Rivera (1991), for non-symmetric conditional distributions both the asymptotic and the finite sample loss in efficiency may be quite large, and semiparametric density estimation, as discussed in Section 1.5, may be preferred.
Specijcation
2.4. 2.4.1.
Lagrange
checks multiplier diagnostic
tests
After a model is selected and estimated, it is generally desirable to test whether it adequately represents the data. A useful array of tests can readily be constructed from calculating Lagrange multiplier tests against particular parametric alternatives. Since almost any moment condition can be formulated as the score against some alternative, these tests may also be interpreted as conditional moment tests; see Newey (1985) and Tauchen (1985). Whenever one computes a collection of test statistics, the question of the appropriate size of the full procedure arises. It is generally impossible to control precisely the size of a procedure when there are many correlated test statistics and conventional econometric practice does not require this. When these tests are viewed as diagnostic tests, they are simply aids in the model building process and may well be part of a sequential testing procedure anyway. In this section, we will show how to develop tests against a variety of interesting alternatives to any particular model. We focus on the simplest and most useful case. Suppose we have estimated a parametric model with the assumption that each observation is conditionally normal with mean zero and variance gf = of(O). Then the score can be written as a special case of (2.13), s&8)
= V,ln0:(8)[&:(B)o:(8)-’
In order to conserve
space, equation
- 11.
(2.23)
(2.23) may be written
more compactly
(2.24)
se,= %,U,, where x0, denotes
as
the k x 1 vector of derivatives
of the logarithm
of the conditional
Ch. 49: ARCH
2985
Models
variance equation with respect to the parameters 8, and u, = &:(&r:(G)- ’ - 1 defines the generalized residuals. From the first order conditions in equation (2.9), the MLE for 8, gT, solves
1
A$*=
1
= 0.
&ii,
(2.25)
*= l,T
1=1,T
Suppose that the additional set of r parameters, represented by the r x 1 vector y, have been implicitly set to zero during estimation. We wish to test whether this restriction is supported by the data. That is, the null hypothesis may be expressed as y0 = 0, where 0: = a:(e, y). Also, suppose that the score with respect to y has the same form as in equation (2.24), sYf
=
x,,u,.
(2.26)
Under fairly general regularity conditions, the scores themselves when evaluated at the true parameter under the null hypothesis, 8,, will satisfy a martingale central limit theorem. Therefore,
T1’*S&)
+ N(0, V),
(2.27)
where V = A, denotes the covariance matrix of the scores. The conventional form of the Lagrange multiplier test, as in Breusch and Pagan (1979) or Engle (1984) is then given by (2.28) f=l,T
t=l,T
where tj = (Q, y), represent estimates evaluated under the null hypothesis, and ? denotes a consistent estimate of I/. As discussed in Section 2.2, a convenient estimate of the information matrix is given by the outer product of the scores,
iiT = T- ’ c
$,&
(2.29)
t=l,T
so that the test statistic can be computed in terms of a regression. Specifically, let the T x 1 vector of ones be denoted z, and the T x (k + r) matrix of scores evaluated under the null hypothesis be denoted by 9 = {iV1, s*w2,.. . , iwT}. Then a simple form of the LM test is obtained from tlT = L?($‘$)-$5
= TR*,
(2.30)
where the R* is the uncentered fraction of variance explained by the regression of a vector of ones on all the scores. The test statistic in equation (2.30) is often referred
T. Bollersleu et al.
2986
to as the outer product of the gradient, or OPG, version of the test. It is very easy to compute. In particular, using the BHHH estimation algorithm, the test statistic is simply obtained by one step of the BHHH algorithm from the maximum achieved under the null hypothesis. Studies of this version of the LM test, such as MacKinnon and White (1985) and Bollerslev and Wooldridge (1992), often find that it has size distortions and is not very powerful as it does not utilize the structure of the problem under the null hypothesis to obtain the best estimate of the information matrix. Of course the R2 in (2.30) will be overstated if the likelihood function has not been fully maximized under the null so that (2.25) is not satisfied. One might recommend a first step correction by BHHH to be certain that this is achieved. An alternative estimate of I/ corresponding to equation (2.19) is available from taking expectations of S’S. In the simplified notation of this section, E(S’S) =
c
E(uf x,x;, = E(uf) 1 E(x,xJ,
1=1,T
(2.3 1)
f= l,T
where it is assumed that the conditional expectation E, _ ,(u:) is time invariant. Of course, this will be true if the standardized innovations s,(B)o:(8)- ‘I2 has a distribution which does not depend upon time or past information, as typically assumed in estimation. Consequently, an alternative consistent estimator of V is given by ?r = (T-‘iYa)(T-‘X’X),
(2.32)
where u’ = {ui,. .,u,}, X’ = {x1,. , xT}, and xi = {xkt, Xl*}. Since Z’S= u’X, the Lagrange multiplier test based on the estimator in equation (2.32) may also be computed from an auxiliary regression, ^ _ _ &r = ti’X(X’X)-‘X ^I u* = TR’.
(2.33)
Here the regression is of the percentage difference between the squared residuals and the estimated conditional variance regressed on the gradient of the logarithm of the conditional variance with respect to all the parameters including those set to zero under the null hypothesis. This test statistic is similar to one step of a GaussNewton iteration from an estimate under the null. It is called the Hessian estimate by Bollerslev and Wooldridge (1992) because it can also be derived by setting components of the Hessian equal to their expected value, assuming only that the first two moments are correctly specified, as discussed in Section 2.3. This version of the test has considerable intuitive appeal as it checks for remaining conditional heteroskedasticity in u, as a function of x,. It also performed better than the OPG test in the simulations reported by Bollerslev and Wooldridge (1992). This is also the version of the test used by Engle and Ng (1993) to compare various model specifications. As noted by Engle and Ng (1993), the likelihood must be fully maximized
Ch. 49: ARCH
2987
Models
under the null if the test is to have the correct size. An approach to dealing with this issue would be to first regress li, on & and then form the test on the basis of the residuals from this regression. The RZ of this regression should be zero if the likelihood is maximized, so this is merely a numerical procedure to purge the test statistic of contributions from loose convergence criteria. Both of these procedures develop the asymptotic distribution under the null hypothesis that the model is correctly specified including the normality assumption. Recently, Wooldridge (1990) and Bollerslev and Wooldridge (1992) have developed robust LM tests which have the same limiting distribution under any null specifying that the first two conditional moments are correct. This follows in the line of conditional moment tests for GMM or QMLE as in Newey (1985) Tauchen (1985) and White (1987,1994). To derive these tests, consider the Taylor series expansions of the scores around the true parameter values, s,(0) and s,(0,),
as
7%,(&J = Tl’%),(&)+ Tl’2 2 (& ae T”*s,(e,) = T%,(&) + T”*
(2.34)
fy,),
f$&e, - e,),
where the derivatives of the scores are evaluated at 8,. The derivatives in equations (2.34) and (2.35) are simply the H,, and H,, elements of the Fessian, respectively. The distribution of the score with respect to y evaluated at 8, is readily obtained from the left hand side of equation (2.34). In particular substituting in (2.35), and using (2.26) to give the limiting distribution of the scores,
T%,(B,) + N(0,
W),
(2.36)
where W = Vyy- H,,H,‘V,,
- Vy/yeHo;lHey+ H,,H,’
VooH,‘Hor.
(2.37)
Notice first, that if the scores are the derivatives of the true likelihood, then the information matrix equality will hold, and therefore H = V asymptotically. In this case we get the conventional LM test described in (2.28) and computed generally either as (2.30) or (2.33). If the normality assumption underlying the likelihood is false so that the estimates are viewed as quasi-maximum likelihood estimators, then the expressions in equations (2.36) and (2.37) are needed. AS pointed out by Wooldridge (1990), any score which has the additional property that H,, converges in probability to zero can be tested simply as a limiting normal with covariance matrix Vyu,or as a TR* type test from a regression of a vector of ones
T. Elollersleu et al.
2988
on &. By proper redefinition of the score, such a test can always be constructed. illustrate, suppose that syf = xylu,, s,, = x0+, and au,/atI = - xBr. Also define
To
(2.38)
s;t = (Xyr- x;$G, where
(2.39)
The statistic based on s,: in equation (2.38) then tests only the part of x,, which is orthogonal to the scores used to estimate the model under the null hypothesis. This strategy generalizes to more complicated settings as discussed by Bollerslev and Wooldridge (1992). 2.4.2.
BDS specijication
tests
As discussed in Section 2.1.2, the asymptotic distribution of the BDS test is unaffected by passing the data through a linear, e.g. ARMA, filter. Since an ARCH model typically assumes that the standardized residuals z, = a,~,~ are i.i.d., it seems reasonable to use the BDS test as a specification test by applying it to the fitted standardized residuals from an ARCH model. Fortunately, the BDS test applied to the standardized residuals has considerable power to detect misspecification in ARCH models. Unfortunately, the asymptotic distribution of the test is strongly affected by the fitting of the ARCH model. As documented by Brock et al. (1991) and Hsieh (1991), BDS tests on the standardized residuals from fitted ARCH models reject much too infrequently. In light of the filtering properties of misspecified ARCH models, discussed in Section 4 below, this may not be too surprising. The asymptotic distribution of the BDS test for ARCH residuals has not yet been derived. One commonly employed procedure to get around this problem is to simply simulate the critical values of the test statistic; i.e. in each replication generate data by Monte Carlo methods from the specific ARCH model, then estimate the ARCH model and compute the BDS test for the standardized residuals. This approach is obviously very demanding computationally. Brock and Potter (1992) suggest another possibility for the case in which the conditional mean of the observed data is known. Applying the BDS test to the logarithm of the squared known residuals, i.e. In($) = ln(zF) + ln(o:), separates ln($) into an i.i.d. component, ln(z:), and a component which can be estimated by non-linear regression methods. Under the null of a correctly specified ARCH model, ln(zf) = In($) - ln(a:) is i.i.d. and, subject to the regularity conditions of Brock and Potter (1992) or Brock et al. (1991), the asymptotic distribution of the BDS test is the same whether applied to ln(z:) or to the fitted values In@,!) = In($) - ln(s:). While theassumption of a known conditional mean is obviously unrealistic in some applications,
Ch. 49: ARCH
2989
Models
it may be a reasonable approximation for high frequency financial time series, where the noise component tends to swamp the conditional mean component.
3. 3.1.
Stationary and ergodic properties Strict stationarity
In evaluating the stationarity of ARCH models, it is convenient to recursively substitute for the lagged E,‘Sand 0:‘s. For completeness, consider the multivariate case where Ef = R cr12Zr,
2, - i.i.d.,
E(Z,) = 0, X r,
-q-&q
= 1, x n
(3.1)
and n,=n(t,z,_,,z,_,
,... ).
Using the ergodicity criterion from Corollary of {st}t, _ 4),dois equivalent
strict stationarity
fl,=wk,,z,-,,...), withR(.;,...) Trace(f2,Ri)
measurable, < co
(3.2) 1.4.2 in Krengel(1985), to the condition
it follows that
(3.3) and a.s.
(3.4)
Equation (3.3) eliminates direct dependence of {a,} on t, while (3.4) ensures that random shocks to {a,} die out rapidly enough to keep (a,} from exploding asymptotically. In the univariate EGARCH(p, q) model, for example, equation (3.2) is obtained by exponentiating both sides of the definition in equation (1.11). Since In($) is written in ARMA(p,q) form, it is easy to see that if (1 + Cj=l,qcljxj) and (1 -xi= i,J&x’) have no common roots, equations (3.3) and (3.4) are equivalent to all the roots of (1 - Ci=i,,jixi) lying outside the unit circle. Similarly, in the bivariate EGARCH model defined in Section 6.4 below, ln(ai,,), In(&) and p,,, all follow ARMA processes giving rise to ARMA stationarity conditions. One sufficient condition for (3.4) is moment boundedness; i.e. clearly E[Trace(f2,0~)P] finite for some p > 0 implies Trace(R$:) < CE a.s. For example, Bollerslev (1986) shows that in the univariate GARCH(p, q) model defined by equation (1.9) E(af) is finite and (et} is covariance stationary, when xi= i,,fii + l,qaj < 1. This is a sufficient, but not a necessary condition for strict stationarity, Cj= however. Because ARCH processes are thick tailed, the conditions for “weak” or
2990
covariance
stationarity
stationarity. For instance,
[
c: = w 1 +
arc often more stringent
in the univariate c k=l,m
n i=l.k
than the conditions
for “strict”
GARCH(1, 1) model, (3.2) takes the form
(/I1 + cZ1z:-i)
1.
(3.5)
Nelson (1990b) shows that when w > O,o: < cc a.s., and {E,, of} is strictly stationary if and only if E[ln(fii + crizf)] < 0. An easy application of Jensen’s inequality shows that this is a much weaker requirement than c1r + /I, < 1, the necessary and sufficient condition for (.st} to be covariance stationary. For example, the simple ARCH(l) model with z, N N(0, 1) and a1 = 3 and /?i = 0, is strictly but not weakly stationary. To grasp the intuition behind this seemingly paradoxical result, consider the terms in the summation in (3.5); i.e. ni= l,k(j31 + a,~:_~). Taking logarithms, it follows directly that Ci= I,k ln(/Ii + u,z:_~) is a random walk with drift. If E[ln(Pi + u,z:_~)] > 0, the drift is positive and the random walk diverges to co a.s. as k + co. If, on the other hand, E[ln(/Ii + u,z:_~)] < 0, the drift is negative and the random walk diverges to - cc a.s. as k + 00, in which case ni= l,k(/?l + u,z:_~) tends to zero at an exponential rate in k a.s. as k -+ co. This, in turn, implies that the sum in equation (3.5) converges a.s., establishing (3.4). Measurability in (3.3) follows easily using Theorems 3.19 and 3.20 in Royden (1968). This result for the univariate GARCH(l, 1) model generalizes fairly easily to other closely related ARCH models. For example, in the multivariate diagonal GARCH( 1,l) model, discussed in Section 6.1 below, the diagonal elements of 0, follow univariate GARCH( 1,l) processes. If each of these processes is stationary, the CauchyySchwartz inequality ensures that all of the elements in R, are bounded a.s. The case of the constant conditional correlation multivariate GARCH(l, 1) model in Section 6.3 is similar. The same method can also be used in a number of other univariate cases as well. For instance, when p = q = 1, the stationarity condition for the model in equation (1.16) is E[ln(cr:I(z, > O)Iz,Iy + cr;I(z,
3.2.
Persistence
The notion of “persistence” of a shock to volatility within the ARCH class of models is considerably more complicated than the corresponding concept of persistence in
Ch.49:ARCH Models
2991
the mean for linear models. This is because even strictly stationary ARCH models frequently do not possess finite moments. Suppose that {CT:}is strictly stationary and ergodic. Let F(o:) denote the unconditional cumulative distribution function (cdf) for c:, and let F,(a:) denote the cdf of c: given information at time s < t. Then for any s, F,(a:) - F(a:) converges to 0 at all continuity points as t + co; i.e. time s information drops out of the forecast distribution as t + co. Therefore, one perfectly reasonable definition of “persistence” would be to say that shocks fail to persist when {o,‘} is stationary and ergodic. It is equally natural, however, to define persistence of shocks in terms of forecast moments; i.e. to choose some q > 0 and to say that shocks to CJ: fail to persist if and of time only if for every s, E,(afq) converges, as t + 00, to a finite limit independent s information. Such a definition of persistence may be particularly appropriate when an economic theory makes a forecast moment, as opposed to a forecast distribution, the object of interest. Unfortunately, whether or not shocks to {of} “persist” depends very much on which definition is adopted. The conditional moment &(a:“) may diverge to infinity for some q, but converge to a well-behaved limit independent of initial conditions for other q, even when the {o:} process is stationary and ergodic. Consider, for example, the GARCH( 1,1) model, in which
The expectation
of @: as of time s, is given by
(3.7)
It is easy to see that E,(o:) converges to the unconditional variance of w/(1 - c(r - pi) ast+coifandonlyifa, +fli < l.IntheIGARCHmodelwitho>Oandcr, +pl = 1, it follows that &(a:) -+ co a.s. as t -+ co. Nevertheless, as discussed in the previous section, IGARCH models are strictly stationary and ergodic. In fact, as shown by Nelson (1990b) in the IGARCH(l, 1) model E,(o:“) converges to a finite limit independent of time s information as t + cc whenever q < 1. This ambiguity of “persistence” holds more generally. When the support of z, is unbounded it follows from Nelson (1990b) that in any stationary and ergodic GARCH(l, 1) model, E,(azV) diverges for all sufficiently large q, and converges for all sufficiently small q. For many other ARCH models, moment convergence may be most easily established with the methods used in Tweedie (1983b). While the relevant criterion for persistence may be dictated by economic theory, in practice tractability may also play an important role. For example, E,(a:), and its multivariate extension discussed in Section 6.5 below, can often be evaluated even when strict stationarity is difficult to establish, or when &(a:“) for q # 1 is intractable.
T. Bollerslev et al.
2992
Even so, in many applications, simple moment convergence criterion have not been successfully developed. This includes quite simple cases, such as the univariate GARCH(p, q) model when p > 1 or q > 1. The same is true for multivariate models, in which co-persistence is an issue. In such cases, the choice of 4 = 1 may be impossible to avoid. Nevertheless, it is important to recognize that apparent persistence of shocks may be driven by thick-tailed distributions rather than by inherent non-stationarity.
4.
Continuous time methods
ARCH models are systems of non-linear stochastic difference equations. This makes their probabilistic and statistical properties, such as stationarity, moment finiteness, consistency and asymptotic normality of MLE, more difficult than is the case with linear models. One way to simplify the analysis of ARCH models is to approximate the stochastic difference equations with more tractable stochastic differential equations. On the other hand, for certain purposes, notably in the computation of point forecasts and maximum likelihood estimates, ARCH models are more convenient than the stochastic differential equation models of time-varying volatility common in the finance literature; see, e.g. Wiggins (1987), Hull and White (1987) Gennotte and Marsh (1991), Heston (1991) and Andersen (1992a). Suppose that the process {X,} is governed by the stochastic integral equation
where {Wl> is an N x 1 standard Brownian motion, and ,u(.) and 0’12(.) are continuous functions from RN into RN and the space of N x N real matrices respectively. The starting value, X,, may be fixed or random. Following Karatzas and Shreve (1988) and Ethier and Kurtz (1986), if equation (4.1) has a unique weak-sense solution, the distribution of the (X,} process is then completely determined by the following four characteristics:9 (i) the cumulative distribution function, F(x,), of the starting point X,; (ii) the drift p(x); (iii) the conditional covariance matrix 0(x) = Q(x)“~[Q(x)“~]‘;‘~ (iv) the continuity, with probability one, of {X,} as a function of time. Our interest here is either in approximating (4.1) by an ARCH model or visa versa. To that end, consider a sequence of first-order Markov processes {,,X,}, whose ‘Formally, we in DR”[O, co), the topology. D&O, (1986)J roQ(x)l/z ISa
consider {X,} and the approximating discrete time processes {,,X,} as random variables space of right continuous functions with finite left limits, equipped with the Skorohod cc) is a complete, separable metric space [see, e.g. Chapter 3 in Ethier and Kurtz
matrix square root of L?(x), though it need not be the symmetric we require only that ~(~)‘~‘[L?(x)‘~‘]’ = a(x), not f2(~)‘~*f2(x)‘~’ = Q(x).
square
root since
Ch. 49: ARCH
2993
Models
sample paths are random step functions with jumps at times h, 2h, 3h,. . . . For each h > 0, and each non-negative integer k, define the drift and covariance functions by ,X,)/,X, = x], and Q,(x) 3 h-’ Cov[(J,Xk+, - ,,Xk)l,,XL=x], /&)---lECbX,‘+i respectively. Also, let F&,x,) denote the cumulative distribution function for ,,XO. Since (i)-(iv) completely characterize the distribution of the {X,} process, it seems intuitive that weak convergence of {,,Xt} to {X,} can be achieved by “matching” these properties in the limit as hJ0. Stroock and Varadhan (1979) showed that this is indeed the case. Theorem 4.1. [Stroock and Varadhan (1979)] Let the stochastic integral equation (4.1) have a unique weak-sense solution. Then {,,Xr} converges weakly to {X,} for hJ0 if (i’) F,,(.) -+ F(.) as hJ0 at all continuity points of F(.), (ii’) p,,(x) -p(x) uniformly on every bounded x set as hJ0, (iii’) Q,(x) + n(x) uniformly on every bounded x set as h10, (iv’) for some 6 > 0, h-‘E[ Il,,Xk+l - hXk I/’ +’ lhXk = x] + 0 uniformly on every bounded x set as h10.” This result, along with various extensions, is fundamental in all of the continuous record asymptotics discussed below. Deriving the theory of continuous time approximation for ARCH models in its full generality is well beyond the scope of this chapter. Instead, we shall simply illustrate the use of these methods by explicit reference to a diffusion model frequently applied in the options pricing literature; see e.g. Wiggins (1987). The model considers an asset price, Y,, and its instantaneous returns volatility, ot. The continuous time process for the joint evolution of (Y,, a,} with fixed starting values, (Y,, a,), is given by dY,=pY,dt+
Y,a,dW,,,
(4.2)
and d [ln($)]
= - B[ln(a,2) - al dt + ICIcl Wz,t,
(4.3)
where ,u, $,fi and c( denote the parameters of the process, and WI,, and W,., are driftless Brownian motions independent of (Y,, ci) that satisfy
Ld”:-’ 1 2.1
CdW,,, d W,,,] = ; [
;
1
dt.
(4.4)
I1 We define the matrix norm, 11’ I),by 11 A ((= [Trace(AA’)]“‘. It is easy to see why (i’)-(iii’) match (i)-(iii) in the limit as h JO. That (iv’) leads to (iv) follows from HGlder’s inequality; see Theorem 2.2 in Nelson (1990a) for a formal proof.
77 Bollersleu et al.
2994
Of course in practice, the price process is only observable at discrete time intervals. However, the continuous time model in equations (4.2)-(4.4) provides a very convenient framework for analyzing issues related to theoretical asset pricing, in general, and option pricing, in particular. Also, by Ito’s lemma equation (4.2) may be equivalently written as dyt=
p-2 (
dt+a,dW,,,, >
where y, = ln( Y,). For many purposes 4.1.
ARCH
this is a more tractable
models as approximations
differential
equation.
to diffusions
Suppose that an economic model specifies a diffusion model such as equation (4. l), where some of the state variables, including Q(x,), are unobservable. Is it possible to formulate an ARCH data generation process that is similar to the true process, in the sense that the distribution of the sample paths generated by the ARCH model and the diffusion model in equation (4.1) becomes “close” for increasingly finer discretizations? Specifically, consider tlie diffusion model given by equations (4.2)-(4.4). Strategies for approximating diffusions such as this are well known. For example, Melino and Turnbull (1990) use a standard Euler approximation in defining (y,, gt),12
p.9
ln(af+h) = ln(af)- hP[ln(cf)- a] + h1’2$z2,t+h,
(4.6)
for t = h, 2h, 3h,. . . . Here (yO, a,) is taken to be fixed, and (Zl,t, Z,,,) is assumed i.i.d. bivariate normal with mean vector (0,O) and
to be
(4.7)
Convergence of this set of stochastic difference equations to the diffusion in equations (4.2)-(4.4) as h 10 may be verified using Theorem 4.1. In particular, (i’) holds trivially, since (y,,, CJ,,)are constants. To check conditions (ii’) and (iii’), note that h-‘E,
(P -m)1
- /3[ln(a:) “See
Pardoux
and Talay (1985) for a general
- LY]
discussion
of the Euler approximation
(4.8) technique.
Ch. 49: ARCH Models
2995
and h-‘Var,
(4.9)
which matches the drift and diffusion matrix of (4.2))(4.4). Condition (iv’) is nearly with arbitrary finite trivially satisfied, since Zr,, and Z2,, are normally distributed moments. The final step of verifying that the limit diffusion has a unique weak-sense solution is often the most difficult and least intuitive part of the proof for convergence. Nelson (1990a) summarizes several sets of sufficient conditions, however, and formally shows that the process defined by (4.5)-(4.7) satisfies these conditions. While conditionally heteroskedastic, the model defined by the stochastic difference equations (4.5)-(4.7) is not an ARCH model. In particular, for p # 1 G: is not simply a function of the discretely observed sample path of {yt} combined with a startup value cri. More technically, while the conditional variance (y,,,, - y,)given the a-algebra generated by {y,,of},,< r $ f e q uals ho:, it is not, in general, the conditional variance of ( yt + h - y,) given the smaller a-algebra generated by { yr}O,h,Zh...htt,hland ci. Unfortunately, this latter conditional variance is not available in closed form.13 To create an ARCH approximation to the diffusion in (4.2)-(4.4) simply replace (4.6) by
ln(a:+,,)= lnbf) - MWf) where g(.) is measurable
- aI+ h”2g(Z,,t+J,
(4.10)
with E[Ig(Zl,t+,,)12+6] < 00 for some 6 > 0, and
(4.11)
As an ARCH model, the discretization defined by (4.5), (4.10) and (4.11) inherits the convenient properties usually associated with ARCH models, such as the easily computed likelihoods and inference procedures discussed in Section 2 above. As such, it is a far more tractable approximation to (4.2))(4.4) than the discretization defined by equations (4.5)-(4.7). To complete the formulation of the ARCH approximation, an explicit g(.) function is needed. Since E((Z,,,I)=(2/~)“2,E(Z1,t~Zl,t~)=0 and Var(lZl,ll)= 1 - (2/7t), one possible formulation would be
(4.12) 13Jacquier et al. (1994) have recently this conditional variance.
proposed
a computationally
tractable
algorithm
for computing
T Bollersleo et al.
2996
This corresponds directly to the EGARCH model in equation (1.11). Alternatively,
dZl*J
=
PtiZ,,, + $
1-P 7
2
l/2
( >
v:,,
- 1)
(4.13)
also satisfies equation (4.11).This latter specification turns out to be the asymptotically optimal filter for h JO, as discussed in Nelson and Foster (199 1,1994) and Section 4.3 below.
4.2.
Difusions as approximations to ARCH models
Now consider the question of how to best approximate a discrete time ARCH model with a continuous time diffusion. This can yield important insights into the workings of a particular ARCH model. For example, the stationary distribution of 0,” in the AR(l) version of the EGARCH model given by equaton (1.11) is intractable. However, the sequence of EGARCH models defined by equations (4.5) and (4.10)-(4.12) converges weakly to the diffusion process in (4.2)-(4.4). When /I > 0, the stationary distribution of ln(a:) is N(cr,11/‘/2/I).Nelson (1990a) shows that this is also the limit of the stationary distribution of In(a:) in the sequence of EGARCH models (4.5) and (4.10)-(4.12) as h JO. Similarly, the continuous limit may result in convenient approximations for forecast moments of the (~,,a:) process. Different ARCH models will generally result in different limit diffusions. To illustrate, suppose that the data are generated by a simple martingale model with a GARCH(l, 1) error structure as in equation (1.9). In the present notation, the process takes the form, Ltt+h
=
Y, +
%k+t, = Yt + &,+h
(4.14)
and = wh + (1 a:+,
tlh - ah”‘)a’
f
+ h1j2ae2 t+h,
where given time t information, E,+~is N(0, a;), and (x,, Note that using the notation for the GARCH(p,q) a, + fil = 1 - Bh, so for increasing sampling frequencies, of the process approach the IGARCH(l, 1) boundary Following Nelson (1990a)
(4.15) a,,) is assumed to be fixed. model in equation (1.9) i.e., as hJ0, the parameters as discussed in Section 3.
(4.16)
2991
Ch. 49: ARCH Models
and (4.17)
Thus, from Theorem
4.1 the limit diffusion
dx,=a*dW,,,
is given by (4.18)
and (4.19) where WI,, and W,,, are independent standard Brownian motions. The diffusion defined by equations (4.18) and (4.19) is quite different from the EGARCH limit in equations (4.2)-(4.4). For example, if d/2a2 > - 1, the stationary distribution of c: in (4.19) is an inverted gamma, so as h 10 and t + co, the normalized increments h-‘12(y,+h - y,) are conditionally normally distributed but unconditionally Student’s t. In particular, in the IGARCH case corresponding to 0 = 0, as hJ0 and t + co, h-‘iZ(y,+h - y,) approaches a Student’s t distribution with two degrees of freedom. In the EGARCH case, however, h - I/‘( y, +,, - y,) is conditionally normal but is unconditionally a normal-lognormal mixture. When 0: is stationary, the GARCH formulation in (1.9) therefore gives rise to unconditionally thickertailed residuals than the EGARCH model in equation (1.11).
4.3.
ARCH models as jilters and forecasters
Suppose that discretely sampled observations are only available for a subset of the state variables in (4.1), and that interest centers on estimating the unobservable state variable(s), Q(x,). Doing this optimally via a non-linear Kalman filter is computationally burdensome; see, e.g. Kitagawa (1987).14 Alternatively, the data might be passed through a discrete time ARCH model, and the resulting conditional variances from the ARCH model viewed as estimates for 0(x,). Nelson (1992) shows that under fairly mild regularity conditions, a wide variety of misspecified ARCH models consistently extract conditional variances from high frequency time series. The regularity conditions require that the conditional distribution of the observable series is not too thick tailed, and that the conditional covariance matrix moves smoothly over time. Intuitively, the GARCH filter defined by equation (1.9) “‘An approximate linear Kalman filter for a discretized version of(4.1) has been employed by Harvey et al. (1994). The exact non-linear filter for a discretized version of (4.1) has been developed by Jacquier et al. (1994). Danielson and Richard (1993) and Shephard (1993) also calculate the exact likelihood by computer intensive methods.
T. Bollerslev et al.
2998
estimates the conditional variance by averaging squared residuals over some time window, resulting in a nonparametric estimate for the conditional variance at each point in time. Many other ARCH models can be similarly interpreted. While many different ARCH models may serve as consistent filters for the same diffusion process, efficiency issues may also be relevant in the design of an ARCH model. To illustrate, suppose that the Y, process is observable at time intervals of length h, but that g: is not observed. Let 8: denote some initial estimate of the conditional variance at time 0, with subsequent estimates generated by the recursion ln(@+,) = ln(8f) + hK(8:) + h”‘g[8f,
h-“2(Y,+h
-
Y,)l.
(4.20)
The set of admissible g(.;) functions is restricted by the requirement that E,{g[af, h-“2(Yt+h - y,)]} be close to zero for small values of h.i5 Define the normalized estimation error from this filter extraction as qt = h-‘14[ln(8:) - ln(of)]. Nelson and Foster (1994) derive a diffusion approximation for qt when the data have been generated by the diffusion in equations (4.2)-(4.4) and the time interval shrinks to zero. In particular, they show that qt is approximately normally distributed, and that by choosing the g(., .) function to minimize the asymptotic variance of q,, the drift term for ln(a:) in the ARCH model, K(.), does not appear in the resulting minimized asymptotic variance for the measurement error. The effect is second order in comparison to that of the g(., .) term, and creates only an asymptotically negligible bias in qt. However, for rc(r~f)s - fi[ln(a:) - a], the leading term of this asymptotic bias also disappears. It is easy to verify that the conditions of Theorem 4.1 are satisfied for the ARCH model defined by equation (4.20) with ~(a’) = - j3[ln(a2) - a] and the variance minimizing g(., .). Thus, as a data generation process this ARCH model converges weakly to the diffusion in (4.2)-(4.4). In the diffusion limit the first _two conditional moments completely characterize the process, and the optimal ARCH filter matches these moments. The above result on the optimal choice of an ARCH filter may easily be extended to other diffusions and more general data generating processes. For example, suppose that the true data generation process is given by the stochastic difference equation analogue of (4.2)-(4.4),
Yt+Jl=Yt+h
( 41 P-y
ln(af+,) = In($)
+51,t,
- hp[ln(af)
(4.21)
- a] + h1/252,f,
(4.22)
where (rl,tgt- ‘, t,,,) is i.i.d. and m d ependent oft, h and y,, with conditional density f(lr .f, t2 .rJor) with mean (O,O), bounded 2 + 6 absolute moments, Var,({,,,) = a:, “Formally, the function (y,, uC) sets as hJ0.
must satisfy that h-‘/4E,{g[uf,
h-1’2(y,+, - y,)] } +
0uniformly
on bounded
Ch. 49: ARCH
Models
2999
and Var,(t,,,) = II/‘. This model can be shown to converge weakly to (4.2)-(4.4) as h10. The asymptotically optimal filter for the model given by equations (4.21) and (4.22) has been derived in Nelson and Foster (1994). This optimal ARCH filter when (4.21) and (4.22) are the data generation process is not necessarily the same as the optimal filter for (4.2)-(4.4). The increments in a diffusion such as (4.2))(4.4) are approximately conditionally normal over very short time intervals, whereas the innovations (rl,l, c2,J in (4.21) and (4.22) may be non-normal. This affects the properties of the ARCH filter. Consider estimating a variance based on i.i.d. draws from some distribution with mean zero. If the distribution is normal, averaging squared residuals is an asymptotically efficient method of estimating the variance. Least squares, however, can be very inefficient if the distribution is thicker tailed than the normal. This theory of robust scale estimation, discussed in Davidian and Carroll (1987) and Huber (1977) carries over to the ARCH case. For example, estimating 0: by squaring a distributed lag of absolute residuals, as proposed by Taylor (1986) and Schwert (1989a, b), will be more efficient than estimating 0: with a distributed lag of squared residuals if the conditional distribution of the innovations is sufficiently thicker tailed than the normal. One property of optimally designed ARCH filters concerns their resemblance to the true data generating process. In particular, if the data were generated by the asymptotically optimal ARCH filter, the functional form for the second conditional moment of the state variables would be the same as in the true data generating process. If the conditional first moments also match, the second order bias is similarly eliminated. Nelson and Foster (1991) show that ARCH models which match these first two conditional moments also have the desirable property that the forecasts generated by the possibly misspecified ARCH model approach the forecasts from the true model as hJ0. Thus, even when ARCH models are misspecified, they may consistently estimate the conditional variances. Unfortunately, the behavior of ARCH filters with estimated as opposed to known parameters, and the properties of the parameter estimates themselves, are not yet well understood.
5. 5. I.
Aggregation Temporal
and forecasting aggregation
The continuous record asymptotics discussed in the preceding section summarizes the approximate relationships between continuous time stochastic differential equations and discrete time ARCH models defined at increasingly higher sampling frequencies. While the approximating stochastic differential equations may result in more manageable theoretical considerations, the relationship between high frequency ARCH stochastic difference equations and the implied stochastic process for less frequently sampled, or temporally aggregated, data is often of direct importance for empirical work. For instance, when deciding on the most appropriate
T. Bollersleu et al.
3ooo
sampling interval for inference purposes more efficient parameter estimates for the low frequency process may be available from the model estimates obtained with high frequency data. Conversely, in some instances the high frequency process may be of primary interest, while only low frequency data is available. The non-linearities in ARCH models severely complicate a formal analysis of temporal aggregation. In contrast to the linear ARIMA class of models for conditional means, most parametric ARCH models are only closed under temporal aggregation subject to specific qualifications. Following Drost and Nijman (1993) we say that (E,} is a weak GARCH(p, q) process if E, is serially uncorrelated with unconditional mean zero, and c:, as defined in equation (1.9), corresponds to the best linear projection of E: on the space spanned by {1, E,_ 1, E,_ 2,. . . , tzf_1, E:_*, . . . }. More specifically,
E(Ef- c$,= E[ (Ef- fJf)E,_i] = E[ (Ef - a;)&;_i] = 0
i= 1,2,... .
(5.1)
This definition of a weak GARCH(p, q) model obviously encompasses the conventional GARCH(p,q) model in which U: is equal to the conditional expectation of E: based on the full information set at time t - 1 as a special case. Whereas the conventional GARCH(p, q) class of models is not closed under temporal aggregation, Drost and Nijman (1993) show that temporal aggregation of ARIMA models with weak GARCH(p, q) errors lead to another ARIMA model with weak GARCH(p’, q’) errors. The orders of this temporally aggregated model and the model parameters depend on the original model characteristics. To illustrate, suppose that {Ed)follows a weak GARCH(l, 1)model with parameters 0,~~ and B1. Let {Ed”‘}denote the discrete time temporally aggregated process defined at t, t + m, t + 2m,. . . . For a stock variable E?) is obtained by sampling E, every mth period. For a flow variable elm)E E,+ E,_ 1 + . . . + E,_ m+ 1. In both cases, it is possible to show that the temporally aggregated process, {E:“‘}, is also weak GARCH(l, 1) with parameters W(~)= w[l - (~1~+ /?J”]/(l - a1 - pl) and u\“” = (~1~+ BJ” - Pi”), where D\“’ is a complicated function of the parameters for the original process. Thus, a?’ + Birn)= (al + /?l)m, and conditional heteroskedasticity disappears as the sampling frequency decreases, provided that CQ+ B1 < 1. Moreover, for flow variables the conditional kurtosis of the standardized residuals, $“)[ajm)]-‘, converges to the normal value of three for less frequently sampled observations. This convergence to asymptotic normality for decreasing sampling frequencies of temporally aggregated covariance stationary GARCH(p, q) flow variables has been shown previously by Diebold (1988), using a standard central limit theorem type argument. These results highlight the fact that the assumption of i.i.d. innovations invoked in maximum likelihood estimation of GARCH models is necessarily specific to the particular sampling frequency employed in the estimation. If E,O; l is assumed i.i.d., the distribution of .$“‘[ajm’]- 1 will generally not be time invariant. Following the discussion in Section 2.3, the estimation by maximum likelihood methods could
3001
Ch. 49: ARCH Models
be given a quasi-maximum likelihood type interpretation, however. Issues pertaining to the efficiency of the resulting estimators remain unresolved. The extension of the aggregation results for the GARCH(p,q) model to other parametric specifications is in principle straightforward. The cross sectional aggregation of multivariate GARCH processes, which may be particularly relevant in the formation of portfolios, have been addressed in Nijman and Sentana (1993).
5.2.
Forecast error distributions
One of the primary objectives of econometric time series model building is often the construction of out-of-sample predictions. In conventional econometric models with time invariant innovation variances, the prediction error uncertainty is an increasing function of the prediction horizon, and does not depend on the origin of the forecast. In the presence of ARCH errors, however, the forecast accuracy will depend non-trivially on the current information set. The proper construction of forecast error intervals and post-sample structural stability tests, therefore, both require the evaluation of future conditional error variances.16 A detailed analysis of the forecast moments for various GARCH models is available in Engle and Bollerslev (1986) and Baillie and Bollerslev (1992). Although both of these studies develop expressions for the second and higher moments of the forecast error distributions, this is generally not enough for the proper construction of confidence intervals, since the forecast error distributions will be leptokurtic and time-varying. A possible solution to this problem is suggested by Baillie and Bollerslev (1992), who argue for the use of the Cornish-Fisher asymptotic expansion to take account of the higher order dependencies in the construction of the prediction error intervals. The implementation of this expansion requires the evaluation of higher order conditional moments of E,+,,’ which can be quite complicated. Interestingly, in a small scale Monte Carlo experiment, Baillie and Bollerslev (1992) find that under the assumption of conditional normality for ~~a;‘, the ninety-five percent confidence interval for multi-step predictions from the GARCH(l, 1) model, constructed under the erroneous assumption of conditional normality of E~+~[E(~:+,)] -I” for s > 1, has a coverage probability quite close to ninety-five percent. The one percent fractile is typically underestimated by falsely assuming conditional normality of the multi-step leptokurtic prediction errors, however. Most of the above mentioned results are specialized to the GARCH class of models, although extensions to allow for asymmetric or leverage terms and multivariate formulations in principle would be straightforward. Analogous results on forecasting ln(a:) for EGARCH models are easily obtained. Closed form expressions ’ bAlso, as discussed earlier, the forecasts in applications with financial data.
of the future conditional
variances
are often of direct interest
T. Bollerslev et al.
3002
for the moments of the forecast error distribution for the EGARCH model are not available, however. As discussed in Section 4.3, an alternative approximation to the forecast error distribution may be based upon the diffusion limit of the ARCH model. If the sampling frequency is “high” so that the discrete time ARCH model is a “close” approximation to the continuous time diffusion limit, the distribution of the forecasts should be “good” approximations too; see Nelson and Foster (1991). In particular, if the unconditional distribution of the diffusion limit can be derived, this would provide an approximation to the distribution of the long horizon forecasts from a strictly stationary model. Of course, the characteristics of the prediction error distribution may also be analyzed through the use of numerical methods. In particular, let fJ.s,+J denote the density function for E~+~conditional on information up through time t. Under the assumption of a time invariant conditional density function for the standardized innovations, f(~,g, ‘), the prediction error density for E~+~ is then given by the convolution
_f&t+J=
ss
... f(&t+,~~~t’,)f(&,+,-l~~+ls-l)...f(&,+,~t+1l)dE,+,-ld&,+,-2.
Evaluation of this multi-step prediction error density may proceed directly by numerical integration. This is illustrated within a Bayesian context by Geweke (1989a, b), who shows how the use of importance sampling and antithetic variables can be employed in accelerating the convergence of the Monte Carlo integration. In accordance with the results in Baillie and Bollerslev (1992) Geweke (1989a) finds that for conditional normally distributed one-step-ahead prediction errors, the shorter the forecast horizon s, and the more tranquil the periods before the origin of the forecast, the closer to normality is the prediction error distribution for E,+,. 6.
Multivariate
specifications
Financial market volatility moves together over time across assets and markets. Recognizing this commonality through a multivariate modeling framework leads to obvious gains in efficiency. Several interesting issues in the structural analysis of asset pricing theories and the linkage of different financial markets also call for an explicit multivariate ARCH approach in order to capture the temporal dependencies in the conditional variances and covariances. In keeping with the notation of the previous sections, the N x 1 vector stochastic process {E,} is defined to follow a multivariate ARCH process if E,_ i(s,) = 0, but the N x N conditional covariance matrix,
E,- (q~:)= n,, 1
(6.1)
3003
Ch. 49: ARCH Models
depends non-trivially on the past of the process. From a theoretical perspective, inference in multivariate ARCH models poses no added conceptual difficulties in comparison to the procedures for the univariate case outlined in Section 2 above. To illustrate, consider the log likelihood function for {Q, Q_ 1,. . . , cl} obtained under the assumption of conditional multivariate normality, LT(~T,~T-l,...,~I;ICI)=
-0S[TNln(27r)+
C (lnI~tI+~~~,~‘~,)l. 1=1,T
(6.2)
This function corresponds directly to the conditional likelihood function for the univariate ARCH model defined by equations (2.7), (2.8) and (2.12), and maximum likelihood, or quasi-maximum likelihood, procedures may proceed as discussed in Section 2. Of course, the actual implementation of a multivariate ARCH model necessarily requires some assumptions regarding the format of the temporal dependencies in the conditional covariance matrix sequence, {Q}. Several key issues must be faced in choosing a parametrization for Q. Firstly, the sheer number of potential parameters in a geneal formulation is overwhelming. All useful specifications must necessarily restrict the dimensionality of the parameter space, and it is critical to determine whether they impose important untested characteristics on the conditional variance process. A second consideration is whether such restrictions impose the required positive semi-definiteness of the conditional covariance matrix estimators. Thirdly, it is important to recognize whether Granger causality in variance as in Granger et al. (1986) is allowed by the chosen parametrization; that is, does the past information on one variable predict the conditional variance of another. A fourth issue is whether the correlations or regression coefficients are time-varying and, if so, do they have the same persistence properties as the variances? A fifth issue worth considering is whether there are linear combinations of the variables, or portfolios, with less persistence than individual series, or assets. Closely related is the question of whether there exist simple statistics which are sufficient to forecast the entire covariance matrix. Finally, it is natural to ask whether there are multivariate asymmetric effects, and if so how these may influence both the variances and covariances. Below we shall briefly review some of the parametrizations that have been applied in the literature, and comment on their appropriateness for answering each of the questions posed above.
6.1.
Vector ARCH and diagonal ARCH
Let vech(.) denote the vector-half operator, which stacks the lower triangular elements of an N x N matrix as an [N(N + I)/21 x 1 vector. Since the conditional covariance matrix is symmetric, vech(Q) contains all the unique elements in Q. Following Kraft and Engle (1982) and Bollerslev et al. (1988), a natural multivariate extension of the univariate GARCH(p,q) model defined in equation (1.9) is then
T.Bollerslev
3004
et al.
given by vech(L?J = W+
C Aivech(&,_i&:_i) + 1 i= 1.9
Bjvech(Q_j)
j=l,p
= W+ A(L)vech(s,_ i&r)
+ B(L)vech@~_,),
(6.3)
where W is an [N(N + 1)/2] x 1 vector, and the Ai and Bj matrices are of dimension [N(N + 1)/2] x [N(N + 1)/2]. This general formulation is termed the vet representation by Engle and Kroner (1993). It allows each of the elements in {Q} to depend on all of the most recent q past cross products of the E,‘Sand all of the most recent p lagged conditional variances and covariances, resulting in a total of [N(N + 1)/2]. [l + (p + q)N(N + 1)/2] parameters. Even for low dimensions of N and small values of p and q the number of parameters is very large; e.g. for N = 5 and p = q = 1 the unrestricted version of (6.3) contains 465 parameters. This allows plenty of flexibility to answer most, but not all, of the questions above.” However, this large number of parameters is clearly unmanageable, and conditions to ensure that the conditional covariance matrices are positive definite a.s. for all t are difficult to impose and verify; Engle and Kroner (1993) provides one set of sufficient conditions discussed below. In practice, some simplifying assumptions will therefore have to be imposed. In the diagonal GARCH(p, q) model, originally suggested by Bollerslev et al. (1988), the Ai and Bj matrices are all taken to be diagonal. Thus, the (i,j)th element in {Q} only depends on the corresponding past (i, j)th elements in {E&>and {Q}. This restriction reduces the number of parameters to [N(N + 1)/2](1 + p + q). These restrictions are intuitively reasonable, and can be interpreted in terms of a filtering estimate of each variance and covariance. However, this model clearly does not allow for causality in variance, co-persistence in variance, as discussed in Section 6.5 below, or asymmetries. Necessary and sufficient conditions on the parameters to ensure that the conditional covariance matrices in the diagonal GARCH(p, q) model are positive definite a.s. are most easily derived by expressing the model in terms of Hadamard products. In particular, define the symmetric N x N matrices A* and BT implicitly by Ai = diag[vech($)] i = 1,. . . , q, Bj = diag[vech(BT)] j = 1,. . . , p, and WE vech( W*). The diagonal model may then be written as
R,=w*+
c
A;~(E,_~E;_~)+
i=l,q
where 0 denotes the Hadamard
c
ByQ12-j,
(6.4)
j=l,P
product. la It follows now by the algebra of
“Note, that even with this number of parameters, asymmetric terms are excluded by the focus on squared residuals. “The Hadamard product of two N x N matrices A and B is defined by {AOBJij = {A}ij{B}ij; see, e.g. Amemiya (1985).
3005
Ch. 49: ARCH Models
Hadamard products, that Lit is positive definite as. for all t provided that W* is positive definite, and the AT and Bf matrices are positive semi-definite for all i= 1,..., q and j= l,..., p; see Attanasio (1991) and Marcus and Mint (1964) for a formal proof. These conditions are easy to impose and verify through a Cholesky decomposition for the parameter matrices in equation (6.4). Even simpler versions of this model which let either A* or Bf be rank one matrices, or even simply a scalar times a matrix of ones, may be useful in some applications. In the alternative representation’of the multivariate GARCH(p, q) model termed by Engle and Kroner (1993) the Baba, Engle, Kraft and Kroner, or BEKK, representation, the conditional covariance matrix is parametrized as (6.5) k=l,Xi=l,q
k=l,Kj=l,p
wherek’,A,,i=l,..., q,k=l,..,, K,andBjkj=l ,..., p,k=l,..., KareallNxN matrices. This formulation has the advantage over the general specification in equation (6.3) that Q is guaranteed to be positive definite a.s. for all t. The model in equation (6.5) still involves a total of [l + (p + q)K]N’ parameters. By taking vech((l,) we can express any model of the form (6.5) in terms of (6.3). Thus any vet model in (6.3) whose parameters can be expressed as (6.5) must be positive definite. However, in empirical applications, the structure of the Aik and Bjk matrices must be further simplified as this model is also overparametrized. A choice made by McCurdy and Stengos (1992) is to set K =p= q = 1 and make A, and B, diagonal. This leads to the simple positive definite version of the diagonal vet model
a,= w*+c(lcr;o(E,_lE:_l)+DIB;O,Rt_l, where A, = diag[a,]
6.2.
(6.6)
and B, = diag[j?,].
Factor ARCH
The Factor ARCH model can be thought of as an alternative simple parametrization of (6.5). Part of the appeal of this parametrization in applications with asset returns stems from its derivation in terms of a factor type model. Specifically, suppose that the N x 1 vector of returns y, has a factor structure with K factors given by the K x 1 vector &, and time invariant factor loadings given by the N x K matrix B: Y, =
B5, +
E,.
Assume that the idiosyncratic shocks, E,, have constant Y’, and that the factors, 5,, have conditional covariance
(6.7) conditional covariances matrix A,. Also, suppose
T. Bollersleu et al.
3006
that E, and 5, are uncorrelated, or that they have constant conditional covariance matrix of y, then equals
v,_1(y,)= n, =
Y + BA,B’.
correlations.
The
(6.8)
If A, is diagonal with elements IZkf,or if the off-diagonal elements are constant and combined into Y, the model may therefore be written as 0, = Y+
c
BkP;nk*Y
(6.9)
k=l,K
where flk denotes the kth column in B. Thus, there are K statistics which determine the full covariance matrix. Forecasts of the variances and covariances or of any portfolio of assets, will be based only on the forecasts of these K statistics. This model was first proposed in Engle (1987), and implemented empirically by Engle et al. (1990b) and Ng et al. (1992) for treasury bills and stocks, respectively. Diebold and Nerlove (1989) suggested a closely related latent factor model, (6.9’) k=l,K
in which the factor variances, S& were not functions of the past information set. An estimation approach based upon an approximate Kalman filter was used by Diebold and Nerlove (1989). More recently King et al. (1994) have estimated a similar latent factor model using theoretical developments in Harvey et al. (1994). An immediate implication of (6.8) and (6.9) is that, if K < N, there are some portfolios with constant variance. Indeed a useful way to determine K is to find how many assets are required to form such portfolios. Engle and Kozicki (1993) present this as an application of a test for common features. This test is applied by Engle and Susmel(1993) to determine whether there is evidence that international equity markets have common volatility components. Only for a limited number of pairs of the countries analyzed can a one factor model not be rejected. A second implication of the formulation in (6.8) is that there exist factorrepresenting portfolios with portfolio weights that are orthogonal to all but one set of factor loadings. In particular, consider the portfolio rkt = 4;y,, where +;Bj equals 1 ifj = k and zero otherwise. The conditional variance of rkt is then given by (6.10) where $k = 4; Y$,. Thus, the portfolios rkl have exactly the same time variation as the factors, which is why they are called factor-representing portfolios. In order to estimate this model, the dependence of the Akr’s upon the past information set must also be parametrized. The simplest assumption is that there is a set of factor-representing portfolios with univariate GARCH( 1,l) representa-
3007
Ch. 49: ARCH Models
tions. Thus, (6.11) and, therefore,
k=l,K
k=l,K
so that the factor ARCH model is a special case of the BEKK parametrization. Clearly, more general factor ARCH models would allow the factor representing portfolios to depend upon a broader information set than the simple univariate assumption underlying (6.11). Estimation of the factor ARCH model by full maximum likelihood together with several variations has been considered by Lin (1992). However, it is often convenient to assume that the factor-representing portfolios are known a priori. For example, Engle et al. (1990b) assumed the existence of two such portfolios: one an equally weighted treasury bill portfolio and one the Standard and Poor’s 500 composite stock portfolio. A simple two step estimation procedure is then available, by first estimating the univariate models for each of the factor-representing portfolios. Taking the variance estimates from this first stage as given, the factor loadings may then be consistently estimated up to a sign, by noticing that each of the individual assets has a variance process which is linear in the factor variances, where the coefficients equal the squares of the factor loadings. While this is surely an inefficient estimator, it has the advantage that it allows estimation for arbitrarily large matrices using simple univariate procedures.
6.3.
Constant
conditional
correlations
In the constant conditional correlations model of Bollerslev (1990), the time-varying conditional covariances are parametrized to be proportional to the product of the corresponding conditional standard deviations. This assumption greatly simplifies the computational burden in estimation, and conditions for 0, to be positive definite a.s. for all t are also easy to impose. More explicitly, let D, denote the N x N diagonal matrix with the conditional variances along the diagonal; i.e. {Dt}ii = {Q},, and {Dt}ij = 0 for i #j, i, j = 1,. . , N. Also, let c denote the matrix of conditional correlations; i.e. {&jij = {Q}ij[ {Q}ii. {QJ,]-“‘, i, j = 1,. . . , N. The constant conditional correlation model then assumes that r, = I- is time-invariant, so that the temporal variation in {a,} is determined solely by the time-varying conditional variances, 0
= I
DwrDl/z f
f
.
(6.13)
T. Bollerslev
3008
et al.
If the conditional variances along the diagonal in the D, matrices are all positive, and the conditional correlation matrix r is positive definite, the sequence of conditional covariance matrices {Qj is guaranteed to be positive definite a.s. for all t. Furthermore, the inverse of Q is simply given by L2, ’ = DfF”2r - 1Dle”2. Thus, when calculating the likelihood function in equation (6.2) or some other multivariate objective function involving 0,’ t = 1,. . ., T, only one matrix inversion is required for each evaluation. This is especially relevant from a computational point of view when numerical derivatives are being used. Also, by a standard multivariate SURE analogy, r may be concentrated out of the normal likelihood function by (D,- li2e,)(D,- 1/2Q’r simplifying estimation even further. Of course, the validity of the assumption of constant conditional correlations However, this particular formulation has already remains an empirical question.” been successfully applied by a number of authors, including Baillie and Bollerslev (1990), Bekaert and Hodrick (1993), Bollerslev (1990), Kroner and Sultan (1991), Kroner and Claessens (1991) and Schwert and Seguin (1990).
Bivariate
6.4.
EGARCH
A bivariate version of the EGARCH model in equation (1.11) has been introduced by Braun et al. (1992) in order to model any “leverage effects,” as discussed in Section 1.2.3, in conditional betas. Specifically, let E,,( and sP,[ denote the residuals for a market index and a second portfolio or asset. The model is then given by (6.14)
6%t = ~m.tGl,t and &
P.1=
Bp,tEm,t+
(6.15)
flp.tZp,v
where {z,,~, z~,~} is assumed to be i.i.d with mean (0,O) and identity covariance matrix. The conditional variance of the market index, a:,,, is modeled by a univariate EGARCH model,
Wi,,) = CL + 4,Jn(~8fJThe conditional BP,,
=
%J
+ R,,z,,~-
Y,( Iz,,~- 1 I - E Iz,,, - 1 I).
beta of sp,t with respect to E,,~, /?p,f, is modeled
4 + UP,,, - 1 - Al) + 11Z,,t
The coefficients
1 +
- 1ZP,l- 1 +
j.2 and 1,, allow for “leverage
19A formal moment based test for the assumption developed by Bera and Roh (1991).
G&t - 1
+
(6.16)
as
J.3Zp.t - 1.
(6.17)
effects” in BP,,. The non-market, of constant
conditional
correlations
or
has been
3009
Ch. 49: ARCH Models
idiosyncratic, variance of the second portfolio, gi,*, is parametrized as a modified univariate EGARCH model, to allow for both market and idiosyncratic news effects.
Braun et al. (1992) find that this model provides a good description for a number of industry and size-sorted portfolios.
6.5.
Stationarity
of the returns
and co-persistence
Stationarity and moment convergence criteria for various univariate specifications were discussed in Section 3 above. Corresponding convergence criteria for multivariate ARCH models are generally complex, and explicit results are only available for a few special cases. Specifically, consider the multivariate vet GARCH(l, 1) model defined in equation (6.3). Analogous to the expression for the univariate GARCH(1, 1) model in equation (3.10) the minimum mean square error forecast for vech(Q) as of time s < t takes the form
E,(vech(Q,))
= W
1
+ (A, + ~,)‘~“vech(QJ,
(6.19)
where (A 1 + II,)’ is equal to the identity matrix by definition. Let VA V- ’ denote the Jordan decomposition of the matrix A, + B,, so that (A, + Br)‘-’ = VAt-sV-1.20 Thus, E,(vech(Q)) converges to the unconditional covariance matrix of the process, W(Z - A, - B,)-‘, for t + cc a.s. if and only if the norm of the largest eigenvalue of A, + B, is strictly less than one. Similarly, by expressing the vector GARCH(p, q) model in companion first order form, it follows that the forecast moments converge, and that the process is covariance stationary if and only if the norm of the largest root of the characteristic equation II - A(x- ‘) - B(x- ‘)I = 0 is strictly less than one. A formal proof is given in Bollerslev and Engle (1993). This corresponds directly to the condition for the univariate GARCH(p, q) model in equation (1.9) where the persistence of a shock to the optimal forecast of the future conditional variances is determined by the largest root of the characteristic polynomial c((x-I) + /I(x- ‘) = I. The conditions for strict stationarity and ergodicity for the multivariate GARCH(p,q) model have not yet been established. “‘If the eigenvalues for A, + B, are all distinct, A equals the diagonal matrix of eigenvalues, and V the corresponding matrix of right eigenvectors. If some of the eigenvalues coincide, A takes the more general Jordan canonical form; see Anderson (1971) for further discussion.
T. Bollerslev et al.
3010
Results for other multivariate formulations are scarce, although in some instances the appropriate conditions may be established by reference to the univariate results in Section 3. For instance, for the constant conditional correlations model in equation (6.13), the persistence of a shock to E,(Q), and conditions for the model to be covariance stationary are simply determined by the properties of each of the N univariate conditional variance processes; i.e., E,( {Q}ii) i = 1,. . , N. Similarly, for the factor ARCH model in equation (6.9), stationarity of the model depends directly on the properties of the univariate conditional variance processes for the factor-representing porifolios; i.e. {Akf}k = 1,. . . , K. The empirical estimates for univariate and multivariate ARCH models often indicate a high degree of persistence in the forecast moments for the conditional variances; i.e. &(a:) or JY,({Q},),~) i = 1,. . . , N, for t + co. At the same time, the commonality in volatility movements suggest that this persistence may be common across different series. More formally, Bollerslev and Engle (1993) define the multivariate ARCH process to be co-persistent in variance if at least one element in E,(Q) is non-convergent a.s. for increasing forecast horizons t-s, yet there exists a non-trivial linear combination of the process, y’s,, such that for every forecast origin s, the forecasts of the corresponding future conditional variances, E,(y’Qy), converge to a finite limit independent of time s information. Exact conditions for this to occur within the context of the multivariate GARCH(p, q) model in equation (6.3) are presented in Bollerslev and Engle (1993). These results parallel the conditions for co-integration in the mean as developed by Engle and Granger (1987). Of course, as discussed in Section 3 above, for non-linear models different notions of convergence may give rise to different classifications in terms of the persistence of shocks. The focus on forecast second moments corresponds directly to the mean-variance trade-off relationship often stipulated by economic theory. To further illustrate this notion of co-persistence, consider the K-factor GARCH(p,q) model defined in equation (6.12). If some of the factor-representing portfolios have persistent variance processes, then individual assets with non-zero factor loadings on such factors will have persistence in variance, also. However, there may be portfolios which have zero factor loadings on these factors. Such portfolios will not have persistence in variance, and hence the assets are copersistent. This will generally be true if there are more assets than there are persistent factors. From a portfolio selection point of view such portfolios might be desirable as having only transitory fluctuations in variance. Engle and Lee (1993) explicitly test for such an effect between large individual stocks and a market index, but fail to find any evidence of co-persistence.
7.
Model selection
Even in linear statistical models, the problem of selecting an appropriate model is non-trivial, to say the least. The usual model selection difficulties are further compli-
Ch. 49: ARCH
3011
Models
cated in ARCH models by the uncountable infinity of functional forms allowed by equation (1.2) and the choice of an appropriate loss function. Standard model selection criteria such as the Akaike (1973) and the Schwartz (1978) criterion have been widely used in the ARCH literature, though their statistical properties in the ARCH context are unknown. This is particularly true when the validity of the distributional assumptions underlying the likelihood is in doubt. Most model selection problems focus on estimation of means and evaluate loss functions for alternative models using either in-sample criteria, possibly corrected for fitting by some form of cross-validation, or out-of-sample evaluation. The loss function of choice is typically mean squared error. When the same strategy is applied to variance estimation, the choice of mean squared error is much less clear. A loss function such as L, =
1 (Ef- CJf)’
(7.1)
1=1.T
will penalize conditional variance estimates which are different from the realized squared residuals in a fully symmetrical fashion. However, this loss function does not penalize the method for negative or zero variance estimates which are clearly counterfactual. By this criterion, least squares regressions of squared residuals on past information will have the smallest in-sample loss. More natural alternatives may be the percentage squared errors, L, =
(Ef - cTf,‘cr-“,
1
(7.2)
2=1.T the percentage L, =
C
absolute errors, or the loss function [ln(a:)
implicit in the Gaussian
likelihood
+ $a,‘].
(7.3)
1=1,T
A simple alternative close to zero is2’ L, =
which exaggerates
C [ln(sfa;2)]2. f= l.T
the interest in predicting
when residuals
are
(7.4)
The most natural loss function, however, may be one based upon the goals of the particular application. West et al. (1993) developed such a criterion from the portfolio decisions of a risk averse investor. In an expected utility comparison based on the “Pagan and Schwert (1990) used the loss functions L, and L, to compare alternative parametric and nonparametric estimators with in-sample and out-of-sample data sets. As discussed in Section 1.5, the L, in-sample comparisons favored the nonparametric models, whereas the out-of-sample tests and the loss function L, in both cases favored the parametric models.
T Bollerslev
3012
et al.
forecast of the return volatility, ARCH models turn out to fare very well. In a related context, Engle et al. (1993) assumed that the objective was to price options, and developed a loss function from the profitability of a particular trading strategy. They again found that the ARCH variance forecasts were the most profitable.
8.
Alternative measures for voiatility
Several alternative procedures for measuring the temporal variation in second order moments of time series data have been employed in the literature prior to the development of the ARCH methodology. This is especially true in the analysis of high frequency financial data, where volatility clustering has a long history as a salient empirical regularity. One commonly employed technique for characterizing the variation in conditional second order moments of asset returns entails the formation of low frequency sample variance estimates based on a time series of high frequency observations. For instance, monthly sample variances are often calculated as the sum of the squared daily returns within the month”; examples include Merton (1980) and Poterba and Summers (1986). Of course, if the conditional variances of the daily returns differ within the month, the resulting monthly variance estimates will generally be inefficient; see French et al. (1987) and Chou (1988). However, even if the daily returns are uncorrelated and the variance does not change over the course of the month, this procedure tends to produce both inefficient and biased monthly estimates; see Foster and Nelson (1992). A related estimator for the variability may be calculated from the inter-period highs and lows. Data on high and low prices within a day is readily available for many financial assets. Intuitively, the higher the variance, the higher the inter-period range. Of course, the exact relationship between the high-low distribution and the variance is necessarily dependent on the underlying distribution of the price process. Using the theory of range statistics, Parkinson (1980) showed that a high-low estimator for the variance of a continuous time random walk is more efficient than the conventional sample variance based on the same number of end-of-interval observations. Of course, the random walk model assumes that the variance remain constant within the sample period. Formal extensions of this idea to models with stochastic volatility are difficult; see also Wiggins (1991), who discusses many of the practical problems, such as sensitivity to data recording errors, involved in applying high-low estimators. Actively traded options currently exist for a wide variety of financial instruments. A call option gives the holder the right to buy an underlying security at a pre-
“Since many high frequency asset prices exhibit low but significant first order serial correlation, two times the first order autocovariance is often added to the daily variance in order to adjust for this serial dependence.
Ch. 49: ARCH Models
3013
specified price within a given time period. A put option gives the right to sell a security at a pre-specified price. Assuming that the price of the underlying security follows a continuous time random walk, Black and Scholes (1973) derived an arbitrage based pricing formula for the price of a call option. Since the only unknown quantity in this formula is the constant instantaneous variance of the underlying asset price over the life of the option, the option pricing formula may be inverted to infer the conditional variance, or volatility, implicit in the actual market price of the option. This technique is widely used in practice. However, if the conditional variance of the asset is changing through time, the exact arbitrage argument underlying the Black-Scholes formula breaks down. This is consistent with the evidence in Day and Lewis (1992) for stock index options which indicate that a simple GARCH(l, 1) model estimated for the conditional variance of the underlying index return provides statistically significant information in addition to the implied volatility estimates from the Black-Scholes formula. Along these lines Engle and Mustafa (1992) find that during normal market conditions the coefficients in the implied GARCH(l, 1) model which minimize the pricing error for a risk neutral stock option closely resemble the coefficients obtained using more conventional maximum likelihood estimation methods. 23 As mentioned in Section 4 above, much recent research has been directed towards the development of theoretical option pricing formulas in the presence of stochastic volatility; see, for instance, Amin and Ng (1993), Heston (1991), Hull and White (1987), Melino and Turnbull (1990), Scott (1987) and Wiggins (1987). While closed form solutions are only available for a few special cases, it is generally true that the higher the variance of the underlying security, the more valuable the option. Much further research is needed to better understand the practical relevance and quality of the implied volatility estimates from these new theoretical models, however. Finance theory suggests a close relationship between the volume of trading and the volatility; see Karpoff (1987) for a survey of some of the earlier contributions to this literature. In particular, according to the mixtures of distributions hypothesis, associated with Clark (1973) and Tauchen and Pitts (1983) the evolution of returns and trading volume are both determined by the same latent mixing variable that reflects the amount of new information that arrives at the market. If the news arrival process is serially dependent, volatility and trading volume will be jointly serially correlated. Time series data on trading volume should therefore be useful in inferring the behavior of the second order moments of returns. This idea has been pursued by a number of empirical studies, including Andersen (1992b), Gallant et al. (1992) and Lamoureux and Lastrapes (1990). While the hypothesis that contemporaneous trading volume is positively correlated with financial market volatility is supported
‘aMore model by maximum minus the
specifically, Engle and Mustafa (1992) estimate the parameters for the implied GARCH(1, 1) minimizing the risk neutral option pricing error defined by the discounted value of the of zero and the simulated future price of the underlying asset from the GARCH(1,l) model exercise price of the option.
3014
T. Bollerslev et al.
in the data, the result that a single latent variable jointly determines both has been formally rejected by Lamoureux and Lastrapes (1994). In a related context, modern market micro structure theories also suggest a close relationship between the behavior of price volatility and the distribution of the bid-ask spread though time. Only limited evidence is currently available on the usefulness of such a relationship for the construction of variance estimates for the returr?; see, e.g. Bollerslev and Domowitz (1993), Bollerslev and Melvin (1994) and Brock and Kleidon (1992). The use of the cross sectional variance from survey data to estimate the variance of the underlying time series has been advocated by a number of researchers. Zamowitz and Lambros (1987) discuss a number of these studies with macroeconomic variables. Of course, the validity of the dispersion across forecasts as a proxy for the variance will depend on the theoretical connection between the degree of heterogeneity and uncertainty; see Pagan et al. (1983). Along these lines it is worth noting that Rich et al. (1992) only find a weak correlation between the dispersion across the forecasts for inflation and an ARCH based estimate for the conditional variance of inflation. The availability of survey data is also likely to limit the practical relevance of this approach in many applications. In a related context, a number of authors have argued for the use of relative prices or returns across different goods or assets as a way of quantifying inflationary uncertainty or overall market volatility. Obviously, the validity of such cross sectional based measures again hinges on very stringent conditions about the structure of the market; see Pagan et al. (1983). While all of the variance estimates discussed above may give some idea about the temporal dependencies in second order moments, any subsequent model estimates should be carefully interpreted. Analogously to the problems that arise in the use of generated regressors in the mean, as discussed by Pagan (1984,1986) and Murphy and Topel(1985), the conventional standard errors for the coefficient estimates in a second stage model that involves a proxy for the variance will have to be adjusted to reflect the approximation error uncertainty. Also, if the conditional mean depends non-trivially on the conditional variance, as in the ARCH-M model discussed in Section 1.4, any two step procedure will generally result in inconsistent parameter estimates; for further analysis along these lines we refer to Pagan and Ullah (1988).
9. 9.1.
Empirical examples U.S. DollarlDeutschmark exchange rates
As noted in Section 1.2, ARCH models have found particularly wide use in the modeling of high frequency speculative prices. In this section we illustrate the empirical quasi-maximum likelihood estimation of a simple GARCH(l, 1) model for a time series of daily exchange rates. Our discussion will be brief. A more detailed and thorough discussion of the empirical specification, estimation and diagnostic
3015
Ch. 49: ARCH Models
testing of ARCH models is given in the next section, which analyzes the time series characteristics of more than one hundred years of daily U.S. stock returns. The present data set consists of daily observations on the U.S. Dollar/Deutschmark exchange rate over the January 2,198l through July 9,1992 period, for a total of 3006 observations.24 A broad consensus has emerged that nominal exchange rates over the free float period are best described as non-stationary, or I(l), type processes; see, e.g. Baillie and Bollerslev (1989). We shall therefore concentrate on modeling the nominal percentage returns; i.e. yr = lOO[ln(s,) - In@_ r)], where s, denotes the spot Deutschmark/U.S. Dollar exchange rate at day t. This is the time series plotted in Figure 2 in Section 1.2 above. As noted in that section, the daily returns are clearly not homoskedastic, but are characterized by periods of tranquility followed by periods of more turbulent exchange rate movements. At the same time, there appears to be little or no own serial dependence in the levels of the returns. These visual observations are also borne out by more formal tests for serial correlation. For instance, the Ljung and Box (1978) portmanteau test for up to twentieth order serial correlation in y, equals 19.1, whereas the same test statistic for twentieth order serial correlation in the squared returns, y:, equals 151.9. Under the null of i.i.d. returns, both test statistics should asymptotically be the realization of a chisquare distribution with twenty degrees of freedom. Note that in the presence of ARCH, the portmanteau test for serial correlation in y, will tend to over-reject. As discussed above, numerous parametric and nonparametric formulations have been proposed for modeling the volatility clustering phenomenon. For the sake of brevity, we shall here concentrate on the results for the particularly simple MA( I)-GARCH( 1,1) model, Yt = PO +
01&t-1+ Et, (9.1)
a:=oo+o,W,--o,(a,
+Br)W,-,
+a&_,
+&a:_,,
where W, denotes a weekend dummy equal to one following a closure of the market. The MA(l) term is included to take account of the weak serial dependence in the mean. Following Baillie and Bollerslev (1989), the weekend dummy is entered in the conditional variance to allow for an impulse effect. The quasi-maximum likelihood estimates (QMLE) for this model, obtained by the numerical maximization of the normal likelihood function defined by equations (2.7), (2.8) and (2.12), are contained in Table 1. The first column in the table shows that the c1r and /?r coefficients are both highly significant at the conventional five percent level. The sum of the estimated GARCH parameters also indicates a fairly strong degree of persistence in the conditional variance process.25 Z4The rates were calculated from the ECU cross rates obtained through Datastream. 25Reparametrizing the conditional variance in terms of (a1 +/I,) and x1, the r-test statistic for the null hypothesis that LX,+ PI = 1 equals 3.784, thus formally rejecting the IGARCH(1, 1) model at standard significance levels.
T. Bollerslev et al.
3016
Quasi-maximum
Coefficient
Jan. 2, 1982 July 9, 1992
Table 1 likelihood
estimatesa Jan. 2, 1982 Oct. 6, 1986
Oct. 7, 1986 July 9, 1992
0.014 (0.018)
- 0.017 (0.016)
::::::j
;::::I;{
- 0.058 (0.030) CO.0271 {0.027)
0.055 (0.027) CO.0271 {0.027}
0.028 (0.005)
0.024 (0.009)
i::Ej
\u,:::j
0.035 (0.011) [O.Ol 11 (0.010)
ml
0.243 (0.045) co.03 l] {0.022)
0.197 (0.087) CO.0621 (0.046)
~::Z:j
x1
0.068 (0.009)
0.076 (0.022)
0.063 (0.017)
i::::i
;:::::;
;:::::jl
0.880 (0.015) co.01 21 (0.010)
0.885 (0.028)
0.861 (0.033) 10.03 11 {0.030)
Q,
- 0.056 (0.014) \“,::::j
%
PI
0.281 (0.087)
“Robust standard errors based on equation (2.21) are reported in parentheses, (.). Standard errors calculated from the Hessian in equation (2.18) are reported in [.I. Standard errors from on the outer product of the sample gradients in (2.19) are given in { ‘}.
Consistent with the stylized facts discussed in Section 1.2.4, the conditional variance is also significantly higher following non-trading periods. The second and third columns of Table 1 report the results with the same model estimated for the first and second half of the sample respectively; i.e. January 2, 1981 through October 6, 1986 and October 7,1986 through July 9,1992. The parameter estimates are remarkably similar across the two sub-periods2’j In summary, the simple model in equation (9.1) does a remarkably good job of capturing the own temporal dependencies in the volatility of the exchange rate series. For instance, the highly significant portmanteau test for serial correlation in 26Even though the assumption of conditional normality is violated empirically, it is interesting to note that the sum of the maximized normal quasi log likelihoods for the two sub-samples equals _ 1727.750 - 1597.166 = - 3324.916, compared to - 3328.984 for the model estimated over the full sample.
Ch. 49: ARCH
3017
Models
the squares of the raw series, y:, drops to only 21.687 for the squared standardized residuals, E^,B;2. We defer our discussion of other residual based diagnostics to the empirical example in the next section. _ While the GARCH(l, 1) model is able to track the own temporal dependencies, the assumption of conditionally normally distributed innovations is clearly violated by the data. The sample skewness and kurtosis for .$8,- ’ equal - 0.071 and 4.892, respectively. Under the null of i.i.d. normally distributed standardized residuals, the sample skewness should be the realization of a normal distribution with a mean of 0 and a variance
of 6/m
= 0.109, while the sample kurtosis
is asymptotically
normally distributed with a mean of 3 and a variance of 24/,,/%% = 0.438. The standard errors for the quasi-maximum likelihood estimates reported in (.) in Table 1 are based on the asymptotic covariance matrix estimator discussed in Section 2.3. These estimates are robust to the presence of conditional excess kurtosis. The standard errors reported in [.I and {.} are calculated from the Hessian and the outer product of the gradients as in equations (2.18) and (2.19), respectively. For some of the conditional variance parameters, the non-robust standard errors are less than one half of their robust counterparts. This compares to the findings reported in Bollerslev and Wooldridge (1992), and highlights the importance of appropriately accounting for any conditional non-normality when conducting inference in ARCH type models based on a normal quasi-likelihood function.
9.2.
U.S. stock prices
We next turn to modeling heteroskedasticity in U.S. stock index returns data. Drawing on the optimal filtering results of Nelson and Foster (1991,1994) summarized in Section 4, as a guidance in model selection, new very rich parametrizations are introduced. From 1885 on, the Dow Jones corporation has published various stock indices daily. In 1928, the Standard Statistics company began publishing daily a wider index of 90 utility, industrial and railroad stocks. In 1953, the Standard 90 index was replaced by an even broader index, the Standard and Poor’s 500 composite. The properties of these indices are considered in some detail in Schwert (1990).27 The Dow data has one substantial chronological break, from July 30, 1914, through December 11, 1914, when the financial markets were closed following the outbreak of the First World War. The first data set we analyze is the Dow data from its inception on February 16, 1885 until the market closure in 1914. The second data set is the Dow data from the December 1914 market reopening until January 3, 1928. The third data set is the Standard 90 capital gains series beginning in January 4, 1928 and extending to the end of May 1952. The Standard 90 index data is “G William Schwert kindly provided the data. Schwert’s indices differ from ours aft& 1962, when he use, the CRSP value weighted market index. We continue to use the S&P 500 through 1990.
T. Bollerslev et al.
3018
available through the end of 1956, but we end at the earlier date because that is when the New York Stock Exchange ended its Saturday trading session, which presumably shifted volatility to other days of the week. The final data set is the S&P 500 index beginning in January 1953 and continuing through the end of 1990. Model specification
9.2.1.
Our basic capital
gains series, rl, is derived from the price index data, P,, as
I, z lOOln[P,/P,_,]. Thus, I, corresponds measured in percent.
(9.2) to the continuously compounded capital gain on the index Any ARCH formulation for rt may be compactly written as
and E, =
Z,‘C,,
z, - i.i.d.,
E[z,] = 0,
E[zf]
= 1,
(9.4)
where &- 1, g:) and crt denote the conditional mean and the conditional standard deviation, respectively. In the estimation reported below we parametrized the functional form for the conditional mean by
This is very close to the specification in LeBaron (1992). The pI coefficient allows for first order autocorrelation. The u2 term denotes the sample mean of rf, which is essentially equal to the unconditional sample variance of rt. As noted by LeBaron (1992), serial correlation seems to be a decreasing function of the conditional variance, which may be captured by equation (9.5) through p2 > 0. The parameter pL3is an ARCH-M term. We assume that the conditional distribution of E, given crt is generalized t; see, e.g. McDonald and Newey (1988). The density for the generalized t-distribution takes the form
.fCw- l; v, $1 = ~~
2a,h.$1i”B(l,q,
I+!+[:+
le,l”/($b”a:)]ti+ liq
(9.6)
where B(l/q, II/) 3 T(l/q)T($)T(l/q + $) denotes the beta function, b = [r($)r(l/q)/ and $q > 2, r] > 0 and $ > 0. The scale factor b makes 03k)UlCI - 2h)l 1’23 Var(E,a, ‘) = I.
3019
Ch. 49: ARCH Models
One advantage of this specification is that it nests both the Student’s t and the GED distributions discussed in Section 2.2 above. In particular, the Student’s tdistribution sets YI= 2 and $ = + times the degrees of freedom. The GED is obtained for Ic/= co. Nelson (1989,199l) fit EGARCH models to U.S. Stock index returns assuming a GED conditional distribution, and found that there were many more large standardized residuals z, = s,gZ- l than would be expected if the returns were actually conditionally GED with the estimated q. The GED has only one “shape” parameter q, which is apparently insufficient to fit both the central part and the tails of the conditional distribution. The generalized t-distribution has two shape parameters, and may therefore be more successful in parametrically fitting the conditional distribution. using a variant of the The conditional variance function, o:, is parametrized EGARCH formulation in equation (1.1 I), ln((g)
= t
w
+
f
(l+ cQL +_+cI,L4)g(z _1 a’
(I-~..._pJP)
where the deterministic 0, = w0 + ln[l
’
component
)
’ f 1’
(9.7)
is given by
+ w1 IV, + 02St + o&J.
(9.8)
As noted in Section 1.2, trading and non-trading periods contribute differently to volatility. To also allow for differences between weekend and holiday non-trading periods W, gives the number of weekend non-trading days between trading days t and t - 1, while H,denotes the number of holidays. Prior to May 1952, the NYSE was open for a short trading session on Saturday. Since Saturday may have been a “slow” news day and the Saturday trading session was short, we would expect low average volatility on Saturdays. The S, dummy variable equals one if trading day t is a Saturday and zero otherwise. Our specification of the news impact function, g(., .), is a generalization of EGARCH inspired by the optimal filtering results of Nelson and Foster (1994). In the EGARCH model in equation (1.11) ln(cr:+ 1) is homoskedastic conditional on a:, and the partial correlation between z, and ln(af+ i) is constant conditional on 0:. These assumptions may well be too restrictive, and the optimal filtering results indicate the importance of correctly specifying these moments. Our specification of g(z,, 0:) therefore allows both moments to vary with the level of 0:. Several recent papers, including Engle and Ng (1993), have suggested that GARCH, EGARCH and similar formulations may make 0: or ln(a:) too sensitive to outliers. The optimal filtering results discussed in Section 4 lead to the same conclusion when E, is drawn from a conditionally heavy tailed distribution. The final form that we assume for g(., .) was also motivated by this observation: g(zt, of) = CT- 2eo1 +
OlZ, ___ 4lztI
+
g,-
2YO
““1”+!&]. [
1 + Y*lZ,lP
(9.9)
T Bollersleu
3020
et al.
The y0 and t?,, parameters allow both the conditional variance of ln(aF+ i) and its conditional correlation with z, to vary with the level of 0:. If 0, < O,ln(a:+ i) and Z, are negatively correlated: the “leverage effect”. The EGARCH model constrains 8, = y0 = 0, so that the conditional correlation is constant, as is the conditional variance of ln(af). The p, y2, and 8, parameters give the model flexibility in how much weight to assign to the tail observations. For example, if yZ and e2 are both positive, the model downweights large 1z, 1’s.The second term on the right hand side of equation (9.9) was motivated by the optimal filtering results in Nelson and Foster (1994), designed to make the ARCH model serve as a robust filter. The orders of the ARMA model for ln(a:), p and q, remain to be determined. Table 2 gives the maximized values of the log likelihoods from (2.7), (2.8) and (9.6) for ARMA models of order up to ARMA(3,5). For three of the four data sets, the information criterion of Schwartz (1978) selects an ARMA(2,l) model, the exception being the Dow data for 1914-1928, for which an AR(l) is selected. For linear time series models, the Schwartz criterion has been shown to consistently estimate the order of an ARMA model. As noted in Section 7, it is not known whether this result carries over to the ARCH class of models. However, guided by the results in Table 2, Table 2 Log likelihood values for fitted models.”
Fitted model White Noise MA(l) MA(2) MA(3) MA(4) MA(5) AR(l) ARMA(l, 1) ARMA(l,2) ARMA(l,3) ARMA(l,4) ARMA(l,S) AR(2) ARMA(2,l) ARMA(2,2) ARMA(2,3) ARMA(2,4) ARMA(2,5) AR(3) ARtiA(3,l) ARMA(3.2) ARMA(3; 3) ARMA(3,4) ARMA(3,5)
Dow 188551914
Dow 1914-1928
Standard 90 1928-1952
- 10036.188 -9926.781 -9848.319 -9779.491 -9750.417 -9718.642 -9554.352 -9553.891 -9553.590 -9552.148 -9543.855 - 9540.485 -9553.939 - 9529.904sc - 9529.642 -9526.865 -9525.683 -9525.560 -9553.787 -9529.410 - 9526.089 - 9524.644A’c - 9524.497 -9523.375
-4397.693 -4272.639 -4241.686 -4233.371 -4214.821 -4198.672 -4164.093sc -4164.081 -4160.671 -4159.413 -4158.836 -4158.179 -4164.086 -4159.OllA’C -4158.428 -4157.731 -4157.569 -4155.071 -4159.227 -4158.608 -4158.230 -4157.730 -4156.823 -4154.906
-11110.120 - 10973.417 - 10834.937 - 10765.259 - 10740.999 - 10634.429 - 10275.294 - 10269.771 - 10265.464 - 10253.027 - 10250.446 - 10242.833 - 10271.732 - 10237.527sc - 10235.724 - 10234.556 - 10234.429 - 10230.418 - 10270.685 - 10237.462 -10228.701A’= - 10228.263 - 10227.982 - 10227.958
S&P 500 1953-1990 -
10717.199 10658.775 10596.849 10529.688 10463.534 10433.631 10091.450 10076.775 10071.040 10070.587 10064.695 10060.336 10083.442 10052.322sc 10049.237 10049.129 10047.962 10046.343 10075.441 10049.833 10049.044 10042.710 10042.284 10040.547A’C
“The AIC and SC indicators denote the models selected by the information criteria of Akaike (1973) and Schwartz (1978), respectively.
Maximum
Coefficient
Table 3 likelihood
estimates?
Dow 1885-1914 ARMA(2,l)
Dow 1914-1928 AR(l)
Standard 90 1928-1952 ARMA(2,l)
S&P 500 1953-1990 ARMA(2,l)
- 0.6682 (0.1251)
- 0.6228 (0.0703)
- 1.2704 (2.5894)
- 0.7899 (0.2628)
0.2013 (0.0520)
0.3059 (0.0904)
0.1011 (0.0518)
0.1286 (0.0295)
-0.4416 (0.0270)
-0.5557 (0.0328)
-0.6534 (0.0211)
0.5099 (0.1554)
0.3106 (0.1776)
0.6609 (0.1702)
0.1988 (0.1160)
3.6032 (0.8019)
2.5316 (0.5840)
4.0436 (0.9362)
3.5437 (0.7557)
2.2198 (0.1338)
2.4314 (0.2041)
1.7809 (0.1143)
2.1844 (0.1215)
0.0280 (0.0112)
0.0642 (0.0222)
0.0725 (0.1139)
0.0259 (0.0113)
- 0.0885 (0.0270)
- 0.0920 (0.0418)
-0.0914 (0.0243)
0.0717 (0.0260)
0.2206 (0.0571)
0.3710 (0.0828)
0.2990 (0.0387)
0.2163 (0.0532)
0.0006 (0.0209)
0.0316 (0.0442)
0.0285. (0.0102)
0.0050 (0.0213)
-0.1058 (0.0905)
0.0232 (0.1824)
- 0.0508 (0.0687)
0.1117 (0.0908)
0.1122 (0.0256)
0.0448 (0.0478)
0.1356 (0.0327)
0.0658 (0.0157)
0.0245 (0.0178)
0.0356 (0.0316)
0.0168 (0.0236)
0.0312 (0.0080)
2.1663 (0.3 119)
3.2408 (1.5642)
1.6881 (0.3755)
2.2477 (0.3312)
- 0.6097 (0.0758)
-0.5675 (0.1232)
-0.1959 (0.0948)
-0.1970 (0.1820)
-0.1509 (0.0258)
-0.3925 (0.1403)
-0.1177 (0.027 1)
-0.1857 (0.0287)
0.0361 (0.0828)
0.3735 (0.3787)
-0.0055 (0.0844)
0.2286 (0.1241)
0.9942 (0.0033)
0.9093 (0.0172)
0.9994 (0.0009)
0.9979 (0.0011)
0.8759 (0.0225)
0.8303 (0.0282)
0.8945 (0.0258)
-0.9658 (0.0148)
-0.9511 (0.0124)
- 0.9695 (0.0010)
“Standard errors are reported in parentheses. The parameters indicated ARcoefficientsaredecomposedas(1-A,L)(1-A,L)~(1-~,L-~,L2),where~A,~>,(A,~.
*
by a I were not estimated.
The
T. Bollerslev et al.
3022 Table 4 Wald hypothesis
tests.
Dow 1885-1914 ARMA(2,l)
Dow 1914-1928
97.3825 (O.cQOO)
63.4545 (O.O@-w
10.1816 (0.0703)
51.8152 (O.OOw
WI = wg: xi
3.3867 (0.0657)
0.0006 (0.9812)
9.8593 (0.0017)
0.3235 (0.5695)
0, = y0 = 0: x;
67.4221
21.3146
(O.OOOQ
(0.0000)
4.4853 (0.1062)
2.2024 (0.3325)
17.2288 (O.oooo)
7.4328 (0.0064)
1.7718 (0.1832)
1.7844 (0.1816)
0.0247 (0.8751)
0.2684 (0.6044)
0.0554 (0.8139)
0.0312 (0.8598)
14.0804 (0.0002)
10.0329 (0.0015)
14.1293 (0.0002)
14.6436 (0.0001)
18.4200 (0.0001)
10.4813 (0.0053)
22.5829
16.9047 (0.0002)
Test y* = f& = y0 = e0 = p - 1 = 0: x;
0, = Yo:x: rl = P: x:
yz = b-3+-‘:
x;
q=p,yz=b-“$-‘:x;
Conditional
(1) &Cz,l= 0
-0.0147 (0.0208)
(2) mz:1 = 1
AR(l)
Table 5 moment specification Dow 1885-1914 ARMA(2,l)
Orthogonality Condition
Standard 90 1928-1952 ARMA(2,l)
(0.0000)
S&P 500 1953-1990 ARMAI’Z, 1)
tests.
Dow 1914-1928 AR(l) - 0.0243 (0.03 19)
Standard 90 1928-1952 ARMA(2,l)
S&P 500 1953-1990 ARMA(2,l)
-0.0275 (0.0223)
-0.0110 (0.0202)
0.0007 (0.0382)
o.wO7 (0.0613)
0.0083 (0.0503)
0.0183 (0.0469)
J%CZr~I~Al =0
-0.0823 (0.0365)
-0.1122 (0.0564)
-0.1072 (0.0414)
-0.0658 (0.0410)
(4) &CS(Z,, ~,)I = 0
0.0007 (0.0046)
0.0013 (0.0080)
0.0036 (0.0051)
0.0003 (0.0035)
- 0.0050 (0.0714)
-0.0507 (0.0695)
-0.0105 (0.0698)
0.1152 (0.0930)
-0.0047 (0.0471)
0.0399 (0.0606)
-0.0358 (0.0815)
- 0.0627 (0.0458)
- ‘)(2:_3 - 111= 0
0.0037 (0.0385)
-0.0365 (0.0521)
0.0373 (0.0583)
-0.0171 (0.0611)
(8) E,C(z: - ‘)(z:_~ - III= 0
0.0950 (0.0562)
- 0.0658 (0.0403)
-0.0018 (0.0543)
-0.0312 (0.0426)
(9) mz:
- m:_ 5- 111= 0
0.0165 (0.0548)
0.0195 (0.0486)
0.0710 (0.0565)
(10) wz:
- ‘)(z:-6 - 111= 0
- 0.0039 (0.0309)
0.0343 (0.0602)
0.0046 (0.0439)
(3)
(5) E,C(zf - l)(zf_,
- 1)l = 0
(6) &C(z: - ‘)(z:_~ - 1)1= 0 (7) Mz:
0.0261 (0.073 1) -0.0557 (0.0392)
3023
Ch. 49: ARCH Models Table 5 (continued)
Orthogonality Condition (11) E,[(zf-l)z,_,l=O
AR(l)
Standard 90 1928-1952 ARMA(2.1)
S&P 500 1953-1990 ARMA(2,l)
-0.0364 (0.0414)
-0.0253 (0.0367)
- 0.0203 (0.0413)
Dow 1885-1914 ARMA(2,l)
Dow 1914-1928
-0.0338 (0.0290)
(12) E,[(zf - l)z,_J
= 0
0.0069 (0.025 1)
-0.0275 (0.0395)
- 0.0434 (0.03 15)
-0.0378 (0.0278)
(13) E,[(zf - l)z,_J
= 0
0.0110 (0.0262)
0.0290 (0.0352)
0.0075 (0.0306)
0.0292 (0.0357)
(14) E,[(zf - l)z,_,]
=o
- 0.0296 (0.0275)
0.0530 (0.0340)
-0.0103 (0.0292)
-0.0137 (0.0238)
- 0.0094 (0.0240)
0.0567 (0.0342)
0.0153 (0.0287)
0.0064 (0.0238)
0.028 1 (0.0216)
0.0038 (0.0350)
-0.0170 (0.0253)
0.0417 (0.0326)
,] = 0
0.0265 (0.0236)
0.0127 (0.0346)
0.0383 (0.0243)
0.0188 (0.0226)
(18) E,[z;z,_J
= 0
0.0133 (0.0157)
-0.0176 (0.0283)
- 0.0445 (0.0174)
- 0.0434 (0.0158)
(19) E,[Z,.Z,_J
=o
0.0406 (0.0158)
0.0012 (0.0262)
0.0019 (0.0175)
0.0140 (0.0152)
(20) E,[z;zr-‘J
= 0
0.0580 (0.0161)
0.0056 (0.0253)
0.02 11 (0.0172)
0.0169 (0.0153)
(21) E,[z;z,_J
= 0
0.0516 (0.0163)
0.0164 (0.025 1)
0.0250 (0.0174)
0.0121 (0.0158)
(22) E,[z,.z,_6]
= 0
- 0.0027 (0.0158)
0.008 1 (0.0261)
- 0.0040 (0.0172)
-0.0211 (0.0150)
(1)-(16):x:,
39.1111 (0.0010)
45.1608 (O.Ocm)
31.7033 (0.011)
25.1116 (0.0679)
(1)-(22): x:,
94.0156 (O.OOw
52.1272 (0.0003)
67.1231 (0.0000)
63.6383 (O.OOw
(15) E,[(z; - 1)z,_51 = 0 (16) E,[(zf - l)z,_,] (17) E,[z;z,_
=0
-
Table 3 reports the maximum likelihood estimates (MLE) for the models selected by the Schwartz criterion. Various Wald and conditional moment specification tests are given in Tables 4 and 5. 9.2.2.
Persistence
of
shocks to tiolatility
As in Nelson (1989,1991), the ARMA(2,l) models selected for three of the four data sets can be decomposed into the product of two AR(l) components, one of which has very long-lived shocks, with an AR root very close to one, the other of which exhibits short-lived shocks, with an AR root very far from one; i.e. (1 - /I1 L - /3J*) =
3024
-4 solid
-3 nonparametric
-2
dashed, parametric short dashes. standard
E,,titnnted
-1
0
1
2
3
4
‘Z normal
Densities
Standard
90,
1928~~95’
0
0
-4 solid,
-3 nonparametric
-2
dashed: parametric short dashes: standard
-1
0 Z
normal
1
2
3
4
3025
Ch. 49: ARCH Models Estimated
Densities
solid. nonparametric dashed. parametric short
dashes.
1914~
1928
z
standard
Estimated
normal
Densltles
4 -3 -2 solid. nonparametric dashed: parometrlc short dashes: standard
Figure
Dow
3.
-1
S&P
500.
0
1953-
1
z normal
Conditional
Distribution
of Returns
1990
2
3
4
T. Bollersfeu et al.
3026
(1 - A,,!,)(1 -A,,!,), where IAll 2 lA,j. When the estimated AR roots are real, a useful gauge of the persistence of shocks in an AR(l) model is the estimated “half life”; that is the value of n for which A” = +. For the Dow 1885-1914, the Standard 90 and the S&P 500 the estimated half lives of the long-lived components are about 119 days, 4; years and 329 days respectively. The corresponding estimated half lives of the short-lived components are only 5.2, 3.7 and 6.2 days, respectively.28*29 9.2.3.
Conditional
mean of returns
The estimated pi terms strongly support the results of LeBaron (1992) of a negative relationship between the conditional variance and the conditional serial correlation in returns. In particular, p2 is significantly positive in each data set, both statistically and economically. For example, for the Standard 90 data, the fitted conditional first order correlation in returns is 0.17 when 0; is at the 10th percentile of its fitted sample values, but equals - 0.07 when CJ: is at the 90th percentile. The implied variation in returns serial correlation is similar in the other data sets. The relatively simple specification of ~(r, _ 1, CT:)remains inadequate, however, as can be seen from the conditional moment tests reported in Table 5. The 17th through 22nd conditions test for serial correlation in the fitted z,‘s at lags one through six. In each data set, significant serial correlation is found at the higher lags. 9.2.4.
Conditional
distribution
of returns
Figure 3 plots the fitted generalized t density of the z,‘s against both a standard normal and a nonparametric density estimate constructed from the fitted z,‘s using a Gaussian kernel with the bandwidth selection method of Silverman (1986, pp. 4548). The parametric and nonparametric densities appear quite close, with the exception of the Dow 1914-1928 data, which exhibits strong negative skewness in 2,. Further aspects of the fitted conditional distribution are checked in the first three conditional moment specification tests reported in Table 5. These three orthogonality conditions test that the standardized residuals 1, E E*,8,-l have mean zero, unit variance, and no skewness.30 In the first three data sets the 1, series exhibit statistically significant, though not overwhelmingly so, negative skewness. “This is consistent with recent work by Ding et al. (1993), in which the empirical autocorrelations of absolute returns from several financial data sets are found to exhibit rapid decay at short lags but much slower decay at longer lags. This is also the motivation behind the permanent/transitory components ARCH model introduced by Engle and Lee (1992,1993), and the fractionally integrated ARCH models recently proposed by Baillie et al. (1993). 29Volatility in the Dow 1914-1928 data shows much less persistence. The half life asociated with the AR(l) model selected by the Schwartz (1978) criterion is only about 7.3 days. For the ARMA(2,l) model selected by the AIC for this data set, the half lives associated with the two AR roots are only 24 and 3.3 days, respectively. 30More precisely, the third orthogonality condition tests that E,[z;lz,l] = 0 rather than E,[z:] = 0. We use this test because it requires only the existence of a fourth conditional moment for Z, rather than a sixth conditional moment.
Ch. 49:
ARCH
3027
Models
Frequency
N
Dow 1885-1914 ARMA(2,l) Actual Expected
2 421.16 3 63.71 4 11.54 5 2.61 6 0.72 7 0.23 8 9.56 x 10m6 9 3.89 x lo-’ 10 1.73 x lo-’ 11 8.25 x lo-* “The table reports standard deviations.
405 74 12 4 2 1 0 0 0 0
Table 6 of tail eventsa
Dow 1914-1928 AR(l) Actual Expected 180.92 31.11 6.99 2.01 0.70 0.28 0.13 0.06 0.03 0.01
the expected
177 33 10 3 1 0 0 0 0 0
and the actual
Standard 90 1928-1952 ARMA(2,l) Actual Expected 369.89 76.51 18.76 5.47 1.86 0.71 0.30 0.14 0.07 0.04 number
363 81 23 4 1 1 1 0 0 0
of observations
S&P 500 1953-1990 ARMA(2,l) Actual Expected 458.85 72.60 13.83 3.27 0.94 0.31 0.12 0.05 0.02 0.01 exceeding
432 57 14 6 5 3 2 2 2 1 N conditional
The original motivation for adopting the generalized t-distribution was that the two shape parameters q and $ would allow the model to fit both the tails and the central part of the conditional distribution. Table 6 gives the expected and the actual number of z,‘s in each data set exceeding N standard deviations. In the S&P 500 data, the number of outliers is still too large. In the other data sets, the tail fit seems adequate. As noted above, the generalized t-distribution nests both the Student’s t (r] = 2) and the GED ($ = co). Interestingly, in only two of the data sets does a t-test for the null hypothesis that r~= 2 reject at standard levels, and then only marginally. Thus, the improved fit appears to come from the t component rather than the GED component of the generalized t-distribution. In total, the generalized r-distribution is a marked improvement over the GED, though perhaps not over the usual Student’s r-distribution. Nevertheless, the generalized t is not entirely adequate, as it does not account for the fairly small skewness in the fitted z,‘s, and also appears to have insufficiently thick tails for the S&P 500 data. 9.2.5.
News impact function
In line with the results for the EGARCH model reported in Nelson (1989,1991), the “leverage effect” term 8, in the g(., .) function is significantly negative in each of the data sets, while the “magnitude effect” term y1 is always positive, and significantly so except in the Dow 1914-1928 data. There are important differences, however. The EGARCH parameter restrictions that p = 1, y0 = y2 = 8, = 8, = 0 are decisively rejected in three of the four data sets. The estimated g(zt, C-J:)functions are plotted in Figure 4, from which the differences with the piecewise linear EGARCH g(z,) formulation are apparent.
3028
Dow
g(z).
1885-1914
I
1
\
\ \
\ \ \\ \ \\ \
‘d’
IL
-Y
solId dashed short
-/
median low dashes
-5 o CT high
-3
-1
0
u
g(r): Standard
-3 short
dashes.
high
CT
1
2
3
4
5
6
7
8
9
10
Z
-1
90.
0 L
1
1928-1952
2
3
4
5
6
7
8
9
10
T. l3oller.&?v et al.
3030
To better understand why the standard EGARCH model is rejected, consider more closely the differences between the specification of the g(z,,a:) function in equation (9.9) and the EGARCH formulation in equation (1.11). Firstly, the parameters y0 and 8, allow the conditional variance of ln(cf) and the conditional correlation between ln(a:) and I, to change as functions of CJ:. Secondly, the parameters p, y2, and 0, give the model an added flexibility in how much weight to assign to large versus small values of z,. As reported in Table 4, the EGARCH assumption that y0 = 8, = 0 is decisively rejected in the Dow 1885-1914 and 1914-1928 data sets, but not for either the Standard 90 or the S&P 500 data sets. For none of the four data sets is the estimated value of y0 significantly different from 0 at conventional levels. The estimated value of 8, is always negative, however, and very significantly so in the first two data sets, indicating that the “leverage effect” is more important in periods of high volatility than in periods of low volatility. The intuition that the influence of large outliers should be limited by setting 0, > 0 and y2 > 0 receives mixed support from the data. The estimated values of y2 and three of the estimated 0,‘s are positive, but only the estimate of yZ for the S&P 500 data is significantly positive at standard levels. We also note that if the data is generated by a stochastic volatility model, as opposed to an ARCH model, with conditionally generalized t-distributed errors, the asymptotically optimal ARCH filter would set q = p and y2 = I,- lb-“. The results in Table 4 indicate that the q = p restriction is not rejected, but that y2 = II/-’ b-” is not supported by the data. The estimated values of y2 are “too low” relative to the asymptotically optimal filter for the stochastic volatility model.
10.
Conclusion
This chapter has focused on a wide range of theoretical properties of ARCH models. It has also presented some new important empirical results, but has not attempted to survey the literature on applications, a recent survey of which can be found in Bollerslev et al. (1992).3’ Three of the most active lines of inquiry are prominently surveyed here, however. The first concerns the general parametrizations of univariate discrete time models of time-varying heteroskedasticity. From the original ARCH model, the literature has focused upon GARCH, EGARCH, IGARCH, ARCH-M, AGARCH, NGARCH, QARCH, QTARCH, STARCH, SWARCH and many other formulations with particular distinctive properties. Not only has this literature been surveyed here, but it has been expanded by the analysis of variations in the EGARCH model. Second, we have explored the relations between the discrete time models and the very popular continuous time diffusion processes that are widely used in a’ Other recent surveys of the ARCH methodology and Palm (1993).
are given in Bera and Higgins
(1993) and Nijman
Ch. 49: ARCH
Models
3031
finance. Very useful approximation theorems have been developed, which hold with increasing accuracy when the length of the sampling interval diminishes. The third area of important investigation concerns the analysis of multivariate ARCH processes. This problem is more complex than the specification of univariate models because of the interest in simultaneously modeling a large number of variables, or assets, without having to estimate an intractable large number of parameters. Several multivariate formulations have been proposed, but no clear winners have yet emerged, either from a theoretical or an empirical point of view. References Akaike, H. (1973) “Information Theory and an Extension of the Maximum Likelihood Principle”, in: B.N. Petrov and F. Csriki, eds., Second International Symposium on Information Theory. Akadtmiai Kiad6: Budapest. Amemiya, T. (1985) Advanced Econometrics. Harvard University Press: Cambridge, MA. Amin, K.I. and V.K. Ng (1993) “Equilibrium Option Valuation with Systematic Stochastic Volatility”, Journal ofFinance, 48, 881C910. Andersen, T.G. (1992a) Volatility, unpublished manuscript, J.L. Kellogg Graduate School of Management, Northwestern University. Andersen, T.G. (1992b) Return Volatility and Trading Volume in Financial Markets: An Information Flow Interpretation of Stochastic Volatility, unpublished manuscript, J.L. Kellogg Graduate School of Management, Northwestern University. Anderson, T.W. (1971) The Statistical Analysis of Time Series. John Wiley and Sons: New York, NY. Andrews, D.W.K. and W. Ploberger (1992) Optimal Tests when a Nuisance Parameter Is Present only under the Alternative, unpublished manuscript, Department of Economics, Yale University. Andrews, D.W.K. and W. Ploberger (1993) Admissibility of the Likelihood Ratio Test when a Nuisance Parameter Is Present only under the Alternative, unpublished manuscript, Department of Economics, Yale University. Attanasio, 0. (1991) “Risk, Time-Varying Second Moments and Market Efficiency”, Review ofEconomic Studies, 58,479-494. Baek, E.G. and W.A. Brock (1992) “A Nonparametric Test for Independence of a Multivariate Time Series”, Statistica Sinica, 2, 137-156. Baillie, R.T. and T. Bollerslev (1989) “The Message in Daily Exchange Rates: A Conditional Variance Tale”, Journal of Business and Economic Statistics, 7, 297-305. Baillie, R.T. and T. Bollerslev (1990) “A Multivariate Generalized ARCH Approach to Modeling Risk Premia in Forward Foreign Exchange Rate Markets”, Journal of International Money and Finance, 9,309%324. Baillie, R.T. and T. Bollerslev (1991) “Intra Day and Inter Day Volatility in Foreign Exchange Rates”, Review of Economic Studies, 58, 565-585. Baillie, R.T. and T. Bollerslev (1992) “Prediction in Dynamic Models with Time Demndent Conditional Variances”, Journal ofEconometrics, 52, 91-113. _ Baillie, R.T., T. Bollerslev and H.O. Mikkelsen (1993), Fractionally Integrated Autoregressive Conditional Heteroskedasticity, unpublished manuscript, J.L. Kellogg Graduate School of Management, Northwestern University. Bekaert, G. and R.J. Hodrick (1993) “On Biases in the Measurement of Foreign Exchange Risk Premiums”, Journal oflnternational Money and Finance. 12. 115-138. Bera, A.K. and M.L. Higgins (1993) “ARCH Models: Properties, Estimation and Testing”, Journal of Economic Surveys, 7,305-366. Bera, A.K. and S. Lee (1992) “Information Matrix Test, Parameter Heterogeneity and ARCH: A Synthesis”, Review of Economic Studies, 60, 229-240. Bera, A.K. and J-S. Roh (1991) A Moment Test of the Constancy of the Correlation Coefficient in the Bivariate GARCH Model, unpublished manuscript, Department of Economics, University of Illinois, Urbana-Champaign.
T. Bollerslev et al.
3032
Bera, A.K., M.L. Higgins and S. Lee (1993) “Interaction Between Autocorrelation and Conditional Heteroskedaslicity: A Random Coefficients Approach”, Journal Businessand Economic Statistics, 10, 133-142. Berndt E.R., B.H. Hall, R.E. Hall, and J.A. Haussman (197’4) “Estimation and Inference in Nonlinear Structural Models”, Annals ofEconomic and Social Measurement, 4, 653-665. Black. F. (1976) “Studies of Stock Price Volatility Changes”, Proceedingsfrom the American Statistical Associaiion, business and Economic Statistics Section, 177-181. Black. F. and M. Scholes (1973) “The Pricing of Ontions and Corporate Liabilities”, Journal ofpolitical Ecdnomy, 81,637-659.‘ ’ Blattberg, R.C. and N.J. Gonedes (1974) “A Comparison of the Stable and Student Distribution of Statistical Models for Stock Prices”, Journal ofBusiness, 47,24+280. Bollerslev, T. (1986) “Generalized Autoregressive Conditional Heteroskedasticity”, Journal of Econometrics, 31, 307-327. Bollerslev, T. (1987) “A Conditional Heteroskedastic Time Series Model for Speculative Prices and *Rates of Return”, Review of Economics and Statistics, 69, 542-547. Bollerslev, T. (1988) “On the Correlation Structure for the Generalized Autoregressive Conditional Heteroskedastic Process”, Journal of Time Series Analysis, 9, 121-131. Bollerslev, T. (1990) “Modelling the Coherence in Short-Run Nominal Exchange Rates: A Multivariate Generalized ARCH Approach”, Review of Economics and Statistics, 72,498-505. Bollerslev, T. and I. Domowitz (1993) “Trading Patterns and the Behavior of Prices in the Interbank Foreign Exchange Market,” Journal of Finance, 48, 1421-1443. Bollerslev, T. and R.F. Engle (1993) “Common Persistence in Conditional Variances”, Econometrica, 61,166-187. Bollerslev, T. and M. Melvin (1994) “Bid-Ask Spreads in the Foreign Exchange Market: An Empirical Analysis”, Journal of International Economics, forthcoming. Bollerslev, T. and J.M. Wooldridge (1992) “Quasi Maximum Likelihood Estimation and Inference in Dynamic Models with Time Varying Covariances”, Econometric Reviews, 11, 143-172. Bollerslev, T., R.F. Engle and J.M. Wooldridge (1988) “A Capital Asset Pricing Mode1 with Time Varying Covariances”, Journal of Political Economy, 96, 116-131. Bollerslev, T., R.Y. Chou and K.F. Kroner (1992) “ARCH Modeling in Finance: A Review of the Theory and Empirical Evidence”, Journal of Econometrics, 52, S-59. Bougerol, P. and N. Picard (1992) “Stationarity of GARCH Processes and of Some Non-Negative Time Series”, Journal of Econometrics, 52, 115-128. Box, G.E.P., and G.M. Jenkins (1976) Time Series Analysis: Forecasting and Control. Holden Day: San Francisco, CA. Second Edition. Braun, P.A., D.B. Nelson and A.M. Sunier (1992) Good News, Bad News, Volatility, and Betas, unpublished manuscript, Graduate School of Business, University of Chicago. Breusch, T. and A.R. Pagan (1979) “A Simple Test for Heteroskedasticity and Random Coefficient Variation”, Econometrica, 47, 1287-1294. Brock, W.A. and A. Kleidon (1992) “Periodic Market Closure and Trading Volume: A Model of Intra Day Bids and Asks”, Journal of Economic Dynamics and Control, 16, 451-489. Brock, W.A., and S.M. Potter (1992), Nonlinear Time Series and Macroeconometrics, unpublished manuscript, Department of Economics, University of Wisconsin, Madison. Brock, W.A., W.D. Dechert and J.A. Scheinkman (1987) A Test for independence Based on the Correlation Dimension, unpublished manuscript, Department of Economics, University of Wisconsin, Madison. Brock, W.A., D.A. Hsieh and B. LeBaron (1991). Nonlinear Dynamics, Chaos and Instability: Statistical Theory and Economic Eoidence. MIT Press: Cambridge, MA. Cai, J. (1994) “A Markov Model of Unconditional Variance in ARCH”, Journal of Business and Economic Statistics, forthcoming. Campbell, J.Y. and L. Hentschel(l992) “No News is Good News: An Asymmetric Model of Changing Volatility in Stock Returns”, Journal of Financial Economics, 3 1, 28 l-3 18. Chou, R.Y. (1988) “Volatility Persistence and Stock Valuations: Some Empirical Evidence Using GARCH”, Journal of Applied Econometrics, 3, 279-294. Christie, A.A. (1982) ‘Thd-Stochastic Behavior of Common Stock Variances: Value, Leverage and Interest Rate Effects”, Journaf of Financial Economics, 10,407-432. Clark, P.K. (1973) “A Subordinated Stochastic Process Model with Finite Variance for Speculative Prices”, Econometrica, 41, 135-l 56.
of
I
.
3033
Ch. 49: ARCH Models
Cornell, B. (1978) “Using the Options Pricing Model to Measure the Uncertainty Producing Effect of Major Announcements”, Financial Management, 7, 54-59. Crowder, M.J. (1976) “Maximum Likelihood Estimation with Dependent Observations”, Journal of the Royal Statistical Society, 38, 45-53. Danielson, J. and J.-F. Richard (1993) “Accelerated Gaussian Importance Sampler with Application to Dynamic Latent Variable Models”, Journal AppliedEconometrics, 8, S153-S173. Davidian, M. and R.J. Carroll (1987) “Variance Function Estimation”, Journal ofthe American Statistical Association, 82, 1079-1091. Davies. R.B. (1977) “Hypothesis Testing when a Nuisance Parameter is Present only under the Null Hypothesis”, Biometkz, 64, 247-254: Dav. T.E. and CM. Lewis (1992) “Stock Market Volatility and the Information Content of Stock Index Options”, Journal of Ecdnomkrics, 52, 267-288. . Demos, A. and E. Sentana (1991) Testing for GARCH Effects: A One-Sided Approach, unpublished manuscript, London School of Economics. Diebold, F.X. (1987) “Testing for Serial Correlation in the Presence of ARCH”, Proceedings from the American Statistical Association, Business and Economic Statistics Section, 323-328. Diebold, F.X. (1988) Empirical Modeling ofExchange Rate Dynamics. Springer Verlag: New York, NY. Diebold, F.X. and M. Nerlove (1989) “The Dynamics of Exchange Rate Volatility: A Multivariate Latent Factor ARCH Model”, Journal of Applied Econometrics, 4, l-21. Ding, Z., R.F. Engle, and C.W.J. Granger (1993) “Long Memory Properties of Stock Market Returns and a New Model”, Journal of Empirical Finance, 1, 83-106. Drost, F.C. and T.E. Nijman (1993) “Temporal Aggregation of GARCH Processes”, Econometrica, 61, 909-927. Engle, R.F. (1982) “Autoregressive Conditional Heteroskedasticity with Estimates of the Variance of U.K. Inflation”, Econometrica, 50,987TlOO8. Engle, R.F. (1984) “Wald, Likelihood Ratio, and Lagrange Multiplier Tests in Econometrics”, in: Z. Griliches and M.D. Intriligator, eds., Handbook of Econometrics, Vol. II. North-Holland: Amsterdam. Engle, R.F. (1987) Multivariate GARCH with Factor Structures - Cointegration in Variance, unpublished manuscript, Department of Economics, UCSD. Engle, R.F. (1990) “Discussion: Stock Market Volatility and the Crash of 87”, Review of Financial Studies, 3, 103-106. Engle, R.F. and T. Bollerslev (1986) “Modelling the Persistence of Conditional Variances”, Econometric Reviews, 5, l-50, 81-87. Engle, R.F. and G. Gonzalez-Rivera (1991) “Semiparametric ARCH Models”, Journal of Business and Economic Statistics, 9, 345-359. Engle, R.F. and C.W.J. Granger (1987) “Cointegration and Error Correction: Representation, Estimation and Testing”, Econometrica, 55, 251-276. Engle, R.F. and S. Kozicki (1993) “Testing for Common Features”, Journal of Business and Economic Statistics, 11, 369-379. Engle, R.F. and K.F. Kroner (1993) Multivariate Simultaneous Generalized ARCH, unpublished manuscript, Department of Economics, UCSD. Engle, R.F. and G.G.J. Lee (1992) A Permanent and Transitory Component Model of Stock Return VolBtility, unpublished manuscript, Department of Economics, UCSD. Engle, R.F. and G.G.J. Lee (1993) Long Run Volatility Forecasting for Individual Stocks in a One Factor Model, unpublished manuscript, Department of Economics, UCSD. Engle, R.F. and C. Mustafa (1992) “Implied ARCH Models from Options Prices”, Journal of Econometrics, 52, 289-311. Engle, R.F. and V.K. Ng (1993) “Measuring and Testing the Impact of News on Volatility”, Journal of Finance, 48, 1749-1778. Engle R.F. and R. Susmel (1993) “Common Volatility in International Equity Markets”, Journal of Business and Economic Statistics, 11, 167-176. Engle, R.F., D.F. Hendry and D. Trumble (1985) “Small Sample Properties of ARCH Estimators and Tests”, Canadian Journal of Economics, 18, 66-93. Engle, R.F., D.M. Lilien and R.P. Robins (1987) “Estimating Time Varying Risk Premia in the Term Structure: The ARCH-M Model”, Econometrica, 55, 391-407. Engle, RF., T. Ito and W-L. Lin (1990a) “Meteor Showers or Heat Waves? Heteroskedastic Intra Daily Volatility in the Foreign Exchange Market”, Econometrica, 58, 525-542.
of
T. Bollersleu et al.
3034
Engle, R.F., V. Ng and M. Rothschild (1990b) “Asset Pricing with a Factor ARCH Covariance Structure: Empirical Estimates for Treasury Bills”, Journal of Econometrics, 45, 213-238. En&, R.F., C-H. Hong, A. Kane, and J. Noh (1993) “Arbitrage Valuation of Variance Forecasts with Simulated Options”, in: D.M. Chance and R.R. Trippi, eds., Advances in Futures and Options Research. JAI Press: Greenwich, Connecticut. Ethier, S.N. and T.G. Kurtz (1986) Markov processes: Characterization and Convergence. John Wiley: New York, NY. Fama, E.F. (1963) “Mandelbrot and the Stable Paretian Distribution”, Journal ofBusiness, 36,420&429. Fama. E.F. (1965) “The Behavior of Stock Market Prices”, Journal Business,38, 34b105. Foste;, D.P.‘and ‘D.B. Nelson (1992) Rolling Regressions, unpublished manuscript, Graduate School of Business, University of Chicago. French, K.R. and R. Roll (1986) “Stock Return Variances: The Arrival of Information and the Reaction of Traders”, Journal of Financial Economics, 17, 5-26. French, K.R., G.W. Schwert and R.F. Stambaugh (1987) “Expected Stock Returns and Volatility”, Journal of Financial Economics, 19, 3-30. Gallant, A.R. and G. Tauchen (1989) “Semi Non-Parametric Estimation of Conditionally Constrained Heterogeneous Processes: Asset Pricing Applications”, Econometrica, 57, 1091L1120. Gallant, A.R., D.A. Hsieh and G. Tauchen (1991) “On Fitting a Recalcitrant Series: The Pound/Dollar Exchange Rate 1974&83”, in: W.A. Barnett, J. Powell and G. Tauchen, eds., Nonparametric and Semiparametric Methods in Econometrics and Statistics. Cambridge University Press: Cambridge. Gallant, A.R., P.E. Rossi and G. Tauchen (1992) “Stock Prices and Volume”, Review of Financial Studies, 5, 199-242. Gallant, A.R., P.E. Rossi and G. Tauchen (1993) “Nonlinear Dynamic Structures”, Econometrica, 61, 871-907. Gennotte, G. and T.A. Marsh (1991) Variations in Economic Uncertainty and Risk Premiums on Capital Assets, unpublished manuscript, Department of Finance, University ofcalifornia, Berkeley. Gerity, M.S. and J.H. Mulherin (1992) “Trading Halts and Market Activity: An Analysis of Volume at the Open and the Close”, Journal of Finance, 47, 1765-1784. Geweke, J. (1989a) “Exact Predictive Densities in Linear Models with ARCH Disturbances”, Journal of Econometrics, 44, 307-325. Geweke, J. (1989b) “Bayesian Inference in Econometric Models Using Monte Carlo Integration”, Econometrica, 57, 1317-1339. Glosten, L.R., R. Jagannathan and D. Runkle (1993) “On the Relation between the Expected Value and the Volatility of the Nominal Excess Return on Stocks”, Journal of Finance, 48, 1779-1801. Gourieroux, C. and A. Monfort (1992) “Qualitative Threshold ARCH Models”, Journal of Econometrics, 52,159-199. Gourieroux, C., A. Holly and A. Monfort (1982) “Likelihood Ratio Test, Wald Test and Kuhn-Tucker Test in Linear Models with Inequality Constraints on Regression Parameters”, Econometrica, 50, 63-80. Granger, C.W.J., R.F. Engle, and R.P. Robins (1986) “Wholesale and Retail Prices: Bivariate Time-Series Modelling with Forecastable Error Variances”, in: D. Belsley and E. Kuh, eds., Model Reliability. MIT Press: Massachusetts. pp 1-17. Hamao, Y., R.W. Masulis and V.K. Ng (1990) “Correlations in Price Changes and Volatility Across International Stock Markets”, Review of Financial Studies, 3, 281-307. Hamilton, J.D. and R. Susmel (1992), Autoregressive Conditional Heteroskedasticity and Changes in Regime, unpublished manuscript, Department of Economics, UCSD. Harris, L. (1986) “A Transaction Data Study of Weekly and Intradaily Patterns in Stock Returns”, Journal of Financial Economics, 16, 99-117. Harvey, CR. and R.D. Huang (1991) “Volatility in the Foreign Currency Futures Market”, Review of Financial Studies, 4, 543-569. Harvey, CR. and R.D. Huang (1992) Information Trading and Fixed Income Volatility, unpublished manuscript, Department of Finance, Duke University. Harvey, A.C., E. Ruiz and E. Sentana (1992) “Unobserved Component Time Series Models with ARCH Disturbances”, Journal of Econometrics, 52, 129-158. Harvey, A.C., E. Ruiz and N. Shephard (1994) “Multivariate Stochastic Volatility Models”, Review of Economic Studies, forthcoming. Heston, S.L. (1991) A Closed Form Solution for Options with Stochastic Volatility, unpublished manuscript, Department of Finance, Yale University.
of
Ch. 49: ARCH Models
3035
Higgins, M.L. and A.K. Bera (1992) “A Class of Nonlinear ARCH Models”, International Economic Review, 33, 137-158. Hong, P.Y. (1991) “The Autocorrelation Structure for the GARCH-M Process”, Economics Letters, 37, 129-132. Hsieh, D.A. (1991) “Chaos and Nonlinear Dynamics: Applications to Financial Markets”, Journal of Finance, 46, 1839-1878. Huber, P.J. (1977). Robust Statistical Procedures. SIAM: Bristol, United Kingdom. Hull, J. and A. White (1987) “The Pricing of Options on Assets with Stochastic Volatilities”. Journal qf Finance, 42,281-300. Jacquier, E., N.G. Polson and P.E. Rossi (1994) “Bayesian Analysis of Stochastic Volatility Models”, Journal of Business and Economic Statistics, forthcoming. Karatzas, I. and S.E. Shreve (1988) Brownian Motion and Stochastic Calculus. Springer-Verlag: New York, NY. Karpoff, J.M. (1987) “The Relation Between Price Changes and Trading Volume: A Survey”, Journal of Financial and Quantitatioe Analysis, 22, 109-126. Kim, C.M. (1989) Nonlinear Dependence of Exchange Rate Changes, unpublished Ph.D. dissertation, Graduate School of Business, University of Chicago. Kine. M.. E. Sentana and S. Wadhwani (1994) “Volatility and Links Between National Stock Markets”, E&oketrica, forthcoming. ’ Kitagawa, G. (1987) “Non-Gaussian State Space Modelling of Nonstationary Time Series”, Journal of the American Statistical Association, 82, 1032-1063. Kodde, D.A. and F.C. Palm (1986) “Wald Criterion for Jointly Testing Equality and Inequality Restrictions”, Econometrica, 54, 1243-1248. Kraft, D.F. and R.F. Engle (1982) Autoregressive Conditional Heteroskedasticity in Multiple Time Series, unpublished manuscript, Department of Economics, UCSD. Krengel, U.11985). Ergodic Theme&Walter de Gruyter: Berlin, Germany. Kroner. K.F. and S. Claessens (1991) “ODtimal Dvnamic Hedging Portfolios and the Currency Composition of External Debt”, Jburnal ofjnternatidnal Money &d-Finance, 10, 131-148. _ Kroner, K.F. and J. Sultan (1991) “Exchange Rate Volatility and Time Varying Hedge Ratios”, in: S.G. Rhee and R.P. Chang, eds., Pacific-Basin Capital Markets Research, Vol. II. North-Holland: Amsterdam. Lamoureux, C.G. and W.D. Lastrapes (1990) “Heteroskedasticity in Stock Return Data: Volume versus GARCH Effects”, Journal ofFinance, 45,221-229. Lamoureux, C.G. and W.D. Lastrapes (1994) “Endogenous Trading Volume and Momentum in Stock Return Volatility”, Journal ofBusiness and Economic Statistics, forthcoming. LeBaron, B. (1992) “Some Relations Between Volatility and Serial Correlation in Stock Market Returns”, Journal of Business, 65, 199-220. Lee, J.H.H. and M.L. King (1993) “A Locally Most Mean Powerful Based Score Test for ARCH and GARCH Regression Disturbances”, Journal of Business and Economic Statistics, 7, 259-279. Lee, S.W. and B.E. Hansen (1993) Asymptotic Theory for the GARCH(l, 1) Quasi-Maximum Likelihood Estimator, unpublished manuscript, Department of Economics, University of Rochester. Lin, W.L. (1992) “Alternative Estimators for Factor GARCH Models ~ A Monte Carlo Comparison”, Journal if Applied Econometrics, 7,259-279. Lin. W.L.. R.F. Enele and T. Ito (1994) “Do Bulls and Bears Move Across Borders? International Transmission of ?&ock Returns &d Volatility as the World Turns”, Review of Financial Studies, forthcoming. Ljung, G.M. and G.E.P. Box (1978) “On a Measure of Lag of Fit in Time Series Models”, Biometrika, 61,297-303. Lumsdaine, R.L. (1992a) Asymptotic Properties of the Quasi:Maximum Likelihood Estimator in GARCH(1, 1) and IGARCH(l.1) Models, unpublished manuscript, Department of Economics, Princeton University. Lumsdaine, R.L. (1992b) Finite Sample Properties of the Maximum Likelihood Estimator in GARCH(1, 1) and IGARCH(l, 1) Models: A Monte Carlo Investigation, unpublished manuscript, Department of Economics, Princeton University. MacKinnon, J.G. and H. White (1985) “Some Heteroskedasticity Consistent Covariance Matrix Estimators with Improved Finite Sample Properties”, Journal of Econometrics, 29, 305-325. Mandelbrot, B. (1963) “The Variation of Certain Speculative Prices”, Journal of Business, 36, 394-419.
3036
T. Bollerslev et al.
Marcus, M. and H. Mint (1964) A Survey ofMatrix Theory and Matrix Inequalities. Prindle, Weber and Schmidt: Boston, MA. McCurdy, T.H. and T. Stengos (1992) “A Comparison of Risk Premium Forecasts Implied by Parametric and Nonparametric Conditional Mean Estimators”, Journal of Econometrics, 52, 225-244. McDonald, J.B. and W.K. Newey (1988) “Partially Adaptive Estimation of Regression Models via the Generalized I Distribution”, Econometric Theory, 4, 4288457. M&no, A. and S. Turnbull (1990) “Pricing Foreign Currency Options with Stochastic Volatility”, Journal ofEconometrics, 45, 2399266. Merton, R.C. (1973) “An Intertemporal Capital Asset Pricing Model”, Econometrica, 42, 8677887. Merton, R.C. (1980) “On Estimating the Expected Return on the Market”, Journal of Financial Economics, 41, 867-887. Milhoj, A. (1985) “The Moment Structure of ARCH Processes”, Scandinavian Journal of Statistics, 12,281-292. Murphy, K. and R. Topel(1985) “Estimation and Inference in Two-Step Econometric Models”, Journal of Business and Economic Statistics, 3, 370-379. Proceedings from the American Nelson, D.B. (1989) “Modeling Stock Market Volatility Changes”, Statistical Association, Business and Economic Statistics Section, 93398. Nelson, D.B. (1990a) “ARCH Models as Diffusion Approximations”, Journal of Econometrics, 45,7-38. Nelson, D.B. (1990b) “Stationarity and Persistence in the GARCH(1, 1) Model”, Econometric Theory, 6, 3188334. Nelson, D.B. (1991) “Conditional Heteroskedasticity in Asset Returns: A New Approach”, Econometrica, 59, 3477370. Nelson, D.B. (1992) “Filtering and Forecasting with Misspecified ARCH Models I: Getting the Right Variance with the Wrong Model”, Journal of Econometrics, 52, 61-90. Nelson, D.B. and C.Q. Cao (1992) “Inequality Constraints in the Univariate GARCH Model”, Journal of Business and Economic Statistics, 10, 2299235. Nelson, D.B. and D.P. Foster (1991) Filtering and Forecasting with Misspecified ARCH Models II: Making the Right Forecast with the Wrong Model, unpublished manuscript, Graduate School of Business, University of Chicago. Nelson, D.B. and D.P. Foster (1994) “Asymptotic Filtering Theory for Univariate ARCH Models”, Econometrica, 62, 1-41. Newey, W.K. (1985) “Maximum Likelihood Specification Testing and Conditional Moment Tests”, Econometrica, 53, 104771070. Ng, V., R.F. Engle and M. Rothschild (1992) “A Multi-Dynamic Factor Model for Stock Returns”, Journal of Econometrics, 52,245-265. Nijman, T.E. and F.C. Palm (1993) “GARCH Modelling of Volatility: An Introduction to Theory and Applications”, in: A.J. de Zeeuw, ed., Advanced Lectures in Quantitative Economics. Academic Press: London. Nijman, T.E. and E. Sentana (1993) Marginalization and Contemporaneous Aggregation in Multivariate GARCH Processes, unpublished manuscript, Center for Economic Research, Tilburg University. Nummelin, E. and P. Tuominen (1982) “Geometric Ergodicity of Harris Recurrent Markov Chains with Applications to Renewal Theory,” Stochastic Processes and Their Applications, 12, 187-202. Pagan, A.R. (1984) “Econometric Issues in the Analysis of Regressions with Generated Regressors”, international Economic Review, 25, 221-247. Pagan, A.R. (1986) “Two Stage and Related Estimators and their Applications”, Review of Economic Studies, 53, 517-538. Pagan, A.R. and Y.S. Hong (1991) “Nonparametric Estimation and the Risk Premium”, in: W.A. Barnett, J. Powell and G. Tauchen, eds., Nonparametric and Semiparametric Methods in Econometrics and Statistics. Cambridge University Press: Cambridge. Pagan, A.R. and H.C.L. Sabau (1987a), On the Inconsistency of the MLE in Certain Heteroskedastic Regression Models, unpublished manuscript, University of Rochester. Pagan, A.R. and H.C.L. Sabau (1987b) Consistency Tests for Heteroskedasticity and Risk Models, unpublished manuscript, Department of Economics, University of Rochester. Pagan, A.R. and G.W. Schwert (1990) “Alternative Models for Conditional Stock Volatility”, Journal of Econometrics, 45, 2677290. Pagan, AR. and A. Ullah (1988) “The Econometric Analysis of Models with Risk Terms”, Journal of Applied Econometrics, 3, 877 105.
Ch. 49: ARCH Models
3037
Pagan, A.R., A.D. Hall and P.K. Trivedi (1983) “Assessing the Variability of Inflation”, Review (d Economic Studies, 50, 585-596. Pardoux, E. and D. Talay (1985) “Discretization and Simulation of Stochastic Differential Equations”, Acta Applicandae Mathematics, 3, 23-47. Parkinson, M. (1980) “The Extreme Value Method for Estimating the Variance of the Rate of Return”, Journal of Business, 53, 61-65. Patell, J.M. and M.A. Wolfson (1979) “Anticipated Information Releases Retlected in Call Option Prices”, Journal of Accounting and Economics, 1, 117-140. Patell, J.M. and M.A. Wolfson (1981) “The Ex-Ante and Ex-Post Price Effects of Quarterly Earnings Announcement Reflected in Option and Stock Price”, Journal ofAccounting Research, 19, 434-458. Poterba, J. and L. Summers (1986) “The Persistence of Volatility and Stock Market Fluctuations”, American Economic Review, 76,1142Z 115 1. Rich, R.W., J.E. Raymond, and J.S. Butler (1992) “The Relationship between Forecast Dispersion and Forecast Uncertainty: Evidence from a Survey Data-ARCH Model”, Journal of Applied Econometrics, 7, 131-148. Royden, H.L. (1968) Real Analysis. Macmillan Publishing Co.: New York, NY. Scheinkman, J., and B. LeBaron (1989) “Nonlinear Dynamics and Stock Returns”, Journal of Business, 62, 31 I-337. Schwarz, G. (1978) “Estimating the Dimension of a Model”, Annals of Statistics, 6, 461-464. Schwert, G.W. (1989a) “Why Does Stock Market Volatility Change Over Time”, Journal of Finance, 44,1115-1153. Schwert, G.W. (1989b) “Business Cycles, Financial Crises, and Stock Volatility”, Carnegie-Rochester Conference Series on Public Policy, 39, 83-126. Schwert, G.W. (1990) “Indexes of U.S. Stock Prices from 1802 to 1987”, Journal ofBusiness, 63,399-426. Schwert, G.W. and P.J. Seguin (1990) “Heteroskedasticity in Stock Returns”, Journal of Finance, 45, 1129-l 155. Scott, L.O. (1987) “Option Pricing when the Variance Changes Randomly: Theory, Estimation and an Application”, Journal of Financial and Quantitatice Analysis, 22, 419-438. Sentana, E. (1991) Quadratic ARCH Models: A Potential Re-Interpretation of ARCH Models, unpublished manuscript, London School of Economics. Shephard, N. (1993) “Fitting Nonlinear Time Series Models with Applications to Stochastic Variance Models”, Journal of AppliedEconomics, 8, S135-S152. Silverman, B.W. (1986) Density Estimationfor Statistics and Data Analysis. Chapman and Hall: London, United Kingdom. Stambaugh, R.F. (1993) Estimating Conditional Expectations When Volatility Fluctuates, unpublished manuscript, The Wharton School, University of Pennsylvania. Stroock, D.W. and S.R.S. Varadhan (1979) Multidimensional D$jksion Processes. Springer-Verlag: Berlin, Germany. Tauchen, G. (1985) “Diagnostic Testing and Evaluation of Maximum Likelihood Models”, Journal of Econometrics, 30, 415-443. Tauchen, G. and M. Pitts (1983) “The Price VariabilityyVolume Relationship on Speculative Markets”, Econometrica, 51, 485-505. Taylor, S. (1986) Modeling Financial Time Series. Wiley and Sons: New York, NY. Tsay, R.S. (1987) “Conditional Heteroskedastic Time Series Models”, Journal of the American Statistical Association, 82, 590-604. Tweedie, R.L. (1983a) “Criteria for Rates of Convergence of Markov Chains, with Application to Queuing and Storage Theory”, in: J.F.C. Kingman and G.E.H. Reuter, eds., Probability, Statistics, and Analysis, London Mathematical Society Lecture Note Series No. 79. Cambridge University Press: Cambridge. Tweedie, R.L. (1983b) “The Existence of Moments for Stationary Markov Chains”, Journal of Applied Probability, 20, 191-196. Watson, M.W. and R.F. Englc (1985) “Testing for Regression Coeflicient Stability with a Stationary AR(l) Alternative”, Review of Economics and Statistics, 67, 341-346. Weiss, A.A. (1984) “AR MA Models with ARCH Errors”, Journal of Time SeriesJinalysis, 5, 129- 143. Weiss, A.A. (1986) “Asymptotic Theory for ARCH Models: Estimation and Testing”, Econometric Theory, 2, 107-131. West, K.D., H.J. Edison and D. Cho (1993) “A Utility Based Comparison of Some Models for Exchange Rate Volatility”, Journal of International Economics, 35, 23-45.
3038
T. Bollersleu et al.
White, H. (1980) “A Heteroskedastic-Consistent Covariance Matrix and a Direct Test for Heteroskedasticity”, Econometrica, 48,421~448. White, H, (1987) “Specification Testing in Dynamic Models”, in: T.F. Bewley, ed., Advances in honometrics: Fifth World Congress, Vol. 1. Cambridge University Press: Cambridge. White, H. (1994) Estimation. inference and Specification Analysis, forthcoming. Wiggins, J.B. (1987) “Option Values under Stochastic Volatility: Theory and Empirical Estimates”, Journal of Financial Economics, 19, 351-372. Wiggins, J.B. (1991) “Empirical Tests of the Bias and Efficiency of the Extreme-Value Variance Estimator for Common Stocks”, Journal of Business, 64, 417-432. Wolak, F.A. (1991) “The Local Nature of Hypothesis Tests Involving Inequality Constraints in Nonlinear Models”, Econometrica, 59, 981-995. Wooldridge, J.M. (1990) “A Unified Approach to Robust Regression Based Specification Tests”, Econometric Theory, 6, 17743. Wooldridge, J.M. (1994) “Estimation and Inference for Dependent Processes”, in: R.F. Engle and D. McFadden, eds., Handbook ofEconometrics, Vol. IV. North-Holland: Amsterdam, Chapter 45. Zakoian, J.-M. (1990) Threshold Heteroskedastic Models, unpublished manuscript, CREST, INSEE. Zarnowitz, V. and L.A. Lambros (1987) “Consensus and Uncertainty in Economic Prediction”, Journal of Political Economy, 95, 591-621.
Chapter 50
STATE-SPACE JAMES
MODELS*
D. HAMILTON
University of California, San Diego
Contents Abstract 1. The state-space representation 2. The Kalman filter
3.
4.
2.1.
Overview
2.2.
Derivation
of the Kalman
2.3.
Forecasting
2.4.
Smoothed
of a linear dynamic
system
3047
filter
of the Kalman
3048
filter
with the Kalman
3051
filter
3051
inference
2.5.
Interpretation
2.6.
Time-varying
2.7.
Other
of the Kalman coefficient
filter with non-normal
disturbances
3052 3053
models
3054
extensions
Statistical inference Kalman filter
about
unknown
parameters
using the 3055
3.1.
Maximum
likelihood
3.2.
Identification
3.3.
Asymptotic
properties
3.4.
Confidence
intervals
3.5.
Empirical
application
Discrete-valued
3041 3041 3046
3055
estimation
3057 of maximum for smoothed - an analysis
likelihood estimates
3058
estimates
3060
and forecasts
of the real interest
3060
rate
3062
state variables
state-space
representation
of the Markov-switching
Linear
4.2.
Optimal
4.3.
Extensions
3067
4.4.
Forecasting
3068
filter when the state variable
follows a Markov
model
3063
4.1.
chain
3064
*I am grateful to Gongpil Choi, Robert Engle and an anonymous referee for helpful comments, and to the NSF for support under grant SES8920752. Data and software used in this chapter can be obtained at no charge by writing James D. Hamilton, Department of Economics 0508, UCSD, La Jolla, CA 92093-0508, USA. Alternatively, data and software can be obtained by writing ICPSR, Institute for Social Research, P.O. Box 1248, Ann Arbor, MI 48106, USA. Handbook of Econometrics, Volume 1 V, Edited by R.F. Engle and D.L. McFadden 0 1994 Elsevier Science B.V. All rights reserved
J.D. Hamilton
3040
5.
4.5.
Smoothed
probabilities
4.6.
Maximum
likelihood
4.7.
Asymptotic
4.8.
Empirical
Non-normal
properties application
of maximum - another
and nonlinear
5.1.
Kitagawa’s
5.2.
Extended
5.3.
Other
References
3069
grid approximation Kalman
approaches
3070
estimation likelihood
estimates
look at the real interest
state-space
3071
models
for nonlinear,
non-normal
filter to nonlinear
3071 rate
3073 state-space
models
3073 3076
state-space
models
3077
3077
3041
Ch. 50: State-Space Models
Abstract This chapter reviews the usefulness of the Kalman filter for parameter estimation and inference about unobserved variables in linear dynamic systems. Applications include exact maximum likelihood estimation of regressions with ARMA disturbances, time-varying parameters, missing observations, forming an inference about and specification of business cycle the public’s expectations about inflation, dynamics. The chapter also reviews models of changes in regime and develops the parallel between such models and linear state-space models. The chapter concludes with a brief discussion of alternative approaches to nonlinear filtering.
1.
The state-space
representation
of a linear dynamic system
Many dynamic models can usefully be written in what is known as a state-space form. The value of writing a model in this form can be appreciated by considering a first-order autoregression (1.1)
Y,+1 =$Yr+st+r,
with E, N i.i.d. N(0, a’). Future values of y for this process depend on (Y,, y,_ 1,. . . ) only through the current value y,. This makes it extremely simple to analyze the dynamics of the process, make forecasts or evaluate the likelihood function. For example, equation (1.1) is easy to solve by recursive substitution, Yf+m = 4”y, + 4m-1Ey+l + 4m-2Et+2 + ... +q5l~~+~-~ from which the optimal
+E~+~
for
m-period-ahead
E(Y,+,lY,,Y,-,,...)=~mY,.
(1.2)
m= 1,2,..., forecast
is seen to be (1.3)
The process is stable if 14 1< 1. The idea behind a state-space representation of a more complicated linear system is to capture the dynamics of an observed (n x 1) vector Y, in terms of a possibly unobserved (I x 1) vector 4, known as the state vector for the system. The dynamics of the state vector are taken to be a vector generalization of (1.1): 5,+r =F&+n,+,.
(1.4)
J.D. Hamilton
3042
Here F denotes an (r x I) matrix and the (r x 1) vector II, is taken to be i.i.d. N(0, Q). Result (1.2) generalizes to
5t+m
=
F”& + F”-~I,+~ + Fr~~+~-r
where F” denotes
+v,+,
the matrix
+ Fm-2~t+2 + ... for
m= 1,2,...,
F multiplied
(1.5)
by itself m times. Hence
Future values of the state vector depend on ({,, 4, _ 1,. . .) only through the current value 5,. The system is stable provided that the eigenvalues of F all lie inside the unit circle. The observed variables are presumed to be related to the state vector through the observation equation of the system, y, = A’.q + H’{, + w,.
(1.6)
Here yt is an (n x 1) vector of variables that are observed at date t, H’ is an (n x r) matrix of coefficients, and W, is an (n x 1) vector that could be described as measurement error; W, is assumed to be i.i.d. N(O,R) and independent of g1 and (1.6) also includes x,, a (k x 1) vector of observed v, for t= 1,2,... . Equation variables that are exogenous or predetermined and which enter (1.6) through the (n x k) matrix of coefficients A’. There is a choice as to whether a variable is defined to be in the state vector 5, or in the exogenous vector xt, and there are advantages if all dynamic variables are included in the state vector so that x, is deterministic. However, many of the results below are also valid for nondeterministic x,, as long as n, contains no information about &+, or w,+, for m = 0, 1,2,. . . beyond that yr. For example, X, could include lagged values of y or containediny,_,,y,_,,..., variables that are independent of 4, and W, for all T. The state equation (1.4) and observation equation (1.6) constitute a linear state-space repesentation for the dynamic behavior of y. The framework can be further generalized to allow for time-varying coefficient matrices, non-normal disturbances and nonlinear dynamics, as will be discussed later in this chapter. For now, however, we just focus on a system characterized by (1.4) and (1.6). Note that when x, is deterministic, the state vector 4, summarizes everything in the past that is relevant for determining future values of y,
E(Yt+ml51,5r-l,...,Yt,Y1-1,...) =EC(A’x,+,+H’5,+,+w,+,)l5,,5,-,,...,y,,y,-l,...I =A’xy+,+H’E(5,+,151,&-1,...,~t,~*-l,...) = A'x~+~ + HlF"'&.
(1.7)
3043
Ch. 50: State-Space Models
As a simple example of a system that can be written a pth-order autoregression
in state-space
- 11)= 4l(Y, - 4 + #dYte 1 - PL)+ ... + 4p(Yt-p+l
(Y,+1
- 11)+
form, consider
(1.8)
Et+12
i.i.d. N(0, a2).
E, -
Note that (1.8) can equivalently
as
41 42 ... 4p-1 4p
Yt+1 -P
1i ! Yz - P
J&p+2
be written
=
-P
1
0
Ol...OO .. .. .o 0 . .
...
0
0
Yt - P Yc-1 -P .
... ...
. 1 :
011
J&p+1
:
-P
Et+1 0 .
!I:1 +
.
(1.9)
0
The first row of (1.9) simply reproduces (1.8) and other rows assert the identity Y,_j-p=Yy,_j-p forj=O,l,..., p - 2. Equation (1.9) is of the form of (1.4) with r=p and &=(yt-PL,Yt-1
v 2+1-
(1.10)
-P~...#-p+l-P)I~
(1.11)
-(%+1,0,...,O)I,
F=
0 .. .
1 .. . 0
-0 The observation Yt = P +
(1.12)
... . .. .‘.
equation
is
H’t,,
where H’ is the first row of the (p x ,p) identity be shown to satisfy
(1.13) matrix.
The eigenvalues
of F can
(1.14) thus stability of a pth-order autoregression requires that any value 1 satisfying (1.14) lies inside the unit circle. Let us now ask what kind of dynamic system would be described if H’ in (1.13)
J.D. Hamilton
3044
is replaced
with a general
y,=/J+Cl
81 0,
(1 x p) vector,
“.
~,-IK
(1.15)
where the 8’s represent arbitrary coefficients. Suppose that 4, continues in the manner specified for the state vector of an AR(p) process. Letting the jth element of &, this would mean 41
51,t+1
42
I! 1 r 2,t+
1
r’ PJ+
1
=
...
4,-l
4%
1
0
...
0
0
0 .. .
1 .. .
... ...
0
0
0
0
...
1
0
+ 1
to evolve tjt denote
E t+1
0 .
(1.16)
.
0:I
The jth row of this system for j = 2,3,. . . , p states that <j,l+ 1 = (j- l,f, implying for
5jt=Lj51,t+l
j=1,2
(1.17)
,..., p,
for L the lag operator. The first row of (1.16) thus implies that the first element of 4, can be viewed as an AR(p) process driven by the innovations sequence {E,}: (1.18)
(1-~l~-~2~2-~~~-~p~P)S1.1+l=~,+l.
Equations y,=p+(l
(1.15) and (1.17) then imply (1.19)
+B,L’+e2L2+...+ep-1LP-1)51r.
If we subtract p from both sides of (1.19) and (1 - c$,L- q5,L2 - ... - 4,Lp), the result is (1 -&L-4,L2-
operate
on both
sides
with
... -~pLP)(yt-p)=(l+~1L1+~2L2+‘~~+ep-1LP-1) x (1 - $,L-
f#),L2 - ... - f#),LP)&,
=(1+8,L’+82L2+~~~+8p_,LP-1)E,
(1.20) by virtue of(1.18). Thusequations(l.15)and (1.16)constitute a state-space representation for an ARMA(p,p - 1) process. The state-space framework can also be used in its own right as a parsimonious time-series description of an observed vector of variables. The usefulness of forecasts emerging from this approach has been demonstrated by Harvey and Todd (1983), Aoki (1987), and Harvey (1989).
Ch. 50: State-Space Models
3045
The state-space form is particularly convenient for thinking about sums of stochastic processes or the consequences of measurement error. For example, suppose we postulate the existence of an underlying “true” variable, &, that follows an AR(l) process (1.21) with u, white noise. Suppose that 4, is not observed directly. Instead, the econometrician has available data y, that differ from 5, by measurement error w,: Y, =
5, + wt.
(1.22)
If the measurement error is white noise that is uncorrelated with t+, then (1.21) and (1.22) can immediately be viewed as the state equation and observation equation of a state-space system, with I = n = 1. Fama and Gibbons (1982) used just such a model to describe the ex ante real interest rate (the nominal interest rate i, minus the expected inflation rate 7~;). The ex ante real rate is presumed to follow an AR( 1) process, but is unobserved by the econometrician because people’s expectation 7~: is unobserved. The state vector for this application is then <, = i, - rcr - /J where p is the average ex ante real interest rate. The observed ex post real rate (y, = i, - n,) differs from the ex ante real rate by the error people make in forecasting inflation, i, - x, = p + (i, - 7rcp - p) + (7~;- 71J, which is an observation equation of the form of (1.6) with R = 1 and w, = (~1 - 7~~). If people do not make systematic errors in forecasting inflation, then w, might reasonably be assumed to be white noise. In many.economic models, the public’s expectations of the future have important consequences. These expectations are not observed directly, but if they are formed rationally there are certain implications for the time-series behavior of observed series. Thus the rational-expectations hypothesis lends itself quite naturally to a state-space representation; sample applications include Wall (1980), Burmeister and Wall (1982), Watson (1989), and Imrohoroglu (1993). In another interesting econometric application of a state-space representation, Stock and Watson (1991) postulated that the common dynamic behavior of an (n x 1) vector of macroeconomic variables yt could be explained in terms of an unobserved scalar ct, which is viewed as the state of the business cycle. In addition, each series y, is presumed to have an idiosyncratic component (denoted a,J that is unrelated to movements in yjt for i #j. If each of the component processes could be described by an AR(l) process, then the [(n + 1) x l] state vector would be
4=L a
113
Qtv . . . 2 a,,)
(1.23)
J.D. Hamilton
3046
with state equation C,
al,
= I 1: P2 .
c, al* +
:I
yz. . .
I
O. . .
l. . .
P”1 1% 00
Y “t
(1.24)
v2,t+l
equation
Yl, Y2t
1:
azr +
[$j=[j{ZGJ
and observation
UC,,+1
v1,t+ 1
... ...
0 .
... 1
a,,
1
(1.25)
a:I m
Thus yi is a parameter measuring the sensitivity of the ith series to the business cycle. To allow for @h-order dynamics, Stock and Watson replaced c, and ai, in (1.23) with the (1 xp) vectors (c,,c,_t ,..., c,-,,+r) and (~,,,a~,~_, ,..., ~_~+r) so that 4, is an [(n + 1)~ x l] vector. The scalars C#I~ in (1.24) are then replaced by (p x p) matrices Fi with the structure of (1.12), and blocks of zeros are added in between the columns of H’ in the observation equation (1.25). A related theoretical model was explored by Sargent (1989). State-space models have seen many other applications in economics. For partial surveys see Engle and Watson (1987), Harvey (1987), and Aoki (1987). 2.
The Kalman filter
For convenience, the general form of a constant-parameter is reproduced here as equations (2.1) and (2.2).
linear state-space
model
State equation 51+1 = (r x 1)
J-5,
Yt (n x
equation .4’x,
= 1)
E(w,w;)
(r x 1)
Q (r x 4
E(v,+~$+J=
Observation
(2.1)
+ v,+1
(r x r)(r x 1)
(n x k)(k x 1) =
R . (n x n)
+
WC& (n x r)(r x 1)
+
w, (n x 1)
(2.2)
Ch. 50:
State-Space
3047
Models
Writing a model in state-space form means imposing certain values (such as zero or one) on some of the elements of F, Q,A,H and R, and interpreting the other elements as particular parameters of interest. Typically we will not know the values of these other elements, but need to estimate them on the basis of observation of {y,, y,, . . . , yT} and {x1,x2,. . . ,x,}.
2.1.
Overview of the Kalman Jilter
Before discussing estimation of parameters, it will be helpful first to assume that the values of all of the elements of F, Q, A, H and R are known with certainty; the question of estimation is postponed until Section 3. The filter named for the contributions of Kalman (1960, 1963) can be described as an algorithm for calculating an optimal forecast of the value of 4, on the basis of information observed through date t - 1, assuming that the values of F, Q, A, H and R are all known. This optimal forecast is derived from a well-known result for normal variables; [see, for example, DeGroot (1970, p. 55)]. Let z1 and zZ denote (n, x 1) and (n2 x 1) vectors respectively that have a joint normal distribution:
Then the distribution of zZ conditional on z1 is N(m,Z) where m=k E=
+%~;,‘(z, -II,), .n,, -
(2.3)
f2&2;;.(2,*.
(2.4)
Thus the optimal forecast of z2 conditional on having observed z1 is given by J%,Iz,)=lr,
+&l.n;,‘(z,
with Z characterizing
-P,),
(2.5)
the mean squared error of this forecast:
EC& - Mz, - m)‘lz,l = f12, - f2,,R ;:f&.
(2.6)
To apply this result, suppose that the initial value of the state vector (et) of a state-space model is drawn from a normal distribution and that the disturbances a, and w, are normal. Let the observed data obtained through date t - 1 be summarized by the vector
J.D. Hamilton
3048
Then the distribution of 4, conditional on &_r turns out to be normal for t = 2,3,,. . , T. The mean of this conditional distribution is represented by the (r x 1) vector {,,,_ 1 and the variance of this conditional distribution is represented by the (I x r) matrix PrIt_ r. The Kalman filter is simply the result of applying (2.5) and (2.6) to each observation in the sample in succession. The input for step t of the iteration is the mean &-, and variance P,,t_I that characterize the distribution of 4, conditional on &- 1. The output for step t is the mean c$+,,, and variance P t+ I,f of I&+1 conditional on 6,. Thus the output for step t is used as the input for step t + 1.
2.2.
Derivation of the Kalman filter
The iteration is started by assuming that the initial value of the state vector gI is drawn from a normal distribution with mean denoted G$,,,and variance denoted PI,,. If the eigenvalues of F are all inside the unit circle, then the vector process defined by (2.1) is stationary, and ~,,, would be the unconditional mean of this process,
&o= 09
(2.7)
while P,,, would be the unconditional
variance
PII, = E(44:). This unconditional
variance can be calculated from’
vec(P,,,) = [I+ - (F@F)]-‘.vec(Q).
(2.f3)
Here Z,2 is the (r2 x r2) identity matrix, “0” denotes the Kronecker product and 1The unconditional expectations:
variance
of 4 can be found by postmultiplying
(2.1) by its transpose
and taking
E(5,+,r;+,)=E(~5,+ol+,)(5:F’+~:+,) = F.E(S,S;)~ If 4, is stationary, P,,, = FP,,,F’
+ w,+
then E({,+ ,S;+,)
,u;+ J.
= E(t,J;)
and the above equation
becomes
+ Q.
Applying the vet operator to this equation that vec(ABCJ = (C&I A).vec(B) produces WP,,,)
= P,,,,
= (F@WWP,I,)
+
ve4Q).
and recalling
[e.g. Magnus
and Neudecker
(1988, p. 30)]
3049
Ch. 50: State-Space Models
vec(P,,,) is the (r2 x 1) vector formed by stacking the columns of Pi,,, one on top of the other, ordered from left to right. For time-variant or nonstationary systems, s,,, could represent a guess as to the value of {i based on prior information, while Pi,, measures the uncertainty associated with this guess ~ the greater our prior uncertainty, the larger the diagonal elements of Pi,,. ’ This prior cannot be based on the data, since it is assumed in the derivations to follow that ut+i and W, are independent of
E(y,Ix,,r,-,)=A’x,+H’%,,-,.
(2.9)
From (2.2) and (2.9) the forecast error can be written
Yt -
&,I
x,9
r,- 1)= (A’& + H’5, + 4 = H’(4 -
- (A’nt + H’a,, - 1)
Et,,1)+ wt.
(2.10)
Since & _ 1 is a function of & _ 1, the term W, is independent Thus the conditional variance of (2.10) is
of both 4, and $,,_ r.
E(Cy,-E(y,lx,,r,-,)lCy,-E(y,Ix,,r,-l)l’lx,,r,-1} =H’.E{Cr,-~,,,-,1C51-~,1-ll’lrl-1}H+E(w,w:) = H’Pt,,_lH+ R. Similarly,
the conditional
*Meinhold perspective.
and Singpurwalla
covariance
between
(2.10) and the error in forecasting
(1983) gave a nice description
of the Kalman
filter from a Bayesian
J.D. Hamilton
3050
the state vector is
Thus the distribution
It then follows from (2.3) and (2.4) that &I & = ~,Ix,,Y,, where
&t = B,,-
1 +
p,,,-
on X, and C,_ 1 is
of the vector (y;, 4:)’ conditional
l~(H’P,,,- ,H+ WYY,
<,- 1 is distributed
- A’? -
et,,-l),
The final step is to calculate a forecast of &+ 1 conditional to see from (2.1) that 5, + 1I G - N& + l,f, P,+ I,t) where
(2.12)
on 5,. It is not hard
(2.14)
E+ 111 = %>
Substituting
Ptit)
(2.13)
P,(,=P,,,_l-P,(,_lH(H’P,,,_lH+R)-’H’P,,,_l.
P t+ l,t = FP,,,F’ +
W&,
Q.
(2.15)
(2.12) into (2.14) and (2.13) into (2.15), we have
(2.16) %+1/t = F~,,,-l+FP,,,-,H(H’P,,,-,~+R)-‘(y,-~’x,-H’~~,,-,), P t+l,,= FPt,,_lF’To summarize,
FPt,,_lH(H’Pt,,_lH+
R)-‘H’Pt,,_lF’+Q.
(2.17)
the Kalman filter is an algorithm for calculating the sequence where & + 1 ,f denotes the optimal forecast of 4, + 1 based {&+ &= 1 and P-T+ &= 1y on observation of (yt,yt_ i,.. .,yl,n,,x,_, ,..., x1) and Pt+l,t denotes the mean squared error of this forecast. The filter is implemented by iterating on (2.16) and (2.17) for t = 1,2,. . . ,T. If the eigenvalues of F are all inside the unit circle and there is no prior information about the initial value of the state vector, this iteration is started using equations (2.7) and (2.8). Note that the sequence {Pt+I,t}:=l is not a function of the data and can be evaluated without calculating the forecasts {tt+ llt}T= 1. Because P,+ l,t is not a function of the data, the conditional expectation of the squared forecast error is
3051
Ch. 50: State-Space Models
the same as its unconditional
expectation,
This equivalence is a consequence constant variances for II, and w,. 2.3.
An m-period-ahead
forecast
normal
distributions
with
of the state vector can be calculated
from (1.5): (2.18)
l,...,Y1,Xt,X,-l,...,X1)=FrnSt,r.
$+,,t = E(5,+,lu,,u,The error of.this forecast -
assumed
with the Kalman filter
Forecasting
4t+m
of having
can be found
by subtracting
(2.18) from (1.5),
+Fm-2vt+2+... +F’u,+,_l $+l4,= Fm(5A,,,,+Fm-1”r+l
from which it follows that the mean squared
Pt+m,t = EC(&+,-
error of the forecast
+Ut+m,
(2.18) is
%+,,,,G+, - tt+d'l
=FmPC,,(Fm)‘+Fm-‘Q(Fm-1)‘+Fm-2Q(Fm-2)’+~~~+FQF’+Q.
(2.19)
These results can also be used to describe m-period-ahead forecasts of the Applying the law of observed vector y! + ,,,, provided that {x,} is deterministic. iterated expectations to (1.7) results in (2.20)
9 t+mlr=E(yr+mIYt,Yt-1,...,Y1)=A’~,+,+H’F”$I,. The error of this forecast is Y t+m
-A+,I, = V’xt+m+ H’b+, + w,+,) - V’xt+m + H’F”&t)
= H’(5,+*- E+m,,) + Wt+m with mean squared
error
EC(y,+m -9t+mlr)(Yt+m -9,+,1,)'1= 2.4.
Smoothed
H’Pt.,,,H+ R.
(2.21)
inference
Up to this point we have been concerned with a forecast of the value of the state vector at date t based on information available at date t - 1, denoted &,- 1, or
J.D. Hamilton
3052
with an inference about the value of the state vector at date t based on currently available information, denoted &. In some applications the value of the state vector is of interest in its own right. In the example of Fama and Gibbons, the state vector tells us about the public’s expectations of inflation, while in the example of Stock and Watson, it tells us about the overall condition of the economy. In such cases it is desirable to use information through the end of the sample (date T) to help improve the inference about the historical value that the state vector took on at any particular date t in the middle of the sample. Such an inference is known as a smoothed estimate, denoted = E({,j c&). The mean squared error of this estimate is denoted PtJT = E(g, - &T)(g, - &-)I. The smoothed estimates can be calculated as follows. First we run the data through the Kalman filter, storing the sequences {Pt,,}T=i and {P+ ,}T=, as calculated from (2.13) and (2.15) and storing the sequences ($,,}T= 1 and {$t,l_ ,>,‘=1
e,,
as calculated from (2.12) and (2.14). The terminal value for {&t}Z” i then gives the smoothed estimate for the last date in the sample, I$=,~,and P,,, is its mean squared error. The sequence of smoothed estimates {&T)TE1 is then calculated in reverse order by iterating on
for t = T- 1, T- 2,. . . , 1, where J, = P,,,F’P;,‘,,,. The corresponding mean squared errors are similarly found by iterating on (2.23) inreverseorderfort=T-l,T-2,..., 13.6).
2.5.
1; see for example Hamilton (1994, Section
Interpretation of the Kalman jilter with non-normal disturbances
In motivating the Kalman filter, the assumption was made that u, and w, were normal. Under this assumption, &,_ 1 is the function of <,- 1 that minimizes
a-(4, -
Et,,1NC- %,,-l)‘l>
(2.24)
in the sense that any other forecast has a mean squared error matrix that differs from that of &,_ 1 by a positive semidefinite matrix. This optimal forecast turned out to be a constant plus a linear function of L-;. The minimum value achieved for (2.24) was denoted PIi,_ 1. If D, and w, are not normal, one can pose a related problem of choosing &, _ 1 to be a constant plus a linear function of &- i that minimizes (2.24). The solution
3053
Ch. 50: State-Space Models
to this problem turns out to be given by the Kalman filter iteration (2.16) and its unconditional mean squared error is still given by (2.17). Similarly, when the disturbances are not normal, expression (2.20) can be interpreted as the linear mean squared projection of yt +m on 5, and a constant, with (2.21) its unconditional error. Thus, while the Kalman filter forecasts need no longer be optimal for systems that are not normal, no other forecast based on a linear function of & will have a smaller mean squared error [see Anderson and Moore (1979, pp. 92298) or Hamilton (1994, Section 13.2)]. These results parallel the Gauss-Markov theorem for ordinary least squares regression.
2.6.
Time-varying coefficient models
The analysis above treated the coefficients of the matrices F, Q, A, H and R as known constants. An interesting generalization obtains if these are known functions of n,:
yt =
a(~,)+ CHbJ1’5,+ w,,
(2.26)
E(w,w:ln,, r,- 1) = W,). Here F(.), Q(.), H(.) and R( .) denote matrix-valued functions of x, and a(.) is an (n x 1) vector-valued function of x,. As before, we assume that, apart from the possible conditional heteroskedasticity allowed in (2.26), x, provides no information about 4, or w, for any t beyond that contained in c,_ r. Even if u, and w, are normal, with x, stochastic the unconditional distributions of 4, and yt are no longer normal. However, the system is conditionally normal in the following sense.3 Suppose that the distribution of 4, conditional on &_ 1 is taken to be N(&I,P,,t_,). Then 4, conditional on x, and &t has the same distribution. Moreover, conditional on x,, all of the matrices can be treated as deterministic. Hence the derivation of the Kalman filter goes through essentially as before, with the recursions (2.16) and (2.17) replaced with s t+Iit= W&
- I + FW’,I, - 1H(xt)1CHWI’~,,,- ,H(x,)+ R(4) - ’
x {u, - 44 - CWx,)l’tt,,-J, P t+ I,( = F(x,)Pt,,- 1F(x,)l-
(F(-q)Pt,,-
~H(n,)CCWxt)l’f’t~t,4x,) + R(xt)l-i
x CfWJl’f’+ - 1CF(41’) + QW 3See Theorem 6.1 in Tjestheim
(2.27)
(1986) for further discussion.
(2.28)
J.D.Hamilton
3054
It is worth noting three elements of the earlier discussion that change with time-varying parameter matrices. First, the distribution calculated for the initial state in (2.7) and (2.8) is only valid if F and Q are fixed matrices. Second, m-period-ahead forecasts of y,,, or &,., for m > 1 are no longer simple to calculate when F, H or A vary stochastically; Doan et al. (1984) suggested approxievaluated al yrir = - l,...,~l)with E(y,,21~,+t,~,,...,~l) mating W, + 2 Iyl,yt E(Y,+,~Y,,Y,-I,...,yl).Finally, if u, and W, are not normal, then the one-periodahead forecasts Et+Ilf and 9,+ Ilt no longer have the interpretation as linear projections, since (2.27) is nonlinear in x,. An important application of a state-space representation with data-dependent parameter matrices is the time-varying coefficient regression model Y, =
xi@, +w
f’
(2.29)
Here & is a vector of regression coefficients that is assumed to evolve over time according to
&+I-@=I;tSI-h+vt+v
(2.30)
Assuming the eigenvalues of F are all inside the unit circle, fi has the interpretation as the average or steady-state coefficient vector. Equation (2.30) will be recognized as a state equation of the form of (2.1) with 4, = Vpt- $). Equation (2.29) can then be written as Yt =
4
-I- x:5*+ w,,
(2.31)
which is in the form of the observation equation (2.26) with a(%,)= X$ and [L&J] = xi. Higher-order dynamics for /It are easily incorporated by, instead, defining 4: = [(B - @,‘, (B,_ 1 - @)‘,. . . , c/pt_ p+ 1 - j?,‘] as in Nicholls and Pagan (1985, p. 437). Excellent surveys of time-varying parameter regressions include Raj and Ullah (1981), Chow (1984) and Nicholls and Pagan (1985). Applications to vector autoregressions have been explored by Sims (1982) and Doan et al. (1984).
2.7.
Other extensions
The derivations above assumed no correlation between II, and VU,,though this is straightforward to generalize; see, for example, Anderson and Moore (1979, p. 108). Predetermined or exogenous variables can also be added to the state equation with few adjustments. The Kalman filter is a very convenient algorithm for handling missing observations. If y, is unobserved for some date t, one can simply skip the updating
Ch. 50: State-Space
3055
Models
equations (2.12) and (2.13) for that date and replace them with Et,*= &,_ 1 and P,,, = P,,r_ r; see Jones (1980), Harvey and Pierse (1984) and Kohn and Ansley (1986) for further discussion. Modifications of the Kalman filtering and smoothing algorithms to allow for singular or infinite P,,, are described in De Jong (1989, 1991). 3.
Statistical Maximum
3.1.
inference about unknown parameters using the Kalman filter likelihood
estimation
The calculations described in Section 2 are implemented by computer, using the known numerical values for the coefficients in the matrices F, Q, A, H and R. When the values of the matrices are unknown we can proceed as follows. Collect the unknown elements of these matrices in a vector 8. For example, to estimate theARMA(p,p-l)process(1.15)-(1.16),8=($,,4, ,..., 4p,01,02 ,..., Bp_rr~,~)‘. Make an arbitrary initial guess as to the value of t9, denoted 0(O), and calculate the sequences {&- r(@(‘))}T=1 and {Pt,t_,(B(o))}t’E1 that result from this value in (2.16) and (2.17). Recall from (2.11) that if the data were really generated from the model (2.1)-(2.2) with this value of 0, then
_hbt,r14e(0) - w4oW, 409(o))),
(3.1)
where
p,(e(O))= p(e(o))]k, + [H(e(O))]&_ 1(8(O)),
(3.2)
qe(O)) = pz(e(o))]~p,,,_ ,(e(O))] [qe(O))] + R(e(O)).
(3.3)
The value of the log likelihood is then f
logf(y,lx,,r,_,;e(O))=
-$i09(27+:$
t=1
-
f,flh
- ~vWi~~w(o9i - lb,
iOglzt(e(0))l
-)u,iw I.
(3.4)
which reflects how likely it would have been to have observed the data if 0(O)were the true value for 8. We then make an alternative guess 0(l) so as to try to achieve a bigger value of (3.4), and proceed to maximize (3.4) with respect to 8 by numerical methods such as those described in Quandt (1983), Nash and Walker-Smith (1987) or Hamilton (1994, Section 5.7). Many numerical optimization techniques require the gradient vector, or the derivative of (3.4) with respect to 0. The derivative with respect to the ith element of 8 could be calculated numerically by making a small change in the ith element
Ch. 50: State-Space
3057
Models
The vector of parameters to be estimated is 8 = (B’, 19r,8,, a)‘. By making an arbitrary guess4 at the value of 8, we can calculate the sequences {&_,(B)}T=, value for (2.16) is the and (ptlt- ,(W>T= r in (2.16) and (2.17). The starting unconditional mean of er, 0
II-1
EtY1 =
s^,,, =E
0
; Et-2
while (2.17) is started Et
P,,,=E
;
&,-I
)
0
with the unconditional
1 [E,
a* Et_1
Et_21 = : 0 0
5 - 2
From these sequences, (3.4) then provides
u(e)
and E,(0)
variance, 0
0
a2 0 0
1
i72
can be calculated
log f(YT,YT-l,...,YlIXT,XT-1,...,Xl;~).
in (3.2) and (3.3), and
(3.7)
Note that this calculation gives the exact log likelihood, not an approximation, and is valid regardless of whether 8, and e2 are associated with an invertible MA(2) representation. The parameter estimates j, 8r, 8, and b are the values that make (3.7) as large as possible.
3.2.
Identijication
The maximum likelihood estimation procedure, just described, presupposes that the model is identified, that is, it assumes that a change in any of the parameters would imply a different probability distribution for {y,},“, 1. One approach to checking for identification is to rewrite the state-space model in an alternative form that is better known to econometricians. For example, since the state-spacemode1(1.15)-(1.16)isjust another way ofwritingan ARMA(p,p - 1) process, the unknown parameters (4r,. . . ,4p, O1,. . . ,8,_ 1, p, a) can be consistently estimated provided that the roots of (1 + t9,z + 0,z2 + ... + 8,_ r.zp-r) = 0 are normalized to lie on or outside the unit circle, and are distinct from the roots of (1 -C#J1z-CJS2z2- ... - 4,~“) = 0 (assuming these to lie outside the unit circle as well). An illustration of this general idea is provided in Hamilton (1985). As another 4Numerical algorithms are usually much better behaved if an intelligent initial guess for 0”’ is used. A good way to proceed in this instance is to use OLS estimates of (3.5) to calculate an initial guess for /?, and use the estimated variance sz and autocorrelations PI and p2 of the OLS residuals to construct initial guesses for O,, O2 and c using the results in Box and Jenkins (1976, pp. 187 and 519).
J.D. Hamilton
3058
example, Yt =
the time-varying
coefficient
regression
model (2.31) can be written
x:s+4,
(3.8)
where
If X, is deterministic, equation (3.8) describes a generalized least squares regression model in which the varianceecovariance matrix of the residuals can be inferred from the state equation describing 4,. Thus, assuming that eigenvalues of F are all inside the unit circle, p can be estimated consistently as long as (l/T)CT= ix& converges to a nonsingular matrix; other parameters can be consistently estimated if higher moments of x, satisfy certain conditions [see Nicholls and Pagan (1985, p. 431)]. The question of identification has also been extensively investigated in the literature on linear systems; see Gevers and Wertz (1984) and Wall (1987) for a survey of some of the approaches, and Burmeister et al. (1986) for an illustration of how these results can be applied.
3.3.
Asymptotic
properties
of maximum likelihood
estimates
Under suitable conditions, the estimate 8 that maximizes (3.4) is consistent and asymptotically normal. Typical conditions require 8 to be identified, eigenvalues of F to be inside the unit circle, the exogenous variable x, to behave asymptotically like a full rank linearly nondeterministic covariance-stationary process, and the true value of 8 to not fall on the boudary of the allowable parameter space; see Caines (1988, Chapter 7) for a thorough discussion. Pagan (1980, Theorem 4) and Ghosh (1989) demonstrated that for particular examples of state-space models
(3.9) where $ZDST is the information matrix second derivatives of the log likelihood
x 2D.T
for a sample function:
of size T as calculated
(3.10)
=
Engle and Watson
from
(1981) showed
that the row i, column
j element
of $2D.T is
3059
Models
Ch. 50: State-Space
given by
(3.11) One option is to estimate (3.10) by (3.11) with the expectation operator dropped from (3.11). Another common practice is to assume that the limit of $ZD,T as T+ co is the same as the plim of
3-g
1 T a2w-(~,Ir,-,~-0) f
1
aeaef
3
(3.12)
e=ti
which can be calculated analytically or numerically by differentiating (3.4). Reported standard errors for 8 are then square roots of diagonal elements of (l/T)(3)-‘. It was noted above that the Kalman filter can be motivated by linear projection arguments even without normal distributions. It is thus of interest to consider as in White (1982) what happens if we use as an estimate of 8 the value that maximizes (3.4), even though the true distribution is not normal. Under certain conditions such quasi-maximum likelihood estimates give consistent and asymptotically normal estimates of the true value of 0, with
Jm-
4)
L
NO, C92d (p2,]
- ‘),
(3.13)
where $2D is the plim of (3.12) when evaluated at the true value 0, and .YoP is the limit of (l/T)CT, 1[s,(&,)] [s,(B,)]’ where
mu =
ai0gf(.hIr,-,~-0) [
ae
I 1. e=eo
An important hypothesis test for which (3.9) clearly is not valid is testing the constancy of regression coefficients [see Tanaka (1983) and Watson and Engle (1985)]. One can think of the constant-coefficient model as being embedded as a special case of (2.30) and (2.31) in which E(u,+ lu:+ J = 0 and /I1 = $. However, such a specification violates two of the conditions for asymptotic normality mentioned above. First, under the null hypothesis Q falls on the boundary of the allowable parameter space. Second, the parameters of Fare unidentified under the null. Watson and Engle (1985) proposed an appropriate test based on the general procedure of Davies (1977). The results in Davies have recently been extended by Hansen (1993). Given the computational demands of these tests, Nicholls and
J.D. Hamilton
3060
Pagan (1985, p. 429) recommended Lagrange multiplier tests for heteroskedasticity based on OLS estimation of the constant-parameter model as a useful practical approach. Other approaches are described in Nabeya and Tanaka (1988) and Leybourne and McCabe (1989). 3.4.
Confidence
intervals for smoothed estimates and forecasts
Let &,(0,) denote the optimal inference about 4, conditional on obervation of all data through date T assuming that 0, is known. Thus, for t d T, {,IT(OO) is the smoothed inference (2.22) while for r > T, &,(O,,) is the forecast (2.18). If 0, were known with certainty, the mean squared error of this inference, denoted PZIT(&,), would be given by (2.23) for r d T and (2.19) for t > T. In the case where the true value of 0 is unknown, this optimal inference is approximated by &,(o) for e^ the maximum likelihood estimate. To describe the consequences of this, it is convenient to adopt the Bayesian perspective that 8 is a random variable. Conditional on having observed all the data &-, the posterior distribution might be approximated by elc,-
N(@(lIT)@-‘).
(3.14)
where Etilr,(‘) denotes the expectation of (.) with respect to the distribution in (3.14). Thus the mean squared error of an inference based on estimated parameters is the sum of two terms. The first term can be written as E,,c,{P,,T(0)}, and might A convenient way to calculate this would be described as “filter uncertainty”. be to generate, say, 10,000 Monte Carlo draws of 8 from the distribution (3.14), run through the Kalman filter iterations implied by each draw, and calculate the average value of PrIT(0) across draws. The second term, which might be described as “parameter uncertainty”, could be estimated from the outer product of [&.(ei) - &,(i!j)] with itself for the ith Monte Carlo draw, and again averaging across Monte Carlo realizations. Similar corrections to (2.21) can be used to generate a mean squared error for the forecast of y,,, in (2.20). 3.5.
Empirical
application - an analysis of the real interest rate
As an illustration of these methods, consider Fama and Gibbons’s (1982) real interest rate example discussed in equations (1.21) and (1.22). Let y, = i, - 7c, denote
Ch. 50: State-Space Models
3061
IO.0 75 50 w
2.5 0.0 -2.5 I 60
63
66
69
72
75
78
81
84
87
90
2.00 1.75 I.50 1.25 a
1.00 075
'!____________________------__-_______________________~'
050
-
025
-
0.00
9
I 60
63
66
69
72
75
78
Eli
84
87
90
69
72
75
78
81
84
87
90
5.0
z 8 is
-2.5 -5.0
60
63
66
Figure 1. Top panel. Ex post real interest rate for the United States, qua;terly from 1960:1 to_ 1992:III, quoted at an annual rate. Middle panel. F$er_uncertainty. Solid line: P,,,(e). Dashed line: PflT(B). Bottom panel. Smoothed inferences t,,(0) along with 95 percent confidence intervals.
the observed ex post real interest rate, where i, is the nominal interest rate on 3-month U.S. Treasury bills for the third month of quarter t (expressed at an annual rate) and rr, is the inflation rate between the third month of quarter t and the third month oft + 1, measured as 400 times the change in the natural logarithm of the consumer price index. Quarterly data for y, are plotted in the top panel of Figure 1 for t= 1960:1 to 1992:III. The maximum likelihood estimates for the parameters of this model are as follows, with standard errors in parentheses, & = 0.9145,_ 1 + u, (0.041)
8” = 0.977 (0.177)
y,=1.43+r,+w, (0.93)
6, = 1.34 . (0.14)
J.D. Hamilton
3062
Here the state variable 5, = i, - 71:-p has the interpretation as the deviation of the unobserved ex ante real interest rate from its population mean p. Even if the population parameter vector 8= (4, o,,~, g,J’ were known with certainty, the econometrician still would not know the value of the ex ante real interest rate, since the market’s expected inflation 7~: is unobserved. However, the econometrician can make an educated guess as to the value of 5, based on observations of the ex post real rate through date t, treating the maximum likelihood estimate aas if known with certainty. This guess is the magnitude &,(a), and its mean squared error P,,,(a) is plotted as the solid line in the middle panel of Figure 1. The mean squared error quickly asymptotes to
which is a fixed constant owing to the stationarity of the process. The middle panel of Figure. 1 also plots the mean squared error for the smoothed inference, PrIT(a). For observations in the middle of the sample this is essentially the mean squared error (MSE) of the doubly-infinite projection
The mean squared error for the smoothed inference is slightly higher for observations near the beginning of the sample (for which the smoothed inference is unable to exploit relevant data on y,,y_ I,. . .) and near the end of the sample (for which knowledge of YT+ r, Y~+~, . . . would be useful). The bottom panel of Figure 1 plots the econometrician’s best guess as to the value of the ex ante real interest rate based on all of the data observed:
Ninety-five percent confidence intervals for this inference that take account of both the filter uncertainty P1,r(a) and parameter uncertainty due to the random nature of 8 are also plotted. Negative ex ante real interest rates during the 1970’s and very high ex ante real interest rates during the early 1980’s both appear to be statistically significant. Hamilton (1985) obtained similar results from a more complicated representation for the ex ante real interest rate.
4.
Discrete-valued
state variables
The time-varying coefficients model was advocated by Sims (1982) as a useful way of dealing with changes occurring all the time in government policy and economic institutions. Often, however, these changes take the form of dramatic, discrete events, such as major wars, financial panics or significant changes in the policy
3063
Ch. 50: State-Space Models
objectives of the central bank or taxing authority. It is thus of interest to consider time-series models in which the coefficients change only occasionally as a result of such changes in regime. Consider an unobserved scalar s, that can take on integer values 1,2,. . . , N corresponding to N different possible regimes. We can then think of a time-varying coefficient regression model of the form of (2.29), Yt =
x:Bs,+ wt
(4.1)
for x, a (k x 1) vector of predetermined or exogenous variables and w, - i.i.d. N(0, a’). Thus in the regime represented by s, = 1, the regression coefficients are given by /?r, when s, = 2, the coefficients are f&, and so on. The variable s, summarizes the “state” of the system. The discrete analog to (2.1), the state transition equation for a continuous-valued state variable, is a Markov chain in which the probability distribution of s, + 1 depends on past events only through the value of s,. If, as before, observations through date t are summarized by the vector
the assumption Prob(s,+,=
is that jlst=i,st_l=il,st_Z=i2,...,&)=Prob(st+i=jls,=i) - pij.
(4.2)
When this probability does not depend on the previous state (pij = pu for all i, j, and I), the system (4.1)-(4.2) is the switching regression model of Quandt (1958); with general transition probabilities it is the Markov-switching regression model developed by Goldfeld and Quandt (1973) and Cosslett and Lee (1985). When x, includes lagged values of y, (4.1)-(4.2) describes the Markov-switching time-series model of Hamilton (1989).
4.1.
Linear state-space
representation
of the Markov-switching
The parallel between (4.2))(4.1) and (2.1))(2.2) is instructive. (N x N) matrix whose row i, column j element is given by pji: r hl F=
P12
P21 P22 .
1
PIN
P2N
...
PNll
“’ ...
PN2 . .
“.
PNN]
.
model Let F denote
an
J.D. Hamilton
3064
Let e, denote the ith column of the (N x N) identity matrix and construct an (N x 1) vector 4, that is equal to e, when s, = i. Then the expectation of &+ 1 is an (N x 1) vector whose ith element is the probability that s,, 1 = i. In particular, the expectation of &+r conditional on knowing that s, = 1 is the first column of F. More generally,
E(5*+,l51,5r-l,...,51~a)=F5~.
(4.4)
The Markov chain (4.2) thus implies the linear state equation (4.5) where u,, 1 is uncorrelated with past values of 4,y or x. The probability that s,, 2 = j given s, = i can be calculated from Probh.2
=jlSt
=
i) =
PilPlj
+
Pi2P2j
+
“’
+
PiNPNj
=
Pljpil
+
PZjPi2
+
“’
+
PNjPiNr
which will be recognized as the row j, column i element of F2. In general, the probability that st+,,, = j given s, = i is given by the row j, column i element of F”, and
Moreover, the regression equation (4.1) can be written y, =
x:q + w,,
(4.7)
where B is a (k x N) matrix whose ith column is given by pi. Equation (4.7) will be recognized as an observation equation of the form of (2.26) with [H(x,)]’ = x:B. Thus the model (4.1)-(4.2) can be represented by the linear state-space model (2.1) and (2.26). However, the disturbance in the state equation u,, 1 can only take on a set of N2 possible discrete values, and is thus no longer normal, so that the Kalman filter applied to this system does not generate optimal forecasts or evaluation of the likelihood function.
4.2.
OptimalJilter when the state variable follows a Markov chain
The Kalman filter was described above as an iterative algorithm for calculating the distribution of the state vector 4, conditional on 5,-i. When 4, is a continuous normal variable, this distribution is summarized by its mean and variance. When the state variable is the discrete scalar s,, its conditional distribution is, instead,
3065
Ch. 50: State-Space Models
summarized
by
Prob(s,=ilG_,)
for i=1,2
,...,
N.
(4.8)
Expression (4.8) describes a set of N numbers which sum to unity. Hamilton (1989) presented an algorithm for calculating these numbers, which might be viewed as a discrete version of the Kalman filter. This is an iterative algorithm whose input at step t is the set of N numbers {Prob(s, = iIT,_ ,)}y=, and whose output is the Kalman filter, we initially assumed that {Prob(s,+, = i 1&)}r= 1. In motivating the values of F, Q, A, Hand R were known with certainty, but then showed how the filter could be used to evaluate the likelihood function and estimate these parameters. Similarly, in describing the discrete analog, we will initially assume that the values of j?l,j?Z,. . . ,&, CJ,and (pij}Tj= 1 are known with certainty, but will then see how the filter facilitates maximum likelihood estimation of these parameters. A key difference is that, whereas the Kalman filter produces forecasts that are linear in the data, the discrete-state algorithm, described below, is nonlinear. If the Markov chain is stationary and ergodic, the iteration to evaluate (4.8) can be started at date t = 1 with the unconditional probabilities. Let ni denote the unconditional probability that s, = i and collect these in an (N x 1) vector a7~~)).Noticing that scan be interpreted as E(&), this vector can be found (XI,?,..., by taking expectations of (4.5): n= Fx.
(4.9)
Although this represents a system of N equations in N unknowns, it cannot be solved for n; the matrix (IN - F) is singular, since each of its columns sums to zero. However, if the chain is stationary and ergodic, the system of (N + 1) equations represented by (4.9) along with the equation l’a=
1
(4.10)
can be solved uniquely for the ergodic probabilities (here “1” denotes vector, all of whose elements are unity). For N = 2, the solution is 711 =
(1- P22Ml-
Pll +
x2 =
(1-
Pll
Pl
A general solution (A’A)- ‘A’ where
A = ,N+*,xN [
Ml -
for scan
Z, - F 1’
1 .
an (N x 1)
1 - PA
(4.11)
+ 1- P22).
(4.12)
be calculated
from the (N + 1)th column
of the matrix
J.D. Hamilton
3066
The input for step t of the algorithm under the assumption of predetermined
is (Prob(s, = i/T,_ r)}y=,, whose ith entry or exogenous x, is the same as (4.13)
Prob(s, = i/x,, G-r). The assumption
f(~,ls,
in (4.1) was that
1 = Otr G-i) = (2Zca2)1,2 exp
- (Y, - $Bi)’ 2oZ [
For given i, xf,yt,/(, and C, the right-hand calculated. This number can be multiplied jointly observing s, = i and y,:
Expression the density
(4.15) describes of y, conditional
f(~,lx,~LJ=
1
(4.14)
side of (4.14) is a number that can be by (4.13) to produce the likelihood of
a set of N numbers on x, and &_ 1:
(for i = 1,2,. . . , N) whose sum is
(4.16)
f(Yt~st=ilxt~Ld
f i=l
If each of the N numbers in (4.15) is divided by the magnitude in (4.16), the result is the optimal inference about s, based on observation of IJ,= {yt, xr, J, _ i}: f(~,,s,= il~t~~-l) Prob(s, = i) &) = f(YtIx,~L,) . The output
Prob(s,+,
for thejth
= jl&)=
iteration
can then be calculated
2 Prob(s,+,
(4.17)
from
=j,s,=il&)
i=l
= i$l Prob(s,+,
= itI Pij.
= j/s, = i, &),Prob(s,
Prob(s,= iI&).
= iI&)
(4.18)
To summarize, let &- 1 denote an (N x 1) vector whose ith element represents Prob(s, = iIT,- J and let fit denote an (N x 1) vector whose ith element is given by (4.14). Then the sequence {&,_ , },‘=1 can be found by iterating on
3067
Ch. 50: State-Space Models
(4.19)
where “0” denotes element-by-element mu_ltiplication and 1 represents an (N x 1) vector of ones. The iteration is started with Ljl,O= ICwhere 1cis given by the solution to (4.9) and (4.10). The contemporaneous inference G$ is given by (&, 0 ?,I*)/ V’(&l,-
4.3.
10
%)I.
Extensions
The assumption that y, depends only on the current value s, of a first-order Markov chain is not really restrictive. For example, the model estimated in Hamilton (1989) was Y,-Ps;=4(Yr-l-P*
St-
)++
)+~*(Y,-z-~,:_l)+...+~p(Yr_p-~,* 1
f-P
(4.20) where SF can take on the values 1 or 0, and follows a Markov chain with Prob(s:+ 1 = jlsf = i) = p$. This can be written in the form of (4.1)-(4.2) by letting N = 2p+ ’ and defining s, = 1
if
(ST = l,s:_,
= 1,. . . ,
and
ST_, = l),
s, = 2
if
(ST = O,s:_‘, = 1,. . .,
and
s:_~ = l), (4.21)
s,=N-
1
s, = N For illustration,
F= (8 x 8)
if
(SF= l,s:_,
=O,...,
and
s*t_-p=O)’
if
(s:=O,s,*_,
=0 ,...,
and
s:_~=O).
the matrix
of transition
probabilities
when p = 2 is
-p:1
0
0
0
pT1
0
0
0
P:O
0
0
0
p&
0
0
0
0
p&
0
0
0
p&
0
0
0
PO*0
0
0
0
p&
0
0
0
P:l
0
0
0
P;l
0
0
0
0
p:o
0
0
0
Pro
0
0
0
0
p&
0
0
0
p&
0
0
0
p&
0
0
0
p&
(4.22)
J.D. Hamilton
3068
There is also no difficulty in generalizing the above method to (n x 1) vector processesyt with changing coefficients or variances. Suppose that when the process is in state s,, Yt ISt,1, - wQf
(4.23)
52,)
where n;, for example, is an (n x k) matrix of regression when s, = 1. Then we simply replace (4.14) with 1 .f‘(~,ls,=i,x~,i-~)=(~~)“,~,~~,~,~exp
- $,
- qx,),n
coefficients
appropriate
,: I(& - n;x,,
[
1) (4.24)
with other details of the recursion identical. It is more difficult to incorporate changes in regime in a moving average process such as y, = E, + Os.s,_ i. For such a process the distribution of y, depends on the such completehistory(i,_,,y,_, ,..., y,,s:,s:_, ,... , ST),and N, in a representation as (4.21), grows with the sample size T. Lam (1990) successfully estimated a related model by truncating the calculations for negligible probabilities. Approximations to the optimal filter for a linear state-space model with changing coefficient matrices have been proposed by Gordon and Smith (1990), Shumway and Stoffer (1991) and Kim (1994).
4.4.
Forecasting
Applying the law of iterated expectations based on data observed through date t is
~(5t+,lr,) =
to (4.6), the optimal
forecast
FrnE,r,
(4.25)
where I$, is the optimal inference calculated by the filter. As an example of using (4.25) to forecast yt, consider again the example This can be written as Y, = Ps; +
z,,
where z, = 4iz,-i + &z,_~ + ... + 4pzt-p + E,. If {SF} were observed, ahead .forecast of the first term in (4.26) turns out to be
E(~~~+_ls:)=~,+{~l+~“(s:-n,)}(~,-~,),
of &+,
in (4.20).
(4.26) an m-period-
(4.27)
Ch. 50: State-Space
3069
Models
where ;1= (- 1 + PT1 + p&J and rrI = (1 - p&,)/(1 - ~7~ + 1 - P&). If ~7~ and P& are both greater than i, then 0 < ,I < 1 and there is a smooth decay toward the steady-state probabilities. Similarly, the optimal forecast of z,+, based on its own lagged values can be deduced from (1.9): (4.28) where e; denotes the first row of the (p x p) identity matrix and @ denotes the (p x p) matrix on the right-hand side of (1.12). Recalling that z, = y, - psz is known if y, and ST are known, E(Yt+,l%
we can substitute
(4.27) and (4.28) into (4.26) to conclude
r,) = PO + 1%+ JrnK- “1Wl -PO) + e;@Y(Y,
-P,:)
b-1
-P,:_l) ...
- Ps;_p+lu.
&J+1
(4.29) Since (4.29) is linear in (ST}, the forecast based solely on the observed & can be found by applying the law of iterated expectations to (4.29): E(y,+,IG)
= cl0 + {x1 + AmCProb(s: = 1 I&) - rrJ}(~,
where the ith element .Fit=Yt-i+l
variables
- pO) + e;@“_F(,
(4.30)
of the (p x 1) vector j, is given by
- p. Prob(s,*_i+ r =Ol&)-pr
Prob($-i+r= 114).
The ease of forecasting makes this class of models very convenient for rationalexpectations analysis; for applications see Hamilton (1988), Cecchetti et al. (1990) and Engel and Hamilton (1990).
Smoothed
4.5.
probabilities
We have assumed that the current value of s, contains all the information in the history of states through date t that is needed to describe the probability laws for y and s:
Prob(s,+, Under
=jls,=i)=Prob(s,+,
these assumptions
=j(sf=i,sf_-l
we have, as in Kitagawa
=it_l,...,sl
=il).
(1987, p, 1033) and Kim (1994),
J.D. Hamilton
3070
that Prob(s,=j,s,+,=ilr,)=Prob(s,+,=ilrr)Prob(s,=jls,+,=i,r,) =Prob(s,+,=il&)Prob(s,=jls,+r=i,&) = Prob(s,+,
= iI&)
= Prob(s,+, = il&)
Prob(s, = j, s,, 1 = iI 4,) Prob(s,+r = iI&) Prob(s,= jl<,)Prob(s,+ 1=ils, = j) Prob(s,+, = iI&)
’
(4.31) Sum (4.31) over i= l,..., N and collect the resulting equations for j = 1,. . . , N in a vector EtIT,whose jth element is Prob(s, = jlcr): (4.32) where “( + ),, denotes element-by-element division. The smoothed probabilities are thus found by iterating on (4.32) backwards for t = T- 1, T- 2,. . . , 1.
4.6.
Maximum
likelihood
estimation
For given numerical values of the transition probabilities in F and the regression parameters such as (ZI,, . . . , I&, 52,, . , . , L2,) in (4.24), the value of the log likelihood function of the observed data is CT’ 1log f(y,Ix,, &_ r) for f(y,Ix,, &_ J given by (4.16). This can be maximized numerically. Again, the EM elgorithm is often an efficient approach [see Baum et al. (1970), Kiefer (1980) and Hamilton (1990)]. For the model given in (4.24), the EM algorithm is implemented by making an arbitrary initial guess at the parameters and calculating the smoothed probabilities. OLS regression of y,,/Prob(s, = 1 I CT) on I,,,/ Prob(s, = 1 I&.) gives a new estimate of I7r and a new estimate of J2, is provided by the sample variance matrix of these OLS residuals. Smoothed probabilities for state 2 are used to estimate ZI, and SL,, and so on. New estimates for pij are inferred from 5 Prob(s,=j,s,_,
Pi’= [f:
=il&)
t=2
Prob(s,_I =il&)]
1 ’
t=2
with the probability of the initial state calculated from pi = Prob(s, = iI &.) rather than (4.9)-(4.10). These new parameter values are then used to recalculate the smoothed probabilities, and the procedure continues until convergence.
Ch. 50: State-Space
3071
Models
When the variance depends on the state as in (4.24), there is an essential singularity in the likelihood function at 0, = 0. This can be safely ignored without consequences; for further discussion, see Hamilton (199 1).
4.7.
Asymptotic
properties
of maximum
likelihood estimates
It is typically assumed that the usual asymptotic distribution theory motivating (3.9) holds for this class of models, though we are aware of no formal demonstration of this apart from Kiefer’s (1978) analysis of i.i.d. switching regressions. Hamilton (1993) examined specification tests derived under the assumption that (3.9) holds. Two cases in which (3.9) is clearly invalid should be mentioned. First, the maximum likelihood estimate flij may well be at a boundary of the allowable parameter space (zero or one), in which case the information matrix in (3.12) need not even be positive definite. One approach in this case is to regard the value of Pij as fixed at zero or one and calculate the information matrix with respect to other parameters. Another case in which standard asymptotic distribution theory cannot be invoked is to test for the number of states. The parameter plZ is unidentified under the null hypothesis that the distribution under state one is the same as under state two. A solution to this problem was provided by Hansen (1992). Testing the specification with fewer states for evidence of omitted heteroskedasticity affords a simple alternative.
4.8.
Empirical
application ~ another look at the real interest rate
We illustrate these methods with a simplified version of Garcia and Perron’s (1993) analysis of the real interest rate. Let y, denote the ex post real interest rate data described in Section 3.5. Garcia and Perron concluded that a similar data set was well described by N = 3 different states. Maximum likelihood estimates for our data are as follows, with standard errors in parentheses:5 Y,~s, = 1 - N( 5.69 , 3.72 ),
(0.41) (1.11) y,ls,=2-N(
1.58,
1.93),
(0.16) (0.32) y,ls, = 3 - N( - 1.58 , 2.83 ), (0.30) (0.72) ‘Garcia and Perron also included the analysis described here.
p = 2 autoregressive
terms as in (4.20), which were omitted
from
J.D. Hamilton
3072 -
0
0.950
0.036 (0.030)
(0.044)
F=
0.050
0.990
(0.044)
(0.010)
0
0
0.010
0.964
(0.010)
(0.030).
The unrestricted maximum likelihood estimates for the transition probabilities occur at the boundaries with fir3 = Fiji = flS2 = 0. These values were then imposed a priori and derivatives were taken with respect to the remaining free parameters 8= (~~,P~,~~,u:,cJ$, ~~,p~~,p~~,p~J to calculate standard errors.
IO.0 75
.-
_
5.0 25-
--
_-
__
’
-
.
0.0 -2.5 -5.0
_
_
___
VtiA 60
63
66
69
72
7s
78
BI
84
87
90
60
63
66
69
72
75
78
81
El4
87
90
60
63
66
69
72
75
78
81
84
87
90
60
63
66
69
72
75
78
81
84
87
90
Figure 2. Top panel. Solid Prob(s, = i(&; 8) > 0.5
line: ex post
real
interest
rate.
Dashed
line: pi6^i,,, where
Ji,, = I if
and c?~,,= 0 otherwise. Second panel. Prob(_, = I 1I&; 6). Third Prob(s, = 21 CT; a). Fourth panel. Prob(s, = 3 1CT;0).
panel.
3073
Ch. 50: State-Space Models
Regime 1 is characterized by average real interest rates in excess of 5 percent, while regime 3 is characterized by negative real interest rates. Regime 2 represents the more typical experience of an average real interest rate of 1.58 percent. The bottom three panels of Figure 2 plot the smoothed probabilities Prob(s, = i( CT; 8) for i = 1,2 and 3, respectively. The high interest rate regime lasted from 1980:IV to 1986:11, while the negative real interest rate regime occurred during 1972:,3 to 1980:III. Regime 1 only occurred once during the sample, and yet the asymptotic standard errors reported above suggest that the transition probability @ii has a standard error of only 0.044. This is because there is in fact not just one observation useful It is exceedingly unlikely that one for estimating pi 1, but, rather, 23 observations. could have flipped a fair coin once each quarter from 1980:IV through 1986:11 and have it come up heads each time; thus the possibility that pii might be as low as 0.5 can easily be dismissed. The means fli, & and f13 corresponding to the imputed regime for each date are plotted along with the actual data for y, in the top panel of Figure 2. Garcia and Perron noted that the timing of the high real interest rate episode suggests that fiscal policy may have been more important than monetary policy in producing this unusual episode.
5.
Non-normal
and nonlinear state-space
models
A variety of approximating techniques have been suggested for the case when the disturbances I+ and W, come from a general non-normal distribution or when the state or observation equations are nonlinear. This section reviews two approaches. The first approximates the optimal filter using a finite grid and the second is known as the extended Kalman filter.
5.1.
Kitagawa’s grid approximation state-space models
for nonlinear,
non-normal
Kitagawa (1987) suggested the following general approach for nonlinear or non-normal filtering. Although the approach in principle can be applied to vector systems, the notation and computations are simplest when the observed variable (y,) and the state variable (r,) are both scalars. Thus consider t If1
=dJ(5,)+~,+1~ Yt =
(5.1)
45,)+ wt.
The disturbances
v, and w, are each i.i.d. and mutually
(5.2)
independent
and have
J.D. Hamilton
3074
densities denoted q(u,) and r(wJ, respectively. These densities need not be normal, but they are assumed to be of a known form; for example, we may postulate that u, has a t distribution with v degrees of freedom: q(ut) =
c(1 + (u:/v))-(V+l)‘*,
where c is a normalizing constant. Similarly functions of some known form; for example, in which case (5.1) would be
c$(.) and h(.) represent parametric 4(.) might be the logistic function,
1
5r+l=l+aexp(-_&)+u’+l’
(5.3)
Step t of the Kalman filter accepted as input the distribution of 5, conditional yl)’ and produced as output the distribution of &+1 on Li =(Y~-~,Y~-~,..., conditional on 6,. Under the normality assumption the input distribution was completely summarized by the mean &_ 1 and variance I’,,,_ 1. More generally, we can imagine a recursion whose input is the density f(<, I&- 1) and whose output is f(&+ 116,). These, in general, would be continuous functions, though they can be summarized by their values at a finite grid of points, denoted t(O), t(l), . . . , ttN). Thus the input for Kitagawa’s filter is the set of (N + 1) numbers
KIr,-,)I,,=,(~)
i=O,l,...,N
and the output is (5.4) with t replaced by t + 1. To derive the filter, first notice that under the assumed everything about the past that matters for y,: f(YtI5J
(5.4)
structure,
5, summarizes
=f(Y,I5,~Ld
Thus f(Y*,5,lr,-l)=f(Y,I~,)f(5,Ir,-l) = d-Y, -
~(5,)l_f-(5,1 L 1)
(5.5)
and f(Ytlr,-1)
=
m f(Y,, s -Kl
5,lL
AdL
(5.6)
Given the observed y, and the known form for I(.) and II(.), the joint density (5.5) can be calculated for each 5, = t(‘), i = 0, 1,. . . , N, and these values can then be
3075
Ch. 50: State-Space Models
used to approximate
(5.6) by
The updated density (5.5) by the constant
for 5, is obtained (5.7):
by dividing
each of the N + 1 numbers
in
fKlr,)=f(5tlYt~Ll)
JYdtlL1) f(Y,lL The joint
conditional
(5.8)
1) . density
of 5,+ 1 and 5, is then
f(rt+l,rtIrt)=f(5r+lI5t)f(51lT1) = d5,+ 1
-
4(&)l.m,
I0
(5.9)
For any pair of values t(‘) and 5”’ equation (5.9) can be evaluated at 5, = 5”’ and 5, + 1 = 5”’ from (5.8) and the form df q( .) and 4( .). The recursion is completed by:
f(5t+11Tl)l~t+,=p)= mf(5,+1,5,Ir,)I,,+,=,,j,d5, s -02
+f(t
1+lr51151)ls,+,=6(,,,6,=6ci~1,}3{5(i)-5(i-1)}. (5.10)
An approximation
logf(Yr~Yr-l~..*~
to the log likelihood
Yl) = i
The maximum likelihood values for which (5.11) is Feyzioglu and Hassett approach to a nonlinear,
1=1
h2f(Y,ILJ
can be calculated
from (5.6):
(5.11)
estimates of parameters such as a, b and v are then the greatest. (1991) provided an economic application of Kitagawa’s non-normal state-space model.
J.D. Hamilton
3076
5.2.
Extended
Consider
Kalman jilter
next a multidimensional
normal
state-space
model (5.12)
5*+1= 9(5,)+4+1, Yr =
(5.13)
44 + 45,) + WI,
where I$: IR’-+lR’, a: Rk-+IR” and h: IR’+fR”, u,-i.i.d. N(0, R). Suppose 4 (.) in (5.12) is replaced with a first-order around 4, = &,
N(O,Q) and IV,-i.i.d. Taylor’s approximation
(5.14)
5,+1=~,++,(5,-%,t)+u,+1, where
@t-“$I
4 = d&t) 0.x 1)
(r x r)
For example, suppose would be given by
(5.15)
f &=F,,
r = 1 and 4(.) is the logistic function
1 abexp(-b5,1J f+1=1+aexp(-~~~,,)+[1+aexp(-~~~I,)]2
5
as in (5.3). Then (5.14)
(&-5;,1)+ur+I.
(5.16)
If the form of 4 and any parameters it depends on [such as a and b in (5.3)] are known, then the inference &, can be constructed as a function of variables observed at date t (&) through a recursion to be described in a moment. Thus & and 4$ in (5.14) are directly observed at date t. Similarly the function h(.) in (5.13) can be linearized around I$- 1 to produce Yt =
44 + ht+ fq5t -
Et,,1)+ wt,
(5.17)
where
4 f (nx 1)
h(&,,-1)
Hi
(nxr)
=Wi)
a<:st=it,,-t
(5.18)
Again h, and H, are observed at date t - 1. The function a(.) in (5.13) need not be liearized since X, is observed directly. The idea behind the extended Kalman filter is to treat (5.14) and (5.17) as if they were the true model. These will be recognized as time-varying coefficient versions of a linear state-space model, in which the observed predetermined variable
Ch. 50: State-Space
3011
Models
4, - @&, has been added to the state equation. Retracing the logic behind the Kalman filter for this application, the input for step t of the iteration is again the forecast Et,,_ 1 and mean squared error I’+ 1. Given these, the forecast of _v~is found from (5.17): E(y,Ix,,r,-l)=a(x,)+h, = a@,) + h(&-
(5.19)
1).
The joint distribution of 4, and y, conditional on X, and 4, _ 1 continues to be given by (2.11), with (5.19) replacing the mean of yt and II, replacing H. The contemporaneous inference (2.12) goes through with the same minor modification: &, = &,,-
1+ J’,,,-,H,W:f’,,,- IH, + W- ‘br - 4x,) - 4$,,- Jl.
If (5.14) described of 6, would be
the true model,
then the optimal
forecast
(5.20)
of &+1 on the basis
To summarize, step t of the extended Kalman filter uses &,, _ 1 and I’,,,_ 1 to calculate H, from (5.18) and &, from (5.20). From these we can evaluate @t in (5.15). The output for step t is then (5.21)
$+ 111= &,t),
P~+II,=~~P,I,-,~:-(~~P,,~-,H,(H:P,,,-~H,+R)-'H:P,,~-,~:}+Q. (5.22) The recursion is started with &,, information about the initial state.
5.3.
and
P,,,
representing
the analyst’s
prior
Other approaches to nonlinear state-space models
A number of other approaches to nonlinear state-space models have been explored in the literature. See Anderson and Moore (1979, Chapter 8) and Priestly (1980, 1988) for partial surveys.
References Anderson, B.D.O. and J.B. Moore (1979) Optimal Filtering. Englewood Cliffs, New Jersey: Prentice-Hall, Inc. Ansley, C.F. and R. Kohn (1985) “Estimation, Filtering, and Smoothing in State Space Models with Incompletely Specified Initial Conditions”, Annah of Statistics, 13, 1286-13 16.
3078
J.D. Hamilton
Aoki, M. (1987) State Space Modeling of Time Series. New York: Springer Verlag. Baum, L.E., T. Petrie, G. Soules and N. Weiss (1970) “A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains”, Annals of Mathematical Statistics, 41, 164-171. Box, G.E.P. and G.M. Jenkins (1976) Time Series Analysis: Forecasting and Control, Second edition. San Francisco: Holden-Day. Burmeister, E. and K.D. Wall (1982) “Kalman Filtering Estimation of Unobserved Rational Expectations with an Application to the German Hyperinflation”, Journal of Econometrics, 20, 255-284. Burmeister, E., K.D. Wall and J.D. Hamilton (1986) “Estimation of Unobserved Expected Monthly Inflation Using Kalman Filtering”, Journal of Business and Economic Statistics, 4, 147-160. Caines, P.E. (1988) Linear Stochastic Systems. New York: John Wiley and Sons, Inc. Cecchetti, S.G., P.-S. Lam and N. Mark (1990) “Mean Reversion in Equilibrium Asset Prices”, American Economic Review, 80, 398-418. Chow, G.C. (1984) “Random and Changing Coefficient Models”, in: Z. Griliches and M.D. Intriligator, eds., Handbook of Econometrics. Vol. 2. Amsterdam: North-Holland. Cosslett, S.R. and L.-F. Lee (1985) “Serial Correlation in Discrete Variable Models”, Journal of Econometrics, 27, 79-97. Davies, R.B. (1977) “Hypothesis Testing When a Nuisance Parameter is Present Only Under the Alternative”, Biometrika, 64, 247-254. DeGroot, M.H. (1970) Optima2 Statistical Decisions. New York: McGraw-Hill. De Jong, P. (1988) “The Likelihood for a State Space Model”, Biometrika, 75, 165-169. De Jong, P. (1989) “Smoothing and Interpolation with the State-Space Model”, Journal of the American Statistical Association, 84, 1085-1088. De Jong, P. (1991) “The Diffuse Kalman Filter”, Annals of Statistics, 19, 1073-1083. Dempster, A.P., N.M. Laird and D.B. Rubin (1977) “Maximum Likelihood from Incomplete Data via the EM Algorithm”, Journal of the Royal Statistical Society, Series B, 39, l-38. Doan, T., R.B. Litterman and C.A. Sims (1984) “Forecasting and Conditional Projection Using Realistic Prior Distributions”, Econometric Reoiews, 3, l-100. Engel, C. and J.D. Hamilton (1990) “Long Swings in the Dollar: Are They in the Data and Do Markets Know It?“, American Economic Reoiew, 80, 689-713. Engle, R.F. and M.W. Watson (1981) “A One-Factor Multivariate Time Series Model of Metropolitan Wage Rates”, Journal of the American Statistical Association, 76, 774-781. Engle, R.F. and M.W. Watson (1987) “The Kalman Filter: Applications to Forecasting and RationalExpectations Models”, in: T.F. Bewley, ed., Advances in Econometrics. Fifth World Congress, Volume I. Cambridge, England: Cambridge University Press. Fama. E.F. and M.R. Gibbons (1982) “Inflation. Real Returns. and Caoital Investment”. Journal of Monetary Economics, 9, 297-i23. ’ Feyzioglu, T. and K. Hassett (1991), A Nonlinear Filtering Technique for Estimating the Timing and Importance of Liquidity Constraints, Mimeographed, Georgetown University. Garcia, R. and P. Perron (1993), An Analysis of the Real Interest Rate Under Regime Shifts, Mimeographed, University of Montreal. Gevers, M. and V. Wertz (1984) “Uniquely Identifiable State-Space and ARMA Parameterizations for Multivariable Linear Systems”, Automatica, 20, 333-347. Ghosh, D. (1989) “Maximum Likelihood Estimation of the Dynamic Shock-Error Model”, Journal of Econometrics, 41, 121-143. Goldfeld, S.M. and R.M. Quandt (1973) “A Markov Model for Switching Regressions”, Journal of Econometrics, 1, 3-16. Gordon, K. and A.F.M. Smith (1990) “Modeling and Monitoring Biomedical Time Series”, Journal of the American Statistical Association, 85, 328-337. Hamilton, J.D. (1985) “Uncovering Financial Market Expectations of Inflation”, Journal of Political Economy, 93, 1224-1241. Hamilton, J.D. (1986) “A Standard Error for the Estimated State Vector of a State-Space Model”, Journal of Econometrics, 33, 387-397. Hamilton, J.D. (1988) “Rational-Expectations Econometric Analysis of Changes in Regime: An Investigation of the Term Structure of Interest Rates”, Journal of Economic Dynamics and Control, 12, 385-423.
Ch. 50: State-Space Models
3079
Hamilton, J.D. (1989) “A New Approach to the Economic Analysis of Nonstationary Time Series and the Business Cycle”, Econometrica, 57, 3577384. Hamilton, J.D. (1990) “Analysis ofTime Series Subject to Changes in Regime”, Journal afEconometrics, 45, 39-70. Hamilton, J.D. (1991) “A Quasi-Bayesian Approach to Estimating Parameters for Mixtures of Normal Distributions”, Journal of Business and Economic Statistics, 9, 27-39. Hamilton, J.D. (1993) Specification Testing in Markov-Switching Time Series Models, Mimeographed, University of California, San Diego. Hamilton, J.D. (1994) Time Series Analysis. Princeton, N.J.: Princeton University Press. Hannan, E.J. (1971) “The Identification Problem for Multiple Equation Systems with Moving Average Errors”, Econometrica, 39, 751-765. Hansen, B.E. (1992) “The Likelihood Ratio Test Under Non-Standard Conditions: Testing the Markov Trend Model of GNP”, Journal of Applied Econometrics, 7, S61-S82. Hansen, B.E. (1993) Inference When a Nuisance Parameter is Not Identified Under the Null Hypothesis, Mimeographed, University of Rochester. Harvey, A.C. (1987) “Applications of the Kalman Filter in Econometrics”, in: T.F. Bewley, ed., Advances in Econometrics, Fifth World Congress, Volume I. Cambridge, England: Cambridge University Press. Harvey, A.C. (1989) Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge, England: Cambridge University Press. Harvey, A.C. and G.D.A. Phillips (1979) “The Maximum Likelihood Estimation of Regression Models with Autoregressive-Moving Average Disturbances”, Biometrika, 66, 49-58. Harvey, A.C. and R.G. Pierse (1984) “Estimating Missing Observations in Economic Time Series”, Journal of the American Statistical Association, 79, 125-131. Harvey, A.C. and P.H.J. Todd (1983) “Forecasting Economic Time Series with Structural and BoxJenkins Models: A Case Study”, Journal of Business and Economic Statistics, 1, 2999307. Imrohoroglu, S. (1993) “Testing for Sunspots in the German Hyperinflation”, Journal of Economic Dynamics and Control, 17, 289-317. Jones, R.H. (1980) “Maximum Likelihood Fitting of ARMA Models to Time Series with Missing Observations”, Technometrics, 22, 3899395. Kalman, R.E. (1960) “A New Approach to Linear Filtering and Prediction Problems”, Journal of Basic Engineering, Transactions of the ASME, Series D, 82, 35545. Kalman, R.E. (1963) “New Methods in Wiener Filtering Theory”, in: J.L. Bogdanoff and F. Kozin, eds., Proceedings of the First Symposium of Engineering Applications of Random Function Theory and Probability, pp. 270-388. New York: John Wiley & Sons, Inc. Kiefer, N.M. (1978) “Discrete Parameter Variation: Efficient Estimation of a Switching Regression Model”, Econometrica, 46, 4277434. Kiefer, N.M. (1980) “A Note on Switching Regressions and Logistic Discrimination”, Econometrica, 48, 1065-1069. Kim, C.-J. (1994) “Dynamic Linear Models with Markov-Switching”, Journal ofEconometrics, 60, l-22. Kitagawa, G. (1987) “Non-Gaussian State-Space Modeling of Nonstationary Time Series”, Journal of the American Statistical Association, 82, 1032-1041. Kahn, R. and C.F. Ansley (1986) “Estimation, Prediction, and Interpolation for ARIMA Models with Missing Data”, Journal of the American Statistical Association, 81, 751-761. Lam, P.-S. (1990) “The Hamilton Model with a General Autoregressive Component: Estimation and Comparison with Other Models of Economic Time Series”, Journal of Monetary Economics, 26, 409-432. Leybourne, S.J. and B.P.M. McCabe (1989) “On the Distribution of Some Test Statistics for Coefficient Constancy”, Biometrika, 76, 1699177. Magnus, J.R. and H. Neudecker (1988) Matrix Differential Calculus with Applications in Statistics and Econometrics. New York: John Wiley & Sons, Inc. Meinhold, R.J. and N.D. Singpurwalla (1983) “Understanding the Kalman Filter”, American Statistician, 37, 123-127. Naba, S. and K. Tanaka (1988) “Asymptotic Theory for the Constance of Regression Coefficients Against the Random Walk Alternative”, Annals of Statistics, 16, 2188235. Nash, J.C. and M. Walker-Smith (1987) Nonlinear Parameter Estimation: An Integrated System in Basic. New York: Marcel Dekker.
3080
J.D. Hamilton
Nicholls, D.F. and A.R. Pagan (1985) “Varying Coefficient Regression”, in: E.J. Hannan, P.R. Krishnaiah and M.M. Rao, eds., Handbook of Statistics.Vol. 5. Amsterdam: North-Holland. Pagan, A. (1980) “Some Identification and Estimation Results for Regression Models with Stochastically Varying Coefficients”, Journal of Econometrics, 13, 341-363. Priestly, M.B. (1980) “State-Dependent Models: A General Approach to Non-Linear Time Series Analysis”, Journal of Time Series Analysis, 1, 47-71. Priestly, M.B. (1988) “Current Developments in Time-Series Modelling”, Journal of Econometrics, 37, 67-86. Quandt, R.E. (1958) “The Estimation of Parameters of Linear Regression System Obeying Two Separate Regimes”, /ournal of the American Statistical Association, 55, 873-880. Quandt, R.E. (1983) “Computational Problems and Methods”, in: Z. Griliches and M.D. Intriligator, eds., Handbook of Econometrics. Volume 1. Amsterdam: North-Holland. Raj, B. and A. Ullah (1981) Econometrics: A Varying Coeficients Approach. London: Groom-Helm. Sargent, T.J. (1989) “Two Models of Measurements and the Investment Accelerator”, Journal of Political Economy, 97,251-287. Shumway, R.H. and D.S. Stolfer (1982) “An Approach to Time Series Smoothing and Forecasting Using the EM Algorithm”, Journal of Time Series Analysis, 3, 253-263. Shumway, R.H. and D.S. Staffer (1991) “Dynamic Linear Models with Switching”, Journal of the American Statistical Association, 86, 7633769. Sims, C.A. (1982) “Policy Analysis with Econometric Models”, Brookings Papers on Economic Activity, 1, 1077152. Stock, J.H. and M.W. Watson (1991) “A Probability Model of the Coincident Economic Indicators”, in: K. Lahiri and G.H. Moore, eds., Leading Economic Indicators: New Approaches and Forecasting Records, Cambridge, England: Cambridge University Press. Tanaka, K. (1983) “Non-Normality of the Lagrange Multiplier Statistic for Testing the Constancy of Regression Coefficients”, Econometrica, 51, 1577-1582. Tjostheim, D. (1986) “Some Doubly St&hastic Time Series Models”, Journal of Time Series Analysis, 7, 51-72. Wall, K.D. (1980) “Generalized Expectations Modeling in Macroeconometrics”, Journal of Economic Dynamics and Control, 2, 161-184. Wall, K.D. (1987) “Identification Theory for Varying Coefficient Regression Models”, Journal of Time Series Analysis, 8, 3599371. Watson, M.W. (1989) “Recursive Solution Methods for Dynamic Linear Rational Expectations Models”, Journal of Econometrics, 41, 65-89. Watson, M.W. and R.F. Engle (1983) “Alternative Algorithms for the Estimation of Dynamic Factor, MIMIC, and Varying Coefficient Regression Models”, Journal of Econometrics, 23, 385-400. Watson, M.W. and R.F. Engle (1985) “Testing for Regression Coefficient Stability with a Stationary AR(l) Alternative”, Review of Economics and Statistics, 67, 341-346. White, H. (1982) “Maximum Likelihood Estimation of Misspecified Models”, Econometrica, 50, l-25,
Chapter 51
STRUCTURAL ESTIMATION DECISION PROCESSES* JOHN
OF MARKOV
RUST
University of Wisconsin
Contents
1. 2.
Introduction Solving MDP’s via dynamic 2.1.
Finite-horizon decision
4.
and the optimality
2.3.
Bellman’s equation,
2.4.
A geometric Overview
dynamic
series representation
of Markovian
Alternative
3.2.
Maximum
3.3.
Alternative
and Bellman’s equation
mappings
and optimality
3095
for discrete decision
processes
estimation
estimation
3.4.
Alternative
3.5.
The identification
estimation
methods: methods:
3101
of DDP’s Finite-horizon Infinite-horizon
DDP problems DDP’s
problem
Optimal
replacement
4.2.
Optimal
retirement
3118 3123 3125
3130
applications
4.1.
3099 3100
models of the “error term” likelihood
3091 3091 3094
for MDP’s
methods
methods
3.1.
programming
contraction
of solution
Econometric
3082 3088 3089
Infinite-horizon
Empirical
programming
A brief review
rules
2.2.
2.5.
3.
dynamic
programming:
of bus engines from a firm
References
3130 3134
3139
*This is an abridged version of a monograph, Stochastic Decision Processes: Theory, Computation, and Estimation written for the Leif Johansen lectures at the University of Oslo in the fall of 1991. I am grateful for generous financial support from the Central Bank of Norway and the University of Oslo and comments from John Dagsvik, Peter Frenger and Steinar Stram.
Handbook of Econometrics. Volume IV, Edited by R.F. Engle and D.L. McFadden 0 1994 Elsevier Science B. V. All rights reserved
John Rust
3082
1.
Introduction
Markov decision processes (MDP) provide a broad framework for modelling sequential decision making under uncertainty. MDP’s have two sorts of variables: state variables s, and control variables d,, both of which are indexed by time or agent t=0,1,2,3 ,..., T, where the horizon T may be infinity. A decision-maker can be represented by a set of primitives (u, p, b) where u(s,, d,) is a utility function representing the agent’s preferences at time t, p(s,+ 1(s,, d,) is a Markov transition probability representing the agent’s subjective beliefs about uncertain future states, and /3~(0,1) is the rate at which the agent discounts utility in future periods. Agents are assumed to be rational: they behave according to an optimal decision rule d, = 6(s,) that solves V,‘(s) = maxd E, { CT= o Fu(s,, d,)) s,, = s} where Ed denotes expectation with respect to the controlled stochastic process {st,d,) induced by the decision rule 6. The method of dynamic programming provides a constructive procedure for computing 6 using the valuefunction V,’ as a “shadow price” to decentralize a complicated stochastic/multiperiod optimization problem into a sequence of simpler deterministic/static optimization problems. MDP’s have been extensively used in theoretical studies because the framework is rich enough to model most economic problems involving choices made over time and under uncertainty.’ A pplications include the pioneering work on optimal inventory policy by Arrow et al. (1951), investment under uncertainty [Lucas and Prescott (1971)] optimal intertemporal consumption/savings and portfolio selection under uncertainty [Phelps (1962), Hakansson (1970), Levhari and Srinivasan (1969), Merton (1969) and Samuelson (1969)], optimal growth under uncertainty [Brock and Mirman (1972), Leland (1974)], models of asset pricing [Lucas (1978), Brock (1982)], and models of equilibrium business cycles [Kydland and Prescott (1982), Long and Plosser (1983)]. By the early 1980’s the use of MDP’s had become widespread in both micro- and macroeconomic theory as well as in finance and operations research. In addition to providing a normative theory of how rational agents “should” behave, econometricians soon realized that MDP’s might provide good empirical models of how real-world decision-makers actually behave. Most data sets take the form {dT,sy) where d; is the decision and s; is the state of an agent a at time t.2 Reduced-form estimation methods can be viewed as uncovering agents’ decision
‘Stochastic control theory can also be used to model “learning” behavior in which beliefs about unobserved stae variables and unknown parameters of the transition according to the Bayes rule. ‘In time-series data, a is fixed at 1 arid t ranges over 1,. , T. In cross-sectional data at 1 and a ranges over 1,. ., A. In panel data sets, t ranges over 1,. ., T.. where T, is periods agent a is observed (possibly different for each agent) and a ranges over 1,. the total number of agents in the sample.
agents update probabilities sets, T is fixed the number of , A where A is
Ch. 51: Structural
Estimation
ofMarkov
3083
Decision Processes
rules or, more generally, the stochastic process from which the realizations {df, $‘} were “drawn”, but are generally independent of any particular behavioral theory.3 This chapter focuses on structural estimation of MDP’s under the maintained hypothesis that {d;, ST}is a realization of a controlled stochastic process. In addition to uncovering the form of this stochastic process (and the associated decision rule 6), structural methods attempt to uncover (estimate) the primitives (u,p,B) that generated it. Before considering whether it is technically possible to estimate agents’ preferences and beliefs, we need to consider whether this is even logically possible, i.e. whether (u,p,fi) is identified. I discuss the identification problem in Section 3.5, and show that the question of identification depends on what type of data we have access to (i.e. experimental vs. no&experimental), and what kinds of a priori restrictions we are willing to impose on (u, p, p). If we only have access to non-experimental data (i.e. uncontrolled observations of agents “in the wild”), and if we are unwilling to impose any prior restrictions on (u, p, j?) beyond basic measurability and regularity conditions on u and p, then it is impossible to consistently estimate (u, p, b), i.e. the class of all MDP’s is non-parametrically unidentified. On the other hand, if we are willing to restrict u and p to a finite-dimensional parametric family, say {U= u,, p = psi 13~0 c RK}, then the primitives (u,p, /?) are identified (generically). If we are willing to impose an even stronger prior restriction, stationarity and rational expectations (RE), then we only need parametric restrictions on u in order to identify (u,p,fl) since stationarity and the RE hypothesis allow us to use non-parametric methods to consistently estimate agents’ subjective beliefs from observations of their past states and decisions. Given that we are already imposing strong prior assumptions by modelling agents’ behavior as an optimal decision rule to an MDP, it would be somewhat schizophrenic to be unwilling to impose any additional prior restrictions on (u, p, /3). In the sequel, I assume that the econometrician is willing to bring to bear prior knowledge in the form of a parametric representation for (u, p, /?).This reduces the problem of structural estimation to the technical issue of estimating a parameter vector BE0 where 0 is a compact subset of RK. The appropriate econometric method for estimating 8 depends critically on whether the control variable d, is continuous or discrete. If d, can take on a continuum of possible values we say that the MDP is a continuous decision process (CDP), and if d, can take on a finite or countable number of values then the MDP is a discrete decision process (DDP). The predominant estimation method for CDP’s is generalized method of moments (GMM) using the first order conditions from the MDP problem (stochastic Euler equations) as orthogonality conditions [Hansen (1982), Hansen and Singleton (1982)]. Hansen’s chapter (this volume) and Pakes’s (1994) survey provide excellent introductions to the literature on structural estimation methods for CDP’s. ‘For Lancaster
an overview of this literature,see Billingsley(1961), Chamberlain (1990) and Basawa and Prakasa Rao (1980).
(1984), Heckman
(1981a),
3084
John Rust
Thus chapter focuses on structural estimation of DDP’s. DDP’s are appropriate for decision problems such as whether not to quit a job [Gotz and McCall (1984)], search for a new job [Miller (1984)], have a child [Wolpin (1984)-J, renew a patent [Pakes (1986)], replace a bus or airplane engine [Rust (1987), Kennet (1994)] or retire a cement kiln [Das (1992)]. Although most of the early empirical applications of DDP’s have been for binary decision problems, this chapter shows that most of the estimation methods extend naturally to DDP’s with any finite number of possible decisions. Examples of multiple choice DDP’s include Rust’s (1989, 1993) model of retirement behavior where workers decide each period whether to work full-time, work part-time, or quit, and whether or not to apply for Social Security, and Miller’s (1984) multi-armed-bandit model of occupation choice. Since the control variable in a DDP model assumes at most a finite number of possible values, the optimal decision rule is determined by the solution to a system of inequalities rather than as a zero to a first order condition. As a result there is no analog of stochastic Euler equations to serve as orthogonality conditions for GMM estimation of 8 as in the case of CDP’s. Instead, most structural estimation methods for DDP’s require explicit calculation of the optimal decision rule 6, typically via numerical methods since analytic solutions for 6 are quite rare. Although we also discuss simulation estimators that rely on Monte Carlo simulations of the controlled stochastic process {st, d,} rather than on explicit numerical calculation of 6, all of these methods can be conceptualized as forms of nonlinear regression that search for an estimate 8 whose implied decision rule d, = 6(s,, 6) “best fits” the data {d:, s;} according to some metric. Unfortunately straightforward application of nonlinear regression methods is not possible due to three complications: (1) the “dependent variable” d, is discrete rather than continuous; (2) the functional form of 6 is generally not known a priori but rather must be derived from the solution to the stochastic control problem; (3) the “error term” E, in the “regression function” 6 is typically multi-dimensional and enters in a non-additive, non-separable fashion: d, = 6(x,, s,, 0). The basic motivation for including an error term in the DDP model is to obtain a “statistically non-degenerate” econometric model. The degeneracy of DDP models without error terms is due to a basic result of MDP theory reviewed in Section 2: the optimal decision rule 6 is a deterministic function of the state s,. Section 3.1 offers several possible interpretations for the error terms in a DDP model, but argues that the most natural and internally consistent interpretation is that E, is an unobserved state variable. Under this interpretation, we partition the full state variable s, = (x,, E,) into a subvector x, that is observed by the econometrician, and a subvector E, that is observed only by the agent. If we are willing to impose two additional restrictions on u and p, namely, that E, enters u in an additive separable (AS) fashion and that p satisfies a conditional independence (CI) condition, we can apply a number of powerful results from the literature on estimation of static discrete choice models [McFadden (1981, 1984)] to yield estimators of 0 with desirable asymptotic properties. In particular, the ASCI assumption allows us to
Ch. 51: Structural Estimation of Markou Decision Processes
3085
“integrate out” E, from the decision rule 6, yielding a non-degenerate system of conditional choice probabilities P(d,l x,, 0) for estimating 8 by the method of maximum likelihood. Under the further restriction that {E,} is an IID extreme value process we obtain a dynamic generalization of the well-known multinomial logit model,
(1.1) As far as estimation is concerned, the main difference between the static and dynamic logit models is the interpretation of the uOfunction: in the static logit model it is a one period utility function that is typically specified as a linear-in-parameters function of 13,whereas in the dynamic logit model it is the sum of a one period utility function plus the expected discounted utility in all future periods. Since the functional form of ugin DDP is generally not known a priori, its values must be computed numerically for any particular value of 8. As a result, maximum likelihood estimation of DDP models requires a “nested numerical solution algorithm” consisting of an “outer” optimization algorithm that searches over the parameter space 0 to maximize the likelihood function and an “inner” dynamic programming algorithm that solves (or approximately solves) the stochastic control problem and computes the choice probabilities P(dl x, 6) and derivatives aP(dI x, @/a0 for each trial value of 8. There are a number of fast algorithms for solving finite- and infinite-horizon stochastic control problems, but space constraints prevent more than a cursory discussion of the main methods in this chapter. Section 3.3 presents other econometric specifications for the error term that allow E,to enter u in a nonlinear, non-additive fashion, and, also, specifications with more complicated patterns of serial dependence in {E,}than is allowed by the CI assumption. Section 3.4 discusses the simulation estimator proposed by Hotz et al. (1993) that avoids the computational burden of the nested numerical solution methods, and the associated “curse of dimensionality”, i.e. the exponential rise in the amount of computer time/space required to solve a DDP problem as its “size” (measured in terms of number of possible values the state and control variables can assume) increases. However, the curse of dimensionality also has implications for the “data” and “estimation” complexity of a DDP model: as the size (i.e. the level of realism or detail) of a DDP model increases, the amount of data needed to estimate the model with an acceptable degree of precision increases more than proportionately. The problems are most severe for estimating beliefs, p. Subjective beliefs can be very slippery, high-dimensional objects to estimate. Since the optimal decision rule 6 is generally quite sensitive to the specification of p, an innaccurate or inconsistent estimate of p will contaminate the estimates of u and /I. Even under the assumption of rational expectations (which allows us to estimate p non-parametrically), the number of observations required to calculate estimates of p of specified accuracy increases exponentially with the number of state and control variables included in the model. The simulation estimator is particularly data-dependent in that it requires
3086
John Rust
accurate non-parametric estimates of agents’ conditional choice probabilities P as well as their beliefs p. Given all the difficulties involved in structural estimation, the reader might wonder why not simply estimate agents’ conditional choice probabilities P using simpler flexible parametric and non-parametric estimation methods. Of course, reduced-form methods can be used, and are quite useful for initial exploratory data analysis and judging whether more tightly parameterized structural models are misspecified. Nevertheless there is considerable interest in structural estimation methods for both intellectual and practical reasons. The intellectual reason is that structural estimation is the most direct way to assess the empirical validity of a specific MDP model: in the process of solving, estimating, and testing a particular MDP model we learn not only about the data, but the detailed implications of the theory. The practical motivation is that structural models can generate more accurate predictions of the impacts of policy changes than reduced-form models. As Lucas (1976) noted, reduced-form econometric techniques can be thought of as uncovering the form of an agent’s historical decision rule. The resulting estimate 8 can then be used to predict the agent’s behavior in the future, provided that the environment is stationary. Lucas showed that reduced-form estimates can produce very misleading forecasts of the effects of policy changes that alter the stochastic environment that agents will face in the future. 4 The reason is that a policy CI(such as government rules for payment of Social Security or welfare benefits) can affect an agent’s preferences, beliefs and discount factor. If we denote the dependence of primitives on policy as (u,,~~,fiJ, then under a new decision rule ~1’the agent’s behavior will be given by a new decision rule 6(u,., pa,, /?,,) rather than the historical decision rule 6(u,,pa, fi,). Unless there has been a lot of historical variation in policies a, reduced-form models won’t be able to estimate the independent effect of CYon 6, and, therefore, we won’t be able to predict how agents will react to a hypothetical policy Co. However if we are able to parameterize the way in which policy affects the primitives, (ub, pb, fi,), then it is a typically straightforward exercise to compute the new decision rule a(~,,, p,,, b,,) for a hypothetical policy a’. One can push this line of argument only so far, since its validity depends on. the assumption that agents really are rational expected-utility maximizers and the structural model is correctly specified. If we admit that a tightly parameterized structural model is at best an abstract and approximate representation of reality, there is no reason why a structural model necessarily yields more accurate forecasts than reduced-form models. Furthermore, because of the identification problem it is possible that we could have a situation where two distinct sets of primitives fit an historical data set equally well, but yield very different predictions about the impact of a hypothetical policy. Under such circumstances there is no objective basis for choosing one prediction over another, and we may have to go to the expense of “The limitations of reduced-form models have also been pointed out in an earlier paper by Marschak (1953), although his exposition pertained more to the static econometric model> of that period. These general ideas can be traced back even further to the work of Haavelmli (1944) and others at the Cowles Commission.
Ch. 51: Structural Estimation of Markov Decision Processes
3087
conducting a controlled experiment to help identify the primitives and predict the impact of a new policy u’.~ In spite of these problems, the final section of this chapter provides some empirical applications that demonstrate the ability of simple structural models to make much more accurate predictions of the effects of various policy changes than reduced-form models. Readers who are familiar with the theory of stochastic control are free to skip the brief review of theory and solution methods in Section 2 and move directly to the econometric implementation of the theory in Section 3. A general observation about the current state of the art in this literature is that, while it is easy to formulate very general and detailed MDP’s, Bellman’s “curse of dimensionality” implies that our ability to actually solve and estimate these problems is much more limited.6 However, recent research [Rust (1995b)] shows that use of random Monte Carlo integration methods does succeed in breaking the curse of dimensionality for the subclass of DDP’s. This result offers the promise that fairly realistic and detailed DDP models will be estimable in the near future. The approach of this chapter is to start with a presentation of the general theory of MDP’s and then show how various restrictions on the general theory lead to subclasses of econometric models that are feasible to estimate. The first general restriction is to exclude MDP’s formulated in continuous time. Although many of the results described in Section 3 can be generalized to continuoustime semi-Markov processes [Ahn (1993b)], there has been little progress on extending the theory to cover other types of continuous-time objects such as controlled diffusion processes. The rationale for using discrete-time models is that solutions to continuous-time problems can be arbitrarily closely approximated by solutions to corresponding discrete-time versions of the problem [cf. Gihman and Skorohod (1979, Chapter 2.3) van Dijk (1984)]. Indeed the standard approach to solving continuous-time stochastic control problems involves solving an approximate version of the problem in discrete time [Kushner (1990)]. The second restriction is implicit in the theory of stochastic control, namely the assumption that agents conform to the von NeumannMorgenstern axioms for choice under uncertainty so that their preferences can be represented by the expected value of a cardinal utility function. A number of experiments have indicated that human decision-making under uncertainty may not always be consistent with the von Neumann-Morgenstern axioms. ’ In addition, expected-utility models imply that agents are indifferent about the timing of the resolution of uncertain events, whereas human decision-makers seem to have definite preferences over the time at which uncertainty is resolved [Kreps and Porteus (1978), Chew and Epstein (1989)]. The justification for focusing on expected utility is that it remains the most tractable 5Experimental data are subject to their own problems, and it would be a mistake to think of controlled experiments as the only reliable way to predict the response to a new policy. See Heckmail (1991, 1994) for an enlightening discussion of some of these limitations. 6See Rust (1994, Section 2) for a more detailed discussion of some of the problems faced in estimating MDP’s. ‘Machina (1982) identifies the “independence axiom” as the source of many of the discrepancies.
John Rust
3088
framework for modelling choice under uncertainty.8 Furthermore, Section 3.5 shows that, from an econometric standpoint, the expected-utility framework is sufficiently rich to model virtually any type of observed behavior. Our ability to discriminate between expected utility and the more subtle non-expected-utility theories of choice under uncertainty may require quasi-econometric methods such as controlled experiments.’
2.
Solving MDP’s via dynamic programming:
A brief review
This section reviews the main results on dynamic programming in finite-horizon problems, and the functional equations that must be solved in infinite-horizon problems. Due to space constraints I only give a cursory outline of the main numerical methods for solving these functional equations, referring the reader to Puterman (1990) or Rust (1995a, 1996) for more in-depth surveys. Dejnition
2.1
A (discrete-time) l l l l l l
Markovian
decision
process consists
of the following
objects:
A time index te{O, 1,2 ,..., T}, T < 00; A state space S; A decision space D; A family of constraint sets (Dt(st) G D}; A family of transition probabilities {pt+ l(.Is,, ~&):54?‘(S)=-[0, 11);” A family of discount functions {bt(s,, d,) > 0} and single period utility functions {u,(s,, d,)} such that the utility functional U has the additively separable decomposition’ ’
(2.1)
‘Recent work by Epstein and Zin (1989) and Hansen and Sargent (1992) on models with non-separable, non-expected-utility functions shows that certain specifications are computationally and analytically tractable. Epstein and Zin have already used their specification of preferences in an empirical investigation of asset pricing. Despite these promising beginnings, the theory and computational methods for these more general problems are in their infancy, and due to space constraints, we are unable to cover these methods in this survey. 9An example of the ability of laboratory experiments to uncover discrepancies between human behavior and the predictions of expected-utility theory is the “Allias paradox” described in Machina (1982, 1987). ‘“3’(S) is the Bore1 a-algebra of measurable subsets of s. For simplicity, the rest of this chapter avoids measure-theoretic details since they are superfluous in the most commonly encountered case where both the state and control variables are discrete. See Rust (1996) for a statement of the required regularity conditions for problems with continuous state and control variables. t ’ The boldface notation denotes sequences: s = (sc,. , sT).Also, define fI,~~lO~j(sj, dj) = 1 in formula (2.1).
Ch. 51: Structural Estimation
ofMar-kmDecision
3089
Processes
The agent’s optimization problem is to choose (6,, ,6,) to solve the following problem:
an optimal
max Ed{ U(s, d)}. d=(du,....SZ-)
2.1.
decision
rule 6* =
(2.2)
Finite-horizon dynamic programming Markovian decision rules
and the optimality of
In finite-horizon problems (T < co), the optimal decision rule S* = (S,‘, . . , iif) can be computed by backward induction starting at the terminal period, T. In principle, the optimal decision at each time t can depend not only on the current state s,, but on the entire previous history of the process, d, = sT(.s,, H,_ 1) where H, = (so, d,, . . . , s,_ 1,A,_ 1). However in carrying out the process of backward induction it is easy to see that the Markovian structure of p and the additive separability of U imply that it is unnecessary to keep track of the entire previous history: the optimal decision rule depends only on the current time t and the current state s,: d, = ST(s,). For example, starting in period T we have
S,(H,-
1,
4 = argmaxWf,-
where U can be rewritten
1,
Q-, 4-h
(2.3)
as
From (2.4) it is clear that previous history H,_ 1 does not affect the optimal decision of d, in (2.3) since d, appears only in the final term ur(sr, dT) on the right hand side of (2.4). Since the final term is affected by H,_ 1 only by the multiplicative discount factor nT:,i Bj(sj, dj), it’s clear that 6, depends only on sr. Working backwards recursively, it is straightforward to verify that at each time t the optimal decision rule 6, depends only on s,. A decision rule that depends on the past history of the process only via the current state s, is called Markovian. Notice also that the optimal decision rule will generally be a deterministic function of s, because randomization can only reduce expected utility if the optimal value of d, in (2.3) is unique. This is a generic property, since if there are two distinct values of dED,(S,) that attain the maximum in (2.3) by a slight perturbation of U, we obtain a similar model where the maximizing value is unique. The valuefunction is the expected discounted value of utility over the remaining
John Rust
3090
horizon assuming an optimal policy is followed in the future. The method of dynamic programming calculates the value function and the optimal policy recursively as follows. In the terminal period VF and SF are defined by
6$)
(2.5)
= axmax u&-, 44, dTdk(sT)
VF(s,) = In periods
(2.6)
max t+(s,, dT). dTEDT(sT)
t = 0,. . . , T - 1, VT and ST are recursively
defined by
It’s straightforward to verify that at time t = 0 the value function V:(Q) represents the conditional expectation of utility over all future periods. Since dynamic programming has recursively generated the optimal decision rule 6* = (S,‘, . . . , SF), it follows that Vi(s) = maxE,{ s
U(&J)/s,
These results can be formalized
= s}.
(2.9)
as follows.
Theorem 2.1 Given an MDP that satisfies certain weak regularity conditions [see Gihman and Skorohod (1979)], 1. An optimal, non-randomized decision rule 6* exists, 2. An optimal decision rule can be found within the subclass of non-randomized Markovian strategies, 3. In the finite-horizon case (T < co) an optimal decision rule 6* can be computed by backward induction according to the recursions (2.5), . . . , (2.8), 4. In the infinite-horizon case (T = co) an optimal decision rule 6* can be approximated arbitrarily closely by the optimal decision rule Sg to an N-period problem in the sense that lim Ed, { U,(S, a)} = 1’im sup Ed{ U,(S, a)} = sup Ed{ U(S, a)}. N-m
N
N+m
6
6
(2.10)
Ch. 51: Structural Estimation ofMarkov Decision Processes
2.2.
3091
Infinite-horizon dynamic programming and Bellman’s equation
Further simplifications are possible in the case of stationary MDP’s In this case the transition probabilities and utility functions are the same for all t, and the discount functions &(s,, d,) are set equal to some constant /?E[O, 1). In the finitehorizon case the time homogeneity of u and p does not lead to any significant simplifications since there still is a fundamental non-stationarity induced by the fact that remaining utility C~_/Iju(sj, dj) depends on t. However in the infinite-horizon case, the stationary Markovian structure of the problem implies that the future looks the same whether the agent is in state s, at time t or in state s,+~ at time t + k provided that s, = s, + k. In other words, the only variable which affects the agent’s view about the future is the value of his current state s. This suggests that the optimal decision rule and corresponding value function should be time invariant, i.e. for all t 3 0 and all ES, 6?(s) = 6(s) and V:(s) = V(s). Analogous to equation (2.7), 6 satisfies 6(s) = argmax u(s, d) + p V(s’)p(ds’l s, d) 1’ dsDW [ .f where V is defined recursively
as the solution
(2.11)
to Bellman’s equation,
V(s) = max U(S,d) + p V(s’)p(ds’ 1s, d) 1’ d..,G[ s
(2.12)
It is easy to see that if a solution to Bellman’s equation exists, then it must be unique. Suppose that W(s) is another solution to (2.12). Then we have
I V(s)- W(s)1< B max I V(s’) - W(s’)Ip(ds’I s, d) s
dsD(s)
<j?supIV(s)-
W(s)/.
(2.13)
SE.7
Since 0 < fi < 1, the only solution
2.3.
to (2.13) is SUP,,~ I V(s) - W(s)/ = 0.
Bellman’s equation, contraction mappings and optimality
To establish the existence of a solution to Bellman’s equation, assume for the moment the following regularity conditions: (1) u&d) is jointly continuous and bounded in (s, d), (2) D(s) is a continuous correspondence. Let C(S) denote the vector space of all continuous, bounded functions f: S -+ R under the supremum norm, Ilf 1)= sup,,s If(s)/. Then C(S) is a Banach space, i.e. a complete normed linear
John Rust
3092
space. l2 Define an operator
r: C(S) -+ C(S) by
W(s’)p(ds’
Bellman’s
equation
can then be rewritten
Is, d)
in operator
(2.14)
notation
as
v = l-(V),
(2.15)
i.e. V is a fixed point of the mapping r. Using an argument easy to show that given any I’, WEC(S) we have
IIWY-mw An operator mapping.
to (2.13) it is
GBIIV-w.
that satisfies inequality
Theorem 2.2. (Contraction If I- is a contraction V.
similar
(2.16) (2.16) for some /?E(O, 1) is said to be a contraction
mapping theorem)
mapping
on a Banach space B, then I- has a unique
fixed point
The uniqueness of the fixed point can be established by an argument similar to (2.13). The existence of a solution is a result of the completeness of the Banach space B. Starting from any initial element of B (such as 0), the contraction property (2.16) implies that the following sequence of successive approximations forms a Cauchy sequence in B:
(0, r(o), r2(o), P(o),
. . . , r”(o), . . .}.
(2.17)
Since the Banach space B is complete, the Cauchy sequence converges to a point VEB, so existence follows by showing that V is a fixed point of r. To see this, note that a contraction r is (uniformly) continuous, so V = lim r"(o) = lim “+LZ “‘ao
r[rn-l(0)] = r(v),
(2.18)
i.e. V is indeed the required fixed point of r. We now show that given the single period decision rule 6 defined in (2.11) the stationary, infinite-horizon policy 6* = (6,6,. . .) does in fact constitute an optimal decision rule for the infinite-horizon MDP. This result follows by showing that the unique solution V(s) to Bellman’s equation coincides with the optimal value “A
space B is said to be complete
if every Cauchy
seuence in B converges
to a point in B.
3093
Ch. 51: Structural Estimation of Markov Decision Processes
function
Vz defined by
V:(s) = my
E,
f
(2.19)
~u(s~, dt)lso = s .
t=o Consider approximating the infinite-horizon horizon problem with value function
V,‘(s)= maxE, d
f i
~u(s,, d,) 1so =
t=o
s
problem
by the solution
to a finite-
I .
Since u is bounded and continuous, CT= o /Y&, d,) converges to C,“=, p’u(s,, d,) for any sequences s = (so, sr, . .) and d = (do, d,, . . .). Theorem 2.1(4) implies that for each SE& V:(s) converges to the infinite-horizon value function V;(s): lim V,‘(s) = V;(s)
(2.21)
vses.
T-02
But the contraction mapping theorem also implies that this same sequence converges to V (since V,‘= rT(0)), so V = I/F. Since V is the expected present discounted value of utility under the policy 6* (a result we demonstrate in Section 2.4), the fact that V = Vz immediately implies the optimality of 6*. A similar result can be proved under weaker conditions that allo)v u(s,d) to be an unbounded function of the state variable. As we will see in Section 3, unbounded utility functions arise in DDP problems as a consequence of assumptions about the distribution of unobserved state variables. Although the contraction mapping theorem is no longer directly applicable, one can prove the following result, a generalization of Blackwell’s theorem, under a weaker set of regularity conditions that allows for unbounded utility. Theorem 2.3 (Blackwell’s
theorem)
Given an infinite-horizon, time homogeneous MDP that satisfies certain regularity conditions [see Bhattacharya and Majumdar (1989)]; 1. A unique solution V to Bellman’s equation (2.12) exists, and it coincides with the optimal value function defined in (2.19), 2. There exists a stationary, non-randomized, Markovian optimal control 6* given by the solution to (2.1 l), 3. There is an optimal non-randomized, Markovian decision rule 6* which can be approximated by the solution 6 z to an N-period problem with utility function U,(s, d) = C;“_ o B’U(%do: lim Eg* { U,(F, (7)) = 1’rm sup E6{ U&,2)) N-rm
N
N-m
6
= sup E,(U(Z, 2)) = Ed*{ U(s’, ai>. 6
(2.22)
John Rust
3094
2.4.A geometric series representation for MDP's Presently, the most commonly used solution procedure for MDP problems involves discretizing continuous state and control variables into a finite number of possible values. This resulting class of finite state DDP problems has a simple and beautiful algebraic structure that we now review. r3 Without loss of generality we can identify the state and decision spaces as finite sets of integers { 1,. . . , S} and { 1,. . . , D},and the constraint set as { 1,. . . , D(s)} where for notational simplicity we now let S, D and D(s) denote positive integers rather than sets. It follows that a feasible stationary decision rule 6 is an S-dimensional vector satisfying ME{ 1,. . . , D(s)}, s= 1,. . . , S, and the value function V is an S-dimensional vector in the Euclidean space RS. Given 6 we can define a vector u,ER~whose ith component is u[i, d(i)], and an S x S transition probability matrix E, whose (i, j) element is p[ilj, S(j)] = Pr { st+ 1 = iIs, = j, d, = S(j)}. Bellman’s equation for a DDP reduces to T(V)(s) =
max l
W’)P(S’l
a(s, d) + P s: [
s’=
1
s, 4
1.
(2.23)
Given a stationary, Markovian decision rule 6, we define I/,eRS as the vector of expected discounted utilities under policy 6. It is straightforward to show that V, is the solution to a system of linear equations,
which can be solved by matrix inversion:
= [Z - BEa] - lU6 =
U6 +
BE@, + p2zgu, + p3E,3U,+ . . .
(2.25)
The last equation in (2.25) is simply a geometric series expansion for V, in powers of /3 and E,. As is well known, Ey = (EJNis simply the N-stage transition probability matrix, whose (i,j) element equals Pr { s, +N = i (s, = j, S}, where the presence of 6 as a conditioning argument denotes the fact that all intervening decisions satisfy dt+j=6(st,.j), j=O ,..., N. Since j?NEy~, is the expected discounted utility received in period N under policy 6, formula (2.25) can be thought of as a vector generalization of a geometric series, showing explicitly how V, equals the sum of expected discounted utilities under 6 in all future periods.14 Since Ed"is a transition probability matrix (i.e. all elements are between 0 and 1, and its rows sum to unity), it 13The geometric representtion also holds for continous state MDP’s, but in infinite-dimensional space instead of R”. “‘As Lucas (1978) notes, “a little knowledge of geometric series goes a long way”.
Ch. 51: Structural
Estimation
of Markov
Decision Processes
follows that lim,,, BNEr = 0, guaranteeing the invertibility of [IMarkovian decision rule 6 and all BE[O, 1).i5
2.5.
3095
/?EJ for any
Overview of solution methods
This section provides a brief review of solution methods for MDP’s. For a more extensive review we refer the reader to Rust (1995a). The main solution method for finite-horizon MDP’s is backward recursion, which has already been described in Section 2.1. The amount of computer time/ space required to solve a problem increases linearly in the length of the horizon T and quadratically in the number of possible state variables S, the latter result being due to the fact that the main work involved in dynamic programming is calculating the conditional expectation of future utility, which requires multiplying an S x S transition matrix by the S x 1 value function. In the infinite-horizon case there are a variety of solution methods, most of which can be viewed as different strategies for solving the Bellman functional equation. The method of successive approximations which we described in Section 2.2 is probably the most well-known solution method for infinite-horizon problems: it essentially amounts to using the solution to a finite-horizon problem with a large horizon T to approximate the solution to the infinite-horizon problem. In certain cases we can significantly accelerate successive approximations by employing the McQueen-Porteus
error bounds,
rk( v) + _bke< I/* < rk( v) + 6,e,
(2.26)
where V* is the fixed point to T,e denotes an S x 1 vector of l’s, and bk=~/(l-~)min[rk(V)-rk-l(V)], 5,‘= p/(1 - p) max[r
“(v) - r”-‘(v)].
(2.27)
The contraction property guarantees that bk and Ek approach each other geometrically at rate /?. The fact that the fixed point V* is bracketed within these bounds suggests that we can obtain an improved estimate of V* by terminating the contraction iterations when 16, - _bkl< E and setting the final estimate of V* to be the median bracketed value e.
(2.28)
151f there are continous state variables, the MDP problem still has the same representation as in (2.23, except that E, is a Markov operator (a bounded, positive linear operator with norm equal to 1) instead of an S x S transition probability matrix.
John Rust
3096
Bertsekas (1987, p. 195) shows that the rate ofconvergence of { pk;I>to V* is geometric at rate p 1A, 1,where 1, is the subdominant eigenvalue of E,.. In cases where II, I < 1, the use of the error bounds can lead to significant speed-ups in the convergence of successive approximations at essentially no extra computational cost. However in problems where Ed* has multiple ergodic sets, I& I= 1, and the error bounds will not lead to an appreciable speed improvement as illustrated in computational results in Table 5.2 of Bertsekas (1987). In relatively small scale problems (S < 10000) the method of policy iteration is generally the fastest method for computing V* and the associated optimal decision rule S*, provided the discount factor is sufficiently large (fi > 0.95). The method starts by choosing an arbitrary initial policy, &,.16 Next a policy valuation step is carried out to compute the value function V,, implied by the stationary decision rule 6,. This requires solving the linear system (2.25). Once the solution I’,, is obtained, a policy improvement step is used to generate an updated policy 6,, VdO(s’)p(s’Is, d)].
6,(s) = argmax [u(s, d) + p i l
s’=
(2.29)
1
Given 6,, one continues the cycle of policy valuation and policy improvement steps until the first iteration k such that 6, = 6, _ 1 (or alternatively V,, = I’,,- ,). It is easy to see from (2.25) and (2.29) that such a V,, satisfies Bellman’s equation (2.23), so that by Theorem 2.3 the stationary Markovian decision rule 6* = 6, is optimal. One can show that policy iteration always generates an improved policy:
v,,3 v,,_ I’
(2.30)
Since there are only a finite number D(1) x ... x D(S) of feasible stationary Markov policies, it follows that policy iteration always converges to the optimal decision rule 6* in a finite number of iterations. Policy iteration is able to find the optimal decision rule after testing an amazingly small number of trial policies 6,. However the amount of work per iteration is larger than for successive approximations. Since the number of algebraic operations needed to solve the linear system (2.25) for I’,, is of order S3, the standard policy iteration algorithm becomes impractical for S much larger than 10000.” To solve very large scale MDP problems, it seems that the best strategy is to use policy iteration, but to only attempt to approximately solve for V, in each policy evaluation step (2.25). There are a number of variants of policy iteration that avoid direct numerical solution of the linear system in (2.25) including modified policy iteration
160ne obvious choice is 6,(s) = argmax, < do .,,,[u(s, d)]. “Supercomputers using combinations of vector processing and multitasking can now routinely solve dense linear systems exceeding 1000 equations and unknowns in under 1 CPU second. See, for example, Dongara and Hewitt (1986).
Ch. 51: Structural Estimation
3097
of Markov Decision Processes
[Puterman and Shin (1978)], and adaptive state aggregation algorithms [Bertsekas and Castafion (1989)]. Puterman and Brumelle (1978, 1979) have shown that policy iteration is identical to Newton’s method for computing a zero to a nonlinear function. This insight turns out to be useful for computing fixed points to contraction mappings Y that are closely related to, but distinct from, the contraction mapping I- defined by Bellman’s equation (2.11). An example of such a mapping is Y: B + J3 defined by
Y(u)(s, d) = u(s,d) + p
1
1
exp {v(s’, d’)} p(ds’( s, d).
d’eD(s’)
(2.31)
In Section 3 we show that the fixed point to this mapping is identical to the value function ug entering the dynamic logit model (1.1). Rewriting the fixed point condition as 0 = u - Y(U), we can apply Newton’s method, generating iterations of the form a,‘+ 1 = uk - [I - ‘ul(O,)l - ‘(I - Y)(s),
(2.32)
where I denotes the identity matrix and Y’(u) is the gradient of Y evaluated at the point UEB. An argument exactly analogous to the series expansion argument used to proved the existence of [Z - /3Ed] - ’ can be used to establish that the matrix [Z - Y’(u)] - 1 is invertible, so the Newton iterations are always well-defined. Given a starting point u0 in a domain of attraction sufficiently close to the fixed point u* of Y, the Newton iterations will converge to u* at a quadratic rate: IV,,,
- v*I
1/*i2
(2.33)
for a positive constant K. Although Newton iterations yield rapid quadratic rates of convergence, it is only guaranteed to converge for initial estimates u,, in a domain of attraction of u* whereas the method of successive approximations yields much slower linear rates of convergence but are always guaranteed to converge to u* starting from any initial point uO.l8 This suggests the following hybrid method or polyalgorithm: start with successive approximations, and when the McQueen-Porteus error bounds indicate that one is sufficiently close to u*, switch to Newton iterations to rapidly converge to the solution. There is another class of methods, which Judd (1994) has termed minimum weighted residual (MWR) methods, that can be applied to solve general operator equations of the form
@(u*)= 0,
(2.34)
“Newton’s method does exhibit global convergence in finite state DDP problems due to the fadt that Newton’s method and policy iteration are identical in this case, and policy iteration converges from any starting point. Thus the domain of attraction in this case is all of R”.
John Rust
3098
where @: B + B is a nonlinear operator on a potentially infinite-dimensional Banach space B. For example, Bellman’s equation is a special case of (2.34) for @(V) = [I- r](V). Similar to policy iteration, Newton’s method becomes computationally burdensome in high-dimensional problems. To avoid this, MWR methods attempt to approximate the solution to (2.34) by restricting the search to a smaller-dimensional subspace B, spanned by the basis elements {xi, x2,. . . , xN). It follows that we can index any approximate solution UEB, by a vector c = (c,, . . . , c,)ER~: u, = ClXl + ... + CNXN.
(2.35)
Unless the true solution u* is an element of B,, @(u,) will generally be non-zero for all vectors CERN. The MWR method computes an estimate uI of u* using a value of e that solves e = argmin
1@(u,)\.
(2.36)
CERN
Variants of MWR methods can be obtained by using different subspaces B, (e.g., Legendre or Chebyshev polynomials, etc.) and different norms on Y(u,) (e.g., least squares or sup norm, etc.). In cases where B is an infinite-dimensional space (which occurs when the DDP problem contains continuous state variables), one must also choose a finite grid of points over which the norm in (2.36) is evaluated. Although I have described MWR as parameterizing the value function in terms of a small number of unknown coefficients c, there are variants of this approach that are based on parameterizations of other features of the stochastic control problem such as the decision rule 6 [Smith (1991)], or the conditional expectation operator E, [Marcet (1994)]. For simplicity, I refer to all these methods as MWR even though there are important differences in their computational implementation. The advantage of the MWR approach is that it converts the problem of finding a zero of a high-dimensional operator equation (2.34) into the problem of finding a zero to a smaller-dimensional minimization problem (2.36). MWR methods may be particularly effective for solving DDP problems with several continuous state variables, since straightforward discretization methods quickly run into the curse of dimensionality. However a disadvantage of the procedure is the computational burden of solving (2.36) given that I @(u,)I must be evaluated for each trial value of c. Typically, one uses approximate methods to evaluate 1@(u,)I, such as Gaussian quadrature or Monte Carlo integration. Another disadvantage is that MWR methods are non-iterative, i.e. previous approximations ui, ul,. . . , uN- 1 are not used to determine the next estimate uN. In practice, one must make do with a single approximation UN,however there is no analog of the McQueen-Porteus error bounds to tell us how far uN is from the true solution. Indeed, there are no general theorems proving the convergence of MWR methods as the dimension N of the subspace increases. There are also problems to be faced in cases where @ has multiple solutions V*, and when the minimization problem (2.36) has multiple local minima. Despite these unresolved problems, versions of the MWR method have proved to be effective in a
Ch. 51: Structural Estimation of Markoo Decision Processes
3099
variety of applied problems. See, for example, Kortum (1993) (who has nested the MWR solution of (2.35) in an estimation routine), and Bansal et al. (1993) who have used Marcet’s method of parameterized expectations to generate stochastic simulations of dynamic, stochastic models for use by their “non-parameteric simulation estimator”. A final class of methods uses Monte Carlo integration to avoid the computational burden of multivariate numerical integration that is the dominating factor that limits our ability to solve DDP problems. Keane and Wolpin (1994) developed a method that combines Monte Carlo integration and interpolation to dramatically reduce the solution time for large scale DDP problems with continuous multidimensional state variables. As we will see below, incorporation of unobservable state variables Eimplies that DDP problems will always have these multidimensional continuous state variables. Recently, Rust (1995b) has introduced a “random multigrid algorithm” using a random Bellman operator F that avoids the need for interpolation and repeated Monte Carlo simulations that is an inherent limiting future of Keane and Wolpin’s method. Rust showed that his algorithm succeeds in breaking the curse of dimensionality of solving the DDP problem -i.e. the amount of computer time required to solve the DDP problem increases only polynomially rather than exponentially with the dimension d of the state variables using Rust’s algorithms. These new methods offer the promise that substantially more realistic DDP models will be estimable in the near future. 3.
Econometric
methods for discrete decision processes
As we discussed in Section 1, structural estimation methods for DDP’s are fundamentally different from the Euler equation methods used to estimate CDP’s. Since the control variable is discrete, we cannot differentiate to derive first order necessary conditions characterizing the optimal decision rule 6* = (6,6,. . .). Instead each component function 6(s) is defined by a finite number of inequality conditions: l9 d&(s)-
I’d ED(S) u(s, d) + /Ii’ V*(s’)p(ds’ 1s, 6) 3 U(S,d’) + fi I’*(s’)p(ds’Is, d’) . iTI s s I
(3.1) Econometric methods for DDP’s borrow heavily from methods developed in the literature on estimation of static discrete choice models.20 The primary difference between estimation of static versus dynamic models of discrete choice is that agents’ choices are governed by the relative magnitude of the value function V rather than the single period utility function u. Even if the functional form of the latter is “For notational simplicity, this section focuses on stationary infinite-horizon DDP problems and ignores the distinction between the optimal policy 6* and its components 6* = (6,6,. .). “See McFadden (1981, 1984) for excellent surveys of the huge literature on estimation of static discrete choice models.
John Rust
3100
specified a priori, the value function is generally unknown, although it can be computed for any value of 8. To date, most empirical applications have used “nested numerical solution algorithms” that compute the best fitting estimate I!?by repeatedly solving the dynamic programming problem for each trial value of 9.
3.1.
Alternative
models of the “error term”
In addition to the numerical problems involved in computing the value function and optimal decision rule, we face the problem of how to incorporate “error terms” into the structural model. Error terms are necessary in light of “Blackwell’s theorem” (Theorem 2.3) that the optimal decision rule d = 6(s) is a deterministic function of the agent’s state s. Blackwell’s theorem implies that if we were able to observe all components of s, then a correctly specified DDP model would be able to perfectly predict agents’ behavior. Since no theory is realistically capable of perfectly predicting the behavior of human decision-makers, there are basically four ways to reconcile discrepancies between the predictions of the DP model and observed behavior: (1) optimization errors, (2) measurement errors, (3) approximation errors, and (4) unobserved state variables.‘l An optimization error causes an agent who “intends” to behave according to the optimal decision rule 6 to take an actual decision d given by d = 6(s) + r/,
(3.2)
where 9 is interpreted as an error that prevents the agent from correctly calculating or implementing the optimal action 6(s). This interpretation of discrepancies between d and 6(s) seems logically inconsistent: if the agent knew that there were random factors that lead to ex post discrepancies between intended and realized decisions, he would re-optimize taking these uncertainties into account. The resulting decision rule will generally be different from the optimal decision rule 6 when intended and realized decisions coincide. On the other hand, if q is simply a way of accounting for irrational or non-maximizing behavior, it is not clear why this behavior should take the peculiar form of random deviations from a rational decision rule 6. Given these logical difficulties, we ignore optimization errors as a way of explaining discrepancies between d and 6(s). Measurement errors, due to response or coding errors, must surely be acknowledged in most empirical studies. Measurement errors are usually much more likely to occur in continuous components of s than in the discrete values of d, although significant errors can occur in the latter as a result of classification error (e.g. defining workers as choosing to work full-time vs. part-time based on noisy measurements of total hours of work). From an econometric standpoint, measurement ‘l Another method, variables
unobserved heterogeneity, can be regarded as a special case of unobserved state in which certain components of the state vector vary over individuals but not over time.
Ch. 51: Structural
Estimation
of Markoo
Decision Processes
3101
errors in s create more serious difficulties since 6 is typically a nonlinear function of s. Unfortunately, the problem of nonlinear errors-in-variables has not yet been satisfactorily resolved in the econometrics literature. In certain cases [Eckstein and Wolpin (1989b) and Christensen and Kiefer (199 1b)], one can account for measurement error in a statistically and computationally tractable manner, although at the present time this approach seems to be highly problem-specific. An approximation error is .defined as the difference between the actual and predicted decision, E = d - 6(s). This approach amounts to an up-front admission that the DDP model is misspecified, and does not attempt to impose auxiliary statistical assumptions about the distribution of E. The existence of such errors is hard to deny since by their very nature DDP models are simplified, abstract representations of human behavior and we would never expect their predictions to be 100% correct. Under this interpretation the econometric problem is to find a specification (u, p, /I) that minimizes some metric of the approximation error such as mean squared prediction error. While this approach seems quite natural, it leads to a “degenerate” econometric model and estimators with poor asymptotic properties. The approximation error approach also suffers from ambiguity about the appropriate metric for determining whether a given model does or does not provide a good approximation to observed behavior. The final approach, unobserved state variables, is the subject of Section 3.2.
3.2.
Maximum likelihood estimation of DDP’s
The remainder of this chapter focuses on structural estimation of DDP’s with unobserved state variables. In these models the state variable s is partitioned into two components s = (x, E) where x is a state variable observed by both agent and econometrician and E is observed only by the agent. The existence of unobserved state variables is quite plausible: it is unlikely that any survey could completely record all information that is relevant to the agent’s decision-making process. It also provides a natural way to “rationalize” discrepancies between observed behavior and the predictions of the DDP model: even though the optimal decision rule d = 6(x, E)is a deterministic function, if the specification of unobservables is sufficiently “rich” any observed (x, d) combination can be explained as the result of an optimal decision by an agent for an appropriate value of E. Since E enters the decision rule 6 in a non-additive fashion, it is infeasible to estimate 0 by nonlinear least squares. The preferred method for estimating 0 is maximum likelihood using the conditional choice probability, P(dlx) =
s
Z{d = b(x,.s)}q(dEjx),
(3.3)
where q(delx) is the conditional distribution of E given x (to be defined). Even though 6 is a step function, integration over E in (3.3) leads to a conditional choice
John Rust
3102
probabilty that is a smooth function of 8 provided that the primitives (u, p, p) are smooth functions of 0 and the DDP problem satisfies certain general properties given in assumptions AS and CI below. These assumptions guarantee that the conditional choice probability has “full support”: d@(x)-=P(d(x)
> 0,
(3.4)
which is equivalent to saying that the set (&Id = 6(x, E)} has positive probability under q(de) x). We say that a specification for unobservables is saturated if (3.4) holds for all possible values of 8. The problem with an unsaturated specification is the possibility that the DDP model may be contradicted in a sufficiently large data set: i.e. one may encounter observations (x;,d;) which cannot be rationalized by any value of E or 8, i.e. P(d;)xf, 0) = 0 for all 0. This leads to practical difficulties in maximum likelihood estimation, causing the log-likelihood function to “blow up” when it encounters a “zero probability” observation. Although one might eliminate such observations to achieve convergence, the impact on the asymptotic properties of the estimator is unclear. In addition, an unsaturated specification may yield a likelihood function whose support depends on 0 or which may be a nonsmooth function of 8. Little is known about the general asymptotic properties of these “non-standard” maximum likelihood estimators.22 Borrowing from the literature on static discrete choice models [McFadden (1981)] we introduce two assumptions that are sufficient to generate a saturated specification for unobservables in a DDP model. Assumption AS The choice sets depend only on the observed state variable x: D(s) = D(x). The unobserved state variable E is a vector with at least as many components as the number of elements in D(x).~~ The utility function has the additively separable decomposition u(s, d) = u(x, d) + c(d), where c(d) is the dth component
(3.5) of the vector E.
22Result~ are available for certain special cases, such as Flinn and Heckman’s (1982) and Christensen and Kiefer’s (1991) analysis of the job search model. If wages are measured without error, this model generates the restriction that any accepted wage offer must be greater than the reservation wage (which is an implicit function of 0). This imphes that the support of the likelihood function depends on 0, resulting in a non-normal limiting distribution with certain parameters converging faster than the fi rate that is typical of standard maximum likelihood estimators. The basic result is analogous to estimating the upper bound 0 of a uniform distribution U[O, 01. The support of this distribution clearly depends f? and, as well known (Cox and Hinckley, 1974) the maximum likelihood estimator is B = max {x1,. , xAj, which converges at rate A to an exponential limiting distribution. 23For technical reasons E may have a number of superfluous components so that we may formally embed the E state vectors in a common state space e. For details, see Definition 3.1.
3103
Ch. 51: Structural Estimation of Markov Decision Processes
Figure
1. Pattern
Assumption The transition
of dependence
in controlled
stochastic
process
implied
by the CI assumption
CZ density for the controlled
Markov
process
{xt, Ed] factors as
p(dx, + 1, ds, + 1 Ix,, E,,4 = G&t + 1Ix, + , Wx, + 1Ix,, 4, where the marginal density of q(ds Ix) of the first 1D(x) ( components equal to R tD(x)iand finite absolute first moments.
(3.6) of E has support
CI is a conditional independence assumption which limits the pattern of dependence in the {x,,E,} process in two ways. First, x,+ 1 is a sufficient statistic for E,+ 1 implying that any serial dependence between E, and E,+ 1 is transmitted entirely through the observed state xt+ 1.24 Second, the probability density for x, + 1 depends only on x, and not on E,. Intuitively, CI implies that the (Ed} process is essentially a noise process superimposed on the main dynamics which are embodied by the transition probability n(dx’ lx, d). Under assumptions AS and CI Bellman’s equation has the form V(x, E) = max [0(x, d) + c(d)],
(3.7)
deD(x)
where
s
$x,4 = 4x, 4 + B VY, &(d~l y)Wy Ix, 4.
(3.8)
Equation (3.8) is the key to subsequent results. It shows that the DDP problem has the same basic structure as a static discrete choice problem except that the value function u replaces the single period utility function u as an argument of the conditional choice probability. In particular, AS-C1 yields a saturated specification for unobservables: (3.8) implies that the set {EJd = 6(x, E)} is a non-empty intersection of half-spaces in RID@)‘, and since E is continuously distributed with unbounded support, it follows that regardless of the values of (o(x,d)} the choice probability P(dlx) is positive for each dED(x). In order to formally define the class of DDP’s, we need to embed the unobserved state variables E in a common space E. Without loss of generality, we can identify each choice set D(x) as the set of integers D(x) = { 1,. . . , ID(x)j}, and let the decision space D be the set D = { 1,. . , supXGx (D(x)/}. Then we define E = RID’, and whenever 241f q(dc)x) is dependent
of x then {E,} is an IID process
which is independent
of {xr}.
John Rust
3104
1D(x)1 < 1D 1then q(ds 1x) assigns the remaining ID I- 1D(x) I “irrelevant” of E equal to some arbitrary value, say 0, with probability 1.
components
Dejinition 3.1 A discrete decision process (DDP) is an MDP satisfying the following restrictions: The decision space D = { 1,. . . , su~,,~ ID(s) I ), where su~,,~ ID(s)I < 00. l The state space S is the product space S = X x E, where X is a Bore1 subset of RJ and E = RID’. l For each SES and XEX we have D(s) = D(x) c D. l The utility function u(s, d) satisfies assumption AS. CI. l The transition probability p(ds,+ 11s,, d,) satisfies assumption l The component q(dsl x) of the transition probability p(ds1 s, d) is itself a product meaSure on R~-‘(x))x RlDl-IDfX)l whose first component has support RIDCX)I and whose second component is a unit mass on a vector of O’s of length 1D I - 1D(x) (. l
The conditional choice probability P(dl x) can be defined in terms of a function McFadden (198 1) has called the social surplus,
max [u(x, d) + .z(d)]q(de Ix).
GC{4x,d),dWx)}Ixl=
(3.9)
dsD(x) s RIDI
If we think of a population of consumers indexed by E, then G[ {u(x, d), deD(x)} Ix] is simply the expected indirect utility of choosing alternatives dED(x). G has an important property, apparently first noted by Williams (1977) and Daly and Zachary (1979), that can be thought of as a discrete analog of Roy’s identity. Theorem 3.1 If q(dslx) has finite first moments, then the social surplus has the following properties. 1. G is a convex function of {u(x, d), deD(x)}. 2. G satisfies the additivity property G[{u(x,d)
+ a,dsD(x)}Ix]
3. The partial derivative probability:
(3.9) exists, and
= c( + G[{u(x,d),d~D(x)}Ix].
of G with respect to u(x, d) equals the conditional
aGC{u(x,d),d~D(x))Ixl =Ptd,xj. a 4x, 4 From the definition simply an exercise
function
(3.10) choice
(3.11)
of G in (3.9), it is evident that the proof of Theorem 3.1(3) is in interchanging integration and differentiation. Taking the
3105
Ch. 51: Structural Estimation ofMarkov Decision Processes
partial
derivative
operator
inside the integral
sign we obtain25
a{maxdeD(y) Cub,4 + 441>
aGC{u(x,d),d~~(X)}IXl autx,d) =
a4x,
q(dE,
x)
4
Z{d = argmax [u(x,d’) + ~(dl)]}q(d~Ix)
= s
d’sD(x)
= P(dIx).
(3.12)
Note that the additivity property (3.10) implies that the conditional choice probabilities sum to 1, so P(. (x) is a well-defined probability distribution over D(x). The fact that the unobserved state variables E have unbounded support implies that the objective function in the DDP problem is unbounded. We need to introduce three extra assumptions to guarantee the existence of a solution since the general results for MDP’s given in Theorems 2.1 and 2.2 assumed that u is bounded above. Assumption BU For each dED(x), u(x,d) is an upper expectation: R(x)
3
F P’R,(x) <
semicontinuous
function
of x with bounded
co,
t=1
4 +,(4 = max R,W(dy Ix, 4 dsD(x)
R,(x) = max d&(x)
Assumption
s
max I U(Y, 4
+ 44
I NE
IMdy
Ix, 4.
(3.13)
d’eD(y)
WC
rr(dy 1x, d) is a weakly continuous function of (x, d): for each bounded continuous function h: X + R, s h(y)rc(dy)x, d) IS . a continuous function of x for each dED(x). Assumption BE Let B be the Banach space of bounded, Bore1 measurable functions h: X x D --) R under the essential supremum norm. Then UEB and for each DEB, E~EB, where Eh is defined by
Wx,d)=
s
GC(h(y,d),d~~(y))lyl~(dyIx,d).
(3.14)
Z5The interchange is justified by the Lebesgue dominated convergence theorem, since the derivative ofmax ,,,,,[u(x, d) + e(d)] with respect to u(x, d) is bounded (it equals either 0 or 1) for almost all E.
John Rust
3106
Theorem
3.2
If {s,, d,} is a DDP satisfying AS, CI and regularity the optimal decision rule 6 is given by
conditions
BU, WC and BE, then
6(x, E) = argmax [u(x, d) + c(d)],
(3.15)
dfD(x)
where u is the unique
fixed point to the contraction
mapping
Y: B + B defined by
s
u’(u)(x,4 = 4x, 4 + P GC{u(y,d'), d’Wy)) lyl4dylx, 4. Theorem
(3.16)
3.3
If {st, d,} is a DDP satisfying AS, CI and regularity conditions BU, WC and BE, then the controlled process {x,, st} is Markovian with transition probability Pr{dx r+i,d,+ilx,,dJ where the conditional
P(Qx)
=
=P(dt+llxt+l)n(dxt+llx,,d,), choice probability
aGC{u(x,d),d~D(x)}Ixl ao(x,d)
(3.17)
P(dJx) is given by
(3.18)
’
where G is the social surplus function defined in (3.9), and u is the unique to the contraction mapping Y defined in (3.16).
fixed point
The proofs of Theorems 3.2 and 3.3 are straightforward: under assumption ASCI the value function is the unique solution to Bellman’s equation given in (3.7) and (3.8). Substituting the formula for I/ given in (3.7) into the formula for u given in (3.8) we obtain u(x, d) = u(x, d) + /I
=Ukd)+B
max CU(Y,d’) +
d’ED(y)
s
M’)lq(d~Iyb(dy Ix, 4
GC(u(y,d’),d’~~(y)}Iyl~(dyIx,d).
(3.19)
The latter formula is the fixed point condition (3.16). It’s a simple exercise to verify that Y is a contraction mapping, guaranteeing the existence and uniqueness of the function u. The fact that the observed components {xf, d,} of the controlled process {xt,et,d,} is Markovian is a direct result of the CI assumption: the observed state x,+ 1 is a “sufficient statistic” for the agent’s choice d,, r. Without the CI assumption, lagged state and control variables would be useful for predicting the agent’s choice
3107
Ch. 51: Structural Estimation of Markoc Decision Processes
time t + 1 and {xl, d,} will no longer be Markovian. As we will see, this observation provides the basis for a specification test of CL For specific functional forms for q we obtain concrete formulas for the conditional choice probability P(dIx), the social surplus function G and the contraction mapping Y. For example if q(ds(x) is a multivariate extreme-value distribution we havez6 at
q(dsIx)=
y G 0.577.
expC-&(d)+Y}expC-exp{-&(d)+y}l
fl
(3.20)
deD(x)
multinomial logic formula
Then P(d(x) is given by the well-known
expf& 41
(3.21)
P(d(x) = __~~ C exp(+,d’)J d’ED(x)
where u is the fixed point to the contraction Y(u)(x,
d) = U(X, d) + B
c
mapping
exp {u(Y, 4
d’4Ky)
}
Y?
1
4dyl
x, 4
(3.22)
The extreme-value specification is especially attractive for empirical applications since the closed-form expressions for P and G avoid the need for multi-dimensional numerical integrations required for other distributions.27 A simple consequence of the extreme value specification is that the log-odds ratio for two alternatives equals the utility differential:
log
=u(x, d) -
u(x, 1).
(3.23)
Suppose that the utility function depends only on the attributes of the chosen alternative: u(x, d) = u(x,), where x = (x,, . . . , xD) is a vector of attributes of all the alternatives and xd is the attribute of the dth alternative. In this case the log-odds ratio implies a property known as independence from irrelevant alternatives (HA): the odds of choosing alternative d over alternative 1 depends only on the attributes of those two alternatives. The IIA property has a number of undesirable implications such as the “red bus/blue bus” problem noted by Debreu (1960). Note, however, that in the dynamic logit model the IIA property does not hold: the log-odds of 26The constant y in (3.18) is Euler’s constant, which shifts the extreme value distribution so it has unconditional mean zero. “Closed-form solutions for the conditional choice probability are available for the larger family of multivariate extreme-value distributions [McFadden (1977)J This family is characterized by the property that it is mu-stable, i.e. it is closed under the operation of maximization. Dagsvik (1991) showed that this class is dense in the space of all distributions for E in the sense that the conditinal choice probabilities for an arbitrary density q can be approximated arbitrarily closely by the choice probability for some multivariate extreme-value distribution.
John Rust
3108
choosing d over 1 equals the difference in the value functions u(x, d) - u(x, l), but from the definition of u(x,d) in (3.22) we see that it generally depends on the attributes of all of the other alternatives even when the single period utility function depends only on the attributes of the chosen alternative, u(x, d) = u(xJ. Thus, the dynamic logit model benefits from the computational simplifications of the extremevalue specification but avoids the IIA problem of static logit models. Although Theorems 3.2 and 3.3 appear to apply only to infinite-horizon stationary DDP problems, they actually include finite-horizon, non-stationary DDP problems as a special case. To see this, let the time index t be an additional component of x,, and assume that the process enters an absorbing state with uJx,, d,) = u(x,, t, d,) = 0 for t > T. Then Theorems 3.2 and 3.3 continue to hold, with the exception that 6, P, G, 7t and u all depend on t. The value functions o,, t = 1,. . . , T are given by the same backward recursion formulas as in the finite-horizon MDP models described in Section 2:
4x, 4 = &,
4,
u,(x,d)=~,(x,d)+~
s t+
G,C{u l(y, 4, d’@(y))I~l~t(d~lx, 4.
(3.24)
Substituting these value functions into (3.18), we obtain choice probabilities P, that depend on time. It is easy to see that the process {xt, d,} is still Markovian, but with non-stationary transition probabilities. Given panel data {xf,dp} on observed states and decisions of a collection of individuals, the full information maximum likelihood estimator I!? is defined by (3.25) Maximum likelihood estimation is complicated by the fact that even in cases where the conditional choice probability has a closed-form solution in terms of the value functions uO, the latter function does not have an a priori known functional form and is only implicitly defined by the fixed point condition (3.16). Rust (1987, 1988b) developed a nested fixed point algorithm for estimating 6: an “inner” contraction fixed point algorithm computes uO for each trial value of 0, and an “outer” hillclimbing algorithm searches for the value of 0 that maximizes J!!. In practice, 0 can be estimated by a simpler 2-stage procedure that yields consistent, asymptotically normal but inefficient estimates of 8*, and a 3-stage procedure which is asymptotically equivalent to full information maximum likelihood. Suppose we partition 8 into two components (O,, O,), where 8, is a subvector of parameters that appear only in n and O2 is a subvector of parameters that appear only in (u, L_J, ,Q In the first stage we estimate 0, using the partial likelihood estimator &;,2s 28Cox (1975) has shown that under standard will be consistent and asymptotically normally
regularity conditions, distributed.
the partial
likelihood
estimator
3109
Ch. 51: Structural Estimation ofMarkou Decision Processes
a=1
B,ERNl
(3.26)
fi rc(dxplxf_ ,,dp_ I, 0,).
6: = argmax LT(8,) 3 fi
t=1
Note that the first stage does not require a nested fixed point algorithm to solve the DDP problem. In the second stage we estimate the remaining parameters using the partial likelihood estimator 85 defined by (3.27) The second stage treats the consistent first stage estimates of n(dx, + 1Ix,, d,, 6;) as the “truth”, reducing the problem to estimating the remaining parameters 8, of (u, q, /?).It is well known that for any optimization method the number of likelihood function evaluations needed to find a maximum increases rapidly with the number of parameters being estimated. Since the second stage estimation requires a nested fixed point algorithm to solve the DDP problem at each likelihood function evaluation, any reduction in the number of parameters being estimated can lead to substantial computational savings. Note that, due to the presence of estimation error in the first stage estimate of @, the covariance matrix formed by inverting the information matrix for the partial likelihood function (3.27) will be inconsistent. Although there is a standard correction formula that yields a consistent estimate of the covariance matrix [Amemiya (1976)], in practice it is just as simple to use the consistent estimates t?p = (&, I@ from stages 1 and 2 as starting values for one or more “Newton steps” on the full likelihood function (3.25): & = @ - y$&),
(3.28)
where the “search direction” $@) = -
$gp) is given by
[a2 i0gLf(@yaeatq
- ‘[a log _C@yae]
(3.29)
and y > 0 is a step-size parameter. Ordinarily the step size y is set equal to 1, but one can also choose y to maximize Ls without changing the asymptotic properties of @. Using the well-known “information equality” we can obtain an alternative asymptotically equivalent version of (3.29) by replacing the Hessian matrix with the negative of the information matrix I^((ep)defined by
f@) =
5 5a a=1
x
i0g
P(~;I~;, @~c(x;_ IX;_ 1
1,
dp_
1,
Pyae
[ *=,1
9 aiog~(d~~xp,BP)n(xg_l~x~_l,dp_l,BP)~ae~
t=1
We call @ the Newton-step
estimator.
It is straightforward
.
(3.30)
1 to show that
this
John Rust
3110
procedure results in parameter estimates that are asymptotically equivalent to full information maximum likelihood and, as a by-product, consistent estimates of the asymptotic covariance matrix r^- 1(Qs).29 The feasibility of the nested fixed point maximum likelihood procedure depends on our ability to rapidly compute the fixed point o,, for any given value of 6, and to find the maximum of Lf or Lp in as few likelihood function evaluations as possible. At a minimum, the likelihood function should be a smooth function of 0 so that more efficient gradient optimization algorithms can be employed. The smoothness of the likelihood is also crucial for establishing the large sample properties of the maximum likelihood estimator. Since the primitives (u, p, b) are specified a priori, they can be chosen to be smooth functions of 0. The convexity of the social surplus function implies that the conditional choice probabilities are smooth functions of up Therefore the question of smoothness further reduces to finding sufficient conditions under which tI-+u, is a smooth mapping from RN into B. This follows from the implicit function theorem since the pair (0, ue) is a zero of the nonlinear operator F: RN x B -+ B defined by 0 = F(u, 0) = (I - ‘PO)(u).
(3.31)
Theorem 3.4 Under regularity conditions (A 1) to (A 13) given in Rust (1988b), a v/36’ exists and is a continuous function of 9 given by
c!$+
y&)]-l
[
!s$!
. II
” =
(3.32)
“”
The successive approximation/Newton iteration polyalgorithm described in Section 2.5 can be used to compute uO.Since Newton’s algorithm involves inverting the operator [I - Y$(u)], it follows that one can use it to compute au,,630 using formula (3.32) at negligible marginal cost. Once we have the derivatives av,/ae, it is a straightforward exercise to compute the derivatives of the likelihood, aLf/ad. This allows us to employ more efficient quasi-Newton gradient optimization algorithms to search for the maximum likelihood estimate, i?. Some of these methods require second derivatives of the likelihood function, which are significantly harder to compute. However, the information equality implies that the information matrix (3.30) is a good approximation to the negative of the Hessian of Lf in large samples. This idea forms the basis of the BHHH optimization algorithm [Berndt, Hall, Hall and Hausman (1974)] which only “In practice, the two-stage estimator & may be sufficiently far away from the maximum likelihood estimates that several Newton steps (3.28) are necessary. In this case, the Newton-step estimator is simply a way generating values for computing the full information maximum likelihood estimates in (3.25). Also, we haven’t attempted to correct the estimated standard errors for possible misspecification as in White (1982) due to the fact that such corrections require second derivatives of the likelihood function which are difficult to compute in DDP models.
Ch. 51: Structural Estimation of Markov Decision Processes
3111
requires first derivatives of the likelihood function. 3o The nested fixed point algorithm combines the successive approximation/Newton iteration polyalgorithm and the BHHH/Broyden gradient optimization algorithm in order to obtain an efficient and numerically stable method for computing the maximum likelihood estimate 6.“’ In order to derive the asymptotic properties of the maximum likelihood estimators e’, i = f, p, n, we need to make some additional assumption about the sampling process. First, we assume that the periods at which agents are observed coincide with the periods at which they make their decisions. In practice agents do not make decisions at exogenously spaced intervals of time, so it is unlikely that the particular points in time at which agents are interviewed coincide with the times they make their decisions. One way to deal with the problem is to use retrospective data on decisions made between survey dates. In order to minimize problems of time aggregation one should in principle formulate a sufficiently fine-grained model with “null actions” that allow one to model decision-making processes with randomly varying times between decisions. However if the DDP model has a significantly shorter decision interval than the observation interval, we may face the problem that the data set may not contain observations on the agent’s intervening states and decisions. In principle, this problem can be solved by using a partial likelihood function that omits the intervening periods, or a full likelihood function that “integrates out” the unobserved states and decisions in the intervening periods. The practical limitation of this approach is the “curse of dimensionality” of solving very fine-grained DP models. Next, we need to make some assumptions about the dependence between the realizations {xp, da} and {xp, dp} for agents a # b. The standard assumption is that these realizations are independent, but this may not be plausible in models where agents are affected by “macroeconomic shocks” (examples of such shocks include price, unemployment rates and news announcements).‘We assume that the observed state variable can be partitioned into two components, x, = (m,,z,), where m, represents a macroeconomic shock that is common to all agents and z, represents an idiosyncratic component that is independently distributed across agents conditional on the realization of {m,}. Sufficient conditions for such independence are given in the three assumptions below.
Assumption CZ-X The transition
probability
for the observed
state variable
4dxt +1 Ix,, 4 = ~l(dzt+1 Iz,,m,,4h(dm, +1 14.
x, = (m,, zr) is given by (3.33)
30Convergence of the BHHH method in small samples can be accelerated by Broyden and Davidon Fletcher Powell updating procedures that adaptively improve the accuracy of the information matrix approximation to the Hessian of Lr. The method also applies to maximization of the partial likelihood function Lp. “A documented example of the algorithm written in the Gauss programming language is available from the author upon request.
John Rust
3112
SI-E
Assumption
For each t 3 0 the distributions independent: Pr(dc: ,..., dcflx: Assumption For
each
,..., xf)=
of the unobserved
ti
state variable
E: are conditionally
(3.34)
q(dcpIxp).
SI-Z t > 1 the transition
the observed
state variable
Pr(dz :+i,.
.,dz;lx:,.
probabilities for the idiosyncratic xy are conditionally independent: . ., xf,d:,...,df)=
fi
components
~l(dz~+lIz~,m,,~~)
zr of
(3.35)
a=1 and,
when t = 0, the initial distributions Pr(dz&.
. , dztlm,)
of zt are independent,
conditional
on m,: (3.36)
= fi rc,(dz”,Im,). 0=1
Assumption CIIX is an additional conditional independence assumption imposed when the observed state variable x, includes macroeconomic shocks. It corresponds to an asymmetry that seems reasonable when individual agents are small relative to the economy: macroeconomic shocks can affect the evolution of agents’ idiosyncratic states {z;,E;}, but an individual’s decision df has no effect on the evolution of the {m,} process, modelled as an exogenous Markov process with between transition probability rc2.32 SI-E and SI-Z require that any correlation agents’ idiosyncratic states {zf, E;} is a result of the common macroeconomic shocks. Together with assumption CI-X, these conditions imply that realizations {zp, df} and {zp, d,b} are independent conditional on {m,}. A final assumption is needed about relative sizes of time-series and cross-sectional dimensions of the data set. There are three cases to consider: (1) the number of time-series observations for each agent is fixed and A+ co, (2) the number of cross-section observations is fixed and T, + GO,or (3) both A and T, tend to infinity. In most panel data sets the cross-sectional dimension is much larger relative to the time-series dimension, so we will focus on case 1. If we further assume that the observation period T, is fixed at a common value T for all agents a, then it is straightforward to show that, conditional on (m,, . . . , m,), the realizations of {x;, d;} are IID. This allows us to use the simpler IID strong law of large numbers and LindeberggLevy central limit theorem to establish the consistency and asymptotic normality of 8’, i = p, f, n, requiring only continuous second derivatives of the likelihood.33 32Note that while {m,} is independent of the decisions independent of their collective behavior. Thus, CI-X equilibrium in large economies. 331n cases where the observations are INID or weakly and boundedness conditions are required to establish the
of individual agents, is not inconsistent dependent, asymptotic
somewhat properties
it will generally not be with dynamic general stronger smoothness of the MLE.
Ch. 51: Structural
Estimation
ofMarkov
3113
Decision Processes
Theorem 3.5 Under assumptions AS, CI, BU, WC, BE, CI-X, SI-E and ST-Z and regularity conditions (AlO) to (A35) of Rust (1988b), the full information and 3-stage maximum likelihood estimators e’, i = f, n satisfy 1. @ is a well-defined random variable, 1 as A + co, 2. t? converges to t3* with probability of fi(e’ - e*) converges 3. the distribution information matrix Z(e*) is given by
Ice*)=-E
where the
5 a2iogP@j” , z,,m,,e*)/aea~l(m,,...,m=)} i t=1
-E i
~~a210g~l(i,~i,-l,m,-l,rZ,_l,e*~~aea~1(mo7. ..,md} t=1
5 a iog7c,(m,(m,_,,e*)/aeae I i t=1
-
=E
weakly to N[O, 1(8*)-i]
i aiogq&I”z,,m,, e*yae f: a log&it I-z,,m,.RyaB’l~m,,....mdj t=1 i t=1
+ E i: a 10g711(ZtI~,_1,m,_l,e*)/ae i f=l x f ai0g711(2tp_1,m,-,,e*)/aefj~mo,....m,)) t=1 ,f a 10gP(i7tj5t,m,,e*)/ae t=1 T
x
c alogn,(~,t~~-,,~~,-,, 1=1
+2E
e*)/a@~(mO~~~~~mT~}
f
alogP(~~I~~,m,,B*)/ae
t=1
x i a logiQt,Im,_,,0*)/asl t=1 a
i
f=l
a logn,(m,(m,_,,e*)/a8’(
1 mOy
..jmT))
/(m 0, ..2mT)}
10g~,(m,~m,_,,e*~,aea8’~.
(3.37)
John Rust
3114
Theorem
3.6
Under assumptions AS, CI, BU, WC, BE, CI-X, SI-E and SI-Z and regularity conditions (AlO) to (A35) of Rust (1988b), the 2-stage partial likelihood estimator ijp = (&, e^;)’ satisfies 1. BP is a well-defined random variable, 2. I!?~converges to 8* with probability 1 as A + co, 3. the distribution
of fi(@
- (3*) converges
weakly to N(0, z) where 2 is given
by z=A-‘flA’-‘,
where
(3.38)
A and R are given by (3.39)
where
A,, = E
f
i
a210gn1(ZtlZt_l,mt_1
,~~-1,eT)n2(m,Im,-l,8T)/
t=1
A,, = E i a2p(~~I~~,e:,e~)/ae2ad~ 2i(-o,--~-T~}~ i 1=1 nll = E f: a 10g7r1(Z,IZ,_i,m,_, i f=l
,d,-,,8:)~2(m,Im,-,,eT)/ae,
x i alOgz 1(-I-zt Z, ,,m,,,ri,,,n:)n,lm,,m~-~,~:)/au;l(m,,...,m,)}, 2=1 q2=E
f: alog711(~tI~,-1,m,-,,d”,-,,~T)~2(m,lm,~,,~T)/ae,
i
1=1
x i az@~z,,q,o;)/a0~ 21(m,,...,md}, t=1 i aP@Iz-,,q,8;)/a8, 1=1
f: t=1 (3.40)
3115
Ch. 51: Structural Estimation of Markoa Decision Processes
In addition to the distribution of the parameter vector &, we are often interested in the distribution of utility and value functions ug and u0 treated as random elements of B. This is useful, for example, in computing uniform confidence bands for these functions. The following result is essentially a Banach space version of the “delta theorem”. Theorem
3.7
If &converges with probability 1 to 6* and &[i? and v, is a smooth function of 8, then 1. 06 is a B-valued random element, 2. vi converges with probability 1 to u,,,,
- Q*] converges
weakly to N(0, Z),
3. the distribution of ,,&[ui - v,.] converges weakly to a Gaussian random element of B with mean 0 and covariance operator [au,,/a0]Z[au,,/aC3’]. We conclude this section with some comments on specification and hypothesis testing. Since structural models are often highly simplified and tightly parameterized, one has an obligation to go beyond simply reporting parameter estimates and their standard errors. If the structural model is to have any credibility, it needs to be subjected to a battery of in-sample specification tests, and if possible, out-of-sample predictive tests (a good example of the latter type of test is presented in Section 4.2). The maximum likelihood framework allows formulation of standard “holy trinity” test statistics, the Wald, Likelihood Ratio, and Lagrange Multiplier (LM) tests [see Engle (1984)]. Examples 1 and 2 below show how these statistics can be used to test the validity of various functional-form restrictions on the DDP model. Example 3 discusses the chi-square goodness-of-fit statistic, which is perhaps the most useful omnibus test of the correctness of the DDP specification. Example
1
The holy trinity can be used to conduct a “heterogeneity test” of the null hypothesis that the parameter vector 8* is the same for two or more subgroups of agents. If there are K subgroups, we can formulate the null hypothesis as K - 1 linear restrictions on a KN-dimensional full parameter vector (d,, . . . , 0,) where 0, is the Ndimensional parameter vector for subgroup k. The likelihood ratio test involves computing - 2 times the difference in the restricted and unrestricted log-likelihood functions, where we compute the restricted log-likelihood by pooling all K subgroups and estimating a single N-dimensional parameter vector 0. The Wald test statistic is a quadratic form in the K - 1 differences in the group-specific coefficient estimates, OK+ 1 - 8,, k = 1,. . . , K - 1. In this case the LM statistic is the easiest to compute since it only requires computation of a single N-dimensional parameter estimate 8for the pooled sample under the null hypothesis of no heterogeneity. The LM statistic tests whether the score of the likelihood function is approximately zero for all K subgroups. All three test statistics have an asymptotic chi-square distribu-
John Rust
3116
tion under the null, with degrees of freedom equal to the number of restrictions being tested. In the example, there are (K - l)N degrees of freedom. Computation of the Wald and LM statistics requires an estimate of the information matrix, f(&) for each of the subgroups k = 1,. . , K. Example
2
The holy trinity can be used to test the validity of the conditional independence assumption CI. Recall that CI implies that the unobserved state variable E, is independent of a,_ 1 and is conditional on the value x, of the observed state variable. This is a strong restriction although, as we will see in Section 4.6, it seems to be necessary to obtain a computationally tractable estimation algorithm. A natural way to test CI is to add some function f of the previous period control variables to the current period value function: Q(x,, II,) + af(dt_ J. Under the null hypothesis that CI is valid, the decision taken in period t - 1 will have no effect on the decision made in period t once we condition on x, since {u&x,, d), dud} constitutes a set of “sufficient statistics” for the agent’s decision in period t. Thus, a = 0 under the null hypothesis that CI holds. However under the alternative hypothesis that Cl doesn’t hold, E, and E,_ 1 will be serially correlated, even conditional, on x,, so that d,_ 1 will generally be useful for predicting the agent’s choice d,. Thus, c1# 0 under the alternative hypothesis. The Wald, Likelihood Ratio or LM statistics can be used to test the hypothesis that a = 0. For example, the Wald statistic is simply A&‘/8’(8), where d2(6i) is the asymptotic variance of oi. Example
3
The chi-square goodness-of-fit statistic provides an overall test of the null hypothesis that an econometric model is correctly specified (i.e. that the parametric model coincides with the “true” data generating process). In the case of a DDP model, this amounts to a test of the joint hypotheses that (1) agents are rational, i.e. they act “as if” their behavior is governed by an optimal decision rule from some DDP model, and (2) the particular parametric specification of this model is correct. However, the analysis of the identification problem in Section 3.5 reveals that the hypothesis of rationality per se imposes essentially no testable restrictions: the empirical content of the theory arises from additional restrictions on the primitives (u, p, j?) of the DDP model. In this sense, testing the theory is tantamount to testing the econometrician’s assumptions about the parametric functional form of (u, p, fi). Although there are other omnibus specification tests [such as White’s (1982) “information matrix test”] the chi-square goodness-of-fit test is far more useful in diagnosing the source of specification errors. There are two versions of this statistic, one for models without covariates [Pollard (1979)], and one for models with covariates [Andrews (1988,1989)]. 34 The former is useful for testing complete realizations of ?Strictly speaking, Pollard’s results are only applicable if the full likelihood includes the probability density of the initial state x,,. Otherwise the full likelihood (3.25) can be analyzed as a conditional likelihood using Andrew& analysis of chi-square tests of parametric models with covariates.
Ch. 51: Structural Estimation ofMarkou Decision Processes
the controlled
3117
{xt, d,} using the full Markov transition probability P(d,+ l I derived in Theorem 3.3, whereas the version with covariates is useful for testing the conditional choice probability P(d, Ixt, 0) and the transition probability rc(dx,+ 1 Ix,, d,, f3). Both formulations are based on a partition of the relevant space of realizations of x, and d,. In the case of the full likelihood function, the relevant space is XT x DT, and in the case of P or n it is X x D or X x X x D, respectively. We partition this space into a fixed number M of mutually exclusive cells. The cells can be randomly chosen, or chosen based on some data-dependent procedure provided (1) the total number of cells M is fixed for all sample sizes, (2) the elements of the partition are members of a Vapnikkcervonenkis (VC) class, and (3) the partition converges in probability to a fixed, non-stochastic partition whose elements are also members of a VC class. 35 If we let 0 denote the M x 1 vector of elements of this partition R = (a,, . . . , a,), we can define a vector n(L2,0) of differences between sample frequencies of fl and the predicted probabilities of the DDP model with parameter 8. In a chi-square test of the specification of the conditional choice probability P, the ith element of A(f2, f3)is given by X t+l,MdX,+l
process
1xt, d,, 0)
(3.41)
The first term in (3.41) is the sample proportion of (x,, d,) pairs falling into partition element ~2~whereas the second term is the DDP model’s prediction of the probability that (x,, ~,)ER~ Note that since the law of motion for x, is not specified, we simply average over all sample points xp. An analogous formula holds for the case of chi-square tests of the full likelihood function: in that case ~(Ri, 0) is the difference between the sample fraction of paths {xp, d;} falling in partition element LJi less the probability that these realizations fall in Q,, computed by integrating with respect to the probability measure on XT x DT generated by the controlled transition probabilityP(d,+ 1I~,+~,&r(dx,+, I x,, d,, t9).The chi-square test statistic is given by x2(0, e^,= A@, @J? + n(n, 6),
(3.42)
where 2’ is a generalized inverse of the asymptotic covariance matrix 2 (which is generally singular). Andrews (1989) showed that under the null hypothesis of correct specification, I1(R,8) converges in distribution to N(O,Z), which implies that x’(f2,6) converges in distribution to a chi-square random variable whose degrees of freedom equal the rank of Z. Andrews provides formulas for ,!? that take relatively simple forms when 6 is an asymptotically efficient estimator such as in the case of the full information maximum likelihood estimator ef. A natural strategy is to start with a chi-square test of the specification of the full likelihood (3.25). If this is rejected, one can then do a chi-square test of rc to see if the rejection is due to a misspecification of agents’ beliefs. If this is not rejected, then we have an “See Pollard (1984) for a definition of a VC class.
John Rust
3118
indication that the rejection of the model is due to a misspecification of preferences (p, u} or that the CI assumption is not valid. A further test of the CI assumption using one of the holy trinity specification tests described in Example 2 can be used to determine if CI is the source of the problem. If not, an investigation of the cells which have the largest prediction errors (3.41) may provide valuable insights on the source of specification errors in u or /?.
3.3.
Alternative
estimation methods: Finite-horizon
DDP problems
Although the previous section showed that the AS-C1 assumption leads to a simple and general estimation theory for DDP’s, it would be desirable to develop estimation methods that can relax these assumptions. This section considers estimation methods for DDP models with unobservables that are serially correlated and enter the utility function in a non-separable fashion. Unfortunately there is no general estimation theory for this class of models at present. Instead there are a variety of problem-specific specifications and estimation methods, most of which are designed for finite-horizon binary choice problems. As we will see, there are substantial theoretical and computational obstacles to developing an estimation theory for a more general class of DDP models that relaxes ASCI. This section presents two examples that illustrate the successful approaches.36 Example
1
Wolpin (1984) pioneered the nested numerical solution method for a class of binary choice models where unobservables enter the utility function u(x,~,d) in a nonadditive fashion. In a binary choice model, one does not need a 2-dimensional vector of unobservables to yield a saturated choice model: in Wolpin’s specification E is uni-dimensional and interacts with the observed state variable x,, yet the choice probability satisfies P&l Ix) > 0 for all x and is continuously differentiable in 8. Wolpin’s application concerned a Malaysian family’s choice of whether or not to conceive a child in period t: d, = 1 if a child is born in year t, d, = 0, otherwise.37 Wolpin used the following quadratic family utility function,
u&x,, E,,d,) = (0, + e,)n, - 8,n: + e3c, - &:,
(3.43)
where x, = (n,, c,) and n, is the number of children in the family and c, is total family consumption (treated as an exogenous state variable rather than as a control variable). Assuming that {st} is an IID Gaussian process with mean 0 and variance cr2 and that {x,} is a Markov process independent of {et}, the family’s optimal 36For additional examples, see the survey by Eckstein and Wolpin (1989a). “Wolpin’s model thus assumed that there is no uncertainty about fertility and that contraception is 100% effective.
Ch. 51: Structural Estimation of Markov Decision Processes
3119
decision rule d, = dt(xr, E,) can be computed by backward induction starting in the final period T at which the family is fertile. It is not difficult to see that (3.43) implies 6, is given by the threshold rule
4(x,, 8,)=
1 if i 0
if
8, > v,(x,,0)
(3.44)
E, d yl,(x,, 0).
The cutoffs (Q(x, 0)) define the value of gr such that the family is indifferent as to whether or not to have another child: for any given value of 8 and for each possible t and x, the cutoffs can be computed by backward induction. The assumption that {E,} is IID N(0, a*) t h en implies that the conditional choice probability is given by
P,(d,Ix,9 4 =
1 - @[Q(x,, 0)/a] @C?,(X,
where @ is the standard using the partial, full or equations (3.25) to (3.27) can show that the cutoffs implies that the likelihood establish that his maximum Example
@/aI
if
d= 1
if
d = 0,
(3.45)
normal CDF. One can estimate 8 maximum likelihood Newton-step maximum likelihood estimators given in of Section 3.2. Using the implicit function theorem one are smooth functions of 8. From (3.45) it is clear that this function is a smooth function of 8, allowing Wolpin to likelihood estimator has standard asymptotic properties.
2
Although Example 1 shows how one can incorporate measurement errors and unobservables that enter the utility function non-separably, it relies on the independence of the {E,} process, a strong form of CI. Pakes (1986) developed a method for estimating binary DDP’s with serially correlated observables that enter u additively. Pakes developed an optimal stopping model of whether or not to renew a patent. He used Monte Carlo simulations to “integrate out” the serially correlated {Ed} process in order to evaluate the likelihood function, avoiding a potentially intractable numerical integration problem. Let t denote the number of years a patent has been in force, and let E, denote the cash flow accruing to the patent in year t. In Europe, patent holders must pay an annual renewal fee c,, an increasing function oft. The patent holder must decide whether or not to pay the cost c, and renew the patent, or let it lapse in which case the patent is permanently cancelled and the idea is assumed to have 0 net value thereafter. In Pakes’s data set one cannot observe patent holders’ earnings E,, so the only observable state variable is x, = t, the number of years the patent is kept in force. Assuming that {et} is a first order Markov process with transition probability qt(dE,IE,_ 1), Bellman’s equation is given by VT(&)= max (0, E - c,}, (3.46)
John Rust
3120
where T is the statutory limit on patent life. Under fairly general conditions on the transition probabilities {qt} (analogous to the weak continuity conditions of Section 2) one can establish that the optimal decision rule, d = S,(E), is given by a threshold rule of the form (3.44), but in this case the cutoffs q,(0) only depend on t and the parameter 0. Pakes used a particular family {ql} to simplify the recursive calculation of the cutoffs: if
s,+i = 0
if
OGs,+i
ddZs,
if
E,+1>6
E 2
(3.47)
2’
The interpretation of (3.47) is that with probability exp{ -8is1} the patent is disthe covered to be worthless (in which case E~+~= 0 for ail k 2 l), or alternatively, patent is believed to have value next period given by E,, 1 = max{e2c,,z}, where z is an IID exponential random variable whose density is given by the third term in (3.47). The likelihood function for this problem requires computation of the probability A,(Q) that the patent lapses in year t: A,(t)= Pr{6,(e,)=0,6,_,(a,_,)=
1,...,6r(ai)=
1)
where q. is the initial distribution of returns at the time the patent was applied for, assumed to be log-normal with parameters (0,,0,). To establish the asymptotic properties of the maximum likelihood estimator 8 = (8,, . . . , &), we need to show that &(t) is a smooth function of 8. It is straightforward to show that the cutoffs q,(0) are smooth functions of 8. Pakes showed that this fact, together with the smoothing that occurs when integrating over the indicator functions in (3.48), yields a likelihood function which is continuously differentiable in tJ and has standard asymptotic properties. However, in practice the numerical integrations required in (3.48) become intractable for t larger than 3 or 4, whereas many patents are held for 20 years or more. To overcome the problem Pakes used Monte Carlo integration, calculating a consistent estimate It(e) by simulating a large number of realizations of the process {Et} and tabulating the fraction of realizations that lead to drop-outs at year t, i.e. $ < q,(d), Es 2 q,(d), s = 1,. . , t - 1. This requires a “nested simulation algorithm” consisting of an outer optimization algorithm that searches for a value of 8 to maximize the likelihood function, and an inner simulation/DP algorithm that 1. solves the DDP problem for a given value of 8, using backward induction to calculate the cutoffs {s,(e)},
Ch. 51: Structural Estimation of Markov
Decision Processes
2. simulates NSIM draws of the process (EJ to calculate &(O) and the full likelihood function J?(O)
3121
a consistent
estimate
of
Note that the realizations {EJ change each time we update 8. The main burden of Pakes’s, simulation estimator arose from having to repeatedly re-simulate NSIM realizations of (8,): the recursive calculation of the cutoffs {r,(O)} took little time in comparison. 38 An additional burden is imposed by the need to calculate numerical derivatives of the likelihood function: each iteration requires 8 separate solutions and simulations of the DDP problem, once to compute if(e) and 7 additional times to compute its partial derivatives with respect to (O,, . . . ,O,). McFadden (1989) and Pakes and Pollard (1989) subsequently developed a new class of simulation estimators that dramatically lowers the value of NSIM needed to obtain consistent and asymptotically normal parameter estimates. In fact, consistent, asymptotically normal estimates of I3 can be obtained with NSIM as small as 1, whereas consistency of Pakes’s original simulation estimators required NSIM to tend to infinity with sample size. The new approach uses minimum distance rather than maximum likelihood to estimate 8. In the case of the patent renewal problem, the estimator is defined by ij = argmin
HA(ey w,H,(e),
HA(e) = [fl,/A - l,(e), . . , +/A - &(e)-y, A=f:n,
It(e) =
(3.49)
where n, is the number of patent holders who dropped out (failed to renew) after t years, A is the total number of agents (patent holders) in the sample, W,., is a T x T positive definite weighting matrix, and {Ejl} denotes the jth simulation of the Markov process when the transition probabilities qt in (3.47) are evaluated at 8. In order to satisfy the uniformity conditions necessary to prove consistency and asymptotic normality of the estimator, it is crucial to note that one does not draw independent realizations {Etj} for distinct values of 8. Instead, at the beginning of estimation we simulate NSIM random vectors (f,, . . . , rNs,,) which are held fixed throughout the entire estimation process, and thejth draw_tj is used to construct {Ejf} for each trial value of 8. In the patent renewal problem tj is given by Tj=(f9Gl,.
. . 9 ~T~~lr~~~~7T)~
(3.50)
38Pakes varied NSIM depending on how close the outer algorithm was to converging, setting NSIM equal to a few thousand in the initial stages of the optimization when one is far from convergence and gradually increasing NSIM to 20000 in the final stages of estimation in order to get more precise estimates of the likelihood function and its derivatives.
John Rust
3122
where z”is standard normal, and the 6, and 7, are IID draws from an exponential distribution with unit mean. Using tj we can construct a realization of {Ejl} for any value of 8 using the recursion
Fjo= exp(0,
+ ~9,zS}, (3.51)
>,~‘1}max[8,~j,_,,8’,B,y”,-8,].
ej,=r{Q,Ej,-,
This simulation process insures that no extraneous simulation error is introduced in the process of minimizing over 0 in (3.49), insuring that the uniformity conditions given in Theorems 3.1 and 3.2 of Pakes and Pollard hold. These conditions imply the following asymptotic distribution for 0:
JA [S - e*] + N(0, f2 = (1 + NSIMA =
Q),
‘)(A’WA)-‘A’WZ-W’A(A’WA)-‘,
aa.(o*)/ae,
r = diag[YB*)]
- n(e*)n(e*),,
A(Q)= C&(@, ...,&(@I’.
(3.52)
By Aitken’s theorem [Theil (1971)], the most efficient choice for the weighting matrix is W = diag[YB*)]. Notice that when one uses the optimal weighting matrix, W,, the relative efficiency of the simulation estimator increases rapidly to the efficiency of maximum likelihood [which requires exact numerical integration to compute A(Q)]. The standard errors of the simulation estimator are only twice as large as maximum likelihood when NSIM equals one, and are only 10% greater when NSIM equals 10. While the “frequency simulation estimator” (3.49) substantially reduces the value of NSIM required to estimate 8, it has the drawback that the objective function is a step function in 0, so a non-differentiable search algorithm must be used which typically requires many more function evaluations to find 8 than gradient optimization methods. Thus, the frequency simulator will be feasible for problems where we can rapidly solve the DDP problem for the cutoffs {q,(0)}.“’ An important question is whether simulation methods can be used to avoid solving the DDP problem itself. Note that the main burden of DDP estimation methods involves calculation of the value function V,(s), which is itself a conditional expectation with respect to the controlled stochastic process {s,}. Can we also use Monte Carlo integration to compute an approximate value function i&s) rather than computing exactly by backward induction? The next section discusses how this might be done in the context of infinite-horizon DDP models. 39Recent progress in developing may help overcome this problem.
“smooth
simulation
estimators”
[Stern
(1989). McFadden
(1990)]
3123
Ch. 51: Structural Estimation of Markov Decision Processes
3.4.
Alternative
estimation
methods: Injinite-horizon
DDP’s
Hotz et al. (1993) introduced a simulation estimator for DDP problems that avoids the computational burden of nested numerical solution of the Bellman equation. Unlike the simulation estimators described in Section 3.3, the Hotz et al. estimator is a smooth function of 0. The simulation estimator is based on the following result of Hotz and Miller (1993). Lemma 3.1 Suppose that q(delx) has a density with finite first moments and support equal to RIDCX)I. Then for each XGX, the choice probabilities given in equation (3.18) define a 1: 1 mapping between the space of normalized value functions {UER’~(~)’Iu( 1) = 0} and the 1D(x)(-dimensional simplex dlDCx)l. In the special case where q has an extreme-value distribution, the inverse mapping has a closed-form solution: it is simply the log-odds transformation. The idea behind the simulation estimator is to use non-parametric estimation to obtain consistent estimates of the conditional choice probabilities p(dlx) and then invert these estimates to obtain consistent estimates of the normalized value function, B&d) 6(,x, 1). If we also have estimates of agents’ beliefs fi we can simulate (a consistent estimate of) the controlled stochastic process {x,, E,,d,). For any given values of x and 0 we can use the simulated realizations of {x,, E,,d,} to construct a simulation estimate of the normalized value function, i&(x, d) - f&(x, 1). At the true parameter 9*, t&(x, d) - &(x, 1) is an (asymptotically) unbiased estimate of u&x, d) - u&x, 1). Since the latter quantity can be consistently estimated by inverting non-parametric estimates of the choice probability, the simulation estimator can be roughly described as the value of 0 that minimizes the distance between the simulated value function CO(x,d) - 5(x, 1) and 0(x, d) - 0(x, l), averaged over the various (x, d) pairs in the sample. More precisely, the simulated value (SV) function estimator consists of 5 steps. For simplicity I assume that the observed state variable x has finite support and that there are no unknown parameters in the transition probability n, and that q is an IID extreme-value distribution. Step I. Invert non-parametric estimates of P(d (x) to compute consistent of the normalized value functions do(x, d) = u(x, d) - u(x, 1):
estimates
(3.53) Step 2. Use the consistent decision rule b:
estimate
of dv* to uncover
&x, E) = argmax [d 0(x, d) + c(d)]. deD(r)
a consistent
estimate
of the
(3.54)
John Rust
3124
Step 3. Using & q and 7~simulate realizations of the controlled stochastic process {xt, et}. Given an initial condition (x,, d,), this consists of the following steps: 1. Given (x,_ r, d,_ 1) draw x, from the transition
probability
~(dxtlxt~r>dt-I), 2. Given x, draw E, from the density q(EtIxt), 3. Given (xc, EJ compute d, = 8(x,, E,), 4. If t > S(A), stop, otherwise set t = t + 1 and return
to step 1.40
Repeat step 2 using each of the sample points (x;, d;), a = 1,. . . , A, t = 1,. as initial conditions. Step 4. Using function
the simulations
, T,
the simulated value
{T,Li,} from step 2, compute
I&(X,,, d,): S(A)
(3.55)
where (x,, d,) is the initial condition from step 2. Repeat step 3 using each of the sample points (x:, dp) as initial conditions. Step 5. Using the normalized
simulated value function d &,(xp, d:) and corresponding non-parametric estimates d 0(x;, df) as “data”, compute the parameter estimate 8 as the solution to
8A = argmin HA(0)‘WAHA(@, (3.56)
where WA is a K x K positive definite dimension of the instrument vector, Z;.
weighting
matrix
and
K is the
The SV estimator has several attractive features: (1) it avoids the need for repeated calculation of the fixed point (3.22), (2) the simulation error in d f&(xp,df) averages out over sample observations, so that one can estimate the underlying parameters using as few as one simulated path of (_i$,Et} per (xr, d;) observation, (3) since each term di&(x,d) is a smooth function of 8, the objective function (3.56) is a smooth function of 8 allowing use of gradient optimization algorithms to estimate e^,. Note that the simulation in step 2 needs to be done only once at the start of the estimation, so the main computational burden is the repeated evaluation of (3.55) each time 13 is updated. Although the SV estimator is consistent, its main drawback is its sensitivity to @-‘Here S(A) denotes 1, S(A)-+m as A+co.
any stopping
rule (possibly
stochastic)
with the property
that with probability
Ch. 51: Structural Estimation of Markov Decision Processes
3125
the non-parametric estimates of P(d(x) in finite samples. One can see from the log-odds transformation (3.53) that if one of the estimated choice probabilities entering the odds ratio is 0, the normalized value function will equal plus or minus co. For states x with few observations, the natural non-parametric estimator for P(dlx) ~ sample frequencies - can frequently be 0 if the true value of P(dlx) is non-zero but small. In general, the inversion process can amplify estimation errors in the non-parametric estimates of P(dIx), and these errors can result in biased estimates of 0. A related problem is that the SV estimator requires estimates of the normalized value function d u(x, d) for all possible (x, d) pairs, but many data sets will frequently have data concentrated only in a small subset of (x,d) pairs. Such cases may require parametric specifications of P(dJx) to be able to extrapolate estimates of P(dl x) to the set of (x, d) pairs where there are no observations. In their Monte Carlo study of he SV estimator, Hotz et al. found that smoothed estimates of P(d Ix) (such as produced by kernel estimators, for example) can reduce bias, but in general the SV estimator depends on having lots of data to obtain accurate non-parametric estimates of P.41
3.5.
The identi$cation problem
I conclude with a brief discussion of the identification problem for MDP’s and DDP’s. Without strong restrictions on the primitives (u,p, /?) the structural model is “non-parametrically unidentified”: there are infinitely many distinct primitives that are consistent with any given decision rule 6. Indeed, we show that the hypothesis that agents behave according to Bellman’s principle of optimality imposes no testable restrictions in the sense that we can always find primitives {/I, u, p} that “rationalize” an arbitrary decision rule 6. This result stands in contrast to the case of static choice models where we know that the hypothesis of optimization per se does imply testable restrictions.42 The absence of restrictions in the dynamic case may seem surprising given that the structure of the MDP problem already imposes a number of strong restrictions such as time additive preferences and constant intertemporal discount factors, as well as the expected utility hypothesis itself. While Bellman’s principle does not place restrictions on historical choice behavior, it can yield restrictions in choice experiments where we have at least partial control over agents’ preferences or beliefs. 43 For example, by varying agents’ beliefs from p to p’, an experiment implies a new optimal decision rule S@*, u*, p’), where /I* and u* are the agent’s “true” discount factor and utility function. This experiment imposes the 41See also Bansal et al. (1993) for another interesting non-parametric simulation estimator that might also be effective for estimating large scale DDP problems. 42These restrictions include the symmetry and negative-semidefiniteness of the Slutsky matrix [Hurwicz and Uzawa (1971)], the generalized axiom of revealed preference [Varian (1984)], and in the case ofdiscrete choice, restrictions on conditional choice probability [Block and Marschak (1960)]. 43An example’is the experiment which revealed the “Allais paradox” mentioned in Footnote 9.
John Rust
3126
following restriction defined by
on candidate
R = {CD,u)IW, u,P)= W*,
(p, u) combinations:
u*,p,> f-7 ((P,4lW,
they must lie in the set R
U,P’)= w*, u*,P’)).
(3.57)
Although experimental methods offer a valuable source of additional testable restrictions that can help narrow down the equivalence class of competing structural explanations of observed behavior, it should be clear that even extensive experimentation will be unable to uniquely identify the “true” structure of agents’ decision-making processes. As is standard in analyses of the identification problem, we will assume that the econometrician has access to an unlimited number of observations. This is justified since our results in this section are primarily negative: if the primitives {/I, u,p} cannot be identified with infinite amounts of data, then they certainly can’t be identified in finite samples. We consider general MDP’s without unobservables, since the existence of unobservables can only make the identification problem worse. In order to formulate the identification problem a la Cowles Commission, we need to translate the concepts of reduced-form and structure to the context of a nonlinear MDP model. Definition 3.2 The reduced-form
of an MDP is the agent’s optimal
decision
rule, 6.
Dejinition 3.3 The structure of an MDP is the mapping:
A: (/I, u, p} + 6 defined by
6(s) = argmax [u(s, d)],
(3.58)
deD(s)
where v is the unique
fixed point to
u(s, d) = u(s, d) + j?
s
max [u(s’, d’)]p(ds’l s, d).
(3.59)
d’sD(s’)
The rationale for identifying 6 as the reduced-form of the MDP is that it embodies all observable implications of the theory and can be consistently estimated by non-parametric regression given sufficient number of observations {s,, d,}.44 We can
“%ince the econometrician fully observes all components of (s,d), the non-parametric regression is degenerate in the sense that the model d = 6(s) has an “error term” that is identically 0. Nevertheless, a variety of non-parametric regression methods will be able to consistently estimate 6 under weak regularity conditions.
Ch. 51: Structural
Estimation
use the reduced-form of primitives. Dejnition
3127
of Markov Decision Processes
estimate
of 6 to define an equivalence
relation
over the space
3.4
Primitives
_ - (u, p, p) and (u, p, /?) are observationally
equivalent if (3.60)
Thus A ‘(6) is the equivalence class of primitives consistent with decision rule 6. Expected-utility theory implies that A(/?, U, p) = A(fl, au + b, p) for any constants a and b satisfying a > 0, so at best we will only be able to identify an agent’s preferences u modulo cardinal equivalence, i.e. up to a positive linear transformation of u. Definition 3.5 The stationary MDP problem (3.58) and (3.59) is non-parametrically ident$ed if given any reduced-form 6 in the range of A, and any primitives (u, p, /I) and (U, p, p) in A - ‘(6) we have
ix p=P, B=
(3.61)
u = aii + b,
for some constant
a and b satisfying
a > 0.
Lemma 3.2 The MDP problem
(3.58) and (3.59) is non-parametrically
unidentified.
The proof of this result is quite simple. Given any 6 in the range of A, let (p, u, p)~ A - ‘(6). Define a new set of primitives (U,p, /3) by
dds’ Is,4 = Ads’1s, 4, $s, 4 = u(s,4 + f(s) - P f(s')p(ds' Is,d), s
(3.62)
where f is an arbitrary measurable function of s. Then U is clearly not cardinally equivalent to u unless f is a constant. To see that both (u,p,/?) and (U,p,p) are observationally equivalent, note that if t&d) is the value function corresponding to primitives (u,p, fl) then we conjecture that U(s,d) = u(s, d) + f(s) is the value
3128
John Rust
function
corresponding
to (ti, p, p):
U(S,d) = ii@, d) + p
max [z?(s’,d’)]p(ds’I s, d) s
d’eD(s’)
= u(s, d) + f(s) - j3 f(s’)p(ds’( s, d) + /I max [I@‘, d’) + f(s’)]p(ds’ Is, d) s s d’ED(s’) = u(s, d) + f(s) + /i’
max [u(s’, d’)]p(ds’I s, d) s
=
u(s,
d)
+
d’eD(s’)
(3.63)
f(s).
Since v is the unique fixed point to (3.59) it follows that v + f is the unique fixed point to (3.63), so our conjecture V = v + f is indeed the unique fixed point to Bellman’s equation with primitives (ii, p, fl). Clearly {U(S,d), dud} and {I+, d) + f(s),d~D(s)} yield the same decision rule 6. It follows that (fi,u + f - BEf,p)is observationally equivalent to (/I, U, P), but u + f - /lEfis not cardinally equivalent to ll. We now ask whether Bellman’s principle places any restrictions on the decision rule 6. In the case of MDP’s Blackwell’s theorem (Theorem 2.3) does provide two restrictions: (1) 6 is Markovian, (2) 6 is deterministic. In practice, it is extremely difficult to test these restrictions empirically. Presumably we could test the first restriction by seeing whether agents’ decisions depend on lagged states s,_~ for k= 1,2,.... However given that we have not placed any a priori bounds on the dimensionality of S, the well-known trick of “expanding the state space” [Bertsekas (1987, Chapter 4)] can be used to transform an Nth order Markov process into a 1st order Markov process. The second restriction might be tested by looking for agents who make different choices in the same state: 6(s,) # 6(s,+,), for some state s, = St+k = s. However, this behavior can be rationalized by a model where the agent is indifferent between several alternatives available in state s and simply chooses one at random. The following lemma shows that Bellman’s principle implies no other restrictions beyond the two essentially untestable restrictions of Blackwell’s theorem: Lemma 3.3 Given an arbitrary measurable that 6 = A@, u,p).
mapping
6: S + D there exist primitives
The proof of this result is straightforward. Given /?E(O, 1) and transition probability p, define u by
U(S,d)=I(d=6(s)}-p.
an arbitrary
(u, p, 8) such
discount
factor
(3.64)
Ch. 51: Structural Estimation ofMarkov Decision Processes
3129
Then it is easy to see that u(,s,d) = I{d = 6(s)} is the unique solution to Bellman’s equation (3.59), so that 6 is the optimal decision rule implied by (u, P, P). If we are unwilling to place any restrictions on (u, p, P), then Lemma 3.2 shows that the resulting MDP model is non-parametrically unidentified and Lemma 3.3 shows that it has no testable implications, in the sense that we can “rationalize” any decision rule 6. However, it is clear that from an economic standpoint, many of the utility functions u + f - b_Ef will generally be implausible, as is the utility function U(S,d) = I{d = 6(s)}. These results provide a simple illustration of the need for auxiliary identifying restrictions to eliminate primitives that we deem “unreasonable” while at the same time pointing out the futility of direct non-parametric estimation of (u, p, /I). The proofs of the lemmas also indicate that in order to obtain testable implications, we will need to impose very strong identifying restrictions on u. To see this, suppose that the agents’ discount factor /I is known a priori, and suppose that we invoke the hypothesis of rational expectations: i.e. agents’ subjective beliefs about the evolution of the state variables coincide with the objectively measurable population probability measure. This implies that we can use observed realizations of the controlled process {.s,,d,} to consistently estimate the p(ds,+ 1Isf,dt), which means that we can treat both /I and p as known a priori. Note, however, that the proofs of Lemmas 3.2 and 3.3 are unchanged whether or not we know /3 and p, so it is clear that we must impose identifying restrictions directly on u itself. The usual way to impose restrictions is to assume that u is a smooth function of a vector of unknown parameters 8. However, in the case where S and D are finite sets, this is insufficient to uniquely identify the model: there is an equivalence class of 8 with non-empty interior consistent with any decision rule 6,. This latter identification problem is an artifact of the degeneracy of MDP models without unobservables. Rust (1996) provides a parallel identification analysis for the subclass of DDP models satisfying the AS-C1 assumptions presented in Section 4.3, and shows that while AS-C1 does impose testable restrictions, the primitives (u, q, rc,/?) are still non-parametrically unidentified in the absence of further identifying restrictions. However, when identifying restrictions take the form of smooth parametric specifications (u,, qe, 7tg,Be), the presence of unobservables succeeds in smoothing out the problem, and the results of Section 4.3 imply that the likelihood is a smooth function of 0. Results from differentiable topology can then be used to show that the resulting parametric model is generically identified. While the results in this section suggest that, in principle, one can always “rig” a DDP model to rationalize any given set of observations, we should emphasize that there is no guarantee that we can do it in a way that is theoretically plausible. Given the increasing size and complexity of current data sets, it can be a considerable challenge to find a plausible DDP model that is consistent with the available data, let alone one that is capable of making accurate out-of-sample forecasts of policy changes. However, in the cases where we do succeed in specifying a plausible structural model that fits the data well, we need to exercise a certain amount of caution in using the model for policy forecasting and welfare analyses, etc. Keep in
John Rust
3130
mind that a model will be credible only until such time as it is rejected by new, out-of-sample observations, or faces competition from alternative models that are also consistent with available data and are equally plausible. The identification analysis suggests that we can’t rule out the existence of alternative plausible, observationally equivalent models that generate completely different policy forecasts or welfare impacts. Put simply, since structural models can be falsified but never proven to be true, their predictions should always be treated as tentative and subject to continual verification. 4.
Empirical applications
I conclude this chapter with brief reviews of two empirical applications of DDP models: Rust’s (1987) model of optimal replacement of bus engines and the Lumsdaine et al. (1992) model of optimal retirement from a firm.
4.1.
Optimal replacement
of bus engines
One of the simplest applications of the specific class of DDP models given in Definition 3.1 is Rust’s (1987) application to bus engine replacement. In contrast to the macroeconomic studies that investigate aggregate replacement investment, this model goes to the other extreme and examines the replacement investment decision at the level of an individual agent. In this case the agent, Harold Zurcher, is the maintenance manager of the Madison Metropolitan Bus Company. One of the problems he faces is to decide how long to operate a bus before replacing its engine with a new or completely overhauled bus engine. We can represent Zurcher’s problem as a DDP with state variable x, equal to the cumulative mileage on a bus since last engine replacement, and control variable d, which equals 1 if Zurcher decides to replace the bus engine, and 0 otherwise. Rust assumed that Zurcher behaves as a cost minimizer, so his utility function is given by u(x,, d,, d,,&)
--or -c(O,Q,)-s,(l)
if
d,= 1
- c(x0 0,) - c,(O)
if
d, = 0,
=
(4.1)
where 0, represents the labor and parts cost of installing a replacement bus engine and c(x, 0,) represents the expected monthly operating and maintenance costs of a bus with x accumulated miles since last replacement. Implicit in the specification (4.1) is the assumption that when a bus engine is replaced, it is “as good as new”, so the state of the system regenerates to x, = 0 when d, = 1. This regeneration property is also captured in the transition probability for x,:
4x, +1 Ix,,4) =
g(x,+,-0)
if
d,= 1
g(x,+r-xx,)
if
d,=O,
3131
Ch. 51: Structural Estimation of Markov Decision Processes
where g is a probability density function. The renewal property given in equations (4.1) and (4.2) defines a regenerative optima/ stopping problem, and under an optimal decision rule d, = 6*(x,, E,, 6) the mileage process (x,} is a regenerative random walk. Using data on 8156 bus/month observations over the period 1975 to 1986, Rust estimated 8 and g using the maximum likelihood estimator (3.25). Figures 2 and 3 present the estimated value and replacement hazard functions assuming a linear cost function, c(x, 0,) = 0,x and two different values of /?. Comparing the estimated hazard function P(l Ix, 0) to the non-parametrically estimated hazard, both the dynamic (fi = 0.9999) and myopic (B = 0) models appear to fit the data equally well. In particular, both models predict that the probability of a replacement is essentially 0 for x less than 100 000 miles. However likelihood ratio and chi-square tests both strongly reject the myopic model in favor of the dynamic model: the data imply that the concave value function for p = 0.9999 fits the data better than the linear value function 8,x when B = 0. The precise value of B could not be identified: the likelihood was virtually flat for B 2 0.98, although with a very slight upward slope as /?- 1. The latter finding, together with theJinal oalue theorem [Bhattacharya and Majumdar (1989)] may indicate that Zurcher is minimizing long-run average costs rather than discounted costs: /??J(x,, d,)
(1 - /I) f
= lim max L E a{ zI
t=o
T+m
d
(4.3)
u(xA)}.
T
0
100 Mileage
300 200 (Thousands) since last replacement
Figure 2. Estimated
value functions
400
John Rust
3132
200 (Thousands)
100 Mileage
Figure
since
3. Estimated
lart hazard
300
400
replacement functions
The estimated model implies that expected monthly maintenance costs increase by $1.87 for every additional 5000 miles. Thus, a bus with 300000 miles costs an average of $112 per month more to maintain than a bus with a newly replaced engine. Rust found evidence of heterogeneity across bus types, since the estimated value of O2 for the newer 1979 model GMC buses is nearly twice as high as for the 1974 models. This finding resolves a puzzling aspect of the raw data: engine replacements for the 1979 model buses occur on average 57 000 earlier than the 1974 model buses despite the fact that the engine replacement cost for the 1979 models is 25% higher. One of the nice features of estimating the preferences at the level of a single individual is that one can evaluate the accuracy of the structural model by simply asking the individual whether the estimated utility function is reasonable.45 In this case conversations with Zurcher revealed that implied cost estimates of the structural model corresponded closely with his perceptions of operating costs, including the finding that monthly maintenance expenditures for the 1979 model GMC buses were nearly twice as high as for the 1974 models. 45Another nice feature is that we completely avoid the problem of unobserved heterogeneity that can confound attempts to estimate dynamic models using panel data. Heckman (1981a. b) provides a good discussion of this problem in the context of models for discrete panel data.
3133
Ch. 51: Structural Estimation of Markoo Decision Processes
Figure 4 illustrates the value of structural models for use in policy forecasting. In this case, we are interested in forecasting Zurcher’s demand for replacement bus engines as their price varies. Figure 4 contains two “demand curves” computed for models estimated with j? = 0.9999 and /I = 0, respectively. The maximum likelihood procedure insures that both models generate the same predictions at the actual replacement price of $4343. A reduced-form approach to forecasting bus engine demand would bypass the difficulties of structural estimation and simply regress the number of engine replacements in a given period as a function of replacement costs during the period. However, since the cost of replacement bus engines has not varied much in the past, the reduced-form approach will be incapable of generating reliable predictions of replacement demand. In terms of Figure 4, all the data would be clustered in a small ball about the intersection of the two curves: obviously many different demand functions will be able to fit the data equally well. By parameterizing our prior knowledge about the replacement problem Zurcher is facing, and by efficiently using the additional information contained in the {xp, dp} sequences, the structural model is able to generate very precise estimates of the replacement demand function.
4
2 Parta
coat
Figure 4. Expected
6 (Thouaanda) of rttplacemont
replacement
demand
8 bum engine
function
10
12
John Rust
3134
4.2.
Optimal
retirement
from
a firm
Lumsdaine, Stock and Wise (1992) (LSW) used a data set that provides a unique “natural experiment” that allows them to compare the policy forecasts of four competing structural and reduced-form models. The data consist of observations of departure rates of older office employees at a Fortune 500 company. These workers were covered by a “defined benefit” pension plan that provided substantial incentives to remain with the firm until age 55 and then substantial incentives to leave the firm before age 65. In 1982 non-managerial employees who were over 55 and vested in the pension plan were offered a temporary 1 year “window plan”. Under the window plan employees who retired in 1982 were offered a bonus equivalent to 3 to 12 months’ salary, depending on the years of service with the firm. Needless to say, the firm experienced a substantial increase in departure rates in 1982. Using data prior to 1982, LSW fit four alternative econometric models and used the fitted models to make out-of-sample forecasts of departure rates under the 1982 window plan. One of the models was a reduced-form probit model with various explanatory variables and the remaining three were different types of dynamic structural models.46 Two of the structural models were finite-horizon DDP models with a binary decision variable: continue working (d = 1) or quit (d = 0). Since quitting is treated as an absorbing state and workers were subject to mandatory retirement at age 70, the DDP model reduces to a simple finite-horizon optimal stopping problem. The observed state variable x, is the benefit (wage or pension) obtained in year t and the unobserved state variable E, is assumed to be an IID extreme-value process in the first specification and an IID Gaussian process in the other specification. LSW used the following specification for workers’ utility functions:
4% &r,44
=
x:’ + Pl
Et(l)
if
d, = 1
i (~ZQ2x,)8’ + ~~(0) if
d, = 0.
+
The two components of p = (pl, pL2)represent time-invariant worker-specific heterogeneity. In the extreme-value specification for {E,},p1 is assumed to be identically 0 and pL2is assumed to have a log-normal population distribution with mean 1 and scale parameter 8,. In the Gaussian specification for (E,}, p2 is identically 1 and pL1 is assumed to have Gaussian population distribution with mean 0 and standard deviation 8,. Although the model does not directly include “leisure” in the utility 46LSW used different sets of explanatory variables in the reduced-form model including calculated “option values” of continued work (the expected present value of benefits earned by retiring at the “optimal” age, i.e. the age at which the total present value of benefits, wages plus pensions, is maximized). Other specifications used the levels and present values of Social Security and pension benefits as well as changes in the present value of these benefits (“pension accruals”), predicted earnings in the next year of employment, and age.
3135
Ch. 51: Structural Estimation of Markou Decision Processrs
function, it is implicitly included via the parameter 0,. Thus, we expect that 8, > 1 since the additional leisure time should imply that a dollar of pension income is worth more than a dollar of wage income. The third structural model is the “option value model” developed in previous work by Stock and Wise (1990). The option value model predicts that the worker will leave the firm at the first year t in which the expected presented discounted value of benefits from departing at t exceeds the maximum of the expected values of departing at any future date. This rule differs from an optimal stopping rule generated from the solution to a DP problem by interchanging the “expectation” and “maximization” operators. This results in a temporally inconsistent decision rule in which the worker ignores the fact that as new information arrives he will be continually revising his estimate of the optimal departure date t*.47 The parameter estimates for the three structural models are presented in Table 1. There are significant differences in the parameter estimates in the three models. The
Parameter Parameter
estimates
for the option
Option
Value Models
(1)
(2)
Dynamic Extreme (1)
0,
1.00*
0,
1.902 (0.192) 0.855 (0.046) 0.168 (0.016)
B 0,
0.612 (0.072) 1.477 (0.445) 0.895 (0.083) 0.109 (0.046)
1.OQ* 1.864 (0.144) 0.618 (0.048) 0.306 (0.037) 0.00*
0,
Summary Statistics -1n Y 294.59 x2 sample 36.5 x2 window 43.9
280.32 53.5 37.5
Table 1 value and the dynamic
279.60 38.9 32.4
Notes: Estimation is by maximum *Parameter value imposed.
Programming
Value (3)
1.018 (0.045) 1.881 (0.185) 0.620 (0.063) 0.302 (0.036) 0.00*
1.187 (0.215) 1.411 (0.307) 0.583 (0.105) 0.392 (0.090) 0.407 (0.138)
277.25 36.2 33.4
likelihood. All monetary
models.
Models Normal
(2)
279.57 38.2 33.5
programing
(4) 1.oO* 2.592 (0.100) 0.899 (0.017) 0.224 (0.021) 0.OQ*
277.24 45.0 29.0
(5)
(6)
1.187 (0.110) 2.975 (0.039) 0.916 (0.013) 0.202 (0.022) 0.00*
1.109 (0.275) 2.974 (0.374) 0.920 (0.023) 0.168 (0.023) 0.183 (0.243)
276.49 40.7 25.0
276.17 41.5 24.3
values are in %lOO,OOO(1980 dollars).
“‘Thus, it is somewhat of a misnomer to call it an “option value” model since it ignores the option value of new information that is explicitly accounted for in a dynamic programming formulation. Stern (1995) shows that in many problems the computationally simpler procedure of interchanging the expectation and maximization operators yields a very poor approximation to the optimal decision rule computed by DP methods.
3136
John
Rust
Gaussian DP specification predicts a much higher implicit value for leisure (0,) than the other two models, and the extreme-value specification yields a much lower estimate of the discount factor /I. The estimated standard deviation CJof the E,‘Sis also much higher in the extreme-value specification than the Gaussian. Allowing for unobserved heterogeneity did not significantly improve the fit of the Gaussian DP model, although it did have a marginal impact on the extreme-value mode1.48 Figures 5 to 7 summarize the ability of the four models to fit the historical data. 1 ,
I -Actual
0.8
-
--.__~
PredIcted -___--.
0.6 -~
0.4 -~
50
52
5L
56
58
60
62
64
66
Age
Figure 5. Predicted versus actual 1980 departure
rates and implicit cumulative model
departures,
option value
1
0.6
-
Actual
---
Extreme
-
Normal
value
A
___
0.6
\
04
0.2
m,? _^
I
0c
52
W
Figure 6. Predicted
versus actual
“*The log-likelihoods - 277.25, respectively.
54
56
58 Age
60
1980 departure rates and implicit programming model
in the Gaussian
and extreme-value
62
cumulative
specifications
66
64
departures,
improved
dynamic
to - 276.17 and
Ch. 51: Structural
Estimation
3137
ofMarkot> Decision Processes
1 -Actual
---Predicted A
0.8
0.6
0.L
0.2
n -50
52
5L
56
56
60
62
6L
66
Age
Figure
7. Predicted
versus actual
departure
rates and implicit
cumulative
departures
probit
model
Figure 5 compares the actual departure rates (solid line) to the departure rates predicted by the option value model (dashed line). Figure 6 presents a similar comparison for the two DDP models, and Figure 7 presents the results for the best-fitting probit model. As one can see from the graphs all four models provide a relatively good fit of actual departure rates except that they all miss the pronounced peak in departure rates at age 65. Table 1 presents the x2 goodness-of-fit statistics for each of the models. The four models are generally comparable, although the option model fits slightly worse and the probit model fits slightly better than the two DDP models. The superior fit of the probit model is probably due to the inclusion of age trends that are excluded from the other models. Figures 8 to 10 summarize the ability of the models to track the shift in departure rates induced by the 1982 window plan. All forecasts were based on the estimated utility function parameters using data prior to 1982. Using these parameters, predictions were generated from all four models after incorporating the extra bonus provisions of the window plan. As is evident from Figures 8-10, the structural models were generally able to accurately predict the large increase in departure rates induced by the window plan, although once again none of the models was able to capture the peak in departure rates at age 65. On the other hand, the reduced-form probit model predicted that the window plan had essentially no effect on departure rates. Other reduced-form specifications greatly overpredicted departure rates under the window plan. The x2 goodness-of-fit statistics presented in Table 1 show that all of the structural models do a significantly better job of predicting the impact of the window plan than any of the reduced-form models.49 LSW concluded that: @The smallest x2 value for any of the reduced-form largest was 623.3.
models
under
the window
plan was 57.3, the
John Rust
3138 1 -
Actual
1981 IA\
19.32
---Actual
0.8 - -Predicted1982
I'
\ \ \
//. / 0.6.
50
52
5L
56
58
60
62
6L
66
Age
Figure 8. Predicted versus actual departure window plan, based on 1980 parameter
-
rates and implicit cumulative departures estimates, and 1981 actual rates: option
under the 1982 value model
Actual1981
---Actual1982
0.8
-Pred.Ext -Pred
Val.1982 Normal1982
0.6
50
52
5L
56
58
60
62
6L
66
Age
Figure 9. Predicted versus actual departure rates and implicit cumulative departures under window plan, based on 1980 parameter estimates, and 1981 actual rates: dynamic programming
the 1982 models
The option value and the dynamic programming models fit the data equally well, with a slight advantage to the normal dynamic programming model. Both models correctly predicted the very large increase in retirement under the window plan, with some advantage in the fit to the option value model. In short, this evidence suggests that the option value and dynamic programming models are considerably more successful than the less complex probit model in approximating the rules
3139
Ch. 51: Structural Estimation of Markov Decision Processes
0.8
06
0.L
/
0.2 -
0
A-.50
Figure
52
d.. 5L
56
/ /
I
58
/
/
1’ /
/
60
62
10. Predicted versus actual departures rates and implicit cumulative window plan, based on 1980 parameter estimates, and 1981 actual
6L
66
departures under the 1982 rates: probit model
individuals use to make retirement decisions, but that the more complex dynamic programming rule approximates behavior no better than the simpler option value rule. More definitive conclusions will have to await accumulated evidence based on additional comparisons using different data sets and with respect to different pension plan provisions. (p. 31).
References Ahn, M.Y. (1993a) Duration Dependence, Search Effort, and Labor Market Outcomes: A Structural Model of Labor Market History, manuscript, Duke University. Ahn, M.Y. (1993b) “Econometric Analysis of Sequential Discrete Choice Behavior with Duration Dependence”, manuscript, Duke University. Ahn, H. and Manski, C. (1993) “Distribution Theory for the Analysis of Binary Choice Under UncertainJournal ofEconometrics, 56(3), 270-291. ty with Nonparametric Estimation of Expectations”, Amemiya, T. (1976) “On A Two-Step Estimation of a Multivariate Logit Model”, Journal of Econometrics, 8, 13-21. Andrews, D.W.K. (1988a) “Chi-Square Diagnostic Tests for Econometric Models”, Journal of Econometrics, 37, 135-l 56. Andrews, D.W.K. (1988b) “Chi-Square Tests for Econometric Models: Theory”, Econometrica, 56, 1414-1453. Arrow, K.J., T. Harris and J. Marschak (1951) “Optimal Inventory Policy”, Econometrica, 19(3), 250-272. Bansal, R., A.R. Gallant, R. Hussey and G. Tauchen (1993) “Nonparametric Estimation of Structural Models for High Frequency Currency Market Data”, Journal of Econometrics, forthcoming. Basawa, I.V. and B.L.S. Prakasa Rao (1980) Statisrical Inference for Stochastic Processes. New York: Academic Press. Bellman, R. (1957) Dynamic Programming. Princeton University Press: Princeton. Berkovec,J. and S. Stern (1991) “Job Exit Behavior of Older Men”, Econometrica, 59(l), 189-210.
3140
John Rust
Berndt, E., B. Hall, R. Hall and J. Hausman (1974) “Estimation and Inference in Nonlinear Structural Models”, Annals ofEconomic and Social Measurement, 3, 6533665. Bertsekas, D. (1987) Dynamic Programming Deterministic and Stochastic Models, Prentice Hall: New York. Bertsekas, D. and D. Castaiion (1989) “Adaptive Aggregation Methods for Infinite Horizon Dynamic Programming”, IEEE Transactions on Automatic Control, 34(6), 5899598. Bhattacharya, R.N. and M. Majumdar (1989) “Controlled Semi-Markov Models ~ The Discounted Case”, Journal ofStatistical Planning and Inference, 21, 3655381. Billingsley, P. (1961) Statistical Inferencefor Markou Processes. University of Chicago Press: Chicago. Blackwell, D. (1962) “Discrete Dynamic Programming”, Annals ofMathematical Statistics, 33, 719~726. Annals of Mathematical Statistics, 36, 2266 Blackwell, D. (1965) “Discounted Dynamic Programming”, 235. Proceedings of the 5th BerkeCey Symposium Blackwell, D. (1967) “Positive Dynamic Programming”, on Mathematical Statistics and Probability, 1, 415-418. Block, H. and J. Marschak (1960) “Random Orderings and Stochastic Theories of Response”, in: I. Olkin, ed., Contributions to Probability and Statistics, Stanford University Press: Stanford. Boldrin, M. and L. Montrucchio (1986) “On the Indeterminacy of Capital Accumulation Paths”, Journal of Economic Theory, 40(l), 26-39. Brock, W.A. (1982) “Asset Prices in a Production Economy”, in: J.J. McCall, ed., The Economics of Information and Uncertainty, Chicago: University of Chicago Press. Brock, W.A. and L.J. Mirman (1972) “Optimal Economic Growth Under Uncertainty: The Discounted Case”, Journal of Economic Theory, 4,4799513. Chamberlain, G. (1984) “Panel Data”, in: Z. Griliches and M.D. Intrilligator, eds., Handbook of Econometrics Volume 2. North-Holland: Amsterdam. 1247-1318. Chew, S.H. and L.G. Epstein (1989) “The Structure of Preferences and Attitudes Towards the Timing and Resolution of Uncertainty”, Znternational Economic Review, 30(l), 103-l 18. Christensen, B.J. and N.M. Kiefer (1991a) “The Exact Likelihood Function for an Empirical Job Search Model”, Econometric Theory. Christensen, B.J. and N.M. Kiefer (1991b) Measurement Error in the Prototypical Job Search Model, manuscript, Cornell University. Cox, D.R. (1975) “Partial Likelihood”, Biometrika, 62(2), 2699276. Cox, D.R. and D.V. Hinkley (1974) Theoretical Statistics. Chapman and Hall: London. Dagsvik, J. (1983) “Discrete Dynamic Choice: An Extension of the Choice Models of Lute and Thurstone”, Journal of Mathematical Psychology, 27, l-43. Dagsvik, J. (1991) A Note on the Theoretical Content of the GEU Model for Discrete Choice, manuscript, Central Bureau of Statistics, Oslo. Daly, A.J. and S. Zachary, (1978) “Improved Multiple Choice Models”, in: D.A. Hensher and Q, Dalvi, eds., Determinants of Travel Choice, 3355357, Teakfield, Hampshire. Das, M. (1992) “A Micro Econometric Model of Capital Utilization and Retirement: The Case of the Cement Industry”, Review of Economic Studies, 59(2), 287-298. Daula, T. and R. Moffitt (1991) “Estimating a Dynamic Programming Model of Army Reenlistment Behavior”, Military Compensation and Personnel Retention. Debreu, G. (1960) “Review of R.D. Lute Individual Choice Behavior”, American Economic Review, 50, 1866188. Denardo, E.V. (1967) “Contraction Mappings Underlying the Theory of Dynamic Programming”, SIAM Review, 9, 1655177. Dongarra, J.J. and T. Hewitt (1986) “Implementing Dense Linear Algebra Algorithms Using Multitasking on the Cray X-MP-4”, SIAM Journal on Scientific and Statistical Computing, 7(l), 3477350. Eckstein, Z. and K. Wolpin (1989a) “The Specification and Estimation of Dynamic Stochastic Discrete Choice Models”, Journal of Human Resources, 24(4),‘562-598. Eckstein, Z. and K. Wolpin (1989b) “Dynamic Labour Force Participation of Married Women and Endogenous Work Experience”, Review of Economic Studies, 56, 375-390. Eckstein, Z. and K. Wolpin (1990) “Estimating a Market Equilibrium Search Model from Panel Data on Individuals”, Econometrica, 58(4), 783-808. Engle, R. (1984) “Wald, Likelihood Ratio and Lagrange Multiplier Tests in Econometrics”, in: Z. Griliches and M.D. Intrilligator, eds., Handbook of Econometrics Vol 2. North-Holland: Amsterdam.
Ch. 51: Structural
Estimation
of Markov Decision Processes
3141
Epstein, L.C. and SE. Zin (1989) “Substitution, Risk Aversion, and the Temporal Behavior of Consumption and Asset Returns: A Theoretical Framework”, Econometrica, 57(4), 937-970. Flinn, C. and J.J. Heckman (1982) “New Methods for Analyzing Laborforce Dynamics”, Journal qf Econometrics, 18, 115-168. Gihman, 1.1. and A.V. Skorohod (1979) Controlled Stochastic Processes, Springer-Verlag New York. Gotz. G.A. and J.J. McCall (1980) “Estimation in Sequential Decisionmaking Models: A Methodological Note”, Economics Letters,6, 131-136. Gotr, G.A. and J.J. McCall (1984) A Dynamic Retention Model for Air Force Officers, Report R-3028AF, The RAND Corporation, Santa Monica, California. Haavelmii, T. (1944) “The Probability Approach in Econometrics”, Econometrica Supplement, 12, 1-l 15. Hakansson. N. (1970) “Optimal Investment and Consumption Strategies Under Risk for a Class of Utility Functibns”,‘Econometrica, 38, 587-607. Hansen, L.P. (1982) “Large Sample Properties of Method of Moments Estimators”, Econometrica, 50, 1029-1054. Hansen, L.P. (1994) in: R. Engle and D. McFadden, eds., Handbook of Econometrics Vol. 4. NorthHolland: Amsterdam. Hansen, L.P. and T.J. Sargent (1980a) “Formulating and Estimating Dynamic Linear Rational Expectations Models”, Journal qf Economic Dynamics and Control, 2(l), 7-46. Hansen, L.P. and T.J. Sargent (1980b) “Linear Rational Expectations Models for Dynamically Interrelated Variables”, in: R.E. Lucas, Jr. and T.J. Sargent, eds., Rational Expectations and Econometric Practice, Minneapolis: University of Minnesota Press. Hansen, L.P. and T.J. Sargent (1993) Recursive Models of Dynamic Economies, manuscript, Hoover Institution. Hansen, L.P. and K. Singleton (1982) “Generalized Instrumental Variables Estimation of Nonlinear Rational Expectations Models”, Econometrica, 50, 1269~1281. Hansen, L.P. and K. Singleton (1983) “Stochastic Consumption, Risk Aversion, and the Temporal Behavior of Asset Returns”, Journal of Political Economy, 91(2), 249-265. Heckman, J.J. (1981a) “Statistical Models of Discrete Panel Data”, in: CF. Manski and D. McFadden, eds., Srructural Analysis ofDiscrete Data, MIT Press: Cambridge, Massachusetts. Heckman, J.J. (1981 b) “The Incidental Parameters Problem and the Problem of Initial Conditions in Estimating Discrete-Time, Discrete-Data Stochastic Processes”, in: C.F. Manski and D. McFadden, eds., Structural Analysis ofDiscrete Data, MIT Press: Cambridge, Massachusetts. Heckman, J.J. (1991) Randomization and Social Policy Evaluation, NBER Technical Working paper 107. Heckman, J.J. (1994) “Alternative Approaches to the Evaluation of Social Programs: Econometric and Experimental Methods”, in: J. Laffont and C. Sims, eds., Advances in Econometrics: Sixth World Congress, Econometric Society Monographs, Cambridge University Press. Hotz, V.J. and R.A. Miller (1993) “Conditional Choice Probabilities and the Estimation of Dynamic Models”, Review of Economic Studies, forthcoming. Hotz, V.J., R.A. Miller, S. Sanders and J. Smith (1993) “A Simulation Estimator for Dynamic Models of Discrete Choice”, Review ofEconomic Studies, 60, 397-429. Howard, R. (1960) Dynamic Programming and Markou Processes. J. Wiley: New York. Howard, R. (1971) Dynamic Probabilistic Systems: Volume 2 ~ Semi-Markov and Decision Processes. J. Wiley: New York. Hurwicz, L. and H. Uzawa (1971) “On the Integrability of Demand Functions”, in: J. Chipman et al., eds., Preferences, Utility, and Demand. New York: Harcourt, Brace and Jovanovich. Judd, K. (1994) Numerical Methods in Economics, manuscript, Hoover Institution. Keane, M. and K. Wolpin (1994) The Solution and Estimation of Discrete Choice Dynamic Programming Models by Simulation: Monte Carlo Evidence, manuscript, University of Minnesota. Kennet, M. (1994) “A Structural Model of Aircraft Engine Maintenance”, Journal of Applied Econometrics, forthcoming. Kreps, D. and E. Porteus (1978) “Temporal Resolution of Uncertainty and Dynamic Choice Theory”, Econometrica, 46, 185-200. Kushner, H.J. (1990) “Numerical Methods for Stochastic Control Problems in Continuous Time”, SIAM Journal on Control and Optimization, 28(5), 999-1048. Kydland, F. and E.C. Prescott (1982) “Time to Build and Aggregate Fluctuations”, Econometrica, 50, 1345-1370.
3142
John Rust
Lancaster, T. (1990) The Econometric Anulysis of Transition Data Cambridge University Press. Leland, H. (1974) “Optimal Growth in a Stochastic Environment”, Review ofEconomic Studies, 41,75-86. Levhari, D. and T. Srinivasan (1969) “Optimal Savings Under Uncertainty”, Review ofEconomic Studies, 36, 153-163. Long, J.B. and C. Plosser (1983) “Real Business Cycles”, Journal of Political Economy, 91(I), 39-69. Lucas, R.E. Jr. (1976) “Econometric Policy Evaluation: A Critique”, in: K. Brunner and A.K. Meltzer, eds., The Phillips Curve and Lahour Markets. Carnegie-Rochester Conference on Public Policy, North-Holland: Amsterdam. Lucas, R.E. Jr. (1978) “Asset Prices in an Exchange Economy”, Econometrica, 46, 1426-1446. Lucas, R.E. Jr. and C.E. Prescott (1971)“Investment Under Uncertainty”, Econometrica, 39(5), 659-681. Lumsdaine, R., J. Stock and D. Wise (1992) “Three Models of Retirement: Computational Complexity vs. Predictive Validity”, in: D. Wise, ed., Topics in the Economics of Aging. Chicago: University of Chicago Press. Machina, M.J. (1982) “Expected Utility without the Independence Axiom”, Econometricu, 50-2,277-324. Machina, M.J. (1987) “Choice Under Uncertainty: Problems Solved and Unsolved”, Journul c>fEconomic Perspectives, l(l), 121-154. Mantel, R. (1974) “On the Characterization of Excess Demand”, Journal qf’Economic Theory, 7,348-353. Marcet, A. (1994) “Simulation Analysis of Stochastic Dynamic Models: Applications to Theory and Estimation”, in: C. Sims and J. Laffont, eds., Aduances in Econometrics: Proceedings of the 1990 Meetings ofthe Econometric Society. Marschak, T. (1953) “Economic Measurements for Policy and Prediction”, in: W.C. Hood and T.J. Koopmans, eds., Studies in Econometric Method. Wiley: New York. McFadden, D. (1973) “Conditional Logit Analysis of Qualitative Choice Behavior”, in: P. Zarembka, ed., Frontiers of Econometrics. Academic Press: New York. McFadden, D. (1981) “Econometric Models of Probabilistic Choice”, in: CF. Manski and D. McFadden, eds. Structural Analysis of Discrete Data. MIT Press: Cambridge, Massachusetts. McFadden, D. (1984) “Econometric Analysis of Qualitative Response Models”, in: 2. Griliches and M.D. Intriligator, eds., Handbook of Econometrics Vol. 2. North-Holland: Amsterdam. 1395-1457. McFadden, D. (1989) “A Method of Simulated Moments for Estimation of Discrete Response Models without Numerical Integration”, Econometrica, 57(5), 995-1026. Merton, R.C. (1969) “Lifetime Portfolio Selection Under Uncertaintv: The Continuous-time Case”. Review of E&on& and Statistics, 51, 247-257. Miller, R. (1984) “Job Matching and Occupational Choice”, Journal of Political Economy, 92(6), 10861120. Pakes, A. (1986) “Patents as Options: Some Estimates of the Value of Holding European Patent Stocks”, Econometrica, 54, 755-785. Pakes, A. (1994) “Estimation of Dynamic Structural Models: Problems and Prospects Part II: Mixed Continuous+Discrete Models and Market Interactions”, in: C. Sims and J.J. Laffont, eds., Proceedings of the 6th World Congress of the Econometric Society, Barcelona, Spain. Cambridge University Press. Pakes, A. and D. Pollard (1989) “Simulation and the Asvmptotics of Optimization Estimators”. Econometrica, 57(5), 1027-1057. Phelps, E. (1962) “Accumulation of Risky Capital”, Econometrica, 30, 729-743. Pollard, D. (1979) “General Chi-Square Goodness of Fit Tests with Data-Dependent Cells”, 2. Wahrscheinlichkeitstheorie verw. Gebeite, 50, 317-331. Polland, D. (1984) Convergence of Stochastic Processes. Springer Verlag. Puterman, M.L. (1990) “Markov Decision Processes”, in: D.P. Heyman and M.J. Sobel, eds., Handbooks in Operations Research and Management Science Volume 2. North-HoIIand/Elsevier: Amsterdam. Rust, J. (1987) “Optimal Replacement of GMC Bus Engines: An Empirical Model of Harold Zurcher”, Econometrica, 55(S), 999-1033. Rust, J. (1988a) “Statistical Models of Discrete Choice Processes”, Transportation Research, 22B(2), 125-158. Rust, J. (1988b) “Maximum Likelihood Estimation of Discrete Control Processes”, SIAM Journal on Control and Optimization, 26(5), 1006-1023. Rust, J. (1989) “A Dynamic Programming Model of Retirement Behavior”, in: D. Wise, ed., The Economics of Aging. University of Chicago Press: Chicago. 359-398. Rust, J. (1992) Do People Behave According to Bellman’s Principal of Optimality?, Hoover Institution Working Paper, E-92-10.
Ch. 51: Structural Estimation of Markov Decision Processes
3143
Rust, J. (1993) How Social Security and Medicare Affect Retirement Behavior in a World of Incomplete Markets, manuscript, University of Wisconsin. Rust, J. (1994) “Estimation of Dynamic Structural Models: Problems and Prospects Part I: Discrete Decision Processes”, in: C. Sims and J.J. Laflont, eds., Proceedings of the 6th World Congress of the Econometric Society, Barcelona, Spain. Cambridge University Press. Rust, J. (1995a) “Numerical Dynamic Programming in Economics”, in H. Amman, D. Kendrick and J. Rust, eds., Handbook of Computational Economics, North-Holland, forthcoming. Rust, J. (1995b) Using Randomization to Break the Curse of Dimensionality, manuscript, University of Wisconsin. Rust, J. (1996) Stochastic Decision Processes: Theory, Computation, and Estimation, manuscript, University of Wisconsin. Samuelson, P.A. (1969) “Lifetime Portfolio Selection by Dynamic Stochastic Programming”, Review of Economics and Statistics, 51, 239-246. Sargent, T.J. (1978) “Estimation of Dynamic Labor Demand Schedules Under Rational Expectations”, Journal of Political Economy, 86(6), 1009-1044. Sargent, T.J. (1981) “Interpreting Economic Time Series”, Journal of Political Economy, 89(2), 213-248. Smith, A.A. Jr. (1991) Solving Stochastic Dynamic Programming Problems Using Rules of Thumb, Discussion Paper 818, Institute for Economic Research, Queen’s University, Ontario, Canada. Sonnenschein, H. (1973) “Do Walras Law and Continuity Characterize the Class of Community Excess Demand Functions?“, Journal of Economic Theory, 6,345-354. Stern, S. (1992) “A Method for Smoothing Simulated Moments of Discrete Probabilities in Multinomial Probit Models”, Econometrica, 60(4), 943-952. Stern, S. (1995) “Approximate Solutions to Stochastic Dynamic Programming Problems”, Econometric Theory, forthcoming. Stock, J. and D. Wise (1990) “Pensions, The Option Value of Work and Retirement”, Econometrica, 58(5), 1151-1180. Stokey, N.L., R.E. Lucas, Jr. and E.C. Prescott (1989) Recursioe Methods in Economic Dynamics. Harvard University Press: Cambridge, Massachusetts. Theil, H. (1971) Principles of Econometrics. Wiley: New York. van Diik. N.M. (1984) Controlled Markov Processes: Time Discretization. CWI Tract 11. Center for Mathematics and domputer Science, Amsterdam. Verim, H. (1982) “The Nonparametric Approach to Demand Analysis”, Econometrica, 52(3), 945-972. White, H. (1982) “Maximum Likelihood Estimation of Misspecified Models”, Econometrica, 50, l-26. Williams, H.C. (1977) “On the Formation of Travel Demand Models and Economic Evaluation of User Benefit”, Environment and Planning, A-9,285-344. Wolpin, K. (1984) “An Estimable Dynamic Stochastic Model of Fertility and Child Mortality”, Journal of Political Economy, 92(5), 852-874. Wolpin, K. (1987) “Estimating a Structural Search Model: The Transition from Schooling to Work”, Econometrica, 55, 801-818.