Editorial policy: The Journal of Econometrics is designed to serve as an outlet for important new research in both theoretical and applied econometrics. Papers dealing with estimation and other methodological aspects of the application of statistical inference to economic data as well as papers dealing with the application of econometric techniques to substantive areas of economics fall within the scope of the Journal. Econometric research in the traditional divisions of the discipline or in the newly developing areas of social experimentation are decidedly within the range of the Journal’s interests. The Annals of Econometrics form an integral part of the Journal of Econometrics. Each issue of the Annals includes a collection of refereed papers on an important topic in econometrics. Editors: T. AMEMIYA, Department of Economics, Encina Hall, Stanford University, Stanford, CA 94035-6072, USA. A.R. GALLANT, Duke University, Fuqua School of Business, Durham, NC 27708-0120, USA. J.F. GEWEKE, Department of Economics, University of Iowa, Iowa City, IA 52240-1000, USA. C. HSIAO, Department of Economics, University of Southern California, Los Angeles, CA 90089, USA. P. ROBINSON, Department of Economics, London School of Economics, London WC2 2AE, UK. A. ZELLNER, Graduate School of Business, University of Chicago, Chicago, IL 60637, USA. Executive Council: D.J. AIGNER, Paul Merage School of Business, University of California, Irvine CA 92697; T. AMEMIYA, Stanford University; R. BLUNDELL, University College, London; P. DHRYMES, Columbia University; D. JORGENSON, Harvard University; A. ZELLNER, University of Chicago. Associate Editors: Y. AÏT-SAHALIA, Princeton University, Princeton, USA; B.H. BALTAGI, Syracuse University, Syracuse, USA; R. BANSAL, Duke University, Durham, NC, USA; M.J. CHAMBERS, University of Essex, Colchester, UK; SONGNIAN CHEN, Hong Kong University of Science and Technology, Kowloon, Hong Kong; XIAOHONG CHEN, Department of Economics, Yale University, 30 Hillhouse Avenue, P.O. Box 208281, New Haven, CT 06520-8281, USA; MIKHAIL CHERNOV (LSE), London Business School, Sussex Place, Regents Park, London, NW1 4SA, UK; V. CHERNOZHUKOV, MIT, Massachusetts, USA; M. DEISTLER, Technical University of Vienna, Vienna, Austria; M.A. DELGADO, Universidad Carlos III de Madrid, Madrid, Spain; YANQIN FAN, Department of Economics, Vanderbilt University, VU Station B #351819, 2301 Vanderbilt Place, Nashville, TN 37235-1819, USA; S. FRUHWIRTH-SCHNATTER, Johannes Kepler University, Liuz, Austria; E. GHYSELS, University of North Carolina at Chapel Hill, NC, USA; J.C. HAM, University of Southern California, Los Angeles, CA, USA; J. HIDALGO, London School of Economics, London, UK; H. HONG, Stanford University, Stanford, USA; MICHAEL KEANE, University of Technology Sydney, P.O. Box 123 Broadway, NSW 2007, Australia; Y. KITAMURA, Yale Univeristy, New Haven, USA; G.M. KOOP, University of Strathclyde, Glasgow, UK; N. KUNITOMO, University of Tokyo, Tokyo, Japan; K. LAHIRI, State University of New York, Albany, NY, USA; Q. LI, Texas A&M University, College Station, USA; T. LI, Vanderbilt University, Nashville, TN, USA; R.L. MATZKIN, Northwestern University, Evanston, IL, USA; FRANCESCA MOLINARI (CORNELL), Department of Economics, 492 Uris Hall, Ithaca, New York 14853-7601, USA; F.C. PALM, Rijksuniversiteit Limburg, Maastricht, The Netherlands; D.J. POIRIER, University of California, Irvine, USA; B.M. PÖTSCHER, University of Vienna, Vienna, Austria; I. PRUCHA, University of Maryland, College Park, USA; E. RENAULT, University of North Carolina, Chapel Hill, NC; R. SICKLES, Rice University, Houston, USA; F. SOWELL, Carnegie Mellon University, Pittsburgh, PA, USA; MARK STEEL (WARWICK), Department of Statistics, University of Warwick, Coventry CV4 7AL, UK; DAG BJARNE TJOESTHEIM, Department of Mathematics, University of Bergen, Bergen, Norway; HERMAN VAN DIJK, Erasmus University, Rotterdam, The Netherlands; Q.H. VUONG, Pennsylvania State University, University Park, PA, USA; E. VYTLACIL, Columbia University, New York, USA; T. WANSBEEK, Rijksuniversiteit Groningen, Groningen, Netherlands; T. ZHA, Federal Reserve Bank of Atlanta, Atlanta, USA and Emory University, Atlanta, USA. Submission fee: Unsolicited manuscripts must be accompanied by a submission fee of US$50 for authors who currently do not subscribe to the Journal of Econometrics; subscribers are exempt. Personal cheques or money orders accompanying the manuscripts should be made payable to the Journal of Econometrics. Publication information: Journal of Econometrics (ISSN 0304-4076). For 2011, Volumes 160–165 (12 issues) are scheduled for publication. Subscription prices are available upon request from the Publisher, from the Elsevier Customer Service Department nearest you, or from this journal’s website (http://www.elsevier.com/locate/jeconom). Further information is available on this journal and other Elsevier products through Elsevier’s website (http://www.elsevier.com). Subscriptions are accepted on a prepaid basis only and are entered on a calendar year basis. Issues are sent by standard mail (surface within Europe, air delivery outside Europe). Priority rates are available upon request. Claims for missing issues should be made within six months of the date of dispatch. USA mailing notice: Journal of Econometrics (ISSN 0304-4076) is published monthly by Elsevier B.V. (Radarweg 29, 1043 NX Amsterdam, The Netherlands). Periodicals postage paid at Rahway, NJ 07065-9998, USA, and at additional mailing offices. USA POSTMASTER: Send change of address to Journal of Econometrics, Elsevier Customer Service Department, 3251 Riverport Lane, Maryland Heights, MO 63043, USA. AIRFREIGHT AND MAILING in the USA by Mercury International Limited, 365 Blair Road, Avenel, NJ 07001-2231, USA. Orders, claims, and journal inquiries: Please contact the Elsevier Customer Service Department nearest you. St. Louis: Elsevier Customer Service Department, 3251 Riverport Lane, Maryland Heights, MO 63043, USA; phone: (877) 8397126 [toll free within the USA]; (+1) (314) 4478878 [outside the USA]; fax: (+1) (314) 4478077; e-mail:
[email protected]. Oxford: Elsevier Customer Service Department, The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, UK; phone: (+44) (1865) 843434; fax: (+44) (1865) 843970; e-mail:
[email protected]. Tokyo: Elsevier Customer Service Department, 4F Higashi-Azabu, 1-Chome Bldg., 1-9-15 Higashi-Azabu, Minato-ku, Tokyo 106-0044, Japan; phone: (+81) (3) 5561 5037; fax: (+81) (3) 5561 5047; e-mail:
[email protected]. Singapore: Elsevier Customer Service Department, 3 Killiney Road, #08-01 Winsland House I, Singapore 239519; phone: (+65) 63490222; fax: (+65) 67331510; e-mail:
[email protected]. Printed by Henry Ling Ltd., Dorchester, United Kingdom The paper used in this publication meets the requirements of ANSI/NISO Z39.48-1992 (Permanence of Paper)
Journal of Econometrics 164 (2011) 207–217
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
A family of empirical likelihood functions and estimators for the binary response model Ron C. Mittelhammer a,∗ , George Judge b a
Economic Sciences and Statistics, School of Economic Sciences, Room 101C Hulbert Hall, Washington State University, Pullman, WA, 99164-6210, United States
b
Graduate School, 207 Giannini Hall, University of California, Berkeley, Berkeley, CA, 94720, United States
article
info
Article history: Received 26 May 2009 Received in revised form 6 February 2011 Accepted 25 April 2011 Available online 6 May 2011
abstract The minimum discrimination information principle is used to identify an appropriate parametric family of probability distributions and the corresponding maximum likelihood estimators for binary response models. Estimators in the family subsume the conventional logit model and form the basis for a set of parametric estimation alternatives with the usual asymptotic properties. Sampling experiments are used to assess finite sample performance. © 2011 Elsevier B.V. All rights reserved.
JEL classification: C13 C14 C25 C51 Keywords: Binary response models and estimators Conditional moment equations Cressie–Read family of likelihood functions Information theoretic methods Minimum power divergence Minimum discrimination information
1. Introduction The applied econometrics literature is replete with analyses of binary response data, where the objective is to predict response probabilities that are unobserved and unobservable. Traditionally, the estimation and inference methods used in empirical analyses of binary response have been based on indirect noisy observations and converted a fundamentally illposed inverse estimation problem into a well-posed problem by imposing restrictive distributional functional forms that can be analyzed via conventional parametric ML methods. By far the dominant distributional choices in empirical work involving binary response models (BRMs) have been either the probit or logit cumulative distribution function (CDF). These choices are often made on the basis of convenience or precedent, with little or no a priori justification for their use. In attempts to mitigate model misspecification issues, a variety of semiparametric models and estimators for the BRM have
∗
Corresponding author. Tel.: +1 509 335 1706; fax: +1 509 335 1173. E-mail addresses:
[email protected] (R.C. Mittelhammer),
[email protected] (G. Judge). 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.04.002
arisen in the literature (e.g., Ichimura (1993), Klein and Spady (1993), Gabler et al. (1993), Gozalo and Linton (1994), Manski (1975), Horowitz (1992), Han (1987), Cosslett (1983), Wang and Zhou (1993)). These semiparametric methods tend to be of either the ‘‘regression-estimating equation’’ or ‘‘maximum likelihood’’ variety, each utilizing some form of nonparametric or semiparametric estimate of the probability that the binary response variable takes the value yi = 1, conditioned on the outcome of explanatory or response variables. On the basis of asymptotic performance comparisons among those estimators for which asymptotics are tractable and well developed, it is found √ that many of the estimators do not achieve n consistency. For those that do, the estimator of Klein and Spady appears to be the best in the sense of achieving the semiparametric efficiency bound. Regarding semi-nonparametric (SNP) methods, Gallant and Nychka (1987) approximates the unknown log likelihood by a polynomial of degree k in the underlying density function argument, weighted by a standard normal probability density. Takada (2001) suggests that semiparametric methods based on conventional kernel density estimators can have good finite sample performance, especially with heavy-tailed densities where the SNP method exhibits ‘‘remarkably slow’’ convergence relative to conventional kernel methods.
208
R.C. Mittelhammer, G. Judge / Journal of Econometrics 164 (2011) 207–217
The semiparametric, semi-nonparametric and nonparametric alternatives have been attempts to insulate models from misspecification and provide a broader model context in which at least consistency and semiparametric efficiency might be achievable. However, as Breisch et al. (2002) note, the additional flexibility of these methods comes at a number of inferential costs.1 These costs include trading off bias for increased variance, increasing computational complexity/effort, and requiring additional model restrictions to guarantee identification of model unknowns. They suggest that semiparametric or nonparametric methods, ‘‘are most appropriate when one wants to (a) perform exploratory research, (b) avoid, at any cost, biases due to misspecification, or (c ) check whether a specified parametric structure is misspecified.’’ Building on these productive efforts and insights, we use information theoretic methods to identify a family of CDFs and associated likelihood functions for the binary response model. The members of the family represent the complete set of CDFs and associated likelihood functions for BRM that satisfy standard orthogonality conditions between covariates and noise and that are minimally power divergent from a conceivable CDF for the Bernoulli response probabilities. The approach provides a solution to the problem of which family of probability distributions to use in modeling the BRM from among the choices that are currently available in the literature. The statistical models and associated estimators represent a large set of alternatives to traditional logit and probit methods as well as to conventional semiparametric methods of analysis. Computationally, solutions for the estimators are relatively easy to obtain, even for large sample sizes. The family of probability distributions is interesting in its own right and has potential for use in statistical models beyond BRM applications. 1.1. Organization of the paper In Section 2, the basic binary statistical model is presented and used to define a very general conditional moment representation of sample information. In Section 3, an information theoretic criterion is applied to identify the underdetermined conditional response probability distributions and specify a statistical model for estimating those probabilities. In the process we identify a family of CDFs whose members are consistent with binary data and model information. In Section 4 a maximum likelihood method for estimating the unknowns in the BRM model specification is presented. Monte Carlo sampling results are used to illustrate, under squared error loss, the finite sampling performance of the estimators in Section 5. Finally, promising new directions for research and applications are delineated in the concluding section of the paper. Some general properties of the family of CDFs, as well as computational considerations relating to the CDFs are presented in the Appendix. 2. The basic statistical model Assume that, on trial i = 1, 2, . . . , n, one of two alternatives is observed to occur for each of the independent nondegenerate binary random variables Yi , i = 1, 2, . . . , n, with pi , i = 1, . . . , n representing the probabilities of observing successes, yi = 1, on
1 The general objective of this paper is to offer an alternative between rigid parametric model specifications of BRMs on the one hand, and distribution-free semiparametric and nonparametric methods on the other, while providing as much of the benefits and as little of the costs of each method as possible. In our approach, the analyst is not required to make an arbitrary distributional assumption nor introduce other statistical assumptions on the data sampling process in order to achieve identifiability, consistency, and/or efficiency in some restricted sense.
each respective trial. The value of pi is a conditional probability given by the model2 pi = P (yi = 1 | xi ) = F (xi , β),
(2.1)
where F (·) is some CDF, xi for i = 1, . . . , n are independent outcomes of a (1 × k) random vector of covariates that affect the conditional probability that yi = 1, and β is a vector of parameters. If a specific parametric family of probability density functions underlying the binary response model is specified (e.g., logit or probit), one can define the specific functional form of the log likelihood and utilize a traditional maximum likelihood (ML) approach as a basis for estimation and inference relative to the unknown β and the response probabilities F (xi , β). If the particular choice of the parametric functional form for the distribution is also correct, then the usual ML properties of consistency, asymptotic normality, and efficiency hold (McFadden, 1974, 1984), Train (2003); for seminal work with Bayesian response models see Zellner and Rossi (1984); see Green (2007) for a review of conventional binary response models). In contrast to historical empirical practice, we do not assume that the family of CDFs in 2.1 is known or that it can be specified based on precedent or convenience. Moreover, we seek to eliminate the use of model specification information that the econometrician generally does not possess. We begin by characterizing the n × 1 vector of nondegenerate Bernoulli random variables, Y, by the universally applicable stochastic representation Y = p + ε,
n
where E (ε) = 0 and p ∈ × (0, 1).
(2.2)
i =1
The specification in (2.2) implies only that the expectation of the random vector Y is some mean vector of Bernoulli probabilities p, and that the outcomes of Y can be decomposed into their means and noise terms, which of course is always true. The Bernoulli probabilities in (2.2) are specified to depend on the values of the covariates contained in the (n × k) matrix X through some generally applicable conditional expectation or regression type relationship E (Y | X) = p(X) = [p1 (X1 ) | p2 (X2 )| . . . |pn (Xn )]′
(2.3)
where Xi is the ith row of X. We emphasize that p(X) merely denotes that the Bernoulli probabilities depend in some general way on X, and we make no assumption about the functional specification of that relationship. It follows that the conditional orthogonality condition, E [g(X)′ (Y − p(X)) | X] = 0, is true for any real-valued measurable and conformable vector or matrix function g(·).3 An application of the iterated expectation theorem leads to the unconditional orthogonality condition E [g(X)′ (Y − p(X))] = 0.
(2.4)
The information employed represents a minimal set of statistical model assumptions for representing the unknown Bernoulli probabilities in the BRM. 3. Minimum power divergence family of CDFs for the binary response model Given sampled binary outcomes from (2.2), we use k < n empirical moment representations, n−1 g(x)′ (y − p) = 0, of the
2 We adopt the convention that capital letters denote random variables and lower case letters denote observed values or outcomes. An exception will be the use of ε to denote a random variable, and e to denote an outcome of the random variable, and the use of F and f to denote a CDF and PDF respectively. Bold face type denotes a vector or matrix, whereas non-bold denotes scalars. 3 Note this representation can be made even more general by allowing for the g function to depend on unknown parameters τ, as g(X; τ), where E [g(X; τ)′ (Y − p(X)) | X] = 0 is also true.
R.C. Mittelhammer, G. Judge / Journal of Econometrics 164 (2011) 207–217
orthogonality conditions (2.4) to link available data information to the unknown probabilities. At this stage, the available information allows an infinite set of probabilities to satisfy both (2.2) and the empirical moment representation of (2.4). In the absence of any additional information relating to the statistical model, a criterion and some estimation framework is needed to address the undetermined nature of (n − k) of the unknown elements in p. 3.1. Minimum discrimination information principle As a basis for identifying the Bernoulli probabilities p, we utilize Kullback’s (1959) information theoretic principle of Minimum Discrimination Information (MDI). Under the MDI principle, beginning with a distribution, f0 , of some random process, if new facts become available relating to the random process, the updated distribution, f1 , should be chosen so that it is as close to the original distribution, f0 , as the new facts will permit. To allow f1 to deviate any further is tantamount to introducing more information than the analyst actually possesses. This leads to the question of how to measure information gain. Divergence measures are commonly used to find the appropriate distance or difference between two probability distributions (Kumar and Taneja, 2006). These measures can be categorized as parametric, nonparametric and entropic measures of information. Parametric measures of information represent the amount of information about an unknown parameter vector θ provided by the data. Fisher’s measure of information is a well-known measure of this type. Nonparametric measures represent the amount of information provided by the data in favor of one probability distribution, f0 , against another, f1 , or represent the distance between f0 and f1 . The Kullback and Leibler (1951) measure is arguably the most well recognized member of this class. Entropy, which represents the amount of information contained in a distribution, or the amount of uncertainty associated with the outcome of an experiment, was first introduced by Shannon (1948). 3.2. Measuring information In order to measure information gain in a general way when moving from an initial distribution, [qi1 , qi2 , . . . , qiM ], to an updated distribution, [pi1 , pi2 , . . . , piM ], for outcome i of a discrete distribution having M possible outcomes, we utilize the generalized Cressie–Read (CR) power divergence measure (Cressie and Read, 1984; Read and Cressie, 1988; Mittelhammer et al., 2000; Judge and Mittelhammer, 2011), represented by
1
γ (γ + 1)
M −
pij
pij qij
j =1
γ
−1
.
(3.1)
The CR statistic subsumes a number of well-known divergence measures, including Shannon’s entropy, the Kullback–Leibler measure, Freeman–Tukey divergence, Hellinger divergence, Neyman Modified Chi-square divergence, and Pearson’s Chi-square divergence. In the binary case with M = 2 for each observation, i = 1, . . . , n, for any given member of the CR family, γ , and for initialreference probabilities, qi ∈ (0, 1), i = 1, . . . , n, we consider the extremum problem where the objective is to obtain the minimum power divergence (MPD) solution of
γ
min
pi ,i=1,...,n
n pi − i=1
pi qi
− 1 + (1 − pi )
1−pi 1 −q i
γ
γ (γ + 1)
s.t. n−1 (g(x)′ (y − p)) = 0 and pi ∈ (0, 1),
∀i
−1
, (3.2) (3.3)
209
where we have suppressed the double subscripts in applying (3.1) by defining, pi = pi1 = P (yi = 1 | xi ), and, 1 − pi = pi2 , and likewise for qi . The summand in the objective function (3.2) is the power divergence of the Bernoulli probabilities, {pi , 1 − pi }, from the reference Bernoulli distribution, {qi , 1 − qi }. Note the unconstrained minimum value of (3.2) is zero, which occurs when pi = qi , ∀i and indicates that p = [p1 , . . . , pn ]′ contains no additional information relative to q = [q1 , . . . , qn ]′ . The overall implication of the extremum formulation (3.2)–(3.3) is that the value of p is chosen from among an infinite number of possible solutions, so as to be both consistent with the sample moment equations, n−1 g(x)′ (y − p) = 0, and minimally divergent from the initial distribution q. If q happens to satisfy the moment conditions, n−1 g(x)′ (y − q) = 0, then the MPD solution is p = q. Otherwise, the solution is a shrinkage-type rule, where the solution for p is as minimally divergent from q as the moment constraints will allow. Pardo (2006) provides a discussion of the use of minimum divergence measures for estimation in a range of statistical model contexts. This approach implies departure from the need to assume a particular fully specified parametric distributional structure underlying the Bernoulli probabilities and thereby reduces the possibility of statistical model misspecification. In the general BRM specification of Section 2, what is known is the universally applicable specification (2.2) together with any orthogonality conditions of the form (2.4). The values and functional form of the conditional expectation, E(Y | X) = p(X), are not assumed known, and having no basis for discriminating pi (Xi ) from pj (Xj ), i ̸= j, it could be argued that one begins from the standpoint of being able to specify no more than pi (Xi ) = q, ∀i. One could argue further that not being able to discriminate between any of the possible values of q ∈ (0, 1) should then lead to the ‘‘ignorance’’ value of q = .5. However, foreshadowing the issue of the determination of the q vector, an alternative choice that could be defended regarding each of the n elements of q is the unconditional value of E (Y ), arguing that since the effect of X on Y is unknown, one cannot distinguish the conditional from the unconditional expectation of Y , a priori. Thus, one might then specify the reference distribution, q = 1n E (Y ), and an empirical estimate of the mean of Y is straightforward to obtain. 3.3. The MPD family of CDFs underlying the BRM The pi ’s can be expressed as functions of the explanatory variables and Lagrange multipliers, λ, of the sample moment conditions by solving the first-order conditions (FOCs) of the Lagrange form of (3.2)–(3.3) with respect to p. In the event that inequality constraints are binding, the FOCs can be adjusted by the complementary Kuhn and Tucker (1951) slackness conditions. The first-order conditions with respect to pi imply
γ γ pi 1 − pi ′ − − g(xi ) λγ ∂L qi 1− qi = 0⇒ pi 1 − pi ∂ pi − ln − g(xi )′ λ ln qi 1 − qi ̸= 0 = 0 for γ . =0
(3.4)
When γ ≤ 0, the solutions are strictly interior to the inequality constraints because the range of the strictly monotonic and p 1−p p 1−p continuous functions, ln( qi ) − ln( 1−qi ) and ( qi )γ − ( 1−qi )γ , i i i i encompass the entire real line for pi ∈ [0, 1]. Accounting for the inequality constraints in (3.4), when γ > 0, the first-order condition in (3.4) together with the complementary slackness
210
R.C. Mittelhammer, G. Judge / Journal of Econometrics 164 (2011) 207–217
conditions permits pi to be expressed as the following functions of the argument, wi = g(xi )′ λ: p(wi ; qi , γ )
[ γ
= arg pi
pi
−
qi
1 − pi
γ
1 − qi
= wi γ
]
for γ < 0
[ ] pi 1 − pi − ln = wi for γ = 0 = arg ln qi 1 − qi pi 1 [ p γ 1 − p γ ] i i − = wi γ arg = qi 1 − qi pi 0
for γ > 0
and
−γ ≥ γ −1 qi . wi ∈ (−γ −1 (1 − qi )−γ , γ −1 q−γ ) i ≤ −γ −1 (1 − qi )−γ
ℓ(β, q, γ | y, x) =
i =1
[wi2 + (4qi − 2)wi + 1].5 − 1 .5 + p(wi ; qi , −1) = 2wi .5 ̸= 0 if wi , (3.6) =0
p(wi ; qi , 1) =
qi exp(wi ) , (1 − qi ) + qi exp(wi )
(3.7)
1
(qi + qi (1 − qi )wi ) 0
1 ≥ q− i 1 for wi ∈ (−(1 − qi )−1 , q− . ) i ≤ − (1 − qi )−1
n − [yi ln(p(g(xi )′ β; q, γ ))
+ (1 − yi ) ln(1 − p(g(xi )′ β; q, γ ))].
(3.5)
The family of CDFs and associated likelihood functions defined by (3.5) characterize the unique and complete set of distributions that are consistent with the binary response model and the empirical orthogonality conditions, and that are minimally divergent from the choice of any possible reference distribution, q, for any divergence measure γ ∈ R. While the CDFs are defined as only implicit functions of wi for almost all γ , numerical representations of the functional relationship between pi and wi can be determined straightforwardly. Computational issues relating to the use of this family of distributions are discussed in Appendix B. Explicit closed form solutions for the CDFs exist on a Lebesgue measure zero set of γ values that includes the set of all integers. Some examples are given in (3.6)–(3.8).
p(wi ; qi , 0) =
the lines of the rationale presented at the end of Section 3.2.4 One might consider specifying elements of the reference distribution a priori, such as the least favorable uniform prior, q = .5, or else, q = Eˆ (Y ) = Y¯ , on the basis of being unable to distinguish conditional from unconditional probabilities that Y = 1. In this section we consider estimating the value of q jointly with all of the other unknowns in the BRM via maximum likelihood, but a constrained MLE, with q fixed at some value, could be pursued as well. An empirical maximum likelihood estimator of a BRM based on the family of MPD distributions can be defined through optimizing the following log likelihood function for the observed sample observations:
(3.8)
When γ = 0 and qi = .5, the functional form for pi in (3.7) coincides with the standard logistic binary response model. When γ = 1, the CDF in (3.8) subsumes the linear probability model. Note that the appropriate dense set of linear index models for the binary response case is defined by (3.5) when g(xi ) = x′i , so that wi = xi λ. Some additional important properties of the MPD family of CDFs are given in Appendix A.
(4.1)
It is useful to consider maximizing the likelihood in two steps, first defining the profile likelihood function for β as pℓ(β | y, x) ≡ sup{ℓ(β, q, γ | y, x)},
(4.2)
q, γ
and then deriving the maximum likelihood estimator of β by maximizing the profile likelihood as
βˆ ml = arg sup{pℓ(β | y, x)} β
≡ arg sup{{sup{ℓ(β, q, γ | y, x)}}}. β
(4.3)
γ,q
One can interpret the likelihood profiling step (4.2) as determining the optimal MPD distribution associated with any choice of the β vector, and the second ML step (4.3) as determining the overall optimal estimate of the parameters of the index argument in the CDF that determines the binary response probabilities. The computational search for the maximum of the likelihood is facilitated by the concavity of the MPD CDFs for all distributions, when γ ≤ 1 (see Appendix A.2). Patefield (1977, 1985) demonstrates that in the case where the parameter space is finite dimensional Euclidean, and given some generally applicable regularity conditions, the profile likelihood, pℓ(β | y, x), can be utilized in the same way as an ordinary likelihood for purposes of defining an appropriate ∂ pℓ(β| y,x) score function, , and an information matrix representation ∂β of the asymptotic covariance matrix with respect to the ML estimator, βˆ ml , [−E
∂ 2 pℓ(β|y,x) −1 ] . ∂β∂β′
We demonstrate in Appendix A.3
that Patefield’s result, regarding likelihood function profiling, can be applied to the likelihoods defined in terms of MPD distributions. With regard to β, the score and information matrix definitions are equivalent to the appropriate submatrix of the information matrix ∂ℓ(θ| y,x) ℓ(θ|y,x) calculated from the full likelihood, as and [−E ∂θ∂θ′ ]−1 , ∂β respectively, where, θ ≡ [β, q, γ ]′ . The asymptotic covariance matrix for βˆ ml , together with the asymptotic normality of the L
ML estimator, n1/2 (βˆ ml − β) → N (0, [−E
∂ 2 pℓ(β|y,x) −1 ] ), ∂β∂β′
can be
4. ML estimation and inference for BRMs based on the MPD family
used for inference purposes in the usual way, enabling asymptotic chi-square distributions for the Wald, Lagrange Multiplier, or Likelihood Ratio test statistics. This is analogous to traditional
In this section we consider how one might select an optimal distribution from the MPD family together with estimating the unknowns in the BRM model consisting of the values of γ , q, and the parameter vector λ, on which the value, w = g(x)′ λ, depends. Henceforth, we concentrate on the case where the reference distribution is specified as q = 1n q, for a scalar q ∈ (0, 1), along
4 One implication of this choice of reference distribution is that the same functional form of MPD distribution will ultimately be used in representing each of the conditional Bernoulli probabilities, analogous to common empirical practice where the same normal or logistic functional form is used. This is apparent from (3.5), and exemplified in the special cases (3.6)–(3.8).
R.C. Mittelhammer, G. Judge / Journal of Econometrics 164 (2011) 207–217 Beta(3,1)
N(0,1)
Beta(1,1)
211
MPD(q =.75,gam = 1.5)
3
MPD(q =.5,gam = -1)
0.45 0.4
2.5
0.35 2 0.3 Beta(w) 1.5
0.25 0.2
1
0.15 0.5 0.1 0 0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.05
0.90
0
Support
-8
Fig. 5.1. Population distributions for P (Y = 1).
∂ 2 pℓ(β|y,x) −1 ] ˆ . ∂β∂β′ β = βml
We motivate the consistency and asymptotical normality of the MPD–ML estimator, based on some primitive and rather straightforward regularity conditions, in Appendix A.3.5 5. Finite sampling performance To investigate the finite sample performance of the ML–MPD approach and to illustrate how the estimator can be applied, the results of Monte Carlo experiments are reported in this section. The sampling experiments are designed so that the population distribution underlying observed probability values, pi = P (yi = 1), achieves targeted mean and variability levels. They also map oneto-one with covariate values, which then define a covariate population data sampling process (DSP).
-2
0
2
4
6
8
parameters are β0 = 1 and β1 = 2. Explicit functional forms for F (·) include two MPD distributions, MPD (q = .5, γ = −1) and MPD (q = .75, γ = 1.5), as well as a N (0, 1) distribution. The MPD (q = .5, γ = −1) distribution is a symmetric distribution around zero associated with the ‘‘empirical likelihood’’ choice, γ = −1. This distribution has the real line for its support, and has substantially fatter tails then the N (0, 1) distribution. The MPD (q = .75, γ = 1.5) distribution has finite support, is heavily skewed to the left, and has a density whose values increase at an increasing rate as the argument of the link function increases. The standard normal distribution is the link function that is optimal for the Probit model of binary response. These three different link functions are illustrated in Fig. 5.2. The random outcomes of the binary responses are generated according to their respective Bernoulli probability distributions as, Yi ∼ Bernoulli(F (β0 + β1 xi )),
Random samples of size n = 100, 250, and 500 of P (yi = 1) values were sampled from a Beta distribution, B(a, b), with b = 1 and was correspondingly to achieve targeted unconditional mean values of P (yi = 1) equaling .5 or .75. Therefore in this design, bE (Y ) a = (1−E (Y )) , with E (Yi ) = .5, or .75. The resulting two population distributions of P (yi = 1) values are B(1, 1) and B(3, 1), which are uniform and centered at .5, or right-skewed with a mean of .75, respectively (see Fig. 5.1). The two distributions represent a situation in which all values of P (y = 1) are either equally likely to be observed, or else are substantially more probable for P (y = 1) 1 greater than .5, than less (note .5 Beta(w; 3, 1)dw = .875). A linear index representation of the Bernoulli probabilities is defined as (5.1)
where F (·) is a cumulative distribution or link function underlying the Bernoulli probabilities, and the xi ’s are chosen so that, xi = (F −1 (P (yi = 1)) − β0 )/β1 . The values of the linear index
i = 1, . . . , n.
(5.2)
This implies a regression relationship of the form, Yi = F (β0 + β1 xi ) + εi ,
5.1. Sampling design and solution algorithms
P (yi = 1) = F (β0 + β1 xi ) for i = 1, . . . , n,
-4
Fig. 5.2. Link functions: N (0, 1), MPD (q = .5, gam = −1), MPD (q = .75, gam = 1.5).
probit or logit contexts, but rooted in a substantially more varied sampling distribution model. In empirical applications, an estimate of the asymptotic covariance of βˆ is defined by −[
-6
i = 1, . . . , n,
(5.3)
and serves as the definition of the residual outcome of εi . The expected mean squared error (MSE) of the probability predictions, X (ˆp(x) − F (β0 + β1 x))2 dF (β0 + β1 x), is used as the measure of estimation performance where pˆ (x) denotes the prediction from an ML–MPD, Logit, or ∑nProbit estimator. Empirically, the measure is calculated as, n−1 i=1 (ˆp(xi ) − F (β0 + β1 xi ))2 . All sampling results are based on 1000 repetitions, and calculations were performed using Aptech Systems’ GAUSS 8.0 software. The Nelder–Meade derivative-free direct search algorithm (Nelder and Meade, 1965), as implemented by Jacoby et al. (1972), was programmed in GAUSS by the authors and used to solve for the ML–MPD estimator, as well as the Logit estimator and the semiparametric estimator of Klein and Spady presented ahead. The standard QNEWTON algorithm as implemented in GAUSS 8.0 software was used to solve for the Probit estimates. All attempted solutions converged and were stable, and at this number of repetitions, all empirically calculated expectations of the performance measures are accurate, with standard errors of the calculated measures being typically .0001 or less in magnitude. 5.2. Sampling results
5 As in all applications of maximum likelihood, issues of regularity conditions arise for the asymptotics to apply. Under appropriate boundedness assumptions relating to X , and resultant boundedness properties and continuous differentiability of the MPD family of distributions, extremum estimator asymptotics apply along the lines of Hansen (1982), Newey (1991), and van der Vaart (1998). The most straightforward case is when global concavity holds, γ ≤ 1, in which case the uniform probability convergence results of Pollard (1991) apply. We provide a relatively straightforward demonstration of consistency and asymptotic normality under regularity conditions that encompass both negative and positive values of γ in Appendix A.3.
The expected squared prediction error results for the ML–MPD, Probit and Logit estimators are displayed in Fig. 5.3, for the MPD (q = .5, γ = −1) DSP, and in Fig. 5.4 for the MPD (q = .75, γ = 1.5) DSP6 . In implementing the ML–MPD estimator the
6 Tables documenting the numerical results underlying the graphs are available from the authors upon request.
212
R.C. Mittelhammer, G. Judge / Journal of Econometrics 164 (2011) 207–217 Logit
Probit
Probit
ML-MPD
0.006
0.01
0.005
0.008 MSE
MSE
ML-MPD
0.012
0.006
ML-MPD(q=.5)
Logit
0.004 0.003
0.004 0.002 0.002 0.001 0 Scenario :
1 2 n = 100 P(y=1) = .50 and .75
3 4 n = 250 P(y=1) =.50 and .75
5
6 n = 500 P(y=1) = .50 and .75
0 1 Scenario:
Fig. 5.3. MSE for P (y = 1) predictions, DSP = MPD(.5, −1).
2
n = 100 P(y=1) = .50 and .75
3
4
n = 250 P(y=1) =.50 and .75
5
6
n = 500 P(y=1) = .50 and .75
Fig. 5.5. MSE for P (y = 1) predictions, DSP = N (0, 1). ML-MPD
Logit
Probit
0.006 0.005
MSE
0.004 0.003 0.002 0.001 0 1 Scenario:
2
n = 100 P(y=1) = .50 and .75
3
4
n = 250 P(y=1) =.50 and .75
5
6
n = 500 P(y=1) = .50 and .75
Fig. 5.4. MSE for P (y = 1) predictions, DSP = MPD(.75, 1.5).
feasible set of distributions examined was defined by, γ ∈ [−2, 2] and q ∈ [.1, .9]. The Probit estimator, which is a Quasi-ML in these two data sampling applications, is strongly dominated by the alternative estimators when the DSP is MPD (q = .5, γ = −1). The Logit estimator, another Quasi-ML estimator in these cases, outperforms the Probit estimator, but is dominated by the ML–MPD estimator. The ML–MPD estimator is MSE superior to the alternative estimators across all sample size scenarios and for both values of P (Y = 1). When the DSP is MPD (q = .75, γ = 1.5), the ML–MPD continues to be the clear estimator of choice. The Probit estimator performs somewhat better than the Logit estimator in this latter case. The consistency of the ML–MPD estimator is reflected in both of the preceding figures as n increases. If one were omniscient and chose the (correct) parametric N (0, 1) probit model, the Probit estimator would, as indicated in Fig. 5.5, be the superior choice in MSE performance. As is well known, the Logit estimator is a good Quasi-ML choice in this case, and its MSE performance is close to the performance of the optimal Probit estimator. The superiority of the Probit estimator diminishes relative to the ML–MPD estimator, as the sample size increases. If one chose to model the binary response probabilities by utilizing only symmetric distributions that are contained in the MPD family (i.e., restricting q = .5), the relative superiority of the correctly specified Probit estimator is diminished further. This is true for even the smallest sample size, as indicated by the ML–MPD (q = .5) results in Fig. 5.5. Overall, the comparisons in Fig. 5.5 indicate both the precision gains that would occur by having correct prior information about the distributional functional form, and also indicates the unavoidable cost when in practice, one does not possess omniscience in the choice of the statistical model. The sampling results suggest that the MPD–ML, which is itself a Quasi-ML in this DSP application,
contributes, to a notable degree, to mitigating this cost of incorrect model choice.7 In general, the sampling results illustrate the estimation performance attainable from utilizing the large and varied family of MPD distributions for modeling binary response, when coupled with a ML approach for choosing optimally among the member distributions within the family. The method mitigates imprecision due to lack of knowledge of the true data sampling process underlying binary responses, and competes well with generally unknown and unattainable correct choices of the data sampling process. 5.3. Sampling comparisons to a prominent semiparametric alternative Relative to other well-known semiparametric methods available for estimating the binary response model, the Klein–Spady (KS) estimator achieves the semiparametric asymptotic efficiency bound. To provide an indication of the relative performance of the ML–MPD and KS estimators, a simulated comparison follows. In terms of the notable collection of regularity conditions needed to establish the KS estimator’s sampling properties, the identification conditions make the KS estimator inapplicable to a binary model having an intercept and one regressor variable, as in the simulations in Section 5.2. Consequently, for this comparison, we add a second iid N (0, 1) regressor variable with a coefficient of 3 to the MC sampling design, and otherwise simulate samples as before. This means we now utilize the relationship, xi1 = (F −1 (P (yi = 1)) − (β0 + β2 xi2 ))/β1 , with, xi2 ∼ iid N (0, 1). We examine two different DSPs, one being the MPD (q = .75, γ = 1.5) link function for F (·), the choice which was least favorable to the ML–MPD estimator of the two non-normal data sampling processes simulated above. We also simulate the case where the link function is N (0, 1), which was a sampling characteristic investigated by KS in their MC simulations. In the definition of the KS estimator we use their recommended adaptive kernel density estimator. This estimator uses locally defined bandwidths which are proportional to the base optimal bandwidth of, hn = .9 · A · n−.2 , where as suggested by Silverman (1986) for Gaussian kernels, A = min{σ , interquartile range/1.34}. Identification is achieved by setting the intercept to zero and the coefficient on the first regressor variable to 1 (which are the location and scale normalization conditions for identification, respectively).
7 Note that the N (0, 1) is not contained in the MPD family of distributions. However, there are a number of distributions in the family, e.g., the standard logistic, that emulate the N (0, 1) rather closely, resulting in relatively effective Quasi-ML behavior by estimators based on the MPD family.
R.C. Mittelhammer, G. Judge / Journal of Econometrics 164 (2011) 207–217 ML-MPD
Probit
Logit
superiority.8 We underscore that while the performance of the MPD–ML(q = .5) estimator was very good in comparison to the alternatives, the estimator incorporates information relating to the symmetry of the DSP, which would be appropriate in practice only if the econometrician had reason to believe that the underlying sampling distribution was indeed symmetric, or at least approximately so.
KS
0.01 0.009 0.008 0.007 0.006
MSE
213
0.005 0.004 0.003
5.4. Simulation summary
0.002 0.001 0 1 Scenario:
2
3
n = 100 P(y=1) = .50 and .75
4
5
n = 250 P(y=1) =.50 and .75
6
n = 500 P(y=1) = .50 and .75
Fig. 5.6. MSE for P (y = 1) predictions, DSP = MPD(.75, 1.5). ML-MPD
Probit
ML-MPD(q=.5)
KS
0.01 0.009 0.008 MSE
0.007 0.006 0.005 0.004 0.003 0.002 0.001 0 1 Scenario:
2
n = 100 P(y=1) = .50 and .75
3
4
n = 250 P(y=1) =.50 and .75
5
6
n = 500 P(y=1) = .50 and .75
Fig. 5.7. MSE for P (y = 1) predictions, DSP = N (0, 1).
The results of the simulation for the link function, MPD (q = = 1.5), are presented in Fig. 5.6. In this application, the ML–MPD estimator again dominates the alternatives across the three sample sizes, and for both levels of P (y = 1). The semiparametric KS estimator performs least well for the smallest sample size, but becomes MSE competitive with the Probit and Logit Quasi-ML estimators for the intermediate size sample, and is as good as or better than Probit or Logit for the largest sample size. The results for the standard normal link function are presented in Fig. 5.7. In this case, the Probit model is the correct binary response model and, as expected, the Probit estimator dominates in MSE. We eliminated the Logit simulation results, and instead replaced it with simulation results for an ML–MPD estimator with q = .5 ∀i. This is tantamount to using only the symmetric distributions in the ML–MPD family, as we did in the simulation underlying the N (0, 1) DSP in Fig. 5.5. The two ML–MPD estimators, which are both quasi-maximum likelihood estimators in this application, produce quite accurate predictions of conditional binary probabilities in a nominal sense, but are still inferior to the Probit in a relative MSE sense. However, compared to the KS estimator, for small and moderate sample sizes, as well as for the largest sample size when the unconditional Bernoulli probability is .75, the ML–MPD estimator remains superior in predictive MSE. Moreover, the ML–MPD (q = .5) notably dominates the KS estimator in all cases, and for moderate and especially large sample sizes produces results that are quite close to the Probit model’s MSE performance. Note it is to be expected, given its consistency and semiparametric efficiency properties, that KS would eventually surpass QMLEs in MSE if the sample size became large enough. It is interesting in the preceding applications that even a sample size of n = 500 was not sufficient for KS to demonstrate its eventual asymptotic
.75, γ
The simulations suggest that the ML–MPD estimator can be a useful alternative to flexible semiparametric methods and non-flexible parametric methods, offering potential performance advantages over each. Given the myriad of varied link functions encompassed by the MPD class, the ML–MPD estimator has the potential for representing a wide range of binary response models as a true MLE as well as approximating others in the context of a QMLE. The simulations suggest ML–MPD may have particular advantages in cases of small to moderate sample sizes often faced by empirical analysts. Moreover, the ML–MPD estimator has the practical advantages of a notably simple functional definition, few and familiar regularity conditions, and is relatively straightforward to compute. The sub-family of symmetric MPD distributions (i.e., those with q = .5) can also serve as a more varied alternative to choosing a specific parametric link function, such as normal or logistic, with a broader potential for performing well as an MLE, or QMLE, when one wishes to employ a model incorporating a symmetric probability distribution. We note that in solving for the estimators, the KS estimator exhibited a number of convergence issues9 , and its performance appeared to be sensitive to the degree of collinearity in the explanatory variables. However, for large sample sizes, the KS estimator has the potential to remain an attractive alternative. In particular, because it is consistent and achieves the semiparametric efficiency bound, it can be expected to eventually overtake QMLEs in performance as sample sizes become large enough. Despite the vast set of alternatives in the MPD class, the class is not dense in all conceivable link functions. The MPD–ML will act as a QMLE in some applications, although its scope as both an MLE and QMLE is substantially broader than traditional Probit and Logit models. 6. Summary and extensions Representing linkages between sample information and underlying binary response outcomes through moment conditions, E [g(X)′ (Y − p)] = 0, the Cressie–Read (CR) family of power divergence measures was used, in the context of the minimum discrimination information principle, to identify an associated family of CDFs and statistical models for the BRM. Maximum likelihood was used as a basis for estimating the unknowns in the model, including the choice of CDF within the family. Sampling experiments were used to illustrate the small sample properties of the ML–MPD
8 A possible contributor to this outcome is the fact that the case where P (y = 1) = .75 is characterized by high multicollinearity between the two explanatory variables, given the sampling design as described in Section 5.3. Note if the variability in sampled F −1 (P (yi = 1))values diminishes, as is the case when moving from P (y = 1) = .5 to P (y = 1) = .75, where variance is more than halved, the collinearity between the Xi ’s necessarily increases. Klein and Spady’s simulations dealt only with the case where the two explanatory variables were independent. 9 These cases were characterized by outcomes for the two RHS explanatory variables being highly collinear. Regarding frequency of issues, the most prevalent difficulties arose in the case of the N (0, 1) link function simulation with P (y = 1) = .75, in which case of 1000 simulations, the KS exhibited convergence issues in eighteen cases for n = 100, five cases for n = 250, but only for one case when n = 500.
214
R.C. Mittelhammer, G. Judge / Journal of Econometrics 164 (2011) 207–217
L(p, λ) =
n − n −
i=1 i=1
+
m −
1
m −
γ (γ + 1)
j=1
λ′j x′ (yj − pj ) +
j=1
[ pij
n − i =1
pij qij
ηi
γ
m −
MPD(Gam = -1)
MPD(GAM = -3)
estimator. Comparisons were made relative to conventional Probit and Logit parametric estimators, and the leading semiparametric alternative, the Klein–Spady estimator. In all of the experiments the ML–MPD estimator performed well against its parametric and semiparametric competitors. The ML–MPD approach was shown to offer an operational basis for estimating the binary response model when there is uncertainty relative to distributional choice from among the myriad of members in the minimum power divergence (MPD) family of distributions. The combined ML–MPD approach results in a method of estimating binary response models that can be consistent, asymptotically normal, and relatively efficient, even in the absence of knowledge of the true functional form of the underlying sampling distribution. It is straight forward to extend the univariate distribution formulations of this paper to their multivariate counterparts. For example, one such extension, which subsumes the multivariate logistic distribution as a special case, begins with a multinomial specification of the minimum power divergence estimation problem in Lagrange form as
MPD(Gam = .5)
MPD(Gam = 1.5)
MPD(Gam = 3)
0.35 0.3 0.25 0.2 f(x) 0.15 0.1 0.05 0 -4
-3
-2
-1
1
0
2
3
4
Support
Fig. A.1. Some symmetric MPD densities with q = .5. MPD(Gam = -2)
MPD(Gam = 0)
MPD(Gam = .5)
MPD(Gam = 3)
0.45 0.4
]
0.35
−1
0.3 f(x)
pij − 1 .
0.25 0.2
(6.1)
0.15
j=1
Solving first-order conditions with respect to the pij ’s leads to the standard multivariate logistic distribution when the reference distributions are uniform. The analytical and sampling results for the new minimum power divergence family of statistical models and estimators represents a base for considering a range of important new problems relating to binary estimation and inference. One question concerns how, in the estimation of probabilities and marginal effects, to make best use of the reference distribution, q, so that known or estimable characteristics of the Bernoulli probabilities may be taken into account. Consideration of alternative conditional moment formulations and their effect on efficiency of the resultant estimators provides another set of interesting research problems on which we have made some progress. The very substantial flexibility, in the index function of the models that can be accommodated by the ML–MPD approach to binary response model estimation, opens a vast array of possibilities for representing such things as endogeneity (see Judge et al., 2006, for one such extension), nonlinear impacts of regressors, heterogeneity in response processes across responders, and other model characteristics requiring flexible functions of regressor variables. These and other issues, such as a more robust basis for estimator choice in a given loss context, are the subject of ongoing research.
0.1 0.05 0 -6
2
4
6
MPD-family of PDFs. To illustrate the range of possibilities, graphs of some members of the family are presented in Figs. A.1 and A.2. These graphs suggest the wide range of distributional-likelihood possibilities, as γ and q takes on different values. A.1. Moments The existence of moments in the MPD family of distributions and their values depend on the γ parameter. The case, γ = 0, is a limiting case that defines a family of logistic distributions, and can be handled explicitly. For other γ cases, consider the general definition of the expectation of g (W ) with respect to MPD (q, γ ). Implicit differentiation of (3.5) with respect to w , implies the following general representation of the MPD family of probability densities for nonzero γ , f (w; q, γ )
=
1 q−γ F (w; q, γ )γ −1 + (1 − q)−γ (1 − F (w; q, γ ))γ −1 for w ∈ Υ (q, γ ),
(A.1)
where F (w; q, γ ) denotes the cumulative distribution function and, Υ (q, γ ), denotes the appropriate support of the distribution. As indicated in (3.5), the support depends on q and γ if γ > 0, and, Υ (q, γ ) = R, otherwise. Expectations can then be defined as E (g (W )) g (w)
∫
We use the notation, MPD (q, γ ), to denote a specific member of the MPD family of distributions identified by particular values of q and γ . A vast array of symmetric, as well as left and rightskewed probability density functions are contained within the
0
Fig. A.2. Some asymmetric MPD distributions for q = .75.
=
Appendix A. Properties of the MPD family of CDFs
-2
Support
Acknowledgments The authors gratefully acknowledge the helpful and substantive comments of Martin Burda, Marian Grendar, Guido Imbens, Joanne Lee, Arthur Lewbel, Art Owen, Paul Ruud, Kenneth Train, and David Wolpert. The authors are also appreciative of the insightful suggestions and comments of the Associate Editor and anonymous Journal reviewers.
-4
−γ F (w; q, γ )γ −1 + (1 − q)−γ (1 − F (w; q, γ ))γ −1 w∈Υ (q,γ ) q
dw. (A.2)
Through a change of variables, the expectation of g (W ) can be defined in the computationally more convenient form E (g (W )) =
∫ 0
1
γ γ pi 1 − pi −1 g γ − dp. qi 1 − qi
(A.3)
R.C. Mittelhammer, G. Judge / Journal of Econometrics 164 (2011) 207–217
When γ > 0, moments of all orders exist for densities in the MPD family. This follows immediately from the fact that the integrand in E (W δ ) =
1
∫
γ −1
γ p q
0
−
1−p
γ δ
1−q
dp,
(A.4)
is bounded for each positive integer-valued δ , finite q ∈ (0, 1), and each finite positive-valued γ . In terms of the first moment, the means of the probability densities are given by evaluating the integral (A.4) for δ = 1, resulting in E (W ) =
q−γ − (1 − q)−γ
γ (γ + 1)
.
(A.5)
The second moment around the origin is obtained by solving (A.4) when δ = 2, as E ( W 2 ) = γ −2
[
Theorem A.3.1 (Consistency of MPD–ML Estimator). Consider a collection of MPD distributions indexed by the parameters {q, γ } ∈ Θ ⊂ (0, 1) × R, let β ∈ B ⊂ Rk , and assume that Υ = B × Θ is a compact set containing the true value of the parameters, {β0 , q0 , γ0 }. If the joint probability distribution of X, m(x) with support X , is such that inf {β,q,γ }∈Υ ,x∈X {p(g(x)′ β; q, γ )} = pℓ > 0 and sup{β,q,γ }∈Υ ,x∈X {p(g(x)′ β; q, γ )} = pu < 1, and (yi , xi ) ∼ iid f (y | x)m(x) = p(g(x)′ β0 ; q0 , γ0 )m(x), then the ML–MPD ˆ qˆ , γˆ } →p {β0 , q0 , γ0 }. estimator is such that {β, Proof. The log likelihood function for the case of ML–MPD estimation is given by
ℓ(β, q, γ | y, x) = =
1 + 2γ
− 2q
(1 − q) Γ (a)Γ (b)
−γ
n −
ℓ∗ (β, q, γ | yi , xi )
i =1
q−2γ + (1 − q)−2γ
−γ
215
n − [yi ln(p(g(xi )′ β; q, γ )) i =1
]
B(γ + 1, γ + 1) ,
+ (1 − yi ) ln(1 − p(g(xi )′ β; q, γ ))].
(A.6)
where, B(a, b) = Γ (a+b) , and, Γ (α) = 0 w α−1 e−w dw , are the well-known Beta and Gamma functions, respectively. The variance of the distribution then follows in the usual way by subtracting the square of (A.5) from (A.6). Distributions in the MPD family, with γ ≤ −1, do not have moments defined of any order because the integral in (A.4) is divergent for any choice of δ ≥ 1. Moments do exist for values of γ ∈ (−1, 0), but only a finite number of moments exist and how high an order of moment exists depends on the value of γ . If γ > −1, the mean of the distribution exists, and its functional representation in terms of γ and q is precisely the same as in (A.4). If γ > − 12 , the second moment about the origin, and thus the variance, exist, and the first two moments have exactly the same functional forms as in (A.5) and (A.6), respectively. In general, the moment of order δ exists provided that, γ > −δ −1 . In this case it is identical in functional form to the corresponding moment in the sub-family of MPD family distributions for which γ > 0.
∞
A.2. Concavity of ln(F (w; q, γ )) and ln(1 − F (w; q, γ )) in w It is useful for estimation purposes to identify curvature properties of both, ln(F (w; q, γ )) and ln(1 − F (w; q, γ )). As is known from the work of Pratt (1981), if these logarithmic functions of the CDF are concave in w , it follows that the log likelihood based on the CDF, with arguments, wi ≡ g(xi )′ β, i = 1, . . . , n, is concave in β. In fact, both logarithmic functions are concave in w , for any q ∈ (0, 1), so long as the MPD distribution is such that γ ≤ 1. When γ > 1, one or both of ln(F (w; q, γ )) and ln(1 − F (w; q, γ )) will not be concave, depending on the value of q. A.3. ML–MPD asymptotics and profiling the likelihood function We motivate the consistency and asymptotic normality of the ML–MPD estimator under straightforward primitive regularity conditions that encompass cases where γ ≤ 0 and γ > 0 are both considered as distributional possibilities. MPD distributions that fall into the latter set exhibit finite supports that are ‘‘nonstandard’’ relative to the usual ML asymptotics. While our regularity conditions are certainly not the most general, our invocation of what amounts to a ‘‘nondegeneracy of outcomes’’ assumption avoids boundary point issues that would lead to handling the two cases differently. We also motivate the applicability of Patefield’s (1977,1985) profiling of the likelihood function under the same regularity conditions.
(A.7)
The terms in the summand are iid outcomes from f (y | x)m(x) whose common expectation, defined via iterated expectations, is given by EX EY|X (ℓ(β, q, γ | Y, X ))
= EX [p(g(x)β0 ; q0 , γ0 ) ln(p(g(x)β; q, γ )) +(1 − p(g(x)β0 ; q0 , γ0 )) ln(1 − p(g(x)β; q, γ ))],
(A.8)
which exists by the boundedness of the bracketed expression. It follows from Khinchin’s WLLN that n− 1
n −
p
ℓ∗ (β, q, γ | yi , xi ) → EX EY |X (ℓ(β, q, γ | Y, X)).
(A.9)
i=1
Moreover, since sup | ℓ∗ (β, q, γ | y, x) |≤ < ∞ by the bound∑M n edness of P (g(x)′ β; q, γ ), the sum n−1 i=1 ℓ∗ (β, q, γ | yi , xi ) converges uniformly in (A.9), and the continuity and differentiability of ℓ∗ (β, q, γ | y, x)are manifest in EX EY|X (ℓ(β, q, γ | Y, X)). Now note that (A.8) is maximized at {β, q, γ } = {β0 , q0 , γ0 }. To demonstrate this, first examine τ (p) = p0 ln(p) + (1 − p0 ) ln(1 − p), and note that
−
p0 p2
−
(1−p0 ) (1−p)2
dτ (p) dp
=
p0 p
−
(1−p0 ) 1−p
d2 τ (p) dp2 dτ (p) dp
and
< 0 ∀p ∈ (0, 1). Then solving
=
= 0
implies that p = p0 maximizes τ (p), which in turn implies that p(g(x)′ β0 ; q0 , γ0 ) = (p(g(x)′ β; q, γ )) for (A.8) to be maximized, and {β, q, γ } = {β0 , q0 , γ0 }
ˆ qˆ , γˆ } {β0 , q0 , γ0 }. It follows that {β,
p
→
Theorem A.3.2 (Asymptotic Normality of MPD–ML Estimator). Una ˆ qˆ , γˆ } ∼ der the assumptions of Theorem A.3.1, {β, N ({β0 , q0 , γ0 },
Σ ), where Σ = −[E ∂
2 ℓ(θ|y,x)
∂θ∂θ′
1 ]− θ0 and θ ≡ {β, q, γ }.
Proof Sketch. Given the consistency of the ML–MPD estimator, asymptotic normality can be shown using a standard Taylor Series argument. See (Amemiya (1985, p. 115–123)). The needed continuous differentiability of p(g(x)′ β; q, γ ), and thus ℓ(β, q, γ | y, x), on Υ = B×Θ can be demonstrated directly via the continuity of the implicit derivatives obtained from (3.5). Theorem A.3.3 (Profiling the Likelihood of MPD–ML Estimator). Under the assumptions of Theorem A.3.3 given that the ML solution is interior to the parameter space, the asymptotic covariance matrix of the MP–MPD estimator of β can be written as [−E
∂ 2 pℓ(β|y,x) −1 ] , ∂β∂β′
pℓ(β | y, x) is the profile likelihood function for β.
where
216
R.C. Mittelhammer, G. Judge / Journal of Econometrics 164 (2011) 207–217
Proof. As presented by Patefield (1977, 1985) the likelihood function of a ML estimation problem can be profiled whenever the parameter space is finite dimensional Euclidean, the likelihood function is twice continuously differentiable, and the MLE occurs in the interior of the parameters space. The first two conditions are clearly met in the ML–MPD case on Υ = B×Θ , since the parameter space is of dimension k + 2, and the continuous differentiability of the likelihood can be demonstrated via implicit differentiation of (3.5). Appendix B. Computational issues In this section we discuss the algorithm used to represent the CDFs of the MPD family numerically, and we present some timing results to indicate how quickly computations can proceed both in calculating CDF values, as well as solving ML–MPD estimation problems. B.1. Numerical representation of MPD CDFs via interval bisection It is readily apparent, given pi and qi ∈ (0, 1) and the derivative relationships
∂
γ pi qi
−
1−pi 1−qi
∂ pi γ −1 =γ
pi
γ
qi
γ
(1 − pi )γ −1 + (1 − qi )γ
> > 0 if γ 0, < <
(B.1)
that the term, η(pi ) ≡ ( qi )γ − ( 1−qi )γ , in (3.5), is a strictly i i monotonically increasing (γ > 0) or decreasing (γ < 0) function p 1−p of p. When γ → 0, η(pi ) = ln( qi ) − ln( 1−qi ), which is strictly i i monotonically increasing in pi . Computationally, one very stable and accurate method for determining the value of pi , that satisfies (3.5) for a given value, wi = g(xi )′ λ, is the interval bisection procedure. Roughly speaking, this procedure halves sequentially generated intervals for pi ∈ (at , bt ), t = 0, 1, 2, . . ., all of which bracket the value of pi that satisfies (3.5), until at iteration t = T , pi = (aT + bT )/2, satisfies (3.5) within a prespecified level of approximation tolerance. The method has the distinct advantage of always converging to the solution in this application, although the method converges only at a linear rate. We have found that conceptually faster algorithms, such as Newton’s method with a quadratic rate of convergence, or Hally’s method (a leading example of Householder-type algorithms, see Scavo and Thoo, 1995) with a cubic rate of convergence, are too unstable and will not necessarily lead to convergence in this problem due to numerical difficulties in calculating iterative updates based on functions of first and/or second-order derivatives of η(pi ). This is especially true when solutions are close to pi = 0 or 1. The interval bisection method is derivative free, and requires only comparisons of values of η(pi ) itself. p 1−p Letting, h(pi ) ≡ (( qi )γ − ( 1−qi )γ ) − wi γ , the bisection method i i requires two initial points and b0 ∈ [0, 1] for pi , such that h(a0 ) and h(b0 ) have opposite signs. Given the strict monotonicity of h(pi ), these two initial points define an interval that brackets the solution. Such initial points always exist and are easily found in the case at hand. If γ ≤ 0, then for any finite wi γ (or simply wi in the limiting case where γ → 0 and h(pi ) is defined alternatively as, p 1 −p h(pi ) = ln( qi )−ln( 1−qi )−wi ) we have that, h(pi ) → ∞ as pi → 0 i i and, h(pi ) → −∞ as pi → 1. It follows that one can always choose initial values a0 = ε and b0 = 1 − ε , for small enough ε > 0, so that h(a0 ) > 0 and h(b0 ) < 0. If γ > 0, one must first check for boundary point solutions, which is easily done using the simple p
1−p
Table B.1 Average time to ML–MPD solution, in seconds, for MPD(.5, −1).
{n, P (y = 1)} {100, .5} .419
{100, .75} .513
{250, .5} .639
{250, .75} .854
{500, .5} 1.019
{500, .75} 1.351
Seconds
rule h(1) < 0 ⇒ pi = 1 and h(0) > 0 ⇒ pi = 0. If the solution for pi is not at either corner point, then one can choose a0 = 0 and b0 = 1 for the initial values bracketing the solution. The procedure next divides the initial interval in two by a +b computing the midpoint of the interval, c0 = 0 2 0 . If c0 happens to be a root of h(pi ) within the tolerance (i.e., number of decimals) specified, the process stops. If not, then two possibilities remain: either h(a0 ) and h(c0 ) or h(b0 ) and h(c0 ) have opposite signs and bracket the root. Select the subinterval that brackets the solution, reassign either a0 or b0 to the value c0 as appropriate, and apply the bisection step to the new reduced (by 50%) interval. Continue the bisection operation until the bracket is sufficiently small to satisfy a +b the approximation tolerance, the solution being, cT = T 2 T , after the T th bisection. Assuming an initial bracketing interval of length 1, it is evident that the absolute error in the solution after T bisections is at most 2−(T +1) . It follows that to achieve a tolerance of δ in the solution, the number of bisections needed is given by, T > log2 ( 1δ ) − 1. It is important to underscore that the solution procedure described above can be straightforwardly programmed to process the solutions pi , i = 1, . . . , n to an n × k matrix of xi values in parallel, increasing its efficiency as a solution procedure. B.2. Timing of CDF and ML–MPD estimator solutions To provide an indication of the relative efficiency with which the CDF of the MPD distributions can be calculated, the procedure was programmed in Aptech Systems’ GAUSS 8.0 software, and timed for n = 100 and 500, with γ = −1, −.75, −.25, and 0, on a single-core Intel Pentium 3.4 GHz PC running Windows XP service pack 3. Timing was based on the average time-to-solution across 10,000 sets of n × 1 CDF vectors for each combination of n and γ , with an approximation tolerance set to a very precise 1.4 × 10−14 . Data were generated randomly in the course of Monte Carlo simulations described in Section 5. There was no appreciable difference in efficiency across γ values, the differences being a few hundred-thousandths of a second. The average total time involved in solving simultaneously for the entire vector of 100 or 500 CDF values was 1.624 × 10−3 or 6.943 × 10−3 seconds, respectively. Regarding the solution time required to obtain the ML–MPD estimate of the unknown parameters, {β, q, γ }, the average time-to-solution corresponding to the simulations involving the MPD(.5, −1) case, depicted in Fig. 5.3, were calculated. Solutions were tracked for each of the scenarios depicted in the graph to yield average times for n = 100, 250, and 500, and for the case of P (y = 1) = .5 and .75 for each simulated sample size. Solutions required slightly more time for the P (y = 1) = .75 case, and of course more time was required as sample size increased. However, in all cases the solution times were essentially negligible, as indicated in Table B.1. References Amemiya, T., 1985. Advanced Econometrics. Harvard University Press, Cambridge. Breisch, R.A., Chintagunta, P.K., Matzkin, R.L., 2002. Semiparametric estimation of brand choice behavior. Journal of the American Statistical Association 97, 973–982. Cosslett, S.R., 1983. Distribution-free maximum likelihood estimation of the binary choice model. Econometrica 51, 765–782.
R.C. Mittelhammer, G. Judge / Journal of Econometrics 164 (2011) 207–217 Cressie, N., Read, T., 1984. Multinomial goodness of fit tests. Journal of the Royal Statistical Society, Series B 46, 440–464. Gabler, S., Laisney, F., Lechner, M., 1993. Seminonparametric estimation of binary choice models with an application to labor force participation. Journal of Business and Economic Statistics 11, 61–80. Gallant, R., Nychka, D.W., 1987. Semi-nonparametric maximum likelihood estimation. Econometrica 55, 363–390. Gozalo, P.L., Linton, O., 1994. Local nonlinear least squares estimation using parametric information nonparametrically, Discussion Paper No. 1075, Cowles Foundation, Yale University. Green, W., 2007. Discrete choice modeling. In: Mills, T., Patterson, K. (Eds.), Applied Econometrics, Part 4.2. In: Handbook of Econometrics, vol. 2. Palgrave, London. Han, A.K., 1987. Nonparametric analysis of a generalized regression model. Journal of Econometrics 35, 303–316. Hansen, L.P., 1982. Large sample properties of generalized method of moments estimators. Econometrica 50, 1029–1054. Horowitz, J.L., 1992. A smoothed maximum score estimator for the binary response model. Econometrica 60, 505–531. Ichimura, H., 1993. Semiparametric least squares (sls) and weighted sls estimation of single-index models. Journal of Econometrics 58, 71–120. Jacoby, S.L.S., Kowalik, J.S., Pizzo, J.T., 1972. Iterative Methods for Nonlinear Optimization Problems. Prentice Hall, Englewood Cliffs, NJ, 1972. Judge, G, Mittelhammer, R., 2011. An Information Theoretic Approach To Econometrics, Cambridge University Press, New York (in press). Judge, G., Mittelhammer, R., Miller, D., 2006. Estimating the link function in multinomial response models under endogeneity. In: In festschrift for Stanley Johnson. California University Press. Klein, R.W., Spady, R.H., 1993. An efficient semiparametric estimator for binary response models. Econometrica 61 (2), 387–421. Kuhn, H.W., Tucker, A.W., 1951. Non-linear programming. In: Proc. 2 Berkeley Symp. on Mathematical Statistics and Probability. Univ. Calif. Press, pp. 481–492. Kullback, S., 1959. Information Theory and Statistics. John Wiley and Sons, NY. Kullback, S., Leibler, R.A., 1951. On information and inefficiency. Annals of Mathematical Statistics 22, 79–86. Kumar, P., Taneja, I., 2006. Generalized relative J-divergence measure and properties. International Journal of Contemporary Mathematical Sciences 1, 597–609. Manski, C.F., 1975. The maximum score estimation of the stochastic utility model of choice. Journal of Econometrics 3, 205–228.
217
McFadden, D., 1984. Qualitative response models. In: Griliches, Z., Intriligator, M. (Eds.), Handbook of Econometrics, vol. 2. North Holland, Amsterdam, pp. 1395–1457. McFadden, D., 1974. Conditional logit analysis of qualitative choice behavior. In: Zarembka, P. (Ed.), Frontiers of Econometrics. Academic Press, New York, pp. 105–142. Mittelhammer, R., Judge, G., Miller, D., 2000. Econometric Foundations. Cambridge University Press, New York. Nelder, J.A., Meade, R., 1965. A simplex method for function minimization. Computer Journal 7, 308–313. Newey, W., 1991. Uniform convergence in probability and stochastic equicontinuity. Econometrica 59, 1161–1167. Pardo, L., 2006. Statistical Inference Based on Divergence Measures. Chapman and Hall, Boca Raton. Patefield, W.M., 1977. On the maximized likelihood function. Sankhya 39, 92–96. Patefield, W.M., 1985. Information from the maximized likelihood function. Biometrika 72, 664–668. Pollard, D., 1991. Asymptotics for least absolute deviation regression estimators. Econometric Theory 7, 186–199. Pratt, J.W., 1981. Concavity of the log likelihood. Journal of the American Statistical Association 76, 103–106. Read, T.R., Cressie, N.A., 1988. Goodness of Fit Statistics for Discrete Multivariate Data. Springer Verlag, New York. Scavo, T.R., Thoo, J.B., 1995. On the egometry of Halley’s method. American Mathematical Monthly 102 (5), 417–426. Shannon, C.E., 1948. A mathematical theory of communications. Bell System Technical Journal 27, 623–659. Silverman, P.W., 1986. Density Estimation. Chapman and Hall, New York. Takada, T., 2001. Nonparametric density estimation: a comparative study. Economics Bulletin 3 (16), 1–10. Train, K., 2003. Discrete Choice Methods with Simulation. Cambridge University Press, New York. van der Vaart, A.W., 1998. Asymptotic Statistics. Cambridge University Press, Cambridge. Wang, W., Zhou, M., 1993. Iterative least squares estimator of binary choice models. Semiparametric approach. Department of Statistics Technical. Report No. 346 University of Kentucky. Zellner, A., Rossi, P., 1984. Bayesian analysis for dichotomous response models. Journal of Econometrics 25, 365–393.
Journal of Econometrics 164 (2011) 218–238
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Model selection criteria in multivariate models with multiple structural changes✩ Eiji Kurozumi a,∗ , Purevdorj Tuvaandorj b a
Department of Economics, Hitotsubashi University, 2-1 Naka, Kunitachi, Tokyo 186-8601, Japan
b
Department of Economics, McGill University, Canada
article
info
Article history: Received 10 June 2010 Received in revised form 14 December 2010 Accepted 26 April 2011 Available online 6 May 2011 JEL classification: C13 C32 Keywords: Structural breaks AIC Mallows’ Cp BIC Information criteria
abstract This paper considers the issue of selecting the number of regressors and the number of structural breaks in multivariate regression models in the possible presence of multiple structural changes. We develop a modified Akaike information criterion (AIC), a modified Mallows’ Cp criterion and a modified Bayesian information criterion (BIC). The penalty terms in these criteria are shown to be different from the usual terms. We prove that the modified BIC consistently selects the regressors and the number of breaks whereas the modified AIC and the modified Cp criterion tend to overfit with positive probability. The finite sample performance of these criteria is investigated through Monte Carlo simulations and it turns out that our modification is successful in comparison to the classical model selection criteria and the sequential testing procedure robust to heteroskedasticity and autocorrelation. © 2011 Elsevier B.V. All rights reserved.
1. Introduction This paper considers the selection of regressors and estimation of the number of structural changes in multivariate regression models in the possible presence of multiple structural changes. Many methods for the selection of regressors have been proposed in the econometric and statistical literature, and it is often the case in practical analyses that the regressors are selected using either testing procedures or model selection criteria. The former methods select the regressors by testing the significance of the coefficients of the regressors and deleting the insignificant coefficients from the models, while the model selection criteria choose the regressors that minimize the given risk functions. The representative model selection criteria in econometric analysis are the Akaike information criterion (AIC) by Akaike (1973), the Cp criterion by Mallows (1973) and the Bayesian information criterion (BIC) by Schwarz (1978) among others. See Burnham and Anderson (2002) and Konishi and Kitagawa (2008) for a general treatment of the model selection criteria.
✩ We are grateful to Professors Takeshi Amemiya (Editor), Kohtaro Hitomi, Yoshihiko Nishiyama, Ryo Okui, Pierre Perron, Zhongjun Qu, Katsumi Shimotsu, Toshiaki Watanabe and anonymous referees for helpful comments. All errors are our responsibility. ∗ Corresponding author. Tel.: +81 42 580 8787. E-mail address:
[email protected] (E. Kurozumi).
0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.04.003
In addition to the selection of the regressors, we need to consider the possibility of structural changes when we investigate data covering a relatively long sample period. In such a case we usually test for structural changes. Various tests for structural changes have been proposed in the literature, and the most commonly used tests in recent practical analyses are the suptype test of Andrews (1993) and the exponential and averagetype tests of Andrews et al. (1996) among others. These tests assume the null hypothesis of no changes against the alternative of (multiple) change(s), whereas Bai and Perron (1998) and Bai (1999) proposed tests for the null of ℓ breaks against the alternative of ℓ + 1 breaks for univariate models. These tests are extended to multivariate models by Qu and Perron (2007), who give a comprehensive treatment on the issue of the estimation, inference, and computation in a system of equations with multiple structural changes. Their treatment is general enough in that less restrictive assumptions are placed on the error term and that models such as vector autoregressions (VAR), seemingly unrelated regressions (SURE) and panel data models are included in their setup as special cases. See Perron (2006) for a review of the testing and estimation of structural changes. Once the evidence of structural breaks is found, the next step is to estimate the number of breaks. Bai (1997b, 1999), Bai and Perron (1998) and Qu and Perron (2007) proposed to implement tests for structural changes sequentially and proved that the estimated number of structural changes is consistent by letting
E. Kurozumi, P. Tuvaandorj / Journal of Econometrics 164 (2011) 218–238
the significance level go to zero. Alternatively, in the statistical literature, the model selection criteria have been proposed to select the number of breaks. For independent normal random variables with mean shifts, Yao (1988) and Zhang and Siegmund (2007) derived the modified BIC and Ninomiya (2005) proposed to modify the AIC, while Liu et al. (1997) considered the modified BIC in regression models with i.i.d. regressors. According to these works, the penalty terms of these new criteria are different from those of the corresponding classical criteria because of the irregularity in the change points. Although these results are of interest from a statistical point of view, they cannot be directly applied to economic data because while economic time series variables are typically serially correlated, assumptions such as i.i.d. observations and regressors are made in the above papers. Exceptions are Ninomiya (2006), Davis et al. (2006) and Hansen (2009). Ninomiya (2006) considered the modified AIC in finite order autoregressive models, but derived under the assumption of known variance, while Davis et al. (2006) proposed a model selection criterion based on the minimum description length principle, but only the consistency of the break fraction estimators was discussed. Hansen (2009) established the modified Mallows’ Cp criterion but with only a single break being allowed. In this paper we develop the model selection criteria in multivariate models allowing lagged dependent variables as regressors in the possible presence of multiple structural changes in both the coefficients and the variance matrices. Our criteria have an advantage over the existing ones in that (i) multivariate models are considered, (ii) serial correlation is taken into account in models by allowing serially correlated regressors, including lagged dependent variables, (iii) structural changes in the variance matrices are allowed. We theoretically derive the AIC, Cp criterion and BIC in models with structural changes and show that the penalty terms should be modified compared with those of the corresponding classical ones. We confirm by Monte Carlo simulations that this modification of the penalty terms is very important to correctly select the regressors and the number of structural changes in finite samples. The rest of this paper is organized as follows. We explain the model and assumptions in Section 2. Section 3 establishes the modified AIC, the modified Cp criterion and the modified BIC with multiple structural breaks and discusses the consistency of these criteria. In Section 4, we investigate the finite sample performance of our model selection criteria via simulations. Concluding remarks are provided in Section 5. 2. Model and assumptions Let us consider the following n-dimensional regression model with m structural changes (m + 1 regimes): yt = Φj xjt + εt
(j = 1, . . . , m + 1
and t = Tj−1 + 1, . . . , Tj )
(1)
where yt and xjt are n × 1 and pxj × 1 vectors of observations, respectively, εt is an error term and Φj is an n × pxj unknown coefficient matrix in the jth regime. Typically, the regressor xjt includes a constant but trending regressors are not allowed in our model. We use the term pφj = npxj to denote the number of unknown coefficients in each regime, so that the total number of ∑m+1 ∑m+1 coefficients is given by pall φ = j=1 pφj = j=1 npxj . Similarly, by allowing structural changes in the variance matrix of εt , the number of unknown variance components in each regime is pσ = n(n + 1)/2 and that in all regimes is given by pall σ = (m + 1)pσ = (m + 1)n(n + 1)/2. We set T0 = 0 and Tm+1 = T , so that the total number of observations is T . In model (1) there are m structural changes (m + 1 regimes) with change points given by
219
T1 , . . . , Tm . We allow the lagged dependent variables as regressors and in that case, the initial observations of yt for t ≤ 0 are assumed to be given. Thus, model (1) includes a VAR model as a special case. Note that the different regressors and the different orders of the lagged dependent variables are allowed depending on the regimes. The main purpose of this paper is to derive the model selection criteria to choose the regressors among the p¯ x candidates for regressors and to estimate the number of structural changes m. In what follows, while we will continue using ‘‘choose the number px among the p¯ x regressors,’’ we, however, imply ‘‘choose the regressors x1t , x2t , . . . , xm+1t for all the regimes among the p¯ x candidates for regressors.’’ Model (1) can be rewritten as yt = (x′jt ⊗ In )ϕj + εt for the jth regime where ϕj = vec(Φj ), or equivalently, yt = (x′t ⊗ In )φj + εt where xt consists of all the regressors, which are collected from the element of xjt ’s without overlap, while φj may contain 0’s if the different regressors are allowed in each regime. We denote the true value of a parameter with superscript 0. For example, φj0 and Tj0 denote the true value of φj in the jth regime and the true jth break point, respectively. Hence, the data generating process is given by yt = (x′jt ⊗ In )ϕj0 + εt = (x′t ⊗ In )φj0 + εt . The following assumptions are made mainly for the derivation of the modified AIC and the modified Cp criterion. Assumption A1. (a) There exists a positive integer l0 > 0 such that for all l > l0 , the minimum eigenvalues of (1/l)
(1/l)
∑Tj0
t =Tj0 −l
∑Tj0−1 +l
t =Tj0−1 +1
xt x′t and
xt x′t are bounded away from zero (j = 1, . . . , m0 +
1). (b) t =k xt x′t is invertible for l − k > k0 for some 0 < k0 < ∞. (c) supt E ‖xt ‖4+δ < ∞ for some δ > 0.
∑l
Assumption A2. When the lagged dependent variables are allowed as regressors, all the characteristic roots associated with the lag polynomials are inside the unit circle. Assumption A3. (a) εt = (Σj0 )1/2 ηt for Tj0−1 + 1 ≤ t ≤ Tj0 (j = 1, . . . , m0 + 1), where Σj0 is a symmetric and positive definite unknown matrix and {ηt } is a martingale difference sequence with respect to Ft = σ {ηt , ηt −1 , . . . , xt +1 , xt , . . . , } with E [ηt ηt′ |Ft −1 ] = In for all t. (b) supt E ‖ηt ‖4+δ < ∞ for some δ > 0. (c) E [ηit ηjt ηkt ] = 0 (i, j, k = 1, . . . , n). (d)
(1/1Tj0 )tr
[
∑Tj0
t =Tj0−1
]2 ′ − I ) (η η n +1 t t
p
−→ κ4j , where κ4 is some
positive number and 1Tj0 = Tj0 − Tj0−1 (j = 1, . . . , m0 + 1). Assumption A4. φj0+1 − φj0 = vT δj and Σj0+1 − Σj0 = vT Ψj , where
(δj , Ψj ) ̸= 0 (j = 1, . . . , m0 ), Σj0 → Σ 0 as T → ∞ for all j and of positive numbers such that vT → 0 and √ vT is a sequence T vT /(log T )2 → ∞. Assumption A5. 0 = λ0 < λ01 < · · · < λ0m0 < λm0 +1 = 1, where Tj0 = [T λ0j ] (j = 0, . . . , m0 + 1).
Assumption A6. The following weak law of large numbers and the functional central limit theorems hold (j = 1, . . . , m0 ): 0
1
Tj −
1Tj0 t =Tj0−1 +1
p
xt x′t ⊗ (Σj0 )−1 −→ Q1j ,
220
E. Kurozumi, P. Tuvaandorj / Journal of Econometrics 164 (2011) 218–238
1
Tj0+1
−
obtained by maximizing (2) over {(λ1 , . . . , λm ); |λj − λj−1 | ≥
p
xt x′t ⊗ (Σj0+1 )−1 −→ Q2j ,
1Tj0+1 t =Tj0 +1
ϵ for j = 1, . . . , m + 1} for small ϵ > 0 and are denoted by θˆ and Tˆ , respectively. Note that θˆ or φˆ may contain 0’s if xjt are allowed to change depending on regimes and in this case, the non-trivial elements of φˆ j equal to the elements of ϕˆ j . Under Assumptions A1–A6, Qu and Perron (2007) showed that the MLEs of the non-zero elements of φj (ϕj ) and Σj have the standard asymptotic distributions as in the case of known break points, whereas the limiting distributions of the break date estimators are given by, for j = 1, . . . , m0 ,
0
Tj −
vT
(ηt ηt − In ) ⇒ ξ1j (v), ′
−2 t =Tj0 −[vvT ] −2 Tj0 +[vvT ]
−
vT
(ηt ηt′ − In ) ⇒ ξ2j (v),
t =Tj0 +1
d
vT2 (Tˆj − Tj0 ) −→ argmax BIj (v)
0
Tj −
vT
xt ⊗ (
Σj0 −1/2
)
ηt ⇒
1/2 Q1j 1j
ηt ⇒
1/2 Q2j 2j
ζ (v),
where the superscript I denotes the dependency on the indicator function given by I (v ≤ 0),
−2 t =Tj0 −[vvT ] −2 Tj0 +[vvT ]
−
vT
xt ⊗ (
Σj0+1 −1/2
)
(3)
v
B1j (v) = √ω1j W1j (|v|) − |v| γ1j : 2 BIj (v) = B (v) = √ω W (v) − v γ : 2j 2j 2j 2j
ζ (v)
t =Tj0 +1
2
p
v≤0
(4)
v > 0,
1
for v ≥ 0, where Q1j and Q2j are positive definite matrices, −→ and ⇒ signify convergence in probability and weak convergence of the associated probability measures, respectively, each entry of ξ1j (v) and ξ2j (v) is a (nonstandard) Brownian motion process on [0, ∞) with Var (vec(ξ1j (1))) = Ω1j and Var (vec(ξ2j (1))) = Ω2j , while each element of ζ1j (v) and ζ2j (v) is a standard Brownian motion process on [0, ∞), and ξ1j (v), ξ2j (v), ζ1j (v) and ζ2j (v) are independent of each other.
ω1j =
The above assumptions satisfy or are similar to the conditions used in the existing literature. For detailed explanations, see Bai (1997a) and Bai and Perron (1998) for the univariate case and Bai (2000) and Qu and Perron (2007) for the multivariate case. Note that we do not allow serial correlation in the error term to derive the model selection criteria. This is more restrictive as compared to the assumptions in Qu and Perron (2007). However, since the lagged dependent variables are allowed as regressors and some elements of xjt can be the lagged values of the other elements, Assumption A3 may not be too restrictive for practical purposes. We should also note that the regressor xjt is not necessarily homogeneous in all the regimes. In other words, the regimewise heteroskedastic regressors are allowed in our model. It is known that the assumption of heteroskedasticity in xjt and the shrinking shifts in Assumption A4 result in the asymmetric limiting distributions of the break point estimators. This asymmetry will make the modified AIC and the modified Cp criterion take relatively complicated forms. We estimate model (1) by the quasi-maximum likelihood (QML) method, conditional on the initial values if the lagged dependent variables are allowed as regressors. Let φ = [φ1′ , φ2′ , . . . , φm′ +1 ]′ , σ = [vec(Σ1 )′ , vec(Σ2 )′ , . . . , vec(Σm+1 )′ ]′ , θ = [φ ′ , σ ′ ]′ and T = [T1 , T2 , . . . , Tm ]′ . Then, given the number of breaks and the regressors, the log-likelihood function, denoted by ℓm,px (T , θ|y, x), becomes
and W1j (v) and W2j (v) are independent standard Brownian motions on [0, ∞). Before moving to the derivation of the model selection criteria, we give the following lemma.
ℓm,px (T , θ|y, x) = − Tj 1− −
nT 2
log 2π −
j=1
2
log |Σj |
m+1
−
2 j =1 t =T +1 j−1
{yt − (x′t ⊗ In )φj }′ Σj−1 {yt − (x′t ⊗ In )φj },
Lemma 1. Let BIj (v) be defined as in (4). Then,
E max BIj (v) =
2 r1j + r1j r2j + r2j2
v
[
r1j + r2j
,
(5)
]
E aIj argmax BIj (v) v
=2
a2
2 r2j (2r1j + r2j )
γ2j
(r1j + r2j )2 [ ] E aIj argmax Bj (v)
+
a1
γ1j
2 r1j (r1j + 2r2j )
(r1j + r2j )2
,
(6)
,
(7)
v
=2
a2
γ2j
2 r2j (2r1j + r2j )
(r1j + r2j )2
−
a1
γ1j
2 r1j (r1j + 2r2j )
(r1j + r2j )2
where aIj = a1 if v ≤ 0 and aIj = a2 if v > 0, r1j = ω1j /γ1j and r2j = ω2j /γ2j . Although the above three expectations are generally expressed using complicated forms, they can be simplified in special cases. For example, when a1 = γ1j and a2 = γ2j , (6) and (7) become
m+1
− 1Tj
vec(A1j )′ Ω1j vec(A1j ) + δj′ Q1j δj , 4 1 γ1j = tr(A21j ) + δj′ Q1j δj , A1j = (Σj0 )−1/2 Ψj (Σj0 )−1/2 , 2 1 ω2j = vec(A2j )′ Ω2j vec(A2j ) + δj′ Q2j δj , 4 1 γ2j = tr(A22j ) + δj′ Q2j δj , A2j = (Σj0+1 )−1/2 Ψj (Σj0+1 )−1/2 , 2
(2)
where 1Tj = Tj − Tj−1 and the subscripts m and px signify that the log-likelihood function depends on the number of structural changes and the selected regressors, respectively. The maximum likelihood estimators (MLE) of θ and T for given m and px are
] [ 2 r1j + r1j r2j + r2j2 I I E aj argmax Bj (v) = 2 , r1j + r2j v [ ] (r2j − r1j )(r1j2 + 3r1j r2j + r2j2 ) E aIj argmax BIj (v) = 2 . (r1j + r2j )2 v In addition, if xjt is homoskedastic across the regimes, BIj (v) has a symmetric distribution with ω1j = ω2j = ωj and γ1j = γ2j = γj so
E. Kurozumi, P. Tuvaandorj / Journal of Econometrics 164 (2011) 218–238
that the expectations reduce to
E max BIj (v) = v
[
3 2
rj ,
[
]
E aIj argmax BIj (v) = 3rj , v
]
E aIj argmax BIj (v) = 0,
= bm,px ,1 + bm,px ,2 + bm,px ,3 + bm,px ,4 ,
v
where rj = r1j = r2j . Moreover, if r1j = r2j = 1 as in the case of the Gaussian error, then we have
] [ 3 I I E (v) = , E aj argmax Bj (v) = 3, v 2 v [ ] I I E aj argmax Bj (v) = 0,
221
+ Ey ℓm,px (T 0 , θ 0 |y, x) − Ey∗ ℓm,px (T 0 , θ 0 |y∗ , x∗ ) + Ey Ey∗ ℓm,px (T 0 , θ 0 |y∗ , x∗ ) − Ey∗ ℓm,px (T 0 , θˆ |y∗ , x∗ ) + Ey Ey∗ ℓm,px (T 0 , θˆ |y∗ , x∗ ) − Ey∗ ℓm,px (Tˆ , θˆ |y∗ , x∗ )
max BIj
As in the literature we evaluate bm,px ,1 to bm,px ,4 omitting the o(1) terms for the true values of m and px to obtain the modified AIC.
pall φ
which are the same as obtained in Ninomiya (2005). Lemma 1 will be used to derive the modified AIC and the modified Cp criterion in the next section.
bm,px ,1 =
3. Derivation of model selection criteria
bm,px ,2 = 0,
− κ4j
m+1
+
2
+
4
j =1
2 m − r1j + r1j r2j + r2j2
r1j + r2j
j =1
− κ4j
m+1
bm,px ,3 =
4
j =1
In this section we derive the three model selection criteria, AIC, Cp criterion and BIC, taking structural changes into account. ¯ be the largest number of regressors More precisely, let p¯ x and m and the largest number of structural changes, respectively, which we have to prespecify. We propose to choose px and m among ¯ respectively, the p¯ x candidates for regressors and 0 ≤ m ≤ m, based on the derived model selection criteria, where though we conventionally state ‘‘choose px ,’’ we imply that we select an optimal set of regressors x1t , x2t , . . . , xm+1t among the p¯ x candidates. Note that all of the following three criteria are designed to choose the model that minimizes them.
and
+
2 m − r1j + r1j r2j + r2j2
bm,px ,4 =
r1j + r2j
j=1
pall φ 2
,
,
.
Proposition 1 suggests that the modified AIC should be defined as
ˆ y, x) + 2pall MAIC (m, px ) = −2ℓm,px (Tˆ , θ| φ m+1
−
+
κˆ 4j + 4
j =1
2 m − rˆ1j + rˆ1j rˆ2j + rˆ2j2
rˆ1j + rˆ2j
j =1
,
(10)
where ˆ denotes the consistent estimator of the corresponding parameter. The estimators of the parameters are given by κˆ 4j =
3.1. Akaike information criterion Akaike information criterion (AIC) is defined as the unbiased estimator of (−2) times the expected log-likelihood given by Ey [Ey∗ [ℓm,px (Tˆ , θˆ |y∗ , x∗ )]], which is in turn equivalent to 2 times the Kullback–Leibler (KL) information, where y∗ and x∗ have the same distribution as y and x but are independent of y and x, and Ey and Ey∗ are expectation operators with respect to y and y∗ , respectively. See, for example, Burnham and Anderson (2002) and Konishi and Kitagawa (2008). Note that Tˆ and θˆ are based not on y∗ but on y. Since we choose the model that minimizes the AIC, we can interpret the chosen model as optimal in the sense that it minimizes the KL information. Akaike (1973) proposed to estimate the expected log-likelihood by the empirical log-likelihood but he also showed that the empirical log-likelihood is the biased estimator of the expected log-likelihood in finite samples. As in Akaike (1973) we consider the following criterion that depends on the number of structural changes m and the selected regressors, which we conventionally denote as px :
ˆ j ), tr(Ω rˆ1j =
1 vec 4
ˆ j vec(Aˆ 1j ) + δˆj′ Qˆ 1j δˆj (Aˆ 1j )′ Ω , 1 tr(Aˆ 21j ) + δˆ j′ Qˆ 1j δˆ j 2
and rˆ2j =
ˆ j+1 vec(Aˆ 2j ) + δˆj′ Qˆ 2j δˆj (Aˆ 2j )′ Ω , 1 tr(Aˆ 22j ) + δˆ j′ Qˆ 2j δˆ j
1 vec 4
2
where ˆ
ˆj = Ω
Tj −
1
1Tˆj
vec(ηˆ t ηˆ t′ − In )vec(ηˆ t ηˆ t′ − In )′ ,
t =Tˆj−1 +1
δˆj = φˆ j+1 − φˆ j , ˆ j−1/2 Σ ˆ j +1 − Σ ˆj Σ ˆ j−1/2 , Aˆ 1j = Σ ˆ
ˆ j−+11/2 Σ ˆ j +1 − Σ ˆj Σ ˆ j−+11/2 , Aˆ 2j = Σ
ˆj = Σ
(8)
where bm,px (Tˆ , θˆ ) = Ey [ℓm,px (Tˆ , θˆ |y, x) − Ey∗ [ℓm,px (Tˆ , θˆ |y∗ , x∗ )]] corresponds to the bias. Since the first term on the right hand side of (8) is the maximized log-likelihood obtained by the QML estimation, we only need to evaluate the bias term explicitly. In order to calculate the bias, we decompose bm,px (Tˆ , θˆ ) into four parts as follows:
ˆ y, x) − Ey∗ ℓm,px (Tˆ , θ| ˆ y ∗ , x∗ ) = Ey ℓm,px (Tˆ , θ| ˆ y, x) − ℓm,px (T 0 , θ 0 |y, x) = Ey ℓm,px (Tˆ , θ|
bm,px (Tˆ , θˆ )
(9)
Proposition 1. Under Assumptions A1–A6 with m = m0 and px = p0x , the bias terms bm,px ,1 , bm,px ,2 , bm,px ,3 and bm,px ,4 are given by, omitting to the o(1) terms,
v
AIC (m, px ) = −2ℓm,px (Tˆ , θˆ |y, x) + 2bm,px (Tˆ , θˆ ),
say.
Qˆ 1j =
1Tˆj
Tj −
εˆ t εˆ t′
t =Tˆj−1 +1
Tˆj
1
1Tˆj
1
−
ˆ j−1 ) (xt x′t ⊗ Σ
t =Tˆj−1 +1
ˆ
and Qˆ 2j =
1
1Tˆj+1
Tj+1 −
ˆ j−+11 ) (xt x′t ⊗ Σ
t =Tˆj +1
with εˆ t = yt − (x′t ⊗ In )φˆ j for Tˆj−1 + 1 ≤ t ≤ Tˆj (j = 1, . . . , m). Although (10) is expressed in a complicated form, the modified AIC can be simplified in several interesting cases. For example,
222
E. Kurozumi, P. Tuvaandorj / Journal of Econometrics 164 (2011) 218–238
when there are no weakly exogenous regressors and model (1) is a pure regime-wise stationary VAR model, the limiting distributions of the break point estimators are symmetric with ω1j = ω2j = ωj and γ1j = γ2j = γj as shown by Bai (2000). In this case, r1j = r2j = rj so that m+1
ˆ y, x) + 2pall MAIC (m, px ) = −2ℓm,px (Tˆ , θ| φ +
− j =1
κˆ 4j + 6
m −
rˆj .
j =1
When ηt is Gaussian, it can be shown that κ4j = n(n + 1) = 2pσ for all j. In addition, we have ω1j = γ1j and ω2j = γ2j because vec(A1j )′ Ω1j vec(A1j )/2 = tr(A21j ) and vec(A2j )′ Ω2j vec(A2j )/2 =
tr(A22j ) as shown in Remark 5 of Qu and Perron (2007). As a result, r1j = r2j = 1 so that the modified AIC reduces to all MAIC (m, px ) = −2ℓm,px (Tˆ , θˆ |y, x) + 2(pall φ + pσ ) + 6m.
(11)
Even if ηt is not Gaussian but if there are no changes in the variance matrices, the modified AIC takes a simple form. In this case, BIj (s) becomes as given by (4) with ω1j = γ1j = δj′ Q1j δj and ω2j = γ2j = δj′ Q2j δj . Again, we have r1j = r2j = 1, so that the modified AIC reduces to
ˆ y, x) + 2pall MAIC (m, px ) = −2ℓm,px (Tˆ , θ| φ + 6m
(12)
where we omit κ4 because this term is common for all the models in this case. In any case we can see that the penalty term of the modified AIC is different from the classical one, which is 2 times the number of unknown coefficients given by 2pall φ . In order to see the intuitive meaning of the penalty term of the modified AIC, we first consider the case where m is prefixed and one additional regressor is included in each regime. In this case, the third and fourth terms on the right hand side of (10) are basically the same except for the changes in the estimators of κ4j , r1j and r2j while the second term is increased by 2(m + 1)n because the difference between the ∑m+1 number of unknown coefficients in the two cases is 2 j=1 n(pxj + 1) − 2 j=1 npxj = 2(m + 1)n. Since (m + 1)n is the number of additional unknown coefficients, the penalty on the additional regressors is the same as in the classical case. Next, let us consider the case where the number of structural changes is increased from m − 1 to m. For ease of exposition, consider the case where the additional break is found in the last regime with pxm+1 regressors. From (10) the penalty term is increased by
words, the uncertainty of the break point always leads to a larger log-likelihood (or a smaller model selection criterion) and the third term of (13) can be interpreted as the penalty on this uncertainty. In fact, we can see from (3) and Lemma 1 that the third term in the parentheses of (13) is an approximation of E [γjI vT2 |Tˆm − Tm0 |], which can be seen as a measure of the uncertainty of the mth break point, where γjI = γ1j for v ≤ 0 and γjI = γ2j for v > 0. Thus, the more uncertain or more volatile the break point estimator, the heavier the penalty imposed on the modified criterion. As is seen in (11) and (12), this penalty becomes equal to 6 (=2 × 3) in special cases such as when the error is Gaussian and when there are no breaks in the variance matrix, because E [γjI vT2 |Tˆm − Tm0 |] = 3 in these cases. We also note that the third term of (13) is an increasing function of r1j = ω1j /γ1j and r2j = ω2j /γ2j , both of which become larger when the break point estimator is more volatile (for larger ω1j and ω2j ). Since this penalty on the uncertainty is positive, the modified AIC will choose a smaller number of structural changes than the classical AIC. This property will be confirmed by the simulations in a later section. 3.2. Mallows’ Cp criterion Mallows (1973) focused on the prediction of the conditional mean of a univariate model and proposed as a measure of adequacy for prediction the scaled sum of squared forecast errors. In this subsection we extend the Mallows’ Cp criterion to multivariate models with multiple structural changes by introducing a multivariate version of the scaled sum of squared forecast errors. The model minimizing the modified Cp criterion is optimal from the viewpoint of the minimization of the risk function based on the forecast errors. Let µ0t = (x′t ⊗ In )φj0 be the conditional mean of yt for Tj0−1 + 1 ≤ t ≤ Tj0 and µ ˆ t = (x′t ⊗ In )φˆ j be its estimator for Tˆj−1 + 1 ≤ t ≤ Tˆj (j = 1, . . . , m + 1). As suggested by Mallows (1973) we adopt as a measure of adequacy of prediction the trace of the scaled residual variance matrix given by
∑m+1
2 npxm+1 +
κˆ 4,m+1 2
+2
2 2 rˆ1m + rˆ1m rˆ2m + rˆ2m
rˆ1m + rˆ2m
.
m+1
Jm,px =
j =1
tr Σj0−1
Tj0 − ′ µ ˆ t − µ0t µ ˆ t − µ0t .
t =Tj0−1 +1
Let εˆ t = yt − µ ˆ t for Tˆj−1 + 1 ≤ t ≤ Tˆj as defined before and ε˜ t = yt − µ ˜ t with µ ˜ t = (x′t ⊗ In )φˆ j for Tj0−1 + 1 ≤ t ≤ Tj0 . That is, εˆ t is the residual from the ML estimation while ε˜ t is the residual when we forecast the conditional mean by (xt ⊗ In )φˆ j in the true regimes. Since yt = µ0t + εt = µ ˆ t + εˆ t = µ ˜ t + ε˜ t , Jm,px becomes
(13)
The first term in the parentheses of (13) is interpreted as the penalty on the additional regressors, while the second term corresponds to the penalty on the increased number of unknown variance components. That is, when the number of structural changes increases, we need to estimate the variance matrix in the additional new regime given by Σm+1 and κˆ 4,m+1 /2 can be interpreted as the penalty on Σm+1 . In fact, when ηt is Gaussian, κ4,m+1 /2 becomes equal to n(n + 1)/2, which is the same as the number of unknown components, pσ , in Σm+1 . On the other hand, the third term in the parentheses of (13) can be interpreted as the penalty on searching for an additional break because this term does not appear when the mth break point, Tm0 , is known but does appear when we estimate it, as is understood from the proof of Proposition 1. When we estimate the additional unknown break, we look for the break point that maximizes the log-likelihood; further, the maximizing point is not necessarily the true break date. In general, the maximization is possibly attained at a point different from the true break date in finite samples. In other
−
0
Tj − −
m+1
Jm,px =
(εt − εˆ t )′ Σj0−1 (εt − εˆ t )
j =1 t =T 0 +1 j−1
=
0
Tj − −
m+1
ε
ε −2
+
(yt − µ ˆ t )′ Σj0−1 εt
j =1 t =T 0 +1 j−1
j =1 t =T 0 +1 j−1 0
Tj − −
m+1
0
Tj − −
m+1 0−1 t t Σj
′
εˆ t′ Σj0−1 εˆ t
j =1 t =T 0 +1 j−1
=−
0
Tj − −
m+1
ε
ε +2
j =1 t =T 0 +1 j−1
Tˆj − − − εˆ t′ Σj0−1 εˆ t − m+1
j =1
0
Tj − −
m+1 0−1 t t Σj
′
t =Tˆj−1 +1
(µ ˆ t − µ0t )′ Σj0−1 εt
j =1 t =T 0 +1 j−1 0
Tj − t =Tj0−1 +1
εˆ t′ Σj0−1 εˆ t
E. Kurozumi, P. Tuvaandorj / Journal of Econometrics 164 (2011) 218–238
ˆ
Tj − −
m+1
+
Proposition 2. Under Assumptions A1–A6 with Σj0+1 − Σj0 = vT2 Ψj
εˆ t′ Σj0−1 εˆ t
and with m = m0 and px = p0x , the expectations of the first three terms of Jm,px are given by, omitting the o(1) terms,
j=1 t =Tˆ j−1 +1
ˆ
Tj − −
m+1
= Jm,px ,1 + Jm,px ,2 + Jm,px ,3 +
εˆ t′ Σj0−1 εˆ t ,
E [Jm,px ,1 ] = −nT ,
say.
j=1 t =Tˆ j−1 +1
(14) Following Mallows’ (1973) original work, the modified Cp criterion is defined as MCp (m, px ) =
ˆ t−,m1¯ ,¯p εˆ t εˆ t′ Σ x
ˆ t ,m¯ ,¯px is the estimator of the variance matrix based on the where Σ ¯ and px = p¯ x . For a univariate most general model using m = m case (n = 1) with no structural changes, Mallows (1973) showed that E [Jm,px ,1 + Jm,px ,2 + Jm,px ,3 ] = 2px − T and hence the modified Cp criterion reduces to the classical form by ignoring −T . For our model (1) with structural changes, it is easy to see that E [Jm,px ,1 ] = −nT , which is common for all the models and can be ignored. On the other hand, it can be shown that the dominant term in Jm,px ,3 is of order vT−1 while Jm,px ,2 is Op (1). As can be seen in the Proof of Proposition 2 we have −
1(Tˆj < Tj0 )tr(A1j )vT2 (Tˆj − Tj0 )
tr(AIj ) argmax BIj (v) = J¯m,px ,3 , v
j =1
say,
(15)
where AIj = A1j when v ≤ 0 and AIj = A2j when v > 0. Then, the natural candidate for the modified Cp criterion would be MCp (m, px ) =
∑m+1 ∑Tˆj j=1
t =Tˆj−1 +1
ˆ t−,m1¯ ,¯p εˆ t + E [J¯m,px ,3 ]. Howεˆ t′ Σ x
ever, this criterion may not be informative for the choice of models. For example, when xjt is homoskedastic in all the regimes, argmaxv BIj (v) has a symmetric distribution, which implies E [J¯m,px ,3 ] = 0. In this case, the modified criterion consists of only
∑m+1 ∑Tˆj j =1
t =Tˆj−1 +1
ˆ t−,m1¯ ,¯p εˆ t , which is always minimized εˆ t′ Σ x
when we choose the most general model. In order to avoid the above problem we need to evaluate the second dominant term in Jm,px ,3 , which would be of the same order as is Jm,px ,2 . The problem here is that the second dominant term in
Jm,px ,3 will be obtained by the higher order expansion of vT2 (Tˆj − Tj0 ), which is tedious to derive in practice. We might construct a new criterion by ignoring the whole term of Jm,px ,3 but such a criterion may not be optimal from the viewpoint of a measure of adequacy for prediction. Because of the above reason, we do not construct the modified Cp criterion under Assumptions A1–A6. Instead, we consider the same criterion under more restrictive assumptions; we impose restrictions on the breaks in the variance matrices such that Σj0+1 −
= v In other words, we allow only those breaks that are smaller than the ones supposed in Assumption A4. Note that by changing the assumption, the limiting distributions of the break point estimators are not affected by the breaks in the variance matrices but depend only on the breaks in the coefficients, whereas a measure of prediction given by Jm,px still depends on the breaks in the variance matrices through Jm,px ,3 . Σj0
2 T Ψj .
γ2j
,
where γ1j = δj′ Q1j δj and γ2j = δj′ Q2j δj in this case.
ˆ
Tj − −
ˆ t−,m1¯ ,¯p εˆ t + 2pall εˆ t′ Σ φ + 6m x
j=1 t =Tˆ j−1 +1
+
m 3−
tr(Aˆ 1j )
2 j =1
γˆ1j
−
tr(Aˆ 2j )
γˆ2j
.
(16)
Note that the last penalty term might be negative depending on the asymmetric property of the limiting distributions of the break point estimators. In the symmetric case the last penalty term disappears so that the modified Cp criterion reduces to MCp (m, px ) =
ˆ
Tj − −
m+1
ˆ t−,m1¯ ,¯p εˆ t + 2pall εˆ t′ Σ φ + 6m. x
(17)
. . . , m0 ) become symmetric when there are no structural changes
m
−
tr(A2j )
For example, the two sided Brownian motions BIj (v) (j = 1,
+ 1(Tˆj > Tj0 )tr(A2j )vT2 (Tˆj − Tj0 ) + op (1) d
−
j=1 t =Tˆ j−1 +1
j=1
−→ −
tr(A1j )
γ1j
2 j =1
MCp (m, px ) =
+ E [Jm,px ,1 + Jm,px ,2 + Jm,px ,3 ]
vT Jm,px ,3 =
3
m+1
j=1 t =Tˆ j−1 +1
m −
E [Jm,px ,3 ] =
E [Jm,px ,2 ] = 2pall φ + 6m,
m −
Proposition 2 suggests that the modified Cp criterion should be given by, ignoring the o(1) terms,
ˆ
Tj − −
m+1
223
in the variance matrices, when (1) is a pure regime-wise stationary VAR model and when xjt is homogeneous across the regimes. From (16) we can see that the penalty term of the modified Cp criterion is different from the classical criterion. As in the case of the modified AIC, the second term on the right hand side of (16) is the penalty on the additional regressors while the third term can be interpreted as the penalty on the uncertainty associated with estimating the break points. The last term of (16) is related with the breaks in the variance matrices and the asymmetry of the break point estimators. From the Proof of Proposition 2 we can see that the last penalty term is an approximation of ∑ 0 2 ˆ − m j=1 tr(Aj )E [vT (Tj − Tj )] where we used the fact that A1j = A2j = (Σ 0 )−1/2 Ψ (Σ 0 )−1/2 = Aj , say, asymptotically. This implies that the additional positive penalty is imposed when Ψj > 0 and
E [vT2 (Tˆj − Tj0 )] < 0 (or Ψj < 0 and E [vT2 (Tˆj − Tj0 )] > 0) and vice versa. To interpret this property, let us assume that Σj0 for j = 1, . . . , m + 1 are known. In this case, the contribution to Jm,px from the jth regime should be given by tr[Σj0−1
∑Tj0
t =Tj0−1 +1
(µˆt −µ0t )(µ ˆ t−
µ0t )′ ] but we need to replace Tj0 with Tˆj for j = 1, . . . , m in practice. ∑Tˆj That is, the actual contribution is given by tr[Σj0−1 (µˆt − t =Tˆ +1 j−1
µ0t )(µ ˆ t −µ0t )′ ]. In this case, if Ψj > 0, or equivalently, Σj0−1 is larger than Σj0+−11 in terms of the matrix, and if Tˆj tends to be smaller than Tj0 , then the contribution to Jm,px from the jth regime decreases more than expected. In order to adjust the smaller contribution from the jth regime, we need an additional positive penalty. On the other hand, if, again, Σj0−1 is relatively large but if the distribution of Tˆj is skewed to the right, the contribution to Jm,px from the jth regime is larger than expected and hence we need to reduce this contribution by a negative penalty. Thus, the last term of (16) can be interpreted as the penalty on the overweight or underweight
224
E. Kurozumi, P. Tuvaandorj / Journal of Econometrics 164 (2011) 218–238
on the forecast errors caused by the asymmetry of the break point estimators. We should keep in mind that the modified Cp criterion given by (16) is optimal from the viewpoint of minimizing the risk function based on the prediction errors only when the magnitude of structural changes in the variance matrices is negligibly small as compared to the magnitude of shifts in the coefficients. In general, the modified Cp may be applied only if we rule out variance breaks a priori. 3.3. Bayesian information criterion Schwarz (1978) considered the problem of model selection in the Bayesian framework, in which a model is selected based on the posterior probability. In this subsection, we derive the modified BIC for models with structural changes using the Laplace approximation technique as explained in Konishi and Kitagawa (2008). The model selected by the modified BIC is optimal from the viewpoint of the maximization of the posterior probability. Let M (m, px , T ) be a model for given m, px and T , P (M (m, px , T )) be the prior probability of a given model, fM (y|θ , x) be the probability density function (pdf) of y conditional on θ and x for a given model M = M (m, px , T ) and πM (θ ) be the prior pdf for θ . Then, the posterior probability of model M (m, px , T ) is given by gM (y|x)P (M (m, px , T )) , P (M (m, px , T )|y, x) = ∑ gM (y|x)P (M (m, px , T ))
(18)
where the summation is taken over models M (m, px , T ) and gM (y|x) is the marginal pdf of y conditional on x defined as gM (y|x) = fM (y|θ , x) πM (θ )dθ . We adopt the model that maximizes the posterior probability (18), but the maximization of (18) is equivalent to the maximization of the numerator on the right hand side of (18) because the denominator is common for all the models. Thus, we consider the maximization of gM (y|x)P (M (m, px , T )), or equivalently, the minimization of
− 2 log {gM (y|x)P (M (m, px , T ))} = −2 log gM (y|x) − 2 log P (M (m, px , T )) .
by (1/T m ) × m!/ j=1 (1 − j/T ). With the assumption that the ¯ + 1) (the uniform prior prior probability of m is given by 1/(m ¯ the prior probability of a∏ for m = 0, 1, . . . , m), given model be¯ + 1)}/ m comes αm,T /T m where αm,T = {m!/(m j=1 (1 − j/T ). • (Example 2) Let us again consider the continuous time framework as in Example 1 and suppose that the prior for m is a Poisson process with mean T β where the prior for β is noninformative (β > 0). In this case, the pdf of T1 , T2 , . . . , Tm conditional on m is given by m!/T m (Theorem 2.3.1 in Ross (Ross, 1996)) ∞ and hence the prior probability of a given model becomes 0 {e−T β (T β)m /m!} × (m!/T m )dβ = m!/T m+1 . This motivates us to assume the prior given in Assumption A7 (c) in the discrete time framework. • (Example 3) Suppose that the probability of the occurrence of structural change at each time is given by a Binomial distribution with parameter ρ , for which the prior is uniform on (0, 1). In this case, the prior probability of a given model 1 ∏m becomes 0 ρ m (1 − ρ)m dρ = (1/T m ) × m!/{(T + 1) j=1 [1 − (j − 1)/T ]}, which is of the same form as in Assumption A7 (c).
∏m
Let the MLE of θ = [φ ′ , σ ′ ]′ for a given set of break points T be θˇ = [φˇ ′ , σˇ ′ ]′ where φˇ = [φˇ 1 , φˇ 2 , . . . , φˇ m+1 ]′ and σˇ = ˇ 1 )′ , vec(Σ ˇ 2 )′ , . . . , vec(Σ ˇ m+1 )′ ]′ . Again, φˇ contains 0’s if the [vec(Σ different regressors are allowed in different regimes. Note that θˆ is the global MLE with T = Tˆ whereas θˇ is obtained for an arbitrary given T . Thus, θˇ is different from θˆ in general and they are the same only in the case where T = Tˆ . While the second term on the right hand side of (19) is given by 2m log T − 2 log αm,T , we need to evaluate the first term to obtain the modified BIC. Proposition 3. Under Assumption A7, the logarithm of the marginal pdf of y given x is expressed as log gM (y|x) = ℓm,px (T , θˇ |y, x) m+1
−
Assumption A7. (a) The conditional pdf of y, fM (y|θ, x), is Gaussian. (b) The priors for ϕ1 , . . . , ϕm+1 (the non-zero elements −1 of φ1 , . . . , φm+1 ) and Σ1−1 , . . . , Σm +1 are noninformative. (c) The prior probability of M (m, px , T ) is given by P (M (m, px , T )) = αm,T /T m where 0 < α < αm,T < α¯ < ∞. Note that we do not have to assume shrinking shifts in the Bayesian framework. Instead, we make assumptions on the priors. Assumption A7 (a) states that we base our analysis on a Gaussian distribution. We can interpret Assumption A7 (b) such that we do not have a priori information on the distributional property of the parameters. Note that the noninformative priors for the reciprocal of the variance matrices are sometimes considered in the Bayesian framework. Assumption A7 (c) is motivated from the following three examples:
log(Tj − Tj−1 ) + Op (1).
(20)
Proposition 3 suggests that, since the second term on the right hand side of (19) is given by 2m log T + O(1) from Assumption A7 (c), we should minimize, ignoring the Op (1) terms, m+1
MBIC (m, px , T ) = −2ℓm,px (T , θˇ |y, x) +
−
pφj + pσ
j =1
× log(Tj − Tj−1 ) + 2m log T .
(21)
Although it is possible to minimize (21) over m, px and T , we can conveniently modify it by using the MLEs Tˆ and θˆ . Since log(Tj − Tj−1 ) = log(Tˆj − Tˆj−1 )+ log(Tj − Tj−1 )/(Tˆj − Tˆj−1 ) = log(Tˆj − Tˆj−1 )+ Op (1) because minimization is taken over {(λ1 , . . . , λm ); |λj+1 − λj | ≥ ϵ for j = 1, . . . , m}, the right hand side of (21) becomes, ignoring the Op (1) term again, m+1
−2ℓm,px (T , θˇ |y, x) +
−
pφj + pσ log(Tˆj − Tˆj−1 )
j =1
• (Example 1) Let us first consider the continuous time framework with 0 < t < T . For a given m, let S1 , S2 , . . . , Sm be m candidates for the break dates that are independently uniformly distributed on (0, T ). Then, the pdf of Sj is 1/T for j = 1, . . . , m. In this case the break points T1 , T2 , . . . , Tm correspond to the order statistics S(1) , S(2) , . . . , S(m) where S(j) is the jth smallest value among S1 , S2 , . . . , Sm . As a result, the joint pdf of T1 , T2 , . . . , Tm is given by m!/T m . This motivates us to assume that for discrete time t = 1, 2, . . . , T the joint probability function of T1 , T2 , . . . , Tm is proportional to 1/T m and given
2
j =1
(19)
In order to evaluate (19), we make the following assumptions.
− pφj + pσ
+ 2m log T . ˇ y, x) = −ℓm,px (Tˆ , θˆ |y, x), we replace T Since minT −ℓm,px (T , θ| and θˇ with the MLEs Tˆ and θˆ and propose the following modified BIC: ˆ y, x) MBIC 1(m, px ) = −2ℓm,px (Tˆ , θ| m+1
+
−
pφj + pσ log(Tˆj − Tˆj−1 ) + 2m log T .
j =1
(22)
E. Kurozumi, P. Tuvaandorj / Journal of Econometrics 164 (2011) 218–238
Our preliminary simulations show that this modification does not result in a significant difference in finite sample property. ˆ j − λˆ j−1 ) = log T + Op (1), Since log(Tˆj − Tˆj−1 ) = log T + log(λ we may also simplify the modified BIC (22) as
ˆ y, x) MBIC 2(m, px ) = −2ℓm,px (Tˆ , θ| all + pall φ + pσ + 2m log T .
(23)
It is not difficult to interpret the penalty terms of the two modified BICs. The second term on the right hand side of (22) all and (pall φ + pσ ) log T of (23) are the penalty on the additional unknown coefficients and variance components while 2m log T can be interpreted as the penalty on the uncertainty of the break points. Since the penalty of the classical BIC on the unknown coefficients is given by pall φ log T , we can see from (22) that MBIC 1 will tend to choose more regressors than the classical BIC for a given m ≥ 1. We also note that the modified BIC (23) takes a form similar to Yao’s (Yao, 1988) BIC, which is given by MBICy (m, px ) = −2ℓm,px (Tˆ , θˆ |y, x) all + (pall φ + pσ + m) log T .
(24)
On comparing the penalty term, we can see that our modified BIC will tend to choose less number of structural changes than Yao’s BIC.1 3.4. Consistency In this subsection we investigate whether or not the model selection criteria derived in this paper can choose the appropriate regressors and the true number of structural changes. It is well known that the BIC can consistently choose the true lag length for time series models with no structural changes while the AIC and the Cp criterion tend to choose longer lags than the true one. See, for example, Shibata (1976), Hannan (1980) and Hannan and Deistler (1988). Thus, we expect that the modified BIC can consistently choose the regressors while the modified AIC and the modified Cp criterion may choose a larger set of regressors than the true ones. In addition, since our two modified BICs take a form similar to Yao’s (Yao, 1988) BIC, which is proved to consistently estimate the number of breaks for a local level Gaussian model, we expect that our modified BIC may have the same property. In the following, we conventionally use the statement ‘‘px > p0x ,’’ which means that the true regressors are included in each regime and there are extra regressors at least in one regime. On the other hand, ‘‘px < p0x ’’ implies that some true regressors are excluded at least from one regime irrespective of whether or not the extra regressors are included in some regimes. In the case of ‘‘px = p0x ’’ each regime includes only the true regressors. Then, {px < p0x } ∪ {px > p0x } ∪ {px = p0x } covers all the possible choices of the regressors. To investigate the consistency of the estimated px and m, let MIC (m, px ) be a general expression of the model selection criterion defined as follows: m+1
MIC (m, px ) = −2ℓm,px (Tˆ , θˆ |y, x) +
−
(pφj + pσ )g1 (T )
j =1
+ mg2 (T ),
Proposition 4. Assume that Assumptions A1–A6 hold. (i) If g1 (T ) → ∞ while gi (T )/(T vT2 ) → 0 for i = 1 and 2, then ˆ = m0 and pˆ x = p0x ) → 1 as T → ∞. P (m (ii) If g2 (T ) → ∞ and g2 (T )/(T vT2 ) → 0 while g1 (T ) = O(1), then ˆ = m0 and pˆ x ≥ p0x ) → 1 as T → ∞. P (m ˆ ≥ m0 and pˆ x ≥ (iii) If gi (T ) = O(1) for i = 1 and 2, then P (m p0x ) → 1 as T → ∞. Proposition 4(i) implies that the divergence of g1 (T ) guarantees ˆ and pˆ irrespective of whether or not the consistency of both m g2 (T ) goes to infinity. Intuitively, this is because the consistency ˆ (pˆ ) requires the divergence of the penalty term when the of m extra breaks (extra regressors) are included. Since the coefficient associated with g1 (T ) increases when either the extra regressors or the extra breaks are included, we have Proposition 4(i). In the case of (ii) the penalty term does not diverge when the extra regressors are included and hence there is a positive probability of pˆ x > p0x . (iii) can be interpreted similarly. ˆ and pˆ x based on the From Proposition 4(i) we can see that m modified BICs are consistent. Of interest is that both the classical BIC and Yao’s (Yao, 1988) BIC also have the consistent property. On the other hand, the estimators based on the modified AIC are not consistent but they tend to be greater than m0 and p0x with positive probability from Proposition 4(iii). Similarly, we deduce that the modified Cp criterion does not deliver consistent estimators of m and px . Therefore, we can say that the modified BICs have more plausible property than the modified AIC and the modified Cp criterion, at least asymptotically. However, as we will see in the next section, this is not always the case in finite samples. We also note that the proposed methods do not have a small sample justification, because the modified AIC and Cp are derived by ignoring the o(1) bias terms while the numerator of (18) is evaluated only up the terms faster than Op (1) in deriving the modified BIC. They have only a large sample justification and then we need to see the finite sample performance in a later section. 3.5. Extension to more general models So far, we have considered model (1), in which regressors are assumed to be the same for all equations in the system. As a result, we exclude the case where each equation has different regressors, such as SURE models. To accommodate models with different regressors in different equations, model (1) may be extended to yt = (In ⊗ zjt′ )Sj ϕj + εt
= Xjt ϕj + εt ,
0
1
Tj −
1Tj0 t =Tj0−1 +1
p
Xjt′ (Σj0 )−1 Xjt −→ Q1j
and
0
vT 1 Treating T , T , . . . , T as usual unknown parameters, Yao (1988) obtained the 1 2 m all modified BIC by inserting the total number of parameters, which equals pall φ + pσ + m in our setup, into the penalty term of the classical BIC. Since we treat the change points in a different way, our modified BIC does not coincide with Yao’s BIC.
(26)
where Xjt = (In ⊗ zjt′ )Sj , zjt is a vector containing all the regressors in the jth regime and Sj is a selection matrix of full column rank. See also Qu and Perron (2007) for detailed explanation of the specification. Because of the existence of the selection matrix Sj , we can choose different regressors for different equations. Even for extended model (26) we obtain the same modified AIC and Cp by slightly changing Assumption A6 as
(25)
where g1 (T ) and g2 (T ) are sequences of positive non-decreasing ¯ and that p¯ x includes all the true numbers. Suppose that m0 ≤ m ˆ and pˆ x be chosen such that the MIC is minimized regressors. Let m ¯ and among the p¯ x regressors. over 0 ≤ m ≤ m
225
Tj −
1/2
Xjt′ (Σj0 )−1/2 ηt ⇒ Q1j ζ1j (v).
−2 t =Tj0 −[vvT ]
Propositions 1 and 2 are proved in the same line as in Appendix with x′t ⊗ In and xt x′t ⊗ Σj−1 replaced by Xjt and Xjt′ Σj−1 Xjt ,
226
E. Kurozumi, P. Tuvaandorj / Journal of Econometrics 164 (2011) 218–238
Table 1 The parameter setting for simulations. No break
DGP0AR1a DGP0AR1b DGP0AR2a DGP0AR2b DGP0AR2c DGP0AR2d DGP0MA1a DGP0MA1b
φ1
φ2
θ1
0.3 0.7 0.2 0.4 0.7 1.3 0.0 0.0
0.0 0.0 0.1 0.3 −0.4 −0.6 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.3 0.7
One time break
DGP1AR1a DGP1AR1b DGP1AR2a DGP1AR2b DGP1AR2c DGP1AR2d DGP1MA1a DGP1MA1b
c1
c2
φ11
φ12
φ21
φ22
θ11
θ12
1.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0
0.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0
0.7 0.3 0.4 0.2 1.3 0.7 0.0 0.0
0.3 0.7 0.2 0.4 0.7 1.3 0.0 0.0
0.0 0.0 0.3 0.1 −0.6 −0.4 0.0 0.0
0.0 0.0 0.1 0.3 −0.4 −0.6 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.7 0.3
0.0 0.0 0.0 0.0 0.0 0.0 0.3 0.7
Two time breaks
DGP2AR1a DGP2AR1b DGP2AR2a DGP2AR2b DGP2AR2c DGP2AR2d DGP2MA1a DGP2MA1b
c1
c2
c3
φ11
φ12
φ13
φ21
φ22
φ23
θ11
θ12
θ13
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 −1.0 1.0 −1.0 1.0 −1.0 1.0 −1.0
0.7 0.7 0.4 0.4 1.3 1.3 0.0 0.0
0.3 0.3 0.2 0.3 0.7 1.0 0.0 0.0
0.7 0.1 0.4 0.2 1.3 0.7 0.0 0.0
0.0 0.0 0.3 0.3 −0.6 −0.6 0.0 0.0
0.0 0.0 0.1 0.2 −0.4 −0.5 0.0 0.0
0.0 0.0 0.3 0.1 −0.6 −0.4 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.7 0.7
0.0 0.0 0.0 0.0 0.0 0.0 0.3 0.3
0.0 0.0 0.0 0.0 0.0 0.0 0.7 0.1
respectively. As a result, there is no need for further modification of the modified AIC and the modified Cp criterion for model (26). On the other hand, it is difficult to obtain the modified BIC for model (26) because the Kronecker product structure of regressors given by x′t ⊗ In plays an important role in deriving L (φ,σ ) the modified BIC. For model (26) it can be shown that e1 dφ in expression (42) of Appendix becomes equal to
1/2 ∑ Tj t =Tj−1 +1 Xjt′ Σj−1 Xjt . Since Σj is related with Xjt in a complicated way, we cannot integrate out Σj in (44). ∏m+1 j =1
pxj /2
(2π )
Then, there is no theoretical optimality in a Bayesian sense in applying the modified BIC derived in Section 3.3 to model (26). However, we may still use the modified BIC for SURE models because Proposition 4 holds even for such models, so that the modified BIC remains consistent. Although it may be interesting to investigate the optimal form of the BIC in models with different regressors for different equations, it is beyond the scope of this paper.
¯ in practice 3.6. Choice of p¯ x and m In practice, we first have to determine the p¯ x candidates for ¯ the maximum number of possible breaks. Once regressors and m, ¯ are given, we calculate the modified information criteria p¯ x and m for each possible pair of px and m among 0 ≤ px ≤ p¯ x and ¯ and choose the model with p = pˆ x and m = m ˆ that 0 ≤ m ≤ m minimizes the information criteria. Thus, the determination of p¯ x ¯ are important in practice. and m The candidates for regressors may be chosen based on economic theory or by some theoretical reasons if lagged variables are excluded from regressors. On the other hand, if we allow lagged variables as regressors and have to determine the order of lags, the choice of p¯ x becomes somewhat arbitrary. In economic applications, it is often the case that the maximum order of lags is set not to be too large, such as within three or four years.
Similarly, there is no theoretical guideline for choosing the ¯ maximum number of possible breaks and we may choose m arbitrarily. One thing we can say is that we should take into account the possibility of breaks caused by historically large shocks and changes of economic and political systems, such as the Great Crash, the oil price shocks, the reunification of Germany, the Asian crises, etc. In other words, we may set the maximum number of breaks slightly larger than the number of historical events that possibly cause structural changes. We also note that the consideration of these historical events may help us choose the trimming parameter ϵ , which determines the minimum length of one regime (see Section 2). Since the sample size in each regime is at least ϵ T , we should choose ϵ so that two consecutive historical shocks are not included within the length of ϵ T . 4. Finite sample property In this section we investigate the finite sample property of the model selection criteria developed in the previous section. We consider univariate AR(1), AR(2) and MA(1) processes with εt ∼ i.i.d.N (0, 1) innovations and possible structural changes as the data generating process (DGP) and examine the performance of the modified criteria by estimating the lag length py as well as the number of breaks m. Note that an MA(1) process is not covered by our model (1) but we include this case in order to see the performance of the modified criteria for an AR(∞) process. In the case of no breaks, the DGP is given by DGP0 : yt = φ1 yt −1 + φ2 yt −2 + εt − θ1 εt −1 : 1 ≤ t ≤ T , where the sets of the parameters are summarized in the first panel of Table 1. DGP0AR1a corresponds to the AR(1) case with weak positive serial correlation while yt is moderately serially correlated for DGP0A1b. DGP0AR2a-b are the AR(2) cases with real valued characteristic roots, whereas DGP0AR2c-d have complex
E. Kurozumi, P. Tuvaandorj / Journal of Econometrics 164 (2011) 218–238
roots. We choose these parameters so that φ1 + φ2 = 0.3 (the case of weak serial correlation) or 0.7 (the case of moderate serial correlation). Similarly, we consider two DGPs for the MA(1) case (DGP0MA1a-b). A process with one time break is generated by
227
where the sets of the parameters are summarized in the last panel of Table 1. For DGP2AR1a, AR2a, AR2c and MA1a, serial correlation is weakened by the first break but the process returns to the first regime after the second break (the first and third regimes have the same parameters). DGP2AR1b, AR2b, AR2d and MA1b correspond to the case where the level of the process goes down and the process becomes less persistent gradually because of structural changes. We set T = 120 or 300 while the trimming parameter ϵ is set as 0.05 or 0.15. All computations are carried out by using the GAUSS matrix language with 5000 replications.2 We estimate the AR(py ) model including a constant with structural changes and select the lag length py and the number of breaks m based on the modified AIC in (12), the modified Cp criterion in (17) and the two modified BICs in (22) and (23), with a restriction such that the lag lengths are the same in all the regimes. ¯ = 5 so that we choose a model from among We set p¯ y = 4 and m 0 ≤ py ≤ 4 and 0 ≤ m ≤ 5. To see the effect of our modification, we also estimate py and m by the classical model selection criteria and Yao’s (Yao, 1988) modified BIC. In addition, we compare the finite sample performance of the model selection criteria with that of the sequential testing procedure. Since we do not know the true lag length, we first estimate the number of breaks by the testing procedure that is robust to heteroskedasticity and autocorrelation, and then select the lag length using the estimated number of breaks. More precisely, following the suggestion by Bai and Perron (2006), we first test for the null of no breaks using the UD max test at the 5% significance level allowing different second moments of the regressors as well as the heterogeneity of the variances. Since py is unknown, we regress yt on a constant in each regime and construct the test statistic using the autocorrelation and heteroskedasticity consistent (HAC) estimate of the variance of the error terms with the prewhitening method. If the null hypothesis is rejected, we repeatedly use the sup F (ℓ + 1|ℓ) test constructed in the same way as the UD max test until it cannot reject the hypothesis. Once the number of breaks is estimated, the lag length
is estimated using the Wald test by the general to specific rule. This robust testing procedure is denoted by ‘‘Sq(rb)’’. We also consider the hybrid of the modified criteria with the testing procedure; we first estimate py and m by the modified criteria and using the estimated lag length, estimate the number of breaks with the testing procedure.3 Note that we do not use the HAC estimates of the variance to construct the test statistics in this case because the lag length is estimated. The hybrid method is denoted by ‘‘Sq(MIC ).’’ For example, ‘‘Sq(MAIC )’’ signifies that the lag length is selected by the modified AIC while the number of breaks is estimated by the testing procedure. Table 2a summarizes the simulation results for the case of no break. In the table, Panel (a) reports the frequencies that the methods correctly estimate the lag order (irrespective of the number of breaks) while the entries in Panel (b) are the frequencies of selecting the true number of breaks (irrespective of the lag order). Then, the entries in Panel (a) are the frequencies of pˆ y = 1 and pˆ y = 2 for the AR(1) and AR(2) cases, respectively, while those ˆ = 0. We focus only on the in Panel (b) are the frequencies of m estimation of the number of breaks for the MA(1) case and the ˆ = 0 because entries in this case correspond to the frequencies of m any finite order lags are incorrect. For DGP0AR1a-b, we can see that the classical AIC and Cp criterion rarely choose the true number of breaks whereas they choose the true lag order with moderate frequencies when ϵ = 0.15 and/or T = 300. On the other hand, the classical BIC has a better finite sample property, although its performance is not necessarily satisfactory when ϵ = 0.05 and T = 120. This poor finite sample performance of the classical criteria is dramatically improved by our modification; in particular, the modified BICs have a high probability of selecting the true p and m, as expected from Proposition 4. The problem of the classical criteria is that they tend to choose the larger number of structural changes. For example, in the case of DGP0AR1a with ϵ = 0.05 and T = 120 the frequencies of choosing the model with m = 5 and p = 0 for AIC, BIC and Cp are 0.456, 0.403 and 0.475, respectively. This tendency to over-estimate is well corrected by including the penalty term on the additional breaks. Comparing the modified criteria with the robust testing procedure and the hybrid methods, we find that our modified criteria work better in choosing the number of breaks in most cases. We also note that all the methods perform better for larger T and larger ϵ . For the AR(2) case with DGP0AR2a, it is difficult for all the methods to choose the true lag order whereas the modified criteria choose the true number of breaks with high frequencies. This poor estimation of the lag order is due to the fact that the coefficient associated with yt −2 is so small that shorter lags tend to be selected. Because the penalty of the modified AIC and the modified Cp criterion on the lag order is not as heavy as that of the modified BICs, the former two methods choose the true lag order with higher frequencies. For the other AR(2) case (DGP0AR2b-d) and the MA(1) case (DGP0MA1a-b), the overall performance is similar to the AR(1) case. Table 2b reports the result for the case of one time break. As in the case of no breaks, the classical criteria do not perform well in choosing the number of breaks because they tend to select larger breaks; this tendency to over-estimate is fixed by our modification. In this case, while the hybrid method with the modified BICs tends to choose the true number of breaks more frequently than the modified BICs for DGP1AR1a-b and AR2a-b, the relation is reversed for DGP1AR2c-d and the MA(1) case.
2 We need to efficiently calculate the maximized likelihood for the case of multiple structural changes to save computational time. The method is explained in Bai and Perron (1998, 2003) and we made use of the program provided by them.
3 We also conducted simulations for the hybrid of the classical model selection criteria, such as AIC and BIC, with the testing procedure, but the performance is poor and we do not report the results.
y = c1 + φ11 yt −1 + φ21 yt −2 + εt − θ11 εt −1 : t 1 ≤ t ≤ [T /2] DGP1 : yt = c2 + φ12 yt −1 + φ22 yt −2 + εt − θ12 εt −1 : [T /2] + 1 ≤ t < T , where the sets of the parameters are given in the second panel of Table 1. For the AR(1) (DGP1AR1a-b) and AR(2) (DGP1AR2a-d) cases, the sum of the AR coefficients changes from 0.7 to 0.3 or 0.3 to 0.7. The MA(1) case (DGP1MA1a-b) has similar structural changes. The DGP with two time breaks is given by
y = c + φ y + φ y + ε − θ ε : t 1 11 t −1 21 t −2 t 11 t −1 1 ≤ t ≤ [T /3] yt = c2 + φ12 yt −1 + φ22 yt −2 + εt − θ12 εt −1 : DGP2 : [T /3] + 1 ≤ t < [2T /3] y = c3 + φ13 yt −1 + φ23 yt −2 + εt − θ13 εt −1 : t [2T /3] + 1 ≤ t < T ,
228
E. Kurozumi, P. Tuvaandorj / Journal of Econometrics 164 (2011) 218–238
Table 2a The frequencies of selecting the true p and m (no break). DGP0AR1a T
ϵ
DGP0AR1b 120 0.05
300 0.05
120 0.15
300 0.15
T
ϵ
120 0.05
300 0.05
120 0.15
300 0.15
0.493 0.858 0.490 0.689 0.971 0.974 0.974 0.694 0.818
0.353 0.573 0.360 0.659 0.776 0.800 0.806 0.668 0.699
0.605 0.954 0.609 0.726 0.972 0.974 0.974 0.736 0.843
AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(rb)
0.460 0.834 0.487 0.639 0.955 0.955 0.959 0.547 0.612
0.616 0.982 0.617 0.681 0.981 0.981 0.981 0.681 0.802
0.592 0.954 0.610 0.698 0.958 0.957 0.958 0.680 0.806
0.647 0.981 0.649 0.708 0.981 0.981 0.981 0.725 0.848
0.000 0.195 0.000 0.705 0.883 0.959 0.986 0.889 0.404 0.508 0.526 0.530 0.391 0.326
0.001 0.814 0.001 0.735 0.990 0.996 0.999 0.963 0.756 0.828 0.829 0.829 0.795 0.663
0.079 0.625 0.092 0.837 0.939 0.978 0.990 0.973 0.843 0.879 0.892 0.895 0.885 0.843
0.100 0.932 0.106 0.858 0.994 0.999 1.000 0.985 0.902 0.937 0.937 0.937 0.937 0.916
AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(MAIC ) Sq(MBICy ) Sq(MBIC1 ) Sq(MBIC2 ) Sq(MCp ) Sq(rb)
0.000 0.569 0.000 0.582 0.954 0.962 0.995 0.815 0.257 0.354 0.353 0.354 0.242 0.108
0.000 0.844 0.000 0.626 0.989 0.992 0.999 0.943 0.631 0.709 0.709 0.709 0.672 0.372
0.044 0.791 0.051 0.774 0.972 0.982 0.996 0.957 0.777 0.821 0.821 0.822 0.807 0.654
0.064 0.926 0.068 0.820 0.995 0.998 0.999 0.978 0.871 0.909 0.909 0.909 0.907 0.843
120 0.05
300 0.05
120 0.15
300 0.15
T
ϵ
120 0.05
300 0.05
120 0.15
300 0.15
0.110 0.074 0.110 0.353 0.183 0.191 0.193 0.424 0.202
0.095 0.021 0.091 0.186 0.053 0.062 0.064 0.219 0.086
0.214 0.152 0.211 0.402 0.188 0.193 0.194 0.436 0.268
AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(rb)
0.107 0.064 0.107 0.459 0.587 0.702 0.748 0.559 0.262
0.594 0.915 0.593 0.717 0.969 0.970 0.971 0.728 0.761
0.268 0.473 0.271 0.629 0.703 0.735 0.755 0.679 0.464
0.707 0.958 0.709 0.758 0.970 0.971 0.971 0.764 0.862
0.001 0.424 0.001 0.673 0.948 0.983 0.994 0.951 0.681 0.718 0.726 0.726 0.753 0.511
0.038 0.410 0.044 0.768 0.864 0.951 0.975 0.964 0.775 0.772 0.799 0.803 0.859 0.749
0.081 0.764 0.086 0.825 0.971 0.991 0.997 0.978 0.868 0.880 0.884 0.884 0.920 0.840
AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(MAIC ) Sq(MBICy ) Sq(MBIC1 ) Sq(MBIC2 ) Sq(MCp ) Sq(rb)
0.000 0.104 0.000 0.474 0.769 0.890 0.969 0.772 0.160 0.211 0.231 0.235 0.180 0.019
0.000 0.919 0.000 0.616 0.995 0.994 0.999 0.935 0.583 0.634 0.635 0.635 0.619 0.049
0.041 0.590 0.050 0.720 0.904 0.943 0.979 0.936 0.710 0.738 0.752 0.756 0.778 0.289
0.105 0.972 0.111 0.811 0.998 0.999 1.000 0.977 0.872 0.901 0.901 0.901 0.899 0.395
(a) Frequencies of selecting the true p AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(rb)
0.091 0.172 0.094 0.585 0.733 0.788 0.802 0.549 0.476
(b) Frequencies of selecting the true m AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(MAIC ) Sq(MBICy ) Sq(MBIC1 ) Sq(MBIC2 ) Sq(MCp ) Sq(rb) DGP0AR2a T
ϵ
DGP0AR2b
(a) Frequencies of selecting the true p AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(rb)
0.069 0.001 0.073 0.146 0.039 0.055 0.062 0.214 0.066
(b) Frequencies of selecting the true m AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(MAIC ) Sq(MBICy ) Sq(MBIC1 ) Sq(MBIC2 ) Sq(MCp ) Sq(rb)
0.000 0.069 0.000 0.596 0.762 0.912 0.963 0.870 0.334 0.390 0.415 0.420 0.346 0.243
(continued on next page)
The result for the case of two time breaks is summarized in Table 2c. It seems difficult to choose the true number of breaks for the AR(1) and AR(2) cases, especially when T = 120, while the true lag order is selected with high frequencies by the modified criteria. As a whole, the finite sample performance of the modified criteria is dominated by that of the hybrid method in many cases. On the other hand, the modified BICs work better than the hybrid method for the MA(1) case, although both perform quite well in this case. We may also be interested in the frequencies of choosing simultaneously both the true lag order and the true number of breaks, that is, choosing the true model. These frequencies are apparently smaller than the minimum values of the corresponding frequencies in Panels (a) and (b). For example, from Table 2a, the
frequencies of choosing the true p in Panel (a) and the true m in Panel (b) by MBIC1 are 0.788 and 0.959 for DGP0AR1a with T = 120 and ϵ = 0.05, which implies that the true model is selected with a probability less than 0.788. In fact, MBIC1 chooses the true model with a probability 0.780 in this case. In general, the relative performance of the methods are similar to Tables 2a–2c. The details are available upon request. To summarize our simulation results, we can say that the modified BICs perform relatively well when m0 ≤ 1 and ϵ = 0.05, while the hybrid method with the modified BICs, that is, the sequential testing procedure with the lag order selected by the modified BICs, may be recommended if one is confident that the distance between the two consecutive break fractions is not so
E. Kurozumi, P. Tuvaandorj / Journal of Econometrics 164 (2011) 218–238
229
Table 2a (continued) DGP0AR2c T
ϵ
DGP0AR2d 120 0.05
300 0.05
120 0.15
300 0.15
T
ϵ
120 0.05
300 0.05
120 0.15
300 0.15
0.682 0.981 0.672 0.747 0.981 0.981 0.981 0.731 0.900
0.664 0.955 0.677 0.761 0.955 0.955 0.955 0.744 0.890
0.704 0.981 0.703 0.762 0.981 0.981 0.981 0.765 0.900
AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(rb)
0.474 0.969 0.511 0.698 0.967 0.966 0.967 0.591 0.771
0.684 0.981 0.682 0.739 0.980 0.980 0.980 0.726 0.897
0.661 0.968 0.674 0.749 0.967 0.966 0.967 0.728 0.891
0.706 0.980 0.708 0.759 0.980 0.980 0.980 0.760 0.898
0.004 0.928 0.004 0.755 0.990 0.981 0.998 0.889 0.245 0.321 0.320 0.321 0.219 0.705
0.007 0.979 0.006 0.763 0.997 0.997 1.000 0.959 0.750 0.794 0.794 0.794 0.767 0.966
0.174 0.955 0.209 0.872 0.992 0.991 0.999 0.975 0.870 0.896 0.896 0.896 0.888 0.990
0.180 0.987 0.192 0.869 0.999 1.000 1.000 0.983 0.913 0.936 0.936 0.936 0.935 0.998
AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(MAIC ) Sq(MBICy ) Sq(MBIC1 ) Sq(MBIC2 ) Sq(MCp ) Sq(rb)
0.000 0.905 0.000 0.663 0.988 0.971 0.998 0.838 0.133 0.177 0.177 0.177 0.121 0.506
0.003 0.976 0.003 0.733 0.997 0.996 1.000 0.955 0.667 0.715 0.715 0.715 0.687 0.958
0.131 0.945 0.153 0.833 0.991 0.990 0.999 0.969 0.823 0.856 0.856 0.856 0.845 0.995
0.161 0.985 0.172 0.861 0.998 0.999 1.000 0.983 0.905 0.929 0.929 0.929 0.929 1.000
120 0.05
300 0.05
120 0.15
300 0.15
T
ϵ
120 0.05
300 0.05
120 0.15
300 0.15
0.010 0.967 0.010 0.817 0.998 0.999 1.000 0.970 0.820 0.920 0.920 0.920 0.845 0.885
0.179 0.942 0.210 0.892 0.995 0.997 1.000 0.982 0.902 0.964 0.964 0.965 0.925 0.940
0.176 0.980 0.188 0.894 0.999 1.000 1.000 0.987 0.926 0.969 0.969 0.969 0.952 0.970
AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(MAIC ) Sq(MBICy ) Sq(MBIC1 ) Sq(MBIC2 ) Sq(MCp ) Sq(rb)
0.071 0.983 0.074 0.886 0.998 0.992 1.000 0.955 0.173 0.389 0.388 0.389 0.125 0.958
0.194 0.997 0.184 0.900 1.000 0.999 1.000 0.984 0.834 0.871 0.871 0.871 0.835 0.997
0.427 0.988 0.502 0.936 0.999 0.998 1.000 0.991 0.932 0.957 0.956 0.957 0.938 0.999
0.478 0.998 0.503 0.934 1.000 1.000 1.000 0.993 0.961 0.971 0.971 0.971 0.965 1.000
(a) Frequencies of selecting the true p AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(rb)
0.441 0.954 0.473 0.723 0.955 0.954 0.955 0.615 0.828
(b) Frequencies of selecting the true m AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(MAIC ) Sq(MBICy ) Sq(MBIC1 ) Sq(MBIC2 ) Sq(MCp ) Sq(rb) DGP0MA1a T
ϵ
DGP0MA1b
(b) Frequencies of selecting the true m AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(MAIC ) Sq(MBICy ) Sq(MBIC1 ) Sq(MBIC2 ) Sq(MCp ) Sq(rb)
0.009 0.902 0.009 0.822 0.994 0.992 1.000 0.919 0.500 0.741 0.741 0.742 0.429 0.664
close, such as ϵ = 0.15, or if it is believed that the model has more than one break with a high probability.
5. Conclusion This paper developed the model selection criteria to select the regressors and the number of structural changes in multivariate regression models, including a VAR model as a special case. We derived the modified AIC, the modified Cp criterion and the modified BICs. The penalty terms of these criteria are determined not in ad hoc ways but based on the risk functions underlying the criteria. We showed that the modified BICs can consistently estimate the number of structural changes and the regressors while the modified AIC and the modified Cp criterion tend to choose a larger model with a positive probability. The consistency of the modified BICs is a desirable theoretical property and reflecting this nice nature they perform well in finite samples. Because it is important to consistently estimate the number of breaks, given the simulation results, we recommend the modified BICs and the hybrid method for empirical work.
Acknowledgments Kurozumi’s research was partially supported by the Ministry of Education, Culture, Sports, Science and Technology under Grantsin-Aid No. 18730142, by the 21st Century Center of Excellence Project and by the Global COE program, the Research Unit for Statistical and Empirical Analysis in Social Sciences at Hitotsubashi University. Appendix Since all the model selection criteria are derived when m = m0 and p = p0 , we omit the superscript 0 for notational convenience. Proof of Lemma 1. In this proof we omit a subscript j for notational convenience. For example, B1j (v) is abbreviated as B1 (v). As explained in Appendix B of Bai (1997a), maxv≤0 B1 (v) and maxv>0 B2 (v) are distributed as exponential distributions with parameters γ1 /ω1 and γ2 /ω2 , respectively, and hence
P max B (v) ≤ b = P max max B1 (v), max B2 (v) ≤ b
I
v
v≤0
v>0
230
E. Kurozumi, P. Tuvaandorj / Journal of Econometrics 164 (2011) 218–238
Table 2b The frequencies of selecting the true p and m (one time break). DGP1AR1a T
DGP1AR1b 120 0.05
ϵ
300 0.05
120 0.15
300 0.15
T
ϵ
120 0.05
300 0.05
120 0.15
300 0.15
0.604 0.993 0.604 0.710 0.992 0.985 0.985 0.680 0.809
0.604 0.912 0.619 0.700 0.951 0.948 0.947 0.622 0.804
0.681 0.995 0.686 0.751 0.992 0.985 0.985 0.734 0.809
AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(rb)
0.326 0.376 0.341 0.606 0.874 0.927 0.932 0.454 0.643
0.610 0.990 0.612 0.701 0.990 0.983 0.980 0.667 0.798
0.587 0.898 0.607 0.686 0.944 0.941 0.936 0.611 0.796
0.678 0.992 0.682 0.744 0.990 0.983 0.980 0.727 0.806
0.000 0.272 0.000 0.488 0.524 0.461 0.317 0.603 0.546 0.579 0.600 0.605 0.534 0.580
0.001 0.808 0.001 0.604 0.939 0.900 0.844 0.916 0.786 0.831 0.831 0.831 0.804 0.685
0.123 0.690 0.153 0.695 0.567 0.471 0.316 0.553 0.702 0.697 0.699 0.699 0.659 0.512
0.138 0.928 0.150 0.839 0.951 0.910 0.844 0.955 0.905 0.931 0.931 0.932 0.922 0.735
AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(MAIC ) Sq(MBICy ) Sq(MBIC1 ) Sq(MBIC2 ) Sq(MCp ) Sq(rb)
0.000 0.204 0.000 0.424 0.443 0.385 0.256 0.548 0.542 0.564 0.592 0.599 0.528 0.589
0.000 0.788 0.000 0.568 0.931 0.880 0.808 0.901 0.789 0.834 0.833 0.835 0.808 0.692
0.103 0.631 0.137 0.641 0.499 0.401 0.257 0.486 0.664 0.657 0.655 0.657 0.617 0.503
0.131 0.918 0.141 0.822 0.943 0.890 0.810 0.944 0.899 0.924 0.924 0.924 0.914 0.721
120 0.05
300 0.05
120 0.15
300 0.15
T
ϵ
120 0.05
300 0.05
120 0.15
300 0.15
0.343 0.351 0.342 0.626 0.618 0.713 0.712 0.681 0.603
0.199 0.119 0.197 0.386 0.298 0.388 0.457 0.476 0.260
0.605 0.574 0.605 0.735 0.651 0.731 0.717 0.730 0.735
AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(rb)
0.066 0.003 0.066 0.210 0.175 0.328 0.440 0.364 0.140
0.321 0.307 0.321 0.597 0.602 0.701 0.709 0.660 0.590
0.165 0.107 0.170 0.366 0.297 0.393 0.469 0.477 0.237
0.578 0.535 0.581 0.711 0.634 0.713 0.711 0.701 0.721
0.001 0.468 0.001 0.533 0.679 0.633 0.489 0.837 0.703 0.707 0.736 0.738 0.747 0.497
0.081 0.373 0.102 0.562 0.422 0.396 0.277 0.448 0.574 0.522 0.550 0.556 0.539 0.597
0.171 0.794 0.187 0.809 0.728 0.658 0.494 0.866 0.862 0.853 0.865 0.865 0.868 0.698
AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(MAIC ) Sq(MBICy ) Sq(MBIC1 ) Sq(MBIC2 ) Sq(MCp ) Sq(rb)
0.000 0.018 0.000 0.259 0.239 0.293 0.227 0.444 0.422 0.381 0.474 0.506 0.502 0.460
0.001 0.409 0.001 0.479 0.624 0.570 0.421 0.798 0.686 0.689 0.721 0.727 0.727 0.503
0.067 0.318 0.086 0.500 0.385 0.349 0.243 0.384 0.533 0.471 0.501 0.507 0.484 0.587
0.155 0.745 0.170 0.773 0.669 0.590 0.425 0.828 0.838 0.826 0.839 0.841 0.841 0.716
(a) Frequencies of selecting the true p AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(rb)
0.340 0.433 0.352 0.621 0.901 0.936 0.944 0.464 0.643
(b) Frequencies of selecting the true m AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(MAIC ) Sq(MBICy ) Sq(MBIC1 ) Sq(MBIC2 ) Sq(MCp ) Sq(rb) DGP1AR2a T
ϵ
DGP1AR2b
(a) Frequencies of selecting the true p AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(rb)
0.075 0.003 0.075 0.239 0.185 0.332 0.433 0.367 0.154
(b) Frequencies of selecting the true m AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(MAIC ) Sq(MBICy ) Sq(MBIC1 ) Sq(MBIC2 ) Sq(MCp ) Sq(rb)
0.000 0.026 0.000 0.313 0.269 0.343 0.264 0.516 0.441 0.401 0.487 0.512 0.498 0.455
(continued on next page)
= P max B1 (v) ≤ b P max B2 (v) ≤ b v≤0 v>0 −(γ1 /ω1 )b −(γ2 /ω2 )b = 1−e 1−e ,
where the second equality holds because B1 (v) and B2 (v) are independent. Then, the probability density function of maxv BI (v) is given by f (b) = (γ1 /ω1 )e−(γ1 /ω1 )b + (γ2 /ω2 )e−(γ2 /ω2 )b − {(γ1 /ω1 )
+ (γ2 /ω2 )}e−{(γ1 /ω1 )+(γ2 /ω2 )}b .
Carrying out the integration b>0 bf (b)db and letting r1 = ω1 /γ1 and r2 = ω2 /γ2 , we obtain (5). Next, let vˆ = argmaxv BI (v). By change of variable with s = 2 (γ1 /ω1 )v as in Qu and Perron (2007) we can see that
ω1 argmax B˜ I (s), where γ12 s W1 (|s|) − |s| : s ≤ 0 ˜BI (s) = 2 √r W (s) − s r : s > 0, ω 2 γ vˆ =
2
E. Kurozumi, P. Tuvaandorj / Journal of Econometrics 164 (2011) 218–238
231
Table 2b (continued) DGP1AR2c T
ϵ
DGP1AR2d 120 0.05
300 0.05
120 0.15
300 0.15
T
ϵ
120 0.05
300 0.05
120 0.15
300 0.15
0.718 0.997 0.715 0.769 0.996 0.993 0.996 0.738 0.829
0.697 0.984 0.716 0.760 0.978 0.967 0.967 0.705 0.853
0.748 0.996 0.752 0.792 0.996 0.993 0.996 0.784 0.824
AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(rb)
0.490 0.986 0.525 0.688 0.976 0.964 0.962 0.514 0.770
0.711 0.996 0.713 0.775 0.996 0.993 0.996 0.742 0.829
0.695 0.985 0.719 0.761 0.976 0.966 0.962 0.705 0.840
0.757 0.996 0.758 0.788 0.996 0.993 0.996 0.784 0.818
0.001 0.796 0.002 0.571 0.750 0.694 0.527 0.779 0.541 0.607 0.604 0.605 0.509 0.489
0.004 0.960 0.005 0.672 0.994 0.990 0.988 0.943 0.811 0.844 0.844 0.844 0.827 0.125
0.185 0.867 0.239 0.804 0.757 0.729 0.527 0.835 0.843 0.856 0.856 0.856 0.838 0.055
0.221 0.983 0.240 0.855 0.996 0.994 0.988 0.985 0.920 0.938 0.938 0.938 0.937 0.078
AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(MAIC ) Sq(MBICy ) Sq(MBIC1 ) Sq(MBIC2 ) Sq(MCp ) Sq(rb)
0.000 0.757 0.000 0.533 0.699 0.641 0.469 0.747 0.529 0.596 0.595 0.593 0.493 0.496
0.003 0.959 0.005 0.657 0.993 0.987 0.986 0.941 0.807 0.848 0.848 0.848 0.826 0.118
0.184 0.843 0.235 0.783 0.705 0.677 0.468 0.800 0.830 0.835 0.836 0.835 0.819 0.046
0.227 0.982 0.244 0.844 0.996 0.994 0.987 0.983 0.917 0.937 0.937 0.937 0.934 0.069
120 0.05
300 0.05
120 0.15
300 0.15
T
ϵ
120 0.05
300 0.05
120 0.15
300 0.15
0.063 0.983 0.073 0.805 0.999 0.998 1.000 0.970 0.854 0.922 0.915 0.922 0.863 0.962
0.341 0.972 0.422 0.904 0.997 0.995 1.000 0.990 0.932 0.968 0.964 0.969 0.936 0.975
0.405 0.992 0.437 0.907 0.999 0.999 1.000 0.991 0.947 0.966 0.965 0.966 0.954 0.984
AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(MAIC ) Sq(MBICy ) Sq(MBIC1 ) Sq(MBIC2 ) Sq(MCp ) Sq(rb)
0.018 0.939 0.032 0.789 0.995 0.982 0.999 0.937 0.678 0.866 0.841 0.866 0.619 0.903
0.062 0.980 0.070 0.805 0.998 0.996 1.000 0.971 0.862 0.920 0.915 0.920 0.873 0.959
0.351 0.969 0.421 0.899 0.997 0.995 0.999 0.990 0.932 0.967 0.962 0.967 0.940 0.975
0.398 0.991 0.431 0.903 0.999 0.999 1.000 0.992 0.944 0.966 0.964 0.966 0.955 0.987
(a) Frequencies of selecting the true p AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(rb)
0.466 0.984 0.513 0.689 0.978 0.964 0.967 0.513 0.763
(b) Frequencies of selecting the true m AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(MAIC ) Sq(MBICy ) Sq(MBIC1 ) Sq(MBIC2 ) Sq(MCp ) Sq(rb) DGP1MA1a T
ϵ
DGP1MA1b
(b) Frequencies of selecting the true m AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(MAIC ) Sq(MBICy ) Sq(MBIC1 ) Sq(MBIC2 ) Sq(MCp ) Sq(rb)
0.020 0.939 0.030 0.782 0.995 0.980 1.000 0.933 0.684 0.868 0.842 0.869 0.622 0.906
where rω = ω2 /ω1 and rγ = γ2 /γ1 . Then, it is sufficient to calculate E [argmaxs B˜ I (s)I (s ≤ 0)] and E [argmaxs B˜ I (s)I (s > 0)] in order to obtain (6) and (7). Following Appendix B of Bai (1997a) it can be shown that the probability density function (pdf) of sˆ = argmaxs B˜ I (s) is given by
rγ 1 1 1 2 − Φ − | s | + + e(1/2){(rγ /rω )+(rγ /rω ) }|s| Φ 2 2 r 2 ω rγ 1 × − + | s | : s≤0 rω 2 √ 2 √ g (s) = − (rγ / rω ) Φ − (rγ / rω ) √s 2 2 √ 2 (rγ / rω ) + r + e(1/2)(rγ +rω )s Φ γ 2 √ √ (rγ / rω ) √ − rω + s : s > 0, 2
which is obtained based on the result on an additive process by Bhattacharya and Brockwell (1976),4 where Φ (·) denotes a cumulative distribution function of a standard normal random variable. By carrying out the integration we obtain 0
∫
sg (s)ds = −
−∞ ∫ ∞ 0
sg (s)ds =
2rγ (rγ + 2rω )
(rγ + rω )
=−
2
2rω2 (rω + 2rγ ) rγ2 (rγ + rω )2
=
2r1 (r1 + 2r2 )
(r1 + r2 )2
2 r22 (2r1 + r2 ) rγ r1 (r1 + r2 )2
,
,
where the second equalities of the above two integrals are obtained by using rω = (r2 /r1 )rγ . Noting that
] [ ∫ I I ˜ E aj argmax B (s) = a2 s
∞
sg (s)ds − a1 0
∫
0
sg (s)ds,
−∞
4 This result is obtained by replacing φ and ξ in Bai (1997a) with r and r , ω γ respectively. Note that there are typos in Eq. B.1 and the definition of g (x) in Bai (1997a). The first term on the right hand side of g (x) for x < 0 must have a negative sign.
232
E. Kurozumi, P. Tuvaandorj / Journal of Econometrics 164 (2011) 218–238
Table 2c The frequencies of selecting the true p and m (two time breaks). DGP2AR1a T
ϵ
DGP2AR1b 120 0.05
300 0.05
120 0.15
300 0.15
T
ϵ
120 0.05
300 0.05
120 0.15
300 0.15
0.647 0.994 0.649 0.712 0.957 0.937 0.916 0.592 0.767
0.626 0.942 0.647 0.691 0.949 0.947 0.944 0.614 0.802
0.733 0.992 0.736 0.756 0.956 0.936 0.915 0.628 0.789
AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(rb)
0.249 0.169 0.266 0.565 0.661 0.834 0.834 0.448 0.608
0.630 0.952 0.633 0.750 0.992 0.988 0.988 0.680 0.785
0.596 0.745 0.611 0.699 0.839 0.883 0.855 0.602 0.742
0.747 0.996 0.752 0.786 0.995 0.988 0.988 0.738 0.812
0.000 0.134 0.001 0.237 0.062 0.040 0.007 0.160 0.210 0.192 0.200 0.199 0.194 0.233
0.000 0.613 0.000 0.482 0.326 0.177 0.055 0.494 0.489 0.493 0.491 0.490 0.435 0.294
0.254 0.342 0.306 0.332 0.067 0.034 0.006 0.065 0.234 0.203 0.201 0.199 0.162 0.190
0.280 0.781 0.296 0.801 0.331 0.177 0.054 0.443 0.711 0.707 0.705 0.703 0.618 0.360
AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(MAIC ) Sq(MBICy ) Sq(MBIC1 ) Sq(MBIC2 ) Sq(MCp ) Sq(rb)
0.000 0.068 0.001 0.328 0.163 0.158 0.077 0.279 0.260 0.243 0.260 0.259 0.244 0.271
0.001 0.719 0.002 0.591 0.690 0.584 0.386 0.754 0.703 0.738 0.737 0.737 0.696 0.480
0.302 0.452 0.350 0.473 0.219 0.149 0.070 0.148 0.372 0.358 0.343 0.335 0.316 0.342
0.303 0.893 0.325 0.871 0.698 0.583 0.382 0.750 0.874 0.884 0.883 0.882 0.854 0.619
120 0.05
300 0.05
120 0.15
300 0.15
T
ϵ
120 0.05
300 0.05
120 0.15
300 0.15
0.453 0.475 0.454 0.613 0.842 0.900 0.922 0.605 0.610
0.215 0.236 0.214 0.483 0.524 0.605 0.661 0.609 0.313
0.671 0.652 0.674 0.712 0.861 0.907 0.924 0.636 0.759
AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(rb)
0.075 0.002 0.075 0.212 0.149 0.313 0.452 0.343 0.153
0.330 0.237 0.330 0.568 0.540 0.676 0.661 0.647 0.517
0.173 0.076 0.171 0.355 0.255 0.369 0.476 0.431 0.223
0.572 0.429 0.573 0.684 0.587 0.697 0.667 0.702 0.657
(a) Frequencies of selecting the true p AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(rb)
0.418 0.545 0.443 0.615 0.913 0.941 0.943 0.483 0.621
(b) Frequencies of selecting the true m AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(MAIC ) Sq(MBICy ) Sq(MBIC1 ) Sq(MBIC2 ) Sq(MCp ) Sq(rb) DGP2AR2a T
ϵ
DGP2AR2b
(a) Frequencies of selecting the true p AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(rb)
0.084 0.013 0.085 0.298 0.354 0.539 0.639 0.477 0.173
(b) Frequencies of selecting the true m AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(MAIC ) Sq(MBICy ) Sq(MBIC1 ) Sq(MBIC2 ) Sq(MCp ) Sq(rb) DGP2AR2c
0.000 0.016 0.000 0.165 0.095 0.083 0.035 0.143 0.211 0.185 0.192 0.191 0.193 0.268
0.000 0.158 0.001 0.348 0.072 0.037 0.008 0.260 0.279 0.255 0.263 0.264 0.243 0.320
0.179 0.261 0.221 0.269 0.124 0.066 0.026 0.052 0.218 0.190 0.161 0.153 0.114 0.360
0.302 0.325 0.332 0.612 0.077 0.036 0.008 0.200 0.433 0.405 0.403 0.402 0.343 0.601
AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(MAIC ) Sq(MBICy ) Sq(MBIC1 ) Sq(MBIC2 ) Sq(MCp ) Sq(rb) DGP2AR2d
0.000 0.016 0.000 0.209 0.159 0.156 0.100 0.179 0.242 0.231 0.225 0.222 0.205 0.270
0.001 0.193 0.002 0.333 0.192 0.130 0.063 0.253 0.317 0.303 0.303 0.307 0.290 0.374
0.194 0.294 0.231 0.290 0.220 0.134 0.082 0.068 0.246 0.286 0.238 0.213 0.163 0.349
0.306 0.419 0.334 0.520 0.205 0.121 0.059 0.187 0.402 0.419 0.396 0.395 0.351 0.643
T
120 0.05
300 0.05
120 0.15
300 0.15
T
ϵ
120 0.05
300 0.05
120 0.15
300 0.15
0.740 0.997 0.740 0.798 0.985 0.981 0.973 0.737 0.849
0.712 0.973 0.736 0.741 0.958 0.954 0.948 0.666 0.856
0.796 0.997 0.797 0.823 0.985 0.982 0.972 0.783 0.848
AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(rb)
0.491 0.984 0.529 0.686 0.968 0.955 0.941 0.506 0.813
0.733 0.998 0.733 0.774 0.996 0.989 0.990 0.720 0.783
0.710 0.984 0.730 0.757 0.968 0.956 0.941 0.686 0.831
0.783 0.998 0.787 0.799 0.996 0.988 0.990 0.763 0.789
ϵ
(a) Frequencies of selecting the true p AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(rb)
0.494 0.974 0.539 0.651 0.959 0.951 0.948 0.504 0.736
(continued on next page)
E. Kurozumi, P. Tuvaandorj / Journal of Econometrics 164 (2011) 218–238
233
Table 2c (continued) (b) Frequencies of selecting the true m AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(MAIC ) Sq(MBICy ) Sq(MBIC1 ) Sq(MBIC2 ) Sq(MCp ) Sq(rb)
0.000 0.313 0.002 0.317 0.085 0.091 0.011 0.353 0.230 0.242 0.242 0.242 0.216 0.074
0.004 0.873 0.005 0.607 0.624 0.566 0.247 0.881 0.707 0.735 0.734 0.734 0.713 0.003
0.316 0.359 0.398 0.534 0.083 0.092 0.010 0.214 0.381 0.372 0.371 0.370 0.341 0.003
0.363 0.912 0.390 0.894 0.628 0.574 0.247 0.907 0.915 0.920 0.920 0.920 0.913 0.002
120 0.05
300 0.05
120 0.15
300 0.15
0.067 0.983 0.090 0.800 0.998 0.995 1.000 0.975 0.853 0.931 0.917 0.932 0.856 0.967
0.520 0.989 0.607 0.936 0.946 0.900 0.598 0.955 0.965 0.978 0.972 0.970 0.956 0.991
0.581 0.996 0.626 0.951 1.000 0.999 1.000 0.995 0.977 0.986 0.983 0.986 0.980 0.995
AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(MAIC ) Sq(MBICy ) Sq(MBIC1 ) Sq(MBIC2 ) Sq(MCp ) Sq(rb)
DGP2MA1a
0.000 0.163 0.003 0.247 0.035 0.065 0.005 0.187 0.215 0.203 0.204 0.201 0.198 0.051
0.004 0.469 0.007 0.537 0.166 0.163 0.036 0.478 0.495 0.491 0.490 0.490 0.483 0.025
0.338 0.132 0.411 0.279 0.027 0.035 0.003 0.056 0.184 0.156 0.155 0.151 0.159 0.005
0.374 0.473 0.410 0.742 0.161 0.156 0.035 0.420 0.629 0.623 0.622 0.622 0.596 0.036
T
ϵ
120 0.05
300 0.05
120 0.15
300 0.15
AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(MAIC ) Sq(MBICy ) Sq(MBIC1 ) Sq(MBIC2 ) Sq(MCp ) Sq(rb)
0.007 0.855 0.024 0.717 0.885 0.795 0.586 0.799 0.609 0.813 0.774 0.793 0.491 0.813
0.028 0.957 0.040 0.750 0.997 0.994 0.995 0.966 0.836 0.928 0.912 0.928 0.846 0.911
0.445 0.963 0.527 0.906 0.892 0.815 0.586 0.767 0.905 0.937 0.927 0.921 0.849 0.961
0.497 0.990 0.532 0.937 0.998 0.998 0.995 0.992 0.972 0.984 0.982 0.984 0.978 0.980
DGP2MA1b
T
ϵ
(b) Frequencies of selecting the true m AIC BIC Cp MAIC MBICy MBIC1 MBIC2 MCp Sq(MAIC ) Sq(MBICy ) Sq(MBIC1 ) Sq(MBIC2 ) Sq(MCp ) Sq(rb)
0.018 0.950 0.051 0.754 0.942 0.873 0.598 0.939 0.623 0.870 0.815 0.853 0.532 0.920
[
]
E aIj argmax B˜ I (s) = a2
∞
∫
s
sg (s)ds + a1
0
sg (s)ds,
(1/2)
| − (nT /2), we can see that ℓm,px (Tˆ , θˆ |y, x) − Ey ℓm,px (T 0 , θ 0 |y, x) m +1 − 1Tj0 1Tˆj 0 ˆ j| − =− log |Σ log |Σj | log |
Σj0
˜j = where Σ
j =1
m+1
−
(27)
2
j =1
ˆ j | − log |Σj0 | log |Σ
− 1Tˆj − 1Tj0 2
− 1Tˆj [
−
2
1 2
Σj0
ˆ j − Σj0 tr Σj0−1 Σ
|, R11 is expressed as
− 1Tˆj [ j =1
2
=−
t =Tj0−1 +1
ε˜ t ε˜ t′ /1Tˆj with ε˜ t = yt − (x′t ⊗ In )φˆ j for
2
ˆj −Σ ˜j tr Σj0−1 Σ
Tj m 1− −
2 j =1 t =Tˆj +1
′ εt − (x′t ⊗ In )(φˆ j+1 − φj0 )
j m 1− −
2 j =1 t =Tˆj +1
tr Σj0+−11 εt εt′ − tr Σj0−1 εt εt′
0
ˆ j − Σj0 Σj0−1 Σ ˆ j − Σj0 tr Σj0−1 Σ
m+1
=−
∑Tj0
T0
m+1
j =1
+ op (1),
× Σj0+−11 εt − (x′t ⊗ In )(φˆ j+1 − φj0 ) ′ − εt − (x′t ⊗ In )(φˆ j − φj0 ) Σj0−1 εt − (x′t ⊗ In )(φˆ j − φj0 )
log |Σj0 |.
ˆ j | around log | By expanding log |Σ R11 = −
=−
m+1
j =1
]
0
m+1
and R12 = −
− 1Tˆj j =1
= R11 + R12 , where R11 = −
For Tˆj < Tj0 , the first term in the square brackets on the right hand side of (28) becomes
2
− 1Tˆj
Tj0−1 + 1 ≤ t ≤ Tj0 .
2
(28)
M } for some large M (j = 1, . . . , m) because Tˆj − Tj = Op (vT−2 ) as shown by Qu and Perron (2007). We first evaluate bm,px ,1 . Since Ey [ℓm,px (T 0 , θ 0 |y, x)] = −(nT /2) log(2π ) − 0 j=1 1Tj
2
ˆ j − Σj0 Σj0−1 Σ ˆ j − Σj0 tr Σj0−1 Σ
Proof of Proposition 1. In this proof we restrict our analysis on the set given by {Tj |Tj = Tj0 + c vT−2 , −M ≤ c ≤
∑m+1
1
−
−∞
0
(6) and (7) are established.
∫
]
− 2(φˆ j+1 − φj0 )′ + op (1)
Tj −
xt ⊗ Σj0+−11 εt + (φˆ j+1 − φj0 )′
t =Tˆj +1 Tj0
ˆj −Σ ˜j tr Σj0−1 Σ
˜ j − Σj0 + tr Σj0−1 Σ
×
− t =Tˆj +1
xt x′t ⊗ Σj0+−11 (φˆ j+1 − φj0 ) + op (1).
234
E. Kurozumi, P. Tuvaandorj / Journal of Econometrics 164 (2011) 218–238
√
Since φˆ j+1 − φj0 = (φj0+1 − φj0 ) + (φˆ j+1 − φj0+1 ) = vT δj + Op (1/ T ) and
Σj0−1 = Σj0+−11 + Σj0+−11 (Σj0+1 − Σj0 )Σj0+−11 + Σj0+−11 (Σj0+1
m+1
R11 + R12 +
− Σj0 )Σj0−1 (Σj0+1 − Σj0 )Σj0+−11 Σj0+−11
=
+v
+v
Then, by combining (29)–(33), we have, for Tˆj < Tj0 ,
0−1/2 0−1/2 A2j Σj+1 T Σj+1
0−1/2 2 0−1/2 2 A2j Σj+1 T Σj + 1
+ O(v ), 3 T
=
we can see that, for Tˆj < Tj0 , m+1
−
− 1Tˆj 2
j =1
Tj −
tr vT A1j
2 j =1
2 j =1
2 j =1
ηt ηt′ − vT2 (Tj0 − Tˆj )tr A21j
t =Tˆj +1
−
0
xt ⊗ Σj0−1 εt
−
Tj0
−
xt x′t ⊗ Σj0−1 δj + op (1).
(29)
−
t =Tˆj +1
Similarly, the second and third terms in the square brackets on the right hand side of (28) are expressed as
d
−→
m+1
−
− 1Tˆj 2
j =1
(φˆ j − φ )
˜ j − Σj0 ) tr Σj0−1 (Σ
m+1 1−
vT2
2 j =1 t =Tj0−1 +1
4
vT2 2
0
(
t =Tˆj +1 0
Tj −
δj
′
εt − (x′t ⊗ In )(φˆ j − φj0 )
′
t =Tˆj +1
m+1 1−
2 j =1
2
=
2 j =1
bm,px ,1 =
−
(φˆ j − φ )
′
xt xt ⊗
Σj0−1
(φˆ j − φ ) 0 j
pall φ 2
j
4
j =1
m+1
=
− j =1
(30)
ˆ j − Σj0 Σj0−1 Σ ˆ j − Σj0 tr Σj0−1 Σ 2 Tj0 − ′ tr (η η − I ) + op (1). t n t 41Tj0 t =T 0 +1 1
(31)
m+1 1−
− κ4j
m+1
+
j =1
4
+
2 m − r1j + r1j r2j + r2j2
r1j + r2j
j =1
2 j =1 t =Tˆj +1
.
m+1 1−
2 j=1
nT 2
vT2 (Tj0 − Tˆj ) 2
− 1Tj0
m+1
log(2π ) −
j =1
2
ˆ j| log |Σ
0
Tj −
2 j=1 t =Tj0−1 +1
(32)
=
ˆ j−1 Σj0 tr Σ
ˆ −1 (φˆ j − φj0 ), 1Tj0 (φˆ j − φj0 )′ Ey∗ [x∗t x∗′ t ] ⊗ Σj
− 1Tj0 j =1
(34)
v
= 1, . . . , m + 1) are independent chi-square
m+1
tr −vT A1j +
max BIj (v),
j =1
Ey∗ [ℓm,px (T 0 , θ 0 |y∗ , x∗ )] − Ey∗ [ℓm,px (T 0 , θˆ |y∗ , x∗ )]
(Tj0 − Tˆj ) log |Σj0 | − log |Σj0+1 |
j m 1− −
tr A21j + op (1), (33)
where the last equality is obtained by expanding log |Σj0+1 | around log |Σj0 |.
m −
we can see that
Tj0 ,
m 1−
2 j =1
−
j−1
T0
=
−
On the other hand, R12 becomes, for Tˆj < R12 =
4
j =1
+
Ey∗ [ℓm,px (T 0 , θˆ |y∗ , x∗ )] = −
t =Tj−1 +1
− 1Tˆj
− κ4j
m+1
χp2φ +
We next evaluate bm,px ,3 because bm,px ,2 = 0 is obvious. Since
t =Tj0−1 +1
Tj0 m + 1 − 1− ′ − tr ηt ηt − In + op (1). 2 j =1 0 m+1
convergence holds for Tˆj > Tj0 and the expectation of the left hand side of (34) equals bm,px ,1 , we can see using Lemma 1 that, omitting the o(1) terms,
Tj0 0 j
xt x′t ⊗ Σj0−1 δj + op (1)
distributions with pφj degrees of freedom. Since the same
nT × Σj0−1 εt − (x′t ⊗ In )(φˆ j − φj0 ) + m+1 1−
Tj − 2 ′ ˆ xt ⊗ Σj0−1 εt − Tj )tr A1j + vT δj
Tj0
where χp2φ (j
Tj0
−
xt x′t ⊗ Σj0−1 (φˆ j − φj0 )
t =Tj0−1 +1
j
=−
Tj −
t =Tˆj +1
t =Tˆj +1
− vT2 δj′
t =Tj−1 +1
0 j
Tj0
+ 2vT δj′
Tj0 − ′ tr ηt ηt − In 0
2 Tj0 m +1 − − 1 ′ + (η η − I ) tr t n t 41Tj0 t =T 0 +1 j =1 j−1 Tj0 m v − − ′ T ηt ηt − In + tr A1j 2 j=1
0
m 1−
=
ˆj −Σ ˜j tr Σj0−1 Σ
m+1 1−
1−
+
+
2
m+1 1−
ˆ j | − log |Σj0 | − log |Σ Tj0
−
2 j=1 t =Tj0−1 +1 m+1 1−
2 j=1
ˆ j−1 Σj0 tr Σ
nT 2
−1 ˆ 1Tj0 (φˆ j − φj0 )′ Ey∗ [x∗t x∗′ ] ⊗ Σ (φˆ j − φj0 ) t j
E. Kurozumi, P. Tuvaandorj / Journal of Econometrics 164 (2011) 218–238
− 1Tj0
m+1
=
R42 + R43
m+1
+
On the other hand, R42 + R43 becomes, for Tˆj < Tj0 ,
ˆ j − Σj0 )Σj0−1 (Σ ˆ j − Σj0 ) tr Σj0−1 (Σ
4
j =1
1−
1Tj0
2 j =1
d
ˆ −1 (φˆ j − φj0 ) (φˆ j − φj0 )′ Ey∗ [x∗t x∗′ t ] ⊗ Σj
= Ey∗
t =Tˆj +1
− κ4j
0 ˆ ˆ j−+11 εt∗ − (x∗′ ×Σ t ⊗ In )(φj+1 − φj ) ′ 0 ˆ − εt∗ − (x∗′ t ⊗ In )(φj − φj ) 0 ˆ ˆ j−1 εt∗ − (x∗′ ×Σ t ⊗ In )(φj − φj )
m+1
+
4
1− 2 j=1
χp2φ ,
(35)
j
ˆ j| where the second equality is obtained by expanding log |Σ around log |Σj0 | and by using the relation
ˆ j−1 Σj0 tr Σ
ˆ j − Σj0 ) = n − tr Σj0−1 (Σ ˆ j − Σj0 )Σ ˆ j−1 (Σ ˆ j − Σj0 ) , + tr Σj0−1 (Σ
=
ˆ j−1 = Σj0−1 − Σj0−1 (Σ ˆ j − Σj0 )Σj0−1 Σ ˆ j − Σj0 )Σj0−1 . ˆ j − Σj0 )Σ ˆ j−1 (Σ + Σj0−1 (Σ From (35), we have, ignoring the o(1) terms,
− κ4j
pall φ
m+1
4
j =1
+
2
=
j =1
= R41 + R42 + R43 , m +1 1Tˆ − 1T 0 − j j ˆ j |, where R41 = log |Σ
R42 =
R43 = −
ˆ y∗t − (x∗′ t ⊗ In )φj
yt − (xt ⊗ In )φˆ j ∗
∗′
−
2
,
(38)
Proof of Proposition 2. E [Jm,px ,1 ] = −nT is obvious. For Jm,px ,2 we expand it as, for Tˆj < Tj0 ,
2
1
r1j + r2j
′
Jm,px ,2 =
j=1 t =T 0 +1 j−1 0
ˆ j | − log |Σ ˆ j +1 | log |Σ
=
2
Tj m − −
ˆ j−1 Σ ˆ j +1 − Σ ˆj (Tj0 − Tˆj ) tr Σ
ˆ j−1 Σ ˆ j +1 − Σ ˆj Σ ˆ j−1 Σ ˆ j +1 − Σ ˆj tr Σ
+ op (1).
′ (x′t ⊗ In )(φˆ j+1 − φˆ j ) Σj0−1 εt
j=1 t =Tˆ +1 j
0
Tj − − ′ 2 (µ ˆt −µ ˜ t ) + (µ ˜ t − µ0t ) Σj0−1 εt m+1
m ˆ − Tj − Tj0
2 j =1
2 m − r1j + r1j r2j + r2j2
omitting the o(1) terms.
0
Tj −
Similarly to the evaluation of R12 , by the Taylor expansion of ˆ j+1 |, R41 can be expressed as, for Tˆj < Tj0 , log |Σ
=
argmax BI (v) , j 2 v
j =1
ˆ ˆ j−1 y∗t − (x∗′ ×Σ t ⊗ In )φj .
m 1−
4
bm,px ,4 = Ey Ey∗ [ℓm,px (T 0 , θ 0 |y∗ )] − Ey∗ [ℓm,px (T 0 , θˆ |y∗ )]
′
=
2 j =1 t =Tj0−1 +1
j=1
2
R41 =
1 2 1 ′ ∗ ∗′ −1 ˆ ˆ δ j E [ x t x t ] ⊗ Σj − Tj ) δj + tr A1j
same convergence holds for Tˆj > Tj0 , we have, by Lemma 1,
ˆ ˆ j−1 y∗t − (x∗′ ×Σ t ⊗ In )φj , m+1 1−
Tj0
where γjI = γ1j when v ≤ 0 and γjI = γ2j when v > 0. Since the
2
2 j =1 t =Tˆj−1 +1
v ( 2 T
+ op (1) m − γjI d −→
ˆ y∗ , x∗ )] − Ey∗ [ℓm,px (Tˆ , θˆ |y∗ , x∗ )] Ey∗ [ℓm,px (T 0 , θ|
Tˆj −
m − j =1
For bm,px ,4 , we write
m+1 1−
(37)
Ey∗ [ℓm,px (T 0 , θ 0 |y∗ )] − Ey∗ [ℓm,px (T 0 , θˆ |y∗ )]
.
j =1
2 j =1
ˆ −1 (Tj0 − Tˆj )vT2 δj′ E [x∗t x∗′ t ] ⊗ Σj+1 δj
Thus, by combining (36) and (37), we have, for Tˆj < Tj0 ,
bm,px ,3 = Ey Ey∗ [ℓm,px (T 0 , θ 0 |y∗ , x∗ )]
− Ey∗ [ℓm,px (T 0 , θˆ |y∗ , x∗ )]
m 1−
ˆ j−1 Σ ˆ j +1 − Σ ˆj Σ ˆ j−1 Σj0 − (Tj0 − Tˆj )tr Σ ˆ j−1 Σ ˆ j +1 − Σ ˆj + (Tj0 − Tˆj )tr Σ ˆ j−+11 Σ ˆ j +1 − Σ ˆj Σ ˆ j Σj0 + op (1). ×Σ
which holds because
=
Tj m ′ − − 0 1 ˆ εt∗ − (x∗′ ⊗ I )( φ − φ ) n j + 1 t j 2 j =1
m+1
j =1
0
+ op (1) −→
235
0
Tj − −
m+1
+2
′ (x′t ⊗ In )(φˆ j − φj0 ) Σj0−1 εt
j =1 t =T 0 +1 j−1
] (36)
=
2
m − j =1
0
vT δj
′
Tj −
xt ⊗ Σj0−1 εt
t =Tˆj +1
236
E. Kurozumi, P. Tuvaandorj / Journal of Econometrics 164 (2011) 218–238 Tj0
−
Proof of Proposition 3. We first note that the logarithm of gM (y|x) can be expressed as
xt x′t ⊗ Σj0−1 δj
− vT δj′
t =Tˆj +1
log gM (y|x) = log
Tj0
−
+ vT δ j
′
′
xt xt ⊗
Σj0−1
δj
+2
−
0
Tj −
(φˆ j − φ )
j=1
d
−→
xt x′t ⊗ Σj0−1 (φˆ j − φj0 )
0 ′ j
t =Tj0−1 +1
+ op (1) m −
2 max BIj v
j =1
m +1 − I (v) + γ argmax Bj (v) + 2 χp2φ , j v I j
where, under the assumption of Σj+1 − Σj = v
γ = γ1j = δj′ Q1j δj when v ≤ 0 and γjI = γ2j = δj′ Q2j δj when v > 0, BIj (v) is defined as (4) with ω1j = γ1j = δj′ Q1j δj and ω2j = γ2j = δj′ Q2j δj 2 T Ψj ,
I j
because only the changes in the coefficients affect the limiting distributions of the break points in this case. Note that the same convergence holds for Tˆj > Tj0 . From Lemma 1 we can see that
] [ I I I E 2 max Bj (v) = E γj argmax Bj (v) = 3
Tj0 ,
L1 (φ, σ ) = −
Jm,px ,3
j =1
t =Tˆj−1 +1
εˆ t′ Σj0−1 εˆ t
= −
εˆ (
0 −1 t Σj + 1
′
−
Σj0−1
m −
)ˆεt
=
j =1
2
−
tr(A2j )
γ2j
dφ =
∏∫
∑Tj
t =Tj−1 +1
xjt ⊗ In yt =
∑Tj
t =Tj−1 +1
e−qj /2 dϕj
pxj /2
(2π )
−pφj /2
(1Tj )
j =1
ˇ −n/2 −pxj /2 , Σj Σj,x
(42)
∑Tj
ˇ j,x = (1Tj )−1 t =T +1 xjt x′jt . The last equality holds where Σ j−1 because the integrand on the right hand side of the second equality is the pdf of a pxj dimensional normal distribution. For L2 (σ ) we can see that
v
γ1j
∏
L2 (σ ) =
j =1
m+1
m+1
and the same convergence holds for Tˆj > Tj0 . Then, from Lemma 1 we can see, omitting the o(1) term, that m − E [Jm,px ,3 ] = E − tr(AIj ) argmax BIj (v)
m − 3 tr(A1j )
L1 (φ,σ )
(39)
v
say,
2
=
t =Tˆj +1
tr(AIj ) argmax BIj (v),
− 1 − qj ,
1/2 − Tj −p /2 xjt xjt ′ ⊗ Σj−1 e−qj /2 dϕj = (2π ) xj t =Tj−1 +1 j =1 −1/2 − Tj p /2 xjt xjt ′ ⊗ Σj−1 × (2π ) xj t =Tj−1 +1
′ εt − (x′t ⊗ In )(φˆ j+1 − φj0 )
j =1
t =Tj−1 +1
∏∫
0−1/2 0−1/2 0−1/2 2 0−1/2 × vT2 Σj A1j Σj − vT4 Σj A1j Σj × εt − (x′t ⊗ In )(φˆ j+1 − φj0 ) Tj0 m − − = tr vT2 A1j ηt ηt′ + op (1)
⇒−
m+1
j=1 t =Tˆ +1 j
j =1
xjt x′jt ⊗ Σj−1
j =1
0
=
(ϕj − ϕˇ j )
where we used the fact that
e
j=1 t =Tˆ +1 j Tj m − −
2 j =1
Tj −
′
j=1
∫
0
Tj m − −
(41)
xjt x′jt ⊗ In ϕˇ j from the first order condition on the maximization. Using this expression we can see that
0
t =Tj0−1 +1
m 1−
× (ϕj − ϕˇ j ) =
Tj −
say.
m+1
in this case. Hence, we obtain E [Jm,px ,2 ] = 6m + 2pall φ ignoring the o(1) terms. Similarly, using Σj0+1 − Σj0 = vT2 Ψj , Jm,px ,3 is expressed as, for
Tˆj − − εˆ t′ Σj0−1 εˆ t − = −
(40)
Since φ appears only in L1 while both L1 and L2 depend on σ , we first evaluate the integral of exp(L1 ) with respect to φ and next obtain the integral of exp(L1 )dφ exp(L2 ) with respect to Σ −1 . From the direct calculation we can see that
v
m+1
ˇ y, x) + L1 (φ, σ ) + L2 (σ ), = ℓm,px (T , θ|
Tˆj <
ℓm,px (T , θ|y, x) = ℓm,px (T , θˇ |y, x) ˇ σ |y, x) − ℓm,px (T , φ, σ |y, x) − ℓm,px (T , φ, ˇ σ |y, x) − ℓm,px (T , θˇ |y, x) − ℓm,px (T , φ,
j =1
v
exp ℓm,px (T , θ|y, x) πM (θ )dθ
where πM (θ )dθ = dφ dΣ −1 with dφ = dϕ1 dϕ2 · · · dϕm+1 and −1 dΣ −1 = dΣ1−1 dΣ2−1 · · · dΣm +1 from Assumption A7 (b). We expand the log-likelihood as
t =Tˆj +1 m+1
∫
.
The limiting distribution in (15) is obtained similarly to (39) under Assumption A4.
nT 2
where Sj =
∫ ∫
m+1
+
− 1Tj 2
j =1
∑Tj
t =Tj−1 +1
1 −1 ˇ log |Σj |/|Σj | − tr Sj Σj 2
(43)
εˇ t εˇ t′ . Using (42) and (43) we have
eL1 (φ,σ )+L2 (σ ) dφ dΣ −1
= enT /2
m+1
∏ j =1
∫ ×
pxj /2
(2π )
ˇ −n/2 −p /2 1Tj /2 ˇ j (1Tj ) φj Σ Σj,x
−1 (1Tj −px )/2 −(1/2)tr(Sj Σ −1 ) −1 j Σ j e d Σj . j
(44)
E. Kurozumi, P. Tuvaandorj / Journal of Econometrics 164 (2011) 218–238
ˇ y, x) log gM (y|x) = ℓm,px (T , θ|
Note that
∫
m+1
−1 (1Tj −px )/2 −(1/2)tr(Sj Σ −1 ) −1 j Σ j e dΣj j
−
i =1
1T ∗ /2 × π n(n−1)/4 Sj−1 j [ ] n ∏ ∗ −n1Tj∗ /2 1 × Γ (1Tj∗ + 1 − i) = 2n1Tj /2 π n(n−1)/4 1Tj n1T ∗ /2 dΣj−1 2 j
2
i=1
] [ n −1T ∗ /2 ∏ 1 ˇ j (1Tj∗ + 1 − i) , × Σ Γ j where 1Tj∗ = 1Tj − pxj + n + 1 and the last equality holds because the integrand on the right hand side of the first equality is the pdf of the Wishart distribution. Then, the logarithm of (44) becomes
∫ ∫
eL1 (φ,σ )+L2 (σ ) dφ dΣ −1
=
− n1Tj∗
m+1
nT
+
2
2
j =1 n
+
−
log Γ
[
i =1
1 2
log 2 −
n(1Tj∗ + pxj ) 2
(1Tj∗ + 1 − i)
]
log 1Tj
n −
log Γ
+ Op (1).
=
n 2
] 1 ∗ (1Tj + 1 − i) 2
i=1
log(2π ) +
n − 1Tj∗ − i
ϑi 6(1Tj∗ + 1 − i)
n1Tj∗ 2
−
1Tj∗ + 1 − i 2
−
1Tj∗ + 1 − i 2
+ =
2
i=1
log
n( n + 1 ) 4
log 1Tj − log 2 −
nT 2
+ O(1),
where 0 < ϑi < 1. The second equality holds because i)/2 = n1Tj /2 − n(n + 1)/4, and ∗
(1Tj∗ − i=1 (1Tj + 1 − i) = nT + O(1)
∑n
= log 1Tj + O
∑n
i =1
∗
log(1Tj∗ + 1 − i) = log 1Tj + log 1 +
1
1Tj
−p x j + n + 2 − i
1Tj
.
Thus, we have
∫ ∫ log
log 1Tj + Op (1).
Proof of Proposition 4. (i) We will show that MIC (m, px ) − MIC (m0 , p0x ) → ∞ if m ̸= m0 or px ̸= p0x . Let Tˆ 0 and θˆ 0 be the MLEs when m = m0 and px = p0x . Then, MIC (m, px ) − MIC (m0 , p0x )
ˆ y, x) − ℓm0 ,p0 (T 0 , θ 0 |y, x) = −2 ℓm,px (Tˆ , θ| x 0 0 + 2 ℓm0 ,p0x (Tˆ , θˆ |y, x) − ℓm0 ,p0x (T 0 , θ 0 |y, x) (46)
When m < m0 or px < p0x , there exists at least one regime in which the estimators of the coefficients are inconsistent. In this case, the first term on the right hand side of (46) is greater than cT vT2 for some c > 0 with a large probability as shown in the proof of Theorem 6 of Bai (2000), while the second term is shown to be Op (1) in the same way as Lemma 13 of Bai (2000). As a result, the left hand side of (46) diverges to infinity as T → ∞. When m > m0 and px > p0x , we rewrite (46) as MIC (m, px ) − MIC (m0 , p0x )
Using the Stirling’s formula, the sum of the logarithms of the Gamma functions becomes
[
2
+ op (T vT2 ).
2
i =1
− pφj + pσ j =1
−1 (1Tj∗ −n−1)/2 −(1/2)tr(Sj Σ −1 ) ∫ Σ j e j = n ∗ 1T / 2 ∏ n1T ∗ /2 2 j π n(n−1)/4 Sj−1 j Γ 21 (1Tj∗ + 1 − i)
log
237
eL1 (φ,σ )+L2 (σ ) dφ dΣ −1
ˆ y, x) − ℓm0 ,p (Tˆ ∗ , θˆ ∗ |y, x) = −2 ℓm,px (Tˆ , θ| x − 2 ℓm0 ,px (Tˆ ∗ , θˆ ∗ |y, x) − ℓm0 ,p0x (Tˆ 0 , θˆ 0 |y, x) + c1 g1 (T ) + (m − m0 )g2 (T ),
(47)
where T ∗ and θ ∗ are the MLEs of T and θ when m = m0 and px > p0x , and c1 is the difference of the number of the unknown parameters, which is positive. Since px > p0x , the model with m = m0 and px may be seen as the true model with the zero coefficients associated with the additional regressors. Then, the first term on the right hand side of (47) can be seen as the likelihood ratio test statistic for m, which is Op (1). On the other hand, the second term is the likelihood ratio test statistic for the extra regressors, which is Op (1). As a result, the left hand side of (47) goes to infinity when m > m0 and px > p0x because c1 g1 (T ) → ∞. In exactly the same manner we can see that the left hand side of (47) diverges to infinity when m = m0 and px > p0x and when m > m0 and px = p0x because c1 is positive in both the cases. Thus, we have (i). Similarly, in the case of (ii), the left hand side of (47) goes to infinity when m > m0 and px ≥ p0x because (m − m0 )g2 (T ) → ∞, while it does not diverge when m = m0 and px ≥ p0x because c1 g1 (T ) = O(1). Hence, we have (ii). In the case of (iii) we have the same equality as (46) with the op (T vT2 ) term replaced by the Op (1) term and thus the left hand side of (46) goes to infinity when m < m0 or px < p0x . On the other hand, we can see that (47) does not go to infinity when m ≥ m0 and px ≥ p0x because g1 (T ) and g2 (T ) are O(1). As a result, we have P (m ≥ m0 and px ≥ p0x ) → 1. References
− npxj n( n + 1 ) log 1Tj + Op (1) = − − m+1
j =1
2
4
m+1
=−
− pφj + pσ j =1
2
log 1Tj + Op (1).
From (40), (41) and (45) we finally have
(45)
Akaike, H., (1973). Information Theory and an Extension of the Maximum Likelihood Principle, in: B.N. Petrov and F. Csáki (eds.), 2nd International Symposium on Information Theory, Akadémiai Kiadó. Budapest, pp. 267–281. Andrews, D.W.K., 1993. Tests for parameter instability and structural change with unknown change point. Econometrica 61, 821–856. Andrews, D.W.K., Lee, I., Ploberger, W., 1996. Optimal changepoint tests for normal linear regression. Journal of Econometrics 70, 9–38. Bai, J., 1997a. Estimation of a change point in multiple regression models. Review of Economics and Statistics 79, 551–563.
238
E. Kurozumi, P. Tuvaandorj / Journal of Econometrics 164 (2011) 218–238
Bai, J., 1997b. Estimating multiple breaks one at a time. Econometric Theory 13, 315–352. Bai, J., 1999. Likelihood ratio tests for multiple structural changes. Journal of Econometrics 91, 299–323. Bai, J., 2000. Vector autoregressive models with structural changes in regression coefficients and in variance-covariance matrices. Annals of Economics and Finance 1, 303–339. Bai, J., Perron, P., 1998. Estimating and testing linear models with multiple structural changes. Econometrica 66, 47–78. Bai, J., Perron, P., 2003. Computation and analysis of multiple structural change models. Journal of Applied Econometrics 18, 1–22. Bai, J., Perron, P., 2006. Estimating and testing linear models with multiple structural changes. In: Corbae, D., Durlauf, S.N., Hansen, B.E. (Eds.), Econometric Theory and Practice. Cambridge University Press, Cambridge. Bhattacharya, P.K., Brockwell, P.J., 1976. The minimum of an additive process with applications to signal estimation and storage theory. Zeitschrift fur Wahrscheinlichkeitstheorie und Verwandte Gebiete 37, 51–75. Burnham, K.P., Anderson, D.R., 2002. Model Selection and Multimodel Inference, 2nd ed. Springer, New York. Davis, R.A., Lee, T.C.M., Rodriguez-Yam, G.A., 2006. Structural break estimation for nonstationary time series models. Journal of the American Statistical Association 101, 223–239. Hannan, E.J., 1980. The estimation of the order of an ARMA process. Annals of Statistics 8, 1071–1081. Hannan, E.J., Deistler, M., 1988. The Statistical Theory of Linear Systems. Wiley, New York.
Hansen, B.E., 2009. Averaging estimators for regressions with a possible structural break. Econometric Theory 25, 1498–1514. Konishi, S., Kitagawa, G., 2008. Information Criteria and Statistical Modeling. Springer-Verlag, New-York. Liu, J., Wu, S., Zidek, J.V., 1997. On segmented multivariate regressions. Statistica Sinica 7, 497–525. Mallows, C.L., 1973. Some comments on Cp . Technometrics 15, 661–675. Ninomiya, Y., 2005. Information criterion for Gaussian change-point model. Statistics and Probability Letters 72, 237–247. Ninomiya, Y., 2006. AIC for change-point models and its application. In: Research Memorandum 1002. Institute of Statistical Mathematics. Perron, P., 2006. Dealing with structural breaks. In: Mills, T.C., Patterson, K. (Eds.), Palgrave Handbook of Econometrics, Vol. 1: Econometric Theory. Palgrave Macmillan, New York. Qu, Z., Perron, P., 2007. Estimating and testing structural changes in multivariate regressions. Econometrica 75, 459–502. Ross, S.M., 1996. Stochastic Processes, 2nd ed. Wiley, New York. Schwarz, G., 1978. Estimating the dimension of a model. Annals of Statistics 6, 461–464. Shibata, R., 1976. Selection of the order of an autoregressive model by Akaike’s information criterion. Biometrika 63, 117–126. Yao, Y.C., 1988. Estimating the number of change-points via Schwarz’ criterion. Statistics and Probability Letters 6, 181–189. Zhang, N.R., Siegmund, D., 2007. Modified Bayes information criterion with applications to the analysis of comparative genomic hybridization data. Biometrics 63, 22–32.
Journal of Econometrics 164 (2011) 239–251
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
A new method of projection-based inference in GMM with weakly identified nuisance parameters Saraswata Chaudhuri a,∗ , Eric Zivot b,1 a
Department of Economics, CB 3305, University of North Carolina, Chapel Hill, NC 27519, United States
b
Department of Economics, Box 353330, University of Washington, Seattle, WA 98195, United States
article
info
Article history: Received 19 December 2008 Received in revised form 6 May 2011 Accepted 15 May 2011 Available online 25 May 2011 JEL classification: C12 C13 C30
abstract Projection-based tests for subsets of parameters are useful because they do not over-reject the true parameter values when either it is difficult to estimate the nuisance parameters or their identification status is questionable. However, they are also often criticized for being overly conservative. We overcome this conservativeness by introducing a new projection-based test that is more powerful than the traditional projection-based tests. The new test is even asymptotically equivalent to the related plug-inbased tests when all the parameters are identified. Extension to models with weakly identified parameters shows that the new test is not dominated by the related plug-in-based tests. © 2011 Elsevier B.V. All rights reserved.
Keywords: Projection test Nuisance parameters C (α) statistic GMM Weak identification
1. Introduction The usefulness of the traditional projection principle in designing tests that are not over-sized has been well established in a series of papers by Dufour and his co-authors (see, among others, Dufour (1990, 1997), Dufour and Jasiak (2001), and Dufour and Taamouti (2005b, 2007)). However, there are two reasons why these tests are also often criticized for being overly conservative. First, they use conventional critical values that are conservative. Second, the test statistics used are typically smaller than the corresponding plug-in-based test statistics rendering the conventional critical values even more conservative. The purpose of this paper is to introduce a new method of projection-based inference that overcomes these two problems and reduces the conservativeness generally associated with the traditional projection principle. In addition, when the parameters in the model are identified, the new method can be made asymptotically equivalent to the plug-in-based methods, a feature
∗
Corresponding author. Tel.: +1 919 966 3962; fax: +1 919 966 4986. E-mail addresses:
[email protected] (S. Chaudhuri),
[email protected] (E. Zivot). 1 Tel.: +1 206 543 6715; fax: +1 206 685 7477. 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.05.012
that can be quite useful when the plug-ins required for the latter are computationally difficult to estimate. We describe the new method in the context of testing for subsets of parameters in a moment conditions model. The setup is described below. Consider a set of parameters θ whose unknown ‘‘true value’’ θ0 is defined by the moment restrictions E [gn (wt , θ )] = 0 ⇐⇒ θ = θ0
(1.1)
where gn : S × Θ → Rk is a known measurable function possibly dependent on sample size n, Θ ⊂ Rν (ν ≤ k) is the parameter space, {wt ∈ S : t = 1, . . . , n} is the sample of observations from the sample space S, and E [.] is the expectation with respect to a probability measure P0 that considers θ0 as the true value of θ . ′ ′ ′ Consider the partition θ = (θ1′ , θ2′ )′ , θ0 = (θ01 , θ02 ) and θ1 ∈ Θ1 , θ2 ∈ Θ2 where Θ1 × Θ2 = Θ . We are interested in the projection-based inference on the subsets of parameters θ1 , i.e., in testing H 0 : θ1 = θ10 versus H 1 : θ1 ̸= θ10 , by treating θ2 as the unknown nuisance parameters. We restrict √ attention to θ10 ∈ Θ1 in the n-neighborhood of θ01 when considering the properties of the testing procedures. In Section 2 we briefly describe the traditional projection principle for such testing and explain the reasons for its conservativeness. We refer to a test as ‘‘conservative’’ if its asymptotic size is less
240
S. Chaudhuri, E. Zivot / Journal of Econometrics 164 (2011) 239–251
than the allowable rate of Type-I error pre-specified by the user. Conservativeness (in terms of size) of a test leads to low power. In Section 3 we introduce our new method of projection and show how it overcomes the conservativeness of the traditional projec√ tion methods. We also show that the new method achieves ( nlocal) asymptotic equivalence with the plug-in-based methods of inference when θ is identified (to be defined precisely in Assumption W). This is done by the use of Neyman (1959)’s C (α) principle and a form of restricted projection that is motivated by Dufour (1990) and Robins (2004). Section 4 extends the new method to models where some or all the parameters are weakly identified in the sense of Stock and Wright (2000). The extension is non-trivial and non-unique because the feasible subset tests (both projection and plug-in based), that are generally recognized to work best, and our new method are only boundedly asymptotically pivotal under the generality of our weak identification assumptions. This hinders a neat analytical comparison of power. However, an extensive Monte Carlo experiment shows that the performance of the new method is always better than the traditional projection methods and is not always dominated by the related plug-in methods proposed by Kleibergen and Mavroeidis (2009), even in relatively small samples. A subset of our simulation results is reported in Section 4. We conclude in Section 5. Proofs of all results are collected in the Appendix. We use the following notation. For an a × b matrix A with full column rank, P (A) := A(A′ A)−1 A′ and N (A) := Ia − P (A) where Ia is the a × a identity matrix. If A is symmetric and positive semi1
definite then A 2 is the lower-triangular Cholesky factor of A such 1
1′
that A = A 2 A 2 . χv2 and χv2 (1 − ϵ), for ϵ ∈ (0, 1), are used to denote the distribution and the 1 − ϵ quantile, respectively, of a central chi-square distribution with v degrees of freedom. Unless confusion may arise, we suppress the explicit dependence of all the functions on the sample size n and the data and, e.g., write gn (wt , θ ) as gt (θ∑ ). For any X1 , X2 , . . . , Xn , we use X¯ n to denote the n sample average t =1 Xt /n. 2. The traditional projection principle In this section we briefly describe the methods of inference based on the traditional projection principle and discuss the reason for their conservativeness with respect to the plug-in-based methods. The conservativeness in terms of size arises because these projection methods are only boundedly asymptotically pivotal. The conservativeness leads to the low power of the traditional projection-based methods relative to the plug-in-based methods. Our maintained assumption for the next two sections is that there is no problem with the identification of θ . This serves two purposes. First, it provides a simple setup for describing these methods. Second, while the projection-based methods are practically useful in the presence of identification problems, their low power (due to conservativeness) relative to the plug-in-based methods is difficult to justify when there are no such problems. Therefore, this setup also allows us to emphasize the benefit of using our new method of projection when we introduce it in the next section, because we show that in the absence of identification problems our new method is asymptotically equivalent to the plug-in-based methods of inference. Under the standard assumptions (to be stated clearly in Sec-
√
d
P
tion 3) that (i) ng¯n (θ0 ) − → N (0, Vgg ) and (ii) Vgg (θ ) − → Vgg (θ ) (both are continuous and positive definite, and Vgg (θ0 ) = Vgg ), the Dufour and Taamouti (2005b)-type projection-based test rejects H 0 : θ1 = θ10 if inf Sn (θ10 , θ20 ) > χk2 (1 − ϵ),
θ20 ∈Θ2
(2.1)
where −1 Sn (θ ) := ng¯n′ (θ ) Vgg (θ )¯gn (θ ).
(2.2)
In this paper we will consider ϵ ∈ (0, 1) as the allowable rate of Type-I error pre-specified by the user. The test in (2.1) is the projection-based test using the S-statistic proposed by Stock and Wright (2000) and defined in (2.2). In the context of linear instrumental variables models with serially uncorrelated and conditionally homoskedastic errors, this is the projection-based Anderson–Rubin test (see Staiger and Stock (1997), Dufour and Jasiak (2001), Dufour and Taamouti (2005b, 2007)). The asymptotic size of the test cannot exceed ϵ . This test, however, has low power in over-identified models (i.e., with k > ν ) if some, e.g., k − ν , moment restrictions are not (very) informative of the unknown parameters θ . One way to reduce the conservativeness is to choose the ν restrictions that are most informative: −1 E G′ Vgg gt (θ ) = 0 ⇐⇒ θ = θ0
(2.3)
where G := G(θ0 ), G(θ ) := limn→∞ E [∂ g¯n (θ )/∂θ ′ ] (assuming differentiability and existence). Then, under the assumptions (i), P
(ii) and, additionally, (iii) Gn (θ ) − → G(θ ) (both full column rank at θ0 and continuous), the Dufour and Taamouti (2005b)-type projection-based test rejects H 0 : θ1 = θ10 if inf LMn (θ10 , θ20 ) > χν2 (1 − ϵ)
(2.4)
θ20 ∈Θ2
where −1/2 −1/2 −1/2 LMn (θ ) := ng¯n′ (θ ) Vgg (θ )P ( Vgg (θ ) Gn (θ )) Vgg (θ )¯gn (θ ). ′
′
(2.5) Here, LMn (θ ) is the S-statistic based on the efficient choice of ν linear combinations of k (estimated) moment restrictions given in (2.3). In other words, (2.5) gives a general form of the score statistic that, depending on the hitherto unspecified estimators Gn (θ ) and Vgg (θ ), includes the GMM score statistic of Newey and West (1987), the continuous updating GMM (CU-GMM) score statistic (i.e., the K statistic of Kleibergen (2005)), the generalized empirical likelihood (GEL) score statistic of Guggenberger and Smith (2005) and other versions of score statistics (based on the GEL-implied probabilities and the naive empirical probabilities) described in Chaudhuri and Renault (2011). We call the test based on (2.4)–(2.5) the usual projection-based score test. This test is still conservative because the critical value comes from the χν2 distribution whereas the number of parameters to be tested is ν1 . To see this, suppose that there exists θn2 (θ1 ) ∈ Θ2 (indexed by θ1 ) such that −1 G′n2 (θ1 , θn2 (θ1 )) Vgg (θ1 , θn2 (θ1 ))¯gn (θ1 , θn2 (θ1 )) = 0 √ and n θn2 (θ01 ) − θ02 = Op (1),
(2.6) (2.7)
where Gn2 (θ ) denotes the last ν2 columns of Gn (θ ). If θ = θ01 , it follows under standard assumptions that 0 1
d
inf LMn (θ01 , θ2 ) ≤ LMn (θ01 , θn2 (θ01 )) − → χν21 ,
θ2 ∈Θ2
(2.8)
and, therefore, the critical value χν2 (1 − ϵ) for the usual projectionbased score test is conservative. A strategy for avoiding this problem could be defining an alternative projection-based score test that rejects H 0 : θ1 = θ10 if inf LMn1 (θ10 , θ20 ) > χν21 (1 − ϵ)
θ20 ∈Θ2
(2.9)
S. Chaudhuri, E. Zivot / Journal of Econometrics 164 (2011) 239–251
where
D1.
−1/2
LMn1 (θ ) := ng¯n (θ )Vgg ′
(θ )
−1/2 −1/2 × P ( Vgg (θ ) Gn1 (θ )) Vgg (θ )¯gn (θ ), ′
′
(2.10)
and Gn1 (θ ) denotes the first ν1 columns of Gn (θ ). Here, LMn1 (θ ) is the S-statistic based on the first ν1 of the efficient choice of ν linear combinations of k (estimated) moment restrictions given in (2.3). When θ02 is known, these ν1 moment restrictions are the most informative about the ν1 × 1 vector θ01 and are sufficient to uniquely identify it. The alternative projection-based test in (2.9), however, is still conservative. To see this, note that for any θ , −1/2 LMn (θ ) = LMn1 (θ ) + ng¯n′ (θ ) Vgg (θ )P
−1/2′ −1/2′ −1/2′ × N Vgg (θ ) Gn1 (θ ) Vgg (θ ) Gn2 (θ ) Vgg (θ) gn (θ ). The right hand side is the sum of two (almost surely) non-negative variables. When θ10 = θ01 , without the (asymptotic) column −1/2 −1/2 rank deficiency of N Vgg (θ ) Gn1 (θ ) Vgg (θ ) Gn2 (θ ) at θ = ′ ′ ′ (θ , θ (θ01 )) , it follows under standard assumptions that ′
01
′
n2
inf LMn1 (θ01 , θ2 ) ≤ LMn1 (θ01 , θn2 (θ01 ))
θ2 ∈Θ2
d
< LMn (θ01 , θn2 (θ01 )) − → χν21 ,
(2.11)
and, therefore, the critical value χν1 (1 − ϵ) is conservative. Our simulations show that the alternative projection-based score test can even be less powerful than the usual projection-based score test. Neither the usual projection-based score test nor the alternative projection-based score test achieves asymptotic equivalence with the plug-in-based score test that rejects H 0 : θ1 = θ10 if 2
LMn (θ10 , θn2 (θ10 )) > χν21 (1 − ϵ)
(2.12)
where θn2 (θ10 ) satisfies (2.6) for θ1 = θ10 . This is a serious drawback of the projection-based methods because the plug-in-based score test, under standard regularity conditions (i.e., when it works), is known to have certain local optimality properties. 3. The new projection-based method of inference In this section we describe how to overcome the aforementioned drawbacks of the usual and alternative projection-based score tests by using a C (α) form of the score statistic accompanied by a restricted but feasible projection.2 We will define an infeasible score test for H 0 : θ1 = θ10 that uses the unknown true value θ02 of the nuisance parameters θ2 as a plug-in and show that both the plug-in-based score test defined in (2.12) and the new projectionbased score test to be defined in (3.4) are asymptotically equivalent to this infeasible score test. However, first let us list a set of standard assumptions for the results discussed so far and the new ones. Assumption Θ (Assumptions on θ0 and the Parameter Space). The k ≥ ν moment restrictions in (1.1) are satisfied for θ0 ∈ interior(Θ ) where Θ is a ν -dimensional compact subset of Rν . The ′ ′ ′ partition θ0 = (θ01 , θ02 ) is such that θ0i ∈ interior(Θi ) where Θi is a νi -dimensional compact subset of Rνi for i = 1, 2 and Θ = Θ1 × Θ2 . Assumption D (Assumptions on the Moment Vector and its Derivatives). The following hold for θ ∈ N ⊂ Θ where N is a nonshrinking open neighborhood of θ0 :
2 See Bera and Bilias (2001) for a survey of the use of Neyman (1959)’s C (α) statistic in econometrics.
√
241
d
ng¯n (θ0 ) − → √ Ψg ∼ N (0, Vgg ), and limn→∞ E [ ng¯n (θ )] ̸= 0.
√
n(θ − θ0 ) ̸= op (1) ⇒
P Vgg (θ ) − → Vgg (θ ) uniformly, where Vgg (θ ) is continuous at D2. P θ0 and Vgg (θ0 ) = Vgg is positive definite. Gn (θ ) − → G(θ ) := ′ limn→∞ E [∂ g¯n (θ )/∂θ ] uniformly, G(θ ) is continuous at θ0 and G := G(θ0 ) is full column rank. −1 D3. Gn (θ )′ Vgg (θ ) Gn (θ ) is positive definite. (Assumed for conve-
nience.) Remark 1. Assumption Θ and the first part of Assumption D1 rule out problems of asymptotic size distortion due to failure of the moment restrictions (e.g., endogeneity or near exogeneity of instruments). The second part of the Assumption D rules out problems due to weak identification. Both these are explicitly incorporated in Assumption W in the next section. The infeasible score test is defined as the test that rejects H 0 : θ1 = θ10 if 0 2 LMeff n1 (θ1 , θ02 ) > χν1 (1 − ϵ),
where
(3.1)
′ −1/2 −1/2′ ¯ LMeff (θ ) := n g (θ ) V (θ ) P N V (θ ) G (θ ) n2 n1 n gg gg −1/2′ −1/2′ gn (θ ). × Vgg (θ )Gn1 (θ ) Vgg (θ )
(3.2)
The test in (3.1) is infeasible because it uses the unknown true value of the nuisance parameters θ2 . Now a word on the superscript ‘‘eff’’ that is used to mean efficient. Note that the estimator θ1eff obtained by solving for θ1 from
−1/2 −1/2 G′1 Vgg N Vgg G2 g¯n (θ1 , θ02 ) = 0 ′
−1/2
(3.3) −1/2′
−1/2′
G1 )−1 where has asymptotic variance (G′1 Vgg N (Vgg G2 )Vgg G1 and G2 are, respectively, the first ν1 and the remaining ν2 columns of G. Under standard assumptions, this asymptotic variance attains the semi-parametric efficiency bound for estimators of θ1 in moment conditions models like (1.1) when θ02 is unknown. Therefore, in some sense, the left hand side of (3.3) is an efficient score function for θ1 . The statistic LMeff n1 (θ ) is a quadratic form of an estimator of this efficient score with respect to an estimator of the inverse of its asymptotic variance (or the asymptotic variance of θ1eff ). In other words, LMeff n1 (θ ) is the S-statistic based on ν1 restrictions obtained by taking the ortho-complement of the projection of the first ν1 rows of an estimator of the efficient moment restrictions in (2.3) on its last ν2 rows. LMeff n1 (θ ) can also be interpreted as Neyman (1959)’s C (α) statistic [see Bera and Bilias (2001)]. The local optimality properties of the plug-in score test in (2.12), when they hold, are due to its asymptotic equivalence with the infeasible √ score test in (3.1) when θ10 is n-local to θ01 . Now we define the new projection method using the statistic LMeff n1 (θ ). The new projection-based score test is defined as a test that rejects H 0 : θ1 = θ10
if C2n (1 − ζ , θ10 ) = empty, or if
inf
θ20 ∈C2n (1−ζ ,θ10 )
0 0 2 LMeff n1 (θ1 , θ2 ) > χν1 (1 − τ ) ,
(3.4)
where C2n (1 − ζ , θ10 ) ⊂ Θ2 is any asymptotic (1 − ζ ) × 100% confidence region for θ2 , and ζ , τ ∈ (0, 1) are specified by the user. This is a two-step procedure. In the first step one constructs any asymptotic confidence region C2n (1 − ζ , θ10 ) for the nuisance parameters θ2 with or without imposing the null hypothesis of interest. In the second step, one rejects H 0 : θ1 = θ10 if either
242
S. Chaudhuri, E. Zivot / Journal of Econometrics 164 (2011) 239–251
0 0 C2n (1 − ζ , θ10 ) is empty or the infimum of LMeff n1 (θ1 , θ2 ) for all 0 0 2 3 θ2 ∈ C2n (1 − ζ , θ1 ) is larger than χν1 (1 − τ ).
The following lemma is key to explaining the local asymptotic properties of the new projection-based score test for testing H 0 : θ1 = θ10 . These properties are described in Theorem 3.2.
√
Lemma 3.1. Suppose that θ10 = θ01 + d1 / n ∈ Θ1 for some fixed √ d1 and consider any θ20 = θ02 + dn2 / n ∈ Θ2 for some dn2 = eff
Op (1). Then, under assumptions Θ and D, LMn1 (θ10 , θ20 ) = eff LMn1 (θ10 , θ02 ) + op (1) and converges in distribution to a non−1/2
central χν21 with non-centrality parameter d′1 G′1 Vgg
−1/2′
N (Vgg
G2 )
−1/2′ Vgg G1 d1 .
Remark 2. (a) Lemma 3.1 states that as long as √ the unknown nuisance parameters θ2 are replaced by any n-consistent 0 estimator, the statistic LMeff n1 (θ1 , θ2 ) is asymptotically equivalent to its ‘‘infeasible form’’ that replaces θ2 by its unknown true √ value θ02 . In other words, small deviations of Op (1/ n) from the unknown true value θ02 of the nuisance parameters θ2 do not affect 0 the asymptotic behavior of LMeff n1 (θ1 , θ2 ). This property does not 0 hold for the statistics LMn (θ1 , θ2 ) and LMn1 (θ10 , θ2 ). (b) Additionally when (2.6) holds for θ1 = θ10 , in which case 0 0 0 0 LMeff n1 (θ1 , θn2 (θ1 )) = LMn (θ1 , θn2 (θ1 )), and when (2.7) holds, Lemma 3.1 justifies the use of the plug-in-based score test defined in (2.12). Theorem 3.2. Let assumptions Θ and D hold. Then the following results hold for any given ζ , τ ∈ (0, 1) as n → ∞: (i) If the asymptotic coverage of C2n (1 − ζ , θ01 ) is 1 − ζ ′ ′ ′ uniformly in θ02 ∈ Θ2 such that θ0 := (θ01 , θ02 ) satisfies assumptions Θ and D, then the asymptotic size of the new projection-based score test defined in (3.4) cannot exceed min(ζ + τ , 1). (ii) If C2n (1 − ζ , θ10 ) is non-empty almost surely and if supθ 0 ∈C
√
2
2n
n‖θ20 − θ02 ‖ = Op (1), then the new projection-based
(1−ζ ,θ10 )
score test defined in (3.4) is asymptotically equivalent to the infeasible score test defined in (3.1) √for any hypothesized value θ10 ∈ Θ1 such that θ10 = θ01 + d1 / n (where d1 is fixed). Remark 3. (a) Part (i) of the theorem allows for empty confidence regions C2n (1 − ζ , θ01 ), for the sake of the discussion of weak identification in the next section, and states that the size of the new projection test is always bounded from above by min(ζ +τ , 1). For the previously stated allowable rate of Type-I error ϵ , the user can specify ζ and τ such that ζ + τ = ϵ . Our experience suggests that the upper bound is not sharp and is mainly affected by τ whenever C2n (1 − ζ , θ01 ) is non-empty. (b) Part (ii) of the theorem is remarkable. √ As long as C2n (1 − ζ , θ10 ) is non-empty and belongs in the n-neighborhood of θ02 (as will be the case when θ02 is identified), the choice of ζ does not matter asymptotically and one can safely choose τ = ϵ , i.e., the allowable rate of Type-I error.4 In this case, the new projectionbased score test is asymptotically equivalent to the infeasible score
3 If C (1 − ζ , θ 0 ) ∩ boundary(Θ ) ̸= ∅, then one can take C (1 − 2n 2 2n 1 = C2n (1 − ζ , θ10 ) ∩ interior(Θ2 ) to avoid the occurrence of 0 0 infθ 0 ∈C (1−ζ ,θ 0 ) LMeff n1 (θ1 , θ2 ) on the boundary of Θ2 . This should not create any
ζ , θ10 ) 2
2n
1
problem asymptotically because θ02 ∈ interior(Θ2 ) by Assumption Θ . 4 While we simply assume the aforementioned properties of C (1 − ζ , θ 0 ), 2n 1 extending assumption D to θ ∈ N1 × Θ2 where N1 ⊂ Θ1 is a non-shrinking open neighborhood of θ01 is typically sufficient for these to hold.
test (defined in (3.1)) that uses the unknown true value θ02 of the nuisance parameters θ2 . The plug-in-based test defined in (2.12), when it works, is also asymptotically equivalent to the infeasible score test and thus equivalent to the new projection-based test. We note that the new projection-based test may have a computational advantage over the plug-in-based test whenever it is difficult to find the restricted estimator θn2 (θ10 ) of the nuisance parameters (to be plugged in) using certain methods like the empirical likelihood (EL) one.5 (c) The new test can also be seen as a Bonferroni-type test. However, it has an important difference from the usual Bonferronitype tests (see Moon and Schorfheide (2009) and the references therein). Unlike the latter tests where standard Bonferroni arguments give an upper bound ζ + τ for the asymptotic size, the asymptotic size for the new projection-based score test is τ if the conditions in Theorem 3.2(ii) are satisfied by C2n (1 − ζ , θ01 ). This is achieved by the use of the C (α) form of the score statistic. 4. Models with weakly identified parameters In this section we allow some or all elements of θ01 and θ02 to be weakly identified.6 The generality of the results and also the exposition in the last section do not carry through completely. The problems come from two distinct sources which we briefly discuss below. 4.1. Problems and the recent developments related to weak identification First, if some elements of θ01 (or θ02 ) are weakly identified, the score statistic of Newey and West (1987) is usually not asymptotically pivotal (under H 0 ) and can lead to severe upward size distortion (see Kleibergen (2005)). The problem is due to the estimator of the Jacobian G. In this case it is important to use an estimator of G that is independent of the average moment vector g¯n (θ ) (both appropriately scaled to avoid degeneracy), at least asymptotically. Kleibergen (2005) showed that the Jacobian estimator in the CU-GMM score statistic satisfies this property. Guggenberger and Smith (2005) extended this result by establishing the first-order equivalence of the Jacobian estimator for the entire GEL class that also includes CU-GMM (see also Newey and Smith (2004)). Second, for a case where some elements of θ02 (i.e., the nuisance parameters) are weakly identified, Stock and Wright (2000) showed that it is no longer possible to √ estimate these unknown (and unspecified by H 0 ) elements ( n-)consistently. As a result, the entire GEL class of plug-in score statistics are also no longer asymptotically pivotal (under H 0 ). Projection-based
√
5 It might be tempting for the user to plug in any easy to obtain n-consistent estimator (e.g., two-step GMM) of the nuisance parameter θ2 in an EL score test and avoid the difficulty of solving a saddle-point problem while hoping to exploit the desirable higher order properties of EL established by Newey and Smith (2004) (although in a different context). Lemma 3.1 assures that this should not affect the first-order asymptotic properties of the score test. However, in simulations by Chaudhuri and Renault (2011) this often caused some upward size distortion in (very small) finite samples. Their simulations also show that projection from an (easy to obtain) confidence region of θ2 satisfying the conditions in Theorem 3.2(ii), mitigates this issue without adversely affecting the finite-sample power much and still reduces the computational burden of solving the saddle-point problem of EL by restricting the search of the nuisance parameters to a confidence region instead of the entire parameter space. This also applies to the other computationally difficult GEL score tests. 6 In this paper, identifications of θ, θ , θ and θ , θ , θ respectively are used 1
2
0
01
02
interchangeably to refer to the same thing. In particular, we say that there is an identification problem with θ/θ1 /θ2 if some elements of θ0 /θ01 /θ02 are weakly identified. We hope that this is not unduly confusing.
S. Chaudhuri, E. Zivot / Journal of Econometrics 164 (2011) 239–251
methods have been traditionally found to be theoretically and practically useful in this case because they enable the user to impose any pre-specified allowable rate of Type-I error (ϵ ) as the upper bound to the asymptotic size of the test. In a major development to the weak identification literature, Kleibergen and Mavroeidis (2009) (henceforth, KM09) established an interesting result where they showed that under certain conditions the CU-GMM plug-in score statistic (i.e., the plug-in or subset-K statistic) is boundedly pivotal asymptotically — the asymptotic distribution function under the H 0 is bounded from the right by that of a central χν21 where ν1 is the dimension of θ1 . Hence the asymptotic size of the plug-in CU-GMM score test as defined in (2.12) cannot exceed the allowable rate of Type-I error ϵ . KM09 further argued that a plug-in-based method should be preferred over the corresponding projection-based method because while the use of the standard fixed χ 2 critical values does not lead to over-rejection by either, the former has better power properties. Therefore, in addition to showing that our new projection-based method outperforms the usual and alternative projection methods, it will also be worthwhile to explore how effective the use of the restricted projection and the C (α) statistic is in reducing the difference in power from the plug-in methods (when the latter work). While our new method and all the projection methods are applicable to the GEL class of score statistics of Guggenberger and Smith (2005) we focus our attention on CU-GMM because the results of KM09 have, as of now, only been proved for it. Use of the CU-GMM score necessitates specifying the form of the estimator of the expected Jacobian, i.e., Gn (θ ). From Kleibergen (2005) we know that this is given by
−1 ¯ n (θ ) − V1g (θ ) Vgg Gn (θ) := G (θ )¯gn (θ ), −1 −1 V2g (θ ) Vgg (θ )¯gn (θ ), . . . , Vν g (θ ) Vgg (θ )¯gn (θ ) ,
(4.1)
¯ n (θ ) := ∂ g¯n (θ )/∂θ . For l = 1, . . . , ν the k × k matrices where G Vlg (θ) are obtained as a byproduct of differentiating with respect to θl the CU-GMM objective function Sn (θ ) defined in (2.2). ′
Since tests involving the CU-GMM score statistic suffer from an undesired decline in power at irrelevant parameter values, Kleibergen (2005) proposed a hybrid test, called the subsetJKLM test, that rejects H 0 : θ1 = θ10
if Sn (θ10 , θn2 (θ10 )) − LMn (θ10 , θn2 (θ10 )) > χk2−ν (1 − ζ ),
or if LMn (θ10 , θn2 (θ10 )) > χν21 (1 − τ ) ,
(4.2)
where Sn (θ ) and LMn (θ ) are as defined in (2.2) and (2.5) respectively. Kleibergen (2005) provided simulation evidence of the undesired decline in power of the K test and its removal by the JKLM test. KM09 showed that the asymptotic size of the subsetJKLM test cannot exceed ζ + τ and recommended choosing ζ , τ such that ζ + τ = ϵ , i.e., the allowable rate of Type-I error. 4.2. The new projection-based score test The new projection-based score test in the presence of weakly identified parameters is the same as that in (3.4) where Gn (θ ) is as defined in (4.1) and
243
ζ + τ = ϵ ) is better than that of the usual and alternative projection-based tests and is not necessarily dominated by that of the plug-in-based tests like the subset-K test and the subset-JKLM test (with ζ + τ = ϵ ). Our results are based on assumptions that are relatively mild as compared to those in KM09. For e.g., the existence of a central limit theorem is assumed only for the truth (see Assumption D′ 2 below), and we do not need to make the difficult to verify assumption that the plug-in estimator θn2 (θ01 ) := arg minθ 0 ∈Θ Sn (θ01 , θ20 ) is the 2
2
only θ2 ∈ Θ2 such that (2.6) (for θ1 = θ01 ) holds. We also note that the method proposed here is more general than that in Chaudhuri et al. (2010) which only applies to split-sample linear instrumental variables regressions with weak instruments. Let us now state explicitly the assumptions before we describe the properties of the new projection-based score test. First we define the identification properties of the parameters. While the characterization of identification is a straightforward extension of the framework of Stock and Wright (2000), to the best of our knowledge this is the first paper that allows for both weakly and strongly identified parameters of interest and nuisance parameters. In what follows we use the following notation to distinguish weakly and strongly identified parameters. For j = w, s, suppose that νj = ν1j + ν2j , θj = (θ1j′ , θ2j′ )′ and Θj = Θ1j × Θ2j . This notation regroups the weakly identified parameters as θw and the (strongly) identified parameters as θs (as defined in Assumption W). The true values are, when convenient, regrouped ′ ′ ′ ′ ′ ′ as θ0w = (θ01 w , θ02w ) and θ0s = (θ01s , θ02s ) respectively. When necessary, N ⊂ Θ and Nr ⊂ Θr are generically used to denote non-shrinking open neighborhoods of θ0 and θ0r for r = := Nw × N1s × 1w, 1s, 2w, 2s, w, s, 1, 2 respectively. Define N Θ2s . Assumption Θ (Continued). For l = 1, 2, suppose that Θl = Θlw × Θls and for j = w, s, suppose that θ0lj ∈ interior (Θlj ) where Θlj ⊂ Rνlj is compact. Assumption √ W (Characterization of Weak Identification). E [¯gn (θ )] n (θ )/ n + m(θs ) where: =m
n (θ ) : Θ → Rk is such that m n (θ ) → m (θ ) uniformly for (a) m (θ ) is bounded and continuous and m (θ0 ) = 0. θ ∈ N where m n (θ ) := ∂ n (θ ) → M (θ ) uniformly. , M For θ ∈ N mn (θ )/∂θ ′ , M (θ ) = [M 1w (θ ), M 1s (θ ), M 2w (θ ), M 2s (θ )] where, for l = M lj (θ ) is bounded and 1, 2 and j = w, s, the k × νlj matrix M continuous. (b) m(θs ) : Θs → Rk is a continuous function and m(θs ) = 0 if and only if θs = θs0 . For θs ∈ N1s × Θ2s , M (θs ) := ∂ m(θs )/∂θs′ is bounded and continuous. M (θ0s ) has full column rank. Here, M (θs ) = [M1 (θs ), M2 (θs )] where Ml (θs ) := ∂ m(θs )/∂θls′ for l = 1, 2. To establish the desirable asymptotic properties of the new projection-based score test in the presence of weakly identified parameters, Assumption D from the last section needs to be augmented with further assumptions following Guggenberger and Smith (2005) and Kleibergen (2005). These assumptions are listed under Assumption D′ which, hereafter, replaces Assumption D.
(4.3)
Assumption D′ (Assumptions on the Moment Vector and its Derivative).
The choice of C2n (1 − ζ , θ ), which can be empty in practice, helps to eliminate the undesired decline in power at certain irrelevant parameter values (see Chaudhuri (2008) for details). This is done in the same spirit as the subset-JKLM test. Simulations at the end of this section show that the performance of the new test (with
¯ n (θ ) := ∂ g¯n (θ )/∂θ ′ = [G1wn (θ ), G1sn (θ ), G2wn (θ ), G2sn (θ )] D′ 1. G = E [G¯ n (θ )] + op (1) uniformly for θ ∈ N where E [G¯ n (θ )] = √ n (θ )/ n + [0, M1 (θs ), 0, M2 (θs )] from ∂ E [¯gn (θ )]/∂θ ′ = M imposing interchangeability of the order of differentiation and integration (and from Assumption W).
C2n (1 − ζ , θ ) := θ ∈ Θ2 : Sn (θ , θ ) ≤ χ (1 − ζ ) . 0 1
0 2
0 1
0 1
0 2
2 k
244
S. Chaudhuri, E. Zivot / Journal of Econometrics 164 (2011) 239–251
D′ 2.
√
d
¯ wn (θ0 ) − E [G¯ wn (θ0 )]) − → [Ψg′ , Ψw′ ] where7 n g¯n′ (θ0 ), v ec ′ (G
[
] Vgg (θ0 ) Ψg k×k ∼ N 0, V (θ0 ) := Vwg (θ0 ) Ψw
kνw ×k
Vg w (θ0 ) k×kνw
.
Vww (θ0 ) kνw ×kνw
Vgg (θ ) is bounded, continuous and positive definite, and P Vgg (θ ) − → Vgg (θ ) uniformly for θ ∈ N. Vwg (θ) is boun′ ded and continuous. Vwg (θ) := [ V1g (θ), . . . , Vν′1w ,g (θ ), P V′ (θ ), . . . , Vν′ +ν ,g (θ)]′ − → Vwg (θ ) uniformly for θ ∈ ν1 +1,g
1
2w
N .8 For l = ν1w + 1, . . . , ν1 , ν1 + ν2w + 1, . . . , ν the −1 k × k matrices Vlg (θ) are such that Vlg (θ) Vgg (θ) = Op (1) uniformly for θ ∈ N . (See (4.1) for the notation.)
Remark 4. (a) Assumption D′ 1 implies that G(θ ) := limn→∞ ¯ n (θ)] = [G1w (θ ), G1s (θ ), G2w (θ ), G2s (θ )] = [0, M1 (θs ), 0, E [G M2 (θs )]. Hence G(θ0 ) = [G1w (θ0 ), G1s (θ0 ), G2w (θ0 ), G2s (θ0 )] = [0, M1 (θ0s ), 0, M2 (θ0s )] has rank νs under Assumption W(b) and ′ satisfies the local identification condition for θs = (θ1s , θ2s′ )′ .
instead of N to ensure (b) Some results are assumed√to hold in N that for θ10 = θ01 + d1 / n (for some fixed d1 ), the region √ C2n (1 − ζ , θ10 ) defined in (4.3) belongs in the n-neighborhood of θ02 almost surely whenever ν2w = 0, i.e., whenever there are no weakly identified nuisance parameters. The following results assume Θ , W and D′ . In Lemma 4.1 we discuss the asymptotic properties of the C (α) form of the score statistic LMeff n1 (θ ) defined in (3.2) and using Gn (θ ) from (4.1). In Theorem 4.2 we discuss the properties of the new projection-based score test defined in (3.4) and using C2n (1 − ζ , θ10 ) from (4.3). Lemma 4.1. Let Assumptions Θ , W and D′ hold. Then the following eff results hold for LMn1 (θ1 , θ2 ) defined in (3.2) using Gn (θ ) from (4.1), as n → ∞: d
eff
(i) LMn1 (θ01 , θ02 ) − → χν21 . √ (ii) Suppose that νw = 0 and θ10 = θ01 + d1 / n ∈ Θ1 for some √ fixed d1 and consider any θ20 = θ02 + dn2 / n ∈ Θ2 for some eff
eff
dn2 = Op (1). Then, LMn1 (θ10 , θ20 ) = LMn1 (θ10 , θ02 )+ op (1) and converges in distribution to a non-central χν21 with non-centrality parameter −1/2
d1 M1 (θ0s )Vgg ′
′
−1/2′
(θ0 )N Vgg
(θ0 )M2 (θ0s )
Remark 5. (a) Asymptotic size: Theorem 4.2(i) gives an upper bound to the asymptotic size of the new projection-based score test and extends Theorem 3.2(i) to cases where some or all elements of the parameters of interest (θ1 ) and/or nuisance parameters (θ2 ) can be weakly identified. In the context of the plug-in-based tests such upper bounds were recently provided by KM09. Under their assumptions, plugging in the restricted CUGMM estimator of θ2 (which, additionally, needs to be the only θ2 ∈ Θ2 satisfying (2.6) for θ1 = θ01 ) results in an upper bound ϵ for the asymptotic size of the plug-in test in (2.12).9 Choosing ζ , τ such that ζ + τ = ϵ matches both the upper bounds. (b) Analytical comparison of power: We emphasize that under the generality of our Assumption W the feasible plug-in-based tests described in KM09 and our new test are only boundedly asymptotically pivotal. While the plug-in-based tests (when they work) are analytically shown by KM09 to be more powerful than the traditional projection-based tests, their argument does not hold when the plug-in tests are compared with the new projection test. Our only analytical result related to power is given in Theorem 4.2(ii) and is limited in scope (also see Remark 5(c)).10 Although under the full generality of Assumption W we were unable to provide a comprehensive analytical comparison of the plug-in and the new projection methods, our Monte Carlo simulations show that when the upper bound ϵ for the size of the plug-in-based tests is matched with that of the new test, no test dominates uniformly in terms of finite-sample power. (c) Undesired decline in power and the choice of the first-step confidence region: Theorem 4.2(ii) is weaker than Theorem 3.2(ii) because even under the null hypothesis the particular choice of C2n (1 −ζ , θ10 ) can lead to empty confidence regions in the first step of the new projection-based score test. While allowing C2n (1 − ζ , θ10 ) to be empty has a positive effect on power and, in particular, some undesired decline in power due to the use of the CU-GMM score is eliminated, it also makes the size of the test dependent on ζ and breaks the asymptotic equivalence with a size τ infeasible score test.11 On the other hand, with this particular choice of C2n (1 − ζ , θ10 ), computation is easier (empty C2n (1 − ζ , θ10 )—see e.g. Table 3—eliminates the need for the second-step test) and at the same time the new test is not dominated in terms of power by either the subset-K or the subset-JKLM test. (d) Choice of ζ and τ : Like for the subset-JKLM test defined in (4.2), the performance of the new projection-based score test now depends on the choice of ζ . There are many possible choices of
−1/2 × Vgg (θ0 )M1 (θ0s )d1 . ′
Theorem 4.2. Let Assumptions Θ , W and D′ hold. Then the following properties of the new projection-based score test defined in (3.4), using Gn (θ) from (4.1) and C2n (1 − ζ , θ10 ) from (4.3), hold for any given ζ , τ ∈ (0, 1) as n → ∞: (i) The asymptotic size of the new projection-based score test cannot exceed min(ζ + τ , 1). (ii) If νw = 0, then the new projection-based score test is asymptotically more powerful than the infeasible score test defined in (3.1) using Gn (θ ) from (4.1) and ϵ = τ and √ for any hypothesized value θ10 ∈ Θ1 such that θ10 = θ01 + d1 / n (where d1 is fixed).
7 The partition of Ψ (θ) = [Ψ ′ (θ), Ψ ′ (θ)]′ , V (θ) = [V (θ), V (θ)] = w gw g1 g2 1w 2w Vw′ g (θ) and Vww (θ) = (Vll′ (θ))l,l′ =1,2 follows the partition of θw = (θ1′ w , θ2′ w )′ , i.e., the partition of the weakly identified elements of θ into those from θ1 and θ2 respectively. 8 For convenience of reading the messy notation let us point out that the k × k P
√
√
matrix Vlg (θ0 ) − → Vlg (θ0 ) = Asym.Cov( n∂ g¯n (θ0 )/∂θl , n g¯n (θ0 )) where θl is the lth element (l = 1, . . . , ν1w , ν1 + 1, . . . , ν1 + ν2w ) of the vector θ .
9 Discussions of the plug-in-based tests are qualified by the phrase ‘‘when they work’’ because, following KM09, additional assumptions beyond Assumptions Θ , W and D are required for the tests to be boundedly asymptotically pivotal and hence for the upper bound to be valid. 10 As in Remark 3(b), we note here that the plug-in-based score test defined
in (2.12) with ϵ = τ , when it works, is also asymptotically equivalent to the infeasible score test and hence is covered by this result. However, as will be seen in the simulations under the conditions of Theorem 4.2(ii), even with ϵ = ζ + τ for the plug-in-based test, there is no observable difference between its rejection frequencies and that of the new projection-based score test when ζ is small relative to τ and the conditions of Theorem 4.2(ii) are satisfied. 11 To avoid this, other confidence regions can be used, e.g., those obtained by imposing H 0 and then subsequently inverting the K test or the modified quasilikelihood ratio test or even the Wald test for θ2 treating θ1 = θ10 as given. While the former two regions are known to have correct coverage probabilities (under H 0 ), they are also computationally extremely demanding [see Mikusheva (2010)]. The Wald confidence region does not have correct coverage probability when ν2w > 0, i.e., when there are weakly identified nuisance parameters, but this problem can be avoided by using KM09’s results. We do not, however, recommend the use of these confidence regions because the resulting projection-based test is always less powerful than the subset-K test with ϵ = τ , and does not have any theoretical or practical advantage over the subset-K test. KM09’s criticism of projection methods apply here.
S. Chaudhuri, E. Zivot / Journal of Econometrics 164 (2011) 239–251 WI-CaseI
WI-Case II
100
100 plug-in score/sub-K (5%) usual-proj score (5%) new-proj score(.5% + 4.5%) new-proj score (2% + 3%)
80
80
60
60
40
40
20
20
0 -5
0
5
0 -5
0
WI-Case III 100
80
80
60
60
40
40
20
20
-0.15
-0.1
-0.05
0
5
WI-Case IV
100
0 -0.2
245
0.05
0.1
0.15
0.2
0 -0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
Fig. 1. Plotted are the rejection rates of different tests against the horizontal axis = hypothesized θ1 − true θ1 . Results are based on 5000 Monte Carlo trials with k = 2 instruments/moments and sample size n = 100. The known upper bound for the asymptotic size of all these tests is 5%. The JKLM test is not defined for just-identified models. Table 1 Four special cases of weak identification for Monte Carlo study.
ν2w = 1, ν2s = 0 WI-Case I
ν2w = 0, ν2s = 1 WI-Case II
ν1w = 1, ν1s = 0
θ1 : weakly identified θ2 : weakly identified
θ1 : weakly identified θ2 : (strongly) identified
ν1w = 0, ν1s = 1
WI-Case III θ1 : (strongly) identified θ2 : weakly identified
WI-Case IV θ1 : (strongly) identified θ2 : (strongly) identified
ζ and τ such that ζ + τ = ϵ , the allowable rate of Type-I error. While Kleibergen (2005) recommended choosing τ close to ϵ (e.g., ζ = 4.5%, τ = .5% when ϵ = 5%), our simulations indicate that other choices can sometimes yield better power. (e) Restricted projection and other score statistics: The restricted projection from the first-step confidence region instead of the entire nuisance parameter space Θ2 , when applied to the usual projection-based score test (defined in (2.4)) or the alternative projection-based score test (defined in (2.9)) does not reduce their conservativeness much. As a result, while the upper bound from Theorem 4.2(i) continues to hold in such cases, the power property described in Theorem 4.2(ii) does not. Intuitively, whenever C2n (1 − ζ , θ10 ) defined in (4.3) is non-empty, it contains the CU-GMM estimator θn2 (θ10 ) := arg minθ2 ∈Θ2 Sn (θ10 , θ2 ). Therefore, results in (2.8) and (2.11) regarding the conservativeness of these two tests still apply and any increase in the rejection frequency due to the restricted projection happens only if the first-step confidence region is empty. On the other hand, the restricted projection is useful for the new test only because it uses the C (α) form of the score statistic. 4.3. Monte Carlo study In this subsection we perform a Monte Carlo experiment to compare the finite-sample rejection rates of various projectionbased score tests and also the subset-K and subset-JKLM tests of Kleibergen (2005). We illustrate the observations in Remark 5 with the help of these simulations. Following Dufour and Taamouti
(2005a), we draw wt = (yt , X1t , X2t , Zt′ )′ for t = 1, . . . , n(= 100) such that yt = X1t θ01 + X2t θ02 + ut , X1t = Zt′ Π1 + U1t , X2t = Zt′ Π2 + U2t
i.i.d
where (ut , U1t , U2t ) ∼ N
1 0.8 0.8
0, Σ =
0.8 1 0.3
0.8 0.3 1
.
The individual instruments in Zt are generated as i.i.d. N (0, 1) variables but are kept fixed over simulations. We report the results for k = 2, 4 and 8 instruments. √ The matrix Π = [Π1 , Π2 ] is constructed such that Π = C/ n where C = [C1 , C2 ] and the elements of Cj are set at 1.1547 and 20 for j = 1, 2 to represent ‘‘weak identification’’ and ‘‘strong identification’’ of θ1 and θ2 respectively.12 From the general setup of weak identification described in Assumption W, we consider the following four special cases listed in Table 1 for the simulations.13 The true values of θ1 and θ2 are set at θ01 = 0.5 and θ02 = 1. We vary the hypothesized value θ10 and report the empirical rejection rates of various nominal 5% tests when θ10 = θ01 in Table 2 to show finite-sample size. We report the same for a grid of values around θ10 − θ01 = 0 in Figs. 1–3 to show finite-sample power. Results are based on 5000 simulations. We consider a variety of feasible tests that share the following two characteristics: (i) all tests use fixed critical values that are quantiles of χ 2 distributions; and (ii) the only theoretical result known to be valid under WI-Cases I–IV is that the asymptotic size of these tests is bounded from above by the same number, 5%. These tests are: (1) the CU-GMM score/subset-K test defined in (2.12) with ϵ = 5%; (2) the usual projection-based score test defined in (2.4) with ϵ = 5%; (3) the new projection-based score
12 For all the cases considered, the minimum eigenvalue of the concentration matrix varies from 0.4762 to 1.3329 whenever θ1 and/or θ2 is weakly identified. 13 Additional simulation results based on related specifications are available upon request.
246
S. Chaudhuri, E. Zivot / Journal of Econometrics 164 (2011) 239–251
Table 2 Reported are the rates (as percentages) at which the tests reject the true value θ01 of the parameter of interest θ1 . Results are based on 5000 Monte Carlo trials with k = 4, 8 instruments/moments and sample size n = 100. The known upper bound for the asymptotic size of all these tests is 5% which is enforced either by choosing ϵ = 5% directly for one-step tests like subset-K or usual-proj score, or by choosing ζ and τ such that ζ + τ = 5% for the rest of the tests (which are all two step by nature). The frequencies are missing for sub-JKLM when k = 2 because the test is not defined in just-identified models. k
WI-Case
plug-in score/subset-K ϵ = 5%
ϵ = 5%
usual-proj score
new-proj score ζ = .5% τ = 4.5%
ζ = .5% τ = 4.5%
sub-JKLM
new-proj score ζ = 1% τ = 4%
ζ = 1% τ = 4%
sub-JKLM
new-proj score ζ = 2% τ = 3%
ζ = 2% τ = 3%
sub-JKLM
2 2 2 2 4 4 4 4 8 8 8 8
I II III IV I II III IV I II III IV
2.6 5.2 2.6 5.2 5.3 5.3 4.2 5.1 7.9 6.2 6.4 5
0.4 1.5 0.4 1.5 0.7 1.4 0.7 1.4 1 1.9 1 1.5
2.4 4.7 2.4 4.7 2.2 4.6 2.6 4.6 2 5.2 2.1 4.8
– – – – 4.9 5.5 3.9 5.2 7.6 6.3 6.3 5.3
2 4 2 4 2.2 4.4 2.5 4.3 2 5 2.1 4.6
– – – – 4.6 5.4 3.7 5.1 7.3 6.2 6.1 5.4
1.4 3 1.4 4 2 3.8 2.5 3.7 2.4 4.7 2.6 4.4
– – – – 4.2 5.5 3.7 5.3 6.6 6.3 5.7 5.9
Table 3 Reported are the rates (as percentages) for obtaining an empty first-step confidence region C2n (1 − ζ , θ10 ) where ζ = .5%, θ10 and θ01 are respectively the hypothesized and the true values of the parameter of interest θ1 . Results are based on 5000 Monte Carlo trials with k = 2, 4, 8 instruments/moments and sample size n = 100.
θ10 − θ01 −5 −4.5 −4 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
WI-Case I
θ10 − θ01
WI-Case II
k=2
k=4
k=8
k=2
k=4
k=8
0.1 0.1 0.1 0.1 0.1 0.08 0.08 0.08 0.06 0.04 0.04 0.76 1.68 0.52 0.26 0.18 0.16 0.16 0.16 0.16 0.14
0.3 0.2 0.2 0.2 0.2 0.2 0.2 0.1 0.0 0.1 0.1 1.3 2.8 2.2 1.6 1.1 0.8 0.7 0.7 0.6 0.6
0.6 0.5 0.5 0.5 0.5 0.4 0.3 0.2 0.2 0.1 0.1 1.2 2.3 2.6 2.3 2.0 1.6 1.5 1.4 1.3 1.3
2.62 2.4 2.24 2.1 1.9 1.68 1.52 1.24 0.78 0.52 0.14 1.66 17.88 20.02 15.26 12.04 9.96 8.88 7.92 7.2 6.76
5.2 5.0 4.6 4.1 3.7 3.3 2.7 1.8 1.1 0.5 0.2 2.9 33.4 38.1 30.3 24.8 21.1 18.6 17.0 15.6 14.5
5.2 5.0 4.7 4.3 3.9 3.5 2.8 2.1 1.3 0.8 0.5 3.4 30.9 37.1 29.2 23.6 20.5 18.2 16.3 15.1 14.2
test defined in (3.4); and (4) the subset-JKLM test defined in (4.2). For the last two tests we consider a series of choices of ζ and τ such that ζ + τ = 5%. For brevity, we only report a small subset of the results for these tests.14 From Table 2 it is clear that the finite-sample size of these tests is, in general, less than 5% with the exception of the case of the plug-in-based tests – the subset-K and subset-JKLM tests – which tend to slightly over-reject the truth in over-identified models. In terms of finite-sample power, the new projection-based score test is vastly superior to the usual (and the alternative) projection-based score test. In fact, its finite-sample power is often very similar to that of the plug-in-based tests. While unlike the simulations in Kleibergen (2005), ours do not show the dramatic undesired decline in finite-sample power of the subset-K test, there seems to be a small dip in WI-Case II (in Figs. 2 and 3 to the right of the point θ10 − θ01 = 0) that is, as expected, corrected by
14 Simulation results for the alternative projection-based score test defined in (2.9) with ϵ = 5% are not reported because this test seems to have unusually low finite-sample power. Simulation results for the projection-based S test defined in (2.1) with ϵ = 5% are not reported because Chaudhuri (2008) already documented the relatively poor finite-sample power properties for this test.
−0.2 −0.18 −0.16 −0.14 −0.12 −0.1 −0.08 −0.06 −0.04 −0.02 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
WI-Case III
WI-Case IV
k=2
k=4
k=8
k=2
k=4
k=8
3.06 2.94 2.76 2.5 2.2 1.46 0.96 0.34 0.12 0.04 0.04 0.14 0.26 0.78 1.26 1.98 2.54 2.82 3.08 3.24 3.36
7.2 7.2 7.1 6.8 6.5 5.6 3.7 1.9 0.5 0.1 0.1 0.2 0.7 1.2 2.1 2.8 3.2 3.5 3.8 4.0 4.2
8.4 8.9 9.2 9.5 9.5 9.1 7.6 3.7 1.1 0.3 0.2 0.5 1.0 1.8 2.6 2.9 3.3 3.6 3.9 4.1 4.3
85.06 74.62 60.9 46.5 31.46 18.9 8.92 3.72 1.32 0.36 0.14 0.42 1.52 5.4 14.28 30.24 53.04 73.72 88.9 96.64 99.44
99.4 97.7 92.9 82.5 64.4 41.4 21.1 8.4 2.2 0.5 0.4 0.8 2.7 10.0 26.6 52.7 78.7 93.6 98.7 99.9 100.0
99.9 99.3 96.4 87.8 69.6 44.4 23.0 9.3 3.0 1.2 0.5 0.9 3.0 9.9 25.6 51.3 75.8 91.8 98.3 99.8 100.0
the subset-JKLM test and the new projection-based score test. Our simulations show that the new projection-based score test can also be more powerful than the plug-in-based tests.15 This is illustrated for WI-Case III in Fig. 3 (and less clearly in Fig. 2) for alternatives to the left of the point θ10 − θ01 = 0. This remarkable result counters KM09’s claim that the projection-based tests cannot be more powerful than the plug-in-based tests. Their claim is of course correct for the traditional projection-based score tests. It may appear that the restricted projection is the main driving force behind the good power performance of the new projectionbased score test and hence could be applied to the usual projectionbased score test and the alternative projection-based score test to make their performance comparable to that of the new test. This intuition, however, is not correct. While, as mentioned in Remark 5(e), such restricted projection does indeed improve the power performance of the usual and the alternative projectionbased score tests, the improvement is not sufficient to make them comparable to the new test. To see this, first note that the restricted projection applied to the usual and alternative methods results in
15 A similar point was made, albeit less convincingly, in Zivot and Chaudhuri (2009).
S. Chaudhuri, E. Zivot / Journal of Econometrics 164 (2011) 239–251 WI-CaseI
247
WI-Case II
100
100 plug-in score/sub-K (5%) usual-proj score (5%) new-proj score(.5% + 4.5%) sub-JKLM (.5% + 4.5%) new-proj score (2% + 3%) sub-JKLM (2% + 3%)
80 60
80 60
40
40
20
20
0 -5
0
5
0 -5
0
WI-Case III
WI-Case IV
100
100
80
80
60
60
40
40
20
20
0 -0.2
-0.15
-0.1
-0.05
0
0.05
5
0.1
0.15
0.2
0 -0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
Fig. 2. Plotted are the rejection rates of different tests against the horizontal axis = hypothesized θ1 − true θ1 . Results are based on 5000 Monte Carlo trials with k = 4 instruments/moments and sample size n = 100. The known upper bound for the asymptotic size of all these tests is 5%. WI-CaseI
WI-Case II
100
100 plug-in score/sub-K (5%) usual-proj score (5%) new-proj score(.5% + 4.5%) sub-JKLM (.5% + 4.5%) new-proj score (2% + 3%) sub-JKLM (2% + 3%)
80 60
80 60
40
40
20
20
0 -5
0
5
0 -5
0
WI-Case III
WI-Case IV
100
100
80
80
60
60
40
40
20
20
0 -0.2
-0.15
-0.1
-0.05
0
0.05
5
0.1
0.15
0.2
0 -0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
Fig. 3. Plotted are the rejection rates of different tests against the horizontal axis = hypothesized θ1 − true θ1 . Results are based on 5000 Monte Carlo trials with k = 8 instruments/moments and sample size n = 100. The known upper bound for the asymptotic size of all these tests is 5%.
tests that reject H 0 : θ1 = θ10 if
powerful than their hybrid (of projection and plug-in) versions that reject H 0 : θ1 = θ10 if
Restricted Usual: C2n (1 − ζ , θ ) = empty, or 0 1
inf
θ20 ∈C2n (1−ζ ,θ10 )
LMn (θ10 , θ20 ) > χν2 (1 − τ ),
(4.4)
LMn (θ10 , θn2 (θ10 )) > χν2 (1 − τ ),
Restricted Alternative: C2n (1 − ζ , θ ) = empty, or 0 1
inf
θ20 ∈C2n (1−ζ ,θ10 )
LMn1 (θ , θ ) > χν1 (1 − τ ). 0 1
0 2
2
Hybrid Usual: C2n (1 − ζ , θ10 ) = empty, or (4.6)
Hybrid Alternative: C2n (1 − ζ , θ ) = empty, or 0 1
(4.5)
Now note that whenever C2n (1 − ζ , θ10 ) is non-empty it will, by construction, contain θn2 (θ10 ), i.e., the CU-GMM estimator of θ2 0 restricted by H . Therefore, the tests in (4.4) and (4.5) are less
LMn1 (θ10 , θn2 (θ10 )) > χν21 (1 − τ ).
(4.7)
In Fig. 4 we plot the finite-sample power of these two tests along with that of the new projection-based score test (all with ζ = .5% and τ = 4.5%). The new projection-based score test is still more powerful than the hybrid tests and hence the restricted usual and
248
S. Chaudhuri, E. Zivot / Journal of Econometrics 164 (2011) 239–251 WI-Case I, k = 2
WI-Case II, k = 2
WI-Case III, k = 2
WI-Case IV, k = 2
100
100
100
100
80
80
80
80
60
60
60
60
40
40
40
40
20
20
20
20
0 -5
0
5
0 -5
WI-Case I, k = 4
0
5
0 -0.2
WI-Case II, k = 4
-0.1
0
0.1
0.2
0 -0.2
WI-Case III, k = 4
100
100
100
80
80
80
80
60
60
60
60
40
40
40
40
20
20
20
20
0
5
0 -5
WI-Case I, k = 8
0
5
0 -0.2
WI-Case II, k = 8
-0.1
0
0.1
0.2
0 -0.2
WI-Case III, k = 8
100
100
100
80
80
80
80
60
60
60
60
40
40
40
40
20
20
20
20
0
5
0 -5
0
hybrid usual-proj score (.5% + 4.5%)
5
0 -0.2
-0.1
0
0.1
hybrid alt-proj score (.5% + 4.5%)
0.1
0.2
-0.1
0
0.1
0.2
WI-Case IV, k = 8
100
0 -5
0
WI-Case IV, k = 4
100
0 -5
-0.1
0.2
0 -0.2
-0.1
0
0.1
0.2
new-proj score(.5% + 4.5%)
Fig. 4. Plotted are the rejection rates of the hybrid tests described in (4.6) and (4.7) and the new projection-based score test against the horizontal axis = hypothesized θ1 − true θ1 . Results are based on 5000 Monte Carlo trials with k = 2, 4, 88 instruments/moments and sample size n = 100. For all three, ζ = .5% and τ = 4.5% and hence the known upper bound for the asymptotic size of all three tests is 5%.
restricted alternative projection-based score tests. As mentioned before in Remark 5(e), although the contribution of the empty first-step confidence region (listed in Table 3) to the finite-sample power is same for all three tests in Fig. 4, it is the extra power due to the use of the C (α) form that leads to the superior performance of the new projection-based score test. 5. Conclusion Projection-based methods of inference on subsets of parameters have been traditionally considered useful for obtaining tests that do not over-reject the true parameter values when either it is difficult to estimate the nuisance parameters or their identification status is questionable. However, these tests are also often criticized for being overly conservative. In this paper we tried to address the problem of conservativeness by introducing a new method of projection-based inference for subsets of parameters. The new projection-based test substantially outperforms the traditional projection-based tests in terms of power. The new test relies on a C (α) form of the score statistic for the parameters of interest and a restricted projection for the nuisance parameters. In the context of moment conditions models without any identification problem, this new projection-based test is even asymptotically equivalent to an infeasible test that uses the unknown true value of the nuisance parameters as the plug-in. This also leads to the asymptotic equivalence with the feasible plug-in-based tests (when they work) that plug in an estimator of the nuisance parameters obtained by, say, solving a set of first conditions after imposing the null hypothesis. The result is remarkable because, to our knowledge, the existing projection-based methods in the literature do not possess this desirable asymptotic property. The result also has practical relevance especially when it is difficult or impossible to obtain a point estimator of the nuisance parameters to be plugged in to
implement the feasible plug-in-based tests (see Chaudhuri and Renault (2011)). Kim (2009) makes a promising advance in this direction by applying the idea of restricted projection for inference in models with partially identified parameters. In the context of inference using CU-GMM in models with weakly identified parameters, the new method of projection also conclusively serves the main purpose of our paper, that is, it reduces the conservativeness of the traditional projection-based methods of inference substantially. Interestingly, the simulations also indicate that with our preferred choice of restricted projection, the finite-sample power of the new projection-based test can be greater than that of the plug-in-based methods (with matched upper bounds for asymptotic size) against certain alternatives. Acknowledgments We are grateful to R. Gallant (co-editor), an anonymous associate editor, three anonymous referees, T. Richardson and J. Robins for their extremely helpful comments and important suggestions. We also thank E. Renault, F. Kleibergen, S. Mavroeidis, J.M. Dufour, R. Startz, M. Carrasco, D. Andrews, J. Hill, S. Park, participants at UBC, U. Montreal, NCSU, UNC Chapel Hill, Queen’s, Waterloo, U. Washington, the ESRC Study Group Conference (2007), Midwest Econometrics Conference (2007) and the winter (2008) and summer (2009) meetings of the Econometric Society, for helpful suggestions. Appendix Proof of Lemma 3.1. First note that for n large enough, θ 0 := (θ10 , θ20 ) ∈ N , and √ hence applying assumption D2 after a mean√ value expansion of ng¯n (θ 0 ) around ng¯n (θ0 ), it follows that
S. Chaudhuri, E. Zivot / Journal of Econometrics 164 (2011) 239–251 0 LMeff n1 (θ ) =
−1/2 ng¯n (θ0 ) + G1 d1 + G2 dn2 Vgg
0 inf can also be expressed as θn2 (θ10 ) = θ02 + dinf n2 (θ1 )/ n for 0 inf some dn2 (θ1 ) = Op (1), and, therefore, for n large enough,
−1/2 −1/2 G1 × P N Vgg G2 Vgg √ −1/2′ × Vgg ng¯n (θ0 ) + G1 d1 + G2 dn2 + op (1) √ ′ −1/2 −1/2′ P N Vgg G2 = ng¯n (θ0 ) + G1 d1 Vgg √ −1/2′ −1/2′ × Vgg G1 Vgg ng¯n (θ0 ) + G1 d1 + op (1) ′
′
′
−1/2′
−1/2′
G2 Vgg
G2
= 0. Convergence in distribution to the non-central χν1 distribud −1/2′ √ tion follows from Assumption that Vgg ng¯n (θ0 ) − → D on noting −1/2′ −1/2′ N (0, Ik ) and that rank of P N Vgg G2 Vgg G1 is ν1 . 2
Remark 6. In the rest of the proofs when we discuss the size of the tests for H 0 : θ1 = θ10 or the asymptotic coverage of the confidence regions for θ02 under H 0 : θ1 = θ10 , it is worthwhile to note that the results hold for any true values θ02 that (along with θ01 ) characterize the model by satisfying Assumptions Θ and Assumption D for results in Section 3, and Θ , W and D′ for results in Section 4. In this regard the result of Lemma 3.1 is important because it shows, from considering √ the sequence of the nuisance parameters θ20 = θ02 + dn2 / n, that there is no discontinuity 0 in the asymptotic distribution of the statistic LMeff n1 (θ1 , θ2 ) at θ02 . For all other statistics considered in this paper, the discontinuity due to such a sequence of nuisance parameters manifests in the non-centrality parameter of the asymptotic distribution. A formal characterization of the model and the appropriate sequence of nuisance parameters (required for the asymptotic approximation of the exact size) is possible by extending, e.g., equations (2.5) and (3.16) of Guggenberger (2010) and imposing h11 = 0. However, this is not done here explicitly because our main results do not establish the asymptotic size of the sub-vector tests; rather they establish an upper bound to it by using the asymptotic size of the full-vector weak-identification-robust tests that are already well documented in Kleibergen (2005).16 Proof of Theorem 3.2. (i) First note that given the correct asymptotic coverage, θ02 is contained in C2n (1 − ζ , θ01 ) with probability approaching 1 − ζ . Now, conditionally on θ02 ∈ C2n (1 − ζ , θ01 ), Lemma 3.1 implies that inf
θ20 ∈C2n (1−ζ ,θ01 )
0 eff 2 LMeff n1 (θ01 , θ2 ) ≤ LMn1 (θ01 , θ02 ) ≤ χν1 (1 − τ )
with probability approaching 1 − τ (for the last inequality). Therefore, from the definition of the new projection-based score test in (3.4), it follows by Bonferroni arguments that the (unconditional) asymptotic size of the test cannot exceed 1 − (1 − ζ )(1 − τ ) ≤ ζ + τ . (ii) C2n (1 − ζ , θ10 ) is assumed to be non-empty almost surely and 0 0 hence infθ 0 ∈C (1−ζ ,θ 0 ) LMeff n1 (θ1 , θ2 ) exists except for a negligible 2
2n
1
set that should not affect the rest of the √ arguments required for the result. Now since supθ 0 ∈C (1−ζ ,θ 0 ) n‖θ20 − θ02 ‖ = Op (1), the 2n 2 1 nth element of the sequence inf {θn2 (θ10 ) ∈ C2n (1 − ζ , θ10 ), where the infimum
inf
θ20 ∈C2n (1−ζ ,θ10 )
′
inf (θ10 , θn2 (θ10 )) ∈ N . Hence, the result follows directly from 0 eff 0 Lemma 3.1 which states that LMeff n1 (θ1 , θ2 ) = LMn1 (θ1 , θ02 ) + √ op (1) for any θ2 that is n-local to θ02 .
0 = LMeff n1 (θ1 , θ02 ) + op (1).
The second equality follows on noting that N Vgg
249
√
′
√
0 0 LMeff n1 (θ1 , θ2 ) is attained},
Proof of Lemma 4.1. (i) For l = 1, 2, define Λl as a diagonal matrix of order νl such that the first νlw diagonal elements are √ n and the rest of the diagonal elements are 1. Note that for any θ , the statistic LMeff n1 (θ ) is numerically invariant under postmultiplication of Gn1 (θ ) by Λ1 and of Gn2 (θ ) by Λ2 . ′ From √ Assumptions W and D 1 we know that g¯n (θ0 ) = Op (1/ n). Hence for l = 1, 2 and q = (l − 1)ν1 + νlw , further application of Assumption D′ gives
−1 ¯ lsn (θ0 ) − Glsn (θ0 ) = G Vq+1,g (θ0 ) Vgg (θ0 )¯gn (θ0 ), . . . , −1 Vq+νls ,g (θ0 ) Vgg (θ0 )¯gn (θ0 ) √ P = [Ml (θ0s ) + op (1)] − [Op (1) × Op (1/ n)] − → Ml (θ0s ). (A.1) On the other hand, from Assumptions W(a) and D′ we know that (r ) for l = 1, 2, the rth column of Glwn (θ0 ), denoted by Glsn (θ0 ), is such that for q = (l − 1)ν1 + r,
√
√
(r )
√
(r )
−1 ¯ lwn (θ0 ) − Vqg (θ0 ) Vgg nG (θ0 ) ng¯n (θ0 ),
n Glwn (θ0 ) =
√
¯ (lwr )n (θ0 ) − E [G¯ (lwr )n (θ0 )] n G
=
√ √ −1 − Vqg (θ0 ) Vgg (θ0 ) ng¯n (θ0 ) + nE [G¯ (lwr )n (θ0 )], d
−1 − → [Ψl[r ] − Vlg[r ] (θ0 )Vgg (θ0 )Ψg ]
(r ) (θ0 ) = G∗[r ] (say) +M lw l [r ]
where Ψl
(A.2)
denotes the k × 1 block containing the (r − 1)k +
Vqg (θ0 ) denotes 1th to rkth rows of Ψl , and Vlg (θ0 ) = plim the k × k block containing the (r − 1)k + 1th to rkth rows of [r ] Vlg (θ0 ). Note that, by construction and from Assumption D′ 2, Ψl − [r ]
−1 (θ0 )Ψg and Ψg are asymptotically jointly normally Vlg (θ0 )Vgg distributed, uncorrelated, and hence independent. Now, for l = ∗[ν ] ∗[1] ∗[2] 1, 2, defining G∗l := [Gl , Gl , . . . , Gl lw , Ml (θ0s )], it follows ′ from (A.1), (A.2) and Assumption D 1 that [r ]
d
−1/2 −1/2 LMeff → Ψg′ Vgg (θ0 )P N Vgg (θ0 )G∗2 n1 (θ01 , θ02 ) − ′
−1/2′ −1/2′ (θ0 )G∗1 Vgg (θ0 )Ψg ∼ χν21 , × Vgg conditionally on G∗1 , G∗2 , and hence unconditionally because each element of these two matrices is independent of Ψg . (ii) This follows from Lemma 3.1 once we note that for n large enough, θ 0 ∈ N and that the conditions of the lemma are also satisfied here. In particular, now for l = 1, 2 and q = (l − 1)ν1 +νlw ,
−1 0 ¯ ln (θ 0 ) − Gln (θ 0 ) = G Vq+1,g (θ 0 ) Vgg (θ )¯gn (θ 0 ), . . . , −1 0 Vq+νls ,g (θ 0 ) Vgg (θ )¯gn (θ 0 ) [ √ 1 0 0 −1 0 ¯ = Gln (θ ) − ng¯n (θ 0 ), . . . , √ Vq+1,g (θ )Vgg (θ ) n
16 Noting that LM (θ) = LMeff (θ) + LM (θ) where the last statistic is the same n n2 n1 as (2.9) with the subscripts 1 replaced by 2, the remark is also seen to be applicable to the infeasible test, by appealing to Kleibergen’s results.
1
−1 0 Vq+νls ,g (θ 0 ) Vgg (θ ) √
n
√
] ng¯n (θ ) 0
250
S. Chaudhuri, E. Zivot / Journal of Econometrics 164 (2011) 239–251
√ = [Ml (θ0 ) + op (1)] − Op (1/ n) √ ng¯n (θ0 ) + M1 (θ0s )d1 + M2 (θ0 )dn2
probability of the infeasible test of the hypothesis H 0 : θ1 = θ10 is lim Pθ0 (Dn ) = lim Pθ0 (Dn ∩ An ) + lim Pθ0 (Dn ∩ Acn ) n
n
n
≤ lim Pθ0 (An ) + lim Pθ0 (Dn ∩ Acn )
P
− → Ml (θ0 ),
n
n
and the probability limit, which does not depend on the deviation from θ0 , is the same as Gl in the statement of Lemma 3.1. (The subscript s is irrelevant here because no parameter is weakly identified, i.e., νw = ν1w = ν2w = 0.)
= lim Pθ0 (An ) + lim Pθ0 (Acn )Pθ0 (Dn |Acn )
Proof of Theorem 4.2. (i) First note that Assumption D′ 2 gives
= lim Pθ0 (An ) + lim Pθ0 (Acn )Pθ0 (Cn |Bn )
n n
n
−1 Sn (θ0 ) − → Ψg′ Vgg (θ0 )Ψg ∼ χk2 and this holds for any θ0 ∈ Θ
op (1) uniformly for θ2 ∈ Θ20 := {θ20 ∈ Θ2 : θ20 = θ02 + dn2 where dn2 ̸= op (1)}. Second, note that from Assumptions W and D′ , we know that for any fixed s ∈ (0, 1/2), n2s−1 Sn (θ10 , θ2 ) − −1 ns (θ2 − θ02 )′ M2′ (θ0 )Vgg (θ0 )M2 (θ0 )ns (θ2 − θ02 ) = op (1) uniformly
for θ2 ∈ Θ2s := {θ20 ∈ Θ2 : θ20 = θ02 + dn2 /ns where dn2 ̸= op (1)}. Therefore, by Assumptions W(b) and D′ 2 i.e., by respectively using m(θ) = 0 ⇐⇒ θ = θ0 , Vgg (θ ) is positive definite for θ ∈ N and M2 (θ0 ) is full column rank, it follows that the probability limits of Sn (θ10 , θ2 )/n and n2s−1 Sn (θ10 , θ2 ) are greater than 0 uniformly for θ2 in respectively Θ20 and Θ2s (for s ∈ (0, 1/2)). Therefore, the statistic Sn (θ10 , θ20 ), inverting which C2n (1 − ζ , θ10 ) is obtained, diverges to +∞ in both cases. Hence any θ20 = θ02 + dn2 /ns (where dn2 ̸= op (1) and s ∈ [0, 1/2)) has a zero probability of being contained in C2n (1 − ζ , θ10 ) as n → ∞.17 This proves the result that C2n (1 − ζ , θ10 ), if non-empty, can only contain points in the √ n-neighborhood of θ02 . [Part 2:] Define the following events: An := C2n (1 − ζ , θ10 ) is empty
Bn :=
θ20 ∈ C2n (1 − ζ , θ10 ) ⇒ θ20 √ = θ02 + d2n / n ∈ Θ2 for some d2n = Op (1)
Cn :=
inf
θ20 ∈C2n (1−ζ ,θ10 )
0 0 2 LMeff n1 (θ1 , θ2 ) > χν1 (1 − τ )
0 2 Dn := LMeff n1 (θ1 , θ02 ) > χν1 (1 − τ ) .
An implication of what has been just proved in [Part 1] is that Bn and Acn are asymptotically the same event. Hence conditioning on Bn instead of on Acn and vice versa does not change the asymptotic probability of any event. Now note that for any true value θ0 ∈ Θ satisfying Assumptions Θ , W and D′ , the asymptotic rejection
17 There is no need to consider θ 0 = θ + d /ns for s < 0, because for sufficiently 02 n2 2 large n this will be almost surely outside Θ2 (compact) and hence, by definition, 0 cannot be contained in C2n (1 − ζ , θ1 ) with non-zero probability.
n
(from Lemma 4.1(ii))
satisfying Assumptions Θ , W and D′ . Therefore, C2n (1 − ζ , θ10 ), defined in (4.3), contains θ02 with probability approaching 1 − ζ when H 0 is true (i.e., θ10 = θ01 ). Hence the proof follows in exactly the same way as that of Theorem 3.2(i), using Lemma 4.1(i).
[Part 1:] The result will be proved on the basis of the following two observations. First note that from Assumptions W and D′ 2, −1 we know that Sn (θ10 , θ2 )/n − m′ (θ01 , θ2 )Vgg (θ01 , θ2 )m(θ01 , θ2 ) =
n
(from [Part 1])
d
(ii) This proof has two parts — the first part establishes that under the conditions of the theorem, the region C2n (1 − ζ , θ10 ), if non√ empty, can only contain points in the n-neighborhood of θ02 . The second part uses this result and follows the same technique as in the proof of Theorem 3.2(ii) to establish our final result.
n
= lim Pθ0 (An ) + lim Pθ0 (Acn )Pθ0 (Dn |Bn )
= lim Pθ0 (An ) + lim Pθ0 (Acn )Pθ0 (Cn |Acn ) n
n
(from [Part 1])
= lim Pθ0 (An ) + lim Pθ0 (Cn ∩ Acn ), n
n
which is the asymptotic rejection probability of the new projectionbased score test. The proof of the equality third from the bottom follows that of Theorem 3.2(ii) on utilizing Lemma 4.1(ii), which establishes the √invariance of the asymptotic distribution of 0 LMeff n-local deviations from the true θ02 . n1 (θ1 , θ2 ) under References Bera, A.K., Bilias, Y., 2001. Rao’s score, Neyman’s C (α) and Silvey’s LM tests: an essay on historical developments and some new results. Journal of Statistical Planning and Inference 97, 9–44. Chaudhuri, S., 2008. Projection-type score tests for subsets of parameters. Ph.D. thesis, University of Washington. Chaudhuri, S., Renault, E., 2011. Finite-sample improvements of score tests by the use of implied probabilities from generalized empirical likelihood. Tech. Rep. University of North Carolina, Chapel Hill. Chaudhuri, S., Richardson, T., Robins, J., Zivot, E., 2010. Split-sample score tests in linear instrumental variables regression. Econometric Theory 26, 1820–1837. Dufour, J.M., 1990. Exact tests and confidence sets in linear regressions with autocorrelated errors. Econometrica 58, 475–494. Dufour, J.M., 1997. Some impossibility theorems in econometrics with applications to structural and dynamic models. Econometrica 65, 1365–1388. Dufour, J.M., Jasiak, J., 2001. Finite sample limited information inference methods for structural equations and models with generated regressors. International Economic Review 42, 815–843. Dufour, J.M., Taamouti, M., 2005a. Further results on projection-based inference in iv regressions with weak, collinear or missing instruments, discussion Paper. Dufour, J.M., Taamouti, M., 2005b. Projection-based statistical inference in linear structural models with possibly weak instruments. Econometrica 73, 1351–1365. Dufour, J.M., Taamouti, M., 2007. Further results on projection-based inference in iv regressions with weak, collinear or missing instruments. Journal of Econometrics 139, 133–153. Guggenberger, P., 2010. On the asymptotic size distortion of tests when instruments locally violate the exogeneity assumption. Tech. Rep. University of California, San Diego. Guggenberger, P., Smith, R., 2005. Generalized empirical likelihood estimators and tests under partial, weak and strong identification. Econometric Theory 21, 667–709. Kim, K.I., 2009. Inference for subsets of parameters in partially identified models. Tech. Rep, University of Minnesota. Kleibergen, F., 2005. Testing parameters in GMM without assuming that they are identified. Econometrica 73, 1103–1123. Kleibergen, F., Mavroeidis, S., 2009. Inference on subsets of parameters in GMM without assuming identification. Tech. Rep. Brown University Working Paper. Mikusheva, A., 2010. Robust confidence sets in the presence of weak instruments. Journal of Econometrics 157, 236–247. Moon, H.R., Schorfheide, F., 2009. Estimation with overidentifying inequality moment conditions. Journal of Econometrics 153, 136–154. Newey, W.K., Smith, R.J., 2004. Higher order properties of GMM and generalized empirical likelihood estimators. Econometrica 72, 219–255. Newey, W.K., West, K.D., 1987. Hypothesis testing with efficient method of moments estimation. International Economic Review 28, 777–787. Neyman, J., 1959. Optimal asymptotic test of composite statistical hypothesis. In: Grenander, U. (Ed.), Probability and Statistics, The Harald Cramer Volume. Almqvist and Wiksell, Uppsala, pp. 313–334.
S. Chaudhuri, E. Zivot / Journal of Econometrics 164 (2011) 239–251 Robins, J.M., 2004. Optimal structural nested models for optimal sequential decisions. In: Lin, D.Y., Heagerty, P. (Eds.), Proceedings of the Second Seattle Symposium on Biostatistics. Springer, New York. Staiger, D., Stock, J.H., 1997. Instrumental variables regression with weak instruments. Econometrica 65, 557–586.
251
Stock, J.H., Wright, J.H., 2000. GMM with weak identification. Econometrica 68, 1055–1096. Zivot, E., Chaudhuri, S., 2009. Comment: weak instrument robust tests in GMM and the new Keynesian Phillips curve by F. Kleibergen and S. Mavroeidis. Journal of Business and Economic Statistics 27, 328–331.
Journal of Econometrics 164 (2011) 252–267
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Measuring correlations of integrated but not cointegrated variables: A semiparametric approach✩ Yiguo Sun a , Cheng Hsiao b,c,∗ , Qi Li d a
Department of Economics, University of Guelph, Guelph, ON, N1G 2W1, Canada
b
Department of Economics, University of Southern California, Los Angeles, CA 90089-0253, USA
c
Department of Economics and Finance, City University of Hong Kong, Hong Kong
d
Department of Economics, Texas A&M University, College Station, TX 77843-4228, USA
article
info
Article history: Received 22 July 2010 Received in revised form 18 April 2011 Accepted 17 May 2011 Available online 12 June 2011 JEL classification: C13 C14 C20
abstract Many macroeconomic and financial variables are integrated of order one (or I (1)) processes and are correlated with each other but not necessarily cointegrated. In this paper, we propose to use a semiparametric varying coefficient approach to model/capture such correlations. We propose two consistent estimators to study the dependence relationship among some integrated but not cointegrated time series variables. Simulations are used to examine the finite sample performances of the proposed estimators. © 2011 Elsevier B.V. All rights reserved.
Keywords: Integrated time series Non-cointegration Semiparametric varying coefficient models
1. Introduction In this paper we consider a semiparametric varying coefficient model as follows: Yt = Xt⊤ θ (Zt ) + ut ,
t = 1, 2, . . . , n,
(1.1)
where Xt (a∑d × 1 vector) and ut (a scalar) are integrated series; ∑t t i.e., Xt = ζ + X and u = ϵ + u0 , and ζt and ϵt s 0 t s s=1 s =1 are stationary processes with zero means and finite long-run variances and satisfy some memory and moment conditions to be given in Section 3. Also, X0 and u0 are Op (1) random variables with Xt ≡ 0 and ut ≡ 0 for t < 0. And, θ (·) is a d × 1 vector of smooth
✩ We gratefully acknowledge the valuable comments from two anonymous referees, an associate editor and Peter Robinson that have greatly improved our paper. We would also like to thank K.S. Chan for providing several references cited in our paper. Yiguo Sun acknowledges the financial support from the Social Science and Humanities Research Council of Canada (SSHRC) grant 410-2009-0109. Li’s research is partially supported by the NSF of China (Project #: 70773005). ∗ Corresponding author at: Department of Economics, University of Southern California, Los Angeles, CA 90089-0253, USA. Tel.: +1 213 740 8335; fax: +1 213 740 8543. E-mail address:
[email protected] (C. Hsiao).
0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.05.013
measurable and squared integrable functions of a scalar stationary variable Zt .1 When Yt and Xt are cointegrated in a model either with a (parametric) nonlinear or with a (semiparametric) varying cointegration vector of functions of time or of some other stationary covariates, its limiting theories have been studied by Juhl (2005), Cai et al. (2009), and Xiao (2009). In this paper, we extend this literature by allowing the dependent variable Yt to be not cointegrated with the I (1) regressors Xt in the above semiparametric framework; that is, this paper considers the case that ut is an integrated process such that model (1.1) depicts a semiparametric regression model that can be used to measure the correlation of integrated but not cointegrated variables. Model (1.1) is of interest because many macro and financial time series are I (1) and correlated, but not necessarily cointegrated. For instance, let Y be the exchange rates between two countries, X be the price indices and Z be the interest rate differentials of these two countries. Should purchasing power parity (PPP) hypothesis holds, ut will be I (0). However, empirical studies often fail to support the PPP hypothesis (e.g., Taylor and Taylor
1 It is straightforward to generalize the results to a vector Z case. For expositional t simplicity, we will only consider the scalar Zt case in this paper, as allowing Zt to be a high-dimensional variable will not shed extra insight in to our theory.
Y. Sun et al. / Journal of Econometrics 164 (2011) 252–267
(2004)), but this does not mean that exchange rates are not related to the inflation rates. Similarly, one can consider Y and X as stock market indices and Z be the interest rate differential. With perfect information and arbitrage, the law of one price would imply Yt − Xt θ (Zt ) = ut , where ut is an I (0) process. However, imperfect information and many domestic factors could also lead to stock market indices to move in different directions so that ut might be an I (1) process. Nevertheless, θ(Zt ) remains of interest in the investigation of the spill-over effects from the variation of Xt to that of Yt . Similar examples can be found in the investigation of interest rate parity (e.g. Mark (2001)) or equity parity (e.g. Gavin (1989)).2 Another example that model (1.1) could be of interest is that (Yt , Xt⊤ , Wt ) are individually I (1) processes and cointegrated so that Yt = Xt⊤ θ (Zt ) + Wt δ + ϵt = Xt⊤ θ (Zt ) + ut , where ϵt is I (0), and 1, −θ (Zt )⊤ , δ are the cointegrating vectors. However, if Wt is not observable, then the composite error in model (1.1), ut = Wt δ + ϵt , is an I (1) process, and Yt and Xt are not cointegrated. Therefore, the I (1) error term ut may be due to omitted variables and/or measurement errors. Indeed, many macroeconomic and financial data are measured with errors; e.g., Barnett and Serletis (1990) and Blundell and Stoker (2005), where with integrated aggregate data, the measurement errors could be either I (0) or I (1) processes. Also, poor proxy could arise when the true data are not directly observable; e.g. consumer demand and consumer consumption may not be the same. When facing poor proxy problems, it seems plausible that I (1) measurement errors are of concern; see the debate on the PPP hypothesis in Xu (2003) and Taylor and Taylor (2004), where the test and forecasting results of the PPP hypothesis heavily rely on the carefully-chosen price indices among several alternatives. In this paper, we show that it is possible to obtain consistent estimator of the unknown smooth function θ (·) even when ut is an I (1) process. Thus, the theory obtained in this paper provides a general method for estimating the dependence among non-cointegrated I (1) processes in a flexible semiparametric framework not limited to model (1.1). It can be used by researchers to expand their repertoire to find meaningful relations among integrated variables even when the researchers suspect the possibility of persistent measurement errors in their data or of non-cointegration due to missing relevant integrated factors. For instance, stock market volatilities are highly persistent and can be viewed as I (1), or near I (1) processes (say, fractionally integrated processes).3 Ray and Tsay (2000) found strong evidence of the existence of common factors generating persistent stock volatilities and the volatilities of stocks of companies in the same business sector sharing more in volatility persistency. Although the existence of common factors can lead to co-movement of stock volatilities, there are also many other (domestic) factors that affect a domestic market’s volatility. It is rare that the domestic stock market volatility is cointegrated with a foreign stock market volatility even though the volatilities from different national stock markets are likely to be related with each other, especially during globally bearish market periods.4 Furthermore,
253
the relation between national stock market volatilities may well be time-varying depending on the varying risk premium. For instance, changes in exchange rate may lead to changes in risk premium, hence may enhance the volatility of stock markets; see Aloui (2007) and Bali and Wu (2010). Therefore, a varying coefficient model, with exchange rate changes as the (nonparametric) state variable entering the varying coefficient function, may provide a flexible approach to examine the spill-over effects of stock market volatility. Monte Carlo simulations reported in Section 4.2 show that our proposed estimators perform quite well when the data are near I (1) processes. The remaining parts of the paper is organized as follows. Section 2 presents the model and discusses the estimation methods. Section 3 provides asymptotic results of our proposed estimators. In Section 4, we report Monte Carlo simulation results to examine the finite sample performance of the proposed estimators. Finally, Section 5 concludes the paper. All the mathematical proofs are relegated to the Appendix. 2. The model and the proposed estimators As explained in the introduction section, we aim to find a consistent nonparametric estimator of θ (·) in the following semiparametric varying coefficient model, Yt = Xt⊤ θ (Zt ) + ut ,
(2.1)
where Xt (a d × 1 vector) and ut (a scalar) are all non-stationary I (1) processes, Zt is a scalar stationary process, and θ (·) is a d × 1 vector of unspecified smooth (three-time differentiable) functions to be estimated. The increments (or first differences) of all the I (1) variables have zero means and finite long-run variances, and the unknown coefficient function θ (·) has a constant finite mean and variance. We will delay more detailed assumptions regarding model (2.1) to Section 3. Below we first examine the OLS estimator’s behavior if one estimates model (2.1) by the least squares method. We assume that the multivariate functional central limit theorem applies to the (d + 1)-dimensional vector of the partial sums ⊤ n−1/2 X[nr] , n−1/2 u[nr]
⊤
H⇒
studied volatility spillovers across international stock markets via multivariate GARCH-type models. GARCH setup and its various extensions are widely used in modeling stock volatility; see Davidson (2004) on some of the recent theoretical results on GARCH-type models.
⊤
BX (r )⊤ , Bu (r )
for r ∈ [0, 1],
⊤
where BX (r ) , Bu (r ) is the (d + 1)-dimensional Brownian motion processes with a zero mean vector and finite and nonsingular variance–covariance matrix, [a] is the integer part of a positive number a, and ‘‘H⇒’’ denotes the weak convergence with respect to the Skorohod metric. The OLS estimator of model (2.1) is given by
θˆ0 =
⊤
n −
⊤
−1 Xt Xt
t =1
n −
Xt Xt⊤ θ (Zt ) + ut = An1 + An2 ,
(2.2)
t =1
where An1 =
∑n
t =1
Xt Xt⊤
−1 ∑n
t =1
Xt Xt⊤ θ (Zt ) and An2 =
∑n
t =1
−1 ∑ n
Xt Xt⊤ t =1 Xt ut . It is well established that under some regularity conditions (as given in Section 3) and applying the multivariate functional central limit theorem and continuous mapping theorem, we have
An2 =
n
−2
n −
−1 ⊤
Xt Xt
n−2
t =1
2 We owe Chan for the references of Mark (2001) and Gavin (1989). 3 Bollerslev and Mikkelsen (1996) and Baillie et al. (1996) treated conditional volatilities as fractionally integrated processes, and Christensen and Nielsen (2007) suggested to use changes of volatility as the M-term to estimate a GARCH-M type return model. 4 Using a different approach, Koutmos and Booth (1995) and Karolyi (1995)
d
1
[∫
BX (r )BX (r )⊤ dr
→ 0
n −
X t ut
t =1
]−1 ∫
1
def
BX (r )Bu (r )dr = θ¯1 ,
(2.3)
0
where θ¯1 is an Oe (1) random variable.5 We show in the Appendix that
5 For any positive non-stochastic sequence a , we say that a random variable is n Oe (an ), if it is Op (an ) but not op (an ). It means that the random variable has an exact probability order of an .
254
Y. Sun et al. / Journal of Econometrics 164 (2011) 252–267
n− 2
An1 =
n −
−1 Xt Xt⊤
n− 2
t =1
n −
The estimation of α(·) may itself be of interest because it implies that one can consistently estimate the partial effects, ∂∂z θ (z ) =
Xt Xt⊤ θ (Zt )
t =1
= E [θ (Zt )] + n−2
n −
−1 Xt Xt⊤
t =1
n− 2
n −
Xt Xt⊤ α(Zt )
t =1
p
→ E [θ (Zt )] .
(2.4)
Therefore, taking together (2.2)–(2.4), we obtain d θˆ0 − E [θ (Zt )] → θ¯1 .
Yt = Xt⊤ c0 + Xt⊤ α (Zt ) + ut . (2.5)
The above result states that when ut is an I (1) process, the OLS estimator θˆ0 deviates from the average value of the random coefficient by a variable of stochastic order of Oe (1). However, in the Appendix (see the arguments given at the end of the proof of Lemma A.2), we show that one could consistently estimate E [θ(Zt )] by the OLS estimator θˆ0 if the error terms ut were an I (0) process. Phillips (2009) showed that the kernel estimator is inconsistent estimator for nonparametric spurious regression models. One naturally expects that such inconsistency result will be carried over to our semiparametric regression model because of the I (1) error term. This conjecture is supported by our theory. Consider the standard local constant kernel estimator of θ (z ) given by
ˇ z) = θ(
n −
−1 Xt Xt⊤ Ktz
t =1
n −
Xt Yt Ktz ,
(2.6)
t =1
where Ktz = K ((Zt − z )/h), K (·) is the kernel function satisfying conditions given in Section 3, and h is the smoothing parameter. In the Appendix, we establish the following result:
ˇ z ) − θ (z ) θ( [∫ 1 ] −1 ∫ d → BX (r )BX (r )⊤ dr 0
1
def
BX (r )Bu (r )dr = θ¯2 ,
(2.7)
0
where θ¯1 and θ¯2 have the same distribution. As θ¯2 = Oe (1), θˇ (z ) is not a consistent estimator of θ (z ) due to the I (1) error term ut . The OLS estimator inconsistently estimating the average value of the coefficient function and the kernel estimator inconsistently estimating the unknown coefficient function are both due to the same reason: the error term is an integrated process whose impact is non-ignorable even asymptotically. Comparing (2.5) with (2.7), it is natural to conjecture that one can asymptotically cancel out the persistency of the I (1) error terms by taking the difference of the kernel estimator θˇ (z ) and the OLS estimator θˆ0 . However, taking the difference between the two estimators not only asymptotically cancels out the impact of the non-stationary error terms but also removes the mean value of the unknown coefficient function. Therefore, in doing so, the best one can do is to def
estimate α(z ) = θ (z )− E [θ (Zt )], the zero mean random coefficient function. Therefore, we will aim to show p ˇ z ) − θˆ0 → θ( θ (z ) − c0 = α(z ), (2.8) where c0 = E [θ (Zt )]. This motivates our first estimator of α(z )
given by
n −
θˇ (Zt )Mnt ,
(2.10)
t =1
where Mnt = Mn (Zt ) is the trimming function that trims away observations near the boundary of the support of Zt . We will discuss more on the need of the trimming function in the Appendix.
(2.11)
Then, adding/subtracting Xt α( ˜ Zt ) in (2.11) gives ⊤
def
˜ Zt ) = Xt⊤ c0 + vt , Y˜t = Yt − Xt⊤ α(
(2.12)
˜ Zt ) + ut . Because α( ˜ Zt ) is a consistent where vt = Xt α (Zt ) − α( estimator of α(Zt ), Y˜t mimics the stochastic properties of Yt − Xt⊤ α (Zt ), and the stochastic property of vt is dominated by that ⊤
of ut . Taking a first difference of (2.12), we obtain
1Y˜t = ζt⊤ c0 + 1vt ,
(2.13)
where 1Y˜t = Y˜t − Y˜t −1 , ζt = Xt − Xt −1 , and 1vt = vt − vt −1 . Then, regressing 1Y˜t on ζt by the least squares method gives the OLS estimator of c0 below,
c˜0 =
n −
−1 ζt ζt
⊤
t =2
n −
ζt 1Y˜t .
(2.14)
t =2
With α( ˜ z ) and c˜0 , we obtain an estimator of θ (z ) given by
θ˜ (z ) = α( ˜ z ) + c˜0 .
(2.15)
The consistency of θ˜ (z ) follows directly from the consistencies of α( ˜ z ) and c˜0 . Alternatively, one can use α( ˆ Zt ) to replace α( ˜ Zt ) in the above calculation; i.e., one can replace Y˜t by Yˆt = Yt − XtT α( ˆ Zt ) in (2.12). We will use cˆ0 to denote the resulting estimator of c0 and denote by θˆ (z ) = cˆ0 + α( ˆ z ) an alternative estimator of θ (z ). The additional assumptions ensuring the consistency of c˜0 and θ˜ (·) (or cˆ0 and θˆ (·)) as well as the asymptotic results of our proposed estimators of α(z ) are the subject of the next section. 3. The asymptotic analysis We first introduce some notation. For any q > 0 and an m × 1
q 1/q
vector ξt , we denote by ‖ξt ‖q = the Lq -norm, j=1 E ξjt and we use ‘‘‖·‖’’ without any subscript to denote the Euclidean ∑m
d
p
norm; Im is the m × m identify matrix; ‘‘→’’ and ‘‘→’’ denote convergence in distribution and convergence in probability, respectively. For a differentiable function g : R → Rm , we denote g (s) (z ) = ds g (z )/dz s . Also, we use M to denote a finite positive constant whose value may change from place to place. We now list regularity conditions sufficient for establishing the consistency of α( ˜ z ) and α( ˆ z ).
⊤
(A1) Let ηt = ζt⊤ , ϵt , Zt be a (d + 2)-dimensional random vector, where ζt = Xt − Xt −1 and ϵt = ut − ut −1 have zero means. ηt is a strictly stationary, α -mixing sequence of size −p/ (p − 2) with a finite, positive definite, long-run variance matrix, and ‖ηt ‖4 < M < ∞, where p = 2 +δ for some small δ > 0. Also, both X0 and u0 are of order Op (1) with Xt ≡ 0 and ut ≡ 0 for t < 0. (A2) (i) The variable Zt has a Lebesgue density f (z ) and infz ∈S f (z ) > 0, where S , the support of Zt , is a compact subset of R. Also, (Zs , Zt ) has a Lebesgue joint density function fs,t (z1 , z2 ).
α( ˜ z ) = θˇ (z ) − θˆ0 , (2.9) ˆ ˇ where θ0 and θ (z ) are defined in (2.2) and (2.6), respectively. Our second estimator of α(z ) is given by α( ˆ z ) = θˇ (z ) − n−1
∂ α(z ). Indeed, this may be the main interest in many studies. ∂z However, naturally, one may also want to know the average value of the varying coefficient curve, c0 . We therefore illustrate our estimator of c0 below, delaying to Section 3 its consistency result. To obtain an estimator for c0 , we replace θ (Zt ) by its identity θ (Zt ) = c0 + α(Zt ) and rewrite (2.1) as
Y. Sun et al. / Journal of Econometrics 164 (2011) 252–267
(ii) f (z ), θ (z ), E (ζt |z ), and E ‖ζt ‖2 |z are three times continuously differentiable in the vicinity of an interior point z ∈ S . And, fs,t (z1 , z2 ) are three times continuously differentiable in the vicinity of the interior point (z , z ) ∈ S × S. (iii) ζt⊤ θ (Zt )2+δ˜ < M < ∞ and ‖θ (Zt )‖2+δ˜ < M < ∞
for some δ˜ > δ > 0. (A3) The kernel function K (u) is a bounded, symmetric (around zero) probabilitydensity function on interval [−1, 1]. Also, we denote µj = uj K (u) du and vj = uj K 2 (u) du. (A4) h → 0, nh → ∞ and nh5 = O (1) as n → ∞. Assumption A1 ensures that the multivariate functional central
⊤
⊤ limit theorem can be applied to X[nr] , u[nr] for any r ∈ [0, 1]; see Lemma 3.1 below. It allows ζt and ϵt to be contemporaneously correlated such as cov(ζt , ϵt ) ̸= 0, provided that they are α -mixing processes with the mixing coefficients satisfying some rate of decaying condition, and that ηt satisfies some moment conditions. Assumption A2 requires that the density is bounded below by a positive constant in the compact support of Zt . It also gives regularity conditions on smoothness and moment restrictions of the density function and some other functionals of Zt . The threetime differentiable conditions ensure that these functions have Taylor expansions up to the third order so as to yield a leading asymptotic center term of the form J (z )h2 + Op (h3 ), where J (z ) is a leading asymptotic center defined in Theorem 3.1 below. The kernel function having a compact support in Assumption A3 is not essential and can be removed at the cost of lengthy proofs. Specifically, the Gaussian kernel is allowed. Assumption A4 ensures that the proposed estimators are consistent as sample size n → ∞, and nh5 = O(1) allows for optimal smoothing, but it rules out the under-smoothing data case. Denote Ktz = K h−1 (Zt − z ) and K˘ tz = Ktz − EKtz . We need the following lemma for deriving the limiting results of the proposed estimators with the proof delayed to the Appendix.
Lemma 3.1. Under Assumptions A1–A4, the following multivariate functional central limit theorem holds
n−1/2
[nr ] −
ζt
t =1
n−1/2
[nr ] −
ϵt
t =1
[nhv0 f (z )]−1/2
[nr ] −
K˘ tz
BX (r ) H⇒ Bu (r ) , WK (r )
(3.1)
⊤
0
0
1
.
(3.2)
Note that in Lemma 3.1, WK (·) is a standard Brownian motion
⊤
process independent of BX (·)⊤ , Bu (·) . Theorems 3.1 and 3.2 below give asymptotic results for α( ˜ z ) and α( ˆ z ), respectively. The proofs are delayed to Appendix A.1.
Theorem 3.1. Under Assumptions A1–A4, we have, for all interior point z ∈ S ,
√
d
nh[α( ˜ z ) − α(z ) − h2 J (z )] → Λ,
∫ 1 ν0 1 BX (r )Bu (r )dWK (r ) B− (X ,2) f (z ) 0 ∫ 1 −1 −1 ⊤ BX (r ) BX (r ) dWK (r ) B(X ,2) B(X ,u) , − B(X ,2)
Λ=
(3.3)
0
1
B(X ,2) = 0 BX (r )BX (r )⊤ dr , B(X ,u) = µ2 [θ (1) (z )f (1) (z )/f (z ) + θ (2) (z )/2].
1 0
BX (r )Bu (r )dr, and J (z ) =
The derivation of Theorem 3.1 does not require zero contemporaneous correlation between the increments of the I (1) processes, 1ut and 1Xt . Therefore, Xt is allowed to be endogenous. ∑ In (2.10), α( ˆ z ) = θˇ (z ) − n−1 nt=1 θˇ (Zt )Mnt requires the cal-
culation of θˇ (Zt ) for all t = 1, . . . , n. Therefore, the continuously differentiable condition in the vicinity of the interior point z ∈ S as imposed by Assumption A2 (i) will not be sufficient to ensure the consistency of α( ˆ z ), and we need to strengthen it to a uniform condition. In addition, to obtain the limiting distribution of α( ˆ z ), we need to strengthen our assumption further.
(A2∗ ) (i) The variable Zt has a Lebesgue density f (z ) and infz ∈S f (z ) > 0, where S , the support of Zt , is a compact subset of R. Also, (Zs , Zt ) has a Lebesgue joint density function fs,t (z1 , z2 ). (ii) f (z ), θ (z ), E (ζt |z ), and E ‖ζt ‖2 |z are continuous on S and three times continuously differentiable in the interior of S . And, fs,t (z1 , z2 ) is continuous on S × S and three times continuously differentiable along the diagonal line (or at all interior points (z , z )) in S × S . (A5) (i) The sequences {(ζt , ϵt )}nt=1 and {Zt }nt=1 are independent of each other. (ii) {Zt }nt=1 is a strictly stationary β -mixing process with the β -mixing coefficients satisfying β(t ) = O(t −τ ) for all t, where τ > 2λ−1 (1 + λ) for some λ ∈ (0, 1). (iii) E (ϵt |ζt ) = 0 and E ζt ζt⊤ is nonsingular. Assumption A2∗ strengthens Assumption A2 to the support of Zt . Note that Assumption A5 implies β(t )λ/(1+λ) = O(t −ρ ) for some λ ∈ (0, 1) and ρ > 2. As a β -mixing condition implies an α -mixing condition, Assumption A5(ii) is stronger than Assumption A1.
√
nh α( ˆ z ) − α(z ) − h2 [J (z ) − E (J (Zt ))] = Λ˚ ,
˚ has the same distribution as Λ defined in (3.3). where Λ
where BX (r )⊤ , Bu (r ) , WK (r ) is a (d + 2)-dimensional Brownian motion with a zero mean vector and finite covariance matrix
Σ
where
Theorem 3.2. Under Assumptions A1–A5, and A2∗ , we have, for all interior point z ∈ S ,
t =1
255
Assumptions A1–A4 are sufficient to show the consistency of α( ˆ z ); i.e., α( ˆ z ) − α(z ) − h2 [J (z ) − E (J (Zt ))] = op (1). The independence between {(ζt , ϵt )}nt=1 and {Zt }nt=1 of Assumption A5(i) is a sufficient but not a necessary condition to show that Λ˚ and Λ have the same distribution, although this assumption significantly simplifies our proof of the limiting distribution in Theorem 3.2. Note that the asymptotic center of α( ˆ z ) is smaller than that of α( ˜ z ) on average. With the above additional assumptions, we are now ready to present the consistency results for c˜0 , θ˜ (z ), cˆ0 and θˆ (z ). Theorem 3.3. Under Assumptions A1–A5 and A2∗ we have
(i) c˜0 − c0 = Op (h2 + (nh)−1/2 ), θ˜ (z ) − θ (z ) = Op (h2 + (nh)−1/2 ); (ii) cˆ0 − c0 = Op (h2 + (nh)−1/2 ), θˆ (z ) − θ (z ) = Op (h2 + (nh)−1/2 ).
256
Y. Sun et al. / Journal of Econometrics 164 (2011) 252–267
The proof of Theorem 3.3 is given in Appendix A.2. Again, Assumption A5(i) is stronger than necessary, it can be relaxed to E (ϵt |ζt , Zt ) = 0 and E [(ζt +1 − ζt ) J (Zt )] = 0. However, we would like to emphasize that Assumption A5 (iii) is necessary for Theorem 3.3 to hold. To see this, noting that c˜0 is the OLS estimator of c0 based on (2.13), where 1vt contains a term equal to 1ut = ϵt , we need E (ϵt |ζt ) = 0 (i.e., Assumption A5(iii)) such that the impact of this term vanishes asymptotically. Theorems 3.1–3.3 show that our proposed semiparametric curve estimators have the same rate of convergence as in the case with stationary data. In fact, it can be shown that our estimators are robust to whether the error terms are an I (0), a near I (1) or an I (1) process. Assumption A5 requires that the innovations that generate Xt and ut to be independent of the stationary covariate Zt sequence. However, the regularity conditions on the consistency of α(·) ˜ is quite weak and standard. If the main interest is to estimate the marginal effects dθ (z )/dz, the stronger condition A5 is not needed as dα(z )/dz = dθ (z )/dz and α( ˜ z ) is a consistent estimator of α(z ) without the need of Assumption A5. Also, in some cases, Assumption A5 (i) may be relaxed to that (ζt , Zt ) is uncorrelated with ϵt or E (ϵt |ζt , Zt ) = 0; hence, conditional heteroskedastic error is allowed. For example, when P (Zt = Zt −1 ) > 0. To see this, we take the first difference to both sides of Eq. (1.1), which gives
1Yt = Xt⊤ θ (Zt ) − Xt⊤−1 θ (Zt −1 ) + ϵt .
(3.4)
Up until now, we only consider the case that all the components of Xt are I (1) processes, which rules out the case that Xt can contain a constant. Consider the following model, Yt = (1, Xt⊤ )
γ (Zt ) = γ (Zt ) + Xt⊤ θ (Zt ) + ut , θ (Zt )
t = 1, 2, . . . , n.
(3.7)
In this case, one can ignore the γ (Zt ) term and treat γ (Zt ) + ut as the composite error term. It can be shown that our proposed estimators α( ˜ z ) and α( ˆ z ) remain consistent with the same asymptotic distributions. This is similar to the linear regression model case, where the OLS estimator for the coefficients associated with I (1) variables remains consistent even if one ignores the presence of I (0) regressors. We summarize the above analysis in a proposition below and give a brief proof in Appendix A.2. Proposition 3.4. The conclusions of Theorems 3.1–3.3 remain unchanged if Yt is generated by (3.7) provided that γ (z ) is a uniformly bounded differentiable function over z ∈ S . 4. Monte Carlo simulations In this section, we examine the finite sample performances of the semiparametric estimators α(·) ˆ , α(·) ˜ as well as cˆ0 and θˆ (·).6 In Section 4.1 we consider the case that both Xt and ut are I (1) processes, and in Section 4.2 we consider the case that Xt and ut are near I (1) processes.
Applying to the sub-data set satisfying Zt = Zt −1 , we obtain
1Yt = 1Xt⊤ θ (Zt ) + ϵt ,
(3.5)
where Eq. (3.5) is a standard varying coefficient model with all the variables being stationary. The asymptotic results for various semiparametric estimators (local linear or local constant) can be found in Cai et al. (2000), Fan et al. (2000), Li et al. (2002), Fan et al. (2003), among others. The assumption that P (Zt = Zt −1 ) > 0 is not unrealistic even for time series data. For instance, if one has monthly observations of Zt , the Federal Reserve fund rate, then for most months, Zt = Zt −1 as the Fed does not change the fund rate on a monthly basis. Even when P (Zt = Zt −1 ) = 0, one may not need the complete independence between {(ζt , ϵt )}nt=1 and {Zt }nt=1 at all the times. Let us rewrite Eq. (3.4) as
1Yt = 1Xt⊤ θ (Zt ) + Xt⊤−1 [θ (Zt ) − θ (Zt −1 )] + ϵt .
(3.6)
Multiplying both sides of Eq. (3.6) by a kernel weight function K ((Zt − Zt −1 )/h), one may derive a consistent estimator of the smooth coefficient function θ (·). That is, only data with |Zt − Zt −1 | ≤ h will be used in estimating the regression model. With Zt −1 close to Zt , we have θ (Zt −1 ) close to θ (Zt ), and the term XtT−1 [θ (Zt ) − θ (Zt −1 )] can be made negligible by choosing some small value of h. However, in order for the term associated with the I (1) regressor√Xt −1 to be negligible, some under-smoothing condition such as √nh2 = o(1) may be needed because Xt −1 needs to be divided by n to make it an Op (1) random variable. In addition to the undesired under-smoothing condition, this method also suffers more of the ‘curse of dimensionality’ problem because one effectively only uses the data satisfying both |Zt − z | ≤ h and |Zt −1 − z | ≤ h when estimating θ (z ), while our earlier estimator only requires the condition |Zt − z | ≤ h. We leave it as a future research topic to find an alternative consistent estimator of θ (z ) (or c0 ) under conditions weaker than Assumption A5(i) and without triggering the ‘curse of dimensionality’ problem.
4.1. Xt and ut are I (1) processes We consider the following data generating process (DGP): Yt = Xt θ (Zt ) + ut ,
(t = 1, . . . , n),
where Xt and ut are both I (1) variables, and Zt is stationary. Specifically, Xt = Xt −1 + ζt with X0 = 0 and ζt is i.i.d. N (0, σζ2 ); ut = ut −1 + ϵt with u0 = 0 and ϵt is i.i.d. N (0, σϵ2 ); zt = vt + vt −1 and vt is i.i.d. uniform [0, 2]. We consider two different α(·) functions: (a) α(z ) = sin(π z ) − E [sin(π Zt )] and (b) α(z ) = z − .5z 2 − E Zt − .5Zt2 (so that E [α(Zt )] = 0). In both cases, we have θ (z ) = c0 + α(z ), and we choose c0 = 1 and 2. It is easy to show that α(·) ˜ and α(·) ˆ are invariant to different c0 values as c0 is canceled out by construction of the estimators. Hence, we only report the case of c0 = 1 for brevity. We choose two different combinations for (σϵ , σζ ): (i) The case of (σϵ , σζ ) = (1, 2) means that the increment of ut has a smaller variance relative to that of Xt (small noise to signal ratio); (ii) The case of (σϵ , σζ ) = (1, 1) corresponds to an equal variance case for the increments of Xt and ut . The sample sizes are n = 100, 200 and 400. The number of replications is 10,000. We use a standard normal kernel function with the smoothing parameter equal to h = σˆ z n−1/5 , where σˆ z is the sample standard deviation of {Zt }nt=1 . We compute the sample average mean squared errors (or AMSEs) for α(·) ˆ , α(·) ˜ , θˆ (·) and cˆ0 as follows: AMSE (α(·)) ˆ =J
−1
J −
n
j =1
AMSE (α(·)) ˜ = J −1
J − j =1
−1
n − 2 αˆ j (Zt ) − α(Zt ) , t =1
n − 2 −1 n α˜ j (Zt ) − α(Zt ) , t =1
6 The performances of c˜ and θ(·) ˜ ˆ ; are very similar to those of cˆ0 and θ(·) 0 therefore, we omit these results to save space.
Y. Sun et al. / Journal of Econometrics 164 (2011) 252–267
257
Table 1 Average mean squared errors when α(z ) = sin(π z ) − E [sin(π Zt )]. (σϵ = 1, σζ = 2)
n
100 200 400
α(·) ˜
α(·) ˆ
ˆ θ(·)
cˆ0
(σϵ = 1, σζ = 1) α(·) ˜ α(·) ˆ
ˆ θ(·)
cˆ0
.0821 .0428 .0234
.0799 .0422 .0225
.1537 .0848 .0436
.0830 .0443 .0224
.1388 .0767 .0428
.2403 .1306 .0703
.1083 .0577 .0297
.1356 .0749 .0418
Table 2 Average mean squared errors when α(z ) = z − .5z 2 − E (Zt − .5Zt2 ). (σϵ = 1, σζ = 2)
AMSE (θˆ (·)) = J
−1
AMSE (ˆc0 ) = J −1
J −
n
−1
n
α(·) ˜
α(·) ˆ
ˆ θ(·)
cˆ0
(σϵ = 1, σζ = 1) α(·) ˜ α(·) ˆ
ˆ θ(·)
cˆ0
100 200 400
.00868 .00487 .00275
.00552 .00319 .00188
.00945 .00551 .00325
.00762 .00435 .00235
.0654 .0372 .0221
.0909 .0504 .0297
.0331 .0172 .0095
n −
j=1
t =1
J −
2
cˆ0,j − c0
2 θˆj (Zt ) − θ (Zt ) ,
,
j =1
where J is the number of replications (J = 10,000), and the subscript j refers to the result from the jth replication. The estimation results are reported in Tables 1 and 2. Both Tables 1 and 2 show the following features. First, all the AMSEs of α(·) ˆ , α(·) ˜ , θˆ (·) and cˆ0 are much smaller in magnitude for the case of (σϵ , σζ ) = (1, 2) than the case of (σϵ , σζ ) = (1, 1). This is expected since the former has a smaller noise to signal ratio. Second, the AMSE for each and every estimator decreases as the sample size increases, which shows the consistency of the estimator. When comparing the results in Table 1 with those in Table 2, we see the latter much smaller than the former. This is because quadratic function θ (z ) (or α(z )) in Table 2 is smoother than the sine function θ (z ) (or α(z )) in Table 1. The above simulations results assume that Zt follows a short memory MA(1) process. Below we report results on some different processes for Zt . First we let Zt follow an AR(1) process: (i) Zt = ρz Zt −1 + vt , where vt is i.i.d. uniform [−1, 1] and ρz = 0.5.7 Also, Assumption A2 implies that Zt has a bounded support. We now use Monte Carlo simulations to examine whether this assumption can be relaxed to an unbounded support case. We consider two more different data generating processes for Zt : (ii) Zt is i.i.d. N (0, 1) which has an unbounded support but with a quite thin tail; (iii) Zt is i.i.d. t (3) (or a Student’s t-distribution with 3 degrees of freedom), which is a fat-tailed distribution with no finite moment of order greater than two. Xt and ut are generated the same way as above. To save space, we only consider the case of (σϵ , σζ ) = (1, 2) and θ(z ) = 1+α(z ) with α(z ) = sin(π z )−E [sin(π Zt )]. The simulation results are reported in Table 3. First, we examine the results for case (i), where Zt follows an AR(1) process with a density function bounded away from zero in its support. Comparing the results of Table 1 with those of Table 3, we observe that the sample AMSEs for an AR(1) Zt case are very similar to those of an MA(1) Zt case. This shows that our estimators are not sensitive to the serial correlation pattern of Zt , provided that Zt is a stationary process with a density function that is bounded away from zero in its support.
7 As we set Z = 0, and the i.i.d. innovations v are drawn independent of Z , the 0 t 0 theorem in Athreya and Pantula (1986, p. 187), ensures that the stationary AR(1) process Zt generated this way is a geometric strong mixing sequence.
.0609 .0350 .0210
Next, we investigate case (ii) that Zt is i.i.d. N (0, 1). First, we observe that the sample AMSEs decrease as sample size n increases for all the three estimators: α(·) ˆ , θˆ (·) and cˆ0 . This suggests that our proposed estimators are likely to be consistent when Zt has an unbounded support with a thin tail distribution such as a N (0, 1) distribution. However, when comparing case (ii) with case (i), we see that the sample AMSEs for case (ii) are significantly larger than those of case (i) where the distribution of Zt has a bounded support with the density function bounded away from zero in its support. Finally, we examine the result that Zt has a fat-tailed t (3) distribution. In this case, Zt has a finite second moment but does not have any finite moments of order higher than two. Table 3 clearly shows the inconsistency of θˆ (·) and α(·) ˆ in this case. The AMSEs of θˆ (·) and α(·) ˆ do not decrease, and in fact, they increase as sample increases. Therefore, one needs to be cautious in practice if one is suspicious that Zt may have a fat-tailed distribution. One way to avoid a fat-tail distributed Zt is to trim out data with extreme Zt values so that f (z ) is bounded away from zero on the trimmed set. 4.2. Xt and ut are near I (1) processes In this section, we consider exactly the same data generating processes as discussed in Section 4.1 except that now Xt and ut are near I (1) rather than I (1) processes. Specifically, we generate Xt and ut by Xt = ρx Xt −1 + ζt with ζt being drawn from an i.i.d. N (0, σζ2 ) and X0 from N (0, σζ2 /(1 −ρx2 )), and ut = ρu ut −1 +ϵt with
ϵt being drawn from an i.i.d. N (0, σϵ2 ) and u0 from N (0, σϵ2 /(1 − ρu2 )). In addition, Zt = vt + vt −1 , and vt is drawn from an i.i.d. uniform [0, 2]. To save space, we only consider the case that α(z ) = sin(π z ) − E [sin(π Zt )]. Since all data are stationary now, θˇ (z ) is a consis∑n def tent estimator of θ (z ), and c¯0 = n−1 t =1 θˇ (Zt ) is a consistent
estimator of c0 . We compare the finite sample performances of θˇ (·) and c¯ with our proposed estimator θˆ (·) and cˆ0 for (ρx , ρu ) = (0.90, 0.90), (0.95, 0.95), (0.97, 0.97), (0.99, 0.99). Usually the first lag autocorrelation coefficient of national stock volatility is around 0.95 to 0.98 (e.g., Sun et al. (2010)). Hence, our choice of ρx and ρu are consistent with the empirical stock volatility data. The simulation results are reported in Table 4. From Table 4, we observe the followings. (i) Our proposed estimators θˆ (·) and cˆ0 have smaller AMSEs than those of θˇ (·) and c¯0 for all the different (ρx , ρu ) combinations and all the different sample sizes considered. (ii) As the data becomes closer to I (1) processes, the relative performances of our proposed estimators improve more over the conventional estimators. For example, for the case of n = 400, when (ρx , ρu ) = (.90, .90), the AMSE of θˇ (·) is only about 20% higher than that of θˆ (·). However, when
258
Y. Sun et al. / Journal of Econometrics 164 (2011) 252–267 Table 3 Three different distributions for Zt . n
100 200 400
(ii) Zt ∼ N (0, 1)
(i) Zt ∼ AR(1)
(iii) Zt ∼ t (3)
α(·) ˆ
ˆ θ(·)
cˆ0
α(·) ˆ
ˆ θ(·)
cˆ0
α(·) ˆ
ˆ θ(·)
cˆ0
.0747 .0488 .0304
.1372 .0875 .0556
.0718 .0426 .0278
.1739 .1239 .0845
.3453 .2357 .1625
.1175 .1140 .0794
.4299 1.496 4.217
.7804 1.803 4.457
.3528 .3076 .2417
Table 4 Xt and ut are near I (1) processes. n
100 200 400
(ρx , ρu ) = (.90, .90) ˇ ˆ θ(·) θ(·) c¯0
cˆ0
(ρx , ρu ) = (.95, .95) ˇ ˆ θ(·) θ(·) c¯0
cˆ0
.2987 .1987 .1286
.0977 .0502 .0249
.0366 .0149 .0067
.3883 .2472 .1534
.1872 .0998 .0499
.0535 .0210 .0090
.2858 .1621 .0821
.0721 .0285 .0119
.7515 .5472 .3250
.5490 .4009 .2219
.1255 .0575 .0252
.2445 .1613 .1094
(ρx , ρu ) = (.97, .97) 100 200 400
.4870 .3088 .1855
.2603 .1734 .1143
.2517 .1665 .1114
(ρx , ρu ) = (.99, .99) .3251 .2025 .1275
(ρx , ρu ) = (.97, .97), the AMSE of θˇ (·) is almost doubled that of ˆ . Moreover, the AMSE of θˇ (·) becomes more than double that of θ(·) ˆ when (ρx , ρu ) = (.99, .99). The improvement for estimating θ(·) c0 is even more dramatic. For n = 100, the AMSE of c¯0 is about four times as large as the AMSE of cˆ0 when (ρx , ρu ) = (.90, .90), and this AMSE ratio becomes almost 10 fold when (ρx , ρu ) = (.99, .99).
This Appendix contains two subsections. Appendix A.1 provides the proofs for Theorems 3.1 and 3.2, and Appendix A.2 gives proofs for Theorem 3.3 and Proposition 3.4.
The simulation results reported in this section are consistent with our theoretical analysis. When the error term ut is an I (1) or a near I (1) process, the conventional kernel-based estimator of θ (·) becomes inconsistent or inaccurate, while our proposed estimator is consistent and accurate regardless of whether the error term ut is an I (1) or a near I (1) process.
1 B (r )BX (r )⊤ 1 −1 0 X dr , B(X ,u) = 0 BX (r )Bu (r )dr , Ktz = K h (Zt − z ) , K˘ tz = Ktz − EKtz , θt = θ (Z∑ shortt ), and α t = ∑ ∑ θt − E (θt ). Also, ∑n we use∑the ∑n n
5. Conclusions Most macroeconomic and finance variables show strong persistency, and many of them are measured with errors or even unobservable. This may render correlated but not cointegrated time series for actually observed or artificially constructed proxy variables, although the variables with true values could be cointegrated as predicted by economic or finance theory. In this paper we suggest using a flexible semiparametric varying coefficient model to capture the correlation among integrated but not cointegrated variables. We propose two consistent estimators to estimate the unknown smooth coefficient function and establish the consistency of the proposed estimators. The current study is limited to the case that the nonparametric component Zt has a density function that is bounded away from zero in its support. It would be useful to extend the asymptotic analysis to the case that Zt has an unbounded support such as a normal distribution case. The simulation results reported in Section 4 show that our proposed estimators do not lead to consistent estimation of θ (·) when Zt has a fat-tailed distribution. It would be very useful if alternative estimation methods can be found that are robust to the fat-tailed distribution of Zt . The theoretical results of this paper may also be useful in developing new cointegration tests based on semiparametric models with an I (1) error term. Finally, we hope to be able to generalize the consistent model specification tests to our framework; i.e., testing the null hypothesis of a parametric functional form of θ (·) by allowing the error term ut to be an I (1) process. These challenging problems are left as future research topics.
Appendix. Mathematical proofs
A.1. Mathematical proofs of Theorems 3.1 and 3.2 Throughout this Appendix, we denote B(X ,2) =
hand notation t and t s̸=t to denote t =1 and t =1 s̸=t , respectively. For readers’ convenience, below we give modified versions of the strong mixing inequality of Lemma 2.1 of McLeish (1975) and Theorem 3.2 of de Jong and Davidson (2000). The presentation will be based on the assumptions imposed in this paper. Denote by Ft = σ (ωt : i ≤ t , n ≥ 1) the smallest σ -field containing the past history of {ωt } for all t ≤ n, n ∈ N , the set of natural number. For 1 ≤ p1 ≤ p2 < ∞ and any m > 0, McLeish (1975) strong mixing inequality states that
‖E (ωt +m |Ft ) − E (ωt )‖p1 −1 −1 ≤ 2 21/p1 + 1 [α (m)]p1 −p2 ‖ωt ‖p2 .
(A.1)
de Jong and Davidson (2000) gave a different proof of the multivariate functional central theorem under quite weak regularity conditions. Theorem 3.2 of de Jong and Davidson (2000): Let Wnt be an d × 1 vector of array and assume that for every d-vector ξ of unit ∑[nr] ˜ nt = ξ ⊤ Wnt , and Un (r ) = length, W t =1 Wnt for r ∈ [0, 1]. Then, Un H⇒ U, where U is a d-dimensional Gaussian process having almost sure continuous sample paths and independent increments, if there exists a positive constant array cnt such that ˜ nt : the following condition holds for W
˜ nt (i) E W
∑ ˜ nt = 0, nt=1 W
2
bounded uniformly in t and n;
˜ nt /cnt is Lp = 1, and W
≤ dnt v (m), where Fnt,+t −mm = σ 2 W , . . . , Wn,t +m is the smallest σ -field containing n,t −m Wn,t −m , . . . , Wn,t +m for all n, dnt /cnt is bounded uniformly in t and n, and v (m) = O m−1/2−ϵ for some small ϵ > 0;
˜ nt − E W ˜ nt |Fnt,+t −mm (ii) W
Y. Sun et al. / Journal of Econometrics 164 (2011) 252–267
−1 (iii) For some sequence bn such that bn = o (n) and bn = −1 o (1), letting rn = nbn , Mni = max(i−1)bn +1≤t ≤bn cnt and
−1/2
Mn,rn +1 = maxrn bn +1≤n cnt , max1≤i≤rn +1 Mni = o bn
∑rn
(iv) limn→∞ E [Un (r )2 ] is finite for all r ∈ [0, 1], and limϕ→0 supr ∈[0,1−ϕ ] lim supn→∞
∑[n(r +ϕ)] [nr]+1
2 cnt = 0.
Note that under Assumptions A1 and A2, the σ -field defined m here, Fnt,+ t −m , only needs to depend on the information sets of Wn,t −m , . . . , Wn,t +m , while de Jong and Davidson (2000) considered more general case than cited here. Proof of Lemma 3.1. By Assumptions A1 and A2, we can verify the conditions required by Corollary 4.2 of Wooldridge and White (1988) or Theorem 3.2 of de Jong and Davidson (2000) and obtain the following multivariate functional central limit theorem, for any r ∈ [0, 1],
n
−1/2
[nr ] −
ζt , n ⊤
−1/2
[nr ] −
ϵt , (nhv0 f (z ))
−1/2
t =1
t =1
[nr ] −
⊤ K˘ tz
t =1
⊤ H⇒ BX (r )⊤ , Bu (r ) , WK (r ) , (A.2) ⊤ is a (d + 1) × 1 column vector of where BX (r )⊤ , Bu (r ) Brownian motion process with a zero mean and covariance matrix of r Σ and is independent of the Standard Brownian motion WK (r ). Here, Σ is a finite nonsingular matrix under Assumption A1. ∑[nr ] The result that (nhv0 f (z ))−1/2 t =1 K˘ tz is asymptotically independent of the other two terms and has an asymptotic variance r is obtained from the following results: (i) The asymptotic ∑[nr] covariance between (nh)−1/2 t =1 K˘ tz and any one of the other two terms in the lemma is of order O −1/2
−
(A.3)
t =1 d
BX (r ) drE ζt⊤ αt .
→
(A.5)
0
Similarly, we obtain n − Γn3 = E ζt αt⊤ n−3/2 Xt −1 + op (1) t =1 1
∫
d → E ζt αt⊤
BX (r ) dr .
(A.6)
0
Finally, we will derive the limiting result of Γn1 , applying the method used in the proof of Theorem 4.2 in Hansen (1992). For ⊤ r ∈ [0, 1], let Un (r ) = Un,[nr] = n−1 X[nr] X[nr] and Vn (r ) = n−1/2 t =1 αt . Under Assumption A1, we actually can extend Lemma 3.1 such that Vn (r ) H⇒ Bα (r ) holds jointly with the partial sums appearing in Lemma 3.1, where (BX (r )⊤ , Bu (r ) , Bα (r )⊤ , WK (r ))⊤ is a (2d + 2)-dimensional Brownian motion with a zero mean and finite and nonsingular covariance matrix. Then, applying Theorem 3.1 of Hansen (1992), we have r r ∑[nr] n−3/2 t =1 Xt −1 Xt⊤−1 αt = 0 Un (s) dVn (s) = 0 Un (s) dQn (s) +
∑[nr]
r 0
Λ∗n,t = n−1/2
d
Un (s) dQn (s) →
t −
r 0
BX (s) BX (s)⊤ dBα (s) = Op (1)
Un,i − Un,i−1 wi − n−1/2 Un,t wt +1
(A.7)
i =1
Proof. If θ (Zt ) ≡ c0 , a vector of constants, we have n−3/2 t =1 Xt Xt⊤ α (Zt ) ≡ 0, and the lemma holds true for this trivial case. Below, we prove this lemma for the case that θ (·) is a vector of non-constant measurable smooth functions. Simple calculations give
∑n
n3/2 t =1
Xt −1 E ζt⊤ αt + op (1)
1
∫
t =1
n 1 −
n −
Γn2 = n−3/2
and
Xt Xt⊤ α (Zt ) = Op (1) .
(A.4)
n−1/2 Xt and closely following ∑n the proof of Theorem 3.3 of Hansen (1992), we obtain n−3/2 t =1 Xt −1 et = op (1). It follows
Λ∗n,[nr] , where
n
t =1
⊤ ζ αt < M < ∞ under Assumptions A1 and A2(ii). Therefore, t 2 we obtain Γn4 = O n−1/2 . Now, we consider Γn2 .Decomposing Γ∑ n2 into two components ∑n n yields Γn2 = Xt −1 E ζt⊤ αt + n−3/2 t =1 Xt −1 et , where et = t = 1 ζt⊤ αt − E ζt⊤ αt and E (et ) = 0. Applying Hölder’s inequality −1 and McLeish’s strong inequality (A.1), we obtain ⊤ mixing ∑ ∑n E [n n −1 | |] e ≤ 2E ζ α < M < ∞ and Var n e = t t t=1 t t =1 t Op n−1 by Assumptions A1 and A2(ii). Then, denoting Unt =
h ; (ii) Under Assumptions
Lemma A.1. If Assumptions A1 and A2 hold, we have
ζt ζt⊤ αt
where the definitions of Γnj for j = 1, 2, 3, 4 should be obvious from the context. Below, we will show that Γn1 , Γn2 and Γn3 are all of order Op (1), but Γn4 = Op n−1/2 . Consider Γn4 first. Under Assumption A1 and applying Lemma 2.1 of White and Domowitz (1984), we obtain that ζt ζt⊤ αt is an α -mixing process of size −p/ (p − 2).Then, applying Hölder’s inequality, we have E (‖Γn4 ‖) ≤ n−1/2 E ‖ζt ‖ ζt⊤ αt ≤ n−1/2 ‖ζt ‖2
∑n
The following two lemmas are used to obtain the limiting property of the OLS estimator under model (2.1). Both the proof of Theorem 4.2 in Hansen (1992) and the proof of Theorem 4.1 in de Jong and Davidson (2000) can be used to show weak convergence to stochastic integral results, while the latter uses weaker assumptions than the former. As we imposed similar assumptions as in Hansen (1992), we will use Hansen’s (1992) method to show the weak convergence to stochastic integral results below.
1 − n3/2
= Γn1 + Γn2 + Γn3 + Γn4 ,
√
˘ A1–A4, the variance of (nh) t =1 Ktz converges to v0 f (z ). This completes the proof of Lemma 3.1.
n−3/2
+
and
2 −1 ; i=1 Mnt = O bn
259 n
Xt Xt αt = ⊤
=
n 1 −
n3/2 t =1 n 1 −
n3/2 t =1
×
n − t =1
(Xt −1 + ζt ) (Xt −1 + ζt ) αt ⊤
with wi = k=1 E (αi+k |Fi ) and Ft = σ (Unt , Zt : i ≤ t , n ≥ 1) being the smallest sigma-field containing the past history of {(Unt , Zt )} for all n. It remains to show Λ∗n,[nr] H⇒ r Λ, a finite vector. Applying Minkowski’s inequality and McLeish’s strong mixing inequality of (A.1), we obtain for some λ2 > λ1 > 2,
∑∞
‖wi ‖λ1 ≤
∞ −
‖E (αi+k |Fi )‖λ1
k=1
Xt −1 Xt⊤−1 αt +
Xt −1 ζt⊤ αt +
1
≤
n3/2
n 1 −
n3/2 t =1
∞ −
6α (k)1/λ1 −1/λ2 ‖αi+k ‖λ2
k=1
ζt Xt⊤−1 αt
≤ 6M
∞ − k=1
k−p/(p−2)(1/λ1 −1/λ2 ) ‖αi+k ‖λ2 < M < ∞,
(A.8)
260
Y. Sun et al. / Journal of Econometrics 164 (2011) 252–267
if ‖αi ‖λ2 < ∞ and p/ (p − 2) (1/λ1 − 1/λ2 ) > 1. Let 1/λ1 − 1/λ2
˜ 2 + δ˜ for some δ˜ > δ > 0. Then, λ1 and λ2 s.t. λ2 > λ1 = = δ/ 2 + δ˜ λ2 / 2 + (1 + λ2 ) δ˜ > 2 can be used to obtain (A.8). Applying Chebyshev’s inequality, obtain for any given can∑ √ small √ we n ξ > 0, Pr sup1≤i≤n ‖wi ‖ > nξ ≤ nξ ≤ i=1 Pr ‖wi ‖ >
∑n √
nξ
i=1
−λ1
E ‖wi ‖λ1 ≤ Mn1−λ1 /2 ξ −λ1 → 0 as n → ∞, it
then follows that sup1≤i≤n ‖wi ‖ = op
√
n . Therefore, we obtain
sup n−1/2 Un,t wt +1
1≤t ≤n
p ≤ sup n−1 Xt Xt⊤ sup n−1/2 wt +1 → 0. 1≤t ≤n
(A.9)
1≤t ≤n
Combining the results above, we obtain d 1 θˆ0 − c0 → B− (X ,2) B(X ,u) .
This completes the proof of Lemma A.2.
From the proof of Lemma A.2, it is obvious that, if ut were an I (0) process, θˆ0 − c0 = Op n−1/2 ; i.e., if Yt and Xt were cointegrated with the varying cointegration vector θ (Zt ), then the OLS estimator, θˆ0 , would consistently estimate c0 = E [θ (Zt )]. Therefore, substantial difference between the OLS estimator θˆ0 and the semiparametric estimator cˆ0 (or c˜0 ) would suggest that Yt and Xt are not cointegrated; i.e., ut is not an I (0) process. The following three lemmas are used to derive the limiting properties of the proposed kernel estimators of model (2.1).
Next, we have n
−1/2
t −
Un,i − Un,i−1 wi = n
−3/2
i=1
Lemma A.3. If Assumptions A1–A4 hold, we have
t −
⊤
Xi Xi −
Xi−1 Xi⊤ −1
wi
i=1
= n−3/2
t −
n −1/2 −
n3 hv0 f (z )
t =1
⊤ Xi−1 ζi⊤ + ζi Xi⊤ wi , −1 + ζi ζi
d
∑t where n−3/2 i=1 ζi ζi⊤ wi = Op n−1/2 as E sup1≤t ≤n n−3/2 ∑t ⊤ ≤ n−1/2 ‖ζi ‖22λ1 /(λ1 −1) × ‖wi ‖λ1 = Op n−1/2 i=1 ζi ζi wi ∑t by (A.8). Then following the proof of Γn2 , we obtain n−3/2 i=1 ⊤ ⊤ −3/2 ∑t ⊤ ⊤ −3/2 X ζ + ζi Xi−1 wi = n n i=1 Xi−1 E ζi wi + E ζi wi ∑it−1 i ∗ i=1 Xi−1 + op (1) = Op (1). Therefore, we have shown Λn,t = Op (1).
d 1 θˆ0 − E [θ (Zt )] → B− (X ,2) B(X ,u) .
Γn = √
+√
n3 h t =1
Xt Yt = n−2
n −
Xt Xt⊤ θt + n−2
n −
t =1 n −
Op
X t ut
n −
Xt ut + n−2
t =1
−
Xt Xt⊤
n −
Xt Xt⊤ αt ,
t =1
Xt Yt
−1
= c0 + n
−2
−
⊤
n− 2
Xt Xt
t
−1 n− 2
−
Xt Xt⊤
n− 2
t
n −
−1 −
Xt Xt⊤
n− 2
t
X t ut
Xt Xt⊤ αt
n −
Xt Xt = n
t =1
⊤ n − Xt Xt d → B(X ,2) √ √ n
t =1
n
n
− X t ut d √ √ → B(X ,u) . t =1
n
n
∑n
t =1
⊤˘ n E ζ K − tz t 1 Xt −1 = + op (1) √ n t =1
d
∫
→
(A.12)
n
X t ut = n− 1
(A.11)
t =1
−1
h , and Γn3 =
h/n . Therefore, Γn1 is the leading term of Γn .
h−1/2 Γn2 Xt ut + Op (n−1/2 ),
√
Following ∑ the Proof of Theorem 3.3 of Hansen (1992), we show that (nh)−1 nt=1 n−1/2 Xt −1 etz = op (1). Therefore,
where the last equality follows from Lemma A.1. By Lemma 3.1 and the continuous mapping theorem, we have ⊤
√
t =1
= c0 + n−2
n − t =1
t =1
ζt ζt⊤ K˘ tz
ζt ζt⊤ K˘ tz ≤ E ζt ζ ⊤ K˘ tz ≤ E E ζt ζ ⊤ |Zt Ktz + E ζt ζ ⊤ E (Ktz ) = t t t ∑n O (h) by Assumptions A1 and A2(i). It follows n−1 t =1 ζt ζt⊤ K˘ tz = Op (h). Therefore, we have Γn3 = Op h/n . (A.15) −1/2 ∑n −1/2 ⊤˘ Consider Γn2 = n3 h + n3 h t =1 Xt −1 E ζt Ktz ∑n ⊤˘ ⊤˘ and E (etz ) = 0. t =1 Xt −1 etz , where etz = ζt Ktz − E ζt Ktz
t
−
n3 h t = 1
−1 − t
n− 2
n −
1 ζt Xt⊤−1 K˘ tz + √
Consider Γn3 first. We have E n−1
t =1
Xt Xt⊤ E (θt ) + n−2
n
n −
1
Xt −1 ζt⊤ K˘ tz
⊤ = Γn1 + Γn2 + Γn2 + Γn3 ,
(A.10)
t =1
n −
n −
1
n3 h t =1
we have
−2
n 1 − Xt Xt⊤ K˘ tz = √ Xt −1 Xt⊤−1 K˘ tz n3 h t =1 n3 h t = 1
where we will show Γn1 = Op (1) , Γn2 = Op
t =1
+
n −
1
+√
Proof. As
θˆ0 =
(A.14)
Proof. Simple calculations give
Lemma A.2. If Assumptions A1 and A2 hold, we have
= n− 2
BX (r ) BX (r )⊤ dWK (r ). 0
Combining the results above completes the proof of Lemma A.1.
n− 2
1
∫
→
i =1
n −
Xt Xt⊤ K˘ tz
(A.13)
h
n
1
BX (r ) drE ζt⊤ |z f (z ) ,
0
which gives Γn2 = Op
√
h .
−1/2 ∑n
⊤ ˘ Now, consider Γn1 = n3 h t =1 Xt −1 Xt −1 Ktz . For r ∈ ⊤ −1 [0, 1], let Un (r ) = Un,[nr] = n X[nr] X[nr] and Vn (r ) = (nhv0
f (z ))−1/2
r 0
∑[nr]
t =1
K˘ tz . Then,
Un (s) dVn (s) =
−1/2 ∑[nr]
n3 hv0 f (z )
⊤ ˘ = t =1 Xt −1 Xt −1 Ktz ∗ U s dQ s + Λ , where by Lemma 3.1 ( ) ( ) n n n,[nr] 0
r
Y. Sun et al. / Journal of Econometrics 164 (2011) 252–267
and Theorem 3.1 of Hansen (1992), we have
r 0
BX (s) BX (s) dWK (s) and ⊤
−1/2
= (nh)
Λ∗n,t
t −
Un,i − Un,i−1 wi − n
−1/2
d
Un (s) dQn (s) →
Un,t wt +1
r 0
(A.16)
261
Lemma A.5. If Assumption A1–A4 hold, and z is an interior point of S , we have n −1/2 − d n hv0 f (z ) Xt ut K˘ tz → 3
with wi = (v0 f (z ))−1/2
∑∞
k=1
E K˘ i+k,z |Fi . It remains to show
Λ∗n,[nr] = op (1) for any r ∈ [0, 1].
Following the proof of Lemma A.1, we can show that ‖wi ‖λ1 = O h1/λ2 for all i, where λ2 > λ1 > 2. Applying Chebyshev’s inequality, we obtain for any 0, given√small ξ > √ ∑n ∑n Pr sup1≤i≤n ‖wi ‖ > nhξ ≤ nhξ ≤ i=1 Pr ‖wi ‖ > i =1 √ −λ1 nhξ E ‖wi ‖λ1 ≤ M (nh)1−λ1 /2 ξ −λ1 → 0 as n → ∞, it then √ follows that sup1≤i≤n ‖wi ‖ = op nh . Consequently, we have sup (nh)−1/2 Un,t wt +1
1≤t ≤n
1≤t ≤n
d 1 θˇ (z ) − θ (z ) − h2 J (z ) → B− (X ,2) B(X ,u) ,
Proof. By adding and subtracting terms, we have
−1
−
Xt Xt Ktz
t
=
−1 −
⊤
Xt Xt Ktz
t
Xi−1 ζi + ⊤
ζi Xi⊤−1
+ ζi ζi
⊤
−1
−
= θ (z ) +
t −1/2 −
Xt (Xt⊤ θt + ut )Ktz
t
Un,i − Un,i−1 wi
i=1
Xt Yt Ktz
t
−
t
= n h
−
⊤
3
(A.20)
where J (z ) = µ2 [θ (1) (z )f (1) (z )/f (z ) + θ (2) (z )/2].
Moreover,
(A.19)
Lemma A.6. If Assumption A1–A4 hold, we have for all interior points z ∈ S,
1 ≤t ≤n
−
BX (r ) Bu (r ) dWK (r ) .
Proof. As both Xt and ut are I (1) processes, applying the Proof of Lemma A.3 with Xt Xt⊤ replaced by Xt ut proves Lemma A.5.
θˇ (z ) =
p ≤ sup n−1 Xt Xt⊤ sup (nh)−1/2 wt +1 → 0.
1
0
t =1
i=1
(nh)−1/2
∫
⊤
Xt Xt Ktz
t
wi ×
i=1
−
Xt Xt⊤ [θt − θ (z )] + ut Ktz .
t
t −1/2 − −1/2 = n3 h Xi−1 ζi⊤ + ζi Xi⊤ , −1 wi + op (nh)
It follows that
i=1
−
−1 −
−1/2 ⊤ as i=1 ζi ζi wi = op (nh) t −1/2 − E sup n3 h ζi ζi⊤ wi 1 ≤t ≤n i=1 −1/2 2 ‖ζi ‖2λ1 /(λ1 −1) ‖wi ‖λ1 = Op (nh)−1/2 h1/λ2 . ≤ M (nh)
θˇ (z ) − θ (z ) =
Following the proof of Γn2 in Lemma A.1, we can show
where the definitions of B1n (z ) and B2n ∑(z ) should be apparent. ∑ By Lemma A.3, we have (n2 h)−1 t Xt Xt⊤ Ktz = (n2 h)−1 t
where n3 h
that n3 h
−1/2 ∑t
−1/2 ∑t
Xi−1 ζi⊤ + ζi Xi⊤ −1 wi = Op
i=1
√
h . Therefore,
(v0 f (z ))−1/2 Γn1 =
t
+
Op
d
1
→
BX (s) BX (s)⊤ dWK (s) = Op (1) .
(A.17)
nh
3 −1/2
B1n (z ) =
(nh)
Xt Xt θtz = Op (1) , ⊤˘
−
Xt Xt Ktz
Xt Xt⊤ E {[θt − θ (z )] Ktz } +
Xt Xt⊤ [θt − θ (z )] Ktz
t
−1
n2 h
−1 −
Xt Xt⊤ E (Ktz )
t
(A.18)
t
−1
−1 ⊤
×
. It follows that
t
Lemma A.4. If Assumptions A1–A4 hold, and z is an interior point of S , then we have n −
−
= h2
−3/2
2 3 −1 ∑
X Xt⊤ [θt − θ (z )] Ktz = n h
t
0
This completes the proof of Lemma A.3.
(A.21)
Xt Xt⊤ E (Ktz ) + Op (nh)−1/2 . By Lemma A.4, we have n2 h3 t
Un (s) dQn (s) + op (1)
Xt ut Ktz
t
≡ B1n (z ) + B2n (z ),
0
∫
Xt Xt⊤ Ktz
−1 −
t
∑
1
Xt Xt⊤ [θt − θ (z )] Ktz
t
−
taking together the results above gives Λ∗n,t = op (1). Hence, we have
∫
Xt Xt⊤ Ktz
n2 h 3
−1 −
+ Op ((nh)−1/2 )
Xt Xt⊤ E [(θt − θ (z ))Ktz ]
t
t =1
where θ˘tz = [θt − θ (z )] Ktz − E {[θt − θ (z )] Ktz }. −1/2 ∑[nr] ⊤ −1 ˘ Proof. Let Vn (r ) = nh3 X[nr] X[nr] t =1 θtz and Un (r ) = n for any r ∈ [0, 1]. The proof will closely follow that of Lemma A.3 with some tedious calculations. Therefore, we omit the proof here.
+ Op
nh
3 −1/2
= E [(θt − θ (z ))Ktz ]/E (Ktz ) + Op ( h/n) ≡ h2 J (z ) + Op (h4 ) + Op ( h/n).
(A.22)
262
Y. Sun et al. / Journal of Econometrics 164 (2011) 252–267
Πn1 =
By Lemmas A.3 and A.5, we have
−
B2n (z ) =
−
Xt Xt⊤ Ktz
t
=
−1 −
n h
Xt Xt E (Ktz ) ⊤
t
×
n2 h
−1 −
X t ut
as 1n1 = Op (1) and 1n2 = Op (nh)−1/2 by (A.14). Applying Lemmas 3.1 and A.3 and the continuous mapping theorem, we obtain
∫ 1 ν0 −1 d 1 Πn1 → − BX (r ) BX (r )⊤ dWK (r ) B− B(X ,2) (X ,2) B(X ,u) . f (z ) 0
Xt ut E (Ktz ) + Op ((nh)−1/2 )
(A.26)
−1
= n
−
−2
Xt XtT
n
−2
−
t
Xt ut + Op ((nh)
−1/2
)
Taking all the results above completes the proof of Theorem 3.1.
t
d
1 → B− (X ,2) B(X ,u) .
(A.23)
This completes the proof of Lemma A.6.
Remark. We see that because ut is I (1), θˇ (z ) − θ (z ) does not converge to zero. Proof of Theorem 3.1. Combining (A.11) and (A.21)–(A.23), if nh5 = O (1), h → 0 and nh → ∞ as n → ∞, we have
√
nh(α( ˜ z ) − α(z ) − h2 J (z ))
−1 − − ⊤ nh Xt Xt Ktz Xt ut Ktz
√
t
t
−
−
−1
−
Xt Xt⊤
t
√
Xt ut + Op ( h)
Proof of Theorem 3.2. Before proving Theorem 3.2, we first briefly discuss the need of the trimming function Mnt = Mn (Zt ). Without loss of generality, we assume that S = [a, b], where −∞ < a < b < ∞, a and b are constants. To avoid boundary bias problem, we choose Mnt = 1(a + δn ≤ Zt ≤ b − δn ), where 1(A) is an indicator function taking value 1 if A holds true and 0 otherwise. δn is a non-stochastic sequence of real numbers that satisfies √the conditions: δn → 0, h/δn → 0 as n → ∞. For example, δn = h is allowed. The use of the trimming function is needed (theoretically) to avoid the slow convergence rate of θˇ (z ) when z falls into the boundary region; i.e., [a, a + h] ∪ [b − h, b]. For notational simplicity, we will omit the trimming function Mnt below. By (A.21), the proposed estimator is given by
αˆ (z ) = θˇ (z ) − n−1
t
−
n− 2
−
+
= θ (z ) − n
+ B2n (z ) − n
−1 ⊤
Xt Xt Ktz
− Xt ut K˘ tz √ √ + Op ( h)
t
n
t
nh
(A.24)
where applying (A.14) and (A.19) and the continuous mapping theorem, we have 1 − n2 h
=
×
→
−1 ⊤
Xt Xt Ktz
t
− Xt ut K˘ tz √ t
EKtz − n2 h
n
nh
−1 ⊤
Xt Xt
−1/2
+ Op (nh)
and
t
n
θ (Zt ) + B1n (z ) − n
−1
ν0 −1 B f (z ) (X ,2)
n −
B1n (Zt )
t =1 n −
B2n (Zt )
est = Kˇ st /Eˇ t (Kst ) ,
(A.28)
Obviously, Ktt = Eˇ t (Ktt ) = K (0), but ett = θˇtt = 0. Denote
1
BX (r )Bu (r )dWK (r ).
(A.25)
1n1 = n−2
n −
Next, let 1n = n = 1n1 + 1n2 , where t Xt Xt Ktz /EKtz ∑ ∑ 1n1 = n−2 t Xt Xt⊤ and 1n2 = n−2 t Xt Xt⊤ K˘ tz /E (Ktz ). By 1 1 Theorem 4.3 in Poirier (1995, p. 627), we have 1− = 1− n n1 − − 1 1 −1 1 1− 1− n1 1n2 1n1 1n2 + Id n1 . Then, we have
∑
⊤
Xt Xt⊤
and
t =1
0
−2
−1
θˇst = (θs − θt ) Kts − Eˇ t [(θs − θt ) Kst ] .
nh
∫
= Λn1 (z ) + Λn2 (z ) + Λn3 (z ) , (A.27) ∑ n where Λn1 (z ) = θ (z ) − n−1 t =1 θ (Zt ), Λn2 (z ) = B1n (z ) − ∑n ∑ −1 n (Zt ) and Λn1 (z ) = B2n (z ) − n−1 nt=1 B2n (Zt ). Λn1 (z ) t =1 B1n ∑ = α (z ) + Op n−1/2 since n−1 nt=1 θ (Zt ) = E [θ (Zt )] + Op (n−1/2 ) = c0 + Op (n−1/2 ). Below, we will show Λn2 (z ) = Op h2 and Λn3 (z ) = Op (nh)−1/2 . We first denote Eˇ t [g (Zs , Zt )] = g (Zs , Zt ) f (Zs ) dZs , where g (Zs , Zt ) is a measurable function of (Zs , Zt ). Then, let Kst = K ((Zs − Zt ) /h), Kˇ st = Kst − Eˇ t (Kst ) ,
− Xt ut K˘ tz √ t
t =1
= Πn1 + Πn2 + Op ( h),
Πn2 =
n −
−1 − n− 2 Xt Xt⊤ Xt ut
√
−1
t
1 − n2 h
θˇ (Zt )
t =1
t
EKtz
t
n − t =1
−1 − √ Ktz − 2 ⊤ Xt Xt = nh n
d
−
t
−1/2 + Op ((nh) )
t
=
t
− √ 1 −1 −2 = − nh1− Xt ut 1 + op (1) n1 1n2 1n1 n
Xt ut Ktz
−1 2
−1 1 −2 nh 1− n − 1n1 n
t
−1
√
1n2 (Zt ) = n
−2
n −
(A.29) Xs Xs Kst /Eˇ t (Kst ) . ⊤ˇ
s =1
By (A.22) and noting that Kst /Eˇ t (Kst ) = 1 + Kˇ st /Eˇ t (Kst ), we have
Y. Sun et al. / Journal of Econometrics 164 (2011) 252–267 n
n− 1
−
B1n (Zt )
+ Id
−1
263
1 −2 1− n1 n
− s
t =1
−1
n − −
= n− 1
n −
= n− 1
s
−
n− 2
Xs Xs⊤ + n−2
−
s
Xs Xs⊤
s
1 −3 − 1− n1 n
−1
Kˇ st
n− 2
Eˇ t (Kst )
−
− s
Xs Xs⊤
1 −1 − 1− n1 n
s
t =1
×
−
1 1n2 (Zt ) 1− n1
− s
t =1
=n
×
1 −1 − 1− n1 n
Eˇ t (Kst )
√
n −
nhΛn3 (z ) =
1n2 (Zt )
t =1
n −
B1n (Zt )
t =1
= h {B (z ) − E [J (Zt )]} + op h 2
2
+ Op
h/n .
(A.30)
and
s =1
πn2 (Zt ) = n
n −
√
nh B2n (z ) − n
−1
n −
B2n (Zt ) (A.32)
Lemma A.7. Let zt ∈ Rp be a strictly stationary β -mixing process, i1 < i2 < · · · < ik be arbitrary integers, F (zi1 , . . . , zik ) the distribution function for (zi1 , . . . , zik ). For any j (0 ≤ j ≤ k − 1), dFj = dF (zi1 , . . . , zij )dF (zij +1 , . . . , zik ) and dF0 = dF (zi1 , . . . , zik ). Let hn (zi1 , . . . , zik ) be a Borel function such that, for some λ > 0, . . . Rkp |hn (zi1 , . . . , zik )|1+λ dF (zi1 , . . . , zij )dF (zij +1 , . . . , zik ) ≤ Mn . Then
∫ ∫ ··· ≤
Now, denoting
−2
1 1n2 (Zt ) 1− n1 πn2 (Zt ) 1 + op (1)
As we will frequently use Lemma 3.1 of Yoshihara (1976) in our proofs, we present Yoshihara’s (1976) Lemma 3.1 below for readers’ convenience.
J (Zt ) + op h2 ,
Xs us
n −
A.2. Mathematical proof for Theorem 3.3 and Proposition 3.4
where Lemmas A.3 and A.9 given in Appendix A.2 are used. Therefore, taking this result with (A.22), we have
n −
1 1n2 (Zt ) 1− n1 πn1 1 + op (1)
which completes the proof of Theorem 3.2.
t =1
t =1
πn1 = n−2
n −
√ = Πn1 + Πn2 + op (nh)−1/2 + Op h ,
Eˇ [(θs − θt ) Kst ] + op h2 ˇEt (Kst )
Λn2 (z ) = B1n (z ) − n−1
πn2 (Zt )
t =1
1 Id − 1− n1 1n2 (Zt )
n −
Eˇ t (Kst )
where the last equation results from Lemma A.9. Taking this result with (A.24), we have
(θs − θt ) Kst Xs Xs⊤ 1 + op (1) Eˇ t (Kst )
n −
= h 2 n− 1
1 + op (1)
t =1
Eˇ t [(θs − θt ) Kst ] 1 + op (1) + op h2 ˇEt (Kst )
= n− 1
s n −
Kst
−1 −1 1 = 1− h , n1 πn1 + op n
n − Eˇ t [(θs − θt ) Kst ] −1 t =1
X s us
t =1 1 −1 − 1− n1 n
(θs − θt ) Kst 1 −3 − 1− n1 n Eˇ t (Kst )
n
−
t =1
(θs − θt ) Kst Eˇ t (Kst )
n − −
1 −3 = 1− n1 n
Kst Eˇ t (Kst )
1 1n2 (Zt ) 1− n1
1 −1 −1 = 1− n1 πn1 + 1n1 n
−1 −1 −2 −1 1 −1 1n1 n 1− n1 − 1n1 1n2 (Zt ) 1n1 1n2 (Zt ) + Id
Xs Xs⊤
n −
Kst Eˇ t (Kst )
t =1
t =1
×
X s us
s
t =1
(θs − θt ) Kst Xs Xs⊤ Eˇ t (Kst )
n
= n− 1
1 −3 = 1− n1 n
Xs Xs⊤ (θs − θt ) Kst
n − −
s
t =1
−
−
Xs Xs⊤ Kst
s
t =1
×
X s us
(A.31) Xs us Kˇ st /Eˇ t (Kst ) ,
Rkp
hn (zi1 , . . . , zik )dF0 −
4Mn1/(1+λ)
[β(ij+1 − ij )]
∫
λ/(1+λ)
∫ ··· Rkp
hn (zi1 , . . . , zik )dFj
.
In this section, Kˇ st , Eˇ t (Kst ), est , 1n1 , 1n2 (Zt ) , πn1 , and πn2 (Zt ) are defined the same way as in (A.28), (A.29) and (A.31). Also, we denote χn = {X1 , . . . , Xn , u1 , . . . , un }. The following lemma is used by the Proofs of Theorem 3.3 and Lemma A.9.
s=1
Lemma A.8. If Assumptions A1–A5 and A2* hold, we have
by (A.23), we have n
−1
n −
B2n (Zt ) = n
−1
n − −
t =1
−1 ⊤
Xs Xs Kst
s
t =1
s
= n− 1
n − − s
t =1
×
−
X s us
s
= n− 1
n − t =1
Xs Xs⊤ +
− s
def
−
Xs Xs⊤
Kˇ st
−1
Eˇ t (Kst )
Kst Eˇ t (Kst )
1 −1 −1 1− n1 − 1n1 1n2 (Zt ) 1n1 1n2 (Zt )
Xs us Kst
Dn = n− 4 h − 2
n − n − n − n −
|cov(est , es′ t ′ )| = O n−2 h−1 . (A.33)
t =1 t ′ =1 s=1 s′ =1
Proof. Let Dn = Dn1 + Dn2 + Dn3 + Dn4 , where Dn1 is for the case that t = t ′ and s = s′ , Dn2 is for the case that t = t ′ but s ̸= s′ , Dn3 is for the case that t ̸= t ′ but s = s′ , and Dn4 is for the case that t ̸= t ′ and s ̸= s′ . Firstly, as sups,t Var (est ) = O (h), we have
Dn1 = n−4 h−2
n − n − t =1 s=1
Var (est ) = O n−2 h−1 .
(A.34)
264
o
Y. Sun et al. / Journal of Econometrics 164 (2011) 252–267
we will show that Dn2 and Dn3 are both of Secondly, −1 2 n h
the proof for D∑ similarity of the proofs. In this case, n3 due ∑to ∑ n n Dn2 = n−4 h−2 t =1 s=1 s′ ̸=s |cov(est , es′ t )|. Applying Berbee (1979) coupling method, we rewrite cov(est , es′ t ) as follows cov(est , es′ t ) = cov(est , es′ t ) − cov (est , es′ t ) ∗
(A.35)
where the expectation for cov ∗ (est , es′ t ) is taken with respect to v(est , es′ t ) the joint pdf fs′ ,s (Zs′ , Zs ) f (Zt ), and the expectation for co is taken with respect to the joint pdf f (Zs′ ) f (Zs ) f (Zt ). Hence, v(est , es′ t ) = 0. Therefore, applying Lemma A.7 gives co
|cov(est , es′ t )| ≤ Mh2/(1+λ) × β λ/(1+λ) min t − s, t − s′ + β λ/(1+λ) s − s′ ,
−−−
h2/(1+λ)
(A.36)
∑ ∑ −1 ∑s−1 ≤ Mn−4 h−2 nt=4 ts= 3 t ′ =2 ∑ t ′ −1 2 −1 1/3 ′ ′ | co v( e , e )| = o ( n h ) . Let m = [ 1 / h ] be the st st s′ =1 integer part of 1/h1/3 . We now bound Dn4 separately for three cases: (i) t − s′ ≤ m; (ii) t − s′ > m and t − t ′ ≤ m; (iii) t − t ′ > m. Conformably, we rewrite Dn4 = Dn4,(i) + Dn4,(ii) + Dn4,(iii) . For cases (i) and (ii), we have |cov(est , es′ t ′ )| ≤ E |est es′ t ′ | + |E (est ) E (es′ t ′ )| ≤ Mh2 . As the number of summations is of order
nm3 for case (i) and n2 m2 for case (ii), we have
n − t −1 t − m−1 − t ′ −1 − −
s − t′
(A.37)
(A.41)
s
Proof. Evidently, we only need to give the proofs for (A.39) and (A.41). To simplify notation, without loss of generality, we treat Xt as a scalar; otherwise the result holds for each element of Γn1 (and Γn3 ). Consider Γn1 first. Under Assumption A2, we have sup1≤t ≤n
t ′ =2
−τ λ/(1+λ)
x−m−1
∫
3
2
∫
z −1
(y − z )−τ λ/(1+λ)
Dn4 = n−4 h−2 O nm3 h2 + n2 m2 h2 + n4−τ λ/(1+λ) h2/(1+λ)
2
n h
+O
n2 h
−1
,
2
n h
Hence, by sup |Xt | = Op
√ n
and
1 ≤t ≤n
sup |ut | = Op
n ,
√
(A.42)
1 ≤t ≤n
we obtain
n − n −
h1/(1+λ) β λ/(1+λ) (|s − t |)
−1 −1 1/(1+λ)
= Op n h h = op (nh)−1 (A.43) ∑ ∑ n n because n−1 t =1 s=1 β λ/(1+λ) (|s − t |) ≤ M under Assumption A5(ii). Now, we consider the conditional variance of Γ˜ n1 given χn . Again, by Assumption A5(i), we have n − n − n − n −
Xs us us′ Xs′ cov(est , es′ t ′ )
Applying Lemma A.8 gives v ar (Γ˜ n1 |χn ) (A.38)
−1
|E (est )| ≤ Mh1/(1+λ) β (|t − s|)λ/(1+λ) .
t =1 t ′ =1 s=1 s′ =1
1
n − n − n − n − |cov(est , es′ t ′ )| . = Op n−4 h−2
n
2−τ λ/(1+λ) (1−λ)/(1+λ)
h
= Op n−2 h−1 . By Markov’s inequality, we then obtain Γ˜ n1 − E Γ˜ n1 |χn = Op (n−1 h−1/2 ). By (A.43), we have Γ˜ n1 = op
Taking (A.37) and (A.38) together, we obtain
−1
−1 ∑n ∑n = n3 h t =1 =1 1s+λ Xs us E (est ). As Eˇ (est ) = 0 and for λ ∈ (0, 1) , Kˇ st /f (Zt ) f (Zt ) f (Zs ) dZt dZs ≤ Mh, applying Lemma A.7 gives By Assumption A5(i), we have E Γ˜ n1 |χn
t =1 t ′ =1 s=1 s′ =1
s ′ =1
x−1
∫
= Γ˜ n1 1 + op (1) .
v ar (Γ˜ n1 |χn ) = n−6 h−2
∼ n4−τ λ/(1+λ) h2/(1+λ) .
=o
Xs Xs⊤ θˇst /Eˇ t (Kst ) = op n−1 .
t =1 s =1
× dsdzdydx
n − −
For case (iii), applying Lemma A.7, we have |cov(est , es′ t ′ )| ≤ Mh2/(1+λ) β λ/(1+λ) s − t ′ for t > s > t ′ > s′ . The leading term of n3 h2 Dn4,(iii) is given by
4
(A.40)
t =1
= O n−3 h−1 + O n−2 h−2/3 −1 = o n2 h .
∼ h2/(1+λ)
1n2 (Zt ) = op n−1 h−1 ,
E Γ˜ n1 |χn = Op n−2 h−1
Dn4,(i) + Dn4,(ii) = O nm3 h2 + n2 m2 h2 n−4 h−2
n
n −
Finally, we will show Dn4
=o
Γn3 = n−3
def
under Assumption A5(ii) with λ ∈ (0, 1). Hence, we obtain Dn2 = o ( n2 h ) − 1 .
Γn2 = n−1
t =1 s=1
(t − s) + β λ/(1+λ) s − s′ = O n h(−2λ)/(1+λ) = o (n2 h)−1 , × β −2
∫
(A.39)
n − n −1 − Xs us Kˇ st /f (Zt ) 1 + op (1) Γn1 = n3 h
s
t
t =4 s =3
πn2 (Zt ) = op n−1 h−1 ,
t =1
λ/(1+λ)
n −
−1 ˇ h Et (Kst ) − f (Zt ) = Op h2 . Hence,
and it follows that
Dn2 ≤ Mn−4 h−2
Γn1 = n−1
t =1
v(est , es′ t ) + co v(est , es′ t ), + cov ∗ (est , es′ t ) − co
h2/(1+λ)
Lemma A.9. If Assumptions A1–A5 and A2* hold, we have
. We will give the detailed proof for Dn2 and omit
if λ ∈ (0, 1) and τ > 2λ−1 (1 + λ). This completes the proof of Lemma A.8.
(nh)−1 . This com-
pletes the proof of (A.39).
−1 ∑n ∑ 2 ˇ t =1 s Xs θst / 1+λ def f (Zt ) 1 + op (1) = Γ˜ n3 1 + op (1) . As θˇst /f (Zt ) f (Zt ) 2+λ ˇ f (Zs ) dZt dZs ≤ Mh for λ ∈ (0, 1) and E (ξst ) = 0, where ξst = θˇst /f (Zt ), applying Lemma A.7 gives Consider Γn3 . Again, we have Γn3 =
|E (ξst )| ≤ Mh(2+λ)/(1+λ) β (|t − s|)λ/(1+λ) .
n3 h
Y. Sun et al. / Journal of Econometrics 164 (2011) 252–267
Hence, by (A.42), we obtain
Secondly, we obtain n
E Γ˜ n3 |χn = Op n−2 h−1
n
−−
h(2+λ)/(1+λ) β λ/(1+λ) (|s − t |)
n −
An2,2 = −n−1
−1 1/(1+λ)
= Op n h
−1
= op n
.
(A.44)
n −1 −
= n− 1
n − n − n − n −
+ n−1 ζ2 X1⊤ B1n (Z1 ) n −1 − = n− 1 (ζt +1 − ζt ) Xt⊤ B1n (Zt )
Xs2 Xs2′ cov(ξst , ξs′ t ′ )
t =1 t ′ =1 s=1 s′ =1
t =2
n − n − n − n − |cov(ξst , ξs′ t ′ )| . = Op n−4 h−2
+ Op n−1/2 h2
t =1 t ′ =1 s=1 s′ =1
Proof of Theorem 3.3. To simplify notation, we denote α˜ t = ∑n ⊤ −1 α( ˜ Z ) and θˇt = θˇ (Zt ). By (2.14), we have c˜0 = t =2 ζt ζt ∑n t ⊤ ˜ ˜ ˜ t + Xt⊤−1 α˜ t −1 = t =2 ζt 1Yt , where 1Yt = Yt − Yt −1 − Xt α ⊤ ⊤ ⊤ Xt (αt − α˜ t ) − Xt −1 (αt −1 − α˜ t −1 ) + ζt c0 +ϵt . Therefore, we have
−
c˜0 = c0 +
−1
n
ζt ζt⊤
n
−
t =2
ζt ϵt +
n −
t =2
−1
−
def
= c0 + An1 + n−1
+ op h
∑n−1
n −
An2,3 = −n−1
=n ∑n
−1
Xt⊤ (αt − α˜ t ) − Xt⊤−1 (αt −1 − α˜ t −1 )
n −1 −
(ζt +1 − ζt ) Xt⊤ B2n (Zt ) − n−1 ζn Xn⊤ B2n (Zn )
n −1 −
(ζt +1 − ζt ) Xt
⊤
−1
=n
n −1 −
(ζt +1 − ζt ) Xt
⊤
⊤ ⊤ t =2 ζt Xt B2n (Zt ) − Xt −1 B2n (Zt −1 ) .
∑n
2
n h
n −1 −
−1 Xs Xs Et (Kst ) ⊤ˇ
s=1
n−1 −
×
(ζt +1 − ζt ) Xt⊤
2
n −1 − × n2 h Xs us Eˇ t (Kst )
= n− 1
n−1 −
An2,1 = 1n1 πn1 + 1n1 1n3
1 (ζt +1 − ζt ) Xt⊤ 1− n1 πn1
t =2
t =2 1 = 1− n1 πn1 + Op n
Xs Xs Et (Kst )
s=1
−1
−1/2
−1 ⊤ˇ
s =1
−1
n −1 −
n h
Firstly, applying Lemma A.1 yields
ζt ζt
t =2
Substituting the above result into An2∑ , we can write An2 = 1 An2,1 + An2,2 + An2,3 , where An2,1 = n−1 t ζt 1− n1(πn1 + 1n3 ), ⊤ ∑n ⊤ −1 An2,2 = −n t =2 ζt Xt B1n (Zt ) − Xt −1 B1n (Zt −1 ) , and An2,3 =
n
Xs us Kst
s=1
+ Op n−1/2 = An2,3,1 + An2,3,2 1 + op (1) + Op n−1/2 ,
−1
n −
n −1 − × n2 h Xs us Kˇ st + Eˇ t (Kst ) 1 + op (1)
[B1n (Zt −1 ) + B2n (Zt −1 )] .
⊤
Xs Xs Kst
s=1
An2,3,1 = n−1
n −
−1 ⊤
t =2
1 ⊤ = ζt⊤ 1− n1 (πn1 + 1n3 ) − Xt [B1n (Zt ) + B2n (Zt )]
−1
by (A.23)
+ Op n−1/2
where
n − s=1
t =2
1 −1 = Xt⊤ −B1n (Zt ) − B2n (Zt ) + 1− n1 πn1 + 1n1 1n3 1 −1 − Xt⊤−1 −B1n (Zt −1 ) − B2n (Zt −1 ) + 1− n1 πn1 + 1n1 1n3
−n − 1
ζt Xt⊤ B2n (Zt ) − Xt⊤−1 B2n (Zt −1 )
t =2
An2 ,
∑n where An1 = ζ t =2 ζt ζt t =2 ζt ϵt and An2 = n ⊤ t =2 t Xt (αt − α˜ t ) − Xt⊤−1 (αt −1 − α˜ t −1 ) . An1 = Op n−1/2 given ∑n E (ϵt |ζt ) = 0 and Var n−1 t =2 ζt ϵt = O n−1 by McLeish’s −1 ∑n strong mixing inequality. Below, we will show n−1 t =2 ζt ζt⊤ 2 An2 = Op h + Op (nh)−1/2 . ∑ Define 1n3 = n−2 t Xt Xt⊤ αt . Then (A.11) can be written as 1 θˆ0 = c0 + 1− n1 (πn1 + 1n3 ). Combining this with (A.21), we have
+ Xt⊤−1
(A.46)
where following the proof of Lemma A.3, we can show n−1 t =2 (ζt +1 − ζt ) Xt⊤ B (Zt ) = Op (1) given the fact that E [(ζt +1 − ζt ) B(Zt )⊤ ] = 0 under Assumption A5(i). Similar manipulations give
−1
⊤ −1
+ Op n−1/2 h2 = Op h2 ,
t =2
t =2
∑n
t =2
t =2
−1 ζt ζt⊤
(ζt +1 − ζt ) Xt⊤ B (Zt )
+ n−1 ζ2 X1⊤ B2n (Z1 ) n −1 − = n−1 (ζt +1 − ζt ) Xt⊤ B2n (Zt ) + Op n−1/2
t =2 n −
−
2
t =2
ζt Xt⊤ (αt − α˜ t ) − Xt⊤−1 (αt −1 − α˜ t −1 )
= n− 1 h 2
= n−1
ζt ζt⊤
n
×
by (A.22)
n−1
˜ Following the proof of Lemma A.8, we show v ar (Γn3 |χn ) = Op n−2 h . Again, by Markov’s inequality, we obtain Γ˜ n3 − E Γ˜ n3 |χn = Op (n−1 h1/2 ). Therefore, by (A.44), we have Γ˜ n3 = op (n−1 ). This proves (A.41). Hence, the proof of Lemma A.9 is complete.
(ζt +1 − ζt ) Xt⊤ B1n (Zt ) − n−1 ζn Xn⊤ B1n (Zn )
t =2
Moreover, we have
v ar (Γ˜ n3 |χn ) = n−6 h−2
ζt Xt⊤ B1n (Zt ) − Xt⊤−1 B1n (Zt −1 )
t =2
t =1 s=1
265
.
(A.45)
=− n
−1
n − t =2
ζt ζt
⊤
−1/2 1 1− , n1 πn1 + Op n
(A.47)
266
Y. Sun et al. / Journal of Econometrics 164 (2011) 252–267
and
Taking together this result with (A.45)–(A.47) yields c˜0 − c0 = Op (h2 + (nh)−1/2 ).
n−1
An2,3,2 = n−1
−
(ζt +1 − ζt ) Xt⊤
Also, Theorem 3.1 indicates α( ˜ z ) = α (z ) + Op (h2 + (nh)−1/2 ). We therefore obtain
t =2
×
n2 h
n −1 −
−1 Xs Xs⊤ Eˇ t (Kst )
θ˜ (z ) = α( ˜ z ) + c˜0 = θ (z ) + Op (h2 + (nh)−1/2 ),
s=1
which shows that θ˜ (z ) is a consistent estimator of θ (z ) under Assumption A4. The proof for the consistency of cˆ0 and θˆ (z ) are similar, and thus are omitted here. This completes the proof of Theorem 3.3.
n −1 − × n2 h Xs us Kˇ st s =1
= n− 1
n−1 −
2 −1 1 (ζt +1 − ζt ) Xt⊤ 1− n1 n h
t =2
×
n −
Xs us Kˇ st /f (Zt ) 1 + op (1)
def
= Γn 1 + op (1)
Proof of Proposition 3.4. θˇ (z ) is still defined as in (2.6). When there is a (varying) intercept term γ (Zt ), θˇ (z ) contains the following extra term: def
as sup1≤t ≤n h−1 Eˇ t (Kst ) − f (Zt ) = O h2 by Assumptions A2 and A3. Below, we will show Γn = Op (nh)−1/2 . In the following proof, without loss of generality, we again treat Xt as a scalar; otherwise the result holds for each element of Γn . By Assumption A5(i), we have E (Γn |χn ) = n h 3
n −1 −1 −
(ζt +1 − ζt ) Xt 1n1
−1
n −
t =2
−1
s=1
Mn =
−
√
√ nMn =
×
=
d
n
−1
n
[∫
n
1
BX (r )BX (r ) dr ⊤
] −1 ∫
1
BX (r )dr
γ (z )
0
= Op (1). It follows that Mn
−1/2
=
∑
t
Xt Xt⊤ Ktz
−1 ∑
t
Xt Ktz γ (Zt )
=
Op n uniformly over z ∈ S since supz ∈S |γ (z )| ≤ M < ∞. Hence, we have shown that Mn = Op (n−1/2 ) = op (h2 + (nh)−1/2 ). Therefore, the inclusion of an additive bounded function γ (·) in the regression model is asymptotically negligible as far as the estimation of θ (·) is concerned. This completes the proof of Proposition 3.4.
β λ/(1+λ) (|s − t |)
s =1
(A.48)
because E |ζt +1 − ζt | < M by Assumption A1 and n t =2 s=1 β λ/(1+λ) (|s − t |) ≤ M under Assumption A5(ii). Now, we consider the conditional variance of Γn given χn . Again, by Assumption A5(i), we have
∑n−1 ∑n −1
def
Dn = v ar (Γn |χn )
(ζt +1 − ζt )
t =2 t ′ =2 s=1 s′ =1
× Xt Xs us us′ Xs′ Xt ′ (ζt ′ +1 − ζt ′ ) cov(est , es′ t ′ ) n−1 − n−1 − n − n − |cov(est , es′ t ′ )| , = Op n−3 h−2
− Xt X ⊤ √ √t
0
t =2
n−1 − n −1 − n − n −
n
−1
→
h1/(1+λ) β λ/(1+λ) (|s − t |)
= n− 6 h − 2
t
|ζt +1 − ζt |
= Op n−1/2 h−1 h1/(1+λ) = op (nh)−1/2
Xt E [Ktz γ (Zt )] 1 + op (1)
t
n −1 − |ζt +1 − ζt | = Op n−3/2 h−1 h1/(1+λ)
×
Xt Xt E (Ktz )
− Xt × n− 1 √ [E (Ktz )]−1 E [Ktz γ (Zt )] 1 + op (1)
s =1
n −
−1 ⊤
t
t =2
×
−
Xs us E (est ) ,
Hence, applying (A.42), we obtain
n −
n
− t
|E (est )| ≤ Mh1/(1+λ) β (|t − s|)λ/(1+λ) .
|E (Γn |χn )| = Op n−3/2 h
Xt Ktz γ (Zt ).
t
Now,
s =1
n−1 − −1
−
Xt XtT Ktz
t
where applying Lemma A.7 gives
(A.50)
(A.49)
t =2 t ′ =2 s=1 s′ =1
where the last equation is obtained by (A.42) and sup1≤t ≤n E ‖ζt +1 − ζt ‖2 ≤ M < ∞. Applying Lemma A.8 gives Dn = Op (n−1 h−1 ). By Markov’s inequality, we obtain Γn − E (Γn |χn ) = Op ((nh)−1/2 ). By (A.48), we have Γn = Op ((nh)−1/2 ). Hence, we have An2,3,2 = Op ((nh)−1/2 ).
References Aloui, C., 2007. Price and volatility spillovers between exchange rates and stock indexes for the pre- and post-euro period. Quantitative Finance 7, 669–685. Athreya, K.B., Pantula, S.G., 1986. A note on strong mixing of ARMA processes. Statistics & Probability Letters 4, 187–190. Baillie, R.T., Bollerslev, T., Mikkelsen, H.O., 1996. Fractionally integrated generalized autoregressive conditional heteroskedasticity. Journal of Econometrics 74, 3–30. Bali, T.G., Wu, L., 2010. The role of exchange rates in intertemporal risk-return relations. Journal of International Money and Finance 29, 1670–1686. Barnett, W.A., Serletis, A., 1990. A dispersion-dependency diagnostic test for aggregation error: with applications to monetary economics and income distribution. Journal of Econometrics 43, 5–34. Berbee, H.C.P., 1979. Random walks with stationary increments and renewal Theory. Math. Cent. Tracts, Amstardam. Blundell, R., Stoker, T.M., 2005. Heterogeneity and aggregation. Journal of Economic Literature 43, 347–391. Bollerslev, T., Mikkelsen, H.O., 1996. Modeling and pricing long-memory in stock market volatility. Journal of Econometrics 73, 151–184. Cai, Z., Fan, J., Yao, Q., 2000. Functional-coefficient regression models for nonlinear time series. Journal of the American Statistical Association 95, 941–956. Cai, Z., Li, Q., Park, J., 2009. Functional-Coefficient Models for Nonstationary Time Series Data. Journal of Econometrics 148, 101–113. Christensen, B.J., Nielsen, M.O., 2007. The effect of long memory in volatility on stock market fluctuations. Review of Economics and Statistics 89, 684–700. Davidson, J., 2004. Moment and memory properties of linear conditional heteroscedastic models, and a new model. Journal of Business and Economic Statistics 22, 16–29.
Y. Sun et al. / Journal of Econometrics 164 (2011) 252–267 de Jong, R.M., Davidson, J., 2000. The functional central limit theorem and weak convergence to stochastic integrals I. Econometric Theory 16, 621–642. Fan, J., Cai, Z., Li, R., 2000. Efficient estimation and inferences of varying-coefficient models. Journal of the American Statistical Association 95, 465–501. Fan, J., Yao, Q., Cai, Z., 2003. Adaptive varying coefficient linear models. Journal of the Royal Statistical Society (Series B) 65, 57–80. Gavin, M., 1989. The stock market and exchange rate dynamics. Journal of International Money and Finance 8, 181–200. Hansen, B.E., 1992. Convergence to stochastic integrals for dependent heterogeneous processes. Econometric Theory 8, 489–500. Juhl, T., 2005. Functional coefficient models under unit root behavior. Econometrics Journal 8, 197–213. Karolyi, G., 1995. A multivariate GARCH model of international transmissions of stock returns and volatility: the case of the United States and Canada. Journal of Business & Economic Statistics 13, 11–25. Koutmos, G., Booth, G.G., 1995. Asymmetric volatility transmission in international stock markets. Journal of International Money and Finance 14, 747–762. Li, Q., Huang, C., Li, D., Fu, T., 2002. Semiparametric smooth coefficient models. Journal of Business and Economic Statistics 20, 412–422. Mark, N.C., 2001. International Marcoeconomics and Finance. Blackwell, New York. McLeish, D.L., 1975. A maximal inequality and dependent strong laws. The Annals of Probability 3, 829–839.
267
Phillips, P.C.B., 2009. Local limit theory and spurious nonparametric regression. Econometric Theory 25, 1466–1497. Poirier, D.J., 1995. Intermediate Statistics and Econometrics: a Comparative Approach. The MIT Press. Ray, B.K., Tsay, R.S., 2000. Long-range dependence in daily stock volatilities. Journal of Business & Economic Statistics 18, 254–262. Sun, Y., Hsiao, C., Li, Q., 2010. Volatility spill over effects: empirical evidence based semiparametric analysis (manuscript). Taylor, A.M., Taylor, M.P., 2004. The purchasing power parity debate. Journal of Economic Perspective 18, 135–158. White, H., Domowitz, I., 1984. Nonlinear regression with dependent observations. Econometrica 52, 143–162. Wooldridge, J.M., White, H., 1988. Some invariance principles and central limit theorems for dependent heterogeneous processes. Econometric Theory 4, 210–230. Xiao, Z., 2009. Functional coefficient cointegration models. Journal of Econometrics 152, 81–92. Xu, Z.H., 2003. Purchasing power parity, price indices, and exchange rate forecasts. Journal of International Money and Finance 22, 105–130. Yoshihara, K., 1976. Limiting behavior of U-statistics for stationary, absolutely regular processes. Zeitschrift fur Wahrscheinlichkeitstheorie und Verwandte Gebiete 35, 237–252.
Journal of Econometrics 164 (2011) 268–293
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Generalized spectral testing for multivariate continuous-time models Bin Chen a,∗ , Yongmiao Hong b,c a
Department of Economics, University of Rochester, Rochester, NY, 14627, United States
b
Department of Economics & Department of Statistical Science, Cornell University, Ithaca, NY 14850, United States
c
Wang Yanan Institute for Studies in Economics (WISE) & MOE Key Laboratory in Econometrics, Xiamen University, Xiamen 361005, China
article
info
Article history: Received 17 December 2008 Received in revised form 21 May 2011 Accepted 1 June 2011 Available online 15 June 2011 JEL classification: C4 E4 G0
abstract We develop an omnibus specification test for multivariate continuous-time models using the conditional characteristic function, which often has a convenient closed-form or can be accurately approximated for many multivariate continuous-time models in finance and economics. The proposed test fully exploits the information in the joint conditional distribution of underlying economic processes and hence is expected to have good power in a multivariate context. A class of easy-to-interpret diagnostic procedures is supplemented to gauge possible sources of model misspecification. Our tests are also applicable to discrete-time distribution models. Simulation studies show that the tests provide reliable inference in finite samples. © 2011 Elsevier B.V. All rights reserved.
Keywords: Affine jump–diffusion model Conditional characteristic function Discrete-time distribution model Generalized cross-spectrum Lévy processes Model specification test Multivariate continuous-time model
1. Introduction Multivariate continuous-time models have proved to be versatile and productive tools in finance and economics (e.g., Andersen et al. (2002), Chernov et al. (2003), Dai and Singleton (2003), Pan and Singleton (2008), Jarrow et al. (2010), and Piazzesi (2010)). Compared with that of discrete-time models, the econometric analysis of continuous-time models is often challenging. In the past decade or so, substantial progress has been made in developing estimation methods for continuous-time models.1 However, relatively little effort has been devoted to specification and evaluation of continuous-time models. A continuous-time model essentially specifies the transition density of underlying processes. Model misspecification generally renders inconsistent parameter estimators and their variance–covariance matrix estimators, yielding misleading conclusions in inference and hypothesis testing.
∗
Corresponding author. E-mail addresses:
[email protected] (B. Chen),
[email protected] (Y. Hong). 1 Sundaresan (2001) points out that ‘‘perhaps the most significant development in the continuous-time field during the last decade has been the innovations in econometric theory and in the estimation techniques for models in continuous time.’’ For other reviews of this literature, see (e.g.) Tauchen (1997) and Ait-Sahalia (2007). 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.06.001
Correct model specification is also crucial for valid economic interpretations of model parameters. More importantly, a misspecified model can lead to large errors in pricing, hedging and managing risk. However, economic theories usually do not suggest any concrete functional form for continuous-time models; the choice of a model is somewhat arbitrary. A prime example of this practice is the pricing and hedging literature, where continuous-time models are generally assumed to have a functional form that results in a closed-form pricing formula. The models, though convenient, are often incorrect or suboptimal. To avoid this pitfall, the development of reliable specification tests for continuous-time models is necessary. In a pioneer paper, Ait-Sahalia (1996a) develops a nonparametric test for univariate diffusion models. By observing that the drift and diffusion functions completely characterize the stationary (i.e., marginal) density of a diffusion model, Ait-Sahalia (1996a) compares the model-implied stationary density with a smoothed kernel density estimator based on discretely sampled data.2 Gao and King (2004) develop a simulation procedure to improve the finite sample performance of Ait-Sahalia’s (1996a) test. These tests are 2 Ait-Sahalia (1996a) also proposes a transition density-based test that exploits the ‘‘transition discrepancy’’ characterized by the forward and backward Kolmogorov equations, although the marginal density-based test is most emphasized.
B. Chen, Y. Hong / Journal of Econometrics 164 (2011) 268–293
convenient to implement. Nevertheless, they may pass over a misspecified model that has a correct stationary density. Hong and Li (2005) develop an omnibus nonparametric specification test for continuous-time models. The test uses the transition density, which depicts the full dynamics of a continuoustime process. When a univariate continuous-time model is correctly specified, the probability integral transform (PIT) of data via the model-implied transition density is i.i.d. U[0, 1]. Hong and Li (2005) check the joint hypothesis of i.i.d. U [0, 1] by using a smoothed kernel estimator of the joint density of the PIT series. However, Hong and Li’s (2005) approach cannot be extended to a multivariate model. The PIT series with respect to a modelimplied multivariate transition density are no longer i.i.d U[0, 1], even if the model is correctly specified. In an application, Hong and Li (2005) evaluate multivariate affine term structure models (ATSMs) for interest rates by using the marginal PIT for each state variable. This is analogous to Singleton’s (2001) conditional characteristic function-based limited-information maximum likelihood (LML-CCF) estimator in a related context, which is based on the conditional density function of an individual state variable. As pointed out by Singleton (2001), ‘‘the LML-CCF estimator fully exploits the information in the conditional likelihood function of the individual yj,t +1 , but not the information in the joint conditional distribution of yt +1 .’’ This generally leads to an asymptotic efficiency loss in estimation. Similarly, Hong and Li’s (2005) test, when applied to each state variable of a multivariate process, may fail to detect misspecification in the joint dynamics of state variables. In particular, the test may easily overlook misspecification in the conditional correlations between state variables. Moreover, the use of the transition density may not be computationally convenient because the transition densities of most continuous-time models have no closed-form. Gallant and Tauchen (1996) propose a class of Efficient Method of Moments (EMM) tests that can be used to test multivariate continuous-time models. They propose a χ 2 test for model misspecification, and a class of appealing diagnostic t-tests that can be used to gauge possible sources for model failure. Since these tests are by-products of the EMM algorithm, they cannot be used when the model is estimated by other methods. This may limit the scope of these tests’ otherwise useful applications. Bhardwaj et al. (2008) consider a simulation-based test, which is an extension of Andrews’ (1997) conditional Kolmogorov test, for multivariate diffusion processes. The limiting distribution of the test statistic is not nuisance parameter free and hence asymptotically critical values must be obtained via the block bootstrap, which may be time-consuming. There have been other tests for univariate diffusion models in the recent literature. Ait-Sahalia et al. (2009) propose some tests by comparing the model-implied transition density and distribution function with their nonparametric counterparts. Chen et al. (2008) also propose a transition density-based test using a nonparametric empirical likelihood approach. Li (2007) focuses on the parametric specification of the diffusion function by measuring the distance between the model-implied diffusion function and its kernel estimator. These approaches could be extended to multivariate continuous-time models. However, all these tests maintain the Markov assumption for the data generating process (DGP), and consider the finite order lag only. If the DGP is non-Markov, these tests may miss some dynamic misspecifications. This paper proposes a new approach to testing the adequacy of a multivariate continuous-time model that uses the full information of the joint dynamics of state variables. In a multivariate context, modeling the joint dynamics of state variables is important in many applications (e.g., Alexander (1998)). For example, as the conditional correlations between asset returns change over time, the specific weight allocated to each asset within a portfolio should
269
be adjusted accordingly. Similarly, hedging requires knowledge of conditional correlations between the returns on different assets within the hedge. Conditional correlations are also important in pricing structured products such as rainbow options, which are based on multiple underlying assets whose prices are correlated. In the term structure literature, models of interest rate term structure impose dynamic cross-sectional restrictions, as implied by the no-arbitrage condition on bond yields of different maturities. This joint dynamics can be used to investigate the transmission mechanism that transfers the impact of government policy from spot rates to longer-term yields. There is a conflict between the flexibilities of modeling instantaneous conditional variances and instantaneous conditional correlations of bond yields. Dai and Singleton (2000) find that not only do swap rate data consistently call for negative conditional correlations between bond yields, but also factor correlations help explain the shape of the term structure of bond yields’ volatilities. Indeed, as Engle (2002) points out, ‘‘the quest for reliable estimates of correlations between financial variables has been the motivation for countless academic articles, practitioner conferences and Wall Street research’’. There has been a long history of using the CF in estimation and hypotheses testing in statistics and econometrics. To name a few, Koutrouvelis (1980) constructs a chi-squared goodness-of-fit test for simple null hypotheses with empirical characteristic function (ECF). Fan (1997) takes the CF approach to testing multivariate distributions. But both tests maintain the i.i.d. assumption and hence are not suitable for the time series data. Recently, Su and White (2007) test conditional independence by comparing the unrestricted and restricted CCFs via a kernel regression. All above works deal with discrete-time models, in recent years, the CF approach has attracted an increasing attention in the continuoustime literature. For most continuous-time models, the transition density has no closed-form, which makes estimation of and testing for continuous-time models rather challenging. However, for a general class of affine jump–diffusion (AJD) models (e.g., Duffie et al. (2000)) and time-changed Lévy processes (e.g., Carr and Wu (2003, 2004)), the CCF has a closed-form as an exponential-affine function of state variables up to a system of ordinary differential equations. This fact has been exploited to develop new estimation methods for multifactor continuous-time models in the literature. Specifically, Chacko and Viceira (2003) suggest a spectral GMM estimator based on the average of the differences between the ECF and the model-implied CF. Jiang and Knight (2002) derive the unconditional joint CF of an AJD model and use it to develop some GMM and ECF estimation procedures. Singleton (2001) proposes both time-domain estimators based on the Fourier transform of the CCF, and frequency-domain estimators directly based on the CCF. By extending Carrasco and Florens (2000), Carrasco et al. (2007) propose GMM estimators with a continuum of moment conditions (C-GMM) via the CF. All these estimation methods differ in their uses of the conditional information set. As of yet, no attempt has been made in the literature to use the CF to test continuoustime models, although Carrasco et al. (2007) do mention that ‘‘the future work will have to refine our results on estimation of nonMarkovian processes and latent states as well as develop tests in the framework of CF-based continuum of moment conditions’’. In light of the convenient closed-form of the CCF of AJD models, we provide a new test of the adequacy of multivariate continuoustime models. The multivariate continuous-time models can include jumps and the underlying DGPs of observable state variables need not be Markov. Naturally, as a special case, our test can be used to check univariate continuous-time models. Compared with the existing tests for continuous-time models in the literature, our approach has several main advantages. First, because the CCF is the Fourier transform of the transition density, our omnibus test fully exploits the information in the
270
B. Chen, Y. Hong / Journal of Econometrics 164 (2011) 268–293
joint transition density of state variables rather than only the information in the transition densities of individual state variables. Thus, it can detect misspecifications in the joint transition density even if the transition density of each state variable is correctly specified. When the underlying multivariate continuous-time process is Markov, our omnibus test is consistent against any model misspecification. This is unattainable by some existing tests for multivariate continuous-time models. Moreover, we have used a novel generalized cross-spectral approach, which embeds the CCF in a spectral framework, thus enjoying the appealing features of spectral analysis. For example, it checks many lags. This is particularly useful when the DGP is non-Markov. Second, besides the omnibus test, we propose a class of diagnostic tests by differentiating the generalized cross-spectrum of the state vector. These tests can evaluate how well a continuoustime model captures various specific aspects of the joint dynamics and they are easy to interpret. In particular, these tests can provide valuable information about neglected dynamics in conditional means, conditional variances, and conditional correlations of state variables, respectively. Therefore, they complement Gallant and Tauchen’s (1996) popular EMM-based individual t-tests. All our omnibus test and diagnostic tests are derived from a unified framework. Third, our tests are applicable to a wide variety of continuoustime models and discrete-time multivariate distribution models, since we impose regularity conditions on the CCF of discretely observed samples with some fixed sample frequency, rather than on stochastic differential equations (SDEs). By using the CCF to characterize the adequacy of a model, our tests are most convenient whenever the model has a closed-form CCF and many popular continuous-time models in finance (e.g., the class of multivariate AJD models and the class of time-changed Lévy processes) have a closed-form CCF, although they have no closedform transition density. Of course, our tests can also evaluate the multivariate continuous-time models with no closed-form CCF. In this case, we need to recover the model-implied CCF by using inverse Fourier transforms or simulation methods. Unlike tests based on CFs in the statistical literature, which often have nonstandard asymptotic distributions, our tests have a convenient null asymptotic N (0, 1) distribution. Fourth, we do not require a particular estimation method. √ Any T -consistent parametric estimators can be used. Parameter estimation uncertainty does not affect the asymptotic distribution of our test statistics. One can proceed as if true model parameters were known and equal to parameter estimates. This makes our tests easy to implement, particularly in view of the notorious difficulty of estimating multivariate continuous-time models. The only inputs needed to calculate the test statistics are the discretely observed data and the model-implied CCF. In Section 2, we introduce the framework, state the hypotheses, and characterize the correct specification of a multivariate continuous-time model. In Section 3, we propose a generalized cross-spectral omnibus test, and in Section 4 we derive the asymptotic null distribution of our omnibus test and discuss its asymptotic power property. In Section 5, we develop a class of generalized cross-spectral derivative tests that focuses on various specific aspects of the joint dynamics of a time series model. In Section 6, we assess the reliability of the asymptotic theory in finite samples by simulation. Section 7 concludes our work. All mathematical proofs are collected in Appendix. A GAUSS code to implement our tests is available from the authors upon request. Throughout the paper, we will use C to denote a generic bounded constant, ‖·‖ for the Euclidean norm, and A∗ for the complex conjugate of A.
2. Hypotheses of interest For concreteness, we focus on a multivariate continuous-time setup.3 For a given complete probability space (Ω , F , P) and an information filtration (Ft ), we assume that a N × 1 state vector Xt is a continuous-time DGP in some state space D ⊂ RN . We permit but do not require Xt to be Markov, which is often assumed in the continuous-time modeling in finance and macroeconomics.4 In financial modeling, the following class M of continuous-time models is often used to capture the dynamics of Xt 5 : dXt = µ (Xt , θ) dt + σ (Xt , θ) dWt + dJt (θ),
θ ∈ 2,
(2.1)
where Wt is an N × 1 standard Brownian motion in R , 2 is a finite-dimensional parameter space, µ : D × 2 → RN is a drift function (i.e., instantaneous conditional mean), σ : D × 2 → RN ×N is a diffusion function (i.e., instantaneous conditional standard deviation), and Jt is a pure jump process whose jump size follows a probability distribution ν : D × 2 → R+ and whose jump times arrive with intensity λ : D × 2 → R+ .6 We allow some state variables to be unobservable. One example is the stochastic volatility (SV) model, where the volatility, which is a proxy for the information inflow, is a latent process. The setup (2.1) is a general multivariate specification that nests most existing continuous-time models in finance and economics. For example, suppose we restrict the drift µ (·, ·), the instantaneous covariance matrix σ (·, ·) σ (·, ·)′ and the jump intensity λ (·, ·) to be affine functions of the state vector Xt ; namely, N
µ(Xt , θ) = K0 + K1 Xt , [σ(Xt , θ)σ(Xt , θ)′ ]jl = [H0 ]jl + [H1 ]jl Xt , j, l = 1, . . . , N , ′ λ (Xt , θ) = L0 + L1 Xt ,
(2.2)
where K0 ∈ RN , K1 ∈ RN ×N , H0 ∈ RN ×N , H1 ∈ RN ×N ×N , L0 ∈ R, and L1 ∈ RN are unknown parameters. Then we obtain the class of AJD models of Duffie et al. (2000). It is well known that for a continuous-time model characterized by a SDE, the specification of the drift µ(Xt , θ), the diffusion σ(Xt , θ) and the jump process Jt (θ) completely determines the joint transition density of Xt . We use p(x, t |Fs , θ) to denote the model-implied transition density of Xt = x given Fs , where s < t. Suppose Xt has a true transition density, say p0 (x, t |Fs ). Then the continuous-time model is correctly specified for the full dynamics of Xt if there exists some parameter value θ0 ∈ 2 such that
H0 : p(·,t |Fs , θ0 ) = p0 (·, t |Fs ) almost surely (a.s.) and for all t , s, s < t .
(2.3)
3 Our test is applicable to both continuous-time and discrete-time models but we focus on a continuous-time setup due to the following reasons. Our approach is most convenient when the CCF has a closed-form and many popular continuoustime models in finance have no closed-form transition density, but do have a closedform CCF. On the other hand, to our knowledge, no CF-based test is available to check the specification of continuous-time models, although the CF approach has been used in the estimation of continuous-time models. Hence, our test nicely fills the gap in the literature. 4 However, Easley and O’Hara (1992) develop an economic structural model and show that financial time series, such as prices and volumes, are likely non-Markov. 5 Despite the fact that most continuous-time models characterized by SDEs in the literature are Markov, it is still important to allow Xt to be non-Markov. Even the null model is Markov, to allow for non-Markov DGPs under the alternative will ensure the power of the test against a wider range of misspecification, particularly, dynamic misspecification. On the other hand, the model may be Markov but involves some latent variables, as is the case of SV models. As a result, observable state variables themselves will be non-Markov, even under the null. 6 We assume that the functions µ, σ , ν and λ are regular enough to have a unique strong solution to (2.1). See (e.g.) Ait-Sahalia (1996a), Duffie et al. (2000) and GenonCatalot et al. (2000) for more discussions.
B. Chen, Y. Hong / Journal of Econometrics 164 (2011) 268–293
Alternatively, if for all θ ∈ 2, we have
Given the equivalence between the transition density and the CCF, we can write the hypotheses of interest H0 in (2.3) versus HA in (2.4) as follows:
HA : p(·, t |Fs , θ) ̸= p0 (·, t |Fs ) for some s < t with positive probability measure,
(2.4)
then the continuous-time model is misspecified for the full dynamics of Xt . The transition density characterization can be used to test correct specification of a continuous-time model. When Xt is univariate, Hong and Li (2005) propose a kernel-based test for a continuous-time model by checking whether the PIT Zt (θ0 ) ≡
Xt
∫
p(x, t |It −1 , θ0 )dx ∼ i.i.d.U [0, 1],
(2.5)
−∞
which holds under H0 , where It −1 = {Xt −1 , Xt −21 , . . .} is the information set available at time t − 1, where 1 is a fixed sampling interval for an observed sample. The i.i.d. U [0, 1] property for the PIT has also been used in other contexts (e.g., Diebold et al. (1998)). However, there are some limitations to this approach. For example, for most continuous-time diffusion models (except such simple diffusion models as Vasicek’s (1977) model), the transition densities, have no closed-form. Most importantly, the PIT cannot be applied to the multivariate joint transition density p(x, t |Fs , θ), because when N > 1, Zt (θ0 ) =
X1t
∫
∫
XNt
··· −∞
p(x, t |It −1 , θ 0 )dx
(2.6)
−∞
is no longer i.i.d. U [0, 1] even if H0 holds. Hong and Li (2005) suggest using the PIT for each state variable. This is valid, but it does not make full use of the information contained in the joint distribution of Xt . In particular, it may miss important model misspecification in the joint dynamics of Xt . For example, consider the DGP
κ11 0 θ1 − X1,t dt κ21 κ22 θ2 − X2,t σ11 0 W1,t + d , 0 σ22 W2,t where W1,t , W2,t are independent standard Brownian motions and κ21 ̸= 0. Suppose we fit the data using the model X1,t κ11 0 θ1 − X1,t d = dt X2,t 0 κ22 θ2 − X2,t σ11 0 W1,t + d . 0 σ22 W2,t
d
X1,t X2,t
=
Then this model is misspecified because it ignores correlations in drift. Now, following Hong and Li (2005), we calculate the gener alized residuals Z1,t , Z2,t , Z1,t −1 , Z2,t −1 , . . . , where Z1,t and Z2,t are the PITs of X1,t and X2,t with respect to the conditional density models p(X1,t , t |Xt −1 , X2,t , θ) and p(X2,t , t |Xt −1 , θ) respectively, and θ = (κ11 , κ22 , θ1 , θ2 , σ11 , σ22 )′ . Then Hong and Li’s (2005) test will have no power because each of these PITs is an i.i.d. U [0, 1] sequence respectively. As the Fourier transform of the transition density, the CCF can capture the full dynamics of Xt . Let ϕ(u, t , |Fs , θ) be the modelimplied CCF of Xt , conditional on Fs at time s < t; that is,
ϕ(u, t |Fs , θ) ≡ Eθ exp iu′ Xt |Fs ∫ √ = exp iu′ x p(x, t |Fs , θ)dx, u ∈ RN , i = −1,
271
(2.7)
RN
where Eθ (·|Fs ) denotes the conditional expectation under the model-implied transition density p(x, t |Fs , θ). Note that for a Markov model, the filtration Fs can be replaced by Xs .
H0 : E exp iu′ Xt |Fs = ϕ(u, t |Fs , θ0 ) a.s. for all u ∈ RN
and for some θ0 ∈ 2
(2.8)
versus
HA : E exp iu′ Xt |Fs ̸= ϕ(u, t |Fs , θ)
with positive probability for all θ ∈ 2.
(2.9) Xt tT=11
Suppose we have a discrete random sample { } of size T . For notational simplicity, we set 1 = 1 below, where time is measured in units of the sampling interval of data.7 Also, we denote It −1 = {Xt −1 , Xt −2 , . . .}, the information set available at time t − 1. Define the process Zt (u, θ) ≡ exp iu′ Xt − ϕ(u, t |It −1 , θ),
u ∈ RN .
(2.10)
Then H0 is equivalent to the following martingale difference sequence (MDS) characterization: E [Zt (u, θ0 )|It −1 ] = 0
for all u ∈ RN
and some θ0 ∈ 2, a.s.
(2.11)
We may call Zt (u, θ) a ‘‘CCF-based generalized residual’’ of the continuous-time model M . This can be seen obviously from the auxiliary regression exp iu′ Xt = ϕ(u, t |It −1 , θ0 ) + Zt (u, θ0 ),
(2.12)
where ϕ(u, t |It −1 , θ0 ) is the regression model for the dependent variable exp iu′ Xt conditional on information set It −1 , and Zt (u, θ0 ) is a MDS regression disturbance (under H0 ). In principle, we can always obtain the CCF by the inverse Fourier transform, provided the transition density of Xt is given. Even if the CCF or the transition density has no closed-form, we can accurately approximate the model transition density by using (e.g.) the Hermite expansion method of Ait-Sahalia (2002), the simulation methods of Brandt and Santa-Clara (2002) and Pedersen (1995), or the closed-form approximation method of Duffie et al. (2003), and then calculating the Fourier transform. Nevertheless, our test is most convenient when the CCF has a closed-form. AJD models are a class of continuous-time models with a closedform CCF, developed and popularized by Dai and Singleton (2000), Duffie and Kan (1996), and Duffie et al. (2000). These models have proven fruitful in capturing the dynamics of economic variables, such as interest rates, exchange rates and stock prices. It has been shown (e.g., Duffie et al. (2000)) that for AJDs, the CCF of Xt conditional on It −1 is a closed-form exponential-affine function of Xt −1 :
ϕ(u, t |It −1 , θ) = exp αt −1 (u) + βt −1 (u)′ Xt −1 ,
(2.13)
where αt −1 : R → R and βt −1 : R → R satisfy the complexvalued Riccati equations: N
N
N
1 β˙ t = K′1 βt + β′t H1 βt + L1 g βt − 1 , 2 1 ′ α˙ t = K′0 βt + βt H0 βt + L0 g βt − 1 ,
(2.14)
2
with boundary conditions βT (u) = iu and aT (u) = 0.8 7 Our approach can be adapted to the case of 1 ̸= 1 easily. See Singleton (2001, p. 117) for related discussion. 8 Assuming that the spot rate is an affine function of the state vector X and that t Xt follows an affine diffusion, Duffie and Kan (1996) show that the yield of the zero coupon bond has a closed-form CCF. Assuming that the spot rate is a quadratic function of the normally distributed state vector, Ahn et al. (2002) also derive the closed-form CCF for the yield of the zero coupon bond.
272
B. Chen, Y. Hong / Journal of Econometrics 164 (2011) 268–293
Affine SV models are another popular class of multivariate continuous-time models with a closed-form CCF (e.g., Heston (1993), Bates (1996, 2000), Bakshi et al. (1997), Das and Sundaram (1999)). SV models can capture salient properties of volatility such as randomness and persistence. Basic affine SV models, affine SV models with Poisson jumps and affine SV models with Lévy processes have been widely used in modeling asset return dynamics as they allow for closed-form solutions for European option prices (e.g., Heston (1993), Das and Sundaram (1999), Chacko and Viceira (2003), Carr and Wu (2003, 2004), and Huang and Wu (2003)). To test SV models, where Vt is a latent process, we need to modify the characterization (2.11) to make it operatable. Generally, we put Xt = (X′1,t , X′2,t )′ , where X1,t ⊂ RN1 denotes the observable state variables, X2,t ⊂ RN2 denotes the unobservable state variables, and N1 + N2 = N. Also, partition u conformably as u = ′ ′ ′ u1 , u2 . Then we can define Z1,t (u1 , θ) = exp iu′1 X1,t − φ(u1 , t |I1,t −1 , θ),
j
j J X2,t −1 j=1 . By the Bayes rule,
step forward to get the new particles { ˆ we have p x2,t −1 |I1,t −1 , θ
p x1,t −1 |I1,t −2 , θ
,
where p x2,t −1 |I1,t −2 , θ
∫
p x2,t −1 |x2,t −2 , I1,t −2 , θ p x2,t −2 |I1,t −2 , θ dx2,t −2 .
=
We can approximate p x2,t −1 |I1,t −1 , θ up to some proportionality; namely,
ˆ 2,t −1 , Iˆ1,t −2 , θ pˆ x2,t −1 |Iˆ1,t −1 , θ ∝ pˆ x1,t −1 |X
×
J −
πtj−1 pˆ x2,t −1 |Xˆ 2,t −2 , Iˆ1,t −2 , θ ,
j =1
φ(u1 , t |I1,t −1 , θ) ≡ Eθ [exp iu′1 X1,t |I1,t −1 ] = Eθ {ϕ[(u1 , 0 ) , t |It −1 , θ]|I1,t −1 }, I1,t −1 = X1,t −1 , X1,t −2 , . . . is the information set on the observables that is available at time t − 1 and the second equality follows ′
′ ′
from the law of iterated expectations. Then we have E Z1,t (u1 , θ0 ) |I1,t −1 = 0,
for all u1 ∈ RN1
and some θ0 ∈ 2, a.s.
(2.15)
This provides a basis for constructing operational tests for multivariate continuous-time models with partially observable state variables.9 Although the model-implied CCF ϕ (u, t |It −1 , θ), where It −1 = (I1,t −1 , I2,t −1 ), may have a closed-form, the conditional expectation φ(u1 , t |I1,t −1 , θ) generally has no closed-form. However, one can approximate it accurately by using some simulation techniques. For almost all continuous-time models characterized by SDEs in the literature, the CCF ϕ u, t |It −1, θ = ϕ u, t |Xt −1, θ is a Markov process. In this case, Eθ {ϕ[(u′1 , 0′ )′ , t |It −1 , θ]|I1,t −1 }
∫
=
}
p x1,t −1 |x2,t −1 , I1,t −2 , θ p x2,t −1 |I1,t −2 , θ
where
=
J
ˆ 2,t −2 }j=1 one The key of this method is to propagate particles {X
ϕ[(u′1 , 0′ )′ , t |X1,t −1 , x2,t −1 , θ]p(x2,t −1 |I1,t −1 , θ)dx2,t −1 ,
where p(x2,t −1 |I1,t −1 , θ) is the model-implied transition density of the unobservable X2,t −1 given the observable information I1,t −1 . Noting that models with latent variables are the leading examples that may yield non-Markov observables in the continuous-time literature, we discuss several popular methods to estimate the model-implied CCF based on observables. We first consider particle filters, which have been developed by Gordon et al. (1993), Pitt and Shephard (1999) and Johannes et al. (2009). The term ‘‘particle’’ was first used by Kitagawa (1996) in this literature to denote the simulated discrete data with random support. Particle filters are the class of simulation filters that recursively approximate the filtering random variable x2,t −1 |I1,t −1 , θ by
ˆ 12,t −1 , Xˆ 22,t −1 , . . . , Xˆ J2,t −1 with discrete probability ‘‘particles’’ X J 1 mass of πt −1 , πt2−1 , . . . , πt −1 . Hence a continuous variable is approximately a discrete one with random support. These discrete points are viewed as samples from p x2,t −1 |I1,t −1 , θ and as J → ∞, the particles can approximate the conditional density increasingly well. 9 This characterization has been used in Singleton (2001, Equation 50) to estimate affine asset pricing models with unobservable components.
∑J
ˆ 2,t −1 , Iˆ1,t −2 , θ) and where pˆ (x1,t −1 |X
j=1
πtj−1 pˆ (x2,t −1 |Xˆ 2,t −2 ,
Iˆ1,t −2 , θ) can be viewed as the likelihood and prior respectively. As pointed out by Gordon et al. (1993), the particle filters require ˆ 2t −1 can be that the likelihood function can be evaluated and that X ˆ ˆ sampled from pˆ (x2,t −1 |X2,t −2 , I1,t −2 , θ). These can be achieved by using time-discretized solutions to the SDEs.10 To implement particle filters, we can use Johannes et al. (2009) and Pitt and Shephard (1999) algorithm. First we generate a simj J ˆM ˆM ˆ ˆ ulated sample {(X 2,t −2 ) }j=1 , where X2,t −2 = {X2,t −2 , X2,t −2+ 1 , M
. . . , Xˆ 2,t −2+ M −1 } and M is an integer. Then we simulate them one M
step forward, evaluate the likelihood function, and set
πtj−1 =
j ˆ ˆM pˆ [x1,t −1 |(X 2,t −1 ) , I1,t −2 , θ] J ∑ j =1
,
j = 1, . . . , J .
j ˆ ˆM pˆ [x1,t −1 |(X 2,t −1 ) , I1,t −2 , θ] j
J
Finally, we resample J particles with weights {πt −1 }j=1 to obtain a 11
new random sample of size J. It has been shown (e.g., Bally and Talay (1996), Del Moral et al. (2001)) that with both large J and M, this algorithm sequentially generates valid simulated samples from p(x2,t −1 |I1,t −1 , θ). Hence φ(u1 , t |I1,t −1 , θ) can be estimated by Monte Carlo averages.12 The second method to approximate p(x2,t −1 , t − 1|I1,t −1 , θ) is Gallant and Tauchen’s (1998) SNP-based reprojection technique, which can characterize the dynamic response of a partially observed nonlinear system to its past observable history. First, ˆ 1,t −1 }Jt =2 and {Xˆ 2,t −1 }Jt =2 we can generate simulated samples {X from the continuous-time model, where J is a large integer. ˆ 2,t −1 }Jt =2 onto a Hermite Then, we project the simulated data {X series representation of the transition density p(x2,t −1 , t −
ˆ 1,t −1, Xˆ 1,t −2 , . . . Xˆ 1,t −L ), where L denotes a truncation lag order. 1|X 10 There are many discretization methods in practice, such as the Euler scheme, the Milstein scheme, and the explicit strong scheme. See (e.g.) Kloeden et al. (1994) for more discussion. 11 This is called sampling/importance resampling (SIR) in the literature. Alternative methods include rejection sampling and Markov chain Monte Carlo (MCMC) algorithm. See Doucet et al. (2001) and Pitt and Shephard (1999) for more discussion. 12 In a related estimation context, Chacko and Viceira (2003), Jiang and Knight (2002) and Singleton (2001) derive analytical expressions for Eθ {ϕ[(u′1 , 0′ )′ , t |It −1 , θ]|I˜1,t −1 } for some suitable subset I˜1,t −1 of I1,t −1 . For example, Chacko and Viceira (2003) obtain a closed-form expression for E [ϕlog S (u, t |It −1 , θ)| log St −1 ], by integrating out Vt −1 . This is computationally more convenient, but it is less efficient.
B. Chen, Y. Hong / Journal of Econometrics 164 (2011) 268–293
With a suitable choice of L via some information criteria such as AIC or BIC, we can approximate p(x2,t −1 , t − 1|Iˆ1,t −1 , θ) arbitrarily well. The final step is to evaluate the estimated density function at the observed data in the conditional information set. See Gallant and Tauchen (1998) for more discussion. For models whose CCF is exponentially affine in Xt −1 ,13 we can ˆ u1 , t |I1,t −1 , θ). also adopt Bates’ (2007) approach to compute φ( First, at time t = 1, we initialize the CCF of the latent vector X2,t −1 conditional on I1,t −1 at its unconditional CF. Then, by exploiting the Markov property and the affine structure of the CCF, we can evaluate the model-implied CCF conditional on data observed through period t, namely, Eθ [ϕ(u, t |Xt −1 , θ)|I1,t −1 ], and thus an estimator for φ(u1 , t |I1,t −1 , θ) is obtained.
We now propose an omnibus specification test for the adequacy of a multivariate continuous-time model by using the CCF characterization in (2.11) or (2.15). For notational convenience, below we focus on the case where Xt is fully observable. The proposed procedures are readily applicable to the case where only X1,t is observable, with Z1,t (u, θ) and I1,t −1 replacing Zt (u, θ) and It −1 respectively. It is not a trivial task to check (2.11) because the MDS property must hold for each u ∈ RN , and because {Zt (u, θ)} may display serial dependence in higher order conditional moments. Any test for (2.11) should be robust to time-varying conditional heteroskedasticity and higher order moments of unknown form in {Zt (u, θ)}. To check the MDS property of Zt (u, θ), we substantively extend Hong’s (1999) univariate generalized spectrum to a multivariate generalized cross-spectrum.14 The generalized spectrum is an analytic tool for nonlinear time series that embeds the CF in a spectral framework. It can capture nonlinear dynamics while maintaining the nice features of spectral analysis. Because the dimension of It −1 can be infinity, we encounter the so-called ‘‘curse of dimensionality’’ problem in checking the MDS property in (2.11). Fortunately, the generalized spectral approach provides a solution to tackle this difficulty. It checks many lags in a pairwise manner, thus avoiding the ‘‘curse of dimensionality’’. Define the generalized covariance function
Γj (u, v) = cov[Zt (u, θ), exp iv′ Xt −|j| ],
u, v ∈ RN .
(3.1)
With the generalized cross-covariance Γj (u, v), we can define the generalized cross-spectrum F (ω, u, v) =
1
a ‘‘flat’’ spectrum: F (ω, u, v) = F0 (ω, u, v) ≡
∞ −
2π j=−∞
Γj (u, v) exp (−ijω) ,
ω ∈ [−π , π], u, v ∈ RN ,
(3.2)
where ω is the frequency. This is the Fourier transform of the generalized covariance function Γj (u, v) and thus contains the same information as Γj (u, v) . An advantage of generalized crossspectral analysis is that it can capture cyclical patterns caused by both linear and nonlinear cross dependence. Examples include volatility spillover, the comovements of tail distribution clustering between state variables, and asymmetric spillover of business cycles cross different sectors or countries. Another attractive feature of F (ω, u, v) is that it does not require the existence of any moment condition on Xt . Under H0 , we have Γj (u, v) = 0 for all u, v ∈ RN and all j ̸= 0. Consequently, the generalized cross-spectrum F (ω, u, v) becomes
13 Examples include AJD models (Duffie et al., 2000) and time-changed Lévy processes (Carr and Wu, 2003, 2004). 14 This is not a trivial extension since Hong’s (1999) test is under an i.i.d. assumption.
1 2π
Γ0 (u, v),
ω ∈ [−π , π], u, v ∈ RN .
(3.3)
We can test H0 by checking whether a consistent estimator for F (ω, u, v) is flat with respect to frequency ω. Any significant deviation from a flat generalized cross-spectrum is evidence of model misspecification. Suppose we have a discretely observed sample {Xt }Tt=1 of size T . Then we estimate the generalized covariance Γj (u, v) by its sample analogue
Γˆ j (u, v) =
3. Omnibus testing
273
1
T −
T − |j| t =|j|+1
ˆ exp iv′ Xt −|j| − ϕˆ j (v) , Zˆt (u, θ)
u, v ∈ RN ,
(3.4)
where Ď
ˆ ˆ ≡ exp iu′ Xt − ϕ(u, t |I , θ), Zˆt (u, θ) t −1
Ď
It is the observed information set available at time t that may
√
ˆ involve certain initial values, estimator for ∑ θ is a T-consistent θ0 and ϕˆ j (v) = (T − |j|)−1 Tt=|j|+1 exp iv′ Xt −|j| is the empirical unconditional CF of Xt . We note that the information set It −1 = {Xt −1 , Xt −2 , . . .} dates back to past infinity and is not feasible. Thus, when Xt is non-Markov, we may need to assume some ˆ . Therefore, we have to initial values in computing ϕ(u, t |It −1 , θ) Ď replace It −1 with a truncated information set It −1 , which contains some initial values. We provide a condition (see Assumption A.5 in Section 4) to ensure that the use of initial values has no impact on the asymptotic distribution of the proposed test statistics. A consistent estimator for F0 (ω, u, v) is Fˆ0 (ω, u, v) =
1 2π
Γˆ 0 (u, v),
ω ∈ [−π , π], u, v ∈ RN .
(3.5)
Consistent estimation for F (ω, u, v) is more challenging. We use a smoothed kernel estimator Fˆ (ω, u, v) =
1
T −1 −
2π t =1−T
(1 − |j| /T )1/2 k(j/p)Γˆ j (u, v)e−ijω ,
ω ∈ [−π , π], u, v ∈ RN ,
(3.6)
where p ≡ p(T ) → ∞ is a bandwidth or an effective lag order, and k : R → [−1, 1] is a kernel function, assigning weights to various lags. Examples of k(·) include the Bartlett kernel, the Parzen kernel and the Quadratic-Spectral (QS) kernel. In (3.6), the factor (1 − |j| /T )1/2 is a finite sample correction. It could be replaced by unity. Under regularity conditions, Fˆ (ω, u, v) and Fˆ0 (ω, u, v) are consistent for F (ω, u, v) and F0 (ω, u, v) respectively. These estimators converge to the same limit under H0 but they generally converge to different limits under HA , giving the power of the test. We can measure the distance between Fˆ (ω, u, v) and Fˆ0 (ω, u, v) by the quadratic form
πT
∫∫∫
2
π −π
2 ˆ F (ω, u, v) − Fˆ0 (ω, u, v) dωdW (u) dW (v)
T −1
=
− j =1
k2 (j/p)(T − j)
∫∫ 2 ˆ Γj (u, v) dW (u) dW (v) ,
(3.7)
where the equality follows by Parseval’s identity, W : RN → R+ is a positive nondecreasing right-continuous function, and the unspecified integrals are all taken over the support of W (·). An example of W (·) is the N (0, IN ) CDF, where IN is a N × N identity
274
B. Chen, Y. Hong / Journal of Econometrics 164 (2011) 268–293
matrix. The function W (·) can also be a step function, analogous to the CDF of a discrete random vector. Our omnibus test statistic for H0 against HA is a standardized version of (3.7): Qˆ (0, 0) =
T −1 −
k2 (j/p)(T − j)
j=1
×
] ∫∫ 2 ˆ ˆ (0, 0), (3.8) D Γj (u, v) dW (u) dW (v) − Cˆ (0, 0)
T −1 −
k2 (j/p)(T − j)−1
∫∫ T 2 − ˆ ˆ Zt (u, θ)
t =j+1
j=1
2 × ψˆ t −j (v) dW (u) dW (v) , ˆ (0, 0) = 2 D
T −2 − T −2 −
k2 (j/p)k2 (l/p)
j=1 l=1
∫∫∫∫ ×
dW (u1 ) dW (u2 ) dW (v1 ) dW (v2 )
1 × T − max(j, l)
T − t =max(j,l)+1
2
ˆ Zˆt (u2 , θ) ˆ ψˆ t −j (v1 )ψˆ t −l (v2 ) , Zˆt (u1 , θ)
iv′ Xt
∑ ′ − ϕ( ˆ v), and ϕ( ˆ v) = T −1 Tt=1 eiv Xt . The ˆ (0, 0) are the approximately mean and factors Cˆ (0, 0) and D ˆ t (v) = e where ψ
Assumption A.6. k : R → [−1, 1] is a symmetric function that is continuous at zero and all points in R except for a finite number of points, with k (0) = 1 and k (z ) ≤ c |z |−b for some b > 12 as z → ∞. Assumption A.7. W : RN → R+ is a nondecreasing right-conti nuous function with RN dW (u) < ∞ and RN ‖u‖4 dW (u) < ∞. Furthermore, W = w dv for some nonnegative weighting function w and some measure v , where W is absolutely continuous with respect to v and w is symmetric about the origin.
where the centering and scaling factors Cˆ (0, 0) =
Ď
Assumption A.5. Let It be the observed feasible information set available at time t that may involve certain initial values. Then ∑T Ď supu∈RN limT →∞ t =1 E supθ∈2 |ϕ(u, t |It −1 , θ) − ϕ(u, t |It −1 , 2 θ)| ≤ C .
variance of the quadratic form in (3.7). They have taken into account the impact of higher order dependence in the generalized residual {Zt (u, θ0 )}. As a result, the Qˆ (0, 0) test is robust to conditional heteroskedasticity and time-varying higher order conditional moments of unknown form in {Zt (u, θ0 )}. In practice, when W (·) is continuous, Qˆ (0, 0) can be calculated by numerical integration or simulation. This may be computationally costly when the dimension N of Xt is large. Alternatively, one can use a finite number of grid points for u and v, which is equivalent to using a discrete CDF. For example, we can generate finitely many numbers of u and v from the N (0, IN ) distribution. This will reduce the computational cost, but at a cost of some power loss. 4. Asymptotic theory To derive the null asymptotic distribution of the test statistic Qˆ (0, 0) and investigate its asymptotic power property, we impose following regularity conditions. Assumption A.1. A discrete-time sample {Xt }Tt =11 , where 1 ≡ 1 is the sampling interval, is observed at equally spaced discrete times. Assumption A.2. Let ϕ (u, t |It −1 , θ) be the CCF of Xt given It −1 ≡ {Xt −1 , Xt −2 , . . .} for a time series model for Xt . (i) For each θ ∈ 2, each u ∈ RN , and each t, ϕ (u, t |It −1 , θ) is measurable with respect to It −1 ; (ii) for each θ ∈ 2, each u ∈ RN , and each t, ϕ (u, t |It −1 , θ) is twice continuously differentiable with respect ∑T to θ ∈ 2 with probability one; (iii) supu∈RN limT →∞ T −1 t =1 ∂ ϕ(u, t |It −1 , θ)‖2 ≤ C and supu∈RN limT →∞ T −1 E supθ∈2 ‖ ∂θ
∑T
t =1
∂ E supθ∈2 ‖ ∂θ∂θ ′ ϕ (u, t |It −1 , θ) ‖ ≤ C . 2
√
Assumption A.3. θˆ is a parameter estimator such that T (θˆ − θ∗ ) = OP (1), where θ∗ ≡ p limT →∞ θˆ and θ∗ = θ0 under H0 . ∂ Assumption A.4. {Xt , ϕ (u, t |It −1 , θ0 ) , ∂θ ϕ (u, t |It −1 , θ 0 )} is a strictly stationary β -mixing process with the mixing coefficient |β (l)| ≤ Cl−ν for some constant ν > 2.
Assumption A.1 imposes some regularity conditions on the discretely observed random sample. Both univariate and multivariate continuous-time or discrete-time processes are covered, and we allow but do not require Xt to be Markov. This distinguishes our test from all other tests in the literature. We emphasize that it is important to allow the DGP to be non-Markov, even if one is interested in testing the adequacy of a Markov model. This is because the misspecification of a Markov model may come from not only the improper specification in functional forms, but also the violation of the Markov assumption. The non-Markov assumption for the DGPs of observable state variables is also necessary when some state variables are latent, as is the case of SV models. There are two kinds of asymptotics in the literature on continuous-time models. The first is to let the sampling interval 1 → 0. This implies that the number of observations per unit of time tends to infinity. The second is to let the time horizon T → ∞. As argued by Ait-Sahalia (1996b), the first approach hardly matches the way in which new data are added to the sample. Even if such ultra-high-frequency data are available, market micro-structural problems are likely to complicate the analysis considerably. Hence, like Ait-Sahalia (1996b) and Singleton (2001), we fix the sampling interval 1 and derive the asymptotic properties of our test for an expanding sampling period (i.e., T → ∞). Unlike Ait-Sahalia (1996a,b), however, we do not impose additional conditions on the SDEs for Xt . Instead, we impose conditions directly on the model-implied CCF, which is equivalent to imposing conditions on the model-implied transition density. For these reasons, our approach is more general and the proposed tests are applicable to both continuous-time and discrete-time models. Assumption A.2 imposes regularity conditions on the CCF of the multivariate time series model. As the CCF is the Fourier transform of the transition density, we can ensure Assumption A.2 by imposing the following conditions on the model-implied transition density p (x, t |It −1 , θ): (i) for each t, each x ∈ D, and each θ ∈ 2, p (x, t |It −1 , θ) is measurable with respect to It −1 ; (ii) for each t, each x ∈ D, and each given It −1 , p (x, t |It −1 , θ) is twice continuously differentiable with respect to θ ∈ 2 with probabil∑T ∂ ln p(x, t | ity one; and (iii) supx∈D limT →∞ T −1 t =1 E supθ∈2 ‖ ∂θ
It −1 , θ)‖2 ≤ C , and supx∈D limT →∞ T −1 (x, t |It −1 , θ) ‖ ≤ C .
√
∑T
t =1
∂ E supθ∈2 ‖ ∂θ∂θ ′ ln p 2
Assumption A.3 requires a T -consistent estimator θˆ under H0 . Examples include asymptotically optimal and suboptimal estimators, such as Gallant and Tauchen’s (1996) EMM, Singleton’s (2001) ML-CCF and GMM-CCF, Ait-Sahalia’s (2002) and Ait-Sahalia and Kimmel’s (2010) approximated MLE, Carrasco et al.’s (2007) C-GMM, and Chib et al.’s (2010) MCMC method. We do not require any asymptotically most efficient estimator or a specified estimator. This is attractive for practitioners given the notorious
B. Chen, Y. Hong / Journal of Econometrics 164 (2011) 268–293
difficulty of efficient estimation of many multivariate nonlinear time series models. Assumption A.4 is a regularity condition on the temporal de∂ pendence of the process {Xt , ϕ (u, t |It −1 , θ0 ) , ∂θ ϕ(u, t |It −1 , θ0 )}. The β -mixing assumption is a standard condition for discrete time series analysis. In the continuous-time context, almost all continuous-time models in the literature are Markov. For Markov ∂ models, ϕ (u, t |It −1 , θ0 ) and ∂θ ϕ (u, t |It −1 , θ0 ) are functions of Xt −1 . Thus, Assumption A.4 holds if the discretely observed random sample {Xt }Tt=1 is a strictly stationary β -mixing process with |β (l)| ≤ Cl−ν for some constant ν > 2.15 Assumption A.5 is a start-up value condition, which ensures Ď that the impact of initial values (if any) assumed in It −1 is asymptotically negligible. This condition holds automatically for Markov models and easily for many non-Markov models. For simplicity, we illustrate such an initial value condition by a √ discrete-time GARCH model: Xt = εt ht , where ht = ω +α ht −1 + β Xt2−1 and {εt } is i.i.d.N (0, 1). Here we have ϕ (u, t |It −1 , θ) = exp(− 21 u2 ht ), where It −1 = {Xt −1 , Xt −2 , . . .}. Furthermore, we Ď
have It −1 = {Xt −1 , Xt −2 , . . . , X1 , h¯ 0 }, where h¯ 0 is an initial value assumed for h0 . By recursive substitution, we obtain
Ď
var (Xt |It −1 , θ) − var Xt |It −1 , θ
=ω+β
t −2 −
α j Xt2−1−j
j =0
+ βα t −1 h0 − ω − β
t −2 −
α j Xt2−1−j − βα t −1 h¯ 0 .
j =0
It follows that sup
T −
u∈R t =1
Ď
2
E sup ϕ (u, t |It −1 , θ) − ϕ u, t |It −1 , θ
≤ sup
θ∈Θ
T −
u∈R t =1
E sup θ∈2
2 1 − exp 1 u2 var (X |I , θ) − var X |IĎ , θ t t − 1 t t −1 2 1 × 2 exp 2 u ht T − u2 βα t −1 h0 − h¯ 0 ≤C ≤ sup E sup exp u2 ω θ∈2 u∈RN t =1 provided ω > 0, 0 < α, β < 1, α + β < 1 and E h¯ 0 < ∞. Assumption A.6 is the regularity condition on the kernel function. The continuity of k (·) at 0 and k (0) = 1 ensure that the bias of the generalized cross-spectral estimator Fˆ (ω, u, v) vanishes to zero asymptotically as T → ∞. The condition on the tail behavior of k (·) ensures that higher order lags have asymptotically negligible impact on the statistical properties of Fˆ (ω, u, v). Assumption A.6 covers most commonly used kernels. For kernels with bounded support, such as the Bartlett and Parzen kernels, b = ∞. For kernels with unbounded support, b is some finite positive real number. For example, b = 2 for the QuadraticSpectral kernel. Assumption A.7 imposes mild conditions on the function W (·). The CDF of any symmetric distribution with finite fourth moment satisfies Assumption A.7. Note that W (·) can be the CDF of continuous or discrete distribution and w(·) is its Radon–Nikodym 15 Suggested by Hansen and Scheinkman (1995) and Ait-Sahalia (1996a), one set of sufficient conditions for the β -mixing when N = 1 is: (i) limx→l or x→u σ (x, θ) π (x, θ) = 0; and (ii) limx→l or x→u |σ (x, θ) /{2µ (x, θ)−σ (x, θ) [∂σ (x, θ) / ∂ x]} < ∞, where l and u are left and right boundaries of Xt with possibly l = −∞ and/or u = +∞, and π (x, θ) is the model-implied marginal density.
275
derivative with respect to v . If W (·) is the CDF of some continuous distribution, v is Lebesgue measure; if W (·) is the CDF of some discrete distribution, v is some counting measure. This provides a convenient way to implement our tests, because we can avoid high-dimensional numerical integrations by using a finite number of grid points for u and v. This is equivalent to using the CDF of a discrete random vector. We now state the asymptotic distribution of the omnibus test Qˆ (0, 0) under H0 . Theorem 1. Suppose Assumptions A.1–A.7 hold, and p = cT λ for 1+δ < λ < (3 + 4b1−2 )−1 , where 0 < c , δ < ∞ and ν is defined νδ d
in Assumption A.4. Then Qˆ (0, 0) → N (0, 1) under H0 as T → ∞. An important feature of Qˆ (0, 0) is that the use of the estimated ˆ in place of the true unobservable generalized residuals {Zˆt (u, θ)} generalized residuals {Zt (u, θ0 )} has no impact on the limiting distribution of Qˆ (0, 0). One can proceed as if the true parameter value θ0 were known and equal to θˆ . Intuitively, the parametric estimator θˆ converges to θ0 faster than the nonparametric estimator Fˆ (ω, u, v) converges to F (ω, u, v) as T → ∞. Consequently, the limiting distribution of Qˆ (0, 0) is solely determined by Fˆ (ω, u, v), and replacing θ0 by θˆ has no impact √ asymptotically. This delivers a convenient procedure, because any T -consistent estimator can be used. We allow for weakly dependent data and data dependence has some impact on the feasible range of the bandwidth p. The condition on the tail behavior of the kernel function k(·) also has some impact. For kernels with bounded support (e.g., the Bartlett and 6 Parzen), λ < 13 because b = ∞. For the QS kernel (b = 2), λ < 19 . These conditions are mild. Next, we investigate the asymptotic power of Qˆ (0, 0) under HA . Theorem 2. Suppose Assumptions A.1–A.7 hold, and p = cT λ for 0 < λ < 12 and 0 < c < ∞. Then as T → ∞, 1
∞ ∫∫ − 1 Γj (u, v)2 dW (u) dW (v) Qˆ (0, 0) →p √ T D (0, 0) j=1
p2
∫∫∫ π π |F (ω, u, v) − F0 (ω, u, v)|2 = √ 2 D (0, 0) −π × dωdW (u) dW (v) , where D (0, 0) = 2
∞
∫
k (z ) dz 4
∫∫
|Σ0 (u1 , u2 )|2 dW (u1 ) dW (u2 )
0
∫∫ − ∞ Ωj (v1 , v2 )2 dW (v1 ) dW (v2 ) , × j=−∞
and Σ0 (u, v) cov(e
iu′ Xt
,e
= cov Zt u, θ∗ , Zt v, θ∗ , and Ωj (u, v) = t −|j| ).
iv′ X
The function G (ω, u, v) is the generalized spectral density of the state vector {Xt }. It captures temporal dependence in {Xt }. The dependence of the constant D (0, 0) on G (ω, u, v) is due to the fact that the conditioning variable exp iv′ Xt −|j| is a time series process. Following Bierens (1982) and Stinchcombe and White (1998), N we have that for j > 0, Γ j (u, v) = 0 for all u, vN ∈ R if ∗ and only if E Zt (u, θ )|Xt −j = 0 a.s. for all u ∈ R . Suppose E Zt (u, θ∗ )|Xt −j ̸= 0 at some lag j > 0 under HA . Then we have
2 Γj (u, v) dW (u) dW (v) ≥ C > 0 for any function W (·)
that is positive, monotonically increasing and continuous, with unbounded support on RN . As a result, P [Qˆ (0, 0) > C (T )] → 1 for
276
B. Chen, Y. Hong / Journal of Econometrics 164 (2011) 268–293
any sequence of constants {C (T ) = o(T /p1/2 )}. Thus Qˆ (0, 0) has asymptotic unit power at any given significance level α ∈ (0, 1), whenever E Zt (u, θ∗ )|Xt −j is nonzero at some lag j > 0 under HA . Note Markov process Xt , we always have a multivariate that for E Zt u, θ∗ |Xt −j ̸= 0 at least for some j > 0 under HA . Hence, Qˆ (0, 0) is consistent against HA when Xt is Markov. Gallant and Tauchen’s (1996) EMM test does not have this property. process Xt , the hypothesis that E For a non-Markovian Zt (u, θ0 )|Xt −j = 0 a.s. for all u ∈ RN and some θ0 ∈ 2 and all j > 0 is not equivalent to the hypothesis that E [Zt (u, θ0 )|It −1 ] = 0 a.s. for all u ∈ RN and some θ0 ∈ 2. The latter implies the former but not vice versa. This is the price we have to pay for dealing with the difficulty of the ‘‘curse of dimensionality’’. Nevertheless, our test is expected to have power against a wide range of nonMarkovian processes, since we check many lag orders. The use of a large number of lags might cause the loss of power, due to the loss of a large number of degrees of freedom. Fortunately, such power loss is substantially alleviated for Qˆ (0, 0), thanks to the downward weighting by k2 (·) for higher order lags. Generally speaking, the state vector Xt is more affected by the recent events than the remote past events. In such scenarios, equal weighting to each lag is not expected to be powerful. Instead, downward weighting is expected to enhance better power because it discounts past information. Thus, we expect that the power of our test is not so sensitive to the choice of the lag order. This is confirmed by our simulation studies below. 5. Diagnostic testing When a multivariate time series model M is rejected by the omnibus test Qˆ (0, 0), it would be interesting to explore possible sources of the rejection. For example, one may like to know whether the misspecification comes from conditional mean/drift dynamics, conditional variance/diffusion dynamics, or conditional correlations between state variables. Such information, if any, will be valuable in reconstructing the model. The CCF is a convenient and useful tool to gauge possible sources of model misspecification. As is well known, the CCF can be differentiated to obtain conditional moments. We now develop a class of diagnostic tests by differentiating the generalized crossspectrum F (ω, u, v). This class of diagnostic tests can provide useful information about how well a multivariate time series model can capture the dynamics of various conditional moments and conditional cross-moments of state variables. Define an N × 1 index vector m = (m1 , m2 , . . . , mN )′ ,
∑N
=
c =1
(m,0)
(0, v) = cov
If m = (0, 1), then
Γj
(m,0)
(0, v) = icov X2t − Eθ (X2t |It −1 ), exp iv′ Xt −|j| .
Thus, the choice of |m| = 1 can be used to check misspecifications in the conditional mean dynamics of X1t and X2t respectively. • Case 2: |m| = 2. We have m = (2, 0), (0, 2) or (1, 1). If m = (2, 0),
Γj
(m,0)
(0, v) = i2 cov X1t2 − Eθ (X1t2 |It −1 ), exp iv′ Xt −|j| .
If m = (0, 2),
Γj
(m,0)
(0, v) = i2 cov X2t2 − Eθ (X2t2 |It −1 ), exp iv′ Xt −|j| .
Finally, if m = (1, 1),
Γj
(m,0)
(0, v) = i2 cov × X1t X2t − Eθ (X1t X2t |It −1 ) , exp iv′ Xt −|j| .
We now define the class of diagnostic test statistics as follows: Qˆ (m, 0) =
T −1 −
∞ −
2π j=−∞
Γj
(m,0)
(0, v) exp (−ijω) ,
N ∏
∫ 2 ˆ (m,0) ˆ ˆ (m, 0), × Γj (0, v) dW (v) − C (m, 0) D
c =1
− Eθ
′ mc (iXct ) It −1 , exp iv Xt −|j| . c =1
N ∏
Here, as before, Eθ (·|It −1 ) is the conditional expectation under the model-implied transition density p (x, t |It −1 , θ).
(5.3)
where the centering and scaling factors Cˆ (m, 0) =
T −1 −
k2 (j/p)
j =1
1 T −j
∫ T 2 − ˆ (m) ˆ 2 ˆ Zt (0, θ) ψt −j (v) dW (v) ,
×
t =|j|+1
T −2 − T −2 −
k2 (j/p)k2 (l/p)
j=1 l=1
∫ ∫ T − 1 ˆ (m) ˆ 2 × Zt (0, θ) T − max(j, l) t =max(j,l)+1 2 ˆ ˆ × ψt −j (v1 )ψt −j (v2 ) dW (v1 ) dW (v2 ) ,
(5.1)
(iXct )mc
k2 (j/p)(T − j)
j =1
∂ m1 ∂ mN · · · F (ω, u , v ) m1 mN ∂ uN ∂ u1 u =0 1
(5.2)
Thus, the choice of |m| = 2 can be used to check model misspecifications in the conditional volatility of state variables as well as their conditional correlations.
mc . Then
where the derivative of the generalized cross-covariance function
Γj
• Case 1: |m| = 1. We have m = (1, 0) or m = (0, 1). If m = (1, 0), (m,0) Γj (0, v) = icov X1t − Eθ (X1t |It −1 ), exp iv′ Xt −|j| .
ˆ (m, 0) = 2 D
where mc ≥ 0 for all 1 ≤ c ≤ N, and put |m| = we define the generalized cross-spectral derivative F (0,m,0) (ω, 0, v) ≡
To gain insight into the generalized cross-spectral derivative F (0,m,0) (ω, 0, v), we consider a bivariate process Xt = (X1t , X2t )′ and examine the cases of |m| = 1 and |m| = 2 respectively:
and (m) ˆ = Zˆt (0, θ)
N ∏
(iXct )
c =1
mc
− Eθˆ
N ∏
(iXct )
mc
|ItĎ−1
.
c =1
To derive the limit distribution of Qˆ (m, 0) under H0 , we impose some moment conditions.
∑T (i) limT →∞ T −1 t =1 E supθ∈2 [ ]2 ∂ ∂ m1 ∂ mN ∂θ ∂ um1 1 · · · ∂ umN N ϕ (u, t |It −1 , θ) |u=0 ≤ C ;
Assumption A.8.
B. Chen, Y. Hong / Journal of Econometrics 164 (2011) 268–293
(ii) limT →∞ T
∑T −1
E sup
θ∈2 t =1 [ ]2 ∂2 m m 1 N ∂ ∂ ∂θ∂θ′ ∂ um1 1 · · · ∂ umN N ϕ (u, t |It −1 , θ) |u=0 ≤ C ; ∑T (iii) limT →∞ T −1 t =1 E supθ∈2 4 ∂ m1 m · · · ∂ mmN ϕ (u, t |It −1 , θ) |u=0 ≤ C ; N ∂ u1 1 ∂ uN 8(1+δ)mc (iv) E ΠcN=1 Xct ≤ C.
277
6. Finite sample performance It is unclear how well the asymptotic theory can provide reliable reference and guidance when applied to financial time series data, which is well known to display highly persistent serial dependence. We now investigate the finite sample performance of the proposed tests for the adequacy of a multivariate affine diffusion model. 6.1. Simulation design
m1 mN Assumption A.9. {Xt , ∂ m1 · · · ∂ mN ϕ (u, t |It −1 , θ0 ) ,
∂ ∂ m1 [ ∂θ ∂ um1
···
1
∂ mN m ∂ uN N
∂ u1
∂ uN
ϕ (u, t |It −1 , θ0 ) |u=0 ]} is a strictly stationary β -
mixing process with the mixing coefficient |β (l)| ≤ Cl−ν for some constant ν > 2. Ď
Assumption A.10. Let It be the observed feasible information set available at time t that may involve certain initial values. Then ∑T m1 mN limT →∞ t =1 E supθ∈2 | ∂ m1 · · · ∂ mN ϕ (u, t |It −1 , θ) |u=0
−
∂ m1 m ∂ u1 1
···
∂ mN m ∂ uN N
∂ u1
∂ uN
ϕ(u, t |ItĎ−1 , θ)|u=0 |2 ≤ C .
Theorem 3. Suppose Assumptions A.1, A.3 and A.6–A.10 hold for +δ some pre-specified m and p = cT λ for 1νδ < λ < (3 + 4b1−2 )−1 , where 0 < c , δ < ∞ and ν is defined in Assumption A.9. Then d
Qˆ (m, 0) → N (0, 1) under H0 as T → ∞. Like the omnibus test Qˆ (0, 0), Qˆ (m, 0) has a convenient asymptotic N (0, 1) distribution and parameter estimation uncertainty√in θˆ has no impact on the asymptotic distribution of Qˆ (m, 0). Any T -consistent estimator can be used. Moreover, different choices of m can examine various specific dynamic aspects of the state vector Xt and thus provide information on how well a multivariate time series model fits various aspects of the conditional distribution of Xt .16 It should be pointed out that the generalized cross-spectral derivative tests may fail to detect certain specific model misspecifications. In particular, they will fail to detect some time-invariant model misspecifications. For example, suppose a time series model assumes a zero conditional correlation between state variables, while there exists a nonzero but time-invariant conditional correlation. Then the derivative of the generalized covariance in (5.2) is identically zero. In this case, the correlation test will fail to detect such time-invariant conditional correlation. Of course, the tests Qˆ (m, 0) will generally have power against time-varying misspecifications. These diagnostic tests are designed to test specification of various conditional moments, i.e., whether the conditional moments of Xt are correctly specified given the discrete sample information. We note that the first two conditional moments differ from the instantaneous conditional mean (drift) and instantaneous conditional variance (squared diffusion). In general, the conditional moments tested here are functions of drift, diffusion and jump (see, e.g., Eqs. (6.2) and (6.3) in Section 6). Therefore, careful interpretation is needed. However, our simulation studies below show that the tests Qˆ (m, 0) with |m| = 1 and 2 are very indicative of drift and diffusion misspecifications respectively.
To examine the size of Qˆ (m, 0) under H0 , we consider the following DGP:
• DGP0 (Uncorrelated Gaussian Diffusion): κ11 0 0 θ1 − X1t X1t θ2 − X2t dt 0 κ22 0 d X2t = θ3 − X3t X3t 0 0 κ33 σ11 0 0 W1t + 0 σ22 0 d W2t . W3t 0 0 σ33
We set (κ11 , κ22 , κ33 , θ1 , θ2 , θ3 , σ11 , σ22 , σ33 ) = (0.5, 1, 2, 0, 0, 0, 1, 1, 1). We use the Euler scheme to simulate 1000 data sets of the random sample {Xt }Tt =11 at the monthly frequency for T = 250, 500 and 1000 respectively. These sample sizes correspond to about twenty to one hundred years of monthly data. Each simulated sample path is generated using 120 intervals per month. We then discard 119 out of every 120 observations, obtaining discrete observations at the monthly frequency. For each data set, we use MLE to estimate model (6.1), with no restrictions on the intercept coefficients. All data are generated using the Matlab 6.1 random number generator on a PC. With a diagonal matrix κ = diag(κ11 , κ22, κ33 ), DGP0 is an uncorrelated 3-factor Gaussian diffusion process. As shown in Duffee (2002), the Gaussian diffusion model has analytic expressions for the conditional mean and the conditional variance respectively: E (Xt |Xs ) = I − e−κ(t −s) θ + e−κ(t −s) Xs ,
var(Xt |Xs ) =
t
∫
(6.2)
e−κ(t −m) 66′ e−κ (t −m) dm ′
s
2 σ11 σ2 [1 − e2κ11 (s−t ) ], 22 2κ11 2κ22 2 σ33 2κ22 (s−t ) × [1 − e ], [1 − e2κ33 (s−t ) ] , (6.3) 2κ33 where θ = (θ1 , θ2 , θ3 )′ and 6 = diag σ11, σ22 , σ33 . To investigate the power of Qˆ (m, 0) in distinguishing model
= diag
(6.1) from alternative processes, we also generate data from five affine diffusion processes respectively:
• DGP1 [Correlated Gaussian Diffusion, with Constant Correlation in Drift]: X1t d X2t X3t
=
0.5 −0.5 0.5
16 In the literature, there have been diagnostic tests based on PITs to gauge possible sources of model rejection (e.g., Diebold et al. (1998), Hong and Li (2005)). As pointed out by Bontemps and Meddahi (2010), however, the diagnostic tests based on the PIT have some drawback: when one rejects a specification of transformed variables, it is difficult to know how to modify the model for original state variables to obtain a correct specification. However, our tests are directly based on state variables and thus avoid this drawback.
(6.1)
+
1 0 0
0 1 0.5 0 1 0
−X1t −X2t dt −X3t 0 0 2
0 W1t 0 d W2t 1 W3t
;
• DGP2 [Uncorrelated CIR (Cox et al., 1985) Diffusion]: 0.5 0 0 2 − X1t X1t 0 1 0 1 − X2t dt d X2t = X3t 0 0 2 1 − X3t
(6.4)
278
B. Chen, Y. Hong / Journal of Econometrics 164 (2011) 268–293
X1t
+ 0
0 W1t d W2t ; 0
0
X2t
0
W3t
X3t
0
(6.5)
• DGP3 [Correlated Gaussian Diffusion, with Constant Correlation in Diffusion]: X1t d X2t X3t
=
0.5 0 0
+
0 1 0
1 0.5 0.5
−X1t −X2t dt −X3t
0 0 2
0 1 0.5
0 W1t 0 d W2t W3t 1
;
(6.6)
• DGP4 [Correlated CIR Diffusion]: 0.5 0 0 2 − X1t X1t 0 1 0 1 − X2t dt d X2t = 1 − X3t X3t 0 0 2 X1t 0 W1t 0 + 0.5X1t d W2t . X2t 0 W3t 0.5 X1t 0.5 X2t X3t • DGP5 [Mixture of Gaussian and CIR Processes]: X1t 0.5 0 0 2 − X1t 0 1 0 1 − X2t dt d X2t = X3t 0 0 2 −X3t W1t X1t 0 0 + 0.5 X X 0 d W2t . 1t
0
2t
0
1
specified, but there are dynamic misspecifications in conditional variances and conditional covariances of state variables. DGP5 is a mixture of Gaussian and CIR diffusion processes, where {X1t , X2t } is a correlated 2-factor CIR diffusion process and {X3t } is a univariate Gaussian process. Under DGP5, model (6.1) is misspecified for the conditional variances of X1t and X2t and the conditional covariance between X1t and X2t . The conditional means of X1t , X2t and X3t and the conditional covariances between X1t and X3t , X2t and X3t are correctly specified. For each of DGPs 1–5, we generate 500 data sets of the random sample {Xt }Tt =11 by the Euler scheme, for T = 250, 500 and 1000 respectively at the monthly sample frequency. For each data set, we estimate model (6.1) via MLE. Because model (6.1) is misspecified under all five DGPs, our omnibus test Qˆ (0, 0) is expected to have nontrivial power under DGPs 1–5, provided the sample size T is sufficiently large. We will also examine how diagnostic tests Qˆ (m, 0) for |m| > 0 can reveal information about various model misspecifications. 6.2. Monte Carlo evidence
(6.7)
(6.8)
W3t
With a nondiagonal matrix κ, DGP1 is a correlated 3-factor Gaussian diffusion process, a special case of the canonical A0 (3) defined in Dai and Singleton (2000). Under DGP1, model (6.1) is misspecified for the drift function but is correctly specified for the diffusion function. Eqs. (6.2) and (6.3) show that the conditional means and the conditional variances of X2t and X3t (but not X1t ), and the conditional covariances between X1t , X2t and X3t are misspecified, if model (6.1) is used to fit the data generated from DGP1. However, the misspecifications in the conditional variances and conditional correlations of Xt are time-invariant. DGP2 is an uncorrelated 3-factor CIR diffusion process, which has a non-central χ 2 transition distribution. Under DGP2, model (6.1) is correctly specified for the drift function but is misspecified for the diffusion function because it fails to capture the ‘‘level effect’’. If model (6.1) is used to fit the data generated from DGP2, the conditional means and conditional covariances of X1t , X2t and X3t are correctly specified, but their conditional variances are misspecified. DGP3 is another correlated Gaussian diffusion process, where the correlations between state variables come from diffusions rather than drifts. Under DGP3, model (6.1) is correctly specified for the drift function but is misspecified for the diffusion function. If model (6.1) is used to fit data generated from DGP3, the conditional means of X1t , X2t and X3t are correctly specified, but the conditional variances of X2t and X3t , and the conditional covariances between X1t , X2t and X3t are misspecified. Specifically, model (6.1) assumes zero conditional correlations between state variables, whereas there exist nonzero but time-invariant conditional correlations between state variables under DGP3. DGP4 is a correlated 3-factor CIR diffusion process, where the conditional variances and covariances of state variables depend on state variables. If we use model (6.1) to fit data generated from DGP4, the conditional means of X1t , X2t and X3t are correctly
To reduce computational costs, we generate u and v from a N (0, I3 ) distribution, with each u and v having 30 grid points in R3 respectively. We use the Bartlett kernel, which has a bounded support and is computationally efficient. Our simulation experience suggests that the choices of W (·) and k (·) have little impact on both size and power of the Qˆ (m, 0) tests.17 Like Hong (1999), we use a data-driven pˆ via a plug-in method that minimizes the asymptotic integrated mean squared error of the generalized cross-spectral Fˆ (ω, u, v), with the Bartlett kernel k¯ (·) used in some preliminary generalized cross-spectral estimators. To examine the sensitivity of the choice of the preliminary bandwidth p¯ on the size and power of the tests, we consider p¯ in the range of 10–40.18 Table 1 reports the rejection rates (in terms of percentage) of Qˆ (m, 0) under DGP0 at the 10% and 5% significance levels, using the asymptotic theory. The omnibus Qˆ (0, 0) test tends to underreject when T = 250, but it improves as T increases. The size Qˆ (0, 0) is not very sensitive to the preliminary lag order p. For example, when T = 250, the rejection rate at the 5% level attains its maximum 3.8% at p = 10, and attains its minimum 2.5% at p = 31, 33, 34. For T = 1000, the rejection rate at the 5% level attains its maximum 6.2% at p = 17, 18, 19, 35, and attains its minimum 5.6% at p¯ = 11. We also consider the diagnostic tests Qˆ (m, 0) for |m| = 1, 2, which check model misspecifications in conditional means, conditional variances and conditional correlations of state variables. The Qˆ (m, 0) tests have similar size patterns to Qˆ (0, 0), except that some of them tend to overreject a bit when T = 1000. Overall, both omnibus and diagnostic tests have reasonable sizes at both the 10% and 5% levels and the sizes are robust to the choice of the preliminary lag order p¯ . Next, we turn to examine the power of Qˆ (m, 0). Tables 2–6 report the rejection rates of Qˆ (m, 0) under DGPs 1–5 at the 10% and 5% levels respectively. Under DGP1, model (6.1) ignores nonzero constant correlations in drifts. The omnibus test Qˆ (0, 0) is able to detect such model misspecification. The rejection rate of Qˆ (0, 0) is about 65% at the 5% level when T = 1000. Because only the misspecifications of the conditional means in X2t and X3t are time-varying when model (6.1) is used to fit the data from DGP1, we expect that the mean tests, Qˆ ((0, 1, 0)′ , 0) and √
√
17 We have tried the Parzen kernel for k (·) and the Uniform(−2 3, 2 3) CDF for W (·), obtaining similar results. 18 We have tried p¯ in the range of 1–10 and results are similar.
B. Chen, Y. Hong / Journal of Econometrics 164 (2011) 268–293
279
Table 1 Sizes of specification tests under DGP0. Lag order
10
T
250
20
α Qˆ (0, 0)
.10 .056
.05 .038
.10 .072
.05 .045
.10 .088
.05 .056
.10 .056
.05 .033
.10 .087
.05 .047
.10 .086
.05 .061
500
1000
250
500
1000
Qˆ 1
Qˆ ((1, 0, 0), 0) Qˆ ((0, 1, 0), 0) Qˆ ((0, 0, 1), 0)
.072 .075 .085
.035 .043 .053
.090 .073 .074
.068 .046 .048
.078 .087 .076
.051 .052 .046
.057 .071 .078
.037 .036 .046
.084 .080 .073
.060 .048 .043
.080 .085 .075
.045 .057 .045
Qˆ 2
Qˆ ((2, 0, 0), 0) Qˆ ((0, 2, 0), 0) Qˆ ((0, 0, 2), 0)
.069 .066 .066
.044 .034 .035
.077 .092 .106
.050 .049 .065
.097 .114 .104
.066 .075 .064
.058 .050 .057
.033 .024 .031
.064 .080 .099
.036 .038 .058
.100 .120 .105
.065 .072 .065
Qˆ 3
Qˆ ((1, 1, 0), 0) Qˆ ((0, 1, 1), 0) Qˆ ((1, 0, 1), 0)
.090 .079 .098
.057 .049 .052
.096 .094 .080
.057 .063 .049
.106 .115 .106
.068 .075 .075
.078 .071 .076
.049 .044 .038
.089 .090 .075
.054 .061 .046
.113 .111 .107
.075 .072 .076
Lag order
30
Qˆ (0, 0)
.064
.026
.082
.049
.082
.059
.057
40 .029
.086
.047
.093
.059
Qˆ 1
Qˆ ((1, 0, 0), 0) Qˆ ((0, 1, 0), 0) Qˆ ((0, 0, 1), 0)
.059 .064 .066
.033 .034 .039
.080 .072 .075
.048 .048 .044
.073 .079 .073
.045 .045 .047
.060 .064 .059
.034 .031 .031
.083 .074 .079
.043 .051 .036
.068 .073 .082
.041 .052 .041
Qˆ 2
Qˆ ((2, 0, 0), 0) Qˆ ((0, 2, 0), 0) Qˆ ((0, 0, 2), 0)
.049 .041 .052
.028 .019 .019
.061 .070 .088
.031 .036 .055
.092 .114 .101
.055 .066 .054
.043 .035 .042
.020 .016 .020
.059 .062 .077
.025 .030 .050
.090 .104 .093
.049 .061 .054
Qˆ 3
Qˆ ((1, 1, 0), 0) Qˆ ((0, 1, 1), 0) Qˆ ((1, 0, 1), 0)
.066 .073 .067
.040 .039 .035
.091 .085 .063
.050 .052 .042
.111 .108 .103
.077 .066 .062
.059 .074 .066
.032 .039 .031
.085 .073 .059
.042 .048 .039
.108 .107 .088
.071 .063 .059
Notes: (i) DGP0 is an uncorrelated Gaussian diffusion process, given in Eq. (6.1); (ii) Qˆ (0, 0) is the omnibus test; Qˆ 1 , Qˆ 2 and Qˆ 3 are conditional mean tests, conditional variance tests and conditional correlation tests respectively; (iii) The p values are based on the results of 1000 iterations.
Table 2 Powers of specification tests under DGP1. Lag order
10
T
250
20
α Qˆ (0, 0)
.10 .120
.05 .078
.10 .304
.05 .214
.10 .746
.05 .674
.10 .122
.05 .082
.10 .282
.05 .212
.10 .722
.05 .650
500
1000
250
500
1000
Qˆ 1
Qˆ ((1, 0, 0), 0) Qˆ ((0, 1, 0), 0) Qˆ ((0, 0, 1), 0)
.096 .226 .616
.064 .158 .498
.060 .432 .914
.040 .352 .886
.078 .834 1.00
.052 .776 .998
.086 .198 .570
.046 .122 .468
.070 .422 .906
.044 .320 .850
.068 .830 1.00
.052 .770 .998
Qˆ 2
Qˆ ((2, 0, 0), 0) Qˆ ((0, 2, 0), 0) Qˆ ((0, 0, 2), 0)
.072 .066 .072
.042 .036 .050
.098 .096 .098
.058 .052 .060
.120 .126 .124
.082 .096 .088
.066 .058 .064
.028 .026 .038
.086 .082 .086
.042 .052 .048
.114 .130 .112
.074 .090 .080
Qˆ 3
Qˆ ((1, 1, 0), 0) Qˆ ((0, 1, 1), 0) Qˆ ((1, 0, 1), 0)
.078 .082 .092
.060 .040 .050
.098 .088 .108
.068 .056 .072
.102 .138 .112
.070 .086 .068
.066 .078 .084
.046 .042 .046
.106 .080 .100
.068 .060 .062
.106 .142 .106
.068 .100 .078
Lag order
30
Qˆ (0, 0)
.112
.072
.272
.200
.690
.618
.114
.064
.260
.200
.656
.568
.078 .170 .530
.044 .096 .430
.078 .388 .876
.058 .284 .832
.064 .796 1.00
.052 .736 .996
.078 .156 .500
.050 .084 .384
.088 .366 .856
.060 .264 .814
.068 .772 .998
.040 .688 .992
.060 .054 .062
.024 .026 .030
.072 .084 .072
.036 .042 .046
.106 .126 .104
.066 .084 .070
.054 .050 .052
.024 .022 .034
.056 .080 .070
.030 .040 .038
.096 .112 .100
.052 .074 .070
.066 .072 .074
.030 .036 .048
.098 .082 .088
.070 .058 .058
.104 .142 .102
.064 .098 .074
.050 .068 .078
.022 .034 .048
.102 .076 .090
.062 .056 .050
.106 .136 .108
.066 .096 .066
Qˆ 1
Qˆ 2
Qˆ 3
Qˆ ((1, 0, 0), 0) Qˆ ((0, 1, 0), 0) Qˆ ((0, 0, 1), 0) Qˆ ((2, 0, 0), 0) Qˆ ((0, 2, 0), 0) Qˆ ((0, 0, 2), 0) Qˆ ((1, 1, 0), 0) Qˆ ((0, 1, 1), 0) Qˆ ((1, 0, 1), 0)
40
Notes: (i) DGP1 is a correlated Gaussian diffusion process with correlation in drift, given in Eq. (6.4); (ii) Qˆ (0, 0) is the omnibus test; Qˆ 1 , Qˆ 2 and Qˆ 3 are conditional mean tests, conditional variance tests and conditional correlation tests respectively; (iii) The p values are based on the results of 500 iterations.
280
B. Chen, Y. Hong / Journal of Econometrics 164 (2011) 268–293
Table 3 Powers of specification tests under DGP2. Lag order
10
T
250
20
α Qˆ (0, 0)
.10 .204
.05 .150
.10 .412
.05 .360
.10 .832
.05 .770
.10 .186
.05 .116
.10 .390
.05 .320
.10 .802
.05 .748
500
1000
250
500
1000
Qˆ 1
Qˆ ((1, 0, 0), 0) Qˆ ((0, 1, 0), 0) Qˆ ((0, 1, 0), 0)
.050 .058 .070
.024 .034 .046
.056 .066 .066
.024 .040 .044
.044 .048 .058
.032 .034 .038
.046 .048 .074
.024 .028 .054
.048 .054 .066
.020 .026 .036
.060 .048 .068
.038 .030 .042
Qˆ 2
Qˆ ((2, 0, 0), 0) Qˆ ((0, 2, 0), 0) Qˆ ((0, 0, 2), 0)
.952 .958 .764
.922 .924 .678
1.00 1.00 .980
.996 1.00 .970
1.00 1.00 1.00
1.00 1.00 1.00
.932 .922 .680
.900 .882 .576
1.00 1.00 .968
.996 1.00 .942
1.00 1.00 1.00
1.00 1.00 1.00
Qˆ 3
Qˆ ((1, 1, 0), 0) Qˆ ((0, 1, 1), 0) Qˆ ((1, 0, 1), 0)
.080 .080 .072
.044 .050 .038
.082 .082 .094
.050 .062 .064
.080 .094 .106
.056 .066 .075
.062 .066 .062
.038 .040 .034
.062 .080 .078
.038 .048 .042
.086 .090 .102
.060 .058 .070
Lag order
30
Qˆ (0, 0)
.174
.110
.360
.296
.768
.694
40 .156
.088
.340
.278
.724
.660
Qˆ 1
Qˆ ((1, 0, 0), 0) Qˆ ((0, 1, 0), 0) Qˆ ((0, 0, 1), 0)
.038 .048 .078
.024 .024 .050
.042 .054 .052
.024 .032 .036
.058 .048 .068
.026 .024 .044
.042 .044 .070
.018 .028 .046
.044 .056 .050
.020 .030 .028
.054 .046 .072
.030 .026 .042
Qˆ 2
Qˆ ((2, 0, 0), 0) Qˆ ((0, 2, 0), 0) Qˆ ((0, 0, 2), 0)
.916 .910 .600
.872 .850 .512
.998 1.00 .942
.992 .996 .886
1.00 1.00 1.00
1.00 1.00 1.00
.896 .882 .560
.858 .806 .444
.998 .996 .892
.992 .996 .832
1.00 1.00 1.00
1.00 .998 .998
Qˆ 3
Qˆ ((1, 1, 0), 0) Qˆ ((0, 1, 1), 0) Qˆ ((1, 0, 1), 0)
.060 .060 .048
.038 .036 .034
.066 .076 .068
.038 .046 .052
.086 .080 .094
.062 .054 .070
.058 .058 .044
.028 .034 .028
.060 .076 .068
.030 .040 .050
.088 .072 .088
.058 .054 .066
Notes: (i) DGP2 is an uncorrelated CIR diffusion process, given in Eq. (6.5); (ii) Qˆ (0, 0) is the omnibus test; Qˆ 1 , Qˆ 2 and Qˆ 3 are conditional mean tests, conditional variance tests and conditional correlation tests respectively; (iii) The p values are based on the results of 500 iterations.
Table 4 Powers of specification tests under DGP3. Lag order
10
T
250
20
α Qˆ (0, 0)
.10 .468
.05 .366
.10 .868
.05 .792
.10 1.00
.05 1.00
.10 .406
.05 .310
.10 .804
.05 .714
.10 .992
.05 .980
500
1000
250
500
1000
Qˆ 1
Qˆ ((1, 0, 0), 0) Qˆ ((0, 1, 0), 0) Qˆ ((0, 1, 0), 0)
.032 .082 .092
.014 .056 .054
.062 .080 .102
.030 .052 .064
.054 .072 .084
.030 .038 .056
.036 .090 .086
.024 .054 .052
.044 .076 .098
.022 .050 .068
.048 .072 .092
.032 .044 .066
Qˆ 2
Qˆ ((2, 0, 0), 0) Qˆ ((0, 2, 0), 0) Qˆ ((0, 0, 2), 0)
.074 .080 .094
.060 .052 .060
.076 .122 .096
.040 .088 .060
.128 .116 .084
.078 .072 .052
.066 .068 .084
.038 .040 .040
.076 .110 .072
.036 .080 .042
.122 .094 .074
.084 .056 .046
Qˆ 3
Qˆ ((1, 1, 0), 0) Qˆ ((0, 1, 1), 0) Qˆ ((1, 0, 1), 0)
.080 .082 .094
.044 .044 .052
.124 .094 .118
.078 .068 .082
.094 .112 .088
.056 .074 .058
.056 .054 .078
.042 .032 .042
.126 .102 .100
.072 .062 .064
.086 .102 .092
.050 .070 .064
Lag order
30
Qˆ (0, 0)
.368
.248
.748
.656
.974
.962
40 .318
.226
.706
.606
.958
.938
Qˆ 1
Qˆ ((1, 0, 0), 0) Qˆ ((0, 1, 0), 0) Qˆ ((0, 1, 0), 0)
.032 .080 .080
.020 .036 .044
.050 .072 .098
.022 .040 .068
.048 .084 .092
.034 .046 .060
.034 .078 .066
.022 .040 .040
.054 .070 .094
.022 .040 .070
.050 .084 .084
.032 .052 .054
Qˆ 2
Qˆ ((2, 0, 0), 0) Qˆ ((0, 2, 0), 0) Qˆ ((0, 0, 2), 0)
.056 .062 .058
.028 .028 .024
.068 .104 .070
.028 .066 .040
.108 .082 .074
.070 .050 .042
.044 .054 .044
.022 .022 .012
.062 .090 .070
.026 .060 .032
.102 .078 .066
.072 .046 .044
Qˆ 3
Qˆ ((1, 1, 0), 0) Qˆ ((0, 1, 1), 0) Qˆ ((1, 0, 1), 0)
.052 .050 .064
.030 .026 .032
.114 .096 .098
.072 .048 .054
.072 .084 .082
.038 .060 .052
.052 .044 .060
.024 .026 .028
.102 .076 .092
.062 .042 .052
.070 .086 .082
.038 .046 .050
Notes: (i) DGP3 is a correlated Gaussian diffusion process with correlation in diffusion, given in Eq. (6.6); (ii) Qˆ (0, 0) is the omnibus test; Qˆ 1 , Qˆ 2 and Qˆ 3 are conditional mean tests, conditional variance tests and conditional correlation tests respectively; (iii) The p values are based on the results of 500 iterations.
B. Chen, Y. Hong / Journal of Econometrics 164 (2011) 268–293
281
Table 5 Powers of specification tests under DGP4. Lag order
10
T
250
20
α Qˆ (0, 0)
.10 .946
.05 .924
.10 .994
.05 .990
.10 1.00
.05 1.00
.10 .914
.05 .886
.10 .988
.05 .988
.10 1.00
.05 1.00
500
1000
250
500
1000
Qˆ 1
Qˆ ((1, 0, 0), 0) Qˆ ((0, 1, 0), 0) Qˆ ((0, 1, 0), 0)
.036 .068 .092
.026 .052 .058
.052 .074 .094
.034 .044 .070
.114 .090 .078
.070 .064 .054
.036 .078 .090
.022 .044 .048
.060 .056 .076
.026 .040 .056
.090 .088 .086
.062 .058 .044
Qˆ 2
Qˆ ((2, 0, 0), 0) Qˆ ((0, 2, 0), 0) Qˆ ((0, 0, 2), 0)
.982 .958 .906
.970 .930 .858
1.00 1.00 1.00
1.00 .996 1.00
1.00 1.00 1.00
1.00 1.00 1.00
.972 .944 .846
.942 .896 .780
1.00 .998 .998
1.00 .996 .990
1.00 1.00 1.00
1.00 1.00 1.00
Qˆ 3
Qˆ ((1, 1, 0), 0) Qˆ ((0, 1, 1), 0) Qˆ ((1, 0, 1), 0)
.834 .832 .800
.768 .768 .722
.992 .986 .978
.980 .974 .956
1.00 1.00 1.00
1.00 1.00 1.00
.800 .786 .762
.712 .716 .682
.982 .978 .970
.972 .968 .952
1.00 1.00 1.00
1.00 1.00 1.00
Lag order
30
Qˆ (0, 0)
.892
.844
.988
.982
1.00
1.00
.868
40 .820
.988
.974
1.00
1.00
Qˆ 1
Qˆ ((1, 0, 0), 0) Qˆ ((0, 1, 0), 0) Qˆ ((0, 1, 0), 0)
.034 .066 .080
.018 .028 .044
.050 .056 .072
.022 .038 .042
.078 .088 .086
.056 .054 .038
.030 .048 .064
.014 .024 .034
.046 .056 .070
.014 .038 .040
.074 .084 .080
.044 .050 .038
Qˆ 2
Qˆ ((2, 0, 0), 0) Qˆ ((0, 2, 0), 0) Qˆ ((0, 0, 2), 0)
.954 .916 .802
.926 .854 .720
1.00 .996 .990
.996 .996 .976
1.00 1.00 1.00
1.00 1.00 1.00
.940 .888 .762
.912 .798 .668
.998 .996 .986
.996 .990 .960
1.00 1.00 1.00
1.00 1.00 1.00
Qˆ 3
Qˆ ((1, 1, 0), 0) Qˆ ((0, 1, 1), 0) Qˆ ((1, 0, 1), 0)
.766 .742 .728
.672 .646 .640
.974 .974 .966
.954 .952 .942
1.00 1.00 1.00
1.00 1.00 1.00
.730 .698 .696
.630 .622 .590
.966 .964 .956
.942 .940 .930
1.00 1.00 1.00
1.00 1.00 1.00
Notes: (i) DGP4 is a correlated CIR diffusion process, given in Eq. (6.7); (ii) Qˆ (0, 0) is the omnibus test; Qˆ 1 , Qˆ 2 and Qˆ 3 are conditional mean tests, conditional variance tests and conditional correlation tests respectively; (iii) The p values are based on the results of 500 iterations.
Table 6 Powers of specification tests under DGP5. Lag order
10
T
250
20
α Qˆ (0, 0)
.10 .438
.05 .366
.10 .726
.05 .676
.10 .970
.05 .960
.10 .458
.05 .366
.10 .728
.05 .670
.10 .970
.05 .958
500
1000
250
500
1000
Qˆ 1
Qˆ ((1, 0, 0), 0) Qˆ ((0, 1, 0), 0) Qˆ ((0, 1, 0), 0)
.070 .062 .078
.044 .042 .042
.072 .054 .064
.046 .026 .038
.098 .058 .056
.072 .038 .036
.068 .068 .066
.032 .036 .042
.064 .052 .070
.044 .030 .046
.092 .058 .072
.060 .038 .038
Qˆ 2
Qˆ ((2, 0, 0), 0) Qˆ ((0, 2, 0), 0) Qˆ ((0, 0, 2), 0)
.938 .974 .076
.902 .946 .048
1.00 1.00 .106
.998 1.00 .062
1.00 1.00 .116
1.00 1.00 .078
.918 .946 .066
.888 .910 .038
1.00 1.00 .096
1.00 1.00 .054
1.00 1.00 .116
1.00 1.00 .080
Qˆ 3
Qˆ ((1, 1, 0), 0) Qˆ ((0, 1, 1), 0) Qˆ ((1, 0, 1), 0)
.750 .094 .072
.648 .062 .044
.982 .096 .112
.964 .064 .076
1.00 .100 .106
1.00 .072 .070
.714 .072 .062
.610 .054 .034
.978 .088 .110
.952 .052 .068
1.00 .108 .110
1.00 .066 .072
Lag order
30
Qˆ (0, 0)
.444
.350
.718
.652
.964
.948
.430
40 .338
.700
.634
.968
.950
Qˆ 1
Qˆ ((1, 0, 0), 0) Qˆ ((0, 1, 0), 0) Qˆ ((0, 1, 0), 0)
.056 .066 .066
.030 .036 .030
.054 .050 .070
.036 .036 .034
.082 .054 .072
.054 .032 .038
.052 .058 .070
.028 .028 .030
.058 .048 .062
.030 .034 .032
.076 .050 .066
.050 .030 .036
Qˆ 2
Qˆ ((2, 0, 0), 0) Qˆ ((0, 2, 0), 0) Qˆ ((0, 0, 2), 0)
.904 .912 .060
.866 .858 .036
1.00 1.00 .096
.996 .996 .052
1.00 1.00 .108
1.00 1.00 .070
.890 .892 .052
.832 .814 .022
1.00 1.00 .086
.990 .996 .044
1.00 1.00 .104
1.00 1.00 .066
Qˆ 3
Qˆ ((1, 1, 0), 0) Qˆ ((0, 1, 1), 0) Qˆ ((1, 0, 1), 0)
.672 .068 .050
.572 .042 .028
.968 .082 .090
.936 .044 .062
1.00 .100 .096
1.00 .066 .066
.644 .058 .040
.522 .032 .020
.960 .066 .088
.912 .040 .052
1.00 .094 .088
1.00 .068 .064
Notes: (i) DGP5 is a mixture of Gaussian and CIR Processes, given in Eq. (6.8); (ii) Qˆ (0, 0) is the omnibus test; Qˆ 1 , Qˆ 2 and Qˆ 3 are conditional mean tests, conditional variance tests and conditional correlation tests respectively; (iii) The p values are based on the results of 500 iterations.
282
B. Chen, Y. Hong / Journal of Econometrics 164 (2011) 268–293
Qˆ ((0, 0, 1)′ , 0), will be powerful but the variance and covariance tests Qˆ (m, 0), where |m| = 2, will have no power. This is indeed confirmed in our simulation. Table 2 shows that the Qˆ ((0, 1, 0)′ , 0) and Qˆ ((0, 0, 1)′ , 0) tests are able to capture time-varying mean misspecifications in X2t and X3t , as their rejection rates are about 75% and 99% respectively at the 5% level when T = 1000.19 However, the rejection rates of all variance and covariance tests Qˆ (m, 0) for |m| = 2 are close to significance levels. Under DGP2, model (6.1) ignores the so-called ‘‘level effect’’ in diffusions. The omnibus test Qˆ (0, 0) has good power under DGP2, with a rejection rate of 75% at the 5% level when T = 1000. Note that the omnibus test Qˆ (0, 0) is less powerful than the conditional variance tests, because Qˆ (0, 0) has to check all possible directions whereas only the conditional variances are misspecified under DGP2. The variance tests Qˆ ((2, 0, 0)′ , 0), Qˆ ((0, 2, 0)′ , 0) and Qˆ ((0, 0, 2)′ , 0) have excellent power, which increases with T and approaches unity when T = 1000. Interestingly, the powers of the diagnostic tests for conditional means and conditional covariances are close to significance levels, indicating that these diagnostic tests do not overreject correctly specified conditional means and conditional covariances of state variables. Under DGP3, model (6.1) is correctly specified for both conditional means and conditional variances of state variables but is misspecified for conditional correlations between state variables, because it ignores the nonzero but time-invariant correlations in diffusions. As expected, the omnibus test Qˆ (0, 0) has excellent power when model (6.1) is used to fit the data generated from DGP3. The power of Qˆ (0, 0) increases significantly with the sample size T and approaches unity when T = 1000. However, our correlation diagnostic tests fail to capture the misspecifications in correlations, because they are time-invariant. Under DGP4, model (6.1) is correctly specified for conditional means of state variables but is misspecified for conditional variances and conditional correlations of state variables, because it ignores both the ‘‘level effect’’ and time-varying conditional correlations in diffusions. The omnibus test Qˆ (0, 0) has excellent power when (6.1) is used to fit data generated from DGP4. This is consistent with the fact that DGP4 deviates most from model (6.1). The Qˆ (m, 0) tests with |m| = 2 have good power in detecting misspecifications in conditional variances and correlations of state variables. The mean tests Qˆ (m, 0) with |m| = 1 have no power, because the conditional means of state variables are correctly specified. Under DGP5, model (6.1) is correctly specified for conditional means of state variables but is misspecified for conditional variances of X1t and X2t and the conditional correlation between X1t and X2t . Again, the omnibus test Qˆ (0, 0) has good power under DGP5, with a rejection rate of 95% at the 5% level when T = 1000. The variance tests Qˆ ((2, 0, 0)′ , 0), Qˆ ((0, 2, 0)′ , 0) and the covariance test Qˆ ((1, 1, 0)′ , 0) have excellent power, which increases with T and attains unity when T = 1000. The powers of other diagnostic tests are close to significance levels, because those conditional moments are correctly specified. To sum up, we observe:
• The Qˆ (m, 0) tests have reasonable sizes in finite samples. Although the omnibus test Qˆ (0, 0) tends to underreject when T = 250, it improves dramatically as the sample size T increases. The sizes of all tests Qˆ (m, 0) are robust to the choice 19 The power of the mean test Qˆ ((1, 0, 0) , 0) for the first state variable X is close 1t to the significance level. This is expected because the drift function of X1t is correctly specified.
of a preliminary lag order used to estimate the generalized cross-spectrum.
• The omnibus test Qˆ (0, 0) has good omnibus power in detecting various model misspecifications. It has reasonable power even when the sample size T is as small as 250. This demonstrates the nice feature of the proposed cross-spectral approach, which can capture various model misspecifications.
• The diagnostic tests Qˆ (m, 0) can check various specific aspects of model misspecifications. Generally speaking, the mean tests, Qˆ (m, 0), with |m| = 1, can detect misspecification in drifts; the variance and covariance tests Qˆ (m, 0), with |m| = 2, can check misspecifications in variances and covariances respectively. However, the correlation tests fail to detect neglected nonzero but time-invariant conditional correlations in diffusions. 7. Conclusion The CCF-based estimation of multivariate continuous-time models has attracted increasing attention in econometrics. We have complemented this literature by proposing a CCF-based omnibus specification test for the adequacy of a multivariate continuous-time model, which has not been attempted in the previous literature. The proposed test can be applied to a variety of univariate and multivariate continuous-time models, including those with jumps. It is also applicable to testing the adequacy of discrete-time dynamic multivariate distribution models. The most appealing feature of our omnibus test is that it fully exploits the information in the joint dynamics of state variables and thus can capture misspecification in modeling the joint dynamics, which may be easily missed by existing procedures. Indeed, when the underlying economic process is Markov, our omnibus test is consistent against any type of model misspecification. We assume that the DGP of state variables may not be Markov. This not only ensures the power of the proposed tests against a wider range of misspecification but also makes our approach applicable to testing models with latent variables, such as SV models. Our omnibus test is supplemented by a class of diagnostic procedures, which is obtained by differentiating the CCF and focuses on various specific aspects of the joint dynamics such as whether there exists neglected dynamics in conditional means, conditional variances, and conditional correlations of state variables respectively. Such information is useful for practitioners in reconstructing a misspecified model. Our procedures are most useful when the CCF of a multivariate time series model has a closed-form, as are the class of AJD models and the class of time-changed Lévy processes that have been commonly used in the literature. All test statistics follow a convenient asymptotic N (0, 1) distribution, and they are applicable to various estimation methods, including suboptimal consistent estimators. Moreover, parameter estimation uncertainty has no impact on the asymptotic distribution of the test statistics. Simulation studies show that the proposed tests perform reasonably well in finite samples. Acknowledgments We thank the co-editor, A. Ronald Gallant, and two referees for careful and constructive comments. We also thank Yacine AitSahalia, Zongwu Cai, Jianqing Fan, Jerry Hausman, Roger Klein, Chung-ming Kuan, Arthur Lewbel, Norm Swanson, Yiu Kuen Tse, Chunchi Wu and Zhijie Xiao, and seminar participants at Boston College, Rutgers University and Singapore Management University for their comments and discussions. Any remaining errors are solely ours.
B. Chen, Y. Hong / Journal of Econometrics 164 (2011) 268–293
Appendix. Mathematical appendix
ˆ T . Also, C ∈ (1, ∞) denotes a generic bounded sample {Zˆt (u, θ)} t =1 constant. Proof of Theorem 1. The proof of Theorem 1 consists of the proofs of Theorems A.1 and A.2 below. Theorem A.1. Under the conditions of Theorem 1, Qˆ (0, 0) − p
Q˜ (0, 0) → 0. Theorem A.2. Under the conditions of Theorem 1 and q = p
(ln2 T )
1+ 4b1−2
d
, Q˜ (0, 0) → N (0, 1).
Proof of Theorem A.1. Put Tj ≡ T − |j|, and let Γ˜ j (u, v) be defined
ˆ replaced by in the same way as Γˆ j (u, v) in (3.4), with Zˆt (u, θ) p
Zt (u, θ0 ). To show Qˆ (0, 0) − Q˜ (0, 0) → 0, it suffices to show 1
ˆ − 2 (0, 0) D
∫∫ − T −1
k2 (j/p)Tj |Γˆ j (u, v)|2 − |Γ˜ j (u, v)|2
T − ˆ − Zt (u, θ0 ) + ϕ (v) − ϕˆ j (v) Tj−1 Zˆt (u, θ) t =j +1
= Bˆ 1j (u, v) + Bˆ 2j (u, v), say. (A.3) ∑2 ∑T −1 2 2 |Bˆ aj (u, v)| dW (u) It follows that Aˆ 1 ≤ 2 a=1 j=1 k (j/p)Tj dW (v). Proposition A.1 follows from Lemmas A.1 and A.2 below, and p/T → 0. ∑T −1 2 |Bˆ 1j (u, v)|2 dW (u)dW (v) = OP (1). Lemma A.1. j=1 k (j/p)Tj ∑T −1 2 |Bˆ 2j (u, v)|2 dW (u)dW (v) = OP Lemma A.2. j=1 k (j/p)Tj (p/T ). We now show these lemmas. Throughout, we put aT (j) ≡ k2 (j/p)Tj−1 . Proof of Lemma A.1. A second order Taylor series expansion yields Bˆ 1j (u, v) = −(θˆ − θ0 )′ Tj−1
1
− (θˆ − θ0 )′ Tj−1 2
p
× dW (u)dW (v) → 0,
(A.1)
1
ˆ (0, 0) − D˜ (0, 0)] = oP (1) C˜ (0, 0)] = OP (T − 2 ) and p−1 [D are straightforward. We note that it is necessary to obtain the 1
convergence rate OP (pT − 2 ) for Cˆ (0, 0) − C˜ (0, 0) to ensure that replacing Cˆ (0, 0) with C˜ (0, 0) has asymptotically negligible impact given p/T → 0. To show (A.1), we first decompose
∂2 ¯ ′ ϕ u, t |It −1 , θ ∂θ∂θ t =j +1 T −
−ϕ
Ď u, t |It −1 , θˆ
= −Bˆ 11j (u, v) − Bˆ 12j (u, v) + Bˆ 13j (u, v) ,
∫∫ − T −1 j =1
T −1 −
k2 (j/p) Tj
j =1
(A.4)
∫∫ 2 ˆ B12j (u, v) dW (u) dW (v) T
−1
T −
2 ∂ sup sup ∂θ∂θ′ N θ∈2
t =1 u∈R
2 T −1 − × ϕ (u, t |It −1 , θ) αT (j) j =1
(A.2)
×dW (u) dW (v) = OP (p/T ) ,
2
k2 (j/p)Tj Γˆ j (u, v) − Γ˜ j (u, v)
where we made use of the fact that T −1 −
× dW (u)dW (v) , ∫∫ − T −1 Aˆ 2 = k2 (j/p)Tj Γˆ j (u, v) − Γ˜ j (u, v)
aT (j) =
j =1
T −1 −
k2 (j/p)Tj−1 = O(p/T ),
(A.5)
j=1
given p = cT λ for λ ∈ (0, 1), as shown in Hong (1999, (A.15), p. 1213). For the third term in (A.4), we have
j =1
× Γ˜ j (u, v)∗ dW (u) dW (v), where Re (Aˆ 2 ) is the real part of Aˆ 2 and Γ˜ j (u, v)∗ is the complex conjugate of Γ˜ j (u, v). Then, (A.1) follows from Propositions A.1 and A.2 below, and p → ∞ as T → ∞. Proposition A.1. Under the conditions of Theorem 1, Aˆ 1 = OP (1). − 12
Proposition A.2. Under the conditions of Theorem 1, p ′
say,
for some θ¯ between θˆ and θ0 . For the second term in (A.4), we have
where Aˆ 1 =
[ϕ u, t |It −1 , θˆ
]ψt −j (v)
j=1
× dW (u)dW (v) = Aˆ 1 + 2 Re(Aˆ 2 ),
T −
t =j +1
√ 4 ∫ ∫ ˆ ≤ C T (θ − θ0 )
k2 (j/p)Tj |Γˆ j (u, v)|2 − |Γ˜ j (u, v)|2
T − ∂ ϕ (u, t |It −1 , θ0 ) ψt −j (v) ∂θ t =j +1
× ψt −j (v) (θˆ − θ0 ) + Tj−1
1
ˆ (0, 0) − D˜ (0, 0)] = p−1 [Cˆ (0, 0) − C˜ (0, 0)] = OP (T − 2 ), and p−1 [D ˜ (0, 0) are defined in the same way as oP (1), where C˜ (0, 0) and D ˆ replaced by Zt (u, θ0 ). For ˆ (0, 0) in (3.8), with Zˆt (u, θ) Cˆ (0, 0) and D space, we focus on the proof of (A.1); the proofs for p−1 [Cˆ (0, 0) −
ˆ − Zt (u, θ0 ) ψt −j (v) Zˆt (u, θ)
t =j +1
j =1
∫∫ − T −1
T −
Γˆ j (u, v) − Γ˜ j (u, v) = Tj−1
Throughout the appendix, we let Q˜ (0, 0) be defined in the same way as Qˆ (0, 0) in (3.8) with the unobservable generalized residual sample {Zt (u, θ0 )}Tt=1 replacing the estimated generalized residual
1 2b−1
283
T −1 − j =1
∫∫ 2 ˆ k (j/p) Tj B13j (u, v) dW (u) dW (v) 2
≤4
T −1 − j=1
∫∫ − T aT (j) ϕ u, t |It −1 , θˆ t =j+1
p
Aˆ 2 → 0.
Proof of Proposition A.1. Put ψt (v) ≡ eiv Xt − ϕ (v) and ϕ(v) ≡ ′ E (eiv Xt ). Then straightforward algebra yields that for j > 0,
−ϕ
Ď u, t |It −1 , θˆ
2 dW (u) dW (v) = OP (p/T ) ,
where we have used Assumptions A.5–A.7.
(A.6)
284
B. Chen, Y. Hong / Journal of Econometrics 164 (2011) 268–293
= Bˆ 21j (u, v) + Bˆ 22j (u, v) ,
For the first term in Eq. (A.4), we have
∫∫ 2 − ˆ k2 (j/p) Tj B11j (u, v) dW (u) dW (v)
T −1 −
j=1
T −1 √ 2 − ≤ T (θˆ − θ0 ) k2 (j/p)
∫∫ 2 ˆ B21j (u, v) dW (u) dW (v)
k2 (j/p) Tj
j =1
j =1
T −1 −
≤
2 ∫∫ T −1 − ∂ ϕ (u, t |It −1 , θ0 ) ψt −j (v) × Tj ∂θ t =j+1 × dW (u) dW (v) = OP (1) ,
say.
For the first term, we have
T −1
k (j/p) Tj 2
∫∫
ϕ (v) − ϕˆ j (v)2 T −1 j
j =1 T 2 − ϕ (u, t |It −1 , θ0 ) − ϕ u, t |It −1 , θˆ
×
t =j +1
(A.7)
∂ as is shown below: Put ηj (u, v) ≡ E [ ∂θ ϕ (u, t |It −1 , θ0 ) ψt −j (v)] ∂ = cov[ ∂θ ϕ (u, t |It −1 , θ0 ) , ψt −j (v)]. Then we have supu,v∈R2N Σj∞ =1
ηj (u, v) ≤ C by Assumption A.4. Next, expressing the moments
by cumulants via well-known formulas (e.g., Hannan (1970, (5.1), p. 23), for real-valued processes), we can obtain
× dW (u) dW (v) = OP (p/T ) , given Assumption A.4 and
T 2 2 − ϕ (u, t |It −1 , θ0 ) − ϕ u, t |It −1 , θˆ ≤ Tj θˆ − θ0 Tj−1
t =j +1
2 T −1 − ∂ Tj E Tj ϕ (u, t |It −1 , θ0 ) ψt −j (v) − ηj (u, v) ∂θ t =j+1
T −
×
2 ∂ sup sup ϕ (u, t |It −1 , θ) = OP (1). ∂θ N θ∈2
t =1 u∈R
Tj − ∂ ⩽ cov ∂θ ϕ u, max(0, τ ) + 2|Imax(0,τ )+1 , θ0 ,
For the second term, we have T −1 −
τ =−Tj
′ ∂ ϕ u, max(0, τ ) + 2 − τ |Imax(0,τ )+1−τ , θ0 ∂θ
k2 (j/p) Tj
j =1
Tj − ηj+|τ | (u, −v) ηj−|τ | (u, v) × |Ωτ (v, −v)| +
≤
T −1 −
∫∫ 2 ˆ B22j (u, v) dW (u) dW (v)
k2 (j/p) Tj
∫∫
ϕ (v) − ϕˆ j (v)2 T −1 j
j =1
τ =−Tj
T −
×
Tj
− κj,|τ |,j+|τ | (u, v) + τ =−Tj
(A.8)
2
Ď
sup sup ϕ (u, t |It −1 , θ) − ϕ u, t |It −1 , θ
t =j+1 u∈R
⩽ C,
(A.9)
2 where we made use of the fact that E ϕ (v) − ϕˆ j (v) ≤ CTj−1
N
θ∈2
× dW (u) dW (v) = OP (p/T ) ,
(A.10)
given Assumption A.4, where κj,l,τ (v) is the fourth order cumulant ∂ of the joint distribution of the process { ∂θ ϕ (u, t |It −1 , θ0 ) , ψt −j (v),
2 where we made use of the fact that E ϕ (v) − ϕˆ j (v) ≤ CTj−1
Consequently, from (A.5), (A.8), |k(·)| ≤ 1, and p/T → 0, we have
Proof of Proposition A.2. Given the decomposition in (A.3), we have
∂ ϕ ∂θ
T −1 − j=1
(u, t |It −1 , θ0 ) , ψt∗−j (v)}. See also (A.7) of Hong (1999, p. 1212).
2 ∫∫ T −1 − ∂ k (j/p)E ϕ (u, t |It −1 , θ0 ) ψt −j (v) Tj ∂θ t =j +1 2
× dW (u)dW (v) T −1 ∫ ∫ T −1 − − ηj (u, v)2 dW (u)dW (v) + C ≤C aT (j) j =1
given Assumptions A.4–A.7. The desired result of Lemma A.2 follows from (A.9) and (A.10).
2 − ˆ |Bˆ aj (u, v)||Γ˜ j (u, v)|, [Γj (u, v) − Γ˜ j (u, v)]Γ˜ j (u, v)∗ ≤ a=1
where the Bˆ aj (u, v) are defined in (A.3), a = 1, 2. We first consider the term with a = 2. By the Cauchy–Schwarz inequality, we have
j=1
= OP (1) + OP (p/T ) = OP (1). Hence (A.7) is OP (1). The desired result of Lemma A.1 follows from (A.5)–(A.7). Proof of Lemma A.2. We first decompose Bˆ 2j (u, v) to
T −1 −
k (j/p)Tj 2
∫∫
|Bˆ 2j (u, v)||Γ˜ j (u, v)|dW (u)dW (v)
j =1
≤
T −1 −
k (j/p)Tj 2
∫∫
12 2 ˆ |B2j (u, v)| dW (u)dW (v)
j =1
Bˆ 2j (u, v) = ϕ (v) − ϕˆ j (v) Tj−1 T − ϕ (u, t |It −1 , θ0 ) − ϕ u, t |It −1 , θˆ ×
t =j +1
t =j +1
− ϕ
Ď u, t |It −1 , θˆ
×
T −1 −
k (j/p)Tj 2
∫∫
21 |Γ˜ j (u, v)| dW (u)dW (v) 2
j =1 1
T − + ϕ (v) − ϕˆ j (v) Tj−1 ϕ u, t |It −1 , θˆ
1
1
1
= OP (p 2 /T 2 )OP (p 2 ) = oP (p 2 ),
(A.11)
given Lemma A.2, and p/T → 0, where p
−1
∑T −1 j =1
k (j/p)Tj 2
|Γ˜ j (u, v)|2 dW (u)dW (v) = OP (1) by Markov’s inequality, the m.d.s. hypothesis of Zt (u, θ0 ), and (A.5).
B. Chen, Y. Hong / Journal of Econometrics 164 (2011) 268–293
For a = 1, by (A.4) and the triangular inequality, we have T −1 −
∫∫
k2 (j/p)Tj
2 ≤ CT θˆ − θ0 T −1
|Bˆ 1j (u, v)‖Γ˜ j (u, v)|dW (u)dW (v)
j =1
×
≤
T −1 −
k2 (j/p)Tj
∫∫
+
k (j/p)Tj 2
∫∫
k2 (j/p)Tj
T −1 −
j=1
∫∫ ×
k2 (j/p)Tj
j =1
|Bˆ 13j (u, v)‖Γ˜ j (u, v)|dW (u)dW (v) .
(A.12)
≤C
For the first term in (A.12), we have T −1 − j =1
t =j+1 u∈R
Γ˜ j (u, v) dW (u)dW (v)
by the Cauchy–Schwarz inequality, Markov’s inequality, Assumptions A.2, A.3, A.6 and A.7, and E |Γ˜ j (u, v) |2 ≤ CTj−1 . For the third term in (A.12), we have
|Bˆ 12j (u, v)‖Γ˜ j (u, v)| T −1 −
k2 (j/p)
∫∫
1 = OP p/T 2 ,
j =1
× dW (u)dW (v) +
T −1 − j =1
|Bˆ 11j (u, v)‖Γ˜ j (u, v)|dW (u)dW (v)
j =1 T −1 −
285
2 T − ∂ sup sup ′ ϕ (u, t |It −1 , θ) ∂θ∂θ N θ∈2
∫∫ ˆ k (j/p)Tj B11j (u, v) Γ˜ j (u, v) dW (u)dW (v)
×
2
T −
∫∫ ˆ B13j (u, v) Γ˜ j (u, v) dW (u)dW (v)
Ď
sup sup ϕ (u, t |It −1 , θ) − ϕ u, t |It −1 , θ
N t =1 u∈R θ∈2
T −1 −
k2 (j/p)
∫∫
Γ˜ j (u, v) dW (u)dW (v)
j =1 1
= OP (p/T 2 ),
T −1 − k2 (j/p)Tj ≤ θˆ − θ0 j =1
∫∫ T −1 − ∂ × ϕ (u, t |It −1 , θ0 ) ψt −j (v) Tj ∂θ t =j +1 × Γ˜ j (u, v) dW (u)dW (v) 1
1
= OP (1 + p/T 2 ) = oP (p 2 ), given p → ∞, p/T → 0, Assumptions A.2, A.3, A.6 and A.7, and Tj E |Γ˜ j (u, v)|2 ≤ C . Note that we have made use of the fact that
T −1 − ∂ ˜ E Tj ϕ (u, t |It −1 , θ0 ) ψt −j (v) Γj (u, v) ∂θ t =j +1 2 21 T − ∂ ≤ E Tj−1 ϕ (u, t |It −1 , θ0 ) ψt −j (v) ∂θ t =j+1 2 12 × E Γ˜ j (u, v) [ ] 1 − − 12 Tj 2 , ≤ C ηj (u, v) + CTj
by the Cauchy–Schwarz inequality, Markov’s inequality, Assumptions A.2, A.3 and A.5–A.7, and E |Γ˜ j (u, v) |2 ≤ CTj−1 . Hence, we have T −1 −
k2 (j/p)Tj
∫∫
|Bˆ 1j (u, v)||Γ˜ j (u, v)|dW (u)dW (v)
j =1
1 1 1 = OP 1 + p/T 2 + OP p/T 2 + OP p/T 2 1 = oP p 2 .
(A.13)
Combining (A.11) and (A.13) then yields the result of this proposition. Proof of Theorem A.2. We first state a lemma, which will be used in the proof. n Lemma A.3. Suppose that Im are the σ -fields generated by a stationary β -mixing process Xi with mixing coefficient β (i). For some t positive integers m let yi ∈ Isii where s1 < t1 < s2 < t2 < · · · < tm and suppose ti − si > τ for all i. Assume that
‖yi ‖ppii = E |yi |pi < ∞, for some pi > 1 for which Q =
by (A.7), and consequently,
∫∫ T −1 T − ˆ −1 − ∂ k2 (j/p)Tj E ϕ θ − θ0 Tj ∂θ j=1 t =j +1 × (u, t |It −1 , θ0 ) ψt −j (v) Γ˜ j (u, v) dW (u)dW (v) ∫ ∫ T −1 − ηj (u, v) dW (u)dW (v) ≤C
< 1. Then l l l ∏ ∏ ∏ ‖yi ‖pi . yi − E (yi ) ≤ 10 (l − 1) β (1−Q ) (τ ) E i=1 i =1 i=1 ∑l
1 i=1 pi
Proof of Lemma A.3. See Theorem 5.4 of Roussas and Ioannides (1987). Let q = p A.4 below.
1+ 4b1−2
1
(ln2 T ) 2b−1 . We shall show Propositions A.3 and
j =1 1
+ CT − 2
T −1 −
Proposition A.3. Under the conditions of Theorem 1, 1
k2 (j/p) = O(1 + p/T 2 ),
j =1
given |k(·)| ≤ 1 and Assumption A.7. For the second term in (A.12), we have T −1 − j =1
k2 (j/p)Tj
∫∫ ˆ B12j (u, v) Γ˜ j (u, v) dW (u)dW (v)
1
p− 2
T −1 −
k2 (j/p)Tj
∫∫
Γ˜ j (u, v)2 dW (u) dW (v)
j =1
= p−1/2 C˜ (0, 0) + p−1/2 V˜ q + oP (1), ∑T ∑q ∑t −2q−1 ∗ where V˜ q = Zt (u, θ0 ) j=1 aT (j)ψt −j (v)[ s=1 Zs t =2q+2 (u, θ0 ) ψs∗−j (v)]dW (u) dW (v).
286
B. Chen, Y. Hong / Journal of Econometrics 164 (2011) 268–293 t −2q−1
˜ −1/2 (0, 0) Proposition A.4. Under the conditions of Theorem 1, D V˜ q → N (0, 1).
−
k2 (j/p)Tj
∫∫
≡ V˜ q + R˜ 4 ,
q; and the second term R˜ 4 is contributed by the lag orders j > q. It follows from (A.14)–(A.17) that
2 ∫ ∫ − T −1 T − = aT (j) Zt (u, θ0 ) ψt −j (v) dW (u)dW (v) j =1 t =j +1 2 ∫∫ − T −1 T − 2 + aT (j) Zt (u, θ0 ) ϕ (v) − ϕˆ j (v) j =1 t =j +1
T −1 −
j =1
×
1 (0, 0) − C˜ (0, 0)] = oP (1) and p− 2 R˜ a = oP (1) given q = −1 1 1 1+ p 4b−2 (ln2 T ) 2b−1 and p = cT λ for 0 < λ < 3 + 4b1−2 .
Lemma A.4. Let C˜ 1 (0, 0) be defined as in (A.15). Then C˜ 1 (0, 0) − 1
C˜ (0, 0) = OP (p/T 2 ).
t =j +1
(A.14)
Next we write
−
aT (j)
∫∫ − T
Lemma A.6. Let R˜ 2 be defined as in (A.14). Then R˜ 2 = OP (p/T 2 ).
2 |Zt (u, θ0 )|2 ψt −j (v)
1
Lemma A.7. Let R˜ 3 be defined as in (A.16). Then R˜ 3 = OP (qp/T 2 ).
t =j+1
j =1
× dW (u)dW (v) + 2 Re
T −1 −
Lemma A.8. Let R˜ 4 be defined as in (A.17). Then R˜ 4 = OP (p2b ln T /q2b−1 ).
aT (j)
j=1
×
∫∫ − T t −1 −
Proof of Lemma A.4. By Markov’s inequality and E |C˜ 1 (0, 0) − ∑ T −1 C˜ (0, 0)| ≤ Cp/T 1/2 given j=1 (j/p)aT (j) = O(p/T ) and E |ϕ(v) −
Zt (u, θ0 ) Zs∗ (u, θ0 ) ψt −j (v)ψs∗−j (v)
t =j+2 s=j+1
× dW (u) dW (v) ≡ C˜ 1 (0, 0) + 2 Re(U˜ ),
(A.15)
where we further decompose U˜ =
Lemma A.5. Let R˜ 1 be defined as in (A.14). Then R˜ 1 = OP (p/T ). 1
T −1
˜q = M
∫∫ T −
Zt (u, θ0 )
t −1 −
t =2q+2
aT (j)ψt −j (v)
j =1
ϕˆ j (v)|4 = O(T −2 ), which are shown in Hong (1999) and below respectively. Let t1 , . . . , t4 be distinct integers and |j| + 1 ≤ ti ≤ T , let |j| + 1 ≤ r1 < · · · < r4 ≤ T be the permutation of t1 , . . . , t4 in ascending order and let dc be the cth largest difference among rl+1 − rl , l = 1, 2, 3. By Lemma 1 of Yoshihara (1976), we have
t −2q−1
−
×
Zs∗ (u, θ0 ) ψs∗−j (v)dW (u) dW (v)
− |j|+1≤r1 <···
s=j+1 T ∫∫ −
+
Zt (u, θ0 )
t −1 −
t =2
aT (j)ψt −j (v)
×
−
Zs (u, θ0 ) ψs∗−j (v)dW (u)dW ∗
≤
(v)
≡ U˜ 1 + R˜ 3 ,
(A.16)
∫∫ T −
Zt (u, θ0 )
t =2q+2
q −
−
T −3 −
T −2 −
T −1 −
T −
1
δ
4C 1+δ β 1+δ (r2 − r1 )
)
T −3 −
1
≤ 4C 1+δ
T −2 −
δ
(r2 − r1 )2 β 1+δ (r2 − r1 )
r1 =|j|+1 r2 =r1 +1 T −
1
≤ 4TC 1+δ
δ
r 2 β 1+δ (r ) = O (T ) .
Similarly,
j =1
− Zs∗ (u, θ0 ) ψs∗−j (v)dW (u) dW (v)
s=j+1
∫∫ T − t =2q+2
′ ′ − ϕ (v) eiv Xr3 −j − ϕ (v) eiv Xr4 −j − ϕ (v)
r =|j|+1
aT (j)ψt −j (v)
t −2q−1
+
iv′ Xr −j 2
(
where in the first term U˜ 1 , we have t − s > 2q so that we can bound it with the mixing inequality. In the second term R˜ 3q , we have 0 < t − s ≤ 2q. Finally, we write
×
e
r1 =|j|+1 r2 =r1 +max rj −rj−1 r3 =r2 +1 r4 =r3 +1 j ≥3
s=max(j+1,t −2q)
U˜ 1 =
′ E eiv Xr1 −j − ϕ (v)
j =1 t −1
×
Γ˜ j (u, v)2 dW (u)dW (v)
1 It suffices to show Lemmas A.4–A.8 below, which imply p− 2 [C˜ 1
t =j +1
˜ + R˜ 1 + 2 Re(R˜ 2 ). ≡M
k (j/p)Tj
= C˜ 1 (0, 0) + 2 Re(V˜ q ) + R˜ 1 − 2 Re(R˜ 2 − R˜ 3 − R˜ 4 ).
∗ Zt (u, θ0 ) ϕ (v) − ϕˆ j (v) dW (u) dW (v)
−
∫∫
2
j =1
× dW (u) dW (v) ∫∫ − T −1 T − + 2 Re αT (j) Zt (u, θ0 ) ψt −j (v) T
(A.17)
where the first term V˜ q is contributed by the lag orders j from 1 to
Γ˜ j (u, v)2 dW (u) dW (v)
j=1
Zs∗ (u, θ0 ) ψs∗−j (v)dW (u)dW (v)
s=j+1
Proof of Proposition A.3. We first decompose T −1
−
×
d
Zt (u, θ0 )
t −1 − j=q+1
aT (j)ψt −j (v)
|j|+1≤r1 <···
×
e
′ E eiv Xr1 −j − ϕ (v)
iv′ Xr −j 2
= O (T ) .
′ ′ − ϕ (v) eiv Xr3 −j − ϕ (v) eiv Xr4 −j − ϕ (v)
B. Chen, Y. Hong / Journal of Econometrics 164 (2011) 268–293
Proof of Lemma A.7. By the m.d.s. property of Zt (u, θ0 ), Minkowski’s inequality and (A.5), we have
Further, it can be shown in a similar way that
− |j|+1≤r1 <···
×
e
′ E eiv Xr1 −j − ϕ (v)
iv′ Xr −j 2
2
=O T
− ϕ (v)
e
iv′ Xr −j 3
′ − ϕ (v) eiv Xr4 −j − ϕ (v)
e
Zs (u, θ0 ) ψs∗−j (v)dW ∗
s=max(j+1,t −2q)
2 E eiv′ Xt1 −j − ϕ (v)
|j|+1≤t1 ,t2 ,t3 ≤T t1 ,t2 ,t3 different
t −1 −
×
Similar to the above, we can show that if rs are not distinct to each other, we have
×
∫∫ T t −1 2 − − ˜ E R3 = E aT (j) Zt (u, θ0 ) ψt −j (v) j =1 t =2
.
−
287
T − t −1 −
⩽
aT (j)
E Zt (u, θ0 ) ψt −j (v)
∫∫
j =1
t =2
2 (u) dW (v)
2 2 1/2 × Zs∗ (u, θ0 ) ψs∗−j (v) dW (u) dW (v) s=max(j+1,t −2q) t −1 −
iv′ Xt −j 2
′ − ϕ (v) eiv Xt3 −j − ϕ (v)
= O T2 , 2
≤ CTq
and
T −1 −
2 aT (j)
= O(p2 q2 /T ).
j=1
− |j|+1≤t1 ,t2 ≤T t1 ,t2 different
2 ′ 2 iv Xt −j E eiv′ Xt1 −j − ϕ (v) 2 e − ϕ v ( )
2
=O T
. 4
Therefore, E ϕ (v) − ϕˆ j (v) = O T −2 .
∫∫ t −1 T − 2 − ˜ aT (j) Zt (u, θ0 ) ψt −j (v) E E R4 = j=q+1 t =2q+2 2 t− 2q−1 − ∗ ∗ × Zs (u, θ0 ) ψs−j (v)dW (u) dW (v) s=j+1 ∫ ∫ t −1 T − − 4 1/4 aT (j) E Zt (u, θ0 ) ψt −j (v) ⩽
Proof of Lemma A.5.
E |R˜ 1 | ≤
T −1 −
aT (j)
∫∫
j =1
4 21 T − E Zt (u, θ0 ) t =j+1
4 21 × E ϕ (v) − ϕˆ j (v) dW (u) dW (v) = O (p/T ) , ∑ T
4
Zt (u, θ0 )
≤ CTj2 by 4 −2 Rosenthal’s inequality and E ϕ (v) − ϕˆ j (v) = O T . where we have used the fact E
t =j +1
Proof of Lemma A.6. By the m.d.s. property of Zt (u, θ0 ), the Cauchy–Schwarz inequality, we have
E |R˜ 2 | ≤ 2
T −1 − j=1
aT (j)
∫∫
2 21 T − E Zt (u, θ0 ) ψt −j (v) t =j+1
2 21 T − × E Z (u, θ0 ) ϕ (v) − ϕˆ j (v) dW (u) dW (v) t =j+1 t ≤2
T −1 − j =1
21 ∫∫ − T 2 Z (u, θ0 ) ψ 2 (v) aT (j) E t t −j t =j +1
j =q +1
t =2q+2
2 4 1/4 2q−1 t −− × E Zs∗ (u, θ0 ) ψs∗−j (v) dW (u) dW (v) s=j+1 ⩽ CT
2
2
T −1 −
aT (j)
⩽ CT
j=q+1
T −1 −
2
2 (j/p)
−2b −1
Tj
j =q +1
= O(p4b ln2 T /q4b−2 ), given Assumption A.6 (i.e., k(z ) ⩽ C |z |−b as z → ∞). Proof of Proposition A.4. We rewrite V˜ q = Vq (t ) =
∫∫
Zt (u, θ0 )
q −
∑T
t =2q+2
Vq (t ), where
aT (j)ψt −j (v)Hj,t −2q−1 (u, v)
j =1
× dW (u) dW (v), ∑t −2q−1 ∗ ∗ and Hj,t −2q−1 (u, v) = s=j+1 Zs (u, θ0 )ψs−j (v). We apply Brown’s 1 (1971) martingale limit theorem, which states var(2 Re V˜ q )− 2 2 Re d
4 14 T − × E Zt (u, θ0 ) t =j+1
V˜ q → N (0, 1) if var(2 Re V˜ q )−1
4 41 × E ϕ (v) − ϕˆ j (v) dW (u) dW (v) 1 = O p/T 2 , 4 given E ϕ (v) − ϕˆ j (v) = O T −2 .
Proof of Lemma A.8. By the m.d.s. property of Zt (u, θ0 ) and Minkowski’s inequality, we have
T −
2
2 Re Vq (t )
1
t =2q+2
1 × 2 Re Vq (t ) > η · var(2 Re V˜ q ) 2 → 0 ∀η > 0, var(2 Re V˜ q )−1
T − t =2q+2
E
2
2 Re Vq (t )
p |It −1 → 1.
(A.18) (A.19)
288
B. Chen, Y. Hong / Journal of Econometrics 164 (2011) 268–293
First, we compute var(2 Re V˜ q ). By the m.d.s. property of Zt (u, θ0 ) under H0 , we have E (V˜ q2 ) =
∫ ∫
T −
Zt (u, θ0 )
E
q −
aT (j)ψt −j (v)
×
Zs (u, θ0 ) ψs∗−j (v)dW (u)dW ∗
q − q −
aT (j)aT (l)
j =1 l =1
∫∫∫∫ ×
t −2q−1
T −
−
β δ/(1+δ) (t − s)
t =2q+2 s=j+1
2
t −2q−1
|V1 (t )| ≤
t =2q+2
j=1
t =2q+2
−
T −
2(1+δ) 2(11+δ) × E Zt (u1 , θ0 ) Zt (u2 , θ0 ) ψt −j (v1 )ψt −l (v2 )
(v)
s=j+1
=
q − q −
2(1+δ) 2(11+δ) × E Zs∗ (u1 , θ0 ) Zs∗ (u2 , θ0 ) ψs∗−j (v1 )ψs∗−l (v2 )
aT (j)aT (l)
× dW (u1 ) dW (u2 ) dW (v1 )dW (v2 ) = O p2 T −1 q−νδ/(1+δ)+1 ,
j =1 l =1
∫∫∫∫ ×
T
t −2q−1
−
−
E [Zt (u1 , θ 0 ) Zt (u2 , θ0 )
t =2q+2 s=j+1
−
× ψt −j (v1 )ψt −l (v2 )] q − q − = aT (j)aT (l) T −
∫∫∫∫ ×
×
× dW (u1 ) dW (u2 ) dW (v1 )dW (v2 ) +
q − q −
×
t −2q−1 s1 −1
−
−
E {{Zt (u1 , θ0 )
× dW (u1 ) dW (u2 ) dW (v1 )dW (v2 ) ∫∫∫∫ − q t− 2q−1 s− T 1 −1 − − ≤2 a2T (j) β δ/(1+δ) (t − s1 )
aT (j)aT (l)
j =1 l =1
j =1
t −2q−1
−
T −
× Zt (u2 , θ0 ) ψt −j (v1 )ψt −j (v2 ) − E × Zt (u1 , θ0 ) Zt (u2 , θ0 ) ψt −j (v1 )ψt −j (v2 ) } × Zs∗1 (u1 , θ0 ) Zs∗2 (u2 , θ0 ) ψs∗1 −j (v1 )ψs∗2 −j (v2 )}
E [Zt (u1 , θ 0 ) Zt (u2 , θ0 )
× ψt −j (v1 )ψt −l (v2 )] × E [Zs∗ (u1 , θ0 ) Zs∗ (u2 , θ0 ) ψs∗−j (v1 )ψs∗−l (v2 )]
T −
a2T (j)
t =2q+2 s1 =j+1 s2 =j+1
t =2q+2 s=j+1
∫∫∫∫
− j =1
∫∫∫∫
t −2q−1
−
|V2 (t )| ≤ 2
t =2q+2
j =1 l =1
(A.21)
q
T
cov[Zt (u1 , θ 0 ) Zt (u2 , θ0 )
t =2q+2 s=j+1
× ψt −j (v1 )ψt −l (v2 ) × Zs∗ (u1 , θ0 ) Zs∗ (u2 , θ0 ) ψs∗−j (v1 )ψs∗−l (v2 )] × dW (u1 ) dW (u2 ) dW (v1 )dW (v2 ) + 2
q −
t =2q+2 s1 =j+1 s2 =j+1
2(1+δ) 2(11+δ) × E Zt (u1 , θ0 ) Zt (u2 , θ0 ) ψt −j (v1 )ψt −j (v2 ) × E Zs∗1 (u1 , θ 0 ) Zs∗2 (u2 , θ0 ) ψs∗1 −j (v1 ) 2(1+δ) 2(11+δ) × ψs∗2 −j (v2 ) dW (u1 ) dW (u2 ) dW (v1 )dW (v2 ) −νδ/(1+δ)+1 = O pq , (A.22)
a2T (j)
j =1 T −
∫∫∫∫ ×
t −2q−1 s1 −1
−
−
and
E [Zt (u1 , θ0 )Zt (u2 , θ0 )
t =2q+2 s1 =j+1 s2 =j+1
× ψt −j (v1 )ψt −j (v2 ) × Zs∗1 (u1 , θ0 ) Zs∗2 (u2 , θ0 ) ψs∗1 −j (v1 )ψs∗2 −j (v2 )] × dW (u1 ) dW (u2 ) dW (v1 )dW (v2 ) + 4
q − j −1 −
T −
|V3 (t )| ≤ 4
t =2q+2
aT (j)aT (l)
T −
∫∫∫∫
∫∫∫∫ ×
−
t =2q+2 s1 =j+1 s2 =l+1
× |E [Zq,j+l (u1 , θ0 ) Zq,j+l (u2 , θ0 ) ψq,l (v1 )ψq,j (v2 )]|
2
t =2q+2
V1 (t ) +
T − t =2q+2
V2 ( t ) +
T −
E {{Zt (u1 , θ0 )
j=1 l=1
∫∫∫∫
V3 (t ) ,
T −
t −2q−1 s1 −1
−
−
β δ/(1+δ) (t − s1 )
t =2q+2 s1 =j+1 s2 =j+1
2(1+δ) 2(11+δ) × E Zt (u1 , θ0 ) Zt (u2 , θ0 ) ψt −j (v1 )ψt −l (v2 )
× dW (u1 ) dW (u2 ) dW (v1 )dW (v2 ) [1 + o(1)] T −
−
× dW (u1 ) dW (u2 ) dW (v1 )dW (v2 ) q − q − ≤4 aT (j)aT (l)
2 j =1 l =1
+
−
× (u1 , θ0 ) Zs∗2 (u2 , θ0 ) ψs∗1 −j (v1 )ψs∗2 −l (v2 )}
× dW (u1 ) dW (u2 ) dW (v1 )dW (v2 ) q q 1 −− 2 = k (j/p)k2 (l/p) ×
t −2q−1 s1 −1
× Zt (u2 , θ0 ) ψt −j (v1 )ψt −l (v2 ) − E Zt (u1 , θ0 ) Zt (u2 , θ0 ) ψt −j (v1 )ψt −l (v2 ) }Zs∗1
E [Zt (u1 , θ0 ) Zt (u2 , θ0 )
× ψt −j (v1 )ψt −l (v2 ) × Zs∗1 (u1 , θ0 ) Zs∗2 (u2 , θ0 ) ψs∗1 −j (v1 )ψs∗2 −l (v2 )]
∫∫∫∫
T −
t =2q+2 s1 =j+1 s2 =j+1
t −2q−1 s1 −1
−
aT (j)aT (l)
j =1 l =1
j =2 l =1
×
q − q −
(A.20)
t =2q+2
where the first term is O (p) as shown in (A.25) and the remaining terms are o(p) following the arguments below
2(1+δ) 2(11+δ) × E Zs∗1 (u1 , θ0 ) Zs∗2 (u2 , θ0 ) ψs∗1 −j (v1 )ψs∗2 −l (v2 ) × dW (u1 ) dW (u2 ) dW (v1 )dW (v2 ) = O p2 q−νδ/(1+δ)+1 .
(A.23)
B. Chen, Y. Hong / Journal of Econometrics 164 (2011) 268–293
where we used the fact that for any given m, p
Similarly, we can obtain E (V˜ q∗ )2 =
q q 1 −−
2 j=1 l=1
∫∫∫∫ ×
(
k2 (j/p)k2 (l/p)
j −m p
)→
E Zj+l (u1 , θ0 ) Zj+l (u2 , θ0 ) ψl (v1 )ψj (v2 ) 2
⩽
q −
× dW (u1 ) dW (u2 ) dW (v1 )dW (v2 ) [1 + o(1)] .
(A.24)
E Zj+l (u1 , θ0 ) Zj+l (u2 , θ0 ) ψl (v1 )ψj (v2 ) = C (0, j, l) + Σ0 (u1 , u2 ; θ0 ) Ωl−j (v1 , v2 ), E Zj+l (u1 , θ0 ) Zj+l (u2 , θ0 ) ψl (v1 )ψj (v2 ) 2 2 = |C (0, j, l)|2 + Σ0 (u1 , u2 ; θ0 ) Ωl−j (v1 , v2 ) + 2 Re[C (0, j, l)Σ0∗ (u1 , u2 ; θ0 ) Ωl∗−j (v1 , v2 )]. ∑∞ ∑∞ Given j=−∞ l=−∞ |C (0, j, l)| ≤ C and |k(·)| ≤ 1, we have
k (j/p)k (l/p) 2
∫∫
Ωl−j (v1 , v2 )2
j=1 l=1
× dW (v1 )dW (v2 )
∫∫
|Σ0 (u1 , u2 ; θ0 )|2
× dW (u1 ) dW (u2 ) [1 + o(1)] q−1 q − − −1 2 2 = 2p p k (j/p)k [(j − m)/p] m=1−q
j=m+1
∫∫
|Ωm (v1 , v2 )|2 dW (v1 )dW (v2 )
∫∫
|Σ0 (u1 , u2 ; θ0 )|2 dW (u1 ) dW (u2 ) [1 + o(1)]
×
aT (j)
= O(p4 t 2 /T 4 ).
j =1
q − q −
aT (j)aT (l)
∫∫∫∫
Σt (u1 , u2 ; θ0 )
j=1 l=1
Put C (0, j, l) ≡ E {[Zj+l (u1 , θ0 ) Zj+l (u2 , θ0 ) − Σ0 (u1 , u2 ; θ0 )] ψl (v1 )ψj (v2 )}, where Σj (u1 , u2 ; θ0 ) ≡ E [Zt (u1 , θ0 ) Zt −j (u2 , θ0 )]. Then
2
4
E [Vq2 (t )|It −1 ] =
E Zj+l (u1 , θ0 ) Zj+l (u2 , θ0 ) ψl (v1 )ψj (v2 ) 2
q − q −
q −
∑T
k2 (j/p)k2 (l/p)
var(2 Re V˜ q ) = 2
⩽ Ct
2
4 4 It follows that = o(p2 ) given p2 / t =2q+2 E |Vq (t )| = O p /T T → 0. Thus, (A.18) holds. Next, we verify condition (A.19). Put Σt (u1 , u2 ; θ0 ) ≡ E [Zt (u1 , θ0 )Zt (u2 , θ0 ) |It −1 ]. Then
˜∗ 2
j=1 l=1
×
∫∫ 4 1/4 aT (j) E Zt (u, θ0 ) ψq,t −j (v)Hj,t −2q−1 (u, v)
× dW (u)dW (v)
2 var(2 Re V˜ q ) = E ( ˜ ) + E (Vq ) + 2E V˜ q
∫∫∫∫
8
4
Because dW (·) weighs sets symmetric about zero equally, we have E |V˜ q |2 = E (V˜ q2 ) = E (V˜ q∗ )2 . Hence,
q − q −
k (z )dz as p → ∞.
j =1
× dW (u1 ) dW (u2 ) dW (v1 )dW (v2 ) [1 + o(1)] .
=2
k2 (j/p)k2
E |Vq (t )|4
E Zj+l (u1 , θ0 ) Z ∗ (u2 , θ0 ) ψl (v1 )ψ ∗ (v2 ) 2 j j +l
Vq2
0
j=m+1
4
2 j=1 l=1
×
∞
We now verify condition (A.18). Noting that E Hj,t −2q−1 (u, v) ≤ Ct 4 for 1 ≤ j ≤ q given the m.d.s. property of Zt (u, θ0 ) and Rosenthal’s inequality (cf. Hall and Heyde (1980, p. 23)), we have
× dW (u1 ) dW (u2 ) dW (v1 )dW (v2 ) [1 + o(1)] , q q 2 1 −− 2 E V˜ q = k (j/p)k2 (l/p) ∫∫∫∫
289
∑q −1
× ψt −j (v1 )ψt −l (v2 )Hj,t −2q−1 (u1 , v1 )Hl,t −2q−1 (u2 , v2 ) × dW (u1 )dW (u2 ) dW (v1 )dW (v2 ) ∫∫∫∫ q − q − = aT (j)aT (l) E [Σt (u1 , u2 ; θ0 ) j =1 l =1
× ψt −j (v1 )ψt −l (v2 ) Hj,t −2q−1 (u1 , v1 )Hl,t −2q−1 (u2 , v2 ) q − q − × dW (u1 )dW (u2 ) dW (v1 )dW (v2 ) + aT (j)aT (l) j=1 l=1
∫∫∫∫ ×
j ,l
L˜ t (u1 , u2 , v1 , v2 )Hj,t −2q−1
× (u1 , v1 )Hl,t −2q−1 (u2 , v2 ) × dW (u1 )dW (u2 ) dW (v1 )dW (v2 ) ≡ S1 (t ) + V4 (t ), say,
(A.26)
j ,l
where L˜ t (u1 , u2 , v1 , v2 ) ≡ Σt (u1 , u2 ; θ0 ) ψt −j (v1 )ψt −l (v2 ) − E Σt (u1 , u2 ; θ0 ) ψt −j (v1 )ψt −l (v2 ) . We further decompose S1 (t ) =
q − q −
aT (j)aT (l)
∫∫∫∫
E Σt (u1 , u2 ; θ0 ) ψt −j (v1 )
j=1 l=1
× ψt −l (v2 )] E Hj,t −2q−1 (u1 , v1 )Hl,t −2q−1 (u2 , v2 ) × dW (u1 )dW (u2 ) dW (v1 )dW (v2 ) ∫∫∫∫ q − q − + aT (j)aT (l) E [Σt (u1 , u2 ; θ0 ) j =1 l =1
×
∞
∫
k4 (z )dz
= 2p 0
∞ ∫∫ −
|Ωm (v1 , v2 )|2 dW (v1 )dW (v2 )
m=−∞
∫∫
|Σ0 (u1 , u2 ; θ0 )|2 dW (u1 ) dW (u2 ) [1 + o(1)] ∫ ∞ ∫∫∫ π = 4π p k4 (z )dz |G(ω, v1 , v2 )|2 dωdW (v1 )dW (v2 ) ×
0
∫∫ ×
−π
|Σ0 (u1 , u2 ; θ0 )|2 dW (u1 ) dW (u2 ) [1 + o(1)], (A.25)
× ψt −j (v1 )ψt −l (v2 ) Hj,t −2q−1 (u1 , v1 )Hl,t −2q−1 (u2 , v2 ) − E Hj,t −2q−1 (u1 , v1 )Hl,t −2q−1 (u2 , v2 ) × dW (u1 )dW (u2 ) dW (v1 )dW (v2 ) ≡ V0 (t ) + S2 (t ), say, (A.27) where V0 ( t ) =
q − q −
min(t − 2q − 1 − j, t − 2q − 1 − l)aT (j)aT (l)
j=1 l=1
∫∫∫∫ ×
E Σt (u1 , u2 ; θ0 ) ψt −j (v1 )ψt −l (v2 ) 2
290
B. Chen, Y. Hong / Journal of Econometrics 164 (2011) 268–293
× dW (u1 )dW (u2 ) dW (v1 )dW (v2 ) = E Vq2 (t ) − V1 (t ) − V2 (t ) − V3 (t ) ,
Lemma A.12. Let V7 (t ) be defined as in (A.30). Then E | V7 (t )|2 = O(p).
where Vj (t ) , j = 1, 2, 3, are defined in (A.20). Put
j ,l
− E Zs (u1 , θ0 ) Zs (u2 , θ0 ) ψs−j (v1 )ψs−l (v2 ) .
(A.28)
Then we write q − q −
t −2q−1
8 14 × E Hl,t −2q−1 (u2 , v2 )
Ljs,l (u1 , u2 , v1 , v2 )dW (u1 )dW (u2 )
−
×
t =2q+2
E Zt (u1 , θ0 ) Zt (u2 , θ0 ) ψt −j (v1 )ψt −l (v2 )
×
+2
s=max(j,l) q − q −
× dW (v1 )dW (v2 ) + ×
s −1 −
Zs u1, θ0 ψs−j (v1 )Zτ (u2 , θ0 )
0
4 14 × E Hj,t −2q−1 (u1 , v1 )
s=max(j,l)+1 τ =l+1
× ψτ −l (v2 )dW (u1 )dW (u2 ) dW (v1 )dW (v2 ) ≡ V5 (t ) + S3 (t ),
say,
(A.29)
where S3 ( t ) =
q − q −
aT (j)aT (l)
∫∫∫∫
E [Zt (u1 , θ0 ) Zt (u2 , θ0 )
× ψt −j (v1 )ψt −l (v2 )] −− × Zs (u1 , θ0 ) ψs−j (v1 )Zτ (u2 , θ0 )
0
0<s−τ ⩽2q
× ψτ −l (v2 )dW (u1 )dW (u2 ) dW (v1 )dW (v2 ) ∫∫∫∫ q − q − + aT (j)aT (l) E [Zt (u1 , θ0 ) Zt (u2 , θ0 ) j =1 l =1
−−
4 14 × E Hl,t −2q−1 (u2 , v2 ) 4 14 4 14 E Hl,τ −2q−1 (u2 , v2 ) × E Hj,τ −2q−1 (u1 , v1 ) −− δ +2 β 1+δ (2q) E L˜ jq,,lt (u1 , u2 , v1 , v2 )L˜ jq,,τl
j=1 l=1
× ψt −j (v1 )ψt −l (v2 )]
[ 2(1+δ) ] 2(11+δ) ˜ j ,l (2q) E Lq,t (u1 , u2 , v1 , v2 )
2(1+δ) 2(11+δ) × Hj,τ −2q−1 (u1 , v1 )Hl,τ −2q−1 (u2 , v2 ) − − j,l l +2 E L˜ q,t (u1 , u2 , v1 , v2 )L˜ jq,,τ (u1 , u2 , v1 , v2 ) .
E Zt (u1 , θ0 ) Zt (u2 , θ0 ) ψt −j (v1 )ψt −l (v2 )
−
β
δ
1+δ
× E L˜ jq,,τl (u1 , u2 , v1 , v2 )Hj,t −2q−1 (u1 , v1 )Hl,t −2q−1 (u2 , v2 )
aT (j)aT (l)
t −2q−1
×
−− t −τ >2q
j =1 l =1
∫∫∫∫
2 T − j , l E L˜ q,t (u1 , u2 , v1 , v2 )Hj,t −2q−1 (u1 , v1 )Hl,t −2q−1 (u2 , v2 ) t =2q+2 [ T 4 ] 12 − 8 14 ˜ j ,l ≤ E Lq,t (u1 , u2 , v1 , v2 ) E Hj,t −2q−1 (u1 , v1 )
aT (j)aT (l)
j=1 l=1
∫∫∫∫
Zs (u1 , θ 0 ) ψs−j (v1 )Zτ (u2 , θ0 )
2(11+δ) × (u1 , u2 , v1 , v2 )|2(1+δ) × E Hj,t −2q−1 (u1 , v1 )Hl,t −2q−1 (u2 , v2 )Hj,τ −2q−1 (u1 , v1 ) 2(1+δ) 2(11+δ) × Hl,τ −2q−1 (u2 , v2 ) νδ νδ = O T 3 + O T 4 q− 1+δ +1 + O T 3 q + O T 3 q− 1+δ +1 ,
s−τ >2q
≡ V6 (t ) + V7 (t ),
say.
(A.30)
It follows from (A.26) and (A.27) and (A.29) and (A.30) that ∑T ∑T ∑7 {E [Vq2 (t )|It −1 ] − E [Vq2 (t )]} = t =2q+2 [ a=4 Va (t ) − ∑t3=2q+2 V ( t )] . It suffices to show Lemmas A.9–A.12 below, which a a=1 p
1+ 4b1−2
∑T
t =2q+2
(ln2 T )
E [Vq2 (t )|It −1 ] − E [Vq2 (t )]|2 = o(p2 ) given q =
1 2b−1
and p = cT λ for 0 < λ < (3 +
)
1 −1 . 4b−2
Thus,
d
condition (A.19) holds, and so Q˜ q (0, 0) → N (0, 1) by Brown’s (1971) theorem. Lemma A.9. Let V4 (t ) be defined as in (A.26). Then E | νδ
∑T
t =2q+2
V4 (t )|2 = O(p/q 1+δ ). Lemma A.10. Let V5 (t ) be defined as in (A.29). Then E | V5 (t )|2 = O(qp4 /T ). Lemma A.11. Let V6 (t ) be defined as in (A.30). Then E | V6 (t )| = O(qp /T ). 2
4
8
where we have made use of the fact that E Hj,t −2q−1 (u1 , v1 ) ≤ Ct 4 for 1 ≤ j ≤ q. It follows by Minkowski’s inequality and (A.5) that
× ψτ −l (v2 )dW (u1 )dW (u2 ) dW (v1 )dW (v2 )
imply E |
t =2q+2
Proof of Lemma A.9. Recalling the definition of L˜ t (u1 , u2 , v1 , v2 ) as in (A.26), we can obtain
Ljs,l (u1 , u2 , v1 , v2 ) ≡ Zs (u1 , θ0 ) Zs (u2 , θ0 ) ψs−j (v1 )ψs−l (v2 )
S2 ( t ) =
∑T
∑T
t =2q+2
∑T
t =2q+2
2 q − q T − − E V4 (t ) ⩽ a (j)aT (l) t =2q+2 j =1 l =1 T ∫∫∫∫ T − j ,l × E L˜ q,t (u1 , v1 , u2 , v2 ) t =2q+2 × Hj,t −2q−1 (u1 , v1 )Hl,t −2q−1 (u2 , v2 )dW (u1 )dW (u2 ) 2 12 2 × dW (v1 )dW (v2 ) 4 = O qp /T . j,l
Proof of Lemma A.10. Recalling the definition of Lq,s (u1 , v1 , u2 , v2 ) in (A.28), we have
B. Chen, Y. Hong / Journal of Econometrics 164 (2011) 268–293
=
|s−τ |≤2q
q − q − q − q −
aT (j1 )aT (j2 )aT (l1 )aT (l2 )
j1 =1 j2 =1 l1 =1 l2 =1
2(1+δ) 2(11+δ) δ β 1+δ (2q) E Ljs,l (u1 , v1 , u2 , v2 )
−
+
291
2 × ψq,τ −l (v2 )dW (u1 )dW (u2 ) dW (v1 )dW (v2 )
2 2q−1 t −− j ,l E Ls (u1 , v1 , u2 , v2 ) s=max(j,l) − j ,l E Ls (u1 , v1 , u2 , v2 )Ljτ,l (u1 , v1 , u2 , v2 ) =
∫
E [Z0 (u11 , θ0 ) Z0 (u21 , θ0 ) ψ−j1 (v11 )ψ−l1 (v21 )]
×
|s−τ |>2q
R8N
× E [Z0∗ (u12 , θ0 ) Z0∗ (u22 , θ0 ) ψ−∗ j2 (v12 )ψ−∗ l2 (v22 )]
2(1+δ) 2(11+δ) × E Ljτ,l (u1 , v1 , u2 , v2 )
t −2q−1
= O (tq) .
×
−
E Zs (u11 , θ0 ) Zs (u12 , θ0 ) ψs−j1 (v11 )ψs−j2 (v12 )
s=2q+2
It follows that
s−2q−1
2 q − q T T − − − E V5 (t ) ⩽ aT (j)aT (l) t =2q+2 t =2q+2 j=1 l=1 ∫∫∫∫ |E [Zt (u1 , θ0 ) Zt (u2 , θ0 ) ×
−
×
E Zτ∗ (u21 , θ0 ) Zτ∗ (u22 , θ0 )
τ =max(l1 ,l2 )+1
× ψτ∗−l1 (v21 )ψτ∗−l2 (v22 ) dW (u11 )dW (u12 ) dW (u21 ) × dW (u22 ) dW (v11 )dW (v12 ) dW (v21 )dW (v22 ) q − q − q − q − aT (j1 )aT (j2 )aT (l1 )aT (l2 ) +
2 12 2q−1 t −− × ψt −j (v1 )ψt −l (v2 )] E Ljs,l (u1 , v1 , u2 , v2 ) s=max(j,l)
j1 =1 j2 =1 l1 =1 l2 =1
∫
E [Z0 (u11 , θ0 ) Z0 (u21 , θ0 ) ψ−j1 (v11 )ψ−l1 (v21 )]
× R8N
2
× E [Z0 (u12 , θ0 ) Z0∗ (u22 , θ0 ) ψ−∗ j2 (v12 )ψ−∗ l2 (v22 )] 2q−1 2q−1 t − t −− − δ 1 +δ Z (u , θ ) Z (u , θ ) ×β (2q) E s =2q+2 s =2q+2 s1 11 0 s2 12 0 ∗
× dW (u1 )dW (u2 ) dW (v1 )dW (v2 ) = O qp4 /T .
1
Proof of Lemma A.11. The result that E |
(qp2 /T ) by Minkowski’s inequality and q − q − 2 aT (j)aT (l) E |V6 (t )| ⩽
s −2q−1 s −2q−1 1− 2− Z (u , θ ) Z (u , θ ) ψ (v ) ×E τ =2q+2 τ =2q+2 τ1 21 0 τ2 22 0 τ1 −j1 21 1
j=1 l=1
∫∫∫∫ ×
2
2(1+δ) 2(11+δ) × ψτ2 −j2 (v22 )
E [Zt (u1 , θ0 ) Zt (u2 , θ0 ) ψt −j (v1 )ψt −l (v2 )]
× E Zs (u1 , θ0 ) ψs−j (v1 ) s=max(j,l) 2 12 − × Zτ (u2 , θ0 ) ψτ −l (v2 ) dW (u1 )dW (u2 ) s−τ ≤2q
2
2(1+δ) 2(11+δ) × ψs1 −j1 (v11 )ψs2 −j2 (v12 )
2 = O t =2q+2 V6 (t )|
∑T
t −2q−1
× dW (u11 )dW (u12 ) dW (u21 )dW (u22 ) dW (v11 )dW (v12 ) × dW (v21 )dW (v22 ) νδ = O(t 2 pT −4 ) + O t 3 pq− 1+δ +1 T −4 ,
−
given Assumption A.7.
Proof of Theorem 2. The proof of Theorem 2 consists of the proofs of Theorems A.3 and A.4 below.
2 × dW (v1 )dW (v2 )
1
⩽ Ctq
q
−
Theorem A.3. Under the conditions of Theorem 2, (p 2 /T )[Qˆ (0, 0)−
4 aT (j)
p
Q˜ (0, 0)] → 0.
= O tqp4 /T 4 .
j =1
Theorem A.4. Under the conditions of Theorem 2, Proof of Lemma A.12. The result that E | − νδ +1 Tpq 1+δ
(
∑T
t =2q+2
V7 (t )|2 = O
) follows from Minkowski’s inequality, p → ∞, and
t −2q−1
×
− s=2q+2
s−2q−1
Zq,s (u1 , θ0 ) ψq,s−j (v1 )
− τ =l +1
1
∫∫∫
π
|F (ω, u, v)
−π
− F0 (ω, u, v)|2 dωdW (u)dW (v) .
the fact that
∫∫∫∫ q − q − 2 E |V7 (t )| = E a (j)aT (l) E Zq,j+l (u1 , θ0 ) j=1 l=1 T × Zq,j+l (u2 , θ0 ) ψq,l (v1 )ψq,j (v2 )
p
1
(p 2 /T )Q˜ (0, 0) →(D (0, 0))− 2
Proof of Theorem A.3. It suffices to show that T
−1
∫∫ − T −1
k2 (j/p)Tj |Γˆ j (u, v)|2 − |Γ˜ j (u, v)|2
j =1
Zq,τ (u2 , θ0 )
p
× dW (u)dW (v) → 0,
(A.31)
292
B. Chen, Y. Hong / Journal of Econometrics 164 (2011) 268–293 p
ˆ (0, 0) − D˜ (0, 0)] → 0, p−1 [Cˆ (0, 0) − C˜ (0, 0)] = OP (1), and p−1 [D ˜ ˜ where C (0, 0) and D(0, 0) are defined in the same way as Cˆ (0, 0) ˆ (0, 0) in (3.8), with Zˆt (u) replaced by Zt (u, θ∗ ). Since the and D ˆ (0, 0) − proofs for p−1 [Cˆ (0, 0) − C˜ (0, 0)] = OP (1) and p−1 [D p
˜ (0, 0)] → 0 are straightforward, we focus on the proof of (A.31). D From(A.5), the Cauchy–Schwarz inequality, and the fact that ∑ T −1 2 2 ˜ T −1 = OP (1) as is j=1 k (j/p)Tj |Γj (u, v)| dW (u)dW (v) implied by Theorem A.4 (the proof of Theorem A.4 does not p
depend on Theorem A.3), it suffices to show that T −1 Aˆ 1 → 0, where Aˆ 1 is defined as in (A.2). Given (A.3), we shall show that p
2 2 ˆ T −1 = 1, 2. j=1 k (j/p)Tj |Baj (u, v)| dW (u)dW (v) → 0, a We first consider a = 1. By the Cauchy–Schwarz inequality and |ψt −j (v)| ≤ 2, we have
∑T −1
|Bˆ 1j (u, v)|2 ≤ CTj−1
T 2 − ϕ u, t |It −1 , θ∗ − ϕ u, t |It −1 , θˆ
t =j +1 T
2 − Ď ϕ u, t |It −1 , θˆ − ϕ u, t |It −1 , θˆ
+ CTj−1
t =j +1
2 T 2 − ∂ sup sup ≤ CTj−1 T θˆ − θ0 T −1 ϕ u , t | I , θ) ( t −1 ∂θ N t =1 u∈R θ∈8 T −
+ CTj−1
N
t =j+1 u∈R
−1
= OP Tj
Ď
2
sup sup ϕ (u, t |It −1 , θ) − ϕ u, t |It −1 , θ θ∈8
.
It follows from (A.4) and Assumption A.7 that T −1
∫∫ − T −1
k2 (j/p)Tj |Bˆ 1j (u, v)|2 dW (u)dW (v)
j =1
= OP (p/T ). The proof for a = 2 is similar. This completes the proof for Theorem A.3. Proof of Theorem A.4. The proof is very similar to Hong (1999, Proof of Thm. 5), for the case (m, l) = (0, 0). The consistency result ∑T ∞ π follows from (a) p−1 j=1 k4 (j/p) → 0 k4 (ω) dω; (b) −π
|Fˆ (ω, u, v) − F (ω, u, v) |2 × dωdW (u)dW (v) → 0; (c) Cˆ (0, 0) = ˆ (0, 0) →p D (0, 0). OP (p); (d) D Proof of Theorem 3. The proof is very similar to Theorem 1, (m) ˆ Zt(m) (0, θ0 ), Γˆ (m,0) (0, v), Cˆ (m, 0), Dˆ (m, 0) replacwith Zˆt (0, θ), j
ˆ Zt (u, θ0 ), Γˆ j (u, v), Cˆ (0, 0), Dˆ (0, 0) respectively. ing Zˆt (u, θ),
References Ahn, D., Dittmar, R., Gallant, A.R., 2002. Quadratic term structure models: theory and evidence. Review of Financial Studies 15, 243–288. Ait-Sahalia, Y., 1996a. Testing continuous-time models of the spot interest rate. Review of Financial Studies 9, 385–426. Ait-Sahalia, Y., 1996b. Nonparametric pricing of interest rate derivative securities. Econometrica 64, 527–560. Ait-Sahalia, Y., 2002. Maximum-likelihood estimation of discretely sampled diffusions: a closed-form approach. Econometrica 70, 223–262. Ait-Sahalia, Y., 2007. Estimating continuous-time models using discretely sampled data. In: Blundell, B., Torsten, P., Newey, W. (Eds.), Advances in Economics and Econometrics, Theory and Applications, Ninth World Congress. Cambridge University Press, Cambridge. Ait-Sahalia, Y., Kimmel, R., 2010. Estimating affine multifactor term structure models using closed-form likelihood expansions. Journal of Financial Economics 98, 113–144. Ait-Sahalia, Y., Fan, J., Peng, H., 2009. Nonparametric transition-based tests for diffusions. Journal the of American Statistical Association 104, 1102–1116. Alexander, C.O., 1998. Volatility and correlation: methods, models and applications. In: Risk Management and Analysis: Measuring and Modelling Financial Risk.
Andersen, T.G., Benzon, L., Lund, J., 2002. An empirical investigation of continuoustime equity return models. Journal of Finance 57, 1239–1284. Andrews, D.W.K., 1997. A conditional kolmogorov test. Econometrica 65, 1097–1128. Bakshi, G., Cao, C., Chen, Z., 1997. Empirical performance of alternative option pricing models. Journal of Finance 52, 2003–2049. Bally, V., Talay, D., 1996. The law of the Euler scheme for stochastic differential equations: I. Convergence rate of the distribution function. Probability Theory and Related Fields 104, 43–60. Bates, D., 1996. Jumps and stochastic volatility: exchange rate processes implicit in PHLX deutsche mark options. Review of Financial Studies 9, 69–107. Bates, D., 2000. Post-’87 crash fears in the S& P 500 futures option market. Journal of Econometrics 94, 181–238. Bates, D., 2007. Maximum likelihood estimation of latent affine processes. Review of Financial Studies 19, 909–965. Bhardwaj, G., Corradi, V., Swanson, N.R., 2008. A simulation based specification test for diffusion processes. Journal of Business & Economic Statistics 176–193. Bierens, H., 1982. Consistent model specification tests. Journal of Econometrics 20, 105–134. Bontemps, C., Meddahi, N., 2010. Testing distributional assumptions: a gmm approach. Working paper, University of Montreal. Brandt, M., Santa-Clara, P., 2002. Simulated likelihood estimation of diffusions with an application to exchange rate dynamics in incomplete markets. Journal of Financial Economics 63, 161–210. Brown, B.M., 1971. Martingale limit theorems. Annals of Mathematical Statistics 42, 59–66. Carr, P., Wu, L., 2003. The finite moment log stable process and option pricing. Journal of Finance 58, 753–777. Carr, P., Wu, L., 2004. Time-changed Lévy processes and option pricing. Journal of Financial Economics 71, 113–141. Carrasco, M., Florens, J.P., 2000. Generalization of GMM to a continuum of moment conditions. Econometric Theory 16, 797–834. Carrasco, M., Chernov, M., Florens, J.P., Ghysels, E., 2007. Efficient estimation of jump diffusions and general dynamic models with a continuum of moment conditions. Journal of Econometrics 140, 529–573. Chacko, G., Viceira, L., 2003. Spectral GMM estimation of continuous-time processes. Journal of Econometrics 116, 259–292. Chen, S.X., Gao, J., Tang, C.Y., 2008. A test for model specification of diffusion processes. Annals of Statistics 36, 162–198. Chernov, M., Gallant, A.R., Ghysels, E., Tauchen, G., 2003. Alternative models for stock price dynamics. Journal of Econometrics 116, 225–257. Chib, S., Pitt, M.K., Shephard, N., 2010. Likelihood based inference for diffusion driven state space models. Working paper, University of Oxford. Cox, J.C., Ingersoll, J.E., Ross, S.A., 1985. A theory of the term structure of interest rates. Econometrica 53, 385–407. Dai, Q., Singleton, K., 2000. Specification analysis of affine term structure models. Journal of Finance 55, 1943–1978. Dai, Q., Singleton, K., 2003. Term structure dynamics in theory and reality. Review of Financial Studies 16, 631–678. Das, S., Sundaram, R., 1999. Of smiles and smirks: a term structure perspective. Journal of Financial and Auantitative Analysis 34, 211–239. Del Moral, P., Jacod, J., Protter, P., 2001. The Monte–Carlo method for filtering with discrete-time observations. Probability Theory and Related Fields 120, 346–368. Diebold, F.X., Gunther, T., Tay, A., 1998. Evaluating density forecasts with applications to financial risk management. International Economic Review 39, 863–883. Doucet, A., Freitas, N., Gordon, N., 2001. Sequential Monte Carlo Methods in Practice. Springer-Verlag. Duffee, G., 2002. Term premia and interest rate forecasts in affine models. Journal of Finance 57, 405–443. Duffie, D., Kan, R., 1996. A yield-factor model of interest rates. Mathematical Finance 6, 379–406. Duffie, D., Pan, J., Singleton, K., 2000. Transform analysis and asset pricing for affine jump-diffusions. Econometrica 68, 1343–1376. Duffie, D., Pedersen, L., Singleton, K., 2003. Modeling credit spreads on sovereign debt: a case study of Russian bonds. Journal of Finance 55, 119–159. Easley, D, O’Hara, M., 1992. Time and the process of security price adjustment. Journal of Finance 47, 577–605. Engle, R., 2002. Dynamic conditional correlation — a simple class of multivariate GARCH models. Journal of Business and Economic Statistics 20, 339–350. Fan, Y., 1997. Goodness-of-fit tests for a multivariate distribution by the empirical characteristic function. Journal of Multivariate Analysis 62, 36–63. Gallant, A.R., Tauchen, G., 1996. Which moments to match? Econometric Theory 12, 657–681. Gallant, A.R., Tauchen, G., 1998. Reprojecting partially observed systems with application to interest rate diffusions. Journal of the American Statistical Association 93, 10–24. Gao, J., King, M., 2004. Adaptive testing in continuous-time diffusion models. Econometric Theory 20, 844–882. Genon-Catalot, V., Jeantheau, T., Laredo, C., 2000. Stochastic volatility models as hidden markov models and statistical applications. Bernoulli 6, 1051–1079.
B. Chen, Y. Hong / Journal of Econometrics 164 (2011) 268–293 Gordon, N., Salmond, D., Smith, A., 1993. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEE Proceedings F140, 107–113. Hall, P., Heyde, C., 1980. Martingale Limit Theory and its Application. Academic Press. Hannan, E., 1970. Multiple Time Series. Wiley, New York. Hansen, L.P., Scheinkman, J.A., 1995. Back to the future: generating moment implications for continuous-time Markov processes. Econometrica 63, 767–804. Heston, S., 1993. A closed-form solution for options with stochastic volatility with applications to bond and currency options. Review of Financial Studies 6, 327–344. Hong, Y., 1999. Hypothesis testing in time series via the empirical characteristic function: a generalized spectral density approach. Journal of the American Statistical Association 94, 1201–1220. Hong, Y., Li, H., 2005. Nonparametric specification testing for continuous-time models with applications to term structure of interest rates. Review of Financial Studies 18, 37–84. Huang, J., Wu, L., 2003. Specification analysis of option pricing models based on time-changed Lévy processes. Journal of Finance 59, 1405–1439. Jarrow, R., Li, H., Liu, S., Wu, C., 2010. Reduced-form valuation of callable corporate bonds: theory and evidence. Journal of Financial Economics 95, 227–248. Jiang, G., Knight, J., 2002. Estimation of continuous-time processes via the empirical characteristic function. Journal of Business and Economic Statistics 20, 198–212. Johannes, M., Polson, N., Stroud, J., 2009. Optimal filtering of jump diffusions: extracting latent states from asset prices. Review of Economic Studies 22, 2759–2799. Kitagawa, G., 1996. Monte Carlo filter and smoother for non-gaussian nonlinear state space models. Journal of Computational and Graphical Statistics 5, 1–25. Kloeden, P., Platen, E., Schurz, H., 1994. Numerical Solution of SDE through Computer Experiments. Springer-Verlag, Berlin, Heidelberg, New York. Koutrouvelis, I.A., 1980. A goodness-of-fit test of simple hypotheses based on the empirical characteristic function. Biometrika 67, 238–240.
293
Li, F., 2007. Testing the parametric specification of the diffusion function in a diffusion process. Econometric Theory 23, 221–250. Pan, J., Singleton, K., 2008. Default and recovery implicit in the term structure of sovereign CDS spreads. Journal of Finance 63, 2345–2384. Pedersen, A.R., 1995. A new approach to maximum likelihood estimation for stochastic differential equations based on discrete observations. Scandinavian Journal of Statistics 22, 55–71. Piazzesi, M., 2010. Affine term structure models. In: Ait-Sahalia, Y., Hansen, L. (Eds.), Handbook of Financial Econometrics. pp. 691–766. Pitt, M.K., Shephard, N., 1999. Filtering via simulation: auxiliary particle filters. Journal of the American Statistical Association 94, 590–599. Roussas, G., Ioannides, D., 1987. Moment inequalities for mixing sequences of random variables. Stochastic Analysis and Applications 5, 60–120. Singleton, K., 2001. Estimation of affine asset pricing models using the empirical characteristic function. Journal of Econometrics 102, 111–141. Stinchcombe, M., White, H., 1998. Consistent specification testing with nuisance parameters present only under the alternative. Econometric Theory 14, 295–324. Su, L., White, H., 2007. A consistent characteristic-function-based test for conditional independence. Journal of Econometrics 141, 807–834. Sundaresan, S., 2001. Continuous-time methods in finance: a review and an assessment. Journal of Finance 55, 1569–1622. Tauchen, G., 1997. New minimum chi-square methods in empirical finance. In: Kreps, D., Wallis, K. (Eds.), Advances in Econometrics: Seventh World Congress. Cambridge University Press. Vasicek, O., 1977. An equilibrium characterization of the term structure. Journal of Financial Economics 5, 177–188. Yoshihara, K., 1976. Limiting behavior of U-statistics for stationary, absolutely regular processes. Zeitschrift fur Wahrscheinlichkeitstheorie und Verwandte Gebiete 35, 237–252.
Journal of Econometrics 164 (2011) 294–309
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
How many consumers are rational? Stefan Hoderlein ∗ Department of Economics, Boston College, 140 Commonwealth Avenue, Chestnut Hill, MA 02467, United States
article
info
Article history: Received 26 September 2008 Received in revised form 9 June 2011 Accepted 10 June 2011 Available online 23 June 2011 JEL classification: C12 C14 D12 D01
abstract Rationality places strong restrictions on individual consumer behavior. This paper is concerned with assessing the validity of the integrability constraints imposed by standard utility maximization, arising in classical consumer demand analysis. More specifically, we characterize the testable implications of negative semidefiniteness and symmetry of the Slutsky matrix across a heterogeneous population without assuming anything on the functional form of individual preferences. Our approach employs nonseparable models and is centered around a conditional independence assumption, which is sufficiently general to allow for endogenous regressors. Using British household data, we show that rationality is an acceptable description for large parts of the population. © 2011 Elsevier B.V. All rights reserved.
Keywords: Nonparametric Integrability Testing rationality Nonseparable models Demand Nonparametric IV
1. Introduction Economic theory yields strong implications for the actual behavior of individuals. In the standard utility maximization model for instance, economic theory places important restrictions on individual responses to changes in prices and wealth, the so-called integrability constraints. However, when we want to evaluate the validity of the integrability conditions using real data, we face the following problem: in a typical consumer data set, we observe individuals only under a very limited number of different price–wealth combinations, often only once. Consequently, observations on ‘‘comparable’’ individuals have to be taken into consideration. But this conflicts with the important notion of unobserved heterogeneity, a notion that is supported by the common empirical finding that individuals with the same covariates vary dramatically in their actual consumption behavior. Thus, general and unrestrictive models for handling unobserved heterogeneity are essential for testing integrability. This task is complicated by the fact that any new method also has to allow to deal with endogeneity, which in consumer demand arises because of correlation of wealth with unobservable preferences.
∗
Tel.: +1 617 552 6042. E-mail address:
[email protected].
0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.06.015
Testing integrability constraints dates back at least to the early work of Stone (1954), and has spurned the extensive research on (parametric) flexible functional forms demand systems (e.g., the Translog, Jorgensen et al. (1982), and the Almost Ideal, Deaton and Muellbauer (1980)). Nonparametric analysis of some derivative constraints was performed by Stoker (1989) and Härdle et al. (1991), but none of these has its focus on modeling unobserved heterogeneity. More closely related to our approach is Lewbel (2001), who analyzes integrability constraints in a purely exogenous setting. In comparison to his work, we make three contributions. First, we show how to handle endogeneity. Second, even restricted to the exogenous case, some of our results (e.g., on negative semidefiniteness) are new and more general. Third, we propose, discuss and implement nonparametric test statistics. An alternative method for checking some integrability constraints is revealed preference analysis; see Blundell et al. (2003), and references therein. In this paper, we extend the recent work on nonseparable models – in particular Imbens and Newey (2009) and Altonji and Matzkin (2005), who treat the estimation of average structural derivatives when regressors are endogenous – to testing integrability conditions. Even though our application comes from traditional consumer demand, most of the analysis is by no means confined to this setup and could be applied to any standard utility maximization problem with linear budget constraints. Moreover,
S. Hoderlein / Journal of Econometrics 164 (2011) 294–309
the spirit of the analysis could be extended to cover, e.g., decisions under uncertainty or nonlinear budget constraints. The main contribution of this paper is the formal clarification of the implications of this conditional independence assumption for testing integrability constraints with data. Much of the second section will be specifically devoted to this issue: we devise very general tests for Slutsky negative semidefiniteness and symmetry, as well as for homogeneity. In the third section we apply these concepts to British FES data. The results are affirmative as far as the validity of the integrability conditions are concerned. One feature of our approach is that we are able to look in detail at various subpopulations of interest, and are thus able to shed light on potential sources of rejections of rationality. To give an example which has recently achieved a great deal of attention in the demand literature, we consider the question whether estimating demand models using household data is not fundamentally flawed because it neglects household composition effects see, e.g., Browning and Chiappori (1998), Fortin and Lacroix (1997) and Cherchye and Vermeulen (2008). To analyze the hypothesis that rejections of rationality using household data are attributable to, say, bargaining behavior amongst individually rational household members, we look at single households and two person households separately. The results are somewhat surprising in the sense that we do not find two person households to be less rational than singles. After this empirical exercise, we conclude this paper with an outlook. The Appendix contains proofs, graphs and summary statistics. A more detailed description of the econometric methods may be obtained from a supplement that is available on the author’s webpage. 2. The demand behavior of a heterogeneous population 2.1. Structure of the model and assumptions Our model of consumer demand in a heterogeneous population consists of several building blocks. As is common in consumer demand, we assume that – for a fixed preference ordering – there is a causal relationship between budget shares, a [0, 1] valued random L-vector denoted by W , and regressors of economic importance, namely log prices P and log total expenditure Y , real valued random vectors of length L and 1, respectively. Let X = (P ′ , Y )′ ∈ RL+1 . To capture the notion that preferences vary across the population, we assume that there is a random variable V ∈ V , where V is a Borel space,1 which denotes preferences (or more generally, decision rules). We assume that heterogeneity in preferences is partially caused by observable differences in individuals’ characteristics (e.g., age), which we denote by the real valued random G-vector Q . Hence, we let V = ϑ(Q , A), where ϑ is a fixed V -valued mapping defined on the sets Q × A of possible values of (Q , A), and where the random variable A (taking again values in a Borel space A) covers residual unobserved heterogeneity in a general fashion. What can we learn from repeated cross section data about individual rationality in a heterogeneous population? As in the treatment effect literature, we would ideally want to allow for (infinitely) many types, and for the parameters to be not necessarily finite but infinite dimensional. Even if we were to confine ourselves to heterogeneous nonlinear models with finite parameters, beyond rather special finite dimensional examples like linear random coefficient models (see, e.g., Hoderlein et al.
1 Technically: V is a set that is homeomorphic to the Borel subset of the unit interval endowed with the Borel σ -algebra. This includes the case when V is an element of a polish space, e.g., the space of random piecewise continuous utility functions.
295
(2010)), little is known about identification. The only case of point identification of general L-dimensional vectors of unobservables involves triangularity (see, e.g., Chesher (2003)); however, such an approach seems ill suited for demand application as any ordering amongst L goods in terms of unobservables (say, good 1 depends on unobservable A1 only, good 2 on unobservables A1 and A2 etc.) seems arbitrary and hence hard to justify. Therefore we formalize the heterogeneous population as W = φ(X , V ) = φ(X , ϑ(Q , A)), for a general mapping φ and an infinite dimensional vector V (unobservable preferences), and A (unobservable drivers of preferences, say, genetic disposal). Obviously, neither φ nor fV or fA are nonparametrically identified. Still, for any fixed value of Q , A, say (q0 , a0 ), we obtain a demand function having standard properties, and the hope is that when forming some type of average over the unobserved heterogeneity A, rationality properties of individual demand may be preserved by some structure. As mentioned, a contribution of this paper is that we allow for dependence between unobserved heterogeneity A and all regressors of economic interest. To this end, we introduce a real valued random K -vector of instruments S. Note that S may contain exogenous elements in X , which serve as their own instruments. Having defined all elements of our model, we are now in the position to summarize it formally, including observable covariates Q: Assumption 2.1. Let all variables be defined as above. The formal model of consumer demand in a heterogeneous population is given by W = φ(X , ϑ(Q , A))
(2.1)
X = µ(S , Q , U )
(2.2)
where φ is a fixed RL -valued Borel mapping defined on the sets X × V of possible values of (X , V ). Analogously, µ is a fixed RL+1 valued Borel mapping defined on the sets S × Q × U of possible values of (S , Q , U ). Moreover, µ is invertible in U, for every value of (S , Q ). Assumption 2.1 defines the nonparametric demand system with (potentially) endogenous regressors as a system of nonseparable equations. These models are called nonseparable, because they do not impose an additive specification for the unobservable random terms (in our case A or U). Note that in the case of endogeneity of X this requires that U be solved for, because the residuals have to be employed actively in a control function fashion. This implies that we know in particular enough about the structure of µ to be able to solve for U. The most common assumption in the literature on nonseparable models is strict monotonicity of µ in U (Chesher, 2003; Imbens and Newey, 2009), which is compatible with our approach. Nevertheless, to allow the applied researcher to select the most appropriate specification, we do not confine ourselves to this structure in the theory part. As any applied researcher, in the application below we have to choose a specification. We specify µ to be the conditional mean function, and consequently U to be additive mean regression residuals for reasons that will become obvious in a short while. However, our approach allows to choose a different specification in case the economic application suggests otherwise (say, because price endogeneity would be an issue, where additivity is questionable). For instance, in the case of several endogenous regressors, µ could be a triangular system of nonseparable functions, or a location-scale model—these options are both compatible with our theoretical analysis. How should we think of this factor U in economic terms in the demand example? As mentioned above, the control function IV structure is motivated by the economic assumption of separability in life cycle optimization models. The drawback of these models
296
S. Hoderlein / Journal of Econometrics 164 (2011) 294–309
is that few of them have an explicit solution. However, the few models that indeed have an explicit solution fall into our category, as we now demonstrate with the CARA (constant absolute risk aversion) model introduced by Caballero (1990, 1991), which is arguably one of the leading models analyzed in the consumption literature. A formal example: In the intertemporal consumption model with CARA preferences, an agent is characterized by an instantaneous utility function u(x; δt ) = −δt−1 exp(−δt x), and is assumed to face an interest rate r = R − 1 (in the following for simplicity riskless and constant). A natural definition of heterogeneity is now that precisely this preference parameter δt is heterogeneous across individuals. We assume that there are preference shocks over time; specifically, we specify that δt +1 = δt + γt +1 , where we assume that the preference shock γt +1 is not predictable given income in period t + 1 and all information in period t (denoted Ft ), i.e., E [γt +1 |St +1 , Ft ] = 0. As a consequence, E [δt |St , Ft −1 ] = δt −1 . Note that Xt , i.e., total consumption in period t is precisely our total expenditure variable, it is endogenous because it embodies intertemporal preferences δt that may be correlated with demand preferences A. We are now going to establish that these economic assumptions result in an additively separable mean regression residual in the IV equation that reflects this correlated preference heterogeneity. Below, after Assumption 2.2, we will further establish that this residual satisfies the assumptions required to control for endogeneity in the outcome equation. Formally, the setup we consider looks as follows: Given information relevant for the consumption decision in period t (i.e., Ft ), the individual optimizes max
(Xt +τ )τ =0,...,T˜ −t
−
τ =0,...,T˜ −t
Et
β τ u(Xt +τ ; δt ) ,
where Et denotes conditional expectation with respect to Ft , T˜ is the terminal period, and u is an utility function of the CARA class.2 He maximizes this objective function subject to the dynamic constraints
in which case the relationship between log total expenditure and log income becomes Xt = St + κ1t Qt − 0.5σ 2 δt , with κ1t a constant that may depend on the time period, but is invariant across the population in any given period. Since Qt = (Mt −1 − Xt −1 )R, Qt is a function of the entire relevant past information for the consumption decision, as embodied in Xt −1 . Taking expectations conditional on (St , Qt ) means therefore that we are conditioning on St and Ft −1 . Hence,
E [Xt |St , Qt ] = E [Xt |St , Ft −1 ] = St + κ1t Qt − 0.5σ 2 δt −1 , and since Xt = St + κ1t Qt − 0.5σ 2 (δt −1 + γt ), we obtain Xt = E [Xt |St , Qt ] − 0.5σ 2 γt ,
(2.3)
Ut
where E [Ut |St , Qt ] = E [Ut |St , Ft −1 ] = 0. This means that in a cross section for fixed t, the mean independent error term contains preference heterogeneity as reflected in γt as the only unobservable that varies across the population—the scale (−0.5σ 2 ) does not matter for arguments involving the dependence structure. We will discuss the remaining dependence structure in more detail after introducing our key independence condition formally, however, it is already obvious from the above discussion that (omitting t from now on, as we are in a single cross section) S is independent of γ , and hence also of U conditional on Q , and that the structure provided by the leading explicitly solvable macroeconomic model results in an additive specification with (at least) mean independent errors in this heterogeneous population. Note that an additive structure would also arise in the case of quadratic preferences analyzed in the seminal paper by Hall (1978) that started the modern consumption literature. After this example, we next specify the precise independence ′ conditions we require. Writing Z = Q ′ , U , we formalize the notion of independence between instruments and unobservables as follows:
Qt +1 = (Mt − Xt )R
Assumption 2.2. Let all variables be as defined above. Then we require that
Mt +1 = Qt +1 + St +1,
FA|S ,Z = FA|Z .
where St +1 is the consumer’s idiosyncratic income in period t, Mt denotes nonhuman assets in period t, and total wealth in period t, Qt is generated as above. The most common example in the literature has St follow an ARMA process (possibly with an unit root), so for the purpose of this discussion, we choose St +1 = St + Σt +1 . In this model, consumption (total expenditure) takes the form Xt = St + τ1 (T˜ , t )Qt + τ2 (T˜ , t )Γ where Γ is generally a function of δt and some higher moments of the conditional distribution of Σt +1 , and τ1 , τ2 are deterministic functions of T˜ and t, (see Caballero, 1990, 1991), specifically τ1 (T˜ , t ) = (T˜ − t + 1). For simplicity, we consider the standard specification in the consumption literature and assume that log income follows a random walk with Gaussian innovations which are iid over time, see Deaton (1992). In this case, Γ = −0.5δt σ 2 , where σ 2 is the variance of the income shock Σt . It is moreover customary in this literature to assume that for working age population, T˜ − t is large and approximately constant across the population,3
2 Observe that the individual does not consider parameter uncertainty, i.e., he does not maximize over ‘‘future selves’’, but takes his current self as given. 3 Otherwise, we simply have to condition on the age of an individual as well, as do in the application anyway.
(2.4)
There are also some differentiability and regularity conditions involved, which are summarized in the Appendix in Assumption 2.3. However, Assumption 2.2 is the key identification assumption and merits a thorough discussion. Assume for a moment all regressors were exogenous, i.e. S ≡ X and U ≡ 0. Then this assumption states that X , in our case wealth and prices, and unobserved heterogeneity are independently distributed, conditional on individual attributes. To give an example: Suppose that in order to determine the effect of wealth on consumption, we are given data on the demand of individuals, their wealth and the following attributes: ‘‘education in years’’ and ‘‘gender’’. Take now a typical subgroup of the population, e.g., females having received 12 years of education. Assume that there be two wealth classes for this subgroup, rich and poor, and two types of preferences, types 1 and 2. Then, for both rich and poor women in this subgroup, the proportion of types 1 and 2 preferences has to be identical. This assumption is of course restrictive. Note, however, that A (the infinite dimensional parameter driving preferences) and regressors of economic interest may still be correlated unconditionally across the population. Moreover, any of the Z may be arbitrarily correlated with A, an issue we take up below when we specify Z to contain a time trend. Now turn to the case of endogenous X and instruments S that are different from X . The assumption is best illustrated through
S. Hoderlein / Journal of Econometrics 164 (2011) 294–309
297
W = φ(X , ϑ(T , A)),
(2.6)
the above example, where for simplicity we neglect household attributes.4 A Formal Example (cont.): Assume we face again a heterogeneous population who chooses their total consumption in period t (i.e., their total expenditure) through maximizing a CARA preference ordering, with parameter δt that varies across the population. As discussed above, this assumption results in an IV equation which relates total expenditure to labor income and the preference parameter in an additive form. Recall Eq. (2.3), and drop the time subscript for simplicity: X = E [X |S , Q ] − 0.5σ 2 γ = E [X |S , Q ] + U ,
(2.5)
where γ is the preference shock in period t, and σ 2 is the variance of the income process. We argue now that it is precisely this Ut that satisfies the control function Assumption 2.2. Note first that A contains all type of preference parameters for the decision on how to allocate total expenditure between various goods. A common assumption that is well grounded empirically is that labor supply of the household head is inelastic, and in most industrialized countries pretty much fixed at 40 h a week. The varying factor is hence the wage rate, which is exogenously determined in the labor market. To proceed, note that Q reflects all information relevant for the consumption decision dated period t − 1. Hence, conditional on Q , S solely reflects new (labor) market information about the individuals’ wages, say η, for which it is natural to assume (and indeed in line with the literature) that it is not predictable given period t − 1 information embodied in Q . Moreover, it is also quite natural to assume that this new market level information shock (say, a shock on the production sector of the economy, e.g., an oil price shock) is independent of the individual’s preferences (U , A). Hence it is natural to assume that η ⊥ (U , A) |Q and, as a consequence, S ⊥ (U , A) |Q . However, this implies that S ⊥ A|Q , U, which combined with Z = (Q , U ) yields precisely Assumption 2.2. As mentioned in the introduction, the independence Assumption 2.2 is potentially restrictive in any specific application, especially when it interacts with limitations of the data. In our application for instance, we pool observations from 25 years. To avoid assuming stability of preferences over 25 years, we include a nonparametric time trend in the set of conditioning variables. What is the consequence of this standard practice in parametric models, and how do our assumptions translate into time invariance or variability of all the various elements in our model? To answer this question as precisely as possible, we introduce the key elements of our model formally in this special case. First, observe that we do neither observe individuals repeatedly, nor do we perform large T (time series) analysis. Thus, we introduce time as a discrete random variable, denoted T , with realizations t ∈ 1, . . . , T . The point in time an individual is surveyed is from this perspective just another household characteristic. Moreover, like any other conditioning variable, it may be arbitrarily correlated with unobservables. To see what the implications of our assumption for the economic primitives are, consider first the exogenous case, which is the subcase where X is its own instruments and µ is the identity. In the following, we assume that Q = T , i.e, T is the only characteristic and that X and S are scalar, all of which is immaterial for our argumentation and could be relaxed easily. The model is given by
4 Else, household attributes would enter as additional conditioning variables, as outlined in the previous paragraph.
and Assumption 2.2 requires that A ⊥ X |T . What does this imply for the time variability or invariance of our model? First, the functional φ is assumed to be time invariant, meaning that given preferences and the budget set, individuals follow always the same decision rule. This does not imply that preferences be stable over time, or the distribution of preferences be stable. It merely implies that the individuals always make the decision in the same fashion. This assumption would for instance be violated, if in one time period individuals determine their consumption as the maximizer of a constrained utility maximization problem, while in another time period they use the minimizer (or some alternative decision rule). Since we want to assess rationality (i.e., the validity of utility maximization as decision rule), our exercise would not make sense if we were to allow for time variability of this decision rule. Second, it is also straightforward to see that the individual may have time varying preferences as Vt = ϑ(t , A), may vary freely with t. Finally, for the same reason the distribution of preferences Vt may vary arbitrarily across time. While this shows that including a time trend in a nonparametric fashion allows for completely general behavior of preferences over time in the exogenous setting, the specification becomes potentially more restrictive once we allow for endogeneity. We treat this case again in the setup of our application, defined by outcome Eq. (2.6), and IV Eq. (2.3), i.e.: Wt = φ(Xt , ϑ(t , Qt , A))
(2.7)
Xt = E [Xt |St , Qt ] − 0.5σ γt , 2
(2.8)
Ut
where we have now added t subscripts to make the time structure transparent. Recall from the discussion after Assumption 2.1 that the structure of the IV equation holds for any t. The same is true for the mean independence condition E [Ut |St , Qt ] = 0. As such forming the regression E [X |S , Q , T = t], for all t = 1, . . . , T¯ , which implicitly stratifies the population according to the time period t in which it is surveyed, produces a residual that is mean independent of the regressors in any period. Similarly, the key identification assumption becomes now St ⊥ (Ut , At ) |Qt , for all t = 1, . . . , T¯ . As is easy to see, all arguments in the discussion after Eq. (2.5) hold in an exactly analogous fashion; the arguments involving exogeneity of the market shock η in a single cross section extend naturally if we perform analysis with several cross sections, but always condition on T = t. In sum, including a time trend in a nonparametric fashion does not alter the conclusions drawn from a single cross section because the identification analysis is done period by period. This discussion establishes that the conditions for identification in the population are fairly natural given both the demand and the consumption literature. To the degree that in the application the choice of bandwidth implicitly pools observations across a few adjacent time periods (as opposed to only using the data cross section by cross section as is done in the identification analysis), or that intertemporal preferences come from a different class than the quadratic/CARA for at least part of the population, we expect a potential (small sample) misspecification bias resulting from our correction for endogeneity. However, in line with other papers in this literature (e.g., Haag et al. (2009b), Blundell et al. (2010)), we do not find that correcting for endogeneity changes the results in our application in any appreciable way. Thus, while we would like to add the cautionary remark that we impose structure which may be restrictive in some applications, we think of the choice of assumptions as defendable and/or inconsequential for the application put forward in this paper.
298
S. Hoderlein / Journal of Econometrics 164 (2011) 294–309
2.2. Implications for observable behavior Given these assumptions and notations, we concentrate first on the relation of theoretical quantities and the identified objects, specifically m(ξ , z ) = E[W |X = ξ , Z = z ] and M (s, z ) = E[W |S = s, Z = z ] which denote the conditional mean regression function using either endogenous regressors and controls, or instruments and controls. Finally, let σ {X } denote the information set (σ -algebra) spanned by X , and Ξ − be the Moore–Penrose pseudo-inverse of a matrix Ξ . More specifically, we focus on the following questions: 1. How are the identified marginal effects (i.e., Dx m or Dx M) related to the theoretical derivatives Dx φ ? 2. How and under what kind of assumptions do observable elements allow inference on key elements of economic theory? Especially, what do we learn about homogeneity, adding up, as well as negative semidefiniteness and symmetry of the Slutsky matrix, which in the standard consumer demand setup we consider (with budget shares as dependent variables, and log prices and log total expenditure as regressors), and in the underlying heterogeneous population (defined by φ , x, and v ), takes the form
S(x, v) = Dp φ(x, v) + ∂y φ(x, v)φ(x, v)
′
+ φ(x, v)φ(x, v)′ − diag {φ(x, v)} . Here, diag {φ} denotes the matrix having the φj , j = 1, . . . , L on the diagonal and zero off the diagonal. The second question shall be the subject of Propositions 2.2 and 2.3. We start, however, with Lemma 2.1 which answers the question on the relationship between estimable and theoretical derivatives. All proofs may be found in the Appendix. Lemma 2.1. Let all the variables and functions be as defined above. Assume that 2.1–2.3 hold. Then,
(i) E[Dx φ(X , V )|X , Z ] = Dx m(X , Z ) (a.s.), (ii) E[Dx φ(X , V )|S , Z ] = Ds M (S , Z )Ds µ(S , Z )− (a.s.). Ad (i): This result establishes that every individual’s empirically obtained marginal effect is the best approximation (in the sense of minimizing distance with respect to L2 -norm) to the individual’s theoretical marginal effect. Note that it is only the local conditional average of the derivative of φ that may be identified and not the function φ . Suppose we were given data on consumption, total expenditure, ‘‘education in years’’ and ‘‘gender’’ as above. Then by use of the marginal effect Dx m(X , Z ), we may identify the average marginal total expenditure effect on consumption of, e.g., all female high school graduates, but not the marginal effect of every single individual. Arguably, for most purposes the average effect on the female graduates is all that decision makers care about. Ad (ii): This result illustrates that assumption 2.2 has several implications on observables depending which conditioning set is used. As the preceding discussion shows, conditioning on X , Z seems natural as the subgroups formed have a direct economic interpretation. However, recall that σ {X , Z } ⊆ σ {S , Z }. Hence, σ {S , Z } should be employed, because conditional expectations using σ {S , Z } are closer in L2 -norm to the true derivatives. Altonji and Matzkin (2005) derive an estimator for E[Dx φ(X , V )|X , Z ] by integrating (i) over U, but conditioning on X , Z . Since our focus is on testing economic restrictions, we avoid the integration step as in many cases it reduces the power of all test statistics. Therefore we give always results using σ {X , Z } and σ {S , Z }. The former has a more clear cut economic interpretation, the latter yields tests of higher power.
We now turn to the question which economic properties in a heterogeneous population have testable counterparts. This problem bears some similarities with the literature on aggregation over agents in demand theory, because taking conditional expectations can be seen as an aggregation step. We introduce the following notation: Let ej denote the jth unit vector of length L + 1, let Ej = (e1 , . . . , ej , 0, . . . , 0) and ι be the vector containing only 1. Now we are in the position to state the following proposition, which is concerned with adding up and homogeneity of degree zero, two properties which are related to the linear budget set: Proposition 2.2. Let all the variables and functions be defined as above, and suppose that 2.1 and 2.3 hold. Then, ι′ φ = 1 (a.s.) ⇒ ι′ m = 1 and ι′ M = 1 (a.s.). If, in addition, FA|X ,Z (a, ξ , z ) = FA|X ,Z (a, ξ + λ, z ) is true for all ξ ∈ X and λ > 0, then
φ(X , V ) = φ(X + λ, V ) (a.s.) ⇒ m(X , Z ) = m(X + λ, Z ) (a.s.). Under (2.1)–(2.3) we obtain that,
φ(X , V ) = φ(X + λ, V ) (a.s.) ⇒ Dp mι + ∂y m = 0 (a.s.) and φ(X , V ) = φ(X + λ, V )
+ Ds MDs µ− eL+1
(a.s.) ⇒ Ds MDs µ− EL ι = 0 (a.s.).
Finally, if we also assume (2.1), (2.3), µ(S , Z ) = µ(S +λ, Z ), as well as, FA|S ,Z (a, s, z ) = FA|S ,Z (a, s + λ, z ), then
φ(X , V ) = φ(X + λ, V ) (a.s.) ⇒ M (S , Z ) = M (S + λ, Z ) (a.s.). Remark 2.2. Note that we do not require any type of independence for adding up to carry through to the world of observables. This is very comforting since – in the absence of any direct way of testing this restriction – adding up is imposed on the regressions by deleting one equation. The homogeneity part of Proposition 2.2 is ordered according to the severity of assumptions: Homogeneity carries through to the regression conditioning on endogenous regressors and controls under a homogeneity assumption on the cdf. This assumption is weaker than 2.2 as it is obviously implied by conditional independence. The derivative implications are generally true under conditional independence. This is particularly useful for testing homogeneity using the regression including instruments, as for this regression to inherit homogeneity an implausible additional homogeneity assumption, µ(S , Z ) = µ(S + λ, Z ), has to be fulfilled. The following proposition is concerned with the Slutsky matrix. We need again some notation. Let V [G, H |F ] denote the conditional covariance matrix between two random vectors G and H, conditional on some σ -algebra F , and V [H |F ] be the conditional covariance matrix of a random vector H. We will also make use of the second moment regressions, i.e. m2 (ξ , z ) = E[WW ′ |X = ξ , Z = z ] and M2 (s, z ) = E[WW ′ |S = s, Z = z ]. Moreover, denote by v ec the operator that stacks a m × q matrix columnwise into a mq × 1 column vector, and by v ec −1 the operation that stacks an mq × 1 column vector columnwise into a m × q matrix. We also abbreviate negative semidefiniteness by nsd. Finally, for any square matrix B, let B = B + B′ . Proposition 2.3. Let all the variables and functions be defined as above. Suppose that 2.1–2.3 hold. Then, the following implications hold almost surely: S nsd ⇒ Dp m + ∂y m2 + 2 (m2 − diag {m}) nsd, and S nsd ⇒ Ds MDs µ− EL + v ec −1 Ds v ec [M2 ] Ds µ− eL+1 +
V ∂y φ, φ|X , Z , respectively V ∂y φ, φ|S , Z , are symmetric we have S symmetric ⇒ Dp m + ∂y mm′ symmetric, and S symmetric ⇒ Ds MDs µ− EL + Ds MDs µ− eL+1 M ′ symmetric almost surely. 2 (M2 − diag {M }) nsd However, if and only if
S. Hoderlein / Journal of Econometrics 164 (2011) 294–309
Remark 2.3. The importance of this proposition lies in the fact that it allows for testing the key elements of rationality without having to specify the functional form of the individual demand functions or their distribution in a heterogeneous population. Indeed, the only element that has to be specified correctly is a conditional independence assumption. Suppose now we see any of these conditions rejected in the observable (generally nonparametric) regression at a position x, z. Recalling the interpretation of the conditional expectation as average (over a ‘‘neighborhood’’) this proposition tells us that there exists a set of positive measure of the population (‘‘some individuals in this neighborhood’’) which does not conform with the postulates of rationality. An interesting question is when the reverse implications hold as well, i.e. one can deduce from the properties of the observable elements directly on S. This issue is related to the concept of completeness raised in Blundell et al. (2007). It is our conjecture that some of the concepts may be transferred, but a detailed treatment is beyond the scope of this paper and will be left for future research. The basic intuition for the results are that expectations are linear operators in the functions involved, implying that many properties on individual level remain preserved. However, difficulties arise when nonlinearities are present, as is the case with conventional tests for Slutsky symmetry. The fact that we are able to circumvent this problem in the case of negative semidefiniteness relies on the fact that we may able to transform the property on individual level so that it is linear in the squared regression. Proposition 2.3 illustrates clearly that appending ‘‘an additive error capturing unobserved heterogeneity’’ and proceeding as if the mean regression m has the properties of individual demand is not the way to solve the problem of unobserved heterogeneity. Note that we may always append a mean independent additive error, since φ = m + (φ − m) = m + ε . The crux is now that the error is generally a function of y and p. For instance, the potentially nonsymmetric part of the Slutsky matrix becomes
S = Dp m + ∂y mm′ + Dp ε + ∂y m ε ′ + ∂y ε m′ + ∂y ε ε ′ ,
and the last four terms will not vanish in general. Returning to Proposition 2.3, one should note a key difference between negative semidefiniteness and symmetry. For the former we may provide an ‘‘if’’ characterization without any assumptions other than the basic ones. To obtain a similar result for symmetry, we have to invoke an additional assumption about the conditional covariance matrix V ∂y φ, φ|X , Z . This matrix is unobservable – at least without any further identifying assumptions. Note that this assumption is (implicitly) implied in all of the demand literature, since symmetry is inherited by Dp m + ∂y mm′ only under this assumption. Conversely, if this additional assumption does not hold, we are able to test at most for homogeneity, adding up and Slutsky negative semidefiniteness. This amounts to demand behavior generated by complete, but not necessarily transitive preferences. Details of this demand theory of the weak axiom can be found in Kihlstrom et al. (1976) and Quah (2000). Furthermore, note some parallels with the aggregation literature in economic theory: Only adding up and homogeneity carry immediately through to the mean regression. This result is similar in spirit to the Mantel–Sonnenschein theorem, where only these two properties are inherited by aggregate demand. It is also well known in this literature that the aggregation of negative semidefiniteness (usually shown for the Weak Axiom) is more straightforward than that of symmetry. Also, a matrix similar to V ∂y φ, φ has been used in this literature (as ‘‘increasing dispersion’’, see Härdle et al. (1991)). Finally, there is an issue about what can be learned from data. To continue our discussion at the beginning of this section, in general
299
the underlying function φ is not identified. One implication is the following: Suppose there are two groups in the population, types 1 and 2. Both types could violate rationality, however, it is entirely possible that the average does not violate rationality. As such, our results provide a lower bound on the violations of rationality in a population. A more precise answer could be obtained, if a linear structure would be assumed. However, in this paper we do not take the linearity assumption for granted. Consequently, under our assumptions this is all we can say about consumer rationality, given the data and mean regression tools. 3. Empirical implementation In this section we discuss all matters pertaining to the empirical implementation: We give first a brief sketch of the econometrics methods, then we provide an overview of the data, mention some issues regarding the econometric methods, and finally present the results. 3.1. Econometric specifications and methods From our identification results in Propositions 2.2 and 2.3, we are able to characterize testable implications of economic theory on nonparametric mean regressions. It is imperative to note that these mean regressions are only the reduced form model, and not a specification of the structural model. As discussed above, these empirically estimable objects are local averages. Consequently, they depend on the position, and we will evaluate them at a set of representative positions (actually random draws from the sample). As a characterization of our empirical results, we will give the percentage of positions at which we reject the null. Needless to mention, this may both under- and overestimate the true proportion of violations. In the one case the average may by accident behave nicely, while all individuals behave irrational. In the other the average may not confirm with rationality, while it is only relatively few nonrational individuals that cause this violation. However, since we are evaluating the population at a large number of local averages over relatively small (but potentially overlapping) neighborhoods, we may expect both effects to be of secondary importance and to balance to some degree. In the spirit of our above results, we could argue that the percentage of rejections obtained using the best approximations given the data is itself at least a good approximation to the underlying fraction of irrational individuals, given the information to our disposal. As the above discussion should have made clear, the precise testable implications will depend on the independence assumption that a researcher deems realistic. In the demand literature, it is log total expenditure that is taken to be endogenous, and labor income is taken as additional instrument, see Lewbel (1999). The basic reason is preference endogeneity: since the broad aggregates of goods typically considered (e.g., ‘‘Food’’) explain much of total expenditure, it is suspected that the preferences that determine demand for these broad categories of goods and those which determine total nondurable consumption are dependent. Prices, in turn are assumed to be exogenous, because unlike in IO approaches where demand for individual goods is analyzed and price endogeneity may be suspected on grounds that firms charge different consumers differently or that there be unobserved quality differences, the broad categories of goods typically analyzed are invariant to these types of considerations. Still, one may question this exogeneity assumption, and we can only emphasize that nothing in our chain of argumentation precludes handling this endogeneity in precisely the same way than total expenditure endogeneity. For purpose of recovering U we have to specify the equation relating the endogenous regressors total expenditure S to instruments. In this paper, we have settled for the general nonparametric form Y = µ(S , Z ) = ψ(S , Q ) + σ2 (S , Q )U, with the
300
S. Hoderlein / Journal of Econometrics 164 (2011) 294–309
normalization E [U |S , Q ] = 0 and V [U |S , Q ] = 1. In terms of the CARA example discussed above, this allows the variance of the income shocks to depend on household observables. In supplementary material that may be downloaded from the author’s webpage, we discuss issues in the implementation of estimators and test statistics in detail. Here we just mention briefly that we estimate all regression functions by local polynomials, and form pointwise nonparametric tests using these estimators. We evaluate all tests at a grid of n = 3000 observations that are iid draws from the data. To derive the asymptotic distribution of our test statistic at each of these observations, we apply bootstrap procedures. In the case of testing homogeneity, we use the fact that the null hypothesis m(P , Y , Z ) = m(P + λ, Y + λ, Z ) may be reformulated by taking derivatives as Dp m (ζ0 ) ι + ∂y m (ζ0 ) = 0
for all ζ0 ∈ supp (P , Y , Z ) ,
(3.1)
where we let ζ0 = (p0 , y0 , z0 ), and analogously for the endogenous case. This hypothesis is easily rewritten as Rθ (ζ0 ) = 0, where R denotes the L − 1 × (L − 1)(L + K + 2) matrix R = IL−1 ⊗ [0 ιL+1 0K ], and θ(ζ0 ) is the vector of stacked levels m (ζ0 ) and h scaled deriva tives, hDp m (ζ0 ) , h∂y m (ζ0 ) , hDz m (ζ0 ) . Next, suppose that θˆ (ζ0 ) denotes a local polynomial estimator of θ (ζ0 ) with generated regressors, defined in detail in the econometric supplement to this paper. In addition, assume again a multiplicative heteroscedastic error structure, i.e., Wi − m(Pi , Yi , Zi ) = Σ1 (Pi , Yi , Zi )ηi . A natural test statistic is then:
′ (ζ0 ) Hom,Ex (ζ0 ) = R θˆ (ζ0 ) − h2 bias Ψ −1 ′ (ζ0 ) , ˆ × R′ Σ R θˆ (ζ0 ) − h2 bias 1 Σ1 (ζ0 ) ⊗ A (ζ0 ) R where Aˆ (ζ0 ) = fˆX (ζ0 )−1 B−1 CB−1 , B and C are matrices of kernel
(ζ0 ) is a pre-estimator for the bias.5 Then, by constants, and bias a trivial corollary to the large sample theory of local polynomial estimators with generated regressors found in the supplementary Hom,Ex →d χL2−1 . This test can thought of as being based material, Ψ on the confidence interval at a certain point, and is a standard pointwise test of derivatives modified by the fact that we consider now systems of equations with generated regressors. Constructing a bootstrap sample under the null, i.e., with homogeneity imposed, is straightforward by simply subtracting log total expenditure from all the log prices and omitting the log total expenditure regressor from the regression (this imposes homogeneity on the regression). Adding the bootstrap residuals εi∗ to the homogeneity imposed regression produces the bootstrap sample Yi∗ , Xi∗ = Yi∗ , Xi . The limiting distribution, as well as arguments showing the consistency of the bootstrap, are then straightforward from the asymptotics of local polynomials. The case of testing symmetry is more involved. To derive a test for the first case, stack the 1/2L(L − 1) nonlinear restrictions Rkl (θ, ζ0 ), at a fixed position x0 ,
Rkl (θ, ζ0 ) = ∂pk ml (ζ0 ) + ∂y ml (ζ0 )mk (ζ0 )
− ∂pl mk (ζ0 ) + ∂y mk (ζ0 )ml (ζ0 ) = 0, k, l = 1, . . . , L − 1, k > l, into a vector R. Consequently, ‘‘Dp m(ζ0 ) + ∂y m(ζ0 )m(ζ0 )′ symmetric for all ζ0 ∈ supp (P , Y , Z )’’ becomes ‘‘R(θ , ζ0 ) = 0 for all ζ0 ∈ supp (P , Y , Z )’’. This suggest using the quadratic form
′ Sym,Ex (ζ0 ) = R(θˆ , ζ0 ) R(θˆ , ζ0 ), Γ
5 Since the bias contains largely second derivatives, we may use a local quadratic or cubic estimator for the second derivative, with a substantial amount of undersmoothing.
and check whether this is significantly bigger than zero. Again, building upon the large sample theory of local polynomial estimators with generated regressors, we could derive the limiting distribution. However, this is even more involved as before, and the bootstrap appears to be the method of choice. An adaptation of the idea of Haag et al. (2009b) for deriving a bootstrap version of the test statistic allows to determine an estimator for the true distribution of the test statistic under the null (by exploiting the structure of the Kernel estimator, as well as the fact that under the null the bias vanishes). This has the added benefit that the consistency of the bootstrap may be established along similar lines as in Haag et al. (2009b). Finally, a bootstrap procedure for testing negative semidefiniteness may be devised using a similar idea by Härdle and Hart (1993). Essentially, the idea is to look at the bootstrap distribution of the largest eigenvalue. This procedure is consistent, provided there is no multiplicity of eigenvalues, which is not a problem in our application. For technical details we refer again to the supplementary material that may be downloaded from the author’s webpage. 3.2. Data The FES reports a yearly cross section of labor income, expenditures, demographic composition and other characteristics of about 7000 households in every year. We use the years 1974–1999, but exclude the respective Christmas periods as they contain too much irregular behavior. We focus on two types of subpopulations, one person and two person households. Specifically, in the one person household category we consider all single person households, but we also stratify according to men and women. In the category of two person households we narrow the subpopulation down a bit further to include households where both are adults and at least one is working. This is going to be our benchmark subpopulation, not least because it was the one commonly used in the parametric demand system literature, see Lewbel (1999). We provide several tables showing various summary statistics of our data in the Appendix. The expenditures of all goods are grouped into three categories. The first category is related to food consumption and consists of the subcategories food bought, food out (catering) and tobacco, which are self-explanatory. The second category contains expenditures which are related to the house, namely housing (a more heterogeneous category; it consists of rent or mortgage payments) and household goods and services. Finally, the last group consists of motoring and fuel expenditures, categories that are often related to energy prices. For brevity, we call these categories food, housing and energy. These broader categories are formed since more detailed accounts suffer from infrequent purchases (recall that the recording period is 14 days) and are thus often underreported. Together they account for approximately half of total expenditure on average, leaving a large forth residual category. We removed outliers by excluding the upper and lower 5% of the population in each of the three groups, and have removed zero budget shares for any good, as the rationality restrictions we want to test do not need to hold at corner solutions. Labor income is constructed as in the definition of household below average income study (HBAI). It is roughly defined as net labor income after taxes, but including state transfers. We employ the remaining household covariates as regressors (including measures of wealth), and included separately a time trend in the set of covariates. We use principal components to reduce the vector of household covariates to one (approximately continuous) factor and consider a monthly time trend taking 310 possible values, mainly because we require continuous covariates for nonparametric estimation. While this has some additional advantages, it is arguably ad hoc. However, we performed
S. Hoderlein / Journal of Econometrics 164 (2011) 294–309
some robustness checks like alternating the component, adding parametric indices to the regressions, or including the time trend in a second factor, and the results do not change in an appreciable fashion, see Appendix C for the treatment of time in particular. Finally, the basis for forming prices come from time series of individual goods, the Retail Price Index which is published at the National Statistics Online web site. It is monthly data spanning all 25 years we consider. As an aside, the FES was collected to construct exactly these price indices. We use the wealth of information on individual price time series as well as the detailed expenditure information on all subcategories of goods to form the more correct Stone–Lewbel (SL) prices, pioneered by Stone, and refined by Lewbel (1989), see also Hoderlein and Mihaleva (2008) for details and applications to standard demand models. Roughly speaking, SL prices use the fact that within a category of goods (say, food), people differ in their tastes for the individual goods. While the common practice of using aggregate price indices implicitly assumes that all individuals have identical Cobb Douglas preferences for all goods within a category, SL prices allow all individuals to have heterogeneous Cobb Douglas preferences, meaning that the standard practice is contained as (restrictive) special case. For this reason, SL prices should always be used when possible (they create less of an aggregation across goods bias, see Lewbel (1989) and Hoderlein and Mihaleva (2008) for details). Due to the variation in weights across individuals, these prices have the added advantage that they provide valuable cross section variation, which allows us to correct for a monthly time trend. This completes our list of variables.
301
Fig. 1. Budget share food as function of total expenditure.
3.3. Empirical results in detail 3.3.1. Result for the Benchmark group: two person households We first start out with our benchmark group, i.e., two person households, both of which are working (from now on ‘‘couples’’). As already mentioned, this subpopulation has often been selected in demand application using the FES, because the working couples are believed to be relatively homogeneous and exhibit less measurement error. Although they are not in the focus of this paper, because they are building blocks for our test statistics we display in the Appendix some nonparametric estimates of the function and the derivatives for this subpopulation. In Figs. 1–3 in the Appendix we show the budget shares of the three categories of goods against log total expenditure. Note that the food and energy budget shares are everywhere downward sloping in total expenditure, whereas housing is only weakly decreasing or increasing, but in absolute value relatively constant across the expenditure range. This conforms well with results obtained elsewhere in the literature, see Lewbel (1999). Moreover, we also show some compensated own price, as well as compensated cross price effects of the symmetrized Slutsky matrix in Figs. 4–9. The figures show that the compensated own price effects of food, energy and housing are negative, as predicted by theory. The cross price effects are smaller (in absolute value) in comparison, but note the relatively large confidence bands around all the derivatives, which arise because of our unrestrictive specification. The negative and dominant own price effects combined with the large standard errors are indicative that negative semidefiniteness may not be rejected. All functions are plotted at the mean level of all other regressors, and obviously at other values of the regressors the picture varies. The largely negative compensated own price effects, however, remain preserved. Turning to the test statistics, they are constructed such that the implications of economic theory are true under the null. Moreover, the tests are to be performed pointwise at the individual observations. A rejection at an individual observation means
Fig. 2. Budget share energy as function of total expenditure.
that the original condition, e.g., negative semidefiniteness in the underlying heterogeneous population, cannot hold in the ‘‘neighborhood’’ of an individual. Although this gives an accurate picture of the behavior on individual level, the consequence is a flood of results, and we aggregate across the population as a means of condensing the results. Hence, our interest centers on the proportion of the population, for which a certain property is not violated. Before discussing the results in detail, two important issues have to be clarified. The first concerns the functional form. Using the test of Haag et al. (2009a), we reject the null that the regression is a Quadratic Almost Ideal with a p-value of 0.001. Indeed, this finding strongly suggests using a flexible nonparametric tools— otherwise rejections of economic hypotheses may occur just because we have chosen the wrong functional form. The second issue is endogeneity: We use total expenditure as regressor, which is potentially endogenous. Employing again a Haag et al. (2009a) type test, we reject the hypothesis that U should be excluded from the regression with a p-value of 0.01. Although this strongly suggests that a control function correction
302
S. Hoderlein / Journal of Econometrics 164 (2011) 294–309
Fig. 3. Budget share housing as function of total expenditure.
Fig. 5. Compensated own price effect energy as function of total expenditure.
Fig. 4. Compensated own price effect food as function of total Expenditure.
Fig. 6. Compensated own price effect housing as function of total expenditure.
Table 3.1 Percentage of two person households in accordance with homogeneity. Hypothesis
0.95
Homogeneity under exogeneity Homogeneity under endogeneity
0.991 0.994
for endogeneity should be adopted, below we will see that in terms of economic restrictions the results are largely comparable. The main results are condensed in Tables 3.1–3.3. We report the results using total expenditure under both exogeneity and endogeneity, where we handle the latter by including control function residuals. The results using the instruments plus controls regressions produce similar results to the standard endogeneity correction, and are therefore not reported here. Table 3.1 shows results for tests of the test of the homogeneity hypothesis. The table shows in the first column again the hypothesis that is being tested, while the second row displays the proportion of the population for which we do not obtain the respective null hypothesis, at the 0.95 confidence level. From these results it is
Table 3.2 Percentage of two person households in accordance with Slutsky symmetry. Hypothesis
0.95
Symmetry under exogeneity Symmetry under exogeneity and homogeneity Symmetry under endogeneity and homogeneity
0.85 0.93 0.93
Table 3.3 Percentage of couples in accordance with negative semidefiniteness. Hypothesis
0.95
Negative Semidefiniteness under exogeneity Negative Semidefiniteness under exogeneity and homogeneity Negative Semidefiniteness under endogeneity and homogeneity
0.92 0.97 0.98
obvious that homogeneity is well accepted, whether or not we correct for endogeneity. The results improve slightly if we move to a scenario of endogeneity, but this was to be expected as we also add one regressor (the control function residuals), and hence the standard errors become larger. However, the difference is only
S. Hoderlein / Journal of Econometrics 164 (2011) 294–309
Fig. 7. Compensated cross price effect of energy price on food as function of total expenditure.
Fig. 8. Compensated cross price effect of food price on energy as function of total expenditure.
marginal. Overall, the results are so good that we will now also report results were we impose homogeneity. Next, the results for symmetry are displayed in Table 3.2. As before, the percentage of non-rejections at the significance level 0.95 are displayed. Observe that we first consider the exogenous setup, then impose homogeneity, and finally also correct for endogeneity. Imposing homogeneity should generally be performed because it is hard to imagine circumstances under which symmetry or negative semidefiniteness holds, while homogeneity does not (and we have already seen that homogeneity is likely to hold across the entire population). Accounting for homogeneity improves the result substantially, which is particularly comforting as the introduction of homogeneity actually reduces the number of regressors, i.e., the standard errors become smaller. The inclusion of endogeneity improves the result only very marginally, but this was to be expected because of the increase in standard errors. With respect to the violations, no clear pattern of violations emerges, i.e. no clustering at certain household characteristics.
303
Fig. 9. Compensated cross price effect of energy price on housing as function of total expenditure.
Finally, consider the hypothesis of negative semidefiniteness. We incorporate the results of the symmetry test by considering this property for the symmetrized Slutsky matrix. With respect to the individual results we report both the uncompensated and compensated price effect (i.e., Slutsky) matrices. The percentage of the population for which negative semidefiniteness cannot be rejected at the 0.95% confidence level are as follows: Note that negative semidefiniteness is quite well accepted, even in the baseline scenario with exogeneity. Imposing homogeneity, the Slutsky matrix is now more often negative semidefinite, which is particularly comforting as the introduction of homogeneity again reduces the number of regressors. The percentage of the population for which all three hypotheses hold is between 0.74 and 0.88, depending on whether we consider the case with exogeneity or whether we impose homogeneity. Finally, Endogeneity generally seems to play a minor role in this application.6 As before, there is no obvious structure in the remaining rejections. In summary, these results establish that our quite general nonparametric approach works well in data sets of the size typically found in applications, and that it corroborates economic theory. As the main econometric caveat for our analysis, the following point should nevertheless be mentioned: Indeed, some point estimates of the Slutsky matrix appear in particular to be not symmetric. The same is, perhaps to a smaller extent, also true of negative semidefiniteness. However, it is especially the cross price effects that are measured with very low precision, and hence have huge standard errors. This familiar problem of parametric demand analysis (cf. Lewbel (1999)) is aggravated by the high dimensional nonparametric approach taken here. Hence, we should stress the fact that we were not able to reject these hypotheses. However, we want to emphasize squarely that this is inevitable given the generality of our assumptions: If we are not willing to assume more, this is what we obtain. While we can safely rule out mistakes by wrongly assuming a certain functional form or too much homogeneity across individuals, the lower precision is the price to pay. A logical next step would be to carefully introduce semiparametric structure, but we leave this step for future research. Nevertheless, it is noteworthy that
6 This disagrees with some findings in the literature which use the same data, in particular Blundell et al. (2007).
304
S. Hoderlein / Journal of Econometrics 164 (2011) 294–309
overall economic theory provides an acceptable hypothesis for the population, as is indicated by results beyond the 90% mark. Hence, at least in this specific subpopulation, economic theory fares well. Since the issue of correcting for time effects may be very important, we report several robustness checks in a separate Appendix. The first is with time added to the principal component (i.e., it enters through a factor structure with other covariates). The result in this specification are very comparable, if anything they are slightly better. Then we also stratify the data set according to whether the household is interviewed before 1988, or thereafter. The results remain stable in any subsample, the marginal decrease in percentages of rational individuals is easily explained by small sample issues. We conclude that our results for couples remain fairly robust to various (semiparametric) specifications of the time trend, as well as to the introduction of endogeneity. That the acceptance of rationality is less than 100% in this subpopulation may be attributed to a number of factors. One of them is that the data are collected on household level, while rationality restrictions pertain to individuals. The passage from individual decisions to those of the households is less than trivial, as has recently been emphasized (see Browning and Chiappori (1998), Fortin and Lacroix (1997) and Cherchye and Vermeulen (2008), amongst others). Hence we are now going to consider one person households and compare them with the composite ones we have just analyzed. 3.3.2. Comparison of two person with single households This subsection considers the behavior of the same tests, involving the same nonparametric building blocks. We are again using the FES data, but now we focus on single households. Economic theory only makes statements about individual behavior; in light of this, the fact that we have treated two person households like individuals may seem to be problematic. To give a particularly simple example, if one person solely decides about buying bundles of goods under one price regime, while the other does so under another price regime, there is no reason why even basic rationality requirements like the Weak Axiom of Revealed Preference were to hold for the composite household. More generally, we face a problem of aggregation of preferences, and this problem has spurned extensive research, see Browning and Chiappori (1998), Fortin and Lacroix (1997) and Cherchye and Vermeulen (2008). An upshot of this literature is that we would expect single households to perform better than couples when it comes to obeying rationality restrictions. The following results, however, show a less clear cut picture. We report the results separately for men (Tables 3.4–3.6) and women (Tables 3.7–3.9), but we have also pooled both groups without obtaining materially different results. Summary statistics for all subpopulations can again be found in the Appendix. We start out with the population of single men, and we consider homogeneity first, because it is a basic requirement for the following analysis. We obtain similarly high acceptance rates as before, which is comforting. Note also, that we do not find a lot of difference with respect to the correction for endogeneity; the overall result for testing homogeneity is qualitatively similar to, but slightly worse than the results for two person households. Next, we turn to symmetry of the Slutsky matrix. The results are again summarized in the following table: Again the results are rather similar – if marginally worse – than for couples, but observe that the results actually get worse when endogeneity is corrected for. Since the correction for endogeneity involves the introduction of control functions, and hence would result in tests of lower power due to the increase in dimensionality of the regressions, the fact that the number of rejections of rationality increases casts doubts on the specific correction for endogeneity. The same conclusion (i.e., that correcting for endogeneity does not
Table 3.4 Percentage of single males in accordance with homogeneity. Hypothesis
0.95
Homogeneity under exogeneity Homogeneity under endogeneity
0.95 0.95
Table 3.5 Percentage of single males in accordance with Slutsky symmetry. Hypothesis
0.95
Symmetry under exogeneity Symmetry under exogeneity and homogeneity Symmetry under endogeneity and homogeneity
0.80 0.86 0.85
Table 3.6 Percentage of single males in accordance with Slutsky negative semidefiniteness. Hypothesis Negative semidefiniteness under exogeneity Negative semidefiniteness under exogeneity and homogeneity Negative semidefiniteness under endogeneity and homogeneity
0.83 0.86 0.86
Table 3.7 Percentage of single females in accordance with homogeneity. Hypothesis
0.95
Homogeneity under exogeneity Homogeneity under endogeneity
0.86 0.85
Table 3.8 Percentage of single females in accordance with Slutsky symmetry. Hypothesis
0.95
Symmetry under exogeneity Symmetry under exogeneity and homogeneity Symmetry under endogeneity and homogeneity
0.81 0.95 0.94
Table 3.9 Percentage of single females in accordance with negative semidefiniteness. Hypothesis Negative semidefiniteness under exogeneity Negative semidefiniteness under exogeneity and homogeneity Negative semidefiniteness under endogeneity and homogeneity
0.93 0.95 0.95
change the results materially and, if anything, only worsens them) is corroborated by the results with respect to negative semidefiniteness, which are displayed in Table 3.6: It is roughly two thirds of this subpopulation for which all restrictions hold. In summary, we find largely unchanged, but slightly worse results compared with couples. The results do not change qualitatively when we consider single females, see the next three Tables 3.7–3.9: Roughly speaking, homogeneity fares a bit worse, while in particular the fraction in accordance with negative semidefiniteness is substantially higher than for male singles. These results are somewhat worse than the corresponding numbers for males. However, when it comes to symmetry under exogeneity, the results become very comparable, women even outperform man once homogeneity is imposed (Table 3.8). Finally, with respect to negative semidefiniteness of the Slutsky matrix, single women outperform both their male counterparts, as well as couples. This may be seen as a hint that at least the female part of the Singles population exhibits more rational behavior. Again, it is approximately two thirds of the population for which all three restrictions hold simultaneously, however, once we impose homogeneity, more than 90% of the female population exhibit both a negative semidefinite and a symmetric Slutsky matrix. The results do not change in any material way when we pool men and women to one ‘‘single’’ category; if anything the results
S. Hoderlein / Journal of Econometrics 164 (2011) 294–309
improve slightly compared to each category in isolation. However, the overall fluctuations between all the various subpopulations are relatively minor and do not show any clear pattern. Hence we draw the conclusion that the rationality restrictions are well, but not perfectly accepted in all subpopulations. At this point we would like to emphasize that there may be issues of measurement error with single households, in particular, it is quite possible that they form parts of extended households living at several locations, and/or are their consumption behavior is partially determined by institutions, in particular their workplaces. As such we also want to express our skepticism about whether 100% compliance with rationality is ever to be expected. In summary, the conclusion we draw from our comparison is that it is hard to detect a material difference between collective decision makers and individuals in our data. If anything, individuals behave slightly worse in terms of rationality, but we acknowledge that this may be due to limitations in our data.
305
difference in consumer behavior between collective households and individuals. We would like to point out that our approach allows us to focus on this question without additional assumptions, and we hope that this example encourages future applications using a similar framework. Acknowledgments I have received helpful comments from two anonymous referees, an associate editor, and from the editor Peter Robinson, as well as from Richard Blundell, Andrew Chesher, Joel Horowitz, Arthur Lewbel, Rosa Matzkin, Whitney Newey and from seminar participants at Berkeley, Brown, ESEM, Copenhagen, Madrid, Northwestern, LSE, Stanford, Tilburg, UCL and Yale. The usual disclaimer applies. I am also indebted to Sonya Mihaleva for excellent research assistance.
4. Conclusion Appendix A. Proofs and regularity conditions In this paper we have established that it is really only necessary to impose one substantial identifying restriction in order to perform most of demand analysis empirically, i.e., the conditional independence Assumption 2.2. Once this assumption is in place (which, as has been pointed out, may be challenging at times to verify), nothing else apart from regularity conditions has to be imposed to test conclusively the main elements of demand theory, in particular homogeneity and negative semidefiniteness. It is a key feature of our approach that by and large no other material assumption needs to be imposed. This is particularly true for functional form of the outcome equation or population homogeneity assumptions. There are two qualifications to this statement. First, in the endogenous case which is probably less relevant for the application in this paper, but may be more relevant in other applications, the IV equation needs to be specified in a way that allows to solve for a control function. In this paper we have argued that additive separability of the IV equation is a heterogeneous preference parameter is natural in the scenario analyzed in this paper, since the consumption function in the two leading explicitly solvable intertemporal optimization models is additive in such a parameter (Caballero, 1990, 1991). In other applications, an alternative qualitative assumption like monotonicity may be more natural. Second, while the above statement is true without any qualification for Homogeneity and Negative Semidefiniteness of the Slutsky matrix, Symmetry of the Slutsky matrix turns out to be the only major implication of rationality that will only ever be testable under additional identification assumptions. The doubts on the empirical verifiability of this property suggests that economic theory should perhaps concentrate on a model of demand based on the Weak Axiom, such as in Kihlstrom et al. (1976) or Quah (2000), whose predictions are entirely testable. The conditional independence assumption plays a critical role in this paper. It may have quite complex economic implications, and may be rightfully questioned in some applications. Hence, much as in the older parametric literature, the credibility of the results hinges on the quality of the data. However, the assumption is sufficiently general to nest a variety of scenarios including control function IV, and proxy variables, which is not detailed in this paper but straightforward. The main task of the researcher is to specify the precise form of the independence assumptions that he believes to be most realistic on economic grounds. This paper has established that from this starting point on empirical economic analysis can proceed without any major additional restriction. In our application we exemplify how this can be done, and discuss evidence about the empirical extend of the
A.1. Regularity conditions Assumption 2.3. Assume that the demand functions φ be continuously differentiable in x for all x ∈ X ⊆ RL+1 . This restricts preferences to be continuous, strictly convex and locally nonsatiated, with associated utility functions everywhere twice differentiable. Assume in addition that µ be continuously differentiable in s for all s ∈ S ⊆ RK +1 , and that Ds µ has full column rank almost surely. Assume that preferences be additively separable over time, which justifies the use of total expenditure as wealth. Moreover, we confine ourselves to observationally distinct preferences, i.e. if vj , vk ∈ V and vj ̸= vk , then there exists a set X ⊆ RL+1 with P(X) > 0, such that ∀x ∈ X : Dx φ(x, v1 ) ̸= Dx φ(x, v2 ). Finally, we require the following condition for dominated convergence: there exists a function g , s. th. ‖v ec [Dx φ(x, ϑ(q, a))]‖ ≤ g (a), with g (a)FA (da) < ∞, uniformly in (x, q). A.2. Proofs of Lemmata and Propositions Proof of Lemma 2.1. First recall that, by definition, 0 ≤ W ≤ 1. Thus, the expectation exists and E[|W |] ≤ c < ∞, where c is a generic constant (the same holds for the second moment). From this it follows that all conditional expectations exist as well, and are bounded. Next, let x, z be fixed, but arbitrary. Dx m(x, q, u) = Dx
∫
φ(x, ϑ(q, a))FA|X ,Q ,U (da, x, q, u)
∫A = Dx
φ(x, ϑ(q, a))FA|Q ,U (da, q, u),
A
due to 2.2 H⇒ FA|X ,Q ,U = FA|Q ,U . Using dominated convergence, we obtain that the rhs equals
∫
Dx φ(x, ϑ(q, a))FA|Q ,U (da, q, u).
(A.1)
A
But due to 2.2 this is (a version of) E[Dx φ|X = x, Z = z ]. Upon inserting random variables for the fixed z , y the statement follows. For (ii), by the same arguments Ds M (s, z ) = E[Dx φ|S = s, Z = z ]Ds µ(s, z ). Postmultiplying with Ds µ(s, z )− produces the result.
306
S. Hoderlein / Journal of Econometrics 164 (2011) 294–309
Proof of Proposition 2.2. Adding up follows trivially by taking conditional expectations of ι′ φ = 1 (a.s). To see homogeneity, recall that
∫
m(x + λ, q, u) =
φ(x + λ, ϑ(a, q))FA|X ,Q ,U (da, x + λ, q, u). A
Inserting φ(x + λ, v) = φ(x, v) as well as FA|X ,Q ,U (a, x + λ, q, u) = FA|X ,Q ,U (a, x, q, u) produces
∫
φ(x + λ, ϑ(a, q))FA|X ,Q ,U (da, x + λ, q, u) ∫ = φ(x, ϑ(a, q))FA|X ,Q ,U (da, x, q, u) = m(x, q, u).
A
A
The same argumentation holds also for
∫
M (s + λ, q, u) =
φ(µ(s + λ, q, u) A
+ λ, ϑ(a, q))FA|S ,Q ,U (da, s, q, u), using additionally that µ(s + λ, q, u) = µ(s, q, u). Finally, the statements about the derivatives follow straightforwardly from the fact that homogeneity implies that Dp φι + ∂y φ = 0 (a.s.), in connection with Lemma 2.1. Proof of Proposition 2.3. Ad Negative Semidefiniteness: Let A(ω), ω ∈ Ω , denote any random matrix. If p′ A(ω)p ≤ 0 for all ω ∈ Ω and all p ∈ RL , then, upon taking expectations w.r.t. an arbitrary probability measure F , it follows that
∫
p′ A(ω)pF (dω) ≤ 0 ⇔ p′
∫
A(ω)F (dω)p ≤ 0,
for all p ∈ RL . From this S nsd (a.s.) ⇒ E [S|X , Z ] nsd (a.s.) is immediate. Let E [S|X , Z ] = B, and note that since the definition of negative semidefiniteness of a square matrix B of dim L involves the quadratic form, p′ Bp ≤ 0, we see that if we put B¯ = B + B′ , we have
¯ = 2p′ Bp p′ Bp
for all p ∈ RL ,
and B¯ symmetric, implying that B is negative semidefinite if and only if B¯ is negative semidefinite. From B = E [S|X , Z ]
= E Dp φ|X , Z + E ∂y φφ ′ |X , Z + E φφ ′ |X , Z −E [diag (φ)|X , Z ] = B1 + B2 + B3 + B4
follows that B¯ = B + B′ = B1 + B2 + B3 + B4 + B′1 + B′2 + B′3 + B′4 = B¯ 1 + B¯ 2 + 2 (B3 + B4 ), since B3 and B4 are symmetric. Thus we have that
S nsd (a.s.) ⇒ B¯ 1 + B¯ 2 + 2 (B3 + B4 ) nsd (a.s.). From Lemma 2.1 it is apparent that B¯ 1 = Dp m + Dp m′ . To see that B¯ 2 = ∂y m2 (x, z ), first note that due to the boundedness of W the second moments and conditional moments exist, so that
∂y m2 (x, z ) = ∂y
∫
φ(x, ϑ(q, a))φ ′ (x, ϑ(q, a))FA|X ,Z (da; x, z )
A
= E[∂y (φφ ′ )|X = x, Z = z ] = E[∂y φφ ′ + φ∂y φ ′ |X = x, Z = z ] = B¯ 2 B3 and B4 are trivial. Upon inserting random variables, the statement follows. The proof using the regression with instruments and controls follows by the same arguments in connection with Lemma 2.1(iii)
Ad Symmetry: To see the ‘‘if’’ direction, note first that S symmetric iff K = Dp φ +∂y φφ ′ is symmetric, whichimplies that E [K |Y , Z ] is symmetric since Aij = E Kij |Y , Z = E Kji |Y , Z = Aji , where the subscript ij denotes the ijth element of the matrix. This implies in turn that
E [K |Y , Z ] = E Dp φ|Y , Z + E ∂y φφ ′ |Y , Z
= E Dp φ|Y , Z + E ∂y φ|Y , Z E φ ′ |Y , Z +V ∂y φ, φ|Y , Z (A.2) ′ is symmetric, from which E Dp φ|Y , Z + E ∂y φ|Y , Z E φ |Y , Z is symmetric if V ∂y φ, φ|Y , Z is assumed to be symmetric. By Lemma 2.1 this equals Dp m + ∂y mm′ . To establish the ‘‘only if’’ direction, we have to show that V ∂y φ, φ|Y , Z not symmetric implies that S symmetric does not imply that Dp m + ∂y mm′ be symmetric. To this end, assume again that S be symmetric, and consider (A.2). In this case, E [K |Y , Z ] is symmetric, but due to V ∂y φ, φ|Y , Z not symmetric we obtain that Dp m + ∂y mm′ = E [K |Y , Z ] − V ∂y φ, φ|Y , Z
has to be not symmetric as well.
Appendix B See Table B.1. Appendix C. Results of tests for robustness of specification In this subsection of the Appendix we present results coming from the same test applied to two specifications that differ from the main specification according to the way time is treated. The first uses time as part of the principal component (i.e., time enters through a factor structure), the second one separates the sample into two subsamples, one involving all observations before 1988, the other one involving all observations dating 1988 and after. Both specifications reveal no material differences from the leading specification with a separate time trend and all observations, and we conclude that our analysis are robust to small changes in the way we correct for time effects. C.1. Time in principal components Table C.1 shows results for tests of the test of the homogeneity hypothesis. The result are very comparable to the results using a separate time trend. Overall, the results are so good that we will now focus on the results were we impose homogeneity, for the same theoretical and empirical reasons as above. The results for symmetry are displayed in Table C.2 As before, the percentage of non-rejections at the significance levels 0.90 and 0.95 are displayed. Symmetry seems somewhat less universally accepted, in particular if we do not impose homogeneity. Finally, consider the negative semidefiniteness hypothesis. The percentage of the population for which negative semidefiniteness cannot be rejected at the 0.95% confidence level are as follows (see Table C.3): Obviously, negative semidefiniteness is widely accepted. All three hypotheses jointly hold for between 76% and 90% of the population, depending on whether we impose homogeneity and correct for endogeneity. Compared to a separate time trend, the results are roughly similar, though even more in favor of the rationality restrictions arising from utility maximization. C.2. Time stratified according to years C.2.1. Before 1988 Table C.4 shows results for tests of the test of the homogeneity hypothesis.
S. Hoderlein / Journal of Econometrics 164 (2011) 294–309
307
Table B.1 Summary statistics of data: household characteristics, income and budget shares after outlier removal. Variable
Minimum
1st quartile
Median
Mean
3rd quartile
Maximum
Couples Number of female Number of retired Number of earners Age of HHhead Fridge Washing machine Centr. heating TV Video PC Number of cars Number of rooms ln.HHincome BS GROUP 1 Food bought Catering Tobacco BS GROUP 2 Housing HHgoods HHservices BS GROUP 3 Motoring Fuel
1 0 1 19 0 0 0 0 0 0 0 1 3.818 0.048 0.004 0 0 0.09 0 0 0 0.029 0 0
1 0 1 30 1 1 1 1 0 0 1 5 4.802 0.115 0.092 0.019 0 0.218 0.122 0.021 0.020 0.104 0.053 0.030
1 0 2 47 1 1 1 1 0 0 1 5 5.314 0.158 0.135 0.041 0 0.294 0.195 0.033 0.031 0.164 0.107 0.045
1.001 0.046 1.749 45 0.995 0.887 0.829 0.877 0.377 0.061 1.315 5.363 5.289 0.166 0.148 0.049 0.018 0.306 0.207 0.043 0.043 0.182 0.128 0.053
1 0 2 57 1 1 1 1 1 0 2 6 5.805 0.209 0.189 0.069 0.026 0.386 0.280 0.089 0.051 0.242 0.187 0.066
2 1 2 86 1 1 1 1 1 1 4 15 6.602 0.349 0.591 0.277 0.220 0.598 0.582 0.643 0.415 0.486 0.470 0.416
Singles Number of female Number of retired Number of earners Age of HHhead Fridge Washing machine Centr. heating TV Video PC Number of cars Number of rooms ln.HHincome BS GROUP 1 Food bought Catering Tobacco BS Group 2 Housing HHgoods HHservices BS GROUP 3 Motoring Fuel
0.0000 0.0000 1.0000 18.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 3.3478 0.0481 0.0000 0.0000 0.0000 0.1045 0.0000 0.0000 0.0000 0.0174 0.0000 0.0000
0.0000 0.0000 1.0000 29.0000 1.0000 0.0000 0.0000 1.0000 0.0000 0.0000 0.0000 3.0000 4.5103 0.1234 0.0634 0.0177 0.0000 0.2637 0.1645 0.0093 0.0263 0.0675 0.0000 0.0297
1.0000 0.0000 1.0000 38.0000 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 1.0000 4.0000 4.9846 0.1725 0.1013 0.0400 0.0000 0.3628 0.2605 0.0281 0.0419 0.1327 0.0717 0.0458
0.6296 0.0000 1.0000 40.6347 0.9745 0.6850 0.7292 0.7571 0.4631 0.0506 0.7034 4.3004 4.9219 0.1832 0.1118 0.0512 0.0201 0.3647 0.2657 0.0582 0.0523 0.1599 0.1029 0.0570
1.0000 0.0000 1.0000 52.5000 1.0000 1.0000 1.0000 1.0000 1.0000 0.0000 1.0000 5.0000 5.3841 0.2336 0.1485 0.0723 0.0276 0.4596 0.3555 0.0733 0.0635 0.2243 0.1626 0.0720
1.0000 0.0000 1.0000 88.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 4.0000 11.0000 6.0112 0.4041 0.3823 0.2997 0.3863 0.6626 0.6275 0.6811 0.4495 0.5048 0.4973 0.5048
Single men Number of retired Number of earners Age of HHhead Fridge Washing machine Centr. heating TV Video PC Number of cars Number of rooms ln.HHincome BS GROUP 1 Food bought Catering Tobacco BS Group 2 Housing HHgoods HHservices BS GROUP 3 Motoring Fuel
0.0000 1.0000 19.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 3.3807 0.0435 0.0000 0.0000 0.0000 0.0829 0.0000 0.0000 0.0000 0.0146 0.0000 0.0000
0.0000 1.0000 29.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 3.0000 4.6823 0.1152 0.0484 0.0256 0.0000 0.2481 0.1789 0.0045 0.0215 0.0665 0.0018 0.0251
0.0000 1.0000 36.0000 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 1.0000 4.0000 5.1340 0.1653 0.0829 0.0525 0.0000 0.3474 0.2665 0.0146 0.0361 0.1366 0.0871 0.0396
0.0000 1.0000 38.6878 0.9589 0.6245 0.7340 0.7391 0.4996 0.0830 0.8281 4.2951 5.0627 0.1802 0.0937 0.0652 0.0213 0.3549 0.2758 0.0396 0.0478 0.1691 0.1199 0.0492
0.0000 1.0000 48.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.0000 1.0000 5.0000 5.4860 0.2353 0.1250 0.0907 0.0309 0.4561 0.3631 0.0436 0.0572 0.2375 0.1870 0.0613
0.0000 1.0000 84.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 4.0000 11.0000 6.2027 0.4105 0.3773 0.3337 0.3863 0.6622 0.6234 0.5304 0.4360 0.5365 0.5229 0.5048 (continued on next page)
308
S. Hoderlein / Journal of Econometrics 164 (2011) 294–309
Table B.1 (continued) Variable
Minimum
1st quartile
Median
Mean
3rd quartile
Maximum
Single women Number of retired Number of earners Age of HHhead Fridge Washing machine Centr. heating TV Video PC Number of cars Number of rooms ln.HHincome BS GROUP 1 Food bought Catering Tobacco BS Group 2 Housing HH goods HH services BS GROUP 3 Motoring Fuel
0.0000 1.0000 18.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 3.3478 0.0532 0.0000 0.0000 0.0000 0.1207 0.0000 0.0000 0.0000 0.0192 0.0000 0.0000
0.0000 1.0000 29.0000 1.0000 0.0000 0.0000 1.0000 0.0000 0.0000 0.0000 3.0000 4.4325 0.1263 0.0736 0.0133 0.0000 0.2788 0.1607 0.0156 0.0291 0.0688 0.0000 0.0330
0.0000 1.0000 40.0000 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 1.0000 4.0000 4.9210 0.1747 0.1118 0.0336 0.0000 0.3695 0.2571 0.0382 0.0449 0.1293 0.0647 0.0505
0.0000 1.0000 41.7008 0.9829 0.7218 0.7287 0.7668 0.4528 0.0370 0.6263 4.2810 4.8487 0.1841 0.1221 0.0432 0.0188 0.3732 0.2616 0.0697 0.0554 0.1524 0.0907 0.0617
0.0000 1.0000 54.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.0000 1.0000 5.0000 5.3068 0.2303 0.1612 0.0637 0.0227 0.4630 0.3477 0.0913 0.0672 0.2114 0.1467 0.0770
0.0000 1.0000 88.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 3.0000 11.0000 5.8917 0.4023 0.3823 0.2653 0.1990 0.6626 0.6275 0.6811 0.4495 0.4798 0.4437 0.3572
Table C.1 Percentage of two person households in accordance with homogeneity.
Table C.6 Percentage of couples in accordance with negative semidefiniteness.
Hypothesis
0.95
Hypothesis
0.95
Homogeneity under exogeneity Homogeneity under endogeneity
0.986 0.991
Negative semidefiniteness under exogeneity Negative semidefiniteness under endogeneity
0.81 0.81
Table C.7 Percentage of two person households in accordance with homogeneity.
Table C.2 Percentage of two person households in accordance with Slutsky symmetry. Hypothesis
0.95
Hypothesis
0.95
Symmetry under exogeneity Symmetry under exogeneity and homogeneity Symmetry under endogeneity and homogeneity
0.84 0.93 0.93
Homogeneity under exogeneity Homogeneity under endogeneity
0.998 0.998
Table C.8 Percentage of two person households in accordance with Slutsky symmetry.
Table C.3 Percentage of couples in accordance with negative semidefiniteness. Hypothesis Negative semidefiniteness under exogeneity Negative semidefiniteness under exogeneity and homogeneity Negative semidefiniteness under endogeneity and homogeneity
0.95 0.987 0.998 0.998
Table C.4 Percentage of two person households in accordance with homogeneity. Hypothesis
0.95
Homogeneity under exogeneity Homogeneity under endogeneity
0.998 1.0
Table C.5 Percentage of two person households in accordance with Slutsky symmetry. Hypothesis
0.95
Symmetry under exogeneity Symmetry under endogeneity
0.89 0.88
The result are even stronger in support of homogeneity as they were in the non-stratified data. In fact, it seems that homogeneity in universally accepted. Overall, the results are so good that we will now only report results where we impose homogeneity. The results for symmetry are displayed in Table C.5. As before, the percentage of non-rejections at the significance level 0.95 is displayed. These results are roughly in line (if slightly weaker) than the results in the full sample. Any differences are readily explained
Hypothesis
0.95
Symmetry under exogeneity Symmetry under endogeneity
0.91 0.92
by the fact that we have a smaller sample. Hence we would expect to find more variability. Finally, consider the negative semidefiniteness hypothesis. The percentage of the population for which negative semidefiniteness cannot be rejected at the 0.95% confidence level are as follows (see Table C.6): A similar remark as in the test for symmetry applies. Note that we do not find an effect of controlling for potential endogeneity on our results. C.2.2. After 1987 Table C.7 shows results for tests of the test of the homogeneity hypothesis. The same remark as in the previous subsection applies— homogeneity is always accepted, so we will impose it from now on. Next, the results for symmetry are displayed in Table C.8. As before, the percentage of non-rejections at the significance levels 0.90 and 0.95 are displayed. Finally, consider the negative semidefiniteness hypothesis: The percentage of the population for which negative semidefiniteness cannot be rejected at the 0.95% confidence level are as follows: Observe that we do not find any significant effect of endogeneity, and the results remain virtually indistinguishable. In summary, splitting the sample in two parts produces materially the same
S. Hoderlein / Journal of Econometrics 164 (2011) 294–309 Table C.9 Percentage of couples in accordance with negative semidefiniteness. Hypothesis Negative semidefiniteness under exogeneity Negative semidefiniteness under endogeneity
0.95 0.82 0.83
results, though in particular negative semidefiniteness fares somewhat worse (see Table C.9). Our results remain robust regardless of the specification. Once we allow for sufficient heterogeneity in preferences across covariates as well as time, regardless of how we slice the data, utility maximization seems a fairly good – but not perfect description – of actual individual behavior. Appendix D. Supplementary data Supplementary material related to this article can be found online at doi:10.1016/j.jeconom.2011.06.015. References Altonji, J., Matzkin, R., 2005. Cross section and panel data estimators for nonseparable models with endogenous regressors. Econometrica 73, 1053–1103. Blundell, R., Browning, M., Crawford, I., 2003. Nonparametric engel curves and revealed preference. Econometrica 71, 205–240. Blundell, R., Chen, X., Kristensen, D., 2007. Semi-nonparametric IV estimation of shape-invariant engel curves. Econometrica 75 (6), 1613–1669. Blundell, R., Horowitz, J., Parey, M., 2010. Measuring the Price Responsiveness of Gasoline Demand, Economic Shape Restrictions and Nonparametric Demand Estimation CeMMAP working papers CWP11/09, Centre for Microdata Methods and Practice, Institute for Fiscal Studies. Browning, M., Chiappori, P.-A., 1998. Efficient intra-household allocations; a general characterization and empirical tests. Econometrica 66, 1241–1278. Caballero, R., 1990. Consumption puzzles and precautionary savings. Journal of Monetary Economics 25, 113–136. Caballero, R., 1991. Earnings uncertainty and aggregate wealth accumulation. American Economic Review 81, 859–871. Cherchye, L., Vermeulen, F., 2008. Nonparemtric analysis of household labor supply;
309
goodness of fit and power of the unitary and the collective model. Review of Economics and Statistics 90, 267–274. Chesher, A., 2003. Identification in nonseparable models. Econometrica 71, 1405–1441. Deaton, A., 1992. Understanding consumption. In: Clarendon Lectures in Economics. Oxford University Press, Oxford. Deaton, A., Muellbauer, J., 1980. An almost ideal demand system. American Economic Review 70, 312–326. Fortin, B., Lacroix, G., 1997. A test of the unitary and the collective model of household labour supply. Economic Journal 107, 933–955. Haag, B., Hoderlein, S., Mihaleva, S., 2009a, Testing Homogeneity of Degree Zero, Working Paper, Brown University. Haag, B., Hoderlein, S., Pendakur, K., 2009b. Testing and imposing slutsky symmetry. Journal of Econometrics 153, 33–50. Härdle, W., Hildenbrand, W., Jerison, M., 1991. Empirical evidence on the law of demand. Econometrica 59, 1525–1549. Hall, R., 1978. Stochastic implications of the life-cycle permanent income hypothesis: theory and evidence. Journal of Political Economy 96, 339–357. Härdle, W., Hart, J., 1993. A bootstrap test for positive definiteness of income effects matrices. Econometric Theory 8, 276–290. Hoderlein, S., Klemelae, J., Mammen, E., 2010. Reconsidering the random coefficient model. Econometric Theory 26, 804–837. Hoderlein, S., Mihaleva, S., 2008. Increasing the price variation in a repeated cross section. Journal of Econometrics 147, 316–326. Imbens, G., Newey, W., 2009. Identification and estimation of triangular simultaneous equations models without additivity. Econometrica 77, 1481–1512. Jorgensen, D., Lau, L., Stoker, T., 1982. The transcedental logarithmic model of individual behavior. In: Basman, R., Rhodes, G. (Eds.), Advances in Econometrics, vol. 1. JAI Press. Kihlstrom, R., Mas-Colell, A., Sonnenschein, H., 1976. The demand theory of the weak axiom of revealed preference. Econometrica 44, 971–978. Lewbel, A., 1989. Identification and estimation of equivalence scales under weak separability. Review of Economic Studies 56, 311–316. Lewbel, A., 1999. Consumer demand systems and household expenditure. In: Pesaran, H., Wickens, M. (Eds.), Handbook of Applied Econometrics. Blackwell Handbooks in economics. Lewbel, A., 2001. Demand systems with and without errors. American Economic Review 611–618. Quah, J., 2000, Weak Axiomatic Demand Theory. Working Paper, Nuffield College, Oxford. Stoker, T., 1989. Tests of additive derivative constraints. Review of Economic Studies 56, 535–552. Stone, R., 1954. Linear expenditure systems and demand analysis: an application to the pattern of British demand. Economic Journal 64, 511–527.
Journal of Econometrics 164 (2011) 310–330
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Estimating a common deterministic time trend break in large panels with cross sectional dependence Dukpa Kim ∗ Department of Economics, University of Virginia, Monroe Hall, McCormick Road, Charlottesville, VA 22903, United States
article
info
Article history: Received 22 February 2010 Received in revised form 9 November 2010 Accepted 12 June 2011 Available online 21 June 2011 JEL classification: C33 Keywords: Structural break Deterministic trend Panel data
abstract This paper develops an estimation procedure for a common deterministic time trend break in large panels. The dependent variable in each equation consists of a deterministic trend and an error term. The deterministic trend is subject to a change in the intercept, slope or both, and the break date is common for all equations. The estimation method is simply minimizing the sum of squared residuals for all possible break dates. Both serial and cross sectional correlations are important factors that decide the rate of convergence and the limiting distribution of the break date estimate. The rate of convergence is faster when the errors are stationary than when they have a unit root. When there is no cross sectional dependence among the errors, the rate of convergence depends on the number of equations and thus is faster than the univariate case. When the errors have a common factor structure with factor loadings correlated with the intercept and slope change parameters, the rate of convergence does not depend on the number of equations and thus reduces to the univariate case. The limiting distribution of the break date estimate is also provided. Some Monte Carlo experiments are performed to assess the adequacy of the asymptotic results. A brief empirical example using the US GDP price index is offered. © 2011 Elsevier B.V. All rights reserved.
1. Introduction While the issues related to structural breaks have drawn a lot of attention in both econometrics and statistics, relatively small attention has been paid to common breaks in large panels. The extension of structural break models to the panel data setup is important because a structural break is regarded as an exogenous shock with permanent effects on the distributions of economic variables and such a shock is likely to have impacts on many economic variables simultaneously. However, the econometric tools currently available are mostly designed for univariate models, for example Bai (1997) and Bai and Perron (1998), or multivariate models where the number of equations are fixed, for example Bai et al. (1998). In this paper, we develop an estimation procedure for a common deterministic time trend break in large panels with cross sectional dependence. The dependent variable in each equation consists of a deterministic trend and an error component. The deterministic trend is assumed to have a change in the intercept, slope or both. The date of change is common for all equations. The estimation method is simply minimizing the sum of squared residuals (SSR) for all possible break dates.
∗
Tel.: +1 434 924 7581; fax: +1 434 982 2904. E-mail address:
[email protected].
0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.06.018
The statistical properties of this break date estimate heavily depend on the specification of the error process. Serial correlation in each equation and cross sectional dependence are particularly important factors. In order to allow strong cross equation correlation, we assume that the errors have a common factor structure. That is, the error process in each equation is the sum of an individual specific error and common factors scaled by an individual loading vector. Also, for the serial correlation, both the common factors and the individual specific errors are assumed to be either stationary linear processes or unit root processes. Under various combinations of the aforementioned specifications, we establish the consistency of the proposed break date estimate and derive its limiting distribution. The employed asymptotic framework is the joint limit theory analyzed in Phillips and Moon (1999), where both the time span and the number of equations grow without assuming any particular growth path. The joint limit theory is often preferred to the sequential limit theory, since the results obtained under the former are more robust to the relative size of the time and cross sectional spans. We show that strong correlations in both the time and cross sectional dimensions slow down the rate of convergence of the break date estimate. The rate of convergence is slower when the error processes have a unit root than when they are stationary. This finding is in parallel with the univariate result shown by Perron and Zhu (2005). In the absence of cross sectional dependence, the rate of convergence is increasing in the number of equations and thus is faster than the univariate case. On the other hand, when
D. Kim / Journal of Econometrics 164 (2011) 310–330
common factors exist and their loading vectors are correlated with the intercept and slope change parameters, the rate of convergence reduces to the univariate case and the efficiency gain in terms of the rate of convergence disappears. The limiting distribution of the break date estimate is normal when only slope changes are allowed. When intercept changes are also allowed, it is non-standard but analogous to the univariate case studied by Perron and Zhu (2005). We discuss a simple method to form confidence intervals using these limiting distributions. There are two important papers closely related to this paper. Perron and Zhu (2005) analyze structural breaks in deterministic trends in univariate models. Indeed, our models are panel extensions of theirs. Bai (2010) analyzes common mean shifts in panel data models. Bai’s (2010) work differs from ours in that it concerns only mean shifts without linear time trends and the errors are only stationary without common factors. Also, Bai (2010) analyzes breaks in variances. The rest of the paper is organized as the following. Section 2 introduces the models and assumptions. Section 3 presents the main theoretical results. Section 4 demonstrates the validity of the asymptotic results via Monte Carlo experiments. Section 5 shows an empirical illustration. Section 6 contains concluding remarks. All mathematical proofs are collected in the Appendix.
311
Model III is a mean shift model which Bai (2010) analyzes for panel data. However, we allow much stronger cross sectional correlation than Bai (2010) and obtain additional results. Also, our results for Model III continue to hold without the linear time trend βi t. Assumption 1. The true break date T1 is unknown and the break fraction λ1 = T1 /T ∈ [π , 1 − π ], π ∈ (0, 1/2) is fixed for all T . Note that the break date T1 is common for all equations, and the break fraction λ1 is fixed as the sample size grows. The trimming of the potential location of λ1 by π is a simple device to ensure that the regressor matrix be of full column rank and π can be arbitrarily small in practice. The assumption of a common break date is not new to the literature. The closest is Bai (2010). Other developments in large panel data models include De Wachter and Tzavalis (2004, 2005), Emerson and Kao (2000) and Kao et al. (2007). Bai et al. (1998) and Qu and Perron (2007) are examples in multivariate system of a fixed size. Remark 1. A natural generalization of Assumption 1 would be to model heterogeneous individual responses to a common shock. That is, the break date in equation i, say, Ti is such that Ti = Tc + 1Ti ,
for i = 1, . . . , N
where 1Ti is an integer valued random variable satisfying
2. Models and assumptions
1Ti ∼ IID(0, σ 2 < ∞).
We consider panel data models where the dependent variable in each equation consists of a deterministic time trend and an error component:
The rationale behind this is that one common event causes a structural break for every equation but the time lead or lag of the actual effect from the date of the common event varies among equations. Having σ 2 = 0 is equivalent to Assumption 1. The common break date estimator in this paper naturally becomes an estimator for Tc . While this specification might be more appealing in practice, the limiting distribution of the break date estimate becomes dependent of the distribution of 1Ti , which may not always be estimated from the data. See Kim (2010) for some related results. We will use Assumption 1 throughout the paper so that the models remain fully estimable.
yti = dti + uti ,
(i = 1, . . . , N and t = 1, . . . , T ).
The deterministic trend is assumed to have a break and we consider three cases:
µi + βi t + γi Bt dti = µi + βi t + θi Ct + γi Bt µi + βi t + θi Ct
Model I (Joint Broken Trend) Model II (Local Disjoint Trend) Model III (Mean Shift)
where
Ct =
0 1
if t ≤ T1 if t > T1 ,
and Bt = (t − T1 )Ct .
The regression coefficients are not restricted to be common across equations, and thus instead of pooling the data we apply the ordinary least squares procedure equation by equation. Models I and II are studied in Perron and Zhu (2005) for the univariate case. Perron and Zhu (2005) analyze another model, dj namely, the global disjoint trend model which uses Bt instead of dj Bt , where Bt = tCt . In fact, Perron and Zhu (2005) show that a more general model, given by dti = µi + βi t + (T α κi )Ct + γi Bt encompasses both the global and local disjoint models as special cases. With α = 0, it becomes the local disjoint model and with α = 1, it becomes the global disjoint model. They further argue that all their results pertaining to the global disjoint model hold more generally for α > 1/2 when the error process has a unit root and for α > 0 when the error process is stationary. Carrioni-Silvestre et al. (2009) also consider this model with α = 1/2 + η, η > 0, where the break date is estimated from quasi differenced data. Making the intercept change parameter increasing in T is a simple device to provide asymptotic theory suitable for large intercept shifts. We do not analyze this model because the break date estimate under this specification is consistent at an extremely fast rate only with one time series and panel data does not give much efficiency gain with practical importance.
Now, let ι = (1, . . . , 1)′ , τ = (1, . . . , T )′ , C = (C1 , . . . , CT )′ , B = (B1 , . . . , BT )′ , and define XT1 to be the collection of all regressors and Πi to be the corresponding regression coefficients for equation i, that is,
XT1 =
[ι, τ , B] [ι, τ , C , B] [ι, τ , C ]
Model I Model II Model III
and
(µi , βi , γi )′ Πi = (µi , βi , θi , γi )′ (µ , β , θ )′ i i i
Model I Model II Model III.
Then, we write each equation in matrix notation as Yi
(T ×1)
=
XT1
Πi
(T ×3 or T ×4) (3×1 or 4×1)
+ Ui
(T ×1)
(1)
where Yi = (yi1 , . . . , yiT )′ and Ui = (ui1 , . . . , uiT )′ . The entire N equation system is written as Y = XT1 Π + U
(2)
where Y = [Y1 , . . . , YN ], Π = [Π1 , . . . , ΠN ], and U = [U1 , . . . , UN ]. Also, define row vectors µ = (µ1 , . . . , µN ), β = (β1 , . . . , βN ), θ = (θ1 , . . . , θN ) and γ = (γ1 , . . . , γN ). Then, an alternative expression for Π is [µ′ , β ′ , γ ′ ]′ , [µ′ , β ′ , θ ′ , γ ′ ]′ and [µ′ , β ′ , θ ′ ]′ for Models I, II and III respectively. Denote by Tb a generic break date and by λ = Tb /T a generic break fraction. XTb is defined analogously to XT1 . Then, the sum of squared residuals for each Tb is given by
312
D. Kim / Journal of Econometrics 164 (2011) 310–330
SSR(Tb ) = tr Y ′ (I − PTb )Y
where PTb = XTb (XTb XTb ) ′
−1 ′
XTb and tr [·] is the trace operator. The
estimated break date Tˆ1 is the date that minimizes the sum of squared residuals: Tˆ1 = arg min SSR(Tb ) Tb
ˆ = Tˆ1 /T . and λ
Remark 2. We point out an important identification issue in Model II. In order to emphasize the break date write Ct = Ct (T1 ) and Bt = Bt (T1 ). Also, let Dt (T1 ) be an impulse dummy (i.e., Dt (T1 ) = 1, if t = T1 + 1, and 0, otherwise) Then, note that
θ Ct (T1 ) + γ Bt (T1 ) = (θ + γ )Dt (T1 ) + (θ + γ )Ct (T1 + 1) + γ Bt (T1 + 1)
Assumption 3. (i) (1 − L)Ft satisfies Assumption 2(i) and F0 = 0. (ii) (1 − L)eti satisfies Assumption 2(ii) and e0i = 0. (iii) Ft and eti are independent.
or
θ Ct (T1 ) + γ Bt (T1 ) = −θ Dt (T1 − 1) + (θ − γ )Ct (T1 − 1) + γ Bt (T1 − 1). This means that, if θ = −γ or 0, there are two break dates (T1 and T1 + 1 or T1 − 1 and T1 ) that can generate exactly the same time trend. This feature is of less concern if we are interested only in ˆ 1 . However, with a the consistency of the break fraction estimate λ panel of data we will establish the consistency of the break date estimate Tˆ1 , and the model should be assumed not to have any identification issue of this kind. The statistical properties of Tˆ1 depend on both the cross sectional and serial correlations of uti , and we assume that the error component uti consists of two terms: ′
uti = hi Ft + eti
(3)
where Ft is a r × 1 vector of latent common factors, hi is a factor loading and eti is an individual specific error. Under the decomposition in (3), we can write (1) and (2) as Yi =
XT1
Πi
(T ×3 or T ×4) (3×1 or 4×1)
+ F
hi + Ei
(T ×r ) (r ×1)
(T ×1)
and Y = XT1 Π + FH + E where H = [h1 , . . . , hN ], F = [F1 , . . . , FT ]′ , and E = [E1 , . . . , EN ]. As a matter of notation, L denotes a lag operator and E a mathematical expectation. For a matrix A, ‖A‖ = (tr (A′ A))1/2 . We also use → to denote the convergence of non-random elements, p
our asymptotic results may not be suitable if there is one series with too large a variance. Also, the primitive shocks wt and εti are iid over t and thus Ft and eti are strictly stationary processes. We show in the next section that this strict stationarity is an important feature in deriving the asymptotic distributions of the break date estimate in Models II and III. Some mild cross sectional dependence is often allowed among the individual specific errors in the common factor papers such as Bai and Ng (2004). We will keep the independence assumption for simplicity especially in applying the Joint Limit Central Limit Theorem. However, it is well expected1 that the rates of convergence in this paper will remain the same under this modification while the limiting distribution should be adjusted.
d
→ convergence in probability and → convergence in distribution. We make the following assumptions on the error component: Assumption 2. (i) The vector of common factors Ft is such that E ‖Ft ‖2+δ < ∑ ∞, δ > 0 and F∑ t = C (L)wt where wt ∼ iid (0, Ir ) ∞ ∞ j and C (L) = C L with j j=0 j=0 j‖Cj ‖ < M and det(C (z )) ̸= 0 for all |z | ≤ 1. M is a generic finite positive number which depends on neither T nor N. (ii) For each equation i, the individual specific error eti is such that E eti |2+δ < ∞, δ > 0∑ and eti = di (L)εti where di (L) = ∑|∞ ∞ j j=0 dij L with di0 = 1, j=0 j dij < M and di (z ) ̸= 0 for all |z | ≤ 1. Furthermore, εti = σi ηti , where ηti ∼ iid (0, 1) across both i and t with σi2 < M. (iii) Ft and eti are independent. The linear process assumption for the common factors and the individual specific errors makes it particularly convenient to apply the Functional Central Limit Theorem (see Phillips and Solo, 1992) and the Joint Limit Central Limit Theorem (see Phillips and Moon, 1999), which are two of the main tools used to derive the asymptotic distribution of the break date estimate. The assumption that both the long-run and short-run variances of the individual specific errors are bounded uniformly in i implies that
In Section 3, we derive our asymptotic results with alternately using Assumptions 2 and 3. We also discuss the cases where Assumptions 2 and 3 are mixed so that there are integrated common factors and stationary individual specific errors or vice versa. See the remark following Theorem 5. We make the following assumptions for the break parameters and the factor loadings. Assumption 4. N −1 θ θ ′ → Aθ θ ̸= 0, N −1 θ γ ′ → Aθ γ ̸= 0, N −1 γ γ ′ → Aγ γ ̸= 0, and N −1 γ DΣε Dγ ′ → Sγ γ ̸= 0, where Σε = diag {σ12 , . . . , σN2 } and D = diag {d1 (1), . . . , dN (1)}. max{γ12 , . . . , γN2 } = O(1). Assumption 5. hi , θi and γi are such that N −1 H γ ′ → AH γ , N −1 HH ′ → AHH , and N −1 H θ ′ → AH θ , where AH γ , AHH , and AH θ are some fixed matrices. In Section 3, we show that the limiting (uncentered) cross moments AH γ and AH θ play an important role for both the rate of convergence and the asymptotic distribution of the break date estimate. When AH γ ̸= 0 and AH θ ̸= 0, the results are similar to those of the univariate case obtained by Perron and Zhu (2005). On the other hand, when AH γ = 0 and AH θ = 0, the break date estimate is consistent at a faster rate than the univariate case. 3. Asymptotic results First we provide the consistency of the break date estimate Tˆ1 ˆ 1 . The notation (T , N ) → ∞ means or break fraction estimate λ that both T and N jointly go to infinity, and we do not assume that they follow any particular path or grow sequentially. Theorem 1 (Stationary Errors). Suppose that Assumptions 1, 2, 4 and 5 hold. Then, we have the following results. A. Model I (Joint Broken Trend) (i) As N → ∞ with T fixed, |Tˆ1 − T1 | = op (1), if hi = 0 and di (L) = 1 for all i. (ii) As (T , N ) → ∞, |Tˆ1 − T1 | = Op (T −1/2 N −1/2 ), if hi = 0 for all i. (iii) As (T , N ) → ∞, |Tˆ1 − T1 | = Op (T −1/2 ), if AH γ ̸= 0. B. Model II (Local Disjoint Trend) and Model III (Mean Shift) (iv) As N → ∞ with T fixed, |Tˆ1 − T1 | = op (1), if hi = 0 and di (L) = 1 for all i. (v) As (T , N ) → ∞, |Tˆ1 − T1 | = op (1), if hi = 0 for all i. (vi) As (T , N ) → ∞, |Tˆ1 − T1 | = Op (1), if AH γ ̸= 0 and AH θ ̸= 0.
1 Although it is not reported in this paper, we derived such a result using the sequential limit theory in an earlier version of this paper.
D. Kim / Journal of Econometrics 164 (2011) 310–330
As shown in (i) and (iv), the break date estimate is consistent even with T fixed if uti is independent across both t and i. Such a result is not generally true when there is either serial or cross sectional dependence in the errors. In Section 4, we provide some related Monte Carlo simulation results. (ii) and (v) pertain to the case of serial correlated individual specific errors with no common factors (hi = 0 for all i). The consistency of the break date estimate is shown as both N and T increase. This result has more significance in Models II and III than in Model I, because the consistency of the break date estimate is anyway obtained in Model I even in the univariate model whereas it is obtained only √ with large panels in Models II and III. The rate of convergence is TN in Model I, but it does depend on the relative magnitude of N and T in Models II and III and we simply denote it as op (1). (iii) and (vi) state that when there are common factors with AH γ ̸= 0 and AH θ ̸= 0, the rate of convergence does not depend on the number of equations N and thus reduces to that of the univariate model obtained by Perron and Zhu (2005). This is a natural consequence of strong cross sectional dependence generated by the common components. Theorem 2 (Integrated Errors). Suppose that Assumptions 1 and 3–5 hold. Then we have the following results. A. Model I (Joint Broken Trend)
ˆ 1 − λ1 | = Op (T −1/2 N −1/2 ), if hi = 0 for (i) As (T , N ) → ∞, |λ all i. ˆ 1 − λ1 | = Op (T −1/2 ), if AH γ ̸= 0. (ii) As (T , N ) → ∞, |λ B. Model II (Local Disjoint Trend) ˆ 1 − λ1 | = op (T −1/2 ), if hi = 0 for all i. (iii) As (T , N ) → ∞, |λ ˆ 1 − λ1 | = Op (T −1/2 ), if AH γ ̸= 0. (iv) As (T , N ) → ∞, |λ When the error terms are integrated, the break date cannot be estimated without a slope change and thus we analyze only Models I and II. The break date estimate Tˆ1 in both models is not consistent whether there are common components or not. This is the reason why we present the results in terms of the break fraction estimate λˆ 1 = Tˆ1 /T which is consistent at least at rate T 1/2 . When there are √ ˆ 1 is TN in Model I no common factors, the rate of convergence of λ ˆ 1 is consistent at rate but not available in Model II. In both models, λ T 1/2 if AH γ ̸= 0 and this rate is the same as in the univariate case. An implication from Theorem 2A and Theorem 1B is that estimating the date of mean shift after differencing is more efficient than estimating the date of slope change in the raw data. Theorem 3 (Stationary Errors). Suppose that Assumptions 1, 2, 4 and 5 hold. As (T , N ) → ∞, we have the following results: A. Model I (Joint Broken Trend) (i) If N /T 3 → 0 and hi = 0 for all i, T
3/2
N
1/2
d
(λˆ − λ1 ) → N 0,
(1 − λ1 )λ1
A2
γγ
Sγ γ
T
3/2
4
d
(λˆ − λ1 ) → N 0,
(1 − λ1 )λ1 A2γ γ
AH γ C (1)C (1) AH γ ′
′
.
B. Model II (Local Disjoint Trend) and Model III (Mean Shift) (iii) Define the stochastic process S ∗ (m) such that S ∗ (0) = 0, S ∗ (m) = S1 (m) for m < 0 and S ∗ (m) = S2 (m) for m > 0. Then d
∗ ˆ − λ1 ) → m∞ T (λ I = arg min S (m) m
where for Model II
Aθ θ + Aγ γ k2 + 2Aγ θ k
0 −
−2
Ft′ AH θ + kAH γ ,
m = −1, −2, . . .
k=m+1
S2 (m) =
m −
Aθ θ + Aγ γ k2 + 2Aγ θ k
k=1 m −
+2
Ft′ AH θ + kAH γ ,
m = 1, 2, . . .
k =1
if AH θ ̸= 0 and AH γ ̸= 0, and for Model III S1 (m) = Aθ θ |m| − 2
0 −
Ft′ AH θ ,
m = −1, −2, . . .
k=m+1
S2 (m) = Aθ θ m + 2
m −
Ft′ AH θ ,
m = 1, 2, . . .
k=1
if AH θ ̸= 0. In Model I, the distribution of the break fraction estimate is approximated by a normal distribution irrespective of the presence of common factors. In both cases, the limiting variance depends on the true break fraction λ1 and the break parameters γi ’s. The closer λ1 is to the mid-point of the time span and the larger the slope changes are, the smaller the limiting variance is. Note that the result in (i) is derived with an additional assumption that N /T 3 → 0. This additional condition is not restrictive in most time series panels where N and T are of similar size or T is much larger than N. In Models II and III with AH θ ̸= 0 and AH γ ̸= 0, the limiting distribution is derived using the strict stationarity of Ft given in Assumption 2. It is highly non-standard but analogous to the univariate case. It depends on the exact distribution of Ft and is of little use in practice unless an assumption on the distribution of Ft is made. In order to form a confidence interval from the above results, various parameters should be estimated. The break parameters γ and θ (and thus Aγ γ and Aθ θ ) can be consistently estimated via the least squares using the estimated break date in Models I and III, but the least squares estimate for θ using the estimated break date is inconsistent in Model II as shown in Perron and Zhu (2005). Hence, asymptotically valid confidence intervals are available only for Models I and III. For Model I, the two limiting distributions in Theorem 3(i) and (ii) differ only by N −1 Sγ γ versus A′H γ C (1)C (1)′ AH γ , once the different rates of convergence are taken into account. Both N −1 Sγ γ and A′H γ C (1)C (1)′ AH γ can be interpreted as the long-run variance of the cross sectional average of the errors weighted by the slope change parameters, because given the assumption of independent individual specific errors
N i=1
(ii) If AH γ ̸= 0,
0 − k=m+1
N 1 −
4
S1 (m) =
313
uti γi =
N 1 −
N i=1
(Ft′ hi + eti )γi
′ −1/2 ) Ft AH γ + Op (N N = 1− eti γi N
if AH γ ̸= 0 if hi = 0 for all i
(4)
i=1
where the long-run variance of Ft′ AH γ corresponds to A′H γ C (1) ∑N C (1)′ AH γ and that of N −1 i=1 eti γi to N −1 Sγ γ for any given N.
ˆ − λ1 ) which is formed Hence, a confidence interval for T 3/2 (λ using the long-run variance of the weighted average of the errors is valid for both Theorem 3(i) and (ii). This is an important fact since it implies that the prior knowledge of the existence of common
314
D. Kim / Journal of Econometrics 164 (2011) 310–330
factors is not required and all the issues related to the estimation of common factors and loadings can be avoided. For Model III with AH θ ̸= 0, a practical choice would be to assume Ft′ AH θ to be a Gaussian process. Then, the process S ∗ (m) can be simulated once the autocovariance function of Ft′ AH θ are given. Since N 1 −
N i=1
uti θi = Ft′ AH θ + Op (N −1/2 ),
the autocovariance function can be estimated from the cross sectional average of the errors weighted by the mean shift parameters. In Models II and III without common factors, the limiting distribution depends on the relative magnitude of N and T and is of a complicated form. We take an alternative approach that still yields a non-degenerate limiting distribution of the break fraction estimate. That is to assume the break parameters are of an order of N −1/2 . Assumption 6. (i) γ = N −1/2 γ˙ , θ = N −1/2 θ˙ , θ θ ′ → A˙ θ θ , θ γ ′ → A˙ θγ , γ γ ′ → A˙ γ γ ̸= 0, and γ DΣε Dγ ′ → S˙γ γ ̸= 0, where Σε = diag {σ12 , . . . , σN2 } and D = diag {d1 (1), . . . , dN (1)}. (ii) max{θ˙12 , . . . , θ˙N2 } = O(1), max{γ˙12 , . . . , γ˙N2 } = O(1). ∑N ∑N (iii) Ξ1 = limN N −1 i=1 θ˙i2 Γi and Ξ2 = limN N −1 i=1 γ˙i2 Γi where Γi is any finite size matrix whose (p, q) element is the autocovariance of eti at p − q lag. The limiting distribution of the break date estimate under this alternative assumption is given in the next theorem. Theorem 4 (Stationary Errors). Suppose that Assumptions 1, 2, 5 and 6 hold and that hi = 0 for all i. Let N1 = (N−m+1,1 , . . . , Nm,1 )′ and N2 = (N−m+1,2 , . . . , Nm,2 )′ be multivariate normal such that N1 = B1 W and N2 = B2 W with Ξ1 = B1 B′1 , Ξ2 = B2 B′2 and W ∼ N (0, I2m ). Define the stochastic process V ∗ (m) such that V ∗ (0) = 0, V ∗ (m) = V1 (m) for m < 0 and V ∗ (m) = V2 (m) for m > 0, Then, we have as (T , N ) → ∞ and N 2 /T → 0, d
∗ ˆ − λ1 ) → m∞ T (λ II = arg min V (m) m
where, for Model II, V1 (m) =
0 −
simulate the process S ∗ (m) based on the weighted cross sectional average of the errors would simulate N −1 V ∗ (m) in the absence of common factors, and the minimizer of N −1 V ∗ (m) has the same distribution as m∞ II . To see this, note that the autocovariance function of N −1 i=1 uti θi , if put in a matrix properly, corresponds to N −1 Ξ1 and Aθ θ corresponds to N −1 A˙ θ θ . Hence, the prior knowledge of the existence of common factors is not required again. The next theorem states the limiting distribution of the break fraction estimate when the error terms are integrated.
∑N
Theorem 5 (Integrated Errors). Suppose that Assumptions 1 and 3–5 hold. As (T , N ) → ∞, we have the following results: A. Model I (Joint Broken Trend) (i) If N /T → 0 ≤ κ <∞ and hi = 0 for all i,
√
d ˆ − λ1 ) → TN (λ N
with σ¯ d2 = limN N (ii) If AH γ ̸= 0,
√
1
(iii) Let ξ11 = [
15A2γ γ
′
0
WF (r )dr ,
1
1
A˙ θθ + A˙ γ γ k2 + 2A˙ γ θ k N1,k + kN2,k ,
m = 1, 2, . . .
k=1
and for Model III, V1 (m) = A˙ θθ |m| − 2
0 −
N1,k ,
m = −1, −2, . . .
k=m+1
V2 (m) = A˙ θθ m + 2
m −
N1,k ,
m = 1, 2, . . . .
k=1
The process V ∗ (m) resembles the process S ∗ (m) in the previous theorem. However, an important difference is that V ∗ (m) is defined in terms of a multivariate normal variate and it does not depend on the exact distribution of the error process. Asymptotically valid confidence intervals can be formed for Model III by simulating the process V ∗ (m). In fact, the procedure to
1
rWF (r )dr , λ WF (r )dr , λ (r − 1 11 = [0, 0, WF (λ1 ), λ WF (r )dr ]′ , ξ3 = 0
1
√
1
(r −1)(3r −2λ1 −1) dWF (r )′ , (1−λ1 )2
d
∗ ˆ − λ1 ) → m∞ T (λ III = arg min Z (m), m
where the stochastic process Z ∗ (m) is defined such that Z ∗ (0) = 0, Z ∗ (m) = Z (m) + m2 ξ4 C (1)′ AH γ for m < 0 and Z ∗ (m) = Z (m) + m2 ξ3 C (1)′ AH γ for m > 0, and |m|3 Z (m) = Aγ γ 3 ′ 2Ω1 ξ11C (1)′ AHH C (1)ξ21 + m tr ′ ′ −Ω1 Σf Ω1 ξ11 C (1) AHH C (1)ξ11
+ σ¯
2 d
with
.
where WF (·) is an r-dimensional standard Wiener process. Then,
m = −1, −2, . . .
k=1 m −
AH γ C (1)C (1) AH γ ′
0
Σf = 4 λ1 Ω1 =
0 0
]
2 2λ21 − 46λ1 + 45
m −
+2
0,
Sγ γ
2
λ1 )WF (r )dr ]′ , ξ21 1 λ1 (3r −2r λ1 ) dWF (r )′ , and ξ4 = λ 0 λ2 1
N1,k + kN2,k ,
,
B. Model II (Local Disjoint Trend)
A˙ θθ + A˙ γ γ k2 + 2A˙ γ θ k
0 −
2
di (1)2 σi2 .
[
k=m+1
V2 (m) =
i =1
ˆ − λ1 ) → N T (λ
1 − 2λ1
Aγ γ 5(1 − λ1 )λ1 15A2γ γ
∑N −1
d
k=m+1
−2
√ σ¯ d2 κ
15λ1 1
λ1 1
1 − λ1 (1 − λ21 )/2 1 − λ1 (1 − λ1 )2
and
6
2
6
λ
λ1
λ
12
6
−
λ31
2 1
−
λ21
4
λ1 (1 − λ1 )
2 1
− 3 λ1 . 1 − 2λ1 6 2 2 λ1 (1 − λ1 ) 3λ2 − 3λ1 + 1 12
12
1
λ31 (1 − λ1 )3
In Model I without common factors, the limiting distribution is obtained under the assumption that N /T → 0 ≤ κ < ∞, which implies that T should not√be too small relative to N, and it has non-zero mean that is O( κ). This bias is unlikely to be negligible unless T is extremely large √ relative to N. For example, when T is five times larger than N , κ ≃ 0.45. Also, note that this
D. Kim / Journal of Econometrics 164 (2011) 310–330
bias does not exist when λ1 = 0.5. See Section 4 for some Monte Carlo simulation results. In Model II, the limiting distribution is highly non-standard again, but the process Z ∗ (m) is defined in terms of a standard Wiener process. When there are no common factors, Z (m) = Aγ γ ∗
|m|3 3
√
2 2λ21 − 46λ1 + 45
+ mσ¯
2 d
15λ1
p
ˆ − λ1 ) → 0, as claimed in Theorem 2(iii). A nonand thus T (λ degenerate limiting distribution is not available in this case. The limiting variance and bias of the limiting distribution in Model I can be estimated from the first difference of the estimated residuals using a method discussed for the stationary case. For Model II, the intercept shift parameter θ cannot be consistently estimated again. Remark 3. The cases where Assumptions 2 and 3 are mixed are also of interest. These can be analyzed using Lemmas A.6 and A.7 in the Appendix. If the common factors are integrated, they are always the dominant terms in the error processes and the asymptotic results are invariant to whether the individual specific errors are integrated or not. In other words, Theorem 2(ii) and (iv) and Theorem 5(ii) and (iii) continue to hold with stationary individual specific errors. Similar results can be drawn for the opposite case. When there are stationary common factors with AH γ ̸= 0, Theorem 2(i) still holds under an extra condition that N /T 2 → 0, and Theorems 2(iii) and 5(i) hold with no extra condition. Remark 4. Although we focus on a single break in this paper, multiple breaks may be handled using the one-at-a-time approach by Bai (1997) or the simultaneous approach by Bai and Perron (1998). The main result from these papers is that the break date estimates are asymptotically independent of each other and thus each break date estimate can be treated as if it were the only one. 4. Monte Carlo simulation In this section, we demonstrate the appropriateness of our asymptotic results via Monte Carlo experiments. In all experiments, the number of replications is 2000. We present the results pertaining to Models I and II only since Model III shares most important features with Model II. yti = dti + uti ,
(i = 1, . . . , N and t = 1, . . . , T )
′
uti = hi Ft + eti
dti =
γi Bt θi Ct + γi Bt
Model I Model II.
The pre-break intercepts and slopes are set at zero (µi = βi = 0, for all i), since the Monte Carlo results are exactly invariant to these parameter values. Fig. 1 illustrates the consistency of the break date estimate in the absence of common components (i.e., hi = 0, for all i) when T is fixed. T is 10 and the true break date T1 is 3. The number of cross sectional units N is set at 1, 100, 300, and 500. The first two rows correspond to Model I and the last two to Model II. γi is drawn from U (0, 1) for both models and θi is drawn from U (0.2, 0.7) for Model II. In the first and third rows of panels, the errors are iid N (0, 1). The bar graph in each panel depicts the relative frequencies of the break date estimates. Note that in both models the relative frequency of the true break date approaches one as N increases. On the other hand, in the second and fourth rows, autocorrelated errors are used with the autoregressive parameter 0.8. That is, eti = 0.8et −1i + εti with εti ∼ iid N (0, 1). As N increases, the probability mass function
315
concentrates on a wrong break date; it concentrates on the midpoint in Model I and on the fourth and fifth dates with similar probabilities in Model II. Figs. 2 and 3 show the estimated probability density of the break date estimate around the true break date (Tˆ1 − T1 ) as both T and N increase. γi is drawn from U (0, 0.3) for both models and θi is drawn from U (0.1, 0.4) for Model II. Each of the four columns corresponds to T of 100, 200, 300, and 500 respectively. The true break date T1 is always 0.3T . There are four lines displayed in each panel and each of them corresponds to N of 1, 10, 50, and 100 respectively. The panels in the first row show the case with iid errors where uti = εti ∼ iid N (0, 1). The panels in the second row show the case with autocorrelated errors where uti = eti , eti = ρi et −1i + εti , εti ∼ iid N (0, 1) and ρi is drawn from U (0.4, 0.7). The panels in the third row show the case with a common factor, where eti is the same as the one in the second row, Ft = 0.6Ft −1 + wt with wt ∼ iid N (0, 1) and hi is 0.5 for N = 1 and drawn from U (0, 1) for the other values of N. The first observation from Fig. 2 is that the displayed densities are all bell shaped as expected from the asymptotic normality given in Theorem 3. Also, the densities concentrate around the origin as T and N increase when there is no common component. When there is a common factor, for any given value of N, the empirical densities look more concentrated around the origin as T increases. However, for any given T , the empirical densities are almost on top of each other for all values of N except for 1. This is due to the fact that the rate of convergence of the break date estimate does not depend on N in the presence of common components. The picture is somewhat different for Model II, though some important features are shared with Model I. In Fig. 3, the densities show bimodality especially for large N, which is related to the non-standard asymptotic distribution of the break date estimate given in Theorem 3. In the absence of a common component, the empirical densities concentrate around the origin as T and N increase. When there is a common component, the densities do not concentrate as T , N or both increase, since the break date estimate is inconsistent in Model II. Figs. 4 and 5 correspond to the case with integrated errors. Each panel displays the empirical probability densities of T −1/2 (Tˆ1 − T1 ). The results in the first row are obtained using integrated individual specific errors and no common component. Thus, uti = eti , eti = et −1i +εti , εti ∼ iid N (0, 1). In the second row, the common factor is integrated so that Ft = Ft −1 +wt with wt ∼ iid N (0, 1) and hi is 0.5 for N = 1 and drawn from U (0, 1) for the other values of N, while eti = εti and εti ∼ iid N (0, 1). In the third row, both the common factor and the individual specific errors are integrated. The slope change parameter γi and the intercept change parameter θi are the same as in Figs. 2 and 3. In Fig. 4, the densities are again bell shaped. In the first row, a distinct feature is that the densities not only concentrate around the mode but also shift to the right as N increases for any given T . This shift is the bias term provided in Theorem 5 (i). In the second row where an integrated common factor exists but the individual specific shocks are iid over T , the densities are centered at the origin although some asymmetry is observed especially for small T . Furthermore, because of the common component, the densities do not concentrate as N increases. In the third row, where both the common factor and individual specific shocks are integrated, the results are intermediate of the two previous cases. The densities displayed in Fig. 5 are somewhat similar to those in Fig. 4. The densities concentrate around the mode as N increases for any given T without common components. When there is a common component, the densities do not concentrate. Also, the bimodality appearing in the stationary case is not present.
316
D. Kim / Journal of Econometrics 164 (2011) 310–330
Fig. 1. Consistency of the break date estimate, histogram of Tˆ1 with T1 = 3.
5. Application As an empirical illustration, we estimate a common slope change in disaggregate price indices. The data is the log of the seasonally adjusted quarterly disaggregate price indices for Gross Domestic Products of the United States. It is obtained from the Bureau of Economic Analysis website.2 The data has 55 series with each of them spanning from 1984.Q1 to 2008.Q2, hence there are 98 observations in each series. Price indices are often analyzed in the first difference form because they would then become inflation rates. There are a number of empirical studies on the statistical properties of inflation series, but they are somewhat conflicting depending on the time span and the employed statistical methods. For example, Pivetta and Reis (2007) find inflation rate very persistent which implies that price index series is integrated twice
2 Table 1.5.4. Price Indexes for Gross Domestic Product, Expanded Detail.
(I (2)). On the other hand, Clark (2006) finds that the average persistence in disaggregate inflation rates is below aggregate persistence and that even further reduction in persistence is observed when a mean shift is taken into account. Clark (2006) uses one common break date at 1993.Q1 which he obtains by averaging the break date estimates from each series. We take Clark’s (2006) view that price indices are generated by Model I with integrated error processes, which also means that inflation rates are generated by Model III with stationary errors. Furthermore, we assume that there are common components. We scale each series by its sample standard deviation because our asymptotic results assume that there is no one series with a dominating variance. Of course, this scaling does not affect the location of the break. The main results are presented in Table 1. The asymptotic distributions used to form confidence intervals are those in Theorems 5(ii) and 3(iii) for Models I and III, respectively. For Model I, the long-run variance parameter A′H γ C (1)C (1)′ AH γ is estimated from the first difference of the weighted cross sectional average of the residuals given in (4). In particular,
D. Kim / Journal of Econometrics 164 (2011) 310–330
317
Fig. 2. Estimated density of Tˆ1 − T1 , Model I with stationary errors. Table 1 Estimated common break date in GDP price indices.
Common break date in the disaggregate indices Break date in the aggregate price index
Model I with I (1) errors
Model III with I (0) errors
Tˆ1
95% C.I.
Tˆ1
95% C.I.
1992.Q4 1991.Q4
[91.Q4, 93.Q4] [89.Q3, 94.Q1]
1992.Q2 1991.Q3
[91.Q4, 92.Q4] [82.Q1, 01.Q4]
we applied a heteroskedasticity and autocorrelation consistent covariance matrix estimator with the Quadratic Spectral window and selected the bandwidth parameter using Andrews’ (1991) data dependent method with AR(1) approximation. For Model III, Ft′ AH θ is assumed to be an AR(p) process with Gaussian innovations. From the cross sectional average N −1
∑N
i =1
uˆ ti θˆi , we first selected the autoregressive order using the Akaike information criterion and then estimated the autoregressive parameters and the innovation variance using the least squares procedure. Using these estimates, the S ∗ (m) process is simulated 5000 times and the relevant quantiles of the minimizer m∞ I is obtained. The estimated date of common slope change in the 55 disaggregate price indices is 1992.Q4 with the 95% confidence interval
ranging from 1991.Q4 to 1993.Q4. The estimated date of common mean shift in the 55 inflation rates is 1992.Q2 with the 95% confidence interval from 1991.Q4 to 1992.Q4. These two point estimates differ only by 2 quarters. The confidence interval in Model III is only one year long (4 periods), which is only a half of the two year long confidence interval in Model I. On the other hand, the estimated date of slope change in the aggregate GDP price index is 1991.Q4 and the estimated date of mean shift in the aggregate inflation is 1991.Q3. Although these estimates differ from their common break counterpart by at most a year, they are associated with an extremely large confidence interval. The 95% confidence interval is 4 and a half years long for Model I and striking 20 years long for Model III. In the presence of
318
D. Kim / Journal of Econometrics 164 (2011) 310–330
Fig. 3. Estimated density of Tˆ1 − T1 , Model II with stationary errors.
the common components, the common break date estimate does not have a faster rate of convergence than a break date estimate from a univariate series. However, the former yields much tighter confidence intervals than the latter in this particular empirical example. In Table 2, we report the break date estimates from each of the 55 series. Most break date estimates are in the early 1990s, though we observe a handful of break date estimates in the 2000s especially in Exports, Imports, and Government Consumption Expenditures and Gross Investment. However, these estimates from individual series are often associated with a very wide confidence interval. The average of these break dates are 1995.Q4 and 1997.Q2 respectively for Models I and III, and the average length of the confidence intervals is 5.5 years and 9.5 years. It is interesting to note that the confidence intervals from Model III are on average almost twice as wide as those from Model I despite the faster rate of convergence. We may suspect from Table 2 that there are actually two breaks, one in the 90’s and the other in the 2000’s. In the following, we
estimated two break dates using the one-at-a-time approach3 by Bai (1997). That is, we estimate the second break date conditional on the break dates reported in Table 1, and then we reestimated the first break date conditional on the second break date estimate. Given two break dates, we re-formed the 95% confidence intervals for both breaks. The results are reported in Table 3. When only one break date is estimated in the presence of two breaks, the common break date estimate is consistent for the break date that gives the greatest reduction in the sum of squared residuals. In contrast, the average of all individual estimates may not be consistent for any of the break dates unless all series share one common dominating break date. As expected, the estimated second break dates are around 2003.Q4. The confidence intervals for these second break dates are somewhat wider than those for the first break dates. Also, the
3 I thank a referee who pointed out a mistake in the application of the one-at-atime approach in the earlier draft of this paper.
D. Kim / Journal of Econometrics 164 (2011) 310–330
319
Table 2 Estimated individual break dates in disaggregate GDP price indices. Price index series
1 2
Gross domestic product Personal consumption expenditures
Model I with I (1) errors
Model III with I (0) errors
Tˆ1
95% C.I.
Tˆ1
95% C.I.
1991.Q4
[89.Q3, 94.Q1]
1991.Q3
[82.Q1, 01.Q4]
1992.Q1
[90.Q3, 93.Q3]
1990.Q4
[84.Q2, 97.Q2]
3 4 5 6 7 8 9 10 11 12 13
Goods Durable goods Motor vehicles and parts Furnishings and durable household equipment Recreational goods and vehicles Other durable goods Nondurable goods Food and beverages purchased for off-premises consumption Clothing and footwear Gasoline and other energy goods Other nondurable goods
1992.Q2 1995.Q2 1997.Q1 1996.Q3 1993.Q4 1992.Q4 1991.Q1 1991.Q1 1992.Q2 2002.Q1 1992.Q2
[89.Q4, 94.Q4] [94.Q3, 96.Q1] [95.Q3, 98.Q3] [95.Q2, 97.Q4] [92.Q3, 95.Q1] [91.Q1, 94.Q3] [84.Q3, 97.Q3] [87.Q3, 94.Q3] [91.Q1, 93.Q3] [98.Q3, 05.Q3] [91.Q2, 93.Q2]
1990.Q4 1995.Q1 1997.Q1 1998.Q2 1995.Q1 1991.Q4 2006.Q4 2006.Q4 1991.Q4 2002.Q1 1992.Q2
[82.Q1, 99.Q2] [94.Q3, 95.Q3] [94.Q4, 99.Q2] [96.Q3, 00.Q1] [93.Q4, 96.Q1] [79.Q2, 04.Q1] [97.Q2, 16.Q2] [03.Q2, 10.Q3] [90.Q3, 93.Q1] [97.Q3, 06.Q2] [91.Q2, 93.Q2]
14 15 16 17 18 19 20 21 22 23 24 25
Services Household consumption expenditures (for services) Housing and utilities Health care Transportation services Recreation services Food services and accommodations Financial services and insurance Other services Final consumption expenditures of nonprofit inst. serving households Gross output of nonprofit inst. Less: Receipts from sales of goods and services by nonprofit inst.
1991.Q4 1992.Q2 1989.Q2 1993.Q1 1992.Q1 1991.Q4 1990.Q2 1992.Q4 1993.Q2 1994.Q2 1991.Q2 1992.Q4
[90.Q4, 92.Q4] [91.Q3, 93.Q1] [87.Q1, 91.Q3] [92.Q2, 93.Q4] [88.Q4, 95.Q2] [90.Q2, 93.Q2] [88.Q4, 91.Q4] [88.Q3, 97.Q1] [90.Q1, 96.Q3] [93.Q1, 95.Q3] [89.Q4, 92.Q4] [91.Q3, 94.Q1]
1992.Q1 1992.Q1 1990.Q3 1993.Q1 2007.Q2 1991.Q3 1991.Q2 1984.Q3 1995.Q2 1993.Q3 1991.Q2 1992.Q4
[91.Q2, 92.Q4] [91.Q2, 92.Q4] [87.Q2, 94.Q1] [92.Q4, 93.Q2] [02.Q2, 12.Q1] [89.Q4, 93.Q1] [87.Q4, 95.Q2] [84.Q2, 84.Q4] [84.Q3, 05.Q4] [90.Q1, 96.Q4] [88.Q3, 94.Q1] [90.Q4, 94.Q3]
2003.Q4 2003.Q4 1990.Q3 2002.Q3 1991.Q3 1991.Q1 1991.Q2 1997.Q1 1993.Q3 1992.Q1 1992.Q2 1992.Q3 2000.Q1
[01.Q2, 06.Q2] [01.Q2, 06.Q2] [85.Q3, 95.Q3] [01.Q3, 03.Q3] [89.Q3, 93.Q3] [87.Q1, 95.Q1] [87.Q1, 95.Q3] [93.Q4, 00.Q2] [92.Q1, 95.Q1] [90.Q1, 94.Q1] [89.Q1, 95.Q3] [90.Q3, 94.Q3] [96.Q2, 03.Q4]
2003.Q3 2003.Q3 2003.Q3 2003.Q4 1991.Q1 1991.Q1 2001.Q4 1998.Q2 1992.Q3 1991.Q1 1992.Q1 1991.Q1 2006.Q4
[97.Q1, 10.Q2] [96.Q4, 10.Q2] [97.Q1, 10.Q1] [00.Q4, 06.Q4] [79.Q1, 03.Q1] [79.Q1, 03.Q2] [89.Q2, 14.Q2] [94.Q1, 02.Q3] [86.Q4, 98.Q3] [81.Q3, 00.Q1] [88.Q1, 96.Q1] [85.Q3, 96.Q3] [00.Q4, 12.Q3]
2004.Q1 2003.Q3 1991.Q4 2004.Q1 2004.Q1 1990.Q4
[01.Q3, 06.Q3] [00.Q3, 06.Q3] [87.Q1, 96.Q3] [01.Q2, 06.Q4] [01.Q2, 06.Q4] [82.Q1, 99.Q3]
2003.Q3 2003.Q3 2003.Q4 2007.Q3 2007.Q3 1985.Q1
[98.Q3, 08.Q3] [97.Q2, 09.Q2] [97.Q2, 10.Q2] [06.Q4, 08.Q2] [06.Q4, 08.Q2] [84.Q1, 86.Q1]
2002.Q4 2001.Q4 2001.Q4 2001.Q3 1987.Q1 2002.Q4 2002.Q4 1991.Q2 2003.Q4 2003.Q3 2003.Q4
[01.Q3, 04.Q1] [99.Q2, 04.Q2] [99.Q4, 03.Q4] [99.Q3, 03.Q3] [85.Q4, 88.Q2] [93.Q2, 12.Q2] [91.Q4, 13.Q4] [88.Q2, 94.Q2] [02.Q2, 05.Q2] [01.Q3, 05.Q3] [02.Q4, 04.Q4]
2003.Q4 2001.Q4 2001.Q4 2001.Q4 1987.Q4 2002.Q4 2002.Q4 1991.Q1 2004.Q1 2004.Q2 2004.Q1
[02.Q2, 05.Q2] [97.Q3, 06.Q1] [98.Q3, 05.Q2] [99.Q3, 03.Q4] [86.Q2, 89.Q3] [91.Q3, 14.Q1] [91.Q1, 14.Q2] [82.Q2, 99.Q4] [02.Q3, 05.Q3] [01.Q3, 07.Q1] [03.Q3, 04.Q3]
26 27 28 29 30 31 32 33 34 35 36 37 38 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
Gross private domestic investment Fixed investment Nonresidential Structures Equipment and software Information processing equipment and software Computers and peripheral equipment Software Other Industrial equipment Transportation equipment Other equipment Residential Exports Goods Services Imports Goods Services Government consumption expenditures and gross investment Federal National defense Consumption expenditures Gross investment Nondefense Consumption expenditures Gross investment State and local Consumption expenditures Gross investment
Table 3 Sequentially estimated common break dates in GDP price indices. Model I with I (1) errors
Model III with I (0) errors
Break date
95% C.I.
Break date
95% C.I.
Common break date in the disaggregate indices
1991.Q4 2003.Q4
[90.Q4, 92.Q4] [03.Q1, 04.Q3]
1991.Q1 2003.Q4
[90.Q1, 92.Q1] [03.Q1, 04.Q2]
Break date in the aggregate price index
1991.Q4 2004.Q1
[90.Q3, 93.Q1] [02.Q3, 05.Q3]
1991.Q3 2003.Q4
[88.Q3, 94.Q4] [97.Q3, 09.Q3]
320
D. Kim / Journal of Econometrics 164 (2011) 310–330
√
Fig. 4. Estimated density of (Tˆ1 − T1 )/ T , Model I with integrated errors.
confidence intervals for the common break dates are tighter than their aggregate counterparts. 6. Conclusion We show how to estimate a common deterministic trend break in large panels through minimizing the sum of squared residuals. The statistical properties of the proposed break date estimate depend on a number of model assumptions. The form of the break, that is, whether it is only a slope change or a slope change combined with an intercept shift, affects the rate of convergence and the form of the limiting distribution of the break date estimate. The amount of correlation in both the time and cross sectional spans affects the rate of convergence. Strong correlation in any of the time and cross sectional span results in a slower rate of convergence. Especially the strong cross sectional dependence generated by the common factors makes the rate of convergence the same as the univariate case and thus eliminates the benefit of the panel data.
Acknowledgments I am grateful to two anonymous referees, an associate editor and the editor for their many helpful comments. I also thank Jushan Bai, Serena Ng, Pierre Perron and other seminar participants at Boston University, Columbia University and the 2010 International Symposium on Econometric Theory and Applications (SETA) for their many constructive comments. Appendix The proofs for Models II and III are very similar and we present only those for Model II to conserve the space. Let 1T = diag {T −1/2 , T −3/2 , T −3/2 } for Model I and diag {T −1/2 , T −3/2 , T −1/2 , T −3/2 } for Model II. Also, ˜ιb = (˜ιb (1), . . . , ˜ιb (T ))′ , ιb = (ιb (1), . . . , ιb (T ))′ , α = (α1 , . . . , αT )′ and κ = (κ1 , . . . , κT )′ where 0
if Tb > T1 ,
˜ιb (t ) ≡ (t − T1 )/(Tb − T1 ) 1
if 1 ≤ t ≤ T1 , if T1 + 1 ≤ t ≤ Tb , if Tb + 1 ≤ t ≤ T ,
D. Kim / Journal of Econometrics 164 (2011) 310–330
321
√
Fig. 5. Estimated density of (Tˆ1 − T1 )/ T , Model II with integrated errors.
0
(t − Tb )/(T1 − Tb )
if Tb < T1 , ˜ιb (t ) ≡
1 if Tb = T1 , if Tb ≥ T1 , if T1 > Tb ,
if 1 ≤ t ≤ Tb , if Tb + 1 ≤ t ≤ T1 , if T1 + 1 ≤ t ≤ T ,
0 if 1 ≤ t ≤ T1 , ˜ιb (t ) = ιb (t ) ≡ 1 if T1 + 1 ≤ t ≤ T . 1 if T1 + 1 ≤ t ≤ Tb , αt = 0 otherwise, −1 if Tb + 1 ≤ t ≤ T1 , αt = 0
otherwise
Lemma A.1. (i) 1T (XT′ b XTb − XT′ 1 XT1 )1T = |Tb − T1 | O(T −1 ) (ii) ˜ι′b
1T XTb = O(T 1/2 ) (iii) α ′ 1T XTb = |Tb − T1 | O(T −1/2 ) (iv) κ ′ 1T XTb = |Tb − T1 |2 O(T −1/2 ) (v) ˜ι′b (I − PTb )˜ιb = O(T ) (vi) 1T XT′ 1 XT1 1T = Σa + op (1), where the symmetric matrix Σa is given by 1 (1 − λ1 )2 1 2 2 1 (1 − λ1 )2 (λ1 + 2) Σa = and 3 6 3 (1 − λ1 )
and if Tb ≥ T1 , if T1 > Tb ,
3
κt = κt =
t − T1 0
if T1 + 1 ≤ t ≤ Tb , otherwise,
−(t − T1 ) 0
if Tb + 1 ≤ t ≤ T1 , otherwise.
The following results pertaining to the deterministic terms are taken from Perron and Zhu (2005) and we will use them without proofs.
1
1 2 1 3
1 − λ1 1 − λ21 2 1 − λ1
(1 − λ1 )2
2 (1 − λ1 )2 (λ1 + 2) 6 (1 − λ1 )2 2 3 (1 − λ1 ) 3
322
D. Kim / Journal of Econometrics 164 (2011) 310–330
for Model I and II respectively.
≤
We prove the following results for the stationary error processes. Lemma A.2. Under Assumption 2, we have for all Tb , as T , N → ∞: (i) 1T XT′ b E γ ′ = Op (N 1/2 ), ˜ι′b E γ ′ = Op (T 1/2 N 1/2 )
(ii) 1T XT′ b EH ′ = Op (N 1/2 ), ˜ι′b EH ′ = Op (T 1/2 N 1/2 )
(iii) In Model I, ˜ι′b (I − PTb )E γ ′ = Op (T 1/2 N 1/2 ), ˜ι′b (I − PTb )EH ′ = Op (T 1/2 N 1/2 ) (iv) α ′ F = |Tb − T1 |1/2 Op (1), α ′ E θ ′ = |Tb − T1 |1/2 Op (N 1/2 ), α ′ EH ′ = |Tb − T1 |1/2 Op (N 1/2 ) (v) κ ′ F = |Tb − T1 |3/2 Op (1), κ ′ E γ ′ = |Tb − T1 |3/2 Op (N 1/2 ), κ ′ EH ′ = |Tb − T1 |3/2 Op (N 1/2 ) (vi) In Model II, α ′ (I − PTb )E θ ′ = |Tb − T1 |1/2 Op (N 1/2 ), α ′ (I − PTb )EH ′ = |Tb − T1 |1/2 Op (N 1/2 ), κ ′ (I − PTb )E γ ′ = |Tb − T1 |3/2 Op (N 1/2 ), κ ′ (I − PTb )EH ′ = |Tb − T1 |3/2 Op (N 1/2 ). Proof. (i) From the definition of 1T and XTb , we need to consider T −1/2 ι′ E γ ′ , T −1/2 C ′ E γ ′ , T −3/2 τ ′ E γ ′ , and T −3/2 B′ E γ ′ . Let ri (h) be the autocovariance function of eti . First note that
Var T
−1/2 ′
ι Eγ
′
1
=
T
Var
T i=1 N −
=
ι Ei γi
γi2 Var ι′ Ei
N −
T −1 −
γ
2 i
i =1
≤
′
i=1
N 1−
=
N −
1−
h=−T +1
∞ −
γi2
| h|
T
r i ( h)
|ri (h)|
h=−∞
i =1
N ∞ ∞ − − − 2 2 = γi σi di,k+|h| di,k h=−∞ k=0 i =1 N −
≤
γi2 σi2
i =1
∞ ∞ − − di,k+|h| di,k k=0
≤ 2M 3
N −
h=−∞
N −
∞ −
γi2
h=−∞
i=1
γi2 = O(N ).
i=1
α ′ (I − PTb )E θ ′ = α ′ E θ ′ − α ′ XTb 1T (1T XT′ b XTb 1T )−1 1T XT′ b E θ ′ = |Tb − T1 |1/2 Op (N 1/2 ) + |Tb − T1 | O(T −1/2 )O(1)Op (N 1/2 ). The second term is always of a smaller order of magnitude than the first term because |Tb − T1 |1/2 /T 1/2 is at most Op (1) and can even be op (1) if |Tb − T1 | = op (T ). A similar argument holds for the other terms. We prove the following results for the integrated error processes. Lemma A.3. Under Assumption 3, we have for all Tb , as T , N → ∞: (i) 1T XT′ b E γ ′ = Op (TN 1/2 ) and ˜ι′b E γ ′ = Op (T 3/2 N 1/2 ) (ii) 1T XT′ b EH ′ = Op (TN 1/2 ) and ˜ι′b EH ′ = Op (T 3/2 N 1/2 ) (iii) In Model I, ˜ι′b (I − PTb )E γ ′ = Op (T 3/2 N 1/2 ) and ˜ι′b (I − PTb )EH ′ = Op (T 3/2 N 1/2 ) (iv) α ′ F = |Tb − T1 | Op (T 1/2 ), α ′ E θ ′ = |Tb − T1 | Op (T 1/2 1/2 N ) and α ′ EH ′ = |Tb − T1 | Op (T 1/2 N 1/2 ) |Tb − T1 |2 Op (T 1/2 ), κ ′ E γ ′ |Tb − T1 |2 (v) κ ′ F = = 2 1/2 1/2 ′ ′ 1/2 1/2 Op (T N ) and κ EH = |Tb − T1 | Op (T N ) (vi) In Model II, α ′ (I − PTb )E θ ′ = |Tb − T1 | Op (T 1/2 N 1/2 ), ′ ′ α (I − PTb )EH = |Tb − T1 | Op (T 1/2 N 1/2 ), κ ′ (I − PTb )E γ ′ = |Tb − T1 |2 Op (T 1/2 N 1/2 ) and κ ′ (I −PTb )EH ′ = |Tb − T1 |2 Op (T 1/2 N 1/2 ). Proof. For (i)–(iii), we show for T −1/2 ι′ E γ ′ , the first element of 1T XT′ b E γ ′ only. The rest of results can be shown analogously. Let ξti = (1 − L)eti and its autocovariance function ri (h). First note that Ei = Gξi where ξi = (ξ1i , . . . , ξTi )′ and G is a lower triangular matrix of ones. Let aT = (a1,T , . . . , aT ,T )′ = T −1 G′ ι. Then, each element of aT is bounded by 1.
−3/2 ′
ι Eγ
′
=
T
=
Since the elements in C , T τ , T B, and ˜ιb are no greater than those in ι, the results related to these terms can be proved analogously. (ii) Completely analogous to (i). (iii) Note that |˜ι′b (I −PTb )E γ ′ | ≤ |˜ι′b E γ ′ |+|˜ι′b PTb E γ ′ |, where ˜ι′b E γ ′ = O(T 1/2 N 1/2 ) from (i), and
˜ι′b PTb E γ ′ = ˜ι′b XTb 1T (1T XT′ b XTb 1T )−1 1T XT′ b E γ ′
N −
ι Ei γi ′
=
i=1
T3
N 1 −
T3
γi2 Var ι′ Gξi =
i=1
N 1−
=
≤
Var 3
N 1 −
i=1
−1
1
γi2 = O(N ).
= O(T 1/2 )O(1)Op (N 1/2 ).
N −
(v) Recall that all elements of |Tb − T1 |−1 κ are bounded by one and only |Tb − T1 | of them are non-zero. The rest of the proof is completely analogous to (iv). (vi)
Var T
−1
|ri (h)| ≤ 2M 3
T i=1 N −
T −1 −
γi2
∞ −
γi2
i=1
i=1
N 1−
T i =1
γi2 Var a′T ξi
T −|h|
−
h=−T +1
γi2 Var ι′ Ei
aj,T aj+|h|,T
r i ( h)
j =1
|ri (h)| ≤ 2M 3
N −
h=−∞
γi2 = O(N ).
i=1
Hence, the result follows.
(iv) Let bT = (b1,T , . . . , bT ,T ) = |Tb − T1 | G α . Then, bT has at most the bigger of T1 and Tb non-zero positive elements and all of them are bounded by one.
Var |Tb − T1 |−1/2 α ′ E γ ′
Var |Tb − T1 |−1 T −1/2 α ′ E γ ′
(iv)
=
=
=
1
|Tb − T1 |
Var
N −
α ′ Ei γ i
1
|Tb − T1 |
i =1
1
N −
|Tb − T1 |
i =1
γ
2 i Var
γ
2 i
h=−T +1
2
|Tb − T1 | T
′ α Ei
T −1 −
1
=
i=1 N −
=
T −|h|
− j =1
−1
′
αj αj+|h| ri (h)
≤
N 1−
T i=1 N − i =1
γ
γi2
2 i
Var
h=−T +1
h=−∞
α ′ Gξi γi
=
i=1
T −1 −
∞ −
N −
N 1−
T i =1
γi2 Var b′T ξi
T −|h|
−
′
bj,T bj+|h|,T
ri (h)
j =1
|ri (h)| ≤ 2M 3
N − i=1
γi2 = O(N ).
D. Kim / Journal of Econometrics 164 (2011) 310–330
(v) and (vi) follow from a similar argument.
As the limit counterpart of ˜ιb (t ), we define a continuous function f˜ιb (r ) over [0, 1]: 0
if λ > λ1 ,
(r − λ1 )/(λ − λ1 )
f˜ιb (r ) =
1
if 0 ≤ r ≤ λ1 , if λ1 ≤ r < λ, if λ ≤ r ≤ 1,
if 0 ≤ r ≤ λ, −(r − λ)/(λ1 − λ) if λ ≤ r < λ1 , −1 if λ1 ≤ r ≤ 1, 0 if 0 ≤ r < λ1 , f˜ιb (r ) = fιb (r ) = 1 if λ1 ≤ r ≤ 1. 0
if λ1 > λ, if λ = λ1 ,
f˜ιb (r ) =
Lemma A.6. Under Assumptions 1, 2, 4 and 5, we have for all generic Tb : (i) In Model I,
→ ∞, we have the following results under
(i) T −1/2 ˜ι′b (I − PTb )F ⇒ ξF′ where ξF ∼ N (0, λ1 (1 − λ1 )C (1) C (1)′ /4) (ii) 1T XT′ 1 F ⇒ f (r , λ1 )dWF (r )′ C (1)′
1
(iii) 1T (XT1 − XTb )′ F ⇒ 0 g (r , λ, λ1 )dWF (r )′ C (1)′ . In Model II, (iv) 1T XT′ 1 F ⇒ f2 (r , λ1 )dWF (r )′ C (1)′ Lemma A.5. As T Assumption 3: In Model I,
→ ∞, we have the following results under
(i) T −3/2 ˜ι′b (I − PTb )F ⇒ (ii) T
−1
1T XT1 F ⇒ ′
1 0
f˜ι∗ (r )WF (r )′ drC (1)′ b
1
(iii) T 1T (XT1 − XTb ) F ⇒ 0 g (r , λ, λ1 )WF (r ) drC (1) In Model II, (iv) T −1 1T XT′ 1 F ⇒ f2 (r , λ1 )WF (r )′ drC (1)′ . ′
(XX ) = |Tb − T1 |2 O(TN ) |Tb − T1 | Op (T 1/2 N 1/2 ) if hi = 0 for all i (XU ) = |Tb − T1 | Op (T 1/2 N ) if AH γ ̸= 0 −1 1/2 |Tb − T1 | Op (T N ) if hi = 0 and di (L) = 1 for all i (UU ) = |Tb − T1 | Op (T −1 N ). (ii) In Model II, (XX ) = |Tb − T1 |3 O(N ) + |Tb − T1 | O(N ) |T − T1 |3/2 Op (N 1/2 ) + |Tb − T1 |1/2 Op (N 1/2 ) b if hi = 0 for all i (XU ) = |Tb − T1 |3/2 Op (N ) + |Tb − T1 |1/2 Op (N ) if AH γ ̸= 0 |Tb − T1 |1/2 Op (T −1/2 N 1/2 ) if hi = 0 and di (L) = 1 for all i (UU ) = |Tb − T1 |1/2 Op (T −1/2 N ). Proof. (i) For Model I,
(XX ) = tr Π ′ (XT1 − XTb )′ (I − PTb )(XT1 − XTb )Π = |Tb − T1 |2 tr γ ′ ˜ι′b (I − PTb )˜ιb γ
f (r , λ1 )WF (r )′ drC (1)′
−1
When both T and N increase and hi = 0 for all i, suppose that |Tˆ1 − T1 | is BT Op (T −1/2 N −1/2 ) where BT is any sequence of numbers increasing in T . Then (XX ) is the dominating term and the same contradiction argument holds, and |Tˆ1 − T1 | = op (T −1/2 ). Hence |Tˆ1 − T1 | is at most Op (T −1/2 N −1/2 ). When AH γ ̸= 0, the contradiction arguments holds whenever |Tˆ1 − T1 | is BT Op (T −1/2 ). Hence, |Tˆ1 − T1 | is at most Op (T −1/2 ). The proof for Model II is completely analogous.
Also, let f (r , λ1 ) = (1, r , (r − λ1 )+ )′ , f2 (r , λ1 ) = (1, r , 1(r ≥ λ1 ), (r − λ1 )+ )′ , g (r , λ, λ1 ) = (0, 0, (λ − λ1 )f˜ιb (r ))′ , and f˜ι∗b (r ) be the projection residual of the least squares regression of f˜ιb (r ) on f (r , λ1 ). Lemma A.4. As T Assumption 2: In Model I,
323
′
= |Tb − T1 |2 ˜ι′b (I − PTb )˜ιb (γ γ ′ )
′
= |Tb − T1 |2 O(NT )
The proofs for Lemmas A.4 and A.5 can be found in Perron and Zhu (2005). Now note that we have by definition SSR(Tˆ1 ) − SSR(T1 ) ≤ 0
(A.1)
and, as in Perron and Zhu (2005), we can write: SSR(Tˆ1 ) − SSR(T1 ) = tr Π ′ (XT1 − XTb )′ (I − PTb )(XT1 − XTb )Π
where the last equality holds from Assumption 4 and Perron and Zhu’s (2005) Lemma 1. Also note that limT T −1 ˜ι′b (I − PTb )˜ιb > 0 as shown in Perron and Zhu and thus (XX ) is |Tb − T1 |2 O(NT ) in its strict sense, meaning that it is not o(|Tb − T1 |2 TN ). From Lemmas A.2 and A.4 and Assumption 5, it follows that
(XU ) = tr Π ′ (XT1 − XTb )′ (I − PTb )U = tr Π ′ (XT1 − XTb )′ (I − PTb )FH ′ + tr Π (XT1 − XTb )′ (I − PTb )E = |Tb − T1 | ˜ι′b (I − PTb )FH γ ′ + |Tb − T1 | ˜ι′b (I − PTb )E γ ′
+ 2 tr Π ′ (XT1 − XTb )′ (I − PTb )U ′ + tr U (PT1 − PTb )U ≡ (XX ) + 2(XU ) + (UU ).
= |Tb − T1 | Op (T 1/2 N ) + |Tb − T1 | Op (T 1/2 N 1/2 ). (A.2)
This decomposition will be used repeatedly below. A.1. Proof of Theorem 1 Theorem 1 follows from Lemma A.6 below. Consider Model I with hi = 0 and di (L) = 1 for all i. Suppose that |Tˆ1 − T1 | is not op (1) as N increases with T fixed. Note that (XX ) is strictly positive and its order of magnitude is in its strict sense as shown in the proof of Lemma A.6. Then (XX ) is of strictly greater order of magnitude than the other two terms, and as N increases, (A.2) cannot be negative with probability one. This is contradictory to the fact that Tˆ1 minimizes SSR among all possible break dates including the true one, and it follows that |Tˆ1 − T1 | is op (1).
Hence, (XU ) = |Tb − T1 | Op (T 1/2 N 1/2 ) if hi |Tb − T1 | Op (T 1/2 N ), if AH γ ̸= 0. Now, consider a decomposition of (UU ): tr U ′ (PT1 − PTb )U
=
(A.3) 0, ∀i, and
−1 = tr U ′ (XT1 − XTb )1T 1T XT′ 1 XT1 1T 1T XT′ 1 U −1 + tr U ′ XTb 1T 1T XT′ b XTb 1T 1T XT′ b XTb −1 − XT′ 1 XT1 1T 1T XT′ 1 XT1 1T 1T XT′ 1 U [ ] −1 + tr U ′ XTb 1T 1T XT′ b XTb 1T 1T (XT1 − XTb )′ U = R1 + R2 + R3 .
(A.4)
324
D. Kim / Journal of Econometrics 164 (2011) 310–330
Write
R1 = tr U ′ (XT1 − XTb )1T 1T XT′ 1 XT1 1T
−1
1T XT′ 1 U −1 1T XT′ 1 UU ′ (XT1 − XTb )1T = tr 1T XT′ 1 XT1 1T −1 ′ v ec 1T XT′ 1 UU ′ (XT1 − XTb )1T = v ec 1T XT′ 1 XT1 1T
where v ec ((1T XT′ 1 XT1 1T )−1 ) = Op (1) from Lemma A.1. Now,
v ec 1T XT′ 1 UU ′ (XT1 − XTb )1T = v ec 1T XT′ 1 FHH ′ F ′ (XT1 − XTb )1T + v ec 1T XT′ 1 FHE ′ (XT1 − XTb )1T + v ec 1T XT′ 1 EH ′ F ′ XT1 − XTb 1T + v ec 1T XT′ 1 EE ′ XT1 − XTb 1T = r11 + r12 + r13 + r14 .
pendent across i. Hence r14 = |Tb − T1 | Op (T −1 N ). By collecting these terms, R1 = |Tb − T1 | Op (T −1 N ). A similar argument shows that R2 = |Tb − T1 | Op (T −1 N ) and R3 = |Tb − T1 | Op (T −1 N ). Therefore, it follows that (UU ) = |Tb − T1 | Op (T −1 N ). This order of magnitude is an upper bound for the general case. However, suppose that hi = 0 and di (1) = 1 for all i. Then,
N −
Ei Ei′
i=1
=
N −
=
Ei′ (PT1 − PTb )Ei
= 0. Hence, (UU ) = |Tb − T1 | Op (T −1
(ii) Consider Model II. The difference is that there is an extra regressor.
(XX ) = tr Π ′ (XT1 − XTb )′ (I − PTb )(XT1 − XTb )Π = tr (θ ′ α ′ + γ ′ κ ′ )(I − PTb )(αθ + κγ ) = κ (I − PTb )κ(γ γ ) + 2κ (I − PTb )α(θ γ ) + α ′ (I − PTb )α(θ θ ′ ) ′
′
′
′
where κ ′ (I − PTb )κ = |Tb − T1 |3 O(1), κ ′ (I − PTb )α = 2 ′ |Tb − T1 | O(1), and α (I − PTb )α = |Tb − T1 | O(1). Depending on the order of magnitude of |Tb − T1 |, the dominant term is either κ ′ (I − PTb )κ or α ′ (I − PTb )α . Hence we can write
(XX ) = |Tb − T1 |3 O(N ) + |Tb − T1 | O(N ).
Hence (XU ) = |Tb − T1 |3/2 Op (N 1/2 ) + |Tb − T1 |1/2 Op (N 1/2 ), if hi = 0 for all i, |Tb − T1 |3/2 Op (N ) + |Tb − T1 |1/2 Op (N ), otherwise. For (UU ), consider the decomposition in (A.4). Now, 1T (XT1 − XTb ) = [0, 0, α, |Tb − T1 | ˜ιb ], and it follows that 1T (XT1 − XTb )′ F = |Tb − T1 |1/2 Op (T −1/2 ) and 1T (XT1 − XTb )′ EH = |Tb − T1 |1/2 Op (T −1/2 N 1/2 ). Using an approach similar to Model I, we obtain that (UU ) = |Tb − T1 |1/2 Op (T −1/2 N 1/2 ) if hi = 0 and di (L) = 1 for all i, |Tb − T1 |1/2 Op (T −1/2 N ), otherwise. A.2. Proof of Theorem 2 Theorem 2 follows from Lemma A.7 below using the same argument as in the proof of Theorem 1. Lemma A.7. Under Assumptions 1 and 3–5, we have for all generic Tb : (i) In Model I,
(XX ) = |Tb − T1 |2 O(TN ) |Tb − T1 | Op (T 3/2 N 1/2 ) (XU ) = |Tb − T1 | Op (T 3/2 N ) (UU ) = |Tb − T1 | Op (TN ). (XX ) = |Tb − T1 |3 O(N ) |Tb − T1 |2 Op (T 1/2 N 1/2 ) (XU ) = |Tb − T1 |2 Op (T 1/2 N ) (UU ) = |Tb − T1 | Op (TN ).
i=1
and E tr (PT1 − PTb )Ei Ei′ N 1/2 ).
if hi = 0 for all i if AH γ ̸= 0
(ii) In Model II,
tr (PT1 − PTb )Ei Ei′
i=1 N −
= |Tb − T1 |3/2 Op (N ) + |Tb − T1 |3/2 Op (N 1/2 ).
where 1T XT′ 1 Ei Ei′ (XT1 − XTb )1T has a non-zero mean and is inde-
1T XT′ 1 Ei Ei′ (XT1 − XTb )1T
tr U ′ (PT1 − PTb )U = tr (PT1 − PTb )
tr θ ′ α ′ (I − PTb )U = tr θ ′ α ′ (I − PTb )FH + tr θ ′ α ′ (I − PTb )E
tr γ ′ κ ′ (I − PTb )U = tr γ ′ κ ′ (I − PTb )FH + tr γ ′ κ ′ (I − PTb )E
i=1
It follows from Lemma A.2 that
and
and 1T XT′ 1 EH ′ = Op (N 1/2 ). The first two columns of 1T (XT1 − XTb ) are zeros and we only need to consider the third column, which is |Tb − T1 | ˜ιb . From Lemmas A.2 and A.4, 1T (XT1 − XTb )′ F = |Tb − T1 | Op (T −1 ) and 1T (XT1 − XTb )′ EH = |Tb − T1 | Op (T −1 N 1/2 ). Hence, r11 = |Tb − T1 | Op (T −1 N ), r12 = |Tb − T1 | Op (T −1 N 1/2 ), and r13 = |Tb − T1 | Op (T −1 N 1/2 ). For the term r14 ,
1T XT′ 1 EE ′ (XT1 − XTb )1T =
(XU ) = tr Π ′ (XT1 − XTb )′ (I − PTb )U = tr (θ ′ α ′ + γ ′ κ ′ )(I − PTb )U .
= |Tb − T1 |1/2 Op (N ) + |Tb − T1 |1/2 Op (N 1/2 )
If hi = 0 for all i, r11 , r12 , and r13 are zeros. If not, HH ′ = O(N ) from Assumption 5. Also, from Lemmas A.2 and A.4, 1T XT′ 1 F = Op (1)
N −
Also, note that lim |Tb − T1 |−3 κ ′ (I − PTb )κ > 0 and lim |Tb − T1 |−1 α ′ (I − PTb )α > 0 as |Tb − T1 | → ∞, and thus (XX ) is |Tb − T1 |3 O(N ) or |Tb − T1 | O(N ) in its strict sense. For (XU ), write
if hi = 0 foralli if AH γ ̸= 0
Proof. (i) (XX ) is the same as in Lemma A.6. The rest of the results follow from Lemma A.3 using an argument similar to the proof of Lemma A.6. A.3. Proof of Theorem 3 (i) Consider Model I with hi = 0 for all i. Let mT = N 1/2 T 1/2 (Tb − T1 ) and D(C ) = {Tb : |Tb − T1 | < CN −1/2 T −1/2 } for a positive number C . On the set D(C ), (XX ) = Op (1), (XU ) = Op (1) and (UU ) = Op (T −3/2 N 1/2 ). Here the term (UU ) is asymptotically negligible if N /T 3 → 0, and
(XX ) = tr Π ′ (XT1 − XTb )′ (I − PTb )(XT1 − XTb )Π = (Tb − T1 )2 ˜ι′b (I − PTb )˜ιb γ γ ′ 1 1 ′ 2 ′ = mT ˜ιb (I − PTb )˜ιb γγ T
N
D. Kim / Journal of Econometrics 164 (2011) 310–330
325
√ − e˜ Ti )/ T if T1 = Tb . Since |Tb − T1 | = o(1) on the set
and (˜eT1 i D(C ), we can write
and
(XU ) = (Tb − T1 )˜ιb (I − PTb )E γ 1 ′ ′ = mT √ ˜ιb (I − PTb )E γ + op (1). ′
′
N 1 − 1
N i=1
TN
N 1 − 1
√ ˜ι′b 1e˜ i γi = √
√
T
N i=1
Hence, on the set D(C )
min
N i=1
=−
1 ′ ˜ιb (I − PTb )˜ιb T
×
1 N
γγ
′
Var (˜eiT1 γi ) = σ γ
2 2 i i
˜ιb (I − PTb )E γ ′
√
TN
′
.
Var −1
N i =1
N 1 −
= √
N i =1
=
N i =1 N 1 −
TN i=1
N i=1
N i=1
≤N
M 3 max γi2 i
−1
N −1 γ DΣε Dγ ′
= O(N −1 ).
(1 − λ1 )λ1 4
N 1 −
√
N i=1
1T XT′ b 1e˜ i
N 1 −
N 1 − 1 ′ Ri,T = √ √ ˜ιb 1e˜ i γi N i=1 N i =1 T
T
−1
Therefore, as (N , T ) → ∞ ∗
mT ⇒ N
0,
4
(1 − λ1 )λ1 A2γ γ
.
Sγ γ
that the probability that Tˆ1 ∈ D(C ) is arbitrarily close to one, and the statement in the theorem follows. √ For (ii), let mT = T (Tb − T1 ) and D(C ) = {Tb : |Tb − T1 | < CT −1/2 } for a positive number C . On the set D(C ), (UU )/N is asymptotically negligible.
√
√ ˜ι′b PTb 1e˜ i γi = op (1).
e˜ 0i − e˜ Ti = √ √ T (˜e0i + · · · + e˜ Ti )/ − e˜ Ti = op (1). N i =1 T T −1 (˜e ˜ Ti )/ − e˜ Ti Tb +1i + · · · + e N 1 − 1
Since N 1/2 T 1/2 (Tˆ1 − T1 ) = Op (1), we can choose C large enough so
Sγ γ
as (N , T ) → ∞. It remains to show that
N 1 − 1
1T XT′ b 1e˜ i
and similarly as before
√
e˜ iT γi and we show
T
N i =1
Therefore, from the Joint Limit CLT (Theorem 3 in Phillips and Moon, 1999) we prove
N i =1
i =1
−1 ′ XTb 1T 1T XT′ b 1e˜ i √ ˜ι′b XTb 1T 1T XTb
N 1 −
γ DΣε Dγ ′
−√
∑N
T
= O(1) √
i
N i=1
√ ˜ι′b PTb 1e˜ i
N i =1
i
0,
i
1 − 1
max d2i (1)σi2 γi2
max γ12 , . . . , γN2 M 3 = o(1).
T
= √
that Qi2,T is uniformly integrable in T . Assumptions 2 and 4 imply that
d
1 T
√ ˜ι′b 1e˜ i γi = op (1).
N
in distribution to Qi2 by the continuous mapping theorem. Also, E (Qi2,T ) = T −1 ˜ι′b (I − PTb )˜ιb → E (Qi2 ) = (1 − λ1 )λ1 /4. This shows
Ci Qi,T → N
N 1 − 1
Qi,T is iid (0, T −1 ˜ι′b (I − PTb )˜ιb ) across i for all T . Since Qi,T = T −1/2 ˜ι′b (I − PTb )ηi ⇒ Qi ∼ N (0, λ1 (1 − λ1 )/4), Qi2,T is convergent
N 1 −
For the second term in (A.5),
Ri,T
Ci = di (1)σi γi 1 Qi,T = √ ˜ι′b (I − PTb )ηi T 1 ′ Ri,T = √ ˜ιb (I − PTb )1e˜ i γi T
√
Var e˜ T1 i γi ≤
N 1 − 1
N i=1
Ci Qi,T + √
T
The same argument applies to (NT )−1/2
√
T
N 1 −
√ e˜ T1 i γi
√
′
where
=
k=0
N 1 − 1
√
Ci2
∞ 2 − ¯dik
and thus 1
N 1 1 − 1 ′ √ ˜ι′b (I − PTb )E γ ′ = √ √ ˜ιb (I − PTb )Ei γi
i
¯ ≤σ γ
2 2 i i
i
Note that limT T ˜ιb (I − PTb )˜ιb = (1 −λ1 )λ1 /4 and limN N γ γ = ∑∞ ¯ Aγ γ . Recall that eti = di (1)εti + e˜ t −1i − e˜ ti with e˜ ti = k=0 dik εt −ki ∑∞ ¯ and dik = i=k+1 di . Let εti = σi ηti with ηti ∼ iid (0, 1), ηi = (η1i , . . . , ηTi )′ and 1e˜ i = (˜e0i − e˜ 1i , . . . , e˜ T −1i − e˜ Ti )′ . Then,
max ∑
d2ik
≤ max γ12 , . . . , γN2 M 3
−1 ′
Ci2
∞ − k=0
−1
−1
TN
T
From Assumptions 2 and 4
TN
T
N 1 − 1 −√ √ e˜ iT γi + op (1).
(XX ) + 2(XU ) + op (1) mT on D(C ) [ 1 ′ 1 = arg min m2T ˜ιb (I − PTb )˜ιb γγ′ mT on D(C ) T N ] 1 ′ ′ + 2mT √ ˜ιb (I − PTb )E γ + op (1)
m∗T = arg
√ e˜ iT1 γi
(XX ) (A.5)
√ √ √ √ Tb > T1 , −|Tb − T1 |−1 (˜eTb i + · · · + e˜ T1 −1i )/ T + e˜ Ti / T , if Tb < T1
First, ˜ι′b 1e˜ i / T = |Tb − T1 |−1 (˜eiT1 + · · · + e˜ iTb −1 )/ T − e˜ Ti / T , if
N
(XU ) N
=
m2T
1 ′ ˜ιb (I − PTb )˜ιb T
1 N
γγ
′
= tr Π ′ (XT1 − XTb )′ (I − PTb )U /N = (Tb − T1 )˜ι′b (I − PTb )FH γ ′ /N + op (1) = mT ξF AH γ + op (1)
326
D. Kim / Journal of Econometrics 164 (2011) 310–330
where ξF =d N (0, λ1 (1 − λ1 )C (1)C (1)′ /4) = Op (1). Hence, using the same argument as before, 4 m∗T ⇒ − A−1 ξF AH γ (1 − λ1 )λ1 γ γ
∼ N 0,
4
(1 − λ1 )λ1
2 ′ ′ A− γ γ AH γ C (1)C (1) AH γ
.
For (iii), consider Model II with AH γ ̸= 0. From Theorem 1, |Tˆ1 − T1 | = Op (1). Let mT = (Tb − T1 ) and D(C ) = {Tb : |Tb − T1 | < C } for a positive number C . On the set D(C ),
(XX )/N = tr Π ′ (XT1 − XTb )′ (I − PTb )(XT1 − XTb )Π /N = tr Π ′ (XT1 − XTb )′ (XT1 − XTb )Π /N + op (1) ] [ ′ θθ θγ ′ ′ N + op (1) = tr (α, κ) (α, κ) γ θ′ γ γ ′ ] [ A Aθγ + op (1) = tr (α, κ)′ (α, κ) θθ Aγ θ
k=m+1 0 −
Ft′ AH θ + kFt′ AH γ ,
m = −1, −2, . . .
k=m+1
S2 (m) =
m −
Aθθ + Aγ γ k2 + 2Aγ θ k
k=1
+2
Aγ θ
=
(XU ) = = =
Aγ γ
T1 − A˙ θ θ + A˙ γ γ (t − T1 )2 + 2A˙ γ θ (t − T1 ) t =T b +1 + op (1) if Tb < T1 Tb − A˙ θ θ + A˙ γ γ (t − T1 )2 + 2A˙ γ θ (t − T1 ) t =T 1 +1 + op (1) if Tb > T1 tr Π ′ (XT1 − XTb )′ (I − PTb )U tr (θ ′ α ′ + γ ′ κ ′ )(I − PTb )U tr (θ ′ α ′ + γ ′ κ ′ )U + op (1)
= α ′ E θ ′ + κ ′ E γ ′ + op (1) − T1 N − −1/2 ˙ − N eti θi t =T b +1 i=1 N − −1/2 + (t − T1 ) N eti γ˙i + op (1) i=1 if Tb < T1 = Tb N − − − 1 / 2 N eti θ˙i t =T 1 +1 i=1 N − − 1 / 2 + ( t − T1 ) N eti γ˙i + op (1) i=1 if Tb > T1 and (UU ) = op (1) if N 2 /T → 0.
∑ = limN N −1 Ni=1 θ˙i2 Γi and Ξ2 = limN N −1 ∑N ˙i2 Γi where Γi is an 2C × 2C matrix whose (p, q) element i=1 γ is the autocovariance of eti at p − q lag. Also, define N1 = (N−C +1,1 , . . . , NC ,1 )′ and N2 = (N−C +1,2 , . . . , NC ,2 )′ such that N1 = B1 W and N2 = B2 W where Ξ1 = B1 B′1 , Ξ2 = B2 B′2 and W ∼ N (0, I2C ). Define Ξ1
Then, from the Lindeberg–Feller CLT and Cramer–Wold device, m −
Ft AH θ + kAH γ ,
′
m = 1, 2, . . . .
k=1
Then, on the set D(C ),
{(XX ) + 2(XU )} /(T 2 N ) + op (1) mT on D(C ) = arg min S ∗ (m) + op (1) .
m∗T = arg
(XX ) = tr Π ′ (XT1 − XTb )′ (I − PTb )(XT1 − XTb )Π = tr Π ′ (XT1 − XTb )′ (XT1 − XTb )Π + op (1) [ ′ ] θθ θγ ′ = tr (α, κ)′ (α, κ) + op (1) γ θ′ γ γ ′ ] [ A˙ A˙ + op (1) = tr (α, κ)′ (α, κ) ˙ θ θ ˙ θ γ
′
= (α, κ)′ F (AH θ , AH γ ) + op (1) T1 − ′ Ft AH θ + (t − T1 )Ft′ AH γ + op (1) − t =T b +1 if Tb < T1 = Tb − ′ Ft AH θ + (t − T1 )Ft′ AH γ + op (1) t =T1 +1 if Tb > T1 and (UU )/N = op (1). Since Ft is strictly stationary under Assumption 2, define S ∗ (m) such that S ∗ (0) = 0, S ∗ (m) = S1 (m) for m < 0 and S ∗ (m) = S2 (m) for m > 0, with 0 − S1 (m) = Aθθ + Aγ γ k2 + 2Aγ θ k −2
Consider Model II under Assumption 6 with hi = 0 and di (L) = 1 for all i. Lemma A.6 implies that |Tˆ1 − T1 | = Op (1). Let mT = (Tb − T1 ) and D(C ) = {Tb : |Tb − T1 | < C } for a positive number C . On the set D(C ),
Aγ γ
= Aθθ α α + Aγ γ κ κ + 2Aγ θ κ ′ α + op (1) T1 − Aθθ + Aγ γ (t − T1 )2 + 2Aγ θ (t − T1 ) t =Tb +1 + op (1) if Tb < T1 = Tb − Aθθ + Aγ γ (t − T1 )2 + 2Aγ θ (t − T1 ) t =T1 +1 + op (1) if Tb > T1 (XU )/N = tr Π ′ (XT1 − XTb )′ (I − PTb )U /N = tr (θ ′ α ′ + γ ′ κ ′ )(I − PTb )U /N = tr (θ ′ α ′ + γ ′ κ ′ )U /N + op (1) = tr (α, κ)′ FH (θ ′ , γ ′ ) /N + op (1) ′
A.4. Proof of Theorem 4
N −1/2
N
− i=1
min
mT on D(C )
The limiting distribution of m∗T is arbitrarily close to that of |Tˆ1 − T1 | for large C .
eTb −C +1,i
.. .
˙ d θi → N1
eTb +C ,i
and
eTb −C +1,i N − d .. N −1/2 γ˙i → N2 . . i=1
eTb +C ,i
D. Kim / Journal of Econometrics 164 (2011) 310–330
Hence, on the set D(C ), if N 2 /T → 0,
using the fact that
{(XX ) + 2(XU )} + op (1) mT = arg min mT on D(C ) = arg min V ∗ (m) + op (1)
∫
∗
mT on D(C )
where the stochastic process V ∗ (m) is such that V ∗ (0) = 0, V ∗ (m) = V1 (m) for m < 0 and V ∗ (m) = V2 (m) for m > 0, with 0 −
V1 (m) =
A˙ θθ + A˙ γ γ k2 + 2A˙ γ θ k
k=m+1 0 −
N1,k + kN2,k ,
−2
m = −1, −2, . . .
1
λ1
WF∗ (r )′ drC (1)′ AH γ ∼ N
m −
0,
λ21 (1 − λ1 )2 120
A′H γ C (1)C (1)′ AH γ
.
√
ˆ − λ1 ) is arbitrarily close to m∗T , the assertion in the Since T (λ theorem follows. √ For (i), let mT = N 1/2 (Tb − T1 )/ T and D(C ) = {Tb : |Tb − T1 | < √ CN −1/2 T } for a positive number C . On the set D(C ), (XX ) = Op (T 2 ), (XU ) = Op (T 2 ) and (UU ) = Op (T 3/2 N 1/2 ). Here the term (UU ) is Op (T 2 ) if N /T → κ ≥ 0. Hence, m∗T = arg
min mT
on D(C )
(XX )/T 2 = m2T
A˙ θθ + A˙ γ γ k2 + 2A˙ γ θ k
[
{(XX ) + 2(XU ) + (UU )} /T 2 .
(1 − λ1 )λ1
m −
N1,k + kN2,k ,
m = 1, 2, . . . .
k=1
]
4
k=1
+2
The first term is such that
k=m+1
V2 (m) =
327
Aγ γ + op (1).
The second term is
(XU )/T 2 = (Tb − T1 )˜ι′b (I − PTb )E γ ′ /T 2 N 1 − 1
ˆ− The limiting distribution of m∗T is arbitrarily close to that of T (λ λ1 ) for large C .
= mT √
N i=1 T 3/2
˜ι′b (I − PTb )Ei γi .
From the Joint Limit CLT, A.5. Proof of Theorem 5
√ Consider first (ii), Model I with AH γ ̸= 0. Let mT = (Tb − T1 )/ T √ and D(C ) = {Tb : |Tb − T1 | < C T } for a positive number C . On the set D(C ), (XX ) = Op (T 2 N ), (XU ) = Op (T 2 N ) and (UU ) = Op (T 3/2 N ), and we can ignore the term (UU ), since it is of strictly smaller order of magnitude. Then, m∗T = arg
min
mT on D(C )
{(XX ) + 2(XU )} /(T 2 N ) + op (1) .
Also, note that
= m2T T −1 ˜ι′b (I − PTb )˜ιb Aγ γ + op (1) ] [ (1 − λ1 )λ1 Aγ γ + op (1) = m2T 4
and
(XU )/(T 2 N ) = tr Π ′ (XT1 − XTb )′ (I − PTb )U /(T 2 N )
= (Tb − T1 )˜ι′b (I − PTb )FH γ ′ /(T 2 N ) + op (1) ∫ 1 = mT WF∗ (r )′ drC (1)′ AH γ + op (1) λ1
where WF∗ (r ) is the residual function from a continuous time least squares regression of WF (r ) on f (r , λ1 ). Hence, on the set D(C ), m∗T = arg
[ min mT
on D(C )
∫ + 2mT
m2T
[
(1 − λ1 )λ1
] Aγ γ
4
1
λ1
WF∗ (r )′ drC (1)′ AH γ + op (1)
4
(1 − λ1 )λ1
∼ N 0,
2
15A2γ γ
1 A− γγ
∫
1
λ1
WF∗ (r )′ drC (1)′ AH γ
AH γ C (1)C (1) AH γ ′
′
(A.6)
and thus eti = di (1) j=1 εji + e˜ 0i − e˜ ti . Let εti = σi ηti with ηti ∼ iid (0, 1), ηi = (η1i , . . . , ηTi )′ , and G be a T × T lower triangular matrix of ones. Then,
∑t
N 1 − 1
N 1 −
N i=1 T
N i =1
˜ι′ (I − PTb )Ei γi = √ 3/2 b
Ci Qi,T + op (1)
Ci = di (1)σi γi 1 Qi,T = 3/2 ˜ι′b (I − PTb )Gηi . T Qi,T is iid (0, T −3 ˜ι′b (I − PTb )GG′ (I − PTb )˜ιb ) across i for all T . Since Qi,T = T −3/2 ˜ι′b (I − PTb )ηi ⇒ Qi ∼ N (0, λ21 (1 − λ1 )2 /120), Qi2,T
is convergent in distribution to Qi2 by the continuous mapping theorem. Also, E (Qi2,T ) = T −3 ˜ι′b (I − PTb )GG′ (I − PTb )˜ιb → E (Qi2 ) =
λ21 (1 − λ1 )2 /120. This shows that Qi2,T is uniformly integrable in T . Assumptions 3 and 4 imply that C2 max ∑ i 2 = O(N −1 ). i Ci i
This proves (A.6) from the Joint Limit CLT (Theorem 3 in Phillips and Moon, 1999). For (UU ), consider the decomposition in (A.4). tr U ′ (PT1 − PTb )U
]
and m∗T ⇒ −
120
where
= (Tb − T1 )2 ˜ι′b (I − PTb )˜ιb γ γ ′ /(T 2 N )
N i=1 T
This result follows similarly to Theorem 1 − L)eti = ∑∞ 3(i). Recall that (∑ ∞ ¯ ¯ di (1)εti + e˜ t −1i − e˜ ti with e˜ ti = k=0 dik εt −ki and dik = i=k+1 di ,
√
(XX )/(T 2 N ) = tr Π ′ (XT1 − XTb )′ (I − PTb ) × (XT1 − XTb )Π /(T 2 N )
λ21 (1 − λ1 )2 d ′ ˜ ι S ( I − P ) E γ → N 0 , γγ . Tb i i 3/2 b
N 1 − 1
√
−1 = tr U ′ (XT1 − XTb )1T 1T XT′ 1 XT1 1T 1T XT′ 1 U −1 + tr U ′ XTb 1T 1T XT′ b XTb 1T 1T XT′ b XTb − XT′ 1 XT1 −1 × 1T 1T XT′ 1 XT1 1T 1T XT′ 1 U −1 + tr U ′ XTb 1T 1T XT′ b XTb 1T 1T (XT1 − XTb )′ U = R1 + R2 + R3 .
328
D. Kim / Journal of Econometrics 164 (2011) 310–330
because, for j = 1, 2
Note that, as shown in PZ ,
1T XT′ b XTb − XT′ 1 XT1 1T = −(Tb − T1 )T −1 Σf + o(1)
2 sup E Lj,i,T 1A(M )
with
0
Σf =
2 ≤ sup E Lj,i,T 1A(M ) 1B(√M ) T
1/2 2 + E Lj,i,T 1A(M ) 1Bc (√M ) 2 ≤ sup E Lj,i,T 1A(M ) 1 √
1T XT1 UU (XT1 − XTb )1T = 1T XT1 EE (XT1 − XTb )1T ′
′
′
′
where 1
=
N 1 − di (1)2 σi2
N i=1 (Tb − T1 )T N 1 −
N i =1
2 + sup E Lj,i,T 1A(M ) 1Bc (√M ) T
√ 2 ≤ sup E Lj,i,T 1B(√M ) + sup MP (A(M )) T
1T XT′ 1 Gηi ηi′ G′ (XT1 − XTb )1T + op (1)
T
(A.7)
Similarly as before,
1
1T XT′ 1 Gηi ηi′ G′ (XT1 − XTb )1T
(Tb − T1 )T
1
f (r , λ1 )Wi (r )dr
Qi,T ⇒ Qi =
0, 0,
0
∫ λ
1
Wi (r )dr
where
Σd = E
i=1
0 E Qi = Σe = 0
0 0
0
0
di (1)2 σi2 and 2 − 3λ21 + λ31
6 5 + λ41 − 6λ21
. 24 3 4 2 −7λ1 + 16λ1 − 6λ1 − 8λ1 + 5
24 To show the uniform integrability of Qi,T , define L1,i,T = (Tb − T1 )
−1
L2,i,T = T
1T (XT1 − XTb ) Gηi
−λ41 + 4λ31 − 8λ1 + 5
3 =
24 2
−λ51 + 10λ31 − 25λ1 + 16
24
.
15
120 7λ51 − 20λ41 + 10λ31 + 20λ21 − 25λ1 + 8 60
√ (UU )/T 2 = mT κ σ¯ d2 tr 2Σa−1 Σe − Σa−1 Σf Σa−1 Σd + op (1) √ (2λ1 − 1) = mT κ σ¯ d2 . 10
= arg
T
1/2 1/2 2 2 ≤ sup E L1,i,T 1A(M ) sup E L2,i,T 1A(M )
min
{(XX ) + 2(XU ) + (UU )} /T 2 [ ] (1 − λ1 )λ1 2
mT on D(C )
+ 2mT N + op (1)
mT
4
Aγ γ
(2λ1 − 1) λ21 (1 − λ1 )2 κ σ¯ , Sγ γ 2 d
20
120
⇒
sup E Qi,T 1A(M )
min
mT on D(C )
√
λ1
This result also implies that supT E Qj,i,T < ∞. Now, let the event A(M ) = Qi,T > M . Then,
T
f2 (r , λ1 )′ Wi (r )dr
0
Collecting terms shows that, as (T , N ) → ∞,
then, Qi,T = L2,i,T L′1,i,T . Note that ‖Lj,i,T ‖2 j = 1, 2 are uniformly integrable in T because ‖Lj,i,T ‖2 ⇒ ‖Lji ‖2 and E ‖Lj,i,T ‖2 → E ‖Lji ‖2 as T → ∞ where 0 ∫ 1 0 ∫ 1 and L2i = L1i = f (r , λ1 )Wi (r )dr . 0 Wi (r )′ dr
T
1
5
m∗T = arg
1T XT1 Gηi ′
→ 0 as M → ∞
∫
Therefore,
′
f2 (r , λ1 )Wi (r )dr
1
p
1
∫ 0
Ci Qi,T → σ¯ d2 E Qi = σ¯ d2 Σe
where σ¯ d2 = limN N
f (r , λ1 )′ Wi (r )dr = σ¯ d2 Σd
0
Probability Limits Theorem (Corollary 1, Phillips and Moon, 1999) to (A.7) and thus as (T , N ) → ∞
∑N −1
f (r , λ1 )Wi (r )dr
0
1
×
1 as T → ∞. If Qi,T is uniformly integrable, we can apply the Joint
1T XT′ 1 EE ′ XTb 1T → σ¯ d2 E
∫
1
∫
p
T 2N
1
∫
−1
M
→ 0 as M → ∞ 2 √ √ where the event B( M ) = Lj,i,T > M .
Qi,T is iid over i for all T , and
N i=1
1/2 E Qj,i,T √ sup T √ M B( M ) +
2 ≤ sup E Lj,i,T 1
Ci Qi,T + op (1)
1/2
T
Ci = di (1)2 σi2
N 1 −
1/2
with
Qi,T =
B( M )
T
1T XT′ 1 EE ′ (XT1 − XTb )1T
(Tb − T1 )TN N 1 − 1 = 1T XT′ 1 Ei Ei′ (XT1 − XTb )1T N i=1 (Tb − T1 )T =
1/2
T
1 − λ1 (1 − λ21 )/2 (1 − λ1 )2
0 0
∗
mT ⇒ −
4
A −1 N
(1 − λ1 )λ1 γ γ √ 2 (2λ1 − 1) λ21 (1 − λ1 )2 × κ σ¯ d , Sγ γ 20 120 √ σ¯ d2 1 − 2λ1 2 ∼ N κ , Sγ γ . Aγ γ 5(1 − λ1 )λ1 15A2γ γ
D. Kim / Journal of Econometrics 164 (2011) 310–330
√ For (iii), consider Model II when AH γ ̸= 0. Let mT = (Tb − T1 )/ T √ and D(C ) = {Tb : |Tb − T1 | < C T } for positive number C . On the set D(C ), (XX ) = Op (T 3/2 N ), (XU ) = Op (T 3/2 N ) and (UU ) = Op (T 3/2 N ). Hence no term is negligible. m∗T = arg
min
mT on D(C )
{(XX ) + 2(XU ) + (UU )} /(T 3/2 N ).
Note that with κ as defined in the Proof of Lemma A.7
(XX )/(T
3/2
N ) = tr Π ′ (XT1 − XTb )′ (I − PTb ) × (XT1 − XTb )Π /(T 3/2 N )
= =
3T 3/2 |mT |3 3
2
For (UU ), consider the decomposition in (A.4) again. Note that
1T XT′ b XTb − XT′ 1 XT1 1T = −(Tb − T1 )T −1 Σf with
Σf =
where
κ ′ U γ ′ /(T 3/2 N ) = T −3/2 κ ′ F (H γ ′ /N ) + op (1) 2 mT ′ ′ WF (λ1 )C (1) (H γ /N ) + op (1) 2 if Tb ≥ T1 , = m2 − T WF (λ1 )C (1)′ (H γ ′ /N ) + op (1) 2
1 T
1T XTb FH ⇒
f2 (r , λ)WF (r )′ drC (1)′ H ≡ ξ11 C (1)′ H 0
with f2 (r , λ) = (1, r , 1(r ≥ λ1 ), (r − λ)+ )′ , and 1
( Tb − T1 )
1T (XT1 − XTb )′ FH 0 0
′ ⇒ ∫ WF (λ1 ) C (1)′ H ≡ ξ21 C (1)′ H . 1 WF (r )′ dr λ1
Also, 1
and
2 m T (1, λ1 , 0, 0),
if Tb ≥ T1 ,
(1, λ1 , 1, 0),
otherwise.
1T XT′ 1 EE ′ (XT1 − XTb )1T (Tb − T1 )TN N 1 − 1 = 1T XT′ 1 Ei Ei′ (XT1 − XTb )1T N i=1 (Tb − T1 )T
=
Combining these, we obtain
(XU )/(T 3/2 N ) ] ∫ 1 2[ mT −1 ′ WF (λ1 ) − (1, λ1 , 0, 0)Ω1 f2 (r , λ)WF (r ) dr 2 0 × C (1)′ AH γ + op (1) if Tb ≥ T1 , 2[ mT = −WF (λ1 ) − (1, λ1 , 1, 0)Ω1−1 2 ] ∫ 1 ′ × f ( r , λ) W ( r ) dr C (1)′ AH γ + op (1) 2 F 0 otherwise
2 ∫ λ1 mT (3r − 2r λ1 ) dWF (r )′ C (1)′ AH γ + op (1) 2 2 λ 0 1 if Tb ≥ T1 , = 2 ∫ 1 m (r − 1)(3r − 2λ1 − 1) T dWF (r )′ C (1)′ AH γ + op (1) 2 λ1 ( 1 − λ1 )2 otherwise.
1
∫
′
otherwise
2
1
+ 1T XT′ 1 EE ′ (XT1 − XTb )1T + op (T 3/2 N )
|Tb − T1 |2 Op (T −1/2 ). Under
where
2 2 mT
λ1
= 1T XT′ 1 FHH ′ F ′ (XT1 − XTb )1T
= κ ′ (I − PTb )U γ ′ /(T 3/2 N ) + op (1)
T −1/2 κ ′ XT1 1T =
= 1T XT′ 1 (FH + E ) (FH + E )′ (XT1 − XTb )1T
(XU )/(T 3/2 N ) = tr Π ′ (XT1 − XTb )′ (I − PTb )U /(T 3/2 N )
1 − λ1 (1 − λ21 )/2 . 1 − λ1 2 (1 − λ1 )
1
1T XT′ 1 UU ′ (XT1 − XTb )1T
(γ γ ′ /N ) + op (1) =
0 0
From Lemma A.3,
(γ γ ′ /N ) + op (1)
using the fact that κ ′ XT1 1T Assumption 5,
2 mT ξ4 C (1)′ AH γ + op (1) otherwise.
0
= (κ ′ κ/T 3/2 )(γ γ ′ /N ) + op (1) |Tb − T1 |3
2
=
= κ ′ (I − PTb )κγ γ ′ /(T 3/2 N ) + op (1)
329
2 m T ξ3 C (1)′ AH γ + op (1) if Tb ≥ T1 ,
=
N 1 − di (1)2 σi2
N i=1 (Tb − T1 )T N 1 −
N i=1
1T XT′ 1 Gηi ηi′ G′ (XT1 − XTb )1T + op (1)
Ci Qi,T + op (1)
(A.8)
where Ci = di (1)2 σi2 Qi,T =
1
(Tb − T1 )T
1T XT′ 1 Gηi ηi′ G′ (XT1 − XTb )1T
Qi,T is iid over i for all T , and 1
∫
f2 (r , λ1 )Wi (r )dr
Qi,T ⇒ Qi =
0, 0, Wi (λ1 ),
0
∫
1
λ1
Wi (r )dr
as T → ∞. Qi,T is uniformly integrable, we can apply the Joint Probability Limits Theorem (corollary 1, Phillips and Moon, 1999) to (A.8) and thus as (T , N ) → ∞ N 1 −
N i=1
p
Ci Qi,T → σ¯ d2 E Qi = σ¯ d2 Σe
where σ¯ d2 = limN N −1
∑N
i=1
di (1)2 σi2 and
330
D. Kim / Journal of Econometrics 164 (2011) 310–330
E Qi = Σe
0
Therefore, the result in the theorem follows. 0
0 = 0
0 0
0
0
2λ1 − λ21
2 − 3λ21 + λ31
2
6 5 + λ41 − 6λ21
6
24 1 − 3λ21 + 2λ31
−λ31 + 3λ21 λ1 (1 − λ1 )
3
λ1 − 2λ21 + λ31
−7λ41 + 16λ31 − 6λ21 − 8λ1 + 5
2
24
.
Also, 1
p
∆T XT′ 1 EE ′ XTb ∆T → σ¯ d2 Σd ∫ ∫ 1 f2 (r , λ1 )Wi (r )dr = σ¯ d2 E
T 2N
1
f2 (r , λ1 )′ Wi (r )dr 0
0
where Σd = E
1
∫
f2 (r , λ1 )Wi (r )dr 0
1
∫
f2 (r , λ1 )′ Wi (r )dr 0
1
5
2 − 3λ21 + λ31
−λ41 + 4λ31 − 8λ1 + 5
3 =
24 2
6
24
15
5 + λ41 − 6λ21
−λ51 + 10λ31 − 25λ1 + 16
24 1 − 3λ21 + 2λ31
−7λ41 + 16λ31 − 6λ21 − 8λ1 + 5
3
120
24 7λ51 − 20λ41 + 10λ31 + 20λ21 − 25λ1 + 8
.
60
Collecting terms shows that, as (T , N ) → ∞,
(UU )/(T 3/2 N ) [ ] ′ 2Ω1 ξ11C (1)′ AHH C (1)ξ21 + σ¯ d2 Σe = mT tr + op (1) ′ −Ω1 Σf Ω1 ξ11 C (1)′ AHH C (1)ξ11 + σ¯ d2 Σd [ ′ = mT tr 2Ω1 ξ11 C (1)′ AHH C (1)ξ21 ] ′ − Ω1 Σf Ω1 ξ11 C (1)′ AHH C (1)ξ11 2 2λ21 − 46λ1 + 45 2 + op (1). + mT σ¯ d 15λ1
References Andrews, D.W.K., 1991. Heteroskedasticity and autocorrelation consistent covariance matrix estimation. Econometrica 59, 817–858. Bai, J., 1997. Estimation of a change point in multiple regression models. Review of Economics and Statistics 79, 551–563. Bai, J., 2010. Common breaks in means and variances for panel data. Journal of Econometrics 157, 78–92. Bai, J., Lumsdaine, R.L., Stock, J.H., 1998. Testing for and dating breaks in multivariate time series. Review of Economic Studies 65, 395–432. Bai, J., Perron, P., 1998. Estimating and testing linear models with multiple structural changes. Econometrica 66, 47–78. Bai, J., Ng, S., 2004. A panic attack on unit roots and cointegration. Econometrica 72, 1127–1177. Carrion-i-Silvestre, J.L., Kim, D., Perron, P., 2009. GLS-based unit root tests with multiple structural breaks under both the null and alternative hypotheses. Econometric Theory 25, 1754–1792. Clark, T.E., 2006. Disaggregate evidence on the persistence of consumer price inflation. Journal of Applied Econometrics 21, 563–587. De Wachter, S., Tzavalis, E., 2004, Detection of structural breaks in linear dynamic panel data models. QM University of London Working Paper 505. De Wachter, S., Tzavalis, E., 2005. Monte Carlo comparison of model and moment selection and classical inference approaches to break detection in panel data models. Economics Letters 88, 91–96. Emerson, J., Kao, C., 2000, Testing for structural change of a time trend regression in panel data. Manuscript, Syracuse University. Kao, C., Trapani, L., Urga, G., 2007, Modelling and testing for structural breaks in panels with common and idiosyncratic stochastic trends, Center for Policy Research Working Paper No. 92. Maxwell School of Citizenship and Public Affairs, Syracuse University. Kim, D., 2010, Common local breaks in time trends for large panel data. Manuscript, Department of Economics, University of Virginia. Perron, P., Zhu, X., 2005. Structural breaks with deterministic and stochastic trends. Journal of Econometrics 129, 65–119. Phillips, P.C.B., Moon, H.R., 1999. Linear regression limit theory for nonstationary panel data. Econometrica 67, 1057–1111. Phillips, P.C.B., Solo, V., 1992. Asymptotics for linear process. Annals of Statistics 20 (2), 971–1001. Pivetta, F., Reis, R., 2007. The persistence of inflation in the United States. Journal of Economic Dynamics and Control 31 (4), 1326–1358. Qu, Z., Perron, P., 2007. Estimating and testing structural changes in multivariate regressions. Econometrica 75, 459–502.
Journal of Econometrics 164 (2011) 331–344
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Testing and detecting jumps based on a discretely observed process Yingying Fan a,∗ , Jianqing Fan b,1 a
Information and Operations Management Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089, United States
b
Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, United States
article
info
Article history: Received 29 January 2010 Received in revised form 5 June 2011 Accepted 12 June 2011 Available online 12 July 2011 JEL classification: C12 C14
abstract We propose a new nonparametric test for detecting the presence of jumps in asset prices using discretely observed data. Compared with the test in Aït-Sahalia and Jacod (2009), our new test enjoys the same asymptotic properties but has smaller variance. These results are justified both theoretically and numerically. We also propose a new procedure to locate the jumps. The jump identification problem reduces to a multiple comparison problem. We employ the false discovery rate approach to control the probability of type I error. Numerical studies further demonstrate the power of our new method. © 2011 Elsevier B.V. All rights reserved.
Keywords: Jump diffusion process Test for jumps High frequency Stable convergence False discovery rate
1. Introduction The discontinuities of asset prices, so-called ‘‘jumps’’, play important roles in pricing and managing risks of many financial instruments such as asset returns, option prices, interest rates, and exchange rates. Recently, researchers have proved the existence of jumps and studied their financial implications in the literature both empirically and theoretically. See, for example, Merton (1976), Duffie et al. (2000), Pan (2002), Johannes (2004) and Andersen et al. (2007). The wide availability of high-frequency data for a host array of financial instruments makes it feasible to accurately detect the locations of jumps with little time delay. The interest in testing and identifying jumps has surged recently. For example, Aït-Sahalia (2004) introduces methods to separate jumps from diffusion. Jiang and Oomen (2008) propose a test statistic that measures the impact of jumps on the third- and higher-order return moments and is directly related to the profit/loss function of a variance swap replication strategy. Barndorff-Nielsen and Shephard (2006) introduce a test statistic based on the bipower
∗
Corresponding author. Tel.: +1 213 740 9916; fax: +1 213 740 7313. E-mail addresses:
[email protected] (Y. Fan),
[email protected] (J. Fan). 1 Tel.: +1 609 258 7924. 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.06.014
variation of the asset price, which is consistent and asymptotically normal with mean zero under the null hypothesis of no jumps, and which converges in probability to some negative number depending on the jump sizes under the alternative hypothesis. A nonparametric test statistic was proposed by Aït-Sahalia and Jacod (2009), which converges to two different deterministic numbers that are independent of the dynamics of the diffusion process, depending on whether the sample path has or does not have jumps. Fan and Wang (2007) develop wavelet methods to estimate jump locations and jump sizes from a jump-diffusion process that is discretely observed with market microstructure noise. Lee and Mykland (2008) introduce and study a nonparametric test to detect jump arrival times up to the intra-day level. Their test statistic not only detects the presence of jumps but also gives estimates of the realized jump sizes in asset prices. Jacod and Todorov (2009) consider a bivariate asset price process and propose tests to decide whether these processes have common jumps or not using discretely observed data on a finite time interval. Other tests include Carr and Wu (2003), Mancini (2003) and Johannes et al. (2004a,b). The asset price Xt is assumed to follow an Itô semimartingale process and is observed at discrete time points ti = i1, i = 0, 1, . . . , n. In this paper, we consider high-frequency data—i.e., assuming that 1 → 0 as n → ∞ and T = n1 is a fixed positive number. To simplify the notation, we suppress the dependence of 1 on n when it causes no confusion. Aït-Sahalia and Jacod (2009)
332
Y. Fan, J. Fan / Journal of Econometrics 164 (2011) 331–344
propose a nonparametric test statistic with the following form [n∑ /K ]
Sn =
|XiK 1 − X(i−1)K 1 |p
i =1 n
∑
, |Xi1 − X(i−1)1 |p
i =1
where K is a positive integer, p > 3, and [z ] denotes the integer part of z. This is the ratio of power variations at two time scales (Zhang et al., 2005). The intuition of the test statistic Sn is that if there is a jump in the time interval ((i − 1)1, i1], then the magnitude of the increment 1Xi = Xi1 − X(i−1)1 is large and independent of the sampling interval 1, whereas the magnitude of 1Xi is small and depends on 1 when there is no jump in the interval. A high power of the increment 1Xi further separates the magnitudes of |1Xi |p in the previous two cases. Since the increments containing jumps are much larger than those that do not, their contribution to the summation dominates all other terms. As a result, Sn behaves very differently when the sample path on the time interval [0, T ] has jumps from the case when there is no jump. In fact, Sn converges to 1 when a jump is present, and it converges to K p/2−1 in the absence of jumps, as formally stated in (7). This limiting result holds for any Itô semimartingale Xt with no need to estimate the model parameters, and thus it is genuinely nonparametric. It serves as the basis for separating jumps from diffusion. Aït-Sahalia and Jacod (2009) establish asymptotic distribution theorems for their test statistic. From their result, it can be easily seen that the asymptotic variance of Sn increases rapidly with both K and p. The inflation of the variance reduces the power of the test. Reducing the variance of the test statistic will undoubtedly increase the power of the test, which is particularly important when the sample size is not large. This is exactly the case when the test statistic is applied to a small window of data to detect whether there is any jump in that window. We proceed with an idea of variance reduction and propose a new test statistic. Note that the numerator of the test statistic Sn uses only the subsequence {XiK 1 : i = 0, 1, . . . , [n/K ]}. For each ℓ = 1, . . . , K , we can construct a similar test statistic Sn,ℓ whose numerator uses data points {X(ℓ−1+iK )1 : i = 0, 1, . . . , [n/K ] − 1}, resulting in test statistics Sn,ℓ , ℓ = 1, . . . , K , that have the same asymptotic distribution. Therefore, a proper linear combination of them can reduce the variance with mean unchanged and thus give a more powerful test statistic. This approach requires the study of the joint behavior of these test statistics under the assumption of jumps. Rooted in our analytical studies, a new test statistic, which is the average of those K test statistics, is proposed. We show that this new test statistic is the optimal one among all linear combinations of Sn,ℓ in terms of variance reduction. In addition to detecting the presence of jumps, this paper also contributes to detecting the locations of jumps using the newly proposed test statistic. Our main idea is to first divide the whole time interval [0, T ] into many non-overlapping subintervals of equal length 2an 1 with an → ∞ as n → ∞ and an 1 fixed, and then to apply the new test statistic to each subinterval. This reduces the problem of jump identification to a multiple comparison problem. Using the False Discovery Rate (FDR) approach, one can decide which subintervals contain jumps at a given level. Therefore, the jump locations can be identified within an accuracy of 2an 1. In the identified jump intervals, we can further locate jumps by comparing the magnitude of increments. This FDR approach can be applied to many tests such as the ones proposed by Barndorff-Nielsen and Shephard (2006) and Jiang and Oomen (2008), and thus allows us to use these tests to locate the jumps. For this reason, our proposed method is very general. The rest of the paper is organized as follows. Section 2 introduces the model and assumption. In Section 3, we construct
the new test and present its asymptotic properties. Section 4 constructs the critical region and studies its asymptotic size and power. A new testing procedure to locate the jumps is proposed in Section 5. In Sections 6 and 7, we present the simulation studies and a real data application, respectively. Section 8 provides some discussions. All proofs are presented in the Appendix. 2. Model and assumption Consider a one-dimensional asset price process Xt on the probability space (Ω , F , (Ft )t ≥0 , P). We are interested in testing jumps in the process Xt over the time interval [0, T ]. In this paper, we assume that Xt is an Itô semimartingale that can be represented as t
∫
t
∫
σs dWs
bs ds +
Xt = X0 +
0
0
∫ t∫
κ ◦ δ(s, x)(µ − ν)(ds, dx)
+ 0
E
∫ t∫
κ ′ ◦ δ(s, x)µ(ds, dx),
+ 0
(1)
E
where Wt is an Ft -adapted standard Brownian motion and µ is a Poisson random measure on [0, t ] × E with (E , ε) an auxiliary measurable space on the probability space (Ω , F , (Ft )t ≥0 , P). The intensity measure of µ is ν(ds, dx) = ds ⊗ λ(dx) with λ some finite or σ -finite measure on (E , ε). The function δ(ω, t , x) is predictable, the function κ is continuous with compact support satisfying κ(x) = x in a neighborhood of 0, and κ ′ (x) = x − κ(x). We further assume that bt and σt are Ft -measurable, with σt being another Itô semimartingale
∫ t ∫ t ∫ t bs ds + σs dWs + σs′ dWs′ σt = σ0 + 0 0 0 ∫ t∫ + κ ◦ δ(s, x)(µ − ν)(ds, dx) 0
E
∫ t∫
κ′ ◦ δ(s, x)µ(ds, dx),
+ 0
(2)
E
where Wt′ is another standard Brownian motion independent of Wt and µ, and δ is a predictable function. This model was also considered in Aït-Sahalia and Jacod (2009). Let 1Xt = Xt − Xt − be the jump size of the process X at time t. Clearly, 1Xt = 0 when there is no jump at time t, and |1Xt | > 0 whenever there is a jump at time t. Define τ = inf{s ∈ [0, T ] : 1Xs ̸= 0} to be the first jump time on the time interval [0, T ]. Then we usually but not necessarily have τ = 0 almost surely when the jump activity is infinite, and we have τ > 0 almost surely when the jump activity is finite. We introduce some notation to facilitate the presentation. Let B(p)t =
−
|1Xs |
p
and A(p)t =
s
t
∫
|σs |p ds. 0
Suppose on the time interval [0, t ] we have discrete observations at ti = i1 for i = 0, 1, . . . , [t /1]. We define
B(p, 1)t =
[− t /1]
|Xj1 − X(j−1)1 |p .
j =1
For each ℓ = 1, . . . , K , let
B(p, K 1)ℓ,t =
[t /(− K 1)]−1
|X(ℓ−1+jK )1 − X(ℓ−1+(j−1)K )1 |p .
j =1
Then both B(p, 1)t and B(p, K 1)ℓ,t estimate B(p)t .
(3)
Y. Fan, J. Fan / Journal of Econometrics 164 (2011) 331–344
Jacod (2008) proves the following convergence results:
P p > 2 H⇒ B(p, 1)t −→ B(p)t , 11−p/2 P B(p, 1)t −→ A(p)t , X is continuous H⇒
(4)
mp
where mp = E |Z |p and Z is a standard Gaussian random variable. The different behavior of the statistic B(p, 1)t , depending on whether the process X has jumps or not, gives the theoretical foundation for jump detection. We define
∫
κ ◦ δ(ω, t , x)λ(dx)
δt (ω) = ′
otherwise.
∞
Before presenting the main results, we make the following assumption, which is similar to that in Aït-Sahalia and Jacod (2009). Assumption 1. (a) All paths t → bt are locally bounded. (b) All paths t → bt , t → σt and t → σt′ are right-continuous with left limits. (c) All paths t → δ(ω, t , x) and t → δ(ω, t , x) are left-continuous with right limits. |δ(ω,t ,x)|
| δ(ω,t ,x)|
(d) All paths t → supx∈E γ (x) and t → supx∈E γ (x) are locally bounded, where γ is a deterministic nonnegative function satisfying E [γ (x)2 ∨ 1]λ(dx) < ∞. (e) All paths t → δt′ (ω) are left-continuous with right limits on [0, τ (ω)). t (f) 0 |σu |du > 0 a.s. for any t > 0. Throughout the paper we consider a stochastic process Xt over a fixed time interval [0, T ]. Thus, an application of the localization procedure shows that if any theorems to be presented later hold under the assumption
|bt | + |σt | + | bt | + | σt | + | σt′ | ≤ M , | δ(t , x)| ≤ γ (x), and γ (x) ≤ M
δ(t , x) ≤ γ (x),
(5)
for some positive constant M, then they hold under Assumption 1 as well. Therefore, (5) is implicitly assumed to be true by Assumption 1.
B(p, K 1)1 S (p, K )1 = . B(p, 1)
(6)
This test statistic behaves differently for sample paths without jumps from those with jumps. In fact, they proved that S (p, K )1 converges in probability to the variable S (p, K ) given by
1 K p/2−1
S (p, K , ω) =
j
on the event Ωt , on the event Ωtc ,
(7)
j
where Ωt = {ω : Xs (ω) is discontinuous on [0, t ]} and Ωtc =
{ω : Xs (ω) is continuous on [0, t ]}. Note that the event Ωtj means that the model has jumps, whereas the event Ωtc does not mean that the model is continuous. In fact, Ωtc could also correspond to the case where the model has jumps but none are in the time interval [0, t ]. The test statistic S (p, K )1 enjoys nice properties. It is invariant under the scaling of Xt , and its limiting behavior is
(8)
K −
ωℓ S (p, K )ℓ ,
ℓ=1
where ω = (ω1 , . . . , ωK ) is the weight vector satisfying ℓ=1 ωℓ = 1. It is critical to derive the asymptotic joint distribution of these K test statistics S (p, K )ℓ , ℓ = 1, . . . , K . A similar technique was used by Aït-Sahalia et al. (2005) with p = 2, but it was mainly used for the reduction of measurement errors. Define
∑K
D(p) =
−
|1Xs |p (σs2− + σs2 ).
s≤t
The following theorem gives the asymptotic joint distributions of ∑K ℓ=1 ωℓ B(p, K 1)ℓ when jumps are present in the sample path of Xt .
B(p, 1) and
Theorem 1. Assume that Assumption 1 holds, 1 → 0, and p > 3. Then the pair of variables
1
To simplify the notation, we will drop the dependence of the test statistic on t whenever there is no confusion. For instance, we write B(p, K 1)ℓ for B(p, K 1)ℓ,t . Based on the convergence results in (4), Aït-Sahalia and Jacod (2009) propose the following statistic to test for jumps:
ℓ = 1, . . . , K .
Thanks to the similarity of their mathematical forms, they have the same asymptotic distribution. Intuitively, for finite K , { S (p, K )ℓ }Kℓ=1 should have asymptotic correlation 1. Contrary to this intuition, the asymptotic correlation is less than 1, as formally presented in Theorems 1 and 2. This suggests that the test statistic defined in (6) can be improved by linearly combining the K test statistics defined in (8). We thus propose a new test statistic:
−1/2
3. Test statistics
S (p, K ) =
independent of the dynamics of the process Xt . Aït-Sahalia and Jacod (2009) also derived the asymptotic distribution of S (p, K )1 . The asymptotic mean of S (p, K )1 is 1 when jumps are present and is K p/2−1 when there are no jumps. Its asymptotic variance is a complicated function of K and p that increases with both K and p. This indicates that although larger K and larger p can separate the asymptotic means further apart, the improvement would very likely be masked by the inflation of the asymptotic variance. Therefore, it is important to reduce the dependence of the asymptotic variance on K or p, which motivated our work. For a given K ≥ 2, there are K test statistics of the similar form:
B(p, K 1)ℓ S (p, K )ℓ = , B(p, 1)
if the integral exists
333
K − B(p, 1) − B(p), ωℓ B(p, K 1)ℓ − B(p) ℓ=1
converges stably in law to a bivariate random vector (Z (p), Z (p) + ∑K ′ ′ ℓ=1 ωℓ Z (p, K , ℓ)), where Z (p) and Z (p, K , ℓ) given by (29) in Ap , (Ft )t ≥0 , pendix A.2 are defined on an extension (Ω , F P) of the original filtered space (Ω , F , (Ft )t ≥0 , P) and have mean zero conditional on F . Conditional on F , Z (p) and Z ′ (p, K , ℓ) are independent, and Z ′ (p, K , ℓ) have the following conditional variance and covariance:
E Z ′ (p, K , ℓ1 )Z ′ (p, K , ℓ2 )|F K −1 1 = p2 − (K − |ℓ2 − ℓ1 |)|ℓ2 − ℓ1 | D(p). 2
K
(9)
Furthermore, if the processes X and σ have no common jumps, Z (p) and Z ′ (p, K , ℓ) are F -conditionally Gaussian. Although Theorem 1 does not exclude the situation when Xt is continuous, both B(p) and D(p) are zero in the absence of jumps. Thus we only use Theorem 1 to derive the asymptotic distribution of S (p, K , ω) under the assumption of jumps. Since the asymptotic distributions are derived for arbitrary linear combinations, the above results also give the asymptotic joint distributions of the K + 1 random variables B(p, 1), B(p, K 1)1 , . . . , B(p, K 1)K . In view of (9), the conditional correlation between Z ′ (p, K , ℓ1 ) and
334
Y. Fan, J. Fan / Journal of Econometrics 164 (2011) 331–344
Z ′ (p, K , ℓ2 ) can be small if |ℓ1 − ℓ2 | is close to K /2. In particular, when K = 2, the conditional correlation between Z ′ (p, K , 1) and Z ′ (p, K , 2) is zero. Therefore, by choosing ω = (1/2, 1/2), compared with S (p, K )1 , we reduce the asymptotic variance by a factor of 1/2. For general K , with a proper choice of ω, the reduction of the variance of the new test statistic S (p, K , ω) can also be significant. Corollary 1. Assume that the conditions of Theorem 1 hold. Then, conditional on Ω j , 1−1/2 ( S (p, K , ω)− 1) converges stably in law to a random variable S (p, K , ω)j which, conditional on F , has mean zero and variance
E (S (p, K , ω)j )2 |F K − K −1 −1 = − ωi ωj (1 − K |i − j|)|i − j| 2
× p2
B(p)2
ω
.
(10)
K − ωi ωj 1 − K −1 |i − j| |i − j| subject to ωi = 1.
i,j
v(p, K ) = K p (m2p − m2p ) + 2
K −1 −
√
cov(| K − ℓU1
with mp = E |U1 |p and U1 , U2 , and U3 being independent standard Gaussian random variables.
We would like to remark that although in models (1) and (2), the processes X and σ are driven by the same Poisson random measure µ, the jump behaviors of X and σ can be very different and even independent if the functions δ and δ˜ are chosen appropriately. This was also pointed out by Jacod (2007, Page 20). In fact, if δ and δ˜ are chosen in a way such that X and σ have independent jump behaviors, with probability 1, X and σ have no common jumps. Thus, it is not restrictive to exclude the common jump case in Theorem 1 and Corollary 1. Notice that under the assumption of jumps, the optimal weight vector ωopt in terms of minimizing the variance of S (p, K , ω)j is the solution to the following quadratic optimization problem
−
E((Y ′ (p))2 |F ) = K −2 v(p, K )A(2p), E(Y (p)Y ′ (p)|F ) = v˜ (p, K )A(2p), √ where v(p, K ) = cov(|U1 |p , |U1 + K − 1U2 |p ) and ℓ=1
Moreover, if the processes X and σ have no common jumps, conditional on F , S (p, K , ω)j has Gaussian distribution.
arg max
E(Y (p)2 |F ) = (m2p − m2p )A(2p),
√ √ √ + ℓU2 |p , | ℓU2 + K − ℓU3 |p )
i,j=1
D(2p − 2)
converges stably in law to a bivariate random vector (Y (p), Y ′ (p)), , F, (Ft )t ≥0 , which is defined on an extension (Ω P) of the original filtered space (Ω , F , (Ft )t ≥0 , P). Conditional on F , (Y (p), Y ′ (p)) is Gaussian with mean zero and variance–covariance
i =1
Corollary 2 is a direct consequence of Theorem 2. It formally states the asymptotic distributions of S (p, K ) when X is continuous. Corollary 2. Assume that the conditions of Theorem 2 hold. Then, 1−1/2 ( S (p, K )− K p/2−1 ) converges stably in law to a random variable S (p, K )c which, conditional on F , is Gaussian with mean zero and variance
A(2p) E (S (p, K )c )2 |F = M (p, K ) , (12) A(p)2 where M (p, K ) = 12 K −1 v(p, K ) − 2K p/2 v(p, K ) + K p−1 (m2p − mp K m2p ) , and v(p, K ), v(p, K ), and mp are given in Theorem 2. The test statistic S (p, K )1 in Aït-Sahalia and Jacod (2009) has asymptotic variance M ∗ (p, K )
A(2p)
(13)
A(p)
It is not hard to show that its solution is of equal weight due to the exchangeability of the weights—i.e., ωopt = K −1 1. The corresponding variance in (10) is
with M ∗ (p, K ) being some constant depending only on p and K when Xt is continuous, and it has asymptotic variance
(2K − 1)(K − 1)p2 D(2p − 2) . 6K B(p)2
(K − 1)p2 D(2p − 2) 2 B(p)2
(11)
The optimal choice of K in terms of variance reduction is clearly attained at K = 2. Deriving the asymptotic joint distribution of S (p, K )ℓ , ℓ = 1, . . . , K when Xt is continuous is nontrivial. We can show that the optimal weight vector ω under the assumption of no jumps coincides with the optimal weight vector ωopt = K −1 1 under the assumption of jumps. This is clear once we observe that due to the similarity of the definitions of S (p, K )ℓ , ℓ = 1, . . . , K , the asymptotic covariance of S (p, K )ℓ1 and S (p, K )ℓ2 depends only on |ℓ1 − ℓ2 | for 1 ≤ ℓ1 , ℓ2 ≤ K . Therefore, minimizing the asymptotic variance of S (p, K , ω) yields ωopt = K −1 1. Hereinafter, we denote by S (p, K ) = S (p, K , ωopt ). The following theorem characterizes the asymptotic joint ∑K distribution of K −1 ℓ=1 B(p, K 1)ℓ and B(p, 1) when X is continuous. Theorem 2. Under Assumption 1, if 1 → 0, p ≥ 2, and X is continuous, then the pair of variables
−1/2
1
11−p/2 B(p, 1) − mp A(p),
11−p/2 K −1
K
− ℓ=1
B(p, K 1)ℓ − K p/2−1 mp A(p)
(14)
when there are jumps. Comparing (11) with (14), we see that when the sample path contains jumps, the best linear combination +1 reduces the variance by a factor of K3K . In particular, when K = 2, the variance reduction is by a factor of 1/2. Under the assumption of no jumps, the factor of variance reduction is given by 1 − M (p, K )/M ∗ (p, K ). Fig. 1 depicts the ratio M (p, K )/M ∗ (p, K ) for p = 4 and 5. Hereinafter, we focus on the following test statistic:
S (p, K ) = S (p, K , ωopt ).
(15)
4. Testing for jumps 4.1. Estimation of asymptotic variance The asymptotic variance of S (p, K ) is a function of D(p), A(p), and B(p), which are unknown and need to be estimated. The same estimators as those in Aït-Sahalia and Jacod (2009) are used in this paper. Specifically, the estimator for D(p) is defined as
D(p) = where
1
n −
m n 1 i =1
1ni X
|1ni X |p
−
(1nj X )2 1{|1nj X | ≤ α 1γ },
j∈In,t (i)
= Xi1 − X(i−1)1 , α > 0, γ ∈ (0, 1/2), mn is the local
Y. Fan, J. Fan / Journal of Econometrics 164 (2011) 331–344
335
available observations Xi1 , i = 0, 1, . . . , n with T = n1. Thus the null and alternative hypotheses are, respectively, H0 : There is no jump in the interval [0, T ], H1 : There are jumps in the interval [0, T ]. Note that the sets under which the null and alternative hypotheses j hold are ΩTc and ΩT , which are subsets of Ω instead of subsets of some parameter space. For a critical value x of the above hypothesis testing problem, the type I error is given by
αn (x) = P ( S (p, K ) ≤ x|H0 ), and the power function is
βn (x) = P ( S (p, K ) ≤ x|H1 ). Fig. 1. Plot of M (p, K )/M ∗ (p, K ) against K for p = 4 (solid curve) and p = 5 (dashed curve).
window size for estimating σ and In,t (i) = {j ∈ N : j ̸= i, 1 ≤ j ≤ [t /1], |i − j| ≤ mn } is a local window of length 2mn 1 → 0 around i1. A realized truncated p-th variation is used to estimate A(p)—that is, for α > 0 and γ ∈ (0, 1/2), the estimator is defined as 2 t ,
[t /1] 11−p/2 − n p A(p) = |1i X | 1{|1ni X |≤α1γ } .
mp
i =1
The estimator B(p) as defined in (3) is used to estimate B(p). It has been proved by Jacod (2008) and Aït-Sahalia and Jacod (2009) that the above three estimators are consistent. Let
(2K − 1)(K − 1)p2 1 D(2p − 2) and Vj = 6K B(p)2 1M (p, K ) A(2p) Vc = . 2 A(p)
(16)
We have the following asymptotic theorem for αn (x) and βn (x). Theorem 3. Assume that Assumption 1 holds and the critical value x ∈ (1, K p/2−1 ). Then, we have: (a) αn (x) → 0; that is, the critical region { S (p, K ) ≤ x} has an asymptotic size 0. j (b) Let P(ΩT ) > 0. Then, the power function satisfies βn (x) → 1 as n → ∞. Similarly, we can show that the above asymptotic results hold for the type I error and power function if the roles of null and alternative hypotheses are switched. Our new test S (p, K ) is asymptotically more powerful than the test S (p, K )1 by Aït-Sahalia and Jacod (2009). To understand this, recall that if we want to compare the power of tests, we fix their sizes at the same level. We add subscripts ‘‘FF’’ and ‘‘AJ’’ to V c and V j to denote the estimated asymptotic variances of S (p, K ) and S (p, K )1 under H0 and H1 , respectively. Then for any critical value p/2−1
K x, the size of S (p, K ) is approximately Φ ( x−
Vc
FF
), and the size of p/2−1
K the Aït-Sahalia and Jacod test is approximately Φ ( x−
Vc
), where
AJ
The following corollary, which is a consequence of Theorems 1 and 2 and the property of stable convergence, gives the asymptotic distribution of the standardized test statistic.
Φ (x) is the cumulative distribution function of standard Gaussian. If the critical value of our test S (p, K ) is xFF ∈ (1, K p/2−1 ), then to
Corollary 3. Assume that Assumption 1 holds and 1 → 0. We have:
size, the critical value of S (p, K )1 should be approximately
make the Aït-Sahalia and Jacod test have approximately the same
(a) If p > 3, then restricted on the set Ω j the random variable
( V j )−1/2 ( S (p, K ) − 1)
(17)
converges stably in law to a random variable which, conditional on F , has mean 0 and variance 1, and which is standard Gaussian provided that the processes X and σ have no common jumps. (b) If X is continuous and p ≥ 2, then the random variable
( V c )−1/2 ( S (p, K ) − K p/2−1 )
(18)
converges stably in law to a random variable that is standard Gaussian conditional on F . The beauty of the results in Corollary 3 is that they translate composite hypotheses asymptotically into two simple hypotheses. When there are jumps in the sample path, S (p, K ) asymptotically has conditional distribution N (1, V j ), whereas when the model is continuous, S (p, K ) asymptotically has conditional distribution N (K p/2−1 , V c ). Hence, the Neyman–Person lemma can be used to test for jumps.
xAJ = K
p/2−1
+
Vc
AJ
Vc
1/2 (xFF − K p/2−1 ).
c c Since we have proved in Corollary 1 that VAJ > VFF , it follows easily that xAJ < xFF . Let Z be a standard Gaussian random variable. Then the corresponding power of our new test is
xFF − 1 j P (S (p, K ) ≤ xFF |Ωt ) ≈ P Z ≤ , j V FF
and the corresponding power of the Aït-Sahalia and Jacod test is
xAJ − 1 j P (S (p, K )1 ≤ xAJ |Ωt ) ≈ P Z ≤ . j V AJ
j
with xFF > xAJ ensures that
xFF −1 FF
Throughout this section, we consider p > 3. We aim at testing the existence of jumps in a fixed time interval [0, T ] using the
j
In view of Corollary 2, we have VFF < VAJ . This inequality together j V
4.2. Testing existence of jumps
(19)
FF
>
xAJ −1
j V
, which indicates that
AJ
our test is asymptotically more powerful than the Aït-Sahalia and Jacod test. Although Theorem 3 shows that asymptotically any constant between 1 and K p/2−1 can serve as a critical value and yield
336
Y. Fan, J. Fan / Journal of Econometrics 164 (2011) 331–344
asymptotic size 0 and asymptotic power 1, the finite sample performance of the test based on these critical values can be very different. Next, we discuss how to choose the critical value in practice. By Corollary 3, conditional on F , the null and alternative distributions of S (p, K ) are both asymptotically Gaussian. Thus, asymptotically the likelihood ratio test rejects H0 whenever S (p, K )) ϕ0 ( (20) ϕ1 ( S (p, K )) is small, where ϕ0 (·) and ϕ1 (·) are the conditional asymptotic densities of S (p, K ) under H0 and H1 , respectively. So the critical
Λ=
value can be chosen using the classical likelihood ratio test theory. Another way to choose the critical value is to use the asymptotic normality of the test statistic S (p, K ) under the null hypothesis, as in Aït-Sahalia and Jacod (2009). An alternative approach is to minimize the sum of the probabilities of type I and type II errors. Unlike other scientific hypothesis testing problems, we do not have a strong preference here in distinguishing the null and alternative hypotheses. This is particularly the case when the test procedure is applied to detecting the location of possible jumps, as will be done in the next section. In this case, the critical value is the minimizer to αn (x) + (1 − βn (x)). By Corollary 3, asymptotically the critical value minimizes the function
g ( x) = Φ
x − K p/2−1
Vc
+Φ
1−x
Vj
.
It can be shown that there exists a unique minimizer x0 of g (x), where x0 ∈ (1, K p/2−1 ) and solves the equation ϕ0 (x) = ϕ1 (x) with ϕ0 and ϕ1 defined in (20). Denote the corresponding rejection region by R = { S (p, K ) ≤ x0 }.
(21)
In the simulation study and real data analysis of this paper, we will use the rejection region defined in (21) when testing whether a fixed time interval has jumps or not. 5. Detecting jump locations We have discussed how to test whether a fixed time interval contains jumps. Naturally, the next question is to locate these jumps if there are any. Throughout this section, we assume finite jump activity, that is, there are only finite jumps in [0, T ]. 5.1. Test statistic as a process Our idea of locating jumps is to apply our test to many small local time intervals in this period and then decide whether these time intervals contain jumps or not. Let the width of the local time interval be fixed at 2W . For each t ∈ [W , T − W ], apply our new test statistic to the data in the time interval [t −W , t +W ], resulting in the value S (p, K )t . This defines a process S (p, K )t over the time interval [W , T − W ]. Suppose the number of jumps NJ in [0, T ] is finite. Denote the successive jump times by J1 , J2 , . . . , JNJ . Then for any t ∈ (Ji − W , Ji + W ), the interval [t − W , t + W ], over which S (p, K )t is defined, contains the i-th jump. Define
Aj =
NJ
(Ji − W , Ji + W ) ,
(22)
i=1
which is the union of a finite number of time intervals with equal length 2W . Let E = {Ji − W , i = 1, . . . , NJ } be the union of left endpoints, and define Ac = [W , T − W ] \ (Aj ∪ E ). Corollaries 1 and 2 give the results of our test statistic S (p, K )t at a fixed time point t. For t ∈ Aj , it should be approximately 1, and
Fig. 2. Test statistics S (p, K )t and S (p, K )1,t , with p = 4 and K = 2 as processes of time. The price process Xt is generated from model (25), with 1 = 1 min and Js = 0.05. There are 4 jumps in total, which are located at time points 1115, 4474, 6512, and 6941. (a) Xt in a one-month period; (b) S (p, K )t with local window size an = 90; (c) S (p, K )t with local window size an = 90.
for t ∈ Ac , it should be approximately K p/2−1 . Fig. 2 shows the test statistic S (p, K )t as a process of time for a simulated sample path, detailed in Section 7. In this simulation, we generate the asset price process Xt at time points i1, i = 0, 1, . . . , n, and calculate the test statistic S (p, K )i1 for each i ∈ {[W /1], . . . , n −[W /1]} using data in the local window [i1 − W , i1 + W ]. It is clear that the process S (p, K )i1 is hovering around two values. The standard deviation of S (p, K )i1 at the state of no jumps is much larger than that at the state with jumps. It is also clear that our new method is less volatile than the test statistic in Aït-Sahalia and Jacod (2009), which is consistent with our theory. From Fig. 2(c), it is clear that there are four jumps in the process, as there are four flat intervals with values around 1. The midpoints of these intervals are our estimated jump locations. Even though the process S (p, K )t at several other locations has a value close to one, they are not jump points, as a jump point implies the process should be around one in an interval with width approximately 2W . 5.2. Locating jumps using the FDR approach Given observations Xi1 , i = 0, 1, . . . , n over a fixed time interval [0, T ], one can first construct the test statistic S (p, K )T using all observations and then test whether there are jumps in [0, T ] using the rejection region (21). If the hypothesis of jumps cannot be rejected, one further uses the procedure described in the previous section to locate the jumps. In practical applications, one can apply the procedure using non-overlapping local intervals. More specifically, we divide the time interval [0, T ] into nonoverlapping subintervals each with length 2an 1 and construct test statistic S (p, K )(2i−1)an 1 using data in the interval [(2i − 2)an 1, 2ian 1] for i = 1, 2, , . . . , [n/(2an )]. Then, the sequence of test statistics { S (p, K )(2i−1)an 1 , i = 1, 2, . . . , [n/(2an )]} are expected to have normal distributions with different means and standard deviations at intervals with and without jumps. Thus, locating jumps is equivalent to a multiple comparison problem. Since S (p, K )(2i−1)an 1 has a smaller variance (see Fig. 1) when there are jumps in [2(i − 1)an 1, 2ian 1], the null hypotheses for this multiple-comparison problem are chosen to be: H0i : There are jumps in the interval 2(i − 1)an 1, 2ian 1 ,
i = 1, . . . , [n/(2an )]. For each H0i , the corresponding test statistic is S (p, K )(2i−1)an 1 , whose null distribution is asymptotically Gaussian with mean 1 and variance
Y. Fan, J. Fan / Journal of Econometrics 164 (2011) 331–344
(2K − 1)(K − 1)p 1 D(2p − 2)2ian 1 − D(2p − 2)2(i−1)an 1 2 6K B(p)2ian 1 − B(p)2(i−1)an 1 2
Table 1 Means and standard deviations of FF test and AJ test when data are generated from model (24).
by Theorem 1. It is well known that controlling the size of each individual null hypothesis will result in low power in the multiplecomparison problem, especially when there are many hypotheses. Thus, using the rejection point given in (21) is not realistic here. Therefore, we propose to use the adaptive control of the False Discovery Rate (FDR) procedure by Benjamini and Hochberg (1995) to control the type I error. The false discovery rate is defined as E (V /R), where R is the number of hypotheses rejected in total and V is the number of hypotheses rejected by mistake. The following procedure is proposed by Benjamini and Hochberg (1995) to control the FDR: (1) Specify an allowable false discovery rate α . (2) Estimate the total number of true null hypotheses, i.e., the number of intervals with jumps NJ∗ and denote the estimated value as NJ∗ . (3) Calculate the p-value Pi corresponding to the null hypothesis H0i . Rank the [n/(2an )]p-values from low to high: P(1) ≤ P(2) ≤ · · · ≤ P([n/(2an )]) . (4) Let k be the largest i for which P(i) ≤ iα/ NJ∗ . Reject all H0i , i = 1, . . . , k. In practice, NJ∗ is unknown and needs to be estimated. We propose to estimate NJ∗ as [n/(∑ 2an )]
NJ∗ =
1(Pi > c )
i=1
1−c
,
(23)
where c ∈ (0, 1) is some constant, and 1(Pi > c ) is a indicator function taking a value 1 if Pi > c and taking a value 0 if Pi ≤ c. To understand why this is an estimator of NJ∗ note that among [n/(2an )] hypotheses to be tested, the P-values of true nulls are uniformly distributed over the interval [0, 1]. For large c, it is reasonable to assume that all P-values falling in [c , 1] are contributed by the true null. Then theoretically, the P-value falling in [c , 1] has density (1 − c )NJ∗ /[n/(2an )]. On the other hand, empirically, the estimated density of P-values on the interval [c , 1] ∑[n/(2an )] is 1(Pi > c )/[n/(2an )]. By matching the theoretical i=1 density with the empirical density, we obtain the estimator proposed above. In practical implementations, we find that the results are not sensitive to c as long as c is not very close to 0. 6. Simulation studies Throughout this section, the power p is fixed at 4 as it is the smallest integer greater than 3 (required by Theorem 1). To simulate the stochastic diffusion process, the Euler scheme is employed. We discard the burn-in period—i.e., the first 500 data points of the whole series—to avoid the starting value effect. More accurate simulations can be obtained by using the methods described in Fan (2005). To ease the presentation, we use ‘‘FF’’ to denote our new test, ‘‘AJ’’ to denote the Aït-Sahalia and Jacod test, ‘‘LM’’ to denote the Lee and Mykland (2008) test, ‘‘BNS’’ to denote the Barndorff-Nielsen and Shephard (2006) test, and ‘‘JO’’ to denote the Jiang and Oomen (2008) test. 6.1. Continuous diffusion process The first model is the following continuous stochastic volatility process, which is taken from Aït-Sahalia and Jacod (2009): dXt /Xt = σt dWt ,
vt = σt2 ,
1/2
dvt = κ(β − vt )dt + γ vt
dBt ,
337
(24)
K =2
K =3
K =4
30 s
AJ FF
2.0006 (0.0800) 2.0020 (0.0568)
3.0068 (0.1633) 3.0042 (0.1214)
4.0188 (0.2608) 4.0059 (0.1960)
1 min
AJ FF
2.0094 (0.1132) 2.0047 (0.0842)
3.0029 (0.2364) 3.0023 (0.1703)
4.0111 (0.3591) 3.9965 (0.2739)
2 min
AJ FF
1.9973 (0.1554) 1.9942 (0.1127)
2.9826 (0.3225) 2.98884 (0.2357)
3.9880 (0.5111) 3.9821 (0.3827)
3 min
AJ FF
1.9964 (0.1917) 1.9965 (0.1397)
2.9971 (0.3928) 2.9946 (0.2930)
4.0203 (0.6109) 3.9973 (0.4836)
Frequency
where Wt and Bt are both Brownian motions and E [dWt dBt ] = over a half-month ρ dt. We simulate 1000 sample paths of prices 1 T = 24 period with parameters β = 0.52 , γ = 0.5, κ = 5, and ρ = −0.5. The sampling frequencies are 1 = 30 s, 1 min, 2 min, and 3 min. In each of the 1000 simulations, for each K = 2, 3, and 4, the test statistic S (p, K )ℓ and our new test statistic S (p, K ) are computed to test whether there are any jumps in the half-month interval. We compare the sample means and standard deviations of S (p, K )ℓ and S (p, K ) across the 1000 simulations in Table 1. Although for each fixed K there are K test statistics S (p, K )ℓ , ℓ = 1, . . . , K , the simulation results show that they all have very similar means and variances, as expected. Thus, only the mean and standard deviation of the first one of them are presented in Table 1. It is easy to see that the standard deviations of our new test statistic are smaller than those of S (p, K )ℓ , while the mean values are approximately the same. The results are in line with our asymptotic theory. We next check the probabilities of type I error of the AJ test and the FF test. To better compare these two tests, we first calculate the rejection point xFF of the FF test using (21), and then we calculate the rejection point of the AJ test according to (19). Thus, the FF test and the AJ test have the same type I error. Both the AJ test and the FF test using all four values of sampling intervals 1 and different choices of K have a probability of type I error 0, which is consistent with our Theorem 3. As a comparison, we also list the probabilities of type I error of the LM test, the BNS test, and the JO test with three different significance levels: α = 0.1, 0.01, and 0.001. The rejection region of the LM test is calculated by using the distribution derived in Lemma 1 of Lee and Mykland (2008), the rejection point of the BNS test is calculated based on the asymptotic normal distribution of the adjusted ratio jump test proposed in Section 2.2 of BarndorffNielsen and Shephard (2006), and the rejection point of the JO test is derived by using the asymptotic normal distribution of the ratio test given in Theorem 2.1 of their paper. For all three tests, if there are any parameters that need to be estimated, we follow the suggestions of the original papers to estimate them. In all cases, the LM test, BNS test, and JO test have higher probabilities of type I error, which, as expected, are close to the size of the test. 6.2. Diffusion process with Poisson jumps The model we use to generate the data is dXt /Xt = σt dWt + Jt dNt ,
vt = σt2 ,
1/2
dvt = κ(β − vt )dt + γ vt
dBt ,
(25)
where Nt is a Poisson process with intensity λ = 12 × 4, Jt measures the jump size, and Wt and Bt are both Brownian motions with E [dWt dBt ] = ρ dt. This model is similar to that in Aït-Sahalia and Jacod (2009). In the simulations, the parameters are chosen to be the same as before; that is, β 1/2 = 0.5, γ = 0.5, κ = 5, and ρ = −0.5. The jump size Jt is generated as Jt = 0.01β 1/2 U, where U is a random variable uniformly distributed over the interval
338
Y. Fan, J. Fan / Journal of Econometrics 164 (2011) 331–344
Table 2 Probabilities of type I errors of FF test, AJ test, LM test, and BNS test when data are generated from model (24). K =2
K =3
K =4
α = 0.1
α = 0.01
α = 0.001
30 s
AJ FF
0 0
0 0
0 0
BNS LM JO
0.090 0.121 0.219
0.014 0.015 0.024
0 0.001 0.002
1 min
AJ FF
0 0
0 0
0 0
BNS LM JO
0.106 0.110 0.219
0.010 0.010 0.026
0.001 0 0.002
2 min
AJ FF
0 0
0 0
0 0
BNS LM JO
0.111 0.144 0.210
0.005 0.011 0.020
0.001 0.002 0.003
3 min
AJ FF
0 0
0 0
0 0
BNS LM JO
0.120 0.145 0.219
0.010 0.012 0.021
0 0.002 0.001
Frequency
Table 3 Means and standard deviations of FF test and AJ test when data are generated from model (25). Frequency
6.3. Diffusion process with Cauchy jumps
K =2
K =3
K =4
In this subsection we consider a diffusion process with Cauchy jumps. We generate data from the following model
30 s
AJ FF
1.1616 (0.1932) 1.1639 (0.1691)
1.3379 (0.3392) 1.3314 (0.3158)
1.5006 (0.4886) 1.4952 (0.4553)
dXt /Xt = σt dWt + JdYt ,
1 min
AJ FF
1.2736 (0.2486) 1.2687 (0.2182)
1.5225 (0.4259) 1.5271 (0.3957)
1.8003 (0.6268) 1.7886 (0.5752)
2 min
AJ FF
1.3921 (0.2944) 1.3872 (0.2496)
1.7825 (0.5425) 1.7832 (0.4764)
2.191 (0.7852) 2.1747 (0.7047)
3 min
AJ FF
1.4896 (0.3431) 1.4846 (0.2815)
1.9796 (0.6188) 1.9718 (0.5391)
2.4758 (0.9101) 2.4617 (0.8057)
where σt is simulated in the same way as that in Section 6.2, J > 0 measures the size of the jumps relative to the volatility level, and Yt is a Cauchy process with characteristic function E exp(iuYt ) = exp{−t |u|/2}. Note that the above model has infinite jump activity. We consider a half month period (T = 1/24) and set J = 0.8. To save space, we omit the means and standard deviations of the AJ and FF tests. The probabilities of type II errors for the five tests are summarized in Table 5. As discussed in the last section, in order to make the comparison of the probabilities of type II errors fair, the LM, BNS and JO tests should be evaluated at a significance level smaller than α = 0.001. To make the comparison more precise, we also calculate the type II errors of the FF test when its type I errors are fixed at the same levels as those for the LM test with α = 0.1 (see Table 2). These type II errors are marked as FF* in the table. Thus, we only need to compare FF* with the LM test with α = 0.1. It can be seen that in this setting, the FF test outperforms the BNS and JO tests, and the LM test has the best performance. We would like to note that the LM test is derived under the assumption that under the null hypothesis, the returns are normally distributed. Such an assumption is hard to validate for data even at daily frequency. All of our simulations fall in such a scenario, which gives advantages to the LM test. On the other hand, the AJ and FF tests do not require such an assumption.
[−2, 1] ∪ [1, 2]. Thus, the jump size is at most 0.02 times the average volatility level and at least 0.01 times the average volatility level. Since we consider a half-month period (T = 1/24) and λ = 12 × 4, on average there are 2 jumps in this period. With the simulated sample paths, we first compare the means and standard deviations of S (p, K ) to those of S (p, K )ℓ . Due to the similarities of S (p, K )ℓ for ℓ = 1, . . . , K , only the results of the first one are presented. Table 2 summarizes the comparison results. The conclusions are the same as before: our new test statistic has the same mean values but smaller standard deviations. To further compare the test statistics, we calculate the probabilities of type II errors of both methods with the type I error fixed at the same level. The rejection points of the FF test are calculated by using (21). The rejection point of the AJ test is calculated by using (19) to ensure that it has approximately the same type I error. The comparison results are shown in Table 4. We see that the probability of type II error increases when the data are sampled less frequently, as expected. Table 4 also shows that our new test statistic outperforms the AJ test statistic in all cases. The probabilities of type II error of the LM, BNS, and JO tests with significance level α = 0.1, 0.01, and 0.001 are listed in the last three columns of Table 4. Recall that with critical values xAJ and xFF , the AJ and the FF tests both have zero probability of type I error (see Table 2). To fairly compare the type II error, all tests should be evaluated at a significance level smaller than α = 0.001. In fact, as shown in Tables 2 and 3, although the LM test has the smallest probabilities of type II errors, it has much larger probabilities of type I errors (see Table 2) than the FF test. To better compare the type II error of the FF test with the LM test, we choose the critical value of the FF test in a way such that the FF test has the same type I error as that of the LM test when α = 0.1. That is, we fix the type I error of the FF test at levels 0.121, 0.110, 0.144 and 0.145 for 1 = 30 s, 1 min, 2 min, and 3 min, respectively. These corresponding type II errors are listed in the table and marked as FF*. So we only need to compare FF* with the LM test when α = 0.1 for type II errors. It can be seen that with the same type I error, our FF test has the smallest type II errors when 1 = 30 s and 1 min, and the LM test has the smallest type II errors when the sampling frequency is lower.
(26)
6.4. Estimation of jump locations In this section, the procedure proposed in Section 5 is used to identify the locations of jumps in a one-month (T = 1/12) period. The data are simulated from model (25) with parameter λ = 12 × 21—i.e., there is one jump per day on average. The same as that in the previous section, the jump size Jt is generated as Jt = 0.01β 1/2 U with U being a random variable uniformly distributed over [−2, −1] ∪ [1, 2]. We study the performance of our new test procedure with three different sampling frequencies 1 = 1 min, 3 min, and 5 min. We use a two-step method to identify the exact location of jumps. In the first step, we divide the one-month period into many non-overlapping intervals with equal length, and then identify the jump intervals using the FDR approach discussed in Section 6. In the second step, for each identified jump interval we locate the jumps by comparing the magnitude of the increment Xi1 − X(i−1)1 . More specifically, in the first step we choose the window size 2W = 2an 1 with an being a positive integer, and we divide the one-month period into [n/(2an )] non-overlapping intervals with
Y. Fan, J. Fan / Journal of Econometrics 164 (2011) 331–344
339
Table 4 Probabilities of type II errors of FF test, AJ test, LM test, and BNS test when data are generated from model (25). FF* represents the FF test with type I errors fixed at levels 0.121, 0.110, 0.144, and 0.145 for 1 = 30 s, 1 min, 2 min, and 3 min, respectively. K =2
K =3
K =4
α = 0.1
α = 0.01
α = 0.001
30 s
AJ FF FF*
0.456 0.095 0
0.565 0.11 0
0.653 0.125 0
BNS LM JO
0.103 0.015 0.038
0.262 0.02 0.063
0.458 0.021 0.084
1 min
AJ FF FF*
0.669 0.265 0.010
0.751 0.302 0.005
0.825 0.333 0.003
BNS LM JO
0.282 0.025 0.084
0.6 0.029 0.149
0.792 0.032 0.206
2 min
AJ FF FF*
0.816 0.637 0.038
0.891 0.692 0.042
0.915 0.718 0.048
BNS LM JO
0.566 0.019 0.15
0.851 0.054 0.301
0.951 0.069 0.396
3 min
AJ FF FF*
0.874 0.637 0.115
0.926 0.692 0.110
0.956 0.718 0.111
BNS LM JO
0.671 0.042 0.262
0.909 0.093 0.437
0.981 0.124 0.561
Frequency
Table 5 Probabilities of type II errors of FF test, AJ test, LM test, and BNS test when data are generated from model (26). FF* represents the FF test with type I errors fixed at levels 0.121, 0.110, 0.144, and 0.145 for 1 = 30 s, 1 min, 2 min, and 3 min, respectively. Frequency
K =2
K =3
K =4
α = 0.1
α = 0.01
α = 0.001
30 s
AJ FF FF*
0.280 0.148 0.015
0.312 0.159 0.020
0.359 0.167 0.017
BNS LM JO
0.041 0 0.058
0.073 0.003 0.104
0.106 0.008 0.122
1 min
AJ FF FF*
0.380 0.210 0.045
0.437 0.219 0.047
0.453 0.232 0.041
BNS LM JO
0.069 0.009 0.089
0.137 0.015 0.147
0.174 0.028 0.174
2 min
AJ FF FF*
0.506 0.281 0.085
0.533 0.284 0.081
0.550 0.293 0.076
BNS LM JO
0.117 0.020 0.129
0.200 0.048 0.199
0.250 0.071 0.236
3 min
AJ FF FF*
0.537 0.312 0.126
0.571 0.325 0.121
0.604 0.333 0.116
BNS LM JO
0.153 0.033 0.148
0.248 0.082 0.243
0.310 0.118 0.283
equal length. In each of these intervals, our new test statistic FF and the test statistic AJ are applied using the data points in that interval. Section 6 shows that the variance of either test statistic is much smaller at the jump locations than at the continuous time points. Thus, the null hypothesis corresponding to the i-th time interval is chosen to be H0,i : There are jumps in i-th time interval. If the i-th null hypothesis is rejected, we conclude that there is no jump in the i-th interval. The procedure introduced in Section 6 is used to control the FDR of this multiple-comparison problem. The FDR is controlled at 5% and the number of true hypotheses is estimated by using (23) with c = 0.05. Practical implementation suggests that the result is not sensitive to c as long as c is not very close to 0. Since the local window size an chosen in the first step is usually small, it is reasonable to assume that with high probability each identified jump interval has at most one jump. Thus in the second step we locate the jump in each interval by identifying the largest increment |Xi1 − X(i−1)1 | in that interval. The BNS and JO tests are applied in the same way to locate jumps. Since we are interested in classifying between ‘‘jump’’ and ‘‘non-jump’’, our problem is essentially a two-class classification problem. Thus, we borrow measures from the two-class classification—that is, sensitivity and specificity—to evaluate our test procedure. These two measures are defined as Sensitivity = Specificity =
# of correctly identified jumps by a test
,
# of true jumps in total # of correctly identified non-jumps by a test # of true non-jumps in total
.
Although larger values of these two measures indicate a better performance of a test, in practice there are tradeoffs between sensitivity and specificity. To understand this, just imagine that if a
test rejects all null hypotheses, then the sensitivity defined above will be 1, but the specificity defined above will be 0. On the other hand, if a test fails to reject any null hypothesis, then the sensitivity is 0 and the specificity is 1. So to compare different tests, we need to combine the results of sensitivity and specificity. Specificity for the LM test is different from the specificity for other tests. To understand this, notice that the FF, AJ, BNS, and JO tests are all defined over local windows with 2an observations, while the LM test is defined over each sampling interval [i1, (i + 1)1]. If according to the FF, AJ, or BNS test, there is no jump in a local window, then none of the sampling intervals in that local window contain jumps. Thus, the specificities for these four tests are defined by considering each local window as a test unit, while the specificity for the LM test is defined by considering each sampling interval as a test unit. To make it more comparable, we redefine the specificity of the LM test using each local window as a unit as well. That is, if any time point in a local window is identified by the LM test as a jump time point, then the whole interval is identified by the LM test as a jump interval. Thus, the specificity we use in this paper takes the form Specificity # of correctly identified non-jump intervals by a test
=
# of true non-jump intervals in total
.
Since sensitivity and specificity usually trade off between each other, we report the weighted averages of sensitivities and specificities. For window sizes 2an = 30, 60, 90, and 120, the sensitivities and specificities of the test statistics are computed and the weighted averages of them are calculated. To save space, we only list the results when the weight for sensitivity is fixed at 0.5. See Tables 6–9. The significance level α for the LM test is chosen to be 0.001, as explained before. It can be seen from Table 6 that the weighted averages of sensitivity and specificity of the FF test are larger than other tests in
340
Y. Fan, J. Fan / Journal of Econometrics 164 (2011) 331–344
Table 6 Weighted averages of means of sensitivities and specificities when 1 = 1 min over 1000 simulations. The weight for sensitivity is 0.5, and the data are generated from model (25) with λ = 12 × 21. K =2
K =4
K =6
BNS
JO
an = 15
AJ FF
0.5069 0.5930
0.559 0.6179
0.5557 0.6078
0.547
0.6075
an = 30
AJ FF
0.5383 0.6286
0.5893 0.6090
0.582 0.5884
0.5595
0.5945
an = 45
AJ FF
0.5634 0.6209
0.5952 0.5633
0.5811 0.543
0.5595
0.5842
an = 60
AJ FF
0.5872 0.5981
0.5885 0.535
0.57 0.5194
0.5562
0.5744
LM
0.5897
Table 7 Weighted averages of means of sensitivities and specificities when 1 = 3 min over 1000 simulations. The weight for sensitivity is 0.5, and the data are generated from model (25) with λ = 12 × 21. K =2
K =4
K =6
BNS
JO
an = 15
AJ FF
0.3308 0.4196
0.3939 0.4660
0.3958 0.4658
0.5049
0.516
an = 30
AJ FF
0.3542 0.4706
0.4282 0.4976
0.4354 0.4964
0.504
0.5093
an = 45
AJ FF
0.3892 0.4936
0.4636 0.5025
0.4669 0.5014
0.5034
0.5062
an = 60
AJ FF
0.4154 0.5002
0.4782 0.5015
0.483 0.5008
0.5032
0.5043
LM
0.5043
Table 8 Weighted averages of means of sensitivities and specificities when 1 = 5 min over 1000 simulations. The weight for sensitivity is 0.5, and the data are generated from model (25) with λ = 12 × 21. K =2
K =4
K =6
BNS
JO
an = 15
AJ FF
0.2645 0.3635
0.3392 0.4221
0.3438 0.4283
0.5017
0.502
an = 30
AJ FF
0.3002 0.4299
0.3897 0.4767
0.4009 0.4794
0.5005
0.5001
an = 45
AJ FF
0.343 0.4669
0.4375 0.4954
0.4451 0.4971
0.4996
0.4992
an = 60
AJ FF
0.3714 0.4797
0.462 0.4988
0.4681 0.4993
0.4989
0.4987
LM
0.5013
Fig. 3. Mean of weighted averages of sensitivity and specificity for FF, AJ, and LM tests over 1000 simulations when 1 = 1 min and an = 30. The x-axis represents the weight for sensitivity. The data are generated from model (25) with λ = 12 × 21.
Fig. 4. Mean of weighted averages of sensitivity and specificity for FF, AJ, and LM tests over 1000 simulations when 1 = 3 min and an = 30. The x-axis represents the weight for sensitivity. The data are generated from model (25) with λ = 12 × 21.
inspecting the simulation results, we found that the LM, BNS, and JO tests classify almost all intervals as non-jump intervals, resulting in 1 specificity and 0 sensitivity. This can be confirmed by Fig. 4, which shows the weighted averages of the FF, AJ, and LM tests as a function of weights when an = 30. The same as before, the xaxis represents the weight for sensitivity. It can be seen that the FF and AJ tests have much higher sensitivities. And most of the time, the FF test has larger weighted averages than the other two tests. The BNS and JO tests perform very similarly to the LM test in this setting. When 1 = 5 min, the results are very similar to the ones for 1 = 3 min. 6.5. Impact of microstructure noise
most cases when K = 2. Unreported simulation results show that the FF and AJ tests have much higher sensitivities than other tests, while the LM test usually has low sensitivity and high specificity. To better illustrate the idea, we plot the weighted averages of the FF, AJ, and LM tests when an = 30 in Fig. 3. The x-axis represents the weight for sensitivity, ranging from 0 to 1. It can be seen that most of the time, the FF test has larger weighted averages than either the AJ test or LM test. The LM test has almost 1 specificity and less than 0.2 sensitivity, indicating that the LM test classifies most intervals as non-jump intervals. When the sampling frequency 1 becomes 3 min, all tests have worse performance. In fact, as shown in Table 7, the weighted averages become smaller than those in Table 6. Note that for LM, BNS, and JO tests, their weighted averages are close to 0.5. After
In this subsection, we compare the performance of the AJ, FF, LM, BNS, and JO tests when market microstructure noise is present. The underlying asset price process is generated from (25) over a one-month period (T = 1/12), and the observed asset price process is generated as Xt∗ = Xt + εt ,
(27)
where εt ∼i.i.d. N (0, σ β) are the market microstructure noises and β is defined in (25) representing the mean volatility level. Three different noise levels are considered: small (σ0 = 0.01%), medium (σ0 = 0.1%), and large (σ0 = 1%). Due to the space limit, we only present the results when the local window size is 2an = 30 and 1 = 1 minute. The comparison results can be found in Table 9. 2 0
Y. Fan, J. Fan / Journal of Econometrics 164 (2011) 331–344
341
Table 9 Means of sensitivities and specificities of different methods when small, medium, and large microstructure noises are presented. The local window size is 2an = 30, and the sampling frequency 1 = 1 min. an = 15
Sensitivity
Specificity
K =2
K =4
K =6
K =2
K =4
K =6
0.7404 0.6275
0.628 0.4854 0.1044 0.1769 0.2356
0.5933 0.4451
0.2568 0.5282
0.4693 0.7126 0.9971 1 0.9749
0.4946 0.7304
Small
AJ FF BNS LM JO
0.7389 0.6698
0.6498 0.5473 0.0004 0.1218 0.2
0.6147 0.5002
0.152 0.3175
0.3423 0.5355 0.996 1 0.9776
0.3845 0.5842
Medium
AJ FF BNS LM JO
0.0582 0.0121
0.0173 0.0009 0.0017 0 0.0021
0.011 0.0006
0.5357 0.9114
0.8685 0.995 0.9964 1 0.9968
0.9083 0.996
Large
AJ FF BNS LM JO
Table 10 Identified time periods with jumps using the stock price data of Microsoft Corporation from May 1, 2007 to May 31, 2007 with 1-min frequency. 5/1/2007 5/2/2007 5/2/2007 5/4/2007 5/4/2007 5/11/2007 5/14/2007 5/15/2007 5/17/2007 5/30/2007 5/31/2007 Fig. 5. Top panel: stock price data of Microsoft Corporation from May 1, 2007 to May 31, 2007 with 1-min frequency. Bottom panel: S (p, K )t as a process of time with 2an = 78.
With large noise, almost all tests fail to identify any jumps. For small and moderate noise levels, the LM, BNS, and JO tests tend to miss most of the true jumps and thus have low sensitivities, while the FF and AJ tests have much higher sensitivities. 7. Data analysis We apply our new test to the high-frequency stock price data of Microsoft Corporation from May 1, 2007 to May 31, 2007. The sampling frequency is chosen to be 1 min. The total sample size is 8591. We first apply our new test with K = 2 to the entire data set, and the value of our new test statistic is 1.195. The rejection region of the null hypothesis that there are no jumps in this onemonth period is obtained by using (21), and the value of x0 is 1.244. Thus, the null hypothesis is rejected. We then locate the jumps. To this end, we calculate our new test S (p, K )t at each data point with indices in {an , an + 1, . . . , 8591 − an }. Jumps occurring when market closes and opens are not very interesting. To exclude these jumps, we choose the local window size 2an = 78, that is, there are roughly 5 local windows per day for the FF test. As discussed in Section 5, if there is no microstructure noise the process S (p, K )t should hover around two values 1 and 2 with some flat regions corresponding to the jumps. Due to the influence of the microstructure noise, the sample path of S (p, K )t is a little bit wiggling. To avoid the influence of the noise, we smooth the process S (p, K )t using the wavelet method. Fig. 5 shows the smoothed curve of S (p, K )t as a process of time. We see that there are a few flat regions in the plot that may indicate jump intervals. However, due to the market microstructure noise in the real data, it
10:52:55 9:38:57 14:37:57 9:39:58 13:41:57 9:52:59 10:50:58 10:02:59 15:34:57 13:59:58 13:30:57
is still somewhat hard to inspect which time intervals have jumps by solely looking at the plot. Thus, we further divide the one-month period into many small non-overlapping time intervals with equal length 2an = 78. The FDR procedure is employed to identify the intervals with jumps with false discovery rate controlled at 5% level. Among the 110 time periods, 11 are detected with jumps, and the corresponding jump locations are listed in Table 10. We see that most of these identified intervals by the FDR approach correspond to a flat region in the plot of S (p, K )t , which is in line with our theory. 8. Discussions We have observed that several nonparametric test statistics similar to the one in Aït-Sahalia and Jacod (2009) can be constructed to detect whether a continuous-time process has a continuous sample path or not. We have derived their asymptotic joint distribution, which shows that they are not highly correlated. As a consequence, we have proposed to linearly combine these test statistics to form a new one that has the same asymptotic properties as the original ones but with smaller variance. We have given explicitly the optimal weights. The critical region for testing the existence of jumps in a fixed time period is constructed by minimizing the summation of type I and type II errors. A new test procedure based on a test process, which is transformed from the original asset price process by applying the new test locally over the time, is proposed. This test process concentrates around two known constants indicating whether a jump occurs at a time point. Thus, the problem of jump identification becomes a multiple-comparison problem. The FDR procedure has been proposed to control the type I error. Simulation studies and a real data application have justified the theoretical results and further demonstrated the performance of our new test method.
342
Y. Fan, J. Fan / Journal of Econometrics 164 (2011) 331–344
Acknowledgments
with Rq , R− (q, ℓ − 1) and R+ (q, ℓ − 1) defined in the proof below.
YF’s research was partially supported by NSF Grant DMS0906784 and 2010 Zumberge Individual Award from USC’s James H. Zumberge Faculty Research and Innovation Fund. JF’s research was partially supported by NSF Grants DMS-0704337 and DMS-0714554. We sincerely thank the Editor, Associate Editor and referees for their constructive comments that significantly improved the paper.
Proof. This proof, which has two steps, is an extension of the proof of Theorem 8 in Aït-Sahalia and Jacod (2009). First, we prove the results for a special class of f , and then we extend the results to a more general function f . Step 1. Assume that f vanishes on [−2K ε, 2K ε] for some ε > 0. Let Sq be the successive jump times of the Poisson process µ([0, t ] × {x : γ (x) > ε}). Let 1XSq be the size of the q-th jump, and set
Appendix. Proofs
X (ε)t = Xt −
−
1XSq .
{q:Sq ≤t }
A.1. A general result Before proving Theorem 1, we first prove a more general result. Let n be the number of observations in the time interval [0, t ]. Define n −
V n (f )t =
f (1ni X ),
L(n, q) = j
i =1 n
V ℓ+1 (f )t =
[− n/ K ]
f (X(ℓ+jK )1 − X(ℓ+(j−1)K )1 ),
1
for ℓ = 0, . . . , K − 1, and V (f )t = s≤t f (1Xs ). Consider an auxiliary space (Ω ′ , F ′ , P′ ) which supports the following variables and processes:
∑
• a sequence of uniform random variables on [0, 1] denoted by (κq ); • 3K sequences of i.i.d. standard Gaussian variables denoted by (U¯ q,−K +1 ), (U¯ q,−K +2 ), . . . , (U¯ q,2K −1 ); another two sequences of i.i.d. standard Gaussian variables denoted by (Uq ) and (Uq′ ); • a sequence of uniform random variables on the finite set {0, 1, . . . , K − 1} denoted by (Lq ); 1 2 3 • three standard Brownian motions W , W , and W ; and all these processes or variables are mutually independent. Next = Ω × Ω ′ , F = F ⊗ F ′ , define Ω P = P ⊗ P′ . Extend the 1
2
3
variables Xt , bt , . . . defined on Ω and W , W , W , Uq , . . . defined . For simplicity, we use the same on Ω ′ to the product space Ω notations to denote these variables. Hereinafter we write E for the expectation with respect to P. Let (Sq )q≥1 be the successive jump t as the smallest right times of X , which are stopping times. Define F containing the filtration (Ft ), and with continuous filtration of F
t , the processes W 1t , W 2t and W 3t are adapted, and all respect to F Sq -measurable random variables and processes defined above are F for all q. Therefore, we have an extension (Ω , F , (Ft )t ≥0 , P) of the original space (Ω , F , (F )t ≥0 , P). Then, the following results hold n for V n (f )t , V ℓ (f )t , ℓ = 1, . . . , K , and V (f )t . Theorem 4. Let f be C with f (0) = f (0) = f (0) = 0. Given Assumption 1, the following multivariate process ′
2
1
1
..., √ ( 1
n VK
n
(f )t − V (f )(K [n/K ]+(K −1))1 )
converges stably in law to the process (Z (f ′ )t , Z (f ′ )t + Z1′ (f ′ )t , . . . , Z (f ′ )t + ZK′ (f ′ )t ), where
−
Z (f )t =
f (1XSq )Rq
and
q:Sq ≤t
Zℓ′ (f )t =
− q:Sq ≤t
f (1XSq )(R− (q, ℓ − 1) + R+ (q, ℓ − 1))
α− (n, q) = √ (WSq − W(iK +j)1 ), 1 1
α+ (n, q) = √ (W(iK +j+1)1 − WSq ), 1 1
β− (n, q, ℓ) = √ (W(iK +j)1 − W(iK +ℓ−K 1{ℓ>j})1 ), 1 1
β+ (n, q, ℓ) = √ (W(iK +ℓ+K 1{ℓ≤j})1 − W(iK +j+1)1 ), 1 where 0 ≤ ℓ ≤ K − 1. For the process X (ε)t , define the increments R− (n, q, ℓ) = X (ε)(iK +j)1 − X (ε)(iK +ℓ−K 1{ℓ>j})1
and
R+ (n, q, ℓ) = X (ε)(iK +ℓ+K 1{ℓ≤j})1 − X (ε)(iK +j+1)1 with 0 ≤ ℓ ≤ K − 1, and Rnq = 1niK +j+1 X (ε) = X (ε)(iK +j+1)1 − X (ε)(iK +j)1 . Finally, for each ℓ = 0, . . . , K − 1 we define Rq =
√ κq Uq σSq− + 1 − κq Uq′ σSq ,
n
R′ q,ℓ = Rnq + R− (n, q, ℓ) + R+ (n, q, ℓ), R− (q, ℓ) =
Lq −
U¯ q,l σSq− ,
l=ℓ−K 1{ℓ>Lq }
R+ (q, ℓ) =
′′
√ (V n (f )t − V (f )1[t /1] ), √ (V 1 (f )t − V (f )K 1[n/K ] ), 1 1 1
and M (n, q) = Sq /1 − (iK + j).
We further define
j =1
Define Ωn (t , ε) to be the set of all ω such that each interval [0, t ] ∩ (i1, (i + 2K − 1)1] contains at most one Sq (ω), |X (ε)i1 − X (ε)(i−1)1 | ≤ 2ε for all i ≤ t /1, the first jump time S1 (ω) > K 1, and |X (ε)i1 − X (ε)(i−2K +1)1 | ≤ 2ε for all 2K − 1 ≤ i ≤ t /1. Next, for each q, on the set {(iK + j)1 < Sq ≤ (iK + j + 1)1} for i ≥ 1 and 0 ≤ j < K , define
ℓ+K− 1{ℓ≤Lq }
U¯ q,l σSq .
l=Lq +1
Using a similar idea as that in Aït-Sahalia and Jacod (2009)—that is, extending the proof of Lemma 6.2 of Jacod and Protter (1998)— we obtain
L(n, q), M (n, q), α− (n, q), α+ (n, q), β− (n, q, 0),
β+ (n, q, 0), . . . , β− (n, q, ℓ), β+ (n, q, ℓ), . . . , β− (n, q, K − 1), β+ (n, q, K − 1) q≥1 Lq − L− (s) √ −→ Lq , κq , κq Uq , 1 − κq Uq′ , U¯ q,l , l =0 K − l=Lq +1
U¯ q,l , . . . ,
Lq − l=ℓ−K 1{Lq <ℓ}
U¯ q,l ,
ℓ+K− 1{Lq ≥ℓ} l=Lq +1
U¯ q,l , . . . ,
Y. Fan, J. Fan / Journal of Econometrics 164 (2011) 331–344 K −1+K {Lq ≥K −1}
Lq −
−
U¯ q,l ,
l=Lq +1
l=K 1{Lq ≥K −1}−1
,
U¯ q,l q≥1
L− (s)
where −→ denotes the stable convergence in law. The above result yields 1 √ Rnq , R− (n, q, 1), R+ (n, q, 1), . . . , 1 R− (n, q, K ), R+ (n, q, K )
−→ Rq , R− (q, 1), R+ (q, 1), . . . , R− (q, K ), R+ (q, K )
q ≥1
. (28)
Since f (x) = 0 for |x| ≤ 2K ε , we obtain that on the set Ωn (t , ε) and for all s ≤ t,
−
f (1XSq + Rnq ) − f (1XSq )
−
E(R+ (q, ℓ1 )R+ (q, ℓ2 )|F ) 1 K −1 − [K − (ℓ2 − ℓ1 )](ℓ2 − ℓ1 ) σS2q . = 2
K
Moreover, for any 0 ≤ ℓ1 , ℓ2 ≤ K − 1,
E(R− (q, ℓ1 )R+ (q, ℓ2 )|F ) = 0. From the above three equations, we deduce that for any ℓ1 ≤ ℓ2 ,
q,Sq ≤1[s/1]
=
K
Similarly, we have
V n (f )s − V (f )1[s/1] =
E(R− (q, ℓ1 )R− (q, ℓ2 )|F ) = E Lq − (ℓ1 − K 1{Lq < ℓ1 }) ∨ (ℓ2 − K 1{Lq < ℓ2 })|F σS2q − K −1 1 = − [K − (ℓ2 − ℓ1 )](ℓ2 − ℓ1 ) σS2q − . 2
q ≥1
L− (s)
343
the value of ℓ. Thus, we only need to consider the correlation between R− (q, ·) and R+ (q, ·) for a fixed q. For ℓ1 ≤ ℓ2 ,
E(Z ′ (p, K , ℓ1 )t Z ′ (p, K , ℓ2 )t |F ) − E(R− (q, ℓ1 − 1)R− (q, ℓ2 − 1)|F ) = f (1XSq )2
f ′ (1XSq + Rnq )Rnq ,
q,Sq ≤1[s/1]
q:Sq ≤t
is between 1XSq and 1XSq + Rnq . Similarly, it can be shown
where Rnq
−
+
that
f (1XSq )2 E(R+ (q, ℓ1 − 1)R+ (q, ℓ2 − 1)|F )
q:Sq ≤t
n
V ℓ (f )s − V (f )K 1[s/(K 1)]+(ℓ−1)1 n
−
=
q,Sq ≤1[s/K 1]+(ℓ−1)1
= p2
n
f ′ (1XSq + R˜′ q,(ℓ−1) )R′ q,(ℓ−1) ,
2
−
1 K
(K − ℓ2 + ℓ1 )(ℓ2 − ℓ1 ) D(p)t .
n
n
and R˜′ q,ℓ also converging to 0 and Ωn (t , ε) → Ω . The continuity of f ′ together with (28) leads to the results in Theorem 4. Step 2. For the general case, the idea of the proof is the same as that in Aït-Sahalia and Jacod (2009). Combining Steps 1 an 2 above yields the stable convergence results in Theorem 4. A.2. Proof of Theorem 1 The proof in Theorem 4 applied with the function f (x) = |x|p yields the stable convergence in law of the process
1 B(p, 1)t − B(p)1[t /1] , √ 1 K 1 − aℓ B(p, K 1)ℓ,t − B(p)K 1[t /K 1]+(ℓ−1)1 √ 1 ℓ=1 ∑K ′ ′ to Z (f ′ )t , Z (f ′ )t + a Z ( f ) with f ′ (x) = p|x|p−1 sgn(x). j t j j =1 Hence, if we let and Z ′ (p, K , ℓ)t = Zℓ′ (f ′ )t
(29)
with Z (f ′ )t and Z ′ (p, K , ℓ)t defined in Theorem 4, then
E(Z (p)2t |F ) =
K −1
This completes the proof.
n
where R˜′ q,ℓ is between 1XSq and 1XSq + R′ q,ℓ . Since Rnq , R− (n, q, ℓ) and R+ (n, q, ℓ) converge to 0 for all ℓ = 0, . . . , K − 1, we have R˜ nq
Z (p)t = Z (f ′ )t
1 2
2
p
−
|1Xs |
2p−2
(σ
2 s−
+ σ ). 2 s
A.3. Proof of Corollary 1 This is similar to the proof of Theorem 3 in Aït-Sahalia and Jacod (2009). A.4. Proof of Theorem 2 Consider the 2-dimensional function f = (f1 , f2 ) with f1 (x1 , . . . , xK ) = |x1 + · · · + xK |p and f2 (x1 , . . . , xK ) = |x1 |p . Write ρσ⊗K (f ) = f (x)ρσ⊗K (dx) with ρσ⊗K the K -fold tensor product of the law N (0, σ 2 ). Define n −K +1
V ′ (f , K , 1)t =
−
√
√ P B(p)t − B(p)K 1[t /K 1]+(ℓ−1)1 / 1 −→ 0.
Then, the convergence results in Theorem 1 are proved. We next calculate the variance and covariance of Z ′ (p, K , ℓ)t . Note that for q1 < q2 , any pairs of random vectors (R− (q1 , ·), R+ (q1 , ·)) and (R− (q2 , ·), R+ (q2 , ·)) are independent, regardless of
(30)
i =1
where 1ni X = Xi1 − X(i−1)1 . By Theorem 7.1 of Jacod (2007), the 2-dimensional processes 1
√
1
1V (f , K , 1)t − ′
∫
t
ρσ⊗uK (f )du
0
(31)
converge stably in law to a continuous process V ′ (f , K ) defined , F, on an extension (Ω P) of the original space (Ω , F , P), which conditionally on the σ -field F is a centered Gaussian R2 -valued process with independent increments, satisfying
E[V ′ (fi , K )t V ′ (fj , K )|F ] =
t
∫ 0
s≤t
It has been proved in Aït-Sahalia and Jacod (2009) that for any positive integer K ,
√
f (1ni X / 1, . . . , 1ni+K −1 X / 1),
Rijσu (f , K )du.
Here, Rijσ (f , K ) is defined as Rijσ (f , K ) =
K −1 −
E fi (σ UK , . . . , σ U2K −1 )
ℓ=−K +1
× fj (σ Ul+K , . . . , σ Ul+2K −1 ) − (2K − 1) E [fi (σ U1 , . . . , σ UK )] E × fj (σ U1 , . . . , σ UK ) ,
(32)
344
Y. Fan, J. Fan / Journal of Econometrics 164 (2011) 331–344
where (Ui )i≥1 independent standard Gaussian random variables. By the definition of f , we can derive that 2p R11 σ (f , K ) = σ
K −1 −
√
cov(| lU1 +
√
K − lU2 |p ,
l=0
√ √ √ √ | K − lU2 + lU3 |p ) + cov(| K − lU1 + lU2 |p , √ √ | lU2 + K − lU3 |p )
This completes the proof.
B(p, K 1)l
l =1
[− n/K ]
√
√
f1 (1n(i−1)K +1 X / 1, . . . , 1n(i+1)K −1 X / 1).
i=1
Define Y (p)t = K −1 V ′ (f1 , K )t and Y ′ (p, a)t = V ′ (f2 , K )t with V ′ (f1 , K )t and V ′ (f2 , K )t defined in (32). In view of (30), the bivariate process
1−1/2 11−p/2 B(p, 1) − mp A(p), 11−p/2 K −1
K −
B(p, K 1)ℓ − K p/2−1 mp A(p)
ℓ=1
converges stably to the 2-dimensional process (Y (p)t , Y ′ (p, a)t ). This follows the results in Theorem 2. A.5. Proof of Corollary 2 Similar to the proof of Theorem 3 in Aït-Sahalia and Jacod (2009). A.6. Proof of Theorem 3 (a) By Corollary 3, when Xt is continuous, ( V c )−1/2 ( S (p, K ) − K p/2−1 ) converges stably in law to N (0, 1). Since x ∈ (1, K ), K p/2−1 − x
Vj
a.s.
−→ +∞.
Thus,
αn,t = P
Vc
x − 1 S (p, K ) − 1 < H1 −→ 1. Vj Vj
References
Next note that V ′ (f2 , K , 1)t = 1−p/2 B(p, 1) and
= K −1
Thus, the following result holds: P ( S (p, K ) < x|H1 ) = P
2p 2 R22 σ (f , K ) = σ (m2p − mp ).
K −
a.s. x−1 N (0, 1). Since 1 < x < K , we know that √ −→ +∞ as 1 → 0.
= σ 2p K p (m2p − m2p ) + 2σ 2p K −1 − √ √ × cov(| K − ℓU1 + ℓU2 |p , √ ℓ=1 √ | ℓU2 + K − ℓU3 |p ), √ 2p R12 K − 1U2 |p , |U1 |p ), σ (f , K ) = σ K cov(|U1 +
1−p/2 K −1
In the case when Xt is not continuous but the sample path is continuous over [0, T ], using a similar technique to that in Theorem 6 of Aït-Sahalia and Jacod (2009) completes the proof. (b) When the sample path of Xt exhibits jumps, it follows from Corollary 3 that ( V j )−1/2 ( S (p, K ) − 1) converges stably in law to
S (p, K ) − K p/2−1 x − K p/2−1 < H0 −→ 0. Vc Vc
Aït-Sahalia, Y., 2004. Disentangling diffusion from jumps. Journal of Financial Economics 74, 487–528. Aït-Sahalia, Y., Jacod, J., 2009. Testing for jumps in a discretely observed process. The Annals of Statistics 37, 184–222. Aït-Sahalia, Y., Mykland, P.A., Zhang, L., 2005. How often to sample a continuoustime process in the presence of market microstructure noise. Review of Financial Studies 18, 351–416. Andersen, T.G., Bollerslev, T., Diebold, F.X., 2007. Roughing it up: including jump components in the measurement, modeling and forecasting of return volatility. Review of Economics and Statistics 89, 701–720. Barndorff-Nielsen, O.E., Shephard, N., 2006. Econometrics of testing for jumps in financial economics using bipower variation. Journal of Financial Econometrics 4, 1–30. Benjamini, Y., Hochberg, Y., 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B 57, 289–300. Carr, P., Wu, L., 2003. What type of process underlies options? A simple robust test. Journal of Finance 58, 2581–2610. Duffie, D., Pan, J., Singleton, K., 2000. Transform analysis and asset pricing for affine jump–diffusions. Econometrica 68, 1343–1376. Fan, J., 2005. A selective overview of nonparametric methods in financial econometrics (with discussion). Statistical Science 20, 317–357. Fan, J., Wang, Y., 2007. Multi-scale jump and volatility analysis for high-frequency financial data. Journal of the American Statistical Association 102, 1349–1362. Jacod, J., 2007. Statistics and high-frequency data: SEMSTAT Seminar. Tech. rep., Université de Paris-6. Jacod, J., 2008. Asymptotic properties of realized power variations and related functionals of semimartingales. Stochastic Processes and their Applications 118, 517–559. Jacod, J., Protter, P., 1998. Asymptotic error distributions for the Euler method for stochastic differential equations. Annals of Probability 26, 267–307. Jacod, J., Todorov, V., 2009. Testing for common arrivals of jumps for discretely observed multidimensional processes. Annals of Statistics 37, 1792–1838. Jiang, G.J., Oomen, R., 2008. Testing for jumps when asset prices are observed with noise-a ‘‘swap variance’’ approach. Journal of Econometrics 144, 352–370. Johannes, M., 2004. The statistical and economic role of jumps in continuous time interest rate models. Journal of Finance 59, 227–260. Johannes, M., Polson, N., Stroud, J., 2004a. Nonlinear filtering of stochastic differential equations with jumps. Manuscript. Johannes, M., Polson, N., Stroud, J., 2004b. Sequential parameter estimation in stochastic volatility models with jumps. Manuscript. Lee, S., Mykland, P.A., 2008. Jumps in financial markets: a new nonparametric test and jump dynamics. Review of Financial Studies 21, 2535–2563. Mancini, C., 2003. Estimation of the characteristics of jump of a general Poissondiffusion process. Scandinavian Actuarial Journal 1, 42–52. Merton, R., 1976. Option pricing when underlying stock returns are discontinuous. Journal of Financial Economics 3, 125–144. Pan, J., 2002. The jump–risk premia implicit in options: evidence from an integrated time series study. Journal of Financial Economics 63, 3–50. Zhang, L., Mykland, P.A., Aït-Sahalia, Y., 2005. A tale of two time scales: determining integrated volatility with noisy high-frequency data. Journal of the American Statistical Association 100, 1394–1411.
Journal of Econometrics 164 (2011) 345–366
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Robust trend inference with series variance estimator and testing-optimal smoothing parameter Yixiao Sun Department of Economics, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0508, United States
article
info
Article history: Received 22 February 2010 Received in revised form 11 June 2011 Accepted 13 June 2011 Available online 21 June 2011 JEL classification: C13 C14 C32 C51 Keywords: Asymptotic expansion F -distribution Hotelling’s T 2 distribution Long run variance Robust standard error Series method Testing-optimal smoothing parameter choice Trend inference Type I and type II errors
abstract The paper develops a novel testing procedure for hypotheses on deterministic trends in a multivariate trend stationary model. The trends are estimated by the OLS estimator and the long run variance (LRV) matrix is estimated by a series type estimator with carefully selected basis functions. Regardless of whether the number of basis functions K is fixed or grows with the sample size, the Wald statistic converges to a standard distribution. It is shown that critical values from the fixed-K asymptotics are second-order correct under the large-K asymptotics. A new practical approach is proposed to select K that addresses the central concern of hypothesis testing: the selected smoothing parameter is testing-optimal in that it minimizes the type II error while controlling for the type I error. Simulations indicate that the new test is as accurate in size as the nonstandard test of Vogelsang and Franses (2005) and as powerful as the corresponding Wald test based on the large-K asymptotics. The new test therefore combines the advantages of the nonstandard test and the standard Wald test while avoiding their main disadvantages (power loss and size distortion, respectively). © 2011 Elsevier B.V. All rights reserved.
1. Introduction Trend regression is one of simple and yet important regressions in economic and climatic time series analysis. In this paper, we consider a linear trend regression with multiple dependent variables. For example, the dependent variables may consist of GDPs from a number of countries. Vogelsang and Franses (2005) provide more empirical examples. Estimation of the trends is relatively easy as the equation-by-equation OLS estimator is asymptotically as efficient as the system GLS estimator. Hence, for point estimation, there is no need to take error autocorrelation into account in large samples. However, trend inference is subtle as the variance of the OLS trend estimator depends on the long run variance (LRV) of the error process. Since the LRV is proportional to the spectral density of the error process evaluated at zero, many nonparametric spectral density methods can be used to estimate
E-mail address:
[email protected]. 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.06.017
the LRV. Commonly used methods are mostly kernel-based. In this paper, we consider estimating the LRV using nonparametric series methods. The resulting series LRV estimator is the sample variance of regression coefficients in a nonparametric series regression. The smoothing parameter in the series LRV estimator is the number of basis functions employed. When the number of basis functions K is fixed, the LRV estimator is inconsistent and converges to a scaled Wishart distribution. The underlying scale cancels out in the limiting distribution the asymptotic variance of the trend estimator. Hence, when K is fixed, the Wald statistic converges to a pivotal nonstandard distribution. The fixed-K asymptotics is in the spirit of the fixed-b asymptotics as in Kiefer and Vogelsang (2002a,b, 2005). This type of asymptotics captures the randomness of the LRV estimator and tests based on it often have better finite sample size properties than those based on consistent LRV estimates. Jansson (2004) and Sun et al. (2008, SPJ hereafter), provide some theoretical justifications for the nonstandard asymptotic theory. We design a set of basis functions so that the fixed-K asymptotic distribution becomes the standard F distribution. For these basis
346
Y. Sun / Journal of Econometrics 164 (2011) 345–366
functions, the series LRV estimator is asymptotically invariant to the intercepts and trend parameters. As a result, it does not suffer from the bias arising from the estimation uncertainty of model parameters. This is in contrast with the conventional kernel LRV estimators where the estimation uncertainty gives rise to a demeaning bias. See, for example, Hannan (1957) and Ng and Perron (1994). By selecting the basis functions appropriately, we completely remove this type of bias to the order we care about. This is a desirable property as we generally prefer an estimator with fewer bias terms, especially in hypothesis testing. Another advantage of using the new series LRV estimator lies in its convenience in practical use as critical values from the fixedK asymptotics are readily available from statistical tables and software packages. While the LRV estimator is inconsistent when K is finite, it becomes consistent when K grows with the sample size at a certain rate. The smoothing parameter K is an important tuning parameter that determines the asymptotic properties of the LRV estimator. Following the conventional approaches (e.g., Andrews, 1991; Newey and West, 1987, 1994), Phillips (2005) chooses the smoothing parameter K to minimize the asymptotic MSE of the LRV estimator. Such a choice of the smoothing parameter is designed to be optimal in the MSE sense for the point estimation of the LRV, but is not necessarily best suited for semiparametric testing. Through its effect on the LRV estimator, the smoothing parameter K affects the type I and type II errors of the associated test. It is thus sensible that the choice of K should take these properties into account. To develop an optimal choice of K for semiparametric testing, we first have to decide on which test to use. We can employ the traditional Wald test, which is based the Wald statistic and uses a chi-square distribution as the reference distribution. Alternatively, we can employ the new F ∗ test given in this paper, which is based on a modified Wald statistic and uses an F -distribution as the reference distribution. We find that critical values from the F distribution are high-order correct under the conventional largeK asymptotics. A direct implication is that the F ∗ test generally has smaller size distortion than the traditional Wald test. On the basis of this theoretical result and the emphasis on the size control in the econometrics literature, we employ the F ∗ test to conduct inference on the trend parameters. One of the main contributions of the paper is to develop an optimal procedure for selecting the smoothing parameter K that addresses the central concern of semiparametric testing. The ultimate goal of any testing problem is to achieve smaller type I and type II errors. However, these two types of errors often move in opposite directions. We can control one type of error while trying to minimize the other type of error. In this paper, we propose to choose K to minimize the type II error subject to the constraint that the type I error is bounded. The resulting optimal K is said to be testing-optimal for the given bound. The bound is defined to be κα , where α is the nominal type I error and κ > 1 is the parameter that captures the user’s tolerance on the discrepancy between the nominal and true type I errors. The proposed approach to selecting the testing-optimal K requires asymptotic measurements of type I and type II errors of the F ∗ test. These measurements are provided by means of highorder asymptotic expansions of the finite sample distribution of the F ∗ statistic under the null and local alternative hypotheses. In a transformed space, the null hypothesis is a fixed point while the alternative hypothesis we consider is a random point uniformly distributed on the sphere centered at the fixed null. The radius of the sphere is chosen so that the power of the test is 75% under the first-order asymptotics. This strategy is similar to that used in the optimal testing literature. In the absence of a uniformly most powerful test, it is often recommended to pick a reasonable
point under the alternative and construct an optimal test against this particular point alternative. It is hoped that the resulting test, although not uniformly most powerful, is reasonably close to the power envelope. Here we use the same idea and select the radius of the sphere according to the power requirement. We hope that the smoothing parameter that is optimal for the chosen radius also works well for other points under the alternative hypothesis. This is confirmed by our Monte Carlo study. The testing-optimal K that maximizes the local asymptotic power while preserving size in large samples is fundamentally different from the MSE-optimal K . The testing-optimal K depends on the sign of the nonparametric bias, the hypothesis under consideration and the permitted tolerance for the type I error while the MSE-optimal K does not. When the permitted tolerance becomes sufficiently small, the testing-optimal K is of smaller order than the MSE-optimal K . Our criterion for K selection is a testing-focused criterion in that it aims at the testing problem and takes the specific hypothesis into consideration. The paper that is most closely related to the present paper is SPJ where robust inference for the mean of a scalar time series is considered. In SPJ, the optimal smoothing parameter minimizes a loss function that is defined to be a weighted sum of the type I and type II errors. Our procedure can also be cast in this framework with the Lagrange multiplier for the constrained minimization problem as the relative weight. The main difference is that our weight is implicitly defined through the tolerance parameter κ . For a given κ , the weight may be different across different data generating processes. In contrast, in the SPJ procedure, the weight is specified a priori and is thus fixed. Both procedures require a user-chosen parameter: the tolerance parameter or the weight. The tolerance parameter is often easier to choose as it involves only the type I error while the weight is more difficult to choose as it depends on both type I and type II errors. This is an advantage of the new procedure proposed here. The same procedure is used in Sun et al. (in press) for robust mean inference with exponentiated kernels. The series LRV estimator has been considered in the literature under different names. It belongs to the class of multi-window or multi-taper estimators (Thomson, 1982) and the class of filterbank estimators (Stoica and Moses, 2005, ch. 5). In the simulation and signal processing literature, the weighted area estimator of Foley and Goldman (1999) is a series LRV estimator with particular basis functions. In econometrics, Phillips (2005) embeds this estimator in a framework of automated regression. Müller (2007) motivates it from the perspective of robust LRV estimation. The fixed-K type of asymptotics has some precursors in the literature. Foley and Goldman (1999) approximate the distribution of their autocorrelation robust t-statistic by a t distribution. As we show later, the t-distribution belongs to the class of fixed-K asymptotic distributions. For some basis functions, the working paper version of Müller (2007) contains the fixed-K asymptotics and F approximation. However, the basis functions considered here are different from the existing literature. They do not constitute a complete basis system, and are designed to eliminate the demeaning effect and the detrending effect at the same time. To the best of my knowledge, the paper is the first to explore the relationship between the fixed-K asymptotics and the conventional large-K asymptotics in trend estimation and inference. It is also the first to propose a testing-optimal smoothing parameter choice in this setting. The rest of the paper is organized as follows. Section 2 describes the basic setting and the limiting distribution of the trend estimator. Section 3 motivates the series LRV estimator and establishes its asymptotic properties under the fixed-K and large-K asymptotics. Section 4 investigates the limiting distribution of the Wald statistic under both fixed-K and large-K asymptotics. Section 5 gives a
Y. Sun / Journal of Econometrics 164 (2011) 345–366
high-order expansion of the finite sample distribution of the modified Wald statistic. On the basis of this expansion, Section 6 proposes a selection rule for K that is most suitable for implementation in semiparametric testing. The next section reports simulation evidence on the performance of the new procedure. The last section provides some concluding discussion. Proofs are given in the Appendix. 2. The model and preliminaries Consider n trend-stationary time series denoted by (y1t , . . . , ynt )′ with t = 1, 2, . . . , T . We assume that the data generating process is yit = αi + βi t + uit ,
t = 1, 2, . . . , T , i = 1, 2, . . . , n,
(1)
where uit is a weakly dependent process with zero mean. Our focus of interest is on the inference about the trend parameters {βi }.
347
1 ′ DX XD = T 1 − T 2 t =1
t
∞ −
rdr 2
r dr
rdr
,
0
0
∫ 1 1 − ′ ′ ′ u √ dW r Λ ( ) t n T . 0 DX ′ u = 1 − ′ →d ∫ 1 ′ ′ tut √ rdWn (r ) Λ T
T
0
Therefore
D−1 θˆOLS − θ
∫0 1
1
rdr 1
6 0 ∫
=
rdr r 2 dr
1
1
dWn (r ) Λ ∫0 1 ′ ′ rdWn (r ) Λ ′
′
0
0
∫
Cj εt −j ,
−1 ∫
1
∫
1 →d ∫
2
− r dWn (r ) Λ . 1 ′ ′ r− dWn (r ) Λ ′
′
3
12
j =0
∞ − ja Cj < ∞
1
∫0 1
1
t
T 3 t =1
0
where εt ∼ iid(0, Σ ), E ‖εt ‖v < ∞ for some v ≥ 4,
∫
Assumption 1. Let ut = (u1t , . . . , unt )′ , we assume that ut = C (L)εt =
T 1 − t 2 1 T t =1 ∫ → d t 1 − 2
2
0
So the OLS estimator βˆ OLS of β satisfies
for a > 3, C (1)Σ C (1) > 0 ′
T 3/2 βˆ OLS − β →d 12Λ
j =0
1
∫
r−
0
1
2
dWn (r ) =d N (0, 12Ω ).
and ‖·‖ is the matrix Euclidean norm. Under the above assumption, the process ut admits the following BN (Beveridge and Nelson, 1981) decomposition
3. Series LRV estimator and its asymptotic properties
ut = C (1)εt + u˜ t −1 − u˜ t
To conduct inference regarding β , we need to first estimate the LRV matrix Ω . In the next subsection, we motivate the series LRV estimator we use in this paper.
for u˜ t =
∞ −
C˜ j εt −j , C˜ j =
j =0
where
∞ −
Cs ,
(2)
s=j+1
∑∞ ˜ 2 j=0 Cj < ∞. Using this decomposition and following
Phillips and Solo (1992), we can prove that [Tr ] 1 −
√
ut →d ΛWn (r ),
as T → ∞,
(3)
T t =1 where Wn (r ) is an n × 1 vector of standard independent Wiener
1/2
processes and Λ = C (1)Σ C (1)′ is the matrix square root of the long run variance matrix Ω of ut :
Ω = ΛΛ′ =
∞ −
Eut u′t −j = C (1)Σ C (1)′ .
j=−∞
To represent the OLS estimator of the model parameters, we introduce the following notation: yi = (yi1 , . . . , yiT )′ ,
Y = (y1 , y2 , . . . , yn )
ui = (ui1 , . . . , uiT )′ , Xt = (1, t ) ,
u = ( u1 , . . . , un )
X = X1 , . . . , XT′
θ = (θ1 , θ2 , . . . , θn )
′
′
with θi = (αi , βi ) . ′
The OLS estimator of θ is then given by
θˆOLS = (X ′ X )−1 X ′ Y . If the errors are second-order stationary, then the OLS estimator is asymptotically equivalent to the GLS estimator. In addition, because (1) is a seemingly unrelated regression (SUR) with the same regressors in each equation, the OLS estimator is equivalent to the SUR estimator, which is the GLS estimator that accounts for contemporaneous correlation across the series. Thus, the simple OLS estimator has some nice optimality properties. Vogelsang and Franses (2005) make the same point. Let D = diag T −1/2 , T −3/2 . Then, for ut defined in Assumption 1,
3.1. Motivation of series LRV estimator Consider the kernel-based estimator proposed by Phillips et al. (2006,2007, PSJ hereafter):
ˆ PSJ = Ω
T T 1 −−
T r =1 t =1
uˆ t Kρ
r −s
T
uˆ ′s ,
where Kρ (r − s) = [K (r − s)]ρ for some second-order kernel function K (·). This estimator is consistent when ρ → ∞ at a certain rate. Assume that K (·) is even, continuous and positive semidefinite. By Mercer’s theorem (Mercer, 1909), we can write
Kρ (r − s) =
∞ −
λk φk (r ) φk (s) ,
(4)
k=1
where {λk } is a sequence of eigenvalues and {φk (r )} is an orthonormal sequence of eigenfunctions corresponding to the eigenvalues λk . It can be shown that ∞ −
λk = 1
∞ −
1 λ2k = O √ as ρ → ∞. ρ k=1
and
k=1
(5)
With this representation of Kρ (r − s), we can write
ˆ PSJ = Ω
∞ −
T 1 −
λk √
T r =1
k=1
:=
∞ −
φk
r T
uˆ r
T 1 −
√
T s=1
ˆk λk Ω
where and
T
′ uˆ s
(6)
k=1
ˆk = Λ ˆ kΛ ˆ ′k Ω
φk
s
T 1 −
ˆk = √ Λ
T t =1
φk
t
T
uˆ t .
348
Y. Sun / Journal of Econometrics 164 (2011) 345–366
In the above expression, λk decays to zero as k increases. The intuition is that, as k increases, the eigenfunction φk (r ) becomes more concentrated on high frequency components and we should impose progressively less weight on these components in order to capture the long run (low frequency) properties of the underlying time series. In addition, for each k, λk → 0 as ρ → ∞. Implicitly, the PSJ estimator employs a soft thresholding method where the weight λk approaches to zero but is not equal to zero for any given k. Instead of soft thresholding, we can also consider the hard thresholding estimator:
ˆ = Ω
K 1 −
K k=1
ˆk Ω
(7)
where K is a positive integer. This estimator truncates the infinite sum in (6) and assigns equal weights to the remaining terms. In other words, the infinite sequence (λ1 , . . . , λK , . .∑ .) is replaced ∞ by (1/K , 1/K , . . . , 1/K , 0, . . .). For this sequence, k=1 λk = 1 ∑∞ 2 λ = 1 / K . Comparing the squared sum with that in and k=1 k √ (5), we can see that K plays the role of ρ in the PSJ estimator. This can also be seen by comparing the asymptotic biases and the asymptotic variances of these two estimators. As will be shown below, with appropriately chosen φk , each of ˆ k is an asymptotically unbiased estimator of Ω . We the summand Ω ˆ k as direct LRV estimators refer to the LRV estimators of the form Ω ˆ is an average of K direct LRV estimators. so that Ω ˆ k is approximately the regression coefficient Note that Λ obtained √by regressing the time series uˆ t on the regressor ∑ ˆ k is part of ‘the total sum of squares’ Tt=1 uˆ t uˆ ′t that φk (t /T )/ T . Ω
√
is explained by the basis function φk (·)/ T . This explained sum of squares may be regarded as another ways of thinking about the long run variance matrix—the contributions to the variation of uˆ t that are due to low frequency variations in the series. ˆ , we can To obtain more flexible estimators of the form Ω ˆ . For example, use any orthogonal basis functions to construct Ω we may use polynomial basis functions. In addition, the basis functions do not have to be a complete basis system. In fact, we use incomplete basis functions below in order to remove the demeaning and detrending effects. We have thus obtained a general class of LRV estimators. For convenience, we refer to them as series LRV estimators as they are based on nonparametric series regressions. The series LRV estimator has different interpretations. First, it can be regarded as√a multiple-window estimator with window function φk (t /T )/ T , see Thomson (1982) and Percival and Walden (1993). In the econometrics literature, Sun (2006) applies the multiple-window estimator to the estimation of realized volatility. The robust long run variance estimators derived by Müller (2007) also belong to the class of multiple-window estimators. In a different context and for a different model, Müller (2007) has established the fixed-K asymptotics given in Section 3.2. Phillips (2005) gives an alternative motivation of the multiple-window estimator and establishes its asymptotic ˆk = properties. φk (1 − x) = φk (x), we can write Λ √ ∑T −1Second, when τ 1/ T φ ( 1 − )ˆ u , which can be regarded as output from T −τ τ =0 k T applying a linear filter to the residual process uˆ t . The transfer function of the linear filter is T −1 1 − τ Hk (ω) = √ exp(iτ ω). φk 1 − T T τ =0
To capture the long run behavior of the process, we require that Hk (ω) be concentrated around the origin. That is, Hk (ω) resembles a band pass filter that passes low frequencies within a certain range and rejects (attenuates) frequencies outside that range.
ˆ k can also be regarded as a filter-bank estimator and Hence, Ω ˆ ˆ Ω is a simple average of these filter-bank estimators. Finally, Ω can be regarded as the sample variance of regression coefficients ˆ k , k = 1, 2, . . . , K }. By construction, it is automatically positive {Λ semidefinite, a desirable property for practical use. Many series LRV estimators can be obtained by choosing different basis functions. However, in nonparametric series estimation, it is a conventional wisdom that the choice of basis functions is often less important than the choice of the smoothing parameter. For this reason, we employ the basis functions that are most convenient for practical use and focus on the problem of selecting the smoothing parameter K . 3.2. Fixed-K asymptotics
ˆ In this subsection, we establish the asymptotic distribution of Ω under the assumption that K is fixed. Let uˆ t = yt − y¯ − t − t¯ βˆ , then [Tr] 1 −
[Tr] 1 − uˆ t = √ ut − u¯ − t − t¯ βˆ − β T t =1 T t =1 ] [∫ r 1 ds → Λ [Wn (r ) − rWn (1)] − Λ s− 2 0
√
∫
1
×
s−
0
1
−1 ∫
2
1
s−
ds
2
0
1 2
dWn (s)
:= ΛVn (r ), where Vn (r ) = Wn (r ) − rWn (1) − 6r (r − 1)
1
∫
t−
0
1
dWn (t ).
2
Using summation and integration by parts and invoking the continuous mapping theorem, we obtain, for fixed K and under Assumption 1:
ˆ →d Λ Ω =Λ
K 1 −
K k=1
K k=1
:= Λ
K 1 −
K k=1
φk (r ) dVn (r )
1
] [∫
0
K [∫ −
1
1
[∫
]′ φk (s) dVn (s) Λ′
0
] [∫ 1 φ˜ k (r ) dWn (r )
0
1
]′ φ˜ k (s) dWn (s) Λ′
0
ζk ζk′ Λ′ ,
(8)
where
φ˜ k (r ) = φk (r ) ∫ ∫ 1 − φk (s) ds − 12 0
1
φk (s) s − 0
1 2
ds
r−
1 2
, (9)
is the transformed basis function and
ζk =
∫
1
φ˜ k (r ) dWn (r ).
0
We call the above asymptotics the fixed-K asymptotics. This is similar to the fixed-b asymptotics of Kiefer and Vogelsang (2005). Common choices of φk are the sine and cosine trigonometric polynomials. In fact, using a simple Fourier expansion and assuming that K (·) is even, we can show that the eigenfunctions in (4) are the sine √ and cosine functions. Asubset of the cosine functions φk (r ) = 2 cos π kr , k = 0, 1, . . . enjoys the desirable property that
φ˜ k (r ) = φk (r )
for k = 0, 2, 4, . . . .
Y. Sun / Journal of Econometrics 164 (2011) 345–366
So not only {φk (r )} are orthonormal but also are their transforms as defined in (9). Note that the first basis with k = 0 is redundant ∑T as t =1 uˆ t = 0. We therefore take
φk
√
t
=
T
2 cos
2π kt
T
,
for k = 1, 2, . . . , K
as our data windows or basis functions. Similar to the Hanning window (1 − cos 2π t /T ) /2, the above functions have small side lobes and their Fourier transforms decay to zero rapidly. As a result, the associated LRV estimator has a small bias due to spectral leakage (Priestley, 1981, p. 563). This is an especially desirable feature for hypothesis testing where bias reduction is more important than the point estimation of the LRV. With the above cosine basis functions, ζk is iidN (0, In ). As ∑K ′ ˆ a result, k=1 ζk ζk is a Wishart distribution Wn (In , K ). So Ω converges to a scaled Wishart distribution. In the scalar case, the limiting distribution reduces to the scaled chi-square distribution ˆ z /z ′ Ω z χK2 /K . In general, for any conforming constant vector z, z ′ Ω 2 converges in distribution to χK /K . This result can be used to test hypotheses regarding Ω . The resulting test may have better size properties. See Phillips et al. (2006, 2007) and Hashimzade and Vogelsang (2007) for the same point based on conventional kernel estimators. We do not pursue this extension here as our main focus is on the inference for β .
as the asymptotic variance. The higher-order bias is not captured in the first-order conventional asymptotic theory, although it is reflected in the nonstandard fixed-b asymptotics. See for example SPJ. We have thus provided a novel way to eliminate the effect of the estimation uncertainty of the model parameters on the LRV estimation. We note in passing that the estimation uncertainty may also be eliminated using recursive OLS residuals. Theorem 2(b) characterizes the asymptotic behavior of the exact variance. This result is different from Theorem 1(ii) Phillips (2005) as the latter provides only the variance of the limiting ˆ . In terms of moment calculations, our results are distribution of Ω stronger than those in Phillips (2005). Theorem 2 shows that the asymptotic bias is increasing in K and the asymptotic variance is decreasing in K . This is different from what is typically observed in nonparametric series regression. The reason is that here we use the series estimator in the time domain to effectively implement a nonparametric smoothing in the frequency domain. Given the reciprocal relationship between the time domain smoothing and frequency domain smoothing, a larger K in the time domain, which corresponds to smaller amount of smoothing in the time domain, is equivalent to larger amount of smoothing in the frequency domain. Hence, a larger K in the present setting leads to a larger asymptotic bias and a smaller asymptotic variance. Let
ˆ , W ) = E v ec (Ω ˆ − Ω )′ W v ec (Ω ˆ − Ω) MSE (Ω
3.3. Large-K asymptotics While the fixed-K asymptotics may capture the randomness ˆ very well, it does not reflect the usual nonparametric bias of Ω ˆ . In this section, we consider the asymptotic or Parzen bias of Ω ˆ when both K and T go to infinity such that properties of Ω K /T → 0.
ˆ ) with weighting matrix W . It be the mean squared error of v ec (Ω follows from Theorem 2 that, up to smaller order terms:
ˆ , W ) = tr WE v ec (Ω ˆ − Ω )v ec (Ω ˆ − Ω )′ MSE (Ω = v ec (B)′ W v ec (B)
Theorem 2. Let Assumption 1 hold. As K → ∞ such that K /T → 0, we have K2 B T2
ˆ −Ω = (a) E Ω
ˆ) (b) v ar v ec (Ω
+o
=
1 K
K T2
+ O T1 . (Ω ⊗ Ω ) In2 + Knn (1 + o (1)) + O T1
3
h2 Γu (h) ,
Γu (h) = Eut u′t −h ,
Knn is the n2 × n2 commutation matrix, and In2 is the n2 × n2 identity matrix. Theorem 2 extends Theorem 1 of Phillips (2005), which is applicable only to scalar time series with known mean. The bias term here is different from that given in Theorem 1(i) in Phillips (2005). This is because the basis functions we used are a subset of the basis functions in Phillips (2005). The
√
2 cos π (2k − 1)r , k = 1, 2, . . . is that
the estimation uncertainty of θ does not affect the bias and variance calculation in large samples. More specifically, we show ˆ is asymptotically equivalent to in the proof that Ω
˜ = Ω
K 1 −
K k=1
1 + tr W (Ω ⊗ Ω ) In2 + Knn . K
So the MSE-optimal K is given by
T 1 −
√
T t =1
φk
t
T
ut
T 1 −
√
T s=1
φk
tr W (Ω ⊗ Ω ) In2 + Knn
1/5
4v ec (B)′ W v ec (B)
T 4/5 .
(11)
This approach to optimal K choice is the same as that for bandwidth choice in kernel LRV estimators. See, for example, Andrews (1991).
h=−∞
advantage of dropping
T4
K =
∞ 2π 2 −
K4
2
where B=−
349
s T
4. Autocorrelation robust inference for trend parameters The hypotheses of interest in this paper are H0 : Rβ = r against H1 : Rβ ̸= r , where R is a p × n matrix and r is a p × 1 vector. The usual Wald statistic FT ,OLS for testing H0 against H1 is given by
,
(10)
an estimator that is based on the true but unknown error term ut . This result is in sharp contrast to existing results in the HAC estimation literature. For conventional kernel HAC estimators, the estimation uncertainty in model parameters gives rise to a higherorder bias term, which is typically the same order of magnitude
ˆ R′ R12Ω
−1
tT ,OLS =
RT 3/2 (βˆ OLS − β)
ˆ R′ R12Ω
RT 3/2 (βˆ OLS − β) .
When p = 1, we can construct the usual t-statistic
′ us
′
FT ,OLS = RT 3/2 (βˆ OLS − β)
1/2 .
4.1. Fixed-K asymptotics Under the fixed-K asymptotics and the null hypothesis
350
Y. Sun / Journal of Econometrics 164 (2011) 345–366
1
FT ,OLS →d
12
RΛ
1
∫
1
r−
2
0
K 1 −
× RΛ
K k=1
dWn (r )
′
of freedom. These fixed-K asymptotic results can also be proved directly using standard techniques from multivariate statistical analysis. We have therefore shown that under the fixed-K asymptotics, the scaled Wald statistic converges weakly to the F distribution with degrees of freedom (p, K − p + 1) and the t-statistic converges to the t-distribution with degrees of freedom K . These results are very handy as critical values from the F distribution or the t distribution can be easily obtained from statistical tables or standard econometrics packages. Under the local alternative hypothesis,
] ˜φk (r )dWn (r )
0
−1 ] ˜φk (s)dWn′ (s) Λ′ R′
1
[∫
1
[∫
× 0
×
1
∫
RΛ
r−
0
1
2
dWn (r ) .
ˆ It turns out the scaling factor in the asymptotic distribution of Ω cancels out with that in the asymptotic distribution of T 3/2 (βˆ OLS − β). To see this, we represent the distribution Rp×n ΛWn (r ) by R∗ Wp∗ (r ) for some p × p matrix R∗ and p-dimensional Brownian motion Wp∗ (r ). Then for a fixed K , we have
1
FT ,OLS →d
√ 12 1
×
1
∫
′ 1 r− dWp∗ (r ) 1
− [∫
K k=1
×
∫
1
1
r−
12
:= η′
0
K 1 −
K k=1
η= √
1
∫
r−
12
0
]−1
and ξk =
1
2
−1 ξk ξk′
1
1 2
r− 0
∫
1
=
r−
0
−1 ξk ξk ′
1
2 1 2
dWp∗ (r ),
1
∫
noncentrality parameter δ 2 = c˜ , the squared length of vector c˜ . The direction of c˜ does not matter. Hence, for the firstorder asymptotics given here, it is innocuous to assume that c˜ is uniformly distributed on the sphere Sp (δ) = {x ∈ Rp : ‖x‖ = δ}. It turns out that this assumption greatly simplifies the development of high-order expansions in later sections.
] φ˜ k (r ) dWp∗ (r )
0
4.2. Large-K asymptotics
φ˜ k (r ) dr = 0 for all k,
ˆ is When K → ∞ such that K /T → 0, the LRV estimator Ω consistent. As a consequence FT ,OLS → χp2 under H0
and
→ χ δ 2 under H1 δ 2 . 2 p
square distribution (Hotelling, 1931):
FT ,OLS
FT ,OLS →d T (p, K ).
When p = 1, the above result reduces to
2
Since for K ≥ p, pK
tT ,OLS → N (0, 1) under H0
T (p, K ) ∼ Fp,K −p+1 ,
we have pK
FT ,OLS →d Fp,K −p+1 :=
and
tT ,OLS → N (δ, 1) under H1 δ 2 .
2
(K − p + 1)
1/2
2
η and ξk are independent as both ∑Kare normal random variables. In addition, ξk ∼ iidN (0, Ip ) and k=1 ξk ξk′ is a Wishart distribution Wp (Ip , K ). Hence the limiting distribution of FT ,OLS is Hotelling’s T -
K −p+1
η + c˜ := Fp,K −p+1 δ 2 ,
K and noncentrality parameter δ = c / 12RΩ R′ = c˜ . The local alternative power depends on c only through the
dWp∗ (r )
φ˜ k (r ) dWp∗ (r ).
−
This result follows from Proposition 8.2 in Bilodeau and Brenner (1999) where the notation Fc is the canonical F distribution (Bilodeau and Brenner, 1999, page 42). Similarly, the t-statistic converges to the noncentral t distribution with degrees of freedom
η,
0
[∫
p
K
−1/2 −1/2 ′ 12RΩ R′ c δ 2 = c˜ c˜ = c ′ 12RΩ R′ ′ ′ −1 = c 12RΩ R c.
dWp (r ) ∗
So the limiting distribution of FT ,OLS does not depend on Λ and is pivotal. Since cov
′ (K − p + 1) η + c˜
FT ,OLS →d
a noncentral F distribution with degrees of freedom (p, K − p + 1) and noncentrality parameter
1
∫
for some p × 1 vector c˜ , we have, for K ≥ p,
×
where 1
(12)
k=1
√
c˜
0
1/2
T
0
φ˜ k (s) dWp∗ (s)′
×
] φ˜ k (r ) dWp∗ (r )
1
[∫
where c = R12Ω R′
pK
2
0
(K − p + 1)
K
√
H1 δ 2 : Rβ = r + c / T
χp2 /p χ
2 K −p+1
/ (K − p + 1)
,
where χp2 and χK2−p+1 denote independent χ 2 random variables. When p = 1, the above result reduces to tT ,OLS →d tK . That is, the t-statistic converges to the t-distribution with K degrees
To compare the fixed-K asymptotics with the large-K asymptotics, we evaluate the difference in their 1 − α quantiles. Let Fpα,K −p+1 be the 1 − α quantile of the Fp,K −p+1 distribution and Fpα,∞
be the 1 − α quantile of the Fp,∞ ≡ χp2 /p distribution. In other
words, pFpα,∞ ≡ χpα is the 1 − α quantile of the χp2 distribution. Denote Gp (·) the CDF of a χ 2 random variable with degrees of freedom p. By definition and with a slight abuse of notation, we have,
Y. Sun / Journal of Econometrics 164 (2011) 345–366
where 1 = 1′α , 1′β
as K → ∞, 1 − α = P Fp,K −p+1 <
Fpα,K −p+1
pFpα,K −p+1
=P χ < 2 p
= EGp pFpα,K −p+1 Gp pFpα,∞
′
χK2−p+1
×E
(K − p + 1)
pFpα,K −p+1
χ
−
pFpα,∞
1 + G′′p pFpα,∞ 2
2
2 K −p+1
− (K − p + 1) α α ′
pFpα,∞
+o
1
Hence, θˆGLS − θ and 1 are independent. In addition,
uˆ = InT − In ⊗ X X ′ X
K2
K
E v ec θˆGLS − θ 1′ = 0.
= Gp pFpα,∞ + Gp pFp,∞ pFp,K −p+1 − pFpα,∞ 2 1 + G′′p pFpα,∞ pFpα,∞ K α 1 α + o pFp,K −p+1 − pFp,∞ + o .
−1
X′
v ec (u) ,
and thus
E v ec θˆGLS − θ uˆ ′ = (In ⊗ X )′ V −1 (In ⊗ X )
(13)
ˆ. So θˆGLS is independent of both 1 and Ω Let FT ,GLS be the Wald statistic based on the GLS estimator:
pFpα,K −p+1 = χpα −
α
1 Gp χp
′′
K G′p χpα
α 2
χp
ˆ R′ FT ,GLS = RT 3/2 (βˆ GLS − β) R12Ω
+o
1
K
as K → ∞.
(14)
−1
RT 3/2 (βˆ GLS − β).
Using the asymptotic equivalence of the OLS and GLS estimators and the above two independence conditions, we can prove the following Lemma. Lemma 3. Let Assumption 1 hold and assume that εt ∼ iidN (0, Σ ). Then for K ≥ p,
But
Gp χpα 1 α χp − p + 2 , − ′ α = α 2χp Gp χp ′′
(a) P (b) P
hence
1 α χp − p + 2 χpα + o = χpα +
−1
−1 ′ × (In ⊗ X )′ V −1 V InT − In ⊗ X X ′ X = 0. X
Therefore
pFpα,K −p+1
and more explicitly
It follows from the asymptotic equivalence of θˆOLS and θˆGLS that Ec ′ 1β 1′β c = O(1/T ) for any vector c. See Grenander and Rosenblatt (1957). It is easy to show that
+ Gp pFp,∞
pFpα,K −p+1
×E
(K − p + 1) α
′
−1 1 = (In ⊗ X )′ (In ⊗ X ) (In ⊗ X )′ −1 − (In ⊗ X )′ V −1 (In ⊗ X ) (In ⊗ X )′ V −1 v ec (u) .
(K − p + 1) χK2−p+1
χK2−p+1
=
351
2K
1
K
−1 v ec θˆGLS − θ = (In ⊗ X )′ V −1 (In ⊗ X ) × (In ⊗ X )′ V −1 v ec (u) .
(K −p+1)
K K
FT ,OLS
= EGp z K −Kp+1 Ξ −1 + O T1 ,
FT ,GLS < z
−1 1/2 1/2 ˆ R′ Ξ = e′β RΩ R′ RΩ RΩ R′ eβ , − 1 / 2 R12ΩT ,GLS R′ RT 3/2 (βˆ GLS − β) , eβ = −1/2 RT 3/2 (βˆ GLS − β) R12ΩT ,GLS R′ 3/2 ˆ and ΩT ,GLS = v ar T βGLS − β 12.
Therefore the critical values from the F -distribution are larger than those from the χ 2 -distribution, reflecting the randomness in the denominator of Up to the order o(1/K ), the Wald statistic. the correction term χpα − p + 2 χpα /(2K ) increases with p and decreases with K . So when K is small or p is large, the difference between the F and χ 2 approximations may be large.
In this section, we consider a high-order expansion of the Wald statistic in order to design a testing-optimal procedure to select K . We make the simplification assumption that ut is normal, which facilitates the derivations. The assumption could be relaxed but at the cost of much greater complexity, see for example, Sun and Phillips (2009). Let V = v ar (v ec (u)), then the GLS estimator of θ satisfies
(K −p+1)
where
.
5. High-order expansion of the finite sample distribution
ˆ affects the Lemma 3 shows that the estimation uncertainty of Ω distribution of the Wald statistic only through Ξ . Taking a Taylor expansion, we have Ξ − 1 = 1 + L + Q + op
1 K
+
K2
T2
+ Op
1
T
,
ˆ − Ω and Q is quadratic in Ω ˆ − Ω . The where L is linear in Ω exact expressions for L and Q are not important here but are given in the Proof of Theorem 4. Plugging this stochastic expansion into Lemma 3, we obtain a high-order expansion of the finite sample distribution of FT ,OLS for the case where K → ∞ such that K /T → 0.
Similarly, the OLS estimator satisfies
Theorem 4. Let Assumption 1 hold and assume that εt N (0, Σ ). If K → ∞ such that K /T → 0, then
−1 v ec θˆOLS − θ = (In ⊗ X )′ (In ⊗ X ) (In ⊗ X )′ v ec (u) .
P
So
v ec θˆOLS − θ = v ec θˆGLS − θ + 1,
(K − p + 1) K
FT ,OLS < z
1 + G′′p (z ) z 2 + o K
= Gp (z ) +
1
K
+o
K2 T2
∼ iid
K2 ′ Gp (z ) z B¯ T2
+O
1
T
(15)
352
Y. Sun / Journal of Econometrics 164 (2011) 345–366
where B¯ = B¯ (R, B, Ω ) =
tr
′
RBR
′ −1
RΩ R
p
.
The first term in (15) comes from the standard chi-square approximation of the Wald statistic. The second term captures the nonparametric bias of the LRV estimator while the third term reflects the variance of the LRV estimator. The result is analogous to those obtained by SPJ for Gaussian location models and Sun and Phillips (2009, SP) for general linear GMM models with stationary data. However, there is an important difference. For conventional kernel estimators as used in SPJ and SP, the asymptotics expansion contains a term that reflects the bias due to the estimation error of the model parameters. Such a term does not appear here because the basis functions we employ are asymptotically orthogonal to the regressors. To understand the relationship between the fixed-K and largeK asymptotics, we develop an expansion of the limiting pFp,K −p+1 distribution as in (13): 1 P pFp,K −p+1 < z = Gp (z ) + G′′p (z ) z 2 + o K
1
K
Comparing this with (15), we find that the above expansion has an additional term −K −1 G′p χpα χpα (p − 1). For any given critical value χpα , this term is negative and grows with p, the number of restrictions in the hypothesis. As a result, the error in rejection probability or the error in coverage probability tends to be larger for larger p. This explains why conventional confidence regions tend to have large under-coverage when the dimension of the problem is high. In the rest of the paper, we use the finite sample corrected Wald statistic FT∗,OLS =
K
FT ,OLS
and employ critical value pFpα,K −p+1 to perform our test. For convenience, we refer to FT∗,OLS as the F ∗ statistic and the test as the F ∗ test. FT∗,OLS can be viewed as the standard Wald statistic but using the following estimator for RΩ R′ :
R Ω R′ =
1
K −
K − p + 1 k=1
ˆ k R′ . RΩ
So the finite sample correction factor (K − p + 1) /K can be viewed as a degree-of-freedom adjustment. The following theorem gives the type I and type II errors of the F ∗ test.
,
as K → ∞. Comparing this with Theorem 4, we find that the fixed-K asymptotics captures one of the higher-order terms in the high-order expansion of the large K asymptotics. Plugging z = pFpα,K −p+1 into the above equation yields:
Theorem 5. Let Assumption 1 hold and assume that εt N (0, Σ ). If K → ∞ such that K /T → 0, then
1 ′′ α Gp pFp,K −p+1 K α 2 1 . × pFp,K −p+1 + o K
P FT∗,OLS > pFpα,K −p+1 = α −
P
(K − p + 1) K
FT ,OLS < pFpα,K −p+1
T
T
(16)
K
then of order O K 2 /T 2 . In contrast, if the critical value from the conventional χp2 distribution is used, the size distortion is of order
O K 2 /T 2 + O (1/K ). So when K 3 /T 2 → 0, using critical value pFpα,K −p+1 should lead to size improvements. We have thus shown that critical values from the fixed-K asymptotics is second-order correct under the large-K asymptotics. The fixed-K asymptotic distribution of FT ,OLS is K (K − p + 1)−1 pFp,K −p+1 while its first-order large-K asymptotic distribution is χp2 . When K is fixed, the two distributions are different. Hence, the large-K asymptotic approximation is not even first-order valid under the fixed-K asymptotics. Theorem 4 gives an expansion of the distribution of K −1 (K − p + 1)FT ,OLS . The factor K −1 (K − p + 1) is a finite sample correction factor. Without this correction, we can show that, up to smaller order terms α
P FT ,OLS < χp
−
K2 = Gp χpα + 2 G′p χpα χpα B¯ T
2 1 ′ α α 1 Gp χp χp (p − 1) + G′′p χpα χpα . K K
T
1
K
.
(17) RΩ R′
1/2
c˜
T
T
where Gp,δ 2 (·) and G′p,δ 2 (·) are the CDF and pdf of the noncentral
χ 2 distribution with degrees of freedom p and noncentrality parameter δ 2 and
1
T T where c˜ is uniformly distributed on the sphere Sp (δ) = {x ∈ Rp : ‖x‖ = δ}, the type II error of the F ∗ test is P FT∗,OLS < pFpα,K −p+1 |H1 δ 2 K 2 B¯ 2 1 = Gp,δ2 χpα + 2 G′p,δ2 χpα χpα + Qp,δ2 χpα χpα T K 2 K 1 1 +o +O , (18) +o 2
K −1 G′′p (z ) z 2 in the high-order expansion. The size distortion is
T2
+O
Therefore, use of critical value pFpα,K −p+1 removes the variance term
√
2
K
K2
(b) Under the local alternative H1 δ 2 : Rβ = r +
K = 1 − α + 2 G′p pFpα,K −p+1 pFpα,K −p+1 B¯ T 2 1 K 1 +o +o + O . 2
K 2 B¯ ′ α α Gp χp χp + o T2
+o
This implies that
∼ iid
(a) The type I error of the F ∗ test is
1 − α = Gp pFpα,K −p+1 +
(K − p + 1)
Qp,δ2 (z ) = G′′p,δ2 (z ) −
G′′p (z ) ′ δ2 ′ Gp,δ 2 (z ) = G 2 (z ) . ′ Gp (z ) 2z (p+2),δ
Theorem 5(a) follows from Theorem 4. The uniformity of c˜ on a sphere enables us to use a similar argument to prove Theorem 5(b). A key point in the Proof of Theorem 4 is that eβ is uniformly distributed on the unit sphere Sp (1), which follows from the rotation invariance of the multivariate standard normal distribution. The uniformity of c˜ ensures the same property holds for the corresponding statistic eβδ
−1/2 3/2 R12ΩT ,GLS R′ RT (βˆ GLS − β) + c˜ = −1/2 RT 3/2 (βˆ GLS − β) + c˜ R12ΩT ,GLS R′
under the local alternative hypothesis.
Y. Sun / Journal of Econometrics 164 (2011) 345–366
α
The quantity Qp,δ 2 χp reflects the difference in curvatures of the two CDF functions Gp (z ) and Gp,δ 2 (z ) at the point z = χpα . When we use the second-order correct critical value pFpα,K −p+1 , the variance term is removed under the null. However, due to the difference in curvatures, the variance term remains under the local alternative hypothesis. The O(1/K ) term in Theorem 5(b) captures this effect. Since Qp,δ 2 (z ) > 0 for all z > 0, this term increases monotonically with K . According to this term, the value of K should be chosen as large as possible. This is not surprising. In order to improve the power of the F ∗ test, we should minimize the randomness of the LRV estimator, which calls for a large K value. However, a large K value may produce large bias, which may lead to power loss or size distortion. In the next section, we show that there is an opportunity to select K to trade off the bias effect and variance effect on the size and power properties.
When B¯ < 0, the constraint eI ≤ κα may be binding and we have to use the Kuhn–Tucker theorem to search for the optimum. Let λ be the Lagrange multiplier, and define L(K , λ) = Gp,δ 2 χpα +
+
α α 1 δ2 ′ χp G 2 χp K 2 (p+2),δ
(κ − 1) α ¯B G′ χ α χ α
K 2 B¯ ′ α α Gp χp χp . T2 Similarly, from (18), the type II error of the F ∗ test can be approximated by eI = α −
α α K 2 B¯ ′ α α 1 δ2 ′ Gp,δ 2 χp χp + G χp . 2 χp 2 T K 2 (p+2),δ We choose K to minimize the approximate type II error while controlling for the approximate type I error. More specifically, we solve eII = Gp,δ 2 χpα +
min eII ,
s · t · eI ≤ κα
where κ is a constant greater than 1. Ideally, the type I error is less than or equal to the nominal type I error α . In finite samples, there are always some approximation error and we allow for some discrepancy by introducing the tolerance factor κ . For example, when α = 5% and κ = 1.2, we aim to control the type I error such that it is not greater than 6%. We may allow κ to depend on the sample size T . For a larger sample size, we may require κ to take smaller values. Note that both the type I and type II errors depend on the ¯ the relative bias ˆ R′ through B, asymptotic bias of the estimator RΩ of estimating the variance of RT 3/2 (βˆ OLS − β). Our testing-oriented criterion is in sharp contrast with the MSE criterion, which depends ˆ . In large samples, on a quadratic form of the asymptotic bias of Ω the quadratic form is of smaller order than the bias itself. So for testing problems, it is more important to reduce the bias of the LRV estimator as compared to the point estimation of the LRV matrix. In addition, the quadratic form is invariant to sign of B. The MSEoptimal K is the same for B and −B. In contrast, for the testing¯ is of vital importance as optimal K , the sign of B (hence that of B) shown below. The solution to the minimization problem depends on the sign of B¯ . When B¯ > 0, the constraint eI ≤ κα is not binding and we have the unconstrained minimization problem: min eII . The optimal K is
Kopt =
1/3 δ 2 G′(p+2),δ2 χpα T 2/3 . ¯ ′ 2 χpα 4BG p,δ
(19)
p
1/2 T,
(21)
p
and the corresponding Lagrange multiplier is G′p,δ 2 χpα
λopt =
In view of the asymptotic expansion in (17) and ignoring the higher-order terms, we can approximate the type I error of the F ∗ test by
(20)
It is easy to show that at the optimal K , the constraint eI ≤ κα is indeed binding and λ > 0. Hence, the optimal K is
p
6.1. Optimal K formula
K 2 B¯ ′ α α G 2 χp χp T 2 p,δ
T
Kopt =
In this section, we provide a novel approach to smoothing parameter selection that is most suitable for semiparametric testing.
K 2 B¯ ′ α α +λ α − 2 Gp χp χp − κα .
6. Optimal smoothing parameter selection
353
G′p χpα
+
1/2 2 ′ ¯B δ G
(p+2),δ 2
α α 3/2 ′ α 1/2 χp χp G p χp
4 [(κ − 1) α ]3/2 T
.
Formulas (19) and (21) can be written collectively as
1/3 δ 2 G′(p+2),δ2 χpα 2/3 = T , 4B¯ G′p,δ 2 χpα − λopt G′p χpα
Kopt
where λopt is as given in Box I. The function L(K , λ) is a weighted sum of the type I and type II errors with weight given by the optimal Lagrange multiplier. When the size distortion is expected to be negative, the optimal Lagrange multiplier is zero and we assign all weight to the type II error. In this case, the expansion rate of the optimal K is O T 2/3 . When the size distortion is expected to be positive, the Lagrange multiplier is positive. In this case, the loss function is a genuine weighted sum of type I and type II errors. The optimal K has an expansion rate that increases with the tolerance on the type I error. When the permitted tolerance is very low so that κ − 1 ∼ 1/T 2 , the optimal K is bounded. The fixed-K rule can be interpreted as assigning increasingly more weight to the type I error as the sample size increases. On the other hand, when the permitted tolerance is high so that κ − 1 = O(1), the optimal K has an expansion rate of O(T ), which is faster than the MSE-optimal expansion rate. All else being equal, the optimal K decreases with ¯B. This is
ˆ increases with both K and expected, as the asymptotic bias of Ω ¯B. When ¯B is large, we should choose a small K to offset the bias effect. The formula for Kopt depends on the noncentrality parameter δ 2 . For practical implementation, we suggest choosing δ 2 such that the first-order power of the test, as measured by 1 − Gp,δ 2 χpα ,
is 75%. That is, we solve 1 − Gp,δ 2 χpα = 75% for a given p and a given significance level α . As usual, we consider α = 5% and 10%. The value of δ 2 can be easily computed using standard statistical programs. Since K greater than or equal to is aninteger p, in practice, we take max Kopt , p as the K value, where ⌈·⌉ is the ceiling function. To sum up, when the size distortion is expected to be negative, the expansion rate of the optimal K is O T 2/3 . When the size distortion is expected to be positive, the optimal K has an expansion rate that increases with the tolerance on the type I error. The expansion can range from O(1) when the permitted tolerance is very low to O(T ) when the permitted tolerance is very high.
354
Y. Sun / Journal of Econometrics 164 (2011) 345–366
λopt
0,
1/2 ′ α α α 3/2 ′ α 1/2 ¯B G χp χp Gp χp = Gp,δ2 χp (p+2),δ 2 2 , ′ α + δ 3/2 Gp χp 4 [(κ − 1) α ] T
if B¯ > 0
′
(22)
if B¯ < 0
Box I.
variance of each series uit is equal to one for all values of |ρ| < 1.
6.2. Data-driven implementation The optimal K in (22) depends on the data generating process ¯ We can therefore write Kopt = only through the parameter B. Kopt (B¯ ). The unknown parameter B¯ can be estimated by a standard plug-in procedure based on a simple parametric model like VAR (e.g. Andrews (1991)). More specifically, the plug-in procedure involves the following steps. First, we estimate the model using the OLS estimator and compute the residuals uˆ t . Second, we specify a multivariate approximating parametric model and fit the model to uˆ t by the standard OLS method. Third, we treat the fitted model as if it were the true model for the process {ut } and compute B¯ as a function of the parameters of the parametric model. Plugging the estimate B¯ into (22) gives the automatic bandwidth Kˆ . Suppose we use a VAR(1) as the approximating parametric ˆ model for ut . Let Aˆ be the estimated parameter matrix and Σ be the estimated innovation covariance matrix, then the plug-in estimates of Ω and B are
ˆ = (In − Aˆ )−1 Σ ˆ (In − Aˆ ′ )−1 , Ω 2 2π ˆ + Aˆ 2 Σ ˆ Aˆ ′ + Aˆ 2 Σ ˆ − 6Aˆ Σ ˆ Aˆ ′ (In − Aˆ )−3 Aˆ Σ Bˆ = − 3 ˆ (Aˆ ′ )2 + Aˆ Σ ˆ (Aˆ ′ )2 + Σ ˆ Aˆ ′ (In − Aˆ ′ )−3 . +Σ
(23)
(24)
For the plug-in estimates under a general VAR(p) model, we refer to Andrews (1991) for the corresponding formulas. Given the plugin estimates of Ω and B, the data-driven automatic bandwidth can be computed as ∗ Kˆ opt = max
ˆ )) , p . Kˆ opt (B¯ (R, Bˆ , Ω
(25)
7. Simulation evidence This section provides some simulation evidence on the finite sample performance of the F ∗ test based on the plug-in procedure that minimizes the type II error while controlling for the type I error. As in Vogelsang and Franses (2005), we set n = 6. The error follows either a VAR(1) or VMA(1) process: ut = Aut −1 +
ut = Aεt −1 +
1 − ρ 2 εt
1 − ρ 2 εt
where A = ρ In , εt = (v1t + µft , v2t + µft , . . . , vnt + µft )′ / 1 + µ2 and (vt , ft )′ is a Gaussian multivariate white noise process with unit variance. Under this specification, the six time series all follow the same VAR(1) or VMA(1) process with εt ∼ iidN (0, Σ ) for
Σ=
1 1+µ
I 2 n
+
µ2 Jn , 1 + µ2
where Jn is a matrix of ones. The parameter µ determines the degree of cross-dependence among the time series considered. When µ = 0, the six series are uncorrelated with each other. When µ = 1, the six series have the same pair wise correlation coefficient 0.5. The variance–covariance matrix of ut is normalized so that the
−1 1 − ρ 2 (In − A)−1 Σ In − A′ . 2 2 For the VMA(1) process Ω = 1 − ρ (In + A/ 1 − ρ )Σ (In + A/ 1 − ρ 2 )′ . For the model parameters, we take ρ = 0, 0.25, 0.50, 0.75 and set µ = 0 and 1. We set the intercepts and slopes to zero as the For the VAR(1) process, Ω =
tests we consider are invariant to those parameters. For each test, we consider two significance levels α = 5% and α = 10%, two different choices of the tolerance parameter: κ = 1.1 and 1.2, and two different sample sizes T = 300, 500. As in Vogelsang and Franses (2005), we consider the following null hypotheses: H01 : β1 = 0, H02 : β1 = β2 = 0, H03 : β1 = β2 = β3 = 0, H04 : β1 = β2 = · · · = β6 = 0, where p = 1, 2, 3, 6, respectively. The corresponding matrix R is the first p rows of the identity matrix I6 . To explore the finite sample size of the tests, we generate data under these null hypotheses. To compare the power of the tests, we generate data under the local alternative hypothesis H1 δ 2 . We examine the finite sample performance of three different testing methods. The first one is the new F ∗ test, which is based on the modified Wald statistic and testing-optimal K and uses the F -distribution as the reference distribution. The second one is the conventional Wald test, which is based on the unmodified Wald statistic and MSE-optimal K and uses the χp2 distribution as the reference distribution. The last one is the test proposed by Vogelsang and Franses (2005), which is based on the Bartlett kernel LRV estimator with bandwidth equal to the sample size and uses the nonstandard asymptotic theory. The three methods are referred as ‘New’, ‘MSE’, and ‘VF’ respectively in the tables and figures below. Table 1 gives the empirical type I error of the three testing methods for the VAR(1) error with sample size T = 300, tolerance parameter κ = 1.1, and µ = 1. The table also includes a hybrid procedure that employs the MSE-optimal K and critical values from the F -distribution. The only difference between the conventional method and the hybrid method lies in the critical values used. More specifically, let Kˆ mse be the plug-in estimate of the MSE-optimal K given in (11) and FˆT ,OLS (Kˆ mse ) be the associated Wald statistic. The hybrid method rejects the null if FˆT ,OLS (Kˆ mse ) is larger than the critical value (pKˆ mse )(Kˆ mse − p + 1)−1 F α ˆ where F α ˆ is the α -level critical value from p,Kmse −p+1
p,Kmse −p+1
the F distribution Fp,Kˆ mse −p+1 . In contrast, the conventional method
uses critical values from the χp2 distribution. The significance level is 5%, which is also the nominal type I error. Several patterns emerge. First, as it is clear from the table, the conventional method has a large size distortion. The size distortion increases with both the error dependence and the number of restrictions being jointly tested. This result is consistent with our theoretical analysis. The size distortion can be very severe. For example, when ρ = 0.75 and p = 6, the empirical type I error of the test is 0.5328, which is far from 0.05, the nominal type I error. Using the F critical values eliminates the distortion to a great extent. This is especially true
Y. Sun / Journal of Econometrics 164 (2011) 345–366
(a) ρ = 0.
(b) ρ = 0.25.
(c) ρ = 0.5.
(d) ρ = 0.75.
355
Fig. 1. Size-adjusted power of different testing procedures for VAR(1) error with T = 300, κ = 1.1 and p = 1. Table 1 Type I error of different tests for VAR(1) error with T = 300, κ = 1.1 and µ = 1. New
MSE
Hybrid
VF
New
MSE
p=1
ρ ρ ρ ρ
=0 = 0.25 = 0.50 = 0.75
0.0517 0.0571 0.0579 0.0637
0.0549 0.0681 0.0784 0.0976
ρ ρ ρ ρ
=0 = 0.25 = 0.50 = 0.75
0.0505 0.0647 0.0704 0.0781
0.0666 0.0965 0.1353 0.2236
0.0508 0.0590 0.0619 0.0665
0.0491 0.0519 0.0557 0.0627
0.0505 0.0614 0.0622 0.0685
0.0609 0.0814 0.1023 0.1492
0.0497 0.0552 0.0636 0.0881
0.0504 0.0815 0.0936 0.1124
0.0949 0.1645 0.2730 0.5328
p=3 0.0495 0.0611 0.0685 0.0731
when the size distortion is large. Intuitively, larger size distortion occurs when K is smaller so that the LRV estimator has a larger variation. This is the scenario where the difference between the F critical values and χ 2 critical values is larger. Second, the size distortion of the new method and the VF method is substantially smaller than the conventional method. This is because both tests employ asymptotic approximations that capture the estimation uncertainty of the LRV estimator. The smaller size distortion of the new method is also consistent with that of the hybrid method as both are based on F -approximations. Third, compared with the VF method, the new method has similar size distortion. Since the bandwidth is set equal to the sample size, the VF method is designed to achieve the smallest possible size distortion. Given this observation, we can conclude that the new method succeeds in controlling the type I error. Due to the approximation error, the bound we impose on the approximate type I error does not fully control the empirical type I error. This is demonstrated in Table 1. The quality of approximation depends on the persistence of the time series. When the time series is highly persistent, the first-order asymptotic bias of the LRV estimator may not approximate the finite sample bias very well. As a result, the approximate type I error, which is based on the firstorder asymptotic bias, may not fully capture the empirical type I
Hybrid
VF
0.0499 0.0601 0.0659 0.0682
0.0477 0.0521 0.0571 0.0717
0.0508 0.0664 0.0753 0.0776
0.0484 0.0602 0.0787 0.1401
p=2
p=6
error. So it is important to keep in mind that the empirical type I error may still be larger than the nominal type I error even if we exert some control over the approximate type I error. Figs. 1–4 present the finite sample power under the VAR(1) error for different values of p. We compute the power using the 5% empirical finite sample critical values obtained from the null distribution. So the finite sample power is size-adjusted and power comparisons are meaningful. The parameter configuration is the same as those for Table 1 except the DGP is generated under the local alternatives. Two observations can be drawn from these figures. First, the new test has higher power than the VF test in most cases except when the error dependence is very high and the number of restrictions being jointly tested is large. When the error dependence is low, the selected K value is relatively large and the variance of the associated LRV estimator is small. In contrast, the LRV estimator used in the VF test is inconsistent and is therefore expected to have a large variance. As a result, the new test is more powerful than the VF test. On the other hand, when the error dependence is high, the selected K values are small. In this case, both the VF test and the new test employ a LRV estimator with large variance. The VF test can be more powerful in this scenario. Second, the new test is as powerful as the conventional Wald test. This result is encouraging, as the size accuracy of a test is often
356
Y. Sun / Journal of Econometrics 164 (2011) 345–366
(a) ρ = 0.
(b) ρ = 0.25.
(c) ρ = 0.5.
(d) ρ = 0.75.
Fig. 2. Size-adjusted power of different testing procedures for VAR(1) error with T = 300, κ = 1.1 and p = 2.
(a) ρ = 0.
(b) ρ = 0.25.
(c) ρ = 0.5.
(d) ρ = 0.75.
Fig. 3. Size-adjusted power of different testing procedures for VAR(1) error with T = 300, κ = 1.1 and p = 3.
achieved at the cost of sizable power loss. An example is the VF test. While it has more accurate size than the corresponding Wald test based on the Bartlett kernel LRV estimator, it is also less powerful. See Vogelsang and Franses (2005) for details. To shed further light on the size and power properties of the new test and the corresponding Wald test under the conventional asymptotics, we present the mean and median of the selected K
values for sample size T = 300 in Table 2. The (sample) mean and median are computed over 10,000 simulation replications. It is clear that for both the testing-oriented criterion and the MSE criterion, the mean and median of the selected K value increase with the error dependence. While the sample mean and median of the MSE-optimal K are about the same, the sample mean of the testing-optimal K is less than its sample median, implying that
Y. Sun / Journal of Econometrics 164 (2011) 345–366
(a) ρ = 0.
(b) ρ = 0.25.
(c) ρ = 0.5.
(d) ρ = 0.75.
357
Fig. 4. Size-adjusted power of different testing procedures for VAR(1) error with T = 300, κ = 1.1 and p = 6. Table 2 The selected K values based on the testing-oriented criterion and the MSE criterion for VAR(1) error with T = 300, κ = 1.1, and µ = 1. Kˆ test , p = 1
ρ ρ ρ ρ
=0 = 0.25 = 0.50 = 0.75
Kˆ test , p = 2
Kˆ test , p = 3
Kˆ test , p = 6
Kˆ mse
Mean
Median
Mean
Median
Mean
Median
Mean
Median
Mean
Median
53 29 16 9
44 24 13 6
62 33 17 9
52 27 14 6
62 37 19 11
53 29 14 7
53 45 25 15
50 37 20 12
63 36 21 11
62 36 21 10
Table 3 Type I error for various tests based on VMA(1) error with T = 100, κ = 1.1. New
MSE
Hybrid
VF
New
MSE
p=1
ρ ρ ρ ρ
=0 = 0.25 = 0.50 = 0.75
0.0517 0.0546 0.0516 0.0502
0.0549 0.0627 0.0655 0.0657
ρ ρ ρ ρ
=0 = 0.25 = 0.50 = 0.75
0.0505 0.0561 0.0561 0.0567
0.0666 0.0850 0.0989 0.1066
0.0508 0.0540 0.0539 0.0531
0.0491 0.0510 0.0522 0.0531
0.0505 0.0537 0.0522 0.0517
0.0609 0.0727 0.0804 0.0827
0.0497 0.0531 0.0544 0.0544
0.0504 0.0661 0.0644 0.0645
0.0949 0.1430 0.1940 0.2154
p=3 0.0495 0.0534 0.0543 0.0541
the testing-optimal K has relatively few high values. When the number of constraints is small, e.g. p = 1, 2, 3, the testing-optimal K is smaller than the MSE-optimal K . This explains why the type I error of the new test is smaller than or about the same as that of the hybrid test. When the number of constraints is 6, the testingoptimal K is larger, which explains the higher size-adjusted power of the new test as compared to the hybrid test. When the sample size increases to 500, the testing-optimal K becomes smaller than the MSE-optimal K for all values of p considered. In this case, the new test has smaller size distortion than the hybrid test for all parameter configurations considered but is also slightly less powerful. Table 3 presents the simulated type I errors for the VMA(1) error process. The qualitative observations for the VAR(1) error
Hybrid
VF
0.0499 0.0522 0.0514 0.0511
0.0477 0.0511 0.0514 0.0513
0.0508 0.0579 0.0553 0.0560
0.0484 0.0560 0.0599 0.0615
p=2
p=6
remain valid. In fact, these qualitative observations hold for other parameter configurations such as different sample sizes and different values of µ. All else being equal, the size distortion of the new method for κ = 1.2 is slightly larger than that for κ = 1.1. This is expected as we have a higher tolerance for the type I error when the value of κ is larger. Figs. 5–8 present the power curves under the VMA(1) error. The figures reinforce and strengthen the observations for the VAR(1) error. It is clear now that the new test is more powerful than the VF test and is as powerful as the conventional Wald test based on the MSE-optimal K and χ 2 approximation. This is true for all parameter combinations considered. In simulations not reported here, we have considered VAR(1) and VMA(1) errors with negative values of ρ and hypotheses of
358
Y. Sun / Journal of Econometrics 164 (2011) 345–366
(a) ρ = 0.
(b) ρ = 0.25.
(c) ρ = 0.5.
(d) ρ = 0.75.
Fig. 5. Size-adjusted power of different testing procedures for VMA(1) error with T = 300, κ = 1.1 and p = 1.
(a) ρ = 0.
(b) ρ = 0.25.
(c) ρ = 0.5.
(d) ρ = 0.75.
Fig. 6. Size-adjusted power of different testing procedures for VMA(1) error with T = 300, κ = 1.1 and p = 2.
the form β1 = β2 = · · · = βj0 for some j0 ≥ 2. For some of ¯ the new F ∗ these configurations, B¯ > 0. Regardless of the sign of B, test is often as powerful as, albeit sometimes slightly less powerful than, the conventional Wald test we consider here. On the other hand, the new F ∗ test is much more accurate in size than the Wald test. In terms of the type I error and size-adjusted power, the new F ∗ test dominates the VF test in an overall sense. Compared to the
hybrid test, the new F ∗ test achieves a smaller type I error for a medium sample size at the cost of very small power loss. 8. Conclusion The paper proposes a novel approach to multivariate trend inference in the presence of nonparametric autocorrelation. The
Y. Sun / Journal of Econometrics 164 (2011) 345–366
(a) ρ = 0.
(b) ρ = 0.25.
(c) ρ = 0.5.
(d) ρ = 0.75.
359
Fig. 7. Size-adjusted power of different testing procedures for VMA(1) error with T = 300, κ = 1.1 and p = 3.
(a) ρ = 0.
(b) ρ = 0.25.
(c) ρ = 0.5.
(d) ρ = 0.75.
Fig. 8. Size-adjusted power of different testing procedures for VMA(1) error with T = 300, κ = 1.1 and p = 6.
inference procedure is based on a series type LRV estimator. Compared to the conventional kernel type LRV estimators, the series LRV estimator enjoys two advantages. First, it is asymptotically invariant to the intercept and trend parameters. This property releases us from worrying about the estimation uncertainty of those parameters. Second, the associated (modified) Wald statistic converges to a standard distribution regardless of the asymptotic
specification of the smoothing parameter. This property releases practitioners from the computation burden of simulating nonstandard critical values. As a primary contribution of this paper, we propose a new method to select the smoothing parameter in the series LRV estimator. The optimal smoothing parameter is selected to minimize the type II error hence maximize the power of the test while
360
Y. Sun / Journal of Econometrics 164 (2011) 345–366
controlling for the type I error. This new selection criterion is fundamentally different from the MSE criterion for the point estimation of the LRV. Depending on the permitted tolerance on the type I error, the expansion rate of the testing-optimal smoothing parameter may be larger or smaller than the MSE-optimal smoothing parameter. The fixed smoothing parameter rule can be interpreted as exerting increasingly tight control on the type I error. Monte Carlo experiments show that the size of the new testing procedure is as accurate as the nonstandard test of Vogelsang and Franses (2005) with bandwidth equal to the sample size. It is also as powerful as the conventional Wald test that is based on the series LRV estimator and uses the MSE-optimal smoothing parameter. The idea of testing-optimal smoothing parameter choice can be extended to usual kernel HAC estimator. Sun (2010) considers kernel HAC estimation in a general GMM framework and develops a testing-optimal procedure for smoothing parameter choice. The method in Sun (2010) can be adopted for trend estimation and inference, leading to a testing-oriented bandwidth choice for the VF type test. Acknowledgments
Some calculations show that T 1−
T s=1 T −
T −
= 0,
T
φk
s
1
2
,
T T2 − 1 .
6
12
s=1
T
(s − s¯) =
T
s=1
(s − s¯)2 =
Hence,
φˇ k
t
t
= φk
T
−
t
T
t−
6T
−
T2 − 1
T
t
= φk
T2 − 1
T
= φk
1
+O
T
t
−
T
−
2
1
1 2
1
−
2
2T
.
T
Define K 1 −
˜ = Ω
The author gratefully acknowledges partial research support from NSF under Grant No. SES-0752443. He thanks Peter Robinson, the coeditor and two anonymous referees for helpful comments and suggestions.
s
φk
T 1 −
φk
√
K k=1
T t =1
t
ut
T
T 1 −
φk
√
T s=1
s T
′ .
us
(26)
Appendix. Appendix of proofs
˜ are the same as those It is easy to see that the bias and variance of Ω ˆ up to order O (1/T ). Therefore it suffices to compute the bias of Ω ˜. and variance of Ω It follows from Eq. (22) in Phillips (2005) that
Proof of Theorem 2. Part (a). Let x¯ be the sample mean of the sequence x1 , . . . , xT , then
˜ −Ω = EΩ
T 1 −
√
T t =1
φk
t
uˆ t
T
1 −
= √
T t =1 T 1 −
= √
T t =1
[
φk φk
× φk
t T
ut − u¯ − t − t¯
ut − √
T
βˆ − β
T T 1 −1−
t
T t =1 T s =1
φk
T
ut
−1 T T T − 1 − − s 2 φk (s − s¯) −√ t − t¯ ut (s − s¯) T t =1
1
= √
s=1
T −
φk
t
−
s=1
T 1−
φk
s
T T s=1 T T t =1 T s ∑ φk T (s − s¯) s=1 ¯ − t − t ut T ∑ 2 (s − s¯) s=1
T 1 − t = √ φˇ k ut , T
T t =1
t
T
= φk
t
T
−
T ∑
φk s=1 − T ∑ s=1
T 1−
1
T s=1
s T
(s − s¯) t − t¯ . (s − s¯)2 s T
t
− φk
T
Γ u ( h) + o
K2
+O
T2
1
T
= o (1) .
T
−
T 1≤t ,t +h≤T
= =
2
,
(27)
[ ] t t +h t φk − φk
φk
T
−
T 1≤t ,t +h≤T 1
−
cos
= 1−
cos
4π kt
cos
T
− cos
+ cos
2π kt
]
T
2π kh
T
+1 2π kh T
cos
T 1≤t ,t +h≤T
T
T
−
cos
2π k (t + h)
2π k (2t + h)
(28)
T
[
T
T
|h|
1
2π kt
−
T
cos
T 1≤t ,t +h≤T
+
φk
]
t
T
Next consider
where
φˇ k
T
LT K
+
3/2 LT K
T
t +h
φk
where Γu (h) = Eut u′t −h and LT satisfies T
s
−
K k=1 T 1≤t ,t +h≤T
h=−LT
T
LT K − 1 −1
−1
4π k t + 12 h
T
− cos
4π kt T
. (29)
Taking the first term of (29), averaging over k and using the fact that ‖h‖ ≤ LT and LT satisfies (27), we get K 1 −
K k=1
cos
= −2π 2 h2
2π kh
T K2 1
2 K 1 − 1 2π kh −1 =− (1 + o (1)) K k=1 2
− 2 K
T 2 K k=1
k
K
(1 + o (1))
T
Y. Sun / Journal of Econometrics 164 (2011) 345–366
−2π h K 2 2
=
2
(1 + o (1)) .
T2
3
v ar (v ec (R1 )) =
Approximating the sums by integrals, we can show that
K 1 −1
−
cos
K k=1 T 1≤t ,t +h≤T
=o
K 2 h2
T2
4π k t +
1 h 2
− cos
T
4π kt
×
T
T
2π h K 2 2
=−
2
T
(1 + o (1)) + O
T2
3
T
×
T
1
T
φk
p T
φk
3 T 2 h=−L T
| h|
1−
h Γu (h) + o
K2
T2
K 1 −
K k=1
K k=1
φk
p T
1
=−
T T 1 −−
T ∞ −
2
3 T 2 h=−∞
h2 Γu (h) + o
K
2
T2
+O
1
ut = C (1)εt + u˜ t −1 − u˜ t
for u˜ t =
∞ −
C˜ j εt −j , C˜ j =
j =0
=
T
as desired. Part (b). Using the device in Phillips and Solo (1992), we have the BN decomposition ∞ −
=
= Cs .
s=j+1
˜ yields Plugging this into the definition of Ω ˜ = R1 + R2 + R2 + R3 , Ω ′
K 1 −
T 1 −
√
K k=1
×
T t =1
T 1 −
√
T τ =1 K 1 −
×
T t =1
√
T τ =1 K k=1
×
T
T 1 −
T 1 −
K 1 −
φk
φk
T
√
T t =1
√
T τ =1
φk
φk
t
T
C (1) εt
C (1)ετ t
T
φk
T
φk
t
T
T
K 2 k=1 ℓ=1
K 2 k=1
φk
T 1−
T t =1
T 1−
T t =1
φk
T
K
K K 1 −−
K 1 −
t
τ
t
T
φk
T
T
T
φk
t
T
2
K τ 1 −
φℓ
T
K ℓ=1
φℓ
t
T
φℓ
τ T
2 t
T
[ ]2 2 t 1 φk = (1 + o (1)) . T
K
Therefore
˜ )) = v ar (v ec (R1 )) + O v ar (v ec (Ω 1 K
1
T
1 (Ω ⊗ Ω ) In2 + Knn (1 + o (1)) + O T
C (1) εt
′
,
u˜ t −1 − u˜ t
u˜ τ −1 − u˜ τ
′
.
12RΩT ,GLS R′
−1/2
−1 1/2 1/2 ˆ R′ × e′β R12ΩT ,GLS R′ R12Ω 12RΩT ,GLS R′ eβ 1 := ΘΞ + Op , where
1
T
.
To save space, we omit the details but they are available upon request. Now
R12ΩT ,GLS R′
−1 1/2 −1/2 ˆ R′ × R12Ω 12RΩT ,GLS R′ R12ΩT ,GLS R′ × RT 3/2 (βˆ GLS − β) 2 −1/2 3/2 = R12ΩT ,GLS R′ RT (βˆ GLS − β)
T
It is not hard but tedious to show that
˜ )) = v ar (v ec (R1 )) + O v ar (v ec (Ω
′
FT ,GLS = RT 3/2 (βˆ GLS − β)
τ T
K k=1
t =1 τ =1
Proof of Lemma 3. Part (a). We write the statistic FT ,GLS as
,
u˜ τ −1 − u˜ τ
φk
T 2 t =1 τ =1 K k=1
as stated.
′
T
τ
T 1 −
T 1 −
φk
τ
√
K k=1
R3 =
K k=1
=
R2 =
K 1 −
1 −− 1 −
where R1 =
T
2π K 2
T
q
and
T 2 t =1 τ =1
+O
T
× E [C (1)ετ ⊗ C (1)εt ] εq′ C (1) ⊗ εp′ C (1) 2 T − T K 1 − τ 1 − t φk = (Ω ⊗ Ω ) In2 + Knn 2 φk
2
T
φk
.
t
T
T
T
LT 2π 2 K 2 −
φk
q
So
˜ −Ω = − EΩ
K 1 −
K k=1
t =1 τ =1 p=1 q=1
t =1 τ =1 p=1 q=1
] [ t +h t t φk − φk φk
K k=1 T 1≤t ,t +h≤T
T2
τ
′ × E v ec C (1)εt ετ′ C (1)′ v ec C (1)εp εq′ C (1)′ T T T T K τ 1 −−−− 1 − t φk φk = 2
,
−
K k=1
T T T T 1 −−−−
uniformly over h ∈ [−LT , LT ]. Hence K 1 −1
K 1 −
361
2 −1/2 3/2 Θ = R12ΩT ,GLS R′ RT (βˆ GLS − β) , and
−1 1/2 1/2 ˆ R′ Ξ = e′β RΩ R′ RΩ RΩ R′ eβ .
1/2
362
Y. Sun / Journal of Econometrics 164 (2011) 345–366
Here we have used
ΩT ,GLS =
1
Hence
1 v ar T 3/2 βˆ GLS − β = Ω 1 + O .
[
12
E F1′
T
Note that Θ is independent of Ξ because (i) (βˆ GLS − β) is ˆ . (ii) Θ is the squared length of a standard independent of Ω normal vector and eβ is the direction of this vector. By the rotation invariance of standard normals, the length is independent of the direction. Hence
[ P
(K − p + 1) K
=P
FT ,GLS < z
(K − p + 1) K
= EGp z
Ξ −1 + O
+O T 1
(K − p + 1) K
Θc +
√ Θa + b < z .
2 EF (ζ1T , ζ2T , Ξ ) = EF (0, 0, Ξ ) + EF1′ (0, 0, Ξ ) ζ1T + O E ζ1T
+ O (E |ζ1T ζ2T |) + O (E ζ2T ) = EF (0, 0, Ξ ) + EF1 (0, 0, Ξ ) ζ1T + O ′
1
T
2 = O(1/T ) and where F1′ = ∂ F /∂ a. Here we have used: O E ζ1T ′ O (E ζ2T ) = O(1/T ), which follows from E (c 1β 1′β c ) = O (1/T ) for any constant c. Next, let feβ (x) be the pdf of eβ . Since eβ is
ˆ and 1β , we have independent of Ω E F1′ (0, 0, Ξ ) ζ1T eβ = x feβ (x)dx
0, 0, x′ RΩ R′
1/2
ˆ R′ RΩ
−1
RΩ R′
1/2
(K − p + 1)
]
1
T
.
1 = EF (0, 0, Ξ ) + O K T 1 (K − p + 1) [ΘΞ ] < z + O =P K T [ ] 1 (K − p + 1) =P FT ,GLS < z + O FT ,OLS < z
]
K
But
E F1′
We have therefore shown that
where
=
x′
As a result,
T
T
[
1/2
for all x.
EF (ζ1T , ζ2T , Ξ ) = EF (0, 0, Ξ ) + O
√ and ζT = Θ ζ1T + ζ2T . Then FT ,OLS = FT ,GLS + ζT . Noting that Θ is independent of ζ1T , ζ2T and Ξ , we have [ ] [ ] (K − p + 1) (K − p + 1) P FT ,OLS < z = P FT ,GLS + ζT < z K K ] [ √ 1 (K − p + 1) =P ΘΞ + Θ ζ1T + ζ2T + Op
∫
RΩ R′
1
−1 ′ 1/2 ˆ R′ ζ1T = 2 RT 3/2 1β R12Ω R12ΩT ,GLS R′ eβ −1 ′ ˆ R′ ζ2T = RT 3/2 1β R12Ω RT 3/2 1β
∫
=0
P
EF1′ (0, 0, Ξ ) ζ1T =
−1
−1 1/2 ′ ˆ R′ x R12ΩT ,GLS R′ × 2 RT 3/2 1β R12Ω
[
ˆ R′ RΩ
as stated. Part (b). Let
F (a, b, c ) = P
1/2
So
K
EF1′ (0, 0, Ξ ) ζ1T = 0.
]
[ΘΞ ] < z
K −p+1
0, 0, x′ RΩ R′
x′
] −1 ′ 1/2 ˆ R′ × 2 RT 3/2 1β R12Ω R12ΩT ,GLS R′ x feβ (x)dx. ˆ = Ω ˆ (u) and 1β = 1β (u) as functions of the error Writing Ω ˆ (u) = Ω ˆ (−u) and 1β (u) = −1β (−u). process u, we have Ω
as desired.
T
ˆ ) and proceed to take a Proof of Theorem 4. We write Ξ = Ξ (Ω ˆ ) around Ξ (Ω ) = 1. To this end, we first Taylor expansion of Ξ (Ω ˆ ) with respect to Ω ˆ: compute the derivatives of Ξ (Ω
ˆ ˆ ) = −Ξ −2 dΞ Ω dΞ −1 (Ω =
−Ξ −2 e′β
′ 1/2
RΩ R
[
ˆ R′ d RΩ
[
−1 ]
RΩ R′
1/2
eβ
]
−1 1/2 ˆ R′ ˆ = Ξ −2 e′β RΩ R′ RΩ R dΩ [ ] −1 1/2 ˆ R′ × R′ RΩ RΩ R′ eβ [ −1 ] −2 ′ ′ 1/2 ′ ˆ =Ξ eβ RΩ R RΩ R R ] [ −1 1/2 ˆ . ˆ R′ ⊗ e′β RΩ R′ RΩ R dv ec Ω Hence
[ ˆ −1 ] ∂ Ξ −1 Ω 1/2 ˆ R′ RΩ R ′ = Ξ −2 e′β RΩ R′ ˆ ∂ v ec Ω [ −1 ] ′ ′ 1/2 ′ ˆ ⊗ eβ RΩ R RΩ R R . Evaluating the above derivative at Ω yields:
−1/2 ′ −1/2 ∂ Ξ −1 (Ω ) R ⊗ eβ RΩ R′ R . ′ = e′β RΩ R′ ˆ ∂ v ec Ω Next, we compute the second-order derivative:
ˆ ∂ Ξ −1 Ω d ′ = D1 + D2 + D3 , ˆ ∂ v ec Ω
Y. Sun / Journal of Econometrics 164 (2011) 345–366
363
:= J1 + J2
where D1 = −2Ξ −3 dΞ
[
′ 1/2
ˆ R′ RΩ
e′β RΩ R
−1 ]
Now
R
[ ⊗ D2 = Ξ
−2
e′β RΩ R
[
e′β
′ 1/2
ˆ R′ RΩ
′ 1/2
−1 ] R
ˆ R′ d RΩ
RΩ R
−1
, ]
∂ Ξ −1 (Ω ) ˆ −Ω , ′ v ec Ω ˆ ∂ v ec Ω ′ 1 ˆ − Ω (J1 + J2 ) v ec Ω ˆ −Ω . Q = v ec Ω
R
L=
2
Furthermore
′ [
[ ⊗
e′β
RΩ R
[
e′β RΩ R′
′ 1/2
ˆR RΩ
′
[ ⊗ D2 = −Ξ
e′β
−2
RΩ R
[
e′β
−1
′ 1/2
e′β RΩ R
×
ˆ R′ RΩ
′ 1/2
ˆ R′ RΩ
′ 1/2
RΩ R
1/2
ˆ R′ RΩ
−1 ]
]′ R R
R
,
−1 ] ′ ′ ˆ ˆ R dΩ R RΩ R R
−1
−1 ] 1/2 ˆ R′ ⊗ e′β RΩ R′ RΩ R [ −1 ˆ R′ ˆ = −Ξ −2 R′ RΩ R dΩ ×
e′β RΩ R′
1/2
ˆ R′ RΩ
−1
ˆ R′ R′ RΩ
⊗
[
−1
] R
1
2
pEX14 + p(p − 1)EX12 X22 = 1, which implies that
]′ ]′
1
EX12 X22 =
R
ˆ v ec dΩ ]
1/2 ˆ R′ ⊗ e′β RΩ R′ RΩ
1
∫
of the distribution of X , we have
[ −1 ] 1/2 ˆ R′ ⊗ e′β RΩ R′ RΩ R [ −1 1/2 ˆ R′ = −Ξ −2 e′β RΩ R′ RΩ R [
x ∈ [−1, 1] ,
Γ 12 p 1 = , 1 (31) = x fX1 (x)dx = 2 Γ 2p + 1 p −1 1 ∫ 1 3 3 Γ 2p 4 4 1 = . (32) EX1 = x fX1 (x)dx = 4 Γ 2p + 2 p (p + 2) −1 ∑p 2 2 By definition, E = 1. Using the permutation symmetry i =1 X i EX12
[
[
p−3 Γ (p/2) 1 − x2 2 , π Γ ((p − 1) /2)
fX1 (x) = √
where Γ (·) is the gamma function. Therefore
−1 ]
ˆ R′ RΩ
We proceed to compute the expected values of L and Q . As a byproduct, we obtain the order of the remainder term. For notational simplicity, we let X = (X1 , . . . , Xp )′ = eβ ∈ Rp . It is easy to see that X is a random vector uniformly distributed on the surface of the p-dimensional sphere with center 0 and radius 1. It follows from Khokhlov (2006) that the density of X1 , the first element of X , is
R
−1 ]
(30)
where
[ −1 ] 1/2 ˆ R′ RΩ R , ⊗ e′β RΩ R′ [ −1 ] 1/2 ˆ R′ RΩ R D3 = Ξ −2 e′β RΩ R′ [ −1 ] 1/2 ˆ R′ R = D2 Knn . d RΩ ⊗ e′β RΩ R′
ˆ D1 = 2Ξ −3 dv ec Ω
−1 ˆ Ξ Ω = 1 + L + Q + remainder
]′
EL = E
−1/2 ′ −1/2 ∂ 2 Ξ (Ω ) eβ eβ RΩ R′ R ′ = 2 R′ RΩ R′ ˆ ∂v ec Ω ˆ ∂v ec Ω −1/2 ′ −1/2 ⊗ R′ RΩ R′ eβ eβ RΩ R′ R −1/2 −1 − R′ RΩ R′ eβ ⊗ R′ RΩ R′ R −1/2 ⊗ e′β RΩ R′ R In2 + Knn
(33)
e′β RΩ R′
−1/2
R ⊗ e′β RΩ R′
−1/2 R
ˆ −Ω v ec Ω
−1/2 ˆ − Ω R′ RΩ R′ −1/2 eβ = Ee′β RΩ R′ R Ω =
R
Therefore
.
Using (31) and EX1 X2 = 0, we have
−1
[ ′ −1 ]′ ˆ Ξ −2 e′β RΩ R′ 1/2 RΩ ˆ R′ = −dv ec Ω R [ −1 ] [ −1 ] 1/2 ˆ R′ ˆ R′ ⊗ R′ RΩ R ⊗ e′β RΩ R′ RΩ R .
p (p + 2)
K2 T2
Ee′β RΩ R′
−1/2
RBR′
× eβ (1 + o (1)) + O =
K2 T2
Etr RΩ R
=
T
tr 2
× =
1 p
1
T
′ −1/2
RBR′
RΩ R
′ −1/2
RΩ R′
1
T RΩ R′
′
−1/2
1
T
K2
T
T
B¯ (1 + o (1)) + O 2
−1/2
RBR
(1 + o (1)) + O
−1/2
× (eβ e′β ) (1 + o (1)) + O K2
RΩ R′
1
.
To compute E (Q ), we note that Q consists of two terms. The first term is
364
1 2
Y. Sun / Journal of Econometrics 164 (2011) 345–366
ˆ − Ω )′ J1 v ec (Ω ˆ − Ω) E v ec (Ω
−1 −1 RΩ RΩ tr R′ RΩ R′ = tr R′ RΩ R′
ˆ − Ω )′ R′ RΩ R′ −1/2 eβ e′β RΩ R′ −1/2 R = E v ec (Ω −1/2 ′ −1/2 ˆ − Ω) ⊗ R′ RΩ R′ eβ eβ RΩ R′ R v ec (Ω ˆ − Ω )′ v ec {R′ RΩ R′ −1/2 eβ e′β RΩ R′ −1/2 R = E v ec (Ω ˆ − Ω )R′ RΩ R′ −1/2 eβ e′β RΩ R′ −1/2 R} (Ω 2 −1/2 ˆ − Ω )R′ RΩ R′ −1/2 eβ R(Ω = E e′β RΩ R′ −− Aij Aℓm Xi Xj Xℓ Xm |A , =E
= p2
i,j
(Ω ⊗ Ω ) Knn −1 −1 ′ R (Ω ⊗ Ω ) R ⊗ R RΩ R′ = tr R′ RΩ R′ ′
R RΩ R′
×
−1/2
ˆ − Ω )R′ RΩ R′ R(Ω
−1/2
=E
A2ii EXi4 + E
−
=
=E
−
A2ii
i
+ 2E
3 p (p + 2)
−
A2ij
i̸=j
= = =
Aii Amm EXi2 Xm2 + 2E
3 p (p + 2) 1 p (p + 2) 1 p (p + 2)
E
+E
Aii Amm
i̸=m
Aij Aij EXi2 Xj2
A2ii +
n − n −
1
Aii Amm + 2A2im
E 2tr (AA) + [tr (A)]
R RΩ R′
p (p + 2)
−1
−1
RΩ
−1
Ω ei e′j R
RΩ ej
2
′ −1
ij
RΩ R RΩ R′
′
−1
RΩ
=p
where we use the representation that Knn =
−
Aii Amm + 2A2im
i̸=m
∑n ∑n i=1
j =1
ei e′j ⊗
′
2
−1/2 ′
E [tr (A)]2 = E tr
.
−1/2 ⊗ R′ RΩ R′
K2
1
+O
T2
T
T
(Ω ⊗ Ω ) In2 −1 −1 = tr R′ RΩ R′ RΩ ⊗ R′ RΩ R′ RΩ R
−1/2
ˆ − Ω )R′ RΩ R′ R(Ω
−1/2 2
T
=
2
v ec R RΩ R
K
T
′ ′ −1
′
R
2 −1 K 1 + O × v ec Ω R′ RΩ R′ RΩ + o 2 T
=
2
′
′ −1
tr R RΩ R
K
+o
K
2
RΩ R RΩ R ′
′ −1
RΩ
T
1
+O T 2 K 2p 1 +O = +o . 2 K
T2
T
T
Therefore 1
−1
RΩ R′
2 ˆ − Ω )R′ RΩ R′ −1 R = E tr (Ω ′ −1 ′ ˆ − Ω) = E v ec R′ RΩ R′ R v ec (Ω ˆ − Ω ) v ec R′ RΩ R′ −1 R × v ec (Ω −1 ′ 1 = E v ec R′ RΩ R′ R (Ω ⊗ Ω ) (In + Knn ) K 2 −1 K 1 × v ec R′ RΩ R′ R +o + O 2
]
The trace in the last line is the sum of R ⊗ R′ RΩ R′
1 2 p +p +o K
Etr (AA) = E v ec (A)′ v ec (A) =
Next, since Knn v ec (A) = v ec (A′ ), we have
T
−1
RΩ ei e′i R′ RΩ R′
= tr R RΩ R
1
K
R RΩ R′
−1
i=1 j=1
× (Ω ⊗ Ω ) (In + Knn ) [ ] −1/2 ′ −1/2 ′ × R′ RΩ R′ ⊗ R RΩ R′ 2 K 1 +o + O 2 T T −1 1 −1 = tr R′ RΩ R′ R ⊗ R′ RΩ R′ R K 2 K 1 × (Ω ⊗ Ω ) In2 + Knn + o +O . 2
′
RΩ ei e′j R′ tr R′ RΩ R′
R RΩ R′
′ ˆ − Ω) × v ec (Ω ] [ −1/2 ′ −1/2 ˆ − Ω) × R′ RΩ R′ ⊗ R′ RΩ R′ v ec (Ω [ ] −1/2 ′ −1/2 1 = tr R′ RΩ R′ ⊗ R′ RΩ R′
tr
−1
e′j R′ RΩ R′
i,m
[ ′
RΩ R′
n − n − ′
Now Etr (AA) = E
ei ej with ei being the ith column unit vector of order n. See Magnus and Neudecker (1979, Definition 3.1). Therefore p (p + 2)
i
′
R
tr
′
1
−
=
E
E
−
p (p + 2)
−
′
ei ej ⊗ ei ej
n − n −
i̸=j
−
n − n −
−1
i=1 j=1
i̸=m
i
R ⊗ R′ RΩ R′
i=1 j=1
. Plugging
ˆ − Ω )′ J1 v ec (Ω ˆ − Ω) E v ec (Ω −
−1
i=1 j=1
=
2
tr
m,ℓ
where A = (Aij ) = RΩ R′ in (32) and (33), we have 1
and
2
ˆ − Ω )′ J1 v ec (Ω ˆ − Ω) E v ec (Ω =
1 p (p + 2)
2p 2 2 p +p + K K
+o
K2 T2
+O
1
T
.
Y. Sun / Journal of Econometrics 164 (2011) 345–366
=
2 K
K
+o
2
+O
T2
1
T
2
= Gp z
ˆ − Ω )′ J2 v ec (Ω ˆ − Ω) E v ec (Ω 11 2 p +p +o = K p
K2
1
+O
T2
T
.
Hence 2
−
11 2 p +p +o K p
K
K2
1 + O 2 K T T 2 1 K 1 +O , = − (p − 1) + o 2
EQ =
K
K 1 + 2 G′p (z ) z B¯ − G′p (z ) z (p − 1) K −p+1 T K 2 2 ′′ 1 K 1 + Gp (z ) z 2 + o + 2 +O 2K K T T K K2 ′ = Gp z + 2 Gp (z ) z B¯ K −p+1 T 1 + G′′p (z ) z 2 − G′p (z ) z (p − 1) K 1 K2 1 +o + 2 +O
.
It is easy to show that 1
T
K
T
T
2
K 1 = Gp (z ) + 2 G′p (z ) z B¯ + G′′p (z ) z 2 T K 1 K2 1 ×o + 2 +O
T
K
and
as desired.
2 −1 K 1 ˆ Ξ Ω = 1 + L + Q + op + O . p 2 T
365 2
(34)
T
T
T
Proof of Theorem 5. Part (a). It follows from Theorem 4 that
Using the above expansion, we have
P
(K − p + 1) K
FT ,GLS < z
=P
(K − p + 1) K
Θ < z Ξ −1
(K − p + 1) =P Θ < z (1 + L + Q ) K 1 K2 1 +o + 2 +O K T T zK 1 K2 1 = EGp + 2 +O (1 + L + Q ) + o K −p+1 K T T zK zK zK = Gp + G′p K −p+1 K −p+1 K −p+1 × E (L + Q ) 2 1 zK zK + G′′p E (L + Q )2 2 K −p+1 K −p+1 1 K2 1 +o + 2 +O K T T zK 1 = Gp + G′p (z ) zE (L + Q ) + G′′p (z ) z 2 EL2 K −p+1 2 1 K2 1 +o + 2 +O .
P FT∗,OLS > pFpα,K −p+1 − α = −
T
T
But EL2 =
1 K
E
−1/2
R ⊗ e′β RΩ R′
−1/2 R
′ −1/2 ′ × (Ω ⊗ Ω ) In2 + Knn eβ RΩ R′ R −1/2 ′ K2 1 1 ⊗ e′β RΩ R′ R +o + 2 +O . K T T 2 2 1 K2 1 = E e′β eβ + o + 2 +O K K T T 2 1 K2 1 = +o + 2 +O K
T
T
so
P
(K − p + 1) K
K
+o
K2
T2
K2 ′ α Gp pFp,K −p+1 pFpα,K −p+1 B¯ 2 T
+O
1
T
.
(35)
But pFpα,K −p+1 = χpα + o (1) , hence P FT∗ ,OLS > pFpα,K −p+1 − α
=−
K 2 B¯ ′ α α Gp χp χp + o T2
1
K
+o
K2
T2
+O
1
T
.
Part (b). The FT ,GLS statistic can be written as
−1 1/2 ′ ˆ R′ c˜ × R12Ω 1/2 × RT 3/2 (βˆ GLS − β) + R12Ω R′ c˜ ′ −1/2 3/2 = R12ΩT ,GLS R′ RT (βˆ GLS − β) + c˜ −1 1/2 1/2 ˆ R′ × R12ΩT ,GLS R′ R12Ω R12ΩT ,GLS R′ −1/2 3/2 1 × R12ΩT ,GLS R′ RT (βˆ GLS − β) + c˜ + O ,
FT ,GLS = RT 3/2 (βˆ GLS − β) + R12Ω R′
T
e′β RΩ R′
K
1
+o
K
FT ,GLS < z
2 where by assumption c˜ = δ 2 . Let −1/2 3/2 R12ΩT ,GLS R′ RT (βˆ GLS − β) + c˜ , eβδ = −1/2 RT 3/2 (βˆ GLS − β) + c˜ R12ΩT ,GLS R′ then FT ,GLS = Θδ Ξδ + Op
1
T
,
where
2 −1/2 3/2 Θδ = R12ΩT ,GLS R′ RT (βˆ GLS − β) + c˜ , −1 1/2 1/2 ˆ R′ Ξδ = e′βδ R12Ω R′ R12Ω R12Ω R′ eβδ ,
366
Y. Sun / Journal of Econometrics 164 (2011) 345–366
and Θδ is independent of Ξδ . In addition, Θδ ∼ χp2 δ 2 and eβδ is uniformly distributed on the unit sphere Sp (1). Using the same calculation as in the Proof of Theorem 4, we have,
] (K − p + 1) FT ,GLS < pFpα,K −p+1 H1 δ 2 P K K 1 = EGp,δ2 pFpα,K −p+1 Ξδ−1 + O K −p+1 T K2 = Gp,δ2 pFpα,K −p+1 + G′p,δ2 pFpα,K −p+1 pFpα,K −p+1 2 B¯
Some algebra shows that
Qp,δ2 (z ) =
[
T
2 2 1 + G′′p,δ2 pFpα,K −p+1 pFpα,K −p+1 2 K 2 1 1 K +o +O . +o T2 K T Plugging pFpα,K −p+1
′′ α 1 Gp χp α 2 χp + o = χp − K G′p χpα
α
1
K
,
we have
[ P
(K − p + 1) K
FT ,GLS <
pFpα,K −p+1 H1
] 2 δ
2 K 2 B¯ 1 = Gp,δ2 χpα + 2 G′p,δ2 χpα χpα + Qp,δ2 χpα χpα T K 2 1 K 1 +o +o + O , 2 K
T
T
where
α α G′′p χpα ′ α ′′ G 2 χp . Qp,δ 2 χp = Gp,δ 2 χp − G′p χpα p,δ Now it is known that
2 j ∞ − δ /2 −δ2 /2 1 Gpδ 2 (z ) = e z j+p/2−1 e−z /2 , (36) j + p / 2 j ! 2 Γ j + p / 2 ( ) j =0 ′
and thus G′′pδ 2 (z )
×
2 j ∞ − δ /2 −δ2 /2 1 = e z j+p/2−1 e−z /2 j+p/2 Γ (j + p/2) j ! 2 j =0 [ ] 2j + p − z − 2 2z
1 = − G′pδ (z ) 2
z+2−p
z
2 j ∞ − δ /2 −δ2 /2 1 j + e j+p/2 Γ (j + p/2) z j ! 2 j =0 1 z+2−p = − G′pδ (z ) 2
z
2 j δ /2 −δ2 /2 z j+p/2−1 e−z /2 j + e j! 2j+p/2 Γ (j + p/2) z j =0 2 j ∞ − G′′p (z ) δ /2 −δ2 /2 z j+p/2−1 e−z /2 j = G′pδ2 (z ) ′ + e . Gp (z ) j! 2j+p/2 Γ (j + p/2) z j=0 ∞ −
Hence
2 j ∞ − δ /2 −δ2 /2 z j+p/2−1 e−z /2 j Qp,δ 2 (z ) = e . j! 2j+p/2 Γ (j + p/2) z j =0
δ2
G′ 2 (z ) 2z (p+2),δ
completing the Proof of the theorem.
References Andrews, D.W.K., 1991. Heteroskedasticity and autocorrelation consistent covariance matrix estimation. Econometrica 59, 817–854. Beveridge, S., Nelson, C.R, 1981. A new approach to decomposition of economic time series into permanent and transitory components with particular attention to measurement of the business cycle. Journal of Monetary Economics 7, 151–174. Bilodeau, M., Brenner, D., 1999. Theory of Multivariate Statistics. Springer. Foley, R.D., Goldman, D., 1999. Confidence intervals using orthonormally weighted standardized time series. ACM Transactions on Modeling and Computer Simulation 19 (4), 297–325. Grenander, U., Rosenblatt, M., 1957. Statistical Analysis of Stationary Time Series. Wiley, New York. Hashimzade, N., Vogelsang, T.J., 2007. Fixed-b asymptotic approximation of the sampling behavior of nonparametric spectral density estimators. Journal of Time Series Analysis 29 (1), 142–162. Hannan, E.J., 1957. The variance of the mean of a stationary process. Journal of the Royal Statistical Society, Series B 19 (2), 282–285. Hotelling, H., 1931. The generalization of student’s ratio. Annals of Mathematical Statistics 2, 360–378. Jansson, M., 2004. On the error of rejection probability in simple autocorrelation robust tests. Econometrica 72, 937–946. Khokhlov, V.I., 2006. The uniform distribution on a sphere in Rs : Properties of projections. Theory of Probability and Applications 50 (3), 386–399. Kiefer, N.M., Vogelsang, T.J., 2002a. Heteroskedasticity-autocorrelation robust testing using bandwidth equal to sample size. Econometric Theory 18, 1350–1366. Kiefer, N.M., Vogelsang, T.J., 2002b. Heteroskedasticity-autocorrelation robust standard errors using the bartlett kernel without truncation. Econometrica 70, 2093–2095. Kiefer, N.M., Vogelsang, T.J., 2005. A new asymptotic theory for heteroskedasticityautocorrelation robust tests. Econometric Theory 21, 1130–1164. Magnus, J.R., Neudecker, H., 1979. The commutation matrix: some properties and applications. The Annals of Statistics 7 (2), 381–394. Mercer, J., 1909. Functions of Positive and Negative Type, and their Connection with the Theory of Integral Equations. In: Series A, 209. Philosophical Transactions of the Royal Society of London, pp. 415–446. Müller, U.K., 2007. A theory of robust long-run variance estimation. Journal of Econometrics 141 (2), 1331–1352. Newey, W.K., West, K.D., 1987. A simple, positive semidefinite, heteroskedasticity and autocorrelation consistent covariance matrix. Econometrica 55, 703–708. Newey, W.K., West, K.D., 1994. Automatic lag selection in covariance estimation. Review of Economic Studies 61, 631–654. Ng, S., Perron, P., 1994. The exact error in estimating of the spectral density at the origin. Journal of Time Series Analysis 17 (4), 379–408. Phillips, P.C.B., 2005. HAC estimation by automated regression. Econometric Theory 21, 116–142. Phillips, P.C.B., Solo, V., 1992. Asymptotics for linear processes. Annals of Statistics 20, 971–1001. Phillips, P.C.B., Sun, Y., Jin, S., 2006. Spectral density estimation and robust hypothesis testing using steep origin kernels without truncation. International Economic Review 21, 837–894. Phillips, P.C.B., Sun, Y., Jin, S., 2007. Long-run variance estimation and robust regression using sharp origin kernels with no truncation. Journal of Statistical Planning and Inference 137, 985–1023. Percival, D., Walden, A.T., 1993. Spectral Analysis For Physical Applications: Multitaper and Conventional Univariate Techniques. Cambridge University Press. Priestley, M.B., 1981. Spectral Analysis and Time Series. Academic Press, New York. Stoica, P., Moses, R., 2005. Spectral Analysis of Signals. Pearson Prentice Hall. Sun, Y., 2006: Best quadratic unbiased estimators of integrated variance in the presence of market microstructure noise, Working Paper, Department of Economics, UC San Diego. Sun, Y., 2010: Let’s fix it: fixed-b asymptotics versus small-b asymptotics in heteroscedasticity and autocorrelation robust inference. Working Paper, Department of Economics, UC San Diego. Sun, Y., Phillips, P.C.B., Jin, S., 2008. Optimal bandwidth selection in heteroskedasticity-autocorrelation robust testing. Econometrica 76, 175–194. Sun, Y., Phillips, P.C.B., Jin, S., 2011. Power maximization and size control in heteroskedasticity and autocorrelation robust tests with exponentiated kernels, Econometric Theory (in press). Sun, Y., Phillips, P.C.B., 2009. Bandwidth choice for interval estimation in GMM regression, Working Paper, Department of Economics, UC San Diego. Thomson, D.J., 1982. Spectrum estimation and harmonic analysis. IEEE Proceedings 70, 1055–1096. Vogelsang, T.J., Franses, P.H., 2005. Testing for common deterministic trend slopes. Journal of Econometrics 126, 1–24.
Journal of Econometrics 164 (2011) 367–381
Contents lists available at SciVerse ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Realized Laplace transforms for estimation of jump diffusive volatility models✩ Viktor Todorov a , George Tauchen b,∗ , Iaryna Grynkiv b a
Department of Finance, Kellogg School of Management, Northwestern University, Evanston, IL 60208, United States
b
Department of Economics, Duke University, Durham, NC 27708, United States
article
info
Article history: Received 12 September 2010 Received in revised form 7 June 2011 Accepted 13 June 2011 Available online 21 June 2011 JEL classification: C51 C52 G12
abstract We develop an efficient and analytically tractable method for estimation of parametric volatility models that is robust to price-level jumps. The method entails first integrating intra-day data into the Realized Laplace Transform of volatility, which is a model-free estimate of the daily integrated empirical Laplace transform of the unobservable volatility. The estimation is then done by matching moments of the integrated joint Laplace transform with those implied by the parametric volatility model. In the empirical application, the best fitting volatility model is a non-diffusive two-factor model where low activity jumps drive its persistent component and more active jumps drive the transient one. © 2011 Elsevier B.V. All rights reserved.
Keywords: Jumps High-frequency data Laplace transform Stochastic volatility models
1. Introduction Stochastic volatility and price-level jumps are two wellrecognized features of asset price dynamics that differ in economically significant ways. Given the substantial compensations for these two risks demanded by investors, as evident from option prices, it is of central importance to better understand their dynamic characteristics. When using coarser sampling of stock prices such as daily data, the separation of volatility from price-level jumps becomes relatively difficult, and, most important, it depends crucially on the correct modeling of all aspects in the asset model; misspecification of any one feature of the model for the asset dynamics can lead to erroneous evidence about the role and significance of each of those risks. On the other hand, the availability of high-frequency data provides a model-free way of disentangling the key features of the price dynamics.
✩ We would like to thank the editor, an Associate Editor, a referee as well as
Torben Andersen, Tim Bollerslev, Andrew Patton, and many conference and seminar participants for numerous suggestions and comments. Todorov’s work was partially supported by NSF grant SES-0957330. ∗ Corresponding author. Tel.: +1 919 660 1812. E-mail addresses:
[email protected] (V. Todorov),
[email protected] (G. Tauchen),
[email protected] (I. Grynkiv). 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.06.016
Price level jumps have recently been studied very extensively.1 There is substantial parametric and non-parametric evidence for rare, but very sharp price movements, as would be expected from a compound Poisson process. There is also some evidence for smaller, more vibrant price-level jumps, as predicted by models built on Lévy processes with so-called Blumenthal–Getoor indexes above zero.2 This evidence is documented by Ait-Sahalia and Jacod (2009a) for liquid individual equity prices, by Todorov and Tauchen (2011a) for index options prices, see also the references therein. The aim of this paper is to remain robust to these price jumps while developing an enhanced methodology for understanding the latent volatility dynamics. The robustness is achieved by using high-frequency data in conjunction with techniques that effectively filter out the irrelevant parts of returns.
1 See, for example, Barndorff-Nielsen and Shephard (2004, 2006) and Ait-Sahalia and Jacod (2009b), among many others, with evaluations of jump tests in Huang and Tauchen (2005), and quite comprehensively in Theodosiou and Zikes (2009). 2 All Lévy processes can be divided into equivalence classes according to the Blumenthal–Getoor index (Blumenthal and Getoor, 1961), which lies in [0,2). The index of the relatively quiescent variance gamma process is 0; that of the Cauchy process is 1.0; and the indices of more vibrant processes are closer to the upper bound of 2.0.
368
V. Todorov et al. / Journal of Econometrics 164 (2011) 367–381
The newly developed estimation method is for parametric volatility models. Restriction to parametric models is motivated by the fact we only directly observe financial prices, not the volatility, so the information regarding volatility dynamics is embedded more deeply into the process. Extracting this information pathwise is, in effect, a de-convolution effort that can be done nonparametrically only at a rate equal in general to the fourth root of the sample size, which is slow. It might be thought that a way around this problem is to use volatility-sensitive financial derivatives, e.g., options, but that is not generally the case. As noted by Todorov and Tauchen (2011a), a market-based volatility index based on a portfolio of option prices such as the VIX is actually a two-level convolution of the underlying volatility process; one convolution is the formation of the forward integrated variance while the other is the integration implicit in market’s computation of the risk-neutral expectation of the forward variance. The risk-neutral volatility expectation contains a rather non-trivial volatility risk premium, and hence we need a model for the latter before we use volatility derivatives in the estimation. Thus, it seems the only way to discriminate more sharply across models, without imposing assumptions on the pricing of risk, is to use a relatively efficient parametric method that exploits the full strength of available high-frequency data. Candidate volatility models and estimators already pervade the literature, but the evidence to date on the empirically most credible model is not very conclusive. The list of potential models includes, for example, purely diffusive affine models, affine jump diffusions, and pure jump models driven by Lévy processes of various activity levels. The evidence, however, at best suggests only the general features of the appropriate model. It seems clear that volatility is comprised of at least two factors, one very slow moving and the other quickly mean reverting, and there is also some evidence for volatility jumps. On these points see the findings regarding volatility factors in Bollerslev and Zhou (2002) and Barndorff-Nielsen and Shephard (2002) who use highfrequency data, along with Andersen et al. (2002) and Chernov et al. (2003) for earlier evidence from low frequency data; for volatility jumps see Todorov and Tauchen (2011a) along with earlier evidence using low frequency data provided by Eraker et al. (2003). Though suggestive, these findings taken together do not define a well-specified parsimonious model that jointly captures the distributional and dynamic characteristics of volatility. This rather hazy view of volatility dynamics should not be surprising, given the above-noted fact that volatility dynamics are embedded so deeply into the financial price process. Our strategy to sharpen estimation precision entails a new moment-based estimator applied to daily aggregates of trigonometric transforms of the high-frequency returns; specifically, we use the Realized Laplace transform of volatility proposed by Todorov and Tauchen (2011b). This computation is a simple sum of cosine transformations of the high-frequency increments, and it is an empirical measure that embodies more information regarding volatility than the now-conventional measures of daily variability. Todorov and Tauchen (2011b) show that the Realized Laplace transform is a nonparametric, consistent and asymptotically normal estimate of the unobservable integrated Laplace transform of volatility over a fixed interval of time. The above paper derives the asymptotic behavior of this nonparametric measure when joint long-span fill-in asymptotics is used. Importantly, this measure is robust to the presence of jumps in the price, a desideratum discussed above. In this paper we build on the above nonparametric results and show how they can be used for efficient and robust parametric estimation of the stochastic volatility process. As seen in Section 3 below, the Realized Laplace transform conveys information about the (joint) Laplace transform of
volatility, and similarly the candidate volatility models are most easily expressed in terms of their characteristic functions.3 Thus some aspects of our estimation methodology are in common with the long tradition of estimation based on the characteristic function (Parzen, 1962) and conditional characteristic functions as in Feuerverger and Mureika (1977), Singleton (2001), Jiang and Knight (2002), Yu (2004), Bates (2006) and Carrasco et al. (2007). For example, we face issues similar to those of the continuum of moment conditions and also the numerical difficulties associated with computing inverse integral transforms. However, an important difference with prior work comes from the fact that we use the high-frequency data to ‘‘integrate out’’ in a model-free way the components of the process not directly linked to volatility such as the driving martingales in the price-level and the price jumps. So, unlike previous work using the empirical characteristic function, we cannot work directly with the empirical joint Laplace transform of the process of interest (the latent volatility) but rather we use a daily integrated version of it. A key feature of the proposed method is its analytical tractability. Moments of the integrated joint Laplace transform of the volatility can be computed via one-dimensional numerical integration as soon as the joint characteristic function of stochastic volatility is known in closed form. This is the case for a variety of models, including the class of affine jump-diffusion models of Duffie et al. (2000, 2003). Given the wide applicability of this class in financial applications, it is important to explore its full flexibility and verify whether specification in this class can capture the key characteristics of volatility risk present in the data. Our method provides a convenient and efficient way to do that. The full description of the method in Section 2.2 requires a moderate amount of detail, but a précis is as follows: Using the analytical tractability of Laplace transforms, we can form at any lag a continuum of estimating equations defined on the twodimensional nonnegative orthant R2+ that span the information in the integrated joint Laplace transform of current and lagged volatility. Then, kernel-based averaging schemes are used to condense the continuum of equations to individual estimating equations, but over regions of the orthant instead of all of R2+ , as would be the case in Paulson et al. (1975), Knight and Yu (2002), and Jiang and Knight (2010), who previously consider kernelbased averaging. Because the estimating equations are additively separable functions of the data and parameters, we can undertake minimum distance estimation (GMM) with a weighting matrix that is a model-free fixed function of the data. In effect, then, the information from the various estimating equations formed by regional kernel averaging at different lags all gets weighted together in a data-optimal manner to form the chi-squared estimation criterion function. The proposed method can be compared with previous work on parametric (or semi-parametric) estimation of continuous-time stochastic volatility models. First, there is an earlier statistics literature on estimation of diffusion models from high-frequency data, see e.g. Prakasa Rao (1988) and the many references therein. The key difference with our method is that these papers are all concerned with estimation of directly observed Markov models, while our focus is estimating processes with hidden states, e.g., volatility and jumps. Indeed, our Realized Laplace transform measure is aimed exactly at estimating the volatility hidden in the price.
3 For any scalar random variable X the characteristic function is c (ω) = E eiωX , ω ∈ R, while for a non-negatively supported random variable Y the real Laplace transform is L(u) = E e−uX , u ∈ R+ . The multivariate extensions are obvious and both transforms are one-to-one with their probability distribution function, but the domain of the real Laplace transform is, of course, smaller.
V. Todorov et al. / Journal of Econometrics 164 (2011) 367–381
Compared next with the earlier literature on estimation of stochastic volatility asset pricing models based on coarser frequencies, e.g., days, there are two key differences. First, we disentangle volatility from price jumps nonparametrically and second we perform estimation as if volatility is directly observable. These aspects of our estimation method provide both efficiency and robustness gains towards the complicated problem of modeling price jumps and their relation with the volatility jumps. On the other hand, our method is only for estimation of the volatility dynamics, and therefore we will need to combine it with existing estimation methods in order to make an inference for the rest of the return dynamics, e.g., the price jumps. Nevertheless, being able to estimate efficiently and robustly the volatility specification should help in pinning down the rest of the return dynamics both robustly and more efficiently.4 More recently, Bollerslev and Zhou (2002), Barndorff-Nielsen and Shephard (2002), Corradi and Distaso (2006) and Todorov (2009) consider estimation of stochastic volatility models using either method of moments or QML estimation on a model-free estimate of the daily Integrated Variance constructed from highfrequency data.5 In our Monte Carlo application we compare our procedure with the above alternative (in particular we compare with QML estimation on a measure of the Integrated Variance). We consider a variety of one-factor pure-continuous and pure-jump volatility models as well as a two-factor volatility model. In all of the simulated models the price contains jumps. We find that our inference based on the Realized Laplace transform performs well without any significant biases unlike the QML estimation based on the high-frequency estimate of the Integrated Variance. The latter has significant biases coming from the high-frequency estimation. Also, comparison with the Cramer–Rao efficiency bound for an infeasible observation scheme of daily spot variance further reveals that our method has good efficiency properties and is able to extract the relevant information about volatility in the highfrequency data. In the empirical application to the S&P 500 index futures sampled at five minutes, the method appears to discriminate reasonably well across a broad class of volatility models, and it shows promise for generating interesting new insights about the dynamics of volatility. We find that volatility exhibits transient and persistent shifts, which in itself is not too surprising, but interestingly both components of volatility appear to move through jumps and possibly without a diffusive component. The persistent shifts in volatility happen through a process with low activity, mainly rare big jumps. On the other hand, the driving Lévy process for the transient factor is far more active—it has big spikes but also a lot of small jumps which capture day to day variation in volatility. The finding that volatility appears to follow a two-factor pure-jump process needs to be tempered by other evidence that the preferred model does encounter some problems reproducing shifts from very low to very high levels of volatility as occur in the data. The rest of the paper is organized as follows. In Section 2 we develop our method for estimating the parametric volatility specification of a stochastic process observed at high-frequency. In Section 3 we present the various jump-diffusion models for the stochastic volatility that we estimate on simulated and observed
4 Similarly, Bollerslev and Zhou (2002), Barndorff-Nielsen and Shephard (2002) and Corradi and Distaso (2006) focus mainly on estimation of the stochastic volatility specification. 5 Recently Dobrev and Szerszen (2010) consider MCMC estimation of the volatility dynamics of discrete-time stochastic volatility models using a measure of the Integrated Variance as an extra observation equation and provide a strong example of the gains from incorporating high-frequency data in the estimation.
369
data. Section 4 contains the Monte Carlo study. In this section we also compare our method with a feasible alternative based on QML estimation on Integrated Variance. Section 5 contains the empirical application based on high-frequency S&P 500 Index futures data. Section 6 concludes. All technical details are given in Appendix. 2. Parametric estimation via the realized Laplace transform Assume we observe at discrete points in time a process X, defined on some filtered probability space Ω , F , (Ft )t ≥0 , P that has the following dynamics dXt = αt dt +
∫
Vt dWt +
δ(t −, x)µ( ˜ dt , dx),
(1)
R
where αt and Vt are càdlàg processes (and Vt ≥ 0); Wt is a Brownian motion; µ is a homogeneous Poisson measure with compensator (Lévy measure) ν(x)dx; δ(t , x): R+ × R → R is càdlàg in t and µ( ˜ ds, dx) = µ(ds, dx) − ν(x)dxds. Our interest in the paper is estimation of parametric models for the stochastic volatility process Vt which is robust to the specification of the rest of the components in the model, i.e., the drift term αt and the price jumps. Importantly, given prior empirical evidence in e.g., Andersen et al. (2002) and Chernov et al. (2003), we will be interested in developing an estimation method that does not depend on the Vt being a Markov process (with respect to its own filtration). We note that in (1) we have implicitly assumed that the process contains a diffusive component and we are interested in its stochastic volatility. Nonparametric tests in Todorov and Tauchen (2010) show that this assumption is satisfied for the S&P 500 index that we are going to use in our empirical study. More generally, however, if the observed price does not have a diffusive component, then the construction of the Realized Laplace Transform below needs to be modified in an important way. Otherwise the estimation will lead to inconsistent results. We do not consider this complication here and instead refer to Todorov and Tauchen (2011c) for the theoretical analysis. 2.1. Constructing the realized Laplace transform We start with constructing the Realized Laplace Transform on which the proposed estimation is based. We assume that we observe the log-price process X at the equidistant times 1 0, 1n , . . . , 1, 1 + π , n+ + π , . . . , ni + ⌊(i − 1)/n⌋π , . . . , T + n π (T − 1). Our unit interval will be the trading part of a day where we have high-frequency price records and the intervals (1, 1 + π ), (2, 2 + π ), . . . are the periods from close to open where we do not have high-frequency price observations. n denotes the number of high-frequency returns within the trading interval and T is the total span of the data.6 For simplicity we will denote the log-price increment over a high-frequency interval as ∆ni X = X i +⌊(i−1)/n⌋π − X i−1 +⌊(i−1)/n⌋π . Our results in the paper will rely n
n
on joint asymptotic arguments: fill-in (n → ∞) and long-span (T → ∞). Using the high-frequency data over a given trading day [(t − 1)(1 + π ), (t − 1)(1 + π ) + 1], we can estimate in a modelfree way the empirical Laplace transform of the latent variance process over that interval. In particular, provided price jumps are of finite variation and under some further mild regularity restrictions (satisfied for all parametric models considered here), Todorov and Tauchen (2011b) show
6 ⌊x⌋ denotes the smallest integer not exceeding x and we will set it to zero if x < 0.
370
V. Todorov et al. / Journal of Econometrics 164 (2011) 367–381 nt
1
Zt (u) =
−
n i=n(t −1)+1
∫
cos
(t −1)(1+π)+1
= (t −1)(1+π)
√
√
2u n∆ni X
directly seen from the following. First, we denote the joint Laplace transform of the vector (Vt1 , . . . , VtK ) as
√
e−uVs ds + Op 1/ n ,
u ≥ 0.
(2)
We refer to Zt (u) as the Realized Laplace transform of the variance over the day. The robustness with respect to jumps of Zt (u) follows from the following property of the cosine function: | cos(x) − cos(y)| ≤ K (|x − y| ∧ 1) for any x, y ∈ R. We note that unlike existing jump–robust volatility measures where robustness is achieved either through explicit truncation of the price increments or through staggering by considering products of adjacent price increments, our measure has a ‘‘built-in’’ truncation through the choice of the function that transforms the increments. We also stress that the above result in (2) holds regardless of the volatility specification (as long as it is an Itô semimartingale) and in particular the result is robust to specifications in which the price and volatility jumps are strongly dependent as data would suggest.7 A formal analysis of the jump–robustness properties of Zt (u) is given in Todorov and Tauchen (2011b). Next, using long-span asymptotics combined with standard stationarity and ergodicity conditions for Vt , and provided T /n → 0, Todorov and Tauchen (2011b) show that8
T ∫ √ 1 − (t −1)(1+π)+1 −uVs e ds + op (1/ T ), L ( u ) = V T t =1 (t −1)(1+π) [∫ (t −1)(1+π)+1 T − 1 V (u, v; k) = L e−uVs ds T − k ( t − 1 )( 1 +π) k+1 ] ∫ (t −k−1)(1+π)+t = 1 √ × e−v Vs ds + op (1/ T ),
(3)
where we define T 1−
T t =1
V (u, v; k) = L
Zt (u), T −
1
T − k t =k+1
(4) Zt (u)Zt −k (v). 9
V (u, v; k) −→ LV (u, v; k), L
P
P
u, v ≥ 0, k ∈ Z,
(5)
for
LV (u) = E e−uVs ,
LV (u, v; k) = E
∫
k(1+π )+1
e−uVs ds
1
∫
k(1+π)
e−v Vs ds .
(6)
0
LV (u) is the Laplace transform of Vs and LV (u, v; k) is just an integrated joint Laplace transform of the variance during two unit V (u) and L V (u, v; k) are their intervals which are k days apart; L sample counterparts. The connection between LV (u, v; k) and the joint Laplace transform of the variance at two points in time can be
7 The Monte Carlo analysis in Section 4 further shows that various realistic features of the return dynamics, such as price jumps, leverage effect and joint price and volatility jumps, have very limited finite-sample effects in the estimation. 8 In fact, as shown in Todorov and Tauchen (2011b), the speed condition T /n → 0
can be relaxed to the much weaker one T /n → 0 for the class of affine jumpdiffusion stochastic volatility models that we estimate in this paper. 9 We refer to Todorov and Tauchen (2011b) for the technical conditions required for this long-span asymptotics as well as the associated CLT with it.
,
u = (u1 , . . . , uK ) ≥ 0,
t = (t1 , . . . , tK ) ≥ 0.
LV (u, v; k) =
(7)
∫
k
∫
1
LV ([u, v]; [t + s, s])dsdt
k−t k −1 ∫ k+1 ∫ k+1−t
+ ∫
k
LV ([u, v]; [t + s, s])dsdt
0
k
(t − k + 1)LV ([u, v]; [t , 0])dt
= k −1
∫
k+1
( k + 1 − t )LV ([u, v]; [t , 0])dt ,
+
(8)
k
where we have used the shorthand k = k(1 + π ). Thus, although we cannot estimate the joint Laplace transform of the stochastic variance process at two arbitrary points in time, we can get V (u, v; k). The potential loss of very close to it by the use of L information that occurs in estimation based on LV (u, v; k) instead of LV ([u, v]; [t + k, t ]) is for volatility dynamics with very short persistence.
In the infeasible case when the variance process Vt is directly observed (at integer times), one can match the empirical and model-implied joint Laplace transform at a given lag K . As shown in Feuerverger and Mureika (1977), see also Carrasco et al. (2007), appropriate weighting of these moments can lead to asymptotic equivalence to the estimation equations
T t =K +1
∇ρ p(Vt |Vt −1 , Vt −2 , . . . , Vt −K ; ρ) = 0,
(9)
where henceforth we denote with ρ the vector of parameters and p(Vt |Vt −1 , Vt −2 , . . . , Vt −K ; ρ) stands for the probability density of Vt conditional on the vector (Vt −1 , Vt −2 , . . . , Vt −K ). The estimation equations in (9) achieve the Cramer–Rao efficiency bound in the case when Vt is Markovian of order K . Our case is more complicated as we do not observe Vt and hence for the estimation problem here we can ‘‘only’’ work with LV (u, v; k) instead of LV ([u, v]; [t + k, t ]). Using the above analysis, we propose to base inference on matching the modelV (u, v; k). If implied LV (u, v; k) with the sample estimate L the volatility is constant within a day, then exactly as above appropriate weighting of these moment conditions will yield the Cramer–Rao efficiency bound based on daily direct observations of Vt . More specifically, our vector of moment conditions is given by mT (ρ) =
2
i=1
ui Vti
Then by a change of variable, and using the fact that Vt is a stationary process and therefore LV ([u, v]; [t , s]) = LV ([u, v]; [t − s, 0]) where t ≥ s, we can write for k ≥ 1
T 1 −
From here using a standard Law of Large Numbers, we easily have
V (u) −→ LV (u), L
K ∑
2.2. Estimation methodology
(t −k−1)(1+π)
V (u) = L
LV (u; t) = E e
−
∫
V (u, v; k) − LV (u, v; k|ρ) L
Rj,k
× ω(du, dv)
,
Rj,k ⊂ R2+ ,
(10)
j=1,...,J ,k=1,...,K
where we describe the construction of the regions Rj,k and the weight measure ω(du, dv) below. Then our estimator is minimum distance (GMM) with estimating equations in the vector given in
V. Todorov et al. / Journal of Econometrics 164 (2011) 367–381
converging in probability to a (10) and some weight matrix W positive definite matrix W : T (ρ). ρˆ = argmin mT (ρ)′ Wm
(11)
ρ
to be an estimate of the optimal weight matrix defined We set W by the asymptotic variance of the empirical moments to be V (u, v; k)ω(du, dv). Note that because of the matched, i.e., R L j,k
separability of data from parameters in the moment vector mT (ρ), using only the data. we construct our optimal weight matrix W This in particular means that all moments are weighted the same way regardless of the model that is estimated, provided the set of moments used in the estimation is kept the same of course. Consistency and asymptotic normality of the estimator in (11) follows from classical conditions required for GMM estimation, see e.g., Hansen (1982), as well as Theorem 2 in Todorov and Tauchen (2011b) that guarantee that (3) above holds. The intuition behind our estimator is as follows. We split R2+ into regions. Within the regions we weight the distance (u, v; k) − L(u, v; k|ρ) by the weight measure ω(du, dv), while L we let the data determine (through the optimal weight matrix the relative importance of each pair region-lag. This approach W) can be viewed as a feasible alternative to the use of a continuum of moment conditions based on L(u, v; k|ρ) (over u, v and k) as in the case when the stochastic process of interest (here Vt ) is fully observable considered in Carrasco et al. (2007). While for many models, e.g., the affine jump-diffusion class, the joint Laplace transform is known analytically, this is not the case generally for the integrated one, L(u, v; k|ρ), and hence the latter has to be evaluated by (one-dimensional) numerical integration which is quick and easy. On the other hand, using a continuum of points (u, v, k) in the estimation would (generally) involve highdimensional numerical integrations which are unstable. This can be viewed as the price to be paid for ‘‘making’’ Vt from latent to ‘‘observable’’. The weight measure ω(du, dv) we consider here is of the form
δ(ui ,vi ) e−0.5(ui +vi )/c for δx denoting Dirac delta at the point x, and we set c = 0.50 × umax where umax is the maximum value of u and v that we use in the estimation. We explain how we set 2
∑
2
2
i
umax later in our numerical applications: the goal is to pick umax such that [0, umax ] × [0, umax ] contains ‘‘most of’’ the information 2 2 2 in L(u, v; k|ρ). e−0.5(u +v )/c weighs the information coming from the different points in the region that is included, with points closer to (0, 0) receiving more weight. Given our choice of c, the lowest weight corresponds to the density of a normal distribution at 2 standard deviations and this is exactly the region where the normal density has curvature (and hence weighs differently the different points (u, v)). This is similar to the use of the Gaussian kernel in empirical characteristic function based estimation in Jiang and Knight (2002) and Carrasco et al. (2007). From a practical point of view, using Dirac deltas would probably not lead to much loss of information as the joint Laplace transform is typically rather smooth. The regions that we look at in the estimation are of the form ′
Rj,k = {(u v) ∈ [bj,k umax bj,k umax ] × [b′j,k umax bj,k umax ]}, j = 1, . . . , J , k = 1, . . . , K , ′ bj,k
(12)
where bj,k and are all in [0, 1], and in the numerical applications in the next sections we will explain how to choose them. The general idea is to cover all useful information in the (integrated) joint Laplace transform, making sure at the same time that the regions contain sufficiently different information so that we do not end up with a perfectly correlated set of moment conditions in the GMM. bj,k , b′j,k ,
371
This approach stands in contrast to existing kernel averaging approaches that enforce the same kernel averaging scheme – almost always Gaussian – over the entire continuum. Our approach uses a GMM weight matrix on top of kernel-weighting over regions, and it also differs from a strategy like that of Carrasco and Florens (2000, 2002) which entails a significant computational burden, at least in the present context, to determine a modelbased kernel that is optimal conditional on model validity. Here the weighting over the continuum of estimating equations is modelfree, and we force all models to confront the same vector of datadetermined conditions. 3. Parametric volatility specifications The proposed estimation procedure based on the Realized Laplace transform is particularly easy to implement in volatility models for which the joint Laplace transform is known in closed form (or up to numerical integration). The general affine jump-diffusion models proposed in Duffie et al. (2000, 2003) are a leading example. They have been used widely in many finance applications. Therefore, in our Monte Carlo study as well as the empirical application we illustrate the proposed estimation technique to estimate multi-factor affine jump-diffusion models for the unobserved market variance process. Using earlier evidence from daily estimation, e.g., Andersen et al. (2002) and Chernov et al. (2003), we will focus on two-factor specifications Vt = V1t + V2t ,
t > 0,
dVit = κi (θi − Vit )dt + σi
Vit dWit + dLit ,
i = 1, 2,
(13)
where Lit are Lévy subordinators with Lévy measures νi (dx). The list of the particular model specifications that we estimate and compare performance are
• Pure-continuous volatility model: one or two factor specification of (13) with Lit ≡ 0. • Pure-jump volatility model: one or two factor specification of (13) with σi ≡ θi ≡ 0 and jump measure of L specified with (15) below.
• Continuous-jump volatility model: one-factor is pure-continuous and the other is pure-jump with jump measure of L specified with (15) below. The pure-continuous volatility models are just a superposition of the standard square-root diffusion processes. The pure-jump volatility factors are also known as non-Gaussian OU models, see e.g., Barndorff-Nielsen and Shephard (2001a). In those models the volatility factor moves only through positive jumps and it reverts afterwards back to its unconditional mean level till another jump arrives (infinite activity, but finite variation, jumps are allowed). The marginal distribution is infinitely divisible (see e.g., Sato (1999)) and hence by the Lévy–Khinchine theorem can be represented (identified uniquely) by its Lévy measure. Here we follow an approach proposed by Barndorff-Nielsen and Shephard (2001a) and model the process by specifying the marginal distribution of the volatility factor and we back out from it the model for the driving Lévy process. This has the advantage that the parameters controlling the memory of the volatility process are separated from those controlling its distribution. In particular we work with pure-jump volatility factors whose marginal distribution is that corresponding to the increments of a tempered stable process (Carr et al., 2002; Rosiński, 2007). The latter is known to be a very flexible distribution. Its corresponding Lévy density is given by
νVi (x) = ci
e−λi x x1+αi
1{x>0} ,
ci > 0, λi > 0, αi < 1.
(14)
372
V. Todorov et al. / Journal of Econometrics 164 (2011) 367–381
Table 1 Parameter setting for the Monte Carlo. Case
Paramaters
One-factor pure-continuous models A κ1 = 0.50 θ1 = 1.0 B and B–L κ1 = 0.15 θ1 = 1.0 C κ1 = 0.03 θ1 = 1.0 One-factor pure-jump models D κ1 = 0.50 α1 = 0.5 E and E–L κ1 = 0.15 α1 = 0.5 Two-factor pure-jump model κ1 = 0.03 α1 = 0.5 F κ2 = 1.00 α2 = 0.5
σ1 = 0.5 σ1 = 0.2 σ1 = 0.1 c1 = 0.7979 c1 = 0.7979
λ1 = 2.0 λ1 = 2.0
c1 = 0.7596 c2 = 0.2257
λ1 = 5.0 λ2 = 1.0
Note: In cases A, B and C, Vt is independent from Wt while in case B–L we have corr (Wt , Wit ) = −0.5 × t. In cases D, E and F, volatility and price jumps are independent. In case E–L, price jumps are equal in magnitude to volatility jumps.
The parameters αi control the small jumps in the volatility factors, while λi control the big jumps. A value αi < 0 corresponds to finite activity jumps and αi ≥ 0 to infinite activity. Intuitively, the activity of the volatility jumps determines the vibrancy of the volatility factor trajectories. There are two special cases of (14): the case α = 0.0 corresponds to Gamma marginal distribution (which is also the marginal distribution of the square-root diffusion) and the case α = 0.5 corresponds to Inverse Gaussian marginal distribution. Using (34) in Appendix, we have that the Lévy density of the driving Lévy process for our pure-jump volatility factors with the specified marginal distribution by (14) is given by
νLi (x) =
αi ci κi e−λi x xαi +1
+
ci λi κi e−λi x xαi
1{x>0} .
(15)
4. Monte Carlo study We next test the performance of our estimation method on simulated data from the following models for the stochastic volatility: one-factor square-root diffusion, one-factor non-Gaussian OU model with Inverse Gaussian marginal distribution, and two-factor superposition of the above non-Gaussian OU model. The different simulated models are summarized in Table 1. In all models the mean of Vt is set to 1 (variance reported in daily percentage units). The different cases differ in the volatility persistence, the volatility of volatility and the presence of volatility jumps.10 Also, in each of the scenarios, except for case E–L, we set the price jumps to be of Lévy type with the following Lévy density (i.e., jump compensator) e− x
2
νX (x) = 0.2 × √ , π
(16)
which corresponds to compound Poisson jumps with normally distributed jump size. The selected values of the parameters in (16) imply variance due to price jumps is 0.1, which is consistent with earlier non-parametric evidence. In case E–L price and volatility jumps arrive at the same time and are equal in magnitude (but the price jumps might be with negative sign as well) and this case allows us to explore the small sample effect of dependence between price and volatility jumps as well as the robustness of our results to presence of infinite-activity price jumps. In each simulated scenario we have T = 5000 days and we sample n = 80 times during the day, which mimics our available data in the empirical application, and for simplicity we also set π = 0, i.e., there is no overnight period. The Monte Carlo results
10 We do not consider a pure-jump alternative to case C, i.e., a persistent onefactor pure-jump model as the numerical integrations needed for the Cramer–Rao bound are relatively unstable since the integrands have too much oscillation.
Fig. 1. Regions in R2+ for fixed lag used in estimation. The shaded regions are the ones used in the set of moment conditions MC1 (together with their projections on the u axis), with the numbers in them corresponding to the lags.
are based on 1000 replications. Finally, each estimation is done via the MCMC approach of Chernozhukov and Hong (2003) to classical estimation, with length of the MCMC chain of 15,000.11 The weight is computed using Parzen kernel with lag length of 70 matrix W days. The results are summarized in Tables 2 and 3. ′V (u) is around −0.01.12 This Our choice of umax is such that L V (umax ) ≈ 0.005 which in turn corresponded to umax resulted in L of around 8 for the simulated models. Therefore, in the estimation, 1 − we choose umax by the simpler rule umax = L V (0.005) which satisfies our target in terms of the derivative of the estimated Laplace transform. The moment conditions that we use are (a) regions [0.1umax 0.2umax ] × [0 0], [0.3umax 0.5umax ] × [0 0] and [0.6umax 0.9umax ] × [0 0], (b) squares [0.1umax 0.2umax ]2 , [0.3umax 0.5umax ]2 and [0.6umax 0.9umax ]2 for lags k = 1, (c) square [0.1umax 0.2umax ]2 for lag k = 5, 10, 30, (d) square [0.3umax 0.5umax ]2 for lag k = 5, 10, 30. Fig. 1 displays the above regions with the lag lengths k entered within the block and the one-dimensional regions are the heavily shaded segments along the abscissa; in subsequent diagnostic work we use other lag lengths and include the ‘‘off-diagonal’’ blocks in Fig. 1 in the estimation. We refer to the set of moment conditions immediately above as MC1. This results in 12 moment conditions, and as we confirm later in the Monte Carlo, this moment vector captures well, in a relatively parsimonious way, the information in the data about the distribution and memory of volatility. Finally, in each of the two-dimensional regions above we evaluate the integrated joint Laplace transform only in the four edges. This is done to save on computational time and does not have a significant effect on the estimation. For the one-factor models in our Monte Carlo, we can compare the efficiency of our estimation method with the infeasible case when we observe directly the latent variance process at daily frequency. The Cramer–Rao efficiency bound for the latter observational scheme is easily computable in the one-factor volatility setting with details provided in Appendix. Note that our benchmark is a daily variance and not a continuous record of the latter. A continuous record of Vt would imply that the parameter σ in the square-root model and α and c in the one-factor nonGaussian OU model can be inferred from a fixed span of data without estimation error. Instead, our goal with this comparison
11 Of course any other optimization method can be used for finding the parameter estimates than our MCMC-based approach. 12 Note that L′ (u) ≡ E(V e−uV ) is strictly decreasing in u. V
V. Todorov et al. / Journal of Econometrics 164 (2011) 367–381
373
Table 2 Monte Carlo results: one-factor models. Par
True value
RLT-based estimation
QML estimation
CRB
Median
MAD
SE
Median
MAD
SE
0.5000 1.0000 0.5000
0.4734 1.0106 0.4921
0.0267 0.0113 0.0098
0.0211 0.0120 0.0126
0.2999 1.0057 0.4068
0.2001 0.0087 0.0932
0.0093 0.0121 0.0062
0.0187 0.0142 0.0063
0.1500 1.0000 0.2000
0.1474 1.0097 0.2014
0.0107 0.0119 0.0059
0.0132 0.0165 0.0072
0.1855 1.0040 0.2488
0.0355 0.0114 0.0488
0.0085 0.0159 0.0035
0.0086 0.0188 0.0021
0.1500 1.0000 0.2000
0.1505 1.0115 0.2044
0.0089 0.0146 0.0062
0.0137 0.0165 0.0081
0.1808 0.9983 0.2430
0.0308 0.0111 0.0430
0.0081 0.0160 0.0035
0.0086 0.0188 0.0021
0.0300 1.0000 0.1000
0.0336 1.0059 0.1019
0.0040 0.0305 0.0035
0.0049 0.0413 0.0049
0.1121 0.9971 0.2050
0.0821 0.0284 0.1050
0.0199 0.0553 0.0079
0.0035 0.0465 0.0010
0.5000 0.5000 0.7979 2.0000
0.4893 0.5042 0.7845 1.9283
0.0218 0.0140 0.0477 0.1252
0.0296 0.0270 0.0943 0.1896
0.3032 – 0.7664 1.8361
0.1968 – 0.0315 0.1639
0.0118 – 0.0163 0.0743
0.0032 0.0095 0.0342 0.0810
0.1500 0.5000 0.7979 2.0000
0.1511 0.5149 0.7541 1.8996
0.0077 0.0317 0.1008 0.2164
0.0124 0.0469 0.1647 0.3026
0.1396 – 0.7491 1.7513
0.0106 – 0.0488 0.2487
0.0087 – 0.0274 0.1255
0.0002 0.0058 0.0294 0.1042
0.1500 0.5000 0.7979 2.0000
0.1519 0.5128 0.7580 1.8793
0.0078 0.0268 0.0803 0.1897
0.0125 0.0380 0.1263 0.2377
0.1424 – 0.7487 1.7181
0.0084 – 0.0492 0.2819
0.0088 – 0.0270 0.1221
0.0002 0.0058 0.0294 0.1042
Case A
κ1 θ1 σ1
Case B
κ1 θ1 σ1
Case B–L
κ1 θ1 σ1
Case C
κ1 θ1 σ1
Case D
κ1 α1 c1
λ1 Case E
κ1 α1 c1
λ1 Case E–L
κ1 α1 c1
λ1
Note: MAD stands for median absolute value around the true value; SE stands for standard error; CRB stands for Cramer–Rao bound; the reported CRBs for Case E are italicized because the associated numerical integrations for the inverse Fourier transform described in Appendix were very delicate, mainly for κ1 ; the difficulties with κ1 indirectly affects the values for the other three parameters because of the matrix inversion.
here is to gauge the potential loss of efficiency due to the use of our moments based on L(u, v; k) instead of working directly with the infeasible daily transitional density of the latent variance. In the one-factor models, we also compare our estimator with a feasible alternative using the high-frequency data that has been widely used to date. It is based on performing inference on the Integrated Variance defined as
∫ IVt =
(t −1)(1+π )+1 (t −1)(1+π )
Vs ds,
t ∈ Z.
Integrated Variance is of course unobserved, but it can be substituted with a model-free estimate from the high-frequency data. One possible such estimate that we use here is the Truncated Variance, proposed originally by Mancini (2009), defined as nt −
TVt (α, ϖ ) =
i=n(t −1)+1
(17)
In many models, and in particular the ones we use in our Monte Carlo, see e.g., Meddahi (2003) and Todorov (2009), the Integrated Variance follows an ARMA process whose coefficients are known functions of the structural parameters for the volatility (for the simulated one-factor models it is ARMA(1,1), see Appendix for the details). Then, one way of estimation based on the Integrated Variance is to match moments like mean, variance and covariance, see e.g., Bollerslev and Zhou (2002) and Corradi and Distaso (2006) in the continuous setting and Todorov (2009) in the presence of jumps. An alternative, following Barndorff-Nielsen and Shephard (2002), that we use here to compare our method with, is to do a Gaussian Quasi-Maximum Likelihood for the sequence {IVt }t ∈Z .13 The details of the necessary computations are given in the Appendix.
13 Barndorff-Nielsen and Shephard (2002) apply the method in the context of no jumps, and so use the Realized Variance. Also, unlike our use of QML here, BarndorffNielsen and Shephard (2002) take into account the error in the Realized Variance in measuring the Integrated Variance (the error does not matter in the fill-in asymptotic limit but can have a finite sample effect). We do not take this error into account to be on par with our proposed method which similarly does not account for the error in measuring LV (u, v; k) from high-frequency data.
|∆ni X |2 1{|∆ni X |≤αn−ϖ } ,
ϖ ∈ (0, 1/2),
α > 0, (18)
where here we use ϖ = 0.√ 49, i.e., a value very close to 1/2 and we further set α = 3 × BVt for BVt denoting the Bipower Variation of Barndorff-Nielsen and Shephard (2004) over the day (which is another consistent estimator of the Integrated Variance in the presence of jumps): BVt =
π
nt −
2 i=n(t −1)+2
|∆ni−1 X ||∆ni X |.
(19)
Under certain regularity conditions, see e.g., Jacod (2008), the Truncated Variance is a model-free consistent and asymptotically normal estimator (for the fill-in asymptotics) of the (unobservable) Integrated Variance defined in (17). The asymptotic justification for the joint fill-in and long-span asymptotics of the QML estimator in the presence of price jumps can be done exactly as in Todorov (2009). Table 2 summarizes the results from the Monte Carlo for the different one-factor models with Table 3 reporting the rejection rates for the corresponding test of overidentifying restrictions. In the case of the QML estimation of the pure-jump model we fix the parameter α at its true value, since the QML estimation cannot identify such richly specified marginal distribution of
374
V. Todorov et al. / Journal of Econometrics 164 (2011) 367–381 Table 4 Monte Carlo results for RLT-based estimation of case F.
Table 3 Monte Carlo results: J-test. Case
A B B–L C
df
Nominal Size
9 9 9 9
Case
1%
5%
4.29 8.00 3.70 0.87
8.95 17.00 17.90 4.13
D E E–L F
df
8 8 8 4
Nominal Size
Par
True value
Median
MAD
SE
1%
5%
2.60 1.90 3.10 6.50
9.80 5.50 8.80 19.98
κ1 α1
0.0300 0.5000 0.7596 5.0000 1.0000 0.5000 0.2257 1.0000
0.0306 0.5031 0.7808 4.8609 1.0000 0.4531 0.2482 1.0620
0.0072 0.0340 0.0963 0.6966 0.0995 0.1097 0.0551 0.2151
0.0118 0.0695 0.1809 1.1049 0.1581 0.1855 0.1077 0.3740
c1
λ1 κ2 α2 c2
14
the volatility. We can see from the table that our proposed method behaves quite well. It is virtually unbiased for almost all parameters—the only exception perhaps is the parameter λ1 in the pure-jump models which is the hardest parameter to estimate (recall it controls the big jumps in volatility). However its bias is still very small, especially when compared with the precision of its estimation. Comparing the standard errors of our estimator with Cramer– Rao bounds for the infeasible scheme of daily observations of Vt , which is used as a benchmark, we see that the performance of our proposed method is generally very good. There are, however, a few notable deviations from full efficiency. One of the reasons for this (for some of the parameters) is in the observational scheme: our estimator is based on integrated volatility measure whereas the Cramer–Rao bounds are for daily observations of spot volatility. For example, we see that for the square-root diffusion the standard errors for σ1 are somewhat bigger than the efficient bound. Note, however, that the efficiency bound for this parameter can be driven to 0 by considering more frequent (than daily) observations of the spot volatility. The same observation can be made also for the parameters c1 , α1 and κ1 in the pure-jump model. The second reason for the deviation from efficiency is in the use of high-frequency data in estimating the integrated joint Laplace transform of volatility in our estimation method. This effect can be seen by noting, for example, that the wedge between the standard errors of our estimator and the Cramer–Rao efficiency bounds widens by going from the square-root diffusive models to the purejump ones. The effect of discretization error in the latter class of models should be bigger because of the volatility jumps. Looking at cases B–L and E–L, we can note that the dependence between the price and volatility innovations, either via correlated Brownian motions or dependent price and volatility jumps, has virtually no finite-sample distortion on the estimation. The same conclusion can be made regarding the presence of price jumps in X that are infinitely active as in the case E–L. Overall, the only effect of the presence of price jumps and the frequency of sampling on our RLT-based estimation is in the standard errors. Comparing our estimator with the feasible alternative of QML on the Truncated Variance, we can see that overall the former behaves much better than the latter. The main problem of the QML estimation is that it is significantly biased for the parameters controlling memory and persistence of volatility across all considered cases. The standard errors of the QML estimators are smaller for the low persistence cases (even from the efficiency bounds) and much higher for the high persistence case, but this is hard to interpret because of the very significant biases in the estimation. What is the reason for the significant biases in the QML estimation based on the Truncated Variance? Using a Central Limit Theorem for the fill-in asymptotics we have approximately TVt (α, ϖ ) ≈
∫
(t −1)(1+π)+1
λ2
Notation as in Table 2. The results are based on 800 Monte Carlo replications.
where conditional on the volatility process, ϵt is a Gaussian error (whose volatility depends on Vs and does not shrink as the sampling frequency n increases). Then note that the objective function of the QML estimator involves squares of the Integrated Variance. Substituting TVt (α, ϖ ) for IVt in the objective function therefore introduces an error in the latter whose expectation is not zero (as it will involve squares of ϵt ) and this in turn generates the documented biases. This error of course will decrease as we sample more frequently, i.e., as n → ∞, but it clearly has a very strong effect on the precision of the QML estimator for the frequency we are interested in. A possible solution to this problem of the QML estimation based on the Truncated Variance is to recognize the approximation in (20) and derive an expression for the variance of ϵt as done for example in Barndorff-Nielsen and Shephard (2002) in the context of no price jumps. We can use similar reasoning as above to explain why our proposed RLT-based estimation does not suffer from the above problem. A Central Limit Theorem, see Todorov and Tauchen (2011b), implies Zt (u) ≈
∫
(t −1)(1+π )+1 (t −1)(1+π )
1 e−uVs ds + √ ϵ˜t , n
(21)
where conditional on the volatility process, ϵ˜t is a Gaussian error (whose volatility depends on Vs and u and does not shrink as n increases) with E(˜ϵt ϵ˜s ) = 0 for t ̸= s and t , s ∈ Z. Note that our V (u, v; k) contains all the estimation is based on the idea that L information for Vt in the high-frequency data and hence uses only these moment conditions in the estimation without any further nonlinear transformations. Using the approximation (21), we see V (u, v; k) are (approximately) unbiased for LV (u, v; k).15 that L Turning to the test for overidentifying restrictions, we can see from Table 3 that overall the test performance is satisfactory although in some of the cases there is moderate over-rejection. The worst performance is for cases B and B–L in which the finite sample size of the test is bigger with 10% from its nominal level. Such finite sample over-rejections though are consistent with prior evidence for GMM reported in Andersen and Sørensen (1996) particularly when the number of moment conditions is large (as is the case for scenarios B and B–L). Finally the results for case F are given in Table 4. We see that our estimator behaves well in this richly parameterized model. Some of the parameters have small biases (particularly α2 ) but they are insignificant compared with the magnitude of the associated standard errors. The hardest parameters to estimate are those of the transient factor, which is also twice as volatile as the persistent factor. 5. Empirical application 5.1. Initial data analysis
1
Vs ds + √ ϵt , n
(20)
We continue next with our empirical application where we use 5-min level data on the S&P 500 futures index covering the
14 Of course a method of moment estimator based on the Truncated Variation could have identified that moment.
15 For more formal discussion about the asymptotic bias in Z (u), see Todorov and t Tauchen (2011b).
(t −1)(1+π)
V. Todorov et al. / Journal of Econometrics 164 (2011) 367–381
375
Table 5 Estimation results: one-factor models. Pure-continuous Estimate
Parameter
Estimate
κ1
0.0146 (0.0027) 2.2437 (0.1078) 0.2082 (0.0132)
κ1
2.6834 (0.1926) 0.5593 (0.0247) 0.2047 (0.0178) 1.9459 (0.2669)
θ1 σ1
J Test (df) (P-Val)
period January 1, 1990, to December 31, 2008. Each day has 80 high-frequency returns. The data frequency is sparse enough so that microstructure-noise related issues are of no series concern here.16 The ratio of the overnight returns variance to the mean realized variance over the trading day is 0.3437, and we therefore set π to this number. On Fig. 2 we plot the raw high-frequency data used in our estimation as well as (a log transformation of) the Truncated Variation, which as explained in Section 4 is a modelfree measure for the daily Integrated Variance. The high-frequency returns have clearly distinguishable spikes, which underscores the importance of using volatility measure robust to jumps as is the case for the Realized Laplace Transform. Also the bottom panel of the figure suggests a complicated dynamic structure of the stochastic volatility with both persistent and transient volatility spikes present. Before turning to the estimation we need to modify slightly our analysis because of the well-known presence of a diurnal deterministic within-day pattern in volatility, see e.g., Andersen and Bollerslev (1997). To this end, Vt in (1) needs to be replaced by t = Vt × f (t − ⌊t /(1 + π )⌋) where Vt is our original stationary V volatility process and f (s) is a positive differentiable deterministic function on [0, 1] that captures the diurnal pattern. Then we correct our original Realized Laplace transform for the deterministic pattern in the volatility by replacing Zt with nt −
n i=n(t −1)+1
gi =
T n−
cos
√
√
−1/2
2u n fi
|∆nit X |2 1(|∆nit X | ≤ α n−ϖ ),
T t =1 i = 1, . . . , nT , α > 0, ϖ ∈ (0, 1/2),
1{fi ̸=0} ∆ni X ,
g =
324.36(6) (p = 0.00)
α1 c1
λ1
209.64(5)
(p = 0.00)
Note: The set of moment conditions MC0 defined in the text is used in the estimation. Standard errors for the parameter estimates are reported in parentheses.
Fig. 2. S&P 500 index data.
1 Zt (u) =
Pure-jump
Parameter
n 1−
n i=1
gi fi = , g gi ,
(22)
where it = t − 1 + i −[i/n]n, for i = 1, . . . , nT and t = 1, . . . , T . As for the construction of the Truncated Variance in Section 4, we set √ α = 3 × BVt and ϖ = 0.49. We further put tilde to all estimators of Section 2 in which Zt is replaced with Zt . In this setting of diurnal V (u) and patterns in volatility (5) will still hold when we replace L V (u, v; k) with L V (u) and L V (u, v; k).17 Intuitively, L fi estimates the deterministic component of the stochastic variance, and then in Zt (u) we ‘‘standardize’’ the high-frequency increments by it.
16 For example, the autocorrelations in the 5-min returns series are very small and insignificant. 17 One can further quantify the effect from the correction for diurnal pattern on
V (u, v; k), see Todorov and Tauchen (2011b), but this effect the standard errors of L is relatively small and therefore we ignore it in the subsequent work.
5.2. Estimation results We proceed with the estimation of the different volatility models discussed in Section 3. In our estimation, we set umax = 1 ′ − L V (0.1) which results in a derivative LV (u) at umax of around −0.01 and a value of umax close to 8 exactly as in the Monte Carlo. Later we check the robustness of our findings with respect to the choice of umax . In the estimation we use the same set of moment conditions as in the Monte Carlo but we drop initially the last three moments in (d) of MC1, which results in 9 moment conditions. We refer to this reduced set of moments as MC0. As for the estimation on the simulated data, for all results here the optimal weighting matrix is estimated using a Parzen kernel with lag length √ of 70. Also, we always impose the stationarity restriction σi ≤ 2θi κi for i = 1, 2 for the square-root processes.18 The results for the one-factor volatility models are given in Table 5. Not surprisingly these models cannot fit the data very well as evidenced by the extremely large values of the J test. The pure-jump model performs far better than the pure-continuous model and this is because it is more flexible in the type of marginal distribution for the volatility it can generate. We also note that the estimated mean reversion parameters in the two models are very different as both models struggle to match the initial fast drop in the autocorrelation of the volatility caused by the many short-term volatility spikes evident from the bottom panel of Fig. 2. We turn next to the two-factor stochastic volatility models. The estimation results for these models are given in Table 6. As we see from the table, the J tests for the two-factor models drop significantly as these models have a better chance to capture simultaneously the short-lived spikes in volatility together with its more persistent shifts. At the same time the performance of the models differ significantly: two-factor pure-continuous and continuous-jump models have still significant difficulties in matching the moments from the data, unlike the two-factor purejump model. Where is that difference in performance coming from? Looking at the mean-reversion parameter estimates, we see that they are quite similar across models: one is very persistent (capturing the persistent shifts in volatility) and the other one is very fast mean-reverting (capturing the short-term volatility spikes). Also, the implied mean of the volatility across models is very similar. Where the models start to differ, which explains their different success, is the ability to generate volatility of volatility in the different factors. First, the pure-continuous model cannot
18 The estimation is done as in the Monte Carlo via MCMC but the length of the chain is 500,000 draws.
376
V. Todorov et al. / Journal of Econometrics 164 (2011) 367–381
Table 6 Estimation results: two-factor models. Pure-continuous
Continuous-jump
Pure-jump
Parameter
Estimate
Parameter
Estimate
Parameter
Estimate
κ1
0.0186 (0.0024) 0.6100 (0.0423) 0.1504 (0.0052)
κ1
0.0176 (0.0106) 0.4152 (0.0702) 0.1209 (0.0148)
κ1
0.0122 (0.0045) 0.3630 (0.1459) 0.1826 (0.0666) 0.1836 (0.1852) 2.2167 (0.1779) 0.5692 (0.0864) 0.1143 (0.0286) 0.6039 (0.1509)
θ1 σ1
κ2 θ2 σ2
J Test (df) (P-Val)
1.3776 (0.1336) 0.4434 (0.0168) 1.1052 (0.0523)
θ1 σ1
κ2 α2 c2
λ2
97.87(3)
(p = 0.000)
2.3605 (0.4000) 0.5403 (0.0548) 0.1050 (0.0246) 0.3420 (0.1110) 39.64(2) (p = 0.000)
α1 c1
λ1 κ2 α2 c2
λ2
1.61 (1)
(p = 0.205)
Note: The set of moment conditions MC0 defined in the text is used in the estimation. Standard errors for the parameter estimates are reported in parentheses.
generate enough volatility of volatility both in the persistent and the transient volatility components. This explains its very bad performance. This fact is most clearly seen by noting that for both factors, the parameters are on the boundary of the stationarity restriction (which generates the highest possible volatility of the factors). When we move from the pure-continuous to the continuous-jump model, we can see a significant improvement of the fit: the J test drops approximately by half. It is interesting to note that in this model the persistent volatility factor is the squareroot diffusion and the pure-jump factor captures the transient day-to-day moves in volatility. Now the volatility of the transient factor can increase. Indeed, its coefficient of variation (standard deviation over mean) rises from 1.00 to 2.01. However, as for the pure-continuous model, the persistent factor is on the boundary of the stationarity condition as the model is struggling to reproduce the pattern of the persistent shifts in ‘‘observed’’ volatility. This shows also that our set of moment conditions identify not only the unconditional distribution of the volatility and its persistence, but it also extracts from the data information about the volatility of the persistent and transient shifts in volatility. Finally, when we model the two volatility factors to be of purejump type, we see that the J test falls to a level that corresponds to p = 0.205, i.e., such a specification does not ‘‘struggle’’ any more to fit the moments in the set MC0. We discuss briefly the parameter estimates of this best performing two-factor pure-jump model. First, the implied mean of the persistent V1 is 0.7580 while that of the transient V2 is 0.2913. This implies that the estimated unconditional mean of the diffusive volatility is 1.05 (recall we quote in daily percentage units). Note that the Realized Laplace Transform captures only the diffusive volatility and is robust to the price jumps. It has ‘‘built-in’’ truncation and we did not have to remove the ‘‘big’’ price increments in its construction to make it robust to jumps as is done in the Truncated Variation. We will later compare the estimated model’s implications for the Integrated Variance with those observed in the data (via the model-free Truncated Variation). The half life of the persistent factor is 56.81 and of the transient is 0.32. This provides a good fit for the persistent and transient shocks in the volatility observed in the bottom panel of Fig. 2. The coefficient of variation for the persistent factor is 2.1395 while that for the transient factor is 1.5627. Interestingly, the data ‘‘requires’’ quite a volatile persistent factor in addition to the already present volatile transient factor.
Fig. 3. The Lévy densities of the volatility jumps are defined in (15) and the parameter estimates used in the calculation are reported in the last column of Table 6. The solid line corresponds to V1t and the dotted line to V2t .
On Fig. 3 we contrast the implied Lévy densities of the driving Lévy processes of the two factors. As seen from the figure, the Lévy density of the transient factor is above that of the persistent one except for the very big jumps (not shown on the figure). The transient factor contains more jumps than the persistent one and their effect on the future value of volatility quickly dies out. The slight increase in the wedge between the two densities for the smaller jump sizes is a manifest of α2 > α1 , i.e., the transient volatility factor is more vibrant.19 To sum up, the estimation results suggest a very persistent component of volatility which moves mainly through big jumps and fast mean-reverting component which is much more vibrant and captures day-to-day moves in volatility as well as occasional spikes in it. This behavior of the components of the volatility can be clearly seen from Fig. 4 which plots a simulation of the two factors over a period of length as that of our sample.
19 Note that since both driving jump processes of the volatility factors are infinitely active, their Lévy densities explode at 0.
V. Todorov et al. / Journal of Econometrics 164 (2011) 367–381
377
Table 7 Two-factor pure-jump model diagnostics. Alternative u-cutoffs
Alternative moments
Par
V (umax ) = 0.15 L
V (umax ) = 0.05 L
MC1
MC2
MC3
MC4
κ1
0.0119 (0.0053) 0.5252 (0.1205) 0.1490 (0.0526) 0.0835 (0.1905) 2.0505 (0.1763) 0.4450 (0.1791) 0.1274 (0.0481) 0.5877 (0.2128)
0.0123 (0.0043) 0.1994 (0.1504) 0.2594 (0.0787) 0.3374 (0.2094) 2.3683 (0.1640) 0.5675 (0.0879) 0.1283 (0.0346) 0.7693 (0.2174)
0.0118 (0.0043) 0.3067 (0.1584) 0.1981 (0.0631) 0.1993 (0.1707) 2.6130 (0.1584) 0.5888 (0.1331) 0.1066 (0.0383) 0.6094 (0.1948)
0.0228 (0.0041) 0.5059 (0.1698) 0.1501 (0.0850) 0.1233 (0.2865) 2.7780 (0.1381) 0.4525 (0.1677) 0.1331 (0.0528) 0.7979 (0.2644)
0.0145 (0.0045) 0.1120 (0.1212) 0.2148 (0.0512) 0.2303 (0.1271) 2.6168 (0.4359) 0.5899 (0.04570) 0.1361 (0.0246) 0.9915 (0.1464)
0.0229 (0.0041) 0.5059 (0.1690) 0.1501 (0.0846) 0.1233 (0.2855) 2.7780 (0.1380) 0.4525 (0.1671) 0.1331 (0.0526) 0.7979 (0.2637)
1.00 (1) p = 0.310
3.23 (1) p = 0.072
14.00 (4) p = 0.007
10.59(4) p = 0.032
26.81 (5) p = 0.000
23.44(5) p = 0.000
α1 c1
λ1 κ2 α2 c2
λ2 J Test (df) P-Val
Note: The alternative sets of moment conditions MC1, MC2, MC3, and MC4 are defined in the text. Standard errors for the parameter estimates are reported in parentheses.
Fig. 4. Simulated volatility from the estimated two-factor pure-jump model with parameter estimates reported in the last column of Table 6. The units on the horizontal axes are simulated (trading) days.
We next test the performance of the best performing model, i.e., the two-factor pure-jump model, in two ways. First, we add additional moments of the integrated joint Laplace transform and also try alternative cutoff levels umax . The list of the alternative sets of moment conditions we test the model are listed below (1) MC1: Defined in Section 4, (2) MC2: We replace (d) in MC1 with region [0.6umax 0.9umax ]2 for lags k = 5, 10, 30, (3) MC3: We replace (d) in MC1 with region [0.1umax 0.2umax ] × [0.6umax 0.9umax ] for lags k = 1, 5, 10, 30, (4) MC4: We replace (d) in MC1 with region [0.6umax 0.9umax ] × [0.1umax 0.2umax ] for lags k = 1, 5, 10, 30. The results of all the above robustness checks are reported in Table 7. Looking at the overall fit first, i.e., the J test, we can see that under the alternative truncation rules for picking umax as well as the first two alternative moment sets MC1 and MC2, the model fit is relatively good. The worst of those cases is the set of moment conditions MC1 where the value of the J test corresponds to a p-value of only 0.7% but this is consistent with the slight finite-sample overrejections of the J test reported in the Monte Carlo experiment. On the other hand, the model struggles with the moment sets MC3 and MC4. In these two regions, u is high and v is low and vice versa (recall the definition in 4). Intuitively,
the high/low level of u puts relatively more importance to very low/high levels of Vt (for this moment condition) and the same holds true for the link between v and Vt −k . Given the above interpretation of the added moments in these two moment sets, the model clearly appears to struggle in matching simultaneously the persistence in volatility and the frequency and speed with which volatility moves from very low to very high levels and vice versa. Turning next to the parameter estimates of the model across the different estimation setups reported in Table 7, we can see that the parameters controlling the persistence of the volatility factors and the marginal distribution of the transient volatility factor are relatively stable. On the other hand, the parameters controlling the persistent volatility factor vary somewhat over the different cases, most notably the estimates of α1 . Apart from the fact that the latter is hard to identify (which is reflected in its relatively big standard error), this is suggestive of some model misspecification of the persistent component of the volatility. Our second test for the model performance is to verify how successfully it can fit the moments of the Truncated Variation (which is directly observed). The latter has not been used in the estimation as our inference is based only on the Realized Laplace Transform, and hence this provides a stringent test for the model performance. In Table 8, we compare the first and the second moment as well as the autocorrelation of log[TVt (α, ϖ ) + 1] implied by the model with that in our data. The transformation log(1 + x) behaves like x for small values of x but is more robust to the outliers, and hence this transformation of the Truncated Variation is much more reliably estimated from the high-frequency data. This is the reason why we use it in our analysis here.20 As seen from the table, the model can very comfortably match the moments of the Truncated Variation estimated from the data.21 This is due to the fact that our estimation procedure selected the model not only by its fit to the mean, variance and persistence of volatility, but rather by its ability to fit the whole transitional density of the volatility.
20 This is similar to the transformations of measures of realized variation used in e.g., Andersen et al. (2003) when constructing reduced-form based volatility forecasts. 21 We also redid the calculations in Table 8 by ‘‘removing’’ the deterministic
component of volatility in constructing TVt (α, ϖ ) using fi in (22). This resulted in very small changes in the last column of Table 8.
378
V. Todorov et al. / Journal of Econometrics 164 (2011) 367–381
Table 8 Two-factor pure-jump model diagnostics: implied moments of truncated variation. Moment
Model-implied
95% CI from data
E (log[TVt (α, ϖ ) + 1]) E (log[TVt (α, ϖ ) + 1])2 AC 1 of log[TVt (α, ϖ ) + 1] AC 5 of log[TVt (α, ϖ ) + 1] AC 10 of log[TVt (α, ϖ ) + 1] AC 30 of log[TVt (α, ϖ ) + 1]
0.5488 0.5255 0.8763 0.8260 0.7857 0.6518
[0.4253 0.5613] [0.2739 0.5564] [0.7948 0.9051] [0.6240 0.8420] [0.5423 0.8114] [0.2324 0.6921]
Note: The model implied moments are computed from a long simulation of the estimated two-factor pure-jump model with parameters reported in the last column of Table 6. Standard errors for the confidence intervals in the last column are computed using the Parzen kernel with lag length of 70.
6. Conclusion In this paper we propose an efficient method for estimation of parametric models for the volatility of general Itô semimartingales sampled at high-frequencies. The estimation is based on the model-free Realized Laplace Transform of volatility proposed in Todorov and Tauchen (2011b) and is robust to the presence of jumps in the price dynamics. The technique is particularly tractable and easy to apply in volatility models with joint characteristic function known in closed form up to a relatively easy numerical integration. The latter is the case for the class of the general affine jump-diffusion models of Duffie et al. (2000, 2003), which are widely used in financial applications. A Monte Carlo assessment documents good robustness and efficiency properties of the estimator in empirically plausible settings. The empirical application illustrates the ability of the proposed estimator to extract important information in the data regarding the dynamic properties of volatility. Our method identifies two components of volatility, which is consistent with earlier empirical evidence. However, the method has the power to discriminate among different models for the dynamic properties of the two volatility components, and in particular indicates they are of pure jump type. In the preferred volatility model, the transient volatility component has occasional big spikes but also a lot of small jumps that capture day-to-day variations. On the other hand the persistent volatility factor moves mainly through big jumps— its dynamics are somewhat similar to a regime switching type model but with gradual decay of the high volatility regime. Finally, our estimation is robust to price level jumps in a way that avoids extra tuning parameters, so inferences regarding volatility are not influenced by an incorrect specification for the price jumps. The general dynamics of volatility and price jumps are well known to be quite different, which, in the literature, motivates modeling of volatility separately from price jumps. In subsequent applications, of course, one needs to reassemble the pieces and therefore naturally be interested also in the price jumps, as they form an important part (between five to fifteen percent) of the total variability risk associated with the asset. Inference for the jump part of the price in the general case is complicated as different parameters of the jump specification can be estimated at various rates even in the relatively simple i.i.d. setting as shown in Ait-Sahalia and Jacod (2008). In the case of finite activity jumps, however, one can adopt the approach of Bollerslev and Todorov (forthcoming) for estimation of jump tails. We leave the systematic study of the problem of parametric estimation of the jump component of the price for future work. Appendix
LV (u; t) =
k ∏
LVj (u; t).
(23)
j =1
Therefore, for our inference methods, we will need a formula for LVj (u; t) for each of the individual factors. We do the calculations first for a general affine jump-diffusion volatility factor, and then we specialize to its two special forms: pure-continuous (square-root diffusion) and pure-jump (nonGaussian OU model) models. For simplicity in the subsequent calculations we refer to the factor as V , i.e., we drop the subscript. Also in what follows we denote with lower case l the log-Laplace transforms (both marginal and joint). We denote
ψ(u) =
∫
(eux − 1)ν(dx),
u ∈ C with ℜ(u) ≤ 0,
(24)
R
where recall ν(dx) is the Lévy measure of the Lévy subordinator Lt . ψ(u) is the characteristic exponent of L1 . We note that lL (u) ≡ ψ(iu) for u ∈ R, where recall lL (u) denotes the log-Laplace transform. Set f (t , v) = E eiuVT |Vt = v . Then f (t , v) solves the following partial integro-differential equation
∂f ∂f 1 ∂ 2f + κ(θ − v) + σ 2v 2 ∂t ∂v 2 ∂v ∫ + (f (v + z ) − f (z ))ν(dz ),
(25)
R
with terminal condition f (0, v) = eiuv . Guessing a solution of the form f (t , v) = eα(u,T −t )+β(u,T −t )v ,
(26)
reduces the problem to the following system of ODE-s:
′ α = κθ β + ψ(β), σ2 2 β ′ = −κβ + β ,
α(u, 0) = 0,
(27) β(u, 0) = iu, 2 where α ′ and β ′ denote derivatives with respect to t. Thus, finally for u ∈ R and T ≥ t, we have:
E eiuVT |Ft = exp (α(u, T − t ) + β(u, T − t )Vt )
α(u, s) = κθ
s
∫
β(u, z )dz + 0
β(u, s) =
s
∫
ψ(β(u, z ))dz ,
(28)
0
e−κ s κ iu
κ − iuσ 2 (1 − e−κ s )/2
.
A.1.1. Square-root diffusion Specializing (28) for this case, we get
LV ([u, v]; [t , s]) =
1+
× LV
−2κθ /σ 2
u
c (|t − s|) −κ|t −s| ue
1 + u/c (|t − s|)
+v ,
(29)
where c (z ) =
2κ . σ 2 (1 − e−κ z )
(30)
The marginal of the square-root diffusion is a Gamma process, see e.g., Cont and Tankov (2004, p. 476), and we have
LV (u) =
1 1 + uσ 2 /(2κ)
2κθ /σ 2
.
(31)
A.1. Laplace transforms for affine jump-diffusion models In general, if Vt is a superposition of independent factors, i.e., if ∑k Vt = j=1 Vjt , then we have
A.1.2. Non-Gaussian OU process Specializing (28) for this case (or even by direct computation), we get
V. Todorov et al. / Journal of Econometrics 164 (2011) 367–381
LV ([u, v]; [t , s]) = LV ue−κ|t −s| + v
d
Lt = L1t + L2t ,
|t −s|
∫
× exp
lL ue
−κ z
dz
.
(32)
lL (u) = uκ × l′V (u),
u ≥ 0.
(33)
Hence, once we specify the the Laplace transform of the marginal, we can determine that of the driving Lévy subordinator Lt , and from here easily calculate the joint Laplace transform LV (u, v; t , s). Further the Lévy densities of Vt and Lt , νV and νL respectively, are linked via (Barndorff-Nielsen and Shephard, 2001a; Sato, 1999)
νL (x) = −κ(νV (x) + xνV′ (x)).
c Γ (−α) [(λ + u)α − λα ] ,
if α ∈ (0, 1),
−c log(1 + u/λ),
if α = 0,
lV (u) =
(35)
where Γ (−α) = − α1 Γ (1 − α) for α ∈ (0, 1) and Γ is the standard Gamma function. From here, using (33), we easily get
c Γ (−α)ακ u(λ + u)α−1 , lL (u) = cκ u − , λ+u ∫ |t −s| lL (ue−κ z )dz
if α ∈ (0, 1),
α −κ|t −s| α ) , c Γ (−α) (λ + u) − (λ + ue if α ∈ (0, 1), = −c log(λ + u) − log(λ + ue−κ|t −s| ) ,
V i = e−κ/n
V i−1 +
n
n
≈e
−κ/n
V i−1 + n
i = 1, . . . , nT .
i−1 n
m −
(37) if α = 0.
j =1
Nt ∼ Poisson process with intensity t × κ c λα Γ (1 − α) and Yj ∼ G(1 − α, λ),
e
L i−1 + n
j nm
(39)
where G(a, b) stands for the Gamma distribution with probability a a−1
density: bΓx(a) e−bx 1{x>0} for a, b > 0. For α = 0.5, L1t has Inverse-Gaussian distribution, denoted Γ (0.5)κ ct as IG(µ, ν), with parameters µ = 12 √ and ν = 12 (κ ct )2 λ
(Γ (0.5))2 . The Laplace transform of a variable Y with Y ∼ IG(µ, ν) is given by
E(e−uY ) = exp (ν/µ) 1 −
1 + 2µ2 u/ν
.
To simulate Y ∼ IG(µ, ν), do the following: x ∼ N (0, 1) and u ∼ U (0, 1) and denote z = µ +
µ 2 x2 2ν
−
µ
2ν
4µν x2 + µ2 x4 . Then
z if u < µ/(µ + z ), µ2 /z else .
Y =
Finally, the simulation of the driving Lévy subordinators in the estimated two-factor pure-jump volatility model in Section 5 in which αi ̸= 0.5 is done using (39) together with a shot-noise decomposition of the Lévy measure of L1t in (39) with 500,000 shot noise terms on average in each discretization period. A.3. ARMA representations for integrated variance in one-factor models
have
φ=
1 + e−2κ − 2ηe−κ −
η=
e−κ (e−κ − 1)2 2(e−κ + κ − 1)
(1 + e−2κ − 2e−κ η)2 − 4(η − e−κ )2 , 2(η − e−κ )
.
(41)
For the rest of the parameters in the ARMA representation we have:
• pure-continuous model µ = θ, Var(ϵt ) =
eκ(s−(i−1)/n) dLs
Yj ,
j =1
−1 κ jnm
1{x>0} ,
(IVt − µ) − e−κ (IVt −1 − µ) = ϵt + φϵt −1 , (40) where ϵt is white noise, i.e., E(ϵt ϵs ) = 0 for t ̸= s. In both cases we
To keep notation simple we continue to remove the subscript for the volatility factors. The simulation of the square-root diffusion is done by a standard Euler scheme. The simulation of the non-Gaussian OU processes is done via the following scheme (recall that we need the volatility process Vt on the grid 0, 1n , 2n , . . . , T ) using i n
x1+α
Easy but rather tedious calculations show that, see e.g., Todorov (2011), the ARMA(1,1) representation of IV in the one-factor purecontinuous and pure-jump model is given by
A.2. Details on the simulation of volatility models in the Monte Carlo
∫
Nt −
e−λx
(36)
if α = 0.
0
L2t =
(34)
Example. Non-Gaussian OU model with tempered stable marginal distribution. The log-Laplace transform of V , i.e., the log-Laplace transform of the tempered stable process, is
L1t ⊥ L2t ,
L1t is Lévy process with Lévy measure κ c α
0
For the non-Gaussian OU model there is a very convenient link (for the purposes of volatility modeling) between the Laplace transform of the driving Lévy subordinator Lt and that of the process Vt . In particular, we have (see e.g., Barndorff-Nielsen and Shephard (2001a))
379
σ 2 θ e−κ −κ [(e − 1)2 − 2(e−κ + κ − 1)], 2κ κ 2 φ
(42)
• pure-jump model − L i−1 + j−1 n
nm
µ = c Γ (1 − α)λα−1 ,
,
Var(ϵt ) = (38)
In our case T = 5000, n = 80 and m = 80, which corresponds to discretization of around 4 s. The simulation of the driving Lévy subordinator in the Monte Carlo is done as follows. We make use of the following representation of Lt for α ≥ 0 which follows immediately from (34) (see also e.g., Barndorff-Nielsen and Shephard (2001b))
ce−κ λα−2 (α Γ (2 − α) + Γ (3 − α))
(43)
2κ 2 φ
× [(e−κ − 1)2 − 2(e−κ + κ − 1)]
.
The QML estimators are found by maximizing the Gaussian likelihood
−
T 1 −
2T t =1
ϵt2 1 − log(Var(ϵt )), Var(ϵt ) 2
(44)
380
V. Todorov et al. / Journal of Econometrics 164 (2011) 367–381
E
Re e−iuVt +1 (∂/∂ρ)ψ(u, Vt , ρ) du
R+
R+
Re e−iuVt +1 ψ(u, Vt , ρ) du
R+
Re e−iuVt +1 (∂/∂ρ ′ )ψ(u, Vt , ρ) du
2
(53)
Box I.
where for a given parameter vector, ϵt is determined recursively from the data by ϵt = (IVt − µ) − e−κ (IVt −1 − µ) − φ ϵt −1 with ϵ0 = 0 (note that the MA coefficient is smaller than 1 in absolute value). IVt is estimated from the high-frequency data via (18)–(19). A.4. Computing the Cramer–Rao lower bound We wish to compute the Cramer–Rao lower bound for the parameter vector ρ of model defined by (13) in the main text on the assumption the volatility process Vt is observed at integer values of t. Let p(Vt +1 |Vt , ρ) denote the model-implied density so the task is to compute the information matrix
E
′ ∂ ∂ . log[p(Vt +1 |Vt , ρ)] log[p(Vt +1 |Vt , ρ)] ∂ρ ∂ρ
(45)
A.4.1. Pure-continuous volatility models, cases A–C For these cases the parameter vector is ρ = (κ θ σ )′ and the conditional density of 2 c (ρ) Vt +1 is non-central chi-squared where c (ρ) =
2κ , σ 2 (1 − e−κ ∆ )
(46)
the degrees of freedom parameter is df (ρ) =
4κθ
σ2
,
(47)
and non-centrality parameter
νt (ρ) = 2c (ρ)e−κδ Vt .
(48)
Thus the gradient term for (45) is
∂ log[p(Vt +1 |Vt , ρ )] ∂ρ ∂ = log {2c (ρ) n [2c (ρ)Vt +1 |df (ρ), νt (ρ) ]} , (49) ∂ρ where n(·|df , ν) is the noncentral chi squared density. We use numerical derivatives, which appeared quite stable and accurate, to compute the gradient term immediately above and then Monte Carlo to compute the expectation in (45). A.4.2. Pure-jump volatility models, cases D and E For these cases ρ = (κ α c λ)′ . We need to work from the conditional characteristic function because the density p(Vt +1 |Vt , ρ ) is not available in convenient closed form. Define the conditional characteristic function
ψ(u, Vt , ρ) = E eiuVt +1 |Vt ,
(50)
and from the Fourier inversion formula the transition density is up to a constant
∫
Re e−iuVt +1 ψ(u, Vt , ρ) du,
(51)
R+
with gradient
∫
Re e−iuVt +1 (∂/∂ρ)ψ(u, Vt , ρ) du.
(52)
R+
Recalling that (∂/∂ρ) log(p) = (1/p)(∂/∂ρ)p, then whenever the derivatives exist and the magnitude of the characteristic function (and its derivatives) is dominated by an integrable function not dependent upon ρ (45) becomes (see Box I).
For Cases D and E the gradient of the characteristics functions is
(∂/∂κ)ψ(u, Vt , ρ) = ψ(u, Vt , ρ)(Vt + α c Γ (−α) × (λ − iue−κδ )(α−1) )iuδ e−κδ ; (∂/∂α)ψ(u, Vt , ρ) = −ψ(u, Vt , ρ)c Γ (−α)(Ψ (1 − α) + 1/α)((λ − iu)α − (λ − iue(−κδ) )α ) + log(λ − iu)(λ − iu)α − log(λ − iue(−κδ) )(λ − iue(−κδ) )α ; (∂/∂ c )ψ(u, Vt , ρ) = ψ(u, Vt , ρ)Γ (−α)((λ − iu)α − (λ − iue−κδ )α ); (∂/∂λ)ψ(u, Vt , ρ) = ψ(u, Vt , ρ)c Γ (−α)α((λ − iu)(α−1) − (λ − iue(−κδ) )(α−1) ); To compute (53) the inner integrals with respect to u are done numerically while the outer integration over the joint distribution of Vt , Vt +1 is done by Monte Carlo. The integrand Re e−iuVt +1 (∂/∂ρ)ψ(u, Vt , ρ) in (53) may exhibit highly oscillatory behavior, which make numerical integration difficult. Filipovic et al. (2010) employ special routines QAWF and QAWO from the GNU scientific library to numerically compute the Fourier integral of a characteristic function. However, those routines assume that the main source of oscillations in the integrand is the sine (or cosine) factor coming from the Fourier transform. Wild oscillations of the characteristic function and its gradient, which in our case happen for several values of Vt , Vt +1 , and ρ , may make the QAWF and QAWO algorithms fail. In such cases, we employ the adaptive Gauss–Kronrod integration. This numerical integration procedure handles the oscillatory behavior of integrands quite well and is implemented both in Matlab and the GNU scientific library. In Case D, when κ = 0.5, we can use the QAWF and QAWO algorithms to compute the inner integrals in (53). However, in Case E, when κ decreases to 0.15, the frequency of oscillations in the characteristic function and its derivatives increases too much so that QAWF and QAWO stop working. Case E, therefore, is computed using the QAGIU routine, which implements the adaptive Gauss–Konrod integration. References Ait-Sahalia, Y., Jacod, J., 2008. Fisher’s information for discretely sampled Lévy processes. Econeomtrica 76, 727–761. Ait-Sahalia, Y., Jacod, J., 2009a. Estimating the degree of activity of jumps in high frequency financial data. Annals of Statistics 37, 2202–2244. Ait-Sahalia, Y., Jacod, J., 2009b. Testing for jumps in a discretely observed process. Annals of Statistics 37, 2202–2244. Andersen, T., Benzoni, L., Lund, J., 2002. An empirical investigation of continuoustime equity return models. Journal of Finance 57, 1239–1284. Andersen, T., Sørensen, B., 1996. GMM estimation of a stochastic volatility model: a Monte Casrlo study. Journal of Business and Economic Statistics 14, 328–352. Andersen, T.G., Bollerslev, T., 1997. Intraday periodicity and volatility persistence in financial markets. Journal of Empirical Finance 4, 115–158. Andersen, T.G., Bollerslev, T., Diebold, F.X., Labys, P., 2003. Modeling and forecasting realized volatility. Econometrica 71, 579–625. Barndorff-Nielsen, O., Shephard, N., 2001a. Non-Gaussian Ornstein–Uhlenbeckbased models and some of their uses in financial economics. Journal of the Royal Statistical Society Series B 63, 167–241. Barndorff-Nielsen, O., Shephard, N., 2001b. Normal modified stable processes. Theory of Probability and Mathematical Statistics 65, 1–19. Barndorff-Nielsen, O., Shephard, N., 2002. Econometric analysis of realized volatility and its use in estimating stochastic volatility models. Journal of the Royal Statistical Society Series B 64, 253–280. Barndorff-Nielsen, O., Shephard, N., 2004. Power and bipower variation with stochastic volatility and jumps. Journal of Financial Econometrics 2, 1–37.
V. Todorov et al. / Journal of Econometrics 164 (2011) 367–381 Barndorff-Nielsen, O., Shephard, N., 2006. Econometrics of testing for jumps in financial economics using bipower variation. Journal of Financial Econometrics 4, 1–30. Bates, D., 2006. Maximum likelihood estimation of latent affine models. Review of Financial Studies 909–965. Blumenthal, R., Getoor, R., 1961. Sample functions of stochastic processes with independent increments. Journal of Mathematics and Mechanics 10, 493–516. Bollerslev, T., Todorov, V., 2011. Estimation of jump tails. Econometrica (forthcoming). Bollerslev, T., Zhou, H., 2002. Estimating stochastic volatility diffusion using conditional moments of integrated volatility. Journal of Econometrics 109, 33–65. Carr, P., Geman, H., Madan, D., Yor, M., 2002. The fine structure of asset returns: an empirical investigation. Journal of Business 75, 305–332. Carrasco, M., Chernov, M., Florens, J., Ghysels, E., 2007. Efficient estimation of general dynamic models with a continuum of moment conditions. Journal of Econometrics 140, 529–573. Carrasco, M., Florens, J.P., 2000. Generalization of gmm to a continuum of moment conditions. Econometric Theory 797–834. Carrasco, M., Florens, J.P., 2002. GMM Estimation using the empirical characteristic function. Technical report, Institut National de la Statistique et des Etudes Economiques. Chernov, M., Gallant, R., Ghysels, E., Tauchen, G., 2003. Alternative models for stock price dynamics. Journal of Econometrics 116, 225–257. Chernozhukov, V., Hong, H., 2003. An MCMC approach to classical estimation. Journal of Econometrics 115, 293–346. Cont, R., Tankov, P., 2004. Financial Modelling with Jump Processes. Chapman and Hall, Boca Raton, Florida, USA. Corradi, V., Distaso, W., 2006. Semiparametric comparision of stochastic volatility models using realized measures. Review of Economic Studies 73, 635–667. Dobrev, D., Szerszen, P., 2010. The information content of high-frequency data for estimating equity return models and forecasting risk. Technical report. Duffie, D., Filipović, D., Schachermayer, W., 2003. Affine processes and applications in finance. Annals of Applied Probability 13 (3), 984–1053. Duffie, D., Pan, J., Singleton, K., 2000. Transform analysis and asset pricing for affine jump-diffusions. Econometrica 68, 1343–1376. Eraker, B., Johannes, M., Polson, N., 2003. The impact of jumps in volatility and returns. Journal of Finance 58, 1269–1300. Feuerverger, A., Mureika, R., 1977. The empirical characteristic function and its application. Annals of Statistics 5, 88–97. Filipovic, D., Mayerhofer, E., Schneider, P., 2010. Transition density approximations for multivariate affine jump diffusion processes. Working paper. Hansen, L., 1982. Large sample properties of generalized method of moments estimators. Econometrica 50, 1029–1054.
381
Huang, X., Tauchen, G., 2005. The relative contributions of jumps to total variance. Journal of Financial Econometrics 3, 456–499. Jacod, J., 2008. Asymptotic properties of power variations and associated functionals of semimartingales. Stochastic Processes and their Applications 118, 517–559. Jiang, G., Knight, J., 2002. Estimation of continuous-time processes via empirical characteristic function. Journal of Business and Economic Statistics 20, 198–212. Jiang, G.J., Knight, J.L., 2010. ECF estimation of Markov models where the transition density is unknown. Econometrics Journal 13, 245–270. Knight, J.L., Yu, J., 2002. Empirical characteristic function in time series estimation. Econometric Theory 18, 691–721. Mancini, C., 2009. Non-parametric threshold estimation for models with stochastic diffusion coefficient and jumps. Scandinavian Journal of Statistics 36, 270–296. Meddahi, N., 2003. ARMA representation of integrated and realized variances. The Econometrics Journal 6, 334–355. Parzen, E., 1962. On estimation of a probability density function and mode. Annals of Mathematical Statistics 33, 1065–1076. Paulson, A.S., Holcomb, E.W., Leitch, R.A., 1975. The estimation of the parameters of the stable laws. Biometrika 62, 163–170. Prakasa Rao, B., 1988. Statistical inference from sampled data for stochastic processes. In: In: Contemporary Mathematics, vol. 80. Amer. Math. Soc., Providence. Rosiński, J., 2007. Tempering stable processes. Stochastic Processes and their Applications 117, 677–707. Sato, K., 1999. Lévy Processes and Infinitely Divisible Distributions. Cambridge University Press, Cambridge, UK. Singleton, K., 2001. Estimation of affine asset pricing models using the empirical characteristic function. Journal of Econometrics 102, 111–141. Theodosiou, M., Zikes, F., 2009. A comprehensive comparison of alternative tests for jumps in asset prices. Technical report, Imperial College London, Business School. Todorov, V., 2009. Estimation of continuous-time stochastic volatility models with jumps using high-frequency data. Journal of Econometrics 148, 131–148. Todorov, V., 2011. Econometric analysis of jump-driven stochastic volatility models. Journal of Econometrics 160, 12–21. Todorov, V., Tauchen, G., 2010. Activity signature functions for high-frequency data analysis. Journal of Econometrics 154, 125–138. Todorov, V., Tauchen, G., 2011a. Volatility jumps. Journal of Business and Economic Statistics 29, 356–371. Todorov, V., Tauchen, G., 2011b. The realized Laplace transform of volatility. Working paper. Todorov, V., Tauchen, G., 2011c. Realized Laplace transforms for pure-jump semimartingales. Working paper. Yu, J., 2004. Empirical characteristic function estimation and its applications. Econometric Reviews 23, 93–123.
Journal of Econometrics 164 (2011) 382–403
Contents lists available at SciVerse ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Semi-nonparametric estimation and misspecification testing of diffusion models Dennis Kristensen ∗ Columbia University, United States CREATES, United States1
article
info
Article history: Available online 14 July 2011 JEL classification: C12 C13 C14 C22 Keywords: Diffusion process Kernel estimation Nonparametric Specification testing Semiparametric Transition density
abstract Novel transition-based misspecification tests of semiparametric and fully parametric univariate diffusion models based on the estimators developed in [Kristensen, D., 2010. Pseudo-maximum likelihood estimation in two classes of semiparametric diffusion models. Journal of Econometrics 156, 239–259] are proposed. It is demonstrated that transition-based tests in general lack power in detecting certain departures from the null since they integrate out local features of the drift and volatility. As a solution to this, tests that directly compare drift and volatility estimators under the relevant null and alternative are also developed which exhibit better power against local alternatives. © 2011 Elsevier B.V. All rights reserved.
1. Introduction In this study, we develop semi-nonparametric estimators and misspecification tests of the so-called drift and diffusion functions in univariate diffusion models given low-frequency observations. The proposed estimators and tests provide the researcher with tools to investigate whether a given parametric specification of the drift and diffusion function is correct and allows him to test drift and diffusion specifications separately from each other. This is in contrast to existing methods found in the literature which simultaneously test correct specification of drift and diffusion terms. Our estimation and testing procedure takes, as starting point, two classes of semiparametric diffusion models introduced in Kristensen (2010): in the first class, the drift term is known up to a finite-dimensional parameter while the diffusion term is left unspecified; in the second class, the diffusion term is of the parametric form while the drift term is unknown. Kristensen (2010) develops estimators of the parametric component for a given model in either of the two classes. We demonstrate how the unspecified term in any of these semiparametric diffusion models can be estimated nonparametrically using kernel methods.
∗ Corresponding address: Department of Economics, International Affairs Building, MC 3308, 420 West 118th Street, New York, NY 10027, United States. E-mail address:
[email protected]. 1 Center for Research in Econometric Analysis of Time Series, funded by the Danish National Research Foundation. 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.07.001
These estimators are useful as guides in the search for a correct parametric specification since they provide information about the shape of the unspecified term. In addition, the estimators help us to develop novel misspecification tests of diffusion models. We suggest two sets of tests. First, we propose tests for a given semiparametric diffusion model against a fully nonparametric alternative. Second, tests for a fully parametric model against either of its two semiparametric alternatives are developed. Our test statistics are chosen as weighted L2 -distances of the estimators of the so-called transition density obtained under null and alternative, respectively. In addition, we also consider tests that directly compare drift or diffusion estimators. We explore the asymptotic properties of the tests both under null and alternative, and obtain a number of interesting results. First, our transition-based test of a given semiparametric model against the fully nonparametric alternative is under the null firstorder asymptotically equivalent to tests of fully parametric models as developed in Aït-Sahalia et al. (2009) and Li and Tkacz (2006). This is due to the fact that estimators of the transition density under the semiparametric and parametric null, respectively, both converge with parametric rate, and as such the asymptotic distributions of our test statistics are completely driven by the fully nonparametric transition density estimator. The parametric rate of the semiparametric transition density estimator appears because computation of transition densities for low-frequency observations involves integration of both the drift and the diffusion term (see e.g. Kristensen, 2008). This integration functions as an additional smoothing mechanism that speeds up convergence rate
D. Kristensen / Journal of Econometrics 164 (2011) 382–403
of the semiparametric estimator of the transition density even though it involves kernel estimators. Second, our proposed transition-based test statistics of the fully parametric model against either of the two semiparametric alternatives converge with parametric rate under the null. This is non-standard within the class of tests based on L2 -distance measures of semi-nonparametric density estimators which in general converge with nonparametric rate. Instead our transitionbased tests for the fully parametric null share similarities with the class semiparametric estimators and tests that exhibit parametric rate (see e.g. Andrews, 1994; Corradi and Swanson, 2005; Whang and Andrews, 1993). Third, we study the power properties of the tests by considering their performance under contiguous alternatives. In particular, we show that, transition-based tests are not very suitable for detection of high-frequency departures in a diffusion framework. This is in contrast to density-based tests in standard, discrete-time setting. This maybe surprising result is due to the fact that local features of the drift and diffusion terms are integrated out in the computation of transition densities and so local deviations get blurred out. It should be stressed that this problem is not special to our particular tests, but is shared by all other transition-based tests of diffusion models in the literature such as Aït-Sahalia et al. (2009). As such our power analysis should be of general interest. The lack of power against local alternatives leads us to propose two alternative tests of the parametric null against semiparametric alternatives based on direct comparison of drift and diffusion function estimators obtained under null and alternatives. We examine their asymptotic properties both under null and alternative: They converge with a slower rate than the transition-based tests, and thus are dominated by transition-based tests in terms of detecting global alternatives. On the other hand, the tests are better at detecting local deviations of drift and diffusion functions from the null, and so have better power against local alternatives. As such they complement our transition-based tests. Finally, we conduct a higher-order analysis of the proposed tests under the null. This analysis demonstrates that first-order asymptotic distributions obtained under the null may be a poor proxy of their finite-sample distributions. We therefore propose a Markov bootstrap method that we hope will provide a better approximation of finite-sample distributions of the test statistics. This conjecture is supported by simulation results in Aït-Sahalia et al. (2009) and Li and Tkacz (2006) who propose similar Bootstrap procedures for their tests. The proposed tests and their theoretical analysis add to a growing literature on specification testing of diffusion models. This class of models is widely used in describing dynamics of asset pricing variables such as interest rates, stock prices, and exchange rates; see for example Björk (2004) for an overview. Since economic theory imposes little restrictions on asset price dynamics, statistical techniques are usually employed in the search for a correct specification. The literature on testing diffusion model specifications can roughly be divided up into two categories depending on whether high-frequency data is assumed available or not. If high-frequency data is observed, simple nonparametric kernel-regression estimators of drift and diffusion terms can be used to test for correct specification (Bandi and Phillips, 2005; Corradi and White, 1999; Li, 2007; Negri and Nishiyama, 2009). In principle, these tests do not rely on stationarity which is an advantage over the approach taken here. On the other hand, asymptotic properties of estimators and associated tests do rely on the time distance between observations shrinking to zero; thus, estimators and tests will potentially be severely biased if only lowfrequency data is available (see Nicolau, 2003). To avoid the bias issues associated with high-frequency based tests, alternative tests based on fixed time distance between
383
observations have been developed. Aït-Sahalia (1996b) propose to test for correct specification using a weighted L2 -distance to measure discrepancies between the marginal density under null and alternative. This class of tests was originally proposed in Bickel and Rosenblatt (1973) in a cross-sectional setting; see also Fan (1994) and Gouriéroux and Tenreiro (2001). Since the test of AïtSahalia (1996b) is only able to detect discrepancies in the marginal density, it is not consistent against all alternatives. This observation lead to the development of tests based on transition densities since these fully characterise diffusion models. Our transition-based tests are most related to the ones developed in Aït-Sahalia et al. (2009) and Li and Tkacz (2006) where fully nonparametric and parametric estimators of the transition density are compared. In a similar spirit, Hong and Li (2005) propose a test where transformed versions of the transition densities are compared, while Chen et al. (2009) employ empirical likelihood techniques. These tests are all designed to examine the correct parametric specification of the drift and diffusion function jointly. In contrast, we are able to test the specification of each of the two functions characterising the model separately. Our local power analysis complements the one carried out in AïtSahalia et al. (2009). They specify alternatives in terms of the transition densities and find that transition-based tests have the ability to detect local deviations form the null at a better rate than CvM type tests. However, given that the end goal is to test for the correct specification of drift and diffusion term, we instead specify our alternatives directly in terms of these. By doing so, we obtain some rather different power results for transition-based tests. In particular, we show that they are not able to detect local alternatives at a higher rate compared to CvM type tests. These seemingly contradictory results are due to the fact that Aït-Sahalia et al. (2009) specify their alternatives in terms of the transition density while we focus on deviations in terms of underlying drift and diffusion functions. Since, as already noted above, the transition density involves integration over the drift and diffusion function, local features in these get smoothed out in the transition density and therefore not easily detected. Our tests based on direct comparison of the drift and diffusion function estimates under null and alternative are related to the marginal density tests of Aït-Sahalia (1996b) and Huang (1997). However, our proposed tests involve non-trivial transformations of the marginal density and its derivatives and as such are able to detect different, more natural alternatives compared to their tests. Instead of comparing transition densities, Kolmogorov–Smirnoff (KS) type tests have been proposed by Bhardwaj et al. (2008) and Corradi and Swanson (2005) where estimators of the cumulative distribution functions (cdf’s) are compared. This on one hand means that their tests converge with parametric rate under the null and as such are more powerful at detecting certain global alternatives compared to transition-based tests. On the other hand KS-type tests are known to have difficulties detecting local deviations from the null; a shortcoming that density-based tests do not suffer from (see e.g. Escanciano, 2009; Eubank and LaRiccia, 1992). Finally, Kristensen (2010) proposes some specification tests which appear to be the only existing tests based on low-frequency data that allow for testing correct specifications of the drift and diffusion terms separately. However, Kristensen (2010) does not supply a complete asymptotic theory. Moreover, as with CvM and KS type tests, his proposed Hausmann-type tests of fully parametric models will in general have low power against local alternatives since they are based on only matching estimators of the parametric component obtained under null and under alternatives. In particular, his tests may not be consistent against all alternatives. In contrast, we base our tests on estimators of the nonparametric component under the alternative, and so expect them to enjoy better power properties.
384
D. Kristensen / Journal of Econometrics 164 (2011) 382–403
The remains of the paper is organised as follows: In Section 2, we lay out the general framework of our analysis. The seminonparametric estimators of the drift and diffusion term are presented and their asymptotic properties derived in Section 3. In Section 4, we propose a number of different test statistics for a parametric specification against semi- and nonparametric alternatives and investigate their asymptotic behaviour. We discuss related tests in Section 5, while Bootstrap versions of the test statistics are developed in Section 6. The finite-sample performance of the estimators are examined through a simulation study in Section 7. We conclude in Section 8. All proofs have been relegated to the Appendix. 2. Framework Consider the continuous time process {Xt } = {Xt : t ≥ 0} solving the following univariate Markov diffusion model, dXt = µ (Xt ) dt + σ (Xt ) dWt ,
(1)
where {Wt } is a standard Brownian motion. The domain of {Xt } takes the form of an open interval I = (l, r ) where −∞ ≤ l < r ≤ ∞. The functions µ : I → R and σ 2 : I → R+ are the socalled drift and diffusion term respectively. The dynamics of the process are described by the transition densities p (y|x; t ) t ≥ 0, describing conditional distributions, P (Xs+t ∈ A|Xs = x) =
∫
p (y|x; t ) dy,
A ⊆ I , s, t ≥ 0.
A
For diffusion models as given in Eq. (1), the transition density can be expressed as the solution to the following partial differential equation (PDE) (see Friedman, 1976):
∂ p (y|x; t ) = A µ, σ 2 p (y|x; t ) , ∂t
t > 0, (x, y) ∈ I × I ,
(2)
with boundary condition limt →0 pt (y|x) = δ (y − x). Here, A[µ, σ 2 ] denotes the infinitesimal generator,
A µ, σ 2 p (y|x; t ) = µ (x)
∂ 2 p (y|x; t ) ∂ p (y|x; t ) 1 2 + σ ( x) , ∂x 2 ∂ x2
and δ (·) Dirac’s delta function. Thus, the drift and diffusion function fully characterise the transition density and we will write p y|x; t , µ, σ 2 for the solution mapping that takes any drift and diffusion function into the corresponding transition density as given implicitly through the PDE in Eq. (2). We are interested in testing parametric specifications of the drift and diffusion function. We will throughout work under the maintained (nonparametric) hypothesis that {Xt } is a Markov diffusion process, HNP : {Xt } solves Eq. (1) with σ 2 (·)
and µ (·) unspecified.
In the existing literature, tests have been developed for a fully parametric diffusion specification against this nonparametric alternative. The joint fully parametric hypothesis takes the form HP : σ 2 (·) = σ 2 ·; θ0,1
and
µ (·) = µ ·; θ0,2
for some θ0,1 , θ0,2 ∈ Θ1 × Θ2 ,
where Θk ⊆ Rdk , k = 1, 2. Thus, under HP , both drift and diffusion functions are known up to some finite-dimensional parameter. A plethora of tests of HP vs. HNP exist; see, for example, AïtSahalia et al. (2009), Bhardwaj et al. (2008), Chen et al. (2009), Hong and Li (2005) and Li and Tkacz (2006). Most of these studies base their tests on comparison of estimators of (potentially transformed versions of) the transition density under the null and the alternative.
However, in case of rejection of HP , such tests are not informative regarding whether misspecification of drift, diffusion or both are the cause of rejection. This motivates us to introduce the following two semiparametric hypotheses, which allow us to test for misspecification of the drift and diffusion term separately from each other: HSP,1 : σ 2 (·) = σ 2 ·; θ0,1
for some θ0,1 ∈ Θ1 ,
(3)
and HSP,2 : µ (·) = µ ·; θ0,2
for some θ0,2 ∈ Θ2 .
(4)
If a model satisfy HSP,1 (HSP,2 ), the drift (diffusion) term is unspecified, and the model is semiparametric. Also note that if a model satisfies both HSP,1 and HSP,2 , then both drift and diffusion are specified and the model is fully parametric. In particular, we have the following nesting of the hypotheses: HP ⊆ HSP,k ⊆ HNP for k = 1, 2. In the next section, we first develop tests of each of the two semiparametric hypotheses, HSP,1 and HSP,2 , against the nonparametric alternative. Secondly, we propose tests of HP against each of the two semiparametric hypotheses. Together, the tests enable the econometrician to first test for the correct specification of, say, the drift term (HSP,2 vs. HNP ), and then (if HSP,2 is accepted) the correct specification of the diffusion term (HP vs. HSP,2 ). In order to develop our tests, we first obtain estimators of the drift and diffusion functions under the two semiparametric hypotheses. The estimators rely on the assumption of stationarity. Suppose that {Xt } is strictly stationary and ergodic, in which case it has a stationary marginal density which we denote π . This density satisfies A π (x) dx = P (Xt ∈ A), for any t ≥ 0 and Borel set A ⊆ I, and can be written on the following form:
π (x) =
Mx∗
σ 2 (x)
[ ∫
x
exp 2 x∗
] µ (y) dy , σ 2 (y)
(5)
for some point x∗ ∈ intI, and normalisation factor Mx∗ > 0, c.f. Karlin and Taylor (1981, Section 15.6). One can revert the expression in Eq. (5) to obtain expressions of either drift or diffusion function:
µ ( x) = σ (x) = 2
1
∂
2π (x) ∂ x 2
π (x)
σ 2 (x) π (x) ,
(6)
µ (y) π (y) dy.
(7)
x
∫ l
From these expressions, we can identify the drift (diffusion) function from the diffusion (drift) term together with the marginal density; this point was already made in Wong (1964), and further pursued in Aït-Sahalia (1996a), Hansen and Scheinkman (1995) and Kristensen (2010). In particular, this allows us to identify the unspecified term under each of the two semiparametric hypotheses. 3. Semi-nonparametric estimators We develop specific drift and diffusion estimators based on the identification scheme presented in the previous section: Suppose that we have n + 1 observations available from Eq. (1), X0 , X∆ , X2∆ , . . . , Xn∆ , where ∆ > 0 is the fixed time distance between observations; without loss of generality, we normalise time distance to ∆ ≡ 1 in the following. Under the relevant semiparametric hypothesis, HSP,1 or HSP,2 , we assume that a preliminary estimator of the parametric component, θ1 or θ2 , is available. We make no assumptions about where the preliminary estimators have arrived from, and merely require that they are sufficiently regular. One particular class of estimators are the pseudo-MLEs proposed in Kristensen (2010), but we do not restrict
D. Kristensen / Journal of Econometrics 164 (2011) 382–403
ourselves to these and the estimator of Aït-Sahalia (1996a) could also be used in the case of a linear drift specification. Given estimators of the parametric components, we now just need to obtain an estimator of the marginal density, π . We here propose to use kernel methods to estimate it,
πˆ (x) =
n 1−
n i=1
Kh ( x − X i ) ,
(8)
where Kh (z ) = K (z /h) /h, K is a kernel, and h > 0 is a bandwidth; see Robinson (1983) for an introduction to kernel density estimators in a time series setting. We then combine estimators of the parametric component and the marginal density to obtain an estimator of the unspecified term. First, consider HSP,1 : In this case, the diffusion term is parameterised and an estimator θˆ1 is available together with the kernel estimator πˆ . We then estimate µ by substituting σ 2 (x; θˆ1 ) and πˆ into Eq. (6):
µ ˆ (x) =
∂ 2 σ (x; θˆ1 )πˆ (x) . 2πˆ (x) ∂ x 1
(9)
Under HSP,2 , we have a parametric estimator of the drift
parameter, θˆ2 , which together with πˆ can be used to estimate the diffusion term. Two alternative estimators present themselves: An obvious estimator would be to directly substitute µ(y; θˆ2 ) x 2 and πˆ into Eq. (7), σ˜ 2 (x) = π( µ(y; θˆ2 )πˆ (y) dy. However, ˆ x) l
x
µ (y) π (y) dy can be estimated without bias by l x ∑n P µ (y) π (y) dy, a sample average, 1n i=1 I {Xi ≤ x} µ(Xi ) → l where I {·} is the indicator function. So we suggest to estimate σ 2 (x) by the integral
σˆ 2 (x) =
2
n 1−
πˆ (x) n
I {Xi ≤ x} µ(Xi ; θˆ2 ).
(10)
i=1
To establish the asymptotic properties of the two estimators, we impose regularity conditions on the model: A.1 (i) The drift µ (·) and diffusion σ 2 (·) > 0 are continuously differentiable. (ii) There exists a twice continuously differentiable function V : R → R+ with V (x) → ∞ as |x| → ∞, and constants b, c > 0 such that 1 (11) 2 A.2 The marginal density π is differentiable of order m + 1 ≥ 3 uniform continuous derivatives, and satisfies with bounded, π (x)1−q dx < ∞ for some q > 0. The conditional density I p (y|x) ≡ p (y|x; 1) is uniformly differentiable of order m with supx,y∈I p (y|x) π (x) < ∞. A.3 The parametric drift and diffusion function satisfy: 1. θ1 → σ 2 (x; θ1 ) is continuously differentiable satisfying ‖∂xij,θ1 σ 2 (x; θ1 ) ‖ ≤ V (x) , i, j = 0, 1. 2. θ2 → µ (x; θ2 ) is continuously differentiable, satisfying ‖∂θi 2 µ (x; θ2 ) ‖ ≤ V (x), i = 0, 1.
µ (x) V ′ (x) + σ 2 (x) V ′′ (x) ≤ −cV (x) + b.
∑n A.4 For k ∈ {1, 2}: The estimator θˆ satisfies θˆk = θk∗ + i=1 ψSP,k √ (Xi |Xi−1 )/n + oP 1/ n with E [ψSP,k (X1 |X0 )] = 0 and E [‖ψSP,k (X1 |X0 ) ‖2+δ ] < ∞ for some δ > 0. Under HSP,k , θk∗ = θ0,k .
385
of the results stated in this section actually go through under weaker mixing conditions, but since in the next section we need β -mixing of geometric order to employ U-statistics results for dependent sequences (see Gouriéroux and Tenreiro, 2001), we impose this restriction throughout for clarity. Most models found in the finance literature satisfy (A.1) under suitable restrictions on the parameters. The existence of m + 1 derivatives of π assumed in (A.2) combined with the use of an mth order kernel as given in (B.1) below allow us to control the bias of the kernel density estimator and its first derivative. The smoothness of π as measured by its number of derivatives, m, determines how much the bias can be reduced with. The condition that π is m times differentiable is satisfied if µ and σ 2 are m − 1 and m times differentiable respectively, c.f. Eq. (5). The tail condition imposed on π in (A.2) is used to obtain uniform convergence results for the semiparametric drift and diffusion estimators when analysing the associated semiparametric estimator of the transition density (see Lemma 2 in Section 4). The parameter q > 0 measures the thickness of the tails of the marginal distribution, and is used to control the asymptotic impact of trimming introduced in the next section. The conditions on the transition density in (A.2) together with (A.1) allow us to bound the variance of πˆ , and will also become useful when analysing nonparametric estimators of the transition density in Section 4. Assumption (A.3) in conjunction with (A.1) implies that the following two moments exist: E [‖∂θi 2 µ (X0 ; θ2 ) ‖] < ∞ and ij
E [‖∂x,θ1 σ 2 (X0 ; θ1 ) ‖] < ∞. These are used when demonstrating uniform convergence of the semi-nonparametric estimators. Assumption (A.4) imposes restrictions on the estimator θˆk obtained under HSP,k , k = 1, 2. The assumption is formulated so it holds both under HSP,1 and HSP,2 respectively, and the nonparametric alternative. Under the relevant null, there exists a parameter value θ0,k such that either Eq. (3) or (4) hold. It is then implicitly assumed that θk∗ = θ0,k such that σ 2 x; θ1∗ = σ 2 (x)
and µ x; θ2∗ = µ (x) respectively. If the relevant semiparametric null is false, no parameter value exists such that either Eq. (3) or (4) hold. As such θk∗ is a pseudo-true value in the sense that σ 2 x; θ1∗ ̸= σ 2 (x) and µ x; θ2∗ ̸= µ (x) respectively, and θk∗ is just some parameter value that the estimator converges towards. Under the null, Assumption (A.4) is satisfied in great generality for most well-behaved estimators: For the fully parametric MLE’s, AïtSahalia (2002) gives conditions for (A.4) to hold, while Kristensen (2010) give conditions under which semiparametric pseudo MLE’s satisfy the conditions. Under the alternatives, we expect that (A.4) will still hold under great generality by employing the arguments similar to those in White (1982). For some of our results, the conditions imposed on the parametric estimators in (A.4) can be weakened to the requirement that they merely converge at a faster rate than the kernel estimator. However, for simplicity we maintain the stronger assumptions of (A.4) throughout. Finally, we restrict the class of kernel functions to belong to the following family:
B.1 The kernel K is differentiable, and there exists constants C , η > 0 such that
(i) K (z ) ≤ C |z |−η , (i) K (z ) − K (i) z ′ ≤ C z − z ′ , (i)
i = 0, 1,
Assumption (A.1) is sufficient for a stationary and geometrically
where K (z ) denotes the ith derivative. Furthermore, R K (z ) dz = 1, R z j K (z ) dz = 0, 1 ≤ j ≤ m − 1, and R |z |m K (z ) dz < ∞.
alternative mixing conditions for diffusion processes can be found in Chen et al. (2010) and Hansen and Scheinkman (1995). We will throughout assume that we have observed this solution. Some
This class includes most standard kernels including the Gaussian and Uniform kernel. We are now able to state pointwise convergence results for the estimators of the unspecified term under the two semiparametric nulls:
β -mixing solution to exist as shown in Meyn and Tweedie (1993);
386
D. Kristensen / Journal of Econometrics 164 (2011) 382–403
Theorem 1. Assume that (A.1)–(A.4) and (B.1) hold. Then for any point x in the interior of I:
4.1. Semiparametric specification tests
1. Under HSP,1 : As nh3 → ∞,
We consider testing either HSP,1 or HSP,2 against HNP . In order to present our tests, we first introduce some additional notation: Recall that we have normalised the time distance between observations to ∆ = 1, such that p (y|x) := p (y|x; 1) is the transition density of the observed Markov chain, Xi , i = 1, . . . , n. Let f (y, x) = p (y|x) π (x) denote the corresponding joint density of (Xi , Xi−1 ). Under either of the two semiparametric hypotheses, restrictions are imposed on the drift and diffusion term. Using Eqs. (6)–(7), we define the restricted drift and diffusion terms under the respective nulls as
√
d
nh3 {µ ˆ (x) − µ (x) − hm κm Bµ (x)} → N 0, Vµ (x) ,
where κm =
R
K (z ) z m dz, and
σ 2 (x) π (m+1) (x) , 2π (x) ∫ σ 4 (x) Vµ (x) = K (1) (z )2 dz . 4π (x) R 2. Under HSP,2 : As nh → ∞, √ d nh{σˆ 2 (x) − σ 2 (x) − hm κm Bσ 2 (x)} → N 0, Vσ 2 (x) , Bµ (x) =
µSP,1 (x) =
where Bσ 2 (x) =
σ 2 (x) π (m) (x) , π (x)
Vσ 2 (x) =
σ 4 ( x) π (x)
∫
K 2 (z ) dz .
R
The above result allows the researcher to plot the two estimators together with pointwise confidence bands. The pointwise asymptotic variances for µ ˆ (x) and σˆ 2 (x) can be estimated by:
∫ σ 4 (x; θˆ1 ) K (1) (z )2 dz , 4πˆ (x) R ∫ 4 ˆVσ 2 (x) = σˆ (x) K (z )2 dz . πˆ (x) R Vˆ µ (x) =
(12)
One can easily show, as is standard for kernel-based estimators, that both semi-nonparametric estimators are asymptotically independent across distinct points. This facilitates inference, for example when constructing pointwise confidence bands. The rate of convergence of µ ˆ is slower than the one of σˆ 2 . This owes to the fact that µ ˆ depends on both πˆ and its first derivative, πˆ (1) , while σˆ 2 is only a function of πˆ . The density derivative has slower weak convergence rate than πˆ which the drift estimator inherits. For example, with a second-order kernel, the bias is of order O h2 for both estimators while the variance is of order O 1/ nh and O (1/ (nh)) for the drift and diffusion estimator respectively. Solving for the optimal bandwidth we find that the optimal rates for the drift and diffusion estimator are n−2/7 and n−2/5 respectively. This substantial slower rate of convergence of the drift estimator is due to its larger variation is found elsewhere in the literature: Gobet et al. (2004) find similar results for their sieve estimators of the drift and diffusion term and coin the nonparametric estimation of µ as an ‘‘ill-posed problem’’. Given high-frequency observations of a stationary diffusion, Bandi and Phillips (2003) demonstrate that kernel estimators of µ (x) and σ 2 (x) have variances of order 1/ (n∆h) and O (1/ (nh)) respectively as ∆ → 0 and n∆ → ∞.
3
4. Goodness-of-fit testing We here develop tests of correct specifications of the drift and/or diffusion function. Our main focus will be on tests based on the transition density of the Markov process {Xt }, where a given null is tested against a given alternative by comparing estimators of the transition density obtained under the null and the alternative respectively. However, motivated by a power analysis of the proposed transition-based tests, we will also develop tests that directly compare drift and diffusion estimators under null and alternative. The two following subsections develop tests of the semiparametric and fully parametric hypotheses respectively and examine their properties.
∂
1
2π (x) ∂ x
σ 2 x; θ1∗ π (x) ,
∗ 2 2 σSP x; θ1 , ,1 (x) = σ ∗ µSP,2 (x) = µ x; θ2 , ∫ x 2 2 σSP,2 (x) = µ x; θ2∗ π (y) dy. π (x) l
(13)
(14)
We let pSP,k (y|x; θk ) := pSP,k (y|x; 1, θk ) denote the transition density corresponding to the restricted drift and diffusion functions under HSP,k , k = 1, 2 at t = ∆ = 1. It can for example be represented as the solution (at t = 1) to the PDE in Eq. (2) with the restricted drift and diffusion functions plugged in. When evaluated at the (pseudo-)true parameter value we simply write pSP,k (y|x) = pSP,k (y|x; θk∗ ). Under the nonparametric hypothesis, HNP , the drift and diffusion functions are left completely unspecified, and so we propose to estimate the unrestricted transition density, p (y|x), under the alternative using standard kernel methods. A standard kernel estimator of the transition density for the observed data is pˆ NP (y|x) =
fˆNP (y, x)
πˆ NP (x)
,
where, for some bandwidth hNP > 0, fˆNP (y, x) =
πˆ NP (x) =
n 1−
n i =1
n 1−
n i=1
KhNP (Xi − y) KhNP (Xi−1 − x) ,
KhNP (Xi−1 − x) .
Note that two different bandwidths are now being employed: Under the semiparametric null, we use the bandwidth h in the estimation of the univariate marginal density, while under the alternative hNP is used to obtain a nonparametric estimator of the bivariate transition density. Next, we obtain an estimator of the transition density under either of the two semiparametric hypotheses, pSP,k (y|x). In both cases, we have drift and diffusion estimators available as developed in the previous section. These could in principle be used to obtain an estimator of pSP,k (y|x) by plugging them into the PDE in Eq. (2) and then solving w.r.t. p (y|x; t ) (at t = 1). However, to establish theoretical properties of the resulting semiparametric estimator of the transition density, we have to modify the drift and diffusion estimators proposed in the previous section to control their tail behaviour. We first introduce a class of trimming functions τa (z ): B.2 The trimming function τa : R → [0, 1] , a > 0, satisfies τa (z ) = 1 for z ≥ a and τa (z ) = 0 for z ≤ a/2. A simple way of constructing τa (z ) is to choose a cdf F with support [0, 1], and define τa (z ) = F ((2z − a) /a) which then in great generality will satisfy (B.2); see also Andrews (1995, p. 572).
D. Kristensen / Journal of Econometrics 164 (2011) 382–403
387
Given the trimming function, we redefine the estimators under the two semiparametric hypotheses, where we now use subscripts to differentiate between the two nulls,
at t = 1. To ensure that the solution exists (asymptotically) and is sufficiently regular, we impose the following assumption on the transition density:
τˆa (x) ∂ 2 σ (x; θˆ1 )πˆ (x) , 2π (x) ∂ x 2 σˆ SP,1 (x) = τˆa (x) σ 2 (x; θˆ1 ) + σ 2 1 − τˆa (x) ,
A.5 The transition density under HSP,k , pSP,k (y|x; t , θ) for t > 0, exists as a solution to Eq. (2) and satisfies ∂xi pSP,k (y|x; t , θ) ≤ γ (y|x; t ) , (t , x, y, θ) ∈ (0, ∆] × I 2 × Θ , i = 0, 1, 2, where
µ ˆ SP,1 (x) =
(15)
µ ˆ SP,2 (x) = τˆa (x) µ(x; θˆ2 ), ∫ (16) 2τˆa (x) x 2 µ(y; θˆ2 )πˆ (y) dy + σ 2 1 − τˆa (x) , σˆ SP x = ( ) ,2 πˆ (x) l where τˆa (x) := τa πˆ (x) , a = an > 0 is a trimming sequence, and σ 2 > 0 a constant. The inclusion of the additional term σ 2 1 − τˆa (x) in the diffusion estimator guarantees that it is strictly positive for all x ∈ I for n sufficiently large. The motivation for the trimming is two-fold: First, by combining results of Andrews (1995) and Kristensen (2009), the trimming of the seminonparametric component is used to show that µ ˆ SP,1 (x) →P 2 2 P τa (π (x)) µSP,1 (x) and σˆ SP,2 (x) → τa (π (x)) σSP,2 (x) uniformly over x ∈ I , k = 1, 2, c.f. Lemma 9. We will then let a → 0 at a suitable rate such that asymptotically the trimming has no first-order effect asymptotically, τa (π (x)) µSP,1 (x) ≈ µSP,1 (x) 2 2 and τa (π (x)) σSP ,2 (x) ≈ σSP,2 (x); see, for example, Ai (1997) and Robinson (1988) for similar applications of trimming. Second, the trimming of the parametric component is introduced to ensure that the associated transition density exists: Due to trimming, µ ˆ SP,k 2 2 and σˆ SP ˆ SP ,k are bounded and σ ,k > 0, and we can therefore apply standard results to ensure that the associated diffusion process has a well-defined transition density; see, for example, Friedman (1976). Given the above re-defined semiparametric drift and diffusion estimators, we define our estimator of the corresponding transition density, pˆ SP,k (y|x), as the solution to the following PDE at t = 1,
∂ pˆ SP,k (y|x; t ) 2 ˆ SP,k (y|x; t ), =A µ ˆ SP,k , σˆ SP ,k p ∂t (x, y) ∈ I × I .
t > 0, (17)
While the theoretical analysis of the estimator will rely on the above representation, its actual computation can be done using numerical techniques as developed in, amongst others, Aït-Sahalia (2002) and Kristensen and Shin (2008); see also Kristensen (2010, Section 5). We propose to test HSP,k by comparing the transition density estimates through the following statistic,
∫ ∫
pˆ SP,k (y|x) − pˆ NP (y|x)
TSP,k = I
2
w (y, x) dydx,
(18)
I
for some weighting function w . Similar test statistics have been considered in Aït-Sahalia et al. (2009) and Li and Tkacz (2006) but in a different context, namely that of testing fully parametric models against a nonparametric alternative. By appropriate choice of w , the tests can be interpreted as second-order approximations of the generalised likelihood-ratio tests, cf. Aït-Sahalia et al. (2009, p. 1105) and Fan et al. (2001). Other transition-based distance measures could be used: For example, measures based on the Kullback–Leibler divergence (Robinson, 1991), the empirical likelihood (Chen et al., 2009), or integral transforms (Hong and Li, 2005). We focus on TSP,k , but conjecture that theoretical results for other distance measures could be derived by following the same proof strategy as used here. As a first step towards establishing asymptotic properties of TSP,k , we investigate the properties of pˆ SP,k (y|x). The analysis will rely on the representation of pˆ SP,k (y|x) as the solution to Eq. (17)
γ (y|x; t ) = c1
|y|λ1 + |x|λ1 t α1
[ exp −c2
|y|λ2 + |x|λ2
]
t α2
(19)
for constants cj, αj , λj > 0, j = 1, 2. The above assumption is high level. It would be preferable to give more primitive conditions in terms of the underlying drift and diffusion functions for the above regularity conditions to hold. However, to our knowledge, the only known sufficient conditions for existence of a solution to Eq. (2) are overly restrictive and, for example, require that drift and diffusion functions are bounded, c.f. Friedman (1976). Such boundedness restrictions are violated by most standard models used in the literature, and rule out that the process is mixing, c.f. Chen et al. (2010). We therefore impose the high level conditions in (A.5) instead, which is similar to the conditions imposed in Kristensen (2010). We conjecture that the assumption could be replaced by alternative conditions such as the ones in Aït-Sahalia (2002). Finally, with m ≥ 2 and q > 0 given in (A.2), we impose the following conditions on the bandwidth and trimming parameter 2 to ensure that µ ˆ SP,k (x) and σˆ SP ,k (x) converge sufficiently fast: H.1
√
√
6 nha (n) → ∞,√ nh3 a4 / log (n) → ∞, n1/4 hm a−3 → √ /mlog −1 0, nh a → 0, and naq/2 → 0.
√
H.2 √nha4 / log (n) → ∞, n1/4 hm a−2 → 0, naq/2 → 0.
√
nhm a−1 → 0, and
Depending on whether we work under HSP,1 or HSP,2 , we will impose (H.1) or (H.2) respectively. The conditions involve both h and a and impose restrictions on how fast they jointly can go to zero. They are used to control higher-order bias and variance terms appearing in pˆ SP,k (y|x); in particular, they ensure that the kernelbased estimators of the relevant semi-nonparametric component under the null converges with rate oP n−1/4 uniformly over {x : π (x) ≥ a}, and that the trimming has no first-order impact on the semiparametric transition density estimator. Utilising arguments developed in Kristensen (2008, 2010), we are now able to establish the following asymptotic expansion of the transition density estimator: Lemma 2. For k ∈ {1, 2}: Assume that (A.1)–(A.5), (B.1)–(B.2) and (H.k) hold. Then, pˆ SP,k (y|x) = pSP,k (y|x) +
n 1−
n i =1
√
Dk,i (y|x) + oP 1/ n ,
k = 1, 2,
uniformly over (y, x) in any compact set of I × I. Here, Dk,i (y|x) = Dk (Xi , Xi−1 ; y, x) is given in Eq. (34). In particular, E Dk,i (y|x) = 0 and E D2k,i (y|x) < ∞ for all (x, y).
√
From the above lemma, we see that pˆ SP,k (y|x) is n-consistent. This holds despite the fact that nonparametric kernel estimators are employed as inputs in the computation of pˆ SP,k (y|x). The reason for this maybe surprising result can be found in the representation of pˆ SP,k (y|x) as a solution to a PDE: As such, the computation of pˆ SP,k (y|x) involves integrating over the drift and diffusion estimator which in turn speeds up the convergence rate; for more details, we refer to Kristensen (2008, 2010). An important consequence of the above lemma is that pˆ SP,k (y|x) converges at a faster rate than pˆ NP (y|x), so we can exchange pˆ SP,k (y|x) for the unknown density
388
D. Kristensen / Journal of Econometrics 164 (2011) 382–403
in the derivation of the asymptotic properties of TSP,k . Finally, we note that the theorem holds both under HSP,k and the alternative: Under the null pSP,k (y|x) equals the true data generating transition density while under the alternative it corresponds to the drift or diffusion restriction evaluated at the pseudo-true value. To derive the asymptotic properties of the test statistics, we impose the following restriction on the weighting function: B.3 The weighting function w : I × I → R+ is continuous with compact support. The assumption of a fixed, compact support of w is made in order to control the tail behaviour of the estimators of transition densities. This assumption is fairly standard and is, for example, also imposed in Aït-Sahalia et al. (2009); a similar restriction is imposed in Li and Tkacz (2006) who assume compact support of the joint density f (y, x). Under (B.2), TSP,k can only detect departures from HSP,k that reveal themselves in the density within the support of w . However, under suitable regularity conditions on the tail behaviour of w , the drift and the diffusion, one should be able to allow for weighting functions with unbounded support, see e.g. Kristensen (2010) and Li and Tkacz (2006). This would lead to more complicated proofs however, and we therefore maintain (B.3) for simplicity. In the following, let (f ∗ g ) (z ) = R f (u) g (u + z ) du denote the convolution of any two functions f and g. We then have the following results for the asymptotic properties of the two tests: Theorem 3. For k ∈ {1, 2}: Assume that (A.1)–(A.5), (B.1)–(B.3) and (H.k) hold. Then under HSP,k : +2 (i) If λ2n := nh2m → λ2 < ∞, the following expansion holds: NP
nhNP TSP,k − mSP = vSP Un +
hNP v¯ SP U¯ n + λn σv Vn + σ¯ v V¯ n
+1 + nh2m Bk + OP NP
log (n)2
nh3NP
where (Un , Vn ) and U¯ n , V¯ n both converge towards bivariate standard normal distributions,
[∫
1
]2
p(y|x)
∫
w (y, x) dydx, π (x) p(y|x) 1 K 2 (z ) dz × w (y, x) dydx, + 2 2 nhNP R R π ( x) [∫ ]2 ∫ 2 vSP =2 p2 (y|x)w (y, x) dydx, (K ∗ K )2 (z ) dz × mSP =
nh2NP
K (z ) dz 2
×
R2
R
∫
∫
I ×I
R
2 and the parameters Bk , v¯ SP , σv2 and σ¯ v2 are given in the proof. +1 (ii) In particular, if nh3NP / log (n)2 → ∞ and nh2m → 0, NP
nhNP
TSP,k − mSP
vSP
→d N (0, 1) .
The first part of the theorem states an asymptotic expansion of TSP,k , k = 1, 2, under weak restrictions on the bandwidth. The limiting distribution is in this general case quite involved and not easily evaluated. One could adjust the proposed test statistics by following the ideas of Bickel and Rosenblatt (1973) and Fan (1994) in order to remove the higher-order terms λn σv Vn + σ¯ v V¯ n +1 and OP nh2m . This however would have consequences for the NP resulting tests’ power properties, c.f. Fan (1994). Under additional restrictions on the bandwidth hNP , we obtain a standard normal distribution of the tests which is similar to the results reported in Aït-Sahalia et al. (2009, Theorems 1 and 2), and Li and Tkacz (2006, Theorem 1). In particular, as in these studies, the asymptotic distribution is entirely determined by the nonparametric estimator, pˆ NP (y|x), since the estimator of the
transition density under the null converges with parametric rate. This is the reason for that the asymptotic expansions in the first part are the same for both tests. It should also be noted that the asymptotic distribution in the second part is not affected by the dependence structure in data and is identical to the one found when data is i.i.d., see e.g. Fan (1994). In order for the resulting test in the second part to become 2 operational, consistent estimates of mSP and vSP have to be obtained. This can easily be done by substituting the unknown quantities entering these for their estimates (either under the null or the alternative); see e.g. Li and Tkacz (2006, p. 867). Next, we investigate the power of the proposed tests. To this end, we introduce the following two sequences of contiguous alternatives: c HSP ,1 : µn (x) =
1
∂ 2 σn (x) π (x) ,
2π (x) ∂ x
σn2 (x) = σ 2 x; θ1∗ + an gn (x) , and ∗ c HSP ,2 : µn (x) = µ x; θ2 + an gn (x) ,
σn2 (x) =
2
π (x)
x
∫
µn (y) π (y) dy. l
Here, an → 0 is a real-valued sequence and gn : I → R is a sequence of functions such that an gn (x) captures the deviation from the posited parametric specification of the drift or diffusion function. In particular, if an = 0 then HSP,k is true, k = 1, 2. On the other hand, if limn→∞ an ̸= 0 then the null is asymptotically false and our tests will trivially detect the departure with probability 1. Instead, we will let an → 0 and investigate at which rate we can detect contiguous alternatives. The drifting alternatives are formulated such that the resulting data-generating diffusion model remains stationary for each given n ≥ 1. In particular, the unspecified term can be identified by Eqs. (6) and (7) respectively so that the nonparametric estimators remain consistent. It should be stressed that the above alternatives are different from the ones considered in, for example, Fan (1994) and Aït-Sahalia et al. (2009) who specify alternatives in terms of the corresponding (transition) density. Since our focus is on testing for the correct specification of the drift and diffusion function, our alternatives seem to be the more natural ones though. We now proceed to analyse how the departures in the drift and the diffusion term is reflected in the corresponding transition densities. We do this by adopting the approach of Gouriéroux and Tenreiro (2001) who also considers density-based tests under contiguous alternatives: We keep the probability measure P fixed which is possible since the Brownian motion {Wt } remains fixed while the drift and diffusion term drifts. Under either of the two sequences of contiguous alternatives, we have a corresponding sequence of data-generating transition densities, pn (y|x) = p(y|x; 1, µn , σn2 ). Since the transition density pn (y|x) is drifting, the observed data will also drift. We emphasise this by explicitly writing data as a triangular array, Xn,i : i = 1, . . . , n . Utilising that pn (y|x) and pSP,k (y|x) := pSP,k y|x; θk∗ both solve a PDE on the form given in Eq. (2), we obtain the following relationship between the two (see Proof of Theorem 4):
(n)
pn (y|x) = pSP,k (y|x) + an γSP,k (y|x) + a2n RSP,k (y|x) ,
(20)
(n)
where γSP,k (y|x) is the functional partial derivative of the transition density w.r.t. µ, σ 2 in the direction of the departure, while RSP,1 (y|x) = O(supw∈I {gn (w)2 + gn′ (w)2 }) and RSP,2 (y|x) = O(supw∈I gn (w)2 ) are remainder terms. With p¯ µ,k (y, x, w) and
D. Kristensen / Journal of Econometrics 164 (2011) 382–403
(n)
p¯ σ 2 ,k (y, x, w) defined in Eq. (37), γSP,k (y|x) , k = 1, 2, are given by: (n) γSP ,1
(y|x) =
(n) γSP ,2 (y|x) =
∂ ∂w
∫
1
2 ∫I
+ ∫
I
[gn (w) π (w)]
π (w)
(21)
gn (w) p¯ µ,2 (y, x, w) dw dt
I
w
∫ ∫
+2 I
gn (u) π (u) du
l
1
π (w)
p¯ σ 2 ,2 (y, x, w) dw. (22)
(n) γSP ,k
The function (y|x) captures how the deviation in the drift or diffusion term translates into deviations in terms of the true and posited transition density. Note that this mapping involves integrating over the deviation gn (x) appearing in the drift and diffusion term. This is due to the fact that any given diffusion transition density implicitly integrates over the underlying drift and diffusion terms as noted earlier. The pseudo-true values θ1∗ = θn∗,1 and θ2∗ = θn∗,2 will in general drift under the contiguous alternatives. However, we will maintain Assumption (A.4) since as noted earlier it will hold even if the null is false by combining the arguments of White (1982) for estimators in misspecified models with LLN and CLT results for triangular, mixing arrays (see e.g. Wooldridge and White, 1988). The specification of the alternatives in terms of the pseudo-true values is quite standard in the literature analysing local power properties of tests. For example, Assumption (A.6)(ii) in conjunction with (A.4) is similar to Assumption (P1) in Härdle and Mammen (1993) who provide a detailed discussion of pseudo-true values and their estimators in a regression setting. To analyse the behaviour of c ˆ θn∗,k under HSP ,k suppose that θk = arg maxθk Qk,n (θk ) , Qk,n (θk ) = qk Xn,i |Xn,i−1 ; θk , such as, for example, the semiparametric MLE of Kristensen (2010). Then,
∑n
i =1
θn∗,k = arg max E qk Xn,i |Xn,i−1 ; θk θk ∈Θk ∫ = arg max pn (y|x) qk (y|x; θk ) π (x) dydx. θk ∈Θk
Assuming that qk (y|x; θk ) identifies θ0,k under HSP,k , θn∗,k → θ0,k as an → 0 since pn (y|x) → pSP,k (y|x). Giving the drifting nature of the pseudo-true values, the chosen specification of the local alternatives may look a bit odd. Note however that for any given sequence of local alternatives, µn (x) and c σn (x), we can always write them on the form of HSP ,k , k = 1, 2, by choosing gn (x) = [µ(x) − µ(x; θn∗,2 )]/an and gn (x) = [σn2 (x) −
σ 2 (x; θn∗,1 )]/an respectively. As a specific example, consider the folc 2 ˜c lowing alternative specification of, for example, HSP ,1 , HSP,1 : σn (x) 2 = σ (x; θ0,1 ) + an g˜n (x) for some fixed value θ0,1 . Combine a firstorder Taylor expansion of the first-order condition characterising θn∗,k (k = 1, 2) as defined above with Eq. (27) to obtain that, as an → 0,
θk∗ ≃ θ0,k − an H0−1 Bn , where we have used that π(x)dydx = 0, and
∂ qk (y|y; θ0,k )/(∂θk )pSP,k (y|x; θ0,k )
∂ 2 qk y|y; θ0,k pSP,k (y|x) π (x) dydx, ∂θk ∂θk′ ∫ ∂ qk y|y; θ0,k (n) Bn,k := γSP,k (y|x) π (x) dydx. ∂θk c c ˜ SP We can now write H ,1 on the form of HSP,1 : By a Taylor expansion, c ˜ HSP,1 can be rewritten as ∫
=σ
2
389
x; θ1 − an σ˙
∗
x; θ0,1 H0−,11 Bn,1 + an g˜n (x)
2
x; θ1 + an gn (x) ,
∗
where gn (x) := g˜n (x) − σ˙ 2 x; θ0,1 H0−,11 Bn,1 captures both the misspecification and the bias that this induces in the estimation under the null. Thus, the notation and the results get a lot simpler c ˜c by working with HSP ,k instead of HSP,k . We will throughout restrict an and gn to satisfy:
p¯ µ,1 (y, x, w) dw
gn (w) p¯ σ 2 ,1 (y, x, w) dw,
σn2 (x) ≃ σ
2
A.6 gn (x) is continuously differentiable with compact support uniformly in n, and an = o (1). Combining the expansion in Eq. (20) with limit results for kernel estimators based on triangular mixing arrays as found in Kristensen (2009) and Gouriéroux and Tenreiro (2001), we obtain the following theorem: Theorem 4. For k ∈ {1, 2}: Assume that (A.1)–(A.6), (B.1)–(B.3) 2 c 3 and (H.k) hold. Then under HSP → ∞ and ,k , as nhNP / log (n) 2m+1 nhNP → 0:
nhNP TSP,k − mSP
= vSP Un,1 + nhNP a2n
∫ ∫ I
I
(n) 2 γSP ,k (y|x)
× w (y, x) dydx + a4n R¯ 2SP,k + oP (1) ,
(23)
where R¯ SP,1 = O(supw∈I {gn (w)2 + gn′ (w)2 }) and R¯ SP,2 = O(supw∈I gn (w)2 ). The above asymptotic expansion of the test statistic under contiguous alternatives corresponds to the ones found in AïtSahalia et al. (2009, Theorem 3) and Fan (1994, Theorem 3.6), (n) except that the deviation from the null, γSP,k (y|x), here takes a more complicated form since it is expressed in terms of the underlying deviations from the hypothesised drift and diffusion function. We now proceed to consider different specifications of the deviation from the null, gn (x). First, we first consider so-called Pitman alternatives where gn (x) = g (x) is fixed. From Eq. (23), we easily see that both tests can in general only detect global alternatives for which limn→∞ nhNP a2n > 0. In particular, they can detect alternatives that vanish with rate an = O n−2/5 when the
bandwidth is chosen to vanish with rate hNP = O n−1/5 . This shows that our tests are less powerful than CvM and KS type tests which can detect alternatives at parametric rate. The above results are in accordance with the analysis of kernel-based specification tests where the alternatives are directly expressed in terms of the density of interest, c.f. Aït-Sahalia et al. (2009) and Fan (1994). Another important point is that for particular directions (choices (n) (n) of g (w)), γSP,1 (y|x) and γSP,2 (y|x) are zero and so the transitionbased tests cannot detect such alternatives. This is in contrast to the analysis in Aït-Sahalia et al. (2009) who conclude that transitionbased tests are consistent against all fixed alternatives. To investigate whether the above mentioned drawback of our test relative to cdf-based tests is peculiar to Pitman alternatives, we consider ‘‘local’’ deviations as originally proposed in Rosenblatt (1975). These take the form gn (x) = g ((x − x0 ) /bn ) for some sequence bn → 0 and some x0 ∈ I. For this class of drift and diffusion alternatives, the corresponding deviations in terms of the transition densities satisfy
(n) γSP ,1 (y|x) =
H0,k :=
∫
1 2bn
g′
w − x0
+
g
w − x0
I
1 π ′ (x0 ) 2 π (x0 )
bn
I
∫
≃
bn
π ′ (w) p¯ µ,1 (y, x, w) dw π (w)
p¯ σ 2 ,1 (y, x, w) dw
p¯ µ,1 (y, x, x0 ) ×
+ bn p¯ σ 2 ,1 (y, x, x0 ) ×
∫
∫
g ′ (z ) dz
g (z ) dz ,
390
D. Kristensen / Journal of Econometrics 164 (2011) 382–403
and, similarly,
∫
(n) ¯ µ,2 (y, x, x0 ) × g (z ) dz + 2bn π (x0 ) γSP ,2 (y|x) ≃ bn p ∫ r ∫ 1 × p¯ σ 2 ,2 (y, x, w) dw × g (z ) dz . x0 π (w)
By plugging the above expressions into Eq. (23), we find that our tests are only able to detect local alternatives for which 2 2 limn→∞ g (z ) NP an bn > 0. Moreover, alternatives which satisfy nh ′ dz = g (z ) dz = 0 cannot be detected by T , while T SP,1 SP,2 can not detect alternatives satisfying g (z ) dz = 0. This shows that our transition-based test statistics also have problems detecting high-frequency/local features in the drift and diffusion function. In particular, the above rates are the same as the one for KS and CvM type tests. The above power results are not particular to our setting, and apply more generally. In particular, the transition-based tests of fully parametric models proposed in Aït-Sahalia et al. (2009) also suffer from poor power against local alternatives: In Section 5, we revisit one of their tests and show that this also suffers from low power against alternatives in terms of the drift and diffusion term. Our findings for local (‘‘high-frequency’’) alternatives are somewhat analogous to the negative results reported for tests based on cumulative distribution functions (cdf’s) such as the KS and CvM tests: High-frequency departures, as formulated in terms of the density, cannot be detected by such tests since the departures are integrated out in the computation of the cdf’s, see e.g. Escanciano (2009) and Eubank and LaRiccia (1992). However, such tests are on the other hand more powerful at detecting global Pitman alternatives compared to tests based on transition densities such as ours, since the former can detect certain global alternatives at parametric rate. In conclusion, it appears as if tests based on L2 -distance measures of transition densities may not be very appropriate for detection of alternatives in terms of the underlying drift and diffusion functions. One way to detect deviating features of these functions would be to obtain estimates of these under null and alternative and compare those directly instead of through the corresponding transition densities. Suppose that we have at our disposal fully nonparametric estimators of the drift and diffusion 2 term, say µ ˆ NP (x) and σˆ NP (x) such as the sieve estimators developed in Chen et al. (2009) and Gobet et al. (2004). Two natural classes of test statistics would then be T¯SP,1 =
∫ I
T¯SP,2 =
∫
2 2 2 [σˆ NP (x) − σˆ SP ¯ (x) dx, ,1 (x)] w
(24)
[µ ˆ NP (x) − µ ˆ SP,2 (x)]2 w ¯ (x) dx, I
for some weighting function w ¯ (x). We expect that such tests would have better power properties against local alternatives. This conjecture is supported by the theoretical results found in the next section where we demonstrate that in testing of fully parametric models against semiparametric alternatives this class of tests indeed have better local power properties. It would be of interest to develop similar results for the above two test statistics, but this is hampered by the fact that existing fully nonparametric estimators are quite complicated to analyse theoretically. We therefore leave this for future research.
on an indirect comparison of the null and alternative through the corresponding transition density estimates. The second will directly compare the drift and diffusion estimates obtained under null and alternative. As we shall see, these two classes of tests have radically different asymptotic behaviour. First, we introduce our transition-based tests: Under the alternative, we have the semiparametric estimate, pˆ SP,k (y|x), while under the null we assume an estimator of the parameters, θ˜ = (θ˜1 , θ˜2 ), is available. Under the null, the model is fully specified and the estimator θ˜ could arrive from a range of standard parametric estimation methods such as maximum-likelihood (Aït-Sahalia, 2002; Kristensen and Shin, 2008) and method of moments (Bibby et al., 2009; Hansen and Scheinkman, 1995). Associated with the fully parametric family of diffusion models under HP , there exists a family of transition densities; this can be obtained by, for example, plugging the parametric drift and diffusion specification into Eq. (2). We denote this family pP (y|x; t , θ ) = p y|x; t , µ (·; θ ) , σ 2 (·; θ ) , and we will again suppress the dependence on t when evaluated at t = 1. The estimated transition density under the null is then given by pˆ P (y|x) := pP (y|x; θ˜ ). As with the semiparametric transition density estimator, pˆ P (y|x) can in general not be written on closed form and numerical approximations have to be employed (Aït-Sahalia, 2002; Kristensen and Shin, 2008). Given pˆ P (y|x) and pˆ SP,k (y|x), we then propose to test HP against HSP,k by:
∫ ∫ I
2
pˆ P (y|x) − pˆ SP,k (y|x)
T P ,k =
w (y, x) dydx,
k = 1, 2.
I
To analyse the asymptotic properties of these two tests, we impose the following assumptions on the parametric model and its estimators: ∑n A.7 The estimator θ˜ satisfies θ˜ = θ ∗∗ + i=1 ψP (Xi |Xi−1 )/n + √ oP (1/ n) with E [ψP (X1 |X0 )] = 0 and E [‖ψP (X1 |X0 )‖2+δ ] < ∞ for some δ > 0. Under HP , θ ∗∗ = θ0 . A.8 The transition density under HP , pP (y|x; θ ), and its first two derivatives w.r.t. θ exist, and they are all continuous w.r.t. (y, x) for all θ . As with the estimators under the semiparametric nulls, (A.6) allows for misspecification and will only assume that θ ∗∗ is equal to the true value when working under HP . We will in general suppress dependence on θ when evaluated at θ = θ ∗∗ . Sufficient conditions for the above assumption to hold for the MLE can be found in Aït-Sahalia (2002) and for GMM-type estimators in Bibby et al. (2009). As we shall see, to derive the asymptotic distribution of TP ,k under the null, it√ is critical that the estimators of the parametric components are n-asymptotically normally distributed. This is in contrast to the semiparametric tests, TSP,k , where we only need that they converge at a sufficiently fast rate. Theorem 5. For k ∈ {1, 2}: Assume that (A.1)–(A.5), (A.7)–(A.8), (B.1)–(B.3) , and (H.k) hold. Then under HP : nTP ,k →d
∫ ∫
Zk2 (y, x) w (y, x) dydx,
for k = 1, 2, where Zk (y, x) is a Gaussian process with covariance kernel
¯ (x, y) , x′ , y′ = Σ0 (x, y) , x′ , y′ Σ ∞ − + Σi (x, y) , x′ , y′ , i=1
4.2. Parametric specification tests In this section, we develop tests of the fully parametric hypothesis, HP , against either of the two semiparametric ones, HSP,1 or HSP,2 . We will consider two types of tests: The first is similar in spirit to the tests considered in the previous section and based
∂ pP (y|x; θ ) ψ − D y | x ( ) P ,0 k,0 ∂θ ′ ] ′ ′ ′ ′ ∂ pP (y |x ; θ ) × ψ − D y | x , P ,i k ,i ∂θ ′
Σi (x, y) , x′ , y′ = E
and ψP ,i := ψP (Xi |Xi−1 ).
[
D. Kristensen / Journal of Econometrics 164 (2011) 382–403
The above test statistic has the interesting property that it converges with parametric rate even though it involves nonparametric kernel estimators. This is due to the fact that the transition density under the semiparametric alternative, pˆ SP,k (y|x), converges with parametric rate. Moreover, the limiting distributions depend on the asymptotics of the underlying parametric estimators. Both these features are in contrast to the ones of the semiparametric transition-based tests. Instead, the asymptotic behaviour of TP ,k , k = 1, 2, is similar to those of omnibus tests such as the KS and CvM test; see, for example, Bhardwaj et al. (2008, Theorem 3) and Escanciano (2009). These omnibus-type features of the tests in particular means that they are able to detect ‘‘global’’ alternatives with parametric rate; on the other hand, due to the integration involved when computing the transition densities, TP ,k cannot detect local (or high-frequency) departures. To see this, consider the following two contiguous alternatives: HPc ,1 : µn (x) = µ x; θ2∗∗ + an gn (x) ,
σn2 (x) = σ 2 x; θ1∗∗
(25)
and HPc ,2 : µn (x) = µ x; θ2∗∗ ,
(26)
σn2 (x) = σ 2 x; θ1∗∗ + an gn (x) .
Here, HPc ,k will be used to examine the power properties of TP ,k . Note that under HPc ,1 the diffusion function is correctly specified and as such it is a constant sequence; this is to ensure that the maintained assumption, HSP,1 is correct. Similarly with HPc ,2 . As with the semiparametric pseudo-true values, θ1∗∗ and θ2∗∗ will in general be drifting under HPc ,2 and HPc ,1 respectively, and the c c discussion following the introduction of HSP ,1 and HSP,2 also applies here. As before, we let pn (y|x) = pn y|x; µn , σn2 denote the data generating transition density, where µn and σn2 are given either by HPc ,1 or HPc ,2 . We then obtain under HPc ,k : (n)
pP (y|x) = pn (y|x) + an γP ,k (y|x) + a2n RP (y|x) ,
(27)
where RP (y|x) = O supw∈I |gn (w)| ,
γP(,n1) (y|x) =
1
γP(,n2) (y|x) =
∫
2
I
∫
2
gn (w) p¯ µ (y, x, w) dw, I
gn (w) p¯ σ 2 (y, x, w) dw,
and p¯ µ (y, x, w) and p¯ σ 2 (y, x, w) are defined as in Eq. (37) except that pP (y|x) replaces pSP,k (y|x). The expression in Eq. (27) can then in turn be used to derive asymptotic expansions of the tests under contiguous alternatives: Theorem 6. For k ∈ {1, 2}: Assume that (A.1)–(A.8), (B.1)–(B.3), and (H.k) hold. Then under HPc ,k :
∫ ∫ nTP ,k =
Zk2 (y, x) w (y, x) dydx
∫ ∫
γP(,nk) (y|x)2 w (y, x) dydx I I ∫ ∫ √ (n) + 2 nan Zk (x, y) γP ,k (y|x) w (y, x) dydx I I √ + OP na4n R2P + O a2n R¯ P / n , where R¯ P = O supw∈I |gn (w)|2 . +
na2n
From the expression in Theorem 6, it is easily seen that TP ,k can detect global alternatives on the form gn (x) = g (x) for which limn→∞ na2n > 0. Thus, it can detect global alternatives vanishing
391
at parametric rate, an = O n
−1/2
. Note however, that certain
(n)
alternatives for which γP ,k (y|x) = 0 cannot be detected (see Escanciano, 2009 for more details). Furthermore, local alternatives on the form gn (x) = g ((x − x0 ) /bn ) are not as easily detected. For this class of alternatives, we obtain
γP(,n1) (y|x) =
∫ g I
x − x0 bn
p¯ µ (y, x, w) dw
= bn p¯ µ (y, x, x0 ) ×
∫
g (z ) dz , I
(n)
and similarly for γP ,2 (y|x). Thus, deviations can only be detected if I g (z ) dz ̸= 0 and limn→∞ na2n b2n > 0. This is akin to the semiparametric tests, and the discussion of these also applies here. In conclusion, transition-based tests may not be suitable when the interest lies in detecting local, ‘‘high-frequency’’ departures in the drift and diffusion function from the null. The problem with the transition-based tests lies in the fact that they integrate out the deviations appearing in the drift and/or diffusion function. We therefore introduce two alternative test statistics that directly compare the fully parametric and semiparametric estimators of the drift and diffusion function. Define T¯P ,1 = T¯P ,2 =
∫
[µ(x; θ˜1 ) − µ ˆ SP,1 (x)]2 w ¯ (x) dx,
∫I I
(28) 2 2 [σ 2 (x; θ˜2 ) − σˆ SP ¯ (x) dx, ,2 (x)] w
for some weighting function w ¯ : I → R+ . These tests are similar to the ones proposed in Eq. (24). We will assume that w ¯ has compact support which in particular implies that trimming of the seminonparametric estimators is not required; thus, we may use the ones given in Eqs. (9)–(10) instead of Eqs. (15)–(16). Here, T¯P ,k tests HP against HSP,k , k = 1, 2. The intuition behind these two alternative test statistics is similar to the one for TP ,1 and TP ,2 , but instead of measuring deviations from the null in terms of the transition densities we now directly measure discrepancies appearing in the drift or diffusion functions. To get a better understanding of what T¯P,k is actually testing, it is worth noting that under the null T¯P,1 ≈ I0 (ω10 ) + I1 (ω11 ) and T¯P ,2 ≈ I0 (ω2 ), where Ik (ω) =
∫
[πˆ (k) (x) − π (k) (x)]2 ω (x) dx,
(29)
I
for k = 0, 1, and ω10 , ω11 and ω2 are appropriately chosen weighting functions (see the proof of Theorem 7 below for details). This highlights that T¯P ,1 and T¯P ,2 to a large extent are testing the correct specification of the marginal density as implied by the parametric specification under HP against its nonparametric alternative. As such the tests are similar to the ones proposed in Aït-Sahalia (1996b) and Huang (1997). This could also seem to indicate that one could instead use Ik (ω) , k = 0, 1, to test HP against HSP,1 and HSP,2 . However, observe that T¯P ,1 and T¯P ,2 involve nontrivial transformations of the marginal density and therefore test different directions of departure from the null with special emphasis on the correct specification of the drift and diffusion respectively. In particular, when one specifies deviations from the null in terms of the drift and diffusion terms, then I0 (ω) and I1 (ω) will distort some of the local features in the drift and diffusion term; see the discussion following Theorem 8 below for more details. We also note that T¯P ,2 shares some similarities with the specification tests proposed in Corradi and White (1999) and Li (2007). These two studies are only concerned with testing the correct specification of the diffusion term, and propose to test a
392
D. Kristensen / Journal of Econometrics 164 (2011) 382–403
given specification of σ 2 using T¯P ,2 as given in Eq. (28) except that they employ the nonparametric estimator of σ 2 (·) proposed in Florens-Zmirou (1993); see also Bandi and Phillips (2003). The advantage of the estimator of Florens-Zmirou (1993) is that it does not require as input a preliminary estimator of the drift function (as ours do). On the other hand, the estimator of Florens-Zmirou (1993) requires high-frequency observations and is only consistent as time distance between observations shrinks to zero, ∆ → 0, sufficiently fast as n → ∞ (c.f. Nicolau, 2003). So for low frequency data, the tests of Corradi and White (1999) and Li (2007) will most likely have lower power. The theorem is shown under the following regularity condition on the weighting function:
that study, this removes some of the higher-order terms in the asymptotic expansions of the resulting test statistics such that weaker restrictions on allowable bandwidth sequences are needed. However, as demonstrated in Fan (1994), these modifications alter the power properties of the tests. We now examine the power properties of the tests to add further insight to their (asymptotic) performance. We do this by revisiting the sequence of alternatives specified in Eqs. (25) and (26). Theorem 8. Assume (A.1)–(A.7), (B.1)–(B.4) hold. Then: (i) Under HPc ,1 , as nh3 → ∞, and nh3/2+2m → 0:
¯ P ,1 nh5/2 T¯P ,1 − m
B.4 The weighting function w ¯ : I → R+ is continuous and has compact support. The discussion that followed Assumption B.3 also applies here. We are now able to derive the following result concerning the asymptotic distributions of the tests under the null:
I
where Un,1 → N (0, 1). (ii) Under HPc ,2 , as nh → ∞, and nh1/2+2m → 0, d
¯ P ,2 nh1/2 T¯P ,2 − m
Theorem 7. Assume (A.1)–(A.5), (A.7), (B.1) and (B.4) hold. Then under HP :
(i) As nhm+5 → 0, nh4m+5/2 → 0 and nh1/2 / log (n)2 → ∞, nh
¯ ¯ 5/2 TP ,1 − mP ,1 v¯ P,1
→d N (0, 1) ,
σ 4 ( x) w ¯ (x) dx 3 4nh R π ( x) ∫ ∫ 4I 1 σ (x) w ¯ 2 (x) π (1) (x)2 + K 2 (z ) dz dx, 4nh R π 4 (x) I ∫ ∫ (1) 2 1 π 2 (x)σ 8 (x) w ¯ 2 (x) dx. = K ∗ K (1) (z ) dz 1
v¯ P2,1
8
(ii) As nh nh
K (1) (z )2 dz ×
1/2 TP ,2
∫
v¯ P2,2
The above expressions reveal that T¯P ,1 and T¯P,2 can only detect global alternatives on the form gn (x) = g (x) for which limn→∞ nh5/2 a2n > 0 and limn→∞ nh1/2 a2n > 0 respectively. Thus, they are less powerful than TP ,k , k = 1, 2, in this regard. However, they are better at detecting local deviations from the null: for alternatives on the form gn (x) = g ((x − x0 ) /bn ), we obtain
∫
gn2
¯ (x) dydx = (x) w
¯ P ,2 d −m → N (0, 1) , v¯ P,2
σ 4 ( x) w ¯ (x) dx, nh R π (x) I ∫ ∫ 8 σ (x) w ¯ 2 (x) dx. = 32 (K ∗ K )2 (z ) dz × 2 π (x) R I 4
∫
K 2 (z ) dz ×
∫
¯ P ,k and v¯ P2,k can be obtained by subConsistent estimates of m stituting the unknown quantities entering these, that is, σ 2 (x) and π (x), for their estimates. As part of the proof of Theorem 7, we derive asymptotic expansions of the two test statistics similar to those stated for the semiparametric test statistics in Theorem 3. These expansions include additional higher-order terms which vanish under the restrictions imposed on the bandwidth in Theorem 7. In contrast to the transition-based tests, TP ,1 and TP,2 , the above alternative tests converge with nonparametric rates and have standard normal distributions. This owes to the fact that the computation of T¯P ,1 and T¯P ,2 does not involve integration over 2 the semi-nonparametric estimators, µ ˆ SP,1 (x) and σˆ SP ,2 (x). As a consequence, the asymptotic properties are similar to those of other kernel-based test statistics, c.f. Theorems 1 and 3. One could consider a number of modified versions of the above test statistics by following the ideas of Kristensen (2007) and replace the parametric estimators of the drift (in T¯P ,1 ) or diffusion (in T¯P ,2 ) with kernel smoothed versions. As shown in
∫
I
g
2
x − x0 bn
I
= bn w ¯ (x0 )
→ 0, nh4m+1/2 → 0, and nh3/2 / log (n)2 → ∞,
where
¯ P ,2 = m
I
R
2m+1
¯
∫
= v¯ P ,2 Un,2 + nh1/2 a2n ∫ × gn2 (x) w ¯ (x) dx + oP (1) ,
where Un,2 →d N (0, 1).
where
¯ P ,1 = m
= v¯ P ,1 Un,1 + nh5/2 a2n ∫ × gn2 (x) w ¯ (x) dx + oP (1) ,
∫
w ¯ (x) dx
g 2 (z ) dz . I
Thus, the tests can detect alternatives for which I g (z ) dz = 0, and the rates at which they can detect alternatives are limn→∞ nh5/2 a2n bn > 0 and limn→∞ nh1/2 a2n bn > 0 respectively. For suitable choices of h, high-frequency alternatives can therefore be detected by T¯P ,1 and T¯P ,2 at a better rate compared to TP ,1 and TP ,2 ; see Rosenblatt (1975) and Ghosh and Huang (1991) for related results. The results of Theorems 6 and 8 are comparable to the ones found in the literature on testing for correct specifications of distributions using either nonparametric kernel density estimators or cumulative density function estimators (see e.g. (Eubank and LaRiccia, 1992)). In conclusion, depending on the type of alternatives of interest, one should either employ TP ,k or T¯P ,k , k = 1, 2.
5. Related tests We here briefly discuss the misspecification tests proposed by Aït-Sahalia (1996b) and Aït-Sahalia et al. (2009) in relation to our tests. As noted in the previous section, the two tests T¯P ,1 and T¯P ,2 are somewhat similar to the one proposed in Aït-Sahalia (1996b) which is on the form I0 (m) as given in Eq. (29). This test was originally proposed to test HP against HNP , but as noted above it seems more suitable for testing the parametric hypothesis against either HSP,1 or HSP,2 . To see how our tests perform relative to this one, we analyse the power properties of a test based on I0 (m): Consider again the contiguous alternative HPc ,1 . Using Eq. (5), we obtain the following marginal density implied by the null,
πP (x) =
Mx∗
σP2 (x)
[ ∫
x
exp 2 x∗
] µP (y) dy , σP2 (y)
D. Kristensen / Journal of Econometrics 164 (2011) 382–403
while the sequence of contiguous densities are given by
] µP (y) + an gn (x) πn (x) = 2 exp 2 dy σP (x) σP2 (y) x∗ [ ∫ x ] an gn (x) = πP (x) exp 2 dy . 2 x∗ σP (y) x
[ ∫
Mx∗
Thus, by using the same arguments as in the proof of Theorem 8, we obtain under HPc ,1 that nh1/2 {I0 (m) − c0 } = v0 Un + nh1/2
∫ ×
[ ∫
x
exp 2 x∗
I
an gn (x)
σP2 (y)
]
dy − 1 πP (x) m (x) dx + oP (1) ,
for suitably defined parameters c0 and v0 , and where Un →d N (0, 1). This shows that I0 (m) is not tailored to detect the deviation, gn (x). In particular, gn (x) is integrated over twice which has as consequence that I0 (m) will suffer from similar issues as the transition-based tests. In contrast, T¯P ,1 is designed to directly capture any deviations between µP (x) and µ (x), c.f. Theorem 8(i). A similar analysis can be carried out under HPc ,2 . Finally, to demonstrate that the reported poor power against both global and local deviations is a general problem for transitionbased tests, we revisit one of the tests developed in Aït-Sahalia et al. (2009). They propose to test HP against HNP by
∫ ∫
pˆ P (y|x) − pˆ NP (y|x)
TP = I
2
w (y, x) dydx.
I
This test has the same asymptotic distribution as TSP,k , k = 1, 2, under the null of HP . To investigate its power properties, we consider the following contiguous alternative, HPc : µn (x) = µ x; θ2∗∗ + an fn (x) ,
σn2 (x) = σ 2 x; θ1∗∗ + an gn (x) , for two sequences fn (x) and gn (x) that measure deviations in the drift and diffusion term respectively. By following the same arguments as employed previously, it is easily shown that (n)
pP (y|x) = pn (y|x) + an γP
(y|x) + a2n RP (y|x) , where RP (y|x) = O supw fn2 (w) + gn2 (w) and ∫ γP(n) (y|x) = fn (w) p¯ µ (y, x, w) dw I ∫ + gn (w) p¯ σ 2 (y, x, w) dw. I
By using the same arguments as in the proof of Theorem 3, we now obtain that nhNP {TP − mSP } ≃ vSP Un,1 + nhNP a2n
∫ ∫
γP(n) (y|x)2 w (y, x) dydx + a4n R¯ 2P .
× I
I
From this expression, we see that TP cannot detect Pitman alternatives on the form fn (x) = f (x) and gn (x) = g (x) for which f (w) p¯ µ (y, x, w) dw = 0 and I g (w) p¯ σ 2 (y, x, w) dw = 0. I Moreover, for local alternatives on the form fn (x) = f ((x − x0 )/bn ) and gn (x) = g ((x − x0 ) /bn ),
∫ γP(n) (y|x) ≃ bn p¯ µ (y, x, x0 ) × f (z ) dz + p¯ σ 2 (y, x, x0 ) I ∫ × g (z ) dz , I
and so TP can only detect local alternatives for which lim n→∞ nhNP a2n b2n > 0. Finally, note that alternatives which satisfy I f (z ) dz = g (z ) dz = 0 are not detectable. I
393
Aït-Sahalia et al. (2009) also conduct a power analysis of TP and conclude that it can detect local alternatives without any of the aforementioned problems. This seeming contradiction between our results and the ones of Aït-Sahalia et al. (2009) are due to different formulations of alternatives: while Aït-Sahalia et al. (2009) express their alternatives in terms of the transition density, we formulate them directly in terms of the underlying drift and diffusion functions. In conclusion, departures from the drift and diffusion functions imposed under the null are in general not easily detected by transition-based tests since the deviations are smoothed out when the drift and diffusion functions are plugged into the transition density. 6. Markov bootstrap tests The asymptotic distributions of the proposed test statistics derived in the previous section ignore several higher-order terms that will affect the finite-sample distributions: first, all asymptotic distributions, except the ones of TP ,1 and TP ,2 , do not involve estimation errors due to unknown parametric components and additional covariance terms due to dependence in data. Second, they all are based on first-order linearisations of the test statistics and thereby ignore second-order terms. Third, various bias terms due to the kernel smoothing are not present. Fourth, in the implementation, we need to estimate unknown quantities entering the asymptotic distributions, which adds additional estimation errors to the tests. In finite samples, the distributions will clearly depend on these additional components, and as such one could fear that the asymptotic distribution stated in the theorems may deliver a poor finite sample approximations. We therefore propose Markov bootstrap versions of the tests which are expected to perform better than the ones relying on approximations based on the asymptotic distribution. The simulation studies in Aït-Sahalia et al. (2009) and Li and Tkacz (2006) of Bootstrap versions of their nonparametric tests support this conjecture. In the Markov bootstrap versions of the tests, we draw a new sample from the transition density under the relevant null, and use this sample to approximate the relevant distributions. The proposed bootstrap is similar to the one proposed by Fan (1995) in a cross-sectional setting and Li and Tkacz (2006) in a time series setting. We also note that our proposal shares some similarities with the Markov bootstrap procedures examined in Horowitz (2003) and Andrews (2005) but in different settings, while Bhardwaj et al. (2008) and Corradi and Swanson (2005) propose to use a block bootstrap in conjunction with their specification tests for diffusion models. Let in the following Tn denote any one of the test statistics developed in the previous section, and pˆ 0 (y|x) and πˆ 0 denote the transition density and stationary density estimated under the relevant null (HSP,1 , HSP,2 or HP ). The proposed bootstrap then proceeds as follows: Step 1 Draw X0∗ ∼ πˆ 0 , and recursively Xi∗ ∼ pˆ 0 (·|Xi∗−1 ), i = 1, . . . , n. n Step 2 Replace the data {Xi }ni=1 with the bootstrap sample Xi∗ i=1 in the computation of estimators and test statistics; we denote the resulting test statistic Tn∗ . Step 3 Repeat Step 1–2 B ≥ 1 times, each new sample being independent of the previous ones, yielding Tn∗,1 , . . . , Tn∗,B . Use the empirical distribution of these to estimate the distribution of Tn . The initialisation in Step 1 could be exchanged for X0∗ = X0 since we have a geometrically ergodic Markov chain. Since pˆ 0 (y|x) in general is not available on closed form, we propose to draw
394
D. Kristensen / Journal of Econometrics 164 (2011) 382–403
Fig. 1. Estimates of µ (x) for the CKLS (i) model. Full line = true function, dotted line = mean of estimate, plusses = 95% confidence interval.
from it by utilising an Euler discretisation scheme (see e.g. (Corradi and Swanson, 2005); (Gouriéroux et al., 1993)). This will involve an additional error, but this can be controlled for by choosing a sufficiently small time step. By relying on arguments similar to those in Bhardwaj et al. (2008), Corradi and Swanson (2005) and Li and Tkacz (2006), one should be able to show that the proposed Bootstrap versions of the parametric tests are consistent under suitable conditions. It should be noted though that in order to show consistency of the Bootstrap versions of the semiparametric tests, we first need to ensure that the bootstrap sample as generated by pˆ SP,k (y|x) is stationary and β -mixing. To this end, we need to further modify the semiparametric estimator of the drift functions, to ensure mean reversion. One could potentially also use the Markov bootstrap to construct confidence bands for the semiparametric estimators. 7. Finite-sample performance of estimators We here examine how the semi-nonparametric estimators perform in finite samples. We choose as data generating models the CKLS model of Chan et al. (1992), dXt = {β1 + β2 Xt } dt +
α
α1 Xt 2 dWt ,
(CKLS)
and a restricted version of the model proposed in Aït-Sahalia (1996b), dXt = β1 + β2 Xt + β3 Xt2 + β4 Xt−1 dt +
α α1 Xt 2 dWt .
(AS)
The data-generating parameters are chosen to match the estimates obtained when fitting the model by MLE to the Eurodollar interest rate data considered in Aït-Sahalia (1996a,b). The parameter estimates satisfy the β -mixing conditions found in Aït-Sahalia (1996b) such that (A.1) holds. We measure time in years and set the time distance to ∆ = 1/252, thereby effectively ignoring holidays and weekends, and consider two sample sizes, n = 2500, 5000. For each sample, we estimate the two following semiparametric models when either CKLS or AS is the data generating process respectively: CKLS 1: µ (x) unknown and σ 2 (x) = α1 xα2 ; CKLS 2:
µ (x) = β1 + β2 x and σ 2 (x) unknown; AS 1: µ (x) unknown and σ 2 (x) = α1 xα2 ; and AS 2: µ (x) = β1 + β2 x + β3 x2 + β4 x−1 and σ 2 (x) unknown. The parameters of the semiparametric models are estimated using the method proposed in Kristensen (2010). Once the parametric component has been estimated, we calculate µ ˆ ( x) and σˆ 2 (x) for models in Class 1 and 2 respectively. We also estimate the fully parametric models (CKLS) and (AS) by MLE which allows us to compare the semiparametric and parametric estimates. In order to evaluate the likelihood in both the parametric and semiparametric case, we employ the simulated likelihood method of Kristensen and Shin (2008). This is implemented by simulating N = 100 values for each observation, using the Euler scheme with a step length of δ = ∆/10 (see (Kristensen, 2010), for more details). We first investigate the behaviour of the semi-nonparametric estimators for the CKLS model. We consider two sets of data generating parameter values, (i) α = (1.8207, 2.6217), β = (0.0344, −0.2921) and (ii) α = (0.1547, 1.7079), β = (0.0271, −0.4455). These are estimates from the Eurodollar data set using (i) the full sample 1973–1995 and (ii) the subsample 1982–1995. The first parameter set generates high volatility and low mean reversion while the second one generates just the opposite behaviour. In Figs. 1 and 2, pointwise means and confidence bands of the fully parametric and semi-nonparametric drift estimates are plotted for the parameters (i) and (ii) respectively. For (i), Fig. 1 shows that the semi-nonparametric drift estimator performs well in the range x ∈ [0.03, 0.12] while it is rather imprecise in tails. This is probably a consequence of that the process rarely visits outside this interval and that the strong persistence makes the nonparametric density estimator more biased. This is confirmed by the performance reported in Fig. 2 where the semi-nonparametric drift estimator becomes more precise in the tails with increased mean reversion. In Figs. 3 and 4, the diffusion estimators are plotted. For both choices of parameter values, the estimator is very imprecise out in the right tail of the support. Moreover, a decrease in the volatility seemingly leads to a further deterioration of the performance. Interestingly, the shape of the mean of the semi-nonparametric diffusion estimator in Fig. 4 is very similar to the one reported in Aït-Sahalia (1996a).
D. Kristensen / Journal of Econometrics 164 (2011) 382–403
395
Fig. 2. Estimates of µ (x) for the CKLS (ii) model. Full line = true function, dotted line = mean of estimate, plusses = 95% confidence interval.
Fig. 3. Estimates of σ 2 (x) for the CKLS (i) model. Full line = true function, dotted line = mean of estimate, plusses = 95% confidence interval.
Next, we examine the behaviour of the AS model. We do this with the parameters fitted to the full sample. In Figs. 5 and 6 respectively, the drift and diffusion estimators are plotted. The parametric drift estimator is not very precise which owes to the fact that the drift parameters in the AS model are difficult to pin down, see also Kristensen (2010, Section 6). The seminonparametric drift estimator performs fairly well, and has more or less the same level of precision as the parametric one. The performance of the semi-nonparametric diffusion estimator is not quite so good though.
8. Concluding remarks Extensions of our results to multivariate diffusion models would be of interest. However, our identification scheme cannot readily be extended to general multivariate diffusion models, since the link between the invariant density, the drift and the diffusion term utilised here does not necessarily hold in higher dimensions. However, if one is willing to restrict attention to multivariate models which does satisfy this relation, the proposed estimation and testing procedures should still work. For example, one may
396
D. Kristensen / Journal of Econometrics 164 (2011) 382–403
Fig. 4. Estimates of σ 2 (x) for the CKLS(ii) model. Full line = true function, dotted line = mean of estimate, plusses = 95% confidence interval.
Fig. 5. Estimates of µ (x) for the AS model. Full line = true function, dotted line = mean of estimate, plusses = 95% confidence interval.
consider the class of d-dimensional diffusions with drift µ : Rd → Rd and diffusion σ 2 : Rd → Rd×d , where the following relationship holds between the drift and diffusion, d − ∂ 2 1 µi (x) = σij (x)π (x) . 2π (x) j=1 ∂ xj
(30)
This restriction is for example imposed by Chen et al. (2009) in their nonparametric study of multivariate diffusion models. Again,
π (x) can be estimated by kernel density methods which together with a parametric specification for σ 2 will lead to the same type of estimators considered here. As revealed in the power analysis in Sections 4.1 and 4.2, a more suitable class of tests for testing a fully nonparametric diffusion alternative against either the semiparametric or fully parametric nulls would be ones proposed in Eq. (24). The analysis of these would be a useful addition to the one conducted here.
D. Kristensen / Journal of Econometrics 164 (2011) 382–403
397
Fig. 6. Estimates of σ 2 (x) for the AS model. Full line = true function, dotted line = mean of estimate, plusses = 95% confidence interval.
π (1) (x) σ 2 x; ; θ0,1 √ 3 nh πˆ (x) − π (x) 2 2π (x) √ 2 + nh3 O |πˆ (1) (x) − π (1) (x) |2 + πˆ (x) − π (x) .
Acknowledgements
−
This paper is a revised version of a chapter of my Ph.D. thesis at the LSE. I wish to thank my supervisor, Oliver Linton, for helpful advice and encouragement, and Jianqing Fan for stimulating discussions. Parts of this paper appeared in an earlier version titled ‘‘Estimation in Two Classes of Semiparametric Diffusion Models’’. The author wishes to thank participants at the conference on ‘‘Semiparametric Methods in Economics and Finance’’ at the LSE, and a seminar at National University of Singapore for comments and suggestions. Financial Support from the Danish Social Sciences Research Council (through CREATES) and the National Science Foundation (grant no. SES-0961596) is gratefully acknowledged. Parts of this research was conducted while the author visited Princeton University and University of Copenhagen whose hospitality is gratefully acknowledged.
Using standard methods for kernel estimators, see Robinson (1983), we obtain, as nh1+2i → ∞,
√
(31) where Vi (x) = π (x) K (i) (z ) dz , i = 0, 1, while the two remainder terms in A1 (x) are negligible. The weak convergence result in the first part of the theorem now follows from Slutsky’s Theorem. To show the second part of the theorem, write
2
+
for some θ¯1,i ∈ [θ0,1 , θˆ1 ], i = 0, 1. We expand A1 (x) in terms of π (i) (x) , i = 0, 1: nh3 A
x
∫
µ(y; θ0,2 )π (y) dy
1
−
1
πˆ (x) π (x) µ(Xi ; θˆ2 ) − µ(Xi ; θ0,2 ) I {Xi ≤ x} l
[ ] πˆ (1) (x) π (1) (x) µ ˆ (x) − µ (x) = σ x; θ0,1 − 2 πˆ (x) π ( x) 2 2 ∂σ x; ; θ0,1 1 ∂σ (x; θˆ1 ) + − 2 ∂x ∂x πˆ (1) (x) 2 + σ (x; θˆ1 ) − σ 2 x; ; θ0,1 2πˆ (x) =: A1 (x) + A2 (x) + A3 (x) . √ We have Ai (x) = OP 1/ n , i = 2, 3, since, by (A.4), ∂ i+1 σ 2 x; θ¯1,i ∂ i σ 2 (x; θˆ1 ) ∂ i σ 2 x; ; θ0,1 − = (θˆ − θ ) ∂ xi ∂ xi ∂ xi ∂θ ′ √ = OP 1/ n ,
√
2
Proof of Theorem 1. To show the first part of the theorem, write 1
σˆ 2 (x) − σ 2 (x) = 2
Appendix A. Proofs
d
nh1+2i πˆ (i) (x) − π (i) (x) − hm κm π (m+i) (x) → N (0, Vi (x)) ,
σ 2 x; ; θ0,1 √ 3 (1) nh [πˆ (x) − π (1) (x)] 1 ( x) = 2π (x)
+
n 1 −
2
πˆ (x) n
i=1
n 1−
2
πˆ (x) n
µ(Xi ; θ0,2 )I {Xi ≤ x} −
x
∫
µ(y; θ0,2 )π (y) dy l
i=1
=: B1 (x) + B2 (x) + B3 (x) , √ where B3 (x) = OP 1/ n by the CLT for mixing processes, c.f. Doukhan et al. (1994), and B2 (x) =
2
n 1 − ∂µ(Xi ; θ¯0,2 )
πˆ (x) n i=1 = OP n−1/2 ,
∂θ ′
I {Xi ≤ x} (θˆ2 − θ0,2 )
for some θ¯2 ∈ [θ0,2 , θˆ2 ]. Regarding B1 (x), first note that 1
πˆ (x)
−
1
π (x)
1 πˆ (x) − π (x) π 2 ( x) 2 πˆ (x) − π (x) + 3 , 4 λπˆ (x) + (1 − λ) π (x)
=−
398
D. Kristensen / Journal of Econometrics 164 (2011) 382–403
for some λ ∈ [0, 1]. Using standard results for kernel estimators, see Robinson (1983), the second term on the left hand side is OP (h2m ) + OP (1/ (nh)). The weak convergence result now follows from Eq. (31) and Slutsky’s Theorem. Proof of Lemma 2. Define µ ¯ SP,k (x; θk ) = τa (π(x))µSP,k (x; θk ) and 2 2 2 σ¯ SP ( x ; θ ) = τ (π ( x ))σ k a ,k SP,k (x; θk ) + σ (1 − τa (π(x))), and let p¯ SP,k (y|x; θk ) denote the transition density corresponding to these trimmed versions. In the following we suppress their dependence on the parameter when evaluated at the true value. We employ (Kristensen, 2010, Lemma 5) in conjunction with the uniform convergence results in Lemma 9 to obtain that pˆ SP,k (y|x) = p¯ SP,k (y|x) + ∇ p¯ (y|x)
√ 2 2 × µ ˆ SP,k − µ ¯ SP,k , σˆ SP ¯ SP ,k − σ ,k + oP 1/ n ,
under the conditions imposed on the bandwidth in (H.k), k = 1, 2, where ∇ p (y|x) dµ, dσ 2 is the pathwise derivative of p¯ SP,k (y|x)
w.r.t. the drift and diffusion function in the direction dµ, dσ 2 . It is the solution (at t = 1) to the following PDE,
∂∇ p¯ (y|x; t ) 2 ¯ (y|x; t ) + A dµ, dσ 2 = A µSP,k , σSP ,k ∇ p ∂t × p¯ SP,k (y|x; t ) , (32) with ∇ p¯ (y|x; 0) dµ, dσ 2 = 0. The solution at t = 1 can be represented as:
∇ p¯ (y|x) dµ, dσ 2 =
∂ p¯ SP,k (y|w; t ) ∂w 0 I ∫ 1∫ × p¯ SP,k (w|x; t ) dwdt + dσ 2 (w) ∫
1
∫
dµ (w)
0
I
∂ 2 p¯ SP,k (y|w; t ) × p¯ SP,k (w|x; t ) dw dt . (33) ∂w2 Using Kristensen (2010, Lemma 5), it follows that ∂ i p¯ SP,k (y|x) /∂ xi = ∂ i pSP,k (y|x)/∂ xi + O(aq ), i√= 0, 1, 2, where q > 0 is q i i given in Assumption (A.3). √ Thus, as na → 0, ∂ p¯ SP,k (y|x)/∂ x = ∂ i pSP,k (y|x)/∂ xi + o(1/ n), which in turn implies that
√ ∇ p¯ (y|x; 0) dµ, dσ 2 = ∇ p (y|x) dµ, dσ 2 + o 1/ n , where ∇ p (y|x) dµ, dσ 2 is the pathwise derivative of the untrimmed transition density, pSP,k (y|x). This pathwise derivative has the same representation as ∇ p¯ (y|x) dµ, dσ 2 given in Eq. (33), but with pSP,k (y|x; t ) replacing p¯ SP,k (y|x; t ) on the right hand side. We now analyse appearing in the representa the two integrals 2 tion of ∇ p (y|x) dµ, dσ 2 with dµ, dσ 2 = (µ ˆ SP,k −µSP,k , σˆ SP ,k −
) for the two classes of semiparametric estimators. First consider the estimators under HSP,1 : Proceeding as in Kristensen (2010, Proof of Theorem 2), under the conditions imposed on bandwidth and the trimming sequence, σ
∫
∂ 2 σSP,1 (z1 ) 2π0 (z1 ) ∂ z1 ] ∫ 1 ∂ pSP,1 (y|z1 ; t ) × pSP,1 (z1 |x; t )dt , ∂ z1 0 ∫ 1∫ 2 ∂σSP ,1 (w; θ ) D1,2 (z1 , z2 , y, x) = ψSP,1 (z1 |z2 ) ∂θ 0 I ∂ 2 pSP,1 (y|w; t ) × pSP,1 (w|x; t ) dw dt . ∂w2 1
D1,1 (z1 , y, x) = −
Next, consider the estimators under HSP,2 : Again, proceeding as Kristensen (2010, Proof of Theorem 2), we obtain under the conditions imposed on the bandwidth and the trimming sequence that 1
∫
∂ 2 pSP,2 (y|w; t ) 2 2 σˆ SP ,2 (w) − σSP,2 (w) ∂w2 I n √ 1− × pSP,2 (w|x; t ) dwdt = D2,1 (Xi , y, x) + oP 1/ n ,
∫
0
n i=1
while, by the mean-value theorem, 1
∫
∫
µ ˆ SP,2 (w) − µSP,2 (w)
∂ pSP,2 (y|w; t )
∂w n − √ 1 D2,2 (Xi , Xi−1 , y, x) + oP 1/ n , =
0
pSP,2 (w|x; t ) dw dt
I
n i =1
where
∂ 2 pSP,2 (y|w; t ) π (w) ∂w2 0 l 2 σSP z ( ,2 1 ) × pSP,2 (w|x; t ) dw dt − 2 2 π (z1 ) ∫ 1 2 ∂ pSP,2 (y|z1 ; t ) × pSP,2 (z1 |x; t ) dt , ∂ z12 0 D2,2 (z1 , z2 , y, x) = ψSP,2 (z1 |z2 ) ∫ 1∫ ∂µSP,2 (w; θ) ∂ pSP,2 (y|w; t ) × pSP,2 (w|x; t ) dw dt . ∂θ ∂w 0 I D2,1 (z1 , y, x) = 2µSP,2 (z1 )
1
∫
∫
z1
1
The claimed result now holds with Dk,i (y|x) = Dk,1 (Xi , y, x) + Dk,2 (Xi , Xi−1 , y, x) ,
k = 1, 2. (34)
2 SP,k
Proof of Theorem 3. First note that we can replace pˆ SP,k (y|x), k = 1√ , 2, by pSP,k (y|x) = p (y|x) in the following since it converges with n-rate, c.f. Lemma 2, and we now proceed to analyse
∂ pSP,1 (y|w; t ) µ ˆ SP,1 (w) − µ ¯ SP,1 (w) pSP,1 (w|x; t ) dt ∂w 0 I n √ 1− = D1,1 (Xi , y, x) + oP 1/ n ,
TSP :=
1
∫
n i =1
while, by the mean-value theorem,
∫
where
2 ∂ 2 pSP,1 (y|w; t ) 2 σˆ SP,1 (w) − σ¯ SP (w) ,1 ∂w2 0 I × pSP,1 (w|x; t ) dw dt n √ 1− = D1,2 (Xi , Xi−1 , y, x) + oP 1/ n , 1
∫
n i =1
∫ ∫ I
p(y|x) − pˆ NP (y|x)
2
w (y, x) dydx.
I
By a Taylor expansion in terms of f and π ,
∫ ∫ TSP = I
f (y|x)
π0 (x)
I
−
fˆNP (y, x)
πˆ NP (x)
2 w (y, x) dydx = I + ¯I + R,
where I :=
¯I :=
∫ ∫
f (y, x) − fˆNP (y, x)
∫I ∫I I
I
2
m (y, x) dydx,
2 ¯ (x) dx, π (x) − πˆ NP (x) m
D. Kristensen / Journal of Econometrics 164 (2011) 382–403
¯ (x) := with m(y, x) := w(y, x)/π 2 (x) and m π 4 (x), and R = OP
I
f (y, x)w(y, x)dy/
4 4 ˆ sup |f (y, x) − fNP (y, x)| + OP sup |π (x) − πˆ NP (x)| .
x,y∈I
x∈I
+2 Under the requirement that λ2 = limn→∞ nh2m < ∞, it follows NP from (Gouriéroux and Tenreiro, 2001, Theorem 4.1) that
nhNP {I − µ − B} = vSP Un +
+ oP 1/2 nhNP
√
√
+1 nhm NP σv Vn
+1 nhm + oP (1) , NP
(35)
√ m+1/2 ¯I − µ ¯ − B¯ = v¯ SP U¯ n + nhNP σ¯ v V¯ n √ m+1/2 nhNP + oP (1) + oP
(36)
µ :=
[∫
]2
∫
× f (y, x)m1 (y, x) dydx, I ×I ∫ 1 µ ¯ := K 2 (z ) dz × π (x)m2 (x) dx, nhNP R I ∫ [∫ u1 − y u1 − y 1 K K B := 2 hNP hNP I ×I I ×I hNP 2 nh2NP
K (z ) dz 2
B¯ :=
∫ [∫ I
v =2
1 hNP
I
[∫
u−x
K
hNP
m (y, x) dydx,
π (u)du − π (x) ]2
∫
]2
(K ∗ K ) (z ) dz 2
2
B = R2
I ×I
−
=
i+j
hNP
i,j≤m
I ×I
hm NP
∇ pn (y|x) [dµn , dσn ] =
∫
dµn (w) p¯ µ,k (y, x, w) dw I
I
m (y, x) dydx
∂ i+j f (y, x) ∂ xi ∂ y j
∫
K (z1 ) K (z2 ) z1i z2 dz1 dz2
∫
dµn (x) =
[
∂ π (x) B¯ = × ∂ xm I ×I 2m ¯ =: h2m NP b + o hNP . ∫
[
m
]2
¯ (x)dx × m
(37)
c 2 2 2 x; θ1∗ = Consider first HSP ,1 : In this case, dσn (x) = σn (x)−σ an gn (x) while
m (y, x) dydx
1
∫
0
j
R2
dσn2 (w) p¯ σ 2 ,k (y, x, w) dw,
∂ pSP,k (y|w, t ) pSP,k (w|x, t ) dt , ∂w ∫ 1 2 ∂ pSP,k (y|w, t ) pSP,k (w|x, t ) dt . p¯ σ 2 ,k (y, x, w) := ∂w2 0 p¯ µ,k (y, x, w) :=
=
[∫
1
∂
2π (x) ∂ x an ∂ 2π (x) ∂ x
2 σn2 (x) − σSP ,1 (x) π (x)
[gn (x) π (x)] .
Plugging these two expressions into ∇ pn (y|x) [dµn , dσn ] above we (n) c obtain the expression of γSP,1 (y|x) in Eq. (21). Next, consider HSP ,2 . Here, dµn (x) = µn (x) − µSP,2 (x) = an gn (x), and
and similarly h2m NP
I
Next, by the same arguments as in the Proof of Lemma 2,
where p¯ µ,k (y, x, w) and p¯ σ 2 ,k (y, x, w) are defined as (k = 1, 2)
]2 ∂ 2m f (y, x) m (y, x) dydx ∂ xm ∂ y m I ×I [∫ ]4 × K (z ) z m dz + o h2m NP R 2m =: h2m NP b + o hNP = h2m NP ×
nhNP
+
2 + o
= nhNP {I − µ − B} + nhNP ¯I − µ ¯ − B¯ + nhNP B + B¯ + nhNP R = vSP Un + hNP v¯ SP U¯ n + λn σv Vn + σ¯ v V¯ n log (n)2 2m+1 ¯ + nhNP b + b + OP 3
∫ 2 |dµn (w)|2 + dσn2 (w) γ¯ (w|x) dw. Bn (x) =
2
∫
nhNP {TSP − µ − µ} ¯
K (z1 ) K (z2 ) [f (y + z1 hNP , x + z2 hNP )
.
In total,
− f (y, x)] dz1 dz2
nh3NP
∫
[∫
log (n)2
where ∇ pn (y|x) [dµn , dσn ] is the pathwise derivative w.r.t. the drift and diffusion term, c.f. Proof of Lemma 2, and
and (Un , Vn ) and U¯ n , V¯ n both converge towards a bivariate standard Normal distribution. Due to the smoothness conditions imposed on p (y|x) and π (x) and K being an mth order kernel,
∫
pn (y|x) − pSP,k (y|x) − ∇ pn (y|x) [dµn , dσn ] ≤ CBn (x) ,
2
I
Lemma 5),
R
R
2 σSP ,k (x) are the drift and diffusion term. By Kristensen (2010,
¯ (x) dx, m
× f (y, x)m (y, x) dydx, I ×I ∫ ∫ 2 ¯ 2 (x) dx, v¯ SP = 2 (K ∗ K )2 (z ) dz × π 2 (x)m 2 SP
4m+1 nhNP R = OP nhNP + OP
c Proof of Theorem 4. Consider the contiguous alternative HSP ,k (k = 1, 2): Since the pseudo-true parameter values are drifting, note that the restricted drift and diffusion terms implied by the null as given in Eqs. (13) and (14) respectively are also drifting. We 2 2 therefore have that σSP ,k (x) = σSP,k,n (x) and µSP,k (x) = µSP,k.n (x) are sequences, and so in turn is the associated transition density, pSP,k (y|x) = pSP,k,n (y|x). We suppress their dependence on n in the following for notational ease. Define the deviations from the null hypothesis, dµn (x) = µn (x) 2 − µSP,k (x) and dσn2 (w) = σn2 (x) − σSP ,k (x), where µSP,k (x) and
R
∫
× f (u1 , u2 )du1 du2 − f (y, x)
Finally, applying (Kristensen, 2009, Theorem 1) together with standard arguments for the bias components of the kernel density estimators,
where λ2n is defined in the theorem; this proves the first part of the theorem. The second part is a direct consequence of this representation.
where 1
399
K (z )z dz m
]2 +o
h2m NP
dσn2 (x) =
R
=
2
π0 (x) 2an
π0 (x)
x
∫
dµn (y) π (y) dy l x
∫
gn (y) π (y) dy. l
400
D. Kristensen / Journal of Econometrics 164 (2011) 382–403
(n)
∫ ∫
Substituting those into γSP,2 (y|x), we obtain the expression in Eq. (22) with p¯ µ,2 (y, x, w) and p¯ σ 2 ,2 (y, x, w) defined above. We now proceed as in the Proof of Theorem 3:
∫ ∫
pSP,k (y|x) − pˆ NP (y|x)
TSP,k = I
2
I
w (y, x) dydx
n
I
I2 := I
=
∂ p(y|x; θ ) ψP ,i − Dk,i (y|x) . ∂θ ′
Let C ⊆ I × I denote the (compact) support of w (y, x). We then wish to show that Zn,k (x, y) weakly converges on C towards the stochastic process Zk (x, y) defined in the theorem. By Lemma 2, Assumption (A.5) and the CLT for stationary and mixing sequences (Doukhan et al., 1994), Zn,k (x, y) →d Zk (x, y) for any given (x, y) ∈ C . Appealing to standard arguments from empirical process theory, see e.g. Doukhan et al. (1995), it follows by Lemma 2 and Assumption (A.6) that Zn,k (x, y) is stochastically equicontinuous. The result now follows by the Continuous Mapping Theorem.
∫ ∫
2
fSP,k (y, x) − fˆNP (y, x)
∫I ∫I
fn (y|x) − fˆNP (y, x) −
I
mn,1 (y, x) dydx
(n) γSP ,k
2 (y|x) π (x)
× mn,1 (y, x) dydx ∫ ∫ 2 = fn (y, x) − fˆNP (y, x) mn,1 (y, x) dydx I I ∫ ∫ (n) 2 2 + γSP ,k (y|x) π (x) mn,1 (y, x) dydx I I ∫ ∫ (n) +2 fˆNP (y, x) − fn (y, x) γSP,k I
∫ ∫
nhNP TSP,k − mSP
∫I ∫I = I
Proof of Theorem 5. Under Assumption (A.6), the parametric estimator satisfies
∂ pP (y|x; θ ) (θ˜ − θ ) + OP ‖θ˜ − θ‖2 ′ ∂θ n √ ∂ pP (y|x; θ ) 1 − = ψP ,i + oP 1/ n , ∂θ ′ n i =1
pˆ P (y|x) − p(y|x) =
where pP (y|x; θ ) = p(y|x) under the null, while Lemma 2 supplies us with an expansion of pˆ SP,k (y|x). Substituting these two expansions into TP ,k yields:
∫ ∫
2
pˆ P (y|x) − pˆ SP,k (y|x)
T P ,k = I
w (y, x) dydx
pˆ P (y|x) − pˆ SP,k (y|x)
2
w (y, x) dydx
pˆ P (y|x) − pP (y|x) − pˆ SP,k (y|x) − pn (y|x)
I
+ {pn (y|x) − pP (y|x)}]2 w (y, x) dydx ∫ ∫ pˆ P (y|x) − pP (y|x) − pˆ SP,k (y|x) = I
= nhNP {I11 − µ1 } + nhNP I12 + nhNP I13 + nhNP {I2 − µ2 } + nhNP R = vSP U1 + nhNP I12 + oP (1) .
T P ,k =
I
Due to the assumptions imposed on gn (x) and an we note that Assumption (A.1)–(A.2) remains true for the diffusion model corresponding to µn , σn2 . In particular, Xn,i is a stationary and mixing triangular array with appropriate moments. Thus, we can recycle the arguments used in Proof of Theorem 3 to obtain nhNP {I11 − µ1 } = vSP U1 + oP (1), while, using the same arguments as in Gouriéroux and Tenreiro (2001), I13 = OP n−1/2 . Since we consider alternatives where the marginal density remains correctly specified, the second term, I2 , still satisfies Eq. (36). In +1 total, as nh3NP → ∞ and nh4m → 0, NP
(38)
Proof of Theorem 6. The representation of the sequence of transition densities given in Eq. (27) follows by the same arguments as in the Proof of Theorem 4. The power analysis now follows along the lines for integrated-based tests (see e.g. Escanciano, 2009):
× (y|x) π (x) mn,1 (y, x) dydx =: I11 + I12 + I13 ,
I
2 π (x) − πˆ NP (x) mn,2 (x) dx,
I
I
n 1 −
Zn,k (x, y) := √ n i=1
with mn,1 (y, x) and mn,2 (x) given in the Proof of Theorem 3, and fSP,k (y, x) = pSP,k (y|x)π (x) denoting the joint density under the null. Also let fn (y, x) = pn (y|x)π (x) denote the sequence of joint densities under the alternative. Substituting the expression of pSP,k (y|x) given in Eq. (20) into I1 and ignoring the higher-order term R2SP,k , I1 =
n
I
where Zn,k (x, y) is an empirical process,
I
∫ ∫
2 fSP,k (y, x) − fˆNP (y, x) mn,1 (y, x) dydx,
∫ ∫ I
I
where R is a higher-order remainder term, I1 :=
2
pˆ P (y|x) − p(y|x) − pˆ SP,k (y|x) − p(y|x)
× w (y, x) dydx ∫ ∫ 1 1 = Zn2,k (x, y) w (y, x) dydx + oP ,
I
= I1 + I2 + R ,
=
I
− pn (y|x)}]2 w (y, x) dydx ∫ ∫ [pn (y|x) − pP (y|x)]2 w (y, x) dydx + I I ∫ ∫ pˆ P (y|x) − pP (y|x) − pˆ SP,k (y|x) +2 I
I
− pn (y|x)}] [pn (y|x) − pP (y|x)] w (y, x) dydx =: I1 + I2 + I3 . The first term, I1 , can be analysed analogously to Proof of Theorem 5, while, by Eq. (27),
∫ ∫ I2 = I
I
γP(,nk) (y|x)2 w (y, x) dydx + O R2P .
Finally, with Zn,k (x, y) defined in Eq. (38), I3 can be written as 2
∫ ∫
I3 = √ n
I
(n)
I
√
Zn,k (x, y)γP ,k (y|x)w(y, x)dydx + OP RP / n .
Proof of Theorem 7. First consider T¯P ,1 : Since the drift estimator under the null and the diffusion estimator under the alternative both converge with parametric rate we may replace them with the true, unknown ones and redefine our estimators as:
π (1) (x) , 2 ∂x 2 π (x) 1 ∂σ 2 (x) 1 πˆ (1) (x) . µ( ˆ x) = + σ 2 (x) 2 ∂x 2 πˆ (x) µ( ˜ x) =
1 ∂σ 2 (x)
1
+ σ 2 (x)
D. Kristensen / Journal of Econometrics 164 (2011) 382–403
Next, by a Taylor expansion w.r.t. π (x) and π ′ (x),
∫
T¯P ,1 =
(1)
[πˆ ∫
I
2hm−1 −
+
n
(x) − π (x)] ω1 (x) dx ′
2
I
=: I1 + I2 + R, where ω1 (x) := σ (x)w( ¯ x)/ 4π (x) , ω2 (x) := σ (x)w( ¯ x)π
4
2
4
(1)
I
n
n 2 − and Gn := √ Gn (Xi ) . n i =1
I
The mean component satisfies
=: I11 + I12 + I13 .
E [Hn (X0 , X0 )]
Using that uniformly over x ∈ I,
π¯ (1) (x) = π (1) (x) + hm π (m+1) (x) + o h
m
,
(39)
=
∫
1
E K
h3/2
the first term can be written as I11 = h
E [Hn (X0 , X0 )]
2 − Hn Xi , Xj − E Hn Xi , Xj , n i <j
Hn :=
I
∫
1 nh5/2
n
where
− π¯ (1) (x)][π¯ (1) (x) − π (1) (x)]ω1 (x) dx ∫ + [πˆ (1) (x) − π¯ (1) (x)]2 ω1 (x) dx
2m
π (m) (x)2 ω1 (x)dx + o(h2m ) I
nh
x∈I
First, consider I1 : With π¯ (1) (x) = E πˆ (1) (x) , write ∫ ∫ I1 = [π¯ (1) (x) − π (1) (x)]2 ω1 (x) dx + 2 [πˆ (1) (x)
∫
∫ + h2m π (m) (x)2 ω1 (x) dx I m−1 h 1 + oP + oP √ + oP (h2m ), 5/2
(x)2 / 4π 4 (x) , and R = OP sup |πˆ ′ (x) − π ′ (x) |4 + OP sup |πˆ (x) − π (x) |4 . x,y∈I
Hn + √ Gn +
nh5/2
E [Hn (X0 , X0 )]
E Hn (Xi , Xj ) + h2m
hm
1
=
nh5/2
n −
n2 h5/2 i= ̸ j
[πˆ (x) − π (x)]2 ω2 (x) dx + R
1
Gn (Xi ) +
i =1
1
+
+
401
n
− E K (1)
2
x − X0 h
I
[
π (m+1) (x)2 ω1 (x) dx + o h2m .
(1)
x − X0
]2
h
ω1 (x) dx
I
The second term, again using Eq. (39), satisfies I12 = 2
n ∫ hm −
nh2
K
x − Xi h
I
i =1
=
[ ] x − X0 −E K
=
h
× π (m+1) (x) + o (1) ω1 (x) dx m n h hm − Gn (Xi ) + oP √ =2 , n
Gn (u) =
∫
h
K
x−u
[ −E K
h
I
I13 =
n i =1 h2 I × ω1 (x) dx 1
=
n −
n2 h5/2 i,j=1
K (1)
x − Xi
h
x − X0
]
h
[
− E K (1)
π
(m+1)
x − Xi
]2
h
∫
K (1)
x−u
n2 h5/2
n −
1
∫
1
+ O h1/2 ,
K (1) (z )2 dz ×
nh3
∫
I
R
We can now appeal to the arguments of Gouriéroux and Tenreiro (2001, Proof of Theorem 3.2) to conclude that I 1 = µ1 +
hm
1 nh
Hn + √ Gn + OP h2m 5/2
n
1 nh5/2
+ oP
hm
√
n
.
Given that Hn and Gn converge towards a bivariate normal distribution with covariance zero and marginal variances vP2 and σP2 (see Gouriéroux and Tenreiro (2001, Theorem 3.1), for their expressions), it follows that nh5/2 {I1 − µ1 } = vP ,1 Un +
√
nhm+5/2 σP ,1 Vn
√ m+5/2 + OP nh2m+5/2 + oP nh .
h
Here, one can verify that
vP2,1 = 2
∫
K (1) ∗ K (1)
2
(z ) dz ×
∫
π 2 (x)ω12 (x) dx
R
Hn Xi , Xj − E Hn Xi , Xj
i,j=1
ω1 (x) π (x) dx I
h1/2
[ ] (1) x − X0 − E K h3/2 I h h [ ] x − v x − X 0 × K (1) − E K (1) ω1 (x) dx. 1
∫
R
Hn Xi , Xj ,
Combining these expressions and following the arguments (Gouriéroux and Tenreiro, 2001, p. 182–184) (see also Huang, 1997), we then obtain 1
K (1) (z )2 dz ×
h
I1 =
I
+ oP
ω1 (x) π (x) dx ∫ ∫ 4 1 σ ( x) w ¯ ( x) 2 (1) = K z dz × dx. ( ) 3 4nh R π (x) I
µ1 =
where Hn (u, v) :=
The third term can be rewritten as
K (1) (u)2 ω1 (x + uh) π (x + uh) dudx + O h1/2
such that E [Hn (X0 , X0 )] / nh5/2 = µ1 + o (1), where
× (x) ω1 (x) dx. ∫ − n 1 1
h1/2
+o
where Gn (u) is given by 1
h1/2 I ∫ 1
n
i=1
∫ ∫
1
=
1 8
∫
R
K (1) ∗ K (1)
2
(z ) dz ×
∫
σ 16 (x) w ¯ 2 (x) dx. π 2 (x)
402
D. Kristensen / Journal of Econometrics 164 (2011) 382–403
Next, by a direct application of Gouriéroux and Tenreiro (2001, Theorem 3.2).
√
nh1/2 {I2 − µ2 } = v¯ P ,1 U¯ n +
nhm+1/2 σ¯ P ,1 V¯ n
√ m+1/2 + OP nh2m+1/2 + oP nh
(40)
where U¯ n , V¯ n converge in distribution towards a bivariate standard Normal distribution, and
∫
1
µ2 =
nh
K 2 (z ) dz ×
∫
∫
4nh
v¯ P ,2 = 2
π (x)ω2 (x) dx
K 2 (z ) dz ×
∫ I
R
σ 4 (x) w ¯ (x) π (1) (x)2 dx, π 3 (x)
v¯ P2,1 = 2
(K ∗ K )2 (z ) dz ×
∫
1 8
π 2 (x)ω22 (x) dx
T¯P ,1 =
σ 4 ( x) w ¯ (x) π (1) (x)4 dx. π 14 (x)
∫ I
R
nh
R = OP nh
4m+5/2
+ OP
log (n)2
nh1/2
.
nh5/2 T¯P ,1 − µ1 − µ2 = nh5/2 {I1 − µ1 }
+ nh5/2 {I2 − µ2 } + nh5/2 R √ = vP ,1 Un + h2 v¯ P ,1 U¯ n + nhm+5/2 σP ,1 Vn + σ¯ P ,1 V¯ n √ m+5/2 4m+5/2 log (n)2 . + oP nh + OP nh + OP 1/2
2
π (x)
σˆ 2 (x) =
πˆ (x)
l
1
µ (y) π (y) dy
−
1
4
sup π (x) − πˆ (x)
= OP h4m + OP
x
log (n)2 n2 h 2
.
From Gouriéroux and Tenreiro (2001, Theorem 3.2) and the usual bias expressions, we now obtain
¯ P ,2 nh1/2 T¯P ,2 − m
∫
∫ [σˆ 2 (x) − σn2 (x)]2 w ¯ (x) dx + gn2 (x)w ¯ (x) dx I I ∫ + 2 [σˆ 2 (x) − σn2 (x)]gn (x)w ¯ (x) dx,
x∈I
1 −
OP n−1/2
log (n)a−2+k h−(1+2k)/2 + OP a−2+k hm
,
= OP n−1/2 log (n)a−2 h−1/2 + OP a−2 hm . Proof. This follows along the same lines as Kristensen (2010, Proofs of Lemmas 9 and 10). References
]2
where R = OP
∫ [µ ˆ (x) − µn (x)]2 w ¯ (x) dx + gn2 (x)w ¯ (x) dx I I ∫ + 2 [µ ˆ (x) − µn (x)]gn (x)w ¯ (x) dx.
x∈Aˆ
π 2 (x) σ 4 (x) w ¯ (x) dx π ˆ x ( ) I ∫ 2 σ 4 (x) w ¯ (x) dx + R, =4 π (x) − πˆ (x) 2 (x) π I π (x)
∫
I
=4
dx
2 2 sup σˆ SP ,2 (x) − τa (π (x)) σSP,2 (x)
[σ˜ 2 (x) − σˆ 2 (x)]2 w ¯ (x) dx ∫ [
I
π 2 (x) 8 2 σ (x) w ¯ (x) dx. π 2 ( x)
k=0
in the following. Thus, using that l µ (y) π (y) dy = π (x) σ 2 (x) together with the mean-value theorem, T¯P ,2 =
(K ∗ K )2 (z ) dz ×
2
sup µ ˆ SP,1 (x) − τa (π (x)) µSP,1 (x)
=
x
∫
I
∫
4σ 4 (x) w ¯ (x)
Lemma 9. Assume that (A.1)–(A.4) and (B.1)–(B.2) hold. Then:
µ (y) π (y) dy,
∫l x
R
Appendix B. Auxiliary lemmas
x
2
π ( x) 2
I
The first part of the theorem now follows under the conditions imposed on the bandwidth. Next, consider T¯P ,2 : By the same arguments as before, we may set
σ˜ 2 (x) =
∫
∫
and we proceed as with T¯P ,1 .
nh
∫
(K ∗ K ) (z ) dz × 2
The first term is treated as in the Proof of Theorem 7, while the third term is a higher-order term which can be ignored. Regarding T¯P ,2 , the diffusion term under the null can be written as σP2 (x) = σn2 (x) + gn (x) such that T¯P ,2 =
In total,
I
Finally, by Kristensen (2009, Theorem 1) and standard kernel bias evaluations, 5/2
π (x)
R
I
(K ∗ K )2 (z ) dz ×
∫
Proof of Theorem 8. Consider first T¯P ,1 , where the drift under the null, µP (x), say, can be written as µP (x) = µn (x) + gn (x). Then,
∫
R
=
∫
= 32
and
∫
K 2 (z ) dz ×
and
I
R
1
=
σ 4 (x) w ¯ (x) dx nh R π 2 (x) I ∫ 4 ∫ 4 σ ( x) w ¯ (x) = K 2 (z ) dz × dx, nh R π ( x) I ∫
4
¯ P ,2 = m
where
√ = v¯ P ,2 Un + nhm+1/2 σP ,2 Vn + OP nh2m+1/2 + nh1/2 Rn ,
Ai, C., 1997. A semiparametric maximum likelihood estimator. Econometrica 65, 933–963. Aït-Sahalia, Y., 1996a. Nonparametric pricing of interest rate derivative securities. Econometrica 64, 527–560. Aït-Sahalia, Y., 1996b. Testing continuous-time models of the spot interest rate. Review of Financial Studies 9, 385–426. Aït-Sahalia, Y., 2002. Maximum likelihood estimation of discretely sampled diffusions: a closed-form approximation approach. Econometrica 70, 223–262. Aït-Sahalia, Y., Fan, J., Peng, H., 2009. Nonparametric transition-based tests for jump diffusions. Journal of the American Statistical Association 104, 1102–1116. Andrews, D.W.K., 1994. Asymptotics for semiparametric econometric models via stochastic equicontinuity. Econometrica 62, 43–72. Andrews, D.W.K., 1995. Nonparametric kernel estimation for semiparametric models. Econometric Theory 11, 560–596. Andrews, D.W.K., 2005. Higher-order improvements of the parametric bootstrap for Markov processes. In: Andrews, D.W.K., Stock, J.H. (Eds.), Identification and Inference for Econometric Models: A Festschrift in Honor of T.J. Rothenberg. Cambridge University Press, Cambridge. Bandi, F.M., Phillips, P.C.B., 2003. Fully nonparametric estimation of scalar diffusion models. Econometrica 71, 241–283.
D. Kristensen / Journal of Econometrics 164 (2011) 382–403 Bandi, F.M., Phillips, P.C.B., 2005. A simple approach to the parametric estimation of potentially nonstationary diffusions. Journal of Econometrics 137, 354–395. Bhardwaj, G., Corradi, V., Swanson, N.R., 2008. A simulation-based specification test for diffusion processes. Journal of Business & Economic Statistics 26, 176–193. Bibby, B.M., Jacobsen, M., Sørensen, M., 2009. Estimating functions for discretely sampled diffusion-type models. In: Aït-Sahalia, Y., Hansen, L.P. (Eds.), Handbook of Financial Econometrics, vol. 1. Elsevier, Amsterdam. Bickel, P.J., Rosenblatt, M., 1973. On some global measures of the deviations of density function estimates. Annals of Statistics 1, 1071–1095. Björk, T., 2004. Arbitrage Theory in Continuous Time, second ed., Oxford University Press, Oxford. Chan, K.C., Karolyi, G.A., Longstaff, F.A., Sanders, A.B., 1992. An empirical comparison of alternative models of the short-term interest rate. Journal of Finance 47, 1209–1227. Chen, S.X., Gao, J., Tang, C.Y., 2009. A test for model specification of diffusion processes. Annals of Statistics 36, 167–198. Chen, X., Hansen, L.P., Scheinkman, J., 2009. Nonlinear principal components and long run implications of multivariate diffusions. Annals of Statistics 37, 4279–4312. Chen, X., Hansen, L.P., Carrasco, M., 2010. Nonlinearity and temporal dependence. Journal of Econometrics 155, 155–169. Corradi, V., Swanson, N.R., 2005. Bootstrap specification tests for diffusion processes. Journal of Econometrics 124, 117–148. Corradi, V., White, H., 1999. Specification tests for the variance of a diffusion. Journal of Time Series Analysis 20, 253–270. Doukhan, P., Massart, P., Rio, E., 1994. The central limit theorem for strongly mixing processes. Annales de l’Institut Henri Poincaré, Section B 30, 63–82. Doukhan, P., Massart, P., Rio, E., 1995. Invariance principles for absolutely regular empirical processes. Annales de l’Institut Henri Poincaré, Section B 31, 393–427. Escanciano, J.C., 2009. On the lack of power of omnibus specification tests. Econometric Theory 25, 162–194. Eubank, R.L., LaRiccia, V.N., 1992. Asymptotic comparison of Cramér-Von Mises and nonparametric function estimation techniques for testing goodness-of-fit. Annals of Statistics 20, 2071–2086. Fan, Y., 1994. Testing the goodness-of-fit of a parametric density function by kernel method. Econometric Theory 10, 316–356. Fan, Y., 1995. Bootstrapping a consistent nonparametric goodness-of-fit test. Econometric Reviews 14, 367–382. Fan, J., Zhang, C., Zhang, J., 2001. Generalized likelihood ratio statistics and Wilks phenomenon. Annals of Statistics 29, 153–193. Florens-Zmirou, D., 1993. On estimating the diffusion coefficient from discrete observations. Journal of Applied Probability 30, 790–804. Friedman, A., 1976. Stochastic Differential Equations and Applications, vol. 1. Academic Press, New York. Ghosh, B.K., Huang, W.-M., 1991. The power and optimal kernel of the Bickel–Rosenblatt test for goodness of fit. Annals of Statistics 19, 999–1009. Gobet, E., Hoffmann, M., Reiß, M., 2004. Nonparametric estimation of scalar diffusions based on low frequency data. Annals of Statistics 32, 2223–2253. Gouriéroux, C., Monfort, A., Renault, E., 1993. Indirect inference. Journal of Applied Econometrics 8, S85–S118. Gouriéroux, C., Tenreiro, C., 2001. Local power properties of kernel based goodness of fit tests. Journal of Multivariate Analysis 78, 161–190. Hansen, L.P., Scheinkman, J.A., 1995. Back to the future: generating moment implications for continuous time Markov processes. Econometrica 63, 767–804.
403
Härdle, W., Mammen, E., 1993. Comparing nonparametric versus parametric regression fits. Annals of Statististics 21, 1926–1947. Hong, Y., Li, H., 2005. Nonparametric specification testing for continuous-time models with application to spot interest rates. Review of Financial Studies 18, 37–84. Horowitz, J.L., 2003. Bootstrap methods for Markov processes. Econometrica 71, 1049–1082. Huang, L.-S., 1997. Testing goodness-of-fit based on a roughness measure. Journal of the American Statistical Association 92, 1399–1402. Karlin, S., Taylor, H.M., 1981. A Second Course in Stochastic Processes. Academic Press, New York. Kristensen, D., 2007. Nonparametric estimation and misspecification testing of diffusion models. CREATES Research Papers 2007-1, University of Aarhus. Kristensen, D., 2008. Estimation of partial differential equations with applications in finance. Journal of Econometrics 144, 392–408. Kristensen, D., 2009. Uniform convergence rates of kernel estimators with heterogeneous, dependent data. Econometric Theory 25, 1433–1445. Kristensen, D., 2010. Pseudo-maximum likelihood estimation in two classes of semiparametric diffusion models. Journal of Econometrics 156, 239–259. Kristensen, D., Shin, Y., 2008. Estimation of dynamic models with nonparametric simulated maximum likelihood. CREATES Research Papers 2008-58, University of Aarhus. Li, F., 2007. Testing the parametric specification of the diffusion function in a diffusion process. Econometric Theory 23, 221–250. Li, F., Tkacz, G., 2006. A consistent bootstrap test for conditional density functions with time-series data. Journal of Econometrics 133, 863–886. Meyn, S.P., Tweedie, R.L., 1993. Stability of Markovian processes III: Foster– Lyapunov criteria for continuous-time processes. Advances in Applied Probability 25, 518–548. Negri, I., Nishiyama, Y., 2009. Goodness of fit test for ergodic diffusion processes. Annals of the Institute of Statistical Mathematics 61, 919–928. Nicolau, J., 2003. Bias reduction in nonparametric diffusion coefficient estimation. Econometric Theory 19, 754–777. Robinson, P.M., 1983. Nonparametric estimators for time series. Journal of Time Series Analysis 4, 185–297. Robinson, P.M., 1988. Root-n-consistent semiparametric regression. Econometrica 56, 931–954. Robinson, P.M., 1991. Consistent nonparametric entropy-based testing. Review of Economic Studies 58, 437–453. Rosenblatt, M., 1975. A quadratic measure of deviation of two-dimensional density estimates and a test of independence. Annals of Statistics 3, 1–14. Whang, Y.-J., Andrews, D.W.K., 1993. Tests of specification for parametric and semiparametric models. Journal of Econometrics 57, 277–318. White, H., 1982. Maximum likelihood estimation of misspecified models. Econometrica 50, 1–25. Wong, E., 1964. The construction of a class of stationary Markoff processes. In: Bellman, R. (Ed.), Sixteenth Symposium in Applied Mathematics – Stochastic Processes in Mathematical Physics and Engineering. American Mathematical Society, Providence, pp. 264–276. Wooldridge, J.M., White, H., 1988. Some invariance principles and central limit theorems for dependent heterogeneous processes. Econometric Theory 4, 210–230.