Editorial policy: The Journal of Econometrics is designed to serve as an outlet for important new research in both theoretical and applied econometrics. Papers dealing with estimation and other methodological aspects of the application of statistical inference to economic data as well as papers dealing with the application of econometric techniques to substantive areas of economics fall within the scope of the Journal. Econometric research in the traditional divisions of the discipline or in the newly developing areas of social experimentation are decidedly within the range of the Journal’s interests. The Annals of Econometrics form an integral part of the Journal of Econometrics. Each issue of the Annals includes a collection of refereed papers on an important topic in econometrics. Editors: T. AMEMIYA, Department of Economics, Encina Hall, Stanford University, Stanford, CA 94035-6072, USA. A.R. GALLANT, Duke University, Fuqua School of Business, Durham, NC 27708-0120, USA. J.F. GEWEKE, Department of Economics, University of Iowa, Iowa City, IA 52240-1000, USA. C. HSIAO, Department of Economics, University of Southern California, Los Angeles, CA 90089, USA. P. ROBINSON, Department of Economics, London School of Economics, London WC2 2AE, UK. A. ZELLNER, Graduate School of Business, University of Chicago, Chicago, IL 60637, USA. Executive Council: D.J. AIGNER, Paul Merage School of Business, University of California, Irvine CA 92697; T. AMEMIYA, Stanford University; R. BLUNDELL, University College, London; P. DHRYMES, Columbia University; D. JORGENSON, Harvard University; A. ZELLNER, University of Chicago. Associate Editors: Y. AÏT-SAHALIA, Princeton University, Princeton, USA; B.H. BALTAGI, Syracuse University, Syracuse, USA; R. BANSAL, Duke University, Durham, NC, USA; M.J. CHAMBERS, University of Essex, Colchester, UK; SONGNIAN CHEN, Hong Kong University of Science and Technology, Kowloon, Hong Kong; XIAOHONG CHEN, Department of Economics, Yale University, 30 Hillhouse Avenue, P.O. Box 208281, New Haven, CT 06520-8281, USA; MIKHAIL CHERNOV (LSE), London Business School, Sussex Place, Regents Park, London, NW1 4SA, UK; V. CHERNOZHUKOV, MIT, Massachusetts, USA; M. DEISTLER, Technical University of Vienna, Vienna, Austria; M.A. DELGADO, Universidad Carlos III de Madrid, Madrid, Spain; YANQIN FAN, Department of Economics, Vanderbilt University, VU Station B #351819, 2301 Vanderbilt Place, Nashville, TN 37235-1819, USA; S. FRUHWIRTH-SCHNATTER, Johannes Kepler University, Liuz, Austria; E. GHYSELS, University of North Carolina at Chapel Hill, NC, USA; J.C. HAM, University of Southern California, Los Angeles, CA, USA; J. HIDALGO, London School of Economics, London, UK; H. HONG, Stanford University, Stanford, USA; MICHAEL KEANE, University of Technology Sydney, P.O. Box 123 Broadway, NSW 2007, Australia; Y. KITAMURA, Yale Univeristy, New Haven, USA; G.M. KOOP, University of Strathclyde, Glasgow, UK; N. KUNITOMO, University of Tokyo, Tokyo, Japan; K. LAHIRI, State University of New York, Albany, NY, USA; Q. LI, Texas A&M University, College Station, USA; T. LI, Vanderbilt University, Nashville, TN, USA; R.L. MATZKIN, Northwestern University, Evanston, IL, USA; FRANCESCA MOLINARI (CORNELL), Department of Economics, 492 Uris Hall, Ithaca, New York 14853-7601, USA; F.C. PALM, Rijksuniversiteit Limburg, Maastricht, The Netherlands; D.J. POIRIER, University of California, Irvine, USA; B.M. PÖTSCHER, University of Vienna, Vienna, Austria; I. PRUCHA, University of Maryland, College Park, USA; E. RENAULT, University of North Carolina, Chapel Hill, NC; R. SICKLES, Rice University, Houston, USA; F. SOWELL, Carnegie Mellon University, Pittsburgh, PA, USA; MARK STEEL (WARWICK), Department of Statistics, University of Warwick, Coventry CV4 7AL, UK; DAG BJARNE TJOESTHEIM, Department of Mathematics, University of Bergen, Bergen, Norway; HERMAN VAN DIJK, Erasmus University, Rotterdam, The Netherlands; Q.H. VUONG, Pennsylvania State University, University Park, PA, USA; E. VYTLACIL, Columbia University, New York, USA; T. WANSBEEK, Rijksuniversiteit Groningen, Groningen, Netherlands; T. ZHA, Federal Reserve Bank of Atlanta, Atlanta, USA and Emory University, Atlanta, USA. Submission fee: Unsolicited manuscripts must be accompanied by a submission fee of US$50 for authors who currently do not subscribe to the Journal of Econometrics; subscribers are exempt. Personal cheques or money orders accompanying the manuscripts should be made payable to the Journal of Econometrics. Publication information: Journal of Econometrics (ISSN 0304-4076). For 2011, Volumes 160–165 (12 issues) are scheduled for publication. Subscription prices are available upon request from the Publisher, from the Elsevier Customer Service Department nearest you, or from this journal’s website (http://www.elsevier.com/locate/jeconom). Further information is available on this journal and other Elsevier products through Elsevier’s website (http://www.elsevier.com). Subscriptions are accepted on a prepaid basis only and are entered on a calendar year basis. Issues are sent by standard mail (surface within Europe, air delivery outside Europe). Priority rates are available upon request. Claims for missing issues should be made within six months of the date of dispatch. USA mailing notice: Journal of Econometrics (ISSN 0304-4076) is published monthly by Elsevier B.V. (Radarweg 29, 1043 NX Amsterdam, The Netherlands). Periodicals postage paid at Rahway, NJ 07065-9998, USA, and at additional mailing offices. USA POSTMASTER: Send change of address to Journal of Econometrics, Elsevier Customer Service Department, 3251 Riverport Lane, Maryland Heights, MO 63043, USA. AIRFREIGHT AND MAILING in the USA by Mercury International Limited, 365 Blair Road, Avenel, NJ 07001-2231, USA. Orders, claims, and journal inquiries: Please contact the Elsevier Customer Service Department nearest you. St. Louis: Elsevier Customer Service Department, 3251 Riverport Lane, Maryland Heights, MO 63043, USA; phone: (877) 8397126 [toll free within the USA]; (+1) (314) 4478878 [outside the USA]; fax: (+1) (314) 4478077; e-mail:
[email protected]. Oxford: Elsevier Customer Service Department, The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, UK; phone: (+44) (1865) 843434; fax: (+44) (1865) 843970; e-mail:
[email protected]. Tokyo: Elsevier Customer Service Department, 4F Higashi-Azabu, 1-Chome Bldg., 1-9-15 Higashi-Azabu, Minato-ku, Tokyo 106-0044, Japan; phone: (+81) (3) 5561 5037; fax: (+81) (3) 5561 5047; e-mail:
[email protected]. Singapore: Elsevier Customer Service Department, 3 Killiney Road, #08-01 Winsland House I, Singapore 239519; phone: (+65) 63490222; fax: (+65) 67331510; e-mail:
[email protected]. Printed by Henry Ling Ltd., Dorchester, United Kingdom The paper used in this publication meets the requirements of ANSI/NISO Z39.48-1992 (Permanence of Paper)
Journal of Econometrics 165 (2011) 1–4
Contents lists available at SciVerse ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Editorial
Moment Restriction-Based Econometric Methods: An overview
1. Introduction One of the primary missions of econometric theory is to provide statistical methods for estimating unknown key quantities and to test hypotheses relating to these quantities. The optimal statistical strategy for estimation and testing is based on the likelihood principle, given a sufficiently large number of observations. In order for this implementation to be enabled, it is necessary to have the joint (or conditional) distribution of the economic variables. Economic theory, however, is typically unable to supply a full model which completely specifies the statistical structure, namely a parametric statistical model. In empirical studies, it is common to use the maximum likelihood method to estimate unknown parameters under the assumption of, say, normality of the disturbances, such as in regression analysis and limited dependent variable models, among others. If the specified model is correct, it is possible to obtain optimal results asymptotically. However, there is a severe limitation in that the estimates or test statistics may be unreliable if the assumed parametric model is, in fact, incorrect. In short, there exists a trade-off between efficiency (under the null hypothesis of correct specification) and robustness (under the alternative of incorrect specification). In view of this possibility, it may be more sensible to assume a less restrictive model or structure which will yield more robust statistical outcomes. A viable alternative to the parametric approach is to assume only certain conditional or unconditional moment conditions without specifying a fully parametric model. Indeed, it has become common practice to use partially specified econometric models that are described in terms of a variety of moment restrictions. Econometric theory has contributed significantly to modelling using moment restrictions over the last half-century. Formally, moment restriction-based econometric modelling is a broad class which includes the parametric, semiparametric and nonparametric approaches. To state the obvious, moments and conditional moments themselves are nonparametric quantities, so if we estimate or test certain unconditional or conditional moments without further restrictions, this must be a nonparametric procedure. If the model is specified in part up to some finite dimensional parameters, this will provide semiparametric estimates or tests. Examples include partially linear regression models and index models. If we use the score to construct moment restrictions for estimating finite dimensional parameters, this yields maximum likelihood (ML) estimates. It would seem that semiparametric or nonparametric settings based on moment restrictions have been the main concern in the literature and, in our view, comprise the most important and interesting topics in this context. Some of the basic advantages and disadvantages of the parametric, semiparametric and nonparametric approaches are given in Table 1. 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.05.001
Parametric modelling performs best if the specified model is correctly specified, but the optimal properties generally disappear under misspecification. Nonparametric modelling is hardly incorrect as it typically does not assume any structure other than some smoothness conditions, and hence does not face the consequences of possible misspecification. However, nonparametric estimates are sometimes difficult to interpret, for example, in the case of nonparametric multiple regression. However, nonparametric methods can provide a very useful tool for testing, as in specification tests, where it is possible to obtain reliable results in comparing parametric versus nonparametric models or semiparametric versus nonparametric models. Semiparametric modelling is a good compromise in estimation between parametric and nonparametric approaches in most cases as it balances efficiency versus robustness. Consequently, moment restriction-based econometric methods can easily fit into either the nonparametric or the semiparametric framework. The instrumental variables (IV) method is perhaps the most basic and useful tool in econometric estimation based on moment restrictions. IV is used when least squares (LS) estimation is inappropriate due to endogeneity in models such as simultaneous equation systems, regressions with errors in variables, and autoregressive models with serially correlated disturbances, among others. When we have more instruments than are required for identification, we can use the generalized method of moments (GMM) estimation method, as proposed by Hansen (1982). GMM is undoubtedly one of the most widely used methods in contemporary econometrics, and is typically a semiparametric estimator, where the moment condition is described in terms of the observations, a finite number of parameters and function(s) relating these quantities. It is widely known in the statistics literature that Godambe (1960) also proposed what is essentially an equivalent method, namely the estimating equation method, which bridges ML estimation and the method of moments estimation. Estimation methods that have the same first-order asymptotic properties as GMM were proposed in the 1980s and 1990s, such as the empirical likelihood and exponential tilting estimators, and further a general class called generalized empirical likelihood estimators, which includes all of the above as special cases. We also would like to mention an important class in econometric analysis called the entropy-based econometric methods, which takes an information theoretic approach but is closely related to the present theme, as well as the maximum likelihood approach. We refer to Golan (2002) which appeared in another special volume of the Journal of Econometrics, and the articles therein. Specification analysis is a classical and important issue in econometrics. A classical example is the t-test in linear regression, where we are interested in whether a certain independent variable
2
Editorial / Journal of Econometrics 165 (2011) 1–4
Table 1 Advantages and disadvantages of parametric, semiparametric and nonparametric methods. Method
Advantages
Disadvantages
Parametric
(i) Efficient estimation. (ii) Sample size can be small. (iii) Powerful tests against parametric alternatives.
(i) Estimation inconsistency under model misspecification. (ii) Test inconsistency under incorrect parametric alternatives.
Semiparametric
Tries to balance efficiency versus robustness.
Inconsistency under model misspecification.
Nonparametric
(i) Robustness. (ii) Can have powerful tests even if the alternative is incorrect.
(i) Inefficient estimation. (ii) Requires large samples. (iii) Can be difficult to interpret estimates.
is useful in accounting for the variation in the dependent variable. Various extensions to nonparametric and semiparametric settings have been considered in the literature since the 1980s by Bierens (1990), among others. This is a specification test for conditional moments, where both the null and alternative hypotheses are nonparametric. These tests are called ‘‘omitted variable’’ tests rather than nonparametric significance tests. Härdle and Mammen (1993), for example, propose a specification test of a parametric regression versus a nonparametric regression. Semiparametric modelling partially specifies the structure, where it is often an issue whether the specified component can be justified on the basis of the data. For example, whether the regression function has a single-index structure or not is something which should be checked in semiparametric single-index models. Such tests of semiparametric models versus nonparametric counterpart have also been developed, for example, in Whang and Andrews (1993) and Yatchew (1992), among many others. Many-moments and weak instrument problems are important empirical issues in GMM estimation and its variants, which have been considered since the late 1990s. When there are many moment conditions, the bias tends to become large. Such a phenomenon has been known in the context of estimation for simultaneous equation systems since the 1980s. Kunitomo (1980) examined the asymptotic properties of the two-stage least squares estimator under many instruments, while Morimune (1983) provided the higher-order approximations of the distribution of the limited information maximum likelihood estimator under the same conditions. Weak instruments are instrumental variables which have very low correlations with the explanatory variables or endogenous variables. In such a case, the resulting estimate becomes very unstable, such that a small change in the variables leads to a large change in the estimates. The many-moments problem and weak instruments problem tend to occur simultaneously because it is more likely that some instruments are weak when there are many instruments. It is worth mentioning some directions in moment restrictionbased econometrics in which econometricians are currently interested. In financial econometrics and empirical finance, forecasting of Value-at-Risk (VaR) and the corresponding quantiles is of increasing importance in developing optimal risk management strategies. Another important area is the analysis of treatment effects or program evaluation, in which the quantities of interest are identified with respect to moments or conditional moments. Various methods of identification and estimation have been developed during the last two decades, and have been usefully applied. A further important direction of recent advances is set estimation by moment inequalities. This kind of approach is mainly used in the econometric analysis of games because economic agents facing game theoretic situations typically select their actions by comparing their outcomes from a variety of strategies. Further theoretical developments and implementation will probably shed light on the behaviour of economic agents. The purpose of this special issue on ‘‘Moment RestrictionBased Econometric Methods’’ is to highlight some areas in which novel econometric methods have contributed significantly
to the analysis of moment restrictions, specifically asymptotic theory for nonparametric regression with spatial data (Robinson, 2011), a control variate method for stationary processes (Amano and Taniguchi, 2011), method of moments estimation and identifiability of semiparametric nonlinear errors-invariables models (Wang and Hsiao, 2011), properties of the CUE estimator and a modification with moments (Hausman et al., 2011), finite sample properties of alternative estimators of coefficients in a structural equation with many instruments (Anderson et al., 2011), instrumental variable estimation in the presence of many-moment conditions (Okui, 2011), estimation of conditional moment restrictions without assuming parameter identifiability in the implied unconditional moments (Hsu and Kuan, 2011), moment-based estimation of smooth transition regression models with endogenous variables (Areosa et al., 2011), a consistent nonparametric test for nonlinear causality (Nishiyama et al., 2011), and linear programming-based estimators in simple linear regression (Preve and Medeiros, 2011). It is our hope that these invaluable papers, most of which were presented at a conference at Kyoto University in honour of Professor Kimio Morimune’s 60th birthday, will encourage others to undertake research in a variety of challenging areas associated with moment restriction-based econometric methods. Two excellent papers (among many) by Kimio Morimune are related to the main theme of the special issue. Although it was known that maximum likelihood estimation often led to an appropriate answer in parametric models, standard asymptotic theory often gave several asymptotically (equivalent) efficient methods. Takeuchi and Morimune (1985) gave a definitive classical answer to this problem in the context of a parametric estimation problem of structural equations, which corresponds to the estimating equation problem and the reduced rank regression problem (See also Anderson et al. (1986).). Morimune (1989) also gave a definitive answer to the statistical question of the standard use of t-tests in the structural equation problem, which is an impressive result. 2. Overview In the first paper, Peter Robinson (London School of Economics) considers ‘‘Asymptotic theory for nonparametric regression with spatial data’’, specifically, nonparametric regression with spatial, or spatio-temporal, data. The conditional mean of a dependent variable, given explanatory variables, is a nonparametric function, while the conditional covariance reflects spatial correlation. Conditional heteroscedasticity is also allowed, as well as non-identically distributed observations. Instead of mixing conditions, a (possibly non-stationary) linear process is assumed for disturbances, allowing for long range, as well as short range, dependence, while decay independence in explanatory variables is described using a measure based on the departure of the joint density from the product of marginal densities. A basic triangular array setting is employed, with the aim of covering various patterns of spatial observation. Sufficient conditions are established for consistency and asymptotic normality of kernel regression estimates. When the crosssectional dependence is sufficiently mild, the asymptotic variance in the central limit theorem is the same as when observations are
Editorial / Journal of Econometrics 165 (2011) 1–4
independent; otherwise the rate of convergence is slower. The author discusses application of the established conditions to spatial autoregressive models, and models defined on a regular lattice. Continuing the theme of moment restriction-based econometric methods, Tomoyuki Amano (Waseda University) and Masanobu Taniguchi (Waseda University) develop a ‘‘Control variate method for stationary processes’’. The sample mean is a natural estimator of the population mean based on independently and identically distributed observations. However, if some control variate were to be available, it is known that the control variate method would reduce the variance of the sample mean. The control variate method often assumes that the variable of interest and the control variable are independently and identically distributed. The authors assume that these variables are stationary processes with spectral density matrices, that is, they are dependent, and then propose an estimator of the mean of the stationary process of interest by using a control variate method based on a nonparametric spectral estimator. It is shown that this estimator improves the sample mean in the sense of mean square error. The authors also extend the analysis to the case where the mean dynamics is of the regression form, and then propose a control variate estimator for the regression coefficients which improves the least squares estimator (LSE). Numerical studies show how their estimator improves the LSE. Liqun Wang (University of Manitoba) and Cheng Hsiao (University of Southern California and Xiamen University) consider ‘‘Method of moments estimation and identifiability of semiparametric nonlinear errors-in-variables models’’. Their paper deals with a nonlinear errors-in-variables model, where the distributions of the unobserved predictor variables and of the measurement errors are nonparametric. Using the instrumental variable approach, the authors propose method of moments estimators for the unknown parameters and simulation-based estimators, for overcoming the possible computational difficulty of minimizing an objective function which involves multiple integrals. Both estimators are consistent and asymptotically normally distributed under fairly general regularity conditions. Moreover, root-n consistent semiparametric estimators and a rank condition for model identifiability are derived using the combined methods of the nonparametric technique and Fourier deconvolution. In the fourth paper, Jerry Hausman (MIT), Konrad Menzel (MIT), Randall Lewis (MIT) and Whitney Newey (MIT) analyse ‘‘Properties of the CUE estimator and a modification with moments’’. Specifically, the authors analyse the properties of the continuous updating estimator (CUE) proposed by Hansen, Heaton and Yaron (1996), which has been suggested as a solution to the finite sample bias problems of the two-step GMM estimator. The authors show that the estimator has high dispersion infinite samples under weak identification, and propose the regularized CUE (RCUE) as a solution to this problem. The RCUE solves a modification of the first-order conditions for the CUE estimator. The authors show that the RCUE is asymptotically equivalent to the CUE under manyweak-moment asymptotics, as in Newey and Windmeijer (2008). They provide extensive Monte Carlo results showing the low bias and dispersion properties of the new estimator. ‘‘On finite sample properties of alternative estimators of coefficients in a structural equation with many instruments’’ leads T.W. Anderson (Stanford University), Naoto Kunitomo (University of Tokyo) and Yukitoshi Matsushita (University of Tokyo) to compare four different estimation methods for the coefficients of a linear structural equation with instrumental variables. As the classical methods, the authors consider the limited information maximum likelihood (LIML) estimator and the two-stage least squares (TSLS) estimator, and as the semiparametric estimation methods, they consider the maximum empirical likelihood (MEL) estimator and the generalized method of moments (GMM) (or the estimating equation) estimator. Tables and figures showing
3
the distribution functions of four estimators are given for sufficient values of the parameters to cover most linear models of interest, and the authors include some heteroscedastic cases and nonlinear cases. The authors find that the LIML estimator has good performance in terms of the bounded loss functions and probabilities when the number of instruments is large, that is, for micro-econometric models with ‘‘many instruments’’, in the terminology of recent econometric literature. In the sixth paper, ‘‘Instrumental variable estimation in the presence of many-moment conditions’’, Ryo Okui (Hong Kong University of Science and Technology) develops shrinkage methods for addressing the ‘‘many-instruments’’ problem in the context of instrumental variable estimation. It has been observed that instrumental variable estimators may behave poorly if the number of instruments is large. This problem can be addressed by shrinking the influence of a subset of instrumental variables. The procedure can be understood as a two-step process of shrinking some of the OLS coefficient estimates from the regression of the endogenous variables on the instruments, then using the predicted values of the endogenous variables (based on the shrunk coefficient estimates) as the instruments. The shrinkage parameter is chosen to minimize the asymptotic mean square error. The optimal shrinkage parameter has a closed form, which makes it easy to implement. A Monte Carlo study shows that the shrinkage method works well and performs better in many situations than do existing instrument selection procedures. Shih-Hsun Hsu (National Chengchi University) and Chung-Ming Kuan (National Taiwan University) examine ‘‘Estimation of conditional moment restrictions without assuming parameter identifiability in the implied unconditional moments’’. A well known difficulty in estimating conditional moment restrictions is that the parameters of interest need not be globally identified by the implied unconditional moments. The authors propose an approach to constructing a continuum of unconditional moments that can ensure parameter identifiability. These unconditional moments depend on the instruments ‘‘generated from a generically comprehensively revealing’’ function, and they are further projected along the exponential Fourier series. The objective function is based on the resulting Fourier coefficients, from which an estimator can be easily computed. A novel feature of the new method is that the full continuum of unconditional moments is incorporated into each Fourier coefficient. The authors show that, when the number of Fourier coefficients in the objective function grows at a proper rate, the proposed estimator is consistent and asymptotically normally distributed. An efficient estimator is also readily obtained via the conventional two-step GMM method. The simulations confirm that the proposed estimator compares favourably with that of Dominguez and Lobato (2004) in terms of bias, standard error, and mean squared error. In the eighth paper, Waldyr Dutra Areosa (Pontifical Catholic University of Rio de Janeiro), Michael McAleer (Erasmus University Rotterdam) and Marcelo C. Medeiros (Pontifical Catholic University of Rio de Janeiro) evaluate ‘‘Moment-based estimation of smooth transition regression models with endogenous variables’’. Nonlinear regression models have been widely used for time series and cross-section data sets. For analysing univariate and multivariate time series data, in particular, smooth transition regression (STR) models have been shown to be very useful for representing and capturing asymmetric behaviour. Most STR models have been applied to univariate processes, and a variety of assumptions have been made, including stationary or cointegrated processes, uncorrelated, homoscedastic or conditionally heteroscedastic errors, and weakly exogenous regressors. Under the assumption of exogeneity, the standard method of estimation is nonlinear least squares. The authors relax the assumption of weakly exogenous regressors and discuss moment-based methods for estimating STR models.
4
Editorial / Journal of Econometrics 165 (2011) 1–4
They analyse the properties of the STR model with endogenous variables with a diagnostic test of linearity under endogeneity, developing an estimation procedure and a misspecification test for the STR model, present results of Monte Carlo simulations to show the usefulness of the model and estimation method, and provide an empirical application for inflation rate targeting in Brazil. They show that STR models with endogenous variables can be specified and estimated by a straightforward application of existing results. Yoshiohiko Nishiyama (Kyoto University), Kohtaro Hitomi (Kyoto Institute of Technology), Yoshinori Kawasaki (Institute of Statistical Mathematics) and Kiho Jeong (Kyunpook National University) develop ‘‘A consistent nonparametric test for nonlinear causality’’. Following Granger (1969), many tests have been proposed for causality of economic time series. Most have been concerned only with ‘‘linear causality in mean’’. This is undoubtedly of primary interest, but the dependence between series may be nonlinear, and/or not only through the conditional mean, such as conditional heteroscedasticity models. The authors propose a nonparametric test for possibly nonlinear causality. Taking into account that dependences in higher order moments are established as an important issue, especially in financial time series, the authors also consider a test for causality up to the K th conditional moment. Statistically, their new test can be considered as a nonparametric omitted variable test in time series regression. A desirable property of the test is that it has nontrivial power against square-root-of-T local alternatives, where T is the sample size. It is possible to develop a test statistic accordingly if there is some knowledge regarding the alternative hypothesis. Furthermore, the authors show that the test statistic includes most of the established omitted variable test statistics as special cases asymptotically. The null asymptotic distribution is not normal, but the critical regions can be calculated by simulation. Monte Carlo experiments show that the proposed test has good size and power properties. In the final paper, Daniel Preve (Singapore Management University) and Marcelo C. Medeiros (Pontifical Catholic University of Rio de Janeiro) consider ‘‘Linear programming-based estimators in simple linear regression’’. The authors introduce a linear programming estimator (LPE) for the slope parameter in a constrained linear regression model with a single regressor. The LPE is interesting because it can be superconsistent in the presence of an endogenous regressor and, hence, preferable to the ordinary least squares estimator (LSE). One advantage of the LPE is that it does not require an instrument. Two different cases are considered as the authors investigate the statistical properties of the LPE. In the first case, the regressor is assumed to be fixed in repeated samples, and in the second, the regressor is stochastic and potentially endogenous. In both cases, the strong consistency and exact finite sample distribution of the LPE are established Conditions under which the LPE is consistent in the presence of serially correlated and heteroscedastic errors are also given. Finally, the authors describe how the LPE can be extended to the case with multiple regressors, and conjecture that the extended estimator is consistent under conditions analogous to those given in the paper. Finite sample properties of the LPE and extended LPE in comparison to the LSE and instrumental variable estimator (IVE) are investigated in a simulation study. It is hoped that this collection of papers by some of the leading experts in the field of ‘‘Moment Restriction-Based Econometric Methods’’ will be of wide interest to a diverse range of econometricians, statisticians, and quantitative researchers. It is a genuine pleasure to acknowledge all of the contributors for preparing their innovative submissions in a timely manner, and for actively participating, together with others, in the rigorous review process. Acknowledgements The authors wish to thank numerous referees for their helpful, insightful and timely reviews of the papers that comprise this
special issue. The second author wishes to acknowledge the financial support of the Australian Research Council, National Science Council, Taiwan, and the Japan Society for the Promotion of Science. References Amano, T., Taniguchi, M., 2011. Control variate method for stationary processes. Journal of Econometrics 165 (1), 20–29. Anderson, T.W., Kunitomo, N., Matsushita, Y., 2011. On finite sample properties of alternative estimators of coefficients in a structural equation with many instruments. Journal of Econometrics 165 (1), 58–69. Anderson, T.W., Kunitomo, N., Morimune, K., 1986. Comparing single-equation estimators in a simultaneous system. Econometric Theory 2, 1–32. Areosa, W.D., McAleer, M., Medeiros, M.C., 2011. Moment-based estimation of smooth transition regression models with endogenous variables. Journal of Econometrics 165 (1), 100–111. Bierens, H.J., 1990. A consistent conditional moment test of functional form. Econometrica 58 (6), 1443–1458. Godambe, V.P., 1960. An optimum property of regular maximum likelihood estimation. Annals of Mathematical Statistics 31, 1208–1211. Golan, A., 2002. Information and entropy econometrics—Editor’s view. Journal of Econometrics 107 (1), 1–15. Hansen, L.P., 1982. Large sample properties of generalized method of moments estimators. Econometrica 50 (4), 1029–1054. Härdle, W., Mammen, E., 1993. Comparing nonparametric versus parametric regression fits. Annals of Statistics 21, 1926–1947. Hausman, J., Menzel, K., Lewis, R., Newey, W., 2011. Properties of the CUE estimator and a modification with moments. Journal of Econometrics 165 (1), 45–57. Hsu, S.-H., Kuan, C.-M., 2011. Estimation of conditional moment restrictions without assuming parameter identifiability in the implied unconditional moments. Journal of Econometrics 165 (1), 87–99. Kunitomo, N., 1980. Asymptotic expansions of the distributions of estimators in a linear functional relationship and simultaneous equations. Journal of the American Statistical Association 75, 693–700. Morimune, K., 1983. Approximate distributions of k-class estimators when the degree of overidentification is large compared with the sample size. Econometrica 51, 821–841. Morimune, K., 1989. t tests in a structural equation. Econometrica 57, 1341–1360. Nishiyama, Y., Hitomi, K., Kawasaki, Y., Jeong, K., 2011. A consistent nonparametric test for nonlinear causality. Journal of Econometrics 165 (1), 112–127. Okui, R., 2011. Instrumental variable estimation in the presence of many moment conditions. Journal of Econometrics 165 (1), 70–86. Preve, D., Medeiros, M.C., 2011. Linear programming-based estimators in simple linear regression. Journal of Econometrics 165 (1), 128–136. Robinson, P.M., 2011. Asymptotic theory for nonparametric regression with spatial data. Journal of Econometrics 165 (1), 5–19. Takeuchi, K., Morimune, K., 1985. Third order efficiency of the extended maximum likelihood estimator in a linear simultaneous equation system. Econometrica 53, 177–200. Wang, L., Hsiao, C., 2011. Method of moments estimation and identifiability of semiparametric nonlinear errors-in-variables models. Journal of Econometrics 165 (1), 30–44. Whang, Y., Andrews, D.W.K., 1993. Tests of specification for parametric and semiparametric models. Journal of Econometrics 57, 277–318. Yatchew, A.J., 1992. Nonparametric regression tests based on least squares. Econometric Theory 8, 435–451.
Naoto Kunitomo Faculty of Economics, University of Tokyo, Japan Michael McAleer Econometric Institute, Erasmus School of Economics, Erasmus University Rotterdam and Tinbergen Institute, The Netherlands Institute of Economic Research, Kyoto University, Japan Yoshihiko Nishiyama ∗ Institute of Economic Research, Kyoto University, Japan E-mail address:
[email protected]. Available online 12 May 2011 ∗ Corresponding editor.
Journal of Econometrics 165 (2011) 5–19
Contents lists available at SciVerse ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Asymptotic theory for nonparametric regression with spatial data P.M. Robinson ∗ London School of Economics, United Kingdom
article
info
Article history: Available online 12 May 2011 JEL classification: C13 C14 C21 Keywords: Nonparametric regression Spatial data Weak dependence Long range dependence Heterogeneity Consistency Central limit theorem
abstract Nonparametric regression with spatial, or spatio-temporal, data is considered. The conditional mean of a dependent variable, given explanatory ones, is a nonparametric function, while the conditional covariance reflects spatial correlation. Conditional heteroscedasticity is also allowed, as well as nonidentically distributed observations. Instead of mixing conditions, a (possibly non-stationary) linear process is assumed for disturbances, allowing for long range, as well as short-range, dependence, while decay in dependence in explanatory variables is described using a measure based on the departure of the joint density from the product of marginal densities. A basic triangular array setting is employed, with the aim of covering various patterns of spatial observation. Sufficient conditions are established for consistency and asymptotic normality of kernel regression estimates. When the cross-sectional dependence is sufficiently mild, the asymptotic variance in the central limit theorem is the same as when observations are independent; otherwise, the rate of convergence is slower. We discuss the application of our conditions to spatial autoregressive models, and models defined on a regular lattice. © 2011 Elsevier B.V. All rights reserved.
1. Introduction A distinctive challenge facing analysts of spatial econometric data is the possibility of spatial dependence. Typically, dependence is modelled as a function of spatial distance, whether the distance be geographic or economic, say, analogous to the modelling of dependence in time series data. However, unlike with time series, there is usually no natural ordering to spatial data. Moreover, forms of irregular spacing of data are more common with spatial than time series data, and this considerably complicates modelling and developing rules of statistical inference. Often, as with cross-sectional and time series data, some (parametric or nonparametric) regression relation or conditional moment restriction is of interest in the modelling of spatial data. If the spatial dependence in the left-hand-side variable is entirely explained by the regressors, such that the disturbances are independent, matters are considerably simplified, and the development of rules of large sample statistical inference is, generally speaking, not very much harder than if the actual observations were independent. In parametric regression models, ordinary least squares can then typically deliver efficient inference (in an asymptotic Gauss–Markov sense, at least). Andrews (2005) has developed the theory to allow for arbitrarily strong forms of
∗
Tel.: +44 20 7955 7516; fax: +44 20 7955 6592. E-mail address:
[email protected].
0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.05.002
dependence in the disturbances, but with the data then generated by random sampling, an assumption that is not necessarily plausible in practice. Substantial activity has taken place in the modelling of spatial dependence, and consequent statistical inference, and this is relevant to handling dependence in disturbances. In the statistical literature, lattice data have frequently been discussed. Here, there is equally-spaced sampling in each of d ≥ 2 dimensions, to extend the equally-spaced time series setting (d = 1). Familiar time series models, such as autoregressive-moving-averages, have been extended to lattices (see e.g. Whittle, 1954). In parametric modelling there are greater problems of identifiability than in time series, and the ‘‘edge effect’’ complicates statistical inference (see Guyon, 1982, Dahlhaus and Künsch, 1987, Robinson and Vidal Sanz, 2006 and Yao and Brockwell, 2006). Nevertheless there is a strong sense in which results from time series can be extended. Unfortunately economic data typically are not recorded on a lattice. If the observation locations are irregularly-spaced points in geographic space, it is possible to consider, say, Gaussian maximum likelihood estimation based on a parametric model for dependence defined continuously over the space, though a satisfactory asymptotic statistical theory has not yet been developed. However, even if we feel able to assign a (relative) value to the distance between each pair of data points, we may not have the information to plot the data in, say, 2-dimensional space. Partly as a result, ‘‘spatial autoregressive’’ (SAR) models of Cliff and Ord (1981) have become popular. Here, n spatial observations (or disturbances) are modelled as a linear transformation of
6
P.M. Robinson / Journal of Econometrics 165 (2011) 5–19
n independent and identically distributed (i.i.d.) unobservable random variables, the n × n transformation matrix being usually known apart from finitely many unknown parameters (often only a single such parameter). While we use the description ‘‘autoregressive’’, forms of the model can be analogous to time series moving average, or autoregressive-moving-average, models, not just autoregressive ones, see (2.9). While a relatively ad hoc form of model, the flexibility of SAR has led to a considerable number of applications (see e.g. Arbia, 2006). SAR, and other structures, have been used to model disturbances, principally in parametric, in particular linear, regression models (see e.g. Kelejian and Prucha, 1999 and Lee, 2002). On the other hand, nonparametric regression has become a standard tool of econometrics, at least in large cross-sectional data sets, due to a recognition that there can be little confidence that the functional form is linear, or of a specific nonlinear type. Estimates of the nonparametric regression function are typically obtained at several fixed points by some method of smoothing. In a spatial context, nonparametric regression has been discussed by, for example, Tran and Yakowitz (1993) and Hallin et al. (2004a). The most commonly used kind of smoothed nonparametric regression estimate in econometrics is still the Nadaraya–Watson kernel estimate. While originally motivated by i.i.d. observations, its asymptotic statistical behaviour has long been studied in the presence of stationary time series dependence. Under forms of weak dependence, it has been found that not only does the Nadaraya–Watson estimate retain its basic consistency property, but more surprisingly it has the same limit distribution as under independence (see, e.g. Roussas, 1969, Rosenblatt, 1971 and Robinson, 1983). The latter finding is due to the ‘‘local’’ character of the estimate, and contrasts with experience with parametric regression models, where dependence in disturbances generally changes the limit distribution, and entails efficiency loss. The present paper establishes consistency and asymptotic distribution theory for the Nadaraya–Watson estimate in a framework designed to be applied to various kinds of spatial data. It would be possible to describe a theory that mimics fairly closely that for the time series case. In particular, strong mixing time series were assumed by Robinson (1983) in asymptotic theory for the Nadaraya–Watson estimate, and various mixing concepts have been generalised to d ≥ 2 dimensions in the random fields literature, where they have been employed in asymptotic theory for various parametric, nonparametric and semiparametric estimates computed from spatial data; a less global condition, in a similar spirit, was employed by Pinkse et al. (2007). We prefer to assume, in the case of the disturbances in our nonparametric regression, a linear (in independent random variables) structure, that explicitly covers both lattice linear autoregressive-movingaverage and SAR models (with a scale factor permitting conditional or unconditional heteroscedasticity). Our framework also allows for a form of strong dependence (analogous to that found in long memory time series), a property ruled out by the mixing conditions usually assumed in asymptotic distribution theory. In this respect, it seems we also fill some gap in the time series literature because we allow our regressors to be stochastic, unlike in the fixeddesign nonparametric regressions with long memory disturbances covered by Hall and Hart (1990) and Robinson (1997). As a further, if secondary, innovation, while we have to assume some (mild) falling off of dependence in the regressors as their distance increases, we do not require these to be identically distributed across observations (as in Andrews (1995)). It should be added that our asymptotic theory is of the ‘‘increasing domain’’ variety. ‘‘Infill asymptotics’’, on a bounded domain, is popular in some research on spatial statistics and could be employed here, but is likely to yield less useful results. For example it has been found in some settings
that under infill asymptotics estimates are inconsistent, by virtue of converging to a nondegenerate probability limit. The following section describes our basic model and setting. Section 3 introduces the Nadaraya–Watson kernel estimate. Detailed regularity conditions are presented in Sections 4 and 5 for consistency and asymptotic distribution theory, respectively, the proofs resting heavily on a sequence of lemmas, which are stated and proved in appendices. Section 6 discusses implications of our conditions and of our results in particular spatial settings. 2. Nonparametric regression in a spatial setting We consider the conditional expectation of a scalar observable Y given a d-dimensional vector observable X . We have n observations on (Y , X ). It is convenient to treat these as triangular arrays, that is, we observe the scalar Yin and the d × 1 vector Xin , for 1 ≤ i ≤ n, where statistical theory will be developed with n increasing without bound. The triangular array structure of Y is partly a consequence of allowing a triangular array structure for the disturbances (the difference between Y and its conditional expectation) in the model, to cover in particular a common specification of the SAR model. But there is a more fundamental reason for it, and for treating the X observations as a triangular array also. We can identify each of the indices i = 1, . . . , n with a location in space. In regularly-observed time series settings, these indices correspond to equi-distant points on the real line, and what we usually mean is evident by letting n increase. However there is ambiguity when these are points in space. For example, consider n points on a 2-dimensional regularly-spaced lattice, where both the number (n1 ) of rows and the number (n2 ) of columns increases with n = n1 . n2 . If we choose to list these points in lexiographic order (say first row left → right, then second row etc.) then as n increases there would have to be some re-labelling, as the triangular array permits. Another consequence of this listing is that dependence between locations i and j is not always naturally expressed as a function of the difference i − j, even if the process is stationary (unlike in a time series). For example, this is so if the dependence is isotropic. Of course in this lattice case we can naturally label the locations by a bivariate index, and model dependence relative to this. However, there is still ambiguity in how n1 and/or n2 increase as n increases, and in any case we do not wish to be restricted to 2-dimensional lattice data; we could have a higher-dimensional lattice (as with spatio-temporal data, for example) or irregularlyspaced data, or else data modelled using a SAR model, in which only some measures of distance between each pair of observations are employed. As a result our conditions tend to be of a ‘‘global’’ nature, in the sense that all n locations are involved, with n increasing, and thus are also relatively unprimitive, sometimes requiring a good deal of work to check in individual cases, but this seems inevitable in order to potentially cover many kinds of spatial data. As a consequence of the triangular array structure many quantities in the paper deserve an n subscript. To avoid burdening the reader with excessive notational detail we will however tend to suppress the n subscript, while reminding the reader from time to time of the underlying n-dependence. Thus, for example, we write Xi for Xin and Yi for Yin . We consider a basic conditional moment restriction of the form E (Yi |Xi ) = g (Xi ),
1 ≤ i ≤ n, n = 1, 2, . . . ,
(2.1)
where g (x) : R → R is a smooth, nonparametric function. We wish to estimate g (x) at fixed points x. Note that g is constant over i (and n). However (anticipating Nadaraya–Watson estimation, which entails density cancellation asymptotically) we will assume that the Xi have probability densities, fi (x) = fin (x), that are unusually allowed to vary across i, though unsurprisingly, given d
P.M. Robinson / Journal of Econometrics 165 (2011) 5–19
the need to obtain a useful asymptotic theory, they do have to satisfy some homogeneity restrictions, and the familiar identicallydistributed case affords simplification. The Xi are also not assumed independent across i, but to satisfy ‘‘global’’ assumptions requiring some falling-off in dependence (see e.g. Assumption A6). A key role is played by an assumption on the disturbances Ui = Uin = Yi − g (Xi ) ,
1 ≤ i ≤ n, n = 1, 2, . . . .
(2.2)
We assume Ui = σi (Xi ) Vi ,
1 ≤ i ≤ n, n = 1, 2, . . . ,
(2.3)
where σi (x) = σin (x) and Vi = Vin are both scalars; as desired the first, and also second, moment of σi (Xi ) exists; and, for all n = 1, 2, . . . , {Vi , 1 ≤ i ≤ n} is independent of {Xi , 1 ≤ i ≤ n}. We assume that E (Vi ) = 0,
1 ≤ i ≤ n, n = 1, 2, . . . ,
(2.4)
implying immediately the conditional moment restriction E {Ui |Xi } = 0, 1 ≤ i ≤ n, n = 1, 2, . . . . As the σi2 (x) are unknown functions, if Vi has finite variance, with no loss of generality we fix Var {Vi } ≡ 1,
(2.5)
whence Var {Yi |Xi } = Var {Ui |Xi } = σi2 (Xi ),
1 ≤ i ≤ n, n = 1, 2, . . . , (2.6)
so conditional heteroscedasticity is permitted. We do not assume the σi2 (x) are constant across i, thus allowing unconditional heteroscedasticity also, though again homogeneity restrictions will be imposed. Dependence across i is principally modelled via the Vi . For many, though not all, of our results we assume Vi =
∞ −
aij εj ,
1 ≤ i ≤ n, n = 1, 2, . . . ,
(2.7)
j =1
where for each n, the εj , j ≥ 1, are independent random variables with zero mean, and the nonstochastic weights aij = aijn are at least square-summable over j, whence with no loss of generality we fix ∞ −
a2ij ≡ 1,
1 ≤ i ≤ n, n = 1, 2, . . . .
(2.8)
j =1
When the εj have finite variance we fix Var {εi } = 1, implying (2.5). An alternative to the linear dependence structure (2.7) is some form of mixing condition, which indeed could cover some heterogeneity as well as dependence. In fact mixing could be applied directly to the Xi and Ui , avoiding the requirement of independence between {Vi } and {Xi }, or simply to the observable {Yi , Xi }. Mixing conditions, when applied to our triangular array, would require a notion of falling off of dependence as |i − j| increases, which, as previously indicated, is not relevant to all spatial situations of interest. Moreover, we allow for a stronger form of dependence than mixing; we usually do not require, for example, that the aij are summable with respect to j, and thence cover forms of long-range dependence analogous to those familiar in time series analysis. The linear structure (2.7) obviously covers equally-spaced stationary time series, where aij is of the form ai−j , and lattice extensions, where the infinite series is required not only to model long range dependence but also a finite-degree autoregressive structure in Vi . Condition (2.7) also provides an extension of SAR models. These typically imply Vi =
n − j =1
aij εi ,
1 ≤ i ≤ n, n = 1 , 2 , . . . ,
(2.9)
7
so there is a mapping from n independent innovations εi to the n possibly dependent Vi . In particular, we may commence from a parametric structure In − ω1 W1 − · · · − ωm1 Wm1 U
(2.10) = In − ωm1 +1 Wm1 +1 − · · · − ωm1 +m2 Wm1 +m2 σ ε, where the integers m1 , m2 are given, In is the n × n identity matrix, U = (U1 , . . . , Un )′ , ε = (ε1 , . . . , εn )′ , the ωi are unknown scalars, σ is an unknown scale factor, and the Wi = Win are given n × n ‘‘weight’’ matrices (satisfying further conditions in order to guarantee identifiability of the ωi ), and such that the matrix on the left hand side multiplying U is nonsingular. Of course (2.10) is similar to the autoregressive-moving-average structure for stationary time series, but that is generated from an infinite sequence of innovations, not only n such (though of course n will increase in the asymptotic theory). There seems no compelling reason to limit the number of innovations to n in spatial modelling, and (2.9) cannot cover forms of long-range dependence, unless somehow the sums Σjn=1 aij are permitted to increase in n without bound, which is typically ruled out in the SAR literature. 3. Kernel regression estimate We introduce a kernel function K (u) : Rd ⇒ R, satisfying at least
∫ Rd
K (u)du = 1.
(3.1)
The Nadaraya–Watson kernel estimate of g (x), for a given x ∈ Rd , is gˆ (x) = gˆn (x) =
vˆ (x) , f (x)
(3.2)
where n 1 − f (x) = f n ( x) = Ki (x), d
nh
vˆ (x) = vˆ n (x) =
i=1
n 1 −
nhd i=1
(3.3) Yi Ki (x),
with Ki (x) = Kin (x) = K
x − Xi h
,
(3.4)
and h = hn is a scalar, positive bandwidth sequence, such that h → 0 as n → ∞. Classically, the literature is concerned with a sequence Xi of identically distributed variables, having probability density f (x), with Xi observed at i = 1, . . . , n, so fi (x) ≡ f (x), In this case f (x) estimates f (x), and vˆ (x) estimates g (x)f (x), so that gˆ (x) estimates g (x). The last conclusion results also in our possibly non-identically distributed, triangular array setting, because under suitable additional conditions,
f (x) − f (x) →p 0,
(3.5)
vˆ (x) − g (x)f (x) →p 0,
(3.6)
where f (x) = f n (x) =
n 1−
fi (x). n i =1 It follows from (3.5) and (3.6) and Slutsky’s theorem, that
(3.7)
gˆ (x) →p g (x),
(3.8)
so long as limn→∞ f (x) > 0. In fact though we establish (3.5), we do not employ this result in establishing (3.8), but instead a more subtle argument (that avoids continuity of f (x)). The consistency
8
P.M. Robinson / Journal of Econometrics 165 (2011) 5–19
results are presented in the following section, with Section 5 then establishing a central limit theorem for gˆ (x).
Assumption A9(x, y). For some ε > 0,
4. Consistency of kernel regression estimate
n→∞ 1≤i≤n ‖u‖<ε
We introduce first some conditions of a standard nature on the kernel function K (u). Assumption A1. K (u) is an even function, and sup |K (u)| + u∈Rd
∫ Rd
|K (u)| du < ∞.
lim max sup µi (x − u) < ∞
and lim max
sup
n→∞ 1≤i,j≤n ‖u‖<ε,‖v‖<ε
νij (x − u, y − v) < ∞;
(4.14)
also (4.1)
lim max E σi2 (Xi ) < ∞.
(4.15)
n→∞ 1≤i≤n
Assumption A2(ξ ). As ‖u‖ → ∞, K (u) = O(‖u‖−ξ ).
(4.2)
For ξ > d, A2(ξ ) plus the first part of A1 implies the second part of A1. Note for future use that, for ε > 0, sup |K (u)| = O(hξ ).
Assumption A10. For n = 1, 2, . . . , {Xi , 1 ≤ i ≤ n} is independent of {Vi , 1 ≤ i ≤ n, }, (2.4) and (2.5) hold, and the covariances
γij = γijn = Cov Vi , Vj ,
lim
n→∞
Assumption A3. u ∈ Rd .
(4.4)
1
n − γij = 0.
Theorem 1. Let Assumptions A1, A2(ξ ) for ξ > 2d, A3, A4, A5(x), A6(x, x), A7, A8, A9(x, x) and A10 hold. Then
g (x) →p g (x),
Assumption A4. As n → ∞,
Proof. For any η > 0 (4.5)
For ε > 0 define
α(x; ε) = αn (x; ε) = sup f (x − w).
(4.6)
as n → ∞.
(4.18)
P (| g (x) − g (x)| > η)
n d −1 − 2 {Yi − g (x)} Ki (x) > η + P ( ≤ P (nh ) f (x) < η). i=1 (4.19)
‖w‖<ε
Using Lemmas 1 and 2, for ζ > η,
Assumption A5(x). For some ε > 0,
f (x) − E { f (x)} > ζ − η P ( f (x) < η) ≤ P
(4.7)
n→∞
lim ρ(x, y; ε) = 0,
(4.8)
n→∞
ρ(u, v; ε) = ρn (u, v; ε) = sup |m(u − w, v − w)| ,
(4.9)
n 1 −
{Yi − g (x)} Ki (x) = r1 (x) + r2 (x),
n2
fij (u, v) − fi (u)fj (v) .
(4.10)
i,j=1 i̸=j
lim inf f (x − u) > 0.
n→∞ ‖u‖<ε
(4.11)
Assumption A8. g is continuous at x, and
n→∞
n 1−
n i =1
E |g (Xi )| < ∞.
r1 (x) = r1n (x) = r2 (x) = r2n (x) =
Assumption A7(x). For some ε > 0,
(4.12)
Define µi (u) = µin (u) = σi2 (u)fi (u), νij (u, v) = νijn (u, v) = σi (u)σj (v)fij (u, v).
(4.21)
where
‖w‖<ε n 1 −
(4.20)
It remains to show that the first probability on the right of (4.19) is negligible. But
nhd i=1
where
lim
≤ Var f (x) /(ζ − η)2 → 0.
Assumption A6(x, y). The joint density of Xi and Xj , fij (x, y) = fijn (x, y), exists for all i, j, and for some ε > 0,
m(u, v) = mn (u, v) =
lim α(x; ε) < ∞.
(4.16)
(4.17)
n2 i,j=1,i̸=j
Assumption A3 excludes higher-order kernels, but can be avoided if conditions on the Xi are slightly strengthened. The following condition on the bandwidth h is also standard.
h + (nhd )−1 → 0.
1 ≤ i, j ≤ n, n = 1, 2, . . . ,
satisfy
(4.3)
‖u‖≥ε/h
K (u) ≥ 0,
(4.13)
n 1 −
nhd i=1 n 1 −
nhdn i=1
Ui Ki (x),
(4.22)
{g (Xi ) − g (x)} Ki (x),
(4.23)
whence it remains only to apply Lemmas 3 and 4.
The linear representation (2.7) (or (2.9)) is not imposed in Theorem 1. To provide also a consistency result that avoids finite variance of Vi , while on the other hand strengthening dependence, we employ (2.7). In this setting, σi (Xi ) no longer represents a standard deviation, but simply a scale factor. For D > 0, define ′′ ′′ εi′ = εin′ = εi 1(|εi | ≤ D), εi = εin = εi − εin′ . Assumption A11. (2.7) holds, where, for all n ≥ 1, {Xi , 1 ≤ i ≤ n} is independent of {εi , i ≥ 1}, the εi are independent random
P.M. Robinson / Journal of Econometrics 165 (2011) 5–19
variables with zero mean and
′′
lim sup sup E εi = 0,
(4.24)
D→∞ n≥1 i≥1
sup sup
∞ ∞ − − aij + sup max aij < ∞.
(4.25)
n≥1 1≤i≤n j=1
n≥1 j≥1 i=1
Notice that if the εi do in fact have finite variance (taken to be 1),
γij =
∞ −
aik ajk
(4.26)
k=1
so (cf. (4.17)) under (4.25) lim
n i,j=1
ever proceed from the rightmost expression in (5.4) by establishing asymptotic normality of w −1/2 r1 (x), along with E {| r2 (x)|} = o(w 1/2 ) and f (x) − f (x) →p 0. We begin by showing the latter result, or rather f (xi ) − f (xi ) →p 0 for any finitely many distinct points x1 , . . . , xp in Rd , because our goal in fact is to establish asymptotic joint normality of the appropriately normed vector Gn − G, where
′
, ′ gn (x1 ), . . . , g n ( xp ) . G = Gn = G = g (x1 ), . . . , g (xp )
(5.5)
We introduce a condition on
n 1 −
n→∞
9
1/2 asymptotic joint normality of nhd f ( x) − E f (x) and d 1/2 [ nh v(x) − E { v(x)}], and the properties E f ( x) − f ( x) = −1/2 −1/2 o nhd , E { v(x)} − g (x)f (x) = o nhd . We how
γij < ∞.
(4.27)
β(x; ε) = βn (x; ε) = sup f (x − w) − f (x) .
Theorem 2. Let Assumptions A1, A2(ξ ) for ξ > 2d, A3, A4, A5(x), A6(x, x), A7, A8, A9(x, x) and A11 hold. Then
g (x) →p g (x),
as n → ∞.
(4.28)
Proof. Identical to that of Theorem 2, but Lemma 5 is used in place of Lemma 4.
Assumption B1(x). For all δ > 0 there exists ε > 0 such that for all sufficiently large n
β(x; ε) < δ.
(5.7)
Theorem 3. Let Assumptions A1, A2(ξ ) for ξ > 2d, A4, A6(xi , xi ) and B1(xi ) hold, i = 1, . . . , p. Then
f (xi ) − f (xi ) →p 0,
5. Asymptotic normality of kernel regression estimate
i = 1, . . . , p.
(5.8)
Proof. Follows from Lemmas 1 and 6. Define n h−d −
s = sn =
n2
γii = nh
d −1
,
(5.1)
(5.2)
the third equality in (5.1) using (2.8). Note that as n → ∞, s → 0 under Assumption A4 and t → 0 under Assumption A10. In the present section we establish a central limit theorem for w−1/2 { g (x) − g (x)}, where w = wn = s if t = O(s) and w = t if s = o(t ), as n → ∞. Under our conditions, s decays at the standard
−1
rate nhd , whereas t is zero when the Vi are uncorrelated, and decays faster than s when Vi is short-range dependent, but slower when Vi is long-range dependent. The modulus operator in t is to ensure non-negativity; however, because n −
n −
γii +
γij = Var
i,j=1;i̸=j
i=1
n − Vi
≥ 0,
(5.3)
i=1
the second term on the left cannot be negative when s = o(t ), so when the t normalization applies the modulus operator is unnecessary. We have
g (x) − g (x) =
v(x) − g (x) f (x) r1 (x) r 2 ( x) = + , f ( x) f ( x) f (x)
(5.4)
where r1 (x), r2 (x) are given in (4.22), (4.23). There is a basic difference in our method of proof from that in, say, Robinson (1983). There (where the assumptions imply that g (x) is
nhd
1/2
-consistent) the delta method was used, after es-
tablishing asymptotic joint normality of and
Assumption B2(x). g satisfies a Lipschitz condition of degree θ ∈ (0, 1] in a neighbourhood of x. Assumption B3. For the same θ as in Assumption B2(x), h2θ /w → 0 as n → ∞.
i=1
n 1 − t = tn = 2 γij , n i,j=1 i̸=j
(5.6)
‖w‖<ε
nhd
1/2 f (x) − f (x)
1/2 nhd v(x) − g (x)f (x) , which in turn follows from
Assumption B4. (2.7) and (2.8) hold, where for all n, {Xi , 1 ≤ i ≤ n} is independent of {εi , i ≥ 1}, and the εi are independent random variables with zero mean, unit variance, and
′′
lim sup sup E εi 2
D→∞ n≥1 i≥1
= 0,
(5.9)
′′
where εi is as defined before Assumption A11. Define
2 2 n ∞ n − − − b = bn = sup aij / aij .
j≥1
i =1
j =1
(5.10)
i =1
Assumption B5. lim sup
n −
n→∞ j≥1 i=1
a2ij < ∞,
(5.11)
and lim b → 0.
(5.12)
n→∞
Assumption B6. When s = o(t ), n −
γij γik = o(n3 t ),
as n → ∞.
(5.13)
i,j,k=1
Assumption B7. The densities fi of Xi (1 ≤ i ≤ n), fi1 i2 of Xi1 , Xi2 (1 ≤ i1 < i2 ≤ n), fi1 i2 i3 of Xi1 , Xi2 , Xi3 (1 ≤ i1 < i2 < i3 ≤ n), and fi1 i2 i31 i4 of Xi1 , Xi2 , Xi3 , Xi4 (1 ≤ i1 < i2 < i3 < i4 ≤ n) are
10
P.M. Robinson / Journal of Econometrics 165 (2011) 5–19
bounded uniformly in large n in neghbourhoods of all combinations of arguments x1 , . . . , xp .
When t = O(s) there exists a function λ(x), x ∈ Rd , such that
Assumption B8(x, y). For all δ > 0 there exists ε > 0 such that for all sufficiently large n
n i=1
max sup |fi (x − w) − fi (x)| < δ.
(5.14)
max sup σi2 (x − w) − σi2 (x) < δ
(5.15)
1≤i≤n ‖w‖<ε
n 1−
n ∑ i,j=1;i̸=j
max
sup
(5.16)
[ n→∞ 1≤i≤n
]
< ∞.
(5.17)
|us |<ε
− fi1 (u1 − v1 )fi2 (u2 − v2 ) , ρi1 i2 i3 (u1 , u2 , u3 , ε) = ρi1 i2 i3 n (u1 , u2 , u3 , ε) = sup fi i i (u1 − v1 , u2 − v2 , u3 − v3 )
(5.18)
1 2 3
(5.19)
n4 t 2
(5.30)
(iii) if also s = o(t ), for any J
(q)
G − G →d N (0, J (q) Φ −1 Ψ Φ −1 J (q) ), t −1/2 J (q)
′
(5.31)
Proof. From (5.4),
w −1/2 ( g − g) = f −1 w −1/2 ( r1 + r2 ) , (5.22)
is =1 s=1,2,3
n − ′ γi1 i2 γi3 i4 ρi1 i2 i3 i4 (u1 , u2 , u3 , u4 , ε) = 0,
(5.23)
is =1 s=1,...,4
(5.32)
where
f = fn = diag f (x1 ), . . . , f (xp ) , ′ ri = rin = ri (x1 ), . . . , ri (xp ) , i = 1, 2.
(5.33)
We deduce q →p Φ from Lemmas 1 and 6 and r2 = op w 1/2 from Lemma 7 and Assumptions B2(xi ), i = 1, . . . , p and B3. Define
Assumption B11. ∞ n ∞ n − − − − aik ajk / lim aik ajk < ∞.
(5.24)
k=1 i,j=1;i̸=j
Σ = κ Λ, if t = o(s), = Ψ , if s = o(t ), = (κ Λ + χ Ψ ), if t ∼ χ s.
(q)
Assumption B12. There exists a function φ(x), x ∈ R such that f (x) ∼ φ(x),
(q) w −1/2 r1 →d N (0, Σ (q) ).
′
(5.25)
(5.34)
We have now to allow for the possibility that Ψ is singular, i.e. q < p. This only affects part (iii) of the theorem but no is generality lost by giving a single proof for all three cases for J (q) G − G . Thus define Σ (q) = J (q) Σ J (q) , r1 It remains to prove
d
as n → ∞.
(5.29)
s−1/2 G − G →d N (0, Φ −1 (κ Λ + χ Ψ )Φ −1 ),
as n → ∞. (5.21)
n − ′ γi2 i3 ρi1 i2 i3 (u1 , u2 , u3 , ε) = 0,
k=1 i,j=1;i̸=j
as n → ∞;
as n → ∞;
where each primed sum omits terms with any equalities between the relevant is .
n→∞
(ii) if also t ∼ χ s,
s=1,2
n→∞
(5.28)
Theorem 4. Let Assumptions A1, A2(ξ ) for ξ > 4d, A4, A7(xi ), A8, A9(xi , xj ), A10, B1(xi ), B2(xi ), B3–B5, B7, B8(xi , xj ), B9(xi ), B10–B12 and B13(q) hold, i, j = 1, . . . , p.
(5.20)
n 1 −′ ρi1 i2 (u1 , u2 , ε) = 0, n→∞ n2 is =1
1
K 2 (u) du,
lim
lim
Rd
s−1/2 G − G →d N (0, κ Φ −1 ΛΦ −1 ),
Assumption B10. For (u1 , u2 , u3 , u4 ) = (xi1 , xi2 , xi3 , xi4 ), is = 1, . . . , p, s = 1, . . . , 4,
n2 t
κ=
∫
(i) If also t = o(s),
1 2 31 4
− fi1 i2 (u1 − v1 , u2 − v2 )fi3 i4 (u3 − v3 , u4 − v4 ) .
n→∞
of these, but we do not distinguish between
which is finite under Assumption A1.
− fi1 (u1 − v1 )fi2 i3 (u2 − v2 , u3 − v3 ) , ρi1 i2 i3 i4 (u1 , u2 , u3 , u4 , ε) = ρi1 i2 i3 i4 n (u1 , u2 , u3 , u4 , ε) = sup fi i i i (u1 − v1 , u2 − v2 , u3 − v3 , u4 − v4 )
lim
p q
them in our notation. Define also
|vs |<ε s=1,2
1
Assumption B13(q). Rank {Ψ } = q ∈ [1, p].
given q there are
ρi1 i2 (u1 , u2 , ε) = ρi1 i2 n (u1 , u2 , ε) = sup fi1 i2 (u1 − v1 , u2 − v2 )
|vs |<ε s=1,...,4
(5.27)
The reason for allowing q < p is discussed in the following ) section. Define byJ (q any q × p matrix formed from q rows of Ip ; for
Write
|vs |<ε s=1,2,3
i,j=1;i̸=j
as n → ∞.
λ(xp ) , and Ψ the p × p matrix with (i, j)th element ψ(xi , yj ).
sup σis (x − us ) + E σi4 (Xi )
∼ ψ(x, y), γij
Denote Φ = diag φ(x1 ), . . . , φ(xp ) , Λ = diag {λ(x1 ), . . . ,
Assumption B9(x). For some ε > 0, lim max
(5.26)
γij νij (x, y)
n ∑
fij (x − w, y − z ) − fij (x, y) < δ.
as n → ∞.
When s = O(t ) there exists a function ψ(x, y), x, y ∈ Rd , such that
1≤i≤n ‖w‖<ε
1≤i<j≤n ‖w‖<ε,‖z ‖<ε
µi (x) ∼ λ(x),
= J (q) r1 . (5.35)
P.M. Robinson / Journal of Econometrics 165 (2011) 5–19
We can write (q) r1 =
∞ −
(q)
(q)
Z j εj ,
Zj
= Zjn(q) =
j=1
n 1 − (q) J Ki σi (Xi )aij , nhd i=1
(5.36)
side is bounded by
where
N −
Ki = Ki (x1 ), . . . , Ki (xp )
′
.
(5.37)
For positive integer N = Nn , increasing with n, define (q)∗ (q)∗ r1 = r1n =
N −
(q)
Zj εj ,
(q)# (q)# (q) (q)∗ r1 = r1n = r1 − r1 .
(5.38) (q)#
=
′
and introduce a q × q matrix P = Pn such that T = PP . For n large enough T is positive definite under our conditions. For a q×1 vector τ , such that τ ′ τ = 1, write (q)∗ c ∗ = cn∗ = τ ′ P −1 r1 ,
(5.40)
= 1. We show that, conditionally on {X ; all n ≥ 1},
c ∗ →d N (0, 1),
(5.41)
whence by the Cramer–Wold device, (q)∗
N −
j =1
(q)2
zj
(q)2
1(zj
> ηr )
(5.49)
j =1
max E εj2 1(εj2 > η/r ) +
N 1 − (q)4 z . ηr j=1 j
(5.50)
The first term can be made arbitrarily small for small enough r, while the second is op (1) by Lemma 11.
(5.39)
j =1
P −1 r1
6. Discussion
N − (q)∗ (q) ∗′ (q) (q)′ T = Tn = E r1 r1 |X = Zj Zj
so E c
E εj2 1(εj2 > η/r ) +
j≥1
By Lemma 9, there exists an N sequence such that r1 op (w −1/2 ). For such N, consider
∗2
(q)2
zj
which, from (5.47), is bounded by
j =1
11
(q) as n → ∞. Now for any r > 0 zj εj > η ⊂ εj2 > ηr ∪ (q)2 zj > η/r , so by independence of the εj and X1 , . . . , Xn the left
→d N (0, Iq ),
Under conditions motivated by spatial or spatio-temporal settings, we have established consistency of the Nadaraya–Watson estimate under relatively broad conditions, and asymptotic normality under stronger conditions. Our discussion focusses on the relevance of some of the conditions, and some implications of these results. 1. Assumption A5(x) is implied by limn→∞ max1≤i≤n sup‖w‖<ε fi (x − w) < ∞, and with identically distributed Xi , both requirements are equivalent to boundedness of f in a neghbourhood of x. Other conditions on Xi likewise simplify. In case of a regularly-spaced stationary time series {Xi }, we have fij (x, y) = f|i−j| (x, y), so
(5.42)
n −1 2−
j
which implies unconditional convergence, then for a q × q matrix Π such that Σ (q) = Π Π ′ it follows that
m(x, y) =
(q)∗ w−1/2 Π −1 r1 →d N (0, Iq )
In this setting, with d = 1, Castellana and Leadbetter (1986) established pointwise consistency of kernel probability density estimates for regularly-spaced time series under the assumption
(5.43)
if w −1/2 Π −1 P converges in probability to an orthogonal matrix, ′ which is implied if w −1 Π −1 PP ′ Π −1 →p Iq , i.e. if
w−1 T →p Σ (q) .
(5.44)
n→∞
But
(q) (q)′
r1 r1 E {T } = E
(q)# (q)#′ (q) (q)#′ (q)# (q)′ , +E r1 r1 − r1 r1 − r1 r1
(5.45)
and the norm of the final expectation is o(w) Schwarz by the (q) (q)′ inequality and Lemmas 8 and 9, while w −1 E r1 r1 → Σ (q) from Lemma 8. Lemma 10 completes the proof of (5.44). To prove (5.41), write c∗ =
N −
(q)
zj εj ,
(q)
zj
= zjn(q) = τ ′ P −1 Zj(q) .
(5.46)
j =1
(q)
Since zj εj is a martingale difference sequence, and N −
(q)2
= 1,
zj
(5.47)
j =1
(5.41) follows, from e.g. Scott (1973), if, for any η > 0 N −
εj 1 zj(q) εj > η|X1 , . . . , Xn →p 0,
(q)2 2
E zj
j =1
lim sup
(5.48)
u,v∈Rd
1−
n j=1
n −1 1 −
n j=1
n
fj (x, y) − f (x)f (y) .
fj (x, y) − f (x)f (y) = 0.
(6.1)
(6.2)
Even after adding stationarity to our conditions, Assumption A6(x, y), used for consistency of density (Theorem 3) as well regression (Theorems 1 and 2) estimates, is milder than (6.2), both because in (6.2) the modulus is inside the summation and because the supremum is over all x, y ∈ Rd . Castellana and Leadbetter (1986) showed that (6.2) holds in case of a scalar Gaussian process {Xi } with lag-j autocovariance tending to zero as j → ∞, and thus cover arbitrarily strong long range dependence. These observations extend to the vector case with d > 1, whence Assumption A6(x, y) is satisfied also. Moreover, (5.21)–(5.23) of Assumption B10 are of a similar character, because all involve convergence to zero, with no rate of weighted averages of density-based measures of dependence (though suprema are now inside the summations). If we employed instead an asymptotic normality proof for the f (xi ) in proving Theorem 4, a possibility mentioned in Section 5, density-based conditions on Xi would, however, have to entail rates, as indeed would mixing conditions. (Castellana and Leadbetter, 1986, like other authors, used mixing conditions in their cental limit theorem for density estimates from stationary time series.) 2. With respect to the conditions on Vi , the requirement A10 for consistency in Theorem 1 implies an arbitrarily slow decay in contributions from γij as, say, |i − j| diverges, and
12
P.M. Robinson / Journal of Econometrics 165 (2011) 5–19
under stationarity is satisfied by arbitrarily strong long range dependence. On the other hand avoiding a second moment in Theorem 2 rules out long range dependence, while easily covering conventional forms of weak dependence. Asymptotic normality in Theorem 4 also permits general long range dependence, though the actual strength of this in part determines the precise outcome, including convergence rate, see parts (i)–(iii). Assumption B11 means that changing the sign of any negative aij would not make t decay slower, and could be satisfied more generally if the aij are eventually positive as j increases. The Lindeberg condition (5.12) was checked for linear time series with arbitrarily strong dependence in the central limit theorem for the sample mean by Ibragimov and Linnik (1971), and for fixed-design nonparametric time series regression by Robinson (1997) (where, incidentally, the kind of trichotomy observed in parts (i)–(iii) of Theorem 4 in our stochastic design setting does not occur). Assumption B6 is a very mild additional decay restriction on the γij . 3. While overall our conditions on {Vi } and {Xi } are neither stronger nor weaker than those employed under a mixing framework, in some respects ours provide a more precise tool. Theorem 4 indicates exactly when the usual kind of result with (nhd )1/2 convergence ceases to apply, whereas a mixing rate can usually only be interpreted as sufficient. On the other hand the linearity of Vi plays an important role in the achievement of asymptotic normality throughout Theorem 4, despite possible long range dependence. Extending the nonlinear-functionsof-Gaussian-processes conditions employed in the long range dependent time series literature would sometimes yield nonnormal limits, especially given our allowance for strong dependence in Xi . 4. It is important to stress that the question of which of parts (i)–(iii) of Theorem 4 is relevant depends on h, as well as the strength of dependence∑in Vi . More precisely, under the weak n dependence condition i,j=1,i̸=j γij = O(n), part (i) is relevant however we choose h, subject to Assumptions A4 and B3, but not otherwise. Putting the conditions together, (5.29) occurs when hd 1
n ∑ i,j=1;i̸=j
+
γij
nhd n (5.30) occurs when
+ nhd+2θ → 0,
1/d
χn n ∑
h∼
i,j=1;i̸=j n ∑ i,j=1;i̸=j
γij
,
(6.3)
χ > 0; (6.4)
n2(1+θ/d)
+
n2
γij
as n → ∞;
n ∑ i,j=1;i̸=j
→ 0,
as n → ∞;
γij
n −
and (5.31) occurs when n ∑ i,j=1;i̸=j
n2
γij hd
n ∑ i,j=1;i̸=j
as n → ∞.
+ γij
γij σi (x)σj (y)fi (x)fj (y)
(6.6)
i,j=1,i̸=j 2 2θ
n
+
consider semiparametric and additive models, respectively, in different spatial settings from ours. 5. Our results do not directly address the issue of bandwidth choice, which is always of practical concern in smoothed nonparametric estimation, though they have some implications for it. By adding a bias calculation under twice differentiability of g to the variance implications of part (i) of Theorem 4 we can reproduce the usual minimum-mean-squared error optimal rate for h of n4/5 . As our results stand, we do not exploit this degree of smoothness (see Assumptions B2 and B3), and to do so would require a stronger restriction on the dependence in Xi , similar to that mentioned in the penultimate sentence of point 1 above. However the condition t ∼ χ s for part (ii) of Theorem 4 prescribes a rate for h which ignores bias, while the part (iii) convergence rate does not directly depend on h, so cannot contribute to an optimal bandwidth calculation. 6. Part (i) of Theorem 4 reproduces the classical asymptotic independence across distinct fixed points familiar from the settings of independent observations and mixing time series. Since ∑n t = o(s) entails i,j=1 γij = o(n/h), the result nevertheless also holds under some degree of long range dependence in Vi , while, at least in the Gaussian case, strong versions of long range dependence in Xi are permitted. The allowance for non-stationarity in both processes leads to a more complicated form of asymptotic variance, the ith diagonal element of κ Φ −1 ΛΦ −1 reducing to the familiar κσ 2 (xi )/f (xi ) under identity of distribution. To carry out inference, such as set confidence regions, we need to consistently estimate the diagonal elements; the extent to which this is possible in our more general setting would require discussion that goes beyond the scope of the present paper. 7. Consistent variance estimation is an even more challenging proposition in (ii) and (iii) of Theorem 4, in part due to the difficulty with estimating t. This in turn stems in part from the possible non-stationarity of Vi , and estimating t would require imposing some additional structure to limit this. It also stems from the implied long range dependence in Vi in both parts (ii) and (iii); note that t is analogous to quantities arising in the ‘‘HAC’’ literature of econometrics, which extends earlier statistical literature (see e.g. Hannan, 1957 and Eicker, 1963), but there weak dependence is typically assumed, in which case we are back to part (i). However, at least for stationary Vi , proposals of Robinson (2005) in the long range dependent time series case may be extendable. The results in parts (ii) and (iii) are much less attractive for practical use due to the nondiagonality of Ψ , and even less so than immediately meets the eye. Notice that if the Xi are i.i.d., Ψ has unit rank for all p, so estimates are undesirably perfectly correlated. The same kind of outcome was observed by Robinson (1991) in kernel probability density estimates from long range dependent time series data. Unit rank could result more generally: under similar conditions on Xi to those in Assumption B10 the numerator of the left side of (5.26) differs from
n h n ∑
i,j=1;i̸=j
by o i,j=1 γij . Then Ψ has unit rank for all p if the Xi are identically distributed, and is possibly not well conditioned more generally. Nevertheless it is of interest that here non-identity of distribution has the potential to desirably increase rank. 8. Consider some implications of our setting for data observed on a rectangular lattice. For simplicity we focus on a 2-dimensional lattice, where observations are recorded at points (t1 , t2 ), for t1 = 1, . . . , n1 , t2 = 1, . . . , n2 , so n = n1 n2 (though interval of observation can differ across dimensions).
∑n
→ 0,
γij (6.5)
These conditions also reflect the dimension d of Xi , and the curse of dimensionality is always of concern with smoothed nonparametric estimation; Gao et al. (2006) and Lu et al. (2007)
P.M. Robinson / Journal of Econometrics 165 (2011) 5–19
We can regard either or both of n1 , n2 as increasing with n. The (t1 , t2 )th observation can be indexed by i = n2 (t1 − 1) + t2 , say, in our triangular array setting. With this correspondence, consider a process, such that Vin = v(t1 , t2 ). We might define v(t1 , t2 ) for t1 , t2 = 0, ±1, . . . , and take it to be stationary. Then under broad conditions, v(t1 , t2 ) has a ‘‘half-plane’’ linear representation in terms of orthogonal, homoscedastic innovations, analogous to the Wold representation for time series (see e.g. Whittle, 1954). However, for Theorems 2 and 4, we require a linear representation for Vi in independent innovations, so for non-Gaussian v(t1 , t2 ) a general, multilateral representation, would be preferred, namely ∞ −
v(t1 , t2 ) =
b(t1 − j1 , t2 − j2 )e(j1 , j2 )
where η ∼ N (0, In ). We deduce (2.7), indeed (2.9), by taking aij to be the (i, j)th element of A and ε = η. 10. Processes in the SAR class (2.10) are more directly placed in our framework. Consider the special case m1 = 1, m2 = 0 of (2.10), i.e.
(In − ωW )U = σ ε,
∑ni −1
ji=1−n ,i=1,2 i
(6.7)
9. Data that are irregularly-observed in space and/or time pose far greater problems in both computation and deriving asymptotic theory for many statistical procedures. Describing a model for irregularly-spaced observations from an underlying continuous model can be difficult even in the time series case; from a first-order stochastic differential equation, Robinson (1977) derived a (time-varying autoregressive) model but this kind of outcome is difficult to extend to higher-order models, or spatial processes. Nonparametric regression, on the other hand, is readily applied, though detailed checking of our conditions would be far more difficult than in the regularlyspaced circumstances just described, especially as observation locations might be regarded as stochastically rather than deterministically generated. However, at least some formal correspondence with our triangular array setting can be constructed. Suppose we have an underlying Gaussian process, where again for simplicity we consider 2 dimensions only, and denote it v(t1 , t2 ), though now t1 and t2 might be real-valued. Then with some ordering the n v(t1 , t2 ) become the n Vi . Moreover, (conditionally on the locations when these are stochastic) the Vi are Gaussian. Denote by Ξ the covariance matrix of the vector V = (V1 , . . . , Vn )′ . In a general irregularly-spaced framework Ξ has little structure to exploit. However, we can write Ξ = AA′ , where, for positive definite Ξ the n × n matrix A is uniquely defined only up to premultiplication by an orthogonal matrix. Due to Gaussianity we can write V = Aη,
(6.8)
′
′
γij = σ 2 1 S −1 S −1 1, where 1 = 1n is the n × 1 vector of
∑n
i,j=1
1′ s, it follows that if S −1 has uniformly bounded column sums (for ∑n which a condition is given in Lemma 1 of Lee, 2002), then i,j=1 γij = O(n), so ‘‘weak dependence’’ is implied, and part (i) of Theorem 4 applies. Acknowledgments This research was supported by ESRC Grant RES-062-23-0036 a Catedra de Excelencia at Universidad Carlos III de Madrid, and Spanish Plan Nacional deI+D+I Grant SEJ2007-62908/ECON. The paper has been improved as a result of the helpful comments of two referees, Supachoke Thawornkaiwong and Marcia Schafgans. Appendix A. Technical lemmas for Section 4 Lemma 1. Let Assumptions A1, A2(ξ ) for ξ > 2d, A4, A5(x), and A6(x, x) hold. Then as n → ∞, for some ε > 0 Var{ˆq(x)} = O((nhd )−1 α(x; ε) + n−1 h2(ξ −d)
+ ρ(x, x; ε) + hξ −2d ) → 0.
Cov {v(0, 0), v(j1 , j2 )}. Tran (1990) and Hallin
et al. (2001, 2004b), for example, established consistency results for kernel density estimates with lattice spatial data, under different conditions from those in Theorem 3.
0 < |ω| < 1,
with nonstochastic n × n W = Wn having row sums normalised to 1. Note that (6.8) generally implies unconditional heteroscedasticity in the Ui , as can be covered by our σi (Xi ) = σi . As noted by Lee (2002), it follows that S = Sn = In − ωW is non-singular, and thus we have∑a solution to (6.8) of form ∑n n 2 2 Ui = j=1 bij εj , with Var {Ui } = j=1 bij = σi . Thus on taking aij = bij /σi we have (2.9) with (2.8). Moreover, because
j1 ,j2 =−∞
with independent e(j1 , j2 ) and square-summable b(j1 , j2 ). To produce a correspondence with (2.7) we might read off the j1 , j2 in a kind of spiral fashion: taking j = 1 when (j1 , j2 ) = (0, 0), then j = 2, . . . , 9, correspond to the points (−1, −1), (−1, 0), . . . , going clockwise around the square with vertices (±1,±1), with j = 10, . . . , 25 generated analogously from the square with vertices (±2,±2), and so on. If a ‘‘half-plane’’ representation is desired we simply omit the relevant points on each square. A correspondence between the aijn and moving average weights b(t1 − j1 , t2 − j2 ) then follows. Now the b(t1 − j1 , t2 − j2 ) might be chosen to be the moving average weights in unilateral or multilateral spatial analogues of autoregressive moving averages (see e.g. Whittle, 1954, Guyon, 1982 and Robinson and Vidal Sanz, 2006). These models have the weak dependence to place them firmly in the setting of part (i) of Theorem 4. But (6.7) can also describe long range dependence, in either or both dimensions, so parts ∑n (ii) and (iii) of Theorem 4 can also be relevant. Notice that i,j=1,i̸=j γij =
13
(A.1)
Proof. We have Var{ˆq(x)}
n −
1
=
(nhd )2
Var{Ki (x)} +
n −
Cov{Ki (x), Kj (x)} .
(A.2)
i,j=1,i̸=j
i =1
The first term in the square brackets is bounded by
∫
K2
n
x−w h
Rd
= nhd
∫ Rd
= nhd
f (w)dw
K 2 (u)f (x − hu)du
∫
K 2 (u)f (x − hu)du ‖hu‖<ε
∫
K (u)f (x − hu)du , 2
+
(A.3)
‖hu‖≥ε
for any ε > 0. The first term in the braces is bounded by C α(x; ε), where, throughout, C denotes an arbitrarily large generic constant. The second term in the braces is bounded by sup K (u) ‖u‖≥ε/h
2
∫ Rd
f (x − hu)du = O(h2ξ −d ),
(A.4)
so 1
n −
(nhd )2
i=1
Var{Ki (x)} = O((nhd )−1 α(x; ε) + n−1 h2(ξ −d) ).
(A.5)
14
P.M. Robinson / Journal of Econometrics 165 (2011) 5–19
The second term in the square brackets in (A.2) is
∫
(nh )
d 2
J1 (ε)
∫ +2 J2 (ε)
K (u)K (v)m(x − hu, x − hv)dudv
+ J3 (ε)
K (u)K (v) m(x − hu, x − hv)dudv ,
∫ Rd
|K (u)| du
2
(A.6)
.
(A.7)
The second integral is bounded by
∫ 2 u∈Rd ,‖hv‖≥ε
E | r2 (x)| → 0,
+ nhd sup |K (u)|
R2d
−
n2 i,j=1,i̸=j
Rd
Rd
(A.8)
|g (x − hu)| f (x − hu)du
f (x − hu)du
≤ δ nhd + Cnhξ
= O(hξ −2d ).
n 1−
n i=1
E |g (Xi )| + |g (x)|
for any δ > 0, to complete the proof.
(A.14)
Lemma 4. Let Assumptions A1, A2(ξ ) for ξ > 2d, A4, A9, A10 hold. Then
The third integral is bounded by
|K (u)K (v)| |m(x − hu, x − hv)| dudv
‖hu‖≥ε,‖hv‖≥ε
∫
{fij (x − hu, x − hv)
+ fi (x − hu)fj (x − hv)}dudv
∫
∫
‖u‖≥ε/h
+ |g (x)|
Rd
‖u‖≤ε
n
1
(A.13)
‖u‖≤ε
‖v‖≥ε/h
×
as n → ∞.
n − {g (Xi ) − g (x)} Ki (x) E i=1 ∫ |K (u)| |g (x − hu) − g (x)| f (x − hu)du ≤ nhd Rd ∫ d | |K (u)| du ≤ nh sup g (x − u) − g (x)| sup f (x − u)
≤ 2 sup |K (u)| sup |K (v)| ∫
Proof. We have
|K (u)K (v)| |m(x − hu, x − hv)| dudv
u∈Rd
(A.12)
Lemma 3. Let Assumptions A1, A2(ξ ) for ξ > d, A4, A5(x), and A8 hold. Then
K (u)K (v)m(x − hu, x − hv)dudv
where J1 (ε) = J1n (ε) = {u, v : ‖hu‖ < ε, ‖hv‖ < ε} , J2 (ε) = J2n (ε) = {u, v : ‖hu‖ < ε, ‖hv‖ ≥ ε} , J3 (ε) = J3n (ε) = {u, v : ‖hu‖ ≥ ε, ‖hv‖ ≥ ε}. The first integral in the braces is bounded by
ρ(x, x; ε)
inf f (x − u) − O(hξ −d ) > ζ 2 ‖u‖<ε
for n sufficiently large and some ζ > 0, by Assumption A7(x).
∫
1
≥
E r1 (x)2 → 0,
≤ sup K (u)2
as n → ∞.
(A.15)
Proof. The left side of (A.15) is
‖u‖≥ε/h
∫
n −
1
× R2d
n2 i,j=1,i̸=j
n −
1
{fij (x − hu, x − hv)
(nhd )2
+ fi (x − hu)fj (x − hv)}dudv
i =1
(A.9)
n −
+
= O(h2ξ −2d ).
E {σi2 (Xi )Ki2 (x)}
γij E {σi (Xi )σj (Xj )Ki (x)Kj (x)}
(A.16)
i,j=1,i̸=j
recalling that γii = Var {Vi } = 1. The first expectation is
Thus 1
n −
(nhd )2
i,j=1,i̸=j
Cov{Ki (x), Kj (x)} = O(ρ(x, x; ε) + h
ξ −2d
).
h
d
∫
K 2 (u)µi (x − hu)du ‖hu‖<ε
(A.10)
∫
K 2 (u)µi (x − hu)du .
+
(A.17)
‖hu‖≥ε
Lemma 2. Let Assumptions A1, A2(ξ ) for ξ > d, A3 and A4 and A7(x) hold. Then lim E { f (x)} > 0.
The first term in braces is bounded by C sup µi (x − u).
n→∞
(A.18)
‖u‖<ε
(A.11)
The second term is bounded by
Proof. We have E { f (x)} ≥
∫
sup K (u)2 K (u)f (x − hu)du
‖u‖<ε/h
K (u)f (x − hu)du , ‖u‖≥ε/h ∫ ≥ inf f (x − u) K (u) du ‖u‖<ε/h
− sup |K (u)| /h ‖u‖≥ε/h
‖u‖≥ε/h
Rd
µi (x − hu)du ≤ Ch2ξ −d E σi2 (Xi ).
(A.19)
The second expectation in (A.16) is
∫ − ‖u‖<ε
∫
d
∫ K
x−w h
R2d
= h2d
K
∫
x−z h
∫ +2 J1 (ε)
νij (w, z )dw dz ∫
+ J2 (ε)
J3 (ε)
K (u)K (v)νij (x − hu, x − hv)dudv.
(A.20)
P.M. Robinson / Journal of Econometrics 165 (2011) 5–19
In a similar way as before, E |σi (Xi ) Ki (x)| ≤ Chd , whence (A.28) is bounded by
Proceeding as in the proof of Lemma 1, this is bounded by Ch2d
νij (x − u, x − v)
sup ‖u‖<ε,‖v‖<ε
+ h2d
2 sup |K (u)| sup |K (v)| + v∈Rd
‖u‖≥ε/h
∫
sup |K (u)|
R2d
C max max E εj max n≥1 1≤j≤n
νij (x − hn u, x − hn v) dudv,
×
′′
2
‖u‖≥ε/h
whose last term is bounded by Chξ E σi (Xi )σj (Xi ) ≤ Chξ E {σi2 (Xi )}1/2 E {σj2 (Xj )}1/2 = O(hξ ) uniformly in i, j. Thus (A.16) is
O
1
(
n( h + h d
)
nhd 2
=O
2 ξ −d
nhd
n2
Appendix B. Technical lemmas for Section 5
E { f (x)} − f (x) ≤ C (β(x; ε) + hξ −d ) < C δ,
n − γij ) + (h + h )
Proof. We have (A.22)
i,j=1,i̸=j
′
Vi = Vin =
∫
E { f (x)} − f (x) =
∫R
Introduce ′
aij εj , Ui = Uin = σi (Xi ) Vi − E Vi ′
′
′
′
′
′′
′′
,
∫
∞ −
′′
′′
′′
′′
′′
aij εj , Ui = Uin = σi (Xi ) Vi − E Vi
. (A.23)
as n → ∞.
(A.24)
for ε > 0. The first term is bounded by
×
∞ −
aij εj′ − E εj′
(A.25)
has mean zero and variance
2
∞ −
i =1
|K (u)| du.
∞ −
2
The second term is bounded by sup |K (u)|
i,j=1,i̸=j
nhd k=1
≤
C nhd
+
n − ∞ −
n2 hd i,j=1 k=1
1 + max 1≤i≤n
as n → ∞.
(B.5)
∞ −
|aik | max
k=1
k≥1
E |σi (Xi ) Ki (x)|
(B.6)
Proof. Proceeding as in the proof of Lemma 4, Cov { r1 (x), r1 (y)} is
n
nhd i=1
(B.4)
if t ∼ χ s, χ ∈ (0, ∞).
|aik | ajk n − ajk → 0
(A.27)
j =1
1
n −
(nhd )2
i=1
+
n 1 − ′′ E d Ui Ki (x) nh i=1 2 −
|K (u)| du ‖u‖≥ε/h
if x = y and t = o (s) , = o (s) , if x ̸= y and t = o (s) , ∼ ψ(x, y)t , if s = o (t ) , ∼ {κλ(x) + χ ψ(x, y)} s,
(A.26)
as n → ∞. On the other hand
≤
f (x − hu)du + f (x)
Cov { r1 (x), r1 (y)} ∼ κλ(x)s,
From the proof of Lemma 4, this is bounded by C
Rd
∫
Lemma 8. Let Assumptions A1, A2(ξ ) for ξ > 2d, A4, A9(x, y), B4, B8(x, y), B10 and B11 hold. Then
E {σi (Xi )σj (Xj )Ki (x)Kj (x)}
aik ajk Var{εk′ }.
a2ik
∫
Lemma 7. Let Assumptions A1, A2(ξ ) for ξ > d + θ , A4, A5(x), A8 and B2(x) hold. Then
k=1
∞ C −
(B.3)
Proof. This is very similar to the proof of Lemma 3, and thus omitted.
a2ik Var{εk′ }
k=1 n −
nhd
×
E {σi2 (Xi )Ki2 (x)}
1
+
Rd
E ‖ r2 (x)‖ = O(hθ ),
j =1
nhd
∫
= O(hξ −d ).
n n 1 − ′ 1 − Ui Ki (x) = σi (Xi ) Ki (x) d d nh i=1 nh i=1
β(x; ε)
‖u‖≥ε/h
Proof.
n −
(B.2)
‖hu‖≥ε
Lemma 5. Let Assumptions A1, A2(ξ ) for ξ > 2d, A4, A9(x, x) and A11 hold. Then E | r1 (x)| → 0,
K (u) {f (x − hu) − f (x)}du,
+
j =1
1
K (u){f (x − hu) − f (x)}du
‖hu‖<ε
j =1
Vi = Vin =
K (u){f (x − hu) − f (x)}du
d
=
∞ −
(B.1)
for all sufficiently large n.
n − γij → 0.
1
+
(A.29)
j =1
Lemma 6. Let Assumptions A1, A2(ξ ) for ξ > d, A4 and B1(x) hold. Then for all δ > 0 there exists ε > 0 such that
i,j=1,i̸=j
1
∞ − aij → 0
ξ
2d
i ≥1
as D → ∞. (A.21)
15
E {σi2 (Xi )Ki (x)Ki (y)}
1
n −
(nhd )2
i,j=1,i̸=j
γij E {σi (Xi )σj (Xj )Ki (x)Kj (y)}.
(B.7)
(B.8)
When x = y, from (A.17), (B.7) equals ∞ − j=1
aij E ε ′′ . j
(A.28)
n ∫ h−d −
n2
i =1
Rd
K 2 (u)µi (x − hu)du.
(B.9)
16
P.M. Robinson / Journal of Econometrics 165 (2011) 5–19
The difference between the integral and κσi2 (x)fi (x) is
∫ Rd
By proceeding much as before with each of these three terms, it may thus be seen that
K 2 (u) σi2 (x − hu) − σi2 (x) fi (x − hu)du
+ σi2 (x)
(B.18) ∼
∫ Rd
K 2 (u) {fi (x − hu) − fi (x)} du.
(B.10)
The first term is bounded by
κ max sup σi2 (x − u) − σi2 (x) sup fi (x − hu) 1≤i≤n ‖u‖<ε ‖u‖<ε ∫ + max sup K 2 (u) σi2 (x − hu) + σi2 (x) fi (x − hu)du 1≤i≤n ‖hu‖≥ε
Rd
= o(1) + O(h2ξ −d ),
(B.11)
uniformly in i, proceeding as in the proof of Lemma 4. The second term is bounded by
κ max σi2 (x) sup |fi (x − hu) − fi (x)| 1≤i≤n ‖hu‖<ε ∫ fi (x − hu)du + max σi2 (x) sup K 2 (u) 1≤i≤n
+ max fi (x) 1≤i≤n
)
{µi (x) + ∆} ∼ κλ(x)s.
(B.13)
n2
Rd
i =1
∫
K (u)K
Rd
K (u)K
u+
u+
|K (u)|
sup ‖u‖>‖x−y‖/2h
y−x
h
µi (x − hu)du.
(B.14)
1
n h−d −
n2
n − # 2 1 E r1 = E Ki′ Kj σi (Xi )σj (Xj ) 2 d nh i ,j = 1
d
y−x h
du
∫ Rd
|K (u)| du ≤ Chξ −d .
µi (x)∆ = o(s).
R2d
µi (xk ) 1(i = j)(1 + o(1))
(B.15)
(B.16)
n ∞ − −
n2 i=1
(B.23)
K (u)K (v)νij (x − hu, y − hv)dudv.
a2ik
1/2 2
.
(B.25)
k=N +1
∞ −
lim max
N →∞ 1≤i≤n
a2ik → 0.
(B.26)
k=N +1
2
Thus E r1# → 0 as n → ∞. Now suppose s = o(t ). We have
n ∞ ∞ n − − − − aik ajk . aik ajk ≤ k=N +1 i,j=1,i̸=j i,j=1,i̸=j k=N +1
(B.27)
Now (B.17)
bk ≡ 1,
(B.28)
where (B.18) bk = bkn =
Now νij (x − u, y − v) − νij (x, y) can be written
n ∞ n − − − aik ajk / aik ajk . i,j=1,i̸=j
{σi (x − u) − σi (x)}σj (y − v)fij (x − u, y − v) + σi (x){σj (y − v) − σj (y)}fij (x − u, y − v) + σi (x)σj (y){fij (x − u, y − v) − fij (x, y)}.
(B.24)
In view of (2.8),
k=1
n2 i,j=1,i̸=j
p −
First suppose t = O(s). By the Cauchy inequality (B.24) is bounded by
As in (A.20) and (B.8) is
∫
(B.22)
uniformly in i.j, where 1(.) denotes the indicator function and Dij is the p × p matrix with (k, l)th element h2d νij (xk , xl ). Thus for large enough n (B.22) is bounded by
∞ −
γij
aik ajk .
+ tr(Dij )1(i ̸= j)(1 + o(1))
i=1
n −
∞ −
k=1
C
Thus when x ̸= y, for ∆ as before, (B.7) =
Proof. We have
aik ajk . 2 d nh i,j=1 k=N +1
is, by essentially the same proof, o(1) + O(h2ξ −d ) uniformly in i. Splitting the range of integration into the sets ‖u‖ > ‖x − y‖ /2h and ‖u‖ ≤ ‖x − y‖ /2h, and noting that the latter inequality implies that ‖u + (y − x)/h‖ ≥ ‖x − y‖ /2h, the integral in (B.15) is bounded by 2
(B.21)
The difference between the integral and
µi (x)
as n → ∞.
n ∞ Ch2d − −
i =1
When x ̸= y, (B.7) equals n ∫ h−d −
2
E r1# = o(w),
(B.12)
n h−d −
n2
Lemma 9. Let Assumptions A1, A2(ξ ) for ξ > 2d, A4, A9(xi , xj ), B4, B8(xi , xj ), i, j = 1, . . . , p, B10 and B11 hold. Then there exists a sequence N = Nn , increasing with n, such that
κh
uniformly in i. Thus, when x = y, there is a sequence ∆ = ∆n → 0 such that (B.7) ∼ κ
(B.20)
i,j=1,i̸=j
From the proof of Lemma 8 the expectation is
‖u‖≥ε/h
= o(1) + O(h
γij νij (x, y) + ∆ ∼ ψ(x, y)t .
k=N +1
K 2 (u)du
2ξ −d
n2
×
Rd
‖u‖≥ε/h
∫
n −
1
(B.29)
k=1 i,j=1,i̸=j
Thus ∞ −
(B.19)
k=N +1
bk → 0
as n → ∞,
(B.30)
P.M. Robinson / Journal of Econometrics 165 (2011) 5–19
In a similar way, the contribution of an ⟨A, A, B, C ⟩−⟨A, A⟩⟨B, C ⟩ term is bounded by
and from (5.26), n
17
∞ −
# 2 h2d − aik ajk → 0 E r1 = 2 nhd i,j=1 k=N +1
as n → ∞. (B.31)
n Ch−d − ′ (x, x, x, ε). ρ γ i2 i3 n i1 i2 i3 n3 is =1
(B.38)
s=1,2,3
Lemma 10. Let Assumptions A1, A2(ξ ) for ξ > 4d, A4, A9(xi , xj ), B4, B6 and B7, B9(xi ) and B10 hold, i, j = 1, . . . , p. Then E ‖T − E {T }‖2 = o(w 2 ),
as n → ∞.
(B.32)
Proof. It suffices to check (B.32) in case p = 1, so we put x1 = x. We have 2
E {T − E {T }} =
N −
E
Zj2 Zk2
2 2
− E Zj E Zk
.
(B.33)
j,k=1
The summand is
n −
1
nhd
4 E
−E
×E
=
i=1
n − i=1 n −
−E
Ki (x)σi (Xi )aij
n4 Kis (x)σis (Xis )
the ⟨A, B, A, B⟩ one by
.
(B.34)
⟨A, A, A, A⟩ − ⟨A, A⟩⟨A, A⟩. ⟨A, B, A, C ⟩ − ⟨A, B⟩⟨A, C ⟩, (B.35)
K (us ) σis (x − hus )dus .
n C −
n4
is =1 s=1,2
γi2i sup fi12i2 (x − u1 , x − u2 ) = o(s2 t );
(B.37)
n Ch−2d − γ sup fi1 i2 (x − u1 , x − u2 ) = O(s2 t ); i i 12 n4 |us |<ε is =1
for some ε > 0. This is o(t ), and thus o(s ) if t = O(s).
(B.44)
s=1,2
s=1,2
the ⟨A, A⟩⟨A, B⟩ one by
s=1,2
(B.45)
the ⟨A, A, A, A⟩ one by n Ch−3d −
n4
sup fi (x − u) = O(s3 );
(B.46)
i=1 |u|<ε
and the ⟨A, A⟩⟨A, A⟩ one by n Ch−2d −
2
(B.43)
1 2 |u |<ε s s=1,2
= o(s2 t );
By arguments similar to those in Lemma 4 the contribution to (B.33) is thus bounded by
2
the ⟨A, B⟩⟨A, B⟩ one by
s=1,2
s=1
s=1,...,4
(B.42)
1 2 |u |<ε s s=1,2
is =1 s=1,2
n Ch−d − γ sup fi (x − u) sup fi1 i2 (x − u1 , x − u2 ) n4 is =1 i1 i2 |u|<ε 1 |us |<ε
(B.36)
n C −′ γ γ ρi1 i2 i3 i4 (x, x, x, x, ε) i i i i 12 34 n4 is =1
γi2i sup fi1 i2 (x − u1 , x − u2 ) = O(s2 t );
the ⟨A, A, A, B⟩ one by
− fi1 i2 (x − hu1 , x − hu2 )fi3 i4 (x − hu3 , x − hu4 )} ×
n Ch−2d −
n4
{fi1 i2 i31 i4 (x − hu1 , x − hu2 , x − hu3 , x − hu4 )
4 ∏
(B.41)
|us |<ε s=1,3
Kis (x)σis (Xis )
|us |<ε s=1,2
is =1 s=1,2,3
× sup fi1 i3 (x − u1 , x − u3 ) = o(st );
For an ⟨A, B, C , D⟩ − ⟨A, B⟩⟨C , D⟩ term, the quantity in square brackets in (B.34) is
R4d
(B.40)
n − γi1 i2 γi1 i3 sup fi1 i2 (x − u1 , x − u2 )
C
s=3
⟨A, A, A, B⟩ − ⟨A, A⟩⟨A, B⟩, ⟨A, A, B, C ⟩ − ⟨A, A⟩⟨B, C ⟩, ⟨A, B, A, B⟩ − ⟨A, B⟩⟨A, B⟩, ⟨A, A, B, B⟩ − ⟨A, A⟩⟨B, B⟩.
s=1,2,3
the ⟨A, B⟩⟨A, C ⟩ one by
The quadruple sum yields terms of seven kinds, depending on the nature of equalities, if any, between the is , and bearing in mind the fact that i1 , i2 are linked with j, and i3 , i4 are linked with k. Symbolically, denote such a term ⟨A, B, C , D⟩ − ⟨A, B⟩⟨C , D⟩, when all is are unequal, and repeat the corresponding letters in case of any equalities. The other six kinds of term are thus
h4d n
This is o(s2 ), and thus o(t 2 ) if s = o(t ). The remaining terms in (B.35) are handled by showing that the individual components of each difference are o(w 2 ). The ⟨A, B, A, C ⟩ contribution is (using Assumption B6) bounded by
s=1
Kis (x)σis (Xis ) E
(B.39)
= o(st );
ai1 j ai2 j ai3 k ai4 k E
4 ∏
ρi1 i2 (x, x, ε).
s=1,2,3
4 ∏
s=1
∫
n2
n Ch−d − γ γ sup fi1 i2 i3 (x − u1 , x − u2 , x − u3 ) i i i i 12 13 n4 |us |<ε is =1
2
is =1 s=1,...,4
2 ∏
Ch−2d
2
Ki (x)σi (Xi )aik
n −
4
Ki (x)σi (Xi )aik
2
i=1
i =1
1
nhd
Ki (x)σi (Xi )aij
n −
This is o(st ), which is o(s2 ) if t = O(s), and o(t 2 ) if s = o(t ). Likewise the contribution of an ⟨A, A, B, B⟩ − ⟨A, A⟩⟨B, B⟩ term is bounded by
n4
sup fi2 (x − u) = o(s3 ).
i=1 |u|<ε
(B.47)
18
P.M. Robinson / Journal of Econometrics 165 (2011) 5–19
Lemma 11. Let Assumptions A1, A2(ξ ) for ξ > 4d, A4, B4, B5 and B7, B9(xi ) and B10 hold, i = 1, . . . , p. Then
as n → ∞, the second and last bounds using Assumption B5 and (B.51). But
N −
h−d
(q)4
zj
= Op (b + n
−1
+ s + t ),
as n → ∞.
(B.48)
n w −2 −
n4
j=1
γij ≤ h−d ts−2 /n2 ≤ C /n if t = O(s)
i,j=1
≤ h−d t −1 /n2 ≤ C /n if s = O(t ). Proof. Arguing as before and applying (5.44), the left side is Op (1) times a positive random variable whose expectation is bounded by N n C w −2 − − ai j ai j ai j ai j 4 1 2 3 4 d nh is =1 j =1
Thus the contribution to (B.49) from the second expression in its braces is O(n−1 ). Next
2
N n C w −2 h−2d − −
n4
s=1,...,4
j=1
a2ij
≤
n
+ h 1(i1 = i2 ̸= i3 , i4 ; i3 ̸= i4 ) + h 1(i1 = i2 ̸= i3 = i4 ) + hd 1(i1 = i2 = i3 = i4 ) 4 2 N − n n n − − C w −2 − −d ≤ a + h a a2ij ij ij n4 j = 1 i = 1 i=1 i =1
×
2d
+ h−2d
2
n
−
n
−
+ h−3d
a2ij
i=1
a4ij
.
(B.49)
i =1
i =1
j =1
i,j=1;i̸=j
∞ n − −
(B.50)
=O
i=1
aik ajk +
n −
k=1 i,j=1;i̸=j
=O
−
a2ik
i =1
n
γij ,
(B.51)
i,j=1
as n → ∞, the penultimate bound using Assumption B11. We have
w −2 n4
n −
2 γij
≤ t 2 s−2 ≤ C if t = O(s),
i,j=1
≤ t −2 t 2 ≤ C if s = o(t ).
(B.52)
Thus the contribution to (B.49) from the first expression in its braces is O(b). Next
2 2 N n n n N n − − − − − − aij aij a2ij ≤ sup a2ij j=1 i=1 j≥1 i=1 j=1 i =1 i=1 2 ∞ n − − aij = O j=1
=O
n − i ,j = 1
i =1
γij ,
a2ij
(B.55)
N C w −2 − −3d h n4 j = 1
n − i =1
a4ij
C w −2 −3d h ≤ n4
n − N −
a4ij
i=1 j=1
and both tend to zero in view of Assumptions A10 and A4.
2 n n ∞ n ∞ − − − − − 2 aik ajk + aij ≤ aik
−−
≤ Cs3 t −2 = o(t ) if s = o(t ),
Now
k=1
i =1
N
C w −2 −3d h ≤ Cs3 s−2 ≤ Cs ≤ n3 if t = O(s)
i=1
2 2 n ∞ − − a . ≤b j=1 i=1 ij
i =1
max 1≤j≤n
a2ij
for both t = O(s) and s = o(t ). Finally
4 2 2 N n n N n − − − − − aij ≤ sup aij aij j ≥1
nh
n −
i=1 j=1
i =1
−2d
≤ C w −2 s 2 / n ≤ C /n,
We have first
j=1
n4
4d
3d
j=1
C w −2
i =1
× h 1(is ̸= it , s, t = 1, . . . , 4, s ̸= t )
(B.54)
(B.53)
(B.56)
References Andrews, D.W.K., 1995. Nonparametric kernel estimation for semiparametric models. Econometric Theory 11, 560–596. Andrews, D.W.K., 2005. Cross-section regression with common shocks. Econometrica 73, 551–585. Arbia, G., 2006. Spatial Econometrics: Statistical Foundations and Applications to Regional Analysis. Springer-Verlag, Berlin. Castellana, J.V., Leadbetter, M.R., 1986. On smoothed probability density estimation for stationary processes. Stochastic Processes and their Applications 21, 179–193. Cliff, A., Ord, J.K., 1981. Spatial Processes: Models & Applications. Pion, London. Dahlhaus, R., Künsch, H., 1987. Edge effects and efficient parameter estimation for stationary random fields. Biometrika 74, 877–882. Eicker, F., 1963. Asymptotic normality and consistency of the least squares estimator for families of linear regressions. Annals of Mathematical Statistics 34, 447–456. Gao, J., Lu, Z., Tjostheim, D., 2006. Estimation in semiparametric spatial regression. Annals of Statistics 34, 1395–1435. Guyon, X., 1982. Parameter estimation for a stationary process on a d-dimensional lattice. Biometrika 69, 95–106. Hall, P., Hart, J.D., 1990. Nonparametric regression with long-range dependence. Stochastic Processes and their Applications 36, 339–351. Hallin, M., Lu, Z., Tran, L.T., 2001. Density estimation for spatial linear processes. Bernoulli 7, 657–688. Hallin, M., Lu, Z., Tran, L.T., 2004a. Local linear spatial regression. Annals of Statistics 32, 2469–2500. Hallin, M., Lu, Z., Tran, L.T., 2004b. Kernel density estimation for spatial processes: the L1 theory. Journal of Multivariate Analysis 88, 61–75. Hannan, E.J., 1957. The variance of the mean of a stationary process. Journal of the Royal Statistical Society: Series B 19, 22–285. Ibragimov, I.A., Linnik, Yu.V., 1971. Independent and Stationary Sequences of Random Variables. Wolters-Noordhoff, Groningen. Kelejian, H.H., Prucha, I.R., 1999. A generalized moments estimator for the autoregressive parameter in a spatial model. International Economic Review 40, 509–533. Lee, L.F., 2002. Consistency and efficiency of least squares estimation for mixed regressive, spatial autoregressive models. Econometric Theory 18, 252–277. Lu, Z., Lundervold, A., Tjostheim, D., Yao, Q., 2007. Exploring spatial nonlinearity using additive approximation. Bernoulli 13, 447–472. Pinkse, J., Shen, L., Slade, M.E., 2007. A cental limit theorem for endogenous locations and complex spatial inteeractions. Journal of Econometrics 140, 215–225. Robinson, P.M., 1977. Estimation of a time series model from unequally spaced data. Stochastic Processes and their Applications 6, 9–24.
P.M. Robinson / Journal of Econometrics 165 (2011) 5–19 Robinson, P.M., 1983. Nonparametric estimators for time series. Journal of Time Series Analysis 4, 185–207. Robinson, P.M., 1991. Nonparametric function estimation for long-memory time series. In: Barnett, W., Powell, J., Tauchen, G. (Eds.), Nonparametric and Semiparametric Methods in Econometrics and Statistics. Cambridge University Press, Cambridge, pp. 437–457. Robinson, P.M., 1997. Large-sample inference for nonparametric regression with dependent errors. Annals of Statistics 28, 2054–2083. Robinson, P.M., 2005. Robust covariance matrix estimation: ‘HAC’ estimates with long memory/antipersistence correction. Econometric Theory 21, 171–180. Robinson, P.M., Vidal Sanz, J., 2006. Modified Whittle estimation of multilateral models on a lattice. Journal of Multivariate Analysis 97, 1090–1120.
19
Rosenblatt, M., 1971. Curve estimates. Annals of Mathematical Statistics 42, 1815–1842. Roussas, G., 1969. Nonparametric estimation of the transition distribution function of a Markov process. Annals of Mathematical Statistics 40, 1386–1400. Scott, D.J., 1973. Central limit theorems for martingales using a Skorokhod representation approach. Advances in Applied Probability 5, 119–137. Tran, L.T., 1990. Kernel density estimation on random fields. Journal of Multivariate Analysis 34, 37–53. Tran, L.T., Yakowitz, S., 1993. Nearest neighbour estimators on random fields. Journal of Multivariate Analysis 44, 23–46. Whittle, P., 1954. On stationary processes in the plane. Biometrika 41, 434–449. Yao, Q., Brockwell, P.J., 2006. Gaussian maximum likelihood estimation for ARMA models II: spatial processes. Bernoulli 12, 403–429.
Journal of Econometrics 165 (2011) 20–29
Contents lists available at SciVerse ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Control variate method for stationary processes Tomoyuki Amano, Masanobu Taniguchi ∗ Department of Applied Mathematics, School of Fundamental Science and Engineering, Waseda University, Tokyo, 169-8555, Japan
article
info
Article history: Available online 12 May 2011 JEL classification: C02 C10 C13 C14 C15 C22 Keywords: Control variate method Stationary processes Spectral density matrix Nonparametric spectral estimator
abstract The sample mean is one of the most natural estimators of the population mean based on independent identically distributed sample. However, if some control variate is available, it is known that the control variate method reduces the variance of the sample mean. The control variate method often assumes that the variable of interest and the control variable are i.i.d. Here we assume that these variables are stationary processes with spectral density matrices, i.e. dependent. Then we propose an estimator of the mean of the stationary process of interest by using control variate method based on nonparametric spectral estimator. It is shown that this estimator improves the sample mean in the sense of mean square error. Also this analysis is extended to the case when the mean dynamics is of the form of regression. Then we propose a control variate estimator for the regression coefficients which improves the least squares estimator (LSE). Numerical studies will be given to see how our estimator improves the LSE. © 2011 Elsevier B.V. All rights reserved.
1. Introduction The sample mean is one of the most natural estimators for the population mean based on the i.i.d. sample. When some control variable vector is available (a random vector which is possibly correlated with the variable of interest), using the information about the control variate vector it is known that the control variate method reduces the variance of the sample mean. That is, if Y¯ is a sample mean of i.i.d.sample {Yi }ni=1 with an unknown mean µY and X is a control variable vector with known mean vector µX , then for any constant vector b, the mean of the control variate estimator µ ˆ Y (b) = Y¯ − b′ (X ∑ − µX ) for µY ∑ is µY and it’s variance is Var [Y¯ ] − 2b′ Cov[Y¯ , X ] + b′ X b, where X is the covariance matrix of X and Cov[Y¯ , X ] is the∑covariance vector between Y¯ and X . Hence if 2b′ Cov[Y¯ , X ] > b′ X b, then the variance of the control variate estimator is smaller than that of the sample mean. This method has been discussed in the case when the sample and control variable are i.i.d.. Lavenberg and Welch (1981) reviews analyses of the control variate developed up to the date. In the paper the value b∗ of vector b which minimizes the variance of the control variate estimator is derived and the confidence interval of µ ˆ Y (b∗ ) is constructed. However in practice, since the correlation between Y¯ and X is unknown, this b∗ is not known
∗
Corresponding author. Tel.: +81 3 5286 8095. E-mail addresses:
[email protected] (T. Amano),
[email protected] (M. Taniguchi). 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.05.003
and an estimator bˆ ∗ of b∗ is proposed. In general the control variate estimator involving the estimator bˆ ∗ is not unbiased and the confidence interval cannot be constructed easily. They also discuss these problems. Rubinstein and Marcus (1985) extends the results to the case when the sample mean Y¯ is a multidimensional vector and the multidimensional control variate estimator is represented as µ ˆ Y (B) = Y¯ − B(X − MX ), where B is an arbitrary matrix and X is a control variate vector with mean vector MX . They give the matrix B∗ which minimizes the determinant of E {µ ˆ Y (B)′ µ ˆ Y (B)}, which is called the generalized variance of µ ˆ Y (B). They also introduce an estimator of Bˆ ∗ of B∗ and discuss the confidence ellipsoid. Nelson (1990) proves a central limit theorem of the control variate estimator. Since a lot of control variate theories have been discussed under a specific probability structure (usually normal distribution) for the sample and control variates, a number of authors introduced remedies for violations of these assumptions. Nelson (1990) gives a systematic analytical evaluation of them. In recent years this method is applied to financial engineering (e.g., Glasserman (2004) and Chan and Wong (2006)). Since the control variate theory is usually discussed under the assumption that the sample and control variates are i.i.d, in this paper, when the sample is generated from a stationary process and some control variable process is available, we propose an estimator θˆC of the mean of the concerned process by using the control variate method. Then it is shown that this estimator improves the sample mean in the sense of mean square error (MSE). The estimator θˆC is expressed in terms of nonparametric estimators for spectra of the concerned process and the control variate process. We also apply this analysis to the case when the mean dynamics
T. Amano, M. Taniguchi / Journal of Econometrics 165 (2011) 20–29
is of the form of regression. A control variate estimator for the regression coefficients is proposed and is shown to improve the LSE in the sense of MSE. Numerical studies show how our estimators behave. Our results have potential applications in various fields, including econometrics in particular. This paper is organized as follows. In Section 2 we introduce an estimator θˆC for the mean of a stationary process by using control variate method. Section 3 shows this estimator θˆC improves the sample mean in the sense of MSE. In Section 4, control variate estimators for the mean which is of the form of regression are proposed and shown to improve the LSE. Section 5 provides numerical studies which show how our estimators improve the sample mean. Proofs of theorems are relegated to Section 6. Throughout the paper we denote the set of all integers by Z, and denote by ‖(·)‖ the Euclidean norm of (·).
One of the most fundamental estimators of the population mean is the sample mean. It is known that if the sample is i.i.d, and if some control variable is available, using the information about the control variate X and its mean µX , the control variate method improves the mean square error of the sample mean. In this section we apply this method to the case when the sample is generated by a stationary process and some control variate process is available, and introduce an estimator of the mean, which improves the variance of the sample mean. Suppose that {Y (t ); t ∈ Z} is a scalarvalued process with mean E [Y (t )] = θ and {X (t ); t ∈ Z} is an another m-dimensional process with the mean vector E [X (t )] = 0, which is possibly correlated with {Y (t )}. We are now interested in the estimation of θ . Let Z (t ) ≡ (Y (t ), X ′ (t ))′ . The following assumptions are imposed. Assumption 2.1. {Z (t ); t ∈ Z} is generated by the following linear process. ∞ −
B(j)ϵ(t − j) + θ⃗
(2.1)
j=0
where θ⃗ = (θ , 0, 0, . . . , 0)′ is m + 1-dimensional vector and B(j)′ s are (m+1)×(m+1) matrices and {ϵ(t )} is a sequence of i.i.d. m+1dimensional random vectors with mean vector 0 and covariance matrix K . Henceforth |U |, Ui,j and vi denote the sum of all the absolute values of elements of matrix U, the (i, j)-th element of the matrix U and the i-th element of vector v , respectively. Assumption 2.2. (i) Det [ u=0 B(u)z u ] = 0 has no roots in the unit disc {z ∈ C ; |z | ≤ 1}. (ii) The coefficient matrices B(u) satisfy
∑∞
∞ −
|u|4 |B(u)| < ∞.
(2.2)
u=0
Let Cum(Q1 , . . . , Qk ) be the joint cumulant of random variables Q1 , . . . , Qk . We assume the following.
.. . 2L − 1
.. .
Ckϵ ≡ sup |Cum(ϵa1 (0), . . . , ϵak (0))| < ∞ a1 ,...,ak
with νp having np > 1 elements, p = 1, . . . , P. Write Cuma1 ,...,ak (t1 , . . . , tk−1 ) = Cum{Za1 (t1 ), . . . , Zak−1 (tk−1 ), Zak (0)} and
a1 ,...,ak t ,...,t 1 k−1 =−∞
|Cuma1 ,...,ak (t1 , . . . , tk−1 )|.
(2.6)
Then Assumptions 2.1–2.3 imply ∞ −
{1 + |tj |}|Cuma1 ,...,ak (t1 , . . . , tk−1 )| < ∞
(2.7)
for j = 1, . . . , k − 1 and any k tuple a1 , . . . , ak when k = 2, 3, . . ., and
∞ − − ν
L=1
ϵ
Cn1 · · · CnP
z L /L! < ∞,
z L /L! < ∞,
Cn1 · · · CnP
(2.8)
∑
where the summation ν is defined as in (2.4) (see, Brillinger (2001), p. 48). From Assumptions 2.1 and 2.2, it is seen that the process {Z (t )} becomes a stationary process with nonsingular spectral density matrix (e.g., Brillinger (2001)). We write the spectral density matrix by f (λ) =
fYY (λ) fXY (λ)
fYX (λ) . fXX (λ)
(2.9)
From Assumption 2.2, it follows that R(s) = {Cumi,j (s)} satisfies ∞ −
|s|4 |R(s)| < ∞,
(2.10)
s=−∞
(e.g., Brillinger (2001), p. 46). Suppose that partial observations {Y (0), Y (1), . . . , Y (n − 1)} and {X (−Mn ), X (−Mn + 1), . . . , X (0), . . . , X (n − 1)} are available, where Mn = O(nβ )( 14 ≤ β < 13 ). Now we are interested in the estimation of θ . Based on the observations we introduce the following estimator θˆC of θ
θˆC ≡
n−1 1−
n t =0
Y (t ) −
Mn −
aˆ n (u)X (t − u) , ′
(2.11)
u=0
π
where aˆ n (u) = 21π −π Aˆ n (λ) exp(iuλ)dλ, Aˆ n (λ) = fˆXX (λ)−1 fˆXY (λ). Here fˆXX (λ) and fˆXY (λ) are, respectively, nonparametric estimators of fXX (λ) and fXY (λ) which are defined as, fˆXY (λ) ≡
fˆXX (λ) ≡
n −1 2π −
n
n
Wn
λ−
2π s n
s=1
n −1 2π −
Wn
IXY
2π s
(2.12)
n
2π s 2π s λ− IXX n
s=1
n
(2.13)
where IXY (µ) and IXX (µ) are submatrices of the periodogram 1 2π n
ϵ
n−1 −
Z (t )e
it µ
n −1 −
t =0
∗ it µ
Z (t )e
(2.14)
t =0
(2.3)
and
ν
∞ −
Ck ≡ sup
=
∞ − −
(2.5)
2L
In (µ) ≡
Assumption 2.3. For k = 3, 4, . . . ,
L=1
2 4
t1 ,...,tk−1 =−∞
2. Setting
Z (t ) =
1 3
21
(2.4)
for z in neighborhood of 0, where the inner summation is over all indecomposable partitions (see Brillinger (2001), p. 20) ν = (ν1 , . . . , νP ) of the table
IYY (µ) IXY (µ)
IYX (µ) IXX (µ)
(say),
(2.15)
and {Wn (λ)} are weight functions which are described in the next section. The Aˆ n (λ) and aˆ n (u) are shown πto be consistent estimators of A(λ) = fXX (λ)−1 fXY (λ), a(u) = 21π −π A(λ) exp(iuλ)dλ, respectively. In the next section we will show that the proposed estimator θˆC improves the sample mean in the sense of the mean square error (MSE).
22
T. Amano, M. Taniguchi / Journal of Econometrics 165 (2011) 20–29
3. Asymptotic theory
fˆXX (λ) ≡
In this section we elucidate the asymptotics of θˆC . Initially, we state the following assumption on {Wn (λ)}.
fˆX ηˆ (λ) ≡
n−1 2π −
n
Wn (λ) = Nn W (Nn λ)
(3.1)
n
∫
(3.2)
−∞ l −ilλ (ii) Wn (λ) can be expanded as Wn (λ) = 21π , where l w( Nn )e w(x)is a continuous, even function with w(0) = 1, |w(x)| ≤ 1 ∞ 1−w(x) and −∞ w(x)2 dx < ∞, and satisfies lim|x|→0 |x| = k1 < ∞ for some constant k1 .
∑
Then we get the following theorem. Theorem 3.1. Suppose Assumptions 2.1–2.3 and 3.1. Then it holds that lim nE |θˆC − θ |2 = 2π (fYY (0) − fYX (0)fXX (0)−1 fXY (0)).
n→∞
(3.3)
It is known that the asymptotic variance of the sample mean ∑n−1 Y¯n ≡ 1n t =0 Y (t ) is 2π fYY (0) (e.g., Brillinger (2001), Theorem 5.2.1). Since 2π(fYY (0) − fYX (0)fXX (0)−1 fXY (0)) ≤ 2π fYY (0),
(3.4)
we observe that θˆC improves Y¯n in the sense of MSE.
We assume {Y (t ); t ∈ Z} is a trend model whose mean E [Y (t )] = µ(t ) = φ ′ (t )θ is a time dependent function. Here φ(t ) = (φ1 (t ), . . . , φJ (t ))′ and θ = (θ1 , . . . , θJ )′ . Let {X (t ); t ∈ Z} be an another m-dimensional process with mean vector E [X (t )] = 0, which is possibly correlated with {Y (t )}. Now we apply the control variate method to estimate the parameter θ . Let Z (t ) ≡ (Y (t ), X ′ (t ))′ , t ∈ Z. We impose the following assumption. Assumption 4.1. {Z (t ); t ∈ Z} is generated by the following linear process.
∞ −
µ(t )
2π s
n
2π s
(4.3)
n
IX ηˆ
2π s
(4.4)
n
1
IXX (µ) ≡
2π n
2π n
it µ
X (t )e
n −1 −
t =0
n−1 −
1
IX ηˆ (µ) ≡
n−1 −
∗ X (t )e
it µ
(4.5)
t =0 it µ
X (t )e
n −1 −
t =0
∗ it µ
η( ˆ t )e
(4.6)
t =0
where η( ˆ t ) = Y (t ) − φ ′ (t )θ¯LSE and θ¯LSE = (φ ′ φ)−1 φ ′ Y (the least squares estimator of θ ). Let Aˆ (λ) = fˆXX (λ)−1 fˆX ηˆ (λ) and aˆ (u) = 1 2π
π
−π
Aˆ (λ) exp(iuλ)dλ.
C Now we propose an estimator θˆLSE of θ :
C ˆ M) θˆLSE = (φ ′ φ)−1 φ ′ (Y − W
(4.7)
ˆM = where Y = (Y (1), . . . , Y (n)) , φ = (φ(1), . . . , φ(n)) and W ˆ M (1), . . . , W ˆ M (n))′ with (W ′
ˆ M (t ) = W
Mn −
′
aˆ ′ (u)X (t − u).
(4.8)
u=0 C To describe asymptotics of θˆLSE , we impose the following Grenander’s conditions.
Assumption 4.2. Let cjn,k (h) = t =1 φj (t + h)φk (t ) = φj (t + h)φk (t ). cjn,k (h)’s satisfy the following conditions.
∑n−h
∑n
(ii) limn→∞
φj2 (n+1) cjn,j (0)
t =1−h
(iii)
= 0, j = 1, . . . , J.
cjn,k (h) lim 1 = mjk (h). n→∞ cjn,j (0)ckn,k (0) 2
0 B(j)ϵ(t − j) + ..
(4.1)
where B(j) s are (m + 1) × (m + 1) matrices satisfying Assumption 2.2 and ϵ(t )′ s are i.i.d. random vectors with mean vector 0 and covariance matrix K . For convenience we define η(t ) by = j=0 B(j)ϵ(t − j) (η(t ), X ′ (t ))′ , then as discussed in Section 2, (η(t ), X ′ (t ))′ has the spectral density matrix,
(4.10)
From Brillinger (2001, p. 175), there exists an r × r matrix valued function Gφφ (λ), −π < λ ≤ π , whose entries are of bounded variation, such that
0 ′
(4.9)
We may take φ1 (t ) = 1 (constant), which evidently satisfies Assumption 4.2, hence, the regression part φ(t ) of {Y (t )} may include a constant. We define the J × J matrix mφφ (u) by mφφ (u) = {mjk (u)}.
.
j =0
λ−
IXX
(i) cjn,j (0) = O(nγ ), j = 1, . . . , J for some γ > 0.
4. Regression models
Z (t ) =
Wn
n
s=1
∞
W (x)dx = 1.
2π s
where
1
where Nn = O(n 3 ) and positive and W (x) is bounded, even, non-negative and satisfies
λ−
s=1
n−1 2π −
Assumption 3.1. (i)
Wn
mφφ (u) =
∫
π
exp(iuλ)dGφφ (λ)
(4.11)
−π
for u = 0, ±1, . . .. Under these assumptions, we obtain the following theorem.
∑∞
Theorem 4.1. Suppose Assumptions 2.3, 3.1, 4.1 and 4.2, then C C lim nγ E [(θˆLSE − θ )(θˆLSE − θ )′ ]
n→∞
f (λ) =
fηη (λ) fX η (λ)
fηX (λ) . fXX (λ)
(4.2)
Suppose that partial observations {Y (0), Y (1), . . . , Y (n − 1)} and {X (−Mn ), X (−Mn + 1), . . . , X (0), . . . , X (n − 1)} are available. We define nonparametric estimators fˆXX (λ) and fˆX ηˆ (λ) for the spectral densities fXX (λ) and fX η (λ), respectively, as
= 2π mφφ (0)−1
∫
π
fη−V ,η−V (λ)dGφφ (λ)mφφ (0)−1 ,
(4.12)
−π
where fη−V ,η−V (λ) = fη,η (λ) − fηX (λ)fXX (λ)−1 fX η (λ) is the spectral ∑∞ ′ density of η(t ) − V (t ). Here V (t ) = u=0 a (u)X (t − u), a(u) = π 1 −1 A (λ) exp ( iu λ) d λ , A (λ) = f (λ) fX η (λ). XX 2π −π
T. Amano, M. Taniguchi / Journal of Econometrics 165 (2011) 20–29 Table 1
Table 2 SMSEC and SMSELSE (φ(t ) = (1, t )′ ).
The SMSE of θˆC and Y¯ . a1 0.1 0.1 0.1 0.2 0.2 0.2 0.3 0.3 0.3 0.3
SMSEθˆC
a2 0.1 0.2 0.3 0.1 0.2 0.3 0.1 0.2 0.3 4.0
0.00062385 0.00030573 0.00012184 0.00068064 0.00042055 0.0002056 0.00071185 0.00047111 0.00027794 0.00068665
SMSEY¯ 0.00205319 0.00196825 0.00201872 0.00193495 0.00206655 0.00192683 0.0020196 0.00195619 0.00198016 0.00199143
SMSEY¯ − SMSEθˆC
a2
SMSEC
SMSELSE
SMSELSE − SMSEC
0.00142934 0.00166252 0.00189688 0.00125431 0.001646 0.00172123 0.00130775 0.00148508 0.00170222 0.00130478
0.1 0.2 0.3
0.00786141 0.0041377 0.00493462
0.01047323 0.00543288 0.00533173
0.00261182 0.00129518 0.00039711
Table 3 SMSEC and SMSELSE (φ(t ) = (1, cos( π4 t ))′ ).
Note that the least squares estimator θ¯LSE of θ has the following asymptotic variance lim nγ E [(θ¯LSE − θ )(θ¯LSE − θ )′ ]
= 2π mφφ (0)
π
∫
a2
SMSEC
SMSELSE
SMSELSE − SMSEC
0.1 0.2 0.3
0.00594558 0.00747352 0.00913403
0.00837971 0.00749953 0.00940378
0.00243413 0.00002601 0.00026975
Table 4 Sˆ C (N ) and Sˆ (N ).
n→∞
−1
23
fη,η (λ)dGφφ (λ)mφφ (0)
−1
,
(4.13)
Sˆ C (N )
Sˆ (N )
S (N )
891.0968
891.4768
891
−π
where fη,η (λ) is the spectral density of η(t ). It is seen that fη−v,η−v (λ) ≡ fη,η (λ) − fηX (λ)fXX (λ)−1 fX η (λ) ≤ fη,η (λ),
(4.14)
C which implies that the asymptotic covariance matrix of θˆLSE is ¯ smaller than that of θLSE .
5. Numerical study In this section we examine our control variate estimators numerically. By simulation, we compare the control variate estimators with sample means in Example 5.1 and with the least squares estimators in Example 5.2. Example 5.3 deals with real financial data. Then we see how our estimator improves the sample mean and least squares estimator. Example 5.1. Let the process of interest {Y (t )} and the control process {X (t )} be generated by Y (t ) = u(t ) + v(t )
(5.1)
X (t ) = a1 u(t ) + 0.4u(t − 1) + a2 v(t ),
(5.2)
respectively, where a1 , a2 are constant values. Here {u(t )} and {v(t )} are mutually independent, and {u(t )}, {v(t )} are i.i.d. N (0, 1). Based on 1000 observations for {Y (t )} and {X (t )}, first, we evaluate the sample mean square error (SMSE) of the control variate estimator θˆC and the sample mean Y¯ in the setting of (2.1). In what follows we set Mn = 20. Table 1 shows the SMSE of θˆC and Y¯ for the various values of a1 , a2 by 5000 times simulation. From Table 1, we can see SMSEY¯ − SMSEθˆC becomes larger when the coefficient a2 of error process v(t ) becomes large, which implies, if control variates are highly correlated with the disturbance, then θˆC is better than Y¯ . However excessive influence of the disturbance makes the performance of θˆC worse. Example 5.2. Let the process of interest {Y (t )} and control process {X (t )} be generated by Y (t ) = µ(t ) + u(t ) + v(t )
(5.3)
X (t ) = 0.3u(t ) + 0.4u(t − 1) + a2 v(t ),
(5.4)
respectively, where a2 is a constant value, µ(t ) = θ ′ φ(t ), φ(t ) is a regression function and θ is a vector valued parameter. Here {u(t )} and {v(t )} are mutually independent, and {u(t )}, {v(t )} are i.i.d. N (0, 1). The sample sizes of Y (t ) and X (t ) are 1000,
C respectively. For these samples the control variate estimator θˆLSE C ¯ and the least squares estimator θLSE are calculated. We repeated this procedure 5000 times (write i-th control variate estimator and (i)C (i) least squares estimator as θˆLSE and θ¯LSE ) and evaluated SMSEC ≡
∑5000 (i)C (i)C 1 2 ¯ − θ‖2 and SMSELSE ≡ 5000 ‖θˆLSE i=1 ‖θLSE − θ ‖ for π ′ the regression function φ(t ) = (1, t ) and φ(t ) = (1, cos( 4 t ))′ . The true value of θ = (θ1 , θ2 )′ is assumed to be (1, 1)′ . 1 5000
∑5000 i =1
For various values of a2 , Table 2 shows the SMSEC and SMSELSE for φ(t ) = (1, t )′ , and Table 3 shows those for φ(t ) = (1, cos ( π4 t ))′ . From Tables 2 and 3, we observe that SMSEC are smaller than SMSELSE . That is, the control variate estimator also improves the least squares estimator. The next example deals with real financial data. Example 5.3. We calculate the control variate estimator θˆC and the sample mean Y¯ of NIPPON OIL CORPORATION’s log return of stock price {Y (t )} from 7/19/2007 to 12/12/2007 by setting the difference between the Yen–Euro exchange rate and its sample mean from 7/4/2007 to 12/11/2007 as the control variate process. Then, using the control variate estimator θˆC and the sample mean Y¯ , we forecast NIPPON OIL CORPORATION’s stock S (N ) at N = 12/13/2007, that is, the estimators for S (N ) are calculated by ¯ ˆ Sˆ C (N ) ≡ eθC +log S (N −1) and Sˆ (N ) ≡ eY +log S (N −1) .
The results are given in Table 4. From Table 4, the prediction value Sˆ C (N ) is nearer to the true value S (N ) than Sˆ (N ), which implies the prediction by the control variate estimator is better than that by the sample mean. There are many fields (econometrics, natural sciences, medical sciences etc.) where we should estimate the statistical models for data of interest under the circumstance that we can use some C related variables. In such situations, our estimators θˆC and θˆLSE can be used, and are more efficient than the usual estimators. 6. Proofs This section provides the proofs of the theorems. Now, we prepare the following lemmas. Lemma 6.1 is from Brillinger (2001, Theorem 8.3.1). Lemma 6.1. Let {Y (t ); t ∈ Z} and {X (t ); t ∈ Z} be a stationary process with mean θ and p-dimensional vector valued stationary process
24
T. Amano, M. Taniguchi / Journal of Econometrics 165 (2011) 20–29
with mean vector 0, respectively. The process (Y (t ), X (t )′ )′ ; t ∈ Z is also supposed to be jointly stationary with spectral density matrix
fYY (λ) fXY (λ)
fYX (λ) . fXX (λ)
−
(6.1)
b′ (u)X (t − u)
(6.2)
π
2π
−π
∑∞
b (u)X (t − u) is then given ′
u =0
1 3
+
− β ) and k ∈ N, we have (6.5)
+
Proof of Lemma 6.2. From Theorem 7.7.3 of Brillinger (2001), for arbitrary ϵ > 0, there exists C1 > 0 such that
|fˆXX (λ) − fXX (λ)| ≤ C1 Ln,ϵ a.s.,
(6.6)
(6.7)
|fˆXY (λ) − fXY (λ)| ≤ C3 Ln,ϵ
−
E ‖Aˆ (λ) − A(λ)‖2k = E ‖fˆXX (λ)−1 fˆXY (λ) − fXX (λ)−1 fXY (λ)‖2k = E ‖(fˆXX (λ)−1 − fXX (λ)−1 )fˆXY (λ)
(6.9)
Proof of Theorem 3.1. We obtain
u =0
Mn
aˆ n (v)X (s − v) ′
u = Mn + 1
a′ (u)X (t − u) −
(6.11)
(6.15)
n −1 n −1 1 −−
n t =0 s=0
E
∞ −
a (u)X (t − u) ′
u=Mn +1
∞ −
a (v)X (s − v) ′
(6.16)
n −1 n −1 1 −−
n t =0 s=0
E
Mn − (ˆa′n (u) − a′ (u))X (t − u)
u=0
(6.17)
n −1 n −1 2 −−
n t =0 s=0
Y (t ) − θ −
E
∞ −
a (u)X (t − u) ′
u=0
(6.18)
n −1 n −1 2 −−
n t =0 s=0
E
∞ −
a (u)X (t − u) ′
u=Mn +1
(6.19)
′ For convenience, let V (t ) ≡ Y (t ) − θ − u=0 a (u)X (t − u) and a˜ n (u) ≡ aˆ n (u) − a(u). Initially we evaluate (6.14). From Lemma 6.1, (2.10) (e.g., Brillinger (2001), Theorem 5.2.1), it is seen that
a (u)X (t − u) ′
1
n−1 − n −1 −
2π n t =0 s=0
E [V (t ) × V (s)]
(6.20)
→ 2π (fYY (0) − fYX (0)fXX (0)−1 fXY (0)) as n → ∞. (6.21) ∑∞ Next we evaluate (6.15). Let B(λ) ≡ u=0 B(u) exp(−iuλ). Since Assumption 2.2 is satisfied and A(λ) is a homorophic function of B(λ), then by Brillinger (2001, p. 78), it holds that a(u)’s satisfy ∞ −
|u|4 |a(u)| < ∞,
(6.22)
u =0
hence, we have
u =0
− (ˆa′n (u) − a′ (u))X (t − u) (6.12) u=0
a (v)X (s − v) ′
(6.10)
v=0
+
a (u)X (t − u)
u=0
∞ −
(6.14) = 2π ·
nE |θˆC − θ|2 Mn n−1 n−1 − 1 −− ′ = E Y (t ) − θ − aˆ n (u)X (t − u)
Mn
′
∑∞
Now we proceed to prove Theorem 3.1.
∞ −
Y (t ) − θ −
E
∞ −
v=0
= O(Ln,ϵ )2k .
n t =0 s=0
Mn − ′ ′ × (ˆan (v) − a (v))X (s − v) .
+ fXX (λ)−1 (fˆXY (λ) − fXY (λ))‖2k ≤ O(Ln,ϵ )2k E ‖fˆXY (λ)‖2k + O(Ln,ϵ )2k
Y (t ) − θ −
(6.14)
v=0
(6.8)
for some C3 > 0, hence we obtain
E
a (v)X (s − v)
Mn − ′ ′ × (ˆan (v) − a (v))X (s − v)
where C2 is independent of λ. Similarly as in the above, we have
1 −−
′
v=0
−
−1 −1 (λ)| ≤ C2 Ln,ϵ a.s., (λ) − fXX |fˆXX
∞ −
∞ −
Mn − ′ ′ × (ˆan (v) − a (v))X (s − v)
where C1 is independent of λ, hence, there exists C2 > 0
a (u)X (t − u)
v=Mn +1
1 where Ln,ϵ = n− 3 +ϵ .
Y (s) − θ −
′
u=0
n t =0 s=0
×
E ‖Aˆ (λ) − A(λ)‖2k = O(Ln,ϵ )2k ,
−
∞ −
v=Mn +1
Lemma 6.2. For any given ϵ (0 < ϵ <
Y (t ) − θ −
E
n −1 n −1 2 −−
×
(6.4)
n t =0 s=0
Y ( s) − θ −
n−1 n−1
(6.13)
v=0
(6.3)
fYY (λ) − fYX (λ)fXX (λ)−1 fXY (λ).
=
n t =0 s =0
+ fXX (λ)−1 fXY (λ) exp(iuλ)dλ.
The spectral density of y(t ) − θ − by
×
− (ˆa′n (v) − a′ (v))X (s − v)
n −1 n −1 1 −−
×
where {b(u)} is an absolutely summable filter. This process is minimized in the sense of mean square if and only if
∫
v=Mn +1
u =0
1
a′ (v)X (s − v)
v=0
=
b(u) =
∞ −
a′ (v)X (s − v) +
v=0
Consider the following process Y (t ) − θ −
∞ −
Mn
∞ −
Y (s) − θ −
×
− u > Mn
|a(u)| ≤
1 − Mn4 u>M n
|u|4 |a(u)| = o
1 n
.
(6.23)
T. Amano, M. Taniguchi / Journal of Econometrics 165 (2011) 20–29
n −1 − n−1 ∞ 2 − − ′ |(6.15)| = E V (t ) × a (v)X (s − v) n t =0 s=0 v=M +1
Then by (6.33), (6.24)
n
2 21 ∞ − ′ (6.25) ≤ (E |V (t )| ) E a (v)X (s − v) v=M +1 n t =0 s=0 n ∫ ′ ∞ π − 1 = 2n(E |V (t )|2 ) 2 a(v)eivλ fXX (λ) n−1 n−1 2 −−
v=Mn +1
a(v)e−ivλ
dλ
(6.26)
v=Mn +1
2 12 ∞ π − 1 ≤ nK (E |V (t )|2 ) 2 a(v)eivλ dλ −π v=M +1 ∫
(6.27)
for some finite K , and
2 2 ∫ π ∞ ∞ − − ivλ |a(v)| . a(v)e dλ ≤ O −π v=M +1 v=M +1
(6.28)
n
n
Then from (6.23), it is seen that (6.15) tends to 0 as n → ∞. For (6.16), from Brillinger (2001) Theorem 5.2.1, (2.10) implies lim
n→∞
n s,t =0
R(s − t ) = 2π fXX (0),
1
E tr (Bn (u, v)B′n (u, v)Bn (u, v)B′n (u, v))
=
n −1 −
1 n4
E {X ′ (t1 − u)X (t2 − u)X ′ (t3 − u)X (t4 − u)
t1 ,...,t8 =0
× X ′ (t5 − v)X (t6 − v)X ′ (t7 − v)X (t8 − v)} n −1 − − C1 ≤ 4 sup |E {Xa1 (t1 − r1 )Xa2 (t2 − r2 ) a1 ,...,a8 r =u,v t1 ,...,t8 =0 i
× Xa3 (t3 − r3 )Xa4 (t4 − r4 ) × Xa5 (t5 − r5 )Xa6 (t6 − r6 )Xa7 (t7 − r7 )Xa8 (t8 − r8 )}| ∞ − 1 ≤ C2 sup |Cuma1 ,...,a8 (t1 , . . . , t7 )| 3 n
+
1 n2
(6.29)
a1 ,...,a8 t ,...,t =−∞ 1 7
∞ −
sup
a1 ,...,a8 t ,...,t =−∞ 1 5
∞ −
×
|Cuma1 ,...,a6 (t1 , . . . , t5 )|
|Cuma7 ,a8 (t7 )|
t8 =−∞
then ∞ −
|(6.16)| =
∞ −
a (u)
+
′
u=Mn +1 v=Mn +1
×
n−1 n−1 1 −−
n t =0 s=0
≤C
E [X (t − u)X ′ (s − v)]a(v)
(6.30)
2
∞ −
n2
(6.31)
n2
|(6.17)| =
Mn − Mn 1−
n u=0 v=0
E (ˆan (u) − a(u))′
n −1 − n−1 −
X (t − u)
n u=0 v=0
π
π
∫
E
−π
≤ 4π
n
ei(uλ+vµ) (Aˆ (λ) − A(λ))′ Bn (u, v)
+
1 n
u,v λ,µ
and Neudecker (1999, p. 201), for arbitrary matrices A, B, C ,
|tr A BC | ≤ (tr A A) (tr C (B BC )) ′
1 4
|Cuma1 ,...,a3 (t1 , t2 )|
|Cuma4 ,...,a6 (t4 , t5 )|
|R(t1 )|
∞ −
∞ −
|R(t2 )|
t2 =−∞
∞ − t3 =−∞
< ∞,
≤ (tr A′ A) (tr B′ BB′ B) (tr CC ′ CC ′ ) .
|R(t3 )|
∞ −
|R(t4 )|
t4 =−∞
(6.35)
1
1 4
|Cuma7 ,a8 (t7 )|
t7 =−∞
|(6.17)| = O(Mn Ln,ϵ )2 = O(nβ− 3 +ϵ )2 ,
= (tr A′ A) 2 (tr (B′ B)′ (CC ′ )) 2 1 2
|Cuma7 ,a8 (t7 )|
where C1 and C2 are some constants. Hence from Lemma 6.2, it follows that
1 2 1
1
∞ −
a1 ,...,a8 t ,t =−∞ 1 2
t1 =−∞
∞ − t7 =−∞
sup
∞ −
+
− A(λ)) Bn (u, v)(Aˆ (µ) − A(µ))]|, (6.32) ∑n−1 ∑n−1 ′ where Bn (u, v) = t =0 s=0 X (t − u)X (s − v). From Magnus ′
|Cuma1 ,...,a4 (t1 , . . . , t3 )|
|Cuma5 ,a6 (t5 )|
t4 ,t5 =−∞
sup sup |E [(Aˆ (λ)
1 2
a1 ,...,a8 t ,...,t =−∞ 1 3
∞ −
×
|Cuma1 ,...,a4 (t1 , . . . , t3 )|
|Cuma5 ,...,a8 (t5 , . . . , t7 )| ∞ −
sup
∞ −
−π
′
′
a1 ,...,a8 t ,...,t =−∞ 1 3
t5 =−∞
× (Aˆ (µ) − A(µ))dλdµ 2 2 Mn
n
×
× X ′ (s − v)(ˆan (v) − a(v)) =
1
t =0 s=0
∫ Mn − Mn 1−
∞ −
sup
t5 ,...,t7 =−∞
+
|Cuma1 ,...,a5 (t1 , . . . , t4 )|
|Cuma6 ,...,a8 (t6 , t7 )|
∞ −
×
for some C . Thus (6.16) tends to 0 as n → ∞. For (6.17) we have
a1 ,...,a8 t ,...,t =−∞ 1 4
t6 ,t7 =−∞
1
∞ −
sup
∞ −
×
+
|a(u)|
1
u = Mn + 1
′
(6.34)
By (2.7), (2.10) and Theorem 2.3.1 of Brillinger (2001), we have
n
n
n −1 1 −
1 1 × B′n (u, v))) 4 (E ‖Aˆ (µ) − A(µ)‖4 ) 4 .
n4
12
∞ −
×
|E [(Aˆ (λ) − A(λ))′ Bn (u, v)(Aˆ (µ) − A(µ))]| 1 ≤ (E ‖Aˆ (λ) − A(λ)‖2 ) 2 (E tr (Bn (u, v)B′n (u, v)Bn (u, v)
2 21
−π
25
(6.33)
which tends to 0 as n → ∞.
(6.36)
26
T. Amano, M. Taniguchi / Journal of Econometrics 165 (2011) 20–29
For (6.18), similarly to (6.17) we have,
=O
′ Mn n−1 n− 1 − − 1− |(6.18)| ≤ (Y (t ) − θ ) X (s − v) E n v=0 t =0 s=0 (6.37)
(6.38)
(6.39)
‖a(u)‖(E tr (Bn (u, v)Bn (u, v)Bn (u, v) 1
1
t1 ,t2 =0 s1 ,s2 =0
12
+
Mn − ∞ −
1
(E ‖ˆan (v) − a(v)‖2 ) 2
‖a(u)‖
v=0 u=0
1 n4
(6.41)
E tr (Bn (u, v)B′n (u, v)Bn (u, v)
fˆX ηˆ (λ) = fˆX η (λ) − (θ¯LSE − θ )′ fˆX φ (λ),
≤ Mn sup
(E ‖ˆan (v) − a(v)‖ )
n− 1 1 −
n2
v
(6.52)
2π s 2π s 2π ˆ where fˆX η (λ) ≡ 2nπ s=1 Wn (λ − n )IX η ( n ), fX φ (λ) ≡ n ∑n−1 2π s 2π s s=1 Wn (λ − n )IX φ ( n ) and IX η , IX φ are the periodograms of X (t ) and η(t ), X (t ) and φ(t ), respectively. Hence, for an arbitrary positive integer k we have
−1 E ‖Aˆ (λ) − A(λ)‖k = E ‖fˆXX (λ)fˆX ηˆ (λ) − A(λ)‖k −1 −1 = E ‖fˆXX (λ)fˆX η (λ) − A(λ) − (θ¯LSE − θ )′ fˆXX (λ)fˆX φ (λ)‖k − 1 ≤ M (E ‖fˆXX (λ)fˆX η (λ) − A(λ)‖k −1 + E ‖(θ¯LSE − θ )′ fˆXX (λ)fˆX φ (λ)‖k )
(6.53) (6.54)
(6.55)
1
(n− 3 +ϵ )k . −1 For E ‖(θLSE − θ )′ fˆXX (λ)fˆX φ (λ)‖k , from Magnus and Neudecker (6.42)
1
‖a′ B‖ = (tr aa′ BB′ ) 2 1
(6.51)
(1999, p. 201) for arbitrary vector a and matrix B 2 21
× Bn (u, v))
(6.50)
where M is a constant. −1 From Lemma 6.2, we have E ‖fˆXX (λ)fˆX η (λ) − A(λ)‖k = O
41 ′
(6.49) ′
∑ n −1
′
× X (s2 − v)]
′ η( ˆ t ) = Y (t ) − θ¯LSE φ(t )
then it is seen that
(6.40) × B′n (u, v)) 2 ) 2 (E ‖ˆan (v) − a(v)‖2 ) 2 M n − 1 n − 1 n − 1 − − E [(Y (t1 ) − θ )(Y (t2 ) − θ )X ′ (s1 − v) ≤ 2 n
(6.48)
= Y (t ) − θ φ(t ) − (θ¯LSE − θ ) φ(t ) = η(t ) − (θ¯LSE − θ )′ φ(t ),
1
v=0
Lemma 6.3. Under the same assumptions as in Theorem 4.1, for any given ϵ (0 < ϵ < 13 − β ) and k ∈ N, we have
′
× (E ‖ˆan (v) − a(v)‖2 ) 2
1
Proof of Lemma 6.3. Since η( ˆ t ) is represented as
2 21 Mn n−1 − n−1 − 1− E (Y (t ) − θ )X (s − v) ≤ t =0 s=0 n v=0
n v=0 u=0
(6.47)
E ‖Aˆ (λ) − A(λ)‖k = O(Ln,ϵ )k .
n−1 − × X (s − v) (ˆan (v) − a(v)) s=0
+
.
n
For the proof of Theorem 4.1, we need the following lemma.
′
Mn − ∞ 1−
which implies that (6.19) converges to 0 as n → ∞.
× (ˆan (v) − a(v)) Mn − ∞ n−1 − 1− ′ X (t − u) + E a (u) n v=0 u=0 t =0
Mn Ln,ϵ
1
≤ (tr aa′ aa′ ) 4 (tr BB′ BB′ ) 4
n−1 −
E [(Y (t1 ) − θ )(Y (t2 ) − θ )
1
= ‖a‖(tr BB′ BB′ ) 4 .
t1 ,t2 =0 s1 ,s2 =0
Then, applying this inequality we have
12 × X ′ (s1 − v)X (s2 − v)]
(6.56)
1
(E ‖ˆan (v) − a(v)‖2 ) 2
1
(6.43)
1
‖a′ BC ‖ ≤ ‖a‖(tr BB′ BB′ ) 4 (tr CC ′ CC ′ ) 4 .
(6.57)
Then by this inequality,
+ Mn sup
∞ −
v
‖a(u)‖
u= 0
1 n4
−1 E ‖(θ¯LSE − θ )′ fˆXX (λ)fˆX φ (λ)‖k
E tr (Bn (u, v)B′n (u, v)
2k 12
4 k 41
≤ {E ‖(θ¯LSE − θ )‖ } {E [tr (fXX (λ)) ] }
14 × Bn (u, v)B′n (u, v))
(6.58)
1
(E ‖ˆan (v) − a(v)‖2 ) 2
(6.44)
ˆ −1
1 × {E [tr fˆX φ (λ)fˆX φ (λ)′ fˆX φ (λ)fˆX φ (λ)′ ]k } 4 .
(6.59)
1 2
= O(Mn Ln,ϵ )
(6.45)
which implies that (6.18) converges to 0 as n → ∞. For (6.19), similarly we obtain,
For {E ‖(θ¯LSE − θ )‖2k } , Theorem 5.11.1 of Brillinger (2001) and γ its proof imply n 2 (θ¯LSE − θ )’s covariance matrix converges and cumulants of greater than 2 are o(1), then 1
{E ‖(θ¯LSE − θ )‖2k } 2 = O
2 21 n −1 − ∞ − 1 |(6.19)| ≤ 2 E a′ (u)X (t − u) n t =0 u =M +1
1 nγ
2k
.
(6.60)
1
−1 Next for {E [tr (fˆXX (λ))4 ]k } 4 , from Theorem 7.7.3 of Brillinger (2001), this term is bounded for large n. Finally we evaluate
n
2 12 Mn n −1 − − 1 × E (ˆan (v) − a(v))′ X (s − v) n s=0 v=0
(6.46)
1 {E [tr fˆX φ (λ)fˆX φ (λ)′ fˆX φ (λ)fˆX φ (λ)′ ]k } 4 . Let (i, j)-the component of ∑n−1 ∑n−1 i , j it λ fˆX φ (λ) be fˆX φ (λ) and dXi (λ) = t =0 Xi (t )e , dφj (λ) = t =0
φj (t )eit λ , then from the definition
T. Amano, M. Taniguchi / Journal of Econometrics 165 (2011) 20–29 n−1 1 −
i,j
fˆX φ (λ) =
n2 s=0
Wn
λ−
2π s
dXi
n
2π s
dφj
n
2π s
n
.
(6.61)
Hence from Brillinger (2001, p. 95) i ,j
n −1 2π −
k,l
E [fˆX φ (λ)fˆX φ (λ)] =
× dφl
2π s
n3
Wn2
λ−
E [IXi ,Xk (λ)] + o
d φj
n
s=0
n
2π s
1 n
2π s
n
′ ′ ′ ˆ M − WM (φ ′ φ)−1 φ ′ V W (φ ′ φ)−1 φ ′ ′ ˆ M − WM − nγ E (φ ′ φ)−1 φ ′ W M W ′ × (φ ′ φ)−1 φ ′ ′ ˆ M − WM − nγ E (φ ′ φ)−1 φ ′ W M W ′ ′ × (φ ′ φ)−1 φ ′ . − nγ E
.
(6.62)
Since due to Brillinger (2001), Theorem 5.2.2 E [IXi ,Xk (λ)]
=
1
O(1), from Assumption 3.1 Wn (λ) = O(n 3 ) uniformly in λ and Assumption 4.2
|E [fˆXi,φj (λ)fˆXkφ,l (λ)]| − n −1 2π s 2π s 2π s 1 Wn2 λ − d φj dφl =O 3 n
=O n
n
s=0
− 32
n
27
+ nγ E (φ ′ φ)−1 φ ′ ′ ′ ˆ M − WM W ˆ M − WM × W (φ ′ φ)−1 φ ′ ′ ′ ˆ M − WM − nγ E (φ ′ φ)−1 φ ′ V W (φ ′ φ)−1 φ ′
n
n −1 2 1 − dφ 2π s dφ 2π s l j n n n
(6.73) (6.74) (6.75)
(6.76)
(6.77)
(6.63)
Initially we evaluate (6.69). From Theorem 5.11.1 of Brillinger (2001), we can see that (6.69) converges to
(6.64)
2π mφφ (0)
∫
−1
π
fη−V ,η−V (λ)dGφφ (λ)mφφ (0)−1 .
(6.78)
−π
s =0
Next for (6.70) we have
2 2 = O n− 3 nγ .
(6.65)
Since, from the definition, [tr fˆX φ (λ)fˆX φ (λ)′ fˆX φ (λ)fˆX φ (λ)′ ]k is a 4ki,j
2
γ
th polynomial of fˆX φ (λ), and n− 3 n 2 fˆX φ (λ)’s covariance matrix converges and cumulants of greater than 2 are o(1) (Theorem 7.7.1 of Brillinger (2001)), we obtain 1 4
{E [tr fˆX φ (λ)fˆX φ (λ) fX φ (λ)fˆX φ (λ) ] } = O(n ′ˆ
′ k
− 23
γ
),
n2 k
(6.66)
which implies 2
−1 E ‖(θ¯LSE − θ )′ fˆXX (λ)fˆX φ (λ)‖k = O(n− 3 )k .
The lemma is proved.
′ (6.70) = nγ E (φ ′ φ)−1 φ ′ W M φ ′ W M (φ ′ φ)−1
(6.67)
=
E nγ
′ 1 ′ −1 φ′W M φ′W M φ φ . (6.80) γ n
1 nγ
′
n
s =1
t =1
n−|h| 1 − RW M (h) · γ φi (t + h)φj (t ), n t =0 h=−n+1 n−1 −
(6.81)
where RW M (h) = E [W M (t + h)W M (t )]. From W M (t ) = a′ (u)X (t − u) and Assumption 4.2,
∑n−|h|
∑∞
u=Mn +1
φi (t + h)φj (t ) con∑n−1 verges, and from (6.23) h=−n+1 RW M (h) = o( n12 ). Thus we obtain 1 . (6.82) (6.70) = o 2 1 nγ
t =0
n
nγ E [(θˆLSE − θ )(θˆLSE − θ )′ ]
ˆM = nγ E (φ ′ φ)−1 φ ′ Y − φθ − W ′ ′ ˆM × Y − φθ − W (φ ′ φ)−1 φ ′ = nγ E (φ ′ φ)−1 φ ′ (Y − φθ − W ) ˆ M − WM ) (Y − φθ − W ) + W M − (W ′ ˆ M − WM ) (φ ′ φ)−1 φ ′ ′ +W M − ( W ′ = nγ E (φ ′ φ)−1 φ ′ VV ′ (φ ′ φ)−1 φ ′ ′ ′ + nγ E (φ ′ φ)−1 φ ′ W M W M (φ ′ φ)−1 φ ′ ′ ′ + nγ E (φ ′ φ)−1 φ ′ V W M (φ ′ φ)−1 φ ′ ′ ′ ′ + nγ E (φ ′ φ)−1 φ ′ V W M (φ ′ φ)−1 φ ′
1
φ φ converge. ′ M ′ ′ M The (i, j)-th element of E φ W φW is represented as n n − − 1 M E φ ( s ) W ( s ) · φj (t )W M (t ) i γ
Next we proceed to prove Theorem 4.1.
a′ (u)X (t − u), V (t ) = Y (t ) − φ ′ (t )θ − W (t ), W M = (W M (1), . . . , W M (n))′ , WM = (WM (1), . . . , WM (n))′ , W = (W (1), . . . , W (n))′ , and V = (V (1), . . . , V (n))′ . We have
−1
(6.79)
1 nγ
= ∑M
n
φ′φ γ
Due to Assumption 4.2, the elements of
n Proof of Theorem 4.1. For convenience, let WM (t ) = a′ (u) u=0∑ ∑∞ ∞ M ′ X (t − u), W (t ) = u=Mn +1 a (u)X (t − u), W (t ) = u =0
1
Hence (6.70) converges 0. For (6.71) and (6.72) we obtain
(6.71) =
(6.68)
1 nγ
φφ ′
−1
′ M ′ 1 ′ −1 ′ E φV φW φφ . (6.83) γ γ
1 n
n
′ The (i, j)-th element of n1γ E φ ′ V φ ′ W M is represented as n n − − 1 E φ ( s ) V ( s ) · φj (t )W M (t ) i γ n
s =1
t =1
n−|h| 1 − RVW M (h) · γ φi (t + h)φj (t ), n t =0 h=−n+1 n−1
(6.69) (6.70) (6.71) (6.72)
=
−
(6.84)
where RVW M (h) = E [V (t + h)W M (t )]. Similar to (6.70), we have
(6.71) = o
1 n
,
which implies (6.71) converges 0.
(6.85)
28
T. Amano, M. Taniguchi / Journal of Econometrics 165 (2011) 20–29 1
× (E ‖Aˆ (µ) − A(µ)‖4 ) 4
Similarly (6.72) converges 0. ˜M ≡ W ˆ M − WM , Now we evaluate (6.73). For convenience let W then we have
(6.73) =
1 n
φ′φ γ
−1
1
E nγ
and the (i, j)-th element of as
˜M φ′W
1 nγ
E
˜M φ′W
′
1 n
φ′φ γ
−1 (6.86)
′ ˜M ˜ M φ′W is represented φ′W
E
∑n
n −
˜ M ( s) · φi (s)W
s =1
E nγ
n − s=1
−
nγ
E
n −
−
nγ
E
φj (t )X (t − v) π
π
∫
Mn −
E
n −
−π
φi (s)X (s − u)
1
E
v=0
· (ˆa(v) − a(v))
π
∫ Mn −
=
s =1
v
v
∞ −
n −
Mn − φj (t ) (ˆa(v) − a(v))′ X (t − v)
v=0
t =1
× Cn (u, v)Cn′ (u, v))) (E ‖Aˆ (µ) − A(µ)‖4 )
1 4
(6.88)
Cn (u, v) may be written Cn (u, v) =
nγ
nγ
φi (s)X (s − u)
γ
n 2 (φ ′ φ)−1
From Assumption 4.2,
n −
′ φj (t )X (t − v)
.
nγ
γ
Brillinger (2001) and its proof, n 2 (φ φ) ′
−1
∑n
t =1
φi (t )X (t − v) ’s
covariance matrix converges and cumulants greater than 2 are o(1), hence E tr (Cn (u, v)Cn (u, v)Cn (u, v)Cn (u, v)) < ∞. ′
E
′
˜ M ( s) · φi (s)W
s =1
n −
˜ M (t ) φj (t )W
= O(Mn2 ) sup sup(E ‖Aˆ (λ) − A(λ)‖2 ) u,v λ,µ
µ
φi (s)V (s) ·
n −
(6.94)
(6.95)
˜ M (t ) = O(Mn Ln,ϵ ), φj (t )W
(6.96)
t =1
(6.90)
1 n
φ′φ γ
−1
1
E nγ
and the (i, j)-th element of E 1
E nγ
=
n −
φi (s)W M (s) ·
n −
s=1 Mn −
v=0
t =1
1 2
µ
s=1
Hence from Lemma 6.3
n −
n −
(6.76) =
converges and due to Theorem 5.11.1 of
ei(vµ) Cn1 (v)(Aˆ (µ) − A(µ))dµ
{1 + |tj |}|Cum(V (t1 ), . . . , V (tk−1 ))| < ∞
(6.89)
t =1
φ′ φ
t =1
s =1
which implies that (6.74) and (6.75) converge to 0. For (6.76) and (6.77) we have
s=1
nγ
′ n n − − φj (t )X (t − v) φi (s)V (s)
for j = 1, . . . , k − 1. Hence, similar to (6.90), E ‖Cn1 (v)‖2 < ∞. As is proven in (6.73), 1
n 2 (φ ′ φ)−1
φ′φ
×
t1 ,...,tk−1 =−∞
′
1 4
E
From (2.7), (6.22) and p. 46 of Brillinger (2001), V (t ) satisfies
u,v λ,µ
n −
. (6.93)
˜ M (t ) φj (t )W
1 1 ≤ 2π Mn sup sup(E ‖Aˆ (µ) − A(µ)‖2 ) 2 (E ‖Cn1 (v)‖2 ) 2 .
sup sup(E ‖Aˆ (λ) − A(λ)‖ ) (E tr (Cn (u, v)Cn (u, v)
γ
φφ
−1
−π
−π
nγ
′
≤ 2π Mn sup sup |E [Cn1 (v)(Aˆ (µ) − A(µ))]|
2 21
φ′φ
′ 1
× (ˆa(v) − a(v))
u,v λ,µ
Mn2
˜M φW ′
× Cn (u, v)(Aˆ (µ) − A(µ))dλdµ ≤ 4π 2 Mn2 sup sup |E [(Aˆ (λ) − A(λ))′ Cn (u, v)(Aˆ (µ) − A(µ))]|
nγ
n −
φi (s)V (s) ·
nγ
v=0
ei(uλ+vµ) (Aˆ (λ) − A(λ))′
E
u=0 v=0
1
s=1
s =1
′
−− ∫
≤ 4π
φV ′
t =1
n − 1
=
(ˆa(v) − a(v))′ X (t − v)
Mn
2
E
∑n
φi (s)V (s) ·
t =1
=
1 nγ
s=1
v=0
n −
Mn
1 nγ
(
−1
1 E (ˆa(u) − a(u))′ · γ n u=0 v=0
×
nγ
u=0
φj (t )
Mn − Mn −
φφ ′
represented as
=
˜ M (t ) φj (t )W
Mn
t =1
=
φj (t )X (t − v))′ , then
Mn − (ˆa(u) − a(u))′ X (s − u)
φi (s)
n
×
t =1
1
t =1
1
=
(6.74) =
1
∑n
nγ
(6.87)
t =1
Let Cn (u, v) = n1γ ( s=1 φi (s)X (s − u))( applying (6.34) we have
n − 1
∑ φi (s)V (s))( nt=1 φj (t )X (t − v))′ , then ap′ ′ ˜M can be plying (6.56) the (i, j)-th element of n1γ E φ ′ V φW
′ ˜M ˜ M φ′W φ′W n n − − 1 ˜ ˜ = E φi (s)WM (s) · φj (t )WM (t ) .
E nγ
s=1
(6.92)
which implies (6.73) converges to 0. For (6.74), (6.75) it is seen that
Let Cn1 (v) =
1
nγ
= O(Mn Ln,ϵ )
(6.91)
2
×
φ′W M
φ′W M
˜M φ′W
′ 1
˜M φ′W
n
′
E
∞ −
u = Mn + 1
˜ M (t ) φj (t )W
n − t =1
1
a(u) · γ n ′
′ φj (t )X (t − v)
n −
φi (s)X (s − u)
s=1
· (ˆa(v) − a(v))
−1 (6.97)
is represented as
t =1
φ′φ γ
T. Amano, M. Taniguchi / Journal of Econometrics 165 (2011) 20–29
∫ Mn −
=
π
E
v=0
eivµ
−π
∞ −
29
which implies (6.76) converges to 0. Similarly convergence to 0 of (6.77) is proved.
a(u)′ Cn (u, v)(Aˆ (µ) − A(µ))dµ
u = Mn + 1
Acknowledgment
∞ − ′ ≤ 2π Mn sup supE a(u) Cn (u, v)(Aˆ (µ) − A(µ)) v µ u =M +1
The authors would like to thank for referees for their comments which improved the original version of this paper.
n
∞ − ≤ 2π Mn a(u) u=M +1
References
n
× sup sup(E tr (Cn (u, v)Cn (u, v)Cn (u, v)Cn (u, v))) ′
u,v
′
1 4
µ
1 × (E ‖Aˆ (µ) − A(µ)‖4 ) 4 .
(6.98)
Similar to (6.73), (6.28) implies 1 nγ
E
n − s=1
φi (s)W (s) · M
n − t =1
Mn Ln,ϵ ˜ φj (t )WM (t ) = O n
(6.99)
Brillinger, D.R., 2001. Time Series Data Analysis and Theory. Society for Industrial and Applied Mathematics. Chan, N.H., Wong, H.Y., 2006. Simulation Techniques in Financial Risk Management. Wiley, New York. Glasserman, P., 2004. Monte Carlo Methods in Financial Engineering. Springer, New York. Lavenberg, S.S., Welch, P.D., 1981. A perspective on the use of control variables to increase the efficiency of Monte Carlo simulations. Manage. Sci. 27, 322–335. Magnus, J.R., Neudecker, H., 1999. Matrix Differential Calculus with Applications in Statistics and Econometrics. Wiley, New York. Nelson, B.L., 1990. Control variate remedies. Oper. Res. 38, 974–992. Rubinstein, R.Y., Marcus, R., 1985. Efficiency of multivariate control variates in Monte Carlo simulation. Oper. Res. 33, 661–677.
Journal of Econometrics 165 (2011) 30–44
Contents lists available at SciVerse ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Method of moments estimation and identifiability of semiparametric nonlinear errors-in-variables models Liqun Wang a , Cheng Hsiao b,c,d,∗ a
Department of Statistics, University of Manitoba, Winnipeg, Manitoba, Canada R3T 2N2
b
Department of Economics, University of Southern California, Los Angeles, CA 90089-0253, USA
c
Wang Yanan Institute for Studies in Economics, Xiamen University, China
d
City University of Hongkong, Hongkong, China
article
info
Article history: Available online 12 May 2011 JEL classification: C13 C14 C15 Keywords: Fourier deconvolution Identifiability Instrumental variables Measurement error Method of moments Root-n consistency Semiparametric estimator Simulation-based estimator
abstract This paper deals with a nonlinear errors-in-variables model where the distributions of the unobserved predictor variables and of the measurement errors are nonparametric. Using the instrumental variable approach, we propose method of moments estimators for the unknown parameters and simulationbased estimators to overcome the possible computational difficulty of minimizing an objective function which involves multiple integrals. Both estimators are consistent and asymptotically normally distributed under fairly general regularity conditions. Moreover, root-n consistent semiparametric estimators and a rank condition for model identifiability are derived using the combined methods of the nonparametric technique and Fourier deconvolution. © 2011 Elsevier B.V. All rights reserved.
1. Introduction Measurement error occurs frequently (e.g. Aigner et al., 1984; Fuller, 1987; Hsiao, 1992). If a model is linear in variables, the issue of random measurement error can often be overcome through the use of the instrumental variable method. If a model is nonlinear in variables, the conventional instrumental variable method, in general, does not yield consistent estimators of the unknown parameters when the variables are subject to random measurement errors (e.g. Amemiya, 1985, 1990; Hsiao, 1989). To obtain consistent estimators for nonlinear measurement error models, some researchers assume that the measurement error variances tend to zero as sample size increases to infinity (e.g. Wolter and Fuller, 1982; Amemiya, 1985, 1990; Stefanski and Carroll, 1985; Amemiya and Fuller, 1988). Alternatively, other researchers assume that the conditional distribution of the unobserved predictor variable given its observed proxy is known up
to a finite-dimensional parameter (e.g. Hsiao, 1989, 1992). Later Li (2002) and Schennach (2004) studied models with replicate observations, while Schennach (2007) used the instrumental variable approach. Besides, various special nonlinear models are investigated, e.g., polynomial models with a scalar predictor variable (Cheng and Schneeweiss, 1998; Hausman et al., 1991, 1995; Huang and Huwang, 2001), and limited dependent variable models (Weiss, 1993; Wang, 1998, 2002; Wang and Hsiao, 2007). Another stream of investigation consists of non- or semi-parametric methods with the assumption that the measurement error is univariate and its distribution is either completely known or is normal with an unknown variance parameter (e.g., Fan and Truong, 1993; Carroll et al., 1999; Taupin, 2001; Carroll et al., 2004; Delaigle, 2007). In this paper, we consider the method of moments estimation of a general nonlinear measurement error model. Specifically, we consider the model Y = g (X ; θ0 ) + ε,
(1.1)
where Y ∈ R, X ∈ R , ε is the random error and θ0 ∈ R is a vector of unknown parameters. In general, g (x; θ0 ) is nonlinear in x. Suppose that X is unobservable, instead we observe k
∗
Corresponding author at: Department of Economics, University of Southern California, Los Angeles, CA 90089-0253, USA. E-mail addresses:
[email protected] (L. Wang),
[email protected] (C. Hsiao). 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.05.004
Z = X + δ,
p
(1.2)
L. Wang, C. Hsiao / Journal of Econometrics 165 (2011) 30–44
where δ is a random measurement error. Further, we assume that an instrumental variable W ∈ Rl exists and is related to X through X = Γ0 W + U ,
(1.3)
where Γ0 is a k × l matrix of unknown parameters which has rank k and U is independent of W with E (U ) = 0. The random errors in (1.1) and (1.2) are supposed to satisfy E (ε | X , Z , W ) = 0 and E (δ | X , W ) = 0. The functional forms of the distributions of X , ε and δ are unknown. In this sense model (1.1)–(1.3) is semiparametric. In this model, the observed variables are (Y , Z , W ). Our primary interest is to estimate θ0 , Γ0 and the distribution FU of U. Model (1.1)–(1.3) was considered by these and other authors before. Wang and Hsiao (1995) derived a rank condition for identifiability and proposed a semiparametric estimator under the condition that g (x; θ0 ) is integrable. Later, the integrability condition was relaxed by Schennach (2007) who used the generalized function technique and achieved more general identifiability conditions. In addition, assuming the model to be identifiable, Newey (2001) derived a consistent estimator when Fu (u) belongs to a parametric family and a consistent semiparametric estimator when Fu (u) is nonparametric but may be approximated by a parametric family. In this paper, we use the approach of Wang and Hsiao (1995) and extend their results to the general g (x; θ0 ) which is not necessarily integrable. In particular, for the case of a parametric distribution fU (u; φ0 ) we propose method of moments estimators for θ and φ which are shown to be consistent and asymptotically normally distributed under fairly general regularity conditions. Simulationbased estimators are also considered to overcome the possible computational difficulty of minimizing an objective function which involves multiple integrals. For the case of nonparametric distribution FU (u), we combine the nonparametric technique with Fourier deconvolution to obtain a root-n consistent estimator for θ and a kernel-based estimator for the density of U. Moreover, this approach results in a surprisingly simple condition for the identifiability of a nonlinear errors-in-variables model. The paper is organized as follows. In Section 2 we introduce the method of moments estimators and derive their consistency and asymptotic normality. In Section 3 we construct simulation-based estimators and show their asymptotic properties. In Section 4 we propose a nonparametric estimator for the density of U and a root-n consistent semiparametric estimator for θ . Section 5 gives a rank condition for model identifiability and illustrative examples. Finally, conclusions and discussions are contained in Section 6, whereas proofs are given in Section 7.
Throughout the paper, all integrals are taken over the space Rk . It follows that θ0 and φ0 can be estimated using a nonlinear least squares method, given that they are identifiable by (2.2) and (2.3). Since it is straightforward to estimate Γ0 , in the following we focus on the estimation of θ0 and φ0 . First, we use some examples to demonstrate that θ0 and φ0 may indeed be estimated using (2.2) and (2.3). To simplify notation, we consider the case where Γ0 = 1, all variables are scalars and U ∼ N (0, φ). For the same reason, we suppress the subscript zero in θ0 and denote it as θ . Example 2.1. Linear model g (x; θ ) = θ x. For this model, it is easy to find E (Y | W ) = θ W and E (YZ | W ) = θ φ + θ W 2 , from which both θ and φ can be consistently estimated by the nonlinear least squares method. Example 2.2. Polynomial model g (x; θ ) = θ1 + θ2 x2 . In this case, we have E (Y | W ) = (θ1 + θ2 φ) + θ2 W 2 and E (YZ | W ) = (θ1 + 3θ2 φ)W + θ2 W 3 . Again, it is clear that θ2 , θ1 + θ2 φ and θ1 + 3θ2 φ can be consistently estimated and, therefore, so do θ1 and φ . Example 2.3. Exponential model g (x; θ ) = θ1 exp(θ2 x), where θ1 θ2 ̸= 0. For this model, we have E (Y | W ) = θ1 exp(θ2 W + θ22 φ/2) and E (YZ | W ) = θ1 (θ2 φ + W ) exp(θ2 W + θ22 φ/2). Now θ2 and θ1 exp(θ22 φ/2) can be consistently estimated from the first equation, and θ1 θ2 φ exp(θ22 φ/2) from the second. It follows that θ1 and φ can be consistently estimated too. Let ψ = (θ ′ , φ ′ )′ and Ψ = Θ × Φ , which is assumed to be compact in Rp+q . The true parameter value of the model is denoted by ψ0 ∈ Ψ . To simplify notation, let Z˜ = (1, Z ′ )′ and x˜ = (1, x′ )′ . Then through variable substitution, (2.2) and (2.3) can be written together as E (Y Z˜ |W ) =
∫
m(v; ψ) =
∫
(2.1)
Therefore Γ0 can be consistently estimated by the least squares fitting of Z on W . Moreover, by model assumptions we have E (Y | W ) =
∫
g (Γ0 W + u; θ0 )fU (u; φ0 )du
(2.2)
and E (YZ | W ) =
∫
(Γ0 W + u)g (Γ0 W + u; θ0 )fU (u; φ0 )du.
(2.3)
x˜ g (x; θ )fU (x − v; φ)dx.
(2.5)
Yj Z˜j − m(Γˆ Wj ; ψ), where n −
′
Zj Wj
j=1
E ( Z | W ) = Γ0 W .
(2.4)
Then it is clear that m(Γ0 W ; ψ0 ) = E (Y Z˜ |W ). Suppose (Yj , Zj , Wj ), j = 1, 2, . . . , n, is an i.i.d. random sample with finite moments EY 2 < ∞, E ‖YZ ‖2 < ∞ and nonsingular EWW ′ , where ‖·‖ denotes the Euclidian norm. Further, let ρˆ j (ψ) =
Γˆ =
In this section we propose a method of moments estimator for a nonlinear errors-in-variables model under the assumption that the distribution fU (u; φ0 ) of U is known up to a vector of unknown parameters φ0 ∈ Φ ⊂ Rq . The case where the distribution of U is nonparametric is treated in Sections 4 and 5. First, substituting (1.3) into (1.2) results in a usual linear regression equation
x˜ g (x; θ0 )fU (x − Γ0 W ; φ0 )dx.
For every v ∈ Rk and ψ ∈ Ψ , define
2. Method of moments estimator
31
n −
−1 ′
Wj Wj
(2.6)
j =1
is the least squares estimator of Γ0 . Then the method of moments ˆ n = argminψ∈Ψ Qn (ψ), estimator (MME) for ψ is defined as ψ where Qn (ψ) =
n −
ρˆ j′ (ψ)Aj ρˆ j (ψ),
(2.7)
j =1
and Aj = A(Wj ) is a nonnegative definite matrix which may depend on Wj . Throughout the paper, let γ = vecΓ denote the vector consisting of the columns of Γ , where vec is the so-called vectorization operator. We also assume that the parameter space of γ is a compact subset of Rkl containing the true value γ0 = vecΓ0 . The ˆ n can be derived in traditional fashion by establishconsistency of ψ ing the uniform convergence of Qn (ψ)/n to a nonstochastic function Q (ψ) which has a unique minimizer ψ0 ∈ Ψ . To achieve this, we assume the following regularity conditions, where µ denotes Lebesgue measure.
32
L. Wang, C. Hsiao / Journal of Econometrics 165 (2011) 30–44
Assumption 1. g (x; θ ) is a measurable function of x for each θ ∈ Θ and is continuous in θ ∈ Θ (a.e. µ). fU (u; φ) is continuously differentiable with respect to (w.r.t.) u for each φ ∈ Φ and is continuous in φ ∈ Φ (a.e. µ). Furthermore, E ‖A(W )‖(|Y |2 + ‖YZ ‖2 ) < ∞,
and
E ‖A(W )‖ ‖W ‖
∫ ×
sup |g (x; θ )fU (x − Γ W ; φ)|(‖x‖ + 1)dx ψ,γ
2
< ∞ (2.8)
∫
∂ fU (x − Γ W ; φ) sup g ( x ; θ ) ∂ u′ ψ,γ 2
< ∞,
B=E
(2.9)
Assumption 2. E [ρ(ψ) − ρ(ψ0 )]′ A(W )[ρ(ψ) − ρ(ψ0 )] = 0 if and only if ψ = ψ0 , where ρ(ψ) = Y Z˜ − m(Γ0 W ; ψ). Assumptions 1 and 2 are common in the literature of nonlinear regression. Assumption 2 is a high level condition for identifiability. Some sufficient conditions for the identifiability are given in Section 5. Assumption 1 ensures that the objective function Qn (ψ) is continuous and converges uniformly in ψ . The following example shows that (2.8) and (2.9) are generally satisfied, e.g., when g (x; θ) is a polynomial in x and U has a normal distribution. Example 2.4. Suppose g (x; θ ) = θ x, U ∼ N (0, φ), all variables are scalars and the parameter spaces are compact intervals [θmin , θmax ], [φmin , φmax ] and [Γmin , Γmax ]. Then, for every x, w ∈ R and θ, φ, Γ in their respective parameter spaces,
|g (x; θ)fU (x − Γ w; φ)|(|x| + 1) θx (x − Γ w)2 exp − = √ (|x| + 1) 2φ 2π φ 2 2 θmax |x|(|x| + 1) x + Γmin w2 ≤ exp − √ 2φmax 2π φmin [ ] Γmax xw Γmin xw × exp + exp , φmax φmax
ˆ n , we assume further To derive the asymptotic normality for ψ regularity conditions as follows. Assumption 3. There exist open subsets θ0 ∈ Θ0 ⊂ Θ and φ0 ∈ Φ0 ⊂ Φ , in which g (x; θ ) is twice continuously differentiable w.r.t. θ(a.e. µ) and fU (u; φ) is twice continuously differentiable w.r.t. φ(a.e. µ). Furthermore, γ0 has an open neighborhood, such that the first two derivatives of g (x; θ )fU (x − Γ w; φ) w.r.t. ψ are uniformly bounded by a function η(x, w), which satisfies
< ∞.
∂ fU (x − Γ W ; φ0 ) E ‖A(W )‖ ‖W ‖ sup ∂ u′ γ 2 × (‖x‖ + 1)dx < ∞
x˜
∂ g (x; θ ) fU (x − Γ0 W ; φ)dx ∂θ ′
∂ρ(ψ) =− ∂φ ′
∫
x˜ g (x; θ )
(2.12)
∂ fU (x − Γ0 W ; φ) dx. ∂φ ′
(2.13)
Again, Assumptions 3–5 are commonly used regularity conditions that are sufficient for the asymptotic normality of method of moments estimators. Together with the Dominated Convergence Theorem (DCT), Assumption 3 implies that the first derivative of Qn (ψ) admits the first-order Taylor expansion and that the second derivative of Qn (ψ) converges uniformly. Moreover, it ensures that the first derivative ∂ρ(ψ)/∂ψ ′ exists and is given by (2.12) and (2.13), while Assumption 4 implies that the first derivative ∂ρ(ψ)/∂γ ′ exists and is given by
∫ ∂ρ(ψ) ∂ fU (x − Γ0 W ; φ) = x˜ g (x; θ ) dx(W ⊗ Ik )′ , (2.14) ∂γ ′ ∂ u′ where ⊗ stands for the Kronecker product (Magnus and Neudecker,
B = plim n→∞
′ ˆ n 1 − ∂ ρˆ j (ψ n)
n j =1
n→∞
∂ψ
Aj
∂ ρˆ j (ψˆ n ) ∂ψ ′
ˆ n ) ∂ Qn (ψˆ n ) 1 ∂ Qn (ψ 4n
∂ψ
∂ψ ′
where (2.10)
ˆn − n( ψ
′ −1
and C22 = E [WW ′ ⊗ (Z − Γ0 W )(Z − Γ0 W )′ ]. Furthermore,
DCD′ = plim
∂ g (x; θ0 ) ∂θ
√
ψ0 ) − → N (0, B DCD B ), where [ ′ ] −1 ∂ρ (ψ0 ) ∂ρ(ψ0 ) ′ D = Ip+q , E A(W ) EWW ⊗ Ik , ∂ψ ∂γ ′ ′ C11 C21 C = , C21 C22 ] [ ′ ∂ρ (ψ0 ) ∂ρ(ψ0 ) C11 = E A(W )ρ(ψ0 )ρ ′ (ψ0 )A(W ) , ∂ψ ∂ψ ′ [ ] ∂ρ(ψ0 ) C21 = E (W ⊗ (Z − Γ0 W )) ρ ′ (ψ0 )A(W ) ∂ψ ′ −1
and
Assumption 4. For ψ0 ∈ Ψ ,
∫
∫
and
L
a.s.
∂ρ(ψ) =− ∂θ ′
Theorem 2.2. Under Assumptions 1–5, as n → ∞,
ˆ n −→ ψ0 , as n → ∞. Theorem 2.1. Under Assumptions 1 and 2, ψ
η(x, W )(‖x‖ + 1)dx
]
1988, p. 30). Finally, Assumption 5 and the DCT imply that the second derivative of Qn (ψ) has a non-singular limiting matrix. Again, it is easy to see that Assumptions 3–5 are satisfied for the polynomial model g (x; θ ) and the normal random error U.
which clearly satisfies (2.8) if, e.g., A(W ) is an identity matrix. Similarly, it is easy to see that (2.9) is satisfied too.
2
∂ρ ′ (ψ0 ) ∂ρ(ψ0 ) A(W ) ∂ψ ∂ψ ′
is nonsingular, where
where the supremum is taken within the compact parameter spaces of ψ and γ .
E ‖A(W )‖
(2.11)
Assumption 5. The matrix
× (‖x‖ + 1)dx
< ∞,
where the supremum is taken within the open subset of γ .
[
E ‖A(W )‖ ‖W ‖
γ
× (‖x‖ + 1)dx
and
sup
2
E ‖A(W )‖
2 ∂ fU (x − Γ W ; φ0 ) |g (x; θ0 )| ∂φ∂ u′
∫
n − ∂ ρˆ j′ (ψ) ∂ Qn (ψ) =2 Aj ρˆ j (ψ). ∂ψ ∂ψ j =1
,
L. Wang, C. Hsiao / Journal of Econometrics 165 (2011) 30–44
ˆ n depends on the weight A(W ). The asymptotic covariance of ψ A natural question is how to choose A(W ) to obtain the most efficient estimator. To answer this question, we first write D = ′ ′ (Ip+q , G), so that DCD′ = C11 + GC21 + C21 G + GC22 G′ . It is easy ′ to see that the last three terms in DCD are due to the least squares estimation of Γ . To simplify discussion, assume for the moment that Γ0 is known, so that these three terms do not appear in DCD′ . The following discussion remains valid, when Γ0 is unknown and estimated using a subset of the sample (Yj , Zj , Wj ), j = 1, 2, . . . , n, while Qn (ψ) is constructed using the rest of the sample points. Then the independence of the sample points implies that C21 = 0. Since ∂ρ ′ (ψ0 )/∂ψ depends on W only, matrix C11 can be written as ] ∂ρ ′ (ψ0 ) ∂ρ(ψ0 ) , A(W )HA(W ) ∂ψ ∂ψ ′ where H = H (W ) = E [ρ(ψ0 )ρ ′ (ψ0 )|W ]. Then, analogous to the [
C11 = E
weighted (nonlinear) least squares estimation, we have B−1 C11 B−1
∂ρ ′ (ψ0 ) −1 ∂ρ(ψ0 ) ≥E H ∂ψ ∂ψ ′ [
and ρˆ j,S (ψ) = Yj Z˜j − mS (Γˆ Wj ; ψ) and ρˆ j,2S (ψ) = Yj Z˜j −
m2S (Γˆ Wj ; ψ). It is easy to see that mS (Γ Wj ; ψ) and m2S (Γ Wj ; ψ) are unbiased simulators for m(Γ Wj ; ψ), because by construction, E [mS (Γ Wj ; ψ)|Wj ] = E [m2S (Γ Wj ; ψ)|Wj ] = m(Γ Wj ; ψ). In addition, using two independent sets of simulated points in ρˆ j,S and ρˆ j,2S guarantees Qn,S (ψ) to be an unbiased simulator for Qn (ψ) in the sense that they have the same conditional expectation given the data (Yj , Zj , Wj ), j = 1, 2, . . . , n. This ‘‘simulation-by-parts’’ has an important consequence that the following consistency and ˆ n,S hold for a fixed S. In contrast, most asymptotic normality of ψ simulation-based methods in the literature require that S → ∞. Since Qn,S (ψ) does not involve integrals any more, it is continuous in and differentiable with respect to ψ , as long as functions g (x; θ ) and fU (u; φ) have these properties. In particular, the first derivative ∂ρj′,S (ψ)/∂ψ consists of
∂ρj′,S (ψ)
] −1 (2.15)
(in the sense that the difference of the left-hand and right-hand sides is nonnegative definite), and the lower bound is attained for A(W ) = H −1 in both B and C11 (Hansen, 1982; Abarin and Wang, 2006). In practice, however, H depends on unknown parameters and therefore needs to be estimated. This suggests the following twostage procedure of estimation. First, minimize Qn (ψ) with identity ˆ n. matrix A(W ) = Ik+1 to obtain the first-stage estimator ψ Secondly, estimate H = H (W ) by a nonparametric method such ∑ ˆ = 1 nj=1 ρˆ j (ψˆ n )ρˆ j′ (ψˆ n ) for models as a kernel estimator or by H n where H does not depend on W , and then minimize Qn (ψ) again
ˆ −1 to obtain the second-stage estimator ψˆˆ n . Since with A(W ) = H ˆ
ˆ is consistent for H, the asymptotic covariance of ψˆ n is given by H
∂θ ∂ρj′,S (ψ) ∂φ
mS (Γ Wj ; ψ) =
S 1 − x˜ js g (xjs ; θ )fU (xjs − Γ Wj ; φ)
S s=1
h(xjs )
and m2S (Γ Wj ; ψ) =
2S 1 − x˜ js g (xjs ; θ )fU (xjs − Γ Wj ; φ)
S s =S +1
h(xjs )
,
j=1
ρˆ j′,S (ψ)Aj ρˆ j,2S (ψ),
h(xjs )
,
′ S 1 − ∂ fU (xjs − Γ Wj ; φ) x˜ js g (xjs ; θ )
∂φ
S s=1
h(xjs )
∂ρj′,2S (ψ)/∂ψ
is given similarly.
a.s.
ˆ n,S −→ ψ0 , as n → ∞. 1. Under Assumptions 1 and 2, ψ 2. Under Assumptions 1–5, where
CS =
′ C21 C22
CS ,11 C21
√
L
ˆ n,S − ψ0 ) − n(ψ → N (0, B−1 DCS D′ B−1 ),
and 2CS ,11
∂ρ1′ ,S (ψ0 )
] ∂ρ1,S (ψ0 ) =E A1 ρ1,2S (ψ0 )ρ1,2S (ψ0 )A1 ∂ψ ∂ψ ′ [ ′ ] ∂ρ1,S (ψ0 ) ∂ρ1,2S (ψ0 ) ′ +E A1 ρ1,2S (ψ0 )ρ1,S (ψ0 )A1 . ∂ψ ∂ψ ′ [
′
Furthermore, DCS D′ = plim n→∞
ˆ n ) ∂ Qn,S (ψˆ n ) 1 ∂ Qn,S (ψ 4n
∂ψ
∂ψ ′
,
where
] n [ ∂ ρˆ j′,2S (ψ) ∂ ρˆ j′,S (ψ) ∂ Qn,S (ψ) − = Aj ρˆ j,2S (ψ) + Aj ρˆ j,S (ψ) . ∂ψ ∂ψ ∂ψ j=1 ˆ n,S is feasible in general, the simulation approximaAlthough ψ tion of ρj (ψ) by ρj,S (ψ) and ρj,2S (ψ) may cause efficiency loss. The following corollary shows that the efficiency loss due to simulation is of magnitude O(1/S ), the proof of which is completely analogous to that of Corollary 4 of Wang (2004) and hence is omitted.
∂(ρ11 − ρ1 )′ A1 ρ1 ∂ρ1′ A1 (ρ11 − ρ1 ) 2S ∂ψ ∂ψ ′ [ ′ 1 ∂(ρ11 − ρ1 ) A1 (ρ12 − ρ1 ) + 2E 4S ∂ψ ] ′ ∂(ρ12 − ρ1 ) A1 (ρ11 − ρ1 ) × , ∂ψ ′
CS ,11 = C11 +
ˆ n,S = argminψ∈Ψ Qn,S (ψ), where for ψ is defined by ψ n −
∂θ
S s=1
Corollary 3.2. Under the conditions of Theorem 3.1,
where x˜ js = (1, x′js )′ . Finally, the simulation-based estimator (SBE)
Qn,S (ψ) =
=−
′ S 1 − ∂ g (xjs ; θ ) x˜ js fU (xjs − Γ Wj ; φ)
Theorem 3.1. Suppose that the support of h(x) covers the support of |g (x; θ )|fU (x − v; φ) for all v ∈ Rk and ψ ∈ Ψ . Then the simulation ˆ n,S has the following properties: estimator ψ
3. Simulation-based estimator
ˆ n or adaptive generalized The numerical computation of MME ψ method of moments estimator is straightforward if the explicit form of m(v; ψ) can be obtained. However, explicit forms of the integrals in (2.5) can be difficult or impossible to derive (for instance, if g is logistic and fU is normal). In this case, one may use a simulation-based approach to approximate the multiple integrals in which they are simulated by Monte Carlo methods such as importance sampling. First, choose a known density h(x) and generate an i.i.d. random sample {xjs , s = 1, 2, . . . , 2S , j = 1, 2, . . . , n} from h(x). Then approximate m(Γ Wj ; ψ) by Monte Carlo simulators
=−
and the first derivative
ˆ
ˆ n is asymptotically the right-hand side of (2.15). Consequently ψ ˆ n . More detailed more efficient than the first-stage estimator ψ discussions about the so-called feasible generalized least squares estimators can be found in, e.g., Amemiya (1974) and Gallant (1987, Chapter 5).
33
(3.1)
1
[
]
E
(3.2)
34
L. Wang, C. Hsiao / Journal of Econometrics 165 (2011) 30–44
where ρ1
= ρ1 (ψ0 ) and ρjs = Yj Z˜j − x˜ js g (xjs ; θ0 )fU (xjs − ∑ Γ0 Wj ; ψ0 )/h(xjs ) is the summand in ρj,S (ψ0 ) = Ss=1 ρjs /S. Asymptotically, the importance density h(x) has no effect on ˆ n,S , as long as it satisfies the condition of Theothe efficiency of ψ rem 3.1. In practice, however, the choice of h(x) will affect the finite sample variances of the Monte Carlo estimators mS (Γ Wj ; ψ) and m2S (Γ Wj ; ψ). Theoretically, the best choice of h(x) is proportional to the absolute value of the integrand ‖˜xg (x; θ )fU (x − Γ W ; ψ)‖. Practically, a density close to being proportional to the integrand is a good choice. 4. Semiparametric estimator In this and next section, we relax the parametric restriction on the distribution of U and instead assume that FU is nonparametric. We derive a semiparametric estimator for θ and a kernel-based nonparametric estimator for FU using moments Eqs. (2.2) and (2.3) which become E (Y | W ) =
∫
g (Γ0 W + u; θ0 )dFU (u)
(4.1)
and E (YZ | W ) =
∫
(Γ0 W + u)g (Γ0 W + u; θ0 )dFU (u).
(4.2)
The basic idea is to apply Fourier deconvolution to (4.1) or (4.2) to separate θ and FU . This approach is based on the following assumptions. Assumption 6. The distribution of W is absolutely continuous with respect to the Lebesgue measure and has support Rl . Assumption 7. g (x; θ0 )(‖x‖ + 1) ∈ L1 (Rk ), the space of all absolutely integrable functions on Rk . Furthermore, the set T = {t ∈ Rk : g˜ (t ; θ0 ) ̸= 0} is dense in Rk , where g˜ (t ; θ√ 0) = ′ e−it x g (x; θ0 )dx is the Fourier transform of g (x; θ0 ) and i = −1. The integrability of g (x; θ0 ) in Assumption 7 implies the existence of the Fourier transform g˜ (t ; θ0 ). Roughly speaking, the second part of the assumption means that the zeros of g˜ (t ; θ0 ) are isolated points in Rk . The examples given at the end of the next section show that this condition is fairly general. Further discussion and possible generalization of this condition is given in Remark 5.1. For every v ∈ Rk , let m1 (v) =
∫
g (v + u; θ0 )dFU (u).
(4.3)
Since Γ0 has full rank, Assumption 6 implies that m1 (Γ0 W ) = E (Y | W ) is fully observable on Rk . Moreover, Assumption 7 implies that m1 (v) ∈ L1 (Rk ). Taking the Fourier transformation on both sides of (4.3) yields
˜ 1 (t ) = m
∫ ∫
=
′ e−it v m1 (v)dv ′
e−it x g (x; θ0 )dx ·
fU (u; θ0 ) =
∫
′
eit u dFu (u)
determined by (4.5)
Further, because any characteristic function is uniformly continuous in Rk , Assumption 7 implies that the value of f˜U (t ) at any zero of
∫
1
(2π )
eit
k
′u
˜ 1 (t ) m dt . ˜g (t ; θ0 )
(4.6)
This expression can be substituted into (4.1) and (4.2), so that the method of moments estimator for θ can be obtained by minimizing an objective function similar to (2.7). Details of this construction is given below. First, let Γˆ denote the least squares estimator in (2.6) and Vj =
Γˆ Wj , j = 1, 2, . . . , n. Then the density function fV (v) of V = Γ0 W and the conditional mean function m1 (v) are estimated by n − v − Vj ˆfV (v) = 1 (4.7) K k nan j=1
an
and n 1 −
ˆ 1 (v) = m
nakn j=1
Yj K
v − Vj an
/fˆV (v),
(4.8)
where K (·) is a kernel function and an is the bandwidth satisfying 0 < an → 0 as n → ∞. −it ′ v ˆ˜ 1 (t ) = ˆ 1 (v)dv , where Bn = {v ∈ Rk : e m Second, let m B n
|fˆV (v)| ≥ bn }, 0 < bn → 0 as n → ∞, and define, for each θ ∈ Θ , ∫ ˆ˜ 1 (t ) 1 ′ m dt , fˆU (u; θ ) = eit u k (2π ) Cn g˜ (t ; θ ) where Cn = {t ∈ Rk : ‖t ‖ ≤ 1/cn , |˜g (t ; θ )| ≥ cn } and 0 < cn → 0, as n → ∞. Finally, the semiparametric estimator (SPE) for θ is defined as θˆn = argminθ∈Θ Qn (θ ), where Qn (θ ) =
n −
ρˆ j′ (θ )Aj ρˆ j (θ )
(4.9)
j =1
x˜ g (x; θ )fˆU (x − Γˆ Wj ; θ )dx. The consistency ˆ ˆ n . However, of θn can be derived similarly as for the MME ψ with ρˆ j (θ ) = Yj Z˜j −
as in many cases, e.g. Robinson (1988), the derivation becomes much more complicated because of the presence of the first-stage nonparametric ), which have convergence rates √ estimators in Qn (θ √ lower than n. To achieve the n-consistency, usually higher order kernels are used and combined with certain smoothness conditions for the density and conditional mean functions. Assumption 8. There exists an integer d ≥ 1, such that fV (v), m1 (v)fV (v) and their partial derivatives of order 1 through d are continuous and uniformly bounded on Rk . k Assumption 9. The kernel function K (v) is bounded on R and, for the integer d in Assumption 8, satisfies: (1) K (v)dv =
∑ d d d v11 v22 · · · vk k K (v)dv = 0, for dj ≥ 0 and 1 ≤ kj=1 dj ≤ d − 1; d d ∑k d and |v1 1 v2 2 · · · vk k K (v)|dv < ∞, for dj ≥ 0 and j=1 dj = d; (2) for every 1 ≤ j ≤ k, supv∈Rk ‖∂ K (v)/∂vj ‖(‖v‖ + 1) < ∞; and (3) ′ K (v) ∈ L1 (Rk ) and eit v K (v)dv ∈ L1 (Rk ). 1;
= g˜ (t ; θ0 )f˜U (t ), (4.4) ˜ where fU (t ) is the characteristic function of U. Here we have slightly abused notation by using f˜U (t ) to denote the Fourier inverse transform, which applies to f˜U (t ) only throughout this article. It follows from (4.4) that, for any t ∈ T , f˜U (t ) is uniquely ˜ 1 (t )/˜g (t ; θ0 ). f˜U (t ) = m
g˜ (t , θ0 ) is also uniquely determined. If, in addition, f˜U (t ) ∈ L1 (Rk ), then the density of U exists and is given by
Assumption 10. supΘ |g (x; θ )|(‖x‖ + 1) ˜ 1 (t )/˜g (t ; θ )|(‖t ‖ + 1) ∈ L1 (Rk ). |m
∈ L1 (Rk ) and supΘ
Assumption 11. g (x; θ ) is a measurable function of x for each θ ∈ Θ and is continuous in θ ∈ Θ (a.e. µ). Furthermore, E ‖A(W )‖(‖W ‖2 + 1) < ∞, E ‖A(W )‖(|Y |2 + ‖YZ ‖2 ) < ∞ and E ‖A(W )‖
< ∞.
∫
2
sup |g (x; θ )fU (x − Γ0 W ; θ )|(‖x‖ + 1)dx Θ
(4.10)
L. Wang, C. Hsiao / Journal of Econometrics 165 (2011) 30–44
Assumption 12. E [ρ(θ ) − ρ(θ0 )]′ A(W )[ρ(θ ) − ρ(θ0 )] = 0 if and only if θ = θ0 , where ρ(θ ) = Y Z˜ − x˜ g (x; θ )fU (x − Γ0 W ; θ )dx. Assumptions 8 and 9 have been used by Robinson (1988) and Andrews (1995) to achieve uniform convergence for their kernel estimators of the conditional mean functions. Assumption 10 guarantees that the Fourier transform g˜ (t ; θ ) exists for all θ ∈ Θ and that the density fU (u; θ ) exists and is given by (4.6). This assumption may be weaken to that the density fU (u; θ ) exists and is piecewise continuous, in which case fU (u; θ ) may be defined by the usual inversion formula or the so-called principal value of the integral on the right-hand side of (4.6) (Walker, 1988). Similarly to Assumption 2, Assumption 12 is a high-level assumption for identifiability, which is implied by the conditions of Theorem 5.1 in the next section. Theorem 4.1. Suppose Assumptions 6–12 hold and, as n → ∞, √ 3 −k−1 nakn+1 b3n cnk+1 → ∞, adn b− → 0 and cn−k−1 Bc |m1 (v)|dv n cn n
P
− → 0, where Bcn is the complement of Bn in Rk . Then, as n → ∞, P P (1) θˆn − → θ0 ; (2) supu∈Rk |fˆU (u; θˆn ) − fU (u; θ0 )| − → 0; and (3) for P ˆ ˆ˜ 1 (t )/˜g (t ; θˆn ) − → every t ∈ Rk such that g˜ (t ; θ0 ) ̸= 0, f˜ U (t ; θˆn ) = m f˜U (t ; θ0 ). The above semiparametric estimator involves three tuning parameters. In practice, these parameters can be chosen as follows. First, take the bandwidth an = n−a where 0 < a < 1/2(k + 1) can be chosen according to a certain optimum criterion for the kernel estimators in (4.7) and (4.8). Second, the quantity dn = |m1 (v)|dv reflects the tail behavior of m1 (v) as v → ∞ which Bcn can be evaluated for the given model g (x, θ0 ) and density fV (v). Suppose dn = o(n−δ ) for some δ > 0. Then cn = n−c δ/(k+1) , 0 < c < 1 satisfies cn−k−1 dn → 0. Finally, choose b > 0 and 0 < c < 1 such that 3b √ + c δ < min{ad, 1/2 − a(k + 1)}. Then bn = n−b satisfies nakn+1 b3n cnk+1 = n1/2−a(k+1)−3b−c δ → ∞ and 3 −k−1 adn b− = n−ad+3b+c δ → 0. n cn Similar to the simulation-based estimator of Section 3, we can also construct a simulated version of the semiparametric estimator. Specifically, ρˆ j (θ ) in (4.9) can be replaced by Monte Carlo simulators such as ρˆ j,S (θ ) and ρˆ j,2S (θ ) in (3.1). Then a simulation-based semiparametric estimator (SBSPE) θˆn,S can be defined by minimizing the simulated version of Qn (θ ), i.e., Qn,S (θ ) =
n −
ρˆ j′,S (θ )Aj ρˆ j,2S (θ ),
(4.11)
j =1
˜ js g (xjs ; θ )fˆU (xjs − Γˆ Wj ; θ )/h(xjs ), where ρˆ j,S (θ ) = Yj Z˜j − s=1 x and ρˆ j,2S (θ ) is defined similarly using {xjs , s = S + 1, S + 2, . . . , 2S }. 1 S
∑S
Moreover, from Section 3 it is easy to see that θˆn,S has the same
properties given in Theorem 4.1 for the SPE θˆn . The asymptotic normality of θˆn,S can also be established in a similar way, under the following further assumptions. Assumption 13. Θ contains an open neighborhood Θ0 of θ0 such that (1) g (x; θ ) is twice continuously differentiable w.r.t. θ in Θ0 ;
ˆ˜ 1 (t )/˜g (t ; θ ) w.r.t. θ and its first (2) the first two derivatives of m derivative w.r.t. γ are uniformly bounded in Θ0 and an open and bounded neighborhood of γ0 by η(t ) > 0 which satisfies η(t )dt < ∞; (3) g (x; θ )fU (x − Γ0 W ; θ ) has the same properties Cn given in Assumptions 3 and 4 for g (x; θ )fU (x − Γ0 W ; φ). Assumption 14. The kernel function K (v) admits the second order partial derivatives and sup ‖∂ 2 K (v)/∂v∂v ′ ‖ < ∞.
v∈Rk
35
Assumption 15. supu |fUc (u; θ0 ) − fU (u; θ0 )| = op (n−1/2 ), where fUc (u; θ0 ) =
e−it
∫
1
(2π )k
′u
∫
g˜ (t ; θ0 )
Cn
′ e−it v m1 (v)dv dt .
Bn
Assumption 16. The matrix
B=E
∂ρ ′ (θ0 ) ∂ρ(θ0 ) A(W ) ∂θ ∂θ ′
is nonsingular. The asymptotic normality of our estimator relies on the asymptotic behavior of the empirical process ξj = (ξ1j′ , ξ2j′ , ξ3j′ , ξ4j′ )′ , where
ξ1j =
∂ρj′,S (θ0 ) ∂θ 1
Aj ρj,2S (θ0 ),
∂ρ0′ (θ0 ) Aj ∂θ
(4.12)
∫
x˜ g (x; θ0 ) (2π ) ∫ ∫ ′ ′ e−it (x−Γ0 W ) e−it v ηj (v) × dv dtdx, (4.13) fV (v) g˜ (t ; θ0 ) Cn ∫ ∫ ′ −it ′ (x−Γ0 W ) ρ (θ ) A ∂ e 0 j ξ3j′ = 0 Ej x˜ g (x; θ0 ) g˜ (t ; θ ) (2π )k ∂θ Cn θ=θ0 ∫ −it ′ v e ηj (v) × dv dtdx (4.14) fV (v) and ξ4j = Wj ⊗ (Zj − Γ0 Wj ). Here ρ0 (θ0 ) = YZ − E (YZ | W ) and Ej
ξ2j =
k
Ej
denotes the conditional expectation given
ηj (v) =
Yj − m1 (v) akn
[ −E
K
v − Γ0 W j
an
Yj − m1 (v) akn
K
v − Γ0 Wj
]
an
.
(4.15)
Then, we have the following result. Theorem 4.2. In addition to √ the conditions of Theorem 4.1, suppose √ q +1 2 2 −k−1 na2k nan b− bn → ∞ and → 0, where d is n n cn as in Assumption 8. Then under Assumptions 13–16, as n → √ L ∞, n(θˆn,S − θ0 ) − → N (0, B−1 DCD′ B−1 ), where C = limn→∞ E ξj ξj′ and
[
D = Ip , Ip , Ip , E
∂ρj′ (θ0 ) ∂θ
Aj
] −1 ∂ρj (θ0 ) ′ EWW ⊗ I . k ∂γ ′
It is easy to see that ξ1j = ξ5j + ξ6j , where
ξ5j = ξ6j =
∂ρj′ (θ0 ) ∂θ ∂ ∆′j,2S ∂θ
Aj ρj (θ0 ),
Aj ρj (θ0 ) +
∂ρj′ (θ0 ) ∂θ
Aj ∆j,S +
∂ ∆′j,2S ∂θ
Aj ∆j,S
and
∆j,S = E (Yj | Wj ) −
S 1 − x˜ js g (xjs ; θ0 )fU (xjs − Γ0 Wj ; θ0 )
S s=1
h(xjs )
.
Moreover, the asymptotic covariance matrix of θˆn,S consists of the approximation errors of Γˆ for Γ0 , ξ4j , fˆu for fu , ξ2j and ξ3j , the sampling error ξ5j and the simulation error ξ6j . It is easy to see that, if Γ0 is known, then ξ4j = 0, and if fU is known, then ξ2j = ξ3j = 0. Therefore, if Γ0 and fU are known, the asymptotic covariance matrix of our simulation estimator only depends on the sampling error ξ5j and simulation error ξ6j . Since E ξ6j ξ6j′ = O(S −1 ), the impact of the simulation error can be reduced by increasing the simulation size S.
36
L. Wang, C. Hsiao / Journal of Econometrics 165 (2011) 30–44
5. Identifiability Identifiability is a long-standing and difficult problem in nonlinear errors-in-variables models. It has both theoretical and practical importance, but very few results have been obtained so far because of its mathematical complexity. In the literature, this problem has usually been avoided by assuming distributions of certain unobserved variables or random errors are known, or it has been completely ignored in applied work. For a model with errors-invariables to be identifiable, additional information such as validation data, repeated measurements or instrumental variables is needed (Fuller, 1987; Carroll et al., 1995). Hausman et al. (1991) showed that the polynomial model is identifiable using instrumental variables. Also using the IV approach Wang and Hsiao (1995) obtained identifiability for models with integrable g (x; θ0 ). Further, Schennach (2007) showed that the identifiability holds for general models which is not necessarily integrable. In this section, we use the framework of the previous section to derive a rank condition for identifiability of model (1.1)–(1.3). First, Γ0 is clearly identifiable by (2.1) and the least squares method. In the previous section, we have demonstrated that FU is uniquely determined by θ0 and Γ0 through (4.5). In the following, we study the identifiability of θ0 using (4.1) and (4.2), given that Γ0 is identified. Analogous to (4.3), for every v ∈ Rk , let m2 (v) =
∫
∫
(v + u)g (v + u; θ0 )dFU (u).
m1 (v)dv = m2 (v)dv =
(5.1)
∫ ∫
g (x; θ0 )dx := g1 (θ0 ),
(5.2)
xg (x; θ0 )dx := g2 (θ0 ).
(5.3)
The left-hand sides of (5.2) and (5.3) are observable and the closed forms of the integrals on the right-hand sides can be obtained, because the functional form of g (x; θ0 ) is known. By the Rank Theorem (Zeidler, 1986, page 178), a sufficient condition for (5.2)–(5.3) to have a unique solution θ0 in its neighborhood is that the Jacobian matrix (e.g., Hsiao, 1983) J (θ0 ) =
Assumption 17. E |g (X ; θ0 )|(‖X ‖ + 1) < ∞. To see this, let gn (x; θ0 ) = g (x; θ0 )1(‖x‖ < Tn ), where 1(·) is the indicator function and Tn → ∞ (e.g., Tn = cna , for some c > 0 and a > 0), and modify m1 (v) and m2 (v) in (4.3) and (5.1) as m1,n (v) = m2,n (v) =
∫ ∫
gn (v + u; θ0 )dFU (u),
(v + u)gn (v + u; θ0 )dFU (u).
Then m1,n (Γ0 W ) and m2,n (Γ0 W ) approximate E (Y | W ) and E (YZ | W ) respectively in the following sense. Theorem 5.2. Under Assumption 17, it holds that lim E |m1,n (Γ0 W ) − E (Y |W )| = 0
(5.5)
n→∞
Then m2 (Γ0 W ) = E (YZ | W ) and Assumption 7 implies that m2 (v) ∈ L1 (Rk ). Now, integrating both sides of (4.3) and (5.1) and applying the Fubini Theorem, we obtain
∫
From a practical point of view, integrability of g (x; θ0 ) in Assumption 7 is not as restrictive as it appears, because in many real problems, the possible values of X are bounded. In this sense a truncated model which vanishes outside a sufficiently large compact set can be used which satisfies Assumption 7. From a theoretical point of view, the integrability of g (x; θ0 ) may be weakened to the following assumption.
∂ g1 (θ0 ) ∂ g2′ (θ0 ) , ∂θ ∂θ
(5.4)
has full rank. Thus, we have the following result. Theorem 5.1. Under Assumptions 6 and 7, a sufficient condition for θ0 and FU to be identifiable is rank J (θ0 ) = p.
and lim E ‖m2,n (Γ0 W ) − E (YZ |W )‖ = 0.
(5.6)
n→∞
Since E (Y |W ) and E (YZ |W ) can be consistently estimated by nonparametric methods, together with Assumptions 6, Theorem 5.2 implies that m1,n (Γ0 W ) and m2,n (Γ0 W ) are (asymptotically) observable on Rk . Therefore results of Theorem 5.1 hold with g (x; θ0 ) replaced by gn (x; θ0 ) in Assumption 7 and in the Jacobian matrix J (θ0 ). In the rest of this section, we use some examples to illustrate how to apply Theorem 5.1 to check model identifiability. Again, we consider cases where all variables are scalars and Γ0 = 1. In this case, we need only to verify Assumptions 7 or 17. 2
Example 5.1. Exponential model g (x; θ ) = e−θ x , θ > 0. Clearly g (x; θ ) is integrable. Further, the second part of Assumption 7
√ 2 π /θ e−t /4θ . To check the rank −θ x2 √ condition, we calculate g1 (θ ) = e dx = π /θ and g2 (θ ) = √ −θ x2 √ xe dx = 0. It follows that J (θ ) = (− π /(2θ θ ), 0) which is satisfied because g˜ (t ; θ ) =
has rank one. Therefore by Theorem 5.1 the model is identifiable.
It is easy to see that a necessary condition for rank J (θ0 ) = p is p ≤ k + 1, because J (θ0 ) has dimensions p by k + 1.
Example 5.2. Linear model g (x; θ ) = θ x. For this model, Assumption 17 is satisfied if E ‖X ‖2 < ∞. Further, since for any T > 0,
Remark 5.1. If the second condition in Assumption 7 is violated but there exists a−1it ′ x ≤ j ≤ k, such that thek set of all zeros of g˜j (t ; θ0 ) = e xj g (x; θ0 )dx is dense in R , where xj is the
g˜ (t ; θ ) =
j-th coordinate of x = (x1 , x2 , . . . , xk ), then f˜U (t ) can still be identified by using the j-th equation in (5.1). This is easy to see by taking a Fourier transformation on both sides of (5.1), which ′ ˜ 2 (t ) = f˜U (t ) e−it x xg (x; θ0 )dx. Moreover, if f˜U (t ) is yields m analytic, then the second condition in Assumption 7 can be further weakened to the assumption that g˜ (t ; θ0 ) ̸= 0 from some t ∈ Rk . This follows from the facts that the continuity of g˜ (t ; θ0 ) implies that g˜ (t ; θ0 ) ̸= 0 in an open neighborhood, and that any analytic function is uniquely determined by its values on a finite segment of the complex plane. Note that any distribution admitting a moment generating function has an analytic characteristic function (Lukacs, 1970, p. 197–198). Examples of such distributions include uniform, normal, double-exponential and many discrete distributions.
∫
T
′
e−it x θ xdx = 2iθ
T cos(tT ) t
−T
−
sin(tT ) t2
,
the second condition in Assumption 7 is satisfied. To check the T rank condition, we calculate g1 (θ ) = θ −T xdx = 0 and g2 (θ ) =
θ
T
2 3 3 −T x dx = 2θ T /3. Therefore J (θ ) = (0, 2T /3) which has rank one. Hence by Theorem 5.1 the model is identifiable.
Example 5.3. Polynomial model g (x; θ ) = θ1 x + θ2 x2 . In this case, Assumption 17 is satisfied if E ‖X ‖3 < ∞. Further, because for any T > 0, g˜ (t ; θ ) = 2T cos(tT )
+ 2 sin(tT )
iθ1 t
+
θ2 T t
θ2
t2
2
−
iθ1 t2
−
2θ2 t3
,
L. Wang, C. Hsiao / Journal of Econometrics 165 (2011) 30–44
the second condition in Assumption 7 is clearly satisfied. Again, it is straightforward to calculate g1 (θ ) = 2θ2 T 3 /3 and g2 (θ ) = 2θ1 T 3 /3. Hence J (θ) =
2T 3
3
0 1
37
should be investigated in the future research is the finite sample properties of the proposed estimators, which may be done through extensive and carefully designed simulation studies.
1 , 0
7. Proofs
which has rank two. Therefore the model is identifiable.
7.1. Proof of Theorem 2.1
Example 5.4. Exponential g (x; θ ) = exp(θ x), θ ̸= 0. For this model, Assumption 17 becomes eθ x dFX (x) < ∞ and ‖x‖ eθ x dFX (x) < ∞, which are satisfied if X has a normal distribution. Since for any T > 0 and t ̸= −iθ ,
Since fU (u; φ) is continuously differentiable with respect to u (Assumption 1), by (2.9) and the Dominated Convergence Theorem (DCT), ρˆ j (ψ) and hence Qn (ψ) are continuously differentiable with respect to γ = vecΓ . For sufficiently large n, therefore, Qn (ψ) has the first-order Taylor expansion about γ0 = vecΓ0 :
g˜ (t ; θ ) =
e(θ−it )T − e−(θ−it )T
θ − it
,
Qn (ψ) =
n −
the second condition in Assumption 7 is satisfied. Moreover, since for any θ , g1 (θ ) = (eθ T − e−θ T )/θ , and T (e
θT
θT
+e ) e −e − . g2 (θ) = θ θ2 Jacobian matrix J (θ ) is of rank one, which implies that the model −θ T
j =1
+2
−θ T
is identifiable. Example 5.5. Consider g (x; θ ) = θ1 + θ2 xθ3 , where θ2 θ3 ̸= 0. Clearly this model would be identifiable if X were observable. Since now the rank of J (θ ) is at most two while p = 3, the model cannot be identified by (4.1) and (4.2). Example 5.6. Let g (x; θ ) = (θ1 + θ2 x)2 , θ2 ̸= 0. Again, θ1 and θ2 would be identifiable if X were observable. However, θ1 and θ1 are not identifiable by (4.1) and (4.2) if θ12 = θ22 , because now J (θ ) = (θ12 −θ22 ) = 0. Note that if the prior restriction θ1 = θ2 or θ1 = −θ2 is imposed, then the model can again be identifiable. However, these restrictions imply very different model specifications.
n −
ρj′ (ψ, γ˜ )Aj
j =1
∂ρj (ψ, γ˜ ) (γˆ − γ0 ), ∂γ ′
(7.1)
where ρj (ψ) = Yj Z˜j − m(Γ0 Wj ; ψ), ρj (ψ, γ˜ ) = Yj Z˜j − m(Γ˜ Wj ; ψ),
∂ρj (ψ, γ˜ ) = ∂γ ′
∫
x˜ g (x; θ )
∂ fU (x − Γ˜ Wj ; ψ) dx(Wj ⊗ Ik )′ ∂ u′
and γ˜ = vecΓ˜ satisfies ‖γ˜ −γ0 ‖ ≤ ‖γˆ −γ0 ‖. Further, since g (x; θ ) and fU (u; φ) are continuous in θ and φ respectively, by (2.8) and the DCT, ρj (ψ) is continuous in ψ and, moreover, E sup |ρ1′ (ψ)A1 ρ1 (ψ)| ψ
≤ E ‖A1 ‖ sup ‖ρ1 (ψ)‖2 ψ
≤ 2E ‖A1 ‖(|Y1 |2 + ‖Y1 Z1 ‖2 ) ∫ 2 + 2E ‖A1 ‖ sup |g (x; θ )fU (x − Γ0 W1 ; φ)|(‖x‖ + 1)dx
6. Conclusions and discussion Consistent estimation and identifiability of general nonlinear errors-in-variables models with multivariate predictor variables and possibly non-normal random errors have been challenging problems for decades. Most researchers rely on restrictive conditions to achieve consistent estimation, or treat more general models at the expense of the accuracy of estimation (e.g., approximately consistent approach). Moreover, most methods in the literature are designed for the case where either validation or replicate data are available. In this paper, we use the instrumental variable approach to study a general model, where the predictor variable is multivariate and the distributions of the measurement error and the random error in the regression equation are nonparametric. Root-n consistent parametric and semiparametric estimators for the model are developed using the method of moments. A rank condition for model identifiability is derived by combining the nonparametric technique and Fourier deconvolution. It is possible to generalize the prediction Eq. (1.3) to a nonlinear one, say, X = Γ (W ) + U. All the results of this paper should be obtained analogously, provided function Γ (·) can be consistently √ estimated with convergence rate n. The latter is generally satisfied if Γ (·) is parametric and estimated by the usual nonlinear least squares method. The independence between W and U is stronger than the usual instrumental variable assumption that they are uncorrelated. As pointed out by a referee, this assumption can be relaxed through parametric modeling of conditional distribution fU |W (u|w; φ0 ) instead of the marginal distribution fU (u; φ0 ). However, it is not clear, and deserves future research, how such an extension is possible for the nonparametric case. Another issue that
ρj′ (ψ)Aj ρj (ψ)
ψ
< ∞.
It follows from the uniform law of large numbers (ULLN Jennrich, 1969, Theorem 2) that the first term on the right-hand side of (7.1) satisfies
n 1 − a.s. ′ sup ρj (ψ)Aj ρj (ψ) − Q (ψ) −→ 0, ψ n j =1
(7.2)
where Q (ψ) = E ρ1′ (ψ)A1 ρ1 (ψ). Similarly, since by the Cauchy– Schwarz inequality and (2.9),
2 ∂ρ1 (ψ, γ ) ∂γ ′ ψ,γ ∂ρ1 (ψ, γ ) 2 ≤ E ‖A1 ‖ sup ‖ρ1 (ψ, γ )‖ ∂γ ′ ψ,γ ∂ρ1 (ψ, γ ) 2 ≤ E ‖A1 ‖ sup ‖ρ1 (ψ, γ )‖2 E ‖A1 ‖ sup ∂γ ′ ψ,γ ψ,γ
′ E sup ρ1 (ψ, γ )A1
≤ kE ‖A1 ‖ sup ‖ρ1 (ψ, γ )‖2 E ‖A1 ‖ ‖W1 ‖2 ψ,γ
∫ × < ∞,
sup g (x; θ ) ψ,γ
2 ∂ fU (x − Γ W1 ; φ) (‖x‖ + 1)dx ∂ u′
again by the ULLN we have
n 1 − ∂ρj (ψ, γ ) ′ sup ρj (ψ, γ )Aj = O(1) (a.s.) ∂γ ′ ψ,γ n j=1
38
L. Wang, C. Hsiao / Journal of Econometrics 165 (2011) 30–44
and therefore
and
n 1 − ∂ρ (ψ, γ ˜ ) j ( γ ˆ − γ ) sup ρj′ (ψ, γ˜ )Aj 0 ′ n ∂γ ψ j=1 n 1 − ∂ρj (ψ, γ ) a.s. ≤ sup ρj′ (ψ, γ )Aj ‖γˆ − γ0 ‖ −→ 0. ∂γ ′ ψ,γ n j=1
∂ vec(∂ ρˆ j′ (ψ)/∂φ)
′ ∂ρ1 (ψ, γ )
(7.4)
= E [E (ρ1′ (ψ0 )|W1 )A1 (ρ1 (ψ) − ρ1 (ψ0 ))] = 0, which implies Q (ψ) = Q (ψ0 ) + E [(ρ1 (ψ) − ρ1 (ψ0 ))′ A1 (ρ1 (ψ) − ρ1 (ψ0 ))]. By Assumption 2, Q (ψ) ≥ Q (ψ0 ) and the equality holds if and only if ψ = ψ0 . It follows that Q (ψ) attains a unique minimum at ψ0 ∈ Ψ and, therefore, by Amemiya (1973, Lemma 3)
Similarly, by Assumption 3 we have
7.2. Proof of Theorem 2.2
By Assumption 3 and the DCT, the first derivative ∂ Qn (ψ)/∂ψ exists and has the first-order Taylor expansion in the open a.s.
ˆ n )/∂ψ = 0 and ψˆ n −→ neighborhood Ψ0 ⊂ Ψ of ψ0 . Since ∂ Qn (ψ ψ0 , for sufficiently large n we have ˜ ∂ Qn (ψ0 ) ∂ Qn (ψ) + (ψˆ n − ψ0 ) = 0, ∂ψ ∂ψ∂ψ ′
ψ,γ
(7.5)
∂ vec(∂ρ1′ (ψ, γ )/∂ψ) 2 × sup ∂ψ ′ ψ,γ < ∞.
˜ − ψ0 ‖ ≤ ‖ψˆ n − ψ0 ‖. The first derivative of Qn (ψ) in where ‖ψ (7.5) is given by
It follows from the ULLN and Amemiya (1973, Lemma 4) that
n − ∂ ρˆ j′ (ψ) ∂ Qn (ψ) =2 Aj ρˆ j (ψ), ∂ψ ∂ψ j =1
(7.6)
∂ g (x; θ ) ′ x˜ fU (x − Γˆ Wj ; φ)dx ∂θ
∂φ
=−
=
∂ fU (x − Γˆ Wj ; φ) ′ x˜ g (x; θ )dx. ∂φ
(7.8)
+ (ρˆ j (ψ)Aj ⊗ Ip+q )
∂ψ ′
] ,
where ∂ vec(∂ ρˆ j′ (ψ)/∂ψ)/∂ψ ′ consists of
∂ vec(∂ ρˆ j′ (ψ)/∂θ )
∂ 2 g (x; θ ) fU (x − Γˆ Wj ; φ)dx, ∂θ ′ ∂θ ∂θ ′ ∫ ∂ vec(∂ ρˆ j′ (ψ)/∂θ ) ∂ g (x; θ ) ∂ fU (x − Γˆ Wj ; φ) = − x˜ ⊗ dx, ′ ∂φ ∂θ ∂φ ′ ∫ ∂ vec(∂ ρˆ j′ (ψ)/∂φ) ∂ fU (x − Γˆ Wj ; φ) ∂ g (x; θ ) = − x˜ ⊗ dx, ′ ∂θ ∂φ ∂θ ′ ∫
=−
x˜ ⊗
B,
(7.9)
] ′ ∂ vec(∂ρ1′ (ψ0 )/∂ψ) ρ1 (ψ0 )A1 ⊗ Ip+q ∂ψ ′ [ ] ∂ vec(∂ρ1′ (ψ0 )/∂ψ) ′ = E E (ρ1 (ψ0 )|W1 )A1 ⊗ Ip+q ∂ψ ′ = 0.
[
n [ − ∂ ρˆ j′ (ψ) ∂ ρˆ j (ψ) ∂ 2 Qn (ψ) = 2 Aj ∂ψ∂ψ ′ ∂ψ ∂ψ ′ j =1
∂ vec(∂ ρˆ j′ (ψ)/∂ψ)
∂ρ1′ (ψ0 ) ∂ρ1 (ψ0 ) A1 ∂ψ ∂ψ ′ ] ∂ vec(∂ρ1′ (ψ0 )/∂ψ) + ρ1′ (ψ0 )A1 ⊗ Ip+q ∂ψ ′ [
where the last equality holds because ∂ vec(∂ρ1′ (ψ0 )/∂ψ)/∂ψ ′ depends on W1 only and therefore E
The second derivative in (7.5) is given by
′
∂ψ∂ψ ′
a.s.
−→ E
(7.7)
and
∫
˜ 1 ∂ 2 Qn (ψ) 2n
where ∂ ρˆ j′ (ψ)/∂ψ consists of
∂ ρˆ j′ (ψ)
2 ∂ vec(∂ρ1′ (ψ, γ )/∂ψ) ∂ψ ′ ψ,γ ∂ vec(∂ρ1′ (ψ, γ )/∂ψ) 2 ≤ (p + q) E ‖A1 ‖ sup ‖ρ1 (ψ, γ )‖ ∂ψ ′ ψ,γ
′ E sup (ρ1 (ψ, γ )A1 ⊗ Ip+q )
≤ (p + q)E ‖A1 ‖ sup ‖ρ1 (ψ, γ )‖2 E ‖A1 ‖
2
∂θ
A1
< ∞.
a.s.
ˆ n −→ ψ0 . that ψ
∫
∂ 2 fU (x − Γˆ Wj ; φ) g (x; θ )dx. ∂φ∂φ ′
∂ρ1 (ψ, γ ) ∂ψ ∂ψ ′ ψ,γ ∂ρ1 (ψ, γ ) 2 ≤ E ‖A1 ‖ sup ∂ψ ′ ψ,γ ∂ρ1 (ψ, γ ) 2 ∂ρ1 (ψ, γ ) 2 + = E ‖A1 ‖ sup ∂θ ′ ∂φ ′ ψ,γ ∫ 2 ∂ g (x; θ ) f ( x − Γ W ; φ) ≤ E ‖A1 ‖ sup (‖ x ‖ + 1 ) dx U 1 ∂θ ψ,γ + E ‖A1 ‖ ∫ 2 ∂ fU (x − Γ Wj ; φ) × sup g (x; θ )dx (‖ x ‖ + 1 ) dx ∂φ ψ,γ
E sup
E [ρ1′ (ψ0 )A1 (ρ1 (ψ) − ρ1 (ψ0 ))]
=−
x˜ ⊗
It follows from Assumption 3 that (7.3)
Further, because E (ρ1 (ψ0 )|W1 ) = 0 and ρ1 (ψ) − ρ1 (ψ0 ) depends on W1 only, we have
∂ ρˆ j′ (ψ)
=−
∂φ ′
It follows from (7.1)–(7.3) that
a.s. 1 sup Qn (ψ) − Q (ψ) −→ 0. n ψ
∫
Further, by Assumption 4 and the DCT, ∂ Qn (ψ0 )/∂ψ is continuously differentiable with respect to γ and hence, for sufficiently large n, has the first-order Taylor expansion about γ0 : n − ∂ρj′ (ψ0 ) ∂ Qn (ψ0 ) =2 Aj ρj (ψ0 ) ∂ψ ∂ψ j =1
+ where
∂ 2 Q˜ n (ψ0 ) (γˆ − γ0 ), ∂ψ∂γ ′
(7.10)
L. Wang, C. Hsiao / Journal of Econometrics 165 (2011) 30–44
∂ 2 Q˜ n (ψ0 ) =2 ∂ψ∂γ ′
−[ ∂ρj′ (ψ0 , γ˜ ) n
∂ψ
j =1
+ (ρj (ψ0 , γ˜ )Aj ⊗ Ip+q ) ′
Aj
∂ρj (ψ0 , γ˜ ) ∂γ ′
7.3. Proof of Theorem 3.1
∂ vec(∂ρj′ (ψ0 , γ˜ )/∂ψ)
∂ρj (ψ0 , γ˜ ) = x˜ g (x; θ0 ) ∂γ ′ ∂ vec(∂ρj′ (ψ0 , γ˜ )/∂θ ) ∫
∂γ ′ ∂ fU (x − Γ˜ Wj ; φ0 ) ∂ u′
We only sketch the proofs here because they are similar to the proofs of Theorems 2.1 and 2.2. Details are available in Wang and Hsiao (2008). The proof of Theorem 3.1.1 is based on the first-order Taylor expansion of Qn,S (ψ) about γ0 :
] ,
dx(Wj ⊗ Ik )′ ,
Qn,S (ψ) =
∂γ ′
] ∂ρ1′ (ψ0 ) ∂ρ1 (ψ0 ) A . (7.11) 1 2n ∂ψ∂γ ′ ∂ψ ∂γ ′ ∑n ∑n ′ By definition (2.6), Γˆ − Γ0 = j=1 (Zj − Γ0 Wj )Wj j =1 ′ −1 Wj Wj
a.s.
[
−→ E
, which can be written as
n − ∂ Qn (ψ0 ) = 2Dn Tj , ∂ψ j =1
n −
−1 Wj Wj′ ⊗ Ik
j =1
= E E (ρ1′ ,S (ψ)|W1 , Y1 , Z1 )A1 E (ρ1′ ,2S (ψ)|W1 , Y1 , Z1 ) = Q (ψ)
By the Law of Large Numbers, n
∑
′
Wj Wj ⊗ Ik
−1
a.s.
−→ (EW1 W1 ⊗ ′
, which together with (7.11) implies
˜ 1 ∂ 2 Qn,S (ψ) n ∂ψ∂ψ ′
Theorem,
j =1
L
+
Tj − → N (0, C ), where C = E (T1 T1 ). Therefore, ′
by Slutsky’s Theorem, we have
∂ Qn (ψ0 ) L − → N (0, DCD′ ). 2 n ∂ψ 1
√
Finally, the theorem follows from (7.5), (7.9) and (7.12).
(7.16)
] n [ − ∂ρj′,2S (ψ0 ) ∂ρj′,S (ψ0 ) ∂ Qn,S (ψ0 ) = Aj ρj,2S (ψ0 ) + Aj ρj,S (ψ0 ) ∂ψ ∂ψ ∂ψ j=1
Moreover, since Tj , j = 1, 2, . . . , n are i.i.d., by the Central Limit
∑n
a.s.
−→ 2B.
Again, by Assumption 4 ∂ Qn,S (ψ0 )/∂ψ has the first-order Taylor expansion about γ0 :
] −1 ∂ρ1′ (ψ0 ) ∂ρ1 (ψ0 ) a.s. ′ Dn −→ Ip+q , E A1 EW W ⊗ I 1 k 1 ∂ψ ∂γ ′ = D. [
√1 n
(7.15)
˜ − ψ0 ‖ ≤ ‖ψˆ n,S − ψ0 ‖. Analogous to (7.9), we can show where ‖ψ that
Aj ρj (ψ0 ) . ∂ψ Wj ⊗ (Zj − Γ0 Wj )
E ρ1′ ,S (ψ)A1 ρ1,2S (ψ)
˜ ∂ Qn,S (ψ0 ) ∂ 2 Qn,S (ψ) + (ψˆ n,S − ψ0 ) = 0, ∂ψ ∂ψ∂ψ ′
Tj =
Ik )
(7.14)
ˆ n,S −→ ψ0 follows from unique minimum at ψ0 ∈ Ψ . Therefore ψ Amemiya (1973, Lemma 3). The proof of Theorem 3.1.2 is based on the first-order Taylor expansion of ∂ Qn,S (ψ)/∂ψ in an open neighborhood Ψ0 ⊂ Ψ of ψ0 :
and
−1
a.s.
2 ∂ψ∂γ ′
∂ρj′ (ψ0 )
and ρj,2S (ψ, γ˜ ), ∂ρj,2S (ψ, γ˜ )/∂γ ′ are given similarly. By Assumptions 1, the ULLN and the conditional independence of ρ1,S (ψ) and ρ1,2S (ψ) we can show that
which has been shown in the proof of Theorem 2.1 to attain a
where
Yj Z˜j −
where
(Magnus and Neudecker, 1988, p. 30). Hence (7.10) can be written as
Dn = Ip+q ,
=
∂ρj,S (ψ, γ˜ ) ∂γ ′ S 1 − x˜ js g (xjs ; θ ) ∂ fU (xjs − Γ˜ Wj ; φ) (Wj ⊗ Ik )′ = S s=1 h(xjs ) ∂ u′
j =1
1 ∂ 2 Q˜ n (ψ0 )
(7.13)
1 a.s. sup Qn,S (ψ) − E ρ1′ ,S (ψ)A1 ρ1,2S (ψ) −→ 0, n ψ
γˆ − γ0 = vec(Γˆ − Γ0 ) n − −1 − = Wj Wj′ ⊗ Ik Wj ⊗ (Zj − Γ0 Wj )
ρj′,S (ψ, γ˜ )Aj
≤ ‖γˆ − γ0 ‖, ρj,S (ψ) mS (Γ0 Wj ; ψ), ρj,S (ψ, γ˜ ) = Yj Z˜j − mS (Γ˜ Wj ; ψ),
and γ˜ = vecΓ˜ satisfies ‖γ˜ − γ0 ‖ ≤ ‖γˆ − γ0 ‖. Similarly to (7.9), by Assumption 4 we can show that 1 ∂ 2 Q˜ n (ψ0 )
n [ −
where ‖γ˜ − γ0 ‖
∂ 2 fU (x − Γ˜ Wj ; φ0 ) dx(Wj ⊗ Ik )′ ∂φ∂ u′
g (x; θ0 )˜x ⊗
ρj′,S (ψ)Aj ρj,2S (ψ)
∂ρj,2S (ψ, γ˜ ) ∂γ ′ j =1 ] ∂ρj,S (ψ, γ˜ ) + ρj′,2S (ψ, γ˜ )Aj (γˆ − γ0 ), ∂γ ′ +
∂ g (x; θ0 ) ∂ fU (x − Γ˜ Wj ; φ0 ) = x˜ ⊗ dx(Wj ⊗ Ik )′ , ∂θ ∂ u′ ∂ vec(∂ρj′ (ψ0 , γ˜ )/∂φ) ∫
∫
n − j =1
∂γ ′
=
39
∂ 2 Q˜ n,S (ψ0 ) (γˆ − γ0 ) ∂ψ∂γ ′
which can be rewritten as (7.12)
n − ∂ Qn,S (ψ0 ) = 2Dn,S Tj,S , ∂ψ j =1
where
40
L. Wang, C. Hsiao / Journal of Econometrics 165 (2011) 30–44
−1 n 1 ∂ 2 Q˜ n,S (ψ0 ) − = Ip+q , Wj Wj′ ⊗ Ik 2 ∂ψ∂γ ′ j =1
Dn,S
and T j ,S =
∂ρj′,S (ψ0 )
∂ψ
1 2
Aj ρj,2S (ψ0 ) +
∂ρj′,2S (ψ0 ) ∂ψ
2Wj ⊗ (Zj − Γ0 Wj )
Aj ρj,S (ψ0 )
.
Then, analogous to (7.11), by Assumption 4 we can show that 1 ∂ 2 Q˜ n,S (ψ0 )
∂ψ∂γ ′
n
∂ρ1′ (ψ0 ) ∂ρ1 (ψ0 ) A1 −→ 2E ∂ψ ∂γ ′ [
a.s.
]
(2π )k |fˆU (u; θ ) − fU (u; θ )| ∫ ∫ ˆ˜ ˜ 1 (t ) ˜ ′ m it ′ u m1 (t ) − m1 (t ) ≤ e dt + dt eit u Cn g˜ (t ; θ ) g˜ (t ; θ ) Cnc ∫ ˆ ∫ m ˜ 1 (t ) − m ˜ 1 (t ) m ˜ 1 (t ) dt ≤ dt + ˜ (t ; θ ) g˜ (t ; θ ) Cn Cnc g ∫ ∫ m ˜ 1 (t ) 1 ˆ˜ 1 (t ) − m ˜ 1 (t )|dt + ≤ sup |m dt cn Cn g˜ (t ; θ ) Cnc Θ ∫ ∫ 2k ˆ 1 (v) − m1 (v)|dv + ≤ k+1 |m |m1 (v)|dv cn
and hence
a.s.
Dn,S −→
[
D.
=
+
] −1 ∂ρ1′ (ψ0 ) ∂ρ1 (ψ0 ) ′ Ip+q , E A1 EW1 W1 ⊗ Ik ∂ψ ∂γ ′
Cnc
(7.17)
Bcn
Bn
∫
m ˜ 1 (t ) sup dt . g˜ (t ; θ ) Θ
Since, by (7.20), limn→∞ P (infBn |fV (v)| ≥ bn /2) = 1, with probability approaching one, we have
∫
ˆ 1 (v) − m1 (v)|dv ≤ |m
Further, by the Central Limit Theorem we have Bn n 1 −
√
n j =1
L
Tj,S − → N (0, CS ),
CS = ET1,S T1′ ,S =
CS ,11 CS ,21
CS′ ,21 CS ,22
=
4
[ E
∂ρ1′ ,S (ψ0 ) ∂ψ
,
×
∫
bn 2 bn
ˆ 1 (v) − m1 (v)|fV (v)dv |m Bn
ˆ 1 (v) − m1 (v)| = op (cnk+1 ), sup |m Bn
sup sup |fˆU (u; θ ) − fU (u; θ )| = op (1). θ∈Θ
A1 ρ1,2S (ψ0 ) +
∂ρ1′ ,2S (ψ0 ) ∂ψ
ρ1′ ,2S (ψ0 )A1
∂ Qn,S (ψ0 ) L − → N (0, DCS D′ ). ∂ψ 2 n 1
√
(7.19)
Finally, Theorem 3.1.2 follows from (7.15), (7.16) and (7.19). 7.4. Proof of Theorem 4.1 First, by Andrews (1995, Theorem 2), under Assumptions 8 and 9, the kernel estimators in (4.7) and (4.8) satisfy k−1 sup |fˆV (v) − fV (v)| = Op (n−1/2 a− ) + Op (adn ) n
(7.20)
v∈Rk
and k−1 −2 2 ˆ 1 (v) − m1 (v)| = Op (n−1/2 a− sup |m bn ) + Op (adn b− n n ).
Further, for any θ ∈ Θ and u ∈ Rk ,
Qn (θ ) =
(7.21)
n −
ρj (θ )′ Aj ρj (θ ) + 2
j =1
]
and CS ,22 = E [(W1 ⊗ (Z1 − Γ0 W1 ))(W1 ⊗ (Z1 − Γ0 W1 ))′ ] = C22 . It follows from (7.17) and (7.18) we have
Bn
(7.23)
u
Now, we write
A1 ρ1,S (ψ0 )
∂ρ1,S (ψ0 ) ∂ρ1,2S (ψ0 ) + ρ1′ ,S (ψ0 )A1 ∂ψ ′ ∂ψ ′ [ ′ ] ∂ρ1,S (ψ0 ) 1 ∂ρ1,S (ψ0 ) ′ = E A1 ρ1,2S (ψ0 )ρ1,2S (ψ0 )A1 2 ∂ψ ∂ψ ′ [ ′ ] ∂ρ1,S (ψ0 ) 1 ∂ρ1,2S (ψ0 ) + E A1 ρ1,2S (ψ0 )ρ1′ ,S (ψ0 )A1 , 2 ∂ψ ∂ψ ′ [ 1 CS ,21 = E (W1 ⊗ (Z1 − Γ0 W1 )) 2 ] ∂ρ1,S (ψ0 ) ∂ρ1,2S (ψ0 ) ′ × ρ1′ ,2S (ψ0 )A1 + ρ (ψ ) A 0 1 1 ,S ∂ψ ′ ∂ψ ′ = C21
2
where the last equality follows by (7.21) and condition of Theorem 4.1. It follows from (7.22) and Assumption 10, that
CS ,11 1
≤
(7.18)
where
(7.22)
+
n −
ρj′ (θ )Aj (ρˆ j (θ ) − ρj (θ ))
j=1
n −
(ρˆ j (θ ) − ρj (θ ))′ Aj (ρˆ j (θ ) − ρj (θ )),
(7.24)
j =1
where ρj (θ ) = Yj Z˜j −
x˜ g (x; θ )fU (x − Γ0 Wj ; θ )dx. Then, since
|fU (x − Γ0 Wj ; θ ) − fU (x − Γˆ Wj ; θ )| ∫ 1 ˜ 1 (t ) −it ′ Γ0 Wj −it ′ Γˆ Wj it ′ x m dt = ( e − e ) e g˜ (t ; θ ) (2π )k ∫ ˜ (t ) 1 1 −it ′ Γ0 Wj −it ′ Γˆ Wj m dt e − e ≤ (2π )k g˜ (t ; θ ) ∫ m 2‖Γˆ − Γ0 ‖ ‖Wj ‖ ˜ 1 (t ) dt , ≤ ‖ t ‖ k (2π ) g˜ (t ; θ ) we have
‖ρˆ j (θ ) − ρj (θ )‖ ∫ ≤ ‖˜xg (x; θ )‖ |fU (x − Γ0 Wj ; θ ) − fˆU (x − Γˆ Wj ; θ )|dx ∫ ≤ ‖˜xg (x; θ )‖ |fU (x − Γ0 Wj ; θ ) − fU (x − Γˆ Wj ; θ )|dx ∫ + ‖˜xg (x; θ )‖ |fU (x − Γˆ Wj ; θ ) − fˆU (x − Γˆ Wj ; θ )|dx ∫ m 2‖Γˆ − Γ0 ‖ ‖Wj ‖ ˜ 1 (t ) dt ≤ ‖ t ‖ sup k (2π ) g˜ (t ; θ ) Θ ∫ × sup |g (x; θ )|(‖x‖ + 1)dx Θ ∫ + sup sup |fˆU (u; θ ) − fU (u; θ )| sup |g (x; θ )|(‖x‖ + 1)dx θ∈Θ
Θ
u
= α1 ‖Γˆ − Γ0 ‖ ‖Wj ‖ + α2 sup sup |fˆU (u; θ ) − fU (u; θ )|, θ∈Θ
u
L. Wang, C. Hsiao / Journal of Econometrics 165 (2011) 30–44
41
where α1 and α2 are positive constants. Further, by the Cauchy–Schwarz Furthermore, Theorem 4.1.2 follows from (7.23) and Amemiya inequality, (1973, Lemma 4). To prove Theorem 4.1.3, note that for any t ∈ Rk , since g˜ (t ; θ0 ) ̸= 0 and g˜ (t ; θ ) is continuous in θ , there exists an open neighborhood Θt of θ0 in Θ , such that infθ∈Θt |˜g (t ; θ )| ≥ c > 0. Therefore, for any θ ∈ Θt ,
(E sup ‖A1 ‖(‖W1 ‖ + 1)‖ρ1 (θ )‖)2 Θ
≤ E ‖A1 ‖(‖W1 ‖ + 1)2 E ‖A1 ‖ sup ‖ρ1 (θ )‖2 Θ [ ≤ 2E ‖A1 ‖(‖W1 ‖2 + 1) E ‖A1 ‖(‖Y1 ‖2 + ‖Y1 Z1 ‖2 ) ∫ + E ‖A1 ‖
∫ ∫ m ˜ 1 (t ) 1 1 ˆ˜ 1 (t ) − m ˆ 1 (v) − m1 (v)|dv + |m1 (v)|dv, |m ≤ c Bn g˜ (t ; θ ) c Bcn
2 ]
sup ‖˜xg (x; θ )fU (x − Γ0 W1 ; θ )‖dx Θ
< ∞,
ˆ
which implies 1n j=1 ‖Aj ‖(‖Wj ‖ + 1) supΘ ‖ρj (θ )‖ = Op (1). Therefore we have
∑n
n 1 − ′ sup ρj (θ ) Aj (ρˆ j (θ ) − ρj (θ )) Θ n j =1 ≤
n 1−
n j =1
7.5. Proof of Theorem 4.2 By Assumption 13 the first derivative ∂ Qn,S (θ )/∂θ exists and has the first order Taylor expansion in a neighborhood of θ0 . Since
‖Aj ‖ sup ‖ρj (θ )‖‖ρˆ j (θ ) − ρj (θ )‖ Θ
P
→ θ0 , for sufficiently large n, we have ∂ Qn,S (θˆn )/∂θ = 0 and θˆn,S −
≤ α1 ‖Γˆ − Γ0 ‖ + α2 sup sup |fˆU (u; θ ) − fU (u; θ )| θ∈Θ u∈Rk
×
n 1−
n j =1
0=
‖Aj ‖(‖Wj ‖ + 1) sup ‖ρj (θ )‖
= op (1),
(7.25)
n 1 − sup (ρˆ j (θ ) − ρj (θ ))′ Aj (ρˆ j (θ ) − ρj (θ )) Θ n j =1 n 1−
n j =1
] n [ ∂ ρˆ j′,2S (θ ) ∂ ρˆ j′,S (θ ) ∂ Qn,S (θ ) − = Aj ρˆ j,2S (θ ) + Aj ρˆ j,S (θ ) , ∂θ ∂θ ∂θ j =1
∂ ρˆ j′,S (θ ) ∂θ
Θ
×
2
n 1−
n j =1
2 2
θ∈Θ u∈Rk
‖Aj ‖(‖Wj ‖2 + 1)
= op (1).
S 1−
[
S j=1 h(xjs )
and (7.26)
Since E sup ρ1′ (θ )A1 ρ1 (θ )
x˜ ′js
∂ g (xjs ; θ ) ˆ fU (xjs − Γˆ Wj ; θ ) ∂θ ] ∂ fˆU (xjs − Γˆ Wj ; θ ) + g (xjs ; θ ) ∂θ
=−
≤ 2α ‖Γˆ − Γ0 ‖ + 2α sup sup |fˆU (u; θ ) − fU (u; θ )|2 2 1
(7.28)
where
‖Aj ‖ sup ‖ρˆ j (θ ) − ρj (θ )‖2
∂ Qn,S (θ0 ) ∂ 2 Qn,S (θ˜ ) + (θˆn,S − θ0 ), ∂θ ∂θ ∂θ ′
where θ˜ satisfies ‖θ˜ − θ0 ‖ ≤ ‖θˆn,S − θ0 ‖. The first derivative in (7.28) is given by
Θ
and analogously,
≤
which implies that supθ∈Θt |f˜ U (t ; θ )− f˜U (t ; θ )| = op (1). The result then follows from Amemiya (1973, Lemma 4).
Θ
∂ fˆU (xjs − Γˆ Wj ; θ ) ∂θ ∫ ˆ˜ ˜ 1 it ′ (xjs −Γˆ Wj ) m1 (t ) ∂ g (t ; θ ) =− e dt . (2π )k Cn g˜ 2 (t ; θ ) ∂θ
≤ E ‖A1 ‖ sup ‖ρ1 (θ )‖2
The second derivative in (7.28) is given by
≤ 2E ‖A1 ‖(|Y1 |2 + ‖Y1 Z1 ‖2 ) ∫ 2 + E ‖A1 ‖ sup ‖˜xg (x; θ )fU (x − Γ0 W1 ; θ )‖dx
∂ 2 Qn,S (θ ) ∂θ ∂θ ′ n [ − ∂ ρˆ j′,S (θ ) ∂ ρˆ j,2S (θ ) = Aj ∂θ ∂θ ′ j =1
Θ
Θ
< ∞,
by the ULLN we have
n 1 − ′ sup ρj (θ ) Aj ρj (θ ) − Q (θ ) = op (1), Θ n j =1
+ (ρˆ j′,2S (θ )Aj ⊗ Ip ) (7.27)
where Q (θ ) = E ρ1 (θ )′ A1 ρ1 (θ ). It follows from (7.24)–(7.27) that
j =1
∂θ
Aj
P
(θ ) − → Q (θ ) uniformly in θ ∈ Θ . Analogous to the proof of Theorem 2.1, by Assumption 12 we can show that Q (θ ) attains P a unique minimum at θ0 ∈ Θ . Therefore θˆn − → θ0 follows from 1 Q n n
+
n [ − ∂ ρˆ j′,2S (θ )
Amemiya (1973, Lemma 3).
+ (ρˆ j′,S (θ )Aj
∂ vec(∂ ρˆ j′,S (θ )/∂θ )
⊗ Ip )
]
∂θ ′ ∂ ρˆ j,S (θ ) ∂θ ′
∂ vec(∂ ρˆ j′,2S (θ )/∂θ ) ∂θ ′
where ∂ vec(∂ ρˆ j′,S (θ˜ )/∂θ )/∂θ ′ is given by
] ,
42
−
L. Wang, C. Hsiao / Journal of Econometrics 165 (2011) 30–44
∂ 2 g (xjs ; θ ) ˆ fU (xjs − Γˆ Wj ; θ ) S j =1 ∂θ ∂θ ′ ∂ g (xjs ; θ ) ∂ fˆU (xjs − Γˆ Wj ; θ ) ˜ + xjs ⊗ ∂θ ∂θ ′ ∂ fˆU (xjs − Γˆ Wj ; θ ) ∂ g (xjs ; θ ) + x˜ js ⊗ ∂θ ∂θ ′ ] ∂ 2 fˆU (xjs − Γˆ Wj ; θ ) 1 + x˜ js ⊗ g ( x ; θ ) js ′ ∂θ ∂θ h(xjs )
1
S [ −
Here
x˜ js ⊗
˜¯ 1 (t ) ∂m = ∂γ ′ ¯ 1 (v) ∂m ∂γ ′ =
1
−
ˆ˜ 1 (t ) m ′ ˆ eit (xjs −Γ Wj ) 2 g˜ (t ; θ ) Cn
tion 13, analogous to the proof of (7.16) it can be shown by Assumption 13 that 1 ∂ 2 Qn,S (θ˜ )
∂θ∂θ ′
∂ρj′,S (θ0 )
∂ρj,2S (θ0 ) ∂θ ′ ] ∂ vec(∂ρj′,S (θ0 )/∂θ ) + (ρj′,2S (θ0 )Aj ⊗ Ip ) ∂θ ′ ] [ ′ ∂ρj (θ0 ) ∂ρj (θ0 ) Aj = B. (7.29) =E ∂θ ∂θ ′ ¯ 1 (v) be defined similarly as fˆV (v), In the following let f¯V (v), m ˆ 1 (v) in (4.7) and (4.8), but with Γˆ substituted by Γ0 . Similarly, m f¯U (u; θ) and ρ¯ j,S (θ ) are used for the same situation. Then, by Assumption 13, ∂ Qn,S (θ0 )/∂θ has the first-order Taylor expansion about γ0 : ] n [ − ∂ ρ¯ j′,2S (θ0 ) ∂ ρ¯ j′,S (θ0 ) ∂ Qn,S (θ0 ) = Aj ρ¯ j,2S (θ0 ) + Aj ρ¯ j,S (θ0 ) ∂θ ∂θ ∂θ j =1 P
[
− →E
∂θ
Aj
√ ∂ 2 Qn,S (θ0 ) (γˆ − γ0 ) + op ( n), + ∂θ ∂γ ′
(7.30)
∂ 2 Qn,S (θ ) = ∂θ∂γ ′
∂ ρ¯ j′,S (θ) ∂γ ′
∂ ρ¯ j,2S (θ ) ∂θ ∂γ ′ j =1 ] ∂ vec(∂ ρ¯ j′,S (θ )/∂θ ) ′ + (ρ¯ j,2S (θ )Aj ⊗ Ip+q ) ∂γ ′ n [ − ∂ ρ¯ j′,2S (θ ) ∂ ρ¯ j,S (θ ) + Aj ∂θ ∂γ ′ j =1 ] ∂ vec(∂ ρ¯ j′,2S (θ )/∂θ ) ′ + (ρ¯ j,S (θ )Aj ⊗ Ip+q ) , ∂γ ′ Aj
1 − x˜ js g (xjs ; θ ) ∂ f¯U (xjs − Γ0 Wj ; θ ) S
=−
S s=1
h(xjs )
∂γ ′
n − ∂ Ka (v − Vj ) ¯ 1 (v) − Yj ) (m (Wj ⊗ Ik )′ ∂v ′ f¯V (v)nan j=1
1
K k
v − Vj
an
an
.
In the above we have slightly abused notation by using Vj = Γ0 Wj . This will not cause confusion, since Γˆ Wj will not appear any more subsequently. Again, analogous to (7.11), by Assumptions 13 and 14 we can show that 1 ∂ 2 Qn,S (θ0 )
∂θ ∂γ ′
n
[
P
− → 2E
∂ρj′ (θ0 )
Aj
∂θ
] ∂ρj (θ0 ) . ∂γ ′
In the following, because all functions of parameters are evaluated at θ0 , we will omit it to further simplify notation. In addition, let ujs = xjs − Γ0 Wj , gjs = g (xjs ; θ0 ), hjs = h(xjs ) and f¯js be similarly defined. Now we express the first term of (7.30) in terms of ξ1j , ξ2j , ξ3j , ξ4j . To this end, we write n − ∂ ρ¯ j′,S j=1
∂θ
Aj ρ¯ j,2S =
n − ∂ρj′,S
n − ∂ρj′,S
Aj (ρ¯ j,2S − ρj,2S ) ∂θ ∂ρj′,S ∂ ρ¯ j′,S + − Aj ρj,2S ∂θ ∂θ j=1 n − ∂ρj′,S ∂ ρ¯ j′,S − + Aj (ρ¯ j,2S − ρj,2S ). (7.31) ∂θ ∂θ j=1
∂θ n −
Aj ρj,2S +
j=1
j=1
Then by Assumption 15 we have
ρ¯ j,S − ρj,S = −
S 1 − x˜ js gjs
S s=1 hjs
√ (f¯js − fjsc ) + op ( n),
where f¯js − fjsc
where n [ − ∂ ρ¯ j′,S (θ )
∂γ
1
Ka (v − Vj ) =
(2π)k [ 2 ] ∂ g (xjs ; θ ) 2 ∂ g˜ (t ; θ ) ∂ g˜ (t ; θ ) × − dt . ∂θ ∂θ ′ g˜ (t ; θ ) ∂θ ∂θ ′ Since ρˆ j,S (θ ) and its derivatives are continuous in θ by Assump-
n
Bn
¯ 1 (v) ′ ∂m dv, e−it v ′
and
and ∂ 2 fˆU (xjs − Γˆ Wj ; θ )/∂θ ∂θ ′ is given by
∫
∫
,
(2π )
=
n(2π )k ℓ=1
1
∫
e
× Bn
=
−it ′ v
f¯V (v)
Cn
n(2π )k ℓ=1
e Bn
e−it Cn
n(2π ) ℓ=1 k
×
∫
−it ′ v
f¯V (v)
′ ¯ 1 (v) − m1 (v))dv dt e−it v (m
g˜ (t ; θ0 )
′u
js
e−it Cn
′u
′ e−it v ηℓ
∫
g˜ (t ; θ0 )
n ∫ −
1
∫
js
(Yℓ − m1 (v))Ka (v − Vℓ )dv dt
n ∫ −
1
+
′u
˜ (t ; θ0 ) Bn Cn g ′ n ∫ − e−it ujs
k
and
∂ f¯U (xjs − Γ0 Wj ; θ ) ∂γ ′ ∫ ′ ˜¯ 1 (t ) 1 eit (xjs −Γ0 Wj ) ∂ m ′ ˜¯ 1 (t )(Wj ⊗ t ) dt . = − im (2π )k Cn g˜ (t ; θ ) ∂γ ′
e−it
∫
1
=
Bn
f¯V (v)
dv dt
js
g˜ (t ; θ0 )
E (Yℓ − m1 (v))Ka (v − Vℓ )dv dt
(7.32)
and ηℓ is defined in (4.15). Since, by (7.24), limn→∞ P (infBn |f¯V (v)| ≥ bn /2) = 1 and limn→∞ P (infBn |fV (v)| ≥ bn /2) = 1, with
L. Wang, C. Hsiao / Journal of Econometrics 165 (2011) 30–44
probability approaching one, the absolute value of the second term of the last equation of (7.32) satisfies
Since E [Wj ⊗ (Zj − Γ0 Wj )] = 0, E ηj (v) = 0 and E
∫ ∫ ′ ′ e−it v e−it ujs E (Yℓ − m1 (v))Ka (v − Vℓ )dv dt Cn g˜ (t ; θ0 ) Bn f¯V (v) ∫ 1 ≤ ˜ g ( t ; θ0 ) Cn ∫ 1 |E (Yℓ − m1 (v)) Ka (v − Vℓ )| dv dt × ¯ f (v) Bn V ∫ 4 |E (Yℓ − m1 (v))Ka (v − Vℓ )| fV (v)dv ≤ k+1
2 n
=
(
) = op (n
−1/2
1
n ℓ=1
e−it Cn
′u
∫
js
e
g˜ (t ; θ0 )
−it ′ v
f¯V (v)
Bn
−
1 fV (v)
ηℓ dv dt = op (n
−1/2
)
and, similarly, n ∫ 1−
n ℓ=1
e−it Cn
′u
′ e−it v
∫
js
g˜ (t ; θ0 )
Bcn
fV (v)
ηℓ dv dt = op (n−1/2 ).
It follows that n ∫ −
1
f¯js − fjsc =
n(2π )k ℓ=1
e−it Cn
′u
∫
js
′ e−it v ηℓ
g˜ (t ; θ0 )
fV (v)
dv dt + op (n−1/2 ).
Therefore, the second term on the right-hand side of (7.31) equals
−
n − ∂ρj′,2S
1
nS (2π )k j=1
× =−
∂θ
′ e−it v ηℓ
∫
fV (v) nS (2π )
k
hjs
s=1
ℓ=1
Cn
e−it
′u
js
g˜ (t ; θ0 )
−−
∂ρj′,2S
j=1 ℓ=1
∂θ
n
Aj
− x˜ js gjs s =1
hjs
∫
e Cn
−it ′ ujs
g˜ (t ; θ0 )
√ = nUn + op ( n),
where Un is a so-called U-Statistic of degree ∑ two. By Theorem √ 1 n of Section 3.2.1 of Lee (1990, pp. 76), nUn = j=1 ξ2j + op ( n). Further, note that in ρ¯ j,2S and ρj,2S only x˜ js gjs and g˜ (t ; θ0 ) involve θ , completely analogously it can be shown ∑n that the√third term on the right-hand side of (7.31) equals j=1 ξ3j + op ( n). Finally, it
√ is obvious that the fourth term is op ( n). It follows that n − ∂ ρ¯ j′,S (θ0 )
∂θ
j =1
n − √ Aj ρ¯ j,2S (θ0 ) = (ξ2j + ξ2j + ξ3j ) + op ( n), j =1
where ξ1j , ξ2j , ξ3j are given in (4.12)–(4.14). Then we can rewrite (7.30) as n − √ ∂ Qn,S (θ0 ) = 2Dn,S ξj + op ( n), ∂θ j =1
where
Dn,S
−1 n − 1 ∂ Qn,S (θ0 ) 1 = Ip , Ip , Ip , Wj Wj′ ⊗ Ik n ∂θ ∂γ ′ n j =1 P
− → D.
∂ Qn,S (θ0 ) L − → N (0, DCD′ ), ∂θ
(7.33)
Let FW and FX respectively denote the distribution function of W and X . Then by (4.1) and the Fubini Theorem, EW |m1,n (Γ0 W ) − E (Y |W )|
∫
|m1,n (Γ0 w) − E (Y |w)|dFW (w) ∫ ∫ = g (Γ0 w + u; θ0 )1(‖Γ0 w + u‖ ≥ Tn )dFU (u) dFW (w) ∫∫ ≤ |g (Γ0 w + u; θ0 )|1(‖Γ0 w + u‖ ≥ Tn )dFU (u)dFW (w) ∫∫ = |g (x; θ0 )|1(‖x‖ ≥ Tn )dFU (x − Γ0 w)dFW (w) ∫ = |g (x; θ0 )|1(‖x‖ ≥ Tn )dFX (x), =
where the last equality follows from (1.3). Since E |g (X ; θ0 )| < ∞, the right-hand side of the last equation tends to zero as Tn → ∞, which implies (5.5). The proof of (5.6) is analogous. Acknowledgments
S
′ √ e−it v ηℓ dv dt + op ( n) fV (v)
∫ ×
=
√
dv dt + op ( n)
n
1
Aj
S n ∫ − x˜ js gjs −
Aj ρj,S
7.6. Proof of Theorem 5.2
),
1
∂θ
where D and C are as in Theorem 5.2. The theorem follows then from (7.28), (7.29), (7.33) and Assumption 16.
where the last two equalities follow from Assumptions 13 ∑ √ and 14 and theorem conditions. Furthermore, since ηℓ = Op ( n), n ∫ 1−
∂ρj′,2S
0, we have E ξj = 0. Furthermore, by model assumption, Assumption 13, the covariance matrix C = E ξj ξj′ is finite. It follows by Lindeberg–Levy CLT and the Slutsky Theorem,
√
b2n cn
2 −k−1 Op adn b− n cn
43
2
The authors are grateful to the anonymous referees for their helpful comments and suggestions. A preliminary version of this paper was completed in Wang and Hsiao (1995) while Liqun Wang was visiting the Department of Economics, University of Southern California. He wishes to thank the department for its hospitality. This work was partially supported by National Science Foundation grants SBR 94-09540 and SBR 96-19330, the Swiss National Science Foundation, and the Natural Sciences and Engineering Research Council of Canada (NSERC). References Abarin, T., Wang, L., 2006. Comparison of GMM with second-order least squares estimator in nonlinear models. Far East Journal of Theoretical Statistics 20, 179–196. Aigner, D.J., Hsiao, C., Kapteyn, A., Wansbeek, T., 1984. Latent variable models in econometrics. In: Griliches, Z., Intriligator, M.D. (Eds.), Handbook of Econometrics, vol. II. North-Holland, Amsterdam. Amemiya, T., 1973. Regression analysis when the dependent variable is truncated normal. Econometrica 41, 997–1016. Amemiya, T., 1974. The nonlinear two-stage least-squares estimator. Journal of Econometrics 2, 105–110. Amemiya, Y., 1985. Instrumental variable estimator for the nonlinear errors-invariables model. Journal of Econometrics 28, 273–289. Amemiya, Y., 1990. Two-stage instrumental variable estimators for the nonlinear errors-in-variables model. Journal of Econometrics 44, 311–332. Amemiya, Y., Fuller, W.A., 1988. Estimation for the nonlinear functional relationship. Annals of Statistics 16, 147–160. Andrews, D.W.K., 1995. Nonparametric kernel estimation for semiparametric models. Econometric Theory 11, 560–596. Carroll, R.J., Maca, J.D., Ruppert, D., 1999. Nonparametric regression in the presence of measurement error. Biometrika 86, 541–554.
44
L. Wang, C. Hsiao / Journal of Econometrics 165 (2011) 30–44
Carroll, R.J., Ruppert, D., Crainiceanu, C.M., Tosteson, T.D., Karagas, M.R., 2004. Nonlinear and nonparametric regression and instrumental variables. Journal of American Statistical Association 99, 736–750. Carroll, R.J., Ruppert, D., Stefanski, L.A., 1995. Measurement Error in Nonlinear Models. Chapman & Hall, London. Cheng, C., Schneeweiss, H., 1998. Polynomial regression with errors in the variables. Journal of the Royal Statistical Society B60, 189–199. Delaigle, A., 2007. Nonparametric density estimation from data with a mixture of Berkson and classical errors. Canadian Journal of Statistics 35, 89–104. Fan, J., Truong, Y.K., 1993. Nonparametric regression with errors in variables. Annals of Statistics 21, 1900–1925. Fuller, A.W., 1987. Measurement Error Models. Wiley, New York. Gallant, A.R., 1987. Nonlinear Statistical Model. Wiley, New York. Hansen, L.P., 1982. Large sample properties of generalized method of moments estimators. Econometrica 50, 1029–1054. Hausman, J., Newey, W., Ichimura, H., Powell, J., 1991. Identification and estimation of polynomial errors-in-variables models. Journal of Econometrics 50, 273–295. Hausman, J., Newey, W., Powell, J., 1995. Nonlinear errors in variables: estimation of some Engel curves. Journal of Econometrics 65, 205–233. Hsiao, C., 1983. Identification. In: Griliches, Z., Intriligator, M.D. (Eds.), Handbook of Econometrics, vol. I.. North-Holland, Amsterdam. Hsiao, C., 1989. Consistent estimation for some nonlinear errors-variables models. Journal of Econometrics 41, 159–185. Hsiao, C., 1992. Nonlinear latent variable models. In: Matyas, L., Sevestre, P. (Eds.), The Econometrics of Panel Data. pp. 242–261. Huang, S., Huwang, L., 2001. On the polynomial structural relationship. The Canadian Journal of Statistics 29, 495–512. Jennrich, R.I., 1969. Asymptotic properties of non-linear least squares estimators. Annals of Mathematical Statistics 40, 633–643. Lee, A.J., 1990. U-Statistics: Theory and Practice. Marcel Dekker, New York. Li, T., 2002. Robust and consistent estimation of nonlinear errors-in-variables models. Journal of Econometrics 110, 1–26. Lukacs, E., 1970. Characteristic Functions. Hafner Publishing Company, New York. Magnus, J.R., Neudecker, H., 1988. Matrix Differential Calculus with Application in Statistics and Econometrics. Wiley, New York.
Newey, , 2001. Flexible simulated moment estimation of nonlinear errors-invariables models. Review of Economics and Statistics 83, 616–627. Robinson, P.M., 1988. Root-N-consistent semiparametric regression. Econometrica 56, 931–954. Schennach, S.M., 2004. Estimation of nonlinear models with measurement error. Econometrica 72, 33–75. Schennach, S.M., 2007. Instrumental variable estimation of nonlinear errors-invariables models. Econometrica 75, 201–239. Stefanski, L.A., Carroll, R.J., 1985. Covariate measurement error in logistic regression. Annals of Statistics 13, 1335–1351. Taupin, M., 2001. Semi-parametric estimation in the nonlinear structural errors-invariables model. Annals of Statistics 29, 66–93. Walker, J.S., 1988. Fourier Analysis. Oxford University Press, New York. Wang, L., 1998. Estimation of censored linear errors-in-variables models. Journal of Econometrics 84, 383–400. Wang, L., 2002. A simple adjustment for measurement errors in some limited dependent variable models. Statistics & Probability Letters 58, 427–433. Wang, L., 2004. Estimation of nonlinear models with Berkson measurement errors. Annals of Statistics 32, 2559–2579. Wang, L., Hsiao, C., 1995. A simulated semiparametric estimation of nonlinear errors-in-variables models. Working Paper. Department of Economics, University of Southern California. Wang, L., Hsiao, C., 2007. Two-stage estimation of limited dependent variable models with errors-in-variables. Econometrics Journal 10, 426–438. Wang, L., Hsiao, C., 2008. Method of moments estimation and identifiability of semiparametric nonlinear errors-in-variables models. Technical Report. Department of Statistics, University of Manitoba. Weiss, A.A., 1993. Some aspects of measurement error in a censored regression model. Journal of Econometrics 56, 169–188. Wolter, K.M., Fuller, W.A., 1982. Estimation of nonlinear errors-in-variables models. Annals of Statistics 10, 539–548. Zeidler, E., 1986. Nonlinear Functional Analysis and its Applications I. SpringerVerlag, New York.
Journal of Econometrics 165 (2011) 45–57
Contents lists available at SciVerse ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Properties of the CUE estimator and a modification with moments Jerry Hausman a,∗ , Randall Lewis c , Konrad Menzel b , Whitney Newey a a
MIT, Department of Economics, United States
b
New York University, Department of Economics, United States
c
Yahoo! Labs, United States
article
info
Article history: Available online 29 June 2011
abstract In this paper, we analyze properties of the Continuous Updating Estimator (CUE) proposed by Hansen et al. (1996), which has been suggested as a solution to the finite sample bias problems of the two-step GMM estimator. We show that the estimator should be expected to perform poorly in finite samples under weak identification, in particular, the estimator is not guaranteed to have finite moments of any order. We propose the Regularized CUE (RCUE) as a solution to this problem. The RCUE solves a modification of the first-order conditions for the CUE estimator and is shown to be asymptotically equivalent to CUE under many weak moment asymptotics. Our theoretical findings are confirmed by extensive Monte Carlo studies. © 2011 Published by Elsevier B.V.
1. Introduction 2SLS is by far the most-used estimator for the simultaneous equation problem. However, it is now well-recognized that 2SLS can exhibit substantial finite sample (second-order) bias when the model is over-identified and the first stage partial r-squared is low. The initial recommendation to solve this problem was to do LIML, e.g. Bekker (1994) or Staiger and Stock (1997). However, Hahn, Hausman, and Kuersteiner (HHK, 2004) demonstrated that the ‘‘moment problem’’ of LIML led to undesirable estimates in this situation. Morimune (1983) analyzed both the bias in 2SLS and the lack of moments in LIML. While it was long known that LIML did not have finite sample moments, it was less known that this lack of moments led to the undesirable property of considerable dispersion in the estimates, e.g. the interquartile range was much larger than 2SLS. HHK developed a jackknife 2SLS (J2SLS) estimator that attenuated the 2SLS bias problem and had good dispersion properties. They found in their empirical results that the J2SLS estimator or the Fuller estimator, which modifies LIML to have moments, did well on both the bias and dispersion criteria. Since the Fuller estimator had smaller second-order MSE, HHK recommended using the Fuller estimator. However, Bekker and van der Ploeg (2005) and Hausman, Newey and Woutersen (HNW, 2005) recognized that both Fuller
∗
Corresponding author. E-mail addresses:
[email protected] (J. Hausman),
[email protected],
[email protected] (R. Lewis),
[email protected] (K. Menzel),
[email protected] (W. Newey). 0304-4076/$ – see front matter © 2011 Published by Elsevier B.V. doi:10.1016/j.jeconom.2011.05.005
and LIML are inconsistent with heteroscedasticity as the number of instruments becomes large in (Bekker, 1994; Chao and Swanson, 2005) many instrument sequences. Since econometricians recognize that heteroscedasticity is often present, this finding presents a problem. Hausman, Newey, Woutersen, Chao and Swanson (HNWCS, 2007) solved this problem by proposing jackknife LIML (HLIML) and jackknife Fuller (HFull) estimators that are consistent in the presence of heteroscedasticity. Newey and Windmeijer (2009a) showed that the continuous updating estimator (CUE), which is the GMM-like generalization of LIML from Hansen et al. (1996), is efficient relative to jackknife instrumental variables with many weak instruments. LIML does not have moments so HNWCS (2007) recommend using HFull, which does have moments. A problem is that if serial correlation or clustering exists, neither HLIML, nor HFull, nor the CUE with heteroskedasticity consistent weighting matrix are consistent. The CUE will solve this problem, if the weighting matrix is made robust to serial correlation or clustering. The CUE also allows treatment of nonlinear specifications which the above estimators need not allow for. However, CUE suffers from the moment problem and exhibits wide dispersion.1 GMM does not suffer from the nomoments problem, but like 2SLS, GMM has finite sample bias that grows with the number of moments. In this paper we modify the first-order conditions solved by the CUE by adding a term of order T −1 n order to address the
1 While a formal proof of absence of moments of CUE does not exist, empirical investigation e.g. Guggenberger (2005), demonstrate that CUE likely has no moments. If homoskedasticity is imposed on the weight matrix, CUE is LIML, which is known to have no moments.
46
J. Hausman et al. / Journal of Econometrics 165 (2011) 45–57
no-moments/large dispersion problem. To first order the variance of the estimator is the same as GMM or CUE, so no large sample efficiency is lost. The resulting estimator demonstrates considerably reduced bias relative to GMM and reduced dispersion relative to CUE. Thus, we expect the new estimator will be useful for empirical research. We next consider a similar approach but use a class of functions which permits us to specify an estimator with all integral moments existing. Lastly, we demonstrate how this approach can be extended to the entire family of Generalized Empirical Likelihood (GEL) estimators. In the next section we specify a general nonlinear model of the type where GMM is often used. We then consider the finite sample bias of GMM. We discuss how CUE partly removes the bias but evidence demonstrates the presence of the no-moments/large dispersion problem. We next consider LIML and associated estimators under large instrument (moment) asymptotics and discuss why LIML, Fuller and HLIM and HFull will be inconsistent when time series or spatial correlation is present. We then give a specific modification to CUE and specify the ‘‘regularized’’ CUE estimator, RCUE, and we prove that a version of the estimator has moments in the case of linear moment restrictions. Lastly, we investigate our new estimators empirically. We find that they do not have either the bias arising from a large number of instruments (moments) or the no moment problem of CUE and LIML and perform well when heteroscedasticity and serial correlation are present. Thus, while further research is required to determine the optimal parameters of the modifications we present, the current estimators should allow improved empirical results in applied research.
The first term is minimized at β0 , but the second ‘‘noise’’ term will in general have a nonzero derivative at β0 and therefore the objective function is not minimized at the truth. This problem leads to asymptotic bias that increases with the number of moments, as in Newey and Smith (2004), and to inconsistency of two-step GMM with many weak moments, as shown by Han and Phillips (2006). The Continuous Updating Estimator (CUE) proposed by Hansen ˆ with a consistent estimator Ω ˆ T (β) of et al. (1996) replaces Ω ΩT (β), giving the objective function
2. Setup and motivation
ˆ (β) = (y − X β)′ (y − X β)Z ′ Z /T 2 , Ω
Consider a sample of T strictly stationary observations of random vectors W1 , . . . , WT . The K -dimensional parameter of interest β0 is defined by an M-dimensional vector of unconditional moment restrictions
E[gt (β0 )] = 0, for the moment functions gt (β) = g (Wt , β). Denote gˆ (β) =
T 1−
T t =1
gt (β),
g¯ (β) = E[gt (β)].
Also, let
ξT (β) = gˆ (β) − g¯ (β), ΩT (β) = T E[ξT (β)ξT (β)′ ] = T · Var (ˆg (β)), and Gkt (β) the kth column of the Jacobian of the moment functions ∂ Gt (β) := ∂β gt (β). Define the sequence µT as the smallest
¯ (β0 )′ ΩT−1 G¯ (β0 ) where G¯ (β) = E[Gt (β)] and eigenvalue of T G ΩT = ΩT (β0 ). In the linear model with one regressor, this sequence has the same rate as the concentration parameter. Also, ¯ k (β0 ) for the kth column of the demeaned denote νtk = Gkt (β0 ) − G Jacobian. In the following discussion, the time subscripts will be omitted for notational ease wherever possible. The two-step GMM estimator minimizes the criterion function ˆ −1 gˆ (β) QTGMM (β) = gˆ (β)′ Ω ˆ is an estimate of ΩT . Similarly to Han and Phillips (2006) where Ω and Newey and Windmeijer (2009a), we can explain the bias in the two-step GMM estimator by calculating the expectation of the ˆ is replaced by ΩT . objective function when Ω E[ˆg (β)′ ΩT−1 gˆ (β)]
= g¯ (β)′ ΩT−1 g¯ (β) + 2E g¯ (β)′ ΩT−1 ξT (θ ) + E[ξT (β)′ ΩT−1 ξT (β)] 1 1 −1 ′ −1 = g¯ (β) ΩT g¯ (β) + tr(ΩT ΩT (β)) + oP . T
T
ˆ (β)−1 gˆ (β). QTCUE (β) = gˆ (β)′ Ω Following the same steps as before, the expectation of the infeasible objective function using the variance ΩT (β) as weighting matrix is given by
E[ˆg (β)′ ΩT (β)−1 gˆ (β)] = g¯ (β)′ ΩT (β)−1 g¯ (β) 1 + tr(ΩT (β)−1 ΩT (β)) T M = g¯ (β)′ ΩT (β)−1 g¯ (β) + . T Here the second term does not depend on β , eliminating that source of bias. Elimination of the noise term, and hence bias reduction, depends crucially on using the correct weighting matrix ΩT (β)−1 . For example, consider linear IV regression with moment restrictions of the form gt (β) = zt′ (yt − xt β) and a homoskedasticity consistent variance estimate where Z = [Z1 , . . . , ZT ]′ , X = [X1 , . . . , XT ]′ , and y = (y1 , . . . , yT )′ . Here the CUE objective function is QTLIML (β) =
(y − X β)′ Z (Z ′ Z )−1 Z ′ (y − X β) , (y − X β)′ (y − X β)
and the estimator that minimizes this function is LIML. The LIML estimator is known to be consistent under many and many weak instruments asymptotics with homoskedastic and not autocorrelated disturbances (Bekker, 1994; Stock and Yogo, 2005). However, LIML loses this property if errors are heteroskedastic (Bekker and van der Ploeg, 2005; Hausman, et al. 2007) or autocorrelated. This problem occurs because the noise term in ˆ (β) does not the objective function no longer disappears when Ω estimate T · Var (ˆg (β)). In particular in the presence of clustering, LIML will typically not be consistent under many or many weak instruments. Other estimators that also rely on independence, like jackknife instrumental variables and the HLIM estimator of Hausman, et al. (2007) will also not be consistent under clustering. ˆ (β) to be robust One can eliminate the bias term by choosing Ω to heteroskedasticity, or clustering, or cross-section dependence. A general autocorrelation and heteroskedasticity consistent version ˆ (β) would eliminate the bias term in general settings, as of Ω previously shown by Donald and Newey (2000). There is a finite sample analog to this calculation. In finite samples the objective function QTCUE (β) may have a jackknife form, that also explains why the CUE has reduced bias. This was shown for i.i.d. data by Newey and Windmeijer (2008). Here we show it for certain kinds of dependence. Suppose that gt (β0 ) and gs (β0 ) are correlated for all (s, t ) ∈ A, where A is a set of index pairs. For example, for τ dependent data S = {(s, t ) : |t − s| ≤ τ } and for clustered data, A = {(s, t ) : s ∈ c (t )}, where c (t ) is the set of observations that are clustered with t. Under appropriate conditions a consistent estimator of the variance of gˆ (β) is
ˆ (β) = Ω
1 − T (s,t )∈S
g˜s (β)˜gt (β)′ .
J. Hausman et al. / Journal of Econometrics 165 (2011) 45–57
the linear model: Assume that the model is linear with gt (β) = ∑ ˆ (β) = Tt=1 g˜t (β)˜gt (β)′ /T . Then let Zt (yt − Xt′ β) and Ω
Then it follows that
ˆ (β)−1 gˆ (β) = gˆ (β)′ Ω
1 − T 2 (s,t )̸∈S
+ =
ˆ (β)−1 gt (β) gs (β)′ Ω
1 − T 2 (s,t )∈S
1 − T 2 (s,t )̸∈S
47
¯ −1 G¯ + α I )−1 G¯ ′ Ω ¯ −1 g˜ (0). β¯ = −(G¯ ′ Ω
ˆ (β)−1 gt (β) gs (β)′ Ω
ˆ (β)−1 gt (β) + gs (β)′ Ω
(2)
We can now give conditions for the existence of finite sample moments for this estimator: M T
Proposition 1. Suppose the moment functions gi (β) are linear in β . Then for any positive α , γ , the estimator defined in (2) has moments equal to the number of moments of the product ‖gi (0)‖‖Gi (β)‖.
.
Here the CUE objective function has a jackknife form, that eliminates cross products that are dependent on each other. Thus, ˆ (β) were replaced by its expectation, the expectation of the if Ω CUE objective function would be minimized at the truth. Therefore, ˆ (β), the bias when the dependence of the data is accounted for in Ω of CUE should be smaller than the two-step GMM estimator. 3. A modification of CUE with finite moments It has been shown that the finite sample distribution of LIML does not have integral moments, Fuller (1977), and we would expect CUE to inherit these properties for most settings, especially if identification is weak, see Guggenberger (2005). To describe a modification of the CUE that would solve this moment problem, we use a transformation of the observations similarly to Kitamura and Stutzer (1997) to allow for autocorrelation or clustering. For a T × T matrix A let g˜t (β) := e′t A(g1 (β), . . . , gT (β))′ .
4. Large sample properties In this section we derive large sample properties of the RCUE estimator defined in Eq. (1), and show that the proposed estimator is asymptotically equivalent to the CUE estimator under many weak moment asymptotics. More specifically, we will consider the distribution of the estimator when the number of M grows to infinity with sample size, and some eigenvalues of the matrix G′ Ω −1 G shrink to zero as specified in Assumption 1 below. Many weak moment asymptotics are motivated by theoretical and Monte Carlo results that suggest an improved approximation to finite sample distributions of moment-based estimators, see e.g. Hansen et al. (2008) and references therein. For our formal results, we will assume that W1 , . . . , WT are i.i.d., which will allow us to make use of the asymptotic results under many weak moment sequences derived in Newey and Windmeijer (2009a,b), henceforth NW. For the derivation of the asymptotic distribution of the RCUE, we will also denote Utk = νtk − E[Gkt gt′ ]Ω (β0 )−1 gt , Ut = [Ut1 , . . . , UtK ], and Λ = limT →∞ µ1 E[Ut′−1 Ut ] T
1 (1{|s − t | < τ })s,t ≤T Possible choices for A are A := T −τ which corresponds to a matrix with the Bartlett kernel function used for the Newey–West variance estimator. In the presence of one-way clustering, the robust variance matrix can be obtained
using the transformation A :=
{s ∈ c (t )}
1 1 nc ( t )
s,t ≤T
, a matrix
of indicator variables that are equal to one divided by the size of the cluster if the observations s and t belong to the same cluster ∑T ˜ (β) = ˜t (β) and Ω c (t ), and zero otherwise. For g˜ (β) = T1 t =1 g 1 T
∑T
t =1
g˜t (β)˜gt (β)′ the CUE is
˜ (β)−1 g˜ (β). βˆ = arg min g˜ (β)′ Ω The estimator proposed in this paper modifies the first-order condition of the CUE estimator in order to reduce its dispersion in finite samples. To describe this estimator let α and γ be positive constants satisfying properties to be specified below and let
ˆ + γ I, ¯ =Ω ˆ (β) Ω ˜ = ∂ g˜ (β)/∂β, ˆ G and T ∂ − ¯= 1 ˆ ′Ω ˆ ˆ ¯ −1 g˜t (β) G 1 − g˜ (β) g˜t (β). T t =1 ∂β
p
M
µ2T
is
bounded, and (ii) µT G Ω G → H, where the eigenvalues of H are bounded from below by λH > 0. −2 ′
−1
Note that it is straightforward to weaken this assumption by allowing the degree of identification for different components of the parameter to vary, but we will stick to this more restrictive assumption for expositional simplicity. The following assumption gives the conditions on J (β) = JT (β) in the definition of the RCUE in Eq. (1), where the subscript indicates that we allow the regularization term to be a function of the sample. p
Assumption 2. (i) JT (β) → J0 (β) uniformly in β , where J0 (β) is a K -dimensional function such that the smallest eigenvalue min eig(∇β J0 (β)) > C a.s. (ii) The expectations PT = E[JT (β0 ) JT (β0 )′ ] → P0 and QT = E[gt JT (β0 )′ ] → Q0 , where the norms of P0 and Q0 are bounded by a constant. For the validity of the asymptotic approximation, we will have to impose relatively standard regularity conditions on the moment functions and their derivatives, which will not be stated in full here. Instead, we refer the reader to NW.
Let J (β) be a vector-valued function of β of dimension K such that all eigenvalues of the Jacobian ∇β J (β) are bounded from below uniformly in β . A default choice for this function is the identity J (β) = β . Then the new estimator βˆ RCUE solves
¯ ′Ω ¯ −1 g˜ (βˆ RCUE ) + α J (βˆ RCUE ) = 0. G
Assumption 1. There is a sequence µT → ∞ such that (i)
(1)
The Monte Carlo experiments reported below strongly suggest that this estimator solves the moment problem of the CUE, though we do not prove this here. Also, for α = γ = 0, these first-order conditions are the same as those solved by the CUE. To motivate this modification, we will now give a version of the estimator in (1) with J (β) = β that has finite moments for
Assumption 3. Assumptions 2, 3 and 6–8 from NW hold. Assumption 2 in NW is a global identification condition, Assumption 3 imposes boundedness and equicontinuity on the second moments of gt (β), uniform consistency of estimators for the variances of gt (β0 ) and Gt (β0 ), and restricts how the Jacobians vary in the parameter around β0 . Furthermore, assumption 6 (not needed for consistency) imposes the restriction E[‖gt (β0 )‖4 ] + E[‖Gt (β0 )‖4 ] MT → 0 on the growth rate of the number of moments. 2 Proposition 2. Suppose µ− T T α → 0 and γ → 0. Then under Assumptions 1–3, the RCUE estimator is consistent, p βˆ RCUE − β0 → 0
48
J. Hausman et al. / Journal of Econometrics 165 (2011) 45–57
This can be shown using any general consistency result for Z -estimators (e.g. Theorem 5.9 in van der Vaart (1998)), noting that the CUE estimator is consistent by Theorem 2 in NW, and that the rank of the derivative of T times the first summand in the first-order condition (1) defining the RCUE grows at rate µT , whereas T times the second summand remains bounded. Uniform convergence of the first-order condition follows from an argument similar to that in the proof of Theorem 2 in NW. 1 Proposition 3. Suppose α T µ− → 0 and µT γ → 0. Then under T the assumptions of Proposition 2, RCUE and CUE are asymptotically equivalent, p µT (βˆ RCUE − βˆ CUE ) → 0.
From this and Theorem 1 in NW, it follows that RCUE is consistent and has the asymptotic distribution as CUE in their Theorem 3 under the many weak moment sequence. Corollary 1. Under the assumptions of Proposition 3, the RCUE estimator converges in distribution to d
µT (βˆ RCUE − β0 ) → N (0, H −1 ΛH −1 ). In order to account for the effect of the regularization in the knifeedge case T α → a > 0 and T γ → c > 0, we also give the asymptotic distribution of the RCUE as a function of the limits (a, c ) in the Appendices. Note that our asymptotic results were obtained assuming i.i.d. observations. However note that the proofs of Lemmas A.10 and A.11 in Newey and Windmeijer (2009a) continue to hold with slight modifications under clustering if the number of clusters increases with sample size, so that the previous results can be extended to that case. The derivation of the distribution of CUE becomes very complicated in the presence of general autocorrelation, and we will leave it for future research. We conjecture that under suitable stationarity assumptions (e.g. strong mixing), the result will essentially continue to hold, only that for inference the variances of the moments and the Jacobians would have to be replaced with HAC estimates. 5. Monte Carlo study We performed a simulation study to investigate the properties of the proposed estimators, ‘‘Regularized’’ CUE (RCUEGMM ) defined by the first-order conditions in Eq. (5), and RCUELS (specified in Eq. (6)), and compare their performance with GMM, LIML, and CUE. In particular, we find that RCUE fixes the no-moments problem present in CUE while significantly reducing the over-identification bias of GMM in both the linear and nonlinear cases. 5.1. Baseline design: linear model without autocorrelation For the simulation study we followed the heteroskedastic design of HNWCS (2007) in order to provide empirically relevant parameter choices for the severity of the endogeneity and heteroskedasticity. We then introduced autocorrelation into the model in the form of groupwise clustering and further generalized the design to allow for nonlinear models. The baseline design is as follows: yi = xi β + εi
(3)
xi = z1i π + νi
(4)
εi = ρνi +
1 − ρ 2 φθ1i +
1 − φ 2 θ2i
iid
iid
θ1i ∼ N (0, z1i2 ), and νi , θ2i ∼ N (0, 1) in order to allow for heteroskedasticity. The estimators we chose to compare for the linear model simulations are 2-step GMM, LIML, Fuller (c = 1), CUE, and the two RCUE estimators. For RCUE, we found c = 1 to perform well in the simulations, although the choice of the optimal constant will be a topic of future research. We also report the results for c = 4 to illustrative purposes. These values for c are the most widely used values for the Fuller estimator since respectively they lead to a mean unbiased or minimum MSE estimator, respectively. In the nonlinear model, we restricted our attention to GMM, CUE, and the two RCUE estimators with c = 1 and c = 4. In Table 1, we compare the median bias of the several estimators. All of the estimators have biases which increase with the degree of over-identification, but the bias of GMM is much more pronounced as theory predicts, even approaching that of the OLS bias (unreported) for small values of the concentration parameter, µ2 , (more weakly identified models) and higher degrees of over-identification, M − K . On the other hand, the bias of CUE and LIML (under homoskedasticity) is much smaller, though also increasing in M − K and decreasing in µ2 . The RCUE estimators perform similarly to CUE, although they do exhibit a slightly larger median bias. In Tables 2 and 3, we see how it is possible for the RCUE estimators to restore the moments of CUE and lead to lower dispersion of the estimation results. In Table 2, for each set of model parameters, the interquartile range for CUE is greater than for the RCUE estimators. In Table 3, the result is qualitatively similar, except it is much greater, particularly for lower µ2 and higher M − K . thus, these results favor the use of RCUE instead of CUE. In Table 4, we see that GMM’s mean bias is increasing in the degree of over-identification, M − K , and is in all cases greater than the bias for the RCUE estimators. In addition, the mean bias for the RCUE estimators is also well defined and stable for all simulation parameters while the bias of both CUE and LIML demonstrate the instability of their first moment, which arises from the ‘‘moment problem’’ of non-existence of moments for these estimators. The simulated variance and MSE of the RCUE estimators in Tables 5 and 6 further support the claim that RCUE adjusts CUE in such a way to restore its moments and decrease estimator dispersion. Further, the first two moments of the RCUE estimators are comparable to those of the HFUL estimator in HNWCS (2007) (unreported). Other than the apparent inconsistency of LIML as evidenced by its erratic performance, no other substantive differences in relative performance emerge when testing the estimators under heteroskedasticity. 5.2. Linear model with clustering The next set of simulations was generated by drawing G clusters based on the design in Eqs. (3) and (4) where the disturbances (εi , νi ) are generated as
iid νgi = γν vg + 1 − γν2 vgi vg , vgi ∼ N (0, 1) iid εgi = ρνgi + 1 − ρ 2 θgi θgi ∼ N (0, 1) for a cluster correlation coefficient γν = 0, 0.5, g = 1, . . . , G, and i = 1, . . . , ng , the number of observations in the gth cluster. We also allow for clustering in the instrumental variables, Zgi = γz ζg +
1 − γz2 ζgi
choosing parameter values γz = 0, 0.5. Once we allow for clustering in the first stage with G = 20, 50 clusters given the total sample size of T = 400 in Tables 7 and 10, we first note the large median and mean biases arising
J. Hausman et al. / Journal of Econometrics 165 (2011) 45–57
49
Table 1 Linear model under homoskedasticity and heteroskedasticity — median bias; ρ = 0.3, T = 400.
R22
CP
M
RCUE1
RCUE4
RCUE21
0.00 0.00 0.00 0.00
8 8 8 8
5 10 30 50
0.046 0.070 0.130 0.166
0.064 0.097 0.151 0.182
0.00 0.00 0.00 0.00
16 16 16 16
5 10 30 50
0.022 0.031 0.055 0.097
0.00 0.00 0.00 0.00
32 32 32 32
5 10 30 50
0.20 0.20 0.20 0.20
8 8 8 8
0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20
ε |z12
RCUE24
GMM
0.059 0.075 0.148 0.182
0.038 0.062 0.135 0.174
0.090 0.156 0.237 0.260
0.026 0.044 0.097 0.131
0.017 0.024 0.067 0.098
0.052 0.058 0.090 0.115
0.034 0.056 0.084 0.119
0.023 0.025 0.053 0.101
0.012 0.015 0.043 0.096
0.053 0.103 0.194 0.227
0.009 0.010 0.030 0.068
0.006 0.009 0.016 0.028
0.025 0.026 0.034 0.046
0.011 0.017 0.026 0.036
0.019 0.033 0.050 0.057
0.013 0.010 0.015 0.028
0.007 0.005 0.012 0.025
0.030 0.065 0.143 0.182
0.005 0.004 0.010 0.022
0.005 0.005 0.005 0.005
0.014 0.014 0.015 0.014
5 10 30 50
0.029 0.054 0.123 0.164
0.045 0.072 0.142 0.179
0.070 0.075 0.147 0.190
0.036 0.050 0.135 0.181
0.062 0.134 0.228 0.256
0.015 0.029 0.099 0.143
−0.222 −0.620 −1.806 −2.579
−0.122 −0.320 −0.837 −1.289
16 16 16 16
5 10 30 50
0.008 0.019 0.051 0.087
0.020 0.040 0.076 0.106
0.026 0.026 0.051 0.089
0.006 0.011 0.042 0.085
0.027 0.085 0.185 0.224
−0.001
−0.177 −0.426 −1.629 −2.524
−0.124 −0.320 −1.035 −1.602
32 32 32 32
5 10 30 50
0.004 0.006 0.013 0.027
0.010 0.022 0.039 0.053
0.013 0.007 0.006 0.021
0.002
0.013 0.048 0.135 0.180
−0.002 −0.005 −0.002 0.013
−0.093 −0.211 −1.009 −1.881
−0.077 −0.185 −0.814 −1.444
LIML
FULL1
−0.002 0.001 0.018
CUE
0.004 0.029 0.068
LIML
FULL1
Simulation results based on 6250 replications.
Table 2 Linear model under homoskedasticity and heteroskedasticity — interquartile range; ρ = 0.3, T = 400.
R22
CP
M
RCUE1
RCUE4
RCUE21
RCUE24
GMM
CUE
0.00 0.00 0.00 0.00
8 8 8 8
5 10 30 50
0.521 0.607 0.862 0.951
0.464 0.502 0.676 0.774
0.482 0.587 0.855 0.929
0.548 0.651 0.915 0.994
0.410 0.350 0.237 0.193
0.581 0.717 1.034 1.156
0.575 0.692 0.922 1.060
0.481 0.582 0.792 0.920
0.00 0.00 0.00 0.00
16 16 16 16
5 10 30 50
0.363 0.419 0.625 0.756
0.340 0.364 0.512 0.627
0.359 0.432 0.649 0.778
0.380 0.453 0.674 0.806
0.315 0.288 0.216 0.182
0.387 0.462 0.709 0.878
0.374 0.431 0.575 0.682
0.346 0.399 0.528 0.625
0.00 0.00 0.00 0.00
32 32 32 32
5 10 30 50
0.250 0.278 0.419 0.541
0.243 0.257 0.361 0.461
0.250 0.288 0.447 0.573
0.257 0.295 0.455 0.584
0.232 0.223 0.186 0.164
0.260 0.298 0.459 0.592
0.251 0.273 0.336 0.394
0.242 0.261 0.325 0.377
0.20 0.20 0.20 0.20
8 8 8 8
5 10 30 50
0.707 0.824 1.118 1.211
0.647 0.693 0.906 1.011
0.558 0.701 1.054 1.159
0.689 0.829 1.165 1.250
0.589 0.481 0.312 0.240
0.763 0.936 1.313 1.400
2.015 3.734 9.974 13.901
1.334 2.040 4.222 5.602
0.20 0.20 0.20 0.20
16 16 16 16
5 10 30 50
0.469 0.530 0.788 0.944
0.449 0.475 0.659 0.799
0.430 0.514 0.791 0.947
0.475 0.556 0.831 0.994
0.433 0.375 0.284 0.224
0.493 0.576 0.886 1.054
1.038 1.729 6.297 10.368
0.852 1.292 3.350 4.853
0.20 0.20 0.20 0.20
32 32 32 32
5 10 30 50
0.323 0.341 0.502 0.631
0.314 0.321 0.437 0.550
0.312 0.343 0.524 0.656
0.327 0.356 0.540 0.670
0.310 0.280 0.240 0.202
0.332 0.360 0.546 0.682
0.528 0.746 2.292 3.910
0.497 0.679 1.625 2.392
ε |z12
Simulation results based on 6250 replications.
from many moments persist in GMM, even after correcting for clustering. Also, LIML, which is typically suggested as a solution to the many moments problem has substantial bias under various conditions. On the other hand, in terms of median bias in Table 7, CUE performs quite well, with much less bias than GMM, although it also exhibits a bias which increases in the degree of overidentification. The RCUE estimators have larger bias, but it still remains much smaller than that of GMM. The same results arise in Tables 8 and 9 in terms of the quantiles. Quantiles of the
RCUE estimators are smaller than quantiles of CUE in all cases and provide preliminary evidence for the existence of the first two moments under autocorrelation in Tables 10–12. There we find the mean, variance, and MSE of the estimators existing for the RCUE estimators, but not for LIML or CUE, whose performance further suggests the moments problem. We also tested HLIM and HFUL from HNWCS (2007) as well as CUE with a heteroskedastic consistent covariance matrix, but all three estimators performed similar to LIML and FULL1 in demonstrating sizable bias.
50
J. Hausman et al. / Journal of Econometrics 165 (2011) 45–57
Table 3 Linear model under homoskedasticity and heteroskedasticity — nine decile range; ρ = 0.3, T = 400.
R22
CP
M
RCUE1
RCUE4
RCUE21
RCUE24
GMM
CUE
0.00 0.00 0.00 0.00
8 8 8 8
5 10 30 50
1.478 1.737 2.418 2.726
1.280 1.287 1.488 1.670
1.194 1.517 2.203 2.575
1.527 1.945 2.913 3.224
1.093 0.893 0.587 0.467
1.930 2.708 4.886 5.734
1.945 2.808 5.115 6.310
1.328 1.664 2.394 2.872
0.00 0.00 0.00 0.00
16 16 16 16
5 10 30 50
0.969 1.137 1.813 2.263
0.879 0.942 1.252 1.499
0.919 1.137 1.807 2.240
1.041 1.282 2.131 2.610
0.805 0.725 0.534 0.440
1.094 1.383 2.581 3.580
1.030 1.275 2.203 3.062
0.907 1.093 1.630 2.062
0.00 0.00 0.00 0.00
32 32 32 32
5 10 30 50
0.637 0.730 1.128 1.524
0.610 0.655 0.886 1.153
0.644 0.759 1.214 1.605
0.669 0.793 1.279 1.718
0.586 0.553 0.457 0.398
0.676 0.803 1.315 1.816
0.641 0.701 0.918 1.172
0.613 0.666 0.861 1.071
0.20 0.20 0.20 0.20
8 8 8 8
5 10 30 50
2.362 2.885 3.534 3.862
2.021 2.038 2.254 2.431
1.427 1.812 2.680 3.124
2.076 2.670 3.780 4.187
1.693 1.276 0.776 0.613
3.019 4.636 6.289 6.553
14.835 26.192 59.017 78.679
3.737 4.608 6.834 8.326
0.20 0.20 0.20 0.20
16 16 16 16
5 10 30 50
1.325 1.671 2.666 3.214
1.238 1.360 1.879 2.172
1.123 1.447 2.371 2.855
1.331 1.798 2.965 3.536
1.168 0.997 0.699 0.575
1.443 2.049 3.706 4.465
5.671 12.118 37.617 60.802
2.971 3.895 6.340 7.898
0.20 0.20 0.20 0.20
32 32 32 32
5 10 30 50
0.824 0.929 1.523 2.020
0.803 0.850 1.229 1.589
0.792 0.921 1.548 2.090
0.841 0.974 1.671 2.253
0.791 0.738 0.601 0.515
0.862 1.002 1.736 2.370
1.821 3.096 16.167 33.146
1.605 2.290 5.150 7.011
ε |z12
LIML
FULL1
Simulation results based on 6250 replications. Table 4 Linear model under homoskedasticity and heteroskedasticity — mean bias; ρ = 0.3, T = 400.
R22
CP
M
0.00 0.00 0.00 0.00
8 8 8 8
5 10 30 50
0.024 0.066 0.132 0.155
0.00 0.00 0.00 0.00
16 16 16 16
5 10 30 50
0.00 0.00 0.00 0.00
32 32 32 32
0.20 0.20 0.20 0.20
RCUE21
RCUE24
0.050 0.107 0.173 0.192
0.066 0.086 0.142 0.161
0.022 0.048 0.116 0.138
0.002 0.017 0.045 0.077
0.020 0.053 0.098 0.126
0.015 0.021 0.045 0.079
−0.010 −0.005
5 10 30 50
0.001 0.005 0.008 0.011
0.010 0.026 0.048 0.059
0.002
−0.001 −0.004
8 8 8 8
5 10 30 50
0.012 0.041 0.120 0.184
0.20 0.20 0.20 0.20
16 16 16 16
5 10 30 50
−0.010
0.20 0.20 0.20 0.20
32 32 32 32
5 10 30 50
−0.008 −0.007 −0.015
ε |z12
RCUE1
0.001 0.036 0.083
0.002
RCUE4
GMM
CUE
LIML
0.078 0.155 0.238 0.260
1.3E+011 −2.0E+011 −4.7E+011 1.1E+011
−0.008
7.2E+009 4.0E+010 −5.3E+010 −3.3E+011
−0.028
0.017 0.052
0.041 0.101 0.195 0.228 0.021 0.059 0.143 0.182
−0.010 −0.015 −1.3E+011 −9.1E+010
−0.008 −0.008
0.004
−0.007 −0.011 −0.018 −0.016
0.016 0.020
0.003 0.003 0.001 −0.003
0.033 0.082 0.160 0.204
0.094 0.099 0.145 0.188
0.039 0.050 0.113 0.176
0.047 0.133 0.231 0.258
−9.4E+010 −1.2E+011 7.5E+011 −3.0E+011
1.047 8.845 −1.698 −1.619
−0.060 −0.123 −0.189 −0.231
0.005 0.034 0.086 0.124
0.030 0.027 0.053 0.091
−0.006 −0.009
5.2E+011 7.1E+011 −1.4E+011 −3.0E+010
−0.534 −11.442
0.019 0.066
0.016 0.080 0.187 0.226
−0.132 −0.230 −0.470 −0.587
−0.001
0.005
0.012 0.029 0.047
−0.003 −0.014 −0.001
−0.010 −0.018 −0.035 −0.020
0.005 0.043 0.135 0.181
6.6E+010 4.4E+010 7.1E+009 1.0E+010
1.309 −0.522 0.142 0.097 −0.310 −0.183
2.021
−8.330 2.291
−1.058 −2.383 −7.190
FULL1 0.043 0.066 0.102 0.133 0.009 0.014 0.023 0.043
−0.106 −0.200 −0.603 −0.888
Simulation results based on 6250 replications.
5.3. Nonlinear specification For the Monte Carlo study in Tables 13–18, we changed the specification of the second stage of the baseline model to yi = exp {xi β} + εi xi = z1i π + νi which has a similar structure as the CCAPM specification in Stock and Wright (2000). The results resemble those in the linear case. In Tables 13 and 16, GMM shows substantial median and mean bias increasing
with the degree of over-identification while CUE and the RCUE estimators exhibit much less median bias in all cases. In addition, in Tables 16 and 17, the first two moments of the RCUE estimator exist under all cases, while they do not appear to exist for CUE.2 Thus, the nonlinear simulation results suggest the wide array of the problems to which the RCUE estimators
2 The RCUE2 estimator appears to behave contrary to what the theory would predict as evidenced by its erratic variances in Table 17. The choice of the optimal c will be further investigated to test this estimator.
J. Hausman et al. / Journal of Econometrics 165 (2011) 45–57
51
Table 5 Linear model under homoskedasticity and heteroskedasticity — variance of estimates; ρ = 0.3, T = 400.
R22
CP
M
RCUE1
RCUE4
RCUE21
RCUE24
GMM
CUE
LIML
FULL1
0.00 0.00 0.00 0.00
8 8 8 8
5 10 30 50
0.223 0.277 0.487 0.611
0.162 0.155 0.215 0.270
0.130 0.198 0.410 0.530
0.210 0.321 0.650 0.803
0.120 0.075 0.032 0.020
3.8E+026 3.3E+026 3.6E+026 4.7E+026
5.997 9.2E+003 2.2E+003 267.805
0.163 0.249 0.492 0.685
0.00 0.00 0.00 0.00
16 16 16 16
5 10 30 50
0.090 0.125 0.290 0.432
0.074 0.083 0.145 0.206
0.078 0.117 0.280 0.406
0.103 0.158 0.399 0.587
0.062 0.049 0.026 0.018
3.2E+023 1.5E+025 6.7E+026 1.8E+026
0.309 135.774 107.529 320.255
0.080 0.117 0.254 0.388
0.00 0.00 0.00 0.00
32 32 32 32
5 10 30 50
0.039 0.050 0.124 0.221
0.035 0.040 0.075 0.123
0.039 0.054 0.138 0.233
0.043 0.061 0.167 0.299
0.032 0.029 0.019 0.015
0.045 0.068 4.7E+025 3.5E+025
0.039 0.069 3.596 19.187
0.035 0.043 0.081 0.132
0.20 0.20 0.20 0.20
8 8 8 8
5 10 30 50
0.686 0.755 1.044 1.234
0.485 0.408 0.474 0.557
0.195 0.294 0.624 0.812
0.386 0.570 1.078 1.311
0.305 0.161 0.058 0.034
1.2E+027 1.5E+027 7.3E+026 7.4E+026
1.7E+003 7.5E+005 2.6E+005 2.9E+005
1.260 2.045 5.461 8.709
0.20 0.20 0.20 0.20
16 16 16 16
5 10 30 50
0.208 0.309 0.615 0.849
0.168 0.193 0.312 0.415
0.123 0.194 0.453 0.645
0.192 0.313 0.697 0.973
0.137 0.096 0.047 0.030
1.4E+027 4.5E+026 3.8E+026 2.9E+026
722.773 7.6E+005 1.3E+005 1.8E+005
0.764 1.325 4.214 7.226
0.20 0.20 0.20 0.20
32 32 32 32
5 10 30 50
0.070 0.092 0.248 0.394
0.065 0.072 0.148 0.225
0.061 0.083 0.230 0.364
0.074 0.103 0.300 0.484
0.060 0.052 0.034 0.024
2.7E+025 6.5E+024 1.3E+026 6.9E+025
7.0E+004 1.9E+003 7.5E+003 2.3E+005
0.309 0.541 2.238 4.594
ε |z12
Simulation results based on 6250 replications.
Table 6 Linear model under homoskedasticity and heteroskedasticity — mean square error; ρ = 0.3, T = 400.
R22
CP
M
RCUE1
RCUE4
RCUE21
RCUE24
GMM
CUE
LIML
FULL1
0.00 0.00 0.00 0.00
8 8 8 8
5 10 30 50
0.224 0.282 0.504 0.635
0.164 0.166 0.245 0.307
0.135 0.206 0.430 0.556
0.211 0.323 0.664 0.822
0.126 0.099 0.088 0.088
3.8E+026 3.3E+026 3.6E+026 4.7E+026
5.997 9.2E+003 2.2E+003 267.825
0.165 0.253 0.502 0.702
0.00 0.00 0.00 0.00
16 16 16 16
5 10 30 50
0.090 0.125 0.292 0.438
0.075 0.085 0.154 0.222
0.078 0.118 0.282 0.412
0.103 0.158 0.400 0.589
0.064 0.059 0.064 0.070
3.2E+023 1.5E+025 6.7E+026 1.8E+026
0.310 135.784 107.625 320.288
0.080 0.117 0.255 0.390
0.00 0.00 0.00 0.00
32 32 32 32
5 10 30 50
0.039 0.050 0.124 0.221
0.035 0.040 0.077 0.126
0.039 0.054 0.138 0.233
0.043 0.061 0.167 0.300
0.033 0.032 0.040 0.048
0.046 0.068 4.7E+025 3.5E+025
0.039 0.069 3.596 19.187
0.035 0.043 0.081 0.132
0.20 0.20 0.20 0.20
8 8 8 8
5 10 30 50
0.687 0.756 1.058 1.268
0.486 0.414 0.500 0.598
0.204 0.304 0.645 0.848
0.387 0.572 1.091 1.342
0.308 0.179 0.111 0.101
1.2E+027 1.5E+027 7.3E+026 7.4E+026
1.7E+003 7.5E+005 2.6E+005 2.9E+005
1.264 2.061 5.497 8.762
0.20 0.20 0.20 0.20
16 16 16 16
5 10 30 50
0.209 0.309 0.617 0.856
0.168 0.194 0.319 0.431
0.124 0.195 0.456 0.653
0.192 0.313 0.697 0.978
0.137 0.102 0.082 0.081
1.4E+027 4.5E+026 3.8E+026 2.9E+026
723.059 7.6E+005 1.3E+005 1.8E+005
0.782 1.377 4.435 7.571
0.20 0.20 0.20 0.20
32 32 32 32
5 10 30 50
0.071 0.092 0.248 0.394
0.065 0.072 0.148 0.228
0.061 0.083 0.230 0.364
0.074 0.103 0.301 0.484
0.060 0.054 0.052 0.057
2.7E+025 6.5E+024 1.3E+026 6.9E+025
7.0E+004 1.9E+003 7.5E+003 2.3E+005
0.321 0.581 2.601 5.383
ε |z12
Simulation results based on 6250 replications.
can be applied as they are capable of removing the bias present in GMM under quite general conditions while decreasing the dispersion of the CUE estimator to permit the existence of moments. 6. Extensions
estimators. As shown by Newey and Smith (2004), the CUE is also the solution to the dual problem max π ,β
T −
ϱ(πt ) s.t.
t =1
t =1
where In this section, we will propose a way of generalizing our approach to the class of Generalized Empirical Likelihood (GEL)
ϱ(v) = −
T −
(v − 1)2 2
.
πt g˜t (β) = 0, and
T − t =1
πt = 1
(5)
52
J. Hausman et al. / Journal of Econometrics 165 (2011) 45–57
Table 7 Linear model under groupwise clustering — median bias; ρ = 0.3, T = 400, γν = 0.5. CP
M
Z clustering
Clusters
RCUE1
RCUE4
RCUE21
RCUE24
GMM
CUE
8 8 8
5 10 15
No No No
20 20 20
0.036 0.040 0.047
0.053 0.064 0.063
0.019 0.020 0.023
0.018 0.018 0.022
0.087 0.151 0.191
0.004 −0.002 0.014
LIML 0.011 0.017 0.027
FULL1 0.045 0.050 0.056
32 32 32
5 10 15
No No No
20 20 20
0.003 0.008 0.016
0.010 0.018 0.023
−0.005 −0.003
−0.005 −0.003
−0.005 −0.003
−0.001 −0.003
0.008
0.008
0.024 0.059 0.091
0.006
0.004
0.009 0.006 0.014
8 8 8
5 10 15
Yes Yes Yes
20 20 20
0.102 0.087 0.057
0.131 0.122 0.080
0.086 0.063 0.026
0.081 0.058 0.023
0.179 0.230 0.249
0.007 0.012 0.009
0.151 0.206 0.228
0.162 0.210 0.230
32 32 32
5 10 15
Yes Yes Yes
20 20 20
0.025 0.032 0.030
0.043 0.053 0.046
0.008 0.015 0.013
0.007 0.013 0.012
0.075 0.132 0.166
0.000 0.002 0.007
0.054 0.102 0.132
0.061 0.106 0.134
8 8 8
5 10 15
No No No
50 50 50
0.034 0.053 0.050
0.057 0.086 0.085
0.024 0.037 0.034
0.020 0.032 0.030
0.089 0.156 0.187
0.005 0.002 0.001
0.014 0.021 0.021
0.051 0.053 0.050
32 32 32
5 10 15
No No No
50 50 50
0.010 0.016 0.014
0.017 0.029 0.033
0.003 0.004 −0.001
0.002 0.003 −0.002
0.028 0.064 0.089
0.002 0.003 −0.002
0.003 0.002 −0.000
0.012 0.011 0.009
8 8 8
5 10 15
Yes Yes Yes
50 50 50
0.071 0.081 0.084
0.100 0.120 0.122
0.062 0.065 0.074
0.056 0.059 0.065
0.137 0.197 0.228
0.008 0.008 0.016
0.088 0.139 0.178
0.108 0.150 0.185
32 32 32
5 10 15
Yes Yes Yes
50 50 50
0.017 0.022 0.025
0.030 0.044 0.052
0.005 0.006 0.007
0.004 0.005 0.006
0.050 0.099 0.130
0.003 0.002 0.001
0.021 0.054 0.072
0.030 0.061 0.078
Simulation results based on 12,000 replications. Table 8 Linear model under groupwise clustering — interquartile range; ρ = 0.3, T = 400, γν = 0.5. CP
M
Z clustering
Clusters
RCUE1
RCUE4
RCUE21
RCUE24
GMM
CUE
LIML
FULL1
8 8 8
5 10 15
No No No
20 20 20
0.531 0.590 0.553
0.485 0.525 0.506
0.584 0.665 0.616
0.588 0.672 0.621
0.421 0.366 0.310
0.606 0.708 0.632
0.569 0.676 0.760
0.478 0.572 0.636
32 32 32
5 10 15
No No No
20 20 20
0.271 0.340 0.390
0.262 0.318 0.369
0.281 0.367 0.427
0.281 0.367 0.428
0.242 0.233 0.218
0.281 0.368 0.429
0.254 0.271 0.288
0.243 0.260 0.276
8 8 8
5 10 15
Yes Yes Yes
20 20 20
0.477 0.454 0.403
0.416 0.378 0.353
0.545 0.545 0.469
0.562 0.559 0.473
0.349 0.256 0.214
0.698 0.637 0.487
0.417 0.324 0.283
0.379 0.309 0.273
32 32 32
5 10 15
Yes Yes Yes
20 20 20
0.297 0.344 0.342
0.271 0.302 0.310
0.335 0.403 0.389
0.338 0.406 0.391
0.236 0.201 0.180
0.344 0.415 0.397
0.251 0.221 0.213
0.240 0.215 0.209
8 8 8
5 10 15
No No No
50 50 50
0.527 0.595 0.640
0.471 0.491 0.518
0.563 0.648 0.704
0.575 0.659 0.717
0.418 0.350 0.309
0.598 0.709 0.773
0.578 0.683 0.758
0.486 0.578 0.641
32 32 32
5 10 15
No No No
50 50 50
0.250 0.294 0.329
0.243 0.271 0.295
0.257 0.312 0.355
0.258 0.313 0.355
0.233 0.227 0.215
0.259 0.314 0.356
0.249 0.267 0.286
0.239 0.256 0.274
8 8 8
5 10 15
Yes Yes Yes
50 50 50
0.503 0.544 0.564
0.435 0.426 0.441
0.548 0.607 0.620
0.566 0.628 0.636
0.371 0.297 0.254
0.649 0.734 0.730
0.496 0.454 0.415
0.437 0.416 0.389
32 32 32
5 10 15
Yes Yes Yes
50 50 50
0.269 0.313 0.356
0.251 0.274 0.303
0.288 0.349 0.400
0.290 0.351 0.403
0.232 0.213 0.197
0.291 0.352 0.408
0.251 0.251 0.249
0.242 0.243 0.242
Simulation results based on 12,000 replications.
Taking derivatives, we obtain the first-order condition
T − t =1
πˆ t (β)
∂ ˜ (β)− g˜ (β) = 0, g˜t (β) Ω ∂β
where A− denotes the generalized (Moore–Penrose Inverse) of a square matrix A, and
˜ (β)− g˜t (β) 1 − g˜ (β)′ Ω πˆ t (β) := T . ∑ − ˜ 1 − g˜ (β)Ω (β) g˜s (β) s=1
The πˆ t have an interpretation as ‘‘empirical probabilities’’ in the Generalized Empirical Likelihood framework. Now consider a modification of the first-order conditions in ˆ with which we replace πˆ t (β)
π˜ tα :=
ˆ ′Ω ˆ − g˜t (β) ˆ + αT ˜ (β) 1 − g˜ (β) T
∑
ˆ Ω ˆ − g˜s (β)] ˆ ˜ (β) [1 + αT − g˜ (β)
s=1
for some sequence αT > 0 with the limiting properties specified in Proposition 3. We hold all terms except g˜ (β) fixed at βˆ , and solve
J. Hausman et al. / Journal of Econometrics 165 (2011) 45–57
53
Table 9 Linear model under groupwise clustering — nine decile range; ρ = 0.3, T = 400, γν = 0.5. CP
M
Z clustering
Clusters
RCUE1
RCUE4
RCUE21
RCUE24
GMM
CUE
LIML
FULL1
8 8 8
5 10 15
No No No
20 20 20
1.484 1.684 1.616
1.309 1.421 1.420
1.736 2.061 1.959
1.837 2.198 2.034
1.112 0.924 0.792
2.129 2.663 2.198
1.961 2.728 3.689
1.299 1.648 1.912
32 32 32
5 10 15
No No No
20 20 20
0.703 0.893 1.041
0.672 0.819 0.966
0.745 0.997 1.171
0.746 0.999 1.176
0.615 0.585 0.539
0.747 1.003 1.191
0.648 0.706 0.755
0.618 0.672 0.718
8 8 8
5 10 15
Yes Yes Yes
20 20 20
1.405 1.324 1.260
1.192 1.083 1.052
1.664 1.773 1.572
1.929 1.970 1.673
1.005 0.701 0.549
1.7E+013 6.795 2.176
1.587 1.032 0.817
1.166 0.926 0.767
32 32 32
5 10 15
Yes Yes Yes
20 20 20
0.818 0.959 1.016
0.730 0.802 0.890
1.022 1.247 1.229
1.068 1.298 1.267
0.634 0.534 0.454
1.189 1.528 1.340
0.706 0.633 0.564
0.665 0.607 0.548
8 8 8
5 10 15
No No No
50 50 50
1.468 1.624 1.753
1.273 1.266 1.313
1.597 1.949 2.157
1.763 2.159 2.365
1.122 0.901 0.801
2.098 2.863 3.242
1.992 2.723 3.418
1.319 1.665 1.930
32 32 32
5 10 15
No No No
50 50 50
0.650 0.742 0.866
0.625 0.668 0.749
0.681 0.809 0.968
0.685 0.813 0.976
0.593 0.559 0.534
0.686 0.816 0.982
0.649 0.697 0.745
0.619 0.664 0.707
8 8 8
5 10 15
Yes Yes Yes
50 50 50
1.488 1.473 1.555
1.239 1.100 1.126
1.605 1.835 1.989
1.889 2.146 2.285
1.060 0.772 0.654
5.156 7.541 5.678
1.893 1.632 1.435
1.275 1.246 1.189
32 32 32
5 10 15
Yes Yes Yes
50 50 50
0.728 0.855 0.973
0.664 0.710 0.779
0.804 1.042 1.204
0.819 1.066 1.233
0.607 0.539 0.497
0.831 1.107 1.304
0.690 0.671 0.662
0.651 0.639 0.635
RCUE4
RCUE21
RCUE24
GMM
CUE
LIML
Simulation results based on 12,000 replications. Table 10 Linear model under groupwise clustering — mean bias; ρ = 0.3, T = 400, γν = 0.5. CP
M
Z clustering
Clusters
RCUE1
8 8 8
5 10 15
No No No
20 20 20
0.008 0.014 0.014
0.037 0.055 0.034
−0.027 −0.036 −0.366
−0.046 −0.058 −0.073
0.074 0.147 0.188
−2.2E+012 −2.6E+012 −9.8E+011
−0.028
32 32 32
5 10 15
No No No
20 20 20
−0.008 −0.007 −0.006
0.001 0.008 0.007
−0.019 −0.026 −0.028
−0.020 −0.032 −0.031
0.016 0.054 0.085
−0.021 −2.0E+010 −6.5E+010
−0.016 −0.015 0.052
−0.003 −0.002 −0.002
8 8 8
5 10 15
Yes Yes Yes
20 20 20
0.081 0.069 0.018
0.120 0.115 0.057
0.021 −0.017 −0.046
−0.010 −0.048 −0.081
0.170 0.229 0.249
−1.3E+013 −6.2E+012 −1.7E+012
0.215 0.227 0.141
0.152 0.206 0.230
32 32 32
5 10 15
Yes Yes Yes
20 20 20
−0.003 0.004 −0.001
0.021 0.038 0.024
−0.043 −0.048 −0.047
−0.054 −0.064 −0.058
0.060 0.128 0.163
−1.2E+012 −8.7E+011 −3.8E+011
0.012 0.082 0.123
0.039 0.095 0.128
8 8 8
5 10 15
No No No
50 50 50
0.008 0.032 0.028
0.041 0.083 0.089
−0.011 −0.005 −0.019
−0.034 −0.030 −0.047
0.071 0.149 0.184
−2.3E+012 −3.4E+012 −3.6E+012
−0.244 −0.059 −0.103
0.034 0.049 0.058
32 32 32
5 10 15
No No No
50 50 50
−0.001 0.006 −0.000
0.008 0.025 0.026
−0.011 −0.012 −0.024
−0.013 −0.014 −0.027
0.019 0.059 0.084
1.6E+010 1.8E+010 −9.1E+010
−0.011 −0.012 −0.015
0.001 0.002 −0.004
8 8 8
5 10 15
Yes Yes Yes
50 50 50
0.056 0.066 0.066
0.094 0.122 0.128
0.031 0.014 0.011
−0.001 −0.021 −0.024
0.128 0.194 0.226
−1.0E+013 −7.3E+012 −6.0E+012
−0.292
0.102 0.142 0.174
32 32 32
5 10 15
Yes Yes Yes
50 50 50
−0.002
0.016 0.038 0.046
−0.022 −0.031 −0.032
−0.026 −0.040 −0.041
0.036 0.092 0.126
−1.0E+011 −2.4E+011 −5.6E+011
−0.003
0.004 0.006
0.136 −0.036
0.139 0.085 0.037 0.049
FULL1 0.035 0.047 0.061
0.014 0.046 0.069
Simulation results based on 12,000 replications.
for β : 0 =
T −
π˜ tα
t =1
∝
T − t =1
∂ ˆ Ω ¯ −1 g˜ (β) g˜t (β) ∂β
∂ ˆ Ω ˆ g˜t (β) ˆ + αT ˆ Ω ˆ (β) ¯ −1 g˜ (β) (6) 1 − g˜ (β) g˜t (β) ∂β ′
−
′
¯ := where, as in (1) we replace the variance estimate with Ω ˆ ˆ Ω (β) + γ I.
From this expression, it can be seen that this modification of the first-order condition for the CUE corresponds to the definition ˜ (β) ˆ ′Ω ¯ −1 g˜ (β), which is – up to the of RCUE in (1) with J (β) := G ridge adjustment to the inverse variance matrix – the first-order condition of two-step (efficient) GMM with the variance matrix and the Jacobian evaluated at the CUE βˆ . An interpretation of this modification is that it adjusts the estimates of the ‘‘empirical probabilities’’ toward T1 and skews the first-order conditions toward those for the two-step (efficient) GMM estimator which has moments up to the degree of
54
J. Hausman et al. / Journal of Econometrics 165 (2011) 45–57
Table 11 Linear model under groupwise clustering — variance of estimates; ρ = 0.3, T = 400, γν = 0.5. CP
M
Z clustering
Clusters
RCUE1
RCUE4
RCUE21
RCUE24
GMM
CUE
LIML
FULL1
8 8 8
5 10 15
No No No
20 20 20
0.232 0.282 2.152
0.175 0.192 0.195
0.288 0.408 1.2E+003
0.367 0.527 0.493
0.132 0.084 0.060
2.6E+027 3.2E+027 6.1E+026
182.247 1.8E+003 66.550
0.162 0.248 0.317
32 32 32
5 10 15
No No No
20 20 20
0.047 0.080 0.113
0.043 0.066 0.093
0.056 0.107 0.154
0.057 0.191 0.164
0.036 0.032 0.027
0.064 1.2E+025 1.7E+025
0.065 0.052 61.775
0.036 0.044 0.051
8 8 8
5 10 15
Yes Yes Yes
20 20 20
0.216 0.339 0.452
0.156 0.113 0.113
0.257 0.297 2.105
0.365 0.411 0.363
0.116 0.046 0.028
1.0E+028 4.3E+027 6.6E+026
36.471 7.891 39.801
0.130 0.094 0.064
32 32 32
5 10 15
Yes Yes Yes
20 20 20
0.072 0.093 0.123
0.055 0.064 0.079
0.112 0.162 0.169
0.140 0.212 0.209
0.041 0.026 0.020
8.6E+026 2.8E+026 1.2E+026
2.451 0.382 0.038
0.048 0.038 0.031
8 8 8
5 10 15
No No No
50 50 50
0.224 0.252 0.288
0.161 0.149 0.160
0.235 0.342 0.411
0.313 0.461 0.551
0.125 0.078 0.060
5.4E+027 7.2E+027 3.7E+027
251.221 64.200 311.093
0.164 0.248 0.324
32 32 32
5 10 15
No No No
50 50 50
0.040 0.054 0.074
0.037 0.043 0.054
0.045 0.067 0.100
0.047 0.071 0.111
0.033 0.029 0.026
3.2E+024 1.7E+024 2.9E+025
0.041 0.071 0.153
0.036 0.043 0.051
8 8 8
5 10 15
Yes Yes Yes
50 50 50
0.222 0.201 0.222
0.155 0.113 0.117
0.227 0.299 0.345
0.325 0.423 0.485
0.117 0.057 0.040
1.3E+028 5.3E+027 3.3E+027
834.605 54.942 38.744
0.152 0.152 0.143
32 32 32
5 10 15
Yes Yes Yes
50 50 50
0.052 0.072 0.092
0.043 0.048 0.057
0.065 0.111 0.147
0.074 0.135 0.179
0.036 0.027 0.023
1.9E+025 9.0E+025 2.1E+026
0.139 0.065 2.581
0.041 0.042 0.044
Simulation results based on 12,000 replications.
Table 12 Linear model under groupwise clustering — mean square error; ρ = 0.3, T = 400, γν = 0.5. CP
M
Z clustering
Clusters
RCUE1
RCUE4
RCUE21
RCUE24
GMM
CUE
LIML
FULL1
8 8 8
5 10 15
No No No
20 20 20
0.233 0.282 2.152
0.177 0.195 0.196
0.289 0.409 1.2E+003
0.369 0.530 0.498
0.137 0.105 0.095
2.6E+027 3.2E+027 6.1E+026
182.248 1.8E+003 66.551
0.163 0.250 0.321
32 32 32
5 10 15
No No No
20 20 20
0.047 0.080 0.113
0.043 0.066 0.093
0.056 0.108 0.155
0.058 0.192 0.165
0.036 0.035 0.035
0.064 1.2E+025 1.7E+025
0.066 0.052 61.778
0.036 0.044 0.051
8 8 8
5 10 15
Yes Yes Yes
20 20 20
0.223 0.343 0.453
0.171 0.126 0.116
0.258 0.298 2.107
0.365 0.413 0.369
0.145 0.098 0.090
1.0E+028 4.3E+027 6.6E+026
36.517 7.943 39.821
0.153 0.136 0.117
32 32 32
5 10 15
Yes Yes Yes
20 20 20
0.072 0.093 0.123
0.056 0.065 0.079
0.113 0.165 0.171
0.143 0.216 0.213
0.045 0.043 0.046
8.6E+026 2.8E+026 1.2E+026
2.452 0.389 0.053
0.050 0.047 0.048
8 8 8
5 10 15
No No No
50 50 50
0.224 0.253 0.289
0.162 0.156 0.168
0.235 0.342 0.411
0.314 0.462 0.553
0.130 0.100 0.094
5.4E+027 7.2E+027 3.7E+027
251.280 64.203 311.103
0.165 0.250 0.327
32 32 32
5 10 15
No No No
50 50 50
0.040 0.054 0.074
0.037 0.044 0.054
0.046 0.068 0.101
0.047 0.071 0.112
0.034 0.033 0.033
3.2E+024 1.7E+024 2.9E+025
0.041 0.071 0.153
0.036 0.043 0.051
8 8 8
5 10 15
Yes Yes Yes
50 50 50
0.225 0.205 0.226
0.164 0.128 0.133
0.228 0.299 0.345
0.325 0.423 0.486
0.134 0.095 0.091
1.3E+028 5.3E+027 3.3E+027
834.690 54.961 38.752
0.162 0.173 0.173
32 32 32
5 10 15
Yes Yes Yes
50 50 50
0.052 0.073 0.092
0.043 0.050 0.059
0.066 0.112 0.148
0.075 0.137 0.181
0.037 0.036 0.039
1.9E+025 9.0E+025 2.2E+026
0.139 0.066 2.583
0.042 0.044 0.048
Simulation results based on 12,000 replications.
over-identification. This modification could be interpreted as a regularization in that it penalizes variation in the ‘‘empirical distribution’’ as defined by the estimated parameters πˆ t . This idea of regularizing the ‘‘empirical probabilities’’ in the definition of CUE through the dual problem can be extended to the entire family of Generalized Empirical Likelihood (GEL) Estimators,
which solve the problem (5) for a general concave function ϱ(v). As shown in Newey and Smith (2004), the GEL estimator solves
T − t =1
′ πˆ t G˜ t (β)
T − t =1
−1 kt g˜t (β)˜gt (β)
′
g˜ (β) = 0
J. Hausman et al. / Journal of Econometrics 165 (2011) 45–57
55
Table 13 Nonlinear model under homoskedasticity — median bias; T = 400.
ρˆ
CP
M
GMM
CUE
RCUE1
RCUE4
RCUE21
RCUE24
0.4 0.4 0.4 0.4
12.50 12.50 12.50 12.50
5 10 30 50
0.050 0.083 0.115 0.124
0.006 0.008 0.015 0.019
0.023 0.030 0.047 0.045
0.036 0.053 0.061 0.058
0.023 0.021 0.035 0.038
0.012 0.017 0.040 0.041
0.4 0.4 0.4 0.4
25.00 25.00 25.00 25.00
5 10 30 50
0.029 0.056 0.096 0.108
0.005 0.005 0.011 0.017
0.014 0.017 0.026 0.034
0.021 0.032 0.043 0.046
0.013 0.012 0.017 0.027
0.007 0.008 0.017 0.027
Simulation results based on 7000 replications.
Table 14 Nonlinear model under homoskedasticity — interquartile range; T = 400.
ρˆ
CP
M
GMM
CUE
RCUE1
RCUE4
RCUE21
RCUE24
0.4 0.4 0.4 0.4
12.50 12.50 12.50 12.50
5 10 30 50
0.165 0.127 0.080 0.067
0.239 0.272 0.304 0.298
0.205 0.216 0.236 0.228
0.184 0.170 0.169 0.165
0.188 0.223 0.259 0.255
0.222 0.252 0.272 0.262
0.4 0.4 0.4 0.4
25.00 25.00 25.00 25.00
5 10 30 50
0.126 0.105 0.073 0.062
0.161 0.183 0.226 0.237
0.149 0.157 0.186 0.195
0.139 0.134 0.141 0.147
0.145 0.166 0.208 0.216
0.157 0.178 0.217 0.221
Simulation results based on 7000 replications.
Table 15 Nonlinear model under homoskedasticity — nine decile range; T = 400.
ρˆ
CP
M
GMM
CUE
RCUE1
RCUE4
RCUE21
RCUE24
0.4 0.4 0.4 0.4
12.50 12.50 12.50 12.50
5 10 30 50
0.467 0.341 0.205 0.167
0.800 0.954 1.369 1.388
0.638 0.695 0.860 0.883
0.572 0.552 0.572 0.606
0.483 0.583 0.806 0.915
0.656 0.774 0.991 1.040
0.4 0.4 0.4 0.4
25.00 25.00 25.00 25.00
5 10 30 50
0.342 0.280 0.186 0.154
0.467 0.543 0.786 0.941
0.421 0.456 0.611 0.730
0.388 0.385 0.443 0.513
0.385 0.454 0.649 0.766
0.440 0.518 0.731 0.837
RCUE24
Simulation results based on 7000 replications.
Table 16 Nonlinear model under homoskedasticity — mean bias; T= 400.
ρˆ
CP
M
GMM
CUE
RCUE1
RCUE4
RCUE21
0.4 0.4 0.4 0.4
12.50 12.50 12.50 12.50
5 10 30 50
0.043 0.083 0.118 0.127
0.015 0.034 0.054 0.048
0.029 0.057 0.073 0.066
0.023 0.044 0.015 0.038
0.001 0.034 0.025 0.042
0.4 0.4 0.4 0.4
25.00 25.00 25.00 25.00
5 10 30 50
0.022 0.055 0.098 0.111
−0.024 −0.028 −0.074 −0.094 −0.007 −0.009 −0.042 −0.057
0.003 0.011 0.017 0.028
0.012 0.029 0.043 0.047
0.006 0.005 0.000 0.015
−0.006 −0.004 −0.003 0.013
Simulation results based on 7000 replications.
Table 17 Nonlinear model under homoskedasticity — variance of estimates; T = 400.
ρˆ
CP
M
GMM
CUE
RCUE1
RCUE4
RCUE21
RCUE24
0.4 0.4 0.4 0.4
12.50 12.50 12.50 12.50
5 10 30 50
0.023 0.011 0.004 0.003
0.245 0.328 0.271 0.675
0.048 0.051 0.068 0.070
0.042 0.036 0.036 0.035
0.022 2.245 1.013 0.408
0.040 2.901 0.986 0.229
0.4 0.4 0.4 0.4
25.00 25.00 25.00 25.00
5 10 30 50
0.012 0.008 0.003 0.002
0.144 0.167 0.107 0.495
0.020 0.024 0.041 0.055
0.017 0.017 0.021 0.026
0.015 0.020 0.042 0.364
0.020 0.028 0.052 0.154
Simulation results based on 7000 replications.
56
J. Hausman et al. / Journal of Econometrics 165 (2011) 45–57
Table 18 Nonlinear model under homoskedasticity — mean square error; T = 400.
ρˆ
CP
M
GMM
CUE
RCUE1
RCUE4
RCUE21
RCUE24
0.4 0.4 0.4 0.4
12.50 12.50 12.50 12.50
5 10 30 50
0.025 0.018 0.018 0.019
0.245 0.329 0.276 0.684
0.048 0.052 0.071 0.073
0.043 0.039 0.041 0.039
0.023 2.246 1.013 0.410
0.040 2.903 0.986 0.231
0.4 0.4 0.4 0.4
25.00 25.00 25.00 25.00
5 10 30 50
0.012 0.011 0.013 0.015
0.144 0.167 0.109 0.498
0.020 0.025 0.041 0.056
0.017 0.018 0.023 0.028
0.015 0.020 0.042 0.364
0.020 0.028 0.052 0.154
Simulation results based on 7000 replications.
where
ϱ (λ gt (β)) , T ∑ ϱ′ (λˆ ′ g˜s (β)) ˆ′ ˜
′
πˆ t =
s=1
λˆ = −
∑ T 1 T
t =1
kt g˜t (β)˜gt (β)′
−1
ˆ , and kt = g˜ (β)
ˆ = Op Under regularity conditions, λ
√1 T
ϱ′ (λ′ g˜t )+1 . λ′ g˜t ′ ˆ ′ [ϱ (λ g˜s )+1] λˆ ′ g˜s
, so that by the mean
reduce the dispersion of the CUE estimator. The Monte Carlo results accord well with the theory and demonstrate that the new RCUE estimator performs better than existing estimators. As topics for future research, we recommend investigation of the optimal value of the parameter in the RCUE estimators and application of the estimators to the class of Maximum Empirical Likelihood estimators, which is likely to decrease estimator bias even more. Acknowledgment
value theorem 1
πˆ t ∝ ϱ′ (0) + ϱ′′ (0)λˆ ′ gt (β) + ϱ′′′ (λ¯ ′ gt (β))(λˆ ′ gt (β))2 2 −1 T − 1 1 = − 1 − g˜ (β)′ kt g˜t (β)˜gt (β)′ g˜t (β) + Op T t =1
L’Hospital’s rule 1 T
+
ϱ′′′ (0) 2T
λˆ ′ (˜g (β) − g˜t (β)) + Op
Appendix A. Proofs
T
¯ lies between zero and λˆ , and without loss of generality where λ we impose the normalizations ϱ′ (0) = ϱ′′ (0) = −1. Also, by kt =
Newey thanks the NSF for research support.
1 T2
.
Plugging this expression back into the expansion for πˆ t , it can be seen that πˆ tGEL = πˆ tCUE + Op T1 , so that in large samples the projection argument made for CUE will be valid for all GEL estimators. It should also be noted that if the third moments of gt (β0 ) are zero and the fourth moments are bounded, all GEL estimators are the same as CUE up to second order. The Monte Carlo results in the previous section indicated that CUE still has some bias for a large degree of over-identification most of which arises from the estimation of the optimal weighting ˆ . Newey and Smith (2004) showed that in the class of matrix Ω GEL estimators, for Empirical Likelihood (EL) estimation of the weighting matrix Ω (β) does not contribute to the higher-order bias, so that applying the proposed modification to the EL estimator should be expected to have even better finite sample properties than the modified CUE. We leave this topic for future research. 7. Conclusion In this paper we attempt to solve two problems: the bias in GMM which increases with the number of instruments (moment condition) and the ‘‘no moment’’ problem of CUE that leads to wide dispersion of the estimates. To date, much of the discussion of these problems has been under the assumption of homoscedasticity and absence of serial correlation, which limits their use in empirical research. HNWCS (2007) propose a solution to the homoscedasticity problem for the linear model. In this paper we propose a solution to the more general problem for linear and nonlinear specifications when both heteroscedasticity and serial correlation may be present. We propose a class of estimators ‘‘regularized CUE’’ or RCUE that reduce the moment problem and
ˆ is positive semidefˆ (β) Proof of Proposition 2. Since the matrix Ω −1 ˆ ˆ inite, the inverse (Ω (β) + γ I ) is positive definite by standard ¯Ω ¯ −1 G¯ is positive semidefarguments. Hence, the matrix product G ˆ = inite with probability 1, and we can bound the eigenvalue of M ¯Ω ¯ −1 G¯ + α I from below by α , so that by the Cauchy–Schwarz inG equality ¯ ≤ (αγ )−1 ‖˜g (0)‖‖G˜ ‖ ‖β‖ with probability one, where we denote the Euclidean norm of a matrix A by ‖A‖ = tr(A′ A)1/2 . Therefore, the estimator has moments at least up to the number of moments of the data divided by two for any choice of α > 0 and γ > 0. Proof of Proposition 3. This proof follows closely the proof of Theorem 3 of Newey and Windmeijer (2008) (in the following quoted as NW): The first-order condition for RCUE given in Eq. (1) is
ˆ (β) := G˜ (β) ˆ ′Ω ¯ (γ )−1 gˆ (β) + αT J (β). 0=D By the mean value theorem,
ˆ (βˆ R ) = Dˆ (β0 ) + ∇β Dˆ (β)( ¯ βˆ R − β0 ). 0=D p
Since, by Proposition 2, RCUE is consistent, β¯ → β0 . Also, as shown by Newey and Windmeijer (2008), CUE is consistent, so that β¯ − p ¯ (γ ) = Ω ˆ (βˆ CUE ) + γ IK we have µT ‖Ω ¯ (γ )−1 − β0 → 0, so that for Ω p Ω −1 ‖ → 0 as µT γ → 0. 2 Also, by continuity of ∇ββ ′ Q (·) and Lemma A.13 in NW
T
µ
2 T
p
2 ˆ ¯ ∇ββ Q (β) → H +
T αT
µ2T
∇β J0 (β0 ).
In the proof of lemma A.12 in NW, replacing Xt =
1
µT
ˆ −1 gt + λ′ G¯ ′ Ω
αT ′ λ J (β0 ) µT
J. Hausman et al. / Journal of Econometrics 165 (2011) 45–57
where λ is a k-dimensional vector with ‖λ‖ = 1, we obtain T E[Xt2 ] =
T
ˆ −1 G¯ λ + λ′ G¯ ′ Ω
µ2T +
Tα
µ
2 T 2 T
T α T ′ ′ −1 ˆ QT λ λGΩ 2
µT
′
where PT = E[JT (β0 )JT (β0 )′ ] and QT = E[gt JT (β0 )′ ], so that by their argument, T
µT
d ˆ (β0 ) → D N
Tα
µT
2 ¯ −2 Also we have that µ− T T M → H + lim T αµT Jβ . Defining Σ := −2 ′ −3/2 −3/2 plimµT TG Ω (I − PΩ 1/2 G )Ω G, by a Central Limit Theorem and Slutsky’s Theorem we have
λ PT λ → λ H λ ′
µ ¯
¯γM ¯ −1 h¯ − Bγ ,T ) (¯ − M − 1 ¯ ( ¯ − Jβ (β) ¯ h¯ − Bα,T ) ¯ M µ ] [ −1 −1 H ΣH H −1 Σ H −1 J β H −1 d . → N 0, −1 ′−1 −1 −1 −1 −1 H Jβ Σ H H Jβ H Jβ H −1 hγ TM 3 −1 −1 T M hα T
2 Denoting a = lim µ− T T α and c = lim γ we obtain from Slutsky’s Theorem that for the expansion in Eq. (7)
J (β0 ), H + Λ .
Therefore, following the proof of Theorem 3 in NW
d µT (β¯ − βˆ CUE − BT ) → N 0, H −1 V (a, c )H −1
p ˆ → µT (βˆ RCUE − β) 0
¯ −1 (Bα,T α + Bγ ,T γ ), so that where BT = M
as µT → ∞, so that the asymptotic distribution for RCUE is the same as that for CUE given in Theorem 3 in NW.
BT → B(a, c ) = aH −1 (Jβ + cHγ )H −1 Jβ β0 .
Appendix B. Expansion of RCUE in α and γ :
V (a, c ) = a2 Jβ H −1 Jβ + ac (Σ H −1 Jβ + Jβ′−1 Σ ) + c 2 Σ .
Recall that in the linear case, from Eq. (1), the regularized estimator can be expressed as
References
¯ ¯ (γ )−1 G¯ + α Jβ β¯ = β(α, γ ) = G˜ ′ Ω
−1
˜ ′Ω ¯ (γ )−1 g¯ (0). G
In the following, we are going to denote
˜ ′Ω ¯ (γ )−1 G¯ + α Jβ , M (α, γ ) = G
and
˜ ′Ω ¯ (γ )−1 (¯g (0) − G¯ βˆ CUE ) − α Jβ βˆ CUE h(α, γ ) = G ¯ βˆ CUE . and v¯ = g¯ (0) − G Since α = γ = 0 corresponds to βˆ CUE , and denoting derivatives with respect to α and γ with subscripts, we have by the mean value theorem that ¯ −1 (h¯ γ − M ¯γM ¯ −1 h¯ )γ + (hα − Jβ (β) ¯ −1 h¯ )α ¯ M β¯ = βˆ CUE + M where
¯ := M (α, M ¯ γ¯ ),
h¯ = h(α, ¯ γ¯ ),
˜ (βˆ CUE ) Ω ¯ (γ¯ )−1 Ω ¯ (γ¯ )−1 v¯ , h¯ γ := G ′
¯ γ = −G˜ (βˆ CUE ) Ω ¯ (γ¯ ) M ′
−1
57
and
˜ ¯ (γ¯ ) G(β) Ω −1 ¯
¯ α, and β˜ = β( ¯ γ¯ ) for some intermediate value (α, ¯ γ¯ ) between zero and (α, γ ). d
By consistency of βˆ CUE , T 1/2 v¯ → N (0, Ω ). By Assumption 3 in NW, the eigenvalues of Ω (β) are bounded away from zero by a ˆ (β) − constant, so that by γ → 0, consistency of βˆ CUE , and Ω p p ¯ (γ¯ ) − Ω (β0 ) → Ω (β) → 0 under Assumption 3, we have Ω 0 in p p ¯→ ˜→ the usual matrix norm. Also G G, and G G where G = G(β0 ). ¯γM ¯ −1 Jβ βˆ CUE and Bα,T = (Jβ M ¯ −1 − Therefore, setting Bγ ,T = α M I )Jβ βˆ CUE , 1 1 ′−1 −1/2 ¯ ¯ ¯ −1 h¯ − Bγ ,T ) = µ− µ− Ω T T (hγ − Mγ M T TG
× (I − PΩ 1/2 G )Ω −1/2 v¯ + oP (1) and
¯ −1 h¯ − Bα,T ) = µT Jβ (G′−1 G)−1 G′−1 v¯ + oP (1). ¯ M µT (h¯ α − Jβ (β)
p
and
Bekker, P., 1994. Alternative approximations to the distributions of instrumental variables estimators. Econometrica 62 (3), 657–681. Bekker, P., van der Ploeg, J., 2005. Instrumental variable estimation based on grouped data. Statistica Neerlandica 59, 239–267. Chao, J., Swanson, N., 2005. Consistent estimation with a large number of weak instruments. Econometrica 73 (5), 1673–1692. Donald, S., Newey, W., 2000. A Jackknife interpretation of the continuous updating estimator. Economics Letters 67, 239–243. Fuller, W., 1977. Some properties of a modification of the limited information estimator. Econometrica 45 (4), 939–954. Guggenberger, P, 2005. Monte-Carlo evidence suggesting a no moment problem of the continuous updating estimator. Economics Bulletin 3 (13), 1–6. Hahn, J., Hausman, J., Kuersteiner, G., 2004. Estimation with weak instruments: accuracy of higher-order bias and MSE approximations. Econometrics Journal 7, 272–306. Han, Ch., Phillips, P., 2006. GMM with many moment conditions. Econometrica 74 (1), 147–192. Hansen, C., Hausman, J., Newey, W., 2008. Estimation with many instrumental variables. Journal of Business and Economic Statistics 26, 398–422. Hansen, L., Heaton, J., Yaron, A., 1996. Finite-sample properties of some alternative GMM estimators. Journal of Business and Economic Statistics 14 (3), 262–280. Hausman, J., Newey, W., Woutersen, T., 2005. IV Estimation with heteroskedasticity and many instruments. Working paper MIT. Hausman, J., Newey, W., Woutersen, T., Chao, J., Swanson, N., 2007. Instrumental variable estimation with heteroskedasticity and many instruments. Working paper MIT. Kitamura, Y., Stutzer, M., 1997. An information-theoretic alternative to generalized method of moments estimation. Econometrica 65 (4), 861–874. Morimune, K., 1983. Approximate distributions of k-class estimators when the degree of overidentifiability is large compared with the sample size. Econometrica 51 (3), 821–841. Newey, W., Smith, R., 2004. Higher order propertiew of gmm and generalized empirical likelihood estimators. Econometrica 72 (1), 219–255. Newey, W., Windmeijer, F., 2009a. Generalized method of moments with many weak moment conditions. Econometrica 77 (3), 687–719. Newey, W., Windmeijer, F., 2009b. Supplement to generalized method of moments with many weak moment conditions. Econometrica 77. Supplemental Material. Staiger, D., Stock, J., 1997. Instrumental variables regression with weak instruments. Econometrica 65 (3), 557–586. Stock, J., Wright, J., 2000. GMM with weak identification. Econometrica 68 (5), 1055–1096. Stock, J., Yogo, M., 2005. Asymptotic distributions of instrumental variables statistics with many instruments. In: Andrews, D.W.K., Stock, J.H. (Eds.), Identification and Inference for Econometric Models: Essays in Honor of Thomas Rothenberg. pp. 109–120. van der Vaart, A., 1998. Asymptotic Statistics. Cambridge University Press.
Journal of Econometrics 165 (2011) 58–69
Contents lists available at SciVerse ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
On finite sample properties of alternative estimators of coefficients in a structural equation with many instruments✩ T.W. Anderson a , Naoto Kunitomo b,∗ , Yukitoshi Matsushita b a
Department of Statistics and Department of Economics, Stanford University, United States
b
Graduate School of Economics, University of Tokyo, Japan
article
info
Article history: Available online 13 May 2011 Keywords: Finite sample properties Empirical likelihood GMM Simultaneous equations with many instruments Limited information maximum likelihood Nonlinear LIML
abstract We compare four different estimation methods for the coefficients of a linear structural equation with instrumental variables. As the classical methods we consider the limited information maximum likelihood (LIML) estimator and the two-stage least squares (TSLS) estimator, and as the semi-parametric estimation methods we consider the maximum empirical likelihood (MEL) estimator and the generalized method of moments (GMM) (or the estimating equation) estimator. Tables and figures of the distribution functions of four estimators are given for enough values of the parameters to cover most linear models of interest and we include some heteroscedastic cases and nonlinear cases. We have found that the LIML estimator has good performance in terms of the bounded loss functions and probabilities when the number of instruments is large, that is, the micro-econometric models with ‘‘many instruments’’ in the terminology of recent econometric literature. © 2011 Elsevier B.V. All rights reserved.
1. Introduction In some recent microeconometric applications many instrumental variables have been used in estimating important structural equations. This feature may be due to the possibility of using a large number of cross sectional data and other instrumental variables. One empirical example of this kind often cited in econometric literature is Angrist and Krueger (1991); there is some discussion by Bound et al. (1995). Because there are some distinctive aspects when the number of instrumental variables is large, we investigate the basic properties of the standard estimation methods of microeconometric models. The new development suggests reconsidering the traditional estimation methods. There is a growing recent literature on related problems; see Bekker (1994), Staiger and Stock (1997), Newey and Smith (2004) and Chao and Swanson (2005), for instance, among many others. The study of estimating a single structural equation in econometric models has led to the development of several estimation methods as alternatives to the least squares estimation
✩ TWA08-7-25-II (Final Version for JOE). This is the second part of a revised version of Discussion Paper CIRJE-F-321 (Graduate School of Economics, University of Tokyo, February 2005) which was presented at the Econometric Society World Congress 2005 at London (August 2005) and the Kyoto conference (July 2006) in the honor of Professor Kimio Morimune. We thank Cheng Hsiao, Yoichi Arai, the editor of this volume and the referees for some comments to the earlier versions. ∗ Corresponding author. E-mail address:
[email protected] (N. Kunitomo).
0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.05.006
method. The classical examples in the econometric literature are the limited information maximum likelihood (LIML) method and the instrumental variables (IV) method including the twostage least squares (TSLS) method Anderson and Rubin (1949, 1950). See Anderson and Sawa (1979), Anderson et al. (1982), Mariano (1982), Morimune (1985) and Davidson and MacKinnon (1993), for studies of their finite sample properties, for instance. As semi-parametric estimation methods, the generalized method of moments (GMM) estimation, originally proposed by Hansen (1982), which is essentially the same as the estimating equation (EE) method, has been used in econometric applications. Also the maximum empirical likelihood (MEL) method has been proposed and has received attention (see Owen, 2001). For sufficiently large sample sizes the LIML and the TSLS estimators have approximately the same distribution in the standard large sample asymptotic theory, but their exact distributions can be quite different for the sample sizes occurring in practice. Although the GMM and the MEL estimators have approximately the same distribution under the more general heteroscedastic disturbances in the standard large sample asymptotic theory, their exact distributions can be quite different for the sample sizes occurring in practice. The main purpose of this paper is to give information about the small sample properties of the exact cumulative distribution functions (cdf’s) of these four different estimators for a wide range of parameter values; they have some asymptotic optimalities. We shall pay special attention to the finite sample properties of alternative estimators when we have many instruments in the simultaneous equations. Since it is quite difficult to obtain
T.W. Anderson et al. / Journal of Econometrics 165 (2011) 58–69
the exact densities and cdf’s of these estimators, the numerical information makes possible the comparison of properties of alternative estimation methods. We intentionally use the classical estimation setting of a linear structural equation when we have a set of instrumental variables, but also we shall mention some heteroscedastic models and nonlinear models for illustrations. It is our intention to make precise comparisons of alternative estimation procedures in the possible simplest case which has many applications. It is possible to generalize our formulation into several directions including many types of nonlinearities and heteroscedasticities as our examples. The present paper corresponds to the second part of our work on the problem (Anderson et al., 2005) and the first part (Anderson et al., 2010) gave the asymptotic justification by the finite sample findings. An important approach to the study of the finite sample properties of alternative estimators is to obtain asymptotic expansions of the exact distributions in normalized forms. The leading term in the asymptotic expansions in the standard large sample theory is the same for all estimators, but the higherorder terms are different. See Fujikoshi et al. (1982), Takeuchi and Morimune (1985), Anderson et al. (1986), Kunitomo (1987) and their citations for the LIML and the TSLS estimators, and Kunitomo and Matsushita (2009) for the MEL and the GMM estimators. Newey and Smith (2004) considered the bias and the mean squared errors of some estimators in more general nonlinear cases. It should be noted, however, that the mean and the mean squared errors of the exact distributions of estimators are not necessarily the same as the mean and the mean squared errors of the asymptotic expansions of the distributions of estimators. In fact the LIML estimator does not possess any moment of positive integer order. (See Mariano and Sawa (1972) and Phillips (1980)). We shall investigate the exact cumulative distributions of the LIML, MEL, GMM, and TSLS estimators directly in a systematic way. We shall compare the estimators on the basis of probabilities of statistical interest, such as significance levels and confidence intervals. In Section 2 we state the models and alternative estimation methods of unknown parameters. Then in Section 3 we shall explain our tables and figures of the finite sample distributions of alternative estimators and discuss their finite sample properties including simple heteroscedastic cases and nonlinear cases. In Section 4 we shall discuss the approximations of the distribution functions based on their asymptotic expansions. Then the conclusions of our study will be given in Section 5. Tables and figures are gathered in the Appendix.
zi = (z′1i , z′2i )′ , V = (v′i ) is the n × (1 + G2 ) matrix of disturbances with E (vi ) = 0 and the (positive definite) covariance matrix
=
ω11 ω21
5
1
(i = 1, . . . , n),
(2.1)
where y1i and y2i are a scalar and G2 × 1 vector of endogenous variables, z1i is a vector of K1 included exogenous variables in (2.1), γ1 and β2 are K1 × 1 and G2 × 1 vectors of unknown parameters, and ui are mutually independent disturbance terms with E (ui ) = 0 and E (u2i ) = σ 2 (i = 1, . . . , n). We assume that (2.1) is one equation in a system of 1 + G2 endogenous variables y′i = (y1i , y′2i )′ . The vector of K (= K1 + K2 ) instrumental variables zi satisfies the orthogonal condition E [zi ui ] = E [zi (y1i − β′2 y2i − γ′1 z1i )] = 0 (i = 1, . . . , n). The reduced form is Y = Z5 + V,
(2.3)
π11 π21
512 522
γ1
1
=
−β2
(2.4)
0
and (π21 , 522 ) is a K2 × (1 + G2 ) matrix of coefficients. The maximum empirical likelihood (MEL) estimator for the vector of parameters in (2.1) is defined by maximizing the Lagrange form L1n (ν, θ) = ∗
n −
log(npi ) − µ
n −
i =1
− nν ′
pi − 1
n −
i=1
pi z i
i =1
× y1i − γ′1 z1i − β′2 y2i , where µ and ν are a scalar and a vector of Lagrange multipliers, and pi (i = 1, . . . , n) is the weighted probability function to be chosen. The above maximization is the same as maximizing L1n (ν, θ) = −
n −
log 1 + ν′ zi [y1i − γ′1 z1i − β′2 y2i ] ,
(2.5)
i=1
where µ ˆ = n and [npˆ i ]−1 = 1 + ν′ zi [y1i − γ′1 z1i − β′2 y2i ] (see Qin and Lawless (1994) or Owen (2001)). By differentiating (2.5) with respect to ν and combining the resulting equation for pˆ i (i = 1, . . . , n), we have the relations
∑
= 0 and νˆ =
n i=1
ˆ zi z′ pˆ i u2i (θ) i
∑n
i =1
′
ˆ 1 z1i − βˆ 2 y2i pˆ i zi (n) y1i − γ ′
−1 ∑ n 1
ˆ ˆ i=1 ui (θ)zi , where ui (θ)
n
′ ′ ′ = y1i − γˆ ′1 z1i − βˆ 2 y2i and θˆ = (γˆ ′1.EL , βˆ 2.EL ) is the maximum
empirical likelihood (MEL) estimator of the vector of unknown parameters θ = (γ′1 , β′2 )′ . The MEL estimator of θ in the linear ′ ∑n ˆ i zi models can be written as the solution of the equations νˆ i=1 p ′ ′ [−(z1i , y2i )] = 0, which implies
n −
z1i y2i
pˆ i
i =1
n −
=
− n ′
zi
×
−1 ˆ 2 zi z′ pˆ i ui (θ) i
pˆ i
z1i y2i
n 1−
n i=1
− n ′
zi
n 1−
n i=1
i =1
i=1
zi y1i
−1 ˆ zi z pˆ i ui (θ) i 2
′
i=1
zi (z1i , y2i ) ′
′
γˆ 1.EL . βˆ 2.EL
(2.6)
The GMM estimator can be written as the solution of (2.6) when ˆ is replaced by ui (θ), ˜ θ˜ is a consistent initial estimator of θ ui (θ) (TSLS was used) and the fixed probability weight functions as pi = 1/n (i = 1, . . . , n) (see Hayashi, 2000, for instance). In order to relate the MEL and GMM estimators and the LIML and we∑ consider the homogeneity ∑n TSLS estimators, ∑n 2 condition n ′ ′ 2 21 2 p u (θ) z z = σ z z and ( 1 / n ) i i i =1 i i i =1 i i i=1 ui (θ) = σ . n The resulting maximization problem under the homogeneity restrictions requires νˆ = (1/σˆ 2 )
∑n
i=1
zi z′i
−1 ∑n
i =1
ˆ zi (ui (θ) ˆ ui (θ)
′
= y1i −γˆ ′1 z1i −βˆ 2 y2i ). Then by using the approximation log(1+x) ∼ x, (2.5) becomes approximately L2n (θ) = (−n)
(2.2)
where Y = (y′i ) is the n × (1 + G2 ) matrix of endogenous variables, Z = (Z1 , Z2 ) = (z′i ) is the n×K matrix of K1 +K2 instrument vectors
=
−β2
2. Model and alternative estimation methods of a structural equation with instruments
y1i = β′2 y2i + γ′1 z1i + ui
ω12 , 22
and 5 is the (K1 + K2 )×(1 + G2 ) matrix of coefficients. The relation between the coefficients in (2.1) and (2.2) is
Let a single linear structural equation in the econometric model be given by
59
×
−1 n n n ∑ ∑ ∑ z′i (y1i − γ′1 z1i − β′2 y2i ) zi z′i zi (y1i − γ′1 z1i − β′2 y2i )
i=1
i=1
n ∑ i=1
i=1
(y1i − γ′1 z1i − β′2 y2i )2
, (2.7)
60
T.W. Anderson et al. / Journal of Econometrics 165 (2011) 58–69
which is (−n) times the variance ratio in turn. The minimum of ′
the variance ratio gives the LIML estimator βˆ LI (= (1, −βˆ 2.LI )′ ) of β = (1, −β′2 )′ , which is the solution of
1 n
1
G−
n−K
λH βˆ LI = 0,
(2.8)
where n − K > 0 and λ is the smallest root of |(1/n)G − l(1/(n − 1 ′ K ))H| = 0. Here we use the notation G = Y′ Z2.1 A− 22.1 Z2.1 Y, H = −1 ′ ′ ′ −1 ′ Y (In − Z(Z Z) Z )Y, A22.1 = Z2.1 Z2.1 , Z2.1 = Z2 − Z1 A11 A12 and
′ A=
Z1 Z′2
(Z1 , Z2 ) =
A11 A21
A12 A22
,
(2.9)
which is a nonsingular matrix (a.s.). The TSLS estimator βˆ TS = ′
(1, −βˆ 2.TS )′ of β = (1, −β′2 )′ is given by 1 1 ′ Y′2 Z2.1 A− Z Y = 0. 22.1 2.1 −βˆ 2.TS
(2.10)
It minimizes the numerator of the variance ratio L2n and corresponds to the GMM estimator if we put pˆ i = 1/n (i = 1, . . . , n) and the homogeneity condition. The statistical methods of the LIML and TSLS estimation were originally developed by Anderson and Rubin (1949, 1950) and it is the (classical) maximum likelihood estimation method with the limited information on instrumental variables. When the disturbances are homoscedastic and normally distributed, G and H are sufficient statistics; the LIML and TSLS estimators depend only on them. The nonlinear LIML estimator can be defined by substituting ui (θ) = fi (y1i , z1i , y2i , θ) for ui (θ) = y1i − γ′1 z1i − β′2 y2i (i = 1, . . . , n) and minimizing the variance ratio in (2.7). (The nonlinear TSLS estimator in the same way. The alternative or standard nonlinear LIML and TSLS extensions have been discussed in Chapter 8 of Amemiya (1985)). 3. Evaluation of exact distribution functions and tables 3.1. Parameterization The evaluation of the cdf’s of estimators we have used is based on simulation. In order to describe our evaluation method, we use an expanded formulation and notation of the classical study of Anderson et al. (1982) except that here the sample size is n. We concentrate on the comparison of the estimators of the coefficient parameter of the endogenous variables and we shall investigate the finite sample distributions of the estimator expressed as F (x) = Pr
1/2 1 ′ 522 A22.1 522 (βˆ 2 − β2 ) ≤ x
σ
(3.1)
for x = (x1 , . . . , xG2 ). The limit of (3.1) in the large sample asymptotics is NG2 (0, IG2 ) for any (asymptotically) efficient estimator under the homoscedasticity assumption. It is easier to interpret the distribution functions in this form rather than with some other normalization. We use the notation of the noncentrality −1/2
−1/2
1 = 22 5′22 A22.1 522 22
(3.2)
and the standardized vector of coefficients 1
1/2
1 α= √ (β − − 22 ω21 ), ω11.2 22 2
3.2. Simulation procedure By using Monte Carlo simulations we obtain empirical cdf’s of estimators of the coefficients of the endogenous variables in the structural equation as follows. We generate a set of random numbers by using a system of (2.1) and y2i = 5′2 zi + v2i ,
(3.4)
where zi ∼ N (0, IK ), ui ∼ N (0, σ ), 52 is a K × G2 matrix of coefficients and v2i ∼ NG2 (0, 22 ) with E (ui v2i ) = ω21 − 22 β2 (i = 1, . . . , n). Since the model of (2.1) and (3.4) is consistent with the reduced form (2.2), we have ui = v1i − β′2 v2i , σ 2 = ω11 − 2β′2 ω12 + β′2 22 β2 , and zi are independent of ui and v2i (i = 1, . . . , n) in the homoscedastic case. We take a set of true values of parameters β2 , γ1 , σ 2 , to satisfy the restrictions in (2.1) and (3.4) given the value of α, and then we control the elements of 1 by setting values for the (1 + K2 )-vectors 52 = (π2j ).1 Following Owen (2001) the maximization in the MEL estimation has been done in 2-steps; the inner loop for the numerical calculation of Lagrange multiplier in (2.5) and the outer loop for the minimization with respect to the unknown parameters. We have used the derivative based maximization routine in the inner loop and a simplex-method based optimization algorithm in the outer loop by utilizing (2.6). There is a non-trivial computational problem on the MEL estimation when the noncentrality parameter is close to zero, which is pointed out by Mittelhammer et al. (2005), for instance. Therefore we have made computations for cases where we did not have a problem in numerical convergence.2 For the LIML and TSLS estimators as (2.8) and (2.10), there are simple ways to express βˆ 2 − β2 in terms of two matrices G and H (see Anderson et al., 1982, for instance). For each simulation we generated a set of random variables from the disturbance terms and exogenous variables. In each simulation the number of repetitions was 5,000. 2
(3.3)
where ω11.2 = ω11 − ω12 22 ω21 . When G2 = 1 in particular, Anderson et al. (1982) have utilized the fact that the explicit distributions of (3.1) for the normalized LIML estimator and normalized TSLS estimator under the standard −1
case (that is, the disturbances are homoscedastic and normally distributed) depend only on the key parameters K2 , n − K , α and 1 δ 2 = 1 (see Anderson (1974), for details). Notice that − 22 ω21 is the regression coefficient of v1i on v2i and ω11.2 is the conditional variance of v1i given v2i . When G2 = 1, we rewrite η = −α/ √ √ 1 + α 2 = (ω12 − ω22 β2 )/[σ ω22 ] (ω22 = 22 ), which is the correlation coefficient between the two random variables ui and v2i (or y2i ) and it is the coefficient of simultaneity in the structural equation of the simultaneous equations system. The numerator of the noncentrality parameter δ 2 represents the additional explanatory power due to y2i over z1i in the structural equation and its denominator is the error variance of y2i . Hence the noncentrality δ 2 determines how well the equation is defined in the simultaneous equations system, and n − K is the degrees of freedom of H which estimates in the LIML method; it is not relevant to the TSLS method. The normalization part in (3.1) can also be written as the square root of δ 2 /(1 + α 2 ) × [ω22 /ω11.2 ]. The distribution of (3.1) does not depend on the units of measurements of y1i and y2i . Some econometricians use the terminology many weak instruments for the cases when K2 is large while δ 2 is not as large as n such that δ 2 /n → 0 and δ 2 /K2 → a (> 0). We have tried to choose the key parameter values to make useful interpretations.
1 In order to examine whether our results strongly depend on the specific values of parameters, however, we have done several simulations for other values of parameters. 2 The numerical convergence of the outer loop is not guaranteed in the MEL estimation (Owen, 2001) and we have confirmed that there can be some problems in the extreme cases.
T.W. Anderson et al. / Journal of Econometrics 165 (2011) 58–69
61
Fig. 1. CDF of standardized estimators: n − K = 300, K2 = 30, α = (1, 1)′ , 1 = (100, 1.50; 1.50, 50), ui ∼ N (0, 1).
As an illustration we give Fig. 1 for the distribution functions of estimators in normalized terms when G2 = 2 and the disturbances are normally distributed. They are given in terms of two marginal distributions of (3.1); F1 (x1 ) = F (x1 , ∞) and F2 (x2 ) = F (∞, x2 ), respectively. Each limiting distribution is N (0, 1).3 In this example ′ we have set the parameters √ : n − K = 300, K2 = 30, α = (1, 1) (ω12 = (ρ, ρ), ρ = −1/ 3), 22 = I2 and
1=
100 1.5
1.5 . 50
Although we have investigated aspects of the distributions of four estimators when G2 > 1 as in Fig. 1, each depends on many parameters and we may require too many tables and figures to obtain useful information in a systematic way. Thus, from now on, we shall give tables (Tables 1–9) and figures (Figs. 2–11, 12–14) only for G2 = 1. We control the values of key parameters in order to compare the distributions of estimators in a limited number of cases in a systematic way. In order to investigate the effects of nonnormal disturbances on the distributions of estimators, we used many nonnormal distributions, but we only report two cases when the distributions of the disturbances are skewed or fat-tailed. As the first case we have generated a set of random variables √ (y1i , y2i , zi ) by using (2.1) and (3.4), ui = −(χi2 (3) − 3)/ 6, and χi2 (3) are χ 2 random variables with 3 degrees of freedom. As the second case, we took the t-distribution with 5 degrees of freedom for the disturbance terms. Also in order to investigate the effects of heteroscedastic disturbances on the distributions of estimators, we have considered the form of E (u2i ) = σ 2 (zi ) and in particular ui = ‖zi ‖u∗i (i = 1, . . . , n), and u∗i (i = 1, . . . , n) are homoscedastic disturbance terms as the typical example. The empirical cdf’s of estimators are consistent for the corresponding true cdf’s. In addition to the empirical cdf’s we have used a smoothing technique of cubic spline functions to estimate their percentiles. The distributions are tabulated in standardized terms because this form of tabulation makes comparisons and interpolation easier. Each table includes three quartiles, the 5 and 95 percentiles and the interquartile range of the distribution. To evaluate the accuracy of our estimates based on the Monte Carlo experiments, we compared the empirical and exact cdf’s of the
3 It is possible to give figures for each components of coefficient vectors. Then the normalizations for the components become messy and the comparison with the limiting distribution may become less clear.
Fig. 2. n − K = 30, K2 = 3, α = 1, δ 2 = 30.
Fig. 3. n − K = 100, K2 = 10, α = 1, δ 2 = 100.
Two-Stage Least Squares (TSLS) estimator, which corresponds to the GMM estimator when uˆ 2i is replaced by a constant (namely σ 2 ), that is, the variance–covariance matrix is homoscedastic and known. The exact distribution of the TSLS estimator has been studied and tabulated extensively by Anderson and Sawa (1979). We do not report the details of our results, but we have found that the differences between the exact cdf and its estimate are less than
62
T.W. Anderson et al. / Journal of Econometrics 165 (2011) 58–69
Table 1 n − K = 30, K2 = 3, α = 1. Normal X05 L.QT MEDN U.QT X95 IQR
δ 2 = 30
δ 2 = 100
LIML
MEL
TSLS
GMM
LIML
MEL
TSLS
GMM
−1.65 −0.67
−1.40 −0.64
−1.64 −0.85 −0.26
−1.54 −0.67
0.00 0.76 2.14 1.40
−1.55 −0.83 −0.24
−1.47 −0.65
0. 0.67 1.65 1.35
−1.52 −0.66 −0.01
0.44 1.64 1.27
0.47 1.66 1.31
0.01 0.75 1.98 1.42
−1.63 −0.79 −0.14
0.80 2.37 1.46
0.00 0.71 1.90 1.36
−1.59 −0.77 −0.14 0.55 1.71 1.32
0.57 1.74 1.36
Table 2 n − K = 100, K2 = 10, α = 1. Normal X05 L.QT MEDN U.QT X95 IQR
−1.65 −0.67
δ 2 = 50
δ 2 = 100
LIML
MEL
TSLS
GMM
LIML
MEL
TSLS
GMM
−1.49 −0.66
−1.68 −0.74 0.01 0.83 2.35 1.57
−2.09 −1.33 −0.77 −0.15
−1.54 −0.66
0.00 0.76 2.11 1.42
−1.98 −1.31 −0.77 −0.18
−1.61 −0.72 −0.01
−1.97 −1.17 −0.59
−2.04 −1.22 −0.61
0.76 1.12
0.89 1.19
0.81 2.11 1.53
0.05 1.06 1.22
0.08 1.18 1.30
0 0.67 1.65 1.35
0.00 0.73 1.90 1.39
Table 3 n − K = 300, K2 = 30, α = 1. Normal X05 L.QT MEDN U.QT X95 IQR
−1.65 −0.67
δ 2 = 50
δ 2 = 100
LIML
MEL
TSLS
GMM
LIML
MEL
TSLS
GMM
−1.63 −0.75
−1.82 −0.79
−1.77 −0.75
0.02 0.97 2.94 1.76
−2.95 −2.30 −1.85 −1.37 −0.60
−1.56 −0.69
0.00 0.85 2.48 1.60
−2.88 −2.28 −1.85 −1.40 −0.67
0.94
0.02 0.86 2.38 1.61
−2.87 −2.14 −1.59 −1.02 −0.12
0.88
0.00 0.77 2.08 1.46
−2.76 −2.10 −1.60 −1.07 −0.21 1.03
1.11
0 0.67 1.65 1.35
Table 4 n − K = 100, K2 = 10, α = 1, δ 2 = 50. Normal X05 L.QT MEDN U.QT X95 IQR
−1.65 −0.67 0 0.67 1.65 1.35
√
ui = −(χ 2 (3) − 3)/ 6
ui = t (5)
LIML
MEL
TSLS
GMM
LIML
MEL
TSLS
GMM
−1.52 −0.67 −0.01
−1.53 −0.67 −0.01
−1.55 −0.67
0.76 2.24 1.43
−1.96 −1.24 −0.69 −0.09
−1.51 −0.62
0.75 2.17 1.42
−2.06 −1.32 −0.77 −0.17
0.82 1.15
0.01 0.83 2.33 1.50
−1.97 −1.22 −0.69 −0.12
0.78 1.14
0.02 0.77 2.12 1.39
−2.02 −1.28 −0.75 −0.18 0.78 1.10
0.86 1.10
Table 5 α = 1, δ 2 = 100, ui = ‖Zi ‖ϵi . Normal X05 L.QT MEDN U.QT X95 IQR
−1.65 −0.67 0 0.67 1.65 1.35
n − K = 30, K2 = 3
n − K = 100, K2 = 10
LIML
MEL
TSLS
GMM
LIML
MEL
TSLS
GMM
−1.39 −0.60
−1.51 −0.66 −0.02
−1.52 −0.73 −0.14
−1.57 −0.78 −0.17
−1.52 −0.67 −0.04
−1.64 −0.70
−2.03 −1.22 −0.60
0.71 2.05 1.36
0.52 1.62 1.25
0.51 1.70 1.29
0.70 1.97 1.37
−1.96 −1.20 −0.65 −0.03
0.02 0.70 1.93 1.29
0.03 0.83 2.20 1.53
1.03 1.18
0.07 1.09 1.29
Table 6 n − K = 300, K2 = 30, α = 1, ui = ‖Zi ‖ϵi . Normal X05 L.QT MEDN U.QT X95 IQR
−1.65 −0.67 0 0.67 1.65 1.35
δ 2 = 50
δ 2 = 100
LIML
MEL
TSLS
GMM
LIML
MEL
TSLS
GMM
−1.62 −0.72
−1.77 −0.77
−1.70 −0.74
0.03 0.97 2.97 1.73
−2.97 −2.31 −1.86 −1.39 −0.68
−1.56 −0.70
0.02 0.89 2.55 1.61
−2.90 −2.30 −1.87 −1.43 −0.76
0.92
0.01 0.88 2.34 1.61
−2.83 −2.14 −1.60 −1.05 −0.14
0.87
0.00 0.79 2.13 1.49
−2.76 −2.14 −1.63 −1.10 −0.25 1.04
1.09
T.W. Anderson et al. / Journal of Econometrics 165 (2011) 58–69
Fig. 4. n − K = 300, K2 = 30, α = 1, δ 2 = 100.
Fig. 5. n − K = 100, K2 = 10, α = 1, δ 2 = 50, ui = −
63
χ2√ (3)−3 6
.
Table 7 n − K = 1000, K2 = 100, α = 1, δ 2 = 100. Normal X05 L.QT MEDN U.QT X95 IQR
−1.65 −0.67 0 0.67 1.65 1.35
ui = N (0, 1) LIML −1.82 −0.78 0.00 0.89 2.39 1.67
TSLS −4.46 −3.89 −3.53 −3.14 −2.57 0.75
ui = ‖Zi ‖ϵi GMM −4.51 −3.92 −3.53 −3.12 −2.49 0.80
LIML
−1.84 −0.81 0.01 0.93 2.51 1.74
TSLS −4.44 −3.91 −3.54 −3.17 −2.59 0.75
GMM
−4.49 −3.93 −3.53 −3.12 −2.51 0.81
0.005 in most cases and the maximum difference is about 0.008. Hence our estimates of the cdf’s are quite accurate, to two digits. 3.3. Distributions of the MEL and LIML estimators For α = 0, the densities of the LIML and MEL estimators are close to being symmetric, (see Table 8 and Fig. 12). As α increases there is some slight asymmetry, but the median is very close to zero. For given α, K2 , and n, the lack of symmetry decreases as δ 2 increases. (See Tables 1–3 and Figs. 2–4). For given α, δ 2 , and n, the asymmetry increases with K2 . The main finding from the tables is that the distributions of the MEL and LIML estimators are roughly symmetric around the true parameter value and they are almost median-unbiased. This finite sample property holds even when K2 is fairly large. At the same time, their distributions have relatively long tails. As δ 2 → ∞, the distributions approach N (0, 1); however, for small values of δ 2 there are appreciable probabilities outside the range of 3 or 4 times ASD (asymptotic standard deviations). When δ 2 is extremely small, we cannot ignore the tail probabilities for practical purposes, (see Table 9). As δ 2 increases, the spread of the normalized distribution decreases. Also the distribution of the LIML estimator has slightly tighter tails than those of the MEL estimator. For given α, K2 , and δ 2 , the spread decreases as n increases and it tends to increase with K2 and decrease with α . Also we have found that some of our findings on the MEL estimator are also pointed out by Guggenberger (2005) in this subsection. 3.4. Distributions of the GMM and TSLS estimators We have included tables of the distributions of the GMM and TSLS estimators. However, since they are quite similar in most cases, we have included only the distribution of the GMM estimator in many figures. The most striking feature of the distributions of the GMM and TSLS estimators is that they are skewed towards the left for α > 0 (and towards the right for α < 0), and the distortion
Fig. 6. n − K = 100, K2 = 10, α = 1, δ 2 = 50, ui = t (5).
increases with α and K2 . The MEL and LIML estimators are close to median-unbiased in each case while the GMM and TSLS estimators are biased. As K2 increases, this bias becomes more serious; for K2 = 10, 30 and 100, the median is less than -1.0 ASD’s. If K2 is large, the GMM and TSLS estimators substantially underestimate the true parameter. This fact definitely favors the MEL and LIML estimators over the GMM and TSLS estimators. However, when K2 is as small as 3, the GMM and TSLS estimators are very similar to the MEL and their distributions have tighter tails. The distributions of the MEL and LIML estimators approach normality faster than the distribution of the GMM and TSLS estimators, due primarily to the bias of the latter. In particular when α ̸= 0 and K2 = 10 ∼ 100 (Figs. 3, 4 and 14), the actual 95 percentiles of the GMM estimator are substantially different from 1.96 of the standard normal. This implies that the conventional hypothesis testing about a structural coefficient based on the normal approximation to the distribution is very likely to seriously underestimate the actual significance. The 5 and 95 percentiles of the MEL and LIML estimators are much closer to those of the standard normal distribution even when K2 is large. These observations on the distributions of the MEL estimator and the GMM estimator are analogous to the earlier findings on the distributions of the LIML estimator and the TSLS estimator by Anderson et al. (1982) and Morimune (1983) under the normal disturbances in the same setting of the linear simultaneous equations system.
64
T.W. Anderson et al. / Journal of Econometrics 165 (2011) 58–69
Table 8 n − K = 300, K2 = 30, δ 2 = 100. Normal X05 L.QT MEDN U.QT X95 IQR
α=5
0 0.67 1.65 1.35
LIML −1.90 −0.78 0.00 0.78 1.93 1.56
Normal
n − K = 100, K2 = 3, δ 2 = 5
−1.65 −0.67
Table 9 α = 1.
X05 L.QT MEDN U.QT X95 IQR
α=0
−1.65 −0.67 0 0.67 1.65 1.35
LIML −1.78 −0.70 −0.08 0.81 4.37 1.51
MEL −2.16 −0.90 −0.02 0.86 2.14 1.76
MEL −1.84 −0.73 −0.10 0.80 4.71 1.53
TSLS −1.44 −0.60 0.00 0.60 1.46 1.19
GMM −1.53 −0.66 −0.01 0.64 1.56 1.30
TSLS −1.66 −0.95 −0.51 0.02 1.16 0.97
GMM −1.68 −0.97 −0.52 0.02 1.22 0.99
LIML −1.43 −0.64 0.00 0.73 1.98 1.37
MEL
TSLS
GMM
−1.52 −0.69 −0.02
−3.14 −2.63 −2.22 −1.77 −1.02
−3.24 −2.65 −2.22 −1.73 −0.96
0.86
0.92
0.76 2.14 1.45
n − K = 100, K2 = 10, δ 2 = 10
Fig. 7. n − K = 100, K2 = 10, α = 1, δ 2 = 100, ui = ‖zi ‖ϵi .
LIML −1.72 −0.77 −0.06 1.00 4.45 1.77
MEL
TSLS
GMM
−2.16 −0.90 −0.14
−2.04 −1.47 −1.09 −0.68
−2.09 −1.59 −1.08 −0.64
0.02 0.79
0.11 0.85
0.94 4.40 1.84
Fig. 9. n − K = 100, K2 = 10, α = 1, δ 2 = 100.
by the sample regression y1 on y2 , that is, by Ordinary Least Squares (OLS). That estimation procedure could result in very biased estimates. The LIML and TSLS estimators were developed to improve the OLS estimator, but the TSLS estimation ignores the information on in (2.3). Table 8 and Fig. 12 compare estimation procedures for some different values of α . The bias of the (normalized) TSLS increases with α ; the median goes from 0 at α = 0 to −2.22 for α = 5. The interquartile range goes from 1.19 at α = 0 to 0.86 at α = 5 as compared to LIML (from 1.56 to 1.37). However, the 95 percentile goes from 1.46 to −1.02; that is, if α is as large as 1, the probability of a negative estimator is greater than 0.95 when α is large. In effect, α is a nuisance parameter. It can have a large effect on the bias of the TSLS estimator. In a sense the TSLS has the defect of the OLS estimator, but not as extreme.
Fig. 8. n − K = 1000, K2 = 100, α = 1, δ 2 = 100, ui = ‖zi ‖ϵi .
3.6. Effects of nonnormality and heteroscedasticity
3.5. Effect of the difference between the structural coefficient and the error regression
Because the distributions of estimators depend on the distributions of the disturbance terms, we have investigated the effects of nonnormality and heteroscedasticity of disturbances in the form of E (u2i ) = σ 2 (zi ). We use the normalization
Before the development of inference for the model of simultaneous equations, the structural coefficient, say, β2 , was estimated
F (x) = Pr([5′22 Q22.1 522 ]1/2 (βˆ 2 − β2 ) ≤ x),
(3.5)
T.W. Anderson et al. / Journal of Econometrics 165 (2011) 58–69
65
Among many tables we show Tables 5–7 and Figs. 7 and 8 as the representative heteroscedastic cases (σ 2 (zi ) = ‖zi ‖2 ) by following Hayashi (2000). Also we show Table 4 and Figs. 5–6 as the representative nonnormal disturbances which we have chosen (χ 2 -type and t distributions). From our tables the comparison of the distributions of four estimators are approximately valid even if the distributions of disturbances are nonnormal and heteroscedasitic in the sense we have specified above. The bias and skewness of the distributions have relatively large effects and they often dominate the nonnormality and heteroscedasticity. Thus the effects of heteroscedasticity and nonnormality of disturbances on the exact distributions of alternative estimators have secondary importance in our setting. When the disturbance terms are heteroscedastic with many instruments, Anderson et al. (2010) assumed the 6th order moments condition for the disturbances and the key condition n 1 − (n) [pii − cn ]2 = 0, n→∞ n i=1
Fig. 10. Nonlinear case I: n − K = 100, K2 = 10, α = 1, δ 2 = 50.
plim
(3.6)
(n)
1 ′ where pii = (Z2.1 A− 22.1 Z2.1 )ii and cn = K2 /n. They have shown that the LIML estimator still has desirable asymptotic properties. The typical (two) examples satisfying this condition are (i) the case (n) of cn → 0 (pii → 0) and (ii) the case when we have dummy variables which have 1 or −1 in their all components so that (1/n)A22.1 = IK2 and p(iin) = K2 /n (i = 1, . . . , n). When (3.6) is not satisfied with many instruments, the LIML estimator may have some biases in extreme cases. Kunitomo (2008) has considered some modifications of the LIML estimation.4
3.7. On the nonlinear LIML estimator Whether our observations on alternative estimators are specific for the linear structural equation models or not is an interesting question. Although there can be many possible nonlinear models, we shall report only some results on two nonlinear cases with two endogenous variables (y1i , y2i ; i = 1, . . . , n). The first case of nonlinearity is a linear structural equation with nonlinear instruments
Fig. 11. Nonlinear case II: n − K = 100, K2 = 10, α = 1, δ 2 = 50.
[ E
wi ⊗2
wi
] (y1i − β2 y2i )|wi = 0,
(3.7)
2
2
⊗ ′ where y2i = (wi ′ , w⊗ = i )52 + v2i , wi = (w1i , . . . , wmi ) , wi 2 2 ′ (w1i , . . . , wmi ) , wi follows N (0, IK ) and E [ · |wi ] is the conditional expectation given wi . We have set m = 5 and then the number of instruments K = 10. As the second nonlinear example, we have a nonlinear structural equation
E [zi (y1i − β2 y22i )|zi ] = 0,
(3.8)
where y2i = zi 52 + v2i , zi follows d(1, . . . , 1) + NK (0, z ) (d is a constant) and ′
′
Fig. 12. n − K = 300, K2 = 30, α = 0, δ 2 = 100.
where Q22.1 = Z2 [R − RZ1 (Z1 RZ1 ) Z1 R]Z2 , R = Z(Z 6Z) Z and 6 = (diag σ 2 (zi )). When σ 2 (zi ) = σ 2 , we have Q22.1 = σ −2 A22.1 in (3.1). The limit of (3.5) is NG2 (0, IG2 ) for the MEL and GMM estimators in the large sample asymptotics. In this case the asymptotic variance–covariance matrix for the LIML and TSLS estimators could be slightly larger than those of the MEL and GMM estimators. ′
′
−1 ′
′
−1 ′
4 When we have time series data for the simultaneous equations model, some parametric models for the heteroscedasticities of disturbances have been developed as one referee has pointed out. We have examined some possibilities including the stationary GARCH models and have found that the essential conclusions on alternative estimators are unchanged. In order to obtain the limiting normality for the LIML estimator, we need to require some moment conditions for the disturbances and thus we need careful analysis on the stationarity, for instance. Since the problem is related to the vast growing concerns in time series econometrics, we did not discuss them in detail. See McAleer (2005), for instance.
66
T.W. Anderson et al. / Journal of Econometrics 165 (2011) 58–69
z =
[
0 0
′
]
0 . IK −1
approach a limit as n −→ ∞. It is convenient to use the noncentrality parameter given by
In this case we have set the number of instruments K = 10 2 and used z⊗ as the instrumental variables in the nonlinear i estimation. We have used the nonlinear LIML, MEL and GMM estimators which are mentioned at the end of Section 2. We give the cdf’s of the nonlinear LIML, MEL and GMM estimators for Case 1 and Case 2 as Figs. 10 and 11, respectively. We have normalized the estimators of the coefficient β2 as in the linear cases such that we can compare the finite sample properties of the alternative estimators. As the parameters, α was constructed as before and δ 2 has been constructed so that the resulting normalized LIML estimator has the limiting N(0,1) distribution in the large sample asymptotics, for instance. Since the evaluation methods of cdf’s are basically the same as the linear cases, we have omitted the details. The most important observation is the fact that the finite sample properties of the nonlinear LIML, MEL and GMM estimators are similar to the ones we have discussed for the linear cases.
µ2 = ( 1 + α 2 )
τ =2
= (1 + α 2 )δ 2
(4.1)
1 + α2
ω22
(1, 0)Q11 QD FDQQ11 −1
′
−1
1 , 0
(4.2)
where Q = (D′ MD)−1 , Q11 = (Π ′22 M22.1 522 )−1 , M22.1 = ∑n plimn→∞ n−1 A22.1 , M = plimn→∞ n−1 i=1 zi z′i , F = plimn→∞ n−1 −1 ′ ∑n ′ ′ −1 ′ − D(D MD) D zi zi and a (K1 + K2 )×(G2 + K1 ) coi=1 zi zi M
efficient matrix D = 52 , (IG2 , O)′ . We need this semi-parametric factor because we estimate the variance–covariance matrix in the MEL and GMM estimation. For the GMM estimator, an asymptotic expansion of its distribution function as n → ∞ (K2 is fixed) when the disturbances are normally distributed (N (0, σ 2 )) and µ2 is proportional to n is given by
P
5′22 A22.1 522
σ
(βˆ 2.GMM − β2 ) ≤ x
α 1 = Φ (x) + − [x2 − (K2 − 1)] − [(τ + (K2 − 1)2 α 2 µ 2µ2
We have mentioned the fact that some estimators do not necessarily have the exact moments under reasonable assumptions. The first moment of a scalar random variable X is said to be infinite or is said not to exist if for any given positive constant c there is a constant a such that
∫
ω22
and the semi-parametric parameter given by
4. Discussion on distributions of estimators 4.1. The moments and Monte Carlo experiments
5′22 A22.1 522
− (K2 − 1))x + (1 − 2K2 α 2 )x3 + α 2 x5 ] φ(x) + O(µ−3 ). (4.3)
a
|x|dF (x) > c , −a
where F (·) is the cdf of X . In this case E (X ) is not defined as a finite number. However, a Monte Carlo experiment can be conducted and the sample mean of the sample calculated. What kind of conclusion can be drawn? As a simple illustration of the problem of interpreting Monte Carlo experiments, we take i.i.d. observations Xi (i = 1, . . . , n) from N (θ −1 , 1) when θ ̸= 0. As a reasonable ∑n estimator of θ we take θˆ = X¯ n−1 and X¯ n = n−1 i=1 Xi . Then
√
d
n θ −2 X¯ n−1 − θ → N (0, 1), but we can calculate the sample bias and MSE in Monte Carlo experiments even though we know that MSE is +∞. We have confirmed the fact in our experiments that the sample bias and MSE of θˆ calculated from Monte Carlo experiments are not stable and they are not reliable quantities even on average with many replications. The extension of the above example to the problem of estimating simultaneous equations can be made. However, it suggests that before conducting a Monte Carlo experiment a mathematical study should be carried out to verify that the parameter being estimated actually exists, (see Mariano and Sawa (1972) as an early development). Our method of analysis in this paper is free from this issue because we compare the estimators on the basis of probabilities.
4.2. Approximations of finite sample distributions The exact distribution functions of alternative estimators in the general case are very complicated, but it is possible to derive the asymptotic expansions of the density functions of alternative estimators as shown by Anderson (1974), Anderson and Sawa (1973), and Kunitomo and Matsushita (2009). Although the asymptotic expansions in the general case (G2 ≥ 1) look complicated even for the linear simultaneous equations, they give some useful information in particular when G2 = 1 and the disturbances are normally distributed. In the (standard) large sample asymptotics, the noncentrality (or concentration) parameter divided by n is assumed to
For the MEL estimator, an asymptotic expansion of its distribution function as n → ∞ (K2 is fixed) is given by
P
5′22 A22.1 522
σ
(βˆ 2.MEL − β2 ) ≤ x
α 1 = Φ (x) + − x2 − [(τ + K2 − 1)x µ 2 µ2 + (1 − 2α 2 )x3 + α 2 x5 ] φ(x) + O(µ−3 ),
(4.4)
where Φ (·) and φ(·) are the cdf and the density function of the standard normal distribution, respectively. The asymptotic expansions of the distributions of the TSLS and LIML estimators are (4.3) and (4.4), respectively, with τ = 0. See Anderson and Sawa (1973) and Anderson (1974). They agree with Fujikoshi et al. (1982) for the LIML and TSLS estimators (G2 ≥ 1). Because τ > 0, the contribution due to the semiparametric methods is that we have the additional term τ /µ2 to the asymptotic mean squared errors (AMSE). As a numerical illustration we give Fig. 9 in Appendix, which show the finite sample distributions and the approximate distributions of the LIML, MEL and GMM estimators in normalized forms as (3.1). Since the limiting distributions of the above estimators are N (0, 1) in the large sample asymptotics, they are denoted by ‘‘o’’ as the benchmark. 4.3. An alternative approximation As we have shown in Anderson et al. (2010) (Part-I of our study), there is an alternative asymptotic theory for the case when the number of excluded instruments K2 (say K2n ) is dependent on the sample size n. Kunitomo (1980, 1982) and Morimune (1983) were the earlier developers of this theory. For more recent developments, see Bekker (1994) and Chao and Swanson (2005),
T.W. Anderson et al. / Journal of Econometrics 165 (2011) 58–69
67
Fig. 13. n − K = 300, K2 = 30, α = 1, δ 2 = 50.
Fig. 14. n − K = 1000, K2 = 100, α = 1, δ 2 = 100.
for instance. We consider a typical approximation when K2n /n → c (0 ≤ c < 1) (and µ2 /n is approximately a constant) as n → ∞. For the LIML estimator, an asymptotic expansion of its distribution function when G2 = 1 as n → ∞ is
to be preferred to the GMM and TSLS estimators if K2 is large. In some microeconometric models and models on panel data, it is often a common feature that K2 is fairly large. For such situations we have shown (Anderson et al. (2010)) that the LIML estimator has asymptotic optimality in the large K2 − asymptotics sense. It seems that we need some stronger conditions for the MEL estimator, but its finite sample properties are often similar to the corresponding LIML estimator.5 Second, the large-sample normal approximation in the large K2 asymptotic theory is relatively accurate for the MEL and LIML estimators. Hence the usual methods with asymptotic standard deviations give often reasonable inferences. On the other hand, for the GMM and TSLS estimators the sample size should be very large to justify the use of procedures based on normality when K2 is large, in particular. Third, it is recommended to use the probability of concentration as a criterion of comparisons because some estimators do not possess any (exact) moments and hence we expect to have unstable and unreliable values of the sample bias and mean squared errors of such estimators in Monte Carlo simulations. This is the reason why we directly considered the finite sample distribution functions of alternative estimation methods. The probability criterion we have adopted roughly corresponds to the bounded loss function. To summarize the most important conclusion from the study of small sample distributions of four alternative estimators is that the GMM and TSLS estimators can be badly biased in some cases and in that sense their use is risky. The MEL and LIML estimator, on the other hand, may have a little more variability with some chance of extreme values, but its distribution is centered at the true parameter value. The LIML estimator has tighter tails than those of the MEL estimator and in this sense the former would be attractive to the latter. Besides, the computational burden for the LIML estimation is not heavy. It is interesting that the LIML estimation was initially invented by Anderson and Rubin (1949,1950). Other estimation methods including the TSLS, the GMM, and the MEL estimation methods have been developed with several different motivations and purposes. Now we have some practical situations in econometric applications where the LIML estimation has an advantage over other estimation methods.6
P
||1/2 2 1 ˆ − 2 αx n(β2.LI − β2 ) ≤ x = ΦΨ (x) + √ σ n
√
× φΨ (x) + O(n−1 ),
(4.5)
where = E (vi vi ), ΦΨ (·) and φΨ (·) are the cdf and the density function of N (0, Ψ ), respectively, ′
−1 −1 −1 Ψ = σ 2 Φ22 .1 + c∗ Φ22.1 ||Φ22.1 ,
(4.6)
c∗ = c /(1 − c ), and Φ22.1 = limn→∞ (1/n)522 A22.1 522 . By setting x = 0√in (4.3) for the GMM estimator, we have 1/2 + α(K2 − 1)/[µ 2π ] + O(µ−3 ). By setting x = 0 in (4.4) and (4.5) for the MEL and LIML estimators, we have 1/2 + O(µ−3 ) and 1/2 + O(n−1 ), respectively. When α ̸= 0, the bias of the GMM estimator (and the TSLS estimator) is proportional to K2 /µ, which increases rapidly if K2 is large in comparison to the noncentrality µ2 . If µ2 is proportional to K2 , for instance, the left hand side of the probability is far from 1/2 whenever α ̸= 0. On the other hand, the MEL and the LIML estimators are almost medianunbiased and this property holds even if K2n is proportional to n. As numerical illustrations, we give the approximations based on the asymptotic expansions in (4.5) up to O(n−1/2 ) as A. exp (large-K2 ) in Table 10 and Figs. 13–14, which gives several approximations of the finite sample distributions of the LIML estimator when K2 is relatively large. As we may expect in these cases the normal approximation based on large-K2 theory (discussed in Part-I) is better than the normal approximation based on the standard large sample theory. (In Table 10 and Figs. 13–14 we have used h = 1 + (n/µ2 )[K2 /(n − K )], which is approximately Ψ × Φ22.1 /σ 2 , for the normalized variance for the limiting distribution). Also we find that the approximations based on (4.5) are even better than the normal approximations. This observation gives an important implication for the testing problem (see Matsushita (2006)). ′
5. Conclusions First, the distributions of the MEL and GMM estimators are asymptotically equivalent in the sense of the limiting distributions in the standard large sample asymptotic theory, but their exact distributions are substantially different in finite samples. The relation of their distributions are quite similar to the distributions of the LIML and TSLS estimators. The MEL and LIML estimators are
5 We have reported the results for the estimation problem, but they have a number of important implications for the testing problem. See Morimune (1989), Matsushita (2006) and Anderson and Kunitomo (2007), for instance. 6 Although we have investigated the nonlinear models and the heteroscedastic models to some extent, the issue remains on the condition that our conclusions are
68
T.W. Anderson et al. / Journal of Econometrics 165 (2011) 58–69
Table 10 x
−3 −2.5 −2 −1.4 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.4 2 2.5 3 X05 L.QT MEDN U.QT X95 IQR
n − K = 300, K2 = 30, α = 1, δ 2 = 50 LIML
MEL
GMM
N (0, 1)
N (0, h) (large-K2 )
Asymptotic-Expansion (large-K2 )
0.000 0.003 0.019 0.085 0.170 0.227 0.291 0.359 0.430 0.500 0.567 0.630 0.687 0.739 0.783 0.852 0.920 0.953 0.972 −1.63 −0.75 0.00 0.85 2.48 1.60
0.002 0.010 0.034 0.110 0.195 0.248 0.308 0.369 0.432 0.494 0.556 0.616 0.670 0.718 0.756 0.822 0.895 0.931 0.953 −1.82 −0.79 0.02 0.97 2.94 1.76
0.042 0.165 0.417 0.736 0.870 0.918 0.950 0.971 0.983 0.991 0.996 0.998 1.000 1.000 1.000 1.000 1.000 1.000 1.000 −2.95 −2.30 −1.85 −1.37 −0.60 0.93
0.001 0.006 0.023 0.081 0.159 0.212 0.274 0.345 0.421 0.500 0.579 0.655 0.726 0.788 0.841 0.919 0.977 0.994 0.999 −1.65 −0.67 0.00 0.67 1.65 1.35
0.005 0.015 0.041 0.112 0.193 0.244 0.301 0.364 0.431 0.500 0.569 0.636 0.699 0.756 0.807 0.888 0.959 0.985 0.995 −1.90 −0.78 0.00 0.78 1.90 1.56
0.000 0.000 0.011 0.080 0.169 0.226 0.291 0.359 0.430 0.500 0.568 0.631 0.688 0.739 0.783 0.855 0.928 0.964 0.985 −1.59 −0.72 0.00 0.85 2.27 1.57
Difference (‘‘Asymptotic-Expansion’’-LIML) 0.000
−0.003 −0.008 −0.005 −0.001 −0.001 −0.001 0.000 0.000 0.000 0.001 0.000 0.001 0.000 0.000 0.003 0.008 0.011 0.013 0.04 0.03 0.00 0.00 −0.21 −0.03
See Section 4.2.
Appendix
Notes on tables In Tables 1–9 the distributions are tabulated in the standardized terms, that is, of (3.1) or (3.5). The tables include three quartiles, the 5 and 95 percentiles and the interquartile range of the distribution for each case. Since the limiting distributions of (3.1) or (3.5) for the MEL and GMM estimators in the standard large sample asymptotic theory are N (0, 1) as n → ∞, we added the standard normal case as the benchmark. In Table 10 we also give the normal approximations based on the large-K2 theory and the approximations based on the asymptotic expansions. Notes on figures In figures the cdf’s of the LIML, MEL and GMM estimators are shown in the standardized terms, that is, of (3.1) or (3.5) in linear models, (the cdf of the TSLS estimator is quite similar to that of the GMM estimator in all cases and it was omitted in many cases). Fig. 1 with G2 = 2 gives two marginal distributions in (3.1) or (3.5) and other figures are with G2 = 1. The dotted lines were used for the distributions of the GMM estimator. For comparative purposes we give the standard normal distribution as the benchmark for each case. In Figs. 13 and 14 we also give the normal approximations based on the large-K2 theory and the approximations based on the asymptotic expansions. We have used the similar method for heteroscedastic cases and nonlinear cases. References Amemiya, T., 1985. Advanced Econometrics. Blackwell. Anderson, T.W., 1974. An asymptotic expansion of the distribution of the limited information maximum likelihood estimate of a coefficient in a simultaneous equation system. Journal of the American Statistical Association 69, 565–573.
valid in more general situations including the case of extremely weak instruments. There are many related important issues, which should be future topics.
Anderson, T.W., Kunitomo, N., 2007. On likelihood ratio tests of structural coefficients: Anderson–Rubin (1949) revisited. Discussion Paper CIRJE-F499. Graduate School of Economics, University of Tokyo. http://www.e.utokyo.ac.jp/cirje/research/dp/2007. Anderson, T.W., Kunitomo, N., Matsushita, Y., 2005. A new light from old wisdoms: alternative estimation methods of simultaneous equations and microeconometric models. Discussion Paper CIRJE-F-399. Graduate School of Economics, University of Tokyo. Anderson, T.W., Kunitomo, N., Matsushita, Y., 2010. On asymptotic optimality of the LIML estimator with possibly many instruments. Journal of Econometrics 157, 191–204. Anderson, T.W., Kunitomo, N., Morimune, K., 1986. Comparing single-equation estimators in a simultaneous equation system. Econometric Theory 2, 1–32. Anderson, T.W., Kunitomo, N., Sawa, T., 1982. Evaluation of the distribution function of the limited information maximum likelihood estimator. Econometrica 50, 1009–1027. Anderson, T.W., Rubin, H., 1949. Estimation of the parameters of a single equation in a complete system of stochastic equations. Annals of Mathematical Statistics 20, 46–63. Anderson, T.W., Rubin, H., 1950. The asymptotic properties of estimates of the parameters of a single equation in a complete system of stochastic equation. Annals of Mathematical Statistics 21, 570–582. Anderson, T.W., Sawa, T., 1973. Distributions of estimates of coefficients of a single equation in a simultaneous system and their asymptotic expansion. Econometrica 41, 683–714. Anderson, T.W., Sawa, T., 1979. Evaluation of the distribution function of the twostage least squares estimate. Econometrica 47, 163–182. Angrist, J.D., Krueger, A., 1991. Does compulsory school attendance affect schooling and earnings. Quarterly Journal of Economics 106, 979–1014. Bekker, P.A., 1994. Alternative approximations to the distributions of instrumental variables estimators. Econometrica 63, 657–681. Bound, J., Jaeger, D.A., Baker, R.M., 1995. Problems with instrumental variables estimation when the correlation between the instruments and the endogenous explanatory variables is weak. Journal of the American Statistical Association 90, 443–450. Chao, J., Swanson, N., 2005. Consistent estimation with a large number of weak instruments. Econometrica 73, 1673–1692. Davidson, R., MacKinnon, J.G., 1993. Estimation and Inference in Econometrics. Oxford University Press. Fujikoshi, Y., Morimune, K., Kunitomo, N., Taniguchi, M., 1982. Asymptotic expansions of the distributions of the estimates of coefficients in a simultaneous equation system. Journal of Econometrics 18 (2), 191–205. Guggenberger, P., 2005. Finite-sample evidence suggesting a heavy tail problem of the generalized empirical likelihood estimator. Unpublished Manuscript. Department of Economics, UCLA. Hansen, L., 1982. Large sample properties of generalized method of moments estimators. Econometrica 50, 1029–1054. Hayashi, F., 2000. Econometrics. Princeton University Press.
T.W. Anderson et al. / Journal of Econometrics 165 (2011) 58–69 Kunitomo, N., 1980. Asymptotic expansions of distributions of estimators in a linear functional relationship and simultaneous equations. Journal of the American Statistical Association 75, 693–700. Kunitomo, N., 1982. Asymptotic efficiency and higher order efficiency of the limited information maximum likelihood estimator in large econometric models. Technical Report No. 365. Institute for Mathematical Studies in the Social Sciences, Stanford University. Kunitomo, N., 1987. A third order optimum properties of the ML estimator in a linear functional relationship model and simultaneous equation system in econometrics. Annals of the Institute of Statistical Mathematics 39, 575–591. Kunitomo, N., 2008. Improving the LIML estimation with many instruments and persistent heteroscedasticities. Discussion Paper CIRJE-F-576. Graduate School of Economics, University of Tokyo. Kunitomo, N., Matsushita, Y., 2009. Asymptotic expansions and higher order properties of semi-parametric estimators in a system of simultaneous equations. Journal of Multivariate Analysis 100-8, 1727–1751. Mariano, R.S., 1982. Analytical small-sample distribution theory in econometrics: the simultaneous equations case. International Economic Review 23, 503–533. Mariano, R.S., Sawa, T., 1972. The exact finite sample distribution of the limited information maximum likelihood estimator in the case of two included endogenous variables. Journal of the American Statistical Association 67, 159–163. Matsushita, Y., 2006. Estimation and testing in a structural equation with possibly many instruments. Unpublished Ph.D. Dissertation. Graduate School of Economics, University of Tokyo.
69
McAleer, M., 2005. Automated inference and learning in modeling financial volatility. Econometric Theory 21, 232–261. Mittelhammer, R., Judge, G., Schoenberg, R., 2005. Empirical evidence concerning the finite sample performance of EL-type structural equation estimation and inference methods. In: Andrews, D., Stock, J. (Eds.), Identification and Inference for Econometric Models. Cambridge University Press, pp. 285–305. Morimune, K., 1983. Approximate distributions of k-class estimators when the degree of overidentification is large compared with sample size. Econometrica 51 (3), 821–841. Morimune, K., 1985. Estimation and testing in econometric models. Kyoritsu, Tokyo (in Japanese). Morimune, K., 1989. t test in a structural equation. Econometrica 57 (6), 1341–1360. Newey, W.K., Smith, R., 2004. Higher order properties of GMM and generalized empirical likelihood estimator. Econometrica 72, 219–255. Owen, A.B., 2001. Empirical Likelihood. Chapman and Hall. Phillips, P.C.B., 1980. The exact finite sample density of instrumental variables estimators in an equation with n + 1 endogenous variables. Econometrica 48, 861–878. Qin, J., Lawless, J., 1994. Empirical likelihood and general estimating equations. The Annals of Statistics 22, 300–325. Staiger, D., Stock, J., 1997. Instrumental variables regression with weak instruments. Econometrica 65, 557–586. Takeuchi, K., Morimune, K., 1985. Third order efficiency of the extended maximum likelihood estimator 3 in a simultaneous equation system. Econometrica 53, 177–200.
Journal of Econometrics 165 (2011) 70–86
Contents lists available at SciVerse ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Instrumental variable estimation in the presence of many moment conditions✩ Ryo Okui ∗ Institute of Economic Research, Kyoto University, Yoshida-Honmachi, Sakyo, Kyoto, 606-8501, Japan
article
info
Article history: Available online 13 May 2011 JEL classification: C21 C31 Keywords: TSLS LIML Shrinkage estimator Instrumental variables
abstract This paper develops shrinkage methods for addressing the ‘‘many instruments’’ problem in the context of instrumental variable estimation. It has been observed that instrumental variable estimators may behave poorly if the number of instruments is large. This problem can be addressed by shrinking the influence of a subset of instrumental variables. The procedure can be understood as a two-step process of shrinking some of the OLS coefficient estimates from the regression of the endogenous variables on the instruments, then using the predicted values of the endogenous variables (based on the shrunk coefficient estimates) as the instruments. The shrinkage parameter is chosen to minimize the asymptotic mean square error. The optimal shrinkage parameter has a closed form, which makes it easy to implement. A Monte Carlo study shows that the shrinkage method works well and performs better in many situations than do existing instrument selection procedures. © 2011 Elsevier B.V. All rights reserved.
1. Introduction This paper proposes new solutions to the problem of instrumental variable (IV, hereafter) estimation in the presence of many instruments. In this situation, we can estimate the model and make some inferences using a minimal subset of instruments. However, with a small number of instruments, we lose efficiency, which results in relatively large standard errors. We might try to increase the number of instruments in order to reduce the standard error of the estimate. It turns out that this approach may be misleading in finite samples. An IV estimator with many instruments may behave poorly and can be sensitive to the number of instruments. In particular, the two-stage least squares (TSLS, hereafter) estimator generates a bias the order of which is proportional to the number of instruments (e.g., see Kunitomo (1980), Morimune (1983), or Bekker (1994)).1 An example where this problem occurs in ✩ A previous version of this paper was circulated under the title ‘‘Shrinkage methods for instrumental variable estimation’’. This study is part of the author’s dissertation project at the University of Pennsylvania. The author gratefully acknowledges the hospitality of Yale University, where part of this paper was written. The author thanks his advisor, Yuichi Kitamura, for his helpful guidance. Gregory Kordas, Petra Todd and Frank Schorfheide also provided helpful supervision. The author obtained valuable comments from two anonymous referees, Dylan Small, Yoshihiko Nishiyama, Sokbae Lee, Donald Andrews, Peter Phillips, Taisuke Otsu, Whitney Newey, Naoto Kunitomo, members of the Penn Empirical Micro Discussion Group, and seminar participants at various institutes. The author is also indebted to Claire Lim, Shalini Roy and Jason Harburger. The author acknowledges financial support from the Research Grants Council of Hong Kong under Project No. HKUST643907. All remaining errors are the author’s. ∗ Tel.: +81 75 753 7191. E-mail address:
[email protected]. 1 Morimune (1985) is a good reference that summarizes the development of
the researches on this topic until the middle of 1980’s. Unfortunately, the book is available only in Japanese. 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.05.007
empirical work is the paper by Angrist and Krueger (1991).2 Bound et al. (1996) illustrate how the problem of many instruments arises in Angrist and Krueger’s (1991) work.3 Existing solutions to the ‘‘many instruments’’ problem usually involve instrument selection. Donald and Newey (2001) propose minimizing the asymptotic mean square error as a criterion for choosing the number of instruments. Small (2002) proposes a criterion function motivated by the Akaike Information Criteria for choosing instruments. Hall and Peixe (2003) also consider another information criterion for instrument selection. Their information criterion consists of two terms. The first term is based on the canonical correlations which measure the relevance of moment conditions. The second term penalizes the number of moment conditions. This paper introduces a new procedure for IV estimation based on shrinkage methods. That is, we reconstruct the estimating equation of an IV estimator, which is a weighted sum of sample moment conditions, by shrinking some elements of the weighting vector. This idea can also be interpreted as shrinking part of the 2 They estimate the return to an additional year of schooling. They show that quarter-of-birth can be an instrument to years of schooling. Their set of instruments includes quarter-birth variables and their interactions with year-of-birth and stateof-birth variables. The number of (excluded) instruments is 180 in one of their specifications. 3 Even though Bound et al. (1996) emphasize the weak instrument problem, Table 1 in their paper indicates that Angrist and Krueger’s (1991) data do not suffer from the bias of the TSLS estimator if we use a minimal subset of instruments. See also Hansen et al. (2008). Actually, there are two problems: one is the ‘‘many instruments’’ problem and the other is that the additional instruments are weak. This paper focuses on the ‘‘many instruments’’ problem. Chao and Swanson (2005) and Stock and Yogo (2005) discuss the consequences of a large number of weak instruments, and Anderson et al. (forthcoming) provide extensive simulation results.
R. Okui / Journal of Econometrics 165 (2011) 70–86
OLS coefficient estimates from the regression of the endogenous variables on the instruments and then using the predicted values of the endogenous variables, based on the shrunk coefficient estimates, as the instruments. One nontrivial question is how to choose the shrinkage parameter. We propose to choose the shrinkage parameter by minimizing the Nagar (1959)-type approximation of the mean square error. The optimal shrinkage parameter has a closed form, which makes it easy to implement. Alternatively, we may consider choosing the shrinkage parameter in a similar way to the well-known James–Stein estimator. However, the James–Stein shrinkage rule is not optimal, and in shrinkage TSLS estimation, there is a crucial difference between these two. Note that the James–Stein shrinkage rule has just an order-K term where K is the number of instruments, however; the optimal shrinkage parameter has an order-K 2 term. The shrinkage parameter given by the James–Stein shrinkage rule is larger than desired when the number of instruments is large. This shows the importance of the mean square error calculation in choosing the shrinkage parameter. In the statistical literature, it has been observed that shrinkage methods perform well, and, moreover, they often work better than selection methods (e.g., see Hastie et al. (2001), Section 3.4.5). The key decision involved in selection methods is to select which instruments to discard. Even though we alleviate the manyinstruments problem by doing so, we also ignore the information that the discarded instruments might reveal. On the other hand, shrinkage methods not only mitigate the many-instruments problem but also enable the use of the information that is lost by discarding variables. Shrinkage procedures can become excellent alternatives to selection methods in IV estimation. A limitation of the shrinkage method proposed here is that it requires us to specify the set of ‘‘main’’ IVs which are a priori known to be strong. While this requirement may be restrictive in some applications, there are situations in which it is possible to specify the set of ‘‘main’’ IVs in a natural way. For example, Angrist and Krueger (1991) use quarter-of-birth variables and their interactions with year-of-birth or state-of-birth variables as instruments. In this case, the quarter-of-birth variables may be considered as ‘‘main’’ instruments and the interactions may be considered as other instruments. We note that selection methods such as those of Donald and Newey (2001) typically require a different assumption that an ordering of instruments is prespecified to make them computationally feasible and to justify the method theoretically. Even though there is hardly any literature that explicitly considers the application of shrinkage methods in IV estimations, Chamberlain and Imbens (2004) consider a procedure, called random effect quasi-maximum likelihood (REQML), which could be categorized as a shrinkage method. They impose a random effect structure on the coefficients in the regression of the endogenous variable on instruments and then maximize the likelihood that takes the random effect structure into account. REQML has several attractive features, such as being interpretable as a Bayes procedure. However, extending the idea of their paper to different settings may not be trivial. For example, the appropriate way to construct the likelihood function of a conditional moment restriction model with conditional heteroscedasticity is not necessarily clear. Moreover, it is also not clear what the appropriate way would be to impose a random effect structure in such a model. The procedure presented here can be extended to different models, as shown by Okui (2005), to consider conditional moment restriction models and dynamic panel data models. Another limitation of REQML is that handling a situation with multiple endogenous regressors is not straightforward. On the other hand, the method considered in this paper is applicable
71
to such a situation. Finally, we derive an approximation of the mean square error of the estimator and choose the shrinkage parameter by minimizing the approximate mean square error, while Chamberlain and Imbens (2004) do not consider the mean square error of the estimator. The kernel-weighted GMM in ARMA models by Kuersteiner (unpublished manuscript) is also related to the ideas explored here.4 Another related paper is Carrasco (unpublished manuscript). Her idea is different from the one considered here. Her approach involves regularization of the inverse of the covariance matrix of instruments while our approach is to shrink some of the coefficient estimates in the first stage regression. The rest of this paper is organized as follows. The next section introduces the shrinkage TSLS estimator, explains the motivation for the method and presents the theoretical results. Section 3 proposes the shrinkage limited information maximum likelihood estimator. Results from Monte Carlo experiments are included in Section 4. Discussions and possible extensions are presented in Section 5. We use the following notation throughout the paper. For a sequence of vectors, {Ai }, we define A as A = (A′1 , A′2 , . . . , A′n )′ . For √ a matrix A, we define ‖A‖ = tr (A′ A) (the usual Euclidean norm), and PA = A(A′ A)−1 A′ . 2. The shrinkage TSLS estimator 2.1. Model and procedure Following Donald and Newey (2001), we consider the model: yi = Yi′ γ + x′1i β + ϵi = Wi′ δ + ϵi
Wi =
Yi x1i
= f (xi ) + ui =
E (Yi |xi ) x1i
+
ηi 0
,
i = 1, . . . , N ,
where yi is a scalar outcome variable, Yi is a d1 × 1 vector of endogenous variables, xi is a vector of exogenous variables, ϵi and ui are unobserved random variables with second moments, which do not depend on xi , and f is an unknown function of xi . Let fi = f (xi ). The set of instruments is (Xi′ , Z¯i′ ). Xi is an m × 1 vector of main instruments and Z¯i is a K × 1 vector of other instruments. They are functions of xi . The included exogenous variable, x1i , is a part of Xi . We employ this semiparametric structure because it allows us to easily analyze the model with many instruments. Another reason is that this paper intends to compare instrument selection methods and shrinkage methods, and, to this end, it would be better to have the same structure as used in Donald and Newey (2001) to present a selection method that will be compared with shrinkage procedures in the Monte Carlo section. √ In the current model, the asymptotic variance of a N¯ −1 , where consistent regular estimator cannot be smaller than σ 2 H σ 2 = E (ϵi2 |xi ) and H¯ = E (fi fi′ ) (Chamberlain, 1987). It can be achieved if fi can be written as a linear combination of the instruments. Likewise, if there is a linear combination of the 4 We note that there is a pair of kernel functions and bandwidths under which the kernel-weighted GMM and the shrinkage TSLS are equivalent. They are K (u) = 1, for |u| < c and K (u) = s for |u| ≥ c, where s is the shrinkage parameter and c is equal to the ratio of the number of main instruments and the total number of instruments, and the bandwidth is equal to the total number of instruments. We note that the choices of bandwidth and shrinkage parameter are not equivalent. Roughly speaking, the shrinkage TSLS chooses the kernel function given the bandwidth, whereas the kernel-weighted GMM chooses the bandwidth given the kernel function. Thus, there is a fundamental difference between the kernelweighted GMM and shrinkage methods. The kernel-weighted GMM can be regarded as a way to exploit all information from the order of instruments that is clear in ARMA models, while this paper implicitly considers situations where the order is not clear.
72
R. Okui / Journal of Econometrics 165 (2011) 70–86
instruments that is close to fi , then the asymptotic variance of the IV estimator is small. This observation implies that using many instruments is desirable in terms of asymptotic variance. However, an IV estimator with many instruments may behave poorly in finite samples and can be sensitive to the number of instruments. Furthermore, if a set of instruments can approximate fi well, then adding more instruments is not helpful to reducing ¯ −1 . It the asymptotic variance since it cannot be smaller than σ 2 H is, therefore, important to consider how to handle a large number of instruments. We consider a situation similar to that of Chamberlain and Imbens (2004) where we have two sets of instruments, X and Z¯ . Among the IVs, we typically have ‘‘main’’ instruments, which guarantee the identification of the parameter, δ , and are more important for estimation than the other instruments. We denote these ‘‘main’’ instruments as X . We consider shrinking the effect of Z¯ on the estimation of δ . The meaning of ‘‘main’’ can differ among situations. For example, suppose that we consider a (possibly misspecified) linear (in parameters) model for the relationship between the endogenous regressors and instruments, as in West et al. (2009). The main instruments (i.e., X ) in this case would be the terms appearing in the model we specify and the other instruments (i.e., Z¯ ) are other functions of the instruments. Another example could be the case where a number of instruments are generated by multiplying the main instruments by regional dummies or time dummies. For instance, as discussed in Section 1, the quarter of birth variables may be considered as ‘‘main’’ instruments in the case of Angrist and Krueger (1991). Note that we are able to estimate δ using only those main instruments if the number of the main instruments is larger than the number of the endogenous variables. However, such an estimate may have a large standard error. Even though using more instruments is a way to reduce the standard error of the estimate, it is commonly observed that IV estimators with many instruments behave poorly (e.g., Morimune (1983) and Bound et al. (1996)). The shrinkage TSLS estimator is introduced to address this ‘‘many instruments’’ problem. In this section, the shrinkage TSLS estimator is discussed. The shrinkage LIML estimator is discussed in the next section. Now, we describe the procedure. Let Z = (I − PX )Z¯ so that X ′ Z = 0. It is important to note that Z in our discussion may not be the matrix of the instruments itself but the orthogonalized one in applications. The TSLS estimator of δ is the solution to W ′ PX (y − W δ) + W ′ P(X ,Z ) (y − W δ) = 0. The shrinkage TSLS estimator, δˆ tsls,s , is defined as the solution to
(1 − s)W ′ PX (y − W δ) + sW ′ P(X ,Z ) (y − W δ) = W ′ PX (y − W δ) + sW ′ PZ (y − W δ) = 0, and it is:
δˆtsls,s = (W ′ P s W )−1 W ′ P s y, for a shrinkage parameter, s, where P s = (1 − s)PX + sP(X ,Z ) = PX + sPZ . The shrinkage TSLS estimator is obtained by solving a weighted average of the estimating equation for the TSLS using only the main instruments and that using all instruments. By introducing the shrinkage parameter, s, we can reduce the effect of adding Z into the set of instruments. The shrinkage parameter, s, lies between 0 and 1; the choice s = 0 leads to the TSLS estimator using only X and likewise setting s = 1 yields the TSLS estimator using all the instruments. A more detailed discussion is found in the next subsection. To operationalize this procedure, a method for choosing s is needed. We recommend the following choice of s because it is an estimator of the shrinkage parameter that minimizes the
asymptotic mean square error (seeSection 2.2): ˆ ′ ˆ −1 W ′ PZ W Hˆ −1 λˆ
∗
sˆ =
σˆ ϵ2 λ H
N
2 ˆ −1 ˆ ˆ ′ ˆ −1 ′ λˆ ′ Hˆ −1 σˆ uϵ σˆ u′ ϵ Hˆ −1 λˆ KN + σˆ ϵ2 λ H W NPZ W H λ
,
(1)
ˆ is the (possibly estimated) weighing vector chosen by where λ the researcher, σˆ ϵ2 and σˆ uϵ are the estimates of σϵ2 = E (ϵi2 ) and σuϵ = E (ui ϵi ) based on the residuals from a preliminary estimation ˆ = W ′ (PX + PZ )W /N, which is an estimate of H = f ′ f /N and H (i.e., the first-order asymptotic variance). Note that the number of instrumental variables should increase with the sample size in order to estimate H.5 2.2. Theoretical results We demonstrate the asymptotic properties of the shrinkage TSLS under the following assumptions, which are similar to those imposed in Donald and Newey (2001). Assumption 1. {yi , Wi , xi } are i.i.d., E (ϵi2 |xi ) = σϵ2 > 0, and E (‖ηi ‖4 |xi ) and E (|ϵi |4 |xi ) are bounded.
¯ ≡ E (fi fi′ ) exists and is nonsingular. (ii) There Assumption 2. (i) H exists πK such that E (‖f (x) − πK (X ′ , Z¯ ′ )‖2 ) → 0 as K → ∞. Assumption 3. (i) E {(ϵ, u′ )′ (ϵ, u′ )|xi } is constant. (ii) (X , Z )′ (X , Z ) is nonsingular with probability one. (iii) maxi≤N PZ ,ii →p 0. (iv) fi is bounded. Assumption 1 imposes restrictions on the moments of the random variables in the model, which are standard in the literature. Assumption 2(i) guarantees the identification of the parameter δ . With Assumption 2(ii), we have the asymptotic variance of the TSLS ¯ −1 . estimator or the shrinkage TSLS estimator under K → ∞ is σ 2 H Assumption 3(i) imposes homoscedasticity of the error terms. Assumption 3(ii) and (iii) impose restrictions on the probabilistic nature of the instruments and the rate of K . Assumption 3(iv) is employed for simplicity and can be relaxed at the cost of making the results and the proofs much more complicated. The first theorem is on the consistency and the asymptotic normality of the shrinkage TSLS estimator. Theorem 1. Suppose that Assumptions 1–3 are satisfied. If (sK )2 /N →p 0 and either s →p 1 or E (fi Zi′ ) = 0, then δˆtsls,s − δ →p 0 and
√
¯ −1 ) . N (δˆ tsls,s − δ) →d N (0, σϵ2 H
The condition E (fi Zi ) = 0 means that Z is a matrix of totally irrelevant instruments and in that case the shrinkage parameter does not need to go to 1. However, when Z is relevant, the shrinkage parameter must converge to 1 in order to achieve the semiparametric efficiency bound. This theorem justifies the use of the shrinkage TSLS estimator. Unfortunately, this result also indicates that the conventional first-order asymptotic analysis is neither strong enough to investigate the effect of shrinkage, nor able to provide any guidance in choosing the shrinkage parameter, s. This is similar to the case of selecting the number of instruments. The first-order asymptotic results do not tell us how many instruments should be used; for this, we have to look at a higher-order expansion. Given this observation, we propose to choose the shrinkage parameter to minimize a higher-order asymptotic mean square error. The notion of the asymptotic mean square error employed here is similar to the Nagar-type asymptotic 5 Note that the function f (·) is unknown and it cannot be written as a linear combination of a finite number of instruments in general. This is the reason that the number of instruments should increase with N in order to estimate H = f ′ f /N.
R. Okui / Journal of Econometrics 165 (2011) 70–86
expansion (Nagar, 1959). Following Donald and Newey (2001), we approximate the mean square error, E {(δˆ tsls,s − δ0 )(δˆ tsls,s − δ0 )′ |x}, by σϵ2 H −1 + S (s) where N (δˆ tsls,s − δ0 )(δˆ tsls,s − δ0 )′ = Qˆ (s) + rˆ (s), E {Qˆ (s)|x} = σϵ2 H −1 + S (s) + T (s), H = f ′ f /N and {ˆr (s)+ T (s)}/tr {S (s)} = op (1) as K → ∞ and N →
∞. First, we divide the N (δˆtsls,s − δ0 )(δˆtsls,s − δ0 )′ into two parts, Qˆ (s) and rˆ (s), and discard rˆ (s), which goes to zero more quickly than S (s) does. Then, we take the expectation of Qˆ (s) conditional on the exogenous variable, x, and ignore the term T (s), which goes to zero more quickly than S (s) does. The term σϵ2 H −1 corresponds to the first-order asymptotic variance. Hence, S (s) is the nontrivial and dominant term in the mean squared error and our goal is to find S (s). This Nagar-type approximation is popular in the IV estimation literature but not common in the shrinkage literature, which mainly focuses on exact finite sample properties. See, for example, Phillips (1983) for a survey of finite sample properties of IV estimators. We have several reasons to investigate the Nagar approximation even though the usual shrinkage literature does not use it. First, a finite sample parametric approach may not be very convincing because it relies on a distributional assumption. Second, an exact finite sample approach usually gives us results that are too complicated to be meaningful. The application of the Nagar approximation provides a clear result, which leads to an easily implementable procedure for choosing the optimal shrinkage parameter. Lastly, this approach makes comparison with Donald and Newey (2001) easier as they also use the Nagar expansion. The next theorem shows the form of the mean square error under K → ∞, N → ∞ and an exogenous shrinkage parameter. Theorem 2. Suppose that Assumptions 1–3 are satisfied. Under (sK )2 /N → 0 and either (1 − s) = Op (K 2 /N ) or E (fi Zi′ ) = 0, Nagar’s decomposition holds with
S (s) = H −1 σuϵ σu′ ϵ
(sK )2 N
+ σϵ2
f ′ (I − P s )(I − P s )f N
H −1 .
The Appendix contains the proof, which is similar to the proof to Proposition 1 in Donald and Newey (2001). The first term in the brackets on the right-hand side of the equation corresponds to the square of the bias term. Introducing the shrinkage parameter mitigates the bias caused by using many instruments. The second term in the brackets corresponds to the second-order variance term. Note that the formula in Donald and Newey (2001) is given by setting s = 1, as s = 1 corresponds to the standard TSLS estimator. Given this formula, our task is to find an s that minimizes the mean square error of a linear combination of the estimator, λ′ S (s)λ (λ may be estimated). The optimal shrinkage parameter is: s∗ =
′ −1 ′ −1 2 λ H f PZ fH λ ϵ N
σ λ′ H −1 σ
uϵ
σ′
uϵ H
−1 λ K 2 N
+ σϵ2 λ H
′ −1 f ′ P fH −1 λ Z N
.
This form is very intuitive: the optimal shrinkage parameter is an increasing function of a measure of the strength of the instruments, f ′ PZ f /N, and a decreasing function of the number of instruments, K . The optimal shrinkage parameter lies between 0 and 1, which is a natural parameter space for the shrinkage parameter. The event s∗ = 1 occurs when σuϵ = 0. In this case, the OLS estimator is consistent, and we should make the estimator close to the OLS estimator by using all the instruments. The standard case is f ′ PZ f /N →p c > 0 and s∗ →p 1. This means that if Z is a valid instrument, then asymptotically we do not shrink and achieve semiparametric efficiency. If f ′ PZ f /K 2 →p 0, which occurs when Z is an irrelevant instrument, s∗ →p 0. We
73
can defend against completely weak instruments by introducing the shrinkage parameter. The weak instruments case in the Staiger and Stock (1997) (see also Chao and Swanson (2005)) sense occurs when f ′ PZ f /K 2 →p c > 0. Then, s∗ →p s¯ where 0 < s¯ < 1. Even though we do not consider this case formally, we conjecture that the shrinkage TSLS can even utilize information from weak instruments. The optimal shrinkage parameter has a K 2 -order term. This is the main difference of this shrinkage rule compared with that of James–Stein, which just has a K -order term.6 This might imply that if we employ the James–Stein shrinkage rule naively, we shrink the effect of the instruments less than desired when the number of instruments is large. This observation indicates the importance of choosing the shrinkage parameter based on the asymptotic mean square error. If there is only one endogenous variable, or, in other words, Yi is a scalar, the choice of λ does not affect the optimal shrinkage parameter, which is ¯′
∗
s =
¯
σϵ2 Y PNZ Y 2
¯′
¯
2 K + σ 2 Y PZ Y σηϵ ϵ N N
,
where Y¯ = (E (Y1 |x1 ), . . . , E (YN |xN ))′ . The optimal shrinkage parameter depends on the unknown parameters. A natural estimator of the optimal shrinkage parameter is given by (1), and the following theorem justifies its use. Donald and Newey (2001) also present a similar result to justify their selection procedure.7 Theorem 3. Suppose that Assumptions 1–3 are satisfied and ˆ σˆ ϵ2 →p σϵ2 , σˆ uϵ →p σuϵ and λ−λ →p 0. Then, {S (ˆs∗ )−S (s∗ )}/S (s∗ ) = op (1). 3. The shrinkage LIML estimator We can extend our idea of the shrinkage TSLS into the limited information maximum likelihood (LIML) estimator. The LIML estimator minimizes (y − W δ)′ Px (y − W δ)/{(y − W δ)′ (y − W δ)}. The shrinkage LIML estimator δˆ liml,s is defined as:
(y − W δ)′ PX (y − W δ) ˆδliml,s = argmin (1 − s) (y − W δ)′ (y − W δ) δ ′ (y − W δ) P(X ,Z ) (y − W δ) + s (y − W δ)′ (y − W δ) (y − W δ)′ P s (y − W δ) = argmin . (y − W δ)′ (y − W δ) δ Let vi = ui − ϵi σuϵ /σϵ2 and define Σv = E (vi vi′ ). The next theorem derives the asymptotic mean square error of the shrinkage LIML estimator. We assume that we have the third-moment condition, E (ϵi2 vi ) = 0, to simplify the formula. Theorem 4. Suppose that Assumptions 1–3 are satisfied, Σv ̸= 0, E (‖ηi ‖5 |xi ) and E (|ϵ|5 |xi ) are bounded and E (ϵi2 vi ) = 0. Then, under 6 Suppose that there is only one endogenous regressor. Then, the James–Stein shrinkage rule for the first-stage regression gives sˆ = {1 − σˆ u2 (K − 2)/(W ′ PZ W )}. 7 Note that, in general, this result does not imply that the estimator with sˆ∗ attains
the minimum of S (s). If sˆ∗ were constructed using samples that are independent of the data used in estimation of δ , then this theorem would imply that the estimators with sˆ∗ have the same second-order MSE properties as those with s∗ . However, we usually use the same samples to estimate s∗ and δ , which makes it very difficult to prove that the estimators with sˆ∗ are second-order equivalent to those with s∗ .
74
R. Okui / Journal of Econometrics 165 (2011) 70–86
sK /N →p 0 and 1 − s = Op (sK /N ) or E (fi Zi′ ) = 0, we have
δˆliml,s →p δ ,
√
¯ −1 ) and N (δˆ liml,s − δ) →d N (0, σϵ2 H ′ s s s2 K −1 2 2 f (I − P )(I − P )f S (s) = H σϵ Σv + σϵ H −1 . N
N
The proof of Theorem 4 is in the Appendix. The proof is similar to the proof of Proposition 2 in Donald and Newey (2001). As before, we propose to choose the shrinkage parameter by minimizing λ′ S (s)λ with respect to s. The optimal shrinkage parameter is s∗ =
σϵ2 λ H
λ ′ H −1 Σ
′ −1 f ′ P fH −1 λ Z N ′ −1 ′ −1 −1 K 2 λ H f PZ fH λ vH ϵ N N
λ +σ
.
If there is only one endogenous variable, minimization does not depend on λ and the optimal shrinkage parameter has the form: ¯′
s∗ =
¯
σϵ2 Y PNZ Y ¯
¯′
2 ) K + σ 2 Y PZ Y (ση2 σϵ2 − σηϵ ϵ N N
,
where ση2 = E (ηi2 ) and σηϵ = E (ηi ϵi ). The optimal shrinkage parameter has a K -order term, while that of the shrinkage TSLS has a K 2 -order term. We should shrink the effect of the other instruments less in the shrinkage LIML than in the shrinkage TSLS when K is large. This observation is consistent with the established result that the LIML estimator is more robust against the number of instruments than is the TSLS estimator (see, e.g., Anderson et al. (forthcoming)). We also note that the optimal shrinkage parameter always lies between 0 and 1 (0 ≤ s∗ ≤ 1). We note that the James–Stein shrinkage parameter has a K -order term too. In fact, for both the optimal and the James–Stein shrinkage parameters, the order of 1 − s is K /N when f ′ PZ f /N →p c > 0. This observation implies that the James–Stein shrinkage parameter has an optimal property in terms of rate in the LIML estimation. However, the James–Stein shrinkage parameter does not take the correlation between the error term of the first-stage regression and that of the second-stage regression into account and thus the estimator based on the James–Stein shrinkage parameter is not expected to behave well. 4. Monte Carlo simulation This section reports the results of the Monte Carlo experiments.8 The aims of these experiments are to see how the shrinkage estimators behave in finite samples and to compare the shrinkage methods with other estimation methods. Comparison with the instrument selection procedure in Donald and Newey (2001) is one of the main purposes of this study. To make this comparison easier, we borrow their experimental design.
Let K¯ be the total number of instruments so that K¯ = K + 1. The variable Xi is the ‘‘main’’ instrument, and Z¯i is the vector of additional instruments. We fix the true value of δ at δ = 0.1, and we examine how well each estimator estimates δ . In this framework, each experiment is indexed by the vector of specifications: (N , K¯ , c , {π }), where N represents the sample size. We use N = 100 and N = 500, and we set K¯ = 20 if N = 100 and K¯ = 25 if N = 500. The degree of endogeneity is summarized in c, and we set c = 0.1, 0.5 and 0.9. The number of replications is 5000 for N = 100 and 2500 for N = 500. Hahn and Hausman (2002) observe that the theoretical R2 of the first stage regression is given by R2f = π ′ π /(π ′ π + 1). While we try four different specifications of π , which are stated later, we specify π such that π always satisfies π ′ π = R2f /(1 − R2f ). We try R2f = 0.1 and 0.01. The first specification of π is a case where the instruments are all equally important.
Model (a): πk =
R2f K¯ (1 − R2f )
Model (b):
π1 = c (K¯ ),
Wi = π (Xi , Zi ) + ui , ′
′
¯′ ′
for i = 1, . . . , N, where Wi is a scalar, δ is the scalar parameter of interest, Xi is a scalar random variable, Z¯i is a K × 1 vector of random variables, (Xi′ , Z¯i′ )′ ∼ i.i.d.N (0, IK +1 ) and9
ϵi ui
0 1 ∼ i.i.d.N , 0
c
c 1
.
8 This Monte Carlo simulation was conducted with Linux Ox 4.1a (Doornik, 2006). 9 We also consider a design under which (ϵ , u ) is generated by the product i i of a random variable with t-distribution with degree of freedom 3 and bivariate normal random variables. The results from this non-normal design are very similar to those reported in the paper and are not presented here. The additional tables that summarize the results from the non-normal design are available from the author upon request.
c (K¯ )
πk =
K¯ − 1
∀k > 1,
where c (K¯ ) is chosen to satisfy π ′ π = R2f /(1 − R2f ). The first instrument is strong but others are weak. This data-generating process seems relevant to many applications. Often, we know that the instruments at hand guarantee the identification of the parameter of interest. However, the estimate using only those instruments has a relatively large standard error, which prevents us from drawing sharp conclusions. In this case, even if we are aware that other possible instruments are relatively weak, we may want to increase the number of instruments to reduce the standard error. Thirdly, we consider the data-generating process used in Donald and Newey (2001).
Our data-generating process is the following model: yi = Wi δ + ϵi ,
∀k.
This case is difficult, as not only are all the instruments equally important, they also are all weak. Using only the first instrument is not appropriate. Using all the instruments might cause the ‘‘many-instruments’’ problem. As there is no reason to prefer some to others, selection methods are not very effective. This is also problematic for shrinkage methods as the main instrument itself is weak and the other instruments are as important as the main one. The second model considered is
¯ Model (c): πk = c (K ) 1 −
4.1. Design
,
k
4
K¯ + 1
.
The strength of the instruments decreases moderately in k. An instrumental selection procedure, such as that proposed by Donald and Newey (2001), would be suitable in this situation. Lastly, we consider the following data-generating process: Model (d): πk = 0
for k ≤
4 k − K¯ /2 πk = c (K¯ ) 1 − K¯ /2 + 1
K¯ 2
;
for k >
K¯ 2
.
The first half of the instruments are completely redundant and the second half of the instruments are informative. In this sense, the instrument ordering in Model (d) is ‘‘wrong’’ and the results of the design would be useful to see the effect of the pre-specified ordering of instruments in selection methods and the effect of a ‘‘wrongly chosen’’ main instrument in shrinkage methods.
R. Okui / Journal of Econometrics 165 (2011) 70–86
A number of estimators are studied. The first is the TSLS estimator with all available instruments (TSLS). The second is the TSLS estimator with Donald and Newey’s (2001) optimal selection of the number of instruments (DNTSLS). The third estimator is the TSLS estimator with Hall and Peixe’s (2003) selection of the number of instruments (HP). The next two estimators are the shrinkage TSLS estimators with different choices of shrinkage parameters. The first estimator uses the true optimal shrinkage parameter (OSTSLS), which is infeasible in practice. The performance of OSTSLS can be seen as the upper bound of the performance of shrinkage procedures. The other estimator uses the estimated (i.e., feasible) optimal shrinkage parameter (STSLS). We consider three LIML-type estimators: the LIML estimator with all instruments (LIML), the LIML estimator with the Donald and Newey (2001) optimal selection of the number of instruments (DNLIML), and the (feasible) shrinkage LIML estimator (SLIML). Lastly, we consider the REQML estimator with all available instruments (REQML). To compute DNTSLS, STSLS, DNLIML and SLIML, preliminary estimates are obtained with the number of instruments that minimizes the first-stage cross-validation criteria.10 The crossvalidation criteria are used for Rˆ (K ) (see Donald and Newey (2001)) in the selection criteria by Donald and Newey (2001). For the Hall and Peixe (2003) method, we use the penalty function that corresponds to the BIC (Eq. (14) in Hall and Peixe (2003)). The selection methods (DNTSLS, HP and DNLIML) are applied given the ordering of the instruments so that they choose only the number of instruments. The ordering of instruments in each model is associated with the index k. We note that the instruments are ordered according to their strength in Models (a)–(c). On the other hand, the order of the instruments is ‘‘wrong’’ in Model (d) in the sense that redundant instruments come first in the ordering. The shrinkage methods are applied with the first instrument as the ‘‘main’’ instrument. We note that the first instrument is the strongest instrument in Models (a)–(c). However, in Model (d), the first instrument is redundant. 4.2. Measures For each estimator, we compute the median bias (median bias), the difference between the 0.1 and 0.9 quantile (Dec. Reg.), the median absolute deviation (MAD), following Donald and Newey (2001). We use these ‘‘robust’’ measures because of concerns about the existence of moments of estimators. For example, it is well known that the LIML estimator does not possess any finite moments and, in fact, we encounter an extremely large value of the mean square error of the LIML estimator in the simulations. A disadvantage of using these robust measures is that the relationship between the theoretical results and the simulation results becomes less clear. To overcome this issue at least partially, we also compute the root-mean truncated square error (RMTSE):
E min (δˆ − δ)2 , 2
for each estimator δˆ .11 This measure is always finite and should be closely related to the mean square error. We also compute the coverage rate (Cov. Rate) of a 95% confidence interval based on each estimator. To construct the 10 Alternatively, we may use the Mallows criteria to choose the number of instruments for preliminary estimates. Using the Mallows criteria reduces the computational time substantially, particularly when the sample size is large, and the results do not change by much. Nonetheless, in this experiment, we use the cross-validation criteria following Donald and Newey (2001). 11 A similar measure is used in Chamberlain and Imbens (2004).
75
confidence intervals to compute the coverage probabilities, we use the following estimate of asymptotic variance, following Donald and Newey (2001). The estimators examined here, except REQML, ˆ ′ W )−1 W ˆ ′ y (i.e., W ˆ = W for OLS, have the common form: δˆ = (W ˆ W = P(X ,Z ) W for in TSLS etc.). The estimates of variance, Vˆ , are given by: Vˆ =
1 ′ ˆ ′ W )−1 W ˆ ′W ˆ (W ′ W ˆ ) −1 , ϵˆ ϵˆ (W
N
where ϵˆ = y − W δˆ . For the REQML estimator, the coverage probability is that of the confidence interval obtained by inverting the likelihood ratio test (see Chamberlain and Imbens (2004, p. 302)). We also compute the coverage probabilities based on Bekker’s (1994) asymptotic variance estimator for the LIML-type estimators (see also Hansen et al. (2008)). It is denoted ‘‘B. Cov. Rate’’ in the tables. Bekker’s asymptotic variance estimator is consistent for the asymptotic variance even if the number of instruments is proportional to the sample size and has the following formula: Vˆ B =
1 N − trace(P )
×
ˆ ′ W ) −1 ϵˆ ′ ϵˆ (W
ˆ ′W ˆ −λ W
W ′ (I − P )ˆϵ ϵˆ ′ (I − P )W
ϵˆ ′ (I − P )ˆϵ
ˆ )−1 , (W ′ W
where λ = ϵˆ ′ P ϵˆ /(ˆϵ ′ ϵˆ ) and P is P(X ,Z ) for LIML, the projection matrix spanned by the selected instruments for DNLIML and P s for SLIML. There are two differences between Vˆ and Vˆ B . One is that Vˆ has N but Vˆ B has the total number of degrees of freedom, N − trace(P ). It makes Vˆ tend to be smaller than Vˆ B . The other is that the matrix ˆ ′W ˆ but the corresponding matrix for Vˆ B is in the middle for Vˆ is W ′ ˆ ˆ W W minus some positive definite matrix. This difference makes Vˆ tend to be larger than Vˆ B . Because of these competing effects, we cannot determine which of Vˆ and Vˆ B is larger in general. 4.3. Results The results of the experiments are summarized in Tables 1–8.12 The mark ‘∗’ indicates that the number is more than 1000. First, we summarize the performance of TSLS and LIML. If the endogeneity is small (c = 0.1), TSLS performs well. The bias of TSLS is negligible and the diversity of TSLS is also very small. We note that all the estimators have negligible biases when c = 0.1, but the ‘‘Dec. Reg.’’ of TSLS is smaller than that of other estimators. The coverage rate based on TSLS is also close to 0.95 in those cases. On the other hand, LIML outperforms TSLS in the cases with c = 0.9. In those cases, the bias of TSLS is very large and the coverage rate based on TSLS is too low. LIML has a relatively small bias and yields better coverage rates in those cases. Nonetheless, when c = 0.9 and R2f = 0.01, LIML exhibits some bias and the confidence interval based on LIML is not so reliable. We now compare selection methods and shrinkage methods. The first comparison is between DNTSLS and HP. While HP typically 12 Another result of the experiments which is interesting and is not included in the tables is about the computational times of the estimators. For illustration, the computational time of each estimator relative to TSLS in Model (a) with N = 100 and R2f = 0.1 is presented in the following table. TSLS 1
DNTSLS 74.19
HP 6.23
OSTSLS 3.07
STSLS 42.56
LIML 21.40
DNLIML 60.38
SLIML 74.85
REQML 160.09
We note that the computational times of DNTSLS, STSLS, DNLIML and STSLS include the time required to obtain the preliminary estimate. Unfortunately, the computational times depend heavily on the actual implementation of each procedure (i.e., how to obtain the preliminary estimates and how to estimate Rˆ (K ) for the selection methods of Donald and Newey (2001)), which makes it difficult to provide a conclusive argument.
76
R. Okui / Journal of Econometrics 165 (2011) 70–86
Table 1 Monte Carlo results. R2f = 0.1
TSLS
DNTSLS
HP
OSTSLS
STSLS
N = 100
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.0597 0.467 0.129 0.195 0.944
0.0521 1.84 0.275 0.67 0.98
0.0627 3.89 0.643 0.905 0.997
0.0607 0.467 0.129 0.196 0.945
0.0608 0.582 0.149 0.272 0.947
N = 500
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.0324 0.287 0.0787 0.117 0.945
0.033 0.3 0.0823 0.152 0.954
0.0382 1.93 0.361 0.696 0.99
0.0321 0.288 0.079 0.117 0.945
N = 100
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.317 0.425 0.317 0.358 0.497
0.293 1.69 0.398 0.701 0.76
0.28 3.77 0.654 0.905 0.952
N = 500
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.158 0.265 0.158 0.184 0.666
0.192 0.423 0.212 0.302 0.69
N = 100
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.571 0.294 0.571 0.582 0.0152
N = 500
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.275 0.204 0.275 0.283 0.176
Model (a)
LIML
DNLIML
SLIML
REQML
0.037 1.82 0.347 0.67 0.985 0.965
0.058 1.23 0.283 0.537 0.985 0.977
0.0345 1.75 0.341 0.656 0.985 0.996
−0.0316
0.0316 0.287 0.0786 0.118 0.944
0.00465 0.43 0.111 0.175 0.966 0.965
0.0109 0.426 0.11 0.172 0.966 0.967
0.00417 0.432 0.111 0.176 0.965 0.983
0.305 0.624 0.312 0.4 0.695
0.309 0.541 0.317 0.389 0.546
0.0208 1.6 0.319 0.636 0.944 0.951
0.241 1.15 0.336 0.559 0.899 0.907
0.0315 1.52 0.32 0.624 0.944 0.976
0.152 1.88 0.374 0.691 0.932
0.152 0.289 0.154 0.188 0.716
0.156 0.268 0.157 0.186 0.672
0.0044 0.411 0.105 0.167 0.961 0.962
0.0296 0.408 0.109 0.164 0.948 0.949
0.00453 0.411 0.106 0.167 0.961 0.976
0.00528 0.432 0.108 0.278 0.954
0.543 1.83 0.61 0.792 0.551
0.501 3.3 0.677 0.904 0.765
0.533 0.754 0.534 0.623 0.426
0.557 0.409 0.557 0.58 0.147
−0.00869 1.18 0.234 0.497 0.94 0.927
0.32 1.01 0.383 0.561 0.722 0.692
0.00297 1.15 0.232 0.488 0.937 0.951
0.0335 1.25 0.254 0.595 0.951
0.292 1.07 0.36 0.567 0.693
0.238 1.82 0.41 0.685 0.774
0.257 0.298 0.257 0.284 0.468
0.271 0.247 0.271 0.284 0.311
0.00377 0.364 0.0937 0.149 0.959 0.95
0.0318 0.345 0.097 0.143 0.94 0.929
0.00416 0.363 0.0939 0.148 0.959 0.955
c = 0.1 3.83 0.478 0.829 0.923 0.00187 0.459 0.116 0.301 0.953
c = 0.5
−0.0163 7.36 0.435 0.789 0.929
c = 0.9
yields better coverage rates than DNTSLS, HP does not perform as well as DNTSLS in other measures. We note that the method of Hall and Peixe is intended to detect (completely) redundant instruments and it does not take into account the bias-variance trade-off in the number of instruments. However, the instruments in Models (a)–(c) are not completely redundant, although they are weak. The instruments in Model (d) are ordered wrongly and the second half of the instruments are not redundant. This is perhaps the reason why Hall and Peixe’s method does not work well in the current setting. The next comparison is between DNTSLS and STSLS. Generally, STSLS performs well in Models (a) and (b), and DNTSLS does well in Model (c), although STSLS is better in Model (c) with little endogeneity. Typically, the good performance of STSLS is because the diversity of STSLS is less than that of DNTSLS, which is indicated by the values of ‘‘Dec. Reg.’’ of STSLS and DNTSLS. A remarkable phenomenon is that there are cases where DNTSLS performs substantially worse than TSLS does. On the other hand, STSLS is usually better than TSLS is. When the endogeneity is small, both DNTSLS and STSLS are outperformed by TSLS, but the performance of STSLS is better compared to that of DNTSLS. A similar phenomenon is observed when we compare DNLIML and SLIML. While SLIML achieves improvement on LIML generally, the relative performance of DNLIML compared with LIML is not stable; in some cases, DNLIML does much better than LIML but there are also cases where DNLIML does much worse than LIML. DNLIML
0.0489 * 0.124 0.664 0.964
usually performs well in the low-endogeneity cases where TSLStype estimators perform well. In the high-endogeneity cases that are suitable for LIML-type estimators, SLIML is usually best in Models (a) and (b). In Model (c) with c = 0.9, DNLIML is usually best though the differences between DNLIML and SLIML are small. The selection methods do not work well in Model (d) except that DNLIML improves LIML when the degree of endogeneity is low (c = 0.1). On the other hand, the performances of the shrinkage methods are similar to those of the estimators that use all the available instruments. In Model (d), the main instrument is weak and it is optimal to set the shrinkage parameter very close to 1 so that the optimal shrinkage estimators are very similar to the estimators that use all the instruments. The Monte Carlo results show that the shrinkage methods perform as well as they can when the main instrument is weak, while the selection methods may not perform well in such a situation. REQML exhibits a very small bias, even when the LIML-type estimators exhibit non-negligible biases. The coverage rate of the confidence interval based on REQML is also close to the nominal level in any situation. These results are consistent with the findings of Flores-Lagunes (2007). However, REQML has a very large diversity (i.e., its ‘‘Dec. Reg.’’ is very large). Because of the large diversity, REQML is not attractive in terms of ‘‘MAD’’ nor ‘‘RMTSE’’. Lastly, we compare STSLS with OSTSLS to see the effect of the estimation errors in the shrinkage parameter. We notice that, in some cases, in particular when c = 0.9 and R2f = 0.01, these two perform differently, although we do also observe cases in which
R. Okui / Journal of Econometrics 165 (2011) 70–86
77
Table 2 Monte Carlo results. R2f = 0.01
TSLS
DNTSLS
HP
OSTSLS
STSLS
LIML
DNLIML
SLIML
REQML
0.0954 0.57 0.163 0.244 0.944
0.0878 4.03 0.57 0.887 0.991
0.096 5.77 0.927 1.01 1
0.0948 0.677 0.183 0.281 0.953
0.0934 0.871 0.211 0.397 0.959
0.293 16.1 0.851 0.996 0.99 0.91
0.144 2.45 0.477 0.767 0.992 0.958
0.269 11.4 0.806 0.977 0.99 0.996
−0.0903
N = 100
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.0872 0.489 0.141 0.208 0.938
0.0729 2.95 0.446 0.807 0.988
0.102 5.29 0.853 0.989 1
0.0877 0.488 0.144 0.21 0.941
0.0881 0.688 0.175 0.318 0.954
0.133 4.39 0.607 0.875 0.988 0.936
0.13 1.61 0.386 0.639 0.986 0.962
0.125 4 0.579 0.861 0.989 0.994
−0.0501
N = 500
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.475 0.514 0.475 0.514 0.323
0.475 3.48 0.703 0.916 0.83
0.491 4.87 0.93 1.02 0.966
0.486 1.85 0.599 0.797 0.943
0.474 0.763 0.481 0.582 0.457
0.289 5.51 0.868 0.995 0.919 0.887
0.461 2.02 0.603 0.821 0.901 0.89
0.309 4.87 0.834 0.981 0.919 0.959
−0.0581
N = 100
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.422 0.427 0.422 0.451 0.276
0.38 2.66 0.56 0.836 0.776
0.389 4.66 0.87 0.992 0.976
0.411 0.983 0.434 0.573 0.79
0.416 0.611 0.421 0.5 0.406
0.112 3.32 0.566 0.848 0.926 0.904
0.379 1.42 0.462 0.685 0.902 0.89
0.13 3.24 0.55 0.838 0.924 0.959
−0.0331
N = 500
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.858 0.275 0.858 0.863 0.0004
0.843 1.96 0.885 0.981 0.51
0.845 2.69 0.935 1.02 0.709
0.845 1.5 0.855 0.941 0.671
0.854 0.431 0.854 0.872 0.052
0.139
N = 100
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.831 0.978 0.764 0.533
0.715 * 0.844 0.923 0.521 0.394
N = 500
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.749 0.254 0.749 0.755 0.0004
0.723 2.18 0.773 0.921 0.528
0.707 3.34 0.863 0.994 0.76
0.722 1.08 0.722 0.827 0.433
0.739 0.413 0.739 0.759 0.0668
0.0139 4.95 0.399 0.72 0.896 0.812
0.516 1.7 0.613 0.745 0.591 0.505
Model (a) c = 0.1
31 0.953 1.03 0.91
8.38 0.721 0.945 0.916
c = 0.5 * 0.983 1.03 0.917
* 0.727 0.939 0.926
c = 0.9
their performances are similar. We may be able to obtain a better estimator by improving the estimation of the shrinkage parameter, but this is beyond the scope of this paper. The confidence intervals based on the LIML-type estimators are conservative in many cases and using Bekker’s method makes the confidence intervals have coverage rates close to 0.95 in those cases. However, when the coverage rate is much smaller than 0.95, using Bekker’s method tends to intensify this problem.13 The confidence intervals based on STSLS are improved by using Bekker’s method. However, REQML is still better in terms of coverage rate. We conclude this section by summarizing the Monte Carlo results. REQML is highly recommended if we are concerned about bias or coverage rate. If we are concerned about the risk of the estimators (either in terms of median absolute deviation or in terms of mean square error), then we should consider selection methods or shrinkage methods. Selection methods are recommended when the rank ordering of the strength of the instruments is clear. Otherwise, shrinkage methods are recommended. Moreover, we observe that shrinkage methods generally can improve the estimators with all the instruments, while there are cases in which selection methods may perform substantially worse than just using all instruments does. 13 Hansen et al. (2008) report that Bekker’s method improves the coverage rates. However, they compare Bekker’s asymptotic variance estimator with ˆ ′ W )−1 , not Vˆ . (ˆϵ ′ ϵˆ /N )(W
*
0.207 *
0.128 *
0.814 0.968 0.748 0.783 0.0279 4.39 0.397 0.711 0.894 0.898
0.981 1.05 0.938 0.107 * 0.498 0.863 0.956
5. Discussion The idea of shrinkage as stated in this paper can easily be extended into general moment restriction models, although finding an optimal way to shrink the effect of the moment conditions might be demanding. Several extensions are found in Okui (2005), which considers conditional moment restriction models and dynamic panel data models. Investigating a way to choose the shrinkage parameter in general moment restriction models is of interest although it might be challenging. We leave this problem for future investigation. We may also consider a method that chooses s and K simultaneously to minimize the asymptotic mean square error. Such a shrinkage-selection hybrid method may be worth further investigation. Another useful extension is to handle multiple groups of instruments. Note that this paper focuses on the situation when we have only two groups of instruments: main instruments and others. If we have more than two groups of instruments, we need to shrink them group by group. The optimal shrinkage parameter would be calculated in a way similar to that presented here. The crucial assumption is that we know to which group some particular instrument belongs. We may also think of hybrid methods of adaptively partitioning instruments and shrinking them group by group. For an estimation of a multivariate normal mean, George (1986) provides an interesting discussion of a method for handling a situation with several candidates for partition. We consider hybrid methods to be a promising direction for future research.
78
R. Okui / Journal of Econometrics 165 (2011) 70–86
Table 3 Monte Carlo results. R2f = 0.1
TSLS
DNTSLS
HP
OSTSLS
STSLS
LIML
DNLIML
SLIML
REQML
0.0631 0.467 0.13 0.196 0.945
0.0266 0.982 0.224 0.453 0.972
0.0119 1.24 0.282 0.536 0.988
0.0555 0.483 0.13 0.199 0.948
0.0463 0.607 0.153 0.269 0.947
0.0351 1.81 0.354 0.668 0.985 0.963
0.031 0.956 0.233 0.419 0.983 0.982
0.0231 1.34 0.292 0.578 0.984 0.996
−0.00607
N = 100
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
N = 500
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.0334 0.285 0.0784 0.117 0.944
0.0255 0.345 0.0892 0.145 0.946
0.00701 0.482 0.127 0.196 0.968
0.0309 0.288 0.079 0.117 0.946
0.0292 0.305 0.0822 0.124 0.944
0.00478 0.431 0.111 0.176 0.962 0.962
0.0179 0.406 0.109 0.162 0.956 0.96
0.00129 0.405 0.105 0.162 0.96 0.978
N = 100
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.318 0.428 0.318 0.357 0.493
0.153 1.05 0.286 0.481 0.8
0.0405 1.29 0.285 0.54 0.944
0.104 0.805 0.224 0.337 0.919
0.221 0.678 0.264 0.341 0.671
0.021 1.61 0.323 0.635 0.948 0.952
0.137 0.953 0.259 0.448 0.913 0.916
0.0346 1.27 0.281 0.554 0.944 0.973
N = 500
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.155 0.267 0.156 0.184 0.669
0.0891 0.459 0.145 0.196 0.814
0.0175 0.491 0.129 0.201 0.948
0.0604 0.362 0.108 0.152 0.925
0.104 0.359 0.128 0.169 0.787
0.00235 0.41 0.106 0.165 0.959 0.958
0.0504 0.396 0.113 0.161 0.93 0.933
0.00415 0.394 0.102 0.157 0.958 0.97
0.00327 0.399 0.102 0.16 0.947
0.571 0.295 0.571 0.581 0.015
0.155 1.33 0.332 0.546 0.75
0.0576 1.44 0.293 0.565 0.86
0.0776 0.924 0.248 0.408 0.879
0.277 0.682 0.305 0.379 0.619
−0.00256
N = 100
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
1.17 0.228 0.494 0.943 0.93
0.158 1.02 0.271 0.466 0.837 0.819
0.0135 1.08 0.22 0.469 0.933 0.943
0.00611 1.18 0.238 0.554 0.954
N = 500
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.276 0.205 0.276 0.283 0.168
0.0518 0.488 0.137 0.206 0.88
0.0258 0.519 0.134 0.215 0.911
0.0433 0.411 0.116 0.168 0.919
0.0912 0.367 0.125 0.165 0.842
0.00129 0.371 0.0935 0.147 0.959 0.95
0.045 0.347 0.101 0.142 0.928 0.918
0.00458 0.366 0.0936 0.146 0.956 0.954
0.0256 0.507 0.107 0.464 0.954
Model (b) c = 0.1
1.29 0.288 0.554 0.931 0.00276 0.412 0.107 0.165 0.944
c = 0.5
−0.00279 1.32 0.28 0.544 0.935
c = 0.9
In this paper, we assume that all the instruments are orthogonal to the error term. However, it is also important to examine the validity of instruments in practice. As considered by Hall and Peixe (2003), we may apply the method of Andrews (1999) first in order to eliminate invalid instruments and then apply the shrinkage method. Investigating the properties of such a procedure is also an interesting future research topic. Finally, the higher-order efficiency property of selection and shrinkage estimators would be an interesting theoretical question. Takeuchi and Morimune (1985) shows a higher-order efficiency of LIML-type estimators whose asymptotic bias is adjusted. Anderson et al. (forthcoming) and Anderson et al. (2010) obtain a similar result in the presence of many instruments. Their framework excludes the possibility of instrument selection or shrinkage estimation. It is an important future research topic to explore how to discuss the efficiency properties of IV estimators with instrument selection and shrinkage estimators. Appendix. Proofs This appendix contains the proofs of the theorems. Hereafter, all expectations are conditional on x. We follow the same steps as the derivation of the asymptotic mean square error in Donald and Newey (2001). Some of the results used as lemmas are proved in that paper. We will employ Lemma A.1 in Donald and Newey (2001) 2 and 4. The estimator √to show Theorems √ examined has the ˆ −1 h. ˆ We define h = f ′ ϵ/ N and H = f ′ f /N. form N (δˆ − δ) = H
Lemma 1 (Donald and Newey (2001) Lemma A.1). If there is a ˆ = H + TH + ZH, decomposition hˆ = h + T h + Z h , H
(h + T h )(h + T h )′ − hh′ H −1 T H ′ − T H H −1 hh′ = Aˆ (s) + Z A (s), such that T h = op (1), h = Op (1), H = Op (1), the determinant of H is bounded away from zero with probability 1, ρK ,N = op (1),
‖T H ‖2 = op (ρK ,N ), ‖Z h ‖ = op (ρK ,N ), Z A (s) = op (ρK ,N ),
‖T h ‖‖T H ‖ = op (ρK ,N ), ‖Z H ‖ = op (ρK ,N ), E {Aˆ (s)|x} = σ 2 H + HS (s)H + op (ρK ,N ),
then N (δˆ − δ0 )(δˆ − δ0 )′ = Qˆ (s) + rˆ (s), E {Qˆ (s)|x} = σϵ2 H −1 + S (s) + T (s), {ˆr (s) + T (s)}/tr (S (s)) = op (1), K → ∞, N → ∞. We state two technical lemmas and their proofs. Those lemmas will be used to prove the theorems. First, recall that Z ′ X = 0 and P s = PX + sPZ , where PX = X (X ′ X )−1 X ′ and PZ = Z (Z ′ Z )−1 Z ′ . Lemma 2. Suppose Assumptions ∑ 1–3 are satisfied. Then ∑ we have (1) tr (P s ) = m + sK ; (2) i (Piis )2 = op (sK ); (3) i̸=j Piis Pjjs =
∑ (m + sK )2 + op (sK ); (4) i̸=j Pijs Pijs = (m + s2 K ) + op (sK ); (5) h = √ f ′ ϵ/ N = Op (1) and H = f ′ f /N = Op (1).
R. Okui / Journal of Econometrics 165 (2011) 70–86
79
Table 4 Monte Carlo results. R2f = 0.01
TSLS
DNTSLS
HP
OSTSLS
STSLS
LIML
DNLIML
SLIML
REQML
0.0928 0.568 0.163 0.244 0.944
0.0794 3.29 0.522 0.841 0.99
0.0774 4.45 0.742 0.948 0.999
0.0833 0.816 0.214 0.337 0.967
0.0867 0.884 0.215 0.408 0.96
0.304 17.1 0.845 0.994 0.99 0.908
0.108 2.11 0.443 0.736 0.994 0.965
0.244 10.6 0.779 0.964 0.991 0.997
−0.0327
N = 100
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.0832 0.482 0.141 0.207 0.939
0.0396 1.53 0.328 0.628 0.984
0.0363 2.11 0.431 0.734 0.995
0.0762 0.533 0.147 0.222 0.951
0.0715 0.721 0.184 0.324 0.959
0.141 4.57 0.593 0.875 0.987 0.93
0.0682 1.33 0.325 0.566 0.991 0.979
0.113 3.22 0.499 0.81 0.989 0.995
−0.0128
N = 500
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.474 0.507 0.475 0.514 0.321
0.384 3.02 0.636 0.868 0.822
0.314 4.12 0.755 0.947 0.961
0.343 2.22 0.586 0.809 0.956
0.441 0.801 0.458 0.552 0.493
0.294 5.29 0.864 0.992 0.917 0.885
0.39 1.99 0.563 0.786 0.9 0.89
0.286 4.65 0.805 0.966 0.915 0.958
−0.00876
N = 100
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.42 0.423 0.42 0.451 0.28
0.209 1.61 0.404 0.654 0.79
0.0844 2.16 0.429 0.733 0.959
0.158 1.24 0.334 0.514 0.932
0.329 0.767 0.357 0.434 0.546
0.124 3.27 0.563 0.85 0.927 0.905
0.224 1.3 0.373 0.587 0.906 0.899
0.121 2.72 0.486 0.787 0.927 0.956
−0.00506
N = 500
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.855 0.275 0.855 0.862 0.0004
0.683 2.75 0.79 0.916 0.582
0.518 3.81 0.74 0.936 0.77
0.514 2.63 0.649 0.871 0.784
0.787 0.656 0.787 0.812 0.165
0.147
N = 100
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.826 0.977 0.763 0.531
0.556 * 0.747 0.884 0.602 0.47
N = 500
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.749 0.258 0.749 0.755 0
0.258 1.81 0.488 0.698 0.697
0.106 2.15 0.4 0.714 0.871
0.143 1.34 0.352 0.589 0.859
0.444 0.743 0.451 0.536 0.465
0.00316 5.03 0.396 0.718 0.897 0.81
0.258 2.17 0.428 0.651 0.764 0.695
Model (b) c = 0.1
6.09 0.802 0.974 0.908
2.3 0.451 0.755 0.921
c = 0.5 * 0.803 0.969 0.915
3.26 0.463 0.758 0.928
c = 0.9 *
0.215 *
0.0291 *
0.772 0.952 0.753 0.786 0.0461 3.4 0.374 0.683 0.895 0.892
0.828 1. 0.942 0.0569 * 0.584 0.942 0.953
Proof. First note that (sK )−1 = Op (1). For part 1,
Lemma 3. Suppose Assumptions 1–3 are satisfied and s →p 1 or
tr (P ) = tr (PX ) + s · tr (PZ ) = m + sK .
E (fi Zi′ ) = 0. Then, we have (1) ∆s = op (1); (2) f ′ (I − P s )ϵ/ N =
√
s
1/2
O(∆s ); (3) u′ P s ϵ = Op (sK ); (4) E (u′ P s ϵϵ ′ P s u) = σuϵ σu′ ϵ (m + sK )2 + (σϵ2 Σu + σuϵ σu′ ϵ )(m + s2 K ) + op (sK ); (5) E (f ′ ϵϵ ′ P s u) =
Assumption 3 and Lemma 2(1) imply that
√ 1/2 (ϵi2 u′i ) = Op (sK∑ ); (6) ∆s / N = op (sK /N + ∆s ); ′ −1 (7) E (hh′ H −1 u′ f /N ) = E (ϵ 2 ui )fi′ /(N 2 ) = Op (1/N ); i fi fi H √ √ 1/2 ′ s ′ s (8) E {f (I − P )ϵϵ P u/N } = op (∆s sK / N ). s i fi Pii E
∑
− (Piis )2 ≤ max(Piis )tr (P s ) = op (1)(m + sK ) = op (sK ). i
i
This proves part 2. Also, these results imply that
−
Piis Pjjs =
i̸=j
−
Piis
−
i
Pjjs −
j
−
Proof. As (I − P s )(I − P s ) = I − P + (s − 1)2 PZ by simple algebra,
(Piis )2 = (m + sK )2 + op (sK )
i
N
Pijs Pijs = tr (P s′ P s ) −
i̸=j
2 s′ s Now, P s′ P s = (PX + sP ∑Z )(PX + sPZ ) = PX + s PZ and tr (P P ) = m + s2 K . As we know, i (Piis )2 = op (sK ) from part 2 in this lemma,
= m + s K + op (sK ). 2
i̸=j
Part 5 is Lemma A.2(v) in Donald and Newey (2001). Let esf = f ′ (I − P s )(I − P s )f /N and ∆s = tr (esf ).
N
+ (s − 1)2
f ′ PZ f N
.
Next, we observe that E {f ′ (I − P s )ϵ/ N } = 0 and
Pijs Pijs
f ′ (I − P )f
√
− (Piis )2 . i
−
=
The first term is op (1) by Lemma A.3(i) in Donald and Newey (2001) and the second term converges on 0 if s →p 1 or f ′ PZ f /N →p 0. Therefore, ∆s = op (1).
which shows part 3. To show 4, first we observe that
−
f ′ (I − P s )(I − P s )f
E
f ′ (I − P s )ϵ ϵ ′ (I − P s )f √
√
N
N
= σϵ2
√
f ′ (I − P s )(I − P s )f N 1/2
= σϵ2 esf .
Therefore, f ′ (I − P s )ϵ/ N = Op (∆s ) by the Chebyshev inequality. This shows 2. For part 3, the Cauchy–Schwartz inequality says that each element of u′ P s ϵ is less than [tr (u′ P s u)(ϵ ′ P s ϵ)]1/2 . As E (u′ P s u) = σu2 (m + sK ) = Op (sK ) and similarly E (ϵ ′ P s ϵ) = Op (sK ), the Markov
√
√
inequality implies that u′ P s ϵ/ N = Op (sK / N ).
80
R. Okui / Journal of Econometrics 165 (2011) 70–86
Table 5 Monte Carlo results. R2f = 0.1
TSLS
DNTSLS
HP
OSTSLS
STSLS
N = 100
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.061 0.461 0.128 0.194 0.95
0.0253 0.783 0.189 0.39 0.969
0.0238 1.22 0.26 0.545 0.985
0.0582 0.464 0.127 0.196 0.949
0.0533 0.542 0.14 0.249 0.953
N = 500
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.0336 0.288 0.078 0.117 0.947
0.0177 0.33 0.0837 0.131 0.95
0.00834 0.389 0.0997 0.16 0.955
0.0325 0.29 0.0779 0.117 0.948
N = 100
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.316 0.431 0.317 0.358 0.497
0.161 0.798 0.24 0.417 0.82
0.113 1.21 0.283 0.542 0.923
N = 500
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.154 0.266 0.155 0.185 0.673
0.0669 0.337 0.101 0.148 0.881
N = 100
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.571 0.295 0.571 0.581 0.0166
N = 500
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.276 0.21 0.276 0.283 0.175
Model (c)
LIML
DNLIML
SLIML
REQML
0.0313 1.8 0.35 0.669 0.985 0.965
0.0215 0.872 0.214 0.394 0.98 0.979
0.0261 1.46 0.307 0.606 0.983 0.996
−0.0161
0.0309 0.297 0.0797 0.12 0.946
0.00508 0.432 0.108 0.177 0.962 0.964
0.00463 0.365 0.0937 0.145 0.959 0.959
0.00488 0.421 0.106 0.169 0.962 0.98
0.158 0.763 0.232 0.344 0.886
0.261 0.587 0.28 0.349 0.629
0.015 1.6 0.317 0.635 0.948 0.95
0.115 0.829 0.227 0.396 0.917 0.923
0.0263 1.33 0.29 0.577 0.946 0.976
0.06 0.37 0.106 0.165 0.916
0.0937 0.355 0.118 0.163 0.883
0.123 0.338 0.138 0.174 0.762
0.003 0.416 0.103 0.164 0.959 0.958
0.0266 0.358 0.0949 0.143 0.942 0.942
0.00699 0.404 0.102 0.161 0.958 0.97
0.00376 0.416 0.103 0.216 0.951
0.234 0.965 0.314 0.496 0.717
0.196 1.3 0.323 0.555 0.774
0.148 0.849 0.255 0.401 0.838
0.347 0.598 0.356 0.414 0.531
−0.0112 1.15 0.229 0.487 0.945 0.932
0.114 0.828 0.231 0.396 0.865 0.854
0.00233 1.08 0.224 0.468 0.938 0.947
0.00466 1.15 0.238 0.553 0.95
0.0919 0.339 0.121 0.163 0.842
0.103 0.33 0.127 0.174 0.826
0.0859 0.418 0.134 0.179 0.884
0.133 0.37 0.15 0.188 0.783
0.00263 0.367 0.0933 0.148 0.959 0.948
0.0265 0.352 0.0936 0.141 0.938 0.928
0.00365 0.364 0.0935 0.147 0.958 0.954
c = 0.1 1.54 0.32 0.623 0.924 0.00458 0.423 0.107 0.171 0.948
c = 0.5
−0.00307 1.61 0.303 0.603 0.926
c = 0.9
To give 4, observe that E (ui Pijs ϵj ϵk Pkls u′l ) = 0 if one of (i, j, k, l)
is different from all the rest. Also, E (ϵi2 ui u′i ) is bounded by Assumption 1. Therefore, we have E (u′ P s ϵϵ ′ P s u) =
− − (Piis )2 E (ϵi2 ui u′i ) + E (ui Piis ϵi ϵj Pjjs u′j ) i̸=j
i
+
−
E(
ϵϵ
ui Pijs j i Pijs u′j
)+
i̸=j
= OP (1)
−
E (ui Pijs ϵj2 Pjis u′i )
i̸=j
− − (Piis )2 + σuϵ σu′ ϵ Piis Pjjs i̸=j
i
+ (σϵ Σu + σuϵ σuϵ ) ′
−
Pijs Pijs
i̸=j
= op (sK ) + σuϵ σu′ ϵ (m + sK )2 + (σϵ2 Σu + σuϵ σu′ ϵ )(m + s2 K ), by (2), (3) and (4) of Lemma 2. Assumption 1 also implies that E (f ′ ϵϵ ′ P s u) =
− i.j.k
fi Pjks E (ϵi ϵj u′k ) =
−
fi Piis E (ϵi2 u′i )
i
and, furthermore, together with Assumption 3 and Lemma 2.1, that
− − fi Piis E (ϵi2 u′i ) ≤ Piis · ‖fi ‖ · ‖E (ϵi2 u′i )‖ = Op (sK ), i i which gives 5.
0.033 * 0.113 0.568 0.96
To prove 6, first we consider the function of a: sK√ /a + a, which is convex and the minimum value of which is 2 sK with the
√
√
minimizer a = sK . If ∆√ s = 0, then (∆s / N ){(sK )/N + ∆s } = 0 √ and for ∆s ̸= √0, (∆s / N )/{(sK )/N + ∆s } = (sK / ∆s N + √ ∆s N )−1 ≤ 1/ sK → 0 as sK → ∞. Part 7 is Lemma A.3(vii) in Donald and Newey (2001). For part 8, let Q = I − P s and for some a and b let ζi = fa (xi , zi ) and µi = E (ϵi2 uib )Piis . Now, the (a, b)th element of E {f ′ (I − P s )
ϵϵ ′ P s u} satisfies
− − s 2 s ζi Qij ϵj ϵk Pkl ulb = ζi Qij E (ϵj ujb )Pjj = |ζ ′ Q µ| E i.j.k.l i ,j ≤ |ζ ′ QQ ζ |1/2 |µ′ µ|1/2 , where the inequality is the Cauchy–Schwartz inequality. Now ′ 1/2 |ζ = Op ((N ∆s )1/2 ) by the definition of ∆s . |µ′ µ| ≤ C ∑ QQsζ 2| ( P ) for some constant C by Assumption 1 and applying i ii Lemma 2(2) we have |µ′ µ| = op (sK ). Therefore, we have
√
E {f ′ (I − P s )ϵϵ ′ P s u/N } = Op ((N ∆s )1/2 )op ( sK )Op (1/N )
√ √ = op (∆s1/2 sK / N ).
R. Okui / Journal of Econometrics 165 (2011) 70–86
81
Table 6 Monte Carlo results. R2f = 0.01
TSLS
DNTSLS
HP
OSTSLS
STSLS
LIML
DNLIML
SLIML
REQML
0.0927 0.568 0.164 0.243 0.945
0.0721 3.2 0.493 0.833 0.99
0.0773 4.68 0.762 0.957 0.999
0.0888 0.752 0.196 0.311 0.962
0.0852 0.863 0.207 0.402 0.961
0.302 20.6 0.859 0.995 0.99 0.909
0.107 2.13 0.44 0.735 0.993 0.967
0.254 11.3 0.787 0.97 0.991 0.997
−0.0469
N = 100
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.0882 0.482 0.142 0.206 0.936
0.0459 1.35 0.288 0.594 0.984
0.0486 2.79 0.496 0.81 0.996
0.0787 0.5 0.145 0.213 0.941
0.0795 0.631 0.165 0.305 0.952
0.132 4.29 0.594 0.873 0.988 0.938
0.0656 1.25 0.297 0.532 0.986 0.973
0.12 3.73 0.532 0.835 0.986 0.996
−0.0237
N = 500
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.476 0.507 0.476 0.514 0.323
0.396 2.83 0.614 0.854 0.823
0.37 4.31 0.778 0.957 0.957
0.399 2.1 0.584 0.801 0.952
0.451 0.777 0.464 0.561 0.476
0.292 5.41 0.864 0.992 0.92 0.885
0.376 1.89 0.55 0.776 0.9 0.887
0.291 4.77 0.806 0.97 0.918 0.961
−0.0068
N = 100
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.416 0.422 0.416 0.45 0.28
0.257 1.34 0.364 0.623 0.782
0.181 2.71 0.511 0.806 0.952
0.255 1.13 0.351 0.515 0.892
0.37 0.67 0.385 0.453 0.462
0.113 3.27 0.557 0.846 0.93 0.908
0.229 1.15 0.34 0.548 0.904 0.901
0.124 2.84 0.502 0.81 0.93 0.962
−0.00747
N = 500
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.855 0.276 0.855 0.862 0.0004
0.702 2.36 0.779 0.91 0.553
0.606 3.42 0.78 0.952 0.739
0.614 2.21 0.701 0.884 0.754
0.811 0.579 0.811 0.83 0.125
0.142
N = 100
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.826 0.98 0.765 0.542
0.556 * 0.719 0.859 0.604 0.48
N = 500
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.747 0.252 0.747 0.754 0
0.395 1.66 0.509 0.705 0.646
0.255 2.56 0.5 0.791 0.81
0.284 1.39 0.39 0.615 0.788
0.568 0.643 0.569 0.619 0.301
0.0202 4.35 0.393 0.713 0.895 0.815
0.262 1.49 0.377 0.582 0.778 0.725
Model (c) c = 0.1
7.9 0.853 0.992 0.907
3.06 0.534 0.833 0.919
c = 0.5 * 0.853 0.989 0.913
6.74 0.536 0.829 0.926
c = 0.9
A.1. Proof of Theorems 1 and 2
esf
= − + Op
Proof. The shrinkage TSLS estimator has the form:
√
N (δˆ tsls,s − δ0 ) = Hˆ s
−1
hˆs ,
√
Hˆ s = W ′ P s W /N , hˆs = W ′ P s ϵ/ N .
Also, Hˆ s and hˆs are decomposed as
ˆ =h+
hs
T1h
T1h
+
T2h
,
√ = −f (I − P )ϵ/ N , ′
s
T2h
√ = u P ϵ/ N ′ s
T2H = (u′ f + f ′ u)/N
Z H = {u′ P s u − u′ (I − P s )f − f ′ (I − P s )u}/N . We show that the conditions of Lemma 1 are satisfied and S (s) has the form given in the theorem. Note that it is enough to show that the term is op ((sK )2 /N + ∆s ) in order to show that a term is op (ρK ,N ) because op ((sK )2 /N + ∆s ) = op (ρK ,N ). Now, h = Op (1) and H = Op (1) by Lemma 2(5). As T h =
√ √ T1h + T2h = −f ′ (I − P s )ϵ/ N + u′ P s ϵ/ N, Lemma 3(2) (3) say √ 1/2 1/2 that T1h = Op (∆s ) and T2h = Op (sK / N ) so T h = Op (∆s ) + √ √ Op (sK / N ). ∆s = op (1) by Lemma 3(1) and sK / N = op (1) by (sK )2 /N = op (1). Therefore, T h = op (1). Next,
T1H = −
f ′ (I − P s )f N
K2 N
= Op
(sK )2 N
0.2 *
0.0511 *
0.786 0.956 0.752 0.788 0.0443 3.19 0.37 0.686 0.895 0.893
0.856 1.01 0.94 0.0841 * 0.531 0.917 0.954
+ ∆s .
√
T2H = Op (1/ N ) by the CLT. Note that (sK )2 /N + ∆s = op (1). Then, each of {(sK )2 /N + ∆s }2 , N −1 and {(sK )2 /N + ∆s }/N are o(ρK ,N ), which implies ‖T H ‖2 = op (ρK ,N ).
Hˆ s = H + T1H + T2H + Z H T1H = −f ′ (I − P s )f /N ,
*
= −esf − s(1 − s)
f ′ PZ f N
1/2
Now, we analyze ‖T h ‖·‖T H ‖. We have seen that T h = Op (∆s )
√
√ 3/2 + Op (sK / N ) and T H = Op (∆s ) + Op (1/ N ). Now, Op (∆s ) = √ 1/2 op (∆s ) = op (ρK ,N ) by Lemma 3(1), Op (∆s / N ) = op (sK /N + √ ∆s ) = op (ρK ,N ) by Lemma 3(6), Op (sK ∆s / N ) = op (ρK ,N ) as √ 1/2 1/2 sK ∆s / N ≤ (sK )2 /N + ∆s and ∆s = op (1) by Lemma 3(1), and Op (sK /N ) = op (ρK ,N ). Therefore, ‖T h ‖ · ‖T H ‖ = op (ρK ,N ). As ‖Z h ‖ = 0 in our case, ‖Z h ‖ = op (ρK ,N ). The last part, for which we need to show op (ρK ,N ), is ‖Z H ‖. Now, Z H = u′ P s u/N −
u′ (I − P s )f /N − f ′ (I − P s )u/N, where the first term is Op (sK /N ) =
√
op (ρK ,N ) and the second and third terms are Op (∆1/2 / N ) =
op (sK /N + ∆s ) = op (ρK ,N ) by Lemma 3(6). Therefore, we have ‖Z H ‖ = op (ρK ,N ).
ˆ = H + op (1) and hˆ = h + op (1). Note that we have shown H Then, Theorem 1 holds by the LLN, the CLT and the Slutzky Lemma. T2h
The discussion above indicates Z A (s) = 0 and Aˆ (s) = (h + T1h + )(h + T1h + T2h )′ − hh′ H −1 (T1H + T2H )′ − (T1H + T2H )H −1 hh′ .
82
R. Okui / Journal of Econometrics 165 (2011) 70–86
Table 7 Monte Carlo results. R2f = 0.1
TSLS
DNTSLS
HP
OSTSLS
STSLS
N = 100
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.0641 0.465 0.129 0.196 0.939
0.0734 2.97 0.349 0.782 0.98
0.114 5.7 0.895 1.01 1
0.0645 0.467 0.129 0.197 0.939
0.0659 0.579 0.151 0.274 0.944
0.0397 1.86 0.348 0.671 0.983 0.959
0.064 1.29 0.289 0.56 0.985 0.975
0.0419 1.8 0.345 0.662 0.984 0.996
N = 500
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.031 0.294 0.0812 0.121 0.941
0.0297 0.305 0.0835 0.174 0.944
0.092 5.45 0.907 1.01 0.998
0.0308 0.293 0.0813 0.121 0.941
0.0308 0.296 0.0811 0.121 0.94
0.00119 0.443 0.112 0.187 0.959 0.958
0.00749 0.415 0.108 0.171 0.956 0.957
0.0011 0.446 0.112 0.188 0.959 0.979
N = 100
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.317 0.423 0.317 0.359 0.495
0.36 2.66 0.499 0.823 0.746
0.457 4.88 0.925 1.01 0.972
0.338 0.573 0.34 0.415 0.63
0.324 0.529 0.328 0.405 0.525
0.0246 1.63 0.323 0.64 0.942 0.95
0.255 1.26 0.345 0.596 0.905 0.915
0.0342 1.63 0.321 0.636 0.942 0.974
N = 500
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.154 0.269 0.156 0.184 0.67
0.206 0.953 0.232 0.574 0.723
0.416 4.86 0.926 1.01 0.979
0.16 0.283 0.161 0.194 0.663
0.158 0.267 0.158 0.188 0.66
0.000656 0.423 0.108 0.178 0.958 0.957
0.0239 0.401 0.106 0.166 0.945 0.945
0.00035 0.425 0.108 0.179 0.959 0.976
0.00626 0.448 0.111 0.317 0.949
N = 100
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.573 0.291 0.573 0.584 0.0154
0.735 2.05 0.818 0.952 0.558
0.813 2.92 0.938 1.01 0.779
0.643 0.578 0.643 0.694 0.221
0.588 0.348 0.588 0.612 0.0558
0.0028 1.21 0.232 0.502 0.94 0.927
0.371 1.14 0.436 0.653 0.676 0.64
0.0113 1.21 0.231 0.5 0.939 0.948
0.0654 1.72 0.27 0.629 0.95
0.277 0.205 0.277 0.282 0.188
0.803 2.78 0.908 1.01 0.737
0.805 2.99 0.935 1.02 0.799
0.312 0.267 0.312 0.335 0.17
0.302 0.224 0.302 0.313 0.154
−0.00076
0.0232 0.365 0.0968 0.15 0.946 0.934
−0.00084
N = 500
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
Model (d)
LIML
DNLIML
SLIML
REQML
c = 0.1
−0.0367 5.84 0.537 0.879 0.921 0.000758 0.537 0.122 0.436 0.945
c = 0.5
−0.0114 * 0.477 0.844 0.928
c = 0.9
Now, we calculate the expectation of each term in A(s). First of all, E (hh′ ) = E (f ϵϵ ′ f ′ /N ) = σϵ2 H. Second, E(
hT1h′
′ ′ f ϵϵ (I − P s )f f ′ (I − P s )f )=E − = −σϵ2 . N
N
0.381 0.0953 0.158 0.963 0.953
f ′ ϵϵ ′ P s u
= Op
N
sK
E (hh′ H −1 T2H ′ ) = E
by Lemma 3(5). This implies that E (T2h h′ ) = Op (sK /N ) too. Fourth, E(
T1h T1h′
)=E
f ′ (I − P s )ϵϵ ′ (I − P s )f
N
hh′ H −1 (u′ f + f ′ u)
E(
) = −E
f ′ (I − P s )ϵϵ ′ P s u
2f
= σϵ
N
′
(I − P s )(I − P s )f N
= op
√ 1/2
.
2f
′
− σϵ
∆s
√
sK
+ Op
(I − P s )f
E(
)=E
u′ P s ϵϵ ′ P s u
N
= σuϵ σuϵ ′
(sK )2 N
+ op
(sK )2
sK / N ).
N
E (hh H ′
T1 ) = −E
−1 H ′
1
N
,
+ op
+ σϵ2
f (I − P )f
√ sK N
(sK )2 N
N
(I − P s )(I − P s )f
√
+ Op
N
1/2 ∆s
s
N
′
sK
1
N
+ σϵ2
+ σuϵ σu′ ϵ
+ op
(sK )2 N
f (I − P )f ′
+ σϵ2
f ′ (I − P s )(I − P s )f N
√ sK
√
N
+ op
s
N
1/2
∆s
sK
N
+ Op
1
N
+ op (ρK ,N ),
where the last equality holds because 1/N = op (ρK ,N ), sK /N = op
f ϵϵ fH ′
= σϵ2 H + σuϵ σu′ ϵ
by Lemma 3(4). Seventh,
+ σϵ
′
√
√
sK
2f
N
N
N 1/2
= Op
N
by Lemma 3(8). Again, we have E (T2h T1h′ ) = op (∆s Sixth, T2h T2h′
N
f ′ (I − P s )f + Op E (Aˆ (K )) = σϵ2 H − σϵ2
Fifth, T1h T2h′
0.109 0.489 0.957
and E (T2H H −1 hh′ ) = OP (1/N ). Therefore, we have
,
N
*
Also, we have E (T1H H −1 hh′ ) = −σϵ2 f ′ (I − P s )f /N. Finally, Lemma 3 (7) implies that
Similarly, E (T1h h′ ) = σϵ2 f ′ (I − P s )f /N. Third, E (hT2h′ ) = E
0.0209
0.381 0.0956 0.158 0.963 0.96
′
f (I − P )f
−1 ′
N2
s
2f
= −σϵ
′
(I − P )f s
N
.
√ √ 1/2 1/2 (ρ sK / N ) = op (ρK ,N ) by the fact that ∆s √ K ,N )√and op (∆s sK / N ≤ sK /N + ∆s .
R. Okui / Journal of Econometrics 165 (2011) 70–86
83
Table 8 Monte Carlo results. R2f = 0.01
TSLS
DNTSLS
HP
OSTSLS
STSLS
LIML
DNLIML
SLIML
REQML
0.0934 0.566 0.163 0.244 0.942
0.105 4.36 0.612 0.9 0.992
0.117 6.2 0.954 1.02 0.999
0.0914 0.655 0.18 0.276 0.951
0.0971 0.863 0.216 0.399 0.957
0.303 17 0.852 0.993 0.989 0.91
0.154 2.52 0.485 0.773 0.992 0.957
0.281 10.8 0.803 0.975 0.99 0.996
−0.0889
N = 100
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.0834 0.494 0.141 0.211 0.933
0.097 4.04 0.551 0.879 0.986
0.104 6.28 0.974 1.03 1
0.0803 0.497 0.142 0.214 0.936
0.0887 0.709 0.184 0.339 0.951
0.144 4.71 0.583 0.872 0.988 0.93
0.126 1.76 0.4 0.672 0.989 0.961
0.127 4.11 0.575 0.863 0.989 0.994
−0.0441
N = 500
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
N = 100
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.474 0.505 0.474 0.514 0.324
0.503 3.59 0.739 0.934 0.828
0.524 5.04 0.963 1.03 0.967
0.5 1.82 0.601 0.796 0.94
0.477 0.752 0.484 0.586 0.459
0.291 5.26 0.869 0.995 0.919 0.888
0.478 2.09 0.624 0.832 0.901 0.889
0.314 4.83 0.827 0.981 0.918 0.959
N = 500
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.421 0.428 0.421 0.451 0.284
0.444 3.54 0.684 0.916 0.799
0.5 5.13 0.972 1.03 0.986
0.442 0.938 0.453 0.579 0.768
0.426 0.638 0.431 0.51 0.398
0.116 3.41 0.568 0.848 0.924 0.897
0.4 1.57 0.506 0.732 0.891 0.878
0.125 3.29 0.555 0.844 0.924 0.955
N = 100
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.855 0.277 0.855 0.863 0.0004
0.889 1.84 0.93 1 0.502
0.896 2.54 0.982 1.03 0.707
0.887 1.34 0.894 0.957 0.659
0.86 0.396 0.86 0.88 0.0348
0.152 * 0.84 0.984 0.763 0.535
0.766 * 0.887 0.944 0.489 0.37
N = 500
Median bias Dec. Reg. MAD RMTSE Cov. Rate B. Cov. Rate
0.748 0.261 0.748 0.757 0.0004
0.829 1.93 0.894 0.995 0.513
0.897 2.66 0.992 1.04 0.736
0.815 0.807 0.815 0.874 0.338
0.757 0.361 0.757 0.782 0.0252
0.00669 5.49 0.402 0.733 0.896 0.803
0.625 1.51 0.725 0.811 0.555 0.477
Model (d) c = 0.1
44.4 0.988 1.03 0.907
7.97 0.768 0.97 0.918
c = 0.5
−0.0402 * 0.993 1.04 0.915
−0.0266 * 0.752 0.96 0.923
c = 0.9
A.2. Proof of Theorem 3 Proof. Under the assumptions, we have W ′ (PX + PZ )W /N − H →p 0 and W ′ PZ W /N − f ′ PZ f /N →p 0. Let V ≡ Vˆ ≡
λ′ H −1 f ′ PZ fH −1 λ σϵ2 , ′ − 1 ′ − 1 λ H σuϵ σuϵ H λ N σˆ ϵ2 λˆ ′ Hˆ −1 W ′ PZ W Hˆ −1 λˆ λˆ ′ Hˆ −1 σˆ uϵ σˆ u′ ϵ Hˆ −1 λˆ
N
0.825 0.974 0.745 0.788
1.02 1.07 0.937
0.0226 5.55 0.406 0.73 0.892 0.897
0.128 7.23 0.5 0.862 0.945
As N (Vˆ − V )/K = op (1), we have Vˆ N /K 2 − VN /K 2
(1 + Vˆ N /K 2 )(1 + VN /K 2 ) N (Vˆ − V )/K 1 = = op (1/K ). 2 2 ˆ K (1 + V N /K )(1 + VN /K )
Then, Vˆ − V = op (1) and s∗ and sˆ∗ can be written as 1 − (1 +
It follows therefore that S (ˆs∗ ) − S (s∗ ) = op (1/N ).
VN /K ) and 1 − (1 + Vˆ N /K ) f ′ PZ f /N → c > 0 for some c. Then
A.3. Proof of Theorem 4
2 −1
sˆ∗ − s∗ =
=
2 −1
(K 2 /N + Vˆ )(K 2 /N + V ) N
respectively. Suppose that
Vˆ N /K 2 − VN /K 2
(1 + Vˆ N /K 2 )(1 + VN /K 2 ) Vˆ − V K2
0.136 *
Suppose that f ′ PZ f = Op (K ), which occurs when Z is a matrix of irrelevant instruments. Then, s∗ = Op (1/K ) and S (s∗ ) = Op (1/N ).
sˆ∗ − s∗ =
.
0.21 *
= op (K 2 /N ).
This implies that H {S (ˆs∗ ) − S (s∗ )}H = (ˆs∗2 − s∗2 )σuϵ σu′ ϵ K 2 /N
+ {(1 − sˆ∗ )2 − (1 − s∗ )2 }σϵ2 f ′ PZ f /N = op (K 2 /N ), by the continuous mapping theorem. As S (s∗ ) is at least Op (K 2 /N ) in this case, the result holds.
First, we show that the consistency of the shrinkage LIML and derive the asymptotic distribution of it under sK /N → 0. Now, our δˆ is δˆliml,s = argminδ (y − W δ)′ P s (y − W δ)/(y − W δ)′ (y − W δ). Lemma 4. Suppose that Assumptions 1–3 are satisfied. Then, under sK /N → 0 and s → 1 or E (fi Zi′ ) = 0, it follows that δˆ liml,s →p δ0 .
¯ ≡ (y, W ) and D0 ≡ (δ, I ). W ¯ can be written Proof. Define W ¯ = WD0 + ϵ e1 , where e1 is the first unit vector. Let Aˆ = as W ¯ ′P sW ¯ /N and A = D′0 HD ¯ 0. W Observing Lemma A.4 and the proof of Lemma A.5 in Donald and Newey (2001), it is enough to show that Aˆ →p A.
84
R. Okui / Journal of Econometrics 165 (2011) 70–86
√
The term Aˆ has the following decomposition: Aˆ = D′0
f ′f
−
N
ϵPW
f ′ (I − P s )f N
′ s
+ e1
N
u′ P s f
+
N
+
D0
N
WP ϵ ′ ϵPϵ ′ e1 + e1 e1 . N N
¯ by the LLN. f ′ (I − P s )f /N = f ′ (I − First, we have f ′ f /N →p H P )f /N + (1 − s)f ′ PZ f /N →p 0 by Lemma A.2(1) in Donald and Newey (2001) and where s →p 1 or f ′ PZ f /N →p 0. E (ϵ ′ P s ϵ) = tr {P s E (ϵϵ ′ )} = σϵ2 (m + sK ), which implies that ϵ ′ P s ϵ/N →p 0 by the Markov inequality. Similarly, we can show that u′ P s u/N →p 0. Let Wj be the jth column of W . Then, ′ s Wj P ϵ N ≤
Wj′ P s Wj
ϵ′P sϵ
N
≤
N
Wj′ Wj
The first inequality is the Cauchy–Schwartz inequality and the second inequality comes from the fact that I −P s is positive definite, which is because I − P s = I − P + (1 − s)PZ and I − P and PZ are positive definite and 1 − s ≥ 0. It follows therefore that W ′ P s ϵ/N →p 0. f ′ P s u/N →p 0 similarly. Summing up, we have Aˆ →p A.
Lemma 6. Suppose Assumptions 1–3 are satisfied, sK /N →p 0 and 1 − s = op (K /N ) or E (fi Zi′ ) = 0. Then, it follows that
ˆ =Λ ˜ − Λ √
sK h′ H −1 h σ˜ ϵ2 ˆ ˜ ˜ −1 Λ− + R Λ = Λ + op σϵ2 2N σϵ2 N
N Rˆ Λ = op (ρK ,N ).
and
ˆ around the true value δ0 . Then ˆ = Λ(δ) Proof. We expand Λ
op (1) = op (1).
N
1 Λδ (δ0 )′ {Λδδ (δ0 )}−1 Λδ (δ0 ) ˆ + Op Λ = Λ(δ0 ) − 2 N 3/2 2 2 2 2 σ˜ ϵ (σ˜ − σϵ ) ˜ − ˜ + ϵ ˜ =Λ −1 Λ Λ σϵ2 σ˜ ϵ2 σϵ2 Λδ (δ0 )′ {Λδδ (δ0 )}−1 Λδ (δ0 ) 1 − + Op . 3/2 2
and s →p 1 or E (fi Zi′ ) = 0. Then, we have σϵ2 H¯ −1 ).
√
N (δˆ liml,s − δ0 ) →d N (0,
Proof. Let A (δ) ≡ (y − W δ) P (y − W δ)/N and B(δ) W δ)′ (y − W δ)/N. Define Λ(δ) ≡ As (δ)/B(δ) so that argminδ Λ(δ). Let Λδ (δ) and Λδδ (δ) be the gradient and Hessian respectively. A standard Taylor expansion shows that ′ s
s
√
˜ −1 N (δˆ liml,s − δ0 ) = −Λδδ (δ) =
√
˜ σ˜ ϵ2 Λδδ (δ)
≡ (y − δˆliml,s = of Λ(δ),
2
−
√ 2
σ˜ ϵ
N Λδ (δ0 ) 2
−
√ σ˜ ϵ2 N Λδ (δ0 ) 2
N
√ This also implies that Λδ = Op (1/ N ). Then, σ˜ ϵ2 Λδδ (δ˜ ) W ′P sW W ′W 1 = − Λ(δ0 ) + Op √ 2
N
˜ σ˜ ϵ2 Λδδ (δ)
2
,
+
f ′ (I − P s )f N
u′ P s u N
u′ P s f
+ N
+ Op
+
f ′P su N
sK N
√
Λδδ (δ) = B(δ)−1 {Aδδ (δ) − Λ(δ)Bδδ (δ)} − B(δ)−1 {Bδ (δ)Λδ (δ)′ + Λδ (δ)B′δ }.
√
Op (1/ N ) and u′ P s u/N
˜ →p As δˆ liml,s →p δ0 by Lemma 4, δ˜ →p δ0 , which implies that B(δ)
˜ →p −2σuϵ . As before, A(δ) ˜ →p 0, which implies that σϵ2 , Bδ (δ) ˜ ˜ ˜ →p 0. Λ(δ) →p 0. Also, we have Aδ (δ) →p 0, which gives Λδ (δ) ′ s ¯ Lastly, we have Aδδ (δ) = 2W P W /N →p 2H and Bδδ (δ) = 2W ′ W / ¯ ˜ →p H. N →p 2E (Wi Wi′ ). Therefore, we have σ˜ ϵ2 Λδδ (δ) Consider the gradient term. First, define αˆ = W ′ ϵ/ϵ ′ ϵ and α = σuϵ /σϵ2 . αˆ − α = Op (1/N ) by the CLT. We have the following decomposition:
ϵ′P sϵW ′ϵ − √ N N ϵ′ϵ ′ s f (I − P )ϵ v′ P s ϵ ϵ′P sϵ = h− + √ − (αˆ − α) √ √ W ′P sϵ
= √
N
N
by Λ(δ0 ) = Op ( sK /N ). As in the proof of Theorem 2, we have f ′ (I − P s )f /N = Op (∆s + sK /N ). It holds also that u′ P s f /N =
Λδ (δ) = B(δ)−1 {Aδ (δ) − Λ(δ)Bδ (δ)},
2
N
=H−
for some mean value δ˜ . Now, we have
√ σ˜ ϵ2 N Λδ (δ0 )
sK = h + Op ∆s1/2 + .
by Bδ (δ0 ) = Op (1). It follows that
N Λδ (δ0 )
−1
N
We can see from the proof of Lemma 5 that
Lemma 5. Suppose that Assumptions 1–3 are satisfied, sK /N →p 0
−
¯ −1 N (0, σϵ2 H¯ ) = N (δˆ liml,s − δ) →d H
ˆ = minδ (y − W δ)′ P s (y − W δ)/(y − W δ)′ (y − W δ) and Define Λ ′ s ˜ Λ = ϵ P ϵ/(N σϵ2 ). Also note that in the LIML case, to show that a term is op (ρK ,N ), it is enough to show that it is op (sK /N + ∆s ).
′ s
s
D0 + D′0
f ′P su
In conclusion, we have ¯ −1 ). N (0, σϵ2 H
N
N
¯ ) by the CLT. Lemma 2(1) and the Chebyshev h →d N (0, σ 2 H √ inequality says f ′ (I − P s )ϵ/ N = op (1). A similar argument as in the proof √ of Lemma √ 2(4) together with E (vi ϵi ) = 0 implies that v ′ P s ϵ/ N = Op ( sK /N ) = op (1). ϵ ′ P s ϵ = Op (sK ) as we see in the √
proof of Lemma 4. It follows, therefore, that (αˆ − α)ϵ P ϵ/ N = √ ¯ ). Op (sK /N ) = op (1). We have −σ˜ ϵ2 N Λδ (δ0 )/2 →d N (0, σϵ2 H ′ s
= Op (sK /N ). Summing up, we have √ + sK /N ). Then, we have 1/2 Λδ (δ0 )′ {Λδδ (δ0 )}−1 Λδ (δ0 ) h′ H −1 h ∆s sK = + Op + . 2 N σϵ2 N N3 √ Also, it follows that σ˜ ϵ2 /σϵ2 − 1 = Op (1/ N ) by the CLT and the 1/2
˜ 2 = H + Op (∆s σ˜ ϵ2 Λδδ (δ)/
Delta method. These results give the first equation of the lemma as
1/2 σ˜ ϵ2 h′ H −1 h ∆s sK ˆ =Λ ˜ − ˜ − Λ −1 Λ + Op + σϵ2 2N σϵ2 N N3 sK 1 + Op , + Op 2 3/2
N
N
and all the remainder terms are op (ρK ,N ). The second equation in the lemma is given by the fact that ˜ = Op (sK /N ). Λ Lemma 7. Suppose that Assumptions 1–3 are satisfied, sK /N →p 0. Then,
˜ u = op (sK /N ), 1. u′ P s u/N −√ΛΣ ∑ ′ ˜ 2. E (hΛϵ v/ √ N ) = (m + sK )/N · i fi E (ϵi2 vi′ )/N + Op (sK /N 2 ), 3. E (hh′ H −1 h/ N ) = Op (1/N ).
R. Okui / Journal of Econometrics 165 (2011) 70–86
˜ ) = tr {P s E (ϵϵ ′ )}/(N σϵ2 ) Proof. We begin with the proof of 1. E (Λ = (m + sK )/N and 2 2 E (ϵ ′ P s ϵϵ ′ P s ϵ) m + sK m + sK ˜ − = − E Λ N N 2 σϵ4 N 2 2 σϵ4 (m + sK )2 + op ((sK )2 ) m + sK sK = − = o , p N 2 σϵ4 N N ˜ − (m + by Lemma 3(4) by replacing u with ϵ . This gives {Λ sK )/N }Σu = op (sK /N ). We also have E (u′ P s u) = (m + sK )Σu and u′ P s u/N − {(m + sK )/N }Σu = op (sK /N ). Therefore, 1 is proved. We observe that E
˜ ϵ′v hΛ
∑
=
√
i,j,k,l
E (fi ϵi ϵj Pjks ϵk ϵl vl′ ) N 2 σϵ2
N
∑
fi Piis E (ϵi4 vi′ )
∑
i
=
+2
N 2 σϵ2
∑ +
fi Pjjs E
fi Pijs E (ϵj2 vj′ )
i̸=j
N2
(ϵ v ) 2 ′ i i
i̸=j
N2
∑ = Op
sK
+ op
N2
sK
+
N2
m + sK
fi E (ϵi2 vi′ )
i
N
,
N
which gives 2. Part 3 is Lemma A.8(iii) in Donald and Newey (2001).
Proof of Theorem 4. The consistency and the asymptotic normality of the shrinkage estimator stems from Lemmas 4 and 5. The shrinkage LIML estimator has the following representation:
√
ˆ −1 hˆ , N (δˆ liml,s − δ0 ) = H
hˆ =
W ′P sϵ
ˆ = H
W ′P sW
ˆ −Λ
N
W ′W N
,
W ′ϵ ˆ √ . −Λ
√
N
N
5 −
=−
√
=
N
˜ h = Op T3h = −Λ T5h =
h′ H −1 h
√
2 N σϵ2
sK
Op ∆1s /2
(
σuϵ = Op
),
T2h
v′ P s ϵ = √ = Op
v′ϵ
˜√ T4h = −Λ
1
√
N
= Op
N
√
2
ZH = −
=−
N
u′ (I − P s )f
˜ −Λ
N u′ u N
˜ −Λ
−
f ′ (I − P s )u
N u′ f + f ′ u N
+
uP s u N
ˆ −Λ ˜) − (Λ
W ′W N
,
h = Op (1) and H = Op (1) by Lemma 3(8). T h = op (1) as all of
√ 1/2 √ ∆s , sK /N, sK /N and 1/ N are op (1). H 2 2 2 ‖T1 ‖ consists √ of terms of order (sK /N + ∆s ) 3, /12/N, (sK /N ) , (sK /N + ∆s )/ N, (sK /N + ∆s ) · sK /N and sK /N . It is easy to see that all of them are op (ρK ,N ). It follows that ‖T1H ‖2 = op (ρK ,N ). Similarly, ‖T h ‖ · ‖T H ‖ consists of terms of order (sK /N + √ √ 1/2 ∆s )op (1), ∆s / N, sK /N, 1/N and sK /N · op (1). A simple inspection and Lemma 3(6) say that all of them are op (ρK ,N ). That gives ‖T h ‖ · ‖T H ‖ = op (ρK ,N ). ˆ − To show Z h = op (ρK ,N ), we investigate each term of Z h . (Λ √ ˜ )h = op (sK /N )Op (1) = op (ρK ,N ) by Lemma 6. N (σ˜ ϵ2 /σϵ2 − Λ √ ˜ (u′ ϵ/N − σuϵ ) = Op (1)Op (sK /N )Op (1/ N ) = Op (sK /N 3/2 ) = 1)Λ √ op (ρK ,N ) by the CLT and the delta method. h′ H −1 h/(2 N σϵ2 ) · √ √ (u′ ϵ/N − σu√ϵ ) = Op (1/ N )Op (1/ N ) = Op (1/N ) = op (ρK ,N ) by the CLT. N Rˆ Λ u′ ϵ/N = op (ρK ,N )Op (1) = op (ρK ,N ) by the LLN and Lemma 6. Therefore, Z h = op (ρK ,N ). Similarly, each term of Z H is shown to be op (ρK ,N ). u′ (I − √ 1/2 s P )f /N = Op (∆s / N ) = op (ρK ,N ) where the first equality can be verified as in the proof of Lemma 3(2) and the second ˜ u′ u/N = uP s u/N − ΛΣ ˜ u− equality is Lemma 3(6). uP s u/N − Λ ′ ˜ Λ(u u/N − Σu ) = op (sK /N ) + Op (sK /N )op (1) = op (ρK ,N ) by ˜ (u′ f + f ′ u)/N = Lemma 7(1) and√the LLN. The CLT implies that Λ ˆ ˜ )W ′ W /N = Op (sK /N )Op (1/ N ) = op (ρK ,N ). Finally (Λ − Λ op (sK /N )Op (1) = op (ρK ,N ) by the LLN and Lemma 6. Hence, we have Z H = op (ρK ,N ). Consider the decomposition: h+
5 −
Tih
h+
i=1
5 −
′ − hh′ H −1
Tih
i=1
3 −
3 −
TiH ′
i=1
TiH H −1 hh′ = A(s) + Z A (s),
where
,
A(s) ≡ hh′ +
,
− hh′ H −1
= Op
sK N
hTih′ +
5 −
Z ( s) ≡ A
′
+ ∆s ,
5 −
Tih
Tih′ h′ + (T1h + T2h )(T1h + T2h )′
i=1 3 −
TiH ′ −
i=1
TiH H −1 hh′ ,
′ Tih
+
i =3
+ (T1h + T2h )
3 − i=1
5 −
i=3
N
5 − i=1
i=1
T1H
N
,
TiH + Z H ,
f ′ (I − P s )f
sK
sK N
N
,
and 3 −
,
i=1
uϵ σ˜ ϵ ˜ −1 Λ − σuϵ σϵ2 N ′ −1 ′ ′ √ hH h uϵ uϵ + √ − σuϵ − N Rˆ Λ , 2 N N 2 N σϵ
ˆ =H+ H
1
N
−
N
ˆ −Λ ˜ )h + Z h = −(Λ
N
i=1
T1h
= Op √ N sK H ˜ H = Op T3 = −Λ ,
Tih + Z h ,
f ′ (I − P s )ϵ
85
′
uf +f u
T2H =
As in the case of TSLS, we verify the assumption of Lemma 1. First, ˆ and hˆ have the following decomposition: H hˆ = h +
′
5 −
5 −
Tih
(T1h + T2h )′
i =3
′ Tih
.
i =3 1/2
Z A (s) consists of terms of order (sK /N )2 , sK /N 3/2 , 1/N, ∆s sK /N, √ √ 1/2 ∆s / N, (sK /N )3/2 and sK /N. All of them are op (ρK ,N ) by a simple inspection and Lemma 3(6). Z A (s) = op (ρK ,N ). What remains to be shown is the expectation of A(s). As we saw in the TSLS case, we have E (hh′ ) = σϵ2 H, E (hT1h′ ) = E (T1h h′ ) = −σϵ2 f ′ (I − P s )f /N, E (T1h T1h′ ) = σϵ2 f ′ (I − P s )(I − P s )f /N, E (T1h T2h′ ) = 1/2
op (∆s
√
sK /N ) = op (ρK ,N ), similarly E (T2h T1h′ ) = op (ρK ,N ),
86
R. Okui / Journal of Econometrics 165 (2011) 70–86
E (hh′ H −1 T1H ′ ) = E (T1H H −1 hh′ ) = −σϵ2 f ′ (I − P s )f /N, E (hh′ H −1 T2H ′ ) = Op (1/N ) = op (ρK ,N ) and similarly E (T2H H −1 hh′ ) = op (ρK ,N ). A similar argument as in the proof of Lemma 3(6), noting that E (vi ϵi ) = 0, gives E (T2h T2h′ ) = σϵ2 Σv
s2 K
+ op (ρK ,N ).
N
Lemma 7(3) shows E(
hT5h′
)=E
hh′ H −1 h 2N σϵ2
σuϵ
1
= Op
N
= op (ρK ,N ).
Similarly, E (T5h h) = op (ρK ,N ). Lemma 7(2) gives
E(
hT4h′
)=E
˜ ϵ′v hΛ
=−
√
fi E (ϵi2 vi′ )
∑
N
Also, we have E (hT2h′ ) =
sK
i
N
+ op (ρK ,N ).
N
∑ (ϵi2 vi′ )/N. Letting ζˆ ≡ i fi Piis E (ϵi2 vi′ )/N − sK /N · i fi E (ϵ v )/N, E (hT2h′ ) + E (hT4h′ ) = ζˆ + op (ρK ,N ) and E (T2h h′ ) + E (T4h h′ ) = ζˆ ′ + op (ρK ,N ). ∑
∑
s i fi Pii E 2 ′ i i
Summing up, we have E (A(s)) = σϵ2 H − 2σϵ2
f ′ (I − P s )f
+ ζˆ + ζˆ ′
N ′ s f ( I − P )( I − P s )f 2
+ σϵ
N 2f
+ 2σϵ
′
(I − P s )f N
= σϵ2 H + σϵ2 Σv
s2 K N
+ σϵ2 Σv
s2 K N
+ op (ρK ,N ) + σϵ2
f ′ (I − P s )(I − P s )f N
+ ζˆ + ζˆ ′ + op (ρK ,N ). Note that under E (ϵi2 vi ) = 0, ζˆ = 0.
References Andrews, D.W.K., 1999. Consistent moment selection procedures for generalized method of moments estimation. Econometrica 67 (3), 543–564. Anderson, T.W., Kunitomo, N., Matsushita, Y., 2008. On finite sample properties of alternative estimators of coefficients in a structural equation with many instruments. Journal of Econometrics (forthcoming). Anderson, T.W., Kunitomo, N., Matsushita, Y., 2010. On the asymptotic optimality of the LIML estimator with possibly many instruments. Journal of Econometrics 157, 191–204. Angrist, J.D., Krueger, A.B., 1991. Does compulsory school attendance affect schooling and earnings? Quarterly Journal of Economics 106 (4)), 979–1014. Bekker, P.A., 1994. Alternative approximations to the distributions of instrumental variable estimators. Econometrica 62 (3), 657–681.
Bound, J., Jaeger, D.A., Baker, R.M., 1996. Problems with instrumental variables estimation when correlation between the instruments and the endogenous explanatory variable is weak. Journal of the American Statistical Association 90 (430), 443–450. Carrasco, M., 2010. A regularization approach to the many instruments problem (unpublished manuscript). Chamberlain, G., 1987. Asymptotic efficiency in estimation with conditional moment restrictions. Journal of Econometrics 34, 305–334. Chamberlain, G., Imbens, G., 2004. Random effects estimators with many instrumental variables. Econometrica 72 (1), 295–306. Chao, J.C., Swanson, N.R., 2005. Consistent estimation with a large number of weak instruments. Econometrica 73 (5), 1673–1692. Donald, S.G., Newey, W.K., 2001. Choosing the number of instruments. Econometrica 69 (5), 1161–1191. Doornik, J.A., 2006. An Object-oriented Matrix Programming Language — Ox 4. Timberlake Consultants Ltd. Flores-Lagunes, A., 2007. Finite sample evidence of IV estimators under weak instruments. Journal of Applied Econometrics 22 (3), 677–694. George, E.I., 1986. Combining minimax shrinkage estimators. Journal of the American Statistical Association 81 (394), 437–445. Hahn, J., Hausman, J., 2002. A new specification test for the validity of instrumental variables. Econometrica 70 (1), 163–189. Hall, A.R., Peixe, F.P.M., 2003. A consistent method for the selection of relevant instruments. Econometric Reviews 22 (3), 269–287. Hansen, C., Hausman, J., Newey, W.K., 2008. Estimation with many instrumental variables. The Journal of Business and Economic Statistics 26 (4), 398–422. Hastie, T., Tibshirani, R., Friedman, J., 2001. The Elements of Statistical Learning; Data Mining, Inference, and Prediction. Springer, New York. Kuersteiner, G.M., 2002. Mean squared error reduction for GMM estimators of linear time series models (unpublished manuscript). Kunitomo, N., 1980. Asymptotic expansions of the distributions of estimators in a linear functional relationship and simultaneous equations. Journal of the American Statistical Association 75, 693–700. Morimune, K., 1983. Approximate distributions of k-class estimators when the degree of overidentification is large compared with the sample size. Econometrica 51 (3), 821–841. Morimune, K., 1985. Keizai Moderu no Suitei to Kentei (Estimation and Testing in Economic Models) Kyouritsu Shuppan, Tokyo, Japan (in Japanese). Nagar, A.L., 1959. The bias and moment matrix of the general k-class estimators of the parameters in simultaneous equations. Econometrica 27 (4), 575–595. Okui, R., 2005. Instrumental variable estimation with many moment conditions with applications to dynamic panel data models, Ph.D. Thesis, University of Pennsylvania. Phillips, P.C.B., 1983. Exact small sample theory in the simultaneous equations model. In: Griliches, Z., Intriligator, M.D. (Eds.), Handbook of Econometrics, vol. 1. North-Holland Publishing Company, (chapter 8). Small, D., 2002. Inference and model selection for instrumental variables regression, Ph.D. Thesis, Stanford University. Staiger, D., Stock, J.H., 1997. Instrumental variables regression with weak instruments. Econometrica 65 (3), 557–586. Stock, J.H., Yogo, M., 2005. Asymptotic distribution of instrumental variables statistics with many weak instruments. In: Andrews, D.W.K., Stock, J.H. (Eds.), Identification and Inference for Econometric Models: Essays in Honor of Thomas Rothenberg. Cambridge University Press, pp. 109–120. Takeuchi, K., Morimune, K., 1985. Third-order efficiency of the extended maximum likelihood estimators in a simultaneous equation system. Econometrica 53 (1), 177–200. West, K., Wong, K., Anatolyev, S., 2009. Instrumental variables estimation of heteroskedastic linear models using all lags of instruments. Econometric Reviews 26 (5), 441–467.
Journal of Econometrics 165 (2011) 87–99
Contents lists available at SciVerse ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Estimation of conditional moment restrictions without assuming parameter identifiability in the implied unconditional moments✩ Shih-Hsun Hsu a , Chung-Ming Kuan b,∗ a
Department of Economics, National Chengchi University, Taiwan
b
Department of Finance, National Taiwan University, Taipei 106, Taiwan
article
info
Article history: Available online 12 May 2011 JEL classification: C12 C22 Keywords: Conditional moment restrictions Fourier coefficients Generically comprehensive revealing function Global identifiability GMM
abstract A well-known difficulty in estimating conditional moment restrictions is that the parameters of interest need not be globally identified by the implied unconditional moments. In this paper, we propose an approach to constructing a continuum of unconditional moments that can ensure parameter identifiability. These unconditional moments depend on the ‘‘instruments’’ generated from a ‘‘generically comprehensively revealing’’ function, and they are further projected along the exponential Fourier series. The objective function is based on the resulting Fourier coefficients, from which an estimator can be easily computed. A novel feature of our method is that the full continuum of unconditional moments is incorporated into each Fourier coefficient. We show that, when the number of Fourier coefficients in the objective function grows at a proper rate, the proposed estimator is consistent and asymptotically normally distributed. An efficient estimator is also readily obtained via the conventional two-step GMM method. Our simulations confirm that the proposed estimator compares favorably with that of Domínguez and Lobato (2004, Econometrica) in terms of bias, standard error, and mean squared error. © 2011 Elsevier B.V. All rights reserved.
1. Introduction To estimate the parameters in conditional moment restrictions, it is typical to find a finite set of unconditional moment restrictions implied by the original restrictions and apply a suitable estimation method, such as the generalized method of moment (GMM) of Hansen (1982) and Hansen and Singleton (1982), or the empirical likelihood method of Qin and Lawless (1994) and Kitamura (1997). A leading example is the instrumental-variable estimation method for regression models. This approach hinges on the assumption that the parameters in the conditional restrictions can be globally identified by the implied, unconditional restrictions. With this assumption, estimator consistency is not really a problem and holds under suitable regularity conditions. Therefore, much research interest focuses on estimator efficiency, e.g., Chamberlain (1987), Newey (1990, 1993), Carrasco and Florens (2000) and Donald et al. (2003). Domínguez and Lobato (2004) challenge the assumption of global identifiability and show that the unconditional moments, when chosen arbitrarily, need not be equivalent to the original conditional restrictions. The identification problem may arise
✩ This paper was originally entitled: ‘‘Consistent Parameter Estimation for Conditional Moment Restrictions.’’ ∗ Corresponding author. E-mail address:
[email protected] (C.-M. Kuan).
0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.05.008
even when the unconditional moments are based on the socalled optimal instruments. Without assuming global identifiability, Domínguez and Lobato (2004) introduce the ‘‘instruments’’ generated from an indicator function and construct a continuum of unconditional moment restrictions that can identify the parameters of interest. However, there are some disadvantages of their method. First, the indicator function takes only the values one and zero and hence may not well present the information in the conditioning variables. Second, their estimation method does not utilize the full continuum of moment restrictions. This may result in further efficiency loss (Carrasco and Florens, 2000). Third, it is not easy to obtain an efficient estimate from their consistent estimate. In this paper, we propose a different approach to constructing a continuum of unconditional moments that can ensure parameter identifiability. These unconditional moments depend on the ‘‘instruments’’ generated from the class of ‘‘generically comprehensively revealing’’ (GCR) functions (Stinchcombe and White, 1998), and these moments are further projected along the exponential Fourier series. The objective function is then based on the resulting Fourier coefficients, from which an estimator can be easily computed. A novel feature of our method is that it utilizes all possible information in the conditioning variables because all unconditional moments have been incorporated into each Fourier coefficient. Moreover, an efficient estimator can be obtained via the conventional two-step GMM method, which is computationally simpler than that of Carrasco and Florens (2000).
88
S.-H. Hsu, C.-M. Kuan / Journal of Econometrics 165 (2011) 87–99
We first show that the proposed estimator is consistent and asymptotically normally distributed when the number of Fourier coefficients in the objective function grows at a proper rate. We also specialize on the ‘‘instruments’’ generated from the exponential function, a special case in the class of GCR functions. For such instruments, the unconditional moments and Fourier coefficients have analytic forms, which greatly facilitate estimation in practice. Our simulations confirm that, under various settings, the proposed consistent and efficient estimators perform significantly better than that of Domínguez and Lobato (2004) in terms of bias, standard error, and mean squared error. The proposed estimators also outperform the estimator based on the optimal instruments. This paper is organized as follows. We introduce the new class of consistent estimators in Section 2 and establish its consistency and asymptotic normality in Section 3. The Efficient estimator based on the proposed consistent estimator is discussed in Section 4. The simulation results are reported in Section 5. Section 6 concludes this paper. All proofs are deferred to Appendix. 2. Consistent estimation We are interested in estimating θ o , the q × 1 vector of unknown parameters, in the following conditional moment restriction:
E[h(Y, θ o )|X] = 0,
with probability one (w.p.1),
(1)
where h is a p × 1 vector of functions, Y is an r × 1 vector of data variables, and X is an m × 1 vector of conditioning variables. Without loss of generality, we shall work on the case that X is bounded with probability one; see e.g., Bierens (1994, Theorem 3.2.1). It is well known that (1) is equivalent to the unconditional moment restriction:
E[h(Y, θ o )f (X)] = 0,
(2)
for all measurable functions f , where each f (X) may be interpreted as an ‘‘instrument’’ that helps us to identify θ o . In practice, one typically forms an estimating function by subjectively choosing certain instruments, such as the square and cross products of the elements in X. This would not be a problem in a linear model if the resulting unconditional moments can exactly identify θo . Yet, when h is nonlinear in θ o , Domínguez and Lobato (2004) show that θ o is not necessarily identified when unconditional moments are determined arbitrarily, and its identifiability may depend on the marginal distributions of the conditioning variables X. This concern is practically relevant because models with nonlinear restrictions are quite common in econometric applications; see e.g., Hansen and Singleton (1982) and Hansen and West (2002).1 One way to ensure parameter identifiability is to employ a class of instruments that span a space of functions of X (Bierens, 1982, 1990; Stinchcombe and White, 1998). Domínguez ∏m and Lobato (2004) set the instruments as 1(X ≤ τ) = j=1 1(Xj ≤ τj ), where 1(A) is the indicator function of the event A. This leads to a continuum of unconditional moments indexed by τ that are equivalent to (1):
E[h(Y, θ o )1(X ≤ τ)] = 0,
τ ∈ Rm .
(3)
Then, θ o can be globally identified by an L2 -norm of these moments, i.e.,
θ o = arg min θ∈Θ
∫ Rm
|E[h(Y, θ)1(X ≤ τ)]|2 dP (τ),
with P (τ) a distribution function of τ and |·| denotes the Euclidean norm. A natural choice of P (τ) is PX (τ), the distribution function of X. The L2 -norm in (4) can then be well approximated by the sample average. Domínguez and Lobato (2004) suggest the following estimator:
2
T T 1 −1 − θ DL (T ) = arg min h(yt , θ)1(xt ≤ τ k ) , T k=1 T t =1 θ∈Θ
(5)
where yt and xt are the sample observations of Y and X, respectively, and τ k = xk , k = 1, . . . , T . Clearly, this is a GMM estimator based on T unconditional moments induced by the indicator function.2 2.1. A class of consistent estimators The indicator function is not the only choice for the desired instruments; Stinchcombe and White (1998) demonstrate that any GCR function will also do. Specifically, let ΛG (T ) be the collection of λ(X∑ ) = G(A(X, τ)) with τ ∈ T ⊂ Rm+1 and A(X, τ) = τ0 + m j=1 Xj τj . ΛG is said to be GCR if for all T with nonempty interior, the uniform closure of the span of ΛG (T ) contains C (B) for every compact B, where C (B) is the set of all bounded and continuous functions on B. The function G is said to be GCR if ΛG is GCR. Corollary 3.9 of Stinchcombe and White (1998) shows that a real analytic function is GCR if and only if it is not a polynomial (of a finite degree).3 Note that polynomials of finite degree are not uniformly dense in the set of all continuous and bounded functions and hence cannot be a GCR function. The legitimate choices of G are, for example, the exponential function (Bierens, 1982, 1990) or the logistic function (White, 1989). The discussion above suggests that, when G is GCR, (2) holds with G(A(X, τ)) as legitimate instruments and τ in an arbitrarily chosen index set T in Rm+1 . The unconditional moment restrictions induced by a GCR function are
E[h(Y, θ o )G(A(X, τ))] = 0,
for almost all τ ∈ T ⊂ Rm+1 ,
(6)
where T may be a small subset with a nonempty interior. Note that the indicator function is not GCR; hence (3) must hold for all τ in Rm . Similar to (4), θ o now can be globally identified by the L2 -norm of (6):
θ o = arg min θ∈Θ
∫
|E[h(Y, θ)G(A(X, τ))]|2 dP (τ).
(7)
T
In contrast with Domínguez and Lobato (2004), there is no natural choice of P (τ), and it is not easy to find a proper sample counterpart of the L2 -norm in (7). Although an objective function for estimating θ o can be constructed using randomized τ , the resulting estimate is arbitrary and may not be preferred. In this paper, we take a different approach to deriving a class of consistent estimators for θ o without assuming parameter identifiability in the implied unconditional moments. This approach finds a condition equivalent to the L2 -norm in (7). To this end, we project the unconditional moments in (6) along the exponential Fourier series and obtain
E[h(Y, θ)G(A(X, τ))] =
1
−
(2π )
k∈S
m +1
CG,k (θ) exp(ik′ τ),
(4)
1 Hansen and West (2002) studied the papers published in 7 top economics journals in 1990 and 2000 and found that, among 35 articles that employed the GMM technique, 14 of them deal with models with nonlinear restrictions.
2 Alternatively, Ai and Chen (2003) and Kitamura et al. (2004) consider nonparametric estimation methods that deal with the conditional moments directly. 3 A function is said to be analytic if it locally equals its Taylor expansion at every point of its domain.
S.-H. Hsu, C.-M. Kuan / Journal of Econometrics 165 (2011) 87–99
where S := {k = [k0 , k1 , . . . , km ]′ ∈ Zm+1 } with ki = 0, ±1, ±2, . . . , ±∞, and CG,k (θ) is a Fourier coefficient: CG,k (θ) =
∫
E[h(Y, θ)G(A(X, τ))] exp(−ik′ τ)dτ T
[ ] ∫ = E h(Y, θ) G(A(X, τ)) exp(−ik′ τ)dτ ,
k ∈ S.
T
It can be seen that each CG,k (θ) incorporates the full continuum of the original instruments G(A(X, τ)) into a new instrument:
ϕG,k (X) =
∫
G(A(X, τ)) exp(−ik′ τ)dτ,
(8)
T
in which the index parameter τ has been integrated out. We shall use the following notations. Given a complex number f , let f¯ denote its complex conjugate and Re(f ) and Im(f ) denote its real and imaginary parts, respectively. For a vector of complex numbers f, its complex conjugate, real part and imaginary part are defined elementwise. Then, |f|2 = f′ ¯f. Apart from a scaling factor, Parseval’s Theorem implies that the L2 -norm in (7) is equivalent to
−
|CG,k (θ)|2 =
k∈S
−
|E[h(Y, θ)ϕG,k (X)]|2 .
k∈S
θ∈Θ
−
|E[h(Y, θ)ϕG,k (X)]|2 ,
and hence is convenient in an optimization program. Further, exp(A(X, τ)) with τ ∈ Rm+1 and exp(X′ τ) with τ ∈ Rm only differ by a constant and hence play the same role in function approximation (Stinchcombe and White, 1998). By employing exp(X′ τ) as a desired instrument, we are able to reduce the dimension of integration in (7) by one, i.e., T ⊂ Rm , and the summation in (9) is over S = {k = [k1 , . . . , km ]′ ∈ Zm }. More importantly, choosing exp(X′ τ) results in an analytic form for the instrument ϕexp,k which in turn facilitates estimation in practice. In particular, setting T = [−π , π]m , the new instruments that integrate out τ are
ϕexp,k (X) =
∫
exp(X′ τ) exp(−ik′ τ)dτ T
= ϕexp,k1 (X1 ) · · · ϕexp,km (Xm ), where
ϕexp,kj (Xj ) =
π
∫
(9)
(11)
exp(Xj τj ) exp(−ikj τj )dτj
−π
(−1)kj · 2 sinh(π Xj ) , j = 1 , . . . , m, (Xj − ikj ) and sinh(w) = (exp(w) − exp(−w))/2. Based on ϕexp,k (X), θ o can =
θ(exp, KT ) = arg min θ∈Θ
k∈S
where the right-hand side no longer involves τ ; cf. (7). By replacing E[h(Y, θ)ϕG,k (X)] in (9) with its sample counterpart, an objective function for estimating θ o is readily obtained. It is well known that CG,k (θ) → 0 as |k| tends to infinity by Bessel’s inequality. This suggests that the new instruments ϕG,k (X), and hence E[h(Y, θ)ϕG,k (X)], contain little information for identifying θ o when |k| is large. As such, we may omit ‘‘remote’’ Fourier coefficients and compute an estimator of θ o as
2 T − 1 − h(yt , θ)ϕG,k (xt ) , θ(G, KT ) = arg min T θ∈Θ t =1 k∈S (K )
k ∈ S,
be identified as in (9). The proposed estimator thus reads
It follows that θ o can be identified as
θ o = arg min
89
(10)
T
where S (KT ) is a subset of S with ki = 0, ±1, . . . , ±KT , such that KT grows with T but at a slower rate. The proposed estimator (10) depends on the function G, and it is also a GMM estimator based on (2KT + 1)m+1 unconditional moments with the identity weighting matrix. Hence, θ(G, KT ) is not an efficient estimator in general. Note that the Domínguez–Lobato estimator (5) relies only on a finite number of unconditional moments determined by the sample observations. By contrast, the proposed estimator (10) utilizes all possible information in estimation because each ϕG,k has included the full continuum of the instruments required for identifying θ o . Our estimator is also computationally simpler than that of Carrasco and Florens (2000), which requires preliminary estimation of a covariance operator and its eigenvalues and eigenfunctions. Moreover, a regularization parameter must be determined in practice so as to ensure the invertibility of the estimated covariance operator. 2.2. A specific estimator
2 T − 1 − h(yt , θ)ϕexp,k (xt ) , T t =1 k∈S (K )
(12)
T
where k is m × 1. 2.3. Implementing the proposed method To summarize, the proposed consistent estimator for θ o in the conditional moment restriction (1) can be computed via the following steps. 1. Choose a GCR function G, a subset T ⊂ Rm+1 with a nonempty interior, and an integer KT smaller than T . 2. Denote S (KT ) := {k = [k0 , k1 , . . . , km ]′ , ki = 0, ±1, ±2, . . . , ±KT }, and compute the instrument
ϕG,k (X) =
∫
G(A(X, τ)) exp(−ik′ τ)dτ T
for k ∈ S (KT ). When G is the exponential function and T = [−π , π]m , ϕG,k (X) has the analytic form given in (11). 3. Using a GMM estimation program with the identity weighting matrix, the proposed estimator is computed as
θ(G, KT ) = arg min θ∈Θ
2 T − 1 − h(yt , θ)ϕG,k (xt ) . T t =1 k∈S (K ) T
Thus far, it is not clear how the number of the required Fourier coefficients, KT , should be determined. For convenience, we suggest to choose KT when there is not much change between the estimates with KT and KT + 1. That is, given a tolerance level q%, KT is chosen when
‖ θ(G, KT + 1) − θ(G, KT )‖ < q%, ‖θ(G, KT )‖ where ‖ · ‖ denotes a vector norm.
To compute the proposed estimator, we follow Bierens (1982, 1990) and set G as the exponential function. This choice has some advantages relative to the indicator function. First, the indicator function takes only the values one and zero, whereas the exponential function is more flexible and hence may better present the information in the conditioning variables. That is, the exponential function may generate better instruments for identifying θ o . Second, the exponential function is smooth
3. Asymptotic properties We now establish the asymptotic properties of the proposed estimator θ(G, KT ). To ease our illustration and proof, we begin our analysis with the case that m = 1; the univariate X is denoted as X (no boldface). The asymptotic properties for the case with multivariate X are discussed in Section 3.3.
90
S.-H. Hsu, C.-M. Kuan / Journal of Econometrics 165 (2011) 87–99 KT −
3.1. Consistency We impose the following conditions.
These conditions are convenient and quite standard in the GMM literature. They may be relaxed at the expense of more technicality. For example, it is possible to extend [A1] to allow for weakly dependent and heterogeneously distributed data; see, e.g., Gallant and White (1988) and Chen and White (1996). Note that in [A2], θ o is assumed to be the unique solution to the original conditional restrictions; we do not require θ o to be the unique solution to some implied, unconditional moment restrictions. As in Stinchcombe and White (1998), [A4] requires G to be real analytic but not a polynomial. [A4] also imposes additional restrictions on G and its derivatives, yet it still permits quite general G functions. Setting T = [−π , π]2 , the instruments resulted from G are
∫ [−π,π]2
G(A(X, τ)) exp(−ik′ τ)dτ.
(13)
Here, k = (k0 , k1 )′ . Define c (ki ) = |ki | for ki ̸= 0 and c (ki ) = 1 for ki = 0, i = 0, 1. The result below provides a bound on ϕG,k (X ). Lemma 3.1. Given [A4], |ϕG,k (X )| ≤ ∆/[c (k0 )c (k1 )] w.p.1, where ∆ is a real number. Define the sample counterpart of CG,k (θ) as mG,k,T (θ) =
T 1−
T t =1
h(yt , θ)ϕG,k (xt ).
With Lemma 3.1, we are able to characterize the approximating capability of mG,k,T (θ). Lemma 3.2. Given [A1]–[A4], if KT → ∞ and KT = o(T 1/2 ), then sup
KT −
Θ k ,k =−K 0 1 T
P
(15)
k=−KT
[A1] The observed data (y′t , xt )′ , t = 1, . . . , T , are independent realizations of (Y′ , X )′ . [A2] For each θ ∈ Θ , h(·, θ) is measurable, and for each y ∈ Rr , h(y, ·) is continuous on Θ , where Θ is a compact subset in Rq . Also, θ o in Θ is the unique solution to E[h(Y, θ)|X] = 0. [A3] E[supθ∈Θ |h(Y, θ)|2 ] < ∞. [A4] G is real analytic but not a polynomial such that, w.p.1, supτ∈T |G(A(X , τ))| < ∞, supτ∈T |Gi (A(X , τ))| < ∞, and supτ∈T |Gij (A(X , τ))| < ∞, where Gi (A(X , τ)) = ∂ G(A(X , τ))/∂τi and Gij (A(X , τ)) = ∂ 2 G(A(X , τ))/(∂τi ∂τj ), for i, j = {0, 1}.
ϕG,k (X ) =
|mexp,k,T (θ) − Cexp,k (θ)|2 −→ 0,
|mG,k,T (θ) − CG,k (θ)|2 −→ 0, P
when KT = o(T ). The result below follows from (15) and is analogous to Theorem 3.3. Corollary 3.4. Given [A1]–[A3], if KT → ∞ and KT = o(T ), then P θ(exp, KT ) −→ θ o as T → ∞. 3.2. Asymptotic normality Recall that the Fourier coefficient CG,k (θ) can be expressed as
E[h(Y, θ)ϕG,k (X)]
∫ = [−π ,π ]2
E[h(Y, θ)G(A(X , τ))] exp(−ik′ τ)dτ,
which is the integral of the product of two functions in τ , i.e., E[h(Y, θ)G(A(X , ·))] and exp(−ik′ ·). To establish asymptotic normality, we work on E[h(Y, θ)G(A(X , ·))] and its sample counterpart directly. This requires some results in the function space, as given below. Consider functions in the Hilbert space L2 [−π , π]. The inner product of two p × 1 vectors of functions f and g is ⟨f, g⟩ = π ′¯ f (τ ) g(τ )dτ , and the norm induced by the inner product is −π
⟨f, f⟩1/2 . A random element U has mean E(U) if E[⟨U, g⟩] = ⟨E(U), g⟩ for any g in L2 [−π , π]. The covariance operator K associated with U is, for any g in L2 [−π , π], Kg = E[⟨U − E(U), g⟩(U − E(U))] such that (Kg)(τ ) = E[⟨U − E(U), g⟩(U(τ ) − E(U(τ )))] p ∫ π − , κji (τ , s)gi (s)ds = i =1
−π
j=1,...,p
with the kernel κji (τ , s) = E[(Uj (τ ) − EUj (τ ))(Ui (s) − EUi (s))]. U is said to be Gaussian if for any g in L2 [−π , π], ⟨U, g⟩ has a normal distribution on R with mean ⟨E(U), g⟩ and variance ⟨Kg, g⟩. Analogous results also hold in L2 ([−π , π]m ). For more discussions on random elements in Hilbert space, see, e.g., Chen and White (1998) and Carrasco and Florens (2000). In view of (10), θ(G, KT ) must satisfy the first order condition: KT −
0 =
∇θ mG,k,T (θ)′ mG,k,T (θ) + ∇θ mG,k,T (θ)′ mG,k,T (θ)
k0 ,k1 =−KT
KT −
=
2 Re(∇θ mG,k,T (θ)′ mG,k,T (θ)),
k0 ,k1 =−KT
where ∇θ mG,k,T (θ) is a p × q matrix with ∇θi mG,k,T (θ) its ith
P
where −→ stands for convergence in probability.
column. A mean-value expansion of mG,k,T ( θ(G, KT )) about θ o gives
Lemma 3.2 implies
Ď
KT −
|mG,k,T (θ)| −→ 2
k0 ,k1 =−KT
P
∞ −
mG,k,T ( θ(G, KT )) = mG,k,T (θ o ) + ∇θ mG,k,T (θ T )( θ(G, KT ) − θ o ),
|CG,k (θ)| , 2
(14)
k0 ,k1 =−∞
uniformly for all θ in Θ . As θ o is the unique minimizer of the right-hand side of (14), the consistency result below follows from Theorem 2.1 of Newey and McFadden (1994). Theorem 3.3. Given [A1]–[A4], if KT → ∞ and KT = o(T 1/2 ), then θ(G, KT ) −→ θ o as T → ∞. P
For the estimator θ(exp, KT ) in (12), note that exp(X τ ) satisfies [A4] with τ a scalar. It is easy to deduce that Lemma 3.1 holds with |ϕexp,k (X )| ≤ ∆/k. In analogy with Lemma 3.2, we also have
Ď
where θ T is between θ(G, KT ) and θ o , and its value may be different Ď
for each row in the matrix ∇θ mG,k,T (θ T ). Thus, KT −
Re ∇θ mG,k,T ( θ(G, KT ))′ [mG,k,T (θ o )
k0 ,k1 =−KT
+ ∇θ mG,k,T (θ ĎT )( θ(G, KT )
− θ o )] = 0.
(16)
To derive the limiting distribution of normalized θ(G, KT ), we impose the following conditions.
S.-H. Hsu, C.-M. Kuan / Journal of Econometrics 165 (2011) 87–99
[A5] θ o is in the interior of Θ . [A6] For each y, h(y, ·) is continuously differentiable in a neighborhood N of θ o such that E[supθ∈N ‖∇θ h(Y, θ)‖2 ] < ∞, where ‖ · ‖ is a matrix norm. [A7] The q × q matrix Mq , with the (i, j)th element
⟨E[∇θi h(Y, θ o )G(A(X , ·))], E[∇θj h(Y, θ o )G(A(X , ·))]⟩, D
(Kg)(τ ) = E[⟨h(Y, θ o )G(A(X , ·)), g⟩(h(Y, θ o )G(A(X , τ )))], Here, [A5] is needed for mean-value expansion; [A6] is a standard ‘‘smoothness’’ condition in nonlinear models. By [A7], Mq is invertible so that the normalized estimator has a unique representation, as given in (17). We directly assume functional convergence in [A8] for convenience; this condition is the same as Assumption 11 in Carrasco and Florens (2000). To ensure such convergence, one may also impose primitive conditions on h, G and the data; see, e.g., Chen and White (1998). To study the behavior of the normalized estimator via (16), we give two limiting results for the terms on the right-hand side of (16). Lemma 3.5. Given [A1]–[A6], if KT → ∞ and KT = o(T Re(∇θ mG,k,T ( θ(G, KT ))
′
D
T ( θ(G, KT ) − θ o ) −→ N (0, V ),
where V = Mq−1 Ωq Mq−1 and Ωq is a q × q matrix with the (i, j)th element:
For the estimator θ(exp, KT ) with G(A(X , τ)) = exp(X τ ), it can be verified that the results analogous to Lemmas 3.5 and 3.6 hold when KT is o(T 1/2 ). In particular, KT −
Ď
Re(∇θ mexp,k,T ( θ(exp, KT ))′ ∇θ mexp,k,T (θ T ))
k=−KT
for any p-dimensional function g.
KT −
√
D
[A8] T t =1 h(yt , θ o )G(A(xt , ·)) −→ Z, where −→ denotes convergence in distribution, and Z is a p-dimensional Gaussian random element that has mean zero and the covariance operator K with
∑T
Theorem 3.7. Given [A1]–[A8], if KT → ∞ and KT = o(T 1/4 ), then
⟨E[∇θi h(Y, θ o )G(A(X , ·))], KE[∇θj h(Y, θ o )G(A(X , ·))]⟩.
is symmetric and positive definite. −1/2
1/4
), then
P
−→
∞ −
⟨E[∇θi h(Y, θ o ) exp(X ·)], E[∇θj h(Y, θ o ) exp(X ·)]⟩, and KT −
=
∞ −
√ ∇θ Cexp,k (θ o )′ T mexp,k,T (θ o ) + oP (1).
In this case, (17) becomes
= −Mq
∇θ CG,k (θ o )′ ∇θ C G,k (θ o ).
√
T mexp,k,T (θ o ) + oP (1),
(20)
where V = Mq−1 Ωq Mq−1 and Ωq is a q × q matrix with the (i, j)th element:
⟨E[∇θi h(Y, θ o ) exp(X ·)], KE[∇θj h(Y, θ o ) exp(X ·)]⟩. 1/4
), then
√ ′
T mG,k,T (θ o ))
For estimation of V in Corollary 3.8, note from (18) that Mq can be consistently estimated by KT −
k0 ,k1 =−KT
√ ∇θ CG,k (θ o )′ T mG,k,T (θ o ) + oP (1).
∇θ mexp,k,T ( θ(exp, KT ))′ ∇θ mexp,k,T ( θ(exp, KT )).
k=−KT
From [A8] and (19), Ωq can be consistently estimated by the real part of
k0 ,k1 =−∞
KT −
With Lemmas 3.5 and 3.6, (16) can be expressed as
√
KT −
[∇θ mexp,ℓ,T ( θ(exp, KT ))′ ]
k=−KT ℓ=−KT
T ( θ(G, KT ) − θ o ) ∞ −
D
T ( θ(exp, KT ) − θ o ) −→ N (0, V ),
Lemma 3.6. Given [A1]–[A6], if KT → ∞ and KT = o(T Re(∇θ mG,k,T ( θ(G, KT ))
Corollary 3.8. Given [A1]–[A3] and [A5]–[A8], if KT → ∞ and KT = o(T 1/2 ), then
√
by the Multiplication theorem (e.g. Stuart, 1961).
= −Mq−1
∇θ Cexp,k (θ o )
′
which also has a limiting normal distribution. The result below is analogous to Theorem 3.7.
= ⟨E[∇θi h(Y, θ o )G(A(X , ·))], E[∇θj h(Y, θ o )G(A(X , ·))]⟩,
∞ − k=−∞
∇θi CG,k (θ o )′ ∇θj C G,k (θ o )
∞ −
(19)
k=−∞
k0 ,k1 =−∞
=
T mexp,k,T (θ o ))
k=−KT
T ( θ(exp, KT ) − θ o )
The limit in Lemma 3.5 is precisely the matrix Mq defined in [A7], because its (i, j)th element is
KT −
√
Re(∇θ mexp,k,T ( θ(exp, KT ))′
√
∇θ mG,k,T (θ ĎT ))
k0 ,k1 =−∞
∞ −
(18)
which is the matrix Mq with the (i, j)th element:
−1
∞ −
P
∇θ Cexp,k (θ o )′ ∇θ C exp,k (θ o ),
k=−∞
k0 ,k1 =−KT
−→
91
∇θ CG,k (θ o )
√ ′
T mG,k,T (θ o ) + oP (1).
(17)
k0 ,k1 =−∞
The functional convergence condition [A8] now ensures that the term in the square bracket on the right-hand side of (17) has a limiting normal distribution, which in turn leads to the asymptotic normality of θ(G, KT ).
×
T 1−
T t =1
h(yt , θ(exp, KT ))ϕexp,ℓ (xt )ϕexp,k (xt )
× h(yt , θ(exp, KT ))
′
[∇θ mexp,k,T ( θ(exp, KT ))].
A consistent estimator of V is readily computed from these two estimators.
92
S.-H. Hsu, C.-M. Kuan / Journal of Econometrics 165 (2011) 87–99
3.3. The results for multivariate X We now extend the asymptotic properties above to the case with multivariate X. Recall that X is an m × 1 vector of conditioning variables. Setting T = [−π , π]m+1 , the proposed instruments based on G are
ϕG,k (X) =
∫
G(A(X, τ)) exp(−ik τ)dτ,
where k = (k0 , k1 , . . . , km ) . The required conditions for asymptotics are unchanged, except [A4] is changed to [A4′ ]. ′
[A4′ ] G is real analytic but not a polynomial such that w.p.1, j ∂ G(A(X, τ)) < ∞, sup m τ∈T ∏ l i (∂τi ) i =0
where i =∑0, 1, . . . , m, j = 1, . . . , m, and li = 0, 1, . . . , j m such that i=1 li = j. Again, let c (ki ) = |ki | for ki ̸= 0 and c (ki ) = 1 for ki = 0, i = 0, 1, . . . , m. Similar to Lemma 3.1, we obtain the following bound on ϕG,k (X) when X is multivariate. Lemma 3.9. Given [A4′ ], |ϕG,k (X )| ≤ ∆/[ ∆ is a real number.
∏m
i=0
c (ki )] w.p.1, where
With Lemma 3.9, the results below include Theorems 3.3 and 3.7 as special cases. Note that the growth rates of KT depend on m, the dimension of X.4 The results for the specific estimator θ(G, KT ) can be obtained similarly. Theorem 3.10. Given [A1]–[A3] and [A4′ ], if KT → ∞ and KT = o(T 1/(m+1) ), then θ(G, KT ) −→ θ o as T → ∞. P
Theorem 3.11. Given [A1]–[A3], [A4′ ] and [A5]–[A8], if KT → ∞ and KT = o(T 1/(2m+2) ), then
√
θ o = arg min
∫
θ∈Θ
′
[−π,π ]m+1
may not be as straightforward as one would like (Newey, 1990, 1993). Carrasco and Florens (2000) consider efficient estimation based on the objective function that takes into account the covariance structure:
D
T ( θ(G, KT ) − θ o ) −→ N (0, V ),
where V = Mq−1 Ωq Mq−1 and Ωq is a q × q matrix with the (i, j)th element:
⟨E[∇θi h(Y, θ o )G(A(X , ·))], KE[∇θj h(Y, θ o )G(A(X, ·))]⟩. 4. Efficient estimation It now remains to show how an efficient estimator can be computed; this is the topic to which we now turn. Following Newey (1990, 1993) and Domínguez and Lobato (2004), an efficient estimate may be obtained from the proposed consistent estimate via an additional Newton–Raphson step. That is, an efficient estimator can be computed as
where K is the covariance operator introduced in Section 3.2, and the corresponding estimation method is based on projection along preliminary estimates of the eigenfunctions of K. There are some drawbacks of this approach. First, this estimator depends on various user-chosen parameters and hence is arbitrary to some extent. Second, the generalized inverse of the covariance operator exists only for a subset of Hilbert space, namely, the reproducing kernel Hilbert space. Moreover, it is difficult to generalize their results to allow for multivariate X . Alternatively, an efficient estimator is readily computed via the conventional two-step GMM method. As ϕG,k (X) is complex, we now consider ϕGr ,k (X) and ϕGi ,k (X), the real and imaginary parts of
ϕG,k (X).5 Equivalent to (9), θ o can also be identified as − θ o = arg min |E[h(Y, θ)ϕGr ,k (X)]|2 + |E[h(Y, θ)ϕGi ,k (X)]|2 , θ∈Θ
k∈S
where the minimum of this objective function is zero. A new set of unconditional moment restrictions now consists of E[h(Y, θ o ) ϕGr ,k (X)] = 0 and E[h(Y, θ o )ϕGi ,k (X)] = 0 with k ∈ S . Given
ϕGr ,k (X) = ϕGr ,−k (X) and ϕGi ,k (X) = −ϕGi ,−k (X) for any k ∈ S , some
of these unconditional moment restrictions are redundant and can be omitted. Let qt (θ, G, KT ) = h(yt , θ) ⊗ ZG,KT (xt ), where ZG,KT (xt ) is the (2KT + 1) × (4KT + 1)m -dimensional vector that contains ϕGr ,k (xt ) and ϕGi ,k (xt ), where k = [k0 , k1 , . . . , km ]′ with k0 = 0, 1, . . . , KT , and ki = 0, ±1, . . . , ±KT for i = 1, 2, . . . , m. The sample counterpart of the asymptotic covariance matrix of qt (θ, G, KT ) is VT (θ, G, KT ) =
T t =1
e θ (G, KT ) = arg min
θ∈Θ
×
qt (θ, G, KT )qt (θ, G, KT )′ .
T 1−
T t =1
T 1−
T t =1
′ qt (θ, G, KT )
1 V− T (θ(G, KT ), G, KT )
qt (θ, G, KT ) .
(21)
In the homoskedasticity case that E[h(yt , θ o )h(yt , θ o )′ |X] is constant, VT simplifies to
VT (θ, G, KT ) =
T 1−
T t =1
⊗
h(yt , θ)h(yt , θ)
T 1−
T t =1
′
ZG,KT (xt )ZG,KT (xt )
′
.
5 For example, when X is univariate and G is the exponential function, r k ϕexp ,k (X ) = (−1)
4 The dimension m affects the growth rates of K only through the implication T rule and the generalized Chebyshev inequality in the proofs.
T 1−
Evaluating the inverse of VT at the consistent estimate θ(G, KT ) and taking the resulting matrix as the weighting matrix, an efficient GMM estimator of θ o is
e θT = θ(G, KT ) − [∇θθ′ QT ( θ(G, KT ))]−1 ∇θ QT ( θ(G, KT )),
where QT (θ) is an objective function for the efficient estimator that can locally identify θ o , and ∇θ QT (θ) and ∇θθ ′ QT (θ) are its gradient vector and Hessian matrix, both evaluated at the consistent estimate θ(G, KT ). In practice, identifying such an objective function and estimating its gradient and Hessian matrix
K−1/2 |E[h(Y, θ) exp(τ X )]|2 dP (τ ), T
i ϕexp ,k (X ) = (−1)
2X
X 2 + k2 2k k X 2 + k2
sinh(π X ), sinh(π X ),
are the real and imaginary parts of ϕexp,k (X ).
S.-H. Hsu, C.-M. Kuan / Journal of Econometrics 165 (2011) 87–99
93
Table 1 Models in Domínguez and Lobato (2004) with exogenous regressors. Sample
Estimator
T
X ∼ N (0 , 1 )
X ∼ N (1, 1)
Bias
SE
MSE
Bias
SE
MSE
50
θˆNLS θˆOPIV θˆDL θˆ (exp, KT ) θ e (exp, KT )
−0.0011 −0.0168 −0.0058 −0.0024 −0.0017
0.0498 0.1961 0.1373 0.0776 0.0614
0.0025 0.0387 0.0189 0.0060 0.0038
−0.0002 −2.2072 −0.0011 −0.0004 −0.0001
0.0225 0.7923 0.0484 0.0253 0.0248
0.0005 5.4992 0.0023 0.0006 0.0006
100
θˆNLS θˆOPIV θˆDL θˆ (exp, KT ) θ e (exp, KT )
−0.0009 −0.0037 −0.0008 −0.0010
0.0347 0.0833 0.0840 0.0501 0.0394
0.0012 0.0070 0.0071 0.0025 0.0016
−0.0002 −2.2266 −0.0003 −0.0001 −0.0001
0.0158 0.7890 0.0336 0.0174 0.0166
0.0002 5.5801 0.0011 0.0003 0.0003
θˆNLS θˆOPIV θˆDL θˆ (exp, KT ) θ e (exp, KT )
−0.0002 −0.0018 −0.0010 −0.0005 −0.0003
0.0239 0.0601 0.0580 0.0343 0.0261
0.0006 0.0036 0.0034 0.0012 0.0007
0.0001
0.0110 0.7769 0.0244 0.0123 0.0115
0.0001 5.6129 0.0006 0.0002 0.0001
200
0.0021
−2.2382 0.0004 0.0002 0.0001
NOTE: 1. The DGP is Y = θo2 X + θo X 2 + ϵ, ϵ ∼ N (0, 1), where θo = 1.25, X ∼ N (0, 1) or X ∼ N (1, 1). 2. We set KT = 5 for θ (exp, KT ) and θ e (exp, KT ).
The inverse of VT is easy to compute in practice because the first term in VT is positive definite and the inverse of the second term can be obtained by any generalized inverse method. Under the ime posed conditions, it is reasonable to expect θ (G, KT ) being as efficient as that of Carrasco and Florens (2000) asymptotically. By treating ZG,KT (xt ) as a class of approximating functions, the results in Donald et al. (2003) may be employed to establish the asymptotic properties of the efficient estimator (21).6 It should be emphasized that, with the proposed unconditional moments, the two-step GMM estimation method is not the only way to obtain an efficient estimator. Other methods, such as empirical likelihood estimation (e.g., Qin and Lawless, 1994) and continuously updated estimation (e.g., Hansen et al., 1996) will also do. 5. Simulations In this section, we focus on the finite-sample performance of the proposed consistent and efficient estimators: θ(exp, KT ) and e θ (exp, KT ). We compare their performance with the nonlinear least squares (NLS) estimator: T 1− θ NLS = arg min |h(yt , θ)|2 ,
θ∈Θ
T t =1
and the DL estimator of Domínguez and Lobato (2004), θ DL in (5). Our comparison is based on the bias, standard error (SE), and mean squared error (MSE) of these estimators. The parameter estimates are computed using the GAUSS optimization procedure, OPTMUM, with the BFGS algorithm. In all experiments, the samples are T = 50, 100, 200; the number of replications is 5000. In each replication, we randomly draw 10 initial values for all estimators, and for each estimator, the estimate that leads to the smallest value of the objective function is chosen. The data xt are transformed using a logistic mapping: exp(xt )/[1 + exp(xt )], so that they are bounded between 0 and 1. Note that we set KT = 5 for the proposed estimators; the effect of different KT on the proposed estimator will be examined in Section 5.4.
6 Some stronger conditions are needed. For example, when G is the exponential function and X is univariate, Theorems 5.3 and 5.4 in Donald et al. (2003) require the growth rate of KT being o(T 1/2 ). This is more restrictive than the rate required for the consistent estimator: θ(exp, KT ); cf. Corollary 3.4.
5.1. The experiments in Domínguez and Lobato (2004) Following Domínguez and Lobato (2004), we postulate the following nonlinear model with exogenous regressors: Y = θo2 X + θo X 2 + ϵ,
ϵ ∼ N (0, 1),
where θo = 1.25 is the unique solution to the conditional moment restriction: E(ϵ|X ) = 0. We consider two cases: X ∼ N (0, 1) and X ∼ N (1, 1). When X ∼ N (0, 1), θo = 1.25 is the only real solution to the unconditional moment restriction resulted from the ‘‘feasible’’ optimal instrument (2θ X + X 2 ); the other two solutions are complex: −0.625 ± 1.0533i. When X ∼ N (1, 1), in addition to θo = 1.25, θ = −1.25 and θ = −3 also satisfy the unconditional moment restriction with the ‘‘feasible’’ optimal instrument. In this case, 1.25 is the global minimum of the NLS objective function; the other two solutions are only local minima. For comparison, our simulations here also includes the optimal instrument variable (OPIV) estimator:
θOPIV = arg min θ∈Θ
T 1−
T t =1
2
(yt − θ xt − θ )(2θ xt + 2
x2t
x2t
)
,
which is different from the NLS estimator; cf. Domínguez and Lobato (2004, p. 1608). The simulation results are summarized in Table 1. In most cases, the NLS estimator outperforms the other estimators in terms of bias, SE and MSE, as it ought to be. On the other hand, θOPIV has severe bias and large SE and is dominated by the other estimators. Note that when X ∼ N (1, 1), the existence of 3 possible solutions (1.25, −1.25 and −3) suggests that the bias of the OPIV estimator should be close to −2.25. This is confirmed in our simulation.7 It is also clear that the proposed consistent and efficient estimators, θ (exp, KT ) and θ e (exp, KT ), both dominate the DL estimator in terms of bias, SE and MSE in all cases. Also, there is no significant difference between these two estimators and the DL estimator in terms of the speed of convergence. When the sample size is not too small (T = 100, 200), the performance of the proposed efficient estimator is comparable with that of the NLS estimator. It
7 Note that Domínguez and Lobato (2004) report a much smaller bias (about
−0.4) under the same simulation design.
94
S.-H. Hsu, C.-M. Kuan / Journal of Econometrics 165 (2011) 87–99
Table 2 Models with an endogenous regressor.
ρ
T = 50
Est.
Bias 0.01
0.1
0.3
0.5
0.7
0.9
T = 100 SE
MSE
0.0315 0.1129 0.0557 0.0492
0.0010 0.0129 0.0031 0.0024
Bias
T = 200 SE
MSE
Bias
0.0212 0.0800 0.0368 0.0331
0.0005 0.0064 0.0014 0.0011
θˆNLS θˆDL ˆ exp, KT ) θ( θ e (exp, KT )
−0.0108 −0.0001 −0.0005
θˆNLS θˆDL ˆ exp, KT ) θ( θ e (exp, KT )
−0.0137 −0.0025 −0.0019
0.0314 0.1170 0.0568 0.0503
0.0011 0.0139 0.0032 0.0025
−0.0054 −0.0006 −0.0001
0.0212 0.0826 0.0358 0.0325
0.0006 0.0068 0.0013 0.0011
−0.0021 −0.0001
θˆNLS θˆDL ˆ exp, KT ) θ( θ e (exp, KT )
0.0324 −0.0100 −0.0029 0.0016
0.0314 0.1183 0.0571 0.0512
0.0020 0.0141 0.0033 0.0026
0.0311 −0.0067 −0.0023 0.0004
0.0210 0.0836 0.0369 0.0335
0.0014 0.0070 0.0014 0.0011
−0.0046 −0.0011
θˆNLS θˆDL ˆ exp, KT ) θ( θ e (exp, KT )
0.0539 −0.0120 −0.0049 0.0024
0.0310 0.1237 0.0575 0.0522
0.0039 0.0154 0.0033 0.0027
0.0527 −0.0088 −0.0033 0.0012
0.0206 0.0852 0.0360 0.0324
0.0032 0.0073 0.0013 0.0011
−0.0029 −0.0009
θˆNLS θˆDL ˆ exp, KT ) θ( θ e (exp, KT )
0.0746 −0.0168 −0.0102 0.0011
0.0291 0.1268 0.0606 0.0536
0.0064 0.0164 0.0038 0.0029
0.0736 −0.0079 −0.0039 0.0020
0.0202 0.0851 0.0369 0.0340
0.0058 0.0073 0.0014 0.0012
−0.0051 −0.0026
θˆNLS θˆDL ˆ exp, KT ) θ( θ e (exp, KT )
0.0968 −0.0153 −0.0112 0.0021
0.0287 0.1287 0.0623 0.0574
0.0102 0.0168 0.0040 0.0033
0.0951 −0.0082 −0.0048 0.0028
0.0187 0.0866 0.0372 0.0336
0.0094 0.0076 0.0014 0.0011
−0.0034 −0.0010
0.0014
0.0098
0.0010
−0.0037 0.0007 0.0001 0.0102
0.0015
−0.0006 0.0007 0.0005 0.0103
0.0003 0.0314
0.0005 0.0522
0.0012 0.0731
0.0007 0.0939
0.0030
SE
MSE
0.0146 0.0560 0.0239 0.0218
0.0002 0.0031 0.0006 0.0005
0.0147 0.0565 0.0241 0.0217
0.0003 0.0032 0.0006 0.0005
0.0145 0.0574 0.0242 0.0220
0.0012 0.0033 0.0006 0.0005
0.0144 0.0593 0.0249 0.0224
0.0029 0.0035 0.0006 0.0005
0.0139 0.0588 0.0252 0.0226
0.0055 0.0035 0.0006 0.0005
0.0131 0.0587 0.0239 0.0216
0.0090 0.0035 0.0006 0.0005
NOTE: 1. The DGP is Y = θo2 Z + θo Z 2 + ϵ , and Z = X + ν with [ ] [ ] ϵ 1 ρ ∼ N 0, , ν ρ 1 where θo = 1.25, ρ = 0.01, 0.1, 0.3, 0.5, 0.7, 0.9, and X ∼ N (0, 1) is independent of ϵ and ν . 2. We set KT = 5 for θ(exp, KT ) and θ e (exp, KT ).
is somewhat surprising to see that, compared with the proposed consistent estimator, our efficient estimator has not only smaller SE and MSE but also slightly smaller bias in many cases. The trade-off between SE and bias can be seen in the experiments in Section 5.3. 5.2. Model with an endogenous regressor We extend the previous experiment to the case that there is an endogenous regressor. The model specification is Y = θo2 Z + θo Z 2 + ϵ, and Z = X + ν , with
[ ] [ ϵ 1 ∼ N 0, ν ρ
ρ 1
]
,
where θo = 1.25, ρ = 0.01, 0.1, 0.3, 0.5, 0.7, 0.9, and X ∼ N (0, 1) is independent of ϵ and ν . Given this specification, E(ϵ|X ) = 0. The simulation results are collected in Table 2. It is clear that the bias of these estimators all increases with ρ . In particular, the NLS estimator has very large biases, and such biases do not diminish when the sample size increases. This should not be surprising because the NLS estimator is inconsistent (due to the endogenous regressor). On the other hand, the proposed consistent and efficient estimators perform remarkably well. They have much smaller bias than the NLS estimator, and they again outperform the DL estimator in terms of bias, SE, and MSE for any ρ and any sample size. Although the NLS estimator typically has a smaller SE, the proposed estimators may yield smaller MSE when the correlation between ϵ and ν is not too small (e.g., ρ ≥ 0.3).
5.3. Noisy disturbances We now examine the effect of the disturbance variance on the performance of various estimators. The model is again Y = θo2 X + θo X 2 + ϵ,
ϵ ∼ N (0, σ 2 ),
where θo = 1.25, X is the uniform random variable on (−1, 1) and independent of ϵ , and σ 2 =1, 4 and 9. It can be verified that there are 3 solutions to the unconditional moment restriction resulted from the√ ‘‘feasible’’ optimal instrument (2θ X + X 2 ): θ = 1.25 and (−25 ± 145)/40, where 1.25 is the global minimum. The results are summarized in Table 3. We note first that, in contrast with the results in Table 1, the NLS estimator is no longer the best estimator even when there is a unique global minimum and the regressor is exogenous. The proposed consistent estimator has smaller biases than all other estimators in all cases, except its bias is slightly larger than the NLS estimator when σ 2 = 1 and T = 200. In terms of MSE, the proposed consistent estimator dominates the DL estimator in all cases and outperforms the OPIV estimator when σ 2 is not too large. Although the proposed efficient estimator has larger bias than θ (exp, KT ) in most cases, it still outperforms the other estimators in terms of bias in all cases (except when T = 50 and σ 2 = 1). Moreover, the efficient estimator has the smallest MSE in all cases with T = 100, 200, and its MSE is only slightly larger than the NLS estimator for T = 50. As far as MSE is concerned, the proposed efficient estimator is to be preferred to the other estimators. 5.4. The proposed estimator with various KT We now examine the effect of KT on the performance of the proposed estimator. The model specification is the same as that in
S.-H. Hsu, C.-M. Kuan / Journal of Econometrics 165 (2011) 87–99
95
Table 3 Models with different disturbance variances.
σ2 1
4
9
T = 50
Estimator
θNLS θOPIV θDL θ(exp, KT ) θ e (exp, KT ) θNLS θOPIV θDL θ (exp, KT ) θ e (exp, KT ) θNLS θOPIV θDL θ (exp, KT ) θ e (exp, KT )
T = 100
T = 200
Bias
SE
MSE
Bias
SE
MSE
Bias
SE
MSE
−0.4144 −1.0993 −0.8468 −0.3146 −0.4241
0.9391 0.9027 1.0712 1.0875 0.9645
1.0535 2.0230 1.8643 1.2814 1.1100
−0.2227 −1.1143 −0.6325 −0.1543 −0.2105
0.7166 0.9086 0.9782 0.8050 0.7089
0.5631 2.0669 1.3567 0.6717 0.5467
−0.0680 −1.1370 −0.4303 −0.0685 −0.0559
0.4055 0.9230 0.8550 0.4931 0.3707
0.1690 2.1444 0.9160 0.2478 0.1405
−0.7696 −1.1114 −1.2274 −0.5048 −0.7614
1.1993 0.9403 1.1969 1.5702 1.2632
2.0303 2.1192 2.9387 2.7199 2.1750
−0.6318 −1.0835 −1.0644 −0.3924 −0.5727
1.1061 0.9315 1.1360 1.2976 1.0948
1.6223 2.0415 2.4232 1.8374 1.5262
−0.4155 −1.1021 −0.8628 −0.2346 −0.3573
0.9399 0.9157 1.0637 1.0388 0.8932
1.0559 2.0529 1.8757 1.1340 0.9252
−0.9124 −1.1412 −1.3675 −0.5511 −0.8602
1.3041 0.9744 1.3043 1.8035 1.3805
2.5328 2.2517 3.5709 3.5556 2.6453
−0.8078 −1.1136 −1.2492 −0.4478 −0.7479
1.2110 0.9458 1.1965 1.5662 1.2245
2.1187 2.1345 2.9918 2.6530 2.0584
−0.6299 −1.0870 −1.0814 −0.3624 −0.5786
1.1001 0.9238 1.1259 1.3195 1.0817
1.6067 2.0347 2.4369 1.8720 1.5046
NOTE: 1. The DGP is Y = θo2 X + θo X 2 + ϵ, ϵ ∼ N (0, σ 2 ), where θo = 1.25, X is the uniform random variable on (−1, 1) and independent of ϵ , and σ 2 = 1, 4 and 9. 2. We set KT = 5 for θ (exp, KT ) and θ e (exp, KT ).
Table 4
Table 5
The performance of θˆ (exp, KT ) with various KT : ρ = 0.5.
KT
Bias
Bias (+%)
SE
SE (+%)
MSE
MSE (+%)
T = 100 1 2 3 4 5 6 7 8 9 10 15 20
−0.0033 −0.0032 −0.0032 −0.0032 −0.0032 −0.0032 −0.0032 −0.0032 −0.0032 −0.0032 −0.0032 −0.0032
θˆDL
−0.0081
−0.0011 −0.0011 −0.0011 −0.0011 −0.0011 −0.0011 −0.0011 −0.0012 −0.0012 −0.0012 −0.0012 −0.0012
θˆDL
−0.0025
KT
Bias
Bias (+%)
SE
SE (+%)
MSE
MSE (+%)
–
–
T = 100 –
−0.6704 −0.2632 −0.1372 −0.0835 −0.0560 −0.0401 −0.0301 −0.0234 −0.0187 −0.0560 −0.0279
0.0378 0.0373 0.0371 0.0370 0.0369 0.0369 0.0369 0.0368 0.0368 0.0368 0.0368 0.0367
–
−1.3585 −0.5527 −0.2945 −0.1817 −0.1230 −0.0886 −0.0668 −0.0521 −0.0418 −0.1258 −0.0631
0.0868
0.0014 0.0014 0.0014 0.0014 0.0014 0.0014 0.0014 0.0014 0.0014 0.0014 0.0014 0.0014
–
−2.6884 −1.0981 −0.5858 −0.3617 −0.2447 −0.1763 −0.1330 −0.1038 −0.0832 −0.2504 −0.1256
0.0076
T = 200 1 2 3 4 5 6 7 8 9 10 15 20
ˆ exp, KT ) with various KT : ρ = 0.9. The performance of θ(
1 2 3 4 5 6 7 8 9 10 15 20
−0.0047 −0.0047 −0.0047 −0.0047 −0.0047 −0.0047 −0.0047 −0.0047 −0.0047 −0.0047 −0.0047 −0.0047
−0.4598 −0.1770 −0.0909 −0.0547 −0.0363 −0.0258 −0.0193 −0.0149 −0.0119 −0.0354 −0.0175
0.0379 0.0374 0.0372 0.0371 0.0371 0.0370 0.0370 0.0370 0.0370 0.0370 0.0369 0.0369
−1.2201 −0.4955 −0.2638 −0.1627 −0.1100 −0.0792 −0.0597 −0.0466 −0.0374 −0.1124 −0.0564
0.0015 0.0014 0.0014 0.0014 0.0014 0.0014 0.0014 0.0014 0.0014 0.0014 0.0014 0.0014
−2.4021 −0.9786 −0.5214 −0.3216 −0.2175 −0.1567 −0.1181 −0.0922 −0.0739 −0.2223 −0.1115
θˆDL
−0.0089
–
0.0860
–
0.0075
–
1 2 3 4 5 6 7 8 9 10 15 20
−0.0021 −0.0021 −0.0021 −0.0021 −0.0021 −0.0021 −0.0021 −0.0021 −0.0020 −0.0020 −0.0020 −0.0020
–
0.0257 0.0253 0.0252 0.0251 0.0251 0.0250 0.0250 0.0250 0.0250 0.0250 0.0250 0.0249
–
−1.3333 −0.5452 −0.2911 −0.1798 −0.1217 −0.0877 −0.0662 −0.0517 −0.0414 −0.1247 −0.0626
0.0007 0.0006 0.0006 0.0006 0.0006 0.0006 0.0006 0.0006 0.0006 0.0006 0.0006 0.0006
−2.6445 −1.0856 −0.5804 −0.3587 −0.2429 −0.1751 −0.1321 −0.1031 −0.0827 −0.2489 −0.1249
θˆDL
−0.0048
–
T = 200 – 0.8887 0.3617 0.1945 0.1210 0.0824 0.0597 0.0452 0.0354 0.0285 0.0863 0.0436
0.0251 0.0248 0.0247 0.0246 0.0245 0.0245 0.0245 0.0245 0.0245 0.0244 0.0244 0.0244 0.0577
–
−1.3657 −0.5580 −0.2979 −0.1840 −0.1245 −0.0897 −0.0677 −0.0528 −0.0424 −0.1276 −0.0640
0.0006 0.0006 0.0006 0.0006 0.0006 0.0006 0.0006 0.0006 0.0006 0.0006 0.0006 0.0006
–
−2.7036 −1.1091 −0.5928 −0.3663 −0.2480 −0.1788 −0.1348 −0.1053 −0.0844 −0.2540 −0.1275
0.0033
NOTE: 1. The DGP is the same as that for Table 2. 2. +% stands for the percentage change when KT increases. 3. The bias (SE, MSE) look the same across different KT because the program recorded only 4 digits after the decimal points. These numbers are actually different, as can be seen in the column of percentage changes.
Section 5.2, where the regressor is endogenous. We consider the cases that ρ equals 0.1, 0.5 and 0.9, and the sample T = 50, 100 and 200. We simulate the DL estimator and θ (exp, KT ) with KT = 1, 2, . . . , 10, 15, 20. We do not consider the NLS estimator because its performance is too poor when regressor is endogenous. To ease our computation, we do not simulate the efficient estimator here. We report only the results for ρ = 0.5 and ρ = 0.9, each with
−1.0072 −0.4112 −0.2183 −0.1342 −0.0905 −0.0650 −0.0489 −0.0381 −0.0305 −0.0916 −0.0458
0.0592
–
0.0035
NOTE: 1. The DGP is the same as that in Table 2. 2. +% stands for the percentage change when KT increases. 3. The bias (SE, MSE) look the same across different KT because the program recorded only 4 digits after the decimal points. These numbers are actually different, as can be seen in the column of percentage changes.
T = 100, 200 in Tables 4 and 5. In addition to the bias, SE and MSE, we also report their percentage changes when KT increases. For instance, for ρ = 0.9 and T = 100, the bias decreases 0.46%, SE decreases 1.22%, and MSE decreases 2.40% when KT increases from 1 to 2. These tables show that, when KT increases, the proposed estimator becomes more efficient (with a smaller SE), while its bias
96
S.-H. Hsu, C.-M. Kuan / Journal of Econometrics 165 (2011) 87–99
typically decreases.8 The percentage changes of bias and SE are small; in most cases, such changes are less than 0.1% when KT is greater than 5 or 6. These results suggest that the first few Fourier coefficients indeed contain the most information for identifying θo . Further increase of KT can only result in marginal improvements on the bias and SE. Note that the proposed estimator again dominates the DL estimator in terms of bias, SE and MSE in all cases.
univariate. We have
ϕG,k (X ) ∫ π∫ π G(τ0 + τ1 X ) exp(−ik0 τ0 ) exp(−ik1 τ1 )dτ0 dτ1 = −π −π ] ∫ π [∫ π G(τ0 + τ1 X ) exp(−ik0 τ0 )dτ0 exp(−ik1 τ1 )dτ1 . = −π
−π
6. Concluding remarks This paper is concerned with consistent and efficient estimation of conditional moment restrictions without assuming the parameters can be identified by the implied unconditional moments. We propose an approach to constructing unconditional moments that can identify the parameter of interest. The consistent and efficient estimators are then readily computed using the conventional GMM method. Our simulations confirm that the proposed estimators perform very well in finite samples and compare favorably with existing estimators, such as that proposed by Domínguez and Lobato (2004). It must be emphasized that we do not have to confine ourselves with GMM estimation. Based on the proposed moment conditions, other estimation methods, such as the empirical likelihood method, can also be employed to obtain consistent and/or efficient estimators. The proposed estimator may be further improved. First, it is important to establish a criterion determining the (optimal) number of the required Fourier coefficients, KT , in the objective function. Second, as different GCR functions result in different sets of unconditional moment conditions and hence different estimators, it would be very useful if we know, in practice, how to choose a better one among such estimators. Moreover, it is interesting to examine if the proposed method remains effective when the instruments are ‘‘weak’’. Chao and Swanson (2005) show that consistent estimation may be possible when the number of weak instruments increases to infinity at some suitable rate. As our approach provides a systematic way to increase the number of unconditional moments so as to identify parameters, it may still work even when instruments are weak. These topics all require thorough analysis and hence are left to future research.
By integration by parts, for k0 , k1 ̸= 0, the term in the square brackets above can be expressed as π
∫
Q1 (τ)
π
G0 (τ0 + τ1 X ) exp(−ik0 τ0 )dτ0 .
− −π
Q2 (τ)
Then, π
∫
i
ϕG,k (X ) =
k0
[Q1 (τ) − Q2 (τ)] exp(−ik1 τ1 )dτ1 ,
−π
so that
∫ π |ϕG,k (X )| ≤ Q1 (τ) exp(−ik1 τ1 )dτ1 |k0 | −π ∫ π + Q2 (τ) exp(−ik1 τ1 )dτ1 . 1
−π
Again by integration by parts, π
∫
Q1 (τ) exp(−ik1 τ1 )dτ1
−π
(−1)k0 i
=
k1
(−1)k1 [G(π + π X ) − G(−π + π X )
− G(π − π X ) + G(−π − π X )] ∫ π − G1 (π + τ1 X ) − G1 (−π + τ1 X ) exp(−ik1 τ1 )dτ1 , −π
and π
∫
Q2 (τ) exp(−ik1 τ1 )dτ1
−π
i
=
k1
(−1)k1
∫
π
[G0 (τ0 + π X )
−π
− G0 (τ0 − π X )] exp(−ik0 τ0 )dτ0 ∫ π ∫ π − G01 (τ0 + τ1 X ) exp(−ik0 τ0 )dτ0 −π π × exp(−ik1 τ1 )dτ1 . Given [A4], we have
∫ 8 In the lower panel of Table 4, the bias actually increases with K . This ill T behavior may be due to the convergence criterion in our procedure; the criterion −4 for the gradient of estimated coefficients is set to 10 .
(−1)k0 [G(π + τ1 X ) − G(−π + τ1 X )]
k0
∫
Appendix Proof of Lemma 3.1. Let ∆ be a generic constant whose value varies in different cases. Recall that A(X , τ ) = τ0 + τ1 X and X is
i
=
Acknowledgments We would like to thank 3 anonymous referees for their constructive comments. We are indebted to Zongwu Cai, John Chao, Jin-Chuan Duan, James Hamilton, Jinyong Hahn, Yongmiao Hong, Ying-Ying Lee, Joon Park, M. Hashem Pesaran, Peter C. B. Phillips, Max Stinchcombe, Allan Timmermann, Yoon-Jae Whang, and Hal White for their comments on early drafts of this paper. We also thank the seminar participants at Boston College, Rice, Toronto, UCSD, U. of Hong Kong, USC, UT Austin, and the SETA 2007 meeting in Hong Kong for their inputs. The research support from the National Science Council of the Republic of China (NSC95-2415-H001-034) for C.-M. Kuan is gratefully acknowledged.
G(τ0 + τ1 X ) exp(−ik0 τ0 )dτ0
−π
π −π
≤
Q1 (τ) exp(−ik1 τ1 )dτ1 1
|k1 |
4 sup |G(τ0 + τ1 X )| τ∈T
S.-H. Hsu, C.-M. Kuan / Journal of Econometrics 165 (2011) 87–99
∫ +2 ≤
π
Proof of Theorem 3.3. The proposed estimator, θ(G, KT ), is the solution to the left-hand side of (14). Hence, it must converge to the unique minimizer, θ o , of the right-hand side of (14) by Theorem 2.1 of Newey and McFadden (1994).
sup |G1 (τ0 + τ1 X )|| exp(−ik1 τ1 )|dτ1
−π τ∈T
∆ , |k1 |
Proof of Corollary 3.4. Given G(A(X , τ)) = exp(X τ ), we have from the text that (15) holds when KT = o(T ). Analogous to (14), we obtain
and
∫
π
Q2 (τ) exp(−ik1 τ1 )dτ1 −π ∫ π 1 sup |G0 (τ0 + τ1 X )|| exp(−ik0 τ0 )|dτ0 ≤ 2 |k1 | −π τ∈T ∫ π ∫ π sup |G01 (τ0 + τ1 X )|| exp(−ik0 τ0 )|dτ0 + π τ∈T −π
KT −
≤
|k1 |
argument to get
.
∇θ mG,k,T (θ ĎT ) − ∇θ mG,k,T (θ o ) −→ 0. P
E[|ηG,k,t | ] ≤ E[|h(Y, θ)| |ϕG,k (X )| ] ≤
Also note that ∇θ CG,k (θ o )′ ∇θ C G,k (θ o ) is real and KT −
2
2 T 1 − 1 E η G ,k,t = 2 T t =1 T
c (k0 )2 c (k1 )2
KT KT 4∆ − 1 − 1
k2 k0 =1 0 k1 =1
T
k21
+
KT −
T
T −
k0 ,k1 =−KT t =1
KT 1 2∆ −
k2 k0 =1 0
+
.
Therefore, it suffices to show that KT −
k2 k1 =1 1
P
2
+
∆
We can show that this convergence holds elementwise. For simplicity of notation, we drop the subscript G and the argument θ o and write ηi,k = ∇θi mk,T − E[∇θi mk,T ]. The (i, j)th element of the matrix above can be expressed as ηi′,k ∇θj mk,T + ∇θi Ck′ η¯ j,k . We need to show that KT −
T
∑n
2 KT T 1 − − P η ≥ ε T t =1 G,k,t k0 ,k1 =−KT 2 KT T 1 − − ε ≤ P ηG,k,t ≥ 2 T ( 2 K + 1 ) T k0 ,k1 =−KT t =1 2 KT T 1 − (2KT + 1)2 − ≤ E ηG,k,t ε T k ,k =−K t =1 1
T
(2KT + 1)2 ∆ , ε T
which holds uniformly in θ , because ∆ does not depend on θ . It is clear that this bound can be made arbitrarily small when KT = o(T 1/2 ).
(ηi′,k ∇θj mk,T + ∇θi Ck′ η¯ j,k ) −→ 0. P
k0 ,k1 =−KT
−2 by the fact that ≤ 2 − 1/n ≤ 2. It follows from the k=1 k implication rule and the generalized Chebyshev inequality that
≤
(∇θ mG,k,T (θ o )′ ∇θ mG,k,T (θ o )
− ∇θ CG,k (θ o )′ ∇θ C G,k (θ o )) −→ 0.
,
0
∇θ CG,k (θ o )′ ∇θ C G,k (θ o ).
k0 ,k1 =−KT
E[|ηG,k,t | ]
KT 1 2∆ −
T
∞ −
→
k0 ,k1 =−∞
∆
2
∇θ CG,k (θ o )′ ∇θ C G,k (θ o )
k0 ,k1 =−KT
Under [A1], these bounds lead to
≤
Ď
P ∇θ mG,k,T ( θ(G, KT )) − ∇θ mG,k,T (θ o ) −→ 0,
2
∆
k=−∞
P
for t = 1, . . . , T and k = (k0 , k1 )′ . By Lemma 3.1, |ϕG,k (X )| ≤ ∆/[c (k0 )c (k1 )]. With [A3], we have
T
|Cexp,k (θ)|2 ,
KT ) −→ θ o . Hence, θ T → θ o . With [A6], we can apply a standard
ηG,k,t = h(yt , θ)ϕG,k (xt ) − E[h(Y, θ)ϕG,k (X )],
≤
∞ −
Proof of Lemma 3.5. Given [A1]–[A4] and KT = o(T 1/4 ), θ(G,
Proof of Lemma 3.2. Again let ∆ denote a generic constant whose value varies in different cases. Define
k0 ,k1 =−KT
P
uniformly in θ . The assertion again follows from Theorem 2.1 of Newey and McFadden (1994).
It follows that |ϕG,k (X )| ≤ ∆/(|k0 ||k1 |) for k0 , k1 ̸= 0. Similarly, we can show that |ϕG,k (X )| ≤ ∆/|k1 | for k0 = 0 and k1 ̸= 0 and that |ϕG,k (X )| ≤ ∆/|k0 | for k0 ̸= 0 and k1 = 0. Also, it is clear that |ϕG,0 (X )| ≤ ∆. The proof is thus complete.
KT −
|mexp,k,T (θ)|2 −→
k=−KT
× | exp(−ik1 τ1 )|dτ1 ∆
97
Again by the implication rule and the generalized Chebyshev inequality, we have
P
KT −
k0 ,k1 =−KT
≤
|ηi′,k ∇θj mk,T
KT − k0 ,k1 =−KT
≤
≤
+ ∇θi Ck η¯ j,k | ≥ ϵ ′
P |ηi′,k ∇θj mk,T + ∇θi Ck′ η¯ j,k | ≥
(2KT + 1)2 ϵ (2KT + 1)2 ϵ
KT − k0 ,k1 =−KT
KT −
ϵ (2KT + 1)2
E[|ηi′,k ∇θj mk,T + ∇θi Ck′ η¯ j,k |]
[E|ηi,k |2 ]1/2 [E|∇θj mk,T |2 ]1/2
k0 ,k1 =−KT
+ [E|∇θi Ck |2 ]1/2 [E|η¯ j,k |2 ]1/2 . By [A1], [A6] and Lemma 3.1,
E|∇θj mk,T |2 =
1 T
E|∇θj h(Y, θ)ϕk (X )|2 ≤
∆ . Tc (k0 )2 c (k1 )2
98
S.-H. Hsu, C.-M. Kuan / Journal of Econometrics 165 (2011) 87–99
Similarly, |∇θi Ck |2 ≤ ∆/[c (k0 )2 c (k1 )2 ], and
where the last inequality, given [A1], is due to the fact that
E|ηi,k |2 = E|∇θi mk,T |2 − E|∇θi Ck |2 ≤ E|∇θi mk,T |2
2 T 1 − √ E| T mG,k,T (θ o )|2 = E √ h(yt , θ o )ϕG,k (xt ) T t =1
≤
∆ Tc (k0 )2 c (k1 )2
.
Putting these results together we have, similar to the proof of Lemma 3.2,
KT −
P
k0 ,k1 =−KT
k0 ,k1 =−KT
∆ ∆ +√ Tc (k0 )2 c (k1 )2 T c (k0 )2 c (k1 )2
(2KT + 1) ∆ √ , ϵ T
E|h(Y, θ o )ϕG,k (X )|2 ≤
P
Proof of Lemma 3.6. Similar to the proof of Lemma 3.5, given [A1]–[A6] and KT = o(T to show that
1/4
P ), θ(G, KT ) −→ θ o , it is thus sufficient
[∇θ mG,k,T (θ o ) − ∇θ CG,k (θ o )]
KT −
k0 ,k1 =−KT
T mG,k,T (θ o ) −→ 0, P
≤
(2KT + 1)2 ∆ √ . ϵ T
∑T
t =1
h
(yt , θ o )G(A(xt , ·)) −→ Z, where Z is a Gaussian random element in L2 ([−π , π]2 ) with the covariance operator K. By invoking the KT −
√ ∇θ CG,k,T (θ o )′ T mG,k,T (θ o )
k0 ,k1 =−KT
∞ −
and
= √ ∇θ CG,k (θ o )′ T mG,k,T (θ o ) ∞ −
∇θi E[h(Y, θ o )G(A(X , ·))],
=
√ ∇θ CG,k (θ o )′ T mG,k,T (θ o ),
T 1 −
k0 ,k1 =−∞
√ where, by invoking the multiplication theorem,
T t =1
= E[∇θ h(Y, θ o )G(A(X, ·))], √
T t =1
h(yt , θ o )G(A(xt , ·))
is real. Again, let ηi,k = ∇θi mG,k,T (θ o ) − E[∇θi mG,k,T (θ o )] and by the implication rule and the generalized Chebyshev inequality, we have
k0 ,k1 =−KT
KT −
√
P
k0 ,k1 =−KT
≤
(2KT + 1)2 ϵ
≤
(2KT + 1) ϵ
2
≤
(2KT + 1) ϵ
2
i=1,...,p
D
T 1 −
|ηi′,k
+ oP (1)
h(yt , θ o )G(A(xt , ·))
−→ N (0, Ωq ).
k0 ,k1 =−∞
KT −
= (⟨∇θi E[h(Y, θ o )G(A(X , ·))], Z⟩)i=1,...,p + oP (1)
√ ∇θ CG,k (θ o )′ T mG,k,T (θ o )
√ ∇θ CG,k,T (θ o )′ T mG,k,T (θ o ) + oP (1)
k0 ,k1 =−∞
k0 ,k1 =−KT
≤
T mG,k,T (θ o )| ≥ ϵ
multiplication theorem, we have
∇θ mG,k,T ( θ(G, KT )) − ∇θ mG,k,T (θ o ) −→ 0
P
√
The proof is complete because this bound can be made arbitrarily small when KT = o(T 1/4 ) and T → ∞.
P
|ηi′,k
D
√ ′
since
∞ −
∆ . Tc (k0 )2 c (k1 )2
Proof of Theorem 3.7. From [A8], we know that T −1/2
k0 ,k1 =−KT
→
,
It follows that
which can be made arbitrarily small when KT = o(T 1/4 ).
KT −
∆ Tc (k0 )2 c (k1 )2
and KT −
2
KT −
From the proof of Lemma 3.5 we have
E|ηi,k |2 ≤
|ηi′,k ∇θj mk,T + ∇θi Ck′ η¯ j,k | ≥ ϵ
(2KT + 1)2 ≤ ϵ ≤
= E|h(Y, θ o )ϕG,k (X )|2 .
The assertion follows from (17).
Proof of Corollary 3.8. In this case, [A8] ensures that T −1/2
∑T
t =1
D
h(yt , θ o ) exp(xt , ·) −→ Z, where Z is a Gaussian random element in L2 [−π , π] with the covariance operator K. Analogous to the proof for Theorem 3.7, the conclusion follows from (20).
T mG,k,T (θ o )| ≥ ϵ
|ηi′,k
ϵ T mG,k,T (θ o )| ≥ (2KT + 1)2
√
KT − k0 ,k1 =−KT
KT −
References
√
E[|ηi′,k T mG,k,T (θ o )|]
√ [E|ηi,k |2 ]1/2 [E| T mG,k,T (θ o )|2 ]1/2
k0 ,k1 =−KT
KT − k0 ,k1 =−KT
[E|ηi,k |2 ]1/2 [E|h(Y, θ o )ϕG,k (X )|2 ]1/2 ,
Ai, C., Chen, X., 2003. Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica 71, 1795–1843. Bierens, H.J., 1982. Consistent model specification test. Journal of Econometrics 20, 105–134. Bierens, H.J., 1990. A consistent conditional moment test of functional form. Econometrica 58, 1443–1458. Bierens, H.J., 1994. Topics in Advanced Econometrics. Cambridge University Press, New York. Carrasco, M., Florens, J.-P., 2000. Generalization of GMM to a continuum of moment conditions. Econometric Theory 16, 797–834. Chao, J.C., Swanson, N.R., 2005. Consistent estimation with large number of weak instruments. Econometrica 73, 1673–1692. Chamberlain, G., 1987. Asymptotic efficiency in estimation with conditional moment restrictions. Journal of Econometrics 34, 305–334. Chen, X., White, H., 1996. Laws of large numbers for Hilbert space-valued mixingales with applications. Econometric Theory 12, 284–304.
S.-H. Hsu, C.-M. Kuan / Journal of Econometrics 165 (2011) 87–99 Chen, X., White, H., 1998. Central limit and functional central limit theorems for Hilbert-valued dependent heterogeneous arrays with applications. Econometric Theory 14, 260–284. Domínguez, M.A., Lobato, I.N., 2004. Consistent estimation of models defined by conditional moment restrictions. Econometrica 72, 1601–1615. Donald, S.G., Imbens, G.W., Newey, W.K., 2003. Empirical likelihood estimation and consistent tests with conditional moment restrictions. Journal of Econometrics 117, 55–93. Gallant, A.R., White, H., 1988. A Unified Theory of Estimation and Inference for Nonlinear Dynamic Models. Basil Blackwell, Oxford, UK. Hansen, L.P., 1982. Large sample properties of generalized method of moments estimators. Econometrica 50, 1029–1054. Hansen, L.P., Heaton, J., Yaron, A., 1996. Finite-sample properties of some alternative GMM estimators. Journal of Business and Economic Statistics 14, 262–280. Hansen, L.P., Singleton, K.J., 1982. Generalized instrumental variable estimation of nonlinear rational expectations models. Econometrica 50, 1269–1286. Hansen, B.E., West, K.D., 2002. Generalized method of moments and macroeconomics. Journal of Business and Economic Statistics 20, 460–469. Kitamura, Y., 1997. Empirical likelihood methods with weakly dependent processes. Annals of Statistics 25, 2084–2102.
99
Kitamura, Y., Tripathi, G., Ahn, H., 2004. Empirical likelihood-based inference in conditional moment restriction models. Econometrica 72, 1167–1714. Newey, W.K., 1990. Efficient instrumental variables estimation of nonlinear models. Econometrica 58, 809–837. Newey, W.K., 1993. Efficient estimation of models with conditional moment restrictions. In: Maddala, G., Rao, C., Vinod, H. (Eds.), Handbook of Statistics, vol. 11. Elsevier Science, pp. 419–454. Newey, W.K., McFadden, D., 1994. Large sample estimation and hypothesis test. In: Engle, R.F., McFadden, D.L. (Eds.), Handbook of Econometrics, vol. IV. Elsevier Science, pp. 2111–2245. Qin, J., Lawless, J., 1994. Empirical likelihood and generalized estimating equations. Annals of Statistics 22, 300–325. Stinchcombe, M.B., White, H., 1998. Consistent specification testing with nuisance parameters present only under the alternative. Econometric Theory 14, 295–325. Stuart, R.D., 1961. An Introduction to Fourier Analysis. Halsted Press, New York. White, H., 1989. An additional hidden unit test for neglected nonlinearity. In: Proceedings of the International Joint Conference on Neural Networks, vol. 2. IEEE Press, New York, pp. 451–455.
Journal of Econometrics 165 (2011) 100–111
Contents lists available at SciVerse ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Moment-based estimation of smooth transition regression models with endogenous variables Waldyr Dutra Areosa a,b , Michael McAleer c,d,e , Marcelo C. Medeiros a,∗ a
Department of Economics, Pontifical Catholic University of Rio de Janeiro, Brazil
b
Banco Central do Brasil, Brazil Econometric Institute, Erasmus School of Economics, Erasmus University Rotterdam, The Netherlands
c d
Tinbergen Institute, The Netherlands
e
Department of Quantitative Economics, Complutense University of Madrid, Spain
article
info
Article history: Available online 19 May 2011 Keywords: Smooth transition Nonlinear models Nonlinear instrumental variables Generalized method of moments Endogeneity Inflation targeting
abstract Nonlinear regression models have been widely used in practice for a variety of time series and crosssection datasets. For purposes of analyzing univariate and multivariate time series data, in particular, smooth transition regression (STR) models have been shown to be very useful for representing and capturing asymmetric behavior. Most STR models have been applied to univariate processes, and have made a variety of assumptions, including stationary or cointegrated processes, uncorrelated, homoskedastic or conditionally heteroskedastic errors, and weakly exogenous regressors. Under the assumption of exogeneity, the standard method of estimation is nonlinear least squares. The primary purpose of this paper is to relax the assumption of weakly exogenous regressors and to discuss moment-based methods for estimating STR models. The paper analyzes the properties of the STR model with endogenous variables by providing a diagnostic test of linearity of the underlying process under endogeneity, developing an estimation procedure and a misspecification test for the STR model, presenting the results of Monte Carlo simulations to show the usefulness of the model and estimation method, and providing an empirical application for inflation rate targeting in Brazil. We show that STR models with endogenous variables can be specified and estimated by a straightforward application of existing results in the literature. © 2011 Elsevier B.V. All rights reserved.
1. Introduction and motivation Nonlinear regression models have been widely used in practice for a variety of time series and cross-section datasets (see Granger and Teräsvirta (1993) for some examples in economics). For purposes of analyzing univariate and multivariate time series data, in particular, smooth transition regression (STR) models, initially proposed in its univariate form by Chan and Tong (1986), and further developed in Luukkonen et al. (1988) and Teräsvirta (1994, 1998), have been shown to be very useful for representing and capturing asymmetric behavior.1 van Dijk et al. (2002) provide a useful review of time series STR models.
∗ Corresponding address: Department of Economics, Pontifical Catholic University of Rio de Janeiro, Rua Marquês de São Vicente 225, Gávea, Rio de Janeiro, RJ, Brazil. E-mail addresses:
[email protected] (M. McAleer),
[email protected] (M.C. Medeiros). 1 The term ‘‘smooth transition’’ in its present meaning first appeared in Bacon and Watts (1971). They presented their smooth transition model as a generalization of models of two intersecting lines with an abrupt change from one linear regression to 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.05.009
Most STR models have been applied to univariate processes under a variety of assumptions. For example, although stationarity is imposed in the vast majority of time series applications, Choi and Saikkonen (2004a,b) considered the case of STR models with cointegrated variables. Conditional heteroskedasticity has been analyzed in several papers, for example, in Lundbergh and Teräsvirta (1998) and Li et al. (2002).2 However, under stationarity or cross-section applications, the covariates have been assumed to be weakly exogenous with respect to the parameters of interest. Under the assumption of exogeneity, the standard method of estimation is nonlinear least squares, and the asymptotic properties of the estimators have been discussed in Mira and Escribano (2000), Suarez-Fariñas et al. (2004), and Medeiros and Veiga (2005), among others. Nonlinear least squares is equivalent
another at some unknown change point. Goldfeld and Quandt (1972, pp. 263–264) generalized the so-called two-regime switching regression model using the same idea. 2 See McAleer (2005) for a discussion of univariate and multivariate conditional volatility models.
W.D. Areosa et al. / Journal of Econometrics 165 (2011) 100–111
to quasi-maximum likelihood or, when the errors are Gaussian, to conditional maximum likelihood. The primary purpose of this paper is to relax the assumption of weakly exogenous regressors and to provide a generalized method of moments (GMM) estimator for recovering the parameters of STR models. The estimator considered here is equivalent to the nonlinear instrumental variables (IV) estimator proposed by Amemiya (1974). The paper analyzes the properties of the STR model with endogenous variables by providing a diagnostic test of linearity of the underlying process under endogeneity, developing an estimation procedure and a misspecification test for the STR model, presenting the results of Monte Carlo simulations to show the usefulness of the model and estimation method, and providing an empirical application for inflation rate targeting in Brazil. Although the treatment of nonlinear IV methods dates back to Amemiya (1974), the estimation of STR models with endogenous regressors does not yet seem to have been analyzed. The only exception is Caner and Hansen (2004), where the authors consider a threshold model with endogenous regression. However, in their case, they assume that the transition (threshold) variable is weakly exogenous. Furthermore, most previous work has focused on independent and identically distributed (IID) data and not on time series models. We show that STR models with endogenous variables can be specified and estimated by straightforward application of existing results in the literature under mild regularity conditions. The rest of the paper is organized as follows. Section 2 briefly reviews the literature on moment-based estimation for nonlinear regression models. The model and the main assumptions are described in Section 3, while the linearity test is discussed in Section 4. The estimation procedure and the asymptotic properties of the estimators are analyzed in Section 5. Section 6 describes some misspecification tests. Section 7 presents Monte Carlo simulations in order to evaluate the finite sample properties of the tests and the estimation procedure. An empirical application for inflation rate targeting in Brazil is presented in Section 8. Finally, Section 9 concludes the paper. All technical proofs are given in the Appendix. 2. Generalized method of moments and instrumental variable estimation for nonlinear regression As in Amemiya (1974), consider the following assumption.
(GMM) framework. Defining Yt = (yt , x′t , z′t )′ and h(Yt ; ψ) =
zt [yt − g (xt ; ψ)] ≡ T1 t =1 zt ut (ψ), the GMM estimator is the solution to the following nonlinear optimization problem: 1 T
∑T
yt = g (xt ; ψ 0 ) + ut ,
−1 ψGMM = argmin[h(Yt ; ψ)′ h(Yt ; ψ)],
(1)
where g (xt ; ψ 0 ) is a nonlinear function of covariates xt ∈ R and is indexed by the ‘‘true’’ parameter vector ψ 0 ∈ 9 ⊆ RK , and {ut }Tt=1 is a sequence of zero mean random variables, E(ut ) = 0, ∀ t. Furthermore, E(u2t ) = σ02 < ∞, ∀t, and E(ut us ) = 0, ∀t ̸= s. Finally, the variables xt are endogenous in the sense that E(ut |xt ) ̸= 0. Under Assumption 1, E(yt |xt ) ̸= g (xt ; ψ 0 ) and the nonlinear least squares estimator (NLSE) of the parameters of interest ψ 0 might be inconsistent as long as E[ut g˙ (xt ; ψ 0 )] ̸= 0, where g˙ (xt ; ψ 0 ) =
∂ g (xt ; ψ) . ∂ψ ψ=ψ0
Consider a vector of exogenous (instrumental) variables wt ∈ Rqw and define a set of valid instruments zt ≡ z(wt ) ∈ Rqz , qz ≥ K , where z(wt ) : Rqw → Rqz is a vector-valued function of wt , such that E(zt ut ) = 0. Therefore, we have qz moment conditions that can be cast into a generalized method of moments
(2)
ψ∈9
where is any consistent estimator of = E[u2t zt z′t ] = σ02
′ E[zt z′t ].3 Hence, treating σ02 as a constant and using T1 t =1 zt zt as a consistent estimator of E[zt z′t ], the GMM approach is equivalent to the modified nonlinear IV estimator discussed in Amemiya (1974). Eq. (2) can be modified as
∑T
ψGMM = argmin
T 1−
T t =1
ψ∈9
T 1−
′ zt [yt − g (xt ; ψ)]
−1
T 1−
zt zt zt [yt − g (xt ; ψ)] T t =1 T t =1 1 = argmin [y − g(X; ψ)]′ Z(Z′ Z)−1 Z′ [y − g(X; ψ)] ψ∈9 T
×
′
1
= argmin [y − g(X; ψ)]′ PZ [y − g(X; ψ)], ψ∈9
T
(3)
where y = (y1 , . . . , yT )′ , X = (x1 , . . . , xT )′ is a (T × qx ) matrix of endogenous variables, g(X; ψ) = [g (x1 ; ψ), . . . , g (xT ; ψ)]′ , and Z = (z1 , . . . , zT )′ is a (T × qz ) matrix of valid instruments. When the model is nonlinear only in the variables, Kelejian (1971) showed consistency of the IV estimation when polynomials of the exogenous variables are used as instruments. Bowden and Turkington (1981) compared different IV estimators for the nonlinear-in-variables model. Amemiya (1974) first discussed the estimation of (1) when the function g (xt ; ψ 0 ) is nonlinear both in the parameters and in the variables. He proved consistency and asymptotic normality of (3) for IID data and when the instruments are assumed to be fixed in repeated samples, and also demonstrated efficiency of the estimator when the model is nonlinear only in the parameters. From the first-order conditions for the optimization problem (3), a key (rank) condition for identification is that plim
qx
∑T
t =1
T →∞
Assumption 1. The sequence {yt }Tt=1 , T > 0, is generated from the following nonlinear model:
101
1 ′ Z g˙ (X; ψ 0 )
T
is of full rank. Thus, the instruments Z must be correlated with the gradient of the nonlinear function. Even though when zt is highly correlated with the endogenous variables xt , this might not be case for zt and g˙ (xt ; ψ 0 ). Thus, strong instruments in a linear framework, might be rather weak in a nonlinear setting (see Stock et al. (2002) for a recent review on weak instruments). Amemiya (1975) showed that the optimal instruments are given by z(wt ) = E[˙g(xt ; ψ 0 )|wt ].
(4)
More recently, Newey (1990) considered asymptotically efficient IV estimation of nonlinear models in an IID framework based on nonparametric estimation of the optimal set of instruments in (4). More specifically, he considered the estimation of the conditional moment in (4) by two different nonparametric techniques, namely nearest-neighbor and series (sieve) approximation. The latter is closely related to the polynomial estimation discussed earlier in Kelejian (1971).
3 Note that under Assumption 1, {h(Y ; ψ)}T is an uncorrelated sequence. t t =1
102
W.D. Areosa et al. / Journal of Econometrics 165 (2011) 100–111
Amemiya (1975) discussed different limited-information estimators of the nonlinear simultaneous equation model and compared their covariance matrices. The author considered the following linear reduced form for xt : xt = 20 wt + vt ,
(5)
where E(vt |wt ) = 0, and { } is a sequence of zero mean IID random variables which are correlated withthe structural errors = W′ W −1 W′ X, ut . Defining W = (w1 , . . . , wT )′ , 2 V = vt Tt=1
W, and u(ψ) = y − g(X; ψ), Amemiya (1975) proposed X−2 the modified IV (MIV) estimator given by ψMIV = argmin QMIV ,T (ψ) ψ∈9
V′ ][y − g(X; ψ)] V)−1 = argmin[y − g(X; ψ)]′ [I − V( V′ ψ∈9
V′ ]u(ψ)}, V)−1 = argmin{u(ψ)′ u(ψ) − u(ψ)′ [ V( V′ ψ∈9
(6)
where I is a (T × T ) identity matrix. He showed that (6) is more efficient than (3). The estimator given in (6) is equivalent to the one-step maximum likelihood estimator, given the parameters of the linear reduced form of the exogenous variables. Finally, Amemiya (1977) considered the maximum likelihood and three-stage least squares estimators of the general nonlinear simultaneous equations model. More recently, Newey and Powell (2003) considered IV estimation of nonparametric models. 3. The model and main assumptions We write a smooth transition regression (STR) model as a special case of (1), and consider the following assumption about the data generating process (DGP). Assumption 2 (Data Generating Process). The sequence {yt }Tt=1 is generated by xt f (st ; γ0 , c0 ) + ut , xt + β′02 yt = β′01
(7)
where f (st ; γ0 , c0 ) is the logistic function given by f (st ; γ0 , c0 ) =
1
,
Assumption 3 (Identification). The parameter vector ψ 0 is interior to a compact parameter space, 9. Furthermore, γ0 > 0 and c0 is interior to the support of the probability distribution of st . If the distribution of st has infinite support, then −∞ < c < c0 < c < ∞. Assumption 3 is a standard assumption for identification of STR models. The restriction on γ0 avoids the lack of identification due to the symmetric behavior of the logistic function. The vector of endogenous variables follows a linear reduced form, as in the following assumption. Assumption 4 (Reduced Form). wt ∈ Rqw is a vector of exogenous variables such that: xt = 20 wt + vt ; E(ut |wt ) = 0, ∀ t; E(vt |wt ) = 0, ∀ t; E(vt |Ft −1 ) = 0, where Ft −1 is the σ -field generated by {x′t −j , w′t −j , ut −j : j ≥ 1}; and (5) Set et = (ut , v′t )′ . E(et e′τ ) = δτ t 6, where (1) (2) (3) (4)
δτ ,t =
1 0
if τ = t , if τ ̸= t ,
and
6=
σ02 6uv
6′uv . 6v
We consider that there is a set of valid instruments that satisfy the assumption below. Assumption 5 (Instruments). Z ≡ [z(w1 ), . . . , z(wT )]′ is a (T ×qz ), qz ≥ K , matrix of instruments, such that: (1) zt ≡ z(wt ) : Rqw −→ Rqz is a linear or nonlinear function of wt , such that E(|zt |) < ∞; (2) plim T1 Z′ Z exists and is nonsingular; T →∞ 1 ′ Zg T
˙ (X; ψ) converges in probability uniformly in ψ ∈ N ψ0 , where N ψ 0 is a neighborhood of ψ 0 ; and (4) plim T1 Z′ g˙ (X; ψ 0 ) exists and is of full rank. (3)
T →∞
Furthermore, the error term is such that: Assumption 6 (Error Term). {ut }Tt=1 is a martingale difference sequence, such that E(ut |Ft −1 ) = 0, where Ft −1 is defined as in Assumption 4.
(8)
In this paper we consider only models with stationary variables.
xt = (1, x′L,t ), xt = (x′L,t , st )′ ∈ Rqx is the vector of covariates,4 E(ut |xt ) ̸= 0, and E(u2t ) = σ02 < ∞.
Assumption 7 (Stationarity). The sequence {Yt }Tt=1 , where Yt = (yt , x′t , z′t )′ , is stationary and ergodic. Furthermore, E(xt x′t |st |6+δ ) < ∞ for some δ > 0.
1 + e−γ0 (st −c0 )
In this case, g (xt ; ψ 0 ) ≡ β′01 xt + β′02 xt f (st ; γ0 , c0 ) and ψ 0 = ′ ′ K (β01 , β02 , γ0 , c0 ) ∈ R . The structural parameters to be estimated are ψ 0 and σ02 . The STR model can be considered as a regime-switching model that allows for two limiting regimes associated with the extreme values of the transition function, f (st ; γ , c ) = 0 and f (st ; γ , c ) = 1, where the transition from one regime to the other is smooth. The parameter c can be interpreted as the threshold between the two regimes, in the sense that the logistic function changes monotonically from 0 to 1 as st increases and f (c ; γ , c ) = 0.5. The parameter γ determines the smoothness of the transition from one regime to the other. As γ becomes very large, the logistic function f (st ; γ , c ) approaches the indicator function and, consequently, the change of f (st ; γ , c ) from 0 to 1 becomes instantaneous at st = c. We make the following assumptions about the parameters of the model. ′
The last moment condition in Assumption 7 is important in order to guarantee the existence of the relevant moments in the linearity test to be presented in the next section. 4. Linearity testing against smooth transition regression Consider an STR model as in (7). A convenient null hypothesis of linearity is H0 : γ = 0, against the alternative Ha : γ > 0. Note that model (7) is not identified under the null hypothesis. In order to remedy this problem, we follow Teräsvirta (1994) and expand the logistic function f (st ; γ , c ) into a third-order Taylor expansion around the null hypothesis γ = 0. After merging terms, the resulting model is5 yt = α′1 xt + α′2 xt st + α′3 xt s2t + α′4 xt s3t + u∗t ,
5 If s is an element of x , then the resulting model should be t t 4 If s is an element of x , then x = x . t L,t t L,t
yt = α′1 xt + α′2 xt st + α′3 xt s2t + α′2 xt s3t + u∗t .
(9)
W.D. Areosa et al. / Journal of Econometrics 165 (2011) 100–111
where u∗t = ut + R(st ; γ , c ), R(st ; γ , c ) is the remainder, α1 = β01
+ ( 12 − γ04c0 − α4 =
γ03 c03
γ03
96
)β02 , α2 = ( γ40 +
γ03 c02 32
) β02 , α3 = −
γ03 c0 32
β02 , and
β . 96 02 A new convenient null hypothesis is H0 : α2 = α3 = α4 = 0. Note that (9) is a nonlinear-in-variables regression model with endogenous regressors, as discussed in Davidson Mackinnon (1993, pp. 224–226). In order to derive the linearity test, consider the following notation. Set y as in Section 2. Define g(X; ψ ∗ ) ∈ RT as a vector with typical line given by the function g (xt ; ψ ∗ ) = (α′1 xt + α′2 xt st + α′3 xt s2t + α′4 xt s3t ), where ψ ∗ = (α′1 , α′2 , α′3 , α′4 )′ . Furthermore, set the restricted and
∗ unrestricted parameter estimates as ψr = ( α′1 , 0′ , 0′ , 0′ )′ and
103
0w t + t = (1, wt )′ , with vt , vt = where w xt = 2
′
0, v′t ,
0 β01 , π02 = 2 0 β02 , and the error term is given by π01 = 2 ′ ξt = ut + β02 vt f (st ; γ0 , c0 ). It is clear that, under Assumption 4, t f (st ; γ0 , c0 )] = 0 and the parameters of (11) can be E[ξt w estimated by nonlinear least squares. Furthermore, γ0 and c0 are both identified. This fact opens the possibility of two-step estimation: first compute estimates γ and c for γ0 and c0 , respectively, using (11), than substitute γ and c in (7) and estimate β01 and β02 . One advantage is that, given γ and c, the STR model becomes a nonlinear-in-variables model. This is the spirit of the estimator proposed by Caner and Hansen (2004). Here we take a different route by considering a possibly endogenous transition variable. The following theorem follows directly from Theorem 8.1.1 in Amemiya (1985, p. 246).
∗ ψu = ( α′1 , α′2 , α′3 , α′4 )′ , respectively. Finally, set PZ = Z(Z′ Z)−1 Z′ , where Z is a (T × qz ), qz ≥ K , matrix of valid instruments, as
p Theorem 1 (Consistency). Under Assumptions 2–5, ψGMM −→ ψ0
in Section 2, formed by linear and/or nonlinear functions of the exogenous variables, wt . The linearity test is equivalent to an F -test in a instrumental variables regression and can be carried out in stages, as follows (see Davidson Mackinnon (1993, pp. 226–232) for a discussion).
and ψMIV −→ ψ0 . In order to state the asymptotic normality result, we have to consider an additional assumption.
(1) Estimate (9) under the null and compute SSRr = ‖PZ [y − g ∗ (xt ; ψr )]‖2 . (2) Estimate the unrestricted model (9) and compute SSRu = ‖PZ ∗ [y − g(xt ; ψu )]‖2 . (3) Compute the statistic6
probability uniformly in ψ ∈ N ψ 0 , where ψi is the ith element of ψ .
(10)
u
Under the null hypothesis, the statistic F is asymptotically distributed as an F distribution with k and T − k degrees of freedom, where k is the number of restrictions tested. In Eq. (9) the regressors are products of the endogenous variables, and the optimal set of instruments, as discussed in Amemiya (1975), is formed by power functions of the exogenous variables. For example, suppose that
xL,t st
=
θ′x 0
0
θs
wx,t
ws,t
Assumption 8 (Asymptotic Normality).
2 1 ′ ∂ g(X;ψ) Z ∂ψ ∂ψ′ T i
converges in
Theorem 2 (Asymptotic Normality). Under Assumptions 2–8,
√
d
1 T ( ψGMM − ψ0 ) −→ N(0, σ0 A− GMM ),
(SSRr − SSRu )/k F= . ∗ ‖y − g( ψ )‖2 /[T − k]
xt =
p
+
vx,t
vs,t .
where AGMM = E[˙g(X; ψ 0 )′ PZ g˙ (X; ψ 0 )]. Furthermore,
√
d
1 T ( ψMIV − ψ0 ) −→ N(0, σ0 A− MIV ),
where AMIV = E{G−1 [σ∗2 G + (σ02 − σ∗2 )H]G−1 }, G = g˙ (X; ψ 0 )′ [I − v(v′ v)−1 v′ ]˙g(X; ψ 0 ),
.
In this case, the optimal set of instruments is zt = (1, w′x,t , w′x,t
ws,t , w′x,t ws2,t , w′x,t ws3,t )′ . It is important that the same set of
instruments is used in each step of the procedure described above (for further details, see Davidson Mackinnon (1993)).
v = (v1 , . . . , vT )′ , σ∗2 = σ02 − 6′uv 6v 6uv , and H = g˙ (X; ψ 0 )′ PZ g˙ (X; ψ 0 ).
5.2. The choice of instruments Set ft ,0 ≡ f (st ; γ0 , c0 ). Hence,
5. Parameter estimation
5.1. Main results In this paper we consider two methods to estimate the STR model with endogenous covariates. The first one is the GMM estimator as in (3). The second one is the modified nonlinear IV estimator defined in (6). When the transition variable is exogenous, the reduced form for yt can be written as
t + π′02 w t f (st ; γ0 , c0 ) + ξt , yt = π′01 w
6 If s is an element of x , then t t F=
(SSRr − SSRu )/k . ∗ ‖y − g( ψu )‖2 /[T − k]
∂ g (xt ; ψ) ∂ g (xt ; ψ) , , ∂β′1 ψ=ψ0 ∂β′2 ψ=ψ0 ′ ∂ g (xt ; ψ) ∂ g (xt ; ψ) , ∂γ ∂c ψ=ψ0 ψ=ψ0 = [ x′t , x′t ft ,0 , β′20 xt ft ,0 1 − ft ,0 (st − c0 ),
g˙ (xt ; ψ 0 ) =
(11)
− β′20 xft ,0 (1 − ft ,0 )γ0 ]′ . It is clear that the gradient depends on the structural parameters. In order to compute the ‘‘optimal’’ instruments as in (4), we adopt the following procedure: (1) Start from an initial and consistent estimate of ψ 0 , say ψ. For example, consider estimators of type (3) or (6) with any set of valid instruments. Compute g˙ (xt ; ψ).
104
W.D. Areosa et al. / Journal of Econometrics 165 (2011) 100–111
(2) Regress g˙ (xt ; ψ) on wt and on the powers and cross-products ψ). of the elements of wt . Compute g˙ (xt ; (3) Set zt = g˙ (xt ; ψ) and re-estimate the parameters. As mentioned in Section 2, Newey (1990) showed that the procedure above can be seen as a series nonparametric approximation to (4). He also proved that this procedure yields efficient estimation in an IID framework. The optimality of such a procedure in a time series context is yet to be proved, but this is beyond the scope of the paper. 6. Model evaluation The goal of this section is to discuss a number of misspecification tests for STR models with parameters estimated by momentbased techniques such the ones previously described. One natural diagnostic test is the J-test for overidentifying conditions proposed by Hansen (1982). Another set of useful tests can be developed on the basis of Gauss–Newton regressions (GNR), as discussed in Davidson Mackinnon (1993, pp. 226–232). t = yt − g (xt ; Define u ψ) and consider the following GNR:
ut = PZ g˙ (xt ; ψ)b + et ,
(12)
where {et } is a sequence of errors and ψ is any consistent estimator of ψ 0 . As observed in Davidson Mackinnon (1993), the ordinary least squares (OLS) estimate of b must be zero and this fact can be used as a measure of the accuracy of the nonlinear optimization routine employed. Thus, testing H0 : b = 0 in (13) is a very simple diagnostic check. Another useful diagnostic is to augment Eq. (13) with nonlinear terms and test for neglected nonlinearity, such as additional regimes. For example, we can consider the following GNR:
ut = PZ g˙ (xt ; ψ)b + α′1 PZ xt st + α′2 PZ xt s2t + α′3 PZ xt s3t + et ,
(13)
where PZ is a new projection matrix. The resulting null hypothesis is H0 : α1 = α2 = α3 = 0. The procedure is a simple F -test in an OLS regression. 7. Monte Carlo experiment The goal of this section is to evaluate the finite sample performance of the linearity test. Two different DGPs will be used, and they are defined as follows: (1) Model A: IID observations yt = −0.2 + 1.4xt + (0.6 − 2.3xt )f (xt ; γ0 , −2.0) + ut , xt = θ wt + vt , where γ0 = 0 or 10, ut = vt + et , vt ∼ NID(0, 1), et ∼ NID(0, 1), and wt ∼ NID(0, 1). (2) Model B: Weakly dependent observations yt = −0.2 + 1.4xt + (0.6 − 2.3xt )f (xt ; γ0 , −2.0) + ut , xt = θ wt + vt ,
wt = 0.8wt −1 + ξt where γ0 = 0 or 10, ut = vt + et , vt ∼ NID(0, 1), et ∼ NID(0, 1), and ξt ∼ NID(0, 1). Both models have a single endogenous variable, xt , that is also the transition variable. The data generated from Model A are independent and identically distributed. Model B generated weakly dependent data as wt is a linear autoregressive (AR) model. We generate 2000 replications of each model with 100, 250, and 500 observations. Models with γ0 = 0 will be useful to evaluate the empirical size of the linearity test.
Fig. 1. Empirical size of the linearity test (Model A) across different values of θ . The nominal significance level is 0.05 and the number of observations is 100.
As discussed in Section 4, linearity testing involves the estimation of a model with endogenous variables that are linear in the parameters but nonlinear in the variables. This type of specification and estimation has been considered in the literature since Kelejian (1971). The optimal choice of instruments has been discussed in several papers, as mentioned in Section 2. Here we will focus on estimators as in (3). For both models, the set of instruments is zt = (1, wt , wt2 , wt3 , wt4 )′ . Our choice of instruments is quite natural as the regressors in the test equations are powers of the endogenous variables (see Eq. (9)). We consider different values for the parameter θ in both models in order to evaluate the strength of the set of instruments. The higher the value of θ , the stronger are the instruments. For example, when Model A is considered, the correlation between xt and wt is given by ρx,w = √ θ . Clearly, the correlation between 1+θ 2
powers of xt and wt will be also function of θ . The linearity test results for Model A are presented in Figs. 1 and 2. We report both the empirical size and power of the linearity test for 100 observations and a nominal significance level of 0.05. When θ is close to zero (the set of instruments is not valid), the test is heavily undersized. On the other hand, the empirical size approaches the nominal size as θ increases (the instruments are quite strong). This fact highlights the harmful influence of weak instruments on the performance of the linearity test. However, it seems that the power of test is less affected by the strength of the instruments. The results concerning Model B are shown in Figs. 3 and 4. As in the previous case, the linearity test is undersized, specially when θ is near zero and approaches the nominal one as θ increase in absolute value. The power of the test goes to one as θ increases in absolute value. Tables 1 and 2 show the estimation results. The parameters of Models A and B are estimated by the modified nonlinear IV estimator as (6). The nonlinear IV estimator (3) was also used but, as the estimates are less precise and have large outliers, we will show only the results concerning the modified estimator. We report results for 100, 250, and 500 observations. We consider only the case where θ = 1. As can be seen by inspection of Tables 1 and 2, apart from γ , all the other parameters are estimated quite well and the precision improves, as expected, as the sample size increases. Skewness approaches 0 (symmetric distribution) and the kurtosis coefficient tends to 3 as the sample size increases, indicating convergence of the estimator to a normally distributed random variable. Finally,
W.D. Areosa et al. / Journal of Econometrics 165 (2011) 100–111
105
Table 1 Monte Carlo simulations: parameter estimates for Model A. Parameter True value Mean
Median
Std. dev. Skewness Kurtosis
100 Observations β11 −0.2 β12 1.4 β21 0.6 β22 −2.3 γ 10 c −2
1.43 0.75 −2.31 64.73 −1.99
0.11 1.58 0.23 −2.41 12.82 −1.99
5.09 1.81 5.10 1.79 136.16 0.13
−0.55 −0.24
−0.29
−0.08
1.40 0.69 −2.29 30.41 −2.00
1.48 0.44 −2.36 10.54 −1.99
2.50 0.86 2.51 0.85 378.78 0.05
−1.61 −1.41
250 Observations β11 −0.2 β12 1.4 β21 0.6 β22 −2.3 γ 10 c −2
Fig. 2. Empirical power of the linearity test (Model A) across different values of θ . The nominal significance level is 0.05 and the number of observations is 100.
500 Observations β11 −0.2 β12 1.4 β21 0.6 β22 −2.3 γ 10 c −2
−0.34
−0.1491
−0.1035 1.4323 1.4626 0.5507 0.4900 −2.3283 −2.3536 10.9256 10.1493 −2.0005 −1.9997
0.56 0.21 3.31 1.51
1.63 1.35 26.88 −0.72
1.4219 −0.3738 0.4883 −0.3935 1.4258 0.3754 0.4862 0.3295 3.8342 4.2906 0.0338 −0.2249
5.11 6.00 5.15 5.94 14.64 11.43 10.46 9.40 10.51 9.20 762.18 8.70 3.9481 4.1737 3.9228 4.0771 46.0866 4.0112
The table shows the mean, median, standard deviation, skewness and kurtosis for each parameter estimate over 2000 replications for different sample sizes. The parameters are estimated minimizing (6) as proposed in Amemiya (1975). The instruments used are zt = (1, wt , wt2 , wt3 , wt4 , wt5 )′ . Table 2 Monte Carlo simulations: parameter estimates for Model B. Parameter
Fig. 3. Empirical size of the linearity test (Model B) across different values of θ . The nominal significance level is 0.05 and the number of observations is 100.
True value
100 Observations β11 −0.2 β12 1.4 β21 0.6 β22 −2.3 γ 10 c −2 250 Observations β11 −0.2 β12 1.4 β21 0.6 β22 −2.3 γ 10 c −2 500 Observations β11 −0.2 β12 1.4 β21 0.6 β22 −2.3 γ 10 c −2
Mean
Median
Std. dev.
Skewness
−0.39
−0.29
1.36 0.80 −2.25 34.03 −2.00
1.40 0.66 −2.27 10.78 −2.00
1.51 0.46 1.54 0.43 84.36 0.08
−0.92 −1.01
−0.14
−0.13
1.41 0.54 −2.31 13.40 −1.99
1.42 0.52 −2.32 10.35 −1.99
0.68 0.20 0.68 0.18 30.22 0.03
−0.21 −0.74
−0.20
−0.20
1.39 0.60 −2.29 10.51 −2.00
1.40 0.60 −2.30 10.23 −2.00
0.43 0.12 0.44 0.11 2.52 0.02
−0.09 −0.24
1.11 0.65 5.51 −1.39
0.32 0.34 18.32 −0.02
0.17 0.10 1.69 −0.17
Kurtosis 8.45 9.08 8.67 9.79 41.10 21.91 4.52 7.21 4.62 5.71 368.80 3.32 3.78 3.68 3.81 3.49 10.20 3.06
The table shows the mean, median, standard deviation, skewness and kurtosis for each parameter estimate over 2000 replications for different sample sizes. The parameters are estimated minimizing (6) as proposed in Amemiya (1975). The instruments used are zt = (1, wt , wt2 , wt3 , wt4 , wt5 )′ .
When the median is used as a measure of central tendency, the results improve substantially. 8. Application 8.1. Inflation targeting in Brazil
Fig. 4. Empirical power of the linearity test (Model B) across different values of θ . The nominal significance level is 0.05 and the number of observations is 100.
although, on average, the estimates of γ are much higher than the true value, this is caused mainly by a few extreme observations.
STR models have been successfully applied to describe the behavior of various macroeconomic time series (see, for example, van Dijk et al. (2002)). In this section, we analyze the Brazilian inflation rate series after the adoption of inflation targeting (IT) in mid-1999 to illustrate the modeling cycle for STR models. Since the early 1990s, a growing number of central banks in industrial and emerging countries have considered the adoption
106
W.D. Areosa et al. / Journal of Econometrics 165 (2011) 100–111
of an IT framework, including the USA. The IT literature points out that much of its benefits can be attributed to its impact on inflation expectations.7 Woodford (2004) argues that the most important achievement of inflation targeting central banks has not been the reorientation of the goals of monetary policy toward a stronger emphasis on controlling inflation, but rather the development of an approach to the conduct of policy that focuses on a clearly defined target. Accordingly, one important advantage of commitment to an appropriately chosen policy rule is that it facilitates public understanding of policy, which is crucial in order for monetary policy to be most effective.8 This seems to be the case in Brazil. As noted by Cerisola and Gelos (2005), the adoption of an explicit and public target for inflation influenced the expectations of private agents. The authors examine the macroeconomic determinants of survey inflation expectations in Brazil since the adoption of inflation targeting in 1999. The results suggest that the inflation targeting framework has helped anchor expectations, with the dispersion of inflation expectations declining considerably. They also find that the inflation target has been instrumental in shaping expectations while the importance of past inflation in determining expectations appears to be relatively low. Soon after changing to a floating exchange rate regime in 1999, Brazil adopted an explicit inflation targeting framework as part of an extensive program of economic reforms. This development ended a period during which the exchange rate had been the main anchor for monetary policy. The mounting uncertainties after the floating of the Real in early 1999 enticed the implementation of a more strict inflation targeting framework, one that would represent a firm commitment to prevent inflation from getting out of control. Moreover, the relatively loose fiscal stance at the outset of the new regime, as well as the lack of formal operational autonomy of the Central Bank, presented additional challenges to the conduct of monetary policy, in particular to the construction of credibility. In order to deal with these concerns, the Central Bank has adopted a flexible and accountable approach in conducting policy. For instance, even when the targets were breached and revised, the process was undertaken in a very transparent manner through open letters from the Central Bank. As noted in Mishkin (2004), the role of the Central Bank in this accomplishment provides a good example for other emerging markets considering adopting inflation targeting: the way the Central Bank articulated the reasons why the initial inflation target was missed, how it responded to the shock, and how it planned to return to its longerrun inflation goal. The new regime has been tested in a number of different ways during its short lifetime, with the intensity and frequency of shocks being unprecedented. Despite challenging conditions, the new monetary framework has proved to be an effective guide for expectations. Even when current inflation deviated from the established targets, monetary policy under inflation target was, for much of the time, capable of keeping inflation expectations in line with the official inflation targets. In the following section, we will formally analyze how the adoption of an explicit target for inflation affects inflation dynamics and monetary policy.
7 See Mishkin and Schmidt-Hebbel (2001) for a survey of early experiences with inflation targeting. Ball and Sheridan (2003) present a more pessimistic view from experience to date. 8 In Woodford’s (2004, p. 16) own words: ‘‘For not only do expectations about policy matter, but, at least under current conditions, very little else matters’’.
8.2. Analytical framework for the inflation process: the Phillips Curve The standard approach to characterize the inflation process is some kind of Phillips Curve relation. Specifically, the so-called ‘‘New-Keynesian’’ Phillips Curve (NKPC) is simply a log-linear approximation about the steady state of the aggregation of the individual firm pricing decisions and relates inflation positively to the output gap:9
πt = β1 xt + β2 Et πt +1 + ut , where xt is the output gap, πt is the inflation rate, Et πt +1 ≡ E(πt +1 |Ft ) is expected future inflation conditional on the information set available at time t, and ut is a cost-push shock. Although theoretically appealing, this curve has problems when faced with the facts, specifically because of the absence of any endogenous persistence. In order to deal with this limitation, Galí and Gertler (1999) propose a model where a fraction of the firms use a cost-free rule-of-thumb based on lagged inflation to readjust their prices. The resulting equation is
πt = β1 πt −1 + β2 xt + β3 Et πt +1 + ut . Even if the central bank stabilizes the output gap from now on, the same would not occur with current inflation as it is influenced by its own recent history. Alves and Areosa (2005) argue that in an inflation targeting economy, it is natural to assume that pricing decisions should also incorporate the inflation target. The authors propose the following extension:
πt − πt∗ = β1 xt + β2 Et [πt +1 − πt∗+1 ] + ut with πt∗ = (1 − λ)πt −1 + λπ ∗ , where 0 < λ < 1 and π ∗ is the inflation target. The investigation of the presence of nonlinear mechanisms in the Phillips Curve has been an important topic in the recent literature because of its implications for monetary policy.10 Following a long tradition that goes back to Cukierman and Wachtel (1979) and Logue and Willett (1976), we argue that the level of inflation and the spread of expectations across individuals are positively related. Therefore, we consider the following family of nonlinear Phillips Curves:
πt = π¯ +
A −
β1jL πt −j +
C −
j =a
+ πˆ +
A − j =a
β2jL xt −j + β3L Et π˜
j =c
β πt −j + N 1j
C −
β
N 2j xt −j
+β
N 3 Et
π˜ f (σ˜ tπ ; γ , c ) + ut ,
j =c
where Et π˜ is a measure of future inflation expectations (measured as deviations from target inflation), σtπ is a measure of expectations uncertainty, and f (σ˜ tkπ ; γ , c ) is the logistic function, as in (8). In the STR model, the two regimes are associated with small and large values of the transition variable, zt , relative to the location parameter, c. This type of regime-switching can be convenient for modeling asymmetric responses from a monopolistic price setter, where the regimes of the STR are related to the uncertainty of inflation expectations. The parameter c can be interpreted as the tolerance level of uncertainty around the value that the price setter considers critical, and the parameter γ determines the smoothness of the transition from one regime to the other.
9 The model of nominal rigidities proposed by Calvo (1993) was used in this case. 10 See Schaling (1999), Laxton et al. (1995), Eliasson (1999), Nobay and Peel (2000), and Musso et al. (2007).
W.D. Areosa et al. / Journal of Econometrics 165 (2011) 100–111
(a) φ .
(c) E φ .
(b) x.
φ
(d) σt .
107
(e) ∆q.
(f) ∆i.
Fig. 5. Variables used in the Phillips Curve.
8.3. Estimation Now we examine whether there is evidence that the Brazilian inflation rate followed a nonlinear process between April 2000 and June 2007. The rationale is that the dynamic of inflation was different during periods of increased inflation uncertainty. We estimate linear and nonlinear models in order to compare the results. As both inflation expectations and expected inflation uncertainty are clearly endogenous, the nonlinear Phillips Curve proposed here is estimated by the nonlinear IV methods described above. Different sets of instruments are used in order to check the robustness of the results. 8.3.1. Data The data source is Banco Central do Brasil (Central Bank of Brazil, hereafter BCB) and Ipea (Research Institute in Applied Economics).11 As a measure of the annualized monthly inflation rate, πt , we consider the Brazilian broad consumer price index (IPCA), used to gauge Brazilian inflation targets. The output gap, xt , is measured by de-trended seasonally adjusted industrial production.12 Inflation expectations are obtained from a daily survey that the BCB conducts among financial institutions and consulting firms. The survey asks what firms expect for end-ofyear inflation in the current and in the following years. The BCB discloses the mean, the median and the standard deviation of the inflation expectations. Our measure of inflation expectation is the median of the expectations across agents. Expected inflation uncertainty is the standard deviation of the inflation expectation across agents.
11 All series are available at www.bcb.gov.br and www.ipeadata.gov.br. 12 Carneiro (2000) showed that linear de-trending is a reasonable strategy to compute the Brazilian output gap.
The Brazilian inflation targeting regime sets year-end inflation targets for the current and the following two years. As it is necessary to have a single measurement of the deviation of inflation from the target, we use a weighted average of current and following years expected deviation of inflation from the target, where the weights are inversely proportional to the number of months remaining until the end of the year. Formally:
Et π˜ =
mt
× [Et π(0) − π(∗0) ] 12 − mt + × [Et π(1) − π(∗1) ], 12
12
σ˜ tπ =
mt 12
× σtπ,(0) +
12 − mt 12
× σtπ,(1) ,
(14)
(15)
where mt is the number of months remaining until the end of the current year, Et π(0) and Et π(1) are, respectively, the expected inflation for the current and following years, and σtπ,(0) and σtπ,(1) are the standard deviations of the inflation expectations for the current and upcoming years, respectively. Similarly, π(∗0) and π(∗1) are the targets for the current year and the following year, respectively. The values of σ˜ tπ are normalized by the estimated unconditional standard deviation of the series. Set qt as the real exchange rate and it as the nominal interest rate given by the Selic rate, which is the Central Bank’s primary monetary policy instrument. The Selic rate is the average interest rate on overnight inter-bank loans collateralized by government bonds that are registered with, and traded on, the Sistema Especial de Liquidação e Custódia (Selic). The Central Bank of Brazil Monetary Policy Committee (COPOM) establishes a target for the Selic interest rate and the Central Bank’s open market desk executes regular liquidity management operations in the domestic money market, with the goal of keeping the daily Selic interest rate at the target level. Fig. 5 illustrates the time evolution of the series used.
108
W.D. Areosa et al. / Journal of Econometrics 165 (2011) 100–111 Table 3 Linear Phillips Curve: parameter estimates.
8.3.2. Estimates First, we estimate the linear Phillips Curve:
πt = π¯ + β1 πt −1 + β2 xt −1 + β3 Et π˜ + ut , where the error is assumed to be a martingal difference sequence, E(ut |Ft −1 ) = 0. We consider only the first lag of inflation and the output gap as residual analysis shows no evidence of remaining serial correlation. In order to estimate the model parameters, we consider the following choices for the exogenous variables, wt , and the set of instruments, zt : (1) Instruments set 1 (IS1): wt = (πt −1 , . . . , πt −4 , xt −1 , xt −2 , Et −1 π˜ , Et −2 π˜ , σ˜ tπ−1 , σ˜ tπ−2 , 1it −1 , ∆it −2 , 1qt −1 , 1qt −2 )′ , zt = (πt −1 , . . . , πt −4 , xt −1 , xt −2 , Et −1 π˜ , Et −2 π˜ ,
σ˜ tπ−1 , σ˜ tπ−2 , 1it −1 , ∆it −2 , 1qt −1 , 1qt −2 )′ .
Coefficient
IS1
IS2
IS3
IS4
π
0.0310
0.0302
0.0313
0.0301
πt −1
0.1576
0.2130
0.1322
0.2151
xt −1
0.5153
0.4738
0.5344
0.4722
Et π
3.1373
2.7942
3.2947
2.7811
IS1
IS2
IS3
IS4
0.0006
0.0038
0.0008
0.0090
zt = (πt −1 , π
,
,
(0.1072)
(0.0066)
(0.1041)
(0.1759)
(0.0065)
(0.1110)
(0.1738)
(0.5031)
p-values
(0.1042)
(0.1779)
(0.4803)
(0.1738)
(0.5315)
(0.4807)
The table shows parameter estimates for equation πt = π¯ + β1 πt −1 + β2 xt −1 + β3 Et π˜ + ut , where πt is the annualized monthly inflation rate πt , xt is the output gap and Et π˜ is the inflation expectation defined as in (14). The parameters are estimated with four different sets of instruments (IS1–IS4). The table also reports the p-value of the linearity test. The transition variable is σ˜ tπ . Table 4 Nonlinear Phillips Curve: parameter estimates. Instrument set 3
wt = (πt −1 , xt −1 , Et −1 π˜ , σ˜ tπ−1 , 1it −1 , 1qt −1 )′ , xt −1 x2t −1
(0.0065)
Linearity Test
(2) Instruments set 2 (IS2):
2 t −1
(0.0065)
, Et −1 π˜ , (Et −1 π˜ )
2
, σ˜ tπ−1 ,
(σ˜ tπ−1 )2 , 1it −1 , 1i2t −1 , 1qt −1 , ∆q2t −1 )′ . (3) Instruments set 3 (IS3): wt = (πt −1 , πt −2 , xt −1 , xt −2 , Et −1 π˜ , Et −2 π˜ , σ˜ tπ−1 ,
σ˜ tπ−2 , 1it −1 , 1it −2 )′ , zt = (πt −1 , πt −2 , xt −1 , xt −2 , Et −1 π˜ , Et −2 π˜ , σ˜ tπ−1 ,
‘‘Raw’’ instruments First regime
σ˜ tπ−1 , (σ˜ tπ−1 )2 , 1it −1 , 1i2t −1 )′ . By choosing different sets of instruments, we may not only check the robustness of our results, but also evaluate the effects of having nonlinear combinations of exogenous variables as potential instruments. The results of the linear estimation are illustrated in Table 3. Some important facts emerge from the table. First, it is clear that, when nonlinear instruments are used (IS2 and IS4), the persistence of past inflation (inflation inertia) is higher and the effect of inflation expectations is lower, as well as the effect of the past output gap. Second, the inclusion of real exchange rates as instruments does not alter the estimation results. Finally, the test described in Section 4 strongly rejects the null hypothesis of linearity, regardless of which instruments are used. As linearity is strongly rejected, we proceed by estimating a smooth transition version of the Phillips Curve. Our specification has the following form:
πt = π + β1L πt −1 + β3L xt −1 + β4L Et π˜ + [π˜ + β1N πt −1 + β3N xt −1 + β4N Et π˜ ]f (σ˜ tπ ; γ , c ) + ut . We present the estimates in Table 4. We report only the results concerning the instrument sets 3 (IS3) and 4 (IS4). The results with instrument sets 1 (IS2) and 2 (IS2) are not substantially different, and hence are omitted. For each instrument set, Table 4 reports two different estimates: one with the two-step procedure to compute the ‘‘optimal’’ instruments, as in Section 5.2; and another set with only zt as the instruments (‘‘raw’’ instruments), that is, without the optimality correction.
Second regime
0.0267
0.0192
0.0257
0.0222
πt −1
0.3118
− 0.3075
0.3999
− 0.2312
xt − 1
0.0417
0.8853
0.0108
0.8221
Et π
1.8351
1.8613
0.6341
2.0609
γ
(0.0092)
(0.0164)
(0.1542)
(0.2070)
(0.2894)
(1.0809)
(0.3823)
19.2964
(1.1632)
(0.0096)
(0.0185)
(0.1558)
(0.2107)
(0.2921)
(1.0960)
(−)
(0.3932)
19.3367
(1.2495)
(−)
1.0661
c
1.0479
(0.1130)
(0.1030)
Instrument set 4 ‘‘Raw’’ instruments
(4) Instruments set 4 (IS4): zt = (πt −1 , πt2−2 , xt −1 , x2t −1 , Et −1 π˜ , (Et −1 π˜ )2 ,
First regime
π
σ˜ tπ−2 , 1it −1 , 1it −2 )′ . wt = (πt −1 , xt −1 , Et −1 π˜ , σ˜ tπ−1 , 1it −1 )′ ,
‘‘Optimal’’ instruments
Second regime
First regime
‘‘Optimal’’ instruments
Second regime
First regime
Second regime
π
0.0296
0.0163
0.0264
0.0220
πt −1
0.3313
− 0.2252
0.3832
− 0.2046
xt − 1
0.0044
0.8956
0.0043
0.8477
Et π
0.8445
2.3388
0.9162
γ c
(0.0094)
(0.0184)
(0.1522)
(0.2049)
(0.2879)
(1.0201)
(0.3892)
18.6842 (−)
1.08291 (0.1035)
(1.2209)
(0.0095)
(0.0194)
(0.1536)
(0.2102)
(0.2930)
(1.0472)
(0.3964)
1.7211 18.7301
(1.2165)
(−)
1.0668 (0.1084)
The table shows the parameter estimates of the model πt = π˜ + β1L πt −1 + β3L xt −1 + β4L Et π˜ + [π˜ + β1N πt −1 + β3N xt −1 + β4N Et π˜ ]f (σ˜ tπ ; γ , c ) + ut . The label ‘‘Raw’’ Instruments means that the optimality correction of Section 5.2 is not used. The numbers in parentheses are the standard errors of the estimates.
The results can be summarized as follows. First, the estimated location of the transition ( c) is almost the same in all the cases considered, and the transition is relatively smooth, although there are not many observations along the transition path. The analysis of the logistic function in Fig. 6 confirms that the regime switches occur in periods when expectations uncertainty is higher. Indeed, the period can be separated into three sub-samples: (i) Before 2001, the implementation phase; (ii) 2001–2002, the stress test; and (iii) After 2002, the restoration of credibility. Hence, we can characterize the two extreme regimes as low uncertainty (regime 1) and high uncertainty (regime 2). Second, the persistence (inflation inertia) is high in the first regime, but almost vanishes in the high uncertainty regime, although it is worth noting the low significance of the coefficient.13 In addition, the output gap is significant only when inflation uncertainty is high. Finally, inflation expectations are more important, as expected, in regime 2 (high uncertainty).
13 This may be due to the restricted number of observations.
W.D. Areosa et al. / Journal of Econometrics 165 (2011) 100–111
(a) Impact of transition variable on logistic function (IS3).
(b) Logistic function through time (IS3).
(c) Impact of transition variable on logistic function (IS4).
(d) Logistic function through time (IS4).
109
Fig. 6. Estimated logistic function. Panel (a): transition function versus transition variable (IS3). Panel (b): transition function across time (IS3). Panel (c): transition function versus transition variable (IS4). Panel (d): transition function across time (IS4).
8.4. Implementing IT: before 2000
8.5. Inflation targeting under stress: 2001–2002
Despite the adoption of IT in Brazil having occurred during a foreign exchange crisis, the transition to the new regime in 1999 was relatively smooth. Against the pessimistic views, inflation at the end of 1999 reached the one-digit level mark (8.9%), while annual GDP grew by almost 1% (0.8%). The response of the Brazilian government and the BCB to the crisis combined fiscal consolidation, a strong commitment with price stability, and external financial support. The analysis of the logistic function in Fig. 6 confirms the economy was in a low uncertainty period (regime 1). After the initial transition phase, with the normalization of financial conditions and under the effects of significant interest rate cuts, inflation ended 2000 at the 6% mid-point target, with robust economic growth of 4.4%. During this period, our firstregime estimates shows the irrelevance of output gap for inflation dynamics, which was driven by lagged inflation (0.3–0.4) and inflation expectations (0.6–1.8). However, during 2000 a series of important shocks occurred, notably: oil prices had double since 1999, while the prices of technology firms fells sharply, with the meltdown of NASDAQ. At the same time, monetary policy conditions were tightened in the USA, with the Federal Funds Rate being raised to 6.5% in May 2000, from 5.5% at the end of 1999. By the end of 2000, while the performance of the economy was positive, the accelerated rate of growth of the Brazilian economy, combined with the US and Global slowdown, pointed to difficulties ahead. The Brazilian economic recovery that began at the end of 1999 was based on strong credit expansion, increasing exports of industrial goods, and agricultural price recovery. This recovery, however, combined with increasing oil prices and the US slowdown, adversely affected the balance of trade, which entered negative territory (12 months) in September 2000 after a period of recovery following the depreciation of the Real in early 1999. The Brazilian core IPCA inflation started to show a growth trend after November 2000.
The year 2001 was marked by a series of adverse shocks, most notably: the Argentina default, officially announced in the fourth quarter of 2001, the energy crises in Brazil, and the September 11, 2001 attack. In the beginning of the year, consumer price inflation was above expectations, while the core inflation trend was incompatible with the 4% inflation target for the year. After reducing the Selic rate to 15.25% in January, the BCB started in March the first monetary policy tightening cycle of the inflation targeting regime. After an initial 50 basis points increase, the tightening cycle was interrupted only in July, with the Selic rate reaching 19%. The policy rate remained unchanged from August 2001 to February 2002, when the Central Bank began the easing process, although for a brief period of time. The series of adverse events produced during 2001 significant exchange rate depreciation, hovering around 20%. At the end of 2001, inflation reached 7.7% (3.7% points above the 4% target) and the economy grew 1.3%. The logistic function in Fig. 6 shows that the economy was in a high uncertainty period (regime 2). In this scenario, lagged inflation is no longer a good proxy for future inflation, which explains why persistence is low in the period. The increased relevance of inflation expectations (1.7–2.3) and output gap (0.82–0.89) in inflation dynamics, highlights the importance of anchoring expectations. Even though the target was not reached, the results obtained in the face of an extremely adverse scenario were satisfactory, revealing the inflation targeting regime as an effective and flexible framework to pin down expectations. Inflation expectations for 2002, gauged at the end of 2001, were still below 5%. The way monetary policy was conducted with the swift reaction after the September 11, 2001 terrorist attacks kept expectations under control and made economic agents believe that the 2001 adverse inflationary shock would be dissipated during the following year.
110
W.D. Areosa et al. / Journal of Econometrics 165 (2011) 100–111
The year 2002 began with the view that the end of the energy crisis, combined with an improved international environment, would allow some flexibility in the conduct of monetary policy. In fact, a considerable exchange rate appreciation occurred (from a 2.80 R$/US$ just after September 11 to 2.40 R$/US$ in the beginning of May 2002). In this context, the monetary policy was relaxed in the beginning of the year, with the Selic rate being reduced from 19% in February to 18% in June. However, later in the year, the uncertainty associated with the presidential election sets off an unprecedented confidence crisis, leading to a sharp exchange rate depreciation and to very unfavorable debt-dynamics. During that time, despite a number of arguments arose, suggesting that particular circumstances distorted the transmission mechanism from monetary policy, which was then bound for defeat against inflation, Brazil did succeed in securing disinflation through monetary tightening, with a perceptible contribution from the aggregate demand transmission channel. The commitment assumed by the new President to sustain sound macroeconomic policies, combining fiscal discipline, a floating exchange rate regime, and the inflation targeting framework, was crucial to dissipate the fear associated with changes in the course of the economy and related to debt sustainability. From September to December 2002, the Central Bank increased its policy rate from 18% to 25%. However, the sharp exchange rate depreciation during the year yielded a considerable increase in inflation, which ended 2002 at 12.5%, and modest GDP growth of 1.9%. Although the inflation targeting regime was unable to anchor expectations during that year, the months that followed this episode proved that inflation targeting has been a useful framework to align market expectations with government objectives. 8.6. Reconstructing credibility: after 2002 In January 2003, the Central Bank sent an open letter to the Minister of Finance explaining why the inflation targets were breached, and made explicit estimates of the size of the shocks and their persistence. The Central Bank added to the original inflation target for 2002 (4%), part of the breach experienced in the previous year, to account for inertia effects (inflation carryover from the 2002 shock), and for the impact on administered prices that, by contract provisions, are adjusted according to past inflation. These two effects let the Central Bank adjust the inflation target for 2003 to 8.5%. The Central Bank made explicit reference to the fact that, after the sharp increase in inflation in 2002, attempting to achieve the original inflation target of 4% for 2003 would require a sizable output sacrifice. Inflation in 2003 fell by more than 3% points, ending at 9.3%, which was close to the adjusted target, and GDP declined by a modest 0.2%. The Central Bank was not able to achieve this on its own. The new government not only supported the inflation targeting regime, but also pursued tight spending policies that resulted in a primary budget surplus in 2003 of 4.3% of GDP. Just in line with these facts, the analysis of the logistic function in Fig. 6 confirms that the economy returned to regime 1 (low uncertainty period), with inflation primarily driven by lagged inflation and inflation expectations. The strong recovery in 2004, with growth reaching almost 5% and with employment increasing at a two-digit rate, required a gradual but firm response of the Central Bank to fight emerging inflationary pressures and to prevent these pressures from contaminating inflationary expectations. From September 2004 to May 2005, the Central Bank raised its policy rate by 3.75% points to 19.75%. Moreover, the government announced in September 2004 a change in the primary surplus target for 2004, from 4.25% to 4.5% of GDP. Inflation, despite some acceleration during the second half of 2004, ended the year at 7.6%, which was above the 5.5% target, but within the tolerance interval.
In September 2004, when it became clear to the Central Bank that the 5.5% target for 2004 would not be fulfilled, and it was possible to project with greater accuracy the 2004 deviation, the Central Bank announced 5.1% as its operational target to be pursued in 2005. 9. Conclusions In this paper we considered the estimation of smooth transition regression models with endogenous variables. Different nonlinear instrumental variable (IV) estimation methods have been discussed, and the asymptotic properties of the estimators were analyzed when the data are formed by weakly dependent stochastic processes. A linearity test based on the Taylor expansion of the logistic function was extended to the case of endogenous regressors, and its small sample properties were checked through simulations. The small sample properties of the nonlinear IV estimators were also analyzed by simulation. Finally, a nonlinear Phillips Curve, for emerging economies was estimated with Brazilian data under an inflation targeting regime. The empirical results showed strong support for a nonlinear specification of the Phillips Curve where the transitions were related to inflation uncertainty with respect to the target. Acknowledgments The second author is most grateful for the financial support of the Australian Research Council and the National Science Council, Taiwan. The third author acknowledges the CNPq for partial financial support. The authors wish to thank participants at the ‘‘Recent Developments in Econometric Theory’’ conference, Kyoto, Japan, July 2006, and the ‘‘International Symposium on Recent Developments of Time Series Econometrics’’, Xiamen, China, May 2008, for helpful comments and suggestions. This paper should not be reported as representing the views of the Banco Central do Brasil. The views expressed in the paper are those of the authors and do not necessarily reflect those of the Banco Central do Brasil. The comments of anonymous referees are gratefully acknowledged. Appendix. Proofs A.1. Proof of Theorem 1 Assumptions 2 and 3 guarantee that the model is identified. Under Assumptions 6 and 4–5, the proof follows from the same steps as in the proof of Theorem 8.1.1 in Amemiya (1985, p. 246). A.2. Proof of Theorem 2 The proof of the first part of the above theorem follows along the same lines as the proof of Theorem 8.1.2 of Amemiya (1985, p. 247). The second part of the theorem follows from the same steps in Amemiya (1975, p. 381). References Alves, S., Areosa, W., 2005. Targets and Inflation Dynamics, Texto para Discussõ, 100, Banco Central do Brasil. Amemiya, T., 1974. The nonlinear two-stage least-squares estimator. Journal of Econometrics 2, 105–110. Amemiya, T., 1975. The nonlinear limited-information maximum-likelihood estimator and the modified nonlinear two-stage least-squares estimator. Journal of Econometrics 3, 375–386. Amemiya, T., 1977. The maximum likelihood and the nonlinear three-stage least squares estimator in the general nonlinear simultaneous equation model. Econometrica 45, 955–968.
W.D. Areosa et al. / Journal of Econometrics 165 (2011) 100–111 Amemiya, T., 1985. Advanced Econometrics. Harvard University Press, Cambridge. Bacon, D.W., Watts, D.G., 1971. Estimating the transition between two intersecting lines. Biometrika 58, 525–534. Ball, L., Sheridan, N., 2003. Does iflation targeting matter?, Working Paper 9577, NBER. Bowden, R., Turkington, D., 1981. A comparative study of instrumental variables estimators for nonlinear simultaneous models. Journal of the American Statistical Association 76, 988–995. Calvo, G., 1993. Staggered prices in a utility maximizing framework. Journal of Monetary Economics 12, 383–398. Caner, M., Hansen, B., 2004. Instrumental variable estimation of a threshold model. Econometric Theory 20, 813–843. Carneiro, D., 2000. Inflation Targeting in Brazil: What Difference Does a Year Make? In: Textos para Discussõ, vol. 429. Pontifícia Universidade Católica do Rio de Janeiro. Cerisola, M., Gelos, R., 2005. What drives inflation expectations in brazil? an empirical analysis, Working Paper WP/05/109, IMF. Chan, K.S., Tong, H., 1986. On estimating thresholds in autoregressive models. Journal of Time Series Analysis 7, 179–190. Choi, I., Saikkonen, P., 2004a. Cointegrating smooth transition regressions. Econometric Theory 20, 301–340. Choi, I., Saikkonen, P., 2004b. Testing linearity in cointegrating smooth transition regressions. Econometrics Journal 7, 341–365. Cukierman, A., Wachtel, P., 1979. Differential inflationary expectations and the variability of the rate of inflation: theory and evidence. American Economic Review 69, 595–609. Davidson, R., MacKinnon, J.G., 1993. Estimation and Inference in Econometrics. Oxford University Press, New York, NY. Eliasson, A.-C., 1999. Is the short-run Phillips curve nonlinear? empirical evidence for Australia, Sweden and the United States, Working Paper Series in Economics and Finance 330, Stockholm School of Economics. Galí, J., Gertler, M., 1999. Inflation dynamics: a structural econometric investigation. Journal of Monetary Economics 44, 195–222. Goldfeld, S.M., Quandt, R., 1972. Nonlinear Methods in Econometrics. North Holland, Amsterdam. Granger, C.W.J., Teräsvirta, T., 1993. Modelling Nonlinear Economic Relationships. Oxford University Press, Oxford. Hansen, L., 1982. Large sample properties of generalized method of moments estimators. Econometrica 40, 1029–1054. Kelejian, H., 1971. Two-stage least squares and econometric systems linear in parameters but nonlinear in the endogenous variables. Journal of the American Statistical Association 66, 373–374. Laxton, D., Meredith, G., Rose, D., 1995. Asymmetric effects of economic activity on inflation. IMF Staff Papers 42, 344–374. Li, W.K., Ling, S., McAleer, M., 2002. Recent theoretical results for time series models with GARCH errors. Journal of Economic Surveys 16, 245–269. Logue, D., Willett, T., 1976. A note on the relation between the rate and variability of inflation. Economica 43, 151–158.
111
Lundbergh, S., Teräsvirta, T., 1998. Modelling economic high-frequency time series with STAR-STGARCH models, Working Paper Series in Economics and Finance 291, Stockholm School of Economics. Luukkonen, R., Saikkonen, P., Teräsvirta, T., 1988. Testing linearity against smooth transition autoregressive models. Biometrika 75, 491–499. McAleer, M., 2005. Automated inference and learning in modeling financial volatility. Econometric Theory 21, 232–261. Medeiros, M., Veiga, A., 2005. A flexible coefficient smooth transition time series model. IEEE Transactions on Neural Networks 16, 97–113. Mira, S., Escribano, A., 2000. Nonlinear time series models: consistency and asymptotic normality of NLS under new conditions. In: Barnett, W.A., Hendry, D., Hylleberg, S., Teräsvirta, T., Tjøsthein, D., Würtz, A. (Eds.), Nonlinear Econometric Modeling in Time Series Analysis. Cambridge University Press, pp. 119–164. Mishkin, F., 2004. Can inflation targeting work in emerging market countbelries? in Conference in Honor of Guilhermo Calvo. Mishkin, F., Schmidt-Hebbel, K., 2001. One decade of inflation targeting in the world: what do we know and what do we need to know? Working Paper 8397, NBER. Musso, A., Stracca, L., van Dijk, D., 2007. Instability and Nonlinearity in the Euro Area Phillips Curve, Working Paper Series 811, European Central Bank. Newey, W., 1990. Efficient instrumental variable estimation of nonlinear models. Econometrica 58, 809–837. Newey, W., Powell, J., 2003. Instrumental variable estimation of nonparametric models. Econometrica 71, 1565–1578. Nobay, A., Peel, D., 2000. Optimal monetary policy with a nonlinear Phillips Curve. Economics Letters 67, 159–164. Schaling, E., 1999. The non-linear Phillips Curve and inflation forecast targeting — symmetric versus asymmetric monetary policy rules, Working Paper Series 98, Bank of England. Stock, J., Wright, J., Yogo, M., 2002. A survey of weak instruments and weak identification in generalized method of moments. Journal of Business and Economic Statistics 20, 518–529. Suarez-Fariñas, M., Pedreira, C.E., Medeiros, M.C., 2004. Local global neural networks: a new approach for nonlinear time series modeling. Journal of the American Statistical Association 99, 1092–1107. Teräsvirta, T., 1994. Specification, estimation, and evaluation of smooth transition autoregressive models. Journal of the American Statistical Association 89, 208–218. Teräsvirta, T., 1998. Modelling economic relationships with smooth transition regressions. In: Ullah, A., Giles, D.E.A. (Eds.), Handbook of Applied Economic Statistics. Dekker, pp. 507–552. van Dijk, D., Teräsvirta, T., Franses, P.H., 2002. Smooth transition autoregressive models—a survey of recent developments. Econometric Reviews 21, 1–47. Woodford, M., 2004. Inflation targeting and optimal monetary policy. Federal Reserve Bank of St. Louis Review 86, 14–41.
Journal of Econometrics 165 (2011) 112–127
Contents lists available at SciVerse ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
A consistent nonparametric test for nonlinear causality—Specification in time series regression✩ Yoshihiko Nishiyama a,∗ , Kohtaro Hitomi b , Yoshinori Kawasaki c , Kiho Jeong d a
Kyoto University, Japan
b
Kyoto Institute of Technology, Japan
c
Institute of Statistical Mathematics, Japan
d
Kyungpook National University, South Korea
article
info
Article history: Available online 13 May 2011 JEL classification: C12 C14 C32 Keywords: Nonlinear causality Causality up to K th moment Nonparametric test Omitted variables test Local alternatives
abstract Since the pioneering work by Granger (1969), many authors have proposed tests of causality between economic time series. Most of them are concerned only with ‘‘linear causality in mean’’, or if a series linearly affects the (conditional) mean of the other series. It is no doubt of primary interest, but dependence between series may be nonlinear, and/or not only through the conditional mean. Indeed conditional heteroskedastic models are widely studied recently. The purpose of this paper is to propose a nonparametric test for possibly nonlinear causality. Taking into account that dependence in higher order moments are becoming an important issue especially in financial time series, we also consider a test for causality up to the K th conditional moment. Statistically, we can also view this test as a nonparametric omitted variable test in time series regression. A desirable property of the test is that it has nontrivial power against T 1/2 -local alternatives, where T is the sample size. Also, we can form a test statistic accordingly if we have some knowledge on the alternative hypothesis. Furthermore, we show that the test statistic includes most of the omitted variable test statistics as special cases asymptotically. The null asymptotic distribution is not normal, but we can easily calculate the critical regions by simulation. Monte Carlo experiments show that the proposed test has good size and power properties. © 2011 Elsevier B.V. All rights reserved.
1. Introduction Causality between variables has been one of the main interests in time series econometrics since the pioneering work by Granger (1969). We propose a nonparametric test for Granger-type causality. A conceptually similar work is Bierens and Ploberger (1997), Chen and Fan (1999), Robinson (1989) and Hidalgo (2000). The first three papers proposed nonparametric tests on certain conditional moment restrictions, while the last paper introduced a nonparametric Granger causality test in the frequency domain for weakly stationary linear processes. Hidalgo is mainly concerned with the test under long range dependent observations, but it does not have power against some alternatives of series with nonlinear dynamics. We construct a test statistic based on moment conditions allowing for nonlinear dependence. It has nontrivial
✩ This research was partially supported by the Ministry of Education, Culture, Sports, Science and Technology (MEXT), Grant-in-Aid for 21st Century COE Program, and Japan Society for the Promotion of Science Grants-in-Aid, 15330040, 19530182, 22330067. ∗ Corresponding address: Yoshida, Sakyo, Kyoto, 606-8501, Japan. E-mail address:
[email protected] (Y. Nishiyama).
0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.05.010
power against T 1/2 -local alternatives, where T is the sample size. The null asymptotic distribution is non-Gaussian, but we can easily calculate the critical region by simulation. When applied to regression analysis for cross section data, this test reduces to a nonparametric omitted variable test, or a significance test of regressors, which was considered in Okui and Hitomi (2002). Causality is not an easy concept to capture philosophically, but Granger (1969) gave a practical definition to deal with it in the context of time series analysis. Suppose we have a twodimensional time series (xt , yt ), t = 1, . . . , T . We are concerned if there exists any causality between x and y. Granger’s definition is that yt is said to cause xt in mean if E[xt − P (xt |xt −1 , . . . , x1 )]2
> E[xt − P (xt |xt −1 , . . . , x1 , yt −1 , . . . , y1 )]2 ,
(1.1)
where P (At |Bt ) is the optimum linear (or least squares) predictor of At given Bt (see Granger (1969, p. 429)), and denoted it as yt → xt . Otherwise yt 9 xt . An interpretation of this definition is that we say yt causes xt when we can improve the linear prediction of xt using the information carried by yt −1 , . . . , y1 . Granger remarks that this definition of causality means ‘‘linear causality in mean’’. Under the linearity assumption that the process has a
Y. Nishiyama et al. / Journal of Econometrics 165 (2011) 112–127
representation yt = j=−∞ αj xt −j + ut , we can test the null hypothesis H0 : yt 9 xt against H1 : yt → xt , as in Sims (1972) or Hosoya (1977), using the property that H0 is equivalent to αj = 0 for all j < 0. This approach has been most commonly used since Sims (1972) (see, e.g., Geweke (1982), Sims et al. (1990), Toda and Phillips (1993), Hosoya (1991) and Lutkepohl and Poskitt (1996)). To the best of our knowledge, Hidalgo (2000) is the newest result following this line allowing for long range dependence without a specification on the distribution. However, this approach may fail to detect some nonlinear causal relationships. The reason is that they construct test statistics based on linear projections of the series of interest, but we can only say that the error terms are uncorrelated with the series of interest, and not independent. In many of the aforementioned research, the author(s) apply frequency domain analysis where causality is captured through cross-spectra, or covariances of y and x. But covariances can easily be zero under nonlinear relationships, even if the two variables are dependent. Then, it is unlikely that tests based on the covariances possess good power property against certain alternatives. We propose a nonparametric test which has power even when the observations are nonlinearly dependent. For this purpose, we replace the linear projections by the optimum predictor, or conditional expectations, namely we rewrite (1.1) to define the possibly nonlinear causality as
∑∞
E[xt − E(xt |xt −1 , . . . , x1 )]2
> E[xt − E(xt |xt −1 , . . . , x1 , yt −1 , . . . , y1 )]2 .
look at the joint cumulative distribution, not conditional moments. Their definition of causality is slightly stronger than the present one. We also refer to Qiao et al. (2009) who applied Hiemstra and Jones test to investigate consumer attitudes. The next section provides the test statistic for causality in mean and its null distribution as well as the regularity conditions. Section 3 explains the power properties of the test. Section 4 provides causality in higher order moments. Section 5 discusses about the power properties as well as special cases of the test. We report Monte Carlo results in Section 6. Section 7 concludes this paper. The proof of the theorem and lemmas are in the Appendix. 2. Test statistic for causality in mean 2.1. Hypotheses and the corresponding moment conditions This section provides heuristic arguments of how to test (1.4), and then provides the test statistics. We restrict ourselves to the case when xt follows a nonlinear AR model of the form E[xt |xt −1 , . . . , xt −p , yt −1 , . . . , yt −q ]
= m(xt −1 , . . . , xt −p , yt −1 , . . . , yt −q ), and (xt , yt ) is a strictly stationary process. We assume p and q are fixed and known integers, and m(·) is an unknown function satisfying certain smoothness conditions. Denote Xt −1 = (xt −1 , . . . , xt −p ), Zt −1 = (Xt −1 , Yt −1 ),
Straightforward calculation gives E[xt − E(xt |xt −1 , . . . , x1 , yt −1 , . . . , y1 )]2
= E[xt − E(xt |xt −1 , . . . , x1 )]2 − E[E(xt |xt −1 , . . . , x1 , yt −1 , . . . , y1 ) − E(xt |xt −1 , . . . , x1 )]2 .
H0 : P[E(ut |Zt −1 ) = 0] = 1
E[E(xt |xt −1 , . . . , x1 , yt −1 , . . . , y1 ) − E(xt |xt −1 , . . . , x1 )]2 > 0,
and
and we call this simply ‘‘causality in mean’’ throughout this paper. We test the null hypothesis, H0 : E[E(xt |xt −1 , . . . , x1 , yt −1 , . . . , y1 )
− E(xt |xt −1 , . . . , x1 )]2 = 0
(1.3)
or H0 : E(xt |xt −1 , . . . , x1 , yt −1 , . . . , y1 ) = E(xt |xt −1 , . . . , x1 ) w.p. 1 (1.4) against the alternative hypothesis (1.2). Here w.p.1 abbreviates to ‘‘with probability one’’. We construct test statistics based on the moment conditions (1.4) for causality in mean. This is, in statistical terms, a test for omitted variables in time series regression. Many such tests have been proposed in the literature, for example in Bierens (1982, 1990), Bierens and Ploberger (1997), Chen and Fan (1999), Fan and Li (1996) and Robinson (1989) for i.i.d. and time series observations, among others. All these papers except the last √ two have nontrivial power against T -local alternative.√ The test proposed in this paper also has nontrivial power against T -local alternatives, but it has an advantage over the previous ones in that the proposed test can control the power properties easily and directly. Furthermore, we can show that these previously proposed tests can be rewritten, in fact, as special cases of the test statistic proposed below, by selecting user-determined components suitably. We will see this in detail in Section 5. Hiemstra and Jones (1994) also consider testing for nonlinear Granger causality, but it differs from our approach because they
Yt −1 = (yt −1 , . . . , yt −q ),
and put g (Xt −1 ) = E[xt |Xt −1 ] then the null hypothesis m(Zt −1 ) = g (Xt −1 ) is equivalent to the event E(ut |Zt −1 ) = 0 where ut = xt − g (Xt −1 ). Therefore, we can represent the null and alternative hypotheses, respectively, as
Thus, we define ‘‘yt (possibly nonlinearly) causes xt in mean’’ if (1.2)
113
H1 : P[E(ut |Zt −1 ) = 0] < 1. We further rewrite the hypotheses in terms of unconditional moment restrictions. Let sX = {s(·)|E [s(Xt −1 )2 ] < ∞} and sZ = {s(·)|E [s(Zt −1 )2 ] < ∞} be the Hilbert L2 spaces. We can decom⊥ pose sZ into sX and s⊥ X , where sX is a Hilbert space orthogonal to sX . That is, for any function p(z ) ∈ sZ , we can represent p(z ) = pX (x) + pX ⊥ (z ) such that pX (x) ∈ sX and pX ⊥ (z ) ∈ s⊥ X . Noting ut is orthogonal to sX by construction, we have E(ut |Zt −1 ) = 0 ⇐⇒ E(ut p(Zt −1 )) = 0,
for ∀p(z ) ∈ s⊥ X.
In order for a technical reason, we slightly modify this null hypothesis to the so-called ‘‘density-weighted’’ version; E(ut f (Xt −1 )|Zt −1 ) = 0 ⇐⇒ E(ut f (Xt −1 )p(Zt −1 )) = 0, for ∀p(z ) ∈ s⊥ X. It is obvious that these two representations are equivalent. Thus, we can rewrite the null and alternative hypotheses as, H0 : E[ut f (Xt −1 )p(Zt −1 )] = 0,
for ∀p(z ) ∈ s⊥ X
and H1 : E[ut f (Xt −1 )p(Zt −1 )] ̸= 0,
for some p(z ) ∈ s⊥ X.
2.2. Test statistics We first heuristically describe the idea of constructing the test ⊥ statistic. Let H (z ) = {hi (z )}∞ i=1 be a complete basis over sX , satisfying Var [ut f (Xt −1 )hi (Zt −1 )] = 1
(2.1)
114
Y. Nishiyama et al. / Journal of Econometrics 165 (2011) 112–127
and fˆ (X ) is used again. Then we have an estimate for Q (Z ) as,
and Cov[ut f (Xt −1 )hi (Zt −1 ), ut f (Xt −1 )hj (Zt −1 )] = 0 for all i ̸= j. Given a sample {(xt , yt )} max(p, q) + 1 and i = 1, . . . , kT , T 1 −
ai = √
T t =r
T t =1
(2.2)
, define, for r =
ut f (Xt −1 )hi (Zt −1 ),
(2.3)
d
ai → N (0, 1) under the null, while |ai | explodes to ∞ under the alternative for some i as T → ∞, because ut has a non-zero mean conditionally on Zt −1 and {hi (Z )} spans a basis of s⊥ X . Therefore we can consider a test combining these quantities as follows. Let {wi }∞ i=1 be a user-determined summable positive sequence, such as wi = 0.9i , then kT −
d
wi a2i →
∞ −
i=1
wi ϵi2
(2.4) d
where ϵi are i.i.d.N (0, 1) random variables. It is obvious ST → ∞ under the alternative because some of ai must explode. We can construct H (Z ) as follows. Given {qi (Z )}∞ i=1 , a user⊥ determined basis of sZ , [{qi (Z ) − ri (X )}f (X )]∞ i=1 forms a basis of sX , where ri (X ) = E [qi (Zt −1 )|Xt −1 = X ]. Let Q (Z )′ = (q1 (Z ) − r1 (X ), . . . , qkT (Z ) − rkT (X ))f (X )
= (Q1 (Z ), . . . , QkT (Z ))′ , and M = E [u2t f (Xt −1 )2 Q (Zt −1 )Q (Zt −1 )′ ]. Then, supposing M is positive definite, H (Z ) = (h1 (Z ), . . . , hkT (Z ))′ = M −1/2 Q (Z )
(2.5)
satisfies the required conditions (2.1) and (2.2). St provides a population quantity to test the null hypothesis (1.4), however (2.4) is infeasible. We propose a feasible plug-in test statistic. The unknown components in (2.3) are ut , f (X ) and hi (Z ) which are replaced by the regression residual and their estimates. We estimate the density by a kernel method as, T −
Thp t =p+1
K
X − X t −1
(2.6)
h
where K (·) is a kernel function and h is a bandwidth decaying to zero as T → ∞. We obtain the residuals straightforwardly by uˆ t = xt − gˆ (Xt −1 )
(2.7)
where gˆ (X ) =
1
ˆ = M
T 1−
T t =r
2 ˆ ′ ˆ u t f (Xt −1 ) Q (Zt −1 )Q (Zt −1 ) . k
Using these estimates, we finally produce {hˆ i (Z )}i=T 1 by
ˆ (Z ) = (hˆ 1 (Z ), . . . , hˆ kT (Z ))′ = M ˆ −1/2 Qˆ (Z ). H
(2.10)
Substituting (2.9) and (2.10) into (2.3), we make a sample analogue of (2.4), SˆT =
kT −
wi aˆ 2i ,
(2.11)
i=1
i =1
1 fˆ (X ) =
= (Qˆ 1 (Z ), . . . , Qˆ kT (Z ))′ . M is estimated by its feasible sample analogue,
for some kT < T which is determined later. Appealing to a central limit theorem for martingale difference sequences, we have
ST =
Qˆ (Z )′ = (q1 (Z ) − rˆ1 (X ), . . . , qkT (Z ) − rˆkT (X ))fˆ (X ).
T −
Thp fˆ (X ) t =p+1
K
X − Xt −1
h
xt
(2.8)
is a nonparametric kernel estimate of g (X ). Using (2.6) and (2.7), we estimate ut f (Xt −1 ) by
ˆ u t f (Xt −1 ) = xt f (Xt −1 ) −
1
T −
Xt −1 − Xs−1
xs .
where T 1 − ˆ u aˆ i = √ t f (Xt −1 )hi (Zt −1 ). T t =r
We give several remarks. First, we use the density weight because ut = xt − E[xt |Xt −1 ] and q(Zt −1 ) − E[q(Zt −1 )|Xt −1 ] involves a denominator of f (Xt −1 ). If we replace it with its estimate in constructing a feasible statistic, the random denominator may cause a numerical problem by taking a very small number in practice, which results in a very large value of |Eˆ [xt |Xt −1 ]| or |Eˆ [q(Zt −1 )|Xt −1 ]|. We would like to avoid such unstable feature of the test statistic. Many papers in nonparametric and semiparametric statistics treat this problem similarly. Second, this procedure allows for conditional heteroscedasticity in ut because H (Z ) absorbs it due to the construction of (2.1) and (2.2). It will be convenient in practice that we need not explicitly estimate the conditional variance V (ut |Zt −1 ) which will entail a multistep procedure. Third, we subtract the mean in making Qˆ (z ) in ∑T ˆ (Xt −1 ) − m(Xt −1 )}hˆ i (Zt −1 ) to be degenerated order for √1 t =p { m T
ˆ (X ) owing to for any function m(X ) and its kernel estimate m this construction. This is convenient to prove the asymptotic equivalence of ai and aˆ i . Fourth, it is also possible to consider a ∑kT 2 1 test statistic with equal weights, or k− T i=1 ai , however it does √
not have power against T -local alternatives. In practice, it is not easy to say which performs better in small samples, but the present procedure performs better in large samples. We discuss about it in Section 3. Finally, it is also possible to test the same hypothesis by integrated conditional moment tests (ICM) by √Bierens (1984, 1990) and others. They also have power against T -local alternatives, but it is not clear how much power it has toward which direction of departure from the null unlike ours. Also, if one has an idea on which direction the alternative may depart from the null, we can include this information in the present test, by choosing wi and qi (Z ) suitably. We will discuss about this issue in detail in Section 3. 2.3. The null distribution
(2.9)
We give a set of regularity conditions and the null distribution of the test statistic proposed in the previous section.
We construct estimates of hi (Z ) by replacing unknown quantities in (2.5) as follows. Since ri (X )=E[qi (Zt −1 )|Xt −1 = X ] is unknown, we estimate it by the Nadaraya–Watson kernel estimator
Definition 1. Let {zt }, t = 1, . . . , T be a strictly stationary time series defined on a probability space (Ω , F , P ). Let Fst be the σ algebra generated by {zs , . . . , zt }. Then we say {zt }, t = 1, . . . , T is absolutely regular when
Thp s=p+1
K
h
rˆi (X ) = Eˆ [qi (Zt −1 )|Xt −1 = X ]
=
1
T −
Thp fˆ (X ) t =p+1
K
X − X t −1 h
qi (Zt −1 ),
β(k) = E
sup |P (A|F−∞ ) − P (A)| 0
A∈Fk∞
→ 0 as k → ∞.
Y. Nishiyama et al. / Journal of Econometrics 165 (2011) 112–127
We call the coefficient β(k) the coefficient of absolute regularity. We note that the φ -mixing condition implies absolute regularity, and absolute regularity implies strong mixing. We assume the following assumptions. Assumption A1. zt = (xt , yt ), t = 1, . . . , T is a strict stationary and absolutely regular sequence of stochastic vectors with absolute regularity coefficient β(k) = O(k−(2+η)/η ) for some η ∈ (0, 1). 2(2+δ)
Assumption A2. E {|xt |2+δ f (Xt −1 )1+δ + xt η where δ > 1−η .
f (Xt −1 )4(2+δ) } < ∞
Assumption A3. Put u = (u1 , . . . , up ) and let K : R → R be an Lth order kernel function satisfying K (−u) = K (u), p
= 1 if l1 + · · · + lp = 0 ··· (u)du = 0 if l1 + · · · + lp < L ̸= 0 for some l1 + · · · + lp = L, L 2 K (u) du + ‖u‖ |K (u)|du < ∞, and ‖u‖L |K (u)| → 0 as ‖ u‖ → ∞ √. h is a positive constant decaying to zero, satisfying T −1 h−p + T hL = o(1) as T → ∞. ∫
l u11
lp up K
Assumption A4. {qi (Z )}∞ i=1 forms a basis for sZ , satisfying supZ
|qi (Z )| ≤ Cq < ∞ for all i and supZ | ζ02 (kT ) for ζ02 (kT ) → ∞ as T → ∞. Assumption A5.
k2T T
∑kT
i=1
{qi (Z ) − ri (X )} | < 2
{ζ02 (kT )2 + k2T } → 0 as T → ∞.
Assumption A6. For X = (x1 , . . . xp ), suppose f (X ), g (X )f (X ), ri (X )f (X ), m1ij (X ) = E u2t Qi (Zt −1 )Qj (Zt −1 )|Xt −1 = X f (X )2
Theorem 1. If Assumptions A1–A7 hold, SˆT =
kT −
m2j (X ) = E
u2t
{qj (Zt −1 ) − rj (X )}|Xt −1 = X f (X )
3
are L times differentiable. Furthermore, writing h(l1 ,...,lp ) (X ) = ∂ l1 +···+lp h(X )/∂ξ11 · · · ∂ξp , l
lp
(hk)(l1 ,...,lp ) (X ) = ∂ l1 +···+lp h(X )k(X )/∂ξ11 · · · ∂ξpp , l
l
for generic functions h(X ) and k(X ), assume E {(g (Xt −1 )2 + u2t + u2t f (Xt −1 ))f (l1 ,...,lp ) (Xt −1 )2 } < ∞, E [{1 + u2t f (Xt −1 )4 }f (Xt −1 )2 (gf )(l1 ,...,lp ) (Xt −1 )2 ] < ∞, E {(f (Xt −1 )2 + u2t )(ri f )(l1 ,...,lp ) (Xt −1 )2 } < ∞, (l1 ,...,lp )
E {m1ij
(l ,...,lp )
(Xt −1 )2 + m2j1
(Xt −1 )2 } < ∞,
for all i, j and all l1 , . . . , lp satisfying 0 ≤ l1 , . . . , lp ≤ L, l1 + · · · + lp = L. Assumption A7. Eigenvalues of M are bounded and bounded away from zero uniformly in kT . A1 is a standard assumption in nonparametric time series analysis, but we believe it can be relaxed to a strong mixing condition in view of recent results such as Gao and King (2004) and Hansen (2008). A2 is a condition used in proving that the second order terms of U-statistics, which appear in the proof, are asymptotically negligible. A3 is standard in asymptotic theory of nonparametric regression. In A4, ζ02 (kT ) = kT or k2T typically depending on the choice of {qi (Z )} as in Newey (1997). We believe A4 and A5 must be stronger than necessary, but these assumptions make the proof of the Theorem much easier. Also we point out that similar assumptions to A4 and A5 were made in Newey (1997) in obtaining asymptotic theory for series regression estimators. A6 guarantees that the projection terms of U-statistics in the proof are asymptotically negligible. Newey (1997) also assumes a similar assumption to A7.
d
wi aˆ 2i →
i =1
∞ −
wi ϵi2 as T → ∞
i =1
under H0 , where ϵi are i.i.d. N (0, 1).
√ 3. Power of the test under non-local and the alternatives
T -local
The above section proposes a test statistic for causality in mean and its null distribution. We discuss about its power properties. First, under non-local alternatives, the DGP can be written as, xt = g (Xt −1 ) + κ(Zt −1 ) + vt , for some non-constant function κ(Zt −1 ), where vt (xt |zt −1 ). Because
= xt − E
ut = xt − E (xt |Xt −1 ) = xt − g (Xt −1 ) = κ(Zt −1 ) + vt , we have E {ut f (Xt −1 )hi (Zt −1 )} = E [{κ(Zt −1 ) + vt }f (Xt −1 )hi (Zt −1 )]
= E [κ(Zt −1 )f (Xt −1 )hi (Zt −1 )]. This value is non-zero at least for some i. It is because if E [κ(Zt −1 )f (Xt −1 )hi (Zt −1 )] are zero for all i, κ(Z ) must be identically zero since {hi (Z )} is a complete basis. This is the source of the power of this test in the sense that 1
p
√ ai → E [κ(Zt −1 )f (Xt −1 )hi (Zt −1 )] = κi ,
T which results in SˆT ≈ T
and
115
kT −
wi κi2 ,
(3.1)
(3.2)
i =1
and at least one for κi is non-zero. To clarify the power structure of the test, suppose we set wi in the decreasing order without loss of generality. Suppose further only one of the κ ’s is non-zero, say 1, for simplicity. If κ1 = 1, SˆT ≈ w1 , while SˆT ≈ wkT if κkT = 1. Obviously, the test has a larger power in the former case because we reject the null when SˆT is large. Therefore we see that if the direction of the departure from the null is closer to the basis functions with smaller indices i, the test will be more likely to reject the null, namely SˆT possesses more power for the discrepancy from the null toward hi (z ) than hj (z ) for wi > wj . Because there is no ordering in the basis functions, we can interchange them in fact. Therefore, if we have a prior knowledge on the direction of alternatives, we can make use of this information to order aˆ i to induce a better power property. This is, we believe, a great advantage of this test against ICM tests, that cannot control for which directions they have power. To illustrate this, suppose we know that the alternative is likely to be the null regression function plus sin(Zt −1 ), then, by choosing h1 (Zt −1 ) = sin(Zt −1 ) − E[sin(Zt −1 )|Xt −1 ], we expect a large power out of this test. The following theorem states that SˆT has a nontrivial power √ against T -local alternatives. Theorem 2. Let κ : Rp+q → R be a measurable function in the space s⊥ X . Suppose the following local alternatives is correct, 1 Hla : E (ut |Zt −1 ) = √ κ(Zt −1 ). T Then, if Assumptions A1–A7 hold, d
SˆT →
∞ −
wi (ϵi + τi )2
i =1
under Hla , where τi = E [κ(Zt )f (Xt )hi (Zt )].
116
Y. Nishiyama et al. / Journal of Econometrics 165 (2011) 112–127 d
We omit the proof because it is obvious noting that ai √→ N (τi , 1) under Hla . We now show that flat weight wi = 1/ kT for all i, for example, is not permitted in order for the test to √ have nontrivial power against T -local alternatives, by providing a counter example. In order for the statistic to be Op (1) under the null, we need to modify the statistic to 1
kT −
a2i − ST′ = √ kT i=1
d
kT → N (0, 2) under H0 .
√
Suppose a
T -local alternative of κ(Z ) = h1 (Z )f (X ), namely,
Hla : xt = g (Xt −1 ) +
h1 (Zt −1 )f (Xt −1 )
√
T
We should point out, however, that we need to be careful in understanding this definition, because we will almost always conclude that there exists causality in the K th moment if there exists causality in the Mth moment (M < K ). We illustrate this when K = 2 and M = 1. Suppose the DGP is xt = g (xt −1 , yt −1 ) + ϵt ,
where ϵt is independent of xt −1 , yt −1 . Obviously yt causes xt in mean. In terms of the present definition of causality in the K th moment, we conclude yt causes xt in the second moment because E (x2t |xt −1 , yt −1 ) = E ({g (xt −1 , yt −1 ) + ϵt }2 |xt −1 , yt −1 )
+ vt
= g (xt −1 , yt −1 )2 + E (ϵt2 ) ̸= E (x2t |xt −1 )
with, for simplicity, conditional homoscedasticity E (ut |Zt −1 ) = d
const is correct. Then, ai → N (τi , 1), and thus E (a2i ) ≈ 1 + τi2 , where τ1 ̸= 0, and τ2 = τ3 = · · · = 0. In this case, we have, kT kT 1 − 1 − E (a2i ) − kT + √ {a2i − E (a2i )} ST′ = √ kT i=1 kT i=1
1
= √
kT −
kT i=1
in general. One way of understanding this definition is that (4.2) states that yt causes xt up to the K th moment rather than in the K th moment. Formally, we define causality up to the K th moment in the following way. We say yt does not cause xt up to the K th moment if E(xkt |xt −1 , . . . , x1 , yt −1 , . . . , y1 )
τ + N (0, 2) 2 i
= E(xkt |xt −1 , . . . , x1 ) w.p. 1 for all k = 1, . . . , K .
τ2 = √1 + N (0, 2). kT
√
Then it obviously does not have power against this simple T -local alternative because the first term in the last expression disappear asymptotically. Therefore, flat weight does not provide a test which √ has power against T -local alternatives in general. 4. Nonparametric test for causality up to the K th moment In view of recent empirical studies in financial econometrics, it appears dependence in second, third or fourth moments looks like an issue of interest. We can provide a tool to examine if there does not exist causal relationship in such a higher order sense using the test proposed above. The main idea is the same as in the previous section, but we need to be careful in understanding the results as explained below. Here we also restrict ourselves to the case when the series of interest follows a stationary nonlinear AR process under the null. To illustrate the motivation for such high-order causality, let us consider, for example, the following nonlinear dependence between series, xt = g (xt −1 ) + σ (yt −1 )ϵt ,
(4.1)
where {yt } is a stationary time series, g (·) and σ (·) are unknown functions which satisfy certain conditions for stationarity. Then yt −1 obviously does not have information in predicting xt in the sense that E (xt |xt −1 , yt −1 ) = E (xt |xt −1 ), but has information in predicting x2t . This type of modeling is becoming increasingly popular in analyzing, for instance, financial data. In general, we may like to know if yt −1 is useful in predicting xKt for a given positive integer K . It looks possible to give a definition of ‘‘causality in the K th moment’’ similarly to the causality in mean, or we say ‘‘yt causes xt in the K th moment’’ if E[xKt − E(xKt |xt −1 , . . . , x1 )]2
> E[xKt − E(xKt |xt −1 , . . . , x1 , yt −1 , . . . , y1 )]2 .
(4.2)
Then, by a similar manipulation to the derivation of (1.4), the null hypothesis of non-causality in K th moment corresponding to the alternative of (4.2) can be written as H0 : E(xKt |xt −1 , . . . , x1 , yt −1 , . . . , y1 )
= E(xKt |xt −1 , . . . , x1 ) w.p. 1.
(4.4)
(4.3)
(4.5)
With this definition, there exists causality up to the second moment for both (4.1) and (4.4). When K = 1, this reduces to noncausality in mean. We note that it is also possible to define higher order causality in other ways (see, e.g., Comte and Lieberman, 2000 and references therein), where they check if the conditional variance of xt , given lagged x, y, depends on lagged y or not in defining causality in variance. In view of the above definition, it is easy to construct test statistic for each k. We simply replace xt in (2.7), (2.8), (2.9) by (k) (k) xkt , and construct statistics aˆ i , then SˆT for each of k = 1, . . . K . Specifically, (k)
SˆT
=
kT −
wi aˆ (i k)2
i=1
where (k)
aˆ i
T 1 − (k) (k) ut f (Xt −1 )hˆ i (Zt −1 ), T t =r
= √
for T 1 − Xt −1 − Xs−1 k (k) ut f (Xt −1 ) = xkt fˆ (Xt −1 ) − p K( ) xs , Th s=p+1 h
ˆ (k) = M
T 1 − (k) ut f (Xt −1 )2 Qˆ (Zt −1 )Qˆ (Zt −1 )′ , T t =p+1
(k)
(k)
ˆ (k) (Z ) = (hˆ 1 (Z ), . . . , hˆ k (Z ))′ = {M ˆ (k) }−1/2 Qˆ (Z ). H T It is, however, not easy to combine these statistics into one statistic because they are mutually correlated. One way of testing for the joint null of (4.5) is to estimate the correlation, say by bootstrap or other simulation methods, and obtain the critical value for a suitably combined statistic. This will be computationally expensive. In order for practical use, we recommend to test causality successively especially when K is small. We implement the test for k = 1 first and if the null of non-casuality in mean is rejected we stop there as we already know that there is causality in the first moment. If not rejected, we go ahead for the causality in the second moment, or k = 2. We continue this procedure until the null is rejected at certain stage or k reaches K . Obviously there is a statistical problem that we do not know the total size from this successive method, but at least when K is small, say 2 or 3, we
Y. Nishiyama et al. / Journal of Econometrics 165 (2011) 112–127
may not need to be too careful about it practically. Also, we point (1) out if ut is symmetric conditionally on Zt −1 , we expect SˆT = SˆT (2)
(2)
and SˆT are asymptotically independent because aˆ i and aˆ j are asymptotically uncorrelated for all i and j. (k) For the asymptotic theory of SˆT , we strengthen the moment condition to 2k(2+δ)
Assumption A2′ . E {|xt |k(2+δ) f (Xt −1 )1+δ + xt η ∞ where δ > 1−η .
f (Xt −1 )4(2+δ) } <
Theorem 3. Under Assumptions A1, A2′ , A3–A7, (k)
SˆT
=
kT −
∞ −
wi {ˆa(i k) }2 → d
i=1
A typical alternative test is a series of integrated conditional moment (ICM) type tests of Bierens (1984, 1990), Bierens and Ploberger (1997) and Chen and Fan (1999) and others. This type of test is constructed as follows. Letting T 1 − uˆ t exp(ξ ′ Xt −1 ), zT (ξ ) = √ T t =1
we know under the null ∞ −
kT
T −
wi a2i −
T −
wi + op (1)
i=1
with
5. Special cases — representation of other omitted variable tests
d
σ
√
and σ 2 = E (ut |Xt −1 ). The omitted variable test by Fan and Li (1996) also has essentially the same structure as THW , and thus can be represented similarly. The test by Von Neumann (1941) is also written as
wi =
zT (ξ )2 dµ →
1
2
h(Z ) = (h1 (Z ), . . . , hkT (Z ))′ = E {Q (Zt )Q (Zt )}−1 f (X )−1 Q (Z ),
i=1
The proof is omitted because it is straightforward.
∫
wi =
i =1
for k = 1, . . . , K .
TICM =
where
TVN =
wi ϵi2
117
τi ϵi2 ,
1
√ σ T h(Z ) = (h1 (Z ), . . . , hkT (Z ))′ = E {Q (Zt )Q (Zt )}−1 f (X )−1 Q (Z ). 2
We can also show that omitted variable tests by Stute (1997) and Whang (2000) are also represented in terms of ST as above.√We also note that TVN and THW tests do not have power against T -local alternatives because of the flat weight as discussed in Section 3. We may point out that there is an advantage in rewriting these special case test statistics as the weighted squared sum form in terms of wi and hi (Z ) in order to see in which direction(s) of alternatives these tests have power. In some cases, we could show that ICM tests have power only against a very limited direction of alternatives, because only first (the largest) weight w1 is extremely large, while the others, w2 , w3 , . . . are relatively very small (see Hitomi (2000)). We cannot change this structure of ICM and other tests, though the proposed test can directly change them.
i=1
and it explodes under the alternatives. Here, {τi } are eigenvalues of the function Γ (ξ1 , ξ2 ) = E [exp(ξ1′ Xt ) exp(ξ2′ Xt )], and ϵi are i.i.d. Gaussian random variables. This also has nontrivial power √ against T -local alternatives like the proposed test. However, it is not clear which direction of alternatives TICM has more/less power. We believe our test is more convenient than the others because it is much clearer for users how much power the test has to each specific direction of alternatives. More specifically, in the present notation, supposing w1 > w2 > · · ·, the test (2.11) has the largest power against the alternative of xt = g (Xt −1 )+ h1 (Zt −1 )+ ut , while it has the smallest power against the alternative xt = g (Xt −1 ) + hkT (Zt −1 ) + ut . The test statistics proposed above may look like quite primitive statistics, however, it is a quite general class, in fact, in the sense that this includes important test statistics for omitted variable tests as special cases. We first show that ICM is a special case. If we set
wi = ξi , −1/2
hi (Z ) = ξi
f (X )−1
∫
exp(ξ ′ Z )ψi (ξ )dξ ,
where ψi (ξ ) is the eigenfunction of Γ (ξ1 , ξ2 ) corresponding to τi , then we can show that ST =
kT −
wi a2i = TICM + op (1).
i =1
Therefore, TICM is a special case of the present test for {wi } and {hi (Z )} chosen as above asymptotically. Second, the test of omitted variables in nonparametric regression by Hong and White (1995), THW , can be written as THW =
kT − i=1
wi a2i −
kT − i =1
√ wi + op (1) = ST −
kT
σ2
+ op (1)
6. A Monte Carlo study In this section we report the results of a Monte Carlo study to investigate the performance of our proposed test statistic. Throughout this section, xt is the time series of our primary interest, while we want to know whether or not another time series yt (in lags) accounts for the variation of xt . Let {ηt } and {ϵt } be the innovation processes of xt and yt respectively, and we assume both follow standard Normal distribution identically and independently. 6.1. Testing causality in mean We start with the four simulation settings as below. These can be considered as time series analogue of the experiments carried out in Okui and Hitomi (2002). The results in this section are generated using Ox version 6.1 (see Doornik, 2007). DGP 0: xt = 0.65xt −1 + ηt , yt = −0.3yt −1 + ϵt DGP 1: xt = 0.65xt −1 + 0.2yt −1 + ηt , yt = −0.3yt −1 + ϵt DGP 2: xt = 0.65xt −1 + 0.2yt −1 + 0.4 sin(−2yt −1 ) + ηt , yt = −0.3yt −1 + ϵt DGP 3: xt = 0.65xt −1 + 0.2y2t −1 + ηt , yt = −0.3yt −1 + ϵt Because there is no causal relationship between xt and yt in DGP 0, the null hypothesis of non-Granger causality should be maintained. This experiment is carried out to illuminate the size property of our test. DGP 1 covers the case of linear vector autoregressive models. In DGP 2 and 3, the target time series xt depends on the lagged covariate time series yt −1 in some nonlinear fashions. In other words, we observe xt and yt and want to know if yt causes xt in Granger’s sense, but we do not know in what functional form xt depends on yt .
118
Y. Nishiyama et al. / Journal of Econometrics 165 (2011) 112–127
Table 1 (1)
Empirical size and power of the proposed test (SˆT ) and Hidalgo’s nonparametric causality test (HNC). T
DGP 0 (1 )
100 200 300 400 500 1000
DGP 1
DGP 2
(1)
(1 )
HNC
SˆT
HNC
SˆT
HNC
SˆT
HNC
0.049 0.047 0.049 0.053 0.054 0.049
0.104 0.108 0.129 0.127 0.130 0.139
0.092 0.236 0.455 0.614 0.760 0.989
0.427 0.675 0.859 0.913 0.964 0.999
0.115 0.335 0.614 0.778 0.911 1.000
0.221 0.343 0.466 0.539 0.627 0.871
0.109 0.286 0.565 0.781 0.899 0.998
0.118 0.135 0.150 0.149 0.158 0.188
The following is the specification of our test. We used normal kernel K (u) = (2π )−1/2 exp(−u2 /2) for nonparametric regression. Our bandwidth choice is C × T −0.3 where T is the length of time series. C is about 7 except for T = 100 case. For basis {qi (Z )}, we choose the eight functions (actually T × 8 numbers); sin(yt −1 ), cos(yt −1 ), sin(yt −1 ) sin(xt −1 ), sin(yt −1 ) cos(xt −1 ), cos(yt −1 ) sin(xt −1 ), cos(yt −1 ) cos(xt −1 ), sin(2yt −1 ), cos(2yt −1 ). So the number kT is set to 8 in all the experiments. The weight function (1) is chosen as wi = 0.9i to construct the test statistic SˆT . The critical value of the asymptotic distribution is calculated by a Monte Carlo simulation with this choice of wi . The upper 5% critical value is estimated as 14.38. As a competitor to our test, we employ a nonparametric Granger causality test of Hidalgo (2000). The basic idea of Hidalgo’s test is to use a nonparametric estimate of cross-spectrum by which linear causality from one time series to another is determined. It should be noted that the main concern of Hidalgo (2000) is to allow long memory time series √ in causality testing. He also showed that his test has power against T -local alternatives which will be confirmed in our simulation study, too. The results are summarized in Table 1. In each experiment the number of iteration is fixed to 1000. Each entry shows empirical rejection rate of the null hypothesis of non-Granger causality, hence the two columns under DGP 0 are the empirical size, and others show empirical power. Our test for mean causality is (1) signified as SˆT following the notation in Section 4, while Hidalgo’s nonparametric Granger causality test is shortened as HNC. As is (1) seen clearly, the size of SˆT is quite decent, while HNC test seems to over-reject the null in this simulation. Hidalgo’s test depends on how many lagged cross-covariance terms are used to construct the test statistic, the choice of M in his notation. As he points out, it is possible to make use of some model selection procedures to choose a plausible M, but our choice of M here is the largest integer which does not exceed T 1/4 . This is the slowest rate within the admissible range. Because we would, basically, like to compare the power, not size, we did not try to tune M such that the size distortion of HNC becomes minimum. (1) What is remarkable is the power of SˆT in DGP 2 and 3 cases, especially when the sample sizes are fairly large. It performs much better than Hidalgo’s test in general as expected, when there exists a nonlinear causality. However, it is worth mentioning that Hidalgo’s test shows high power for DGP 1 experiment even in small sample case (T = 100). This is because DGP 1 is exactly the situation that Hidalgo’s test expects, or linear √causality, and we will visit this issue again in the experiments of T -local alternatives.
√ 6.2. Power under
DGP 3
(1)
SˆT
T -local alternatives
√ Whether or not a statistical test has power against T -local alternative is more or less a theoretical issue rather than practical one. But in order to confirm the theory, we carry out a bunch of simulations under local alternatives. For example, in conjunction
Table 2 (1)
Empirical power of the proposed test (SˆT ) and Hidalgo’s nonparametric causality test (HNC): case of local alternatives. T
DGP 1L (1 )
100 200 300 400 500 1000
DGP 2L (1 )
DGP 3L (1 )
SˆT
HNC
SˆT
HNC
SˆT
HNC
0.092 0.133 0.164 0.151 0.166 0.157
0.427 0.457 0.500 0.498 0.529 0.569
0.115 0.166 0.228 0.213 0.213 0.227
0.221 0.238 0.267 0.265 0.289 0.326
0.109 0.156 0.224 0.235 0.224 0.279
0.118 0.128 0.137 0.138 0.144 0.162
with DGP 2 setting, we put the model in the alternative hypothesis as 2 4 xt = 0.65xt −1 + √ yt −1 + √ sin(−2yt −1 ) + ηt T T while DGP of yt is unchanged. We refer this specification to DGP 2L. DGP 1L √ and DGP 3L are defined by dividing the deviation from the null by T in DGP 1 and DGP 3, though we do not display the equation explicitly. Note that choosing T = 100 coincides with the specification in previous simulations for non-local alternatives. The figures in the first row in Table 2 are, naturally, more or less the same as those in the first row of Table 1. When T grows from 100 to 1000, the coefficient of yt −1 shrinks from 0.2 to 0.02. √ (1) As Table 2 shows, SˆT does have power against T -local alter(1)
natives. It is striking that SˆT exhibits quite a good performance for DGP 3L case while the empirical rejection rate of HNC test is as low as its empirical size, which suggests HNC test virtually has no power for DGP 3L case. However, HNC test is superior to our (1) test SˆT in DGP 1L and DGP 2L. DGP 1L is a case of linear vector autoregressive model, and HNC test is specifically designed for such a linear case. DGP 2L even includes a linear causal variable (yt −1 ) on (1) the RHS of the DGP of xt . SˆT captures the causal effect of the linear term by the kernel method, which might be a roundabout way. 6.3. Testing causality up to second moment To demonstrate the usefulness of the methodology proposed in Section 4, we consider the following nonlinear time series model as DGP 4. DGP 4 : σt2 = ω + α y2t −1 + βσt2−1 yt = σt ηt ,
ηt ∼ NID(0, 1) xt = 0.65xt −1 + 1 + y2t −1 ϵt ,
ϵt ∼ NID(0, 1),
where ω = 0.05, α = 0.1, β = 0.8. In other words, yt is generated by GARCH(1,1) model and it enters into the variance of the innovation of xt . So xt does not depend on yt −1 in mean but there exists causality in the second moment. Therefore at (1) least theoretically, neither SˆT (our test in mean) nor Hidalgo’s nonparametric Granger causality test will not be able to detect (2) such type of dependency. But our test SˆT , based on the kernel 2 regression of xt , is expected to detect the causality.
Y. Nishiyama et al. / Journal of Econometrics 165 (2011) 112–127 Table 3 Empirical size and power of successive tests when causality in second moment exists (DGP4) and when causality both in mean and in variance exists (DGP 5). T
DGP 4 (1)
100 200 300 400 500 1000
DGP 5 (2)
(1)
(2)
SˆT
SˆT
HNC
SˆT
SˆT
HNC
0.075 0.046 0.059 0.066 0.080 0.072
0.011 0.075 0.168 0.268 0.375 0.766
0.166 0.185 0.208 0.211 0.217 0.257
0.178 0.271 0.388 0.466 0.605 0.920
0.049 0.268 0.497 0.672 0.820 1.000
0.435 0.646 0.797 0.867 0.918 0.995
The results for DGP 4 are summarized in the left three columns of Table 3. As for HNC, the figures in the table are hardly interpretable. It may reflect even sloppier size or it may be reflecting the weak power. Quite the contrary, the numbers in (1) the column of SˆT are all close to the size (5%), and the power (2) in the columns of SˆT grows when the sample size T gets large. (2) Here we note that the figures in the SˆT column are the empirical
rejection frequencies based on the number of cases which were not (1) rejected by SˆT . For example in T = 1000 case, 72 out of 1,000 (1)
times SˆT
rejected the null of non-Granger causality in mean. For (2)
the accepted 928 cases we perform the test by SˆT . This leads to 711 rejections of non-causality in the second moment, namely the ‘power’ is 0.766. As a final case, we consider the case where xt depends on yt −1 in the second moment as well as in mean. DGP 5 : xt = 0.65xt −1 + 0.2yt −1 +
1 + y2t −1 ηt ,
yt = −0.3yt −1 + ϵt . This is a challenging situation for the test we propose. We expect (1) SˆT to detect causality from 0.2yt −1 term, but then it will stop the procedure and we do not proceed to the second stage with (2) (2) (1) SˆT . Hence just the same as in DGP 4, we apply SˆT when SˆT fails to detect the causality in mean. The right three columns of (1) Table 3 show that SˆT suffers at small sample stages, but finally (2) has high power at T = 1000. When the sample size is large, SˆT (1)
manages to detect the causality in the second moment even if SˆT failed to detect the causality in mean. As for HNC, again it shows considerably good performance because the true DGP includes linear causality from yt −1 . Though the overall size control could (2) be difficult, combining HNC with SˆT might help in some practical situation. 6.4. Application to price-volume causality As an application to empirical data, we investigate the causality between stock price and traded volume. Price-Volume charting has been very popular among technical trading, and it is often said that the change in volume precedes the price movement. Ideally, the rise of price will be followed by the increase in the volume, while the fall of price often observed after the shrink in the traded volume. It is sometimes called ‘counter clockwise’ relationship between volume and price. We use weekly price and volume data of Nikkei 225 futures. The sample period starts at 1st week of September 1988 and ends in the 4th week of October 2007, of which sample size is T = 1, 000. To ensure the stationarity of the data, both price and volume are transformed to the changes in percentage. We call these variables in change dpt and dvt respectively. Our primary concern is to investigate if there is any causality from past volume change to the present price change. Hence we give dvt −1 the role of yt −1 , and perform the nonparametric causal(1) (1) ity test in mean using SˆT . The result is SˆT = 11.54 < 14.38,
119
which leads to accepting the null of non-causality in mean. Then we proceed to the second stage, and regress dp2t nonparametrically (2)
on dvt −1 . The resulting test statistic is SˆT = 20.81 > 14.38, which implies the rejection of non-causality in the second moment. To make sure, we check the reverse causal relationship, exchanging (1) the role of dpt and dvt . The results are SˆT = 9.95 < 14.38 and (2)
SˆT = 8.85 < 14.38, hence both causality in mean and in the second moment are not detected. Combining these results, we may conclude that the volatility of Nikkei 225 futures is caused by its past volume change, but there seems to be no evidence for the reverse order causality. 7. Concluding remarks Granger causality tests proposed previously do not detect certain nonlinear causal relationships. The reason is that they construct test statistics based on a linear representation of the time series appealing to, say, the Wold decomposition, but we know only that the error terms are uncorrelated with the series of interest, rather than independent. Observing this fact, we proposed a nonparametric testing procedure for Granger-type causality in the case of any form of nonlinear dependence. The test can be viewed as a test for omitted variables in time series regression. We show that the test has nontrivial power √ against T -local alternatives like ICM. But it has an advantage over other omitted variable tests in that the power properties are clear, and thus we can incorporate a priori information, if ever, in forming a test. Furthermore, it is possible to show that those previously proposed omitted variable tests can be represented in terms of the present test asymptotically. The Monte Carlo study shows that the test performs well with decent empirical size, in general, and good power property for both causality in mean and causality up to the second moment. Compared with Hidalgo’s test, SˆT does slightly worse when the series exhibit linear dependence in mean in fact, but in the case of nonlinear dependence, especially nonlinear dependence in conditional variance, it obviously outperforms Hidalgo’s test. We also show an empirical example for financial data. Appendix A. Proofs of theorem Proof of Theorem 1. Denote T T 1 − 1 − 1 a= √ ut f (Xt −1 )h(Zt −1 ) = √ ut f (Xt −1 )M − 2 Q (Zt −1 ), T t =r T t =r T T 1 − 1 − ˆ ˆ −1 ˆ aˆ = √ u u t f (Xt −1 )h(Zt −1 ) = √ t f (Xt −1 )M 2 Q (Zt −1 ), T t =r T t =r
and W = diag (w1 , . . . , wkT ). We would like to show that a′ W a −
∞ −
p
wi ϵi2 → 0
(A.1)
i=1
and p
aˆ ′ W aˆ − a′ W a → 0
(A.2)
as T → ∞. We first (A.1). For ∀ϵ > 0 and ∀δ > 0, we can ∑prove ∞ choose m such that i=m+1 wi < δϵ/2 because {wi } is a summable
120
Y. Nishiyama et al. / Journal of Econometrics 165 (2011) 112–127
sequence. As m is a finite number independent of T , m < kT < T holds for sufficiently large T . Then we have
k ∞ − T − 2 2 P wa − wϵ >δ i=m+1 i i i=m+1 i i ∞ ∑ ∑ kT 2 E wi ai + E wi ϵi2 i=m+1 i=m+1 ≤ δ kT ∞ ∑ ∑ wi wi + i=m+1 i=m+1 = δ ∞ ∑ 2 wi i=m+1 < <ϵ δ
a1
(A.6)
and p
A′3T WA3T → 0
(A.7)
imply (A.4). Therefore, (A.5), (A.6) and (A.7) suffice to show (A.2). We start with (A.6). Because
1
1
ˆ − 2 − M− 2 A2T = M
T 1 − ut f (Xt −1 )Q (Zt −1 ), √
T t =r
we have 1
1
A′2T WA2T ≤ ‖A′2T W 2 ‖2 ≤ ‖A2T ‖2 ‖W 2 ‖2
T −
ˆ − 12 − M − 12 ‖2 ‖ √1 ≤ ‖M
T t =r
1
ut f (Xt −1 )Q (Zt −1 )‖2 ‖W 2 ‖2 .
1
As {wi } are summable, ‖W 2 ‖2 = 6, and Assumption A7, 1
(A.3)
am
ˆ − M ‖2 kT ‖M
1
ˆ − 2 − M − 2 ‖2 ≤ ‖M
λ3min (M )
∑kT
i=1
wi < ∞. By Lemmas 5 and
= Op
kT T
{ζ04 (kT ) + k2T } .
Also,
Let λ = (λ1 , . . . , λm )′ be any fixed m-vector, then
i =1
(A.5)
p
A2T WA2T → 0
.. d . → N (0, Im ).
m −
p
A′1T WA1T → 0 ′
where the first inequality is due to Markov inequality, the equality uses E (a2i ) = E (ϵi2 ) = 1. We look at the remaining first m elements. We only need to show
Similarly as above, we immediately know
T 1 −
T 1 − ut f (Xt −1 )λi hi (Zt −1 ) = √ vt T t =r T t =r
λ i ai = √
2 where E (vt |Zt −1 ) = 0, V (vt ) = i=1 λi . Appealing to the CLT for stationary martingale differences (see, e.g. Theorem 7.11 of Bierens (2004)) with Cramér-Wold device, we have (A.3), which completes the proof for (A.1).
∑m
2 T 1 − ut f (Xt −1 )Q (Zt −1 ) E √ T t =r 2 kT T − 1 − = E √ ut f (Xt −1 )Qi (Zt −1 ) kT
=
−
E {u2t f (Xt −1 )2 Qi (Zt −1 )2 }
i=1
≤ 4Cq2 kT E {u2t f (Xt −1 )3 } ≤ CkT .
We prove (A.2) next. Noting that aˆ W aˆ − a′ W a = (ˆa − a)′ W (ˆa − a) + (ˆa − a)′ W a + a′ W (ˆa − a)
T t =r
i=1
′
Combining them, we have
and 1
A2T WA2T = Op
1
|(ˆa − a)′ W a|2 = |a′ W (ˆa − a)|2 ≤ ‖(ˆa − a)′ W 2 ‖2 ‖W 2 a‖2 ≤ (ˆa − a)′ W (ˆa − a) · a′ W a,
T
{ζ (kT ) + 4 0
k2T
} ,
and thus, (A.6) holds by Assumption A5. Because
it suffices to show
E (A′1T WA1T ) p
(ˆa − a)′ W (ˆa − a) → 0
(A.4)
=
as a′ W a = Op (1) by (A.1). Write further
ˆ aˆ − a = √ {u t f (Xt −1 )M T t =r
− 21
T 1 −
wi E √
T t =r
i=1
Qˆ (Zt −1 )
− 12
− ut f (Xt −1 )M Q (Zt −1 )} T 1 − −1 = √ {u t f (Xt −1 ) − ut f (Xt −1 )}M 2 Q (Zt −1 ) T t =r
T − ˆ − 21 − M − 21 √1 + M ut f (Xt −1 )Q (Zt −1 )
T t =r
T 1 −
kT −
T 1 −
ut f (Xt −1 )M T t =r + smaller order terms
+√
k2T
′
− 12
{Qˆ (Zt −1 ) − Q (Zt −1 )}
= A1T + A2T + A3T + smaller order terms.
{u t f (Xt −1 ) − ut f (Xt −1 )}hi (Zt −1 ) 2 {ut f (Xt −1 ) − ut f (Xt −1 )}hi (Zt −1 ) ,
T 1 −
≤ C sup E √ 1≤i≤kT
2
T t =r
it is sufficient to show, by Markov inequality,
E
2 {ut f (Xt −1 ) − ut f (Xt −1 )}hi (Zt −1 ) → 0
T 1 −
√
T t =r
(A.8)
as T → ∞ uniformly in i in order to prove (A.5). We rewrite the quantity in the square brackets into a U-statistic form as follows. T 1 −
√
T t =r
√ =
T
{u t f (Xt −1 ) − ut f (Xt −1 )}hi (Zt −1 ) −1 − T −1 − T T 2
t =r s=t +1
Ui (Wt , Ws ),
Y. Nishiyama et al. / Journal of Econometrics 165 (2011) 112–127
121
where Wt = (xt , Zt −1 ) and
terms with triple summation such as
[ 1 1 Xt −1 − Xs−1 Ui (Wt , Ws ) = (xt − xs ) p K 2 h h − ut f (Xt −1 ) hi (Zt −1 ) 1 Xt −1 − Xs−1 + (xs − xt ) p K h h ] − us f (Xs−1 ) hi (Zs−1 ) .
T −2 − T −1 − T −
and those with quadruple summation of the form T −3 − T −2 − T −1 T − −
u1i (W ) = E {Ui (Wt , W )} ∫ 1 = − hi (Z ) g (X + hu)f (X + hu)K (u)du 2 ∫ 1 1 + xhi (Z ) f (X + hu)K (u)du − uf (X )hi (Z ). 2 2
T
2
=
= √
T t =r
+ |{x1 − g (X1 )}f (X1 )hi (Z1 )| ≤ C {h
√ T
−1 − T −1 − T T 2
φi (Wt , Ws ).
√
T t =r
u1i (Wt )
→ 0,
(A.13)
+ 1} = C1h .
(A.14)
(A.9)
2
δ
≤ 4C1h2+δ β(s − t ) 2+δ
√
E
dF (w1 )dF (w2 ) + C
∫ ∫ φ(w1 , w2 )2 dFt ,s (w1 , w2 ) − φi (w1 , w2 )2 dF (w1 )dF (w2 )
and
−p(1+δ)
The first inequality holds because E |u1i (Wt )|2+δ is of smaller order than E |Ui (Wt , Ws )|2+δ . The last inequality is valid uniformly in i because |hi (Z )| ≤ 2Cq f (X ) due to Assumption A4. Lemma 4 yields
t =r s=t +1
Then we want to show, in order to prove (A.8), that Var
h
2+δ
Ui (Wt , Ws )
u1i (Wt ) +
T 1 −
|φi (w1 , w2 )|2+δ dF (w1 )dF (w2 ) ∫ ≤ |Ui (w1 , w2 )|2+δ dF (w1 )dF (w2 ) + C (A.12) 2+δ ∫ 1 X1 − X2 |hi (Z1 )|2+δ ≤ C |x1 − x2 |2+δ p(2+δ) K h
t =r s=t +1
T 1 −
We evaluate the expectation of these terms individually. We follow the same track as Yoshihara (1976) to obtain the bounds for the expectations of these quantities. Define Ft1 ,...,tm (w1 , . . . , wm ) be the joint distribution function of Wt1 , . . . , Wtm , let F (w) = Ft1 ,...,tm (w, ∞, . . . , ∞). First, under Assumption A2, A4, we have
∫
In order for the Hoeffding decomposition, put φi (Wt , Ws ) Ui (Wt , Ws ) − u1i (Wt ) − u1i (Ws ) and write
−1 − T −1 − T T
φi (Wt , Ws )φi (Wu , Wv ).
t =r s=t +1 u=s+1 v=u+1
We consider the projection which will be used later. For W = (x, Z ) and u = x − g (X ),
√
φi (Wt , Ws )φi (Wt , Wu )
t =r s=t +1 u=s+1
T
−1 − T −1 − T T 2
φi (Wt , Ws )
but (A.14) holds also for δ = 0, and thus φi (w1 , w2 )2 dF (w1 ) dF (w2 ) = O(h−p ). Combining them, we have
2 →0
(A.10)
t =r s=t +1
E {φi (Wt , Ws ) } = 2
uniformly in i. We first prove (A.9). We immediately know that E {u1i (Wt )|Ws } = 0
(A.11)
for t > s because E (ut |Zt −1 ) = 0, E (xt |Zt −1 ) = g (Xt −1 ) under H0 , and E {hi (Zt −1 )|Xt −1 } = 0. Also, by Assumptions A3 and A6, we can show u1i (W ) = h ki (W ) + o(h ) L
∫
≤ C {h−p + h−
T −1 − T −
δ
β(s − t ) 2+δ =
t =1 s=t +1
k1i (W ) =
2L!
···
0≤l1 ,...,lp ≤L l1 +···+lp =L
(u)du
i=1
× xf (l1 ,...,lp ) (X ) + (gf )(l1 ,...,lp ) (X )
(l1 ,...,lp )
l x11
=
l uii K
The squared sum summation term T −1 − T − t =r s=t +1
φi (Wt , Ws )2 ,
(s − t )
− (η(2+η)δ 2+δ)
T −1 − T −
(s − t )−1−γ ≤ CT ,
for γ = 2(δ − η)/{η(2 + δ)} > 0, we have
E
−2 − T −1 − T T T
lp xp
2 T −1 ∑ T involves a double t =r s=t +1 φi (Wt , Ws )
T −1 − T −
t =1 s=t +1
where f (X ) = ∂ f (X )/∂ · · · ∂ . Based on the above, it is straightforward to prove (A.16) because E {u1i (Wt )} = 0, Var {u1i (Wt )} = O(h2L ) by Assumptions A4, A6 and (A.11). We follow the same track as Yoshihara (1976) to prove (A.10).
∑
δ
β(s − t ) 2+δ }.
t =1 s=t +1
with
p − ∫ ∏
2p(1+δ) 2+δ
Because
L
hi (Z ) −
φi (w1 , w2 )2 dFt ,s (w1 , w2 )
2
φi (Wt , Ws )
2
≤C
Thp
t =r s=t +1
=C
1
1 Thp
which goes to zero because of Assumption A3. We consider T
−2 − T −2 − T −1 − T T 2
t =r s=t +1 u=s+1
φi (Wt , Ws )φi (Wt , Wu )
+ T 2h
2p(1+δ) 2+δ
2p
+
1
h 2+δ T 2 h2p
122
Y. Nishiyama et al. / Journal of Econometrics 165 (2011) 112–127
next. We have, similarly to the computation of (A.14),
∫
1+ 2δ
|φi (w1 , w2 )φi (w1 , w3 )| dFt ,s (w1 , w2 )dFu (w3 ) ∫ ∫ 1+ 2δ 1+ 2δ ≤ |φi (w1 , w2 )| |φi (w1 , w3 )| dFu (w3 )
E (A23Ti ) → 0
(A.15)
uniformly in i where A3Ti is the ith element of A3T , namely, T 1 − A3Ti = √ ut f (Xt −1 )m′i {Qˆ (Zt −1 ) − Q (Zt −1 )} T t =r
× dFt ,s (w1 , w2 ) ≤ C (h−pδ + 1) = C2h .
1
Then, by Lemma 4, we have
∫ φi (w1 , w2 )φi (w1 , w3 )dFt ,s,u (w1 , w2 , w3 ) ∫ − φi (w1 , w2 )φi (w1 , w3 )dFt ,s (w1 , w2 )dF (w3 ) 2
Now to prove (A.7), similarly to the proof for (A.5), it suffices to show
δ
≤ 4C2h2+δ β(s − t ) 2+δ . Noting φi (w1 , w3 )dF (w3 ) = 0 by construction, we have 2
where mi indicates the ith row of M − 2 . It is easily seen that ‖mi ‖ ≤ 1/λmin (M ) < C . We write A3Ti in the U-statistic form,
−1 − T −1 − T T
√ A3Ti =
T
2
where
∫
δ
|φi (w1 , w2 )φi (w3 , w4 )|1+ 2 dFt ,s,u (w1 , w2 , w3 )dF (w4 ) ∫ ∫ δ δ ≤ |φi (w1 , w2 )|1+ 2 |φi (w3 , w4 )|1+ 2 dF (w4 )
Xt −1 − Xs−1 h
2
1
+ uf (X )2 m′i {q(Z ) − r (X )}f (X ). 2
In order for the Hoeffding decomposition, put ψi (Wt , Ws ) Vi (Wt , Ws ) − v1i (Wt ) − v1i (Ws ) and write
√ T
−1 − T −1 − T T
T 1 − Vi (Wt , Ws ) = √ v1i (Wt ) T t =r t =r s=t +1
2
−1 − T −1 − T T
√ +
T
2
ψi (Wt , Ws ).
t =r s=t +1
δ
≤ 4C3h β(v − u) . Noting φi (w3 , w4 )dF (w4 ) = 0, we have
Var
T 1 −
√
T t =r
v1i (Wt ) → 0,
2
(A.16)
and E
T −1 − T √ T −1 − T
2
2 ψi (Wt , Ws )
→0
uniformly in i. It is straightforward to prove that {v1i (Wt )} is a martingale difference sequence, we can further show δ
|E {φi (Wt , Ws )φi (Wu , Wv )}| ≤ 4C3h2+δ β(s − t ) 2+δ ,
v1i (W ) = hL k3i (W ) + o(hL )
and thus
with
−2 − − − − E T T φi (Wt , Ws )φi (Wu , Wv ) 2 t <s
p − ∫ ∏ {x − g (X )}f (X ) − li k3i (W ) = ··· ui K (u)du 2L! 0≤l1 ,...,lp ≤L i =1 l1 +···+lp =L
s−t
≤ CT −γ h
=
2p(δ−η−ηδ) η(2+δ) 2(δ−η)
(Thp ) η(2+δ)
(A.17)
t =r s=t +1
2+δ
Ch
=
Then we want to show, in order to prove (A.15), that
∫ φi (w1 , w2 )φi (w3 , w4 )dFt ,s,u,v (w1 , w2 , w3 , w4 ) ∫ − φi (w1 , w2 )φi (w3 , w4 )dFt ,s,u (w1 , w2 , w3 )dF (w4 )
δ − 22p +δ
v1i (W ) = E {Vi (Wt , W )} ∫ 1 ′ = uf (X )mi {q(Z ) − r (X + hu)}f (X + hu)K (u)du
Then
2 2+δ
h
K p
The projection is, for W = (x, Z ), u = x − g (X ), q(Z ) = Q (Z )/f (X ) and r (X ) = E {q(Zt −1 )|Xt −1 = X },
× dFt ,s,u (w1 , w2 , w3 ) ≤ C (h−pδ + 1) = C3h .
1
] − {ut f (Xt −1 )hi (Zt −1 ) + us f (Xs−1 )hi (Zs−1 )} .
2pδ
≤ Ch− 2+δ (s − t )−1−γ
We can pick δ < 2 as long as E |xt |2+δ exists. Thus, the expectation of the triple sum decays to zero. For the quadruple summation, we consider the cases (i) t < s < u < v and s − t > v − u and (ii) t < s < u < v and s − t ≤ v − u as in Yoshihara (1976). Because both cases can be handled similarly, we consider only the first case (i). Note that
2
[ {ut f (Xt −1 ) − us f (Xs−1 )}
× m′i {Q (Zt −1 ) − Q (Zs−1 )}
δ
2−δ −2 − T −2 − T −1 − T h 2+δ T φi (Wt , Ws )φi (Wt , Wu ) ≤ C p . E T 2 Th t =r s=t +1 u=s+1
1
Vi (Wt , Ws ) =
|E {φi (Wt , Ws )φi (Wt , Wu )}| ≤ 4C2h2+δ β(s − t ) 2+δ which yields
Vi (Wt , Ws ),
t =r s=t +1
× mi q(Z )f
.
By assumption (η < 1 & δ > η/1 − η) and (Thp → ∞), this decays to zero as T → ∞. Combining these results, we have (A.10).
′
(l1 ,...,lp )
(X ) − (m′i rf )(l1 ,...,lp ) (X ) .
Therefore, (A.16) is true. For (A.17), we basically need to look at Vi (Wt , Ws ) as v1i (Wt ) are of much smaller order. We omit the proof for (A.17) because it can be treated in exactly the same manner as (A.10).
Y. Nishiyama et al. / Journal of Econometrics 165 (2011) 112–127
Appendix B. Lemmas
+
The following lemma by Yoshihara (1976) is useful to prove Theorem 1. The proof is omitted. Lemma 4. Let {zt }, t = 1, . . . , T be an absolutely regular sequence of random variables with coefficient β(k). Let t1 < · · · < tn be integers. Let F (a, b) be the distribution function of zta , . . . , ztb (a ≤ b). Let s(ξ ) = s(ξ1 , . . . , ξn ) be a Borel-measurable function. Then for δ > 0,
The next two lemmas are used to prove the theorem.
Let Wt be Wt = (xt , Zt −1 ), and define U1 (Wt , Ws ) and its projection uc1 (w) as follows,
=
ζ04 (kT ) T
k2T T
+ C3
ζ02 (kT )kT T
.
Proof. Recall rj (Xt −1 ) = E [qj (Zt −1 )|Xt −1 ] for j = 1, . . . , kT , and ut = xt − g (Xt −1 ) where g (Xt −1 ) = E [xt |Xt −1 ], and let q˜j (Zt −1 ) =
ˆ − M is, qj (Zt −1 ) − rj (Xt −1 ). The (j, l) element of M 1− 2 4 Aj,l = uˆ t fˆ (Xt −1 )qˆ˜ j (Zt −1 )qˆ˜ l (Zt −1 ) T 2 4 − E ut f (Xt −1 )˜qj (Zt −1 )˜ql (Zt −1 )
2
fˆ (Xt −1 ) =
1 − Xs−1 − Xt −1 Thp s=2
K
h
xt ut f (Xt −1 )˜qj (Zt −1 )˜ql (Zt −1 )
Thp s=2
xs K
A1j,l =
,
2 −− T2
1 −
=
fˆ (Xt −1 ).
=
−
T 2−
T t =2
T t =1
ut f 3 (Xt −1 )˜qj (Zt −1 )˜ql (Zt −1 )
× {ˆg (Xt −1 )fˆ (Xt −1 ) − g (Xt −1 )f (Xt −1 )} T 1− 2 3 + ut f (Xt −1 )˜ql (Zt −1 )qj (Zt −1 ){fˆ (Xt −1 ) − f (Xt −1 )} T t =1
−
+
−
T 1−
T t =1 T 1−
T t =1 T 1−
T t =1
u2t f 3 (Xt −1 )˜ql (Zt −1 ){ˆrj (Xt −1 )fˆ (Xt −1 ) − rj (Xt −1 )f (Xt −1 )} u2t f 3 (Xt −1 )˜qj (Zt −1 )ql (Zt −1 ){fˆ (Xt −1 ) − f (Xt −1 )} u2t f 3 (Xt −1 )˜qj (Zt −1 ){ˆrl (Xt −1 )fˆ (Xt −1 ) − rl (Xt −1 )f (Xt −1 )}
h
,
T
T
U1 (Wt , Ws )
s
2−
xt ut f 3 (Xt −1 )˜qj (Zt −1 )˜ql (Zt −1 )f (Xt −1 )
t
2−
uc1 (Wt ) −
t
2− T
xt ut f 3 (Xt −1 )
t T −1 T 1 − −
T 2 t =r s=t +1
φ1 (Wt , Ws )
2 − E xt ut q˜ j (Zt −1 )˜ql (Zt −1 )|Xt −1 T t
+
xt ut f 3 (Xt −1 )˜qj (Zt −1 )˜ql (Zt −1 ){fˆ (Xt −1 ) − f (Xt −1 )}
T 2−
Xs−1 − Xt −1
− xt ut q˜ j (Zt −1 )˜ql (Zt −1 ) f 4 (Xt −1 ) + O(hL )
Decompose Aj,l as the followings, Aj,l =
t
× q˜ j (Zt −1 )˜ql (Zt −1 )f (Xt −1 ) +
h
K Thp
h
by the stationarity of Wt . In order for the Hoeffding decomposition, put φ1 (Wt , Ws ) = U1 (Wt , Ws )−uc1 (Wt )−uc1 (Ws ). With U1 (Wt , Ws ) and its projection, A1j,l is represented as
−
Xs−1 − Xt −1
1
= E [xt ut f 3 (Xt −1 )˜qj (Zt −1 )˜ql (Zt −1 )|Xt −1 = X ]
and T 1 −
K
Xs−1 − Xt −1
E [xs us f 3 (Xs−1 )˜qj (Zs−1 )˜ql (Zs−1 )|Xs−1 = X ]
qj (Zs−1 ) Thp s=2 Xs−1 − Xt −1 ×K fˆ (Xt −1 ) h
uˆ t = xt −
Thp
Note that
T
qˆ˜ j (Zt −1 ) = qj (Zt −1 ) −
1
3
1 xt ut f 3 (Xt −1 )˜qj (Zt −1 )˜ql (Zt −1 )f (Xt −1 ) 2 + E [xt ut f 3 (Xt −1 )˜qj (Zt −1 )˜ql (Zt −1 )|Xt −1 ]f (Xt −1 ) + Op (hL ) .
where T
uc1 (Wt ) = E [U1 (W , Ws )]|W =Wt
= + C2
1
+ xs us f 3 (Xs−1 )˜qj (Zs−1 )˜ql (Zs−1 )
Lemma 5. Under A1–A6,
2
u2t f 4 (Xt −1 )˜qj (Zt −1 )˜ql (Zt −1 )
− E u2t f 4 (Xt −1 )˜qj (Zt −1 )˜ql (Zt −1 ) + (smaller order terms) = A1j,l + A2j,l + A3j,l + A4j,l + A5j,l + A6j,l + A7j,l + (smaller order terms).
≤ 3M 1/(1+δ) {β(tj+1 − tj )}δ/(1+δ) providing M ≡ |s(ξ )|1+δ dF (1, j)dF (j + 1, n) exists.
T t =1
U1 (Wt , Ws )
∫ ∫ s(ξ )dF (1, n) − s(ξ )dF (1, j)dF (j + 1, n)
ˆ − M ≤ C1 E M
T 1−
123
T −1 T 1 − −
T 2 t =r s=t +1
φ1 (Wt , Ws ).
(B.1)
First we show that the 3rd term of R.H.S is a smaller order term than ∑T −1 ∑T the first term of R.H.S. Var T12 t =r s=t +1 φ1 (Wt , Ws ) includes E φ12 (Wt , Ws ) , E [φ1 (Wt , Ws )φ1 (Wt , Wu )] and E [φ1 (Wt , Ws )φ1 (Wu , Wv )], we evaluate each term.
∫
|φ1 (w1 , w2 )|2+δ dF (w1 )dF (w2 ) ∫ ≤ |U1 (w1 , w2 )|2+δ dF (w1 )dF (w2 ) + C 2+δ ∫ 1 4 2+δ K X1 − X2 ≤ C |x1 u1 f (X1 )| p(2+δ) h
(B.2)
h
× dF (w1 )dF (w2 ) + C ≤ C {h−p(1+δ) + 1} = C1h .
(B.3)
124
Y. Nishiyama et al. / Journal of Econometrics 165 (2011) 112–127
Lemma 4 yields
∫ ∫ φ1 (w1 , w2 )2 dFt ,s (w1 , w2 ) − φ1 (w1 , w2 )2 dF (w1 )dF (w2 ) 2
in Yoshihara (1976). Because both cases can be handled similarly, we consider only the first case (i). Note that
∫
δ 2+δ
≤ 4C1h2+δ β(s − t )
but (B.3) holds also for δ = 0, and thus φ1 (w1 , w2 )2 dF (w1 ) dF (w2 ) = O(h−p ). Combining them, we have
∫
E {φ1 (Wt , Ws )2 } =
2p(1+δ) 2+δ
Then
δ
β(s − t ) 2+δ }.
∫ φ1 (w1 , w2 )φ1 (w3 , w4 )dFt ,s,u,v (w1 , w2 , w3 , w4 ) ∫ − φ1 (w1 , w2 )φ1 (w3 , w4 )dFt ,s,u (w1 , w2 , w3 )dF (w4 )
Because T −1 − T −
δ
β(s − t ) 2+δ =
t =1 s =t +1
T −1 − T −
(s − t )
(2+η)δ − η( 2+δ)
t =1 s=t +1
=
T −1 − T −
2
(s − t )
−1−γ
≤ CT ,
for γ = 2(δ − η)/{η(2 + δ)} > 0, we have E
2
=
1 T
φ1 (Wt , Ws )2
1
≤
T
t =r s=t +1
C
1 Thp
We consider
+
h
2p 2+δ
T 2 h2p
−2 ∑ T 2
T −2 t =r
=o
1
T
C
2
1 Thp
1
+ T 2h
2p(1+δ) 2+δ
.
and thus
−2 − − − − E T φ ( W , W )φ ( W , W ) 1 t s 1 u v 2 t <s
∑T −1 ∑T s=t +1
u=s+1
≤
φ1 (Wt , Ws )φ1 (Wt , Wu )
=
1+ 2δ
|φ1 (w1 , w2 )φ1 (w1 , w3 )| dFt ,s (w1 , w2 )dFu (w3 ) ∫ ∫ δ δ ≤ |φ1 (w1 , w2 )|1+ 2 |φ1 (w1 , w3 )|1+ 2 dFu (w3 ) × dFt ,s (w1 , w2 )
1 T
CT
2pδ −γ − 2+δ
1 Ch T
h
(Thp )
A1j,l
2
∫ φ1 (w1 , w2 )φ1 (w1 , w3 )dFt ,s,u (w1 , w2 , w3 ) ∫ − φ1 (w1 , w2 )φ1 (w1 , w3 )dFt ,s (w1 , w2 )dF (w3 ) δ
≤ 4C2h2+δ β(s − t ) 2+δ . Noting φ1 (w1 , w3 )dF (w3 ) = 0 by construction, we have 2
|E {φ1 (Wt , Ws )φ1 (Wt , Wu )}| ≤ 4C2h2+δ β(s − t ) ≤ Ch
δ − 22p +δ
δ 2+δ
2(δ−η) η(2+δ)
1
T
.
=
4 T
[ [ E
E xt ut q˜ j (Zt −1 )˜ql (Zt −1 )|Xt −1
T
For the quadruple summation, we consider the cases (i) t < s < u < v and s − t > v − u and (ii) t < s < u < v and s − t ≤ v − u as
4
T
T
then kT − kT −
E
A1j,l
2
j=1 l=1
≤
4 T
≤4
−2 − T −2 − T −1 − T T φ1 (Wt , Ws )φ1 (Wt , Wu ) E 2 t =r s=t +1 u=s+1 2−δ 1 h 2+δ 1 ≤ C p =o .
]
2 ] − xt ut q˜ j (Zt −1 )˜ql (Zt −1 ) f (Xt −1 ) + O(h2L ) 4 1 ≤ E x2t u2t q˜ 2j (Zt −1 )˜q2l (Zt −1 )f 8 (Xt −1 ) + o ,
(s − t )−1−γ
which yields
Th
=o
Then, by Lemma 4, we have
T
.
2p(δ−η−ηδ) η(2+δ)
Since the first term of the right hand side of (B.1) is a sum of stationary martingale difference random variables, E
≤ Ch−pδ = C2h .
2
δ
|E [φ1 (Wt , Ws )φ1 (Wu , Wv )]| ≤ 4C3h2+δ β(s − t ) 2+δ ,
next. We have, similarly to the computation of (B.3),
∫
δ
≤ 4C3h2+δ β(v − u) 2+δ . Noting φi (w3 , w4 )dF (w4 ) = 0, we have
t =1 s=t +1
−2 − T −1 − T T
× dFt ,s,u (w1 , w2 , w3 ) ≤ C (h−pδ + 1) = C3h .
φ1 (w1 , w2 )2 dFt ,s (w1 , w2 )
≤ C {h−p + h−
δ
|φ1 (w1 , w2 )φ1 (w3 , w4 )|1+ 2 dFt ,s,u (w1 , w2 , w3 )dF (w4 ) ∫ ∫ 1+ 2δ 1+ 2δ ≤ |φ1 (w1 , w2 )| |φ1 (w3 , w4 )| dF (w4 )
E x2t u2t
kT − j=1
ζ04 (kT ) T
q˜ 2j (Zt −1 )
kT −
q˜ 2l (Zt −1 )f 8 (Xt −1 )
l =1
E x2t u2t f 8 (Xt −1 )
= o(1).
Similarly, A2j,l =
2 − ut q˜ j (Zt −1 )˜ql (Zt −1 )m(Xt −1 )f 4 (Xt −1 ) T t
− E [ut q˜ j (Zt −1 )˜ql (Zt −1 )|Xt −1 ]xt f 4 (Xt −1 ) + O(hL ) + (small order term).
Y. Nishiyama et al. / Journal of Econometrics 165 (2011) 112–127
The first term of the right hand side is a sum of martingale difference random variables since xt = ut + m(Xt −1 ) and ut q˜ j (Zt −1 )˜ql (Zt −1 )m(Xt −1 )f (Xt −1 ) 4
+ ut E [ut q˜ j (Zt −1 )˜ql (Zt −1 )|Xt −1 ]xt f (Xt −1 ),
8
− E
E
T
T j=1
2 ut q˜ j (Zt −1 )˜ql (Zt −1 )g (Xt −1 )f 4 (Xt −1 )
E u4t f 8 (Xt −1 )
T
kT − kT − E (A5j,l )2 = o(1). j =1 l =2
1 −
A4j,l =
ut q˜ j (Zt −1 )˜ql (Zt −1 )g (Xt −1 )f 4 (Xt −1 )
T
u2t q˜ l (Zt −1 )rj (Xt −1 )
t
− E u2t q˜ l (Zt −1 )|Xt −1 qj (Zt −1 ) f 4 (Xt −1 ) + O hL .
2 1 4 4 + E E [ut q˜ j (Zt −1 )˜ql (Zt −1 )|Xt −1 ]xt f (Xt −1 ) + o T T
The first term of the right hand side is a sum of martingale difference random variables since qj (Zt −1 ) = q˜ j (Zt −1 ) + rj (Xt −1 ) and
4 2 2 E ut q˜ j (Zt −1 )˜q2l (Zt −1 )g 2 (Xt −1 )f 8 (Xt −1 ) T 2 1/2 8 + E ut q˜ j (Zt −1 )˜ql (Zt −1 )g (Xt −1 )f 4 (Xt −1 ) T
2 1/2 × E E [ut q˜ j (Zt −1 )˜ql (Zt −1 )|Xt −1 ]xt f 4 (Xt −1 )
Thus
4 + E E [ut q˜ j (Zt −1 )˜ql (Zt −1 )|Xt −1 ]2 x2t f 8 (Xt −1 ) . T
E
u2t q˜ l (Zt −1 )rj (Xt −1 ) − E u2t q˜ l (Zt −1 )|Xt −1 qj (Zt −1 ) f 4 (Xt −1 )
kT − kT −
E
A2j,l
4 T
+
E u2t
A4j,l
2
kT − j =1
T j=1 l=1
kT −
q˜ 2l (Zt −1 )g 2 (Xt −1 )f 8 (Xt −1 )
l =1
4Cq4 E u2t g 2 (Xt −1 )f 8 (Xt −1 )
≤4
j =1
l =1
kT − kT −
E
+ 4Cq4
1 − 2 ut q˜ l (Zt −1 )qj (Zt −1 ) T t
− E u2t q˜ l (Zt −1 )qj (Zt −1 )|Xt −1 f 4 (Xt −1 ) + O(hL ), 1 4 2 1 2 2 8 E (A3j,l ) ≤ E ut q˜ l (Zt −1 )qj (Zt −1 )f (Xt −1 ) + o ≤
A4j,l
2
j =1 l =1
Similarly
T
1 4 2 E ut q˜ l (Zt −1 )f 8 (Xt −1 ) T
T 1/2 1 4 2 1 4 8 E ut q˜ l (Zt −1 )f 8 (Xt −1 ) + 4Cq4 E ut f (Xt −1 ) T T
Then,
T
Cq2
≤
Cq2
T
1/2 × E E [u2t |Xt −1 ]x2t f 8 (Xt −1 ) ζ 4 (kT ) 2 +4 0 E E [ut |Xt −1 ]x2t f 8 (Xt −1 ) T 1 =o .
T
1 + Cq2 E E u4t q˜ 2l (Zt −1 )|Xt −1 f 8 (Xt −1 )
1 + Cq2 E E u4t q˜ 2l (Zt −1 )|Xt −1 f 8 (Xt −1 ) .
1/2 k2 + 32Cp4 T E u2t g 2 (Xt −1 )f 8 (Xt −1 ) T
A3j,l = −
T
1 4 2 ≤ E ut q˜ l (Zt −1 )f 8 (Xt −1 ) T 1/2 2 4 2 + E ut q˜ l (Zt −1 )gj2 (Xt −1 )f 8 (Xt −1 ) T
1/2 2 × E E u2t |Xt −1 f 8 (Xt −1 )
E u2t g 2 (Xt −1 )f 8 (Xt −1 )
T
1 4 2 E ut q˜ l (Zt −1 )rj2 (Xt −1 )f 8 (Xt −1 )
T
1/2 2 × E E u2t q˜ l (Zt −1 )|Xt −1 f 8 (Xt −1 )
ζ (kT ) 4 0
=
Cq2
1/2
1/2 × 4Cq4 E E [u2t |Xt −1 ]x2t f 8 (Xt −1 ) kT kT − − 4 q˜ 2j (Zt −1 ) + E E [u2t q˜ 2l (Zt −1 )|Xt −1 ]x2t f 8 (Xt −1 ) T
2
T
q˜ 2j (Zt −1 )
kT kT − 8−
2 − E u2t q˜ l (Zt −1 )rj (Xt −1 )E u2t q˜ l (Zt −1 )|Xt −1 qj (Zt −1 )f 8 (Xt −1 ) T 2 1 1 + E E u2t q˜ l (Zt −1 )|Xt −1 qj (Zt −1 )f 8 (Xt −1 ) + o
j =1 l =1
= u2t q˜ l (Zt −1 )rj (Xt −1 ) − E u2t q˜ l (Zt −1 )rj (Xt −1 )|Xt −1 f 4 (Xt −1 ) − q˜ j (Zt −1 )E u2t q˜ l (Zt −1 )rj (Xt −1 )|Xt −1 f 4 (Xt −1 ).
And
≤
ζ (kT )kT
Since A5j,l = A3j,l
T × E [ut q˜ j (Zt −1 )˜ql (Zt −1 )|Xt −1 ]xt f 4 (Xt −1 )
≤
l =1
2 0
= o(1).
both terms of right hand side are martingale differences. Thus
k kT − kT kT T − − − 2 21 4 2 8 E (A3j,l ) ≤ Cq E ut q˜ l (Zt −1 )f (Xt −1 ) ≤ Cq2
4
E (A2j,l )2 =
and
j =1 l =2
− E [ut q˜ j (Zt −1 )˜ql (Zt −1 )|Xt −1 ]xt f 4 (Xt −1 ) = ut q˜ j (Zt −1 )˜ql (Zt −1 )m(Xt −1 ) − E [ut q˜ j (Zt −1 )˜ql (Zt −1 )|Xt −1 ]m(Xt −1 ) f 4 (Xt −1 )
4
125
≤
Cq2
kT 1−
T j =1
E
u4t
kT −
˜ (Zt −1 )f (Xt −1 )
q2l
8
l =1
kT − kT 4 8 1/2 1− E ut f (Xt −1 ) T j =1 l =1
1/2 × E E u4t |Xt −1 f 8 (Xt −1 ) k kT T − − 21 4 2 8 + Cq E E ut q˜ l (Zt −1 )|Xt −1 f (Xt −1 ) T j =1
l =1
2 ζ (kT )kT
≤ Cq +
2 0
T
k2 4Cq4 T T
E u4t f 8 (Xt −1 )
1/2
E u4t f 8 (Xt −1 )
E E u4t |Xt −1 f 8 (Xt −1 )
1/2
126
Y. Nishiyama et al. / Journal of Econometrics 165 (2011) 112–127
+ Cq2
ζ (kT )kT 2 0
T
= o (1) . ∑kT ∑kT
l =1 E
j=1
E E u4t |Xt −1 f 8 (Xt −1 )
+
A6j,l
2
= o (1) , since A6j,l = A4l,j .
≤ C1
−E
u2t f 4
=
T t =1
A7j,l
2
(Xt −1 )˜qj (Zt −1 )˜ql (Zt −1 )
=
2 −− t −1 2 −−
T2
t
Cov(mj,l (Wt ), mjl (Wt −i )).
i =0
=
2 −− T2
t
t
i=0
≤ Cq4 4(21−1/(2+δ) + 1) t 2+δ 2/(2+δ) 1 − − 1−2/(2+δ) × E x2t f 4 (Xt −1 ) βi , 2 T
t
i =0
A7j,l
2
≤
k2T
j=1 l=1
x2t f 4
( X t −1 )
T
1
1
1
and 1
1
(iii) ‖X − 2 − M − 2 ‖2 ≤
k‖X − M ‖2
λ31 1
t
+ o(‖X − M ‖2 ). 1
1
1
1
Proof. Let X 2 = EX ΛX2 EX′ , then X 2 X − 2 = Ik and X 2 X 2 = X . Differentiating both equalities, we have 1
1
1
1
(dX 2 )X − 2 + X 2 (dX − 2 ) = dIk = 0 and 1
1
1
1
(dX 2 )X 2 + X 2 (dX 2 ) = dX .
1
1
1
t 2+δ 2/(2+δ) 1 − −
1
× (M 2 ⊗ Ik + Ik ⊗ M 2 )−1 v ec (X − M ) + o(‖X − M ‖)
1
1
1
1
= −(X − 2 ⊗ X − 2 )v ec (dX 2 ) 1−2/(2+δ) i
β
i=0
.
(B.4)
and 1
1
1
v ec (dX 2 ) = (Ik ⊗ X 2 + X 2 ⊗ Ik )−1 v ec (dX ).
Combining above results, we have,
(B.5)
Substituting (B.5) into (B.4), we have
2 ˆ E M − M
1
kT − kT −
E
A1j,l
2
+
j =1 l =1
kT − kT −
E
kT − kT −
E
A3j,l
2
+
E
j =1 l =1
A2j,l
A5j,l
2
+
kT − kT −
E
A4j,l
kT − kT −
E
j=1 l=1
1
1
1
which completes the proof for (i). It is straightforward that (ii) 1
2
A6j,l
holds if X − 2 is differentiable at X = M. To prove (iii), using (ii), we have 1
1
1
1
1
1
‖X − 2 − M − 2 ‖2 = v ec (X − 2 − M − 2 )′ v ec (X − 2 − M − 2 )
j=1 l=1
kT − kT −
1
v ec (dX − 2 ) = −(X − 2 ⊗ X − 2 )(Ik ⊗ X 2 + X 2 ⊗ Ik )−1 v ec (dX ) 2
j =1 l =1
j =1 l =1
+
1
(ii) v ec (X − 2 − M − 2 ) = −(M − 2 ⊗ M − 2 )
v ec (dX − 2 ) = −(Ik ⊗ X 2 )−1 (Ik ⊗ X − 2 )v ec (dX 2 )
Cq4 4(21−1/(2+δ) + 1)
T
is differentiable with respect
Because v ec (ABC ) = (C ′ ⊗ A)v ec (B) and (A ⊗ B)(C ⊗ D) = AC ⊗ BD for comformable matrices A, B, C , D, we obtain,
and E
1
−1
1
T
kT − kT −
− 12
= EX ΛX 2 EX′ with ΛX = diag (λ1 (X ), . . . , λk (X )), 1 1 1 1 1 M − 2 = E Λ− 2 E ′ and M 2 = E Λ 2 E ′ . When X − 2 is differentiable at X = M, then we have Define X − 2
Cov(mj,l (Wt ), mjl (Wt −i ))
i=0
2 1 − − 1−2/(2+δ) βi ≤ 4(21−1/(2+δ) + 1) mj,l (Wt )2+δ 2
+
= (X )−1 . If X
t −1
t
≤
1 2
∂v ec (X − 2 ) 1 1 1 1 = −(X − 2 ⊗ X − 2 )(X 2 ⊗ Ik + Ik ⊗ X 2 )−1 . ′ ∂v ec (X )
1
mj,l (Wt )2 2+δ 2 1−2/(2+δ) 1−1/(2+δ) ≤ 2(2 + 1)βi mj,l (Wt ) 2+δ 2+δ 1/(2+δ) . Thus where mj,l (Wt )2+δ = E mj,l (Wt )
× E
− 12
1
1−2/(2+δ)
A7j,l
T
Suppose further X has eigenvalues λk (X ) ≥ · · · ≥ λ1 (X ) > 0 and corresponding k × k eigenvector matrix EX satisfying EX EX′ = EX′ EX = Ik . Similarly, let M be a k × k symmetric and positive definite matrix with eigenvalue decomposition M = E ΛE ′ where Λ = diag (λk , . . . , λ1 ) with λk ≥ · · · ≥ λ1 > 0 and EE ′ = E ′ E = Ik .
Cov(mj,l (Wt ), mjl (Ws ))
≤ 2(21−1/(2+δ) + 1)αi
E
T
1
Cov(mj,l (Wt ), mj,l (Wt −i ))
2
1 2
X X = X and X to X ,
Since an absolute regular process is an α -mixing process and αm ≤ βm , mixing inequality implies
T
1
1 2
s≤t
t
T
ζ02 (kT )kT
It is possible to define symmetric matrices X 2 and X − 2 satisfying
(i)
T2
+ C3
Lemma 6. Let X be a symmetric and positive definite k × k matrix.
mj,l (Wt ) − E [mj,l (Wt )],
=
T
k2T
+ C2
= o(1).
and E
ζ04 (kT )
T
u2t f 4 (Xt −1 )˜qj (Zt −1 )˜ql (Zt −1 )
T 1−
2
ζ 4 (kT ) k2T ζ02 (kT )kT ≤ C max 0 , ,
The last term is
T t =1
A7j,l
mj,l (Wt ) ≡ u2t f 4 (Xt −1 )˜qj (Zt −1 )˜ql (Zt −1 ). T 1−
E
j=1 l=1
Denote mj,l (Wt ) be
A7j,l =
kT − kT −
1
2
1
= v ec (X − M )′ (Ik ⊗ M 2 + M 2 ⊗ Ik )−1 1
1
× (M −1 ⊗ M −1 )(M 2 ⊗ Ik + Ik ⊗ M 2 )−1
Y. Nishiyama et al. / Journal of Econometrics 165 (2011) 112–127
× v ec (X − M ) + o(‖X − M ‖2 ) 1 2
1 2
≤ ‖v ec (X − M )‖ ‖(Ik ⊗ M + M ⊗ Ik ) 2
1
−1
1
× (M −1 ⊗ M −1 )(M 2 ⊗ Ik + Ik ⊗ M 2 )−1 ‖ + o(‖X − M ‖2 ). We evaluate the norms in the last quantity. We obviously have 1
1
‖v ec (X − M )‖2 = ‖X − M ‖2 . Because M 2 = E Λ 2 E ′ and M −1 = E Λ−1 E ′ , we have, 1
1
1
1
M 2 ⊗ Ik + Ik ⊗ M 2 = (E ⊗ E )(Λ 2 ⊗ Ik + Ik ⊗ Λ 2 )(E ′ ⊗ E ′ ) and M −1 ⊗ M −1 = (E ⊗ E )(Λ−1 ⊗ Λ−1 )(E ′ ⊗ E ′ ). Therefore, using (A ⊗ B)−1 = A−1 ⊗ B−1 for invertible matrices A, B, we have 1
1
‖(Ik ⊗ M 2 + M 2 ⊗ Ik )−1 (M −1 ⊗ M −1 ) 1
1
× (M 2 ⊗ Ik + Ik ⊗ M 2 )−1 ‖2 1
1
= ‖(E ⊗ E )(Λ 2 ⊗ Ik + Ik ⊗ Λ 2 )−1 (E ′ ⊗ E ′ ) × (E ⊗ E )(Λ−1 ⊗ Λ−1 )(E ′ ⊗ E ′ ) 1
1
× (E ⊗ E )(Λ 2 ⊗ Ik + Ik ⊗ Λ 2 )−1 (E ′ ⊗ E ′ )‖2 1
1
= ‖(E ⊗ E )(Λ 2 ⊗ Ik + Ik ⊗ Λ 2 )−1 (Λ−1 ⊗ Λ−1 ) 1
1
× (Λ 2 ⊗ Ik + Ik ⊗ Λ 2 )−1 (E ′ ⊗ E ′ )‖2 1
1
≤ tr ((Λ 2 ⊗ Ik + Ik ⊗ Λ 2 )−2 (Λ−2 ⊗ Λ−2 ) 1
1
× (Λ 2 ⊗ Ik + Ik ⊗ Λ 2 )−2 ) k k − − 1 1 k2 . · ≤ = √ λ61 ( λi + λj )4 λ2i λ2j i=1 i=1 1
1
We use that Λ 2 ⊗ Ik + Ik ⊗ Λ 2 and Λ−1 ⊗ Λ−1 are diagonal matrices in the last inequality. Thus 1
1
‖X − 2 − M − 2 ‖2 ≤
k‖X − M ‖2
λ31
+ o(‖X − M ‖2 ).
References Bierens, H.J., 1982. Consistent model specication tests. Journal of Econometrics 20 (1), 105–134. Bierens, H.J., 1984. Model specication testing of time series regressions. Journal of Econometrics 26 (3), 323–353. Bierens, H.J., 1990. A consistent conditional moment test of functional form. Econometrica 58 (6), 1443–1458.
127
Bierens, H.J., 2004. Introduction to the Mathematical and Statistical Foundations of Econometrics. Cambridge University Press. Bierens, H.J., Ploberger, W., 1997. Asymptotic theory of integrated conditional moment tests. Econometrica 65 (5), 1129–1151. Chen, X., Fan, Y., 1999. Consistent hypothesis testing in semiparametric and nonparametric models for econometric time series. Journal of Econometrics 91, 373–401. Comte, F., Lieberman, O., 2000. Second-order noncausality in multivariate GARCH processes. Journal of Time Series Analysis 21 (5), 535–557. Doornik, J.A., 2007. Object-Oriented Matrix Programming Using Ox, 3rd ed.. Timberlake Consultants Press and Oxford, London, www.doornik.com. Fan, Y., Li, Q., 1996. Consistent model specification tests: omitted variables and semiparametric functional forms. Econometrica 64 (4), 865–890. Gao, J., King, M., 2004. Adaptive estimation in continuous-time diffusion models. Econometric Therory 20, 844–882. Geweke, J., 1982. Measurement for linear dependence and feedback between multiple time series. JASA 77, 303–323. Granger, C.W.J., 1969. Investigating causal relations by econometrics models and cross-spectral methods. Econometrica 37 (3), 424–438. Hansen, B.E., 2008. Uniform convergence rates for kernel estimation with dependent data. Econometric Theory 24 (3), 726–748. Hiemstra, C., Jones, D., 1994. Testing for linear and nonlinear granger causality in the stock price-volume relation. Journal of Finance 49 (5), 1639–1664. Hidalgo, J., 2000. Nonparametric test for causality with long-range dependence. Econometrica 68 (6), 1465–1491. Hitomi, K., 2000. Common structure of consistent misspecification tests and a new test. Mimeo. Hong, Y., White, A., 1995. Consistent specification testing via nonparametric series regression. Econometrica 63 (5), 1133–1159. Hosoya, Y., 1977. On the Granger condition for non-causality. Econometrica 45 (7), 1735–1736. Hosoya, Y., 1991. The decomposition and measurement of the interdependency between second-order stationary processes. Probability Theory and Related Fields 88, 429–444. Lutkepohl, H., Poskitt, D.S., 1996. Testing for causation using infinite order vector autoregressive processes. Econometric Theory 12, 61–87. Newey, W.K., 1997. Convergence rates and asymptotic normality for series estimators. Journal of Econometrics 79 (1), 147–168. Okui, R., Hitomi, K., 2002. A consistent test for omitted variables in nonparametric regression. Mimeo. Qiao, Z, McAleer, M., Wong, W.K., 2009. Linear and nonlinear causality between changes in consumption and consumer attitudes. Economics Letters 102, 161–164. Robinson, P.M., 1989. Hypothesis testing in semiparametric and nonparametric models for econometric time series. Review of Economic Studies 56, 511–534. Sims, C.A., 1972. Money, income, and causality. American Economic Review 62, 540–552. Sims, C.A., Stock, J.H., Watson, M.W., 1990. Inference in linear time series models with some unit roots. Econometrics 58, 113–144. Stute, W., 1997. Nonparametric model checks for regression. Annals of Statistics 25 (2), 613–641. Toda, H.Y., Phillips, P.C.B., 1993. Vector autocorrelation and causality. Econometrica 61, 1367–1393. Von Neumann, J., 1941. Distribution of the ratio of the mean squared successive difference to the variance. Annals of Mathematical Statistics 12, 367–395. Whang, Y.J., 2000. Consistent bootstrap tests of parametric regression functions. Journal of Econometrics 98 (1), 27–46. Yoshihara, K., 1976. Limiting behavior of U-statistics for stationary, absolutely regular processes. Zeitschrift fur Wahrscheinlichkeitstheorie und verwande Gebiete 35, 237–252.
Journal of Econometrics 165 (2011) 128–136
Contents lists available at SciVerse ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Linear programming-based estimators in simple linear regression Daniel Preve a,∗ , Marcelo C. Medeiros b a
Department of Statistics, Uppsala University and School of Economics, Singapore Management University, Box 513, 751 20 Uppsala, Sweden
b
Pontifical Catholic University of Rio de Janeiro, Brazil
article
info
Article history: Available online 12 May 2011 Keywords: Linear regression Endogeneity Linear programming estimator Quasi-maximum likelihood estimator Exact distribution
abstract In this paper we introduce a linear programming estimator (LPE) for the slope parameter in a constrained linear regression model with a single regressor. The LPE is interesting because it can be superconsistent in the presence of an endogenous regressor and, hence, preferable to the ordinary least squares estimator (LSE). Two different cases are considered as we investigate the statistical properties of the LPE. In the first case, the regressor is assumed to be fixed in repeated samples. In the second, the regressor is stochastic and potentially endogenous. For both cases the strong consistency and exact finite-sample distribution of the LPE is established. Conditions under which the LPE is consistent in the presence of serially correlated, heteroskedastic errors are also given. Finally, we describe how the LPE can be extended to the case with multiple regressors and conjecture that the extended estimator is consistent under conditions analogous to the ones given herein. Finite-sample properties of the LPE and extended LPE in comparison to the LSE and instrumental variable estimator (IVE) are investigated in a simulation study. One advantage of the LPE is that it does not require an instrument. © 2011 Elsevier B.V. All rights reserved.
1. Introduction The use of certain linear programming estimators in time series analysis is well documented. See, for instance, Davis and McCormick (1989), Feigin and Resnick (1994) and Feigin et al. (1996). LPEs can yield much more precise estimates than traditional methods such as conditional least squares (e.g. Datta et al., 1998; Nielsen and Shephard, 2003). The limited success of these estimators in applied work can be partially explained by the fact that their point process limit theory complicates the use of their asymptotics for inference (e.g. Datta and McCormick, 1995). In regression analysis, it is well known that the ordinary least squares estimator is inconsistent for the regression parameters when the error term is correlated with the explanatory variables of the model. In this case an instrumental variables estimator or the generalized method of moments may be used instead. In economics, such endogenous explanatory variables could be caused by measurement error, simultaneity or omitted variables. To the authors’ knowledge, however, there has so far been no attempt to investigate the statistical properties of LP-based estimators in a cross-sectional setting. In this paper we show that LPEs can, under certain circumstances, be a preferable alternative
∗
Corresponding author. E-mail addresses:
[email protected] (D. Preve),
[email protected] (M.C. Medeiros). 0304-4076/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2011.05.011
to LS and IV estimators for the slope parameter in a simple linear regression model. We look at two types of regressors which are likely to be of practical importance. First, we introduce LPEs to the simple case of a non-stochastic regressor. Second, we consider the general case of a stochastic, and potentially endogenous, regressor. For both cases we establish the strong consistency and exact finitesample distribution of a LPE for the slope parameter. The LPE can be used in situations where the regressor is strictly positive. For example, in empirical finance, we can consider regressions involving volatility and volume. In labor economics a possible application is the regression between income and schooling, for example. The remainder of the paper is organized as follows. In Section 2, we establish the strong consistency and exact finite-sample distribution of the LPE when (1) the explanatory variable is nonstochastic, and (2) the explanatory variable is stochastic and potentially endogenous. In Section 3, we discuss how our results can be extended to other endogenous specifications and give conditions under which the LPE is consistent in the presence of serially correlated, heteroskedastic errors. We also describe how the LPE can be extended to the case with multiple regressors. Section 4 reports the simulation results of a Monte Carlo study comparing the LPE and extended LPE to the LSE and IVE. Section 5 concludes. Mathematical proofs are collected in the Appendix. An extended Appendix available on request from the authors contains some results mentioned in the text but omitted from the paper to save space.
D. Preve, M.C. Medeiros / Journal of Econometrics 165 (2011) 128–136 Table 1
2. Assumptions and results
Ratio distributions with accompanying moments of βˆ n . Fz (z ) is the cdf of the ratio z = u1 /x1 , with parameter θ = θu /θx , on which the moments are based. Results hold for γ = 0 and n > 2. Γ (·) is the gamma function.
2.1. Non-stochastic explanatory variable The first regression model we consider is y i = β x i + ui ui = α + ε i , i = 1 , . . . , n
Exp(θu ) Exp(θx )
where the response variable yi and the explanatory variable xi are observed, and ui is the unobserved non-zero mean random error. β is the unknown regression parameter of interest. We assume that {xi } is a nonrandom sequence of strictly positive reals, whereas {ui } is a sequence of independent identically distributed (i.i.d.) nonnegative random variables (RVs). For ease of exposition we assume that E (ui ) = α . The potentially unknown distribution function Fu of ui is allowed to roam freely subject only to the restriction that it is supported on the nonnegative reals. A well known continuous probability distribution with nonnegative support is the Weibull distribution, which can approximate the shape of a Gaussian distribution quite well. A ‘quick and dirty’ estimator of the slope parameter, based on the nonnegativity of the random errors, is given by
βˆ = min
y1 x1
,...,
yn xn
.
(1)
This estimator has been used to estimate β in certain constrained first-order autoregressive time series models, yi = β xi + ui , with xi = yi−1 (e.g. Datta and McCormick, 1995; Nielsen and Shephard, 2003). As it happens, (1) may be viewed as the solution to the linear programming problem of maximizing the objective function f (β) = β subject to the n constraints yi − β xi ≥ 0. Because of this we will sometimes refer to βˆ as a LPE. Regardless if the regressor is stochastic or non-stochastic, (1) is also the maximum likelihood estimator (MLE) of β when the errors are exponentially distributed. What is interesting, however, is that βˆ consistently estimates β for a wide range of error distributions, thus the LPE is also a quasi-MLE. Assumption 1 holds throughout the section. Assumption 1. Let yi = β xi + ui (i = 1, . . . , n) where ui = α + εi and (i) (ii) (iii) (iv) (v)
{xi } is a nonrandom sequence of strictly positive reals, 0 is not a limit point of S ≡ {x1 , x2 , . . .}, {ui } is an i.i.d. sequence of nonnegative RVs, inf{u : Fu (u) > 0} = 0, E (εi ) = 0.
Note that β can be any real number and that conditions (iii) and (v) combined imply that the mean of ui is α ≥ 0. Since βˆ n −β = Rn , where Rn = min{ui /xi }, it is clear that P (βˆ n − β ≤ z ) = 0 for all z < 0 and, hence, the LPE is positively biased. Moreover, as (1) is nonincreasing in the sample size its accuracy either remains the same or improves as n increases. Proposition 1 gives the exact distribution of the LPE in the case of a non-stochastic regressor.
P (βˆ n − β ≤ z ) = 1 −
[1 − Fu (xi z )].
i=1
1 1+θ −1 z
1− 1 z, z ≤ θ 2θ
U (0,θu ) U (0,θx )
1−
Ra(θu ) Ra(θx )
θ
2z
,z > θ
E (βˆ n − β)
Var(βˆ n − β)
θ n− 1
θ2n (n−2)(n−1)2
2θ n+ 1
1 + (n−11)2n
√ θ π Γ (n−1/2) 2 Γ (n)
1 − 1+θ1−2 z 2
O
θ2
1 n2
1 n−1
−
π Γ 2 (n−1/2) 4 Γ 2 ( n)
Intuitively, this is because the left-tail condition on ui implies that the probability of obtaining an error arbitrarily close to 0 is nonzero and, hence, that (1) is likely to be precise in large samples. a.s.
Corollary 1. Under Assumption 1, βˆ n → β as n → ∞. From Corollary 1 it follows that α (the unknown mean of the error term) can be consistently estimated by
αˆ =
n 1−
n i =1
(yi − βˆ xi ),
(2)
the sample mean of the residuals, under fairly weak conditions.2 It is worth noting that the MLE of β satisfies the stochastic inequality βˆ ML ≤ βˆ . Regardless if xi is stochastic or non-stochastic, in some cases the LPE will be equal to βˆ ML . For instance, it is readily verified that if the random errors are (1) exponentially distributed with non-zero density function (1/a) exp{−u/a} for u ≥ 0
ˆ βˆ ML = β,
aˆ ML = α, ˆ
(3)
and (2) uniformly distributed on the interval [0, b]
ˆ βˆ ML = β,
bˆ ML = max {yi − βˆ xi }.
(4)
As an illustration of Proposition 1 in action, Corollary 2 shows that the exact distribution of βˆ − β when the errors are Weibull distributed, and the regressor is non-stochastic, is also Weibull. The Weibull distribution, with distribution function 1 − exp{−(u/a)b } for u ≥ 0, nests the well known exponential (b = 1) and Rayleigh (b = 2) distributions. Corollary 2. Let the regression errors be Weibull distributed. Then, under Assumption 1,
b z P (βˆ n − β ≤ z ) = 1 − exp − −1/b , n ∑ b a xi i =1
if z ≥ 0 and 0 otherwise. Hence, βˆ n − β is Weibull with scale parameter a
Proposition 1. Under Assumption 1, n ∏
Fz (z ), z > 0
Ratio
129
∑n
i=1
xbi
−1/b
and shape parameter b.
For example, in view of Corollary 2 with b = 1 it is clear that n −
xi (βˆ n − β),
i =1
The proof of the proposition follows from the observation that (i)
P (βˆ n − β ≤ z ) = P (Rn ≤ z ) = 1 − P (u1 > x1 z , . . . , un > xn z ), and condition (iii) of Assumption 1. By condition (iv), Fu (u) > 0 for every u > 0 implying that βˆ consistently estimates β .1
1 If x instead is assumed to be strictly negative then the estimator max{y /x } is i i i strongly consistent for β .
is exponentially distributed with scale parameter a. Moreover, by (3) and basic results of large sample theory, the statistic n 1 −
αˆ n
xi (βˆ n − β),
i=1
is asymptotically standard exponential.
2 If, under Assumption 1, α < ∞ and if n−1 ∑n x is O(1) as n → ∞. i=1 i
130
D. Preve, M.C. Medeiros / Journal of Econometrics 165 (2011) 128–136
Table 2 Simulation results: univariate regression with i.i.d. uniformly distributed errors—specification (i). Each table entry, based on 1000 Monte Carlo replications, reports the empirical bias/mean squared error (MSE) of different estimators for the slope parameter β1 = 2.5 in the univariate regression yi = 2.5x1i + ui , where x1i = v1i + γ ui , v1i ∼ χ 2 (3) and ui ∼ U (0, 10). The following estimators are considered: the ordinary least squares estimator (LSE), the instrumental variable estimator (IVE) and the linear programming estimator (LPE). For the IVE, the variable v1i is used as an instrument. Finally, for both the LSE and IVE an intercept is included in the regression equation. Different sample sizes (n) and levels of endogeneity (γ ) are considered.
γ =0
n
LSE
IVE
β1 50 100 200 500 1000 2000
β2
LPE
β1
β2
β1
β2
Bias
MSE
Bias
MSE
Bias
MSE
Bias
MSE
Bias
MSE
Bias
MSE
−0.001
0.031 0.015 0.007 0.003 0.002 0.001
– – – – – –
– – – – – –
−0.001
0.031 0.015 0.007 0.003 0.002 0.001
– – – – – –
– – – – – –
0.067 0.032 0.016 0.007 0.003 0.002
0.009 0.002 0.001 0.000 0.000 0.000
– – – – – –
– – – – – –
0.008
−0.002 0.002
−0.000 0.000
0.008
−0.002 0.002
−0.000 0.000
γ = 0.25
n
LSE
IVE
β1 Bias 50 100 200 500 1000 2000
0.335 0.332 0.323 0.322 0.320 0.325
β2
LPE
β1
β2
β1
β2
MSE
Bias
MSE
Bias
MSE
Bias
MSE
Bias
MSE
Bias
MSE
0.144 0.124 0.112 0.107 0.104 0.103
– – – – – –
– – – – – –
−0.019 −0.003 −0.003
0.033 0.014 0.008 0.003 0.001 0.001
– – – – – –
– – – – – –
0.065 0.033 0.017 0.007 0.003 0.002
0.008 0.002 0.001 0.000 0.000 0.000
– – – – – –
– – – – – –
0.001
−0.001 −0.001
γ = 0.5
n
LSE
IVE
β1 Bias 50 100 200 500 1000 2000
0.542 0.538 0.519 0.519 0.519 0.516
β2
LPE
β1
β2
β1
β2
MSE
Bias
MSE
Bias
MSE
Bias
MSE
Bias
MSE
Bias
MSE
0.316 0.291 0.275 0.272 0.270 0.267
– – – – – –
– – – – – –
−0.023 −0.006 −0.006 −0.002
0.035 0.015 0.007 0.003 0.001 0.001
– – – – – –
– – – – – –
0.065 0.033 0.017 0.007 0.003 0.002
0.008 0.002 0.001 0.000 0.000 0.000
– – – – – –
– – – – – –
0.001
−0.001
γ =1
n
LSE
IVE
β1 Bias 50 100 200 500 1000 2000
0.590 0.586 0.583 0.584 0.583 0.581
β2
LPE
β1
β2
β1
β2
MSE
Bias
MSE
Bias
MSE
Bias
MSE
Bias
MSE
Bias
MSE
0.356 0.348 0.342 0.339 0.340 0.338
– – – – – –
– – – – – –
−0.025 −0.016 −0.003 −0.002 −0.002 −0.002
0.043 0.019 0.008 0.003 0.001 0.001
– – – – – –
– – – – – –
0.058 0.032 0.016 0.007 0.003 0.002
0.006 0.002 0.001 0.000 0.000 0.000
– – – – – –
– – – – – –
2.2. Stochastic explanatory variable The second regression model we consider is y i = β x i + ui x i = v i + γ ui ui = α + ε i
where {vi } is an i.i.d. sequence of nonnegative RVs, {ui } and {vi } are mutually independent, and γ ≥ 0 such that Cov(xi , ui ) = γ Var(ui ). The parameter γ is potentially unknown. For this model the explanatory variable and error are uncorrelated if and only if γ = 0. In this case E (yi |xi ) = β xi + α . Here γ > 0 is a typical setting in which the LSE of β is inconsistent.3 Assumption 2 holds throughout the section. Assumption 2. Let yi = β xi + ui (i = 1, . . . , n) where ui = α + εi and
(i) xi = vi + γ ui for some γ ≥ 0, (ii) {ui } and {vi } are mutually independent i.i.d. sequences of nonnegative RVs, (iii) inf{u : Fu (u) > 0} = 0, (iv) P (vi = 0) = 0, (v) E (εi ) = 0. Conditions (i) through (iv) ensures that xi is strictly positive and, hence, that (1) is well-defined. Also for this case the exact distribution of the LPE can be obtained. For ease of exposition, we only give the result for the important special case when the related distributions are continuous. Proposition 2. Let ui and vi be (absolutely) continuous RVs with pdfs fu and fv , respectively, and let 1{·} denote the indicator function. Then, under Assumption 2, P (βˆ n − β ≤ z ) = 1 − [1 − Fz (z )]n , where
p 3 More specifically, βˆ → β + γ Var(ui )/[Var(vi ) + γ 2 Var(ui )] as n → ∞ LS provided the variances exist.
Fz (z ) = 1{z >0}
∞
∫ z∫ 0
xfv (x)fu (tx)dxdt , 0
(5)
D. Preve, M.C. Medeiros / Journal of Econometrics 165 (2011) 128–136
131
Table 3 Simulation results: bivariate regression with i.i.d. uniformly distributed errors—specification (ii). Each table entry, based on 1000 Monte Carlo replications, reports the empirical bias/mean squared error (MSE) of different estimators for the slope parameters β1 = 2.5 and β2 = −1.5 in the bivariate regression yi = 2.5x1i − 1.5x2i + ui , where x1i = v1i + γ ui , x2i = v2i + γ ui , with v1i ∼ χ 2 (3) and v2i ∼ χ 2 (4), and ui ∼ U (0, 10). The following estimators are considered: the ordinary least squares estimator (LSE), the instrumental variable estimator (IVE) and the extended linear programming estimator (LPE). For the IVE, the variables v1i and v2i are used as instruments. Finally, for both the LSE and IVE an intercept is included in the regression equation. Different sample sizes (n) and levels of endogeneity (γ ) are considered.
γ =0
n
LSE
IVE
β1
β2
LPE
β1
β2
β1
β2
Bias
MSE
Bias
MSE
Bias
MSE
Bias
MSE
Bias
MSE
Bias
MSE
50 100 200 500 1000 2000
0.000 −0.008 −0.001 0.002 0.000 0.000
0.031 0.015 0.007 0.003 0.001 0.001
−0.006 −0.004
0.024 0.012 0.006 0.002 0.001 0.001
0.000 −0.008 −0.001 0.002 0.000 0.000
0.031 0.015 0.007 0.003 0.001 0.001
−0.006 −0.004
0.024 0.012 0.006 0.002 0.001 0.001
0.098 0.043 0.026 0.011 0.005 0.003
0.035 0.007 0.002 0.000 0.000 0.000
0.012 0.008 0.002 0.001 0.000 0.000
0.019 0.004 0.001 0.000 0.000 0.000
n
γ = 0.25
0.001 0.001 −0.001 0.000
LSE
IVE
β1
β2
Bias 50 100 200 500 1000 2000
0.001 0.001 −0.001 0.000
MSE
0.313 0.311 0.308 0.301 0.301 0.303
Bias
0.128 0.110 0.102 0.094 0.092 0.092
LPE
β1
0.237 0.228 0.230 0.228 0.226 0.227
β2
β1
β2
MSE
Bias
MSE
Bias
MSE
Bias
MSE
Bias
MSE
0.076 0.062 0.058 0.054 0.052 0.052
−0.008 −0.006
0.034 0.015 0.007 0.003 0.001 0.001
−0.010 −0.009 −0.000
0.024 0.012 0.005 0.002 0.001 0.001
0.092 0.049 0.025 0.010 0.005 0.003
0.028 0.009 0.002 0.000 0.000 0.000
0.015 0.003 0.003 0.001 0.000 0.000
0.014 0.005 0.001 0.000 0.000 0.000
0.000
−0.002 −0.002 0.001
0.001
−0.001 0.000
γ = 0.5
n
LSE
IVE
β1
β2
Bias 50 100 200 500 1000 2000
MSE
0.446 0.448 0.438 0.436 0.435 0.433
Bias
0.220 0.213 0.197 0.192 0.190 0.188
LPE
β1
0.333 0.329 0.325 0.325 0.323 0.324
β2
β1
β2
MSE
Bias
MSE
Bias
MSE
Bias
MSE
Bias
MSE
0.126 0.115 0.109 0.107 0.105 0.106
−0.014 −0.009 −0.005 −0.000
0.033 0.017 0.008 0.003 0.002 0.001
−0.020 −0.002 −0.005 −0.002 −0.003 −0.001
0.026 0.011 0.006 0.002 0.001 0.001
0.088 0.049 0.024 0.010 0.006 0.003
0.027 0.008 0.002 0.000 0.000 0.000
0.012 0.006 0.003 0.001 0.000 0.000
0.014 0.004 0.001 0.000 0.000 0.000
0.001 0.000
γ =1
n
LSE
IVE
β1
β2
Bias 50 100 200 500 1000 2000
MSE
0.408 0.411 0.406 0.404 0.407 0.406
Bias
0.176 0.174 0.167 0.164 0.164 0.165
0.310 0.303 0.305 0.306 0.304 0.304
Fz (z ) = 1{0
0
Bias
MSE
Bias
MSE
Bias
MSE
0.104 0.096 0.095 0.094 0.093 0.092
−0.039 −0.008 −0.004 −0.006 −0.000 −0.001
0.050 0.017 0.008 0.003 0.001 0.001
−0.021 −0.012 −0.005 −0.002 −0.000 −0.001
0.036 0.013 0.006 0.002 0.001 0.001
0.078 0.049 0.025 0.010 0.004 0.003
0.022 0.009 0.002 0.000 0.000 0.000
0.018 0.003 0.002 0.002 0.001 0.000
0.013 0.005 0.001 0.000 0.000 0.000
xfv (x − γ tx)fu (tx)dxdt + 1{z ≥1/γ } , 0
For a simple example, consider the case when ui and vi are standard exponentially distributed RVs and γ is non-zero. Then, in view of Proposition 2,
0
∞
x exp{−x[1 + (1 − γ )t ]}dxdt 0
+ 1{z ≥1/γ } . Hence, if γ ̸= 1 Fz (z ) = 1{0
= 1{0
z
∫
1
dt + 1{z ≥1/γ }
0
[1 + (1 − γ )t ]2
1
1
1−γ
−
β2
MSE
otherwise.
Fz (z ) = 1{0
β1
Bias
∞
∫ z∫
β2
MSE
if γ = 0 and
∫ z∫
LPE
β1
(1 − γ )[1 + (1 − γ )z ]
+ 1{z ≥1/γ } .
Similarly, if γ = 1 then Fz (z ) = 1{0
Proposition 3. Under Assumption 2, βˆ n → β as n → ∞.
132
D. Preve, M.C. Medeiros / Journal of Econometrics 165 (2011) 128–136
Table 4 Simulation results: univariate regression with heteroskedastic errors—specification (iii). Each table entry, based on 1000 Monte Carlo replications, reports the empirical bias/mean squared error (MSE) of different√estimators for the slope parameter β1 = 2.5 in the univariate regression yi = 2.5x1i + σi ui , where x1i = v1i + γ ui , v1i ∼ χ 2 (3), σi2 = 0.25 + 0.75 ni and ui ∼ U (0, 12). The following estimators are considered: the ordinary least squares estimator (LSE), the instrumental variable estimator (IVE) and the linear programming estimator (LPE). For the IVE, the variable v1i is used as an instrument. Finally, for both the LSE and IVE an intercept is included in the regression equation. Different sample sizes (n) and levels of endogeneity (γ ) are considered. n
γ =0 LSE
IVE
β1
β2
LPE
β1
β2
β1
β2
Bias
MSE
Bias
MSE
Bias
MSE
Bias
MSE
Bias
MSE
Bias
MSE
50 100 200 500 1000 2000
−0.000 −0.000
0.003 0.001 0.001 0.000 0.000 0.000
– – – – – –
– – – – – –
−0.000 −0.000
0.003 0.001 0.001 0.000 0.000 0.000
– – – – – –
– – – – – –
0.018 0.008 0.004 0.002 0.001 0.000
0.008 0.000 0.000 0.000 0.000 0.000
– – – – – –
– – – – – –
n
γ = 0.25
0.002 0.000 0.000 −0.000
LSE
IVE
β1 Bias 50 100 200 500 1000 2000 n
0.031 0.029 0.029 0.028 0.028 0.029
β2
β1
β2
Bias
MSE
Bias
MSE
Bias
MSE
Bias
MSE
Bias
MSE
0.004 0.002 0.001 0.001 0.001 0.001
– – – – – –
– – – – – –
−0.000 −0.001 −0.001 −0.001 −0.000
0.003 0.001 0.001 0.000 0.000 0.000
– – – – – –
– – – – – –
0.017 0.009 0.004 0.002 0.001 0.000
0.001 0.000 0.000 0.000 0.000 0.000
– – – – – –
– – – – – –
0.000
γ = 0.5
Bias
IVE
0.060 0.056 0.057 0.056 0.056 0.056
β2
LPE
β1
β2
β1
β2
MSE
Bias
MSE
Bias
MSE
Bias
MSE
Bias
MSE
Bias
MSE
0.006 0.004 0.004 0.003 0.003 0.003
– – – – – –
– – – – – –
−0.001 −0.003
0.003 0.001 0.001 0.000 0.000 0.000
– – – – – –
– – – – – –
0.017 0.009 0.004 0.002 0.001 0.000
0.001 0.000 0.000 0.000 0.000 0.000
– – – – – –
– – – – – –
0.001
−0.000 0.000 0.000
γ =1 LSE
IVE
β1 Bias 50 100 200 500 1000 2000
β2
MSE
β1
n
LPE
β1
LSE
50 100 200 500 1000 2000
0.002 0.000 0.000 −0.000
0.108 0.107 0.104 0.104 0.104 0.103
β2
LPE
β1
β2
β1
β2
MSE
Bias
MSE
Bias
MSE
Bias
MSE
Bias
MSE
Bias
MSE
0.014 0.013 0.011 0.011 0.011 0.011
– – – – – –
– – – – – –
−0.005 −0.001 −0.002
0.003 0.001 0.001 0.000 0.000 0.000
– – – – – –
– – – – – –
0.016 0.009 0.004 0.002 0.001 0.000
0.001 0.000 0.000 0.000 0.000 0.000
– – – – – –
– – – – – –
0.000 0.000 −0.000
Hence, under Assumption 2, the LPE is strongly consistent in the presence of an endogenous regressor. Once more, it follows that αˆ in (2) is consistent for α under fairly weak conditions.4
specifications such as ui = α + ε i +
q −
ψk εi−k ,
k=1
3. Extensions In the previous section we aimed for clarity at the expense of generality. For example, in the case with a stochastic regressor, we assumed that xi = vi + γ ui even though other endogenous γ specifications, such as xi = vi ui , also are possible. In this section we discuss how the results of Section 2 can be extended. 3.1. Serially correlated, heteroskedastic errors A proof similar to that of Theorem 3.1 in Preve (2011) shows that the LPE remains consistent for certain serially correlated error
or ui = α + εi + ψεi−1 εi−2 . Consistency also holds for certain heteroskedastic specifications. Because of this, βˆ can be used to seek sources of misspecification in the errors. Proposition 4. Let yi = β xi + σi ui (i = 1, . . . , n) where (i) xi = vi + γ h(σi )ui for some γ ≥ 0 and h : (0, ∞) → (0, ∞), (ii) {vi } is an i.i.d. sequence of nonnegative RVs, mutually independent of {ui }, with P (vi = 0) = 0, (iii) {σi } is a deterministic sequence of strictly positive reals with sup {σi } < ∞, (iv) {ui } is a sequence of m-dependent identically distributed nonnegative RVs with inf{u : Fu (u) > 0} = 0. p
4 If, under Assumption 2, both E (v ) and α are finite. i
Then, βˆ n → β as n → ∞. The endogeneity parameter γ , the map h(·) and the specification of {σi } are potentially unknown. m ∈ N is finite and also potentially unknown.
D. Preve, M.C. Medeiros / Journal of Econometrics 165 (2011) 128–136
133
Table 5 Simulation results: bivariate regression with heteroskedastic errors—specification (iv). Each table entry, based on 1000 Monte Carlo replications, reports the empirical bias/mean squared error (MSE) of different estimators for the slope parameters β1 = 2.5 and β2 = −1√ .5 in the bivariate regression yi = 2.5x1i − 1.5x2i + σi ui , where x1i = v1i + γ ui , x2i = v2i + γ ui , with v1i ∼ χ 2 (3) and v2i ∼ χ 2 (4), σi2 = 0.25 + 0.75 ni and ui ∼ U (0, 12). The following estimators are considered: the ordinary least squares estimator (LSE), the instrumental variable estimator (IVE) and the extended linear programming estimator (LPE). For the IVE, the variables v1i and v2i are used as instruments. Finally, for both the LSE and IVE an intercept is included in the regression equation. Different sample sizes (n) and levels of endogeneity (γ ) are considered.
γ =0
n
LSE
IVE
β1
β2
LPE
β1
β2
β1
β2
Bias
MSE
Bias
MSE
Bias
MSE
Bias
MSE
Bias
MSE
Bias
MSE
50 100 200 500 1000 2000
−0.002
0.003 0.001 0.001 0.000 0.000 0.000
−0.001
0.002 0.001 0.000 0.000 0.000 0.000
−0.002
0.003 0.001 0.001 0.000 0.000 0.000
−0.001
0.002 0.001 0.000 0.000 0.000 0.000
0.026 0.013 0.007 0.003 0.001 0.001
0.002 0.001 0.000 0.000 0.000 0.000
0.002 0.001 0.001 0.000 0.000 0.000
0.001 0.000 0.000 0.000 0.000 0.000
n
γ = 0.25
0.001 0.001 −0.001 −0.000 0.000
0.001
−0.001 0.000 0.000 −0.000
LSE
0.001
−0.001 0.000 0.000 −0.000
IVE
β1
β2
Bias 50 100 200 500 1000 2000
0.001 0.001 −0.001 −0.000 0.000
MSE
0.031 0.029 0.030 0.028 0.028 0.029
0.003 0.002 0.002 0.001 0.001 0.001
Bias 0.025 0.021 0.021 0.021 0.021 0.021
LPE
β1
β2
MSE
Bias
MSE
0.003 0.001 0.001 0.001 0.001 0.001
−0.000 −0.001
0.002 0.001 0.001 0.000 0.000 0.000
0.001
−0.000 −0.000 0.000
Bias 0.002 0.002 0.000 0.000 0.001 0.000
β1
β2
MSE
Bias
MSE
Bias
MSE
0.002 0.001 0.001 0.000 0.000 0.000
0.028 0.014 0.007 0.003 0.001 0.001
0.003 0.001 0.000 0.000 0.000 0.000
0.001 0.001 0.001 0.000 0.000 0.000
0.001 0.000 0.000 0.000 0.000 0.000
γ = 0.5
n
LSE
IVE
β1
β2
Bias 50 100 200 500 1000 2000
MSE
0.060 0.057 0.055 0.055 0.055 0.054
0.006 0.005 0.004 0.003 0.003 0.003
Bias 0.044 0.042 0.042 0.042 0.041 0.041
LPE
β1
β2
β1
β2
MSE
Bias
MSE
Bias
MSE
Bias
MSE
Bias
MSE
0.004 0.003 0.002 0.002 0.002 0.002
−0.001
0.003 0.001 0.001 0.000 0.000 0.000
−0.001 −0.000
0.002 0.001 0.001 0.000 0.000 0.000
0.025 0.014 0.007 0.003 0.001 0.001
0.002 0.001 0.000 0.000 0.000 0.000
0.003 0.001 0.000 0.000 0.000 0.000
0.001 0.000 0.000 0.000 0.000 0.000
0.001
−0.001 −0.000 −0.000 −0.001
0.000 0.000 −0.000 0.000
γ =1
n
LSE
IVE
β1
β2
Bias 50 100 200 500 1000 2000
MSE
0.100 0.100 0.098 0.096 0.096 0.095
0.012 0.011 0.010 0.010 0.009 0.009
Bias 0.078 0.074 0.072 0.071 0.072 0.072
β2
β1
β2
MSE
Bias
MSE
Bias
MSE
Bias
MSE
Bias
MSE
0.008 0.006 0.006 0.005 0.005 0.005
−0.004 −0.000 −0.000 −0.000 −0.001 −0.000
0.003 0.001 0.001 0.000 0.000 0.000
−0.000
0.002 0.001 0.001 0.000 0.000 0.000
0.024 0.013 0.007 0.003 0.001 0.001
0.002 0.001 0.000 0.000 0.000 0.000
0.004 0.001 0.001 0.000 0.000 0.000
0.001 0.001 0.000 0.000 0.000 0.000
The σi are scaling constants which express the possible heteroskedasticity. The map h(·) allows for a heteroskedastic regressor. Condition (iii) is quite general and allows for various standard specifications, including abrupt breaks or smooth transitions such
σ02 + (σ12 − σ02 ) ni . If σi is not a function of n, then the convergence of βˆ n is almost surely. as σi =
3.2. Multiple regressors
∑p
βj xji + ui (i = 1, . . . , n) and, along the lines ˆ = (βˆ 1 , . . . , βˆ p )′ be the soluof Feigin and Resnick (1994), let β tion to the linear programming∑ problem of maximizing the objecp tive function f (β1 , . . . , βp ) = j=1 βj subject to the n constraints ∑p yi − j=1 βj xji ≥ 0. Note that (1) is the solution to the above problem for the special case when p = 1. The finite-sample and ˆ is the subject of furasymptotic properties of the extended LPE β Let yi =
LPE
β1
j =1
ther research. We conjecture that the extended LPE consistently estimates β = (β1 , . . . , βp )′ under conditions analogous to
0.000 −0.001 −0.001 −0.000 0.000
Assumption 2. The proposed estimator is easily computable using standard numerical computing environments such as MATLAB. Our simulations indicate that the extended LPE can have very reasonable finite-sample properties, also in the presence of heteroskedastic or serially correlated errors.5 4. Monte Carlo results In this section we report simulation results concerning the estimation of the slope parameters β1 and β2 in the regression yi = β1 x1i + β2 x2i + σi ui x1i = v1i + γ ui x2i = v2i + γ ui , i = 1, . . . , n
(6)
where v1i is a chi-square RV with three degrees of freedom, v1i ∼ χ 2 (3), and v2i is a chi-square RV with four degrees of
5 Sample MATLAB code can be downloaded from http://www.mysmu.edu/staff/ danielpreve.
134
D. Preve, M.C. Medeiros / Journal of Econometrics 165 (2011) 128–136
Table 6 Simulation results: univariate regression with serially correlated errors—specification (v). Each table entry, based on 1000 Monte Carlo replications, reports the empirical bias/mean squared error (MSE) of different estimators for the slope parameter β1 = 2.5 in the univariate regression yi = 2.5x1i + ui , where x1i = v1i + γ ui , v1i ∼ χ 2 (3) and ui = wi (1 + 0.8wi−1 ) with i.i.d. noise wi ∼ U (0, 10). The following estimators are considered: the ordinary least squares estimator (LSE), the instrumental variable estimator (IVE) and the linear programming estimator (LPE). For the IVE, the variable v1i is used as an instrument. Finally, for both the LSE and IVE an intercept is included in the regression equation. Different sample sizes (n) and levels of endogeneity (γ ) are considered. n
γ =0 LSE
IVE
β1 Bias 50 100 200 500 1000 2000
0.052 0.006 0.019 −0.012 −0.007 0.001
n
γ = 0.25
β2 MSE 1.419 0.646 0.335 0.131 0.064 0.032
MSE
Bias
n
3.199 3.193 3.196 3.196 3.204 3.204
– – – – – –
– – – – – –
0.052 0.006 0.019 −0.012 −0.007 0.001
β2
β2
MSE
Bias
MSE
Bias
MSE
– – – – – –
– – – – – –
0.263 0.119 0.060 0.024 0.012 0.006
0.136 0.029 0.007 0.001 0.000 0.000
– – – – – –
– – – – – –
LPE
Bias
MSE
Bias
MSE
10.323 10.247 10.242 10.225 10.275 10.271
– – – – – –
– – – – – –
−0.593 −0.180 −0.088 −0.029 −0.012 −0.008
4.342 1.108 0.380 0.131 0.071 0.031
β1
β2
Bias
MSE
Bias
MSE
Bias
MSE
– – – – – –
– – – – – –
0.232 0.117 0.062 0.024 0.011 0.006
0.102 0.027 0.007 0.001 0.000 0.000
– – – – – –
– – – – – –
γ = 0.5
Bias
IVE
1.883 1.879 1.882 1.882 1.882 1.882
β2 MSE 3.555 3.536 3.546 3.543 3.545 3.544
LPE
β1
β2
β1
β2
Bias
MSE
Bias
MSE
Bias
MSE
Bias
MSE
Bias
MSE
– – – – – –
– – – – – –
−0.614 −0.369 −0.241 −0.061 −0.034 −0.014
33.279 8.628 1.411 0.186 0.078 0.032
– – – – – –
– – – – – –
0.195 0.108 0.057 0.023 0.012 0.006
0.067 0.021 0.006 0.001 0.000 0.000
– – – – – –
– – – – – –
γ =1 LSE
IVE
β1 Bias 50 100 200 500 1000 2000
β1
Bias
β2
MSE
β1
n
1.419 0.646 0.335 0.131 0.064 0.032
β1
LSE
50 100 200 500 1000 2000
MSE
IVE
β1 50 100 200 500 1000 2000
β2
Bias
LSE Bias
LPE
β1
0.983 0.984 0.984 0.984 0.984 0.984
β2 MSE 0.968 0.969 0.969 0.969 0.969 0.969
LPE
β1
β2
β2
MSE
Bias
MSE
Bias
MSE
Bias
MSE
Bias
MSE
– – – – – –
– – – – – –
0.212 0.002 −0.145 −0.252 −0.071 −0.032
18.330 16.321 22.566 3.497 0.162 0.044
– – – – – –
– – – – – –
0.177 0.100 0.056 0.023 0.011 0.006
0.049 0.017 0.005 0.001 0.000 0.000
– – – – – –
– – – – – –
freedom, v2i ∼ χ 2 (4). The sequences {v1i } and {v2i } are mutually independent. We write ui ∼ U (0, b) to indicate that ui is uniformly distributed on the interval [0, b] and consider different specifications of (6): (i) β1 = 2.5, β2 = 0, σi = 1 and ui ∼ U (0, 10).6 (ii) Same specification as in (i) but with β2 = −1.5.
(iii) β1 = 2.5, β2 = 0, σi =
β1
Bias
0.25 + 0.75 ni and ui ∼ U (0,
√
12)
with Var(ui ) = 1. (iv) Same specification as in (iii) but with β2 = −1.5. (v) β1 = 2.5, β2 = 0, σi = 1 and ui = wi (1 + 0.8wi−1 ) with i.i.d. noise wi ∼ U (0, 10). (vi) Same specification as in (v) but with β2 = −1.5. For the odd-numbered specifications, which are all simple regressions, we use the LPE in (1). For the even-numbered specifications we use the extended LPE described in Section 3 and compute it
6 Hence, α = 5 and ε ∼ U (−5, 5) in this specification. i
using the MATLAB function linprog. We report the empirical bias and mean squared error (MSE) over 1000 Monte Carlo replications and consider the following estimators: the LSE, IVE and LPE. We consider different sample sizes and levels of endogeneity. The simulation results are shown in Tables 2–7. In general, the results indicate that the LPE has a higher bias than the IVE but a substantially lower MSE, suggesting that the LPE has a considerably smaller variance than the IVE. For example, for specification (v) with γ = 0.5 and n = 200 the MSE of the IVE and LPE is 1.411 and 0.006, respectively. Similarly, the results for the extended LPE indicate that it can be consistent in the presence of heteroskedastic or serially correlated errors and that its variability is much lower than that of the IVE. 5. Conclusions In this paper we have established the exact finite-sample distribution of a LPE for the slope parameter in a constrained simple linear regression model when (1) the regressor is nonstochastic, and (2) the regressor is stochastic and potentially
D. Preve, M.C. Medeiros / Journal of Econometrics 165 (2011) 128–136
135
Table 7 Simulation results: bivariate regression with serially correlated errors—specification (vi). Each table entry, based on 1000 Monte Carlo replications, reports the empirical bias/mean squared error (MSE) of different estimators for the slope parameters β1 = 2.5 and β2 = −1.5 in the bivariate regression yi = 2.5x1i − 1.5x2i + ui , where x1i = v1i + γ ui , x2i = v2i + γ ui , with v1i ∼ χ 2 (3) and v2i ∼ χ 2 (4), and ui = wi (1 + 0.8wi−1 ) with i.i.d. noise wi ∼ U (0, 10). The following estimators are considered: the ordinary least squares estimator (LSE), the instrumental variable estimator (IVE) and the extended linear programming estimator (LPE). For the IVE, the variables v1i and v2i are used as instruments. Finally, for both the LSE and IVE an intercept is included in the regression equation. Different sample sizes (n) and levels of endogeneity (γ ) are considered. n
γ =0 LSE
IVE
β1
β2
Bias
MSE
Bias
MSE
Bias
50 100 200 500 1000 2000
0.047 0.037 0.012 0.010 0.007 −0.001
1.533 0.746 0.330 0.129 0.066 0.032
−0.037 −0.019
1.055 0.532 0.236 0.093 0.049 0.025
0.047 0.037 0.012 0.010 0.007 −0.001
n
γ = 0.25
0.023
−0.009 0.003 0.003
LSE
50 100 200 500 1000 2000 n
1.992 2.018 2.009 2.002 2.005 1.999
β2 MSE 4.146 4.172 4.083 4.028 4.030 4.003
Bias
Bias 1.493 1.482 1.494 1.499 1.499 1.502
MSE
Bias
2.389 2.287 2.276 2.265 2.255 2.263
−0.342 −0.164 −0.099 −0.032 −0.022 −0.008
1.533 0.746 0.330 0.129 0.066 0.032
−0.037 −0.019 0.023
−0.009 0.003 0.003
β1 MSE 1.055 0.532 0.236 0.093 0.049 0.025
β2
Bias
MSE
0.362 0.198 0.093 0.036 0.017 0.009
0.469 0.145 0.031 0.005 0.001 0.000
Bias 0.038 0.011 0.012 0.002 0.003 0.000
MSE 0.240 0.064 0.016 0.002 0.000 0.000
LPE
β2 MSE
Bias
8.125 1.195 0.430 0.142 0.062 0.033
−0.291 −0.139 −0.088 −0.015 −0.008 −0.016
β1 MSE 3.731 0.758 0.320 0.100 0.048 0.025
IVE
1.095 1.094 1.102 1.099 1.104 1.104
β2
β2
Bias
MSE
0.340 0.187 0.096 0.037 0.018 0.009
0.427 0.113 0.029 0.004 0.001 0.000
Bias 0.023 0.007 0.005 0.002 0.001 0.001
MSE 0.241 0.055 0.013 0.002 0.001 0.000
MSE 1.250 1.220 1.228 1.215 1.222 1.220
Bias 0.837 0.836 0.829 0.832 0.827 0.827
LPE
β1
β2
β1
β2
MSE
Bias
MSE
Bias
MSE
Bias
MSE
Bias
MSE
0.748 0.719 0.701 0.698 0.686 0.686
−0.004 −0.312 −0.270 −0.074 −0.029 −0.021
24.035 17.710 1.757 0.224 0.080 0.033
0.186 −0.289 −0.212 −0.076 −0.033 −0.009
19.156 14.171 1.211 0.150 0.055 0.024
0.262 0.158 0.085 0.041 0.017 0.009
0.243 0.098 0.025 0.006 0.001 0.000
0.057 0.023 0.010 −0.001 0.002 0.001
0.149 0.056 0.012 0.003 0.001 0.000
γ =1 LSE
IVE
β1 Bias 50 100 200 500 1000 2000
Bias
γ = 0.5 β1
n
MSE
β1
LSE
50 100 200 500 1000 2000
β2
IVE
β1 Bias
LPE
β1
0.560 0.568 0.566 0.566 0.566 0.566
β2 MSE 0.326 0.329 0.324 0.322 0.320 0.321
Bias 0.430 0.423 0.424 0.424 0.425 0.424
LPE
β1
β2
β1
β2
MSE
Bias
MSE
Bias
MSE
Bias
MSE
0.197 0.185 0.183 0.181 0.181 0.180
0.322 0.269 0.091 −0.161 −0.133 −0.042
13.875 10.957 13.125 3.937 0.423 0.057
0.430 −0.030 0.151 −0.101 −0.108 −0.033
10.342 14.681 6.518 2.023 0.290 0.039
0.245 0.128 0.082 0.034 0.019 0.010
0.216 0.058 0.022 0.004 0.001 0.000
endogenous. The exact distribution may be used for statistical inference or to bias-correct the LPE. In addition, we have shown that the LPE is strongly consistent under fairly general conditions on the related distributions. In particular, the LPE is robust to various heavy-tailed specifications and its functional form indicates that it can be insensitive to outliers in yi or xi . We have also identified a number of cases where the LPE is superconsistent. In contrast to existing results for the LPE, in a time series setting, our results in a cross-sectional setting are valid also in the case when the slope parameter is negative. We provided conditions under which the LPE is consistent in the presence of serially correlated, heteroskedastic errors and described how the LPE can be extended to the case with multiple regressors. Our simulation results indicated that the LPE and extended LPE can have very reasonable finite-sample properties compared to the LSE and IVE, also in the presence of heteroskedastic or serially correlated errors. Clearly, one advantage of the LPE is that, in contrast to the IVE, it does not require an instrument.
Bias 0.015 0.025 0.006 0.004 0.000 0.000
MSE 0.129 0.034 0.012 0.002 0.001 0.000
Acknowledgments We are grateful to conference participants at the Singapore Econometrics Study Group Meeting (2009, Singapore) for their comments and suggestions. We also wish to thank Michael McAleer and two referees for very helpful comments. All remaining errors are our own. The first author gratefully acknowledges partial research support from the Jan Wallander and Tom Hedelius Research Foundation (grant P 2006-0166:1) and the Sim Kee Boon Institute for Financial Economics at Singapore Management University. The second author wish to thank CNPq/Brazil for partial financial support. Appendix a.s.
a.s.
Proof of Corollary 1. Note that βˆ n → β iff Rn → 0. By condition (ii), 0 is not a limit point of S, hence, there exists a δ > 0 such that the two sets (−δ, δ) and S are disjoint. Let c = δ/2 and let
136
D. Preve, M.C. Medeiros / Journal of Econometrics 165 (2011) 128–136
ϵ > 0 be arbitrary. Then, in view of Proposition 1, P (|Rn | > ϵ) =
n ∏
Fz (z ) = 1{0
0
(i)–(ii)
[1 − Fu (xi ϵ)] ≤ [1 − Fu (c ϵ)]n . p
By (iv), Fu (u) > 0 for every u > 0. Hence, Rn → 0 as n → ∞. Finally, since R1 , . . . , Rn forms a stochastically decreasing sequence, it follows that Rn → 0.
(ii)
n
n
= 1 − [1 − Fz (z )]n , where Fz (z ) is the cdf of z = u1 /x1 . Let fu1 ,x1 (u, x) denote the joint pdf of u1 and x1 = v1 + γ u1 , and fu1 (u) the marginal pdf of u1 . Denote by fx1 |u1 =u (x) the conditional pdf of x1 given that u1 = u. Then, for u > 0 fu1 ,x1 (u, x) = fx1 |u1 =u (x)fu1 (u) = fv1 (x − γ u)fu1 (u),
(7)
where fv1 (v) is the pdf of v1 . By Theorem 3.1 in Curtiss (1941), fz (z ) = Fz′ (z ) (ii)
|x|fu1 ,x1 (zx, x)dx = −∞
∫
∞
xfu1 ,x1 (zx, x)dx.
0
Now consider the case when γ > 0. By (7) and (8), fz (z ) =
∞
∫
(ii)
≤ P (u1 > ϵv1 , . . . , un > ϵvn ) = P (u1 > ϵv1 )n .
P (βˆ n − β ≤ z ) = 1 − P (u1 /x1 > z ) = 1 − [1 − P (u1 /x1 ≤ z )]
∞
(i)
P (|Rn | > ϵ) = P [u1 > ϵ(γ u1 + v1 ), . . . , un > ϵ(γ un + vn )]
A simple proof by contradiction, based on a geometric argument,
(ii)
∫
Proof of Proposition 3. Let ϵ > 0 be arbitrary. Then,
Proof of Proposition 2.
=
xfv1 (x − γ tx)fu1 (tx)dxdt + 1{z ≥1/γ } . 0
The proof when γ = 0 is analogous.
i=1
a.s.
∞
∫ z∫
xfv1 (x − γ zx)fu1 (zx)dx, 0
for 0 < z < 1/γ and zero otherwise. Hence,
(8)
p
shows that P (u1 > ϵv1 ) < 1. Hence, Rn → 0 as n → ∞ and once more the strong convergence of βˆ n = β + Rn follows. References Curtiss, J.H., 1941. On the distribution of the quotient of two chance variables. The Annals of Mathematical Statistics 12 (4), 409–421. Datta, S., Mathew, G., McCormick, W.P., 1998. Nonlinear autoregression with positive innovations. Australian & New Zealand Journal of Statistics 40 (2), 229–239. Datta, S., McCormick, W.P., 1995. Bootstrap inference for a first-order autoregression with positive innovations. Journal of the American Statistical Association 90, 1289–1300. Davis, R.A., McCormick, W.P., 1989. Estimation for first-order autoregressive processes with positive or bounded innovations. Stochastic Processes and their Applications 31, 237–250. Feigin, P.D., Kratz, M.F., Resnick, S.I., 1996. Parameter estimation for moving averages with positive innovations. The Annals of Applied Probability 12 (4), 1157–1190. Feigin, P.D., Resnick, S.I., 1994. Limit distributions for linear programming time series estimators. Stochastic Processes and their Applications 51, 135–165. Nielsen, B., Shephard, N., 2003. Likelihood analysis of a first-order autoregressive model with exponential innovations. Journal of Time Series Analysis 24 (3), 337–344. Preve, D., 2011. Linear programming-based estimators in nonnegative autoregression. Manuscript.