ECONOMETRIC ANALYSIS OF FINANCIAL AND ECONOMIC TIME SERIES
i
ADVANCES IN ECONOMETRICS Series Editors: Thomas B. Fomby and R. Carter Hill Recent Volumes: Volume 15: Volume 16: Volume 17:
Nonstationary Panels, Panel Cointegration, and Dynamic Panels, Edited by Badi Baltagi Econometric Models in Marketing, Edited by P. H. Franses and A. L. Montgomery Maximum Likelihood of Misspecified Models: Twenty Years Later, Edited by Thomas B. Fomby and R. Carter Hill
Volume 18:
Spatial and Spatiotemporal Econometrics, Edited by J. P. LeSage and R. Kelley Pace Volume 19: Applications of Artificial Intelligence in Finance and Economics, Edited by J. M. Binner, G. Kendall and S. H. Chen
ii
ADVANCES IN ECONOMETRICS
VOLUME 20, PART B
ECONOMETRIC ANALYSIS OF FINANCIAL AND ECONOMIC TIME SERIES EDITED BY
THOMAS B. FOMBY Department of Economics, Southern Methodist University, Dallas, TX 75275
DEK TERRELL Department of Economics, Louisiana State University, Baton Rouge, LA 70803
Amsterdam – Boston – Heidelberg – London – New York – Oxford Paris – San Diego – San Francisco – Singapore – Sydney – Tokyo iii
ELSEVIER B.V. Radarweg 29 P.O. Box 211 1000 AE Amsterdam, The Netherlands
ELSEVIER Inc. 525 B Street, Suite 1900 San Diego CA 92101-4495 USA
ELSEVIER Ltd The Boulevard, Langford Lane, Kidlington Oxford OX5 1GB UK
ELSEVIER Ltd 84 Theobalds Road London WC1X 8RR UK
r 2006 Elsevier Ltd. All rights reserved. This work is protected under copyright by Elsevier Ltd, and the following terms and conditions apply to its use: Photocopying Single photocopies of single chapters may be made for personal use as allowed by national copyright laws. Permission of the Publisher and payment of a fee is required for all other photocopying, including multiple or systematic copying, copying for advertising or promotional purposes, resale, and all forms of document delivery. Special rates are available for educational institutions that wish to make photocopies for non-profit educational classroom use. Permissions may be sought directly from Elsevier’s Rights Department in Oxford, UK: phone (+44) 1865 843830, fax (+44) 1865 853333, e-mail:
[email protected]. Requests may also be completed on-line via the Elsevier homepage (http://www.elsevier.com/locate/permissions). In the USA, users may clear permissions and make payments through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA; phone: (+1) (978) 7508400, fax: (+1) (978) 7504744, and in the UK through the Copyright Licensing Agency Rapid Clearance Service (CLARCS), 90 Tottenham Court Road, London W1P 0LP, UK; phone: (+44) 20 7631 5555; fax: (+44) 20 7631 5500. Other countries may have a local reprographic rights agency for payments. Derivative Works Tables of contents may be reproduced for internal circulation, but permission of the Publisher is required for external resale or distribution of such material. Permission of the Publisher is required for all other derivative works, including compilations and translations. Electronic Storage or Usage Permission of the Publisher is required to store or use electronically any material contained in this work, including any chapter or part of a chapter. Except as outlined above, no part of this work may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior written permission of the Publisher. Address permissions requests to: Elsevier’s Rights Department, at the fax and e-mail addresses noted above. Notice No responsibility is assumed by the Publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. Because of rapid advances in the medical sciences, in particular, independent verification of diagnoses and drug dosages should be made. First edition 2006 British Library Cataloguing in Publication Data A catalogue record is available from the British Library. ISBN-10: 0-7623-1273-4 ISBN-13: 978-0-7623-1273-3 ISSN: 0731-9053 (Series)
∞ The paper used in this publication meets the requirements of ANSI/NISO Z39.48-1992 (Permanence of Paper). Printed in The Netherlands.
Working together to grow libraries in developing countries www.elsevier.com | www.bookaid.org | www.sabre.org
iv
CONTENTS DEDICATION
ix
LIST OF CONTRIBUTORS
xi
INTRODUCTION Thomas B. Fomby and Dek Terrell
xiii
REMARKS BY ROBERT F. ENGLE III AND SIR CLIVE W. J. GRANGER, KB Given During Third Annual Advances in Econometrics Conference at Louisiana State University, Baton Rouge, November 5–7, 2004 GOOD IDEAS Robert F. Engle III
xix
THE CREATIVITY PROCESS Sir Clive W. J. Granger, KB REALIZED BETA: PERSISTENCE AND PREDICTABILITY Torben G. Andersen, Tim Bollerslev, Francis X. Diebold and Ginger Wu ASYMMETRIC PREDICTIVE ABILITIES OF NONLINEAR MODELS FOR STOCK RETURNS: EVIDENCE FROM DENSITY FORECAST COMPARISON Yong Bao and Tae-Hwy Lee v
xxiii
1
41
vi
CONTENTS
FLEXIBLE SEASONAL TIME SERIES MODELS Zongwu Cai and Rong Chen
63
ESTIMATION OF LONG-MEMORY TIME SERIES MODELS: A SURVEY OF DIFFERENT LIKELIHOOD-BASED METHODS Ngai Hang Chan and Wilfredo Palma
89
BOOSTING-BASED FRAMEWORKS IN FINANCIAL MODELING: APPLICATION TO SYMBOLIC VOLATILITY FORECASTING Valeriy V. Gavrishchaka
123
OVERLAYING TIME SCALES IN FINANCIAL VOLATILITY DATA Eric Hillebrand
153
EVALUATING THE ‘FED MODEL’ OF STOCK PRICE VALUATION: AN OUT-OF-SAMPLE FORECASTING PERSPECTIVE Dennis W. Jansen and Zijun Wang
179
STRUCTURAL CHANGE AS AN ALTERNATIVE TO LONG MEMORY IN FINANCIAL TIME SERIES Tze Leung Lai and Haipeng Xing
205
TIME SERIES MEAN LEVEL AND STOCHASTIC VOLATILITY MODELING BY SMOOTH TRANSITION AUTOREGRESSIONS: A BAYESIAN APPROACH Hedibert Freitas Lopes and Esther Salazar
225
ESTIMATING TAYLOR-TYPE RULES: AN UNBALANCED REGRESSION? Pierre L. Siklos and Mark E. Wohar
239
Contents
vii
BAYESIAN INFERENCE ON MIXTURE-OF-EXPERTS FOR ESTIMATION OF STOCHASTIC VOLATILITY Alejandro Villagran and Gabriel Huerta
277
A MODERN TIME SERIES ASSESSMENT OF ‘‘A STATISTICAL MODEL FOR SUNSPOT ACTIVITY’’ BY C. W. J. GRANGER (1957) Gawon Yoon
297
PERSONAL COMMENTS ON YOON’S DISCUSSION OF MY 1957 PAPER Sir Clive W. J. Granger, KB
315
A NEW CLASS OF TAIL-DEPENDENT TIME-SERIES MODELS AND ITS APPLICATIONS IN FINANCIAL TIME SERIES Zhengjun Zhang
317
This page intentionally left blank
viii
DEDICATION Volume 20 of Advances in Econometrics is dedicated to Rob Engle and Sir Clive Granger, winners of the 2003 Nobel Prize in Economics, for their many valuable contributions to the econometrics profession. The Royal Swedish Academy of Sciences cited Rob ‘‘for methods of analyzing economic time series with time-varying volatility (ARCH)’’ while Clive was cited ‘‘for methods of analyzing economic time series with common trends (cointegration).’’ Of course, these citations are meant for public consumption but we specialists in time series analysis know their contributions go far beyond these brief citations. Consider some of Rob’s other contributions to our literature: Aggregation of Time Series, Band Spectrum Regression, Dynamic Factor Models, Exogeneity, Forecasting in the Presence of Cointegration, Seasonal Cointegration, Common Features, ARCH-M, Multivariate GARCH, Analysis of High Frequency Data, and CAViaR. Some of Sir Clive’s additional contributions include Spectral Analysis of Economic Time Series, Bilinear Time Series Models, Combination Forecasting, Spurious Regression, Forecasting Transformed Time Series, Causality, Aggregation of Time Series, Long Memory, Extreme Bounds, Multi-Cointegration, and Non-linear Cointegration. No doubt, their Nobel Prizes are richly deserved. And the 48 authors of the two parts of this volume think likewise. They have authored some very fine papers that contribute nicely to the same literature that Rob’s and Clive’s research helped build. For more information on Rob’s and Clive’s Nobel prizes you can go to the Nobel Prize website http://nobelprize.org/economics/laureates/2003/ index.html. In addition to the papers that are contributed here, we are publishing remarks by Rob and Clive on the nature of innovation in econometric research that were given during the Third Annual Advances in Econometrics Conference at Louisiana State University in Baton Rouge, November 5–7, 2004. We think you will enjoy reading their remarks. You come away with the distinct impression that, although they may claim they were ‘‘lucky’’ or ‘‘things just happened to fall in place,’’ having the orientation of building models that solve practical problems has been an orientation that served them and our profession very well. ix
x
DEDICATION
We hope the readers of this two-part volume enjoy its contents. We feel fortunate to have had the opportunity of working with these fine authors and putting this volume together. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, Texas 75275
Robert F. Engle III 2003 Nobel Prize Winner in Economics
Dek Terrell Department of Economics Louisiana State University Baton Rouge, Louisiana 70803
Sir Clive W. J. Granger, Knight’s designation (KB) 2003 Nobel Prize Winner in Economics
LIST OF CONTRIBUTORS Torben G. Andersen
Department of Finance, Kellogg School of Management, Northwestern University, IL, USA
Yong Bao
Department of Economics, University of Texas, TX, USA
Tim Bollerslev
Department of Economics, Duke University, NC, USA
Zongwu Cai
Department of Mathematics & Statistics, University of North Carolina, NC, USA
Ngai Hang Chan
Department of Statistics, The Chinese University of Hong Kong, New Territories, Hong Kong
Rong Chen
Department of Information and Decision Sciences, College of Business Administration, The University of Illinois at Chicago, IL, USA
Francis X. Diebold
Department of Economics, University of Pennsylvania, PA, USA
Robert F. Engle III
Stern School of Business, New York University, NY, USA
Valeriy V. Gavrishchaka
Alexandra Investment Management, LLC, NY, USA
Sir Clive W. J. Granger, KB
Department of Economics, University of California at San Diego, CA, USA
Eric Hillebrand
Department of Economics, Louisiana State University, LA, USA
Gabriel Huerta
Department of Mathematics and Statistics, University of New Mexico, NM, USA xi
xii
LIST OF CONTRIBUTORS
Dennis W. Jansen
Department of Economics, Texas A&M University, TX, USA
Tze Leung Lai
Department of Statistics, Stanford University, CA, USA
Tae-Hwy Lee
Department of Economics, University of California, CA, USA
Hedibert Freitas Lopes
Graduate School of Business, University of Chicago, IL, USA
Wilfredo Palma
Departamento de Estadistica, Pontificia Universidad Catolica De Chile (PUC), Santiago, Chile
Esther Salazar
Instituto de Matema´tica, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil
Pierre L. Siklos
Department of Economics, Wilfrid Laurier University, Ontario, Canada
Alejandro Villagran
Department of Mathematics and Statistics, University of New Mexico, NM, USA
Zijun Wang
Private Enterprise Research Center, Texas A&M University, TX, USA
Mark E. Wohar
Department of Economics, University of Nebraska–Omaha, NE, USA
Ginger Wu
Department of Banking and Finance, University of Georgia, GA, USA
Haipeng Xing
Department of Statistics, Columbia University, NY, USA
Gawon Yoon
School of Economics, Kookmin University, Seoul, South Korea
Zhengjun Zhang
Department of Statistics, University of Wisconsin-Madison, WI, USA
INTRODUCTION Thomas B. Fomby and Dek Terrell The editors are pleased to offer the following papers to the reader in recognition and appreciation of the contributions to our literature made by Robert Engle and Sir Clive Granger, winners of the 2003 Nobel Prize in Economics. Please see the previous dedication page of this volume. The basic themes of this part of Volume 20 of Advances in Econometrics are timevarying betas of the capital asset pricing model, analysis of predictive densities of nonlinear models of stock returns, modeling multivariate dynamic correlations, flexible seasonal time series models, estimation of long-memory time series models, the application of the technique of boosting in volatility forecasting, the use of different time scales in Generalized Auto-Regressive Conditional Heteroskedasticity (GARCH) modeling, out-of-sample evaluation of the ‘Fed Model’ in stock price valuation, structural change as an alternative to long memory, the use of smooth transition autoregressions in stochastic volatility modeling, the analysis of the ‘‘balancedness’’ of regressions analyzing Taylor-type rules of the Fed Funds rate, a mixtureof-experts approach for the estimation of stochastic volatility, a modern assessment of Clive’s first published paper on sunspot activity, and a new class of models of tail-dependence in time series subject to jumps. Of course, we are also pleased to include Rob’s and Clive’s remarks on their careers and their views on innovation in econometric theory and practice that were given at the Third Annual Advances in Econometrics Conference held at Louisiana State University, Baton Rouge, on November 5–7, 2004. Let us briefly review the specifics of the papers presented here. In the first paper, ‘‘Realized Beta: Persistence and Predictability,’’ Torben Andersen, Tim Bollerslev, Francis Diebold, and Jin Wu review the literature on the one-factor Capital Asset Pricing Model (CAPM) for the purpose of coming to a better understanding of the variability of the betas of such models. They do this by flexibly modeling betas as the ratio of the integrated stock and market return covariance and integrated market variance in a way that allows, but does not impose, fractional integration and/or cointegration. xiii
xiv
THOMAS B. FOMBY AND DEK TERRELL
They find that, although the realized variances and covariances fluctuate widely and are highly persistent and predictable, the realized betas, which are simple nonlinear functions of the realized variances and covariances, display much less persistence and predictability. They conclude that the constant beta CAPM, as bad as it may be, is nevertheless not as bad as some popular conditional CAPMs. Their paper provides some very useful insight into why allowing for time-varying betas may do more harm than good when estimated from daily data. They close by sketching an interesting framework for future research using high-frequency intraday data to improve the modeling of time-varying betas. Yong Bao and Tae-Hwy Lee in their paper ‘‘Asymmetric Predictive Abilities of Nonlinear Models for Stock Returns: Evidence from Density Forecast Comparison’’ investigate the nonlinear predictability of stock returns when the density forecasts are evaluated and compared instead of the conditional mean point forecasts. They use the Kullback–Leibler Information Criterion (KLIC) divergence measure to characterize the extent of misspecification of a forecast model. Their empirical findings suggest that the out-of-sample predictive abilities of nonlinear models for stock returns are asymmetric in the sense that the right tails of the return series are predictable via many of the nonlinear models while they find no such evidence for the left tails or the entire distribution. Zongwu Cai and Rong Chen introduce a new class of flexible seasonal time series models to characterize trend and seasonal variation in their paper ‘‘Flexible Seasonal Time Series Models.’’ Their model consists of a common trend function over periods and additive individual trend seasonal functions that are specific to each season within periods. A local linear approach is developed to estimate the common trend and seasonal trend functions. The consistency and asymptotic normality of the proposed estimators are established under weak a-mixing conditions and without specifying the error distribution. The proposed methodologies are illustrated with a simulated example and two economic and financial time series, which exhibit nonlinear and nonstationary behavior. In ‘‘Estimation of Long-Memory Time Series Models: A Survey of Different Likelihood-Based Methods’’ Ngai Hang Chan and Wilfredo Palma survey the various likelihood-based techniques for analyzing long memory in time series data. The authors classify these methods into the following categories: Exact maximum likelihood methods, maximum likelihood methods based on autoregressive approximations, Whittle estimates, Whittle estimates with autoregressive truncation, approximate estimates based on the Durbin–Levinson algorithm, state–space based estimates for
Introduction
xv
the autoregressive fractionally integrated moving average (ARFIMA) models, and estimation of stochastic volatility models. Their review provides a succinct survey of these methodologies as well as an overview of the important related problems such as the maximum likelihood estimation with missing data, influence of subsets of observations on estimates, and the estimation of seasonal long-memory models. Performances and asymptotic properties of these techniques are compared and examined. Interconnections and finite sample performances among these procedures are studied and applications to financial time series of these methodologies are discussed. In ‘‘Boosting-Based Frameworks in Financial Modeling: Application to Symbolic Volatility Forecasting’’ Valeriy Gavrishchaka suggests that Boosting (a novel ensemble learning technique) can serve as a simple and robust framework for combining the best features of both analytical and datadriven models and more specifically discusses how Boosting can be applied for typical financial and econometric applications. Furthermore, he demonstrates some of the capabilities of Boosting by showing how a Boosted collection of GARCH-type models for the IBM stock time series can be used to produce more accurate forecasts of volatility than both the best single model of the collection and the widely used GARCH(1,1) model. Eric Hillebrand studies different generalizations of GARCH that allow for several time scales in his paper ‘‘Overlaying Time Scales in Financial Volatility Data.’’ In particular he examines the nature of the volatility in four measures of the U.S. stock market (S&P 500, Dow Jones Industrial Average, CRSP equally weighted index, and CRSP value-weighted index) as well as the exchange rate of the Japanese Yen against the U.S. Dollar and the U.S. federal funds rate. In addition to analyzing these series using the conventional ARCH and GARCH models he uses three models of multiple time scales, namely, fractional integration, two-scale GARCH, and wavelet analysis in conjunction with Heterogeneous ARCH (HARCH). Hillebrand finds that the conventional ARCH and GARCH models miss the important short correlation time scale in the six series. Based on a holding sample test, the multiple time scale models, although offering an improvement over the conventional ARCH and GARCH models, still did not completely model the short correlation structure of the six series. However, research in extending volatility models in this way appears to be promising. The ‘‘Fed Model’’ postulates a cointegrating relationship between the equity yield on the S&P 500 and the bond yield. In their paper, ‘‘Evaluating the ‘Fed Model’ of Stock Price Valuation: An Out-of-Sample Forecasting Perspective,’’ Dennis Jansen and Zijun Wang evaluate the Fed Model as a
xvi
THOMAS B. FOMBY AND DEK TERRELL
vector-error correction forecasting model for stock prices and for bond yields. They compare out-of-sample forecasts of each of these two variables from a univariate model and various versions of the Fed Model including both linear and nonlinear vector error correction models. They find that for stock prices the Fed Model improves on the univariate model for longerhorizon forecasts, and the nonlinear vector-error correction model performs even better than its linear version. In their paper, ‘‘Structural Change as an Alternative to Long Memory in Financial Time Series,’’ Tze Leung Lai and Haipeng Xing note that volatility persistence in GARCH models and spurious long memory in autoregressive models may arise if the possibility of structural changes is not incorporated in the time series model. Therefore, they propose a structural change model that allows changes in the volatility and regression parameters at unknown times and with unknown changes in magnitudes. Their model is a hidden Markov model in which the volatility and regression parameters can continuously change and are estimated by recursive filters. As their hidden Markov model involves gamma-normal conjugate priors, there are explicit recursive formulas for the optimal filters and smoothers. Using NASDAQ weekly return data, they show how the optimal structural change model can be applied to segment financial time series by making use of the estimated probabilities of structural breaks. In their paper, ‘‘Time Series Mean Level and Stochastic Volatility Modeling by Smooth Transition Autoregressions: A Bayesian Approach,’’ Hedibert Lopes and Esther Salazar propose a Bayesian approach to model the level and variance of financial time series based on a special class of nonlinear time series models known as the logistic smooth transition autoregressive (LSTAR) model. They propose a Markov Chain Monte Carlo (MCMC) algorithm for the levels of the time series and then adapt it to model the stochastic volatilities. The LSTAR order of their model is selected by the three information criteria Akaike information criterion (AIC), Bayesian information criterion (BIC), and Deviance information criteria (DIC). They apply their algorithm to one synthetic and two realtime series, namely the Canadian Lynx data and the SP500 return series, and find the results encouraging when modeling both the levels and the variance of univariate time series with LSTAR structures. Relying on Robert Engle’s and Clive Granger’s many and varied contributions to econometrics analysis, Pierre Siklos and Mark Wohar examine some key econometric considerations involved in estimating Taylor-type rules for U.S. data in their paper ‘‘Estimating Taylor-Type Rules: An Unbalanced Regression?’’ They focus on the roles of unit roots, cointegration,
Introduction
xvii
structural breaks, and nonlinearities to make the case that most existing estimates are based on unbalanced regressions. A variety of their estimates reveal that neglecting the presence of cointegration results in the omission of a necessary error correction term and that Fed reactions during the Greenspan era appear to have been asymmetric. They further argue that error correction and nonlinearities may be one way to estimate Taylor rules over long samples when the underlying policy regime may have changed significantly. Alejandro Villagran and Gabriel Huerta propose a Bayesian Mixtureof-Experts (ME) approach to estimating stochastic volatility in time series in their paper ‘‘Bayesian Inference on Mixture-of-Experts for Estimation of Stochastic Volatility.’’ They use as their ‘‘experts’’ the ARCH, GARCH, and EGARCH models to analyze the stochastic volatility in the U.S. dollar/ German Mark exchange rate and conduct a study of the volatility of the Mexican stock market (IPC) index using the Dow Jones Industrial (DJI) index as a covariate. They also describe the estimation of predictive volatilities and their corresponding measure of uncertainty given by a Bayesian credible interval using the ME approach. In the applications they present, it is interesting to see how the posterior probabilities of the ‘‘experts’’ change over time and to conjecture why the posterior probabilities changed as they did. Sir Clive Granger published his first paper ‘‘A Statistical Model for Sunspot Activity’’ in 1957 in the prestigious Astrophysical Journal. As a means of recognizing Clive’s many contributions to the econometric analysis of time series and celebrating the near 50th anniversary of his first publication, one of his students and now professor, Gawon Yoon, has written a paper that provides a modern time series assessment of Clive’s first paper. In ‘‘A Modern Time Series Assessment of ‘A Statistical Model for Sunspot Activity’ by C. W. J. Granger (1957),’’ Yoon reviews Granger’s statistical model of sunspots containing two parameters representing an amplitude factor and the occurrence of minima, respectively. At the time Granger’s model accounted for about 85% of the total variation in the sunspot data. Interestingly, Yoon finds that, in the majority, Granger’s model quite nicely explains the more recent occurrence of sunspots despite the passage of time. Even though it appears that some of the earlier observations that Granger had available were measured differently from later sunspot numbers, Granger’s simple two-parameter model still accounts for more than 80% of the total variation in the extended sunspot data. This all goes to show (as Sir Clive would attest) that simple models can also be useful models. With respect to Yoon’s review of Granger’s paper, Sir Clive was kind enough to offer remarks that the editors have chosen to publish immediately
xviii
THOMAS B. FOMBY AND DEK TERRELL
following Yoon’s paper. In reading Clive’s delightful remarks we come to know (or some of us remember, depending on your age) how difficult it was to conduct empirical analysis in the 1950s and 1960s. As Clive notes, ‘‘Trying to plot by hand nearly two hundred years of monthly data is a lengthy task!’’ So econometric researchers post-1980 have many things to be thankful for, not the least of which is fast and inexpensive computing. Nevertheless, Clive and many other young statistical researchers at the time were undaunted. They were convinced that quantitative research was important and a worthwhile endeavor regardless of the expense of time and they set out to investigate what the naked eye could not detect. You will enjoy reading Clive’s remarks knowing that he is always appreciative of the comments and suggestions of colleagues and that he is an avid supporter of best practices in statistics and econometrics. In the final paper of Part B of Volume 20, ‘‘A New Class of TailDependent Time Series Models and Its Applications in Financial Time Series,’’ Zhengjun Zhang proposes a new class of models to determine the order of lag-k tail dependence in financial time series that exhibit jumps. His base model is a specific class of maxima of moving maxima processes (M3 processes). Zhang then improves on his base model by allowing for possible asymmetry between positive and negative returns. His approach adopts a hierarchical model structure. First you apply, say GARCH(1,1), to get estimated standard deviations, then based on standardized returns, you apply M3 and Markov process modeling to characterize the tail dependence in the time series. Zhang demonstrates his model and his approach using the S&P 500 Index. As he points out, estimates of the parameters of the proposed model can be used to compute the value at risk (VaR) of the investments whose returns are subject to jump processes.
GOOD IDEAS Robert F. Engle III The Nobel Prize is given for good ideas – very good ideas. These ideas often shape the direction of research for an academic discipline. These ideas are often accompanied by a great deal of work by many researchers. Most good ideas don’t get prizes but they are the centerpieces of our research and our conferences. At this interesting Advances in Econometrics conference hosted by LSU, we’ve seen lots of new ideas, and in our careers we have all had many good ideas. I would like to explore where they come from and what they look like. When I was growing up in suburban Philadelphia, my mother would sometimes take me over to Swarthmore College to the Physics library. It was a small dusty room with windows out over a big lawn with trees. The books cracked when I opened them; they smelled old and had faded gold letters on the spine. This little room was exhilarating. I opened books by the famous names in physics and read about quantum mechanics, elementary particles and the history of the universe. I didn’t understand too much but kept piecing together my limited ideas. I kept wondering whether I would understand these things when I was older and had studied in college or graduate school. I developed a love of science and the scientific method. I think this is why I studied econometrics; it is the place where theory meets reality. It is the place where data on the economy tests the validity of economic theory. Fundamentally I think good ideas are simple. In Economics, most ideas can be simplified until they can be explained to non-specialists in plain language. The process of simplifying ideas and explaining them is extremely important. Often the power of the idea comes from simplification of a collection of complicated and conflicting research. The process of distilling out the simple novel ingredient is not easy at all and often takes lots of fresh starts and numerical examples. Discouragingly, good ideas boiled down to their essence may seem trivial. I think this is true of ARCH and Cointegration and many other Nobel citations. But, I think we should not be xix
xx
ROBERT F. ENGLE III
offended by this simplicity, but rather we should embrace it. Of course it is easy to do this after 20 years have gone by; but the trick is to recognize good ideas early. Look for them at seminars or when reading or refereeing or editing. Good ideas generalize. A good idea, when applied to a new situation, often gives interesting insights. In fact, the implications of a good idea may be initially surprising. Upon reflection, the implications may be of growing importance. If ideas translated into other fields give novel interpretations to existing problems, this is a measure of their power. Often good ideas come from examining one problem from the point of view of another. In fact, the ARCH model came from such an analysis. It was a marriage of theory, time series and empirical evidence. The role of uncertainty in rational expectations macroeconomics was not well developed, yet there were theoretical reasons why changing uncertainty could have real effects. From a time series point of view a natural solution to modeling uncertainty was to build conditional models of variance rather than the more familiar unconditional models. I knew that Clive’s test for bilinearity based on the autocorrelation of squared residuals was often significant in macroeconomic data, although I suspected that the test was also sensitive to other effects such as changing variances. The idea for the ARCH model came from combining these three observations to get an autoregressive model of conditional heteroskedasticity. Sometimes a good idea can come from attempts to disprove proposals of others. Clive traces the origin of cointegration to his attempt to disprove a David Hendry conjecture that a linear combination of the two integrated series could be stationary. From trying to show that this was impossible, Clive proved the Granger Representation theorem that provides the fundamental rationale for error correction models in cointegrated systems. My first meeting in Economics was the 1970 World Congress of the Econometric Society in Cambridge England. I heard many of the famous economists of that generation explain their ideas. I certainly did not understand everything but I wanted to learn it all. I gave a paper at this meeting at a session organized by Clive that included Chris Sims and Phoebus Dhrymes. What a thrill. I have enjoyed European meetings of the Econometric Society ever since. My first job was at MIT. I had a lot of chances to see good ideas; particularly good ideas in finance. Myron Scholes and Fischer Black were working on options theory and Bob Merton was developing continuous time finance. I joined Franco Modigliani and Myron on Michael Brennan’s dissertation committee where he was testing the CAPM. Somehow I missed
Good Ideas
xxi
the opportunity to capitalize on these powerful ideas and it was only many years later that I moved my research in this direction. I moved to UCSD in 1976 to join Clive Granger. We studied many fascinating time series problems. Mark Watson was my first PhD student at UCSD. The ARCH model was developed on sabbatical at LSE, and when I returned, a group of graduate students contributed greatly to the development of this research. Tim Bollerslev and Dennis Kraft were among the first, Russ Robins and Jeff Wooldridge and my colleague David Lilien were instrumental in helping me think about the finance applications. The next 20 years at UCSD were fantastic in retrospect. I don’t think we knew at the time how we were moving the frontiers in econometrics. We had great visitors and faculty and students and every day there were new ideas. These ideas came from casual conversations and a relaxed mind. They came from brainstorming on the blackboard with a student who was looking for a dissertation topic. They came from ‘‘Econometrics Lunch’’ when we weren’t talking about gossip in the profession. Universities are incubators of good ideas. Our students come with good ideas, but they have to be shaped and interpreted. Our faculties have good ideas, which they publish and lecture on around the world. Our departments and universities thrive on good ideas that make them famous places for study and innovation. They also contribute to spin-offs in the private sector and consulting projects. Good ideas make the whole system work and it is so important to recognize them in all their various forms and reward them. As a profession we are very protective of our ideas. Often the origin of the idea is disputable. New ideas may have only a part of the story that eventually develops; who gets the credit? While such disputes are natural, it is often better in my opinion to recognize previous contributions and stand on their shoulders thereby making your own ideas even more important. I give similar advice for academics who are changing specialties; stand with one foot in the old discipline and one in the new. Look for research that takes your successful ideas from one field into an important place in a new discipline. Here are three quotations that I think succinctly reflect these thoughts. ‘‘The universe is full of magical things, patiently waiting for our wits to grow sharper’’ Eden Philpotts ‘‘To see what is in front of one’s nose requires a constant struggle.’’ George Orwell ‘‘To select well among old things is almost equal to inventing new ones’’ Nicolas Charles Trublet
xxii
ROBERT F. ENGLE III
There is nothing in our chosen career that is as exhilarating as having a good idea. But a very close second is seeing someone develop a wonderful new application from your idea. The award of the Nobel Prize to Clive and me for our work in time series is an honor to all of the authors who contributed to the conference and to this volume. I think the prize is really given to a field and we all received it. This gives me so much joy. And I hope that someone in this volume will move forward to open more doors with powerful new ideas, and receive her own Nobel Prize. Robert F. Engle III Remarks Given at Third Annual Advances in Econometrics Conference Louisiana State University Baton Rouge, Louisiana November 5–7, 2004
THE CREATIVITY PROCESS Sir Clive W. J. Granger, KB In 1956, I was searching for a Ph.D. topic and I selected time series analysis as being an area that was not very developed and was potentially interesting. I have never regretted that choice. Occasionally, I have tried to develop other interests but after a couple of years away I would always return to time series topics where I am more comfortable. I have never had a long-term research topic. What I try to do is to develop new ideas, topics, and models, do some initial development, and leave the really hard, rigorous stuff to other people. Some new topics catch on quickly and develop a lot of citations (such as cointegration), others are initially ignored but eventually become much discussed and applied (causality, as I call it), some develop interest slowly but eventually deeply (fractionally integrated processes), some have long term, steady life (combination of forecasts), whereas others generate interest but eventually vanish (bilinear models, spectral analysis). The ideas come from many sources, by reading literature in other fields, from discussions with other workers, from attending conferences (time distance measure for forecasts), and from general reading. I will often attempt to take a known model and generalize and expand it in various ways. Quite frequently these generalizations turn out not to be interesting; I have several examples of general I(d) processes where d is not real or not finite. The models that do survive may be technically interesting but they may not prove useful with economic data, providing an example of a so-called ‘‘empty box,’’ bilinear models, and I(d), d non-integer could be examples. In developing these models one is playing a game. One can never claim that a new model will be relevant, only that it might be. Of course, when using the model to generate forecasts, one has to assume that the model is correct, but one must not forget this assumption. If the model is correct, the data will have certain properties that can be proved, but it should always be remembered that other models may generate the same properties, xxiii
xxiv
SIR CLIVE W. J. GRANGER, KB
for example I(d), d a fraction, and break processes can give similar ‘‘long memory’’ autocorrelations. Finding properties of data and then suggesting that a particular model will have generated the data is a dangerous game. Of course, once the research has been done one faces the problem of publication. The refereeing process is always a hassle. I am not convinced that delaying an interesting paper (I am not thinking of any of my own here) by a year or more to fix a few minor difficulties is actually helping the development of our field. Rob and I had initial rejections of some of our best joint papers, including the one on cointegration. My paper on the typical spectral shape took over three and a half years between submission and publication, and it is a very short paper. My favorite editors’ comment was that ‘‘my paper was not very good (correct) but is was very short,’’ and as they just had that space to fill they would accept. My least favorite comment was a rejection of a paper with Paul Newbold because ‘‘it has all been done before.’’ As we were surprised at this we politely asked for citations. The referee had no citations, he just thought that must have been done before. The paper was published elsewhere. For most of its history time series theory considered conditional means, but later conditional variances. The next natural development would be conditional quantiles, but this area is receiving less attention than I expected. The last stages are initially conditional marginal distributions, and finally conditional multivariate distributions. Some interesting theory is starting in these areas but there is an enormous amount to be done. The practical aspects of time series analysis are rapidly changing with improvements in computer performance. Now many, fairly long series can be analyzed jointly. For example, Stock and Watson (1999) consider over 200 macro series. However, the dependent series are usually considered individually, whereas what we are really dealing with is a sample from a 200dimensional multivariate distribution, assuming the processes are jointly stationary. How to even describe the essential features of such a distribution, which is almost certainly non-Gaussian, in a way that is useful to economists and decision makers is a substantial problem in itself. My younger colleagues sometimes complain that we old guys solved all the interesting easy questions. I do not think that was ever true and is not true now. The higher we stand the wider our perspective; I hope that Rob and I have provided, with many others, a suitable starting point for the future study in this area.
xxv
The Creativity Process
REFERENCE Stock, J. H., & Watson, M. W. (1999). A Comparison of linear and nonlinear univariate models for forecasting macroeconomic time series. In: R. F. Engle & H. White (Eds), Cointegration, causality, and forecasting: A festschrift in honour of Clive W. J. Granger. Oxford: Oxford University Press.
Sir Clive W. J. Granger, KB Remarks Read at Third Annual Advances in Econometrics Conference Louisiana State University Baton Rouge, Louisiana November 5–7, 2004
This page intentionally left blank
xxvi
REALIZED BETA: PERSISTENCE AND PREDICTABILITY$ Torben G. Andersen, Tim Bollerslev, Francis X. Diebold and Ginger Wu ABSTRACT A large literature over several decades reveals both extensive concern with the question of time-varying betas and an emerging consensus that betas are in fact time-varying, leading to the prominence of the conditional CAPM. Set against that background, we assess the dynamics in realized betas, vis-a`-vis the dynamics in the underlying realized market variance and individual equity covariances with the market. Working in the recently popularized framework of realized volatility, we are led to a framework of nonlinear fractional cointegration: although realized variances and covariances are very highly persistent and well approximated as fractionally integrated, realized betas, which are simple nonlinear functions of those realized variances and covariances, are less persistent and arguably best modeled as stationary I(0) processes. We conclude by drawing implications for asset pricing and portfolio management.
$
We dedicate this paper to Clive W. J. Granger, a giant of modern econometrics, on whose broad shoulders we are fortunate to stand. This work was supported by the National Science Foundation and the Guggenheim Foundation.
Econometric Analysis of Financial and Economic Time Series/Part B Advances in Econometrics, Volume 20, 1–39 Copyright r 2006 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(05)20020-8
1
2
TORBEN G. ANDERSEN ET AL.
1. INTRODUCTION One of the key insights of asset pricing theory is also one of the simplest: only systematic risk should be priced. Perhaps not surprisingly, however, there is disagreement as to the sources of systematic risk. In the one-factor capital asset pricing model (CAPM), for example, systematic risk is determined by covariance with the market (Sharpe, 1963; Lintner, 1965a, b), whereas, in more elaborate pricing models, additional empirical characteristics such as firm size and book-to-market are seen as proxies for another set of systematic risk factors (Fama & French, 1993).1 As with most important scientific models, the CAPM has been subject to substantial criticism (e.g., Fama & French, 1992). Nevertheless, to paraphrase Mark Twain, the reports of its death are greatly exaggerated. In fact, the onefactor CAPM remains alive and well at the frontier of both academic research and industry applications, for at least two reasons. First, recent work reveals that it often works well – despite its wrinkles and warts – whether in traditional incarnations (e.g., Ang & Chen, 2003) or more novel variants (e.g., Cohen, Polk, & Vuolteenaho, 2002; Campbell & Vuolteenaho, 2004). Second, competing multi-factor pricing models, although providing improved statistical fit, involve factors whose economic interpretations in terms of systematic risks remain unclear, and moreover, the stability of empirically motivated multi-factor asset pricing relationships often appears tenuous when explored with true out-of-sample data, suggesting an element of data mining.2 In this paper, then, we study the one-factor CAPM, which remains central to financial economics nearly a half century after its introduction. A key question within this setting is whether stocks’ systematic risks, as assessed by their correlations with the market, are constant over time – i.e., whether stocks’ market betas are constant. And if betas are not constant, a central issue becomes how to understand and formally characterize their persistence and predictability vis-a`-vis their underlying components. The evolution of a large literature over several decades reveals both extensive concern with this question and, we contend, an eventual implicit consensus that betas are likely time-varying.3 Several pieces of evidence support our contention. First, leading texts echo it. For example, Huang and Litzenberger (1988) assert that ‘‘It is unlikely that risk premiums and betas on individual assets are stationary over time’’ (p. 303). Second, explicitly dynamic betas are often modeled nonstructurally via time-varying parameter regression, in a literature tracing at least to the early ‘‘return to normality’’ model of Rosenberg (1973), as implemented in the CAPM by Schaefer Broaley, Hodges, and Thomas (1975). Third, even in the absence of
Realized Beta: Persistence and Predictability
3
explicit allowance for time-varying betas, the CAPM is typically estimated using moving estimation windows, usually of 5–10 years, presumably to guard against beta variation (e.g., Fama, 1976; Campbell, Lo, & MacKinlay, 1997). Fourth, theoretical and empirical inquiries in asset pricing are often undertaken in conditional, as opposed to unconditional, frameworks, the essence of which is to allow for time-varying betas, presumably because doing so is viewed as necessary for realism. The motivation for the conditional CAPM comes from at least two sources. First, from a theoretical perspective, financial economic considerations suggest that betas may vary with conditioning variables, an idea developed theoretically and empirically in a large literature that includes, among many others, Dybvig and Ross (1985), Hansen and Richard (1987), Ferson, Kandel, and Stambaugh (1987), Ferson and Harvey (1991), Jagannathan and Wang (1996), and Wang (2003).4 Second, from a different and empirical perspective, the financial econometric volatility literature (see Andersen, Bollerslev, & Diebold, 2005, for a recent survey) has provided extensive evidence of wide fluctuations and high persistence in asset market conditional variances, and in individual equity conditional covariances with the market. Thus, even from a purely statistical viewpoint, market betas, which are ratios of time-varying conditional covariances and variances, might be expected to display persistent fluctuations, as in Bollerslev, Engle, and Wooldridge (1988). In fact, unless some special cancellation occurs – in a way that we formalize – betas would inherit the persistence features that are so vividly present in their constituent components. Set against this background, we assess the dynamics in betas vis-a`-vis the widely documented persistent dynamics in the underlying variance and covariances. We proceed as follows: In Section 2, we sketch the framework, both economic and econometric, in which our analysis is couched. In Section 3, we present the empirical results with an emphasis on analysis of persistence and predictability. In Section 4, we formally assess the uncertainty in our beta estimates. In Section 5, we offer summary, conclusions, and directions for future research.
2. THEORETICAL FRAMEWORK Our approach has two key components. First, in keeping with the recent move toward nonparametric volatility measurement, we cast our analysis within the framework of realized variances and covariances, or equivalently, empirical quadratic variation and covariation. That is, we do not entertain a null
4
TORBEN G. ANDERSEN ET AL.
hypothesis of period-by-period constant betas, but instead explicitly allow for continuous evolution in betas. Our ‘‘realized betas’’ are (continuous-record) consistent for realizations of the underlying ratio between the integrated stock and market return covariance and the integrated market variance.5 Second, we work in a flexible econometric framework that allows for – without imposing – fractional integration and/or cointegration between the market variance and individual equity covariances with the market.
2.1. Realized Quarterly Variances, Covariances, and Betas We provide estimates of quarterly betas, based on nonparametric realized quarterly market variances and individual equity covariances with the market. The quarterly frequency is appealing from a substantive financial economic perspective, and it also provides a reasonable balance between efficiency and robustness to microstructure noise. Specifically, we produce our quarterly estimates using underlying daily returns, as in Schwert (1989), so that the sampling frequency is quite high relative to the quarterly horizon of interest, yet low enough so that contamination by microstructure noise is not a serious concern for the highly liquid stocks that we study. The daily frequency further allows us to utilize a long sample of data, which is not available when sampling more frequently. Suppose that the logarithmic N 1 vector price process, pt, follows a multivariate continuous-time stochastic volatility diffusion, dpt ¼ mt dt þ Ot dW t
(1)
where Wt denotes a standard N-dimensional Brownian motion, and both the process for the N N positive definite diffusion matrix, Ot, and the N-dimensional instantaneous drift, mt, are strictly stationary and jointly independent of the Wt process. For our purposes it is helpful to think of the Nth element of pt as containing the log price of the market and the ith element of pt as containing the log price of the ith individual stock included in the analysis, so that the corresponding covariance matrix contains both the market variance, say s2M;t ¼ OðNNÞ;t ; and the individual equity covariance with the market, siM;t ¼ OðiNÞ;t : Then, conditional on the sample path realization of mt and Ot, the distribution of the continuously compounded h-period return, rtþh;h ptþh pt ; is Z h Z h h rtþh;h js mtþt ; Otþt t¼0 N Otþt dt (2) mtþt dt; 0
0
Realized Beta: Persistence and Predictability
5
h where s mtþt ; Otþt t¼0 denotes the s-field generated by the sampleR paths of h mt+t and Ot+t, for 0rtrh. The integrated diffusion matrix 0 Otþt dt; therefore provides a natural measure of the true latent h-period volatility.6 The requirement that the innovation process, Wt, is independent of the drift and diffusion processes is rather strict and precludes, for example, the asymmetric relations between return innovations and volatility captured by the so-called leverage or volatility feedback effects. However, from the results in Meddahi (2002), Barndorff-Nielsen and Shephard (2003), and Andersen, Bollerslev, and Meddahi (2004), we know that the continuousrecord asymptotic distribution theory for the realized covariation continues to provide an excellent approximation for empirical high-frequency realized volatility measures.7 As such, even if the conditional return distribution result (2) does not apply in full generality, the evidence presented below, based exclusively on the realized volatility measures, remains trustworthy in the presence of asymmetries in the return innovation–volatility relations. By the theory of quadratic variation, we have that under weak regularity conditions, and regardless of the presence of leverage or volatility feedback effects, that Z h X Otþt dt ! 0 (3) rtþjD;D r0tþjD;D 0 j¼1;...;½h=D almost surely for all t as the sampling frequency of the returns increases, or D ! 0: Thus, by summing sufficiently finely sampled high-frequency returns, it is possible to construct ex-post realized volatility measures for the integrated latent volatilities that are asymptotically free of measurement error. This contrasts sharply with the common use of the cross-product of the h-period returns, rtþh;h r0tþh;h ; as a simple ex post (co)variability measure. Although the squared return (innovation) over the forecast horizon provides an unbiased estimate for the integrated volatility, it is an extremely noisy estimator, and predictable variation in the true latent volatility process is typically dwarfed by measurement error. Moreover, for longer horizons any conditional mean dependence will tend to contaminate this variance measure. In contrast, as the sampling frequency is lowered, the impact of the drift term vanishes, thus effectively annihilating the mean. These assertions remain valid if the underlying continuous time process in Eq. (1) contains jumps, so long as the price process is a special semimartingale, which will hold if it is arbitrage-free. Of course, in this case the limit of the summation of the high-frequency returns will involve an additional
6
TORBEN G. ANDERSEN ET AL.
jump component, but the interpretation of the sum as the realized h-period return volatility remains intact. Finally, with the realized market variance and realized covariance between the market and the individual stocks in hand, we can readily define and empirically construct the individual equity ‘‘realized betas.’’ Toward that end, we introduce some formal notation. Using an initial subscript to indicate the corresponding element of a vector, we denote the realized market volatility by X v^2M;t;tþh ¼ (4) r2ðNÞ;tþjD;D j¼1;...;½h=D and we denote the realized covariance between the market and the ith individual stock return by X (5) rðiÞ;tþjD;D rðNÞ;tþjD;D v^iM;t;tþh ¼ j¼1;...;½h=D We then define the associated realized beta as v^iM;t;tþh b^ i;t;tþh ¼ 2 v^M;t;tþh
(6)
Under the assumptions invoked for Eq. (1), this realized beta measure is consistent for the true underlying integrated beta in the following sense: Rh OðiNÞ;tþt dt b^ i;t;tþh ! bi;t;tþh ¼ R h0 (7) 0 OðNNÞ;tþt dt
almost surely for all t as the sampling frequency increases, or D ! 0: A number ofR comments are in order. First, the integrated return covarh iance matrix, 0 Otþt dt; is treated as stochastic, so both the integrated market variance and the integrated covariances of individual equity returns with the market over [t, t+h] are ex ante, as of time t, unobserved and governed by a non-degenerate (and potentially unknown) distribution. Moreover, the covariance matrix will generally vary continuously and randomly over the entire interval, so the integrated covariance matrix should be interpreted as the average realized covariation among the return series. Second, Eq. (3) makes it clear that the realized market volatility in (4) and the realized covariance in (5) are continuous-record consistent estimators of the (random) realizations of the underlying integrated market volatility and covariance. Thus, as a corollary, the realized beta will be consistent for the integrated beta, as stated in (7). Third, the general representation here
Realized Beta: Persistence and Predictability
7
encompasses the standard assumption of a constant beta over the measurement or estimation horizon, which is attained for the degenerate case of the Ot process being constant throughout each successive h-period measurement interval, or Ot ¼ O. Fourth, the realized beta estimation procedure in Eq. (4)–(6) is implemented through a simple regression (without a constant term) of individual high-frequency stock returns on the corresponding market return. Nonetheless, the interpretation is very different from a standard regression, as the Ordinary Least Square (OLS) point estimate now represents a consistent estimator of the ex post realized regression coefficient obtained as the ratio of unbiased estimators of the average realized covariance and the realized market variance. The associated continuous-record asymptotic theory developed by Barndorff-Nielsen and Shephard (2003) explicitly recognizes the diffusion setting underlying this regression interpretation and hence facilitates the construction of standard errors for our beta estimators.
2.2. Nonlinear Fractional Cointegration: A Common Long-Memory Feature in Variances and Covariances The possibility of common persistent components is widely recognized in modern multivariate time-series econometrics. It is also important for our analysis, because there may be common persistence features in the underlying variances and covariances from which betas are produced. The idea of a common feature is a simple generalization of the wellknown cointegration concept. If two variables are integrated but there exists a function f of them that is not, we say that they are cointegrated, and we call f the conintegrating function. More generally, if two variables have property X but there exists a function of them that does not, we say that they have common feature X. A key situation is when X corresponds to persistence, in which case we call the function of the two variables that eliminates the persistence the copersistence function. It will prove useful to consider linear and nonlinear copersistence functions in turn. Most literature focuses on linear copersistence functions. The huge cointegration literature pioneered by Granger (1981) and Engle and Granger (1987) deals primarily with linear common long-memory I(1) persistence features. The smaller copersistence literature started by Engle and Kozicki (1993) deals mostly with linear common short-memory I(0) persistence features. The idea of fractional cointegration, suggested by Engle and Granger (1987) and developed by Cheung and Lai (1993) and Robinson and Marinucci, (2001),
8
TORBEN G. ANDERSEN ET AL.
among others, deals with linear common long-memory I(d) persistence features, 0odo1/2. Our interest is closely related but different. First, it centers on nonlinear copersistence functions, because betas are ratios. There is little literature on nonlinear common persistence features, although they are implicitly treated in Granger (1995). We will be interested in nonlinear common long-memory I(d) persistence features, 0odo1/2, effectively corresponding to nonlinear fractional cointegration.8 Second, we are interested primarily in the case of known cointegrating relationships. That is, we may not know whether a given stock’s covariance with the market is fractionally cointegrated with the market variance, but if it is, then there is a good financial economic reason (i.e., the CAPM) to suspect that the cointegrating function is the ratio of the covariance to the variance. This provides great simplification. In the integer-cointegration framework with known cointegrating vector under the alternative, for example, one could simply test the cointegrating combination for a unit root, or test the significance of the error-correction term in a complete errorcorrection model, as in Horvath and Watson (1995). We proceed in analogous fashion, examining the integration status (generalized to allow for fractional integration) of the realized market variance, realized individual equity covariances with the market, and realized market betas. Our realized beta series are unfortunately relatively short compared to the length required for formal testing and inference procedures regarding (fractional) cointegration, as the fractional integration and cointegration estimators proposed by Geweke and Porter-Hudak (1983), Robinson and Marinucci (2001), and Andrews and Guggenberger (2003) tend to behave quite erratically in small samples. In addition, there is considerable measurement noise in the individual beta series so that influential outliers may have a detrimental impact on our ability to discern the underlying dynamics. Hence, we study the nature of the long range dependence and short-run dynamics in the realized volatility measures and realized betas through intentionally less formal but arguably more informative graphical means, and via some robust procedures that utilize the joint information across many series, to which we now turn.
3. EMPIRICAL ANALYSIS We examine primarily the realized quarterly betas constructed from daily returns. We focus on the dynamic properties of market betas vis-a`-vis the
Realized Beta: Persistence and Predictability
9
dynamic properties of their underlying covariance and variance components. We quantify the dynamics in a number of ways, including explicit measurement of the degree of predictability in the tradition of Granger and Newbold (1986).
3.1. Dynamics of Quarterly Realized Variance, Covariances and Betas This section investigates the realized quarterly betas constructed from daily returns obtained from the Center for Research in Security Prices from July 1962 to September 1999. We take the market return rm,t to be the 30 Dow Jones Industrial Average (DJIA), and we study the subset of 25 DJIA stocks as of March 1997 with complete data from July 2, 1962 to September 17, 1999, as detailed in Table 1. We then construct quarterly realized DJIA variances, individual equity covariances with the market, and betas, 1962:3–1999:3 (149 observations). In Fig. 1, we provide a time-series plot of the quarterly realized market variance, with fall 1987 included (top panel) and excluded (bottom panel). It is clear that the realized variance is quite persistent and, moreover, that the fall 1987 volatility shock is unlike any other ever recorded, in that volatility reverts to its mean almost instantaneously. In addition, our subsequent computation of asymptotic standard errors reveals that the uncertainty associated with the fall 1987 beta estimate is enormous, to the point of rendering it entirely uninformative. In sum, it is an exceptional outlier with potentially large influence on the analysis, and it is measured with huge imprecision. Hence, following many other authors, we drop the fall 1987 observation from this point onward. In Figs. 2 and 3, we display time-series plots of the 25 quarterly realized covariances and realized betas.9 Like the realized variance, the realized covariances appear highly persistent. The realized betas, in contrast, appear noticeably less persistent. This impression is confirmed by the statistics presented in Table 2: the mean Ljung–Box Q-statistic (through displacement 12) is 84 for the realized covariance, but only 47 for the realized beta, although both are of course significant relative to a w2(12) distribution.10 The impression of reduced persistence in realized betas relative to realized covariances is also confirmed by the sample autocorrelation functions for the realized market variance, the realized covariances with the market, and the realized betas shown in Fig. 4.11 Most remarkable is the close correspondence between the shape of the realized market variance correlogram and the realized covariance correlograms. This reflects an extraordinary high degree of dependence in the correlograms across the individual realized
10
TORBEN G. ANDERSEN ET AL.
Table 1.
The Dow Jones Thirty.
Company Name
Ticker
Alcoa Inc. Allied Capital Corporation American Express Co. Boeing Co. Caterpillar Inc. Chevron Corp. DuPont Co. Walt Disney Co. Eastman Kodak Co. General Electric Co. General Motors Corp. Goodyear Tire & Rubber Co. Hewlett–Packard Co. International Business Machines Corp. International Paper Co. Johnson & Johnson JP Morgan Chase & Co. Coca–Cola Co. McDonald’s Corp. Minnesota Mining & Manufacturing Co. Philip Morris Co. Merck & Co. Procter & Gamble Co. Sears, Roebuck and Co. AT&T Corp. Travelers Group Inc. Union Carbide Corp. United Technologies Corp. Wal–Mart Stores Inc. Exxon Corp.
AA ALD AXPa BA CAT CHV DD DIS EK GE GM GT HWP IBM IP JNJ JPMa KO MCDa MMM MO MRK PG S T TRVa UK UTX WMTa XON
Data Range 07/02/1962 07/02/1962 05/31/1977 07/02/1962 07/02/1962 07/02/1962 07/02/1962 07/02/1962 07/02/1962 07/02/1962 07/02/1962 07/02/1962 07/02/1962 07/02/1962 07/02/1962 07/02/1962 03/05/1969 07/02/1962 07/05/1966 07/02/1962 07/02/1962 07/02/1962 07/02/1962 07/02/1962 07/02/1962 10/29/1986 07/02/1962 07/02/1962 11/20/1972 07/02/1962
to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to
09/17/1999 09/17/1999 09/17/1999 09/17/1999 09/17/1999 09/17/1999 09/17/1999 09/17/1999 09/17/1999 09/17/1999 09/17/1999 09/17/1999 09/17/1999 09/17/1999 09/17/1999 09/17/1999 09/17/1999 09/17/1999 09/17/1999 09/17/1999 09/17/1999 09/17/1999 09/17/1999 09/17/1999 09/17/1999 09/17/1999 09/17/1999 09/17/1999 09/17/1999 09/17/1999
Note: A summary of company names and tickers, and the range of the data are examined. We use the Dow Jones Thirty as of March 1997. a Stocks with incomplete data, which we exclude from the analysis.
covariances with the market, as shown in Fig. 5. In Fig. 4, it makes the median covariance correlogram appear as a very slightly dampened version of that for the market variance. This contrasts sharply with the lower and gently declining pattern for the realized beta autocorrelations. Intuitively, movements of the realized market variance are largely reflected in movements of the realized covariances; as such, they largely ‘‘cancel’’ when we form ratios (realized betas). Consequently, the correlation structure across
11
Realized Beta: Persistence and Predictability 0.12 0.10 0.08 0.06 0.04 0.02 0.00 (a)
1965 1970 1975 1980 1985 1990 1995
0.020
0.016
0.012
0.008
0.004
0.000 (b)
1965
1970
1975
1980 1985
1990
1995
Fig. 1. Time Series Plot of Quarterly Realized Market Variance, Fall 1987 (a) Included (b) Exculded. Note: The Two Subfigures Show the Time Series of Quarterly Realized Market Variance, with The 1987:4 Outlier Included (a) and Excluded (b). The Sample Covers the Period from 1962:3 through 1999:3, for a Total of 149 Observations. We Calculate the Realized Quarterly Market Variances from Daily Returns.
the individual realized beta series in Fig. 6 is much more dispersed than is the case for the realized covariances in Fig. 5. This results in an effective averaging of the noise and the point estimates of the median correlation values are effectively zero beyond 10 quarters for the beta series.12
0.030 0.025 0.020 0.015 0.010 0.005 0.000 -0.005
0.030 0.025 0.020 0.015 0.010 0.005 0.000 -0.005
70
70
70
80 DD
80 GT
80
ALD
90
90
0.030 0.025 0.020 0.015 0.010 0.005 0.000 -0.005
0.030 0.025 0.020 0.015 0.010 0.005 0.000 -0.005
70
70
80 DIS
80
BA
90
90
90
0.030 0.025 0.020 0.015 0.010 0.005 0.000 -0.005
70
80
70
90
0.030 0.025 0.020 0.015 0.010 0.005 0.000 -0.005
70
80
90
70
80
90
0.030 0.025 0.020 0.015 0.010 0.005 0.000 -0.005
70
70
80
90
0.030 0.025 0.020 0.015 0.010 0.005 0.000 -0.005
90
0.030 0.025 0.020 0.015 0.010 0.005 0.000 -0.005
70
70
80
90
70
80
80
90
90
0.030 0.025 0.020 0.015 0.010 0.005 0.000 -0.005
70
70
80
90
0.030 0.025 0.020 0.015 0.010 0.005 0.000 -0.005
90
0.030 0.025 0.020 0.015 0.010 0.005 0.000 -0.005
70
70
80
90
70
80
90
80
90
80
90
PG
90
0.030 0.025 0.020 0.015 0.010 0.005 0.000 -0.005
70
UTX 0.030 0.025 0.020 0.015 0.010 0.005 0.000 -0.005
80 GM
JNJ
MRK
UK 0.030 0.025 0.020 0.015 0.010 0.005 0.000 -0.005
80 GE
IP
MO
T 0.030 0.025 0.020 0.015 0.010 0.005 0.000 -0.005
80
90
CHV 0.030 0.025 0.020 0.015 0.010 0.005 0.000 -0.005
80
90
XON
90
0.030 0.025 0.020 0.015 0.010 0.005 0.000 -0.005
70
80
90
Fig. 2. Time Series Plots of Quarterly Realized Covariances. Note: The Time Series of Quarterly Realized Covariances, with The 1987:4 Outlier Excluded are Shown. The Sample Covers The Period from 1962:3 through 1999:3, for a Total of 148 Observations. We Calculate the Realized Quarterly Covariances from Daily Returns.
TORBEN G. ANDERSEN ET AL.
90
S 0.030 0.025 0.020 0.015 0.010 0.005 0.000 -0.005 80
70
80 EK
CAT 0.030 0.025 0.020 0.015 0.010 0.005 0.000 -0.005
IBM
0.030 0.025 0.020 0.015 0.010 0.005 0.000 -0.005
70
70
MMM
0.030 0.025 0.020 0.015 0.010 0.005 0.000 -0.005 80
0.030 0.025 0.020 0.015 0.010 0.005 0.000 -0.005
HWP
KO
70
0.030 0.025 0.020 0.015 0.010 0.005 0.000 -0.005
12
AA 0.030 0.025 0.020 0.015 0.010 0.005 0.000 -0.005
3.0 2.5 2.0 1.5 1.0 0.5 0.0 -0.5
70
70
80 DD
80
ALD
90
3.0 2.5 2.0 1.5 1.0 0.5 0.0 -0.5
70
70
80
90
70
80
70
90
70
70
80
80
80
90
3.0 2.5 2.0 1.5 1.0 0.5 0.0 -0.5
70
80
90
3.0 2.5 2.0 1.5 1.0 0.5 0.0 -0.5
90
70
80
90
70
70
80
90
70
80
90
90
70
80
70
90
70
90
70
80
70
80
80
90
3.0 2.5 2.0 1.5 1.0 0.5 0.0 -0.5
70
80
90
90
70
80
90
70
80
90
JNJ
90
3.0 2.5 2.0 1.5 1.0 0.5 0.0 -0.5
70
80
90
PG
90
3.0 2.5 2.0 1.5 1.0 0.5 0.0 -0.5
70
UTX 3.0 2.5 2.0 1.5 1.0 0.5 0.0 -0.5
80 GM
3.0 2.5 2.0 1.5 1.0 0.5 0.0 -0.5
MRK
UK 3.0 2.5 2.0 1.5 1.0 0.5 0.0 -0.5
90
IP 3.0 2.5 2.0 1.5 1.0 0.5 0.0 -0.5
MO 3.0 2.5 2.0 1.5 1.0 0.5 0.0 -0.5
80
CHV 3.0 2.5 2.0 1.5 1.0 0.5 0.0 -0.5
GE 3.0 2.5 2.0 1.5 1.0 0.5 0.0 -0.5
IBM 3.0 2.5 2.0 1.5 1.0 0.5 0.0 -0.5
T 3.0 2.5 2.0 1.5 1.0 0.5 0.0 -0.5
80
CAT 3.0 2.5 2.0 1.5 1.0 0.5 0.0 -0.5
EK
MMM
S 3.0 2.5 2.0 1.5 1.0 0.5 0.0 -0.5
70
HWP 3.0 2.5 2.0 1.5 1.0 0.5 0.0 -0.5
KO 3.0 2.5 2.0 1.5 1.0 0.5 0.0 -0.5
90
DIS 3.0 2.5 2.0 1.5 1.0 0.5 0.0 -0.5
GT 3.0 2.5 2.0 1.5 1.0 0.5 0.0 -0.5
80
BA 3.0 2.5 2.0 1.5 1.0 0.5 0.0 -0.5
Realized Beta: Persistence and Predictability
AA 3.0 2.5 2.0 1.5 1.0 0.5 0.0 -0.5
80
90
XON
90
3.0 2.5 2.0 1.5 1.0 0.5 0.0 -0.5
70
80
90
13
Fig. 3. Time Series Plots of Quarterly Realized Betas. Note: The Time Series of Quarterly Realized Betas, with The 1987:4 Outlier Excluded are Shown. The Sample Covers The Period from 1962:3 through 1999:3, for a Total of 148 Observations. We Calculate the Realized Quarterly Betas from Daily Returns.
14
TORBEN G. ANDERSEN ET AL.
Table 2. The Dynamics of Quarterly Realized Market Variance, Covariances and Betas. v2mt
Q 1.09.50 cov(rmt, rit)
ADF1 5.159
ADF2 3.792
ADF3 4.014
ADF4 3.428 bit
Q ADF1 ADF2 ADF3 ADF4 Q ADF1 ADF2 ADF3 ADF4 Min. 47.765 6.188 4.651 4.621 4.023 6.6340 8.658 6.750 5.482 5.252 0.10 58.095 5.880 4.383 4.469 3.834 15.026 7.445 6.419 5.426 4.877 0.25 69.948 5.692 4.239 4.352 3.742 26.267 6.425 5.576 5.047 4.294 0.50 84.190 5.478 4.078 4.179 3.631 46.593 6.124 5.026 3.896 3.728 0.75 100.19 5.235 3.979 4.003 3.438 66.842 5.431 4.188 3.724 3.313 0.90 119.28 4.915 3.777 3.738 3.253 106.67 4.701 3.404 3.225 2.980 Max. 150.96 4.499 3.356 3.690 2.986 134.71 4.600 3.315 2.808 2.493 Mean 87.044 5.435 4.085 4.159 3.580 53.771 6.090 4.925 4.245 3.838 S.D. 24.507 0.386 0.272 0.250 0.239 35.780 1.026 0.999 0.802 0.729 Note: Summary on the aspects of the time-series dependence structure of quarterly realized market variance, covariances, and realized betas. Q denotes the Ljung–Box portmanteau statistic for up to 12th-order autocorrelation, and ADFi denotes the augmented Dickey–Fuller unit root test, with ntercept and with i augmentation lags. The sample covers the period from 1962:3 through 1999:3, with the 1987:4 outlier excluded, for a total of 148 observations. We calculate the quarterly realized variance, covariances, and betas from daily returns.
The work of Andersen et al. (2001a) and Andersen et al. (2003), as well as that of many other authors, indicates that asset return volatilities are welldescribed by a pure fractional noise process, typically with the degree of integration around d 0:4:13 That style of analysis is mostly conducted on high-frequency data. Very little work has been done on long memory in equity variances, market covariances, and market betas at the quarterly frequency, and it is hard to squeeze accurate information about d directly from the fairly limited quarterly sample. It is well-known, however, that if a flow variable is I(d), then it remains I(d) under temporal aggregation. Hence, we can use the results of analyses of high-frequency data, such as Andersen et al. (2003), to help us analyze the quarterly data. After some experimentation, and in keeping with the typical finding that d 0:4; we settled on d ¼ 0:42: In Fig. 7, we graph the sample autocorrelations of the quarterly realized market variance, the median realized covariances with the market, and the median realized betas, all prefiltered by (1 L)0.42. It is evident that the dynamics in the realized variance and covariances are effectively annihilated by filtering with (1 L)0.42, indicating that the pure fractional noise process with d ¼ 0:42 is indeed a good approximation to their dynamics. Interestingly,
15
Realized Beta: Persistence and Predictability 0.6 Volatility (Q pvalue=0) Median Covariances (Q pvalue=0) Median Betas (Q pvalue=0)
0.5 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2 0
5
10
15
20
25
30
35
40
Fig. 4. Sample Autocorrelations of Quarterly Realized Market Variance, Median Sample Autocorrelations of Quarterly Realized Covariances and Median Sample Autocorrelations of Quarterly Realized Betas. Note: The First 36 Sample Autocorrelations of Quarterly Realized Market Variance, the Medians across Individual Stocks of the First 36 Sample Autocorrelations of Quarterly Realized Covariances and the Medians across Individual Stocks of the First 36 Sample Autocorrelations of Quarterly Realized Betas are Shown. The Dashed Lines Denote Bartlett’s Approximate 95 Percent Confidence Band in the White Noise Case. Q Denotes the Ljung–Box Portmanteau Statistic for up to 12th-Order Autocorrelation. The Sample Covers the Period from 1962:3 through 1999:3, with the 1987:4 Outlier Excluded, for a Total of 148 Observations. We Calculate the Quarterly Realized Variance, Covariances, and Betas from Daily Returns.
however, filtering the realized betas with (1 L)0.42 appears to produce overdifferencing, as evidenced by the fact that the first autocorrelation of the fractionally differenced betas is often negative. Compare, in particular, the median sample autocorrelation function for the prefiltered realized covariances to the median sample autocorrelation function for the prefiltered realized betas. The difference is striking in the sense that the first autocorrelation coefficient for the betas is negative and much larger than those for all of the subsequent lags. Recall that the standard error band for the median realized beta (not shown in the lower panels, as it depends on the unknown cross-sectional dependence structure) should be considerably narrower than for the other series in Fig. 7, thus likely rendering the first-order correlation coefficient for the beta series significantly negative. This finding can be seen to be reasonably consistent across the individual prefiltered covariance and beta correlation functions displayed in Figs. 8 and 9. If fractional differencing of the realized betas by (1 L)0.42 may be ‘‘too much,’’ then the question naturally arises as to how much differencing is ‘‘just
BA ( Q12 pvalue=0.000 )
ALD ( Q12 pvalue=0.000)
CAT ( Q12 pvalue=0.000 )
CHV ( Q12 pvalue=0.000 )
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0.0
0.0
0.0
0.0
-0.2 0
5 10 15 20 25 30 35 40
-0.2
0
DD ( Q12 pvalue=0.000)
0.6
5 10 15 20 25 30 35 40
-0.2
0.0
-0.2 0
DIS ( Q12 pvalue=0.000 )
5 10 15 20 25 30 35 40
0
EK ( Q12 pvalue=0.000 )
5 10 15 20 25 30 35 40
-0.2
0.6
0.6
0.4
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0.0
0.0
0.0
0.0
0.0
-0.2
-0.2
-0.2
-0.2
-0.2
5 10 15 20 25 30 35 40
GT ( Q12 pvalue=0.000)
0
5 10 15 20 25 30 35 40
HWP ( Q12 pvalue=0.000 )
0.6
0
5 10 15 20 25 30 35 40
IBM ( Q12 pvalue=0.000 )
0.6
0
5 10 15 20 25 30 35 40
IP ( Q12 pvalue=0.000 )
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0.0
0.0
0.0
0.0
0.0
-0.2
-0.2
-0.2
-0.2
-0.2
5 10 15 20 25 30 35 40
0
KO ( Q12 pvalue=0.000)
5 10 15 20 25 30 35 40
0
MMM ( Q12 pvalue=0.000 )
5 10 15 20 25 30 35 40
0
MO ( Q12 pvalue=0.000 )
5 10 15 20 25 30 35 40
0.6
0.6
0.4
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0.0
0.0
0.0
0.0
-0.2
S ( Q12 pvalue=0.000 )
0.6
0
5 10 15 20 25 30 35 40
-0.2
T ( Q12 pvalue=0.000 )
0.6
0.0
-0.2 0
5 10 15 20 25 30 35 40
0
UK ( Q12 pvalue=0.000 )
5 10 15 20 25 30 35 40
-0.2
UTX ( Q12 pvalue=0.000 )
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0.0
0.0
0.0
0.0
-0.2 0
5 10 15 20 25 30 35 40
0
5 10 15 20 25 30 35 40
-0.2
5 10 15 20 25 30 35 40
5 10 15 20 25 30 35 40
0.0
-0.2 0
0
XON ( Q12 pvalue=0.000 )
0.4
-0.2
5 10 15 20 25 30 35 40
0
5 10 15 20 25 30 35 40
-0.2
0
5 10 15 20 25 30 35 40
Fig. 5. Sample Autocorrelations of Quarterly Realized Covariances. Note: The Figure Shows the First 36 Sample Autocorrelations of Quarterly Realized Covariances are Shown. The Dashed Lines Denote Bartlett’s Approximate 95 Percent Confidence Band in the White Noise Case. Q Denotes the Ljung–Box Portmanteau Statistic for up to 12th-Order Autocorrelation. The Sample Covers the Period from 1962:3 through 1999:3, with the 1987:4 Outlier Excluded, for a Total of 148 Observations. We Calculate the Quarterly Realized Covariances from Daily Returns.
TORBEN G. ANDERSEN ET AL.
0.6
5 10 15 20 25 30 35 40
0
PG ( Q12 pvalue=0.000 )
0.6
0
5 10 15 20 25 30 35 40
JNJ ( Q12 pvalue=0.000 )
MRK ( Q12 pvalue=0.000 )
0.6
-0.2
0
0.6
0.4
0
5 10 15 20 25 30 35 40
GM ( Q12 pvalue=0.000 )
0.6
0
0
GE ( Q12 pvalue=0.000 )
0.6
0.6
16
AA ( Q12 pvalue=0.000)
0.6
ALD ( Q12 pvalue=0.002 )
BA ( Q12 pvalue=0.000 )
CAT ( Q12 pvalue=0.000 )
CHV ( Q12 pvalue=0.000 )
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0.0
0.0
0.0
0.0
-0.2
-0.2
-0.2 0
5 10 15 20 25 30 35 40
DD ( Q12 pvalue=0.091 )
0
5 10 15 20 25 30 35 40
DIS ( Q12 pvalue=0.000 )
0.0
-0.2 0
5 10 15 20 25 30 35 40
EK ( Q12 pvalue=0.000 )
-0.2 0
5 10 15 20 25 30 35 40
GE ( Q12 pvalue=0.010 )
0
0.6
0.6
0.4
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0.0
0.0
0.0
0.0
-0.2
-0.2 0
5 10 15 20 25 30 35 40
GT ( Q12 pvalue=0.000 )
0
5 10 15 20 25 30 35 40
HWP ( Q12 pvalue=0.255 )
0.6
-0.2
0
5 10 15 20 25 30 35 40
IBM ( Q12 pvalue=0.240 )
0.6
-0.2
0.6
0.0 0
5 10 15 20 25 30 35 40
IP ( Q12 pvalue=0.881 )
-0.2
0.6
0.6
0.4
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0.0
0.0
0.0
0.0
-0.2
0
5 10 15 20 25 30 35 40
KO ( Q12 pvalue=0.000 )
-0.2
0
5 10 15 20 25 30 35 40
MMM ( Q12 pvalue=0.005 )
0.6
-0.2
0
5 10 15 20 25 30 35 40
MO ( Q12 pvalue=0.000 )
0.6
-0.2
5 10 15 20 25 30 35 40
MRK ( Q12 pvalue=0.000 )
-0.2
0.6 0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0.0
0.0
0.0
0.0
0
5 10 15 20 25 30 35 40
S ( Q12 pvalue=0.000 )
-0.2
0
5 10 15 20 25 30 35 40
T ( Q12 pvalue=0.000 )
-0.2
0
5 10 15 20 25 30 35 40
UK ( Q12 pvalue=0.000 )
0.6
-0.2
5 10 15 20 25 30 35 40
UTX ( Q12 pvalue=0.000 )
-0.2
0.6 0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0.0
0.0
0.0
0.0
0.0
-0.2
-0.2
-0.2
-0.2
-0.2
0
5 10 15 20 25 30 35 40
0
5 10 15 20 25 30 35 40
0
5 10 15 20 25 30 35 40
5 10 15 20 25 30 35 40
PG ( Q12 pvalue=0.000 )
0.0 0
0.4
0.6
0
0.6
0.6
0.6
5 10 15 20 25 30 35 40
JNJ ( Q12 pvalue=0.010 )
0.0 0
0.4
-0.2
0
0.6
0.6
0.6
5 10 15 20 25 30 35 40
GM ( Q12 pvalue=0.000 )
Realized Beta: Persistence and Predictability
AA ( Q12 pvalue=0.013 ) 0.6
0
5 10 15 20 25 30 35 40
0
5 10 15 20 25 30 35 40
XON ( Q12 pvalue=0.000 )
0.6
0
5 10 15 20 25 30 35 40
17
Fig. 6. Sample Autocorrelations of Quarterly Realized Betas. Note: The First 36 Sample Autocorrelations of Quarterly Realized Betas are Shown. The Dashed Lines Denote Bartlett’s Approximate 95 Percent Confidence Band in the White-Noise Case. Q Denotes the Ljung–Box Portmanteau Statistic for up to 12th-Order Autocorrelation. The Sample Covers the Period from 1962:3 through 1999:3, with the 1987:4 Outlier Excluded, for a Total of 148 Observations. We Calculate the Quarterly Realized Betas from Daily Returns.
18
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0.0
0.0
0.0
-0.1
-0.1
-0.1
-0.2
-0.2 0
5
10 15 20 25 30 35 40 Q pvalue=0.000
0
5
10 15 20 25 30 35 40 Q pvalue=0.497
-0.2 0
5
10 15 20 25 30 35 40 Q pvalue=0.268
TORBEN G. ANDERSEN ET AL.
Fig. 7. (a) Sample Autocorrelations of Quarterly Realized Market Variance Prefiltered by (1 L)0.42. Median Sample Autocorrelations of Quarterly Realized (b) Covariances and (c) Betas. Note: The Three Parts (a–c) Show the First 36 Sample Autocorrelations of Quarterly Realized Market Variance, the Medians across Individual Stocks of First 36 Sample Autocorrelations of Quarterly Realized Covariances and the Medians across Individual Stocks of First 36 Sample Autocorrelations of Quarterly Realized Betas all Prefiltered by (1 L)0.42. The Dashed Lines Denote Bartlett’s Approximate 95 Percent Confidence Band in the White-Noise Case. Q Denotes the Median of Ljung–Box Portmanteau Statistic for up to 12th-Order Autocorrelation. The Sample Covers the Period from 1962:3 through 1999:3, with the 1987:4 Outlier Excluded, for a Total of 148 Observations. We Calculate the Quarterly Realized Variance from Daily Returns.
AA ( Q12 pvalue=0.233)
0
5 10 15 20 25 30 35 40
ALD ( Q12 pvalue=0.480) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
0
0
5 10 15 20 25 30 35 40
GT ( Q12 pvalue=0.396) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2 0 5 10 15 20 25 30 35 40
0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
0
0
5 10 15 20 25 30 35 40
0
5 10 15 20 25 30 35 40
5 10 15 20 25 30 35 40
0
5 10 15 20 25 30 35 40
0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
0
5 10 15 20 25 30 35 40
0
5 10 15 20 25 30 35 40
0
5 10 15 20 25 30 35 40
0
5 10 15 20 25 30 35 40
5 10 15 20 25 30 35 40
0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
5 10 15 20 25 30 35 40
5 10 15 20 25 30 35 40
0
0
5 10 15 20 25 30 35 40
0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
0
0
5 10 15 20 25 30 35 40
0
0
0
5 10 15 20 25 30 35 40
PG ( Q12 pvalue=0.497)
5 10 15 20 25 30 35 40
0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
0
UTX ( Q12 pvalue=0.935) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
5 10 15 20 25 30 35 40
JNJ ( Q12 pvalue=0.657) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
MRK ( Q12 pvalue=0.554) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
5 10 15 20 25 30 35 40
GM ( Q12 pvalue=0.353)
IP ( Q12 pvalue=0.272) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
UK ( Q12 pvalue=0.521)
0
CHV ( Q12 pvalue=0.185) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
GE ( Q12 pvalue=0.420)
MO ( Q12 pvalue=0.706) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
0.4 0.3 0.2 0.1 0.0 -0.1 -0.2 0
0
IBM ( Q12 pvalue=0.407) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
T ( Q12 pvalue=0.554) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
5 10 15 20 25 30 35 40
CAT ( Q12 pvalue=0.547) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
EK ( Q12 pvalue=0.899)
MMM ( Q12 pvalue=0.861) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
S ( Q12 pvalue=0.722) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
0
HWP ( Q12 pvalue=0.328) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
KO ( Q12 pvalue=0.397) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
BA ( Q12 pvalue=0.510)
DIS ( Q12 pvalue=0.607)
DD ( Q12 pvalue=0.055) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
5 10 15 20 25 30 35 40
0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
Realized Beta: Persistence and Predictability
0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
5 10 15 20 25 30 35 40
XON ( Q12 pvalue=0.453)
5 10 15 20 25 30 35 40
0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
0
5 10 15 20 25 30 35 40
0.42
19
Fig. 8. Sample Autocorrelations of Quarterly Realized Covariances Prefiltered by (1 L) . Note: The First 36 Sample Autocorrelations of Quarterly Realized Covariances Prefiltered by (1 L)0.42 is Shown. The Dashed Lines Denote Bartlett’s Approximate 95 Percent Confidence Band in the White-Noise Case. Q Denotes the Ljung–Box Portmanteau Statistic for up to 12th-Order Autocorrelation. The Sample Covers the Period from 1962:3 through 1999:3, with the 1987:4 Outlier Excluded, for a Total of 148 Observations. We Calculate the Quarterly Realized Covariances from Daily Returns.
AA ( Q12 pvalue=0.268 )
0
5 10 15 20 25 30 35 40
DD ( Q12 pvalue=0.501 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2 0 5 10 15 20 25 30 35 40
ALD ( Q12 pvalue=0.472 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
0 5 10 15 20 25 30 35 40 DIS ( Q12 pvalue=00.003 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2 0 5 10 15 20 25 30 35 40
GT ( Q12 pvalue=0.248 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
0
5 10 15 20 25 30 35 40
KO ( Q12 pvalue=0.382 )
0
5 10 15 20 25 30 35 40
HWP ( Q12 pvalue=0.065 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
0
0
5 10 15 20 25 30 35 40
5 10 15 20 25 30 35 40
MMM ( Q12 pvalue=0.302 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
0
S ( Q12 pvalue=0.001 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
5 10 15 20 25 30 35 40 EK ( Q12 pvalue=0.453 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2 0 5 10 15 20 25 30 35 40
5 10 15 20 25 30 35 40
0
5 10 15 20 25 30 35 40
0 5 10 15 20 25 30 35 40 GE ( Q12 pvalue=0.124 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2 0 5 10 15 20 25 30 35 40
IBM ( Q12 pvalue=0.545 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2 0
5 10 15 20 25 30 35 40
0.4 0.3 0.2 0.1 0.0 -0.1 -0.2 0
5 10 15 20 25 30 35 40
5 10 15 20 25 30 35 40
0 5 10 15 20 25 30 35 40 GM ( Q12 pvalue=0.028 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2 0 5 10 15 20 25 30 35 40
0
5 10 15 20 25 30 35 40
JNJ ( Q12 pvalue=0.932 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
0
MRK ( Q12 pvalue=0.002 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
UK ( Q12 pvalue=0.041 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2 0
CHV ( Q12 pvalue=0.016 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
IP ( Q12 pvalue=0.421 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
MO ( Q12 pvalue=0.323 )
T ( Q12 pvalue=0.004 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
CAT ( Q12 pvalue=0.608 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
0
5 10 15 20 25 30 35 40
0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
UTX ( Q12 pvalue=0.052 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
0
5 10 15 20 25 30 35 40
0.42
5 10 15 20 25 30 35 40
PG ( Q12 pvalue=0.330 )
0
5 10 15 20 25 30 35 40
XON ( Q12 pvalue=0.075 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
0
5 10 15 20 25 30 35 40
Fig. 9. Sample Autocorrelations of Quarterly Realized Betas Prefiltered by (1 L) . Note: The First 36 Sample Autocorrelations of Quarterly Realized Betas Prefiltered by (1 L)0.42 are Shown. The Dashed Lines Denote Bartlett’s Approximate 95 Percent Confidence Band in the White-Noise Case. Q Denotes the Ljung–Box Portmanteau Statistic for up to 12th-Order Autocorrelation. The Sample Covers the Period from 1962:3 through 1999:3, with the 1987:4 Outlier Excluded, for a Total of 148 Observations. We Calculate the Quarterly Realized Betas from Daily Returns.
TORBEN G. ANDERSEN ET AL.
0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
BA ( Q12 pvalue=0.567 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2 0
20
0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
21
Realized Beta: Persistence and Predictability
right.’’ Some experimentation revealed that differencing the betas by (1 L)0.20 was often adequate for eliminating the dynamics. However, for short samples it is almost impossible to distinguish low-order fractional integration from persistent but strictly stationary dynamics. We are particularly interested in the latter alternative where the realized betas are I(0). To explore this possibility, we fit simple AR(p) processes to realized betas, with p selected by the Akaike Information Criterion (AIC). We show the estimated roots in Table 3, all of which are indicative of covariance stationarity. In Fig. 10, we show the sample autocorrelation functions of quarterly realized betas prefiltered by the estimated AR(p) lag-operator polynomials. The autocorrelation functions are indistinguishable from those of white noise. Taken as a whole, the results suggest that realized betas are integrated of noticeably lower order than are the market variance and the individual equity covariances with the market, corresponding to a situation of nonlinear fractional cointegration. I(d) behavior, with dA[0, 0.25], appears accurate for betas, whereas the market variance and the individual equity covariances with the market are better approximated as I(d) with dA[0.35, 0.45]. Indeed, there is little evidence against an assertion that betas are I(0), whereas there is strong evidence against such an assertion for the variance and covariance components. 3.2. Predictability Examination of the predictability of realized beta and its components provides a complementary perspective and additional insight. Granger and Newbold (1986) propose a measure of the predictability of covariance stationary series under squared-error loss, patterned after the familiar regression R2, GðjÞ ¼
varðx^ tþj;t Þ ¼1 varðxt Þ
varðetþj;t Þ , varðxt Þ
(8)
where j is the forecast horizon of interest, x^ tþj;t the optimal (i.e., conditional mean) forecast, and etþj;t ¼ xtþj x^ tþj;t : Diebold and Kilian (2001) define a generalized measure of predictability, building on the Granger–Newbold measure, as PðL; O; j; kÞ ¼ 1
EðLðetþj;t ÞÞ , EðLðetþk;t ÞÞ
(9)
where L denotes the relevant loss function, O is the available univariate or multivariate information set, j the forecast horizon of interest, and k a long but not necessarily infinite reference horizon.
22
TORBEN G. ANDERSEN ET AL.
Table 3.
Inverted Roots of AR(p) Models for Quarterly Realized Betas.
Stock
Inverted Roots
AA ALD BA CAT CHV DD DIS EK GE GM GT HWP IBM IP JNJ KO MMM MO MRK PG S T UTX UK XON
0.49 0.25i 0.50 0.80 0.39 0.80 0.20 0.86 0.73 0.50 0.84 0.83 0.36 0.66 0.10 0.33 0.79 0.47 0.83 0.60 0.11i 0.72 0.59 0.87 0.77 0.80 0.58
(and Modulus of Dominant Inverted Root) 0.49+0.25i 0.30 0.30+0.49i
0.30 0.49i
0.29 0.44i
0.29+0.44i
0.20 0.48i 0.25+0.38i 0.28 0.29+0.44i 0.33+0.41i 0.13+0.27i 0.09+0.68i
0.20+0.48i 0.25 0.38i
0.04+0.50i 0.13+0.31i 0.16+0.61i 0.60+0.11i 0.45 0.25 0.17 0.64i 0.08+0.63i 0.10+0.58i 0.29
0.10 0.41i
0.10+0.41i
0.57
0.50 0.59i
0.50+0.59i
0.29 0.33 0.13 0.09
0.44i 0.41i 0.27i 0.68i
0.04 0.13 0.16 0.07
0.50i 0.31i 0.61i 0.73i
0.48 0.35i 0.07+0.73i
0.48+0.35i 0.81
0.57 0.17+0.64i 0.08 0.63i 0.10 0.58i
0.56+0.34i 0.68 0.42
0.56 0.34i
0.76
0.63
(0.57) (0.50) (0.80) (0.39) (0.80) (0.20) (0.86) (0.73) (0.50) (0.84) (0.83) (0.36) (0.76) (0.10) (0.33) (0.79) (0.47) (0.83) (0.81) (0.72) (0.59) (0.87) (0.77) (0.80) (0.58)
Note: The inverted roots and modulus of the dominant root of the autoregressive lag operator ^ 1L F ^ 2 L2 . . . F ^ p Lp Þ; where F ^ 1; F ^ 2; . . . ; F ^ p are the least-squares estimates polynomials ð1 F of the parameters of AR(p) models fit to the realized betas, with p selected by the AIC are shown. The sample covers the period from 1962:3 through 1999:3, with the 1987:4 outlier excluded, for a total of 148 observations. We calculate the quarterly realized variance, covariances, and betas from daily returns.
Regardless of the details, the basic idea of predictability measurement is simply to compare the expected loss of a short-horizon forecast to the expected loss of a very long-horizon forecast. The former will be much smaller than the latter if the series is highly predictable, as the available conditioning information will then be very valuable. The Granger–Newbold measure, which is the canonical case of the Diebold–Kilian measure (corresponding to L(e) ¼ e2, univariate O, and k ¼ N) compares the 1-step-ahead forecast error variance to that of the N-step-ahead forecast error variance, i.e., the unconditional variance of the series being forecast (assuming that it is finite).
AA ( pvalue=0.653 )
0
5 10 15 20 25 30 35 40
ALD ( pvalue=0.537 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
0
DD ( pvalue=0.811 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
0
5 10 15 20 25 30 35 40
0
5 10 15 20 25 30 35 40
0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
0
0
0
5 10 15 20 25 30 35 40
0
5 10 15 20 25 30 35 40
0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
5 10 15 20 25 30 35 40
0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
0
0
0
5 10 15 20 25 30 35 40
0
5 10 15 20 25 30 35 40
5 10 15 20 25 30 35 40
5 10 15 20 25 30 35 40
0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
0
5 10 15 20 25 30 35 40
0
5 10 15 20 25 30 35 40
0
0
5 10 15 20 25 30 35 40
0
5 10 15 20 25 30 35 40
0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
0
0
0
5 10 15 20 25 30 35 40
0
5 10 15 20 25 30 35 40
5 10 15 20 25 30 35 40 PG ( pvalue=0.886 )
0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
0
UTX ( pvalue=0.749 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
5 10 15 20 25 30 35 40 JNJ ( pvalue=0.964 )
0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
MRK ( pvalue=0.295 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
5 10 15 20 25 30 35 40 GM ( pvalue=0.779 )
IP ( pvalue=0.912 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
UK ( pvalue=0.461 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
5 10 15 20 25 30 35 40
CHV ( pvalue=0.526 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
GE ( pvalue=0.232 )
MO ( pvalue=0.811 )
T ( pvalue=0.491 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
0
IBM ( pvalue=0.664 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
MMM ( pvalue=0.204 )
S ( pvalue=0.778 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
5 10 15 20 25 30 35 40
5 10 15 20 25 30 35 40
CAT ( pvalue=0.684 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
EK ( pvalue=0.879 )
HWP (pvalue=0.266 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
KO ( pvalue=0.981 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
0
DIS ( pvalue=0.732 )
GT ( pvalue=0.926 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
5 10 15 20 25 30 35 40
BA ( pvalue=0.740 ) 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
Realized Beta: Persistence and Predictability
0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
5 10 15 20 25 30 35 40 XON ( pvalue=0.255 )
0.4 0.3 0.2 0.1 0.0 -0.1 -0.2
0
5 10 15 20 25 30 35 40
23
^ 1L F ^ 2 L2 F ^ p Lp Þ: Note: The Fig. 10. Sample Autocorrelations of Quarterly Realized Betas Prefiltered by ð1 F ^ 1L F ^ 2 L2 F ^ p Lp Þ; where First 36 Sample Autocorrelations of Quarterly Realized Betas Prefiltered by ð1 F ^ ^ ^ F1 ; F2 ; . . . ; Fp are the Least Squares Estimates of the Parameters of AR(p) Models Fit to the Realized betas, with p Selected by the AIC are Shown. The Dashed Lines Denote Bartlett’s Approximate 95 Percent Confidence Band in the White-Noise Case. Q Denotes the Ljung–Box Portmanteau Statistic for up to 12th-Order Autocorrelation. The Sample Covers the Period from 1962:3 through 1999:3, with the 1987:4 Outlier Excluded, for a Total of 148 Observations. We Calculate the Quarterly Realized Variance, Covariances, and Betas from Daily Returns.
24
TORBEN G. ANDERSEN ET AL.
In what follows, we use predictability measures to provide additional insight into the comparative dynamics of the realized variances and covariances versus the realized betas. Given the strong evidence of fractional integration in the realized market variance and covariances, we maintain the pure fractional noise process for the quarterly realized market variance and the realized covariances, namely ARFIMA(0, 0.42, 0). We then calculate the Granger–Newbold predictability G(j) analytically, conditional upon the ARFIMA(0, 0.42, 0) dynamics, and we graph it in Fig. 11 for j ¼ 1,y, 7 quarters.14 The graph starts out as high as 0.4 and decays only slowly over the first seven quarters. If the realized beta likewise follows a pure fractional noise process but with a smaller degree of integration, say ARFIMA(0, 0.20, 0), which we argued was plausible, then the implied predictability is much lower, as also shown in Fig. 11. As we also argued, however, the integration status of the realized betas is difficult to determine. Hence, for the realized betas we also compute Granger–Newbold predictability using an estimated AR(p) sieve approximation to produce estimates of var(et+j,t) and var(xt); this approach is valid regardless of whether the true dynamics are short-memory or long-memory. In Fig. 12 we plot the beta predictabilities, which remain noticeably smaller and more quickly decaying than the covariance predictabilities, as is further clarified by comparing the median beta predictability, also included in Fig. 11, to the market variance and equity covariances predictability. It is noteworthy that the shorter-run beta predictability – up to about four quarters – implied by the AR(p) dynamics is considerably higher than for the I(0.20) dynamics. Due to the long-memory feature of the I(0.20) process this eventually reverses beyond five quarters.
4. ASSESSING PRECISION: INTERVAL ESTIMATES OF BETAS Thus far, we have largely abstracted from the presence of estimation error in the realized betas. It is possible to assess the (time-varying) estimation error directly using formal continuous-record asymptotics.
4.1. Continuous-Record Asymptotic Standard Errors We first use the multivariate asymptotic theory recently developed by Barndorff-Nielsen and Shephard (2003) to assess the precision of our realized betas which are, of course, estimates of the underlying integrated
25
Realized Beta: Persistence and Predictability 100 d=0.42
d=0.20
Median AR(p)
80 60 40 20 0 1
2
3
4
5
6
7
Fig. 11. Predictability of Market Volatility, Individual Equity Covariances with the Market, and Betas. Note: P We Define Predictability as Pj ¼ 1 varðetþj;t Þ=varðetþ40;t Þ; where varðetþj;t Þ ¼ s2 ji¼01 b2i ; s2t is the Variance of the Innovation et, and the bi’s are Moving Average Coefficients; i.e., the Wold Representation is yt ¼ ð1 þ b1 L þ b2 L2 þ b2 L3 þ . . .Þt : We Approximate the Dynamics Using a Pure Long-Memory Model, (1 L)0.42 yt ¼ t ; in Which Case b0 ¼ 1 and bi ¼ ð 1Þbi 1 ðd i þ 2Þ=ði 1Þ and Plot Pj for j ¼ 1;y, 7 in the Solid Line. Moreover, Because We Take d ¼ 0:42 for Market Volatility and for all Covariances with the Market, all of Their Predictabilities are the Same at all Horizons. As One Approximation for the Dynamics of the Betas we use a Pure Long-Memory Model, (1 L)0.20 yt ¼ et, in Which Case b0 ¼ 1 and bi ¼ ð 1Þbi 1 ðd i þ 2Þ=ði 1Þ and Plot Pj for j ¼ 1;y, 7 in the Dotted Line. We also Approximate the Beta Dynamics Using an AR(p) Model, with the Autoregressive Lag Order p Determined by the AIC and Plot the Median of Pj for j ¼ 1;y, 7 among all 25 Stocks in the Mixed Dotted Line. The Sample Covers the Period from 1962:3 through 1999:3, with the 1987:4 Outlier Excluded, for a Total of 148 Observations.
betas. This helps us in thinking about separating ‘‘news from noise’’ when examining temporal movements in the series. From the discussion above, realized beta for stock i in quarter t is simply PN t j¼1 rijt rmjt ^b ¼ P , (10) it Nt 2 j¼1 rmjt
where rijt is the return of stock i on day j of quarter t, rmjt the return of the DJIA on day j of quarter t, and Nt the number of units (e.g., days) into which quarter t is partitioned.15 Under appropriate regularity conditions that allow for non-stationarity in the series, Barndorff-Nielsen and Shephard (2003)
AA (AR5)
ALD (AR2)
BA (AR3)
CAT (AR1)
CHV (AR3)
100
100
100
100
80
80
80
80
80
60
60
60
60
60
40
40
40
40
40
20
20
20
20
0
0
0
0
1
2
3
4
5
6
7
1
2
DD (AR1)
3
4
5
6
7
1
2
DIS (AR5)
3
4
5
6
7
20 1
2
EK (AR3)
3
4
5
6
7
0
100
100
100
80
80
80
80
80
60
60
60
60
60
40
40
40
40
40
20
20
20
20
0
0
0
0
3
4
5
6
7
1
2
GT (AR3)
3
4
5
6
7
1
2
HWP (AR3)
3
4
5
6
7
2
IBM (AR4)
3
4
5
6
7
0
100
100
100
100
80
80
80
80
60
60
60
60
40
40
40
40
40
20
20
20
20
3
4
5
6
7
0
1
2
KO (AR4)
3
4
5
6
7
0
1
2
MMM (AR3)
3
4
5
6
7
0
2
MO (AR5)
3
4
5
6
7
0
100
80
80
80
80
80
60
60
60
60
60
40
40
40
40
40
20
20
20
20
4
5
6
7
0
1
2
S (AR3)
3
4
5
6
7
0
1
2
T (AR5)
3
4
5
6
7
0
2
UK (AR4)
3
4
5
6
7
0
100
100
80
80
80
80
80
60
60
60
60
60
40
40
40
40
40
20
20
20
20
0
0
0
0
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
4
6
7
5
6
7
5
6
7
6
7
5
2
3
4
1
2
3
4
XON (AR2)
100
3
1
UTX (AR4)
100
2
3
20 1
100
1
7
20 1
2
3
4
5
6
7
0
1
2
3
4
5
Fig. 12. Predictability of Betas Based on AR(p) Note: We Define Predictability as P Sieve Approximation P of2 Dynamics. 2 Pj ¼ 1 varðetþj;t Þ=varðyt Þ; Where varðetþj;t Þ ¼ s2 ji¼01 b2i ; varðyt Þ ¼ s2 1 b ; s is the Variance of the Innovation et, so that t i¼0 i the bi’s Correspond to the Moving Average Coefficients in the Wold Representation for yt. We Approximate the Dynamics Using an AR(p) Model, with the Autoregressive Lag Order p Determined by the AIC, and Plot Pj for j ¼ 1; y , 7. The Sample Covers the Period from 1962:3 through 1999:3, with the 1987:4 Outlier Excluded, for a Total of 148 Observations. We Calculate the Quarterly Realized Betas from Daily Returns.
TORBEN G. ANDERSEN ET AL.
100
3
6
PG (AR2)
100
2
2
MRK (AR5)
100
1
5
20 1
100
0
4
JNJ (AR1)
60
2
1
IP (AR1)
80
1
3
20 1
100
0
2
GM (AR3)
100
2
1
GE (AR2)
100
1
26
100
27
Realized Beta: Persistence and Predictability
derive the limiting distribution of realized beta. In particular, as N ! 1;
where
b^ it bit r ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi ) Nð0; 1Þ, PN t 2 2 g^ it j¼1 rmjt g^ it ¼
Nt X
a2ij
j¼1
N t 1 X
(11)
(12)
aij aijþ1
j¼1
and aij ¼ rijt rmjt
b^ it r2mjt .
(13)
Thus, a feasible and asymptotically valid a-percent confidence interval for the underlying integrated beta is vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ! 2 u N t u X t 2 ^ r (14) b 2b z g^ , it
it
a=2
mjt
it
j¼1
where za/2 denotes the appropriate critical value of the standard normal distribution. In Fig. 13, we plot the pointwise 95 percent confidence intervals for the quarterly betas. They are quite wide, indicating that daily sampling is not adequate to drive out all measurement error. They are, given the width of the bands, moreover, consistent with the conjecture that there is only limited (short range) dependence in the realized beta series. The continuous record asymptotics discussed above directly points to the advantage of using finer sampled data for improved beta measurements. However, the advent of reliable high-frequency intraday data is, unfortunately, a relatively recent phenomenon and we do not have access to such data for the full 1962:3–1999:3 sample period used in the empirical analysis so far. Nonetheless, to see how the reduction in measurement error afforded by the use of finer sample intradaily data manifests itself empirically in more reliable inference, we reproduce in Fig. 14 the pointwise 95 percent confidence bands for the quarterly betas over the shorter 1993:1–1999:3 sample. These bands may be compared directly to the corresponding quarterly realized beta standard error bands over the identical time span based on a 15-min sampling scheme reported in Fig. 15.16 The improvement is readily visible in the narrowing of the bands. It is also evident from Fig. 15 that there is quite pronounced positive dependence in the realized quarterly beta measures. In
ALD
BA
CAT
CHV
3
3
3
3
2
2
2
2
2
1
1
1
1
1
0
0
0
0
-1
70
80
90
-1
70
DD
80
90
-1
70
DIS
80
90
-1
0 70
EK
80
90
-1
3
3
3
3
2
2
2
2
1
1
1
1
1
0
0
0
0
80
90
-1
70
GT
80
90
-1
70
HWP
80
90
-1
IBM
80
90
-1
3
3
3
2
2
2
2
1
1
1
1
1
0
0
0
0
90
-1
70
KO
80
90
-1
70
MMM
80
90
-1
MO
80
90
-1
3
3
2
2
2
1
1
1
1
1
0
0
0
0
-1
70
S
80
90
-1
70
T
80
90
-1
0 70
UK
80
90
-1
3
3
3
3
2
2
2
2
2
1
1
1
1
1
0
0
0
0
0
-1
-1
-1
-1
-1
80
90
70
80
90
70
80
70
80 XON
90
70
80
90
UTX
3
70
90
90
70
80
90
Fig. 13. Ninety-Five Percent Confidence Intervals for Quarterly Beta, Long Sample, Daily Sampling. Note: The Time Series of 95 Percent Confidence Intervals for the Underlying Quarterly Integrated Beta, Calculated Using the Results of BarndorffNielsen and Shephard (2003) are Shown. The Sample Covers the Period from 1962:3 through 1999:3, with the 1987:4 Outlier Excluded. We Calculate the Realized Quarterly Betas from Daily Returns.
TORBEN G. ANDERSEN ET AL.
3
2
90
80
PG
3
80
70
MRK
2
70
90
0 70
3
-1
80
JNJ
3
80
70
IP
2
70
90
0 70
3
-1
80
GM
2
70
70
GE
3
-1
28
AA 3
ALD
3
BA
3
CAT
3
2
2
2
2
1
1
1
1
1
0
0
0
0
-1
1994
1996
1998
DD
3
-1
1994
1996
1998
DIS
3
-1
1994
1996
1998
EK
3
-1
0 1994
1996
1998
GE
3
-1
2
2
2
2
1
1
1
1
1
0
0
0
0
0
-1
-1
-1
-1
-1
1994
1996
1998
GT
1994
1996
1998
HWP
3
1994
1996
1998
IBM
3
1994
1996
1998
IP
3
2
2
2
2
1
1
1
1
1
0
0
0
0
0
-1
-1
-1
-1
-1
1994
1996
1998
KO
1994
1996
1998
MMM
3
1994
1996
1998
MO
3
1994
1996
1998
MRK
3
2
2
2
2
1
1
1
1
1
0
0
0
0
-1
1994
1996
1998
S
3
-1
1994
1996
1998
T
3
-1
1994
1996
1998
UK
3
-1
1996
1998
UTX
3
-1
2
2
2
1
1
1
1
1
0
0
0
0
1996
1998
-1
1994
1996
1998
-1
1994
1996
1998
-1
1996
1998
JNJ
1994
1996
1998
PG
1994
1996
1998
XON
3
2
1994
1998
0 1994
2
-1
1994
3
2
1996
GM
3
2
3
1994
3
2
3
CHV
3
2
Realized Beta: Persistence and Predictability
AA
3
0 1994
1996
1998
-1
1994
1996
1998
29
Fig. 14. Ninety-Five Percent Confidence Intervals for Quarterly Beta, Short Sample, Daily Sampling. Note: The Time Series of 95 Percent Confidence Intervals for the Underlying Quarterly Integrated Beta, Calculated Using the Results of BarndorffNielsen and Shephard (2003) are Shown. The Sample Covers the Period from 1993:2 through 1999:3. We Calculate the Realized Quarterly Betas from Daily Returns.
ALD
BA
CAT
CHV
3
3
3
3
2
2
2
2
2
1
1
1
1
1
0
0
0
0
-1
1994
1996
1998
-1
1994
DD
1996
1998
-1
1994
DIS
1996
1998
-1
0 1994
EK
1996
1998
-1
3
3
3
3
2
2
2
2
1
1
1
1
1
0
0
0
0
0
-1
-1
-1
-1
-1
1998
1994
GT
1996
1998
1994
HWP
1996
1998
1994
IBM
1996
1998
3
3
3
2
2
2
2
2
1
1
1
1
1
0
0
0
0
0
-1
-1
-1
-1
-1
1998
1994
KO
1996
1998
1994
MMM
1996
1998
1994
MO
1996
1998
3
3
3
2
2
2
2
1
1
1
1
1
0
0
0
0
1998
-1
1994
S
1996
1998
-1
1994
T
1996
1998
-1
UK
1996
1998
-1
3
3
3
3
2
2
2
2
1
1
1
1
1
0
0
0
0
1996
1998
-1
1994
1996
1998
-1
1994
1996
1998
-1
1998
1994
1996 XON
1998
1994
1996
1998
UTX
2
1994
1996
0 1994
3
-1
1998
0 1994
1996
1998
-1
Fig. 15. Ninety-Five Percent Confidence Intervals for Quarterly Beta, Short Sample, 15-Min Sampling. Note: The Time Series of 95 Percent Confidence Intervals for the Underlying Quarterly Integrated Beta, Calculated Using the Results of Barndorff-Nielsen and Shephard (2003) are Shown. The Sample Covers the Period from 1993:2 through 1999:3. We Calculate the Realized Quarterly Betas from 15-min Returns.
TORBEN G. ANDERSEN ET AL.
3
1996
1996
PG
2
1994
1994
MRK
3
-1
1998
JNJ
3
1996
1994
IP
3
1994
1996
GM
2
1996
1994
GE
3
1994
30
AA 3
Realized Beta: Persistence and Predictability
31
other words, the high-frequency beta measures importantly complement the results for the betas obtained from the lower frequency daily data, by more clearly highlighting the dynamic evolution of individual security betas. In the web appendix to this paper (www.ssc.upenn.edu/fdiebold), we perform a preliminary analysis of realized betas computed from high-frequency data over the shorter 7-year sample period. The results are generally supportive of the findings reported here, but the relatively short sample available for the analysis invariably limits the power of our tests for fractional integration and nonlinear cointegration. In the concluding remarks to this paper, we also sketch a new and powerful econometric framework that we plan to pursue in future, much more extensive, work using underlying high-frequency data. 4.2. HAC Asymptotic Standard Errors As noted previously, the quarterly realized betas are just regression coefficients computed quarter-by-quarter from CAPM regressions using intraquarter daily data. One could obtain consistent estimates of the standard errors of those quarterly regression-based betas using HAC approaches, such as Newey–West, under the very stringent auxiliary assumption that the period-by-period betas are constant. For comparison to the continuousrecord asymptotic bands discussed above, we also compute these HAC standard error bands. In Fig. 16, we provide the Newey–West 95 percent confidence intervals for the quarterly realized betas. Comparing the figure to Fig. 13, there is not much difference in the assessment of the estimation uncertainly inherent in the quarterly beta measures obtained from the two alternative procedures based on daily data. However, as noted above, there are likely important gains to be had from moving to high-frequency intraday data.
5. SUMMARY, CONCLUDING REMARKS, AND DIRECTIONS FOR FUTURE RESEARCH We have assessed the dynamics and predictability in realized betas, relative to the dynamics in the underlying market variance and covariances with the market. Key virtues of the approach include the fact that it does not require an assumed volatility model, and that it does not require an assumed model of time variation in beta. We find that, although the realized variances and covariances fluctuate widely and are highly persistent and predictable (as is
ALD
3
BA
3
CAT
3
2
2
2
2
1
1
1
1
1
0
0
0
0
0
-1
-1
-1
-1
-1
70
80
90
DD
3
70
80
90
DIS
3
70
80
90
EK
3
70
80
90
GE
3
70
2
2
2
2
1
1
1
1
1
0
0
0
0
0
-1
-1
-1
-1
-1
70
80
90
GT
70
80
90
HWP
3
70
80
90
IBM
3
70
80
90
IP
3
70
2
2
2
2
1
1
1
1
1
0
0
0
0
-1
-1 70
80
90
KO
80
90
MMM
3
80
90
MO
3
-1 70
80
90
MRK
3
70
2
2
2
2
1
1
1
1
1
0
0
0
0
0
-1
-1
-1
-1
-1
70
80
90
S
70
80
90
T
3
70
80
90
UK
3
70
80
90
UTX
3
70
2
2
2
2
1
1
1
1
1
0
0
0
0
-1
-1 70
80
90
-1 70
80
90
80
90
80
90
0
-1 70
90
XON
3
2
80
PG
3
2
3
90
0
-1 70
80
-1 70
80
90
70
80
90
Fig. 16. Ninety-Five Percent Confidence Intervals for Quarterly Beta, Long Sample, Daily Sampling (Newey–West). Note: The Time Series of Newey–West 95 Percent Confidence Intervals for the Underlying Quarterly Integrated Beta. The Sample Covers the Period from 1962:3 through 1999:3, with the 1987:4 Outlier Excluded are Shown. We Calculate the Realized Quarterly Betas from Daily Returns.
TORBEN G. ANDERSEN ET AL.
3
-1 70
90
JNJ
3
2
80
GM
3
2
3
CHV
3
2
32
AA
3
Realized Beta: Persistence and Predictability
33
well-known), the realized betas, which are simple non-linear functions of the realized variances and covariances, display much less persistence and predictability. The empirical literature on systematic risk measures, as captured by beta, is much too large to be discussed in a sensible fashion here. Before closing, however, we do want to relate our approach and results to the literature on latent factor models and two key earlier papers that have important implications for the potential time variation of betas and the further use of the techniques developed here. First, our results are closely linked to the literature on the latent factor volatility model, as studied by a number of authors, including Diebold and Nerlove (1989), Harvey, Ruiz, and Shephard (1994), King, Sentana, and Wadhwani (1994), Fiorentini, Sentana, and Shephard (1998), and Jacquier and Marcus (2000). Specifically, consider the model, rit ¼ bi f t þ vit ; iid
vit ð0; o2i Þ;
f t jI t ð0; ht Þ
covðvit vjt0 Þ ¼ 0; 8iaj; tat0
(15a) (15b)
where i, j ¼ 1; y , N, and t ¼ 1; y , T. The ith and jth time-t conditional (on ht) variances, and the ijth conditional covariance, for arbitrary i and j, are then given by hit ¼ b2i ht þ o2i ;
hjt ¼ b2j ht þ o2j ; covijt ¼ bi bj ht
(16)
Assume, as is realistic in financial contexts, that all betas are nonnegative, and consider what happens as ht increases, say: all conditional variances increase, and all pairwise conditional covariances increase. Hence, the market variance increases, and the covariances of individual equities with the market increase. Two observations are immediate: (1) both the market variance and the covariances of individual equities with the market are time-varying, and (2) because the market variance moves together with the covariances of individual equities with the market, the market betas may not vary as much – indeed in the simple one-factor case sketched here, the betas are constant, by construction! The upshot is that wide fluctuations in the market variance and individual equity covariances with the market, yet no variation in betas, is precisely what one expects to see in a latent (single) factor volatility model. It is also, of course, quite similar to what we found in the data: wide variation and persistence in market variance and individual equity covariances with the market, yet less variation and persistence in betas. Notice, also the remarkable similarity in the correlograms for the individual realized covariances in Fig. 5. This is another indication of a strong coherence in the dynamic evolution of
34
TORBEN G. ANDERSEN ET AL.
the individual covariances, consistent with the presence of one dominant underlying factor. Second, our results also complement and expand upon those of Braun et al. (1995), who study the discrepancy in the time series behavior of betas relative to the underlying variances and covariances for 12 industry portfolios using bivariate Exponential Generalized Autoregressive Conditional Heteroskedasticity (EGARCH) models. They also find variation and persistence in the conditional variances and covariances, and less variation and persistence in betas. Moreover, they find the strong asymmetric relationship between return innovations and future return volatility to be entirely absent in the conditional betas.17 Hence, at the portfolio level they document similar qualitative behavior between the variances and covariances relative to the betas as we do. However, their analysis is linked directly to a specific parametric representation, it studies industry portfolios, and it never contemplates the hypothesis that the constituent components of beta – variances and covariances – may be of a long memory form. This latter point has, of course, been forcefully argued by numerous subsequent studies. Consequently, our investigation can be seen as a substantive extension of their findings performed in a fully nonparametric fashion. Third, our results nicely complement and expand upon those of Ghysels (1998), who argues that the constant beta CAPM, as bad as it may be, is nevertheless not as bad as some popular conditional CAPMs. We provide some insight into why allowing for time-varying betas may do more harm than good when estimated from daily data, even if the true underlying betas display significant short memory dynamics: it may not be possible to estimate reliably the persistence or predictability in individual realized betas, so good in-sample fits may be spurious artifacts of data mining.18 We also establish that there should be a real potential for the use of high-frequency intraday data to resolve this dilemma. In closing, therefore, let us sketch an interesting framework for future research using high-frequency intraday data, which will hopefully deliver superior estimates of integrated volatilities by directly exploiting insights from the continuous-record asymptotics of Barndorff-Nielsen and Shephard (2003). Consider the simple state-space representation: b^ i;t ¼ bi;t þ ui;t bi;t ¼ a0 þ a1 bi;t
1
þ vi;t
(17a) (17b)
35
Realized Beta: Persistence and Predictability
0
ui;t N @0;
Nt X j¼1
r2mjt
!
2
1
g^ it A;
vi;t Nð0; s2v;i;t Þ
(17c)
The measurement Equation (17a) links the observed realized beta to the unobserved true underlying integrated beta by explicitly introducing a normally distributed error with the asymptotically valid variance obtained from the continuous-record distribution of Barndorff-Nielsen and Shephard (2003). The transition Equation (17b) is a standard first-order autoregression with potentially time-varying error variance.19 The simplest approach would be to let vi,t have a constant variance, but it is also straightforward to let the variance change with the underlying variability in the realized beta measure, so that the beta innovations become more volatile as the constituent parts, the market variance and the covariance of the stock return with the market, increase. This approach directly utilizes the advantages of highfrequency intraday beta measurements by incorporating estimates of the measurement errors to alleviate the errors-n-variables problem, while explicitly recognizing the heteroskedasticity in the realized beta series. We look forward to future research along these lines.
NOTES 1. The Roll (1977) critique is also relevant. That is, even if we somehow knew what factor(s) should be priced, it is not clear that the factor proxies measured in practice would correspond to the factor required by the theory. 2. See Keim and Hawawini (1999) for a good discussion of the difficulty of interpreting additional empirically motivated factors in terms of systematic risk. 3. There are of course qualifications, notably Ghysels (1998), which we discuss subsequently. 4. The idea of conditioning in the CAPM is of course not unrelated to the idea of multi-factor pricing mentioned earlier. 5. The underlying theory and related empirical strategies are developed in Andersen, Bollerslev, Diebold, and Labys (2001b, 2003), Andersen, Bollerslev, Diebold, and Ebens (2001a), Andersen and Bollerslev (1998), and Barndorff-Nielsen and Shephard (2003). Here, we sketch only the basics; for a more rigorous treatment in the framework of special semimartingales, see the survey and unification by Andersen et al. (2005). 6. This notion of integrated volatility already plays a central role in the stochastic volatility option pricing literature, in which the price of an option typically depends on the distribution of the integrated volatility process for the underlying asset over the life of the option. See, for example, the well-known contribution of Hull and White (1987).
36
TORBEN G. ANDERSEN ET AL.
7. Formal theoretical asymptotic justification for this finding has very recently been provided by Barndorff-Nielsen and Shephard (2004). 8. One could, of course, attempt a linear cointegration approach by taking logs of the realized volatilities and covariances, but there is no theoretical reason to expect all covariances to be positive, and our realized covariance measures are indeed sometimes negative, making logarithmic transformations problematic. 9. We compute the quarterly realized variance, covariances, and betas from slightly different numbers of observations due to the different numbers of trading days across the quarters. 10. Note also that the Dickey–Fuller statistics indicate that unit roots are not present in the market variance, individual equity covariances with the market, or market betas, despite their persistent dynamics. 11. For the realized covariances and realized betas, we show the median autocorrelations functions. 12. The standard error band (under the null of an i.i.d. series) indicated in Fig. 4 is only valid for the realized market variance. It should be lower for the two other series, reflecting the effective averaging in constructing the median values. In fact, it should be considerably lower for the beta series due to the near uncorrelated nature of the underlying beta dynamics, while the appropriate reduction for the covariance series would be less because of the strong correlation across the series. We cannot be more precise on this point without imposing some direct assumptions on the correlation structure across the individual series. 13. A partial list of references not written by the present authors includes Breidt, Crato, and de Lima (1998), Comte and Renault (1998), Harvey (1998), and Robinson (2001), as well as many of the earlier papers cited in Baillie (1996). 14. Note that only one figure is needed, despite the many different realized covariances, because all G(j) are identical, as all processes are assumed to be ARFIMA(0, 0.42, 0). 15. N has a time subscript because the number of trading days varies slightly across quarters. 16. The high-frequency tick-by-tick data underlying the 15-min returns was obtained from the TAQ (Trade And Quotation) database. We refer the reader to the web appendix to this paper, available at www.ssc.upenn.edu/~fdiebold, for a more detailed description of the data capture, return construction, and high-frequency beta measurements. 17. In contrast, on estimating a similar EGARCH model for individual daily stock returns, Cho and Engle (2000) find that daily company specific betas do respond asymmetrically to good and bad news. 18. Chang and Weiss (1991) argue that a strictly stationary Autoregressive Moving Average (ARMA) (1,1) model provides an adequate representation for most individual quarterly beta series. Their sample size is smaller than ours, however, and their estimated betas are assumed to be constant over each quarter. Also, they do not provide any separate consideration of the persistence of the market variance or the individual covariances with the market. 19. Generalization to an arbitrary ARMA process or other stationary structures for the evolution in the true betas is, of course, straightforward.
Realized Beta: Persistence and Predictability
37
ACKNOWLEDGMENTS For useful discussion we thank participants at the UCSD Conference on Predictive Methodology and Applications in Economics and Finance (in honor of Granger), and the London School of Economics Fifth Financial Markets Group Conference on Empirical Finance. We also thank seminar participants at the University of Pennsylvania, as well as Andrew Ang, Michael Brandt, Mike Chernov, Graham Elliott, Eric Ghysels, Rich Lyons, Norman Swanson, and Mark Watson.
REFERENCES Andersen, T. G., & Bollerslev, T. (1998). Answering the skeptics: Yes, standard volatility models do provide accurate forecasts. International Economic Review, 39, 885–905. Andersen, T. G., Bollerslev, T., & Diebold, F. X. (2005). Parametric and nonparametric volatility measurement. In: L. P. Hansen & Y. Ait-Sahalia (Eds), Handbook of financial econometrics. Amsterdam: North-Holland. Andersen, T., Bollerslev, T., Diebold, F. X., & Ebens, H. (2001a). The distribution of realized stock return volatility. Journal of Financial Economics, 61, 43–76. Andersen, T., Bollerslev, T., Diebold, F. X., & Labys, P. (2001b). The distribution of realized exchange rate volatility. Journal of the American Statistical Association, 96, 42–55. Andersen, T., Bollerslev, T., Diebold, F. X., & Labys, P. (2003). Modeling and forecasting realized volatility. Econometrica, 71, 579–626. Andersen, T., Bollerslev, T., & Meddahi, N. (2004). Analytic evaluation of volatility forecasts. International Economic Review, 45, 1079–1110. Andrews, D. W. K., & Guggenberger, P. (2003). A bias-reduced log-periodogram regression estimator of the long-memory parameter. Econometrica, 71, 675–712. Ang, A., & Chen, J. C. (2003). CAPM over the long run: 1926–2001. Manuscript, Columbia University and University of Southern California. Baillie, R. T. (1996). Long memory processes and fractional integration in econometrics. Journal of Econometrics, 73, 5–59. Barndorff-Nielsen, O. E., & Shephard, N. (2003). Econometric analysis of realized covariation: High frequency covariance, regression and correlation in financial economics. Manuscript, Oxford: Nuffield College. Barndorff-Nielsen, O. E., & Shephard, N. (2004). A feasible central limit theory for realised volatility under leverage. Manuscript, Oxford: Nuffield College. Bollerslev, T., Engle, R. F., & Wooldridge, J. (1988). A capital asset pricing model with timevarying covariances. Journal of Political Economy, 96, 113–131. Braun, P. A., Nelson, D. B., & Sunier, A. M. (1995). Good news, bad news, volatility, and betas. Journal of Finance, 50, 1575–1603. Breidt, F. J., Crato, N., & de Lima, P. (1998). On the detection and estimation of long memory in stochastic volatility. Journal of Econometrics, 73, 325–334.
38
TORBEN G. ANDERSEN ET AL.
Campbell, J. Y., Lo, A. W., & MacKinlay, A. C. (1997). The econometrics of financial markets. Princeton, NJ: Princeton University Press. Campbell, J. Y., & Vuolteenaho, T. (2004). Bad beta, good beta. American Economic Review, 94, 1249–1275. Chang, W.-C., & Weiss, D. E. (1991). An examination of the time series properties of beta in the market model. Journal of the American Statistical Association, 86, 883–890. Cheung, Y.-W., & Lai, K. S. (1993). A fractional cointegration analysis of purchasing power parity. Journal of Business and Economic Statistics, 11, 103–112. Cho, Y. H., & Engle, R. F. (2000). Time-varying and asymmetric effects of news: Empirical analysis of blue chip stocks. Manuscript, Rutgers University and New York University. Cohen, R., Polk, C., & Vuolteenaho, T. (2002). Does risk or mispricing explain the cross-section of stock prices. Manuscript, Kellogg School, Northwestern University. Comte, F., & Renault, E. (1998). Long memory in continuous time stochastic volatility models. Mathematical Finance, 8, 291–323. Diebold, F. X., & Kilian, L. (2001). Measuring predictability: Theory and macroeconomic implications. Journal of Applied Econometrics, 16, 657–669. Diebold, F. X., & Nerlove, M. (1989). The dynamics of exchange rate volatility: A multivariate latent factor ARCH model. Journal of Applied Econometrics, 4, 1–22. Dybvig, P. H., & Ross, S. A. (1985). Differential information and performance measurement using a security market line. Journal of Finance, 40, 383–400. Engle, R. F., & Granger, C. W. J. (1987). Co-integration and error correction: Representation, estimation and testing. Econometrica, 55, 251–276. Engle, R. F., & Kozicki, S. (1993). Testing for common features. Journal of Business and Economic Statistics, 11, 369–383. Fama, E. F. (1976). Foundations of finance. New York: Basic Books. Fama, E. F., & French, K. R. (1992). The cross section of expected stock returns. Journal of Finance, 47, 427–465. Fama, E. F., & French, K. R. (1993). Common risk factors in the returns on stocks and bonds. Journal of Financial Economics, 33, 3–56. Ferson, W. E., & Harvey, C. R. (1991). The variation of economic risk premiums. Journal of Political Economy, 99, 385–415. Ferson, W. E., Kandel, S., & Stambaugh, R. F. (1987). Tests of asset pricing with time-varying expected risk premiums and market betas. Journal of Finance, 42, 201–220. Fiorentini, G., Sentana, E., & Shephard, N. (1998). Exact likelihood-based estimation of conditionally heteroskedastic factor models. Working Paper. University of Alcante, CEMFI and Oxford University. Geweke, J., & Porter-Hudak, S. (1983). The estimation and application of long memory time series models. Journal of Time Series Analysis, 4, 221–238. Ghysels, E. (1998). On stable factor structures in the pricing of risk: Do time-varying betas help or hurt?. Journal of Finance, 53, 549–573. Granger, C. W. J. (1981). Some properties of time series data and their use in econometric model specification. Journal of Econometrics, 16, 121–130. Granger, C. W. J. (1995). Modeling nonlinear relationships between extended-memory variables. Econometrica, 63, 265–279. Granger, C. W. J., & Newbold, P. (1986). Forecasting economic time series (2nd edition). Orlando: Academic Press.
Realized Beta: Persistence and Predictability
39
Hansen, L. P., & Richard, S. F. (1987). The role of conditioning information in deducing testable restrictions implied by dynamic asset pricing models. Econometrica, 55, 587–613. Harvey, A. C. (1998). Long memory in stochastic volatility. In: J. Knight & S. Satchell (Eds), Forecasting volatility in financial markets. London, UK: Butterworth-Heinemann. Harvey, A. C., Ruiz, E., & Shephard, N. (1994). Multivariate stochastic variance models. Review of Economic Studies, 61, 247–264. Horvath, M. T. K., & Watson, M. W. (1995). Testing for cointegration when some of the cointegrating vectors are prespecified. Econometric Theory, 11, 952–984. Huang, C., & Litzenberger, R. H. (1988). Foundations for financial economics. Amsterdam: North-Holland. Hull, J., & White, A. (1987). The pricing of options on assets with stochastic volatilities. Journal of Finance, 42, 381–400. Jacquier, E., & Marcus, A. J. (2000). Market volatility and asset correlation structure. Working Paper, Boston College. Jagannathan, R., & Wang, Z. (1996). The conditional CAPM and the cross section of expected returns. Journal of Finance, 51, 3–53. Keim, D., & Hawawini, G. (1999). The cross section of common stock returns: A review of the evidence and some new findings. Manuscript, Wharton School. King, M., Sentana, E., & Wadhwani, S. (1994). Volatility and links between national stock markets. Econometrica, 62, 901–933. Lintner, J. (1965a). Security prices, risk and maximal gains from diversification. Journal of Finance, 20, 587–615. Lintner, J. (1965b). The valuation of risky assets and the selection of risky investments in stock portfolios and capital budgets. Review of Economics and Statistics, 47, 13–37. Meddahi, N. (2002). A theoretical comparison between integrated and realized volatility. Journal of Applied Econometrics, 17, 479–508. Robinson, P. M. (2001). The memory of stochastic volatility models. Journal of Econometrics, 101, 195–218. Robinson, P. M., & Marinucci, D. (2001). Semiparametric fractional cointegration analysis. Journal of Econometrics, 105, 225–247. Roll, R. (1977). A critique of the asset pricing theory’s tests. Journal of Financial Economics, 4, 129–176. Rosenberg, B. (1973). Random coefficient models: The analysis of a cross section of time series by stochastically convergent parameter regression. Annals of Economic and Social Measurement, 2, 399–428. Schaefer, S., Brealey, R., Hodges, S., & Thomas, H. (1975). Alternative models of systematic risk. In: E. Elton & M. Gruber (Eds), International capital markets: An inter- and intracountry analysis (pp. 150–161). Amsterdam: North-Holland. Schwert, G. W. (1989). Why does stock market volatility change over time?. Journal of Finance, 44, 1115–1153. Sharpe, W. F. (1963). A simplified model for portfolio analysis. Management Science, 9, 227–293. Wang, K. Q. (2003). Asset pricing with conditioning information: A new test. Journal of Finance, 58, 161–196.
This page intentionally left blank
40
ASYMMETRIC PREDICTIVE ABILITIES OF NONLINEAR MODELS FOR STOCK RETURNS: EVIDENCE FROM DENSITY FORECAST COMPARISON Yong Bao and Tae-Hwy Lee ABSTRACT We investigate predictive abilities of nonlinear models for stock returns when density forecasts are evaluated and compared instead of the conditional mean point forecasts. The aim of this paper is to show whether the in-sample evidence of strong nonlinearity in mean may be exploited for out-of-sample prediction and whether a nonlinear model may beat the martingale model in out-of-sample prediction. We use the Kullback– Leibler Information Criterion (KLIC) divergence measure to characterize the extent of misspecification of a forecast model. The reality check test of White (2000) using the KLIC as a loss function is conducted to compare the out-of-sample performance of competing conditional mean models. In this framework, the KLIC measures not only model specification error but also parameter estimation error, and thus we treat both types of errors as loss. The conditional mean models we use for the daily closing S&P 500 index returns include the martingale difference, Econometric Analysis of Financial and Economic Time Series/Part B Advances in Econometrics, Volume 20, 41–62 Copyright r 2006 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(05)20021-X
41
42
YONG BAO AND TAE-HWY LEE
ARMA, STAR, SETAR, artificial neural network, and polynomial models. Our empirical findings suggest the out-of-sample predictive abilities of nonlinear models for stock returns are asymmetric in the sense that the right tails of the return series are predictable via many of the nonlinear models, while we find no such evidence for the left tails or the entire distribution.
1. INTRODUCTION While there has been some evidence that financial returns may be predictable (see, e.g., Lo & MacKinlay, 1988; Wright, 2000), it is generally believed that financial returns are very close to a martingale difference sequence (MDS). The evidence against MDS is usually stronger from insample specification tests than from out-of-sample predictability tests using standard evaluation criteria such as the mean squared forecast error (MSFE) and mean absolute forecast error (MAFE). In this paper, we investigate if this remains true when we evaluate forecasting models in terms of density forecasts instead of the conditional mean point forecasts using MSFE and MAFE. We examine if the evidence and its significance of the nonlinear predictability of financial returns depend on whether we use point forecast evaluation criteria (MSFE and MAFE) or we use the probability density forecasts. As Clements and Smith (2000, 2001) show, traditional measures such as MSFE may mask the superiority of nonlinear models, whose predictive abilities may be more evident through density forecast evaluation. Motivated by the encouraging results of Clements and Smith (2000, 2001), we compare the density forecasts of various linear and nonlinear models for the conditional mean of the S&P 500 returns by using the method of Bao, Lee, and Saltoglu (2004, BLS henceforth), where the Kullback and Leibler’s (1951) Information Criterion (KLIC) divergence measure is used for characterizing the extent of misspecification of a density forecast model. In BLS’s framework, the KLIC captures not only model specification error but also parameter estimation error. To compare the performance of density forecast models in the tails of stock return distributions, we also follow BLS by using the censored likelihood functions to compute the tail minimum KLIC. The reality check test of White (2000) is then constructed using the KLIC as a loss function. We find that, for the entire distribution and the left tails, the S&P 500 daily closing returns are not predictable via various linear
Asymmetric Predictive Abilities of Nonlinear Models for Stock Returns
43
or nonlinear models and the MDS model performs best for out-of-sample forecasting. However, from the right tail density forecast comparison of the S&P 500 data, we find, surprisingly, that the MDS model is dominated by many nonlinear models. This suggests that the out-of-sample predictive abilities of nonlinear models for stock returns are asymmetric. This paper proceeds as follows. In Section 2, we examine the nature of the in-sample nonlinearity using the generalized spectral test of Hong (1999). Section 3 presents various linear and nonlinear models we use for the outof-sample analysis. In Section 4, we compare these models for the S&P 500 return series employing the density forecast approach of BLS. Section 5 concludes. Throughout, we define yt ¼ 100ðlnPt lnPt 1 Þ; where Pt is the S&P 500 index at time t.
2. IN-SAMPLE TEST FOR MARTINGALE DIFFERENCE We will first explore serial dependence (i.e., any departure from IID) in the S&P 500 returns using Hong’s (1999) generalized spectrum. In particular, we are interested in finding significant and predictable nonlinearity in the conditional mean even when the returns are linearly unpredictable. The basic idea is to transform a strictly stationary series yt to eiuyt and consider the covariance function between the transformed variables eiuyt and eivyt jjj sj ðu; vÞ covðeiuyt ; eivyt jjj Þ
(1) pffiffiffiffiffiffiffi where i 1; u; v 2 ð 1; 1Þ; and j ¼ 0; 1; . . . : Suppose that fyt gTt¼1 has a marginal characteristic function jðuÞ Eðeiuyt Þ and a pairwise joint characteristic function jj ðu; vÞ Eðeiðuyt þvyt jjj Þ Þ: Straightforward algebra yields sj ðu; vÞ ¼ jj ðu; vÞ jðuÞjðvÞ: Because jj ðu; vÞ ¼ jðuÞjðvÞ for all u, v if and only if yt and yt |j| are independent, sj (u, v) can capture any type of pairwise serial dependence P over various lags. When supu;v2ð 1;1Þ 1 j¼ 1 jsj ðu; vÞjo1; the Fourier transform of sj (u, v) exists f ðo; u; vÞ
1 1 X sj ðu; vÞe 2p j¼ 1
ijo
;
o 2 ½ p; p
(2)
Like sj (u, v), f(o, u, v) can capture all pairwise serial dependencies in {yt} over various lags. Hong (1999) calls f(o, u, v) a ‘‘generalized spectral
44
YONG BAO AND TAE-HWY LEE
density’’ of {yt}, and shows that f(o, u, v) can be consistently estimated by n 1 1 X f^n ðo; u; vÞ ð1 2p j¼1 n
jjj=nÞ1=2 kðj=pÞs^ j ðu; vÞe
ijo
(3)
^ j ðu; vÞ j ^P ^ j ð0; vÞ is the empirical generalized covarwhere s^ j ðu; vÞ j j ðu; 0Þj iance, j^ j ðu; vÞ ðn jjjÞ 1 nt¼jjjþ1 eiðuytþv yt jjj Þ is the empirical pairwise characteristic function, p pn a bandwidth or lag order, and k( ) a kernel function or ‘‘lag window’’. Commonly used kernels include the Bartlett, Daniell, Parzen, and Quadratic–Spectral kernels. When {yt} is IID, f(o, u, v) becomes a ‘‘flat’’ generalized spectrum: 1 s0 ðu; vÞ; o 2 ½ p; p 2p Any deviation of f(o, u, v) from the flat spectrum f0 (o, u, v) is the evidence of serial dependence. Thus, to detect serial dependence, we can compare f^n ðo; u; vÞ with the estimator f 0 ðo; u; vÞ
1 f^0 ðo; u; vÞ s^ 0 ðu; vÞ; o 2 ½ p; p 2p To explore the nature of serial dependence, one can compare the derivative estimators nP1 ð0;m;lÞ 1 f^n ðo; u; vÞ 2p ð1 j¼1 n
jjj=nÞ1=2 kðj=pÞs^ jðm;lÞ ðu; vÞe
ijo
ð0;m;lÞ 1 ðm;lÞ f^0 ðo; u; vÞ 2p s^ 0 ðu; vÞ
where s^ ðm;lÞ ðu; vÞ @mþl s^ j ðu; vÞ=@m u@l v for m, lZ0. Just as the characteristic j function can be differentiated to generate various moments, generalized spectral derivatives can capture various specific aspects of serial dependence, thus providing information on possible types of serial dependence. Hong (1999) proposes a class of tests based on the quadratic norm Z Zp ð0;m;lÞ ð0;m;lÞ ^ð0;m;lÞ ^ ^ Q fn ;f0 ðo; u; vÞ f n p
2 ¼ p
Z X n 1 j¼1
k2 ðj=pÞð1
2 ð0;m;lÞ f^0 ðo; u; vÞ do dW 1 ðuÞ dW 2 ðvÞ
2 j=nÞs^ jðm;lÞ ðu; vÞ dW 1 ðuÞ dW 2 ðvÞ
where the second equality follows by Parseval’s identity, and the unspecified integrals are taken over the support of W1( ) and W2( ), which are positive
Asymmetric Predictive Abilities of Nonlinear Models for Stock Returns
45
and nondecreasing weighting functions that set weight about zero equally. The generalized spectral test statistic M(m, l) is a standardized version of the quadratic norm. Given (m, l), M(m, l) is asymptotically one-sided N(0,1) under the null hypothesis of serial independence, and thus the upper-tailed asymptotic critical values are 1.65 and 2.33 at the 5% and 1% levels, respectively. We may first choose ðm; lÞ ¼ ð0; 0Þ to check if there exists any type of serial dependence. Once generic serial dependence is discovered using M(0,0), we may use various combinations of (m, l) to check specific types of serial dependence. For example, we can set ðm; lÞ ¼ ð1; 0Þ to check whether there exists serial dependence in mean. This checks whether Eðyt jyt j Þ ¼ Eðyt Þ for all j40, and so it is a suitable test for the MDS hypothesis. It can detect a wide range of deviations from MDS. To explore whether there exists linear dependence in mean, we can set ðm; lÞ ¼ ð1; 1Þ: If M(1,0) is significant but M(1,1) is not, we can speculate that there may exist only nonlinear dependence in mean. We can go further to choose ðm; lÞ ¼ ð1; lÞ for l ¼ 2; 3; 4; testing if covðyt ; ylt j Þ ¼ 0 for all j40. These essentially check whether there exist ARCH-in-mean, skewness-in-mean, and kurtosis-in-mean effects, which may arise from the existence of time-varying risk premium, asymmetry, and improper account of the concern over large losses, respectively. Table 1 lists a variety of spectral derivative tests and the types of dependence they can detect, together with the estimated M(m, l) statistics.1 We now use the generalized spectral test to explore serial dependence of the daily S&P 500 closing return series, retrieved from finance.yahoo.com. They are from January 3, 1990 to June 30, 2003 (T ¼ 3403). The statistic M(m, l) involves the choice of a bandwidth p in its computation, see Hong (1999, p. 1204). Hong proposes a data-driven method to choose p. This method still involves the choice of a preliminary bandwidth p¯ : Simulations in Hong (1999) show that the choice of p¯ is less important than that of p. We consider p¯ in the range 6–15 to examine the robustness of M(m, l) with respect to the choice of p¯ : We use the Daniell kernel, which maximizes the asymptotic power of M(m, l) over a class of kernels. We have also used the Bartlett, Parzen, and Quadratic–Spectral kernels, whose results are similar to those based on the Daniell kernel and are not reported in this paper. Table 1 reports the values of M(m, l) for p¯ ¼ 6; 9; 12; 15: The results for various values of p¯ are quite similar. M(m, l) has an asymptotic one-sided N(0,1) distribution, so the asymptotic critical value at the 5% level is 1.65. The M(0,0) statistic suggests that the random walk hypothesis is strongly rejected. In contrast, the correlation test M(1,1) is insignificant, implying
46
YONG BAO AND TAE-HWY LEE
Table 1. Test
IID MDS Correlation ARCH-in-mean
Statistic M(m, l) M(0, M(1, M(1, M(1,
0) 0) 1) 2)
Skewness-in-mean
M(1, 3)
Kurtosis-in-mean
M(1, 4)
Nonlinear ARCH Leverage Linear ARCH
M(2, 0) M(2, 1) M(2, 2)
Conditional skewness Conditional skewness
M(3, 0) M(3, 3)
Conditional kurtosis Conditional kurtosis
M(4, 0) M(4, 4)
Generalized Spectral Tests. Test Function sjðm;lÞ ðu; vÞ
p¯ ¼ 6
Preliminary Bandwidth p¯ ¼ 9 p¯ ¼ 12 p¯ ¼ 15
sj ¼ ðu; vÞ covðyt ; eivyt j Þ cov(yt, yt j) covðyt ; y2t j Þ
51.02 17.40 0.10 56.24 0.11
0.38
0.50
0.51
covðyt ; y4t j Þ covðy2t ; eivyt j Þ covðy2t ; yt j Þ covðy2t ; y2t j Þ covðy3t ; eivyt j Þ covðy3t ; y3t j Þ covðy4t ; eivyt j Þ covðy4t ; y4t j Þ
29.85
29.99
29.57
29.18
62.15 9.25 172.87
70.75 8.57 182.41
76.71 8.52 188.53
81.30 8.57 193.64
7.63 27.82
6.98 26.66
6.64 26.69
6.36 26.83
17.16 35.56
18.17 35.22
19.12 35.22
20.10 35.25
covðyt ; y3t j Þ
58.23 18.28 0.44 55.83
63.75 18.67 0.63 55.36
67.85 19.04 0.68 54.84
Note: All generalized spectral test statistics M(m, l) are asymptotically one-sided N(0, 1) and thus upper-tailed asymptotic critical values are 1.65 and 2.33 at the 5% and 1% levels, respectively. M(0, 0) is to check if there exists any type of serial dependence. M(1, 0) is to check whether there exists serial dependence in mean. To explore whether there exists linear dependence in mean, we can set ðm; lÞ ¼ ð1; 1Þ: If M(1, 0) is significant but M(1, 1) is not, we can speculate that there may exist only nonlinear dependence in mean. We choose ðm; lÞ ¼ ð1; lÞ with l ¼ 2; 3, 4, to test if Eðyt jylt j Þ ¼ 0 for all j40. The PN model is to exploit the nonlinear predictive evidence of polynomials found from M(1, l) with l ¼ 2; 3, 4.
that {yt} is an uncorrelated white noise. This, however, does not necessarily imply that {yt} is an MDS. Indeed, the martingale test M(1,0) strongly rejects the martingale hypothesis as its statistic is above 17. This implies that the S&P 500 returns, though serially uncorrelated, has a nonzero mean conditional on its past history. Thus, suitable nonlinear time series models may be able to predict the future returns. The polynomial (PN) model (to be discussed in the next section) is to exploit the nonlinear predictive evidence of the lth power of returns, as indicated by the M(1,l) statistics. The test M(2,0) shows possibly nonlinear time-varying volatility, and the linear ARCH test M(2,2) indicates very strong linear ARCH effects. We also observe that the leverage effect (M(2,1)) is significant and there exist significant conditional skewness as evidenced from M(3,0) and M(3,3), and large conditional kurtosis as evidenced from M(4,0) and M(4,4). It is important to explore the implications of these in-sample findings of nonlinearity in the conditional mean. The fact that the S&P 500 return series
Asymmetric Predictive Abilities of Nonlinear Models for Stock Returns
47
is not an MDS implies it may be predictable in the conditional mean. In the next section, we will use various linear and nonlinear time series models to examine this issue.
3. CONDITIONAL MEAN MODELS Let Ft 1 be the information set containing information about the process {yt} up to and including time t 1. Since our interest is to investigate the predictability of stock returns in the conditional mean mt ¼ Eðyt jFt 1 Þ; we assume that yt is conditionally normally distributed and the conditional variance s2t ¼ Eð2t jFt 1 Þ follows a GARCH(1,1) process s2t ¼ o þ a2t 1 þ bs2t 1 ; where t ¼ yt mt : We consider the following nine models for mt in three classes: (i)
the MDS model yt ¼ t
MDS
(ii) four linear autoregressive moving average (ARMA) models Constant MAð1Þ ARMAð1; 1Þ ARð2Þ
yt ¼ a0 þ t
yt ¼ a0 þ b1 t yt ¼ a0 þ a1 yt
yt ¼ a0 þ a1 yt
þ t 1 þ b1 t
1
1
þ a2 yt
1 2
þ t
þ t
(iii) four nonlinear models, namely, the polynomial (PN), neural network (NN), self-exciting transition autoregressive (SETAR), and smooth transition autoregressive (STAR) models,
PNð2; 4Þ NNð2; 5Þ SETAR STAR
yt ¼ a0 þ a1 yt
1
yt ¼ a0 þ a1 yt
1
yt ¼ a0 þ a1 yt yt ¼ a0 þ a1 yt
þ a2 yt þ a2 yt 1 1
þ a2 yt
2
þ a2 yt
2
þ þ
4 2 P P
j¼1 i¼2 5 P
i¼1
aij yit
di Gðg0i þ
þ ðb0 þ b1 yt þ ðb0 þ b1 yt 2 2
j
þ t 2 P
j¼1
gji yt j Þ þ t
þ b2 yt 2 Þ1ðyt 1 4cÞ þ t þ b2 yt 2 ÞGðgðyt 1 cÞÞ þ t 1 1
48
YONG BAO AND TAE-HWY LEE
where GðzÞ ¼ 1=ð1 þ e z Þ is a logistic function and 1( ) denotes an indicator function that takes 1 if its argument is true and 0 otherwise. Note that the four nonlinear models nest the AR(2) model. All the above models have been used in the literature, with apparently mixed results on the predictability of stock returns. Hong and Lee (2003) use the AR, PN, and NN models. McMillan (2001) and Kanas (2003) find evidence supporting the NN and STAR models, while Bradley and Jansen (2004) find no evidence for the STAR model. We note that these authors use the MSFE criterion for out-of-sample forecasting evaluation. Racine (2001) finds no predictability evidence using the NN model. Anderson, Benzoni, and Lund (2002) use the MA model for estimation. The results from the generalized spectral test reported in Table 1 suggest that Eðyt jFt 1 Þ is time-varying in a nonlinear manner, because M(1,1) is insignificant but M(1,0) is significant. Also, we note that the M(1,l) statistics are significant with l ¼ 2; 4 but not with l ¼ 1; 3; indicating volatility and tail observations may have some predictive power for the returns but not the skewness of the return distribution. The PN model is to exploit this nonlinear predictive evidence of the lth order power of the lagged returns.
4. OUT-OF-SAMPLE TEST FOR MARTINGALE DIFFERENCE We now examine if the in-sample evidence of nonlinear predictability of the S&P 500 returns from the generalized spectral test in the previous section may be carried over to the out-of-sample forecasting. While the in-sample generalized spectral test does not involve parameter estimation of a particular nonlinear model, the out-of-sample test requires the estimation of model parameters since it is based on some particular choice of nonlinear models. Despite the strong nonlinearity found in the conditional mean from the in-sample tests, model uncertainty and parameter uncertainty usually make the out-of-sample results much weaker than the in-sample nonlinear evidence, see Meese and Rogoff (1983). Given model uncertainty, econometricians tend to search for a proper model over a large set of candidate models. This can easily cause the problem of data snooping, see Lo and MacKinlay (1990). In order to take care of the data snooping bias, we follow the method of White (2000) in our comparison of multiple competing models. Moreover, we use the KLIC
Asymmetric Predictive Abilities of Nonlinear Models for Stock Returns
49
measure as our loss function in the comparison. As emphasized by BLS (2004), the KLIC measure captures both the loss due to model specification error and the loss due to parameter estimation error. It is important to note that in comparing forecasting models we treat parameter estimation error as a loss. Now, we briefly discuss the BLS test.2 4.1. The BLS Test Suppose that {yt} has a true, unknown conditional density function f ðyt Þ f ðyt jFt 1 Þ: Let cðyt ; hÞ ¼ cðyt jFt 1 ; h) be a one-step-ahead conditional density forecast model with parameter vector h, where hAH is a finitedimensional vector of parameters in a compact parameter space H. If cð; h0 Þ ¼ f ðÞ for some h0AH, then the one-step-ahead density forecast is correctly specified and hence optimal, as it dominates all other density forecasts for any loss function (e.g., Diebold, Gunther, & Tay, 1998; Granger & Pesaran, 2000a, b). As our purpose is to compare the outof-sample predictive abilities competing density forecast models, we among T R consider two subsamples yt t¼1 and yt t¼Rþ1 : we use the first sample to estimate the unknown parameter vector h and the second sample to compare the out-of-sample density forecasts. In practice, it is rarely the case that we can find an optimal model as all the models can be possibly misspecified. Our task is then to investigate which model can approximate the true model most closely. We have to first define a metric to measure the distance of a given model to the truth and then compare different models in terms of this distance. The adequacy of a postulated distribution may be appropriately measured by the KLIC divergence measure between f ðÞ and cðÞ: Iðf : c; hÞ ¼ E ln f ðyt Þ lncðyt ; hÞ ; where the expectation is with respect to the true distribution. Following Vuong (1989), we define the distance between a model and the true density as the minimum of the KLIC Iðf : c; h Þ ¼ E lnf ðyt Þ lncðyt ; h Þ (4) and h* is the pseudo-true value of h, the parameter value that gives the minimum Iðf : c; hÞ for all h 2 Y (e.g., White, 1982). The smaller this distance, the closer the model cð; hÞ is to the true density. Thus, we can use this measure to compare the relative distance of a battery of competing models to the true model f t ðÞ: However, Iðf : c; h Þ is generally unknown, since we cannot observe f ðÞ and thereby we can not evaluate the expectation in (4). Under some regularity conditions, it can nevertheless be shown that
50
E ln f ðyt Þ
YONG BAO AND TAE-HWY LEE
lncðyt ; h Þ can be consistently estimated by T h X ^ : cÞ ¼ 1 ln f ðyt Þ Iðf n t¼Rþ1
i ln cðyt ; h^ t 1 Þ
(5)
where n ¼ T R and h^ t 1 is consistently estimated from a rolling sample yt 1 ; . . . ; yt R of size R. But we still do not know f ðÞ: For this, we utilize the equivalence relationship between ln½f ðyt Þ=cðyt ; h^ t 1 Þ and the loglikelihood ratio of the inverse normal transform of the probability integral transform (PIT) of the actual realizations of the process with respect to the models’ density forecast. This equivalence relationship enables us to consistently estimate Iðf : c; h Þ and hence to conduct the out-of-sample comparison of possibly misspecified models in terms of their distance to the true model. The (out-of-sample) PIT of the realization of the process with respect to the model’s density forecast is defined as ut ¼
Zyt
cðy; h^ t 1 Þdy;
t ¼ R þ 1; . . . ; T
(6)
1
It is well known that if cðyt ; h^ t 1 Þ coincides with the true density f(yt) for all t, then the sequence fut gTt¼Rþ1 is IID and uniform on the interval [0, 1] (U[0,1] henceforth). This provides a powerful approach to evaluating the quality of a density forecast model. Our task, however, is not to evaluate a single model, but to compare a battery of competing models. Our purpose of utilizing the PITs is to exploit the following equivalence between ln½f ðyt Þ=cðyt ; h^ t 1 Þ and the log likelihood ratio of the transformed PIT and hence to construct the distance measure. The inverse normal transform of the PIT is xt ¼ F 1 ðut Þ
(7)
where FðÞ is the cumulative distribution function (CDF) of the standard normal distribution. If the sequence fut gTt¼Rþ1 is IID U[0, 1] then fxt gTt¼Rþ1 is IID standard normal N(0, 1) (IID N(0, 1) henceforth). More importantly, Berkowitz (2001, Proposition 2, p. 467) shows that h i ln f ðyt Þ=cðyt ; h^ t 1 Þ ¼ ln pðxt Þ=fðxt Þ (8) where pðÞ is the density of xt and fðÞ the standard normal density. Therefore, the distance of a density forecast model to the unknown true
Asymmetric Predictive Abilities of Nonlinear Models for Stock Returns
51
model can be equivalently estimated by the departure of fxt gTt¼Rþ1 from IID N(0, 1), T X ~ : cÞ ¼ 1 ½ ln pðxt Þ Iðf n t¼Rþ1
ln fðxt Þ
(9)
We transform the departure of cð; hÞ from f ðÞ to the departure of pðÞ from IID N(0, 1). To specify the departure from IID N(0, 1), we want pðÞ to be as flexible as possible to reflect the true distribution of fxt gTt¼Rþ1 and at the same time it can be IID N(0, 1) if the density forecast model coincides with the true model. We follow Berkowitz (2001) by specifying fxt gTt¼Rþ1 as an AR(L) process xt ¼ q0 X t 1 þ sZt (10) where X t 1 ¼ ð1; xt 1 ; . . . ; xt L Þ0 ; q is an (L+1) 1 vector of parameters, and Zt IID distributed. We specify a flexible distribution for Zt, say, p(Zt; c) where c is a vector of distribution parameters such that when c ¼ c for some c* in the parameter space, p(Zt; c*) is IID N(0, 1). A test for IID N(0, 1) of fxt gTt¼Rþ1 per se can be constructed by testing elements of the parameter vector b ¼ ðq0 ; s; c0 Þ0 ; say, q ¼ 0, s ¼ 1; and c ¼ c : We assume the seminonparametric (SNP) density of order K of Gallant and Nychka (1987) for Zt P K k c Z k¼0 k t fðZt Þ pðZt ; cÞ ¼ R , (11) P 2 þ1 K k c u fðuÞdu k k¼0 1 where c0 ¼ 1 and c ¼ ðc0 ; . . . ; cK Þ0 : Setting ck ¼ 0 ðk ¼ 1; . . . ; KÞ gives pðZt Þ ¼ fðZt Þ: Given (10) and (11), the density of xt is p ðxt q0 X t 1 Þ=s; c , pðxt ; bÞ ¼ s ~ : which degenerates into IID N(0,1) by setting b ¼ b ¼ ð00 ; 1; 00 Þ0 : Then Iðj cÞ as defined in (9) can be estimated by
T 0 X p ðx q X Þ=s; c 1 t t 1 ~ : c; bÞ ¼ Iðf ln ln fðxt Þ . n t¼Rþ1 s
The likelihood ratio test statistic of the adequacy of the density forecast model cð; hÞ in Berkowitz (2001) is simply the above formula with pðÞ ¼ ~ : cÞ by fðÞ: As b is unknown, we estimate Iðj
T X p ðxt q^ 0n X t 1 Þ=s^ n ; c^ n ~Iðf : c; b^ n Þ ¼ 1 ln fðxt Þ , (12) ln n t¼Rþ1 s^ n
52
YONG BAO AND TAE-HWY LEE
0 0 0 ^ where PTbn ¼ ðq^ n ; s^ n ; cn Þ is the maximum likelihood estimator that maximizes 1 n t¼Rþ1 ln pðxt ; bÞ:
To check the performance of a density forecast model in certain regions of the distribution, we can easily modify our distance measure tailored for the tail parts only. For the lower (left) tail, we define the censored random variable ( if xt ot xt t xt ¼ (13) F 1 ðaÞ t if xt t:
For example, t ¼ 1:645 for a ¼ 0:05; the left 5% tail. As before, we consider an AR model (10) with Zt distributed as in (11). Then the censored random variable xtt has the distribution function 1ðxt otÞ 1ðxt tÞ p ðxt q0 X t 1 Þ=s t q0 X t 1 t t ;c p ðxt ; bÞ ¼ , (14) 1 P s s in which Pð; cÞ is the CDF of the SNP density function and is calculated in the way as discussed in BLS. Given pt ðxtt ; bÞ; the (left) tail minimum KLIC divergence measure can be estimated analogously T h t t 1 X t I~ ðf : c; b^ n Þ ¼ ln pt ðxtt ; b^ n Þ n t¼Rþ1
i ln ft ðxtt Þ ,
(15)
t t t 1ðxt tÞ ½fðxt Þ1ðxt otÞ where and b^ n maximizes PT f ðxttÞ ¼t ½1 FðtÞ 1 ln p ðx ; bÞ: n t t¼Rþ1 For the upper (right) tail we define the censored random variable ( F 1 ðaÞ t if xt t t xt ¼ (16) xt if xt 4t
For example, t ¼ 1:645 for a ¼ 0:95; the right 5% tail. Then the censored random variable xtt has the distribution function 1ðxt 4tÞ 1ðxt tÞ p ðxt q0 X t 1 Þ=s t q0 X t 1 t t p ðxt ; bÞ ¼ 1 P ;c , (17) s s and the (right) tail minimum KLIC divergence measure can be estimated by (15) with pt ðxtt ; bÞ given by (17). Therefore, given (6) and (7), we are able to estimate the minimum distance measure (4) by (12) (or its tail counterpart by (15)), which is proposed by BLS as a loss function to compare the out-of-sample predictive abilities of a set of l+1 competing models, each of which can be possibly misspecified. To
Asymmetric Predictive Abilities of Nonlinear Models for Stock Returns
53
establish the notation with model indexed by k ðk ¼ 0; 1; . . . ; lÞ; let the density forecast model k be denoted by ck(yt; hk). Model comparison between a single model (model k) and the benchmark model (model 0) can be conveniently formulated as hypothesis testing of some suitable moment conditions. Consider constructing the loss differential d k ¼ d k ðc0 ; ck Þ ¼ ln f ðyt Þ lnc0 ðyt ; h0 Þ ln f ðyt Þ ln ck ðyt ; hk Þ (18)
where 1 k l: Note that Eðd k Þ ¼ Iðf : c0 ; h0 Þ Iðf : ck ; hk Þ is the difference in the minimum KLIC of models 0 and k. When we compare multiple l models against a benchmark jointly, the null hypothesis of interest is that the best model is no better than the benchmark H0 : max Eðd k Þ 0 1kl
(19)
To implement this test, we follow White (2000) to bootstrap the following test statistic pffiffiffi (20) V¯ n max n d¯ k;n Eðd k Þ 1kl
~ : c0 ; b^ 0;n Þ Iðf ~ : ck ; b^ k;n Þ; and Iðf ~ : c0 ; b^ 0;n Þ and Iðf ~ : where d¯ k;n ¼ Iðf ck ; b^ k;n Þ are estimated by (12) for models 0 and k, with the normal-inversed PIT fxt gTt¼Rþ1 constructed using h^ 0;t 1 and h^ k;t 1 estimated by a rollingsample scheme, respectively. A merit of using the KLIC-based loss function for comparing forecasting models is that d¯ k;n incorporates both model specification error and parameter estimation error (note that xt is constructed using h^ rather than h ) as argued by BLS (2004). To obtain the p-value for V¯ n White (2000) suggests using the stationary bootstrap of Politis and Romano (1994). This bootstrap p-value for testing H0 is called the ‘‘reality check p-value’’ for data snooping. As discussed in Hansen (2001), White’s reality check p-value may be considered as an upper bound of the true p-value, since it sets Eðd k Þ ¼ 0: Hansen (2001) considers a modified reality check test, which removes the ‘‘bad’’ models from the comparison and thereby improves the size and the power of the test. The reality check to compare the performance of density forecast models in the tails can be implemented analogously. 4.2. Results of the BLS Test We split the sample used in Section 2 into two parts (roughly into two halves): one for in-sample estimation of size R ¼ 1703 and another for outof-sample density forecast of size n ¼ 1700: We use a rolling-sample scheme.
54
YONG BAO AND TAE-HWY LEE
That is, the first density forecast is based on observations 1 through R (January 3, 1990–September 24, 1996), the second density forecast is based on observations 2 through R+1 (January 4, 1990–September 25, 1996), and so on. The results are presented in Tables 2–4, with each table computing the statistics with different ways of selecting L and K. We present the reality check results with the whole density (100% with a ¼ 1:00), three left tails (10%, 5%, 1% with a ¼ 0:10; 0.05, 0.01), and three right tails (10%, 5%, 1% with a ¼ 0:90; 0.95, 0.99). With the AR(L)–SNP(K) model as specified in (10) and (11), we need to choose L and K. In Table 2, we fix L ¼ 3 and K ¼ 5: We minimize the Akaike information criteria (AIC) in Table 3, and the Schwarz information criteria (SIC) in Table 4, for the selection of L and K from the sets of {0, 1, 2, 3} for L and {0, 1,y, 8} for K. The out-of-sample average KLIC loss (denoted as ‘‘Loss’’ in tables) as well as the reality check p-value of White (2000) (denoted as ‘‘p1’’) and the modified reality check p-value of Hansen (2001) (denoted as ‘‘p2’’) are presented in these tables. In comparing the models using the reality check tests, we regard each model as a benchmark model and it is compared with the remaining eight models. We set the number of bootstraps for the reality check to be 1,000 and the mean block length to be 4 for the stationary bootstrap of Politis and Romano (1994). The estimated out-of-sample KLIC (12) and its censored versions as defined from (15) with different values of t (each corresponding to different a) are reported in Tables 2–4. Note that in these tables a small value of the out-of-sample average loss and a large reality check p-value indicate that the corresponding model is a good density forecast model, as we fail to reject the null hypothesis that the other remaining eight models is no better than that model. As our purpose is to test for the MDS property of the S&P 500 returns in terms of out-of-sample forecasts, we examine the performance of the MDS model as the benchmark (Model 0 with k ¼ 0) in comparison with the remaining eight models indexed by k ¼ 1;y, l (l ¼ 8). The eight competing models are Constant, MA, ARMA, AR, PN, NN, SETAR, and STAR. Table 2 shows the BLS test results computed using L ¼ 3 and K ¼ 5: First, comparing the entire return density forecasts with a ¼ 100%; the ~ : c0 ; b^ 0;n Þ ¼ 0:0132; that is the KLIC loss value for the MDS model is Iðf smallest loss, much smaller than those of the other eight models. The large reality check p-values for the MDS model as the benchmark (p1 ¼ 0:982 and p2 ¼ 0:852) indicate that none of the other eight models are better than the MDS model, confirming the efficient market hypothesis that the S&P 500 returns may not be predictable using linear or nonlinear models.
55
Asymmetric Predictive Abilities of Nonlinear Models for Stock Returns
Table 2. Tail
Reality Check Results Based on AR(3)-SNP(5).
Model
Left Tail
Right Tail Loss
p1
p2
L
K
5 5 5 5 5 5 5 5 5
0.0409 0.0359 0.0352 0.0359 0.0363 0.0348 0.0365 0.0360 0.0366
0.000 0.415 0.794 0.442 0.309 0.627 0.269 0.396 0.243
0.000 0.394 0.686 0.397 0.298 0.615 0.261 0.372 0.238
3 3 3 3 3 3 3 3 3
5 5 5 5 5 5 5 5 5
3 3 3 3 3 3 3 3 3
5 5 5 5 5 5 5 5 5
0.0229 0.0205 0.0206 0.0206 0.0208 0.0170 0.0210 0.0213 0.0208
0.000 0.040 0.039 0.040 0.031 0.981 0.022 0.019 0.029
0.000 0.040 0.039 0.040 0.031 0.518 0.022 0.019 0.029
3 3 3 3 3 2 3 3 3
5 5 5 5 5 5 5 5 5
3 3 3 3 3 3 3 3 3
5 5 5 5 5 5 5 5 5
0.0073 0.0068 0.0068 0.0068 0.0068 0.0058 0.0068 0.0068 0.0068
0.042 0.172 0.167 0.160 0.151 0.932 0.150 0.135 0.169
0.042 0.172 0.167 0.160 0.151 0.914 0.150 0.135 0.169
3 3 3 3 3 3 3 3 3
5 5 5 5 5 5 5 5 5
Loss
p1
p2
L
K
100%
MDS Constant MA ARMA AR PN NN SETAR STAR
0.0132 0.0233 0.0237 0.0238 0.0171 0.0346 0.0239 0.0243 0.0238
0.982 0.370 0.357 0.357 0.547 0.071 0.355 0.354 0.354
0.852 0.370 0.357 0.357 0.486 0.071 0.355 0.354 0.354
3 3 3 3 3 3 3 3 3
5 5 5 5 5 5 5 5 5
10%
MDS Constant MA ARMA AR PN NN SETAR STAR
0.0419 0.0467 0.0472 0.0468 0.0469 0.0466 0.0469 0.0476 0.0470
1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.730 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
3 3 3 3 3 3 3 3 3
5%
MDS Constant MA ARMA AR PN NN SETAR STAR
0.0199 0.0218 0.0217 0.0217 0.0211 0.0226 0.0211 0.0211 0.0213
1.000 0.002 0.003 0.003 0.023 0.000 0.026 0.040 0.017
0.671 0.002 0.003 0.003 0.000 0.000 0.000 0.004 0.000
1%
MDS Constant MA ARMA AR PN NN SETAR STAR
0.0068 0.0074 0.0074 0.0074 0.0075 0.0074 0.0075 0.0076 0.0074
1.000 0.148 0.163 0.178 0.132 0.240 0.128 0.064 0.181
0.992 0.148 0.163 0.178 0.132 0.240 0.128 0.064 0.181
Note: ‘‘Loss’’ refers to is the out-of-sample averaged loss based on the KLIC measure; ‘‘p1’’ and ‘‘p2’’ are the reality check p-values of White’s (2000) and Hansen’s (2001) tests, respectively, where each model is regarded as a benchmark model and is compared with the remaining eight models. We use an AR(3)–SNP(5) model for the transformed PIT {xt}. We retrieve the S&P 500 returns series from finance.yahoo.com. The sample observations are from January 3, 1990 to June 30, 2003 (T ¼ 3303), the in-sample observations are from January 3, 1990 to September 24, 1996 (R ¼ 1703), and the out-of-sample observations are from September 25, 1996 to June 30, 2003 (n ¼ 1700).
56
YONG BAO AND TAE-HWY LEE
Table 3. Tail
Reality Check Results Based on Minimum AIC.
Model
Left Tail
Right Tail Loss
p1
p2
L
K
8 8 8 8 8 8 8 8 8
0.0658 0.0593 0.0582 0.0588 0.0594 0.0628 0.0597 0.0584 0.0591
0.050 0.717 0.960 0.839 0.701 0.239 0.619 0.845 0.775
0.050 0.675 0.947 0.813 0.655 0.239 0.571 0.808 0.737
3 3 3 3 3 3 3 3 3
8 8 8 8 8 8 8 8 8
2 2 2 3 3 3 3 3 1
8 7 8 8 8 8 8 8 8
0.0390 0.0359 0.0359 0.0359 0.0360 0.0515 0.0362 0.0357 0.0359
0.333 0.852 0.917 0.926 0.929 0.006 0.841 0.913 0.908
0.092 0.665 0.840 0.865 0.885 0.006 0.741 0.753 0.826
3 3 3 3 3 2 3 3 3
8 8 8 8 8 7 8 8 8
3 2 2 2 2 2 2 2 2
7 7 7 7 7 7 7 7 8
0.0122 0.0113 0.0116 0.0118 0.0118 0.0092 0.0118 0.0112 0.0117
0.008 0.042 0.018 0.019 0.020 0.989 0.020 0.060 0.022
0.008 0.042 0.018 0.019 0.020 0.564 0.020 0.060 0.022
3 3 3 3 3 3 3 3 3
8 8 8 8 8 7 8 7 8
Loss
p1
p2
L
K
100%
MDS Constant MA ARMA AR PN NN SETAR STAR
0.0247 0.0231 0.0266 0.0236 0.0266 0.0374 0.0238 0.0266 0.0268
0.719 0.938 0.534 0.874 0.533 0.238 0.853 0.539 0.532
0.719 0.933 0.534 0.862 0.533 0.238 0.845 0.539 0.532
3 3 3 3 3 1 3 3 3
8 3 7 3 7 8 3 7 8
10%
MDS Constant MA ARMA AR PN NN SETAR STAR
0.0583 0.0652 0.0651 0.0647 0.0649 0.0650 0.0649 0.0653 0.0653
1.000 0.064 0.070 0.078 0.075 0.076 0.075 0.068 0.063
0.767 0.064 0.070 0.078 0.075 0.076 0.075 0.068 0.063
3 2 3 3 3 2 3 3 3
5%
MDS Constant MA ARMA AR PN NN SETAR STAR
0.0315 0.0334 0.0339 0.0340 0.0336 0.0356 0.0337 0.0334 0.0705
1.000 0.628 0.468 0.425 0.495 0.256 0.487 0.529 0.000
0.973 0.473 0.320 0.266 0.340 0.129 0.327 0.386 0.000
1%
MDS Constant MA ARMA AR PN NN SETAR STAR
0.0098 0.0113 0.0114 0.0114 0.0116 0.0116 0.0116 0.0113 0.0229
0.973 0.321 0.322 0.330 0.279 0.292 0.279 0.314 0.000
0.939 0.111 0.131 0.137 0.108 0.109 0.107 0.109 0.000
Note: ‘‘Loss’’ refers to is the out-of-sample averaged loss based on the KLIC measure; ‘‘p1’’ and ‘‘p2’’ are the reality check p-values of White’s (2000) and Hansen’s (2001) tests, respectively, where each model is regarded as a benchmark model and is compared with the remaining eight models; ‘‘L’’ and ‘‘K’’ are the AR and SNP orders, respectively, chosen by the minimum AIC criterion in the AR(L)–SNP(K) models, L ¼ 0 ,y, 3, K ¼ 0 ,y, 8 for the transformed PIT {xt}. We retrieve the S&P 500 returns series from finance.yahoo.com. The sample observations are from January 3, 1990 to June 30, 2003 (T ¼ 3303), the in-sample observations are from January 3, 1990 to September 24, 1996 (R ¼ 1703), and the out-of-sample observations are from September 25, 1996 to June 30, 2003 (n ¼ 1700).
57
Asymmetric Predictive Abilities of Nonlinear Models for Stock Returns
Table 4. Reality Check Results Based on Minimum SIC. Tail
Model
Left Tail
Right Tail Loss
p1
p2
L
K
8 8 8 8 8 8 8 8 8
0.0658 0.0593 0.0582 0.0588 0.0594 0.0628 0.0597 0.0584 0.0591
0.050 0.717 0.960 0.839 0.701 0.239 0.619 0.845 0.775
0.050 0.675 0.947 0.813 0.655 0.239 0.571 0.808 0.737
3 3 3 3 3 3 3 3 3
8 8 8 8 8 8 8 8 8
2 2 3 3 3 3 3 3 1
7 7 7 7 7 7 7 7 8
0.0390 0.0359 0.0359 0.0359 0.0360 0.0515 0.0362 0.0357 0.0359
0.333 0.852 0.917 0.926 0.929 0.006 0.841 0.913 0.908
0.092 0.665 0.840 0.865 0.885 0.006 0.741 0.753 0.826
3 3 3 3 3 2 3 3 3
8 8 8 8 8 7 8 8 8
3 3 3 3 3 3 3 3 2
1 3 3 3 3 3 3 3 8
0.0016 0.0014 0.0000 0.0000 0.0000 0.0014 0.0000 0.0015 0.0000
0.077 0.090 1.000 1.000 1.000 0.109 0.852 0.095 0.910
0.077 0.090 0.999 0.999 0.997 0.109 0.718 0.095 0.757
3 3 3 3 3 3 3 3 3
2 2 1 1 1 2 1 2 1
Loss
p1
p2
L
K
100%
MDS Constant MA ARMA AR PN NN SETAR STAR
0.0196 0.0211 0.0216 0.0215 0.0215 0.0342 0.0214 0.0221 0.0213
0.950 0.628 0.562 0.570 0.580 0.157 0.599 0.538 0.596
0.950 0.628 0.562 0.570 0.580 0.157 0.599 0.538 0.596
1 1 1 1 1 2 1 1 1
3 3 3 3 3 4 3 3 3
10%
MDS Constant MA ARMA AR PN NN SETAR STAR
0.0583 0.0652 0.0651 0.0647 0.0649 0.0650 0.0649 0.0653 0.0653
1.000 0.064 0.070 0.078 0.075 0.076 0.075 0.068 0.063
0.767 0.064 0.070 0.078 0.075 0.076 0.075 0.068 0.063
3 2 3 3 3 2 3 3 3
5%
MDS Constant MA ARMA AR PN NN SETAR STAR
0.0306 0.0334 0.0328 0.0328 0.0324 0.0343 0.0325 0.0323 0.0705
0.991 0.384 0.416 0.411 0.495 0.250 0.489 0.522 0.000
0.871 0.213 0.241 0.238 0.318 0.098 0.311 0.357 0.000
1%
MDS Constant MA ARMA AR PN NN SETAR STAR
0.0000 0.0043 0.0044 0.0044 0.0045 0.0044 0.0045 0.0045 0.0229
1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.544 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Note: ‘‘Loss’’ refers to is the out-of-sample averaged loss based on the KLIC measure; ‘‘p1’’ and ‘‘p2’’ are the reality checks p-values of White’s (2000) and Hansen’s (2001) tests, respectively, where each model is regarded as a benchmark model and is compared with the remaining eight models; ‘‘L’’ and ‘‘K’’ are the AR and SNP orders, respectively, chosen by the minimum SIC criterion in the AR(L)–SNP(K) models, L ¼ 0 ,y, 3, K ¼ 0 ,y, 8, for the transformed PIT {xt}. We retrieve the S&P 500 returns series from finance.yahoo.com. The sample observations are from January 3, 1990 to June 30, 2003 (T ¼ 3303), the in-sample observations are from January 3, 1990 to September 24, 1996 (R ¼ 1703), and the out-of-sample observations are from September 25, 1996 to June 30, 2003 (n ¼ 1700).
58
YONG BAO AND TAE-HWY LEE
Next, comparing the left tails with a ¼ 10%; 5%, 1%, we find the results are similar to those for the entire distribution. That is, the MDS model has the smallest KLIC loss values for these left tails, much smaller than those of the other eight models. The reality check p-values for the MDS model as the benchmark are all very close to one, indicating that none of the eight models are better than the MDS, confirming the efficient market hypothesis in the left tails of the S&P 500 returns. Interestingly and somewhat surprisingly, when we look at the right tails with a ¼ 90%; 95%, 99% (i.e., the right 10%, 5%, and 1% tails, respectively), the results are exactly the opposite to those for the left tails. That is, the KLIC loss values for the MDS model for all of these three right tails are the largest, larger than those of the other eight competing models. The reality check p-values with the MDS model as the benchmark are zero or very close to zero, indicating that some of the other eight models are significantly better than the MDS model, hence rejecting the efficient market hypothesis in the right tails of the S&P 500 return density. This implies that the S&P 500 returns may be more predictable when the market goes up than when it goes down, during the sample period from January 3, 1990 to June 30, 2003. To our knowledge, this empirical finding is new to the literature, obtained as a benefit of using the BLS method that permits evaluation and comparison of the density forecasts on a particular area of the return density. As the right tail results imply, the S&P 500 returns in the right tails are predictable via some of the eight linear and nonlinear competing models considered in this paper. To see the nature of the nonlinearity in mean, we compare the KLIC loss values of these models. It can be seen that the PN model has the smallest loss values for all the three right tails. The reality check p-values show that the PN model (as a benchmark) is not dominated by any of the remaining models for the three 10%, 5%, and 1% right tails. The other three nonlinear models (NN, SETAR, and STAR) are often worse than the linear models in terms of out-of-sample density forecasts. We note that, while PN is the best model for the right tails, it is the worst model for forecasting the entire density (a ¼ 100%). Summing up, the significant in-sample evidence from the generalized spectral statistics M(1, 2) and M(1, 4) reported in Table 1 suggests that the squared lagged return and the fourth order power of the lagged returns (i.e. the conditional kurtosis representing the influence of the tail observations) have a predictive power for the returns. The out-of-sample evidence adds that this predictability of the squared lagged return and the fourth order power of the lagged returns is asymmetric, i.e., significant only when the market goes up.
Asymmetric Predictive Abilities of Nonlinear Models for Stock Returns
59
Table 3 shows the BLS test results computed with L and K chosen by the AIC. The results of Table 3 are similar to those of Table 2. Comparing the entire distribution with a ¼ 100%; the large reality check p-values for the MDS model as the benchmark (p1 ¼ 0:719 and p2 ¼ 0:719) indicate that none of the other eight models are better than the MDS model (although the KLIC loss value for the MDS model is not the smallest). This confirms the market efficiency hypothesis that the S&P500 returns may not be predictable using linear or nonlinear models. The results for the left tails with a ¼ 10%; 5%, 1% are also comparable to those in Table 2. That is, the KLIC loss values for the MDS model for these left tails are the smallest. The reality check p-values with the MDS model as the benchmark are all close to one (p1 ¼ 1:000; 1.000, 0.973 and p2 ¼ 0:767; 0.973, 0.939), indicating that none of the other eight models are better than the MDS model, again confirming the market efficiency hypothesis in the left tails of the S&P 500 returns. The right tail results are different from those for the left tails, as in Table 2. The MDS model for the right tails is generally worse than the other eight models. The reality check p-values of Hansen (2001) for the MDS model as the benchmark are very small (p2 ¼ 0:050; 0.092 and 0.008) for the three right tails, indicating that some of the eight models are significantly better than the MDS model. This shows that the S&P 500 returns may be more predictable when the market goes up than when it goes down – the nonlinear predictability is asymmetric. Table 4 shows the BLS test results computed with L and K chosen by the SIC. The results are very similar to those of Tables 2 and 3. Comparing the entire distribution with a ¼ 100%; the KLIC loss value for the benchmark MDS model is the smallest with the large reality check p-values (p1 ¼ 0:950 and p2 ¼ 0:950). The results for the left tails with a ¼ 10%; 5%, 1% are consistent with those in Tables 2 and 3. That is, the KLIC loss values for the MDS model for these left tails are the smallest. The reality check p-values with the MDS as the benchmark are all close to one (p1 ¼ 1:000; 0.991, 1.000 and p2 ¼ 0:767; 0.871, 0.544), indicating that none of the other eight models are better than the MDS model, again confirming the market efficiency hypothesis in the left tails. The story for the right tails is different from that for the left tails, as in Tables 2 and 3. The MDS model for the right tails is generally worse than the other eight models. The reality check p-values of Hansen (2001) for the MDS model as the benchmark are very small (p2 ¼ 0:050; 0.092, and 0.077) for the three right tails, indicating that some of the linear and nonlinear models are significantly better than the MDS model. This shows that the S&P 500 returns may be more predictable when the market goes up than when it goes down.
60
YONG BAO AND TAE-HWY LEE
5. CONCLUSIONS In this paper, we examine nonlinearity in the conditional mean of the daily closing S&P 500 returns. We first examine the nature of the nonlinearity using the in-sample test of Hong (1999), where it is found that there are strong nonlinear predictable components in the S&P 500 returns. We then explore the out-of-sample nonlinear predictive ability of various linear and nonlinear models. The evidence for out-of-sample nonlinear predictability is quite weak in the literature, and it is generally accepted that stock returns follow a martingale. While most papers in the literature use MSFE, MAFE, or some economic measures (e.g., wealth or returns) to evaluate nonlinear models, we use the density forecast approach in this paper. We find that for the entire distribution the S&P 500 daily closing returns are not predictable, and various nonlinear models we examine are no better than the MDS model. For the left tails, the returns are not predictable. The MDS model is the best in the density forecast comparison. For the right tails, however, the S&P 500 daily closing returns are found to be predictable by using some linear and nonlinear models. Hence, the out-of-sample nonlinear predictability of the S&P 1500 daily closing returns is asymmetric. These findings are robust to the choice of L and K in computation of the BLS statistics. We note that the asymmetry in the nonlinear predictability which we have found is with regards to the two tails of the return distribution. Our results may not imply that the bull market is more predictable than the bear market because stock prices can fall (left tail) or rise (right tail) under both market conditions. Nonlinear models that incorporate the asymmetric tail behavior as well as the bull and bear market regimes would be an interesting model to examine, which we leave for future research.
NOTES 1. We would like to thank Yongmiao Hong for sharing his GAUSS code for the generalized spectral tests. 2. The GAUSS code is available upon request.
ACKNOWLEDGEMENT We would like to thank the co-editor Tom Fomby and an anonymous referee for helpful comments. All remaining errors are our own.
Asymmetric Predictive Abilities of Nonlinear Models for Stock Returns
61
REFERENCES Anderson, T. G., Benzoni, L., & Lund, J. (2002). An empirical investigation of continuous-time equity return models. Journal of Finance, 57, 1239–1284. Bao, Y., Lee, T.-H., & Saltoglu, B. (2004). A test for density forecast comparison with applications to risk management. Working paper, University of California, Riverside. Berkowitz, J. (2001). Testing density forecasts with applications to risk management. Journal of Business and Economic Statistics, 19, 465–474. Bradley, M. D., & Jansen, D. W. (2004). Forecasting with a nonlinear dynamic model of stock returns and industrial production. International Journal of Forecasting, 20, 321–342. Clements, M. P., & Smith, J. (2000). Evaluating the forecast densities of linear and non-linear models: applications to output growth and unemployment. Journal of Forecasting, 19, 255–276. Clements, M. P., & Smith, J. (2001). Evaluating forecasts from SETAR models of exchange rates. Journal of International Money and Finance, 20, 133–148. Diebold, F. X., Gunther, T. A., & Tay, A. S. (1998). Evaluating density forecasts with applications to financial risk management. International Economic Review, 39, 863–883. Gallant, A. R., & Nychka, D. W. (1987). Semi-nonparametric maximum likelihood estimation. Econometrica, 55, 363–390. Granger, C. W. J., & Pesaran, M. H. (2000a). A decision theoretic approach to forecasting evaluation. In: W. S. Chan, W. K. Li & Howell Tong (Eds), Statistics and finance: An interface. London: Imperial College Press. Granger, C. W. J., & Pesaran, M. H. (2000b). Economic and statistical measures of forecast accuracy. Journal of Forecasting, 19, 537–560. Hansen, P. R. (2001). An unbiased and powerful test for superior predictive ability. Working paper, Stanford University. Hong, Y. (1999). Hypothesis testing in time series via the empirical characteristic function: A generalized spectral density approach. Journal of the American Statistical Association, 84, 1201–1220. Hong, Y., & Lee, T.-H. (2003). Inference on predictability of foreign exchange rates via generalized spectrum and nonlinear time series models. Review of Economics and Statistics, 85(4), 1048–1062. Kanas, A. (2003). Non-linear forecasts of stock returns. Journal of Forecasting, 22, 299–315. Kullback, L., & Leibler, R. A. (1951). On information and sufficiency. Annals of Mathematical Statistics, 22, 79–86. Lo, A. W., & MacKinlay, A. C. (1988). Stock market prices do not follow random walks: Evidence from a simple specification test. Review of Financial Studies, 1, 41–66. Lo, A. W., & MacKinlay, A. C. (1990). Data-snooping biases in tests of financial asset pricing models. Review of Financial Studies, 3, 175–208. McMillan, D. G. (2001). Nonlinear predictability of stock market returns: Evidence from nonparametric and threshold Models. International Review of Economics and Finance, 10, 353–368. Meese, R. A., & Rogoff, K. (1983). The out of sample failure of empirical exchange rate models: Sampling error or misspecification. In: J. Frenkel (Ed.), Exchange rates and international economics. Chicago: University of Chicago Press. Politis, D. N., & Romano, J. P. (1994). The stationary bootstrap. Journal of the American Statistical Association, 89, 1303–1313.
62
YONG BAO AND TAE-HWY LEE
Racine, J. (2001). On the nonlinear predictability of stock returns using financial and economic variables. Journal of Business and Economic Statistics, 19(3), 380–382. Vuong, Q. (1989). Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica, 57, 307–333. White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica, 50, 1–25. White, H. (2000). A reality check for data snooping. Econometrica, 68(5), 1097–1126. Wright, J. H. (2000). Alternative variance-ratio tests using ranks and signs. Journal of Business and Economic Statistics, 18, 1–9.
FLEXIBLE SEASONAL TIME SERIES MODELS Zongwu Cai and Rong Chen ABSTRACT In this article, we propose a new class of flexible seasonal time series models to characterize the trend and seasonal variations. The proposed model consists of a common trend function over periods and additive individual trend (seasonal effect) functions that are specific to each season within periods. A local linear approach is developed to estimate the trend and seasonal effect functions. The consistency and asymptotic normality of the proposed estimators, together with a consistent estimator of the asymptotic variance, are obtained under the a-mixing conditions and without specifying the error distribution. The proposed methodologies are illustrated with a simulated example and two economic and financial time series, which exhibit nonlinear and nonstationary behavior.
1. INTRODUCTION The analysis of seasonal time series has a long history that can be traced back to Gilbart (1854, 1856, 1865). Most of the time series, particularly economic and business time series, consist of trends and seasonal effects and many of them are nonlinear (e.g., Hylleberg, 1992; Franses, 1996, 1998). Econometric Analysis of Financial and Economic Time Series/Part B Advances in Econometrics, Volume 20, 63–87 Copyright r 2006 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(05)20022-1
63
64
ZONGWU CAI AND RONG CHEN
There are basically three major types of trend and seasonality studied in the literature, and each requires different modeling approaches. When the trend and seasonality are determined to be stochastic, differences and seasonal differences of proper order are often taken to transform the time series to stationarity, which is then modeled by standard (seasonal) autoregressive and moving average (ARMA) models (e.g., Box, Jenkins, & Reinsel, 1994; Shumway & Stoffer, 2000; Pena, Tiao, & Tsay, 2001). This method is commonly used to model economic and financial data, such as the quarterly earnings data (e.g., Griffin, 1977; Shumway, 1988; Burman & Shumway, 1998; Franses, 1996, 1998). On the other hand, when the trend and seasonality are determined to be deterministic, linear (or nonlinear) additive or multiplicative trends and seasonal components are used, sometimes accompanied with an irregular noise component. More detailed discussion on these models can be found in the books by Shumway (1988), Brockwell and Davis (1991), and Franses (1996, 1998), and references therein. A third approach is the stable seasonal pattern model studied by Marshall and Oliver (1970, 1979), Chang and Fyffe (1971), and Chen and Fomby (1999). It assumes that given the season total, the probability portion of each period within the season remains constant across different periods. For more types of trend and seasonality studied in the literature, the reader is referred to the books by Hylleberg (1992), Franses (1996, 1998), and Ghysels and Osborn (2001). In this article, we take the deterministic trend and seasonality approach. It is often observed, particularly in economic time series, that a seasonal pattern is clearly present but the seasonal effects change from period to period in some relatively smooth fashion. In addition, in many cases the trend cannot be fitted well by straight lines or even higher order polynomials because the characteristics of smooth variation change over time. These phenomena can be easily seen from the quarterly earnings series of Johnson & Johnson from 1960 to 1980. The data are taken from Table 7 of Shumway (1988, p. 358). Figure 1(a) shows the time series plot of the series. Clearly, there is an overall exponentially increasing trend with several different rates and a regular yearly seasonal component increasing in amplitude. A stochastic approach would first take a logarithm transformation (Fig. 1(b)), then a first order difference (Fig. 1(c)) and finally fit a conventional or seasonal autoregressive integrated moving-average model. But as pointed out by Burman and Shumway (1998), this approach does not seem to be satisfactory for this series due to its high nonlinearity and non-stationarity. It can be seen from Fig. 1(d), which displays the time series plot for each quarter over years, that the trends for all seasons are similar, increasing
65
Flexible Seasonal Time Series Models
15 2 10 1 5 0
0
(a)
0
20
40
60
80
0.4
(b)
0
20
40
60
80
1970
1975
1980
Quarter 1 Quarter 2 Quarter 3 Quarter 4
15
0.2 10
0.0 -0.2
5 -0.4 -0.6 0
(c)
0
2
20
40
60
80
1970
1975
1980
(d)
1960
1965
Quarter 1 Quarter 2 Quarter 3 Quarter 4
1 0 -1 -2 -3
(e)
1960
1965
Fig. 1. Johnson & Johnson Quarterly Earnings Data from 1960–80. (a) Original. (b) Log-transformed Series. (c) Difference of the Log-transformed Series. (d) Quarter-by-Year Time Series Plots with the Legend: Solid Line for Quarter 1, Dotted Line for Quarter 2, Dashed Line for Quarter 3, and Dotted-Dashed Line for Quarter 4. (e) Quarter-by-Year Time Series Plots for the Observed Seasonal Data with the Same Legend as in (d).
66
ZONGWU CAI AND RONG CHEN
smoothly but nonlinearly. In this article, we propose a flexible model that is capable of capturing such features in a time series. Denote a seasonal time series as yt1 ; . . . ; ytd ;
t ¼ 1; 2; . . . ,
where d is the number of seasons within a period. A general deterministic trend and seasonal component model assumes the form (1)
ytj ¼ T t þ S tj þ etj
different periods within a season, and where Tt is the common trend same to P d Stj is the seasonal effect, satisfying j¼1 S tj ¼ 0: A standard parametric model assumes a parametric function for the common trend Tt, such as linear or polynomial functions. The seasonal effects are usually assumed to be the same for different periods, that is, Stj ¼ Sj for j ¼ 1; . . . ; d and all t, only if seasonality is postulated to be deterministic, an assumption which has been criticized in the seasonality literature. Motivated by the Johnson & Johnson quarterly earnings data, Burman and Shumway (1998) considered the following semi-parametric seasonal time series model ytj ¼ aðtÞ þ bðtÞgj þ etj ;
t ¼ 1; . . . ; n;
j ¼ 1; . . . ; d
(2)
where a(t) is regarded as a common trend component and {gj} are seasonal factors. Hence, the overall seasonal effect changes over periods in accordance with the modulating function b(t). Implicitly, model (2) assumes that the seasonal effect curves have the same shape (up to a multiplicative constant) for all seasons. This assumption may not hold for some cases, as shown in the Johnson & Johnson series (Fig. 1(e)). To estimate the functions and parameters in model (2), Burman and Shumway (1998) proposed two methods: nonlinear ANOVA and smoothing spline. They established the consistency for the estimators of trend, modulation function, and seasonal factors and the asymptotic normality for the estimators of the seasonal factors. Inspired by the Johnson & Johnson quarterly earnings data of which the seasonal effect curves do not have the same shape over years (see Fig. 1(e)), we propose a more general flexible seasonal effect model having the following form: yij ¼ aðti Þ þ bj ðti Þ þ eij ;
i ¼ 1; . . . ; n;
j ¼ 1; . . . ; d
(3)
where ti ¼ i=n; a( ) is a smooth trend function in [0, 1], {bj( )} are smooth seasonal effect functions in [0, 1], either fixed or random, subject to a set of
67
Flexible Seasonal Time Series Models
constraints, and the error term eij is assumed to be stationary and strong mixing. Note that for convenience in study of the asymptotic properties, we use ti for time index in (3), instead of the integer t. As in model (2), the following constraints are needed for fixed seasonal effects: d X j¼1
bj ðtÞ ¼ 0
for all t
(4)
reflecting the fact that the sum of all seasons should be zero for the seasonal factor. For random seasonal effects, bðtÞ ¼ ðb1 ðtÞ; . . . ; bd ðtÞÞ0 is the vector of random seasonal effects with E(10 d(b(t)) ¼ 0 and a certain covariance structure, where, throughout, 1d denotes a d 1 unit vector. However, in this article, our focus is only on the fixed effect case and the stationary errors. Of course, it is warranted to further investigate the random effect case and it would be a very interesting topic to consider cases with nonstationary errors such as the integrated processes and stationary/nonstationary exogenous variables. The rest of this article is organized as follows. In Section 2, we study in detail the flexible seasonal time series model (3). A local linear technique is used to estimate the trend and seasonal functions, and the asymptotic properties of the resulting estimators are studied. Section 3 is devoted to a Monte Carlo simulation study and the analysis of two economic and financial time series data. All the technical proofs are given in the appendix.
2. MODELING PROCEDURES Combination of (3) and (4) in a matrix expression leads to (5)
Yi ¼ Ahðti Þ þ ei where yi1 B . C . C Yi ¼ B @ . A; yid 0
1
A¼
1d1 1
!
Id1 ; 10d1
1 aðtÞ C B B b1 ðtÞ C C B hðtÞ ¼ B . C; B .. C @ A bd1 ðtÞ 0
and
0
1 ei1 B . C . C ei ¼ B @ . A eid
Id is the d d identity matrix, and the error term ei is assumed to be stationary with E(ei) ¼ 0 and cov(ei, ej) ¼ R(ij). Clearly, model (5) includes model (2) as a special case.
68
ZONGWU CAI AND RONG CHEN
2.1. Local Linear Estimation For estimating a( ) and {bj( )} in (3), a local linear method is employed, although a general local polynomial method is also applicable. Local (polynomial) linear methods have been widely used in nonparametric regression during recent years due to their attractive mathematical efficiency, bias reduction, and adaptation of edge effects (e.g., Fan & Gijbels, 1996). Assuming throughout that a( ) and {bj( )} have a continuous second derivative in [0, 1], then a( ) and {bj( )} can be approximated by a linear function at any fixed time point 0rtr1 as follows: aðti Þ ’ a0 þ b0 ðti tÞ and bj ðti Þ ’ aj þ bj ðti tÞ;
1 j d 1,
where C denotes the first order Taylor approximation. Hence h(ti)Ca+b (tit), where a ¼ h(t) and b ¼ h(1)(t) ¼ dh(t)/dt and (5) is approximated by a Yi ’ Zi þ ei b
where Zi ¼ (A, (tit) A). Therefore, the locally weighted sum of least squares is n X i¼1
Yi Zi
0 a a Yi Zi K h ðti tÞ b b
(6)
where K h ðuÞ ¼ Kðu=hÞ=h, K( ) is the kernel function, and h ¼ hn 40 is the bandwidth satisfying h- 0 and nh-N as n-N, which controls the amount of smoothing used in the estimation. By minimizing (6) with respect to a and b, we obtain the local linear ^ estimate of h(t), denoted by hðtÞ; which is a^ : It is easy to show that the minimizers of (6) are given by a^ ¼ b^
S n;0 ðtÞA S n;1 ðtÞA
S n;1 ðtÞA
S n;2 ðtÞA
!1
Tn;0 ðtÞ Tn;1 ðtÞ
!
where for k 0; S n;k ðtÞ ¼ n1
n X ðti tÞk K h ðti tÞ i¼1
and
Tn;k ðtÞ ¼ n1
n X i¼1
ðti tÞk K h ðti tÞYi
69
Flexible Seasonal Time Series Models
^ is given by Hence, the local linear estimator hðtÞ ^ ¼ A1 S n;2 ðtÞTn;0 ðtÞ S n;1 ðtÞTn;1 ðtÞ ¼ A1 hðtÞ S n;0 ðtÞ S n;2 ðtÞ S2n;1 ðtÞ where Si ðtÞ ¼
n X i¼1
Si ðtÞYi
(7)
½S n;2 ðtÞ S n;1 ðtÞðti tÞK h ðti tÞ n o n S n;0 ðtÞS n;2 ðtÞ S 2n;1 ðtÞ
Remark 1. The local weighted least squares estimator from (6) does not take into the account of correlation between the periods et and es. It can be improved by incorporating the correlation between periods. For example, given a known correlation structure, we can replace (6) with 0 a a 1=2 1=2 YZ Kh R1 Kh Y Z b b where Ynd l and Znd 2d are obtained by stacking Yi and Zi, respectively, 1=2 and S is the covariance matrix of stacked ei’s, and Kh is a diagonal 1=2 matrix with (id+j)-th diagonal element being K h ðti tÞð1 i n; 1 j d 1Þ: When the correlation between the periods are unknown but can be modeled by certain parametric models such as the ARMA models, it is then possible to use an iterative procedure to estimate jointly the nonparametric function and the unknown parameters. In this article, we adopt the simple locally weighted least squares approach and use Eq. (6) to construct our estimator. Remark 2. Note that many other nonparametric smoothing methods can be used here. The locally weighted least squares method is just one of the choices. There is a vast literature in theory and empirical study on the comparison of different methods (e.g., Fan & Gijbels, 1996). Remark 3. The restriction to the locally weighted least squares method suggests that normality is at least being considered as a baseline. However, when non-normality is clearly present, a robust approach would be considered; see Cai and Ould-Said (2003) for details about this aspect in nonparametric regression estimation for time series. Remark 4. The expression (7) has a nice interpretation which eases computation. Let mj ðtÞ ¼ aðtÞ þ bj ðtÞ be the seasonal mean curve for the jth season. Then m^ j ðtÞ; the local linear estimator based on the series y1j, y2j, y , ynj
70
ZONGWU CAI AND RONG CHEN
(and ignore all other observations), can be easily shown to be m^ j ðtÞ ¼
n X i¼1
S i ðtÞyij
which, in conjunction with (7), implies that ^ ¼ A1 lðtÞ ^ hðtÞ
^ ¼ ðm^ 1 ðtÞ; . . . ; m^ d ðtÞÞ0 : A computation of the inverse of A leads to where lðtÞ a^ ðtÞ ¼
d 1X m^ ðtÞ; d j¼1 j
and
b^ j ðtÞ ¼ m^ j ðtÞ a^ ðtÞ;
1jd
(8)
Hence one needs to estimate only the low dimension individual mean curves fm^ j ðtÞg to obtain a^ ðtÞ and fb^ j ðtÞg; which makes the implementations of the estimator much easier. However, we must point out that this feature is unique to the local linear estimator for model (5) and does not apply to other models. It is clear from (8) that fmj ðÞg can be estimated using different bandwidths, which can deal with the situation in which the seasonal mean functions have different degrees of smoothness. Remark 5. Bandwidth selection is always one of the most important parts of any nonparametric procedure. There are several bandwidth selectors in the literature, including the generalized cross-validation of Wahba (1977), the plug-in method of Jones, Marron, and Sheather (1996), and the empirical bias method of Ruppert (1997), among others. They all can be used here. A comparison of different procedures can be found in Jones, Marron, and Sheather (1996). In this article, we use a procedure proposed in Fan, Yao, and Cai (2003), which combines the generalized crossvalidation and the empirical bias method. 2.2. Asymptotic Theory The estimation method described in Section 2.1 can accommodate both fixed and random designs. Here, our focus is on fixed design. The reason is that it might be suitable for pure time series data, such as financial and economic data. Since data are observed in time order as in Burman and Shumway (1998, p. 137), we assume that ti ¼ tni ¼ i=n for simplicity although the theoretical results developed later still hold for non-equal spaced design points, and that ei ¼ eni. In such a case, we assume that, for each n, {en1, y , enn} have the same joint distribution as {n1, y , nn}, where nt,
Flexible Seasonal Time Series Models
71
t ¼ y,1, 0, 1, y, is a strictly stationary time series defined on a probability space ðO; A; PÞ and taking values on Rd : This type of assumption is commonly used in fixed-design regression for time series contexts. Detailed discussions on this respect can be found in Roussas (1989), Roussas, Tran, and Ioannides (1992), and Tran, Roussas, Yakowitz, and Van (1996) for nonparametric regression estimation for dependent data. Traditionally, the error component in a deterministic trend and seasonal component like model (1) is assumed to follow certain linear time series models such as an ARMA process. Here we consider a more general structure – the a-mixing process, which includes many linear and nonlinear time series models as special cases (see Remark 6). Our theoretical results here are derived under the a-mixing assumption. For reference convenience, the mixing coefficient is defined as aðiÞ ¼ sup jPðA \ BÞ PðAÞPðBÞj : A 2 F01 ; B 2 F1 i
where Fba is the s-algebra generated by fni gbi¼a : If a(i)-0 as i-N, the process is called strongly mixing or a-mixing.
Remark 6. Among various mixing conditions used in the literature, a-mixing is reasonably weak and is known to be fulfilled for many linear and nonlinear time series models under some regularity conditions. Gorodetskii (1977) and Withers (1981) derived the conditions under which a linear process is a-mixing. In fact, under very mild assumptions, linear autoregressive and more generally bilinear time series models are a-mixing with mixing coefficients decaying exponentially. Auestad and Tjøstheim (1990) provided illuminating discussions on the role of a-mixing (including geometric ergodicity) for model identification in nonlinear time series analysis. Chen and Tsay (1993) showed that the functional autoregressive process is geometrically ergodic under certain conditions. Further, Masry and Tjøstheim (1995, 1997) and Lu (1998) demonstrated that under some mild conditions, both autoregressive conditional heteroscedastic processes and nonlinear additive autoregressive models with exogenous variables, particularly popular in finance and econometrics, are stationary and a-mixing. Roussas (1989) considered linear processes without satisfying the mixing condition. Potentially our results can be extended to such cases. Assumptions: A1. Assume that the kernel K(u) is symmetric and satisfies the Lipschitz condition and u K(u) is bounded, and that a( ) and {bj( )} have a continuous second derivative in [0,1].
72
ZONGWU CAI AND RONG CHEN
A2. For each n, {en1, y , enn} have the same joint distribution as {n1, y ,nn}, where nt, t ¼ y, 1, 0, 1, y, is a strictly stationary time series with the covariance matrix R(kl) ¼ cov(nk, nl) defined on a probability space ðO; A; PÞ and taking values on
Remark 7. Let rjm(k) denote the (j, m)-th element of R(k). By the Davydov’s inequality P (e.g., Corollary A.2 in Hall and Heyde, 1980), Assumption A2 implies 1 k¼1 jrjm ðkÞjo1:
Note that all the asymptotic results here assume that the number of periods n-N Rbut the number of seasons R within a period d is fixed. PDefine, for kZ0, mk ¼ uk KðuÞ du and uk ¼ uk K 2 ðuÞ d u: Let R0 ¼ 1 k¼1 RðkÞ; which exists by Assumption A2 (see Remark 7), and hð2Þ ðtÞ ¼ d 2 hðtÞ=dt2 : ^ Now we state the asymptotic properties of the local linear estimator hðtÞ: A sketch of the proofs is relegated to the appendix. Theorem 1. Under Assumptions A1 and A2, we have 2
^ hðtÞ h m hð2Þ ðtÞ þ oðh2 Þ ¼ Op ððn hÞ1=2 Þ hðtÞ 2 2 Theorem 2. Under Assumptions A1–A3, we have pffiffiffiffiffiffi h2 2 ð2Þ ^ n h hðtÞ hðtÞ m2 h þ oðh Þ ! Nð0; Ry Þ 2
where Ry ¼ u0 A1 R0 ðA1 Þ0 : Clearly, the asymptotic variance of a^ ðtÞ is v0 d 2 10d R0 1d :
^ hðtÞ ¼ Op ðh2 þ Remark 8. As a consequence of Theorem 1, hðtÞ 1=2 ^ ðn hÞ Þ so that hðtÞ is a consistent estimator of h(t). From Theorem ^ is given by 2, the asymptotic mean square error (AMSE) of hðtÞ h2 2 2 trðRy Þ m jjh ðtÞjj22 þ nh 4 2 2 1=5 ð2Þ 1=5 which gives the optimal bandwidth, hopt ¼ n P PtrðRy m2 2 Þjjh ðtÞjj2 P ; by minimizing the AMSE, where jjAjj2 ¼ ð i j a2ij Þ1=2 and trðAÞ ¼ i aii if A ¼ (aij) is a square matrix. Hence, the optimal convergent rate of the AMSE ¼
73
Flexible Seasonal Time Series Models
^ is of the order of n4/5, as one would have expected. Also, AMSE for hðtÞ the asymptotic variance of the estimator does not depend on the time point t. More importantly, it shows that the asymptotic variance of the estimator depends on not only the covariance structure of the seasonal P effects (R(0) ¼ var(ei)) but also the autocorrelations over periods ( 1 k¼1 RðkÞ). Finally, it is interesting to note that in general, the elements (except the first one) in the first row of Sy are not zero, unless S0 ¼ s2 I: This implies that a^ ðtÞ and b^ j ðtÞ (1rjrd1) may be asymptotically correlated. Remark 9. In practice, it is desirable to have a quick and easy imple^ mentation to estimate the asymptotic variance of hðtÞ to construct a pointwise confidence interval. The explicit expression of the asymptotic variance in Theorem 2 provides two direct estimators. From Lemma A1 in the Appendix, for any 0oto1, we have ! n X h 1 Ry ¼ lim A var eni K h ðti tÞ ðA1 Þ0 n!1 n i¼1 0
b n0 Q b ðA1 Þ0 ; b y ¼ A1 Q Hence a direct (naive) estimator of Sy is given by R n0 P n 1=2 1 ^ b n0 ¼ ðh n Þ where Q i¼1 fYi Ahðti Þg K h ðti tÞ: However, in the finite b y might depend on t. To overcome this shortcoming, an altersample, R native way to construct the estimation of R0 is to use some popular approach in econometric literature such as heteroskedasticity consistent (HC) or heteroskedasticity and autocorrelation consistent (HAC) method; see White (1980), Newey and West (1987, 1994). b be the forecast of YðtÞ based on the mean function. Remark 10. Let YðtÞ ^ b Then, YðtÞ ¼ AhðtÞ and ^ ^ YðtÞ YðtÞ ¼ AðhðtÞ hðtÞÞ þ et
^ and so that the forecast variance, by ignoring the correlation between hðtÞ et and the bias, is 0 ^ ^ var ðYðtÞ YðtÞÞ A var ðhðtÞ hðtÞÞA
þ Rð0Þ u0 ðn hÞ1 R0 þ Rð0Þ
by Theorem 2. Therefore, the standard error estimates for the predictions can be taken to be the square root of the diagonal elements of ^ 0 þ Rð0Þ; b ^ 0 is given in the above remark and Rð0Þ b u0 ðnP hÞ1 R where R ¼ n 0 1 ^ b n ðY A hðt ÞÞðY A hðt ÞÞ : Note that the above formula is just i i i i i¼1 an approximation. A further theoretical and empirical study on the predictive utility based on model (3) is warranted.
74
ZONGWU CAI AND RONG CHEN
3. EMPIRICAL STUDIES In this section, we use a simulated example and two real examples to illustrate our proposed models and the estimation procedures. Throughout this section, we use the Epanechnikov kernel, KðuÞ ¼ 0:75ð1 u2 ÞIðjuj 1Þ and the bandwidth selector mentioned in Remark 5. Example 1. A Simulated Example. : We begin the illustration with a simulated example of the flexible seasonal time series model. For this simulated example, the performance of the estimators is evaluated by the mean absolute deviation error (MADE) ¼ n1 0
n0 X
j ¼ n1 0
n0 X
for aðÞ and
k¼1
k¼1
jb aðuk Þ aðuk Þj
jb bj ðuk Þ bj ðuk Þj
for bj ðÞ; where fuk ; k ¼ 1; . . . ; n0 g are the grid points from (0, 1). In this simulated example, we consider the following nonlinear seasonal time series model yij ¼ aðti Þ þ bj ðti Þ þ eij;
i ¼ 1; . . . ; n;
j ¼ 1; . . . ; 4
where ti ¼ i=n aðxÞ
¼
b1 ðxÞ ¼ b2 ðxÞ ¼
b4 ðxÞ ¼ b3 ðxÞ ¼
expð0:7 þ 3:5xÞ
3:1x2 þ 17:1x4 28:1x5 þ 15:7x6 0:5x2 þ 15:7x6 15:2x7
0:2 þ 4:8x2 7:7x3 b1 ðxÞ b2 ðxÞ b4 ðxÞ
for 0ox 1: Here, the errors eij ¼ e4ði1Þþj ; where {et} are generated from the following AR(1) model: et ¼ 0:9et1 þ t
and et is generated from N(0, 0.12). The simulation is repeated 500 times for each of the sample sizes n ¼ 50; 100, and 300. Figure 2 shows the results of a typical example for sample size n ¼ 100: The typical example is selected in such a way that its total MADE value (¼ þ 1 þ 2 þ 3 þ 4 ) is equal to
75
Flexible Seasonal Time Series Models
15
15
10
10
5
5
0
Season 1 Season 2 Season 3 Season 4
0 0
100
200
300
400
0
(a)
(b)
15
1
20
40
60
80
100
60
80
100
0 10 -1 5
-2 -3
0 0 (c)
Season 1 Season 2 Season 3 Season 4
20
40
60
80
100
0
20
40
(d)
Fig. 2. Simulation Results for Example 1. (a) The Time Series Plot of the Time Series {yt}. (b) The Time Series Plots for {yij} for Each Season (Thin Lines) with True Trend Function a( ) (Thick Solid Line) with the Legend: Solid Line for Season 1, Dotted Line for Season 2, Dashed Line for Season 3, and Dotted-Dashed Line for Season 4. (c) The Local Linear Estimator (Dotted Line) of the Trend Function a( ) (Solid Line). (d) The Local Linear Estimator (Thin Lines) of the Seasonal Effect Functions {bj( )} (Thick Solid Lines) with the Legend: Solid Line for b1( ), Dotted Line for b2( ), Dashed Line for b3( ), and Dotted-Dashed Line for b4( ).
the median in the 500 replications. Figure 2(a) presents the time series {y(i1)4+j} and Figure 2(b) gives the time series plots of each season {yij} (thin lines) with the true trend function aðÞ (thick solid line). Figures 2(c) and 2(d) display, respectively, the estimated aðÞ and fbj ðÞg (thin lines), together with their true values (thick lines). They show that the estimated values are very close to the true values. The median and standard deviation (in parentheses) of the 500 MADE values are summarized in
76
ZONGWU CAI AND RONG CHEN
Table 1. n 50 100 300
The Median and Standard Deviation of the 500 MADE Values.
e
e1
e2
e3
e4
0.1202(0.0352) 0.0997(0.0286) 0.0692(0.0161)
0.0284(0.0077) 0.0204(0.0051) 0.0135(0.0031)
0.0271(0.0083) 0.0214(0.0059) 0.0151(0.0037)
0.0244(0.0071) 0.0179(0.0044) 0.0119(0.0027)
0.0263(0.0073) 0.0192(0.0057) 0.0122(0.0034)
Table 1, which shows that all the MADE values decrease as n increases. This reflects the asymptotic results. Overall, the proposed modeling procedure performs fairly well. Example 2. Johnson & Johnson Earnings Series: We continue our study on the Johnson & Johnson quarterly earnings data, described in Section 1. There are two components of the quarterly earnings data: (1) a fourperiod seasonal component and (2) an adjacent quarter component which describes the seasonally adjusted series. As discussed in Section 1, we fit the series using yij ¼ aðti Þ þ bj ðti Þ þ eij ; i ¼ 1; . . . ; 21; j ¼ 1; . . . ; 4 (9) P4 subject to the constraint j¼1 bj ðtÞ ¼ 0 for all t. Together with the observed data for each quarter (thin lines) over years, the estimated trend function b aðÞ (thick solid line) is plotted in Fig. 3(a), and the estimated seasonal effect functions fb^ j ðÞg (thick lines) are displayed in Fig. 3(b) with the observed seasonal data y¯ ij ¼ yij
4 1X y 4 j¼1 ij
for each quarter (thin lines) over years. With the automatically selected bandwidth, the fitted and observed values are very close to each other. For comparison purpose, we calculate the overall mean squared error 21 X 4 n o2 1 X yij a^ ðti Þ b^ j ðti Þ 84 i¼1 j¼1
which is 0.0185, with the effective number of parameters (e.g., Hastie & Tibshirani, 1990; Cai & Tiwari, 2000) being 44.64. It seems to be a significant improvement of the semi-parametric model (2) of Burman and Shumway (1998), which has a MSE of 0.0834 with 44 effective number of parameters.
77
Flexible Seasonal Time Series Models 2 15
Quarter 1 Quarter 2 Quarter 3 Quarter 4
Quarter 1 Quarter 2 Quarter 3 Quarter 4
1 0
10
-1 5 -2 -3
0 (a) 18 16
1960
1965
1970
1975
1980
(b)
1960
1965
1970
1975
1980
Model (11) Estimated Actual Value Model (2) Model (12)
14 12 10 8
Forecast
6
(c)
75
80
85
Fig. 3. Johnson & Johnson Quarterly Earnings Data from 1960–80. (a) Quarterby-Year Time Series Plots (Thin Lines) with the Estimated Trend Functions (Thick Line) with the Same Legend: Solid Line for Quarter 1, Dotted Line for Quarter 2, Dashed Line for Quarter 3, and Dotted-Dashed Line for Quarter 4. (b) Quarterby-Year Time Series Plots for the Observed Seasonal Data (Thin Lines) with the Estimated Seasonal Effect Functions (Thick Lines) with the Same Legend as in (a). (c) Four-Quarter Forecasts for 1981 Based on: Model (9), Dotted Line, Model (2), Dotted-Dashed Line, and Model (10), Short-Dashed Line, together with ‘‘Estimated Actual’’ Value – Dashed Line.
Finally, to evaluate the prediction performance, we use the whole dataset (all 84 quarters from 1960 to 1980) to fit the model and forecast the next four quarters (in 1981) by using one-step ahead prediction y^ 22;j ¼ a^ ðt22 Þ þ b^ j ðt22 Þ for 1 j 4 in model (9). The standard error estimates for the predictions (see Remark 10) are 0.110, 0.106, 0.186, and 0.177, respectively.
78
ZONGWU CAI AND RONG CHEN
Also, we consider the predictions using the following seasonal regression model, popular in the seasonality literature (e.g., Hylleberg, 1992; Franses, 1996, 1998; Ghysels & Osborn, 2001), for the first order difference of the logarithm transformed data zt ¼ log ðyt Þ logðyt1 Þ
4 X j¼1
yj Dj;t þ t ;
2 t 84
(10)
where {Dj,t} are seasonal dummy variables. Figure 3(c) reports the predictions for the mean value function based on models (9) (dotted line) and (10) (short-dashed line), connected with the observed values (solid line) for the last thirteen quarters (from quarters 72 to 84). In contrast to the semiparametric model proposed by Burman and Shumway (1998) which forecasts a slight decreasing bend in the trend function for 1981 (the dashed-dotted line), our model predicts an upward trend. Indeed, a^ ðt22 Þ ¼ 16:330; which is much larger than the last two values of the estimated trend function a^ ðt21 Þ ¼ 14:625 and a^ ðt20 Þ ¼ 12:948: For comparison purpose, in Fig. 3(c), together with the forecasting results based on model (2) from Burman and Shumway (1998) in dashed-dotted line and model (10) in shortdashed line, we show, in dashed line, the ‘‘estimated actual’’ earnings for the year 1981. They were ‘‘estimated’’ since we were not able to obtain the exact definition of the earnings of the original series and had to resort to estimating the actual earning (using the average) in the scale of the original data from various sources. Fig. 3(c) shows that the forecasting results based on model (9) is closer to the ‘‘estimated actual’’ earnings than those based on model (2) from Burman and Shumway (1998) and model (10), although neither forecasting result for the second quarter of 1981 is close to the ‘‘estimated actual’’ earning, and that model (10) does not predict well. Example 3. US Retail Sales Series: In this example we apply the proposed methods to the monthly US retail sales series (not seasonally adjusted) from January of 1967 to December of 2000 (in billions of US dollars). The data can be downloaded from the web site at http://marketvector.com. The US retail sales index is one of the most important indicators of the US economy. There are vast studies of this series in the literature (e.g., Franses, 1996, 1998; Ghysels & Osborn, 2001). Figure 4(a) represents the series plotted as 12 curves, each corresponding to a specific month of each calendar year and the yearly averages plotted in the thick solid line. They show that the trends for all seasons (months) are similar, basically increasing but nonlinearly. Figure 4(b) displays the monthly growth over years and shows pattern of high seasonal effects. It also can be observed
79
Flexible Seasonal Time Series Models
300
60
250
40
200
20
150
0
100 −20 50 −40
(a)
1970 1975 1980 1985 1990 1995 2000
(b)
1970 1975 1980 1985 1990 1995 2000
250 40 200 20 150 0 100 −20 50 −40
(c)
1970 1975 1980 1985 1990 1995 2000
(d)
1970 1975 1980 1985 1990 1995 2000
10
5
0
−5
−10
(e)
1970 1975 1980 1985 1990 1995 2000
Fig. 4. US Retail Sales Data from 1967–2000. (a) Quarter-by-Year Time Series Plots with Yearly Averages (Thick Solid Line). (b) Quarter-by-Year Time Series Plots for the Observed Seasonal Data with the Same Legend as in (a). (c) Estimated Trend Function (Solid Line) with the Yearly Averages (Dashed Line). (d) Estimated Seasonal Functions with the Same Legend as in (a). (e) Estimated Seasonal Functions for Months from March to November.
80
ZONGWU CAI AND RONG CHEN
that the sales were decreasing over the years for the first four months in the year and September, and increasing for the rest seven months. In particular, the decreasing/increasing rates for January, February, and December were much faster than those for the rest nine months. Retail sales are known to be large in the fourth quarter because of Christmas spending and small in the first quarter. This is confirmed from Fig. 4(b) where the monthly curve for December was significantly higher than the rest months and those for January and February were much lower than the rest. We fit the series using the following nonparametric seasonal model yij ¼ aðti Þ þ bj ðti Þ þ eij ; with the constraints
P12
j¼1 bj ðti Þ
i ¼ 1; . . . ; 34;
and
j ¼ 1; . . . ; 12
¼ 0 for all ti. The overall mean squared error
34 X 12 1 X fy a^ ðti Þ b^ j ðti Þg2 408 i¼1 j¼1 ij
is 4.64 and the correlation coefficient between the estimated data and the actual data is 0.9996. Figure 4(c) depicts the estimated trend function with the yearly averages (dotted line) and it shows that the trend is increasing nonlinearly (possibly exponentially). The estimated seasonal effect functions are plotted in Fig. 4(d) for all months. From these plots we observe a stable pattern – the seasonal effect functions for the first four months and September stay below the average and the rest of them are above the average. Also, the monthly curves drift apart and some of them cross each other. Particularly, the monthly curves for January, February, and December dominate those for the rest nine months. To get a better picture of the months from March to November, the estimated seasonal effect functions for these nine months are depicted in Fig. 4(e). It appears that some patterns change starting from 1999.
NOTE 1. This stationary assumption might not be suitable for many economic applications, and the models might allow the errors to be non-stationary such as an integrated process. This topic is still open and beyond the scope of this article, although, a further investigation is warranted.
Flexible Seasonal Time Series Models
81
ACKNOWLEDGMENTS The authors are grateful to the co-editor T. Fomby and the referee for their very detailed and constructive commends and suggestions, and Professors D. Findley, T. Fomby, R.H. Shumway, and R.S. Tsay as well as the audiences at the NBER/NSF Time Series Conference at Dallas 2004 for their helpful and insightful comments and suggestions. Cai’s research was supported in part by the National Science Foundation grants DMS-0072400 and DMS-0404954 and funds provided by the University of North Carolina at Charlotte. Chen’s research was supported in part by the National Science Foundation grants DMS-0073601, 0244541 and NIH grant R01 Gm068958.
REFERENCES Auestad, B., & Tjøstheim, D. (1990). Identification of nonlinear time series: First order characterization and order determination. Biometrika, 77, 669–687. Box, G. E. P., Jenkins, G. M., & Reinsel, G. G. (1994). Time series analysis, forecasting and control (3rd ed.). Englewood Cliffs, NJ: Prentice Hall. Brockwell, P. J., & Davis, R. A. (1991). Time series theory and methods. New York: Springer. Burman, P., & Shumway, R. H. (1998). Semiparametric modeling of seasonal time series. Journal of Time Series Analysis, 19, 127–145. Cai, Z., Fan, J., & Yao, Q. (2000). Functional-coefficient regression models for nonlinear time series. Journal of the American Statistical Association, 95, 941–956. Cai, Z., & Ould-Said, E. (2003). Local M-estimator for nonparametric time series. Statistics and Probability Letters, 65, 433–449. Cai, Z., & Tiwari, R. C. (2000). Application of a local linear autoregressive model to BOD time series. Environmetrics, 11, 341–350. Chang, S. H., & Fyffe, D. E. (1971). Estimation of forecast errors for seasonal style-Goods sales. Management Science, Series B 18, 89–96. Chen, R., & Fomby, T. (1999). Forecasting with stable seasonal pattern models with an application of Hawaiian tourist data. Journal of Business & Economic Statistics, 17, 497–504. Chen, R., & Tsay, R. S. (1993). Functional-coefficient autoregressive models. Journal of the American Statistical Association, 88, 298–308. Fan, J., & Gijbels, I. (1996). Local polynomial modelling and its applications. London: Chapman and Hall. Fan, J., Yao, Q., & Cai, Z. (2003). Adaptive varying-coefficient linear models. Journal of Royal Statistical Society, Series B 65, 57–80. Franses, P. H. (1996). Periodicity and stochastic trends in economic time series. New York: Cambridge University Press. Franses, P. H. (1998). Time series models for business and economic forecasting. New York: Cambridge University Press. Ghysels, E., & Osborn, D. R. (2001). The econometric analysis of seasonal time series. New York: Cambridge University Press.
82
ZONGWU CAI AND RONG CHEN
Gilbart, J. W. (1854). The law of the currency as exemplified in the circulation of country bank notes in England since the passing of the Act of 1844. Statistical Journal, 17. Gilbart, J. W. (1856). The laws of currency in Scotland. Statistical Journal, 19. Gilbart, J. W. (1865). Logic for the million, a familiar exposition of the art of reasoning. London: Bell and Daldy. Gorodetskii, V. V. (1977). On the strong mixing property for linear sequences. Theory of Probability and Its Applications, 22, 411–413. Griffin, P. A. (1977). The time series behavior of quarterly earnings: Preliminary evidence. Journal of Accounting Research, 15, 71–83. Hall, P., & Heyde, C. C. (1980). Martingale limit theory and its applications. New York: Academic Press. Hastie, T. J., & Tibshirani, R. (1990). Generalized additive models. London: Chapman and Hall. Hylleberg, S. (1992). The historical perspective. In: S. Hylleberg (Ed.), Modelling Seasonality. Oxford: Oxford University Press. Jones, M. C., Marron, J. S., & Sheather, S. J. (1996). A brief survey of bandwidth selection for density estimation. Journal of the American Statistical Association, 91, 401–407. Lu, Z. (1998). On the ergodicity of non-linear autoregressive model with an autoregressive conditional heteroscedastic term. Statistica Sinica, 8, 1205–1217. Masry, E., & Tjøstheim, D. (1995). Nonparametric estimation and identification of nonlinear ARCH time series: Strong convergence and asymptotic normality. Econometric Theory, 11, 258–289. Masry, E., & Tjøstheim, D. (1997). Additive nonlinear ARX time series and projection estimates. Econometric Theory, 13, 214–252. Marshall, K. T., & Oliver, R. M. (1970). A constant work model for student enrollments and attendance. Operations Research Journal, 18, 193–206. Marshall, K. T., & Oliver, R. M. (1979). Estimating errors in student enrollment forecasting. Research in Higher Education, 11, 195–205. Newey, W. K., & West, K. D. (1987). A simple, positive-definite, heteroskedasticity and autocorrelation consistent covariance matrix. Econometrica, 55, 703–708. Newey, W. K., & West, K. D. (1994). Automatic lag selection in covariance matrix estimation. Review of Economic Studies, 61, 631–653. Pena, D., Tiao, G. C., & Tsay, R. S. (2001). A course in time series analysis. New York: Wiley. Roussas, G. G. (1989). Consistent regression estimation with fixed design points under dependence conditions. Statistics and Probability Letters, 8, 41–50. Roussas, G. G., Tran, L. T., & Ioannides, D. A. (1992). Fixed design regression for time series: Asymptotic normality. Journal of Multivariate Analysis, 40, 262–291. Ruppert, D. (1997). Empirical-bias bandwidths for local polynomial nonparametric regression and density estimation. Journal of the American Statistical Association, 92, 1049–1062. Shumway, R. H. (1988). Applied statistical time series analysis. Englewood Cliffs, NJ: PrenticeHall. Shumway, R. H., & Stoffer, D. S. (2000). Time series analysis & its applications. New York: Springer-Verlag. Tran, L., Roussas, G., Yakowitz, S., & Van, B. T. (1996). Fixed-design regression for linear time series. The Annals of Statistics, 24, 975–991. Wahba, G. (1977). A survey of some smoothing problems and the method of generalized crossvalidation for solving them. In: P. R. Krisnaiah (Ed.), Applications of Statistics (pp. 507–523). Amsterdam, North Holland.
83
Flexible Seasonal Time Series Models
White, H. (1980). A heteroskedasticity-consistent covariance matrix and a direct test for heteroskedasticity. Econometrica, 48, 817838. Withers, C. S. (1981). Conditions for linear processes to be strong mixing. Zeitschrift fur Wahrscheinlichkeitstheorie verwandte Gebiete, 57, 477–480.
APPENDIX: PROOFS Note that the same notations in Section 2 are used here. Throughout this appendix, we denote by C a generic constant, which may take different values at different appearances. Lemma A1. Under the assumptions of Theorem 1, we have varðQn0 Þ ¼ u0 R0 þ oð1Þ
and
Qn1 ¼ op ð1Þ
where for k ¼ 0; 1; Qnk ¼ h1=2 n1=2
n X ðti tÞk eni K h ðti tÞ i¼1
Proof. By the stationarity of fnj g; varðQn0 Þ ¼ n1 h
X
1k;ln
¼ n1 h Rð0Þ I1 þ I2
Rðk lÞK h ðtk tÞK h ðtl tÞ
n X k¼1
K 2h ðtk tÞ þ 2n1 h
X
1lokn
Rðk lÞK h ðtk tÞK h ðtl tÞ
Clearly, by the Riemann sum approximation of an integral, Z 1 I1 Rð0Þh K 2h ðu tÞdu u0 Rð0Þ 0
Since nh-N, there exists cn-N such that cn/(nh)-0. Let S1 ¼ {(k, l): 1rklrcn; 1rlokrn} 1. Then, I2 is P and S2 ¼ {(k, 1) : 1rlokrn}S P ð Þ; denoted by split into two terms as S1 ð Þ; denoted by I21, and SP 2 I22. Since K( ) is bounded, then, K h ðÞ C=h and n1 nk¼1 K h ðtk tÞ C: In conjunction with the Davydov’s inequality (e.g., Corollary A.2 in
84
ZONGWU CAI AND RONG CHEN
Hall and Heyde, 1980), we have, for the (j, m)-th element of I22, X I22ðjmÞ Cn1 h jrjm ðk lÞjK h ðtk tÞðK h ðtl tÞ S2
1
Cn h Cn1 C
X S2
n X k¼1
X
a
ad=ð2þdÞ ðk lÞK h ðtk tÞK h ðtl tÞ
K h ðtk tÞ
d=ð2þdÞ
k1 4cn
X
k1 4cn
ad=ð2þdÞ ðk1 Þ
ðk1 Þ Ccd n !0
by Assumption A2 and the fact that cn-N. For any ðk; lÞ 2 S1 ; by Assumption A1 jK h ðtk tÞ K h ðtl tÞj Ch1 ðtk tl Þ=h Ccn =ðnh2 Þ n1 X X I21ðjmÞ 2n1 h rjm ðk lÞfK h ðtk tÞ K h ðtl tÞgK h ðtl tÞ l¼1 1klc n
Ccn n2 h1
Ccn n2 h1
n1 X
X
l¼1 1klcn
n1 X l¼1
jrjm ðk lÞjK h ðtl tÞ
K h ðtl tÞ
X k1
jrjm ðkÞj Ccn =ðn hÞ ! 0
by Remark 7 and the fact that cn/(nh)-0. Therefore, I21ðjmÞ ¼ 2n1 h ¼ 2n1 h
n1 X
X
l¼1 1klcn
n1 X l¼1
rjm ðk lÞK h ðtk tÞK h ðtl tÞ
K 2h ðtl tÞ
X
1klcn
rjm ðk lÞ þ I212ðjmÞ ! 2u0
Thus, var ðQn0 Þ ! u0 Rð0Þ þ 2
1 X k¼1
!
RðkÞ
¼ u0 R0
1 X k¼1
rjm ðkÞ
85
Flexible Seasonal Time Series Models
On the other hand, by Assumption A1, we have var ðQn1 Þ ¼ n1 h
X
1k;ln
Cn1 h
Rðk lÞðtk tÞðtl tÞK h ðtk tÞK h ðtl tÞ
X
1k;ln
jRðk lÞj C h
1 X
k¼1
jRðkÞj ! 0
This proves the lemma. Proof of Theorem 1. By the Riemann sum approximation of an integral, S n;k ðtÞ ¼ hk mk þ oð1Þ
(A.1)
By the Taylor expansion, for any kZ0 and ti in a neighborhood of t n1
n X ðti tÞk hðti ÞK h ðti tÞ i¼1
1 ¼ Sn;k ðtÞhðtÞ þ S n;kþ1 ðtÞhð1Þ ðtÞ þ Sn;kþ2 ðtÞhð2Þ ðtÞ þ oðh2 Þ 2 so that by (7), S2 ðtÞ Sn;1 ðtÞS n;3 ðtÞ ð2Þ ^ ¼ hðtÞ þ 1 n;2 h ðtÞ hðtÞ 2 Sn;0 ðtÞS n;2 ðtÞ S 2n;1 ðtÞ þ oðh2 Þ þ A1
n X i¼1
S i ðtÞeni
Then, by (A.1), 2
^ hðtÞ h m hð2Þ ðtÞ þ oðh2 Þ ¼ A1 hðtÞ 2 2
n X i¼1
Si ðtÞeni
which implies that 2 pffiffiffiffiffiffi ^ hðtÞ h m @2 hðtÞ=@t2 þ oðh2 Þ n h hðtÞ 2 2 S n;2 ðtÞ Qn0 sn;1 ðtÞQn1 ¼ A1 S n;0 ðtÞS n;2 ðtÞ S2n;1 ðtÞ
ðA:2Þ
where both Qn0 and Qn1 are defined in lemma A1. An application of lemma A1 proves the theorem.
86
ZONGWU CAI AND RONG CHEN
Proof of Theorem 2. It follows from (A.2) that the term 12h2 m2 @2 hðtÞ=@t2 on the right hand side of (A.2) serves as the asymptotic bias. Also, by (A.1) and lemma A1, we have 2 pffiffiffiffiffiffi ^ hðtÞ h m @2 hðtÞ=@t2 þ oðh2 Þ n h hðtÞ 2 2 ¼ A1 f1 þ oð1ÞgfQn0 þ op ð1Þg
^ which implies that to establish the asymptotic normality of hðtÞ; one only needs to consider the asymptotic normality for Qn0. To this end, the Crame´r-Wold device is used. For anyPunit vector d 2
(A.3)
Now, the Doob’s small-block and large-block technique is used. Namely, partition {1,y, n} into 2qn+1 subsets with large-block of size j k rn ¼ ðnhÞ1=2
and small-block of size
j k sn ¼ ðnhÞ1=2 = log n ,
where
qn ¼
n . rn þ sn
Then, qna(sn)rC n(t1)/2 h(t+1)/2 logt n, where t ¼ ð2 þ dÞð1 þ dÞ=d; and qn a(sn)-0 by Assumption A3. Let r j ¼ jðrn þ sn Þ and define the random variables, for 0rjrqn1, r j þrn
Zj ¼
X
i¼r j þ1
Zn;i ;
xj ¼
rjþ1 X
i¼r j þrn þ1
Z n;i ;
and
Qn;3 ¼
n X
Z n;i
i¼r qn þ1
Pqn1 0 Then, Pqn1 d Qn0 ¼ Qn,1+Qn,2+Qn,3, where Qn1 ¼ j¼0 Zj and Qn;2 ¼ j¼0 xj : Next we prove the following four facts: (i) as n-N, 2 E Qn;2 ! 0;
2 E Qn;3 ! 0
(A.4)
87
Flexible Seasonal Time Series Models
(ii) as n-N and y2d ðtÞ defined as in (A.3), we have qX n 1 j¼0
EðZ2j Þ ! y2d
(iii) for any t and n-N, qY n 1 E½expðitZj Þ ! 0 E½expðitQn;1 Þ j¼0
(A.5)
(A.6)
and (iv) for every 40;
qX n 1 j¼0
h n oi E Z2j I jZj j yd !0
(A.7)
(A.4) implies that Qn,2 and Qn,3 are asymptotically negligible in probability. (A.6) shows that the summands {Zj} in Qn,1 are asymptotically independent, and (A.5) and (A.7) are the standard Lindeberg-Feller conditions for asymptotic normality of Qn,1 for the independent setup. The rest proof is to establish (A.4)–(A.7) and it can be done by following the almost same lines as those used in the proof of Theorem 2 in Cai, Fan, and Yao (2000) with some modifications. This completes the proof of Theorem 2.
This page intentionally left blank
88
ESTIMATION OF LONG-MEMORY TIME SERIES MODELS: A SURVEY OF DIFFERENT LIKELIHOOD-BASED METHODS Ngai Hang Chan and Wilfredo Palma ABSTRACT Since the seminal works by Granger and Joyeux (1980) and Hosking (1981), estimations of long-memory time series models have been receiving considerable attention and a number of parameter estimation procedures have been proposed. This paper gives an overview of this plethora of methodologies with special focus on likelihood-based techniques. Broadly speaking, likelihood-based techniques can be classified into the following categories: the exact maximum likelihood (ML) estimation (Sowell, 1992; Dahlhaus, 1989), ML estimates based on autoregressive approximations (Granger & Joyeux, 1980; Li & McLeod, 1986), Whittle estimates (Fox & Taqqu, 1986; Giraitis & Surgailis, 1990), Whittle estimates with autoregressive truncation (Beran, 1994a), approximate estimates based on the Durbin–Levinson algorithm (Haslett & Raftery, 1989), state-space-based maximum likelihood estimates for ARFIMA models (Chan & Palma, 1998), and estimation of stochastic volatility models (Ghysels, Harvey, & Renault, 1996; Breidt, Crato, & de Lima, 1998; Chan & Petris, 2000) among others. Given the diversified Econometric Analysis of Financial and Economic Time Series/Part B Advances in Econometrics, Volume 20, 89–121 Copyright r 2006 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(05)20023-3
89
90
NGAI HANG CHAN AND WILFREDO PALMA
applications of these techniques in different areas, this review aims at providing a succinct survey of these methodologies as well as an overview of important related problems such as the ML estimation with missing data (Palma & Chan, 1997), influence of subsets of observations on estimates and the estimation of seasonal long-memory models (Palma & Chan, 2005). Performances and asymptotic properties of these techniques are compared and examined. Inter-connections and finite sample performances among these procedures are studied. Finally, applications to financial time series of these methodologies are discussed.
1. INTRODUCTION Long-range dependence has become a key aspect of time series modeling in a wide variety of disciplines including econometrics, hydrology and physics, among many others. Stationary long-memory processes are defined by autocorrelations decaying slowly to zero or spectral density displaying a pole at zero frequency. A well-known class of long-memory models is the autoregressive fractionally integrated moving average (ARFIMA) processes introduced by Granger and Joyeux (1980) and Hosking (1981). An ARFIMA process {yt} is defined by BÞd yt ¼ YðBÞt
FðBÞð1
(1)
where FðBÞ ¼ 1 þ f1 B þ þ fp Bp and YðBÞ ¼ 1 þ y1 B þ þ yq Bq are the autoregressive and moving average (ARMA) operators, respectively; (1 B)d is the fractional operator defined by the binomial exP differencing d j pansion ð1 BÞd ¼ 1 j¼0 ðj ÞB and {et} a white noise sequence with variance s2 : Under the assumption that the roots of the polynomials F(B) and Y(B) are outside the unit circle and |d|o1/2, the ARFIMA(p,d,q) process is second-order stationary and invertible. The spectral density of this process is f ðoÞ ¼
s2 j1 2p
eio j
2d
j1
Fðeio Þj 2 j1
Yðeio Þj2
and its ACF may be written as gðkÞ ¼
Z
p p
f ðoÞeiok do
(2)
Estimation of long-memory models has been considered by a large number of authors. Broadly speaking, most of the estimation methodologies
Estimation of Long-Memory Time Series Models
91
proposed in the literature can be classified into the time-domain and the spectral-domain procedures. In the first group, we have the exact maximum likelihood estimators (MLE) and the quasi maximum likelihood estimators (QMLE), see for example the works by Granger and Joyeux (1980), Sowell (1992) and Beran (1994a), among others. In the second group, we have for instance, the Whittle and the semi-parametric estimators (see Fox & Taqqu, 1986; Giraitis & Surgailis, 1990; Robinson, 1995) among others. In this article, we attempt to give an overview of this plethora of estimation methodologies, examining their advantages and disadvantages, computational aspects such as their arithmetic complexity, finite sample behavior and asymptotic properties. The remaining of this paper is structured as follows. Exact ML methods are reviewed in Section 2, including the Cholesky decomposition, the Levinson–Durbin algorithm and state-space methodologies. Section 3 discusses approximate ML methods based on truncations of the infinite autoregressive (AR) expansion of the long-memory process, including the Haslett and Raftery estimator and the Beran method. Truncations of the infinite moving average (MA) expansion of the process are discussed in Section 4 along with the corresponding Kalman filter recursions. The spectrum-based Whittle method and semi-parametric procedures are studied in Section 5. Extensions to the non-Gaussian case are also addressed in this section. The problem of parameter estimation of time series with missing values is addressed in Section 6 along with an analysis of the effects of data gaps on the estimates. Section 7 discusses methodologies for estimating time series displaying both persistence and cyclical behavior. Long-memory models for financial time series are addressed in Section 8, while final remarks are presented in Section 9.
2. EXACT MAXIMUM LIKELIHOOD METHOD Under the assumption that the process {yt} is Gaussian and has zero mean, the log-likelihood function may be written as 1 1 0 1 log det Gy Y Gy Y (3) 2 2 where Y ¼ ðy1 ; . . . ; yn Þ0 ; Gy ¼ varðY Þ and y is the model parameter vector. Hence, the ML estimate is given by y^ ¼ argmaxy LðhÞ: Expression (3) involves the calculation of the determinant and the inverse of Gy. As described next, the well-known Cholesky decomposition algorithm LðhÞ ¼
92
NGAI HANG CHAN AND WILFREDO PALMA
can be used to carry out these calculations. Further details can be found in Sowell (1992). Here we give an overview of this and other related methodologies for computing (3) such as the Levinson–Durbin algorithm and the Kalman filter recursions. 2.1. Cholesky Decomposition Since the variance–covariance matrix Gy is positive definite, it can be decomposed as Gy ¼ L1 L01 where L1 is a lower triangular matrix. This Q Cholesky decomposition provides the determinant det Gy ¼ ðdet L1 Þ2 ¼ ni¼1 l 2ii ; where lii denotes the ithdiagonal element of L1. Furthermore, the inverse of Gy can be obtained as Gy 1 ¼ ðL1 1 Þ0 L1 1 ; where the inverse of L1 can be computed by means of a very simple procedure, see for example Press, Teukolsky, Vetterling, and Flannery (1992, 89ff). Observe that while the inversion of a nonsingular square matrix n n has arithmetic complexity of order Oðn3 Þ; the Cholesky algorithm is of order Oðn3 =6Þ; cf. Press et al. (1992, p. 34).
2.2. Levinson–Durbin Algorithm Since for large sample sizes, the Cholesky algorithm could be inefficient, faster methods for calculating the log-likelihood function have been developed. These numerical procedures, designed to exploit the Toeplitz structure of the variance–covariance matrix of an second-order stationary process, are based on the seminal works by Levinson (1947) and Durbin (1960). Let y^ 1 ¼ 0 and y^ tþ1 ¼ ft1 yt þ þ ftt y1 for t ¼ 1; . . . ; n 1 be the onestep predictors of the process {yt} based on the finite past fy1 ; . . . yt 1 g; where the partial regression coefficients ftj are given by the recursive equations tP1 ftt ¼ gðtÞ ft 1;i gðt iÞ =nt 1 i¼1
ftj ¼ ft
1;j
n0 ¼ gð0Þ nt ¼ nt 1 ½1
ftt ft
f2tt ;
1;t j ;
j ¼ 1; . . . ; n
t ¼ 1; . . . ; n
1
1
93
Estimation of Long-Memory Time Series Models
Now, if et ¼ yt y^ t is the prediction error and e ¼ ðe1 ; . . . L2 Y ; where L2 is the lower triangular matrix 0 1 B f 1 B 11 B B f22 f21 1 B L2 ¼ B f33 f32 f31 1 B B B .. .. B . . @ fn
1;n 1
fn
1;n 2
fn
; en Þ0 ; then e ¼
1;1 1
1 C C C C C C C C C C A
0 Thus, Gy may be decomposed as Qn Gy ¼ L2 DL20; 1where 0 D1¼ diagðn0 ; . . . ; nn 1 Þ: Consequently, det Gy ¼ j¼1 nj 1 and Y Gy Y ¼ e D e: Hence, the log-likelihood function may be written as n n 1X 1X LðhÞ ¼ lognt 1 e2 =nt 1 2 t¼1 2 t¼1 t
The arithmetic complexity of this algorithm is Oð2n2 Þ for a linear stationary process, see Ammar (1996). However, for some Markovian processes such as the family of ARMA models, the Levinson–Durbin algorithm can be implemented in only OðnÞ operations, see for example Section 5.3 of Brockwell and Davis (1991). Unfortunately, ARFIMA models are not Markovian and this reduction in operations count does not apply to them. 2.3. Calculation of Autocovariances A critical aspect in the implementation of the Cholesky and the Levinson– Durbin algorithms is the calculation of the ACF process. A closed form expression for the ACF of an ARFIMA model is given in Sowell (1992). Here, we briefly review this and other alternative methods of obtaining the ACF of a long-memory process. Observe that the polynomial F(B) defined in (1) can be written as p Y FðBÞ ¼ ð1 ri BÞ i¼1
where {ri} are the roots of the polynomial F(z 1). Assuming that all these roots have multiplicity one, it can be deduced from (2) that p q X X gðkÞ ¼ s2 cðiÞxj Cðd; p þ i k; rj Þ i¼ q j¼1
94
NGAI HANG CHAN AND WILFREDO PALMA
h Q xj ¼ rj pi¼1 ð1 Gð1 2dÞGðd þ hÞ cðd; h; rÞ ¼ Gð1 d þ hÞGð1 dÞGðdÞ
with cðiÞ ¼
Pminðq;q
1Þ j¼maxð0;lÞ yj yj l ;
½r2p F ðd þ h; 1; 1
ri rj Þ
d þ h; rÞ þ F ðd
Q
maj ðrj
h; 1; 1
rm Þ d
i
1
h; rÞ
and
1
where GðÞ is the Gamma function and F(a, b, c, x) is the Gaussian hypergeometric function, see (Gradshteyn & Ryzhik, 2000, Section 9.1). An alternative method for calculating the ACF is the so-called splitting method, see Bertelli and Caporin (2002). This technique is based on the decomposition of the model into its ARMA and its fractional integrated (FI) parts. Let g1 ðÞ be the ACF of the ARMA component and g2 ðÞ be the ACF of the fractional noise. Then, the ACF of the corresponding ARFIMA process is given by the convolution of these two functions: gðkÞ ¼
1 X
g1 ðhÞg2 ðk
m X
g1 ðhÞg2 ðk
h¼ 1
hÞ
If this infinite sum is truncated to m summands, we obtain the approximation gðkÞ
h¼ m
hÞ
Thus, the ACF can be efficiently calculated with a great level of precision, see for instance the numerical experiments reported by Bertelli and Caporin (2002) for further details.
2.4. Exact State-Space Method Another method for computing exact ML estimates is provided by the statespace system theory. In this Section, we review the application of Kalman filter techniques to long-memory processes. Note that since these processes are not Markovian, all the state space representations are infinite dimensional as shown by Chan and Palma (1998). Despite this fact, the Kalman filter equations can be used to calculate the exact log-likelihood (3) in a finite number of steps. Recall that a causal ARFIMA(p, d, q) process {yt} has a linear process representation given by yt ¼
YðBÞ ð1 FðBÞ
BÞ d t ¼
1 X j¼0
jj t
j
(4)
95
Estimation of Long-Memory Time Series Models
P 1 j where jj are the coefficients of jðzÞ ¼ 1 zÞ d : An j¼0 jj z ¼ YðzÞFðzÞ ð1 infinite-dimensional state-space representation may be constructed as follows. From Eq. (4), a state space system may be written as
where X t ¼ ½yðtjt 2
0 60 F ¼4 .. .
1
0
0
0 .. .
1 0 .. .
X tþ1 ¼ FX t þ Ht
(5)
yt ¼ GX t þ t
(6)
1Þ yðt þ 1jt
1Þ yðt þ 2jt
1Þ . . . 0
(7)
yðtjjÞ ¼ E½yt jyj ; yj 1 ; . . . (8) 3 7 H ¼ ½j1 ; j2 ; . . . 0 ; and G ¼ ½1; 0; 0; . . . 5;
(9)
The log-likelihood function can be evaluated by directly applying the Kalman recursive equations in Proposition 12.2.2 of Brockwell and Davis (1991) to the infinite-dimensional system. The Kalman algorithm is as follows: Let Ot ¼ ðoðtÞ ij Þ be the state estimation error covariance matrix at time t. The Kalman equations for the infinite-dimensional system are given by X^ 1 ¼ E½X 1 O1 ¼ E½X 1 X 01
(10)
E½X^ 1 X^ 0 1
Dt ¼ oðtÞ 11 þ 1 oijðtþ1Þ ¼ oðtÞ iþ1;jþ1 þ ji jj and
(11) (12)
ðtÞ ðoðtÞ iþ1;1 þ ji Þðojþ1;1 þ jj Þ
oðtÞ 11 þ 1
ðtþ1Þ ðtþ1Þ ðtþ1Þ X^ tþ1 ¼ ðX^ 1 ; X^ 2 ; . . . Þ0 ¼ ðX^ i Þ0i¼1;2;...
(13)
(14)
where ðtþ1Þ ðtÞ X^ i ¼ X^ iþ1 þ
ðyt
ðtÞ X^ 1 ÞðoðtÞ iþ1;1 þ ji Þ
oðtÞ 11 þ 1 ðtþ1Þ
y^ tþ1 ¼ G X^ tþ1 ¼ X^ 1
(15)
(16)
96
NGAI HANG CHAN AND WILFREDO PALMA
and the log-likelihood function is given by ( ) n n X 1 1X ðyt y^ t Þ2 2 n log 2p þ LðhÞ ¼ log Dt þ n log s þ 2 2 s t¼1 Dt t¼1
Although the state space representation of an ARFIMA model is infinitedimensional, the exact likelihood function can be evaluated in a finite number of steps. Specifically, we have the following theorem due to Chan and Palma (1998). Theorem 1. Let fy1 ; . . . ; yn g be a finite sample of an ARFIMA(p,d,q) process. If O1 is the variance of the initial state X1 of the infinite-dimensional representation (5)–(9), then the computation of the exact likelihood function (3) depends only on the first n components of the Kalman Eqs. (10)–(16).
It is worth noting that as a consequence of Theorem 1, given a sample of n observations from an ARFIMA process, the calculation of the exact likelihood function is based only on the first n components of the state vector. Therefore, the remaining infinitely many components of the state vector can be omitted from the computations. The arithmetic complexity of this algorithm is Oðn3 Þ: Therefore, it is comparable to the Cholesky decomposition but it is less efficient than the Levinson–Durbin procedure. The Kalman approach is advisable for moderate sample sizes or for handling time series with missing values. The state space framework provides a simple solution to this problem. Note that the Levinson–Durbin method is no longer appropriate when the series displays missing values since the variance–covariance matrix of the incomplete data does not have a Toeplitz structure.
2.5. Asymptotic Results for the Exact MLE The consistency, asymptotic normality and efficiency of the MLE have been established by Yajima (1985) for the fractional noise process and by Dahlhaus (1989) for the general case including the ARFIMA model. Let h^ n be the value that maximizes the exact log-likelihood where h ¼ ðf1 ; . . . ; fp ; f1 ; . . . ; yq ; d; s Þ0 is a p þ q þ 2 dimensional parameter vector and let h0 be the true parameter. Assume that the regularity conditions listed in Dahlhaus (1989) hold.
97
Estimation of Long-Memory Time Series Models
Theorem 2. (Consistency) p h^ nffiffiffi ! h0 in probability as n-N. (Central Limit Theorem) nðh^ n h0 Þ ! Nð0; G 1 ðh0 ÞÞ; as n-N, where GðhÞ ¼ ðGij ðhÞÞ with R n@ log kðo;hÞon@ log kðo;hÞo 1 p dw Gij ðhÞ ¼ 4p p @yi @yj (17) P1 and kðo; hÞ ¼ j j¼0 cj ðhÞeijo j2 (Efficency) h^ n is an efficient estimator of h0.
The next two sections discuss ML methods based on AR and MA approximations.
3. AUTOREGRESSIVE APPROXIMATIONS Since the computation of exact ML estimates is computationally highly demanding, several authors have considered the use of AR approximations to speed up the calculation of parameter estimates. In particular, this approach has been adopted by Granger and Joyeux (1980), Li and McLeod (1986), Haslett and Raftery (1989), Beran (1994b), Shumway and Stoffer (2000) and Bhansali and Kokoszka (2003), among others. Most of these techniques are based on the following estimation strategy. Consider a long-memory process {yt} defined by the AR(N) expansion yt
p1 ðhÞyt
1
p2 ðhÞyt
¼ t
2
where pj(h) are the coefficients of FðBÞY ðBÞð1 BÞd : In practice, only a finite number of observations is available, fy1 ; . . . ; yn g; therefore the following truncated model is considered yt
p1 ðhÞyt
1
1
pm ðhÞyy
m
¼ ~t
for motrn. The ML estimate h^ n is found by minimizing L0 ðhÞ ¼
n X
t¼mþ1
½yt
p1 ðhÞyt
1
pm ðhÞyt
m
2
Upon this basic framework, many refinements can be made to improve the quality of these estimates. In what follows next, we describe some of these refinements. For simplicity, any estimator produced by the maximization of an approximation of the Gaussian likelihood function (3) will be called QMLE.
98
NGAI HANG CHAN AND WILFREDO PALMA
3.1. Haslett–Raftery Method The following technique was proposed by Haslett and Raftery (1989). Consider the ARFIMA process (1). An approximate one-step predictor of yt is given by y^ t ¼ FðBÞYðBÞ
1
t 1 X
ftj yt
(18)
j
j¼1
with prediction error variance vt ¼ varðyt
y^ t Þ ¼ s2y k
t 1 Y ð1 j¼1
f2jj Þ
where s2y ¼ varðyt Þ; k is the ratio of the innovations variance to the variance of the ARMA(p,q) process as given by Eq. (3.4.4) of Box, Jenkins, and Reinsel (1994) and ! t Gðj dÞGðt d j þ 1Þ ftj ¼ ; for j ¼ 1; . . . ; t j Gð dÞGðt d þ 1Þ To avoid the computation of a large number of coefficients ftj, the last term of the predictor (18) is approximated by t 1 X
ftj yt
j
M X
ftj yt
j¼1
j¼1
j
t 1 X
p j yt
j
(19)
j¼Mþ1
since ftj pj for large j, cf. Hosking (1981), where for simplicity pj denotes pj(h) and ajbj means that aj/bj-1, as j-N. An additional approximation is made to the second term on the righthand-side of (19): " d # t 1 X M 1 y¯ Mþ1;t 1 M pj yt j MpM d 1 t j¼Mþ1 where y¯ Mþ1;t maximizing
1 M
¼t
1 1 2M
1 M j¼Mþ1 yj :
Pt
L1 ðhÞ ¼ constant
Hence, a QMLE h^ n is obtained by 1 n log½s^ 2 ðhÞ 2
99
Estimation of Long-Memory Time Series Models
with s^ 2 ðhÞ ¼
n 1X ðyt y^ t Þ2 n t¼1 vt
The arithmetic complexity of the Haslett and Raftery method is of order OðnMÞ: For a fixed M, the algorithm is of order OðnÞ; which is much faster compared to the Levinson–Durbin method. Haslett and Raftery (1989) suggest M ¼ 100: Note that, when M ¼ n; the exact ML estimated is obtained. However, the arithmetic complexity in that case becomes Oðn2 Þ and no gain is obtained as compared to the Levinson–Durbin approach. 3.2. Beran Method Beran (1994a) proposed the following version of the AR approximation approach. Assume that the following Gaussian innovation sequence t ¼ yt
1 X j¼1
pj ðhÞyt
j
Since the values {yt, t r 0} are not observed, an approximate innovation sequence {ut} can be obtained by assuming that yt ¼ 0 for t r 0, ut ðhÞ ¼ yt
t 1 X j¼1
pj ðhÞyt
j
for t ¼ 2; . . . ; n: Let rt ðhÞ ¼ ut ðhÞ=y1 and h ¼ ðs ; f1 ; . . . ; fp ; y1 ; . . . ; yq ; dÞ: Then, a QMLE for h is provided by the minimization of n X r2t ðhÞ L2 ðhÞ ¼ 2n log ðy1 Þ þ t¼2
Now, by taking partial derivatives with respect to h, the minimization problem is equivalent to solving the non-linear equations n X frt ðhÞ_rt ðhÞ E½rt ðhÞ_rt ðhÞg ¼ 0 (20) t¼2 0 @rt ðhÞ where r_t ðhÞ ¼ @r@yt ðhÞ : ; . . . ; @yr 1
The arithmetic complexity of this method is Oðn2 Þ; that is, comparable to the Levinson–Durbin algorithm. Unlike the Haslett and Raftery method, the Beran approach uses the same variance for all the errors ut. Hence, its performance may be poor for short time series.
100
NGAI HANG CHAN AND WILFREDO PALMA
3.2.1. Asymptotic Behavior The QMLE based on the AR approximations share the same asymptotic properties with the exact MLE. The following results are due to Beran (1994a). Let h~ n be the value that solves (20). Then, Theorem 3. (Consistency) p h~ nffiffiffi! h0 ; in probability, as n-N. (Central Limit Theorem) nðh~ n h0 Þ ! N ð0; G 1 ðh0 ÞÞ; as n-N, with G(h0) is given in (17). (Efficiency) h~ n is an efficient estimator of h0.
4. MOVING AVERAGE APPROXIMATIONS A natural alternative to AR approximations is the truncation of the infinite MA expansion of a long-memory process. The main advantage of this approach is the easy implementation of the Kalman filter recursions and the simplicity of the analysis of the theoretical properties of the ML estimates. Furthermore, if differencing is applied to the series, then the resulting truncation has less error variance than the AR approximation. A causal representation of an ARFIMA(p, d, q) process {yt} is given by yt ¼
1 X
c j t
(21)
j
j¼0
On the other hand, we may consider an approximate model for (21) given by yt ¼
m X
c j t
(22)
j
j¼0
which corresponds to a MA(m) process in contrast to the MA(N) process (21). A canonical state space representation of the MA(m) model (22) is given by X tþ1 ¼ FX t þ Ht yt ¼ GX t þ t
with F¼
0
Im
0...
0
X t ¼ ½yðtjt
yðt þ jjt
1
;
G ¼ ½1 0 0 . . . 0;
1Þ; yðt þ 1jt
H ¼ ½c1 . . . cm 0
1Þ; . . . ; yðt þ m
1Þ ¼ E½ytþj jyt 1 ; yt 2 ; . . .
1jt
1Þ0
101
Estimation of Long-Memory Time Series Models
The approximate representation of a causal ARFIMA(p, d, q) has computational advantages over the exact one. In particular, the order of the MLE algorithm is reduced from Oðn3 Þ to OðnÞ: A brief discussion about the Kalman filter implementation of this state space system follows.
4.1. Kalman Recursions Let the initial conditions be X^ 1 ¼ E½X 1 and O1 ¼ E½X 1 X 01 E½X^ 1 X^ 0 1 : The recursive Kalman equations may be written as follows (cf. Chan, 2002, Section 11.3): Dt ¼ GOt G 0 þ s2
(23)
Yt ¼ F Ot G 0 þ S
(24)
Otþ1 ¼ F Ot F 0 þ Q
Yt Dt 1 Y0t
X^ tþ1 ¼ F X^ t þ Yt Dt 1 ðyt y^ t ¼ GX^ t
G X^ t Þ
(25)
(26)
for t ¼ 1; 2; . . . ; n; where Q ¼ var(Het), se ¼ var(et) and S ¼ cov(Het, et). The log-likelihood function, excepting a constant, is given by ( ) n n X 1 X ðyt y^ t ðyÞÞ2 LðhÞ ¼ log Dt ðyÞ þ 2 t¼1 Dt ðyÞ t¼1
where h ¼ ðf1 ; . . . fp ; y1 ; . . . ; yq ; d; s2 Þ is the parameter vector associated with the ARFIMA representation (1). In order to evaluate the log-likelihood ^ function LðhÞ; we may choose the initial conditions P1 X 1 ¼ E½X 1 ¼ 0 and 0 O1 ¼ E½X 1 X 1 ¼ ½oði; jÞi; j¼1; 2; ...; where oði; jÞ ¼ k¼0 ciþk cjþk : The evolution of the state estimation and its variance, Ot, is given by the following recursive equations. Let di ¼ 1 if i 2 f0; 1; . . . ; m 1g and di ¼ 0 otherwise. Furthermore, let dij ¼ di dj : Then, the elements of Ot+1 and X^ tþ1 in (25) and (26) are as follows: otþ1 ði; jÞ ¼ ot ði þ 1; j þ 1Þdij þ ci cj ½ot ði þ 1; 1Þdi þ ci ½ot ðj þ 1; 1Þdj þ cj ot ð1; 1Þ þ 1
102
NGAI HANG CHAN AND WILFREDO PALMA
and ðot ði þ 1; 1Þdi þ ci Þðyt X^ tþ1 ðiÞ ¼ X^ t ði þ 1Þdi þ ot ð1; 1Þ þ 1
X^ t ð1ÞÞ
In addition, y^ t ¼ G X^ t ¼ X^ t ð1Þ In order to speed up the algorithm, we can difference the series {yt} so that the MA expansion of the differenced series converges faster. To this end, consider the MA expansion of the differenced process zt ¼ ð1
BÞyt ¼
1 X
j j t
j
(27)
j¼0
where jj ¼ cj cj 1 : If we truncate this expansion after m components, an approximate model can be written as zt ¼
m X
jj t
j
(28)
j¼0
The main advantage of this approach is that the coefficients jj converge to zero faster than the coefficients cj do. Thus, a smaller truncation parameter is necessary to achieve a good level of approximation. A state space representation of this truncated model is given by, see for example Section 12.1 of Brockwell and Davis (1991), 2 3 j1 6 . 7 0 Im 1 . 7 X tþ1 ¼ Xt þ 6 4 . 5 t 0 0 jm and
zt ¼ 1
0
0
...
0 X t þ t
The Gaussian log-likelihood of the truncated model (28) may be written as 1 1 0 log det T n; m ðhÞ Z T n; m ðhÞ 1 Zn 2n 2n n Rp where ½T n; m ðhÞr; s¼1; ... ;n ¼ p f~m;h ðoÞeioðr sÞ do is the covariance matrix of Z n ¼ ðz1 . . . zn Þ0 with f~m;h ðoÞ ¼ s2 jjm ðeio Þj2 and jm ðeio Þ ¼ 1 þ j1 eio þ þ j1m emio : Ln ðhÞ ¼
Estimation of Long-Memory Time Series Models
103
In this case, the matrices involved in the truncated Kalman equations are of order m m. Therefore, only m2 evaluations are required for each iteration and the algorithm has an order n m2. For a fixed truncation parameter m, the calculation of the likelihood function is only of order OðnÞ for the approximate ML method. Thus, for very large samples, it may be desirable to consider truncating the Kalman recursive equations after m components. With this truncation, the number of operations required for a single evaluation of the log-likelihood function is reduced to an order of OðnÞ: This approach is discussed in Chan and Palma (1998), where they show the following result. Theorem 4. (Consistency) Assume that m ¼ nb with b40, then as n-N, h~ n;m ! h0 ; in probability. (Central that m ¼ nb with bZ1/2, then as npffiffiffi ~ Limit Theorem) Suppose 1 N, nðhn; m h0 Þ ! Nð0; G ðh0 ÞÞ; where G(h) is given in (17). (Efficiency) Assume that m ¼ nb with b Z 1/2, then h~ n; m ; is an efficient estimator of y0. Observe that both AR(m) and MA(m) approximations produce algorithms with arithmetic complexity of order OðnÞ: However, the quality of the approximation is governed by the truncation parameter m. Bondon and Palma (2005) prove that the variance of the truncation error for an AR(m) approximation is of order Oð1=nÞ: On the other hand, it can be easily shown that for the MA(m) case, this quantity is of order Oðn2d 1 Þ: Furthermore, when the differenced approach is used, the truncation error variance is of order Oðn2d 3 Þ:
5. WHITTLE APPROXIMATIONS Another approach to obtain approximate ML estimates is based on the calculation of the periodogram by means of the Fast Fourier Transform (FFT) and the use of the Whittle approximation of the Gaussian loglikelihood function. This approach produces fast numerical algorithms for computing parameter estimates, since the calculation of the FFT has an arithmetic complexity Oðn log2 ðnÞÞ (cf. Press et al., 1992, p. 498).
104
NGAI HANG CHAN AND WILFREDO PALMA
5.1. Whittle Approximation of the Gaussian Likelihood Function Consider the Gaussian process Y ¼ ðy1 ; . . . ; yn Þ0 with zero mean and variance Gh. Then the log-likelihood function divided by the sample size is given by 1 1 0 1 log det Gy Y Gy Y LðhÞ ¼ (29) 2n 2n Observe that the variance–covariance matrix Gh can be expressed in terms of the spectral density of the process fh( ) as follows: ðGh Þij ¼ gh ði jÞ where gh ðkÞ ¼
1 2p
Z
p p
f h ðoÞ exp ðiokÞ do
In order to obtain the method proposed by Whittle (1951), two approximations are made. Following the result by Grenander and Szego¨ (1958) that Z 1 1 p log det Gh ! log f h ðoÞ do n 2p p as n-N, the first term in (29) is approximated by Z 1 1 p log det Gy log f h ðoÞ do 2n 4p p On the other hand, the second term in (29) is approximated by
Z p n X n X 1 0 1 1 Y Gy Y yl f h 1 ðoÞ exp ½ioðl jÞdo yj 2p 4pn p l¼1 j¼1 Z p n X n X 1 yl yj exp ½ioðl jÞdo f h 1 ðoÞ ¼ 4pn p l¼1 j¼1 Z p n X 1 ¼ yj exp ðiojÞj2 do f h 1 ðoÞj 4pn p j¼1 Z 1 p IðoÞ do ¼ 4p p f h ðoÞ P where IðoÞ ¼ 1n j nj¼1 ðyj y¯ Þ exp ðiojÞj2 is the periodogram of the series {yt}. Thus, the log-likelihood function is approximated by
Z p Z p 1 IðoÞ do (30) L3 ðhÞ ¼ log f h ðoÞdo þ 4p p p f h ðoÞ
Estimation of Long-Memory Time Series Models
105
5.2. Discrete Version The evaluation of the log-likelihood function (30) requires the calculation of integrals. To simplify this computation, the integrals can be substituted by Riemann sums as follows: Z p n 2p X log f h ðoÞdo log f h ðoj Þ n j¼1 p and Z
n Iðoj Þ IðoÞ 2p X dog n j¼1 f h ðoj Þ p f h ðoÞ
p
where oj ¼ 2pj n are the Fourier frequencies. Thus, a discrete version of the log-likelihood function (30) is ( ) n n X Iðoj Þ 1 X log f h ðoj Þ þ L4 ðhÞ ¼ f ðoj Þ 2n j¼1 j¼1 h 5.3. Alternative Versions Further simplifications of the Whittle log-likelihood function can be made. For example, by assuming that the spectral density is normalized as Z p log f h ðoÞdo ¼ 0 (31) p
then the Whittle log-likelihood function is reduced to Z 1 p IðoÞ do L5 ðhÞ ¼ 4p p f h ðoÞ with the discrete version L6 ðhÞ ¼
n Iðoj Þ 1 X 2n j¼1 f h ðoj Þ
Observe that by virtue of the well-known Szego¨–Kolmogorov formula Z p 1 2 log f h ðoÞdo s ¼ 2p exp 2p p the normalization (31) is equivalent to setting s2 ¼ 2p:
106
NGAI HANG CHAN AND WILFREDO PALMA
5.4. Asymptotic Results The asymptotic behavior of the Whittle estimator is similar to that of the exact MLE. The following theorem combines results by Fox and Taqqu (1986) and Dahlhaus (1989). ðiÞ Theorem 5. Let h^ n be the value that maximizes the log-likelihood function Li ðhÞ for i ¼ 3; . . . ; 6 for a Gaussian process {yp t}.ffiffiffi Then, under ðiÞ ðiÞ some regularity conditions, h^ n is consistent and nðh^ n h0 Þ ! Nð0; Gh01 Þ; as n - N.
5.5. Non-Gaussian Processes All of the above methods apply to Gaussian processes. When this assumption in dropped, it is still possible to find well-behaved Whittle estimates. In particular, Giraitis and Surgailis (1990) have studied the estimates based on the maximization of the log-likelihood function L5 ðhÞ for a general class of linear processes with independent innovations. Consider the process {yt} generated by the Wold decomposition yt ¼
1 X j¼0
cj ðhÞt
j
P 2 where et is an i.i.d. sequence with finite four cumulant and 1 j¼0 cj ðhÞ o1: The following result, due to Giraitis and Surgailis (1990), establishes the consistency and the asymptotic normality of the Whittle estimate under these circumstances. Theorem 6. Let h^ n be the value that maximizes the log-likelihood function Lffiffi5ffi ðhÞ: Then, under some regularity conditions, h^ n is consistent and p ^ n hn h0 ! Nð0; Gh01 Þ; as n-N. Note that this theorem does not assume the normality of the process. 5.6. Semi-Parametric Methods Another generalization of the Whittle estimator is the Gaussian semiparametric estimation method proposed by Robinson (1995). This estimation approach does not require the specification of a parametric model for the data. It only relies on the specification of the shape of the spectral density of the time series.
107
Estimation of Long-Memory Time Series Models
Consider the stationary process {yt} with spectral density satisfying f ðoÞ Go1
2H
as o-0+, with G 2 ð0; 1Þ and H 2 ð0; 1Þ: Note that for an ARFIMA model, the terms G and H correspond to s2 =2p½yð1Þ=fð1Þ2 and 1/2+d, respectively. H is usually called the Hurst parameter. Let the objective function Q(G, H) to be minimized be given by ( ) m oj2H 1 1X 1 2H QðG; HÞ ¼ Iðoj Þ log Goj þ m j¼1 G
^ H) ^ be the value that where m is an integer satisfying mon/2. Let (G; minimizes Q(G, H). Then, under some regularity conditions of the spectral density and 1 m þ !0 m n as n-N, the following result due to Robinson (1995) holds: Theorem 7. Let H0 be the true
parameter. The pffiffiffiffi value of the Hurst estimator H^ is consistent and mðH^ H^ 0 Þ ! N 0; 14 ; as n-N. 5.7. Numerical experiments
Table 1 displays the results from several simulations comparing five ML estimation methods for Gaussian processes: Exact MLE, Haslett and Raftery’s approach, AR(40) approximation, MA(40) approximation and the Whittle method. The process considered is a fractional noise ARFIMA(0, d, 0) with three values of the long-memory parameter: Table 1. Finite Sample Behavior of ML Estimates of ARFIMA(0, d, 0) Models. Sample Size n ¼ 250 and Truncation m ¼ 40 for AR(m) and MA(m) Approximations. d 0.40 0.25 0.10
mean stdev mean stdev mean stdev
Exact
HR
AR
MA
Whittle
0.371210 0.047959 0.229700 0.051899 0.082900 0.049260
0.372320 0.048421 0.230400 0.051959 0.083240 0.049440
0.376730 0.057392 0.229540 0.056487 0.083910 0.052010
0.371310 0.050048 0.229810 0.051475 0.084410 0.049020
0.391760 0.057801 0.221760 0.060388 0.065234 0.051646
108
NGAI HANG CHAN AND WILFREDO PALMA
d ¼ 0:1; 0:25; 0:4; Gaussian innovations with mean zero and variance 1 and sample size n ¼ 250: The mean and standard deviations of the estimates are based on 1,000 repetitions. All the simulations were carried out by means of Splus programs, available upon request. From this table, it seems that all estimates are somewhat downward biased for the three values of d considered. All the estimators, excepting the Whittle, seem to behave similarly in terms of both bias and standard error. On the other hand, the Whittle method has less bias for d ¼ 0:4; but greater bias for d ¼ 0:1: Besides, this procedure seems to have greater standard error than the other estimators, for the three values of d. The empirical parameter estimate standard deviations of all the method considered are close to its theoretical value 0.04931. In the next section we discuss the application of the ML estimation methodology to time series with missing values.
6. ESTIMATION OF INCOMPLETE SERIES The Kalman filter recursive Eqs. (23)–(26) can be modified to calculate the log-likelihood function for incomplete series, as described in Palma and Chan (1997). In this case, we have Dt ¼ GOt G0 þ s2w Yt ¼ F O t G 0 þ S
Otþ1 ¼ X^ tþ1 ¼
(
(
F Ot F 0 þ Q
Yt Dt 1 Y0t 0
yt missing
F Ot F þ Q
F X^ t þ Yt Dt 1 ðyt F X^ t
yt known
G X^ t Þ
yt known yt missing
Let Kn be the set indexing the observed values of the process {yt}. The Kalman recursive log-likelihood function is given by ( ) X X ðy 1 y^ t Þ2 t 2 1 LðhÞ ¼ r log 2p þ log Dt þ r log s 2 2 s t2K Dt t2K n
n
oðtÞ 11
where r is the number of observed values and Dt ¼ þ 1 is the variance of the best predictor y^ t : This form of the likelihood function may be used to efficiently calculate Gaussian ML estimates. One-step predictions and their corresponding standard deviations are obtained directly from the recursive
109
Estimation of Long-Memory Time Series Models
Kalman filter equations without further computation, see Palma and Chan (1997) for details. 6.1. Effect of Data Irregularities and Missing Values on ML Estimates
Water Level
To illustrate the potential dramatic effects of data irregularities such as repeated or missing values on the parameter estimation of a long-memory process, consider the well-known Nile river data, cf. Beran (1994b), shown in Fig. 1. Panel (a) displays the original data. From this plot, it seems that in several periods, the data were repeated year after year in order to complete the series. Panel (b) shows the same series, but without those repeated values (filtered series). This procedure has been discussed in Palma and Del Pino (1999). Table 2 shows the fitted parameters of an ARFIMA(0, d, 0) using an AR(40) approximation along the Kalman filter, for both the original and the data without repetitions. Observe that for the original data, the estimate of the long-memory parameter is 0.5, indicating that the model has reached the non-stationary boundary. On the other hand, for the filtered data, the estimate of d belongs to the stationary region. Thus, in this particular case, the presence of data 15 14 13 12 11 10 9 600
800
1000
1200 Year
1400
1600
1800
600
800
1000
1200
1400
1600
1800
Water Level
(a) 15 14 13 12 11 10 9
(b)
Fig. 1.
Year
Nile River Data (622 A.D. – 1921 A.D.). Panel (a) Original Data and panel (b) Filtered Data.
110
Table 2.
NGAI HANG CHAN AND WILFREDO PALMA
Nile River Data: ML Estimates for the Fitted ARFIMA(0, d, 0) Model.
Series Panel (a) Panel (b)
d^
td^
s^
0.5000 0.4337
25.7518 19.4150
0.6506 0.7179
Table 3. Finite Sample Behavior of ML Estimates of ARFIMA(0, d, 0) Processes with Missing Values. Sample Size n ¼ 250 and Truncation m ¼ 40 for AR(m) and MA(m) Approximations. d
NA’s 0%
0.40
15% 30% 0%
0.25
15% 30% 0%
0.10
15% 30%
AR
MA
Expected stdv
mean stdev
0.380470 0.056912
0.374390 0.048912
0.049312
mean stdev
0.376730 0.060268
0.369120 0.055014
0.053550
mean stdev
0.366650 0.065166
0.363860 0.059674
0.058935
mean stdev
0.225600 0.056810
0.226900 0.052110
0.049312
mean stdev
0.223500 0.065440
0.224800 0.060040
0.053550
mean stdev
0.219800 0.072920
0.220800 0.062510
0.058935
mean stdev
0.081145 0.054175
0.081806 0.051013
0.049312
mean stdev
0.086788 0.058106
0.081951 0.052862
0.053550
mean stdev
0.080430 0.067788
0.081156 0.061118
0.058935
irregularities such as the replacement of missing data with repeated values induces non-stationarity. On the other hand, when the missing value is appropriately taken care of, the resulting model is stationary (cf. Palma & Del Pino, 1999). Table 3 displays the results from Monte Carlo simulations of approximate ML estimates for fractional noise ARFIMA(0, d, 0) with missing values at
111
Estimation of Long-Memory Time Series Models
random. The sample size chosen was n ¼ 250 and the AR and MA truncations are m ¼ 40 for both cases. The long-memory parameters are d ¼ 0:1; 0:25; 0:40 and s2 ¼ 1: The number of missing values are 38 (15% of the sample) and 75 (30% of the sample) and were selected randomly for each sample. Note that the bias and the standard deviation of the estimates seem to increase as the number of missing values increases. On the other hand, the sample standard deviation of the estimates seems to be close to the expected values, for the MA approximation. For the AR approximation, these values are greater than expected. The expected standard deviation used here is pffiffiffiffiffiffiffiffiffiffiffiffiffiffi 6=p2 nn ; where n is the number of observed values.
7. ESTIMATION OF SEASONAL LONG-MEMORY MODELS In practical applications, many researchers have found time series exhibiting both long-range dependence and cyclical behavior. For instance, this phenomenon occurs for the inflation rates studied by Hassler and Wolters (1995), revenue series analyzed by Ray (1993), monetary aggregates considered by Porter-Hudak (1990), quarterly gross national product and shipping data discussed by Ooms (1995) and monthly flows of the Nile River studied by Montanari, Rosso, and Taqqu (2000). Several statistical methodologies have been proposed to model this type of data. For instance, Gray, Zhang, and Woodward (1989) propose the generalized fractional or Gegenbauer (GARMA) processes, Porter-Hudak (1990) discusses seasonal fractionally integrated autoregressive moving average (SARFIMA) models, Hassler (1994) introduces the flexible seasonal fractionally integrated processes (flexible ARFISMA) and Woodward, Cheng, and Gray (1998) introduce the k-GARMA processes. Furthermore, the statistical properties of these models have been investigated by Giraitis and Leipus (1995), Chung (1996), Arteche and Robinson (2000), Velasco and Robinson (2000) and Giraitis, Hidalgo, and Robinson (2001), among others. A rather general class of Gaussian seasonal long-memory processes is specified by the spectral density f ðoÞ ¼ gðoÞjoj
a
mi r Y Y i¼1 j¼1
jo
oij j
ai
(32)
112
NGAI HANG CHAN AND WILFREDO PALMA
where o 2 ð p; p; 0 a; ai o1; i ¼ 1; . . . ; r; gðoÞ is a symmetric, strictly positive, continuous and bounded function and oij6¼0 are known poles for j ¼ 1; . . . ; mi ; i ¼ 1; . . . ; r: To ensure the symmetry of f, it is assumed that for any i ¼ 1; . . . ; r; j ¼ 1; . . . mi ; there is one and only one 1rj0 rmi such that oij ¼ oij0 : The spectral density of many widely used models such as SARFIMA and k-factor GARMA satisfy specification (32). The exact ML estimation of processes satisfying (32) has been recently studied by Palma and Chan (2005) who have established the following result: Theorem 8. Let h^ n be the exact MLE for a process satisfying (32) and h0 the true parameter. Then, under some regularity conditions we have (Consistency) h^ n!p h0 as n-N. (Central limit theorem) ThepML ffiffiffi estimate, h^ n ; satisfies the following limiting distribution as n-N: nðh^ n h0 Þ ! Nð0; Gðh0 Þ 1 Þ; where G(h0) is given by (17). (Efficiency) The ML estimate, h^ n ; is asymptotically an efficient estimate of h0. 7.1. Monte Carlo Studies In order to assess the finite sample performance of the ML estimates in the context of long-memory seasonal series, a number of Monte Carlo simulations were conducted for the class of SARFIMA(p, d, q) (P, ds, Q)s models described by the following difference equation (cf. Porter-Hudak, 1990): fðBÞFðBs Þð1
BÞd yt ¼ yðBÞYðBs Þt
where ft g are standard normal random variables, fðBÞ ¼ 1 f1 B fp Bp ; FðBs Þ ¼ 1 F1 Bs FP BsP ; yðBÞ ¼ 1 þ y1 B þ þ yq Bq ; YðBs Þ ¼ 1 þ Y1 Bs þ þ YQ BsQ ; polynomials f(B) and y(B) have no common zeros, F(Bs) and Y(Bs) have no common zeros, and the roots of these polynomials are outside the unit circle. Table 4 shows the results from simulations for SARFIMA(0, d, 0) (0, ds, 0)s models. The estimates of d^ and d^s reported in columns seven and eight of Table 4 and the estimated standard deviations displayed in the last two columns of the table are based on 1,000 repetitions. The MLE are computed by means of an extension to SARFIMA models of the state space representations of long-memory processes, see Chan and Palma (1998) for details. The theoretical values of the standard deviations of the estimated parameters are based on the formula (17). In general, analytic expressions for the
113
Estimation of Long-Memory Time Series Models
Table 4. Finite Sample Performance of ML Estimates of SARFIMA(0, d, 0) (0, ds, 0)s Models for Several Values of s, n, d and ds. Period 4 6 6 6 12 12
n
d
ds
sðdÞ
sðd s Þ
d^
d^s
^ ^ dÞ sð
^ d^s Þ sð
1000 500 1000 2000 3000 1000
0.100 — 0.150 0.200 — 0.200
0.300 0.300 0.200 0.200 0.250 0.100
0.025 — 0.025 0.018 — 0.025
0.025 0.035 0.025 0.018 0.014 0.025
0.090 — 0.140 0.195 — 0.194
0.299 0.286 0.198 0.196 0.252 0.094
0.027 — 0.026 0.018 — 0.025
0.028 0.036 0.026 0.018 0.016 0.027
integral in (17) are difficult to obtain for an arbitrary period s. For an SARFIMA(0, d, 0) (0, ds, 0)s model, the matrix G(h) can be written as 0 2 1 p c 6 A GðhÞ ¼ @ 2 c p6 Rp with c ¼ p1 p log j2 sinðo2 Þj log j2 sinðso2 Þj do: An interesting feature of the asymptotic variance of the parameters is that for an SARFIMA(0, d, 0) (0, ds, 0)s process, the variance of d^ is the same as the variance of d^s : From Table 4, note that the estimates and their standard deviations are close to the theoretical values, for all the sample sizes and combinations of parameters investigated.
8. HETEROSKEDASTIC TIME SERIES Evidence of long-memory behavior in returns and/or empirical volatilities has been observed by several authors, see for example Robinson (1991) and references therein. Accordingly, several models have been proposed in the econometric literature to explain the combined presence of long-range dependence and conditional heteroskedasticity. In particular, a class of models that has received considerable attention is the ARFIMA–GARCH (generalized autoregressive conditional heteroskedastic) process, see for example Ling and Li (1997). In this model, the returns have long memory and the noise has a conditional heteroskedasticity structure. A related class of interesting models is the extension of the ARCH(p) processes first introduced by Engle (1982) to the ARCH(N) models to encompass the longer dependence observed in many squared financial series. On the other hand, extensions of the stochastic volatility processes to the long-memory case
114
NGAI HANG CHAN AND WILFREDO PALMA
have produced the so-called long-memory stochastic volatility models (LMSV). In this section, we discuss briefly some of these well-known econometric models.
8.1. ARFIMA–GARCH Model An ARFIMA(p,d,q)–GARCH(r,s) process is defined by the discrete-time equation FðBÞð1
BÞd ðyt
jFt
mÞ ¼ YðBÞt
Nð0; ht Þ s r X X bj ht ai 2t þ ht ¼ a0 þ 1
i¼1
j
j¼1
where Ft 1 is the s-algebra generated by the past observations yt 1 ; y t 2 ; . . . : Most econometric models dealing with long memory and heteroskedastic behaviors are non-linear, in the sense that the noise sequence is not necessarily independent. An approximate MLE h^ is obtained by maximizing the conditional log-likelihood LðhÞ ¼
n
1 X 2t log ht þ 2n t¼1 ht
(33)
The asymptotic behavior of this estimate was formally established by Ling and Li (1997). Let h ¼ ðh1 ; h2 Þ0 ; where h1 ¼ ðf1 ; . . . ; fp ; y1 ; . . . ; yq ; dÞ0 is the parameter vector involving the ARFIMA components and h2 ¼ ða0 ; . . . ; ar ; b1 ; . . . ; bs Þ0 is the parameter vector containing the GARCH component. The following result correspond to Theorem 3.2 of Ling and Li (1997). Theorem 9. Let h^ n be the value that maximizes the conditional log-likelihood function (33). Then, some regularity conditions, h^ n is a pffiffiffi under ^ consistent estimate and nðhn h0 Þ ! Nð0; O 1 Þ; as n-N , where O ¼ diag(O1,O2) with " # 1 @t @t 1 @ht @ht þ O1 ¼ E ht @h1 @h01 2h2t @h1 @h01
115
Estimation of Long-Memory Time Series Models
and "
1 @ht @ht O2 ¼ E 0 2h2t @h2 @h2
#
8.2. Arch-Type Models The ARFIMA–GARCH process described in the previous subsection is adequate for modeling long-range dependence in returns of financial time series. However, as described by Rosenblatt (1961) and Palma and Zevallos (2004), the squares of an ARFIMA–GARCH process have only intermediate memory for d 2 ð0; 1=4Þ: In fact, for any d 2 ð0; 1=2Þ; the ACF of the ~ squared series behaves like k2d 1 ; where k denotes the k-th lag and d~ ¼ 2d 1=2: Consequently, the long-memory parameter of the squared series d~ is always smaller than the long-memory parameter of the original series d, ~ i.e. dod for do1=2: Since in many financial applications the squared returns have the same or greater level of autocorrelation, the theoretical reduction in the memory that affects the squares of an ARFIMA–GARCH process may not be appropriate in practice. This situation leads one to consider other classes of processes to model the dependence of the squared returns directly. For instance, Robinson (1991) proposed the following extension of the ARCH(p) introduced by Engle (1982), y t ¼ st x t s2t ¼ a0 þ
1 X
aj y2t
j
j¼1
which can be formally written as y2t ¼ a0 þ nt þ
1 X
aj y2t
j
(34)
j¼1
where s2t ¼ E½y2t jyt 1 ; yt 2 ; . . . ; nt ¼ y2t s2t is a martingale difference sequence, {xt} a sequence of independent and identically distributed random variables and a0 a positive constant, cf. Eqs. (1.31), (1.33) and (1.35) of Robinson (2003), respectively. When the coefficients {aj} of (34) are specified by an ARFIMA(p,d,q) model, the resulting process corresponds to the (fractionally integrated
116
NGAI HANG CHAN AND WILFREDO PALMA
GARCH) FIGARCH(p,d,q) model which is defined by FðBÞð1
BÞd y2t ¼ o þ YðBÞnt
where o is a positive constant, cf. Baillie, Bollerslev, & Mikkelsen, (1996). As noted by Karanasos, Psaradakis, & Sola (2004), this process is strictly stationary and ergodic but not square integrable. 8.2.1. Estimation Consider the quasi log-likelihood function LðhÞ ¼
1 log ð2pÞ 2
n
1X 2 log s2t þ t2 2 t¼1 st
(35)
where h ¼ ðo; d; f1 ; . . . ; fp ; y1 ; . . . ; y1 Þ: A QMLE h^ n can be obtained by maximizing (35). But, even though this estimation approach has been widely used in many practical applications, to the best of our knowledge, asymptotic results for these estimators remain an open issue. For a recent study about this problem, see for example Caporin (2002). 8.3. Stochastic Volatility Stochastic volatility models have been addressed by Harvey, Ruiz, and Shephard (1994), Ghysels et al. (1996) and Breidt et al. (1998), among others. These processes are defined by and
r t ¼ st x t st ¼ s exp ðvt =2Þ
(36)
where {xt} is a independent, identically distributed sequence with mean zero and variance one and {vt} is a stationary process independent of {xt}. In particular, {vt} can be specified as a long-memory ARFIMA(p,d,q) process. The resulting process is called LMSV model. From (36), we can write logðr2t Þ ¼ logðs2t Þ þ logðx2t Þ
logðs2t Þ ¼ logðs2 Þ þ vt
Let yt ¼ logðr2t Þ; m ¼ logðs2 Þ þ E½logðx2t Þ and t ¼ logðx2t Þ Then yt ¼ m þ vt þ t
E½logðx2t Þ: (37)
117
Estimation of Long-Memory Time Series Models
Consequently, the transformed process {yt} corresponds to a stationary long-memory process plus an independent noise. The ACF of (37) is given by gy ðkÞ ¼ gv ðkÞ þ s2 d0 ðkÞ where d0 ðkÞ ¼ 1 for k ¼ 0 and d0 ðkÞ ¼ 0 otherwise. Furthermore, the spectral density of {yt}, fy, is given by f y ðoÞ ¼ f v ðoÞ þ
s2 2p
where fv is the spectral density of the long-memory process {vt}. In particular, if the process {vt} is an ARFIMA(p,d,q) model FðBÞð1
BÞd vt ¼ YðBÞZt
(38)
and h ¼ ðd; s2Z ; s2 ; f1 ; . . . ; fp ; y1 ; . . . ; yq Þ0 is the parameter vector that specifies model (38), then the spectral density is given by YðexpðioÞ2 s2Z s2 f h ðoÞ ¼ 2 þ 2d 2p 1 expðioÞ FðexpðioÞ 2p
Breidt et al. (1998) consider the estimation of the parameter h by means of the spectral-likelihood estimator obtained by minimizing L7 ðhÞ ¼
n=2
Iðoj Þ 2p X log f h ðoj Þ þ f h ðoj Þ n j¼1
where fh(o) is given by (39). Let h^ be the value that minimizes L7 ðhÞ over the parameter space Y. Breidt et al. (1998) prove the following result. Theorem 10. Assume that the parameter vector h is an element of the compact parameter space Y and assume that f h1 ¼ f h2 implies that h1 ¼ h2. Let h0 be the true parameter value. Then h^ n ! h0 in probability as n-N. Other estimation procedures for LMSV using state space systems can be found in Chan and Petris (2000) and Section 11 of Chan (2002).
118
NGAI HANG CHAN AND WILFREDO PALMA
Table 5. Quasi ML Estimation of Long-Memory Stochastic Volatility Models with an ARFIMA(0, d, 0) Specification for vt, for Different Values of d. d 0.10 0.25 0.40
d^
s^ Z
^ S.D.ðdÞ
S.D.(s^ Z )
0.0868 0.2539 0.4139
9.9344 10.0593 10.1198
0.0405 0.0400 0.0415
0.4021 0.4199 0.3773
8.4. Numerical Experiments The finite sample performance of the spectral-likelihood estimator is analyzed here by means of Monte Carlo simulations. The model pffiffiffiinvestigated is the LMSV with an ARFIMA(0,d,0) structure, s ¼ p 2; xt follows a standard normal distribution, sZ ¼ 10 and the sample size is n ¼ 400. The results displayed in Table 5 are based on 1,000 replications. From Table 5, observe that estimates of both the long-memory parameter d and the scale parameter sZ are close to their true values. On the other hand, the standard deviations of d^ and s^ Z seem to be similar for all the values of d simulated. However, to the best of our knowledge there are no formally established results for the asymptotic distribution of these QMLE yet.
9. SUMMARY In this article, a number of estimation techniques for long-memory time series have been reviewed together with their corresponding asymptotic results. Finite sample behaviors of these techniques were studied through Monte Carlo simulations. It is found that they are relatively comparable in terms of finite sample performance. However, in situations like missing data or long-memory seasonal time series, some approaches such as the MLE or truncated MLE seems to be more efficient than their spectral domain counterparts such as the Whittle approach. Clearly, long-memory time series is an exciting and important topic in econometrics as well as many other disciplines. This article does not attempt to cover all of the important aspects of this exciting field. Interested readers may find many actively pursued topics in this area in the recent monograph of Robinson (2003). It is hoped that this article offers a focused and practical introduction to the estimation of long-memory time series.
Estimation of Long-Memory Time Series Models
119
ACKNOWLEDGMENT We would like to thank an anonymous referee for helpful comments and Mr. Ricardo Olea for carrying out some of the simulations reported in this paper. This research is suppored in part by HKSAR-RGC Grants CUHK 4043/02P, 400305 and 2060268 and Fondecyt Grant 1040934. Part of this research was conducted when the second author was visiting the Institute of Mathematical Sciences at the Chinese University of Hong Kong. Support from the IMS is gratefully acknowledged.
REFERENCES Ammar, G. S. (1996). Classical foundations of algorithms for solving positive definite Toeplitz equations. Calcolo, 33, 99–113. Arteche, J., & Robinson, P. M. (2000). Semiparametric inference in seasonal and cyclical long memory processes. Journal of Time Series Analysis, 21, 1–25. Baillie, R. T., Bollerslev, T., & Mikkelsen, H. O. (1996). Fractionally integrated generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 74, 3–30. Beran, J. (1994a). On a class of M-estimators for Gaussian long-memory models. Biometrika, 81, 755–766. Beran, J. (1994b). Statistics for long-memory processes. New York: Chapman and Hall. Bertelli, S., & Caporin, M. (2002). A note on calculating autocovariances of long-memory processes. Journal of Time Series Analysis, 23, 503–508. Bhansali, R. J., & Kokoszka, P. S. (2003). Prediction of long-memory time series. In: P. Doukhan, G. Oppenheim & M. Taqqu (Eds), Theory and applications of long-range dependence (pp. 355–367). Boston: Birkha¨user. Bondon, P., & Palma, W. (2005). Prediction of strongly dependent time series. Preprint, Pontificia Universidad Catolica de Chile, Santiago, Chile. Box, G. E. P., Jenkins, G. M., & Reinsel, G. C. (1994). Time series analysis, forecasting and control. Englewood Cliffs: Prentice-Hall. Breidt, F. J., Crato, N., & de Lima, P. (1998). The detection and estimation of long memory in stochastic volatility. Journal of Econometrics, 83, 325–348. Brockwell, P. J., & Davis, R. A. (1991). Time series: Theory and methods. New York: SpringerVerlag. Caporin, M. (2002). Long memory conditional heteroskedasticity and second order causality. Ph.D. Dissertation, Universita de Venezia, Venezia. Chan, N. H. (2002). Time series, applications to finance. Wiley Series in Probability and Statistics. New York: Wiley Chan, N. H., & Palma, W. (1998). State space modeling of long-memory processes. The Annals of Statistics, 26, 719–740. Chan, N. H. & Petris, G. (2000). Recent developments in heteroskedastic financial series. In: W. S. Chan, W. K. Li & H. Tong (Eds), Statistics and finance (pp. 169–184). (Hong Kong, 1999). London: Imperial College Press.
120
NGAI HANG CHAN AND WILFREDO PALMA
Chung, C. F. (1996). A generalized fractionally integrated autoregressive moving-average process. Journal of Time Series Analysis, 17, 111–140. Dahlhaus, R. (1989). Efficient parameter estimation for self-similar processes. The Annals of Statistics, 17, 1749–1766. Durbin, J. (1960). The fitting of time series models. Review of The International Statistical Institute, 28, 233–244. Engle, R. F. (1982). Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation. Econometrica, 50, 987–1007. Fox, R., & Taqqu, M. S. (1986). Large-sample properties of parameter estimates for strongly dependent stationary Gaussian time series. Annals of Statistics, 14, 517–532. Ghysels, E., Harvey, A. C., & Renault, E. (1996). Stochastic volatility. In: G. S. Maddala & C. R. Rao (Eds), Statistical methods in finance (Vol. 14 of Handbook of Statistics. pp. 119–191). Amsterdam: North-Holland. Giraitis, L., Hidalgo, J., & Robinson, P. M. (2001). Gaussian estimation of parametric spectral density with unknown pole. The Annals of Statistics, 29, 987–1023. Giraitis, L., & Leipus, R. (1995). A generalized fractionally differencing approach in longmemory modeling. Lietuvos Matematikos Rinkinys, 35, 65–81. Giraitis, L., & Surgailis, D. (1990). A central limit theorem for quadratic forms in strongly dependent linear variables and its application to asymptotical normality of Whittle’s estimate. Probability Theory and Related Fields, 86, 87–104. Gradshteyn, I. S., & Ryzhik, I. M. (2000). Table of integrals, series, and products. San Diego, CA: Academic Press Inc. Granger, C. W. J., & Joyeux, R. (1980). An introduction to long-memory time series models and fractional differencing. Journal of Time Series Analysis, 1, 15–29. Gray, H. L., Zhang, N. F., & Woodward, W. A. (1989). On generalized fractional processes. Journal of Time Series Analysis, 10, 233–257. Grenander, U., & Szego¨, G. (1958). Toeplitz forms and their applications. California monographs in mathematical sciences. Berkeley: University of California Press. Harvey, A. C., Ruiz, E., & Shephard, N. (1994). Multivariate stochastic variance models. Review of Economic Studies, 61, 247–265. Hassler, U. (1994). (Mis)specification of long memory in seasonal time series. Journal of Time Series Analysis, 15, 19–30. Hassler, U., & Wolters, J. (1995). Long memory in inflation rates: International evidence. Journal of Business and Economic Statistics, 13, 37–45. Haslett, J., & Raftery, A. E. (1989). Space-time modelling with long-memory dependence: Assessing Ireland’s wind power resource. Journal of Applied Statistics, 38, 1–50. Hosking, J. R. M. (1981). Fractional differencing. Biometrika, 68, 165–176. Karanasos, M., Psaradakis, Z., & Sola, M. (2004). On the autocorrelation properties of longmemory GARCH processes. Journal of Time Series Analysis, 25, 265–281. Levinson, N. (1947). The Wiener RMS (root mean square) error criterion in filter design and prediction. Journal of Mathematical Physics of the Massachusetts Institute of Technology, 25, 261–278. Li, W. K., & McLeod, A. I. (1986). Fractional time series modelling. Biometrika, 73, 217–221. Ling, S., & Li, W. K. (1997). On fractionally integrated autoregressive moving-average time series models with conditional heteroscedasticity. Journal of the American Statistical Association, 92, 1184–1194.
Estimation of Long-Memory Time Series Models
121
Montanari, A., Rosso, R., & Taqqu, M. S. (2000). A seasonal fractional ARIMA model applied to Nile River monthly flows at Aswan. Water Resources Research, 36, 1249–1259. Ooms, M. (1995). Flexible seasonal long memory and economic time series. Rotterdam. Technical report Econometric Institute, Erasmus University. Palma, W., & Chan, N. H. (1997). Estimation and forecasting of long-memory processes with missing values. Journal of Forecasting, 16, 395–410. Palma, W., & Chan, N. H. (2005). Efficient estimations of seasonal long range dependent processes. Journal of Time Series Analysis, 26, 863–892. Palma, W., & Del Pino, G. (1999). Statistical analysis of incomplete long-range dependent data. Biometrika, 86, 965–972. Palma, W., & Zevallos, M. (2004). Analysis of the correlation structure of square time series. Journal of Time Series Analysis, 25, 529–550. Porter-Hudak, S. (1990). An application of the seasonal fractionally differenced model to the monetary aggregates. Journal of the American Statistical Association, Applications and Case Studies, 85, 338–344. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in FORTRAN. Cambridge: Cambridge University Press. Ray, B. K. (1993). Long-range forecasting of IBM product revenues using a seasonal fractionally differenced ARMA model. International Journal of Forecasting, 9, 255–269. Robinson, P. M. (1991). Testing for strong serial correlation and dynamic conditional heteroskedasticity in multiple regression. Journal of Econometrics, 47, 67–84. Robinson, P. M. (1995). Gaussian semiparametric estimation of long range dependence. The Annals of Statistics, 23, 1630–1661. Robinson, P. M. (2003). Time series with long memory. Oxford: Oxford University Press. Rosenblatt, M. (1961). Independence and dependence. In: J. Neyman (Ed.), Proceeding of the 4th Berkeley Symposium On Mathematics Statistics and Probability (Vol. II, pp. 431– 443). Berkeley, CA: University of California Press. Shumway, R. H., & Stoffer, D. S. (2000). Time series analysis and its applications. New York: Springer-Verlag. Sowell, F. (1992). Maximum likelihood estimation of stationary univariate fractionally integrated time series models. Journal of Econometrics, 53, 165–188. Velasco, C., & Robinson, P. M. (2000). Whittle pseudo-maximum likelihood estimation for nonstationary time series. Journal of the American Statistical Association, 95, 1229–1243. Whittle, P. (1951). Hypothesis testing in time series analysis. New York: Hafner. Woodward, W. A., Cheng, Q. C., & Gray, H. L. (1998). A k-factor GARMA long-memory model. Journal of Time Series Analysis, 19, 485–504. Yajima, Y. (1985). On estimation of long-memory time series models. The Australian Journal of Statistics, 27, 303–320.
This page intentionally left blank
122
BOOSTING-BASED FRAMEWORKS IN FINANCIAL MODELING: APPLICATION TO SYMBOLIC VOLATILITY FORECASTING Valeriy V. Gavrishchaka ABSTRACT Increasing availability of the financial data has opened new opportunities for quantitative modeling. It has also exposed limitations of the existing frameworks, such as low accuracy of the simplified analytical models and insufficient interpretability and stability of the adaptive data-driven algorithms. I make the case that boosting (a novel, ensemble learning technique) can serve as a simple and robust framework for combining the best features of the analytical and data-driven models. Boosting-based frameworks for typical financial and econometric applications are outlined. The implementation of a standard boosting procedure is illustrated in the context of the problem of symbolic volatility forecasting for IBM stock time series. It is shown that the boosted collection of the generalized autoregressive conditional heteroskedastic (GARCH)-type models is systematically more accurate than both the best single model in the collection and the widely used GARCH(1,1) model.
Econometric Analysis of Financial and Economic Time Series/Part B Advances in Econometrics, Volume 20, 123–151 Copyright r 2006 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(05)20024-5
123
124
VALERIY V. GAVRISHCHAKA
1. INTRODUCTION Increasing availability of the high-frequency and multisource financial data has opened new opportunities for quantitative modeling and problem solving in a wide range of financial and econometric applications. These include such challenging problems as forecasting of the financial time series and their volatilities, portfolio strategy discovery and optimization, valuation of complex derivate instruments, prediction of rare events (e.g., bankruptcy and market crashes), and many others. Such data availability also allows more accurate testing and validation of the existing models and frameworks to expose their limitations and/or confirm their advantages. For example, analytical and simple parametric models are usually clear and stable, but lack sufficient accuracy due to simplified assumptions. Adaptive data-driven models (mostly non-parametric or semi-parametric) can be very accurate in some regimes, but lack consistent stability and explanatory power (‘‘black-box’’ feature). Moreover, the multivariate nature of many problems and data non-stationarity results in ‘‘dimensionality curse’’ (Bishop, 1995) and incompleteness of the available training data that is prohibitive for a majority of the machine learning and statistical algorithms. Recently, the large margin classification techniques emerged as a practical result of the statistical learning theory, in particular, support vector machines (SVMs) (Vapnik, 1995, 1998; Cristianini & Shawe-Taylor, 2000). A large margin usually implies good generalization performance. The margin is the distance of the example to the class separation boundary. In contrast to many existing machine learning and statistical techniques, a large margin classifier generates decision boundaries with large margins to almost all training examples. This results in superior generalization ability on out-of-sample data. SVMs for classification and regression have been successfully applied to many challenging practical problems. Recent successful applications of the SVM-based adaptive systems include market volatility forecasting (Gavrishchaka & Ganguli, 2003), financial time series modeling (Van Gestel et al., 2001), bankruptcy prediction (Fan & M. Palaniswami, 2000), space weather forecasting (Gavrishchaka & Ganguli, 2001a, b), protein function classification (Cai, Wang, Sun, & Chen, 2003), cancer diagnostics (Chang, Wu, Moon, Chou, & Chen, 2003), face detection and recognition (Osuna, Freund, & Girosi, 1997) as well as many other financial, scientific, engineering, and medical applications. In most cases, SVMs demonstrated superior results compared to other advanced machine learning and statistical techniques, including different types of neural networks (NNs).
Boosting-Based Frameworks in Financial Modeling
125
Although SVMs are significantly more tolerant to data dimensionality and incompleteness (Cristianini & Shawe-Taylor, 2000; Gavrishchaka & Ganguli, 2001a, b), they can still fail in many important cases where available data does not allow generating stable and robust data-driven models. Moreover, SVM framework, like many other machine learning algorithms (including NNs), largely remains to be a ‘‘black box’’ with its limited explanatory power. Also, it is not easy to introduce a priori problem-specific knowledge into the SVM framework, except for some flexibility in the problem-dependent choice of the kernel type. One of the machine learning approaches to compensate for the deficiency of the individual models is to combine several models to form a committee (e.g., Bishop, 1995; Hastie, Tibshirani, & Friedman, 2001; Witten & Frank, 2000). The committee can compensate limitations of the individual models due to both incomplete data and specifics of the algorithms (e.g., multiple local minima in the NN error surface). A number of different ensemble learning (model combination) techniques to build optimal committees have been proposed over the years in several research communities (e.g., Granger, 1989; Clemen, 1989; Drucker et al., 1994; Hastie, Tibshirani, & Friedman, 2001, and references therein). The work of Bates and Granger (1969) is considered to be the first seminal work on forecast combining (model mixing) in econometrics and financial forecasting. In the early machine learning literature, the same area of research was called ‘‘evidence combination’’ (e.g., Barnett, 1981). Recently, there was a resurgence of interest in model combination in machine learning community (e.g., Schapire, 1992; Drucker et al., 1994; Dietterich, 2000; Ratsch, 2001). Modern research is mainly focused on the novel techniques that are suited for challenging problems with such features as a large amount of noise, limited number of training data, and highdimensional patterns. These types of problems often arise in financial applications dealing with high-frequency and heterogeneous data (e.g., Gavrishchaka & Ganguli, 2003 and references therein), bioinformatics and computational drug discovery (e.g., Scholkopf, Tsuda, & Vert, 2004), space weather forecasting (e.g., Gleisner & Lundstedt, 2001; Gavrishchaka & Ganguli, 2001a, b), medical diagnostic systems (e.g., Gardner, 2004), and many others. Many modern ensemble learning algorithms perform a special manipulation with the training data set to compensate for the data incompleteness and its numerous consequences. Among them are bagging, crossvalidating committees, and boosting (e.g., Drucker et al., 1994; Ratsch, 2001; Hastie et al., 2001). The advantage of the ensemble learning approach is not only the possibility of the accuracy and stability improvement, but also its ability to
126
VALERIY V. GAVRISHCHAKA
combine a variety of models: analytical, simulation, and data-driven. This latter feature can significantly improve explanatory power of the combined model if building blocks are sufficiently simple and well-understood models. However, ensemble learning algorithms can be susceptible to the same problems and limitations as standard machine learning and statistical techniques. Therefore, the optimal choice of both the base model pool and ensemble learning algorithms with good generalization qualities and tolerance to data incompleteness and dimensionality is very important. A very promising ensemble learning algorithm that combines many desirable features is boosting (Valiant, 1984; Schapire, 1992; Ratsch, 2001). Boosting and its specific implementations such as AdaBoost (Freund & Schapire, 1997) have been actively studied and successfully applied to many challenging classification problems (Drucker, Schapire, & Simard, 1993; Drucker et al., 1994; Schwenk & Bengio, 1997; Opitz & Maclin, 1999). Recently, boosting has been successfully applied to such practical problems as document routing (Iyer et al., 2000), text categorization (Schapire & Singer, 2000), face detection (Xiao, Zhu, & Zang, 2003), pitch accent prediction (Sun, 2002), and others. One of the main features that sets boosting aside from other ensemble learning frameworks is that it is a large margin classifier similar to SVM. Recently, connection between SVM and boosting was rigorously proven (Schapire et al., 1998). This ensures superior generalization ability and better tolerance to incomplete data compared to other ensemble learning techniques. Similar to SVM, the boosting generalization error does not directly depend on the dimensionality of the input space (Ratsch, 2001). Therefore, boosting is also capable of working with high-dimensional patterns. The other distinctive and practically important feature of boosting is its ability to produce a powerful classification system starting with just a ‘‘weak’’ classifier as a base model (Ratsch, 2001). In most cases discussed in the literature, boosting is used to combine datadriven models based on machine learning algorithms such as classification trees, NNs, etc. In this article, I will stress out advantages of the boosting framework that allows an intelligent combination of the existing analytical and other simplified and parsimonious models specific to the field of interest. In this way, one can combine clarity and stability typical for analytical and parsimonious parametric models with a good accuracy usually achieved only by the best adaptive data-driven algorithms. Moreover, since the underlying components are well-understood and accepted models, the obtained ensemble is not a pure ‘‘black box.’’ In the next two sections, I provide a short overview of the boosting key features and its relation to existing approaches as well as the description of
Boosting-Based Frameworks in Financial Modeling
127
the algorithm itself. After that, potential applications of boosting in quantitative finance and financial econometrics are outlined. Finally, a realistic example of the successful boosting application to symbolic volatility forecasting is given. In particular, boosting is used to combine different types of GARCH models for the next day threshold forecasting of the IBM stock volatility. Ensemble obtained with a regularized AdaBoost is shown to be consistently superior to both single best model and GARCH(1,1) model on both training (in-sample) and test (out-of-sample) data.
2. BOOSTING: THE MAIN FEATURES AND RELATION TO OTHER TECHNIQUES Since the first formulation of the adaptive boosting as a novel ensemble learning (model combination) algorithm (Schapire, 1992), researchers from different fields demonstrated its relation to different existing techniques and frameworks. Such an active research was inspired by the boosting robust performance in a wide range of applications (Drucker et al., 1993, 1994; Schwenk & Bengio, 1997; Opitz & Maclin, 1999; Schapire & Singer, 2000; Xiao et al., 2003). The most common conclusion in machine learning community is that boosting represents one of the most robust ensemble learning methods. Computational learning theorists also proved boosting association with the theory of margins and SVMs that belong to an important class of large-margin classifiers (Schapire et al., 1998). Recent research in statistical community is showing that boosting can also be viewed as an optimization algorithm in function space (Breiman, 1999). Statisticians consider boosting as a new class of learning algorithms that Friedman named ‘‘gradient machines’’ (Friedman, 1999), since boosting performs a stage wise greedy gradient descent. This relates boosting to particular additive models and matching pursuit known within the statistics literature (Hastie et al., 2001). There are also arguments that the original class of model mixing procedures are not in competition with boosting but rather can coexist inside or outside a boosting algorithm (Ridgeway, 1999). In this section, I review boosting key features in the context of its relation to other techniques. Since the structure and the most practical interpretation of the boosting algorithm naturally relates it to ensemble learning techniques, comparison of boosting with other approaches for model combination will bo my main focus. Conceptual similarity and differences with other algorithms will be demonstrated from several different angles whenever possible.
128
VALERIY V. GAVRISHCHAKA
The practical value of model combination is exploited by practitioners and researchers in many different fields. In one of his papers, Granger (1989) summarizes the usefulness of combining forecasts: ‘‘The combination of forecasts is a simple, pragmatic, and sensible way to possibly produce better forecasts.’’ The basic idea of ensemble learning algorithms including boosting is to combine relatively simple base hypotheses (models) for the final prediction. The important question is why and when an ensemble is better than a single model. In machine learning literature, three broad reasons for possibility of good ensembles’ construction are often mentioned (e.g., Dietterich, 2000). First, there is a pure statistical reason. The amount of training data is usually too small (data incompleteness) and learning algorithms can find many different models (from model space) with comparable accuracy on the training set. However, these models capture only certain regimes of the whole dynamics or mapping that becomes evident in out-of-sample performance.There is also a computational reason related to the learning algorithm specifics such as multiple local minima on the error surface (e.g., NNs and other adaptive techniques). Finally, there is a representational reason when the true model cannot be effectively represented by a single model from a given set even for the adequate amount of training data. Ensemble methods have a promise of reducing these key shortcomings of standard learning algorithms and statistical models. One of the quantitative and explanatory measures for the analysis and categorization of the ensemble learning algorithms is an error diversity (e.g., Brown, Wyatt, Harris, & Yao, 2005 and references therein). In particular, the ambiguity decomposition (Krogh & Vedelsby, 1995) and bias variance– covariance decomposition (Geman, Bienenstack, & Dourstat, 1992) provide a quantification of diversity for linearly weighted ensembles by connecting it back to an objective error criterion: mean-squared error. Although these frameworks have been formulated for regression problems, they can also be useful for the conceptual analysis of the classifier ensembles (Brown et al., 2005). For simplicity, I consider only ambiguity decomposition that is formulated for convex combinations and is a property of an ensemble trained on a single dataset. Krogh and Vedelsby (1995) proved that at a single data point the quadratic error of the ensemble is guaranteed to be less than or equal to the average quadratic error of the base models: ðf
yÞ2 ¼
X t
wt ðf t
yÞ2
X t
wt ðf t
f Þ2
(1)
Boosting-Based Frameworks in Financial Modeling
where f is a convex combination (
P
t wt ¼ 1) of the base models ft: X f ¼ wt f t
129
(2)
t
This result directly shows the effect due to error variability of the base models. The first term in (1) is the weighted average error of the base models. The second is the ambiguity term, measuring the amount of variability among the ensemble member answers for a considered pattern. Since this term is always positive, it is subtractive from the first term. This means that the ensemble is guaranteed lower error than the average individual error. The larger the ambiguity term (i.e., error diversity), the larger is the ensemble error reduction. However, as the variability of the individual models rises, so does the value of the first term. Therefore, in order to achieve the lowest ensemble error, one should get the right balance between error diversity and individual model accuracy (similar to the bias-variance compromise for a single model). The same is conceptually true for the classification ensembles (Hansen & Salamon, 1990). The above discussion offers some clues about why and when the model combination can work in practice. However, the most important practical question is how to construct robust and accurate ensemble learning methods. In econometric applications, the main focus is usually on equal weight and Bayesian model combination methods (e.g., Granger, 1989; Clemen, 1989). These methods provide a simple and well-grounded procedure for the combination of the base models chosen by a practitioner or a researcher. The accuracy of the final ensemble crucially depends on the accuracy and diversity of the base models. In some cases, a priori knowledge and problem-dependent heuristics can help to choose an appropriate pool of the base models. However, equal weight and Bayesian combination methods do not provide any algorithmic procedures for the search, discovery, and building of the base models that are suitable for the productive combination in an ensemble. Such procedures would be especially important in complex highdimensional problems with limited a priori knowledge and lack of simple heuristics. Without such algorithms for the automatic generation of the models with significant error diversity, it could be difficult or impossible to create powerful and compact ensembles from the simple base models. In modern machine learning literature, the main focus is on the ensemble learning algorithms suited for challenging problems dealing with a large amount of noise, limited number of training data, and high-dimensional patterns (e.g., Ratsch, 2001; Buhlmann, 2003). Several modern ensemble
130
VALERIY V. GAVRISHCHAKA
learning techniques relevant for these types of applications are based on training data manipulation as a source of base models with significant error diversity. These include such algorithms as bagging (‘‘bootstrap aggregation’’), cross-validating committees, and boosting (e.g., Ratsch, 2001; Witten & Frank, 2000; Hastie et al., 2001, and references therein). Bagging is a typical representative of ‘‘random sample’’ techniques in ensemble construction. In bagging, instances are randomly sampled, with replacement, from the original training dataset to create a bootstrap set with the same size (e.g., Witten & Frank, 2000; Hastie et al., 2001). By repeating this procedure, multiple training data sets are obtained. The same learning algorithm is applied to each data set and multiple models are generated. Finally, these models are linearly combined as in (2) with equal weights. Such combination reduces variance part of the model error and instability caused by the training set incompleteness. Unlike basic equal weight and Bayesian combination methods, bagging offers a direct procedure to build base models with potentially significant error diversity (the last term in (1)). Recently, a number of successful econometric applications of bagging have been reported and compared with other model combination techniques (e.g., Kilian & Inoue, 2004). Bagging exploits the instability inherent in learning algorithms. For example, it can be successfully applied to the NN-based models. However, bagging is not efficient for the algorithms that are stable, i.e., whose output is not sensitive to small changes in the input (e.g., parsimonious parametric models). Bagging is also not suitable for a consistent bias reduction. Intuitively, combining multiple models helps when these models are significantly different from one another and each one treats a reasonable portion of the data correctly. Ideally, the models should complement one another, each being an expert in a part of the domain where performance of other models is not satisfactory. The boosting method for combining multiple models exploits this insight by explicitly seeking and/or building models that complement one another (Valiant, 1984; Schapire, 1992; Ratsch, 2001). Unlike bagging, boosting is iterative. Whereas in bagging individual models are built separately, in boosting, each new model is influenced by the performance of those built previously. Boosting encourages new models to become experts for instances handled incorrectly by earlier ones. Final difference is that boosting weights obtained models by their performance, i.e., weights are not equal as in bagging. Unlike bagging and similar ‘‘random sample’’ techniques, boosting can reduce both bias and variance parts of the model error.
Boosting-Based Frameworks in Financial Modeling
131
The initial motivation for boosting was a procedure that combines the outputs of many ‘‘weak’’ classifiers to produce a powerful committee (Valiant, 1984; Schapire, 1992; Ratsch, 2001). The purpose of boosting is to sequentially apply the weak classification algorithm to repeatedly modified versions of data, thereby producing sequence of weak classifiers. The predictions from all of them are then combined through a weighted majority vote to produce the final prediction. As iterations proceed, observations that are difficult to correctly classify receive an ever-increasing influence through larger weight assignment. Each successive classifier trained on the weighted error function is thereby forced to concentrate on those training observations that are missed by previous ones in the sequence. Empirical comparative studies of different ensemble methods often indicate superiority of boosting over other techniques (Drucker et al., 1993, 1994; Schwenk & Bengio, 1997; Opitz & Maclin, 1999). Using probably approximately correct (PAC) theory, it was shown that if the base learner is just slightly better than random guessing, AdaBoost is able to construct ensemble with arbitrary high accuracy (Valiant, 1984). Thus, boosting can be effectively used to construct powerful ensembles from the very simplistic ‘‘rules of thumb’’ known in the considered field. The distinctive features of boosting can also be illustrated through the error diversity point of view. In constructing the ensemble, the algorithm could either take information about error diversity into account or not. In the first case, the algorithm explicitly tries to optimize some measure of diversity during building the ensemble. This allows to categorize ensemble learning techniques as explicit and implicit diversity methods. While implicit methods rely on randomness to generate diverse trajectories in the model space, explicit methods deterministic ally choose different paths in the space. In this context, bagging is an implicit method. It randomly samples the training patterns to produce different sets for each model and no measurement is taken to ensure emergence of the error diversity. On the other hand, boosting is an explicit method. It directly manipulates the training data distributions through specific weight changes to ensure some form of diversity in the set of base models. As mentioned before, equal weight and Bayesian model combination methods do not provide any direct algorithmic tools to manipulate error diversity of the base models. Boosting has also a tremendous advantage over other ensemble methods in terms of interpretation. At each iteration, boosting trains new models or searches the pool of existing models that are complementary to the already chosen models. This often leads to the very compact but accurate ensemble with a clear interpretation in the problem-specific terms. This could also
132
VALERIY V. GAVRISHCHAKA
provide an important byproduct in the form of automatic discovery of the complementary models that may have value of their own in the area of application. Boosting also offers a flexible framework for the incorporation of other ensemble learning techniques. For example, at each boosting iteration, instead of taking just one best model, one can use mini-ensemble of models that are chosen or built according to some other ensemble learning technique. From my empirical experience with boosting. I find it useful to form an equal weight ensemble of several comparable best models at each boosting iteration. Often, this leads to the superior out-of-sample performance compared to the standard boosted ensemble. Of course, there are many different ways to combine boosting with other ensemble techniques that could be chosen according to the specifics of the application. So far, the main boosting features have been illustrated through its relation to other ensemble learning methods. However, one of the most distinctive and important features of boosting relates it to the large margin classifiers. It was found that boosted ensemble is a large margin classifier and that SVM and AdaBoost are intimately related (Schapire et al., 1998). Both boosting and SVM can be viewed as attempting to maximize the margin, except that the norm used by each technique and optimization procedures are different (Ratsch, 2001). It is beyond the scope of this paper to review the theory of margins and structural risk minimization that is the foundation of the large margin classifiers including SVM (Vapnik, 1995, 1998). However, it should be mentioned that the margin of the data example is its minimal distance (in the chosen norm) to the classifier decision boundary. Large margin classifiers attempt to find decision boundary with large (or maximum) margin for all training data examples. Intuitively, this feature should improve the generalization ability (i.e., out-of-sample performance) of the classifier. Rigorous treatment of this topic can be found in (Vapnik, 1995, 1998; Ratsch, 2001). Relation to the large margin classifiers partially explains robust generalization ability of boosting. It also clarifies its ability to work with highdimensional patterns. Many traditional statistical and machine learning techniques often have the problem of ‘‘dimensionality curse’’ (Bishop, 1995), i.e., inability of handling high-dimensional inputs. The upper bound of the boosting generalization error depends on the margin and not on the dimensionality of the input space (Ratsch, 2001). Hence, it is easy to handle highdimensional data and learn efficiently if the data can be separated with large margin. It has been shown that boosting superiority over other techniques is more pronounced in high-dimensional problems (e.g., Buhlmann, 2003).
Boosting-Based Frameworks in Financial Modeling
133
3. ADAPTIVE BOOSTING FOR CLASSIFICATION In this section, the technical details of a typical boosting algorithm and its properties are presented. For clarity, I will describe boosting only for twoclass classification problem. A standard formalization of this task is to estimate a function f : X ! Y ; where X is the input domain (usually Rn) and Y ¼ { 1, +1} is the set of possible class labels. An example (xi, yi) is assigned to the class +1 if f(x)Z0 and to the class 1 otherwise. Optimal function f from a given function set F is found by minimizing an error function (empirical risk) calculated on a training set S ¼ fðx1 ; y1 Þ; :::; ðxN ; yN Þg with a chosen loss function g(y, f(x)): ½f ; S ¼
N 1X gðyn ; f ðxn ÞÞ N n¼1
(3)
Unlike the typical choice of squared loss in regression problems, here one usually considers the 0/1 loss: gðy; f ðxÞÞ ¼ Ið yf ðxÞÞ
(4)
where I(z) ¼ 0 for zo0 and I(z) ¼ 1 otherwise. In a more general case, a cost measure of interest can be introduced through the multiplication of g(yn, f(xn)) by the normalized weight wn of the training example. Regularized AdaBoost for two-class classification consists of the following steps (Ratsch, 2001; Ratsch et al., 1999, 2001): wð1Þ n ¼ 1=N t ¼
(5)
N X ½wðtÞ n Ið yn ht ðxn ÞÞ
(6)
n¼1
gt ¼
N X ½wðtÞ n yn ht ðxn Þ
(7)
n¼1
1 1 þ gt at ¼ log 2 1 gt
1 1þC log 2 1 C
wnðtþ1Þ ¼ wðtÞ n exp½ at yn ht ðxn Þ=Z t f ðxÞ ¼
1 STt¼1 at
T X t¼1
at ht ðxÞ
(8) (9) (10)
134
VALERIY V. GAVRISHCHAKA
Here N is a number of training data points, xn a model/classifier input set of the nth data point, yn the corresponding class label (i.e., 1 or +1), T the number of boosting iterations, oðtÞ n a weight of the nth data point at tth iteration, Zt the weight normalization constant at tth iteration, ht ðxn Þ ! ½ 1; þ1 the best base hypothesis (model) at tth iteration, C a regularization (soft margin) parameter, and f(x) is a final weighted linear combination of the base hypotheses (models). Boosting starts with equal and normalized weights for all training data (step (5)). A base classifier (model) ht(x) is trained using weighted error function et (step (6)). If a pool of several types of base classifiers is used, then each of them is trained and the best one (according to et) is chosen at the current iteration. The training data weights for the next iteration are computed in steps (7)–(9). It is clear from step (9) that at each boosting iteration, data points misclassified by the current best hypothesis (i.e., yn ht ðxn Þo0) are penalized by the weight increase for the next iteration. In subsequent iterations, AdaBoost constructs progressively more difficult learning problems that are focused on hard-to-classify patterns. This process is controlled by the weighted error function (6). Steps (6)–(9) are repeated at each iteration until stop criteria gt C (i.e.,t 12 ð1 CÞ) or gt ¼ 1 (i.e., t ¼ 0) occurs. The first stop condition means that, for the current classification problem, a classifier with accuracy better than random guess (corrected by the regularization multiplier (1 C)) cannot be found. The second stop condition means that the perfect classifier is found and formulation of the next iteration classification problem focusing on misclassified examples is not needed. Step (10) represents the final combined (boosted) model that is ready to use. The model classifies an unknown example as class +1 when f(x)40 and as 1 otherwise. It should be noted that the performance of the final ensemble (10) is evaluated according to the original error function given by (3) and (4) (not by (6)). Details of the more general versions of the regularized AdaBoost and its extensions are given in (Ratsch, 2001). The original AdaBoost algorithm (Freund & Schapire, 1997) can be recovered from (5)–(10) for C ¼ 0: Similar to SVM, regularization parameter C represents the so-called soft margin. Soft margin is required to accommodate the cases where it is not possible to find a boundary that fully separates data points from different classes. Regularization (soft margin) is especially important for financial applications where large noise-to-signal ratio is a typical case. One of the most important properties of the AdaBoost is its fast convergence to a hypothesis, which is consistent with the training sample, if the base learning algorithm produces hypotheses with error rates consistently
Boosting-Based Frameworks in Financial Modeling
135
smaller than 1/2. A theorem on the exponential convergence of the original AdaBoost (C ¼ 0) states that the error (given by (3) and (4))p offfiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi the ensemble (10) on the training set is bounded above by 2T PTt¼1 t ð1 t Þ (Freund & Schapire, 1997). Thus, if the error rate in each iteration is bounded from above by t 12 12 m ðfor some m40Þ; then the training error decreases exponentially in the number of iterations: exp ð Tm2 =2Þ: This and the other theorems for the original AdaBoost have been generalized for the regularized case (C40) (Ratsch, 2001). Motivated by boosting success in classification setting, a number of researchers attempted to transfer boosting techniques to regression problems. A number of different boosting-like algorithms for regression have been proposed (e.g., Drucker, 1997; Friedman, 1999; Ratsch, 2001). However, for clarity and owing to the existence of many practically important classification problems, I focus on boosting for classification in this article.
4. BOOSTING FRAMEWORKS IN FINANCIAL AND ECONOMETRIC APPLICATIONS Several conceptually different boosting frameworks can be proposed for financial and econometric applications. In the most straightforward way, boosting can be employed as yet another advanced data-driven algorithm expanding collection of tools used for building empirical models. However, models, built in this way, would still have ‘‘black box’’ nature and potentially limited stability. In our opinion, the most appealing feature of boosting, especially, for financial applications, is that it combines the power of the advanced learning algorithm with the ability to use existing models specific to the field of interest. This important feature has never been addressed in machine learning and related literature on boosting. The mentioned combination can be easily achieved by using wellunderstood and accepted models as base models (hypotheses) in a boosting framework. This allows integrating a priori problem-specific knowledge in a natural way that could lead to enhancement of the final model performance and stability, especially, in the cases of severe incompleteness of the training data, Moreover, since the final model is a weighted linear combination of the industry-standard components, the conclusions of such a model will be better accepted by practitioners and decision makers compared to a pure ‘‘black-box’’ model. In the following, only examples of this type of boosting applications will be discussed.
136
VALERIY V. GAVRISHCHAKA
Any mature field of quantitative finance has its own set of established and well-understood models. In many cases it is very desirable to improve accuracy of the models without losing their clarity and stability. In the following, I will give several examples from different fields. One of them (symbolic volatility forecasting) will be used as an illustration of the realistic boosting application in the next section. Encouraging results and all steps of boosting application to this practically important problem will be presented. 4.1. Typical Classification Problems It would be convenient to distinguish four large groups of the potential boosting frameworks in quantitative finance and financial econometrics. The first group includes well-defined classification models that can be used as base hypotheses in a standard boosting framework ((5)–(10)). For example, bankruptcy prediction models based on accounting and market data are used to classify companies as bankrupt/non-bankrupt (usually for the time horizon of 1 year). These models are based on linear discriminant analysis (LDA), logit, and other statistical and machine learning frameworks (e.g., Hillegeist, Keating, Cram, & Lundstedt, 2004; Fan & Palaniswami, 2000; Fanning & Cogger, 1994). Typical industry-standard models include Altman’s Z-score (LDA) (Altman, 1968), Ohlson model (logit) (Ohlson, 1980), and variations of the Black–Scholes–Merton model (Hillegeist et al., 2004). One can use the same inputs (financial ratios) and framework (LDA, logit, etc.) as in well-accepted models and build classifier on a weighted training set at every boosting iteration. Obtained boosted combination will still be based on a trusted and simple framework while it may have significantly better performance compared to the single base model. 4.2. Symbolic Time Series Forecasting Prediction of the financial time series (stocks, market indexes, foreign exchange rates, etc.) and their volatilities is one of the most challenging and generally unresolved problems. Time series predictor with sufficient accuracy is a desired component in both trading and risk management applications. In many practical cases forecasting of the actual future value of the time series is not required. Instead it is enough to forecast symbolically encoded value. For example, the model can predict whether the future value will be supercritical or subcritical to some threshold value or just the direction of change (up or down). In many practical problems, switching from the full
Boosting-Based Frameworks in Financial Modeling
137
regression problem (i.e., actual value prediction) to the symbolic encoding allows to increase accuracy of the prediction, since practically unimportant small fluctuations and/or noise are removed by this procedure (Gavrishchaka & Ganguli, 2001b; Tino, Schittenkopf, Dorffner, & Dockner, 2000). Classification problem that is obtained from the original time series forecasting problem by symbolic encoding is a second group suited for the standard boosting framework (5)–(10). Boosting can combine existing analytical and other parsimonious time series models. For example, GARCHtype framework (Engle, 1982; Bollerslev, 1986) is an industry standard for the volatility modeling. GARCH-type models can serve as base hypotheses pool in a boosting framework for the volatility forecasting. Details and preliminary results of the boosting application to the symbolic volatility forecasting will be discussed in the next section. 4.3. Portfolio Strategy Discovery and Optimization Discovery and optimization of the portfolio strategy is an important area where boosting may also prove to be useful. Simple trading rules based on the well-known market indicators often have significant performance and stability limitations. They can also be used by many market participants easily making market efficient to these sets of trading strategies. On the other hand, adaptive strategies, capable of utilizing more subtle market inefficiencies, are often of the ‘‘black-box’’ type with limited interpretability and stability. Combining basic (intuitive) trading strategies in a boosting framework may noticeably increase return (decrease risk) while preservingclarity of the original building blocks. Straightforward application of the standard boosting framework (5)–(10) for the trading strategy optimization is not possible, since it is a direct optimization and not classification problem. However, for a large class of optimization objective functions, boosting for classification can be easily transformed into a novel framework that can be called ‘‘boosting for optimization.’’ For example, one can require returns (r) generated by the strategy on a given time horizon (t) to be above certain threshold (rc). By calculating strategy returns on a series of shifted intervals of length t and using twoclass encoding (r 4 ¼ rc and rorc), one obtains symbolically encoded time series as previously discussed. Encoding can be based on a more sophisticated condition that combines different measures of profit maximization and risk minimization according to the utility function of interest. Contrary to symbolic time series forecasting discussed earlier, here the purpose is not correct classification, but rather maximization of one class
138
VALERIY V. GAVRISHCHAKA
(i.e., r 4 ¼ rc) population. Formally, one still has classification problem where the boosting framework (5)–(10) can be applied. However, in this case, one has very uneven sample distribution between two classes. Therefore, not all theoretical results that have been rigorously proven for boosting could be automatically assumed to be valid here. Nevertheless, our preliminary empirical studies indicate consistent stability and robustness of the boosting for optimization framework when applied to technical stock/index trading. In the case of trading strategy optimization, the usage of boosting output is also different. Instead of using weighted linear combination of the base hypotheses as a model for classification, one should use boosting weights for strategy portfolio allocation. It means that initial capital should be distributed among different base strategies in amounts according to weights obtained from the boosting. Details of boosting application to the portfolio strategy optimization will be given in our future work. 4.4. Regression Problems Finally, the fourth group of applications explicitly requires boosting for regression (Ratsch, 2001). For example, time series forecasting models can be combined in their original form without casting into classification problem through symbolic encoding. Another practically important application of the boosting for regression may be construction of the pricing surface for complex derivative instruments using simplified analytical or numerical valuation models as base hypotheses. This may significantly improve representation accuracy of the real market pricing surface compared to a single pricing model. Boostingbased pricing framework could be especially useful for less common or novel derivative instruments where valuation models used by different market participants are not yet standardized. This type of boosting application can be considered as a hybrid derivative pricing contrary to pure empirical pricing approaches based on machine learning algorithms (e.g., Hutchinson, Lo, & Poggio, 1994; Gencay & Qi, 2001).
5. SYMBOLIC VOLATILITY FORECASTING In this section, I consider boosting-based framework for symbolic volatility forecasting together with all the details of its implementation and application. Stock market index and foreign exchange rate volatility are very
139
Boosting-Based Frameworks in Financial Modeling
important quantities for derivative pricing, portfolio risk management, and as one of the components used for decision making in trading systems. It may be the main component for the quantitative volatility trading (e.g., Dunis & Huang, 2002). I start with a short overview of the existing deterministic volatility models and their limitations. Stochastic volatility models are not considered here. A common example of the deterministic volatility models is autoregressive conditional heteroskedastic (ARCH)-type models (Engle, 1982; Bollerslev, 1986). These models assume a particular stochastic process for the returns and a simple functional form for the volatility. Volatility in these models is unobservable (latent) variable. The most widely used model of this family is generalized ARCH (GARCH) process (Bollerslev, 1986). GARCR(p,q) process defines volatility as s2t ¼ a0 þ
p X i¼1
ai r2t i þ
where return process is defined as
q X
bi s2t
(11)
i
i¼1
(12)
r t ¼ st z t
Here zt is an identically and independently distributed (i.i.d.) random variable with zero mean and variance 1. The most common choice for the return stochastic model (zt ) is a Gaussian (Wiener) process. Parameters ai and bi from equation are estimated from historical data by maximizing the likelihood function (LF) which depends on the assumed return distribution. To resolve some of the GARCH limitations, a number of extensions have been proposed and used by practitioners. For example, since GARCH model depends only on the absolute values of returns (r2i ), it does not cover leverage effect. Different forms of leverage effect have been introduced in EGARCH (Nelson, 1991), TGARCH (Zakoian, 1994), and PGARCH (Ding, Granger, & Engle, 1993) models that are given below: lnðs2t Þ ¼ a0 þ s2t ¼ a0 þ
p X i¼1
p X i¼1
sdt ¼ a0 þ
ai
jrt i j þ gi rt st i
ai r2t i þ
p X i¼1
p X i¼1
i
þ
q X i¼1
gi S t i r2t i þ
ai ðjrt i j þ gi rt i Þd þ
bi lnðs2t i Þ
q X
bi s2t
i
(13)
(14)
i¼1
q X i¼1
bi sdt
i
(15)
140
VALERIY V. GAVRISHCHAKA
Here gi is a leverage parameter. For TGARCH model, St i ¼ 1 for rt i o 0 and St i ¼ 0 otherwise. For PGARCH, d is a positive number that can also be estimated together with ai, bi, and gi coefficients. GARCH and its extensions are the most common choices of the volatility model by practitioners. GARCH process can reproduce a number of known stylized volatility facts, including clustering and mean reversion (e.g., Engle & Patton, 2001; Tsay, 2002). Explicit specification of the stochastic process and simplified (linear) functional form for the volatility allows to do simple analysis of the model properties and its asymptotic behavior. However, assumptions of the ARCH-type models also impose significant limitations. Model parameter calculation from the market data is practical only for low-order models (small p and q), i.e., in general, it is difficult to capture direct long memory effects. Volatility multiscale effects are also not covered (Dacorogna, Gencay, Muller, Olsen, & Pictet, 2001). Finally, the model gives unobservable quantity that makes it difficult to quantify the prediction accuracy and comparison with other models. Some of these restrictions are relaxed in the GARCH model extensions. However, majority of the limitations mentioned cannot be resolved in a self-consistent fashion. All advantages and limitations of the GARCH-type models, mentioned in the previous two paragraphs, suggest that the framework accepted by many practitioners is theoretically sound and clear. It can reproduce all major signatures empirically observed in real time series. On the other hand, its forecasting accuracy may be insufficient for a number of important applications and regimes (e.g., Gavrishchaka & Ganguli, 2003). Accuracy can be improved by using more complex models that are often of the ‘‘black-box’’ type. For example, we have recently proposed the SVMbased volatility model (Gavrishchaka & Ganguli, 2003) with encouraging results for both foreign exchange and stock market data. Previously, different NN-based frameworks have also been proposed for the same purpose (Donaldson & Kamstra, 1997; Schittenkopf, Dorffner, & Dockner, 1998). However, limited stability and explanation facility of these ‘‘black-box’’ frameworks may restrict their usage only as complementary models (Gavrishchaka & Ganguli, 2003). Boosting offers an alternative way that allows to combine all the advantages of the accepted industry models such as GARCH with significant accuracy improvement. In many practical applications, symbolic volatility forecasting may be acceptable and even desirable due to effective filtering of noise and unimportant small fluctuations. For example, for many trading and risk management purposes, it is important (and sufficient) to know whether the future volatility value will be above or below a certain threshold. This
Boosting-Based Frameworks in Financial Modeling
141
problem is of classification type and standard boosting framework (l)–(5) can be applied. Below I will give a detailed example of the boosting application to this problem. I consider boosting-based framework for symbolic (threshold) volatility forecasting with a base hypotheses pool consisting of the GARCH-like models (Eqs. (11)–(15)) with different types and input dimensions. In the following, I will restrict the pool to 100 models that are obtained from all possible combinations of the model types (GARCH, EGARCH, TGARCH, and PGARCH), return input dimension (from 1 to 5), and s input dimension (from 1 to 5). In our case, the number of the possible base hypothesis on each boosting iteration is countable and fixed. Therefore, the best model at each iteration is just one of the 100 models with a minimal classification error on a weighted training set. It is different from the more conventional usage of boosting where the available base models are not countable. Instead, at each iteration, the best model is obtained by training base models (NN, classification tree, etc.) on a weighted training set (Ratsch, 2001). As a proxy for realized volatility that will be compared with a given threshold to give supercritical/subcritical classification (encoded as ‘‘+1’’ and ‘‘ 1’’), standard deviation of the daily returns computed on the last five business days will be used. Threshold value of the daily realized volatility is taken to be 2% that corresponds to 32% of the annualized volatility. Although the exact threshold value is problem dependent, the threshold considered here is a reasonable boundary between low volatility regime (usually taken to be 20–25% of annual volatility) and higher volatility regimes. It is difficult to find an adequate measure of the realized volatility that can be used to test GARCH model performance (e.g., Tsay, 2002). The reason is that GARCH model outputs instantaneous (i.e., ‘‘true’’) volatility which is a latent (non-observable) variable. Any volatility measure computed on a real data set will be an approximation due to volatility time dependence. In many academic studies, performance of the GARCH-type models is measured by comparing model output s2i with a single day squared return r2i through autocorrelation of r2i =s2i time series (Tsay, 2002). However, although usage of r2i is appealing due to its direct theoretical meaning (r2i =s2i should approach i.i.d. according to (12)), single day r2i is never used as a volatility measure in practice. Daily r2i time series is usually extremely noisy and can have large magnitudes that are unattainable by any practical volatility model. Volatility measure, based on a five-day standard deviation, significantly removes the noise, makes magnitude comparable to GARCH outputs, while still preserves instantaneous nature to some extent. On the other hand, standard deviation based on large data interval gives just
142
VALERIY V. GAVRISHCHAKA
unconditional averaged volatility that cannot be compared with GARCH conditional volatility. The choice of the realized volatility measure should be imposed by the requirements of the particular application. The purpose of boosting is not a ‘‘pure’’ test of the GARCH framework as a model for the ‘‘true’’ (latent) volatility. Instead, GARCH models are used as base experts that can capture the main generic features of the volatility, and boosting is used to make accurate forecasting of a particular practically important measure that is related but not identical to the ‘‘true’’ volatility. I have applied boosting framework for symbolic volatility forecasting to last several years of IBM stock data. The daily closing prices adjusted for splits and dividends from finance.yahoo.com have been used. For benchmark purposes, boosting results are compared to the best model in the base hypothesis collection (as per training set) and industry-standard GARCH(1,1) model. To demonstrate result sensitivity to market conditions, data samples shifted in time are used. The first data sample, corresponding to shift ¼ 0 in all figures, includes 3 years of IBM stock daily return data (April 18, 2001–March 4, 2004). Return is computed as ri ¼ lnðS i =Si 1 Þ; where Si is stock closing price at day i. First 2/3 of the data sample are used as training data and the remaining part as test data. All subsequent samples are obtained by 10 business days shifting backward in time. For simplicity and clarity, I do not use validation set that allows to choose best regularization parameters C in the regularized AdaBoost algorithm (5)–(10). Instead, I use a fixed value C ¼ 0.05 that shows good performance across all samples. Adaptive choice of C should only improve the performance of the boosting-based model. In Fig. 1a, I compare performance of the boosting model (solid line), single best model from the base model collection (dashed line), and GARCH(1, 1) model (dotted line) on the training sets of all shifted data intervals (samples). As a performance measure, I use hit rate defined as a number of correctly classified data points to the total number of data points (i.e., maximum possible hit rate is 1). This measure is directly related to the
Fig. 1. (a) Hit Rate of the Boosting Model (Solid Line), Single Best Model from the Base Model Collection (Dashed Line), and GARCH(1,1) Model (Dotted Line) on the Training Sets of All Shifted Data Intervals, (b) Percentage Change of the Boosting Model Hit Rate Relative to that of the Best Single Model (Training Sets), and (c) Percentage Change of the Boosting Model Hit Rate Relative to that of the GARCH(1,1) Model (Training Sets).
143
Boosting-Based Frameworks in Financial Modeling 0.85 0.8
Hit Rate
0.75 0.7 0.65 0.6 0.55 0.5 −450 −400 −350 −300 −250 −200 −150 −100 −50
(a)
0
Data window shift (days) 30
Hit Rate Change (%)
25 20 15 10 5 0 −500 −450 −400 −350 −300 −250 −200 −150 −100 −50
(b)
0
Data window shift (days) 45
Hit Rate Change (%)
40 35 30 25 20 15 10
(c)
5 −500 −450 −400 −350 −300 −250 −200 −150 −100 −50 Data window shift (days)
0
144
VALERIY V. GAVRISHCHAKA
standard classification error function given by (3) and (4). It is clear that boosting models produce maximum hit rates at all times. For some samples, boosting performance is significantly superior compared to the other two models. Note that the best single model (i.e., its type and input dimensions) changes from sample to sample. The best model is the model with a maximum hit rate for a given sample. To further quantify boosting model superiority over the two benchmark models, I compute the percentage change of the boosting model hit rate relative to the benchmark model. This number is given by 100 (HR HR0)/HR0, where HR and HR0 are hit rates of the boosting and benchmark models, respectively. In Fig. 1b, this relative measure is shown for the best single model as a benchmark. One can see that in some regimes, boosting can enhance performance by up to 25%. Similar result for GARCH(1,1) model as a benchmark is shown in Fig. 1c. In this case, boosting can increase hit rate by up to 40%. Although boosting demonstrated a superior performance on the training set, the true measure of the model performance is on the test (out-of-samplc) data. This demonstrates the generalization abilities of the model. In Fig. 2, I show exactly the same measures as in Fig. 1 but for the test sets of the shifted samples. The best single model is still the best model on the training set as would be the choice in a real application where performance on the test set is not known at the time of training. From Fig. 2, it is clear that the boosting model remains superior to both benchmark models on the test set. Moreover, the hit rate relative change even increases on the test set compared to the training set. Thus, boosting demonstrates robust out-of-sample performance and confirms its solid generalization ability. Finally, I would like to illustrate what resources have been provided by the pool of the base models (hypotheses) such that boosting was able to achieve a significant performance improvement compared to the single base model. Figure 3 shows the number of the models that correctly classify a data point normalized to the number of all models in the pool. As an example, I consider just one sample with shift ¼ 0 (a – training set, b – test set). It is clear that for all data points, there is a significant number of models that correctly classify this data point. However, if one considers even the best single model, a significant number of points will be misclassified. Fortunately, majority of points, misclassified by the best model, can be correctly classified by some complementary models (that may be noticeably inferior to the best model on the whole set). Thus, boosting robustly finds these complementary models at each iteration that results in a much more accurate final combined model.
145
Boosting-Based Frameworks in Financial Modeling 1 0.9 0.8
Hit Rate
0.7 0.6 0.5 0.4 0.3
(a)
0.2 −500 −450 −400 −350 −300 −250 −200 −150 −100 −50 Data window shift (days)
0
50 45
Hit Rate Change (%)
40 35 30 25 20 15 10 5 0 −5
(b)
−450 −400 −350 −300 −250 −200 −150 −100 −50 Data window shift (days)
0
−450 −400 −350 −300 −250 −200 −150 −100 −50
0
250
Hit Rate Change (%)
200
150
100
50
0
(c)
Fig. 2.
Data window shift (days)
The Same Measures as in Fig. 1, but for the Test (Out-of-Sample) Sets.
146
VALERIY V. GAVRISHCHAKA 1 0.9 0.8
Correct Models
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 (a)
0
50
100
0
20
40
150
200 250 300 Sample Number
350
400
450
1 0.9 0.8
Correct Models
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 (b)
60
80 100 120 140 160 180 200 Sample Number
Fig. 3. Number of the Models that Correctly Classify a Data Point Normalized to the Number of all Models in the Pool for One Sample with Shift ¼ 0 (a – Training Set, b – Test Set).
In the presented illustrative application, I have considered rather a restrictive pool of base volatility models. However, even with that pool, boosting was able to achieve significant improvement in performance. In practice, one can expand this pool by including more advanced and flexible models such as heterogeneous ARCH (HARCH) (Dacorogna et al., 2001)
Boosting-Based Frameworks in Financial Modeling
147
(accounts for multiscale effects), fractionally integrated GARCH (FIGARCH) (Baillie, Bollerslev, & Mikkelson, 1996) (models short and long-term memory), and other models, including those outside the GARCH family. In addition, hybrid volatility models (e.g., Day & Lewis, 1992), that combine delayed historical inputs with available implied volatility data, could also be included in the base model pool. More detailed study of the boosting-based models for volatility forecasting is warranted.
6. DISCUSSION AND CONCLUSION Limitations of the existing modeling frameworks in quantitative finance and econometrics, such as low accuracy of the simplified analytical models and insufficient interpretability and stability of the best data-driven algorithms have been outlined. I have made the case that boosting (a novel, ensemble learning technique) can serve as a simple and robust framework for combining the clarity and stability of the analytical models with the accuracy of the adaptive data-driven algorithms. A standard boosting algorithm for classification (regularized AdaBoost) has been described and compared to other ensemble learning techniques. I have pointed out that boosting combines the power of the large margin classifier characterized by a robust generalization ability and open framework that allows to use simple and industry-accepted models as building blocks. This distinguishes boosting from other powerful machine learning techniques, such as NNs and SVMs, that often have ‘‘black-box’’ nature. Potential boosting-based frameworks for different financial applications have been outlined. Some of them can use standard boosting framework without modifications. Others require reformulation of the original problem to be used with boosting. Detailed description of these frameworks and results of their application will be presented in our future works. Details of a typical boosting operation have been demonstrated on the problem of symbolic volatility forecasting for IBM stock time series. Twoclass threshold encoding of the predicted volatility has been used. It has been shown that the boosted collection of the GARCH-type models performs consistently better compared to both the best single model in the collection and the widely used GARCH(1,1) model. Depending on the regime (period in time), classification hit rate of the boosted collection on the test (out-of-sample) set can be up to 240% higher compared to GARCH(l, l) models and up to 45% higher compared to the best single model in the collection. For a training set, these numbers are 40% and 25%, respectively.
148
VALERIY V. GAVRISHCHAKA
Boosting performance is also very stable. In the worst case/regime, performance of the boosted collection just becomes comparable to the best single model in the collection. Detailed comparative studies of boosting and other ensemble learning (model combination) methods for typical econometric applications are beyond the scope of this work. However, to understand the real practical value of boosting and its limitations, such studies should be performed in the future.
ACKNOWLEDGMENTS This work is supported by Alexandra Investment Management. I thank the referee who evaluated this paper and the editor for the valuable comments and suggestions.
REFERENCES Altman, E. (1968). Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. Journal of Finance, 23, 589–609. Baillie, R. T., Bollerslev, T., & Mikkelsen, H.-O. (1996). Fractionally integrated generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 74, 3. Barnett, J. A. (1981). Computational methods for a mathematical theory of evidence. Proceedings of IJCAI, 868. Bates, J. M., & Granger, C. W. J. (1969). The combination of forecasts. Operational Research Quarterly, 20, 451. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press. Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 31, 307. Breiman, L. (1999). Prediction games and arcing algorithms. Neural Computation, 11(7), 1493–1518. Brown, G., Wyatt, J., Harris, R., & Yao, X. (2005). Diversity creation methods: A survey and categorisation. Journal of Information Fusion, 6, 1. Cai, C. Z., Wang, W. L., Sun, L. Z., & Chen, Y. Z. (2003). Protein function classification via support vector machine approach. Mathematical Biosciences, 185, 111. Chang, R. F., Wu, W.-J., Moon, W. K., Chou, Y.-H., & Chen, D.-R. (2003). Support vector machine for diagnosis of breast tumors on US images. Academy of Radiology, 10, 189. Clemen, R. T. (1989). Combining forecasts: A review and annotated bibliography. International Journal of Forecasting, 5, 559. Cristianini, N., & Shawe-Taylor, J. (2000). Introduction to support vector machines and other kernel-based learning methods. Cambridge: Cambridge University Press. Dacorogna, M. M., Gencay, R., Muller, U., Olsen, R. B., & Pictet, O. V. (2001). An introduction to high-frequency finance. San Diego: Academic Press.
Boosting-Based Frameworks in Financial Modeling
149
Day, T. E., & Lewis., C. M. (1992). Stock market volatility and the information content of stock index options. Journal of Econometrics, 52, 267. Dietterich, T. G. (2000). Ensemble methods in machine learning. In: J. Kittler & F. Roli (Eds), First international workshop on multiple classifier systems, Lecture Notes in Computer Science (pp. 1–15). Heidelberg: Springer Verlag. Ding, Z., Granger, C. W., & Engle, R. F. (1993). A long memory property of stock market returns and a new model. Journal of Empirical Finance, 1, 83. Donaldson, R. G., & Kamstra, M. (1997). An artificial neural network-GARCH model for international stock return volatility. Journal of Empirical Finance, 4, 17. Drucker, H. (1997). Improving regressors using boosting techniques. In: D. H. Fisher (Ed.), Proceedings of the fourteenth international conference on machine learning (ICML 1997) (pp. 107–115). Nashville, TN, USA, July 8–12. Morgan Kaufmann, ISBN 1–55860–486–3. Drucker, H., Cortes, C., Jackel, L. D., LeCun, Y., & Vapnik, V. (1994). Boosting and other ensemble methods. Neural Computation, 6, 1289–1301. Drucker, H., Schapire, R. E., & Simard, P. Y. (1993). Boosting performance in neural networks. International Journal of Pattern Recognition and Artificial Intelligence, 7, 705. Dunis, C. L., & Huang, X. (2002). Forecasting and trading currency volatility: An application of recurrent neural regression and model combination. Journal of Forecasting, 21, 317. Engle, R. F. (1982). Autoregressive conditional heteroskedasticity with estimates of the variance of UK inflation. Econometrica, 50, 987. Engle, R. F., & Patton, A. J. (2001). What good is a volatility model? Quantitative Finance, 1, 237. Fan, A., & Palaniswami, M. (2000). Selecting bankruptcy predictors using a support vector machine approach. Proceedings of the IEEE-INNS-ENNS international joint conference on neural networks, IJCNN 2000, Neural computing: New challenges and perspectives for the new millennium (Vol. 6, pp. 354–359). Como, Italy, July 24–27. Fanning, K., & Cogger, K. (1994). A comparative analysis of artificial neural networks using financial distress prediction. International Journal of Intelligent Systems in Accounting, Finance, and Management, 3(3), 241–252. Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences, 55, 119. Friedman, J. H. (1999). Greedy function approximation. Technical report (February). Department of Statistics, Stanford University. Gardner, A. (2004). A novelty detection approach to seizure analysis from intracranial EEG. Ph.D. thesis, School of Electrical and Computer Engineering, Georgia Institute of Technology. Gavrishchaka, V. V., & Ganguli, S. B. (2001a). Support vector machine as an efficient tool for high-dimensional data processing: Application to substorm forecasting. Journal of Geophysical Research, 106, 29911. Gavrishchaka, V. V., & Ganguli, S. B. (2001b). Optimization of the neural-network geomagnetic model for forecasting large-amplitude substorm events. Journal of Geophysical Research, 106, 6247. Gavrishchaka, V. V., & Ganguli, S. B. (2003). Volatility forecasting from multiscale and highdimensional market data. Neurocomputing, 55, 285. Geman, S., Bienenstock, E., & Dourstat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4, 1. Gencay, R., & Qi, M. (2001). Pricing and hedging derivative securities with neural networks: Bayesian regularization, early stopping and bagging. IEEE Transactions on Neural Networks, 12(4), 726–734.
150
VALERIY V. GAVRISHCHAKA
Gleisner, H., & Lundstedt, H. (2001). Auroral electrojet predictions with dynamic neural networks. Journal of Geophysical Research, 106, 24, 541. Granger, C. W. J. (1989). Combining forecasts – Twenty years later. Journal of Forecasting, 8, 167. Hansen, L. K., & Salamon, P. (1990). Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12, 993. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: Data mining, inference, and prediction. Berlin: Springer. Hillegeist, S. A., Keating, E. K., Cram, D. P., & Lundstedt, K. G. (2004). Assessing the probability of bankruptcy. Review of Accounting Studies, 9, 5–34. Kluwer Academic Publishers. Hutchinson, J. M., Lo, A., & Poggio, A. W. (1994). A nonparametric approach to pricing and hedging derivative securities via learning networks. Journal of Finance, 31, 851. Iyer, R. D., Lewis, D. D., Schapire, R. E., Singer, Y., & Singhal, A. (2000). Boosting for document routing, In: Proceedings of the ninth international conference on information and knowledge management. Kilian, L., & Inoue, A. (2004). Bagging time series models. Econometric society 2004 North American summer meetings. Krogh, A., & Vedelsby, J. (1995). Neural network ensembles, cross validation, and active learning. Neural Information Processing Systems (NIPS), 7, 231. Nelson, D. B. (1991). Conditional heteroskedasticity in asset returns: A new approach. Econometrica, 59, 347. Ohlson, J. (1980). Financial ratios and the probabilistic prediction of bankruptcy. Journal of Accounting Research, 19, 109. Opitz, D., & Maclin, R. (1999). Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research, 11, 169. Osuna, E., Freund, R., & Girosi, F. (1997). Training support vector machines: An application to face detection. Conference on Computer Vision and Patterns Recognition (CVPR’97). IEEE Computer Society, 130. Ratsch, G. (2001). Robust boosting via convex optimization: Theory and applications. Ph.D. thesis, Potsdam University. Ratsch, G., et al. (1999). Regularizing AdaBoost. In: M. S. Kearns, S. A. Solla & D. A. Cohn (Eds), Advances in neural information processing systems, (Vol. 11, pp. 564–570). Cambridge, MA: MIT Press. Ratsch, G., et al. (2001). Soft margins for AdaBoost. Machine Learning, 42, 287. Ridgeway, G. (1999). The state of boosting. Computing Science and Statistics, 31, 172. Schapire, R. E. (1992). The design and analysis of efficient learning algorithms. Ph.D. thesis, Cambridge, MA: MIT Press. Schapire, R. E., & Singer, Y. (2000). BoostTexter: A boosting-based system for text categorization. Machine Learning, 39, 135. Schapire, R. E., Freund, Y., Bartlett, P. L., & Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26, 1651. Schittenkopf, C., Dorffner, G., & Dockner, E. J. (1998). Volatility prediction with mixture density networks. In: L. Niklasson, M. Boden & T. Ziemke (Eds), ICANN ’98 – Proceedings of the 8th international conference on artificial neural networks (p. 929). Berlin: Springer. Scholkopf, B., Tsuda, K., & Vert, J. -P. (Eds) (2004). Kernel methods in computational biology (computational molecular biology). Cambridge, MA: MIT Press.
Boosting-Based Frameworks in Financial Modeling
151
Schwenk, H., & Bengio, Y. (1997). AdaBoosting neural networks. In: W. Gerstner, A. Germond, M. Hasler & J.-D. Nicoud (Eds), Proceedings of the Int. Conf. on Artificial Neural Networks (ICANN’97) (Vol. 1327 of LNCS, pp. 967–972). Berlin, Springer. Sun, X. (2002). Pitch accent prediction using ensemble machine learning. In: Proceedings of the 7th international conference on spoken language processing (ICSLP). Denver, CO, USA, September 16–20. Tino, P., Schittenkopf, C., Dorffner, G., & Dockner, E. J. (2000). A symbolic dynamics approach to volatility prediction. In: Y. S. Abu-Mostafa, B. LeBaron, A. W. Lo & A. S. Weigend (Eds), Computational finance 99 (pp. 137–151). Cambridge, MA: MIT Press. Tsay, R. S. (2002). Analysis of Financial Time Series. New York: Wiley. Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27, 1134. Van Gestel, T., Suykens, J., Baestaens, D., Lambrechts, A., Lanckriet, G., Vandaele, B., De Moor, B., Vandewalle, J. (2001). Financial time series prediction using least squares support vector machines within the evidence framework. IEEE Transactions on Neural Networks, Special Issue on Neural Networks in Financial Engineering, 12, 809 Vapnik, V. (1995). The nature of statistical learning theory. Heidelberg: Springer Verlag. Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Witten, I. H., & Frank, E. (2000). Data mining. San Francisco, CA: Morgan Kaufmann Publishers. Xiao, R., Zhu, L., & Zhang, H. -J. (2003). Boosting chain learning for object detection. In: Proceedings of the ninth IEEE international conference on computer vision (ICCV03). Nice, France. Zakoian, J.-M. (1994). Threshold heteroskedastic models. Journal of Economic Dynamics Control, 18, 931.
This page intentionally left blank
152
OVERLAYING TIME SCALES IN FINANCIAL VOLATILITY DATA Eric Hillebrand ABSTRACT Apart from the well-known, high persistence of daily financial volatility data, there is also a short correlation structure that reverts to the mean in less than a month. We find this short correlation time scale in six different daily financial time series and use it to improve the short-term forecasts from generalized auto-regressive conditional heteroskedasticity (GARCH) models. We study different generalizations of GARCH that allow for several time scales. On our holding sample, none of the considered models can fully exploit the information contained in the short scale. Wavelet analysis shows a correlation between fluctuations on long and on short scales. Models accounting for this correlation as well as long-memory models for absolute returns appear to be promising.
1. INTRODUCTION Generalized auto-regressive conditional heteroskedasticity (GARCH) usually indicates high persistence or slow mean reversion when applied to financial data (e.g., Engle & Bollerslev, 1986; Ding, Granger, & Engle, 1993; Engle & Patton, 2001). This finding corresponds to the visually discernable Econometric Analysis of Financial and Economic Time Series/Part B Advances in Econometrics, Volume 20, 153–178 Copyright r 2006 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(05)20025-7
153
154
ERIC HILLEBRAND
volatility clusters that can be found in the time series of squared or absolute returns from financial objects. The presence of change-points in the data, however, prevents GARCH from reliably identifying the long time scale that is associated with the high persistence (Diebold, 1986; Lamoureux & Lastrapes, 1990; Mikosch & Starica, 2004; Hillebrand, 2005). Recently, the possibility of the presence of several correlation time scales in financial volatility data has been discussed (Fouque, Papanicolaou, Sircar, & Sølna, 2003; Chernov, Gallant, Ghysels, & Tauchen, 2003; Gallant & Tauchen, 2001). Using six different time series of daily financial data, we show that autoregressive conditional heteroskedasticity (ARCH), GARCH, and Integrated GARCH (IGARCH) fail to capture a short correlation structure in the squared returns with a time scale of less than 20 days. This serial correlation with low persistence can be used to improve the short-term forecast error from GARCH models. Several models that are being proposed to capture different time scales are discussed and estimated on our data set. Unfortunately, none is able to fully exploit the information contained in the short scale in our data set. In correspondence to the results of Ding and Granger (1996), ARFIMA estimation of absolute, as opposed to squared, returns yields a consistent improvement of the 1-day forecast on our holding sample. Wavelet transformation of the volatility time series from our data set reveals that there is correspondence between long- and short-correlation scales. This supports the findings of Mu¨ller et al. (1997) and Ghashgaie, Breymann, Peinke, Talkner, and Dodge (1996). The heterogeneous ARCH (HARCH) model of Mu¨ller et al. (1997), which allows for correlation between long and short-scales, indeed uses the short-scale information for the stock market part of our data set best. In Section 2, we introduce our data set and discuss the ARCH, GARCH, and IGARCH results, which turn out to be very typical for financial data. Section 2.3 identifies the short-correlation scale that is being missed by GARCH, using spectral methods as well as ARMA estimation, which provides a benchmark for forecast improvement over GARCH. In Section 3, we segment our time series using recent results for change-point detection in volatility data. GARCH is then estimated locally on the segments. As change-points cause GARCH to show spurious high persistence, the idea is that accounting for change-points may reveal the short scale within the segments. This is indeed the case, but does not help to improve forecasts for our holding sample. Section 4 discusses and estimates an array of models designed to capture several time scales in volatility. We begin with fractionally integrated models and estimate ARFIMA and FIGARCH on our data set. ARFIMA estimation of absolute returns yields a consistent
155
Overlaying Time Scales in Financial Volatility Data
improvement of the 1-day ahead forecast on our holding sample. Next, a two-scale version of GARCH by Engle and Lee (1999) is estimated. Section 4.3 demonstrates the correspondence between long and short scales with wavelet analysis. The HARCH model of Mu¨ller et al. (1997) is then estimated. We conclude that, on our data set, none of the considered models captures the short scale as well as the simple ARMA fit to the residual in Section 2.3. For the stock market section of our data, HARCH yields the relatively largest improvements in the short-term forecast errors.
2. INTEGRATED VOLATILITY 2.1. Data We use an array of daily financial data. Four series measure the U.S. stock market: the S&P 500 index (sp) and the Dow Jones Industrial Average (dj), obtained from Datastream, as well as the CRSP equally weighted index (cre) and the CRSP value-weighted index (crv). Next, we consider the exchange rate of the Japanese Yen against the U.S. Dollar (yd, Datastream series BBJPYSP) and the U.S. federal funds rate (ffr). These series cover the period January 4, 1988 through Decmber 31, 2003 and contain 4,037 observations, except for the Yen/Dollar exchange rate, which is recorded in London and therefore contains only 3,991 observations.
2.2. ARCH and GARCH In the ARCH(p) model (Engle, 1982) the log-returns of a financial asset are given by rt ¼ mt þ t ; t ¼ 1; . . . ; T, t jFt h t ¼ s2 þ
1
Nð0; ht Þ,
p X
ai ð2t
1
s2 Þ,
(1) (2) (3)
i¼1
where mt ¼ Et 1 rt is the conditional mean function, F Ptpdenotes the filtration that models the information set, and s2 ¼ o=ð1 i¼1 ai Þ is the unconditional mean with o being a constant. For the conditional mean mt different models are used, for example ARMA processes, regressions on exogenous
156
ERIC HILLEBRAND
variables, or simply a constant. Since we focus on the persistence of the conditional volatility process, we will use a constant in this study. P The GARCH(p,q) model (Bollerslev, 1986) adds the term qi¼1 bi ðht i s2 Þ to the conditional variance Eq. (3). The unconditional mean becomes Pp P q s2 ¼ o=ð1 a i¼1 i i¼1 bi Þ Empirically, GARCH models often turn out to be more parsimonious than ARCH models since these often need high lag orders to capture the dynamics of economic time series. There are different ways to measure persistence in the context of GARCH models, as discussed, for example, in Engle and Patton (2001) and Nelson (1990). The most commonly used measure is the sum of the autoregressive parameters, l :¼
p X i¼1
ai þ
q X
bi .
(4)
i¼1
The parameter l can be interpreted as the fraction of a shock to the volatility process that is carried forward per unit of time. Therefore (1 l) is the fraction of the shock that is washed out per unit of time. Hence, 1/(1 l) is the average time needed to eliminate the influence of a shock. The closer l is to unity, the more persistent is the effect of a change in ht. The stationarity condition is lo1. A stationary but highly persistent process returns slowly to its mean, a process with low persistence reverts quickly to its mean. Different persistences therefore imply different times of mean reversion. In this study, the term ‘‘time scales’’ is understood in this sense. Estimations of the conditional volatility process with GARCH models usually indicate high persistence (Engle & Bollerslev, 1986; Bollerslev, Chou, & Kroner, 1992; Bollerslev & Engle, 1993; Bollerslev, Engle, & Nelson, 1994; Baillie, Bollerslev, & Mikkelsen, 1996; Ding, Granger, & Engle, 1993, Ding & Granger, 1996; Andersen & Bollerslev, 1997). The estimated sum l of the autoregressive parameters is often close to one. In terms of time scales, l ¼ 0:99 corresponds to 100 days or 5 months mean reversion time for daily data, whereas l ¼ 0:999 corresponds to 1000 days or 4 years. Therefore, it is difficult to identify the long scale numerically and interpret it economically. Engle and Bollerslev (1986) suggested that the integrated GARCH, or IGARCH model reflect the empirical fact of high persistence. In the IGARCH model, the likelihood function of the GARCH model is maximized subject to the constraint that l ¼ 1: In the IGARCH(1, 1) specification this amounts to the parameter restriction b ¼ 1 a: The estimated conditional volatility process is not stationary, a shock has indefinite influence on the level of volatility.
157
Overlaying Time Scales in Financial Volatility Data
Table 1 compares GARCH, ARCH, and IGARCH estimations on our data set. Within the class of model orders up to GARCH(5,5), the Bayes Information Criterion favors GARCH(1, 1) for all series. Contrary to that, it indicates lag orders between 8 and 20 for ARCH estimations. For GARCH estimations, the sum of the autoregressive parameters is almost one for all series, with the CRSP equally weighted index showing the lowest persistence with a mean reversion time of 27 days and the volatility of the federal funds rate indicating non-stationarity. The ARCH estimations show a very different picture of persistence, l is of the order of 0.80–0.85 for the stock market, 0.5 for the yen–dollar exchange rate, and still close to one for the federal funds rate. This indicates that another, shorter scale of the order of a few days may be present in the data.
Table 1.
ARCH, GARCH, and IGARCH Estimation.
sp
dj
Panel A: GARCH(1, 1) ðht ¼ o þ a2t a+b MAE(1) MAE(20) MAE(60)
0.995 7.24e-5 6.24e-5 4.59e-5
Panel B: ARCH ðht ¼ o þ p P
ai MAE(1) MAE(20) MAE(60)
14 0.855 1.48 1.03 1.03
0.047 1.05 1.31 1.64
1
q P
j¼1
crv
ffr
yd
0.991 7.08e-5 5.76e-5 4.48e-5
1.0 0.0254 0.0182 0.0158
0.976 3.35e-5 3.18e-5 2.89e-5
þ bht 1 Þ
0.990 4.63e-5 5.33e-5 4.06e-5
0.963 3.36e-5 3.99e-5 3.95e-5
aj 2t j Þ 11 0.796 1.58 1.03 1.02
Panel C: IGARCH(1, 1) ðht ¼ o þ a2t a MAE(1) MAE(20) MAE(60)
cre
0.058 1.03 1.39 1.63
8 0.831 1.44 1.02 1.00 1
þ ð1
14 0.866 1.34 1.04 1.02
20 0.989 0.64 0.90 0.91
9 0.494 0.85 0.98 1.03
0.072 1.07 1.41 1.74
0.169 0.98 2.86 6.77
0.050 0.94 1.06 1.15
aÞht 1 Þ
0.013 1.09 1.31 1.39
Note: The sample consists of daily returns of the S&P 500 index (sp), Dow Jones Industrial Average (dj), CRSP equally-weighted index (cre), CRSP value-weighted index (crv), federal funds rate (ffr), and yen per dollar exchange rate (yd) from January 4, 1988 through October 6, 2003. The MAEs are calculated using the holding sample October 7, 2003 through December 31, 2003. The MAE for the ARCH and IGARCH models are stated in percentages of the forecast error from the GARCH(1, 1) model.
158
ERIC HILLEBRAND
The last three rows of every panel show the mean absolute errors (MAE) for a 1-, 20-, and 60-day forecast of the conditional volatility process when compared to the squared returns of the holding sample October 7, 2003 through December 31, 2003. That is, MAEð1Þ ¼ 2t ht ; t ¼ Oct 7; 2003; NovP 3; 2003 1 2 ht ; MAEð20Þ ¼ 20 t t¼Oct 7; 2003
MAEð60Þ
¼
1 60
Dec 31; P2003
t¼Oct 7; 2003
2 t
ht :
The mean absolute errors of the GARCH(1, 1) forecast are taken as a benchmark. The forecast errors of other models are expressed as percentages of the GARCH(1, 1) error. In terms of forecast performance on the holding sample considered here, GARCH dominates ARCH for the stock market data. For the federal funds rate and the exchange rate, ARCH seems to be a better forecast model. One might suspect that the stock market volatility is in fact integrated and that GARCH(1, 1) with l close to one reflects this fact better, thereby providing better forecasts than ARCH. This interpretation, however, cannot hold as IGARCH(1, 1) forecasts are worse than GARCH across all series and forecast horizons, the only exceptions being the single-day forecasts of the exchange rate and the federal funds rate. Further, IGARCH performs worse than ARCH for the 20-day and 60-day forecasts of stock market volatility but generally better on the single-day horizon. This is a counterintuitive result as the non-stationarity of IGARCH should reflect the long-term behavior better than the short term. Note that the GARCH(1, 1) and ARCH models in Table 1 display an unexpected pattern in the forecast error: The error often decreases with increasing forecast horizon. This indicates that these models miss short-term dynamics and capture long-term dynamics better.
2.3. The Missed Short Scale The GARCH model aims to capture the entire correlation structure of the volatility process et2 in the conditional volatility process ht. Therefore, we can define the residual nt :¼ 2t ht : This residual is white noise. From the distribution assumption (2) of the GARCH model, which can be written as
159
Overlaying Time Scales in Financial Volatility Data
2t ¼ Z2t ht ; where Zt is a standard normal random variable, we have that Ent ¼ EðEt 1 ð2t Ent ns ¼
EððZ2t
ht ÞÞ ¼ EðEt 1 ðZ2t ht
1ÞðZ2s
1Þht hs Þ ¼
(
0 2Eh2t
ht ÞÞ ¼ 0, for tas; for t ¼ s:
(5)
(6)
The latter is shown to exist in Theorem 2 of Bollerslev (1986). Therefore, if GARCH is an accurate model for financial volatility, the estimated residual process n^ t ¼ ^2t h^t should not exhibit any serial correlation. Figures 1–3 show the estimated averaged periodograms of the residuals of the GARCH estimations in Table 1. The averaged periodogram is estimated by sub-sampling with a Tukey–Hanning window of 256 points length allowing for 64 points overlap. A Lorentzian spectrum model hðwÞ ¼ a þ b=ðc2 þ w2 Þ
(7)
is fitted to the periodogram, where w denotes the frequencies and (a, b, c) are parameters. The average mean reversion time is obtained as a function of the parameter c.1 Except for the CRSP equally weighted index and the federal funds rate, the series show strong serial correlation in the residual. The persistence time scale is of the order of five days. This finding is confirmed when ARMA models are fitted to the residuals. Table 2 reports the parameter estimates and the mean absolute errors of the forecasts for the volatility process et2 when the serial correlation in the residual nt is taken into account. The errors are stated as percentages of the errors from the simple GARCH estimations in Table 1. The first order autoregressive parameters are estimated around 0.8, corresponding to the 5-day mean reversion time scale found from the Lorentzian model and corresponding to the estimated persistence from the ARCH models in Table 1. Note that the forecast errors can be substantially reduced by correcting for this correlation, which is exogenous to the GARCH model. Using the residual et2/ht, which is a truncated standard normal under GARCH, or adding autoregressive terms to the mean equation does not alter these findings. We are now able to explain the counterintuitive pattern of forecast errors from the GARCH model. The fact that the short correlation scale is not captured leads to inflated errors on short forecast horizons. The correction for the short scale therefore reduces the errors at the 1-day and at the 20-day horizon much more than at the 60-day horizon. In the following sections, we will explore different ways to account for the short scale in GARCH-type models. We will evaluate the forecast performance
160
ERIC HILLEBRAND -3
sp, mean reversion time: 5.365 days
x 10 5 4 3 2 1 0 0
0.05
0.1
-3
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.4
0.45
0.5
dj, mean reversion time: 5.2189 days
x 10 6 5 4 3 2 1 0 0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Fig. 1. Periodogram Estimation I. Estimation of the Power Spectra (Dotted Line) of the Residual Processes n^ t ¼ ^2t h^t of the S&P 500 and Dow Jones Series and Nonlinear Least Squares fit of a Lorentzian Spectrum (Solid Line). The Estimate of the Average Mean Reversion Time is Computed from the Lorentzian.
of these models against the benchmark of the exogenous correction for the short scale in Table 2.
3. LOCAL GARCH ESTIMATION The apparent integrated or near-integrated behavior of financial volatility may be an artifact of changing parameter regimes in the data. Several authors have studied this phenomenon in a GARCH framework with market data and in simulations (Diebold, 1986; Lamoureux & Lastrapes, 1990; Hamilton & Susmel, 1994; Mikosch & Starica, 2004). Hillebrand (2005) shows that if a GARCH model is estimated globally on data that were
161
Overlaying Time Scales in Financial Volatility Data -3
cre, mean reversion time: 0 days
x 10
1.5
1
0.5
0 0
0.05
0.1
-3
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.4
0.45
0.5
crv, mean reversion time: 5.4208 days
x 10 4
3
2
1
0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Fig. 2. Periodogram Estimation II. Estimation of the Power Spectra (Dotted Line) of the Residual Processes n^ t ¼ ^2t h^t of the CRSP Equally Weighted Index and the CRSP Value-Weighted Index and Nonlinear Least Squares fit of a Lorentzian Spectrum (Solid Line). The Estimate of the Average Mean Reversion Time is Computed from the Lorentzian.
generated by changing local GARCH models, then the global l will be estimated close to one, regardless of the value of the data-generating l within the segments. In other words, a short mean reversion time scale in GARCH plus parameter change-points exhibits statistical properties similar to a long-mean reversion time scale. Tables 3 and 4 show the estimation of GARCH(1, 1) models on segmentations that were obtained using a change-point detector proposed by Kokoszka and Leipus (1999). The detector statistic is given by t pffiffiffiffi tðT tÞ 1 X UðtÞ ¼ T r2 t j¼1 j T2
1 T
T X
t j¼tþ1
r2j
!
.
(8)
162
ERIC HILLEBRAND 4
ffr, mean reversion time: 0 days
x 10 15
10
5
0
0
0.05
0.1
-3
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.4
0.45
0.5
yd, mean reversion time: 4.9911 days
x 10 3.5 3 2.5 2 1.5 1 0.5 0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Fig. 3. Periodogram Estimation III. Estimation of the Power Spectra (Dotted Line) of the Residual Processes n^ t ¼ ^2t h^t of the Federal Funds Rate and Yen–Dollar Exchange Rate Series and Nonlinear Least Squares Fit of a Lorentzian Spectrum (Solid Line). The Estimate of the Average Mean Reversion Time is Computed from the Lorentzian.
The estimator of the single change-point in the sample of T observations of the squared returns rt2 is obtained as t^ ¼ min t : jUðtÞj ¼ max fUðtÞj . t2f1;...Tg
(9)
Kokoszka and Leipus (1999) show that the statistic U(t)/s Pconverges 2in 2distribution to a standard Brownian bridge, where s2 ¼ 1 j¼ 1 covðrt ; rtþj Þ: Therefore, confidence intervals can be calculated from the distribution of the maximum of a Brownian bridge. We follow Andreou and Ghysels (2002) and use the VARHAC estimator of Den Haan and Levin (1997) for s. Changepoints at a significance level of 0.05 or less are considered in Tables 3 and 4.
163
Overlaying Time Scales in Financial Volatility Data
Table 2. Forecast Correction Using Short Scale. sp nt ¼
p P
j¼1
fj nt
j
þ
Spec f1 f1 c1 c2 MAE(1) MAE(20) MAE(60)
q P
j¼1
dj
cj Zt
j
cre
crv
ffr
yd
(2,2) 0.73 0.08 0.48 0.44 0.20 0.79 0.92
(1,1) 0.53
þ Zt ; Z white noise
(1,1) 0.82
(1,1) 0.80
(1,2) 0.79
(1,1) 0.78
0.78
0.78
0.75
0.10 0.76 0.89
0.44 0.81 0.91
0.91 0.26 0.42 0.95 0.98
0.19 0.81 0.92
0.44 0.41 0.97 0.99
Note: ARMA parameter estimates and mean absolute errors for the 1-day, 20-days, and 60-days forecast of the global GARCH(1, 1) model correcting for the short scale found in the residual nt ¼ 2t ht : The ARMA specification was found by including only parameters significant at the 0.05 level or lower. The errors are reported in percent of the global GARCH(1, 1) forecast error in Table 1.
Table 3.
Change-point Detection.
sp Segment 01/04/88 12/30/91 01/08/96 12/06/96 08/14/97 10/06/03 l¯
dj
Prob
l
0.00 0.00 0.01 0.00
0.925 0.948 0.909 0.989 0.953 0.944
cre
Segment
Prob
l
01/04/88 05/16/91 10/13/92 12/15/95 03/26/97 10/06/03
0.00 0.01 0.04 0.00
0.930 0.517 0.905 0.942 0.960 0.901
Segment
Prob
l
01/04/88 10/16/97 10/06/03
0.00
0.849 0.964
0.892
Note: Local GARCH(1, 1) estimations on segments identified by the change-point detector of Kokoszka and Leipus (1999).
In fact, the average estimated l within the segments indicates lower persistence than the global estimates. The implied mean reversion time scales range from about 3 days for the yen per dollar exchange rate to about 18 days for the S&P 500 index. However, while the approach uncovers a short scale, its forecast success is disappointing. Table 5 shows the mean absolute error from forecasts of the GARCH(1, 1) model of the last segment. Two offsetting effects are at work. While we can expect the local estimation approach to capture the short-run
164
ERIC HILLEBRAND
Table 4.
Local GARCH(1, 1) Estimations.
crv
ffr
Segment
Prob
l
01/04/88 12/30/91 12/15/95 03/26/97 10/15/97 10/06/03
0.00 0.00 0.04 0.00
0.916 0.900 0.872 0.937 0.973
l¯
yd
Segment
Prob
l
01/04/88 09/04/90 02/22/91 01/21/93 10/06/03
0.02 0.00 0.00
0.991 0.522 0.933 0.704
0.928
Segment
Prob
l
01/04/88 05/07/97 09/27/99 10/06/03
0.02 0.00
0.959 0.902 0.010
0.764
0.706
Note: The segments were identified by the change-point detector of Kokoszka and Leipus (1999).
Table 5.
MAE(1) MAE(20) MAE(60)
Forecast Correction Using Local Estimations.
sp
dj
cre
crv
ffr
yd
1.56 1.21 1.25
1.64 1.24 1.24
1.36 1.02 0.97
1.41 1.21 1.18
1.06 1.37 1.52
1.06 1.11 1.21
Note: Mean absolute errors for the 1-, 20-, and 60-days forecast of the GARCH(1, 1) model for the last segment identified by the Kokoszka and Leipus (1999) change-point detector. The MAEs for the holding sample October 7, 2003 through December 31, 2003 are expressed in percent of the global GARCH(1, 1) forecast error as reported in Table 1.
dynamics of the series better, the shorter sample sizes increase the estimation error. In the monitoring situation, where the change-point must be detected as observations arrive, the time lag necessary to recognize a parameter change will further deteriorate the forecast performance. The MAEs of Table 5 are again stated as percentages of the global forecast errors and show that in terms of forecast success, the global approach yields better results on our holding sample.
4. MODELS OF MULTIPLE TIME SCALES In the local estimations in Section 3 (Tables 3 and 4), we have seen that accounting for change-points and estimating GARCH on segments reduces the estimated persistence. The mean reversion time estimated from the average persistence across segments stays below 20 days as opposed to the
Overlaying Time Scales in Financial Volatility Data
165
global estimation in Table 1, where four out of six volatility time series display a mean reversion time of more than 100 days. These findings suggest the interpretation that financial volatility displays low persistence with occasional structural breaks, which lead to apparent high persistence in estimations that ignore the change-points. Another possible interpretation is that two or more persistence time scales influence the volatility processes simultaneously and continuously in daily data. We study this interpretation in Sections 4.1 and 4.2. A third possible interpretation is that the scales remain only for some limited time and then die out and are replaced by different scales. They may also influence each other. Section 4.3 deals with this case. A short mean reversion time scale in financial volatility has been found in many different recent studies, mostly in the context of stochastic volatility models. Gallant and Tauchen (2001) estimate a stochastic volatility model with two different volatility drivers. Estimating their model for daily returns on the Microsoft stock, one driver assumes high persistence, the other assumes low persistence. Fouque et al. (2003) and Chernov et al. (2003) propose and discuss multi-scale stochastic volatility models. LeBaron (2001) simulates a stochastic volatility model where the volatility process is an aggregate of three different persistence time scales and shows that the simulated time series display long-memory properties. This corresponds to Granger’s (1980) finding that aggregation over several scales implies longmemory behavior. These findings suggest that both, long and short scale act continuously and simultaneously in the data.
4.1. Fractional Integration Assume that the time series xt, when differenced d times, results in a time series yt that has an ARMA representation. Then, the time series xt is called integrated of order d, denoted xtI(d). Fractionally integrated time series, where d is not an integer, are a discretized version of fractional Brownian motion.2 Granger (1980) and Granger and Joyeux (1980) show that the autocorrelation function of a fractionally integrated xt is given by rk ¼
Gð1 dÞ Gðk þ dÞ Ad k2d 1 , GðdÞ Gðk þ 1 dÞ
for dA(0, 1/2), where Ad is a constant. The sum over the autocorrelations does not converge, so that it is a suitable model for long memory. If dA(1/2, 1), the process has infinite variance and is therefore non-stationary but still
166
ERIC HILLEBRAND
mean reverting. If d41, the process is non-stationary and not mean reverting. For dA( 1/2, 0), the process is anti-persistent. Granger (1980) shows that aggregation of a large number of processes with different short mean reversion times yields a fractionally integrated process. In particular, he shows that the process xt ¼
N X
yj;t ,
j¼1
where the yj,t are AR(1) models yj;t ¼ aj yj;t
1
þ j;t ;
j ¼ 1; . . . ; N,
with ej,t independent white noise processes, is a fractionally integrated process with order d depending on the statistical properties of the aj. This is particularly appealing for economic time series, which are often aggregates. The surprising result is that an aggregate of many short scales can exhibit long memory properties. Geweke and Porter-Hudak (1983) provide the standard estimation method for d. The estimator is obtained as the negative of the estimate b^ of the slope in the regression log Iðwj Þ ¼ a þ b logð4 sin2 ðwj =2Þ þ j , where I(wj) is the periodogram at the harmonic frequencies wj ¼ jp=T and T is the sample size. Most financial time series exhibit significant long memory in the sense of the Geweke and Porter-Hudak (1983) estimator. Table 6 shows the estimated d for the squared and the absolute returns of the time series considered here. The estimates are significantly larger than zero and below 1/2, indicating stationary long memory. The only exception are the absolute returns of the CRSP equally weighted index, which may be non-stationary but still mean reverting. These findings may indicate the presence of multiple overlaying time scales in the data in the sense of Granger (1980). For the absolute returns calculated from the time series considered here, we estimate an ARFIMA model using Ox (Doornik, 2002) and the ARFIMA package for Ox (Doornik & Ooms, 1999). We choose to model absolute returns since they exhibit an even richer correlation structure than squared returns (Ding & Granger, 1996) and a model for absolute returns allows to forecast squared returns as well. The ARFIMA(p,d,q) model for the time series yt is defined as FðLÞð1
LÞd yt ¼ CðLÞZt ,
(10)
167
Overlaying Time Scales in Financial Volatility Data
Estimation of Fractional Integration.
Table 6. sp
dj
cre
crv
ffr
yd
0.354 (0.071)
0.328 (0.071)
0.390 (0.071)
0.387 (0.071)
0.238 (0.071)
0.319 (0.071)
0.428 (0.071)
0.507 (0.071)
0.436 (0.071)
0.343 (0.071)
0.422 (0.071)
Panel A: r2t d
Panel B: jr2t j d
0.418 (0.071)
Note: Results of the Geweke and Porter-Hudak (1983) test for the daily S&P 500 index (sp), Dow Jones Industrial Average (dj), CRSP equally weighted index (cre), CRSP value-weighted index (crv), federal funds rate (ffr), and yen per dollar exchange rate (yd) from January 4, 1988 through December 30, 2003. The null hypothesis is d ¼ 0, the alternative d6¼0. All estimates are significant according to all common significance levels. The smoothed periodogram was estimated using a 1,024 points window with 256 points overlap. The g(T) function in Theorem 2 of Geweke and Porter-Hudak (1983) was set to [T0.55] ¼ 96. The numbers in parentheses are asymptotic standard errors.
where L is the lag operator, Zt is white noise with mean zero and variance s 2, FðLÞ ¼ 1
f1 L
f2 L2
fp Lp ,
is the autoregressive lag polynomial, and CðLÞ ¼ 1 þ c1 L þ c2 L2 þ þ cp Lp , is the moving-average lag polynomial. We assume that all roots of the lag polynomials are outside the unit circle. The fractional differencing operator (1 L)d, where dA( 0.5, 0.5), is defined by the expansion dðd 1Þ 2 d ðd 1Þðd 2Þ 3 L L þ (11) 2! 3! In estimations, a truncation lag has to be defined; we use 100 in this study. We determine the ARFIMA lag orders p and q by significance according to t-values. Table 7 reports the estimation results and the forecast errors in percent of the GARCH(1, 1) forecast error. The estimates of the fractional integration parameter d are highly significant across all time series. On the 1-day horizon the forecast error is smaller than the forecast error of the GARCH(1, 1) model across all time series. Except for the yen per dollar exchange rate, the forecast performance on the holding sample is worse than GARCH(1, 1) for the 20- and 60-day forecast horizons. ð1
LÞd ¼ 1
dL þ
168
ERIC HILLEBRAND
Table 7. ARFIMA Estimation. sp ð1
fLÞ ð1
LÞd jrt j ¼
dj q P
j¼0
spec d f c1 c2 MAE(1) MAE(20) MAE(60)
cj Zt
j
cre
crv
ffr
yd
(1,d,2) 0.415 0.558 0.515 0.244 0.86 2.38 3.48
(1,d,1) 0.300 0.343 0.567
þ Zt ; Z white noise
(1,d,1) 0.407 0.228 0.605
(1,d,1) 0.400 0.247 0.616
(1,d,0) 0.404 0.244
(1,d,1) 0.404 0.153 0.537
0.81 1.06 1.31
0.64 1.11 1.33
0.62 1.04 1.05
0.79 1.10 1.32
0.43 0.91 0.99
Note: The sample consists of absolute returns of the daily S&P 500 index (sp), Dow Jones Industrial Average (dj), CRSP equally weighted index (cre), CRSP value-weighted index (crv), federal funds rate (ffr), and yen per dollar exchange rate (yd) from January 4, 1988 through October 6, 2003. All parameter estimates are significant according to all common confidence levels. The MAEs for the holding sample October 7, 2003 through December 31, 2003 are expressed in percent of the GARCH(1, 1) error as reported in Table 1.
Baillie et al. (1996) proposed a GARCH model that allows for a fractionally integrated volatility process. The conditional variance Eq. (3) can be rewritten as an ARMA process for et2 ½ð1
aðLÞ
bðLÞ2t ¼ o þ ½1
bðLÞnt ,
(12)
where nt ¼ 2t ht : The coefficients aðLÞ ¼ a1 L þ þ ap Lp and bðLÞ ¼ b1 L þ þ bq Lq are polynomials in the lag operator L. The order of the polynomial [(1 a(L) b(L)] is max{p, q}. The residual nt is shown to be white noise in (5) and thus, (12) is indeed an ARMA process. If the process et2 is integrated, the polynomial [(1 a(L) b(L)] has a unit root and can be written as F(L)(1 L) where F(L) ¼ [(l a(L) b(L)](1 L) 1 is of order max{p, q} 1. Motivated by these considerations the fractionally integrated GARCH, or FIGARCH model defines the conditional variance as FðLÞð1
LÞd 2t ¼ o þ ½1
bðLÞnt ,
(13)
where dA(0,1). Table 8 reports the parameter estimates and MAEs of the forecast from a FIGARCH(1, d, 1) model. The estimation was carried out using the GARCH 2.2 package of Laurent and Peters (2002) for Ox.
169
Overlaying Time Scales in Financial Volatility Data
Table 8. sp ð1 ða þ bÞLÞ ð1 d a b MAE(1) MAE(20) MAE(60)
FIGARCH Estimation.
dj
LÞ 1 ð1 0.428 0.199 0.580 1.11 1.06 1.10
cre
LÞd 2t ¼ o þ ð1 0.417 0.247 0.608 1.03 1.08 1.08
bLÞnt ; nt ¼ 2t 0.708 0.553 0.005 0.87 0.91 0.93
crv
ffr
yd
ht : 0.404 0.118 0.478 0.97 1.06 1.08
0.289 0.256 0.106 0.91 2.80 4.24
0.251 0.330 0.498 0.66 0.92 0.96
Note: The sample consists of the daily S&P 500 index (sp), Dow Jones Industrial Average (dj), CRSP equally weighted index (cre), CRSP value-weighted index (crv), federal funds rate (ffr), and yen per dollar exchange rate (yd) from January 4, 1988 through October 6, 2003. The MAEs for the holding sample October 7, 2003 through December 31, 2003 are expressed in percent of the GARCH(1, 1) error as reported in Table 1. Indicates the only parameter estimate that is not significant according to all common confidence levels.
The interpretation of persistence in a FIGARCH model is difficult. The estimates of the parameter of fractional integration are highly significant across all series. Davidson (2004) note, however, that in the FIGARCH model the parameter d does not have the same interpretation as d in the ARFIMA model. The FIGARCH process is non-stationary if d40, thus our estimates indicate non-stationarity rather than stationary long memory. Also, the autoregressive parameters a and b do not have a straightforward interpretation as persistence parameters. Their sums are substantially below one for all considered series but we cannot extract a persistence time scale from this as in the case of GARCH. In summary, long-memory models describe financial volatility data well and the parameter of fractional integration is usually highly significant. This may indicate that there are several persistence time scales present in the data. Fractionally integrated models do not allow, however, for an identification of the different scales. 4.2. Two-Scale GARCH In the GARCH framework, Engle and Lee (1999) proposed a model that allows for two different overlaying persistence structures in volatility.According to (3), the conditional variance in the GARCH(1, 1) model is given by ht ¼ s2 þ að2t
1
s2 Þ þ bðht
1
s2 Þ.
170
ERIC HILLEBRAND
Engle and Lee (1999) generalize the unconditional variance s2 from a constant to a time varying process qt, which is supposed to model the highly persistent component of volatility: qt ¼ að2t
ht
1
qt ¼ o þ rqt
qt 1 Þ þ bðht 1
þ fð2t
1
1
(14)
qt 1 Þ,
(15)
ht 1 Þ.
Then, l ¼ a þ b can capture the short time scale. The authors show that the model can be written as a GARCH(2, 2) process. Table 9 reports the estimation of the Engle and Lee (1999) model on the time series considered in this paper. The results clearly show two different scales. The sum a+b of the autoregressive parameters indicates a short scale between 5 and 18 days. The autoregressive coefficient r of the slowly moving component indicates very high persistence close to integration. As before, the last three rows report the mean absolute forecast errors for the holding period as a fraction of the errors of the GARCH(1, 1) forecast. Except for two instances at the 1-day horizon, the forecast errors are higher than those of the GARCH(1, 1) model. Table 9.
ht qt ¼ að2t 1 qt 1 Þ þ bðht 1 qt ¼ o þ pqt 1 þ fð2t 1 ht 1 Þ:
a
b r f o MAE(1) MAE(20) MAE(60)
Component-GARCH Estimation. sp
dj
cre
crv
ffr
yd
0.056 (0.013) 0.888 (0.027) 0.999 (2e-4) 0.011 (0.255) 2e-5 (6e-6) 1.58 2.02 2.78
0.061 (0.016) 0.882 (0.030) 0.999 (2e-4) 0.009 (0.313) 2e-5 (6e-6) 1.80 2.25 2.95
0.199 (0.025) 0.717 (0.031) 0.999 (3e-4) 0.014 (0.181) 2e-5 (5e-6) 1.09 1.34 1.42
0.074 (0.006) 0.861 (0.030) 0.999 (3e-4) 0.011 (0.221) 2e-5 (6e-6) 1.37 1.87 2.40
0.139 (0.048) 0.700 (0.031) 0.999 (6e-4) 0.088 (0.189) 0.001 (6e-4) 0.87 2.35 4.26
0.050 (0.024) 0.726 (0.140) 0.986 (0.008) 0.028 (0.026) 2e-4 (1e-4) 0.87 1.08 1.24
qt 1 Þ;
Note: Estimation of the Engle and Lee (1999) model for the daily S&P 500 index (sp), Dow Jones Industrial Average (dj), CRSP equally weighted index (cre), CRSP value-weighted index (crv), federal funds rate (ffr), and yen per dollar exchange rate (yd) from January 4, 1988 through October 6, 2003. The MAEs for the holding sample October 7, 2003 through December 31, 2003 are expressed in percent of the GARCH(1, 1) error as reported in Table 1. Significant at 95% level; Significant at 99% level.
Overlaying Time Scales in Financial Volatility Data
171
4.3. Wavelet Analysis and Heterogeneous ARCH The presence of different time scales in a time series (or ‘‘signal’’) is a wellstudied subject in the physical sciences. A standard approach to the problem is spectral analysis, that is, Fourier decomposition of the time series. The periodogram estimation carried out in Section 2.3 rests on such a Fourier decomposition. The time series x(t) is assumed to be a stationary process and for ease of presentation, assumed to be defined on continuous time. Then, the process is decomposed into a sum of sines and cosines with dif^ ferent frequencies o and amplitudes xðoÞ: Z 1 1 iot ^ xðtÞ ¼ pffiffiffiffiffiffi do. (16) xðoÞe 2p 1 ^ The amplitudes xðoÞ are given by the Fourier transform of x(t): Z 1 1 p ffiffiffiffiffi ffi xðoÞ ¼ xðoÞe iot dt. 2p 1
(17)
^ The periodograms in Figs. 1 through 3 are essentially plots of the xðoÞ against the frequencies o. We saw that spectral theory can be used to identify the short scale. However, this result pertains to the entire sample period only. The integration over the time dimension in (17) compounds all local information into a ^ global number, the amplitude xðoÞ: The sinusoids eiot ¼ cos ot þ i sin ot have support on all t 2 R: The coefficient in the Fourier decomposition (16), ^ the amplitude xðoÞ; has therefore only global meaning. It is an interesting question, however, if there are time scales in the data that are of local importance, that is, influence the process for some time and then die out or are replaced by other scales. This problem can be approached using the wavelet transform instead of the Fourier transform (Mallat, 1999). The wavelet transform decomposes a process into wavelet functions which have a frequency parameter, now called scale, and also a position parameter. Thereby, the wavelet transform results in coefficients which indicate a time scale plus a position in time where this time scale was relevant. Contrary to that, the Fourier coefficients indicate a globally relevant time scale (frequency) only. A wavelet c(t) is a function with mean zero: Z 1 cðtÞdt ¼ 0, 1
172
ERIC HILLEBRAND
which is dilated with a scale parameter s (corresponding to the frequency in Fourier analysis), and translated by u (this is a new parameter that captures the point in time where the scale is relevant): 1 t u . cu;s ðtÞ ¼ pffiffi c s s
In analogy to the Fourier decomposition (16), the wavelet decomposition of the process x(t) is given by Z 1Z 1 Wxðu; sÞcu;s ðtÞdu ds. (18) xðtÞ ¼ 1
1
The coefficients Wx(u, s) are given by the wavelet transform of x(t) at the scale s and position u, which is calculated by convoluting x(t) with the complex conjugate of the wavelet function: Z 1 t u 1 Wxðu; sÞ ¼ xðtÞ pffiffi c dt. (19) s s 1
This is the analog to the Fourier transform (17). Figures 4 and 5 plot the wavelet coefficients Wx(u, s) against position u A {1,y, 4037} in the time series and against time scale s A {2, 4,..., 512} days. The wavelet function used here is of the Daubechies class with 4 vanishing moments (Mallat, 1999, 249 ff.) but other wavelet functions yield very similar results. The plots for the CRSP data are left out for brevity, they are very similar to Fig. 4. These are three dimensional graphs; the figures present a bird’s eye view of the coefficient surface. The color encodes the size of the coefficient Wx(u, s): the higher the coefficient, the brighter its color.3 If there were a long scale and a short scale in the data, which influenced the process continuously and simultaneously, we would expect to see two horizontal ridges, one at the long and one at the short scale. According to the GARCH estimation in Table 1, the ridge of the long scale of the S&P 500 should be located at s ¼ 1/(1–0.995) ¼ 200 days, in the case of the Dow Jones at s ¼ 1/(1–0.99) ¼ 100 days, and so on. From the results in Section 2.3, we would expect the ridge of the short scale to be located at around 5 days. Figures 4 and 5 do not exhibit such horizontal ridges. Instead, there are vertical ridges and braid-like bulges that run from longer scales to shorter scales. A pronounced example is in the upper panel of Fig. 5, where a diagonal bulge runs from observation 500 and the long end of the scale axis to observation 1250 and the short end of the scale axis. A high vertical ridge is located around observation 750. The corresponding event in the federal
173
Overlaying Time Scales in Financial Volatility Data
time scales 2:2:512
sp
386
258
130
2
500
1000
1500
2000
2500
3000
3500
4000
2500
3000
3500
4000
time scales 2:2:512
dj
386
258
130
2
500
1000
1500
2000
Fig. 4. Wavelet Estimation I. Plot of the Wavelet Coefficients of the Absolute Returns Series of the S&P 500 and the Dow Jones. The Wavelet Coefficients Wx(u, s) are Defined for a Specific Position u, here on the Abscissa, and a Specific Time Scale s, Here on the Ordinate. The Panels give a Bird’s Eye View of a Three Dimensional Graph that Plots the Coefficients Against Position and Scale. The Different Values of the Coefficients are Indicated by the Color: the Brighter the Color, the Higher the Coefficient at that Position and Scale.
funds rate series is the period of high fluctuations during the end of the year 1990 and early 1991, when the Federal Reserve phased out of minimum reserve requirements on non-transaction funds and the first gulf war was impending. These patterns indicate that there is correlation between fluctuations with long-mean reversion time and fluctuations with short-mean reversion time. This correlation has been established by Mu¨ller et al. (1997). Ghashgaie et al. (1996) relate this finding to hydrodynamic turbulence, where energy flows from long scales to short scales. Mu¨ller et al. (1997) propose a variant of ARCH, called HARCH, to account for this correlation. The idea is to use long-term fluctuations to forecast short-term volatility. This is achieved by adding squared j-day returns to the conditional variance equation. That is,
174
ERIC HILLEBRAND
scales 2:2:512
ffr
386
258
130
2
500
1000
1500
2000
2500
3000
3500
4000
scales 2:2:512
yd
386
258
130
2
500
1000
1500
2000
2500
3000
3500
Fig. 5. Wavelet Estimation II. Plot of the Wavelet Coefficients of the Absolute Returns Series of the Federal Funds Rate and the Yen–Dollar Exchange Rate. The Wavelet Coefficients Wx(u, s) are Defined for a Specific Position u, here on the Abscissa, and a Specific Time Scale s, here on the Ordinate. The Panels give a Bird’s Eye View of a Three Dimensional Graph that Plots the Coefficients Against Position and Scale. The Different Values of the Coefficients are Indicated by the Color: the Brighter the Color, the Higher the Coefficient at that Position and Scale.
(2) is replaced by ht ¼ o þ
n X j¼1
aj
j X i¼1
t
i
!2
,
(20)
where all coefficients are positive. The number of possible combinations of n and j renders information criterion searches for the best specification infeasible, in particular, since high js are desirable to capture the long-mean reversion scales. The stationarity condition is l ¼ Snj¼1 jaj o1: Mu¨ller et al. (1997) also propose a GARCH-like generalization of HARCH, adding lagged values of h to (20). We refrain from using such a specification, since it incurs spurious persistence estimates due to change-points (Hillebrand, 2004).
175
Overlaying Time Scales in Financial Volatility Data
Table 10. sp ht ¼ o þ a1 2t o a1 a2 a3 a4 a5 l MAE(1) MAE(20) MAE(60)
1
dj
HARCH Estimation. cre
crv
ffr
yd
2 2 2 2 60 100 250 þ a2 ðS20 i¼1 t i Þ þ a3 ðSi¼1 t i Þ þ a4 ðSi¼1 t i Þ þ a5 ðSi¼1 t i Þ
6e-5 (2e-7) 0.132 (0.013) 0.010 (0.001) 0.002 (3e-4) 1.5e-4 (9.5e-5) 4.5e-6 (3.9e-6) 0.45 0.62 1.00 1.17
6e-5 (2e-46) 0.120 (0.012) 0.012 (9e-4) 0.001 (3e-4) 1e-10 (1e-4) 1e-5 (4e-6) 0.43 0.62 1.16 1.31
2e-5 (7e-7) 0.327 (0.023) 0.003 (3e-4) 4e-4 (6e-5) 1e-10 (2e-5) 1e-10 (3e-6) 0.42 0.25 0.88 0.92
5e-5 (2e-6) 0.153 (0.009) 0.017 (0.011) 0.002 (3e-4) 3e-4 (8e-5) 1e-10 (4e-6) 0.49 0.50 0.99 1.09
0.047 (3e-4) 0.556 (0.011) 1e-10 (4e-4) 1e-10 (4e-4) 1e-10 (1e-4) 1e-5 (5e-6) 0.56 1.84 2.51 2.83
3e-5 (8e-7) 0.089 (0.009) 0.007 (8e-4) 0.001 (2e-4) 2e-4 (9e-5) 1e-10 (5e-6) 0.30 1.59 1.31 1.23
Note: The sample consists of the daily S&P 500 index (sp), Dow Jones Industrial Average (dj), CRSP equally-weighted index (cre), CRSP value-weighted index (crv), federal funds rate (ffr), and yen per dollar exchange rate (yd) from January 4, 1988 through October 6, 2003. The MAEs for the holding sample October 7, 2003 through December 31, 2003 are expressed in percent of the GARCH(1, 1) error as reported in Table 1. Significant at 99% level.
We use n ¼ 5, that is, we include five different returns at horizons 1, 20, 60, 100, and 250 days. Thereby, we nest the simple ARCH(1) specification but also let 1-month, 3-month, 5-month, and 1-year returns influence daily fluctuations. Table 10 reports the estimation results. Remarkable are the consistently high significance of 3-month returns and the low estimates of the persistence parameter l. HARCH yields consistent improvements on the 1-day forecast horizon for the four stock-market series, but performs badly on the federal funds rate and the exchange rate.
5. CONCLUSION This study shows that apart from the well-known high persistence in financial volatility, there is also a short correlation, or fast mean reverting time scale. We find it in six different daily financial time series: four capture the U.S. stock market, the federal funds rate and the yen–dollar exchange rate.
176
ERIC HILLEBRAND
This short scale has a mean reversion time of less than 20 days. An ARMA fit to the residual 2t ht from a GARCH estimation improves the shortterm forecast of the GARCH model and provides a benchmark. We estimate several generalizations of GARCH that allow for multiple correlation time scales, including segmentation of the data and local GARCH estimation. Unfortunately, none of the considered models is able to fully exploit the information contained in the short scale and improve over the benchmark from the simple ARMA fit. Wavelet analysis of the volatility time series reveals that there is correlation between fluctuations on long scales and fluctuations on short scales. The heterogeneous ARCH model of Mu¨ller et al. (1997), which allows for correlation of this kind, can exploit some of the short-scale information from the stock market part of our data set.
NOTES 1. The relation between the mean reversion time 1/c estimated from the Lorentzian and the mean reversion time 1/(1 l) from the GARCH model was found in simulations as 1=c ¼ 86:74 þ 61:20 1=ð1 lÞ; R2 ¼ 0:93: A motivation of the Lorentzian for GARCH models and the details of the simulations are available upon request. Rt 2. Let Y t ¼ 1 ðt sÞH 1=2 dW ðsÞ be fractional Brownian motion, where W(t) is standard Brownian motion. Then, the relation between the Hurst coefficient H and the fractional integration parameter d is given by H ¼ d þ 1=2; see Geweke and Porter-Hudak (1983). 3. The color plots can be downloaded at http://www.bus.lsu.edu/economics/ faculty/ehillebrand/personal/.
ACKNOWLEDGMENTS This paper benefited from discussions with and comments from George Papanicolaou, Knut Solna, Mordecai Kurz, Caio Almeida, Doron Levy, Carter Hill, and the participants of the 3rd Annual Advances in Econometrics Conference, in particular Robert Engle and Tom Fomby. Remaining errors of any kind are mine.
REFERENCES Andersen, T. G., & Bollerslev, T. (1997). Intraday periodicity and volatility persistence in financial markets. Journal of Empirical Finance, 4, 115–158.
Overlaying Time Scales in Financial Volatility Data
177
Andreou, E., & Ghysels, E. (2002). Detecting multiple breaks in financial market volatility dynamics. Journal of Applied Econometrics, 17, 579–600. Baillie, R. T., Bollerslev, T., & Mikkelsen, H. O. (1996). Fractionally integrated generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 74, 3–30. Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 31, 307–327. Bollerslev, T., Chou, R. Y., & Kroner, K. F. (1992). ARCH modeling in finance: A review of theory and empirical evidence. Journal of Econometrics, 52, 5–59. Bollerslev, T., & Engle, R. F. (1993). Common persistence in conditional variances. Econometrica, 61(1), 167–186. Bollerslev, T., Engle, R. F., & Nelson, D. B. (1994). GARCH models. In: R. F. Engle & D. L. McFadden (Eds), Handbook of Econometrics (Vol. 4). Amsterdam: Elsevier. Chernov, M., Gallant, A. R., Ghysels, E., & Tauchen, G. (2003). Alternative models of stockprice dynamics. Journal of Econometrics, 116, 225–257. Davidson, J. (2004). Moment and memory properties of linear conditional heteroskedasticity models, and a new model. Journal of Business and Economics Statistics, 22(1), 16–29. Den Haan, W., & Levin, A. (1997). A practitioner’s guide to robust covariance matrix estimation. In: G. Maddala & C. Rao, (Eds), Handbook of Statistics (Vol. 15). Amsterdam, North-Holland: Robust Inference. Diebold, F. X. (1986). Modeling the persistence of conditional variances: A comment. Econometric Reviews, 5, 51–56. Ding, Z., & Granger, C. W. J. (1996). Modeling volatility persistence of speculative returns: A new approach. Journal of Econometrics, 73, 185–215. Ding, Z., Granger, C. W. J., & Engle, R. F. (1993). A long memory property of stock market returns and a new model. Journal of Empirical Finance, 1, 83–106. Doornik, J. A. (2002). Object-oriented matrix programming using Ox (3rd ed.). London: Timberlake Consultants Press. www.nuff.ox.ac.uk/Users/Doornik Doornik, J. A., & Ooms, M. (1999). A package for estimating, forecasting and simulating arfima models. www.nuff.ox.ac.uk/Users/Doornik Engle, R. F. (1982). Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation. Econometrica, 50(4), 987–1007. Engle, R. F., & Bollerslev, T. (1986). Modelling the persistence of conditional variances. Econometric Reviews, 5(1), 1–50. Engle, R. F., & Lee, G. G. J. (1999). A long-run and short-run component model of stock return volatility. In: R. F. Engle & H. White (Eds), Cointegration, causality, and forecasting: A festschrift in honour of Clive W. J. Granger. Oxford: Oxford University Press. Engle, R. F., & Patton, A. J. (2001). What good is a volatility model? Quantitative Finance, 1(2), 237–245. Fouque, J. P., Papanicolaou, G., Sircar, K. R., & Sølna, K. (2003). Short time-scale in S&P 500 volatility. Journal of Computational Finance, 6, 1–23. Gallant, A. R., & Tauchen, G. (2001). Efficient method of moments. Mimeo. http:// www.unc.edu/arg Geweke, J., & Porter-Hudak, S. (1983). The estimation and application of long memory time series models. Journal of Time Series Analysis, 4(4), 221–238. Ghashgaie, S., Breymann, W., Peinke, J., Talkner, P., & Dodge, Y. (1996). Turbulent cascades in foreign exchange markets. Nature, 381, 767–770.
178
ERIC HILLEBRAND
Granger, C. W. J. (1980). Long memory relationships and the aggregation of dynamic models. Journal of Econometrics, 14, 227–238. Granger, C. W. J., & Joyeux, R. (1980). An introduction to long-memory time series models and fractional differencing. Journal of Time Series Analysis, 1(1), 15–29. Hamilton, J. D., & Susmel, R. (1994). Autoregressive conditional heteroskedasticity and changes in regime. Journal of Econometrics, 64, 307–333. Hillebrand, E. (2004). Neglecting parameter changes in autoregressive models. Louisiana State University Working Paper. Hillebrand, E. (2005). Neglecting Parameter Changes in GARCH Models. Journal of Econometrics, 129, Annals Issue on Modeling Structural Breaks, Long Memory, and Stock Market Volatility (forthcoming). Kokoszka, P., & Leipus, R. (1999). Testing for parameter changes in ARCH models. Lithuanian Mathematical Journal, 39, 182–195. Lamoureux, C. G., & Lastrapes, W. D. (1990). Persistence in variance, structural change, and the GARCH model. Journal of Business and Economic Statistics, 8, 225–234. Laurent, S., & Peters, J.-P. (2002). Garch 2.2: An Ox package for estimating and forecasting various ARCH models. Journal of Economic Surveys, 16(3), 447–485. LeBaron, B. (2001). Stochastic volatility as a simple generator of apparent financial power laws and long memory. Quantitative Finance, 1(6), 621–631. Mallat, S. (1999). A wavelet tour of signal processing (2nd ed.). San Diego: Academic Press. Mikosch, T., & Starica, C. (2004). Nonstationarities in financial time series, the longrange dependence, and the IGARCH effects. Review of Economics and Statistics, 86(1), 378–390. Mu¨ller, U. A., Dacorogna, M. M., Dave, R. D., Olsen, R. B., Pictet, O. V., & von Weizsaecker, J. E. (1997). Volatilities of different time resolutions – analyzing the dynamics of market components. Journal of Empirical Finance, 4, 213–239. Nelson, D. B. (1990). Stationarity and persistence in the GARCH(1, 1) model. Econometric Theory, 6, 318–334.
EVALUATING THE ‘FED MODEL’ OF STOCK PRICE VALUATION: AN OUT-OF-SAMPLE FORECASTING PERSPECTIVE Dennis W. Jansen and Zijun Wang ABSTRACT The ‘‘Fed Model’’ postulates a cointegrating relationship between the equity yield on the S&P 500 and the bond yield. We evaluate the Fed Model as a vector error correction forecasting model for stock prices and for bond yields. We compare out-of-sample forecasts of each of these two variables from a univariate model and various versions of the Fed Model including both linear and nonlinear vector error correction models. We find that for stock prices the Fed Model improves on the univariate model for longer-horizon forecasts, and the nonlinear vector error correction model performs even better than its linear version.
1. INTRODUCTION A simple model, the so-called ‘‘Fed Model,’’ has been proposed for predicting stock returns or the level of the stock market as measured by the S&P 500 index. This model compares the earnings yield on the S&P 500 to Econometric Analysis of Financial and Economic Time Series/Part B Advances in Econometrics, Volume 20, 179–204 Copyright r 2006 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(05)20026-9
179
180
DENNIS W. JANSEN AND ZIJUN WANG
the yield on 10-year government bonds, where the earnings yield on the S&P 500 is calculated as the predicted future earnings of S&P 500 firms divided by the current value of the S&P 500 index. This model is not endorsed by the Board of Governors of the Federal Reserve System (i.e. the Fed), but it was used in the Fed’s Humphrey–Hawkins report to Congress in July 1997 and quickly picked up by private security analysts, who gave it the name ‘‘Fed Model.’’ Variants have been adopted by a number of security strategists including some at for example Prudential Securities and Morgan Stanley. The basic idea behind the Fed Model is that the yields on the S&P 500 and on the 10-year government bond should be approximately the same, and certainly should move together over time, so that deviations of the equity yield from the bond yield signal that one or both yields must adjust to restore the long-run equilibrium relationship between the two yields. A simple textbook asset pricing formula would link the value of a perpetuity to the ratio of the annual payment to the interest rate, after which a simple manipulation would lead to the equality of the interest rate and the asset yield, the equilibrium relationship postulated by the Fed Model. The Fed Model can be seen as postulating this simple asset pricing formula as a long-run equilibrium relationship instead of a relationship that holds observation by observation. The Fed Model is not, however, completely consistent with the simple textbook asset pricing formula. The textbook asset pricing formula would have equality between the real interest rate and the asset yield, not equality between the nominal interest rate and the asset yield. Nonetheless, applications of the Fed Model have ignored the distinction between real and nominal interest rates (see Asness, 2003). In this paper, we propose to investigate the Fed Model, to consider its statistical properties including stability over time, and especially to consider the predictive power of the Fed Model. We ask if the Fed Model contains useful information for forecasting stock prices or bond yields. We conduct an out-of-sample forecasting exercise and investigate the forecasting performance of the Fed Model at various horizons. Since the Fed Model can be interpreted as postulating a cointegrating relationship between the equity yield and the bond yield, our investigation will provide insights into the usefulness of incorporating this cointegrating relationship in a forecast of equity and bond yields. Further, because the postulated cointegrating relationship is a long-run equilibrium relationship, we evaluate the Fed Model’s forecasting ability for various forecasting horizons and thus provide information on the relative utility of incorporating the cointegrating relationship into short- and longer-run forecasts.
Evaluating the ‘Fed Model’ of Stock Price Valuation
181
We further investigate the predictive ability of the Fed Model in a nonlinear framework. Interpreting the Fed Model as postulating the cointegrating relationship between the equity and bond yield, we ask if a nonlinear threshold cointegration approach can improve predictive performance over and above the linear version of the Fed Model. Kilian and Taylor (2003) argue that nonlinear mean reversion better describes asset price movements in a world of noise trading and risky arbitrage. They hypothesize that the risk to arbitrage decreases as assets become increasingly over or under priced, leading to quicker adjustments toward equilibrium. Thus small differences are likely to be persistent, while large deviations are more quickly corrected. Kilian and Taylor (2003) suggest an exponential smooth transition autoregressive (ESTAR) model for nominal exchange rate deviations from purchasing power parity, while Rapach and Wohar (2005) consider a similar ESTAR model of the price-dividend and priceearnings ratios.
2. DATA We obtained monthly data from January 1979 through February 2004 on the average of analysts’ estimated 12-month forward earnings on the S&P 500 (FE), the S&P 500 price index (SP) and the 10-year U.S. Treasury bond yield (BY), all transformed to natural logarithms. This study was limited to a January 1979 starting point because this was the earliest period for which we obtained estimated forward earnings. These estimated forward earnings are estimates from private analysts and are not available over the much longer time span for which data on the S&P 500 is available. For purposes of conducting an out-of-sample forecasting exercise we divided this sample into two parts, a subsample to be used for estimation, and a subsample to be used for out-of-sample forecast evaluation. The subsample for model specification and estimation is January 1979–February 1994, and the forecast evaluation period is March 1994–February 2004. We chose the forecast evaluation period to be a 10-year period because we wanted a relatively long period in order to evaluate longer-horizon forecasts and in order to avoid having the late-1990s stock market boom and subsequent bust to unduly influence our results. Panel A of Fig. 1 plots the three series FE, SP and BY for our entire sample period. All three series are measured in natural logarithms, and FE was multiplied by 100 (prior to taking logs) for ease of presentation. Both SP and FE show a general upward trend for most of the period, until
182
DENNIS W. JANSEN AND ZIJUN WANG
(A)
(B)
Fig. 1.
Forward Earnings, S&P 500 Index and 10-Year Government Bond Yield. Panel (A) and Panel (B).
September 2000 for FE and August 2000 for SP. There is a less significant downward trend in BY over the sample period. We defined the equity yield (EY) as the difference between FE and SP. Panel B of Fig. 1 plots the equity yield, EY, and the bond yield, BY. Over the entire sample period of January 1979–February 2004, the average stock
183
Evaluating the ‘Fed Model’ of Stock Price Valuation
yield is 2.06 (log percentage points), slightly higher than the average bond yield of 2.03. From Fig. 1 it appears that the two yield series move together over the sample period, but there is a greater tendency for short-run divergence in the very beginning of the sample, and again beginning in 1999. Table 1 summarizes the results of unit root tests on each series. The first half of Table 1 presents the results over the initial estimation and specification subsample, January 1979–February 1994. Over this period we do not reject the null of a unit root in any of the four series, FE, SP, BY or EY, regardless of our trend specification. We do, however, reject a unit root in the yield difference, EY BY, providing support for the Fed Model’s implication that these variables are cointegrated, and with a specific cointegrating vector. For completeness, the second half of Table 1 presents unit root results over the entire sample period of January 1979–February 2004. Here the Table 1. Results of the Augmented-Dickey–Fuller (ADF) Test. Variable Name
Including Drift Lag order
t-statistic
Including Drift and Trend Lag order
t-statistic
Estimation subsample: 1979.01–1994.02 Forward earnings Stock price Bond yield Equity yield Yield difference
0 0 2 0 1
0.542 0.694 0.501 1.079 4.635
4 1 2 1 —
2.579 3.323 2.961 2.736 —
3 1 3 1 1
0.370 1.032 0.895 1.521 3.403
3 1 3 1 —
2.510 1.833 4.212 2.340 —
Full sample: 1979.01–2004.02 Forward earnings Stock price Bond yield Equity yield Yield difference
Notes:(1) Forward earnings (FE), stock price (SP) and bond yield (BY) are all in logs. Equity yield is defined as FE – SP. The yield difference is the difference of the equity yield and the bond yield, FE – SP – BY. (2) The lag order for the tests was determined by the Schwarz information criterion (SIC) in a search over a maximum of 12 lags. (3) MacKinnon critical values for 170 sample observations (corresponding to our estimation subsample) are: test with drift, 3.470 (1%) and 2.878 (5%); test with drift and trend 4.014 (1%) and 3.437 (5%). For 290 sample observations (corresponding to our full sample) critical values are: test with drift, 3.455 (1%) and 2.872 (5%); test with drift and trend, 3.993 (1%) and 3.427 (5%). Indicates rejection of null at 5% critical value; indicates rejection of null at 1% critical value.
184
DENNIS W. JANSEN AND ZIJUN WANG
results are much the same as over the subsample, except that when the alternative hypothesis allows a drift we would reject the unit root null for the bond yield variable, BY. For purposes of considering the Fed Model, we note that for the yield difference EY BY we again reject a unit root. Thus the preliminary indication is that the main hypothesis drawn from the Fed Model, that the yield difference is stationary, holds over the estimation subsample and over the full sample. In first differences, all four of our series reject the null of a unit root over either the estimation subsample or the full sample.
3. ECONOMETRIC METHODOLOGY 3.1. Linear Error Correction Models The basic empirical framework used in this study is the vector error correction, or VECM, model. Let Yt denote a (3 1) vector, Yt ¼ (y1,t, y2,t, y3,t)0 . Here, y1 stands for the forward earnings (FE), y2 for stock price index (SP) and y3 for the bond yield (BY). Using conventional notation, this model can be described as Y t ¼ A1 Y t
1
þ þ Ap Y t
þ m þ t ;
p
t ¼ 1; . . . ; T
(1)
t ¼ 1; :::; T
(2)
and the error correction form of Eq. (1) is DY t ¼ PY t
1
with Gi ¼
þ
p 1 X
Gi DY t
i¼1
i
þ m þ t ;
ðAiþ1 þ Aiþ2 þ þ Ap Þ;
i ¼ 1; 2; . . . ; p
1
and P¼
ðI m
A1
A2
Ap Þ
where D is the difference operator ðDY t ¼ Y t Y t 1 Þ; P a (3 3) coefficient matrix, Gi ði ¼ 1; 2; . . . ; p 1Þ a (3 3) matrix of short-run dynamics coefficients, m a (3 1) vector of parameters and e a vector of Gaussian errors. If the rank of matrix P, r, is 0, then the three variables are not cointegrated. A VAR in first differences of the three variables, DFE, DSP and DBY, is appropriate. If 0oro3, then the three variables are cointegrated, implying that there exists a long-run relationship between the
185
Evaluating the ‘Fed Model’ of Stock Price Valuation
three series. In this case, P can be written as P ¼ ab0 ; where a and b0 are of the dimension 3 r. Then b is called the cointegrating vector, and reflects the long-run relationship between the series. The term b0 Yt 1 is the error correction term, while a a measure of the average speed by which the variables converge toward the long-run cointegrating relationship. In particular, if the stock yield equals the bond yield in the long run, as the Fed model suggests, then a natural candidate for the cointegrating vector in Eq. (2) is FE SP BY. To examine whether the inclusion of series yj in the regression of yi (j 6¼ i) helps improve predictability of series yi, we compare the out-of-sample forecasting performance of the trivariate model given by Eq. (2) (hereafter the VECM model) with that of the following univariate model: Dyi;t ¼
p 1 X
gij Dyi;t
j¼1
j
þ mi þ i;t ;
t ¼ 1; . . . ; T; i ¼ 1; 2
(3)
Note that in this model, the lag terms of Dyj are not explanatory variables for Dyi (i 6¼ j), nor is the error correction term b0 Yt 1. 3.2. Linearity Testing and STAR Model Specification The above linear error correction model in Eq. (2) has several limitations. Two that we will examine are, first, that it does not allow the possibility that error correction or dynamic adjustment might only occur if the difference in yields on the two securities exceeds a certain threshold (perhaps due to transaction costs). Second, the speed of adjustment toward the long-run cointegrating relationship might vary with the magnitude of the difference in yields on the two securities, perhaps due to heterogeneous behavior of investors. While Eq. (2) does not allow for these possibilities, both can be allowed in a smooth transition autoregressive (STAR) model, assuming SY and BY are cointegrated: Dyi;t ¼
p 1X 2 X
gijk Dyk;t
j¼1 k¼1
þ
p 1X 2 X j¼1 k¼1
j
þ ai EC t
gijk Dyk;t j
þ
1
þ mi
ai EC t 1
þ
mi
!
F ðzt
dÞ
þ i;t ; i ¼ 1; 2
ð4Þ
where ECt 1 is the error correction term, ECt 1 ¼ b0 Yt 1, zt d the transition variable and d the delay parameter, the lag between a change in the transition
186
DENNIS W. JANSEN AND ZIJUN WANG
variable and the resulting switch in dynamics.1 The function F( ) varies from zero to one and determines the switch in dynamics across regimes. In practice, there are two main specifications of the transition function. One is the logistic function: F ðzt d Þ ¼ ð1 þ e
yðzt
cÞ
d
Þ
1
(5)
Models using this transition function are often labeled logistic smooth transition autoregressive or LSTAR models. The alternative specification is the exponential function: F ðzt
dÞ
¼1
yðzt
e
d
cÞ2
(6)
Models using this transition function are often called exponential smooth transition autoregressive or ESTAR models. For both Eq. (5) and (6), the parameter y40 measures the speed of transition between regimes. In this paper, the transition variable is the error correction term, so zt d ¼ b0 Y t d : Therefore, the dynamics of our three-variable system change as a function of the deviation of our three variables from the specified long-run cointegrating relationship. To conduct the linearity test, we follow the procedure proposed by Terasvirta and Anderson (1992). Specifically, we estimated the following equation: Dyi;t ¼
p 1X 2 X
gijk Dyk;t
j¼1 k¼1
þ þ þ
p 1X 2 X
j
þ ai EC t
g1ijk Dyk;t j
j¼1 k¼1
p 1X 2 X
g2ijk Dyk;t j
j¼1 k¼1
p 1X 2 X j¼1 k¼1
g3ijk Dyk;t j
1
þ mi
þ
a1i EC t 1
þ
a2i EC t 1
þ
a3i EC t 1
!
zt
!
z2t
!
z3t
d
d
d
þ vi;t ;
i ¼ 1; 2
ð7Þ
The value of d is determined through the estimation of (7) for a variety of d values. For each d, one can test the hypothesis that gsijk ¼ asi ¼ 0 for all j and k, where s ¼ 1; 2; 3: If only one value of d indicates rejection of linearity, it becomes the chosen value. On the other hand, if more than one of the d values rejects linearity, Terasvirta (1994) suggests picking the delay that yields the lowest marginal probability value.
Evaluating the ‘Fed Model’ of Stock Price Valuation
187
Terasvirta and Anderson (1992) also suggest testing the following set of hypothesis: H 0;1 : g3ijk H 0;2 : g2ijk H 0;3 : g1ijk
¼ ¼ ¼
a3i ¼ 0
a2i ¼ 0jg3ijk ¼ a3i ¼ 0
a1i ¼ 0jg2ijk ¼ a2i ¼ g3ijk ¼ a3i ¼ 0
Selection of ESTAR and LSTAR can be made using the following decision rules: (1) If H0,1 is rejected, select an LSTAR model. (2) If H0,1 is not rejected impose it. Then if H0,2 is rejected, select an ESTAR model. (3) If neither H0,1 nor H0,2 are rejected then impose these restrictions. Then if H0,3 is rejected, select an LSTAR model. 3.3. Forecast Comparisons As mentioned above, we divided the data of length T into two subsets, a subsample of length T1 to be used for model specification and estimation, and a subsample of length T2 to be used for forecast evaluation. We specified our univariate, linear VECM and nonlinear VECM models based upon the sample information of the first T1 observations and then generated h-step-ahead recursive forecasts of Yt for the remaining T2 observations from all models. We denote the corresponding forecast error series as eR it and eU ; respectively. To test whether the forecasts from the various models are it statistically different, we apply a test procedure proposed by Diebold and Mariano (1995). The Diebold and Mariano test allows a comparison of the entire distribution of forecast errors from competing models. For a pair of R h-step-ahead forecast errors ðeU it and eit ; t ¼ 1; . . . ; T 2 Þ; the forecast accuracy can be judged based on some specific function g( ) of the forecast error (where mean squared forecast errors, or MSFE, is often used). The null hypothesis of equal forecast performance is: E½gðeU it Þ
gðeR it Þ ¼ 0
where g( ) is a loss function. We use gðuÞ ¼ u2 ; the square loss function, in this paper. Define dt by d t ¼ gðeU it Þ
gðeR it Þ
188
DENNIS W. JANSEN AND ZIJUN WANG
Then the Diebold-Mariano test statistic is given by: ¯ 1=2 d¯ DM ¼ ½V^ ðdÞ ¯ the Newy–West heteroscedasticity where d¯ is the sample mean of dt and V^ ðdÞ ¯ The and autoregression consistent estimator of the sample variance of d: D–M statistic does not have a standard asymptotic distribution under the null hypothesis when the models being compared are nested (see McCracken (2004) for the one-period-ahead forecasts, and Clark & McCracken (2004) for multiple-period-ahead forecasts). We refer to the simulated critical values provided in Table 1 of McCracken (2004) for the recursive scheme. These critical values apply to the case of one-period-ahead forecasts.
4. EMPIRICAL RESULTS 4.1. Model Specification and Estimation We use the first 182 monthly observations (January 1979–February 1994) for in-sample model specification, and the remaining 120 monthly observations (10 years) for out-of-sample forecasting comparisons. As the first step in model estimation, we determine the optimum lag order p in Eq. (1). Schwarz information criterion (SIC) is minimized at p ¼ 2 while AIC suggests that p ¼ 4: Since our interest is out-of-sample forecasts and simple models often deliver better forecasts than complicated models, we choose p ¼ 2 in the subsequent analysis.2 We test the cointegrating rank of model (2) following Johansen’s (1988, 1991) trace procedure (and specifying p ¼ 2). The test statistics we calculate are 35.94 and 11.03 for the null hypotheses r ¼ 0 and rr1, respectively. The asymptotic critical values are 29.80 and 15.49 at the 5% level, so we conclude that r ¼ 1: That is, we conclude that there is a single long-run relationship among the three variables, FE, SP and BY. Thus, the first model we consider is a vector error correction model with p ¼ 2 (or one lag of the differences).3 We conduct two exercises to examine the stability of the long-run cointegrating relationship. First, we conduct a sequence of tests for structural change in which all samples start in March 1979 and the first sample ends in March 1994 the second in April 1994 and so on until the last sample, which ends in February 2004 and includes all 302 observations, our full sample. This exercise is based on the idea that the estimated cointegrating relationship is stable after a possible structural change, and hence including more observations will add to our ability to detect the change.
Evaluating the ‘Fed Model’ of Stock Price Valuation
189
Second, we conducted a series of tests for structural change in the adjustment coefficients and/or in the cointegrating vector as developed by Seo (1996). Figure 2 summarizes the results of these exercises. Panel A presents the recursive estimates of cointegration rank, which indicates that the cointegration relationship breaks down temporarily in 1999, is restored in mid-2000, but then breaks down again in mid-2002.4 Panels B and C report the sequential break test for the cointegrating vector (Panel B) and the adjustment coefficients (Panel C) as proposed by Seo. Both tests indicate structural change, and there is particularly strong evidence for structural change in the adjustment coefficient. An advantage of the above vector error form is that it can be easily used to test the interesting hypothesis that the equity yield is equal to the bond yield in the long run, FE SP ¼ BY. This hypothesis can be tested as a restriction on the cointegration space in model (2), a test of b ¼ (1, 1, 1)0 . A simple first check of the appropriateness of this hypothesis is to test the nonstationarity of the yield difference series, FE SP BY, as we reported in Table 1. This yield difference is stationary over both our estimation subsample and our full sample. When we directly test this restriction within the context of an estimated VECM we find less support, as the Johansen LR test statistic for this hypothesis is 7.81 with a p-value of 0.02. The restriction of equality between the equity yield and the bond yield, or at least of a stationary and constant difference between the equity yield and the bond yield, is based in part on economic theory. Thus, while the insample test result does not support the hypothesis of equal yield, we estimate and forecast a model with this restriction imposed to see whether it can improve the forecasting performance, especially in the long run.5 This model, an estimated VECM imposing equality between the equity and bond yield, is our second model. Following the procedure described in Section 3.2, we test for the appropriateness of the linear assumption implicit in the linear models outlined above. For tests of linearity we use the in-sample observations (January 1979 – February 1994). The sequential tests of the null hypotheses H0,1, H0,2 and H0,3 are summarized in Table 2, where we assume the maximum value of d is 3. In conducting these tests, we maintain the restriction of equal yields, so the error correction term is ECt 1 ¼ FEt 1 BYt 1. The null hypothesis H0 is that linearity is an appropriate SPt 1 restriction. It can be seen from Table 2 that H0 can be rejected for all the three equations, that is for FE, SP and BY. Thus, a third forecasting model we consider is a nonlinear VECM. The minimum probability value occurs at
190
DENNIS W. JANSEN AND ZIJUN WANG
(A)
(B)
(C)
Fig. 2. Stability Tests. Panel (A) Recursive Estimates of Cointegration Rank. Panel (B) Test for Stability of Cointegrating Vector. Panel (C) Test for Stability of Adjustment Coefficient. Note: In Panel (A) the First Sample Covers the Period of 1979.03–1994.02, and the Last Sample 1979.03–2004.02. Horizontal Axes Indicate the End Periods for Each Window. The Significance Level is 5%. In Panels B and C, the Short- and Long-Dashed Lines are the 5% Critical Values of the Sup_LM and Ave_LM Test Statistics of Seo (1998).
191
Evaluating the ‘Fed Model’ of Stock Price Valuation
Table 2.
Linearity Tests and Determination of the Delay Parameter d.
Dependent variable
d¼1
d¼2
d¼3
0.051 0.826 0.454 0.003
0.000 0.015 0.485 0.000
0.000 0.011 0.003 0.001
0.000 0.121 0.074 0.001
0.004 0.609 0.016 0.006
0.006 0.278 0.088 0.005
0.008 0.582 0.001 0.351
0.813 0.680 0.420 0.841
0.101 0.027 0.088 0.455
Forward earnings (FE) H0 H0,1 H0,2 H0,3 Stock price (SP) H0 H0,1 H0,2 H0,3 Bond yield (BY) H0 H0,1 H0,2 H0,3
Notes: (1) The four null hypotheses in relation to Eq. (6) are: H 0 : g1ijk ¼ a1i ¼ g2ijk ¼ a2i ¼ g3ijk ¼ a3i ¼ 0; H 0;1 : g3ijk ¼ a3i ¼ 0; H 0;2 : g2ijk ¼ a2i ¼ 0jg3ijk ¼ a3i ¼ 0; H 0;3 : g1ijk ¼ a1i ¼ 0jg2ijk ¼ a2i ¼ g3ijk ¼ a3i ¼ 0:(2) Column entries are associated p-values.
d ¼ 3 for the FE equation and at d ¼ 1 for the SP and BY equations. In the SP equation and restricting our attention to d ¼ 1; the null hypothesis H0,1 is not rejected and H0,2 is rejected at the 0.10 but not the 0.05 significance level, while H0,3 is rejected even at the 0.01 significance level. Thus, SP could be modeled as either an ESTAR or an LSTAR model with a delay of one. In the BY equation and restricting our attention to d ¼ 1; the null hypothesis H0,1 is not rejected, but H0,2 is rejected, which implies that an ESTAR specification is appropriate for BY. We will use an ESTAR specification for both BY and SP. For FE we have an LSTAR specification with d ¼ 3: To examine whether the above results are sensitive to the choice of the cointegrating rank, we generate forecasts from the linear VECM while imposing the rank restriction r ¼ 0: That is, we specify the model as a VAR in first differences. This is our fourth model. As the cointegration relationship was not stable throughout the forecasting period, we also generate forecasts from models that allow the cointegrating rank to vary across each recursive sample. The fifth model is estimated with increasing-length windows (samples). This allows use to simulate a forecaster in real time estimating the model, testing its
192
DENNIS W. JANSEN AND ZIJUN WANG
cointegrating rank and making forecasts from the resulting model whether specified as a VECM with r ¼ 1 or as a VECM with r ¼ 0 (equivalently, a VAR in first differences). Throughout the paper, we refer to Eq. (3) as the univariate model. It is a simple univariate model in first differences with one or more lags, and will serve as our benchmark model against which the above six multivariate specifications are tested. We note that a common claim is that stock market prices can be modeled as following a random walk. We also examined the out-of-sample forecasting performance of the random walk model. This model did slightly worse than our univariate model. For SP the root mean squared error (RMSE) is 0.0379 for the one-step-ahead forecast and 0.2060 for the 12-step-ahead forecast, both slightly higher than the RMSE values for the univariate model of SP. (see Tables 3 and 4 for a comparison.) Table 3. Out-of-Sample Forecasts (One-Step-Ahead:1994.03–2004.02).
Tested Model
0
1
2
3
4
5
Linear Univariate AR
Linear VECM; Estimated CV
Linear VECM; Imposed CV
Nonlinear VECM
Linear VAR in Differences
Linear VECM, r varies, Increasing Window
Forecasted variable: stock price (SP) RMSE MAE D–M statistic
0.0366 0.0281 —
0.0379 0.0284 1.467
0.0378 0.0280 1.146
0.0384 0.0285 1.701
0.0375 0.0284 1.461
0.0378 0.0286 —
0.0449 0.0349 0.256
0.0452 0.0359 0.373
0.0447 0.0348 0.057
0.0444 0.0346 —
Forecasted variable: bond yield (BY) RMSE MAE D–M statistic
0.0446 0.0347 —
0.0445 0.0345 0.128
Note: Model 0 is a univariate AR model; the comparison null model in each D–M test. Model 1 is a VECM with estimated cointegrating vector. Model 2 is a VECM with imposed cointegrating vector, EC ¼ BY SY. Model 3 is a nonlinear smooth transition VECM, with ESTAR transition function. Model 4 is a VAR(1) in first differences. Model 5 allows r to vary over increasing-length samples of data. significance at the 0.05 level; significance at the 0.10 level.
193
Evaluating the ‘Fed Model’ of Stock Price Valuation
Table 4.
Tested Model
Out-of-Sample Forecasts (12-Steps-Ahead: 1995.02–2004.02). 0
1
2
3
4
5
Linear Univariate AR
Linear VECM; Estimated CV
Linear VECM; Imposed CV
Nonlinear VECM
Linear VAR in Differences
Linear, r varies, Increasing Window
Forecasted variable: stock price (SP) RMSE MAE D–M statistic
0.1906 0.1540 —
0.1941 0.1572 0.400
0.1826 0.1519 0.514
0.1775 0.1468 0.679
0.1920 0.1543 0.627
0.1994 0.1619
0.1672 0.1244 0.047
0.1606 0.1275 0.332
0.1660 0.1388 0.116
0.1590 0.1271
Forecasted variable: bond yield (BY) RMSE MAE D–M statistic
0.1659 0.1380
0.1492 0.1122 1.179
Note: See Note in Table 3. significance at the 0.05 level; significance at the 0.10 level.
4.2. Comparisons of Out-of-Sample Forecasts (One-Step-Ahead) We first recursively estimate the linear VECM model with p ¼ 2 and r ¼ 1; as well as its restricted version, a univariate model in first differences. From the recursive parameter estimates we generate one-step-ahead up to 12-steps-ahead out-of-sample forecasts for the monthly values of FE, SP and BY. Thus, we generated 120 sets of one-step-ahead forecasts covering the period of March 1994–February 2004, and 109 sets of one-step-ahead forecasts covering the period of February 1995–February 2004. As the forecasts of FE are not of direct interest, they are omitted from much of the following discussion. Instead, we focus our attention on our main interest, forecasting stock prices, and our secondary interest, forecasting bond yields. Forecast errors are derived by subtracting the forecasts from realized values. Two statistics are calculated for each forecast error series, the root mean squared forecast errors (RMSE) and the mean absolute errors (MAE). The results of one-step-ahead forecasts are summarized in Table 3 under the column labeled 1. For SP, the RMSE from the VECM is 0.0379, which is larger than the 0.0366 from the univariate model. The MAE from the VECM
194
DENNIS W. JANSEN AND ZIJUN WANG
is 0.0284, also larger than the univariate model’s 0.0281. The D–M statistic is 1.467, indicating that the VECM model generates less accurate forecasts than the univariate model that includes only lags of SP at the 0.05 significance level. The bottom half of Table 3 contains results for forecasts of BY. Here, for the column labeled 1 we see that the RMSE and MAE of the VECM model are 0.0445 and 0.0345, respectively, and both are smaller than their counterparts in the univariate model, 0.0446 and 0.0347. Nevertheless, the D–M statistic is insignificant. In Table 3, in the column labeled 2, we present the forecast comparisons of the VECM with the restricted cointegration space, model 2, and the univariate models. The imposition of this restriction improves the forecasts of SP while reducing the forecast accuracy of BY. Nevertheless, the forecasts of SP from the restricted VECM are still less accurate than the forecasts from the univariate model in terms of RMSE, although they have a slightly smaller MAE than the rival forecasts (0.0280 vs. 0.0281). The one-step-ahead out-of-sample forecasts from the nonlinear VECM model (model 3) are reported in the column labeled 3 in Table 3.6 We find that the nonlinear VECM does not improve the forecasts of SP relative to the univariate model. In this case, we also find that the nonlinear model does not help with forecasting BY. It is interesting to note that the nonlinear VECM forecasts SP less accurately than the linear VECM (model 2), and the same result holds for BY as well. The forecasts of the VAR in first differences (model 4) are better than those of the first three specifications for SP in terms of the RMSE measure, but are mixed with respect to the forecasting of BY. The last column of Table 3 presents the forecast summary from the flexible linear specification in which we allow the cointegration rank to vary over recursive samples. Similar to the results under the first four model specifications, the predictability of SP is never better than when using the univariate model. In contrast, there seems to be some weak evidence that the inclusion of stock market information can improve the forecasts of the bond yield.7 4.3. Comparisons of Long-Run Forecasts The above results on one-step-ahead forecasts may not be surprising, as the hypothesis of equal yield on equity and bonds is more likely to hold in the long run and hence may be more likely to improve longer-horizon forecasts. As argued in the literature, the use of long-run restrictions in a forecasting model – even when theoretically based restrictions are rejected in parametric tests – might well be more important for long-horizon forecasting performance.
Evaluating the ‘Fed Model’ of Stock Price Valuation
195
Turning our attention to longer-horizon forecasts, Table 4 summarizes the performance of the six models for forecasting SP and BY 12-steps (months) ahead. As in the one-step-ahead forecasts, a simple VAR in first differences (model 4) outperforms the VECM models with estimated cointegrating vectors (models 1 and 5). In contrast, the second linear VECM model, where we impose the equal yield restriction in the cointegrating vector, provides better forecasts than the univariate model by both performance measures. We note, however, that while the performance differences are significant at the 0.10 significance level, the simulated critical values in McCracken (2004) are obtained assuming h ¼ 1: Hence, they can be used only as rough guides in evaluating the long-run forecasts. Note that when the nonlinearity is incorporated in the multivariate regression (in addition to the imposition of the equal yield restriction), the performance of the VECM model (model 3) further improves. It beats the univariate model now by relatively large margins. The bottom panel of Table 4 shows that several VECM specifications generate better long-run forecasts than the univariate rival does. An exception is model 2, with the imposed cointegrating vector, at least when performance is measured by RMSE, and model 4, which has nearly the same RMSE but a higher MAE. Figure 3 plots performance of the nonlinear VECM model (relative to the univariate model) against the out-of-sample forecasting horizon. Clearly, the performance of the nonlinear VECM model improves as the forecasting horizon increases (almost uniformly). For example, the RMSE of the nonlinear model is 1.05 times as large as that of the univariate model in onestep-ahead forecasts. When the horizon increases to six months, the two models perform equally well. At a 12-month horizon, the nonlinear model leads the competition by a ratio of 0.93:1. The above result suggests that imposing restriction from theory may improve the forecasting performance of the error correction model, a finding consistent with some forecasting literature (e.g. Wang & Bessler, 2002). The result also indicates that the finding in previous studies that VECM models do not always forecast better than the unrestricted VAR model or univariate models, may be in part due to the parameter uncertainty in estimated cointegrating vectors and/or omitted nonlinearity. These results may not be surprising in light of other results in the literature. Clements and Hendry (1995) and Hoffman and Rasche (1996) report that the inclusion of error correction terms in the forecasting model typically improves forecasts mostly at longer horizons. Rapach and Wohar (2005) also report such a result, and further report that a nonlinear ESTAR model fits the data
196
DENNIS W. JANSEN AND ZIJUN WANG
Fig. 3.
The Ratio of Root Mean Squared Errors (RMSE) of Stock Price Forecasts from Unrestricted and Restricted ESTAR Models.
reasonably well and provides a plausible explanation for long-horizon stock price predictability, especially predictability based on the price-dividend ratio. 4.4. The Impact of the High-Tech Bubble on the Predictability of Stock Returns In this subsection, we offer a brief discussion about how the stock price predictability may have been affected by the stock market boom in high-tech industry. As was widely observed, the period of the late 1990s featured historically low earning-to-price ratios, especially for the so-called high-tech stocks. These historically low earnings-to-price ratios may have had a large impact on the fit of our models. To provide some perspective, in Fig. 4 we plot the recursive estimates of RMSE statistics for both the nonlinear VECM and the univariate model for the forecasting period. We first focus on Panel A, which includes the one-step-ahead forecasts. In forecasting SP, the VECM performs better than the univariate model in the early period, but its performance deteriorates and is worse than the univariate model starting in late 1997. The VECM then continues to underperform the univariate model through the end of our sample.
Evaluating the ‘Fed Model’ of Stock Price Valuation
197
(A)
(B)
Fig. 4. Recursive Root Mean Squared Errors (RMSE) of Stock Prices. Panel (A) One-Step-Ahead Forecasts. Panel (B) 12-Step-Ahead Forecasts. Note: For the OneStep-Ahead Forecasts, the Statistics are Computed for the Period 1994M03– 2004M02. For the 12-Step-Ahead Forecasts, the Statistics are Computed for the Period 1995M02–2004M02.
198
DENNIS W. JANSEN AND ZIJUN WANG
(A)
(B)
Fig. 5. Recursive Root Mean Squared Errors (RMSE) of Bond Yield. (A) OneStep-Ahead Forecasts. (B) 12-Step-Ahead Forecasts. Note: For the One-Step-Ahead Forecasts, the Statistics are Computed for the Period 1994M03–2004M02. For the 12-Step-Ahead Forecasts, the Statistics are Computed for the Period 1995M02– 2004M02.
199
Evaluating the ‘Fed Model’ of Stock Price Valuation
Panel B of Fig. 4 shows that the period of the high-tech boom was a period in which the longer-term predictability of stock returns also declined. After the stock market bust in 2001, information on BY again became helpful in long-run forecasts of SP, and the VECM forecasts were again better than the univariate model. Figure 5 plots the recursive RMSE Table 5.
Out-of-Sample Forecasts using Actual Earnings (Sample Period: 1994.03–2004.02). Panel A: One-Step Ahead Forecasts 0
1
2
Forecasted variable: stock price (SP) Tested model RMSE MAE D–M statistic
Univariate AR model 0.0366 0.0281 —
Linear VECM; Imposed CV Nonlinear VECM 0.0374 0.0390 0.0288 0.0293 1.556 0.969
Forecasted variable: bond yield (BY) Tested model RMSE MAE D-M statistic
0.0446 0.0347 —
Linear VECM; Imposed CV Nonlinear VECM 0.0451 0.0452 0.0351 0.0359 0.399 1.137
Panel B: 12-Month-Ahead Forecasts 0
1
2
Forecasted variable: stock price (SP) Tested model RMSE MAE D–M statistic
0.1906 0.1540
Linear VECM; Imposed CV Nonlinear VECM 0.1742 0.1653 0.1476 0.1440 0.759 0.777
Forecasted variable: bond yield (BY) Tested model Linear VECM; Estimated CV Linear VECM; Imposed CV Nonlinear VECM RMSE 0.1659 0.1915 0.2056 MAE 0.1380 0.1570 0.1708 2.328 D–M statistic 1.789 Note: Model 0 is a univariate AR Model; the comparison null model in each D–M test. Model 1 is a VECM with estimated cointegrating vector. Model 2 is a VECM with imposed cointegrating vector, EC ¼ BY SY. significance at the 0.05 level; significance at the 0.10 level.
200
DENNIS W. JANSEN AND ZIJUN WANG
estimates of BY for both one- and 12-step-ahead forecasts. It can be seen that the VECM consistently performs better than the univariate model in the long-run forecasts over the whole forecasting period. Some may wonder if the results we have reported are due to our use of forecasted earnings. To investigate this possibility we reran our models using actual or reported earnings on the S&P 500. The results are similar, and are reported in Table 5. For the one-month-ahead forecasts of both SP and BY, the linear univariate models perform best. For the 12-month-ahead forecasts of SP, the linear VECM model with imposed cointegrating vector performs better than the univariate model, and the nonlinear VECM performs even better than the linear VECM. For 12-month ahead predictions of BY, however, the univariate model outperforms either variant of VECM. We summarize the results relating forecasting horizon to forecasting performance for forecasts of SP in Fig. 6. Similarly to our results with forward earnings, we find that the performance gain of the nonlinear VECM increases nearly monotonically with forecast horizon beginning at the fivemonth horizon.
Fig. 6. The Ratio of Root Mean Squared Errors (RMSE) of Stock Price Forecasts from Unrestricted and Restricted ESTAR Models using Actual Earnings.
201
Evaluating the ‘Fed Model’ of Stock Price Valuation
4.5. Nonlinear Dynamics in the Stock Market Finally, we provide some details about the nonlinear VECM model. Table 6 summarizes the parameter estimates from the nonlinear VECM based on the full sample. (We do not report the estimates of the linear VECM and the univariate model, but they are available from the authors upon request.) Of interest is the coefficient y, which indicates the curvature of the transition function F( ). The estimate is moderate in magnitudes in both SP and BY equations, which is easier to see in Fig. 7 where we plot the estimated transition functions. Most observations of EC variable (delay variable) fall in the range 0.2 – 0.2. Within this range, the transition function has values from 0 to 0.5. Note that the limiting case of F ¼ 1 is not attained over the range of observed yield differences, as the maximum difference is 0.543 with a transition function value of 0.987 (the observed minimum deviation is –0.491 with a F-value of 0.971). Panel B graphs the transition function for the BY equation, which essentially tells the same story. Finally, Panel C graphs the transition function for the FE equation, which has a much higher value for y and hence a much narrower range in which the transition function is near zero. Table 6.
Parameter Estimates from the STAR Model, 1979.01 – 2004.02.
Dependent variable Stock Price (SP)
Bond Yield (BY)
Forward Earnings (FE)
Parameter est. p-value Parameter est. p-value Parameter est. m DFEt 1 DSPt 1 DBYt 1 ECt 1 DFEt 1F(EC t d) DSPt 1F(EC t d) DBYt 1F(EC t d) ECt 1F(EC t d) y Mean log-likelihood
p-value
0.0056 0.0495 0.1928 0.1235 0.0880
0.0157 0.7945 0.0090 0.0868 0.0128
0.0039 0.1307 0.1032 0.4260 0.0036
0.1191 0.5862 0.2835 0.0000 0.9252
0.0037 0.4816 0.1155 0.1163 0.0015
0.0001 0.0004 0.0026 0.0023 0.8545
0.1010 0.1436 0.1097 0.0728 14.6483
0.8124 0.4608 0.4482 0.0823 0.2284
0.6959 0.2742 0.2764 0.0193 22.5343
0.1181 0.1508 0.0801 0.6668 0.1217
0.4403 0.1636 0.1312 0.0126 13.9508
0.0248 0.0119 0.0168 0.2592 0.0231
1.883
1.804
2.9566
Note: The error correction term is ECt 1 ¼ FEt 1 – SPt 1 – BYt 1. The transition function is exponential, the transition variable is ECt d where d ¼ 1 for the SP equation, and d ¼ 3 for the BY equation. The transition function is logistic, the transition variable is ECt d where d ¼ 3.
DENNIS W. JANSEN AND ZIJUN WANG
1.0
1.0
0.8
0.8
0.6
0.6
F (EC)
F (EC)
202
0.4
0.4
0.2
0.2
0.0 -0.6 -0.4 -0.2
(A)
0.0 0.2 EC (-1)
0.4
0.0 -0.6 -0.4 -0.2
0.6
(B)
0.0 0.2 EC (-3)
0.4
0.6
1.0
F (EC)
0.8 0.6 0.4 0.2 0.0 -0.6 -0.4 -0.2
(C)
0.0 0.2 EC (-3)
60
0.4
0.6
Series: EC Sample 1979:01 2004:02 Observations 302
50
Mean Median Maximum Minimum Std.Dev. Skewness Kurtosis
40 30 20 10
Jarque-Bera Probability
0.027837 0.031615 0.542701 -0.491297 0.184640 0.204498 3.789494 9.948129 0.006915
0
(D)
-0.4
-0.2
-0.0
0.2
0.4
Fig. 7. Transition Functions. Panel (A) Stock Price. Panel (B) Bond Yield. Panel (C) Forward Earnings. Panel (D) Histogram of EC.
Evaluating the ‘Fed Model’ of Stock Price Valuation
203
5. CONCLUSION The Fed Model of stock market valuation postulates a cointegrating relationship between the equity yield on the S&P 500 (earnings over index value) and the bond yield. We evaluate the Fed Model as a vector error correction forecasting model for stock prices and for bond yields, and compare out-of-sample forecasts of each of these two variables from the Fed Model to forecasts from alternative models that include a univariate model, a nonlinear vector error correction model, and a set of modified vector error correction models. We find that the Fed Model does not improve on the forecasts of the univariate models at short horizons, but that the Fed Model is able to improve on the univariate model for longer-horizon forecasts, especially for stock prices. The nonlinear vector error correction model performs even better than its linear version as a forecasting model for stock prices, although this does not hold true for forecasts of the bond yield.
NOTES 1. The STAR model has been widely used in finance and economics literature (e.g. Terasvirta & Anderson, 1992; Anderson, 1997; Michael, Nobey, & Peel, 1997; Jansen and Oh, 1999; Bradley & Jansen, 2004). 2. We note that in the full sample information SIC attain its minimum at p ¼ 2, while the choice of AIC remains p ¼ 4. 3. We examined the models with p ¼ 4 and found that the forecasts of SP were less accurate than those from the corresponding models with p ¼ 2: In contrast, we found that forecasts of FE were more accurate in the VAR(4). As our focus is the predictability of stock prices, and the basic conclusions in the paper are not changed by using other p-values (i.e. p ¼ 1; 3, 4), we only report the results associated with p ¼ 2: 4. We also conducted a second exercise in which we sequentially test the cointegrating rank using 121 rolling fixed-window samples. The first sample is the period March 1979–February 1994, 180 months, and we continue changing the starting and ending period by one month until we reach our final sample period of March 1994–February 2004. This procedure allows for a quicker detection of underlying structural breaks at the cost of never having a sample size that grows larger over time with the addition of new observations. The rolling sample estimates are consistent with Panel A but if anything indicate an earlier breakdown of the cointegrating relationship, in 1997. 5. The p-value of the null hypothesis of equal yield for the full sample is 0.05. 6. Note that the in-sample linearity test suggests that the delay parameter d is 1 for the BY equation. However, we find that, while basic results to be reported below remain the same for SP for any d-values, the specification that d ¼ 3 for the BY equation provides somewhat better out-of-sample forecasts. This is consistent with the linearity test based on the full sample, where we find that the test obtains the
204
DENNIS W. JANSEN AND ZIJUN WANG
minimum p-value at d ¼ 3 (the p-values are 0.068, 0.207 and 0.037 for d ¼ 1, 2 and 3, respectively). This may be due to the stock market responding more quickly to innovations than the bond. In any case, we report results for d ¼ 3 for the BY equation throughout the paper. 7. We also investigated the performance of a rolling fixed-window linear VECM. Results were similar to the recursive model reported in Table 3.
REFERENCES Anderson, H. (1997). Transition costs and non-linear adjustment toward equilibrium in the U.S. Treasury bill market. Oxford Bulletin of Economics and Statistics, 59, 465–484. Asness, C. (2003). Fight the Fed Model: The relationship between future returns and stock and bond market yields. Journal of Portfolio Management, 30, 11–24. Bradley, M. D., & Jansen, D. W. (2004). Forecasting with a nonlinear dynamic model of stock returns and industrial production. International Journal of Forecasting, 20, 321–342. Clark, T., & McCracken, M. W. (2004). Evaluating long-horizon forecasts. Manuscript, University of Missouri-Columbia. Clements, M. P., & Hendry, D. F. (1995). Forecasting in cointegrated systems. Journal of Applied Econometrics, 10, 127–146. Diebold, F. X., & Mariano, R. S. (1995). Comparing predictive accuracy. Journal of Business and Economics Statistics, 13, 253–263. Hoffman, D. L., & Rasche, R. H. (1996). Assessing forecast performance in a cointegrated system. Journal of Applied Econometrics, 11, 495–517. Jansen, D. W., & Oh, W. (1999). Modeling nonlinearity of business cycles: Choosing between the CDR and STAR models. Review of Economics and Statistics, 81, 344–349. Johansen, S. (1988). Statistical analysis of cointegration vectors. Journal of Economic Dynamics and Control, 12, 231–254. Johansen, S. (1991). Estimation and hypothesis testing of cointegration vectors in Gaussian vector autoregressive models. Econometrica, 59, 1551–1580. Kilian, L., & Taylor, M. P. (2003). Why is it so difficult to beat the random walk forecast of exchange rates? Journal of International Economics, 14, 85–107. McCracken, M. W. (2004). Asymptotics for out-of-sample tests of granger causality. Working Paper, University of Missouri-Columbia. Michael, P. A., Nobay, R., & Peel, D. A. (1997). Transaction costs and nonlinear adjustment in real exchange rates: An empirical investigation. Journal of Political Economy, 105, 862–879. Rapach, D. E., & Wohar, M. E. (2005). Valuation ratios and long-horizon stock price predictability. Journal of Applied Econometrics, 20, 327–344. Seo, B. (1996). Tests for structural change in cointegrated systems. Econometric Theory, 14, 222–259. Terasvirta, T. (1994). Specification, estimation, and evaluation of smooth transition autoregressive models. Journal of American Statistical Association, 89, 208–218. Terasvirta, T., & Anderson, H. (1992). Characterizing nonlinearities in business cycles using smooth transition autoregressive models. Journal of Applied Econometrics, 7, S119–S136. Wang, Z., & Bessler, D. A. (2002). The homogeneity restriction and forecasting performance of VAR-type demand systems: An empirical examination of U.S. meat consumption. Journal of Forecasting, 21, 193–206.
STRUCTURAL CHANGE AS AN ALTERNATIVE TO LONG MEMORY IN FINANCIAL TIME SERIES Tze Leung Lai and Haipeng Xing ABSTRACT This paper shows that volatility persistence in GARCH models and spurious long memory in autoregressive models may arise if the possibility of structural changes is not incorporated in the time series model. It also describes a tractable hidden Markov model (HMM) in which the regression parameters and error variances may undergo abrupt changes at unknown time points, while staying constant between adjacent changepoints. Applications to real and simulated financial time series are given to illustrate the issues and methods.
1. INTRODUCTION Volatility modeling is a cornerstone of empirical finance, as portfolio theory, asset pricing and hedging all involve volatilities. Since the seminal works of Engle (1982) and Bollerslev (1986), generalized autoregressive conditionally heteroskedastic (GARCH) models have been widely used to model and forecast volatilities of financial time series. In many empirical studies of Econometric Analysis of Financial and Economic Time Series/Part B Advances in Econometrics, Volume 20, 205–224 Copyright r 2006 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(05)20027-0
205
206
TZE LEUNG LAI AND HAIPENG XING
stock returns and exchange rates, estimation of the parameters o, a and b in the GARCH(1,1) model yn ¼ sn n ; s2n ¼ o þ ay2n
1
þ b s2n
(1)
1
reveals high volatility persistence, with the maximum likelihood estimate l^ ¼ a^ þ b^ close to 1. The constraint lo1 in the GARCH model (1), in which the n are independent standard normal random variables that are not observable and the yn are the observations, enables the conditional variance s2n of yn given the past observations to be expressible as s2n ¼ ð1
lÞ 1 o þ afðy2n
1
s2n 1 Þ þ lðy2n
2
s2n 2 Þ þ l2 ðy2n
3
s2n 3 Þ þ g
For l ¼ 1, the contributions of the past innovations y2n t s2n t to the conditional variance do not decay over time but are ‘‘integrated’’ instead, yielding the IGARCH model of Engle and Bollerslev (1986). Baillie, Bollerslev, and Mikkelsen (1996) introduced fractional integration in their FIGARCH models, with a slow hyperbolic rate of decay for the influence of the past innovations, to quantify the long memory of exchange rate volatilities. In his comments on Engle and Bollerslev (1986), Diebold (1986) noted with respect to interest rate data that the choice of a constant term o in (1) not accommodating to shifts in monetary policy regimes might have led to an apparently integrated series of innovations. By assuming o to be piecewise constant and allowing the possibility of jumps between evenly spaced time intervals, Lamoureux and Lastraps (1990) obtained smaller estimates of l, from the daily returns of 30 stocks during the period 1963–1979, than those showing strong persistence based on the usual GARCH model with constant o. In Section 2 we carry out another empirical study of volatility persistence in the weekly returns of the NASDAQ index by updating the estimates of the parameters o, a and b in the GARCH model (1) as new data arrive during the period January 1986 to September 2003, starting with an initial estimate based on the period November 1984 to December 1985. These sequential estimates show that the parameters changed over time and that l^ N eventually became very close to 1 as the sample size N increased with accumulating data. The empirical study therefore suggests using a model that incorporates the possibility of structural changes for these data. In Section 3 we describe a structural change model, which allows changes in the volatility and regression parameters at unknown times and with unknown change in magnitudes. It is a hidden Markov model (HMM), for which the
Structural Change as an Alternative to Long Memory
207
volatility and regression parameters at time t based on data up to time t (or up to time n4t) can be estimated by recursive filters (or smoothers). Although we have been concerned with volatility persistence in GARCH models so far, the issue of spurious long memory when the possibility of structural change is not incorporated into the time series model also arises in autoregressive and other models. In Section 4, by making use of the stationary distribution of a variant of the structural change model in Section 3, we compute the asymptotic properties of the least squares estimates in a nominal AR(1) model that assumes constant parameters, thereby showing near unit root behavior of the estimated autoregressive parameter. Section 5 gives some concluding remarks and discusses related literature on structural breaks and the advantages of our approach.
2. VOLATILITY PERSISTENCE IN NASDAQ WEEKLY RETURNS Figure 1, top panel, plots the weekly returns of the NASDAQ index, from the week starting on November 19, 1984 to the week starting on September 15, 2003. The series rt is constructed from the closing price Pt on the last day of the week via rt ¼ 100 logðPt =Pt 1 Þ: The data consisting of 982 observations are available at http://finance.yahoo.com. Similar to Lamoureux and Lastraps (1990, p. 227), who use lagged dependent variables to account for nonsynchronous trading, we fit an AR(2) model rn ¼ m þ r1 rn 1 þ r2 rn 2 þ yn to the return series. The yn are assumed to follow a GARCH(1,1) process so that the n in (1) is standard normal and independent of fði ; yi ; ri Þ; i n 1g . The parameters of this AR-GARCH model can be estimated by maximum likelihood using ‘garchfit’ in MATLAB. Besides using the full dataset, we also estimated these parameters sequentially from January 1986 to September 2003, starting with an initial estimate based on the period November 1984 to December 1985. These sequential estimates are plotted in Fig. 2. Note that the estimates on the last date, September 15, 2003, corresponds to those based on the full dataset; see the last row of Table 1. Lamoureux and Lastraps (1990) proposed the following strategy to incorporate possible time variations in o in fitting GARCH(1,1) models to a time series of N ¼ 4; 228 observations of daily returns of each of 30 stocks during the period 1963–1979. They allow for possible jumps in o every 302 observations, thereby replacing o in (1) by o0 þ d1 D1n þ þ dk Dkn ; where D1n ; . . . ; Dkn are dummy variables indicating the subsample to which n belongs. The choice of 302, 604, 906, y as the only possible jump points of o
208
TZE LEUNG LAI AND HAIPENG XING
20 10 0 -10 -20 -30 1 0.8 0.6 0.4 0.2 0 11/19/84
Fig. 1.
01/01/90
01/01/95
01/01/00
09/15/03
Weekly Returns (Top Panel) of NASDAQ Index and Estimated Probabilities of Structural Change (Bottom Panel).
while not allowing jumps in a and b appears to be overly restrictive. In Section 3.3 we fit a HMM of structural change to the observed time series and use the fitted model to estimate the probability of changes in both the autoregressive and volatility parameters at any time point. On the basis of these estimated probabilities, which are plotted in the bottom panel of Fig. 1, we divide the time series of NASDAQ index returns into 4 segments of different lengths; see Section 3.3 for further details. The estimates of the AR-GARCH parameters in each segment are compared with those based on the entire dataset in Table 1. Consecutive segments in Table 1 have slight overlap because we try not to initialize the returns with highly volatile data for a segment. Table 1 shows substantial differences in the estimated AR and GARCH parameters among the different time periods. In particular, l^ is as small as 0.0364 in the first segment but as high as 0.9858 in the last segment. Although this seems to suggest volatility persistence during the last period of 512 years, we give an alternative viewpoint at the end of this Section.
209
Structural Change as an Alternative to Long Memory 1 0.8 0.6
λ α β
0.4 0.2 0 2.5 2
ω
1.5 1 0.5 0 0.4 0.3
µ ρ1 ρ
0.2
2
0.1 0
07/07/86
01/01/90
01/01/95
01/01/00
09/15/03
Fig. 2. Sequential Estimates of AR-GARCH Parameters.
Table 1.
Estimated AR(2)-GARCH(1,1) Parameters for Different Periods.
Period
m
r1
r2
o
a
b
l
Nov 19, 1984–Jun 16, 1987 Jun 09, 1987–Aug 15, 1990 Aug 08, 1990–Mar 13, 1998 Mar 06, 1998–Sep 15, 2003
0.3328 0.1242 0.3362 0.2785
0.3091 0.2068 0.03874 0.05612
0.09838 0.0891 0.1039 0.04063
1.7453 0.8364 0.4560 0.8777
0.0364 0.4645 0.08767 0.2039
0.0 0.4784 0.8008 0.7819
0.0364 0.9429 0.8885 0.9858
Nov 19, 1984–Sep 15, 2003
0.3203
0.09241
0.07657
0.2335
0.2243
0.7720
0.9963
Using the estimates m^ of m; r^ 1 and r^ 2 of the autoregressive parameters and ^ a^ and b^ of the GARCH parameters, we can compute the estimated mean o; return m^ þ r^ 1 rn 1 þ r2 r^n 2 at time n, the residual y^ n ¼ rn ðm^ þ r^ 1 rn 1 þ ^ þ a^ y^ 2n 1 þ b^ s^ 2n 1 Þ1=2 recursively. The r^ 2 rn 2 Þ; and the volatility s^ n ¼ ðo
210
TZE LEUNG LAI AND HAIPENG XING
4
0
-4
-8 4
0
-4
-8 4
0
-4
-8 11/19/84
01/01/90
Fig. 3.
01/01/95
01/01/00
09/15/03
Mean Return Estimates Using Three Different Methods.
results are plotted in Figs. 3 and 4, with the top panel using parameter estimates based on the entire dataset and the middle panel using different parameter estimates for different segments of the data. The bottom panel of Fig. 3 plots the estimated mean returns based on a HMM of structural change (see Section 3.2). The bottom panel of Fig. 4 plots (1) the estimated volatilities (solid curve) based on the structural change model (see Section 3.2), and (2) standard deviations (dotted curve) of the residuals in fitting an AR(2) model to the current and previous 19 observations (see Section 3.3 for further details). Consider the last 512 year period in Table 1 that gives l^ ¼ 0:9858: An alternative to the GARCH model, which is used for the bottom panel of Fig. 4, is the structural change model, which assumes that yn ¼ sn n with sn undergoing periodic jumps according to a renewal process, unlike the continuously changing sn in the GARCH model (1). As the sn are unknown, we replace them by their estimates s^ n (given by the solid curve
211
Structural Change as an Alternative to Long Memory
15
10
5
0 15
10
5
0 10 8 6 4 2 0 11/19/84
01/01/90
Fig. 4.
01/01/95
01/01/00
09/15/03
Volatility Estimates Using Three Different Methods.
in the bottom panel of Fig. 4) and consider a simulated set of innovations ynn ¼ s^ n nn for this last segment of the data, in which nn are i.i.d. standard normal random variables. Fitting a nominal GARCH (1,1) model to these simulated innovations yielded a value l^ ¼ 0:9843; which is also close to 1, and differs little from value of 0.9858 for the last period in Table 1.
3. AUTOREGRESSIVE MODELS WITH STRUCTURAL BREAKS IN VOLATILITY AND REGRESSION PARAMETERS Bayesian modeling of structural breaks in autoregressive time series has been an active research topic in the last two decades. Hamilton (1989) proposed a regime-switching model, which treats the autoregressive parameters as the hidden states of a finite-state Markov chain. Noting that volatility
212
TZE LEUNG LAI AND HAIPENG XING
persistence of stock returns may be overestimated when one ignores possible structural changes in the time series of stock returns, Hamilton and Susmel (1994) extended regime switching for autoregression to a Markov-switching ARCH model, which allows the parameters of Engle’s (1982) ARCH model of stock returns to oscillate among a few unspecified regimes, with transitions between regimes governed by a Markov chain. Albert and Chib (1993) considered autoregressive models with exogenous variables whose autoregressive parameters and error variances are subject to regime shifts determined by a two-state Markov chain with unknown transition probabilities. Wang and Zivot (2000) introduced a Bayesian time series model which assumes the number of the change-points to be known and in which multiple change points in level, trend and error variance are modeled using conjugate priors. Similar Bayesian autoregressive models allowing structural changes were introduced by Carlin, Gelfand, and Smith (1992), McCulloch and Tsay (1993), Chib (1998) and Chib, Nardari, and Shepard (2002). However, except for Hamilton’s (1989) seminal regime-switching model, Markov Chain Monte Carlo techniques are needed in the statistical analysis of these models because of their analytic intractability. Lai, Liu, and Xing (2005) recently introduced the following model to incorporate possible structural breaks in the AR(k) process X n ¼ mn þ r1n X n
1
þ þ rkn X n
k
þ sn n ; n4k
(2)
where the n are i.i.d. unobservable standard normal random variables, and hn ¼ ðmn ; r1n ; . . . ; rkn ÞT and sn are piecewise constant parameters. Note that s2n is assumed to be piecewise constant in this model, unlike the continuous change specified by the linear difference Eq. (1). The sequence of changetimes of ðhTt ; st Þ is assumed to form a discrete renewal process with parameter p, or equivalently, I t :¼ 1fðht ; st Þaðht 1 ; st 1 Þg are i.i.d. Bernoulli random variables with PðI t ¼ 1Þ ¼ p for t k þ 2 and I kþ1 ¼ 1: In addition, letting tt ¼ ð2s2t Þ 1 ; it is assumed that ðhTt ; tt Þ ¼ ð1 where
ðZT1 ;
g1 Þ;
ðZT2 ; g2 Þ;
I t Þ ðhTt 1 ; tt 1 Þ þ I t ðzTt ; gt Þ
(3)
. . . are i.i.d. random vectors such that,
gt Gamma ðg; lÞ; Zt jgt Normal ðz; V=ð2gt ÞÞ
(4)
An important advantage of this Bayesian model of structral breaks is its analytic and computational tractability. Explicit recursive formulas are available for estimating sn, mn, r1,n,y and rk,n in the Bayesian model (2)–(4).
213
Structural Change as an Alternative to Long Memory
As we now proceed to show, these formulas also have an intuitively appealing interpretation as certain weighted averages of the estimates based on different segments of the data and assuming that there is no change-point in each segment. 3.1. Sequential Estimation of sn ; mn ; r1;n ; . . . ; rk;n via Recursive Filters To estimate ðhTn ; s2n Þ from current and past observations X 1 ; . . . ; X n ; let Xt;n ¼ ð1; X n ; . . . ; X t ÞT and consider the most recent change-time J n :¼ max ft n : I t ¼ 1g: Recalling that tn ¼ ð2s2n Þ 1 ; the conditional distribution of ðhTn ; tn Þ given ðJ n ; XJ n ; nÞ can be described by n Jn þ 1 1 1 ; ; hn jtn Normal zJ n ;n ; VJ ;n tn Gamma g þ 2 aJ n ;n 2tn n (5) where for kojrn, 1 n n P P Xt k; t 1 XTt k; t 1 Xt k; t 1 X t Vj;n ¼ V 1þ ; zj;n ¼ Vj;n V 1 z þ t¼1
t¼1
aj;n
¼
l
1
1
þ zT V z þ
n P t¼j
X 2t
zTj;n Vj;n1 zj;n
(6) ~ distribution, then Y has the inverse ~ lÞ Note that if (2Y) 1 has a Gamma ðg; ~ l ¼ 2l~ and that EY ¼ l 1 ðg 1Þ 1 gamma IG(g,l) distribution with g ¼ g; pffiffiffiffi 1=2 1 when g41 and E Y ¼ l Gðg 2Þ=GðgÞ. It then follows from (5) that EðhTn ; s2n jX1;n Þ ¼
¼
n X
j¼kþ1 n X
j¼kþ1
pj;n EðhTn ; s2n jX1;n ; J n ¼ jÞ
pj;n zTj;n ;
aj;n 2g þ n j
1
ð7Þ
where pj;n ¼ PðJ n ¼ jjX1;n Þ; see Lai et al. (2005, p. 282). Moreover, Eðsn jX1;n Þ ¼ Snj¼kþ1 pj;n ðaj;n =2Þ1=2 Gn j ; where Gi ¼ Gðg þ i=2ÞGðg þ ði þ 1Þ=2Þ: Denoting conditional densities by f ðjÞ; the weights pj,n can be determined recursively by ( pf ðX n jJ n ¼ jÞ if j ¼ n (8) pj;n / pnj;n :¼ ð1 pÞpj;n 1 f ðX n jXj k;n 1 ; J n ¼ jÞ if j n 1
214
TZE LEUNG LAI AND HAIPENG XING
P P Since ni¼kþ1 pi;n ¼ 1; pj;n ¼ pnj;n = ni¼kþ1 pni;n : Moreover, as shown in Lemma 1 of Lai et al. (2005), f ðlj 1 ðX n
zTj;n 1 Xj
k;n 1 ÞjJ n
¼ j; Xj
k;n 1 Þ
¼ Studðgj Þ
(9)
where Stud(g) denotes the Student-t density function with g degrees of freedom and gj ¼ 2g þ n
j;
gj ¼ 2g;
lj ¼ aj;n ð1 þ XTn lj ¼ ð1 þ
kþ1;n 1 VXn kþ1;n 1 Þ=ð2g T Xn kþ1;n 1 VXn kþ1;n 1 Þ=ð2lgÞ
þn
jÞ if jon ifj ¼ n
3.2. Estimating Piecewise Constant Parameters from a Time Series via Bayes Smoothers Lai et al. (2005) evaluate the minimum variance estimate Eðtt ; hTt jX 1 ; . . . ; X n Þ; with k+1rtrn, by applying Bayes’ theorem to combine the forward filter that involves the conditional distribution of ðtt ; hTt Þ given X 1 ; . . . ; X t and the backward filter that involves the conditional distribution of ðtt ; hTt Þ given X tþ1 ; . . . ; X n : Noting that the normal distribution for ht assigns positive probability to the explosive region fh ¼ ðm; r1 ; . . . ; rk ÞT : 1 r1 z rk zk g has roots inside the unit circle, they propose to replace the normal distribution in (4) by a truncated normal distribution that has support in some stability region C such that inf j1
jzj1
r1 z
rk zk j40 if h ¼ ðm; r1 ; . . . ; rk ÞT 2 C
(10)
Letting TC Normal(z, V) denote the conditional distribution of Z given Z A C, where Z Normal(z, V) and C satisfies the stability condition (10), they modify (4) as gt Gamma ðg; lÞ; Zt jgt TC Normal ðz; V=ð2gt ÞÞ
(11)
hTt ;
and show that ðtt ; Xt kþ1;t Þ has a stationary distribution under which ðhTt ; tt Þ has the same distribution as ðZTt ; gt Þ in (11) and
X t jðhTt ; tt Þ Normal ðmt =ð1 r1;t rk;t Þ; ð2tt Þ 1 vt Þ P 2 where vt ¼ 1 in the power series repj¼0 bj;t and bj;t are the coefficients P j resentation of 1=ð1 a1;t z ak;t zk Þ ¼ 1 b j¼0 j;t z for jzj 1: In addiT tion, they show that the Markov chain ðtt ; ht ; Xt kþ1;t Þ is reversible if it is initialized at the stationary distribution, and therefore the backward filter of ðtt ; hTt Þ based on Xn,y,Xt+1 has the same structure as the forward
215
Structural Change as an Alternative to Long Memory
predictor based on the past n – t observations prior to t. Application of Bayes’ theorem shows that the forward and backward filters can be combined to yield f ðtt ; ht jX1;n Þ / f ðtt ; ht jX1;t Þf ðtt ; ht jXtþ1;n Þ=pðtt ; ht Þ
(12)
where p denotes the stationary density function, which is the same as that of (gt, zt) given in (11); see Lai et al. (2005, p. 284). Because the truncated normal is used in lieu of the normal distribution in (11), the conditional distribution of hn given tn needs to be replaced by hn jtn TC Normal ðzJ n ;n ; VJ n ;n =ð2tn ÞÞ; while (6) defining Vj,n, zj,n and aj,n remains unchanged. For the implementation of the forward or backward filter, Lai et al. (2005) propose to ignore the truncation, as the constraint set C only serves to generate non-explosive observations but has little effect on the values of the weights pj,n. Analogous to (5) and (6), the conditional distribution of (ht, tt) given J t ¼ i; J~ t ¼ j þ 1 and Xi,j for irtojrn can be described by j tt Gamma g þ
iþ1 1 1 ; ; ht jtt Normal Zi;j;t; Vi;j;t 2 ai;j;t 2tt
(13)
if we ignore the truncation in the truncated normal distribution in (11), where 1
~ Vi;j;t ¼ ðVi;t1 þ V j;t
V 1Þ
1
~ 1 z~ j;tþ1 zi;j;t ¼ Vi;j;t ðVi;j;t1 zi;t þ V j;t ai;j;t ¼ l
1
þ zT V 1 z þ
j X
X 2l
V 1 zÞ zTi;j;t Vi;j;t1 zi;j;t
l¼i
~ j;t ; z~ j;t and a~ j;t are defined in which Vi,t, zi,t and ai,t are defined in (6) and V similarly by reversing time. Let| |denote the determinant of a matrix, bi;j;t ¼
~ j;t j jVi;t j jV jVj jVi;j;t j
Bt ¼ p þ ð1
pÞ
1=2
(
X
GðgÞGðg þ 12ðj i þ 1ÞÞ Gðg þ 12ðt i þ 1ÞÞGðg þ 12ðj
kþ1itojn
pi;t p~ j;t bi; j;t
tÞÞ
)
gþðt iþ1Þ=2 gþðj tÞ=2 a~ j;t gþðj iþ1Þ=2 g a ai;j;t
ai;t
216
TZE LEUNG LAI AND HAIPENG XING
Using (12) and (13), Lai et al. (2005, p. 288) have shown that analogous to (7), t P : Eðs2t jX1;n Þ¼ Bpt
i¼kþ1
: Eðht jX1;n Þ¼ Bpt
t P
i¼kþ1
pi;t ai;t 2gþt i 1
þ 1Btp
pi;t zi;t þ 1Btp
P
a
kþ1itjn
P
pi;t p~ i;t bi;j;t 2gþji;j;tiþ1 (14)
pi;t p~ j;t bi;j;t zi;j;t
kþ1itjn
in which the approximation ignores truncation within C. Moreover, the conditional probability of a structural break at time t (r n) given X1,n is PfI t ¼ 1jX1;n g ¼
t X
i¼kþ1
PfI t ¼ 1; J t
1
: ¼ ijX1;n g¼p=Bt
(15)
Although the Bayes filter uses a recursive updating formula (8) for the weights pj,n (kojrn), the number of weights increases with n, resulting in unbounded computational and memory requirements in estimating sn and hn as n keeps increasing. Lai et al. (2005) propose a bounded complexity mixture (BCMIX) approximation to the optimal filter (7) by keeping the most recent mp weights together with the largest np mp of the remaining weights at every stage n (which is tantamount to setting the other n np weights to be 0), where 0rmponp. For the forward and backward filters in the Bayes smoothers, we can again use these BCMIX approximations, yielding a BCMIX smoother that approximates (14) by allowing at most np weights pi,t and np weights p~ j;t to be nonzero. In particular, the Bayes estimates of the timevarying mean returns and volatilities in the bottom panels of Figs. 3 and 4 use np ¼ 40 and mp ¼ 25: 3.3. Segmentation via Estimated Probabilities of Structural Breaks The plot of the estimated probabilities of structural breaks over time in the bottom panel of Fig. 1 for the NASDAQ weekly returns provides a natural data segmentation procedure that locates possible change-points at times when these estimated probabilities are relatively high. These probabilities are computed via (15), in which the hyperparameters p, z,V,g and l of the change-point autoregressive model (2)–(4) are determined as follows. First, we fit an AR(2) model to Pfrt ; rt 1 ; . . . ; rt 19 g; max(k+1, 20)rtrn, yielding m~ t ; r~ 1t ; r~ 2t and s~ 2t ¼ t 1 tj¼t 19 ðrj m~ t r~ 1t rj 1 r~ 2t rj 2 Þ2 ; which are the ‘‘moving window estimates’’ (with window size 20) of mt, r1t, r2t and s2t : The dotted curve in the bottom panel of Fig. 4 is a plot of the s~ t series. Then we estimate z and V by the sample mean z^ and the sample covariance
217
Structural Change as an Alternative to Long Memory
^ of fðm~ t ; r~ 1t ; r~ 2t ÞT ; max ðk þ 1; 20Þ t ng; and apply the method matrix V of moments to estimate g and l of (4). Specifically, regarding ð2s~ 2t Þ 1 as a sample from the Gamma(g, l) distribution, Eðs~ 2t Þ ¼ ð2lÞ 1 ðg 1Þ 1 and Varðs~ 2t Þ ¼ ð2lÞ 2 ðg 1Þ 2 ðg 2Þ 1 ; and using the sample mean and the sample variance of the s~ 2t to replace their population counterparts yields the following estimates of g and l: g^ ¼ 2:4479;
l^ ¼ 0:03960
z^ ¼ ð0:1304 0:003722 0:008111ÞT , 0 1 0:06096 0:009412 0:0003295 B C ^ ¼ B 0:009412 0:01864 0:004521 C V @ A 0:0003295 0:004521 0:01023
With the hyperparameters z, V, g and l thus chosen, we can estimate the hyperparameter p by maximum likelihood, noting that f ðX t jX1;t 1 Þ ¼ pf ðX t jJ t ¼ tÞ þ ð1 ¼
t X
pÞ
t 1 X
j¼kþ1
PðJ t
1
¼ jjX1;t 1 Þf ðX t jX1;t 1 ; J t
1
¼ jÞ
pnj;t
j¼kþ1
in which pnj;t is defined by (8) and is a function of p. Hence the log-likelihood function is ! ! t n n Y X X n lðpÞ ¼ log pj;t log f ðX t jX1;t 1 Þ ¼ t¼kþ1
t¼kþ1
j¼kþ1
Figure 5 plots the log-likelihood function thus defined for the NASDAQ weekly returns from November 19, 1984 to September 15, 2003, giving the maximum likelihood estimate p^ ¼ 0:027 of the hyperparameter p.
4. NEAR UNIT ROOT BEHAVIOR OF LEAST SQUARES ESTIMATES IN A CHANGE-POINT AUTOREGRESSIVE MODEL We have shown in Section 2 that volatility persistence in GARCH models may arise if the possiblity of structural changes is ignored. Similarly, spurious
218
TZE LEUNG LAI AND HAIPENG XING
-2300
-2305
-2310
-2315
-2320
-2325
-2330
0
Fig. 5.
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
Log-Likelihood Function of Hyperparameter p in Bayesian ChangePoint Model.
unit root in autoregressive models may also arise if one omits the possibility of structural changes. There is an extensive literature on the interplay between unit root (or more general long memory) and structural breaks; see Perron (1989), Perron and Vogelsang (1992), Zivot and Andrews (1992), Lumsdaine and Papell (1997), Maddala, Kim, and Phillips (1999), Granger and Hyung (1999), Diebold and Inoue (2001) and the special issues of Journal of Business and Economic Statistics (1992) and Journal of Econometrics (1996). In this Section, we use the HMM of structural changes similar to that in Section 3 to demonstrate that spurious long memory can arise when the possibility of structural change is not incorporated. Consider the autoregressive model with occasional mean shifts X n ¼ mn þ rX n
1
þ n
(16)
where n are i.i.d. unobservable standard normal random variables and are independent of I n :¼ 1fmn amn 1 g ; which are i.i.d. Bernoulli random variables
219
Structural Change as an Alternative to Long Memory
with PðI n ¼ 1Þ ¼ p; as in (3), and mn ¼ ð1
I n Þmn
1
(17)
þ I n Zn
with i.i.d. normal Zn that have common mean m and variance n. The autoregressive parameter r is assumed to be less than 1 in absolute value but is unknown. The methods in Section 3.1 or 3.2 can be readily modified to estimate mn and r, either sequentially or from a given set of data {Xn, 1rnrT}. Without incorporating possible shifts in m, suppose we fit an AR(1) model X n ¼ m þ rX n 1 þ n to {Xn, 1rnrT}. The maximum likelihood estimate ^ is the same as the least squares estimate; in particular, ^ rÞ ðm; P P PT T 1 T 1 i¼2 X i X i 1 i¼1 X i i¼2 X i T 1 (18) r^ ¼ P 2 PT 1 2 T 1 1 X X i i i¼1 i¼1 T 1 Since|r|o1, the Markov chain (mn,Xn) defined by (16) and (17) has a stationr) ary distribution p under which mn has the N(m, v) distribution and (1 Xn has the N(m, 1+v) distribution. Hence E p ðX n Þ ¼ m=ð1
rÞ; E p ðX 2n Þ ¼ ð1 þ v þ m2 Þ=ð1
rÞ2 ; E p ðmn Þ ¼ m
(19)
Making use of (19) and the stationary distribution of (mn,Xn), we prove in the appendix the following result, which shows that the least squares r^ in the nominal model that ignores structural breaks can approach 1 as n ! 1 even though |r|o1. Theorem. Suppose that in (16) and (17), ð12pÞr ¼ 1 with probability 1.
pv: Then r^ ! 1
Table 2 reports the results of a simulation study on the behavior of the least squares estimate p^ in fitting an AR(1) model with constant m to {Xn,1rnrT}, in which Xn is generated from (16) that allows periodic jumps in m. Unless stated otherwise, T ¼ 1000: Each result in Table 2 is the average of 100 simulations from the change-point model (16), with the standard error included in parentheses. The table considers different values of the parameters p, r and v of the change-point model and also lists the corresponding values of the ratio (l—p)r/(l—pv) in the preceding theorem. Table 2 shows that r^ is close to 1 when this ratio is 1, in agreement with the theorem. It also shows that r^ can be quite close to 1 even when this ratio differs substantially from 1. Also given in Table 2 are the p-values (computed by ‘PP. test’ in R) of the Phillips–Perron test of the null hypothesis that r ¼ 1; see Perron (1988). These results show that near unit root
220
TZE LEUNG LAI AND HAIPENG XING
Means and Standard Errors (in Parentheses) of Least Squares Estimates and P-Values of Unit Root Test.
Table 2.
ð1 pÞr 1 pv
r^
P-Value
0.05
0.5030
8
0.20
0.2028
0.002
8
0.84
0.8520
0.002
16
0.50
0.5155
0.004
8
0.50
0.5145
0.008
8
0.50
0.5299
0.01
16
84/99
1
0.01
16
84/99
1
0.01
16
84/99
1
0.01
16
84/99
1
0.8628 (0.0180) 0.7591 (0.0320) 0.9730 (0.0052) 0.9134 (0.0174) 0.9505 (0.0126) 0.9891 (0.0007) 0.9978 (0.0002) 0.9986 (0.0001) 0.9988 (0.0001) 0.9988 (0.0000)
0.1023 (0.0174) 0.1595 (0.0240) 0.4443 (0.0371) 0.3818 (0.0329) 0.3505 (0.0318) 0.3514 (0.0252) 0.6110 (0.0252) 0.4241 (0.0201) 0.2585 (0.0193) 0.0811 (0.0090)
p
v
0.002
4
0.002
r
ðT ¼ 2000Þ ðT ¼ 3000Þ ðT ¼ 4000Þ
behavior of r^ occurs quite frequently if the possibility of structural change is not taken into consideration.
5. CONCLUDING REMARKS Financial time series modeling should incorporate the possibility of structural changes, especially when the series stretches over a long period of time. Volatility persistence or long memory behavior of the parameter estimates typically occurs if one ignores possible structural changes, as shown by the empirical study in Section 2 and the asymptotic theory in Section 4. Another way to look at such long memory behavior is through ‘‘multiresolution analysis’’ or ‘‘multiple time scales,’’ in which the much slower time scale for parameter changes (causing ‘‘long memory’’) is overlaid on the short-run fluctuations of the time series. The structural change model in Section 3 provides a powerful and tractable method to capture structural changes in both the volatility and regression parameters. It is a HMM, with the unknown regression and volatility
Structural Change as an Alternative to Long Memory
221
parameters ht and st undergoing Markovian jump dynamics, so that estimation of ht and st can be treated as filtering and smoothing problems in the HMM. Making use of the special structure of the HMM that involves gamma-normal conjugate priors, there are explicit recursive formulas for the optimal filters and smoothers. Besides the BCMIX approximation described in Section 3.2, another approach, introduced in Lai et al. (2005), is based on Monte Carlo simulations and involves sequential importance sampling with resampling. In Section 3.3, we have shown how the optimal structural change model can be applied to segment financial time series by making use of the estimated probabilities of structural breaks. There is an extensive literature on tests for structural breaks in static regression models. Quandt (1960) and Kim and Siegmund (1989) considered likelihood ratio tests to detect a change-point in simple linear regression, and described approximations to the significance level. The issue of multiple change-points was addressed by Bai (1999), Bai and Perron (1998), and Bai, Lumsdaine, and Stock (1998). The computational complexity of the likelihood statistic and the associated significance level increase rapidly with the prescribed maximum number of change-points. Extension from static to dynamic regression models entails substantially greater complexity in the literature on tests for structural breaks in autoregressive models involving unit root and lagged variables; see Kramer, Plogerger, and Alt (1988), Banerjee, Lumsdaine, and Stock (1992), Doufour and Kivlet (1996), Harvey, Leybourne, and Newbold (2004) and the references therein. The posterior probabilities of change-points given in Section 3.2 for the Bayesian model (2)–(4), which incorporates changes not only in the autoregressive parameters but also in the error variances, provides a much more tractable Bayes test for structural breaks, allowing multiple change-points. The frequentest properties of the test are given in Lai and Xing (2005), where it is shown that the Bayes test provides a tractable approximation to the likelihood ratio test.
ACKNOWLEDGMENT This research was supported by the National Science Foundation.
REFERENCES Albert, J. H., & Chib, S. (1993). Bayes inference via Gibbs sampling of autoregressive time series subject to Markov mean and variance shifts. Journal of Business and Economic Statistics, 11, 1–15.
222
TZE LEUNG LAI AND HAIPENG XING
Bai, J. (1999). Likelihood ratio tests for multiple structural changes. Journal of Econometrics, 91, 299–323. Bai, J., Lumsdaine, R. L., & Stock, J. H. (1998). Testing for and dating common breaks in multivariate time series. Review of Economic Studies, 65, 395–432. Bai, J., & Perron, P. (1998). Estimating and testing linear models with multiple structural changes. Econometrica, 66, 47–78. Baillie, R. T., Bollerslev, T., & Mikkelsen, H. O. (1996). Fractionally integrated generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 74, 3–30. Banerjee, A., Lumsdaine, R. L., & Stock, J. H. (1992). Recursive and sequential tests of the unit-root and trend-break hypothesis: Theory and international evidence. Journal of Business and Economic Statistics, 10, 271–287. Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 31, 307–327. Carlin, B. P., Gelfand, A. E., & Smith, A. F. M. (1992). Hierarchical Bayesian analysis of changepoint problems. Applied Statistics, 41, 389–405. Chib, S. (1998). Estimation and comparison of multiple change-point models. Journal of Econometrics, 86, 221–241. Chib, S., Nardari, F., & Shepard, N. (2002). Markov chain Monte Carlo methods for stochastic volatility models. Journal of Econometrics, 108, 281–316. Diebold, F. X. (1986). Comment on modeling the persistence of conditional variance. Econometric Reviews, 5, 51–56. Diebold, F. X., & Inoue, A. (2001). Long memory and regime switching. Journal of Econometrics, 105, 131–159. Doufour, J.-M., & Kivlet, J. F. (1996). Exact tests for structural change in first-order dynamic models. Journal of Econometrics, 70, 39–68. Engle, R. F. (1982). Autoregressive conditional heteroskedasticity with estimates of the variance of United Kingdom inflation. Econometrica, 50, 987–1007. Engle, R. F., & Bollerslev, T. (1986). Modeling the persistence of conditional variances. Econometric Reviews, 5, 1–50. Granger, C. W. J., & Hyung, N. (1999).Occasional structural breaks and long memory. Economics working paper series, 99-14, Department of Economics, UC San Diego. Hamilton, J. D. (1989). A new approach to the economic analysis of nonstationary time series and the business cycle. Econometrica, 57, 357–384. Hamilton, J. D., & Susmel, R. (1994). Autoregressive conditional heteroskedasticity and changes in regime. Journal of Econometrics, 64, 307–333. Harvey, D. L., Leybourne, S. J., & Newbold, P. (2004). Tests for a break in level when the order of integration is unknown. Oxford Bulletin of Economics and Statistcs, 66, 133–146. Kim, H., & Siegmund, O. D. (1989). The likelihood ratio test for a change-point in simple linear regression. Biometrika, 76, 409–423. Kramer, W., Plogerger, W., & Alt, R. (1988). Testing for structural change in dynamic models. Econometrica, 56, 1355–1369. Lai, T. L., Liu, H., & Xing, H. (2005). Autoregressive models with piecewise constant volatility and regression parameters. Statistica Sinica, 15, 279–301. Lai, T. L., & Xing, H. (2005). Hidden Markov models for structural changes and stochastic cycles in regression and time series. Working paper, Department of Statistics, Stanford University. Lamoureux, C. G., & Lastraps, W. D. (1990). Persistence in variance, structural change and the GARCH model. Journal of Business and Economic Statistics, 8, 225–234.
223
Structural Change as an Alternative to Long Memory
Lumsdaine, R. L., & Papell, D. H. (1997). Multiple trend breaks and the unit root hypothesis. Review of Economics and Statistics, 79, 212–218. Maddala, G. S., Ki, I., & Phillips, P. C. B. (1999). Unit roots, cointegration, and structural change. New York: Cambridge University Press. McCulloch, R. E., & Tsay, R. S. (1993). Bayesian inference and prediction for mean and variance shifts in autoregressive time series. Journal of the American Statistical Association, 88, 968–978. Meyn, S. P., & Tweedie, R. L. (1993). Markov chains and stochastic stability. New York: Springer-Verlag. Perron, P. (1988). Trends and random walks in macroeconomic time series. Journal of Economic Dynamics and Control, 12, 297–332. Perron, P. (1989). The great crash, the oil price shock, and the unit root hypothesis. Econometrica, 57, 1361–1401. Perron, P., & Vogelsang, T. J. (1992). Nonstationarity and level shifts with an application to purchasing power parity. Journal of Business and Economic Statistics, 10, 301–320. Quandt, R. E. (1960). Tests of the hypothesis that a linear regression system obeys two seperate regimes. Journal of the American Statistical Association, 55, 324–330. Wang, J., & Zivot, E. (2000). A Bayesian time series model of multiple structural changes in level, trend and variance. Journal of Business and Economic Statistics, 18, 374–386. Zivot, E., & Andrews, D. W. K. (1992). Further evidence on the great crash, the oil-price shock, and the unit-root hypothesis. Journal of Business and Economic Statistics, 10, 251–270.
APPENDIX The proof of the theorem makes use of (18), (19) and Covp ðmnþ1 ; X n Þ ¼ ð1
pÞv=f1
ð1
pÞrg
(A.1)
To prove (A.1), first note from (16) that Covp ðmn ; X n Þ ¼ v þ rCovðmn ; X n 1 Þ ¼ v þ rfEðmn ; X n 1 Þ Eðmn ÞEðX n 1 Þg ¼ v þ ð1
pÞrCovp ðmn 1 ; X n 1 Þ
ðA:2Þ
The last equality in (A.2) follows from (17) and that Ep(InZn) ¼ pm. The recursive representation of Covp(mn,Xn) in (A.2) yields Covp ðmn ; X n Þ ¼ v=f1
ð1
pÞrg
Combining this with the first equality in (A.2) establishes (A.1). In view of (A.2) and the strong law for Markov random walks (cf. Theorem 17.1.7 of
224
TZE LEUNG LAI AND HAIPENG XING
Meyn and Tweedie (1993)), T X1 i¼1
X i =T ! E p ðX n Þ;
T 1 X i¼1
!
X iþ1 X i =T ¼
! E p ðmnþ1 X n Þ þ
(
T X1
miþ1 X i þ r
i¼1 rE p ðX 2n Þ
T 1 X i¼1
X 2i
þ
¼ Covp ðmnþ1 ; X n Þ
þ E p ðmnþ1 ÞE p ðX n Þ þ rE p ðX 2n Þ
T X1 i¼1
)
iþ1 X i =T
ðA:3Þ
with probability 1, noting that E p ðX n nþ1 Þ ¼ 0: Putting (A.3) and (19), (A.1) into (18) shows that with probability 1, r^ converges r þ fvð1 pÞ ð1 pÞ2 g=fð1 þ vÞ½1 ð1 pÞrg; which is equal to 1 if r ¼ ð1 pvÞ=ð1 pÞ:
TIME SERIES MEAN LEVEL AND STOCHASTIC VOLATILITY MODELING BY SMOOTH TRANSITION AUTOREGRESSIONS: A BAYESIAN APPROACH Hedibert Freitas Lopes and Esther Salazar ABSTRACT In this paper, we propose a Bayesian approach to model the level and the variance of (financial) time series by the special class of nonlinear time series models known as the logistic smooth transition autoregressive models, or simply the LSTAR models. We first propose a Markov Chain Monte Carlo (MCMC) algorithm for the levels of the time series and then adapt it to model the stochastic volatilities. The LSTAR order is selected by three information criteria: the well-known AIC and BIC, and by the deviance information criteria, or DIC. We apply our algorithm to a synthetic data and two real time series, namely the canadian lynx data and the SP500 returns.
Econometric Analysis of Financial and Economic Time Series/Part B Advances in Econometrics, Volume 20, 225–238 Copyright r 2006 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(05)20028-2
225
226
HEDIBERT FREITAS LOPES AND ESTHER SALAZAR
1. INTRODUCTION Smooth transition autoregressions (STAR), initially proposed in its univariate form by Chan and Tong (1986) and further developed by Luukkonen, Saikkonen, and Tera¨svirta (1988) and Tera¨svirta (1994), have been extensively studied over the last 20 years in association to the measurement, testing, and forecasting of nonlinear financial time series. The STAR model can be seen as a continuous mixture of two AR(k) models, with the weighting function defining the degree of nonlinearity. In this article, we focus on an important subclass of the STAR model, the logistic STAR model of order k, or simply the logistic smooth transition auto regressive LSTAR(k) model, where the weighting function has the form of a logistic function. Dijk, Tera¨svirta, and Franses (2002) make an extensive review of recent developments related to the STAR model and its variants. From the Bayesian point of view, Lubrano (2000) used weighted resampling schemes (Smith & Gelfand, 1992) to perform exact posterior inference of both linear and nonlinear parameters. Our main contribution is to propose an Markov Chain Monte Carlo (MCMC) algorithm that allows easy and practical extensions of LSTAR models to modeling univariate and multivariate stochastic volatilities. Therefore, we devote Section 2 to introducing the general LSTAR(k) to model the levels of a time series. Prior specification, posterior inference, and model choice are also discussed in this section. Section 3 adapts the MCMC algorithm presented in the previous section to modeling stochastic volatility. Finally, in Section 4, we apply our estimation procedures to a synthetic data and two real time series, namely the canadian lynx data and the SP500 returns, with final thoughts and perspectives listed in Section 5.
2. LSTAR: MEAN LEVEL Let yt be the observed value of a time series at time t and xt ¼ ð1; yt 1 ; . . . ; yt k Þ0 the vector of regressors corresponding to the intercept plus k lagged values, for t ¼ 1; . . . ; n: The logistic smooth transition autoregressive model of order k, or simply LSTAR(k), is defined as follows: yt ¼ x0t y1 þ F ðg; c; st Þx0t y2 þ t
t Nð0; s2 Þ
(1)
with F playing the role of a smooth transition continuous function bounded between 0 and 1. In this paper, we focus on the logistic transition, i.e. F ðg; c; st Þ ¼ f1 þ expð gðst
cÞÞg 1 .
(2)
227
Time Series Mean Level and Stochastic Volatility Modeling
Several other functions could be easily accommodated, such as the exponential or the second order logistic function (Dijk et al., 2002). The parameter g40 is responsible for the smoothness of F, while c is a location or threshold parameter. When g-N, the LSTAR model reduces to the wellknown self-exciting TAR (SETAR) model (Tong, 1990) and when g ¼ 0 the standard AR(k) model arises. Finally, st is called the transition variable, with st ¼ yt d commonly used (Tera¨svirta, 1994), and d a delay parameter. Even though we use yt d as the transition variable throughout this paper, it is worth emphasizing that any other linear/nonlinear functions of exogenous/endogenous variables can be easily included with minor changes in the algorithms presented and studied in this paper. We will also assume throughout the paper, without loss of generality, that drk and y k+1, y ,y0 are known and fixed quantities. This is common practice in the time series literature and would only marginally increase the complexity of our computation and could easily be done by introducing a prior distribution for those missing values. The LSTAR(k) model can be seen as model that allows smooth and continuous shifts between two extreme regimes. More specifically, if d ¼ y1 þ y2 ; then x0t y1 and x0t d represent the conditional means under the two extreme regimes yt ¼ ð1
F ðg; c; yt
0 d ÞÞxt y1
þ F ðg; c; yt
0 d Þxt d
þ t
(3)
or 8 0 x y þ t > < t 1 0 0 yt ¼ ð1 oÞxt y1 þ oxt d þ t > : x0 d þ t
t
if o ¼ 0 if 0ooo1 if o ¼ 1
Therefore, the model (3) can be expressed as (1) with y2 ¼ d y1 being the contribution to the regression of considering a second regime. Huerta, Jiang, and Tanner (2003) also address issues about nonlinear time series models through logistic functions. We consider the following parameterization that will be very useful in the computation of the posterior distributions: yt ¼ z0t y þ t
(4)
where y0 ¼ ðy01 ; y02 Þ represent the linear parameters, (g, c) are the nonlinear parameters, Y ¼ ðy; g; c; s2 Þ and z0t ¼ ðx0t ; F ðg; c; yt d Þx0t Þ: The dependence of zt on g,c and yt d will be made explicit whenever necessary. The likelihood is
228
HEDIBERT FREITAS LOPES AND ESTHER SALAZAR
then written as pðyjYÞ / s where y ¼ ðy1 ; . . . ; yn Þ0 :
n=2
exp
1 Xn ðy t¼1 t 2s2
z0t yÞ2
(5)
2.1. Prior Specification and Posterior Inference In this Section, we derive MCMC algorithm for posterior assessment of both linear and nonlinear parameters and for the observations’ variance, s2, so that the number of parameters to be estimated is 2k+5. We also consider the estimation of the delay parameter d. We adopt Lubrano’s (2000) formulation, i.e. ðy2 js2 ; gÞ Nð0; s2 eg I kþ1 Þ and pðy1 ; g; s2 ; cÞ / ð1 þ g2 Þ 1 s 2 for y1 2
Time Series Mean Level and Stochastic Volatility Modeling
229
2.2. Choosing the Model Order k In this Section, we introduce the traditional information criteria, Akaike Information Criterion (AIC) (Akaike, 1974) and Bayesian Information Criterion (BIC) (Schwarz, 1978), widely used in model selection/comparison. To this toolbox we add the DIC (Deviance Information Criterion) that is a criterion recently developed and widely discussed in Spiegelhalter, Best, Carlin, and Linde (2002). Traditionally, model order or model specification are compared through the computation of information criteria, which are well known for penalizing likelihood functions of overparametrized models. AIC (Akaike, 1974) and BIC (Schwarz, 1978) are the most used ones. For data y and ^ þ 2p parameter y, these criteria are defined as follows: AIC ¼ 2 lnðpðyjyÞÞ ^ þ p ln n; p is the dimension of y, with p ¼ 2k þ 5 and BIC ¼ 2 lnðpðyjyÞÞ ^ One of for LSTAR(k), sample size n and maximum likelihood estimator, y: the major problems with AIC/BIC is that to define k is not trivial, mainly in Bayesian hierarchical models, where the priors act like reducers of the effective number of parameters through its interdependencies. To overcome this limitation, Spiegelhalter et al. (2002) developed an information criterion ~ ¯ DðyÞ; that properly defines the effective number parameter by pD ¼ D ~ ¯ ¼ EðDðyÞjyÞ: where DðyÞ ¼ 2 ln pðyjyÞ is the deviance, y ¼ EðyjyjÞ and D As a by-product, they proposed the Deviance Information Criterion: DIC ¼ ~ þ 2p ¼ D ¯ þ pD : One could argue that the most attractive and DðyÞ D appealing feature of the DIC is that it combines model fit (measured by ¯ with model complexity (measured by pD). Besides, DICs are more D) attractive than Bayes Factors since the former can be easily incorporated into MCMC routines. For successful implementations of DIC we refer to Zhu and Carlin (2000) (spatio-temporal hierarchical models) and Berg, Meyer, and Yu (2004) (stochastic volatility models), to name a few.
3. LSTAR: STOCHASTIC VOLATILITY In this Section, we adopt the LSTAR structure introduced in Section 2 to model time-varying variances, or simply the stochastic volatility, of (financial) time series. It has become standard to assume that yt, the observed value of a (financial) time series at time t, for t ¼ 1; . . . ; n; is normally distributed conditional on the unobservable volatility ht, i.e. yt jht Nð0; eht Þ; and that the log-volatilities, lt ¼ log ht ; follow an autoregressive process of order one, i.e. lt Nðy0 þ y1 lt 1 ; s2 Þ; with y0, y1 and s2
230
HEDIBERT FREITAS LOPES AND ESTHER SALAZAR
interpreted as the volatility’s level, persistence and volatility, respectively. Several variations and generalizations based on this AR(1) structure were proposed over the last two decades to accommodate asymmetry, heavy tails, strong persistence and other features claimed to be presented in the volatility of financial time series (Kim, Shephard, & Chib, 1998). As mentioned earlier, our contribution in this section is to allow the stochastic volatility to evolve according to the LSTAR(k) model, Eq. (1) from Section 2 lt ¼ x0t y1 þ F ðg; c; lt d Þx0t y2 þ t
t Nð0; s2 Þ
(8)
with x0t ¼ ð1; lt 1 ; . . . ; lt k Þ; y1 and y2 ðk þ 1Þ-dimensional vectors and F ðg; c; lt d Þ the logistic function introduced in Eq. (2). We assume, without loss of generality, that d is always less than or equal to k. In the particular case, when k ¼ 1; the AR(1) model is replaced by an LSTAR(1) lt N y10 þ y11 lt 1 þ ½y20 þ y21 lt 1 F ðg; c; lt d Þ; s2
Conditional on the vector of log-volatilities, k ¼ ðl1 ; . . . ; ln Þ; we adopt, for the static parameters describing the LSTAR model, the same prior structure presented in Section 2.1. In other words, ðy2 js2 ; gÞ Nð0; s2 eg I Kþ1 Þ and pðy1; g; s2 ; cÞ / ð1 þ g2 Þ 1 s 2 for y1 2
k Y i¼0
pðltþi jltþi
1;
. . . ; ltþi
k; YÞ
(9)
which has no known closed form. We sample lnt from NðlðjÞ t ; DÞ; for D a problem-specific tuning parameter, and ltðjÞ the current value of lt in this Random-Walk Markov chain. The draw lt* is accepted with probability. ( ) gðlt jk t; Y; yÞ a ¼ min 1; gðltðjÞ jk t ; Y; yÞ
Time Series Mean Level and Stochastic Volatility Modeling
231
We do not claim that our proposal is optimal or efficient by any formal means, but we argue that they are useful from a practical viewpoint as our simulated and real data applications reveal.
4. APPLICATIONS In this Section, our methodology is extensively studied against simulated and real time series. We start with an extensive simulation study, which is followed by the analysis of two well known dataset: (i) The Canadian Lynx series, which stands for the number of Canadian Lynx trapped in the Mackenzie River district of North-west Canada, and (ii) The USPI Index, which stands for the US Industrial Production Index.
4.1. A Simulation Study We performed a study by simulating a time series with 1,000 observations from the following the LSTAR(2): þ 0:795yt 2 ÞF ð100; 0:02; yt 2 Þ þ t (10) 1 and t where F ð100; 0:02; yt 2 Þ ¼ ½1 þ exp 100ðyt 2 0:02Þ Nð0; 0:022 Þ: The initial values were k ¼ 5, y1 ¼ y2 ¼ ð0; 1; 1; 1; 1; 1Þ; g ¼ 150; c ¼ y¯ and s2 ¼ 0:012 : We consider 5,000 MCMC runs with the first half used as burn-in. AIC, BIC and DIC statistics are presented in Table 1. Posterior means and standard deviations for the model with highest posterior model probability are shown in Table 2. The three information criteria point to the correct model, while our MCMC algorithm produces fairly accurate estimates of posterior quantities. yt ¼ 1:8yt
1
1:06yt
2
þ ð0:02
0:9yt
1
4.2. Canadian Lynx We analyze the well-known Canadian Lynx series. Figure 1 shows the logarithm of the number of Canadian Lynx trapped in the Mackenzie River district of North-west Canada over the period from 1821 to 1934 (data can be found in Tong, 1990, p. 470). For further details and previous analysis of this time series, see Ozaki (1982), Tong (1990), Tera¨svirta (1994), Medeiros and Veiga (2005), and Xia and Li (1999), among others.
232
HEDIBERT FREITAS LOPES AND ESTHER SALAZAR
Model Comparison using Information Criteria for the Simulated LSTAR(2) Process.
Table 1. k
d
AIC
BIC
DIC
1
1 2 3
4654.6 4728.6 4682.7
4620.2 4694.3 4648.4
8622.2 8731.2 8606.8
2
1 2 3
3912.0 5038.9 4761.2
3867.9 4994.8 4717.0
7434.2 9023.6 8643.8
3
2 3
5037.0 4850.4
4983.0 4796.5
9023.3 8645.9
Table 2. Posterior Means for the Parameters by LSTAR(2) Model using 2,500 MCMC Runs. Par
True Value
Mean
St.Dev.
Par
y01 y11 y21 g s2
0 1.8 1.06 100 0.0004
0.0028 1.7932 1.0809 100.87 0.00037
0.0021 0.0525 0.0654 4.9407 0.000016
y02 y12 y22 c
True Value 0.02 0.9 0.795 0.02
Mean
St.Dev.
0.023 0.8735 0.7861 0.0169
0.0036 0.0637 0.0746 0.0034
We run our MCMC algorithm for 50,000 iterations and discard the first half as burn-in. The LSTAR(11) model with d ¼ 3 has the highest posterior model probability when using the BIC as a proxy to Bayes factor ð0:307Þ
ð0:111Þ
yt ¼ 0:987 þ 0:974 yt ð0:146Þ
0:0702 yt ð2:04Þ
6
ð0:151Þ
1
ð0:158Þ
ð0:142Þ
2
0:051 yt
ð0:167Þ
0:036 yt
7
ð0:431Þ
þ ð 3:688 þ 1:36 yt ð0:657Þ
0:098 yt
þ 0:179 yt
ð0:744Þ
3:05 yt
1
ð0:735Þ
þ 0:406 yt 6 0:862 yt ð0:688Þ 1 þ expf 11:625ðyt
ð0:684Þ
7
8
2
ð0:137Þ
3
ð0:159Þ
8
þ 0:025 yt
ð1:111Þ
þ 4:01 yt
ð0:539Þ
ð0:143Þ
0:155 yt
3
4
þ 0:045 yt
5
ð0:144Þ
9
þ 0:138 yt
ð0:972Þ
2:001 yt ð0:486Þ
4
ð0:096Þ
0:288 yt
10
11
ð0:753Þ
þ 1:481 yt
0:666 yt 8 þ 0:263 yt 9 þ 0:537 yt 10 1 ð0:017Þ 3:504Þg þ t ; Eðs2 jyÞ ¼ 0:025
5
ð0:381Þ
0:569 yt
11
with posterior standard deviations in parentheses. The LSTAR(k ¼ 11) captures the decennial seasonality exhibited by the data. The vertical lines on Part (a) of Fig. 1 are estimates of the logistic transition function, F(g, c, yt 3)
233
0.0
1.0
2.0
Time Series Mean Level and Stochastic Volatility Modeling
1840
1860
1900
1920
0.4
0.8
1880 years
0.0
F(g,c,y(t−3)
(a)
−0.6 (b)
−0.2
0.2
0.4
0.6
y(t)−3.5
Fig. 1. Canadian Lynx Series: (a) Logarithm of the Number of Canadian Lynx trapped in the Mackenzie River District of North-West Canada over the Period from 1821 to 1934 (Time Series is Shifted 1.5 Units down to Accommodate the Transition Function). The Vertical Lines are Estimates of the Logistic Transition Function, F(g, c, yt 3) from Eq. (2), for g ¼ 11:625 and c ¼ 3:504: Posterior Means of g and c, respectively, (b) The Posterior Mean of the Transition Function as a Function of y 3.5.
from Eq. 2, for g ¼ 211:625 and c ¼ 3:504 posterior means of g and c, respectively. Roughly speaking, within each decade the transition cycles according to both extreme regimes as yt gets away from c ¼ 3:504; or the number of trapped lynx gets away from 3,191, which happens around 15% of the time. Part (b) of Fig. 1 exhibits the transition function as a function of y 3.5, which is relatively symmetric around zero corroborating with the fact that the decennial cycles are symmetric as well. Similar results were found in Medeiros and Veiga (2005) who fit an LSTAR(2) with d ¼ 2 and Tera¨svirta (1994) who fit an LSTAR(11) with d ¼ 3 for the same time series. 4.3. S&P500 Index Here, we fit the LSTAR(k)-stochastic volatility model proposed in Section 3 to the North American Standard and Poors 500 index, or simply the SP500 index, which was observed from January 7th, 1986 to December 31st, 1997,
234
HEDIBERT FREITAS LOPES AND ESTHER SALAZAR
a total of 3,127 observations. We entertained six models for the stochastic volatility and did model comparison by using the three information criteria presented in Section 2.2, i.e., AIC, BIC and DIC. The six models are: M1 : AR(1),M2 : AR(2),M3 : LSTAR(1) with d ¼ 1; M4 : LSTAR(1) with d ¼ 2; M5 : LSTAR(2) with d ¼ 1; and M6 : LSTAR(2) with d ¼ 2: Table 3 presents the performance of the six models, in which all three information criteria agree upon the choice of the best model for the logvolatilities: LSTAR(1) with d ¼ 1: One can argue that the linear relationship prescribed by an AR(1) structure is insufficient to capture the dynamic behavior of the log-volatilities. The LSTAR structure brings more flexibility to the modeling. Table 4 presents the posterior mean and standard devations of all parameters for each one of the six models listed above.
5. CONCLUSIONS In this paper, we developed Markov Chain Monte Carlo (MCMC) algorithms for posterior inference in a broad class of nonlinear time series models known as logistic smooth transition autoregressions, LSTAR. Our developments are checked against simulated and real dataset with encouraging results when modeling both the levels and the variance of univariate time series with LSTAR structures. We used standard information criteria such as the AIC and BIC to compare models and introduced a fairly new Bayesian criterion, the Deviance Information Criteria (DIC), all of which point to the same direction in the examples we explored in this paper. Even though we concentrated our computations and examples to the logistic transition function, the algorithms we developed can be easily adapted to other functions such as the exponential or the second-order Table 3. S&P500: Model Comparison based on Information Criteria: AIC (Akaike Information Criterion), BIC (Bayesian Information Criterion) and DIC (Deviance Information Criterion). Models M1 : AR(1) M2 : AR(2) M3 : LSTAR(1, d ¼ 1) M4 : LSTAR(1, d ¼ 2) M5 : LSTAR(2, d ¼ 1) M6 : LSTAR(2, d ¼ 2)
AIC
BIC
DIC
12795 12624 12240 12244 12569 12732
31697 31532 31165 31170 31507 31670
7223.1 7149.2 7101.1 7150.3 7102.4 7159.4
235
Time Series Mean Level and Stochastic Volatility Modeling
Table 4. S&P500: Posterior Means and Posterior Standard Deviations for the Parameters from all Six Entertained Models: M1 : AR(1), M2 : AR(2), M3 : LSTAR(1) With d ¼ 1, M4 : LSTAR(1) With d ¼ 2, M5 : LSTAR(2) With d ¼ 1, and M6 : LSTAR(2) with d ¼ 2. Parameter
Models M1
M2
M3
M4
M5
M6
Posterior mean (standard deviation) y01
0.060 (0.184)
0.066 (0.241)
0.292 (0.579)
0.354 (0.126)
4.842 (0.802)
6.081 (1.282)
y11
0.904 (0.185)
0.184 (0.242)
0.306 (0.263)
0.572 (0.135)
0.713 (0.306)
0.940 (0.699)
y21
—
0.715 (0.248)
1.018 (0.118)
1.099 (0.336)
y02
—
—
0.685 (0.593)
0.133 (0.092)
4.783 (0.801)
6.036 (1.283)
y12
—
—
0.794 (0.257)
0.237 (0.086)
0.913 (0.314)
1.091 (0.706)
y22
—
—
—
—
c
—
—
118.18 (16.924)
163.54 (23.912)
1.748 (0.114) 132.60 (10.147)
1.892 (0.356) 189.51 (0.000)
c
—
—
1.589 (0.022)
0.022 (0.280)
2.060 (0.046)
2.125 (0.000)
s2
0.135 (0.020)
0.234 (0.044)
0.316 (0.066)
0.552 (0.218)
0.214 (0.035)
0.166 (0.026)
—
—
logistic functions, or even combinations of those. Similarly, even though we chose to work with yt d as the transition variable, our findings are naturally extended to situations where yt d is replaced by, say, s(yt 1,y, yt d, a) for a and d unknown quantities. In this paper, we focused on modeling the level and the variance of univariate time series. Our current research agenda includes (i) the generalization of the methods proposed here to model factor stochastic volatility problems (Lopes & Migon, 2002; Lopes, Aguilar, and West, 2000), (ii) fully treatment of k as another parameter and posterior inference based on reversible jump Markov Chain Monte Carlo (RJMCMC) algorithms for both the levels and the stochastic volatility of (financial) time series (Lopes
236
HEDIBERT FREITAS LOPES AND ESTHER SALAZAR
& Salazar, 2005), and (iii) comparing smooth transition regressions with alternative jump models, such as Markov switching models (Carvalho & Lopes, 2002). All our computer codes are available upon request and are fully programmed in the student’s version of Ox, a statistical language that can be downloaded free of charge from http://www.nuff.ox.ac.uk/Users/Doornik/ index.html.
ACKNOWLEDGMENTS Hedibert Lopes would like to thank the Graduate School of Business, University of Chicago and the Department of Statistical Methods, Federal University of Rio de Janeiro for supporting his research on this paper. Esther Salazar was partially supported by a scholarship from CNPq, Conselho Nacional de Desenvolvimento Cientı´fico e Tecnolo´gico, to pursue her master’s degree in Statistics at the Federal University of Rio de Janeiro. We thank the comments of the participants of III Annual Advances in Econometrics Conference (USA), Science of Modeling and Symposium on Recent Developments in Latent Variables Modelling (Japan), II Congreso Bayesiano de America Latina (Mexico) and VII Brazilian Meeting of Bayesian Statistics, and the participants of talks delivered at the Catholic University of Rio de Janeiro and Federal University of Parana´ (Brazil), and Northern Illinois University (USA). We also would like to thank the invaluable comments made by the referee.
REFERENCES Akaike, H. (1974). New look at the statistical model identification. IEEE Transactions in Automatic Control AC, 19, 716–723. Berg, A., Meyer, R., & Yu, J. (2004). DIC as a model comparison criterion for stochastic volatility models. Journal of Business & Economic Statistics, 22, 107–120. Carvalho, C., & Lopes, H. F. (2002). Simulation-based sequential analysis of Markov switching stochastic volatility models. Technical Report. Department of Statistical Methods, Federal University of Rio de Janeiro. Chan, K. S., & Tong, H. (1986). On estimating thresholds in autoregressive models. Journal of Time Series Analysis, 7, 179–190. van Dijk, D., Tera¨svirta, T., & Franses, P. (2002). Smooth transition autoregressive models – a survey of recent developments. Econometrics Reviews, 21, 1–47. Gamerman, D. (1997). Markov Chain Monte Carlo: Stochastic simulation for Bayesian inference. London: Chapman & Hall.
Time Series Mean Level and Stochastic Volatility Modeling
237
Gelfand, A. E., & Smith, A. M. F. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85, 398–409. Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (1996). Markov Chain Monte Carlo in Practice. London: Chapman & Hall. Hastings, W. R. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57, 97–109. Huerta, G., Jiang, W., & Tanner, M. (2003). Time series modeling via hierarchical mixtures. Statistica Sinica, 13, 1097–1118. Jacquier, E., Polson, N., & Rossi, P. (1994). Bayesian analysis of stochastic volatility models (with discussion). Journal of Business of Econometrics Statistics, 12, 371–417. Kim, S. N., Shephard, N., & Chib, S. (1998). Stochastic volatility: Likelihood inference and comparison with ARCH models. Review of Economic Studies, 65, 361–393. Lopes, H. F., Aguilar, O., & West, M. (2000). Time-varying covariance structures in currency markets. In: Proceedings of the XXII Brazilian Meeting of Econometrics, Campinas, Brazil. Lopes, H. F., & Migon, H. (2002). Comovements and contagion in emergent markets: Stock indexes volatilities. In: C. Gatsonis, A. Carriquiry, A. Gelman, D. Higdon, R.E. Kass, D. Pauler, & I. Verdinelli (Eds), Case studies in Bayesian statistics, (Vol. 6, pp. 285–300). New York: Springer-Verlag. Lopes, H. F., & Salazar, E. (2005). Bayesian inference for smooth transition autoregressive time series models. Journal of Time Series Analysis (to appear). Lubrano, M. (2000). Bayesian analysis of nonlinear time series models with a threshold. In: W. A. Barnett, D. F. Hendry, S. Hylleberg, T. Tera¨svirta, D. Tjostheim & A. Wu¨rtz (Eds), Proceedings of the Eleventh International Symposium in Economic Theory. Helsinki: Cambridge University Press. Luukkonen, R., Saikkonen, P., & Tera¨svirta, T. (1988). Testing linearity against smooth transition autoregressive models. Biometrika, 75, 491–499. Medeiros, M. C., & Veiga, A. (2005). A flexible coefficient smooth transition time series model. IEEE Transactions on Neural Networks, 16, 97–113. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953). Equation of state calculations by fast computing machine. Journal of Chemical Physics, 21, 1087–1091. Ozaki, T. (1982). The statistical analysis of perturbed limit cycle process using nonlinear time series models. Journal of Time Series Analysis, 3, 29–41. Robert, C., & Casella, G (1999). Monte Carlo statistical methods. New York: Springer-Verlag. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464. Smith, A., & Gelfand, A. (1992). Bayesian statistics without tears. American Statistician, 46, 84–88. Spiegelhalter, D., Best, N., Carlin, B., & van der Linde, A. (2002). Bayesian measures of model complexity and fit (with discussion). Journal of the Royal Statistical Society, 64, 583–639. Tera¨svirta, T. (1994). Specification, estimation, and evaluation of smooth transition autoregressive models. Journal of the American Statistical Association, 89, 208–218. Tong, H. (1990). Non-linear time series: A dynamical systems approach. Oxford: Oxford University Press. Xia, Y., & Li, W. K. (1999). On single-index coefficient regression models. Journal of the American Statistical Association, 94, 1275–1285. Zhu, L., & Carlin, B. (2000). Comparing hierarchical models for spatio-temporally misaligned data using the deviance information criterion. Statistics in Medicine, 19, 2265–2278.
238
HEDIBERT FREITAS LOPES AND ESTHER SALAZAR
APPENDIX:. LSTAR FULL CONDITIONAL DISTRIBUTIONS In this appendix we provide in detail the steps of our MCMC algorithm for the LSTAR models of Section 2. In what follows, [x] denotes the full conditional P distribution P of x conditional on the other parameters and the data. Also, short for nt¼1 : The full conditional distributions are:
[y] Combining Eq. (4), with yt Nðz0t y; s2 Þ; ðy2 js2 ; gÞ s2 eg I kþ1 Þ , it is P Nð0; n n 0 2 ~ y ; C y Þ where C y ¼ ð zt zt s þ S 1 Þ; m ~ ny ¼ easyP to see that ½y Nðm 1 2 2 g Sy ð zt yt s Þ and, S ¼ diagð0; s e I kþ1 Þ: [s2] From Eq. (4), t ¼ yt z0t y Nð0; s2 Þ; which combined with the 2 2 2 noninformative P 2 prior pðs Þ / s ; leads to ½s IGððT þ k þ 1Þ=2; g 0 ðe y2 y2 þ t Þ=2Þ: [g, c] We sample g* and c*, respectively, from G½ðgðiÞ Þ2 =Dg ; gðiÞ =Dg and TNðcðiÞ ; Dc Þ; a normal truncated at the interval [ca,cb], and g(i) and c(i) current values of g and c. The pair (g*, c*) is accepted with probability a ¼ minf1; Ag; where QT n f ð j0; s2 Þ f N ðy2 j0; s2 eg I kþ1 Þ pðgn Þpðcn Þ A ¼ QTt¼1 N ðiÞt f ðt j0; s2 Þ f N ðy2 j0; s2 egðiÞ I kþ1 Þ pðgðiÞ pðcðiÞ Þ ht¼1 N ðiÞ ðiÞ i c capffiffiffiffi c F cbpffiffiffiffi F f G ðgðiÞ jðgn Þ2 =Dg ; gn =DgÞ Dc Dc n i h n c cp c a ffiffiffiffi F F cpb ffiffiffiffi f G ðgn jðgðiÞ Þ2 =Dg ; gðiÞ =DgÞ Dc Dc
for nt ¼ yt z0t ðgn ; cn ; yt d Þy; ðiÞ z0t ðgðiÞ ; cðiÞ ; yt d Þy and FðÞ is the t ¼ yt cumulative distribution function of the standard normal distribution.
ESTIMATING TAYLOR-TYPE RULES: AN UNBALANCED REGRESSION?$ Pierre L. Siklos and Mark E. Wohar ABSTRACT Relying on Clive Granger’s many and varied contributions to econometric analysis, this paper considers some of the key econometric considerations involved in estimating Taylor-type rules for US data. We focus on the roles of unit roots, cointegration, structural breaks, and non-linearities to make the case that most existing estimates are based on an unbalanced regression. A variety of estimates reveal that neglected cointegration results in the omission of a necessary error correction term and that Federal Reserve (Fed) reactions during the Greenspan era appear to have been asymmetric. We argue that error correction and non-linearities may be one way to estimate Taylor rules over long samples when the underlying policy regime may have changed significantly.
$
Presented at the 3rd Annual Conference in Econometrics: Econometric Analysis of Financial Time Series, honoring the contributions of Clive Granger and Robert Engle, Louisiana State University, Baton Rouge. Comments on an earlier draft by Øyvind Eitreheim, Bill Gavin, and Charles Goodhart and Conference participants are gratefully acknowledged.
Econometric Analysis of Financial and Economic Time Series/Part B Advances in Econometrics, Volume 20, 239–276 Copyright r 2005 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(05)20029-4
239
240
PIERRE L. SIKLOS AND MARK E. WOHAR
1. INTRODUCTION How a central bank reacts to changes in economic conditions is of crucial importance in both policy-making and to academics. Attempts to describe a systematic process by which the central bank adjusts a variable that it has control over, such as an interest rate, to changes in economic conditions give rise to the reaction function approach. There are many ways that one can specify monetary policy reaction functions or rules. Several variables have been thought to significantly affect the setting of monetary policy instruments, including monetary aggregates and the exchange rate. In recent years, however, Taylor-type rules have become the most popular way of summarizing how central banks conduct monetary policy. Taylor (1993) showed that US monetary policy, covering a brief period of only 5 years (1987–1992), is well described by movements in the federal funds rate that respond to deviations of inflation and real GDP from their target and potential levels, respectively. Taylor evaluated the response to these two variables and found that the federal funds rate could essentially be described as a rule of the form: it ¼ r þ pt þ aðpt
pt Þ þ bðy~ t Þ
(1)
where i is the nominal federal funds rate, r the equilibrium federal funds rate, pt the inflation rate over the previous four quarters, pt the target inflation rate, and y~ the percent deviation of real GDP from its target or potential level. The Federal Reserve (Fed) reacts by changing the federal funds rate in such a manner that it is expected to increase when inflation rises above its target or when real GDP rises above its target or potential level. The above equation can also be rewritten as it ¼ m þ ð1 þ aÞpt þ by~ t
(2)
where m ¼ r ap : The parameters a and b reflect the preferences of the monetary authority in terms of their attitude toward the short-run trade-off between inflation and output (see Ball, 1997; Muscatelli, Tirelli, & Trecroci, 1999; Clarida, Gali, & Gertler, 1998).1 In Taylor (1993), r and pt are both set equal to 2, and a weight of 0.5 is assigned to both a and b. Stability implies that a40, which means that the response of the monetary authorities to an inflation shock translates into a higher real federal funds rate. Otherwise, the central bank cannot convince markets that it prefers lower future inflation.2
Estimating Taylor-Type Rules
241
Numerous papers have estimated variants of the above specification. Less thought is given as to whether the right- and left-hand side variables are of the same order of integration. Banerjee, Dolado, Galbraith, and Hendry (1993, pp. 164–166) discuss conditions under which it is preferable to balance regressions where the regressor and the regressand are a mix of I(1) and I(0) series.3 Sims, Stock, and Watson (1990) show that estimating (2) in levels, even when the series are non-stationary in levels, need not be problematic.4 However, omission of a possible underlying cointegrating relationship leads to a misspecification with implications for forecast performance (Hendry, 1995, chapter 8; Clements & Hendry, 1993). This is one point Granger (2004) emphasizes in his Nobel Lecture. While most monetary economists and central bankers would agree on the fundamental features of a monetary policy rule (if they were inclined to use one), there is still disagreement about the details of the specification. In this paper, we are interested in exploring some of the econometric issues that have a bearing on the estimation of Taylor rules, especially ones that stem from the varied seminal contributions of Clive Granger in the field of time series analysis. Most notably, the stationarity and cointegration properties of time series can have a significant impact on the interpretation and forecasts from reaction functions. For example, the real interest rate, r ; need not be constant as is generally assumed. The difficulty, however, is that the equilibrium real interest rate implied by the rule (r in (1)) is unobserved and subject to a great deal of uncertainty (e.g., Laubach & Williams, 2003). Nevertheless, depending upon the chosen sample, it can display trend-like behavior. Alternatively, relying on a standard Fisher equation decomposition, a unit root in either the inflation rate or the real rate will also produce unit root behavior in the nominal rate. Moreover, economic theory suggests that some of the variables in the Taylor rule, which may be non-stationary in levels, might be cointegrated. This suggests a common factor linking these series that gives rise to error correction type models. The nominal interest rate and inflation are candidates for cointegration but other cointegrated relationships are also possible, as we shall see. Perhaps surprisingly, the extant literature has largely ignored this possibility. In the next two sections, we outline the principal econometric issues in the estimation of Taylor-type equations. Knowing the time series properties of the variables in (1) or (2) is crucial for proper inference. In particular, little thought is given to whether the right- and left-hand side variables are of the same order of integration. The dependent variable in Eq. (2) is treated as a stationary series even though it may be difficult to distinguish it from a unit root process.5 Next, following a brief review of existing estimates of Taylor
242
PIERRE L. SIKLOS AND MARK E. WOHAR
rules for the US, we consider the sensitivity of estimates of Equations such as (1) and (2), and its variants, to the lessons learned from Clive Granger’s work. The paper concludes with a summary.
2. SELECTED ECONOMETRIC ASPECTS OF TAYLOR RULES Our aim is to focus our attention on lessons learned from the contributions of Granger. Nevertheless, a few other issues germane to estimating Taylor rules will be raised though not extensively considered in the subsequent empirical work that follows. Inference from estimates of Equations such as (1) and (2) are sensitive to several econometric considerations. These include: (i)
The inclusion of a term to accommodate interest rate smoothing behavior of central banks. The path of interest rates set by central banks tends to be fairly smooth over time, changing slowly in one direction or the other, with reversals occurring relatively infrequently (e.g., Siklos, 2002, Table 4.3). Taylor-type rules have been modified to incorporate interest rate smoothing by including a lagged interest rate term. Sack and Wieland (2000), in a survey, argue that interest rate smoothing may be deliberate or just the result of monetary policy reacting to persistent macroeconomic conditions. If the latter view is correct, one would expect that the coefficient associated with the lagged interest rate to be small and insignificant. However, estimates of Taylor rules find that the coefficient associated with the lagged interest rate is close to unity and statistically significant, suggesting that interest rate smoothing may be deliberate.6 The conclusion indicating that central banks practice interest rate smoothing has been questioned by Rudebusch (2002) and So¨derlind, So¨derstro¨m, and Vredin (2003). All of them argue that a Taylor rule augmented with a lagged interest rate implies too much predictability of interest rate changes compared with yield curve evidence. Specifically, Rudebusch (2002) argues that a large coefficient on the lagged interest rate would imply that future interest rate changes are highly predictable. But yield curve evidence suggests that interest rate predictability is low. Hence, Rudebusch concludes that the dynamic Taylor rule is a mis-specified representation of monetary policy. Rudebusch is not able to obtain any definitive conclusions based on this test because of an observational equivalence problem affecting the
Estimating Taylor-Type Rules
243
analysis.7 Nevertheless, there are solid theoretical and practical reasons to believe that central banks do smooth interest rates (e.g., Goodhart, 1999; Sack, 1998; Sack & Wieland, 2000; Collins & Siklos, 2004; Bernanke, 2004b). Moreover, English et al. (2003) reject on empirical grounds Rudebusch’s conjecture. Dueker and Rasche (2004) argue that US studies, which rely on quarterly or monthly data fail to make an adjustment arising out of the discrete nature of changes in the target fed funds rate. Nevertheless, once the appropriate adjustment is made, interest rate smoothing remains a feature of the data. (ii) Whether or not the equilibrium real rate is constant. Taylor-type rules contain both an unobserved equilibrium real interest rate and an unobserved inflation target. Judd and Rudebusch (1998), Kozicki (1999), and Clarida, Gali, and Gertler (2000), among others, calculate the equilibrium real interest rate as the difference between the average federal funds rate and the average inflation rate. With this sample specific value, one is able to back out an estimated value for the inflation target from the empirical estimate of the constant term in Eq. (2) above. Of course, one could also begin with an inflation target and back out a value of the real interest rate from the constant term. Rudebusch (2001) employs a more elaborate approach to estimating the equilibrium real interest rate from empirical estimates of an IS curve. Kozicki (1999) finds that estimates of the equilibrium real interest rate vary over time for the US. Clarida et al. (2000) use the average of the observed real interest rate as a proxy for the equilibrium real rate but allow it to vary between sub-samples. More recently, Rapach and Wohar (2005) have shown that ex-post real interest rates for the G7 countries have undergone structural shifts over time. They find that these shifts correspond to structural breaks in the inflation rate.8 (iii) Estimated weights a and b may be sensitive to the policy regime in place. Different sample periods may yield different results. Judd and Rudebusch (1998) and Hamalainen (2004) estimate Taylor-type rules for different sample periods and find that the Fed has reacted differently over time to inflation and the output gap. Hence, the possibility of structural breaks, well-known to have a significant impact on the unit root and cointegration properties of the data, also plays a role. (iv) Different ways in which the inflation rate can be estimated. In the original Taylor rule, monetary policy targets inflation as the rate of inflation in the GDP deflator over the previous four quarters. Kozicki (1999) compares the Taylor rule using four different price indexes to
244
PIERRE L. SIKLOS AND MARK E. WOHAR
compute the annual inflation rate (GDP price deflator, CPI, core CPI, and expected inflation from private-sector forecasts). (v) Different measures of potential output. The output gap measure used in the Taylor rule has been interpreted as either representing policymakers objectives relating to output stability as well as a measure of expected inflation (see Favero & Rovelli, 2003). Different methods have been used to construct proxies for potential output. In the original Taylor (1993) paper, potential output was computed by fitting a time trend to real GDP. Judd and Rudebusch (1998) and Clarida et al. (2000) measure potential output using the Congressional Budget Office’s (CBO) estimates and by fitting a segmented trend and quadratic trend to real GDP. McCallum and Nelson (1999) and Woodford (2001) have argued against the use of time trends as estimates of potential output. First, because the resulting output gap estimates might be overly sensitive to the chosen sample; second, de-trending ignores the potential impact of permanent shocks on output. The latter consideration proves especially relevant when the time series properties of the output gap series are investigated, as we shall see. Other measures of potential output in the extant literature include calculations using the Hodrick–Prescott (HP) filter, and a band-pass filter. Kozicki (1999) shows that different measures of potential output can lead to large variations in Federal Funds rate target movements. Finally, Walsh (2003a, b) argues that policy makers are best thought of as reacting to the change in the output gap. Typically, however, estimated output gap proxies are derived under the implicit, if not explicit, assumption that the series is stationary. Differencing such a series may further exacerbate the unbalanced regression problem noted earlier and may result in over-differencing.9 (vi) The timing of information (e.g., backward or forward looking expectations) as well as the use of current versus real time data. In Taylor’s original article, the Fed reacts contemporaneously to the right-hand side variables. However, this assumes that the central bank knows the current quarter values of real GDP and the price index when setting the federal funds rate for that quarter, which is unlikely.10 Levin, Wieland, and Williams (1999), McCallum and Nelson (1999), and Rudebusch and Svensson (1999) found few differences between the use of current versus lagged values of variables, perhaps because both inflation and the output gap are very persistent time series so that lagged values serve as good proxies for current values. Clarida et al. (1998, 2000), Orphanides (2001), and Svensson (2003) emphasize the forward-looking
Estimating Taylor-Type Rules
245
nature of monetary policy and advocate the estimation of forwardlooking Taylor-type rules. Orphanides (2001) estimates Taylor-type rules over the period 1987–1992 using ex-post revised data as well as real-time forecasts of output gap. He also used Federal Reserve staff forecasts of inflation to investigate whether forward looking specifications describe policy better than backward-looking specifications. Differences in the interpretation of results based on whether models are forward or backward-looking, forecast-based, or rely on real time data strongly points to a potentially important role of the time series properties of the variables that enter the Taylor rule (also see Tchaidze, 2004). (vii) Issues of model uncertainty. Svensson (2003) doubts the relevance of the Taylor rule on theoretical grounds. Levin et al. (1999) and Taylor (1999) argue that Taylor-type rules are robust to model uncertainty.11 However, these conclusions concerning robustness under model uncertainty could be challenged on the grounds that the models they use are too similar to each other. A potential solution may be to adopt the ‘‘thick modeling’’ approach of (Granger & Jeon, 2004; also see Castelnuovo & Surico, 2004), which we will not pursue here.12 These foregoing issues point to the crucial role played by the time series properties of the nominal interest rate, inflation, and output gap. The interest rate and inflation rate in Eq. (2) have often been found to be nonstationary I(1) processes, while the output gap is usually, by construction, a stationary I(0) process (see Goodfriend, 1991; Crowder & Hoffman, 1996; Culver & Papell, 1997). This would suggest that it might be preferable to estimate a first difference form of the Taylor rule (e.g., as in Siklos, 2004; Gerlach-Kristen, 2003).13 Regardless of whether interest rates and inflation are non-stationary or stationary processes, most would agree that these series may be near unit root processes or, rather, that their stationarity property can be regime dependent. There are problems with estimating equations with near unit root processes. Phillips (1986, 1989) has shown that if variables are integrated of order 1, or are near unit root processes, then levels regressions may yield spurious results, and standard inference is not asymptotically valid (Elliott & Stock, 1994). Alternatively, Lanne (1999, 2000) draws attention to the fact that the finding of a unit root in nominal interest rates is common because there is likely a structural break that gives rise to the unit root property. Even if one believes that a unit root in interest rates, or any of the other variables in a Taylor rule, is a convenient simplification, there remains
246
PIERRE L. SIKLOS AND MARK E. WOHAR
the problem with the economic interpretation of a unit root in interest rates. There are also good economic reasons to expect some cointegration among the variables in the Taylor rule. For example, the real interest rate in (2) reflects the Fisher equation relating the nominal interest rate to expected inflation, and these two series may be statistically attracted to each other in the long run if real interest rates are believed to be stationary.14 If I(1) variables are found not to be cointegrated, a static regression in levels will be spurious (Granger & Newbold, 1974), and in a spurious regression the estimated parameter vector is inconsistent and the t- and F-statistics diverge. Studies of the Taylor rule which have regression equations with R2 greater than the DW statistic is an indication of spurious regression.15 If no cointegration is found then there is no long-run relationship between the I(1) variables.
3. ALTERNATIVE SPECIFICATIONS OF THE TAYLOR RULE Judd and Rudebusch (1998) modify Taylor’s rule given in Eq. (2) above that allows for gradual adjustment in the Federal funds rate. The modified Taylor rule is written as int ¼ rn þ pt þ a pt pn þ b1 y~ t þ b2 y~ t 1 (3) where the adjustment process is given by Dit ¼ g it it 1 þ rDit
(4)
1
The coefficient g is the adjustment to the ‘‘error’’ in the interest rate setting and the coefficient r can be thought of as a measure of the ‘‘momentum’’ from last period’s interest rate change.16 Combining (3) and (4) gives the Taylor-type rule with interest rate smoothing: Dit ¼ gm
git
1
þ gð1 þ aÞpt þ gb1 y~ t þ gb2 y~ t
1
þ rDit
1
(5)
where m ¼ r ap : Note, however, that (5) is a difference rule and, as Hendry (2004) points out, it was the dissatisfaction with specifications such as these, notably the fact that such a specification would be valid whether the raw data were I(1) or I(0), which eventually found expression in what eventually came to be called Granger’s representation theorem.
247
Estimating Taylor-Type Rules
The above Equation can be rewritten as it ¼ g m þ gð1 þ aÞpt þ b1 y~ t þ b2 y~ t 1 þ ð1
gÞit
1
þ rDit
1
(6)
Clarida et al. (1998, 2000) and McCallum (2001) employ a partialadjustment form of the Taylor rule that is further modified to reflect forwardlooking behavior of monetary policy makers. This interest rate rule is given as (7) it ¼ i þ jðE ½ptþk jOt p Þ þ bE y~ tþm jOt where it is the nominal interest rate to be set; i the desired nominal interest rate when inflation and output are at their target values; pt+k is the inflation rate over the period t to t+k; y~ tþm a measure of the average output gap between period t and t+m; Ot the information set available at the time the interest rate is set. Interest rate smoothing is incorporated through a partialadjustment process given as it ¼ rðLÞit
1
þ ð1
rÞit
(8)
where rðLÞ ¼ r1 þ r2 L þ þ rn Ln 1 ;
r rð1Þ
where it is the current interest rate set by the monetary authority. Combining (7) and (8) yields the instrument rule to be estimated as it ¼ ð1 rÞ r ðj 1Þp þ jptþk þ by~ tþm þ rðLÞit 1 þ t (9) where r ¼ i t ¼
p is the long-run equilibrium real rate and ð1 rÞ jðptþk E ½ptþk jOt Þ þ b y~ tþm E y~ tþm jOt
(10)
Levin et al. (1999) argue that the behavior depicted in Eq. (9) could be an optimal response for a central bank. Such interest rate smoothing has been employed in the empirical work of Carida et al. (1998, 2000), Gerlach and Schnabel (2000), and Domenech, Ledo, and Taguas (2002). While differencing the variables in the Taylor rule solves one problem it creates another. Castelnuovo (2003) tests for the presence of interest rate smoothing at quarterly frequencies in forward-looking Taylor rules by taking into account potentially important omitted variables such as the squared output gap and the credit spread. Collins and Siklos (2004) estimate optimal Taylor rules and demonstrate empirically that interest rate smoothing is the trade-off for responding little to the output gap. Uncertainty about the measurement of the output gap is well known (e.g., see Orphanides, 2001) but it could just as well be uncertainty about the appropriate structural model of the economy that leads policy members to react cautiously.
248
PIERRE L. SIKLOS AND MARK E. WOHAR
While the original Taylor rule was specified as having the Fed reacting to only two variables, more recently several papers have explored the potential role of real exchange rate movements (e.g., as in Leitemo & Røisland, 1999; Medina & Valde´s, 1999) or the role of stock prices (e.g., Bernanke & Gertler, 1999; Cecchetti, Genberg, Lipsky, & Wadhwani, 2000; Fuhrer & Tootell, 2004). Lately, there has been interest concerning the impact of the housing sector on monetary policy actions as many central banks have expressed concern over the economic impact of rapidly rising housing prices (e.g., see Siklos & Bohl, 2005a, b and references therein). While it is too early to suggest that there is something of a consensus in the literature, it is clear that variables such as stock prices and housing prices may result in the need to estimate an augmented version of Taylor’s rule. In what follows, we do not consider such extensions.
4. EMPIRICAL EVIDENCE ON TAYLOR-TYPE RULES: A BRIEF OVERVIEW OF THE US EXPERIENCE17 In Taylor (1993) no formal econometric analysis is carried out. Nevertheless, the simple rule described in Eq. (1) above was found to visually track the federal funds rate for the period 1987–1992 fairly well. Taylor (1999) estimated a modified version of Eq. (1), shown as Eq. (2) above, for different sample periods using OLS. Taylor concluded that the size of the response coefficients had increased over time between the international gold standard era and the Bretton Woods and Post-Bretton Woods period. Using the same specification, Hetzel (2000) examines the sample periods 1965–1979, 1979–1987, and 1987–1999 for the US. He also finds that the response coefficients increased over time, but Hetzel questions whether these could be given any structural interpretation. Orphanides (2001) estimates the Taylor rule for the US over the period 1987–1992 using OLS and IV and employs both revised- and real-time data. He finds that when real-time data are used, the rule provides a less accurate description of policy than when ex-post revised data are used. He also finds that the forward-looking versions of the Taylor rule describe policy better than contemporaneous specifications, especially when real-time data are used. Clarida et al. (2000) consider two sub-samples 1960–1979 and 1979–1998 for the US using the same specification and employing the General Method of Moments (GMM) estimator. The error term in Eq. (9) is a linear combination of forecast errors and thus, it is orthogonal to the information set Ot. Therefore, a set of instruments can be extracted from the information set
Estimating Taylor-Type Rules
249
to use in the GMM estimation. They report considerable differences in the response coefficients over the two different sample periods.18 The results of Clarida et al. (2000) are challenged by the findings of Domenech et al. (2002). They find evidence for activist monetary policy in the US in the 1970s that resembles that of the 1980s and 1990s. On balance, however, as documented by Dennis (2003), Favero and Rovelli (2003), and Ozlale (2003), the bulk of the evidence suggests that the Volcker–Greenspan era differs markedly from monetary policy practiced by their predecessors at the Fed. Hamalainen (2004) presents a brief survey of the properties relating to Taylor-type monetary policy rules focusing specifically on the empirical studies of Taylor (1999) (using the original Taylor rule), Judd and Rudebusch (1998), incorporating interest rate smoothing into a modified Taylor rule, and Clarida et al. (2000), incorporating forward-looking behavior into a modified Taylor rule. Different measures of inflation and output gaps as well as different sample periods are employed. Not surprisingly, perhaps, the response coefficients are sensitive to the measure of inflation and output gap employed and to the sample estimation period employed. Hamalainen (2004) also notes that small changes in assumptions can yield very different policy recommendations, leading one to question the use of Taylor-type rules.19 There have only been a small number of papers that have examined the time series properties of the variables within Taylor-type rules.20 Christensen and Nielsen (2003) reject the traditional Taylor rule as a representation of US monetary policy in favor of an alternative stable long-run cointegrating relationship between the federal funds rate, the unemployment rate, and the long-term interest rate over the period 1988–2002. They find that deviations from this long-run relation are primarily corrected via changes in the federal funds rate. Osterholm (2003) investigates the properties of the Taylor (1993) rule applied to US, Australian, and Swedish data. The included variables are found to be integrated or near integrated series. Hence, a test for cointegration is carried out. The tests find no support for cointegration though the economic rationale for testing for cointegration among the series in the Taylor rule is not justified. Siklos (2004) reports similar results for a panel of countries that includes the US (also see Ruth, 2004), but whether the Taylor rule represents an unbalanced or spurious regression remains debatable, and one we explore econometrically in greater detail below. More recently, empirical estimates of Taylor’s rule have begun to consider non-linear effects. Siklos (2002, 2004) allows for asymmetric changes in inflation and finds in favor of this view in a panel of OECD countries, as well as for GARCH effects in interest rates. Dolado, Pedrero,
250
PIERRE L. SIKLOS AND MARK E. WOHAR
and Ruge-Murcia (2004) also permit ARCH-type effects in inflation (i.e., via the addition of the variance in inflation in the Taylor rule) but find that nonlinear behavior describes US monetary policy well only after 1983. In what follows, we rely exclusively on OLS estimation since these are adequate for bringing out the types of econometric problems raised in Granger’s work and relevant to our understanding and assessment of Taylor rules.
5. TAYLOR RULE ESTIMATES FOR THE US 5.1. Data The data are quarterly converted via averaging of monthly data.21 The full sample consists of data covering the period 1959.1–2003.4. The mnemonics shown in parentheses are the labels used to identify each series by FREDII at the Federal Reserve Bank of St. Louis.22 The short-term interest rate is the effective federal funds rate (FEDFUNDS). To proxy inflation, evaluated as the first log difference of the price level, we examine a number of price indices. They are: the CPI for all urban consumers and all items (CPIAUCSL), the core version of this CPI (CPILFESL; excludes food and energy prices), and the price index for Personal Consumption Expenditures (PCECTPI), as well as the core version of this index (JCXE). All these data were seasonally adjusted at the source. In the case of output, real Gross Domestic Product (GDPC96), and real potential Gross Domestic Product (these are the Congressional Budget Office’s estimates; GDPPOT2) were utilized. Again, the real GDP data were seasonally adjusted at the source. We also consider other proxies for output gap including, for example, the difference between the log of output less its HP filtered value.23 Some studies use a proxy for the unemployment gap. This is defined as the unemployment rate less an estimate of the NAIRU. This can be estimated as in Staiger, Stock, and Watson (1997).24 We also relied on other data from time to time to examine the sensitivity of our results to alternative samples or regimes. For example, Romer and Romer’s Fed contraction dates (Romer & Romer, 1989; 1947.01–1991.12), Boschen and Mills’ indicator of the stance of monetary policy (Boschen & Mills, 1995; 1953.01–1996.01), Bernanke and Mihov’s (1998; 1966.01–1996.12) measure of the stance of monetary policy, and the NBER’s reference cycle dates (www.nber.org).25 Forecast-based versions of Taylor rules have also proved popular both because they capture the forward-looking nature of policy making and permit estimation of the reaction function without resort to generated
Estimating Taylor-Type Rules
251
regressors. In the present paper we rely on eight different sources of forecasts for inflation and real GDP growth. They are from The Economist, Consensus Economics, the OECD, the IMF (World Economic Outlook), the Survey of Professional Forecasters, the Livingston survey regularly updated by the Federal Reserve Bank of Philadeplhia, the Greenbook forecasts, and the University of Michigan survey.
5.2. Alternative Estimates and their Implications Prior to presenting estimates, it is useful to examine the time series properties of the core series described above. Figure 1A plots the fed funds rate while Fig. 1B plots various measures of inflation. Both these figures show a general upward trend until the late 1970s or early 1980s followed by a general downward trend thereafter. Both series appear to be stationary, but only after 1990 approximately. Nevertheless, it is not difficult to conclude, as have several authors, that both these variables are non-stationary over a longer sample that includes a period of rising rates and inflation versus the era of falling rates and disinflation. This is also true even if the sample is roughly split between the pre-Greenspan era (1987) and the period since his appointment as the Fed Chairman.26 Summary statistics and some unit root tests that broadly support the foregoing discussion are shown in Table 1A. Both the widely-used Augmented Dickey–Fuller test statistic and the statistically more powerful modified unit root tests proposed by Elliott, Rothenberg, and Stock (Ng & Perron, 2001) generally lead to the same conclusion. Taylor rule estimates based on available forecasts generally do not consider the time series properties of the variables in question. Figure 2A plots actual CPI inflation against forecasted inflation from eight different sources. Since data are not available for the full sample from all sources, only the sample since Greenspan has been Fed Chairman is shown. The top portion of the figure collects forecasts that are survey based in addition to those from the Fed’s Greenbook. The bottom portion of the figure contrasts forecasts for inflation from international organizations or institutions that collect and average forecasts from several contributors or sources. Surveybased forecasts appear to be far more volatile than the other types of forecasts. Otherwise, neither type of forecast appears outwardly to dominate others over time. Turning to unit root tests shown in Table 1B, rejections of the null of a unit root in inflation forecasts are more frequent than for realized inflation. This is true for both the full sample as well as the
252
PIERRE L. SIKLOS AND MARK E. WOHAR 20
Federal Funds rate
Percent
16
12
8
4
0 1960 1965 1970 1975 1980 1985 1990 1995 2000
Percent (annual)
12
CPI inflation Core CPI inflation
10 8 6 4 2 0 1960 1965 1970 1975 1980 1985 1990 1995 2000 12
Percent (annual)
10
PCE inflation core PCE inflation
8 6 4 2 0 1960 1965 1970 1975 1980 1985 1990 1995 2000
Fig. 1.
(A) Federal Fund Rate, 1959:1–2003:4. (B) Alternative Measures of US Inflation. Source: See Text.
253
Estimating Taylor-Type Rules
Table 1A. Series i pcpi pcpi core ppce ppce core y~ cbo y~ HP y~ Dy~ (y yT) y~ T3 i pcpi pcpi core ppce ppce core y~ cbo y~ HP y~ Dy~ (y yT) y~ T3 i pcpi pcpi core ppce ppce core y~ cbo y~ HP y~ Dy~ (y yT) y~ T3
Summary Statistics and Unit Root Tests.
Sample Full
Pre-Greenspan
Greenspan
Mean 6.26 4.18 4.22 3.65 3.73 1.29 0.03 0.01 0.003 0.107 0.000 7.05 4.85 4.88 4.35 4.46 1.67 0.08 0.004 0.004 0.258 0.258 4.95 3.07 3.13 2.49 2.51 0.65 0.05 0.03 0.001 0.15 0.45
Standard Deviation 3.32 2.82 2.56 2.15 2.43 2.59 1.55 2.16 0.86 3.09 2.38 3.61 3.29 2.96 2.33 2.71 2.94 1.81 2.51 1.01 3.78 2.61 2.26 1.09 1.01 1.09 1.08 1.71 0.99 1.42 0.54 1.29 1.85
ADF ERS(modified) 1.89 (0.34) 3.98(7) 1.70 (0.43) 9.70(12) 1.35 (0.60) 11.11(13) 1.40 (0.58) 11.29(12) 1.55 (0.51) 8.58(13) 3.46 (0.01) 1.59(1) 5.85 (0.00) 1.18(0) 4.16 (0.001) 11.96(12) 7.07 (0.00) 13.94(9) 2.96 (0.04) 4.83(2) 3.94(0.00) 1.61(1) 3.52 (0.04) 8.55(2) 1.07 (0.93) 20.44(12) 0.45 (0.98) 26.69(12) 1.73 (0.73) 17.14(8) 0.82 (0.96) 18.36(12) 2.73 (0.07) 7.80(1) 4.28 (0.00) 6.13(0) 2.80 (0.06) 10.44(12) 8.49 (0.00) 21.78(10) 2.34 (0.16) 16.86(1) 3.20(0.02) 6.29(1) 2.73 (0.23) 5.91(1) 2.62 (0.27) 9.12(4) 3.56 (0.04) 14.50(4) 3.35 (0.07) 13.80(5) 2.09 (0.54) 14.06(8) 2.52 (0.12) 16.11(1) 3.46 (0.01) 11.66(0) 3.29 (0.02) 6.40(4) 3.40 (0.01) 5.79(1) 2.30 (0.17) 10.50(0) 2.43 (0.14) 15.85(1)
Note: The full sample is 1959.1–2003.4, Pre-Greenspan is 1959.1–1987.2, Greenspan is 1987.3– 2003.4. ADF is the Augmented Dickey–Fuller statistic with lag augmentation selected according to the Akaike Information criterion. A constant only is used in the full sample; for the interest rate and inflation series, a constant and a trend is used in the other two samples. p- values are given in parentheses. Test statistics in italics are based on the modified ERS tests due to Ng and Perron (2001) using the modified AIC criterion. When the two unit root tests contradict each other this is indicated in bold. Lag length in the augmentation portion of the test equation shown in parenthesis. i ¼ interest rate; p ¼ inflation; y˜ ¼ output gap; HP ¼ HP filter; cbo ¼ Congressional Budget Office; pce ¼ Personal Consumption Expenditures; yT is trend output. The figures in brackets are based on a one-sided HP filter.
254
PIERRE L. SIKLOS AND MARK E. WOHAR 6
Percent at annual rate
5
4
3
2
1 1988
1990
1992
Livingston survey Greenbook
1994
1996
1998
2000
2002
Survey of Professional Forecasters CPI all items Michigan Survey
6
Percent at annual rate
5 4 3 2 1 0 1988 (a)
Fig. 2(A).
1990
1992
Economist IMF
1994
1996
OECD Consensus
1998
2000
2002
CPI all items
Alternative Forecasts of Inflation: 1987–2003. Source: See Text.
Greenspan era. When output growth forecasts are considered there is more unit root like behavior in the full sample than in the various output gap proxies considered in Table 1A, while the evidence of unit root behavior is just as mixed in the forecast as in the actual data. It is surprising that there have been relatively few Taylor rule estimates that rely on publicly available forecasts. After all, as then Governor Bernanke noted, such forecasts are important to policymakers since ‘‘ythe public does not know but instead must estimate the central bank’s reaction function and other economic relationships using observed data, we have no guarantee that the economy will
255
Estimating Taylor-Type Rules
Table 1B.
Additional Unit Root Tests: Forecast of Inflation and Output Growth.
Source of Forecast
Full Sample
Pre-Greenspan
Greenspan
Inflation Livingston Greenbook Survey of Prof. For. Univ. Michigan Economist IMF OECD Consensus
1.66 (0.45)/24.78(11) 3.77(0.02)T/18.81(11) 4.48(0.00)T/13.74(0) 4.70(0.00)T/9.55(0)
1.37(0.59)/8.29(11)
1.40(0.58)/23.10(8) 2.83(0.19)T/6.90(0) 2.58(0.11)/7.30(5)
3.23(0.09)T/31.18(8) 3.66(0.04)T/25.66(3) 3.61(0.04)T/5.85(1) 3.68(0.03)T/7.11(2) 3.74(0.03)T/11.89(0) 1.53(0.51)/7.35(6) 2.59(0.29)T/6.80(10) 3.34(0.07)T/5.92(0)
Output Growth Greenbook Economist OECD Consensus Livingston Survey of Prof. For.
4.84(0.00)/6.40(11) 3.57(0.01)/8.24(12) 1.90(0.33)/2.11(13) 1.26(0.65)/31.91(0)
2.94(0.05)/3.63(2) 2.31(0.17)/13.13(3) 2.12(0.24)/1.33(6) 4.46(0.00)/13.13(8) 2.91(0.17)/13.05(8) 3.45(0.05)/7.07(0)
Note: Test based on the ADF test with lag augmentation chosen as in Table 1(A); the test statistics in parenthesis is the modified ERS test as described in Table 1(A). ‘‘T’’ indicates that a deterministic trend was added to the test equation. Blanks indicate insufficient or no data to estimate the test equation. p-values are in parenthesis. OECD and Greenbook inflation forecasts are for the GDP deflator and OECD output growth forecasts are for the OECD’s estimate of the output gap.
converge – even in infinite time – to the optimal rational expectations equilibrium (Bernanke, 2004a, p. 3). Since the form of the estimated Taylor rule will depend in part on the unit root property of inflation, differences in the time series properties of realized versus forecasted inflation may be important. Clive Granger’s work has long focused on the ability of econometric models to forecast time series. An influential contribution was Bates and Granger (1969) in which some linear combination of forecasts often outperforms even the best forecasts. This result, seemingly counter-intuitive, comes about simply because averaging forecasts is a useful device in canceling out biases in individual forecasts. Figure 2B plots a variety of inflation forecasts. Data limitations govern the chosen samples. As we shall see, averaging inflation forecasts results in estimates of forecast-based reaction functions that compare well with Taylor rules based on realized data. This
256
PIERRE L. SIKLOS AND MARK E. WOHAR
6
12 10
5
8 4 6 3 4 2
2
1
0 1992
1994
1996
1998
2000
2002
1985
Average forecasts: private Average forecasts: public agencies Actual CPI inflation
1990
1995
Average forecasts: private
2000 Actual CPI inflation
14 12 10 8 6 4 2 0 1975
(b)
1980
1985
1990
1995
2000
Average forecasts: public agencies Actual CPI inflation
Fig. 2(B). Average Inflation Forecasts of Inflation: 1975–2002. Note: In the Top Left Figure, inflation Forecasts from the Greenbook, IMF, and OECD (Public), Survey of Professional Forecasters (SPF), Livingston, Economist (ECON), University of Michigan (UMICH), Consensus (CONS); the Top Right Plot is an Average of SPF, Livingston, and UMICH forecasts; the Bottom Plot is an Average of IMF and OECD Forecasts. Greenbook and OECD Forecasts are for the Implicit Price Deflator.
result is not generally obtained when policy rule are estimated with individual inflation forecasts. Figures 3A and B display various plots of output growth and the output gap. The time series properties of these series differ noticeably from the other core variables in the Taylor rule. In the top portion of Fig. 3A we compare the output gap first by taking the difference between the growth of actual real GDP and the growth in potential real GDP and next by simply taking the difference between the log levels of actual and potential real GDP. Potential real GDP estimates are from the CBO. The latter is often the
Estimating Taylor-Type Rules
257
Deviations from potential output (%)
8
4
0
-4
-8
-12
deviations from growth in potential output "standard" output gap 1960 1965 1970 1975 1980 1985 1990 1995 2000
Deviations from potential output (%) or growth in potential output
4 3 2 1 0 -1 -2 -3 -4
"speed limit" output gap HP filtered output gap -5 (a) 1960 1965 1970 1975 1980 1985 1990 1995 2000
Deviations from potential real GDP (%)
8
4
0
-4
-8 detrended output gap CBO based output gap
-12 (b) 1960 1965 1970 1975 1980 1985 1990 1995 2000
Fig. 3. (A) Alternative Proxies for the Output Gap. Note: The ‘‘Standard’’ Output Gap Measure Relies on CBO Estimates of Potential Real GDP. (B) Comparing Detrended and CBO Measures of the Output Gap. Source: See Text.
258
PIERRE L. SIKLOS AND MARK E. WOHAR
‘‘standard’’ way of representing the output gap in the empirical literature that relies on US data. The former is a frequently used proxy for the output gap. While perhaps more volatile, the resulting series appears more nearly stationary. The bottom portion of Fig. 3A contrasts a measure of the output gap evaluated as the difference between the log level of real GDP and its HP filtered value against the first difference in the ‘‘standard’’ output gap measure shown in levels in the top portion of the same figure. The latter is sometimes referred to as the ‘‘speed limit’’ measure of the output gap. Both series are, broadly speaking, stationary. The HP filter suffers from the well-known end point problem. Yet, even if one estimates an output gap measure based on a one-sided HP filter where the end point is permitted to change over time, this does not seem to affect the unit root property of the resulting time series, with the possible exception of data covering the Greenspan era only (see Table 1A). The speed limit measure is clearly more volatile and more stationary in appearance. Finally, Fig. 3B compares the output gap derived from the CBO’s measure (i.e., the ‘‘standard’’ proxy shown in Fig. 3A) against a de-trended measure of the log level of real GDP. Versions of the latter have been used in some of the studies cited above. In the version shown here, separate trends for the 1959–1990 and 1991–2003 periods were fitted to the log level of real GDP. The time period captures the phenomenon noted by several analysts who have argued the US experienced a permanent growth in potential output beginning around the 1990s as the impact of efficiencies in computing technology became economically meaningful. Chairman Greenspan has, on a number of occasions, commented on the apparent technological driven boom of the 1990s (based, in part, on evidence as in Oliner & Sichel, 2000).27 Although the two series produce broadly similar patterns, the measure of the output gap derived from detrending comes closest to being I(1) of all the proxies considered. The same result holds for the Greenspan sample when a cubic trend serves to proxy potential real GDP (y~ T3 ). If one were instead to rely on the unemployment rate (results not shown) there is a slightly better chance of concluding that the unemployment rate gap is I(1), at least for the separate sub-samples 1957–1987 and 1988–2003, but not when an HP filter is applied. Notwithstanding the low power of unit root tests, and the presence of structural breaks in the data, perhaps due to regime changes in monetary policy, there are strong indications that interest rates and inflation are I(1) while the output gap is I(0). Nevertheless, since the output gap (or, for that matter, the unemployment rate gap) is a theoretical construct there is no reason that it cannot be I(1) for some sub-sample. In any event, there is
Estimating Taylor-Type Rules
259
prima facie evidence that the Taylor rule is, as usually estimated, an unbalanced regression. The possibility that Eq. (2) contains two or more I(1) series raises the possibility that they are cointegrated. Indeed, if one examines Eq. (3) and adds an interest rate smoothing parameter, then it is a cointegrating rela tionship when ro1: Alternatively, there may be other cointegrating relationships implicit in the standard formulation of Taylor’s rule.28 One must also ask what economic theory might lead these series to be attracted to each other in a statistical sense. Clearly, the nominal interest rate and inflation are linked to each other via the Fisher effect. It is less clear whether the output gap will be related to the interest rate in the long-run not only because it is typically I(0) by construction but also because we know that a central bank’s willingness to control inflation is regime specific. Table 2 presents a series of cointegration tests and estimates of the cointegration vectors.29 We first consider cointegration between the output gap, inflation (in the CPI alone), the fed funds rate and a long-term interest rate. Addition of the long-term interest rate series, proxied here by the yield on 10-year Treasury bonds (GS10), reflects the possibility that the term spread, often used as a predictor of future inflation, short-term interest rates, or output, affects the determination of the fed funds rate. The results in Table 1A and 1B indicate that only the output gap series based on detrending is I(1) and thus, we rely on this series alone in the tests that follow. Not surprisingly, we are able to conclude that the output gap does not belong in the cointegrating relationship. Relying on the full sample, the null that the output gap is zero in the cointegrating vector cannot be rejected. Moreover, if we estimate a vector error correction model the null that the fed funds rate is weakly exogenous also cannot be rejected.30 Both sets of results appear consistent with the notion that the output gap is not an attractor in the specified VAR and that the form of the Taylor rule with the fed funds rate reacting to the inflation and output gap is adequate. One cointegrating vector is found in both models considered. Since the finding of cointegration is clearly sensitive to the output gap definition, we instead investigate whether there is cointegration between inflation, and the short and long rates (Model 2). While cointegration ordinarily requires a long span of data, it is conceivable that the sought after property is a feature only of some well-chosen sub-sample. The results in Table 2, at least based on the model that excludes the output gap, suggest that there is cointegration during the Greenspan and full samples. Figure 4 plots the error corrections from the estimated cointegrating vector based on the full and Greenspan samples. Only in the Greenspan sample do the error corrections appear to
260
PIERRE L. SIKLOS AND MARK E. WOHAR
Table 2. Cointegration Tests. Model 1 y~ T ; pt ; it ; iLt
Cointegration Test Statistics 1959.1–2003.4
No. of cointegrating vectors
Full {9}
0 1 2 3
32.29 19.90 9.55 7.72
Model 2 pt ; it ; iLt 0 1 2 Cointegrating equation
(0.02) (0.10) (0.38) (0.09)
1987.3–2003.4 Pre-Greenspan {9}
29.59 22.30 15.89 9.16
(0.04) (0.53) (0.20) (0.18)
Greenspan {4}
44.70 6.98 5.55 9.16
(0.00) (0.98) (0.84) (0.31)
{9}
{12}
{3}
29 (0.00) 15.19 (0.06) 5.05 (0.28)
19.57 (0.12) 17.21 (0.03) 3.92 (0.92)
26.22 (0.01) 11.27 (0.23) 2.00 (0.78)
pt 3:53iLt þ 2:53i þ 5:15 (0.61) (0.53) (1.51)
pt 0:87iLt þ 0:26i þ 1:40 (0.09) (0.07) (0.41)
Note: iL is the long-term interest rate. Johansen’s lmax test statistic reported (p-values due to MacKinnon, Haug, & Michelis, 1999, in parenthesis). Lag length for VARs, shown in brackets, chosen on the basis of the likelihood ratio test except for Model 1, Greenspan sample, where the Final Prediction Error criterion was used. The cointegrating equation is expressed in the form shown as in Eq. (13) with standard errors in parenthesis. Indicates statistically significant at the 5% level.
be roughly stationary. In the case of the full sample estimates, it is apparent that the series tends to behave asymmetrically. In contrast, the error corrections appear more symmetric when the Greenspan sample alone is considered. Based on the usual testing, it is difficult to know whether it is preferable to estimate a relationship covering the full sample or some portion of it. As noted previously, there are good reasons to think that the full sample may indeed consist of periods where cointegration is switched on or off depending upon the policy regime in place (Siklos & Granger, 1997). Part of the answer depends on the objectives of the study. Clearly, there is some evidence summarized above suggesting that the Taylor rule is a useful way of studying monetary policy over long periods of time, especially for the US experience (also see Orphanides, 2003a, b). If the purpose is to understand the behavior of a central bank during a well defined policy regime then a sub-sample is clearly preferable. Alternatively, if the point that needs to be
Estimating Taylor-Type Rules
261
25 20 15 10 5 0 -5
Error Corrections
-10 -15 1960 1965 1970 1975 1980 1985 1990 1995 2000 Cointegrating vector 2.0 1.6 1.2 0.8 0.4 0.0 -0.4 -0.8 -1.2 -1.6 1960 1965 1970 1975 1980 1985 1990 1995 2000 Cointegrating vector
Fig. 4.
Error Corrections. Note: Obtained from Table 2 Estimates for Model 2 for the Full Sample (top) and Greenspan Samples (bottom).
made is that monetary policy behaves asymmetrically over time then a longer span of data is clearly preferable. The difficulty is in properly estimating the timing of a regime change. As we have seen above there is no widespread consensus on the precise dating of shifts in regimes. Nevertheless, to maximize comparability with other studies, we retain sub-sample distinctions that reflect the impact of the tenures of Volcker and Greenspan.
262
PIERRE L. SIKLOS AND MARK E. WOHAR
More sophisticated analyses of the dating of regime changes, as in Rapach and Wohar (2005) for the real interest rate, or Burdekin and Siklos (1999) for inflation, might well choose a slightly different set of dates. However, the conclusion that distinct policy regimes are a feature of US monetary policy over the sample considered would remain unchanged. We now turn to estimates of the Taylor rule that capture some of the features of the data discussed above. Table 3 provides the main results. Three versions of the Taylor rule were estimated. The first is (9) setting k and m both to zero yielding: it ¼ ð1 rÞ r ðj 1Þp þ jpt þ by~ t þ rðLÞit 1 þ t (11) The second version recognizes the cointegration property described above. In this case we estimate Dit ¼
o X
ðj0i Dpt
i
þ b0i Dy~ t
i
þ rDiLt i Þ þ dðp
c1 i
c 2 i L Þt
1
þ rDit
1
þ zt
i¼0
(12) where all the variables have been defined and ðp c1 i c2 iL Þt 1 is the error correction term.31 A third version decomposes the error correction term into positive and negative changes to capture a possible asymmetry in the reaction function based on the momentum model. The resulting specification is one of a variety of non-linear forces for the error correction mechanism. That specification is written X Dit ¼ ðj0i Dpt i þ b0i Dy~ t i þ rDiLt i Þ þ dDþ ðp c1 i c2 iL Þt 1 þ dD ðp
c1 i
c2 iL Þt
1
þ rDit 1 zt
ð13Þ
The first four columns of Table 3 report the steady-state coefficient estimates for the equilibrium real interest rate, the inflation and output gap parameters as well as the interest rate smoothing coefficient when the estimated model ignores cointegration. It is apparent that real interest rates are relatively lower in the Greenspan period than at any other time since the 1960s. Nevertheless, it is also true that estimates vary widely not only across time but across techniques used to proxy the output gap.32 Estimates of the interest rate response to the inflation gap are generally not significantly different from one, the norm known as Taylor’s principle. It is clear, however, that the Greenspan era stands in sharp contrast with other samples considered, as the inflation gap coefficient is considerably larger than at any other time reflecting more aggressive Fed reactions to inflation shocks. While the sample here is considerably longer with the
Standard Taylor Rule r*/(1 r): Real Rate
j/(1 r): Inflation
y~ T3
5.90 (0.02)
0.91 (0.18)
Dy~ y~ y yT y~ HP
0.81 5.41 8.21 5.03
(0.88) (0.26) (0.17) (0.00)
1.04 0.97 0.94 0.82
6.68 6.48 6.47 6.56 6.73
(0.00) (0.00) (0.00) (0.00) (0.00)
2.96 5.09 3.02 3.12 2.75 5.12 1.12
(0.18) (0.27) (0.04) (0.02) (0.27) (0.13) (0.78)
Sample/ Output gap
Difference Rules r: Interest Smoothing
d: Error Correction
1.40 (0.15)
0.91 (9.99)
0.09 ( 2.41)
(0.88) (0.85) (0.33) (0.01)
6.09 4.74 1.33 1.52
(0.49) (0.56) (0.38) (0.00)
1.04 0.97 0.94 0.82
(17.89) (15.93) (12.22) (8.65)
n.a. n.a. 0.09 ( 2.42) 0.09 ( 2.45)
0.79 0.90 0.90 0.82 0.81
(0.17) (0.41) (0.50) (0.24) (0.20)
0.09 0.95 0.28 0.08 0.09
(0.67) (0.05) (0.19) (0.73) (0.77)
0.50 0.38 0.45 0.51 0.52
(3.33) (2.44) (2.86) (3.38) (3.44)
0.001 ( 0.03) n.a. n.a. 0.001 ( 0.02) 0.002 (0.06)
3.80 4.31 0.37 3.25 4.39 0.52 0.88
(0.52) (0.54) (0.60) (0.13) (0.40) (0.65) (0.94)
0.06 19.32 4.29 0.59 0.74 1.22 0.88
(0.97) (0.56) (0.09) (0.44) (0.78) (0.06) (0.03)
1.05 (16.87) 1.02 (32.71) 0.95 (27.66) 1.06 (27.83) 1.04 (24.05) 0.89(10.00) 0.91(10.36)
b/(1 r): Output Gap
Dd+: Pos. Momentum
Dd : Neg. Momentum
.05 ( 2.61) n.n. n.n. 0.04 ( 2.02) 0.05 ( 2.26)
0.01 ( 0.60)
Pre-Volcker
Estimating Taylor-Type Rules
Table 3. Taylor Rule Coefficient Estimates.
Volcker y~ T3 Dy~ y~ y yT y~ HP Greenspan 0.03 ( 1.93) n.a. n.a. 0.02 ( 1.75) 0.02 ( 1.72)
0.01 ( 0.79) 0.01 ( 0.52)
263
y~ T3 Dy~ y~ y yT y~ HP y˜-Greenbook y˜-Consensus
264
Table 3. (Continued ) Standard Taylor Rule Sample/ Output gap
r*/(1 r): Real Rate
j/(1 r): Inflation
b/(1 r): Output Gap
Difference Rules r: Interest Smoothing
d: Error Correction
0.03 ( 1.93) n.a. n.a. 0.02 ( 1.75) 0.02 ( 1.72)
Dd+: Pos. Momentum
Dd : Neg. Momentum
Full 4.60 4.10 3.63 5.27 4.71
(0.00) (0.00) (0.00) (0.00) (0.00)
0.92 1.09 1.14 0.37 0.66
(0.34) (0.80) (0.56) (0.18) (0.20)
0.83 3.30 1.38 0.75 1.34
(0.42) (0.04) (0.00) (0.14) (0.01)
0.91 0.92 0.88 0.93 0.90
(25.91) (27.11) (25.21) (27.67) (24.53)
1.02 0.95 0.98 0.89
(0.43) (1.16) (0.32) (0.36)
0.12 0.29 0.05 0.30
(0.47) (0.60) (0.03) (0.26)
0.87 (17.40) 0.92 (15.33) 0.88 (22.00) 0.86(17.20)
Forecast-based rules Public-y˜HP Private-y˜HP Mix I-y˜HP Mix II-y˜HP
2.50 1.36 2.54 3.30
(0.07) (0.10) (0.07) (0.07)
Note: See text and Table 1 for variable definitions. Pre-Volcker ¼ 1959.1–1979.2; Volcker ¼ 1979.3–1987.2; Greenspan ¼ 1987.3–2003.4; Full ¼ 1959.1–2003.4. Equations are estimated via OLS. In parenthesis, p-value for Wald test (F-statistic) that i =ð1 rÞ ¼ 0; j=ð1 rÞ ¼ 1; and b=ð1 rÞ ¼ 0: For r, d, Dd+, Dd t-statistic in parenthesis. d is the coefficient on the error correction term in the first difference form of the Taylor rule. Dd+, Dd are the coefficient estimates for the positive and negative changes in the error correction term based on the first difference form of the Taylor rule. Greenbook and Consensus are forecast rule estimates. Estimates based on average forecast-based rules are for 1974q2–1998q2 (Public); 1981q3–2001q3 (Private); 1960q1–2002q2 (Mix I); 1960q4–1987q2 (Mix II). Mix I and II represent average forecasts of inflation obtained from the Livingston and OECD sources. Other forecast combinations are described in Fig. 2(B).
PIERRE L. SIKLOS AND MARK E. WOHAR
y~ T3 Dy~ y~ y yT y~ HP
Estimating Taylor-Type Rules
265
addition of 6 years of data, the coefficients reported in Table 3 are markedly higher than the ones reported by Kozicki (1999, Table 3) except perhaps when the conventional output gap using CBO potential real GDP estimates are used. Turning to the output gap coefficient it is quickly apparent, as others have found (e.g., Favero & Rovelli, 2003; Collins & Siklos, 2004), that the output gap is often insignificant and rather imprecisely estimated. Kozicki (1999, Table 3) reaches much the same conclusion. It is only for the full sample that a non-negligible Fed response to the output gap is recorded. Estimates of the interest rate smoothing parameter confirm the largely close to unit root behavior of nominal interest rates with the notable exception of the Volcker era when interest rate persistence is considerably lower. This is to be expected not only because of the sharp disinflationary period covered by this sample but also because the Fed’s operating procedure at the time emphasized monetary control over the interest rate as the instrument of monetary policy. We also considered estimating various versions of inflation forecast-based Taylor rules. Since the bulk of the available data are only available since the late 1980s we show only some estimates for the Greenspan era.33 Of eight potential versions of Taylor’s rule only two, namely a version that relies on the Greenbook forecasts (of GDP deflator inflation) and the Consensus forecasts produce plausible results.34 Otherwise, estimates of all the coefficients are implausibly large and the interest rate smoothing coefficient is more often than not greater than one.35 It is not entirely clear why such a result is obtained but it is heartening that, since both these forecasts are among the most widely watched, forecasters see Fed behavior much the same way as conventional regression estimates based on actual data. Nevertheless, it should be noted that the plausibility of forecast-based estimates is highly sensitive to the choice of the output gap measure. The CBO’s output gap measure is the only one that yields results such as the ones shown in Table 3. Matters are different when forecasts are averaged. Whether public agencies or private sector forecasts are averaged no longer appears to matter. The resulting estimates are now comparable to ones that rely on actual data. Another one of Clive Granger’s insights proves useful. The last set of columns considers the significance of adding an error correction term to the Taylor rule in first differences. The error correction term was only added in cases where one could reasonably conclude that y~ Ið1Þ based on earlier reported tests. It is striking that, with the exception of the Volcker era, the error correction term is significant indicating, as argued above, that long-run equilibrium conditions implicit in Taylor’s rule,
266
PIERRE L. SIKLOS AND MARK E. WOHAR
should not be omitted. The Volcker exception makes the case for cointegration having been turned off during that period, and this interpretation is unlikely to be a controversial one. Finally, the last two columns of Table 3 give estimates of the error correction terms when a momentum type asymmetric model is estimated (Eq. (13)). Given that asymmetry in monetary policy actions is most likely to have occurred during Greenspan’s tenure, we only consider estimates for the post-1987 sample. The results suggest that since only Dd+ is statistically significant this is akin to an interpretation whereby a rise in the term spread, an indicator of higher expected inflation, over inflation (i.e., a negative error correction) prompts the Fed to raise interest rates. The effect does not appear to work in reverse. Therefore, this seems consistent with aggressive Fed policy actions to stem higher future inflation that marks Greenspan’s term as Fed Chair.36 Clearly, there are other forms of non-linearity that could have been employed. For example, as shown in Table 4, there is evidence of ARCH-type effects in the residuals of the Taylor rule in every sample although the evidence is perhaps less strong for the Greenspan era. Obviously, a task for future research is to provide not only a more rigorous theory that might lead to a non-linear rule but more extensive empirical evidence. As a final illustration of the usefulness of the error correction term we examine in-sample forecasts of the Fed funds rate during the Greenspan term. Figure 5A plots forecasts that include and exclude the error correction term and it is clear that policy makers would have produced much better forecasts if the difference rule had been employed.37 Figure 5B then shows
Table 4. Sample Full Pre-Greenspan Greenspan
Test for ARCH in Taylor Rule Residuals. Test Statistic 11.64(0.00) 14.15(0.00) 6.61(0.01) 11.20(0.00) 0.69(0.41) 3.26(0.07) 2.82(0.09) 0.12(0.73) 1.36(0.24)
Output Gap Measure y~ y~ HP y~ y~ HP y~ y~ HP y~ T3 y~ y~ y~ T
Note: LM test statistic for the presence of ARCH (1) in the residuals for the Taylor rules defined in the last column. All Taylor rules use CPI inflation and the output gap proxy shown in the last column. p-values are given in parenthesis. Sample details are given in Table 1.
267
Estimating Taylor-Type Rules 10 9 8 7 6 5 4 3 2 1 1988
1990
1992
1994
1996
1998
2000
Fed funds rate Taylor difference rule Taylor level rule 20
16
12
8
4
0 1980
1985 Fed funds rate
1990
1995 2000 fed funds: pre-Volcker TR
Fig. 5. (A) Implied Interest Rate During Greenspan’s Tenure: Level vs. Difference Rules. (B) Counterfactual: The Volcker and Greenspan Tenures: Using a PreVolcker Taylor Rule. Note: The top figure shows the implied rules during the Greenspan era using a level versus a difference rule. See Eqs. (11) and (12). In Fig. (B), a Counterfactual is conducted wherein a level rule estimated during the Pre-Volcker era is used to generate the implied fed funds rate during the Greenspan sample.
268
PIERRE L. SIKLOS AND MARK E. WOHAR
what the Fed funds rate would have been if the coefficients in the Taylor rule estimated for the pre-Volcker era (1959.1–1979.4) had been used to predict interest rates during Greenspan’s tenure. While policy would have clearly been too loose until about 1990, interest rates are fairly close to the actual Fed funds rate after that. On the face of it, one may be tempted to conclude that there is less to the choice of Fed chairman than priors might lead one to believe. However, if we compare the RMSE based on the implied interest rates we find, on balance, that there is a notable difference in monetary policy performance across various Fed chairmen (also see Romer & Romer, 2003).
6. SUMMARY AND LESSONS LEARNED This paper has considered a variety of time series related questions that arise when estimating Taylor’s rule. Focusing on the varied contributions of Clive Granger it was argued that most estimates have ignored the consequences of the unbalanced nature of Taylor’s original specification. This means that some of the variables that make up Taylor’s rule are either underdifferenced or possibly over-differenced. Moreover, the extant literature has, with very few exceptions, omitted the possibility of cointegration among the variables in the Taylor rule. Augmenting Taylor’s specification with an error correction term adds significantly to explaining interest rate movements, and consequently to our understanding of the conduct of monetary policy in the US since the late 1950s. Finally, there is also significant evidence of asymmetries in the conduct of monetary policy, at least during the Greenspan era. This highlights another of Clive Granger’s remarkable contributions to econometric analysis, namely the importance of nonlinearities inherent in many economic relationships. A growing number of papers (e.g., Rabanal, 2004) are beginning to underscore this point.
NOTES 1. Goodhart (2004) casts doubt on the ability of expressions such as (1) to capture the preferences of central bankers if the regressions use the proper data. In particular, if a central bank has an inflation target and successfully meets it then the correlation in historical interest rate data and inflation data ought to vanish as central banks should always set the interest rate consistent with the target level of inflation.
Estimating Taylor-Type Rules
269
2. Stability conditions are not necessarily as clear-cut as suggested above. See Woodford (2003, chapter 2). 3. They credit Mankiw and Shapiro (1985, 1986) for drawing attention to this problem. Also, see Enders (2004, pp. 102, 212), and Johnston and Dinardo (1997, p. 263). 4. For example, an augmented Dickey–Fuller test equation for the presence of a unit root is an unbalanced regression. Note, however, that standard inference on the I(0) series in the test equation is not appropriate. Hence, one must rely on a nonstandard distribution for hypothesis testing (e.g., see Hamilton, 1994, pp. 554–556). 5. Some might claim that non-stationarity in the interest rate makes no economic sense. Yet, empirical work relies on specifically chosen samples and, if there are sufficiently large shocks, the appearance of non-stationarity is difficult to reject. 6. Siklos (1999, 2002) also finds that inflation targeting central banks may indulge in relatively less interest rate smoothing behavior. Also, see Goodhart (2004). 7. English, Nelson, and Sack (2003) note that this observational equivalence problem may be overcome with a model estimated in first differences. Siklos (2004) also recognizes the near unit root nature of nominal interest rates and notes that a Taylor rule in first differences may also be more suitable for the analysis of countries that formally (or informally) target inflation. Using this type of formulation for the Taylor rule in a panel setting he finds that there are significant differences in the conduct of monetary policy between inflation and non-inflation targeting countries. 8. Indeed, since the real interest rate variable incorporates one or more possible cointegrating relationships, the absence of an error correction term in most Taylor rule equations is surprising. Further, given well-documented shifts in monetary policy, it is conceivable that a cointegrating relationship may be turned on, or off, in a regime-sensitive manner. It is precisely this type of consideration that led Siklos and Granger (1997) to propose regime-dependent cointegration. 9. Over-differencing can induce a non-invertible moving average component in the error term. The problem has been known for some time (e.g., Plosser & Schwert, 1977) and its emergence as an econometric issue stems from the spurious regression problem made famous by Granger and Newbold (1974). 10. For example, in the US, the first (or advance) release of real GDP data for each quarter is not available until about one month after the end of the quarter. The final release is not available until three months after the quarter ends. In addition, historical data are often revised. Hence, while Taylor rulesare usually estimated using final data, policy decisions often necessitate reliance on real-time data (see McCallum, 1998; Orphanides, 2001, 2003a, b; Orphanides & van Norden, 2002). 11. See also Smets (1998), Isard, Laxton, and Eliasson (1999), Orphanides et al. (1999), Muscattelli and Trecroci (2000) on Taylor rules and output gap and model uncertainty. 12. The basic idea is to generate a wide variety of estimates from different specifications and pool these to form combined estimates and confidence intervals. Indeed, Granger and Jeon (2004) apply their technique to the US data with the Taylor rule serving as the focal point of the technique’s illustration. 13. Williams (2004) also points out that estimating a first differenced version of (1) reduces the problems associated with the uncertainty surrounding the measurement of the equilibrium real interest rate (i.e., the constant term in the Taylor rule).
270
PIERRE L. SIKLOS AND MARK E. WOHAR
14. Additionally, the inflation and output gap variables in (1) or (2) capture the trade-off between inflation and output variability and there are sound arguments, since at least Friedman (1977), to expect that an underlying long-run relationship exists between these two variables 15. In the event an interest rate smoothing parameter is present in the Taylor, the DW test statistic is inappropriate. One might instead resort to the use of the Durbinh-test statistic. 16. Giannoni and Woodford (2003) consider the robustness of alternative policy rules and find in favor of a rule written in regression form as it ¼ m þ r1 it 1 þ r2 Dit 1 þ fp pt þ fy~ Dy~ t þ t : The rule is very similar to the one specified by Judd and Rudebusch (1998) except that the Fed here reacts to the change in the output gap instead of its level. It is easy to show that this type of reaction function can be turned into a first difference version that includes an error correction term that reflects cointegration between the nominal interest rate and the output gap. 17. Space constraints prevent a review of Taylor rule estimates outside the US experience. Readers are asked to consult, for example, Clarida et al. (1998), Gerlach and Schnabel (2000), Gerlach-Kristen (2003), Gerdesmeier and Roffia (2004), and Siklos and Bohl (2005a, b). 18. It is far from clear that GMM provides different estimates of the parameters in a Taylor rule. One reason is that the literature has, until recently, almost completely ignored the issue of the choice of instruments and their relevance. Siklos and Bohl (2005a, b) show that this aspect of the debate may be quite crucial in deriving policy implications from Taylor rules. 19. In addition, the weights in Taylor-type rules also depend on how the rule interacts with other equations and variables in an economic structural model. 20. Clarida et al. (1998) perform some unit root tests and report mixed results in terms of the order of integration. They argue for the stationarity of the included variables based on the argument that the unit root tests have low power. 21. We intend to make available at http://www.wlu.ca/wwwsbe/faculty/psiklos/ home.htm the raw data and any programs used to generate the results. To facilitate access to the results we have relied on Eviews 5.0 for estimation. 22. The relevant data were downloaded from http://research.stlouisfed.org/fred2/. 23. Using the standard smoothing parameter of 1,600. 24. A simple estimate is based on the regression Dpt ¼ m þ b1 ut 1 þ b2 ut 2 þ gX t þ vt where u is the unemployment rate, p is inflation, and X is a vector of other variables. We set X ¼ 0 and used the CPI all items to obtain separate estimates for the 1957–1990 and 1991–2003 samples, which yielded estimates of 6.05% and 4.29%, respectively, where NAIRU ¼ m=ðb1 þ b2 Þ: 25. Other than the NBER dates, the remaining data were kindly made available by Justin Wolfers and can also be downloaded from http://bpp.wharton.upenn.edu/ jwolfers/personal_page/data.html. 26. Not surprisingly, the various core inflation measures are smoother than the ones that include all items and while, in general, the CPI and PCE based measures are similar there are some differences in the volatility of the series. 27. This view was not universally shared within the FOMC. Meyer (2004, p. 213), for example, did not believe that ‘‘ythe Fed’s effort to reduce inflation was somehow
271
Estimating Taylor-Type Rules
responsible for the wave of innovation that propelled the acceleration of productivity in the second half of the 1990s.’’ 28. A highly readable account of the inspiration for the cointegration concept is contained in Hendry (2004). 29. The type of cointegration more or less implicit in Giannoni and Woodford (2003) is not generally found in the present data. Hence, we do not pursue this line of enquiry. 30. The test statistic for the null that the output gap does not belong in the cointegrating relationship is w22 ¼ 5:80 (p-value ¼ 0.05). The test statistic for the null that it is weakly exogenous is w22 ¼ 3:92 (0.14). Both results are based on full sample estimates. 31. Another potential cointegrating relationship is the one between short and long rates (i.e., the term spread) but this variable proved to be statistically insignificant in almost all the specifications and samples considered. 32. Kozicki (1999, Table 1) reports an estimate of 2.34% when CPI inflation is used for the 1987–1997 sample. Our estimates of the equilibrium real rate (see Table 3) are sharply higher. 33. Where possible, estimates for the full sample were also generated but the results were just as disappointing as the ones discussed below. Estimating forecastbased rules pose a variety of difficulties. For example, Consensus and Economist forecasts used here are produced monthly and refer to forecasts for the subsequent calendar year. As the horizon until the forecast period narrows the information set available to forecasters expands and it is conceivable, if not likely, that the interest rate assumption used in the forecast has changed. No special adjustments were made here for this problem and it is not obvious how the correction ought to be made. 34. Interestingly, Kuttner (2004) reaches the same conclusion using an international data set. However, he does not consider the possibility that the averaging of forecasts might produce better results. 35. This conclusion is robust to whether we use actual output gap estimates versus forecasts of real GDP growth or the output gap (OECD) or to whether add an error correction term to the Taylor rule. 36. The results also lend further support for the findings first reported in Enders and Siklos (2001) who introduced the momentum model in tests for threshold cointegration (also see Enders & Granger, 1998). 37. The RMSE for the case where the error correction included is 0.95 and 1.92 when the error correction is excluded. Clements and Hendry (1995) discuss conditions under which incorporating an error correction term can improve forecasts.
REFERENCES Ball, L. (1997). Efficient rules for monetary policy. NBER Working Paper no. 5952, March. Banerjee, A., Dolado, J., Galbraith, J. W., & Hendry, D. F. (1993). Co-Integration, errorcorrection, and the econometric analysis of non-stationary data. Oxford: Oxford University Press.
272
PIERRE L. SIKLOS AND MARK E. WOHAR
Bates, J. M., & Granger, C. W. J. (1969). The combination of forecasts. Operations Research Quarterly, 20, 451–468. Bernanke, B. S. (2004a). ‘‘Fedspeak’’. Speech delivered at the annual meeting of the American economic association, San Diego, CA, January, available from www.federalreserve.gov Bernanke, B. S. (2004b).‘‘Gradualism.’’ Remarks at an economics luncheon co-sponsored by the Federal Reserve Bank of San Francisco and the University of Washington, 20 May, available from www.federalreserve.gov Bernanke, B. S., & Gertler, M. (1999). Monetary policy and asset price volatility. Economic Review, Federal Reserve Bank of Kansas City (Fourth Quarter), 17–51. Bernanke, B. S., & Mihov, I. (1998). Measuring monetary policy. Quarterly Journal of Economics, 113(August), 869–902. Boschen, J. F., & Mills, L. O. (1995). The relation between narrative and money market indicators of monetary policy. Economic Inquiry, 33(January), 24–44. Burdekin, R. C. K., & Siklos, P. L. (1999). Exchange regimes and shifts in inflation persistence: Does nothing else matter? Journal of Money, Credit and Banking, 31(May), 235–247. Castelnuovo, E. (2003). Taylor rules, omitted variables, and interest rate smoothing in the US. Economics Letters, 81, 55–59. Castelnuovo, E., & Surico, P. (2004). Model uncertainty, optimal monetary policy and the preferences of the Fed. Scottish Journal of Political Economy, 51(February), 105–126. Cecchetti, S., Genberg, H., Lipsky, J., & Wadhwani, S. (2000). Asset prices and central bank policy. Geneva Reports on the World Economy (Vol. 2). Geneva: International Center for Monetary and Banking Studies; London: Centre for Economic Policy Research. Christensen, A. M., & Nielsen, H. B. (2003). Has US monetary policy followed the Taylor rule? A cointegration analysis 1988–2002. Working Paper, September 9. Clarida, R., Gali, J., & Gertler, M. (1998). Monetary policy rules in practice some international evidence. European Economic Review, 42(6), 1033–1067. Clarida, R., Gali, J., & Gertler, M. (2000). Monetary policy and macroeconomic stability: Evidence and some theory. The Quarterly Journal of Economics, 115(1), 147–180. Clements, M. P., & Hendry, D. F. (1993). Forecasting in cointegrated systems. Journal of Applied Econometrics, 10, 127–146. Collins, S., & Siklos, P. (2004). Optimal monetary policy rules and inflation targets: Are Australia, Canada, and New Zealand different from the U.S.? Open Economies Review, 15(October), 347–62. Crowder, W., & Hoffman, D. (1996). The long-run relationship between nominal interest rates and inflation: The fisher equation revisited. Journal of Money, Credit and Banking, 28, 102–118. Culver, S. E., & Papell, D. H. (1997). Is there a unit root in the inflation rate? Evidence from sequential break and panel data models. Journal of Applied Econometrics, 12, 435–444. Dennis, R. (2003). The policy preferences of the U.S. federal reserve. Working Paper, Federal Reserve Bank of San Francsico, July. Dolado, J., Pedrero, R., & Ruge-Murcia, F. J. (2004). Nonlinear policy rules: Some new evidence for the U.S. Studies in Nonlinear Dynamics and Econometrics, 8(3), 1–32. Domenech, R., Ledo, M., & Tanguas, D. (2002). Some new results on interest rate rules in EMU and in the U.S. Journal of Economics and Business, 54, 431–446. Dueker, M., & Rasche, R. H. (2004). Discrete policy changes and empirical models of the federal funds rate. Review of the Federal Reserve Bank of St. Louis (forthcoming).
Estimating Taylor-Type Rules
273
Elliott, G., & Stock, J. H. (1994). Inference in time series regression when the order of integration of a regressor is unknown. Econometric Theory, 10, 672–700. Enders, W. (2004). Applied econometric time series (2nd ed.). New York: Wiley. Enders, W., & Granger, C. W. J. (1998). Unit-root tests and asymmetric adjustment with an example using the term structure of interest rates. Journal of Business and Economic Statistics, 16, 304–311. Enders W., & Siklos, P. L. (2001). Cointegration and threshold adjustment. Journal of Business and Economic Statistics, 19(April), 166–177. English, W. B., Nelson, W. R., & Sack, B. (2003). Interpreting the significance of the lagged interest rate in estimated monetary policy rules. Contributions to Macroeconomics, 3(1), Article 5. Favero, C. A., & Rovelli, R. (2003). Macroeconomic stability and the preferences of the fed: A formal analysis, 1961–98. Journal of Money, Credit and Banking, 35(August), 545–556. Friedman, M. (1977). Nobel lecture: Inflation and unemployment. Journal of Political Economy, 85(June), 451–72. Fuhrer, J., & Tootell, G. (2004). Eyes on the prize: How did the fed respond to the stock market? Public Policy Discussion Papers 04-2, Federal Reserve Bank of Boston, June. Gerdesmeier, D., & Roffia, B. (2004). Empirical estimates of reaction functions for the Euro area. Swiss Journal of Economics and Statistics, 140(March), 37–66. Gerlach-Kristen, P. (2003). A Taylor rule for the Euro area. Working Paper, University of Basel. Gerlach, S., & Schnabel, G. (2000). The Taylor rule and interest rates in the EMU area. Economics Letters, 67, 165–171. Giannoni, M. P., & Woodford, M. (2003). How forward looking is monetary policy? Journal of Money, Credit and Banking, 35(December), 1425–1469. Goodfriend, M. (1991). Interest rates and the conduct of monetary policy. Carnegie-Rochester Conference Series on Public Policy, 34, 7–30. Goodhart, C. A. E. (1999). Central bankers and uncertainty. Bank of England Quarterly Bulletin, 39(February), 102–114. Goodhart, C. A. E. (2004). The monetary policy committee’s reaction function: An exercise in estimation. Working Paper, Financial Markets Group, London School of Economics. Granger, C. W. J. (2004). Time series analysis, cointegration, and applications. American Economic Review, 94(June), 421–425. Granger, C. W. J., & Jeon, Y. (2004). Thick modelling. Economic Modelling, 21, 323–343. Granger, C. W. J., & Newbold, P. (1974). Spurious regressions in econometrics. Journal of Econometrics, 2, 111–120. Hamalainen, N. (2004). A survey of Taylor-type monetary policy rules. Working Paper, Department of Finance, Ministry of Finance, Canada. Hamilton, J. D. (1994). Time series analysis. Princeton: Princeton University Press. Hendry, D. F. (1995). Dynamic econometrics. Oxford: Oxford University Press. Hendry, D. F. (2004). The nobel memorial price for Clive W. J. Granger. Scandinavian Journal of Economics, 106(2), 187–213. Hetzel, R. L. (2000). The Taylor rule: Is it a useful guide to understanding monetary policy? Federal Reserve Bank of Richmond Economic Quarterly, 86, 1–33. Isard, P., Laxton, D., & Eliasson, A.-C. (1999). Simple monetary policy rules under model uncertainty. International Tax and Public Finance, 6, 537–577. Johnston, J., & Dinardo, J. (1997). Econometric methods (4th ed.). New York: McGraw-Hill.
274
PIERRE L. SIKLOS AND MARK E. WOHAR
Judd, J. P., & Rudebusch, G. D. (1998). Taylor’s rule and the Fed: 1970–1997. Federal Reserve Bank of San Francisco Economic Review, 3, 3–16. Kozicki, S. (1999). How useful are Taylor rules for monetary policy? Federal Reserve Bank of Kansas City Economic Review, 84(2), 5–33. Kuttner, K. N. (2004). A snapshot of inflation targeting in its adolescence. In: K. Christopher & S. Guttman (Eds), The future of inflation targeting (pp. 6–42). Sydney, Australia: Reserve Bank of Australia. Lanne, M. (1999). Near unit roots and the predictive power of yield spreads for changes in longterm interest rates. Review of Economics and Statistics, 81, 393–398. Lanne, M. (2000). Near unit roots, cointegration, and the term structure of interests rates. Journal of Applied Econometrics, 15(September–October), 513–529. Laubach, T., & Williams, J. C. (2003). Measuring the natural rate of interest. Review of Economics and Statistics, 85(November), 1063–1070. Leitemo, K., & Røisland, Ø. (1999). Choosing a monetary policy regime: Effects on the traded and non-traded sectors. University of Oslo and Norges Bank. Levin, A., Wieland, V., & Williams, J. C. (1999). Robustness of simple policy rules under model uncertainty. In: J. B. Taylor (Ed.), Monetary policy rules. Chicago: University of Chicago Press. MacKinnon, J. G., Haug, A., & Michelis, L. (1999). Numerical distribution functions of likelihood ratio tests for cointegration. Journal of Applied Econometrics, 14, 563–577. Mankiw, N. G., & Shapiro, M. D. (1985). Trends, random walks, and tests of the permanent income hypothesis. Journal of Monetary Economics, 16(September), 165–174. Mankiw, N. G., & Shapiro, M. D. (1986). Do we reject too often? Small sample properties of tests of rational expectations models. Economics Letters, 20, 139–145. McCallum, B. T. (1998). Issues in the design of monetary policy rules. In: J. B. Taylor & M. Woodford (Eds), Handbook of macroeconomics. Elsevier. McCallum, B. T. (2001). Should monetary policy respond strongly to output gaps? American Economic Review, 91(2), 258–262. McCallum, B. T., & Nelson, E. (1999). Performance of operational policy rules in an estimated semiclassical structural model. In: J. B. Taylor (Ed.), Monetary policy rules. Chicago: University of Chicago Press. Medina, J. P., & Valde´s, R. (1999). Optimal monetary policy rules when the current account matters. Central Bank of Chile. Meyer, L. (2004). A Term at the Fed. New York: HarperCollins. Muscatelli, V., Tirelli, A. P., & Trecroci, C. (1999). Does institutional change really matter? Inflation targets, central bank reform and interest rate policy in the OECD countries. CESifo Working Paper No. 278, Munich. Muscattelli, A., & Trecroci, C. (2000). Monetary policy rules, policy preferences, and uncertainty: Recent empirical evidence. Journal of Economic Surveys, 14(5), 597–627. Ng, S., & Perron, P. (2001). Lag length selection and the construction of unit root tests with good size and power. Econometrica, 69(November), 1519–1554. Oliner, S. D., & Sichel, D. E. (2000). The resurgence of growth in the late 1990s: Is information technology the story? Journal of Economic Perspective, 14(Fall), 3–23. Orphanides, A. (2001). Monetary policy rules based on real-time data. American Economic Review, 91(4), 964–985. Orphanides, A. (2003a). Monetary policy evaluation with noisy information. Journal of Monetary Economics, 50(3), 605–631.
Estimating Taylor-Type Rules
275
Orphanides, A. (2003b). Historical monetary policy analysis and the Taylor rule. Journal of Monetary Economics, 50(5), 983–1022. Orphanides, A., & van Norden, S. (2002). The unreliability of output gap estimates in real time. Review of Economics and Statistics, 84(4), 569–583. Orphanides, A., Porter, R. D., Reifschneider, D., Tetlow, R., & Finan, F. (1999). Errors in the Measurement of the output gap and the design of monetary policy. Finance and Economics Discussion Series, 1999–45, Federal Reserve Board, August. Osterholm, P. (2003). The Taylor rule: A spurious regression. Working Paper, Uppsala University. Ozlale, U. (2003). Price stability vs. output stability: Tales from three federal reserve administrations. Journal of Economic Dynamics and Control, 9(July), 1595–1610. Phillips, P. C. B. (1986). Understanding spurious regression in econometrics. Journal of Econometrics, 33, 311–340. Phillips, P. C. B. (1989). Regression theory for nearly-integrated time series. Econometrica, 56, 1021–1043. Plosser, C. I., & Schwert, G. W. (1977). Estimation of a non-invertible moving average process: The case of over-differencing. Journal of Econometrics, 6(September), 199–224. Rabanal, P. (2004). Monetary policy rules and the U.S. business cycle: Evidence and implications. International Monetary Fund Working Paper WP/04/164, September. Rapach, D. E., & Wohar, M. E. (2005). Regime changes in international real interest rates: Are they a monetary phenomenon? Journal of Money, Credit and Banking, 37(October), 887–906. Romer, C. D., & Romer, D. H. (1989). Does monetary policy matter? A new test in the spirit of Friedman and Schwartz. NBER Macroeconomics Annual 1989 (pp. 121–170). Cambridge, MA: The MIT Press. Romer, C. D., & Romer, D. H. (2003). Choosing the federal reserve chair: Lessons from history. Working Paper, University of California, Berkeley. Rudebusch, G. D. (2001). Is the fed too timid? Monetary policy in an uncertain world. Review of Economics and Statistics, 83(2), 203–217. Rudebusch, G. D. (2002). Term structure evidence on interest rate smoothing and monetary policy inertia. Journal of Monetary Economics, 49, 1161–1187. Rudebusch, G. D., & Svensson, L. E. O. (1999). Policy rules for inflation targeting. In: J. B. Taylor (Ed.), Monetary policy rules. Chicago: University of Chicago Press. Ruth, K. (2004). Interest rate reaction functions for the Euro area: Evidence from panel data analysis. Deutsche Bundesbank Working Paper 33/2004. Sack, B. (1998). Does the fed act gradually? A VAR analysis. Journal of Monetary Economics, 46(August), 229–256. Sack, B., & Wieland, V. (2000). Interest-rate smoothing and optimal monetary policy: A review of recent empirical evidence. Journal of Economics and Business, 52, 205–228. Siklos, P. L. (1999). Inflation target design: Changing inflation performance and persistence in industrial countries. Review of the Federal Reserve Bank of St. Louis, 81(March/April), 47–58. Siklos, P. L. (2002). The changing face of central banking: Evolutionary trends since World War II. Cambridge: Cambridge University Press. Siklos, P. L. (2004). Central bank behavior, the institutional framework, and policy regimes: Inflation versus noninflation targeting countries. Contemporary Economic Policy, (July), 331–343.
276
PIERRE L. SIKLOS AND MARK E. WOHAR
Siklos, P. L., & Bohl, M. T. (2005a). The role of asset prices in Euro area monetary policy: Specification and estimation of policy rules and implications for the ECB. Working Paper, Wilfrid Laurier University. Siklos, P. L., & Bohl, M. T. (2005b). Are inflation targeting central banks relatively more forward looking? Working Paper, Wilfrid Laurier University. Siklos, P. L., & Granger, C. W. J. (1997). Regime-sensitive cointegration with an illustration from interest rate parity. Macroeconomic Dynamics, 1(3), 640–657. Sims, C., Stock, J. H., & Watson, M. W. (1990). Inference in linear time series models with some unit roots. Econometrica, 58(January), 113–144. Smets, F. (1998). Output gap uncertainty: Does it matter for the Taylor rule? BIS Working Papers, no. 60, November. So¨derlind, P., So¨derstro¨m, U., & Vredin, A. (2003). Taylor rules and the predictability of interest rates. Working Paper no. 147, Sveriges Riksbank. Staiger, D., Stock, J. H., & Watson, M. W. (1997). The NAIRU, unemployment and monetary policy. Journal of Economic Perspectives, 11(Summer), 33–50. Svensson, L.E.O. (2003). What is wrong with Taylor rules? Using judgment in monetary policy through targeting rules. Journal of Economic Literature, XLI(June), 426–477. Taylor, J. B. (1993). Discretion versus policy rules in practice. Carnegie–Rochester Conference Series on Public Policy, 39, 195–214. Taylor, J. B. (1999). A historical analysis of monetary policy rules. In: J. B. Taylor (Ed.), Monetary policy rules. Chicago: University of Chicago Press. Tchaidze, R. (2004). The greenbook and U.D. monetary policy. IMF Working Paper 04/213, November. Walsh, C. (2003a). Speed limit policies: The output gap and optimal monetary policy. American Economic Review, 93(March), 265–278. Walsh, C. (2003b). Implications for a changing economic structure for the strategy of monetary policy. In: Monetary Policy and Uncertainty: Adapting to a Changing Economy, A symposium sponsored by the Federal Reserve Bank of Kansas City Jackson Hole, Wyoming, August 28–30. Williams, J. C. (2004). Robust estimation and monetary policy with unobserved sructural change. Federal Reserve Bank of San Francisco Working Paper 04–11, July. Woodford, M. (2001). The Taylor rule and optimal monetary policy. American Economic Review, 91(2), 232–237. Woodford, M. (2003). Interest and Prices. Princeton: Princeton University Press.
BAYESIAN INFERENCE ON MIXTURE-OF-EXPERTS FOR ESTIMATION OF STOCHASTIC VOLATILITY Alejandro Villagran and Gabriel Huerta ABSTRACT The problem of model mixing in time series, for which the interest lies in the estimation of stochastic volatility, is addressed using the approach known as Mixture-of-Experts (ME). Specifically, this work proposes a ME model where the experts are defined through ARCH, GARCH and EGARCH structures. Estimates of the predictive distribution of volatilities are obtained using a full Bayesian approach. The methodology is illustrated with an analysis of a section of US dollar/German mark exchange rates and a study of the Mexican stock market index using the Dow Jones Industrial index as a covariate.
1. INTRODUCTION In options trading and in foreign exchange rate markets, the estimation of volatility plays an important role in monitoring radical changes over time of Econometric Analysis of Financial and Economic Time Series/Part B Advances in Econometrics, Volume 20, 277–296 Copyright r 2006 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(05)20030-0
277
278
ALEJANDRO VILLAGRAN AND GABRIEL HUERTA
key financial indexes. From a statistical standpoint, volatility refers to the variance of the underlying asset or return, conditional to all the previous information available until a specific time point. It is well known that the volatility of a financial series tends to change over time and there are different types of models to estimate it: continuoustime or discrete-time processes. This paper discusses a model that falls in the second category. Discrete-time models are divided into those for which the volatility is ruled by a deterministic equation and those where the volatility has a stochastic behavior. Among the former, we have the autoregressive conditional heteroskedastic (ARCH) model introduced by Engle (1982) which provided a breakthrough in the modeling and estimation of timevarying conditional variance. Extensions to this seminal model are the generalized ARCH (GARCH) of Bollerslev (1986), the exponential GARCH (EGARCH) of Nelson (1991), the Integrated GARCH (IGARCH) and the GARCH in mean (GARCH-M). Additionally, models like the stochastic volatility (SV) of Melino and Turnbull (1990) give a discrete-time approximation for a continuous diffusion processes used in option pricing. These variety of models gives a portfolio of options to represent volatility, but no agreement to decide which is the best approach. Given this difficulty, a new source of modeling arised: mixture-of-models. Since the only agreement seems to be that no real process can be completely explained by one model, the idea of model mixing, to combine different approaches into a unique representation, is very interesting. There are many types of methods of mixture models to estimate volatility. For instance, Wong and Li (2001) proposed a mixture of ARCH models with an autoregressive component to model the mean (MAR-ARCH) and for which they use the Expectation Maximization or EM algorithm to produce point estimation of the volatility. Another approach is given by Tsay (2002), who considered a mixture of ARCH and GARCH models by Markov Switching. Vrontos, Dellaportas, and Politis (2000) used reversible jump Markov chain Monte Carlo (MCMC) methods to predict a future volatility via model averaging of GARCH and EGARCH models. Both of these papers only considered a mixture of two models. Furthermore, Huerta, Jiang, and Tanner (2003) discussed the neural networks approach known as Hierarchical Mixtureof-Experts (HME), which is a very general and flexible approach of model mixing since it incorporates additional exogenous information, in the form of covariates or simply time, through the weights of the mixture. The example shown in that paper makes comparisons between a difference-stationary and
Bayesian Inference on Mixture-of-Experts
279
a trend-stationary model using time as the only covariate. Additionally, Huerta, Jiang, and Tanner (2001), considers an HME model including AR, ARCH and EGARCH models and obtained point estimates of volatility using the EM algorithm. However, that paper does not report interval estimates via a full Bayesian approach. In this paper, we use Mixture-of-Experts (ME), which is a particular case of HME, to build a mixture of ARCH, GARCH and EGARCH models. Through a full Bayesian approach based on MCMC methods, we show how to estimate the posterior distribution of the parameters and the posterior predictive distribution of the volatility, which is a very complicated function of the mixture representation. Additionally, we show how to obtain point estimates of the volatility, but also the usually unavailable forecast intervals. The paper proceeds in the following way. In Section 2, we offer an introduction to ME and HME in the context of time series modeling. In Section 3, we give the details of our volatility ME model along with the MCMC specifications to implement a full Bayesian approach to the problem of estimating volatility. Section 4 illustrates our methodology in the context of two financial applications and Section 5 provides some conclusions and extensions.
2. MIXTURE MODELING Hierarchical Mixture of Experts (HME) was first introduced in the seminal paper by Jordan and Jacobs (1994) and it is based on mixing models to construct a neural network the using logistic distribution. This approach allows for model comparisons and a representation of the mixture weights as a function of time or other covariates. Additionally, the elements of the mixture, also known as experts, are not restricted to a particular parametric family, which allows for very general model comparisons. The model considers a response time series {yt} and a time series of covariates or exogenous variables {xt}. Let f t ðyt jFt 1; w; yÞ be the probability density function (pdf) of yt conditional on y, a vector of parameters, w the s-algebra generated by the exogenous information fxt gn0 ; and for each t, Ft 1 is the s-algebra generated by fys g0t 1 ; the previous history about the response variable up to time t 1. Usually, it is assumed that this conditional pdf only depends on w through xt. In the methodology of HME the pdf ft is assumed to be a mixture of conditional pdfs of simpler models (Peng, Jacobs, & Tanner, 1996). In the
280
ALEJANDRO VILLAGRAN AND GABRIEL HUERTA
context of time series, the mixture could be represented by a finite sum X gt ðJjFt 1; w; gÞpt ðyt jFt 1; w; J; ZÞ f t ðyt jFt 1; w; yÞ ¼ J
where the functions gt( | , ;g) are the mixtures weights; pt( | , ,J;Z) are the pdfs of simpler models defined by the label J; g and Z are sub-vectors of the parameter vector y. The models that are being mixed in HME are commonly denoted as experts. For example, in time series, one expert could be an AR(1) model, another expert could be a GARCH(2,2) model. Also, the experts could be models that belong to the same class but with different orders or number of parameters. For example, all the experts are AR model but with different orders and different values of the lag coefficients. The extra hierarchy in HME partitions the space of covariates into O ‘‘overlays’’. In each overlay we have M competing models so that the most appropriate model will be assigned a higher weight. For this hierarchical mixture, the expert index J, could be expressed as J ¼ ðo; mÞ, where the overlay index o takes a value in the set {1, y ,O} and the model type index m takes a value in {1, y ,M}. The mixture model can be rewritten as f t ðyt jFt 1 ; w; yÞ
O X M X
o¼1 m¼1
gt ðo; mjFt 1 ; w; gÞpt ðyt jFt 1 ; w; o; m; ZÞ
Within the neural network terminology, the mixture weights are known as gating functions. In difference to other approaches these weights have a particular parametric form that may depend on the previous history, exogenous information or exclusively on time. This makes the weights evolve across time in a very flexible way. Specifically, it is proposed that the mixture weights have the form, ) ( )( T T evo þuo W t evmjo þumjo W t gt ðo; mjFt 1 ; w; gÞ ¼ PO PM vl joþuT W t vs þuTs W t ljo s¼1 e l¼1 e
where the v’s and u’s are parameters, which are components of g; Wt an input at time t, which is measurable with respect to the s-algebra induced by Ft 1 [ w: In this case, g includes the following components: v1,u1,y,vO 1,uO 1,v1|1,u1|1,y,vM 1|1,uM 1|1,y,vM 1|O,uM 1|O. For identifiability of the mixture weights, we set vO ¼ uO ¼ vMjo ¼ uMjo ¼ 0 for all o ¼ 1; . . . ; O: This restriction guarantees that the gating functions are uniquely identified by g as shown in Huerta et al. (2003). Both terms that
281
Bayesian Inference on Mixture-of-Experts
define the mixture weights follow a multinomial logistic pdf where the first term describes the probability of a given overlay and the second term, the probability of a model within overlay. Each of these probabilities being a function of the input Wt. Mixtures of time series models for estimating volatility has also been considered in Wong and Li (2001). However, these authors only look at the problem from a point estimation perspective. Inferences on the parameter vector y can be based in the log-likelihood function Ln ðÞ ¼
n 1X log f t ðyt jFt n t¼1
1; w; Þ
To obtain the maximum likelihood estimator of y; y^ ¼ arg max Ln ðÞ; it is possible to use the EM algorithm as described in Huerta et al. (2003). A general presentation of the EM algorithm appears in Tanner (1996). ^ is obtained, the interest focuses in the evaluation of the After the MLE, y; weights assigned to each of the M models as a function of time t. Primarily, there are two ways to achieve this, the first one is via the conditional probability of each model m defined by Pt ðmjyt ; Ft 1 ; w; yÞ hm ðtÞ
O X
hom ðt; yÞ
o¼1
where conditional refers to the actual observation at time t, yt. The second approach is to consider the unconditional probability at time t given by Pt ðmjFt 1 ; w; yÞ gm ðtÞ
O X
gom ðt; yÞ
o¼1
Point estimation of each probability can be obtained by evaluation at y^ or by computing the expected value with respect to the posterior distribution pðyjFn ; wÞ: The particular case of HME that we consider in this paper is O ¼ 1; which is known as Mixture of Experts (ME). In the ME modeling it is assumed that the process that generates the response variable can be decomposed into a set of subprocesses defined over specific regions of the space of covariates. For each value of the covariate xt, a ‘label’ r is chosen with probability gt ðrjw; Ft 1 ; gÞ: Given this value of r, the response yt is generated from the conditional pdf pt ðyt jr; w; Ft 1 ; ZÞ: The pdf of yt conditional on the
282
ALEJANDRO VILLAGRAN AND GABRIEL HUERTA
parameters, the covariate and the response history is given by f ðyt jw; Ft 1 ; yÞ ¼
M X r¼1
gt ðrjw; Ft 1 ; gÞpt ðyt jr; w; Ft 1 ; ZÞ
and the likelihood function is LðyjwÞ ¼
n X M Y t¼1 r¼1
gt ðrjw; Ft 1 ; gÞpt ðyt jr; w; Ft 1 ; ZÞ
As with HME, the mixture probabilities associated to each value of r are defined through a logistic function ex r gt ðrjw; Ft 1 ; ZÞ gðtÞ r ¼ PM
h¼1 e
xh
where xr ¼ vr þ uTr W t ; Wt is an input that could be a function of time, history and {xt}. For identifiability, xM is set equal to zero. Inference about the parameters in the ME is simplified by augmenting the data with nonobservable indicator variables, which determine the type of ðtÞ model expert. For each time t, zðtÞ r is a binary variable such that zr ¼ 1 with probability gðtÞ r pt ðyt jr; w; Ft 1 ; ZÞ hðtÞ r ¼ PM ðtÞ r¼1 gr pt ðyt jr; w; Ft 1 ; ZÞ
If w0 ¼ fðxt ; zðtÞ Þgnt¼1 ; where z(t) is the vector that includes all the indicator variables, the augmented likelihood for the ME model is Lðyjw0 Þ ¼
n Y M Y zðtÞ r fgðtÞ r pt ðyt jr; w; Ft 1 ; gÞg t¼1 r¼1
In the following section, we discuss how to estimate a ME model by a Bayesian approach and with the experts being ARCH, GARCH and EGARCH models. We picked these experts as our model building blocks since these are the main conditional heteroskedasticity models used in practice as pointed out by the seminal papers of Engle (1982, 1995), Bollerslev (1986) and Nelson (1991). Also, Tsay (2002) and Vrontos, et al. (2000) mention that these models are interesting in practice due to their parsimony.
283
Bayesian Inference on Mixture-of-Experts
3. BAYESIAN INFERENCE ON MIXTURES FOR VOLATILITY If the Bayesian paradigm is adopted, the inferences about y are based on the posterior distribution pðyj yÞ: Bayes Theorem establishes that f ðy jyÞpðyÞ p f Y ðy jyÞ dF ðyÞ
pðyj yÞ ¼ R
which defines the way to obtain the posterior distribution of y through the prior p(y) and the likelihood function fRðy jyÞ: However, for a ME or HME approach the marginal distribution of y; Y f ðy jyÞ dF p ðyÞ cannot be obtained analytically. We overcome this difficulty by using MCMC methods to simulate samples from pðyj yÞ: For more details about MCMC methods see Tanner (1996). First, we assume that the prior distribution for y ¼ ðZ; gÞ has the form pðyÞ ¼ pðZÞpðgÞ so the expert parameter Z and the weights or gating parameters g are a priori independent. We define Z ¼ fzðtÞ ; t ¼ 1; . . . ; ng and for each t, zðtÞ ¼ fzðtÞ r ;r ¼ 1; . . . ; Mg is the set of indicator variables at t. Conditional on y, PðzðtÞ jy; wÞ is a Multinomial distribution with total count equal to 1 and cell probabilities gr(t,g). Our MCMC scheme is based on the fact that it is easier to obtain samples from the augmented posterior distribution pðy; ZjFn ; wÞ; instead of directly simulating values from pðyjFn ; wÞ: This data augmentation principle was introduced by Tanner and Wong (1987). The MCMC scheme follows a Gibbs sampling format for which we iteratively simulate from the conditional distributions pðyjZ; Fn ; wÞ and pðZjy; Fn ; wÞ: The conditional posterior pðZjy; Fn ; wÞ is sampled through the marginal conditional posteriors pðzðtÞ jy; Fn ; wÞ defined for each value of t. Given y, Fn and w, it can be shown that the vector z(t) has a Multinomial distribution with total count equal to 1 and for which gðtÞ r pt ðyt jr; Ft 1 ; w; ZÞ PðzðtÞ r ¼ 1jy; Fn ; wÞ ¼ hr ðt; yÞ ¼ PM ðtÞ r¼1 gr pt ðyt jr; Ft 1 ; w; ZÞ
The vector y ¼ ðZ; gÞ is sampled in two stages. First, Z is simulated from the conditional posterior distribution pðZjg; Z; Fn ; wÞ and then g is sampled from the conditional posterior pðgjZ; Z; Fn ; wÞ: By Bayes Theorem, pðZjg; Z; Fn ; wÞ /
n Y M Y t¼1 r¼1
ðtÞ
f t ðyt jFt 1 ; w; r; ZÞzr pðZÞ
284
ALEJANDRO VILLAGRAN AND GABRIEL HUERTA
Analogously, pðgjZ; Z; Fn ; wÞ /
n Y M Y t¼1 r¼1
ðtÞ
gt ðrjFt 1 ; w; gÞzr pðgÞ
If Z can be decomposed into a sub-collection of parameters Zr that are assumed a priori independent, the simulation for the full conditional for Z is reduced to individual simulation of each Zr. If Zr is assigned a conjugate prior with respect to the pdf of the ‘‘r’’ expert, the simulation of Z is straightforward. For g, it is necessary to implement Metropolis–Hastings steps to obtain a sample from its full conditional distribution. The specific details for the MCMC implementation depends on the type of expert models and prior distributions on model parameters. For example, Huerta et al. (2003) discussed a HME model with a full Bayesian approach where the experts are a ‘difference-stationary’ and a ‘trend-stationary’ model. The priors used on the parameters of their HME model were noninformative. Here, we consider the case of experts that allow volatility modeling. It is well known that the volatility of a financial time series can be represented by ARCH, GARCH and EGARCH models. The properties of these models make them attractive to obtain forecasts in financial applications. We propose a ME that combines the models AR(1)-ARCH(2), AR(1)-GARCH(1,1) and AR(1)-EGARCH(1,1). Although the order of the autoregressions for the observations and volatility is low for these models, in practice it is usually not necessary to consider higher order models. The elements of our ME model are, the time series of returns fyt gn1 ; the series of covariates fxt gn1 ; which in one of our applications presented in the next section it is simply time and in the other, it is the Dow Jones index (DJI). In any case, xr ¼ ur þ uTr W t ; where Wt is an input that depends on the covariates. Our expert models will be parameterized in the following way: AR(1)-ARCH(2) yt ¼ f 1 yt
1
þ 1;t
s21;t ¼ o1 þ a11 21;t
1;t Nð0; s21;t Þ 1
þ a12 21;t
2
AR(1)-GARCH(1,1) yt ¼ f 2 yt
1
þ 2;t
s22;t ¼ o2 þ a21 22;t
2;t Nð0; s22;t Þ 1
þ a22 s22;t
2
285
Bayesian Inference on Mixture-of-Experts
AR(1)-EGARCH(1,1) yt ¼ f 3 yt
1
þ 3;t
3;t Nð0; s23;t Þ
ln ðs23;t Þ ¼ o3 þ a31 lnðs23;t 1 Þ þ a32 3;t
1
þ a33 ðj3;t 1 j
Eðj3;t 1 jÞÞ
For each expert m ¼ 1; 2; 3; the ME will be represented by the following pdfs and gating functions: Expert 1 1g gt ð1jFt 1 ; w; gÞ ¼ Pexpfx 3
1 þu1 W t g ¼ Pexpfv 3 expfvr þur W t g r¼1 2 1 1 ffi exp pt ðyt j1; Ft 1 ; w; Z1 Þ ¼ pffiffiffiffiffiffiffiffi ðyt f1 yt 1 Þ 2 2s2 r¼1
expfxr g
2ps1;t
1;t
Expert 2
2g gt ð2jFt 1 ; w; gÞ ¼ Pexpfx 3
2 þu2 W t g ¼ Pexpfv 3 expfvr þur W t g r¼1 r¼1 2 1 1 ffi exp ðy f y Þ pt ðyt j2; Ft 1 ; w; Z2 Þ ¼ pffiffiffiffiffiffiffiffi t 2 t 1 2 2s2
expfxr g
2ps2;t
2;t
Expert 3
3g gt ð3jFt 1 ; w; gÞ ¼ Pexpfx 3
1 ¼ P3 expfv r þur W t g r¼1 2 1 ffi 1 pt ðyt j3; Ft 1 ; w; Z3 Þ ¼ pffiffiffiffiffiffiffiffi exp ðy f y Þ t 3 t 1 2 2s2 expfxr g r¼1
2ps3;t
3;t
As mentioned before, the vector y ¼ ðZ; gÞ can be decomposed into two subsets, one that includes the expert parameters, Z ¼ ða11 ; a12 ; a21 ; a22 ; a31 ; a32 ; a33 ; o1 ; o2 ; o3 ; j1 ; j2 ; j3 Þ and another subvector that includes the gating function parameters, g ¼ ðu1 ; u2 ; v1 ; v2 Þ: The likelihood function of the model is expressed as, n X 3 Y 1 LðyjwÞ ¼ ðyt fr yt 1 Þ2 g grðtÞ expf 2 2s r;t t¼1 r¼1 and the augmented likelihood function is, ( n Y 3 Y 1 0 gðtÞ ðy Lðyjw Þ ¼ r expf 2s2r;t t t¼1 r¼1
fr yt 1 Þ2 g
)zðtÞr
The MCMC scheme to obtain posterior samples for this ME is based on the principle of ‘‘divide and conquer’’. For Expert 1, we assume that Z1 ¼ ðf1 ; o1 ; a11 ; a12 Þ has prior distribution with components that are a priori
286
ALEJANDRO VILLAGRAN AND GABRIEL HUERTA
independent, pðZ1 Þ ¼ pðf1 Þpðo1 Þpða11 Þpða12 Þ where f1 Nð0; 0:1Þ; o1 Uð0; 1Þ; a11 Uð0; 1Þ and a12 Uð0; 1Þ: For the Expert 2, Z2 ¼ ðf2 ; o2 ; a21 ; a22 Þ also has components that are a priori independent where, f2 Nð0; 0:1Þ; o2 Uð0; 1Þ; a21 Uð0; 1Þ and a22 Uð0; 1Þ: In an analogous way, for Expert 3 the vector Z3 ¼ ðf3 ; o3 ; a31 ; a32 ; a33 Þ has independent components with marginal prior distributions given by f3 Nð0; 0:1Þ; o3 Nð0; 10Þ; a31 Uð 1; 1Þ; a32 Nð0; 10Þ and a33 Nð0; 10Þ: Finally, each of the entries of the vector of parameters appearing in the mixture weights, g ¼ ðu1 ; u2 ; v1 ; v2 Þ; is assumed to have a U( l, l) prior distribution with a large value for l. This prior specification was chosen to reflect vague (flat) prior information and to facilitate the calculations of the different steps inside our MCMC scheme. The N(0, 0.1) prior on the AR(1) coefficients was proposed with the idea of containing most of its mass in the region defined by the stationarity condition. Instead of a N(0,0.1) prior, we also used a U(—1, 1) prior on the coefficients and the results obtained were essentially the same as with the Normal prior. In fact, the noninformative priors were suggested by Vrontos et al. (2000) in the context of parameter estimation, model selection and volatility prediction. Also, these authors show that under these priors and for pure GARCH/EGARCH models, the difference between classical and Bayesian point estimation is minimal. The restrictions on these priors in the different parameter spaces is to satisfy the stationarity conditions of the expert models. For our ME model, these type of noninformative priors were key to produce good MCMC convergence results which could not be obtain with other classes of (informative) priors. Since the parameters of ARCH/GARCH/EGARCH models are not very meaningful in practice, the most typical prior specification for these parameters is to adopt noninformative prior distributions. To our knowledge, there is no study on the effects of using informative priors in this context. Furthermore, Villagran (2003) discusses another aspect of our prior specification in terms of simulated data. If the true data follow an AR(1)GARCH(1,1) structure, the noninformative prior allows to estimate the parameters of the true model with a maximum absolute error of 0.003. The maximum posterior standard deviation for all the model parameters is 0.0202 and the posterior mean for gðtÞ t is practically equal to one for the true model (in this case GARCH) and for all time t. Our MCMC algorithm can be summarized as follows: Assign initial values for y and with these values calculate the volatilities 2ð0Þ 2ð0Þ 2ð0Þ for each expert, s1;t ; s2;t and s3;t for all t.
287
Bayesian Inference on Mixture-of-Experts
ð0Þ Evaluate the probabilities gðtÞ and compute the conditional probr at g ðtÞ abilities hr for all t. Generate zrðtÞ conditional on y, Fn ; w from a Multinomial distribution with total count equal to 1 and cell probabilities hðtÞ r : Across this step, we are ðtÞ ðtÞ generating vectors ðzðtÞ ; z ; z Þ for all values of t. 1 2 3 For the augmented posterior distribution, we generate each of the expert parameters and each of the mixture weights or gating function parameters via Metropolis–Hastings (M–H) steps. A general description of the M–H algorithm with several illustrative examples appears in Tanner (1996). At each M–H step, we propose a new value y(j) for the parameters from a candidate distribution and than accept or reject this new value with probability h ðjÞ i jÞqðyðj 1Þ Þ aðyðj 1Þ ; yðjÞ Þ ¼ min pðy ðj 1Þ ðjÞ ; 1 : pðy jÞqðy Þ
After generating all the model parameters at iteration j, we update the 2ðjÞ 2ðjÞ ðtÞ ðtÞ volatilities s2ðjÞ 1;t ; s2;t and s3;t ; the probabilities gr ; hr and the indicator (j) variables Z . The algorithm is iterated until Markov Chain convergence is reached. An initial section of the iterations is considered a burn-in period and the remaining iterations are kept as posterior samples of the parameters.
Given that mixture models are highly multimodal, to improve on the convergence of our MCMC method, it is convenient to propose several starting points for y and run the algorithm for a few iterations. The value that produces the maximum posterior density is used as the initial point y(0) to produce longer runs of the Markov chain. In the applications that are presented in the next section, we used 20 overdispersed starting values for y and calibrated our proposal distribution so that the acceptance rates of all the M-H steps were around 45%. We maintained this acceptance rates relatively low to allow full exploration of the parameter space and to avoid getting stuck around a local mode. For a specific MCMC iteration, the volatility or conditional variance of our ME model is computed as V ðjÞ t ¼
3 P
ðjÞ 2ðjÞ gr;t sr;t þ
r¼1
m¯ tðjÞ
¼
3 P
r¼1
3 P
ðjÞ ðjÞ gr;t ðmr;t
2 m¯ ðjÞ t Þ
r¼1 ðjÞ gðjÞ r;t mr;t
¼
3 P
r¼1
ðjÞ ðjÞ gr;t fr yt
1
where the index t represents time, the index j the iteration and mr,t the mean of expert r at time t. These expressions follow from well-known results to compute
288
ALEJANDRO VILLAGRAN AND GABRIEL HUERTA
the variance of a mixture distribution with three components. Given a value of model parameters at iteration j, V ðjÞ t can be directly evaluated from these equations. The expression for the volatility of our ME model is formed by two terms. The first term represents the dependency of the conditional variance with respect to past volatilities, and the second term represents changes in volatility of the mixture model due to the differences in conditional mean between experts. In the next section, we show that using time as a covariate allows one to detect structural changes in volatility so ME is able to determine if the process generating the data corresponds to a unique expert.
4. APPLICATIONS 4.1. Exchange Rates US Dollar/German Mark Figure 1(a) shows 500 daily observations between the American dollar and the German mark starting on October of 1986 and Fig. 1(b) shows the corresponding returns of these exchange rates.1 2.1
exchange rates
2 1.9 1.8 1.7 1.6
(a) 0.03 0.02
returns
0.01 0 −0.01 −0.02 −0.03 Oct 86
Dec 86
Jan 87
Mar 87
May 87
Jun 87
Aug 87
Oct 87
Nov 87
Jan 88
Mar 88
(b)
Fig. 1.
(a) Exchange Rates between U.S. Dollar and German Mark Starting from October 1986. (b) Returns of the Exchange Rates.
289
Bayesian Inference on Mixture-of-Experts AR(1)−ARCH(2) 1
Posterior mean Credible Intervals 95%
probability
0.8 0.6 0.4 0.2 0
AR(1)−GARCH(1,1) 1
probability
0.8 0.6 0.4 0.2 0
AR(1)−EGARCH(1,1) 1
probability
0.8 0.6 0.4 0.2 0 Oct 86
Fig. 2.
Dec 86
Jan 87
Mar 87
May 87
Jun 87 time
Aug 87
Oct 87
Nov 87
Jan 88
Mar 88
Exchange Rates Example. Posterior Means of gðtÞ r ; r ¼ 1; 2; 3 (Solid Lines) and 95% Credible Intervals (Dashed Lines).
Using the returns as our response time series, we implemented the ME model with time being our covariate. Figure 2 shows posterior means and 95% credible intervals for the unconditional probabilities of each expert, ðtÞ ðtÞ i.e., gðtÞ r ; r ¼ 1; 2; 3: Both hr and gr are functions of the unknown parameter vector y, so it makes absolute sense to assess measures of uncertainty to these probabilities. We can appreciate that the expert that dominates in terms of probability is the AR(1)-GARCH(1,1) model. Furthermore, Fig. 2 also shows the relative uncertainty of the different experts across time. For the period covering October 1986 to January 1987, there is a significant weight associated to the AR(1)-EGARCH(1,1) model and the credible band for the weight of this model can go as high as 0.8 and as low as 0.2. In Fig. 3, we report posterior means of the conditional probabilities of each model, hðtÞ r ; r ¼ 1; 2; 3: This figure shows a similar pattern compared to the description of probabilities given by Fig. 2. Since these are conditional probabilities, individual
290
ALEJANDRO VILLAGRAN AND GABRIEL HUERTA AR(1)−ARCH(2)
1
probability
0.8 0.6 0.4 0.2 0
AR(1)−GARCH(1,1)
1
probability
0.8 0.6 0.4 0.2 0
AR(1)−EGARCH(1,1) 1 probability
0.8 0.6 0.4 0.2 0 Oct 86
Dec 86
Fig. 3.
Jan 87
Mar 87
May 87
Jun 87 time
Aug 87
Oct 87
Nov 87
Jan 88
Mar 88
Exchange Rates Example. Posterior Means of hðtÞ r ; r ¼ 1; 2; 3:
observations may produce high fluctuations in probability. However, this figure confirms that the models that dominate in the ME, at least in the initial time periods, are the AR(1)-GARCH(1,1) and the AR(1)-EGARCH(1,1). Toward the end of the considered time periods, the dominant model is the AR(1)GARCH(1,1) but the AR(1)-ARCH(2) model has a significant weight of 0.4. In Fig. 4, we present posterior mean estimators of the volatility for the ME model and for the individual expert models. The ME model has a higher volatility estimate at the beginning of the time series in contrast to the individual models. This is due to the influence of the EGARCH models in the first part of the series as shown by Figs. 2 and 3. A referee and the editors suggested that we compared the square of the residuals of a pure AR(1) model fitted to the return series, with the volatilities presented in Fig. 4. These residuals seem to be better charaterized by the AR(1)-ARCH(2) volatilities towards the end of the time period covered by the data. However at the beginning of the period, the residuals are
291
Bayesian Inference on Mixture-of-Experts
AR(1) ARCH(2) 4
3
3
volatility
volatility
Mixture of Experts 4
2
1
2
1
0
0
AR(1) GARCH(1,1)
AR(1) EGARCH(1,1) 15
1
volatility
volatility
0.8 0.6 0.4
10
5 0.2 0 Dec 86
Jan 87 May 87 Aug 87 Nov 87 Mar 88
0 Dec 86
Jan 87
May 87 Aug 87 Nov 87
Mar 88
Fig. 4. Exchange Rates Example. Posterior Mean Estimate of Volatility for Mixture-of-Experts Model and Posterior Estimates of Volatility for Individual Models.
more closely followed by the AR(1)-EGARCH(1,1) volatilities. Additionally, we think that the ME is at least as good as any individual expert model since it is pooling information from different models. A great advantage of the ME model is that it shows, as a function of time t, how different expert models are competing with each other conditional on the information at t – 1 and how the volatilities change according to time. Notice that from January 1987, the volatilities of the AR(1)-ARCH(2) are consistent with the volatilities of the ME model. Figure 5 considers a ‘‘future’’ period starting from March 1988 and that covers 100 daily exchange-rate values. Figure 5(a) shows the time series of returns for this future period and Fig. 5(b) shows the one-step-ahead predictive posterior means and the 95% one-step-ahead forecast intervals for volatility based on the ME model that only uses previous data from October 1986 to March 1988 and with a forecasting horizon of 100 time steps. This figure illustrates one of the main
292
ALEJANDRO VILLAGRAN AND GABRIEL HUERTA 0.02 0.015
returns
0.01 0.005 0 −0.005 −0.01
(a) −0.015 0 0.55 Posterior mean 0.5
Forecast Intervals 95%
volatility
0.45 0.4 0.35 0.3 0.25 0.2
(b)
Mar 88
Apr 88
May 88
Jun 88
Fig. 5. Exchange Rates Example. (a) Time Series of Returns Starting from March 1988. (b) Predictive Posterior Means and 95% Forecast Intervals for Volatility.
features of our model. The MCMC approach allows us to compute samples of future or past volatilities values that can be summarized in terms of predictive means and credible intervals. In our ME model, the volatility is a very complicated function of the parameters and producing non-Monte Carlo estimates, especially predictive intervals, is practically impossible. 4.2. Analysis of the Mexican Stock Market In this application, we studied the behavior of the Mexican stock market (IPC) index using as covariates the Dow Jones Industrial (DJI) index from January 2000 to September 2004 and also using time. Figure 6 shows both the IPC index and the DJI index time series with their corresponding returns. It is obvious that the IPC index and the DJI index have the same overall pattern over time. In fact, some Mexican financial analysts accept that the IPC index responds to every ‘strong’ movement of the DJI index. Our Bayesian ME approach adds some support to this theory.
293
Bayesian Inference on Mixture-of-Experts 11000
0.1 0.08 0.06 0.04
9000
returns
IPC Index
10000
8000
0.02 0 −0.02
7000
−0.04 −0.06
6000
(b)
0.1
11500
0.08
11000
0.06
10500
0.04
10000 9500 9000
0.02 0 −0.02
8500
−0.04
8000
−0.06
7500
(c)
−0.08
12000
returns
DJI Index
(a)
−0.08 Oct 00
Aug 01 May 02 Mar 03
Jan 04
(d)
Oct 00
Aug 01 May 02 Mar 03
Jan 04
Fig. 6. (a) The Mexican Stock Market (IPC) Index from January 2000 to September 2004. (b) Time Series of Returns for the IPC Index. (c) The Dow Jones Index (DJI) from January 2000 to September 2004. (d) Time Series of Returns for the DJI.
In Fig. 7, we show the posterior distributions of the parameters for the mixture weights or gating functions when Wt was set equal to the DJI index. The posterior distribution for the ‘slope’ parameters u1 and u2 have most of their posterior mass away from 0, which means that the effect of the covariate in our ME analysis is ‘highly significant’. In Fig. 8 we show the mixture weights of the ME using different covariates. The left column presents the posterior mean estimates of gr ðtÞ; r ¼ 1; 2; 3 using the DJI index as covariate and the right column shows the estimates as a function of time. The right column shows a shift on the regime since the AR(1)EGARCH(1,1) expert rules the evolution of the volatility of the IPC index from January 2000 to March 2002. After this date, the AR(1)-ARCH(2) model is the one with higher probability. As a function of the DJI index, the mixture weights behavior is quite different. Now the AR(1)-EGARCH(1,1) expert has higher probability in comparison to the other experts. This is a result due to the common high volatility shared by the IPC index and the DJI index.
294
ALEJANDRO VILLAGRAN AND GABRIEL HUERTA
5000
5000
4000
4000
3000
3000
2000
2000
1000
1000
(a)
0 −4
−3
−2
−1
0
1
2
3
0
5
6
7
8
9
0 −8
−7
−6
−5
−4
(b)
5000
5000
4000
4000
3000
3000
2000
2000
1000
1000
(c)
0 −5
−4
−3
−2
−1
0
1
(d)
10
−3
11
−2
12
−1
Fig. 7. (a) Posterior Distribution for Gating Parameter v1. (b) Posterior Distribution for Gating Parameter u1. (c) Posterior Distribution for Gating Parameter v2. (d) Posterior Distribution for Gating Parameter u2.
5. CONCLUSIONS AND EXTENSIONS In this paper, we present a mixture modeling approach based on the Bayesian paradigm and with the goal of estimating SV. We illustrate the differences of our mixture methodology versus a sole model approach in the context of ARCH/GARCH/EGARCH models for two different financial series. The two main aspects of our ME model are: (1) the comparison of different volatility models as a function of covariates and (2) the estimation of predictive volatilities with their corresponding measure of uncertainty given by a credible interval. On the other hand, we had only considered ME and not HME. The difficulty with HME is that it requires the estimation of the number of overlays O, which poses challenging computational problems in the form of reversible jump MCMC methods. Additionally, we had not considered any other experts beyond ARCH, GARCH and EGARCH
295
Bayesian Inference on Mixture-of-Experts AR(1) −ARCH(2) 1
0.8
0.8 probability
probability
AR(1)−ARCH(2) 1
0.6 0.4 0.2
0.6 0.4 0.2
0
0 AR(1) −GARCH(1,1)
1
1
0.8
0.8 probability
probability
AR(1) −GARCH(1,1)
0.6 0.4
0.6 0.4
0.2
0.2
0
0 AR(1) −EGARCH(1,1) 1
0.8
0.8 probability
probability
AR(1)−EGARCH(1,1) 1
0.6 0.4 0.2 0 −0.1
0.6 0.4 0.2
−0.05
0
0.05
0.1
returns
0
Oct 00 Aug 01 May 02 Mar 03
Jan 04
time
Fig. 8. Left Column. Probabilities of Each Expert as a Function of the Returns of the DJI. Right Column. Probabilities of Each Expert as a Function of Time.
models. An extension to our approach considers other competitors or experts like the SV models of Jacquier, Polson, and Rossi (1994). This leads into MCMC algorithms combining mixture modeling approaches with Forward filtering backward simulation. These extensions are part of future research.
NOTE 1. The code to fit the models used for this section is available under request from
[email protected]. Also, this code can be downloaded from http://www.stat.unm. edu/avhstat.
296
ALEJANDRO VILLAGRAN AND GABRIEL HUERTA
ACKNOWLEDGMENTS We wish to express our thanks to Professors Thomas Fomby and Carter Hill, editors of this volume, for all their considerations about this paper. During the preparation of this paper, A. Villagran was partially supported by CONACyT-Mexico grant 159764 and by The University of New Mexico.
REFERENCES Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 31, 307–327. Engle, R. F. (1982). Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation. Econometrica, 50, 987–1007. Engle, R. F. (1995). ARCH selected readings. New York: Oxford University Press. Huerta, G., Jiang, W., & Tanner, M. A. (2001). Mixtures of time series models. Journal of Computational and Graphical Statistics, 10, 82–89. Huerta, G., Jiang, W., & Tanner, M. A. (2003). Time series modeling via hierarchical mixtures. Statistica Sinica, 13, 1097–1118. Jacquier, E., Polson, N., & Rossi, P. (1994). Bayesian analysis of stochastic volatility models. Journal of Business & Economic Statistics, 12, 371–389. Jordan, M., & Jacobs, R. (1994). Hierarchical mixture of experts and the EM algorithm. Neural Computation, 6, 181–214. Melino, A., & Turnbull, S. M. (1990). Pricing foreign currency options with stochastic volatility. Journal of Econometric, 45, 239–265. Nelson, D. B. (1991). Conditional heteroskedasticity in asset returns. Econometrica, 59, 347–370. Peng, F., Jacobs, R. A., & Tanner, M. A. (1996). Bayesian inference in mixtures-of-experts and hierarchical mixtures-of-experts models with an application to speech recognition. Journal of the American Statistical Association, 91, 953–960. Tanner, M. A. (1996). Tools for statistical inference (3rd ed.). New York: Springer-Verlag. Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association, 82, 528–550. Tsay, R. S. (2002). Analysis of financial time series. New York: Wiley. Villagran, A. (2003). Modelos Mezcla para Volatilidad. Unpublished MSc. thesis, Universidad de Guanajuato, Mexico. Vrontos, D., Dellaportas, P., & Politis, D. N. (2000). Full Bayesian inference for GARCH and EGARCH models. Journal of Business & Economic Statistics, 18, 187–197. Wong, Ch. S., & Li, W. K. (2001). On a mixture autoregressive conditional heteroscedastic model. Journal of the American Statistical Association, 96, 982–995.
A MODERN TIME SERIES ASSESSMENT OF ‘‘A STATISTICAL MODEL FOR SUNSPOT ACTIVITY’’ BY C. W. J. GRANGER (1957) Gawon Yoon ABSTRACT In a brilliant career spanning almost five decades, Sir Clive Granger has made numerous contributions to time series econometrics. This paper reappraises his very first paper, published in 1957 on sunspot numbers.
1. INTRODUCTION In 1957, a young lad from the Mathematics Department of Nottingham University published a paper entitled ‘‘A Statistical Model for Sunspot Activity’’ in the prestigious Astrophysical Journal, published nowadays by the University of Chicago Press for the American Astronomical Society. The paper showed that sunspot numbers could be well represented by a simple statistical model ‘‘formed by the repetition of a fixed basic cycle varying only in magnitude and phase.’’ The paper was received by the editorial office of the Journal on the author’s 22nd birthday and is still cited, see for instance Econometric Analysis of Financial and Economic Time Series/Part B Advances in Econometrics, Volume 20, 297–314 Copyright r 2006 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(05)20031-2
297
298
GAWON YOON
Reichel, Thejll, and Lassen (2001) and Carozza and Rampone (2001), even though the author himself stopped working on the sunspot numbers quite sometime ago. The author had just received a bachelor’s degree in mathematics and was an assistant lecturer at the University while he was still a Ph.D. student there. He did not know that there was a submission fee to the Journal, which was kindly waived by its editorial office.1 The author of the paper is Clive W. J. Granger, a Nobel Memorial Prize winner in Economic Science in 2003. His contributions to time series econometrics are numerous and far-reaching, and they are not easy to summarize in brief; see some of his work collected in Granger (2001) as well as about a dozen of his books. Hendry (2004) provides a useful summary of Granger’s major contributions to time series analysis and econometrics.2 From an interview conducted by Peter Phillips for Econometric Theory in 1997, one may gather that the young Granger was initially interested in meteorology; see Granger (1997). Interestingly, his second published paper in 1959 was concerned with predicting the number of floods of a tidal stretch; see Granger (1959). However, most of his later work is related to economic data, especially time series data. It is not yet widely acknowledged that Granger’s early work on sunspots had also led to an important theoretical development in economics. Karl Shell, who formulated the concept of sunspot equilibrium with David Cass, notes during his visit to Princeton that ‘‘Clive gave me some of his technical reports, which contained reviews of time-series literature on the possible economic effects of real-world sunspots a` la Jevons. Perhaps this was the seed that germinated into the limiting case of purely extrinsic uncertainty, i.e., stylized sunspots a` la Cass-Shell.’’ (See Shell, 2001).3 Granger (1957) proposed a simple two-parameter mathematical model of sunspot numbers, with an amplitude factor and an occurrence of minima, respectively. He reported that his model accounted for about 85% of the total variation in the sunspot data. He also made several interesting observations on sunspot numbers. It is the aim of this paper to determine if the statistical model he proposed stands the test of the passage of time with the extended sunspot numbers that span beyond the sample period he considered.4 In the next Section, a brief discussion on sunspot data is provided.
2. DATA ON SUNSPOT NUMBERS Sunspots affect the weather as well as satellite-based communications. They are also suspected of being responsible for global warming; see Reichel et al.
A Modern Time Series Assessment
299
(2001). Sunspot numbers have been recorded for more than 300 years and are known to be very difficult to analyze and forecast. For instance, Tong (1990) notes that they ‘‘reveal an intriguing cyclical phenomenon of an approximate 11 year period which has been challenging our intellecty.’’ Efforts in solar physics have been made over the years to understand the causes of sunspot cycles. For instance, Hathaway, Nandy, Wilson, and Reichman (2003, 2004) recently show that a deep meridional flow, a current of gas supposed to travel from the poles to the equator of the Sun at a depth of about 100,000 km,5 sets the sunspot cycle. Given that various theoretical models for sunspot cycles are still in their early stages of development, time series models appear to be a natural alternative in modeling sunspot numbers. Indeed, various time series models are already applied to them, such as linear autoregressive, threshold, and bilinear models, among many others. In fact, sunspot numbers were the first data to be analyzed by autoregressive models: in 1927, Yule fitted an AR(2) model to sunspot data. See also Section 7.3 of Tong (1990) for some estimation results from various time series models. Granger also came back to the sunspot numbers later in Granger and Andersen (1978), using his bilinear model. Forecasting sunspot cycles is not easy; for instance, see Morris (1977) and Hathaway, Wilson, and Reichman (1999). In the remainder of this Section, the sunspot data used in this paper will be presented and compared with those employed in Granger (1957).6 G(57) used both annual and monthly sunspot numbers from 1755 to 1953, available from Terrestrial Magnetism and Atmospheric Electricity and the Journal of Geophysical Research. The annual sunspot numbers he used are listed in his Table 2. G(57) also contained a brief discussion on how the sunspot numbers are constructed. In this paper, the extended annual sunspot numbers are employed, which are available from the Solar Influences Data Analysis Center (SIDC), http://sidc.oma.be/DATA/yearssn.dat, for the sample period of 1700–2002.7 Figure 1 shows the annual sunspot numbers. The variation in amplitude and period can easily be seen. It turns out that for the sample period of 1755–1953, the extended annual sunspot numbers are not the same as those used in G(57). Figure 2 shows the differences between the two sets of sunspot numbers. The biggest difference occurs in 1899; 12.1 in this paper and 21.1 in G(57). The discrepancy is most likely due to a type-in error. There is also a minor error in G(57) in calculating the mean for the cycle that started in 1856; in his Table 2, the mean should be 49.6, not 51.7. As this wrong mean value was employed in the ensuing calculations, the results used in G(57) for his Figs. 1 and 2 are not correct;
300
GAWON YOON
Fig. 1.
Fig. 2.
Annual Sunspot Numbers, 1700–2002.
Differences Between the Two Sets of Data, 1755–1953.
301
A Modern Time Series Assessment
Table 1. Sunspot Cycles with Annual Observations. Cycle No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
Year of Minimum
Period (Years)
Mean
From Granger (1957)
1700 1712 1723 1733 1744 1755 1766 1775 1784 1798 1810 1823 1833 1843 1856 1867 1878 1889 1901 1913 1923 1933 1944 1954 1964 1976 1986 1996
12 11 10 11 11 11 9 9 14 12 13 10 10 13 11 11 11 12 12 10 10 11 10 10 12 10 10 NA
18.3 29.5 54.1 51.5 40.1 42.4 59.9 68.2 60.45 23.8 18.1 39.0 65.3 53.7 49.6 56.6 34.6 38.8 31.1 44.2 41.0 55.0 75.7 95.0 58.8 82.9 78.5 NA
42.4 60.0 68.2 60.45 23.8 18.1 38.9 65.2 53.5 51.7 [49.6] 56.6 34.6 39.5 31.1 44.0 40.9 54.8 75.7
NA: Not applicable. The first cycle is assumed to start in 1700, which is the first year the sunspot numbers became available. Granger (1957) instead called the cycle that started in 1755 Cycle 1. The last cycle, Cycle 28, is not yet completed. The number in the square brackets, 49.6, is the correct mean, not 51.7, which was originally reported in Granger (1957).
but the effect is only minor. Compare also the last two columns of Table 1 below: Other than the error for the mean of the cycle that started in 1856, the differences in means are due to the fact that the two sunspot number data sets are not the same. The main empirical results of this paper are presented in the next Section.
302
GAWON YOON
3. MAIN EMPIRICAL RESULTS G(57) made several interesting observations on sunspot activity and provided supporting evidence. In this Section, empirical regularities discussed in G(57) are reappraised with the extended sunspot numbers. Of course, properties of sunspot cycles other than those discussed in G(57) are also available in the literature; see Hathaway, Wilson, and Reichman (1994). The organization of this Section follows closely that of G(57) to make comparison easier.
3.1. Varying Periods In this paper, the sunspot cycle that started in 1700, the first year in which the annual sunspot numbers become available, will be called Cycle 1. G(57) instead called the cycle that started in 1755 Cycle 1. In Table 1, the date for the start of sunspot cycles, their periods, and the two means for the two different sunspot numbers are listed for the 27 (and 18) completed cycles. Table 1 shows that periods are changing, as already noted in G(57). The mean of periods is 11.0 years for the extended sunspot numbers, with a standard deviation of 1.2 years. For the sunspot numbers used in G(57), they are 11.1 and 1.4 years, respectively.
3.2. The Basic Cycle G(57) assumed that sunspot cycles are ‘‘made up of a basic cycle, each multiplied by an amplitude factor plus a random element with zero mean.’’ See Eq. (4) for his model. The basic cycle was determined as follows: ‘‘The mean of the sunspot numbers for the years making up the cycle was found, and all the yearly intensities were divided by this mean. The new numbers were then averaged for the minimal year, the year after the minimum, and so forth over all the cycle’’ (p. 153).8 Figure 3 shows the newly transformed annual sunspot numbers. They display similar amplitudes, except for Cycle 1. Figure 4 displays the basic cycles from the two different sunspot number data sets, with the time from the minimum on the horizontal axis.9 The two basic cycles are very close to each other, except at 5 years after the minimum. For the extended sunspot data, the basic cycle reaches a maximum in 5 years after the minimum, not in 4 years as in the sunspot data used in G(57). However, it is still true that it takes less time to get to the maximum from the minimum than from the maximum to the next minimum,
A Modern Time Series Assessment
Fig. 3.
Fig. 4.
Transformed Annual Sunspot Numbers, 1700–1995.
The Basic Cycle, Corresponding to Fig. 1 in Granger (1957).
303
304
GAWON YOON
as already observed in G(57). G(57) also proposed the following functional form to approximate the basic curve: g ¼ t2:67 e1:73
0:63t
(1)
where t denotes the time from the minimum and g the intensity. (1) belongs to the model considered by Stewart and Panofsky (1938) and Stewart and Eggleston (1940).10 Figure 4 shows that (1) fits the basic curves rather well, especially over the middle range. 3.3. Long-Term Fluctuations G(57) also examined the cycle means over time. The real line in Fig. 5 shows the means of each cycle for the sunspot numbers employed in G(57).11 It appears that the means tend to increase, especially for the later cycles. In Fig. 5, the following relation representing a long-term fluctuation is also superimposed: 360 t þ 7:25 (2) 46 þ 19 sin 87
Fig. 5.
Means of Cycle 6–23, Corresponding to Fig. 2 in Granger (1957).
305
A Modern Time Series Assessment
Fig. 6.
Means of Cycle 1–27.
as proposed in G(57). The relation (2) seems to track the changes in means of sunspot cycles rather well, except for the later cycles. The correlation between the two curves plotted in Fig. 5 is 0.52, which is slightly lower than that reported in G(57). In Fig. 6, the cycle means are plotted with a real line for the extended sunspot numbers. Clearly, the means of Cycles 24–27 become higher. Therefore, it easily follows that the model of long-term fluctuation in the form of (2) would perform less well for the extended sunspot number data. For instance, the following model of long-term sunspot fluctuation is also plotted in Fig. 6, with a broken line: 360 46 þ 19 sin t þ 10:0 (3) 87 which produces a correlation of only 0.18 with the basic curve.12 3.4. Goodness of Fit of the Model The model G(57) proposed is of the form for sunspot activities f ðxÞ gðxÞ þ x
(4)
306
GAWON YOON
where f(x) is the amplitude factor, shown in Figs. 5 and 6, g(x) the basic curve in Fig. 4, and ex a zero-mean random variable.13 The two-parameter model of sunspot activity accounted for about 86% of the total variation for the data G(57) used. For the extended sunspot data with 27 completed cycles, the fit of his model is still high, explaining about 82% of the total variation. Hence, the model Granger proposed in 1957 still appears to be effective in accounting for the fluctuations in the extended sunspot numbers as well as in the data series he employed.
3.5. Additional Observations G(57) noted that mean solar activity is negatively correlated with sunspot cycle length: ‘‘the cycle is often shorter, the higher the mean or peak amplitude.’’ He employed a 2 2 contingency table to test the claim. In this paper, a simple linear regression model will be tested instead. Figure 7 shows that there is indeed an inverse relation between the mean and the period (in years) of a cycle.14 For instance, the following estimation result is found with OLS for the 27 cycles from the extended sunspot data: mean ¼ 131 ð31Þ
7:33 period ð2:81Þ
with R2 ¼ 0:21: Standard errors are reported in the parentheses. The estimated coefficient associated with period is significant at conventional significance levels. The fitted line is also shown in Fig. 7. For the sunspot numbers used in G(57), the results are mean ¼ 98
ð28Þ
4:53 period ð2:44Þ
with R2 ¼ 0:17 and sample size ¼ 18. Additionally, using the functional specification discussed in G(57), the following estimation results are obtained. For the extended sunspot data, mean ¼
38:5 þ 966 period 1 ; R2 ¼ 0:25
ð31:4Þ
ð339Þ
For the sunspot data used in G(57), mean ¼
10:2 þ 630 period 1 ; R2 ¼ 0:21
ð28:5Þ
ð308Þ
It should also be added, however, that there is only a very weak relation between the mean and the period for the monthly data, to be discussed below.
A Modern Time Series Assessment
Fig. 7.
307
Scatter Diagram Between Cycle Period and Mean of 27 Cycles.
G(57) contained further discussions on the sunspot numbers. The following results are obtained from monthly sunspot numbers, which are available at http://sidc.oma.be/index.php3. More specifically, smoothed monthly sunspot numbers for the sample period of 1749:07–2003:05 are used in this paper.15 Figure 8 plots the monthly sunspot numbers. Clearly, they are less smooth than the annual ones. Table 2 shows the starting date for each sunspot cycle with its period in months and mean sunspot number. The mean and standard deviation of periods are 131.5 and 14.3 months, respectively, for the 22 completed cycles. Unfortunately, the monthly sunspot numbers used in G(57) are not available to this author. G(57) noted that ‘‘the peak came later for cycles of small mean as compared to cycles with larger means.’’ This relation is known in the literature as the Waldmeier effect. G(57) defined peak by the time in months between the minimum and the maximum of a cycle divided by its length. Figure 9 shows a scatter plot of the peak and mean of sunspot cycles; note that the first cycle observed for the monthly sunspot numbers is Cycle 6, as listed in Table 2. Figure 9 seems to indicate a negative relation between the peak and mean of sunspot cycles. For instance, for the 22 completed sunspot cycles from the monthly observations, the following estimation results are found
308
GAWON YOON
Fig. 8.
Monthly Sunspot Numbers, 1749:07–2003:05.
with OLS: peak ¼ 0:53 ð0:04Þ
0:0027 mean ð0:00079Þ
R2 ¼ 0:36: However, the results seem to be heavily influenced by the observations from earlier sunspot cycles. For instance, Fig. 10 shows that earlier cycles have wider fluctuations in peak . Jagger (1977) contains a brief discussion on the quality of the sunspot numbers, and notes that the current definition of sunspot numbers dates only from 1849 and that the sunspot numbers are fairly subjective.16 When only the recent 15 sunspot cycles are utilized, there appears to be little relation between the mean and peak of sunspot cycles as the following estimation result demonstrates: peak ¼ 0:40 ð0:04Þ
0:0006 mean ð0:0007Þ
R2 ¼ 0:06: Additionally, most of the peak values are less than 0.5, indicating asymmetry in the sunspot cycles, with the rise to maximum being faster than the fall to minimum. This asymmetry is well documented in the sunspot literature.
309
A Modern Time Series Assessment
Table 2. Cycle No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
Sunspot Cycles with Monthly Observations. Month of Minimum
Period (Months)
Mean
1755:03 1766:06 1775:06 1784:09 1798:05 1810:12 1823:05 1833:11 1843:07 1855:12 1867:03 1878:12 1890:03 1902:02 1913:08 1923:08 1933:09 1944:02 1954:04 1964:10 1976:03 1986:09 1996:05
135 108 111 164 151 149 126 116 149 135 141 135 143 138 120 121 125 122 126 137 126 116 NA
41.7 50.7 67.1 75.8 28.4 18.0 20.6 60.3 64.3 48.5 61.7 34.4 41.2 30.0 33.4 43.3 47.8 66.9 91.9 70.8 67.8 81.4 NA
NA: Not applicable. Monthly observations are available from 1749:07 to 2003:05. The last cycle, Cycle 28, is not yet completed.
Finally, in Fig. 11, a scatter plot between the peak and length of sunspot cycle is displayed for the monthly observations.17 G(57) noted that ‘‘The various points appear to lie on two distinct parabolasy’’ which suggests ‘‘two types of cycle, occurring randomly.’’ However, it may be added that extreme observations in Fig. 11 are mainly from earlier sunspot cycles, which were potentially contaminated with measurement errors. With 15 recent sunspot cycles only, there appears to be little relation between the peak and length of sunspot cycles.
310
Fig. 9.
GAWON YOON
Scatter Diagram Between Cycle Mean and Peak from Monthly Data.
Fig. 10.
Plot of Peak for Each Cycle from Monthly Data.
A Modern Time Series Assessment
Fig. 11.
311
Scatter Diagram Between Length of a Cycle and Peak from the Monthly Data, Corresponding to Fig. 3 in Granger (1957).
3.6. Forecasting Sunspot Numbers G(57) offered no predictions for future sunspot values. Are these characterizations of sunspot cycles discussed in G(57) useful in forecasting the sunspot numbers? Because the two parameters in his model, the amplitude factor and the occurrence of minima, are hard to predict, G(57) noted that ‘‘It is thus doubtful that long-term prediction will be possible until a better understanding of the various solar variables is gained.’’ However, once a cycle has begun, improved short-term sunspot forecasts would be possible. Forecasting of sunspot cycles is discussed in Morris (1977) and Hathaway et al. (1999), for instance. See also references cited therein.
4. CONCLUDING REMARKS Clive Granger has permanently changed the shape of time series econometrics in his brilliant career spanning almost five decades. He has always been interested in problems empirically relevant to economic time series. In this paper, his very first published paper on sunspot numbers is reappraised
312
GAWON YOON
with extended sunspot numbers. The sunspot data are known to be very hard to analyze and forecast. His paper contained various interesting observations on sunspot cycles and is still cited in the literature. Whereas some of his findings appear to depend on particular earlier observations, which were possibly measured differently from later ones, most of his findings in his first published work still appear to hold with the extended sunspot numbers. His simple two-parameter mathematical model still accounts for more than 80% of the total variation in the extended sunspot data.
NOTES 1. The leading footnote of the paper reads that the paper was ‘‘Published with the aid of a grant from the American Astronomical Society.’’ 2. The above-mentioned paper by Reichel et al. (2001) is unique in that, in addition to citing the first published work by Granger, it also tests for Grangercausality between the solar cycle length and the cycle mean of northern hemisphere land air temperature. 3. Granger was visiting Princeton in 1959–1960, holding a Commonwealth Fellowship of the Harkness Funds. He also spent three subsequent summers in Princeton. 4. To be precise, 27 sunspot cycles are used here from the extended sunspot numbers, compared to only 18 cycles in Granger (1957) for annual sunspot numbers. For monthly sunspot observations, four additional sunspot cycles are available in this paper. See Tables 1 and 2 below for more details. 5. This description of the deep meridional flow is from the Economist (2003). 6. For efficiency of exposition, further references to Granger (1957) will be designated as G(57). 7. Monthly sunspot numbers are also analyzed below. See Section 3 for more details. 8. The averaging was accomplished with arithmetic means. The yearly intensities in the quote denote the original sunspot numbers. See his Table 2 for more details. 9. The Figure corresponds to Fig. 1 in G(57). 10. Different functional forms are suggested for the basic curve in Elling and Schwentek (1992) and Hathaway et al. (1994). 11. The means are listed in the last column in Table 1, above. The Figure corresponds to Fig. 2 in G(57). 12. The expression in (3) for the long-term sunspot fluctuations is based on experiments, and no effort has been made to obtain an improved fit. 13. The notations in (4) are from G(57), in which the same argument x for both f and g was used. However, it appears that the notations should be more carefully expressed. The argument for function g should be the time from the minimum, as in (1), and the argument for function f changes for different sunspot cycles. 14. For ease of exposition, cycle numbers are also recorded in the Figure.
A Modern Time Series Assessment
313
15. Smoothing is performed at the source with the 13-month running mean. If Rn ¯n ¼ is P the sunspot P number for month n, the 13-month running mean is R 5 6 1 ð R þ R Þ: nþi nþi i¼ 6 i¼ 5 24 16. Cycles post-1856 are understood to fall within what is known as the modern era of sunspot cycles. 17. The Figure corresponds to Fig. 3 in G(57).
ACKNOWLEDGMENTS I would like to thank Sir Clive Granger, Tae-Hwan Kim, and Hee-Taik Chung for comments, and David Hendry and Jinki Kim for some references. An earlier version of this paper was presented at the ‘‘Predictive Methodology and Application in Economics and Finance’’ conference held in honor of Clive Granger at the University of California, San Diego, on January 6–7, 2004. The title of this paper was suggested to me by Norm Swanson. This research was supported by a Pusan National University Research Grant. Finally, some personal notes. I have had the privilege of working with Clive Granger for a long time, ever since I was a graduate student at UCSD. I gratefully appreciate his advice and encouragement over the years. I always learn a lot from him. On a cold winter night in London on December 6, 2002, he told me about his experience with his first publication. This paper is my small gift to him.
REFERENCES Carozza, M., & Rampone, S. (2001). An incremental multivariate regression method for function approximation from noisy data. Pattern Recognition, 34, 695–702. Economist. (2003). Sunspots: Staring at the Sun. Economist, 367(June 28), 116–117. Elling, W., & Schwentek, H. (1992). Fitting the sunspot cycles 10–21 by a modified F-distribution density function. Solar Physics, 137, 155–165. Granger, C. W. J. (1957). A statistical model for sunspot activity. Astrophysical Journal, 126, 152–158. Granger, C. W. J. (1959). Estimating the probability of flooding on a tidal river. Journal of the Institution of Water Engineers, 13, 165–174. Granger, C. W. J. (1997). The ET interview, Econometric Theory, 13, 253–303 (interviewed by P. C. B. Phillips). Granger, C. W. J. (2001). Essays in econometrics. In: E. Ghysels, N. R. Swanson, & M. W. Watson (Eds), Collected papers of Clive W. J. Granger (Two Vols.), Econometric Society Monographs. Cambridge: Cambridge University Press. Granger, C. W. J., & Andersen, A. P. (1978). Introduction to bilinear time series models. Go¨ttingen: Vandenhoeck & Ruprecht.
314
GAWON YOON
Hathaway, D. H., Nandy, D., Wilson, R. M., & Reichman, E. J. (2003). Evidence that a deep meridional flow set the sunspot cycle period. Astrophysical Journal, 589, 665–670. Hathaway, D. H., Nandy, D., Wilson, R. M., & Reichman, E. J. (2004). Erratum: Evidence that a deep meridional flow set the sunspot cycle period. Astrophysical Journal, 602, 543. Hathaway, D. H., Wilson, R. M., & Reichman, E. J. (1994). The shape of the sunspot cycle. Solar Physics, 151, 177–190. Hathaway, D. H., Wilson, R. M., & Reichman, E. J. (1999). A synthesis of solar cycle prediction techniques. Journal of Geophysical Research, 104, 22375–22388. Hendry, D. F. (2004). The Nobel Memorial Prize for Clive W. J. Granger. Scandinavian Journal of Economics, 106, 187–213. Jagger, G. (1977). Discussion of M. J. Morris ‘‘Forecasting the sunspot cycle’’. Journal of the Royal Statistical Society, Series A, 140, 454–455. Morris, M. J. (1977). Forecasting the sunspot cycle. Journal of the Royal Statistical Society, Series A, 140, 437–468. Reichel, R., Thejll, P., & Lassen, K. (2001). The cause-and-effect relationship of solar cycle length and the Northern hemisphere air surface temperature. Journal of Geophysical Research-Space Physics, 106, 15635–15641. Shell, K. (2001). MD interview: An interview with Karl Shell. Macroeconomic Dynamics, 5, 701–741 (interviewed by Stephen E. Spear and Randall Wright). Stewart, J. Q., & Eggleston, F. C. (1940). The mathematical characteristics of sunspot variations II. Astrophysical Journal, 91, 72–82. Stewart, J. Q., & Panofsky, H. A. A. (1938). The mathematical characteristics of sunspot variations. Astrophysical Journal, 88, 385–407. Tong, H. (1990). Non-linear time series. A dynamic system approach, Oxford: Clarendon Press.
PERSONAL COMMENTS ON YOON’S DISCUSSION OF MY 1957 PAPER Sir Clive W. J. Granger, KB There is a type of art that is known as ‘‘naive,’’ which can be very simplistic and have a certain amount of charm. Reading about my own work on sunspots published in 1957 also brings to mind the description ‘‘naive.’’ I was certainly naive in thinking that I could undertake a piece of simple numerical analysis and then send it off to a major journal. The world of empirical research was then very simplistic as we had no computer at the University of Nottingham where I was employed, and all calculations had to be done on an electronic calculator and all plots by hand. It was clear that if monthly data for sunspots had been available, I would have been overwhelmed by the quantity of data! Trying to plot by hand nearly 200 years of monthly data is a lengthy task! The computing restrictions also limited the types of model that could be considered. When I started the analysis I was employed as a statistician in the Department of Mathematics and so any data was relevant. I was intrigued by the sunspot series for several reasons. It was a long sequence with interesting features and seemed to be of interest to others. I also did not understand why it had been gathered. I was told it had been gathered by Swiss monks for a couple of hundred years, but no reason for their activity was given.
Econometric Analysis of Financial and Economic Time Series/Part B Advances in Econometrics, Volume 20, 315–316 Copyright r 2006 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(05)20032-4
315
316
SIR CLIVE W. J. GRANGER, KB
The obvious way to evaluate a model is to use an extended data set, as Gawon Yoon has done. However, more recent data are gathered in a different, more reliable manner, and so the error-rate on the series may have fallen. It is unclear to me how a ‘‘sunspot’’ is actually measured as it surely has indistinct edges and a different side of the sun will face us every month and year. I suppose that the 11-year cycle could be because some sides of the sun are more active than others, and the sun rotates every 11 years. I have absolutely no idea if this is correct; in fact, the analysis of sunspot numbers by statisticians is poor empirical analysis with no physical or solar theory used to form the model. How different this is than our econometrics models when economic theory provides some structure in many cases. Not only do ‘‘sunspots’’ have an interesting place in economic theory thanks to the analysis of Cass and Shell (1983), among many, but they are also outliers in Physics. I would suppose Physics is now dominated by the idea that the world is deterministic (apart from the quantum domain) and so relies heavily on chaos theory and models. I surmise that those models cannot fit sunspot data – even the recent data – without substantial errors. I think that returning to the occasional old analysis may be a worthwhile pursuit. I believe that my 1957 paper will prove to be not the best example available to provide lessons for current research. I would like to thank Gawon Yoon for his paper and work.
REFERENCE Shell, K., & Cass, D. (1983). Do sunspots matter? Journal of Political Economy, 91, 193–227.
A NEW CLASS OF TAIL-DEPENDENT TIME-SERIES MODELS AND ITS APPLICATIONS IN FINANCIAL TIME SERIES Zhengjun Zhang ABSTRACT In this paper, the gamma test is used to determine the order of lag-k tail dependence existing in financial time series. Using standardized return series, statistical evidences based on the test results show that jumps in returns are not transient. New time series models which combine a specific class of max-stable processes, Markov processes, and GARCH processes are proposed and used to model tail dependencies within asset returns. Estimators for parameters in the models are developed and proved to be consistent and asymptotically joint normal. These new models are tested on simulation examples and some real data, the S&P 500.
1. INTRODUCTION Tail dependence – which is also known as asymptotic dependence or extreme dependence – exists in many applications, especially in financial Econometric Analysis of Financial and Economic Time Series/Part B Advances in Econometrics, Volume 20, 317–352 Copyright r 2006 by Elsevier Ltd. All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1016/S0731-9053(05)20033-6
317
318
ZHENGJUN ZHANG
time series analysis. Not taking this dependence into account may lead to misleading results. Towards a solution to the problem, we show statistical evidences of tail dependencies existing in jumps in returns from standardized asset returns, and propose new nonlinear time series models which can be used to model serially tail dependent observations. A natural way of modeling tail dependencies is to apply extreme value theory. It is known that the limiting distributions of univariate and multivariate extremes are max-stable, as shown by Leadbetter, Lindgren and Rootzen (1983) in the univariate case and Resnick (1987) in the multivariate case. Max-stable processes, introduced by de Haan (1984), are an infinite-dimensional generalization of extreme value theory, and they do have the potential to describe clustering behavior and tail dependence. Parametric models for max-stable processes have been considered since the 1980s. Deheuvels (1983) defines the moving minima (MM) process. Davis and Resnick (1989) study what they call the max-autoregressive moving average (MARMA) process of a stationary process. For prediction, see also Davis and Resnick (1993). Recently, Hall, Peng, and Yao (2002) discuss moving maxima models. In the study of the characterization and estimation of the multivariate extremal index, introduced by Nandagopalan (1990, 1994), Smith and Weissman (1996) extend Deheuvels’ definition to the so-called multivariate maxima of moving maxima (henceforth M4) process. Like other existing nonlinear time series models, motivations of using max-stable process models are still not very clear. First, real data applications of these models are yet becoming available. Also, statistical estimation of parameters in these models is not easy due to the nonlinearity of the models and the degeneracy of the joint distributions. A further problem is that a single form of moving maxima models is not realistic since the clustered blowups of the same pattern will appear an infinite number of times as shown in Zhang and Smith (2004). In our paper, we explore the practical motivations of applying certain classes of max-stable processes, GARCH processes and Markov processes. In time series modeling, a key step is to determine a workable finite dimensional representation of the proposed model, i.e. to determine statistically how many parameters have to be included in the model. In linear time series models, such as AR(p), MA(q), auto-covariance functions or partial auto-covariance functions are often used to determine this dimension, such as the values of p and q. But in a nonlinear time-series model, these techniques may no longer be applicable. In the context of maxstable processes, since the underlying distribution has no finite variance, the
A New Class of Tail-dependent Time-Series Models
319
dimension of the model can not be determined in the usual manner. We use the gamma test and the concept of lag-k tail dependence, introduced by Zhang (2003a), to accomplish the task of model selection. Comparing with popular model selection procedures, for example Akaike’s (1974) Information Criteria (AIC), Schwarz Criteria (BIC) (1978), our procedure first determines the dimension of the model prior to the fitting of the model. In this paper, we introduce a base model which is a specific class of maxima of moving maxima processes (henceforth M3 processes) as our newly proposed nonlinear time-series model. This base model is mainly motivated through the concept of tail dependencies within time series. Although it is a special case of the M4 process, the motivation and the idea are new. This new model not only possesses the properties of M4 processes, which have been studied in detail in Zhang and Smith (2004), but statistical estimation turns out to be relatively easy. Starting from the basic M3 process at lag-k, we further improve the model to allow for possible asymmetry between positive and negative returns. This is achieved by combining an M3 process with an independent two state ({0,1}) Markov chain. The M3 process handles the jumps, whereas the Markov chain models the sign changes. In a further model, we combine two independent M3 processes, one for positive and negative within each, with an independent three state Markov chain ({ 1, 0, 1}) governing the sign changes in the returns. In Section 2, we introduce the concepts of lag-k tail dependence and the gamma test. Some theoretical results are listed in that section; proofs are given in Section 9. In Section 4, we introduce our proposed model and the motivations behind it. The identifiability of the model is discussed. The tail dependence indexes are explicitly derived under our new model. In Section 5, we combine the models introduced in Section 4 with a Markov process for the signs. In Section 7, we apply our approach to the S&P 500 index. Concluding remarks are listed in Section 8. Finally, Section 9 continues the more technical proofs.
2. THE CONCEPTS OF LAG-K TAIL DEPENDENCE AND THE GAMMA TEST Sibuya (1960) introduces tail independence between two random variables with identical marginal distributions. De Haan and Resnick (1977) extend it to the case of multivariate random variables. The definition of tail
320
ZHENGJUN ZHANG
independence and tail dependence between two random variables is given below. Definition 2.1. A bivariate random variable (X1, X2) is called tail independent if l ¼ lim PðX 1 4ujX 2 4uÞ ¼ 0, u!xF
(2.1)
where X1 and X2 are identically distributed with xF ¼ supfx 2 R : PðX 1 xÞo1g; l is also called the bivariate tail dependence index which quantifies the amount of dependence of the bivariate upper tails. If l40, then (X1, X2) is called tail dependent. If the joint distribution between X1 and X2 is known, we may be able to derive the explicit formula for l. For example, when X1 and X2 are normally distributed with correlation rA(0,1) then, l ¼ 0: When X and Y have a standard bivariate t-distribution with n degrees pffiffiffiffiffiffiffiffiffiffiffip ffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiof freedom and correlation r4 1 then, l ¼ 2¯tnþ1 ð n þ 1 1 r= 1 þ rÞ; where ¯tnþ1 is the tail of standard t distribution. Embrechts, McNeil, and Straumann (2002) give additional cases where the joint distributions are known. Zhang (2003a) extends the definition of tail dependence between two random variables to lag-k tail dependence of a sequence of random variables with identical marginal distribution. The definition of lag-k tail dependence for a sequence of random variables is given below. Definition 2.2. A sequence of sample fX 1 ; X 2 ; :::; X n g is called lag-k tail dependent if lk ¼ lim PðX 1 4ujX kþ1 4uÞ40; lim PðX 1 4ujX kþj 4uÞ ¼ 0; j41, (2.2) u!xF
u!xF
where xF ¼ supfx 2 R : PðX 1 xÞo1g; lk is called lag-k tail dependence index. Here we need to answer two questions: the first is whether there is tail dependence; the second is how to characterize the tail dependence index. The first question can be put into the following testing of hypothesis problem between two random variables X1 and X2: H0 : X 1
and
X2
are tail independent
2H 1 : X 1 and X 2 are tail dependent;
ð2:3Þ
which can also be written as H 0 : l ¼ 0 vs: H 1 : l40.
(2.4)
321
A New Class of Tail-dependent Time-Series Models
To characterize and test tail dependence is a difficult exercise. The main problem for computing the value of the tail dependence index l is that the distribution is unknown in general; at least some parameters are unknown. Our modeling approach first starts with the gamma test for tail (in)dependence of Zhang (2003a) which also demonstrates that the gamma test efficiently detects tail (in)dependencies at high threshold levels for various examples. The test goes as follows. Let ! X 1; X 2; . . . ; X n (2.5) Y 1; Y 2; . . . ; Y n be an independent array of unit Fre´chet random variables which have distribution function F ðxÞ ¼ expð 1=xÞ; x40: Now let (Ui, Qi), i ¼ 1, ... ,n be a bivariate random sequence, where both Ui and Qi are correlated and have support over (0,u] for a typically high threshold value u. Let X ui ¼ X i I fX i 4ug þ U i I fX i ug ; Y ui ¼ Y i I fY i 4ug þ Qi I fX i ug ; i ¼ 1; . . . ; n: Then ! ! ! X un X u2 X u1 (2.6) ;...; ; Y un Y u2 Y u1 is a bivariate random sequence drawn from two dependent random variables Xui and Yui. Notice that X ui I fX ui 4ug ð¼ X i I fX i 4ug Þ and Y ui I fY ui 4ug ð¼ Y i I fY i 4ug Þ are independent; but X ui I fX ui ug ð¼ U i I fX i ug Þ and Y ui I fY ui ug ð¼ Qi I fY i ug Þ are dependent. Consequently, if only tail values are concerned, we can assume the tail values are drawn from (2.5) under the null hypothesis of tail independence. We have the following theorem. Theorem 2.3. Suppose Vi and Wi are exceedance values (above u) in (2.5). Then ( t t u þ Wi 1þt 1þt e t ¼ P t t u þ Vi 1þt þ 1þt e
ð1þtÞ=u
;
ð1þtÞ=u
if
; if
lim Pðn 1 ½maxðu þ W i Þ=ðu þ V i Þ þ 1 xÞ ¼ e
n!1
in
0oto1 (2.7)
t 1; ð1 e
1=u
Þ= x
.
(2.8)
Þx .
(2.9)
Moreover, lim Pðn½minðu þ W i Þ=ðu þ V i Þ xÞ ¼ 1
n!1
in
e ð1
e
1=u
322
ZHENGJUN ZHANG
The random variables maxirn (u+Wi)/(u+Vi) and maxirn (u+Vi)/ (u+Wi) are tail independent, i.e. ðu þ W i Þ ðu þ V i Þ þ 1 x; n 1 ½max þ 1 yÞ lim Pðn 1 ½max in ðu þ W i Þ in ðu þ V i Þ 1=u 1=u ¼ e ð1 e Þ x ð1 e Þ y .
n!1
Furthermore, the random variable maxin fðu þ W i Þ ðu þ V i Þg þ maxin fðu þ V i Þ=ðu þ W i Þg Qu;n ¼ maxin fðu þ W i Þ ðu þ V i Þg maxin fðu þ V i Þ=ðu þ W i Þg
ð2:10Þ 2 (2.11) 1
is asymptotically gamma distributed, i.e. for xZ0, lim PðQu;n xÞ ¼ zðxÞ,
n!1
(2.12)
where z is gammað2; 1 e 1=u Þ distributed. A proof of Theorem 2.3. is given in Zhang (2003a). Under certain mixing conditions, Zhang (2003a) also shows that (2.11) and (2.12) hold when Vi and Wi are not exceedances from (2.5) Remark 1. If V i ¼ ðX ui (2.11) and (2.12) hold.
uÞI fX ui 4ug and W i ¼ ðY ui
uÞI fY ui 4ug ; then
Eqs. (2.11) and (2.12) together provide a gamma test statistic which can be used to determine whether tail dependence between two random variables is significant or not. When nQu,n4za, where za is the upper ath percentile of the gamma(2, 1 e 1/u) distribution, we reject the null-hypothesis of no tail dependence.1 Remark 2. In the case of testing for lag-k tail dependence, we let Y i ¼ X iþk in the gamma test statistic (2.11). If the observed process is lag-j tail dependent, the gamma test would reject the H0 of (2.3) when k ¼ j; but it would retain the H0 of (2.3) when k ¼ j þ 1 and a bivariate subsequence fðX il ; Y il Þ; l ¼ 1; . . . ; i ¼ 1; . . . g of data are used to compute the value of the test statistic, where il il-14j. In Section 4, we will introduce a procedure using local windows to accomplish this testing procedure. So far, we have established testing procedure to determine whether there exists tail dependence between random variables. Once the H0 of (2.3) is rejected, we would like to model the data using a tail dependence model. In the next section we explore how the existence of tail dependence may imply transient behaviors in the jumps of financial time series.
323
A New Class of Tail-dependent Time-Series Models
3. JUMPS IN RETURNS ARE NOT TRANSIENT: EVIDENCES The S&P 500 index has been analyzed over and over again, but models using extreme value theory to analyze S&P 500 index returns are still somehow delicate. There are several references; examples include Tsay (1999) and, Danielsson (2002). Models addressing tail dependencies within the observed processes are more difficult to find in the literature. In this section, we use the gamma test to check whether there exists tail dependence for the S&P 500 return data. Recall that, in order to use the gamma test for tail independence, the original distribution function needs to be transformed into unit Fre´chet distribution. This transformation preserves the tail dependence parameter. 3.1. Data Transformation and Standardization In Fig. 1, we plot the negative log returns of the S&P500 index. The data are from July 3, 1962 to December 31, 2002. The highest value corresponds to Negative Daily Return 1962−2003
0.2 0.15
Neg Log Return
0.1 0.05 0 -0.05 -0.1 -0.15 -0.2 07/03/62
03/11/76
11/18/89
07/28/03
S&P500
Fig. 1.
Plot of the S&P 500 Original Log Returns. The Highest Value Corresponds to October 19, 1987 Wall Street Crash.
324
ZHENGJUN ZHANG Estimated Conditional Standard Deviation Using GARCH(1,1)
0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0
0
Fig. 2.
2000
4000
6000 S&P500 data
8000
10000
12000
Estimated Volatility Plot of the S&P 500 Log Returns. GARCH(1,1) Model is used to Estimate the Volatilities.
October 19, 1987 Wall Street crash. There clearly are large moves (possibly jumps) both in the returns as well as in the volatilities. We are especially interested in the tail dependence resulting from these large moves. As GARCH models have been quite successful in modelling volatilities in financial time series, we apply a GARCH fitting to the data. A GARCH(1,1) (Bollerslev, 1986) model fitted to the data is plotted in Fig. 1. The estimated volatility process is plotted in Fig. 2. The original data, divided by the estimated standard volatility, are plotted in Fig. 3. This series is referred as the GARCH-devolatilized time series. Examples of papers going beyond this approach are for instance McNeil and Frey (2000), and Engle (2002). In this paper, we study the standardized financial time series, or GARCH residuals using our methods. We focus on the modeling of the larger jumps in the standardized returns. Notice that the largest value in the standardized series does not correspond to October 19, 1987 Wall Street crash. The larger values in the standardized series look more like extreme observations, not outliers. An example of simulated extreme process which demonstrates extreme observations is illustrated in a later section. We now turn to study data transformation methods.
A New Class of Tail-dependent Time-Series Models
325
Negative Daily Return Divided by Estimated Standard Deviation, 1982−2001 10 8 6
Neg Log Return
4 2 0 -2 -4 -6 -8 -10 07/03/62
03/11/76
11/18/89
07/28/03
S&P500
Fig. 3.
Plot of the S&P 500 Standardized Log Returns.
Methods for exceedance modeling over high thresholds are widely used in the applications of extreme value theory. The theory used goes back to Pickands (1975), below we give a brief review. For full details, see Embrechts, Klu¨ppelberg, and Mikosch (1997) and the references therein. Consider the distribution function of a random variable X conditionally on exceeding some high threshold u, we have F u ðyÞ ¼ PðX u þ yjX 4uÞ ¼
F ðu þ yÞ F ðuÞ ; y 0 1 F ðuÞ
As u ! xF ¼ supfx : F ðxÞo1g; Pickands (1975) shows that the generalized Pareto distributions (GPD) are the only non-degenerate distribution that approximate Fu(y) for u large. The limit distribution of Fu(y) is given by Gðy; s; xÞ ¼ 1
y 1=x ð1 þ x Þþ s
(3.1)
In the GPD, yþ ¼ maxðy; 0Þ; x is the tail index, also called shape parameter, which gives a precise characterization of the shape of the tail of the GPD. For the case of x40, it is long-tailed, i.e., 1 G(y) decays at rate y 1=x for
326
ZHENGJUN ZHANG
large y. The case xo0 corresponds to a distribution function which has a finite upper end point at s/x. The case x ¼ 0 yields the exponential distribution with mean s: y Gðy; s; 0Þ ¼ 1 expð Þ s The Pareto, or GPD, and other similar distributions have long been used models for long-tailed processes. As explained in the previous sections, we first transform the data to unit Fre´chet margins. This is done using the generalized extreme value (GEV) distributions which are closely related to the GPD as explained in Embrechts et al. (1997) and can be written as x m 1=x HðxÞ ¼ exp ð1 þ x Þ , (3.2) c þ where m is a location parameter, c40 is a scale parameter, and x is a shape parameter similar to the GPD form in (3.1). Pickands (1975) first established the rigorous connection of the GPD with the GEV. Using the GEV and GPD, we fit the tails of the negative observations, the positive observations, and the absolute observations separately. The absolute returns are included in our data analysis for comparison. The generalized extreme value model (3.2) is fitted to observations which are above a threshold u. We have tried a series of threshold values and performed graphical diagnosis using the mean excess plot, the W and Z statistics plots due to Smith (2003). Those diagnosis suggested that when values above u ¼ 1.2, visually, a GPD function fits data well. There are about 10% of observed values above the threshold u ¼ 1.2. The maximum likelihood estimates are summarized in Table 1. From the table, we see that standardized negative returns are still fat tailed since the estimated shape parameter value is positive, but the Table 1. Estimations of Parameters in GEV Using Standardized Return Series and Threshold Value 1.2. Nu is the Number of Observations Over Threshold u. Nu
m (SE)
Logc (SE)
x (SE)
Negative
1050
Positive
1088
Absolute
2138
3.231845 (0.086479) 2.869327 (0.054016) 3.514718 (0.071042)
0.319098 (0.075502) 0.784060 (0.064046) 0.450730 (0.060214)
0.096760 (0.028312) 0.062123 (0.025882) 0.045448 (0.018304)
Series
327
A New Class of Tail-dependent Time-Series Models (a) - Frechet Scaled S&P500 neg-return data
105 104 103 10
2
101 0
10 0 10
4
10
3
10
2
10
1
100 0
2000
4000
6000
8000
10000
12000
10000
12000
(b) - Frechet Scaled S&P500 positive-return data
2000
4000
6000
8000
Fig. 4. Plots of the S&P 500 Standardized Time Series in Fre´chet Scale. Panel (a) is for Negative Returns. Panel (b) is for Positive Returns. The Threshold used for Both Plots is u ¼ 1.2.
standardized positive returns are now short tailed. The absolute returns are still fat tailed. The standardized data are now converted into unit Fre´chet scale for the observations above u ¼ 1.2 using the tail fits resulting from Table 1. The final transformed data are plotted in Fig. 4. After standardization and scale transformation, we can say that the data show much less jumps in volatility, but jumps in returns are still persistent as will be shown in the next Section. 3.2. Tail Dependence In the applications of GARCH modelling, GARCH residuals are assumed to be independently identically distributed normal (or t) random variables. The residual distribution assumptions may not be appropriate in some applications. As a result, there have been various revised GARCH type models proposed in the literature. Hall and Yao (2003), Mikosch and Staˇricaˇ (2000), Mikosch and Straumann (2002) – among others – have discussed heavy-tailed errors and extreme values. Straumann (2003) deals with the estimation in certain conditionally heteroscedastic time series models, such as GARCH(1,1), AGARCH(1,1), EGARCH etc. Our main
328
ZHENGJUN ZHANG
interest is the analysis of tail dependence within the standardized unit Fre´chet transformed returns. Given such tail dependence, we want to present a relevant time series model capturing that behavior. In order to check whether there exists tail dependence within the sequences in Fig. 4, we perform the gamma test from Section 2 to the transformed returns with the threshold level at the 95th percentile (90th percentile is related to u ¼ 1.2) of observed sequence as in Zhang (2003a). However, in reality or even in simulated examples as shown in Section 4, one single test may not be enough to discover the true tail dependence or independence. As pointed out in Zhang (2003a), the gamma test is able to detect tail dependence between two random variables even if there are outliers in one of two observed sequences. In our case, we have a univariate sequence. If there are outliers in the sequence, a Type I error may occur since outliers cause smaller Qu,n values in (2.11). If there are lower lag tail dependencies, a Type II error may occur since there are dependencies in each projected sequence. In order to minimize the probability of making errors, we select observations in local windows of size 500 to perform the gamma test. The observations in each window are used to compute the value of the gamma test statistic and a global conclusion is suggested at level a ¼ 0.05. Besides the gamma test, we also compute the empirical estimation of the tail dependence indexes. The empirical estimation is done by using the following procedure. Empirical estimation procedures: Suppose x1, x2, y , xn is a sequence of observed values and x* is the 95th sample P percentile. Then.the Pn empirical k estimate of lag-k tail dependence index is ni¼1k I ðxi 4x ;xiþk 4x Þ i¼1 I ðxi 4x Þ ; where I(.) is an indictor function. The test results for lags 1 15 are summarized in Table 2. Columns 2, 6 ( l00) yield percentages of rejection of H0 from all fully enumerated local windows of size 500. Of each data point, at least one component is nonzero. Columns 3, 7 are empirically estimated lag-k tail dependence indexes over a threshold value computed at the 95th percentile for the whole data. Columns 4, 8 are the minima of all computed Qu,n values using (2.11) in all local windows. Columns 5, 9 are the maxima of all computed Qu,n values using (2.11). The number 0.7015 in Column 2 means that if we partition the whole data set into 100 subsets, and perform the gamma test to each subset, there are about 70 subsets from which we would reject the tail independence hypothesis. The number 0.0700 in Column 3 is the empirically estimated tail dependence index which tells that when a large price drop is observed, and the resulting negative return is below the 95th percentile of the historical data, there is 7% chance to observe a large price drop in the kth day. The rest of numbers in the table can be interpreted similarly.
329
A New Class of Tail-dependent Time-Series Models
Table 2. Columns 2, 6 ( l00) Yield Percentages of Rejection of H0 from all Fully Enumerated Local Windows of Size 500. Of Each Data Point, at Least One Component is Nonzero. Columns 3, 7 are Estimated Lag-k Tail Dependence Indexes Over a Threshold Value Computed at the 95th Percentile for the Whole Data. Columns 4, 8 are the Minima of all Computed Qu,n Values using (2.11) in all Local Windows. Columns 5, 9 are the Maxima of all Computed Qu,n Values Using (2.11). lag k ¼ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Negative Returns
Positive Returns
Rej.
Ind.
Min
Max
Rej.
Ind.
Min
Max
1.0000 0.7015 0.6064 0.8440 0.5713 0.4452 1.0000 0.7417 0.3793 1.0000 1.0000 1.0000 0 1.0000 0.5217
0.0700 0.0700 0.0525 0.0525 0.0306 0.0088 0.0131 0.0306 0.0175 0.0131 0.0088 0.0131 0.0175 0.0088 0.0131
0.0162 0.0013 0.0125 0.0185 0.0095 0.0174 0.0532 0.0029 0.0244 0.0264 0.0262 0.0556 0.0083 0.0480 0.0098
0.3092 0.2292 0.2370 0.4046 0.3369 0.2761 0.2917 0.1353 0.3740 0.0687 0.2012 0.1802 0.0093 0.0528 0.0436
0.6787 0.7556 0.6454 0.4104 1.0000 0.5147 0.5125 0.1158 0 0 1.0000 1.0000 0.8681 0 1.0000
0.0555 0.0357 0.0317 0.0317 0.0238 0.0119 0.0159 0.0079 0.0119 0.0159 0.0159 0.0079 0.0040 0.0119 0.0198
0.0079 0.0078 0.0126 0.0152 0.0344 0.0181 0.0196 0.0067 0.0065 0.0060 0.0309 0.0530 0.0219 0.0072 0.0839
0.1291 0.1905 0.1630 0.1170 0.2113 0.1597 0.2245 0.0485 0.0268 0.0186 0.0444 0.0679 0.0320 0.0081 0.1061
From Column 2 in Table 2, we see that for lags from 1 to 12, the percentages of rejecting H0 are high for negative returns. At lags 1, 7, 10, 11, 12, tail independence is always rejected in the case of negative returns. In Column 3, if adding up all estimated lag-1 to lag-12 tail dependence indexes, the total percentage is over 30% which tells that there is more than 30% of chance that a large price drop may happen in one of the following 12 days. This should be seriously taken into account in modeling. Lag-13 is a break point at which the rejection rate of tail independence is 0 for negative returns. The empirically estimated dependence indexes in Column 3 show a decreasing trend at the beginning five lagged days; then after lag-5, they become small and no trend. The table also shows that each of the first five empirically estimated tail dependence indexes is falling in the range of the minima of Qu,n value, and the maxima of Qu,n. These numbers suggest that there is tail dependence within the negative returns. All the empirically computed values, the Qu,n values can be used as estimates of the tail
330
ZHENGJUN ZHANG
dependence index. In Section 4, we compute theoretical lag-k tail dependence indexes for certain models. Columns 6–9 are for positive returns, and can be interpreted in a similar way to negative returns. Column 6 shows that up to lag-8, the positive returns are tail dependent. Lag-5 tail independence is always rejected. Lag-9 is a break point at which the rejection rate of tail independence is 0. Like negative returns, Column 7 shows a decreasing trend of the empirically estimated tail dependence indexes. After lag-5, there is no pattern. Columns 7,8 are the minima and the maxima of empirical tail dependence index values from formula (2.11). Each of the first five numbers in Column 7 is falling in the range of the minima and the maxima in Columns 8, 9 with the same lag numbers. In the literature, that jumps in returns are transient often refers to that a large jump (positive or negative) does not have a future impact. When jumps in return series are transient, they are tail independent. Models dealing with transience of jumps in return series are for instance Eraker, Johannes, and Polson (2003), Duan, Ritchken, and Sun (2003). Our results (jumps in returns being tail dependent) suggest that jumps in returns are not transient. When jumps in returns are tail dependent, a large jump may have an extreme impact on the future returns, and should be taken into account in modeling. In order to find a criterion for deciding on the maximal lag of which tail dependence exists, we propose the following procedure: Empirical criterion: When the rejection rate of the lag-r test first reaches 0 or a rate substantially smaller than the rejection rate of the lag-(r 1) test, and the estimated lag-r tail dependence index is less than 1%, let k ¼ r. At the moment, this test procedure is still purely ad hoc; more work on its statistical properties is needed in the future. Based on this criteria, however, we use lag-12 tail dependence for the transformed negative returns and lag-7 tail dependence for the transformed positive returns in the sections below. The above test may suggest the existence of tail dependence; finding on appropriate time series model explaining these dependencies is another task. In the next section we introduce a versatile class of models aimed at achieving just that.
4. THE BASIC MODEL AND THE COMPUTATION OF LAG-K TAIL DEPENDENCE INDEXES Suppose the maximal lag of tail dependence within a univariate sequence is K, and fZ li ; l ¼ 1; :::; L; 1oio1g is an independent array, where the
331
A New Class of Tail-dependent Time-Series Models
random variables Zli are identically distributed with a unit Fre´chet distribution function. Consider the model: Y i ¼ max
0lL
max
0kK
alk Z l;i
k;
1oio1,
(4.1)
PL PK where the constants {alk} are nonnegative and satisfy k¼0 alk ¼ 1: l¼0 When L ¼P 0, model (4.1) is a moving maxima process. It is clear that the P a unit Fre´chet condition Ll¼0 K a ¼ 1 makes the random variable Y i k¼0 lk random variable. So Yi can be thought of as being represented by an independent array of random variables which are also unit Fre´chet distributed. The main idea behind (4.1) is that stochastic processes (Yi) with unit Fre´chet margins can be thought of as being represented (through (4.1)) by an infinite array of independent unit Fre´chet random variables. A next step concerns the quality of such an approximation for finite (possibly small) values of L and K. The idea and the approximation theory are presented in Smith and Weissman (1996), Smith (2003). Zhang (2004) establishes conditions for using a finite moving range M4 process to approximate an infinite moving range M4 process. Under model (4.1), when an extreme event occurs or when a large Zli occurs, Yipal,i k for iEk, i.e. if some Zlk is much larger than all neighboring Z values, we will have Y i ¼ al;i k Z l k for i near k. This indicates a moving pattern of the time series, known as signature pattern. Hence, L corresponds to the maximal number of signature patterns. The constant K characterizes the range of dependence in each sequence and the order of moving maxima processes. We illustrate these phenomena for the case of L ¼ 0 in Fig. 5. Plots (b) and (c) involve the same values of (a00, a01 y , a0k). Plot (b) is a blowup of a few observations of the process in (a) and Plot (c) is a similar blowup of a few other observations of the process in (a). The vertical coordinate scales of Y in Plot (b) are from 0 to 5, while the vertical coordinate scales of Y in Plot (c) are from 0 to 50. These plots show that there are characteristic shapes around the local maxima that replicate themselves. Those blowups, or replicates, are known as signature patterns. Remark 3. Model (4.1) is a simplified model of a general M4 process since we restrict our attention to univariate time series. In Zhang and Smith (2003, 2004), properties of models with a finite number of parameters are discussed. They provide sufficient conditions from which the estimators of model parameters are shown to be consistent and jointly asymptotic multivariate normal.
332
ZHENGJUN ZHANG
180 160 140 120 100 80 60 40 20 0 (a) 0 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 40 (b)
100
42
200
44
46
50 45 40 35 30 25 20 15 10 5 0 138 (c)
300
140
400
142
144
Fig. 5. An Illustration of a M4 Process. Figure (a) consists of 365 Simulated Day Observation. Figures. (b), (c) are Partial Pictures Drawn from (a), i.e., (b) Is the Enlarged View of (a) From the 40th Day to the 45th Day; (c) is the enlarged view of (a) From the 138th Day to the 144th Day. (b) and (c) are Showing a Single Moving Pattern, called Signature Pattern, in Certain Time Periods when External Events Occur.
Under model (4.1), we have the following lag-k tail dependence index formula: lk ¼
1 X 1 X
l¼1 m¼ 1
minðal;1
m; al;1þk m Þ.
(4.2)
Obviously, as long as both al0 and alK are non-zero, Yi and Yi+K are dependent, and of course tail dependent as can be seen from (4.2). In real data, for example, the negative returns, we count how many times that negative return values Y i ; Y iþ1 ; :::; Y iþk ; ði ¼ 1; :::; n kÞ, are simultaneously greater than a given threshold value for any fixed k. We record all the i values such that Y i ; Y iþ1 ; :::; Y iþk are simultaneously greater than the given threshold value . We typically find that jumps in negative returns as
333
A New Class of Tail-dependent Time-Series Models
well as jumps in positive returns appear to cluster in two consecutive days, i.e., k ¼ 1. In general, it is not realistic to observe similar blowup patterns of a period of more than two days. A preliminary data analysis shows that blowups seem to appear in a short time period (2 or 3 days), while the transformed returns (negative, positive) have a much longer lag-k tail dependence. Considering the properties of clustered data in two consecutive days and much longer lag-k tail dependencies in real data, the following parameter structure seems reasonable, i.e., we assume the matrix of weights (alk) to have the following structure: 0
a00
Ba B 10 B B ðalk Þ ¼ B a20 B . B .. @ aL0
0
0
0
...
a11 0 .. . 0
0 a22 .. . 0
0 0 .. . 0
... ... .. . ...
0
1
C C C C C. C 0 C A aLL 0 0
(4.3)
Now, the number L corresponds to the maximal lag of tail dependencies within the sequence; the lag-k tail dependence index is characterized by the coefficients ak0 and akk. The coefficient a00 represents the proportion of the number of observations which are drawn from an independent process {Z0i}. In other words, a very large value at time 0 has no future impact when the large value is generated from {Z0i}. If both ak0 and akk are not zero, then a very large value at time 0 has impact at time k when the large value is generated from {Zki}. If there is strong lag-k tail dependence for each k, the value of a00 will be small. While two coefficients ak0 and akk may not be sufficient enough to characterize different kinds of k step impacts from a very large value at time 0, the setups of (4.1) and (4.3) are considered as an approximation of the observed process. It is clear that as long as the maximal lag of tail dependence of a sequence has been determined, we can easily write the model based on (4.1) and (4.3). Although the estimators for coefficients in a general M4 model have been proposed by Zhang and Smith (2003), Zhang (2003b), there would be advantages to use the special structure in (4.3) to construct estimators, i.e., the conditions imposed on the parameters can be reduced to a minimal level. We now compute the lag-r tail dependence index and then, in next section, turn to the estimation of the parameters in (4.3).
334
ZHENGJUN ZHANG
For r40, we have PðY 1 x; Y 1þr yÞ ¼ Pðalk Zl;1
k
x; alk Zl;1þr
¼ Pðal0 Zl1 x; all Z l;1
l
k
y; 0 l L; 0 k LÞ
x; al0 Z l;1þr y; all Zl;1þr
0 l LÞ ( X al0 þ all ar0 þ arr þ Þ ð ¼ exp x y lar 1 1 ar0 arr ¼ exp þ minð ; Þ x y x y
arr x
ar0 y
l
y,
ar0 arr maxð ; Þ x y
)
ð4:4Þ
A general joint probability computation leads to the following expression: PðY i yi ; 1 i rÞ ¼PðZ l;i
k
¼ PðZ l;m ¼ exp
yi for 0 l L; 0 k L; 1 i rÞ al;k ymþk min ; 0 l L; l þ 1omorÞ 1 mkr m al;k ! L r X X al;k ð4:5Þ max 1 mkr m ymþk l¼0 m¼ lþ1
Since PðY 1 u; Y 1þr uÞ ¼1 PðY 1 uÞ
expf1ug
r0 ;arr Þ expf 2 minða g u 1 1 expfug
! minðar0 ; arr Þ
ð4:6Þ
as u -N, the lag-r tail dependence index of model (4.1) is min(ar0, arr). There is some intuition behind the characterization of lag-r tail dependence index in (4.6). The value of (ar0+arr) represents the proportion of the number of observations which are drawn from the process Zri. The value of min(ar0, arr) represents the proportion of the number of observations which are over a certain threshold and are drawn from the lag-r dependence process. This can be seen from the left hand side of (4.6) when it is replaced by its empirical counterpart. The empirical counterpart can also be used to estimate min(ar0, arr) for a suitable choice of u value. Parameter estimation will be discussed in Section 6. We now illustrate an example and show how the gamma test detects the order of lag-k tail dependence.
335
A New Class of Tail-dependent Time-Series Models
Example 4.1. Consider model (4.1) with the following parameter structure: 0
0:3765
B 0:0681 B B B 0:0450 ðalk Þ ¼ B B 0:0276 B B @ 0:0711 0:1113
0
0
0
0
0
0:0725 0
0 0:0544
0 0
0 0
0 0
0
0
0:1166
0
0
0 0
0 0
0 0
0:0185 0
0 0:0386
1
C C C C C C C C A
(4.7)
We use (4.1) and (4.7) to generate a sequence of observations of size 5000. These observations are plotted in Fig. 6. We use the gamma test (2.12) to test lag-k tail depenencies at level a ¼ 0.05. As pointed out earlier, the index l in model (4.1) P corresponds to a signature pattern of the observed process. The sum of k alk yields the proportion of the total number of observations drawn from the lth independent sequences of Zlk , NokoN. If we simply use all data for the gamma test, we may not get the right indications of the lag-k tail dependencies because the values computed from the test statistic may not be associated with the kth moving Simulated M3 Processes
105 104
Log10 scale
103 102 101 100 10-1
0
100
Fig. 6.
200
300
400
500 Time
600
700
800
900
Simulated Time Series of M3 Processes in Model (4.1).
1000
336
ZHENGJUN ZHANG
pattern. Here we conduct the gamma test in a local moving window of size 300. We randomly draw 100 local windows and perform the gamma test using the 300 observations in each window. The number of rejections of lag-k tail dependencies are summarized in the following table. lag-k
1
2
3
4
5
6
7
8
9
10
# of rejection of H0
7
5
2
3
5
1
1
0
0
0
The rejection rates of the lag-k, k ¼ 1, 2, 3, 4, 5, are close to their corresponding proportions of observations which are drawn from the corresponding moving patterns. The rejection rates of larger lag-k are relatively small. Therefore the maximal lag of 5 for tail dependence present in the simulated data is indeed suggested by the gamma test results.
5. COMBINING M3 WITH A MARKOV PROCESS: A NEW NONLINEAR TIME-SERIES MODEL The previous analysis has suggested that there is asymmetric behavior between negative returns and positive returns. We have also tested whether there is tail dependence between positive returns and negative returns, and found that the null hypothesis of tail independence was not rejected. These phenomena suggest that models for negative returns should be different from models for positive returns. First, we want to find a class of models of (4.1) and (4.3) to model negative returns. Second, we want to find a different class of models of (4.1) and (4.3) to model positive returns. Then we combine those two classes of models with a Markov process for both returns. Notice that in the time series plot of negative returns, we have many zeros values (close to 50% of the total number of points) at which positive returns were observed. This suggests that models for negative returns should have a variable to model the locations of occurrences of zeros. This results in the following model structure. Model 1. : Combining M3 (used to model scales) with a Markov process (used to model signs): a new model for negative returns: Y i ¼ max
max alk Zl;i
0lL 0kK
k;
1oio1,
337
A New Class of Tail-dependent Time-Series Models
where the superscript means that the model P isPfor negative returns only. Constants falk g are nonnegative and satisfy Ll¼0 K k¼0 alk ¼ 1: The matrix of weights is 0
B B B B ðalk Þ ¼ B B B @
a00
0
0
0
...
0
a10 a20 .. .
a11 0 .. .
0 a22 .. .
0 0 .. .
... ... .. .
0 0
0
0
0
...
aL
0
0 aL
L
1
C C C C C C C A
fZ li ; l ¼ 1; . . . ; L ; 1oio1g is an independent array, where random variables Z li are identically distributed with a unit Fre´chet distribution function. Let Ri ¼ xi Y i ; 1oio1,
(5.1)
where the process fxi g is independent of fY i g and takes values in a finite set {0, 1} –i.e., fxi g is a sign process. Here fY i g is an M3 process, fxi g is a simple Markov process. fRi g is the negative return process. For simplicity, Model (5.1) is regarded as MCM3 processes. Remark 4. If fY i g is an independent process, then PðRiþr 4ujRi 4uÞ 0 as u-N for i40, r40, i.e., no tail dependence exists. This phenomenon tells that if there are tail dependencies in the observed process, the model with time dependence (through a Markov chain) only can not model the tail dependence if the random variables used to model scales are not tail dependent. Model 2. : An MCM3 process model for positive returns: Yþ i ¼ maxþ 0lL
þ max aþ lk Z l;i k ; 1oio1,
0kK þ
where the superscript+means that the model P is for positive returns Lþ P K þ þ only. Constants faþ g are nonnegative and satisfy l¼0 k¼0 alk ¼ 1: The lk
338
ZHENGJUN ZHANG
matrix of weights is 0
B B B B þ ðalk Þ ¼ B B B B @
aþ 00
0
0
0
...
0
aþ 10
aþ 11
0
0
...
0
aþ 20 .. . þ a Lþ 0
0 .. . 0
aþ 22 .. . 0
0 .. . 0
... .. . ...
0 0 aþ L þ Lþ
1
C C C C C C C C A
þ Random variables fZþ li ; l ¼ 1; . . . ; L ; 1oio1g is an independent þ array, where Z li s identically distributed with a unit Fre´chet distribution. Let þ þ Rþ i ¼ xi Y i ;
1oio1,
(5.2)
þ where the process fxþ i g is independent of fY i g and takes values in a finite þ þ set {0, 1}. Here fY i g is an M3 process, fxi g is a simple Markov process. fRþ i g is the positive return process.
Remark 5. In previous sections, we have seen that negative returns Y i and positive returns Y þ i are asymmetric, and concluded that models for positive returns should be different from models for negative returns. Notice that at any time i, one can only observe one of the Y i and Y þ i : The other one is missing. By introducing the Markov processes xi and xþ i ; þ to and R in (5.2) are observable. We use R both Ri in (5.1) and Rþ i i i construct parameter estimators. Model 3. : An MCM3 process model for returns: with the established notations in (5.1) and (5.2), let Ri ¼ signðxi Þn½I xi ¼ 1 Y i þ I xi ¼1 Y þ i ;
1oio1,
(5.3)
where the process fxi g is a simple Markov process which is independent of fY i g and takes values in a finite set { 1, 0, 1}. fRi g is the return process. Remark 6. The processes fxi g; fxþ i g may be Bernoulli processes or Markov processes taking values in a finite set. The process fxi g may be considered as an independent process or a Markov process taking values in a finite set. Remark 7. In Model (5.3), as long as fY i g; fY þ i g; and xi are determined, Ri is determined.
339
A New Class of Tail-dependent Time-Series Models
Remark 8. In many applications, only positive observed values are concerned. Insurance claims, annual maxima of precipitations, file sizes, durations in internet traffic at certain point are some of those examples having positive values only. Even in our negative return model, the values have been converted into positive values. Therefore, the properties of Model (5.1) can easily be extended to Model (5.2). We now focus on Model (5.1). We first compute the lag-k tail dependence index in Model (5.1): PðRj 4u; Rjþk 4uÞ PðY j 4u; xj ¼ 1; Y jþk 4u; xjþk ¼ 1Þ ¼ PðRj 4uÞ PðY j 4u; xj ¼ 1Þ ¼ ¼
PðY j 4u; Y jþk 4uÞ Pðxj ¼ 1; xjþk ¼ 1Þ Pðxj ¼ 1Þ
PðY j 4uÞ
PðY j 4u; Y jþk 4uÞ PðY j 4uÞ
Pðxjþk ¼ 1jxj ¼ 1Þ
! minðak0 ; akk ÞPðxjþk ¼ 1jxj ¼ 1Þ
ð5:4Þ
as u tends to infinite. Now suppose fxi g is a simple Markov process taking values in a finite set {0, 1}, and with the transition probabilities: Pðxjþ1 ¼ 0jxj ¼ 0Þ ¼ p00 ; Pðxjþ1 ¼ 1jxj ¼ 0Þ ¼ p01 ; Pðxjþ1 ¼ 0jxj ¼ 1Þ ¼ p10 ; Pðxjþ1 ¼ 1jxj ¼ 1Þ ¼ p11 ;
(5.5)
and the kth (k41) step transition probabilities: ðkÞ Pðxjþk ¼ 0jxj ¼ 0Þ ¼ pðkÞ 00 ; Pðxjþk ¼ 1jxj ¼ 0Þ ¼ p01 ;
ðkÞ Pðxjþk ¼ 0jxj ¼ 1Þ ¼ pðkÞ 10 ; Pðxjþk ¼ 1jxj ¼ 1Þ ¼ p11 ;
(5.6)
where the superscripts (k) denote the kth step in a Markov process. Using (5.5) and (5.6), we first have PðR1 ox; R1þr oyÞ ¼ PðR1 ox; R1þr oy; x1 ¼ 0Þ þ PðR1 ox; R1þr oy; x1 ¼ 1Þ ¼ PðR1þr oy; x1 ¼ 0Þ
þ PðY 1 ox; R1þr oy; x1 ¼ 1Þ.
ð5:7Þ
340
ZHENGJUN ZHANG
Since PðR1þr oy; x1 ¼ 0Þ ¼ PðR1þr oy; x1þr ¼ 0; x1 ¼ 0Þ þ PðR1þr oy; x1þr ¼ 1; x1 ¼ 0Þ
¼ Pðx1þr ¼ 0; x1 ¼ 0Þ þ PðY 1þr oy; x1þr ¼ 1; x1 ¼ 0Þ ¼ Pðx1 ¼ 0ÞPðx1þr ¼ 0jx1 ¼ 0Þ
þ PðY 1þr oyÞPðx1 ¼ 0ÞPðx1þr ¼ 1; x1 ¼ 0Þ
ðrÞ ¼ p0 pðrÞ 00 þ p0 p01 e
1=y
ð5:8Þ
,
where p0 ¼ Pðx1 ¼ 0Þ is the probability of the chain starting at the initial state x1 ¼ 0; PðY 1 ox; R1þr oy; x1 ¼ 1Þ ¼ PðY 1 ox; R1þr oy; x1þr ¼ 0; x1 ¼ 1Þ þ PðY 1 ox; R1þr oy; x1þr ¼ 1; x1 ¼ 1Þ
¼ PðY 1 ox; x1þr ¼ 0; x1 ¼ 1Þ
þ PðY 1 ox; Y 1þr oyÞPðx1þr ¼ 1; x1 ¼ 1Þ 1 1 ðrÞ 1=x e þ p1 pðrÞ ¼ p1 p10 11 expf x y a a þ minð r0 ; rr Þg, x y
ð5:9Þ
where p1 ¼ 1 p0; then putting (5.8) and (5.9) in (5.7), we have ðrÞ 1=y þ p1 pðrÞ PðR1 ox; R1þr oyÞ ¼ p0 pðrÞ 10 e 00 þ p0 p01 e 1 1 a a þ p1 pðrÞ þ minð r0 ; rr Þg. 11 expf x y x y
1=x
ð5:10Þ
Notice that the event fRi oxg is a union of two events fY i ox; xi ¼ 1g and fxi ¼ 0g; which are mutually exclusive.Then for I1oI2oyoIm, we have the following joint probability expression: PðRI 1 oxI 1 ; RI 2 oxI 2 ; . . . ; RI m oxI m Þ
¼ p1
þ
m1 Y
ðI
p11jþ1
j¼1
2
m1 6 X k¼1
6 4
IjÞ
PðY I 1 oxI 1 ; Y I 2 oxI 2 ; . . . ; Y I m oxI m Þ X
j 1 oj 2 o...oj k fj 1 ;j 2 ; ... ;j k gfI 1 ;I 2 ; ... ;I m g
I ðj1 4I 1 Þ I ðj1 ¼I 1 Þ p1
p0
m1 Y j¼1
3
ðI
pI ðIjþ12fj j
IjÞ
I 1 ;j 2 ; ... ;j k gÞ ðI jþ1 2fj 1 ;j 2 ; ... ;j k gÞ
m1 Y 7 ðI PðY j 1 oxj 1 ; Y j 2 oxj 2 ; . . . ; Y jk oxjk Þ7 þ p p00jþ1 0 5 j¼1
IjÞ
.
ð5:11Þ
A New Class of Tail-dependent Time-Series Models
341
Notice that an exact M4 data generating process (DGP) may not be observable in real – i.e., it may not be realistic to observe an infinite number of times of signature patterns as demonstrated in Fig. 5. It is natural to consider the following model: Rni ¼ xi ðY i þ N i Þ; 1oio1, (5.12) where fN i g is an independent bounded noise process which is also independent of fxi g and fY i g: By adding the noise process, the signature patterns can not be explicitly illustrated as we did in Fig. 5. The following proposition tells that as long as the characterization of tail dependencies is the main concern, Model (5.1) or its simplified form should be a good approximation to the possible true model. Proposition 5.1. The lag-k tail dependence within fRni g can be expressed as: PðRnj 4u; Rnjþk 4uÞ PðRnj 4uÞ
! minðak0 ; akk ÞPðxjþk ¼ 1jxj ¼ 1Þ
as u ! 1.
ð5:13Þ
A proof of Proposition 5.1 is given in Section 9. The same lag-k tail dependence index in both (5.4) and (5.13) suggests that an MCM3 process (5.1) can be used to approximate (5.12). Under (5.12), it is reasonable to assume each paired parameters being identical. Under this assumption, we derive the parameter estimators in the next Section.
6. MCM3 PARAMETER ESTIMATION Notice that on the right-hand sides of (5.4) and (5.13) are moving coefficients and the rth step transition probabilities. We choose to substitute the quantities on the left-hand sides by their corresponding empirical counterparts to construct the parameter estimators. Considering that fxi g and fY i g are assumed independent, the estimations of parameters from these two processes can be processed separately. Our main interest here is to estimate the parameters in M3 process. The maximum likelihood estimations and the asymptotic normality of the estimators for Markov processes have been developed in Billingsley (1961a, b). Here we simply perform empirical estimation and then treat the estimated transition probabilities as known parameter values to accomplish statistical inference for the M3 processes.
342
ZHENGJUN ZHANG
We now assume ar0 ¼ arr for all r ¼ 1,y, L ¼ L , and denote PðRi 4u; Riþr 4uÞ as mr for r ¼ 1,...,L, PðR1 4uÞ as mLþ1 ; ar0 as ar respectively. Define 1 X¯ r ¼ n
n r X i¼1
I ðRi
4u;Riþr 4uÞ ;
1 X¯ Lþ1 ¼ n
n r X i¼1
I ðRi
r ¼ 1; . . . ; L, 4uÞ :
(6.1)
(6.2)
Then by the strong law of large numbers (SLLN), we have X¯ r
a:s:
! PðRi 4u; Riþr 4uÞ ¼ mr ; r ¼ 1; . . . ; L, a:s: X¯ Lþ1 ! PðR1 4uÞ ¼ mLþ1 .
(6.3) (6.4)
From (5.4) and (5.13), we propose the following estimators for parameters ar : a^ r ¼
X¯ r X¯ Lþ1 pðrÞ 11
; r ¼ 1; . . . ; L.
(6.5)
In order to study asymptotic normality, we introduce the following proposition which is Theorem 27.4 in Billingsley (1995). First we introduce the so-called a-mixing condition. For a sequence Y1, Y2,yof random variables, let an be a number such that PðA \ BÞ PðAÞPðBÞ an
for A 2 sðY 1 ; . . . ; Y k Þ; B 2 sðY kþn ; Y kþnþ1 ; . . .Þ; and k 1; n 1: When an - 0, the sequence {Yn} is said to be a-mixing. Proposition 6.1. Suppose that X1,X2, y is stationary and a-mixing with an ¼ O(n 5) and that E½X n ¼ 0 and E[X12 n ]oN. If S n ¼ X 1 þ þ X n ; then n 1 Var½S n ! s2 ¼ E½X 21 þ 2
1 X k¼1
E½X 1 X 1þk ;
pffiffiffi L where the series converges absolutely. If s40, then S n =s n ! Nð0; 1Þ:
Remark 9. The conditions an ¼ O(n 5) and E[Xn12]oN are stronger than necessary as stated in the remark following Theorem 27.4 in Billingsley (1995) to avoid technical complication in the proof.
343
A New Class of Tail-dependent Time-Series Models
With the established notations, we have the following lemma dealing with the asymptotic properties of the empirical functions. Lemma 6.2. Suppose that X¯ j and mj are defined in (6.1)–(6.4), then 02 ¯ 3 X1 pffiffiffiB6 . 7 6 . 7 nB @4 . 5 X¯ Lþ1
31 m1 ! L
X 6 . 7C L T 6 .. 7C ! N 0; S þ Wk þ Wk , 4 5A k¼1 mLþ1 2
where the entries sij of matrix S and the entries wijk of the matrix W k are defined below. For i ¼ 1, y , L, j ¼ 1, y , L, sij ¼ EðI fR1 4u;R1þi 4ug
mi ÞðI fR1 4u;R1þj 4ug
¼ PðR1 4u; R1þi 4u; R1þj 4uÞ wijk ¼ EðI fR1 4u;R1þi 4ug
mi ÞðI fR1þk 4u;
mj Þ
mi mj .
mj Þ
R1þjþk 4ug
¼ PðR1 4u; R1þi 4u; R1þk 4u; R1þjþk 4uÞ
mi mj .
For i ¼ 1, y ,L, j ¼ L+1, sij ¼ EðI fR1 4u; ¼ mi
R1þi 4ug
mj Þ
mi ÞðI fR1 4ug
mi m j .
wijk ¼ EðI fR1 4u;
R1þi 4ug
mi ÞðI fR1þk 4ug
¼ PðR1 4u; R1þi 4u; R1þk 4uÞ
mj Þ
mi mj .
For i ¼ L+1, j ¼ 1,y, L, sij ¼ sji . wijk ¼ EðI fR1 4ug
mi ÞðI fR1þk 4u;R1þjþk 4ug
¼ PðR1 4u; R1þk 4u; R1þjþk 4uÞ
mj Þ mi mj .
344
ZHENGJUN ZHANG
For i ¼ L+1, j ¼ L+1, sij ¼ mi wijk ¼ EðI fR1 4ug
mi mj .
mi ÞðI fR1þk 4ug
¼ PðR1 4u; R1þk 4uÞ
mj Þ mi mj .
mi m j ¼ mk
A proof of Lemma 6.2. is deferred to Section 9. From (6.5), we have the following Jacobian matrix: 0 m1 1 mLþ1 p11 ðmLþ1 Þ2 p11 B B m2 1 B ðm Þ2 p211 mLþ1 pð2Þ B Lþ1 11 Y ¼B B .. B . B @ mL 1
ðmLþ1 Þ2 pðLÞ 11
mLþ1 pðLÞ 11
1 C C C C C C C C A
(6.6)
which is applied to getting asymptotic covariance matrix for parameter estimators. We have obtained the following theorem. P Theorem 6.3. Let a^ 0 ¼ 1 2 Li¼1 a^ i Let a^ ¼ ½a^ 0 ; a^ 1 ; ; a^ L T ; a ¼ ½a0 ; a1 ; ; aL T be two vectors. Then with the established notations, pffiffiffi nð^a
L
a^ Þ ! Nð0; C B C
T
Þ,
where B is a matrix with elements P B1j ¼ 0, Bi1 ¼ 0, I, j ¼ 1,2,y, m, and the minor of B11 being Y ½S þ Lk¼1 fW k þ W k T gY T ; and C is a matrix with elements C 11 ¼ 1; C 1j ¼ 2; C i1 ¼ 0; I, j ¼ 2,y,m, and the minor of C 11 being a unit matrix. Similarly, we can construct estimators of parameters in Model 2. For Model 3, we propose to apply Model 1 and Model 2 first, and then reestimate the one step Markov transition probabilities. The asymptotic properties of the transition probability estimators can be found in Billingsley (1961a, b). We also can derive the joint asymptotic covariance matrix for all parameter estimators following the similar process used for the combined M3 and Markov process for negative returns. We will not pursue that in this paper because we think the joint asymptotic properties from Model 1 and Model 2 may be enough for various practical purposes.
345
A New Class of Tail-dependent Time-Series Models
7. MODELING JUMPS IN RETURNS 7.1. Modeling Jumps in Negative Returns The previous analysis in Section 3 has suggested up to lag-12 tail dependencies for the negative. We now fit model (5.1) to the transformed negative returns. From the data, the proportion of the days that the negative returns are observed, i.e. the investors lose money, is 0.4731. We use Markov chain to model negative signs. Suppose the state 1 corresponds to the day that a negative return is observed and the state 0 corresponds to the day that a negative return is not observed. The one step transition probabilities P(xd ¼ I|xd ¼ j), I, j ¼ 0,1, are estimated in the following table. State 0 1
0
1
0.5669 0.4823
0.4331 0.5177
They are empirical estimations – forPexample, Pðxiþ1;d . ¼P 0jxid ¼ 0Þ ¼ n 1 n 1 PðY iþ1;d 0jY i;d 0Þ is estimated by: i¼1 I ðY i;d 0;Y iþ1;d 0Þ i¼1 I ðY i;d 0Þ : The following table estimates the rth step transition probabilities using the data and computes the transition probabilities using Chapman-Kolmogorov equation. Step 1 2 3 4 5 6 7 8 9 10 11 12
Estimated 0.5669 0.5172 0.5195 0.5302 0.5284 0.5223 0.5279 0.5282 0.5239 0.5285 0.5270 0.5369
0.4331 0.4828 0.4805 0.4698 0.4716 0.4777 0.4721 0.4718 0.4761 0.4715 0.4730 0.4631
0.4823 0.5377 0.5353 0.5232 0.5251 0.5319 0.5256 0.5256 0.5301 0.5252 0.5271 0.5158
Chapman–Kolmogorov 0.5177 0.4623 0.4647 0.4768 0.4749 0.4681 0.4744 0.4744 0.4699 0.4748 0.4729 0.4842
0.5669 0.5303 0.5272 0.5269 0.5269 0.5269 0.5269 0.5269 0.5269 0.5269 0.5269 0.5269
0.4331 0.4697 0.4728 0.4731 0.4731 0.4731 0.4731 0.4731 0.4731 0.4731 0.4731 0.4731
0.4823 0.5231 0.5266 0.5269 0.5269 0.5269 0.5269 0.5269 0.5269 0.5269 0.5269 0.5269
0.5177 0.4769 0.4734 0.4731 0.4731 0.4731 0.4731 0.4731 0.4731 0.4731 0.4731 0.4731
346
ZHENGJUN ZHANG
One can see from the table that the estimated values (the middle panel) are very close to the theoretical values (the right panel) after the first step. The limiting distribution of the two state Markov chain is consistent with the proportions of the days that a negative return is observed. Based on this table, one can conclude that the data suggests that a two state Markov chain model is a good fit of the transitions of signs of negative returns. This analysis together with the previous analysis suggest that an MCM3 model is suitable for jumps in returns. Next we estimate the parameters in M3 model. The results are summarized in Table 3. The estimated parameter values can be used to further statistical inference – for example, to compute value at risk (VaR) based on the estimated model. 7.2. Modeling Jumps in Positive Returns Similar to the case of negative returns, the estimated first order transition probability matrix for positive returns is summarized in the following table. State 0 1
0
1
0.5209 0.4379
0.4791 0.5621
The following table estimates the rth step transition probabilities using the data and computes the transition probabilities using Chapman– Kolmogorov equation. Step 1 2 3 4 5 6 7 8
Estimated 0.5209 0.4649 0.4665 0.4804 0.4786 0.4722 0.4782 0.4791
0.4791 0.5351 0.5335 0.5196 0.5214 0.5278 0.5218 0.5209
0.4379 0.4893 0.4876 0.4749 0.4766 0.4825 0.4772 0.4762
Chapman-Kolmogorov 0.5621 0.5107 0.5124 0.5251 0.5234 0.5175 0.5228 0.5238
0.5209 0.4811 0.4778 0.4776 0.4775 0.4775 0.4775 0.4775
0.4791 0.5189 0.5222 0.5224 0.5225 0.5225 0.5225 0.5225
0.4379 0.4742 0.4773 0.4775 0.4775 0.4775 0.4775 0.4775
0.5621 0.5258 0.5227 0.5225 0.5225 0.5225 0.5225 0.5225
The estimated parameter values in M3 model are summarized in Table 4. The estimated parameter values can be used to further statistical inference – for
347
A New Class of Tail-dependent Time-Series Models
Table 3.
Estimations of Parameters in Model (5.1).
r
ar0
SE
r
ar0
SE
r
ar0
SE
0 1 2 3 4
0.2385 0.0700 0.0700 0.0525 0.0525
0.2318 0.0175 0.0181 0.0179 0.0180
5 6 7 8
0.0306 0.0088 0.0131 0.0306
0.0178 0.0176 0.0176 0.0177
9 10 11 12
0.0175 0.0131 0.0088 0.0131
0.0176 0.0176 0.0176 0.0176
Table 4.
Estimations of Parameters in Model (5.2).
r
ar0
SE
r
ar0
SE
0 1 2 3
0.5913 0.0550 0.0354 0.0314
0.1668 0.0190 0.0200 0.0200
4 5 6 7
0.0314 0.0236 0.0118 0.0157
0.0200 0.0199 0.0198 0.0199
example, to compute value at risk (VaR) based on the estimated model. However, we restrict ourself to find estimates of the M3 process in this current work.
8. DISCUSSION In this paper, we obtained statistical evidences of the existence of extreme impacts in financial time series data; we introduced a new time series model structure – i.e., combinations of Markov processes, GARCH(1,1) volatility model, and M3 processes. We restricted our attentions to a subclass of M3 processes. This subclass has advantages of efficiently modeling serial tail dependent financial time series, while, of course, other model specifications are possibly also suitable. We proposed models for tail dependencies in jumps in returns. The next step is to make statistical and economic inferences of computing risk measures and constructing prediction intervals. In a different project, we extend the results developed in this paper to study extremal risk analysis and portfolio choice. The approach adopted in the paper is a hierarchical model structure – i.e., to apply GARCH(1,1) fitting and to get estimated standard deviations first; then based on standardized return series, we apply M3 and Markov
348
ZHENGJUN ZHANG
processes modelling. It is possible to study Markov processes, GARCH processes, and M3 processes simultaneously. But that requires additional work. We put this goal as a direction for future research.
NOTE 1. Matlab codes for the gamma test are available upon request.
ACKNOWLEDGEMENT This work was supported in part by NSF Grants DMS-0443048 and DMS-0505528, and Institute for Mathematical Research, ETH Zu¨rich. The author thanks Professor Paul Embrechts, Professor Tom Fomby (Editor), and one referee for their many comments, suggestions which have greatly improved the quality of the paper.
REFERENCES Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control AC–19, 716–723. Billingsley, P. (1961a). Statistical Inference for Markov Processes. Chicago: University of Chicago Press. Billingsley, P. (1961b). Statistical methods in Markov chains. Annals of Mathematical Statistics, 32, 12–40. Billingsley, P. (1995). Probability and Measure (3rd ed.). New York: Wiley. Bollerslev, T. (1986). Generalized autoregressive conditional heteroscedasticity. Journal of Econometrics, 31, 307–327. Danielsson, J. (2002). The emperor has no clothes: Limits to risk modelling. Journal of Banking and Finance, 26, 1273–1296. Davis, R. A., & Resnick, S. I. (1989). Basic properties and prediction of max-ARMA processes. Advances in Applied Probability, 21, 781–803. Davis, R. A., & Resnick, S. I. (1993). Prediction of stationary max-stable processes. Annals of Applied Probability, 3, 497–525. Deheuvels, P. (1983). Point processes and multivariate extreme values. Journal of Multivariate Analysis, 13, 257–272. Duan, J. C., Ritchken, P., & Sun, Z. (2003). Option valuation with jumps in returns and volatility. Web reference. Embrechts, P., Klu¨ppelberg, C., & Mikosch, T. (1997). Modelling extremal events for insurance and finance. Berlin: Springer.
A New Class of Tail-dependent Time-Series Models
349
Embrechts, P., McNeil, A., & Straumann, D. (2002). Correlation and dependence in risk management: Properties and pitfalls. In: M. A. H. Dempster (Ed.), Risk management: Value at risk and beyond (pp. 176–223). Cambridge: Cambridge University Press. Engle, R. (2002). Dynamic conditional correlation – a simple class of multivariate GARCH models. Journal of Business and Economic Statistics, 17, 339–350. Eraker, B., Johannes, M., & Polson, N. G. (2003). The impact of jumps in returns and volatility. Journal of Finance, 53, 1269–1300. de Haan, L. (1984). A spectral representation for max-stable processes. Annals of Probability, 12, 1194–1204. de Haan, L., & Resnick, S. I. (1977). Limit theory for multivariate sample extremes. Zeitschriftfur Wahrscheinlichkeitstheorieund verwandle. Gebiete, 40, 317–337. Hall, P., Peng, L., & Yao, Q. (2002). Moving-maximum models for extrema of time series. Journal of Statistical Planning and Inference, 103, 51–63. Hall, P., & Yao, Q. (2003). Inference in ARCH and GARCH models with heavy-tailed errors. Econometrica, 71, 285–317. Leadbetter, M. R., Lindgren, G., & Rootze´n, H. (1983). Extremes and related properties of random sequences and processes. Berlin: Springer. McNeil, A., & Frey, R. (2000). Estimation of tail-related risk measures for heteroscedastic financial time series: An extreme value approach. Journal of Empirical Finance, 7, 271–300. Mikosch, T., & Staˇricaˇ, C. (2000). Limit theory for the sample autocorrelations and extremes of a GARCH(1,1) process. Annals of Statististics, 28, 1427–1451. Mikosch, T., & Straumann, D. (2002). Whittle estimation in a heavy-tailed GARCH(1,1) model. Stochastic Process. Appl, 100, 187–222. Nandagopalan, S. (1990). Multivariate extremes and the estimation of the extremal index. Ph.D. dissertation, Dept. of Statistics, University of North Carolina, Chapel Hill. Nandagopalan, S. (1994). On the multivariate extremal index. Journal of Research, National Institute of Standards and Technology, 99, 543–550. Pickands, J., III. (1975). Statistical inference using extreme order statistics. The Annals of Statistics, 3, 119–131. Resnick, S. I. (1987). Extreme values, regular variation, and point processes. New York: Springer. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464. Sibuya, M. (1960). Bivariate extreme statistics, I. Annals of Institute of Statistical Mathematics, 1, 195–210. Smith, R. L. (2003). Statistics of extremes: With applications in the environment, insurance and finance. In: B. Finkenstadt & H. Rootze´n (Eds), Extreme value in finance, Telecommunications and the Environment (pp. 1–78). Chapman & Hall/CRC: Boca Raton. Smith, R. L., & Weissman, I. (1996). Characterization and estimation of the multivariate extremal index. Technical report, the University of North Carolina, USA. Straumann, D. (2003). Estimation in conditionally heteroscedastic time series models. Ph.D. thesis, Laboratory of Actuary Mathematics, University of Copenhagen. Tsay, R. S. (1999). Extreme value analysis of financial data. Manuscript, Univ. of Chicago. Zhang, Z. (2003a). Quotient correlation: A sample based alternative to Pearson’s correlation. Manuscript, Washington University. Zhang, Z. (2003b). The estimation of M4 processes with geometric moving patterns. Manuscript, Washington University.
350
ZHENGJUN ZHANG
Zhang, Z. (2004). Some results on approximating max-stable processes. Manuscript, Washington University. Zhang, Z. & Smith, R. L. (2003). On the estimation and application of max-stable processes. Manuscript, Washington University. Zhang, Z., & Smith, R. L. (2004). The behavior of multivariate maxima of moving maxima processes. Journal of Applied Probability, 41, 1113–1123.
APPENDIX Proof of Proposition 5.1. We have PðRnj 4u; Rnjþk 4uÞ ¼ PðY j þ N j 4u; Y jþk þ N jþk 4u; xj ¼ 1; xjþk ¼ 1Þ ¼ p1 pðkÞ 11 PðY j þ N j 4u; Y jþk þ N jþk 4uÞ,
PðRnj 4uÞ ¼ p1 PðY j þ N j 4uÞ
Let f(x) and M be the density and the bound limit of Nj, then PðRnj 4u; Rnjþk 4uÞ PðY j þ N j 4u; Y jþk þ N jþk 4uÞ ¼ pðkÞ 11 PðY j þ N j 4uÞ PðRnj 4uÞ RM RM x; Y jþk 4u yÞf ðxÞf ðyÞdxdy M PðY j 4u M ¼ pðkÞ RM 11 xÞf ðxÞdx M pðY j 4u R M R M PðY j 4u x;Y jþk 4u yÞ f ðxÞf ðyÞdxdy M M PðY j 4uÞ ¼ pðkÞ R M PðY j 4u xÞ 11 M PðY 4uÞ f ðxÞdx j
It is easy to see that
PðY 4u xÞ limu!1 PðYj 4uÞ j
PðY j 4u x; Y jþk 4u PðY j 4uÞ
yÞ
¼
¼ 1; and
PðY j 4u xÞ PðY j 4u x; Y jþk 4u PðY j 4uÞ PðY j 4u xÞ
yÞ
.
We have PðY j 4w; Y jþk 4w þ zÞ PðY j 4wÞ
¼
1
¼1
e
e
1=w
1=ðwþzÞ
e
½
1
1=ðwþzÞ
e
þ e 1=w 1=ðwþzÞþminðak0 =w;akk =ðwþzÞÞ 1 e 1=w
1=wþminðak0 =w;akk =ðwþzÞÞ
1
e
1=w
351
A New Class of Tail-dependent Time-Series Models
Since for w40, z40 (similarly for zo0), we have minðak0 =ðw þ zÞ; akk =ðw þ zÞÞ minðak0 =w; akk =ðw þ zÞÞ minðak0 =w; akk =wÞ;
so 1
e 1=wþminðak0 ;akk Þ=w 1 1 e 1=w
e
1=wþminðak0 =w;akk =ðwþzÞÞ
1
e
1=w
1
e 1=wþminðak0 ;akk Þ=ðwþzÞ 1 e 1=w
which gives lim
1
w!1
lim
1
w!1
e 1=wþminðak0 ;akk Þ=w ¼1 1 e 1=w
minðak0 ; akk Þ,
e 1=wþminðak0 ;akk Þ=ðwþzÞ ¼1 1 e 1=w
minðak0 ; akk Þ,
hence, lim
1
w!1
e 1=wþminðak0 =w;akk =ðwþzÞÞ ¼1 1 e 1=w
minðak0 ; akk Þ,
So we have PðY j 4w; Y jþk 4w þ zÞ ¼ minðak0 ; akk Þ. w!1 PðY j 4wÞ lim
Let w ¼ u
x, z ¼ x
y, then
PðY j 4u x; Y jþk 4u u!1 PðY j 4uÞ lim
yÞ
¼ minðak0 ; akk Þ
which gives the desired results in (5.13). Proof of Lemma 6.2. Let U 1 ¼ ðI ðR1 4u;R2 4uÞ mL ; I ðR1 4uÞ
m1 ; I ðR1 4u;R3 4uÞ
T
m2 ; . . . ; I ðR1 4u;R1þL 4uÞ
m1þL Þ ,
U 1þk ¼ ðI ðR1þk 4u;R2þk 4uÞ mL ; I ðR1þk 4uÞ
m1 ; I ðR1þk 4u;R3þk 4uÞ T
m2 ; . . . ; I ðR1þk 4u;R1þkþL 4uÞ
m1þL Þ ,
and a ¼ (a1, y , aL)T6¼0 be an arbitrary vector. Let X 1 ¼ aT U 1 ; X 1þk ¼ aT U 1þk ; . . . ; then E[Xn] ¼ 0 and E½X 12 n o1: So Proposition 6.1 can apply. We say expectation are applied on all elements if expectation is applied on a random matrix. But E½X 21 ¼ aT E½U 1 U T1 a ¼
352
ZHENGJUN ZHANG
aT Sa; E½X 1 X 1þk ¼ aT E½U 1 U T1þk a ¼ aT W k a: The entries of S and Wk are computed from the following expressions. For i ¼ 1, y ,L, j ¼ 1, y ,L, sij ¼ EðI fR1 4u;R1þi 4ug wijk
¼ PðR1 4u; R1þi 4u; R1þj 4uÞ ¼ EðI fR1 4u;R1þi 4ug
mj Þ
mi ÞðI fR1 4u;R1þj 4ug mi mj ,
mi ÞðI fR1þk 4u;
mj Þ
R1þjþk 4ug
¼ PðR1 4u; R1þi 4u; R1þk 4u; R1þjþk 4uÞ
mi mj
For i ¼ 1, y ,L, j ¼ L+1, sij ¼ EðI fR1 4u;R1þi 4ug ¼ mi wijk
mi ÞðI fR1 4ug
mj Þ
mi mj ,
¼ EðI fR1 4u;R1þi 4ug
mi ÞðI fR1þk 4ug
¼ PðR1 4u; R1þi 4u; R1þk 4uÞ
mj Þ
mi m j
For i ¼ L+1, j ¼ 1, y ,L, sij ¼ sji , wijk ¼ EðI fR1 4ug
mi ÞðI fR1þk 4u;R1þjþk 4ug
¼ PðR1 4u; R1þk 4u; R1þjþk 4uÞ
mj Þ mi mj
For i ¼ L+1, j ¼ L+1, sij ¼ mi wijk ¼ EðI fR1 4ug
mi mj ,
mi ÞðI fR1þk 4ug
¼ PðR1 4u; R1þk 4uÞ
mj Þ
mi mj ¼ mk
mi mj
So the proof is completed by applying the Crame´r–Wold device.