The
Econometrics Journal Econometrics Journal (2010), volume 13, pp. 145–176. doi: 10.1111/j.1368-423X.2010.00310.x
Specification and estimation of social interaction models with network structures L UNG - FEI L EE † , X IAODONG L IU ‡ AND X U L IN § †
Department of Economics, The Ohio State University, 410 Arps Hall, 1945 N High Street, Columbus, OH 43210, USA. E-mail:
[email protected] ‡
Department of Economics, University of Colorado at Boulder, 256 UCB, Boulder, CO 80309, USA. E-mail:
[email protected]
§ Department
of Economics, Wayne State University, 656 W Kirby Street, 2074 FAB, Detroit, MI 48202, USA. E-mail:
[email protected]
First version received: February 2009; final version accepted: December 2009
Summary This paper considers the specification and estimation of social interaction models with network structures and the presence of endogenous, contextual and correlated effects. With macro group settings, group-specific fixed effects are also incorporated in the model. The network structure provides information on the identification of the various interaction effects. We propose a quasi-maximum likelihood approach for the estimation of the model. We derive the asymptotic distribution of the proposed estimator, and provide Monte Carlo evidence on its small sample performance. Keywords: Identification, Maximum likelihood, Network, Peer effects, Social interactions.
1. INTRODUCTION Social interaction models study how interaction among individuals can lead to collective behaviour and aggregate patterns (Anselin, 2006). Such models are subjects of interest in the new social economics (Durlauf and Young, 2001). Empirical studies on social interactions can be found in Case (1991, 1992) on consumption pattern and technology adoption; Bertrand et al. (2000) on welfare cultures; and Sacerdote (2001), Hanushek et al. (2003) and Lin (2005, 2009) on student achievement, to name a few. For these studies, an individual belongs to a social group. The individuals within a group may interact with each other. A general social interaction model incorporates endogenous effects, contextual effects and unobserved correlation effects. Identification of the endogenous interaction effect from the other effects is the main interest in social interaction models (see e.g. Manski, 1993, and Moffitt, 2001). In his seminal work, Manski (1993) has shown that linear regression models where the C The Author(s). Journal compilation C Royal Economic Society 2010. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
146
L.-F. Lee, X. Liu and X. Lin
endogenous effect is specified in terms of the group mean would suffer from the ‘reflection problem’. The various interaction effects cannot be separately identified. Lee (2004, 2007) recognizes that many of the empirical studies of social interactions in a group setting have their model specifications related to the spatial autoregressive (SAR) model in the spatial econometrics literature (see e.g. Case, 1991, Bertrand et al., 2000, Moffitt, 2001, and Hanushek et al., 2003). Lee (2007) considers the SAR model in a group setting which allows endogenous group interactions, contextual factors and group-specific fixed effects. Lee’s group interaction model assumes that an individual is equally influenced by all others in that group, so that the endogenous effect and contextual effect are specified, respectively, as the average outcomes and characteristics of the peers. Lee (2007) shows that the identification of the various social interaction effects is possible if there are sufficient variations in group sizes in the sample. The identification, however, can be weak if all of the group sizes are large. When there is no information on how individuals interact within a group, Lee’s (2007) group interaction model is practical by assuming an individual is equally influenced by the peers. In some data sets which are designed for the study of social interactions, information on the network structure within a group may be available. An example is the Add Health data (Udry, 2003), where there is information on the ‘named’ friends within a grade or a school of each student in the sample. Such information on the connections of each individual (node) in a group (network) may be captured by the spatial weights matrix in a SAR model. Different from the equally weighted group interaction matrix in Lee (2007), the network weights matrix can be asymmetric and its off-diagonal entries may be zeros. Such a weights matrix introduces more non-linearity for identification of various social interaction effects beyond the variation of group sizes. Lin (2005) recognizes the value of the network structure and has estimated a network model on student academic achievement using the Add Health data. Lin’s (2005) model has the specification of a SAR model, which includes group-specific fixed effects, in addition to endogenous and contextual effects. Lin (2005) has discussed the difference of the network model with the linear-in-mean model of Manski (1993) and argued that the information on network structure helps identification. However, formal identification conditions have not been explicitly derived in that paper. Subsequently, Bramoull´e et al. (2009) investigate the identification of the network model in Lin (2005) by focusing on the network operator of the reduced form equation. This paper discusses the specification, identification and estimation issues of the network model. The sample consists of many different groups and a network is formed among individuals within a group. To capture the group unobservables, a group dummy is included. As there are many groups in the sample, the joint estimation of the group fixed effects with the structural parameters will create the ‘incidental parameter’ problem (Neyman and Scott, 1948). For this reason, Lee (2007) considers the within estimation method for the group interaction model, and Lin (2005) takes the difference of an individual’s outcome from the average outcomes of his/her named friends (or connections) to eliminate the group fixed effects. For the within equation, Lee (2007) discusses the 2SLS and (conditional) maximum likelihood (ML) methods for the model estimation. He shows that the ML method is efficient relative to the 2SLS. On the other hand, the empirical model in Lin (2005) is estimated by 2SLS after the elimination of group fixed effects. The model considered in this paper has a similar specification as the network model in Lin (2005). In addition, we allow the disturbances of connected individuals to be correlated, so that
C The Author(s). Journal compilation C Royal Economic Society 2010.
Estimation of social network models
147
the selection effect in a network can be partially captured.1 We characterize the identification conditions of the extended SAR model based on features of the network structure, the role of exogenous variables, and the presence of correlated disturbances. We propose an alternative method to eliminate the group fixed effects. We compare the performance of the proposed elimination method with that of Lin (2005) in terms of estimation efficiency. For the estimation, we propose a quasi-maximum likelihood (QML) method which is computationally tractable and efficient relative to the 2SLS method. This likelihood is a partial likelihood in the terminology of Cox (1975). The rest of the paper is organized as follows. Section 2 presents the SAR model with network structures. We interpret the specification of the model and discuss identification and estimation issues. Section 3 suggests a transformation of the model to eliminate group fixed effects. The implementation of the QML estimation of the transformed model is discussed in Section 4. Section 5 characterizes the identification conditions of the model and establishes the consistency of the QML estimator (QMLE). Section 6 derives the asymptotic distribution of the QMLE and compares the efficiency properties of the QMLE with the 2SLS estimator (2SLSE). Section 7 investigates the finite sample performance of the estimation methods, and consequences of model misspecifications via Monte Carlo experiments. Section 8 briefly concludes.2
2. THE NETWORK MODEL WITH MACRO GROUPS The model under consideration has the specification Ynr = λ0 Wnr Ynr + Xnr β10 + Wnr Xnr β20 + lmr αr0 + unr ,
(2.1)
where unr = ρ0 Mnr unr + nr for r = 1, . . . , r¯ . r¯ is the total ¯ number of groups in the sample, mr is the number of individuals in the rth group, and n = rr=1 mr is the total number of sample observations. Ynr = (y1r , . . . , ymr ,r ) is an mr -dimensional vector of yir , where yir is the observed outcome of the ith member in the macro group r. Wnr and Mnr are non-stochastic mr × mr network weights matrices, which may or may not be the same.3 Xnr is an mr × k matrix of exogenous variables.4 lmr is an mr -dimensional vector of ones. nr = (nr,1 , . . . , nr,mr ) is an mr -dimensional vector of disturbances, where nr,i ’s are i.i.d. with zero mean and variance σ02 . The specification of the weights matrices Wnr and Mnr in (2.1) captures the network structure of the macro group r.5 In a group interaction model, with no information on how individuals 1 If the network formation is endogenous due to the similar preference of connected individuals as argued in Moffitt (2001), disturbances of connected individuals may be correlated. Therefore, correlated disturbances shall be allowed in order to capture the endogenous network formation, which is regarded as an important selection issue in the empirical literature. Although network formation is assumed exogenous in this paper, such a specification of disturbances is in the right direction for a better model. 2 An empirical application to illustrate the practical use of the specified model and the proposed estimation method can be found in the working paper version of this paper and Lin (2009). 3 Some empirical studies assume M = W (see e.g. Cohen, 2002, and Fingleton, 2008). On the other hand, some r r discussions on the possibility that Mr = Wr can be found in LeSage (1999, pp. 87–88). 4 Sometimes, model (2.1) can be specified as Y = λ W Y + X nr 0 nr nr 1nr β10 + Wnr X2nr β20 + lmr αr0 + unr with unr = ρ0 Mnr unr + nr . Here we assume X1nr = X2nr = Xnr wlog. If X1nr and X2nr are not the same, they may be expanded to an Xnr which contains all the distinct columns of X1nr and X2nr . In that case, β10 and β20 will have zero restrictions in some of their entries. 5 In an empirical study, one might have different specifications of the network weights matrix. Models with different network weights matrices would be different models and we would have a model selection problem in practice. Some Monte Carlo studies in Lee (2008) provide evidence that model selection based on the maximized likelihood values can be quite effective. Such a model selection issue is interesting and important but is not the focus of this paper. C The Author(s). Journal compilation C Royal Economic Society 2010.
148
L.-F. Lee, X. Liu and X. Lin
interact within a group, it is typical to assume that each group member is equally affected by all e e = Mnr = the other members in that group, so that the weights matrix takes the special form Wnr 1 6 (l l − I ). On the other hand, some data sets (e.g. the Add Health data as mentioned m m r mr r mr −1 above) have information on the network structure. With such information, the (i, j ) entry of the weights matrix is a non-zero constant if i is influenced by j, and zero otherwise. In principle, the influence is not necessarily reciprocal, and hence the weights matrices can be asymmetric. In the paper, we focus on the case that Wnr and Mnr are row-normalized such that the sum of each row is unity, that is Wnr lmr = Mnr lmr = lmr . Row normalization is popular in empirical studies of social interactions as Wnr Ynr can be then interpreted as the (weighted) average outcome (or behaviour) of the peers.7 The network model (2.1) is an equilibrium model in the sense that the observed outcomes Ynr are simultaneously determined through the network structure within a group under the assumption that (Imr − λ0 Wnr ) is invertible.8 This model may have different economic contents under different contexts. One may interpret the equations in (2.1) as reaction functions in the industrial organization literature. Or, Ynr may be regarded as the outcomes of the Nash equilibrium in a peer effect game (see Case et al., 1993, and Calv´o-Armengol et al., 2006). In the spatial econometrics literature, the model (2.1) is an extended SAR model with SAR disturbances.9 A typical SAR model, however, does not have a macro group structure so groupspecific effects are absent. As a model in the framework of social network, which is our main focus, Wnr Ynr captures the possible endogenous social interactions effect with the coefficient λ0 , Wnr Xnr captures the contextual effect with the coefficient β20 . The endogenous effect refers to the contemporaneous influences of peers. The contextual effect includes characteristics of peers unaffected by the current behaviour. The incorporation of the contextual variables, here Wnr Xnr , in addition to Xnr , has a long history in the social interaction literature in sociology before simultaneity is allowed in the model. αr0 captures the unobserved group-specific effect, and Mnr unr captures the unobserved correlation effect among connected individuals with the coefficient ρ0 .10 6
A list of frequently used notations is provided in Appendix A for convenience of reference. In some cases, however, row-normalization is not plausible. For example, if a row has all zero elements, then it is impossible to normalize that row to one. Also, sometimes one may be interested in the aggregate influence rather than the average influence of the peers. Liu and Lee (2009) have proposed a GMM approach to estimate a social interaction model where the weights matrix is not row-normalized. The two models with or without row-normalization might address different empirical motivations and they can be complementary to each other. 8 A sufficient condition for (I mr − λ0 Wnr ) to be invertible is that ||λ0 Wnr || < 1 for some matrix norm || · ||. For the case where Wnr , with all entries being non-negative, is row-normalized, a sufficient condition is |λ0 | < 1. 9 In the terminology of spatial econometrics, W X is called a exogenous spatial lag (Florax and Folmer, 1992), and nr nr a model with such a term is referred to as a spatial Durbin model (Anselin, 1988). 10 In the literature of spatial econometrics, several approaches have been suggested for the specification of the form of spatial error dependence. In model (2.1), the regression error term unr is assumed to follow a SAR process. Under this specification, all the observations in a group are related to each other, with a decreasing correlation with higher orders of contiguity. Hence, such a structure is desirable as it induces global spatial autocorrelation within a group (Anselin, 2006). As an alternative, one can model the structure of spatial correlation based on a moving average process. However, such a specification only represents a local pattern of autocorrelation. For example, with a first-order moving average specification, there is no spatial covariance beyond the second neighbour (Anselin, 2006). We have shown, in Appendix E, that the proposed QML method can be extended to the model where the disturbances follow a more general spatial ARMA process. In some cases, one could model the spatial error dependence by assuming that the spatial correlation is a function of the distance between two observations (Cressie, 1993). Such a specification could be useful for geostatistic models but might be less so for social network models. For example, if the social network can be represented in a graph, the relationship between nodes could simply be represented by a binary indicator which is one for connected nodes and zero for unconnected ones. This is the case for the Add Health data (Udry, 2003) that we have applied the proposed 7
C The Author(s). Journal compilation C Royal Economic Society 2010.
Estimation of social network models
149
Manski’s (1993) reflection problem refers to the difficulty to distinguish between behavioural and contextual factors. Moffitt (2001) argues that the basic identification problem is how to distinguish correlations of outcomes that arise from social interactions from correlations that arise from correlated group unobservables. He believes in two generic sources of correlated unobservables—one from preferences or other forces that lead certain types of individuals to be grouped together, and the second from some unobservable common environmental factors. For our generalized network model with macro groups, while the second source is captured by the group-specific effect αr0 , the first source may be captured by the correlation effect parameter ρ0 . Although we treat αr0 as the unobserved group effect of a macro group, such as a schoolgrade, this specification can be generalized if there are several network components in a macro group. In the terminology of networks, a component is formed by a maximal set of individuals directly or indirectly related to each other. A macro group may be regarded as the platform for a social network. A social network may have a single or several components. In some applications, one may prefer to introduce a separate dummy for each component within a group instead of a single group dummy. Such a generalization will be accommodated by model (2.1) as we might regard each component as a group instead.
3. ELIMINATION OF THE MACRO GROUP FIXED EFFECTS In this paper, we allow the distribution of αr0 to depend on Xnr and Wnr . We consider the estimation of the model conditioning on αr0 ’s by treating αr0 ’s as unknown parameters (as in the panel econometrics literature). To avoid the incidental parameter problem, we shall have the fixed effect parameter eliminated. In a linear panel regression model or a logit panel regression model with fixed effects, the fixed effect parameter can be eliminated by the method of conditional likelihood when effective sufficient statistics can be found for each of the fixed effects. For those panel models, time average of the dependent variable provides the sufficient statistic (see Chamberlain, 1980, and Hsiao, 2003). However, effective sufficient statistics might not be available for many other models. The well-known example is the probit panel regression model, where time average of the dependent variable does not provide the sufficient statistic, even though probit and logit models are close substitutes. For the group interaction rmodel in Lee (2007), due to the specific structure e , the group average, that is y¯r = m1r m of Wnr i=1 yir , does provide an effective sufficient statistic to eliminate the fixed effect parameter αr0 . The observation deviated from the group mean (yir − y¯r ) does not involve the fixed effect αr0 and hence can be used in the conditional likelihood function for the estimation of the structural parameters. For a general network weights matrix Wnr , y¯r might not be a sufficient statistic for αr0 .11 Even so, this paper suggests a method which eliminates the fixed effects and allows the estimation of the remaining parameters of interest via a QML framework by exploring the row-normalization property of the weights matrices. method in the empirical studies of our previous version and that of Lin (2009). In addition to these specifications of the disturbances, another possibility is to leave the covariance structure unspecified such as those in Conley (1999) and Kelejian and Prucha (2007). For that alternative strategy, the main interest is to provide HAC covariance matrices for the 2SLS and/or GMM estimators. Our paper does not follow the latter strategy as our interest is to consider efficient estimation for the model as well as the variance structure of the disturbances. 11 The model (2.1) implies that y¯ = 1 l Y = λ0 l W Y + 1 l X β + 1 l W X β + α + r r0 mr mr nr mr mr nr nr mr mr nr 10 mr mr nr nr 20 1 mr lmr unr . y¯r does not provide a sufficient statistic for αr0 when lmr Wnr is not proportional to lmr because lmr Wnr Ynr may not be a function of y¯r . C The Author(s). Journal compilation C Royal Economic Society 2010.
150
L.-F. Lee, X. Liu and X. Lin
To simplify repeated notations, let Snr (λ) = Imr − λWnr , Snr = Snr (λ0 ), Rnr (ρ) = Imr − −1 (Znr β0 + ρMnr and Rnr = Rnr (ρ0 ). The reduced form equation of (2.1) is Ynr = Snr −1 lmr αr0 + Rnr nr ), where Znr = (Xnr , Wnr Xnr ) and β0 = (β10 , β20 ) . A Cochrane–Orcutt-type transformation introduces i.i.d. disturbances so that Rnr Snr Ynr = Rnr Znr β0 + (1 − ρ0 )lmr αr0 + nr ,
(3.1)
as Rnr lmr = (1 − ρ0 )lmr . Let Jnr = Imr − m1r lmr lm r be the derivation from group mean projector. Premultiplication of (3.1) by Jnr eliminates αr0 ’s, so we have Jnr Rnr Snr Ynr = Jnr Rnr Znr β0 + Jnr nr .
(3.2)
The transformed disturbances Jnr nr are linearly dependent because its variance matrix σ02 Jnr is singular. For an essentially equivalent but more effective transformation, we consider the √ orthonormal matrix of Jnr given by [Fnr , lmr / mr ]. The columns in Fnr are eigenvectors of lmr = 0, Fnr Fnr = Im∗r and Fnr Fnr = Jnr , Jnr corresponding to the eigenvalue one, such that Fnr ∗ where mr = mr − 1. Premultiplication of (3.1) by Fnr leads to a transformed model without Rnr Snr Ynr = Fnr Rnr Znr β0 + Fnr nr . By Lemma C.1, this implies that12 αr0 ’s, Fnr (Fnr Rnr Fnr )(Fnr Snr Fnr )Fnr Ynr = (Fnr Rnr Fnr )Fnr Znr β0 + Fnr nr .
(3.3)
∗ ∗ ∗ ∗ ∗ = Fnr Ynr , Znr = Fnr Znr , Wnr = Fnr Wnr Fnr , Mnr = Fnr Mnr Fnr , Snr (λ) = Denote Ynr ∗ ∗ ∗ Fnr Snr (λ)Fnr = Im∗r − λWnr , and Rnr (ρ) = Fnr Rnr (ρ)Fnr = Im∗r − ρMnr . Furthermore, denote ∗ ∗ ∗ ∗ = Snr (λ0 ) and Rnr = Rnr (ρ0 ) for simplicity. The transformed model (3.3) can be rewritten Snr more compactly as ∗ ∗ ∗ ∗ ∗ ∗ Snr Ynr = Rnr Znr β0 + nr , Rnr
(3.4)
∗ = Fnr nr is an m∗r -dimensional disturbance vector with zero mean and variance matrix where nr 2 ∗ σ0 Imr . Equation (3.4) is used for the estimation of the structural parameters in the model. Some features in (3.4) may not conform to a typical SAR model. A spatial weights matrix in a conventional SAR model is specified to have zero diagonal elements. Such a specification facilitates the interpretation of spatial effects of neighbouring units on a spatial unit and excludes self-influence. A zero diagonal spatial weights matrix is also utilized in Moran’s test of spatial independence so as the test statistic has zero mean under the null hypothesis of spatial independence. Many articles on spatial econometrics maintain this assumption. While Wnr and ∗ ∗ and Mnr do not.13 Also even though Wnr and Mnr have zero diagonals, the transformed Wnr ∗ ∗ Mnr are row-normalized, the transformed Wnr and Mnr do not preserve this feature. However, these do not turn out to be difficult issues for understanding asymptotic properties of estimators. The difficulty from the analytic point of view is on the uniform boundedness properties of the ∗ transformed spatial matrices. Furthermore, when elements of nr are i.i.d., the elements of nr are only uncorrelated but, in general, not necessarily independent. So asymptotic results that are developed for the estimation of a typical SAR model, for example the QMLE in Lee (2004), may not directly apply. The following section discusses the implementation of the QML method for the transformed model.
12
See Appendix C for some useful lemmas. ∗ has non-zero diagonal elements because tr(W ∗ ) = tr(W F F ) = As tr(Wnr ) = 0 and Wnr lmr = lmr , Wnr nr nr nr nr ) = −1. tr(Wnr Jnr ) = tr(Wnr ) − m1r tr(Wnr lmr lm r 13
C The Author(s). Journal compilation C Royal Economic Society 2010.
151
Estimation of social network models
4. QUASI-MAXIMUM LIKELIHOOD ESTIMATION ∗ ∗ ∗ ∗ ∗ Let nr (δ) = Rnr (ρ)[Snr (λ)Ynr − Znr β], where δ = (β , λ, ρ) . For a sample with r¯ macro groups, the log-likelihood function is
ln Ln (θ ) = −
r¯ r¯ r¯ n∗ 1 ∗ ∗ ∗ ∗ ln(2π σ 2 ) + ln |Snr (λ)| + ln |Rnr (ρ)| − nr (δ)nr (δ), 2 2 2σ r=1 r=1 r=1
(4.1)
¯ where θ = (δ , σ 2 ) and n∗ = rr=1 m∗r = n − r¯ is the number of effective sample observations. The likelihood function has a partial likelihood (Cox, 1975) interpretation as showed in Appendix D. ∗ ∗ (λ) and Rnr (ρ) are In order to implement the QML, the determinant and inverse of Snr 1 1 ∗ ∗ (λ)| = 1−λ |Snr (λ)| and |Rnr (ρ)| = 1−ρ |Rnr (ρ)| by Lemma C.1, the tractability needed. As |Snr ∗ ∗ (λ)| and |Rnr (ρ)| is exactly that of |Snr (λ)| and |Rnr (ρ)|.14 Furthermore, as of evaluating |Snr ∗ −1 −1 ∗ ∗ ∗ (ρ)−1 = Fnr Rnr (ρ)−1 Fnr by Lemma C.1, Snr (λ) and Rnr (ρ) Snr (λ) = Fnr Snr (λ) Fnr and Rnr are invertible as long as the original matrices Snr (λ) and Rnr (ρ) are invertible. ∗ (δ) = Fnr nr (δ) by Lemma C.2, Let nr (δ) = Rnr (ρ)[Snr (λ)Ynr − Znr β]. As nr ∗ ∗ . Denote Yn = it follows that nr (δ)nr (δ) = nr (δ)Jnr nr (δ) because Jnr = Fnr Fnr (Yn1 , . . . , Yn¯r ) , Xn = (Xn1 , . . . , Xn¯r ) , Zn = (Zn1 , . . . , Zn¯r ) , n = (n1 , . . . , n¯ r ) , Wn = Diag{Wn1 , . . . , Wn¯r }, Mn = Diag{Mn1 , . . . , Mn¯r } and Jn = Diag{Jn1 , . . . , Jn¯r }. The log-likelihood function can be evaluated without Fnr ’s as ln Ln (θ ) = −
r¯ r¯ r¯ 1 n∗ |Snr (λ)| |Rnr (ρ)| ln(2π σ 2 ) + + − ln ln (δ)Jnr nr (δ) 2 1−λ 1−ρ 2σ 2 r=1 nr r=1 r=1
n∗ ln(2π σ 2 ) + ln |Sn (λ)| + ln |Rn (ρ)| − r¯ ln[(1 − λ)(1 − ρ)] 2 1 − 2 n (δ)Jn n (δ), (4.2) 2σ where n (δ) = Rn (ρ)[Sn (λ)Yn − Zn β], Sn (λ) = In − λWn and Rn (ρ) = In − ρMn . For simplicity, let Sn = Sn (λ0 ) and Rn = Rn (ρ0 ). e = m1∗ (lmr lm r − Imr ). Hence, In Lee’s (2007) group interaction model, ρ0 = 0 and Wnr = Wnr r the likelihood function (4.2) becomes =−
r¯ r¯ n∗ 1 2 ∗ ∗ ∗ mr [ln(mr + λ) − ln mr ] − (δ)Jnr nr (δ), ln Ln (θ ) = − ln(2π σ ) + 2 2σ 2 r=1 nr r=1
where
m∗r +λ Jnr Ynr − (Jnr Xnr , − m1∗ Jnr Xnr )β, m∗r r Jnr Wnr = − m1∗ Jnr . This is exactly the one r
Jnr nr (δ) =
m∗ +λ ∗ λ)( mr ∗ )mr , r
because
|Imr − λWnr | =
(1 − derived in Lee (2007). Thus, the proposed estimation approach in this paper generalizes Lee (2007). For computational and analytical simplicity, the concentrated log likelihood can be derived by concentrating out β and σ 2 . From (4.2), given γ = (λ, ρ) , the QMLE of β0 is given 14 When W nr and Mnr are constructed as row normalized weights matrices from original symmetric matrices, Ord (1975) suggests a computational tractable method for the evaluation of |Snr (λ)| and |Rnr (ρ)|. This will also be useful for ∗ (λ)| and |R ∗ (ρ)|, even though the row sums of the transformed spatial weights matrices W ∗ and M ∗ evaluating |Snr nr nr nr may not be unity.
C The Author(s). Journal compilation C Royal Economic Society 2010.
152
L.-F. Lee, X. Liu and X. Lin
by βˆn (γ ) = [Zn Rn (ρ)Jn Rn (ρ)Zn ]−1 Zn Rn (ρ)Jn Rn (ρ)Sn (λ)Yn , and the QMLE of σ02 is given by 1 [Sn (λ)Yn − Zn βˆn (γ )] Rn (ρ)Jn Rn (ρ)[Sn (λ)Yn − Zn βˆn (γ )] n∗ 1 = ∗ Yn Sn (λ)Rn (ρ)Pn (ρ)Rn (ρ)Sn (λ)Yn , n
σˆ n2 (γ ) =
where Pn (ρ) = Jn − Jn Rn (ρ)Zn [Zn Rn (ρ)Jn Rn (ρ)Zn ]−1 Zn Rn (ρ)Jn simplicity. The concentrated log-likelihood function of γ is ln Ln (γ ) = −
and Pn = Pn (ρ0 ) for
n∗ n∗ (ln(2π ) + 1) − ln σˆ n2 (γ ) + ln |Sn (λ)| + ln |Rn (ρ)| − r¯ ln[(1 − λ)(1 − ρ)]. 2 2 (4.3)
The QMLE γˆn = (λˆ n , ρˆn ) is the maximizer of the concentrated log likelihood (4.3). The QMLEs of β0 and σ02 are, respectively, βˆn (γˆn ) and σˆ n2 (γˆn ). For asymptotic analysis, we assume the following regularity conditions. A SSUMPTION 4.1. The {nr,i }i=1,...,mr ,r=1,...,¯r are i.i.d. with mean zero and variance σ02 .15 The moment E(|nr,i |4+η ) for some η > 0 exists. A SSUMPTION 4.2. The elements of Zn are uniformly bounded constants for all n.16 Zn has the full rank 2k, and limn→∞ n1 Zn Rn Jn Rn Zn exists and is non-singular. A SSUMPTION 4.3. The sequences of row-normalized spatial weights matrices {Wn } and {Mn } are uniformly bounded in both row and column sums in absolute value.17 A SSUMPTION 4.4. {Sn−1 (λ)} and {Rn−1 (ρ)} are uniformly bounded in both row and column sums in absolute value uniformly in γ in a compact parameter space , with the true γ0 = (λ0 , ρ0 ) in the interior of . The higher than the fourth-moment condition in Assumption 4.1 is needed in order to apply a central limit theorem due to Kelejian and Prucha (2001). The non-stochastic Zn and its uniform boundedness conditions in the first half of Assumption 4.2 are for analytical simplicity. The Rn Zn are regressors transformed by the spatial filter Rn , and the Jn Rn Zn are those 15 Homoscedasticity might be a restrictive assumption, but it is beyond the scope of this paper to incorporate heteroscedasticity. Under unknown heteroscedasticity, one might need consider an alternative estimation strategy like an IV-based method (see Kelejian and Prucha, 2007, and Lin and Lee, 2009). However, the IV (or, in general, momentbased) estimation method can be sensitive in non-obvious ways to various implementation issues such as the interaction between the choice of instruments and the specification of the model (LeSage and Pace, 2009, p. 56). Furthermore, the IV estimates can be imprecise when instruments are weak. For these reasons, we focused on likelihood-based techniques in this paper. 16 If Z is allowed to be stochastic, then appropriate moment conditions need to be imposed, and the results presented nr in this paper can be considered as conditional on Znr instead. Furthermore, if Znr is allowed to be correlated with nr , then we have an endogenous regressor problem. In that case, estimation methods such as IV, etc., which takes into account the endogeneity issue, would be needed. 17 A sequence of square matrices {A }, where A = [a n n n,ij ], is said to be uniformly bounded in row sums (column sums) in absolute value if the sequence of row sum matrix norm An ∞ = maxi=1,...,n nj=1 |an,ij | (column sum matrix norm An 1 = maxj =1,...,n ni=1 |an,ij |) is bounded (Horn and Johnson, 1985). C The Author(s). Journal compilation C Royal Economic Society 2010.
Estimation of social network models
153
transformed by the deviation form group means projector Jn . The second half of Assumption 4.2 assumes that the exogenous regressors Jn Rn Zn in the transformed model (3.2) are not multicollinear. Assumption 4.3 limits the spatial dependence among the units to a tractable degree and is originated by Kelejian and Prucha (1999). It rules out the unit root case (in time series as a special case). Assumption 4.4 deals with the parameter space of γ to make sure that ln |Sn (λ)|, ln |Rn (ρ)|, ln[(1 − λ)(1 − ρ)], and their related derivatives are well behaved. As shown in Lee (2004), if Wn ≤ 1 and Mn ≤ 1 where · is a matrix norm, then {||Sn−1 (λ)||} and {||Rn−1 (ρ)||} are uniformly bounded in any subset of (−1, 1) bounded away from the boundary.
5. IDENTIFICATION AND CONSISTENCY There is a fundamental identification issue for the network model different from the reflection problem in Manski (1993) if λ0 β10 + β20 = 0 and Wn = Mn , as summarized in the following lemma. L EMMA 5.1. If λ0 β10 + β20 = 0 and Wn = Mn , then the endogenous effect parameter λ0 , the contextual effect parameter β20 and the correlated effect parameter ρ0 cannot be separately identified. This problem is revealing from the reduced form equation of (3.4), which is ∗ ∗−1 ∗ ∗−1 ∗−1 ∗ Ynr = Snr Znr β0 + Snr Rnr nr .
(5.1)
∗ ∗ ∗ ∗ ∗ ∗ With the restriction λ0 β10 + β20 = 0, Znr β0 = Xnr β10 + Wnr Xnr β20 = (Imr − λ0 Wnr )Xnr β10 = ∗ ∗ ∗ ∗ Snr Xnr β10 , and, hence, the reduced form equation (5.1) becomes Ynr = Xnr β10 + vnr , where ∗−1 ∗−1 ∗ ∗ ∗ ∗ Rnr nr . While β10 can be identified from the mean regression E(Ynr |Xnr ) = Xnr β10 , vnr = Snr both λ0 and β20 can not be identified as they are not in the mean regression equation. On the other ∗ ∗ vnr + λ0 Wnr vnr − hand, the disturbances vnr follows a high-order SAR process, vnr = ρ0 Mnr ∗ ∗ ∗ ρ0 λ0 Mnr Wnr vnr + nr , where the identification conditions have been considered in Lee and Liu ∗ ∗ = Mnr , ρ0 and λ0 can be identified from the correlation (2008). If Wnr = Mnr so that Wnr structure of vnr . β20 can then be identified via the restriction β20 = −λ0 β10 once λ0 is identified. ∗ ∗2 ∗ vnr − ρ0 λ0 Wnr vnr + nr , and hence ρ0 and λ0 However, when Mnr = Wnr , vnr = (ρ0 + λ0 )Wnr can only be locally identified but can not be separately identified. An interpretation of the situation λ0 β10 + β20 = 0 is that (2.1) does not represent a reaction function with simultaneity but a model with spurious social correlation among peers. This is because, under the restriction λ0 β10 + β20 = 0, (2.1) can be generated from the panel regression ∗ ∗ = Xnr β10 + vnr with SAR disturbances. Let β10,j and β20,j be the jth element of β10 model Ynr and β20 respectively. The spurious social correlation model can be ruled out when β20,j = 0 and β10,j = 0 for some j, or, in other words, there is a relevant variable in Xn that affects Yn only through the contextual effect Wn Xn . For the linear-in-mean model of Manski (1993), the identification of endogenous and exogenous interaction effects depends crucially on the existence of relevant variables in Xnr that directly affect Yn . For the network model, it is the behavioural interpretation of the parameters that can be problematic when λ0 β10 + β20 = 0.18
18 This restriction can be tested even λ and β 0 20 were not identifiable. One may test the significance of the added ∗ = X ∗ β + W ∗ X ∗ ζ + v by testing that ζ = 0. regressor vector in the expanded equation Ynr nr nr 10 nr nr C The Author(s). Journal compilation C Royal Economic Society 2010.
154
L.-F. Lee, X. Liu and X. Lin
The transformed equilibrium vector Jn Rn (ρ)Yn for any ρ in its parameter space can be represented as Jn Rn (ρ)Yn = λ0 Jn Rn (ρ)Gn Zn β0 + Jn Rn (ρ)Zn β0 + Jn Rn (ρ)Sn−1 Rn−1 n , Sn−1
(5.2)
Wn Sn−1 .
= λ0 Gn + In , where Gn = A sufficient condition for global identification because of θ0 is that the generated regressors Jn Rn (ρ)Gn Zn β0 and Jn Rn (ρ)Zn are not asymptotically multicollinear, and the variance of matrix of Jn Rn−1 n is unique. Let 2 σa,n (ρ) =
σ02 tr(Rn−1 Rn (ρ)Jn Rn (ρ)Rn−1 ) n∗
and σn2 (γ ) =
σo2 tr([Rn (ρ)Sn (λ)Sn−1 Rn−1 ] Jn [Rn (ρ)Sn (λ)Sn−1 Rn−1 ]). n∗
A SSUMPTION 5.1. Either (a) limn→∞ n1∗ [Gn Zn β0 , Zn ] Rn (ρ)Jn Rn (ρ)[Gn Zn β0 , Zn ] exists and is non-singular for each possible ρ in its parameter space and 1 2 (ρ)Rn−1 (ρ)Jn Rn−1 (ρ)| − ln |σ02 Rn−1 Jn Rn−1 | = 0, lim ∗ ln |σa,n n→∞ n for any ρ = ρ0 ; or (b) for any γ = γ0 , 1 lim ∗ ln |σn2 (γ )[Sn−1 (λ)Rn−1 (ρ)]Jn [Sn−1 (λ)Rn−1 (ρ)] | − ln |σ02 (Sn−1 Rn−1 )Jn (Sn−1 Rn−1 ) | = 0. n→∞ n The rank condition on Jn Rn (ρ)[Gn Zn β0 , Zn ] in Assumption 5.1(a) is for the identification of λ0 and β0 from the deterministic component of the reduced form equation (5.2). The following lemmas provide some sufficient conditions which imply this rank condition. 2 Xnr , lmr ] has full column rank for some L EMMA 5.2 If β20 + λ0 β10 = 0 and [Xnr , Wnr Xnr , Wnr group r, then Jn Rn (ρ)[Gn Zn β0 , Zn ] has full column rank.
Lemma 5.2 gives a sufficient condition for the rank condition in Assumption 5.1(a) based on the network structure of a single group, which is feasible only if the size of that group is greater than 3k + 1, where k is the column rank of Xnr . If there are not enough members in any of the groups in the sample, information across groups need to be explored to achieve identification. A sufficient condition for the rank condition in Assumption 5.1(a) based on the whole sample is given as follows. L EMMA 5.3. If β20 + λ0 β10 = 0 and Jn [Xn , Wn Xn , Wn2 Xn ] has full column rank, then Jn Rn (ρ)[Gn Zn β0 , Zn ] has full column rank. e = The group interaction model in Lee (2007) has the spatial weights matrix Wnr 1 e − Imr ). As Jnr Wnr = − (mr −1) Jnr ,
e 2 1 1 e Xnr , Xnr Jnr Xnr , Wnr Xnr , Wnr Xnr = Jnr Xnr , − (mr − 1) (mr − 1)2
1 (l l mr −1 mr mr
does not have full column rank. Identification is not possible only with a single group. On e the other hand, let c1 , c2 and c3 be conformable vectors such that Jnr Xnr c1 + Jnr Wnr Xnr c2 + C The Author(s). Journal compilation C Royal Economic Society 2010.
Estimation of social network models
155
1 e 2 e Jnr (Wnr ) Xnr c3 = 0, or more explicitly, Jnr Xnr [c1 − (mr1−1) c2 + (mr −1) 2 c3 ] = 0 by Jnr Wnr = 1 − (mr −1) Jnr . As Jnr Xnr = 0, if there are at least three distinct values for mr ’s in the sample, the equality holds only if c1 = c2 = c3 = 0. Hence, if there is sufficient group size variations, then Jn [Xn , Wn Xn , Wn2 Xn ] has full column rank, which implies the rank condition in Assumption 5.1(a) holds by Lemma 5.3. The identification of the endogenous effect, and hence, the exogenous effect, may be intuitively illustrated via the reduced form. For a group r, ∗−1 ∗ ∗ ∗ ∗−1 ∗−1 ∗ ∗ = Snr (Xnr β10 + Wnr Xnr β20 ) + Snr Rnr nr Ynr ∗ = Xnr β10 +
∞
∗j +1 ∗ ∗−1 ∗−1 ∗ λ0 Wnr Xnr (β20 + λ0 β10 ) + Snr Rnr nr , j
j =0
j ∗j ∗−1 ∗ ∗ ∗ because Snr = ∞ j =0 λ0 Wnr when sup ||λ0 Wnr ||∞ < 1. The effects of Xnr on Ynr can be ∗ decomposed in layers. The direct effect of Xnr is captured by β10 , the effect due to immediate neighbours is captured by (β20 + λ0 β10 ), and that due to neighbours of neighbours in the second layer is captured by (β20 + λ0 β10 )λ0 with the discount factor λ0 . So if the immediate neighbours can be distinguished from the second-layer neighbours, the discount factor provides e = mr1−1 (lmr lm r − Imr ), as the identification of the endogenous effect λ0 . For the case with Wnr 1 e 2 e ∗ ∗ e in the group r is (Fnr Wnr Fnr ) = − mr −1 Fnr Wnr Fnr , the net effect of Xnr on Ynr through Wnr ∞ λ0 j captured by the coefficient (β20 + λ0 β10 ) j =0 (− mr −1 ) . Hence, the endogenous effect λ0 can be identified only by comparing these net effects across groups with different sizes. An example where the rank condition above fails is the complete bipartite network, where individuals in a group are divided into two blocks such that each individual in one block is connected to all individuals in the other block but none in the same block, and vice versa. These include the star network where one individual is connected to all other individuals in a group and all the others in the group connect only to him. This example is due to Bramoull´ transformation.19 For the complete e et al. (2009) for a different
1 0 l l mr2 mr1 mr2 bipartite network, Wnr = with mr1 + mr2 = mr . It implies that 1 l l 0 mr1 mr2 mr1
1 lm l 0 2 2 Wnr = mr1 r1 mr1 1 = [ m1r1 lmr lm r1 , m1r2 lmr lm r2 ] with all . Consequently, Wnr + Wnr 0 l l m mr2 r2 mr2 its columns proportional to lmr . This implies, in particular, the column space spanned by the 2 Xnr ] contains lmr . So if all groups in a sample consist of bipartite columns of Jnr [Wnr Xnr , Wnr networks, the rank condition in Assumption 5.1(a) may not hold. We have discussed the rank condition in Assumption 5.1(a) for the identification of λ0 and β0 in the mean regression function of the reduced form equation. The second part of Assumption 5.1(a) is for the identification of ρ0 in the SAR error process. It is clear that ρ0 can not be identified from the mean regression function, as Rn (ρ) only plays the role of weighting sample observations for efficient estimation. So, ρ0 needs to be identified from the disturbances Sn−1 Rn−1 n . On the other hand, when Jn Rn (ρ)Gn Zn β0 and Jn Rn (ρ)Zn are linearly dependent or
19 Bramoull´ e et al. (2009) point out this underidentification case for the model with the transformation Imr − Wnr , which has been utilized in Lin (2005), to eliminate the group effect.
C The Author(s). Journal compilation C Royal Economic Society 2010.
156
L.-F. Lee, X. Liu and X. Lin
asymptotically multicollinear as n goes to infinity, a global identification condition would be related to the uniqueness of the variance matrix of Jn Yn , which is given by Assumption 5.1(b).20 Finally, we would like to point out that the division by the effective sample size n∗ in the limiting conditions in Assumption 5.1 (as well as in the following Assumption 6.1) has ruled out the case of large group interactions, which has been considered in Lee (2004, 2007). In that case, both the endogenous and exogenous interaction effects would be weakly identified and their rates of convergence can be quite low. But for network models, one has emphasized on ‘small world’, as the main interest in the network literature. This is also for the empirical application in this paper. Let Qn (γ ) = maxβ,σ 2 E(ln Ln (θ )). The solutions of this maximization problem are βn∗ (γ ) = [Zn Rn (ρ)Jn Rn (ρ)Zn ]−1 Zn Rn (ρ)Jn Rn (ρ)Sn (λ)Sn−1 Zn β0 and 1 E{[Sn (λ)Yn − Zn βn∗ (γ )] Rn (ρ)Jn Rn (ρ)[Sn (λ)Yn − Zn βn∗ (γ )]} n∗ 1 = ∗ (λ0 − λ)2 (Gn Zn β0 ) Rn (ρ)Pn (ρ)Rn (ρ)Gn Zn β0 n σ2 + 0∗ tr[(Sn−1 Rn−1 ) Sn (λ)Rn (ρ)Jn Rn (ρ)Sn (λ)Sn−1 Rn−1 ]. n
σn∗2 (γ ) =
Hence, Qn (γ ) = −
n∗ n∗ (ln(2π ) + 1) − ln σn∗2 (γ ) + ln |Sn (λ)| + ln |Rn (ρ)| − r¯ ln[(1 − λ)(1 − ρ)]. 2 2 (5.3)
Identification of γ0 can be based on the maximum of n1∗ Qn (γ ). With identification and uniform convergence of n1∗ ln Ln (γ ) − n1∗ Qn (γ ) to zero on , consistency of θˆn follows. P ROPOSITION 5.1. estimator of θ0 .
Under Assumptions 4.1–5.1, θ0 is globally identifiable and θˆn is a consistent
6. ASYMPTOTIC DISTRIBUTIONS ∂ ln Ln (θˆn ) ∂θ
= 0, it follows that 2 √ 1 ∂ ln Ln (θ˜n ) −1 1 ∂ ln Ln (θ0 ) ˆ , n(θn − θ0 ) = √ n ∂θ ∂θ ∂θ n
From the Taylor expansion of
for some θ˜n between θˆn and θ0 . The first-order derivatives of the log-likelihood function at θ0 given in Appendix B are linear or quadratic functions of n . The asymptotic distribution of the first-order derivatives may be derived from central limit theorems in Kelejian and Prucha (2001). Ln (θ0 ) Ln (θ0 ) √1 ∂ ln Ln (θ0 ) is E[ √1n∗ ∂ ln ∂θ · n∗ ∂θ ] = θ,n + θ,n , where The variance matrix of √1n∗ ∂ ln ∂θ θ,n = −E[ n1∗ ∂
2
ln Ln (θ0 ) ] ∂θ∂θ
is the symmetric average Hessian matrix, and θ,n is a symmetric
20 The identification of λ and/or ρ via the variance structure is exactly those for SAR models in Lee (2004) and Lee 0 0 and Liu (2008).
C The Author(s). Journal compilation C Royal Economic Society 2010.
Estimation of social network models
157
matrix such that θ,n = 0 when nr,i ’s are normally distributed.21 Assumption 5.1(a) is sufficient to guarantee that the limiting average Hessian matrix is non-singular. If γ0 is a regular point (Rothenberg, 1971), as Assumption 5.1(b) is a global identification condition which implies local identification, the limiting average Hessian matrix will also be non-singular. The sufficient condition which complements Assumption 5.1(b) for this purpose is given as follows. Let As = A + A for a square matrix A. Let Cn = Jn Rn Gn Rn−1 − n1 tr(Jn Rn Gn Rn−1 )In and Dn = Jn Hn − n1 tr(Jn Hn )In . A SSUMPTION 6.1.
limn→∞ ( n1∗ )2 [tr(Dns Dns )tr(Cns Cns ) − tr2 (Cns Dns )] > 0.
P ROPOSITION 6.1. Under Assumptions 4.1–4.4 and 5.1(a); or 4.1–4.4, 5.1(b) and 6.1, √ d n∗ (θˆn − θ0 ) → N(0, θ−1 + θ−1 θ θ−1 ), where θ = limn→∞ θ,n and θ = limn→∞ θ,n , √ d which are assumed to exist. If nr,i ’s are normally distributed, then n∗ (θˆn − θ0 ) → N (0, θ−1 ). For the transformed model Jn Yn = λ0 Jn Wn Yn + Jn Zn β0 + Jn Rn−1 n , a computational convenient estimation method is the generalized 2SLS (G2SLS) by Kelejian and Prucha (1998). In the first step of G2SLS, ζ0 = (β0 , λ0 ) will be estimated by the 2SLS with an IV matrix Q1n , ζˆ2sls,n = [(Zn , Wn Yn ) Jn P1n Jn (Zn , Wn Yn )]−1 (Zn , Wn Yn ) Jn P1n Jn Yn , where P1n = Q1n (Q1n Q1n )−1 Q1n . With the initial 2SLSE ζˆ2sls,n , Jn Rn−1 n can be estimated as a residual and ρ0 will be estimated by a method of moments (MOM) in Kelejian and Prucha (1999). Let ρˆmom,n be the consistent MOM estimate of ρ0 and Rˆ n = Rn (ρˆmom,n ). The feasible G2SLS estimator (G2SLSE) of ζ0 in the model is ζˆg2sls,n = [(Zn , Wn Yn ) Rˆ n Jn P2n Jn Rˆ n (Zn , Wn Yn )]−1 (Zn , Wn Yn ) Rˆ n Jn P2n Jn Rˆ n Yn , where P2n = Q2n (Q2n Q2n )−1 Q2n for some IV matrix Q2n . The G2SLSE is consistent and asymptotically normal with √
−1 1 d . n∗ (ζˆg2sls,n − ζ0 ) → N 0, σ02 lim ∗ (Zn , Gn Zn β0 ) Rn Jn P2n Jn Rn (Zn , Gn Zn β0 ) n→∞ n
It follows from the generalized Schwartz inequality that the best selection of Q2n is Jn Rˆ n [Zn , Gn (λˆ n )Zn βˆn ], where Gn (λ) = Wn Sn−1 (λ), and the variance matrix of the best G2SLS −1 , where ζ,n = σ 21n∗ (Zn , Gn Zn β0 ) Rn Jn Rn (Zn , Gn Zn β0 ). When nr,i ’s estimator ζˆb2sls,n is n1∗ ζ,n 0 are normally distributed, the variance matrix of ζˆb2sls,n can be easily compared with that of the MLE. P ROPOSITION 6.2. When nr,i ’s are normally distributed, the MLE is more efficient than the best G2SLS estimator.
21
The explicit expressions of θ,n and θ,n are given in Appendix B.
C The Author(s). Journal compilation C Royal Economic Society 2010.
158
L.-F. Lee, X. Liu and X. Lin
7. MONTE CARLO RESULTS To investigate the finite sample performance of the MLE, we consider the following model: Ynr = λ0 Wnr Ynr + Xnr β10 + Wnr Xnr β20 + lmr αr0 + unr , where unr = ρ0 Wnr unr + nr and nr ∼ N(0, σ02 Imr ), for r = 1, . . . , r¯ . The weights matrix Wnr is based on the Add Health data (Udry, 2003). For the Monte Carlo study, we consider four samples. The first sample consists of groups with the group size less than or equal to 30. There are 102 such groups in the data with 1344 observations and average group size being 13.1. We also consider a sub-sample with the group size less than or equal to 15. In the data, there are 67 such small groups with 557 observations and average group size being 8.3. To facilitate comparison, we also randomly pick 67 groups with the group size less than or equal to 30 and 102 groups with the group size less than or equal to 50 from the data. For the first sample of randomly picked groups, the sample size is 877 with the average group size being 13.1. For the second one, the sample size is 2279 with the average group size being 22.3. This allows us to inspect the effect of increasing the number of groups r¯ and increasing the average group size separately. The number of repetitions is 400 for each case in this Monte Carlo experiment. For each repetition, Xnr and αr0 are generated from N (0, Imr ) and N(0, 2) respectively, for r = 1, . . . , r¯ . The data are generated with λ0 = ρ0 = 0.5 and σ02 = 1.β10 and β20 are varied in the experiments. The estimation methods considered are the 2SLS with the IV matrix Q1n = Jn (Zn , Wn Zn , Wn2 Zn , Wn3 Zn ), the G2SLS with the IV matrix Q1n in the first step and Q2n = Jn Rˆ n (Zn , Wn Zn , Wn2 Zn , Wn3 Zn ) in the last step, and the ML approach proposed in this paper (labelled ML1 in the following tables).22 Lin (2005) suggests an alternative elimination method of the fixed effects by the transformation using (Imr − Wnr ). See Appendix F for more detail.23 However, as the rank of (Imr − Wnr ) may be less than m∗r , more linear dependence is induced when eliminating the fixed effects. Hence, this alternative elimination method may be less efficient. We also report the MLE (labelled ML2 in the following tables) based on the alternative elimination method using (Imr − Wnr ) in the Monte Carlo experiments. We report the mean ‘Mean’ and standard deviation ‘SD’ of the empirical distributions of the estimates. To facilitate the comparison of various estimators, their root mean square errors ‘RMSE’ are also reported. Table 1 reports the results in the case with β10 = β20 = 1, that is the regressors are ‘strong’. For all sample sizes considered, the G2SLS estimates of ρ0 are downward biased. The bias reduces as the average group size increases. The other estimates are essentially unbiased. In terms of the SD, G2SLS improves 2SLS upon the estimates of λ0 , β10 and β20 and ML improves G2SLS upon estimates of ρ0 . For the same estimator, the SDs decrease as either r¯ or the average group size increases. 22 In finite samples, best G2SLS with Q = J R ˆ n )Zn βˆn ) is quite sensitive to the initial estimates. As 2n n ˆ n (Zn , Gn (λ the initial 2SLSEs are obtained with no restrictions on the parameter space, the initial estimate of λˆ n could have an absolute value greater than one. This causes the estimated best IV used in the second step problematic. In the case when β10 = β20 = 0.2, and n = 557, about 1/10 of the replications had an initial estimate with |λˆ n | > 1. In the Monte Carlo experiments, we use the above simpler Q2n instead to avoid the effect of bad initial estimates. 23 We have experimented with iterated G2SLS. However, for many replications, the iterated estimator failed to converge. For example, when β10 = β20 = 0.2, and n = 557, the iterated estimator failed to converge in about 1/4 of the replications. This issue tends to occur especially when some estimates of λ0 are out of bound, that is with an absolute value greater than one, during the iterations. Note that 2SLS approach does not impose restrictions on |λ| < 1. Even for the converged iterated G2SLS estimates, there is no evidence that iteration procedure improves the performance of the G2SLS estimator in this specific simulation experiment. Hence, we choose not to report the simulation results on iterated G2SLS in this paper.
C The Author(s). Journal compilation C Royal Economic Society 2010.
159
Estimation of social network models
2SLS
Table 1. 2SLS, G2SLS and ML estimation with strong X’s. λ0 = 0.5 ρ0 = 0.5 β10 = 1 r¯ Small size groups: r¯ = 67, n = 557, r=1 rank(Imr − Wnr ) = 466 0.489(0.085)[0.086] – 1.004(0.056)[0.056]
1.003(0.092)[0.092]
G2SLS ML1 ML2
0.498(0.069)[0.069] 0.495(0.069)[0.069] 0.495(0.089)[0.089]
1.001(0.083)[0.083] 1.004(0.082)[0.082] 1.005(0.083)[0.083]
2SLS
0.346(0.107)[0.187] 1.003(0.053)[0.054] 0.495(0.081)[0.081] 1.004(0.053)[0.053] 0.490(0.110)[0.111] 1.005(0.053)[0.053] r¯ Moderate size groups: r¯ = 67, n = 877, r=1 rank(Imr − Wnr ) = 761 0.494(0.059)[0.059] – 1.004(0.041)[0.041]
G2SLS ML1 ML2
0.499(0.048)[0.048] 0.496(0.048)[0.048] 0.497(0.060)[0.060]
β20 = 1
1.008(0.078)[0.078]
2SLS
0.424(0.072)[0.105] 1.003(0.037)[0.037] 1.004(0.064)[0.064] 0.497(0.058)[0.058] 1.003(0.037)[0.037] 1.006(0.063)[0.063] 0.498(0.073)[0.073] 1.003(0.037)[0.037] 1.006(0.065)[0.065] r¯ Moderate size groups: r¯ = 102, n = 1344, r=1 rank(Imr − Wnr ) = 1166 0.495(0.047)[0.047] – 1.003(0.034)[0.035] 1.003(0.059)[0.059]
G2SLS ML1
0.498(0.038)[0.039] 0.496(0.038)[0.039]
ML2
0.496(0.050)[0.050]
0.428(0.058)[0.092] 0.501(0.046)[0.046]
1.002(0.032)[0.032] 1.002(0.032)[0.032]
1.001(0.049)[0.049] 1.003(0.048)[0.049]
2SLS
0.504(0.063)[0.063] 1.002(0.032)[0.032] r¯ Large size groups: r¯ = 102, n = 2279, r=1 rank(Imr − Wnr ) = 2076 0.497(0.038)[0.038] – 1.002(0.028)[0.028]
1.003(0.050)[0.050] 1.006(0.050)[0.050]
G2SLS ML1
0.500(0.031)[0.031] 0.499(0.031)[0.031]
0.460(0.046)[0.061] 0.499(0.038)[0.038]
1.001(0.025)[0.025] 1.001(0.025)[0.025]
1.002(0.041)[0.041] 1.003(0.040)[0.040]
ML2
0.502(0.036)[0.036]
0.497(0.044)[0.044]
1.001(0.025)[0.025]
1.002(0.041)[0.041]
Note: Mean(SD)[RMSE].
In the case when the regressors are ‘weak’ with β10 = β20 = 0.2, the estimation results are summarized in Table 2. When r¯ = 67, the 2SLS and G2SLS estimates of λ0 are upward biased. The G2SLS estimates of ρ0 and the 2SLS and G2SLS estimates of β20 are downward biased. When r¯ increases to 102, the 2SLS estimates of λ0 become downward biased with a smaller magnitude, and the other biases also reduce. The MLE of ρ0 is slightly downward biased with the sample of small groups. The bias reduces as the sample size increases. The MLEs also have smaller SDs and RMSEs than the other estimates for all sample sizes considered. For example, when n = 2279, the percentage reduction in SD of the MLEs of λ0 , ρ0 and β20 relative to the G2SLS estimates is, respectively, 42.0%, 47.1% and 15.9%. The percentage reduction is even larger with smaller samples. For both cases with ‘strong’ and ‘weak’ regressors, MLEs based on the alternative elimination method of the fixed effects by the transformation using (Imr − Wnr ) have larger SDs than those of the MLEs proposed in this paper. Results in Table 3 inspect the effects of model misspecification on the MLEs using the sample with 102 moderate size groups. When positive endogenous effects captured by λ0 is ignored in the estimation, ρˆn and βˆ2n will be upward biased. When positive exogenous effects captured by β20 is ignored in the estimation, λˆ n will be upward biased, and βˆ1n and ρˆn will be downward biased. When a positive spatial correlation with ρ0 in the disturbances fails to be modelled, λˆ n will be upward biased and βˆ2n will be downward biased. The bias of λˆ n can be large enough to change its sign in the case when λ0 < 0. The opposite occurs when the omitted ρ0 has a C The Author(s). Journal compilation C Royal Economic Society 2010.
160
L.-F. Lee, X. Liu and X. Lin
2SLS
Table 2. 2SLS, G2SLS and ML estimation with weak X’s. λ0 = 0.5 ρ0 = 0.5 β10 = 0.2 r¯ Small size groups: r¯ = 67, n = 557, r=1 rank(Imr − Wnr ) = 466 0.591(0.499)[0.507] – 0.194(0.057)[0.058]
0.177(0.101)[0.103]
G2SLS ML1 ML2
0.633(0.508)[0.525] 0.517(0.149)[0.150] 0.506(0.168)[0.168]
0.178(0.108)[0.110] 0.201(0.074)[0.074] 0.204(0.075)[0.075]
2SLS
0.226(0.348)[0.443] 0.198(0.057)[0.057] 0.453(0.159)[0.166] 0.204(0.052)[0.052] 0.469(0.177)[0.179] 0.205(0.052)[0.053] r¯ Moderate size groups: r¯ = 67, n = 877, r=1 rank(Imr − Wnr ) = 761 0.527(0.409)[0.410] – 0.197(0.045)[0.046]
G2SLS ML1 ML2
0.548(0.324)[0.328] 0.507(0.117)[0.117] 0.513(0.143)[0.144]
β20 = 0.2
0.189(0.090)[0.090]
2SLS
0.356(0.277)[0.312] 0.199(0.040)[0.040] 0.190(0.079)[0.079] 0.473(0.123)[0.126] 0.203(0.037)[0.037] 0.203(0.058)[0.058] 0.471(0.148)[0.150] 0.203(0.037)[0.037] 0.203(0.060)[0.060] r¯ Moderate size groups: r¯ = 102, n = 1344, r=1 rank(Imr − Wnr ) = 1166 0.481(0.398)[0.398] – 0.199(0.045)[0.045] 0.195(0.088)[0.088]
G2SLS ML1
0.508(0.220)[0.220] 0.503(0.106)[0.106]
ML2
0.512(0.133)[0.133]
0.396(0.225)[0.248] 0.483(0.110)[0.111]
0.199(0.033)[0.033] 0.201(0.031)[0.031]
0.194(0.056)[0.056] 0.200(0.046)[0.046]
2SLS
0.479(0.138)[0.140] 0.202(0.032)[0.032] r¯ Large size groups: r¯ = 102, n = 2279, r=1 rank(Imr − Wnr ) = 2076 0.470(0.212)[0.214] – 0.201(0.030)[0.030]
0.200(0.047)[0.047] 0.204(0.054)[0.054]
G2SLS ML1
0.515(0.169)[0.170] 0.502(0.098)[0.098]
0.455(0.191)[0.197] 0.487(0.101)[0.102]
0.200(0.026)[0.026] 0.201(0.025)[0.025]
0.198(0.044)[0.044] 0.202(0.037)[0.037]
ML2
0.514(0.107)[0.108]
0.478(0.112)[0.114]
0.201(0.025)[0.025]
0.201(0.037)[0.037]
Note: Mean(SD)[RMSE].
negative value. The bottom panel of Table 3 studies the effects of misspecified weights matrices in a model with i.i.d. disturbances (ρ0 = 0). The weights matrices Wnr in the data-generating process is specified as above. However, suppose, when estimating this model, we do not have the information on the network structure and put equal weight on each member of a group as in e = m1∗ (lmr lm r − Imr ) is used. With the misspecified the model with group interaction so that Wnr r Wnr , λˆ n , βˆ1n and βˆ2n are upward biased by 65.2%, 31.6% and 83.3%, respectively. SD of λˆ n also dramatically increases relative to the estimate with correctly specified Wnr . We also compare the likelihood values of ML estimation of the correctly specified model with those of the misspecified models. We find a larger likelihood value indicates a better specified model in most cases.
8. CONCLUSION This paper considers model specification, identification and estimation of a social interaction model. The social interaction model generalizes the group interaction model in Lee (2007), where an individual in a group interacts with all other members with equal weights, to the situation that each individual may have their own connected peers. This model extends the SAR model with SAR errors to incorporate contextual variables and group unobservables. The social interactions are rich in that endogenous interaction effects, contextual effects, group-specific effects, and correlations among connected individuals in a network can all be captured in the model. The C The Author(s). Journal compilation C Royal Economic Society 2010.
161
Estimation of social network models
λ0 = 0.5
Table 3. ML estimation of misspecified models (¯r = 102, n = 1344). ρ0 = 0.5 β10 = 1
β20 = 1
0.501(0.046)[0.046]
Likelihood: 1.002(0.032)[0.032]
−1749.0 (−) 1.003(0.048)[0.049]
Misspecified model with λ = 0 – 0.805(0.012)[0.305] Misspecified model with β2 = 0
Likelihood: 1.000(0.033)[0.033] Likelihood:
−1804.5(6.0%) 1.144(0.046)[0.151] −1907.1(0.0%)
0.799(0.011)[0.300] −0.018(0.048)[0.521] Misspecified model with ρ = 0
0.821(0.037)[0.183] Likelihood:
– −1803.9(7.0%)
0.897(0.032)[0.108]
0.677(0.050)[0.327]
True model 0.496(0.038)[0.039]
0.718(0.016)[0.218] λ0 = −0.3
– ρ0 = 0.5
True model −0.295(0.044)[0.045] 0.496(0.042)[0.042] Misspecified model with ρ = 0 0.166(0.033)[0.467] λ0 = 0.5
– ρ0 = −0.3
β10 = 1
β20 = 1
Likelihood: 1.001(0.031)[0.031] Likelihood:
−1797.2 (−) 0.997(0.047)[0.047] −1824.5(0.0%)
0.920(0.032)[0.086]
0.603(0.056)[0.401]
β10 = 1
β20 = 1
Likelihood: 1.003(0.031)[0.031] Likelihood:
−1797.1 (−) 1.003(0.060)[0.061] −1820.8(0.0%)
–
1.064(0.031)[0.071]
1.192(0.056)[0.200]
ρ0 = 0
β10 = 1
β20 = 1
True model 0.499(0.019)[0.019]
–
Likelihood: 1.002(0.030)[0.030]
−1755.0 (−) 1.000(0.048)[0.048]
Model with misspecified Wr 0.826(0.328)[0.462]
–
Likelihood: 1.316(0.050)[0.320]
−1978.1(0.0%) 1.833(0.084)[0.838]
True model 0.497(0.024)[0.024] −0.295(0.038)[0.038] Misspecified model with ρ = 0 0.369(0.023)[0.133] λ0 = 0.5
Note: Parameter estimates: Mean(SD)[RMSE]; likelihood value: Mean(Frequency of the likelihood value of the misspecified model larger than that of the correct model).
incorporation of possible correlations among connected individuals may partially capture the endogeneity of network formation. The identification of endogenous and contextual effects in Manski’s (1993) linear-inmean model requires the inclusion of some individual exogenous characteristics but the exclusion of their corresponding contextual effects. In the group interaction model in Lee (2007), identification requires variation in group sizes in the sample. For the network model, identification is in general feasible even when groups have the same size because of additional non-linearity due to the network structure. The identification issue is similar to that of the SAR model but with a slight complication due to the presence of contextual variables and group unobservables. Identification can be based on the mean regression function as well as correlation structure of the dependent variables. In general, all the social interaction effects of interest can be identified in a network model. We consider the estimation of the network model. As a model with endogeneity, it can in general be estimated by the 2SLS method as instrumental variables can be generated from the C The Author(s). Journal compilation C Royal Economic Society 2010.
162
L.-F. Lee, X. Liu and X. Lin
network structure with the presence of relevant exogenous variables. The 2SLS method is simple but not efficient. This paper considers a possible extension of the QML method for the group interaction model in Lee (2007) to the general network model. It generalizes the QML method for a SAR model with SAR errors in that there are incidental parameters due to group-specific dummy variables. The QML method is designed after the elimination of group dummies. This strategy may have applications in other models, for example the spatial panel data models with time dummies in Lee and Yu (2009). We establish analytically the consistency and asymptotic normality of the estimators and show that the QMLE is asymptotically efficient relative to the G2SLS estimator. Monte Carlo studies are designed to investigate the finite sample performance of the estimators. The QMLE has better finite sample properties than the 2SLS and G2SLS estimators as confirmed by the Monte Carlo results. We also pay special attention to possible consequences of omitting some social or correlation effects on the estimates of the remaining effects. Furthermore, we provide some limited evidence on possible consequences with the misspecification on network connections and the usefulness of the maximized log likelihood as a model selection criterion.
ACKNOWLEDGMENTS We appreciate valuable comments and suggestions from the editor and two anonymous referees to improve the presentation of this paper. Lee appreciates having financial support for the research from NSF grant SES-0519204. Lin appreciates having financial support for the research from NSFC grant 70701020. An earlier version of this paper is circulated under the title, ‘Specification and estimation of social interaction models with network structure, contextual factors, correlation and fixed effects’.
REFERENCES Anselin, L. (1988). Spatial Econometrics: Methods and Models. Dordrecht: Kluwer Academic Publishers. Anselin, L. (2006). Spatial econometrics. In T. C. Mills and K. Patterson (Eds.), Palgrave Handbook of Econometrics, Volume 1, 1213–92. New York: Palgrave Macmillan. Bertrand, M., E. Luttmer and S. Mullainathan (2000). Network effects and welfare cultures. Quarterly Journal of Economics 115, 1019–55. Bramoull´e, Y., H. Djebbari and B. Fortin (2009). Identification of peer effects through social networks. Journal of Econometrics 150, 41–55. Calv´o-Armengol, A., E. Patacchini and Y. Zenou (2006). Peer effects and social networks in education. Working paper, Universitat Autonoma de Barcelona. Case, A. (1991). Spatial patterns in household demand. Econometrica 59, 953–65. Case, A. (1992). Neighbourhood influence and technological change. Regional Science and Urban Economics 22, 491–508. Case, A., J. J. R. Hines and H. S. Rosen (1993). Interstate tax competition after TRA 86. Journal of Policy Analysis and Management 12, 136–48. Chamberlain, G. (1980). Analysis of covariance with qualitative data. Review of Economic Studies 47, 225–38. Cohen, J. (2002). Reciprocal state and local airport spending spillovers and symmetric responses to cuts and increases in federal airport grants. Public Finance Review 30, 41–55. C The Author(s). Journal compilation C Royal Economic Society 2010.
Estimation of social network models
163
Conley, T. G. (1999). GMM estimation with cross sectional dependence. Journal of Econometrics 92, 1–45. Cox, D. R. (1975). Partial likelihood. Biometrika 62, 269–76. Cressie, N. (1993). Statistics for Spatial Data. New York: John Wiley and Sons. Durlauf, S. N. and H. P. Young (2001). The new social economics. In S. N. Durlauf and H. P. Young (Eds.), Social Dynamics, 1–14. Cambridge, MA: MIT Press. Fingleton, B. (2008). A generalized method of moments estimator for a spatial panel model with an endogenous spatial lag and spatial moving average errors. Spatial Economic Analysis 3, 27– 44. Florax, R. and H. Folmer (1992). Specification and estimation of spatial linear regression models: Monte Carlo evaluation of pre-test estimators. Regional Science and Urban Economics 22, 405–32. Hanushek, E. A., J. F. Kain, J. M. Markman and S. G. Rivkin (2003). Does peer ability affect student achievement? Journal of Applied Econometrics 18, 527–44. Horn, R. and C. Johnson (1985). Matrix Analysis. Cambridge: Cambridge University Press. Hsiao, C. (2003). Analysis of Panel Data (2nd edn.). Cambridge: Cambridge University Press. Kelejian, H. H. and I. R. Prucha (1998). A generalized spatial two-stage least squares procedure for estimating a spatial autoregressive model with autoregressive disturbance. Journal of Real Estate Finance and Economics 17, 99–121. Kelejian, H. H. and I. R. Prucha (1999). A generalized moments estimator for the autoregressive parameter in a spatial model. International Economic Review 40, 509–33. Kelejian, H. H. and I. R. Prucha (2001). On the asymptotic distribution of the Moran I test statistic with applications. Journal of Econometrics 104, 219–57. Kelejian, H. H. and I. R. Prucha (2007). HAC estimation in a spatial framework. Journal of Econometrics 140, 131–54. Lancaster, T. (2000). The incidental parameter problem since 1948. Journal of Econometrics 95, 391–413. Lee, L. F. (2004). Asymptotic distributions of quasi-maximum likelihood estimators for spatial econometric models. Econometrica 72, 1899–926. Lee, L. F. (2007). Identification and estimation of econometric models with group interactions, contextual factors and fixed effects. Journal of Econometrics 140, 333–74. Lee, L. F. and X. Liu (2008). Efficient GMM estimation of high order spatial autoregressive models with autoregressive disturbances. Forthcoming in Econometric Theory. Lee, L. F. and J. Yu (2009). A spatial dynamic panel data model with both time and individual fixed effects. Forthcoming in Econometric Theory. Lee, S. Y. (2008). Three essays on spatial autoregressive models and empirical organization. Unpublished Ph.D. thesis, Department of Economics, Ohio State University. LeSage, J. (1999). The theory and practice of spatial econometrics. Working paper, Department of Economics, University of Toledo. LeSage, J. and R. Pace (2009). Introduction to Spatial Econometrics. Boca Raton, FL: CRC Press. Lin, X. (2005). Peer effects and student academic achievement: an application of spatial autoregressive model with group unobservables. Working paper, Ohio State University. Lin, X. (2009). Identifying peer effects in student academic achievement by a spatial autoregressive model with group unobservables. Working paper, Wayne State University. Lin, X. and L. F. Lee (2009). GMM estimation of spatial autoregressive models with unknown heteroskedasticity. Forthcoming in Journal of Econometrics. Liu, X. and L. F. Lee (2009). GMM estimation of social interaction models with centrality. Working paper, Department of Economics, University of Colorado at Boulder. Manski, C. F. (1993). Identification of endogenous social effects: the reflection problem. Review of Economic Studies 60, 531–42. C The Author(s). Journal compilation C Royal Economic Society 2010.
164
L.-F. Lee, X. Liu and X. Lin
Moffitt, R. A. (2001). Policy interventions, low-level equilibria, and social interactions. In S. N. Durlauf and H. P. Young (Eds.), Social Dynamics, 45–82. Cambridge, MA: MIT Press. Neyman, J. and E. L. Scott (1948). Consistent estimates based on partially consistent observations. Econometrica 16, 1–32. Ord, J. (1975). Estimation methods for models of spatial interaction. Journal of the American Statistical Association 70, 120–26. Rothenberg, T. J. (1971). Identification in parametric models. Econometrica 39, 577–91. Sacerdote, B. (2001). Peer effects with random assignment: results for Dartmouth roommates. Quarterly Journal of Economics 116, 681–704. Udry, J. R. (2003). The National Longitudinal Study of Adolescent Health (Add Health), Waves I & II, 1994–1996; Wave III, 2001–2002 [Machine-readable data file and documentation]. Carolina Population Center, University of North Carolina at Chapel Hill. White, H. (1994). Estimation, Inference and Specification Analysis. New York: Cambridge University Press.
APPENDIX A: SUMMARY OF NOTATION β = (β1 , β2 ) , γ = (λ, , δ = (β , γ ) , θ = (δ , σ 2 ) , ζ = (β , λ) . ρ) r¯ r¯ ∗ n = r=1 mr , n = r=1 m∗r = n − r¯ , m∗r = mr − 1. lmr is an mr -dimensional vector of ones. e = mr1−1 (lmr lm r − Imr ). Wnre = Mnr Znr = (Xnr , Wnr Xnr ); Snr (λ) = Imr − λWnr , Snr = Snr (λ0 ); Rnr (ρ) = Imr − ρMnr , Rnr = Rnr (ρ0 ); Gnr = −1 . Wnr Snr √ Jnr = Imr − m1r lmr lm r .[Fnr , lmr / mr ] is the orthonormal matrix of Jnr where Fnr corresponds to the eigen value one. ∗ ∗ ∗ = Fnr Znr , nr = Fnr nr ; Wnr∗ = Fnr Wnr Fnr , Mnr = Fnr Mnr Fnr . Ynr∗ = Fnr Ynr , Znr ∗ ∗ ∗ ∗ Snr (λ) = Fnr Snr (λ)Fnr = Im∗r − λWnr , Snr = Snr (λ0 ). ∗ ∗ ∗ ∗ (ρ) = Fnr Rnr (ρ)Fnr = Im∗r − ρMnr , Rnr = Rnr (ρ0 ). Rnr ∗ ∗ ∗ ∗ (δ) = Rnr (ρ)[Snr (λ)Ynr∗ − Znr β]. nr (δ) = Rnr (ρ)[Snr (λ)Ynr − Znr β], nr , . . . , Yn¯ ) , X = (X , . . . , X ) , Z = (Z , . . . , Z ) , n = (n1 , . . . , n¯ Yn = (Yn1 n n r n¯r n¯r r ) , Wn = Diag n1 n1 {Wn1 , . . . , Wn¯r }, Mn = Diag{Mn1 , . . . , Mn¯r } and Jn = Diag{Jn1 , . . . , Jn¯r }. ˜ n = Rn Gn Rn−1 . Hn = Mn Rn−1 ; Z˜ n = Rn Zn , G ˜ n − 1 tr(Jn G ˜ n )In , Dn = Jn Hn − 1 tr(Jn Hn )In . Cn = Jn G n n Pn (ρ) = Jn − Jn Rn (ρ)Zn [Zn Rn (ρ)Jn Rn (ρ)Zn ]−1 Zn Rn (ρ)Jn and Pn = Pn (ρ0 ). Let As = A + A for a square matrix A. Let vecD (A) denote the column vector formed from the diagonal elements of a square matrix A.
APPENDIX B: THE SCORE VECTOR AND INFORMATION MATRIX The first-order derivatives of the log-likelihood function at θ0 are 1 1 ∂ ln Ln (θ0 ) 1 ˜ ˜ n Z˜ n β0 + √ ˜ n) , = 2 √ n Jn G n Jn Gn n − σ02 tr(Jn G √ ∂λ n∗ σ0 n ∗ σ02 n∗ 1 1 ∂ ln Ln (θ0 ) = 2√ n Jn Hn n − σ02 tr(Jn Hn ) , √ ∗ ∗ ∂ρ n σ0 n 1 1 ∂ ln Ln (θ0 ) 1 1 ∂ ln Ln (θ0 ) = 2 √ Z˜ n Jn n , √ = n Jn n − n∗ σ02 , √ √ 2 4 ∗ ∗ ∗ ∗ ∂β ∂σ n σ0 n n 2σ0 n C The Author(s). Journal compilation C Royal Economic Society 2010.
165
Estimation of social network models The second-order derivatives of the log-likelihood function are
∂ 2 ln Ln (θ ) 1 = −tr[Jn G2n (λ)] − 2 Yn Wn Rn (ρ)Jn Rn (ρ)Wn Yn , ∂λ2 σ ∂ 2 ln Ln (θ ) ∂ 2 ln Ln (θ) 1 1 = − 4 Yn Wn Rn (ρ)Jn n (δ), = − 2 Zn Rn (ρ)Jn Rn (ρ)Wn Yn , ∂λ∂β σ ∂λ∂σ 2 σ 1 ∂ 2 ln Ln (θ ) 1 = − 2 Yn Wn Mn Jn n (δ) − 2 Yn Wn Rn (ρ)Jn Mn [Sn (λ)Yn − Zn β], ∂λ∂ρ σ σ 1 ∂ 2 ln Ln (θ) 1 ∂ 2 ln Ln (θ ) = − 2 Zn Rn (ρ)Jn Rn (ρ)Zn , = − 4 Zn Rn (ρ)Jn n (δ), ∂β∂β σ ∂β∂σ 2 σ 1 ∂ 2 ln Ln (θ ) 1 = − 2 Zn Mn Jn n (δ) − 2 Zn Rn (ρ)Jn Mn [Sn (λ)Yn − Zn β], ∂β∂ρ σ σ ∂ 2 ln Ln (θ ) 1 = −tr(Jn [Mn Rn−1 (ρ)]2 ) − 2 [Sn (λ)Yn − Zn β] Mn Jn Mn [Sn (λ)Yn − Zn β], ∂ρ 2 σ ∂ 2 ln Ln (θ ) 1 ∂ 2 ln Ln (θ) n∗ 1 = − 4 n (δ)Jn Mn [Sn (λ)Yn − Zn β], = − 6 n (δ)Jn n (δ). 2 2 2 ∂ρ∂σ σ ∂(σ ) 2σ 4 σ
The variance matrix of
∂ ln Ln (θ0 ) √1 ∂θ n∗
θ,n = −E ⎛
is E[ √1n∗
∂ ln Ln (θ0 ) ∂θ
1 ∂ 2 ln Ln (θ0 ) n∗ ∂θ ∂θ 1 ˜ ˜ Z Jn Zn σ02 n∗ n
·
∂ ln Ln (θ0 ) √1 ] ∂θ n∗
= θ,n + θ,n .
⎜ ⎜ ⎜ ⎜ 1 ˜ n Z˜ n β0 ) Jn Z˜ n ⎜ 2 ∗ (G ⎜ σ0 n =⎜ ⎜ ⎜ ⎜ 01×k ⎜ ⎜ ⎝ 01×k
∗
∗
θ,n,22
∗
1 ˜ n) tr(Hns Jn G n∗ 1 ˜ n) tr(Jn G σ02 n∗
1 tr(Hns Jn Hn ) n∗ 1 tr(Jn Hn ) σ02 n∗
⎞ ∗ ⎟ ⎟ ⎟ ⎟ ∗ ⎟ ⎟ ⎟ ⎟ ⎟ ∗ ⎟ ⎟ ⎟ 1 ⎠ 2σ04
and ⎛
θ,n
⎜ ⎜ ⎜ ⎜ ⎜ =⎜ ⎜ ⎜ ⎜ ⎜ ⎝
0k×k μ3 ˜ n )Jn Z˜ n vecD (Jn G σ04 n∗ μ3 vecD (Jn Hn )Jn Z˜ n σ04 n∗ 01×k
∗
∗
θ,n;22
∗
θ,n;32
θ,n;33
μ4 − 3σ04 ˜ n) tr(Jn G 2σ06 n
μ4 − 3σ04 tr(Jn Hn ) 2σ06 n
C The Author(s). Journal compilation C Royal Economic Society 2010.
∗
⎞
⎟ ⎟ ⎟ ⎟ ⎟ ⎟, ⎟ ∗ ⎟ ⎟ ⎟ (μ4 − 3σ04 )n∗ ⎠ 4σ08 n ∗
166
L.-F. Lee, X. Liu and X. Lin
where θ,n,22 =
1 σ02 n∗
˜ n ), ˜ n Z˜ n β0 ) Jn (G ˜ n Z˜ n β0 ) + 1 tr(G ˜ sn Jn G (G n∗
θ,n;22 =
4 2μ3 ˜ n )Jn G ˜ n Z˜ n β0 + μ4 − 3σ0 vecD ˜ n )vecD (Jn G ˜ n ), G vec (J (Jn G n D σ04 n∗ σ04 n∗
θ,n;32 =
4 μ3 ˜ n Z˜ n β0 + μ4 − 3σ0 vecD ˜ n ), G vec (J H )J (Jn Hn )vecD (Jn G n n n D σ04 n∗ σ04 n∗
θ,n;33 =
μ4 − 3σ04 vecD (Jn Hn )vecD (Jn Hn ), σ04 n∗
and μ3 and μ4 being the third and fourth moments of nr,i respectively.
APPENDIX C: SOME BASIC PROPERTIES In this appendix, we list some properties which are useful for the proofs of the results in the text. The results in Lemmas C.3–C.10 are either straightforward or can been found in Kelejian and Prucha (2001) and Lee (2004). They are listed here for easy reference. Throughout this appendix, the elements vi ’s in Vn = (v1 , . . . , vn ) are assumed to be i.i.d. with zero mean, finite variance σ 2 and finite fourth moment μ4 . L EMMA C.1. Suppose Wnr is a row-normalized mr × mr matrix, Jnr = Imr − m1r lmr lm r , and √ [Fnr , lmr / mr ] is the orthonormal matrix of Jnr where Fnr corresponds to the eigenvalue one. Let Wnr∗ = Fnr Wnr Fnr and m∗r = mr − 1. Then (1) Fnr (Imr − λWnr ) = Fnr (Imr − λWnr )Fnr Fnr , (2) |Im∗r − 1 |Imr − λWnr |, (3) (Im∗r − λWnr∗ )−1 = Fnr (Imr − λWnr )−1 Fnr , and (4) Wnr∗ (Im∗r − λWnr∗ )−1 = λWnr∗ | = 1−λ (Im∗r − λWnr∗ )−1 Wnr∗ = Fnr Wnr (Imr − λWnr )−1 Fnr . Proof: As Fnr Fnr = Jnr = Imr − lmr lm r /mr , we have Fnr (Imr − λWnr ) = Fnr (Imr − λWnr )(Fnr Fnr + lmr lm r /mr ) = Fnr (Imr − λWnr )Fnr Fnr + Fnr (Imr − λWnr )lmr lm r /mr . As Wnr is a row-normalized, Fnr Wnr lmr = Fnr lmr = 0. Hence, (1) holds. To show (2), we note that (Im∗r − λWnr∗ ) = Fnr (Imr − λWnr )Fnr . As √ √ [Fnr , lmr / mr ] (Imr − λWnr )[Fnr , lmr / mr ] √ Fnr (Imr − λWnr )Fnr Fnr (Imr − λWnr )lmr / mr = √ lm r (Imr − λWnr )Fnr / mr lm r (Imr − λWnr )lmr /mr
0 Fnr (Imr − λWnr )Fnr , = 0 1−λ because Fnr Wnr lmr = Fnr lmr = 0 and lm r Wnr lmr = mr . Hence |Im∗r − λWnr∗ | = |Fnr (Imr − λWnr )Fnr | = 1 |I − λWnr |. 1−λ mr Since Fnr Wnr lmr = Fnr lmr = 0, (3) and (4) can be verified as (Im∗r − λWnr∗ ) · Fnr (Imr − λWnr )−1 Fnr = Fnr (Imr − λWnr )Fnr · Fnr (Imr − λWnr )−1 Fnr = Fnr (Imr − λWnr )(Imr − lmr lm r /mr )(Imr − λWnr )−1 Fnr = Fnr Fnr − Fnr (Imr − λWnr )lmr lm r (Imr − λWnr )−1 Fnr /mr = Im∗r , and Wnr∗ (Im∗r − λWnr∗ )−1 = Fnr Wnr Fnr · Fnr (Imr − λWnr )−1 Fnr = Fnr Wnr (Imr − lmr lm r /mr )(Imr − λWnr )−1 Fnr = Fnr Wnr (Imr − λWnr )−1 Fnr , and (Im∗r − λWnr∗ )−1 Wnr∗ = Fnr (Imr − λWnr )−1 Wnr Fnr = Fnr Wnr (Imr − λWnr )−1 Fnr . C The Author(s). Journal compilation C Royal Economic Society 2010.
Estimation of social network models
167
∗ ∗ ∗ ∗ L EMMA C.2. nr (δ) = Rnr (ρ)[Snr (λ)Ynr∗ − Znr β] = Fnr Rnr (ρ)[Snr (λ)Ynr − Znr β].
Proof: ∗ ∗ ∗ ∗ (δ) = Rnr (ρ)[Snr (λ)Ynr∗ − Znr β] = Fnr Rnr (ρ)Fnr · Fnr [Snr (λ)Fnr · Fnr Ynr − Znr β] nr
1 1 lm l lm l = Fnr Rnr (ρ) Imr − Snr (λ) Imr − Ynr − Znr β mr r mr mr r mr
= Fnr Rnr (ρ)[Snr (λ)Ynr − Znr β] because Fnr lmr = Fnr Wnr lmr = Fnr Mnr lmr = 0.
L EMMA C.3. Suppose that { Wn }, { Mn }, { Sn−1 } and { Rn−1 }, where · is a matrix norm, are bounded. Then { Sn−1 (λ) } and { Rn−1 (ρ) } are uniformly bounded in a neighbourhood of λ0 and ρ0 respectively. L EMMA C.4. Suppose that elements of the n × k matrices Zn are uniformly bounded for all n; and the limn→∞ n1 Zn Rn Jn Rn Zn exists and is non-singular, then the projectors Pn and (Jn − Pn ), where Pn = Jn − Jn Rn Zn [Zn Rn Jn Rn Zn ]−1 Zn Rn Jn , are uniformly bounded in both row and column sums in absolute value. L EMMA C.5. Suppose that the elements of the sequences of vectors Pn = (pn1 , . . . , pnn ) and Qn = (qn1 , . . . , qnn ) are uniformly bounded for all n. (1) If {An } are uniformly bounded in either row or column sums in absolute value, then |Qn An Pn | = O(n). (2) If the row sums of {An } and {Zn } are uniformly bounded, |zi,n An Pn | = O(1) uniformly in i, where zi,n is the ith row of Zn . L EMMA C.6. Suppose that the elements of the n × n matrices {An } are uniformly bounded, and the n × n matrices {Bn } are uniformly bounded in column sums (respectively, row sums) in absolute value. Then, the elements of An Bn (respectively, Bn An ) are uniformly bounded. For both cases, tr(An Bn ) = tr(Bn An ) = O(n). L EMMA C.7. Suppose that An is an n × n matrix with its column sums being uniformly bounded in absolute value and elements of the n × k matrix Zn are uniformly bounded. Elements vi ’s of Vn = (v1 , . . . , vn ) are i.i.d. (0, σ 2 ). Then, √1n Zn An Vn = Op (1), Furthermore, if the limit of n1 Zn An An Zn exists and is positive definite, then
√1 Z An Vn n n
d
→ N (0, σ02 limn→∞ n1 Zn An An Zn ).
L EMMA C.8. Let An be an n × n matrix. Then E(Vn An Vn ) = σ 2 tr(An ) and Var(Vn An Vn ) = (μ4 − (An )vecD (An ) + σ 4 [tr(An An ) + tr(A2n )]. 3σ 4 )vecD L EMMA C.9. Suppose that {An } is a sequence of n × n matrices uniformly bounded in either row or column sums in absolute value. Then, E(Vn An Vn ) = O(n), Var(Vn An Vn ) = O(n), Vn An Vn = Op (n), and 1 [Vn An Vn − E(Vn An Vn )] = op (1). n L EMMA C.10. Suppose that {An } is a sequence of symmetric n × n matrices with row and column sums uniformly bounded in absolute value and {bn } is a sequence of n-dimensional constant vectors such that supn n1 ni=1 |bni |2+η1 < ∞ for some η1 > 0. The moment E(|v|4+2η ) of v for some η > 0 exists. Let σQ2 n be the variance of Qn where Qn = bn Vn + Vn An Vn − σ 2 tr(An ). Assume that the variance σQ2 n is bounded away from zero at the rate n. Then
Qn σQn
d
→ N (0, 1).
C The Author(s). Journal compilation C Royal Economic Society 2010.
168
L.-F. Lee, X. Liu and X. Lin
APPENDIX D: A PARTIAL LIKELIHOOD JUSTIFICATION The likelihood function (4.1) is for Fnr Rnr Ynr given in equation (3.4). It remains to consider the remaining component m1r lm r Rnr Ynr , which is 1 1 lmr Rnr Ynr = l Rnr (λ0 Wnr Ynr + Znr β0 ) + (1 − ρ0 )αr0 + ¯r , mr mr mr where ¯r =
1 l . mr mr nr
As Fnr Fnr = Jnr and
1 l R l mr mr nr mr
(D.1)
= (1 − ρ0 ), it follows that
1 1 1 1 Ynr = l Rnr Ynr = l Rnr Fnr Fnr + lm l l Rnr Fnr Ynr∗ + (1 − ρ0 )y¯r , mr mr mr mr mr r mr mr mr
(D.2)
1 1 1 1 ∗ lmr Rnr Znr = lmr Rnr Fnr Fnr + lmr lm r Znr = l Rnr Fnr Znr + (1 − ρ0 )¯zr , mr mr mr mr mr
(D.3)
where y¯r =
1 l Y mr mr nr
and z¯ r =
1 l Z . mr mr nr
Similarly,
1 1 1 lmr Rnr Wnr Ynr = lmr Rnr Fnr Fnr + lmr lm r Wnr Ynr mr mr mr 1 1 = l Rnr Fnr Wnr∗ Ynr∗ + (1 − ρ0 ) lm r Wnr Ynr , mr mr mr where m1r lm r Wnr Ynr = (D.4) in (D.1) gives
1 l Wnr (Fnr Fnr mr mr
+
1 l l )Ynr mr mr mr
=
1 l Wnr Fnr Ynr∗ mr mr
(D.4)
+ y¯r . Substitution of (D.2)–
1 l Rnr Fnr Ynr∗ + (1 − ρ0 )y¯r mr mr
1 1 = λ0 lmr Rnr Fnr Wnr∗ Ynr∗ + (1 − ρ0 ) lm r Wnr Fnr Ynr∗ + (1 − ρ0 )y¯r mr mr 1 ∗ + lm r Rnr Fnr Znr β0 + (1 − ρ0 )¯zr β0 + (1 − ρ0 )αr0 + ¯r , mr or equivalently y¯r =
1 1 λ0 1 lmr Rnr Fnr Wnr∗ Ynr∗ − l Rnr Fnr Ynr∗ (1 − λ0 )(1 − ρ0 ) mr (1 − λ0 )(1 − ρ0 ) mr mr 1 1 λ0 1 ∗ + l Wnr Fnr Ynr∗ + l Rnr Fnr Znr β0 (1 − λ0 ) mr mr (1 − λ0 )(1 − ρ0 ) mr mr 1 1 1 z¯ r β0 + αr0 + ¯r . + (1 − λ0 ) (1 − λ0 ) (1 − λ0 )(1 − ρ0 )
(D.5)
∗ , (D.5) can be regarded as a non-linear regression equation. As ¯r is independent of Ynr∗ , conditional on Xnr The joint likelihood function of Ynr∗ and y¯r can thus be decomposed into a product of the conditional likelihood of y¯r given Ynr∗ from (D.5) and the likelihood function of Ynr∗ from (3.4). Therefore, the likelihood function of Ynr∗ from the transformation method for (3.4) is a partial likelihood function (Cox, 1975, and Lancaster, 2000).
C The Author(s). Journal compilation C Royal Economic Society 2010.
169
Estimation of social network models
APPENDIX E: THE LIKELIHOOD FUNCTION OF A NETWORK MODEL WITH A SPATIAL ARMA DISTURBANCE In this appendix, we show that the proposed QML approach can be generalized to the case where the disturbances follow a more general spatial ARMA process. The generalized model has the specification that Ynr = λ0 Wnr Ynr + Znr β0 + lmr αr0 + unr , where unr = ρ10 M1nr unr + ρ20 M2nr nr + nr for r = 1, . . . , r¯ . In this model, Wnr , M1nr and M2nr are row-normalized such that the sum of each row is unity, that is Wnr lmr = M1nr lmr = M2nr lmr = lmr . As before, let Rnr (ρ) = Imr − ρM1nr and Rnr = Rnr (ρ10 ). A Cochrane–Orcutttype transformation gives Rnr Snr Ynr = Rnr Znr β0 + (1 − ρ10 )lmr αr0 + (Imr + ρ20 M2nr )nr . Note that (Imr + ρ20 M2nr )−1 lmr = (1 + ρ20 )−1 lmr under the assumption that (Imr + ρ20 M2nr ) is invertible and M2nr is row normalized. It follows that 1 − ρ10 lmr αr0 + nr . (Imr + ρ20 M2nr )−1 Rnr Snr Ynr = (Imr + ρ20 M2nr )−1 Rnr Znr β0 + 1 + ρ20 As Fnr (Imr + ρ20 M2nr )−1 = Fnr (Imr + ρ20 M2nr )−1 Fnr Fnr = (Im∗r + ρ20 Fnr M2nr Fnr )−1 Fnr , cation by Fnr leads to a transformed model without αr0 ’s; that is,
premultipli-
∗ ∗ ∗ ∗ ∗ Snr Ynr∗ = (Im∗r + ρ20 Fnr M2nr Fnr )−1 Rnr Znr β0 + nr . (Im∗r + ρ20 Fnr M2nr Fnr )−1 Rnr ∗ ∗ ∗ ∗ (δ) = (Im∗r + ρ20 Fnr M2nr Fnr )−1 Rnr (ρ)[Snr (λ)Ynr∗ − Znr β], where δ = (β , λ, ρ1 , ρ2 ) . For a Let nr sample with r¯ macro groups, the log-likelihood function is
n∗ ∗ ∗ ln(2π σ 2 ) + ln |Snr (λ)| + ln |Rnr (ρ1 )| 2 r=1 r=1 r¯
ln Ln (θ ) = − −
r¯
r¯
ln |Im∗r + ρ2 Fnr M2nr Fnr | −
r=1
r¯ 1 ∗ ∗ (δ)nr (δ), 2σ 2 r=1 nr
1 1 ∗ ∗ (λ)| = 1−λ |Snr (λ)|, |Rnr (ρ1 )| = 1−ρ |Rnr (ρ1 )|, |Im∗r + ρ2 Fnr M2nr Fnr | = where θ = (δ , σ 2 ) . As |Snr 1 1 ∗ |I + ρ2 M2nr |, and nr (δ) = Fnr nr (δ), where nr (δ) = (Imr + ρ20 M2nr )−1 Rnr (ρ)[Snr (λ)Ynr − Znr β], 1+ρ2 mr the log-likelihood function can be evaluated without Fnr ’s as
|Snr (λ)| |Rnr (ρ1 )| n∗ ln(2π σ 2 ) + + ln ln 2 1−λ 1 − ρ1 r=1 r=1 r¯
ln Ln (θ ) = − −
r¯ r=1
ln
r¯
r¯ |Imr + ρ2 M2nr | 1 − (δ)Jnr nr (δ). 1 + ρ2 2σ 2 r=1 nr
APPENDIX F: AN ALTERNATIVE ELIMINATION METHOD OF THE MACRO GROUP FIXED EFFECTS Lin (2005) suggests an alternative method to eliminate the fixed effects by the transformation using (Imr − Wnr ), the deviation from the weighted average of an individual’s connections. Although this transformation is not very convenient when ρ0 = 0 with an arbitrary Mnr such that Mnr = Wnr , it can be used as an alternative approach in the special case when Mnr = Wnr . When Mnr = Wnr , as (Imr − Wnr )Wnr = Wnr (Imr − Wnr ) and (Imr − Wnr )Rnr = Rnr (Imr − Wnr ), premultiplication of (3.1) by (Imr − Wnr ) gives Rnr (Imr − Wnr )Ynr = λ0 Rnr Wnr (Imr − Wnr )Ynr + Rnr (Imr − Wnr )Znr β0 + (Imr − Wnr )nr . C The Author(s). Journal compilation C Royal Economic Society 2010.
(F.1)
170
L.-F. Lee, X. Liu and X. Lin
The fixed effect αr0 is eliminated because (Imr − Wnr )lnr = 0 as Wnr lnr = lnr . The variance of the ¨ nr = (Imr − Wnr )(Imr − Wnr ) . The elements of ¨ nr where transformed disturbances (Imr − Wnr )nr is σ 2 (Imr − Wnr )nr may be correlated and heteroscedastic. There are also linear dependence among its elements because (Imr − Wnr ) does not have full row rank. Suppose that the rank of (Imr − Wnr ) is mr , which, in principle, can be empirically evaluated as Wnr is a given matrix. As (Imr − Wnr )lmr = 0, mr ≤ mr − 1, − Wnr ) to eliminate the fixed effects may leave the number of independent the transformation using (Imr ¯ (mr − 1). sample observations less than rr=1 ¨ nr is positive semidefinite, there exists some orthonormal matrix [F¨nr , H¨ nr ] where F¨nr As is an mr × mr matrix of normalized eigenvectors corresponding to the positive eigenvalues and H¨ nr is an mr × (mr − mr ) matrix of normalized eigenvectors with zero eigenvalues. Let nr ¨ nr F¨nr = be the mr × mr diagonal matrix consisting of all the positive eigenvalues. Thus, ¨ nr H¨ nr = 0, F¨nr F¨nr = Imr , F¨nr H¨ nr = 0, F¨nr F¨nr + H¨ nr H¨ nr = Imr and ¨ nr = F¨nr nr F¨nr . F¨nr nr , −1 −1 −1 Denote Ynr = nr 2 F¨nr (Imr − Wnr )Ynr , Znr = nr 2 F¨nr (Imr − Wnr )Znr and nr = nr 2 F¨nr (Imr − Wnr )nr . To eliminate heteroscedasticity and linear dependence in (Imr − Wnr )nr , premultiplication of (F.1) by −1 nr 2 F¨nr yields Ynr = λ0 Rnr Wnr Ynr + Rnr Znr β0 + nr , Rnr 1
1
1
(F.2)
1
− − = nr 2 F¨nr Rnr F¨nr nr2 = Imr − ρ0 Wnr . The variance matrix of where Wnr = nr 2 F¨nr Wnr F¨nr nr2 and Rnr is σ 2 Imr . the transformed disturbances nr Under the normality assumption, the log-likelihood function of the sample with r¯ macro groups is
nr ln(2π σ 2 ) + ln |Imr − λWnr | + ln |Imr − ρWnr | 2 r=1 r=1 r¯
ln Ln (θ ) = − −
r¯
r¯ 1 [(Imr − λWnr )Ynr − Znr β] Rnr Rnr [(Imr − λWnr )Ynr − Znr β], 2σ 2 r=1
¯ mr . To implement the ML estimation, one needs to evaluate the determinants |Imr − where nr = rr=1 λWnr | and |Imr − ρWnr | for each macro group r. The evaluation of this determinant is equivalent to the evaluation of the determinants |Imr − λWnr | and |Imr − ρWnr |, which can be shown as follows. As
=
[F¨nr , H¨ nr ] (Imr − λWnr )[F¨nr , H¨ nr ] = Imr − λ[F¨nr , H¨ nr ] Wnr [F¨nr , H¨ nr ]
Imr − λF¨nr Wnr F¨nr −λF¨nr Wnr H¨ nr
=
−λH¨ nr Wnr F¨nr Imr − λF¨nr Wnr F¨nr 0
I(mr −mr ) − λH¨ nr Wnr H¨ nr
−λF¨nr Wnr H¨ nr , (1 − λ)I(mr −mr )
because H¨ nr Wnr = H¨ nr , H¨ nr F¨nr = 0 and H¨ nr H¨ nr = I(mr −mr ) . It follows that |Imr − λWnr | = |Imr − λ[F¨nr , H¨ nr ] Wnr [F¨nr , H¨ nr ]| = |Imr − λF¨nr Wnr F¨nr | · |(1 − λ)I(mr −mr ) |. Therefore, −1
1
|Imr − λWnr | = |Imr − λnr 2 F¨nr Wnr F¨nr nr2 | = |Imr − λF¨nr Wnr F¨nr | = (1 − λ)−(mr −mr ) |Imr − λWnr |.
Similarly, |Imr − ρWnr | = (1 − ρ)−(mr −mr ) |Imr − ρWnr |. As Rnr [(Imr − λWnr )Ynr − Znr β] = − 12 ¨ nr Fnr (Imr − Wnr )Rnr [(Imr − λWnr )Ynr − Znr β], the log likelihood can also be expressed in terms of Ynr , Znr and Wnr as
C The Author(s). Journal compilation C Royal Economic Society 2010.
171
Estimation of social network models nr ln(2π σ 2 ) − (nr − nr ) ln[(1 − λ)(1 − ρ)] + ln |Snr (λ)| + ln |Rnr (ρ)| 2 r=1 r=1 r¯
ln Ln (θ ) = − −
r¯
r¯ 1 + ¨ nr [Snr (λ)Ynr − Znr β] Rnr (Imr − Wnr ) (Imr − Wnr )Rnr [Snr (λ)Ynr − Znr β], (F.3) 2σ 2 r=1
+ ¨ ¨ nr = F¨nr −1 where nr Fnr is the generalized inverse of (Imr − Wnr )(Imr − Wnr ) . The MLE is derived from the maximization of (F.3).
APPENDIX G: PROOFS ∗ ∗ ∗ Proof of Lemma 5.1: With the restriction λ0 β10 + β20 = 0, Znr β0 = Snr Xnr β10 , and, hence, ∗ ∗−1 ∗−1 ∗ β10 + vnr , where vnr = Snr Rnr nr . As λ0 and ρ0 are not in the mean (5.1) becomes Ynr∗ = Xnr ∗ vnr + λ0 Wnr∗ vnr − regression equation, they could only be identified via the disturbances, vnr = ρ0 Mnr ∗ ∗ ∗ ρ0 λ0 Mnr Wnr vnr + nr , when Mnr = Wnr . With λ0 and β10 identified, β20 can be identified from the ∗ . In restriction λ0 β10 + β20 = 0. However, when Mnr = Wnr , vnr = (ρ0 + λ0 )Wnr∗ vnr − ρ0 λ0 Wnr∗2 vnr + nr this case, ρ0 and λ0 may only be identified locally but not globally, and hence β20 can not be separately identified.
Proof of Lemma 5.2: that
For the group r, let c1 , c2 and c3 be conformable scalar and column vectors such
Jnr Rnr (ρ)Gnr (Xnr β10 + Wnr Xnr β20 )c1 + Jnr Rnr (ρ)Xnr c2 + Jnr Rnr (ρ)Wnr Xnr c3 = 0,
(G.1)
−1 . We are interested in sufficient conditions so that c1 = c2 = c3 = 0. Denote where Gnr = Wnr Snr 1 μ1r = mr lmr Rnr (ρ)Gnr (Xnr β10 + Wnr Xnr β20 ), μ2r = m1r lm r Rnr (ρ)Xnr and μ3r = m1r lm r Rnr (ρ)Wnr Xnr . As −1 −1 = Snr Wnr , Wnr Snr
[Jnr Rnr (ρ)Gnr (Xnr β10 + Wnr Xnr β20 ), Jnr Rnr (ρ)Xnr , Jnr Rnr (ρ)Wnr Xnr ] −1 {[Wnr (Xnr β10 + Wnr Xnr β20 ), Snr Xnr , Snr Wnr Xnr ] = Rnr (ρ)Snr −1 (ρ)lmr (μ1r , μ2r , μ3r )} −Snr Rnr −1 {[Wnr (Xnr β10 + Wnr Xnr β20 ), Snr Xnr , Snr Wnr Xnr ] − lmr (μ∗1r , μ∗2r , μ∗3r )}, = Rnr (ρ)Snr −1 0 0 where μ∗l = ( 1−λ )μl , for l = 1, 2, 3, because Snr Rnr (ρ)lmr = ( 1−λ )l . As Rnr (ρ) and Snr are non1−ρ 1−ρ mr singular, (G.1) is equivalent to
Wnr (Xnr β10 + Wnr Xnr β20 )c1 + Snr Xnr c2 + Snr Wnr Xnr c3 − lmr (μ∗1r c1 + μ∗2r c2 + μ∗3r c3 ) = Xnr c2 + Wnr Xnr (c1 β10 − λ0 c2 + c3 ) + Wnr2 Xnr (c1 β20 − λ0 c3 ) − lmr (μ∗1r c1 + μ∗2r c2 + μ∗3r c3 ) = 0. As [Xnr , Wnr Xnr , Wnr2 Xnr , lmr ] has the full column rank, it follows that c2 = 0, c3 + c1 β10 = 0 and β20 c1 − λ0 c3 = 0. These imply, in turn, that c1 = 0 and c3 = −c1 β10 = 0 under the assumption β20 + λ0 β10 = 0. The desired result follows. Proof of Lemma 5.3:
By Lemma C.1, Fnr Rnr (ρ)[Gnr (Xnr β10 + Wnr Xnr β20 ), Xnr , Wnr Xnr ]
∗ ∗−1 ∗ ∗ ∗ ∗ ∗ ∗ (ρ)Snr [Wnr∗ (Xnr β10 + Wnr∗ Xnr β20 ), Snr Xnr , Snr Wnr∗ Xnr ]. = Rnr C The Author(s). Journal compilation C Royal Economic Society 2010.
172
L.-F. Lee, X. Liu and X. Lin
∗ ∗ (ρ) and Snr are non-singular, a sufficient identification condition derived from a similar argument As Rnr in the proof of Lemma 5.2 (but without lmr term) is that the stacked matrix with its rth row block being ∗ ∗ ∗ , Wnr∗ Xnr , Wnr∗2 Xnr ] has full column rank as long as β20 + λ0 β10 = 0. By a premultiplication with Fnr , [Xnr a sufficient condition is that the stacked matrix with its rth row block being
[Jnr Xnr , (Jnr Wnr )(Jnr Xnr ), (Jnr Wnr )2 (Jnr Xnr )] = Jnr [Xnr , Wnr Xnr , Wnr2 Xnr ]
has full column rank.
Proof of Proposition 5.1: We shall prove that n1∗ [ln Ln (γ ) − Qn (γ )] converges in probability to zero uniformly on , and the identification uniqueness condition holds, that is for any ε > 0, lim supn→∞ maxγ ∈N¯ ε (γ0 ) n1∗ [Qn (γ ) − Qn (γ0 )] < 0 where N¯ ε (γ0 ) is the complement of an open neighbourhood of γ0 in with radius . The following arguments extend those in Lee (2004) for the SAR model with i.i.d. disturbances to our transformed equation model. For the proof of these properties, it is useful to establish some properties for ln |Sn (λ)|, ln |Rn (ρ)| and σn2 (γ ) =
σ02 tr([Rn (ρ)Sn (λ)Sn−1 Rn−1 ] Jn [Rn (ρ)Sn (λ)Sn−1 Rn−1 ]), n∗
where
Jn [Rn (ρ)Sn (λ)Sn−1 Rn−1 ] = Jn [In + (ρ0 − ρ)Hn + (λ0 − λ)Rn Gn Rn−1 + (ρ0 − ρ)(λ0 − λ)Hn Rn Gn Rn−1 ]. There is also an auxiliary model which has useful implications. Denote Qp,n (γ ) = ∗ ∗ The log-likelihood − n2 (ln(2π ) + 1) − n2 ln σn2 (γ ) + ln |Sn (λ)| + ln |Rn (ρ)| − r¯ ln[(1 − λ)(1 − ρ)]. ∗ ∗ ∗ ∗ Ynr∗ = λ0 Rnr Wnr∗ Ynr∗ + nr , where nr ∼ N (0, σ02 Im∗r ) for function of a transformed SAR process Rnr ∗ ∗ r = 1, . . . , r¯ , is ln Lp,n (γ , σ 2 ) = − n2 ln(2π ) − n2 ln σ 2 + ln |Sn (λ)| + ln |Rn (ρ)| − r¯ ln[(1 − λ)(1 − ρ)] − 1 Y S (λ)Rn (ρ) × Jn Rn (ρ)Sn (λ)Yn . It is apparent that Qp,n (γ ) = maxσ 2 Ep [ln Lp,n (γ , σ 2 )], where Ep 2σ 2 n n is the expectation under this SAR process. By the Jensen inequality, Qp,n (γ ) ≤ Ep [ln Lp,n (γ0 , σ02 )] = Qp,n (γ0 ) for all γ . This implies that n1∗ [Qp,n (γ ) − Qp,n (γ0 )] ≤ 0 for all γ . Let (λ1 , ρ1 ) and (λ2 , ρ2 ) be in . By the mean value theorem, n1∗ (ln |Sn (λ2 )| − ln |Sn (λ1 )|) = 1 tr(Gn (λ¯ n ))(λ2 − λ1 ) where λ¯ n lies between λ1 and λ2 . By the uniform boundedness of Assumption 4.4, n∗ Lemma C.6 implies that n1∗ tr(Gn (λ¯ n )) = O(1). Thus, n1∗ ln |Sn (λ)| is uniformly equicontinuous in λ in . As
is a bounded set, n1∗ (ln |Sn (λ2 )| − ln |Sn (λ1 )|) = O(1) uniformly in λ1 and λ2 in . Similarly, n1∗ ln |Rn (ρ)| is uniformly equicontinuous in ρ in , and n1∗ (ln |Rn (ρ2 )| − ln |Rn (ρ1 )|) = O(1) uniformly in ρ1 and ρ2 in . The σn2 (γ ) is uniformly bounded away from zero on . This can be established by a counter argument. Suppose that σn2 (γ ) were not uniformly bounded away from zero on . Then, there would exist a sequence {γn } in such that limn→∞ σn2 (γn ) = 0. We have shown that n1∗ [Qp,n (γ ) − Qp,n (γ0 )] ≤ 0 for all γ , which implies that − 12 ln σn2 (γ ) ≤ − 12 ln σ02 + n1∗ (ln |Sn | − ln |Sn (λ)|) + n1∗ (ln |Rn | − ln |Rn (ρ)|) − r¯ (ln[(1 − λ0 )(1 − ρ0 )] − ln[(1 − λ)(1 − ρ)]) = O(1), because n1∗ (ln |Sn | − ln |Sn (λ)|) = O(1) and n∗ 1 (ln |Rn | − ln |Rn (ρ)|) = O(1) uniformly on . That is, − ln σn2 (γn ) is bounded from above, a n∗ contradiction. Therefore, σn2 (γ ) must be bounded always from zero uniformly on . (Uniform convergence.) We will show that 1 1 1 sup ∗ ln Ln (γ ) − ∗ Qn (γ ) = sup | ln σˆ n2 (γ ) − ln σn∗2 (γ )| = op (1). n γ ∈ n γ ∈ 2 As Pn (ρ)Rn (ρ)Sn (λ)Yn = (λ0 − λ)Pn (ρ)Rn (ρ)Gn Zn β0 + Pn (ρ)Rn (ρ)Sn (λ)Sn−1 Rn−1 n , 1 Y S (λ)Rn (ρ)Pn (ρ)Rn (ρ)Sn (λ)Yn n∗ n n (λ0 − λ)2 = [Rn (ρ)Gn Zn β0 ] Pn (ρ)[Rn (ρ)Gn Zn β0 ] + 2(λ0 − λ)K1n (γ ) + K2n (γ ), n∗
σˆ n2 (γ ) =
C The Author(s). Journal compilation C Royal Economic Society 2010.
Estimation of social network models where K1n (γ ) =
1 [Rn (ρ)Gn Zn β0 ] Pn (ρ)[Rn (ρ)Sn (λ)Sn−1 Rn−1 n ] n∗
K2n (γ ) =
173
and
1 [Rn (ρ)Sn (λ)Sn−1 Rn−1 n ] Pn (ρ)[Rn (ρ)Sn (λ)Sn−1 Rn−1 n ]. n∗
σˆ n2 (γ ) − σn∗2 (γ ) = 2(λ0 − λ)K1n (γ ) + K2n (γ ) − σn2 (γ ), since σn∗2 (γ ) =
(λ0 − λ)2 [Rn (ρ)Gn Zn β0 ] Pn (ρ)[Rn (ρ)Gn Zn β0 ] + σn2 (γ ). n∗
Lemma C.7 implies K1n (γ ) = op (1). The convergence is uniform on as λ and ρ appears simply as polynomial factors. On the other hand, K2n (γ ) − σn2 (γ ) = −
1 [Rn (ρ)Sn (λ)Sn−1 Rn−1 n ] Jn [Rn (ρ)Sn (λ)Sn−1 Rn−1 n ] n∗
σ02 tr([Rn (ρ)Sn (λ)Sn−1 Rn−1 ] Jn [Rn (ρ)Sn (λ)Sn−1 Rn−1 ]) − Tn (γ ), n∗
where Tn (γ ) =
As
1 [Rn (ρ)Sn (λ)Sn−1 Rn−1 n ] Jn Rn (ρ)Zn [Zn Rn (ρ)Jn Rn (ρ)Zn ]−1 n∗ ×Zn Rn (ρ)Jn [Rn (ρ)Sn (λ)Sn−1 Rn−1 n ].
√1 Z R (ρ)Jn Rn (ρ)Sn (λ)S −1 R −1 n n n n∗ n n
Tn (γ ) =
= Op (1) uniformly on by Lemma C.7, it follows that
−1 1 1 Z R (ρ)J R (ρ)Z √ Zn Rn (ρ)Jn Rn (ρ)Sn (λ)Sn−1 Rn−1 n n n n n∗ n n n∗
1 × √ Zn Rn (ρ)Jn Rn (ρ)Sn (λ)Sn−1 Rn−1 n = op (1). n∗ 1 n∗
By Lemma C.9, we have 1 {[Rn (ρ)Sn (λ)Sn−1 Rn−1 n ] Jn [Rn (ρ)Sn (λ)Sn−1 Rn−1 n ] n∗ −σ02 tr([Rn (ρ)Sn (λ)Sn−1 Rn−1 ] Jn [Rn (ρ)Sn (λ)Sn−1 Rn−1 ])} = op (1). These convergences are uniform on because λ and ρ appears simply as polynomial factors in those terms. That is, K2n (γ ) − σn2 (γ ) = op (1) uniformly on . Therefore, σˆ n2 (γ ) − σn∗2 (γ ) = op (1) uniformly on . By the Taylor expansion, | ln σˆ n2 (γ ) − ln σn∗2 (γ )| = |σˆ n2 (γ ) − σn∗2 (γ )|/σ˜ n2 (γ ), where σ˜ n2 (γ ) lies between σˆ n2 (γ ) and σn∗2 (γ ). As σn∗2 (γ ) ≥ σn2 (γ ) and σn2 (γ ) is uniformly bounded away from zero on , σn∗2 (γ ) will be so too. It follows that, because σˆ n2 (γ ) − σn∗2 (γ ) = op (1) uniformly on , σˆ n2 (γ ) will be bounded away from zero uniformly on in probability. Hence, | ln σˆ n2 (γ ) − ln σn∗2 (γ )| = op (1) uniformly on . Consequently, supγ ∈ | n1∗ ln Ln (γ ) − n1∗ Qn (γ )| = op (1). (Uniform equicontinuity.) We will show that 1 1 1 ln Qn (γ ) = − (ln(2π ) + 1) − ln σn∗2 (γ ) n∗ 2 2 r¯ 1 + ∗ (ln |Sn (λ)| + ln |Rn (ρ)|) − ∗ ln[(1 − λ)(1 − ρ)] n n is uniformly equicontinuous on . The σn∗2 (γ ) is uniformly continuous on . This is so, because σn∗2 (γ ) is a polynomial of λ and ρ, with bounded coefficients by Lemmas C.5 and C.6. The uniform continuity of ln σn∗2 (γ ) on follows because σ ∗21(γ ) is uniformly bounded on . Hence n1∗ ln Qn (γ ) is uniformly n equicontinuous on . C The Author(s). Journal compilation C Royal Economic Society 2010.
174
L.-F. Lee, X. Liu and X. Lin (Identification uniqueness.) At γ0 , σn∗2 (γ0 ) = σ02 . Therefore, 1 1 Qn (γ ) − ∗ Qn (γ0 ) n∗ n 1 1 1 = − ln σn2 (γ ) − ln σ02 + ∗ (ln |Sn (λ)| − ln |Sn |) + ∗ (ln |Rn (ρ)| − ln |Rn |) 2 n n r¯ 1 − ∗ (ln[(1 − λ)(1 − ρ)] − ln[(1 − λ0 )(1 − ρ0 )]) − [ln σn∗2 (γ ) − ln σn2 (γ )] n 2 1 1 = ∗ (Qp,n (γ ) − Qp,n (γ0 )) − [ln σn∗2 (γ ) − ln σn2 (γ )]. n 2
Suppose that the identification uniqueness condition would not hold. Then, there would exist an > 0 and a sequence {γn } in N¯ (γ0 ) such that limn→∞ [ n1∗ Qn (γn ) − n1∗ Qn (γ0 )] = 0. Because N¯ (λ0 ) is a compact set, there would exist a convergent subsequence {γnm } of {γn }. Let γ+ be the limit point of {γnm } in . As n1∗ Qn (γ ) is uniformly equicontinuous in γ , limnm →∞ n1m [Qnm (γ+ ) − Qnm (γ0 )] = 0. Because (Qp,n (γ ) − Qp,n (γ0 )) ≤ 0 and −[ln σn∗2 (γ ) − ln σn2 (γ )] ≤ 0, this is possible only if limnm →∞ (σn∗2m (γ+ ) − σn2m (γ+ )) = 0 and limnm →∞ n1m [Qp,nm (γ+ ) − Qp,nm (γ0 )] = 0. The limnm →∞ (σn∗2m (γ+ ) − σn2m (γ+ )) = 0 is a contradiction when limn→∞ n1∗ [Rn (ρ)Gn Zn β0 ] Pn (ρ)[Rn (ρ)Gn Zn β0 ] = 0, for all ρ. In the event that limn→∞ n1∗ [Rn (ρ)Gn Zn β0 ] Pn (ρ)[Rn (ρ)Gn Zn β0 ] = 0 for some ρ, the contradiction follows from the relation limn→∞ n1m [Qp,nm (γ+ ) − n1m Qp,nm (γ0 )] = 0 under Assumption 5.1(b). This is so, because, in this event, Assumption 5.1(b) is equivalent to that
1 1 (ln |Sn (λ)| − ln |Sn |) + ∗ (ln |Rn (ρ)| − ln |Rn |) lim n→∞ n∗ n r¯ 1 ln σn2 (γ ) − ln σ02 − ∗ (ln[(1 − λ)(1 − ρ)] − ln[(1 − λ0 )(1 − ρ0 )]) − n 2 1 = lim ∗ [Qp,n (γ ) − Qp,n (γ0 )] = 0 n→∞ n for γ = γ0 . Therefore, the identification uniqueness condition must hold. The consistency of γˆn and, hence, θˆn follow from this identification uniqueness and uniform convergence (White, 1994, Theorem 3.4). 2
˜
Ln (θn ) Proof of Proposition 6.1: (Show that n1∗ ∂ ln∂θ∂θ − given in Appendix B. By the mean value theorem,
1 ∂ 2 ln Ln (θ0 ) n∗ ∂θ∂θ
p
→ 0.) The second-order derivatives are
tr(Jn G2n (λ˜ n )) = tr(Jn G2n ) + 2tr(Jn G3n (λ¯ n ))(λ˜ n − λ0 ). Note that Gn (λ¯ n ) is uniformly bounded in row and column sums uniformly in a neighbourhood of λ0 by Lemma C.3 under Assumption 4.4. As Rn (ρ˜n ) = Rn + (ρ0 − ρ˜n )Mn , it follows that
1 ∂2 ∂2 ˜ ln L ( θ ) − ln L (θ ) n n n 0 n∗ ∂λ2 ∂λ2 1 1 1 1 Yn Wn Rn Jn Rn Wn Yn − = −2 ∗ tr[Jn G3n (λ¯ n )](λ˜ n − λ0 ) − ∗ n n σ˜ n2 σ02 +
2(ρ˜n − ρ0 ) (ρ˜n − ρ0 )2 Y W R J M W Y − Yn Wn Mn Jn Mn Wn Yn = op (1), n n n n n n n n∗ σ˜ n2 n∗ σ˜ n2
because n1∗ tr(Jn G3n (λ¯ n )) = O(1), n1∗ Yn Wn Rn Jn Rn Wn Yn = Op (1), n1∗ Yn Wn Rn Jn Mn Wn Yn = Op (1), and 1 Y W M J M W Y = Op (1). The convergence in probability of the other second-order derivatives n∗ n n n n n n n follows similar or more straightforward arguments. C The Author(s). Journal compilation C Royal Economic Society 2010.
175
Estimation of social network models
2 L (θ ) 2 L (θ ) p n 0 n 0 ˜ n n = op (1) by Lemma C.7, it follows ˜ n Z˜ n β0 ) Jn G (Show n1∗ ∂ ln∂θ∂θ − E( n1∗ ∂ ln∂θ∂θ ) → 0.) As n1∗ (G 1 1 ˜ ˜ 1 ˜ ˜ ˜ ˜ n n + op (1). Lemmas C.8 and C.6 that n∗ Yn Wn Rn Jn Rn Wn Yn = n∗ (Gn Zn β0 ) Jn Gn Zn β0 + n∗ n Gn Jn G 2 ˜ ˜ ˜ ˜ imply E(n Gn Jn Gn n ) = σ0 tr(Gn Jn Gn ) and
1 ˜ ˜ n Gn Jn Gn n n n μ4 − 3σ04 1 2σ04 ˜ 2 ˜ ˜ ˜ ˜ ˜ G G G = vec ( G J )vec ( G J ) + tr[J ( G ) ] = O . D n D n n n n n n n n 2 n2 n n i=1
Var
2
p
2
Hence n1∗ ∂ ln∂λL2n (θ0 ) − E( n1∗ ∂ ln∂λL2n (θ0 ) ) → 0 follows from the law of large numbers. The convergence of the other terms can be derived by similar arguments. (Show that θ is non-singular.) Let α = (α1 , α2 , α3 , α4 ) be a column vector of constants such that θ α = 0. It is sufficient to show that α = 0. From the first row block of the linear equation system θ α = ˜ n Z˜ n β0 · α2 . From the last equation of the linear system, 0, one has α1 = − limn→∞ (Z˜ n Jn Z˜ n )−1 Z˜ n Jn G 2 2σ02 ˜ n ) · α2 − limn→∞ 2σ∗0 tr(Jn Hn ) · α3 . Substitution in the third equation one has α4 = − limn→∞ n∗ tr(Jn G n ˜ n ) − 2∗ tr(Jn Hn )tr(Jn G ˜ n )]α2 + limn→∞ 1∗ [tr(Hns Jn Hn ) − of the linear system gives limn→∞ 1∗ [tr(Hns Jn G 2 2 tr (Jn Hn )]α3 n∗
n
1 n→∞ (n∗ )2 lim
n
= 0. By eliminating α1 , α3 and α4 , the remaining equation becomes
n
1 2 ˜ n Z˜ n β0 ) Pn (G ˜ n Z˜ n β0 ) + φn · α2 = 0, tr(Hns Jn Hn ) − ∗ tr2 (Jn Hn ) (G 2 n σ0
where
2 ˜ n ) − 2 tr2 (Jn G ˜ n) ˜ sn Jn G φn = tr(Hns Jn Hn ) − ∗ tr2 (Jn Hn ) tr(G n n∗ 2
˜ n ) − 2 tr(Jn Hn )tr(Jn G ˜ n) . − tr(Hns Jn G n∗ ˜ n − 1 tr(Jn G ˜ n )In and Dn = Jn Hn − 1 tr(Jn Hn )In . Then, Let Cn = Jn G n n lim
n→∞
1 1 2 2 s tr(H J H ) − tr (J H ) = lim tr(Dns Dns ) ≥ 0 n n n n n n→∞ 2n∗ n∗ n∗
and φn =
1 [tr(Dns Dns )tr(Cns Cns ) − tr2 (Cns Dns )] ≥ 0. 4
As Assumption 5.1(a) implies that limn→∞ n1∗ (Rn Gn Zn β0 ) Pn (Rn Gn Zn β0 ) is positive definite and limn→∞ n1∗ [tr(Hns Jn Hn ) − n2∗ tr2 (Jn Hn )] > 0, it follows that α2 = 0 and, so, α = 0. On the other hand, if limn→∞ n1∗ (Rn Gn Zn β0 ) Pn (Rn Gn Zn β0 ) = 0, limn→∞ ( n1∗ )2 φn > 0 by Assumption 6.1, it follows that α2 = 0 and, so, α = 0. Hence θ is non-singular. ˜ n are uniformly (Derive the limiting distribution of √1n∗ ∂ ln L∂θn (θ0 ) .) The matrices Jn Rn , Jn Hn and Jn G bounded in both row and column sums in absolute value. As the elements of Zn are bounded, the elements ˜ n Z˜ n β0 for all n are uniformly bounded by Lemma C.5. With the existence of high-order of Jn Z˜ n and Jn G moments of in Assumption 4.1, the central limit theorem for quadratic forms of double arrays of Kelejian and Prucha (2001) can be applied and the limiting distribution of the score vector follows. √ 2 L (θ˜ ) n n −1 √1 ∂ ln Ln (θ0 ) n∗ (θˆn − θ0 ) = −( n1∗ ∂ ln∂θ∂θ ) , the asymptotic disFinally, from the expansion ∂θ n∗ tribution of θˆn follows. C The Author(s). Journal compilation C Royal Economic Society 2010.
176
L.-F. Lee, X. Liu and X. Lin
Proof of Proposition 6.2: ˜ n ), ( n1∗ tr(Hns Jn G
Let Bn =
1 ˜ n )) , tr(Jn G σ02 n∗
Bn,11 Bn,12 , where Bn,11 = Bn,21 Bn,22
1 ˜ n ), ˜ sn Jn G tr(G n∗
Bn,21 = Bn,12 =
and ⎛
Bn,22
1 tr(Hns Jn Hn ) ⎜ n∗ ⎜ =⎝ 1 tr(Jn Hn ) σ02 n∗
∗
⎞
⎟ ⎟. 1 ⎠ 2σ04
Under normality assumption, the variance matrix of the MLE of θ0 is
ζ,n 0k×k 0(k+1)×2 1 1 −1 θ,n = ∗ + ∗ n n 02×(k+1) 02×2 03×k The variance matrix of the MLE of ζ0 is 0k×k 1 ζ,n + ∗ n 01×k
0k×1 −1 Bn,11 − Bn,12 Bn,22 Bn,21
0k×3
−1 .
Bn
−1 ,
by the inversion of the partitioned matrix. As Bn is non-negative definite, the variance matrix of the MLE is relatively smaller than that of ζˆb2sls,n .
C The Author(s). Journal compilation C Royal Economic Society 2010.
The
Econometrics Journal Econometrics Journal (2010), volume 13, pp. 177–204. doi: 10.1111/j.1368-423X.2010.00313.x
Improving robust model selection tests for dynamic models H WAN - SIK C HOI † AND N ICHOLAS M. K IEFER ‡,§ †
Department of Economics, Texas A&M University 3035 Allen, 4228 TAMU, College Station, TX 77843-4228, USA. E-mail:
[email protected] ‡
Department of Economics and Department of Statistical Science, 490 Uris Hall, Cornell University, Ithaca, NY 14850, USA. E-mail:
[email protected]
§ CREATES,
University of Aarhus, Bartholins All´e 10, Building 1322, 3 DK-8000 Aarhus C, Denmark.
First version received: September 2008; final version accepted: February 2010
Summary We propose an improved model selection test for dynamic models using a new asymptotic approximation to the sampling distribution of a new test statistic. The model selection test is applicable to dynamic models with very general selection criteria and estimation methods. Since our test statistic does not assume the exact form of a true model, the test is essentially non-parametric once competing models are estimated. For the unknown serial correlation in data, we use a Heteroscedasticity/Autocorrelation-Consistent (HAC) variance estimator, and the sampling distribution of the test statistic is approximated by the fixed-b asymptotic approximation. The asymptotic approximation depends on kernel functions and bandwidth parameters used in HAC estimators. We compare the finite sample performance of the new test with the bootstrap methods as well as with the standard normal approximations, and show that the fixed-b asymptotics and the bootstrap methods are markedly superior to the standard normal approximation for a moderate sample size for time series data. An empirical application for foreign exchange rate forecasting models is presented, and the result shows the normal approximation to the distribution of the test statistic considered appears to overstate the data’s ability to distinguish between two competing models. Keywords: Autocorrelation, Bootstrap, Fixed-b asymptotics, Heteroscedasticity, Model selection.
1. INTRODUCTION Since Cox (1961, 1962), many methods for distinguishing separate families of hypotheses, or non-nested hypothesis testing, have been developed. Non-nested hypothesis testing is quite different from nested hypothesis testing. In nested hypothesis testing, the null hypothesis is well defined, but the alternative hypothesis can be arbitrarily close to, though different from, the null and therefore difficult to detect. Further, these close alternatives may not be importantly different from the null in any practical sense. In contrast, non-nested hypothesis testing has clear separation between candidate models but presents the difficulty of choosing a sensible null C The Author(s). Journal compilation C Royal Economic Society 2010. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
178
H.-S. Choi and N. M. Kiefer
hypothesis. Cox used centred log-likelihood ratios between two non-nested models under the null hypothesis that one of the models is true. A test for non-nested linear regression models was developed in Pesaran (1974). Along the tradition of the nesting approach of Atkinson (1970), which sets up a general model that contains the candidate models, the J-test of Davidson and MacKinnon (1981) is popular (McAleer, 1995). See Gouri´eroux and Monfort (1999) for a summary. Much non-nested hypothesis testing treats the models asymmetrically by choosing a null model and checking if the null is unsatisfactory compared to one or many alternatives. Therefore, it is possible that all the models are rejected or observationally indistinguishable. There is a different approach that treats the candidate models symmetrically. This approach is often called model selection and does not assume the true model is among the candidates. The goal of model selection is to choose the best model under a loss function. Hence, model selection should be considered as a part of a decision-making process. See Pesaran and Weeks (2001) for a discussion. Vuong (1989) considered a selection criterion for i.i.d. data based on the difference in the Kullback–Leibler Information Criterion (KLIC, Kullback and Leibler, 1951) from the (unknown) true model to the competing models, and the null hypothesis is that two models are equivalent in KLIC. This approach has the advantage of treating two competing models symmetrically and does not require the specification of a nesting model. Lien and Vuong (1987) also used the same method for normal linear regression models. Rivers and Vuong (2002) extended the approach for dynamic models. The criteria allowed in their paper are very general and their method allows the serial correlation in the data. Under their null hypothesis, the candidate models are equivalent with respect to the criterion function asymptotically, and the estimated sampling variance is used to normalize the difference in the criterion values of the competing models. They showed that the sampling variance of the difference in the criterion generally depends on the estimation method when the models are not estimated by optimising the model selection criterion. When the models use the selection criterion for estimation, the sampling error from the estimators is of higher order and asymptotically negligible, which implies that the estimators can be treated as known asymptotically. It should be noted that when the models are correctly specified, their sampling variance converges to zero and the test statistic has a different limiting distribution. Testing for zerovariance is important in practice, and one can use a test similar to Vuong (1989). See Section 6 in Rivers and Vuong (2002). We assume the models are non-nested and misspecified and do not study this problem further in this paper. For variance estimation with serially dependent data, the Heteroscedasticity/AutocorrelationConsistent (HAC) estimators are often used in estimation and inference problems when there is no parametric correlation structure assumed a priori. Rivers and Vuong (2002) suggested the HAC estimator for the sampling variance. They gave regularity conditions under which the asymptotic distribution of the test statistic becomes the standard normal distribution. Among the conditions, the existence of a non-vanishing consistent variance estimator (Assumption 8 of Rivers and Vuong, 2002) is crucial for the standard normal approximation. However, it is widely known that the standard normal approximation does not capture the sampling variability of the HAC estimators, and the results of the test are affected by the choice of bandwidth and kernel function used in the HAC estimation. The automatic bandwidth selection methods in Andrews (1991) or Newey and West (1994) provide some guidelines, but the tests with the standard normal approximation generally reject the null too often. Thus, instead of the consistency of the C The Author(s). Journal compilation C Royal Economic Society 2010.
Robust model selection test
179
HAC estimators, we use an alternative assumption that the HAC variance estimator satisfies the functional central limit theorem (FCLT) and propose the fixed-b asymptotic approximation for the HAC estimator. The fixed-b approach, proposed by Kiefer et al. (2000), Kiefer and Vogelsang (2002a,b, 2005), provides a new asymptotic approximation that depends on the bandwidth and the kernel function used in the HAC estimation. The approximating distribution is given by the distribution of the ratio of a standard normal random variable and a functional of the Brownian motion that depends on a kernel function and a fixed number b = m/n, where m is the bandwidth used in the HAC estimator and n is the sample size. We study the improvement of the asymptotic approximation to the sampling distribution of the test statistic of Rivers and Vuong (2002) under fixed-b asymptotics. We also provide a comparison with the block bootstrap method which is another popular approach to improve the asymptotic approximation for serially dependent data. For the J-test by Davidson and MacKinnon (1981), which is a popular non-nested hypothesis test, extensive work is done to study the finite sample properties, and some modified test statistics are proposed to improve the finite sample properties (see Godfrey and Pesaran, 1983, McAleer, 1995). Fan and Li (1995), Godfrey (1998) and Davidson and MacKinnon (2002) showed that the bootstrap is an attractive alternative. Choi and Kiefer (2008) generalized the J-test to dynamic models using the HAC estimation, and showed the fixed-b asymptotic approximation and the bootstrap methods are superior to the standard normal approximation. But for model selection methods such as Rivers and Vuong (2002), less is known about the performance of the asymptotic approximations to the sampling distribution of the test statistics. The results in this paper show that the improvement from using the fixed-b asymptotics or the bootstrap is substantial, and the conventional standard normal approximation should be avoided, especially when there is strong serial correlation in data or when a large bandwidth is used to calculate a HAC estimator. We also present simulation evidence that small bandwidths give a better size–power trade-off, but large bandwidths would give smaller size distortion. This paper also provides local power simulations and shows that, when larger bandwidths are used the power decreases, but the amount of the decrease depends on the kernel function. Among the kernel functions we considered, the power curve from the Bartlett kernel was less sensitive to the bandwidth choice. An important application of our approach is comparing the forecasting performance from competing models. In the forecasting model comparison literature, a bootstrap method is also used by White (2000). White considers a ‘benchmark’ model and a group of alternative models. The null is that none of the other models dominates the benchmark. The differences between the forecast errors from the benchmark model and all alternatives are arranged in a vector. Then, the test is that the maximum of these differences is negative, so no model dominates the benchmark. The distribution of this maximum is obtained by the stationary bootstrap of Politis and Romano (1994). Hansen (2005) also considers comparing a benchmark model with a number of alternatives. He tests the superiority of the benchmark model and uses the stationary bootstrap methods, as in White (2000). Hansen (2005) differs from White (2000) in that he Studentizes the statistic before taking the maximum. White is essentially using the null that is closest to the alternative. Hansen estimates the null mean rather than using zero. Instead of testing a superiority of prediction accuracy, the idea of testing equivalence is used in the DM test of Diebold and Mariano (1995). The DM test compares forecast accuracy of two competing models, where the accuracy is measured by some criterion function (such as a goodness-of-fit measure) and the null is that the forecasts are equally accurate. It is similar to Vuong (1989), except the likelihood is not used, rather a fairly general function of the (outof-sample) fit. The variance estimator in the DM test is also a HAC estimator. Harvey et al. (1997) attempted to improve finite sample performance of the DM test by using a correction C The Author(s). Journal compilation C Royal Economic Society 2010.
180
H.-S. Choi and N. M. Kiefer
factor to the DM test statistic [MDM (modified DM) test]. Although we aim to improve the finite sample properties of the DM test statistic, our approach is different from the MDM test in two aspects. First, our test considers a better approximation to the whole distribution of the test statistic, whereas the MDM test considers the scaled normal approximations only. Second, our approximation depends on the kernel function and bandwidth used in a HAC estimator, whereas MDM is derived for a particular kernel function (the uniform kernel) and a bandwidth (a forecasting horizon) used in the DM test. We can use our approach to the DM test without modification, and an empirical application for the DM test is presented to test the equality of predictive accuracy of the foreign exchange rate forecasting models considered in Diebold and Mariano (1995) using USD/EURO and YEN/USD exchange rate data. The competing models are simple, but we can clearly see that our new test could give quite different results. We expect the difference would be more dramatic in complex models.
2. KIEFER–VOGELSANG–BUNZEL FIXED-b ASYMPTOTICS HAC variance estimators are used frequently in econometrics for test statistics involving serially correlated observations. In the standard set-up, the approximation to a HAC estimator assumes that a long-run variance is known and equal to its estimated value. The limiting distribution does not depend on a kernel function or a bandwidth parameter used, and is known to give a poor approximation to the sampling distribution, especially for size calculations. Kiefer et al. (2000) and Kiefer and Vogelsang (2002a,b, 2005) proposed a new asymptotic approximation to the sampling distributions of HAC estimators (and test statistics). They proposed to derive the approximate distributions by fixing the ratio m/n → b ∈ (0, 1] as n → ∞, where m is the bandwidth parameter used in a kernel function of a HAC estimator and n is the number of observations. In the conventional approach, b converges to zero as n → ∞ and a HAC estimator becomes consistent. But under the new set-up with b ∈ (0, 1], HAC estimators do not converge to long-run variances but to long-run variances multiplied by functionals of the Brownian bridge processes. This new asymptotic approximation is called the ‘fixed-b’ approach in comparison to the conventional or ‘small-b’ approach and shown to be more accurate in many applications. n be a HAC estimator (used in the denominator of a test statistic) given by Let V n−1 j γˆ (j ), (2.1) k Vn = m j =1−n where k(x) is a kernel function and γˆ (j ) =
n 1 ¯ uˆ t−|j | − u), ¯ (uˆ t − u)( n t=|j |+1
(2.2)
where {uˆ t }nt=1 is a stochastic process of interest such as residuals, scores or other criteria used in a test statistic, and u¯ is the sample mean of {uˆ t }nt=1 . (Often we have u¯ = 0 as in residuals from linear regressions with a constant term or as in scores.) When the functional central limit theorem (FCLT) holds for the partial sum of ut , i.e. n−1/2
[rn]
ut ⇒ λW (r),
(2.3)
t=1 C The Author(s). Journal compilation C Royal Economic Society 2010.
181
Robust model selection test
where λ2 = ∞ j =−∞ γ (j ) and W(r) is the standard Brownian motion defined on C[0, 1], Kiefer n converges weakly to a random variable under m/n → b ∈ and Vogelsang (2005) show that V (0, 1] and the limiting distribution depends on k(x) and b. For example, if the Bartlett (triangular) kernel is used, we have 1
1−b 2λ2 2 Vn ⇒ W (r) dr − W (r + b)W (r) dr , (2.4) b 0 0 (r) = W (r) − rW (1) is the Brownian bridge process. In testing applications, λ2 is where W cancelled out with the asymptotic variance of the numerator of a test statistic making the test statistic pivotal. Cases with different kernel functions are also in Theorem 1 of Kiefer and Vogelsang (2005). A simple example is in Choi and Kiefer (2008).
3. DYNAMIC MODEL SELECTION TESTING WITH FIXED-b ASYMPTOTICS j
We use the same set-up as Rivers and Vuong (2002) (hereafter, RV). Let {Qn (ω, γ j )} be a sequence of random functions to minimize such as negative quasi log-likelihood functions (divided by n for scaling) for models j = 1, 2, with a parameter vector γ j in a compact domain j j and data ω with n observations. Let γˆn be an estimator of the model j obtained either by j j ¯ jn (γ j )} minimizing Qn (ω, γ ) or by other estimation methods such as GMM estimation. Let {Q j j j ¯ n (γ ) = EQn (ω, γ j ), be a non-stochastic sequence of a model selection criterion such as Q a.s. j ¯ jn (γ j ) → 0 uniformly on j . Moreover, let {γ¯nj } be a non-stochastic and assume Qn (ω, γ j ) − Q j j a.s. j j sequence of pseudo-true values such that γˆn − γ¯n → 0. For example, γˆn and γ¯n can be the j j j j ¯ n (γ ), respectively. We consider the null hypothesis that two minimizers of Qn (ω, γ ) and Q models are asymptotically equivalent in the sense that √ 1 1 ¯ n (γ¯n ) − Q ¯ 2n (γ¯n2 ) = 0, H0∗ : lim n Q (3.1) n→∞
against the alternative hypothesis H1∗ ∪ H2∗ , where √ 1 1 ¯ n (γ¯n ) − Q ¯ 2n (γ¯n2 ) = −∞, H1∗ : lim n Q
(3.2)
n→∞
H2∗ : lim
n→∞
√ 1 1 ¯ n (γ¯n ) − Q ¯ 2n (γ¯n2 ) = ∞. n Q
(3.3) j
Our model selection test is based on the difference Dn between the criterion functions Qn (ω, γ j ) j evaluated at their estimators γˆn given by √ Dn = n Q1n (ω, γˆn1 ) − Q2n (ω, γˆn2 ) . (3.4) The test statistic is a standardized version of Dn by its long-run variance estimator. The long-run j j variance of Dn depends on whether we estimate γˆn by minimizing Qn (ω, γ j ) or not. Generally, j j the sampling variances of γˆn affect the long-run variance of Dn . But when the estimators γˆn j j minimize Qn (ω, γ j ), the sampling variances of γˆn do not affect the asymptotic variance of Dn , j and we can treat the problem as if we know γ¯n (see RV, p. 10). We give formal technical assumptions similar to RV. C The Author(s). Journal compilation C Royal Economic Society 2010.
182
H.-S. Choi and N. M. Kiefer
A SSUMPTION 3.1. (a) Let ω be a p-dimensional weakly stationary stochastic process {Xt }∞ t=−∞ on a complete probability space (, F, P). For some r > 2, {Xt } is φ- or α-mixing sequence such that φm is of size −r/(r − 1) or αm is of size −2r/(r − 2), respectively. (b) For j = 1, 2, let Mj (γ j ) be a parametric or semi-parametric model with a parameter vector γ j ∈ j , where j j is a compact subset in Rkj . (c) For j = 1, 2, let γˆn be an estimator using data {Xt }nt=1 j j such that γˆn ∈ j for each n. For a sequence of estimators {γˆn }∞ n=1 on (, F, P), there exists j ∞ a non-stochastic sequence {γ¯n }n=1 of pseudo-true values uniformly interior to j such that j j a.s. γˆn − γ¯n → 0. A SSUMPTION 3.2 (E XISTENCE OF A PPROXIMATE S EQUENCES ). For j = 1, 2, there exist j (a) A sequence of (kj × 1) random vectors {ψnt }nt=1 on (, F, P) for n = 1, 2, . . . such that n √ j n(γˆnj − γ¯nj ) = n−1/2 ψnt + op (1),
(3.5)
t=1
j j with n−1 nt=1 Eψnt = 0, and (b) A sequence of (rj × 1) random vectors {gt (ω, γ j )}nt=1 on j rj j (, F, P) for n = 1, 2, . . . and a function q : R × → R such that
n j j j j −1 j j Qn (ω, γ ) = q n gt (ω, γ ), γ . (3.6) t=1 j
Moreover, q j (·, ·) is continuously differentiable, gt (·, γ j ) is measurable for every γ j ∈ j and j j continuously differentiable on j . (c) Let ∇γ j = ∂/∂γ j . Both gt (ω, γ j ) and ∇γ j gt (ω, γ j ) are (i) Lipschitz-L1 on j a.s.-P, (ii) r-dominated on j uniformly in t, and (iii) near-epoch the first dependence is uniform dependent upon {Xt } of size −1 and −1/2, respectively, where j j j j j j j on j . (d) Let g¯ t (γ j ) = Egt (ω, γ j ), g¯ n (γ j ) = n−1 nt=1 g¯ t (γ j ), ψ¯ nt = Eψnt , and ψ¯ n = n−1 × n j j j ¯j ¯ n (γ¯n ), ψ¯ n )| = O(n−1 ) and t=1 ψnt . We have |(g j j j (3.7) g¯ t (γ¯n ) − g¯ nj (γ¯nj ), ψ¯ nt − ψ¯ nj ≤ bn , for all t = 1, . . . , n with bn = O(n−1/2 ). (e) Let ∇gj = ∂q j (x, γ j )/∂x. We have a sequence of random variables {vnt }nt=1 on (, F, P) for n = 1, 2, . . . such that j j vnt = J {j } × ∇gj q j (g¯ nj (γ¯nj ), γ¯nj ) gt (ω, γ¯nj ) − g¯ t (γ¯nj ) (3.8) j =1,2
j + ∇γ j q j (g¯ nj (γ¯nj ), γ¯nj ) + ∇gj q j (g¯ nj (γ¯nj ), γ¯nj )∇γ j g¯ nj (γ¯nj ) ψnt , where J {1} = 1 and J {2} = −1. (f) Let V ar(n−1/2
n t=1
λ2 = lim λ2n . n→∞
(3.9)
vnt ) = λ2n and (3.10)
We have 0 < λ2 < ∞. j j j j j j Denote {vˆnt } as {vnt } with γˆn and n−1 nt=1 gt (ω, γˆn ) for γ¯n and g¯ n (γ¯n ), respectively. To n from {vˆnt }, then we standardize Dn estimate λ2n , we use a HAC long-run variance estimator V C The Author(s). Journal compilation C Royal Economic Society 2010.
Robust model selection test
n . We define the test statistic with V √ 1 n Qn ω, γˆn1 − Q2n ω, γˆn2 Tn = , n V
183
(3.11)
where n = V
n−1 j =1−n
k
j γˆn (j ), m
(3.12)
k(·) is a kernel function with a bandwidth parameter 0 < m ≤ n, and γˆn (j ) = where v¯n = n−1
n t=1
n 1 (vˆnt − v¯n )(vˆn(t−|j |) − v¯n ), n t=|j |+1
(3.13)
vˆnt . The original version of RV’s test uses γˆnRV (j ) =
n 1 vˆnt vˆn(t−|j |) , n t=|j |+1
(3.14)
instead of γˆn (j ). Let TnRV be the test statistic with γˆnRV (j ). A SSUMPTION 3.3. The ∞ kernel function k : R → R is continuous at x = 0, k(x) = k(−x), |k(x)| ≤ 1 and −∞ k 2 (x)dx < ∞. A SSUMPTION 3.4. {m} is a sequence of integers such that m/n → b, as n → ∞, where b ∈ (0, 1] is a fixed number. The main difference of our paper from RV is m diverges to the infinity at the same rate n is not as n (RV assumed m = o(n1/4 )). Under our assumption, the HAC variance estimator V consistent, but the fixed-b approach still gives an asymptotically pivotal test statistic. With the conventional asymptotic approach as in RV, the difference between Tn and TnRV is asymptotically negligible under the null, although the power may be different. On the other hand, we show the fixed-b approach leads to different asymptotic limits depending on the demeaning with v¯n . Thus the modification of the original RV test statistic TnRV makes an important difference under both the null and the alternative. To apply the fixed-b asymptotics to Tn and TnRV , we further assume that the FCLT is satisfied for the process {vnt }. A SSUMPTION 3.5.
The process {vnt }nt=1 satisfies the FCLT under H0∗ , 1 vnt ⇒ λW (r), √ n t=1 [rn]
where W(r) is a standard Brownian motion, and λ is defined in equation (3.10). C The Author(s). Journal compilation C Royal Economic Society 2010.
(3.15)
184
H.-S. Choi and N. M. Kiefer
Our assumption λ2 > 0 essentially implies that the models are non-nested and misspecified. When the candidate models are nested, we have λ2 = 0, and the test statistic must be reformulated to avoid degeneracy. The approximation is given by the mixed chi-squared distribution in this case. See RV section 6 for further discussion. We provided a high-level assumption to focus on presenting the idea. For low-level assumptions for the FCLT, see Phillips and Durlauf (1986) for example. The conditions in RV are stronger than the assumptions required for the consistency of HAC estimators in Andrews (1991), but the conditions for the FCLT in Phillips and Durlauf (1986) are slightly weaker than those in Andrews (1991) (but weaker conditions than Andrews, 1991, for consistency are also available in Hansen, 1992, and de Jong and Davidson, 2000). Therefore, our FCLT assumption does not impose a stronger condition than RV, and the central difference of our paper from RV is m = O(n). A discussion of the assumptions imposed for the FCLT is also given in Kiefer and Vogelsang (2005, p. 1135). (r). The process W (r) We define the random variable Q(b) that is derived from a process W will be defined differently depending on applications later. D EFINITION 3.1 (HASHIMZADE AND VOGELSANG, 2008, P. 146). With a bandwidth b = m/n and a kernel function k(x), let the random variable Q(b) be defined as follows. Case (i): If k(x) is twice continuously differentiable everywhere, 1 Q(b) = − 2 b
1
1
k 0
0
2 + W (1) b
1
k
0
r −s b
1−r b
(r)W (s) drds W
(3.16)
(r) dr + W (1)2 . W
(3.17)
Case (ii): If k(x) is continuous, k(x) = 0 for |x| ≥ 1, and k(x) is twice continuously differentiable everywhere except for |x| = 1, 1 Q(b) = − 2 b
k |r−s|
2 (1) + k− b
2 + W (1) b
1−b
r −s b
(r)W (s) drds W
(r)W (r + b) dr W
(3.18)
(3.19)
0
1 1−b
k
1−r b
(r) dr + W (1)2 , W
(3.20)
where k− (1) = limh→0 [(k(1) − k(1 − h))/h], i.e. k− (1) is the derivative of k(x) from the left at x = 1. C The Author(s). Journal compilation C Royal Economic Society 2010.
Robust model selection test
Case (iii): If k(x) is the Bartlett kernel, 2 1−b 2 1 2 W (r) dr − W (r)W (r + b) dr Q(b) = b 0 b 0 2 − W (1) b
1
(r)dr + W (1)2 . W
185
(3.21)
(3.22)
1−b
(r) = W (r) = W (r) − rW (1) In the following theorem, we define Q(b) with either W (r) = W (r) when we consider Tn or TnRV . The random variable Q(b) (Brownian bridge) or W (1) = 0 from W (1) = W (1) = 0. becomes simpler for Tn than TnRV because we have W T HEOREM 3.1. Suppose Assumptions 3.1–3.5 hold. Then under H0∗ in equation (3.1), we have W (1) Tn ⇒ , QW (b)
(3.23)
(r) = W (r), and where the random variable QW (b) is defined as Q(b) in Definition 3.1 with W independent of W (1), and W (1) , TnRV ⇒ √ QW (b)
(3.24)
where QW (b) is defined as Q(b) with W(r). Since Q(b) depends on both k(x) and b, critical values can be obtained by simulating Q(b) in practice. Table 1 (Kiefer and Vogelsang, 2005, p. 1146) gives the approximate critical values for popular kernel functions as a function of b for Tn using QW (b).
4. MONTE CARLO STUDY 4.1. Quasi-likelihood criterion We compare the performance of our new asymptotic approximation for the sampling distribution of our test statistic with the conventional standard normal approximation and the block bootstrap method. We use the quasi-likelihood (QL) criterion for simulations, and parameters are estimated with the quasi-maximum likelihood estimation (QMLE). For the QL criterion, the test statistic Tn can be written as √ n ln(σˆ 22 /σˆ 12 ) , (4.1) Tn = n V where σˆ j2 = n−1 nt=1 uˆ 2j t (j = 1, 2) is a measure of fit calculated from the residuals {uˆ j t } from model j , n = V
n−1
k
j =1−n C The Author(s). Journal compilation C Royal Economic Society 2010.
j γˆ (j ) m
(4.2)
186
H.-S. Choi and N. M. Kiefer Table 1. Fixed-b asymptotic critical value function. a0 a1 a2
Kernel Bartlett
Parzen
Quadratic spectral
Daniell
Tukey– Hanning
Bohman
a3
R2
90% 95%
1.2816 1.6449
1.3040 2.1859
0.5135 0.3142
−0.3386 −0.3427
0.9995 0.9991
97.50% 99%
1.9600 2.3263
2.9694 4.1618
0.4160 0.5368
−0.5324 −0.9060
0.9980 0.9957
90% 95%
1.2816 1.6449
0.9729 1.5184
0.5514 1.0821
0.0011 −0.0660
0.9993 0.9993
97.50% 99%
1.9600 2.3263
2.0470 2.5794
1.7498 3.9580
−0.1076 −0.7012
0.9993 0.9984
90% 95% 97.50%
1.2816 1.6449 1.9600
1.6269 2.7098 3.0002
2.6366 4.5885 10.5805
−0.4329 −0.6984 −3.3454
0.9996 0.9992 0.9984
99%
2.3263
5.4054
14.1281
−2.3440
0.9969
90%
1.2816
1.4719
2.1942
−0.1981
0.9991
95% 97.50% 99%
1.6449 1.9600 2.3263
2.4986 2.8531 5.0506
3.9948 9.6484 14.1258
−0.4587 −3.0756 −3.2775
0.9990 0.9984 0.9952
90% 95%
1.2816 1.6449
1.1147 1.5479
1.9782 4.4153
−0.5142 −1.4993
0.9940 0.9957
97.50% 99%
1.9600 2.3263
1.6568 1.1261
8.2454 18.3270
−2.6136 −7.1177
0.9892 0.9825
90% 95% 97.50%
1.2816 1.6449 1.9600
1.0216 1.5927 2.2432
0.7906 1.5151 2.0441
−0.1121 −0.2925 −0.1358
0.9993 0.9994 0.9989
99%
2.3263
2.6213
5.4876
−1.6575
0.9982
Note: Fixed-b asymptotic critical value function coefficients for t-statistics: cv(b) = a0 + a1 b + a2 b2 + a3 b3 . Given the kernel and a percentage point, for a given value of b = m/n the critical value for t is given by the polynomial cv(b) = a0 + a1 b + a2 b2 + a3 b3 . The R 2 indicates the fit of the polynomial through the simulated asymptotic critical values.
is a HAC estimator, k(x) is a kernel function, m is a bandwidth, γˆ (j ) = v¯ = n−1
n t=1
n 1 (vˆt − v)( ¯ vˆt−|j | − v), ¯ n t=|j |+1
(4.3)
uˆ 21t uˆ 22t − 2 . vˆt = σˆ 22 σˆ 1
(4.4)
vˆt and
C The Author(s). Journal compilation C Royal Economic Society 2010.
Robust model selection test
187
Note that from the definition of σˆ j2 , we always have v¯ = 0 under the null or alternative. Therefore, we do not have to centre vˆt . We could use vˆt = σˆ 1−2 (uˆ 22t − uˆ 21t ) or vˆt = σˆ 2−2 (uˆ 22t − uˆ 21t ). Under ¯ becomes important. As shown in the previous these alternative definitions of vˆt , centring (vˆt − v) section, under the fixed-b asymptotics, the asymptotic distributions depend on whether vˆt is centred or not. Therefore, different critical values must be used when vˆt is not centred. Also the test’s power typically reduces without centring. In a simulation not reported in this paper, we found that using vˆt = σˆ 1−2 (uˆ 22t − uˆ 21t ) without centring gives little power for local alternatives in many cases. This problem worsens as a larger bandwidth is used. It is generally a good idea to use centring for this reason. We use the centred version only in our experiments.
4.2. The bootstrap method Bootstrap methods are popular alternatives to conventional asymptotic approximations in econometrics. In the non-nested hypothesis context, the bootstrap is known to improve the approximation of the sampling distributions of test statistics. See Fan and Li (1995), Godfrey (1998), Davidson and MacKinnon (2002) and Choi and Kiefer (2008). We used the bootstrap method in Hall and Horowitz (1996) and White (2000) for our test statistic. Our bootstrap method is ‘naive’ in the sense that the bootstrap test statistic is identical to the original test statistic. It has been shown that the naive block bootstrap, including the i.i.d. bootstrap, has the same limiting distribution as the fixed-b asymptotics (Gonc¸alves and Vogelsang, 2010, Theorem 4.1). They provided a sufficient condition under which the fixed-b approach and the naive bootstrap have the same limiting distribution and give a better rate of error than the standard normal approximation when a certain moment condition holds for the errors. The argument proceeds by writing the test statistic and the bootstrap test statistic as the same functions of the data and the bootstrap data, respectively. Using appropriate assumptions on the bootstrap data and the continuous mapping theorem gives the result that the limit distributions are identical. Showing that the resulting distribution is an improvement on the normal approximation is more difficult. Gonc¸alves and Vogelsang (2010) are able to obtain this result for a special case of estimation of a normal mean. See also Jansson (2004) who shows that the fixed-b asymptotics can improve on the normal approximation in terms of rate of error in rejection probability. Our simulation results indicate that the bootstrap is practically useful in our settings. It is notable that the i.i.d. bootstrap is valid under the fixed-b asymptotics even for serially dependent observations. The i.i.d. bootstrap does not capture the correct dependent structure in the data but the asymptotic pivotalness of the test statistic makes it irrelevant. Unlike Gonc¸alves and Vogelsang (2010) we use the non-parametric bootstrap, but their theory essentially applies. Our null hypothesis does not assume a specific form of the true model, therefore we can not use the explanatory variables as given and generate bootstrap samples. This implies that since neither of the candidate models is correct, we should not bootstrap from one particular model. Instead, we should bootstrap from the joint empirical distribution of the dependent variable and the explanatory variables. Then the bootstrap critical values are calculated using the standard method, as in Hall and Horowitz (1996). In our settings, it is important to resample the data without destroying the relationship between the two candidates. Thus, resampling the ith observation should draw from both of the models simultaneously.
C The Author(s). Journal compilation C Royal Economic Society 2010.
188
H.-S. Choi and N. M. Kiefer
For the QL criterion, the bootstrap test statistic Tnb is given by √ n ln(σ˜ 22 /σ˜ 12 ) − C0 b Tn = , n V where σ˜ j2 = n−1
n t=1
(4.5)
u˜ 2j t (j = 1, 2) are calculated with the bootstrap residuals {u˜ j t }nt=1 , and C0 = ln
σˆ 22 , σˆ 12
(4.6)
n is calculated from equation (4.2) replacing where σˆ 12 and σˆ 22 are from the original sample, and V vˆt with
2 u˜ 21t u˜ 2t (4.7) − 2 − D1 , v˜t = σˆ 22 σˆ 1 where D1 is the mean of (u˜ 22t /σˆ 22 − u˜ 21t /σˆ 12 ) that can be rewritten as D1 =
σ˜ 22 σ˜ 12 − . σˆ 22 σˆ 12
(4.8)
Note that we could use (σ˜ 12 , σ˜ 22 ) instead of (σˆ 12 , σˆ 22 ) in the denominators because the sampling error of the estimators is asymptotically negligible when the estimators are obtained by minimizing the selection criterion. It is interesting that D1 = 0 by using (σ˜ 12 , σ˜ 22 ). We repeat this procedure for b = 1, . . . , B to obtain the bootstrap critical values. The bootstrap method in our paper is the equal-tailed percentile-t method with B = 1199. We considered block bootstraps with the block sizes one (the i.i.d. bootstrap) and five (nonoverlapping). We also emphasize that a special concern is required for the candidate models with lagged variables. We propose to use a non-parametric bootstrap with {yt , yt−j , xt }, where yt−j is the vector of all the lagged variables used as explanatory variables in the candidate models, and xt is the vector of all the other explanatory variables. Then we drop the first J observations where J is the highest lag number used. We used this method for our MA(2) example later in this paper. 4.3. MA(2) model Consider the MA(2) true data-generating process, yt = εt + 0.5εt−1 + εt−2
(t = 1, . . . , n),
(4.9)
where εt ∼ i.i.d. N(0, 1). The competing models are AR models H1 : yt = α1 + βyt−1 + ε1t ,
(4.10)
H2 : yt = α2 + δyt−2 + ε2t ,
(4.11)
where ε1t and ε2t are assumed to be white noises. The true model has γ (1) = γ (2), and we know p p βˆ → γ (1)/γ (0) and δˆ → γ (2)/γ (0). Thus we have the same pseudo-true values, β ∗ = δ ∗ . From C The Author(s). Journal compilation C Royal Economic Society 2010.
Robust model selection test
189
plim σˆ 12 = plim σˆ 22 ,
(4.12)
this fact we can easily show n→∞
n→∞
which implies they are equivalent in our quasi-log-likelihood criterion function. The variance σˆ 12 and σˆ 22 were calculated based on n − 1 observations in H 1 and n − 2 observations in H 2 , respectively, and the HAC denominator was based on n − 2 residuals from H 1 and H 2 (we dropped out the first residual from H1 ). The test statistic is given by Tn = √ n is from equation (4.2). n , where V n − 2 ln(σˆ 22 /σˆ 12 )/ V The number of iterations of the simulation is 5000. We use sample sizes n = 27, 52, 102, 202 and the level of tests is 5% (two tail). For the bootstrap tests, we resample the lagged variables {yt , yt−1 , yt−2 } together, dropping the first two observations. Therefore, the bootstrap sample size is n − 2. We set the bootstrap iteration B = 1199. Table 2 shows the empirical rejection rates of the standard normal, fixed-b, i.i.d. bootstrap [‘boot(1)’], and block bootstrap [‘boot(5)’] approximations. The fixed-b asymptotics shows a great improvement upon the standard normal approximation, especially when a large bandwidth is used for all sample sizes. Also the i.i.d. bootstrap approach is better than the block bootstrap and similar to, but a little bit worse than, the fixed-b approximation. For large sample sizes (n = 102,202) the block bootstrap improves, but in almost all cases the fixed-b approximation is better than the others. We surmise that a more sophisticated bootstrap approach is required in this setting. 4.4. Linear regression model We generate the following variables for t = 1, . . . , n:
and
ut = αut−1 + εt , εt ∼ i.i.d. N(0, 1),
(4.13)
wt = ρwt−1 + ζ1t ,
(4.14)
xt = ρxt−1 + ζ2t ,
(4.15)
zt = ρzt−1 + ζ3t ,
(4.16)
⎛
⎞ ⎛ ⎡ ζ1t 1 ⎜ ⎟ ⎜ ⎢ ⎝ ζ2t ⎠ ∼ i.i.d. N ⎝0, ⎣ κ1 ζ3t
κ1
κ1 1
⎤⎞ κ1 ⎥⎟ κ2 ⎦⎠ ,
κ2
1
(4.17)
where α, ρ, κ1 , κ2 are parameters we choose for the simulation. We consider two cases as true models Case I : yt = wt + 0.5xt + 0.5zt + ut ,
(4.18)
Case II : yt = wt + 0.5xt + 0.5zt + 0.5yt−1 + ut .
(4.19)
C The Author(s). Journal compilation C Royal Economic Society 2010.
0.1560
Boot(5)
0.1116
0.1046 0.0836 0.0978
5
0.2486
0.1480 0.0912 0.0846
3
0.1040
0.1156 0.0750 0.0902
10
10
0.0986
0.1274 0.0718 0.0884
15
0.2024
0.2510 0.0876 0.0832
n = 102
0.2128
0.1780 0.0910 0.0836
5
0.0918
0.2342 0.0680 0.0830
50
0.2030
0.3144 0.0876 0.0846
15
0.0890
0.3568 0.0646 0.0790
100
0.2092
0.4154 0.0854 0.0852
25
0.1220
0.1450 0.1430 0.1556
1
0.2094
0.1278 0.1152 0.1296
1
0.0894
0.0944 0.0840 0.0950
5
0.1592
0.1432 0.0954 0.1082
5
0.0830
0.0916 0.0748 0.0838
10
15
0.0782
0.0954 0.0674 0.0780
20
0.1444
0.2110 0.0884 0.0988
n = 202
0.1490
0.1764 0.0894 0.0990
10
n = 52
0.0750
0.1104 0.0646 0.0756
100
0.1394
0.2718 0.0854 0.0948
25
0.0714
0.3516 0.0642 0.0720
200
0.1412
0.4032 0.0828 0.0926
50
Note: Actual sizes of the 5% level two-tail model selection test based on the quasi-likelihood criterion using the approximations from the standard normal [N(0, 1)], fixed-b asymptotics, i.i.d. bootstrap [‘Boot(1)’], and block bootstrap [‘Boot(5)’] with a block length five. An MA(2) process is true and two AR models are competing.
0.1316 0.1256 0.1468
N(0, 1) Fixed-b Boot(1)
1
0.2812
Boot(5)
m=
0.0974 0.0768 0.0720
1
N(0, 1) Fixed-b Boot(1)
m=
n = 27
Table 2. Actual sizes of the 5% level two-tail model selection test. MA(2)
190 H.-S. Choi and N. M. Kiefer
C The Author(s). Journal compilation C Royal Economic Society 2010.
191
Robust model selection test
Note that we have lagged dependent variable in the second case. The competing models are H1 : yt = α1 + wt δ1 + xt β1 + u1t ,
(4.20)
H2 : yt = α2 + wt δ2 + zt β2 + u2t .
(4.21)
In Case I, our competing models are missing one variable, but in Case II, both models are missing one variable and one lagged dependent variable. We generated 50 observations (n = 50). We have chosen κ1 = κ2 = 0.5 and ρ = 0, ±0.5, 0.9, α = 0, ±0.5, 0.9 (16 variations total) for both Cases I and II. The number of iterations is 5000 for each case. As before, we do two-tail tests with a 5% level. Note that the test can be directional. For example, in the right-tail test, the rejection favours the model H 1 over H 2 . The results under Case I and Case II are shown in Tables 3 and 4, respectively. If regressors are serially uncorrelated (ρ = 0), the value of α does not make much difference in the distribution of the test statistic, although there were cases with underrejection due to the fact that we have to estimate the pseudo-true values. For ρ = 0 of course, it is perhaps best to ignore the possible autocorrelation. In all cases, if a robust test is used when unnecessary (ρ = 0) then the normal approximation is a disaster and both the fixed-b approximation and the bootstrap methods are better, with very similar performance, although the i.i.d. bootstrap seems a little better than the block bootstrap. With positive autocorrelation in regressors and errors (the expected case), the robust test is required and the standard normal approximation is bad. The fixed-b and the bootstrap methods beat the normal approximation and are about the same, except when both correlations are quite strong (ρ = α = 0.9), in which case the bootstrap methods outperform the fixed-b approach. The ranking of the i.i.d. and the block bootstrap when the regressors are highly autocorrelated depends on the actual value of the error autocorrelation, with the block bootstrap performing better with high autocorrelation. Perhaps this is understandable, since the block bootstrap was designed for this case. However, it is interesting that the i.i.d. bootstrap is better with moderate error autocorrelation (α = 0.5). This is true with and without lagged dependent variables (Cases II and I, respectively). With a negative regressor autocorrelation, as might arise from differencing the regressors, the bootstrap and fixed-b methods perform similarly and dominate the normal approximation. In the case of lagged dependent variables and strong positive error autocorrelation as well, the fixed-b tends to underreject relative to both the i.i.d. and block bootstraps. In all cases of negative error autocorrelation, the fixed-b and bootstrap methods perform similarly and dominate the normal approximation. Although not shown in the tables, we found that when the common regressor wt is strongly correlated with the other regressors, the power of the test is reduced since the wrong model still contains much information through wt about the true model. 4.5. Power comparison For the comparison of two different test statistics, size-corrected power is often used. Since we have proposed different approximations to the distribution of the same test statistic, sizecorrected power comparisons are not applicable. To check the finite sample power properties of the fixed-b approximation, we compare the local powers given a level, a bandwidth and a sample size for different kernel functions. We consider five different kernels (Bartlett, Parzen, Quadratic spectral, Daniell and Bohman). In the fixed-b approach, different kernels give different approximating distributions. C The Author(s). Journal compilation C Royal Economic Society 2010.
0.0624 0.0532 0.0432
0.0714
0.0606 0.0542
0.0460 0.0726
0.0706
0.0614 0.0460
0.0846
0.0286 0.0244
0.0224 0.0714
Boot(5)
N(0, 1) Fixed-b
Boot(1) Boot(5)
N(0, 1)
Fixed-b Boot(1)
Boot(5)
N(0, 1) Fixed-b
Boot(1) Boot(5)
1
N(0, 1) Fixed-b Boot(1)
m=
0.0418 0.0644
0.0788 0.0442
0.0676
0.0570 0.0468
0.0950
0.0470 0.0658
0.0838 0.0476
0.0550
0.0836 0.0516 0.0434
5
15
0.0542
0.1482 0.0488 0.0430
0.0478 0.0608
0.1560 0.0496
0.0654
0.0580 0.0476
0.1678
0.0442 0.0608
0.1220 0.0438 0.0450 0.0588
0.1558 0.0442
ρ = 0.5, α = −0.5
0.0660
0.0566 0.0476
0.1298
ρ = 0.5, α = 0
0.0478 0.0602
0.1214 0.0464
ρ = 0, α = −0.5
0.0538
0.1178 0.0490 0.0434
ρ = 0, α = 0
10
0.0466 0.0610
0.2514 0.0492
0.0632
0.0552 0.0472
0.2600
0.0474 0.0606
0.2516 0.0518
0.0568
0.2434 0.0512 0.0452
30
0.0474 0.0602
0.3478 0.0492
0.0622
0.0542 0.0456
0.3566
0.0464 0.0616
0.3550 0.0516
0.0554
0.3472 0.0492 0.0440
50
0.1178 0.0710
0.1436 0.1342
0.0930
0.1216 0.1006
0.1332
0.0334 0.0430
0.0312 0.0266
0.0664
0.0570 0.0486 0.0442
1
0.0596 0.0518
0.1050 0.0626
0.0686
0.0772 0.0622
0.1218
0.0414 0.0416
0.0648 0.0326
0.0544
0.0874 0.0514 0.0498
5
15
0.0526
0.1618 0.0552 0.0522
0.0490 0.0444
0.1406 0.0416
0.0622
0.0696 0.0590
0.1818
0.0574 0.0530
0.1346 0.0604
0.0580 0.0546
0.1680 0.0608
ρ = 0.5, α = 0.9
0.0656
0.0710 0.0586
0.1522
ρ = 0.5, α = 0.5
0.0468 0.0452
0.1016 0.0386
ρ = 0, α = 0.9
0.0510
0.1286 0.0532 0.0522
ρ = 0, α = 0.5
10
Table 3. (Case I) Actual sizes of the 5% level two-tail model selection test. Case I (n = 50)
0.0590 0.0538
0.2734 0.0606
0.0654
0.0710 0.0584
0.2852
0.0478 0.0450
0.2440 0.0428
0.0520
0.2612 0.0556 0.0488
30
0.0566 0.0516
0.3906 0.0606
0.0654
0.0696 0.0586
0.3870
0.0474 0.0464
0.3492 0.0422
0.0520
0.3572 0.0524 0.0510
50
192 H.-S. Choi and N. M. Kiefer
C The Author(s). Journal compilation C Royal Economic Society 2010.
C The Author(s). Journal compilation C Royal Economic Society 2010.
0.0478 0.0754
0.1230
0.1114 0.0924 0.0980
0.1600 0.1478
0.1054 0.1506
0.0870 0.0788
0.0524 0.1302
Boot(1) Boot(5)
N(0, 1)
Fixed-b Boot(1) Boot(5)
N(0, 1) Fixed-b
Boot(1) Boot(5)
N(0, 1) Fixed-b
Boot(1) Boot(5)
0.0740 0.1136
0.1466 0.0954
0.0832 0.1222
0.1762 0.1168
0.0616 0.0558 0.0750
0.1056
0.0424 0.0596
0.0932 0.0530 0.0446 0.0560
0.1530 0.0528
0.0594 0.0514 0.0696
0.1714
0.0728 0.1136
0.2498 0.1052
0.0694 0.1072
0.1868 0.0920 0.0672 0.1052
0.2256 0.0892
ρ = 0.9, α = −0.5
0.0776 0.1160
0.2090 0.1108
ρ = 0.9, α = 0
0.0582 0.0536 0.0738
0.1350
ρ = −0.5, α = −0.5
0.0446 0.0574
0.1212 0.0514
0.0644 0.1026
0.3368 0.0866
0.0758 0.1090
0.3472 0.1004
0.0602 0.0528 0.0690
0.2676
0.0468 0.0560
0.2510 0.0538
0.0640 0.1026
0.4416 0.0858
0.0780 0.1098
0.4550 0.0996
0.0584 0.0514 0.0686
0.3712
0.0478 0.0572
0.3492 0.0536
0.3352 0.1584
0.3984 0.3860
0.1976 0.1614
0.2726 0.2584
0.0048 0.0064 0.0280
0.0060
0.0194 0.0490
0.0290 0.0248
0.1480 0.1102
0.2540 0.1910
0.1126 0.1224
0.1990 0.1458
0.0254 0.0364 0.0322
0.0518
0.0392 0.0466
0.0768 0.0446 0.0436 0.0436
0.1414 0.0496
0.0392 0.0472 0.0400
0.1472
0.0904 0.1090
0.2726 0.1236
0.1244 0.1044
0.2710 0.1572
0.1148 0.1018
0.3036 0.1480
ρ = 0.9, α = 0.9
0.0976 0.1150
0.2362 0.1264
ρ = 0.9, α = 0.5
0.0350 0.0474 0.0366
0.1052
ρ = −0.5, α = 0.9
0.0416 0.0460
0.1098 0.0486
ρ = −0.5, α = 0.5
0.1106 0.0978
0.3998 0.1324
0.0926 0.1084
0.3784 0.1196
0.0406 0.0514 0.0424
0.2486
0.0442 0.0466
0.2324 0.0508
0.1080 0.0940
0.4980 0.1310
0.0920 0.1072
0.4830 0.1150
0.0410 0.0540 0.0440
0.3584
0.0436 0.0480
0.3326 0.0512
Note: (Case I) Actual sizes of the 5% level two-tail model selection test based on the quasi-likelihood criterion using the approximations from the standard normal [N(0, 1)], fixed-b asymptotics, i.i.d. bootstrap [‘Boot(1)’], and block bootstrap [‘Boot(5)’] with a block length five. ρ and α are AR(1) coefficients of regressors and errors, respectively.
0.0690 0.0634
N(0, 1) Fixed-b
ρ = −0.5, α = 0
Robust model selection test
193
0.0400 0.0340 0.0354
0.0674
0.0470 0.0398
0.0384 0.0706
0.1398
0.1262 0.1106
0.1026
0.1156 0.1034
0.0848 0.1048
Boot(5)
N(0, 1) Fixed-b
Boot(1) Boot(5)
N(0, 1)
Fixed-b Boot(1)
Boot(5)
N(0, 1) Fixed-b
Boot(1) Boot(5)
1
N(0, 1) Fixed-b Boot(1)
m=
0.0540 0.0742
0.1074 0.0632
0.0808
0.0688 0.0606
0.1160
0.0410 0.0574
0.0714 0.0394
0.0532
0.0700 0.0354 0.0390
5
15
0.0524
0.1312 0.0402 0.0446
0.0428 0.0548
0.1360 0.0404
0.0704
0.0618 0.0590
0.1846
0.0536 0.0724
0.1442 0.0588 0.0528 0.0748
0.1782 0.0610
ρ = 0.5, α = −0.5
0.0730
0.0644 0.0574
0.1486
ρ = 0.5, α = 0
0.0404 0.0558
0.1060 0.0374
ρ = 0, α = −0.5
0.0506
0.1012 0.0374 0.0418
ρ = 0, α = 0
10
0.0556 0.0750
0.2890 0.0622
0.0718
0.0662 0.0612
0.2862
0.0468 0.0564
0.2294 0.0466
0.0516
0.2322 0.0414 0.0464
30
0.0558 0.0786
0.3934 0.0634
0.0728
0.0670 0.0608
0.3962
0.0458 0.0556
0.3376 0.0446
0.0520
0.3392 0.0412 0.0468
50
0.0912 0.0574
0.1100 0.1000
0.0854
0.1202 0.1118
0.1326
0.0114 0.0354
0.0122 0.0098
0.0636
0.0348 0.0290 0.0322
1
0.0382 0.0426
0.0726 0.0416
0.0658
0.0592 0.0580
0.1036
0.0184 0.0374
0.0290 0.0150
0.0562
0.0608 0.0344 0.0418
5
15
0.0524
0.1382 0.0370 0.0436
0.0310 0.0400
0.0854 0.0210
0.0612
0.0586 0.0534
0.1716
0.0352 0.0428
0.1008 0.0396
0.0400 0.0432
0.1312 0.0418
ρ = 0.5, α = 0.9
0.0624
0.0556 0.0530
0.1318
ρ = 0.5, α = 0.5
0.0252 0.0388
0.0570 0.0176
ρ = 0, α = 0.9
0.0566
0.0982 0.0342 0.0422
ρ = 0, α = 0.5
10
Table 4. (Case II) Actual sizes of the 5% level two-tail model selection test. Case II (n = 50)
0.0426 0.0478
0.2142 0.0456
0.0636
0.0586 0.0574
0.2718
0.0344 0.0410
0.1686 0.0240
0.0548
0.2374 0.0376 0.0458
30
0.0406 0.0500
0.3146 0.0448
0.0640
0.0602 0.0616
0.3846
0.0350 0.0426
0.2644 0.0212
0.0552
0.3458 0.0360 0.0458
50
194 H.-S. Choi and N. M. Kiefer
C The Author(s). Journal compilation C Royal Economic Society 2010.
C The Author(s). Journal compilation C Royal Economic Society 2010.
0.0136 0.0554
0.0394
0.0338 0.0348 0.0694
0.3612 0.3444
0.2784 0.1862
0.3480
0.3334 0.2718 0.1962
Boot(1) Boot(5)
N(0, 1)
Fixed-b Boot(1) Boot(5)
N(0, 1) Fixed-b
Boot(1) Boot(5)
N(0, 1)
Fixed-b Boot(1) Boot(5)
0.1986 0.1480 0.1492
0.2622
0.1474 0.1408
0.2522 0.1930
0.0380 0.0424 0.0586
0.0760
0.0408 0.0538
0.0658 0.0326 0.0484 0.0534
0.1412 0.0438
0.0426 0.0462 0.0600
0.1516
0.1176 0.1270
0.3152 0.1580
0.1656 0.1230 0.1358
0.2836 0.1538 0.1170 0.1318
0.3194
ρ = 0.9, α = −0.5
0.1250 0.1288
0.2782 0.1676
ρ = 0.9, α = 0
0.0426 0.0476 0.0576
0.1156
ρ = −0.5, α = −0.5
0.0456 0.0554
0.1088 0.0390
0.1462 0.1094 0.1264
0.4146
0.1178 0.1258
0.4160 0.1504
0.0452 0.0508 0.0574
0.2480
0.0546 0.0546
0.2394 0.0460
0.1462 0.1114 0.1268
0.5200
0.1162 0.1236
0.5124 0.1476
0.0462 0.0514 0.0610
0.3582
0.0502 0.0538
0.3432 0.0462
0.3766 0.3264 0.1402
0.3916
0.2922 0.1602
0.3632 0.3500
0.0002 0.0006 0.0268
0.0002
0.0050 0.0364
0.0040 0.0022
0.1750 0.1340 0.1000
0.2370
0.1310 0.1158
0.2360 0.1722
0.0076 0.0208 0.0352
0.0208
0.0334 0.0468
0.0456 0.0222 0.0462 0.0500
0.1250 0.0306
0.0180 0.0418 0.0410
0.0910
0.1060 0.1054
0.2884 0.1326
0.1396 0.1048 0.0890
0.2480
0.1282 0.1012 0.0868
0.2802
ρ = 0.9, α = 0.9
0.1116 0.1080
0.2544 0.1436
ρ = 0.9, α = 0.5
0.0148 0.0368 0.0392
0.0572
ρ = −0.5, α = 0.9
0.0420 0.0482
0.0886 0.0268
ρ = −0.5, α = 0.5
0.1204 0.0962 0.0816
0.3730
0.1060 0.1020
0.3962 0.1302
0.0218 0.0434 0.0430
0.1820
0.0484 0.0528
0.2296 0.0350
0.1186 0.0956 0.0850
0.4708
0.1058 0.1022
0.4998 0.1280
0.0214 0.0452 0.0472
0.2814
0.0498 0.0554
0.3372 0.0354
Note: (Case II) Actual sizes of the 5% level two-tail model selection test based on the quasi-likelihood criterion using the approximations from the standard normal [N(0, 1)], fixed-b asymptotics, i.i.d. bootstrap [‘Boot(1)’], and block bootstrap [‘Boot(5)’] with a block length five. ρ and α are AR(1) coefficients of regressors and errors, respectively.
0.0156 0.0114
N(0, 1) Fixed-b
ρ = −0.5, α = 0
Robust model selection test
195
196
H.-S. Choi and N. M. Kiefer
Note since the test statistic’s (traditional) limiting distribution under the local alternative is identical across kernels and bandwidths, the asymptotic power comparison was not available for the traditional standard normal approximation. The fixed-b asymptotics makes possible comparison of the asymptotic powers for different kernels and bandwidths as shown in Kiefer and Vogelsang (2005). Our finite sample power comparison shows that the fixed-b asymptotic power comparison can be useful in understanding the actual difference in the finite sample powers among kernels and bandwidth choices. The fixed-b approximation has reasonable power in our simulations. We use the following local alternatives and generated 300 observations and truncated the first 100 observations to reduce the effect of initial values (n = 200). • MA(2) example: We used the same candidate models as in the size comparison of the MA(2) example and the true DGP: yt = εt + 0.5(1 + c)εt−1 + εt−2 ,
(4.22)
where c ∈ [0, 1] is the deviation parameter (c = 0 gives the null hypothesis) and the errors are from the i.i.d. standard normal distribution. • Linear regression examples: We use the same candidate models as in the size comparison and the power of the tests was compared with the true DGP: Case I : yt = wt + 0.5(1 + c)xt + 0.5(1 − c)zt + ut ,
(4.23)
Case II : yt = wt + 0.5(1 + c)xt + 0.5(1 − c)zt + 0.5yt−1 + ut ,
(4.24)
where c ∈ [0, 1] is the deviation parameter. We set ρ = 0.5, α = 0.5 for the regressors and the error DGP specification in the size comparison section and the other settings are the same. Local powers are calculated for the five different kernel functions from 5% level two-tailed tests with bandwidths b = 0.02, 0.25, 0.5, 1. Figures 1–3 are the local power comparisons from 5000 iterations for MA(2), Case I and Case II, respectively. For the MA(2) example, we can see the clear difference between two groups of kernels. The quadratic spectral (QS) and Daniell kernels behave very similarly and the Bartlett, Parzen and Bohman kernels give similar results. The local power curves from the QS and Daniell kernels are sensitive to a bandwidth, and a large bandwidth decreases their power more than the power of the other kernels. But they show good size. The power curve of the Bartlett kernel is robust to the bandwidth, and the Parzen and Bohman are also robust but less than the Bartlett kernel. This supports the asymptotic power comparison given in Kiefer and Vogelsang (2005). Small bandwidths increased power are also shown in Kiefer and Vogelsang (2005). For the linear regression cases, we got similar results to the MA(2) power results. The result shows the Bartlett kernel is robust to the bandwidth for detecting the local alternatives. The QS and Daniell kernels have low local powers when the bandwidth is close to one, but they show good size in small bandwidths. It is notable that the QS and Daniell kernels behave very similarly and the Parzen and Bohman kernels show close power curves. If we compare Cases I and II, in Case II where the candidate models are missing the lagged dependent variable yt−1 , the powers decrease in all experiments. The power decrease is more severe when we increase the AR(1) coefficient for yt . We found that the Bartlett kernel has a reasonably good size property with a very robust power behaviour. Choosing a small bandwidth leads to a good power but a larger size C The Author(s). Journal compilation C Royal Economic Society 2010.
197
Robust model selection test ( MA[2] ) b = 0.25 1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6 Power
Power
( MA[2] ) b = 0.02 1
0.5 0.4
0.4
0.3
0.3
Bartlett Parzan QS Daniell Bohman
0.2 0.1 0
0.5
0
0.2
0.4 0.6 Deviation from the null
0.8
Bartlett Parzan QS Daniell Bohman
0.2 0.1 0
1
0
0.2
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5 0.4
1
0.5 0.4
0.3
0.3
Bartlett Parzan QS Daniell Bohman
0.2 0.1 0
0.8
( MA[2] ) b = 1
1
Power
Power
( MA[2] ) b = 0.5
0.4 0.6 Deviation from the null
0
0.2
0.4 0.6 Deviation from the null
0.8
Bartlett Parzan QS Daniell Bohman
0.2 0.1 1
0
0
0.2
0.4 0.6 Deviation from the null
0.8
1
Figure 1. [MA(2)] Local power curves for different kernel functions.
distortion, and a large bandwidth reduces the size distortion but lowers the power. The power decrease can be mitigated by using the Bartlett kernel. Though not shown in this paper, we found that the regressor and the error serial correlation ρ and α affect the power. The power gets worse as serial correlation gets stronger and the effect of α is greater than that of ρ.
5. EXCHANGE RATES Diebold and Mariano (1995) considered a test for equality of predictive accuracy of two exchange rate models in forecasting three-months-ahead spot rates. They considered a random walk model (no difference in three months) and a forward exchange rate model (current three-monthsforward rate). The accuracy is compared with the mean absolute error criterion. We revisit their analysis using New York Federal reserve bank’s USD/EURO and YEN/USD, end-of-month, noon-buying rates (spot rates) and three-months-forward rates. The data range from 1999.1 to 2006.7 and all changes are measured with difference in logs of exchange rates. C The Author(s). Journal compilation C Royal Economic Society 2010.
198
H.-S. Choi and N. M. Kiefer ( CASE I ) b = 0.25 1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6 Power
Power
( CASE I ) b = 0.02 1
0.5 0.4
0.4
0.3
0.3
Bartlett Parzan QS Daniell Bohman
0.2 0.1 0
0.5
0
0.2
0.4 0.6 Deviation from the null
0.8
Bartlett Parzan QS Daniell Bohman
0.2 0.1 0
1
0
0.2
0.4 0.6 Deviation from the null
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5 0.4
0.5 0.4
0.3
0.3
Bartlett Parzan QS Daniell Bohman
0.2 0.1 0
1
( CASE I ) b = 1
1
Power
Power
( CASE I ) b = 0.5
0.8
0
0.2
0.4 0.6 Deviation from the null
0.8
Bartlett Parzan QS Daniell Bohman
0.2 0.1 1
0
0
0.2
0.4 0.6 Deviation from the null
0.8
1
Figure 2. (Case I) Local power curves for different kernel functions.
The selection criterion is the mean absolute error, E|eit | = E|yt+3 − yˆit |,
for i = 1, 2,
(5.1)
where yt+3 = log(st+3 /st ) is the change in (actual) spot rates in three months, {st } is the spot exchange rate process, and yˆit is the prediction from model i = 1, 2. The prediction from model 1 is yˆ1t = log(ft /st ), where ft is a three-months-forward rate at t, and model 2 gives a random walk prediction yˆ2t = 0. The null hypothesis is E(dt ) = E(|e1t | − |e2t |) = 0 and our HAC robust test statistic is the same as the DM test given by √ nd¯ (5.2) Tn = , n V n is the HAC variance estimator for {dt } with vˆt = where d¯ is the sample mean of {dt } and V |e1t | − |e2t | in equation (4.2), but we use the fixed-b approximation. C The Author(s). Journal compilation C Royal Economic Society 2010.
199
Robust model selection test ( CASE II ) b = 0.02
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5 0.4
0.3
Bartlett Parzan QS Daniell Bohman
0.2 0.1 0
0.2
0.4 0.6 Deviation from the null
0.8
0.1 0
1
0
0.2
0.8
0.7
0.7
0.6
0.6 Power
0.9
0.8
0.5 0.4
0.4 0.6 Deviation from the null
0.8
1
( CASE II ) b = 1
1
0.9
0.5 0.4
0.3
0.3
Bartlett Parzan QS Daniell Bohman
0.2 0.1 0
Bartlett Parzan QS Daniell Bohman
0.2
( CASE II ) b = 0.5
1
Power
0.5 0.4
0.3
0
( CASE II ) b = 0.25
1
Power
Power
1
0
0.2
0.4 0.6 Deviation from the null
0.8
Bartlett Parzan QS Daniell Bohman
0.2 0.1 1
0
0
0.2
0.4 0.6 Deviation from the null
0.8
1
Figure 3. (Case II) Local power curves for different kernel functions.
Figure 4 shows the actual changes of USD/EURO and YEN/USD rates, predictions from the forward rate and the random walk models. The average absolute error in the forward rate model for USD/EURO (YEN/USD) is 0.0194 (0.0173) and for the random walk model, 0.0187 (0.0163). In both currencies, the random walk model wins. We test the statistical significance of the superiority of the random walk model. Figure 5 is the autocovariance function for the {dt } showing a strong serial correlation in low lags and a varying degree of correlation in higher-order lags. The DM test uses (h − 1) as a choice of bandwidths for the h-step-ahead forecasting problem (in our case, h = 3) and the uniform kernel. We use the Bartlett kernel and explore all bandwidths. Figure 6 shows the values of our test statistic for a range of bandwidths and the critical values from the fixed-b approximations with 5% level two-sided tests. The solid line is two-sided 5% level critical values from the fixed-b approximation. The fixed-b critical value at zero bandwidth is equal to the critical value from the standard normal approximation. For USD/EURO, we could reject the null for small bandwidths but could not reject for large bandwidths at 5% level (twosided). The tests for YEN/USD could not reject the null for most of the bandwidths. We can C The Author(s). Journal compilation C Royal Economic Society 2010.
200
H.-S. Choi and N. M. Kiefer USD/EURO
YEN/USD 0.06
0.06 Actual Predicted Random Walk
0.05
0.04 Change (Difference in Logs)
Change (Difference in Logs)
0.04 0.03 0.02 0.01 0 −0.01 −0.02
Actual Predicted Random Walk
0.02
0
−0.02
−0.04
−0.03 −0.04 Jan 1999
2001
2003
2005
−0.06 Jan 1999
2001
2003
2005
Figure 4. Three-months change of exchange rates (monthly data).
−6
4
−6
USD/EURO
x 10
20 15 Autocovariance
Autocovariance
3 2 1 0 −1
YEN/USD
x 10
10 5 0
0
25
50 Lag
75
−5
0
25
50 Lag
75
Figure 5. Autocovariance function of the difference in absolute prediction error |e1r | − |e2r |.
see that if we used the standard normal approximation, using large bandwidths will reject the null in the both currencies, and this rejection may have come from the size distortion of the conventional approximation. Also for YEN/USD, the standard normal approximation rejects the null for very low bandwidth but could not reject the null for a wide range of bandwidths up to about m/n = 1/2, then rejects the null again for large bandwidths. This confirms the fact that the DM test shows overrejection as the forecasting horizon h gets larger since it uses a bandwidth equal to (h − 1) (Harvey et al., 1997). The fixed-b approximation properly addresses the size distortion problem by giving larger critical values for larger bandwidths. C The Author(s). Journal compilation C Royal Economic Society 2010.
201
Robust model selection test USD/EURO
YEN/USD
5
5 Fixed−b c.v.(5%) Robust Test
4.5
4 Test Statistic
Test Statistic
4 3.5 3 2.5
3.5 3 2.5 2
2 1.5
Fixed−b c.v.(5%) Robust Test
4.5
1.5 0
1/3
2/3 Bandwidth
1
1
0
1/3
2/3 Bandwidth
1
Figure 6. Values of the test statistic with various bandwidths and the Bartlett kernel.
6. CONCLUSION For comparing non-nested dynamic models, an important improvement in the finite sample properties was made by using the fixed-b asymptotics. We have shown by Monte Carlo simulations that the fixed-b asymptotics corrects the size distortion, especially when a large bandwidth is used. The bootstrap methods were compared with the fixed-b approximations and showed similar performance to the fixed-b asymptotics. The fixed-b approach uses a smaller critical region than the standard normal approximation, therefore we improve the size property of the tests at the cost of the power. The fixed-b approximation showed reasonable local power in our examples especially when the Bartlett kernel is used. There is a trade-off between size and power in the bandwidth selection, and the Bartlett kernel gave a robust power and a reasonably good size. The power is influenced by the correlation in the regressors and the errors and also by the degree of misspecification. In an application to predicting future spot rates on currency exchanges (USD/EURO and YEN/USD), we find that a random walk model does slightly better in mean absolute prediction error than does a model based on the forward rate. The difference is not significant using the fixed-b asymptotic distribution, though it appears significant using the normal approximation. Thus the documented tendency of the normal approximation to overreject could lead to an overstatement of the data’s ability to distinguish these two models in our sample. Using the standard normal approximation for the dynamic model selection test is not desirable unless the regressors are not correlated and a small bandwidth is used for linear models. In general cases, the robust tests should be used and the normal approximation should not. The fixed-b and bootstrap approximations are practical alternatives.
C The Author(s). Journal compilation C Royal Economic Society 2010.
202
H.-S. Choi and N. M. Kiefer
REFERENCES Andrews, D. (1991). Heteroskedasticity and autocorrelation consistent covariance matrix estimation. Econometrica 59, 817–58. Atkinson, A. C. (1970). A method for discriminating between models. Journal of the Royal Statistical Society, Series B 32, 211–43. Choi, H.-S. and N. Kiefer (2008). Robust nonnested testing and the demand for money. Journal of Business and Economic Statistics 26, 9–17. Cox, D. R. (1961). Tests of separate families of hypotheses. In J. Neyman (Ed.), Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1, 105–23. Berkeley, CA: University of California Press. Cox, D. R. (1962). Further results on tests of separate families of hypotheses. Journal of the Royal Statistical Society, Series B 24, 406–24. Davidson, R. and J. G. MacKinnon (1981). Several tests for model specification in the presence of alternative hypotheses. Econometrica 49, 781–93. Davidson, R. and J. G. MacKinnon (2002). Bootstrap J tests of nonnested linear regression models. Journal of Econometrics 109, 167–93. de Jong, R. M. and J. Davidson (2000). Consistency of kernel estimators of heteroscedastic and autocorrelated covariance matrices. Econometrica 68, 407–23. Diebold, F. X. and R. S. Mariano (1995). Comparing predictive accuracy. Journal of Business and Economic Statistics 13, 253–63. Fan, Y. and Q. Li (1995). Bootstrapping J-type tests for non-nested regression models. Economics Letters 48, 107–12. Godfrey, L. (1998). Tests of non-nested regression models: some results on small sample behaviour and the bootstrap. Journal of Econometrics 84, 59–74. Godfrey, L. and M. Pesaran (1983). Tests of non-nested regression models: small sample adjustments and Monte Carlo evidence. Journal of Econometrics 21, 133–54. Gonc¸alves, S. and T. J. Vogelsang (2010). Block bootstrap HAC robust tests: the sophistication of the naive bootstrap. Forthcoming in Econometric Theory. Gouri´eroux, C. and A. Monfort (1999). Testing non-nested hypotheses. In R. F. Engle and D. McFadden (Eds.), Handbook of Econometrics, Volume 4, 2583–637. Amsterdam: North-Holland. Hall, P. and J. L. Horowitz (1996). Bootstrap critical values for tests based on generalized-method-ofmoments estimators. Econometrica 64, 891–916. Hansen, B. E. (1992). Consistent covariance matrix estimation for dependent heterogeneous processes. Econometrica 60, 967–72. Hansen, P. R. (2005). A test for superior predictive ability. Journal of Business and Economic Statistics 23, 365–80. Harvey, D., S. Leybourne and P. Newbold (1997). Testing the equality of prediction mean squared errors. International Journal of Forecasting 13, 281–91. Hashimzade, N. and T. Vogelsang (2008). Fixed-b asymptotic approximation of the sampling behaviour of nonparametric spectral density estimators. Journal of Time Series Analysis 29, 142–62. Jansson, M. (2004). The error in rejection probability of simple autocorrelation robust tests. Econometrica 72, 937–46. Kiefer, N. M. and T. J. Vogelsang (2002a). Heteroskedasticity-autocorrelation robust standard errors using the Bartlett kernel without truncation. Econometrica 70, 2093–95. Kiefer, N. M. and T. J. Vogelsang (2002b). Heteroskedasticity-autocorrelation robust testing using bandwidth equal to sample size. Econometric Theory 18, 1350–66. C The Author(s). Journal compilation C Royal Economic Society 2010.
Robust model selection test
203
Kiefer, N. M. and T. J. Vogelsang (2005). A new asymptotic theory for heteroskedasticity-autocorrelation robust tests. Econometric Theory 21, 1130–64. Kiefer, N. M., T. J. Vogelsang and H. Bunzel (2000). Simple robust testing of regression hypotheses. Econometrica 68, 695–714. Kullback, S. and R. Leibler (1951). On information and sufficiency. Annals of Mathematical Statistics 22, 79–86. Lien, D. and H. Q. Vuong (1987). Selecting the best linear regression model: a classical approach. Journal of Econometrics 35, 3–23. McAleer, M. (1995). The significance of testing empirical non-nested models. Journal of Econometrics 67, 149–71. Newey, W. K. and K. D. West (1994). Automatic lag selection in covariance matrix estimation. Review of Economic Studies 61, 631–53. Pesaran, M. and M. Weeks (2001). Non-nested hypothesis testing: an overview. In B. H. Baltagi (Ed.), A Companion to Theoretical Econometrics, 279–309. Oxford: Blackwell. Pesaran, M. H. (1974). On the general problem of model selection. Review of Economic Studies 41, 153–71. Phillips, P. C. B. and S. N. Durlauf (1986). Multiple time series regression with integrated processes. Review of Economic Studies 53, 473–95. Politis, D. N. and J. P. Romano (1994). The stationary bootstrap. Journal of the American Statistical Association 89, 1303–13. Rivers, D. and Q. Vuong (2002). Model selection tests for nonlinear dynamic models. Econometrics Journal 5, 1–39. Vuong, Q. (1989). Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica 57, 307–33. White, H. (2000). A reality check for data snooping. Econometrica 68, 1097–126.
APPENDIX A: PROOFS OF RESULTS j j A SSUMPTION A.1. Let {Qjn (ω, γ j )}∞ n=1 be a sequence of functions such that Qn (·, γ ) is measurable for j j ¯ every γ ∈ . There exists an equicontinuous sequence of non-stochastic functions {Qjn (γ j )}∞ n=1 such that a.s. Qjn (ω, γ j ) − Q¯ jn (γ j ) → 0 uniformly on j as n → ∞.
A SSUMPTION A.2. The functions Qjn (ω, ·) and Q¯ jn (·) are continuously differentiable, and a.s. ¯ jn (γ j ) → ¯ jn (·) is equicontinuous and ∇γ j Qjn (ω, γ j ) − ∇γ j Q 0 uniformly on j as n → ∞. Moreover, ∇γ j Q ¯ jn (γ¯nj ) is bounded. ∇γ j Q Proof of Theorem 3.1: Step 1. Using Assumptions 3.1 and 3.2, we can use the result from Step 1 of the proof of Theorem 2 in RV (p. 28) to show that Assumptions A.1 and A.2 are satisfied. Then with Assumptions 3.1, A.1–A.2 and 3.5, we have the normality of the numerator of our test statistics, √
d n 1 Qn ω, γˆn1 − Q2n ω, γˆn2 → N (0, 1), λn
(A.1)
from equation (A.3) in RV following the proof of Theorem 1 of RV (p. 27). Note that we replaced Assumption 6 of RV with our Assumption 3.5. C The Author(s). Journal compilation C Royal Economic Society 2010.
204
H.-S. Choi and N. M. Kiefer Step 2. From Assumptions 3.1 and 3.2, we have 1 p (vˆnt − vnt ) → 0, √ n t=1 [rn]
(A.2)
uniformly in r. Then the result for the denominator, n ⇒ λ2 Q(b), V
(A.3)
is derived directly by Theorem 1 (case 3) of Hashimzade and Vogelsang (2008) for both Tn and TnRV using Assumptions 3.2–3.5. Q(b) is defined as either QW (b) or QW (b) for Tn and TnRV , respectively. Note that Assumption 3.2(d) is stronger than the assumption given in RV since m = O(n). Therefore we get the desired results, √ n Q1n ω, γˆn1 − Q2n ω, γˆn2 W (1) λn % , (A.4) ⇒ Tn = QW (b) n λ2n V W (1) , TnRV ⇒ √ QW (b)
(A.5)
directly from equations (A.1) and (A.3). The independence between W (1) and QW (b) is from the (r). independence between W (1) and W
C The Author(s). Journal compilation C Royal Economic Society 2010.
The
Econometrics Journal Econometrics Journal (2010), volume 13, pp. 205–217. doi: 10.1111/j.1368-423X.2010.00312.x
Testing the adequacy of conventional asymptotics in GMM J ONATHAN H. W RIGHT † †
Department of Economics, Johns Hopkins University, 3400 North Charles Street, Baltimore, MD 21218, USA. E-mail:
[email protected] First version received: December 2008; final version accepted: December 2009
Summary This paper proposes a new test of the null hypothesis that a generalized method of moments model is sufficiently well identified for conventional asymptotics to be reliable. The idea of the test is to compare the volume of two confidence sets—one that is robust to lack of identification and one that is not. Under the null hypothesis, the relative volume of these two sets is Op (1), but under the alternative, the robust confidence set has high probability of being unbounded. Keywords: GMM, Instrumental variables, Robust confidence sets, Weak identification.
1. INTRODUCTION A key assumption of conventional asymptotic theory in the generalized method of moments (GMM) model (Hansen, 1982) is identification, which requires the moment condition to have a unique zero at the true parameter value and to have a gradient of full rank. Where this assumption fails, or nearly fails, conventional Gaussian asymptotic theory can provide a very poor approximation to the actual sampling distribution of estimators and test statistics. An important special case is the linear instrumental variable (IV) model, in which the identification condition requires the instruments and endogenous variables to be correlated. The many papers showing the inadequacy of conventional asymptotics when identification is weak and/or suggesting alternative asymptotic approximations include Nelson and Startz (1990a,b), Bekker (1994), Bound et al. (1995), Hansen et al. (1996), Staiger and Stock (1997), Stock and Wright (2000) and Newey and Windmeijer (2009). Fortunately, much progress has been made in the last 15 years on developing approaches to inference in GMM that are robust to failure or near-failure of the identification condition. These approaches all give up on point estimation, because consistent point estimation is of course impossible without identification. These robust methods are all instead based on constructing tests of hypotheses concerning the structural coefficient where the asymptotic distribution of the test statistic (or in some cases its exact distribution) is the same regardless of whether the model is identified or not. Confidence sets can be formed by inverting the acceptance regions of these tests. Such identification-robust tests/confidence sets have been proposed by Anderson and Rubin (1949), Zivot et al. (1998), Stock and Wright (2000), Kleibergen (2002, 2005), Moreira (2003, 2009), Guggenberger and Smith (2005, 2008), Andrews et al. (2006), Otsu (2006), Andrews C The Author(s). Journal compilation C Royal Economic Society 2010. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
206
J. H. Wright
and Marmer (2008), Hansen et al. (2008) and Kleibergen and Mavroeidis (2008) among others. Andrews and Stock (2006) provide a recent review. There is a strong case to be made for saying that researchers should always report only identification-robust confidence sets, giving up on point estimation. However, empirical researchers evidently prefer to report point estimates and standard errors, partly because identification-robust confidence sets are hard to represent when the number of parameters is large. Accordingly, it seems helpful to have a diagnostic so as to indicate whether the identification is sufficiently strong that conventional Wald confidence sets are likely to be adequate. A number of tests of identification have been proposed. The simplest is the first-stage F-test, which tests a lack of identification in the linear IV model. However, a significant first-stage F-statistic by no means implies that issues of weak instruments can be ignored (see, for example, Stock and Yogo, 2005). Stock and Yogo constructed critical values for a version of the first-stage F-test in which the null hypothesis is instead that identification is too weak for conventional asymptotics to work well. Hahn and Hausman (2002) proposed a test of the hypothesis that the model is identified, but this test has however low power against the alternative of weak identification (Hausman et al., 2005). All of these tests apply only in the linear IV model.1 In this paper, I propose a new test of the hypothesis that identification is sufficiently strong for conventional asymptotics to work well. It is applicable in the general GMM model provided that the model has more moment conditions than parameters. The idea is to compare the volume of a Wald confidence set (not robust to identification difficulties) with the volume of a robust confidence set. Under the null that the model is well identified, this ratio is Op (1). Under the alternative, the robust confidence set has high probability of being unbounded. The proposed test has non-trivial power both against the null that the model is completely unidentified and against the alternative that the identification is so weak that conventional Gaussian asymptotics works very poorly. Thus it is a test of the adequacy of conventional Gaussian asymptotics. In this regard, the motivation is similar to that of Stock and Yogo (2005). But the test proposed here is different from that of Stock and Yogo in that it flips the null and alternative hypotheses, and is valid in a general GMM context, not just the linear IV model. The plan for the remainder of this paper is as follows. The GMM model is introduced in Section 2. Section 3 describes the proposed test and derives its asymptotic distribution. Section 4 contains a Monte Carlo simulation evaluating the test. Section 5 concludes. Proofs of the theorems are in the Appendix.
2. THE GMM MODEL The GMM model specifies that {Yt }Tt=1 is an observed time series and θ is an n × 1 parameter vector with a true value θ0 , in the interior of a compact space , such that E(φ(Yt , θ0 )) = 0, where φ(·, ·) is a k-dimensional function, k ≥ n. The GMM estimator of θ is θˆ = arg min S(θ ), θ
1 A computationally intensive and asymptotically conservative analogue of the first-stage F-test for the GMM model was developed by Wright (2003).
C The Author(s). Journal compilation C Royal Economic Society 2010.
Testing GMM asymptotics
207
where S(θ ) = φ ∗ (θ ) WT φ ∗ (θ ), T φ(Yt , θ )] and WT is a symmetric positive definite k × k weighting matrix φ ∗ (θ ) = [T −1/2 t=1 which converges almost surely to a symmetric non-stochastic O(1) positive-definite matrix W. Here are the standard assumptions for the GMM model:
A SSUMPTION 2.1. φ ∗ (θ ) is twice continuously differentiable, for all θ in . T a.s. E(φ(Y , θ )) and T −1 T dφ(Yt ,θ) a.s. E[ dφ(Yt ,θ) ], uniA SSUMPTION 2.2. T −1 t=1 φ(Yt , θ ) → t t=1 → dθ dθ formly in θ . T d N(0, A(θ )), uniformly in θ , where A SSUMPTION 2.3. T −1/2 t=1 [φ(Yt , θ ) − E(φ(Yt , θ ))] → A(θ ) is 2π -times the zero-frequency spectral density matrix of φ(Yt , θ ).
A SSUMPTION 2.4. VT (θ ) is an estimator of A(θ ) that is consistent, uniformly in θ . A SSUMPTION 2.5. The k × n matrix B = E[ dφ(Ydθt ,θ0 ) ] has rank n. A SSUMPTION 2.6. E(φ(Yt , θ )) has a unique zero at θ = θ0 . Assumptions 2.2 and 2.3 are high-level convergence assumptions. Assumption 2.5 is the local identification assumption. Assumption 2.6 is the global identification assumption (Hsiao, 1983). P θ and Under these assumptions, θˆ → 0 √ d N(0, (B W B)−1 B W AW B(B W B)−1 ), T (θˆ − θ0 ) → where A = A(θ0 ). Let ST S (θ ) and SCU (θ ) denote the familiar two-step and continuous-updating (CU) objective functions (see, for example, Hansen et al., 1996). Both of these use weight matrices that converge in probability to A−1 . Let θˆT S and θˆCU denote the resulting estimators. Both are asymptotically efficient and are asymptotically normal with mean zero and variance (B A−1 B)−1 . In empirically relevant sample sizes, the above asymptotic theory often works poorly, as θˆT S and θˆCU are frequently biased and have sampling distributions far from those predicted by this asymptotic theory. These problems, documented in numerous Monte Carlo studies, could arise because of a failure or near-failure of the identification Assumptions 2.5 and 2.6. However, a simple remedy for this problem is to use the fact that if θ0 is the true parameter value then Assumptions 2.3 and 2.4 alone are sufficient to ensure that SCU (θ0 ) converges to a χ 2 (k) distribution. A confidence set for θ can then be formed as the inverse of the acceptance region of this test, i.e. the confidence set of coverage 1 − α is Sθ∗ (α) = {θ : SCU (θ ) ≤ Fχ 2 (k, α)}, where Fχ 2 (a, b) is the 100b percentile of a χ 2 (a) distribution. In a completely unidentified model (E(φ(Yt , θ )) = 0, uniformly in θ ) or a locally asymptotically underidentified model (E(φ(Yt , θ )) = O(T −1/2 ), uniformly in θ ), such a confidence set will have infinite expected volume. But this is the correct statement of our uncertainty about θ in the presence of weak identification (Dufour, 1997). This confidence set is known as the S-set and was proposed by Stock and Wright (2000). In the homoscedastic linear IV model, it reduces to the confidence set of Anderson and Rubin (1949).2 2 Anderson and Rubin assumed normality. Making this additional assumption, their test statistic has an exact F-distribution. However, in this paper, I do not assume normality, and so view the Anderson–Rubin (AR) test as an asymptotic test.
C The Author(s). Journal compilation C Royal Economic Society 2010.
208
J. H. Wright
3. THE PROPOSED TEST This paper proposes a test of the null hypothesis that the model is well identified in the GMM context so long as k > n. Assumptions 2.1–2.4 are maintained throughout. The goal is to construct a test of Assumptions 2.5 and 2.6. Partition the parameter vector as θ = (θA , θB ) with θA and θB being nA × 1 and nB × 1, respectively. Define W 1 as the maximum distance between any two values of θA in the robust S-set for θ , i.e. W1 = supθA,1 ,θA,2 ||θA,1 − θA,2 || such that SCU (θi ) ≤ Fχ 2 (k, α) for i = 1, 2, where , θB,i ) for i = 1, 2. If the S-set is empty, define W 1 to be ||·|| denotes the L2 -norm and θi = (θA,i zero. If it is unbounded, define W 1 to be infinity. The computation of W 1 simplifies in the linear IV model because there is an analytical expression for the AR confidence set in this case (Dufour and Taamouti, 2005). It reduces to the solution to a quadratic equation in the case where θ = θA is a scalar parameter. Likewise define W 2 as the maximum distance between any two points in the usual twostep GMM Wald confidence set for θA , i.e. W2 = supθA,1 ,θA,2 ||θA,1 − θA,2 || such that T (θˆA,T S − −1 ˆ θA,i ) (JˆAA − JˆAB JˆBB JBA )(θˆA,T S − θA,i ) ≤ Fχ 2 (nA , α) for i = 1, 2, where θˆA,T S is the two-step JAA JAB and the Jˆ’s estimator of θA , J = B A−1 B is partitioned conformably with θ as JBA JBB are the counterparts replacing A and B by the usual consistent estimates. The numerical Fχ 2 (nA ,α) 2 , where λˆ denotes the smallest eigenvalue computation of W 2 is simple, as W2 = √ ˆ λ
T
−1 ˆ JBA . of JˆAA − JˆAB JˆBB I refer to W 1 and W 2 as the volumes of the robust S-set and Wald confidence set, respectively. The test statistic that I propose is the ratio of these two volumes,
L=
W1 . W2
The limiting distribution of L under the null of identification (Assumptions 2.5 and 2.6) is provided in Theorem 3.1. T HEOREM 3.1. Suppose that Assumptions 2.1–2.4 hold and that k > n. Under the null hypothesis that Assumptions 2.5 and 2.6 also hold, ∗
d L = L→
Fχ 2 (k, α) − ω 1[Fχ 2 (k, α) − ω ≥ 0], Fχ 2 (nA , α)
where ω is a χ 2 (k − n) random variable. The null-limiting distribution L∗ has point mass at zero and non-negative support as P (L∗ = 0) = 1 − P (ω ≤ Fχ 2 (k, α)) P (L∗ ≤ x) = 1 − P (ω ≤ Fχ 2 (k, α) − Fχ 2 (nA , α)x 2 ), 0 < x <
Fχ 2 (k, α)/Fχ 2 (nA , α)
and P L∗ ≤ Fχ 2 (k, α)/Fχ 2 (nA , α) = 1. C The Author(s). Journal compilation C Royal Economic Society 2010.
209
Testing GMM asymptotics
n=1
Table 1. Critical values of L∗ . n=2 n=3
2 3
1.248 1.417
1.142
4 5 6
1.542 1.642 1.726
1.252 1.338 1.408
7 8
1.799 1.863
9 10 11
n=4
n=5
1.102 1.185 1.251
1.080 1.147
1.066
1.469 1.522
1.307 1.356
1.202 1.249
1.123 1.170
1.922 1.975 2.024
1.569 1.612 1.652
1.398 1.437 1.472
1.289 1.326 1.358
1.210 1.245 1.277
12 13
2.069 2.112
1.689 1.723
1.505 1.535
1.389 1.417
1.305 1.332
14 15 16
2.152 2.190 2.226
1.755 1.786 1.814
1.564 1.591 1.616
1.443 1.467 1.490
1.356 1.379 1.401
17 18
2.260 2.293
1.842 1.868
1.640 1.663
1.512 1.533
1.421 1.441
19 20 21
2.324 2.354 2.383
1.893 1.917 1.940
1.685 1.706 1.726
1.553 1.572 1.590
1.459 1.477 1.494
22 23
2.411 2.438
1.962 1.984
1.745 1.764
1.608 1.625
1.510 1.526
24 25 26
2.464 2.489 2.514
2.005 2.025 2.044
1.782 1.800 1.817
1.641 1.657 1.673
1.541 1.556 1.570
27 28 29
2.537 2.561 2.583
2.063 2.081 2.099
1.833 1.849 1.865
1.688 1.702 1.716
1.584 1.597 1.610
30
2.605
2.117
1.880
1.730
1.623
k
L∗ ,
Note: This table gives the 95th percentile of the null limiting distribution of L, allowing a one-sided 5% test to be constructed in the special case in which the test compares the volume of Wald and S-sets for the entire parameter vector, θ . Entries are only given for k > n, as the distribution in Theorem 3.1 applies in this case only.
The proposed test is a one-sided test that rejects the null for large values of L. The test compares the volume of two confidence sets for a subvector of θ .3 One special case is where confidence sets for the whole vector are compared (θ = θA ) in which case Theorem 3.1 will go through with nA = n. Table 1 gives critical values for a 5% test (the 95th percentile of L∗ ) for this case, 3 The proposed test can be thought of as a Hausman specification test (Hausman, 1978) applied to confidence sets rather than point estimates.
C The Author(s). Journal compilation C Royal Economic Society 2010.
210
J. H. Wright
with various values of k and n. The null limiting distribution of L is degenerate if k = n. But the statement of Theorem 3.1 ruled out this case. The proposed test works for any nominal coverage level of the Wald and S-sets, α. For all numerical work in this paper, I use α = 0.95. Theorem 3.2 gives a result on the power of the proposed test. T HEOREM 3.2. If the model is completely unidentified (i.e. Assumptions 2.1–2.4 continue to hold, but 2.5 and 2.6 are not satisfied, and instead E(φ(Yt , θ )) = 0 for all θ ), then in the limit as the sample size goes to infinity, the power of the test is at least 2α − 1. While Theorem 3.2 does not show that the test is consistent, its rejection rate is guaranteed to asymptote above a certain point that depends on the coverage of the confidence sets, α. For example, if α = 0.95, i.e. the robust and Wald confidence sets have 95% nominal coverage, then the rejection rate of the test under this alternative is guaranteed to asymptote above 90% (and could of course be higher).4 Zivot et al. (1998) prove that, in the linear IV model with n = 1, the AR confidence set of nominal coverage α must be unbounded in any sample in which the first-stage F-test statistic is below the α critical value of a χ 2 (k)/k distribution.5 It follows that the rejection rate of L must be no less than the acceptance rate of the usual first-stage F-test. As shall be seen in Monte Carlo simulations below, in the linear IV model with n = 1, the rejection rate of the proposed test is often much greater than the acceptance rate of the usual first-stage F-test. Kleibergen and Mavroeidis (2008) show that SCU (θ ) at extreme values of θ can be interpreted as an identification test statistic (in the linear IV model, it is just the first-stage F-statistic). Formally, the null hypothesis for the test is that the model is identified (2.5 and 2.6 hold). But this is a point-wise result, and one would expect the test to have a high rejection rate in small samples if these assumptions hold, but the identification is very weak.6 This would be desirable if the identification is so weak that the Wald confidence set is unreliable with the sample size at hand. Thus, in practice, I think of using L as a test of the null hypothesis that the model is sufficiently well identified for conventional asymptotics to be reliable. The smallsample properties of the proposed test with different degrees of identification will be evaluated in Monte Carlo simulations in Section 4.
4. MONTE CARLO RESULTS 4.1. The linear IV model with a single included endogenous regressor In the first set of Monte Carlo results, I focus on the linear IV model and base the experimental design on Hahn and Hausman (2002), specifying that
4 Although L uses the Wald confidence set based on the two-step estimator as the non-robust confidence set, the same asymptotic distribution theory would apply if the Wald confidence set associated with the CU estimator were used instead. Because the CU estimator is more robust to weak identification than the two-step estimator, this however should give a less powerful test. 5 In the linear IV model for general n, Dufour and Taamouti (2005) show that a necessary and sufficient condition for the AR confidence set to be bounded is that a certain matrix is positive definite. Whenever this matrix is not positive definite, the AR confidence set will be unbounded, and the proposed test statistic will necessarily reject. 6 Testing the null of identification against the alternative of a lack of identification involves testing a composite null against a point alternative. Such a test must either have power equal to the size, or must fail to control the size uniformly in the parameter space under the null.
C The Author(s). Journal compilation C Royal Economic Society 2010.
Testing GMM asymptotics
211
y = Xβ + u, X = Z + v, where y and X are T × 1 matrices of endogenous variables, Z is a T × k matrix of instruments that are independent standard normal random variables, and u and v are conformable matrices of errors such that wt = (ut , vt ) is a vector of zero-mean Gaussian errors with variance–covariance 1 ρ matrix = . I normalize β to zero and set = (π, . . . , π ). The population R 2 in the ρ1 first-stage regression is R˜ f2 = kπ 2 /(kπ 2 + 1) and measures the strength of identification, so π = R˜ f2 /(k(1 − R˜ f2 )). I set ρ = 0.5 and 0.9, k = 5, 10 and 30, and R˜ f2 = 0, 0.01, 0.1 and 0.5, ranging from no identification to quite strong identification. Results are reported for T = 100 and 1000 in Tables 2 and 3, respectively. In each experiment, I calculate (i) the coverage of the Wald confidence interval for β using BTSLS (bias-adjusted TSLS which is more robust to weak instruments than TSLS without being fully robust (Donald and Newey, 2001)), (ii) the coverage of the AR confidence set for β, (iii) the rejection rate of the proposed test for identification L (based on comparing TSLS-Wald and AR confidence sets for the parameter β) and (iv) the acceptance rate of the first-stage F-tests for the null of a lack of identification using standard critical values and the critical values of Stock and Yogo (2005), which I call F 1 and F 2 , respectively.7 All confidence sets for β have 95% nominal coverage and the tests F1 , F2 and L all have 5% nominal size. The model is formally identified in all the experiments except those for which R˜ f2 = 0. But one would want the test L to reject if the identification is so weak that the t-statistics exhibit severe size distortions. The effective coverage rate of the Wald confidence interval can be far below the nominal level, as is well known. Its simulated coverage can be below 20%, even if R˜ f2 is as large as 0.1. The AR confidence set effectively circumvents this problem. The proposed test, L, has a low rejection rate when conventional asymptotic theory works well, but a high rejection rate when it works poorly. The rejection rate of the proposed test is above 99% in all cases where R˜ f2 = 0. So the test rejects with probability very close to 1 when instruments are totally irrelevant. The rejection rate of L is consistently above 93% in the case R˜ f2 = 0.01. In contrast, the test of Hahn and Hausman (2002) has rejection rates of around 10% in many of these simulations (their Table 3). While it is true that the model is formally identified with R˜ f2 = 0.01, the model cannot be said to be well identified with such a low theoretical first-stage R 2 without an enormous sample size. Viewing the test as testing for the adequacy of conventional asymptotic theory, it is a useful feature of the proposed test that it rejects in such cases. The test has a much lower rejection rate if R˜ f2 = 0.5. When the identification is strong, the rejection rate of L is increasing in the number of instruments. This makes sense because the limiting distribution of the Wald test is affected by a large number of strong instruments (Bekker, 1994). The simulation results indicate that the test is useful as an indicator of the adequacy of the standard asymptotic approximation for the Wald test statistic. This motivates considering the coverage of the confidence set for that is either the BTSLS-Wald or the AR confidence set depending on the result of L, F1 or F 2 . Coverage rates for these pretest confidence sets 7 Specifically these are the critical values to ensure that the effective size of a TSLS-Wald test is no greater than 25% under the weak identification asymptotics of Staiger and Stock (1997).
C The Author(s). Journal compilation C Royal Economic Society 2010.
212
J. H. Wright Table 2. Monte Carlo results: linear IV model: T = 100. Coverage Reject rate Accept rate
Pretest coverage
R˜ f2
ρ
k
Wald
AR
L
F1
F2
L
F1
F2
0 0
0.5 0.5
5 10
54.5 37.3
95.5 95.9
99.8 100.0
93.6 94.2
100.0 100.0
95.5 95.9
92.9 92.7
95.5 95.9
0 0 0
0.5 0.9 0.9
30 5 10
17.7 17.7 9.4
97.4 95.5 95.9
100.0 100.0 100.0
88.3 94.6 92.8
100.0 100.0 100.0
97.4 95.5 95.9
89.3 92.6 91.5
97.4 95.5 95.9
0
0.9
30
3.5
97.4
100.0
90.0
100.0
97.4
89.4
97.4
0.01
0.5
5
58.9
95.5
99.8
89.4
100.0
95.5
92.3
95.5
0.01 0.01 0.01
0.5 0.5 0.9
10 30 5
40.0 17.9 24.1
95.9 97.4 95.5
99.9 100.0 99.3
91.5 85.2 90.3
100.0 100.0 100.0
95.9 97.4 95.5
92.2 87.4 88.6
95.9 97.4 95.5
0.01 0.01
0.9 0.9
10 30
8.5 3.9
95.9 97.4
99.6 100.0
89.7 86.7
100.0 100.0
95.9 97.4
88.4 86.4
95.9 97.4
0.1 0.1 0.1
0.5 0.5 0.5
5 10 30
81.8 62.4 32.5
95.5 95.9 97.4
93.9 95.5 99.7
28.6 41.4 57.6
99.4 100.0 100.0
95.6 95.5 97.4
89.4 83.2 76.3
95.4 95.9 97.4
0.1 0.1
0.9 0.9
5 10
69.6 45.2
95.5 95.9
95.7 95.5
27.9 40.2
99.6 100.0
94.2 94.7
78.1 69.4
95.4 95.9
0.1
0.9
30
16.4
97.4
98.3
56.7
100.0
97.3
62.4
97.4
0.5
0.5
5
94.3
95.5
47.5
0.0
0.1
96.7
94.3
94.3
0.5 0.5 0.5
0.5 0.5 0.9
10 30 5
92.1 80.5 92.7
95.9 97.4 95.5
55.3 86.1 57.0
0.0 0.0 0.0
59.0 100.0 0.5
95.1 96.0 95.4
92.1 80.5 92.7
93.8 97.4 92.8
0.5 0.5
0.9 0.9
10 30
88.8 71.3
95.9 97.4
65.3 90.6
0.0 0.0
60.3 100.0
93.4 94.3
88.8 71.3
91.0 97.4
Note: The coverage columns give the effective coverage rates of the bias-adjusted TSLS-Wald and AR confidence sets (nominal coverage: 95%). The reject rate column gives the rejection rate of the proposed test of the null of identification, L, which compares TSLS-Wald and AR confidence set volumes. The accept rate columns give the acceptance rates of F 1 and F2 , the tests of the null of underidentification comparing the first-stage F-statistic with χ 2 (k)/k critical values and the critical values of Stock and Yogo (2005) which ensure that the TSLS-Wald test size is no larger than 25%. The pretest coverage columns report the effective coverage rate of the confidence set that is either the Wald or the AR confidence set depending on the tests L, F1 and F 2 . The sample size is T = 100 in all cases. For each experiment, 1000 replications were conducted.
are also reported in Tables 2 and 3. The confidence set using L as the pretest has coverage that is consistently close to 95% (never below 89%), meaning that the proposed test does a reasonable job of assessing the adequacy of conventional asymptotic theory. The same is true for the coverage rate of the pretest confidence set using F 2 . But the coverage rate of the pretest confidence set using the usual first-stage F-test as the pretest can be as low as 62%. I conclude that a researcher working with the linear IV model who has a preference for conventional point estimates and standard errors should use either L or F 2 as a pretest, using a robust confidence C The Author(s). Journal compilation C Royal Economic Society 2010.
213
Testing GMM asymptotics Table 3. Monte Carlo results: linear IV model: T = 1000. Coverage Reject rate Accept rate
Pretest coverage
R˜ f2
ρ
k
Wald
AR
L
F1
F2
L
F1
F2
0 0
0.5 0.5
5 10
53.9 37.0
95.0 94.3
100.0 99.9
94.2 94.6
100.0 100.0
95.0 94.3
92.6 90.9
95.0 94.3
0 0 0
0.5 0.9 0.9
30 5 10
20.4 19.0 8.2
95.5 95.0 94.3
99.6 100.0 99.8
95.2 94.5 94.8
100.0 100.0 100.0
95.5 95.0 94.3
92.5 92.6 92.2
95.5 95.0 94.3
0
0.9
30
3.3
95.5
99.5
94.6
100.0
95.5
93.4
95.5
0.01
0.5
5
82.0
95.0
92.9
31.2
99.9
94.8
88.6
95.0
0.01 0.01 0.01
0.5 0.5 0.9
10 30 5
62.6 33.4 69.4
94.3 95.5 95.0
95.8 97.6 93.5
45.6 68.7 31.0
100.0 100.0 99.8
94.3 95.0 92.5
82.4 80.3 78.7
94.3 95.5 95.0
0.01 0.01
0.9 0.9
10 30
44.2 16.3
94.3 95.5
94.4 96.1
46.1 68.2
100.0 100.0
92.8 95.2
70.2 70.6
94.3 95.5
0.1 0.1 0.1
0.5 0.5 0.5
5 10 30
94.6 93.1 84.2
95.0 94.3 95.5
28.6 32.8 54.6
0.0 0.0 0.0
0.0 42.7 100.0
96.2 95.6 92.5
94.6 93.1 84.2
94.6 93.0 95.5
0.1 0.1
0.9 0.9
5 10
92.9 91.3
95.0 94.3
39.6 50.0
0.0 0.0
0.0 42.4
94.3 94.4
92.9 91.3
92.9 90.3
0.1
0.9
30
75.4
95.5
73.1
0.0
100.0
88.8
75.4
95.5
0.5
0.5
5
94.8
95.0
8.4
0.0
0.0
95.3
94.8
94.8
0.5 0.5 0.5
0.5 0.5 0.9
10 30 5
95.1 93.9 94.7
94.3 95.5 95.0
8.0 10.0 9.3
0.0 0.0 0.0
0.0 0.0 0.0
95.5 94.7 95.2
95.1 93.9 94.7
95.1 93.9 94.7
0.5 0.5
0.9 0.9
10 30
95.2 93.0
94.3 95.5
9.0 13.8
0.0 0.0
0.0 0.0
95.5 94.0
95.2 93.0
95.2 93.0
Note: As for Table 2, except that the sample size is T = 1000.
set if the model is found to be underidentified, and the BTSLS-Wald confidence set otherwise. An advantage of using L rather than F 2 as the pretest, is that while they both result in similar effective coverage, there are some simulations in which L gives the researcher a much higher chance of using conventional inference. 4.2. The consumption CAPM with CRRA preferences An important feature of the proposed test, L, is that it is applicable in all GMM settings, not just in the homoscedastic linear IV model, unlike F 1 , or F 2 , or the test of a null of identification proposed by Hahn and Hausman (2002). The second set of Monte Carlo results evaluates the proposed test in the consumption CAPM with CRRA preferences. C The Author(s). Journal compilation C Royal Economic Society 2010.
214
J. H. Wright Table 4. Monte Carlo results: Consumption CAPM. Coverage Reject rate
Instruments
T
Wald
Robust S-set
L
Pretest coverage
A A
50 100
60.2 74.4
86.2 91.3
95.8 95.9
85.7 90.1
A A B
250 1000 50
83.7 91.3 73.3
93.1 95.1 93.8
84.0 16.7 98.8
90.0 91.6 93.4
B B
100 250
79.8 88.4
93.8 94.7
97.2 85.8
93.1 90.5
B
1000
92.7
95.6
7.5
92.8
Note: The coverage columns give the coverage rates of the robust S-set and the Wald confidence sets centred around the CU GMM point estimates. The rejection rate column gives the rejection rate of the proposed test of a null of identification, L, which compares two-step Wald and S-set volumes. The ‘pretest coverage’ column reports the effective coverage rate of the confidence set that is either the CU-Wald or S-set depending on the result of the proposed identification test. For each experiment, 1000 replications were conducted.
I simulated data from the consumption CAPM with a discount factor, δ, of 0.97 and a coefficient of risk aversion, γ , of 1.3, following the approach of Tauchen and Hussey (1991).8 I then consider GMM estimation of the parameters δ and γ , using both stock and bond returns as the test asset returns, and using as instruments either instrument set A: a constant, one lag of stock and bond returns and one lag of consumption growth, or instrument set B: a constant and one lag of consumption growth.9 These instrument sets were used by Hansen et al. (1996). The sample sizes considered are 50, 100, 250 and 1000. For these simulations, I report the coverage rate of two confidence sets for θ = (δ, γ ) : a Wald set centred around the CU GMM estimator and the S-set. I also report the rejection rate of the proposed test, L, comparing the volumes of confidence sets for the whole vector θ . Lastly, I report the coverage rate of the confidence set for θ that is the S-set if L rejects and the Wald set otherwise. Computing the numerator of the proposed test statistic, W 1 , is harder than in the linear IV model but can be done by grid search.10 The results are given in Table 4. In sample sizes of 50 and 100, the effective coverage rates of the Wald confidence sets are far below the nominal level for both instrument sets. The effective coverage rate of the Wald confidence sets rises with the sample size. The L test has a rejection rates over 95% in sample sizes of 50 and 100, but its rejection rate falls to below 20% in the sample size of 1000 when the Wald confidence set fares quite well. The confidence set that is the S-set if L rejects and the Wald set otherwise yields coverage of at least 85% in a sample size of 50, and a least 90% with the larger sample sizes. 8 This involves fitting a 16-state Markov chain to consumption and stock-market dividend growth calibrated to approximate the VAR in Kocherlakota (1990). Taking random draws from this Markov chain, numerical quadrature is then used to calculate the prices of a stock and a risk-free asset implied by the consumption CAPM. I am grateful to George Tauchen for his Gauss code for implementing this. 9 In GMM estimation of the consumption CAPM, I set V (θ ) = T −1 T φ(Y , θ )φ(Y , θ ) . T t t t=1 10 The parameter space in each simulation is bounded between the wider of two possible limits: (i) δ between 0.5 and 1.5 and γ between −5 and 60, and (ii) the two-step estimates ±30 standard errors. Taking wider bounds could only ever increase W 1 , and so make the test statistic L = W1 /W2 more likely to reject. C The Author(s). Journal compilation C Royal Economic Society 2010.
Testing GMM asymptotics
215
5. CONCLUSION In this paper, I have proposed a test of the null hypothesis that a GMM model is sufficiently well identified for conventional asymptotics to be reliable. It applies in any GMM model with more moment conditions than parameters. The test is conceptually simple, working by comparing the volume of confidence sets that are robust to underidentification with the volume of the nonrobust Wald confidence set. In Monte Carlo simulations, I evaluated a pretesting strategy of using a Wald confidence set if the proposed test of identification accepts and a fully robust confidence set that gives up on point estimation otherwise. This pretesting strategy has good overall coverage properties, and allows the researcher to use conventional point estimates and standard errors in circumstances where this would not be misleading.
ACKNOWLEDGMENTS I am grateful to Rob Engle, Jon Faust, Jinyong Hahn, Eric Leeper, Nour Meddahi, Matt Pritsker, David Smith, Richard Smith, Jim Stock, and two anonymous referees for their helpful comments on earlier versions of this manuscript. All remaining errors are the sole responsibility of the author.
REFERENCES Anderson, T. W. and H. Rubin (1949). Estimation of the parameters of a single equation in a complete system of stochastic equations. Annals of Mathematical Statistics 20, 46–63. Andrews, D. W. K. and V. Marmer (2008). Exactly distribution-free inference in instrumental variables regression with possibly weak instruments. Journal of Econometrics 142, 183–200. Andrews, D. W. K., M. J. Moreira and J. H. Stock (2006). Optimal two-sided invariant similar tests for instrumental variables regression. Econometrica 74, 715–52. Andrews, D. W. K. and J. H. Stock (2006). Inference with weak instruments. In R. Blundell, W. K. Newey and T. Persson (Eds.), Advances in Economics and Econometrics, Theory and Applications, 9th Congress of the Econometric Society, Volume 3, 122–73. Cambridge: Cambridge University Press. Bekker, P. A. (1994). Alternative approximations to the distributions of instrumental variable estimators. Econometrica 62, 657–81. Bound, J., D. A. Jaeger and R. Baker (1995). Problems with instrumental variables estimation when the correlation between the instruments and the endogenous explanatory variable is weak. Journal of the American Statistical Association 90, 443–50. Donald, S. G. and W. K. Newey (2001). Choosing the number of instruments. Econometrica 69, 1161–91. Dufour, J. M. (1997). Some impossibility theorems in econometrics with applications to structural and dynamic models. Econometrica 65, 1365–87. Dufour, J. M. and M. Taamouti (2005). Projection-based statistical inference in linear structural models with possibly weak instruments. Econometrica 73, 1351–65. Guggenberger, P. and R. J. Smith (2005). Generalized empirical likelihood estimators and tests under partial, weak and strong identification. Econometric Theory 21, 667–709. Guggenberger, P. and R. J. Smith (2008). Generalized empirical likelihood tests in time series models with potential identification failure. Journal of Econometrics 142, 134–61. Hahn, J. and J. A. Hausman (2002). A new specification test for the validity of instrumental variables. Econometrica 70, 163–89. C The Author(s). Journal compilation C Royal Economic Society 2010.
216
J. H. Wright
Hansen, C., J. A. Hausman and W. K. Newey (2008). Estimation with many instrumental variables. Journal of Business and Economic Statistics 26, 398–422. Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica 50, 1029–54. Hansen, L. P., J. Heaton and A. Yaron (1996). Finite sample properties of some alternative GMM estimators. Journal of Business and Economic Statistics 14, 262–80. Hausman, J. A. (1978). Specification tests in econometrics. Econometrica 46, 1251–71. Hausman, J. A., J. H. Stock and M. Yogo (2005). Asymptotic properties of the Hahn–Hausman test for weak instruments. Economics Letters 89, 332–42. Hsiao, C. (1983). Identification. In Z. Griliches and M. D. Intriligator (Eds.), Handbook of Econometrics, Volume 1, 223–83. Amsterdam: North-Holland. Kleibergen, F. (2002). Pivotal statistics for testing structural parameters in instrumental variables regression. Econometrica 70, 1781–803. Kleibergen, F. (2005). Testing parameters in GMM without assuming that they are identified. Econometrica 73, 1103–24. Kleibergen, F. and S. Mavroeidis (2008). Inference on subsets of parameters in GMM without assuming identification. Working paper, Brown University. Kocherlakota, N. (1990). On tests of representative consumer asset pricing models. Journal of Monetary Economics 26, 285–304. Moreira, M. J. (2003). A conditional likelihood ratio test for structural models. Econometrica 71, 1027–48. Moreira, M. J. (2009). Tests with correct size when instruments can be arbitrarily weak. Journal of Econometrics 152, 131–40. Nelson, C. R. and R. Startz (1990a). The distribution of the instrumental variables estimator and its t-ratio when the instrument is a poor one. Journal of Business 63, S125–40. Nelson, C. R. and R. Startz (1990b). Some further results on the exact small sample properties of the instrumental variable estimator. Econometrica 58, 967–76. Newey, W. K. and F. Windmeijer (2009). Generalized method of moments with many weak moment conditions. Econometrica 77, 687–719. Otsu, T. (2006). Generalized empirical likelihood inference for nonlinear and time series models under weak identification. Econometric Theory 22, 513–27. Staiger, D. and J. H. Stock (1997). Instrumental variables regression with weak instruments. Econometrica 65, 557–86. Stock, J. H. and J. H. Wright (2000). GMM with weak identification. Econometrica 68, 1055–96. Stock, J. H. and M. Yogo (2005). Weak instruments in linear IV regression. In D. W. K. Andrews and J. H. Stock (Eds.), Identification and Inference for Econometric Models: Essays in Honor of Thomas Rothenberg, 80–108. Cambridge: Cambridge University Press. Tauchen, G. and R. Hussey (1991). Quadrature based methods for obtaining approximate solutions to nonlinear asset pricing models. Econometrica 59, 371–96. Wright, J. H. (2003). Detecting lack of identification in GMM. Econometric Theory 19, 322–30. Zivot, E., R. Startz and C. R. Nelson (1998). Valid confidence intervals and inference in the presence of weak instruments. International Economic Review 39, 1119–46.
APPENDIX: PROOFS OF RESULTS Proof of Theorem 3.1: Under the null hypothesis, Assumptions 2.1–2.6 all hold. Using a second-order Taylor series expansion, the CU objective function can be decomposed into the sum of the Hansen J-test C The Author(s). Journal compilation C Royal Economic Society 2010.
217
Testing GMM asymptotics and a Wald-type test statistic as: dSCU (θˆCU ) 1 d 2 SCU (θ ∗ ) + (θ − θˆCU ) (θ − θˆCU ) dθ 2 dθdθ d 2 SCU (θ ∗ ) 1 (θ − θˆCU ), = SCU (θˆCU ) + (θ − θˆCU ) 2 dθ dθ
SCU (θ ) = SCU (θˆCU ) + (θ − θˆCU )
where θ ∗ is on the line segment between θ and θˆCU . Hence W1 = supθA,1 ,θA,2 ||θA,1 − θA,2 || such that d 2 SCU (θ ∗ ) 1 (θi − θˆCU ) (θi − θˆCU ) ≤ Fχ 2 (k, α) − SCU (θˆCU ) for i = 1, 2. 2 dθ dθ Fχ 2 (k, α) − SCU (θˆCU ) 1[Fχ 2 (k, α) − SCU (θˆCU ) ≥ 0], Therefore,W1 = 2 μ/2 2
∗
SCU (θ ) −1 where μ is the smallest eigenvalue of HAA − HAB HBB HBA and H = d dθdθ is partitioned conformably HAA HAB 1 1 P P with θ as . Under 2.1–2.6, T H → 2J , and so 2T μ → λ which is the smallest eigenvalue HBA HBB −1 d ω, which is χ 2 (k − n) distributed (Hansen, 1982). Putting these of JAA − JAB JBB JBA . Also, SCU (θˆCU ) → together, gives Fχ 2 (k, α) − ω d 2 1[Fχ 2 (k, α) − ω ≥ 0]. T 1/2 W1 → λ Fχ 2 (nA ,α) −1 ˆ JBA , which Meanwhile, W2 = √2T , where λˆ is the smallest eigenvalue of JˆAA − JˆAB JˆBB λˆ Fχ 2 (nA ,α) P 2 . Combining these, using Slutsky’s Theorem, under is consistent for λ, so that T 1/2 W2 → λ Assumptions 2.1–2.6, Fχ 2 (k, α) − ω T 1/2 W1 d 1[Fχ 2 (k, α) − ω ≥ 0], L = 1/2 → T W2 Fχ 2 (nA , α)
as required.
Proof of Theorem 3.2: The argument is adapted from Dufour (1997). If E(φ(Yt , θ )) = 0 uniformly in θ , then from Assumptions 2.3 and 2.4, SCU (θ ) has a marginal χ 2 (k) distribution for any θ. Take any pair {θ1 , θ2 }. By the Bonferroni inequality, the probability that both will satisfy SCU (θi ) ≤ Fχ 2 (k, α) is at least 2α − 1. Therefore, the S-set for θ (or any subvector of θ ) will be unbounded with at least this probability, and so the test will reject with probability at least 2α − 1 in the limit as the sample size goes to infinity.
C The Author(s). Journal compilation C Royal Economic Society 2010.
The
Econometrics Journal Econometrics Journal (2010), volume 13, pp. 218–244. doi: 10.1111/j.1368-423X.2009.00307.x
Theory and inference for a Markov switching GARCH model L UC B AUWENS † , A RIE P REMINGER ‡ AND J EROEN V. K. R OMBOUTS †,§ ,¶ †
Universit´e catholique de Louvain, CORE, B-1348 Louvain-La-Neuve, Belgium. E-mail:
[email protected] ‡
Department of Economics, University of Haifa, 31905, Israel. E-mail:
[email protected] § HEC
¶ CIRANO,
Montr´eal and CIRPEE, 3000 Cote Sainte Catherine, Montr´eal (Quebec), Canada, H3T 2A7. E-mail:
[email protected]
2020, University Street, 25th Floor, Montr´eal, Quebec, Canada, H3A 2A5.
First version received: June 2008; final version accepted: November 2009
Summary ¿We develop a Markov-switching GARCH model (MS-GARCH) wherein the conditional mean and variance switch in time from one GARCH process to another. The switching is governed by a hidden Markov chain. We provide sufficient conditions for geometric ergodicity and existence of moments of the process. Because of path dependence, maximum likelihood estimation is not feasible. By enlarging the parameter space to include the state variables, Bayesian estimation using a Gibbs sampling algorithm is feasible. We illustrate the model on S&P500 daily returns. Keywords: Bayesian inference, GARCH, Markov-switching.
1. INTRODUCTION The volatility of financial markets has been the object of numerous developments and applications over the past two decades, both theoretically and empirically. In this respect, the most widely used class of models is certainly that of GARCH models (see e.g. Bollerslev et al., 1994, and Giraitis et al., 2007, for a review of more recent developments). These models usually indicate a high persistence of the conditional variance (i.e. a nearly integrated GARCH process). Diebold (1986) and Lamoureux and Lastrapes (1990), among others, argue that the nearly integrated behaviour of the conditional variance may originate from structural changes in the variance process which are not accounted for by standard GARCH models. Furthermore, Mikosch and Starica (2004) and Hilebrand (2005) show that estimating a GARCH model on a sample displaying structural changes in the unconditional variance does indeed create an integrated GARCH effect. These findings clearly indicate a potential source of misspecification, to the extent that the form of the conditional variance is relatively inflexible and held fixed throughout the entire sample period. Hence the estimates of a GARCH model may suffer from a substantial upward bias in the persistence parameter. Therefore, models in which the parameters are allowed to change over time may be more appropriate for modelling volatility. C The Author(s). Journal compilation C Royal Economic Society 2010. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
Markov switching GARCH
219
Indeed, several models based on the idea of regime changes have been proposed. Schwert (1989) considers a model in which returns can have a high or low variance, and switches between these states are determined by a two-state Markov process. Cai (1994) and Hamilton and Susmel (1994) introduce an ARCH model with Markov-switching parameters in order to take into account sudden changes in the level of the conditional variance. They use an ARCH specification instead of a GARCH to avoid the problem of path dependence of the conditional variance which renders the computation of the likelihood function infeasible. This occurs because the conditional variance at time t depends on the entire sequence of regimes up to time t due to the recursive nature of the GARCH process. Since the regimes are unobservable, one needs to integrate over all possible regime paths when computing the sample likelihood, but the number of possible paths grows exponentially with t, which renders ML estimation intractable. Gray (1996) presents a tractable Markov-switching GARCH model. In his model, the path dependence problem is removed by aggregating the conditional variances over all regimes at each time step in order to construct a single variance term. This term (conditional on the available information, but not on the regimes) is used to compute the conditional variances in the next time step. A modification of his model is suggested by Klaassen (2002); see also Dueker (1997), Bollen et al. (2000), Haas et al. (2004) and Marcucci (2005) for related papers. Stationarity conditions for some of these tractable models are given by Abramson and Cohen (2007). The objective of this paper is to develop both the probabilistic properties and the estimation of a Markov-switching GARCH (MS-GARCH) model that has a finite number of regimes in each of which the conditional mean is constant and the conditional variance takes the form of a GARCH(1,1) process. Hence, in our model the conditional variance at each time depends on the whole regime path. This constitutes the main difference between our model and existing variants of Gray’s (1996) model mentioned above. We provide sufficient conditions for the geometric ergodicity and the existence of moments of the proposed model. We find that for strict stationarity, it is not necessary that the stability condition of Nelson (1990) be satisfied in all the GARCH regimes but it must be satisfied on average with respect to the unconditional probabilities of the regimes. Further, for covariance stationarity, the GARCH parameters in some regimes can be integrated or even explosive. A similar model was proposed by Francq and Zakoian (2005) who study conditions for second-order stationarity and existence of higher-order moments. Concerning the estimation method, we propose a Bayesian Markov chain Monte Carlo (MCMC) algorithm that circumvents the problem of path dependence by including the state variables in the parameter space and simulating them by Gibbs sampling. We illustrate by a repeated simulation experiment that the algorithm is able to recover the parameters of the datagenerating process, and we apply the algorithm to a real data set. For the more simple MS-ARCH case, Francq et al. (2001) establish the consistency of the ML estimator. Douc et al. (2004) obtained its asymptotic normality for a class of autoregressive models with Markov regime, which includes the regime-switching ARCH model as a special case. Bayesian estimation of a Markov switching ARCH model where only the constant in the ARCH equation can switch, as in Cai (1994), has been studied and illustrated by Kaufman and Fr¨uhwirth-Schnatter (2002) and Kaufman and Scheicher (2006). Tsay (2005, pp. 588–94) proposed a Bayesian approach for a simple two-state Markov switching model with different risk premiums and different GARCH dynamics. Das and Yoo (2004) and Gerlach and Tuyl (2006) propose an MCMC algorithm for the same model (switch in the constant only) but with a GARCH term and therefore tackle the path dependence problem, but only the last cited paper contains an application to real data. Finally, the most comparable work to our paper (for estimation) is that of Henneke et al. (2006) who C The Author(s). Journal compilation C Royal Economic Society 2010.
220
L. Bauwens, A. Preminger and J. V. K. Rombouts
estimate by an MCMC algorithm a Markov-switching ARMA-GARCH model. They apply their algorithm to the data used by Hamilton and Susmel (1994). Non-Bayesian estimation of MSGARCH models is studied by Francq and Zakoian (2005) who propose to estimate the model by the generalized method of moments. To illustrate our estimation method, we apply it to a time series of daily returns of the S&P500 index. We find that the MS-ARCH version of the model does not account satisfactorily for the persistence of the conditional variance, whereas the MSGARCH version performs better. Moreover, the latter dominates the former in terms of a model choice criterion based on the BIC formula. The paper is organized as follows: in Section 2, we define our version of the MS-GARCH model and state sufficient conditions for geometric ergodicity and existence of moments. In Section 3, we explain how the model can be estimated in the Bayesian framework and provide a numerical example. In Section 4, we apply our approach to financial data. In Section 5, we conclude and discuss possible extensions. Proofs of the theorems stated in the paper are gathered in an appendix.
2. MARKOV-SWITCHING GARCH MODEL The GARCH(1,1) model can be defined by yt = μt + σt ut
(2.1)
2 2 + βσt−1 , σt2 = ω + αεt−1
(2.2)
where σt and μt are measurable functions of yt−τ for τ ≤ t − 1, εt = yt − μt , and the error term ut is i.i.d. with zero mean and unit variance. In order to ensure easily the positivity of the conditional variance we impose the restrictions ω > 0, α ≥ 0 and β ≥ 0. For simplicity, we assume that μt is constant. The sum α + β measures the persistence of a shock to the conditional variance in equation (2.2). When a GARCH model is estimated using daily or higher frequency data, the estimate of this sum tends to be close to one, indicating that the volatility process is highly persistent and the second moment of the return process may not exist. However, it was argued that the high persistence may artificially result from regime shifts in the GARCH parameters over time; see Diebold (1986), Lamoureux and Lastrapes (1990) and Mikosch and Starica (2004), among others. This motivates our idea to estimate a Markov-switching GARCH (MS-GARCH) model that permits regime switching in the parameters. It is a generalization of the GARCH model and permits a different persistence in the conditional variance of each regime. Thus, the conditional variance in each regime accommodates volatility clustering, nesting the GARCH model as a special case. Let {st } be an ergodic homogeneous Markov chain on a finite set S = {1, . . . , n}, with transition matrix P defined by the probabilities {ηij = P(st = i|st−1 = j )} and invariant probability measure π = {πi }. We assume the chain is initiated at t = 0, which implies that {st }t≥1 are independent by definition from {ut }t≥1 since the transition probabilities are fixed over time. The MS-GARCH model is defined by yt = μst + σt ut
(2.3)
2 2 + βst σt−1 , σt2 = ωst + αst εt−1
(2.4)
C The Author(s). Journal compilation C Royal Economic Society 2010.
Markov switching GARCH
221
where ωst > 0, αst ≥ 0, βst ≥ 0 for st ∈ {1, 2, . . . , n}, and εt = yt − μst . These assumptions on the GARCH coefficients entail that σt2 is almost surely strictly positive. In related papers, Yang (2000), Yao and Attali (2000), Yao (2001) and Francq and Zakoian (2002) derived conditions for the asymptotic stationarity of some AR and ARMA models with Markov-switching regimes. However, these results do not apply in our framework since, although for the GARCH model the squared residuals follow an ARMA model, the same does not apply for our model. Conditions for the weak stationarity and existence of moments of any order for the Markov-switching GARCH(p, q) model with zero means μst , in which the discrete Markov chain of the latent states is initiated from its stationary probabilities, have been derived by Francq and Zakoian (2005); see also Francq et al. (2001). The MS-GARCH process is not a Markov chain. However, the extended 2 , st+1 ) is a Markov chain (see the Appendix). In what follows, we state process Zt = (yt , σt+1 mild regularity conditions for which this chain is geometrically ergodic, strictly stationary and has finite moments. These issues are not dealt with by Francq and Zakoian (2005). Further, geometric ergodicity implies that it is not necessary for our model to have a stationary solution, but also that irrespective of the initial conditions it asymptotically behaves as the stationary version. In this case, the asymptotic analysis can be extended to any solution of the model, and not only the stationary one. Our results are based on Markov chain theory; see e.g. Chan (1993) and Meyn and Tweedie (1993). We impose the following assumptions: ASSUMPTION 2.1. The error term ut is i.i.d. with a density function f (·) that is positive and continuous everywhere on the real line and is centred on zero. Furthermore, E(|u2t |δ ) < ∞ for some δ > 0. ASSUMPTION 2.2. αi > 0, βi > 0, the Markov chain is homogeneous, and ηij ∈ (0, 1) for all i, j ∈ {1, . . . , n}. n 2 ASSUMPTION 2.3. i=1 πi E[log(αi ut + βi )] < 0. The first assumption is satisfied for a wide range of distributions for the error term, e.g. the normal and the Student distributions. For δ ≥ 1, we set the variance to unity and, if δ < 1, the parameters of the conditional scaling factor of the data are estimated. The second assumption is slightly stronger than the non-negativity conditions of Bollerslev (1986) for the GARCH(1,1) model. Under this assumption all the regimes are accessible and the discrete Markov chain is ergodic. These assumptions are needed in order to establish the irreducibility and aperiodicity of the process. Assumption 2.3 implies that at least one of the regimes is stable. We assume, without loss of generality throughout that in the first regime (st = 1) the process is strictly stationary, thus E log(α1 u2t + β1 ) < 0. To obtain the results in Theorem 2.1, we observe that it is not necessary that the strict stationarity requirement of Nelson (1990) be satisfied for all the GARCH regimes but on average with respect to the invariant probability distribution of the latent states. T HEOREM 2.1. Under Assumptions 2.1–2.3, Zt is geometrically ergodic and if it is initiated from its stationary distribution, then the process is strictly stationary and β-mixing (absolutely regular) with exponential decay. Moreover, E(|yt |2p ) < ∞ for some p ∈ (0, δ] where the expectations are taken under the stationary distribution. The geometric ergodicity ensures not only that a unique stationary probability measure for the process exists, but also that the chain, irrespective of its initialization, converges to it at a geometric rate with respect to the total variation norm. Markov chains with this property satisfy conventional limit theorems such as the law of large numbers and the central limit theorem for any given starting value given the existence of suitable moments; see Meyn and Tweedie C The Author(s). Journal compilation C Royal Economic Society 2010.
222
L. Bauwens, A. Preminger and J. V. K. Rombouts
(1993, ch. 17) for details. Geometric ergodicity implies that if the process is initiated from its stationary distribution, it is regular mixing—see Doukhan (1994, sec. 1.1) for the definition— with exponential decaying mixing numbers. This implies that the autocovariance function of any measurable function of the data (if it exists) converges to zero at least at the same rate (e.g. the autocorrelation function of |yt |p decays at exponential rate). Using geometric ergodicity, we can evaluate the stationary distribution numerically since analytical results are hard to obtain for our model. For example, Moeanaddin and Tong (1990) propose a conditional density approach by exploiting the Chapman–Kolmogorov relation for non-linear models; see also Stachurski and Martin (2008). For the GARCH(1,1) model, the sum α + β measures the persistence of a shock to the conditional variance. We note that (2.4) can be rewritten as 2 + vt , σt2 = ωst + λt σt−1
(2.5)
2 − σ 2 ) is a serially uncorrelated innovation. Let Ak = where λt = αst + βst and vt = αst (εt−1 k n t−1 j =1 λt−j for k ≥ 1 (A0 ≡ 1), γ = i=1 πi log(αi + βi ). Assuming γ < 0, by solving (2.5) recursively, we can express the conditional variance as
σt2 =
t−1
Ak (vt−k + ωst−k ) + At σ02 .
k=0
It can be shown that Ak = O(exp(γ¯ k)) for some γ¯ ∈ (γ , 0) with probability one.1 Thus, the impact of past shocks on the variances declines geometrically at a rate which is bounded by γ . Hence, γ can serve as a bound on the measure of volatility persistence of our model. As it approaches zero the persistence of the volatility shocks increases. For γ = 0, the impact of shocks on the variances does not decay over time. By Jensen’s inequality and the strict concavity of log(x), it is clear that if γ ≤ 0 Condition 2.3 is satisfied. Next, we illustrate that Condition 2.3 allows explosive regimes while the global process is stable. We consider an MS-GARCH model with two regimes. For the first regime we choose an integrated GARCH(1,1) process with realistic values α1 = 0.1 and β1 = 0.9. We note that this process is strictly stationary, but not covariance stationary. Let π = η21 /(η12 + η21 ) be the ergodic probability of the stable regime, and let (α2 , β2 ) be the parameters of the second regime. Further, define F (α2 , β2 , π ) = π E log(0.1u2t + 0.9) + (1 − π )E log(α2 u2t + β2 ). In Figure 1, we show the strict stationarity frontiers F (α2 , β2 , π ) = 0 which have been evaluated by simulations in the cases ut ∼ N(0, 1) and π = 0.0, 0.5, 0.75. The strict stationarity regions are the areas below these curves and above the axes (notice that on the graph β2 ≥ 0.80). We note that when π = 0 the model has one regime which is strictly stationary and the computed values satisfy the stability condition of Nelson (1990). However, for π > 0, the parameters of the second regime imply that it can be explosive. That is, under the non-stable regime the conditional volatility diverges. Further, the higher the probability of being in the stable regime, the higher the values that the persistence parameters of the second regime can assume. Therefore, we observe that our model allows periods in which explosive regimes are operating, giving the impression of structural instability in the conditional volatility, before the process collapses to its stationary level. k 1 Note that {s } is an ergodic Markov chain, hence for any initial state, 1 t j =1 log(αsj + βsj ) → γ a.s.; hence similar k to Nelson (1990, Theorem 2) we can show that there exists a γ¯ ∈ (γ , 0) such that Ak = O(exp(γ¯ k)) with probability one. C The Author(s). Journal compilation C Royal Economic Society 2010.
223
Markov switching GARCH 1.05
1.00
β2
0.95
0.90
0.85 π=0.5 π=0 0.000
0.025
0.050
π=0.8
0.075
0.100
0.125
0.150
0.175
0.200
0.225
0.250
0.275
0.300
α2
Figure 1. Strict stationarity region for a two-state MS-GARCH process.
In order to establish the existence of higher-order moments, we define the n × n matrix k k ⎞ ⎛ E α1 u2t + β1 η11 · · · E αn u2t + βn ηn1 ⎟ ⎜. . .. ⎟ ⎜
= ⎜ .. . .. ⎟. ⎠ ⎝ k k E α1 u2t + β1 η1n · · · E αn u2t + βn ηnn A similar matrix was first introduced by Yao and Attali (2000) for non-linear autoregressive models with Markov switching. Let ρ(·) denote the spectral radius of a matrix, i.e. its largest eigenvalue in modulus. Then, we impose the following conditions: ASSUMPTION 2.4. E(|u2t |k ) < ∞ for some integer k ≥ 1. ASSUMPTION 2.5. ρ( ) < 1. Assumption 2.5 is similar to the stability condition imposed by Francq and Zakoian (2005) to establish the existence of a second-order stationary solution for the MS-GARCH model. However, in our set-up this condition induces not only stationarity but also geometric ergodicity. T HEOREM 2.2. Under Assumptions 2.1–2.2 and 2.4–2.5, the process is geometrically ergodic and E(|yt2 |k ) < ∞ for some integer k ≥ 1, where the expectations are taken under the stationary distribution. The spectral radius condition used in Theorem 2.2 is simple to check in the leading case where k = 1. Let di = αi + βi . If di < 1 for all i ∈ {1, 2, . . . , n}, Assumption 2.5 is satisfied for this case, since ηij ∈ (0, 1), see Lutkepohl (1996, p. 141, 4(b)), and the resulting MS-GARCH C The Author(s). Journal compilation C Royal Economic Society 2010.
224
L. Bauwens, A. Preminger and J. V. K. Rombouts 1.15 1.10 1.05 1.00
d2 0.95 0.90 0.85 0.80
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
1.05
1.10
d1
Figure 2. Covariance-stationarity region for two-state MS-GARCH.
process is covariance stationary. However, it is not necessary that all the GARCH processes of each regime be covariance stationary. To illustrate this, we plot in Figure 2 the boundary curve ρ( ) = 1 for n = 2 where η11 = η22 = 0.85. The covariance stationarity region is the interior intersection of the boundary curve and the two axes. We observe that one of the GARCH regimes does not need to be weakly stationary and can even be mildly explosive, provided that the other regime is sufficiently stable. As a special case, we consider a situation where we start the discrete ergodic Markov chain from its invariant measure π . In this case π = Pπ and our model is equivalent to a regime switching GARCH model where the probabilities of the regimes are constant over time; see Bauwens et al. (2006). Under Assumptions 2.1–2.2, it can be shown that a sufficient condition for geometric ergodicity and existence of moments of order k is given by ni=1 πi E(αi u2t + βi )k < 1. We observe that the condition derived by Bollerslev (1986) for covariance stationarity under a single GARCH model needs not hold in each regime but for the weighted average of the GARCH parameters. Note, that high values of the parameters of the non-stable GARCH processes must be compensated by low probabilities for their regimes. The autocorrelation function of εt2 exists if Assumptions 2.4 and 2.5 are satisfied for k = 2, and the geometric ergodicity implies that it decreases exponentially. So, our model allows for short memory in the squared residuals. Now, let ψ be the second largest eigenvalue of P, which is assumed to be homogeneous in our setting. Kramer (2008) shows that if ρ( ) < 1 (2.5) and P is non-homogeneous depends on the sample size (T ) such that 1 − ψ = O(T −2d ) for some T and 2 d > 0 then Var( t=1 εt ) = O(T 2d+1 ). This implies that the process has a long memory in the squared residuals; see also Diebold and Inoue (2001). C The Author(s). Journal compilation C Royal Economic Society 2010.
Markov switching GARCH
225
3. ESTIMATION Given the current computing capabilities, the estimation of switching GARCH models by the maximum likelihood method is impossible, since the conditional variance depends on the whole past history of the state variables. We tackle the estimation problem by Bayesian inference, which allows us to treat the latent state variables as parameters of the model and to construct the likelihood function assuming we know the states. This technique is called data augmentation; see Tanner and Wong (1987) for the basic principle and more details. It is applied for Bayesian inference on stochastic volatility models, as initially proposed by Jacquier et al. (1994); see also Bauwens and Rombouts (2004) for a survey including other references. In Section 3.1, we present the Gibbs sampling algorithm for the case of two regimes, in Section 3.2 we discuss the relation between identification and the prior, in Section 3.3 we discuss possible extensions of the algorithm and related issues, and in Section 3.4 we perform a Monte Carlo study using a realistic data-generating process. 3.1. Bayesian inference We explain the Gibbs sampling algorithm for an MS-GARCH model with two regimes and normality of the error term ut . The normality assumption is a natural starting point. A more flexible distribution, such as the Student distribution, could be considered, although one may be sceptical that this is needed since Gray (1996) reports large and imprecise estimates of the degrees of freedom parameters. For the case of two regimes, the model is given by equations (2.3)–(2.4), st = 1 or 2 indicating the active regime. We denote by Yt the vector (y1 y2 . . . yt ) and likewise St = (s1 s2 . . . st ). The model parameters consist of η = (η11 , η21 , η12 , η22 ) , μ = (μ1 , μ2 ) and θ = (θ1 , θ2 ) , where θk = (ωk , αk , βk ) for k = 1, 2. The joint density of yt and st , given the past information and the parameters, can be factorised as f (yt , st |μ, θ, η, Yt−1 , St−1 ) = f (yt |st , μ, θ, Yt−1 , St−1 )f (st |η, Yt−1 , St−1 ).
(3.1)
The conditional density of yt is the Gaussian density f (yt |st , μ, θ, Yt−1 , St−1 ) =
(yt − μst )2 , exp − 2σt2 2π σt2 1
(3.2)
where σt2 , defined by equation (2.4), is a function of θ . The marginal density (or probability mass function) of st is specified by f (st |η, Yt−1 , St−1 ) = f (st |η, st−1 ) = ηst st−1
(3.3)
with η11 + η21 = 1, η12 + η22 = 1, 0 < η11 < 1 and 0 < η22 < 1. This specification says that st depends only on the last state and not on the previous ones and on the past observations of yt , so that the state process is a first-order Markov chain with no absorbing state. The joint density of y = (y1 , y2 , . . . , yT ) and S = (s1 , s2 , . . . , sT ) given the parameters is then obtained by taking the product of the densities in (3.2) and (3.3) over all observations: f (y, S|μ, θ, η) ∝
(yt − μst )2 ηst st−1 . σt−1 exp − 2σt2 t=1
T
C The Author(s). Journal compilation C Royal Economic Society 2010.
(3.4)
226
L. Bauwens, A. Preminger and J. V. K. Rombouts
Since integrating this function with respect to S by summing over all paths of the state variables is numerically too demanding, we implement a Gibbs sampling algorithm. The purpose of this algorithm is to simulate draws from the posterior density. These draws then serve to estimate features of the posterior distribution, like means, standard deviations and marginal densities. As the posterior density is not standard we cannot sample from it directly. Gibbs sampling is an iterative procedure to sample sequentially from the posterior distribution; see Gelfand and Smith (1990). Each iteration in the Gibbs sampler produces a draw from a Markov chain, implying that draws are dependent. Under regularity conditions, see e.g Robert and Casella (2004), the simulated distribution converges to the posterior distribution. Thus a warm-up phase is needed, such that when a sufficient large number of draws has been generated by the Gibbs sampler, the draws that are generated after the warm-up phase can be considered as draws from the posterior distribution. The Markov chain is generated by drawing iteratively from lower dimensional distributions, called blocks or full conditional distributions, of this joint posterior distribution. These full conditional distributions are easier to sample from because either they are known in closed form or simulated by a lower dimensional auxiliary sampler. For the MSGARCH model, the blocks of parameters are given by (θ, μ), η and the elements of S. We present below a sketch of the Gibbs algorithm, followed by a detailed presentation of the corresponding full conditional densities in three subsections. We explain what are our prior densities for θ, μ and η in these subsections. Sketch of Gibbs sampling algorithm: The superscript r on a parameter denotes a draw of the parameter at the rth iteration of the algorithm. For iteration 1, initial values η0 , θ 0 , μ0 and st0 for t = 1, 2, . . . , T must be used. One iteration of the algorithm involves three steps: (1) Sample sequentially each state variable str for t = 1, 2, . . . , T given ηr−1 , μr−1 , θ r−1 , r−1 r , st+1 : see Subsection 3.1.1. st−1 (2) Sample the transition probabilities ηr given str for t = 1, 2, . . . , T : see Subsection 3.1.2. (3) Sample (θ r , μr ) given str for t = 1, 2, . . . , T , and ηr : see Subsection 3.1.3. These three steps are repeated until the convergence of the Markov chain is achieved, which can be evaluated by convergence diagnostics. These warm-up draws are discarded, and the steps are iterated a large number of times to generate draws from which the desired features (means, variances, quantiles, etc.) of the posterior distribution can be estimated consistently. 3.1.1. Sampling st . To sample st we must condition on st−1 and st+1 (because of the Markov chain for the states) and on the future state variables (st+1 , st+2 , . . . , sT ) (because of path dependence of the conditional variances). The full conditional mass function of state t is T (yj − μsj )2 2−st st −1 2−st+1 st+1 −1 −1 σj exp − , (3.5) ϕ(st |S =t , μ, θ, η, y) ∝ η1,st−1 η2,st−1 η1,st η2,st 2σj2 j =t where we can replace η2,st−1 by 1 − η1,st−1 and η2,st by 1 − η1,st . Since st takes two values (1 or 2) we compute the expression above for each of these values, and divide each evaluation by the sum of the two to get the normalized discrete distribution of st from which to sample. Sampling from such a distribution once the probabilities are known is like sampling from a Bernoulli distribution. C The Author(s). Journal compilation C Royal Economic Society 2010.
227
Markov switching GARCH
3.1.2. Sampling η.
Given a prior density π (η), ϕ(η|S, μ, θ, y) ∝ π (η)
T
ηst st−1 ,
(3.6)
t=1
which does not depend on μ, θ and y. For simplicity, we can work with η11 and η22 as free parameters and assign to each of them a beta prior density on (0, 1). The posterior densities are then also independent beta densities. For example, a11 +n11 −1 ϕ(η11 |S) ∝ η11 (1 − η11 )a21 +n21 −1 ,
(3.7)
where a11 and a21 are the parameters of the beta prior, n11 is the number of times that st = st−1 = 1 and n21 is the number of times that st = 2 and st−1 = 1. A uniform prior on (0, 1) corresponds to a11 = a21 = 1 and is what we use in our simulations and applications below. 3.1.3. Sampling θ and μ.
Given a prior density π (θ, μ), T (yt − μst )2 −1 , ϕ(θ, μ|S, η, y) ∝ π (θ, μ) σt exp − 2σt2 t=1
(3.8)
which does not depend on η. We can sample (θ, μ) either by a griddy-Gibbs step or by a Metropolis step. The griddy-Gibbs step amounts to sampling by numerical inversion each scalar element of the vector (θ, μ) from its univariate full conditional distribution. Numerical tabulation of the cdf is needed because no full conditional distribution belongs to a known family (such as Gaussian). Thus, this step of the Gibbs algorithm that cycles between the states, η and the remaining parameters can be described as follows at iteration r + 1, given draws at iteration r denoted by the superscript (r) attached to the parameters: (1) Using (3.8), compute κ(ω1 |S (r) , β1(r) , α1(r) , θ2(r) , μ(r) , y), the kernel of the conditional posterior density of ω1 given the values of S, β1 , α1 , θ2 , and μ sampled at iteration r, over a grid (ω11 , ω12 , . . . , ω1G ), to obtain the vector Gκ = (κ1 , κ2 , . . . , κG ). (2) By a deterministic integration rule using M points, compute Gf = (0, f2 , . . . , fG ) where ω1i fi = κ ω1 |S (r) , β1(r) , α1(r) , θ2(r) , μ(r) , y dω1 , i = 2, . . . , G. (3.9) ω11
(3) Generate u ∼ U (0, fG ) and invert f (ω1 |S (r) , β1(r) , α1(r) , θ2(r) , μ(r) , y) by numerical interpolation to get a draw ω1(r+1) ∼ ϕ(ω1 |S (r) , β1(r) , α1(r) , θ2(r) , μ(r) , y). (4) Repeat steps 1–3 for ϕ(β1 |S (r) , ω1(r+1) , α1(r) , θ2(r) , μ(r) , y), ϕ(α1 |S (r) , ω1(r+1) , β1(r+1) , θ2(r) , μ(r) , y), ϕ(ω2 |S (r) , β2(r) , α2(r) , θ1(r+1) , μ(r) , y), etc. Note that intervals of values for the elements of θ and μ must be defined. The choice of these bounds (such as ω11 and ω1G ) needs to be fine-tuned in order to cover the range of the parameter over which the posterior is relevant. Over these intervals, the prior can be chosen as we wish, and in practice we choose independent uniform densities for all elements of (θ, μ). For the Metropolis version of this block of the Gibbs algorithm, we construct a multivariate Gaussian proposal density for (θ, μ). Its mean and variance–covariance matrix are renewed at each iteration of the Gibbs algorithm in order to account for the updating of the states sampled as described in Subsection 3.1.1. Thus, after updating the states, we maximize the likelihood of C The Author(s). Journal compilation C Royal Economic Society 2010.
228
L. Bauwens, A. Preminger and J. V. K. Rombouts
y given the sampled states, defined by the product of (3.2) over all observations. The mean of the Gaussian proposal is set equal to the ML estimate and the variance–covariance matrix to minus 1 times the inverse-Hessian evaluated at the ML estimate. 3.2. Identification and prior density In Markov-switching models similar to the model of this paper, one must use some identification restrictions to avoid the label switching issue. This would occur when the states and the parameters can be permuted without changing the posterior distribution. There are different ways to avoid label switching; see Hamilton et al. (2007) for a discussion. In an ML set-up, an identification restriction can be e.g. that the mean of regime 1 (μ1 ) is larger than the mean of regime 2 (μ2 ), or the same restriction for the variance level (ω) or another parameter of one GARCH component relative to the other. These restrictions aim at avoiding confusion between the regimes. They are exact restrictions, holding with probability equal to 1 if we reason in the Bayesian paradigm. In Bayesian inference, the restrictions need not be imposed with probability equal to 1. They can be imposed less stringently through the prior density. For example, coming back to the restriction on means, the prior for μ1 can be uniform on (−0.03, +0.09) and independent of the prior for μ2 taken as uniform on (−0.09, +0.03). These priors overlap but not too much, allowing a sufficiently clear a priori separation of the two regimes and avoiding label switching in the MCMC posterior simulation. Thus the regimes must be sufficiently separated to be identified; that is, some parameters must be different between regimes. Our approach to this is to use prior supports for the corresponding parameters of the two regimes that are partially different. When choosing these prior supports, we must be careful to avoid two problems: truncating the posterior density (if we use too narrow supports compared to the likelihood location), or computing the posterior over too wide a region of the parameter space (if we use too wide prior supports). In the former case, the prior density will distort the sample information strongly and thus the posterior results will not be interesting. In the latter case, the posterior computations will be inefficient because the prior will be multiplied by a quasi-null likelihood value on a big portion of the parameter space. Thus our approach is to start with wide enough prior intervals, taking care of the need to separate the regimes, and to narrow these intervals if possible or to widen them if needed, such that finally the posterior is not significantly truncated and label switching is avoided. If an ML estimation would be feasible, the analogue procedure would be to impose exactly some identification constraints and to choose the starting values for the maximization in a portion of the parameter space where the likelihood value is not very small. 3.3. Extensions Although we presented the Gibbs sampling algorithm for the case of two states, some extensions are possible without changing the nature of the algorithm, but they will increase the computation time. A first extension consists of allowing the mean of the process to switch between two ARMA processes rather than constant means. Similarly, one can consider GARCH(p, q) specifications for the regimes with p and q larger than 1. Such extensions are dealt with by redefining the θ and μ parameter vectors and adapting directly the procedures described in Subsection 3.1.3 to account for the additional parameters. Henneke et al. (2006) describe a slightly different MCMC algorithm from ours for this more general model, considering also the case of a Student distribution for the innovation. C The Author(s). Journal compilation C Royal Economic Society 2010.
Markov switching GARCH
229
A second extension consists of considering more than two regimes. Again, the algorithm described for two regimes can in principle be extended. The states must be sampled from a discrete distribution with three points of support, and the η parameters from Dirichlet distributions that generalize Beta distributions and can be simulated easily. Finally, the nature of the third block does not change and can be sampled as for two regimes. In practice, an increase in the number of regimes increases the number of parameters, which is especially costly in computing time for the third block (θ and μ), whether one uses a griddy-Gibbs or a Metropolis step. A related difficulty lies in the identification of the regimes; see the discussion in the previous subsection. Specifying the prior density for more than two regimes requires more search and it is obvious that in this case the algorithm will be more complicated. We leave the detailed study of this type of question for further research. Another issue that is not yet solved is the computation of the marginal likelihood of the MS-GARCH model, as recognized in the literature; see e.g. Maheu and He (2009). Because of the combination of the GARCH (rather than ARCH) structure and the latent variable structure, we could not find a method to compute the marginal likelihood. Chib (1996) has developed an algorithm for Markov mixture models. Kaufmann and Fr¨uhwirth-Schnatter (2002) use this algorithm for an MS-ARCH model, and mention that it cannot be extended to the MS-GARCH case because of the path dependence problem. Equation (3) in Chib (1996) does not hold in case of path dependence. It would be quite a valuable contribution to solve this issue since it would allow us to do model selection in a fully Bayesian way. Short of this, we resort to the following procedure: once the Gibbs sampler has been applied, we compute the posterior means of the state variables. These are obtained by averaging the Gibbs draws of the states. These means are smoothed (posterior) probabilities of the states. A mean state close to 1 corresponds to a high probability to be in the second regime. With these means, we can assign each observation to one regime, by attributing an observation to the regime for which the state has the highest posterior probability. Once the data are classified in this way, we can compute easily the Bayesian information criterion (BIC) using the likelihood function defined by the product over all observations of the contributions given by the densities in (3.2). Given the values of the state variables, it is easy to evaluate this function (in logarithm) at the posterior means of the parameters and subtract from it the usual penalty term 0.5p log T , where p is the number of parameters, to get the BIC value. In large samples, the BIC usually leads to choose the model also picked by the marginal likelihood criterion; see the discussion in Kass and Raftery (1995). We use this method in the application (Section 4). 3.4. Monte Carlo study We have simulated a data-generating process (DGP) corresponding to the model defined by equations (2.3)–(2.4) for two states, and ut ∼ N(0, 1). The parameter values are reported in Table 2. The second GARCH equation implies a higher and more persistent conditional variance than the first one. The other parameter values are inspired by previous empirical results, like in Hamilton and Susmel (1994), and our results presented in the next section. In particular, the transition probabilities of staying in each regime are close to unity. All the assumptions for stationarity and existence of moments of high order are satisfied by this DGP. In Table 1, we report summary statistics for 1500 observations from this DGP, and in Figure 3 we show the series, its estimated density and the autocorrelations of the squared data. The mean of the data is close to zero. The density is skewed to the left, and its excess kurtosis is estimated to be 5.52 (the excess kurtosis is 1.62 for the first component GARCH and 0.12 for the second). The ACF of C The Author(s). Journal compilation C Royal Economic Society 2010.
230
Mean Standard deviation Skewness
L. Bauwens, A. Preminger and J. V. K. Rombouts Table 1. Descriptive statistics for the simulated data. 0.010 Maximum 1.472 Minimum −0.616 Kurtosis
6.83 −9.44 8.52
Note: Statistics for 1500 observations of the DGP defined in Table 2.
the squared data is strikingly more persistent than the ACF of each GARCH component, which are both virtually at 0 after 10 lags. Said differently, a GARCH(1,1) process would have to be close to integrated to produce the excess kurtosis and the ACF shown in Figure 3. Thus we can say that the DGP chosen for illustrating the algorithm is representative of a realistic process for daily financial returns. In Table 2, we report the posterior means and standard deviations for the model corresponding to the DGP, using the simulated data described above. The results are in the last two columns of the table. In Figure 4, we report the corresponding posterior densities. The prior density of each parameter is uniform between the bounds reported in Table 2 with the DGP values. Thus, these bounds were used for the integrations in the griddy-Gibbs sampler (except for η11 and η22 since they are sampled from beta densities). The number of iterations of the Gibbs sampler was set to 50,000, and the initial 20,000 draws were discarded, since after these the sampler seems to have converged (based on cumsum diagrams not shown to save space). Thus the posterior moments are based on 30,000 dependent draws of the posterior distribution. The posterior means are with few exceptions within less than one posterior standard deviation away from the DGP values, and the shapes of the posterior densities are not revealing bimodalities that would indicate a label switching problem. From the Gibbs output, we also computed the posterior means of the state variables. These are obtained by averaging the Gibbs draws of the states. These means are smoothed (posterior) probabilities of the states. A mean state close to 1 corresponds to a high probability to be in the second regime. If we attribute an observation to regime 2 if its corresponding mean state is above one-half (and to regime 1 otherwise), we find that 96% of the data are correctly classified. We repeated the previous experiment 100 times, thus generating 100 samples of size 1500 of the same DGP and repeating the Bayesian estimation for each sample, using the same prior throughout these 100 repetitions. In columns 4 and 5 of Table 2, we report the means and standard deviations of the 100 posterior means of each parameter. The posterior mean of each parameter can be viewed as an estimator (i.e. a function of the sample) of the parameter, and the difference between the mean of these posterior means and the true value measures the bias of this estimator, while the standard deviation of these means is a measure of the sampling variability of the estimator. The reported results show that the bias is not large in general, and that it is larger for the parameters of the second regime than of the first. The standard deviations are also larger for the parameters of the second regime. These larger biases and standard deviations are not surprising since the second regime is less frequently active than the first one. The interpretation of the posterior mean as an estimator that is not much biased (for samples of 1500 observations) is of course also conditional on the prior information we have put in the estimation (a noninformative prior on finite intervals). It should be kept in mind that by changing the location and precision of the prior, one could induce much more (or less) bias and increase (or decrease) the variability of the estimator. From a strictly Bayesian viewpoint, the issue of bias and sampling variability is not relevant. Nevertheless, we can safely conclude from these 100 repetitions that C The Author(s). Journal compilation C Royal Economic Society 2010.
231
Markov switching GARCH
5.0
2.5
0.0
−2.5
−5.0
−7.5
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
(a) Sample path (1500 observations) 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05
−6
−5
−4
−3
−2
−1
0
1
2
3
4
5
6
(b) Kernel density 0.5
0.4
0.3
0.2
0.1
0
5
10
15
20
25
30
35
40
(c) Correlogram of squared data
Figure 3. Graphs for simulated data for DGP defined in Table 2. C The Author(s). Journal compilation C Royal Economic Society 2010.
45
50
232
L. Bauwens, A. Preminger and J. V. K. Rombouts Table 2. Monte Carlo results. 100 repetitions DGP values
Prior supports
Mean
Single sample
Std. dev.
Mean
Std. dev.
ω1 β1
0.30 0.20
(0.15 0.45) (0.05 0.40)
0.301 0.201
(0.043) (0.061)
0.345 0.192
(0.051) (0.087)
α1 ω2 β2
0.35 2.00 0.60
(0.10 0.50) (0.50 4.00) (0.35 0.85)
0.355 2.232 0.556
(0.059) (0.513) (0.084)
0.264 2.136 0.584
(0.051) (0.688) (0.106)
α2 μ1
0.10 0.06
(0.02 0.35) (0.02 0.15)
0.110 0.056
(0.043) (0.017)
0.142 0.079
(0.049) (0.016)
μ2 η11 η22
−0.09 0.98 0.96
(−0.35 0.18) (0.00 1.00) (0.00 1.00)
−0.056 0.977 0.951
(0.085) (0.005) (0.016)
−0.076 0.987 0.958
(0.103) (0.004) (0.012)
Note: Columns 4–5: Mean and standard deviations of posterior means for the MS-GARCH model. Results based on 100 replications, each of which consists of 1500 observations from the DGP defined by equations (2.3)–(2.4) with N (0, 1) distribution. Columns 6–7: Posterior means and standard deviations for a single sample of 1500 observations.
η11
100
30
7.5
0.98
1.00
2.5 0.2 7.5
0.925 4 3 2 1
α1
5.0
0.4
α2
3 2
2.5
1
0.0
0.1
0.2
0.3
0.950
0.975
1.000
β1
0.2 0.50
0.3
0.4
ω2
0.25 0.1
5.0
3
2.5
10 0.96
ω1
5.0
20
50
4
7.5
η22
0.2
0.3
0.4
β2
1 20
2
3
4
μ1
10 0.4
0.6
0.8
0.05
0.10
0.15
μ2
2 1 −0.2
0.0
0.2
Figure 4. Posterior densities for the MS-GARCH model.
C The Author(s). Journal compilation C Royal Economic Society 2010.
Markov switching GARCH
233
indeed our MCMC algorithm is able to estimate the MS-GARCH model in a reliable way, since for 100 repetitions it was able to infer the DGP parameters from the data quite satisfactorily. Finally, let us mention that we also simulated a three-state model by adding a third, more persistent regime, to those reported in Table 2. The algorithm is also able to recover the DGP. The detailed results are available on request.
4. APPLICATION We use the S&P500 daily percentage returns from 19/07/2001 to 20/04/2007 (1500 observations) for estimation. Figure 5 displays the sample path, the kernel density and the correlogram of the squared returns. We observe a strong persistence in the squared returns, a slightly positive skewness and a usual excess kurtosis for this type of data and sample size; see also Table 3. In Table 4, we report the posterior means and standard deviations from the estimation of two models using the estimation sample. The estimated models are the two-regime MS-ARCH model defined by setting β1 = β2 = 0 in equations (2.3)–(2.4), and a restricted version (β1 = α1 = 0) of the corresponding MS-GARCH model. The marginal posterior densities for these models are shown in Figures 6 and 7. The intervals over which the densities are drawn are the prior intervals (except for the transition probabilities). The intervals for the GARCH parameters were chosen to avoid negative values and truncation. For the MS-GARCH model, we report in the table the results obtained by using the two versions of the Gibbs algorithm described in Subsection 3.1.3, one (GG) using the griddy-Gibbs method for sampling the mean and variance equation parameters, and the other (MH) using a Metropolis–Hastings step for the same parameters. In all cases, the total Gibbs sample size is equal to 50,000 observations with a warm-up sample of 20,000, and the prior distribution is the same. The MH version needs 20% more computing time than the griddy-Gibbs one, and its acceptance rate is 68%, which is a good performance. However, the number of rejections due to the prior restrictions (finite ranges of the GARCH parameters) is slightly more than 50%. These rejections are not needed when using the griddyGibbs version. Thus we may conclude that the GG version has a slight advantage in this instance, but more importantly, it is reassuring that both algorithms give approximately the same posterior results. When estimating the MS-ARCH model, we find that in the first regime, which is characterized by a low volatility level (ω1 /(1 − α1 ) = 0.42 using the posterior means as estimates, as opposed to 2.24 in the second regime), the ARCH coefficient α 1 is close to 0 (posterior mean 0.014, standard deviation 0.012; see also the marginal density in Figure 6). This is a weak evidence in favour of a dynamic effect in the low volatility regime. The same conclusion emerges after estimating the MS-GARCH model, with the added complication that the β 1 coefficient is poorly identified (since α 1 is almost null). Thus we opted to report the MS-GARCH results with α 1 and β 1 set equal to 0, and GARCH dynamics only in the high volatility regime. These results show clearly that the lagged conditional variance should be included in the second regime. Thus, the MS-ARCH model is not capturing enough persistence of the conditional variance in the second regime. The second regime in the MS-GARCH model is rather strongly persistent but stable, with the posterior mean of β2 + α2 equal to 0.973 (0.919 + 0.054). If we estimate a single regime GARCH model, we find that the persistence is 0.990 (0.942 + 0.048), which makes it closer to integrated GARCH than the second regime of the MS model. The estimation results for the MS-GARCH model also imply that compared to the first regime (where ω1 = 0.31), the second regime is a high volatility regime since ω2 /(1 − α2 − β2 ) = 1.73. C The Author(s). Journal compilation C Royal Economic Society 2010.
234
L. Bauwens, A. Preminger and J. V. K. Rombouts
4
2
0
−2
−4
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
(a) Sample path (1500 observations)
0.5
0.4
0.3
0.2
0.1
−5
−4
−3
−2
−1
0
1
2
3
4
5
(b) Kernel density 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
5
10
15
20
25
30
35
40
45
50
(c) Correlogram of squared data
Figure 5. Graphs for S&P500 daily returns from 19/07/2001 to 20/04/2007. C The Author(s). Journal compilation C Royal Economic Society 2010.
235
Markov switching GARCH
Mean Standard deviation Skewness
Table 3. Descriptive statistics for S&P500 daily returns. 0.015 Minimum 1.00 Maximum 0.11 Kurtosis
−5.046 5.57 6.37
Note: Sample period: 19/07/2001 to 20/04/2007 (1500 observations).
Table 4. Posterior means and standard deviations (S&P500 daily returns). MS-ARCH (GG) MS-GARCH (GG) MS-GARCH (MH) Mean
Std. dev.
Mean
Std. dev.
Mean
Std. dev.
ω1 β1 α1
0.419 – 0.014
(0.028)
ω2 β2 α2
1.988 – 0.115
(0.175)
μ1 μ2
0.046 −0.040
(0.016) (0.044)
0.071 −0.012
(0.019) (0.029)
0.069 −0.012
(0.020) (0.032)
η11 η22 π1
0.994 0.986 0.300
(0.003) (0.006)
0.978 0.985 0.595
(0.011) (0.006)
0.979 0.986 0.600
(0.012) (0.006)
(0.012)
(0.042)
0.308 – –
(0.025)
0.313 – –
(0.027)
0.0467 0.919 0.054
(0.024) (0.024) (0.015)
0.0486 0.917 0.055
(0.025) (0.021) (0.012)
Note: Sample period: 19/07/2001 to 20/04/2007 (1500 observations). A – symbol means that the parameter was set to 0. GG: griddy-Gibbs, MH: Metropolis–Hastings. π1 : unconditional probability of state 1.
Next, we compare the two models, starting with their BIC values, which are computed as explained at the end of Subsection 3.3. These values show clearly that the MS-GARCH model is strongly favoured, with a BIC value of −954.72, compared to the value of −1034.2 for the MS-ARCH. Another way to compare the models is through the means of the state variables. These are obtained by averaging the Gibbs draws of the states. These means are smoothed (posterior) probabilities of the states. A mean state close to 1 corresponds to a high probability to be in the second regime. Figures 8 and 9 display the paths of these means. Both figures show, in conjunction with the sample path of the data (in Figure 5), that high probabilities are associated with high volatility periods (observations 1 to 500 and some peaks later, which are less strong for the ARCH than the GARCH model). From the posterior means of the MS-GARCH model, we can also deduce that the unconditional probabilities of the regimes are, respectively, 0.60 (= (1 − η11 )/(2 − η11 − η22 )) for the first one and 0.40 for the second one. For the MS-ARCH model, the steady-state probability of regime 1 is 0.30 only. These proportions correspond intuitively to the information provided by the plots of the mean states. Finally, we provide some diagnostics about the estimated models. To compute standardized residuals, we attribute an observation to regime 2 if the mean of the corresponding state variable is larger than 0.5 and to regime 1 otherwise, then we standardize the observation by subtracting its mean and dividing by its conditional standard deviation computed for the attributed state C The Author(s). Journal compilation C Royal Economic Society 2010.
236
L. Bauwens, A. Preminger and J. V. K. Rombouts
150
75
η11
100
50
50
25 0.980
15
0.985
0.990
0.995
1.000
0.96 60
ω1
10
40
5
20 0.30
0.35
0.40
0.45
0.50
0.55
0.97
0.98
0.99
1.00
α1
0.00 10
ω2
2
η22
0.01
0.02
0.03
0.04
0.05
0.06
0.07
α2
5
1 1.25
1.50
1.75
2.00
2.25
2.50
2.75
μ1
20
0.00 7.5
0.05
0.10
0.15
0.20
0.25
μ2
5.0
10
2.5 0.000
0.025
0.050
0.075
0.100
0.125
−0.15
−0.10
−0.05
0.00
0.05
Figure 6. Posterior densities for the MS-ARCH model (S&P500 daily returns).
using the posterior means. The Ljung–Box Q(p)-statistics of the residuals of the MS-ARCH model are Q(10) = 8.19, Q(20) = 14.8 for the residuals and Q(10) = 83.5, Q(20) = 110 for the squared residuals. For the MS-GARCH model, the corresponding statistics are Q(10) = 10.1, Q(20) = 16.4 for the residuals and Q(10) = 16.5, Q(20) = 22.1 for the squared residuals. Even if there is no formal statistical theory that allows us to compare these values to quantiles of chi-square distributions, the informal evidence from these results is that there is clearly some misspecification of the conditional variance dynamics in the MS-ARCH model, and not in the MS-GARCH one. The conditional mean of both models seems correctly specified.
5. CONCLUSION We establish some theoretical properties of a Markov-switching GARCH model with constant transition probabilities. We provide simple sufficient conditions for the ergodic stationarity of the process and the existence of its moments. Since ML estimation is not feasible due to path dependence, we develop a reliable Bayesian estimation algorithm for this model, and we apply it to a sample of S&P500 daily returns. Based on residual diagnostics and the Bayesian information C The Author(s). Journal compilation C Royal Economic Society 2010.
237
Markov switching GARCH η11
40
η22
75 50
20
25 0.95
0.96
0.97
0.98
0.99
1.00
ω1
15
0.96 20
10
0.97
0.98
0.99
1.00
0.15
0.20
ω2
10
5 0.20 30
0.25
0.30
0.35
0.40
0.45
0.50
α2
0.00
0.10
β2
15
20
0.05
10
10
5
0.000
0.025
0.050
0.075
0.100
μ1
20
0.75 15
0.80
0.85
−0.10
−0.05
0.90
0.95
μ2
10
10
5 0.000
0.025
0.050
0.075
0.100
0.125
−0.15
0.00
0.05
Figure 7. Posterior densities for the MS-GARCH model (S&P500 daily returns).
criterion we find that the MS-GARCH model with two regimes fits the data much better than the corresponding ARCH model. Further research could be oriented in several directions. A first direction could be to refine the specification by using existing extensions of the simple GARCH(1,1) model, and allowing an ARMA structure for the conditional mean. A second direction of research is to specify the transition probabilities as a function of past information as in Gray (1996). All these extensions would render the algorithm more CPU-time consuming (due to the additional parameters) but would not complicate it fundamentally. Establishing the geometric ergodicity and existence of moments of such more richly specified processes would require us to extend and adapt the proofs presented in the current paper. A second topic would be to compare our MS-GARCH model to other models like Gray (1996), the MS and finite mixture GARCH models of Haas et al. (2004) and rank their performance in terms of in-sample fit and out-of-sample forecasting of volatility. Finally, further research could be focused on estimating the model with other data series, more regimes and in comparisons with other GARCH models, in a similar way as done by Marcucci (2005). An open issue of particular relevance would be to find a way to compute the marginal likelihood of the MS-GARCH model. C The Author(s). Journal compilation C Royal Economic Society 2010.
238
L. Bauwens, A. Preminger and J. V. K. Rombouts
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1100
1200
1300
1400
1500
Figure 8. Mean states MS-ARCH.
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
100
200
300
400
500
600
700
800
900
1000
Figure 9. Mean states MS-GARCH.
ACKNOWLEDGMENTS We thank the referees and the editor for their insightful comments on an earlier version. Bauwens’s work was supported by the contract ‘Projet d’Actions de Recherche Concert´ees’ 07/12-002 of the ‘Communaut´e franc¸aise de Belgique’, granted by the ‘Acad´emie universitaire Louvain’. Luc Bauwens is also a member of ECORE, an association between ECARES and CORE. Arie Preminger acknowledges the financial support provided through the European Community’s Human Potential Programme under contract HPRN-CT-2002-00232, Microstructure of Financial Markets in Europe, and the Ernst Foundation. This paper presents research results of the Belgian Program on Interuniversity Poles of Attraction initiated by the Belgian State, Prime Minister’s Office, Science Policy Programming. The scientific responsibility is assumed by the authors. C The Author(s). Journal compilation C Royal Economic Society 2010.
Markov switching GARCH
239
REFERENCES Abramson, A. and I. Cohen (2007). On the stationarity of Markov-switching GARCH processes. Econometric Theory 23, 485–500. Bauwens, L., A. Preminger and J. Rombouts (2006). Regime switching GARCH models. CORE Discussion Paper 2006/11, Universit´e catholique de Louvain, Louvain La Neuve. Bauwens, L. and J. Rombouts (2004). Econometrics. In J. Gentle, W. Hardle and Y. Mori (Eds.), Handbook of Computational Statistics: Concepts and Methods, 951–79. Heidelberg: Springer. Bollen, N., S. Gray and R. Whaley (2000). Regime-switching in foreign exchange rates: evidence from currency option prices. Journal of Econometrics 94, 239–76. Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics 31, 307–27. Bollerslev, T., R. F. Engle and D. Nelson (1994). ARCH models. In R. F. Engle and D. McFadden (Eds.), Handbook of Econometrics, Volume 4, 2959–3038. Amsterdam: North Holland. Cai, J. (1994). Markov model of unconditional variance in ARCH. Journal of Business and Economics Statistics 12, 309–16. Chan, K. (1993). A review of some limit theorems of Markov chains and their applications. In H. Tong (Ed.), Dimension Estimation and Models. River Edge, NJ: World Scientific Publishing. Chib, S. (1996). Calculating posterior distributions and modal estimates in Markov mixture models. Journal of Econometrics 75, 79–97. Das, D. and B. H. Yoo (2004). A Bayesian MCMC algorithm for Markov switching GARCH models. Working paper, City University of New York and Rutgers University. Davidson, J. (1994). Stochastic Limit Theory. New York: Oxford University Press. Diebold, F. (1986). Comment on modelling the persistence of conditional variances. Econometric Reviews 5, 51–56. Diebold, F. and A. Inoue (2001). Long memory and regime switching. Journal of Econometrics 105, 131–59. Douc, R., E. Moulines and T. Ryden (2004). Asymptotic properties of the maximum likelihood estimator in autoregressive models with Markov regime. Annals of Statistics 32, 2254–304. Doukhan, P. (1994). Mixing: Properties and Examples. New York: Springer. Dueker, M. (1997). Markov switching in GARCH processes in mean reverting stock market volatility. Journal of Business and Economics Statistics 15, 26–34. Francq, C., M. Roussignol and J.-M. Zakoian (2001). Conditional heteroskedasticity driven by hidden Markov chains. Journal of Time Series Analysis 22, 197–220. Francq, C. and J.-M. Zakoian (2002). Comments on the paper by Minxian Yang: ‘Some properties of vector autoregressive processes with Markov-switching coefficients’. Econometric Theory 18, 815–18. Francq, C. and J.-M. Zakoian (2005). The l 2 -structures of standard and switching-regime GARCH models. Stochastic Processes and their Applications 115, 1557–82. Gelfand, A. and A. Smith (1990). Sampling based approaches to calculating marginal densities. Journal of the American Statistical Association 85, 398–409. Gerlach, R. and F. Tuyl (2006). MCMC methods for comparing stochastic volatility and GARCH models. International Journal of Forecasting 22, 91–107. Giraitis, L., R. Leipus and D. Surgailis (2007). Recent advances in ARCH modelling. In G. Teyssi`ere and A. Kirman (Eds.), Long Memory in Economics, 3–38. Heidelberg: Springer. Gray, S. (1996). Modeling the conditional distribution of interest rates as a regime-switching process. Journal of Financial Economics 42, 27–62.
C The Author(s). Journal compilation C Royal Economic Society 2010.
240
L. Bauwens, A. Preminger and J. V. K. Rombouts
Haas, M., S. Mittnik and M. Paolella (2004). A new approach to Markov-switching GARCH models. Journal of Financial Econometrics 2, 493–530. Hamilton, J. and R. Susmel (1994). Autoregressive conditional heteroskedasticity and changes in regime. Journal of Econometrics 64, 307–33. Hamilton, J., D. Waggoner and T. Zha (2007). Normalization in econometrics. Econometric Reviews 26, 221–52. Henneke, J. S., S. T. Rachev and F. J. Fabozzi (2006). MCMC based estimation of Markov switching ARMA–GARCH models. Working paper, University of Karlsruhe. Hilebrand, E. (2005). Neglecting parameter changes in GARCH models. Journal of Econometrics 129, 121–38. Jacquier, E., N. Polson and P. Rossi (1994). Bayesian analysis of stochastic volatility models (with discussion). Journal of Business and Economic Statistics 12, 371–417. Kass, R. and A. Raftery (1995). Bayes factors. Journal of the American Statistical Association 90, 773–95. Kaufman, S. and S. Fr¨uhwirth-Schnatter (2002). Bayesian analysis of switching ARCH models. Journal of Time Series Analysis 23, 425–58. Kaufman, S. and M. Scheicher (2006). A switching ARCH model for the German DAX index. Studies in Nonlinear Dynamics and Econometrics 10, No. 4, Article 3. Klaassen, F. (2002). Improving GARCH volatility forecasts with regime-switching GARCH. Empirical Economics 27, 363–94. Kramer, W. (2008). Long memory with Markov-switching GARCH. Economics Letters 99, 390–92. Lamoureux, C. and W. Lastrapes (1990). Persistence in variance, structural change, and the GARCH model. Journal of Business and Economic Statistics 8, 225–34. Lutkepohl, H. (1996). Handbook of Matrices. New York: John Wiley. Maheu, J. and Z. He (2009). Real time detection of structural breaks in GARCH models. Working paper, Department of Economics, University of Toronto. Marcucci, J. (2005). Forecasting stock market volatility with regime-switching GARCH models. Studies in Nonlinear Dynamics and Econometrics 9, 1–53. Meitz, S. and P. Saikkonen (2008). Stability of nonlinear AR-GARCH models. Journal of Time Series Analysis 29, 453–75. Meyn, S. and R. Tweedie (1993). Markov Chains and Stochastic Stability. London: Springer. Mikosch, T. and C. Starica (2004). Nonstationarities in financial time series, the long-range dependence, and the iGARCH effects. Review of Economics and Statistics 86, 378–90. Moeanaddin, R. and H. Tong (1990). Numerical evaluations of distribution of non-linear autoregression. Journal of Time Series Analysis 11, 33–48. Nelson, D. B. (1990). Stationarity and persistence in the GARCH(1,1) model. Econometric Theory 6, 318–34. Robert, C. and G. Casella (2004). Monte Carlo Statistical Methods. New York: Springer. Schwert, G. (1989). Why does stock market volatility change over time? Journal of Finance 44, 1115–53. Stachurski, J. and V. Martin (2008). Computing the distributions of economic models via simulations. Econometrica 76, 443–50. Tanner, M. and W. Wong (1987). The calculation of the posterior distributions by data augmentation. Journal of the American Statistical Association 82, 528–40. Tjostheim, D. (1990). Non-linear time series and Markov chains. Advances in Applied Probability 22, 587–611. Tsay, R. (2005). Analysis of Financial Time Series. New York: John Wiley. Yang, M. (2000). Some properties of vector autoregressive processes with Markov-switching coefficients. Econometric Theory 16, 23–43. C The Author(s). Journal compilation C Royal Economic Society 2010.
Markov switching GARCH
241
Yao, J. (2001). On square-integrability of an AR process with Markov switching. Statistics and Probability Letters 52, 265–70. Yao, J. and J.-G. Attali (2000). On stability of nonlinear ar process with Markov switching. Advances in Applied Probability 32, 394–407. Zhang, M. Y., J. Russell and R. Tsay (2001). A nonlinear autoregressive conditional duration model with applications to financial transaction data. Journal of Econometrics 104, 179–207.
APPENDIX A: PROOFS OF RESULTS To prove Theorems 2.1 and 2.2, we write the model in its Markovian state space representation. We use the notation σt2 = ht−1 to make it clear that σt is a function of the information dated at time t − 1 or earlier, not information dated at t. Let λ and v denote the Lebesgue and the counting measures, respectively. Proof of Theorem 2.1: There exists a measurable function g : S × → S such that st = g(st−1 , ξt ), where the error term ξt is i.i.d. independent of ut and h0 . Let s¯t = st+1 , ηt = (ut , ξt ) , and Zt be defined on D ⊂ × + × S where + = (0, +∞). From (2.3) and (2.4), we have ⎛ ⎞ ⎛ ⎞ √ μs¯t−1 + ht−1 ut yt ⎜ ⎟ ⎜ ⎟ 2 ⎟ ⎜ ⎟ Zt = ⎜ ⎝ ht ⎠ = ⎝ ωg(¯st−1 ,ξt+1 ) + αg(¯st−1 ,ξt+1 ) εt + βg(¯st−1 ,ξt+1 ) ht−1 ⎠ s¯t g(¯st−1 , ξt+1 ) ⎞ ⎛ √ (A.1) μs¯t−1 + ht−1 ut ⎟ ⎜ 2 ⎟ ⎜ = ⎜ ωg(¯st−1 ,ξt+1 ) + αg(¯st−1 ,ξt+1 ) ut + βg(¯st−1 ,ξt+1 ) ht−1 ⎟ = F (Zt−1 , ηt ), ⎠ ⎝ g(¯st−1 , ξt+1 ) where F : D × 2 → D. Since ηt is independent of Zt−1 it follows from (A.1) that (yt , ht , s¯t ) forms a homogeneous Markov chain. The process is defined n on (D, , ϕ). The state space of the process is given by D = {(y, h, s) ∈ S : (y, h) ∈
× + × i=1 Di , s ∈ S}, where Di is the domain of the chain in each regime and is given ¯ and h¯ = minβi <1 {ωi /(1 − βi )}. The strict by Di = nj=1 {(y, h) ∈ × + : h ≥ ωi + αi (y − μj )2 + βi h} 2 stationarity of the first regime (E log(α1 ut + β1 ) < 0), implies that β1 < 1 (see Nelson, 1990), hence h¯ is well defined. Note that h¯ is a lower bound for the volatility, which implies that ht ≥ ωi + αi (yt − μj )2 + βi h¯ a.s. The subsets in × + × S which are not contained in D are transient sets and are null with respect to π , the stationary measure of the process, if it exists (see Chan, 1993). Therefore, without loss of generality our state space excludes such π -null sets (see Zhang et al., 2001). The state space is equipped with , the Borel σ -algebra on × + × S restricted to D.2 The measure ϕ is the product measure λ2 ⊗ v on (D, ). We use Pm (z0 , A) = P(Zt ∈ A|Zt−m = z0 ) to signify the probability that (yt , ht , s¯t ) moves from (y0 , h0 , s¯0 ) to the set A ∈ in m steps. In order to establish geometric ergodicity of the Markov chain, we first show that the process is ϕ-irreducible. For irreducibility, it is sufficient to show that Pk (z0 , A) > 0 for some k ≥ 1, all z0 ∈ D and any Borel measurable sets A ∈ with positive ϕ measure (see Chan, 1993). In this case, ϕ is called an irreducibility measure. Now, since the ϕ measure of the set of all boundary points is zero, for any non-null set A we can find a close non-null subset A which is restricted to be interior to the state space. Next, we can show that from any (y0 , h0 , s0 ) ∈ D, all (y, h, s) ∈ A can be reached in a finite number of steps. We assume that s¯0 = i, s = and h¯ is achieved in regime q; that is, h¯ = ωq /(1 − βq ).
2 The topology over D is based on the product topology, where we use the discrete topology over S (for which every subset is open) and the usual topology on 2 . C The Author(s). Journal compilation C Royal Economic Society 2010.
242
L. Bauwens, A. Preminger and J. V. K. Rombouts
¯ since A is interior set, ς > 0. Thus, there exists a Let h˜ = [h − ω − α (y − μq )2 ]/β and ς = h˜ − h, −t ¯ positive integer m = min(t ≥ 1 : h + 0.5ς + 0.5βq ς > ωq + βq h0 }, such that the point (y, h, s) can be ¯ reached through the following m + 1 intermediate steps3 w = {(yt , ht , s¯t )}m+1 t=1 where s¯t = q, ht = h + 0.5ς + 0.5βqt−(m+1) ς, y1 = μi + [(h1 − ωq − βq h0 )/αq ]0.5 and yt = μq + [0.5ς (1 − βq )/αq ]0.5 for t > 1. Note that, given (¯st−1 , s¯t , yt , ht−1 ) from (14) we have that ht = ωs¯t + αs¯t (yt − μs¯t−1 )2 + βs¯t ht−1 which implies by substitution the value of ht at each step. In the m + 1th step, we get hm+1 = h˜ and by setting sm+2 = and ym+2 = y, we get hm+2 = h. For any A in the interior of D there exists a small open ball A around z (A ⊂ A ), such that all the points in it can be reached from z0 in m + 2 steps as in our construction above. This result follows from the definition of m and the continuity of the steps (yt , ht ) with respect to (y, h). The m + 2th step transition probability is absolutely continuous with respect to the ϕ measure. By Assumptions 2.1 and 2.2, p m+2 (z0 , z) ≥
m+1
P(¯st+1 |¯st ) > 0, f (yt+1 − μs¯t )/h0.5 t
t=0
which implies that P m+2 (z0 , A) ≥ P m+2 (z0 , A ) > 0 and hence the chain is ϕ-irreducible. If z0 ∈ C, a compact set, inf(z0 ,z)∈C×C p m+2 (z0 , z) > δ > 0 and for any A ∈ and z0 ∈ C, Pm+2 (z0 , A) ≥ Pm+2 (z0 , A ∩ C) ≥ p m+2 (z0 , z)dϕ(z) ≥ δϕ(A ∩ C). A∩C
Therefore, P(z0 , A) is minorised by ϕ(· ∩ C) which implies that all non-null, compact sets in D are small by definition, see Meyn and Tweedie (1993, p. 111), and can serve as test sets. Using the same arguments as above, we can show that any small set can be reached in m + 3 steps, therefore the chain is aperiodic; see Chan (1993). From (2.4) and the cr inequality,4 we get 1/t 1/t ht ≤ ωs¯t + αs¯t u2t + βs¯t ht−1 1/t 1/t 1/t ≤ (ωs¯t )1/t + αs¯t u2t + βs¯t (ωs¯t−1 )1/t + αs¯t u2t + βs¯t αs¯t−1 u2t−1 + βs¯t−1 ht−2 .. . ≤
(A.2) j t t−1 1/t 1/t 1/t αs¯j u2j + βs¯j αs¯t−i +1 u2t−i+1 + βs¯t−i+1 h + (ωs¯t−j )1/t + (ωs¯t )1/t . j =1
j =1
i=1
Since {¯st } is an ergodic Markov chain, for any initial state, we have that 1 log αs¯j u2j + βs¯j → πi E log αi u2t + βi < 0 a.s.; t j =1 i=1 t
n
(A.3)
see Chan (1993). This result, Assumption 2.1 and the dominated convergence theorem imply that there exists t¯ sufficiently large such that δ ≥ 1/t¯ = p, and for all ∈ S and t¯ ≤ t, ⎞ ⎛ t p αs¯j u2j + βs¯j s¯0 = ⎠ ≤ γ < 1. (A.4) E⎝ j =1
3 The integer ‘m’ ensures that in the first step the term (h − ω − β h ) is positive, so y is well defined in our state 1 q q 0 1 space. 4 The c inequality: For r > 0, E|X + Y |r ≤ c (E|X|r + E|Y |r ), where c = 1 if r ≤ 1 and c = 2r−1 if r > 1. See r r r r Davidson (1994, p. 140).
C The Author(s). Journal compilation C Royal Economic Society 2010.
Markov switching GARCH
243
2p 2p As a drift function we use V (z) = 1 + (η/)y ¯ + hp , where η¯ = η − γ , = E|ut |, p = 1/t¯, η is some positive number which satisfies γ < η < 1 and the test set is given by C = {(y, h, s) ∈ D : h + y 2 ≤ c, s ∈ S}, where c > 0 is to be determined below. From (A.2)–(A.4), we find for some t¯ < t that
E
p ht
⎛ ⎞ t p z0 = z ≤ E ⎝ αs¯j u2j + βs¯j s¯0 = ⎠ hp + M ≤ M + γ hp , j =1
where M = E( Therefore,
t−1 j =1
p
ωs¯t−j
j i=1
p
(αs¯t−i+1 u2t−i+1 + βs¯t−i+1 )p + ωs¯t |¯s0 = ) and M < ∞ by Assumption 2.1.
p 2p ¯ ht−1 z0 = z Eut E(V (zt )|z0 = z) ≤ 1 + M + γ hp + ηE 2M + 1 ≤ 2M + 1 + ηhp = hp + η . hp ¯ 1/p ) and η = [(2M + 1)/hp ] + η. For h + y 2 > c, if Let M1 = (2M + 1)/(1 − η), c = M1 (1 + (/(ηη)) M1 > hp , we get that η ∈ (η, 1), hence E(V (zt )|z0 = z) ≤ η hp < η V (z) where the last inequality follows 1/p by the definition of the drift function. Otherwise, if M1 ≤ hp , we note that y 2 > c − h ≥ c − M1 = 1/p 2p ¯ , hence M1 < ((ηη)/)y ¯ and for some η ∈ (η, 1) ((M1 )/(ηη)) 1/p
E(V (zt )|z0 = z) ≤ 2M + 1 + ηhp ≤ M1 (η + 1 − η) = M1 2p ≤ η(η/)y ¯ < η V (z).
Since the Lyapounov function above is bounded on compact sets we have that E(V (zt )|z0 = z) ≤ η V (z) + a · 1C (z) for some a < ∞ and for all z0 ∈ D, hence the drift criterion is satisfied. We can then combine Meyn and Tweedie (1993, Theorem 15.0.1) and Tjostheim (1990) to obtain that {Zt } is geometric ergodic. The finiteness of E(|yt |2p ) with respect to the stationary measure follows from Meitz and Saikkonen (2008, Lemma 6). If the process is initiated from its stationary distribution, it further follows that the process is α-mixing with geometrically decaying mixing numbers; see Doukhan (1994, p. 89). Proof of Theorem 2.2: Let I be a 1 × n vector that contains 1 on the th position and zeros elsewhere and l = (1, . . . , 1) is an n × 1 vector. The matrix is positive definite, hence the spectral radius is real and positive and Assumption 2.5 implies that there exists a positive integer t¯ such that for all t¯ ≤ t each element of t is smaller than 1/n; that is, ( t )ij < 1/n (see Lutkepohl, 1996, p. 76, 3(a)). Hence ⎞ ⎛ t k αs¯j u2j + βs¯j s¯0 = ⎠ = (I ) t−1 ≤ γ < 1. E⎝
(A.5)
j =1
By solving (2.4) recursively and setting h0 = h, we get ht = ωs¯t + αs¯t u2t + βs¯t ht−1 =
t j =1
j t−1 αs¯j u2j + βs¯j h + αs¯t−i+1 u2t−i+1 + βs¯t−i+1 + ωs¯t . (ωs¯t−j ) j =1
(A.6)
i=1
¯ = η − γ where η is some positive number which satisfies γ < η < 1. We select Now, let = E|u2k t | and η 2k + hk and a test set C = {(y, h, s) ∈ D : h + y 2 ≤ a drift function of the form V (y, h, s) = 1 + (η/)y ¯ C The Author(s). Journal compilation C Royal Economic Society 2010.
244
L. Bauwens, A. Preminger and J. V. K. Rombouts
c, s ∈ S}. By the binomial theorem, (A.6), Assumption 2.4 and some tedious calculations, we can find for some t¯ < t, finite positive constants {am }, m = 1, . . . , k, independent of h, such that E(V (yt , ht )|z0 = z) = 1 + E hkt z0 = z + ηE ¯ yt2k z0 = z / ⎡ ⎛ ⎞ ⎤ t k k k 2 k k−m k −m ⎣ ⎝ ⎠ ⎦ αsj uj + βsj am h ≤h η+ am h ≤1+ E + η¯ h + . j =1
m=1
Next, by applying the same arguments as in Theorem 2.1, the desired result follows.
m=1
C The Author(s). Journal compilation C Royal Economic Society 2010.
The
Econometrics Journal Econometrics Journal (2010), volume 13, pp. 245–270. doi: 10.1111/j.1368-423X.2010.00316.x
ECF estimation of Markov models where the transition density is unknown G EORGE J. J IANG †,‡ AND J OHN L. K NIGHT § †
Department of Finance, Eller College of Management, University of Arizona, 1130 E. Helen Street, Tucson, AZ 85721-0108, USA. E-mail:
[email protected] ‡
Yunnan University of Finance and Economics, 273 Longquan Road, Kunming, China, 650221.
§ Department
of Economics, University of Western Ontario, 1151 Richmond St., London, ON N6A 5C2, Canada. E-mail:
[email protected]
First version received: September 2006; final version accepted: January 2010
Summary In this paper, we consider the estimation of Markov models where the transition density is unknown. The approach we propose is based on the empirical characteristic function estimation procedure with an approximate optimal weight function. The approximate optimal weight function is obtained through an Edgeworth/Gram–Charlier expansion of the logarithmic transition density of the Markov process. We derive the estimating equations and demonstrate that they are similar to the approximate maximum likelihood estimation (AMLE). However, in contrast to the conventional AMLE our approach ensures the consistency of the estimator even with the approximate likelihood function. We illustrate our approach with examples of various Markov processes. Monte Carlo simulations are performed to investigate the finite sample properties of the proposed estimator in comparison with other methods. Keywords: Approximate MLE, Edgeworth/Gram–Charlier expansion, Empirical characteristic function (ECF), Markov process.
1. INTRODUCTION The estimation of Markov models can proceed straightforwardly via the maximum likelihood estimation (MLE) method when the transition density is known in closed form. However, when the transition density is unknown alternative estimators to MLE need to be considered. Various estimation methods have been proposed and applied in the literature, e.g. the quasi-maximum likelihood (QML) method by Bollerslev and Wooldridge (1992), Fisher and Gilles (1996) among others, the maximum likelihood method based on closed-form Hermite expansion of the likelihood function by A¨ıt-Sahalia (2002, 2003), the generalized method of moments (GMM) by Hansen and Scheinkman (1996) and Liu (1997), as well as various simulation-based methods such as the simulated moments estimator (SME) by Duffie and Singleton (1993), the indirect inference approach by Gouri´eroux et al. (1993), the efficient method of moments (EMM) by C The Author(s). Journal compilation C Royal Economic Society 2010. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
246
G. J. Jiang and J. L. Knight
Gallant and Tauchen (1996) and Gallant and Long (1997) with applications by Chernov and Ghysels (2000) and Andersen et al. (2002) to continuous-time asset return models, and the Bayesian MCMC method by Jones (1998), etc. In recent literature, a number of estimation methods have also been developed in the frequency domain. These methods exploit the analytical conditional characteristic function (CCF) of the state variables. It is noted that, for many Markov processes, while a closed form of the transition density is unavailable, the associated CCF of the state variables can often be derived analytically. For instance, in the continuous time finance literature, models specified within the affine framework often have closed-form CCF. This observation opens the door to alternative estimation methods using the CCF. In this regard we have the estimators developed in Singleton (2001), Jiang and Knight (2002) and Chacko and Viceira (2003) for continuous time diffusion and jump-diffusion models. The basic idea is to minimize the integrated distance between the empirical characteristic function (ECF) or joint ECF and their theoretical counterparts. Singleton (2001) also proposes to use the likelihood function obtained by Fourier inversion of the CCF. In a recent paper, Carrasco et al. (2007) extend the ECF approach by using a continuum of moment conditions for the estimation of general dynamic models. The ECF estimation method is asymptotically efficient as there is a one-to-one relationship between the CCF and the transition density function. However, the practical implementation of the method involves two issues: namely the choice of the discrete grids at which the ECF and CCF are matched, and the weight function used in the estimation procedure. While in the univariate case the choice of discrete grids has received some attention, see Feuerverger and Mureika (1977), Schmidt (1982), Knight and Satchell (1997), Yu (1998) and Singleton (2001), in the multivariate case the choice is indeed an open question. With given discrete grids, Singleton (2001) proposes using a numerical procedure in the GMM framework to find the optimal weight function, and shows that the optimal weight function is simply the covariance of the ECF evaluated at the discrete grid points. In this paper, we propose an alternative approach to address the issues involved in the ECF estimation procedure. Instead of choosing the discrete grid points first and then finding the optimal weight function following a numerical procedure, we propose a closed-form or analytical approximation of the optimal weight function in the ECF estimation procedure. As a result, the choice of the discrete grid can be avoided and the estimation can proceed based on the analytical conditional cumulants. The basic idea stems from the fact that, due to the one-toone correspondence between the CCF and the transition density, the first-order conditions for ML estimation can indeed be written as the sum of a weighted integral of the difference between ECF and CCF. The optimal weight in this set-up is the inverse Fourier transform of the tth score. Thus by approximating the logarithmic transition density of the Markov process we can approximate the optimal weight function and hence solve the integral for the appropriate estimating equations. We demonstrate that the estimating equations are similar to the approximate maximum likelihood estimation (AMLE). However, in contrast to the conventional AMLE our approach ensures the consistency of the estimator even with the approximate optimal weight function. The method applies to processes with closed-form CCF, e.g. the common continuous-time affine diffusion and jump-diffusion processes with known analytical CCF, the discrete time compound autoregressive (CAR) processes introduced by Darolles et al. (2006). The CAR processes are a class of non-linear dynamic processes characterized by the conditional logLaplace transforms which are affine functions of lagged values of the process. Furthermore, since only the conditional cumulants are required, the method also applies to processes where the conditional cumulants to certain orders can be derived. C The Author(s). Journal compilation C Royal Economic Society 2010.
Estimation of Markov models
247
The paper is organized as follows. Section 2 proposes the ECF estimation approach with approximate optimal weight function and derives the appropriate estimating equations. In particular, we relate our approach to various existing estimation methods, especially the QMLE and AMLE. In Section 3, we illustrate the application of our approach with examples of various Markov processes in both discrete time and continuous time and for both univariate and multivariate cases. In Section 4, we perform Monte Carlo simulations to investigate the finite sample properties of the proposed estimation procedure. A brief conclusion is contained in Section 5. Proofs are given in Appendix A.
2. ECF METHOD AND CONSISTENT AMLE (C-AMLE) Let xt ∈ RN , t > 0, be an N-dimensional stationary Markov process defined in either a discrete time set or a continuous time set with a complete probability space (, F, P ). Suppose that {xt }Tt=1 represents an observed sample over a discrete set of time from the Markov process. Let f (xt+1 | xt ; θ ) : RN × RN × → R denote the measurable transition density for the Markov process and θ ∈ ⊂ RQ denote the parameter vector of the data-generating process for xt . Following Singleton (2001), a consistent estimator of the parameters based on the empirical characteristic function (ECF) can be derived from the following equation: T 1 .. w(r, t | xt ; θ )(eir xt+1 − φ(r, xt+1 | xt ; θ )) dr = 0, T t=1
(2.1)
where φ(r, xt+1 | xt ; θ ) = E[exp(ir xt+1 ) | xt ] is the CCF and w(r, t | xt ; θ ) ∈ W (r, t | xt ; θ ) with w(r, t | xt ; θ ) being the set of ‘instrument’ or ‘weight’ functions as defined in Singleton (2001). Namely, for each ‘instrument’ or ‘weight’ function w(r, t | xt ; θ ) : RN × R+ × RN × → CQ where CQ denotes the complex numbers, we have w(r, t | xt ; θ ) ∈ It and w(r, t | xt ; θ ) = w(−r, ¯ t | xt ; θ ), t = 1, 2, . . . , T + 1, where It is the σ -algebra generated by xt . The ECF estimation procedure has been proposed by Feuerverger and Mureika (1977), Schmidt (1982) and Feuerverger and McDunnough (1981) for i.i.d. cases, and Feuerverger (1990) for generic stationary Markov processes using the joint ECF of state variables. As shown in Feuerverger (1990), Singleton (2001), Jiang and Knight (2002) and Carrasco et al. (2007), there exists an optimal ‘instrument’ or ‘weight’ function in the sense that the estimator defined in (2.1) is also an efficient estimator with the same asymptotic properties as ML estimator. We summarize these results in the following lemma. Following Singleton (2001), we impose the same regularity conditions on the ECF estimation procedure as those imposed by Hansen (1982) for GMM. L EMMA 2.1. Under standard regularity conditions where w(r, t | xt ; θ ) ∈ W (r, t | xt ; θ ) is a well-defined ‘instrument’ or ‘weight’ function, equation (2.1) leads to consistent parameter estimators. Furthermore, let f (xt+1 | xt ; θ ) be the transition density of the Markov process, the following weight function is optimal: 1 ∂ ln f (xt+1 | xt ; θ ) −ir xt+1 e dxt+1 (2.2) .. w(r, t | xt ; θ ) = N (2π ) ∂θ in the sense that the estimators based on equation (2.1) are equivalent to MLE. It is noted that the optimal weight function is determined by the logarithm of the transition density or likelihood function of the Markov process. When f (xt+1 | xt ; θ ) is explicitly known, C The Author(s). Journal compilation C Royal Economic Society 2010.
248
G. J. Jiang and J. L. Knight
the Markov model can be estimated straightforwardly via the ML method and it can be shown that the estimation equation (2.1) will result exactly in conditional ML estimator. However, if f (xt+1 | xt ; θ ) is not known explicitly but φ(r, xt+1 | xt ; θ ) is, then (2.1) must be implemented with other than the optimal weight function w(r, t | xt ; θ ). Thus, the implementation of the estimation procedure in (2.1) involves two issues, namely the choice of the discrete grids at which the ECF and CCF are matched, and the weight function. Since with a discrete grid the estimation problem in (2.1) is essentially the GMM in the frequency domain, Singleton (2001) proposes a GMM approach to deal with the issues involved in the implementation of (2.1). Namely, the integral in (2.1) is first approximated with a sum over a discrete set of r and then the optimal weight function can be obtained by appealing to the GMM framework following a numerical procedure. As shown in Singleton (2001), under certain regularity conditions as the grid of r’s becomes increasingly fine in RN , the asymptotic efficiency of the estimator approaches that of MLE. The appeal to GMM to find the optimal weight matrix in essence is a GLS solution but its major drawback is the necessity to choose the vectors r over which the sum and hence the integral is approximated. In the scalar Markov case the choice of the discrete grids has been considered in the literature, at least in the i.i.d. case; see Feuerverger and Mureika (1977), Schmidt (1982), Knight and Satchell (1997), Yu (1998) and Singleton (2001). In the N-dimensional case (N ≥ 2) the problem is much more complicated and there is virtually no guidance given in the literature. In general, there is an obvious trade-off between finer grids and coarser grids. With too coarse a grid, the GMM estimation based on selected set of moments is easier to implement but would achieve lower asymptotic efficiency. While with too fine a grid, the implementation of the estimation procedure can become infeasible due to the singularity of the covariance matrix. Thus, in practice there is a limitation to the actual implementation of a very fine grid. In the context of the estimation of mixture distributions, Carrasco and Florens (2000) provide Monte Carlo evidence of efficiency loss relative to MLE due to the use of a discrete grid. In a recent paper, Carrasco et al. (2007) propose a GMM approach with a continuum of moment conditions and address the singularity problem via a penalization term. In this paper, we propose an alternative approach than the one in Singleton (2001) to deal with the issues involved in the implementation of (2.1). Instead of choosing the vectors r first and then finding the optimal weight function following a numerical procedure, we propose an approximation to the optimal weight function in (2.2) and derive the estimating equations accordingly. As a result, the choice of the discrete grid is avoided and the estimation is relatively straightforward to implement. We show that our approach is similar to the approximate MLE as both are based on the approximate likelihood function or the approximate score. As in the case of the conventional approximate MLE, approximating the optimal weight function does result in an efficiency loss relative to MLE. However, in contrast to the conventional approximate MLE, our approach ensures the consistency of parameter estimators even with the approximated likelihood function. 2.1. Approximate optimal weight function and consistent AMLE (C-AMLE) In this paper, we propose a closed-form or analytical approximation to the logarithmic transition density of the Markov process, i.e. ln f (xt+1 | xt ; θ ), which will then, via (2.2), give us an approximate weight function in (2.1) and hence result in a consistent estimator. We consider series expansions for the log transition density rather than for the density itself. In addition to the fact that the log transition density appears explicitly in the optimal weight function and thus is the function we aim to approximate, for a number of other reasons better approximations are C The Author(s). Journal compilation C Royal Economic Society 2010.
Estimation of Markov models
249
often obtained by approximating the log transition density and then exponentiating. Since the solution of (2.1) requires the knowledge of φ(r, xt+1 | xt ; θ ) or ln φ(r, xt+1 | xt ; θ ), we can use this function to develop approximations to ln f (xt+1 | xt ; θ ). The approximation we propose is the multivariate Edgeworth/Gram–Charlier expansion. Following McCullagh (1987), using the tensor notation, we have the general Gram– Charlier/Edgeworth expansion for the log multivariate density ln f (xt+1 | xt ; θ ) given by ln f (xt+1 | xt ; θ ) = ln f0 (xt+1 | xt ; θ ) 1 + K i,j ,k hij k (xt+1 | xt ) 3! 1 + K i,j ,k,l hij kl (xt+1 | xt ) 4! 1 + K i,j ,k,l,m hij klm (xt+1 | xt ) 5! 1 + [K i,j ,k,l,m,n hij klmn (xt+1 | xt ) + K i,j ,k K l,m,n hij k,lmn (xt+1 | xt )[10]] 6! (2.3) + . . . ., where f0 (xt+1 | xt ; θ ) is chosen such that its first-order and second-order moments agree with those of xt+1 conditional on xt . Upon letting ψ(r, xt+1 | xt ; θ ) = ln φ(−ir, xt+1 | xt ; θ ) (the cumulant generating function), we have the conditional cumulants of various orders: ∂ ψ(r, xt+1 | xt ; θ )|r=0 ∂ri ∂2 = ψ(r, xt+1 |xt ; θ )|r=0 ∂ri ∂rj
λi = λi,j
K i,j ,k = K i,j ,k,l =
∂3 ψ(r, xt+1 |xt ; θ )|r=0 ∂ri ∂rj ∂rk ∂4 ψ(r, xt+1 |xt ; θ )|r=0 ∂ri ∂rj ∂rk ∂rl
···. It is noted that Edgeworth series used for approximations to distributions are most conveniently expressed using cumulants. Moreover, where approximate normality is involved, higher-order cumulants can usually be neglected but not higher-order moments. In this paper, since we often deal with situations where the log CCF has simpler expression, the cumulants can be more conveniently obtained than the moments. Furthermore, the Hermite polynomial tensors in the general Edgeworth/Gram–Charlier expansion of equation (2.3) are given by hi = λi,j (x j − λj ) hij = hi hj − λi,j hij k = hi hj hk − hi λj ,k [3] hij kl = hi hj hk hl − hi hj λk,l [6] + λi,j λk,l [3] hij klm = hi hj hk hl hm − hi hj hk λl,m [10] + hi λj ,k λl,m [15] C The Author(s). Journal compilation C Royal Economic Society 2010.
250
G. J. Jiang and J. L. Knight
hij klmn = hi . . . hn − hi hj hk hl λm,n [15] + hi hj λk,l λm,n [45] − λi,j λk,l λm,n [15] .... hij k,lmn = hij klmn − hij k hlmn . In tensor notation, it is understood that any index repeated once as a subscript and once as is interpreted as sums over these repeated scripts, i.e. hi = λi,j (x j − λj ) = a superscript j j j λi,j (x − λ ), etc. Also, the numbers in square brackets refer to the number of permutations of the various subscripts. In this paper, for simplicity we let the initial approximating function be the multivariate normal density with mean vector λi and covariance matrix λi,j , i.e. j 1 i −N/2 i,j −1/2 i j f0 (xt+1 | xt ; θ ) = (2π ) (2.4) |λ | exp − xt+1 − λ xt+1 − λ λi,j , 2 i with xt+1 being an N × 1 vector whose ith element is xt+1 , mean λi and covariance matrix i,j i,j i,j λ , λi,j is the inverse matrix of λ and |λ | is the determinant of the covariance matrix. For clarity of presentation as well as ease of notation and yet without loss of generality, in the following discussion we focus on the Gram–Charlier series expansion and set the truncation order p = 4; consequently we have,
1 N 1 ln 2π − ln |λi,j | − (x i − λi )(x j − λj )λi,j 2 2 2 1 i,j ,k 1 i,j ,k,l + K hij k + K hij kl 3! 4! = ln f (xt+1 | xt ; θ ) − ln fp (xt+1 | xt ; θ ),
ln fˆp (xt+1 | xt ; θ ) = −
(2.5)
where ln fp (xt+1 | xt ; θ ) is the approximation error. The Edgeworth series and Gram–Charlier series are formally identical when the expansion order is infinite and the main difference is the different criteria used in collecting terms in a truncated series. As a result, with the same order of expansion for the Edgeworth series and Gram–Charlier series, different cumulants or moments may appear in the estimating equations. Let the parameter vector to be estimated be denoted by θ ∈ , then all cumulants and the Hermite tensors are functions of θ . The approximate score function, i.e. the derivative of the ln fˆp (xt+1 | xt ; θ ), is given by ∂ ln fˆp (xt+1 | xt ; θ ) ∂λi,j 1 ∂ i,j ∂λi 1 = − i,j |λ | + hi − (x i − λi )(x j − λj ) ∂θ 2|λ | ∂θ ∂θ 2 ∂θ
∂h 1 ∂K i,j ,k ij k hij k + K i,j ,k + 6 ∂θ ∂θ
∂hij kl 1 ∂K i,j ,k,l hij kl + K i,j ,k,l . + 24 ∂θ ∂θ
(2.6)
Using the approximate score in (2.6), we can define an approximate optimal weight function from (2.2) as ∂ ln fˆp (xt+1 | xt ; θ ) −ir xt+1 1 ωˆ p (r, t | xt ; θ ) = e .. dxt+1 . (2.7) N (2π ) ∂θ
C The Author(s). Journal compilation C Royal Economic Society 2010.
Estimation of Markov models
251
The following theorem states the estimating equation for a consistent estimator of θ . Since the estimation is based on an approximate likelihood function, we refer to the estimator as the AMLE. The asymptotic distribution of the proposed estimator is derived under standard regularity conditions as in Singleton (2001). T HEOREM 2.1. (Consistent Approximate MLE) Given the approximate optimal weight function in (2.7), the ECF procedure in (2.1) can be written as T 1 (2.8) ωˆ p (r, t | xt ; θ )(eir xt+1 − φ(r, xt+1 | xt ; θ ))dr = 0, T t=1 with w(r, t | xt ; θ ) replaced by ωˆ p (r, t | xt ; θ ). From the above equation, we have the following estimating equation for our proposed estimator which we refer to as the approximate MLE. That is
T ∂ ln fˆp (xt+1 | xt ; θ )
1 ∂ ln fˆp (xt+1 | xt ; θ ) −E = 0. (2.9)
xt T t=1 ∂θ ∂θ Under standard regularity conditions, the approximate ML estimator defined in (2.8) or (2.9) is consistent and asymptotically normal, i.e. as T → ∞: √ d T (θˆp − θ ) −→ N (0, p ), (2.10) where the limiting covariance matrix p is given in Appendix A. The estimating equation in (2.9) is based on the approximate likelihood function or the approximate score.1 However, in contrast to the conventional AMLE it also includes the second term, i.e. the expectation of the approximate score. It is clear that when the true transition density of the Markov process f (xt+1 | xt ; θ ) is known and fˆp (xt+1 | xt ; θ ) = f (xt+1 | xt ; θ ), then we ∂ ln fˆ (x
| x ;θ)
p t+1 t have E[ ] = 0. However, this will not necessarily be the case if we approximate ∂θ ˆ f (xt+1 | xt ; θ ), i.e. fp (xt+1 | xt ; θ ) = f (xt+1 | xt ; θ ). The inclusion of the second term ensures the consistency of the proposed estimator and it does not require convergence of the infinite expansion. However, as noted in the proof, if as p goes to infinity we have ln fˆp → ln f , i.e. convergence, then the estimator becomes the ECF estimator proposed in Singleton (2001) with optimal weight function and thus achieves ML efficiency. Conditions needed for the convergence of the Edgeworth/Gram–Charlier series are given in Cram´er (1925) for the univariate case and in Skovgaard (1986) for the multivariate case. As Cram´er (1946, p. 224) notes, ‘in practical applications it is in most cases only of little value to know the convergence properties of our expansions. What we really want to know is whether a small number of terms—usually not more than two or three—suffice to give a good approximation to f (x)’. In our simulations, reported in Section 4, we note that an expansion involving corrections for skewness and kurtosis does indeed work very well.
1 Note that the estimating equation in (2.9) is equivalent to the ECF procedure in (2.8). One can thus resort to (2.8) for model estimation when the conditional expectation in (2.9) is difficult to compute. We note that when analytical expression of the CCF is unavailable, an approximate efficient estimator based on discretizing the integral in (2.8) is proposed in Singleton (2001) and a simulated method of moments is proposed in Carrasco et al. (2007).
C The Author(s). Journal compilation C Royal Economic Society 2010.
252
G. J. Jiang and J. L. Knight
Since the QML estimation is a special case of approximate MLE, it is also a special case of our estimator (with expansion p = 2). In fact, the consistency of the QML estimation can be easily seen from our framework. For this and one technical note, we add the following remarks. R EMARK 2.1. If one was merely to approximate f (xt+1 | xt ; θ ) by a normal density, i.e. only the initial approximating density fˆp (xt+1 | xt ; θ ) = f0 (xt+1 | xt ; θ ) and fˆp (xt+1 | xt ; θ ) = f (xt+1 | xt ; θ ), the approach is essentially the QML estimation. From (2.4), we have
∂ ln fˆp (xt+1 |xt ; θ ) E ∂θ
=0
which shows, via the results in Bollerslev and Wooldridge (1992), the consistency of QML estimation. R EMARK 2.2. The reader may be concerned that in the definitions (2.2) and (2.7) the score and approximate score may not be integrable. However, using the theory of generalized functions these integrals will exist. In the univariate case, the approximate score is a polynomial in xt+1 and consequently the approximate optimal weight will be a function involving the Dirac delta function and its derivatives. The result is similar in the multivariate case with the multivariate Dirac delta function playing the central role.
2.2. The approximate MLE (C-AMLE) versus other methods In the following we further derive the estimating equations based on (2.9) and illustrate the similarity and difference of our estimator with other estimation methods. From the definition of the Hermite polynomials we can readily establish that ∂λi,j j ∂hi ∂λj = (x − λj ) − λi,j ∂θ ∂θ ∂θ = λ¯ i,j (x j − λj ) − λi,j λ¯ j = h¯ i − z¯ i ∂hij k ∂λj ,k ∂hi ∂hi = hj hk [3] − λj ,k [3] − hi [3] ∂θ ∂θ ∂θ ∂θ ∂hij kl ∂λi,j ∂hi ∂hi ∂λk,l = hj hk hl [4] − hi hj [6] − hj λk,l [12] + λk,l [6]. ∂θ ∂θ ∂θ ∂θ ∂θ Substituting these derivatives into the expansion given by (2.6) and taking expectations we can derive the appropriate estimating equations, which are stated in the following lemma. L EMMA 2.2. For an N-dimensional Markov process with known CCF associated with an unknown transition density, following the ECF estimation procedure with approximate optimal weight function the use of an Edgeworth/Gram–Charlier approximation for the unknown C The Author(s). Journal compilation C Royal Economic Society 2010.
Estimation of Markov models
253
transition density as in (2.5) results in the following estimating equations.
T ∂λi,j 1 ∂λi 1 hi − [(x i − λi )(x j − λj ) − λi,j ] T t=1 ∂θ 2 ∂θ i,j ,k 1 ∂K + (hij k − E(hij k | xt )) + 3K i,j ,k (h¯ i hj hk − E[h¯ i hj hk | xt ]) 6 ∂θ
− ∂λj ,k − z¯ i (hj hk − E[hj hk | xt ]) − h i λj ,k − hi ∂θ 1 ∂K i,j ,k,l (hij kl − E(hij kl | xt )) + K i,j ,k,l 4(h¯ i hj hk hl − E[h¯ i hj hk hl | xt ]) + 24 ∂θ − 4¯zi (hj hk hl − E[hj hk hl | xt ]) − 12(h¯ i hj λk,l − E[h¯ i hj λk,l | xt ])
∂λk,l = 0. + 12¯zi hj λk,l − 6(hi hj − E[hi hj | xt ]) ∂θ (2.11)
If θ is of dimension Q then there will be Q such equations, the solution of which will lead to approximate ML estimation. The results in Lemma 2.2 underline the distinguishing feature of our method, i.e. while the estimating equations are derived based on the approximation of scores, the approximate scores are not used directly for estimation as in the common AMLE. Instead, these scores are used to construct the optimal weight function in the ECF estimation procedure and the estimating equations only involve certain conditional cumulants. Clearly, an advantage of the proposed method is that since only the conditional cumulants are required in the estimation, the method applies to general Markov processes where the conditional cumulants to certain orders are available in closed form. The following remark discusses the relationship between our method and alternative estimation methods in the literature. R EMARK 2.3. The estimating equations in (2.11) show clearly that similar to the method of moments (MM) and the GMM, our method is also based on the conditional cumulants or equivalently conditional moments of various orders. While our method is similar to the MM in that the number of moment restrictions is the same as the dimension of the parameter vector, our estimating equations may involve more moments than the dimension of the parameter vector. While our method is similar to the GMM in that the number of moments involved in the estimation may be higher than the dimension of the parameter vector, the number of moment restrictions is the same as the dimension of the parameter vector. In particular, GMM relies on a numerical procedure to find the optimal weights for various moment conditions, the moment restrictions in our method have their own non-linear structure or specific weights of various moment conditions as defined in the estimating equations. While the estimating equations in the general case are cumbersome as in (2.11), in the univariate case they collapse into the well-known method of moments as we will now illustrate, albeit the moment restrictions are a system of conventional moment conditions. In the univariate C The Author(s). Journal compilation C Royal Economic Society 2010.
254
G. J. Jiang and J. L. Knight
case we essentially just let i = j = k = l = 1 and thus drop the superscript index λ11 = 1/K2 h1 = (xt+1 − K1 )/K2 , h111 = h31 − 3h1 /K2 ,
h11 = h2 = h21 − 1/K2 h1111 = h41 − 6h21 /K2 + 3/K22
with ∂h1 ∂K2 /∂θ ∂K1 /∂θ − , = −h1 ∂θ K2 K2 ∂h1 ∂K2 /∂θ ∂K1 /∂θ = −h31 h21 − h21 , ∂θ K2 K2
∂h1 ∂K2 /∂θ ∂K1 /∂θ − h1 = −h21 ∂θ K2 K2 ∂h ∂K /∂θ ∂K 1 2 1 /∂θ = −h41 h31 − h31 ∂θ K2 K2 h1
and since E h21 | xt = 1/K2 E[h11 | xt ] = 0, E h111 | xt = K3 /K23 E[h1111 | xt ] = E h41 | xt − 6E h21 | xt /K2 + 3/K22 = K4 /K24 .
E[h1 | xt ] = 0,
The estimating equations collapse to
T 1 ∂K1 3∂K3 /∂θ 1 2 1 ∂K3 3 K3 1 ∂K2 + h1 − + h1 − 3 − h1 h1 T t=1 ∂θ 2 K2 ∂θ 6 ∂θ K2 K2
K3 ∂K2 /∂θ K3 ∂K2 /∂θ 1 ∂K1 /∂θ h31 − 3 − + h21 − − 2h1 2 K2 K2 K2 K22 K2
(K4 + 3K22 ) ∂K4 /∂θ 2 1 ∂K4 1 4 −6 h1 − h1 − + 24 ∂θ K2 K2 K24 2 K4 + 3K2 ∂K2 /∂θ K3 ∂K1 /∂θ K4 4 h41 − + 4 h31 − 3 − 24 K2 K2 K24 K2
∂K1 /∂θ 1 ∂K2 /∂θ 2 = 0. − 18 h1 − − 12h1 K2 K2 K22
(2.12)
Again the derivatives are taken with respect to all elements in the Q-dimensional parameter vector θ . The above estimating equations can be readily put into a more recognizable j j form by combining coefficients on (h1 − E(h1 | xt )), j = 1, 2, 3, 4 (p = 4). Letting Aij t be j
j
the appropriate coefficient on (h1 − E(h1 | xt )), j = 1, 2, 3, 4 (p = 4), associated with the derivative with respect to the ith element of θ , we have the estimating equations given by T 1 At gt = 0, T t=1 C The Author(s). Journal compilation C Royal Economic Society 2010.
255
Estimation of Markov models
where At is a Q × 4 (p = 4) matrix with the ith row being associated with 4 × 1 (p = 4) vector given by ⎡ ⎤ h1 ⎢ ⎥ ⎢ ⎥ h21 − 1/K2 ⎢ ⎥. gt = ⎢ ⎥ 3 3 h − K /K 3 ⎣ ⎦ 1 2 h41 − K4 + 3K22 /K24
∂ ∂θ i
and gt is a
More specifically, we have ∂K3 /∂θ ∂K2 /∂θ ∂K1 /∂θ ∂K1 − + K3 + K4 2 ∂θ 2K2 K2 2K22 ∂K1 /∂θ ∂K2 /∂θ ∂K2 /∂θ ∂K4 /∂θ − K3 A2t = − + 3K4 2 2K2 4K2 4K22 ∂K2 /∂θ ∂K1 /∂θ ∂K3 /∂θ − K3 A3t = − K4 6 2K2 6K2 ∂K2 /∂θ ∂K4 /∂θ − K4 A4t = . 24 6K2
A1t =
3. ILLUSTRATIVE EXAMPLES OF MARKOV PROCESSES E XAMPLE 3.1. The Ornstein–Uhlenbeck Process (Equivalence to MLE) (Univariate Continuous-Time Gaussian Process). The OU process is a univariate diffusion process specified by the following stochastic differential equation: dxt = β(α − xt )dt + σ dwt ,
(3.1)
where wt is a standard Brownian motion. The OU process has a normal transition density −βτ 2 ) σ2 t −α)e function given by f (xt+τ | xt ; θ ) = √ 1 2 exp{− (xt+τ −α−(x }, where s 2 = 2β (1 − e−2βτ ). 2s 2 2πs T The conditional log-likelihood is given by ln L = t=1 ln f (xt+1 | xt ; θ ) and maximization of the likelihood function leads to the ML estimator. As a member of the affine class of diffusions, the CCF of the OU process is given by 2 2 φ(r, xt+1 | xt ; θ ) = exp{ir(α + (xt − α)e−β ) − r4βσ (1 − e−2β )}. The conditional cumulants can be easily derived as K1 = α + (xt − α)e−β K2 =
σ2 (1 − e−2β ), 2β
Ki = 0, ∀ i ≥ 3.
Substituting the cumulants into the estimating equation in (2.12) for the univariate case, we have T 1 1 ∂K1 1 ∂K2 h1 + h21 − = 0, T t=1 ∂θ 2 K2 ∂θ where h1 = (xt+1 − K1 )/K2 , θ = (α, β, σ ). It is straightforward to verify that this is equivalent to MLE. C The Author(s). Journal compilation C Royal Economic Society 2010.
256
G. J. Jiang and J. L. Knight
E XAMPLE 3.2. The Square-Root Diffusion Process (Univariate Continuous-Time NonGaussian Process). The square-root process is a univariate diffusion specified by the following stochastic differential equation: √ (3.2) dxt = β(α − xt )dt + σ xt dwt . The transition density of the square-root process is non-central chi-square and the marginal density is a gamma function. The square-root process is also a member of the affine class of ire−βτ xt }, where diffusions and has the following CCF: φ(r, xt+τ | xt ; θ ) = (1 − irc )−(q+1) exp{ (1−ir/c) c = 2β/(σ 2 (1 − e−βτ )), q =
2αβ σ2
− 1. The following four cumulants can be easily derived:
K1 = α(1 − e−βτ ) + xt e−βτ ασ 2 xt σ 2 −βτ (1 − e−βτ )2 + e (1 − e−βτ ) 2β β ασ 4 3xt σ 4 −βτ K3 = (1 − e−βτ )3 + e (1 − e−βτ )2 2 2β 2β 2 3ασ 6 3xt σ 6 −βτ K4 = (1 − e−βτ )4 + e (1 − e−βτ )3 . 3 4β β3 K2 =
Again, it is straightforward to apply the estimating equation in (2.12) for the univariate case using the above results. The derivatives of these cumulants with respect to the parameter vector θ = (α, β, σ 2 ) can be easily derived. E XAMPLE 3.3. The Autoregressive Gamma Process (Univariate Discrete-Time Non-Linear Non-Gaussian Process). Darolles et al. (2006) introduced a class of non-linear Markov processes, termed CAR processes, in the following general non-linear Markov framework: yt = h(yt−1 , t , θ ), where t is an i.i.d. stochastic process and θ is the parameter vector. The CAR processes are characterized by the conditional log-Laplace transforms which are affine functions of the lagged variables of the process. With the closed-form conditional log-Laplace transforms or equivalently the CCF of the state variables, the method proposed in this paper can be readily used for the statistical inference of this class of processes. Among the CAR processes, a very interesting model is the discrete time counterpart of the square-root diffusion process (as presented in Example 3.2 above) in continuous time, namely the autoregressive gamma process. The process is specified through the following compounding distribution: yt | xt ∼ γ (δ + xt ) xt | yt−1 ∼ P(βyt−1 ),
(3.3)
where γ (·) and P(·) are the gamma and Poisson distributions, respectively, and θ = (β, δ) is the parameter vector. Similar to the square-root diffusion process in continuous-time as in Example 3.2, the transition density of the process has a non-central gamma distribution and the marginal density of the process has a gamma distribution, namely, yt | yt−1 ∼ γ (δ, βyt−1 ) and (1 − β)yt ∼ γ (δ). The process first introduced in Gouri´eroux and Jasiak (2000) has the following CCF: φ(r, yt | yt−1 ) = E[exp{iryt | yt−1 }] = exp{−a(r)yt−1 + b(r)}, where a(r) = −iβr/(1 − ir), b(r) = −δ ln(1 − ir) and the marginal characteristic function ψ(r, yt ) = E[exp{iryt }] = C The Author(s). Journal compilation C Royal Economic Society 2010.
257
Estimation of Markov models
exp{c(r)}, where c(r) = −δ ln(1 − ir/(1 − β)). The process is stationary when |β| < 1. The conditional cumulants can be derived as K1 = βyt + δ,
K2 = 2βyt + δ
K3 = 6βyt + 2δ,
K4 = 24βyt + 6δ.
It is straightforward to apply the estimating equation in (2.12) for the univariate case using the above results. E XAMPLE 3.4. The Double OU Process (Equivalence to MLE) (The Bivariate ContinuousTime Gaussian Process) y
dyt = κ(xt − yt )dt + σ dwt dxt = −βxt dt + σx dwtx ,
(3.4)
t ∈ [0, T ],
where κ, β > 0, κ = β. The process yt exhibits linear mean reversion to conditional mean xt which itself follows a mean-reverting process. The transition density of the process is given by
−1
f (yt+τ , xt+τ | Ft ) = (2π ) |V |
−1/2
1 · exp − 2
yt+τ − my
xt+τ − mx
V
−1
yt+τ − my xt+τ − mx
,
where mx = E[xt+τ | Ft ], my = E[yt+τ | Ft ] and V =
Var[yt+τ | Ft ]
Cov(yt+τ , xt+τ | Ft )
Cov(xt+τ , yt+τ | Ft )
Var[xt+τ | Ft ]
.
The conditional likelihood is given by ln L = Tt=1 ln f (yt+1 , xt+1 | yt , xt ; θ ), where θ = (κ, σ, β, σx ) and maximizing the likelihood function leads to the ML estimator. The joint CCF of (yt+τ , xt+τ ) can be derived in closed form and is given in Appendix B. The conditional cumulants of various orders can be derived either from the closed-form expression of CCF or through the moment–cumulant relationship: μ1 = K1 μ2 = K2 + K12 μ3 = K3 + 3K2 K1 + K13 μ4 = K4 + 4K3 K1 + 3K22 + 6K2 K12 + K14 . For relations between higher-order moments and cumulants, see Kendall and Stuart (1977). Note that the superscript notation is used for cumulants in the multivariate case and 1 refers to the cumulants associated with yt and 2 refers to the cumulants associated with xt . We have K 1 = E[yt+τ | Ft ],
K 1,1 = Var[yt+τ | Ft ],
K 1,1,1 = K 1,1,1,1 = 0
K 2 = E[xt+τ | Ft ],
K 2,2 = Var[xt+τ | Ft ],
K 2,2,2 = K 2,2,2,2 = 0
K 1,2 = Cov(yt+τ , xt+τ | Ft ),
K 1,1,2 = K 1,2,2 = K 1,1,2,2 = 0.
C The Author(s). Journal compilation C Royal Economic Society 2010.
258
G. J. Jiang and J. L. Knight
Substituting the cumulants into the estimating equation in (2.11) for the multivariate case, we have the jth estimating equation given by T ∂V 1 ∂V −1 1 ∂K h = 0, h+ h − tr V T t=1 ∂θj 2 ∂θj ∂θj
where h = (h1 , h2 ), ∂K = ( ∂K , ∂K ), and here we have used the fact that ∂θj ∂θj ∂θj 1
2
−1
∂V −1 ∂θj
= −V −1
V , with j = 1, 2, 3, 4. It is straightforward to verify that this is equivalent to ML estimation. ∂V ∂θj
E XAMPLE 3.5. The Discrete Time Stochastic Volatility Process (Bivariate Discrete Time NonLinear Process). The following process is often used in the finance literature to model the asset return process with stochastic conditional volatility. xt = eht /2 t
ht = α + βht−1 + σ νt t 0 1 ρ ∼N , . 0 ρ 1 νt
(3.5)
It is noted that the return process and stochastic volatility are correlated, reflecting the leverage effect when ρ < 0. While the stochastic volatility is in general unobserved, a set of recent literature has shown that it can be either estimated from high frequency return observations or implied from the derivatives market. As demonstrated in recent studies such as Andersen et al. (2001) and Barndorff-Nielsen and Shephard (2002a,b), daily volatility can be very well approximated from high frequency intradaily returns. Such an estimator or proxy of stochastic volatility can be readily used for the estimation of the process, and in the subsequent analysis we apply the proposed method to the SV process in the context of bivariate Markov processes. For the model specified in (3.5) there is no closed-form expression for the CCF; however, we can still derive the necessary conditional moments of xt , ht and then use them in our estimation. Substituting ht into the process of xt , we have xt = e(α+βht−1 )/2 · eσ νt /2 · t . Let gt = ht − α − βht−1 , we aim to derive the conditional moments E[xtn gtm | ht−1 ], n, m = 0, 1, . . . . Decomposing the correlated normal random variables, we have √ 2 E xtn gtm | ht−1 = en(α+βht−1 )/2 · E enσ 1−ρ ut /2 · enσρt /2 · tn · (σ νt )m | ht m m m−q nσ √1−ρ 2 ut /2 n(α+βht−1 )/2 · e =e E σ 1 − ρ 2 ut q q=0 · E[(σρt )q tn enσρt /2 ]. Since E[zn erz ] = ϕ (n) (r) =
2
∂ n er /2 , ∂r n
thus, √ 2
m−q nσ 1−ρ u /2 m−q (m−q) t = σ 1 − ρ2 nσ 1 − ρ 2 /2 e ϕ E σ 1 − ρ 2 ut E (σρt )q tn enσρt /2 = (σρ)q ϕ (q+n) (nσρ/2). C The Author(s). Journal compilation C Royal Economic Society 2010.
Estimation of Markov models
259
The conditional moments of various orders can be readily derived. For instance, when n = m = 2 2, we have E[xt2 gt2 | ht−1 ] = σ 2 (1 + σ 2 + ρ 2 (2 + 5σ 2 + σ 4 )) · eα+βht−1 +σ /2 . The conditional cumulants of various orders can be easily derived via the cumulant–moment relationship and substituting the cumulants into (2.11) for the multivariate case, we have the necessary estimating equations. E XAMPLE 3.6. The Continuous-Time Square-Root Volatility Process (Bivariate ContinuousTime Process). The following process with stochastic volatility is often used in the literature to model asset return in the continuous time framework: √ dxt = μdt + vt dwts √ dvt = β (α − vt ) dt + σ vt dwtv dwts dwtv
= ρdt,
(3.6)
t ∈ [0, T ],
where β > 0. The process is known to be a positive process with a reflecting barrier at 0, which is attainable when 2αβ < σ 2 . This model has been widely used in the finance literature for asset return dynamics as it has an associated closed-form expression for European option prices; see Heston (1993). Similar to the discrete-time SV model, we apply the proposed method in this paper to the continuous-time SV process in the context of bivariate Markov processes. The joint CCF of (xt+τ , vt+τ ) can be derived in closed from and is given in Appendix B. Based on the closed-form expression of the characteristic function, analytical expressions for the conditional cumulants of any order can be derived as follows with 1 for xt and 2 for vt , respectively. K 1 = μτ K 1,1 = ατ + α K 1,1,1 = K 1,1,1,1 =
K 1,2 = K 1,1,2 =
K 1,2,2 = K 1,1,2,2 =
1 − eβτ eβτ − 1 + vt βekβτ βeβkτ
3e−βτ (α(2 + βτ − eβτ (2 − βτ )) − (1 − eβτ + βτ )vt )ρσ β2 (1 − eβτ )2 τ 2 ρ 2 β 2 − (2ρ 2 + 1)(eβτ − βτ − 1) 3σ 2 (α − 2vt ) + 6σ 2 (α − vt ) 3 2j βτ 2β e β 3 ej βτ 1 + eβτ (βτ − 1) − 8ρ 2 (eβτ − 1) + 4ρ 2 τβ(1 + eβτ ) + 3ασ 2 β 3 eβτ −βτ e (α(eβτ − 1 − βτ ) + βτ vt )ρσ β e−2βτ (2(1 − eβτ (1 − βτ − β 2 τ 2 ρ 2 ))vt − α(1 − e2βτ (1 + 4ρ 2 ) 2β 2 + 2eβτ (βτ + 2ρ 2 + 2βτρ 2 + β 2 τ 2 ρ 2 )))σ 2 e−2βτ (α(eβτ − 1)(eβτ − 1 − βτ ) − (1 + 2βτ − eβτ (1 + βτ ))vt )ρσ 3 β2 e−3βτ ((−3 + 4eβτ (1 − βτ − ρ 2 − 2βτρ 2 − 2β 2 τ 2 ρ 2 ) − e2βτ (1 − 2βτ − 4ρ 2 2β 3 − 4βτρ 2 − 2β 2 τ 2 ρ 2 ))vt + α(1 + e3βτ (1 + 6ρ 2 ) − e2βτ (1 + 12ρ 2 + 2β 2 τ 2 ρ 2 + 2β(τ + 4τρ 2 )) + eβτ (−1 + 6ρ 2 + 4β 2 τ 2 ρ 2 + 2β(τ + 4τρ 2 ))))σ 4 .
C The Author(s). Journal compilation C Royal Economic Society 2010.
260
G. J. Jiang and J. L. Knight
The cumulants with respect to vt with various orders are given in Example 3.2 for the squareroot diffusion process. Substituting the cumulants into (2.11) for the multivariate case, it is straightforward to derive the necessary estimating equations.
4. MONTE CARLO SIMULATIONS In this section, we investigate via simulation the performance of the estimation method proposed in the paper for the Markov processes in comparison with alternative estimation methods. In order to focus on the relative performance of various estimators, we restrict our simulations to the models with exact sampling paths. As a result we focus on the univariate continuous-time squareroot diffusion process and the bivariate discrete-time stochastic volatility process. Through these two models, we demonstrate the performance of our estimation method for both univariate and bivariate Markov processes and in both discrete time and continuous time frameworks. 4.1. The continuous-time square-root diffusion process The square-root diffusion process is specified in (3.2). The parameter values are set as α = 0.075, β = 0.80, σ = 0.100, which are close to the estimates of an interest rate process using historical U.S. three-month Treasury bill yields. The choice of parameter values gives an integer value of the degree of freedom for the non-central chi-square transition density function and makes it feasible to generate exact sampling paths.2 Thus, there is no discretization error and differences between different estimates are entirely due to the different estimation methods. We set two sampling intervals, i.e. = 1/4, 1 with sample size T = 250, 500. In each sampling path simulation, the first 200 observations are discarded to mitigate the start-up effect. The number of replications in the Monte Carlo simulation is 1000. The estimation methods we consider include the C-AML developed in this paper, the MLE, QMLE and GMM. MLE Solving from the Kolmogorov backward (or Fokker–Planck) equation or from the CCF via Fourier inversion, the transition density function of the square-root process can be obtained as ν q/2 Iq (2(uν)1/2 ), (4.1) f (xt | xt−τ ; θ ) = ce−u−ν u with xt taking non-negative values, where c = 2β/(σ 2 (1 − e−βτ )), u = cxt−τ e−βt , ν = − 1, and Iq (·) is the modified Bessel function of the first kind of order q. The cxt , q = 2βα σ2 transition density function is non-central chi-square, χ 2 [2cxt ; 2q + 2, 2u], with 2q + 2 degrees of freedom and parameter of non-centrality 2u proportional to the current level of the stochastic process. If the process displays the property of mean reversion (β > 0), the process is stationary and its marginal distribution can be derived from the transition density, which is a gamma ωs s−1 −ωxt xt e , where ω = 2β/σ 2 and s = 2αβ/σ 2 , probability density function, i.e. g(xt ; θ ) = (s) with mean α and variance ασ 2 /2β. The MLE is based on ∂ ln g(x1 ; θ ) ∂ ln f (xt+1 | xt ; θ ) + = 0. ∂θ ∂θ t=1 T
2 The square-root diffusion process specified in (3.2) has a non-central chi-square transition density function with a − 1. Given the parameter values used in our simulation, q equals 11. degree of freedom 2q + 2, where q = 2βα σ2 C The Author(s). Journal compilation C Royal Economic Society 2010.
Estimation of Markov models
261
QMLE The QMLE is based on the conditional mean and variance which are given in Example 3.2 of Section 3 as well as the above unconditional mean and variance of the squareroot process. As mentioned earlier, QMLE is equivalent to the C-AMLE proposed in this paper when the Edgeworth/Gram–Charlier expansion has an order of 2, namely p = 2. GMM The same moment conditions as in Chan et al. (1992) are used for GMM, except that the moment conditions are derived from the continuous-time model and are not subject to discretization bias, i.e. ft (θ ) =
t , t2 − E t2 xt−
(4.2)
where t = 1, 2, . . . , T , θ = (α, β, σ ) and the lagged variable is used as instrumental variable in the estimation, where t = xt − E[xt | xt− ] with E[xt | xt− ] = (1 − e−β )(α − 2 2 (1 − e−β )2 . xt− ), E[t2 | xt− ] = σβ (e−β − e−2β )xt− + ασ 2β The simulation results of alternative estimators are reported in Table 1 for different sampling intervals and sample sizes. Certain interesting observations for the estimation of the square-root process are worth noting. The unconditional mean parameter (α) tends to be straightforward to estimate. Measured by both bias and mean squared error, all four methods have similar performance. The estimation of the conditional variance parameter (σ ) improves as both sample size and sampling frequency increases. Among alternative estimators, QMLE has the worst performance. The mean-reverting parameter (β) turns out to be most difficult to estimate. Overall, there tends to be an upward bias for all estimators. As the sampling period increases, the upward bias reduces. In other words, the estimation of β requires a large sample to be accurate. Overall, the C-AML estimator performs as well as the ML estimator and consistently outperforms other estimators. The difference with the QML and GMM suggests that the moments with order higher than 3 are informative and help to improve the parameter estimation in our estimation procedure. On the other hand, we also perform GMM estimation using the same moment conditions as those used in the C-AML procedure. The performance of the GMM estimation generally deteriorates. For example, with sampling interval = 1/4 and sample size N = 250, the root mean square error of β estimates increases to 0.2302 (from 0.2211 as reported in Table 1). The deterioration is mainly caused by the increase in the standard error of the parameter estimates. Finally, it is noted that since MLE involves numerical evaluation of the Bessel function of the first kind, computationally the alternative estimators, including C-AML estimator, all have a certain advantage. 4.2. The discrete-time stochastic volatility process The discrete-time stochastic volatility process is specified in (3.5). We consider two sets of parameter values, namely {α, β, σ, ρ} = {−0.35, 0.95, 0.25, −0.50; −0.70, 0.90, 0.35, −0.25} which are similar to the values used in Andersen and Sørensen (1996) except that we set nonzero values for ρ in our study. Again, there is no approximation error involved in the path simulation and differences between different estimates are entirely due to the different estimation methods. We set two sample sizes T = 250, 500. In each sampling path simulation, the first 200 observations are discarded to mitigate the start-up effect. The number of replications in the Monte Carlo simulation is 1000. Since neither the transition density function nor the CCF is available in closed form for the SV process, we only include the QMLE and GMM in our comparison. C The Author(s). Journal compilation C Royal Economic Society 2010.
262
G. J. Jiang and J. L. Knight
Table 1. Monte Carlo simulation results of the square-root diffusion model. √ Parameter Estimation method Mean Median SD m.s.e. (95 percentiles) Panel A: Sampling Interval = 1/4, Sample Size = 250 α C-AML 0.0748 0.0746
0.0043
0.0043
(0.0671 0.0829)
(0.075)
ML QML GMM
0.0748 0.0748 0.0747
0.0746 0.0746 0.0746
0.0043 0.0043 0.0044
0.0043 0.0043 0.0044
(0.0672 0.0840) (0.0672 0.0838) (0.0667 0.0839)
β (0.800)
C-AML ML
0.8925 0.8958
0.8797 0.8803
0.2002 0.1991
0.2205 0.2208
(0.5535 1.3618) (0.5486 1.3545)
QML GMM
0.8929 0.8840
0.8790 0.8692
0.2089 0.2046
0.2285 0.2211
(0.5221 1.3691) (0.5465 1.3476)
C-AML ML QML
0.0995 0.0994 0.0712
0.0994 0.0994 0.0712
0.0050 0.0051 0.0038
0.0051 0.0051 0.0291
(0.0894 0.1084) (0.0896 0.1100) (0.0642 0.0792)
GMM
0.0997
0.0996
0.0055
0.0055
(0.0894 0.1103)
σ (0.100)
Panel B: Sampling Interval = 1/4, Sample Size = 500 α (0.075)
C-AML ML QML
0.0748 0.0748 0.0748
0.0747 0.0746 0.0746
0.0031 0.0031 0.0031
0.0031 0.0031 0.0031
(0.0686 0.0802) (0.0687 0.0811) (0.0688 0.0811)
GMM
0.0747
0.0746
0.0031
0.0031
(0.0687 0.0810)
β
C-AML
0.8508
0.8420
0.1374
0.1465
(0.6133 1.1521)
(0.800)
ML QML GMM
0.8517 0.8453 0.8428
0.8427 0.8337 0.8301
0.1370 0.1450 0.1403
0.1464 0.1519 0.1466
(0.6094 1.1480) (0.5921 1.1606) (0.6016 1.1400)
σ (0.100)
C-AML ML
0.0995 0.0990
0.0991 0.0990
0.0034 0.0035
0.0035 0.0036
(0.0922 0.1050) (0.0924 0.1056)
QML GMM
0.0709 0.0998
0.0709 0.0997
0.0026 0.0037
0.0292 0.0037
(0.0659 0.0759) (0.0927 0.1074)
Panel C: Sampling Interval = 1, Sample Size = 250 α (0.075)
β (0.800)
C-AML ML
0.0749 0.0749
0.0749 0.0749
0.0023 0.0022
0.0023 0.0022
(0.0705 0.0793) (0.0709 0.0796)
QML GMM
0.0749 0.0750
0.0749 0.0749
0.0022 0.0023
0.0022 0.0023
(0.0709 0.0796) (0.0708 0.0797)
C-AML ML QML
0.8333 0.8385 0.8376
0.8236 0.8246 0.8203
0.1424 0.1364 0.1473
0.1462 0.1417 0.1520
(0.5896 1.1490) (0.5955 1.1497) (0.5883 1.1710)
GMM
0.8327
0.8200
0.1438
0.1474
(0.5861 1.1631)
C The Author(s). Journal compilation C Royal Economic Society 2010.
263
Estimation of Markov models
Parameter Estimation method
Table 1. Continued. Mean Median
σ (0.100)
0.1003 0.1002
C-AML ML
SD
√
m.s.e.
(95 percentiles)
0.0997 0.1000
0.0067 0.0066
0.0067 0.0066
(0.0881 0.1137) (0.0883 0.1138)
QML 0.0714 0.0711 GMM 0.1001 0.0997 Panel D: Sampling Interval = 1, Sample Size = 500
0.0050 0.0070
0.0291 0.0070
(0.0625 0.0818) (0.0874 0.1143)
α (0.075)
C-AML ML
0.0750 0.0749
0.0749 0.0748
0.0015 0.0016
0.0015 0.0016
(0.0718 0.0779) (0.0718 0.0781)
QML GMM
0.0749 0.0749
0.0748 0.0749
0.0016 0.0016
0.0016 0.0016
(0.0718 0.0781) (0.0718 0.0781)
β
C-AML
0.8145
0.8131
0.0991
0.1002
(0.6452 1.0244)
(0.800)
ML QML
0.8212 0.8173
0.8146 0.8113
0.0977 0.1051
0.0999 0.1065
(0.6531 1.0294) (0.6329 1.0490)
GMM
0.8151
0.8108
0.0998
0.1009
(0.6451 1.0286)
σ
C-AML
0.0998
0.0996
0.0047
0.0047
(0.0907 0.1090)
(0.100)
ML QML GMM
0.0997 0.0710 0.0999
0.0996 0.0708 0.0997
0.0047 0.0036 0.0049
0.0047 0.0293 0.0049
(0.0911 0.1093) (0.0644 0.0784) (0.0906 0.1098)
QMLE The QMLE is based on the conditional mean and variance of xt and ht as well as their cross moment which are given in Example 3.5 of Section 3 as well as the unconditional mean and variance of xt and ht . Again, QMLE is equivalent to the C-AMLE proposed in this paper when the Edgeworth/Gram–Charlier expansion has a order of 2, namely p = 2. GMM The following moment conditions are used in the GMM estimation: ⎤ ⎡ νt ⎥ ⎢ (4.3) ft (θ ) = ⎣ νt2 − E νt2 ht−1 ⎦ ,
νt xt − E νt xt ht−1 where νt = (ht − α − βht−1 )/σ, t = 1, 2, . . . , T , θ = (α, β, σ ) and the lagged variable ht−1 is used as instrumental variable in the estimation. The simulation results for alternative estimators are reported in Table 2 for different parameter sets and sample sizes. The most noticeable difference between C-AML and the QMLE and GMM estimators is the performance of the ρ parameter estimation. As measured by both bias and root mean squared error, the C-AML estimator outperforms both QMLE and GMM, suggesting that higher moment conditions help to identify the correlation parameter and improve its estimation. It is not surprising that the estimation of α, β and σ has similar performance for all three estimators as ht follows a simple AR(1) Gaussian process. We note that in our simulations, the simulated volatility process is also used in estimation. This is different from Andersen and Sørensen (1996) where volatility is a latent variable. For this reason, the nonconvergence issue encountered in our simulations is much less severe than in Andersen and Sørensen (1996) and the relative performance of alternative estimators is not affected even when we include the non-convergence cases. Our further analysis of the GMM estimation also shows C The Author(s). Journal compilation C Royal Economic Society 2010.
264
G. J. Jiang and J. L. Knight
Table 2. Monte Carlo simulation results of the discrete-time SV model. √ Parameter Estimation method Mean Median SD m.s.e. (95 percentiles) Panel A: Sample Size = 250, Parameter Set I α C-AML −0.4321
−0.4207
0.1110
0.1380
(−0.6701 −0.2543)
−0.4626 −0.4573
−0.4501 −0.4435
0.1322 0.1318
0.1736 0.1699
(−0.7529 −0.2533) (−0.7427 −0.2497)
(−0.35)
QML GMM
β (0.95)
C-AML QML GMM
0.9383 0.9339 0.9347
0.9398 0.9362 0.9367
0.0158 0.0189 0.0188
0.0197 0.0248 0.0243
(0.9045 0.9637) (0.8926 0.9638) (0.8936 0.9642)
σ (0.25)
C-AML QML
0.2481 0.2476
0.2484 0.2481
0.0039 0.0043
0.0044 0.0050
(0.2399 0.2550) (0.2384 0.2548)
GMM
0.2473
0.2478
0.0044
0.0051
(0.2379 0.2543)
C-AML QML
−0.4779 −0.4459
−0.4804 −0.4397
0.1297 0.1385
0.1316 0.1486
(−0.7136 −0.2192) (−0.7267 −0.2044)
GMM
−0.4460
−0.4384
0.1398
0.1498
(−0.7332 −0.2095)
ρ (−0.50)
Panel B: Sample Size = 500, Parameter Set I α (−0.35)
C-AML QML GMM
−0.3978 −0.4119 −0.4086
−0.3864 −0.3982 −0.3953
0.0810 0.0914 0.0913
0.0940 0.1104 0.1084
(−0.5665 −0.2692) (−0.6171 −0.2665) (−0.6123 −0.2636)
β (0.95)
C-AML QML
0.9432 0.9412
0.9447 0.9432
0.0116 0.0131
0.0134 0.0158
(0.9185 0.9617) (0.9118 0.9621)
GMM
0.9417
0.9437
0.0131
0.0155
(0.9126 0.9623)
σ
C-AML
0.2491
0.2493
0.0021
0.0022
(0.2449 0.2526)
(0.25)
QML GMM
0.2489 0.2487
0.2490 0.2488
0.0023 0.0023
0.0026 0.0026
(0.2442 0.2528) (0.2440 0.2524)
ρ (−0.50)
C-AML QML GMM
−0.5023 −0.4727 −0.4726
−0.5031 −0.4697 −0.4700
0.1014 0.1098 0.1096
0.1014 0.1131 0.1129
(−0.6512 −0.2918) (−0.6965 −0.2762) (−0.6934 −0.2797)
Panel C: Sample Size = 250, Parameter Set II α C-AML −0.7970
−0.7781
0.1720
0.1975
(−1.1476 −0.5183)
−0.8093 −0.7998
−0.7955 −0.7863
0.1705 0.1706
0.2024 0.1975
(−1.1717 −0.5315) (−1.1569 −0.5161)
(−0.70)
QML GMM
β
C-AML
0.8861
0.8888
0.0245
0.0282
(0.8364 0.9258)
(0.90)
QML GMM
0.8845 0.8857
0.8867 0.8877
0.0243 0.0243
0.0288 0.0282
(0.8338 0.9242) (0.8348 0.9260)
σ (0.35)
C-AML QML
0.3476 0.3465
0.3480 0.3471
0.0055 0.0058
0.0060 0.0068
(0.3370 0.3572) (0.3352 0.3565)
GMM
0.3467
0.3471
0.0057
0.0066
(0.3355 0.3566)
C The Author(s). Journal compilation C Royal Economic Society 2010.
265
Estimation of Markov models
Parameter Estimation method
Table 2. Continued. Mean Median SD
√
m.s.e.
(95 percentiles)
−0.2473 −0.2276
−0.2354 −0.2308
0.1082 0.1120
0.1082 0.1142
(−0.4202 −0.0229) (−0.4371 −0.0143)
GMM −0.2289 Panel D: Sample Size = 500, Parameter Set II α C-AML −0.7430
−0.2304
0.1137
0.1156
(−0.4415 −0.0213)
−0.7401
0.1148
0.1225
(−0.9844 −0.5512)
−0.7491 −0.7447
−0.7415 −0.7344
0.1189 0.1186
0.1286 0.1267
(−0.9889 −0.5493) (−0.9877 −0.5456)
ρ (−0.25)
C-AML QML
(−0.70)
QML GMM
β (0.90)
C-AML QML GMM
0.8939 0.8930 0.8936
0.8944 0.8943 0.8950
0.0164 0.0170 0.0169
0.0175 0.0183 0.0181
(0.8595 0.9218) (0.8590 0.9212) (0.8593 0.9219)
σ (0.35)
C-AML QML
0.3485 0.3479
0.3487 0.3481
0.0030 0.0033
0.0034 0.0039
(0.3424 0.3537) (0.3413 0.3535)
GMM
0.3479
0.3482
0.0033
0.0039
(0.3414 0.3534)
ρ
C-AML
−0.2463
−0.2437
0.0779
0.0780
(−0.3710 −0.0925)
(−0.25)
QML GMM
−0.2365 −0.2368
−0.2331 −0.2337
0.0845 0.0843
0.0855 0.0853
(−0.3990 −0.0872) (−0.4002 −0.0869)
that adding higher-order moments such as E[xt3 | ht−1 ] or E[xt νt2 | ht−1 ] to those in (4.3) helps to identify the parameter ρ. However, by adding these extra high-order moment conditions in the GMM estimation, there is in general an increase in the standard errors of the parameter estimates. As noted in Andersen and Sørensen (1996), this is likely due to the deterioration in the estimation of the weighting matrix as the number of moments increases. We also note that as shown theoretically by Newey and Smith (2004), increasing the number of moment conditions will increase the bias but decrease the standard error of the GMM estimator. To reconcile these two observations, we follow Andersen and Sørensen (1996) and perform the following simulations. First, we simulate 10,000 sample observations and estimate the so-called ‘true’ optimal weighting matrix in the GMM procedure. This ‘true’ optimal weighting matrix is then used in our simulation with different sampling frequencies and sample sizes. In all four cases of Table 2, we indeed observe a decrease in the standard errors of the GMM estimates. The results confirm the following fundamental trade-off documented in Andersen and Sørensen (1996): inclusion of additional moments improves estimation performance for a given degree of precision in the estimation of the weighting matrix, but in finite samples this must be balanced against the deterioration in the estimate of the weighting matrix as the number of moments increases. Since the C-AMLE procedure has a specific analytical structure for moment restrictions as given in the estimating equations, the weighting matrix is analytically available and there is no need to estimate the weighting matrix numerically. This further underlines the advantage of our estimator over the GMM and explains why our estimator achieves better performance with finite samples.
5. CONCLUSION In this paper, we develop a new estimator for Markov models by combining an approximation to the logarithmic transition density along with the first-order conditions associated with C The Author(s). Journal compilation C Royal Economic Society 2010.
266
G. J. Jiang and J. L. Knight
the ECF estimation approach. The estimator is consistent. As an alternative to the MLE due to the unavailability of the exact likelihood function, the new approach is exact and parsimonious. The method applies to processes with closed-form CCF, e.g. the common continuous-time affine diffusion and jump-diffusion processes, the discrete time CAR processes introduced by Darolles et al. (2006). Furthermore, since only the conditional cumulants are required in the estimation the method also applies to processes which have analytical expressions for certain orders of conditional cumulants. Using various examples, we illustrate the application of our approach to Markov processes in both discrete time and continuous time and for both univariate and multivariate cases. The Monte Carlo simulations for selected models confirm that the estimator has desirable finite sample performance in comparison with other methods.
ACKNOWLEDGMENTS We wish to thank Yacine A¨ıt-Sahalia, Alan Rogers, Peter Thompson, St´ephane Gregoir (the Editor), and two anonymous referees for helpful comments and suggestions, along with participants at the 2003 North American Econometric Society Meetings in Atlanta, University of Toronto Economics Workshop, and the 2004 Financial Econometrics Conference at the University of Waterloo. Both authors acknowledge financial support from NSERC, Canada. The usual disclaimer applies.
REFERENCES A¨ıt-Sahalia, Y. (2002). Maximum likelihood estimation of discretely sampled diffusions: a closed-form approach. Econometrica 70, 223–62. A¨ıt-Sahalia, Y. (2003). Closed-form likelihood expansions for multivariate diffusions. Working paper, Princeton University and NBER. Andersen, T. G., L. Benzoni and J. Lund (2002). An empirical investigation of continuous-time equity return models. Journal of Finance 57, 1239–84. Andersen, T. G., T. Bollerslev, F. X. Diebold and P. Labys (2001). The distribution of realized exchange rate volatility. Journal of the American Statistical Association 96, 42–55. Andersen, T. G. and B. E. Sørensen (1996). GMM estimation of a stochastic volatility model: a Monte Carlo study. Journal of Business and Economic Statistics 14, 328–52. Barndorff-Nielsen, O. E. and N. Shephard (2002a). Econometric analysis of realised volatility and its use in estimating stochastic volatility models. Journal of the Royal Statistical Society, Series B 64, 252–80. Barndorff-Nielsen, O. E. and N. Shephard (2002b). Estimating quadratic variation using realized variance. Journal of Applied Econometrics 17, 457–77. Bollerslev, T. and J. Wooldridge (1992). Quasi-maximum likelihood estimation and inference in dynamic models with time-varying covariances. Econometric Reviews 11, 143–72. Carrasco, M., M. Chernov, J.-P. Florens and E. Ghysels (2007). Efficient estimation of jump diffusions and general dynamic models with a continuum of moment conditions. Journal of Econometrics 140, 529–73. Carrasco, M. and J. Florens (2000). Efficient GMM estimation using the empirical characteristic function. Document du travail 2000–33, CREST, Paris. Chacko, G. and L. M. Viceira (2003). Spectral GMM estimation of continuous-time processes. Journal of Econometrics 116, 259–92. Chan, K., G. A. Karolyi, F. A. Longstaff and A. B. Sanders (1992). An empirical comparison of alternative models of the short-term interest rate. Journal of Finance 47, 1209–27. C The Author(s). Journal compilation C Royal Economic Society 2010.
Estimation of Markov models
267
Chernov, M. and E. Ghysels (2000). A study toward a unified approach to the joint estimation of objective and risk-neutral measures for the purposes of options valuation. Journal of Financial Economics 56, 407–58. Cram´er, H. (1925). On some classes of series used in mathematical statistics. Proceedings of the Sixth Scandinavian Congress of Mathematicians, Copenhagen, 399–425. Cram´er, H. (1946). Mathematical Methods of Statistics. Princeton: Princeton University Press. Darolles, S., C. Gouri´eroux and J. Jasiak (2006). Structural Laplace transform and compound autoregressive models. Journal of Time Series Analysis 27, 477–503. Duffie, D. and K. J. Singleton (1993). Simulated moments estimation of Markov models of asset prices. Econometrica 61, 929–52. Feuerverger, A. (1990). An efficiency result for the empirical characteristic function in stationary timeseries models. Canadian Journal of Statistics 18, 155–61. Feuerverger, A. and P. McDunnough (1981). On some Fourier methods for inference. Journal of the American Statistical Association 76, 379–87. Feuerverger, A. and R. A. Mureika (1977). The empirical characteristic function and its applications. Annals of Statistics 5, 88–97. Fisher, M. and C. Gilles (1996). Estimating exponential affine models of the term structure. Working paper, Federal Reserve Atlanta. Gallant, A. and J. Long (1997). Estimating stochastic differential equations efficiently by minimum chisquare. Biometrika 84, 125–41. Gallant, A. R. and G. Tauchen (1996). Which moments to match? Econometric Theory 12, 657–81. Gouri´eroux, C. and J. Jasiak (2000). Compound gamma processes. Working paper, CREST, Paris. Gouri´eroux, C., A. Monfort and E. Renault (1993). Indirect inference. Journal of Applied Econometrics 8, S85–S199. Hansen, L. P. (1982). Large sample properties of generalized method of moments estimator. Econometrica 50, 1029–54. Hansen, L. P. and J. Scheinkman (1996). Back to the future: generating moment implications for continuoustime Markov processes. Econometrica 63, 767–804. Heston, S. L. (1993). A closed form solution for options with stochastic volatility with applications to bond and currency options. Review of Financial Studies 6, 327–44. Jiang, G. J. and J. L. Knight (2002). Estimation of continuous-time processes via the empirical characteristic function. Journal of Business and Economic Statistics 20, 198–212. Jones, C. S. (1998). Bayesian estimation of continuous-time finance models. Working paper, University of Rochester. Kendall, M. and A. Stuart (1977). The Advanced Theory of Statistics, Volume 1. London: Charles Griffin. Knight, J. and S. E. Satchell (1997). The cumulant generating function method estimation, implementation and asymptotic efficiency. Econometric Theory 13, 170–84. Liu, J. (1997). Generalized method of moments estimation of affine diffusion processes. Working Paper, Graduate School of Business, Stanford University. McCullagh, P. (1987). Tensor Methods in Statistics. London: Chapman and Hall. Newey, W. and R. Smith (2004). Higher order properties of GMM and generalized empirical likelihood estimators. Econometrica 72, 219–55. Schmidt, P. (1982). An improved version of the Quandt–Ramsey MGF estimator for mixtures of normal distributions and switching regressions. Econometrica 6, 501–24. Singleton, K. J. (2001). Estimation of affine asset pricing models using the empirical characteristic function. Journal of Econometrics 102, 111–41. C The Author(s). Journal compilation C Royal Economic Society 2010.
268
G. J. Jiang and J. L. Knight
Skovgaard, M. (1986). On multivariate Edgeworth expansions. International Statistical Review 54, 169– 86. Yu, J. (1998). Empirical characteristic function in time series estimation and a test statistic in financial modelling. Unpublished Ph.D dissertation, University of Western Ontario.
APPENDIX A: PROOFS OF LEMMAS AND THEOREM Proof of Lemma 2.1: The proof is essentially along the lines of Singleton (2001). From (2.2), we have ∂ ln f (xt+1 | xt ; θ ) = eir xt+1 w(r, t | xt ; θ) dr ∂θ and w(r, t | xt ; θ )φ(r, xt+1 | xt ; θ ) dr = w(r, t | xt ; θ ) eir xt+1 f (xt+1 | xt ; θ) dxt+1 dr ir xt+1 dr f (xt+1 | xt ; θ) dxt+1 = w(r, t | xt ; θ )e
= E eir xt+1 w(r, t | xt ; θ ) dr
∂ ln f (xt+1 | xt ; θ ) = E ∂θ Thus the estimating equations (2.1) lead to T ∂ ln f (xt+1 | xt ; θ) 1 ∂ ln f (xt+1 | xt ; θ ) −E T t=1 ∂θ ∂θ
= 0,
xt
which is equivalent as ML estimation.
Proof of Theorem 2.1: Similar to the manipulation to equation (2.8) as in the proof of Lemma 2.1 ˆ t | xt ; θ ), we have equation (2.9). Firstly, from (2.8), we except w(r, t | xt ; θ) being replaced with w(r, note immediately from Singleton (2001) that under regularity conditions our estimator is consistent and asymptotically normally distributed. Secondly, denote equation (2.8) as H (θ ) =
T 1 ht (θ) = 0 T t=1
under certain regularity conditions, for a fixed p (the order in Edgeworth/Gram–Charlier expansion) we have that the asymptotic variance–covariance matrix p is given by p = D(θ )−1 (θ)D(θ)−1 , with D(θ ) = plim T1
T t=1
∂ht (θ) ∂θ
and (θ ) = plimT H (θ)H (θ) . Refer to equation (2.9), we have
T ∂ ln fˆp (xt+1 | xt ; θ)
1 ∂ 2 ln fˆp (xt+1 | xt ; θ ) ∂ D(θ ) = plim − E .
xt
T t=1 ∂θ ∂θ ∂θ ∂θ C The Author(s). Journal compilation C Royal Economic Society 2010.
269
Estimation of Markov models Since
thus
∂ ln fˆp (xt+1 | xt ; θ )
∂ ln fˆp (xt+1 | xt ; θ) E · f (xt+1 | xt ; θ) dxt+1 ,
xt =
∂θ ∂θ
∂ ln fˆp (xt+1 | xt ; θ ) ∂ E ∂θ ∂θ
Furthermore
∂ 2 ln fˆp E ∂θ ∂θ
∂ ln fˆp ∂f ∂ 2 ln fˆp
· · dxt+1 · f (x | x ; θ) · dx +
xt = t+1 t t+1
∂θ ∂θ ∂θ ∂θ
∂ 2 ln fˆp
∂ ln fˆp ∂ ln f =E · · f · dxt+1
xt + ∂θ ∂θ ∂θ ∂θ
∂ 2 ln fˆp
∂ ln fˆp ∂ ln f
· =E
xt .
xt + E ∂θ ∂θ ∂θ ∂θ
∂ 1 ∂ fˆp
·
xt
xt = E
∂θ fˆp ∂θ 1 ∂ fˆp ∂ fˆp 1 ∂ 2 fˆp · =E − · + · 2 ˆ fp ∂θ ∂θ fˆp ∂θ∂θ
∂ ln fˆp ∂ ln fˆp
1 · = −E
xt + ∂θ ∂θ fˆp
∂ ln fˆp ∂ ln fˆp
·
= −E
xt . ∂θ ∂θ
xt
·
∂ 2 fˆp · f · dxt+1 ∂θ∂θ
Note, the last inequality holds if ln fˆp = ln f . Consequently, T ∂ ln f ∂ ln fˆp ∂ ln fˆp 1 ∂ 2 ln fˆp (xt+1 | xt ; θ ) · −E − D(θ ) = plim T t=1 ∂θ ∂θ ∂θ ∂θ ∂θ 1 ∂ 2 fˆp − · · f · dxt+1 . fˆp ∂θ ∂θ
xt
On the other hand, 1 h(θ )h(θ ) T
∂ ln fˆp
∂ ln fˆp 1 ∂ ln fˆp ∂ ln fˆp = plim −E
xt T ∂θ ∂θ ∂θ ∂θ
∂ ln fˆp
∂ ln fˆp
∂ ln fˆp ∂ ln fˆp x E − + E
xt E
t
∂θ ∂θ ∂θ ∂θ
(θ) = plim
.
xt
Note that the above results are derived for a fixed p which is the truncation order of the Gram–Charlier series expansion. If as p goes to infinity we have ln fˆp → ln f , i.e. the expansion is convergent, then we have D(θ ) = plim
1 ∂ 2 ln f (xt+1 | xt ; θ) T ∂θ∂θ
C The Author(s). Journal compilation C Royal Economic Society 2010.
270
G. J. Jiang and J. L. Knight
and (θ) = plim
1 ∂ ln f (xt+1 | xt ; θ) ∂ ln f (xt+1 | xt ; θ) . T ∂θ ∂θ
In other words, D(θ ) = (θ ) = I (θ ) and the asymptotic covariance is equal to I (θ)−1 . In this case, the estimator becomes the ECF estimator proposed in Singleton (2001) with optimal weight function and thus achieves ML efficiency. Proof of Lemma 2.2: The results in Lemma 2.2 follow immediately from the substitution of equation (2.6) into equation (2.9). Alternatively, the estimating equation (2.11) can be derived by substituting the approximating weight function into equation (2.8) and apply the definition of cumulants.
APPENDIX B: CCF OF BIVARIATE OU PROCESS AND SV PROCESS CCF of Bivariate OU Process. The joint characteristic function of (yt+τ , xt+τ ) conditional on Ft can be written as ψ(r1 , r2 ; yt+τ , xt+τ | yt , xt ) = E[exp{ir1 yt+τ + ir2 xt+τ } | yt , xt ] = exp{C(τ ; r1 , r2 ) + D1(τ ; r1 , r2 ) yt + D2(τ ; r1 , r2 ) xt }, where C(·), D1(·) and D2(·) are solved from the Ricatti equations as κ 1 2 κ2 1 2 −2κτ r1 σ 2 + r σx2 (e−2βτ − 1) (e C(τ ; r1 , r2 ) = σ − 1) + − r 2 1 4κ (β − κ)2 x 4β β −κ κ κ + r1 r2 − r1 σ 2 (e−(β+κ)τ − 1) β − κ β2 − κ2 x D1(τ ; r1 , r2 ) = ir1 e−κτ D2(τ ; r1 , r2 ) = i r2 − r1
κ (β−κ)τ (e − 1) e−βτ . β −κ
CCF of Square-Root SV Process. The joint characteristic function of (xt+τ , vt+τ ) conditional on Ft can be written as ψ(r1 , r2 ; xt+τ , vt+τ | xt , vt ) = E[exp{ir1 xt+τ + ir2 vt+τ } | xt , vt ] = exp{C(τ ; r1 , r2 ) + D1(τ ; r1 , r2 ) xt + D2(τ ; r1 , r2 ) vt }, where C(·), D1(·) and D2(·) are solved from the Ricatti equations as
1 − ge−hτ αβ C(τ ; r1 , r2 ) = (ir1 μ + iβαr2 )τ + 2 (b − h)τ − 2 ln σ 1−g D1(τ ; r1 , r2 ) = ir1 D2(τ ; r1 , r2 ) = ir2 +
b − h 1 − e−hτ , σ 2 1 − ge−hτ
with h(r1 , r2 ) = [b2 + σ 2 (r12 + 2ρσ r1 r2 + σ 2 r22 + 2iβr2 )]1/2 , b = β − ρσ ir1 − σ 2 r2 i, g(r1 , r2 ) = (b − h)/(b + h).
C The Author(s). Journal compilation C Royal Economic Society 2010.
The
Econometrics Journal Econometrics Journal (2010), volume 13, pp. 271–289. doi: 10.1111/j.1368-423X.2010.00315.x
Bimodal t-ratios: the impact of thick tails on inference C ARLO V. F IORIO †,‡ , VASSILIS A. H AJIVASSILIOU § AND P ETER C. B. P HILLIPS ¶ ,∗,††,‡‡ †
University of Milan, Via Conservatorio, 7, 20122 Milan, Italy. E-mail:
[email protected] ‡
Econpubblica, Via Roentgen, 1, 20136 Milan, Italy.
§ London
School of Economics and Financial Markets Group, Houghton Street London WC2A 2AE, UK. E-mail:
[email protected]
¶ Yale ∗
University, New Haven, CT 06520-8281, USA. E-mail:
[email protected]
University of Auckland, Owen G. Glenn Building 12 Grafton Road Auckland, New Zealand. ††
Singapore Management University, 90 Stamford Road, Singapore 178903. ‡‡
University of York, York YO10 5DD, UK.
First version received: April 2008; final version accepted: December 2009
Summary This paper studies the distribution of the classical t-ratio with data generated from distributions with no finite moments and shows how classical testing is affected by bimodality. A key condition in generating bimodality is independence of the observations in the underlying data-generating process (DGP). The paper highlights the strikingly different implications of lack of correlation versus statistical independence in DGPs with infinite moments and shows how standard inference can be invalidated in such cases, thereby pointing to the need for adapting estimation and inference procedures to the special problems induced by thick-tailed (TT) distributions. The paper presents theoretical results for the Cauchy case and develops a new distribution termed the ‘double-Pareto’, which allows the thickness of the tails and the existence of moments to be determined parametrically. It also investigates the relative importance of tail thickness in case of finite moments by using TT distributions truncated on a compact support, showing that bimodality can persist even in such cases. Simulation results highlight the dangers of relying on naive testing in the face of TT distributions. Novel density estimation kernel methods are employed, given that our theoretical results yield cases that exhibit density discontinuities. Keywords: Bimodality, Cauchy, Double-Pareto, Thick tails, t-ratio.
1. INTRODUCTION Many economic phenomena are known to follow distributions with non-negligible probability of extreme events, termed thick-tailed (TT) distributions. Top income and wealth distributions are often modelled with infinite variance Pareto distributions (see, inter alia, Cowell, 1995). The C The Author(s). Journal compilation C Royal Economic Society 2010. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
272
C. V. Fiorio, V. A. Hajivassiliou and P. C. B. Phillips
distribution of cities by size seems to fit Zipf’s law, a discrete form of a Pareto distribution with infinite variance (Gabaix, 1999). Another example is the size distribution of firms (Hart and Prais, 1956, Steindl, 1965). Further, TT distributions frequently arise in financial return data and data on corporate bankruptcies, which can cause difficulties in regulating markets where such extremes are observed (Embrechts, 2001, Loretan and Phillips, 1994). A final example arises in the economics of information technology where Web traffic file sizes follow distributions that decline according to a power law (Arlitt and Williamson, 1996), often with infinite variance (Crovella and Bestavros, 1997). Although there is a large and growing literature on robust estimation with data following thick-tail distributions (e.g. Beirlant et al., 1996, Hsieh, 1999, Dupuis and Victoria-Feser, 2006), little is known about the consequences of performing classical inference using samples drawn from such distributions. Important exceptions are Logan et al. (1972), which drew early attention to the possibility of bimodal distributions in self-normalized sums of independent random variables, Marsaglia (1965) and Zellner (1976, 1978), who showed bimodality for certain ratios of normal variables, Phillips and Wickens (1978), who showed that the distribution of structural equation estimators was not always unimodal, and Phillips and Hajivassiliou (1987), who analysed bimodality in classical t-ratios. Nelson and Startz (1990) and Maddala and Jeong (1992) provided some further analysis of structural estimators with possibly weak instruments. More recent contributions include Woglom (2001), Forchini (2006), Hillier (2006) and Phillips (2006), who all consider bimodality in structural equation distributions. Not much emphasis in this literature has been placed on the difference between orthogonal and fully independent observations. The present paper contributes to this literature in several ways. It provides an analysis of the asymptotic distribution of the classical t-ratio for distributions with no finite variance and discusses how classical testing is affected. In Section 2 we clarify the concept of TT distributions and provide a theoretical analysis of the bimodality of the t-ratio with data from an i.i.d. Cauchy distribution. A simulation analysis of this case is given in Section 3. Novel density estimation kernel methods are employed, given that our theoretical results yield cases that exhibit density discontinuities. Section 4 considers the different implications of lack of correlation and statistical independence. Section 5 illustrates extensions to other distributions with heavy tails: the stable family of distributions (Subsection 5.1) and a symmetric double-Pareto distribution (Subsection 5.2), which allows tail thickness and existence of moments to be determined parametrically. Section 6 investigates inference in the context of t-ratios with TT distributions. Section 7 shows that bimodality can arise even with TT distributions trimmed to have finite support. Section 8 concludes.
2. CAUCHY DGPs AND BIMODALITY OF THE t-STATISTIC While there is no universally accepted definition of a TT distribution, random variables drawn from a TT distribution have a non-negligible probability of assuming very large values. Distribution functions with infinite first moments certainly belong to the family of TT distributions. Different TT distributions have differing degrees of thick-tailedness and, accordingly, quantitative indicators have been developed to evaluate the probability of extremal events, such as the extremal claim index to assign weights to the tails and thus the probability of extremal events (Embrechts et al., 1999). A crude though widely used definition describes any distribution with infinite variance as a TT distribution. Other weaker definitions require the kurtosis coefficient to larger than 3 (leptokurtic) (Bryson, 1982). C The Author(s). Journal compilation C Royal Economic Society 2010.
Bimodal t-ratios and thick tails
273
In this paper we say that a distribution is thick-tailed (TT) if it belongs to the class of distributions for which P r(|X| > c) = c−α and α ≤ 1. The Cauchy distribution corresponds to the boundary case where α = 1. Such distributions are sometimes called very heavy tailed. It is well known that ratios of random variables frequently give rise to bimodal distributions. , where x and y are independent N (0, 1) Perhaps the simplest example is the ratio R = a+x b+y variates and a and b are constants. The distribution of R was found by Fieller (1932) and its density may be represented in series form in terms of a confluent hypergeometric function (see Phillips, 1983, equation (3.35)). It turns out, however, that the mathematical form of the density of R is not the most helpful instrument in analysing or explaining the bimodality of the distribution that occurs for various combinations of the parameters (a, b). Instead, the joint normal distribution of the numerator and denominator statistics, (a + x, b + y), provides the most convenient and direct source of information about the bimodality. An interesting numerical analysis of situations where bimodality arises in this example is given by Marsaglia (1965), who shows that the density of R is unimodal or bimodal according to the region of the plane in which the mean (a, b) of the joint distribution lies. Thus, when (a, b) lies in the positive quadrant the distribution is bimodal whenever a is large (essentially a > 2.257). Similar examples arise with simple posterior densities in Bayesian analysis and certain structural equation estimators in econometric models of simultaneous equations. Zellner (1978) provides an interesting example of the former, involving the posterior density of the reciprocal of a mean with a diffuse prior. An important example of the latter is the simple indirect least squares estimator in just identified structural equations as studied, for instance, by Bergstrom (1962) and recently by Forchini (2006), Hillier (2006) and Phillips (2006). The present paper shows that the phenomenon of bimodality can also occur with the classical t-ratio test statistic for populations with undefined second moments. The case of primary interest to us in this paper is the standard Cauchy (0, 1) with density pdf (x) =
1 . π (1 + x 2 )
(2.1)
When the t-ratio test statistic is constructed from a random sample of n draws from this population, the distribution is bimodal even in the limit as n → ∞. This case of a Cauchy (0, 1) population is especially important because it highlights the effects of statistical dependence in multivariate spherical populations. To explain why this is so, suppose (X1 , . . . , Xn ) is multivariate Cauchy with density n+1 2 . (2.2) pdf (x) = (n+1)/2 π (1 + x x)(n+1)/2 This distribution belongs to the multivariate spherical family and may be written in terms of a variance mixture of a multivariate N (0, σ 2 In ) as ∞ N(0, σ 2 In ) dG(σ 2 ), (2.3) 0
where 1/σ 2 is distributed as χ12 and G(σ 2 ) is the distribution function of σ 2 . Note that the marginal distributions of (2.2) are all Cauchy. In particular, the distribution of Xi is univariate Cauchy with density as in (2.1) for each i. However, the components of (X1 , . . . , Xn ) are statistically dependent, in contrast to the case of a random sample from a Cauchy (0, 1) population. The effect of this dependence, which is what distinguishes (2.2) from the random C The Author(s). Journal compilation C Royal Economic Society 2010.
274
C. V. Fiorio, V. A. Hajivassiliou and P. C. B. Phillips
sample Cauchy case, is dramatically illustrated by the distribution of the classical t-statistic: tX =
n−1 1n Xi X = 1/2 . SX n−2 1n (Xi − X)2
(2.4)
Under (2.2), tX is distributed as t with n − 1 degrees of freedom, just as in the classical case of a random sample from an N(0, σ 2 ) population. This was pointed out by Zellner (1976) and is an immediate consequence of (2.3) and the fact that tX is scale invariant.1 However, the spherical assumption that underlies (2.2) and (2.3) and the dependence that it induces in the sample (X1 , . . . , Xn ) is very restrictive. When it is removed and (X1 , . . . , Xn ) comprise a random sample from a Cauchy (0, 1) population, the distribution of tX is very different. The new distribution has symmetric density about the origin but with distinct modes around ±1. This bimodality persists even in the limiting distribution of tX so that both asymptotic and small sample theory are quite different from the classical case. In the classical t-ratio the numerator and denominator statistics are independent. Moreover, as n → ∞ the denominator, upon suitable scaling, converges in probability to a constant. By contrast, in the i.i.d. Cauchy case the numerator and denominator statistics of tX converge weakly to non-degenerate random variables which are (non-linearly) dependent, so that as n → ∞ the t-statistic is a ratio of random variables. Moreover, it is the dependence between the numerator and denominator statistics (even in the limit) which induces the bimodality in the distribution. These differences are important and, as we will prove below, they explain the contrasting shapes of the distributions in the two cases. We will use the symbol ‘⇒’ to signify weak convergence as n → ∞, and the symbol ‘≡’ to signify equality in distribution. Recalling that for an i.i.d. sample from a Cauchy (0, 1) distribution, the sample mean X¯ ≡ Cauchy (0, 1) for all n, and, of course, X → X ≡ Cauchy (0, 1) as n → ∞, the following ¯ SX ) and that of the associated t-ratio statistic. theorem will focus on the distribution of (X, T HEOREM 2.1. Let (X1 , . . . , Xn ) be a random sample from a Cauchy (0, 1) distribution with density (2.2). Define S 2 = n−2 1n Xi2 , t=
X . S
(2.5)
(2.6)
Then: (a) S2 ⇒ Y , where Y is a stable random variate with exponent α = 1/2 and characteristic function given by
π π 2 |v|1/2 1 − isgn(v)tan . (2.7) cfY (v) = E(eivY ) = exp − 1/2 cos π 4 4
1
This fact may be traced back to original geometric proofs by Fieller (1932). C The Author(s). Journal compilation C Royal Economic Society 2010.
275
Bimodal t-ratios and thick tails
(b) (X, S 2 ) ⇒ (X, Y ), where (X, Y ) are jointly stable variates with a characteristic function given
1 1 2 −1/2 −1/2 (−iv) . cfX,Y = exp −2π 1 F1 − , ; u /4iv 2 2
(2.8)
where 1 F1 denotes the confluent hypergeometric function. An equivalent form is cfX,Y (u, v) = exp{−|u| − π −1/2 e−iu
2
/4v
(3/2, 3/2; iu2 /4v)},
(2.9)
where denotes the confluent hypergeometric function of the second kind. (c) S 2 − SX2 = Op (n−1 ),
(2.10)
t − tX = Op (n−1 ).
(2.11)
(d) The probability density of the t-ratio (2.6) is bimodal, with infinite poles at ±1. ¯ S 2 ) and shows that the distributions of t Theorem 1 establishes the joint distribution of (X, and tX , and of S and SX are respectively asymptotically equivalent.2 Note that Xi2 has density pdf (y) =
1 π y 1/2 (1
+ y)
,
y > 0.
(2.14)
In fact, Xi2 belongs to the domain of attraction of a stable law with exponent α = 1/2. To see this, we need only verify (Feller, 1971, p. 313) that if F(y) is the distribution function of Xi2 then 1 − F (y) + F (−y) ∼ 2/π y 1/2 ,
y → ∞,
which is immediate from (2.14), and that the tails are well balanced. Here we have: 1 − F (y) → 1, 1 − F (y) + F (−y)
F (−y) → 0. 1 − F (y) + F (−y)
2 For the definition of the hypergeometric functions that appear in (2.8) and (2.9) see Lebedev (1972, ch. 9). Note that when u = 0, (2.8) reduces to
exp{−2π −1/2 (−iv)1/2 }.
(2.12)
We now write −iv in polar form as −iv = |v|e−isgn(v)π/2 so that (−iv)1/2 = |v|1/2 e−isgn(v)π/4 = |v|1/2 cos(π/4) (1 − isgn(v)tan(π/4)) ,
(2.13)
from which it is apparent that (2.8) reduces to the marginal characteristic function of the stable variate Y given earlier in (2.7). When v = 0 the representation (2.9) reduces immediately to the marginal characteristic function, exp(−|u|), of the Cauchy variate X. In the general case the joint characteristic function cfXY (u, v) does not factorize, and X and Y are dependent stable variates. C The Author(s). Journal compilation C Royal Economic Society 2010.
276
C. V. Fiorio, V. A. Hajivassiliou and P. C. B. Phillips
Note also that the characteristic function of the limiting variate Y given by (2.7) belongs to the general stable family, whose characteristic function (see Ibragimov and Linnik, 1971, p. 43) has the following form:
π α . ϕ(v) = exp iγ v − c|v|α 1 − iβsgn(v)tan 2
(2.15)
In the case of (2.7) the exponent parameter α = 1/2, the location parameter γ = 0, the scale parameter c = 2π −1/2 cos(π/4) and the symmetry parameter β = 1. Part (a) of Theorem 1 shows that the denominator of the t-ratio (2.6) is the square root of a stable random variate in the limit p as n → ∞. This is to be contrasted with the classical case where nSX2 → σ 2 = E(Xi2 ) under general conditions. Note that when n = 1, the numerator and denominator of t are identical up to sign. In this case we have t = ±1 and the distribution assigns probability mass of 1/2 at +1 and −1. When n > 1, the numerator and denominator statistics of t continue to be statistically dependent. This dependence persists as n → ∞. Figures 1a–d show Monte Carlo estimates (by smoothed kernel methods) of the joint probability surface of (X, S 2 ) for various values of n. As is apparent from the pictures the density involves a long curving ridge that follows roughly a parabolic shape in the (X, S 2 ) plane. OLS estimates of the ridge in the joint pdf stabilize quickly as a function of n and confirm the dependence between X and S 2 for the Cauchy DGP. Further note that the ridge in the joint density is symmetric about the S 2 axis. The ridge is associated with clusters of probability mass for various values of S 2 on either side of the S 2 axis and equidistant from it. These clusters of mass along the ridge produce a clear bimodality in the conditional distribution of X¯ given S 2 for all moderate to large S 2 . For small S 2 the probability mass is concentrated in the vicinity of the origin in view of the dependence between X and S 2 . ¯ S 2 ) plane are also responsible for the The clusters of probability mass along the ridge in the (X, bimodality in the distribution of certain ratios of the statistics (X, S 2 ) such as the t-ratio statistics t = X/S and tX = X/SX . These distributions are investigated by simulation in the following section.
3. SIMULATION EVIDENCE FOR THE CAUCHY CASE The empirical densities reported here were obtained as follows: for a given value of n, m = 10,000 random samples of size n were drawn from the standard Cauchy distribution with density given by (2.1) and corresponding cumulative distribution function F (x) =
1 1 arctan(x) + , π 2
−∞ < x < ∞.
(3.1)
Since (3.1) has a closed-form inverse, the probability integral transform method was used to generate the draws. To estimate the probability density functions, conventional kernel methods, for example Tapia and Thompson (1978), would not provide consistent estimates of the true density in a neighbourhood of ±1 in view of the infinite singularities (poles) there. An extensive literature considers how to correct the so-called boundary effect, although there is no single dominating C The Author(s). Journal compilation C Royal Economic Society 2010.
277
Bimodal t-ratios and thick tails
(a) n=2
(b) n=10
(c) n=30
(d) n=100
Figure 1. Joint density function estimates of X¯ and S 2 for i.i.d. Cauchy DGPs.
solution that is appropriate for all shapes of density.3 The method adopted here follows Zhang et al. (1999), which is a combination of methods of pseudo-data, transformation and reflection, is non-negative anywhere, and performs well compared to the existing methods for almost all shapes of densities and especially for densities with substantial mass near the boundary. For the univariate densities (Figures 2, 4 and 5) a bandwidth of h = 0.2 was used, while for the bivariate densities in Figure 1, we employed equal bandwidths hx = hy = 0.2. We investigated the sampling behaviour of the t-ratio statistics t and tX , by combining four kernel densities, two estimating the density on the left of ±1 and two estimating the density on the right of ±1, using the fact that for x > 1 + h, x < −1 − h and −1 + h < x < 1 − h the densities estimated with and without boundary correction coincide (Zhang et al., 1999, p. 1234). These are shown in Figure 2. Note that the bimodality is quite striking and persists for all sample sizes.
3 For an introductory discussion of density estimation on bounded support, see Silverman (1986, p. 29). Methods to correct for the boundary problem include the reflection method (Silverman, 1986, Cline and Hart, 1991); the boundary kernel method (Jones, 1993, Cheng et al., 1997, Zhang and Karunamuni, 1998); the transformation method (Marron and Ruppert, 1994); and the pseudo-data method (Cowling and Hall, 1996).
C The Author(s). Journal compilation C Royal Economic Society 2010.
278 0.5
C. V. Fiorio, V. A. Hajivassiliou and P. C. B. Phillips
0.3 0.2 0.0
0.1
kernel estimate of f(t)
0.4
n=10 n=100 n=500 n=10000
−3
−2
−1
0
1
2
3
t−ratio
0.5
(a) Density Function of the t-ratio
0.3 0.2 0.1 0.0
kernel estimate of f(t)
0.4
n=10 n=100 n=500 n=10000
−3
−2
−1
0
1
2
3
tX−ratio
(b) Density Function of the tX -ratio Figure 2. Density functions for i.i.d. Cauchy DGPs.
C The Author(s). Journal compilation C Royal Economic Society 2010.
Bimodal t-ratios and thick tails
(a) Spherical (Dependent)
279
(b) Independent (Nonspherical)
Figure 3. Bivariate Cauchy.
4. LACK OF CORRELATION VERSUS INDEPENDENCE Data from an n-dimensional spherical population with finite second moments have zero correlation, but are independent only when normally distributed. The standard multivariate Cauchy (with density given by (2.2)) has no finite integer moments but its spherical characteristic may be interpreted as the natural analogue of uncorrelated components in multivariate families with thicker tails. When there is only ‘lack of correlation’ as in the spherical Cauchy case, it is well known (e.g. King, 1980) that the distribution of inferential statistics such as the tratio reproduce the behaviour that they have under independent normal draws. When there are independent draws from a Cauchy population, the statistical behaviour of the t-ratio is very different. Examples of this type highlight the statistical implications of the differences between lack of correlation and independence in non-normal populations. Figure 3 highlights these differences for the bivariate Cauchy case. The left panel plots the iso-pdf contours of the bivariate spherical Cauchy (with the two observations being non-linearly dependent), while the right panel gives the contours for the bivariate independent Cauchy case (where the distribution is non-spherical). In view of the TTs, we see the striking divergence between sphericality and statistical independence: whereas for normal Gaussian distribution, sphericality (= uncorrelatedness) and full statistical independence coincide, we confirm that for non-Gaussianity, sphericality is neither necessary nor sufficient for independence.4 These results confirm the findings of Hajivassiliou (2008), who emphasized that when data are generated from distributions with TTs, independence and zero correlation are very different properties and can have startlingly different outcomes. By construction, the random variables ¯ is linearly orthogonal to the S 2 variable in the square root in the numerator of the t-ratio, X, X 4 Figure 10 of the extended version of this paper (Fiorio et al., 2008) considers six representative squares on the domain of the bivariate Cauchy distributions, and calculates various measures of deviation from independence for the spherical, dependent version.
C The Author(s). Journal compilation C Royal Economic Society 2010.
280
C. V. Fiorio, V. A. Hajivassiliou and P. C. B. Phillips
of the denominator. Under Gaussianity, this orthogonality implies full statistical independence between numerator and denominator. But in the case of data drawn from the Cauchy distribution, statistical independence of the numerator and denominator of the t-ratio rests crucially on whether or not the underlying data are independently drawn: if they are generated from a multivariate spherical Cauchy (with a diagonal scale matrix) and hence they are non-linearly dependent, then the numerator and denominator in fact become independent and the usual unimodal t-distribution obtains. If, on the other hand, they are drawn fully independently from one another, then X¯ and SX2 turn out to be dependent and hence the density of the t-ratio exhibits the striking bimodality documented here.
5. IS THE CAUCHY DGP NECESSARY FOR BIMODALITY? Our attention has concentrated on the sampling and asymptotic behaviour of statistics based on a random sample from an underlying Cauchy (0, 1) population. This has helped to achieve a sharp contrast between our results and those that are known to apply with Cauchy (0, 1) populations under the special type of dependence implied by spherical symmetry. However, many of the qualitative results given here, such as the bimodality of the t-ratios, continue to apply for a much wider class of underlying populations. In this section we show that the bimodality of the t-ratio persists for other heavy-tailed distributions. Two cases are illustrated: (a) draws from the ‘stable’ family of distributions and (b) draws from the ‘double-Pareto’ distribution. 5.1. Draws from the stable family of distributions Let (X1 , . . . , Xn ) denote a random sample from a symmetric stable population with characteristic function cf (s) = e−|s| , α
(5.1)
and exponent parameter α < 2, then the t-ratios t and tX have bimodal densities similar in form to those shown in Figure 2 above for the special case α = 1. To generate random variates characterized by (5.1) a procedure described in section 1 of Kanter and Steiger (1974) was used. In our experiments we considered several examples of stable distributions for various values of a. We found that the bimodality is accentuated for α < 1 and attenuated as α → 2. When α = 2, of course, the distribution is classical t with n − 1 degrees of freedom. In a similar vein to the Cauchy case, we found the ridge in the joint density to be most pronounced for α = 1/3 but withers as α rises to 5/3. For extended simulation results see Fiorio et al. (2008). 5.2. Draws from the double-Pareto distribution Analogous to the double-exponential (see Feller, 1971, p. 49), we define the doublePareto distribution as the convolution of two independent Pareto (type I) distributed random variables, X1 − X2 , where X 1 and X 2 have density α1 β1α1 x −α1 −1 (x ≥ β1 , α1 > 0, β1 > 0) and α2 β2α2 (x)−α2 −1 (x ≥ β2 , α2 > 0, β2 > 0), respectively.5 Its density is 5 The name double-Pareto was also used by Reed and Jorgensen (2004) for the distribution of a random variable that is obtained as the ratio of two Pareto random variables and is only defined over a positive support.
C The Author(s). Journal compilation C Royal Economic Society 2010.
281
Bimodal t-ratios and thick tails
∞
−∞
α1 β1α1 α2 β2α2 (x2 + t)−α1 −1 (x2 )−α2 −1 dx2 ,
and its first two moments are:6 E(x) =
α1 β1 (α2 − 1) − α2 β2 (α1 − 1) (α1 − 1)(α2 − 1)
V (x) =
α1 β12 α2 β22 2α1 α2 β1 β2 − + α1 − 2 (α1 − 1)(α2 − 1) α2 − 2
with α1 > 1, α2 > 1 with α1 > 2, α2 > 2.
The results that follow were obtained via Monte Carlo simulations from random samples of dimension n using the method of inverted CDFs; that is, a random sample of dimension n is extracted from a unit rectangular variate, U (0, 1), and then it is mapped into the sample space using the inverse CDF. The number of replications m was 10,000. This study allows one to disentangle some differences about the asymptotic distribution of the t-ratio statistic when either one or both first two moments do not exist.7 The Cauchy and the double-Pareto distribution with α1 = α2 ≤ 1 are both symmetric and have infinite mean. For these distributions, as the sample size increases, the statistic tX converges towards a stable distribution which is symmetric and bimodal. The convergence is fairly rapid, even for samples as small as 10, and the two modes are located at ±1. For the double-Pareto distribution we find that the t-ratio distribution does depend on αi , i = 1, 2: the lower is αi , the higher is the concentration around the two modes; see Figure 4(a). We also examined the case 1 < α < 2 and found that the t-ratio, tX , is not always clearly bimodally distributed. The more α departs from 1 the less evident is the bimodality of the t-ratio density and the clearer the convergence towards a standard normal distribution; see Figure 4(b). We set β = 3 but these results apply for any value of β > 0, since β is simply a threshold parameter that does not affect the tX statistic behaviour.8 If α1 = α2 it suffices to have either α1 ≤ 1 or α2 ≤ 1 for the double-Pareto to have infinite mean. However, in this case the t-ratio distribution does not have a bimodal density, nor is it stable. See Figure 6 of the extended online version of this paper, Fiorio et al. (2008). The regularity in the tX distribution for the symmetric double-Pareto case leads us to investigate the relationship between the first and second centred moments, in the numerator and denominator of tX respectively. In Section 2 aforementioned, we showed that if the distribution is Cauchy, the variance converges toward a unimodal distribution with the mode lying in the interval (0, 1). However, if the distribution is double-Pareto, the sample variance does not converge towards a stable distribution but becomes more dispersed as the sample size increases. This behaviour confirms the surprising results obtained elsewhere (Ibragimov, 2004, Hajivassiliou, 2008) concerning inference with thick-tailed (TT) distributions depending on the tail thickness parameter, α: for α = 1, the dispersion of the distribution of sample averages remains invariant to the sample size n, for α < 1 more observations actually hurt with the variance rising with n. 6
For derivations, see appendix B of the extended version of this paper, Fiorio et al. (2008). Using copulas, we could evaluate behaviour with correlated double-Pareto draws. See Hajivassiliou (2008) for a development of this idea. See also Pe˜na et al. (2006) for some general results. 8 These findings can be proved theoretically along the lines of Appendix A: the theory behind the double-Pareto Figures 4(a)–4(b) corresponds to the Logan et al. (1972) case of p = 2 and P rob (t < −q) ∼ rq −α = P rob (t > q) ∼ q −α with r = . When 0 < α < 1 as in Figure 4(a), the density of tX has infinite singularities at ±1, while for 1 < α < 2, as in Figure 4(b), the density is continuous throughout with modes at ±1. 7
C The Author(s). Journal compilation C Royal Economic Society 2010.
282
C. V. Fiorio, V. A. Hajivassiliou and P. C. B. Phillips
0.4 0.3 0.2 0.1
kernel estimate of f(t)
n=10 n=100 n=500 n=10000
0.0
0.1
0.2
0.3
0.4
n=10 n=100 n=500 n=10000
0.0
kernel estimate of f(t)
0.5
Double−Pareto, a1=.9, b1=3, a2=.9, b2=3
0.5
Double−Pareto, a1=.5, b1=3, a2=.5, b2=3
−3
−2
−1
0 tX−ratio
1
2
3
−3
−2
−1
0 tX−ratio
1
2
3
(a) t-ratio of infinite-first-moment double Pareto distributions (α < 1)
Double−Pareto, a1=1.8, b1=3, a2=1.8, b2=3 0.5 0.4 0.3 0.2 0.0
kernel estimate of f(t)
n=10 n=100 n=500 n=10000
0.1
0.1
0.2
0.3
0.4
n=10 n=100 n=500 n=10000
0.0
kernel estimate of f(t)
0.5
Double−Pareto, a1=1.1, b1=3, a2=1.1, b2=3
−3
−2
−1
0
1
2
3
−3
−2
−1
tX−ratio
0
1
2
3
tX−ratio
(b) t-ratio of finite-first-moment double Pareto distributions (1 < α ≤ 2)
Figure 4. t-Ratios for double-Pareto distributions.
Furthermore, the usual asset diversification result that spreading a given amount of wealth of a larger number of assets reduces the variability of the portfolio no longer holds: with returns from a TT distribution the variability may remain invariant to the number of assets composing the portfolio if α = 1, while portfolio variability actually rises with the number of assets if α < 1. In such cases, all eggs should be placed in the same basket.9
9 For specific analysis of the distribution of the variance of double-Pareto distributions with infinite mean, and of the relationship between the sample mean and variance in this case, the interested reader is referred to the extended online version of this paper, Fiorio et al. (2008).
C The Author(s). Journal compilation C Royal Economic Society 2010.
Bimodal t-ratios and thick tails
283
6. REJECTION PROBABILITY ERRORS OF t-RATIOS The preceding results are relevant for hypothesis testing in regressions with errors that are independent and identically distributed from a TT distribution. They are also relevant for testing the hypothesis of difference in means or other statistics of two samples when either or both come from a TT distribution. How serious are the mistakes in such cases if the critical values of an N (0, 1) distribution are used in classical t-ratio testing? The issue is well illustrated using the p-value discrepancy plot (Davidson and MacKinnon, 1998). Let us now summarize results, which are extensively described in Fiorio and Hajivassiliou (2006). Assume that we have a random sample from a double-Pareto distribution with 1 < α ≤ 2 and we run a test H0 : μ = μ0 against the alternative HA : μ = μ0 , where μ is the true mean and μ0 some value on the real line. The sample mean is used to estimate μ. Performing such a test using the standard normal rather than the correct distribution causes the null hypothesis to be under-rejected by quite a small amount, not larger than 5% for tests of size 5%, and even less for tests of size 1% or 10%. This conclusion would often lead us to ignore the caveat of having a systematic error in rejection probability (ERP) using the standard normal for testing two-sided hypothesis with a symmetric double-Pareto distribution with 1 < α ≤ 2. However, three important points should be noted. The first is that the policy of ignoring the true nature of the t-ratio distribution under this particular DGP may be an acceptable policy if the size of the test is smaller than 10%. If the test has a larger size—for instance 40%—the ERP can be larger than 10 and is obviously more difficult to tolerate.10 Second, if the non-symmetric double-Pareto distribution is considered, then the t-ratio statistic is not even stable. Finally, although the ‘ignore’ policy leads to minor errors (below ±5%) for one-sided tests in the case of the double-Pareto distribution, the ERP might be much larger for other TT processes.
7. BIMODALITY WITHOUT INFINITE MOMENTS? In order to investigate the relative importance of tail thickness and non-existence of moments, we consider a distribution truncated on a compact support, characterized as follows: Z=
X NA
iff |X| < c otherwise
(7.1)
where X is a standard Cauchy (0, 1). The cut-off parameter c is a positive finite real number. Since the support of this distribution is by construction finite and compact, the moments of the r.v. Z are all finite. The first trimmed distribution truncated on a compact support as in (7.1) that we consider is Cauchy (0, 1), while the second is the double-Pareto law introduced in Subsection 5.2. 10 Although tests of size larger than 10% are rather unusual in economics it is much less so in other disciplines, such as physics, where the main point is often to maximize the power of the test, rather than to minimize its size. Also in physics and other related sciences, it is common to consider the ‘probable error’ of a test procedure, which corresponds to a significance level of 50%. In such cases it is common to find confidence intervals with about 60% coverage probability (see, for instance, Karlen, 2002).
C The Author(s). Journal compilation C Royal Economic Society 2010.
284
0.3 0.2 0.1 0.0
kernel estimate of f(t)
0.4
0.5
C. V. Fiorio, V. A. Hajivassiliou and P. C. B. Phillips
−0.1
|c|=6x10^4,Pr(|x|>|c|)=.00001 |c|=6x10^3,Pr(|x|>|c|)=.0001 −3
−2
−1
0 tX−ratio
|c|=6x10^2,Pr(|x|>|c|)=.001 |c|=6x10, Pr(|x|>|c|)=.01 1
2
3
0.3 0.2 0.1 0.0
|c|=30x10^8,Pr(|x|>|c|)=.00001 |c|=40x10^6,Pr(|x|>|c|)=.0001
−0.1
kernel estimate of f(t)
0.4
0.5
(a) Truncated Cauchy, n = 500.
−3
−2
−1
0 tX−ratio
|c|=45x10^4,Pr(|x|>|c|)=.001 |c|=50x10^2,Pr(|x|>|c|)=.01 1
2
3
(b) Truncated Double Pareto, a1 = a2 = .5, b1 = b2 = 3, n = 500.
Figure 5. t-Ratio of Cauchy and double-Pareto on compact support.
C The Author(s). Journal compilation C Royal Economic Society 2010.
285
Bimodal t-ratios and thick tails
By considering truncated versions of distributions whose untruncated counterparts do not have finite moments, we can control the relative importance of the tails while working with distributions with all moments finite. In the simulations below, we consider the following truncation points: Truncated Cauchy c prob(cut-off tails)
500
1000
3000
5000
0.0012
0.0006
0.0002
0.0001
250,000 0.0069
500,000 0.0048
Truncated Double-Pareto c prob(cut-off tails), α = 0.5
5000 0.049
100,000 0.011
The higher the absolute value of c is, the less attenuated the impact of tail behaviour will be. In contrast, low absolute values of c imply cutting out most of the (thick) tails of the distribution. The general conclusion is that the bimodality can appear also when moments are finite and the sample size is finite, but reasonably large for many empirical applications. Our results with N = 500 show that the source of the bimodality is the rate of tail behaviour and not unboundedness of support or non-existence of moments (Figure 5), the non-normal behaviour being more evident the larger the truncation point c. The heuristic explanation for these results is that any large draw in a finite sample from the underlying TT distribution will tend to dominate both the numerator and denominator of a t-ratio statistic, even if the DGP distribution has bounded support. Especially when there is a single extremely large draw that dominates all others, then the t will be approximately ±1, therefore leading to a distribution that has modal activity in the neighbourhood of these two points. Clearly, it is not necessary for the distribution to have infinite moments or unbounded support for this phenomenon to occur.
8. CONCLUSIONS This paper has investigated issues of inference from data based on independent draws from TT distributions. When the distribution is TT with infinite moments, the standard t-ratio formed from a random sample does not converge to a standard normal distribution and the limit distribution is bimodal. Conventional inference is invalidated in such cases and ERP in testing can be serious. Bimodality in the finite sample distribution of the t-ratio arises even in cases of trimmed TT distributions, showing that non-existence of moments is not necessary for the phenomenon to occur.
ACKNOWLEDGMENTS The authors would like to thank the Co-Editor and anonymous referees for constructive comments and suggestions. The usual disclaimer applies.
REFERENCES Arlitt, M. F. and C. L. Williamson (1996). Web server workload characterization: the search for invariants. SIGMETRICS Performance Evaluation Review 24, 126–37. C The Author(s). Journal compilation C Royal Economic Society 2010.
286
C. V. Fiorio, V. A. Hajivassiliou and P. C. B. Phillips
Beirlant, J., P. Vynckier and J. L. Teugels (1996). Tail index estimation, Pareto quantile plots, and regression diagnostics. Journal of the American Statistical Association 91, 1659–67. Bergstrom, A. (1962). The exact sampling distributions of least squares and maximum likelihood estimators of the marginal propensity to consume. Econometrica 30, 480–90. Bryson, M. (1982). Heavy-tailed distributions. In N. Kotz and S. Read (Eds.), Encyclopaedia of Statistical Sciences, Volume 3, 598–601. New York: John Wiley. Cheng, M., J. Fan and J. Marron (1997). On automatic boundary corrections. Annals of Statistics 25, 1691–708. Cline, D. and J. Hart (1991). Kernel estimation of densities of discontinuous derivatives. Statistics 22, 69–84. Cowell, F. A. (1995). Measuring Inequality (2nd ed.). Hemel Hempstead: Harvester Wheatsheaf. Cowling, A. and P. Hall (1996). On pseudodata methods for removing boundary effects in kernel density estimation. Journal of the Royal Statistical Society, Series B, 58, 551–63. Crovella, M. and A. Bestavros (1997). Self-similarity in world wide web traffic: evidence and possible causes. IEEE/ACM Transactions on Networking 5, 835–46. Davidson, R. and J. MacKinnon (1998). Graphical methods for investigating the size and power of hypothesis tests. Manchester School 66, 1–26. Dupuis, D. J. and M. P. Victoria-Feser (2006). A robust prediction error criterion for Pareto modeling of upper tails. Canadian Journal of Statistics 34, 639–58. Embrechts, P. (2001). Extremes in economics and the economics of extremes. In B. Finkenst¨adt and H. Rootz´en (Eds.), Extreme Values in Finance, Telecommunications and the Enviroment, 169–84. Boca Raton, FL: CRC Press LLC. Embrechts, P., C. Klupperlberg and T. Mikosch (1999). Modelling Extremal Events for Insurance and Finance. Berlin: Springer-Verlag. Erd´elyi, A. (1953). Higher Transcendental Functions, Volume 1. New York: McGraw-Hill. Feller, W. (1971). An Introduction to Probability Theory and its Applications, Volume II (2nd edn.) New York: John Wiley. Fieller, E. C. (1932). The distribution of the index in a normal bivariate population. Biometrika 24, 428–40. Fiorio, C., V. Hajivassiliou and P. Phillips (2008). Bimodal t-ratios: the impact of thick tails on inference. Working paper, Department of Economics, LSE. Available at: http://econ.lse.ac.uk/staff/vassilis/papers/. Forchini, G. (2006). On the bimodality of the exact distribution of the TSLS estimator. Econometric Theory 22, 932–46. Gabaix, X. (1999). Zipf’s law for cities: an explanation. Quarterly Journal of Economics 114, 739– 67. Hajivassiliou, V. (2008). Correlation versus statistical dependence and thick tail distributions: some surprising results. Working paper, Department of Economics, LSE. Hart, P. E. and S. J. Prais (1956). An analysis of business concentration. Journal of the Royal Statistical Society, Series A 119, 150–81. Hillier, G. (2006). Yet more on the exact properties of IV estimators. Econometric Theory 22, 913–31. Hsieh, P.-H. (1999). Robustness of tail index estimation. Journal of Computational and Graphical Statistics 8, 318–32. Ibragimov, I. and V. Linnik (1971). Independent and Stationary Sequences of Random Variables. GroningerWolter: Noordhorf. Ibragimov, R. (2004). On the robustness of economic models to heavy-tailedness assumptions. Working paper, Department of Economics, Yale University. Jones, M. (1993). Simple boundary correction for kernel density estimation. Statistics and Computing 3, 135–46. C The Author(s). Journal compilation C Royal Economic Society 2010.
Bimodal t-ratios and thick tails
287
Kanter, M. and W. Steiger (1974). Regression and autoregression with infinite variance. Advances in Applied Probability 6, 768–83. Karlen, D. (2002). Credibility of confidence intervals. In M. Whalley and L. Lyons (Eds.), Proceedings of the Conference on Advanced Statistical Techniques in Particle Physics, 53–57. Institute for Particle Physics Phenomonology, University of Durham. King, M. (1980). Robust tests for spherical symmetry and their application to least squares regression. Annals of Statistics 8, 1265–71. Lebedev, N. (1972). Special Functions and Their Applications. Englewood Cliffs, NJ: Prentice-Hall. Logan, B., C. Mallows, S. Rice and L. Shepp (1972). Limit distributions of self-normalized sums. Annals of Probability 1, 788–809. Loretan, M. and P. C. B. Phillips (1994). Testing the covariance stationarity of heavy-tailed time series. Journal of Empirical Finance 1, 211–48. Maddala, G. and J. Jeong (1992). On the exact small sample distribution of the instrumental variable estimator. Econometrica 60, 181–83. Marron, J. and D. Ruppert (1994). Transformations to reduce boundary bias in kernel density estimation. Journal of the Royal Statistical Society, Series B 56, 653–71. Marsaglia, G. (1965). Ratios of normal variables and ratios of sums of uniform variables. Journal of the American Statistical Association 60, 193–204. Nelson, C. and R. Startz (1990). Some further results on the exact small sample properties of the instrumental variables estimator. Econometrica 58, 967–76. Pe˜na, V. H. de la, R. Ibragimov and S. Sharakhmetov (2006). Characterizations of joint distributions, copulas, information, dependence and decoupling, with applications to time series. In J. Rogo (Ed.), Optimality: The Second Erich L. Lehmann Symposium, IMS Lecture Notes-Monograph Series 49, 183– 209. Beachwood, OH: Institute of Mathematical Statistics. Phillips, P. C. B. (1983). Exact small sample theory in the simultaneous equation model. In Z. Griliches and M. Intrilligator (Eds.), Handbook of Econometrics, volume 1, 449–516. Amsterdam: North-Holland. Phillips, P. C. B. (2006). A remark on bimodality and weak instrumentation in structural equation estimation. Econometric Theory 22, 947–60. Phillips, P. C. B. and V. A. Hajivassiliou (1987). Bimodal t-ratios. Cowles Foundation Discussion Paper No. 842, Yale University. Phillips, P. C. B. and M. Wickens (1978). Exercises in Econometrics. Oxford: Philip Allan. Reed, W. J. and M. Jorgensen (2004). The double Pareto-lognormal distribution—a new parametric model for size distributions. Communications in Statistics: Theory and Methods 33, 1733–53. Silverman, B. (1986). Density Estimation for Statistics and Data Analysis. London: Chapman and Hall. Steindl, J. (1965). Random Processes and the Growth of Firms. London: Charles Griffin. Tapia, R. and J. Thompson (1978). Nonparametric Probability Density Estimation. Baltimore: Johns Hopkins University Press. Woglom, G. (2001). More results on the exact small sample properties of the instrumental variable estimator. Econometrica 69, 1381–89. Zellner, A. (1976). Bayesian and non-Bayesian analysis of the regression model with multivariate Student-t error terms. Journal of the American Statistical Association 71, 400–05. Zellner, A. (1978). Estimation of functions of population means and regression coefficients including structural coefficients: a minimum expected loss (MELO) approach. Journal of Econometrics 8, 127–58. Zhang, S. and R. Karunamuni (1998). On kernel density estimation near endpoints. Journal of Statistical Planning and Inference 70, 301–16. Zhang, S., R. Karunamuni and M. Jones (1999). An improved estimator of the density function at the boundary. Journal of the American Statistical Association 94, 1231–41. C The Author(s). Journal compilation C Royal Economic Society 2010.
288
C. V. Fiorio, V. A. Hajivassiliou and P. C. B. Phillips
APPENDIX A: PROOF OF THEOREM 2.1 Proof: Part (a): We start by finding the characteristic function of Xi2 . This is
2
E(eivXi ) =
∞
−∞
2
eivx dx = π (1 + x 2 )
−1 1 eivr dr 1 1 = , ; −iv , π r 1/2 (1 + r) 2 2 2
∞
0
where is a confluent hypergeometric function of the second kind. It follows that the characteristic function of S 2 = n−2 1n Xi2 is E(e
ivS 2
)=
2 2
ni=1 E(eivXi /n )
n
1 −1 1 1 2 , ; −iv/n = . 2 2 2
(A.1)
We now use the following asymptotic expansion of the function (see Erd´elyi, 1953, p. 262):
1 1 −iv , ; 2 2 n2
− 12 1 −iv 1/2 1 + = + o(1/n), 2 n2 2
so that (A.1) tends as n → ∞ to: − 12 −2 1/2 1/2 . (−iv) exp = exp 1 2 (−iv) π 1/2 2 Using the argument given in the text from equations (2.12) to we deduce (2.7) as stated. ∞(2.13) zx+wx 2 Part (b): We take the joint Laplace transform L(z, w) = −∞ πe (1+x 2 ) dx and transform x → (r, h) according to the decomposition x = r 1/2 h, where r = x 2 and h = sgn(x) = ±1. Using the Bassel function integral
ezrh/2 dh = 0 F1 h
1 1 2 , zr 2 4
∞ = k=0
(z2 4)k r k , k! 12 k
we obtain L(z, w) =
∞ ∞ 1 (z2 /4)k k + 12 1 1 1 (z2 /4)k ∞ ewr r k−1/2 1 1 dr = k + , k + , −w , π k=0 k! 2 k 0 (1 + r) π k=0 2 2 k! 2 k (A.2)
from the integral representation of the function (Erd´elyi, 1953, p. 255). We now use the fact that
1 1 1 1 1 k + , k + ; −w = − k 1 F1 k + , k + ; −w 2 2 2 2 2
1 k− 2 3 1/2−k + (−w) − k; −w 1 F1 1, 2 k + 12
(A.3)
(see Erd´elyi, 1953, p. 257)
π 1 −k = k 2 (−1 )(k + 12 )
and
1 F1
1 1 k + , k + ; −w = e−w . 2 2
. C The Author(s). Journal compilation C Royal Economic Society 2010.
Bimodal t-ratios and thick tails
289
Combining (A.2) and (A.3) we have: L(z, w) =
∞ (−z2 /4)k −w e k! 12 k k=0
+
∞ 3 1 (z2 /4)k k − 12 1 (−w)1/2−k 1 F1 1, − k; −w . π k=0 2 k! 2 k
(A.4)
Let z = iu , w = Tiv2 . T It follows from (A.4) that ∞ 1
− 12 − 2 k (u/4iv)k −iv 1/2 1 iu iv 1 , 2 =1+ , + o L 2 T T π T T k! 2 k k=0 and thus
T
− 12 iu iv 1 1 u2 1/2 L , 2 , ; (−iv) − → exp F . 1 1 T T π 2 2 4iv
, iv )]T and (− 21 ) = −2π 1/2 , we deduce that Since cfX,S 2 (u, v) = [L( iu T T2
1 1 u2 2 (−iv)(1/2) , cfX,Y (u, v) = exp − 1/2 1 F1 − , ; π 2 2 4iv
(A.5)
as required for (2.8). The second representation in this part of the theorem is obtained by noting that a −1 x a 1 F1 (a, a + 1; −x) = (a) − e−x (1 − a, 1 − a, x) Erd´elyi (1953, p. 266). Using this result we find
1 1 −1 1 1 u2 3 3 −u2 1 2 = |u| − − eu /4iv . − (−iv)1/2 1 F1 − , ; , ; 2 2 2 4iv 2 2 2 2 4iv
(A.6)
Using (A.6) in (A.5) we obtain (2.9) as stated. 2 Part (c): To prove equations (2.10) and (2.11), note that SX2 = S 2 − n−1 X = S 2 + Op (n−1 ) since X ⇒ 2 −1 −1/2 −1 = t + Op(n ) as required. Cauchy (0, 1). Similarly, tX = X[S + Op (n )] Part (d): To prove that the density of the t-ratio has singularities with infinite poles at ±1, it suffices to note that in the notation of Logan et al. (1972), the case of the t-ratio (2.6) based on i.i.d. Cauchy draws corresponds to their parameters: p = 2, α = 1 and r/l = 1. Then their equations (5.1) and (5.2) and Lemmas A and B guarantee the result.
C The Author(s). Journal compilation C Royal Economic Society 2010.