Econometrics Journal (2008), volume 11, pp. 219–243. doi: 10.1111/j.1368-423X.2008.00240.x
Panel vector autoregression under cross-sectional dependence X IAO H UANG † †Department of Economics & Finance, Box #403, Kennesaw State University, Kennesaw, GA 30144, USA. E-mail:
[email protected] First version received: October 2005; final version accepted: December 2007
Summary This paper studies estimation in panel vector autoregression (VAR) under crosssectional dependence. The time series are allowed to be an unknown mixture of stationary and unit root processes with possible cointegrating relations. The cross-sectional dependence is modeled with a factor structure. We extend the factor analysis in Bai and Ng (2002, Econometrica 70, 91–221) to vector processes. The fully modified (FM) estimator in Phillips (1995) is used for estimation in panel VAR and we also propose a factor augmented FM estimator. Our simulation results show this factor augmented FM estimator performs well when sample size is large. Key words: Cross-sectional dependence, Factor analysis, Panel data, VAR.
1. INTRODUCTION Over the past few years there has been growing interest in studying cross-sectional dependence in panel data model. Cross-sectional dependence such as oil price shocks and business cycles is quite common for many economic data and it may cause bias and inconsistency in estimation if not properly accounted. Recent studies on cross-sectional dependence in panel data focus mainly on two issues: parameter estimation in the mean and factor analysis for data with panel structure. The estimation of parameters in the mean under cross-sectional dependence has been studied in several papers. Pesaran (2006) suggested an augmented regression to eliminate unobserved cross-sectional dependence in panel data models and Pesaran et al. (2004) used augmented VAR model to study regional interdependencies. Phillips and Sul (2003) proposed a median unbiased estimation procedure for testing and confidence interval construction in dynamic panels. The asymptotic bias in dynamic panel estimation under cross-sectional dependence was studied in Phillips and Sul (2007). Andrews (2005) analyzed the properties of least square (LS) estimators for both cross-section and panel data under cross-sectional dependence. Bai and Kao (2006) recently studied the panel cointegration with a factor structure in the errors, where they obtained a limiting distribution of the fully modified (FM) estimator and also proposed a continouslyupdated FM (CUP-FM) estimator. Chang (2002, 2004), Im and Pesaran (2003), and Moon and Perron (2004) studied panel unit root test under cross-sectional dependence. Mutl (2002) studied spatial dependence in panel VAR.
C 2008 The Author. Journal compilation C The Royal Economic Society 2008. Published by Blackwell Publishing Ltd, 9600 Garsington
Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
220
Xiao Huang
Factor models for data with panel structure assume that variations in a large number of economic variables can be modeled by a small number of reference variables. Stock and Watson (1998, 2002a, b) provided asymptotic results and examples for diffusion index forecasting. Forni et al. (2000) and Forni and Lippi (2001) provided a theory for factor models with infinite dynamics and non-orthogonal idiosyncratic errors. Bai and Ng (2002) gave criteria functions to select the number of factors under heteroskedasticity in approximate factor models. Bai (2003) further developed an inferential theory for factor models with large dimensions. Tests that can distinguish whether non-stationarity in the data comes from common components or from an idiosyncratic source were developed in Bai and Ng (2003). Bai (2004) studied large-dimension factor models with non-stationary dynamic factors. Bayesian method was used in Canova and Ciccarelli (2002) to integrate Panel VAR and index models for forecasting purposes. In this paper, we study the estimation in non-stationary panel VAR under cross-sectional dependence. The cross-sectional dependence is modeled with a factor structure and the estimated factors are used as augmented regressors in the panel VAR estimation. The time dimension of the panel is assumed to be large to allow for possible parameter heterogeneity across different cross-sectional units. The procedure we propose consists of three steps. In the first step, we obtain a first stage estimator for each cross-sectional unit through OLS by ignoring the cross-sectional dependence; in the second step, we apply factor analysis to residuals from first stage estimation; in the third step, the model is reestimated using the factor augmented FM method. The theory of factor analysis for large panel developed so far applies to scalar processes and we extend the method to vector processes in this paper. The rest of the paper is organized as follows. Section 2 extends the method of factor analysis to vector process and provides the asymptotic results. Section 3 presents a fully modified estimator of VAR under cross-sectional dependence. Section 4 provides the simulation results and Section 5 concludes the paper.
2. FACTOR MODEL Let us consider the following first order panel VAR model with m variables and r cross-sectional shocks: yit = Ai yi,t−1 + xit , xit = λi f t + eit ,
i = 1, . . . , N ,
(2.1) t = 1, . . . , T ,
(2.2)
where y it , x it and eit are m × 1 vectors, λi is an m × r matrix of factor loadings, f t is an r × 1 vector of cross-sectional shocks and Ai is an m × m matrix. The elements of y it , x it and eit are y itp , x itp and eitp , respectively, where p = 1, . . . , m. The coefficient Ai may be different for different cross-sectional units. If f t and eit are assumed to be i.i.d., a first stage consistent estimator Aˆ i can be obtained through OLS, provided T is sufficiently large, and λi and f t can be estimated from the residual xˆit . We show in the following that for a vector process xˆit , the solution to factor analysis with large T and N is the same as the scalar process except for a modification with respect to the dimension of xˆit . Consider the factor model in (2.2). Define x t = (x1t , . . . , x N t ) an m × N matrix; xt = vec(x t ) an mN × 1 vector; x i = (xi1 , . . . , xi T ) a T × m matrix. In a similar fashion, we define C 2008 The Author. Journal compilation C The Royal Economic Society 2008
Panel vector autoregression under cross-sectional dependence
221
y t , yt , y i , et , et and ei . Let λ = (λ1 , . . . , λN ) be an r × mN matrix and λi = (λi1 , . . . , λim ) be an r × m matrix, where λi p is an r × 1 vector. Let e = (e1 , . . . , e N ), f = ( f 1 , . . . , f T ) and x = (x 1 , . . . , x N ). The factor model for error terms can be written as xi T ×m
= f λi + ei ,
x
= f
T ×r r ×m
λ +
T ×r r ×m N
T ×m N
(2.3)
T ×m
e .
(2.4)
T ×m N
Let xˆ be the residual from the first stage estimation. After we replace x with xˆ in (2.4), the objective function for factor analysis becomes min V (k) = (m N T )−1
N T m
λi p , f t
(xˆit p − λi p f t )2 .
(2.5)
i=1 t=1 p=1
The solutions to (2.5) are λ˜ i p =
T
−1 f t f t
T
t=1
f˜t =
N m
f t xˆit p ,
(2.6)
t=1
−1 λi p λi p
N m
i=1 p=1
xˆit p λi p .
(2.7)
i=1 p=1
Substituting (2.6) into (2.5), we obtain V (k) = N −1
N
vec xˆ i − xˆ i f ( f f )−1 f vec xˆ i − xˆ i f ( f f )−1 f
i=1
= N −1
N
tr xˆ i − xˆ i f ( f f )−1 f xˆ i xˆ i − xˆ i f ( f f )−1 f
i=1
=N
−1
tr
N i=1
xˆ i xˆ i
−N
−1
tr
f
N
xˆ i xˆ i
f ,
(2.8)
i=1
where vec(·) and tr (·) are vectorization and trace operators, respectively. The last equality in (2.8) follows from the normalization condition f f /T = I r . Minimizing (2.5) is equivalent to maximizing the second term in (2.8) by choosing √ a proper f . If the initial guess for the number of factors is k, the estimated factors, f˜, are T times eigenvectors corresponding to the ˆ . k largest eigenvalues of the matrix xˆ xˆ . The estimated factor loadings are given by λ˜ = f˜ x/T C 2008 The Author. Journal compilation C The Royal Economic Society 2008
222
Xiao Huang
Alternatively, we can substitute (2.7) into the objective function to obtain V (k) = T −1
T
vec(xˆ t ) − λ(λ λ)−1 λvec(xˆ t )
vec(xˆ t ) − λ(λ λ)−1 λvec(xˆ t )
t=1
= T −1
T
vec(xˆ t )vec(xˆ t ) − T −1
T
t=1
= T −1
T
vec(xˆ t ) λ(λ λ)−1 λ vec(xˆ t )
t=1
ˆ xˆt xˆt − T −1 tr (λ xˆ xλ).
(2.9)
t=1
The last equality in (2.9) √ follows from the normalization λ λ/(mN) = I r and estimated factor ¯ loadings, λ, are chosen as m N times eigenvectors corresponding to the k largest eigenvalues of ¯ ˆ The estimated factors are given by f¯ = xˆ λ/(m xˆ x. N ). The above estimators for f and λ are obtained when r is assumed to be equal to k and we also need to estimate r. Under the assumption that r is the same for every cross-sectional unit, the number of factors can be consistently estimated by using the criteria functions proposed in Bai and Ng (2002). Consider a simple example when the panel VAR includes only GDP and interest rate. To estimate r, we can apply the method in Bai and Ng (2002) to residuals from the equation that describes the dynamics of GDP. However, the estimated factors and factor loadings ignore information in the residuals from equations related to interest rate dynamics. To address this problem, we propose to extend the method in Bai and Ng (2002) to vector processes and the following assumptions are needed for the results in Theorem 2.1: T ASSUMPTION 2.1. f t is i.i.d. with zero mean. E f t 4 < ∞ and T −1 t=1 f t f t → f as T → ∞ for some positive definite matrix f .λi p ≤ M < ∞ for a finite positive constant M, and (mN)−1 λ λ → λ as N → ∞ for some positive definite matrix λ . ASSUMPTION 2.2. eit is i.i.d. with zero mean and covariance matrix e . Eeit 4 < ∞ and E(f t eisp ) = 0 for all i, s, t and p. ASSUMPTION 2.3. There exists a finite positive constant M such that for all i, N m j, p, q, s, t, (a) E(eit p ) = 0, E|eit p |8 ≤ M; (b) E(m −1 N −1 i=1 p=1 eisp eit p ) = T T γm N (s, t), |γm N (s, s)| ≤ M, and T −1 s=1 |γ (s, t)| ≤ M; (c) E(e e ) = τ i j,t,pq with N t=1 m mitp jtq N N m −1 |τ i j,t,pq | ≤ |τ i j,pq | for some τ i j,pq . In addition, (m N ) i=1 j=1 p=1 q=1 |τi j, pq | ≤ M; and N N 4 (d) E|(m N )−1/2 i=1 (e e − E(e e ))| ≤ M. it p isp p=1 it p isp Assumptions 2.1 and 2.2 are similar to those in Bai and Ng (2002) with the exception that we further require both f t and eisp to be i.i.d. and uncorrelated with each other. Assumption 2.3(b) allows for contemporaneous dependence in error terms while Assumption 2.3(c) allows for the cross-sectional dependence in error terms. Similar to the results in Bai and Ng (2002), consistent estimation of r requires a penalty function g in the following criterion function PC (k) = V (k, fˆk ) + k · g, where V (k, fˆk ) is obtained by substituting estimated factors and factor loadings to (2.6). The value k that minimizes PC(k) is the estimate for r. The penalty function g in Bai and Ng (2002) is a function of N and T only. It may also depend on the convergence rate of Aˆ i − Ai when xˆit is C 2008 The Author. Journal compilation C The Royal Economic Society 2008
Panel vector autoregression under cross-sectional dependence
223
obtained through first stage OLS in (2.1). Define the block diagonal coefficient matrix ⎛ ⎞ A1 0 ⎜ ⎟ .. ⎟ A=⎜ . . ⎝ ⎠ 0 A N m N ×m N T Let (m N T )−1 s=1 ||ys−1 2 = O p (T δ ), where y s is a mN × 1 vector and δ indicates the probability order. For example, if all time series in the panel data are stationary, δ = 0; if there is a unit root, δ = 1. The following theorem shows minimizing PC(k) gives a consistent estimate for r. THEOREM 2.1. If Assumptions 2.1–2.3 hold and k factors are obtained by the principal component method with kˆ = arg min PC(k), then lim N ,T →∞ Prob(kˆ = r ) = 1 if g → 0 −2 δ 2 ˆ and √ ((m N )−1 √O p (T )O p ( A − A ) + O p (Cm N T ))g → ∞ as N , T → ∞ with Cm N T = min( m N , T ). The proof is given in appendix. This theorem is similar to Theorem 2 in Bai and Ng (2002) except for the penalty function. When N and T are large, the effect of m in the criterion function is negligible. However, it may make some difference in finite sample. Examples of penalty function are given in Section 4. After extracting the common shocks in the error, we proceed to study the FM estimator for parameters in conditional mean in the next section.
3. FULLY MODIFIED ESTIMATION The cross-sectional dependence in (2.1) is assumed to be a stationary process, but we allow for possible non-stationarity and cointegration in y it . When T is large, we can estimate the model for each cross-sectional unit and allow for possible parameter heterogeneity. By assuming normality of the error term eit in (2.2) and exogeneity of the cross-sectional shocks, we may extend Johansen’s (1988) method to a process with exogenous variables. It is also possible to further relax the assumption of stationary cross-sectional shocks, but the asymptotic results of factor analysis for non-stationary vector processes have to be derived similarly to those in Bai (2004). In this section, our model in (2.1) is estimated using the fully modified VAR (FM-VAR) method in Phillips (1995), which imposes no distributional assumptions on eit or f t . The FM method was developed in Phillips (1995) as an alternative way to estimate a possibly cointegrated VAR process without pretesting for cointegration rank and the location of unit root. The method can be used to correct possible endogeneity in regressors due to cointegration. Equation (2.1) differs from the model in Phillips (1995) with an extra term λi f t . However, we can treat the r.h.s. of (2.2) as a composite error term, and we expect that the asymptotic theory for Aˆ i in (2.1) gives a different variance covariance structure compared to that in Phillips (1995). We note that Bai and Kao (2006) also used the FM method for panel data with cross-sectional dependence. However, our paper differs from theirs in several aspects: (1) we study panel VAR model instead of static panels with a scalar process; (2) the time dimension T is large in our model which allows for parameter heterogeneity; (3) the method in Bai and Ng (2002) is extended to vector processes and (4) we consider the finite sample performance of factor augmented regressions in simulation. C 2008 The Author. Journal compilation C The Royal Economic Society 2008
224
Xiao Huang
In the FM estimation of Ai in (2.1), we consider a general qth order VAR: yit = Ji (L)yi,t−1 + eit = Ji∗ (L) yi,t−1 + Ai yi,t−1 + λi f t + eit
(3.1) (3.2)
= Ji z t + Ai yi,t−1 + xit , (3.3) q−1 q where Ji (L) = h=1 Ji h L h−1 , Ji∗ (L) = h=1 Ji∗h L h−1 , Ji∗h = − g=h+1 Jig , Ji = ∗ ∗ (Ji1 , . . . , Ji,q−1 ), and z t = ( y i,t−1 , . . . , y i,t−q+1 ). When q = 1, equation (3.3) reduces to (2.1). The following notation is used in deriving the FM estimator: let a t and bt be two covariance The long-run and one-sided long-run stationary processes.1 ∞ covariance matrix between a and b are given by ab = ∞ E(a b ) and = E(at+ j bt ), whose kernel estimators t+ j t ab j=−∞ T −1 T −1 j=0 ˆ ˆ ˆ are ab = j=−T +1 w( j/K ) ab ( j) and ab = j=0 w( j/K ) ˆ ab ( j), respectively. The w(·) ˆ ab and ˆ ab is the kernel function with a lag truncation or bandwidth parameter K and in ˆ ab ( j) = T −1 1≤t,t+ j≤T at+ j bt . The following assumptions for the estimation of Ai are from Phillips (1995): q
ASSUMPTION 3.1. (a) |I m − J i (L)L| = 0 has roots on or outside the unit circle; (b) Ai = I m + αβ where α and β are m × r matrices of full column rank r , 0 ≤ r ≤ m; (c) α ⊥ (J ∗ (1) − I m ) β ⊥ is non-singular, where α ⊥ and β ⊥ are m × (m − r ) matrices of full column rank such that α ⊥ α = β ⊥ β = 0. Define the orthogonal transformation matrix H T = [β, β ⊥ ]. We may multiply both sides of (3.3) by H T to obtain a set of equations similar to equations (31a) and (31b) in Phillips (1995): y˜1it = J˜1 z˜ t + A˜ i11 y˜1i,t−1 + x˜1it ,
(3.4)
y˜2it = y˜2i,t−1 + u 2t , u 2t = x˜2it + J˜2 z˜ t + A˜ i21 y˜1i,t−1 ,
(3.5)
where the coefficients, J˜1 , J˜2 , A˜ i11 , A˜ i21 , are elements of the properly partitioned matrices of H T J i H T and H T Ai H T in equation (28) of Phillips (1995), and tilded variables are leftmultiplication of the original variable with H T . The transformed variable y˜1it is stationary and y˜2it is non-stationary. In matrix notation, equations (3.4) and (3.5) may be written as: y˜ i = J˜i z˜ i + A˜ i y˜ i,−1 + x˜ i = F˜i W˜ i + x˜ i
= F˜1i W˜ i1 + F˜2i W˜ i2 + x˜ i ,
(3.6) (3.7)
where F˜i = ( J˜i , A˜ i ), W˜ i1 and W˜ i2 contain the stationary and non-stationary regressors, respectively. From (3.6), the asymptotic results for Fˆ˜ i are similar to those in Theorem 5.1 in Phillips (1995). In practice, we need not know the non-stationary property of the data and the
1 This
succinct notation is selected from Kauppi (2004). C 2008 The Author. Journal compilation C The Royal Economic Society 2008
Panel vector autoregression under cross-sectional dependence
225
estimator Fˆi+ takes the following form: . ˆ y y ˆ f y ˆ y y y y Fˆi+ = y i z i ..y i y i,−1 − λˆ i − T i,−1 i,−1 −1 −1 −1 −1 −1 ˆ ey ˆ y y ˆ y y y y − − T (Wi Wi )−1 . i,−1 i,−1 −1 −1 −1 −1 −1
(3.8)
The distribution of the FM estimator in the original coordinates is similar to the result in Theorem 5.7 of Phillips (1995) and the proof is omitted. It is summarized in the following theorem: THEOREM 3.1. Under Assumption 3.1, we have √ → −1 T ( Fˆi+ − Fi+ )d N 0, (λi f λi + e ) ⊗ G11 G , where G = ( Iq−10⊗ H
0 β ),
.
G ⊥ = (0..β⊥ ) and 11 = E(W i1t W i1t ).
The estimator in (3.8) is not valid in the presence of I (2) variables. However, it is possible to use the residual-based fully modified VAR procedure in Chang (2000) where an unknown mixture of I (0), I (1) and I (2) variables is allowed. The correction for endogeneity in (3.8) is taken with respect to both the factors and error terms, which is equivalent to making correction with respect to the composite error term xˆit = λˆ i fˆt + eˆit . In other words, if the FM method is applied to each cross-sectional unit in the panel data, ignoring cross-sectional dependence produces the same estimator in (3.8). The estimated factors can be used in the following factor augmented regression to possibly improve the estimation in (3.2): yit = Ji∗ (L) yi,t−1 + Ai yi,t−1 + λi fˆt + eit .
(3.9)
The correction for possible endogeneity in the VAR in (3.9) is taken with respect to eit only, and fˆt is added as a stationary regressor. We study the finite sample properties of Aˆ i in this factor augmented FM-VAR in the next section.
4. SIMULATION RESULTS In this section we use simulation to study the finite sample properties of both factor analysis in (2.2), FM-VAR in (2.1) and the factor augmented FM-VAR in (3.9). Let us consider the model in equations (2.1) and (2.2) with m = 2 and r = 3. The value for the matrix Ai is taken from the cointegrated panel VAR model in Binder et al. (2005): 0.5 0.1 Ai = −0.5 1.1 with α = (−0.5, −0.5) and β = (1, −0.2) . We also assume parameter homogeneity across different groups to simplify the data generating process (DGP). The simulation consists of two parts. First, we need to determine the number of factors in (2.2) when x it is replaced with the residuals from the first stage OLS method in (2.1). The result in Theorem 2.1 implies the penalty function must take into consideration of the convergence rate of Aˆ i − Ai in addition to mN and T. However, it can be shown that if Aˆ i is the first stage OLS estimator, we have (m N )−1 O p (T δ )O p ( Aˆ − A2 ) = O p (T −1 ) and only mN and T need to be C 2008 The Author. Journal compilation C The Royal Economic Society 2008
226
Xiao Huang Table 1. Estimated number of factors over 1000 replications The true number factor (r) is set at 3 PCp1 PCp2 PCp3 ICp1 ICp2
N
T
100
40
5
5
5
3
3
3.001
100
60
5
5
5
3
3
3
200 500
60 60
5 4.121
4.995 3.96
5 4.603
3 3
3 3
3 3
100
100
5
5
5
3
3
3
200 500 40
100 100 100
4.988 3.229 5
4.497 3.164 5
5 3.758 5
3 3 3
3 3 3
3 3 3.38
60 60
100 200
5 5
5 5
5 5
3 3
3 3
3.034 3
10 10 20 100 100
50 100 100 10 20
5 5 5 5 5
5 5 5 5 5
5 5 5 5 5
3.427 3.031 3 5 3
3.035 3.001 3 5 3
4.981 3.706 3.072 5 3.002
ICp3
considered in the penalty function. The following factor number selection criteria are modified from those in Bai and Ng (2002): mN + T mN PC p1 (k) = V (k, fˆk ) + k σˆ 2 log ; mNT mN + T mN + T PC p2 (k) = V (k, fˆk ) + k σˆ 2 log Cm2 N T ; mNT k 2 ˆ PC p3 (k) = V (k, f ) + k σˆ log Cm2 N T Cm2 N T ; mN + T mN k ˆ (k) I C p1 = log(V (k, f )) + k log ; mNT mN + T mN + T I C p2 (k) = log(V (k, fˆk )) + k log Cm2 N T i; mNT k I C p3 (k) = log(V (k, fˆ )) + k log Cm2 N T Cm2 N T . i.i.d.
i.i.d.
In our simulation, we assume f t ∼ N (0, 1), λi ∼ N (0.1, 1) and 1 0.5 i.i.d. eit ∼ N 0, . 0.5 1 ˆ over 1,000 replications The upper bound for Table 1 reports the estimated number of factors (k) kˆ in estimation is set at 5 while the true number of factors is 3. As it is reported in Table 1, all three PC criteria fail to select the correct number of factors unless both N and T are large, but IC criteria correctly determine the number of factors for most of time. When x it in (2.2) is observed, C 2008 The Author. Journal compilation C The Royal Economic Society 2008
227
Panel vector autoregression under cross-sectional dependence Table 2. Mean biases and standard deviations of four estimators DGP: N = 100; σ 2 f = 1; λitp ∼ N(0.1, 1). A11 = 0.5 Estimator OLS FM
T = 50 −0.0534
T = 100 −0.0202
A12 = 0.1 T = 200 −0.0178
T = 50
T = 100
0.0205
0.0075
T = 200 0.0060
(0.2967)
(0.3100)
(0.2056)
(0.0633)
(0.0690)
(0.0428)
0.1122 (0.3058)
0.1803 (0.3187)
0.1420 (0.2102)
0.0218 (0.0652)
−0.0359 (0.0709)
−0.0282 (0.0437)
UFM
0.1162
0.1670
0.1500
0.0217
−0.0328
−0.0293
AUG
(0.3060) 0.0547 (0.2203)
(0.3178) 0.0863 (0.2144)
(0.2106) 0.0994 (0.1744)
(0.0653) 0.0107 (0.0470)
(0.0707) −0.0173
(0.0438) −0.0197
(0.0477)
(0.0363)
A21 = −0.5
OLS FM UFM AUG
A22 = 1.1
T = 50
T = 100
T = 200
T = 50
0.0012 (0.6876) 0.0427 (0.6999) 0.0477 (0.7014) 0.0429 (0.3665)
0.0100 (0.6007) 0.1569 (0.6082) 0.1459 (0.6081) 0.0289 (0.3125)
−0.0031 (0.4033) 0.0589 (0.4055) 0.0616 (0.4057) 0.0562 (0.2673)
−0.0057 (0.1464) −0.0131 (0.1490) −0.0136 (0.1494) −0.0112 (0.3668)
T = 100 −0.0065 (0.1336) −0.0337 (0.1353) 0.0310 −(0.1353) −0.0075 (0.3127)
T = 200 −0.0021 (0.0839) −0.0128 (0.0843) −0.0131 (0.0844) −0.0121 (0.2673)
Notes: Tables 2–6 report the average bias and standard deviation (in parentheses) of the four parameters in the VAR coefficient over 1,000 simulations. In each table, the nubmer of factors (r) is set at 3; all the factors (f t ) are sampled from i.i.d., normal distributions with zero mean. σ f is the standard deviation of each element of f L
it is shown in an earlier version of this paper that all 6 criteria correctly determine the number of factors, and similar results are obtained in Bai and Ng (2002). When x it needs to be estimated, the IC criteria clearly outperform the PC criteria. Hence we recommend to use the IC criteria in practice. Next, we study the finite sample property of the FM-VAR under cross-sectional shocks. Each element of λi is generated from N (0.1, 1) or N (0.5, 1) and each element of f t is generated from N (0, 1) or N (0, 5). We let N = 100, 500, 1,000 and T = 50, 100, 200 and use different combinations of (N , T ) in simulation. The long-run and one-sided variance covariance matrices in (3.8) are calculated with the KERNEL procedure in COINT 2.0, where we use the Parzen kernel and an arbitrary lag truncation number of 5. In Tables 2–6, we report the mean bias and standard deviation (in parentheses) of each element of Aˆ i with 1,000 replications. The FM estimator and factor augmented (AUG) FM estimator as well as results from OLS and continuously-updated estimator (UFM) of Bai and Kao (2006) are reported. UFM is obtained through an iteration procedure: first, the FM estimator is used as the C 2008 The Author. Journal compilation C The Royal Economic Society 2008
228
Xiao Huang Table 3. Mean biases and standard deviations of four estimators DGP: N = 500; σ 2 f = 5; λitp ∼ N(0.1, 1). A11 = 0.5
Estimator OLS FM
T = 50
A12 = 0.1
T = 100
T = 200
T = 50
T = 100
0.0225
0.0140
0.0385
0.0128
(0.2710)
(0.2066)
(0.2111)
(0.0565)
(0.0428)
(0.0485)
0.1016 (0.2789)
0.0940 (0.2109)
0.0862 (0.2135)
0.0201 (0.0582)
0.0181 (0.0437)
−0.0162 (0.0490) −0.0210
−0.0698
UFM
0.1105
0.1118
0.1150
0.0207
0.0210
AUG
(0.2796) 0.0128 (0.1202)
(0.2117) 0.0323 (0.1135)
(0.2145) 0.0003 (0.0315)
(0.0584) 0.0024 (0.0251)
(0.0439) 0.0059 (0.0236)
A21 = −0.5
OLS FM UFM AUG
T = 200 0.0075
(0.0493) 0.0011 (0.0072)
A22 = 1.1
T = 50
T = 100
T = 200
T = 50
T = 100
T = 200
−0.0331 (0.7920) −0.0232 (0.8062) −0.0233 (0.8077) 0.0153 (0.3690)
−0.0006 (0.4699) −0.0232 (0.4745) −0.0296 (0.4749) 0.0269 (0.3147)
−0.0022 (0.0817) 0.0107 (0.0820) 0.0140 (0.0821) 0.0033 (0.2712)
0.0150 (0.1650) −0.0010 (0.1681) −0.0007 (0.1684) −0.0064 (0.3691)
−0.0009 (0.0974) 0.0031 (0.0984) 0.0043 (0.0985) −0.0066 (0.3149)
0.0001 (0.0188) −0.0023 (0.0188) −0.0028 (0.0189) −0.0008 (0.2712)
initial estimator; second, factors and factor loadings are estimated based on the residual from previous step; finally, the FM estimator is obtained based on (3.8). The procedure is repeated until convergence in Aˆ i is reached. The maximum iteration step is set at 30, and the convergence criterion is set at 0.001. In Table 2, OLS produces the best results for a cointegrated VAR process, though the AUG FM estimator is expected to perform better. However, as we increase N and the standard deviation of the cross-sectional shocks in Tables 3 and 4, AUG FM yields a much improved performance. In fact, it outperforms the other three estimators in most cases in terms of bias. We see the same pattern as the mean of the factor loading is increased from 0.1 to 0.5 in Tables 5 and 6. This occurs because as N and T increase, factor analysis gives more precise estimation of the factors and factor loadings. Also, in Table 2 the cross-sectional shock ( f t ) has the same variance as the idiosyncratic error. Hence the signal from a cross-sectional shock is hard to extract when combined with idiosyncratic errors, and it is even harder when the composite error term x it is estimated instead of directly observed. After the variance of the cross-sectional shock is increased from 1 to 5 in Tables 3 to 6, the cross-sectional shock ‘stands out’ in the composite error terms x it and factor analysis can extract the common shocks more effectively. C 2008 The Author. Journal compilation C The Royal Economic Society 2008
229
Panel vector autoregression under cross-sectional dependence Table 4. Mean biases and standard deviations of four estimators DGP: N = 1000; σ 2 f = 5; λitp ∼ N(0.1, 1). A11 = 0.5 Estimator OLS FM
T = 50 −0.0815
A12 = 0.1
T = 100 −0.0232
T = 200 −0.0128
T = 50
T = 100
0.0368
0.0104
T = 200 0.0042
(0.2729)
(0.2535)
(0.2012)
(0.0568)
(0.0625)
(0.0443)
0.0989 (0.2807)
0.0783 (0.2575)
0.0766 (0.2033)
−0.0193 (0.0584)
−0.0137 (0.0635)
−0.0146 (0.0448)
−0.0178
−0.0196
UFM
0.1064
0.1058
0.1054
−0.0200
AUG
(0.2814) 0.0122 (0.1209)
(0.2586) −0.0025 (0.0399)
(0.2043) 0.0051
(0.0585) −0.0021
(0.0638) 0.0021
(0.0450) −0.0002
(0.0448)
(0.0251)
(0.0099)
(0.0099)
A21 = −0.5
OLS FM UFM AUG
A22 = 1.1
T = 50
T = 100
T = 200
T = 50
T = 100
T = 200
−0.0540 (0.7994) 0.0038 (0.8127) 0.0099 (0.8143) 0.0214 (0.3682)
−0.0014 (0.0842) 0.0093 (0.0847) 0.0116 (0.0850) 0.0040 (0.3280)
−0.0018 (0.0944) −0.0135 (0.0949) −0.0186 (0.0950) 0.0069 (0.2673)
0.0173 (0.1662) −0.0051 (0.1689) −0.0061 (0.1692) −0.0066 (0.3686)
−0.0033 (0.0208) −0.0022 (0.0209) −0.0025 (0.0210) −0.0011 (0.3282)
−0.0028 (0.0208) 0.0024 (0.0209) 0.0033 (0.0209) −0.0018 (0.2673)
5. CONCLUSION We study the cross-sectional dependence in panel VAR using factor analysis. Our paper makes the following contributions to the literature: (1) factor analysis is extended to vector processes when N and T are large; (2) a limiting distribution of the FM-VAR with cross-sectional dependence is provided and (3) finite sample properties of OLS, FM, UFM and AUG are investigated through simulation. Our findings are: (1) The vector factor analysis proposed in this paper gives a simple approach in analyzing data with panel vector structure; (2) asymptotic results of the estimators for both stationary and non-stationary variables in panel VAR are obtained and are found to have a normal mixture distribution in the limit, similar to the results in Phillips (1995) and (3) factor augmented regression performs reasonably well when cross-sectional signal is strong relative to the magnitude of idiosyncratic errors. The literature on cross-sectional dependence in panel data model is growing and there are several unresolved issues in panel VAR model under cross-sectional dependence. First, there is a need to test for existence of cross-sectional shocks. Such is a prior step before augmented regressions are applied. Second, Phillips and Sul (2007) derived the analytic bias for dynamic C 2008 The Author. Journal compilation C The Royal Economic Society 2008
230
Xiao Huang Table 5. Mean biases and standard deviations of four estimators DGP: N = 500; σ 2 f = 5; λitp ∼ N(0.5, 1). A11 = 0.5
Estimator OLS FM
T = 50 −0.0751
T = 100 −0.0251
A12 = 0.1 T = 200 −0.0166
T = 50
T = 100
0.0455
0.0126
(0.2668)
(0.2092)
(0.0784)
(0.0551)
(0.0480)
0.1410 (0.3773)
0.1538 (0.2738)
0.0761 (0.2113)
0.0289 (0.0809)
0.0306 (0.0565)
−0.0142 (0.0485) −0.0192
0.1326
0.1515
0.1059
0.0264
0.0297
AUG
(0.3768) 0.0049 (0.1215)
(0.2737) 0.0426 (0.1419)
(0.2123) 0.0002 (0.0315)
(0.0808) 0.0014 (0.0261)
(0.0565) 0.0083 (0.0293)
A21 = −0.5
FM UFM AUG
0.0090
(0.3659)
UFM
OLS
T = 200
(0.0487) 0.0010 (0.0072)
A22 = 1.1
T = 50
T = 100
T = 200
T = 50
T = 100
T = 200
−0.0406 (0.9618) 0.0909 (0.9798) 0.0876 (0.9810) 0.0000 (0.3746)
−0.0028 (0.7750) 0.0820 (0.7823) 0.0833 (0.7827) 0.0244 (0.3156)
−0.0060 (0.0507) −0.0036 (0.0509) −0.0050 (0.0510) 0.0036 (0.2719)
0.0225 (0.2056) −0.0236 (0.2096) −0.0221 (0.2098) −0.0033 (0.3745)
−0.0011 (0.1599) −0.0186 (0.1614) −0.0185 (0.1615) −0.0062 (0.3156)
0.0018 (0.0116) 0.0006 (0.0117) 0.0008 (0.0117) −0.0009 (0.2718)
panel estimation when cross-sectional dependence is ignored, and extension of their work to panel VAR model offers area for further research. Third, it is also interesting to extend the results in this paper to panel data with non-stationary cross-sectional dependence. Finally, we note the time dimension is short for many microeconomic data, and extension of the work in Binder et al. (2003) to cross-sectional dependence is necessary. We hope to address these issues in subsequent work.
ACKNOWLEDGEMENTS I would like to thank Aman Ullah for stimulating discussions and two referees for exceptionally helpful comments. I am also grateful to seminar participants at U.C. Riverside, SUNY-Albany and Kennesaw State University for helpful comments. My thanks also go to Gabriel Ramirez for his detailed and helpful suggestions. All remaining errors are mine. C 2008 The Author. Journal compilation C The Royal Economic Society 2008
231
Panel vector autoregression under cross-sectional dependence Table 6. Mean biases and standard deviations of four estimators DGP: N = 1000; σ 2 f = 5; λitp ∼ N(0.5, 1). A11 = 0.5 Estimator OLS FM
T = 50 −0.0833
T = 100 −0.0231
A12 = 0.1 T = 200 −0.0148
T = 50
T = 100
0.0374
0.0079
T = 200 0.0046
(0.3687)
(0.2531)
(0.2111)
(0.0790)
(0.0614)
(0.0486)
0.1399 (0.3800)
0.0860 (0.2574)
0.0768 (0.2132)
−0.0282 (0.0815)
0.0156 (0.0625)
−0.0144 (0.0492) −0.0195
UFM
0.1312
0.1120
0.1069
−0.0260
0.0193
AUG
(0.3795) 0.0029 (0.1221)
(0.2584) −0.0021 (0.0416)
(0.2142) 0.0019 (0.0379)
(0.0813) −0.0005
(0.0628) 0.0018
(0.0494) 0.0004
(0.0262)
(0.0101)
(0.0087)
A21 = −0.5
OLS FM UFM AUG
A22 = 1.1
T = 50
T = 100
T = 200
T = 50
T = 100
T = 200
−0.0579 (0.9699) 0.1141 (0.9869) 0.1145 (0.9883) 0.0021 (0.3741)
0.0003 (0.1191) 0.0175 (0.1199) 0.0217 (0.1203) 0.0040 (0.3281)
−0.0062 (0.0458) 0.0002 (0.0459) 0.0002 (0.0460) 0.0050 (0.2690)
0.0182 (0.2075) −0.0278 (0.2111) −0.0274 (0.2114) −0.0027 (0.3740)
−0.0061 (0.0289) −0.0042 (0.0291) −0.0046 (0.0292) −0.0012 (0.3281)
−0.0020 (0.0106) −0.0002 (0.0106) 0.0002 (0.0106) −0.0011 (0.2691)
REFERENCES Andrews, D.W.K. (2005). Cross-section regression with common shocks. Econometrica 73, 1551–85. Bai, J. (2003). Inference on factor models of large dimension. Econometrica 71, 135–71. Bai, J. (2004). Estimating cross-section common stochastic trends in nonstationary panel data. Journal of Econometrics 122(1), 137–83. Bai, J. and S. Ng (2002). Determine the number of factors in approximate factor models. Econometrica 70, 91–221. Bai, J. and S. Ng (2004). A PANIC attack on unit roots and cointegration. Econometrica 72, 1127–77. Bai, J. and C. Kao (2006). On the estimation and inference of panel cointegration model with cross-sectional dependence. In Badi H. Baltagi and B. V. Elsevier (Eds.), Contributions to Economic Analysis, Volume 274, 1–30. New York: Elsevier. Binder, M., C. Hsiao and M.H. Pesaran (2005). Estimation and inference in short panel vector autoregressions with unit roots and cointegration. Econometric Theory 21, 795–837. Canova, F. and M. Ciccarelli (2002). Panel index VAR models: specification, estimation, testing and leading indicators. Discussion Paper # 4033, Centre for Economic Policy Research. Available at http://www.cepr.org/pubs/dps/DP4033.asp. C 2008 The Author. Journal compilation C The Royal Economic Society 2008
232
Xiao Huang
Chang, Y. (2000). Vector autoregressions with unknown mixtures of I(0), I(1), and I(2) components. Econometric Theory 16, 905–26. Chang, Y. (2002). Nonlinear IV unit root tests in panels with cross-sectional dependency. Journal of Econometrics 110, 261–92. Chang, Y. (2004). Bootstrap unit root tests in panels with cross-sectional dependency. Journal of Econometrics 120, 263–93. Forni, M., M. Hallin, M. Lippi and L. Reichlin (2000). The generalized dynamic-factor model: Identification and estimation. Review of Economics and Statistics 82, 540–54. Forni, M. and M. Lippi (2001). The generalized dynamic factor model: representation theory. Econometric Theory 17, 1113–41. Im, K. S. and M.H. Pesaran (2003). On the panel unit root tests using nonlinear instrumental variables. Working paper, Trinity College, Cambridge. Johansen, S. (1988). Statistical analysis of cointegration vectors. Journal of Economic Dynamics and Control 12, 231–54. Kauppi, H. (2004). On the robustness of hypothesis testing based on fully modified vector autoregression when some roots are almost one. Econometric Theory 20, 341–59. L¨utkepohl, H. (1996). Handbook of Matrices. New York: John Wiley & Sons. Moon, H.R. and B. Perron (2004). Testing for a unit root in panels with dynamic factors. Journal of Econometrics 122, 81–126. Mutl, J. (2002). Panel VAR with spatial dependence. Working paper, Stockholm School of Economics. Available at http://econpapers.hhs.se/cpd/2002/113 Mu66tl.pdf. Pesaran, M.H. (2006). Estimation and inference in large heterogenous panels with a multifactor error. Econometrica 74, 967–1012. Pesaran, M.H., T. Schuermann and S.M. Weiner (2004). Modeling regional interdependencies using a global error-correcting macroeconometric model. Journal of Business and Economic Statistics 22, 129–62. Phillips, P.C.B. (1995). Fully modified least squares and vector autoregression. Econometrica 63, 1023– 78. Phillips, P.C.B. and D. Sul (2003). Dynamic panel estimation and homogeneity testing under cross section dependence. Econometrics Journal 6, 217–59. Phillips, P.C.B. and D. Sul (2007). Bias in dynamic panel estimation with fixed effects, incidental trends and cross section dependence. Journal of Econometrics 137, 162–88. Stock, J. and M. Watson (1998). Diffusion indexes. Working paper #6702, National Bureau of Economic Research. Stock, J. and M. Watson (2002a). Forecasting using principal components from a large number of predictors. Journal of the American Statistical Association 97, 1167–79. Stock, J. and M. Watson (2002b). Macroeconomic forecasting using diffusion indexes. Journal of Business and Economic Statistics 20, 147–62.
APPENDIX We need the following lemmas to prove Theorem 2.1. LEMMA A.1. Under Assumptions 2.1–2.3 and for some finite positive constant M 1 , we have T T (a) T −1 s=1 γ (s, t)2 ≤ M ; T t=1 m N−1/2 N 1 m −1 (m N ) eit p λ 2 ) ≤ M1 ; (b) E(T i=1 t=1 N p=1 m i p T T −2 −1 2 (c) E(T t=1 s=1 ((m N ) i=1 p=1 xˆit p xˆisp ) ) ≤ M1 . C 2008 The Author. Journal compilation C The Royal Economic Society 2008
233
Panel vector autoregression under cross-sectional dependence
Proof: The proof for (a) and (b) is similar to that in Lemma A.1 of Bai and Ng (2002). Consider (c). It is sufficient to prove E|xˆit p |4 ≤ M1 ∀i, t, p. Let Ai p be the pth row in Ai and we have E|xˆit p |4 = E|λi p f t + eit p − ( Aˆ i p − Ai p ) yi,t−1 |4 ≤ 8E(λi p f t + eit p )4 + 8E(( Aˆ i p − Ai p ) yi,t−1 )4 . By the results in Lemma A.1 of Bai and Ng (2002) and Assumptions 2.1, 2.2 and 2.3(a), the first term on the r.h.s., E(λi p f t + eitp )4 , is bounded. The second term is always o p (1) given Aˆ i p is a consistent estimator of Aˆ i p . Hence E|xˆit p |4 ≤ M1 and the result in (iii) holds. k LEMMA √ A.2.√ For any fixed k ≥ 1, there exits a r × k matrix H i rank = min(k, r ) and Cm N T = min( m N , T ) such that
T −1
T
fˆtk − H k f t 2 = (m N )−1 O p (T δ )O p ( Aˆ − A2 ) + O p Cm−2N T ,
(A.1)
t=1
where O p (T δ ) is the probability order of (m N T )−1
T s=1
||ys−1 2 .
Proof: The proof is analogous to proof of Theorem 1 in Bai and Ng (2002) except that x it is a vector process and estimated from first stage OLS method. Define fˆk = f¯k ( f¯k f¯k /T )1/2 , where f¯k = xˆ λ¯ k /m N and λ¯ k is √ ˆ Let H k = ( f˜k f /T )(λ λ/m N ) m N times eigenvectors corresponding to the k largest eigenvalues of xˆ x. and we have fˆtk − H k f t = (m N T )−1
T
xˆt xˆt f˜sk − (m N T )−1
s=1
= T −1
T
T
f s λ λ f t f˜sk
s=1
f˜sk γm N (s, t) + T −1
s=1
T
f˜sk ζst + T −1
s=1
− (m N T )−1
T
T s=1
f t λ ( Aˆ − A)ys−1 f˜sk − (m N T )−1
s=1
− (m N T )−1
T
+ (m N T )−1
T
f˜sk ξst
s=1 T
et ( Aˆ − A)ys−1 f˜sk
s=1 yt−1 ( Aˆ − A) λ f s f˜sk − (m N T )−1
s=1 T
f˜sk ηst + T −1
T
yt−1 ( Aˆ − A) es f˜sk
s=1 yt−1 ( Aˆ − A) ( Aˆ − A)ys−1 f˜sk ,
s=1
where ζst = (m N )−1 es et − γm N (s, t) , ηst = (m N )−1 f s λ et , ξst = (m N )−1 f t λ es . Using the inequality between geometric mean and arithmetic mean, we can show that T −1
T t=1
fˆtk − H k f t 2 ≤ 9T −1
T
(It + IIt + IIIt + IVt + Vt + VIt + VIIt + VIIIt + IXt ) ,
t=1
C 2008 The Author. Journal compilation C The Royal Economic Society 2008
(A.2)
234
Xiao Huang
where It = IIt = IIIt = IVt = Vt = VIt =
T T s=1 T −2 T s=1 T −2 T s=1 T −2 T s=1 −2
f˜sk γm N
2 (s, t) ,
2
f˜sk ζst
, 2 k f˜s ηst , 2 f˜sk ξst , 2 T −2 ˆ k ˜ (m N T ) f t λ ( A − A)ys−1 f s , s=1 2 T (m N T )−2 et ( Aˆ − A)ys−1 f˜sk , s=1
2 T k ˜ ˆ yt−1 ( A − A) λ f s f s , VIIt = (m N T ) s=1 2 T VIIIt = (m N T )−2 yt−1 ( Aˆ − A) es f˜sk , s=1 2 T −2 k IXt = (m N T ) yt−1 ( Aˆ − A) ( Aˆ − A)ys−1 f˜s . s=1 −2
From Lemma A.1(a), the fact that f˜k f˜k /T = Ik, and Cauchy–Schwarz inequality, T −1 For IIt , T −1
T
T T T
IIt = T −3
t=1
t=1 It
is O p (T −1 ).
f˜sk f˜uk ζst ζut
t=1 s=1 u=1
T −1
≤
T
T
⎛ 2 ⎞1/2 T T T f˜tk 2 ⎝T −2 ζst ζut ⎠ .
s=1
s=1 u=1
t=1
Since E
T
2 ζst ζut
≤ T 2 max E |ζst |4 s,t
t=1
≤ T max (m N ) 2
s,t
≤ (m N )−2 M, by Assumption 2.3(d) we have T −1
T t=1
−2
4 N m −1/2 E (m N ) (eit p eisp − E(eit p eisp )) i=1 p=1
II t ≤ O p ((m N )−1 ). C 2008 The Author. Journal compilation C The Royal Economic Society 2008
235
Panel vector autoregression under cross-sectional dependence For III t , we have T 2 f˜sk f s IIIt = T −2 (m N )−1 λ vec(et )
≤ (m N )
−2
λ
vec(et )2
s=1
T
−1
T ˜k 2 fs
T
−1
T
s=1
fs
2
s=1
= (m N )−2 vec(et )λ2 · O p (1) . T T It follows that T −1 t=1 III t = O p (1) · (m N T )−1 t=1 (m N )−1/2 vec(et )λ2 = O p ((m N )−1 ) by Lemma −1 1(b), and the result IV t = O p ((mN) ) can be proved similarly. Next let us consider Vt :
T
−1
T
Vt = T
−1
T
t=1
−2
(m N T )
t=1
≤ T −1
T
2 T ˆ k f t λ ( A − A)ys−1 f˜s s=1
f t · (m N )−1 λ ( Aˆ − A)2 · (m N T )−1
t=1
||ys−1 2
s=1
T
·T −1
T
|| f˜sk 2
s=1
≤ (m N )−1 O p ( Aˆ − A2 ) · O p (T δ ) · O p (1) = (m N )−1 O p (T δ )O p ( Aˆ − A2 ). The value of δ depends on the time series property of the data. Consider VIt :
T
−1
T
VIt = T
−3
(m N )
t=1
≤ T −1
T
−2
2 T T ˆ k ˜ et ( A − A)ys−1 f s s=1 t=1
(m N )−1 ||et ( Aˆ − A)||2 · (m N T )−1
t=1
= (m N )−1 O p T δ O p ( Aˆ − A2 ).
T
||ys−1 2 · T −1
s=1
T
|| f˜sk 2
s=1
Similarly, we have the following results for terms VIIt , VIIIt and IXt :
T
−1
T
VIIt = T
−3
(m N )
t=1
≤ T −1
T
−2
2 T T k ˜ ˆ yt−1 ( A − A) λ f s f s s=1 t=1
|| f s ||2 · (m N )−1 ||( Aˆ − A)λ||2 · (m N T )−1
s=1
·T −1
T
T t=1
|| f˜sk 2
s=1
= (m N )−1 O p (T δ )O p ( Aˆ − A2 ); C 2008 The Author. Journal compilation C The Royal Economic Society 2008
||yt−1 2
236
Xiao Huang
T
−1
T
VIIIt = T
−3
−2
(m N )
t=1
≤ T −1
2 T T k yt−1 ( Aˆ − A) es f˜s t=1 s=1
T
·T −1
(m N )−1 ||es ( Aˆ − A)||2 · (m N T )−1 s=1
T
T
||yt−1 2
t=1
|| f˜sk 2
s=1
= (m N )−1 O p (T δ )O p ( Aˆ − A2 );
T
−1
T
IXt = T
−3
(m N )
t=1
≤ (m N T )−1
−2
2 T T k yt−1 ( Aˆ − A) ( Aˆ − A)ys−1 f˜s t=1 s=1
T
yt−1 ( Aˆ − A) 2 · (m N T )−1
t=1
·T −1
T
T
( Aˆ − A)ys−1 2
s=1
|| f˜sk 2
s=1
2 = (m N )−1 O p (T δ )O p ( Aˆ − A2 ) . Combine the results for terms I t to IXt and substitute them into (A.2), we obtain Lemma A.2.
LEMMA A.3. With 1 ≤ k ≤ r , V (k, fˆk ) − V (k, f H k ) = (m N )−1/2 O p (T δ/2 )O p ( Aˆ − A) + O p (Cm−1N T ).
(A.3)
Proof: Similar to Lemma 2 in Bai and Ng (2002), we have
V (k, fˆk ) = (m N T )−1
N
vec(xˆ i ) (Im ⊗ M kfˆ )vec(xˆ i ),
i=1
V (k, f H k ) = (m N T )−1
N
vec(xˆ i ) (Im ⊗ M f H )vec(xˆ i ),
i=1
V (k, fˆk ) − V (k, f H k ) = (m N T )−1
N
vec(xˆ i ) Im ⊗ P f H − P fkˆ vec(xˆ i ),
i=1
where P fH = fH((fH) fH)−1 fH , M fH = I T − P fH and M kfˆ = IT − fˆk ( fˆk fˆk )−1 fˆk = IT − P fkˆ . Let Dk = fˆk fˆk /T and D = H k f fH k /T . Following the same decomposition in Bai and Ng (2002), we have P fkˆ − P f H = T −1 ( fˆk − f H k )Dk−1 ( fˆk − f H k ) + ( fˆk − f H k )Dk−1 H k f + f H k Dk−1 ( fˆk − f H k ) + f H k (Dk−1 − D −1 )H k f , C 2008 The Author. Journal compilation C The Royal Economic Society 2008
237
Panel vector autoregression under cross-sectional dependence
N and (m N T )−1 i=1 vec(xˆ i ) (Im ⊗ (P f H − P fkˆ ))vec(xˆ i ) = I + II + III + IV. Define the T × 1 vector xˆ i p = (xˆi1 p , xˆi2 p , . . . , xˆi T p ) . We have I = (m N T )−1
N
vec(xˆ i ) Im ⊗ ( fˆk − f H k )Dk−1 ( fˆk − f H k ) vec(xˆ i )
i=1
= (m N T )−1
xˆ i p ( fˆk − f H k )Dk−1 ( fˆk − f H k ) xˆ i p
i=1 p=1
≤
N m
T
−2
T T fˆk − H k f t 2 fˆk − H k f s 2 D −1 2 k t s t=1 s=1
⎛ · ⎝T
−2
T N T m t=1 s=1
1/2
2 ⎞1/2 xˆit p xˆisp ⎠
i=1 p=1
= (m N )−1 O p (T δ )O p ( Aˆ − A2 ) + O p Cm−2N T , where the last equality follows Lemma A.1(c), Lemma A.2 and the fact that D −1 k = O p (1). By Lemma A.1(c) and Lemma A.2, II = (m N T )−1
N
vec(xˆ i ) (Im ⊗ ( fˆk − f H k )Dk−1 H k f )vec(xˆ i )
i=1
= (m N T )−1
i=1 p=1 t=1 s=1
≤
N m T T
T ⎛
−1
T
fˆk − H k f t 2 t
fˆtk − f t H k Dk−1 H k f s xˆit p xˆisp
1/2
1/2 T −1 2 D T −1 H k f s 2 k
t=1
· ⎝T −2
T N T m t=1 s=1
s=1
2 ⎞1/2 xˆit p xˆisp
⎠
i=1 p=1
= (m N )−1/2 O p (T δ/2 )O p ( Aˆ − A) + O p Cm−1N T = III,
IV = (m N T )−1
N m T T
f t H k Dk−1 − D −1 H k f s xˆit p xˆisp
i=1 p=1 t=1 s=1
= Dk−1 − D −1 ·
m
⎛
⎝(m N )−1
p=1
N
2 ⎞
T f H k · |xit p | T −1 t
i=1
⎠
t=1
= (m N )−1/2 O p (T δ/2 )O p ( Aˆ − A) + O p Cm−1N T , where the property Dk − D = (m N )−1/2 O p (T δ/2 )O p ( Aˆ − A) + O p (Cm−1N T ) can be proved similar to the proof of Lemma 2 in Bai and Ng (2002). Lemma A.3 holds by combining the results for terms I, II, III and IV. LEMMA A.4. For each k < r , there exists a τ k > 0 such that p lim inf V (k, f H k ) − V (r , f ) = τk . N ,T →∞
C 2008 The Author. Journal compilation C The Royal Economic Society 2008
(A.4)
238
Xiao Huang
Proof: Define y i p = (yi1 p , . . . , yi T p ) , y i = (y i1 , . . . , y im ), y i,−1, p = (yi0 p , . . . , yi,T −1, p ) , y i,−1 = (y i,−1,1 , . . . , y i,−1,m ) and ei p = (ei1 p , . . . , ei T p ) . Consider the following decomposition: V (k, f H k ) − V (r , f ) = (m N T )−1
N
vec(xˆ i ) (Im ⊗ (P f − P f H ))vec(xˆ i )
i=1 N m = (m N T )−1 (y i p − y i,−1 Aˆ i p ) (P f − P f H )(y i p − y i,−1 Aˆ i p ) i=1 p=1
N m ( f λi p + ei p − y i,−1 ( Aˆ i p − Ai p )) (P f − P f H ) = (m N T )−1 i=1 p=1
·( f λi p + ei p − y i,−1 ( Aˆ i p − Ai p )) = (m N T )−1
N m
λi p f (P f − P f H ) f λi p
i=1 p=1
+ 2 (m N T )−1
N m
ei p (P f − P f H ) f λi p
i=1 p=1
+ (m N T )−1
N m
ei p (P f − P f H )ei p
i=1 p=1
− 2 (m N T )−1
N m ( Aˆ i p − Ai p ) y i,−1 (P f − P f H ) f λi p i=1 p=1
− 2 (m N T )−1
m N ( Aˆ i p − Ai p ) y i,−1 (P f − P f H )ei p i=1 p=1
+ (m N T )−1
N m i=1 p=1
( Aˆ i p − Ai p ) y i,−1 (P f − P f H )y i,−1 ( Aˆ i p − Ai p )
= I + II + III − IV − V + VI, where I
= (m N T )−1
N m
λi p f (P f − P f H ) f λi p ,
i=1 p=1
II = 2 (m N T )−1
N m
ei p (P f − P f H ) f λi p ,
i=1 p=1
III = (m N T )−1
m N
ei p (P f − P f H )ei p ,
i=1 p=1
IV = − 2 (m N T )−1
N m ( Aˆ i p − Ai p ) y i,−1 (P f − P f H ) f λi p , i=1 p=1
N m ( Aˆ i p − Ai p ) y i,−1 (P f − P f H )ei p , V = − 2 (m N T )−1 i=1 p=1
VI = (m N T )−1
N m ( Aˆ i p − Ai p ) y i,−1 (P f − P f H )y i,−1 ( Aˆ i p − Ai p ). i=1 p=1
C 2008 The Author. Journal compilation C The Royal Economic Society 2008
239
Panel vector autoregression under cross-sectional dependence
We know I > 0 and III > 0, the proof of which follows that in Bai and Ng (2002). Consider the second term II:
−1
II = 2 (m N T )
m N p=1
ei p P f
f λi p −
ei p P f H
f λi p .
i
i=1
The first term in II is N N T −1 −1 ei p P f f λi p = (m N T ) ft eit p λi p (m N T ) i=1 t=1 i=1 1/2 T ≤ T −1 f t 2 t=1
−1/2
· (m N ) = Op
T
−1
(m N )−1/2 ,
T
−1/2
(m N )
t=1
N
1/2 eit p λi p
2
i=
which follows Lemma A.1(b). The second term has the same order. Thus II = O p ((mN)−1/2 ). Consider IV:
IV = 2 (m N T )−1
N m i=1 p=1
= 2 (m N T )−1
N m i=1 p=1
− 2 (m N T )−1
( Aˆ i p − Ai p ) y i,−1 (P f − P f H ) f λi p
( Aˆ i p − Ai p ) y i,−1 P f f λi p
N m ( Aˆ i p − Ai p ) y i,−1 P f H f λi p i=1 p=1
= IV1 − IV2. We have N m −1 ˆ |IV1| = 2 (m N T ) ( Ai p − Ai p ) y i,−1 P f f λi p i=1 p=1 ≤ 2 (m N )−1
N m i=1 p=1
Aˆ i p − Ai p · T −1/2 y i,−1 · T −1/2 f λi p
= O p (T δ/2 )O p ( Aˆ i p − Ai p ). Similarly, IV2 ≤ O p (T δ/2 )O p ( Aˆ i p − Ai p ). We note that the convergence rate O p (T δ/2 )O p ( Aˆ i p − Ai p ) is equivalent to (m N )−1/2 O p (T δ/2 )O p ( Aˆ − A) in previous results and it converges to zero when Aˆ is a consistent estimate of A. C 2008 The Author. Journal compilation C The Royal Economic Society 2008
240
Xiao Huang Let P f = (P f 1 , . . . , P f t , . . . , PfT ), where P f t is the tth column of P f , and V = 2 (m N T )−1
m N ( Aˆ i p − Ai p ) y i,−1 (P f − P f H )ei p i=1 p=1
N m = 2 (m N T )−1 ( Aˆ i p − Ai p ) y i,−1 P f ei p i=1 p=1
− 2 (m N T )−1
N m ( Aˆ i p − Ai p ) y i,−1 P f H ei p i=1 p=1
= V1 + V2. We further have
N m −1 |V1| = 2 (m N T ) ( Aˆ i p − Ai p ) y i,−1 P f ei p i=1 p=1 ≤ 2 (m N )−1
N m i=1 p=1
Aˆ i p − Ai p · T −1/2 y i,−1 · T −1/2 P f ei p
= O p (T δ/2 )O p ( Aˆ i p − Ai p ). Similarly, |V2| ≤ O p (T δ/2 )O p ( Aˆ i p − Ai p ). Hence V = O p (T δ/2 )O p ( Aˆ i p − Ai p ). For VI, we have VI = (m N T )−1
N m i=1 p=1
= (m N T )−1
N m i=1 p=1
− (m N T )−1
( Aˆ i p − Ai p ) y i,−1 (P f − P f H )y i,−1 ( Aˆ i p − Ai p )
( Aˆ i p − Ai p ) y i,−1 P f y i,−1 ( Aˆ i p − Ai p )
N m ( Aˆ i p − Ai p ) y i,−1 P f H y i,−1 ( Aˆ i p − Ai p ) i=1 p=1
= VI1 − VI2, and
N m −1 ˆ ˆ |VI1| = (m N T ) ( Ai p − Ai p ) y i,−1 P f y i,−1 ( Ai p − Ai p ) i=1 p=1 ≤ (m N )−1
N m i=1 p=1
Aˆ i p − Ai p 2 · T −1 y i,−1 2 · ||P f ||
= O p T δ O p ( Aˆ i p − Ai p 2 ). The result VI2 = O p (T δ )O p ( Aˆ i p − Ai p 2 ) can also be proved in a similar way and VI = O p (T δ )O p ( Aˆ i p − Ai p 2 ). Lemma A.4 holds by combining the results in terms I–VI. LEMMA A.5. For any k ≥ r , V (k, fˆk ) − V (r , fˆr ) = (m N )−1 O p (T δ )O p ( Aˆ − A2 ) + O p (Cm−2N T ). Proof: By the proof Lemma 4 in Bai and Ng (2002), it is sufficient to prove for each k with k ≥ r , V (k, fˆk ) − V (r , f ) = (m N )−1 O p (T δ )O p ( Aˆ − A2 ) + O p (Cm−2N T ). C 2008 The Author. Journal compilation C The Royal Economic Society 2008
241
Panel vector autoregression under cross-sectional dependence Let H k+ be the generalized inverse of H k , and we have xˆ i = y i − yˆ i = f λi − ei − y i,−1 ( Aˆ i − Ai ) = fˆk H k+ λi + u i , where u i = ei − ( fˆk − f H k )H k+ λi − y i,−1 ( Aˆ i − Ai ) . Similar to Bai and Ng (2002), we have V (k, fˆk ) = (m N T )−1
N
vec(u i ) (Im ⊗ M fˆk )vec(u i ),
i=1
V (r , f ) = (m N T )−1
N
vec(ei ) (Im ⊗ M f )vec(ei ),
i=1
V (k, fˆk ) = (m N T )−1
N m i=1 p=1
ei p − ( fˆk − f H k )H k+ λi p − y i,−1 ( Aˆ i p − Ai p )
·M fˆk (ei p − ( fˆk − f H k )H k+ λi p − y i,−1 ( Aˆ i p − Ai p )) = (m N T )−1
m N
ei p M fˆk ei p
i=1 p=1 N m
− 2 (m N T )−1
λi p H k+ ( fˆk − f H k ) M fˆk ei p
i=1 p=1
+ (m N T )−1
N m
λi p H k+ ( fˆk − f H k ) M fˆk ( fˆk − f H k )H k+ λi p
i=1 p=1 N m
− 2 (m N T )−1
i=1 p=1 N m
+ 2 (m N T )−1
i=1 p=1
+
ei p M fˆk y i,−1 ( Aˆ i p − Ai p ) λi p H k+ ( fˆk − f H k ) M fˆk y i,−1 ( Aˆ i p − Ai p )
m N ( Aˆ i p − Ai p ) y i,−1 M fˆk y i,−1 ( Aˆ i p − Ai p ) i=1 p=1
= I + II + III + IV + V + VI. Consider II. II = 2 (m N T )−1
N m
λi p H k+ ( fˆk − f H k ) M fˆk ei p
i=1 p=1
1/2 T k+ −1 k k 2 ˆ ≤ 2r H · T ft − H ft ⎛
t=1
2 ⎞1/2 N T m · (m N )−1/2 ⎝T −1 eit p λi p ⎠ (m N )−1/2 t=1 i=1 p=1 ≤ (m N )−1 O p (T δ )O p ( Aˆ − A2 ) + O p Cm−2N T
C 2008 The Author. Journal compilation C The Royal Economic Society 2008
242
Xiao Huang
The term III is (m N )−1 O p (T δ )O p ( Aˆ − A2 ) + O p (Cm−2N T ) following the proof in Bai and Ng (2002). Consider term IV: m N ˆ IV ≤ 2 (m N T ) ei p M fˆk y i,−1 ( Ai p − Ai p ) p=1 i=1 1/2 1/2 N N m −1 −1 −1 2 2 (m N ) (m N ) ≤ 2T ei p yi,−1 Aˆ i p − Ai p −1
p=1
≤2
m p=1
≤2
m
i=1
N 2 −1 eit p yi,t−1 T
(m N )−1
t
i=1
N
−1
(m N )
p=1
T
−2
(m N )−1 −1
eit2 p yi,t−1 2
N
1/2 Aˆ i p − Ai p 2
i=1
1/2 (m N )
t
i=1 −1/2
i=1
1/2
N
1/2 Aˆ i p − Ai p 2
i=1
δ/2
= (m N T ) O p (T )O p ( Aˆ − A) ≤ (m N )−1 O p (T δ )O p ( Aˆ − A2 ) + O p Cm−2N T . For terms V and VI, we have
V ≤ 2 (m N T ) ≤2
m
−1
m N k+ ˆk k λi p H ( f − f H ) M fˆk y i,−1 ( Aˆ i p − Ai p ) p=1 i=1 N
−1
(m N )
p=1
λi p 2 H k+ 2 (
i=1
−1
· (m N )
N
fˆk − f H k )2 T −1
T t=1
1/2
y i,t−1
1/2 2
Aˆ i p − Ai p 2
i=1
= (m N )
−1
O p (T δ )O p ( Aˆ − A2 ),
where the last equality is obtained by substituting the convergence results for the terms T −1 T H k f t 2 , (T −1 t=1 y i,t−1 2 ), Aˆ i p − Ai p and the fact that λi p H k+ is O p (1). VI = (m N T )−1
T t=1
fˆtk −
N m ( Aˆ i p − Ai p ) y i,−1 M fˆk y i,−1 ( Aˆ i p − Ai p ) i=1 p=1
≤ (m N )
−1
= (m N )
−1
N m
T m 2 −1 2 ˆ Ai p − Ai p · T yi,t−1,q
i=1 p=1
t=1 q=1
δ
O p (T )O p ( Aˆ − A2 ).
By the fact that V (k, fˆk ) − V (r , f ) ≤ 0 for k ≥ r , we have V (k, fˆk ) − V (r , f ) = (m N T )−1
N m i=1 p=1
ei p P f ei p − (m N T )−1
N m
ei p P fˆk ei p + O p Cm−2N T .
i=1 p=1 C 2008 The Author. Journal compilation C The Royal Economic Society 2008
Panel vector autoregression under cross-sectional dependence
243
N m −2 We know (m N T )−1 i=1 p=1 ei p P f ei p = O p (C m N T ) which follows the proof in Lemma 4 of Bai and Ng (2002). Consider the second term (m N T )−1
N m
ei p P fˆk ei p = (m N T )−1
i=1 p=1
N m i=1 p=1
≤ (m N T )−1
ei p e−1/2 · e1/2 P fˆk e1/2 · e−1/2 ei p ip ip ip ip
N m 1/2 ρ e1/2 P ˆk e f ip ip i=1 p=1
≤ (m N T )−1
N m ρ ei p ρ(P fˆk ) i=1 p=1
≤ O p Cm−2N T ,
where ei p is the covariance matrix of ei p and ρ(·) is the largest eigenvalue of the matrix in parentheses. The first inequality follows result (6) in Section 4.1.2 of L¨utkepohl (1996). The third inequality follows from the fact that weak correlation in time dimension makes ρ(ei p ) a bounded number and P fˆk is an idempotent matrix with eigenvalues 1 or 0. Hence we have V (k, fˆk ) − V (k, f ) = O p (Cm−2N T ). Proof of Theorem 2.1: The proof follows the proof of Theorem 2 in Bai and Ng (2002) and the results in Lemmas A.1–A.5.
C 2008 The Author. Journal compilation C The Royal Economic Society 2008
Econometrics Journal (2008), volume 11, pp. 244–270. doi: 10.1111/j.1368-423X.2008.00246.x
Moment based regression algorithms for drift and volatility estimation in continuous-time Markov switching models ¨ R OBERT J. E LLIOTT †, VIKRAM K RISHNAMURTHY ‡ AND J ORN S ASS § †Haskayne School of Business, University of Calgary, Calgary, T2N 1N4, Canada. E-mail:
[email protected] ‡Department of Electrical and Computer Engineering, University of British Columbia, Vancouver V6T 1Z4, Canada. E-mail:
[email protected] §RICAM, Austrian Academy of Sciences, Altenberger Str. 69, A-4040 Linz, Austria. E-mail:
[email protected] First version received: May 2006; final version accepted: March 2008
Summary We consider a continuous time Markov switching model (MSM) which is widely used in mathematical finance. The aim is to estimate the parameters given observations in discrete time. Since there is no finite dimensional filter for estimating the underlying state of the MSM, it is not possible to compute numerically the maximum likelihood parameter estimate via the well known expectation maximization (EM) algorithm. Therefore in this paper, we propose a method of moments based parameter estimator. The moments of the observed process are computed explicitly as a function of the time discretization interval of the discrete time observation process. We then propose two algorithms for parameter estimation of the MSM. The first algorithm is based on a least-squares fit to the exact moments over different time lags, while the second algorithm is based on estimating the coefficients of the expansion (with respect to time) of the moments. Extensive numerical results comparing the algorithm with the EM algorithm for the discretized model are presented. Keywords: Generalized method of moments, Markov switching model, Moments based estimator, Parameter estimation, Regime switching.
1. MARKOV SWITCHING MODELS AND APPLICATIONS Continuous time Markov switching models (MSMs) are generalizations of Hidden Markov Models and are widely used in mathematical finance, see Buffington and Elliott (2002), Guo (2001) and the references therein. A continuous time MSM comprises of an observed process R = (R t )t∈[0,T ] where t t Rt = μs ds + σs d Ws , t ∈ [0, T ], (1.1) 0
0
W = (W t )t∈[0,T ] is a standard Brownian motion, μ = (μt )t∈[0,T ] is the time varying drift and σ = (σ t )t∈[0,T ] the time varying volatility. In a MSM, the drift μ and volatility σ are piecewise C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008. Published by Blackwell Publishing Ltd, 9600 Garsington
Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
Moment based regression algorithms for MSMs
245
constant functions of time, jumping according to the sample path of a continuous time Markov chain which evolves independently of W. Thus the volatility is independent of the Brownian motion, i.e. the model is based on a no-leverage assumption. For a review of classical stochastic volatility models and statistical inference we refer to Shephard (2005), see also Beskos et al. (2005) and the references therein for diffusion models with leverage. For no leverage, moments and the asymptotic distribution of the realized volatility error are derived in Barndorff-Nielsen and Shephard (2002). They show that the asymptotic distribution does not depend on μ and thus the error is small when setting μ = 0. Here we consider a regime switching model, i.e. drift μ and volatility σ jump at the same time and we are also interested in the parameters of μ. Thus we follow a different approach, based on the explicit computation of the moments which provides us with an explicit error term which we can utilize to estimate the corresponding parameters as described following (1.3). To be more precise, let Y denote a continuous time, irreducible, aperiodic Markov chain with rate matrix Q and state space {e1 , . . . , ed }, where ei , i = 1, . . . , d denote the unit vectors in Rd . Then in a MSM, μ and σ evolve in time as μt = b Yt ,
σt = a Yt ,
(1.2)
where b ∈ Rd , a ∈ Rd>0 . The diagonal elements of Q provide us with the rate −Q ii for leaving state ei , and the off-diagonal elements with the probabilities −Q i j /Q ii to jump to state e j when leaving ei . An important issue in utilizing the above continuous time MSMs in finance applications is to compute the parameters of the MSM to fit real world data. In financial applications, σ determines option prices. Indeed, derivatives are traded based on σ . In the context of financial optimization, computing an accurate estimate for μ is important. In these applications involving stochastic control methods it is often preferable to work in continuous time, since it is easier to obtain closed form solutions, see e.g. Sass and Haussmann (2004). In order to use stochastic control methods one first has to estimate the parameters of the model from real world discrete observations. Therefore we have to deal with the problem of parameter estimation for a continuous time model observed in discrete time. For an overview of related methods we refer to A¨ıt-Sahalia (2007), cf. also Barndorff-Nielsen and Shephard (2002) for no leverage volatility models and Beskos et al. (2005) who propose exact sampling methods for discretely observed diffusions which can be efficiently used in combination with Bayesian methods, cf. Fearnhead et al. (2008). Here we look at a special case, the MSM, in which we can compute the moments explicitly, and hence derive moment based estimation algorithms leading for the two-state MSM to consistent estimators. A particularly attractive criterion that is widely used to fit the parameters to real world data is the maximum likelihood (ML) criterion. The ML estimate of the parameters is often computed numerically via algorithms such as the expectation maximization (EM) algorithm. In the context of continuous time MSMs there are two ways of implementing such maximum likelihood estimation algorithms:
r ML Approach 1: Apply a time discretized implementation of a continuous-time algorithm
r
to the time discretized data. The implicit assumption here is that for suitably small time discretization interval, the discretized algorithm is a faithful representation of the continuous-time algorithm. ML Approach 2: Apply a discrete-time algorithm to the time discretized data. The assumption here is that the data indeed came from a discrete-time process.
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
246
R. J. Elliott et al.
Below we argue that both of the above ML approaches have disadvantages for continuoustime MSMs thus motivating the need for developing alternative estimation criteria instead of the ML criterion. Approach 1 does not work for continuous time MSMs. An implementation of Approach 1 via the EM algorithm requires deriving finite dimensional filters to estimate the conditional mean. Then the filters are discretized over time to implement the EM algorithm. In theory, the underlying process can be observed via the volatility process. In practice we only have discrete observations for which no explicit expression for the conditional mean is known, cf. the discussion in Beskos et al. (2005). In general, for continuous time models with time varying volatility σ , it is not possible to construct a finite dimensional filter since the change of measure requires a known σ . [This is in contrast to a continuous-time hidden Markov model, for which a 1 = · · · = a d , i.e. σ is constant, and where the optimal filter is finite dimensional, see Elliott et al. (1995)]. In Approach 2 the time discretized MSM data is processed using a discrete time EM algorithm to obtain a discrete time ML estimate of the parameters. In the discrete time case a finite dimensional filter exists for a MSM—indeed the filter is virtually identical to the discrete time Hidden Markov Model filter. For discrete-time ML estimation see e.g. Elliott et al. (1997), Hamilton (1989), for the discretization of a continuous-time model Chourdakis (2002), James et al. (1996). Usually the Euler scheme is used for the discretization, say with time step t yielding approximations Z i of R i = R it − R (i−1)t . If the rates of the underlying Markov chain are quite high so that with a high probability a jump occurs in time t, then this linearization no longer provides a good approximation for the law of R i . As the results in Sass and Haussmann (2004) for market data of stock prices suggest, we have to expect high rates when using daily observations. In theory this problem can be avoided by using a smaller time discretization interval for the approximation. However, in practice we usually have a minimum interval between our observations (e.g. ticker times for stock prices) and hence we cannot decrease the time discretization interval t sufficiently. For the same reason many methods based on the quadratic variation which are widely used to estimate the noise parameters can be inadequate, cf. the discussion in Barndorff-Nielsen and Shephard (2002). It is also possible to use a Bayesian approach, see Hahn et al. (2007) where a Markov chain Monte Carlo method for the MSM is derived which yields slightly better results than the corresponding EM algorithm. The estimation of the rates is still difficult but works well for medium sample sizes. Here it would be quite interesting to derive methods based on Beskos et al. (2005) and Fearnhead et al. (2008), i.e. to use particle filters based on sampling methods without discretization error, cf. also Fearnhead and Meligkotsidou (2007). Even in the cases where the EM algorithm and Bayesian methods work well (low transition rates, low noise variance) these algorithms require several iterations to converge. Due to the above limitations, there is strong motivation to explore the use of moment based methods for MSMs, cf. Timmermann (2000) for the discrete time case or Ho et al. (1996) for a generalized method of moments (GMM) in a different continuous time setting. Obviously such method would be computationally efficient since estimators for the moments E[R m t ] can be computed easily from the observations. However, it is not enough to look only at the moments with respect to t since to estimate all the parameters of our model we would need a lot of moments and the performance of such moment based methods can become quite poor, cf. Andersen and Sørensen (1996) and Hansen (1982) who analyze good choices of weighting matrices to overcome these problems. For example, a MSM with three states we have to estimate 12 parameters, hence we would have to use 12 moments using classical generalized method of moments. While we shall utilize the concept of weighting matrices, our main idea is that we look at each moment over C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
Moment based regression algorithms for MSMs
247
different time lags. Computing estimators over different sampling intervals l t we can use fewer moments to estimate all parameters. To give more insight into the techniques presented in this paper, consider a simplistic version of (1.1), with constant μ and σ . Then the observation process is R t = μ t + σ Wt ,
t ∈ [0, T ].
We then get 2 E Rt /t = μ2 t + σ 2 .
(1.3)
Clearly, it is not possible to estimate both parameters μ and σ using only the moment E [R 2t ]. But if we compute estimators Rˆ 2,l for E [R l2t ] for l = 1, . . . , L with L ≥ 2, then we can identify both parameters from a least-squares fit. Alternatively, viewing the moment (1.3) as function of time, t → μ2 t + σ 2 we can get unbiased estimates by a linear regression of (lt, Rˆ 2,l /(lt)), l = 1, . . . , L. The 0th coefficient of this regression yields the estimate of the volatility σ , and the first coefficient yields the trend μ. In this paper, we aim to extend this procedure to estimate also the other parameters using a regression or best-fit method, Therefore we need exact representations of the moments of the observation process of a MSM which are derived in Section 2. We emphasize that for the computation of the exact moments we have to work in the continuous time model. For two states we give more explicit representations in Section 3 before . we propose in Section 4 two methods based on the moments. First a best-fit method similar to generalized method of moments only that we use the sample moments over different time lags. In Theorem 4.1 we show that for two states this provides consistent estimates, when we increase the number of observations while keeping the observation period t constant. as we show in Theorem 4.1. In Section 4.3, we sketch a regression method based on an expansion of the moments. Numerical results based on simulations are provided in Section 5 and an application to daily data of the Dow Jones Industrial Average index in Section 6. In particular, the best-fit method gives accurate estimates (for a reasonable number of observations), even in critical cases like high transition rates or high noise variance. Especially for two states we can exploit the explicit representation of the transition propabilities we have in this case and can show consistency of the estimates. For more states we can expand the transition matrices in terms of the rate matrix and thus get approximations of the moments of arbitrary accuracy. We illustrate the algorithm on simulated and historical data. 1.1. Notation For a vector c, Diag (c) denotes the diagonal matrix with diagonal c. For n ∈ N0 = {0, 1, 2, . . .} the bottom n/2 of n/2 equals n/2 if n is even and (n − 1)/2 if n is odd. For an n × m-matrix A, Ai,· denotes the 1 × m matrix consisting of the ith row of A. So for m-dimensional vectors a, b, we have Ai,· b = dj=1 Ai, j b j while the inner product is denoted by a b ( transposition). By c(k) we denote the vector with components ci(k) = (ci )k . C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
248
R. J. Elliott et al.
2. COMPUTATION OF THE MOMENTS For a fixed time horizon T > 0 we consider the MSM (1.1), (1.2). In the case that a 1 = . . . = a d we have a Hidden Markov Model with constant noise parameter σ . Otherwise we shall assume that ij ai = a j for all i = j. We denote the transition matrices of Y by P t , Pt = P(Y t = e j | Y 0 = ei ). They can be computed from the rate matrix Q by Pt = exp{Q t} :=
ti . i!
Qi
i≥0
(2.1)
Let ν ∈ (0, 1)d denote the invariant distribution of Y (which exists and is unique since by assumption Y is irreducible and aperiodic). It satisfies ν P t = ν as well as ν Q = 0. We assume that the chain starts with its invariant distribution, i.e. P(Y 0 = ek ) = ν k , k = 1, . . . , d. This implies P(Y t = ek ) = ν k for all t ≥ 0. For k, l ∈ N0 we define the joint moments 2k l
t
Jtk,l = E
t
σs d W s
0
.
μs ds
(2.2)
0
2.1. Main results The following proposition states how we can express the moments of the MSM R in (1.1) in terms k,l of the joint moments J k,l t . Theorem 2.1 then provides explicit formulae for J t . For proofs we refer to the appendix. PROPOSITION 2.1. Consider the MSM R given by (1.1). Then for m ∈ N, m/2 m E[Rtm ] = Jtk,m−2k , 2k k=0 where J k,m−2k satisfies t Jtk,m−2k
(2k)! = k E 2 k!
t 0
k σs2 ds
t
m−2k
μs ds
.
0
So we only have to consider certain joint moments, those defined in (2.2). The following theorem provides a formula to compute these moments given the rate matrix Q. THEOREM 2.1. For k, l ∈ N0 , the joint moments defined in (2.2) are given by Jtk,l =
(2k)!l! 2k j≥0
K k,l (Q i1 , . . . , Q ik+l−1 )
i 1 ,...,i k+l−1 ≥0 i 1 +...+i k+l−1 = j
t k+l+ j . (k + l + j)!
where K k,1 is defined for (d × d)-matrices A1 , . . . , Ak+l−1 by
k+l−1 k,l K (A1 , . . . , Ak+l−1 ) = ν Diag(ci )Ai ck+l . c1 ,...,ck+l ∈{a (2) ,b} #{i : ci =a (2) }=k
i=1
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
Moment based regression algorithms for MSMs
249
To see how to find approximations of the moments using Theorem 2.1, we illustrate the computation of the first 6 moments of R t in the following lemma. We shall use the following lemma in the numerical studies of Section 5. To handle the long representations we shall use the notation v (k) = (v k 1 , . . . , vdk ) and Dv = Diag (v) for a vector v. LEMMA 2.1. E[Rt ] = ν b t, E[Rt2 ] = ν a (2) t + ν b(2) t 2 + ν Db Qb t 3 /3 + ν Db Q 2 b t 4 /12 + . . . , 1 1,1 3 (2) 2 (3) E[Rt ] = 3ν Da b t + ν b + K (Q) t 3 2 1 1 1,1 2 (2) (2) + ν (Db Q b + Db Qb) + (K (Q )) t 4 + . . . , 4 8 4 (4) 2 (2) (2) (2) (2) E[Rt ] = 3 ν a t + 6ν Da b + ν Da Q a t3 1 1,2 1 (2) 2 (2) 4 1,2 (4) t + ..., + ν b + (K (Q, I ) + K (I , Q)) + ν Da Q a 2 4 5 2,1 2,1 (2) (3) 5 (4) 3 E[Rt ] = 15 ν Da b t + t4 (K (Q, I ) + K (I , Q)) + 10ν Da b 4 1 2,1 2 + (K (Q , I ) + +K 2,1 (Q, Q) + K 2,1 (I , Q 2 )) 4 1 1,3 1,3 1,3 (5) + (K (Q, I , I ) + K (I , Q, I ) + K (I , I , Q)) + ν b t5 + . . . , 2 15 (2) (4) (4) (2) (4) (2) 6 (6) 3 E[Rt ] = 15ν a t + t4 ν (Da Qa + Da Qa ) + 45ν Da b 4 3 + ν (Da (2) Q 2 a (4) + Da (4) Q 2 a (2) + Da (2) Q Da (2) Qa (2) ) 4 3 2,2 2,2 2,2 (2) (4) + (K (Q, I , I ) + K (I , Q, I ) + K (I , I , Q)) + 15ν Da b t5 2 (6) + ν b + · · · t6 + · · · .
2.2. Some intuition behind the proof of Theorem 2.1 Let us give some intuition of the proof of the above theorem. It will be shown in the proof (see Appendix), that for the transition matrices P t Jtk,l =
(2k)! l! 2k
0
t
t3
··· 0
0
t2
K k,l (Pt2 −t1 , · · · , Ptk+1 −tk+l−1 ) dt1 dt2 · · · dtk+l .
(2.3)
So we have to take the sum over all possible combinations of vectors c1 , . . . , ck+l with k vectors equal to a(2) and l vectors equal to b. In the simplest case (n = 1) we have E[R t ] = J 0,1 by t Proposition 2.1. So the only possible choice is c1 = b. Further the product in the definition of C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
250
R. J. Elliott et al.
K 0,1 is empty, hence we get from Theorem 2.1 t E[Rt ] = Jt0,1 = ν b dt1 = ν b t. 0
For n = 2,
E Rt2 = Jt0,2 + Jt1,0 ,
where with the same argument as for J 0,1 t above t Jt1,0 = ν a (2) dt1 = ν a (2) t. 0
For
J 0,2 t
we have c1 = c2 = b and hence simply K 0,2 (A) = ν Diag (b)Ab. Thus, t t2 Jt0,2 = 2 ν Diag(b)Pt2 −t1 bdt2 dt1
Using the expansion Pt2 −t1 =
0
0
1 j j≥0 j! Q (t2
Jt0,2 = 2
− t1 ) j , see (2.1), we get
ν Diag(b)Q j b
j≥0
t 2+ j . (2 + j)!
Theorem 2.1 above gives the general formula. 2.3. Intuition behind Lemma 2.1 To give some intuition on the proof of the above lemma, we show how to compute the fourth moment. By Proposition 2.1, E Rt4 = Jt0,4 + 6 Jt1,2 + Jt2,0 . 2.3. From the definition of K k,l in Theorem 2.1 and using DvDw = D(vw), Dv v = v (2) , Dv w = Dw v we get Jt0,4 = 4!
j≥0
Jt1,2
ν (DbQ i1 DbQ i2 DbQ i3 b)
i 1 ,i 2 ,i 3 ≥0 i 1 +i 2 +i 3 = j
t 4+ j (4 + j)!
t5 = ν b(4) t 4 + ν DbQb(3) + Db(2) Qb(2) + Db(3) Qb + ··· 5 t 3+ j =2 K 1,2 (Q i1 , Q i2 ) (3 + j)! j≥0 i1 ,i2 ≥0 i 1 +i 2 = j
Jt2,0
t4 = ν Da (2) b(2) t 3 + ν K 1,2 (Q, I ) + K 1,2 (I , Q) + ···, 12 t 2+ j = 3! Da (2) Q j Da (2) (2 + j)! j≥0 = 3Da (4) t 2 + Da (2) Qa (2) t 3 + Da (2) Q 2 a (2)
t4 + · · · ., 4
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
251
Moment based regression algorithms for MSMs
where I is the identity matrix and where we have by the definition in Theorem 2.1 K 1,2 (Q i1 , Q i2 ) = ν (Da (2) Q i1 DbQ i2 b + DbQ i1 Da (2) Q i2 b + DbQ i1 DbQ i2 a (2) ) and we expanded K 0,4 (Q j ), K 2,0 (Q j ) since they only consist of one term. Gathering the terms of the same order in t we can write down approximations up to a given order of t as exemplified in the above lemma where we gave explicit formulae for the first 6 moments.
3. TWO STATE MARKOV SWITCHING MODEL For the special case of a two state Markov chain Y, i.e. d = 2, the above moment formulae can be conveniently computed for a MSM as we now describe. Denote the transition rates of the two state Markov chain Y by Q=
−λ1
λ1
λ2
−λ2
,
λ1 , λ2 > 0,
and the total rate by λ = λ1 + λ2 . The invariant distribution of Y is given by ν = ( p, 1 − p) ,
p=
where
λ2 . λ
The transition probabilities are of the form Pt = A + (I − A)e−λt ,
t ≥ 0,
(3.1)
where I is the identity matrix and A=
p
1− p
p
1− p
,
so
I−A=
1− p
−(1 − p)
−p
p
.
By μk , σ l , μk σ l we denote the (mixed) annualized moments of μ and σ under the invariant distribution, i.e. μk = p b1k + (1 − p)b2k ,
σ l = p a1l + (1 − p)a2l ,
μk σ l = p b1k a1l + (1 − p)b2k a2l . (3.2)
And in the case that k = 1 we might omit the index, i.e. μ = μ1 , μσ 2 = μ1 σ 2 . Note that e.g. μ and σ 2 are the mean and the variance of R1 , respectively. Thinking of the time measured in years we hence call these moments the annualized moments. C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
252
R. J. Elliott et al.
LEMMA 3.1. E[Rt ] = μ t, E[Rt2 ] = σ 2 t + μ2 t 2 + 2(μ2 − μ2 )(e−λt − 1 + λt − (λt)2 /2)/λ2 , E[Rt3 ] = 3 μσ 2 t 2 + μ3 t 3 + 6(μσ 2 − μ σ 2 )(e−λt − 1 + λt − (λt)2 /2)/λ2 + 12(μ μ2 − (μ)3 )(1 − λt + (λt)2 /2 − (λt)3 /6 − e−λt )/λ3 + 6(μ3 − 2 μ μ2 + (μ)3 )(−2 + λt + (2 + λt)e−λt − (λt)3 /6)/λ3 , E[Rt4 ] = 3 σ 4 t 2 + 6 μ2 σ 2 t 3 + μ4 t 4 + 6(σ 4 − (σ 2 )2 )(e−λt − 1 + λt − (λt)2 /2)/λ2 + 6(4 μ2 σ 2 + 8 μ μσ 2 − 12 (μ)2 σ 2 )(1 − λt + (λt)2 /2 − (λt)3 /6 − e−λt )/λ3 + 6(6 μ2 σ 2 − 4 μ2 σ 2 − 8 μ μσ 2 + 6 (μ)2 σ 2 )(−2 + λt + (2 + λt)e−λt − (λt)3 /6)/λ3 + 72((μ)2 μ2 − (μ)4 )(e−λt − 1 + λt − (λt)2 /2 + (λt)3 /6 − (λt)4 /4!)/λ4 + 72(μ μ3 − 2 (μ)2 μ2 + (μ)4 )(3 − 2λt + (λt)2 /2 − (3 + λt)e−λt − (λt)4 /4!)/λ4 + 24(μ4 + 3 (μ)2 μ2 − 3 μ μ3 − (μ)4 )(−3 + λt + (3 + 2λt + (λt)2 /2)e−λt − (λt)4 /4!)/λ4 .
4. PARAMETER ESTIMATION FOR MARKOV SWITCHING MODELS We first provide a best-fit method which can be viewed as a generalized method of moments using moments over different time lags (not only over the smallest time span t). For the two state model described in Section 3 we show in Section 4.2 how to use the representations derived in Lemma 3.1 to estimate the parameters a 1 , a 2 , b1 , b2 , p, λ by proving identifiability of the parameters and consistency in Section 4.2. In Section 4.3 we present an alternative moment based regression method. Assume that we observe the discrete time observation sequence R 0 , R t , R 2t , . . . R N t with a time discretization interval t, so N = T /t . To estimate the moments over different time lags lt we use Rˆ m,l =
N /l 1 (R(i+l)t − Rit )m , N /l i=0
(4.1)
choosing moments m = 1, . . . , M and time lags l = 1, . . . , L N . So we consider M moments, N each over L different time lags, thus we use M L lagged moments alltogether. We may write Rˆ m,l to highlight the dependency on the number of observations. N LEMMA 4.1. The estimators Rˆ m,l are strongly consistent, i.e. m N = E Rlt a.s. lim Rˆ m,l N →∞
With the above results, we now present the best-fit algorithm. C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
Moment based regression algorithms for MSMs
253
4.1. Best-fit algorithm We wish to estimate parameters (a i )i=1,... ,d , (bi )i=1,... ,d and λi j for i, j = 1, . . . , d, i = j, which we gather into the parameter vector denoted ξ : ξ = (a1 , · · · , ad , b1 , · · · , bd , Q 12 , · · · , Q 1d , Q 21 , Q 23 , · · · Q d,d−1 ) We gather the M L lagged moments E[R lmt ], m = 1, . . . , M, l = 1, . . . , L into a vector denoted η: M M η = (E[Rt ], · · · , E[R Lt ], · · · , E Rt , · · · , E R Lt ) . (4.2) Further we denote the estimates given by the sample means (4.1) as η, ˆ so ηˆ = ( Rˆ 1,1 , · · · , Rˆ 1,L , · · · , Rˆ M,1 , · · · , Rˆ M,L ) . Suppose the analytic expressions (or approximations) of the lagged moments in terms of the parameters are given by a function f on X = Rd+ × Rd × Rd(d−1) with values in R M L , i.e. + f (ξ ) = η
( f (ξ ) ≈ η).
As we saw in Section 3, for two states (d = 2) we can compute the moments gathered in η in a concise form while for more states we have to use in practice an approximation of f using the expansions in Theorem 2.1, e.g. the expansions up to a certain order of t. Using a quadratic minimization criterion we want to find a solution of min( f (x) − η) ˆ ( f (x) − η), ˆ x∈X
using a suitable positive definite weighting matrix in R M L×M L . We choose the solution ξˆ of the minimization problem as our estimate for the parameter vector ξ . In Section 5, we will use the following algorithm. ALGORITHM 4.1. (1) Choose moments m = 1, . . . , M and time lags l = 1, . . . , L and an approximation f (ξ ) of the lagged moments η. (2) Compute ηˆ = ( Rˆ 1,1 , · · · , Rˆ 1,L , · · · , Rˆ M,1 , · · · , Rˆ M,L ) . (3) Compute the weighting matrix . (4) Find the estimator ξˆ as argument of min( f (x) − η) ˆ ( f (x) − η) ˆ x∈X
As weighting matrix we will use one of the following: (1) ii = 1/ηi2 , i j = 0, i, j = 1, · · · , d, i = j, (2) ii = 1/γii , i j = 0, i, j = 1, · · · , d, i = j, (3) = ((γ i j )i, j=1,... ,d )−1 , where γ i j denotes the sample covariance corresponding to the observation as numbered in vector η, e.g. if L ≥ 2 2 M γ1,1 = Var[Rt ], γ1,L+2 = Cov Rt , R2t , γ1,M L = Cov Rt , R Lt . C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
254
R. J. Elliott et al.
We have to use suitable approximations of ηi and γ i j , e.g. by the sample means ηˆ i and the sample covariances. So in choice (1) we use a diagonal weighting matrix with one over the squares of the sample moments in the diagonal. This corresponds to minimizing the relative error given the sample mean. A natural choice. For the simulated data in Section 5 this already provides excellent results. For more sophisticated choices of the weighting matrix cf. Andersen and Sørensen (1996) and Hansen (1982). The latter shows how to get asymptotically weighting matrices based on estimates of the covariance matrix. Since in the general case we only approximate the moments, it is not clear if the arguments carry over. Further, due to the large dimensions of the covariance matrix in (3) which may lead to severe estimation errors when we sum over all (ML)2 elements, Ho et al. (1996) propose to use only the diagonal of this optimal weighting matrix which leads to (2). In Section 5, we will look at all choices (1), (2) and (3). Note that the full weighting matrix (3) accounts for the covariance between the moments for one sampling period as well as for the dependency over the different time lags with the drawback that one has to compute M L(M L + 1)/2 sample covariances. The computation of non trivial weighting matrices is the most time consuming part of Algorithm 4.1. REMARK 4.1. (1) The idea is to fit for each m = 1, . . . , M, the lagged sample moments Rˆ m,l , l = 1, · · · , L to the curves t → E [Rm t ], m = 1, . . . , M. So the method uses more information than many methods which only rely on the moments at the smallest time discretization interval t. The next Section 4.2 shows that we in fact can get more information and identify more than M parameters. (2) The choice of L is difficult and has to be further investigated. It has to reflect the desire to get enough information on the curve, so say L ≥ 10, and to achieve a reasonable accuracy for the lagged sample moments Rˆ m,l which depends very much on parameters. 4.2. Identifiability and consistency For a two state chain we want to show that we can indeed obtain consistent estimates using less moments then parameters but sampling these over different discretization intervals. We consider a simple algorithm, which uses no weighting matrix and only as many lagged moments as parameters to ease the computations. For two states (see Section 3) we have to estimate the parameters ξ = (a1 , a2 , b1 , b2 , p, λ) .
(4.3)
Using Lemma 3.1 we can express the moments in terms of the annualized moments ζ = (μ, μ2 , σ 2 , μ3 , μσ 2 , λ) ,
(4.4)
say η = g(ζ ), and the annualized moments by the parameters, say ζ = h(ξ ). So we have f (ξ ) = g(h(ξ )) in the notation of Section 4.1. Obviously it is not enough to fit only the curve given by, say, the second moment. Since by Lemma 3.1 this could be expressed by λ, μ, μ2 , σ 2 . These four quantities are not enough to fit the six parameters in (4.3). Nevertheless it would be enough to use the first three moments. But for more stable estimates one can always choose further moments. To discuss consistency C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
Moment based regression algorithms for MSMs
255
and identifiability we shall now look at a minimal choice of moments using only the first three moments. In that case identifiability of the six parameters a 1 , a 2 , b1 , b2 , p, λ is no longer obvious. From the representations in Lemma 3.1 we see that we can estimate at most μ, μ2 , σ 2 and λ from Rˆ 1,1 and Rˆ 2,l , l = 1, · · · , L for any number L ≥ 3. So we cannot identify all 6 parameters from these observations. But the third moments depend also on μ3 , μσ 2 so we may hope to find all parameters from the estimates ηˆ for 2 2 2 3 3 η = E[Rt ], E Rt , E R2t , E R3t , E Rt , E R2t
(4.5)
which we obtain as in (4.1) above. Note that we used in Section 4.1 for each moment the same maximum time lag L to ease the notation. The Algorithm 4.1 can be implemented analogously when we use different maximum lags for each moment like we do here. Let, as above, f : X → Y, X = R2+ × R2 × (0, 1) × R+ , Y = R × R3+ × R2 be the function which assigns the moments as given by η to the values of the parameters. Here we shall look at the direct algorithm (no weighting matrix) ALGORITHM 4.2. Consider ξ , η as in (4.4), (4.5). (1) Compute ηˆ as in (4.1). (2) Do a least-squares fit min( f (x) − η) ˆ ( f (x) − η) ˆ x∈X
yielding ξˆ . PROPOSITION 4.1. Suppose ξ , η are given as in (4.4), (4.5) and b1 = b2 . For each > 0 there exist neighbourhoods U of ξ , V of η and a unique continuous function ϕ:V → U such that supx∈U x − ξ < and f (ϕ(y)) = y
f or all
y ∈ V.
Since we assume that the time t between two observations is constant and cannot be decreased, we get a longer total observation time T = N t, when we increase the number of observations N to prove consistency. THEOREM 4.1. Suppose b1 = b2 . For the choice of moments in (4.5) Algorithm 4.2 yields consistent estimates, i.e. for the estimates ξˆ N obtained for N observations lim ξˆ N = ξ
N →∞
a.s. .
Here we used for exposition the identity matrix as weighting matrix. But the result also holds for any positive definite weighting matrix. REMARK 4.2. While we cannot prove consistency for the general model with more than two states since we have to use approximations of the moments, we could still prove identifiability of the parameters for suitable choices of moments over different sampling periods. C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
256
R. J. Elliott et al. Table 1. Regression method for estimating constant volatility. Second moment Linear regression Quadratic regression
Average
0.2661
0.2516
0.2483
Standard deviation
0.0053
0.0065
0.0094
Relative error
6.44 %
0.65 %
−0.70 %
4.3. Regression Alternatively to the best-fit method one can write down the expansions E[Rtm ] t (m+1)/2
=
Lr
Cm, j t j ,
m = 1, · · · , M,
j=0
Lr j up to a given order L r of t and regress (it, Rˆ m,l )l=1,···,L on j=0 C m, j t . Setting C 1,0 = Rˆ 1,1 /t, C1, j = 0 for j ≥ 1 and having obtained the coefficients C m, j t j , m ≥ 2, from these polynomial regressions we can then find an estimate ξˆ by minimizing ψm, j (x) − Cm, j 2 m, j
Cm, j
,
over X , where ψ m, j (ξ ) denotes the analytic expression of coefficient j of moment m in the expansions above, in terms of the parameter vector ξ like we can obtain it from Lemmas 2.1 and 3.1. For two states this approach works quite well if λ t 1. If we use in Lemma 3.1 and truncate the expansion of e−λt after some power of t and divide by a suitable power of t we get E[Rt1 ]/t = μ E[Rt2 ]/t ≈ σ 2 + μ2 t − (μ2 − (μ)2 )(λt 2 /3 + λ2 t 3 /12) + · · · E[Rt3 ]/t 2 ≈ 3 μσ 2 + (μ3 − (μσ 2 − μ σ 2 )λ)t − ((μ3 − μ μ2 )λ/2 − (μσ 2 − μ σ 2 )λ2 /4)t 2 +((3 μ3 − 4 μ μ2 + (μ)3 )λ2 − (μσ 2 − μ σ 2 )λ3 )t 3 /20 + · · · . While we cannot expect consistency, we can show similar as in the proof of Proposition 4.1 that for 2 states the six parameters can be identified e.g. from the coefficients C 1,0 = μ, C 2,0 = σ 2 , C 2,1 = μ2 , C 2,2 , C 3,0 , C 3,1 of the expansions above. REMARK 4.3. (1) For the regression method to work satisfactorily we need max |Q i | t L r 1. Therefore we have to use L smaller than the maximum lag of the best-fit method. (2) An advantage of the regression method is that if we are only interested in e.g. the constant noise parameter σ of a Hidden Markov Model, it is enough to estimate the 0th coefficient σ 2 = σ 2 of the expansion of E [R 2t ]/t, i.e. we do not need to compute the moments explicitly. For example, if we observe over 5 years some asset prices daily (only 1250 observations) with return process given by (1.1) and parameters σ = 0.25, μ with states 0.8, −0.6 and corresponding rates 12 and 14, then we get for 100 simulated trajectories the results in Table 1, where ‘second moment’ means that we C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
Moment based regression algorithms for MSMs
257
use Rˆ 2,1 /t as estimate. Clearly the regressions lead to better results. If we want to estimate all parameters, the best-fit methods seems to work better – even when using only an approximation to the moments, cf. Examples 5.1 below.
5. NUMERICAL EXAMPLES In Remark 4.3 and Table 1 we saw that 1250 observations can yield an accurate estimate of constant σ . In this section we illustrate the performance of the moment based estimators when all the parameters of the MSM are estimated. To obtain accurate estimates of all the parameters of the MSM, more observations are required. We use the ‘best-fit’ Algorithm 4.1 with one of the choices (1), (2) and (3) for a weighting matrix like presented after Algorithm 4.1. In the first example we compare with the ‘regression’ method as described in Section 4.3. In the Examples 5.1, 5.2, 5.3 for 2 states we use states b = (2, − 1) , as parameter for the invariant distribution p = 0.52, and consider M = 4 moments for different L. We always average over 100 simulated trajectories and provide the average estimates (av.), the relative error (r.e.) in %, the standard deviation (s.d.) and the root mean squared errors (r.m.s.e.). Further, in some tables we provide the average computation times (av.c.t.) on a standard personal computer. Example 5.1. We consider N = 10, 000 observations with time discretization interval t = 0.001, i.e. T = 10 ‘years’. We first use the first and the second moment to get estimates μ0 and σ02 for μ and σ 2 . We then use b0 = (μ0 + 0.5, μ0 − 0.5) , a0 = ((σ02 )1/2 , (σ02 )1/2 ), and p 0 = 0.5, λ0 = 50 as initial values for the minimization procedure. These are often far off the true values. First we look at parameters which we expect to be easy to estimate: We use low noise a = (0.1, 0.05) and low rates given by the total rate λ = 25, corresponding to λ1 = 12, λ2 = 13. Lower rates should be easier to estimate since the expansion of e−λt is in terms of λ t, concrete, we stay longer in one state and our model is close to a discrete time Markov switching model. The results in Table 2 show that the we get accurate and unbiased estimates. The latter can be seen from the small difference between standard deviation (s.d.) and root mean squared error (r.m.s.e.) which measures the deviation from the true parameters. The regression method leads to a slightly bigger relative error for the rates with higher standard deviation than the best-fit method. In Table 2 we used L = 10. As pointed out in Remarks 4.1 and 4.3, for the regression method we would expect a lower choice of L to be better and for the best-fit method the choice should not have such a big influence as long as it is not too small. We illustrate the dependency of the estimates on the choice of L in Table 3. Indeed we observe that the estimates for the regression method get worse with increasing L. The same effect can be observed when we consider higher rates. Therefore, in the following we will only consider the best-fit method. As the results in Table 2 further indicate, all weighting matrices lead to comparable good results, hence we will choose in the following the best-fit method with diagonal weighting matrices (2) since the average computation (av.c.t.) of 5.90 s is much less then 55.89 s for the full weighting matrix (3). Note that the weighting matrix has dimension ML × ML, so due to symmetry ML(ML + 1)/2 = 820 covariance estimates have to be computed for the full weighting matrix compared to ML = 40 sample variances for the diagonal weighting matrix. C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
258
Method
R. J. Elliott et al. Table 2. Comparison of methods for 100 simulations using L = 10. b1 b2 a1 a2 True
2.000
−1.000
0.1000
0.0500
p
λ
0.5200
25.00
Best fit
av.
1.9988
−0.9923
0.0999
0.0505
0.5112
23.762
Weights (1) av.c.t.:
r.e. [%] s.d.
−0.0621 0.0896
−0.7673 0.0404
−0.0504 0.0012
0.9970 0.0020
−1.6956 0.0487
−4.9522 16.313
4.82 s
r.m.s.e.
0.0892
0.0409
0.0012
0.0021
0.0493
16.278
Best fit
av.
2.0023
−0.9951
0.0998
0.0504
0.5109
25.924
Weights (2) av.c.t.: 5.90 s
r.e. [%] s.d. r.m.s.e.
0.1173 0.0810 0.0806
−0.4929 0.0458 0.0458
−0.1566 0.0012 0.0012
0.8939 0.0021 0.0022
−1.7540 0.0479 0.0485
3.6948 16.286 16.230
Best fit
av.
1.9628
−1.0107
0.0994
0.0493
0.5171
24.779
Weights (3) av.c.t.: 55.89 s
r.e. [%] s.d. r.m.s.e.
−1.8598 0.0726 0.0812
1.0714 0.0362 0.0376
−0.6310 0.0017 0.0018
−1.3096 0.0016 0.0018
−0.5672 0.0466 0.0465
−0.8851 9.2855 9.2416
Regression
av. r.e. [%] s.d. r.m.s.e.
2.0074 0.3698 0.1494 0.1489
−0.9932 −0.6815 0.1749 0.1742
0.0985 −1.4951 0.0029 0.0032
0.0480 −3.9709 0.0056 0.0060
0.5133 −1.2871 0.0587 0.0588
22.710 −9.1588 66.942 66.646
av.c.t.: 2.20 s
Table 3. The dependency of the estimates on L. b1 b2 a1 a2 p 2.000 −1.000 0.1000 0.0500 0.5200
Method true par.
L
Best fit Weights (2)
10 20 40 80
1.9827 1.9885 2.0012 2.0251
−0.9970 −0.9835 −0.9515 −0.9104
0.1001 0.1002 0.1002 0.1000
0.0500 0.0515 0.0559 0.0636
Regression
10 20 40 80
1.9713 1.9745 1.9348 1.9863
−0.9883 −1.0030 −1.1244 −1.2343
0.0991 0.0956 0.0708 0.0941
0.0473 0.0457 0.0391 0.0452
λ 25.00
av.c.t.
0.5277 0.5244 0.5163 0.5042
22.126 22.599 22.217 22.626
6.4121 12.565 25.341 44.386
0.5374 0.5440 0.5938 0.5603
27.945 31.164 24.139 17.379
2.3797 3.6127 5.8604 9.6522
Example 5.2. Let us look at more challenging parameter choices considering a higher noise level a = (0.2, 0.1) or higher rates λ = 250, respectively. Again we choose L = 10. For a sample size of 10,000 the results would not always be satisfactory, hence we give the estimates for N = 20, 000 observations in Table 4 including the parameter choice we used before. While the estimates are still good we observe that as expected the standard deviation for higher σ gets higher and that the estimates for the states b are not as good for the high rates as for low rates. Example 5.3. Now we compare the best-fit method for diagonal weighting matrix (2) with a maximum likelihood method for the linearized model. We use the EM algorithm like it is proposed C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
259
Moment based regression algorithms for MSMs Table 4. Higher noise and higher rates using b = (2, − 1), p = 0.52, L = 10. λ = 25 λ = 25 λ = 250 a 1 = 0.1, a 2 = 0.05 av.
r.e.[%]
s.d.
a 1 = 0.2, a 2 = 0.1
a 1 = 0.1, a 2 = 0.05
av.
r.e.[%]
s.d.
av.
r.e.[%]
s.d.
b1
2.004
0.214
0.064
2.000
0.009
0.185
2.059
2.925
0.121
b2
−0.990 0.100
−0.999 0.050
0.033 0.001
−1.007 0.200
0.648 −0.081
0.091 0.002
−0.908 0.100
−9.181 0.300
0.058 0.001
0.051
1.104
0.002
0.099
−1.011
0.008
0.054
9.192
0.002
0.517 23.24
−0.529 −7.043
0.032 10.86
0.522 23.96
0.437 −4.175
0.049 27.27
0.497 244.5
−4.528 −2.182
0.024 36.11
a1 a2 p λ
True λ 12.5 25. 50. 100. 200.
Table 5. Comparison of best-fit method and EM algorithm for different λ. Method b1 b2 a1 a2 p true par. 2.000 −1.000 0.1000 0.0500 0.5200 Best fit EM alg. Best fit EM alg. Best fit EM alg. Best fit EM alg. Best fit EM alg.
2.0014 2.0095 1.9938 2.0142 2.0039 2.0273 2.0179 2.0178 2.0386 1.9327
−0.9934 −0.9818 −1.0014 −0.9624 −0.9844 −0.8983 −0.9625 −0.7812 −0.9297 −0.6197
0.1001 0.1001 0.0999 0.1 0.0999 0.1002 0.0999 0.1007 0.1002 0.1014
0.0503 0.0508 0.0502 0.0519 0.051 0.0546 0.0523 0.0591 0.0538 0.0644
0.5208 0.538 0.5165 0.5057 0.5181 0.4845 0.5115 0.4467 0.4999 0.4252
λ 11.282 12.14 24.417 22.528 47.005 41.305 96.353 75.101 190.51 143.17
Elliott et al. (1997). To use their algorithm we have to discretize our model, leading to Rit − R(i−1)t = b Y(i−1)t t + a Y(i−1)t (Wit − W(i−1)t ), where the transition matrix of (Y it )i=0,... ,N is given by P t , parameterized by p and λ as in (3.1) above. Then we can estimate b, a, p, λ. As outlined in Section 1 we cannot derive a closed form continuous time EM algorithm. Using the Due to the linearization of the model we cannot expect that the discrete time EM algorithm leads to consistent results for high rates for which the linearization provides a poor approximation of the original model. We compare the method for total rates λ = 12.5, 25, 50, 100, 200. As initial values we always use 80% of the true value of λ and for the other parameters we compute initial values as described in Example 5.1 via estimating μ and σ 2 first. We use 20,000 observations and as regression parameter L = 10 for the best-fit method, and (only) two iterations for the EM algorithm. For 100 independent trials we gathered the averages of the estimated parameters in Table 5. Both methods are applied to the same sample. We did not provide the standard deviations and mention only that these are smaller for the EM algorithm. In fact for estimating the rates they are much smaller, so for rates 25, 50, 100, 200 the true parameters are not in the one-standard C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
260
R. J. Elliott et al.
deviation interval. For the best fit method the true values are always within a distance of one standard deviation from the average estimate. As explained above we expect the discrete time EM algorithm to perform poorer for higher rates. The simulation results in Table 5 seem to verify our guess. We have to point out that we only used two iterations of the EM algorithm, since especially for higher values of λ the estimates for λ often go off. There would be a little improvement in the estimates for b and a if we take 2 or 3 steps more. The estimates for p and λ would get much worse. So to stop the iteration after the second step seems to give the best results (on average). Regarding the computational cost, the EM algorithm needs cd 2 N matrix multiplications to compute the conditional means in each iteration for some constant c, cf. Elliott et al. (1997), while the best fit method requires to compute M L lagged sample moments, where we have to choose ML greater than the number of parameters d(d + 1), so M L is of the order d2 and we have summations and scalar multiplications of the same order d 2 N . The computational cost then depends very much on the choice of the weighting matrix which is (M L × M L)-dimensional. So for a diagonal weighting matrix it is also of the order d 2 N and for the full covariance matrix of order d 4 N . Thus for diagonal weighting matrices we can expect a computation time comparable to one iteration of the EM algorithm while it can be much higher for a full weighting matrix, cf. the average computation times (av.c.t.) in Table 2. In this example we had on average a computation time of 11.28 s for the best-fit algorithm using a diagonal weighting matrix and of 14.81 s for the EM algorithm for each iteration. Example 5.4. Finally, we consider a MSM with 3 states and 25,000 observations (250 each year). We use approximations of the first M = 6 moments derived from Theorem 2.1, each over L = 6 time lags. The true parameters are given in Table 6. As initial values we use for b the first moment +2, +0, −2, for each component of a the constant volatility estimated by a regression like explained in Remark 4.3, and for the rates we use always 20. We use algorithm 4.1 and compare different weighting matrices. The results show that states and volatilities can be estimated very well, in particular for the extreme states, but that estimation of the rates remains difficult. While both weighting matrices (2) and (3) improve the estimates for a and b with smaller standard deviation as (1), only the diagonal weighting matrix (2) yields better estimates for the rates Q i j , cf. the discussion in Section 4.1. But we observe that for the rates the standard deviation is considerably smaller than the root mean squared error, indicating a bias in the estimates, presumably due to the approximation of the moments. But we can compute the invariant distribution ν from the estimates by ν Q = 0 for each sample. These estimates for ν in Table 7 seem to be more stable than the rates itself. A further improvement can be expected using more observations and better initial conditions.
6. APPLICATION TO MARKET DATA We estimate parameters for daily returns of the Dow Jones Industrial Average Index (DJII) using data for 75 years up to the year 2006. We assume a MSM with a 3 state Markov chain. We proceed as in Example 5.4, only that we use different initial values and consider the diagonal weighting matrix (2) only. We get the estimates in the second column (32–06) of Table 8 where we computed the invariant distribution ν from the estimate for Q by solving ν Q = 0. Then we split the time series in M = 5 parts, each corresponding to 15 years, and estimate parameters for each of these subsamples. C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
261
Moment based regression algorithms for MSMs Table 6. Parameter estimation for three states, Example 5.4. b1 b2 b3 a1
Method True
2.5
0.2
−2.5
0.15
a2
a3
0.1
0.12
2.4365
0.2359
−2.4191
0.1484
0.0834
0.1189
(1) av.c.t.:
r.e.[%] s.d.
−2.5418 0.1655
17.960 0.0930
−3.2347 0.1194
−1.0936 0.0035
−16.610 0.0251
−0.8843 0.0025
12.98 s
r.m.s.e.
0.1765
0.0992
0.1437
0.0038
0.0300
0.0027
2.4798
0.1711
−2.5201
0.1488
0.0960
0.1195
−0.8091 0.1037 0.1052
−14.452 0.0453 0.0535
0.8036 0.0740 0.0763
−0.7855 0.0030 0.0032
−3.9674 0.0029 0.0049
−0.4558 0.0021 0.0021
2.4605
0.1663
−2.4836
0.1501
0.1025
0.1188
−1.5780 0.0908 0.0985
−16.827 0.0473 0.0579
−0.6543 0.0725 0.0740
0.0624 0.0032 0.0032
2.5134 0.0019 0.0032
−1.0334 0.0028 0.0031 Q32 12
Weights
av.
Weights
av.
(2) av.c.t.: 15.61 s
r.e.[%] s.d. r.m.s.e.
Weights
av.
(3) av.c.t.: 72.88 s
r.e.[%] s.d. r.m.s.e.
True
Q12 12
Q13 8
Q21 4
Q23 4
Q31 8
av. r.e.[%]
13.697 14.146
10.164 27.045
5.9393 48.484
5.9651 49.128
10.050 25.619
13.545 12.873
s.d.
1.1027
0.8568
1.2810
1.1651
0.8545
0.9928
12.98 s Weights (2)
r.m.s.e. av. r.e.[%]
2.0212 14.005 16.706
2.3255 9.5120 18.899
2.3207 5.3576 33.940
2.2815 4.7179 17.946
2.2189 9.5921 19.902
1.8336 14.010 16.750
av.c.t.: 15.61 s
s.d. r.m.s.e.
0.7958 2.1554
0.9737 1.7957
0.6770 1.5155
0.5358 0.8942
0.8203 1.7892
0.6535 2.1126
Weights (3) av.c.t.: 72.88 s
av. r.e.[%] s.d. r.m.s.e.
8.0545 −32.879 1.0817 4.0896
3.1274 −60.908 0.9962 4.9725
3.0378 −24.054 0.6850 1.1791
0.7334 −81.665 0.7949 3.3610
1.3546 −83.067 1.5667 6.8258
4.3865 −63.446 1.4840 7.7554
Method
Weights (1) av.c.t.:
The results are given in columns 3–7 while the last column provides the averages over these M estimates. The latter are computed weighted by the corresponding invariant distributions, i.e.
1/2 M M 2 m=1 νi,m ai,m m=1 νi,m bi,m bi = M , ai = , M i=1 νi,m i=1 νi,m where bi,m , ν i,m denote the estimates for bi , ν i in subsample m. Of course, the dynamics of the index would not follow a simple MSM with parameters constant over time. Due to the non-linear estimation procedure we thus cannot expect that the averages are very close to the overall estimates. Below we get more consistent estimates for a modified time series. But from the estimates we C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
262
R. J. Elliott et al. Table 7. Invariant distribution for three states, Example 5.4. ν1 ν2
Method
ν3
True
0.2
Weights (1)
av.
0.2307
0.5362
0.2331
Weights (2)
r.e.[%] av.
15.349 0.2152
−10.632 0.5820
16.548 0.2029
r.e.[%]
7.5817
−3.0041
1.4307
av.
0.1916
0.6254
0.1830
−4.2103
4.2322
−8.4863
Weights (3)
r.e.[%]
Years b1 b2 b3 a1 a2 a3 ν1 ν2 ν3
0.6
Table 8. Estimates for the DJII. 47–61 62–76 77–91
32–06
32–46
1.234 0.028 −0.970 0.132 0.106
1.634 −0.118 −0.562 0.308 0.206
2.033 0.126 −2.165 0.108 0.087
1.809 0.079 −2.496 0.218 0.090
0.603 0.068 0.871
0.797 0.064 0.871
0.233 0.069 0.858
0.061
0.065
0.074
0.2
92–06
Average
1.310 0.244 −4.007 0.142 0.098
1.097 0.091 −0.871 0.132 0.133
1.564 0.085 −1.960 0.193 0.131
0.145 0.050 0.900
0.564 0.062 0.878
0.386 0.071 0.860
0.489 0.063 0.873
0.050
0.061
0.068
0.064
can see some qualitative properties which stay the same over all time intervals and provide some insight: The results indicate one intermediate state b2 close to 0 and two quite extreme states with higher volatilities. Due to Example 5.4 we should be cautious about the estimates for the rate matrix Q, but we expect that the estimates for the invariant distribution ν – even when computed from Q – are more reliable. They show that in the mean we stay 87% of the time in the intermediate state and only with a low probability of about 6–7% we stay in each of the extreme states. The results are similar to the 6-year estimates in Hahn et al. (2005), where they were obtained by a Markov chain Monte Carlo method. As pointed out above, the times series might not be very well described by a 3-state MSM, e.g. since we also expect some jumps in the prices. Note that the jumps of the underlying Markov chain don’t lead to jumps in the prices. If we correct the time series for jumps by simply replacing extreme daily returns by the mean return, we get much more stable results for the modified DJII data, e.g. in Table 9 for cutting off at ± 0.05, corresponding to up- and downward jumps of the prices of approximately 5% which occurred for 117 of the 18,750 observations. The results in Table 9 show that the estimates become more stable over time and the averages are much closer to the estimates taken over the whole sample.
7. CONCLUSION The numerical results in Section 5 show that the best-fit method and the regression method provide accurate and consistent estimates for a model with two states if we have a sufficient number of observations. C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
263
Moment based regression algorithms for MSMs
Years
32–06
Table 9. Estimates for the modified DJII. 32–46 47–61 62–76 77–91
92–06
Average
b1
1.405 0.149
0.435 0.163
2.021 0.127
1.556 0.104
2.284 0.105
2.024 0.143
1.665 0.128
−2.459 0.299 0.124 0.262
−2.731 0.354 0.200 0.313
−2.014 0.111 0.087 0.225
−2.491 0.231 0.091 0.100
−2.663 0.264 0.136 0.209
−2.494 0.274 0.128 0.231
−2.470 0.262 0.134 0.232
0.065 0.871 0.064
0.066 0.865 0.069
0.064 0.865 0.071
0.047 0.904 0.050
0.062 0.877 0.061
0.069 0.869 0.062
0.062 0.876 0.062
b2 b3 a1 a2 a3 ν1 ν2 ν3
In the critical case of high rates in particular the best-fit method still gives accurate results and clearly outperforms the EM algorithm for the discretized (linearized) model while no continuous time EM algorithm can be implemented for the MSM. For more than two states, if the rates are not too high, we still get good estimates. For the rates they depend on a good choice of the weighting matrix and initial conditions become important. However, the estimates for the noise parameters are still very accurate. So the method proves particularly useful to estimate the volatility parameter(s) of coarsely observed continuous time processes.
ACKNOWLEDGMENTS Robert Elliott wishes to thank SSHRC for its continued support and J¨orn Sass acknowledges support by the Austrian Science Fund, FWF grant P17947-N12.
REFERENCES A¨ıt-Sahalia, Y. (2007). Estimating continuous-time models using discretely sampled data. In R. Blundell, T. Persson and W. K. Newey (Eds.), Advances in Economics and Econometrics, Theory and Applications, Nineth World Congress. Econometric Society Monographs 43, Volume 3, 261–327. Cambridge: Cambridge University Press. Andersen, T. G. and B. E. Sørensen (1996). Generalized Methods of Moments estimation of a stochastic volatility model: A Monte Carlo study. Journal of Business and Economic Statistics 14, 328–52. Barndorff-Nielsen, O. E. and N. Shephard (2002). Econometric analysis of ralized volatility and its use in estimating stochastic volatility models. Journal of the Royal Statistical Society, Series B 64, 253– 80. Beskos, A., O. Papapiliopoulos, G. O. Roberts and P. Fearnhead (2005). Exact and efficient likelihood-based estimation for discretely observed diffusion processes. Journal of the Royal Statistical Society, Series B 68, 333–82. Buffington, J. and R. J. Elliott (2002). American options with regime switching. International Journal of Theoretical and Applied Finance 5, 497–514. C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
264
R. J. Elliott et al.
Chourdakis, K. M. (2002). Continuous time regime switching models and applications in estimating processes with stochastic volatility and jumps. Working Paper 464, Department of Economics, Queen Mary, University of London. Elliott, R. J. (1993). New finite-dimensional filters and smoothers for noisily observed Markov chains. IEEE Transactions on Information Theory 39, 265–71. Elliott, R. J. (1994). Exact adaptive filters for Markov chains observed in Gaussian noise. Automatica 30, 1399–408. Elliott, R. J., L. Aggoun and J. B. Moore (1995). Hidden Markov Models – Estimation and Control. New York: Springer-Verlag. Elliott, R. J., W. C. Hunter and B. M. Jamieson (1997). Drift and volatility estimation in discrete time. Journal of Economic Dynamics and Control 22, 209–18. Elliott, R. J. and R. W. Rishel (1994). Estimating the implicit interest rate of a risky asset. Stochastic Processes and their Applications 49, 199–206. Fearnhead, P. and L. Meligkotsidou (2007). Filtering methods for mixture models. Journal of Computtional and Graphical Statistics 16, 586–607. Fearnhead, P., O. Papaspiliopoulos and G. O. Roberts (2008). Particle filters for partially observed diffusions. Journal of the Royal Statistical Society, forthcoming in, Series B. Guo, X. (2001). An explicit solution to an optimal stopping problem with regime switching. Journal of Applied Probability 38, 464–81. Hahn, M., S. Fr¨uhwirth-Schnatter and J. Sass (2007). Markov chain Monte Carlo methods for parameter estimation in multidimensional continuous time Markov switching models. RICAM Report 2007–09, RICAM, Linz. Hamilton, J. D. (1989). A new approach to the economic analysis of nonstationary time series and the business cycle. Econometrica 57, 357–84. Hansen, L. P. (1982). Large sample properties of generalized methods of moments estimators. Econometrica 50, 1029–54. Ho, M. S., W. R. M., Perraudin, B. E., Sørensen (1996). A continuous-time arbitrage pricing model with stochastic volatility and jumps. Journal of Business and Economic Statistics 14, 31–43. James, M. R., V. Krishnamurthy, F. Le Gland (1996). Time discretization of continuous-time filters and smoothers for HMM parameter estimation. IEEE Transactions on Information Theory 42, 593–605. Krishnamurthy, V. and R. J. Elliott (2002). Robust continuous-time smoothers without two-sided stochastic integrals. IEEE Transactions on Automatic Control 47, 1824–41. Nummelin, E. (1984). General Irreducible Markov Chains and Non-negative Operators. Cambridge: Cambridge University Press. Sass, J. and U. G. Haussmann (2004). Optimizing the terminal wealth under partial information: The drift process as a continuous time Markov chain. Finance and Stochastics 8, 553–77. Shephard, N. (2005). Stochastic Volatility: Selected Readings. New York: Oxford University Press. Tierney, L. (1994). Markov chains for exploring posterior distributions. The Annals of Statistics 22, 1701–62. Timmermann, A. (2000). Moments of Markov switching models. Journal of Econometrics 96, 75–111.
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
265
Moment based regression algorithms for MSMs
APPENDIX Let (FtY )t∈[0,T ] denote the augmented filtration generated by Y. Proof of Proposition 2.1: Let Htn = For n ≥ 2 Itˆo’s formula implies
Htn = n
t
0
t
n σs dWs
.
0
Hsn−1 σs dWs + n(n − 1)
1 2
t
0
Hsn−2 σs2 ds
By iteration and taking the conditional expectation we get for even n = 2k t3 t2 (2k)! t E Ht2k | FTY = · · · σt2k · · · σt21 dt1 dt2 · · · dtk 2k 0 0 0 t k (2k)! = k σs2 ds 2 k! 0 and for odd n = 2k + 1 E Ht2k+1 | FTY
t3 t2 t1 (2k + 1)! t Y 2 2 = ··· σtk · · · σt1 E σs dWs FT dt1 dt2 · · · dtk 2k 0 0 0 0 =0. Hence we also have E
Ht2k+1
t
l
μs ds
t
=E
0
0
Therefore E[Rtm ]
=
m n
l 2k+1 Y μs ds E Ht | FT = 0.
j
t
m− j
t
l
μs ds E Ht j 0 m−2k
t m/2 m 2k μs ds E Ht . = 2k 0 k=0 j=0
The result follows from
E[Jtk,l ]
and
E
Ht2k
0
t
=E
l
μs ds
Ht2k
μs ds
0
t
=E 0
(2k)! = k E 2 k!
l 2k Y μs ds E Ht | FT
t
0
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
k σs2
ds 0
t
l
μs ds
266
R. J. Elliott et al.
Proof of Theorem 2.1: We prove the theorem in two steps. Step 1: First we show that t3 t2 (2k)! l! t · · · K k,l (Pt2 −t1 , · · · , Ptk+1 −tk+l−1 ) dt1 dt2 · · · dtk+l . Jtk,l = 2k 0 0 0 Step 2: Then we show that the right hand side of the above expression equates to the right hand side of Theorem 2.1. Step 1: From Proposition 2.1 we have k t l
t (2k)! k,l 2 σs ds μs ds . (A.1) Jt = k E 2 k! 0 0 Using Fubini arguments we get k t l
t 2 σs ds μs ds E 0
=
k +l l
=
k +l l
0
−1
−1
E
k+l i=1
c1 ,···,ck+l ∈{a (2) ,b} #{i : ci =a (2) }=k
0
ci Ys
(k + l)!
t
E
t
ds
···
0
c1 ,···,ck+l ∈{a (2) ,b} #{i : ci =a (2) }=k
t3
0
t2
k+l
0
ci Yti
dt1 dt2 · · · dtk+l
.
i=1
For M = k + l d j j 1{Yt M−1 =ei } E Yt M c M | Yt M−1 = ei E cM Yt M | FtYM−1 = i, j=1
=
d
i, j
j
YtkM−1 Pt M −t M−1 c M
i, j=1
=
d
YtiM−1 Pti,· c . M −t M−1 M
i=1 j
Using again 1{Y j =e j } = Yt and Yti c Y t = Yti ci this yields t
E cM−1 Yt M−1 cM Yt M | FtYM−2
=
d
j Yt M−2 E cM−1 Yt M−1 E cM Yt M | FtYM−1 | Yt M−2 = e j
j=1
=
j Yt M−2 E cM−1 Yt M−1 YtiM−1 Pti,· c | Y = e M t j −t M−2 M M−1
d j,i=1
=
j Yt M−2 E ciM−1 YtiM−1 Pti,· c | Yt M−2 = e j M −t M−1 M
d j,i=1
=
d
j
j,i
Yt M−2 Pt M−1 (Diag(c M−1 )Pt M −t M−1 )i,· c M
j,i=1
=
d
j
j,·
Yt M−2 Pt M−1 (Diag(c M−1 )Pt M −t M−1 )c M .
j C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
267
Moment based regression algorithms for MSMs j
j
Using the tower property of conditional expectations, P(Y t = e j ) = ν j , and again Yt c Y t = Yt cj , we hence get by induction
k+l d k+l−1 j j,· E ci Yti −ti−1 = E (c1 Yt1 )Yt1 Pt2 −t1 Diag(ci )Pti −ti−1 ck+l i=1
=
d
j=1
j
j,·
ν j c1 Pt2 −t1
k+l−1
j=1
=ν
k+l−1
i=2
Diag(ci )Pti −ti−1 ck+l
i=2
Diag(ci )Pti −ti−1 ck+l .
i=1
Thus
E
k+l
ci Yti −ti−1
= K k,l (Pt2 −t1 , · · · , Ptk+1 −tk+l−1 )
i=1
c1 ,···,ck+l ∈{a (2) ,b} #{i : ci =a (2) }=k
and we get
k
t
σs2
E 0
t
= k lE
t
ds
···
0
t3
μs ds
0
0
l
t2
0
K k,l (Pt2 −t1 , · · · , Ptk+1 −tk+l−1 ) dt1 dt2 · · · dtk+l
and the result follows from (A.1). Step 2: Using Theorem 2.1 and the expansion (2.1) we get K k,l (Pt2 −t1 , · · · , Ptk+l −tk+l−1 )=
K k,l (Q i1 , · · · , Q ik+l−1 )
i 1 ,···,i k+l−1 ≥0 i 1 +···+i k+l−1 = j
(tk+l − tk+l−1 )ik+l−1 (t2 − t1 )i1 ··· . i1 ! i k+l−1 !
So it remains to show that for i 1 , . . . , i k+l−1 ≥ 0, i 1 + . . . + i k+l−1 = j t3 t2 t (tk+l − tk+l−1 )ik+l−1 t k+l+ j (t2 − t1 )i1 ··· dt1 dt2 · · · dtk+l = . ··· i1 ! i k+l−1 ! (k + l + j)! 0 0 0 Since
t2 0
t i1 +1 (t2 − t1 )i1 dt1 = 2 i1 ! (i 1 + 1)!
it is by induction enough to show for j, n ≥ 1 that t (t − s) j s n t n+1+ j ds = . j! n! (n + 1 + j)! 0 Expanding the first factor we get
t 0
j (−1)i (t − s) j s n (n + 1 + j)! j t n+1+ j ds = i n + 1 + i (n + 1 + j)! j! n! j! n! i=0
So it remains to show that j (−1)i (n + 1 + j)! j = 1. j! n! i n+1+i i=0 C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
(A.2)
268
R. J. Elliott et al.
Using (n + 1 + j − 1)! (n + 1 + i + j − i) (n + j)! (n + j)! (n + 1 + j)! = = + ( j − i) n+1+i n+1+i n! n+1+i we get
j j j (n + 1 + j)! j (n + j)! (−1)i j (−1)i ( j − i) i j = (−1) + j! n! j! n! n+1+i i n+1+i i i i=0 i=0 i=0 =(1−1) j =0
j−1 (−1)i (n + 1 + j − 1)! j − 1 = ··· i ( j − 1)! n! n+1+i i=0 1 (n + 1 + 1)! 1 (−1)i = 1!n! i n+1+i i=0
=
= n + 2 − (n + 1) = 1.
Proof of Lemma 3.1: For the first moment we simply have E[Rt ] = Jt0,1 =
t
ν b dt = μ t.
0
How to proceed for higher moments we exemplify by looking at the third moment. By Proposition 2.1 we have E[Rt3 ] = 3 Jt1,1 + Jt0,3 .
(A.3)
By (2.3) derived in the proof of Theorem 2.1 we get Jt1,1 =
t 0
s
ν Diag(b)Ps−r a (2) + ν Diag(a (2) )Ps−r b dr ds .
0
Using (3.1) we get ν Diag(b)Ps−r a (2) = ν Diag(a (2) )Ps−r b = μ σ 2 + (μσ 2 − μ σ 2 )e−λ(r −s) . Integration yields t2 1 + 2(μσ 2 − μ σ 2 ) 2 (e−λt − 1 + λt) 2 λ (λt)2 . = μσ 2 t 2 + λ22 (μσ 2 − μ σ 2 ) e−λt − 1 + λt − 2
Jt1,1 = 2 μ σ 2
By (3.1) ν Diag(b)Ps−r Diag(b)Pu−s b = μ3 + (μ μ2 − μ3 )(e−λ(s−r ) + e−λ(u−s) ) +(μ3 − 2 μ μ2 + μ3 )e−λ(u−r ) C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
269
Moment based regression algorithms for MSMs and (2.3) yields t
u
s
ν Diag(b)Ps−r Diag(b)Pu−s b dr ds du 1 (λt)2 − e−λt = μ3 t 3 + 12(μ μ2 − μ3 ) 3 1 − λt + λ 2 1 6(μ3 − 2 μ μ2 − μ3 ) 3 (−2 + λt + (2 + λt)e−λt ) λ 12 (λt)2 (λt)3 3 3 −λt 3 2 − −e = μ t + 3 (μ μ − μ ) 1 − λt + λ 2 6 (λt)3 6 . + 3 (μ3 − 2 μ μ2 − μ3 ) −2 + λt + (2 + λt)e−λt − λ 6
Jt0,3 = 3!
0
0
0
So the result follows by (A.3). For the other moments the computations are similar. Only note that for higher moments the integrand is no longer symmetric. Proof of Lemma 4.1: For each l we can look at (Yln , Rln )n∈N0 as a discrete time Markov chain with state space R. Here Y ln = Y n lt , R ln = R (n+1)lt − R n lt . Obviously this chain is aperiodic and one can show that this chain is also Harris recurrent. Hence a Law of Large Numbers applies, cf. e.g. Nummelin (1984), Tierney (1994), and we have for each l N /l 1 f (Rln ) = E[ f (Rlt )] a.s. , N →∞ N /l n=1
lim
if the expectation on the right hand side is finite. Note that we compute the expectations assuming that the law of Y 0 is given by the invariant distribution of Y already. In particular we get the consistency of the estimates Rˆ k,l for E [R k lt ]. Proof of Proposition 4.1: Thinking of y as the sample moments we want to solve f (x) − y = 0. A solution is given by the true parameters f (ξ ) = η. By the Implicit Function Theorem we hence know that there exist neighbourhoods V of η and U of ξ and a unique continuous function ϕ: V → U with f (ϕ(y)) = y
for all
y ∈ V,
if ∂ f (ξ ) is non-singular. Here ∂ f denotes the Jacobian of f . So we have to show for the determinant det ∂ f (ξ ) = 0. It is simpler to do this in two steps by writing f (x) = g(h(x)) as we did above, i.e. h maps the the true parameters to the annualized and g the annualized moments to the analytic moments given by Lemma 3.1. So h(ξ ) = ζ and g(ζ ) = η for ζ as in (4.4). Thus it is sufficient to verify det ∂h(ξ ) = 0 and det ∂ g(ζ ) = 0 for all possible parameters ξ . An easy computation yields det ∂h(ξ ) = −4a1 a2 (b1 − b2 )5 (1 − p)2 p 2 and a more tedious one det ∂ g(ζ ) = −144 t 4 (1 − eλt )5 (μ2 − μ2 )(e2λt − 1 − 2λteλt ) C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
e−8λt . λ8
270
R. J. Elliott et al.
For b1 = b2 we also have μ2 − μ2 > 0. So both determinants are strictly negative and the result follows from the argument above noting that due to the continuity of ϕ we can choose V such that U is contained in an -neighbourhood of ξ . Proof of Theorem 4.1: Suppose > 0 and choose a neighbourhood V of η according to Proposition 4.1 such that for some ≤ a -neighbourhood U of ξ and ϕ:V → U exist with f (ϕ(y)) = y for all y ∈ V. By Lemma 4.1 we have strong consistency for the moment estimates η N , hence there exists N 0 such that ηˆ N ∈ V for all N ≥ N 0 . Since ξˆ N = ϕ(ηˆ N ) ∈ U we have ξ − ξˆ N ≤ ξ − ϕ(η) + ϕ(η) − ξˆ N < 2
for all
N ≥ N0 .
Since was arbitrary this yields the result observing that due to the uniqueness of ϕ, ϕ(ξˆ N ) is also the unique solution in V for the least-squares fit f (x) − ηˆ N . min f (x) − ηˆ N x∈U
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
Econometrics Journal (2008), volume 11, pp. 271–286. doi: 10.1111/j.1368-423X.2008.00245.x
Factor analysis in a model with rational expectations ˆ A NDREAS B EYER †, R OGER E. A. FARMER ‡, J E´ R OME H ENRY † AND M ASSIMILIANO M ARCELLINO § †European Central Bank Postfach 16 03 19 D-60066 Frankfurt am Main, Germany. E-mails:
[email protected],
[email protected] ‡UCLA, Department of Economics 8283 Bunche Hall Box 951477 Los Angeles, CA 90095-1477, USA. E-mail:
[email protected] §IEP-Bocconi University, IGIER and CEPR Via Sarfatti, 25 20136 Milano, Italy. E-mail:
[email protected] First version received: May 2006; final version accepted: November 2007
Summary DSGE models are characterized by the presence of expectations as explanatory variables. To use these models for policy evaluation, the econometrician must estimate the parameters of expectation terms. Standard estimation methods have several drawbacks, including possible lack or weakness of identification of the parameters, misspecification of the model due to omitted variables or parameter instability, and the common use of inefficient estimation methods. Several authors have raised concerns over the implications of using inappropriate instruments to achieve identification. In this paper, we analyze the practical relevance of these problems and we propose to combine factor analysis for information extraction from large data sets and generalized method of moment to estimate the parameters of systems of forward-looking equations. Using these techniques, we evaluate the robustness of recent findings on the importance of forward-looking components in the equations of a standard New-Keynesian model. Keywords: New-Keynesian Phillips curve, Forward-looking output equation, Taylor rule, Rational expectations, Factor analysis.
1. INTRODUCTION This paper is about the estimation of New-Keynesian models of the monetary transmission mechanism. We evaluate a number of recent findings obtained using single equation methods and we develop a system approach that makes use of additional identifying information extracted using factor analysis from large data sets. The combination of factor analysis and generalized method of moment (GMM) to estimate the parameters of systems of forward-looking equations is one of the distinctive features of our work, which extends the univariate analysis in Favero et al. (2005). The latter paper has also stimulated theoretical research on the properties of the factor-GMM estimators, see Bai and Ng (2006b) and Kapetanios and Marcellino (2006b), which provide a sound theoretical framework for our empirical analysis. C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008. Published by Blackwell Publishing Ltd, 9600 Garsington
Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
272
A. Beyer et al.
Following the influential work of Gal´ı and Gertler (1999, hereafter GG), a number of authors have used instrumental variable methods to estimate one or more equations of the New-Keynesian model of the monetary transmission mechanism. GG used the New-Keynesian paradigm to explain the behaviour of U.S. inflation as a function of its first lag, expected first lead, and the marginal cost of production. Their work stimulated considerable debate, much of which has focused on the size and significance of future expected inflation in the New-Keynesian Phillips curve. Similar arguments have been made over the role of expected future variables in the other equations of the New-Keynesian model: for example Clarida et al. (1998) estimate a Taylor rule in which expected future inflation appears as a regressor and Fuhrer and Rudebusch (2004) have estimated an Euler equation for output in which expected future output appears on the right-hand side. The estimation of models that include future expectations has revived a debate that began in the 1970s with the advent of rational expectations econometrics. In this context, a number of authors have raised econometric issues that relate to the specification and estimation of single equations with forward-looking variables. For example, Rudd and Whelan (2005, hereafter RW) showed that the GG parameter estimates for the coefficient on future inflation may be biased upward if the equation is misspecified due to the omission of relevant regressors that are instead used as instruments. With regard to the estimation of the coefficients of future variables, they pointed out that this problem can yield differences between estimates that are based on the following two alternative estimation methods. The first (direct) method estimates the coefficient directly using GMM; the second (indirect) method computes a partial solution to the complete model that removes the expected future variable from the right-hand side and substitutes an infinite distributed lag of all future expected forcing variables. RW use their analysis to argue in favour of Phillips curve specifications that favour backward lags of inflation over the New-Keynesian specification that includes only expected future inflation as a regressor. Gal´ı et al. (2005, hereafter GGLS) have responded to the RW critique by pointing out that, in spite of the theoretical possibility of omitted variable bias, estimates obtained by direct and indirect methods are fairly close, and when additional lags of inflation are added as regressors in the structural model to proxy for omitted variables, they are not significant. While the Rudd–Whelan argument is convincing, the GGLS response is less so since other (contemporaneous) variables might also be incorrectly omitted from the simple GG inflation equation. Even if additional lags of inflation were found to be insignificant, their inclusion could change the parameters of both the closed form solution and the structural model. We argue, in this paper, that these issues can only be resolved by embedding the single equation New-Keynesian Phillips curve in a fully specified structural model. Other authors, e.g. Fuhrer and Rudebusch (2004), Lind´e (2005) and Jondeau and Le Bihan (2003) have pointed out that the GMM estimation approach followed by GG could be less robust than maximum likelihood estimation (MLE) in the presence of a range of model misspecifications such as omitted variables and measurement error, typically leading to overestimation of the parameter of future expected inflation. GGLS correctly replied that no general theoretical results are available on the relative merits of GMM and MLE under misspecification, that the comparison could be biased by the use of an inappropriate GMM estimator, and that other authors such as Ireland (2001) provided evidence in favour of a (pure) forward-looking equation for US inflation when using MLE. In this paper, we hope to shed additional light on the efficiency and possible bias of GMM estimation by comparing alternative estimation methods on the same data set and the same model specification. A different and potentially more problematic critique of the GG approach comes from Mavroeidis (2005), B˚ardsen et al. (2004) and Nason and Smith (2005), building upon previous C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
Factor analysis in a model with rational expectations
273
work on rational expectations by Pesaran (1987). Pesaran (1987) stressed that the conditions for identification of the parameters of the forward-looking variables in an equation of interest should be carefully checked prior to single equation estimation. To check identification conditions, one must specify a model for all of the right-hand side variables. Even if there are enough instruments available such that conventional order and rank conditions are fulfilled and parameters are not underidentified, there might nevertheless be a problem of weak identification. The articles cited above have shown that in the presence of weak identification, estimation by GMM yields unreliable results. Weak identification is related to the quality of the instruments when applying GMM. When instruments are only weakly correlated with the corresponding endogenous variables they might not be particularly useful for forecasting, e.g. future expected inflation. The resulting GMM estimators might then suffer from weak identification, which leads to non-standard distributions for the estimators. As a consequence, this can yield misleading inference, see e.g. Stock et al. (2002) for a general overview on weak instruments and weak identification. In summary, the recent literature on the New-Keynesian Phillips curve has highlighted four main problems with the single equation approach to estimation by GMM. First, parameter estimates may be biased due to correlation of the instruments with the error term. Second, an equation of interest could be misspecified because of omitted variables or parameter instability within the sample. Third, parameters of interest may not be identified because there are not enough instruments available. Fourth, parameters may be weakly identified if the correlation of the instruments with the target variable is low. In this paper, we analyze the practical relevance of these problems, propose remedies for each of them, and evaluate whether the findings on the importance of the forward-looking component are robust when obtained within a more general econometric context. In Section 2, we compare single equation and system methods of estimation for models with forward-looking regressors. In Section 3, we conduct a robustness analysis for a full forward-looking system. In Section 4, we analyze the role of information extracted from large data sets to reduce the risk of specification bias and weak instruments problems. In Section 5, we summarize the main results of the paper and conclude.
2. SINGLE EQUATION VERSUS SYSTEM APPROACH We begin this section with a discussion of the estimation of the New-Keynesian Phillips curve. This will be followed by a discussion of single-equation estimation of the Euler equation and the policy rule. We then contrast the single equation approach to a closed, three-equation, NewKeynesian model. We estimate simultaneously a complete structural model, which combines the three previously estimated single-equation models for the Phillips curve, the Euler equation and the policy rule and we compare system estimates of parameters with those of the three singleequation specifications. Our starting point is a version of the New-Keynesian Phillips curve inspired by the work of GG, e πt = α0 + α1 πt+1 + α2 xt + α3 πt−1 + et ,
(2.1)
where π t is the GDP deflator, π et+1 is the forecast of π t+1 made in period t, x t is a real forcing variable (e.g. marginal costs as suggested by GG, unemployment—with reference to Okun’s law—as in, e.g. Beyer and Farmer (2007a), or any version of an output gap variable). The error C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
274
A. Beyer et al.
term et is assumed to be i.i.d. (0, σ 2e ) and is, in general, correlated with the non-predetermined variables (i.e. with π et+1 and x t ). Since we want to arrive at the specification of a system of forward-looking equations, we prefer to use as a real forcing variable the output gap , measured as the deviation of real GDP from its one-sided HP filtered version as widely used in the literature.1,2 To estimate equation (2.1) we replace π et+1 with π t+1 , such that (2.1) becomes πt = α0 + α1 πt+1 + α2 xt + α3 πt−1 + vt .
(2.2)
Equation (2.2) can be estimated by GMM, with HAC standard errors to take into account the MA(1) structure of the error term v t = et + α 1 (π et+1 − π t+1 ).3 All data are for the United States, quarterly, for the period 1970:1–1998:4, where the constraint on the end date is due to the large data set we use in Section 4.4 In the first panel of the first column of Table 1, we report the single-equation estimation results. As in GG (1999), we find a larger coefficient on π et+1 , about 0.78, than on π t−1 , about 0.23. The coefficient on the forcing variable is very small and not statistically significant at the 5% level, again in line with previous results. There are at least two related problems with this single equation approach: first, the appropriateness or availability of the instruments cannot be judged in isolation without reference to a more complete model, and therefore, second, the degree of over, just, or under-identification is undefined. The issue of identification and the use of appropriate instruments in rational expectations models is a very subtle one, see e.g. Pesaran (1987), Mavroeidis (2005), B˚ardsen et al. (2004) or Beyer and Farmer (2003a). In linear backward looking models, such as conventional simultaneous equation models, rank and order conditions can be applied in a mechanical way (see e.g. Fisher 1966). In rational expectation models, however, the conditions for identification depend on the solution of the model, i.e. whether the solution of the model is determinate or indeterminate, see Beyer and Farmer (2007b). In our case, as it is common in this literature, we have used (three) lags of π t , x t and the interest rate, i t as instruments where i t is the 3-month US Federal funds interest rate. However, since i t does not appear in (2.1), both π et+1 and x t may not at all or may only weakly depend on lags of i t , which would make i t an irrelevant or a weak instrument. To evaluate whether or not lagged interest rates are suitable instruments, we estimated the following sub-VAR model: xt = b0 + b1 πt−1 + b2 xt−1 + b3 i t−1 + u xt , i t = c0 + c1 πt−1 + c2 xt−1 + c3 i t−1 + u ct ,
(2.3)
1 The
forward-looking IS curve is usually specified in terms of the output or unemployment gap. that although common practice in the applied literature the use of the HP filtered version of the output gap is by no means unproblematic. For example, Nelson (2006) finds that the HP cyclical component of U.S. real GDP has no predictive power for future changes in output growth and Fukac and Pagan (2006) provide an example in which HP filtering produces biased coefficient estimates within a New-Keynesian model. 3 In particular, to compute the GMM estimates we start with an identity weighting matrix, get a first set of coefficients, use these to update the weighting matrix and finally iterate coefficients to convergence. To compute the HAC standard errors, we adopt the Newey and West (1994) approach with a Bartlett kernel and fixed bandwidth. These calculations are carried out with Eviews 5.0. 4 We have estimated the models using the output gap and unemployment as real forcing variables. To save space, we present here only the output gap results. 2 Notice
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
275
Factor analysis in a model with rational expectations Table 1. Single equation versus sytem estimation, GDP gap. Estimation method Equation πt
Var. (coeff.)
System
[1]
[2]
[3] 0.672∗∗∗ (0.084) −0.038∗
Adj. R 2 J-stat
(0.034) 0.231∗ (0.128) 0.860 4.809(6)
(0.025) 0.281∗∗∗ (0.082) 0.899 17.993(18)
(0.023) 0.334∗∗∗ (0.072) 0.870 13.780(18)
p-value e gapt+1 (β1 )
0.56 0.544∗∗∗
0.45 0.538∗∗∗
0.74 0.540∗∗∗
(0.046) 0.017 (0.013) 0.487∗∗∗ (0.040) 0.954 5.778(6) 0.44 1.103∗∗ (0.458) 1.675∗ (0.922) 0.921∗∗∗ (0.027) 0.885 10.702∗ (6) 0.098
(0.033) 0.016 (0.011) 0.486∗∗∗ (0.030) 0.954 18.453(18) 0.42 1.072∗∗∗ (0.362) 1.659∗∗ (0.691) 0.922∗∗∗ (0.020) 0.885 15.574(18) 0.62
(0.034) 0.015 (0.011) 0.484∗∗∗ (0.029) 0.954 −
πt−1 (α3 )
ri t (β2 ) gapt−1 (β3 )
it
Sub-VAR 0.726 (0.090) −0.051∗∗
e πt+1 (α1 )
gapt (α2 )
x t = gapt
Single
Adj. R 2 J-stat p-value e πt+1 (γ1 ) gapt (γ2 ) i t−1 (γ3 ) Adj. R2 J-stat p-value
∗∗∗
0.778 (0.139) −0.067∗
∗∗∗
1.186∗∗∗ (0.308) 1.476∗∗ (0.704) 0.920∗∗∗ (0.022) 0.885 −
Note: The instrument set includes the constant and three lags of gap, π , i. Sample is 1970:1–1998:4. The columns report results for single equation estimation (Single), system estimation where the completing equations are Sub-VARs (Sub-VAR), and full forward-looking system (System). HAC s.e. (no pre-whitening, Bartlett kernel, fixed bandwith Newey–West) in (). ∗ , ∗∗ and ∗∗∗ indicate significance at 10, 5 and 1%. J-stat is χ 2 (p) under the null hypothesis of p valid overidentifying restrictions.
where u xt and u ct are i.i.d. error terms, which are potentially correlated with et . If b3 = 0, i.e. i t does not Granger cause x t , then lags of i t are not relevant instruments for the endogenous variables in (2.1). Whether lagged values of inflation and the real variable beyond order one (i.e. π t−2 , π t−3 , x t−2 and x t−3 ) are relevant instruments for π et+1 is also questionable. If the solution for π t only C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
276
A. Beyer et al.
depends on π t−1 and x t−1 , which is the case when the solution is determinate, then the additional lags are not relevant instruments. However, in case of indeterminacy additional lags of π t and x t matter, which may re-establish the relevance of π t−2 , π t−3 , x t−2 and x t−3 as instruments.5 As a consequence of the model dependence with respect to the number of available and relevant instruments, the Hansen’s J-statistic, a popular measure for the relevance of the instruments and overidentifying restrictions that we also present for conformity to the literature, can be potentially uninformative and even misleading when applied in a forward-looking context. Estimating (2.1) and (2.3) using only one lag of π , x and i as instruments, we find that b3 = 0 but the null hypothesis b3 = 0 cannot be rejected. In this case, since the instruments are only weakly correlated with their targets, the resulting GMM estimators can suffer from weak identification, see Mavroeidis (2005, 2006). This might lead to non-standard distributions for the estimators and can yield misleading inference, see e.g. Stock et al. (2002). Empirically, we find that the size of the standard errors for the estimators of the parameters α 1 and α 2 in (2.1) matches the estimated values for α 1 and α 2 . However, when we estimate (2.1) and (2.3) using three lags of π, x and i as instruments, we find that b3 = 0 but the null hypothesis b3 = 0 is strongly rejected. The estimated parameters for (2.1) are reported in the first panel in column 2 of Table 1. Compared with the corresponding single equation estimates, we find that the point estimates of the parameters are basically unaffected (there is a non-significant decrease of about 5% in the coefficient of π et+1 and a corresponding increase in that of π t−1 ). Yet, there is a substantial reduction in the standard errors of 30–40%. Similar results are obtained when (2.3) is substituted for a VAR(3) specification. These findings suggest that the model is identified, but the solution could be indeterminate. Intuitively, indeterminacy arises because the sum of the estimated parameters α 1 and α 3 in (2.1) is very close to one. So far the processes for the forcing variables was assumed to be purely backward looking. As an alternative, we consider a forward-looking model also for x t . For example, Fuhrer and Rudebusch (2004) estimated a model for a representative agent’s Euler equation (in their notation) k−1 1 e e xt = β0 + β1∗ xt+1 i e − πt+ + β3∗ xt−1 + β4∗ xt−2 + ηt , + β2∗ (2.4) j+1 k j=0 t+ j where x t is real output (detrended in a variety of ways), x et+1 is the forecast of x t+1 made in period t, i t − π et+1 is a proxy for the real interest rate at time t, and ηt is an i.i.d. (0, σ 2η ) error term. In our sample period, the second lag of x is not significant and only the current interest rate matters. Hence, the model becomes e e xt = β0 + β1 xt+1 + β2 i t − πt+1 (2.5) + β3 xt−1 + ηt . Replacing the forecast with its realized value, we get xt = β0 + β1 xt+1 + β2 (i t − πt+1 ) + β3 xt−1 + μt ,
(2.6)
where μt = β 1 (x e t+1 − x t+1 ) + β 2 (π et+1 − π t+1 ).
5 Beyer and Farmer (2003a) conduct a systematic search of the parameter space in a model closely related to the one studied in this paper. They sample from the asymptotic parameter distribution of the GMM estimates and find, for typical identification schemes, that point estimates lie in the indeterminate region, but anywhere from 5 to 20% of the parameter region may lie in the non-existence or determinate region.
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
Factor analysis in a model with rational expectations
277
As in the case of the New-Keynesian Phillips curve, this equation can be estimated by GMM, appropriately corrected for the presence of an MA component in the error term μt . As in our estimates of the New-Keynesian Phillips curve, we use three lags of x, i and π as instruments. The results are reported in the first column of the second panel of Table 1. The coefficient on x et+1 is slightly larger than 0.5 and significant, and the coefficient on x t−1 is also close to 0.5 and significant. These values are in line with those in Fuhrer and Rudebusch (2004), who found lower values when using ML estimation rather than GMM and the positive sign of the real interest in the equation for the output gap is similar to the Fuhrer–Rudebusch results when they used HP detrending. As with the New-Keynesian Phillips curve, we estimate equation (2.6) simultaneously together with a sub-VAR(1) as in (2.3), but here for the forcing variables π t and i t . Again, the significance of the coefficients in the VAR(1) equations (in particular those for lagged π t in the i t equation) lends support to their relevance as instruments. The numerical values of the estimated parameters for the Euler equation remain nearly unchanged. However, as in the case of the Phillips curve above, the precision of the estimators increases substantially. These results are reported in the second column of the second panel in Table 1. In order to complete our building blocks for a forward-looking system we finally also model the interest rate with a Taylor rule as in Clarida et al. (1998, 2000). Our starting point here is the equation e i t∗ = i + γ1 πt+1 − πt∗ + γ2 xt − xt∗ , (2.7) where i ∗t is the target nominal interest rate, i is the equilibrium rate, x t is real output, and π ∗t and x ∗t are the desired levels of inflation and output. The parameter γ 1 indicates whether the target real rate adjusts to stabilize inflation (γ 1 > 1) or to accommodate it (γ 1 < 1), while γ 2 measures the concern of the central bank for output stabilization. Following the literature, we introduce a partial adjustment mechanism of the actual rate to the target rate i ∗ : i t = (1 − γ3 )i t∗ + γ3 i t−1 + vt ,
(2.8)
where the smoothing parameter γ 3 satisfies 0 ≤ γ 3 ≤ 1, and v t is an i.i.d. (0, σ 2v ) error term. Combining (2.7) and (2.8), we obtain e i t = γ0 + (1 − γ3 )γ1 πt+1 − πt∗ + (1 − γ3 )γ2 xt − xt∗ + γ3 i t−1 + vt , (2.9) where γ0 = (1 − γ3 )i, which becomes i t = γ0 + (1 − γ3 )γ1 πt+1 − πt∗ + (1 − γ3 )γ2 xt − xt∗ + γ3 i t−1 + t ,
(2.10)
with t = (1 − γ 3 )γ 1 (π et+1 − π t+1 ) + v t , after replacing the forecasts with their realized values. The results for single equation GMM estimation (with 3 lags as instruments) are reported in the first column of the third panel of Table 1. As in Clarida et al. (1998, 2000), the coefficient on future inflation is larger than one. We also found the coefficient on output to be larger than one, although the standard errors around both point estimates are rather large. Again, as in the cases of single equation estimations of the Phillips curve and the Euler equation, we are able to reduce the variance of our point estimates by adding sub-VAR(1) equations for the forcing variables π t and x t when estimating the resulting system by GMM (see column 2). As above, for both approaches we have used up to three lags for the intrument variables. C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
278
A. Beyer et al.
We are now in a position to estimate the full forward-looking system, composed of equations (2.1), (2.5) and (2.9): e πt = α0 + α1 πt+1 + α2 xt + α3 πt−1 + et , e e xt = β0 + β1 xt+1 + β2 i t − πt+1 + β3 xt−1 + ηt , e ∗ i t = γ0 + (1 − γ3 )γ1 πt+1 − πt + (1 − γ3 )γ2 xt − xt∗ + γ3 i t−1 + vt .
(2.11)
The results are reported in column 3 of Table 1. For each of the three equations the estimated parameters are very similar to those obtained either in the single equation case or in the systems completed with VAR equations. Furthermore, the reductions in the standard errors of the estimated parameters are similar to those obtained with sub-VAR(1) specifications. Since the VAR equations can be interpreted as reduced forms of the forward-looking equations, this result suggests that completing a single equation of interest with a reduced form may be enough to achieve as much efficiency as within a full system estimation. However, the full forward-looking system represents a more coherent choice from an econometric point of view, and the finding that the forward-looking variables have large and significant coefficients in all the three equations lends credibility to the complete rational expectations model. The non-linearity of our system of forward-looking equations makes the evaluation of global identification impossible. However, if we linearize the model around the estimated parameters and focus on local identification, we can show that the model is (at least) exactly identified, see Beyer et al. (2005). Exact identification holds when the point estimates imply a determinate solution. The model would be potentially overidentified in case of an indeterminate equilibrium.
3. ROBUSTNESS ANALYSIS While system estimation increases efficiency, the full forward-looking model in (2.11) could still suffer from misspecification problems, see e.g. Canova and Sala (2006). To evaluate this possibility, we conducted four types of diagnostic tests. First, we ran an LM test on the residuals of each equation to check for additional serial correlation, i.e. serial correlation beyond the one that is due to the MA(1) error structure of the model. Second, we ran the Jarque and Bera normality test on the estimated errors. Although our GMM estimation approach is robust to the presence of non-normal errors, rejection of normality could signal other problems, such as the presence of outliers or parameter instability.6 Third, we ran an LM test to check for the presence of ARCH effects; rejection of the null of no ARCH effects might more generally be a signal of changes in the variance of the errors. Finally, we checked for parameter constancy by running recursive estimates of the forward-looking system. The results of our misspecification tests are reported in the bottom lines of each panel in Table 2. For convenience, we also present in column 1 again the estimated parameters. There are only minor problems of residual correlation in the inflation equation, but normality is strongly rejected for the inflation and interest rate equation and the interest rate equation also fails the test for absence of serial correlation and absence of ARCH. The rejection of correct specification could be due to parameter instability in the full sample 1970:3–1998:4. Instability might be caused by a variety of sources including external events such
6 Note
that this is not the case for MLE. C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
279
Factor analysis in a model with rational expectations Table 2. Alternative forward-looking sytems, GDP gap. 1970–1998 1985–1998 Equation
πt
Var. (coeff.)
e πt+1
(α1 )
gapt (α2 )
[1]
[2]
0.672
∗∗∗
(0.083) −0.038∗
0.605∗∗∗ (0.093) −0.012
Significant Factors as regressors [3] 0.435∗∗∗ (0.081) 0.067
(0.018) 0.319∗∗∗ (0.067) 0.481 2.007
(0.049) 0.343∗∗∗ (0.042) 0.531 2.302∗
6.907∗∗ 2.990 0.540∗∗∗ (0.034) 0.015 (0.011) 0.484∗∗∗ (0.029)
1.877 0.565 0.466∗∗∗ (0.044) −0.021 (0.015) 0.558∗∗∗ (0.047)
0.065 1.505 0.623∗∗∗ (0.033) 0.057∗∗∗ (0.010) 0.491∗∗∗ (0.040)
Adj. R2 No corr (4) Norm No ARCH (4) e πt+1 (γ1 )
0.954 0.177 2.721 1.510 1.186∗∗∗ (0.308)
0.966 0.884 2.198 0.896 1.123∗∗ (0.458)
0.953 0.453 0.440 1.270 1.098∗∗∗ (0.178)
gapt (γ2 )
1.476∗∗ (0.704) 0.920∗∗∗ (0.022) 0.885 2.172∗ 525.0∗∗∗ 2.743∗∗ 13.780(18) 0.74
0.771∗∗∗ (0.160) 0.867∗∗∗ (0.024) 0.945 0.237 1.833 0.765 12.824(18) 0.80
0.846∗∗∗ (0.012) 0.841∗∗∗ (0.090) 0.947 0.227 3.257 0.583 12.942(18) 0.99
Adj. R2 No corr (4) Norm No ARCH (4) e gapt+1 (β1 ) ri t (β2 ) gapt−1 (β3 )
it
No factors
(0.023) 0.334∗∗∗ (0.072) 0.870 1.599
πt−1 (α3 )
xt = gapt
No factors
i t−1 (γ3 ) Adj. R2 No corr (4) Norm No ARCH (4) J-stat p-value
Note: The instrument set includes the constant and three lags of gap, π , i (no factors) plus the first lag of the six estimated factors (other case). The regressors are either as in Table 1 (no factors) or include some contemporaneous factors (see text for details) HAC s.e. (as in Table 1) in (). ∗ , ∗∗ and ∗∗∗ indicate significance at 10, 5 and 1%. The misspecification tests (No corr, Norm, No ARCH) are conducted on the residuals of an MA(1) model for the estimated errors. No corr is LM(4) test for no serial correlation, Norm is Jarque-Bera statistic for normality, and ARCH in LM(4) test for no ARCH effects. J-stat is χ 2 (p) under the null hypothesis of p valid overidentifying restrictions. C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
280
A. Beyer et al.
1.0
.06
0.9
.04
.6 .5
.02
0.8
.4
.00 0.7 -.02
.3
0.6 -.04 0.5
.2
-.06
0.4
-.08
0.3 197 0
-.10 197 0
197 5
198 0 a1
198 5
199 0
a1-2se
199 5
.1
197 5
a1+2se
.8
198 0 a2
198 5
199 0
a2-2se
199 5
197 5
a2+2se
.08
.7
.0 197 0
198 0 a3
198 5
199 0
a3-2se
199 5 a3+2se
.7
.6
.04
.6
.5 .00
.5
.4 -.04
.4
.3 197 0
197 5
198 0 b1
198 5
199 0
b1-2se
199 5
-.08 197 0
.3
197 5
b1+2se
3.0
198 0 b2
198 5
199 0
b2-2se
199 5
.2 197 0
197 5
b2+2se
3
198 0 b3
198 5
199 0
b3-2se
199 5 b3+2se
2.5 2.0
2
2.5
1.5 1
1.0
2.0 0.5
0
0.0
1.5
-1 -0.5
1.0 197 0
197 5
198 0 c1
198 5 c1-2se
199 0
199 5 c1+2se
-2 197 0
197 5
198 0 c2
198 5 c2-2se
199 0
199 5 c2+2se
-1.0 197 0
197 5
198 0 c3
198 5 c3-2se
199 0
199 5 c3+2se
Figure 1. Backward recursive estimation, 1988:1–1970:1, system with GDP gap.
as the oil shocks, internal events, such as the reduction in the volatility of output (e.g. McConnell and Perez-Quiros, 2000), or changes in the monetary policy targets. Since we had more faith in the second part of our sample, we implemented a backward recursion by estimating the system first for the subsample 1988:1–1998:4, and recursively reestimating the system by adding one quarter of data to the beginning of the sample, i.e. our second subsample consisted of the quarters 1987:4–1998:4, our third was 1987:3–1998:4 and so on until 1970:3–1998:4. In Figure 1, we report recursive parameter estimates. These graphs confirm that the likely source of the rejection of ARCH, normality and serial correlation tests is the presence of parameter change. Although the parameter estimates are stable back to 1985:1, going further back than this is associated with substantial parameter instability in all three equations, and particularly in the estimated Taylor rule. Although parameter instability is more pronounced when we use unemployment as a measure of economic activity, it is also present in estimates obtained when using the output gap. Overall, these misspecification tests cast serious doubts on results obtained for the full sample, and they suggest that a prudent approach would be to restrict our analysis to a more homogeneous sample. For this reason, in the subsequent analysis, we report results only for the subperiod 1985:1–1998:4. C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
Factor analysis in a model with rational expectations
281
Our subsample results are presented in the second column of Table 2. It is interesting to note that the values of the estimated parameters of the New-Keynesian Phillips curve and the Euler equation are similar to those obtained for the full sample. However, parameter estimates of the coefficients of the Taylor rule differ substantially from the single equation estimates. Most prominently, there is a marked decrease in the estimated coefficient on the output gap. In the post 1985 subsample, we fail to reject the null hypothesis for all four of our diagnostic tests, thereby lending additional credibility to our estimation results. The final issue we briefly consider is the role of the method of estimation. Fuhrer and Rudebusch (2004), Lind´e (2005) and Jondeau and Le Bihan (2003) have suggested that GMM may lead to an upward bias in the parameters associated with the forward-looking variables, while maximum likelihood (ML) produces better results. For example Lind´e (2005) finds that estimated parameters were not heavily biased from true parameters. In case of exact identification ML coincides with indirect least squares where the reduced form parameters of the model are mapped back into those of the structural form. We compared our estimates with the point estimates from GMM by computing the indirect least squares estimates from the reduced form. Using this approach, we find that our GMM estimates are similar to the ML values. For the subsample 1985–1998, the estimated coefficient on π et+1 in the inflation equation is 0.76 and 0.62 for that on future expected output gap in the Euler equation whereas the GMM estimates of these parameters are, respectively, 0.61 and 0.47. The differences are slightly larger for the coefficient on future inflation in the Taylor rule, in the range 2.1–2.4 with ML. Overall, we are reassured that our finding of significant coefficients on future expected variables is robust to alternative system estimation methods.
4. ENLARGING THE INFORMATION SET The analysis in Sections 2 and 3 supports the use of a system approach to the estimation of forward-looking equations. For the 1985:1–1998:4 sample, our estimated system passes a wide range of misspecification tests. Moreover, the Hansen’s J-statistic, reported at the foot of Table 2, is unable to reject the null of relevant instruments for this period (but it is worth recalling the caveats on the use of the J-test in this context). However, there could still be problems of weak instruments and/or omitted variables which are hardly detectable using standard tests, (see, e.g. Mavroeidis, 2005). This section proposes a method that can potentially address both of these issues. Our approach is to augment our data by adding information extracted from a large set of 146 macroeconomic variables as described in Stock and Watson (2002a,b hereafter SW). We assume that these variables are driven by a few common forces, i.e. the factors, plus a set of idiosyncratic shocks. This assumption implies that the factors provide an exhaustive summary of the information in the large data set, so that they may alleviate omitted variable problems when used as additional regressors in our small system. Moreover, the factors extracted from the Stock and Watson data are known to have good forecasting performance for the macroeconomic variables in our small data set and they are therefore likely to be useful as additional instruments that may alleviate weak instrument problems, too. Bernanke and Boivin (2003) and Favero et al. (2005) showed that when estimated factors are included in the instrument set for GMM estimation of Taylor rules, the precision of the parameter estimators increases substantially. The economic rationale for inclusion of these variables is that central bankers rely on a large set of indicators C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
282
A. Beyer et al.
in the conduct of monetary policy; our extracted factors may provide a proxy for this additional information. An additional reason for being interested in the inclusion of factors in our analysis is that, the inclusion of factors in small scale VARs has been shown to remove the ‘price puzzle’ suggesting that factors may be used to reduce or eliminate the estimation bias, that arises from the omission of relevant right-hand side variables.7 In the following subsection, we present a brief overview on the specification and estimation of factor models for large data sets. Following this discussion, we evaluate whether the use of the estimated factors changes the size and or the significance of the coefficients of the forward-looking components in the New-Keynesian model.
4.1. The factor model Equation (4.1) represents a general formulation of the dynamic factor model z t = f t + ξt ,
(4.1)
where z t is an N × 1 vector of variables and f t is an r × 1 vector of common factors. We assume that r is much smaller than N, and we represent the effects of f t on z t by the N × r matrix . ξ it is an N × 1 vector of idiosyncratic shocks. Stock and Watson require the factors, f t , to be orthogonal although they may be correlated in time and with the idiosyncratic components for each factor.8 Notice that the factors are not identified since equation (4.1) can be rewritten as z t = GG −1 f t + ξt = pt + ξt , where p t is an alternative set of factors and G is an arbitrary invertible r × r matrix. This fact makes it difficult to form a structural interpretation of the factors, but it does not prevent their use as a summary of the information contained in z t . SW define the estimators f t as minimizing the objective function VN ,T ( f , ) =
N T 1 (z it − i f t )2 . N T i=1 t=1
Under the hypothesis of r common factors, they show that the optimal estimators of the factors are N the r eigenvectors corresponding to the r largest eigenvalues of the T × T matrix N −1 i=1 z i z i , where z i = (z i1 , . . . , z i T ). Moreover, the r eigenvectors corresponding to the r largest eigenvalues T of the N × N matrix T −1 t=1 z t z t are the optimal estimators of . These eigenvectors coincide with the principal components of z t ; they are also the OLS estimators of the coefficients in a regression of z it on the k estimated factors f t , i = 1, . . . , N .9 Although there are alternative estimation methods available such as the one by Forni and Reichlin (1998) or Forni et al. (2000),
7 For
a definition and discussion of this issue the reader is referred to Christiano et al. (1999, pp. 97–100). moment conditions on f t and ξ t , and requirements on the loading matrix , are given in SW. 9 SW prove that when r is correctly specified, f t converges in probability to f t , up to an arbitrary r × r transformation matrix, G. When k factors are assumed, with k > r , k − r estimated factors are redundant linear combinations of the elements of f t , while even when k < r consistency for the first k factors is preserved (because of the orthogonality hypothesis). See Bai (2003) for additional inferential results. 8 Precise
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
Factor analysis in a model with rational expectations
283
we chose the SW approach since there is some evidence to suggest that it dominates the alternatives in this context.10 No statistical test is currently available to determine the optimal number of factors. SW and Bai and Ng (2002) suggested minimizing a particular information criterion, however its small sample properties in the presence of heteroskedastic idiosyncratic errors deserve additional investigation. In their empirical analysis with this data set, SW found that the first 2–3 factors are the most relevant for forecasting key US macroeconomic variables. In the following analysis we however evaluate the role of up to six factors to make sure sufficient information is captured. Finally, Bai and Ng (2006a) have shown that the estimated factors when√ used in subsequent econometric analyses do not create any generated regressor problem when T /N is o p (1). This condition requires the number of variables to grow faster than the sample size, which basically guarantees faster convergence of the factor estimators than of the estimators of√the other parameters of interest. We assume that this condition is satisfied in our context, where T /N = 0.055.
4.2. The role of the estimated factors As we mentioned, the estimated factors can proxy for omitted variables in the specification of the forward-looking equations. In particular, we use up to six contemporaneous factors as additional regressors in each of the three structural equations, and retain those which are statistically significant. Since the factors are potentially endogenous, we use their first lag as additional instruments. These lags are likely to be useful also for the other endogenous variables in each structural equation. Bai and Ng (2006b) and Kapetanios and Marcellino (2006b) provide a detailed derivation of the properties of factor-based GMM estimators. In column 3 of Table 2, we report the results of GMM estimation of the forward-looking system over the period 1985–1998. First, a few factors are strongly significant in the equations for inflation and the real variable. While it is difficult to provide an economic interpretation for this result, it does point to the omission of relevant regressors in the Phillips curve and Euler equation. In contrast, no factors are significant in the Taylor rule, which indicates that output gap and inflation expectations are indeed the key driving variables of monetary policy over this period. Second, in general the estimated parameters of the forward-looking variables are 10–20% lower than those without factors, but they remain strongly statistically significant. Third, the precision of the estimators systematically increases, as the standard errors of the estimated parameters are 10–50% lower than those without the factors. This confirms the usefulness of the additional information contained in the factors. Fourth, since the highest lag order of the regressors in the structural model is one, it could suffice to include one lag of π t , x t and i t in the instrument set instead of three lags. In this case, the point estimates are unaffected, as expected, but the standard errors increase substantially. This finding suggests that the solution of the system could be indeterminate, in which case more lags would indeed be required.
10 Kapetanios and Marcellino (2006a) found that SW’s estimator performs better in simulation experiments, and Favero et al. (2005) reached the same conclusion when using the estimated factors for the estimation of Taylor rules and VARs.
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
284
A. Beyer et al.
Finally, since there is no consensus on the best way to compute robust standard errors in this context, we verified the robustness of our findings based on Newey and West (1994) comparing them with those based on Andrews (1991). The latter are in general somewhat lower, but the advantages resulting from the use of factors are still systematically present.
5. CONCLUSIONS In this paper, we provided a general econometric framework for the analysis of models with rational expectations, focusing in particular on the hybrid version of the New-Keynesian Phillips curve that has attracted considerable attention in the recent period. First, we showed that system estimation methods where the New-Keynesian Phillips curve is complemented with equations for the interest rate and either unemployment or the output gap yield more efficient parameter estimates than traditional single equation estimation, while there are only minor changes in the point estimates and the expected future variables play an important role in all the three equations. The latter result remains valid even if MLE is used rather than system GMM. Second, we stressed that it is important to evaluate the correct specification of the model, and we showed that our systems provide a proper statistical framework for the variables over the 1985–1998 period, while during the 1970s there is evidence of parameter changes, in particular in the interest rate equation. Third, we analyzed the role of factors that summarize the information contained in a large data set of U.S. macroeconomic variables. Some factors were found to be significant as additional regressors in the New-Keynesian Phillips curve and in the Euler equation, alleviating potential omitted variable problems. Moreover, using lags of the factors as additional instruments in our small New-Keynesian system, the standard errors of the GMM estimates systematically decrease for all the estimated parameters; the gains are particularly large for the coefficients of forwardlooking variables. In conclusion, using the factors, data after 1985 are not inconsistent with the New-Keynesian interpretation of a determinate equilibrium driven by three fundamental shocks. The estimated parameters of the complete model form a consistent picture which coincides with New-Keynesian economic theory. We should note that while our results support the relevance of forward-looking variables in our estimated equations, there is a large variety of alternative models compatible with the observed data which can have very different properties both in terms of the relevance of the forward-looking variables and of the characteristics of their dynamic evolution. This has been demonstrated in Beyer and Farmer (2003b). A more detailed analysis of this issue represents an interesting topic for further research in this field.
ACKNOWLEDGMENTS We wish to thank two Referees and seminar participants at the ECB and CEF 2006 conference in Cyprus for useful comments on a previous version. The views expressed in this paper are those of the authors and do not necessarily represent those of the ECB. Farmer acknowledges the support of NSF grant SES - 0418174. C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
Factor analysis in a model with rational expectations
285
REFERENCES Andrews, D. W. K. (1991). Heteroscedasticity and autocorrelation consistent covariance matrix estimation. Econometrica 59, 817–58. Bai, J. and S. Ng (2002). Determining the number of factors in approximate factor models. Econometrica 70, 191–223. Bai, J. (2003). Inferential theory for factor models of large dimension. Econometrica 71, 135–73. Bai, J. and S. Ng (2006a). Confidence intervals for diffusion index forecasts and inference for factoraugmented regressions. Econometrica 74, 1133–50. Bai, J. and S. Ng (2006b). IV-estimation in a data-rich environment. Working paper, New York University. B˚ardsen, G., E. S. Jansen and R. Nymoen (2004). Econometric evaluation of the New Keynesian Phillips curve. Oxford Bulletin of Economics and Statistics, Supplement 2004, 66, 671–86. Bernanke, B. and J. Boivin (2003). Monetary policy in a data-rich environment. Journal of Monetary Economics 50, 525–46. Beyer, A. and R. E. A. Farmer (2003a). Identifying the monetary transmission mechanism using structural breaks. Working Paper Series #275, European Central Bank. Beyer, A. and R. E. A. Farmer (2003b). On the indeterminacy of New-Keynesian economics. Working Paper Series #323, European Central Bank. Beyer, A. and R. E. A. Farmer (2007a). Natural rate doubts. Journal of Economic Dynamics and Control 31, 797–825. Beyer, A. and R. E. A. Farmer (2007b). What we don’t know about the monetary transmission mechanism and why we don’t know it. Macroeconomc Dynamics, 1–15. Beyer, A., R. E. A. Farmer, J. Henry and M. Marcellino (2005). Factor analysis in a New-Keynesian model. Working Paper Series #510, European Central Bank. Canova, Fabio and Luca Sala (2006). Back to square one: Identification issues in DSGE models. Working Paper Series #583, European Central Bank. Christiano, L. J., M. Eichenbaum and C. L. Evans (1999). Monetary policy shocks: What have we learned and to what end? In J. Taylor and M. Woodford (Eds.), Handbook of Macroeconomics, Volume 1A, 65–148. Amsterdam: North Holland. Clarida, R., J. Gal´ı and M. Gertler (1998). Monetary policy rules in practice: Some international evidence. European Economic Review 42, 1033–67. Clarida, R., J. Gal´ı and M. Gertler (2000). Monetary policy rules and macroeconomic stability: Evidence and some theory. Quarterly Journal of Economics 115, 147–80. Favero, C., M. Marcellino and F. Neglia (2005). Principal components at work: The empirical analysis of monetary policy with large data sets. Journal of Applied Econometrics 20, 603–20 Fisher, F. (1966). The Identification Problem in Econometrics. New York: McGraw-Hill. Forni, M. and L. Reichlin (1998). Let’s get real: A factor analytical approach to disaggregated business cycle dynamics. Review of Economic Studies 65, 453–73. Forni, M., M. Hallin, M. Lippi and L. Reichlin (2000). The generalised factor model: Identification and estimation. The Review of Economic and Statistics 82, 540–54. Fuhrer, J. C. and G. D. Rudebusch (2004). Estimating the Euler equation for output. Journal of Monetary Economics 51, 1133–53. Fukac, M. and A. Pagan (2006). Limited information estimation and evaluation of DSGE models. Working paper 2006-6, National Centre for Econometric Research, Queensland University of Technology. Gal´ı, J. and M. Gertler (1999). Inflation Dynamics: A structural econometric approach. Journal of Monetary Economics 44, 195–222.
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
286
A. Beyer et al.
Gal´ı, J., M. Gertler and J. D. Lopez-Salido (2005). Robustness of the estimates of the hybrid New-Keynesian Phillips curve. Journal of Monetary Economics 52, 1107–18. Ireland, P. N. (2001). Sticky price models of the business cycle: Specification and stability. Journal of Monetary Economics 47, 3–18. ´ and H. Le Bihan (2003). ML vs GMM estimates of hybrid macroeconomic models (with an Jondeau, E. application to the ‘New Phillips curve’). Working Paper 103, Banque de France. Kapetanios, G. and M. Marcellino (2006a). A comparison of estimation methods for dynamic factor models of large dimensions. Working Paper 5620, Centre for Economic Policy Research. Kapetanios, G. and M. Marcellino (2006b). Factor-GMM estimation with a large set of possibly weak instruments. Working Paper, Queen Mary, University of London. Lind´e, J. (2005) Estimating New-Keynesian Phillips curves: A full information maximum likelihood approach. Journal of Monetary Economics 52, 1135–49. Mavroeidis, S. (2005). Identification issues in forward-looking models estimated by GMM with an application to the Phillips Curve. Journal of Money, Credit and Banking 37, 421–49. Mavroeidis, S. (2006). Testing the New Keynesian Phillips Curve without assuming identification. Brown Economics Working paper No. 2006–13, Brown University. Available at http://ssrn.com/abstract= 905261. McConnell, M. M. and G. Perez-Quiros (2000). Output fluctuations in the United States: What has changed since the early 1980s? American Economic Review 90, 1464–76. Nason, J. M. and G. W. Smith (2005). Identifying the New-Keynesian Phillips curve. Working paper No. 2005–1, Federal Reserve Board of Atlanta. Nelson, C. (2006). The Beveridge-Nelson decomposition in retrospect and prospect. Working paper, University of Washington. Newey, W. K. and K. D. West (1994). Automatic lag selections in covariance matrix estimation. Review of Economic Studies 61, 631–53. Pesaran, M. H. (1987). The Limits to Rational Expectations. Oxford: Basil Blackwell. Rudd, J. and K. Whelan (2005). New tests of the New-Keynesian Phillips curve. Journal of Monetary Economics 52, 1167–81. Stock, J. H. and M. W. Watson (2002a). Macroeconomic forecasting using diffusion indexes. Journal of Business and Economic Statistics 20, 147–62. Stock, J. H. and M. W. Watson (2002b). Forecasting using principal components from a large number of predictors. Journal of the American Statistical Association 97, 1167–79. Stock, J. H., J. Wright and M. Yogo (2002). GMM, weak instruments, and weak identification. Journal of Business and Economic Statistics 20, 518–30.
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
Econometrics Journal (2008), volume 11, pp. 287–307. doi: 10.1111/j.1368-423X.2008.00237.x
Generic consistency of the break-point estimators under specification errors in a multiple-break model J USHAN B AI †, H AIQIANG C HEN ‡, TERENCE TAI -L EUNG C HONG § AND S ERAPH X IN WANG § †Department of Economics, New York University. E-mail:
[email protected] ‡Department of Economics, Cornell University. E-mail:
[email protected] §Department of Economics, The Chinese University of Hong Kong. E-mail:
[email protected]; seraph
[email protected] First version received: March 2006; final version accepted: December 2007
Summary This paper considers the estimation of multiple-structural-break models under specification errors. A common example in economics is that the true model is measured in level, but a linear-log model is estimated. We show that, under specification errors, if there are more than one break points and if a single-break model is estimated, the estimated break point is consistent for one of the true break points. This consistency result applies to models with multiple regressors where some or all of the regressors are misspecified. Another important contribution of this paper is that we have constructed a Sup-Wald test whose limiting distribution is not affected by model misspecification. Using this robust test, we show that the break points can be estimated sequentially one at a time. Simulation evidence and an empirical application are provided. Keywords: Consistency, Measurement error, Misspecification, Multiple changes.
1. INTRODUCTION The end of the last century saw significant advance in the research of structural-break models.1 While earlier studies focus on the single-break model, the attention has shifted to the multiplebreak model in recent years. Pioneering works on multiple-break models include Chong (1995), Bai (1997), Bai and Perron (1998) and Chong (2001). The aforementioned studies propose different methods to estimate the locations and the number of breaks and establish the consistency of the break-point estimators under correct model specification. In the presence of model misspecification, Chong (2003) shows that the break-point estimator is still consistent when
1 For
the recent development of the literature, one is referred to Perron (2006).
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008. Published by Blackwell Publishing Ltd, 9600 Garsington
Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
288
J. Bai et al.
there is only one break. It is not known, however, whether this consistency result can be extended to the case of multiple breaks. In this paper, we extend the work of Chong (2003) to study the consistency of the break-point estimator in a multiple-break model under specification errors. This problem is important because, at least philosophically, all models are incorrect. We investigate the robustness of the estimated break point when incorrect models are employed. The main finding is that the break points can still be consistently estimated despite incorrect model specifications. To present the key idea of the underlying consistency argument, consider the following measurement error model of Chang and Huang (1997): yt = β1 ξt + t ,
t ≤ k1
yt = β2 ξt + t ,
t > k1
where t = 1, . . . , T . The variable ξ t is not observable, but instead, we observe x t = ξ t + ηt , where ηt is the measurement error. It is assumed that t is independent of ηt and ξ t . In standard regression models without a break, it is well known that if x t is used as a regressor, simple least squares cannot yield consistent estimate of the slope parameters because the new error is correlated with the regressor x t . We can rewrite the model with new slope coefficients without the measurement error problem. First, projecting ξ t on x t , we have ξt = c xt + et , for some constant c, where the projection residual et is uncorrelated with x t by the projection argument. Thus, the original model can be rewritten as yt = (β1 c)xt + t + β1 et ,
t ≤ k1
yt = (β2 c)xt + t + β2 et ,
t > k1 .
The disturbances t + β 1 et and t + β 2 et are uncorrelated with the regressor x t . Thus, this is a standard change-point problem without measurement error. Furthermore, β1 = β2 implies β1 c = β2 c. Note that c = 0 under the measurement error setup. The case of c = 0 corresponds to the case of using a regressor which is unrelated to the true regressor ξ t . Such a situation will be ruled out by our assumptions. Since we have a standard structural-break model, the break fraction τ = k 1 /T can be consistently estimated. More interestingly, the estimated break has a convergence rate of T. That is, T (τˆ − τ ) is stochastically bounded despite measurement error in the regressors. The above simple example highlights the main argument for consistency. We can extend this result to multiple-break models with general specification errors. Our estimation method does not assume that the number of breaks is known. It will be demonstrated that the estimated break point is consistent for one of the true break points when the model has multiple breaks, and that it has a convergence rate of T in spite of various misspecifications.
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
289
Break-point estimators under specification errors
This paper is organized as follows: Section 2 presents the model and major assumptions. Section 3 shows the asymptotic behaviour of the break-point estimator and the Sup-Wald statistic under misspecification. Simulations in support of our theories are given in Section 4. Section 5 provides an empirical application. The last section concludes the paper. All proofs are relegated to the Appendix.
2. THE MODEL AND ASSUMPTIONS To begin with, we present some frequently used mathematical notations in this paper. [x] denotes p
d
the greatest integer ≤x. The symbol ‘→’ represents convergence in probability, ‘→’ represents convergence in distribution and ‘=⇒’ signifies weak convergence in D[0, 1] (see Billingsley, 1968, and Pollard, 1984). All limits are defined as the sample size T → ∞ unless otherwise stated. Consider a multivariate linear regression model with p break points at unknown time points, the number of break points p is also unknown. We assume that the true model has the following form: yt = f 1 (xt1 )β1i + f 2 (xt2 )β2i + · · · + f L (xt L )β Li + u t , t = 1, 2, . . . , T ;
ki−1 < t ≤ ki
i = 1, 2, . . . , p + 1.
This is a model with p + 1 regimes, and the p break points are k 1 , . . . , k p . We define, throughout, k 0 = 0, k p+1 = T . In matrix notation, the true model can be rewritten as Y =
p+1
Ii Fβi + U ,
(2.1)
i=1
where Y = (y 1 y 2 . . . y T ) is a T by 1 vector of yt ; F is a T by L matrix with (t, l)th element f l (x tl ), where f l (·) is a real-value function for l = 1, 2, . . . , L; U = (u 1 u 2 . . . u T ) is a T by 1 vector with the element u t being a martingale difference sequence T p with T1 t=1 u 2t → σ 2 < ∞, supt E |u t |4+c < ∞ for some c > 0 (Bai and Perron, 1998); β i = (β 1i β 2i . . . β Li ) is a L by 1 vector of true parameters, i = 1, 2, . . . , p + 1; I i is a T by T diagonal matrix with the (t, t)th element being an indicator function 1 {k i−1 < t ≤ k i }, i = 1, 2, . . . , p + 1, k i are the dates of changes, i = 1, 2, . . . , p. Let τ i = k i /T be the true break fraction for i = 1, 2, . . . , p. Also, let τ 0 = 0, τ p+1 = 1. Let k = [τ T], where τ ∈ [0, 1]. Without knowing the true model, the following regression model with a single break is estimated: , a[τ ] + Ib G β b[τ ] + V Y = Ia G β
(2.2)
where I a is a T by T diagonal matrix with the (t, t) th element being an indicator function 1 {t < k} with k = [T τ ], and I b = I T ×T − I a ; a[τ ] and β b[τ ] are N by 1 vectors of OLS coefficient estimates for the first and second sample β subsample, respectively;
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
290
J. Bai et al.
G is a T by N matrix with the (t, n) th element g n (x tn ), where g n (·) is a real-value function, n = 1, 2, . . . , N; is a T by 1 vector of the residuals for the misspecified model. V In this model, only one break point is estimated despite the existence of p true break points. Therefore, two kinds of misspecifications occur in this model: misspecification in regressors and misspecification in the number of breaks. Define the following matrices: lim
T 1 E[ f (xt ) f (xt ) ] = Q f f , T t=1
lim
T 1 E[g (xt ) g (xt ) ] = Q gg T t=1
lim
T 1 E g (xt ) f (xt ) = Q g f , T t=1
T →∞
T →∞
and T →∞
where f (x t ) = ( f 1 (x t1 ), . . . , f L (x t L )) is a L × 1 vector, and g(x t ) = (g 1 (x t1 ), . . . , g N (x tN )) is an N × 1 vector. We define Q fg = Q gf . Q ff and Q gg are assumed to be of full rank. The matrix Q ff being of full rank is a necessary identification condition even in the absence of breaks. The full rank requirement is standard. Define Q = Q f g Q −1 gg Q g f . It is assumed that Q is of full rank.2 Following Chong (2003), we also assume, uniformly in τ ∈ [0, 1], that (A1) (A2)
1 T 1 T 1 T
p
F Ia F → τ Q f f ; p
G Ia G → τ Q gg ; p
(A3) G Ia F → τ Q g f ; (A4) det F [I ( j) − I (h)] F > 0 and det G [I ( j) − I (h)] G > 0 for ∀ j, h ∈ [0, T ] and j > h, where I(j) and I(h) are T by T diagonal matrices with the (t, t) th element being an indicator function 1 {t ≤ j} and 1 {t ≤ h}, respectively; (A5) there exist some real numbers r > 2, C l > 0 and D n > 0 (1 ≤ l ≤ L, 1 ≤ n ≤ N ) such that for all 0 ≤ h < j > T , the two inequalities r j E fl (xtl ) u t ≤ Cl ( j − h)r /2 t=h+1
2 The assumption of full rank for Q is equivalent to Q having a full column rank. We do not allow all the regressors to gf be completely unrelated to the true regressors. The requirement that Q gf has full column rank says that at least a subset of regressors (same dimension as the true regressors f ) is correlated to the true regressors. This is a reasonable assumption since one cannot expect to consistently estimate the model with arbitrary unrelated regressors.
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
Break-point estimators under specification errors
291
and r j E gn (xtn ) u t ≤ Dn ( j − h)r /2 t=h+1 hold; (A6) (τ1, τ2, . . . , τ p , β1 , β2 , . . . , β p+1 ) ∈ = [τ , τ ] p × B p+1 ⊂ (0, 1) p × R( p+1)×L , where B is a L dimension parameter space; (A7) infi |τi+1 − τi | > 0, i = 0, 1, . . . , p, τ0 = 0, τ p+1 = 1; (A8) infi βi+1 − βi > 0, i = 0, 1, . . . , p. Assumptions (A1)–(A3) exclude trending regressors. Assumption (A4) guarantees the a and β b can invertibility of the matrices defined there, and the OLS regression coefficients β thus be estimated. The uniform convergence result is ensured by Assumption (A5). Assumption a and β b are not (A6) states that the true break points are in a compact set in (0, 1) as the estimated β defined at the boundary of the time domain. Assumption (A7) requires two consecutive change points to be separated far enough from each other. Assumption (A8) states that the magnitudes of changes should not be negligible. In model (2.2), there are two kinds of misspecifications, namely, F = G or the number of true breaks is more than one (p > 1). In other words, the model is misspecified if the functional form of the regressors or the true number of changes is actually more than one.
3. ASYMPTOTICS 3.1 Consistency Following Chong (2001), we define τT = arg min RSS (τ, 0, 1) , τ ∈[τ ,τ ]
(3.1)
where RSS (τ , τ v , τ l ) denotes the residual sum of squares of model (2.2) regressed on data dating from τ v T to τ l T , and a[τ ] − Ib G β b[τ ] 2 RSS (τ, 0, 1) = Y − Ia G β is the residual sum of squares for the whole sample. For any given value of τ , the least-squares estimators of the pre- and post-shift parameters of model (2.2) are
−1 βˆa[τ ] = G Ia G G Ia Y ,
−1 βˆb[τ ] = G Ib G G Ib Y . For τ ∈ [τ h , τ h+1 ], h = 0, 1, . . . , p, p βˆa[τ ] → Q −1 gg Q g f 1 (τ ) p βˆb[τ ] → Q −1 gg Q g f 2 (τ ) C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
(3.2)
292
J. Bai et al.
where for τ ∈ [τ h , τ h+1 ), 1 (τ ) = β1 for h=0 h (τi − τi−1 ) βi + (τ − τh ) βh+1 = i=1 , τ p+1 2 (τ ) =
i=h+2
= β p+1 ,
(τi − τi−1 ) βi + (τh+1 − τ ) βh+1 , 1−τ for h = p.
for
h = 1, 2, . . . , p
for
h = 0, 1, . . . , p − 1
(3.3)
(3.4)
In general, for any interval (τ v , τ l ) ⊆ [0, 1], we have 1 p RSS(τ, τ v , τ l ) → R(τ, τ v , τ l ) uniformly, T where R(τ , τ v , τ l ) is a piecewise concave function of τ defined on (τ v , τ l ). In particular, it can be shown that 1 p RSS(τ, 0, 1) → R (τ, 0, 1) , τ ∈ [0, 1] T uniformly, where R (τ, 0, 1) = σ 2 +
p+1
(τi − τi−1 ) βi Q f f βi − τ 1 (τ ) Q 1 (τ ) − (1 − τ ) 2 (τ ) Q 2 (τ ) .
i=1
Note that for τ ∈ [0, τ 1 ), we have 2 ∂ R (τ, 0, 1) 1 L 1 Q L 1 ≤ 0 =− ∂τ 1−τ and ∂ 2 R (τ, 0, 1) 2 =− L 1 Q L 1 ≤ 0, 2 ∂τ (1 − τ )3 where L1 =
p+1
(τi − τi−1 ) βi − (1 − τ1 ) β1 .
i=2
For τ ∈ [τ h , τ h+1 ), h = 1, 2, . . . , p − 1, we have ∂ R (τ, 0, 1) Q ( 1 (τ ) − 2 (τ )) , = 1 (τ ) Q 1 (τ ) − 2 (τ ) Q 2 (τ ) − 2βh+1 ∂τ ∂ 2 R (τ, 0, 1) 1 = −2 (βh+1 − 1 (τ )) Q (βh+1 − 1 (τ )) 2 ∂τ τ 1 (βh+1 − 2 (τ )) Q (βh+1 − 2 (τ )) ≤ 0. −2 1−τ C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
Break-point estimators under specification errors
293
For τ ∈ [τ p , 1], we have ∂ R (τ, 0, 1) 1 = 2 L 2 Q L 2 ≥ 0, ∂τ τ where L2 =
p
(τi − τi−1 ) βi − τ p β p+1 ,
i=1
and ∂ 2 R (τ, 0, 1) 2 = − 3 L 2 Q L 2 ≤ 0. 2 ∂τ τ The beauty of R(τ , 0, 1) is that it is concave within any interval [τ h , τ h+1 ], h = 0, 1, . . . , p. In ∂ R τ,τ ,1 1) addition, for the two boundary intervals, we have ∂ R(τ,0,τ ≤ 0 and ( ∂τ p ) ≥ 0, implying that ∂τ the function is decreasing for τ ∈ [0, τ 1 ] and is increasing for τ ∈ [τ p , 1]. Thus, the function R(τ , 0, 1) achieves its minimum at one of the true break points (τ 1 , . . . , τ p ), which in turn implies that τT is consistent for one of the true break points. PROPOSITION 3.1. Under Assumptions (A1)–(A8), R(τ , τ v , τ l ) defined in any subsample [τ v , τ l ] ⊆ [0, 1] is piecewise concave. Proposition 3.1 suggests that the estimated break point is consistent for one of the true breaks. The piecewise concavity over each subinterval guarantees the consistent estimation of all the break points. The following theorem strengthens the consistency argument by including the convergence rate for the estimated break points. THEOREM 3.1. Under Assumptions (A1)–(A8), for some k = 1,2, . . . , p (i) τT is consistent for one of the true break points, i.e., plim τT = τk . (ii) Suppose that τT is consistent for τ k . Then T ( τT − τk ) = O p (1). Theorem 3.1 shows the consistency of the estimated change point for one of the true ones. However, which of the change points will be identified depends on a number of factors. It is not necessarily the case that the break with the biggest magnitude will be identified first. The result also depends on the duration of the break and the form of misspecification. For example, in the simplest case of Chong (1995), where F = G and the true model has two breaks, but a single-break model is estimated, if (β3 − β2 ) Q (β3 − β2 ) τ1 (1 − τ1 ) < , (β2 − β1 ) Q (β2 − β1 ) τ2 (1 − τ2 ) p
p
then τ → τ1 , and if this inequality is reversed, then τ → τ2 . According to Theorem 3.1, we can first consistently estimate one of the break points τT in the whole sample, then split the sample into pre- and post-shift subsample at the break fraction τT . Within both subsamples [0, τT ] and ( τT , 1], estimate τT = arg min RSS (τ, 0, τT ) and τT = arg min RSS (τ, τT , 1) to obtain the next two break points, assuming each subinterval contains C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
294
J. Bai et al.
a break point. This process continues until all other break points are located.3 According to Proposition 3.1, the estimated break is consistent. Alternatively, in each subsequent round of estimation, only one break point is chosen in one of the subsamples for which the reduction in residual sum of squares is greatest. This method is considered by Bai and Perron (1998) and is referred to as the sequential method. The process stops when the specified number of breaks is obtained. 3.2 The number of break points and the Wald test Consistent estimation of a break point does not hinge on the prior knowledge of the number of breaks, as long as there exists at least one break point. However, if the objective is to estimate all break points, clearly, one needs to know the number of break points. Recently, Altissimo and Corradi (2003) propose a consistent estimation method for the number of break points when the model is correctly specified. In this paper, we discuss three estimation methods. The first is based on the Sup-Wald test procedure of Chong (2003). The second estimation method is based on a Wald test with a heteroskedastic and autocorrelation (HAC) robust estimator of the variance. The third estimation is to use the information criteria of Yao (1988). 3.2.1 The Sup-Wald test statistic. The Sup-Wald test statistic together with sample-splitting method can be used to determine the location and number of the breaks. Suppose model (2.1) is the true model, but model (2.2) is estimated; the Sup-Wald statistic for the null hypothesis H 0 : β 1 = β 2 = · · · = β p+1 = β is defined as sup WT (τ, 0, 1) = sup τ ∈S
τ ∈S
T τ (1 − τ ) ˆ (βa[τ ] − βˆb[τ ] ) G G(βˆa[τ ] − βˆb[τ ] ), RSS(τ, 0, 1)
(3.5)
where S is a set with closure in [0, 1]. The second and the third parameter of W T (τ , 0, 1) imply that the test is calculated with the full sample data. As shown in Chong (2003), although the pre- and post-shift estimators will not be consistent in the presence of misspecification, they will converge to the same constant under the null hypothesis of no structural break. Meanwhile, if there are structural breaks, the probability limits of the pre- and post-shift estimators will generally be different in the presence of specification errors. Therefore, the Wald type test based on the magnitude of the estimated break will still be a consistent test. To derive the distribution of W T (τ , 0, 1), we make the following assumptions: (B1) u t follows an i.i.d (0, σ 2 ) with σ 2 < ∞; (B2) √1T F Ia U =⇒ B f u (τ ) , where B fu (τ ) is an L-vector zero-mean Gaussian process with variance τ σ 2 Q ff ; (B3) √1T G Ia U =⇒ Bgu (τ ) , where B gu (τ ) is an N-vector zero-mean Gaussian process with variance τ σ 2 Q gg ;
3 Note that R(τ , 0, 1) will no longer be piecewise concave and the consistency of the estimation is not guaranteed if p Q fg = 0. For example, when f (x t ) = x t ∼ i.i.d.N (0, σ 2x ) and g (x t ) = x 2t , we have Q fg = 0, thus RSS(τ, 0, 1) → p+1 2 R (τ, 0, 1) = σ + i=1 (τi − τi−1 ) βi Q f f βi which is no longer a function of τ . C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
295
Break-point estimators under specification errors
√ (B4) T Vec( T1 G Ia G − τ Q gg ) =⇒ Bgg (τ ) , where B gg (τ ) is an N 2 × 1 vector of zero-mean Gaussian processes with covariance matrix
1 1 2 τ σ lim T E Vec Vec ; G G − Q gg G G − Q gg T →∞ T T √ (B5) T Vec( T1 G Ia F − τ Q f g ) =⇒ B f g (τ ), where B fg (τ ) is an NL × 1 vector of zero-mean Gaussian processes with variance
1 1 2 τ σ lim T E Vec Vec . G F − Q fg G F − Q fg T →∞ T T Chong (2003) derives the null distribution of W T (τ , 0, 1) under Assumptions (A1)–(A5) and (B1)–(B5). THEOREM 3.2. Under Assumptions (A1)–(A5) and (B1)–(B5), and H 0 :β 1 = β 2 = ··· = β p+1 = β, as T → ∞, we have WT (τ, 0, 1) =⇒
[τ B A (1) − B A (τ )] Q −1 gg [τ B A (1) − B A (τ )]
2 τ (1 − τ ) [σ + β Q f f − Q β]
and d
sup WT (τ, 0, 1) → sup τ ∈S
τ ∈S
[τ B A (1) − B A (τ )] Q −1 gg [τ B A (1) − B A (τ )]
2 τ (1 − τ ) [σ + β Q f f − Q β]
where S is a set with closure in [τ , τ ], B A (τ ) = [B f g (τ ) − Bgg (τ )Q −1 gg Q f g ]β + Bgu (u) is an Nvector of Brownian motions on [0, 1] and Q = Q fg Q −1 Q . fg gg The proof is given in Chong (2003) and is thus omitted. If there is no specification error, we d
) have B A (τ ) = B fu (u), Q = Q ff and supτ ∈S WT (τ, 0, 1) → supτ ∈S τ B(1)−B(τ under H 0 , where τ (1−τ ) B(τ ) is an L-vector of independent Brownian motions. With little modification, Theorem 3.2 can also be shown to hold in any subsample. To estimate the number of breaks, one can follow the sample splitting method of Chong (2001). After the first break point is identified, we split the sample into two subsamples using the first break-point estimate as a cut-off point. Then, the same test is performed on the two subsamples. The procedure continues until all the break points are identified. Note that the distribution of Sup-Wald statistic is now affected by the form of misspecification in an unknown manner. However, if the misspecification is due to measurement errors, we can still perform the test in some special cases. To see this, let the true model be
Y =
p+1
2
Ii Fβi + U ,
i=1
but we estimate , a[τ ] + Ib (F + ) β b[τ ] + V Y = Ia (F + ) β where = (δ 1, δ 2 , . . . , δ T ) is a T by L matrix and δ t = (δ t1 , δ t2 , . . . , δ t L ) , (t = 1, 2, . . . , T ), is T are zero-mean i.i.d across t and l, and are the measurement error vector. Assuming that {δ tl }t=1 C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
296
J. Bai et al.
independent of F and U, then we have E () = 0 and lim
T →∞
1 = , T
where is an L × L diagonal matrix with the (l, l) th element var (δ tl ) = σ l2 . Note that in this case, Q fg = Q ff and Q gg = Q ff + . It is not clear how affects the test statistic when Q ff is not diagonal. However, when L = 1 and if f (x t ) ∼ i.i.d. N (0, α f σ 2 ), δ t ∼ i.i.d. N (0, α δ σ 2 ), d
where α f and α δ are the ratios of var [ f (x t )] and var(δ t ) to that of u t , then supτ ∈S WT (τ, 0, 1) → ))2 supτ ∈S (τ B(1)−B(τ , which is the conventional null distribution of the Sup-Wald test statistic for τ (1−τ ) testing structural break. 3.2.2 The HAC-adjusted Wald test statistic. In general, the limiting distribution of the SupWald statistic depends on unknown parameters, except for some special cases. The underlying reason is that misspecification and measurement error induce heteroskedasticity in the sense that E(gt gt v 2t ) = E(gt gt )Ev 2t . As is well known, the usual F test (and hence the Sup-Wald test) fails under heteroskedasticity. Thus, a heteroskedasticity-robust covariance should be used. In this paper, we propose a modified test whose asymptotic distribution will not be affected by the functional form F and the unknown parameter β. As the misspecification in the functional form can be absorbed by the error term, we can construct an HAC-adjusted Wald test statistic (Bai and Perron, 1998) as follows:
−1
WTHAC (τ, 0, 1) = T τ (1 − τ ) βˆa[τ ] − βˆb[τ ] βˆa[τ ] − βˆb[τ ] , (3.6) where S is a set with closure in [0, 1], and the estimated asymptotic variance of βˆa[τ ] − βˆb[τ ] is 1 , where τ (1−τ )
−1 T −1
t=1 gt gt vt2 GG = p lim G G , T T T is robust to ˆ = 1, 2, . . . , T ) are the estimated residuals. Note that and vt = yt − gt β(t heteroskedasticity. One can also employ the Newey and West (1987) estimator to make it robust to serial correlation as well. THEOREM 3.3. Under Assumptions (A1)–(A5) and (B1)–(B5), and H 0 : β 1 = β 2 = · · · = β p+1 = d
d
β, as T → ∞, we have WTHAC (τ, 0, 1) → χ N2 for each fixed τ , and supτ ∈S WTHAC (τ, 0, 1) → ) 2 supτ ∈S τ B(1)−B(τ , where S is a set with closure in [τ , τ ], B (τ ) is an N-vector of independent τ (1−τ ) Brownian motions. According to Theorem 3.3, sup WTHAC (τ, 0, 1) converges to the same asymptotic distribution of the Sup-Wald test of Andrews (1993). 3.2.3. Information criteria of Yao (1988). Alternatively, one can use the Schwarz information criterion approach of Yao (1988) to determine the total number of breaks. Once the number of breaks is determined, the dynamic programming method of Bai and Perron (1998) can be used to C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
Break-point estimators under specification errors
297
estimate all breaks simultaneously. No matter which method is used, the estimated break point is T consistent.
4. SIMULATIONS Experiment 1. This experiment shows the distribution of the first break-point estimator in a two-break model under misspecifications. We perform the following simulations: True model:
T yt = 1 + xt + u t for t ∈ 1, 3 T 2T yt = 1 + 3xt + u t , for t ∈ 3 3 2T yt = 1 + 4xt + u t fort ∈ ,T . 3 Estimated model: yt = βˆ01 + βˆ1 g (xt ) + vt yt = βˆ02 + βˆ2 g (xt ) + vt
for
t ≤ τT
for
t > τT
Model
(1)
(2)
(3)
g(xt )
xt
xt3
ln xt
T T where T = 1,000; u t ∼ i.i.d.N (0, 1) ; x t ∼ i.i.d.U (1, 10) ; {x t }t=1 and {u t }t=1 are independent of each other. Table 1 shows the frequency distribution of the minimum of RSS (τ , 0, 1) around the two true break points. The number of replications is set to N = 1,000. Note that the first break will be identified first. Table 1 shows that the first break points always have a high chance of being identified.
Experiment 2. When there is only one regressor in both the true and the estimated models, we have shown that the Sup-Wald statistic will not be affected asymptotically. This experiment studies the behaviour of Sup-Wald statistic in the presence of measurement error. We estimate the following model: True model: yt = xt + u t ,
t = 1, 2, . . . , T .
Misspecified model: yt = βˆ (xt + δt ) + vt ,
t = 1, 2, . . . , T .
T = 1,000 (sample size); N = 10,000 (number of replications); u t ∼ i.i.d.N (0, 1); x t ∼ i.i.d.N (0, 1); C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
298
J. Bai et al. Table 1. Distribution of the first estimated break point. Estimated models kˆ1 − 334
(1)
(2)
(3)
<−6
0
10
0
−6
0
3
0
−5 −4
0 0
14 17
0 0
−3
2
29
0
−2 −1 0
7 81 820
68 126 429
0 2 446
1 2
66 17
112 56
189 111
3 4 5 6 >6
3 4 0 0 0
47 26 34 11 18
72 37 34 26 83
Table 2. Critical values of the Sup-Wald test. Andrews (1993) αδ = 0 αδ = 1
αδ = 2
λ\α
0.1
0.05
0.01
0.1
0.05
0.01
0.1
0.05
0.01
0.1
0.05
0.01
0.1 0.2 0.3 0.4
7.63 6.80 6.05 5.10
9.31 8.45 7.51 6.57
12.69 11.69 10.91 9.82
7.60 6.74 6.15 5.18
9.51 8.41 7.44 6.50
12.55 11.79 10.80 9.77
7.55 6.74 5.87 4.90
9.19 8.42 7.46 6.35
13.06 11.58 11.08 9.81
7.51 6.69 6.13 4.94
8.21 8.21 7.46 6.34
12.80 11.62 11.19 10.11
δ t ∼ i.i.d.N (0, α δ ); T T T {x t }t=1 , {u t }t=1 and {δ t }t=1 are independent of each other. Using the misspecified model, we simulate the critical values of null hypothesis of no break point under measurement error. Table 2 reports the critical value c such that Pr[supτ ∈(λ,1−λ) WT (τ, 0, 1) > c] = α. The values are compared with Andrews (1993). It can be clearly observed from Table 2 that the critical values are very close to those of Andrews (1993). Additional simulations (not reported here) show that, under general misspecifications, the critical values of the HAC-adjusted Sup-Wald test are also close to those of Andrews (1993), as predicted by Theorem 3.3. C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
299
Break-point estimators under specification errors
Model
Table 3. Comparison results for different models and methods. SEQ with BIC GLM with BIC SEQ with Sup-HAC-Wald
Mean shift
First break:
79
Two breaks: 47, 79
First break: 79
Second break:
47
AR(1)
First break: Second break:
81 46
Two breaks: 46, 78
First break: 81; Second break: 46; Third break: 24
Second break: 47
AR(2)
First break:
80
One break: 80
First break: 80; Second break: 46; Third break: 24
5. EMPIRICAL APPLICATION In this section, we provide an application on the U.S. ex-post real interest rate series considered by Garcia and Perron (1996) and Bai and Perron (2003). The real interest rate series is constructed from the three-month treasure bill rate deflated by the CPI taken from the Citibase data base. Quarterly data from 1961Q1 to 1986Q3 are used. Bai and Perron (2003) apply a simple meanshift model to the data and found two breaks. In this paper, we check the robustness of the estimation when different models are used. Meanwhile, we would also like to see whether the sequential estimation method (SEQ) will give the same estimation result as the global minimization method (GLM). The latter estimates all the break points simultaneously. Two methods are used to determine the number of breaks. The first is the BIC of Yao (1988), while the second one is the Sup-HAC weighted Wald test. Note from Table 3 that the SEQ can obtain similar results as the global minimization method for all cases. For the mean-shift model, they give identical results. Using the BIC, two breaks are detected for the mean-shift model and for the AR(1) model, and only one break has been found for the AR(2) model. However, if we use the Sup-HAC-Wald test, we obtain two breaks for the mean-shift model and three for the AR(1) and the AR(2) models. The number of breaks detected by the BIC and Sup-HAC-Wald test can vary across different models, which is not unexpected in a finite sample. According to the Sup-HAC weighted Wald test, we conclude that there are three breaks in the AR model. Note that the estimation results are very close regardless of the method used.
6. CONCLUSIONS In this paper, we combine the works of Bai (1997) and Chong (2003) to examine the consistency of the break-point estimator under misspecification of the independent variables in a multiplebreak model with unknown number of break points. The misspecification can occur in different dimensions: (i) the number of breaks is incorrectly specified, (ii) the functional forms of regressors are incorrectly specified, (iii) variables are observed with errors. It is shown that the residual sum of squares divided by the sample size converges uniformly to a piecewise concave function, which in turn implies that, if a one-break model is estimated in the presence of multiple breaks, the break-point estimator will converge to one of the true break points. Thus, it is possible to locate other break points by using the sample splitting method. The rate of convergence for the estimated break points is also provided. If the number of break points is known, all break points can be consistently estimated one by one. The issue of how to determine the number of breaks is also discussed. Experimental and empirical evidence confirm our results. C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
300
J. Bai et al.
ACKNOWLEDGMENTS The authors would like to thank St´ephane Gr´egoir, Win lin Chou, Andy Kwan, Henry Lin and two anonymous referees for helpful comments. All errors are ours.
REFERENCES Andrews, D. W. K. (1993). Tests for parameter instability and structural change with unknown change point. Econometrica 61, 821–56. Altissimo, F. and V. Corradi (2003). Strong rules for detecting the number of breaks in a time series. Journal of Econometrics 117, 207–44. Bai, J. (1997). Estimating multiple breaks one at a time. Econometric Theory 13, 315–52. Bai, J. and P. Perron (1998). Testing for and estimating of multiple structural changes. Econometrica 66, 47–79. Bai, J. and P. Perron (2003). Computation and analysis of multiple structural-change models. Journal of Applied Econometrics 18, 1–22. Billingsley, P. (1968). Convergence of Probability Measures. New York: Wiley. Busetti, F. and A. Harvey (1998). Testing for the presence of a random walk in series with structural breaks. Discussion Paper No. Em/98/365, London School of Economics and Political Science. Chang, Y. P. and W. T. Huang (1997). Inference for the linear errors-in-variables with change point models. Journal of the American Statistical Association 92, 171–78. Chesher, A. (1991). The effect of measurement error. Biometrika 78, 451–62. Chong, T. T. L. (1995). Partial parameter consistency in a misspecified structural change model. Economics Letters 49, 351–57. Chong, T. T. L. (2001). Estimating the locations and number of change points by the sample-splitting method. Statistical Papers 42, 53–79. Chong, T. T. L. (2003). Generic consistency of the break-point estimator under specification errors. Econometrics Journal 6, 167–92. Forchini, G. (2002). Optimal similar tests for structural change for the linear regression model. Econometric Theory 18, 853–67. Garcia, R. and P. Perron (1996). An analysis of the real interest rate under regime shifts. Review of Economics and Statistics 78, 111–25. Lee, J., C. J. Huang and Y. C. Shin (1997). On stationary tests in the presence of structural breaks. Economics Letters 55, 165–72. Nelson, D. B. (1995). Vector attenuation bias in the classical errors-in-variables model. Economics Letters 49, 345–49. Newey, W. K. and K. D. West (1987). A simple positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix. Econometrica 55, 703–8. Perron, P. (2006). Dealing with structural breaks. In K. Patterson and T. C. Mills (Eds.) Palgrave Handbook of Econometrics, Volume 1, 278–352. Basingstoke: Macmillan. Pollard, D. (1984). Convergence of Stochastic Process. New York: Springer. White, H. (1980). A heteroskedasticity-consistent covariance matrix and a direct test for heteroskedasticity. Econometrica 48, 817–38. Yao, Y. C. (1988). Estimating the number of change-points via Schwarz criterion. Statistics and Probability Letters 6, 181–89.
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
Break-point estimators under specification errors
301
APPENDIX A Proof of Proposition 3.1: We will establish the piecewise concavity of R(τ , τ v , τ l ). Suppose there are q (q ≤ p) breaks within the segment t = v + 1, . . . , l − 1, l. Let (1) τ v = v/T , τ l = l/T , [τ v , τ l ] ⊂ [τ , τ ]; (2) τ v < τ¯1 < τ¯2 < · · · < τ¯q−1 < τ¯q < τ l , where τ¯i , i = 1, 2, . . . , q are q consecutive true change points with τ¯1 = τ j , ∀ j ∈ {1, 2, . . . , p − q}; (3) Let I v , I l , I a , I b , I i be, respectively, a (l − v) ×(l − v) diagonal matrix with the (t − v, t − v)th element an indicator function 1 {v < t ≤ τ¯1 T } , 1 τ¯q T < t ≤ l , 1 {v < t ≤ τ T } , 1 {τ T < t ≤ l} , 1 {τ¯i−1 T < t ≤ τ¯i T } , i = 2, . . . , q. For convenience in notation, let τ¯0 = τ v , τ¯q+1 = τ l , I1 = I v and I q+1 = I l . The true model in this segment is Y =
q+1
Ii Fβr −1+i + U ,
(A.1)
i=1
where r is an integer such that τ r −1 ≤ τ v < τ r . Let β¯i = βr −1+i , i = 1, 2, . . . , q + 1. We then rewrite (2.1) as Y =
q+1
Ii F β¯i + U .
i=1
Due to misspecification, model (2.2) is estimated. For τ¯0 ≤ τ < τ¯1 , we have I a I 1 = I a ; I a I i = 0, i = 2, 3, . . . , q + 1; I b I 1 = (I − I a ) I 1 = I 1 − I a ; I b I i = I i , i = 2, 3, . . . , p. Thus,
−1 βˆa[τ ] = G Ia G G Ia Y −1 q+1 1 G Ia G ¯ G I I F β + U = a i i T T i=1
βˆb[τ ]
p → [(τ − τ¯0 )Q gg ]−1 (τ − τ¯0 )Q g f β¯1 ¯ = Q −1 gg Q g f β1 , q+1
−1 G Ib G −1 1 = G Ib G G Ib Y = G Ib Ii F β¯i + U T T i=1 q
−1 p ¯ ¯ (τ¯i+1 − τ¯i ) βi+1 → τ¯q+1 − τ Q gg Q g f (τ¯1 − τ )β1 + i=1
=
Q −1 gg Q g f 2
(τ ) ,
uniformly, where 2 (τ ) =
1 τ¯q+1 − τ
(τ¯1 − τ )β¯1 +
q
(τ¯i+1 − τ¯i ) β¯i+1 , τ ∈ [τ¯0 , τ¯1 ).
i=1
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
302
J. Bai et al.
Define R(τ, τ v , τ l ) = σ 2 +
q+1
(τ¯i − τ¯i−1 ) β¯i Q f f β¯i − (τ − τ¯0 ) β¯1 Q β¯1 − (τ¯q+1 − τ ) 2 (τ ) Q 2 (τ )
i=1 on [τ¯0 , τ¯1 ), where Q = Q gf Q −1 gg Q gf is a symmetric positive definite matrix. Let T = (τ¯q+1 − τ¯0 )T . By Assumptions (A1) to (A8), and the use of the triangle inequality, we have 1 sup RSS(τ, τ v , τ l ) − R(τ, τ v , τ l ) T τ ∈[τ¯0 ,τ¯1 ) q+1 1 1 q+1 2 (τi − τi−1 ) β¯i Q f f β¯i ≤ U U − σ + β¯i F Ii F β¯i − T i=1 T i=1 1
a[τ ] G Ia G β a[τ ] − 2βˆa[τ ] G Ia F β¯1 + (τ − τ¯0 ) β¯1 Q β¯1 + sup β τ ∈[τ¯0 ,τ¯1 ) T 1 b[τ ] − 2βˆb[τ ] G (I1 − Ia ) F β¯1 − 2 q+1 β b[τ ] G Ii F β¯i T βb[τ ] G Ib G β i=2 + sup
τ ∈[τ¯0 ,τ¯1 ) + τ¯q+1 − τ 2 (τ ) Q 2 (τ ) q+1 2 2 a[τ ] + sup 2 U Ib G β b[τ ] ¯ + U Ii F βi + sup U Ia G β τ ∈[τ¯0 ,τ¯1 ) T T i=1 T τ ∈[τ¯0 ,τ¯1 )
= o p (1) since each individual term above iso p (1) . Thus, 1 p RSS(τ, τ v , τ l ) → R(τ, τ v , τ l ) T uniformly, where R(τ , τ v , τ l ) is a piecewise concave function of τ defined on (τ v , τ l ). Meanwhile, ∂ 2 (τ ) ∂ R(τ, τ v , τ l ) = −β¯1 Q β¯1 + 2 (τ ) Q 2 (τ ) − 2 (1 − τ ) 2 (τ ) Q ∂τ ∂τ = −β¯1 Q β¯1 + 2 (τ ) Q 2 (τ ) − 2 2 (τ ) Q( 2 (τ ) − β¯1 )
= − 2 (τ ) − β¯1 Q 2 (τ ) − β¯1
where L 1 =
q i=1
1 =− τ¯q+1 − τ
2
L 1 Q L 1 ≤ 0,
(τ¯i+1 − τ¯i ) β¯i+1 − (τ¯q+1 − τ¯1 )β¯1 . 2 ∂ 2 R(τ, τ v , τ l ) =− L Q L 1 ≤ 0. ∂τ 2 (τ¯q+1 − τ )3 1
Therefore, T1 RSS(τ, τ v , τ l ) converges uniformly to a concave function of τ for τ ∈ [τ¯0 , τ¯1 ). For τ ∈ [τ¯h , τ¯h+1 ), h = 1, 2, . . . , q − 1, we have I a I i = I i , 1 ≤ i ≤ h; h Ia Ih+1 = Ia − i=1 Ii ; I a I i = 0, h + 1 < i ≤ q + 1; I b I i = 0, i ≤ h; h+1 Ib Ih+1 = i=1 Ii − I a ; I b I i = I i , h + 1 < i ≤ q + 1. C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
303
Break-point estimators under specification errors Thus,
−1 G Ia Y = βˆa[τ ] = G Ia G
G Ia G T
−1
q+1 1 ¯ G Ia Ii F βi + U T i=1
h h 1 ¯ ¯ ¯ G Ii F βi + Ia F βh+1 − Ii F βh+1 + Ia U = T i=1 i=1 h p −1 (τi − τ¯i−1 ) β¯i + (τ − τh ) β¯h+1 → [(τ − τ¯0 ) Q gg ] Q g f
G Ia G T
−1
i=1
=
Q −1 gg Q g f 1
(τ )
uniformly, where 1 1 (τ ) = τ − τ¯0 βˆb[τ ]
h
¯ ¯ (τi − τ¯i−1 ) βi + (τ − τh ) βh+1 , τ ∈ [τ¯h , τ¯h+1 ).
i=1
q+1
−1
−1 = G Ib G G Ib Y = G Ib G G Ib Ii F β¯i + U =
G Ib G T
−1
1 G T
h+1
Ii − I a
F β¯h+1 +
i=1 −1
→ [(τ¯q+1 − τ )Q gg ] Q g f
q+1
Ii F β¯i + Ib U
i=h+2
p
i=1
q+1
(τi − τ¯i−1 ) β¯i + (τ¯h+1 − τ ) β¯h+1
i=h+2
= Q −1 gg Q g f 2 (τ ) uniformly, where 2 (τ ) =
1 τ¯q+1 − τ
q+1
(τi − τ¯i−1 ) β¯i + (τ¯h+1 − τ ) β¯h+1 , τ ∈ [τ¯h , τ¯h+1 ).
i=h+2
Define q+1
(τi − τi−1 ) β¯i Q f f β¯i R τ, τ v , τ l = σ 2 + i=1
− (τ − τ¯0 ) 1 (τ ) Q 1 (τ ) − τ¯q+1 − τ 2 (τ ) Q 2 (τ ) on [τ¯h , τ¯h+1 ), it follows that 1 RSS(τ, τ v , τ l ) − R(τ, τ v , τ l ) τ ∈[τ¯h ,τ¯h+1 ) T q+1 1 q+1 1 (τi − τi−1 ) β¯i Q f f β¯i β¯i F Ii F β¯i − ≤ U U − σ 2 + T i=1 T i=1 h 1 β a[τ ] − 2β a[τ ¯ ] G Ia G βa[τ ] G i+1 Ii F βi h + sup + (τ − τ¯0 ) 1 (τ ) Q 1 (τ ) ¯ −2βa[τ ] G Ia − i=1 Ii F βh+1 τ ∈[τ¯h ,τ¯h+1 ) T sup
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
304
J. Bai et al. q+1 1 h+1 b[τ ] − 2β b[τ b[τ T βb[τ ] G Ib G β G Ii − Ia F β¯h+1 − 2β G i=h+2 Ii F β¯i ] ] i=1 + sup
τ ∈[τ¯h ,τ¯h+1 ) + τ¯q+1 − τ 2 (τ ) Q 2 (τ ) q+1 2 2 2 a[τ ] + sup b[τ ] + U Ii F β¯i + sup U Ia G β U Ib G β T i=1 τ ∈[τ¯h ,τ¯h+1 ) T τ ∈[τ¯h ,τ¯h+1 ) T = o p (1) ,
meanwhile
1 (τ ) Q 1 (τ ) − 2 (τ ) Q 2 (τ ) + ∂ R τ, τ v , τ l
=− ∂τ 2 (τ − τ¯0 ) 1 (τ ) Q ∂ ∂τ1 (τ ) + 2 τ¯q+1 − τ 2 (τ ) Q ∂ ∂τ2 (τ ) 1 (τ ) Q 1 (τ ) − 2 (τ ) Q 2 (τ ) + =−
2 1 (τ ) Q β¯h+1 − 1 (τ ) + 2 2 (τ ) Q 2 (τ ) − β¯h+1 = 1 (τ ) Q 1 (τ ) − 2 (τ ) Q 2 (τ ) − 2β¯h+1 Q ( 1 (τ ) − 2 (τ ))
= β¯h+1 − 1 (τ ) Q β¯h+1 − 1 (τ ) − β¯h+1 − 2 (τ ) Q β¯h+1 − 2 (τ )
∂ 2 R τ, τ v , τ l ∂ 1 (τ ) ∂ 2 (τ ) − 2 2 (τ ) Q = 2 1 (τ ) Q ∂τ 2 ∂τ ∂τ
1 1 Q 2 (τ ) − β¯h+1 β¯h+1 − 1 (τ ) − −2β¯h+1 τ − τ¯0 τ¯q+1 − τ =
2 2 (τ ) Q β¯h+1 − 1 (τ ) − (τ ) Q 2 (τ ) − β¯h+1 τ − τ¯0 1 τ¯q+1 − τ 2 1 1 (τ ) 2 (τ ) 1 + 2β¯h+1 Q β¯h+1 + Q + −2β¯h+1 τ − τ¯0 τ¯q+1 − τ τ − τ¯0 τ¯q+1 − τ
2 ¯ Q 1 (τ ) + 1 (τ ) Q 1 (τ ) βh+1 Q β¯h+1 − 2β¯h+1 τ − τ¯0
2 β¯h+1 Q β¯h+1 − 2β¯h+1 Q 2 (τ ) + 2 (τ ) Q 2 (τ ) − τ¯q+1 − τ
2 ¯ =− βh+1 − 1 (τ ) Q β¯h+1 − 1 (τ ) τ − τ¯0
2 β¯h+1 − 2 (τ ) Q β¯h+1 − 2 (τ ) − τ¯q+1 − τ
=−
≤ 0.
Therefore, T1 RSS τ, τ v , τ l converges uniformly to a concave function of τ for τ ∈ [τ¯h , τ¯h+1 ), h = 1, 2, . . . , q − 1. For τ ∈ [τ¯q , τ¯q+1 ],q we have I a I i = I i , i = 1, 2, . . . , q; Ia Iq+1 = Ia − i=1 Ii ; I b I i = 0, i = 1, 2, . . . , q; I b I q+1 = I b . Similarly, we can prove that p βˆa[τ ] → Q −1 gg Q g f 1 (τ ) C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
Break-point estimators under specification errors where
305
q
1 ¯ ¯ (τi − τ¯i−1 ) βi + τ − τ¯q βq+1 , τ ∈ [τq , τq1 +1 ) 1 (τ ) = τ − τ¯0 i=1
and p ¯ βˆb[τ ] → Q −1 gg Q g f βq+1 .
Define R(τ, τ v , τ l ) = σ 2 +
q+1 (τi − τi−1 )β¯i Q f f β¯i i=1
− (τ − τ¯0 ) 1 (τ )Q 1 (τ ) − (τ¯q+1 − τ )β¯q+1 Q β¯q+1
on [τ¯q , τ¯q+1 ], and it follows that
1 v l v l sup RSS(τ, τ , τ ) − R(τ, τ , τ ) = o p (1) . T τ ∈[τ ,τ ] q
q+1
Also,
where L 2 =
q i=1
∂ R(τ, τ v , τ l ) 1 = 2 L 2 Q L 2 ≥ 0, ∂τ τ (τ¯i − τ¯i−1 ) β¯i − τ¯q β¯q+1 .
∂ 2 R τ, τ v , τ l 2 = − 3 L 2 Q L 2 ≤ 0. ∂τ 2 τ
Combining the results above, we prove that T1 RSS τ, τ v , τ l converges to a piecewise concave function R(τ , τ v , τ l ) on any segment (τ v T , τ l T ) of [0, T]. With little modification, one can show that 1 p RSS(τ, 0, 1) → R (τ, 0, 1) , τ ∈ [τ , τ ] T uniformly, where R(τ , 0, 1) is piecewise concave function of τ on (0, 1) and is defined as R (τ, 0, 1) = σ 2 +
p+1
(τi − τi−1 ) βi Q f f βi − τ 1 (τ ) Q 1 (τ ) − (1 − τ ) 2 (τ ) Q 2 (τ )
i=1
2
with 1 (τ ) and 2 (τ ) as defined in Section 3. Proof of Theorem 3.1: Recall the true model has the form: Y =
p+1
Ii Fβi + U .
(A.2)
i=1
Let F t = f (x t ) = ( f 1 (x t1 ), . . . , f L (x t L )) and similarly let G t = g(x t ). Thus F = (F1 , F2 , . . . , FT ) , G = (G 1 , G 2 , . . . , G T ) . Projecting F t on G t , we have Ft = C G t + et , C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
(A.3)
306
J. Bai et al.
where C is an L × N matrix Note that the rank T of coefficient. Tof C is also L, which is less than or equal to T N. This follows from T1 t=1 Ft G t = C T1 t=1 G t G t + T1 t=1 et G t . From the orthogonality of G t and et due to the projection, we have, in the limit, Q fg = CQgg . Because the rank of Q fg is L and the rank of Q gg is N (N ≥ L), the rank of C must be L (full row rank). Let e = (e1 , e2 , . . . , e T ) . The model can be rewritten as Y =
p+1
p+1
Ii GC + e βi + U = Ii Gθi + V ,
i=1
(A.4)
i=1
where θi = C βi and V =
p+1
Ii eβi + U .
i=1
Furthermore, from β i+1 − β i = 0, we have θ i+1 − θ i = C (β i+1 − β i ) = 0 because C = Q −1 gg Q gf has a rank of L and β i is L × 1. Thus, the model reduces to that of a standard multiple-change model with G being considered as the true regressors and θ i as the new coefficients. From this point of view, the estimated model only has misspecification in the number of break points. Now invoke the results of Bai (1997), τT is T-consistent for one of the true breaks. This completes the proof of Theorem 3.1.4 2 Proof of Theorem 3.3: Rewrite f (xt ) = Q f g Q −1 gg g(x t ) + et , where Q gg = E(gt gt ). By definition, E(et gt ) = 0, where g t = g(x t ). In matrix notation, F = G Q −1 gg Q g f + e, where e = F − G Q −1 gg Q g f . Thus, Y = Gδ + eβ + ε = Gδ + v, where δ = Q −1 gg Q g f β, v = eβ + ε. Under the null, we have, = (G G)−1 G Y = δ + (G G)−1 G v. β
4 The proof uses the fact that Q (β i+1 − β i ) = 0, which is a necessary condition for consistent estimation of the break gf point. This is because the condition implies the existence of a break point in the projected model. The condition that Q gf (β i+1 − β i ) = 0 (i = 1, 2, . . . , m) does not always require Q gf to be of full column rank for a given parameter configuration β. However, in order for the condition to hold for all configurations of β, the matrix Q gf must be of full column rank. Because Theorem 3.1 makes a general statement that covers all configurations of β, the full column rank requirement for Qgf becomes necessary. C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
Break-point estimators under specification errors
307
Thus, − δ = (G G)−1 G (eβ + ε) . β While et is uncorrelated with g t , (β et )2 is not necessarily uncorrelated with gt gt under measurement error or misspecification. This means that the equality E[(β et )2 gt gt ] = E(β et )2 E(gt gt ) does not hold in general. Thus, we need to use the HAC robust Wald test. The whole sample estimator βˆ has the limiting distribution √ d − δ) → T (β N (0, ), −1 where = Q −1 gg V L Qgg , V L the long-run covariance matrix for g t v t = gt et β + g t ε t . For the first subsample, we have
√
a[τ ] − δ) → T (β d
1 1/2 B(τ ), τ
where B(τ ) is a vector of independent of Brownian motions. Similarly, for the second subsample estimator, we have √
b[τ ] − δ) → T (β d
1 1/2 [B(1) − B(τ )]. 1−τ
Combining the two results, we have √
a[τ ] − β b[τ ] ) → T (β d
1 [B(τ ) − τ B(1)]. τ (1 − τ )
Assuming no serial correlation for simplicity, the White (1980) estimator for the covariance matrix V L is T L = 1 V gt gt vt 2 . T t=1
The Wald test 2 d B(τ ) − τ B(1) −1 (β b[τ ] ) a[τ ] − β b[τ ] ) → a[τ ] − β T τ (1 − τ )(β τ (1 − τ )
−1 −1 =Q where gg VL Q gg and Q gg =
G G T
.
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
2
Econometrics Journal (2008), volume 11, pp. 308–325. doi: 10.1111/j.1368-423X.2008.00239.x
Representation theorem for convex nonparametric least squares TIMO KUOSMANEN Economic Research Unit, MTT Agrifood Research, Finland. E-mail:
[email protected] First version received: May 2007; final version accepted: January 2008
Summary We examine a nonparametric least-squares regression model that endogenously selects the functional form of the regression function from the family of continuous, monotonic increasing and globally concave functions that can be nondifferentiable. We show that this family of functions can be characterized without a loss of generality by a subset of continuous, piece-wise linear functions whose intercept and slope coefficients are constrained to satisfy the required monotonicity and concavity conditions. This representation theorem is useful at least in three respects. First, it enables us to derive an explicit representation for the regression function, which can be used for assessing marginal properties and for the purposes of forecasting and ex post economic modelling. Second, it enables us to transform the infinite dimensional regression problem into a tractable quadratic programming (QP) form, which can be solved by standard QP algorithms and solver software. Importantly, the QP formulation applies to the general multiple regression setting. Third, an operational computational procedure enables us to apply bootstrap techniques to draw statistical inference. Keywords: Concavity, Convexity, Curve fitting, Linear splines, Local linear approximation, Nonparametric methods, Regression analysis.
1. INTRODUCTION Nonparametric regression techniques that avoid strong prior assumptions about the functional form are attracting increasing attention in econometrics. Nonparametric least squares subject to continuity, monotonicity and concavity constraints [henceforth referred to as Convex Nonparametric Least Squares (CNLS)] are one of the oldest approaches, dating back to the seminal work by Hildreth (1954). This method draws its power from the shape constraints that coincide with the standard regularity conditions of the microeconomic theory (see e.g. Varian, 1982, 1984). In contrast to the kernel regression and spline smoothing techniques, CNLS does not require specification of a smoothing parameter. Thus, CNLS circumvents the fundamental bias-variance tradeoff (see e.g. Yatchew, 2003, for discussion) associated with most other nonparametric regression techniques. Earlier work on CNLS has mainly focused on the statistical properties, and the essential aspects of the CNLS estimators are nowadays well understood. The maximum-likelihood interpretation of CNLS was already noted by Hildreth (1954), and Hanson and Pledger (1976) have proved its consistency. More recently, Nemirovskii et al. (1985), Mammen (1991) and Mammen and Thomas-Agnan (1999) have shown that CNLS achieves the standard nonparametric rate of C 2008 The Author. Journal compilation C The Royal Economic Society 2008. Published by Blackwell Publishing Ltd, 9600 Garsington
Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
Representation theorem for convex nonparametric least squares
309
convergence OP (n−1/(2+m) ) where n is the number of observations and m is the number of regressors. Imposing further smoothness assumptions or derivative bounds can improve the rate of convergence and alleviate the curse of dimensionality (see e.g. Mammen, 1991, Yatchew, 1998, Mammen and Thomas-Agnan, 1999, and Yatchew and H¨ardle, in press). Groeneboom et al. (2001) have derived the asymptotic distribution of the univariate CNLS estimator at a fixed point. Despite the attractive theoretical properties, empirical applications of CNLS remain scarce. There are many factors restricting the diffusion of CNLS to econometrics. In our view, three major barriers are (1) the lack of explicit regression function, (2) computational complexity and (3) difficulty of statistical inference. The lack of a tractable closed-form expression for the CNLS regression function presents a clear disadvantage relative to the parametric methods and some other nonparametric techniques such as kernel smoothing. Indeed, economists are often interested in the marginal properties and elasticities of the function, which cannot be assessed based on a discrete set of fitted values provided by CNLS. Moreover, a simple closed-form expression of the regression function is necessary for using the regression results in ex post economic modelling (e.g. using estimated utility and production functions in a computable general equilibrium model). The computational complexity of CNLS is due to the fact that the functional form of the regression function is not assumed a priori, but it is endogenously selected from an infinitely large family of continuous, monotonic increasing and concave functions. Efficient computational algorithms have been developed by Wu (1982), Fraser and Massam (1989), Goldman and Ruud (1995), Ruud (1996) and Meyer (1999), but the implementation of these procedures requires considerable programming skills. More importantly, most existing computational procedures are restricted to the single regressor case where the observations can be sorted according to the explanatory variable. These algorithms cannot be generalized (even in principle) to the multiple regressor setting involving a vector of regressors. In principle, conventional methods of statistical inference could be adapted to the context of CNLS. However, the degrees of freedom depend on the number of different hyperplane segments or the number of observations projected to a given segment (see Meyer, 2003, 2006, for discussion). Moreover, the segments are endogenously determined in the model, and the coefficients of the segments may not be unique in the multiple regression setting. For these reasons, the bootstrap approach appears to be the most promising tool for statistical inferences (see e.g. Efron, 1979, 1982, and Efron and Tibshirani, 1993). However, implementing computationally intensive bootstrap simulations requires a fast, tractable algorithm for computing the estimator. This paper presents a representation theorem that helps us to overcome (or at least lower) each of these three barriers. Firstly, drawing insight from the celebrated Afriat’s Theorem, we derive an explicit representor function which can be used for assessing marginal properties and for the purposes of forecasting and ex post economic modelling. The representor function is a simple piece-wise linear function that is easy to compute given the coefficients estimated by CNLS. Secondly, the representation theorem is useful from the computational point of view: it enables us to transform the infinite dimensional CNLS problem into a tractable quadratic programming (QP) form, which can be solved by standard QP algorithms and solver software. Importantly, the QP formulation applies to the general multiple regression setting. Thirdly, existence of a tractable computational procedure enables one to apply computationally intensive bootstrap or C 2008 The Author. Journal compilation C The Royal Economic Society 2008
310
T. Kuosmanen
Monte Carlo simulations to draw statistical inference or assess the small sample performance of the estimator, respectively. Finally, in addition to tackling these three barriers, we point out a number of interesting links between CNLS and parallel developments in the literature. The rest of the paper is organized as follows. Section 2 presents our main result. Section 3 applies the result to formulate the infinite dimensional CNLS problem as a finite dimensional QP problem. Section 4 derives an explicit representor function that provides the first-order approximation for any arbitrary regression function in the neighbourhood of observed points. Section 5 illustrates the potential of the method by means of Monte Carlo simulations. Section 6 presents a concluding discussion and points some directions for future research. In the interest of readability, all formal proofs of mathematical theorems are presented in Appendix A. A GAMS code for computing the CNLS regression is presented in Appendix B.
2. THE REPRESENTATION THEOREM Consider the canonical multiple regression model yi = f (xi ) + εi , i = 1, . . . , n,
(2.1)
where y i is the dependent variable, f is an unknown regression function to be estimated, xi ∈ Rm is the vector of explanatory variables and εi is the idiosyncratic error term. Errors ε = (ε 1 , . . . , εn ) ∈ Rn are assumed to be uncorrelated random variables with E(ε) = 0 and Var(εi ) = σ 2 < ∞∀i = 1, . . . , n (i.e. the Gauss–Markov conditions). The data set of n observations is denoted by (X,y), with y = (y 1 , . . . , y n ) ∈ Rn and X = (x1 , . . . , xn ) ∈ Rm×n . In contrast to the linear and nonlinear parametric approaches, we assume no particular functional form for f a priori. Instead, we impose a more general condition that f belongs to the set of continuous, monotonic increasing and globally concave functions denoted by ∀x, x ∈ Rm : x ≥ x ⇒ f (x) ≥ f (x ); m F2 = f : R → R . ∀x , x ∈ Rm : x = λx + (1 − λ)x , λ ∈ [0, 1] ⇒ f (x) ≥ λ f (x ) + (1 − λ) f (x ) (2.2) The rationale behind the continuity, monotonicity and concavity postulates lies in their central role in the microeconomic theory (see e.g. Varian, 1982, 1984). The CNLS problem is to find f ∈ F 2 that minimizes the sum of squares of the residuals, formally: min f
n
(yi − f (xi ))2
s.t. f ∈ F2 .
(2.3)
i=1
In other words, the CNLS estimator of f is a monotonic increasing and concave function that minimizes the L2 -norm of the residuals. The CNLS problem (2.3) does not restrict beforehand to any particular functional form, but selects the best-fitting function f from the family F 2 , which includes an infinite number of functions. This makes problem (2.3) generally hard to solve. Single regressor algorithms developed by Wu (1982), Fraser and Massam (1989) and Meyer (1999) require that the data are sorted in ascending order according to the scalar-valued regressor x. However, such a sorting is not possible in the general multiple regression setting where x is a vector. C 2008 The Author. Journal compilation C The Royal Economic Society 2008
Representation theorem for convex nonparametric least squares
311
The Sobolev least-squares models (e.g. Wahba, 1990) differ from CNLS in that functions f must be differentiable (smooth) at every point of its domain and the Sobolev norm of f is bounded from above. The Sobolev models that impose monotonicity and concavity (e.g. Yatchew and Bos, 1997, and Yatchew and H¨ardle, in press) are hence constrained variants of (2.3). Inspired by that literature, we next try to identify a subset of representor functions G:G ⊂ F 2 such that replacing the constraint f ∈ F 2 by f ∈ G does not influence the optimal solution of problem (2.3) but makes it easier to solve. Consider the following family of piece-wise linear functions G 2 (X) = {g : Rm → R |g(x) = min αi + β i x; i∈{1,...,n}
βi ≥ 0
∀i = 1, . . . , n;
αi + β i xi ≤ αh + β h xi
∀h, i = 1, . . . , n .
(2.4) (2.5) (2.6)
Clearly, functions g ∈ G 2 (X) are continuous, monotonic increasing and globally concave for any arbitrary X. Hence G 2 ⊂ F 2 . The following theorem shows that this class of functions can be used as representors when solving the infinite dimensional CNLS problem (2.3). THEOREM 2.1. Given an arbitrary finite real-valued data (X,y), denote the optimal solution to the CNLS problem (2.3) by s 2f and let sg2 ≡ min g
n
(yi − g(xi ))2
s.t. g ∈ G 2 (X).
(2.7)
i=1
Then s 2f = s 2g . This result augments the representation theorems of the Sobolev least squares (e.g. Yatchew and Bos, 1997) to the nonsmooth CNLS setting. A number of parallel results are known in the literature. In the microeconomic theory, the celebrated Afriat’s Theorem relates continuous, monotonic, concave utility functions with piece-wise linear representors in a directly analogous way (Afriat, 1967, and Varian, 1982). In the productive efficiency literature, Banker and Maindiratta (1992) have applied the Afriat inequalities in the maximum-likelihood estimation of frontier production functions perturbed by skewed, non-Gaussian error terms. In the present context of nonparametric regression, the possibility to use the Afriat inequalities to model concavity/convexity constraints has been briefly suggested by Matzkin (1994, 1999) and Yatchew (1998); Theorem 2.1 confirms and further elaborates these conjectures. In the context of limited dependent variable models, Matzkin (1991, 1992) has employed the Afriat inequalities to develop consistent semi- and nonparametric estimators for the consumer’s utility function. Finally, Mammen (1991) has derived a similar theorem for a class of nonparametric regression functions constrained by qualitative shape restrictions, showing that these functions can be represented by a class of piece-wise monotonic or piece-wise concave/convex splines. The link between CNLS and spline smoothing becomes evident if we interpret the piecewise linear representors g ∈ G 2 (X) as linear spline functions. In contrast to the spline functions, however, here the partition to the linear segments is not fixed a priori. Indeed, the number and the location of the segments are here endogenously determined to maximize the empirical fit. Theorem 2.1 implies that the CNLS problem (2.3) can be equivalently stated as a linear spline C 2008 The Author. Journal compilation C The Royal Economic Society 2008
312
T. Kuosmanen
smoothing problem where the knots (i.e. the vertices of the hyperplane segments) are optimally selected to minimize the sum of squares of residuals subject to the monotonicity and concavity constraints for the spline function. It is worth to emphasize that the knots do not generally coincide with the observed points, but occur typically somewhere between them (see Figure 1 below for a graphical illustration). Moreover, the number of knots is usually a small fraction of n. Theorem 2.1 extends in a straightforward fashion to globally convex and/or monotonic decreasing functions. In the case of a convex regression function, the signs of the inequality constraints (2.6) should be reversed. Similarly, a monotonic decreasing function is obtained by reversing the signs of the inequalities (2.5). One could easily impose further assumptions about linear homogeneity. If function f is known to be homogenous of degree one (e.g. if f is a production function exhibiting constant returns to scale or an expected utility function exhibiting risk neutrality), we may simply impose an additional constraint α i = 0 (or delete α i altogether) in (2.4)–(2.6). This will guarantee that functions g pass through the origin. On the other hand, the monotonicity constraints (2.5) could be relaxed to estimate (inverse) U-shaped curves. However, removing the concavity constraints (2.6) does not directly lead us to the isotonic regression model considered by Barlow et al. (1972) and Sasabuchi et al. (1983). Establishing formal links between CNLS and isotonic regression formulations and exploring the intermediate cases of quasi-concave/convex regression are left as a challenge for future research.
3. QUADRATIC PROGRAMMING FORMULATION Theorem 2.1 is important from the computational point of view. It enables us to transform the infinite dimensional problem (2.3) into the finite dimensional QP problem (2.7). This QP formulation can be expressed more intuitively as min
ε,α,β
n
εi2
i=1
yi = αi + βi xi + εi αi + βi xi ≤ αh + β h xi βi ≥ 0 ∀i = 1, . . . , n.
∀h, i = 1, . . . , n (3.1)
The first constraint of (3.1) can be interpreted as the regression equation: note that coefficients α i , βi are specific to each observation i:i = 1, . . . , n. The second constraint is a system of Afriat inequalities that guarantee concavity [equivalent to (2.6)]. The Afriat inequalities are the key to modelling concavity constraints in the general multiple regressor setting. The third constraint ensures monotonicity [equivalent to (2.5)]. The QP problems represent the simplest thinkable class of nonlinear optimization problems; many sophisticated algorithms and powerful solvers are nowadays available for such problems.1
1 QP is a standard class of problems within nonlinear programming (NLP). The quadratic objective function implies that the first-order conditions are linear. Hence, the QP problems are amenable to the standard simplex and interior point algorithms developed for linear programming. A variety of commercial and sharewere solver software are available for solving QP problems. High-performance QP solvers include, e.g., CPLEX, LINDO, MOSEK and QPOPT, but also general NLS solvers such as MINOS and BQPD can handle QP problems. Most solvers can be used integrated with standard mathematical modelling systems/languages such as GAMS, Gauss, Mathematica and Matlab.
C 2008 The Author. Journal compilation C The Royal Economic Society 2008
Representation theorem for convex nonparametric least squares
313
The QP formulation of the CNLS problem presents a clear advantage to the earlier computational algorithms that only apply in the single regression case (e.g. Wu, 1982, Fraser and Massam, 1989, and Meyer, 1999). It is worth to note the structural similarity between problem (3.1) and the varying coefficient (or random parameters) regression models (e.g. Fan and Zhang, 1999, and Greene, 2005). The varying coefficient models assume a conditional linear structure that allows the coefficients of the linear regression function to differ across (groups of) observations, similar to (3.1). Interestingly, our representation theorem shows that the varying coefficient approach (conventionally used for estimating n different regression functions of the same a priori specified functional form) can be used for estimating n tangent hyper-planes to a single, unspecified regression function. The piece-wise linear structure of CNLS also resembles the nonparametric data envelopment analysis (DEA) frontiers (compare with Banker and Maindiratta, 1992). The key difference between CNLS and DEA concerns the treatment of residuals εi . In DEA, the residual term is interpreted as deterministic inefficiency that can only take negative values. The standard variable returns to scale DEA model is obtained as a special case of (3.1) if the residuals are constrained to be non-positive [i.e. one adds constraint εi ≤ 0 ∀i = 1, . . . , n to problem (3.1); see Kuosmanen, 2006 for details].
4. DERIVING AN EXPLICIT REPRESENTOR FUNCTION This section takes a more detailed view on the representor functions g. Interestingly, the solution to a complex problem need not itself be very complex: Theorem 2.1 shows that the infinite dimensional optimization problem (2.3) always has an optimal solution that takes the form of a simple piece-wise linear function. Given the estimated coefficients (αˆ i , βˆ i ) from (3.1), we can construct the following explicit representor function ˆ g(x) ≡ min
i∈{1,...,n}
αˆ i + βˆ i x .
(4.1)
In principle, function gˆ consists of n hyperplane segments. In practice, however, the estimated coefficients (αˆ i , βˆ i ) are clustered to a relatively small number of alternative values: the number of different hyperplane segments is usually much lower than n (see Section 5 for some simulation evidence). When the number of different segments embedded in (4.1) is small, the values of gˆ are easy to enumerate. The simplicity of the representor is an appealing feature for its potential ex post uses in economic modelling. The use of gˆ as an estimator of f is justified by the following result: COROLLARY 4.1. Denote the set of functions that minimize the CNLS problem (2.3) by F ∗2 . For any finite real-valued data (X, y), the function gˆ defined by (4.1) and (3.1) is one of the optimal solutions to problem (2.3), that is, gˆ ∈ F2∗ . The representor gˆ and its coefficients (αˆ i , βˆ i ) have a compelling interpretation: vector βˆ i can be interpreted as an estimator of the subgradient vector ∇ f (xi ), and equation y = αˆ i + βˆ i x is an estimator of the tangent hyperplane of f at point xi . In other words, function gˆ provides a local first-order Taylor series approximation to any f ∈ F ∗2 in the neighbourhood of the observed C 2008 The Author. Journal compilation C The Royal Economic Society 2008
314
T. Kuosmanen
points xi .2 This justifies the use of the representor gˆ for forecasting the values of y not just at the observed points, but also at unobserved points in the neighbourhood of observations. Coefficients βˆ i can also be used for nonparametric estimation of the marginal properties and elasticities. We can calculate the rate of substitution between variables k and m at point xi as ˆ i ) ∂ xk ∂ g(x βˆik = , (4.2) ˆ i ) ∂ xm ∂ g(x βˆim and further, the elasticity of substitution as ek,m (xi ) =
βˆik xim · . βˆim xik
(4.3)
These substitution rates and elasticities are simple to compute given the estimated βˆ i coefficients. One should note that the optimal solution to problem (2.3) is not necessarily unique; there generally exists a family of alternate optima, denoted by F ∗2 . The optimal solution to problem (3.1) need not be unique either, although the fitted values and most of the coefficients typically do have a unique solution. The set of alternative representor functions characterized by (3.1) and (4.1) form a subset of F ∗2 . The lack of a unique solution might be seen as a serious problem of identification, but this does not render the CNLS model meaningless. In fact, it is possible to derive tight lower and upper bounds for the alternate optima within F ∗2 . Specifically, functions f ∈ F ∗2 are bounded by the following piece-wise linear functions
gˆ min (x) = min m α + β x α + β x i ≥ yˆi ∀i = 1, . . . , n , (4.4) α∈R,β∈R and gˆ max (x) =
φ φ ≤ αi + βi x ∀i; αi + βi xi = yˆi ∀i; αi + βi xh ≥ yˆh ∀h = i , max n m×n φ∈R,α∈R ,β∈R (4.5)
where ˆ i ) = yi − εˆ i , i = 1, . . . , n, yˆi = g(x
(4.6)
denote the fitted values of the dependent variable. THEOREM 4.1. For any finite real-valued data (X, y), function gˆ min is the tightest possible lower bound for the family of functions F ∗2 , and gˆ max is the tightest possible upper bound for F ∗2 . Specifically, gˆ min (x) = min f (x) s.t. f ∈ F2∗
(4.7)
gˆ max (x) = max f (x) s.t. f ∈ F2∗
(4.8)
f
and f
for all x ∈ Rm .
2 In contrast to the flexible functional forms that can be interpreted as second-order Taylor approximations around a single, unknown expansion point, CNLS uses all n observations as expansion points for the local linear approximation.
C 2008 The Author. Journal compilation C The Royal Economic Society 2008
Representation theorem for convex nonparametric least squares
315
Theorem 4.1 further highlights the role of the piece-wise linear functions in CNLS. The boundary functions gˆ min and gˆ max are analogous to the under- and overproduction functions derived by Varian (1984). The lower bound gˆ min satisfies the maintained regularity properties. Moreover, since gˆ min (xi ) = yˆi ∀i = 1, . . . , n, we have gˆ min ∈ F2∗ : there exists a unique piecewise linear function that characterizes the lower boundary of family F ∗2 for all x ∈ Rm . By contrast, the upper bound gˆ max is not globally concave, and thus gˆ max ∈ / F2∗ . For any given x ∈ Rm , the ∗ upper bound gˆ max (x) is achieved by some f ∈ F 2 , but in general, no concave function is able to reach the upper bound of F ∗2 at all points x ∈ Rm simultaneously. ˆ In small samples, the boundary functions satisfy inequalities gˆ min (x) ≤ g(x) ≤ gˆ max (x) ∀x ∈ p Rm . The consistency result by Hanson and Pledger (1976) implies that gˆ min (x) − gˆ max (x) → 0 as n → ∞. Given a large enough density of observations within the observed range, the lower and ˆ upper bounds coincide with the representor g.
5. MONTE CARLO SIMULATIONS 5.1. Single regression example This section examines the CNLS representor function and the method as a whole by simple Monte Carlo simulations.3 To gain intuition, we first illustrate the CNLS representor in the single regression case. Suppose the true regression function is of the form f (x) = ln(x) + 1. We drew a random sample of 100 observations of the x values from Uni[1,11], calculated the corresponding true f (x i ) values and perturbed them by adding a random error term drawn independently from N (0, 0.62 ). This gives the observed yi values for the dependent variable. Figure 1 illustrates the observed sample by the scatter plot. Also the true function f (solid grey curve) is plotted in the figure. We solved the QP problem (3.1) by using the GAMS software with Minos nonlinear programming (NLP) solver. The coefficient of determination was R2 = 0.795. The optimal solution to (3.1) provides coefficients αˆ i , βˆ i , which were used for constructing the piece-wise ˆ plotted in Figure 1. This function consists of six different line segments; linear representor g, the estimated αˆ i , βˆ i coefficients were clustered to six different vectors in this example. Recall that, in contrast to the linear spline smoothing methods, the positions of the line segments are not fixed ex ante but the number and the length of the segments are endogenously determined within the model. As Figure 1 shows, function gˆ (solid black curve) provides a good approximation of the true f throughout the observed range of x, not only at the observed points but also in their neighbourhood. To appreciate this result, we also fitted the log-linear Cobb–Douglas function with OLS (broken grey curve). As Figure 1 indicates, the Cobb–Douglas function proved too inflexible for capturing the shape of the true f in this example. 5.2. Multiple regression simulations We next performed more systematic Monte Carlo simulations in the two-regressor setting, fixing the sample size at 100 as before. Three different specifications for the true regression function f were considered:
3 For empirical applications, an interested reader is referred to working papers Kuosmanen (2006) and Kuosmanen and Kortelainen (2007) that apply CNLS to production frontier estimation in cross-sectional and panel settings.
C 2008 The Author. Journal compilation C The Royal Economic Society 2008
316
T. Kuosmanen y 7
6
5
4
3
2
1
0
x 0
1
2
3
4
5
6
7
8
9
10
11
Figure 1. Illustration of gˆ of the CNLS regression.
(1) Cobb–Douglas: f CD (x1 , x2 ) = x10.4 · x20.5 (2) Generalized Leontief: f GL (x1 , x2 ) = (0.2x10.5 + 0.3x20.5 + 0.4x10.5 · x20.5 )0.9 (3) Piece-wise linear: f PWL (x1 , x2 ) = min {x1 + 2x2 , 2x1 + x2 , 0.5x1 + x2 +225,x1 + 0.5x2 + 225}. The values x1 and x2 were independently and randomly sampled from the uniform distribution Uni[100,200], and the true f (x1 , x2 ) were computed. Subsequently, random error terms drawn from N (0, σ 2 ) were added to f (x1 , x2 ). Three different levels of standard deviation were considered: (A) low σ = 2.5, (B) medium σ = 5 and (C) high σ = 10. The resulting data sets perturbed by errors were treated as the observations. We computed 250 simulations for each of the nine alternative scenarios, referred to as (1A), (1B), . . . , (3C). The GAMS code for solving the CNLS model is presented in Appendix B. First, we know that the CNLS representor gˆ satisfies the regularity properties globally, but how useful the estimated gˆ functions can be in practical economic modelling? The answer to this question obviously depends on the complexity of the functions. Table 1 sheds further light on this question by describing the number of hyperplane segments in each scenario. We find that typically about 20 hyper-plane segments suffice to characterize the CNLS function; in some scenarios the number was less than 10, in others, as high as 50. Relatively small number of segments is desirable for analytical and computational convenience as well as for narrowing the gap between lower and upper bounds. Somewhat surprisingly, CNLS uses large numbers of segments in Scenario 3 where C 2008 The Author. Journal compilation C The Royal Economic Society 2008
317
Representation theorem for convex nonparametric least squares Table 1. CNLS representor: the number of different hyperplane segments in each scenarios. Signal-to Average St.Dev. Min Max % True
Error
noise
no. of
no. of
no. of
no. of
function
variance
ratio∗
segments
segm.
segm.
segm.
observations with βˆm = 0
Cobb– Douglas
σ = 2.5 σ =5
4.5 2.25
23.1 20.8
4.9 4.7
9 6
35 33
5.3 8.6
σ = 10
1.13
18.7
4.7
6
30
7.6
σ = 2.5 σ =5
2 1
20.5 18.7
4.8 4.9
6 6
34 32
8.9 14.7
σ = 10
0.5
16.8
4.4
7
27
25.3
σ = 2.5 σ =5
20 10
36.9 37.7
4.4 4.9
25 24
49 51
2.7 4.8
σ = 10
5
33.7
4.9
19
48
8.0
Generalized Leontief
Piece-wise linear ∗ Measured
by the expected standard deviation of f (x1 , x2 ) values divided by σ .
the true piece-wise linear function consists of only four linear segments. This seems to be due to the fact that the number of hyperplane segments is positively correlated with the signal-to-noise ratio: the noisier the data, the smaller the number of segments. The right-most column of Table 1 reports the percentage of observations for which any of the elements in vector βˆ i is zero. Recall that these slope coefficients determine the substitution properties. While the zero values are consistent with the regularity conditions, the economic interpretation of the regression becomes odd if there are many zero substitution rates. In these simulations, the percentages of zero coefficients are relatively small. The average number of observations per hyperplane segment was four; in Scenarios 1 and 2 this average exceeded five. This means that the bounds gˆ min and gˆ max coincided with gˆ for very large proportions of the curve. The percentage of observations i such that gˆ min (xi ) = gˆ max (xi ) varied between 70 and 95 per scenario, with the mean value of 87.4 across all scenarios. That is, the min and max bounds coincide for almost 90 per cent area of the estimated surfaces. The deviations typically occurred near the boundaries of the observed range of x. Consider next the empirical fit. Table 2 reports the coefficient of determination (R2 ), loglikelihood (lnL) and the mean-squared error (MSE) statistics for each scenario. To put the performance of CNLS in a perspective, we also estimated the Cobb–Douglas and translog regression functions using OLS. The CNLS always gave the highest R2 and log-likelihood values as expected. The difference is notable especially in Scenarios 2C and 3A. Moreover, we see that the empirical fit and the MSE values tend to deteriorate as the error variance increases (the only exception is the MSE of the translog regression that decreased in Scenario 3. In Scenario 1, we see that the correctly specified Cobb–Douglas model yields the lowest MSE. The empirical fit of the more flexible CNLS and translog models becomes ‘too good’ in this scenario; the error variance is underestimated. The translog specification performs somewhat better than the CNLS in Scenario 1, but the difference is relatively small. In Scenarios 2A and 2B (the standard deviation of 2.5 and 5), the translog regression yields the lowest MSE, which is hardly surprising since both translog and generalized Leontief belong to the family of flexible function C 2008 The Author. Journal compilation C The Royal Economic Society 2008
318
T. Kuosmanen Table 2. Goodness of fit: average of 250 simulations (standard deviation in parentheses). CNLS estim. Cobb–Douglas estim. Translog estim.
True function Cobb– Douglas
Std. dev. σ = 2.5 σ =5
R2
LnL
MSE
R2
lnL
0.960 −179 0.79 0.954 −187 (0.008) (8.54) (0.34) (0.009) (6.83) 0.856
−250
2.71
0.838
−256
(0.029) (8.34) (1.23) (0.029) (6.83) σ = 10
0.606
−321
9.48
0.561
−326
(0.068) (8.16) (4.47) (0.070) (7.54) Gener.
σ = 2.5
Leontief
0.830
−181
0.66
0.813
−187
MSE 0.22 (0.16) 0.89 (0.64) 3.99 (2.97) 7.58
R2
lnL
MSE
0.954 −186 0.41 (0.009) (7.85) (0.23) 0.841
−255
1.68
(0.030) (7.86) (0.92) 0.575
−325
7.09
(0.070) (7.88) (3.81) 0.809
−187
0.23
σ =5
(0.034) (8.31) (0.30) (0.031) (6.83) 0.562 −252 2.33 0.530 −256
(1.04) 38.46
(0.035) (7.52) (0.16) 0.518 −256 1.00
σ = 10
(0.073) (8.13) (1.10) (0.064) (6.86) 0.281 −322 8.19 0.053 −557
(8.97) 46821
(0.074) (7.54) (0.75) 0.045 −335 111
(0.084) (7.97) (4.02) (0.095) Piecewise
σ = 2.5
linear
σ =5
(143)
0.998 −171 1.65 0.973 −305 (0.000) (9.66) (0.50) (0.005) (6.33) 0.992
−242
5.58
0.965
−318
(40176) 58.94 (8.04) 59.79 (8.20)
(0.001) (9.40) (1.80) (0.006) (6.32) σ = 10
(132)
0.973 −305 59.18 (0.004) (6.08) (7.16) 0.985
−276
15.07
(0.003) (7.71) (2.27)
−350
62.83
0.957
−330
(0.007) (9.08) (6.20) (0.005) (6.78)
(9.00)
(0.01)
(7.89) (4.36)
0.969
−314
(0.259) (12.8)
17.86
0.963
20.00
forms. The MSE statistics of the CNLS come very close to those of translog in these two scenarios, while Cobb–Douglas has notably higher MSE. When standard deviation is increased to 10 in Scenario 2C, the CNLS still provides reasonably accurate estimates, but the performance of the Cobb–Douglass and translog regressions is catastrophic. In Scenario 3, the CNLS has the lowest MSE throughout, as expected. The Cobb–Douglas specification performs poorly throughout all sub-scenarios, while the translog performs relatively well when the standard deviation of error increases to 10. Overall, the MSE statistics suggest the CNLS provides more robust performance than the two parametric candidates considered. Increasing the number of observations would further improve the accuracy of CNLS, whereas increasing the number of explanatory variables would likely favour OLS. We conclude by emphasizing that consistency with the regularity properties implied by the economic theory is often a more important criterion than the empirical fit. In this respect, CNLS will always satisfy monotonicity and concavity by construction. The Cobb–Douglas function satisfies monotonicity but can violate concavity, while translog can violate both. Table 3 reports the frequencies of violations in each scenario. Our simulations suggest that the parametric regression models are surprisingly likely to violate the regularity conditions even when the true functions satisfy the properties and the empirical fit is reasonably good. For example, in Scenario 2C, 80 per cent of the estimated Cobb–Douglas functions were convex although the true underlying function C 2008 The Author. Journal compilation C The Royal Economic Society 2008
319
Representation theorem for convex nonparametric least squares
True
Table 3. Violations of concavity and monotonicity (per cent of simulations). Error Cobb–Douglas Translog
function
variance
Cobb–Douglas
Generalized Leontief
Piece-wise linear
Translog
concavity
concavity
monotonicity
σ = 2.5
0
52.6
0.000
σ =5 σ = 10
0.4 23.2
44.3 41.7
0.016 1.26
σ = 2.5
0
0
0
σ =5 σ = 10
14.4 80
0 0
0 0
σ = 2.5
0
0
0
σ =5 σ = 10
0 0
100 100
0.000 0.000
was concave. The estimated translog functions also frequently violated convexity in Scenarios 1 and 3.
6. CONCLUSIONS AND DISCUSSION CNLS draws its power from the shape constraints that coincide with the standard regularity conditions of the microeconomic theory, avoiding prior assumptions about the functional form or its smoothness. Despite its attractive theoretical properties, applications of CNLS are scarce. In the Introduction, we noted as three major barriers of application: (1) the lack of explicit regression function, (2) computational complexity and (3) difficulty of statistical inference. Our main result is a representation theorem, which shows that the complex, infinite dimensional CNLS problem always has a simple solution characterized by a continuous, piece-wise linear function. Making use of this result, we derived an explicit formulation for a representor function, which can be used as an estimator of the unknown regression function. The representation theorem also enabled us to express the CNLS problem as a QP problem. This facilitates the computation of the CNLS estimators by standard QP algorithms and solver software. Furthermore, a tractable computational procedure enables us to draw statistical inference by applying bootstrap simulations. Thus, we hope that the results of this paper may help to lower the barriers of using CNLS in empirical economic applications. From a methodological point of view, we noted a number of interesting links between CNLS and parallel developments in the literature. The CNLS approach offers a rich framework for further extensions that fall beyond the scope of this paper. We have restricted attention on estimation of monotonic increasing and concave functions, but the method applies to estimation of monotonic decreasing and/or convex functions in a straightforward fashion. One could relax monotonicity to estimate (inverted) U-shaped functions (such as the Kuznets curves). Relaxing concavity, one arrives at the isotonic regression setting (Barlow et al., 1972). One could also postulate convexity or concavity to apply for a specific range of values, to estimate S-shaped production functions. One might also model homogeneity or homotheticity properties, as briefly suggested in Section 4. The practical implementation of these alternative properties deserves further elaboration. C 2008 The Author. Journal compilation C The Royal Economic Society 2008
320
T. Kuosmanen
Another research topic is to adapt the general-purpose regression approach presented here to more specific areas in econometrics, for example, time series or panel data analyses. In the field of production frontier estimation, CNLS has a great potential for unifying the field currently dominated by two separate branches: the parametric regression-based stochastic frontier analysis (SFA) and the deterministic nonparametric DEA (see Kuosmanen, 2006, and Kuosmanen and Kortelainen, 2007, for further discussion). Consumer demand analysis is another area where CNLS has potential to bridge the gap between the nonparametric tests (Afriat, 1967, and Varian, 1982, 1985) and the parametric estimation of demand systems (e.g. Deaton, 1986). We conclude by noting that the generality of the nonparametric approach does have a price: the rates of convergence are low when the model involves many explanatory variables, which means that large numbers of observations are required to get meaningful estimates. Our Monte Carlo simulations suggest that the method works well when there are relatively few explanatory variables relative to the sample size. In applications with many explanatory variables, the ’curse of dimensionality’ could be alleviated by imposing some further semi-parametric structure (e.g. partial linear model). While the asymptotic properties of the nonparametric least squares are well understood, further work is needed to shed light on the impact of the monotonicity, concavity and other inequality constraints on the small sample performance of nonparametic least-squares estimators.
ACKNOWLEDGEMENTS This paper has benefited from insightful comments by two anonymous reviewers of this Journal. Earlier versions of this paper have been presented at the 29th Annual Meeting of the Finnish Society for Economic Research, Lappeenranta, Finland 1–2 February 2007; the GREMAQ Econometrics Seminar, Toulouse, France, 26 February 2007; 4th Nordic Econometric Meeting, Tartu, Estonia, 24– 26 May 2007 and the EEA/ESEM 2007, Budapest, Hungary, 27–31 August, 2007. The author would like to thank Mika Kortelainen, Timo Sipil¨ainen, Heikki Hella, Christian Bontemps, Pierre Dubois, Thierry Magnac, Martin Browning, Dennis Kristensen, Mogens Fosgerau and other participants to these seminars for useful comments and suggestions. Financial support from the Yrj¨o Jahnsson Foundation for this research is gratefully acknowledged.
REFERENCES Afriat, S. N. (1967). The construction of a utility function from expenditure data. International Economic Review 8, 67–77. Banker, R. D. and A. Maindiratta (1992). Maximum likelihood estimation of monotone and concave production frontiers. Journal of Productivity Analysis 3, 401–15. Barlow, R. E., J. M. Bartholomew, J. M. Bremner and H. D. Brunk (1972). Statistical Inference under Order Restrictions. New York: John Wiley & Sons. Deaton, A. (1986). Demand analysis. In Z. Griliches and M. D. Intriligator (Eds.), Handbook of Econometrics, Volume 3, 1767–839. Amsterdam: North Holland. Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of Statistics 7, 1–26. Efron, B. (1982). The Jackknife, the Bootstrap and Other Resampling Plans. CBMS-NSF Regional Conference Series in Applied Mathematics 38. Philadelphia: Society for Industrial and Applied Mathematics. Efron, B. and R. J. Tibshirani (1993). An Introduction to the Bootstrap. London: Chapman and Hall. C 2008 The Author. Journal compilation C The Royal Economic Society 2008
Representation theorem for convex nonparametric least squares
321
Fan, J. and W. Zhang (1999). Statistical estimation in varying coefficients models. Annals of Statistics 27, 1491–518. Fraser, D. A. S. and H. Massam (1989). A mixed primal-dual bases algorithm for regression under inequality constraints: Application to concave regression. Scandinavian Journal of Statistics 16, 65–74. Goldman, S. M. and P. A. Ruud (1995). Nonparametric multivariate regression subject to constraint. Working paper, University of California at Berkeley. Greene, W. H. (2005). Reconsidering heterogeneity in panel data estimators of the stochastic frontier model. Journal of Econometrics 126, 269–303. Groeneboom, P., G. Jongbloed and J. A. Wellner (2001). Estimation of convex functions: Characterizations and asymptotic theory. Annals of Statistics 26, 1653–98. Hanson, D. L. and G. Pledger (1976). Consistency in concave regression. Annals of Statistics 4, 1038– 50. Hildreth, C. (1954). Point estimates of ordinates of concave functions. Journal of the American Statistical Association 49, 598–619. Kuosmanen, T. (2006). Stochastic nonparametric envelopment of data: Combining virtues of SFA and DEA in a unified framework. MTT Discussion Paper 3/2006, Helsinki. Kuosmanen, T. and M. Kortelainen (2007). Stochastic nonparametric envelopment of data: Cross-sectional frontier estimation subject to shape constraints. Economics Discussion Paper #46, University of Joensuu. Mammen, E. (1991). Nonparametric regression under qualitative smoothness assumptions. Annals of Statistics 19, 741–59. Mammen, E. and C. Thomas-Agnan (1999). Smoothing splines and shape restrictions. Scandinavian Journal of Statistics 26, 239–52. Matzkin, R. L. (1991). Semiparametric estimation of monotone and concave utility functions for polychotomous choice models. Econometrica 59, 1315–27. Matzkin, R. L. (1992). Nonparametric and distribution-free estimation of the binary threshold crossing and the binary choice models. Econometrica 60, 239–70. Matzkin, R. L. (1994). Restrictions of economic theory in nonparametric methods. In R. F. Engle and D.L McFadden (Eds.), Handbook of Econometrics, Volume 4, 2523–58. Amsterdam: Elsevier. Meyer, M. C. (1999). An extension of the mixed primal-dual bases algorithm to the case of more constraints than dimensions. Journal of Statistical Planning and Inference 81, 13–31. Meyer, M. C. (2003). A test for linear vs. convex regression function using shape-restricted regression. Biometrika 90, 223–32. Meyer, M. C. (2006). Consistency and power in tests with shape-restricted alternatives. Journal of Statistical Planning and Inference 136, 3931–47. Nemirovskii, A. S., B. T. Polyak and A. B. Tsybakov (1985). Rates of convergence of nonparametric estimates of maximum likelihood type. Problems of Information Transmission 21, 258–71. Rockafellar, R. T. (1970). Convex Analysis. NJ: Princeton University Press. Ruud, P. A. (1996). Restricted least squares subject to monotonicity and concavity constraints, Working paper. University of California at Berkeley. Sasabuchi, S., M. Inutsuka and D. D. S. Kulatunga (1983). A multivariate version of isotonic regression. Biometrica 70, 465–72. Varian, H. (1982). The nonparametric approach to demand analysis. Econometrica 50, 945–73. Varian, H. (1984). The nonparametric approach to production analysis. Econometrica 52, 579–97. Varian, H. (1985). Nonparametric analysis of optimizing behavior with measurement error. Journal of Econometrics 30, 445–58. Wahba, G. (1990). Spline models for observational data. CBMS-NSF Regional Conference Series in Applied Mathematics 59. Philadelphia: Society for Industrial and Applied Mathematics. C 2008 The Author. Journal compilation C The Royal Economic Society 2008
322
T. Kuosmanen
Wu, C. F. (1982). Some algorithms for concave and isotonic regression. TIMS Studies in Management Science 19, 105–16. Yatchew, A. J. (1998). Nonparametric regression techniques in economics. Journal of Economic Literature 36, 669–721. Yatchew, A. J. (2003). Semiparametric Regression for the Applied Econometrician. Cambridge: Cambridge University Press. Yatchew, A. J. and L. Bos (1997). Nonparametric regression and testing in economic models. Journal of Quantitative Economics 13, 81–131. Yatchew, A. J. and W. H¨ardle (2003). Nonparametric state price density estimation using constrained least squares and the bootstrap. Journal of Econometrics, 133, 579–599.
APPENDIX A: PROOFS Proof of Theorem 2.1: It is easy to verify that functions g ∈ G 2 satisfy continuity, monotonicity and concavity. Hence G 2 ⊂ F 2 . This implies that problem (2.7) involves more stringent constraints than (2.3), and thus we must have s 2g ≥ s 2f for any arbitrary data. Suppose s 2g > s 2f . Thus, there exists function fˆ = arg min s 2f such that fˆ ∈ F2 \G 2 . Define the subdifferential of fˆ at point xi ∈ Rm as
∂ fˆ(xi ) = ∇ fˆ(xi ) ∈ Rm ∇ fˆ(xi ) · (x − xi ) ≤ fˆ(x) − fˆ(xi ) ∀x ∈ Rm , (A.1) where vector ∇ fˆ(xi ) is referred to as subgradient. Since fˆ is continuous, for every observed point xi , i = 1, . . . , n there exists a set of tangent hyperplanes
Hi = h i : Rm → R h i (x) = fˆ(xi ) + ∇ fˆ(xi ) · (x − xi ); ∇ fˆ(xi ) ∈ ∂ fˆ(xi ) . (A.2) Monotonicity implies that ∇ fˆ(xi ) ≥ 0
∀∇ fˆ(xi ) ∈ ∂ fˆ(xi ), i = 1, . . . , n.
(A.3)
∀h i ∈ Hi ; h k ∈ Hk ; k, i = 1, . . . , n.
(A.4)
Concavity implies that h i (xi ) ≤ h k (xi )
Note that the objective function of (2.7) depends on the value of fˆ at a finite set of points xi , i = 1, . . . , n. Since h i (xi ) = fˆ(xi ) ∀h i ∈ Hi ; i = 1, . . . , n, without loss of generality, we can represent function fˆ by its tangent hyperplanes at these points. Since by assumption s 2g > s 2f , there exists at least one tangent hyperplane h i ∈ H i for some i ∈ {1, . . . , n} that is not feasible for functions g ∈ G 2 . To see that the last claim implies a contradiction, we note that for any given ∇ fˆ(xi ) ∈ ∂ fˆ(xi ), it is possible to set βi = ∇ fˆ(xi )
(A.5)
αi = h i (0).
(A.6)
and
C 2008 The Author. Journal compilation C The Royal Economic Society 2008
323
Representation theorem for convex nonparametric least squares
With this parametrization, we immediately see that the monotonicity condition (A.3) is equivalent to (2.5). Moreover, the concavity condition (A.4) can be re-written as fˆ(xi ) + ∇ fˆ(xi ) · (xi − xi ) ≤ fˆ(xk ) + ∇ fˆ(xk ) · (xi − xk ) ⇔ ( fˆ(xi ) − ∇ fˆ(xi ) · xi ) + ∇ fˆ(xi ) · xi ≤ ( fˆ(xk ) − ∇ fˆ(xk ) · xk ) + ∇ fˆ(xk ) · xi ⇔ αi + βi xi ≤ αk + β k xi
∀k, i = 1, . . . , n
∀k, i = 1, . . . , n
(A.7)
(A.8)
∀k, i = 1, . . . , n.
(A.9)
Inequalities (A.9) are equivalent to the concavity constraints (2.6). Thus, any set of supporting hyperplanes available for functions fˆ ∈ F2 is also available for functions g ∈ G 2 . Since the assumption s 2g > s 2f results as a contradiction, we must have s 2g = s 2f . 2 Proof of Corollary 4.1: Since G 2 ⊂ F 2 , then gˆ ∈ G 2 ⇒ gˆ ∈ F2 . Since problems (2.3) and (3.1) depend on the value of functions f and gˆ at a finite set of points xi , i = 1, . . . , n, Theorem 2.1 directly implies that gˆ ∈ G ∗2 ⇒ gˆ ∈ F2∗ . 2 Proof of Theorem 4.1: We start from the lower bound. Note first that gˆ min is continuous, monotonic increasing and concave: gˆ min ∈ F2 . Secondly, note that gˆ min (xi ) = yˆi ∀i ∈ {1, . . . , n}. These two observations imply that gˆ min ∈ F2∗ . Consider an arbitrary fˆ ∈ F2∗ , and let ∇ fˆ(x) ∈ ∂ fˆ(x) be a subgradient of fˆ at an arbitrary point x ∈ Rm . The supporting hyperplane theorem (e.g. Rockafellar, 1970) implies that, for all fitted points (xi , yˆi ), i = 1, . . . , n, the tangent hyperplanes of a concave fˆ at point x must satisfy fˆ(x) + ∇ fˆ(x) · (xi − x) ≥ yˆi
∀∇ fˆ(x) ∈ ∂ fˆ(x).
(A.10)
Using α = fˆ(x) − ∇ fˆ(x) · x and β = ∇ fˆ(x), function gˆ min can be re-written as
gˆ min (x) = min α + β x|α + β x i ≥ yˆi ∀i = 1, . . . , n
(A.11)
α∈R m β∈R
= min ( fˆ(x) − ∇ fˆ(x) · x) + ∇ fˆ(x) · x ( fˆ(x) − ∇ fˆ(x) · x) + ∇ fˆ(x) · xi ≥ yˆi ∀i = 1, . . . , n fˆ (A.12)
= min fˆ(x) fˆ(x) + ∇ fˆ(x) · (xi − x) ≥ yˆi fˆ
∀i = 1, . . . , n .
(A.13)
Therefore, gˆ min (x) = min{ f (x)| f ∈ F2∗ }. This completes the first part of the proof. f
Consider next the upper bound
gˆ max (x) = max φ φ ≤ αi + βi x ∀i; αi + βi xi = yˆi ∀i; αi + βi xh ≥ yˆh n m×n φ∈R,α∈R ,β∈R
∀h = i . (A.14)
C 2008 The Author. Journal compilation C The Royal Economic Society 2008
324
T. Kuosmanen
Note that there must exist some observation i = arg min αi + βi x for which the constraint i∈{1,...,n}
φ ≤ α i + βi x holds as equality. Therefore, the upper bound (A.14) can be equivalently written as gˆ max (x) = min
i∈{1,...,n}
max m αi + βi x αi + βi xi = yˆi ; αi + βi xh ≥ yˆh
αi ∈R,βi ∈R
∀h = i
. (A.15)
This minimax formulation reveals that function gˆ max is not concave, and thus, gˆ max ∈ / F2∗ . ˆ ˆ ˆ Using again α = f (x) − ∇ f (x) · x and β = ∇ f (x), function gˆ max can be expressed as gˆ max (x) = min
i∈{1,...,n}
max fˆi (x) fˆi (xi ) = yˆi ; fˆi (x) + ∇ fˆi (x) · (xh − x) ≥ yˆh fˆi
∀h = i
.
(A.16) Consider first the embedded maximization problem in (A.16). The problem is analogous to (A.13), except that we have replaced the minimization by maximization, and we force the tangent line to pass through a given point (xi , yˆi ). Constraint fˆi (xi ) = yˆi is necessary in (A.16), because otherwise the problem would be unbounded. In essence, the embedded maximization problem of (A.16) finds the tangent hyperplane through a fixed point (xi, yˆi ) with the largest value at point x. The first-order conditions of problem max f f (x) f ∈ F2∗ imply that the optimum is achieved at one of the points fˆi (x) on the tangent hyperplane of some i ∈ {1, . . . , n}. To see why the optimum must be the minimum value over i ∈ {1, . . . , n}, consider observations i, j ∈ {1, . . . , n} and let fˆj∗ (x), fˆi∗ (x) be the optimal solutions to the embedded maximization problem of (A.16) such that fˆj∗ (x) > fˆi∗ (x). The maximizing property of fˆi∗ (x) implies that there exists a subset S ⊂ {1, . . . , n} such that fˆi∗ (x) + ∇ fˆi∗ (x) · (xs − x) = yˆs ∀s ∈ S (i.e. the constraints of the maximization problem in (A.16) are binding for all observations s ∈ S).
But then it is possible to construct point ( x , y ) as a convex combination x = λ x s∈S s s + λ j x,
y = s∈S λs yˆs + λ j fˆj (x), s∈S λs + λ j = 1, λs λ j ≥ 0 ∀s ∈ S such that x = xi and y > yˆi . Therefore, choosing point (x, fˆj∗ (x)) violates of the concavity postulate. As this argument applies to any observations i, j ∈ {1, . . . , n}, the only feasible solution is obtained by minimizing over observations i ∈ {1, . . . , n}. Therefore, gˆ max (x) = max f { f (x)| f ∈ F2∗ .}. 2 REMARK A.1. The dual problem of (4.4) provides some further intuition especially to the first part of the proof of Theorem 4.1. The dual formulation can be written as gˆ min (x) = maxn z∈R+
n i=1
n n z i yˆi x ≥ z i xi ; zi = 1 , i=1 i=1
(A.17)
where zi represents the weight assigned to observation i. In essence, the dual problem (A.17) maximizes the weighted average of the fitted values yˆi , subject to the constraint that the weighted average of the observed explanatory variables is less than or equal to the given level of x. Below the observed range, problem (A.17) is infeasible and is hence assigned the value −∞, whereas above the observed range gˆ min (x) = maxn { yˆn }. C 2008 The Author. Journal compilation C The Royal Economic Society 2008
Representation theorem for convex nonparametric least squares
APPENDIX B: GAMS CODE FOR THE CNLS REGRESSION (100 OBSERVATIONS, TWO EXPLANATORY VARIABLES) SETS k observations /1∗ 100/ alias(k,m); PARAMETERS Y(k) dependent variable X1(k) value of explanatory variable 1 in observation k X2(k) value of explanatory variable 2 in observation k VARIABLES E(k) residual of k A(k) constant SS sum of square of errors; POSITIVE VARIABLES B1(k) beta 1 coefficients B2(k) beta 2 coefficients; EQUATIONS QSSE objective = sum of squares of residuals QREGRESSION(k) regression equation QCONCAVITY(k,m) concavity constraint; QSSE.. SS = e = sum(k,E(k)∗ E(k)); QREGRESSION(k).. Y (k) = e = A(k)+B1(k)*X 1(k)+B2(k)*X 2(k)+E(k); QCONCAVITY(k,m).. Y (m) = l = A(k)+B1(k)*X 1(m)+B2(k)*X 2(m)+E(m); MODEL CNLS /all/ SOLVE CNLS using NLP Minimizing SS;
C 2008 The Author. Journal compilation C The Royal Economic Society 2008
325
Econometrics Journal (2008), volume 11, pp. 326–348. doi: 10.1111/j.1368-423X.2008.00244.x
The impact of homework on student achievement O ZKAN E REN † AND D ANIEL J. H ENDERSON ‡ †Department of Economics, College of Business, University of Nevada, Las Vegas, 4505 Maryland Parkway, Las Vegas, NV 89154-6005, USA. E-mail:
[email protected] ‡Department of Economics, State University of New York, Binghamton, NY 13902-6000, USA. E-mail:
[email protected] First version received: June 2006; final version accepted: October 2007
Summary Utilizing parametric and nonparametric techniques, we assess the role of a heretofore relatively unexplored ‘input’ in the educational process, homework, on academic achievement. Our results indicate that homework is an important determinant of student test scores. Relative to more standard spending related measures, extra homework has a larger and more significant impact on test scores. However, the effects are not uniform across different subpopulations. Specifically, we find additional homework to be most effective for high and low achievers, which is further confirmed by stochastic dominance analysis. Moreover, the parametric estimates of the educational production function overstate the impact of schooling related inputs. In all estimates, the homework coefficient from the parametric model maps to the upper deciles of the nonparametric coefficient distribution and as a by-product the parametric model understates the percentage of students with negative responses to additional homework. Keywords: Generalized kernel estimation, Nonparametric, School inputs, Stochastic dominance, Student achievement.
1. INTRODUCTION Real expenditures per student in the United States have more than tripled over the last four decades and that public spending on elementary and secondary education amounts to approximately $200 billion. Unfortunately, the substantial growth in resources devoted to schools have not been accompanied by any significant changes in student achievement (Hoxby, 1999). Given this discontinuity between educational expenditures and student achievement, economists have produced a voluminous body of research attempting to explore the primary influences of student learning. The vast majority of papers in this area have focused on spending related ‘inputs’ such as class size and teachers’ credentials. With a few exceptions, these studies conclude that measured school ‘inputs’ have only limited effects on student outcomes (Hanushek, 2003). In light of these pessimistic findings, it is surprising how little work has been devoted to understanding the impact
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008. Published by Blackwell Publishing Ltd, 9600 Garsington
Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
The impact of homework on student achievement
327
of other aspects of the educational environment on student achievement.1 In particular, given parental concerns, policy debates and media interest (e.g. Time Magazine, 25 January 1999), very little research to date has been completed on the role of homework. We know of two empirical studies that examine the effects of homework on student outcomes. Aksoy and Link (2000), using the National Educational Longitudinal Study of 1988 (NELS:88), find positive and significant effects of homework on tenth grade math test scores. However, the authors rely on student responses regarding the hours of homework, which carries the potential risk of a spurious correlation since it likely reflects unobserved variation in student ability and motivation. Betts (1997) presents the only empirical work that, to our knowledge, focuses on the hours of homework assigned by the teacher. This measure of homework is actually a policy variable, which the school or teacher can control. Using the Longitudinal Study of American Youth, Betts obtains a substantial effect of homework on math test scores. Specifically, an extra half hour of math homework per night in grades 7–11 is estimated to advance a student nearly two grade equivalents. Furthermore, in a nonlinear model setting, the author argues that virtually all students (99.3% of the sample) could benefit from extra homework and thus math teachers could increase almost all students’ achievement by assigning more homework. Although the aforementioned papers provide careful and important evidence on the effects of homework, there are numerous gaps remaining. First, there may be heterogeneity in the returns to homework. Theoretical treatments of the topic indicate that the responses to extra homework will depend on student’s ability level (see, e.g. Betts, 1997, and Neilson, 2005). In this respect, the impact of homework may differ among students. Second, the existing educational production function literature relies mostly on parametric regression models. Although popular, parametric models require stringent assumptions. In particular, the errors are generally assumed to come from a specified distribution and the functional form of the educational production function is given a priori. Since the theory predicts a non-monotonic relation between homework and student achievement, a parametric specification which fully captures the true relation may be difficult to find. Further, if the functional form assumption does not hold, the parametric model will most likely lead to inconsistent estimates. In order to alleviate some of these potential shortcomings, we adopt a nonparametric approach. Nonparametric estimation procedures relax the functional form assumptions associated with traditional parametric regression models and create a tighter fitting regression curve through the data.2 These procedures do not require assumptions on the distribution of the error nor do they require specific assumptions on the form of the underlying production function. Furthermore, the procedures generate unique coefficient estimates for each observation for each variable. This attribute enables us to make inference regarding heterogeneity in the returns. Utilizing the above stated techniques and the NELS:88, we reach four striking empirical findings. First, controlling for teacher’s evaluation of the overall class achievement in the educational production function is crucial. In the absence of such a control, the schooling ‘inputs’ are overstated. Second, relative to more standard spending related measures such as class size, extra homework appears to have a larger and more significant impact on mathematics achievement.
1 Notable exceptions are Eren and Millimet (2007), Figlio and Lucas (2003) and Fuchs and W¨ oßmann (2007). These studies examine the impact of other aspects of schooling such as the time structure, grading standards and institutional factors on student achievement. 2 Nonparametric estimation has been used in other labour economics domains to avoid restrictive functional form assumptions (see, e.g. Henderson et al., 2006, and Kniesner and Li, 2002).
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
328
O. Eren and D. J. Henderson
However, the effects are not homogenous across subpopulations. We find additional homework to be more effective for high and low achievers relative to average achievers. This is further uniformly confirmed by introducing stochastic dominance techniques into the examination of returns between groups from a nonparametric regression. Third, in contrast to time spent on homework, time spent in class is not a significant contributor to math test scores. This may suggest that learning by doing is a more effective tool for improvement in student achievement. Finally, the parametric estimates of the educational production function overstate the impact of schooling related inputs. In particular, both the homework and class size coefficients from the preferred parametric model map to the upper deciles of the coefficient distribution of the nonparametric estimates. Moreover, the parametric model understates the percentage of students with negative responses to an additional hour of homework. The remainder of the paper is organized as follows. Section 2 provides a short sketch of the theoretical background. Section 3 describes the estimation strategy, as well as the statistical tests used in the paper. Section 4 discusses the data and Section 5 presents the results. Finally, Section 6 concludes.
2. THEORETICAL BACKGROUND To motivate our empirical methodology, we briefly summarize the theoretical models of homework effectiveness on test scores following Betts (1997) and Neilson (2005). The existing models rest on three important assumptions: (i) Students have differences in abilities and thus require different amounts of time to complete the same homework assignment. (ii) Homework is beneficial, at least in small amounts and (iii) Students are time constrained. In the absence of the third assumption, additional homework can benefit all students regardless of ability level. However, once the third assumption comes into play, further homework will only affect those who have not hit their individual time constraint or ‘give-up’ limit. Formally, let M be the (same) amount of time that each student has available for completing their homework assignment and let H (a i , HW m ) be the amount of time spent on homework by student i, which is a function of his/her ability (a) and the number of units of homework (HW) assigned by teacher m. Moreover, let f be a production function that transforms the ability of student i and the homework assigned by teacher m into a test score as TSi = f (a i , HW m ). It is assumed that TS is an increasing function of a. More homework also leads to higher test scores by assumption (ii) and students who have higher ability complete their homework assignments more quickly under assumption (iii); dH/da < 0. In addition to the three assumptions, suppose that each unit of homework takes the same length of time for a given student (i.e. H is homogeneous of degree one with respect to homework) so that the most homework a student can do is M/H (a, 1) units. Note that M/H (a, 1) is an increasing function of ability since the denominator is decreasing in a. Then, for any two random students where the ability of the first is strictly greater than the second, there exists a level of homework above which the difference between the test score of the first and second student is non-decreasing for a given M.3 In other words, when the low able student has reached their time constraint but the high able has not, further homework only positively affects the high ability student. Therefore, the
3 See
proposition 2 of Neilson (2005) for the full proof. C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
The impact of homework on student achievement
329
responses to additional homework will depend on how far each student is from their individual specific ‘give-up’ limit and the relation between test scores and homework is non-monotonic.
3. EMPIRICAL METHODOLOGY 3.1. Parametric model We begin our empirical methodology with a parametric specification of the educational production function as TS ilkm = f (HW m , Wi , Ck , Tm , ξl , μi , β) + εilkm ,
(3.1)
where as described above, TS is the test score of student i in school l in class k and HW denotes the hours of homework assigned by teacher m. The vector W represents individual and family background characteristics, as well as ex ante achievement (lagged test scores), C is a vector of class inputs and T is a vector of teacher characteristics. We control for all factors invariant within a given school with the fixed effect ξ , μ is the endowed ability of student i, β is a vector of parameters to be estimated and ε is a zero mean, normally distributed error term. We attempt to capture ability (a) with lagged test scores and a host of variables defined for μ. Our main parameter of interest is the coefficient on homework, which represents the effect of an additional hour of homework on student test scores. 3.2. Generalized kernel estimation Parametric regression models require one to specify the functional form of the underlying data generating process prior to estimation. Correctly specified parametric models provide consistent estimates and inference based on such estimates is valid. However, uncertainty exists about the shape of the educational production function because the theory does not provide a guide as to an appropriate functional form (Betts, 1997, Hanushek, 2003, and Todd and Wolpin, 2003). There could be nonlinear/non-monotonic relations as well as interactions among regressors, which standard parametric models may not capture. Furthermore, typical parametric models do not fully conform to the theoretical model described in Section 2 since they commonly ignore any information regarding heterogeneity in responses to additional homework. Given the potential shortcomings of parametric models, we also estimate a nonparametric version of (3.1). To proceed, we utilize Li-Racine Generalized Kernel Estimation (Li and Racine, 2004 and Racine and Li, 2004) and express the test score equation as TS i = θ (xi ) + ei ,
i = 1, . . . , N ,
(3.2)
where θ (·) is the unknown smooth educational production function, ei is an additive error term and N is the sample size. The covariates of equation (3.1) are subsumed in x i = [xic , xiu , xio ], where xic is a vector of continuous regressors (e.g. hours of homework), xiu is a vector of regressors that assume unordered discrete values (e.g. race), and xio is a vector of regressors that assume ordered discrete values (e.g. parental education). Taking a first-order Taylor expansion of (3.2) with respect to x j yields TS i ≈ θ (x j ) + xic − x cj β(x j ) + ei , (3.3) C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
330
O. Eren and D. J. Henderson
where β(x j ) is defined as the partial derivative of θ (x j ) with respect to x c . The estimator of θ(x j ) δ(x j ) ≡ is given by β(x j ) N 1 θ (x j ) δ(x j ) = Kh,λu ,λo c = j) β(x xi − x cj i=1
N 1 T Si , × Kh,λu ,λo xic − x cj
−1 xic − x cj c xi − x cj xic − x cj
(3.4)
i=1
where Kh,λu ,λo is the commonly used product kernel for mixed data (Li and Racine, 2006). h refers to the estimated bandwidth associated with the standard normal kernel for a particular continuous regressor. Similarly, λu refers to the estimated bandwidth associated with the Aitchison and Aitken’s (1976) kernel function for unordered categorical data and λo is the estimated bandwidth for the Wang and Van Ryzin (1981) kernel function for ordered categorical data. Estimation of the bandwidths (h, λu , λo ) is typically the most salient factor when performing nonparametric estimation. For example, choosing a very small bandwidth means that there may not be enough points for smoothing and thus we may get an undersmoothed estimate (low bias, high variance). On the other hand, choosing a very large bandwidth, we may include too many points and thus get an oversmoothed estimate (high bias, low variance). This trade-off is a well known dilemma in applied nonparametric econometrics and thus we resort to automatic determination procedures to estimate the bandwidths. Although there exist many selection methods, one popular procedure (and the one used in this paper) is that of Least-Squares Cross-Validation. In short, the procedure chooses (h, λu , λo ) which minimize the least-squares cross-validation function given by C V (h, λu , λo ) =
N 1 [T S j − θ− j (x j )]2 , N j=1
(3.5)
where θ− j (·) is the commonly used leave-one-out estimator of θ (x).4 3.3. Model selection criteria To assess the correct estimation strategy, we utilize the Hsiao et al. (2007) specification test for mixed categorical and continuous data. The null hypothesis is that the parametric model ( f (x i , β)) is correctly specified (H0 : Pr[E(T Si | xi ) = f (xi , β)] = 1) against the alternative that it is not (H1 : Pr[E(T Si | xi ) = f (xi , β)] < 1). The test statistic is based on I ≡ E (E (ε |x)2 f (x)), where ε = y − f (x, β). I is non-negative and equals zero if and only if the null is true. The resulting test statistic is N h1h2 · · · hq I J= ∼ N (0, 1), (3.6) σ
4 All
bandwidths are readily computable in N; see Racine (2003). C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
The impact of homework on student achievement
331
where N N 1 I = 2 εi ε j Kh,λu ,λo , N i=1 j=1, j=i
σ2 =
N N 2h 1 h 2 · · · h q 2 εi2 ε2j Kh, , λo λu , N2 i=1 j=1, j=i
Kh,λu ,λo is the product kernel, and q is the number of continuous regressors. If the null is false, J diverges to positive infinity. Unfortunately, the asymptotic normal approximation performs poorly in finite samples and a bootstrap method is generally suggested for approximating the finite sample null distribution of the test statistic. This is the approach we take. 3.4. Stochastic dominance Nonparametric estimation as described in equation (3.4) generates unique coefficient estimates for each observation for each variable. This feature of nonparametric estimation enables us to compare (rank) the returns for subgroups and thus make inferences about who benefits most from additional homework. Here we propose using stochastic dominance tests for empirical examination of such comparisons.5 The comparison of the effectiveness of a policy on different subpopulations based on a particular index (such as a conditional mean) is highly subjective; different indices may yield substantially different conclusions. In contrast, finding a stochastic dominance relation provides uniform ranking regarding the impact of the policy among different groups and offers robust inference. To proceed, let β(HW) be the actual effect of an additional hour of homework on test scores unique to an individual (other regressors can be defined similarly). If there exists distinct known groups within the sample, we can examine the returns between any two groups, say w and v. Here w and v might refer to males and females, respectively. Denote β w (HW) as the effect of an additional hour of homework on test scores to a specific individual in group w. β v (HW) is defined similarly. Note that the remaining covariates are not constrained to be equal across or within groups. In practice, the actual effect of an additional hour of homework on test scores is unknown, w,i (H W )} Nw is a vector but the nonparametric regression gives us an estimate of this effect. {β i=1 N v,i (H W )} v is an analogous vector of estimates of β v (HW). of N w estimates of β w (HW) and {β i=1 F(β w (HW)) and G(β v (HW)) represent the cumulative distribution functions of β w (HW) and β v (HW), respectively. Consider the null hypotheses of interest as Equality of Distributions: F(β(H W )) = G(β(H W )) ∀β(H W ) ∈ .
(3.7a)
First Order Stochastic Dominance: F dominates G if F(β(H W )) ≤ G(β(H W )) ∀β(H W ) ∈ ,
(3.7b)
5 For empirical applications of stochastic dominance tests on school quality data see Eren and Millimet (2006) and Maasoumi et al. (2005). For an empirical application of stochastic dominance tests on fitted values obtained via nonparametric regression see Maasoumi et al. (2007).
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
332
O. Eren and D. J. Henderson
where is the union support for β w (HW) and β v (HW). To test the null hypotheses, we define the empirical cumulative distribution function for β w (HW) as Nw w (H W )) = 1 w,i (H W ) ≤ βw (H W )), F(β 1(β Nw i=1
(3.8)
v (H W )) is defined similarly. Next, we define where 1(·) denotes the indicator function and G(β the following Kolmogorov–Smirnov statistics TE Q =
TF S D =
(H W )) − G(β (H W ))|; | F(β
(3.9a)
(H W )) − G(β (H W ))); sup ( F(β
(3.9b)
sup
β(H W )∈
β(H W )∈
for testing the equality and first order stochastic dominance (FSD) relation, respectively. Unfortunately, the asymptotic distributions of these nonparametric sample based statistics under the null are generally unknown because they depend on the underlying distributions of the data. We need to approximate the empirical distributions of these test statistics to overcome this problem. The strategy following Abadie (2002) is as follows: (1) Let T be a generic notation for T EQ and for T FSD . Compute the test w,2 (H W ), . . . , β w,Nw (H W )} and w,1 (H W ), β statistics T for the original sample of {β {βv,1 (H W ), βv,2 (H W ), . . . , βv,Nv (H W )}. w,1 (H W ), β w,2 (H W ), . . . , β w,Nw (H W ), β v,1 (H W ), (2) Define the pooled sample as = {β βv,2 (H W ), . . . , βv,Nv (H W )}. Resample N w + N v observations with replacement from b . and call it b . Divide b into two groups to obtain T (3) Repeat step (ii) B times.
B b > T ). Reject the null (4) Calculate the p-values of the tests with p-value = B −1 b=1 1(T hypotheses if the p-value is smaller than some significance level α, where α ∈ (0, 1/2). By resampling from , we approximate the distribution of the test statistics when F(β(HW)) = G(β(HW)). Note that for (3.7b), F(β(HW)) = G(β(HW)) represents the least favorable case for the null hypothesis. This strategy allows us to estimate the supremum of the probability of rejection under the composite null hypothesis, which is the conventional definition of test size.6
4. DATA The data is obtained from the National Educational Longitudinal Study of 1988 (NELS:88), a large longitudinal study of eighth grade students conducted by the National Center for Educational
6 Ideally we would like to reestimate the nonparametric returns within each bootstrap replication to take into account the uncertainty of the returns. Thus, the bootstrapped p-values most likely differ slightly from their ‘true’ values. This non-trivial extension is left for future research. Nonetheless, if we obtain a large p-value, it is unlikely that accounting for such uncertainty would alter the inference.
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
The impact of homework on student achievement
333
Statistics. The NELS:88 is a stratified sample, which was chosen in two stages. In the first stage, a total of 1032 schools on the basis of school size were selected from a universe of approximately 40,000 schools. In the second stage, up to 26 students were selected from each of the sample schools based on race and gender. The original sample contains approximately 25,000 eighth grade students. Follow-up surveys were administered in 1990, 1992, 1994 and 2000. To measure academic achievement, students were administered cognitive tests in reading, social sciences, mathematics and science during the spring of the base year (eighth grade), first follow-up (tenth grade) and second follow-up (twelfth grade). Each of the four grade specific tests contain material appropriate for each grade, but included sufficient overlap from previous grades to permit measurement of academic growth. Although four test scores are available per student, teacher and class information sets (discussed below) are only available for two subjects per student. We utilize tenth grade math test scores as our dependent variable in light of the findings of Grogger and Eide (1995) and Murnane et al. (1995).7 These studies find a substantial impact of mathematics achievement on postsecondary education, as well as on earnings. Our variable of interest is the hours of homework assigned daily and comes directly from the student’s math teacher reports. This measure of homework is a policy variable, which the school administrator or the teacher can control. Relying on hours spent on homework from the student reports is not as accurate and may yield spurious correlations since it may reflect unobserved variation in student ability and motivation.8 Since researchers interested in the impact of school quality measures are typically (and correctly) concerned about the potential endogeneity of school quality variables, we utilize a relatively lengthy vector of student, family, endowed ability, teacher and classroom characteristics. The NELS:88 data enables us to tie teacher and class-level information directly to individual students and thus circumvents the risk of measurement error and aggregation bias. Furthermore, we include school fixed effects as described in equations (3.1) and (3.2) to capture differences between schools that may affect student achievement. Specifically, our estimations control for the following variables: Individual: gender, race, lagged (eighth grade) math test score; Family: father’s education, mother’s education, family size, socioeconomic status of the family; Endowed Ability: ever held back a grade in school, ever enrolled in a gifted class, math grades from grade 6 to 8; Teacher: gender, race, age, education; School: school fixed effects; Class: class size, number of hours the math class meets weekly, teacher’s evaluation of the overall class achievement. Information on individual and family characteristics and endowed ability variables are obtained from base year survey questionnaires and data pertaining to the math teacher and class comes from the first follow-up survey. Observations with missing values for any of the variables
7 We
follow Boozer and Rouse (2001) and Altonji et al. (2005) and utilize item response theory math test scores. though hours of homework assigned by the teacher is a superior measure to hours of homework reported by the student, it is far from being perfect since we only observe the quantity and not the quality of homework. This limitation suggests directions for future research and data collection. 8 Even
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
334
O. Eren and D. J. Henderson
defined above are dropped. We further restrict the sample to students who attend public schools. Table 1 reports the weighted summary statistics of some of the key variables for the 6,913 students in the public school math sample and for the regression sample used for estimation.9 The means and standard deviations in the regression sample are similar to those obtained when using the full set of potential public school observations. This similarity provides some assurance that missing values have not distorted our sample. Prior to continuing, a few comments are warranted related to the issue of endogeneity/validity of the set of control variables that we utilize. First, common with existing practice in the educational production function literature, we include lagged (eighth grade) math test scores in our estimations. Lagged test scores are assumed to provide an important control for ex ante achievement and capture all previous inputs in the educational production process, giving the results a ‘valueadded’ interpretation (see, e.g. Hanushek, 1979, 2005). The value added specification is generally regarded as being better than the ‘contemporaneous’ specification to obtain consistent estimates of the contemporaneous inputs. However, as indicated in Todd and Wolpin (2003), the value added specification is highly susceptible to bias even if the omitted inputs are orthogonal to the included inputs. The problem mainly arises due to the correlation between lagged test scores and (unobserved) endowed ability. If this potential endogeneity of lagged test scores is not taken into account, then the resulting bias will not only contaminate the estimate of lagged test scores but may be transmitted to the estimates of all the contemporaneous input effects. To this end, we include a host of variables, as mentioned above, to capture the endowed ability of students. Furthermore, even in the absence of (or along with) the aforementioned endogeneity, the value added specification may still generate biased estimates if the potential omitted inputs are correlated with the lagged test scores. Todd and Wolpin (2003) propose the use of within-estimators (i.e. student fixed effect) in a longitudinal framework or within-family estimators as alternatives to the value added specification. We investigate whether the within-estimator affects our results in the next section. Second, the teacher’s evaluation of the class plays a crucial role in our estimations and therefore, requires extra attention. The teachers surveyed in the NELS:88 are asked to report ‘which of the following best describes the achievement level of the students in this class compared with the average tenth grade student in this school.‘ Their choice is between four categories: high, average, low and widely differing. It is important to note that the overall class evaluation is not based on the test scores since the teacher surveys were administered prior to the student surveys. That being said, given the subjective nature of the question, we need to verify the quality of this variable. Under the assumption that measurement error is not dominant, Bertrand and Mullainathan (2001) indicate that subjective measures can be helpful as control variables in predicting outcomes. In Table 2, we report the summary statistics of tenth grade math test scores disaggregated by teachers’ evaluations. As seen in the first column, the mean test scores are highest (lowest) for the high (low) achievement group. Moreover, the within-class standard deviation (as well as overall standard deviation) displayed in the third column of Table 2 is largest for the widely differing achievement group. These findings provide some corroborative evidence for the validity of teachers’ responses in reflecting overall class ability and thus measurement error may not be a dominant issue.
9 Our regressions do not use weights. Instead we include controls for the variables used in the stratification, see Rose and Betts (2004) for a similar approach.
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
335
The impact of homework on student achievement Table 1. Sample statistics of key variables. Public school math sample
Regression sample
Mean
SD
Mean
SD
51.306
9.850
52.225
9.547
0.643 3.922
0.392 1.033
0.644 3.972
0.381 1.022
51.488
9.931
52.354
9.901
High school dropout High school Junior college
0.133 0.396 0.136
0.340 0.489 0.343
0.127 0.418 0.139
0.333 0.493 0.349
College less than 4 years College graduate
0.097 0.146
0.296 0.353
0.091 0.136
0.288 0.342
10th grade math test score Assigned daily hours of homework Weekly hours of math class 8th grade math test score Mother’s education
Master degree Ph.D., MD., etc Family Size Female Race Black Hispanic Other White Ever held back a grade (1 = Yes) Ever Enrolled in a gifted class (1 = Yes) % of Teachers holding a graduate degree
0.069 0.019 4.606 0.498
0.255 0.137 1.400 0.500
0.071 0.015 4.565 0.493
0.257 0.124 1.315 0.498
0.117 0.085 0.042 0.753 0.135 0.213 0.508
0.321 0.280 0.202 0.499 0.343 0.409 0.499
0.087 0.068 0.031 0.813 0.134 0.217 0.517
0.281 0.251 0.173 0.389 0.341 0.412 0.499
Teacher’s race Black Hispanic Other White
0.050 0.017 0.017 0.914
0.218 0.129 0.131 0.279
0.036 0.016 0.011 0.935
0.186 0.126 0.108 0.245
Teacher’s evaluation of the overall class achievement High level Average level Low level
0.254 0.410 0.236
0.435 0.491 0.424
0.287 0.415 0.197
0.452 0.492 0.398
Widely differing Class size Number of observations
0.099 23.521 6913
0.299 7.315
0.100 23.442 3733
0.300 7.278
Notes: Weighted summary statistics are reported. The variables are only a subset of those utilized in the analysis. The remainder are excluded in the interest of brevity. The full set of sample statistics are available upon request.
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
336
O. Eren and D. J. Henderson
Table 2. Means and standard deviations of 10th grade math test scores by achievement levels. Teacher’s evaluation of the overall class achievement Mean SD Within-class SD High achievement
59.769
7.517
3.577
Average achievement
51.795
8.142
4.177
Low achievement Widely differing
43.863 49.995
6.618 9.904
3.025 4.347
Notes: Achievement levels are based on teachers’ evaluations. See text for further details.
5. EMPIRICAL RESULTS 5.1. Parametric estimates Our parametric specifications are presented in Table 3. For all regression estimates, White standard errors are reported beneath each coefficient. The first column of Table 3 gives a large significant coefficient for homework. An additional hour of homework is associated with a gain of 4.01 (0.59) points in math achievement. Given that the mean test score is approximately 52.22, this represents an increase of slightly below 8%. However, this model is simplistic in that it does not take into account many observable variables that are known to affect test scores. In the second column of Table 3, we include demographic and family characteristics. There is a slight decrease in the homework coefficient. The third column adds student’s eighth grade math scores, which gives the results a valueadded interpretation. Including the student’s 8th grade math score greatly reduces the homework coefficient from 3.47 (0.50) to 0.90 (0.21). However, the coefficient is still statistically significant. In order to capture the potential endogeneity of lagged test scores due to endowed ability, we include the endowed ability variables in the fourth column of Table 3.10 Doing so reduces the coefficient of homework to 0.77 (0.20) and a slightly smaller decrease is observed in the eighth grade math score coefficient as well. An important concern regarding the effect of homework and any other school quality variables is that schools may differ in both observable and unobservable dimensions. If school traits are correlated with homework or other inputs, then it is likely that the coefficients will be biased. Therefore, it is most prudent to control for any observed and unobserved factors common to all students in a school. We accomplish this by including the school fixed effects in the fifth column of Table 3. The school dummies are jointly significant (p-value = 0.00), but the homework coefficient remains practically unchanged. The sixth and seventh columns of Table 3 add teacher and classroom characteristics (class size and weekly hours of math class), respectively. Even though the effect of homework is similar in magnitude, two points are noteworthy regarding the selected covariate estimates. First, the class size coefficient is positive and statistically significant at the 10% level in that increasing the number of students in a math class from the sample average of 23–33 will lead to an increase of 0.25 points in math scores. This finding is consistent and similar in magnitude with Goldhaber and Brewer (1997), who use the NELS:88 to assess the impact of class size on tenth grade math test scores. Second, in contrast to Betts (1997), we do not find a significant effect of weekly hours of
10 The
endowed ability variables are jointly significant (p-value = 0.00). C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
–
Class size
Weekly hours of math class
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
No No No No No No
(2)
Yes No No No No No
0.232
–
–
–
3.474 (0.505) –
(3)
(4)
Yes No No No No No
0.762 Yes Yes No No No No
0.778
–
(0.010) –
(0.008) – –
0.732
0.768 (0.202) –
0.792
0.905 (0.213) –
Notes: White standard errors are reported in paranthesis. See text for definition of the variables.
R Other controls Demographic and family characteristics Endowed ability School fixed effects Teacher characteristics Class characteristics Teacher’s evaluation of the overall class achievement 0.027
–
8th Grade math test score
2
–
Homework squared
(1) 4.011 (0.586) –
Homework
Yes Yes Yes No No No
0.825
–
(0.011) –
0.725
0.825 (0.229) –
(5)
(6)
Yes Yes Yes Yes No No
0.826
–
(0.011) –
0.723
0.921 (0.229) –
Table 3. Parametric estimates of 10th grade math test scores on homework. Coefficient (Standard error) (7)
(8)
(0.109) 0.838
(0.112) 0.826
Yes Yes Yes Yes Yes Yes
(0.014) −0.034
(0.015) 0.009
Yes Yes Yes Yes Yes No
(0.012) 0.014
0.661
0.453 (0.215) –
(0.011) 0.025
0.721
0.919 (0.229) –
(9)
Yes Yes Yes Yes Yes Yes
(0.109) 0.839
(0.014) −0.032
(0.012) 0.014
0.660
(0.098)
1.136 (0.429) −0.192
The impact of homework on student achievement
337
338
O. Eren and D. J. Henderson
math class on test scores. Moreover, the coefficient is substantially small in magnitude. It appears that time spent on homework is what matters. The school fixed effects should capture any factors common to all students in a school, but there may still be some unobserved ability differences across students within a school. For instance, if the overall ability of students in a class is high due to non-randomness in the assignment of students to classes, then the teacher may increase (decrease) the homework load for students in that particular class. If this is the case, the homework coefficient is going to be upward (downward) biased. To control for this possibility, we utilize the teachers’ responses on the overall achievement level of the math class. Assuming that measurement error is not dominant, this variable may be helpful in predicting test scores. Regression estimates controlling for class achievement are given in the eighth column of Table 3. The class achievement variables are jointly significant (p-value = 0.00). The homework coefficient is still statistically significant, but considerably diminished in magnitude. A similar reduction is observed in the class size effect as well and it is no longer significant. Finally, in the last column of Table 3, we test the potential nonlinear effects of homework in the parametric specification by adding a quadratic term. In this model, the homework squared term is negative and statistically significant, suggesting evidence for diminishing returns to the amount of homework assigned. The return to homework becomes zero at around 2.96 hours per day and is negative afterwards. This corresponds to 0.45% of the sample. At the mean level hours of homework, which is 0.64 per day, the marginal product (partial effect) of homework is roughly 0.89 (=1.136 − 2(0.192 × 0.64)).11 As noted above, the value added specification relies on the exogeneity assumption of lagged test scores. If our set of ability variables do not fully capture the endowed ability and/or there are some omitted inputs correlated with lagged test scores, then our estimates are susceptible to bias. As a robustness check, we utilize the longitudinal nature of the NELS:88 data for the sample of 6634 observations from the first and second follow-up surveys and run a student fixed effect model, assuming that the impact of homework is the same across grades, rather than a value added specification.12 The student fixed effect estimates are 0.96 (0.42) and −0.18 (0.09) for homework and homework squared (the partial effect of homework evaluated at the mean level of homework is 0.75), respectively and the remaining covariate estimates are qualitatively similar to those presented in the last column of Table 3 (all estimates are available upon request).13 In this respect, our value added specification does not seem to be seriously contaminated by endogeneity of lagged test scores and therefore, we take the quadratic value added model (column 9) as our preferred parametric specification for the remainder of the paper. To summarize, our parametric estimates provide four key insights. First, inclusion of the teacher’s evaluation of class achievement in the regression is crucial. In the absence of such a control, the coefficients on homework and class size are overstated. We believe that teacher’s
11 To check for complementarity in the educational production function, we also tried to include interaction terms between homework and other schooling inputs (class and teacher characteristics) one at a time. In no case did any of these interactions become significant at even the 10% level. 12 Ideally, we would like to include the eighth grade sample to our student fixed effect estimation as well, however, the teachers are not asked to report the daily hours of homework assigned in the base year sample. 13 In addition to socioeconomic status and size of the family, teacher and classroom characteristics, the student fixed effect estimation controls for the following school-grade specific variables: average daily attendance rate, percentage of students from single parent homes, percentage of students in remedial math and percentage of limited English proficiency students.
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
The impact of homework on student achievement
339
assessment of the class purges out some of the ability differences within the school, as well as represents the teacher’s expectations from the class and thus alleviates bias arising from the possible endogeneity of homework. Second, in contrast to time spent on homework, time spent in class is not a significant contributor to math test scores. This may suggest that learning by doing is a more effective tool for improvement in student achievement. Third, compared to more standard spending related measures such as class size, additional homework appears to have a larger and more significant impact on math test scores. Fourth, hours of homework assigned exhibit diminishing returns but only 0.45% of the sample respond negatively to additional homework. 5.2. Nonparametric estimates Prior to discussing the results, we conduct the Hsiao et al. (2007) specification test based on the assumption that the correct functional form is the last column of Table 3. The preferred parametric model (ninth column) is strongly rejected (p-value = 0.00); the linear parametric model (eighth column) is also rejected (p-value = 0.00). These findings raise concerns regarding the functional form assumptions of the educational production function in the existing school quality literature. Nonparametric models have the potential to alleviate these concerns since these types of procedures allow for nonlinearities/interactions in and among all variables. Turning to the results, Table 4 displays the nonparametric estimates of homework on math test scores.14 Given the number of parameters (unique coefficient for each student in the sample) obtained from the Generalized Kernel Estimation procedure, it is tricky to present the results. Unfortunately, no widely accepted presentation format exists. Therefore, in Table 4, we give the mean estimate, as well as the estimates at each decile of the coefficient distribution along with their respective bootstrapped standard errors. The mean nonparametric estimate is positive but statistically insignificant. Looking at the coefficient distribution, we observe a positive and marginally significant effect for the 60th percentile and significant effects for the upper three deciles. The squared correlation between the actual and predicted values of student achievement rises from 0.84 to 0.88 when we switch from the parametric to nonparametric model. Precision set aside, the parametric estimate at the mean level of homework obtained from the last column of Table 3 is larger than the corresponding mean of the nonparametric estimate. More importantly, roughly 25% of the nonparametric estimates are negative. In other words, more than 25% of the students do not respond positively to additional homework, whereas this ratio is only 0.45% of the sample from the parametric model. Table A1 in the Appendix displays the sample statistics for those with negative homework coefficients. The most interesting pattern, when we compare it with the regression sample, is observed in the overall class achievement. Students with negative coefficients are intensified in classes, which the teacher evaluates as average. We further analyze this point in the next subsection. Table 5 presents the nonparametric estimates of selected covariates. We present the mean, as well as the nonparametric estimates corresponding to the 25th, 50th and 75th percentiles of the coefficient distribution (labelled Q1, Q2 and Q3). The results for the eighth grade math scores are in line with the parametric estimates and are statistically significant throughout the distribution. The class size effect, however, differs from the parametric estimates. The mean nonparametric estimate
14 For all nonparametric estimates, we control for individual, family, endowed ability, teacher and classroom characteristics as well as school fixed effects.
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
340
O. Eren and D. J. Henderson Table 4. Nonparametric estimates of 10th grade math test scores on homework. Coefficient (Standard error) Mean
0.593
10%
(0.523) −0.589
20%
−0.165
(0.352) (0.531) 0.108 (0.543)
30% 40%
0.324 (0.296)
50%
0.513 (0.345) 0.727 (0.398) 0.963 (0.426) 1.308 (0.432) 1.847 (0.567) 0.881
60% 70% 80% 90% R2
Notes: Standard errors are obtained via bootstrapping. Estimations control for individual, family, endowed ability teacher and classroom characteristcs as well as school fixed effects.
indicates a reversal in the sign of the class size effect. Even though we do obtain primarily negative coefficients, a majority are insignificant and thus we are unable to draw a definite conclusion. The mean return to time spent in class is also negative and larger in magnitude than the parametric estimate. In addition, the negative effect is statistically significant at the first quartile. In sum, the relaxation of the parametric specification reveals at least three findings. First, at the mean, the predicted effect of homework from the parametric estimate (0.89) is roughly 1.5 times larger than the nonparametric estimate (0.59). Second, parametric estimates understate the percentage of students with negative responses to homework. However, extra homework continues to be significantly effective for at least 40% of the sample under the nonparametric model. Third, the sign of the (mean) class size coefficient is reversed from positive to negative. 5.3. Effects of homework by achievement group Given the concentration of students with negative responses at the average achievement level, we further explore the impact of homework on subgroups based on the teacher’s evaluation of the class. Note that, in contrast to the parametric model, we do not need to split the sample and reestimate for each subgroup because we have already obtained a unique coefficient for homework for each C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
341
The impact of homework on student achievement Table 5. Quartile estimates for selected covariates. Coefficient (Standard error) Mean 8th grade math test score
Q1
Q2
Q3
0.722
0.658
0.724
0.785
Class size
(0.040) −0.006
(0.023) −0.047
(0.029) −0.016
(0.023) 0.029
(0.026)
(0.085)
Weekly hours of math class
−0.229
(0.036) (0.301)
(0.029) −0.542 (0.230)
−0.222 (0.387)
0.075 (0.805)
Notes: Standard errors are obtained via bootstrapping.
individual in the nonparametric model. Table 6 displays the mean nonparametric estimate, the impact at each decile of the coefficient distribution, as well as the parametric estimate of homework for each subgroup. In the parametric specifications, we exclude the homework squared term unless it is significant at the 10% level or better. The first column of Table 6 presents the results for the high achievement group. The parametric estimate of homework is significant with a value of 2.25 (0.53) and is higher than the corresponding 90th percentile of the nonparametric coefficient distribution; the mean nonparametric counterpart is 0.83 (0.40) and statistically significant. Thus, the nonparametric model indicates that the parametric model vastly overstates the homework effect for virtually the entire subsample. In addition, the parametric model cannot capture the heterogeneity inherent in the model. For instance, the homework effect is more than twice as large at the 90th percentile (1.79) of the coefficient distribution as it is at the median (0.71). The second column presents the estimates for the average achievement group. The parametric and nonparametric estimates are substantially small in magnitude and do not yield any significant effect of homework on math test scores. Even though the coefficients are insignificant, the nonparametric model indicates that nearly 40% of the subsample respond negatively to extra homework. This may not be surprising given that the students with negative responses are intensified in average achievement classes. For the low achievement group, unlike the first two columns, we include the homework squared term in the parametric specification. The return to homework becomes zero at around 2.22 hours and is negative afterwards. This corresponds to roughly 0.42% of the subsample. At the mean level of homework (0.53 hours per day), the partial effect of homework is 1.78 and is higher than the corresponding 80th percentile of the nonparametric coefficient distribution. The mean nonparametric estimate is 0.75 (0.41) and marginally significant. Similar to the first column, the parametric model overstates the homework effect and moreover, understates the percentage of students with negative responses, which is around 24% of the subsample based on the nonparametric coefficient distribution. For completeness, the last column presents the estimates for students in classes with widely differing ability levels. The coefficients are large in magnitude but are only statistically significant for the upper three deciles of the nonparametric estimates. C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
342
O. Eren and D. J. Henderson
Table 6. Parametric/nonparametric estimates of 10th grade math test scores on homework by achievement level. Coefficient (Standard error) High
Average
Low
Widely
achievement
achievement
achievement
differing
0.832 (0.397) 0.072
0.302 (0.566) −0.757
0.755 (0.413) −0.608
0.610 (1.551) −0.728
(0.476)
(0.486)
(1.864)
(0.810)
0.285 (0.728) 0.413 (0.364) 0.553
−0.454 (0.537) −0.221 (0.472) 0.011
−0.103 (0.493) 0.137 (0.613) 0.362
(0.390) 0.715 (0.419) 0.901 (0.447) 1.100 (0.578) 1.378 (0.575) 1.788
(0.556) 0.231 (0.735) 0.453 (0.538) 0.708 (0.566) 1.019 (0.648) 1.490
(0.950) 0.555 (0.526) 0.794 (0.401) 1.024 (0.599) 1.414 (0.767) 2.042
−0.185 (1.194) 0.160 (0.942) 0.488 (0.885) 0.808 (1.308) 1.090 (0.895) 1.522 (0.763) 1.886 (1.051) 2.466
(0.489)
(0.788)
(0.676)
(1.006)
2.254 (0.531) N/A
0.177 (0.451) N/A
2.336 (1.587) −0.526
1.023 (1.709) N/A
Nonparametric estimates Mean 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 Parametric estimates Homework Homework squared
(0.312) Notes: Standard errors are obtained via bootstrapping for the nonparametric estimates and White standard errors are reported for the parametric estimates. Homework squared term is excluded unless it is significant at 10% level or better. Estimations control for individual, family, endowed ability, teacher and classroom characteristics as well as school fixed effects.
Table 7 displays the results for selected covariates for each subgroup. We present the parametric results, as well as the nonparametric mean estimates and the nonparametric estimates corresponding to the 25th, 50th and 75th percentiles of the coefficient distribution. Three results emerge. First, the parametric estimates of eighth grade math scores are similar in magnitude to the mean (median) of the nonparametric estimates. Second, for three of the subgroups (high, C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
343
The impact of homework on student achievement Table 7. Quartile estimates for selected covariates by achievement level. Coefficient (Standard error) Mean
Q1
Q2
Q3
Parametric
High achievement 8th Grade math test score Class size Weekly hours of math class
0.646 (0.030) −0.031
0.593 (0.022) −0.055
0.644 (0.031) −0.034
0.690 (0.031) −0.010
0.587 (0.022) −0.031
(0.036)
(0.028)
(0.034)
(0.034)
(0.030)
−0.125 (0.188)
−0.271 (0.258)
−0.099 (0.211)
0.095 (0.184)
−0.078 (0.273)
0.746 (0.026) −0.012 (0.030) −0.510 (0.258)
0.698 (0.023) −0.047 (0.034) −0.756 (0.503)
0.743 (0.064) −0.020 (0.028) −0.476 (0.208)
0.790 (0.037) 0.008 (0.038) −0.253 (0.227)
0.675 (0.021) −0.001 (0.027) −0.237 (0.200)
0.752 (0.050) 0.054 (0.046) 0.029 (0.337)
0.704 (0.065) 0.013 (0.036) −0.332 (0.371)
0.756 (0.037) 0.051 (0.036) −0.007 (0.461)
0.809 (0.034) 0.090 (0.035) 0.337 (0.439)
0.623 (0.043) 0.124 (0.064) 0.482 (0.360)
0.786 (0.044) −0.021 (0.060) 0.079 (0.288)
0.713 (0.057) −0.090 (0.066) −0.347 (0.419)
0.789 (0.047) −0.017 (0.125) 0.080 (0.397)
0.855 (0.038) 0.052 (0.011) 0.539 (0.284)
0.685 (0.064) −0.000 (0.090) −0.585 (0.730)
Average achievement 8th grade math test score Class size Weekly hours of math class Low achievement 8th grade math test score Class size Weekly hours of math class Widely differing 8th grade math test score Class size Weekly hours of math class
Notes: Standard errors are obtained via bootstrapping for the nonparametric estimates and White standard errors are reported for the parametric estimates.
average and widely differing), we observe predominantly negative but insignificant coefficients for class size. Similar to the full sample estimates, the parametric models overstate the class size effect. For the low achievement group, however, the parametric class size effect is positive, significant and lies in the upper extreme tail of the corresponding distribution of nonparametric estimates. Specifically, the parametric estimate, 0.12 (0.06), maps to roughly the 85th percentile of the nonparametric coefficient distribution. Finally, for the average achievement group, the C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
344
O. Eren and D. J. Henderson Table 8. Stochastic dominance tests of the coefficient distributions. Equality of distributions First order stochastic dominance p-values
p-values
High achievement/average achievement
0.000
0.906
High achievement/low achievement High achievement/widely differing
0.000 0.000
0.165 0.000
Low achievement/average achievement
0.000
0.961
Low achievement/widely differing
0.000
0.000
Widely differing/average achievement
0.000
0.909
Notes: Probability values are obtained via bootstrapping. The null hypothesis is rejected if the p-value is smaller than some significance level α (0 < α < 1/2).
nonparametric estimates of time spent in class are negative and statistically significant at the mean and median. The final set of results are provided in Table 8. We report the p-values associated with the null hypotheses of equality and FSD for the homework coefficient distributions among the four subgroups. The corresponding cumulative distribution functions are plotted in Figure 1. For all subgroups, we can easily reject equality of distributions at conventional confidence levels (p-value = 0.00). In terms of rankings, homework return for the three subgroups (high, low and widely differing) dominate average achievers’ returns in the first order sense and further confirm that extra homework is less effective or may not be effective at all for average achievers. We do not observe FSD between the widely differing ability group and low or high achievers. There is some evidence of FSD for the return distribution of high achievers over low achievers, but this evidence is relatively weak.15 As discussed, the theoretical models suggest that homework should positively affect the student’s achievement up to some limit and then have no effect. In this respect, extra homework leading gains for high achievers is not at odds with theory. The mean hours of homework for high achievers is 0.74 (0.40), but this amount may be far away from the subgroup’s ‘give-up’ limit. The potential puzzle in our results is that extra homework is not effective for average achievers, despite leading gains for low achievers. One possibility is that average achievers are at the edge of their maximum effort, whereas low achievers are below their threshold level. The mean hours of homework are 0.64 (0.36) and 0.53 (0.37) for average and low achievers, respectively. If the ‘give-up’ level for low achievers is some value greater than 0.53, then they will benefit from the extra homework. Although this is by no means a definitive explanation for our findings, it is a plausible explanation.16
15 We also examine the returns to homework for subgroups based on gender and race. The tests do not lead to strong conclusions for FSD. The results are available upon request. 16 As a robustness check, we also divide the sample based on the eighth grade math score distribution to evaluate homework effectiveness. Consistent with the teacher’s overall class evaluation, we do not obtain any significant effect for average achievers. The coefficient estimates for low and high achievers are statistically significant and large in magnitude. These results are available upon request.
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
345
0
0
.2
.2
.4
.4
F(.)
F(.)
.6
.6
.8
.8
1
1
The impact of homework on student achievement
-10
-5
0
5
-4
-2
0
Support
2
4
6
Support Average Achievers
High Achievers
Low Achievers
0
0
.2
.2
.4
.4
F(.)
F(.)
.6
.6
.8
.8
1
1
High Achievers
-4
-2
0
2
4
6
-10
-5
0 Support
Support
10
Low Achievers
.4 .2 0
0
.2
.4
F(.)
F(.)
.6
.6
.8
.8
1
1
Average Achievers
5
-10
-5
0 Support
5
10
-4
-2
0
2 Support
4
6
Figure 1. Cumulative distribution functions of the estimated homework coefficients from the generalized kernel estimation procedure by achievement level.
6. CONCLUSIONS The stagnation of academic achievement in the United States has given rise to a growing literature seeking to understand the determinants of student learning. Utilizing parametric and nonparametric techniques, we assess the impact of a heretofore relatively unexplored ‘input’ in the educational process, homework, on tenth grade test performance. C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
346
O. Eren and D. J. Henderson
Our results indicate that homework is an important determinant of student achievement. Relative to more standard spending related measures such as class size, extra homework appears to have a larger and more significant impact on math test scores. However, the effects are not uniform across different subpopulations. We find additional homework to be most effective for high and low achievers. This is further confirmed by introducing stochastic dominance techniques into the examination of returns between groups from a nonparametric regression. In doing so we were able to find that the returns for both the high and low achievement groups uniformly dominate the returns for the average achievement group. Further, in contrast to time spent on homework, time spent in class is not a significant contributor to math test scores. This may suggest that learning by doing is a more effective tool for improvement in student achievement. Finally, parametric estimates of the educational production function overstate the impact of schooling related inputs and thus raise concerns regarding the commonly used specifications in the existing literature. Specifically, in all estimates, both the homework and class size coefficients from the parametric model map to the upper deciles of the nonparametric coefficient distribution and as a by-product parametric estimates understate the percentage of students with negative responses to additional homework.
ACKNOWLEDGMENTS The authors wish to thank two anonymous referees, Kevin Grier, Qi Li, Essie Maasoumi, Daniel Millimet, Solomon Polachek and Jeff Racine for helpful comments which led to an improved version of this paper as well as participants of the seminars at the University of Oklahoma, Binghamton University and Temple University. The paper also benefited from the comments of participants at the Winemiller Conference on Methodological Developments of Statistics in the Social Sciences at the University of Missouri (October, 2006), the North American Summer Meetings of the Econometric Society at Duke University (June, 2007) and the Western Economic Association International Meetings in Seattle, WA (July, 2007). The data used in this article can be obtained from the corresponding author upon request.
REFERENCES Abadie, A. (2002). Bootstrap tests for distributional treatment effects in instrumental variable models. Journal of the American Statistical Association 97, 284–92. Aitchison, J. and C. G. G. Aitken (1976). Multivariate binary discrimination by kernel method. Biometrika 63, 413–20. Aksoy, T. and C. R. Link (2000). A panel analysis of student mathematics achievement in the us in the 1990s: does increasing the amount of time in learning activities affect math achievement? Economics of Education Review 19, 261–77. Altonji, J., T. Elder and C. Taber (2005). Selection on observed and unobserved variables: Assessing the effectiveness of catholic schools. Journal of Political Economy 113, 151–84. Bertrand, M. and S. Mullainathan (2001). Do people mean what they say? implications for subjective survey data. American Economic Review 941, 67–72. Betts, J. R. (1997). The role of homework in improving school quality. Working Paper, University of California, San Diego. C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
The impact of homework on student achievement
347
Boozer, M. A. and C. Rouse (2001). Intraschool variation in class size: Patterns and implications. Journal of Urban Economics 50, 163–89. Eren, O. and D. L. Millimet (2007). Time to learn? The organizational structure of schools and student achievement. Empirical Economics 32, 301–32. Figlio, D. N. and M. E. Lucas (2003). Do high grading standards affect student performance? Journal of Public Economics 88, 1815–34. Fuchs, T. and L. W¨oßmann (2007). What accounts for international differences in student performance? A re-examination using PISA data. Empirical Economics 32, 433–64. Goldhaber, D. B. and J. B. Brewer (1997). Why don’t schools and teachers seem to matter? Assessing the impact of unobservables on educational productivity. Journal of Human Resources 32, 505–23. Grogger, J. and E. Eide (1995). Changes in college skills and the rise in the college wage premium. Journal of Human Resources 30, 280–310. Hanushek, E. A. (1979). Conceptual and empirical issues in the estimation of educational production functions. Journal of Human Resources 14, 351–88. Hanushek, E. A. (2003). The failure of input based schooling policies. Economic Journal 113, 64–98. Hanushek, E. A. (2005) Teachers, schools and academic achievement. Econometrica 73, 417–58. Henderson, D. J., A. Olbrecht and S. W. Polachek (2006). Do former athletes earn more at work? a nonparametric assessment. Journal of Human Resources 41, 558–77. Hoxby, C. M. (1999). The productivity of schools and other local public goods producers. Journal of Public Economics 74, 1–30. Hsiao, C., Q. Li and J. Racine (2007). A consistent model specification test with mixed categorical and continuous data. Journal of Econometrics 140, 802–26. Kniesner, T. J. and Q. Li (2002). Nonlinearity in dynamic adjustment: Semiparametric estimation of panel labor supply. Empirical Economics 27, 131–48. Li, Q. and J. Racine (2004). Cross-validated local linear nonparametric regression. Statistica Sinica 14, 485–512. Li, Q. and J. Racine (2006). Nonparametric Econometrics: Theory and Practice. Princeton: Princeton University Press. Maasoumi, E., D. L. Millimet and V. Rangaprasad (2005). Class size and educational policy: Who benefits from smaller classes? Econometric Reviews 24, 333–68. Maasoumi, E., J. Racine and T. Stengos (2007), Growth and convergence: A profile of distribution dynamics and mobility. Journal of Econometrics 136, 483–508. Murnane, R. J., J. B. Willett and F. Levy (1995), The growing importance of cognitive skills in wage determination. Review of Economics and Statistics 77, 251–66. Neilson, W. (2005). Homework and performance for time-constrained students. Economics Bulletin 9, 1–6. C Racine, J. (2003). N . Available at http://www.economics.mcmaster.ca/racine/ . Racine, J. and Q. Li (2004). Nonparametric estimation of regression functions with both categorical and continuous data. Journal of Econometrics 119, 99–130. Rose, H. and J. R. Betts (2004). The effect of high school courses on earnings. Review of Economics and Statistics 86, 497–513. Todd, P. E. and K. I. Wolpin (2003). On the specification and estimation of the production function for cognitive achievement. Economic Journal 113, F3–F33. Wang, M. C. and J. V. Ryzin (1981). A class of smooth estimators for discrete estimation. Biometrika 68, 301–09.
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
348
O. Eren and D. J. Henderson APPENDIX
Table A1. Sample statistics of key variables for students with negative respsonses to additional homework. Negative response sample Mean 10th grade math test score Assigned daily hours of homework Weekly hours of math class 8th grade math test score
SD
49.586
8.566
0.659 3.867
0.470 1.114
48.991
8.240
0.142 0.493 0.114 0.078 0.094
0.349 0.500 0.318 0.269 0.292
Mother’s education High school dropout High school Junior college College less than 4 years College graduate Master degree Ph.D., MD., etc Family size Female Race Black Hispanic Other White Ever held back a grade (1 = Yes) Ever enrolled in a gifted class (1 = Yes) % of Teachers holding a graduate degree
0.063 0.012 4.575 0.537
0.243 0.111 1.322 0.498
0.126 0.076 0.023 0.773 0.126 0.154 0.585
0.332 0.266 0.151 0.418 0.333 0.361 0.492
Teacher’s race Black Hispanic Other
0.061 0.020 0.007
0.239 0.142 0.084
White Teacher’s evaluation of the overall class achievement High level Average level Low level
0.910
0.285
0.097 0.627 0.191
0.296 0.483 0.393
Widely differing Class size Number of observations
0.083 23.891 966
0.277 6.945
Notes: Weighted summary statistics are reported. The variables listed are only a subset of those utilized in the analysis. The remainder are excluded in the interest of brevity. The full set of sample statistics are available upon request.
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
Econometrics Journal (2008), volume 11, pp. 349–376. doi: 10.1111/j.1368-423X.2008.00242.x
Generalized LM tests for functional form and heteroscedasticity Z HENLIN YANG † AND YIU -KUEN T SE † †School of Economics, Singapore Management University, Singapore 178903. E-mail:
[email protected];
[email protected] First version received: May 2005; final version accepted: December 2007
Summary We present a generalized LM test of heteroscedasticity allowing the presence of data transformation and a generalized LM test of functional form allowing the presence of heteroscedasticity. Both generalizations are meaningful as non-normality and heteroscedasticity are common in economic data. A joint test of functional form and heteroscedasticity is also given. These tests are further ‘studentized’ to account for possible excess skewness and kurtosis of the errors in the model. All tests are easy to implement. They are based on the expected information and are shown to possess excellent finite sample properties. Several related tests are also discussed and their finite sample performances assessed. We found that our newly proposed tests significantly outperform the others, in particular in the cases where the errors are non-normal. Keywords: Box-Cox transformation, Double Heteroscedasticity, LM tests, Robustness.
length
regression,
Functional
form,
1. INTRODUCTION Non-normality and heteroscedasticity are common in economic data. A popular approach to modelling these data is to apply a non-linear transformation to the response and some of the regressors, with the anticipation that the transformed model is of independent and homoscedastic normal errors, and a simple model structure. In practice, however, it may not be the case that all of these goals can be achieved simultaneously by a single transformation. Typically, when genuine heteroscedasticity is present in the data, it may not be possible to find a transformation to bring the data to normality as well as homoscedasticity. A more proper and realistic approach is perhaps to directly model the heteroscedasticity while allowing the presence of data transformation in the model. Thus, the role of transformation is basically to induce normality and a relatively simpler model structure (or a correct functional form). This model, termed as Box-Cox heteroscedastic regression (BCHR) in the literature, has found interesting applications in economics (see, e.g. Yang and Tse, 2006). This paper presents three LM tests for the BCHR model based on the expected information (EI). We first derive a simple but general LM test for heteroscedasticity allowing the presence of data transformation in the model. There is a large literature on tests for heteroscedasticity,
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008. Published by Blackwell Publishing Ltd, 9600 Garsington
Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
350
Zhenlin Yang and Yiu-Kuen Tse
and most of these tests are based on the assumption that the observations are normal.1 Some authors have relaxed the normality condition and provided robust tests for heteroscedasticity (see, e.g. Koenker, 1981 and Ruppert and Carroll, 1981). Allowing a normalizing data transformation in the model is perhaps another way to account for the non-normality of the data. Also, most of these tests concern only a null hypothesis of homoscedastic errors (e.g. Breusch and Pagan, 1979). The need for a more general test is evident: when the null hypothesis of homoscedasticity is rejected, one would like to know which heteroscedastic variables are responsible for it. Hence, our test generalizes that of Breusch and Pagan (1979) in two dimensions: (i) from a null hypothesis of homoscedasticity to a null hypothesis of a certain form of heteroscedasticity and (ii) from a regular linear regression model to a transformed regression model. To further safeguard against non-normality, we provide a studentized LM test which generalizes that of Koenker (1981). We then derive a generalized LM test for functional form allowing the presence of heteroscedasticity in the model. This test generalizes that of Yang and Abeysinghe (2003). Most of the functional form tests concern either a specific functional form (linear or log-linear) or a model with homoscedastic errors.2 Our test allows for a general Box-Cox functional form that includes linear, log-linear, square-root, cubic-root, etc. as special cases, and the presence of a general heteroscedastic structure in the model. Interestingly, this test is shown, through Monte Carlo simulations, to be fairly robust against non-normality. Finally, a joint test of functional form and heteroscedasticity is given, which generalizes Lahiri and Egy (1981), and a robust version of it follows from the studentization or the robustness property of the two marginal tests. There are other tests one could use such as the LM test based on the Hessian, LM test based on outer-product-of-gradient (OPG), LM test based on double length regression (DLR), and the likelihood ratio (LR) test.3 They are all much easier to derive than the EI-based LM test, but not necessarily easier to implement in practical applications. More importantly, their finite sample performance remains unknown, at least in the context of the BCHR model. In this paper, we present empirical evidence on the finite sample performance of the tests discussed above, including the newly proposed ones, through extensive Monte Carlo simulations. In terms of size, some general observations are in order: (i) the three EI-based LM tests generally outperform all the others; (ii) the tests are ranked in the following order: LM-EI, LM-DLR, LR, LM-Hessian and LMOPG; (iii) LM-DLR performs reasonably well especially considering the fact that it is based on only the first derivatives of the log-likelihood function; (iv) LM-OPG often performs very poorly and (v) the studentized LM test for heteroscedasticity, the LM-EI for functional form, and the studentized joint test are all quite robust against non-normality. In terms of size-adjusted power of the tests, it is observed that the EI-based tests always have better or similar power compared with others. Section 2 presents the model and the estimation procedure. Section 3 presents the three tests. Section 4 contains the Monte Carlo simulation results and Section 5 concludes the paper. Appendix A contains the score and Hessian functions, Appendix B discusses some related tests, and Appendix C contains the proofs of the theorems and corollaries.
1 See,
for example, Goldfeld and Quant (1965), Glejser (1969), Harvey (1976), Amemiya (1977), Breusch and Pagan (1979), Ali and Giaccotto (1984), Griffiths and Surekha (1986), Farebrother (1987), Maekawa (1987), Evans and King (1988), Kalirajan (1989), Evans (1992), Wallentin and Agren (2002), Dufour et al. (2004) and Godfrey et al. (2006). 2 See, for example, Box and Cox (1964), Godfrey and Wickens (1981), Tse (1984), Davidson and MacKinnon (1985), Lawrance (1987), Baltagi (1997) and Yang and Abeysinghe (2003). 3 For a comparison of the observed and expected Fisher information, see Lindsay and Li (1997). C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
Generalized LM tests for functional form and heteroscedasticity
351
2. MODEL ESTIMATION The BCHR model takes the following general form: h(yi , λ) =
k1 j=1
xi j β j +
k
h(xi j , λ)β j + σ ω(vi , γ ) ei ,
j=k1 +1
≡ xi (λ)β + σ ω(vi , γ ) ei ,
i = 1, . . . , n,
(2.1)
where h(·, λ) is a monotonic increasing transformation dependent on a parameter vector λ with p elements, β = {β 1 , . . ., β k } is k × 1 vector of regression coefficients, and x i j is the ith value of the jth regressor, ω (v i , γ ) ≡ ωi (γ ) is the weight function, v i is a set of q weighting variables, γ is a q × 1 vector of weighting parameters, σ is a constant, and {ei } are independent and identically distributed (i.i.d.) with zero mean and unit variance. The first k1 of the k regressors are not transformed as they correspond to the intercept, dummy variables, etc. 1 1 1 Let ψ = {β , σ 2 , γ , λ } , 2 (γ ) = diag{ω1 (γ ), . . . , ωn (γ )}, (γ ) = 2 (γ ) 2 (γ ), X(λ) be the n × k regression matrix, and Y be the n × 1 vector of (untransformed) dependent variable. The Gaussian log-likelihood function of model (2.1), ignoring the constant, is n n n h(yi , λ) − xi (λ)β 2 n 1 (ψ) = − log σ 2 − log ωi (γ ) − + log h y (yi , λ), 2 2 2σ i=1 ωi (γ ) i=1 i=1 (2.2) where h y (y, λ) = ∂h(y, λ)/∂ y. 1 1 Define M(γ , λ) = In − − 2 (γ )X(λ)[X (λ)−1 (γ )X(λ)]−1 X (λ)− 2 (γ ) where I n is the n × n identity matrix. Maximizing (2.2) under given γ and λ results in constrained estimates: ˆ , λ) = [X (λ)−1 (γ )X(λ)]−1 X (λ)−1 (γ )h(Y, λ), β(γ
(2.3)
1 1 1 h (Y, λ)− 2 (γ )M(γ , λ)− 2 (γ )h(Y, λ), n
(2.4)
σˆ 2 (γ , λ) =
which upon substitution gives the concentrated Gaussian log-likelihood, n p (γ , λ) = n log[ J˙(λ)/ω(γ ˙ )] − log σˆ 2 (γ , λ), (2.5) 2 where ω(γ ˙ ) and J˙(λ) are the geometric means of ωi (γ ) and J i (λ) = h y (y i , λ), respectively. When {ei } are exactly normal, maximizing p (γ , λ) over λ gives the constrained maximum likelihood estimate (MLE) λˆ c of λ for a given γ , maximizing p (γ , λ) over γ gives the constrained MLE γˆc of γ for a given λ, and maximizing p (γ , λ) jointly over γ and λ gives the unconstrained MLEs γˆ and λˆ of γ and λ, respectively. Substituting these constrained or unconstrained MLEs into equations (2.3) and (2.4) gives the constrained or unconstrained MLEs of β and σ 2 . When {ei } are not exactly normal, the above procedure leads to Gaussian quasi-MLEs (QMLEs) of the model parameters. Under mild conditions, these MLEs or QMLEs of the model parameters are consistent and asymptotic normal with the same mean but different variance-covariance matrices.4
4 See Hernandez and Johnson (1980), Bickel and Doksum (1981), Carroll and Ruppert (1984) and Chen et al. (2002) for asymptotic results for some related models.
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
352
Zhenlin Yang and Yiu-Kuen Tse
3. GENERALIZED LM TESTS We first introduce some general notations. Define D ◦ (γ ) = {ω iγ (γ )/ωi (γ )}n×q and D(γ ) = {1n , D ◦ (γ )}, where 1n is the n × 1 vector of ones, and ωiγ (γ ) = ∂ωi (γ )/∂γ . Let (γ , λ) = { i (γ , ˆ , λ)]/[ωi (γ )σˆ (γ , λ)], and g(γ , λ) = {g i (γ , λ)}n×1 , λ)}n×1 , where i (γ , λ) = [h(yi , λ) − xi (λ)β(γ 2 where g i (γ , λ) = i (γ , λ) − 1. Let h λ (y i , λ) and g λ (γ , λ) be, respectively, the partial derivatives of the h and g functions with respect to λ. Some basic assumptions are as follows. We assume ω(v i , 0)= constant (as commonly assumed in the literature) so that γ = 0 represents a model with homoscedastic errors. Without loss of generality, we take ω(v i , 0) = 1. We assume that ωi (γ ) is twice differentiable, and that h(y i , λ) is differentiable once with respect to y i and twice with respect to λ. Some general technical assumptions are as follows. Proofs of all results are given in Appendix C. ASSUMPTION 3.1. The disturbances {ei } are independent and identically distributed with mean zero, variance one, skewness α, and finite kurtosis κ. ASSUMPTION 3.2. The limit limn→∞ n1 X (λ)−1 (γ )X(λ) exists, and is positive definite. ASSUMPTION 3.3. The limit limn→∞ n1 D (γ )D(γ ) exists, and is positive definite. Further, the elements of D(γ ) are uniformly bounded. 3.1. A generalized LM test for heteroscedasticity THEOREM 3.1. Under Assumptions 3.1–3.3, assume further that (i) α = 0 and κ = 3, (ii) ˜ is a consistent √1 D (γ )gλ (γ , ) = O p (1) uniformly in in a neighborhood of λ, and (iii) λ n 5 estimator of λ. The LM statistic for testing H 0 : γ = γ 0 versus H a : γ = γ 0 takes the form LME (γ0 ) =
1 −1 ˜ ˜ g (γ0 , λ)D(γ 0 )[D (γ0 )D(γ0 )] D (γ0 )g(γ0 , λ), 2
(3.1)
which has an asymptotic χ q2 distribution under H 0 . It turns out that this new test statistic is very simple. It is just one half of the explained sum ˜ + 1 on D i (γ 0 ), the ith column of D (γ 0 ). On the other of squares of the regression of gi (γ0 , λ) hand, the test is very general as it works with any smooth transformation function h and weighting function ω. Robustness of (3.1) against non-normality of the original data Y is enhanced as the test allows the normalizing transformation to be chosen according to the data. Furthermore, If ω(v i , γ ) = ω(v i γ ), the special test for homoscedasticity takes a simpler form, and the test (like that of Breusch and Pagan 1979) does not depend on the exact form of the ω function. We have the following corollary.
5λ ˜ could be λˆ c , or λ, ˆ or any other estimator which converges in probability to λ as n → ∞. For example, such an estimator could be constructed by adapting the method proposed by Powell (1996).
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
Generalized LM tests for functional form and heteroscedasticity
353
COROLLARY 3.1. Under the conditions of Theorem 3.1, assume further that ω(v i , γ ) = ω(v i γ ). Then, the LM statistic for testing H 0 : γ = 0 becomes LME (0) =
1 ˜ ˜ g (0, λ)V (V V )−1 V g(0, λ), 2
(3.2)
where V = {1, v i }n×(q+1) . The test statistic for homoscedasticity in Corollary 3.1 is simply one half of the explained sum ˜ + 1 on V i = (1, v i ) . It gives a one-step generalization of squares of the regression of gi (0, λ) to that of Breusch and Pagan (1979) by allowing a normalizing transformation to be present in the model, and hence it is more robust against the non-normality of the data. The test in Theorem 3.1 gives a two-step generalization by allowing for both a normalization transformation and a non-zero null vector γ 0 . Hence, the test is not only more robust against the non-normality of the data, it also allows for easy identifications of truly heteroscedastic variables. It turns out that the asymptotic distribution of the test statistic does not depend on whether the λ parameter is pre-specified or estimated from the data. 3.2. Studentizing the LM test for heteroscedasticity The LM tests given in Theorem 3.1 and Corollary 3.1 require that α = 0 and κ = 3, which means that the disturbances {ei } are essentially Gaussian. This is in line with the aims of a data transformation: to induce normality, homoscedasticity as well as a simple model structure (or correct functional form). However, in many practical applications, it may not be possible to achieve these three goals simultaneously with a single transformation, in particular the exact normality in the errors. In this case, it might be more reasonable to assume that after the transformation, one has a correct functional form for the model while the errors obey Assumption 3.1 with arbitrary α and κ. In this subsection we explore generalizations of the results given in Theorem 3.1 and Corollary 3.1 by dropping the assumptions that α = 0 and κ = 3. Koenker (1981) generalized the result of Breusch and Pagan (1979) by providing a studentized version of the LM test for homoscedasticity, which is robust against non-normality of the errors in terms of excess kurtosis. Very recently, Dufour et al. (2004) and Godfrey et al. (2006) presented simulation-based tests for heteroscedasticity in linear regression models. While allowing the presence of data transformations and general heteroscedastic structure in the model complicates the matter, we are able to provide a result that very much parallels that of Koenker (1981).6 COROLLARY 3.2. Under Assumptions of Theorem 3.1 with arbitrary α and κ, the LM statistic for testing H 0 : γ = γ 0 versus H a : γ = γ 0 takes the form LM∗E (γ0 ) =
1 −1 ˜ ˜ g (γ0 , λ)D(γ 0 )[D (γ0 )D(γ0 )] D (γ0 )g(γ0 , λ), κ˜ − 1
(3.3)
6 We are very grateful to a referee for directing our attention to the robustness issue of the LM tests for heteroscedasticity, which directly results in a new and more useful result as stated in Corollary 3.2. This idea is further explored in Sections 3.3 and 3.4 to provide robust tests for the other two cases.
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
354
Zhenlin Yang and Yiu-Kuen Tse
n ˜ The statistic has an asymptotic χ q2 distribution under H 0 . where κ˜ − 1 = n1 i=1 gi2 (γ0 , λ). 1 ˜ (V V )−1 V g(0, λ). ˜ Furthermore, if γ 0 = 0 and ωi (v i , γ ) = ω(v i γ ), then LM∗E (0) = κ−1 g (0, λ)V ˜ Note that LM∗E (γ 0 ) can be written as nR2 , where R2 is the uncentered coefficient of ˜ on D(γ 0 ). Also note that LM ∗E (γ 0 ) is as simple as determination from the regression of g(γ0 , λ) LME (γ 0 ), but should be much more useful when there exist excess skewness and kurtosis even if a normalizing transformation is applied to the data. This point is later confirmed by the Monte Carlo simulation. 3.3. A generalized LM test for functional form Unlike the LM test for heteroscedasticity which requires only the submatrix of the expected information for a given λ, the main difficulty in deriving the expected information-based LM test for functional form is that it requires the explicit expression of the full expected information matrix. This is impossible for a general transformation function. However, when h is the Box-Cox power transformation: h(y, λ) = (y λ − 1)/λ if λ = 0; log y if λ = 0 (Box and Cox 1964), we are able to derive a very accurate approximation to the full expected information matrix, based on which a simple LM test for functional form emerges. The approximation is based on the expansion: 1 (−1)k+1 k k λ log yi = log(1 + ληi ) + θi ei − θi2 ei2 + · · · + θi ei + · · · , 2 k
(3.4)
where θ i = λσ ωi (γ )/(1 + ληi ) and ηi = x i (λ)β. Typically, the θ i s are small, and in this case, one may just need a few terms to obtain the desired degree of approximation accuracy.7 ˆ , λ)]/[ωi (γ )σˆ (γ , λ)]}n×1 , We need further notations. Let u(γ , λ) = {[h λ (yi , λ) − xiλ (λ)β(γ ∂2 where x iλ (λ) is the first derivative of x i (λ). Let h λλ (yi , λ) = ∂λ 2 h(yi , λ). Define θ0 = max{|θi |, i = 1, . . . , n}, θ = {θi }n×1 , φ = {log(1 + ληi )}n×1 , A = In − n1 1n 1n , and R(γ ) = AD◦ (γ )[D ◦ (γ )AD◦ (γ )]−1 D ◦ (γ )A. Common functions applied to a vector are operated elementwise, e.g. θ 2 = {θ i2 } and log θ = {log θ i }. Element-by-element multiplication (or Hadamard product) of two vectors, e.g. θ and φ, is denoted as θ φ. THEOREM 3.2. Under Assumptions 3.1–3.3, assume further that (i) h is the Box-Cox power transformation with θ 0 1; (ii) {ei } are Gaussian, and (iii) E[h 2λ (y i , λ)], E[h(y i , λ)h λ (y i , λ)] and E[h(y i , λ)h λλ (y i , λ)] exist for all i. The EI-based LM test for testing H 0 : λ = λ0 is LME (λ0 ) =
1n log Y − (γˆc , λ0 )u(γˆc , λ0 ) , {ξ M(γˆc , λ0 )ξ + δ − 2ζ R(γˆc )ζ }1/2
(3.5)
where when λ = 0, δ = λ12 ( 32 θ θ − 2φ Aθ 2 + 2φ Aφ) + O(θ04 ), ξ = λ1 ( 12 θ + φ θ −1 + θ 3 ) − 1 − 12 (γ )Xλ (λ)β + O(θ04 ), and ζ = λ1 (φ − 12 θ 2 ) + O(θ04 ); when λ = 0, δ = 32 σ 2 tr((γ )) + σ
7 There is a well known truncation problem for the Box-Cox power transformation. Model assumption requires this truncation effect to be negligible, which in turn requires θ i s to be small. This is seen as follows. Since (y iλ − 1)/λ = x i (λ)β + σ ωi (γ )ei , we have y iλ = 1 + λ x i (λ)β + λσ ωi (γ )ei . As y i > 0 implies y iλ > 0, this in turn implies |λσ ωi (γ )| 1 + λ x i (λ)β for the truncation on ei to be negligible. C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
355
Generalized LM tests for functional form and heteroscedasticity
2η Aη, ξ = 2σ1 − 2 (γ )[η2 + σ 2 (γ )1n − 2 log(X)β], and ζ = η. All the quantities θ, φ, δ, ζ and ξ are evaluated at the constrained MLEs at λ0 . Under H 0 , LM E (λ0 ) is asymptotic N (0, 1). 1
Note that the order of the remainder term in the approximation to δ, ξ and ζ is O(θ 40 ), indicating that the third-order approximation, i.e. k = 3 in (3.4), is used. Our simulation results show that this approximation is very accurate. Although the test statistic given in Theorem 3.2 is derived under the assumption that the errors are Gaussian, it turns out that it is fairly robust against the non-normality of the errors as long as Assumption 3.1 is satisfied. This is seen from (i) the Monte Carlo results presented in Section 4 and (ii) tedious but straightforward approximations to the numerator of (3.5) using (3.4), which show that the effects of higher-order moments of errors are involved in terms of smaller magnitude. 3.4. Joint LM test for functional form and heteroscedasticity It is sometimes desirable to conduct a joint test first for both functional form and heteroscedasticity simply because if the null hypothesis H 0 : γ = 0, λ = λ0 (where λ0 can be any of the convenient values such as 0, 1, 1/2, 1/3, etc.) is not rejected, one may just need to fit an ordinary linear regression model with response and explanatory variables appropriately transformed according to the fixed λ0 value. Of course, it is arguable that the two one-dimensional tests given earlier are more interesting as one would typically ask: given that we have fitted a transformation model, do we still need heteroscedasticity, or given that we have fitted a heteroscedastic regression model, do we still need to transform the data? Nevertheless, a joint test should be useful in certain applications, and a strong rejection of the null would simply lead to the consideration of the full transformed heteroscedastic regression model. Following the set up in Theorem 3.2, we have our third result. THEOREM 3.3. Under the same set of assumptions as in Theorem 3.2, the EI-based LM statistic for testing H 0 : γ = γ 0 and λ = λ0 is given by LME (γ0 , λ0 ) =
Sc (γ0 , λ0 )
2D◦ (γ0 )AD◦ (γ0 ),
−2D◦ (γ0 )Aζ
−2ζ AD◦ (γ0 ),
ξ M(γ0 , λ0 )ξ + δ
−1 Sc (γ0 , λ0 ), (3.6)
where the concentrated score S c (γ 0 , λ0 ) = {D ◦ (γ 0 )g(γ 0 , λ0 ), 1n log Y − (γ 0 , λ0 ) u(γ 0 , λ0 )} . All the quantities ξ , ζ and δ are give in Theorem 3.2, but evaluated at the constrained MLEs at 2 γ 0 and λ0 . Under H 0 , LME (γ 0 , λ0 ) is asymptotic χ q+1 . Although the derivations for the LME (λ0 ) and LME (γ 0 , λ0 ) statistics are more tedious than the other forms of LM tests, their implementations are not, and may even be simpler than the other versions of the LM tests. Besides, their excellent finite sample performance as shown in Section 4 indicates that for the cases where one has only a small data set, the LME (λ0 ) or LME (γ 0 , λ0 ) should be used. The point of having a test with good finite sample behaviour is further emphasized in Dufour et al. (2004) and Godfrey et al. (2006). Following the result of Corollary 3.2 and the robustness property of the test given in (3.5), one easily generalizes the result of Theorem 3.3 to provide a studentized (robustified) version of C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
356
Zhenlin Yang and Yiu-Kuen Tse
the joint LM test, allowing the errors to be non-Gaussian satisfying Assumption 3.1. −1 τ ¯ D (γ )AD (γ ), − τ ¯ D (γ )Aζ 0 ◦ 0 0 ◦ ◦ LM∗E (γ0 , λ0 ) = Sc (γ0 , λ0 ) Sc (γ0 , λ0 ), (3.7) −τ¯ ζ AD◦ (γ0 ), ξ M(γ0 , λ0 )ξ + δ where τ¯ =
1 n
n i=1
gi2 (γ0 , λ0 ).
4. MONTE CARLO RESULTS Section 3 introduces three EI-based LM tests for three different testing situations, and Appendix B discusses some related tests. While all the tests for a given situation are asymptotically equivalent when the errors are normally distributed and hence any of them can be used when a large data set is available, their small sample performance remains an important question. The purpose of the Monte Carlo experiment is: (i) to assess the small sample performance of the three new tests, (ii) to assess the small sample performance of the related (and readily available) tests and (iii) to compare and contrast all the tests to give practical guidance on which to use when only small data set is available. We consider the following data generation process (DGP): h(yi , λ) = β0 + β1 x1i + β2 x2i (λ) + σ exp(γ1 x1i + γ2 x2i )ei ,
i = 1, . . . , n,
(4.1)
where the values for x 1i are generated from U (0, 10) and the values for x 2i are generated from either U (0, 10) or U (0, 5), and then fixed throughout the whole Monte Carlo experiment. Throughout, the regression coefficient are set to β 0 = 25, β 1 = 10, and β 2 = 10. The sample size n, transformation parameter λ, heteroscedasticity parameters γ 1 and γ 2 , and the error standard deviation σ are the quantities that could potentially affect the finite sample behaviour of the LM tests. Thus, for a thorough investigation, we have considered various combinations of the values of these quantities for which n ∈ {30, 80, 200}, λ ∈ {0.0, 0.2, 0.5, 0.8, 1.0}, γ 1 ∈ {0.0, 0.1, 0.2}, γ 2 ∈ {0.0, 0.1, 0.2, 0.3}, and σ ∈ {0.1, 0.5, 1.0}. All parameter configurations are chosen so that the probability of truncation, i.e. the probability that 1 + λ[β 0 + β 1 x 1i + β 2 x 2i (λ) + σ exp(γ 1 x 1i + γ 2 x 2i )ei ] ≤ 0, is negligible. The simulation process is as follows. For a given parameter configuration, i.e. each set of values of n, σ , γ 1 , γ 2 , and λ, a random sample of ei s are generated from N (0, 1) or a nonnormal population with zero mean and unit variance, which is then converted to the values for y i s through the DGP in (4.1). Then, we proceed with model estimation and calculation of test statistics assuming the parameters are not known. Record 1 for each test if it rejects the null hypothesis. Repeat this process 10,000 times and the proportion of rejections gives a Monte Carlo estimate of the size (empirical size) of the test. The comparison of the small-sample performance of the tests will be based on their empirical sizes. As the tests are asymptotically equivalent under the null and local alternatives, the small-sample size is the most basic criterion for performance comparison. To examine the effects of non-normal errors on the tests, two non-normal populations are considered: a normal mixture and a normal-gamma mixture, both standardized to have zero mean and unit variance. In the case of the normal mixture, 80% of the ei s are from N (0, 1), and the remaining 20% from N (0, 4); whereas in the case of the normal-gamma mixture, 80% of the ei s are from N (0, 1), and the remaining 20% from GA(1, 1), a gamma distribution with both scale and shape parameters being one. C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
Generalized LM tests for functional form and heteroscedasticity
357
For brevity, we report only a representative part of the Monte Carlo results. Full results are available from the authors upon request. For clarity and conciseness, we use plots to summarize the simulation results. In each plot, the vertical scale is the empirical size, and the horizontal scale is the index for the 60 possible combinations of parameter values of γ 1 ∈ {0.0, 0.1, 0.2}, γ 2 ∈ {0.0, 0.1, 0.2, 0.3} and λ ∈ {0.0, 0.2, 0.5, 0.8, 1.0} with λ being the fastest changing index, followed by γ 2 and then γ 1 . 4.1. Tests for heteroscedasticity Seven tests are investigated in this case, namely, (i) LME0 which is (3.1) with λ˜ replaced by the true value λ, (ii) LME which is (3.1), (iii) LME∗ which is the studentized statistic in (3.3), (iv) LMD (LM test based on double length regression), (v) LR (likelihood ratio test), (vi) LMH (LM test based on Hessian) and (vii) LMG (LM test based on gradient). The last four tests are described in Appendix B. As these seven tests all allow for any smooth monotonic h function, we consider two transformations in this case: the Box-Cox power transformation (Box and Cox 1964) and the dual power transformation of Yang (2006), where h(y, λ) = (y λ − y −λ )/2λ if λ = 0; log y if λ = 0. Figure 1 summarizes the results. From Figure 1 the following regularities are observed: (i) LME∗ has an excellent finite sample performance even when the sample size is as small as 30, irrespective of whether the errors are normal or non-normal, and of what transformation is used; (ii) LME and LMEo have excellent finite sample performance only when the errors are normal, showing the necessity of studentizing LME to safeguard against possible departures from normality of the error distribution; (iii) LMD performs very well under normal errors when the Box-Cox transformation is used, but not well enough when the dual power transformation is used; (iv) In the case of non-normal errors, all the tests except LME∗ suffer from size distortions, and furthermore, their empirical sizes apparently do not converge to the nominal level 5% as n increases; (v) when errors are normal, the empirical sizes of all the seven tests converge fairly quickly to 5% as n increases, except for LMG with its empirical coverages still nearly double the nominal size when n = 200 and (vi) changing the error standard deviation and the ranges of the covariates’ values changes the empirical sizes of the tests slightly, but not the general regularities summarized above. 4.2. Tests for functional form In this case, we report the empirical sizes for five tests: LME, LMD, LMH, LMG, and LR. Selected results are summarized in Figure 2. Some general observations are in order: (i) LME generally possesses excellent finite sample properties and outperforms all the others; (ii) the tests are ranked in the following order: LME, LMD, LR, LMH and LMG, with LMG often performing very poorly; (iii) it is worthnoting that LMD performs reasonably well, especially considering the fact that it is based on only the first derivatives of the loglikelihood function; (iv) all tests are fairly robust against departures from normality of the error distribution; (v) as n increases, empirical sizes converge to 5% and (vi) changing the parameter values does not affect much the empirical sizes. 4.3. Tests for functional form and heteroscedasticity Six tests, namely, LME, LMD, LMH, LMG, LR and LME∗ (defined in (3.7)), are compared, where when the errors are normal, LME∗ is excluded. Selected results are summarized in Figure 3. For C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
Figure 1a. Empirical sizes of LM tests for heteroscedasticity, BC transformation, σ = 0.1, X 1 ∼ U (0, 10) and X 2 ∼ U (0, 10).
358 Zhenlin Yang and Yiu-Kuen Tse
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
359
Figure 1b. Empirical sizes of LM tests for heteroscedasticity, BC transformation, σ = 0.5, X 1 ∼ U (0, 5) and X 2 ∼ U (0, 5).
Generalized LM tests for functional form and heteroscedasticity
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
Figure 1c. Empirical sizes of LM tests for heteroscedasticity, BC transformation, σ = 1.0, X 1 ∼ U (0, 5) and X 2 ∼ U (0, 5).
360 Zhenlin Yang and Yiu-Kuen Tse
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
361
Figure 1d. Empirical sizes of LM tests for heteroscedasticity, DP transformation, σ = 0.1, X 1 ∼ U (0, 5) and X 2 ∼ U (0, 10).
Generalized LM tests for functional form and heteroscedasticity
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
Figure 2a. Empirical sizes of LM tests for functional form, BC transformation with normal errors: X 1 ∼ U (0, 10), X 2 ∼ U (0, 10) for first row, and X 2 ∼ U (0, 5) for last two rows.
362 Zhenlin Yang and Yiu-Kuen Tse
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
363
Figure 2b. Empirical sizes of LM tests for functional form, BC transformation with normal mixture: X 1 ∼ U (0, 5), X 2 ∼ U (0, 5).
Generalized LM tests for functional form and heteroscedasticity
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
Figure 2c. Empirical sizes of LM tests for functional form, BC transformation with normal-gamma mixture, X 1 ∼ U (0, 5), X 2 ∼ U (0, 5).
364 Zhenlin Yang and Yiu-Kuen Tse
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
Figure 3a. Empirical sizes of the joint LM tests, BC transformation with normal errors: X 1 ∼ U (0, 10), X 2 ∼ U (0, 10) for first row, and X 2 ∼ U (0, 5) for last two rows.
Generalized LM tests for functional form and heteroscedasticity
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
365
Figure 3b. Empirical sizes of the joint LM tests, BC transformation with normal mixture: X 1 ∼ U (0, 5), X 2 ∼ U (0, 5).
366 Zhenlin Yang and Yiu-Kuen Tse
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
367
Figure 3c. Empirical sizes of the joint LM tests, BC transformation with normal-gamma mixture, X 1 ∼ U (0, 5), X 2 ∼ U (0, 5).
Generalized LM tests for functional form and heteroscedasticity
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
368
Zhenlin Yang and Yiu-Kuen Tse
the case of normal errors, general observations remain the same as for testing functional form. One difference is that LMH and LMG perform notably poorer. This reinforces the necessity of using the EI-based LM test when sample size is small. Again, LMD performs reasonably well. However, unlike the EI-based LM test, LMD does not perform well uniformly for all situations. For the case of non-normal errors, LME∗ performs exceptionally well even when sample size is as small as 30, whereas all others perform poorly. Furthermore, the empirical sizes of the other tests apparently do not converge to the nominal level as n increases. 4.4. Power of the tests The power of the tests is another important consideration for practitioners in choosing among the alternative tests. As the sizes of the tests can differ substantially, we use the simulated critical values to ensure fairness in making power comparison.8 Selected results are summarized in Figure 4 with β 0 = 25, β 1 = β 2 = 10, and σ = 1.0. For the tests of heteroscedasticity, the null hypothesis is H0 : γ 1 = γ 2 = 0.1, and the alternative values are γ 1 = γ 2 = (−0.16, −0.12, −0.08, −0.04, 0.0, 0.04, 0.07, 0.1, 0.13, 0.16, 0.2, 0.24, 0.28, 0.32, 0.36); for the tests of functional form, the null hypothesis is H0 : λ = 0.1 with the alternative values λ = (0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17); and for the joint tests, the null hypothesis is H0 : γ 1 = γ 2 = λ = 0.1, and alternative values are elementwise combinations of γ 1 = γ 2 = (−0.04, −0.02, 0.0, 0 .02, 0.04, 0.06, 0.08, 0.10, 0.12, 0.14, 0.16, 0.18, 0.20, 0.22, 0.24) and λ = (0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17), which are then indexed by the integers 1 to 15 for plotting. From Figure 4, we see that (i) LME tests always have better or similar power compared with others, (ii) LME∗ for testing heteroscedasticity may have a notably lower power than the others when the sample size is small due to its robustness nature, but when the sample size increases it quickly catches up in power, (iii) The LME∗ for joint test performs as well as LME in terms of power and (iv) LMH and LMG may have significantly lower power than the others in the cases of functional form tests and joint tests.9
5. CONCLUSIONS We provide an LM test for heteroscedasticity with the allowance of a transformation being present in the model to take care of potential non-normality of the data. With this test, one can test any specifications on the heteroscedasticity parameters so that variables attributable to heteroscedasticity can be identified. In the case of normal errors, the test compares favourably against the commonly used likelihood ratio test in both the ease of application and in the finite sample performance. The test compares also favourably against other versions of LM tests. In the case of non-normal errors, the robustified version of the EI-based LM test clearly outperforms all others.
8 For each test, 10,000 test statistic values are generated at a given parameter configuration. The 95th percentile is calculated, which is then used in the subsequent power comparisons. 9 Note (i) for brevity the results based on other sample sizes are not plotted and (ii) the size-adjusted tests are not feasible in practice as one does not know the true values of the model parameters.
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
369
Figure 4. Power of the tests of heterscedasticity (row 1), functional form (row 2), and joint. BC transformation: X 1 , X 2 ∼ U (0, 5).
Generalized LM tests for functional form and heteroscedasticity
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
370
Zhenlin Yang and Yiu-Kuen Tse
We also provide an LM test for functional form allowing for heteroscedasticity to be present in the model. This flexibility is important as genuine heteroscedasticity often exists in the data and transformation cannot get rid of it. Monte Carlo simulations show that this test outperforms other tests. All the tests of functional form considered are quite robust against non-normality of the error distribution. Based on the test of heteroscedasticity and the test of functional form, we provide a joint test of functional form and heteroscedasticity, and a robust version of it. Monte Carlo simulation shows excellent finite sample performance of the proposed tests, as compared with other tests. Considering the simplicity in their practical implementation and excellent small sample performance, the three proposed tests, in particular the second and the studentized versions of the first and third, should be recommended for practical applications.
ACKNOWLEDGEMENTS We are grateful to the Editor, Pravin K. Trivedi, the Coordinating Editor, Karim Abadir, and the two anonymous referees for their very helpful comments that have led to significant improvements on an early version of this paper. We are also grateful to the comments from the seminar participants of the Far Eastern Meeting of the Econometric Society 2004 (FEMES 2004) and the Econometric Society Australasian Meeting 2004 (ESAM 2004). We gratefully acknowledge the research support from the Wharton-SMU Research Center, and the research assistance by Chenwei Li.
REFERENCES Ali, M. M. and C. Giaccotto (1984). A study of several new and existing tests for heteroscedasticity in the general linear model. Journal of Econometrics 26, 355–73. Amemiya, T. (1977). A note on a heteroscedastic model. Journal of Econometrics 6, 365–70. Baltagi, B. H. (1997). Testing linear and loglinear error components regressions against Box-Cox alternatives. Statistics and Probability Letters 33, 63–8. Baltagi, B. H. and D. Li (2000). Double-length regressions for the Box-Cox difference model with heteroscedasticity or autocorrelation. Economics Letters 69, 9–14. Bera, A. K. and C. MacKenzie (1986). Alternative forms and properties of the score test. Journal of Applied Statistics 13, 13–25. Bickel, P. J. and K. A. Doksum (1981). An analysis of transformations revisited. Journal of the American Statistical Association 76, 296–311. Box, G. E. P. and D. R. Cox (1964). An analysis of transformations (with discussion). Journal of the Royal Statistical Society, Series B 26, 211–52. Breusch, T. S. and A. R. Pagan (1979). A simple test for heteroscedasticity and random coefficient variation. Econometrica 47, 1287–93. Carroll, R. J. and D. Ruppert (1984). Power transformations when fitting theoretical models to data. Journal of the American Statistical Association 79, 321–8. Chen, G., R. A. Lockhart and M. A. Stephens (2002). Box-Cox transformations in linear models: Large sample theory and tests of normality (with discussion). Canadian Journal of Statistics 30, 177–234. Davidson, R. and J. G. MacKinnon (1983). Small sample properties of alternative forms of the Lagrange multiplier test. Economics Letters 12, 269–75. C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
Generalized LM tests for functional form and heteroscedasticity
371
Davidson, R. and J. G. MacKinnon (1984). Model specification tests based on artificial linear regressions. International Economic Review 25, 485–502. Davidson, R. and J. G. MacKinnon (1985). Testing linear and loglinear regression against Box-Cox alternatives. Canadian Journal of Economics 18, 499–517. Davidson, R. and J. G. MacKinnon (1993). Estimation and Inference in Econometrics. Oxford: Oxford University Press. Dufour, J.-M., L. Khalaf, J.-T. Bernard and I. Genest (2004). Simulation-based finite-sample tests for heteroscedasticity and ARCH effects. Journal of Econometrics 122, 317–47. Evans, M. A. (1992). Robustness of size of tests of autocorrelation and heteroskedasticity to nonnormality. Journal of Econometrics 51, 7–24. Evans, M. A. and M. L. King (1988). A further class of tests for heteroscedasticity. Journal of Econometrics 37, 265–76. Farebrother, R. W. (1987). The statistical foundations of a class of parametric tests for heteroscedasticity. Journal of Econometrics 36, 359–68. Glejser, H. (1969). A new test for heteroscedasticity. Journal of the American Statistical Association 64, 316–23. Godfrey, L. G. (1988). Misspecification Tests in Econometrics. Cambridge: Cambridge University Press. Godfrey, L. G. and M. R. Wickens (1981). Testing linear and log-linear regressions for functional form. Review of Economic Studies 48, 487–96. Godfrey, L. G., C. D. Orme and J. M. C. Santos Silva (2006). Simulation-based tests for heteroskedasticity in linear regression models: Some further results. Econometrics Journal 9, 76–97. Goldfeld, S. M. and R. E. Quandt (1965). Some tests for homoscedasticity. Journal of the American Statistical Association 60, 539–47. Griffiths, W. E. and K. Surekha (1986). A Monte Carlo evaluation of the power of some tests for heteroscedasticity. Journal of Econometrics 31, 219–31. Harvey, A. C. (1976). Estimating regression models with multiplicative heteroscedasticity. Econometrica 44, 461–65. Hernandez, F. and R. A. Johnson (1980). The large-sample behavior of transformations to normality. Journal of the American Statistical Association 75, 855–61. Kalirajan, K. P. (1989). A test for heteroscedasticity and non-normality of regression residuals. Economics Letters 30, 133–6. Koenker, R. (1981). A note on studentizing a test for heteroscedasticity. Journal of Econometrics 17. 107–12. Lawrance, A. J. (1987). The score statistic for regression transformation. Biometrika 74, 275–9. Lahiri, K. and D. Egy (1981). Joint estimation and testing for functional form and heteroscedasticity. Journal of Econometrics, 15, 299–307. Lindsay, B. G. and B. Li (1997). On second-order optimality of the observed Fisher information. Annals of Statistics 25, 2172–99. MacKinnon, J.G. and L. Magee (1990). Transforming the dependent variable in regression models. International Economic Review 31, 315–39. Maekawa, K. (1988). Comparing the Wald, LR and LM tests for heteroscedasticity in a linear regression model. Economics Letters 26, 37–41. Powell, J. L. (1996). Rescaled methods-of-moments estimation for the Box-Cox regression model. Economics Letters 51, 259–65. Ruppert, D. and R. J. Carroll (1981). On robust tests for heteroscedasticity. Annals of Statistics 9, 206–10. Tse, Y. K. (1984). Testing for linear and log-linear regression with heteroscedasticity. Economics Letters 16, 63–69.
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
372
Zhenlin Yang and Yiu-Kuen Tse
Wallentin, B. and A. Agren (2002). Test of heteroscedasticity in a regression model in the presence of measurement errors. Economics Letters 76, 205–11. Yang, Z. L. (2006). A modified family of power transformations. Economics Letters 92, 14–9. Yang, Z. L. and T. Abeysinghe (2003). A score test for Box-Cox functional form. Economics Letters 79, 107–15. Yang, Z. L. and Y. K. Tse (2006). Modelling firm-size distribution using Box-Cox heteroscedastic regression. Journal of Applied Econometrics 21, 641–53. Yeo, I. K. and R. A. Johnson (2000). A new family of power transformation to improve normality or symmetry. Biometrika 87, 954–9.
APPENDIX A: SCORES AND OBSERVED INFORMATION For the model with a general transformation and a general weighting function, the score function S(ψ), where ψ = {β , σ 2 , γ , λ} , has the following elements: Sβ =
n 1 [h(yi , λ) − xi (λ)β]xi (λ) , σ 2 i=1 ωi2 (γ )
Sσ 2 =
n 1 [h(yi , λ) − xi (λ)β]2 n , − 4 2σ i=1 2σ 2 ωi2 (γ )
Sγ =
n n 1 ωiγ (γ ) ωiγ (γ ) 2 , , λ) − x (λ)β] − [h(y i i σ 2 i=1 ωi3 (γ ) ωi (γ ) i=1
Sλ =
n n 1 h yλ (yi , λ) [h(yi , λ) − xi (λ)β][h λ (yi , λ) − xiλ (λ)β] − 2 , 2 h (y , λ) σ ω (γ ) y i i i=1 i=1
from which the gradient matrix for use in the OPG LM test can be easily formulated. Let ei (ψ) = [h(y i , λ) − x i (λ)β]/[σ ωi (γ )], and eiλ (ψ) and eiλλ (ψ) be its first and second partial derivatives with respect to λ. The elements of the Hessian matrix H (ψ) = ∂ S(ψ)/∂ψ are: Hββ = −
n 1 xi (λ)xi (λ) , σ 2 i=1 ωi2 (γ )
n 1 n e2 (ψ) + , 4 σ i=1 i 2σ 4 n n (γ ) (γ ) ωiγ γ (γ ) 3ωiγ (γ )ωiγ ωiγ γ (γ ) ωiγ (γ )ωiγ − − =− ei2 (ψ) + , 2 2 ωi (γ ) ωi (γ ) ωi (γ ) ωi (γ ) i=1 i=1 n n 2
2 ∂ log h y (yi , λ) eiλ (ψ) + ei (ψ)eiλλ (ψ) + =− , ∂λ2 i=1 i=1
Hσ 2 σ 2 = − Hγ γ Hλλ
Hβσ 2 = −
n 1 ei (ψ)xi (λ) , σ 3 i=1 ωi (γ )
Hβγ = −
n ei (ψ)xi (λ)ωiγ (γ ) 2 , 2 σ i=1 ωi (γ )
Hβλ
=
n eiλ (ψ)xi (λ) + ei (ψ)xiλ (λ) 1 , σ i=1 ωi (γ ) C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
Generalized LM tests for functional form and heteroscedasticity
Hσ 2 γ = − Hσ 2 λ =
373
n ei2 (ψ)ωiγ (γ ) 1 , 2 σ i=1 ωi (γ )
n 1 ei (ψ)eiλ (ψ), σ 2 i=1
Hγ λ = 2
n ei (ψ)eiλ (ψ)ωiγ (γ ) i=1
ωi (γ )
.
Now, for the Box-Cox transformation, we have h y (y, λ) = y λ−1 , h yλ (y, λ) = y λ−1 log y, h yλλ (y, λ) = y λ−1 (log y)2 , and 1 [1 + λh(y, λ)] log y − λ1 h(y, λ), λ = 0, h λ (y, λ) = λ1 (log y)2 , λ = 0, 2 ⎧ ⎨ h λ (y, λ) log y − 1 + 1 [h(y, λ) − log y], λ = 0, λ λ2 h λλ (y, λ) = ⎩ 1 (log y)3 , λ = 0. 3
For the dual-power transformation of Yang (2006), we have h y (y, λ) = 12 [y λ−1 + y −λ−1 ], h yλ (y, λ) = 1 (y λ−1 − y −λ−1 ) log y, h yλλ (y, λ) = 12 (y λ−1 + y −λ−1 )(log y)2 , and 2 1 (y λ + y −λ ) log y − λ1 h(y, λ), λ = 0, h λ (y, λ) = 2λ 0, λ = 0, 2 2 h(y, λ)(log y) − λ h λ (y, λ) λ = 0, h λλ (y, λ) = 1 (log y)3 , λ = 0. 3 √ The inverse of the dual power transformation is y = (λh + 1 + λ2 h 2 )1/λ when λ = 0, and exp(h) when λ = 0, where h = (y λ − y −λ )/2λ when λ = 0, and log y when λ = 0. These partial derivatives are also available for other transformations such as MacKinnon and Magee (1990), and Yeo and Johnson (2000).
APPENDIX B: SOME RELATED TEST STATISTICS The same notations as in Appendix A are followed. Let I (ψ) be the expected information matrix. If ψˆ 0 is the constrained MLE of ψ under the constraints imposed by the null hypothesis, the LM statistic is defined as follows LME = S (ψˆ 0 )I −1 (ψˆ 0 )S(ψˆ 0 ). See, for example, Godfrey (1988). In situations where the test concerns only a subvector ψ 2 of ψ = {ψ 1 , ψ 2 } , the test reduces to the following form LME = S2 (ψˆ 0 )I 22 (ψˆ 0 )S2 (ψˆ 0 ), where S 2 (ψ) denotes the relevant subvector of S(ψ), and I 22 (ψ) denotes the submatrix of I −1 (ψ) corresponding to ψ 2 . As I (ψ) may not be easily obtainable, alternative ways of estimating the information matrix have been proposed. In particular, I (ψ) may be replaced by −H (ψ) or the outer product of the gradient (OPG) G(ψ) G(ψ), with G(ψ) = {∂ i (ψ)/∂ψ }, where i is the element of the log likelihood corresponding to the ith observation. Hence, the Hessian form and the OPG form of the LM statistic, denoted by LMH and C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
374
Zhenlin Yang and Yiu-Kuen Tse
LMG , respectively, can be calculated as follows: LMH = −S2 (ψˆ 0 )H 22 (ψˆ 0 )S2 (ψˆ 0 ), LMG = S2 (ψˆ 0 )D 22 (ψˆ 0 )S2 (ψˆ 0 ), where H 22 (ψ) and D 22 (ψ) are, respectively, the submatrices of H −1 (ψ) and [G (ψ)G(ψ)]−1 corresponding to ψ 2 . In addition, the LM statistic can also be calculated from the double-length artificial regression proposed by Davidson and MacKinnon (1984). We denote this version of the LM statistic by LMD . Then, LMD is the explained sum of squares of the regression of {e (ψˆ 0 ), 1n } on {−∂e(ψˆ 0 )/∂ψ , ∂(log |∂e(ψˆ 0 )/∂ y|)/∂ψ }, which has 2n observations and k + p + q + 1 regressors. The LMD statistic has been found to outperform the LMH and LMG statistics in finite-sample performance (Davidson and MacKinnon, 1993), and has been applied by many authors in different situations (see Tse, 1984, Baltagi and Li, 2000, among others). Although the four forms of LM statistic are asymptotically equivalent with the same limiting chi-squared distribution under the null, LME is expected to give the best finite-sample performance.10 This is verified empirically in our present context using Monte Carlo experiment. The likelihood ratio (LR) test for testing, for example, heteroscedasticity is simply defined as ˆ − p (γ0 , λˆ c )) LR(γ0 ) = 2( p (γˆ , λ)
(B.1)
where λˆ c is the constrained MLE of λ at γ 0 .
APPENDIX C: PROOFS OF THE THEOREMS AND COROLLARIES Proof of Theorem 3.1: We start our derivation by first assuming that λ is known. Since λ is known, ψ = {β , σ 2 , γ } , ψˆ 0 = {βˆ (γ0 , λ), σˆ (γ0 , λ), γ0 } and the score Sγ (ψˆ 0 ) =
n ˆ 0 , λ)]2 ωiγ (γ0 ) [h(yi , λ) − x (λ)β(γ i
i=1
ωi (γ0 )
ωi2 (γ0 )σˆ 2 (γ0 , λ)
−
n ωiγ (γ0 ) i=1
ωi (γ0 )
= D◦ (γ0 )g(γ0 , λ).
The elements of the expected information matrix I (ψ) are: Iββ = σ12 X (λ)−1 (γ )X(λ), Iβσ 2 = 0, Iβγ = 0, Iσ 2 σ 2 = 2σn 4 , Iσ 2 γ = σ12 1n D◦ (γ ), and I γ γ = 2D ◦ (γ )D ◦ (γ ). Thus, the γ γ -block of I −1 (ψ) is −1 1 = [(D◦ (γ ) − 1n D¯ ◦ (γ )) (D◦ (γ ) − 1n D¯ ◦ (γ ))]−1 , I γ γ = Iγ γ − Iγ σ 2 Iσ−1 2 σ 2 Iσ 2 γ 2
where D¯ ◦ (γ ) = n1 1n D◦ (γ ). These give the LM test statistic of a known λ as 1 g (γ0 , λ)D◦ (γ0 )[(D◦ (γ0 ) − 1n D¯ ◦ (γ0 )) (D◦ (γ0 ) − 1n D¯ ◦ (γ0 ))]−1 D◦ (γ0 )g(γ0 , λ) 2 1 = g (γ0 , λ)D(γ0 )[D (γ0 )D(γ0 )]−1 D (γ0 )g(γ0 , λ). 2
LM(γ0 |λ) =
The proof for the asymptotic distribution of LM(γ 0 | λ) parallels that of Koenker (1981), except that we consider only the null distribution of LM(γ 0 | λ). It is easy to see that g(r 0 , λ) can be decomposed into 2 g(r0 , λ) = σˆ 2 (γσ ,λ) (v1 − 2v2 + v3 + v4 ), where v 1 = e2 − 1, v 2 = e (K (γ 0 , λ)e), v 3 = (K (γ 0 , λ)e)2 , 0
10 Bera and MacKenzie (1986) has argued for the superior small-sample performance of LM over LM and LM , E H G which has been found to be empirically supported. Also, the superior performance of LMD over LMH and LMG in small samples has been shown in many empirical studies (see Davidson and MacKinnon, 1983, 1984). We shall show below, however, that LMD is dominated by LME in tests of functional form and heteroscedasticity. C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
375
Generalized LM tests for functional form and heteroscedasticity and v4 = (1 −
σˆ 2 (γ0 ,λ) )1n σ2
with K (γ , λ) = I n − M(γ , λ). Under Assumptions 3.1–3.3, it is easy √ p to prove that (i) n[D (γ0 )D(γ0 )]−1 D (γ0 )vk = o p (1), for k = 2, 3, 4; (ii) σˆ 2 (γ0 , λ) −→ σ 2 ; and √ d (iii) n[D (γ0 )D(γ0 )]−1 D (γ0 )v1 −→ N (0, 2 −1 ), where = limn→∞ n1 D (γ0 )D(γ0 ). It follows that d
LM(γ0 | λ) −→ χq2 under H 0 . ˜ the LM statistic when λ is replaced by What is being left now is to prove that LME (γ0 | λ), ˜ is asymptotically equivalent to LME (γ 0 | λ). Under Assumption 3.2, it is sufficient to show that λ, p ˜ − g(γ0 , λ)] → √1 D (γ0 )[g(γ0 , λ) 0. By the mean value theorem, we have n 1 ˜ − g(γ0 , λ)] = √1 D (γ0 )gλ (γ0 , λ∗ )(λ˜ − λ), √ D (γ0 )[g(γ0 , λ) n n p p where λ∗ lies between λ˜ and λ. As λ˜ −→ λ, λ∗ −→ λ. Now, as √1n D (γ0 )gλ (γ0 , ) is bounded in probability uniformly in in a neighborhood of λ, we have √1n D (γ0 )gλ (γ0 , λ∗ ) = O p (1). The result of the theorem thus follows. 2
Proof of Corollary 3.1: As ωi (γ ) = ω(v i γ ), it must be that ωiγ (0) = cv i for a constant c, which directly leads to equation (3.2). 2 Proof of Corollary 3.2: The proof of Corollary 3.2 is identical to the proof of Theorem 3.1, except that √ d n[D (γ0 )D(γ0 )]−1 D (γ0 )v1 −→ N (0, (κ − 1) −1 ), where κ under the relaxed distributional assumption, n ˜ is consistently estimated by κ˜ = 1 + n1 i=1 gi2 (γ0 , λ). 2 Proof of Theorems 3.2 and 3.3: Now, ψ = {β , σ 2 , γ , λ } . The elements of I (ψ) corresponding to β, σ 2 , and γ are given in the proof of Theorem 3.1. With the addition of the λ parameter and with h being the Box-Cox power transformation, the other elements of I (ψ) are: I λλ = E 1 [eλ (ψ)eλ (ψ)] + E[eλ (ψ)eλλ (ψ)]; Iβλ = − σ1 X (λ)− 2 (γ )E[eλ (ψ)]; Iσ 2 λ = − σ12 E[e (ψ)eλ (ψ)]; and I γ λ = −2D ◦ (γ ) E [e(ψ) eλ (ψ)]. These give the (γ , λ)-block and the λ-element of I −1 (ψ) respectively as
−1
2D◦ (γ )AD◦ (γ ),
−2D◦ (γ )Aζ
−2ζ AD◦ (γ ),
ξ M(γ , λ)ξ + δ
,
and
ξ M(γ , λ)ξ + δ − 2ζ AD◦ (γ )[D◦ (γ )AD◦ (γ )]−1 D◦ (γ )Aζ n
−1
,
where ξ = E [eλ (ψ)], ζ = E [e(ψ) eλ (ψ)], and δ = i=1 {Var[eiλ (ψ)] + E[ei (ψ)eiλλ (ψ)]} − n2 (1n ζ )2 . The former corresponds to the middle term of (3.6), and the latter corresponds to the denominator of (3.5). However, the three quantities ξ , ζ and δ do not possess explicit expressions in general. Thus, some approximations are desirable. From the basic properties of the Box-Cox power transformation given at the end of Appendix A, we see that in order to obtain approximations to ξ , ζ and δ, one only needs to approximate log y i when λ = 0. Using the expansion (3.4) with k = 3, we obtain (λ)β φi ηi xiλ θi θi3 E[eiλ (ψ)] = + − − + O θi4 , + 2λ λθi λ λσ ωi (γ ) σ ωi (γ0 ) 1 1 2 E[ei (ψ)eiλ (ψ)] = φi − θi + O θi4 , λ 2 1 1 2 Var[eiλ (ψ)] = 2 θi − φi θi2 + φi2 + O θi4 , λ 2 1 E[ei (ψ)eiλλ (ψ)] = 2 θi2 − φi θi2 + φi2 + O θi4 , λ C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
376
Zhenlin Yang and Yiu-Kuen Tse
for i = 1, . . . , n. The first expression gives an approximation to ξ after removing the fourth term as it is absorbed by the M(γ , λ) matrix, the second expression gives an approximation to ζ , and the last three expressions together give an approximation to δ. When λ = 0, exact expressions for δ, ξ and ζ follow directly from the calculations using log y i = ηi + σ ωi (γ )ei , or from finding the limits of the above quantities when λ approaches zero. Finally, Assumptions 3.2 and 3.3 ensure that the denominator of (3.5) and the middle term of (3.6) exist for all n. This, together with Assumption 3.1 and the normality of the errors, leads to the asymptotic normal or chi-square distribution for Theorems 3.2 and 3.3, respectively. 2
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
Econometrics Journal (2008), volume 11, pp. 377–395. doi: 10.1111/j.1368-423X.2008.00243.x
A bootstrap procedure for panel data sets with many cross-sectional units G. K APETANIOS † †Department of Economics, Queen Mary, University of London, Mile End Rd., London E1 4NS, UK. E-mail:
[email protected] First version received: April 2005; final version accepted: January 2008
Summary This paper considers the issue of bootstrap resampling in panel data sets. The availability of data sets with large temporal and cross-sectional dimensions suggests the possibility of new resampling schemes. We suggest one possibility which has not been widely explored in the literature. It amounts to constructing bootstrap samples by resampling whole cross-sectional units with replacement. In cases where the data do not exhibit cross-sectional dependence but exhibit temporal dependence, such a resampling scheme is of great interest as it allows the application of i.i.d. bootstrap resampling rather than block bootstrap resampling. It is well known that the former enables superior approximation to distributions of statistics compared to the latter. We prove that the bootstrap based on cross-sectional resampling provides asymptotic refinements. A Monte Carlo study illustrates the superior properties of the new resampling scheme compared to the block bootstrap. Keywords: Bootstrap, Panel data.
1. INTRODUCTION Panel data sets have been increasingly used in economics to analyse complex economic phenomena. One of the attractions of panel data sets is the ability to use an extended data set to obtain information about parameters of interest which are assumed to have common values across panel units. The existing literature on panel data is huge and rapidly expanding. Good but inevitably somewhat partial reviews may be found, among others, in Baltagi (2001) and Hsiao (2003). Traditionally, panel analysis focussed on data sets with large cross-sectional dimension (N) and smaller time series dimension (T). But more recently, and with the emergence of rich data sets both in N and T, focus rests on the theoretical analysis of large N–T data sets. Inference in panel data sets has mainly used asymptotic approximations for the construction of test statistics and estimation of variances of estimators. The use of the bootstrap as an alternative to such asymptotic approximations has been considered but its properties have not received the same amount of attention as in the time series literature. Here, we need to note the well-known fact that bootstrap methods can provide better approximations to the exact distributions of various statistics compared to asymptotic approximations leading to the conclusion that the analysis of the bootstrap for panel data merits further attention. This property of the bootstrap is well documented in the literature (see, e.g. Hall, 1992, for independent data or Lahiri, 2003 for weakly dependent C 2008 The Author. Journal compilation C The Royal Economic Society 2008. Published by Blackwell Publishing Ltd, 9600 Garsington
Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
378
G. Kapetanios
data). The consideration of the bootstrap for panel data has focussed on resampling in the time dimension extending the work on the bootstrap in time series. Resampling in the cross-sectional dimension has received attention as well. A treatment of such resampling methods when N is large but T is assumed small and fixed can be found in Cameron and Trivedi (2005). This paper aims to provide a treatment of the bootstrap when resampling occurs either in the cross-sectional dimension or more generally in both cross-sectional and time series dimensions. In a nutshell, cross-sectional resampling consists of resampling cross-sectional units as wholes rather than resampling within the units across the time dimension. The motivation for such resampling is clear when N is large compared to T. In particular, it is the only kind of resampling that will provide asymptotically valid bootstrap procedures when N increases but T remains fixed. Nevertheless, this is not very interesting as treatment of this case bears analogies to the treatment of the bootstrap for multivariate time series with N and T transposed. Treatments of resampling for time series may be found in, e.g. Davison (1997) or Lahiri (2003). The analysis becomes more interesting when both N and T are large. There, cross-sectional resampling is an alternative to time series resampling. Both are asymptotically valid. The paper will discuss the asymptotic validity of cross-sectional resampling in this context. The question of what sort of resampling to use becomes more interesting when dependence is considered. Allowing for temporal dependence in panel data is of course essential in the large N–T context. On the other hand, the analysis of cross-sectional dependence is less developed and assuming no such dependence is quite common in empirical literatures that use panel data such as, e.g. macroeconometrics. This is despite the advances in fields such as spatial econometrics. Assuming cross-sectional independence is crucial for the bootstrap. Dependent data cannot be resampled in the same way as independent data and methods such as the block bootstrap need to be employed. Furthermore, the use of the block bootstrap or its variants has been shown to provide less accurate approximations that the bootstrap in i.i.d. context as discussed in, e.g. Lahiri (2003) or Andrews (2002). We show that if there exists temporal dependence but no cross-sectional dependence then cross-sectional resampling can be more accurate than temporal resampling. More relevantly, it is common in the panel literature to account for cross-sectional dependence by introducing a factor structure in the residuals of the panel regressions (see, e.g. Pesaran, 2002). The presence of a factor structure introduces global cross-sectional dependence which is symmetric across all panel units but does not introduce local cross-sectional dependence. As a result the standard cross-sectional resampling scheme we consider which makes no use of block resampling is still the appropriate resampling scheme. The structure of the paper is as follows. Section 2 provides a discussion of cross-sectional resampling. Section 3 provides theoretical results for the bootstrap based on cross-sectional resampling for a particular estimator. Section 4 presents a Monte Carlo analysis of the new bootstrap procedure. Finally, Section 5 concludes.
2. THE BOOTSTRAP FOR PANEL DATA SETS In this section, we discuss various possibilities for bootstrap resampling schemes that can be applied in large N–T panel data sets. In order to do this we introduce a general panel model. This is given by yi,t = zt ai + xi,t β + i,t , i = 1, . . . , N ; t = 1, . . . , T . (2.1) The focus of attention is inference on the vector β. zt is a vector of variables that enter all cross-sectional units. In many applications it will contain deterministic terms such as a constant. C 2008 The Author. Journal compilation C The Royal Economic Society 2008
Bootstrap resampling in panel data sets
379
xi,t = (x 1,i,t , . . . , x k,i,t ) contains explanatory variables that are particular to a given cross-sectional unit. We will regulate the behaviour of the explanatory variables, the coefficients and the error term, i,t via appropriate assumptions in the next section, but keep the discussion heuristic at this stage to concentrate on the intuition. We assume the existence of an estimator βˆ for β which is consistent both as only N → ∞ and as T , N → ∞ sequentially (i.e. as T → ∞ and then N → ∞), and, suitably normalised, asymptotically normal. The exact nature of the estimator will depend on the assumptions made about (2.1). Define Y = (y1 , . . . , yi , . . . , y N ) = ˜ j = (˜x j,1 , . . . , x˜ j,i , . . . , x˜ j,N ) = (˜x j,1 , . . . , x˜ j,t . . . , x˜ j,T ) , (y1 , . . . , yt . . . , yT ) , X = (1 , . . . , i . . . , N ) = (1 , . . . , t . . . , T ) Z = (z1 , . . . , zT ) , yi = (yi,1 , . . . , yi,T ) , yt = (y1,t , . . . , y N ,t ) , x˜ j,i = (x j,i,1 , . . . , x j,i,T ) , x˜ j,t = (x j,1,t , . . . , x j,N ,T ) , i = (i,1 , . . . , i,T ) and t = ( 1,t , . . . , N ,t ) . We first consider a ‘fixed effects’ interpretation of the model and assume the existence of consistent estimates for each ai , denoted by aˆ i as T → ∞. Define A = (a1 , . . . , ˆ = (ˆa1 , . . . , aˆ N ). a N ) and A We now consider the definition of a bootstrap sample. We distinguish between the parametric and the non-parametric bootstrap. There is the obvious tradeoff between the two depending on how realistic one considers the assumed model to be. We define the non-parametric bootstrap ∗ ˜∗ ˜∗ sample to be given by the following set: {Y , X1 , . . . , Xk }. Likewise, the parametric bootstrap ∗ ˆ ∗ ˆ ∗ ˜∗ ˜ sample is given by { , X1 , . . . , Xk , A β} where starred entries have been obtained by some sort of resampling from their non-starred counterparts. As is usual in the bootstrap∗literature we will p d∗ use the superscript star notation in all probabilistic statements, such as → or → for convergence in distribution and probability, respectively, that relate to the bootstrap probability space which is conditional on the realisation of the original sample. We now focus on possible resampling schemes. Dealing first with the non-parametric bootstrap, the most common scheme for resampling both y i,t and xi,t operates in the time dimension and consists of drawing with replacement either individual rows, or, in the case where the data are assumed to be dependent, blocks of contiguous ˜ j , where the block size is assumed to depend solely on and grow with T. So, rows from Y and X for example in the case of independent data, Y∗ = (yt1 , . . . , ytt . . . , ytT ) where each element of the vector of indices (t 1 , . . . , t T ) is obtained by drawing with replacement from (1, . . . , T) . The ˜ ∗ , j = 1, . . . , k. same vector of indices is used to obtain X j ˜ j with Cross-sectional resampling on the other hand resamples columns of Y and X ∗ replacement. Thus, in this case, Y = (yi1 , . . . , yii . . . , yi N ) where each element of the vector of indices (i 1 , . . . , i N ) is obtained by drawing with replacement from (1, . . . , N) . The same ˜ ∗ , j = 1, . . . , k. In the case of cross-sectional dependence vector of indices is used to obtain X j blocks of columns of Y can be randomly resampled with replacement. In this case, Y∗ = (yi1 , yi1 +1 , . . . , yi1 +b , . . . , yii , yii +1 , . . . , yii +b , . . . , yi[N /b] , yi[N /b] +b ) where the vector of indices (i 1 , . . . , i [N /b] ) is obtained by drawing with replacement from (1, . . . , N − b) and b denotes the block size. Of course, a combination of the two resampling schemes is also possible. The combination ¯∗ = is obtained as follows: Let the temporally resampled bootstrap sample be denoted by Y ∗ ∗ ∗ (yt1 , . . . , ytt . . . , ytT ) ≡ (¯y1 , . . . , y¯ i . . . , y¯ N ). Then, the bootstrap sample from the combination of temporal and cross-sectional resampling is given by Y∗ = (¯yi∗1 , . . . , y¯ i∗i . . . , y¯ i∗N ) In this case, temporal resampling occurs first followed by cross-sectional resampling. Block resampling is straightforwardly defined.
C 2008 The Author. Journal compilation C The Royal Economic Society 2008
380
G. Kapetanios
In summary, formal definitions for the various resampling schemes suggested above, are provided below. DEFINITION 2.1 (cross-sectional resampling). For a T × N matrix of random variables Z, crosssectional resampling is defined as the operation of constructing a T × N ∗ matrix Z∗ where the columns of Z∗ are a random resample with replacement of blocks of the columns of Z and N ∗ is not necessarily equal to N. DEFINITION 2.2 (temporal resampling). For a T × N matrix of random variables Z, temporal resampling is defined as the operation of constructing a T ∗ × N matrix Z∗ where the rows of Z∗ are a random resample with replacement of blocks of the rows of Z and T ∗ is not necessarily equal to T. DEFINITION 2.3 (cross-sectional/temporal resampling). For a T × N matrix of random variables Z, cross-sectional/temporal resampling is defined as the operation of constructing a T ∗ × N ∗ matrix Z∗ where the columns and rows of Z∗ are a random resample with replacement of blocks of the columns and rows of Z and N ∗ , T ∗ are not necessarily equal to N, T. The parametric bootstrap can be implemented similarly with the residual matrix , rather than ˜ j in the manner discussed above. Then, the estimates of Y being resampled together with the X ˆ are used to construct Y∗ . the model parameters βˆ and A Moving on to a ‘random effects’ interpretation of the panel model, we abstract from the issue of estimating β but simply assume that some appropriate estimator has been used. Of course, the dichotomy between the ‘random effects’ and ‘fixed effects’ interpretation is not relevant for the non-parametric bootstrap. For the parametric bootstrap, we note that by assumption ai ˆ β. Note that, is independent of a j , ∀ j = i. Then, we define the residual term ˆi,t = yi,t − xi,t conditional on zt , ˆi,t is independent of ˆ j,t ∀ j = i. Similarly to , we construct the matrix ˆ and simply resample from it either cross-sectionally, temporally or both.
3. THEORETICAL RESULTS In this section we provide some theoretical results for the bootstrap based on cross-sectional resampling. We will deal with the non-parametric bootstrap. Similar treatments for the parametric bootstrap can also be considered. We carry out a sequential asymptotic analysis, as first T and then N tend to infinity. We discuss straightforward extensions of part of the results to joint asymptotic analysis in Remark 3.2. A good reference on the distinction between sequential and joint asymptotic analysis is Phillips and Moon (1999). That paper makes clear that proofs involving joint asymptotic analysis are more difficult to establish that those using sequential asymptotic analysis. However, as we discuss in Remark 3.2 our assumptions can be used to establish results under both kinds of asymptotic analysis. The following assumptions are made. ASSUMPTION 3.1. Uniformly across i the regressors, xi,t , are covariance stationary, uniformly integrable processes with absolutely summable autocovariances, zero means and finite fourthorder moments and are distributed independently of the individual-specific errors, i,t , for all t and t . The regressors are independent across i. C 2008 The Author. Journal compilation C The Royal Economic Society 2008
381
Bootstrap resampling in panel data sets
ASSUMPTION 3.2. The observed common effects, zt , are covariance stationary with absolute summable autocovariances, distributed independently of the individual-specific errors, i,t and regressors xi,t , for all t and t . ASSUMPTION 3.3. The slope coefficients of the individual-specific effects, β i are restricted to be equal to a common value, β. The coefficients of the observed common effects, α i , are bounded (lie on a compact set). ASSUMPTION 3.4. The individual specific error, i,t , is distributed independently across i. They are martingale difference processes across t with mean zero, variance, σ i2 , and a finite fourth-order 4 moment, E( i,t ) ≤ K. REMARK 3.1. The above assumptions are not the most mild assumptions possible but they are relatively mild. The main restrictions are independence across units and exogeneity. In fact these assumptions are not needed explicitly for the bootstrap procedures we propose but for the underlying estimators. Further, these estimators, discussed below, have the desired properties under milder assumptions in which case the bootstrap procedures will do too. But our aim is not to have the mildest assumptions possible for the estimators but simply to provide relatively straightforward proofs for the validity of the bootstrap procedure so as to illustrate the essence of the bootstrap. Further extensions are possible in a number of dimensions and worth pursuing but lie outside the scope of this paper whose main aim is to introduce new resampling schemes in the context of the bootstrap. Next, we discuss two possible estimators for β. Of course, alternative estimators could be analysed. We define the following two estimators for β which are common in the literature (see, e.g. Pesaran, 2002, and the references cited therein). First, we define the pooled estimator given by −1 N N 1 1 ˆ βP = (3.1) X MXi X Myi , N i=1 i N i=1 i where M = I − Z (Z Z)−1 Z and Xi = (xi,1 , . . . , xi,t , . . . , xi,T ) Secondly, we define the mean group estimator N 1 βˆM G = βˆi , N i=1
where
(3.2)
−1 βˆi = Xi MXi Xi Myi .
(3.3)
Define −1/2
−1/2
ˆ j x (βˆ P − β), t M G = N 1/2 ˆ j x (βˆM G − β), t P = N 1/2 where
ˆ Px =
N Xi MXi 1 N i=1 T
−1
N X MXi 1 σˆ i i N i=1 T
C 2008 The Author. Journal compilation C The Royal Economic Society 2008
N Xi MXi 1 N i=1 T
(3.4) −1 (3.5)
382
G. Kapetanios
ˆ M Gx =
σˆ i =
N T (βˆM G − β)(βˆM G − β) N i=1
(3.6)
1 yi − Xi βˆi yi − Xi βˆi T
(3.7)
and ∗−1/2
ˆ jx t P∗ = N 1/2 where
ˆ ∗P x =
ˆ (βˆ P∗ − β),
N Xi∗ MXi∗ 1 N i=1 T
−1
∗−1/2
∗ 1/2 ˆ jx tM G = N
N X∗ MXi∗ 1 σˆ i∗ i N i=1 T
∗ ˆ (βˆM G − β),
N Xi∗ MXi∗ 1 N i=1 T
N ∗ ∗ T ˆ βˆM G − βˆ βˆM G −β N i=1 1 ∗ σˆ i∗ = yi − Xi∗ βˆi∗ yi∗ − Xi∗ βˆi∗ . T
(3.8) −1 (3.9)
∗ ˆM Gx =
(3.10)
It is a standard result in the literature (see, e.g. Pesaran, 2002) and the references cited therein) d
that T 1/2 t j → N (0, I), j = P, M G. We next present our first result. d∗
THEOREM 3.1. Let Assumptions 3.1–3.4 hold. Then, for fixed T , t ∗j → N (0, I). Further, as d∗
T → ∞ and N → ∞ sequentially, T 1/2 t ∗j → N (0, I) in probability T ASSUMPTION 3.5. The joint distribution of {xi,t }t=1 and the distribution of i,t is the same for all i. xi,t are strictly stationary.
ASSUMPTION 3.6. Let uit = (x i,t , i,t ) . E(||uit ||)l < ∞, for some sufficiently large but finite l. ASSUMPTION 3.7. Let χ (u) denote the characteristic function of uit . χ (u) satisfies Cramer’s condition lim sup |χ (u)| < 1. ||u||→∞
(3.11)
THEOREM 3.2. Let the assumptions underlying Theorem 3.1 hold. Then, the bootstrap estimate of the distribution of t j is O p∗ (N −1 ) consistent for fixed T under Assumptions 3.5–3.7. Further, it is O p∗ (N −1 T −1/2 ) as T → ∞. In this case, the assumptions underlying Theorem 3.1 are sufficient for the result. REMARK 3.2. Note that for both theoretical results we use sequential asymptotics. Extending Theorem 3.1 to hold for joint N, T asymptotics is straightforward. To see this we use results by Phillips and Moon (1999). It is sufficient to prove that conditions (i)–(iv) of Theorem 3.1 of Phillips and Moon (1999) hold for Xi∗ M∗ Xi∗ and that the 2 + δ moment of Xi∗ M∗ i∗ exists using (A.9) of theorem 2 of Phillips and Moon (1999). But these easily follow by Assumptions C 2008 The Author. Journal compilation C The Royal Economic Society 2008
Bootstrap resampling in panel data sets
383
3.1–3.4. However, proving Theorem 3.2 using joint asymptotics is not trivial as we rely heavily on results by Hall (1992) which use Edgeworth expansions. We are not aware of any derivation of Edgeworth expansions under joint asymptotics. REMARK 3.3. Assumptions 3.5 and 3.6 are quite strict. Assumption 3.5 can be easily relaxed to allow for different means and variances for the regressors and different variances for the error terms across i. Assumption 3.6 is required by theorem 5.1 of Hall (1992). Again, as noted in Remark 3.1, the point is to provide a clear outline of the kind of results possible for the new resampling schemes. Relaxing assumptions further, while worthwhile, may detract from the clarity of the exposition for these schemes.
4. MONTE CARLO STUDY In this section we carry out a Monte Carlo analysis of the various resampling schemes described in Section 2. 4.1. Monte Carlo setup In this Monte Carlo study two different models are considered. The first is given by (2.1). The Monte Carlo design for this model (Case I) is designed to satisfy the assumptions of Section 3. The second model (Case II) extends (2.1) by allowing i,t to contain a common factor effect along the lines of Pesaran (2002). In other words i,t = γi ft + εi,t where ft is an m-dimensional weakly dependent process which satisfies the same assumption as z t (i.e. Assumption 3.2), εi,t has the same properties as i,t did previously and γ i can be interpreted as either fixed bounded constants or i.i.d. random variables across i. Note that although the factor introduces cross-sectional dependence in the panel, this is not a problem for i.i.d. cross-sectional resampling. The reason for this is that the factor is basically another zt variable which does not introduce any spatial structure to the panel. Hence, a random i.i.d. cross-sectional resample of the original sample will replicate its properties as long as N → ∞. This makes our suggested cross-sectional resampling scheme much more widely applicable. We now provide specifications for the Monte Carlo experiments for Case II. We allow m = 1, 3. z t and k are set to 1. We set f s,t = ρs, f f s,t−1 + ε f ,s,t ,
s = 1, . . . , m
and xi,t = ρi,x xi,t−1 + vi,t i = 1, . . . , N ; Let
⎞ εi,t ξit = ⎝ vi,t ⎠ , ε f ,i,t ⎛
C 2008 The Author. Journal compilation C The Royal Economic Society 2008
s = 1, . . . , m
384
G. Kapetanios
ξ it is generated as iid N (0, Σξ i ), where
ξ i = diag σiε2 , σiv2 , σiε2 f ,
2 2 where σiε2 f = 1 − ρi2f . We let ρ s f = 0.5, ρ i x ∼ U [0.2, 0.9], σ iε ∼ U [1.5, 2.5], σ iv ∼ U [0.5, 1.5] and γ i,s ∼ N (1, 0.04). The final set of parameters to be fixed is γ is . γ is [k] ∼ N (1, 0.04). For the AR models underlying f s,t and x i,t , we mitigate the effect of initial conditions, which are set to zero, by generating time series of T + 200 observations and then dropping the first 200 observations. We set N, T = 25, 50, 100, 150, 250. One thousand replications are carried out. For Case I we simply set γ i,s = 0 and do not include y¯t and x¯ t in the panel regressions. We carry out cross-sectional resampling, temporal resampling with a block structure where the block size is set to [T 1/4 ] as suggested by, e.g. Lahiri (2003) or Andrews (2002), and combined cross-sectional and temporal resampling where we investigate both resampling cross-sectionally first and then temporally (as in Definition 2.3) and vice versa. In order to estimate β for the model with the factors we follow Pesaran (2002) and use the following estimators. Define
y¯t =
N N 1 1 yi,t , x¯ t = xi,t . N i=1 N i=1
(4.1)
¯ and y¯ be T × k and T × 1 observation matrices on the aggregates x¯ t and y¯t , respectively. Let X Then, βˆi for the mean group estimator is defined as ¯ i )−1 X My ¯ i, βˆi = (Xi MX i
(4.2)
¯ = IT − H( ¯ H ¯ H) ¯ −1 H ¯ , M
(4.3)
where
¯ = (Z, X, ¯ y¯ ).The Mean Group estimator is given by (3.2). The pooled estimator is defined and H as −1 N N ¯X ˜i ¯ i. βˆ PC = X M X My (4.4) i
i=1
i
i=1
These estimators are proven to be consistent and asymptotically normal by Pesaran (2002). Their ˆ ∗ ˆ ∗−1 R ˆ ∗−1 , respectively, where variances are given by (3.10) and ¯
N ¯ Xi MXi 1 ∗ Xi MXi ˆ ˆ ˆ ˆ ˆ R = (βi − β M G )(βi − β M G ) . (4.5) N − 1 i=1 T T and ˆ∗ =
N ¯ Xi MXi 1 . N i=1 T
(4.6)
We consider three different statistics whose properties we investigate: First, the bootstrap variance estimator; secondly, the rejection probability of the two-sided t-test for the null hypothesis that β = 0 for a 5% nominal significance level and thirdly the rejection probability of a Hausmantype test of panel homogeneity. Under homogeneity both the pooled and mean group estimators C 2008 The Author. Journal compilation C The Royal Economic Society 2008
385
Bootstrap resampling in panel data sets Table 1. Variance estimate results for model without factors. S E(V N ) Pooled estimator RR M Mean Group estimator M S E(V T )
R M S E(V N ) R M S E(V T )
T/N
25
50
100
150
250
25
50
100
150
250
25
1.007
0.787
0.563
0.533
0.406
0.988
0.657
0.566
0.539
0.410
50
1.177
0.868
0.660
0.581
0.519
1.171
0.888
0.655
0.584
0.512
100 150
1.317 1.404
1.022 1.160
0.850 0.900
0.667 0.741
0.569 0.658
1.324 1.508
1.007 1.127
0.773 0.946
0.648 0.723
0.581 0.684
250
1.494
1.243
1.064
0.834
0.768
1.496
1.283
1.019
0.848
0.738
25
Pooled estimator 103 ∗ RMSE(V N ,T ) 4.653 2.294 1.123 0.783 0.452
Mean Group estimator 103 ∗ RMSE(V N ,T ) 6.598 3.344 1.633 1.147 0.660
50
2.038
0.971
0.474
0.319
0.190
2.743
1.362
0.627
0.443
0.259
100 150 250
0.924 0.612 0.360
0.465 0.298 0.179
0.241 0.155 0.092
0.149 0.098 0.059
0.092 0.062 0.037
1.206 0.780 0.449
0.621 0.390 0.236
0.314 0.202 0.120
0.199 0.131 0.077
0.125 0.083 0.047
25 50 100 150 250
Pooled estimator 103 ∗ RMSE(V T,N ) 4.660 2.289 1.129 0.780 0.451 2.019 0.981 0.472 0.320 0.191 0.927 0.462 0.241 0.151 0.092 0.615 0.297 0.155 0.098 0.062 0.360 0.178 0.093 0.059 0.037
Mean Group estimator 103 ∗ RMSE(V T,N ) 6.610 3.340 1.643 1.147 0.658 2.719 1.366 0.627 0.443 0.257 1.209 0.620 0.311 0.202 0.125 0.786 0.388 0.202 0.130 0.083 0.449 0.234 0.120 0.077 0.047
Notes: V N , variance estimator based on cross-sectional resampling (see Definition 2.1); V T , variance estimator based on temporal resampling (see Definition 2.2); V N ,T , variance estimator based on cross-sectional/temporal resampling (see Definition 2.3); V T,N , variance estimator based on temporal/cross-sectional resampling (see Definition 2.3).
are consistent for β, and, under further regularity conditions, the pooled estimator is fully efficient. Under heterogeneity, the two estimators need not have the same probability limit. The major reason for considering the Hausman test is that it can provide a sterner benchmark for the performance of the various bootstrap approximations that the simple t-statistic. For the Hausman-type statistic, we do not allow for factors (Case I) and simplify further the model by specifying that ρ i x = 0.5, 2 2 σ iε = 2 and σ iv = 1. For this model, under normality, the pooled estimator is fully efficient and the standard construction for Hausman tests can be used. Note that, when unobserved factors are present in the data (Case II), it is not clear whether the estimators by Pesaran (2002) are fully efficient or, indeed, what is the fully efficient estimator and so the standard construction for Hausman tests cannot be straightforwardly used in Case II.
4.2. Monte Carlo results In Tables 1 and 4 we report root mean squared errors as performance measures of the variance estimators. The estimators are denoted by V N , V T , V N ,T and V T,N for the cross-sectional, temporal, combined cross-sectional/temporal and combined temporal/cross-sectional resampling, respectively. Actual rejection probabilities are reported in Tables 2 and 5 for the t-tests and in C 2008 The Author. Journal compilation C The Royal Economic Society 2008
386
G. Kapetanios
T/N
Table 2. Rejection probabilities for H 0 : β = 0 for model without factors. 50 100 150 250 25 50 100 150
25
Pooled estimator
250
Mean group estimator
Cross-sectional resampling 25 50
0.072 0.079
0.063 0.059
0.054 0.071
0.063 0.056
0.046 0.055
0.059 0.054
0.060 0.050
0.054 0.077
0.049 0.050
0.040 0.056
100
0.080
0.057
0.048
0.046
0.059
0.060
0.048
0.055
0.061
0.061
150 250
0.074 0.069
0.067 0.067
0.047 0.055
0.061 0.063
0.045 0.051
0.062 0.055
0.057 0.053
0.050 0.053
0.060 0.056
0.037 0.051
25
0.093
0.082
0.083
Temporal resampling 0.073 0.080 0.201
0.198
0.212
0.181
0.200
50
0.078
0.078
0.091
0.081
0.079
0.176
0.200
0.236
0.192
0.208
100 150 250
0.070 0.062 0.051
0.072 0.065 0.065
0.060 0.056 0.055
0.074 0.075 0.066
0.074 0.057 0.063
0.189 0.192 0.193
0.178 0.188 0.187
0.181 0.168 0.163
0.190 0.202 0.183
0.181 0.167 0.196
0.021 0.024 0.016 0.021 0.020
0.019 0.023 0.023 0.025 0.022
0.019 0.022 0.016 0.014 0.021
Temporal/cross-sectional resampling 0.000 0.001 0.031 0.022 0.002 0.003 0.017 0.016
0.019 0.023
0.018 0.022
0.015 0.022
0.018 0.022 0.024
0.021 0.023 0.020
0.017 0.012 0.022
25 50 100 150 250
0.002 0.000 0.002 0.002 0.001
0.003 0.001 0.001 0.002 0.001
Cross-sectional/temporal resampling 0.002 0.000 0.001 0.031 0.025 0.000 0.003 0.003 0.020 0.019 0.002 0.000 0.002 0.017 0.016 0.002 0.002 0.002 0.019 0.015 0.001 0.001 0.000 0.024 0.019
25 50
0.003 0.001
0.004 0.004
0.001 0.000
100 150 250
0.001 0.001 0.000
0.001 0.001 0.001
0.002 0.004 0.001
0.002 0.001 0.000
0.000 0.001 0.000
0.016 0.017 0.022
0.015 0.024 0.023
Table 3 for the Hausman-type test. It is worth noting here that a benchmark for the performance of the bootstrap in terms of rejection probabilities is, of course, the asymptotic approximation. For the t-test, the asymptotic approximation for both Cases I and II is known to be quite good, as discussed in, e.g. Pesaran (2002), whose Monte Carlo experiments are very similar to ours. In this respect the bootstrap approximations have a high hurdle to overcome. In the case of the Hausman-type test the performance of the asymptotic approximation is not available elsewhere and is therefore presented in Table 3. Results make interesting reading. Starting with Case I, for which results are reported in Tables 1–3, we see clearly that the cross-sectional resampling does better than the temporal resampling for all cases considered apart from cases where N is small whereas T is large. This is expected as cross-sectional resampling improves with N. But, the relative performance of crosssectional resampling improves as both N and T increase together as we can see from the diagonal elements of the relevant panels of Table 1. This implies that the i.i.d. resampling nature of crosssectional resampling is superior to the block temporal resampling scheme. This provides some C 2008 The Author. Journal compilation C The Royal Economic Society 2008
387
Bootstrap resampling in panel data sets Table 3. Rejection probabilities for Hausman Test for model without factors. T/N 25 50 100 150 250 Cross-sectional resampling 25 50
0.027 0.032
0.026 0.028
0.019 0.018
0.016 0.022
0.010 0.018
100
0.028
0.031
0.028
0.031
0.023
150
0.030
0.024
0.029
0.024
0.017
250
0.033
0.028
0.028
0.028
0.029
Temporal resampling 25 50
0.083 0.085
0.147 0.107
0.184 0.182
0.234 0.229
0.255 0.253
100 150 250
0.067 0.068 0.069
0.126 0.113 0.123
0.162 0.152 0.158
0.193 0.179 0.175
0.251 0.214 0.233
25 50 100 150 250
0.033 0.038 0.034 0.020 0.031
Cross-sectional/temporal resampling 0.061 0.086 0.102 0.043 0.093 0.113 0.062 0.074 0.103 0.057 0.087 0.086 0.060 0.078 0.094
0.111 0.135 0.149 0.125 0.152
25 50 100
0.029 0.038 0.034
Temporal/cross-sectional resampling 0.060 0.084 0.106 0.041 0.092 0.114 0.061 0.076 0.103
0.117 0.130 0.147
150 250
0.022 0.030
25 50 100 150
0.459 0.503 0.577 0.571
0.378 0.480 0.529 0.545
250
0.594
0.528
0.051 0.061
0.087 0.079
0.088 0.086
0.125 0.149
0.313 0.411 0.470 0.527
0.273 0.383 0.455 0.478
0.210 0.332 0.432 0.454
0.511
0.510
0.499
Asymptotic normal approximation
evidence supporting the theoretical result in Theorem 3.2. Results are similar for both pooled and mean groups estimators. Table 1 also report results on the combined cross-sectional and temporal resampling. We report absolute RMSE results there because the combined resampling scheme performs much worse than either of the other two resampling schemes. In particular the variance estimator is considerably upwards biased. Nevertheless, its performance improves when either N and T increase. There seems little to choose between combined cross-sectional/temporal and combined temporal/crosssectional resampling. C 2008 The Author. Journal compilation C The Royal Economic Society 2008
388
G. Kapetanios Table 4. Variance estimate results for model with factors. Pooled estimator
R M S E(V N ) R M S E(V T )
Number of factors: 1
Number of factors: 3
T/N
25
50
100
150
250
25
50
100
150
250
25
0.957
0.785
0.572
0.547
0.454
0.883
0.762
0.514
0.562
0.442
50
1.135
0.890
0.645
0.563
0.487
1.007
0.894
0.730
0.605
0.547
100 150
1.179 1.319
0.981 1.037
0.850 0.875
0.637 0.726
0.566 0.635
1.097 1.179
1.100 1.046
0.819 0.836
0.791 0.890
0.719 0.720
250
1.336
1.239
1.047
0.817
0.790
1.479
1.231
0.892
0.908
0.791
R M S E(V N ) R M S E(V T )
Mean group estimator 25
1.121
0.748
0.546
0.532
0.444
1.065
0.821
0.677
0.509
0.443
50 100 150 250
1.312 1.558 1.745 1.516
1.030 1.157 1.092 1.424
0.602 0.853 0.984 1.097
0.604 0.639 0.744 0.846
0.488 0.627 0.725 0.744
1.235 1.590 1.287 1.854
0.963 1.060 1.146 1.273
0.787 0.786 0.777 0.934
0.624 0.762 0.817 0.927
0.513 0.654 0.635 0.882
3.150 1.192 0.517 0.324 0.195
Pooled estimator 103 ∗ RMSE(V N ,T ) 1.507 1.020 0.602 6.764 3.235 0.569 0.368 0.221 2.606 1.268 0.264 0.163 0.099 1.118 0.551 0.166 0.106 0.065 0.688 0.351 0.096 0.062 0.038 0.443 0.206
1.553 0.627 0.273 0.170 0.098
1.062 0.410 0.182 0.117 0.067
0.612 0.247 0.110 0.070 0.041
3.132 1.203 0.515 0.324 0.194
Mean group estimator 103 1.518 1.016 0.600 0.569 0.370 0.222 0.264 0.164 0.100 0.166 0.105 0.065 0.097 0.062 0.038
RMSE(V N ,T ) 6.823 3.245 2.586 1.262 1.116 0.553 0.686 0.346 0.445 0.206
1.548 0.631 0.274 0.170 0.098
1.069 0.405 0.182 0.117 0.067
0.614 0.247 0.110 0.070 0.041
4.894 1.721 0.716 0.442 0.265
Pooled estimator 103 ∗ RMSE(V T,N ) 2.251 1.532 0.903 10.963 4.919 0.759 0.517 0.303 3.770 1.756 0.348 0.218 0.135 1.631 0.718 0.220 0.141 0.087 0.969 0.468 0.128 0.081 0.048 0.614 0.272
2.464 0.850 0.354 0.220 0.131
1.592 0.556 0.237 0.151 0.089
0.932 0.330 0.142 0.090 0.053
25 50 100 150 250 25 50 100 150 250 25 50 100 150 250
6.436 2.533 1.086 0.715 0.403 6.489 2.505 1.086 0.716 0.403 10.102 3.636 1.541 0.978 0.557
∗
Looking at the rejection probabilities of the various bootstrap procedures for the t-test in Table 2 we see that the best performing resampling scheme is the cross-sectional resampling. Temporal resampling tends to overreject and cross-sectional/temporal resampling tends to underreject. Moving on to the Hausman-Type test we see from Table 3 where the rejection probabilities are presented that once again the cross-sectional resampling works best. It only slightly underrejects C 2008 The Author. Journal compilation C The Royal Economic Society 2008
389
Bootstrap resampling in panel data sets Table 4. Continued. Pooled estimator
R M S E(V N ) R M S E(V T )
Number of factors: 1 T/N
25
50
100
150
Number of factors: 3 250
Mean group estimator 10
25 3 ∗
RMSE(V
50 N ,T
100
150
250
)
25
10.181
4.900
2.261
1.531
0.905
11.041
4.917
2.450
1.602
0.936
50 100
3.611 1.543
1.730 0.714
0.760 0.346
0.518 0.220
0.301 0.135
3.744 1.629
1.755 0.722
0.855 0.354
0.552 0.237
0.329 0.142
150
0.983
0.441
0.218
0.141
0.087
0.968
0.463
0.219
0.152
0.090
0.558
0.264
0.128
0.081
0.049
0.617
0.271
0.131
0.090
0.054
250
VN ,
VT ,
Notes: variance estimator based on cross-sectional resampling (see Definition 2.1); variance estimator based on temporal resampling (see Definition 2.2); V N,T , variance estimator based on cross-sectional/temporal resampling (see Definition 2.3); V T,N , variance estimator based on temporal/cross-sectional resampling (see Definition 2.3).
but this feature is less apparent from larger N and T. All other resampling schemes overreject for large N. We also report results for the normal approximation which overrejects significantly. We next consider results for Case II which are reported in Tables 4 and 5. Once again the cross-sectional resampling does better than the temporal resampling for the majority of cases considered both in terms of the variance estimators and rejection probabilities for the t-tests. One exception for the variance estimators, are cases where N is small whereas T is large, just like for Case I. Overall results for Case II are very similar to those obtained for Case I. This similarity leads us to suggest that the performance of cross-sectional resampling is good in a variety of panel data models. In particular, the results obtained for Case II is of considerable interest as it implies that strong forms of cross-sectional dependence that do not have a local cross-sectional structure such as factor structures can still be dealt with i.i.d. resampling making the applicability of the new procedures much wider. This is indeed of practical significance given the recent work of, among others, Stock and Watson (1998), Bai and Ng (2002), Bai (2003), Bai and Ng (2004) and Pesaran (2002). An interesting feature is that whereas temporal resampling performs relatively well for the pooled estimator in the case of a model without factors, this is not the case for the model with factors. The temporal resampling scheme overrejects significantly in all cases for the mean group estimator. A final issue to consider is the possibility of serial correlation in εi,t . We therefore reexamine the bootstrap variance estimator and the rejection probability for the t-test for Case I where we now let εi,t be an AR(1) process with AR coefficient equal to 0.5. and i.i.d. standard normal errors. Results are presented in Tables 6 and 7. As is clear from these results the overall conclusions reached above do not change, although the relative performance of the cross-sectional resampling is improved compared to temporal resampling. This is expected given the temporal correlation introduced in the error term.
5. CONCLUSIONS This paper has considered the issue of bootstrap resampling in panel data sets. The availability of data sets with large temporal and cross-sectional dimensions suggests the possibility of new resampling schemes. We suggest one possibility which has not been widely explored in the C 2008 The Author. Journal compilation C The Royal Economic Society 2008
390
G. Kapetanios Table 5. Rejection probabilities for H 0 : β = 0 for model with factors. Number of factors: 1 Number of factors: 3
T/N
25
50
100
150
250
25
50
100
150
250
Pooled estimator Cross-sectional resampling 25 50
0.066 0.066
0.061 0.051
0.056 0.065
0.053 0.060
0.052 0.054
0.060 0.065
0.059 0.063
0.070 0.043
0.054 0.065
0.055 0.064
100
0.062
0.063
0.043
0.050
0.056
0.064
0.056
0.049
0.043
0.046
150 250
0.053 0.066
0.066 0.059
0.048 0.051
0.051 0.061
0.049 0.055
0.082 0.050
0.059 0.060
0.057 0.067
0.048 0.041
0.045 0.054
25
0.180
0.194
0.183
Temporal resampling 0.188 0.205 0.208
0.185
0.207
0.199
0.205
50 100 150 250
0.178
0.184
0.178 0.168 0.178
0.200 0.195 0.177
0.197 0.171 0.162 0.169
0.199 0.203 0.189 0.177
0.172 0.185 0.189 0.170
0.181 0.170 0.177 0.193
0.189 0.166 0.175 0.177
0.179 0.178 0.164 0.167
0.028 0.012 0.018 0.027 0.024
0.021 0.022 0.018 0.015 0.017
0.023 0.024 0.017 0.018 0.019
Temporal/cross-sectional resampling 0.021 0.017 0.025 0.024 0.025 0.023 0.027 0.019 0.023 0.021 0.023 0.022
0.027 0.012 0.017
0.023 0.021 0.017
0.026 0.020 0.015
0.023 0.015
0.027 0.025
0.016 0.017
0.021 0.018
0.191 0.187 0.173 0.176
0.189 0.192 0.199 0.158
25 50 100 150 250
0.028 0.025 0.024 0.020 0.020
0.019 0.023 0.020 0.024 0.022
Cross-sectional/temporal resampling 0.020 0.022 0.018 0.022 0.030 0.025 0.024 0.023 0.026 0.023 0.017 0.023 0.021 0.023 0.030 0.023 0.018 0.022 0.038 0.022 0.020 0.026 0.018 0.016 0.022
25 50 100
0.032 0.022 0.018
0.018 0.023 0.020
0.024 0.025 0.017
150 250
0.019 0.017
0.022 0.021
0.029 0.017
0.017 0.028
0.023 0.015
0.036 0.011
Mean group estimator Cross-sectional resampling 25 50 100 150
0.061 0.050 0.046 0.043
0.052 0.044 0.045 0.057
0.059 0.055 0.051 0.054
0.057 0.053 0.055 0.051
0.056 0.056 0.058 0.038
0.037 0.048 0.040 0.059
0.056 0.047 0.055 0.058
0.048 0.051 0.053 0.053
0.058 0.050 0.050 0.055
0.054 0.062 0.047 0.055
250
0.042
0.042
0.050
0.049
0.057
0.045
0.054
0.059
0.050
0.045
25 50 100 150
0.181 0.186 0.187 0.170
0.184 0.176 0.181 0.188
0.218 0.217 0.178 0.172
Temporal resampling 0.190 0.191 0.191 0.177 0.192 0.184 0.194 0.177 0.167 0.198 0.179 0.226
0.187 0.180 0.200 0.179
0.184 0.184 0.183 0.201
0.185 0.196 0.189 0.178
0.191 0.173 0.172 0.178
C 2008 The Author. Journal compilation C The Royal Economic Society 2008
391
Bootstrap resampling in panel data sets Table 5. Continued. Number of factors: 1
Number of factors: 3
T/N
25
50
100
150
250
25
50
100
150
250
250
0.192
0.182
0.160
0.188
0.190
0.164
0.186
0.201
0.159
0.153
25 50
0.029 0.020
0.022 0.018
0.020 0.023
0.024 0.023
0.018 0.022
0.011 0.017
0.025 0.013
0.014 0.020
0.024 0.019
0.022 0.020
100
0.009
0.017
0.020
0.023
0.023
0.020
0.021
0.021
0.015
0.019
150 250
0.014 0.009
0.021 0.013
0.019 0.025
0.024 0.019
0.014 0.020
0.023 0.015
0.022 0.019
0.025 0.023
0.021 0.018
0.023 0.011
25
0.033
0.023
0.023
Temporal/cross-sectional resampling 0.022 0.018 0.016 0.027
0.011
0.024
0.023
50 100 150 250
0.017 0.011 0.017 0.012
0.015 0.017 0.020 0.017
0.025 0.019 0.020 0.021
0.018 0.016 0.025 0.018
0.020 0.012 0.023 0.022
0.018 0.018 0.022 0.013
Cross-sectional/temporal resampling
0.023 0.023 0.021 0.018
0.023 0.020 0.014 0.021
0.019 0.015 0.025 0.013
0.014 0.020 0.023 0.018
Table 6. Variance estimate results for model with serial correlation and without factors. S E(V N ) S E(V N ) Pooled estimator RR M Mean group estimator RR M M S E(V T ) M S E(V T ) T/N 25 50 100 150 250 25 50 100 150 250 25 50 100 150 250
25
50
100
150
250
0.878 0.754 0.441 0.423 0.312 0.923 0.610 0.469 0.439 0.359 0.986 0.722 0.642 0.480 0.389 1.022 0.819 0.631 0.519 0.529 1.150 1.017 0.962 0.587 0.710 3 ∗ N ,T Pooled estimator 10 RMSE(V ) 6.573 3.390 1.559 1.099 0.628 3.389 1.605 0.794 0.551 0.311 1.713 0.854 0.454 0.288 0.169 1.164 0.581 0.306 0.198 0.126 0.732 0.379 0.196 0.123 0.079 3 ∗ T,N Pooled estimator 10 RMSE(V ) 6.586 3.388 1.551 1.093 0.624 3.357 1.612 0.790 0.550 0.312 1.718 0.847 0.452 0.291 0.170 1.170 0.582 0.306 0.198 0.126 0.729 0.377 0.197 0.123 0.079
25
50
100
150
250
0.924 0.650 0.442 0.571 0.342 0.928 0.642 0.454 0.439 0.351 1.021 0.699 0.553 0.460 0.388 1.056 0.764 0.621 0.493 0.616 1.048 1.048 1.000 0.576 0.578 3 ∗ Mean group estimator 10 RMSE(V N ,T ) 8.686 4.285 2.047 1.510 0.842 4.091 2.010 0.954 0.682 0.381 2.052 1.021 0.517 0.334 0.203 1.362 0.675 0.345 0.230 0.148 0.812 0.443 0.227 0.141 0.088 3 ∗ Mean group estimator 10 RMSE(V T,N ) 8.665 4.291 2.043 1.506 0.836 4.059 2.014 0.954 0.679 0.379 2.054 1.021 0.510 0.338 0.204 1.377 0.673 0.343 0.229 0.148 0.812 0.439 0.228 0.141 0.088
Notes: V N , variance estimator based on cross-sectional resampling (see Definition 2.1); V T , variance estimator based on temporal resampling (see Definition 2.2); V N ,T , variance estimator based on cross-sectional/temporal resampling (see Definition 2.3); V T,N , variance estimator based on temporal/cross-sectional resampling (see Definition 2.3). C 2008 The Author. Journal compilation C The Royal Economic Society 2008
392
G. Kapetanios
Table 7. Rejection probabilities for H 0 : β = 0 for Model with serial correlation and without factors T/N 25 50 100 150 250 25 50 100 150 250 Pooled estimator
Mean group estimator
Cross-sectional resampling 25 50
0.069 0.066
0.057 0.074
0.057 0.066
0.054 0.047
0.055 0.059
0.057 0.055
0.045 0.054
0.058 0.060
0.044 0.039
0.048 0.055
100
0.078
0.063
0.049
0.058
0.062
0.046
0.053
0.053
0.051
0.064
150 250
0.082 0.076
0.074 0.069
0.057 0.049
0.063 0.064
0.053 0.051
0.059 0.053
0.071 0.051
0.063 0.050
0.058 0.058
0.042 0.057
25
0.138
0.141
0.155
Temporal resampling 0.137 0.168 0.227
0.242
0.258
0.220
0.248
50
0.127
0.138
0.137
0.130
0.146
0.207
0.236
0.255
0.228
0.257
100 150 250
0.125 0.109 0.088
0.118 0.107 0.094
0.109 0.097 0.081
0.109 0.116 0.094
0.120 0.108 0.073
0.207 0.218 0.223
0.232 0.207 0.191
0.223 0.215 0.196
0.206 0.228 0.200
0.223 0.198 0.196
0.031 0.023 0.024 0.031 0.013
0.021 0.019 0.028 0.022 0.026
0.025 0.025 0.024 0.017 0.019
Temporal/cross-sectional resampling 0.003 0.002 0.029 0.021 0.002 0.010 0.025 0.022
0.040 0.018
0.020 0.019
0.025 0.027
0.027 0.034 0.015
0.026 0.023 0.026
0.022 0.014 0.021
25 50 100 150 250
0.011 0.004 0.003 0.005 0.006
0.003 0.004 0.001 0.004 0.003
Cross-sectional/temporal resampling 0.004 0.005 0.003 0.027 0.025 0.005 0.004 0.006 0.023 0.026 0.003 0.004 0.006 0.023 0.019 0.003 0.002 0.004 0.016 0.031 0.001 0.002 0.003 0.020 0.017
25 50
0.010 0.002
0.005 0.005
0.003 0.006
100 150 250
0.006 0.004 0.003
0.003 0.007 0.004
0.002 0.003 0.000
0.004 0.002 0.001
0.006 0.003 0.003
0.019 0.012 0.020
0.021 0.024 0.022
literature. It amounts to constructing bootstrap samples by resampling whole cross-sectional units with replacement. In cases where the data do not exhibit cross-sectional dependence but exhibit temporal dependence, such a resampling scheme is of interest as it allows the application of i.i.d. bootstrap resampling rather than block bootstrap resampling. It is well known that the former enables superior approximation to distributions of statistics compared to the latter. More relevantly, it is common in the panel literature to account for cross-sectional dependence by introducing a factor structure in the residuals of the panel regressions (see, e.g. Pesaran, 2002). The presence of a factor structure introduces global cross-sectional dependence which is symmetric across all panel units but does not introduce local cross-sectional dependence. As a result the standard cross-sectional resampling scheme we consider which makes no use of block resampling is still the appropriate resampling scheme. We prove that the bootstrap based on cross-sectional resampling provides asymptotic refinements. A Monte Carlo study illustrates the performance of the new resampling scheme. C 2008 The Author. Journal compilation C The Royal Economic Society 2008
Bootstrap resampling in panel data sets
393
ACKNOWLEDGMENTS I would like to thank the Co-Editor and three anonymous referees for their extremely helpful comments that greatly improved the paper.
REFERENCES Andrews, D. W. K. (2002). Higher order improvements of a computationally attractive k-step bootstrap for extremum estimators. Econometrica 70, 119–62. Bai, J. (2003). Inferential theory for factor models of large dinensions. Econometrica 71, 135–73. Bai, J. and S. Ng (2002). Determining the number of factors in approximate factor models. Econometrica 70, 191–221. Bai, J. and S. Ng (2004). A PANIC attack on unit roots and cointegration. Econometrica 72, 1127–77. Baltagi, B. H. (2001). Econometric Analysis of Panel Data. Chichester: Wiley. Cameron, A. C. and P. K. Trivedi (2005). Microeconometrics: Methods and Applications. Cambridge: Cambridge University Press. Davidson, J. (1994). Stochastic Limit Theory. Oxford: Oxford University Press. Davison, A. C. (1997). Bootstrap Methods and their Application. Cambridge: Cambridge University Press. Hall, P. (1992). The Bootstrap and Edgeworth Expansion, Springer Series in Statistics. New York: SpringerVerlag. Hsiao, C. (2003). Analysis of Panel Data. Cambridge: Cambridge University Press. Lahiri, S. N. (2003). Resampling Methods for Dependent Data. New York: Springer. Pesaran, M. H. (2002). Estimation and inference in large heterogenous panels with cross-section dependence. DAE working paper No. 0305, University of Cambridge. Phillips, P. C. B. and H. R. Moon (1999). Linear regression limit theory for nonstationary panel data. Econometrica 67, 1057–1111. Stock, J. and M. W. Watson (1998). Diffusion indexes. Working paper 6702, National Bureau of Economic Research.
APPENDIX: PROOF ∗ Proof of Theorem 3.1: We deal with the pooled estimator first. Substituting the true model for y i,t in the estimator gives −1 N N 1 1 1/2 ˆ ∗ ∗ ∗ ∗ ∗ ∗ ∗ N X M Xi X M i . β P − βˆ = (A.1) N i=1 i N 1/2 i=1 i
For fixed T , {Xi∗ M∗ i∗ }i is a sequence of i.i.d. random variables with expectation 0 and variance σ i E ∗ (Xi∗ M∗ Xi∗ ) by Assumptions 3.1, 3.2 and 3.4. Hence, by a standard central limit theorem for i.d. random variables (see, e.g. theorem 25.2 of Davidson, 1994), N N 1 1 d∗ σˆ i∗ Xi∗ M∗ Xi∗ Xi∗ M∗ i∗ → N (0, I ). (A.2) 1/2 N i=1 N i=1 Hence, the result follows. If T → ∞, then −1 N N Xi∗ M∗ Xi∗ Xi∗ M∗ i∗ 1 1 (N T )1/2 βˆ P∗ − βˆ = . N i=1 T N 1/2 i=1 T 1/2 C 2008 The Author. Journal compilation C The Royal Economic Society 2008
(A.3)
394
G. Kapetanios
Then, by Assumption 3.1 Xi∗ M∗ Xi∗ p∗ → Qi T
(A.4)
Xi∗ M∗ i∗ d ∗ → N 0, σi∗ Q i 1/2 T
(A.5)
where Q i is a positive definite matrix and
where both (A.4) and (A.5) hold uniformly across i by virtue of Assumption 3.1. Then, by letting N → ∞ we get the required result. Moving on to the mean group estimator, we see that the result for fixed T is obvious. To see this note that N 1 ∗ βˆM β∗. G = N i=1 i
(A.6)
N But β i∗ is simply an i.i.d. resample from {βˆi }i=1 . Since βˆi are i.d. random variables with finite second moments and equal mean (by Assumption 3.3), uniformly across i, the result follows, noting that the variance estimator given in (3.10) converges √ to the true variance following a standard law of large numbers for i.d. random variables. For T → ∞, T (βˆi − β) converges distribution and hence the above argument for √ to a normal N fixed T can be applied to a resample from { T (βˆi − β)}i=1 when T → ∞ followed by N → ∞.
Proof of Theorem 3.2: We will provide a proof for the simple case of only the constant belonging to zt and a single x regressor. The general case of multiple z and x follows with appropriate modifications. We first consider the fixed T case. The estimator βˆ P is given by −1 N N T T 1 1 2 (xi,t − x¯i ) (xi,t − x¯i )(yi,t − y¯i ) , (A.7) βˆ P = N i=1 t=1 N i=1 t=1 where x¯i =
1 T
T t=1
xi,t . Substituting in the true model for y i,t gives −1 N N T T 1 1 2 (xi,t − x¯i ) (xi,t − x¯i )i,t . βˆ P − β = N i=1 t=1 N i=1 t=1
(A.8)
Denote Xi =
T (xi,t − x¯i )2
(A.9)
t=1
and Yi =
T (xi,t − x¯i )i,t
(A.10)
t=1
and remember the definition of
ˆx =
N 1 Xi N i=1
−2
N 1 2 σ , N i=1 x,i
(A.11)
T 2 = t=1 σˆ i2 (xi,t − x¯i )2 . Then, the quantity whose distribution we are estimating is easily seen where σx,i to be a function of means of i.i.d. random variables, by assumption. These random variable sequences are 2 Xi , Yi and σ 2x,i . Their means over i are denoted by X¯i Y¯ i and σ¯ x,i . They are i.i.d. by Assumption 3.4. Denote ¯ where Z¯ = (X¯i , Y¯ i , σ¯ 2 ). Further, denote the bootstrap equivalent the function of the means by N 1/2 A(Z), x,i C 2008 The Author. Journal compilation C The Royal Economic Society 2008
Bootstrap resampling in panel data sets
395
¯ by N 1/2 A∗ (Z). ¯ We then consider Edgeworth expansions for N 1/2 A(Z) ¯ and N 1/2 A∗ (Z). ¯ By of N 1/2 A(Z) Assumption 3.6, we have that x i,t and i,t possess moments of sufficiently high order, denoted l. It then 2 follows that, for fixed T, X¯i Y¯ i and σ¯ x,i possess moments of the same order. Then, under Assumptions 3.5–3.7, it follows from theorem 5.1 of Hall (1992) that ν 1/2 − j/2 ¯ ≤ w) − (w) − sup −∞<w<∞ P(N A(Z) N q j (w)φ(w) = O(N −ν/2 ) (A.12) j=1 and
ν ∗ 1/2 ∗ ¯ − j/2 qˆ j (w)φ(w) = O p (N −(ν+1)/2 ), N sup −∞<w<∞ P (N A (Z) ≤ w) − (w) − j=1 (A.13)
where (.), and φ(.) denote the standard normal distribution and density functions, respectively, q j (w) are polynomials of population cumulants of Xi , Yi and σ 2x,i and qˆ j (w) are as q j (w) but where the population ¯ quantities are replaced by sample ones. These are the Edgeworth expansions corresponding to N 1/2 A(Z) 1/2 ∗ ¯ and N A (Z). Inverting these expansions gives Cornish–Fisher expansions of the distribution quantiles given by vα = z α +
ν
N − j/2 q j1 (z α )
(A.14)
N − j/2 qˆ j1 (z α ),
(A.15)
j=1
vˆα = z α +
ν j=1
¯ ≤ vα ) = α and (z α ) = α and qj1 (.) and qˆ j1 (.) are where v α and z α are the solutions of P(N 1/2 A(Z) polynomials defined in terms of q j and qˆ j (.). Since, sample moments and cumulants are O p (N −1/2 ) consistent estimators of population moments it follows that qˆ j (w) = q j (w) + O p (N −1/2 ) and so vˆα − vα = O p∗ (N −1 ) completing the proof for fixed T. For sequential (N, T) asymptotics we consider a sequence where T → ∞ followed by N → ∞. This 2 case is much simplified since as T → ∞, T1 Xi and T1 σx,i tend in probability to constants. Further, √1T Yi tends to a normal distribution which automatically satisfies Assumptions 3.4–3.6. Hence, the result follows via a similar treatment to the fixed T case.
C 2008 The Author. Journal compilation C The Royal Economic Society 2008
Econometrics Journal (2008), volume 11, pp. 396–408. doi: 10.1111/j.1368-423X.2008.00238.x
K-nearest-neighbour non-parametric estimation of regression functions in the presence of irrelevant variables R UI L I † AND G UAN G ONG ‡ †School of Economics & Management, Beijing University of Aeronautics & Astronautics, China. E-mail:
[email protected] ‡School of Economics, Shanghai University of Finance and Economics, China. E-mail:
[email protected] First version received: December 2006; final version accepted: December 2007
Summary We show that when estimating a non-parametric regression model, the k-nearestneighbour non-parametric estimation method has the ability to remove irrelevant variables provided one uses a product weight function with a vector of smoothing parameters, and the least-squares cross-validation method is used to select the smoothing parameters. Simulation results are consistent with our theoretical analysis and show that the performance of the k-nn estimator is comparable to the popular kernel estimator; and it dominates a non-parametric series (spline) estimator when there exist irrelevant regressors. Key words: k-nearest-neighbour, Cross-validation, Irrelevant variables, Simulations.
1. INTRODUCTION The non-parametric kernel estimation method is by far the most popular technique used to estimate a regression model non-parametrically. Recently, Hall et al. (2007) show that the kernel estimation method, coupled with the least squares cross-validation method of selecting the smoothing parameters, has the amazing property that irrelevant (continuous or discrete) regressors can be automatically smoothed out. Li et al. (2007) further use a kernel-based propensity score estimator to estimate an average treatment effect. They defend the use of the kernel method by stating that, ‘it is not clear to us how to extend the property of kernel smoothing to other nonparametric estimation methods such as series methods (i.e. the ability to automatically remove irrelevant covariates). Therefore, we restrict our attention to non-parametric kernel methods in this paper’. Indeed, when one faces a mixture of continuous and discrete variables, the only known nonparametric series estimation method is to use indicator functions to split the sample into discrete cells, and then estimate a regression model using data from each cell. This sample split method becomes infeasible when the number of discrete cells is large. Therefore, the non-parametric series estimation method does not share the amazing ‘removing irrelevant variable’ property of the kernel-based estimator. In this paper, we investigate the problem of whether or not the nonparametric k-nn method can have the property of automatically removing irrelevant variables C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008. Published by Blackwell Publishing Ltd, 9600 Garsington
Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
K-nn non-parametric estimation of regression functions
397
in a regression model. We show that the k-nn estimator can indeed remove irrelevant variables from a regression model. However, special attention is needed in order for a k-nn estimator to possess this property. First, in order to remove irrelevant variables, a product weight function must be used with a vector of smoothing parameters so that each component of the regressor has a different smoothing parameter. Second, only the uniform weight function has the property of completely removing irrelevant variables. All other weight functions, such as the Gaussian and Epanechnikov weight functions, do not have this property. However, a simple modification can lead to k-nn estimators with non-uniform weight functions to possess the property of removing irrelevant variables. This is quite different from the kernel-based estimators because Hall et al. (2007) show that only non-uniform kernel functions have the ability of removing irrelevant regressors. In this paper, we use simulation results to compare the finite sample performance of three popular non-parametric methods: the kernel method, the nearest neighbourhood (k-nn) method and the spline method. We are particularly interested in evaluating these estimators in the presence of irrelevant regressors. The remaining part of the paper is organized as follows. In Section 2, we first discuss the conventional k-nn estimator and compare it with the kernel estimation results of Hall et al. (2007). We show in Section 3 that with some modifications, the conventional k-nn estimation method can possess the ability of removing irrelevant variables. Section 4 reports the simulation results and examines the finite sample behaviour of the k-nn estimator and compares it with the non-parametric kernel and series estimators. Finally, Section 5 concludes the paper.
2. THE NON-PARAMETRIC K-NN ESTIMATION METHOD We will mainly focus on the case of a non-parametric regression model with continuous regressors. We will briefly discuss the mixed discrete and continuous regressor case at the end of this section. Considering the following non-parametric regression model: Yi = g (X i ) + u i ,
i = 1, 2, . . . , n,
(2.1)
where X i ∈ R is a continuous variable of dimension q, the functional form of g(·) is unspecified. n are independent and identically distributed (i.i.d); We only consider the case where (Y i , X i )i=1 the results of the paper can be readily extended to the weakly dependent data case. We first review the conventional k-nn estimation method and then discuss what modifications are needed in order for the k-nn estimator to be able to automatically remove irrelevant variables. We will use the same notation as in Ouyang et al. (2006). For a fixed value x ∈ Rq , define the k-nearest-neighbour distance, centred at x by q
de f
Rx ≡ Rn (x) = the k th nearest Euclidean distance to x among all the X j s for j = 1, . . . , n.
(2.2)
Also define the k−nn distance centred at X i as de f
Ri ≡ Rn (X i ) = the kth nearest Euclidean distance to X i among all the X j s for j = 1, . . . , n.
(2.3)
q W (v)dv = Let W (·)4: R → R be a bounded non-negative weight function, 1, W (v)||v|| dv < ∞, where ||v|| denotes the Euclidean norm of v. For example, one can C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
398
Rui Li and Guan Gong
use the Epanechnikov weight function defined by W (v) = (3/4) (1 − ||v||2 ) 1(||v|| ≤ 1), where 1 (A) is an indicator function which takes value 1 if event A holds true, and 0 otherwise. The local constant k − nn estimator of g(x) is given by n 1 Xi − x gˆ (x) = / fˆ (x) , (2.4) Yi W q Rx n Rx i=1 n where fˆ (x) = n R1 q i=1 W ( XRi −x ) is the k − nn estimator of f (x), f (·) is the probability density x x function of X i . ˆ i ). Then, we Let Gˆ k denote the n × 1 vector with its ith element being given by g(X have that n ˆ G k = Mn (k)Y , where M n (k) is an n × n matrix with its (i, j)th element given by Wi j / l=1 Wil , where W i j = W ((X i − X j )/R i ). The following three procedures for selecting k were studied by Li (1987). 1 Mallow’s C L method (or C p ; Mallow, 1973): One selects kˆ to minimize the following objective function: kˆC = arg min n −1 k
n
ˆ i )]2 + 2σ 2 tr [Mn (k)]/n, [Yi − g(X
(2.5)
i=1
n ˆ i ). where σ 2 is the variance of u i . We estimate σ 2 by σˆ 2 = n −1 i=1 uˆ i2 with uˆ i = Yi − g(X 2 Generalized cross-validation method (Craven and Wahba, 1979): One selects kˆ to minimize the following objective function: n ˆ i )]2 n −1 i=1 [Yi − g(X kˆGC V = arg min . (2.6) −1 k (1 − n tr [Mn (k)])2 3 Leave-one-out cross-validation method (Stone, 1974): One chooses kˆC V to minimize n 1 2 kˆC V = arg min C V (k) ≡ arg min (2.7) [Yi − gˆ −i (X i )] , k k n i=1 where gˆ −i (X i ) = nj=i Y j Wi j / nj=i Wi j is the leave-one-out k − nn estimator of g(X i ). Li (1987) has shown that the above three procedures are asymptotically equivalent and all of them lead to optimal smoothing in the sense that [gˆ kˆ (x) − g(x)]2 d F(x) p → 1, (2.8) 2 infk [gˆ k (x) − g(x)] d F(x) where F(·) is the distribution function of X i , and gˆ kˆ (x) is the k − nn estimator of g(x) defined in (2.4) using one of the above procedures to select k, i.e., in (2.8), kˆ = kˆC , or kˆ = kˆGC V , or kˆ = kˆC V ; ˆ and gˆ k (x) = g(x) is the k-nn estimator of g(x) with a generic k as defined in (2.4). Ouyang et al. (2006) extend Li’s (1987) result and derive the rate of convergence of the leave-one-out crossvalidation kˆC V to some non-stochastic optimal benchmark value of k, say k0 . We now show that the above conventional k-nn estimator does not possess the property of removing irrelevant variables. We use X is to denote the s th component of X i . Often in applied settings not all q regressors in X i are relevant. Without loss of generality, assume that the first q 1 (1 ≤ q 1 ≤ q) components of X are ‘relevant’ in the sense defined below. C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
K-nn non-parametric estimation of regression functions
399
Let X¯ consist of the first q1 relevant components of X let X˜ = X \ { X¯ } denote the remaining irrelevant components of X. Following Hall et al. (2007) we define X¯ to be relevant and X˜ to be irrelevant by asking that ( X¯ , Y )
is independent of
X˜ .
(2.9)
Clearly, (2.9) implies that E(Y | X ) = E(Y | X¯ ). A weaker condition would be to ask that, conditional on X¯ , the irrelevant variables X˜ and Y are independent. Hall et al. (2007) show that under (2.9), the least squares cross validation method can automatically remove irrelevant variables. However, simulations reported in Hall et al. (2007) reveal that the ‘removing irrelevant variable’ results also hold under weaker condition that E(Y | X ) = E(Y | X¯ ). ¯ X¯ i ) + u i where g(·) ¯ Hall et al. (2007) assume that the true regression model is given by Yi = g( ¯ is of unknown form, and E(u i | X i ) = 0. They assume that the relevant regressors are unknown ex ¯ using the superset of regressors X = ( X¯ , X˜ ). The non-parametric ante, hence one estimates g(·) kernel estimator of g(x) is given by
X js −xs q j=1 Y j s=1 w hs
X js −xs , n q w j=1 s=1 hs
n gˆ ker nel (x) =
(2.10)
where w(·) is the univariate kernel function and h s is the smoothing parameter associated with x s , s = 1, . . . , q. Hall et al. (2007) suggest to choose the smoothing parameters by the least squares cross validation (CV) method. They show that the CV selected smoothing parameters, say hˆ s , has the property that hˆ s ∼ n −1/(4+q1 ) for s = 1, . . . , q 1 and that hˆ s → ∞ for s = q 1 + 1, . . . , q. Note X −x that when h s = ∞, we have limh s →∞ w( jsh s s ) = w(0). And gˆ ker nel (x) becomes unrelated to x s . Therefore, all irrelevant variables are smoothed out asymptotically. ˆ The k-nn estimator g(x) defined in (2.4) does not possess the ability of removing irrelevant variables. This is because the k-nn distance R x is a scalar. This is similar to the case of using a scalar smoothing parameter in the kernel estimation case, i.e., h 1 = . . . = h q = h. In this case, if there exists at least one relevant regressor, then h → 0 as n → ∞, and no variables will be smoothed out. On the other hand, if h → ∞ as n → ∞, then all variables are smoothed out. A scalar smoothing parameter h cannot have the flexibility of removing some (irrelevant) variables while keeping the remaining (relevant) variables. For the same reason, the use of a scalar k-nn distance R x cannot smooth out irrelevant regressors. Therefore, we suggest that one uses a product weight function and uses a different distance for each different component x s in the non-parametric k-nn estimation. This is the topic of next section.
3. K-NN ESTIMATOR WITH PRODUCT WEIGHT FUNCTION In order to use a product weight function and allow for a vector of smoothing parameters (k 1 , . . . , k q ), we first need to introduce some notation. For s = 1, . . . , q define R s,x by de f
Rs,x ≡ Rn (xs ) = the ksth nearest Euclidean distance to xs among all the X js s for j = 1, . . . , n. C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
(3.1)
400
Rui Li and Guan Gong
Note that the range of k s is {1, 2, . . . , n} for all s = 1, . . . , q. Also, we define the k s − nn distance centered at X is by: de f
Rs,i ≡ Rn (X is ) = the ksth nearest Euclidean distance to X is among all the X js s for j = 1, . . . , n. Using a product weight function W ( by
X j −x de f ) = Rn,x
n j=1
ˆ g(x) = n
q s=1
(3.2) w(
X js −xs ), Rs,x
Y j W ((X j − x)/Rn,x )
j=1
W ((X j − x)/Rn,x )
.
we define the k-nn estimator
(3.3)
We say that the (irrelevant) regressor x s is completely smoothed out from the regression model ˆ if g(x) is unrelated to x s for all xs ∈ Ss , where Ss is the support for x s . The next lemma shows that a k-nn estimator with a product uniform weight function possess the property of removing irrelevant variables, while a k-nn estimator with any other non-uniform weight function does not share this property. A uniform weight function is defined by w(v) = (1/2) 1(|v| ≤ 1), i.e., w(v) = 1/2 if |v| ≤ 1; and 0 otherwise. LEMMA 3.1. Letting the k-nn estimator be defined using the product weight function given in (3.3), (i) if one uses a (product) uniform weight function, then x s will be smoothed out from the regression model if k s = n. (ii) if one uses a non-uniform weight function, then x s cannot be completely smoothed out for all values of k s = 1, . . . , n. The proof of Lemma 3.1 is given in the Appendix. Lemma 3.1 shows that the uniform weight function has the property of being able to smooth out irrelevant variables completely, while all other non-uniform weight functions such as the Gaussian or Epanechnikov weight functions do not possess this property. This is quite different from the kernel estimation result of Hall et al. (2007) who rule out the use of a uniform kernel in their analysis. In practice one does not know which variable is relevant and which variable is irrelevant. Therefore, some data-driven methods are needed to select the smoothing parameter k s optimally in practice. We suggest using the least squares cross validation method to select k 1 , . . . , k q , i.e., we select k 1 , . . . , k q by minimizing the following objective function: C V (k) =
n 1 [Yi − gˆ −i (X i )]2 , n i=1
(3.4)
X −X X −X where gˆ −i (X i ) = nj=i Y j W ( Rj n,i i )/ nj=i W ( Rj n,i i ) is the leave-one-out k-nn estimator of q g(X i ) with W ((X j − X i )/R n,i ) = s=1 w((X js − X is )/R s,i ) being the product weight function. When allowing for the use of different k s , the asymptotic analysis for cross-validation selection of smoothing parameters is quite complex. To the best of our knowledge, no formal asymptotic results are available in this case. Therefore, we will resort to simulation studies to examine the finite sample performance of the k-nn estimator based cross validation selected smoothing parameters (with a vector of smoothing parameters). According to the result of Lemma 3.1, it seems that one should always use a uniform (product) weight function in non-parametric k-nn estimation. However, removing irrelevant variables is only one part of the story. For relevant variables, a uniform weight function may not be the best choice. C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
K-nn non-parametric estimation of regression functions
401
Other weight functions such as Gaussian or Epanechnikov may be more suitable to use in finite sample applications because these weight functions give more weight to observations closer to x, rather than a constant weight as the uniform weight function does. Therefore, it would be desirable if one could modify the cross validation selection rule to smooth the irrelevant variables even with a non-uniform weight function. We show below that this is indeed possible. From Lemma 3.1 we know that when a non-uniform weight function is used, such as the Gaussian or Epanechnikov weight function, then even when k s = n (taking the upper extreme value) x s is not smoothed out completely. However, a simple modification can be made so that an irrelevant variable can be (completely) smoothed out even when one uses a non-uniform weight function. We suggest the following modifications in selecting k s . If the cross validation method selects k s = n, then one should add an extreme value for R s,i = ∞ for all i = 1, . . . , n. If the resulting CV(k) (with R s,i = ∞) has a smaller value than the case with k s = n, then we remove the regressor x s from the regression model. This is reasonable because lim Rn,i →∞ w((X js − Wis )/Rs,i ) = w(0). ˆ Hence, x s is completely removed from g(x) if R s,i = ∞. Therefore, we remove x s from the regression model if removing x s leads to a smaller cross validation function value. Up to now we have assumed that all the regressors are continuous variables. Let’s consider the case of mixed continuous and discrete regressors. Say X i = (X ic , Xid ), where X ic is a continuous variable vector of dimension q, and Xid is a discrete variable vector of dimension r. Then one can use the same discrete weight function as suggested by Hall et al. (2007) to deal with the discrete variable. Specifically, define l(Xisd , xsd , λs ) = 1 if Xisd = xsd ; and l(Xisd , xsd , λs ) = λs if Xisd = xsd . Then one can estimate g(x) by
X c −x c
Y j W Rj n,x L X dj , x d , λ ˆ g(x) =
X cj −x c d , n L X j , xd, λ j=1 W Rn,x n
j=1
(3.5)
where L(Xjd , x d , λ) = rs=1 l(Xjsd , xsd , λs ) is the discrete variable product weight function. The range of λs is [0, 1] for all s = 1, . . . , r . If λs = 1, xsd is deemed as an irrelevant variable since it is completely smoothed out from the regression model. The cross validation function is modified to: C V (k, λ) =
n 1 [Yi − gˆ −i (X i )]2 , n i=1
(3.6)
X c −X c X c −X c where gˆ −i (X i ) = nj=1 Y j W ( Rj n,i i )L(X dj , X id , λ)/ nj=1 W ( Rj n,i i )L(X dj , X id , λ) is the leaveone-out estimator of g(X i ). In practice one selects (k 1 , . . . , k q , λ1 , . . . , λr ) by minimizing the objective function CV(k, λ) defined in (3.6). We examine the finite sample performance of the k-nn estimator in the next section.
4. MONTE CARLO SIMULATIONS In this section we report simulation results to examine the finite sample performances of the k-nn estimator with a product weight function and with the least squares cross validation method selecting the smoothing parameters. For comparison purposes we also report the results of the kernel estimator and the non-parametric series estimator. For the series estimation method we use the B-spline and the power series. C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
402
Rui Li and Guan Gong
We consider the series estimators in which we can estimate g(·) using series approximating functions g (x) =
K
β j p Kj (x),
(4.1)
j=1
where pjK (x) s are called the base functions which can be power series, B-spline or some other type of series based function (e.g. wavelet). In our simulations we consider both the power spline and the cardinal B-spline. We consider orders of r = 2, 3, and 4 for the B-spline base functions. The results are not sensitive to the order of the base functions. However, the estimator is sensitive to the order of the polynomials in the power spline series and the number of knots in the B-spline series. We will call the order of polynomials in power series and the number of knots in B-spline the number of terms in non-parametric series estimations. Following Li (1987) we consider three popular criteria in choosing the optimal number of terms in the power spline and B-spline: (a) The Mallow’s C L method (or C p ; Mallow, 1973); (b) The generalized cross-validation (GCV) proposed by Craven and Wahba (1979); (c) The leave-one-out cross-validation (Stone, 1974). See Li (1987) or Li and Racine (2007) for detailed description of these procedures. We use the in-sample mean-square error to measure the performance of an estimator: MSE =
n
2 1 g(xi )2 − g (xi ) . n i=1
(4.2)
For 1,000 replications, we obtain 1,000 sample MSEs as given by (4.2). We then sum over these 1,000 MSEs and divide the sum by 1,000; i.e., we compute the average MSE (AMSE) and report the AMSE, along with the median of the 1,000 MSEs. 4.1. DGPs We consider a few different data generating processes (DGPs) in the Monte Carlo simulations which are all nested in the DGP below. The general DGP used in the simulations is given by 2 2 yi = α0 + α1 xi1 + α2 xi1 + α3 sin(4π xi1 ) + β1 xi2 + β2 xi2 + β3 sin(4π xi2 ) + γ1 xi3 2 + γ2 sin(4π xi3 ) + δ1 xi1 xi2 + δ2 sin(xi1 xi2 ) + δ3 z i + u i ,
(4.3)
where x is ∼ uniform(0, 1) for s = 1, 2, 3, u i ∼ N (0, 1), z i is a binary variable taking values in {0, 1} with P(z i = 0) = p(z i = 1) = 1/2. Depending on the values of the parameters, (4.3) encompasses a linear model, a partially linear model and a non-linear model. We vary the values of the parameters as well as the number of observations and the number of replications in our experiments. It appears that the results are not sensitive to the number of replications. When the number of replications is more than 200, the results are stable. All the results reported in this paper are based on 1,000 replications. The number of observations n = 100 and 200. The MSE in equation (4.2) is used to measure the performance of different estimators. We report the mean and median of the 1,000 MSEs in the tables. We will estimate both a univariate and a bivariate non-parametric regression function. The true DGP, however, can be different from the estimated model. For example, if the true regression is C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
403
K-nn non-parametric estimation of regression functions Table 1. In-sample AMSE of univariate estimators. B Spline Power Spline Kernel GCV LSCV CL GCV LSCV
CL
k-nn Uniform
k-nn Gaussian
n = 100 Model is correctly specified. (α 3 = 2) Mean
0.112
0.112
0.112
0.477
0.478
0.478
0.127
0.136
0.137
Median
0.108
0.110
0.105
0.474
0.474
0.478
0.117
0.136
0.128
0.036 0.023
0.037 0.023
0.018 0.008
0.029 0.018
0.029 0.016
0.182 0.176
0.182 0.177
0.827 0.069
0.073 0.068
0.078 0.086
0.024 0.013
0.012 0.012
0.010 0.011
0.017 0.010
0.029 0.009
Model is over-specified. (all parameters are zeros) Mean Median
0.045 0.039
0.043 0.038
0.038 0.034
0.038 0.025 n = 200
Model is correctly specified. (α 3 = 2) Mean 0.089 0.091 0.092 Median 0.087 0.090 0.091
0.181 0.177
Model is over-specified. (all parameters are zeros) Mean 0.020 0.019 0.019 0.023 Median 0.015 0.017 0.017 0.014
univariate and we estimate a bivariate regression model, we say that we estimate an overspecified model. Letyˆi be the non-parametric fitted value of y i , we compute the goodness-of-fit R2 by n n n 2 2 2 −1 2 R = 1 − i=1 (yi − yˆi ) / i=1 (yi − y¯ ) , where y¯ = n i=1 yi . The R for various DGPs considered in this paper range from 0.46 to 0.72.
4.2. Estimation of a univariate non-parametric regression model In this section, we consider the case of estimating an univariate regression model. The true model, however, can be univariate, or a white noise, or a bivariate regression model. 4.2.1. The true model is univariate (α 3 = 2). We first consider the case where α 3 = 2 and all other coefficients are equal to zero. Therefore, the true regression function is only a function of x 1i but we estimate the univariate regression model E(y i |x 1i ) non-parametrically. In this case, the model is correctly specified, the performance of various methods are close to each other as we can see from Table 1. The B-spline estimator performs best in most cases, closely followed by the kernel and the k-nn estimators. In other DGPs, such as the exponential function, we do find that the power spline performs better than the B-spline sometimes. It appears that no method dominates the others. When the sample size n is doubled from 100 to 200, the AMSE of all estimators fall by 20% to 50%, showing the consistency of all estimators. C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
404
Rui Li and Guan Gong
4.2.2. The true model is a white noise process (all parameters are zeros). When all parameters are set to be zero, we have y i = u i , a white noise process. However, we estimate the function E(y i |x i1 ) non-parametrically. Thus, we estimate an overspecified regression model, i.e., the DGP is white noise but we apply the univariate estimation method to estimate the conditional mean function, the k-nn (regardless of its weight function) and the kernel perform much better than the B-spline and the power spline, as we can see from Table 1. The superior performance of the k-nn and the kernel estimators over the series estimators is due to the fact that both the k-nn and the kernel methods can smooth out irrelevant regressors while the series estimators do not possess this property. Also, it appears that the k-nn estimator is not sensitive to its weight function being uniform or Gaussian. 4.3. Estimation of a bivariate additive model In this section, we estimate an additive model of the following form: yi = g1 (xi1 ) + g2 (xi2 ) + u i . However, the true regression model may be a univariate, or an additive model, or a bi-variate non-additive (with interaction terms) model. 4.3.1. The true model is bivariate additive (α 3 = −1, β 3 = 1). We choose α 3 = −1 and β 3 = 1 and all other parameters are set to be zero, so that the true regression model is an additive model (additive in x i1 and x i2 ). When the estimated model is correctly specified, the performance of various methods is close to each other, as we can see from Table 2. In general, the kernel and the k-nn estimators perform slightly better than the series-based estimators. 4.3.2. The true model is univariate (α 3 = 1). We choose α 3 = 1 and all other parameters are zero. In this case, we estimate an overspecified model, i.e., the DGP is univariate but we estimate a bivariate additive model. As can be seen from Table 2, the kernel and the k-nn estimators perform much better than the B-spline and the power spline estimators. The reason is that the kernel and the k-nn estimators can automatically remove irrelevant regressors (x i3 ) while the series estimators cannot. 4.4. Estimation of a bivariate fully non-parametric model In this section, we estimate a bi-variate non-parametric regression model: y i = g(x i1 , x i2 ) + u i . However, the true regression model can be a univariate, a bi-variate, or a tri-variate regression model. 4.4.1. The true model is bivariate with interactions (α 3 = −1, β 3 = 1, δ 1 = 0.5). We choose α 3 = −1, β 3 = 1, δ 1 = 0.5 and all other parameters are set to be zero. Hence, the estimated model is correctly specified. From Table 3, we see that the k-nn and kernel methods perform better than the spline methods. 4.4.2. The true model is univariate (α 3 = 2). With only α 3 = 2 being the non-zero parameter, the estimated model is overspecified, i.e., the DGP is univariate but we estimate a bivariate fully C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
405
K-nn non-parametric estimation of regression functions Table 2. In-sample AMSE of bivariate additive estimators. B Spline Power Spline Kernel CL
GCV
LSCV
CL
GCV
LSCV
k-nn
k-nn
Uniform
Gaussian
n = 100 Model is correctly specified. (α 3 = −1, β 3 = 1) Mean
0.327
0.329
0.331
0.344
0.341
0.342
0.199
0.256
0.236
Median
0.317
0.318
0.326
0.337
0.337
0.334
0.200
0.254
0.228
0.125 0.123
0.128 0.142
0.128 0.148
0.124 0.149
0.096 0.081
0.102 0.117
Model is over-specified. (α 3 = 2) Mean Median
0.181 0.172
0.180 0.172
0.181 0.172
n = 200 Model is correctly specified. (α 3 = −1, β 3 = 1) Mean 0.134 0.135 0.134 0.118 Median 0.133 0.133 0.132 0.117
0.134 0.147
0.141 0.148
0.121 0.120
0.116 0.119
0.117 0.112
Model is over-specified. (α 3 = 2) Mean 0.251 0.250 0.251 Median 0.243 0.243 0.241
0.258 0.251
0.258 0.248
0.149 0.138
0.211 0.199
0.190 0.173
0.259 0.253
non-parametric model. The kernel estimator performs best, followed by the k-nn (Gaussian and uniform weigth functions), the B-spline and the power spline, as can be seen from Table 3.
4.5. The mixed discrete and continuous regressors case In this section, we consider estimating a non-parametric regression model with mixed discrete and continuous regressors. The model we will estimate is y i = g(x i1 , z i ) + u i . However, this may be the true model, or this may be an overspecified model, i.e., the true model might be a univariate regression model y i = g(x i1 ) + u i . 4.5.1. The true model is a bi-variate mixed regressor model (α 3 = 2, δ 3 = 1.5). In this case, the estimated model is correctly specified. Table 4 reports the results for n = 100 and 200. We observe that the performances of the B spline, the k-nn and the kernel estimators are all similar to each other. All of them perform much better than the power series estimators. 4.5.2. The true model is a univariate model (α 3 = 2). In this case, the true model is univariate but we estimate a bivariate regression model with one continuous regressor (the relevant variable x i1 ) and one discrete regressor (the irrelevant variable z i ). Therefore, the true model is overspecified. The best performers are the k-nn and the kernel estimators, followed by the B spline and, finally, the power spline estimator. Again, the reason that the series-based estimators C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
406
Rui Li and Guan Gong Table 3. In-sample AMSE of bivariate fully non-parametric estimators. B Spline Power Spline Kernel CL
GCV
LSCV
CL
GCV
LSCV
k-nn
k-nn
Uniform
Gaussian
n = 100 Model is correctly specified. (α 3 = −1, β 3 = 1, δ 1 = 0.5) Mean
0.337
0.337
0.338
0.335
0.335
0.338
0.262
0.259
0.236
Median
0.330
0.333
0.336
0.328
0.329
0.331
0.247
0.249
0.227
0.275 0.192
0.275 0.192
0.295 0.192
0.137 0.124
0.213 0.207
0.192 0.179
Model is correctly specified. (α 3 = −1, β 3 = 1, δ 1 = 0.5) Mean 0.134 0.135 0.134 0.118 0.134 Median 0.133 0.133 0.132 0.117 0.147
0.141 0.148
0.121 0.120
0.116 0.119
0.117 0.112
Model is over-specified. (α 3 = 2) Mean 1.113 1.112 1.113 Median 1.115 1.113 1.117
0.125 0.076
0.077 0.074
0.079 0.078
0.082 0.089
Model is over-specified. (α 3 = 2) Mean Median
1.356 1.373
1.356 1.373
1.356 1.373
n = 200
0.127 0.087
0.115 0.081
are inferior to either the k-nn or the kernel estimators is that the series-based estimators cannot smooth out irrelevant discrete variables while both the k-nn and the kernel estimators have the ability of removing irrelevant discrete and continuous variables.
5. CONCLUSIONS One conclusion from our theoretical analysis and the simulations is that both the k-nn and the kernel methods can automatically remove irrelevant variables and both perform well compared to the series method when the model is overspecified. Furthermore, our simulation results show that the k-nn method is not sensitive to the weight functions used. A referee suggested us that the approach of Abramson (1984) might be useful in reducing the estimation bias in k-nn regression function estimations. Abramson considered the non-parametric density estimation problem and suggested some transformation methods so that the density function for the transformed data is relatively smoother than the density function for the original data, and therefore the estimation bias can be reduced. It is not clear to us whether Abramson’s approach can be generalized to the regression model case because the unknown regression function may still possess the same degree of smoothness. We leave this as a future research topic. A more important future research topic is to develop new series-based non-parametric estimation methods that can automatically remove irrelevant variables. C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
407
K-nn non-parametric estimation of regression functions Table 4. In-sample AMSE of non-parametric estimators with mixed regressors. B Spline Power Spline Kernel CL
GCV
LSCV
CL
GCV
LSCV
k-nn
k-nn
Uniform
Gaussian
n = 100 Model is correctly specified. (α 3 = 2, δ 3 = 1.5) Mean
0.235
0.234
0.234
0.416
0.432
0.428
0.250
0.248
0.246
Median
0.225
0.224
0.224
0.358
0.365
0.349
0.241
0.241
0.242
Model is over-specified. (α 3 = 2) Mean 0.223 0.223 0.224
0.409
0.428
0.420
0.177
0.183
0.191
Median
0.349
0.362
0.346
0.165
0.166
0.189
0.219
0.220
0.219
n = 200 Model is correctly specified. (α 3 = 2, δ 3 = 1.5) 0.132 0.133 0.133 0.347 0.149 0.151 0.151 0.241
0.348 0.232
0.345 0.239
0.124 0.118
0.129 0.121
0.146 0.143
Model is over-specified. (α 3 = 2) Mean 0.129 0.130 0.130 Median 0.143 0.149 0.148
0.337 0.229
0.329 0.234
0.102 0.091
0.109 0.098
0.097 0.089
Mean Median
0.335 0.235
ACKNOWLEDGEMENTS The authors are grateful to two referees and a co-editor for their comments that helped to improve the paper. Rui Li’s research is supported by the National Natural Science Foundation of China (project serial number: 70573057). Guan Gong received financial support from the Pujang Project sponsored by the City of Shanghai.
REFERENCES Abramson, I. S. (1984). Adaptive density flatting: A metric distortion principle for combating bias in nearest neighbour methods. Annals of Statistics 12, 880–86. Craven, P. and G. Wahba (1979). Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik 31, 377–403. Hall, P., Q. Li and J. Racine (2007). Nonparametric estimation of regression functions in the presence of irrelevant variables. Review of Economics and Statistics 89, 784–89. Li, K. C. (1987). Asymptotic optimality for Cp , CL , cross-validation, and generalized cross-validation: Discrete index set. Annals of Statistics 15, 958–75. Li, Q., and J. S. Racine (2007). Nonparametric Econometrics: Theory and Practice. NJ, Princeton University Press. Li, Q., J. Racine and J. Wooldridge (2007). Efficient estimation of average treatment effects with mixed categorical and continuous data. Journal of Business and Economics Statistics, in press. C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008
408
Rui Li and Guan Gong
Mallow, C. L. (1973). Some comments on Cp . Technometrics 15, 661–75. Ouyang, D., D. Li and Q. Li (2006). Cross-validation and nonparametric K nearest neighbor estimation. Econometrics Journal 9, 448–71. Stone, C. J. (1974). Cross-validatory choices and assessment of statistical predictions. Journal of the Royal Statistical Society, Series B 36, 111–47.
APPENDIX: PROOF ˆ Proof of Lemma 3.1 (i): In order to smoothen out x s from g(x), one must have w((X is − x s )/R x,s ) = c for all values of X is , xs ∈ Ss . Given that Ss is a compact set and without loss of generality we can assume that Ss = [0, 1] (the unit interval). Then, we know if k s = n, then |(X js − x s )/R n,s | ≤ 1 for all j = 1, . . . , n. Thus, we have w((X js − x s )/R n,s ) = (1/2), a constant, for all values of X js and x s . Therefore, ˆ g(x) is unrelated to x s . (ii): If w(·) is not a uniform weight function, then for k s = n, we know that Rx,s is finite. Hence, 0 ≤ |(X js − x s )/R x,s | ≤ 1 and that w((X js − x s )/R x,s ) is not a constant function because ˆ |(X js − x s )/R x,s | is not a constant and w(·) is not a uniform weight function. Hence, g(x) must depend on x s , i.e., x s cannot be completely smoothed for k s = n. Similarly, one can show that for all k s ∈ {2, . . . , ˆ n − 1}, x s cannot be completely smoothed out from g(x).
C 2008 The Author(s). Journal compilation C The Royal Economic Society 2008