Springer Series in Statistics
Jeffrey D. Hart
'Springer
Springer Series in Statistics Advisors: P. Bickel, P. Diggle, S. Fienberg, K. Krickeberg, I. Olkin, N. Wermuth, S. Zeger
Springer New York Berlin Heidelberg Barcelona Budapest Hong Kong London Milan Paris Santa Clara Singapore Tokyo
Springer Series in Statistics Andersen!Borgan!Gill!Keiding: Statistical Models Based on Counting Processes. Andrews/Herzberg: Data: A Collection of Problems from Many Fields for the Student and Research Worker. Anscombe: Computing in Statistical Science through APL. Berger: Statistical Decision Theory and Bayesian Analysis, 2nd edition. Bolfarine!Zacks: Prediction Theory for Finite Populations. Borg!Groenen: Modem Multidimensional Scaling: Theory and Applications Bremaud: Point Processes and Queues: Martingale Dynamics. Brockwell/Davis: Time Series: Theory and Methods, 2nd edition. Daley!Vere-Jones: An Introduction to the Theory of Point Processes. Dzhaparidze: Parameter Estimation and Hypothesis Testing in Spectral Analysis of Stationary Time Series. Fahrmeir!Tutz: Multivariate Statistical Modelling Based on Generalized Linear Models. Farrell: Multivariate Calculation. Federer: Statistical Design and Analysis for Intercropping Experiments. Fienberg!Hoaglin!Kruskal!Tanur (Eds.): A Statistical Model: Frederick Mosteller's Contributions to Statistics, Science and Public Policy. Fisher/Sen: The Collected Works of Wassily Hoeffding. Good: Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses. Goodman!Kruskal: Measures of Association for Cross Classifications. Gourieroux: ARCH Models and Financial Applications. Grandell: Aspects of Risk Theory. Haberman: Advanced Statistics, Volume 1: Description of Populations. Hall: The Bootstrap and Edgeworth Expansion. Hiirdle: Smoothing Techniques: With Implementation inS. Hart: Nonparametric Smoothing and Lack-of-Fit Tests. Hartigan: Bayes Theory. Heyde: Quasi-Likelihood and Its Application: A General Approach to Optimal Parameter Estimation. Heyer: Theory of Statistical Experiments. Huet!Bouvier!Gruet/Jolivet: Statistical Tools for Nonlinear Regression: A Practical Guide with S-PLUS Examples. Jolliffe: Principal Component Analysis. Kolen!Brennan: Test Equating: Methods and Practices. Katz/Johnson (Eds.): Breakthroughs in Statistics Volume I. Katz/Johnson (Eds.): Breakthroughs in Statistics Volume II. Katz/Johnson (Eds.): Breakthroughs in Statistics Volume III. Kres: Statistical Tables for Multivariate Analysis. Kiichler!S¢rensen: Exponential Families of Stochastic Processes. Le Cam: Asymptotic Methods in Statistical Decision Theory. Le Cam/Yang: Asymptotics in Statistics: Some Basic Concepts. Longford: Models for Uncertainty in Educational Testing. Manoukian: Modem Concepts and Theorems of Mathematical Statistics. Miller, Jr.: Simultaneous Statistical Inference, 2nd edition.
(continued after index)
For Michelle and Kayley
Jeffrey D. Hart Department of Statistics Texas A&M University College Station, TX 77843-3143 USA
Library of Congress Cataloging-in-Publication Data Hart, Jeffrey D. Nonparametric smoothing and lack-of-fit tests I Jeffrey D. Hart p. em. - (Springer series in statistics) Includes bibliographical references (p - ) and indexes. ISBN 0-387-94980-1 (hardcover: alk. paper) 1. Smoothing (Statistics) 2. Nonpararnetric statistics. 3. Goodness-of-fit tests. I. Title. II. Series QA278.H357 1997 519.5-dc21 97-10931 Printed on acid-free paper. © 1997 Springer-Verlag New York, Inc. All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer-Verlag New York, Inc., 175 Fifth Avenue, New York, NY 10010, USA), except for brief excerpts in connection with reviews or scholarly. analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use of general descriptive names, trade names, trademarks, etc., in this publication, even if the former are not especially identified, is not to be taken as a sign that such names, as understood by the Trade Marks and Merchandise Marks Act, may accordingly be used freely by anyone.
Production managed by Steven Pisano; manufacturing supervised by Joe Qua tela. Photocomposed pages prepared from the author's LaTeX files. Printed and bound by Maple-Vail Book Manufacturing Group, York, PA. Printed in the United States of America.
98765432 1 ISBN 0-387-94980-1 Springer-Verlag New York Berlin Heidelberg SPIN 10568296
Preface
The primary aim of this book is to explore the use of nonparametric regression (i.e., smoothing) methodology in testing the fit of parametric regression models. It is anticipated that the book will be of interest to an audience of graduate students, researchers and practitioners who study or use smoothing methodology. Chapters 2-4 serve as a general introduction to smoothing in the case of a single design variable. The emphasis in these chapters is on estimation of regression curves, with hardly any mention of the lack-offit problem. As such, Chapters 2-4 could be used as the foundation of a graduate level statistics course on nonparametric regression. The purpose of Chapter 2 is to convey some important basic principles of smoothing in a nontechnical way. It should be of interest to practitioners who are new to smoothing and want to learn some fundamentals without having to sift through a lot of mathematics. Chapter 3 deals with statistical properties of smoothers and is somewhat more theoretical than Chapter 2. Chapter 4 describes the principal methods of smoothing parameter selection and investigates their large-sample properties. The remainder of the book explores the problem of testing the fit of probability models. The emphasis is on testing the fit of parametric regression models, but other types of models are also considered (in Chapter 9). Chapter 5 is a review of classical lack-of-fit tests, including likelihood ratio tests, the reduction method from linear models, and some nonparametric tests. The subject of Chapter 6 is the earliest work on using linear smoothers to test the fit of models. These tests assume that a statistic's smoothing parameter is nonstochastic, which entails a certain degree of arbitrariness in performing a test. The real heart of this book is Chapters 7 through 10, in which lack-offit tests based on data-driven smoothing parameters are studied. It is my opinion that such tests will have the greatest appeal to both practitioners and researchers. Chapters 7 and 8 are a careful treatment of distributional properties of various "data-driven" test statistics. Chapter 9 shows that many of the ideas learned in Chapters 7 and 8 have immediate applications in more general settings, including multiple regression, spectral analysis and vii
viii
Preface
testing the goodness of fit of a probability distribution. Applications are illustrated in Chapter 10 by means of several real-data examples. There are a number of people who in various ways have had an influence on this book (many of whom would probably just as soon not take any credit). I'd like to thank Scott Berry, Jim Calvin, Chien-Feng Chen, Ray Chen, Cherng-Luen Lee, Geung-Hee Lee, Fred Lombard, Manny Parzen, Seongbaek Yi and two anonymous reviewers for reading portions of the book and making valuable comments, criticisms and suggestions. I also want to thank Andy Liaw for sharing his expertise in graphics and the finer points of TEX. To the extent that there are any new ideas in this book, I have to share much of the credit with the many colleagues, smoothers and nonsmoothers alike, who have taught me so much over the years. In particular, I want to express my gratitude to Randy Eubank, Buddy Gray, and Bill Schucany, whose ideas, encouragement and friendship have profoundly affected my career. Finally, my biggest thanks go to my wife Michelle and my daughter Kayley. Without your love and understanding, finishing this book would have been impossible. Jeffrey D. Hart
ji'
Contents
Preface
vii
1. Introduction
1
2. Some Basic Ideas of Smoothing
4
2.1. Introduction 2.2. Local Averaging 2.3. Kernel Smoothing 2.3.1 Fundamentals 2.3.2 Variable Bandwidths 2.3.3 Transformations of x 2.4. Fourier Series Estimators 2.5. Dealing with Edge Effects 2.5.1 Kernel Smoothers 2.5.2 Fourier Series Estimators 2.6. Other Smoothing Methods 2.6.1 The Duality of Approximation and Estimation 2.6.2 Local Polynomials 2.6.3 Smoothing Splines 2.6.4 Rational Functions 2.6.5 Wavelets 3. Statistical Properties of Smoothers
3.1. Introduction 3.2. Mean Squared Error of Gasser-Muller Estimators 3.2.1 Mean Squared Error at an Interior Point 3.2.2 Mean Squared Error in the Boundary Region 3.2.3 Mean Integrated Squared Error 3.2.4 Higher Order Kernels 3.2.5 Variable Bandwidth Estimators 3.2.6 Estimating Derivatives
4 5 6 6 14 16 19 28 29 32 35 35 37 40 41 44 50
50 51 51 59 61 62 63 64 ix
X
Contents
3.3. MISE of Trigonometric Series Estimators 3.3.1 The Simple Truncated Series Estimator 3.3.2 Smoothness Adaptability of Simple Series Estimators The Rogosinski Series Estimator 3.3.3 Asymptotic Distribution Theory 3.4. 3.5. Large-Sample Confidence Intervals
4. Data-Driven Choice of Smoothing Parameters 4.1. Introduction 4.2. Description of Methods 4.2.1 Cross-Validation 4.2.2 Risk Estimation 4.2.3 Plug-in Rules 4.2.4 The Hall-Johnstone Efficient Method 4.2.5 One-Sided Cross-Validation 4.2.6 A Data Analysis 4.3. Theoretical Properties of Data-Driven Smoothers 4.3.1 Asymptotics for Cross-Validation, Plug-In and Hall-Johnstone Methods 4.3.2 One-Sided Cross-Validation 4.3.3 Fourier Series Estimators 4.4. A Simulation Study 4.5. Discussion
5. Classical Lack-of-Fit Tests 5.1. Introduction 5.2. Likelihood Ratio Tests 5.2.1 The General Case 5.2.2 Gaussian Errors 5.3. Pure Experimental Error and Lack of Fit 5.4. Testing the Fit of Linear Models 5.4.1 The Reduction Method 5.4.2 Unspecified Alternatives 5.4.3 Non-Gaussian Errors 5.5. Nonparametric Lack-of-Fit Tests 5.5.1 The von Neumann Test 5.5.2 A Cusum Test 5.5.3 Von Neumann and Cusum Tests as Weighted Sums of Squared Fourier Coefficients 5.5.4 Large Sample Power 5.6. Neyman Smooth Tests
65 66 70 71 76 78
84 84 85 85 86 88 90 90 92 93 94 98 105 107 113
117 117 119 119 121 122 124 124 126 129 131 132 134 136 137 140
Contents
6. Lack-of-Fit Tests Based on Linear Smoothers
6.1. Introduction 6.2. Two Basic Approaches 6.2.1 Smoothing Residuals 6.2.2 Comparing Parametric and Nonparametric Models 6.2.3 A Case for Smoothing Residuals 6.3. Testing the Fit of a Linear Model 6.3.1 Ratios of Quadratic Forms 6.3.2 Orthogonal Series 6.3.3 Asymptotic Distribution Theory 6.4. The Effect of Smoothing Parameter 6.4.1 Power 6.4.2 The Significance Trace 6.5. Historical and Bibliographical Notes
xi
144
144 145 145 147 148 149 149 151 152 157 158 160 163
7. Testing for Association via Automated Order Selection
164
7.1. 7.2. 7.3. 7.4.
164 166 168 174 174 175 176 177 177 181 183 185 185 187 188 188 189 195 196
7.5.
7.6.
7.7.
7.8. 7.9.
Introduction Distributional Properties of Sample Fourier Coefficients The Order Selection Test Equivalent Forms of the Order Selection Test 7.4.1 A Continuous-Valued Test Statistic 7.4.2 A Graphical Test Small-Sample Null Distribution of Tn 7.5.1 Gaussian Errors with Known Variance 7.5.2 Gaussian Errors with Unknown Variance 7.5.3 Non-Gaussian Errors and the Bootstrap 7.5.4 A Distribution-Free Test Variations on the Order Selection Theme 7.6.1 Data-Driven Neyman Smooth Tests 7.6.2 F-Ratio with Random Degrees of Freedom 7.6.3 Maximum Value of Estimated Risk 7.6.4 Test Based on Rogosinski Series Estimate 7.6.5 A Bayes Test Power Properties 7.7.1 Consistency 7.7.2 Power of Order Selection, Neyman Smooth and Cusum Tests 7.7.3 Local Alternatives 7.7.4 A Best Test? Choosing an Orthogonal Basis Historical and Bibliographical Notes
197 201 203 205 206
xii
Contents
8. Data-Driven Lack-of-Fit Tests for General Parametric Models
208
8.1. Introduction 8.2. Testing the Fit of Linear Models 8.2.1 Basis Functions Orthogonal to Linear Model 8.2.2 Basis Functions Not Orthogonal to Linear Model 8.2.3 Special Tests for Checking the Fit of a Polynomial 8.3. Testing the Fit of a Nonlinear Model 8.3.1 Large-Sample Distribution of Test Statistics 8.3.2 A Bootstrap Test 8.4. Power Properties 8.4.1 Consistency 8.4.2 Comparison of Power for Two Types of Tests
208 208 209 213 217 219 219 221 223 223 224
9. Extending the Scope of Application 9.1. 9.2. 9.3. 9.4. 9.5. 9.6. 9.7. 9.8. 9.9. ],i
Introduction Random x's Multiple Regression Testing for Additivity Testing Homoscedasticity Comparing Curves Goodness of Fit Tests for White Noise Time Series Trend Detection
10. Some Examples 10.1. Introduction 10.2. Babinet Data 10.2.1 Testing for Linearity 10.2.2 Model Selection 10.2.3 Residual Analysis 10.3. Comparing Spectra 10.4. Testing for Association Among Several Pairs of Variables 10.5. Testing for Additivity Appendix A.l. Error in Approximation of Fos(t) A.2. Bounds for the Distribution of Tcusum
226 226 226 228 232 234 236 238 240 248 253
253 253 253 255 257 258 261 263 267 267 268
References
271
Index
281
1 Introduction
The estimation of functions is a pervasive statistical problem in scientific endeavors. This book provides an introduction to some nonpammetric methods of function estimation, and shows how they can be used to test the adequacy of parametric function estimates. The settings in which function estimation has been studied are many, and include probability density estimation, time series spectrum estimation, and estimation of regression functions. The present treatment will deal primarily with regression, but many of the ideas and methods to be discussed have applications in other areas as well. The basic purpose of a regression analysis is to study how a variable Y responds to changes in another variable X. The relationsHip between X and Y may be expressed as
(I.i)
Y
=
r(X)
+ E,
where r is a mathematical function, called the regression function, and E is an error term that allows for deviations from a purely deterministic relationship. A researcher is often able to collect data (X1 , Y1 ), ... , (Xn, Yn) that contain information about the function r. From these data, one may compute various guesses, or estimates, of r. If little is lmo~n about the nature of r, a nonparametric estimation approach is desirable. Nonparametric methods impose a minimum of structure on the regression function. This is paraphrased in the now banal statement that "nonparametric methods let the data speak for themselves." In order for nonparametric methods to yield reasonable estimates of r, it is only necessary that r possess some degree of smoothness. Typically, continuity of r is enough to ensure that an appropriate estimator converges to the truth as the amount of data increases without bound. Additional smoothness, such as th(il existence of derivatives, allows more efficient estimation. In contrast to nonparametric methods are the parametri,c ones that have dominated much of classical statistics. Suppose the variable X is known to lie in the interval [0, 1]. A simple) example of a parametric model for r in 0
1
2
1. Introduction
(1.1) is the straight line r(x) =eo+ e1x,
0 S::
X
S:: 1,
where e0 and e1 are unknown constants. More generally, one might assume that r has the linear structure p
r(x) =
L
eiri(x),
0::::; X::::; 1,
i=O
where To, ... , Tp are known functions and eo, ... , ep are unknown constants. Parametric models are attractive for a number of reasons. First of all, the parameters of a model often have important interpretations to a subject matter specialist. Indeed, in the regression context the parameters may be of more interest than the function values themselves. Another attractive aspect of parametric models is their statistical simplicity; estimation of the entire regression function boils down to inferring a few parameter values. Also, if our assumption of a parametric model is justified, the regression function can be estimated more efficiently than it can be by a nonparametric method. If the assumed parametric model is incorrect, the result can be misleading inferences about the regression function. Thus, it is important to have methods for checking how well a parametric model fits the observed data. The ultimate aim of this book is to show that various nonparametric, or smoothing, methods provide a very useful means of diagnosing lack of fit of parametric models. It is by now widely acknowledged that smoothing is an extremely useful means of estimating functions; we intend to show that smoothing is also valuable in testing problems. The next chapter is intended to be an expository introduction to some of the basic methods of nonparametric regression. The methods given our greatest attention are the so-called kernel method and Fourier series. Kernel methods are perhaps the most fundamental means of smoothing data and thus provide a natural starting point for the study of nonparametric function estimation. Our reason for focusing on Fourier series is that they are a central part of some simple and effective testing methodology that is treated later in the book. Other useful methods of nonparametric regression, including splines and local polynomials, are discussed briefly in Chapter 1 but receive less attention in the remainder of the book than do Fourier series. Chapter 3 studies some of the statistical properties of kernel and Fourier series estimators. This chapter is much more theoretical than Chapter 2 and is not altogether necessary for appreciating subsequent chapters that deal with testing problems. Chapter 4 deals with the important practical problem of choosing an estimator's smoothing parameter. An introduction to several methods of data-driven smoothing and an account of their theoretical properties are given. The lack-of-fit tests focused upon in Chapters 7-10 are based on data-driven choice of smoothing parameters. Hence, although
1. Introduction
3
not crucial for an understanding of later material, Chapter 4 provides the reader with more understanding of how subsequent testing methodology is connected with smoothing ideas. Chapter 5 introduces the lack-of-fit problem by reviewing some classical testing procedures. The procedures considered include likelihood ratio tests, the reduction method and von Neumann's test. Chapter 6 considers more recently proposed lack-of-fit tests based on nonparametric, linear smoothers. Such tests use fixed smoothing parameters and are thus inherently different from tests based on data-driven smoothing parameters. Chapter 7 introduces the latter tests in the simple case of testing the "noeffect" hypothesis, i.e., the hypothesis that the function r is identical to a constant. This chapter deals almost exclusively with trigonometric series methods. Chapters 8 and 9 show that the type of tests introduced in Chapter 7 can be applied in a much wider range of settings than the simple no-effect problem, whereas Chapter 10 provides illustrations of these tests on some actual sets of data.
2 Some Basic Ideas of Smoothing
2.1 Introduction In its broadest sense, smoothing is the very essence of statistics. To smooth is to sand away the rough edges from a set of data. More precisely, the aim of smoothing is to remove data variability that has no assignable cause and to thereby make systematic features of the data more apparent. In recent years the term smoothing has taken on a somewhat more specialized meaning in the statistical literature. Smoothing has become synonomous with a variety of nonparametric methods used in the estimation of functions, and it is in this sense that we shall use the term. Of course, a primary aim of smoothing in this latter sense is still to reveal interesting data features. Some major accounts of smoothing methods in various contexts may be found in Priestley (1981), Devroye and Gyorfi (1985), Silverman (1986), Eubank (1988), Hardie (1990), Wahba (1990), Scott (1992), Tarter and Lock (1993), Green and Silverman (1994), Wand and Jones (1995) and Fan and Gijbels (1996). Throughout this chapter we shall make use of a canonical regression model. The scenario of interest is as follows: a data analyst wishes to study how a variable Y responds to changes in a design variable x. Data Y1, ... , Yn are observed at the fixed design points x 1 , ... , Xn, respectively. (For convenience we suppose that 0 < x 1 < x 2 < · · · < Xn < 1.) The data are assumed to follow the model (2.1)
Yj
=
r(xj)
+ Ej,
j = 1, ... , n,
where r is a function defined on [0, 1] and E1 , ... , En are unobserved random variables representing error terms. Initially we assume that the error terms are uncorrelated and that E(Ei) = 0 and Var(Ei) = o- 2 , i = 1, ... , n. The data analyst's ultimate goal is to infer the regression function r at each x in [0, 1]. The purpose of this chapter is twofold. First, we wish to introduce a variety of nonparametric smoothing methods for estimating regression functions, and secondly, we want to point out some of the basic issues that arise 4
2.2. Local Averaging
5
when applying such methods. We begin by considering the fundamental notion of local averaging.
2.2 Local Averaging Perhaps the simplest and most obvious nonparametric method of estimating the regression function is to use the idea of local averaging. Suppose we wish to estimate the function value r(x) for some x E [0, 1]. If r is indeed continuous, then function values at Xi's near x should be fairly close to r(x). This suggests that averaging Yi's corresponding to xi's near x will yield an approximately unbiased estimator of r(x). Averaging has the beneficial effect of reducing the variability arising from the error terms. Local averaging is illustrated in Figures 2.1 and 2.2. The fifty data points in Figure 2.1 were simulated from the model
}j
=
r(xj)
+ Ej,
j = 1, ... , 50,
where
r(x)
=
( 1-
(2x- 1) 2
r,
0 :'S:
X
:'S: 1,
Xj = (j- .5) /50, j = 1, ... , 50, and the Ej 's are independent and identically distributed as N(O, (.125) 2 ). (N(t-t, u 2 ) denotes the normal distribution with mean t-t and variance u 2 .)
•
co
0
• >-
• •• • •• • •
"': 0
0 0
0.0
0.2
0.4
0.6
0.8
X
FIGURE
2.1. Windows Centered at .20 and .60.
•• • •• 1.0
6
2. Some Basic Ideas of Smoothing
For each x, consider the interval [x- h, x + h], where his a small positive number. Imagine forming a "window" by means of two lines that are parallel to the y-axis and hit the x-axis at x - h and x + h (see Figure 2.1). The window is that part of the (x, y) plane that lies between these two lines. Now consider the pairs (x j, Yj) that lie within this window, and average all the Yj 's from these pairs. This average is the estimate of the function value r(x). The window can be moved to the left or right to compute an estimate at any point. The resulting estimate of r is sometimes called a window estimate or a moving average. In the middle panel of Figure 2.2, we see the window estimate of r corresponding to the window of width .188 shown in Figure 2.1. The top and bottom panels show estimates resulting from smaller and larger window widths, respectively. Smoothing is well illustrated in these pictures. The top estimate tracks the data well, but is much too rough. The estimate becomes more and more smooth as its window is opened wider. Of course, there is a price to be paid for widening the window too much, since then the estimate does not fit the data well. Parenthetically we note that Figure 2.2 provides, with a single data set, a nice way of conveying the notion that variability of an average decreases as the number of data points increases. The decrease in variability is depicted by the increasing smoothness in the estimated curves. This is an example of how smoothing ideas can be pedagogically useful in a general study of statistics.
2.3 Kernel Smoothing
2. 3.1 Fundamentals A somewhat more sophisticated version of local averaging involves the use of so-called kernel estimators, also referred to as kernel smoothers. Here, the simple average discussed in Section 2.2 is replaced by a weighted sum. Typically, larger weights are given to li's whose Xi's are closer to the point of estimation x. There are at least three versions of the kernel estimator to which we shall refer from time to time. We first consider the NadarayaWatson type of estimate (Nadaraya, 1964 and Watson, 1964). Define (2.2)
Q:::; X:::; 1,
where K is a function called the kernel. The quantity h is called the bandwidth or smoothing parameter and controls the smoothness of rf:W in the same way that window width controls the smoothness of a moving average. In fact, the window estimate of Section 2.2 is a special case of (2.2) with
/
2.3. Kernel Smoothing
7
q
"'0 >-
"'0
"' 0
"! 0
0
0
0.0
0.2
0.6
0.4
0.8
1.0
0.8
1.0
X
q
"'
>-
.
0.4
0.6
.. .
0
------
•'
-~,·
"'0
"' 0
"'0 0
0
0.0
0.2
X
"'0 •,._
.....,
....'•
"!
·~
0
·..
•'·-~-·-.
0
0
0.0
0.2
0.4
0.6
0.8
1.0
X
FIGURE 2.2. Window Estimates. The dashed line is the true curve. The window widths of the estimates are, from top to bottom, .042, .188 and .60.
8
2. Some Basic Ideas of Smoothing
KR(u)
=
1 ·~/(-1,1)(u),
and IA denotes the indicator function for the set A, i.e., IA(x)
= { 1,
x E A
0,
X
tf_ A.
The kernel KR is called the rectangular kernel. A popular choice for K in (2.2) is the Gaussian kernel, i.e., 2
Ka(x)
=
-1-
J21f
exp ( - x- ) . 2
A qualitative advantage of using the Gaussian kernel as opposed to the more naive rectangular one is illustrated in Figure 2.3. Here, a NadarayaWatson estimate has been applied to the same data as in our first example. The bandwidths used are comparable to their corresponding window widths in Figure 2.2. We see in each case that the Gaussian kernel estimate is smoother than the corresponding window estimate, which is obviously due to the fact that Ka is smooth, whereas KR is discontinuous at -1 and 1. An estimate that is guaranteed to be smooth is an advantage when one envisions the underlying function r as being smooth. At least two other types of kernel smoothers are worth considering. One, introduced by Priestley and Chao (1972), is defined by ~PC
rh (x) I''
1
n h ~(XiXi-1)YiK
=
(
X- Xi
-h- ) .
•=1 A similar type smoother, usually known as the Gasser-Muller (1979) estimator, is
1
~GM
rh
(x)
=
n
h ~ Yi
1 s·'
s;-1
K
( X-U )
-h-
du,
where so = 0, Bi = (xi + Xi+l)/2, i = 1, ... , n - 1, and Gasser-Muller estimator may also be written as (2.3)
~
1 1
Yn(u)K (
Sn
1. The
x~ u) du,
where Yn ( ·) is the piecewise constant function n
Yn(u) =
L Yii[si-l,si)(u). i=1
In other words, the Gasser-Muller estimate is the convolution of Yn(·) with K(-/h)/h. This representation suggests that one could convolve K(-/h)/h with other "rough" functions besides Yn(·). Clark (1977) proposed a version
2.3. Kernel Smoothing
9
q CX)
d CD
»
d
..,. d
••'-,
"'d
..•
0
d
0.0
0.2
0.4
0.6
0.8
1.0
X
q
.· .. . ,•
CX)
d CD
»
d
..,. d
....
:;l
,
0
d
0.0
0.2
0.4
0.6
0.8
1.0
0.6
0.8
1.0
X
q CX)
d CD
»
d
"': 0
"'d 0
d
0.0
0.2
0.4 X
FIGURE 2.3. Nadaraya-Watson Type Kernel Estimates Using a Gaussian Kernel. The dashed line is the true curve. The bandwidths of the estimates are, from top to bottom, .0135, .051 and .20.
10
2. Some Basic Ideas of Smoothing
of (2.3) in which YnO is replaced by a continuous, piecewise linear function that equals Yi at Xi, i = 1, ... , n. An appealing consequence of (2.3) is that the Gasser-Muller estimate tends to the function Yn(·) as the bandwidth tends to 0. By contrast, the Nadaraya-Watson estimate is not even well defined for sufficiently small h when K has finite support. As a result the latter estimate tends to be much more unstable for small h than does the Gasser-Muller estimate. The previous paragraph suggests that there are some important differences between Nadaraya-Watson and Gasser-Muller estimators. Chu and Marron (1991) refer to the former estimator as being an "evaluation type" and to the latter as a "convolution type." The Priestley-Chao estimator may also be considered of convolution type, since, if K is continuous,
where si-l :S x; :S si, i = 1, ... , n. When the design points are at least approximately evenly spaced, there is very little difference between the evaluation and convolution type estimators. However, as Chu and Marron (1991) have said, "when the design points are not equally spaced, or when they are iid random variables, there are very substantial and important differences in these estimators." Having made this point, there are, nonetheless, certain basic principles that both estimator types obey. In the balance of Section 2.3 we will discuss these principles without making any distinction between evaluation and convolution type estimators. We reiterate, though, that there are nontrivial differences between the estimators, an appreciation of which can be gained from the articles of Chu and Marron (1991) and Jones, Davies and Park (1994). A great deal has been written about making appropriate choices for the kernel K and the bandwidth h. At this point we will discuss only the more fundamental aspects of these choices, postponing the details until Chapters 3 and 4. Note that each of the kernel estimators we have discussed may be written in the form n
r(x)
=
L Wi,n(x)Yi i=l
for a particular weight function wi,n(x). To ensure consistency of f(x) it is necessary that I:~=l Wi,n(x) = 1. For the Nadaraya-Watson estimator this condition is guaranteed for each x by the way in which f{;W (x) is constructed. Let us investigate the sum of weights for the Gasser-Muller
2.3. Kernel Smoothing
11
estimator. (The Priestley-Chao case is essentially the same.) We have
L-h1 1 8
n
i=1
i
Si-1
- d U )u = 1 K( -X h h
=
1 1
Q
( X - U K d) u h
1x/h
K(v)dv.
(x-1)/h
This suggests that we take K to be a function that integrates to 1. By doing so, the sum of kernel weights will be approximately 1 as long as h is small relative to min(x, 1- x). If J K(u) du = 1 and K vanishes outside ( -1, 1), then the sum of kernel weights is exactly 1 whenever h :::; min(x, 1 - x). This explains why finite support kernels are generally used with GasserMuller and Priestley-Chao estimators. When xis in [0, h) or (1- h, 1], note that the weights do not sum to 1 even if K has support ( -1, 1). This is our first indication of so-called edge effects, which will be discussed in Section
2.5. In the absence of prior information about the regression function, it seems intuitive that K should be symmetric and have a unique maximum at 0. A popular way of ensuring these two conditions and also J K (u) du = 1 is to take K to be a probability density function (pdf) that is unimodal and symmetric about 0. Doing so also guarantees a positive regression estimate for positive data, an attractive property when it is known that r 2: 0. On the other hand, there are some very useful kernel functions that take on negative values, as we will see in Section 2.4. To ensure that a kernel estimator has attractive mean squared error properties, it turns out to be important to choose K so that (2.4)
J
uK(u) du = 0,
J
K 2 (u) du
< oo
and
J
u 2 K(u) du
< oo.
Note that conditions (2.4) are satisfied by KR and Ka, and, in fact, by any bounded, finite variance pdf that is symmetric about 0. The necessity of (2.4) will become clear when we discuss mean squared error in Chapter 3. It is widely accepted that kernel choice is not nearly so critical as choice of bandwidth. A common practice is to pick a reasonable kernel, such as the Gaussian, and use that same kernel on each data set encountered. The choice of bandwidth is another matter. We saw in Figure 2.3 how much an estimate can change when its bandwidth is varied. We now address how bandwidth affects statistical properties of a kernel estimator. Generally speaking, the bias of a kernel estimator becomes smaller in magnitude as the bandwidth is made smaller. Unfortunately, decreasing the bandwidth also has the effect of increasing the estimator's variance. A principal goal in kernel estimation is to find a bandwidth that affords a satisfactory compromise between the competing forces of bias and variance. For a given sample size n, the interaction of three main factors dictates the value of this "optimal" bandwidth. These are
12
2. Some Basic Ideas of Smoothing
• the smoothness of the regression function, • the distribution of the design points, and • the amount of variability among the errors E1, ... , En· The effect of smoothness can be illustrated by a kernel estimator's tendency to underestimate a function at peaks and overestimate at valleys. This tendency is evident in Figure 2.4, in which data were simulated from the model
Yj = 9.9 + .3 sin(27rxj) + Ej,
j
=
1, ... , 50,
where Xj = (j - .5)/50, j = 1, ... , 50, and the Ej's are independent and identically distributed as N(O, (.06) 2 ). The estimate in this case is of Gasser-Muller type with kernel K(u) = .75(1- u 2)J(-l,l)(u). The bandwidth of .15 yields an estimate with one peak and one valley that are located at nearly the same points as they are for r. However, the kernel estimate is too low and too high at the peak and valley, respectively. The problem is that at x ~ .25 (for example) the estimate tends to be pulled down by data values (xj, Yj) for which r(xj) is smaller than the peak. All other factors being equal, the tendency to under- or overestimate will be stronger the sharper the peak or valley. Said another way, the bias of a kernel estimator is smallest where the function is most nearly linear. The bias at a peak or valley can be lessened by choosing the bandwidth smaller. However, doing so also has its price. In Figure 2.5 we see the same
..... .·~·· ..~./. •'
0
·...
~--
··... ··..
......
0
. •.
..• .....
·
',
.
co
a)
···.. •
co
a)
0.0
0.2
0.4
. . ···' .. .. .. ·."il!:
0.6
··...
....··
··...
0.8
1.0
X
FIGURE 2.4. The Tendency of a Kernel Smoother to Undershoot Peaks and Overshoot Valleys. The solid line is the true curve and the dotted line a Gasser-Muller kernel smooth.
2.3. Kernel Smoothing
13
C\J
c:i
.•.......
~
0
c:i
..
co cri
·~
''!
(0
cri
0.0
0.2
0.4
•
··· ...
..
0.6
·. ···•··•
.!'...
0.8
1.0
X
FIGURE 2.5. Kernel Smooth with a Small Bandwidth Based on the Same Data as in Figure 2.4.
data and same type of kernel estimate as in Figure 2.4, except now the bandwidth is much smaller. Although there is no longer an obvious bias at the peak and valley, the overall estimate has become very wiggly, a feature not shared by the true curve. Figure 2.5 illustrates the fact that the variance of a kernel estimator tends to increase when its bandwidth is decreased. This is not suprising since a smaller value of h means that effectively fewer Yj 's are being averaged. To gain some insight as to how design affects a good choice for h, consider estimating r at two different peaks of comparable sharpness. A good choice for h will be smaller at the x near which the design points are more highly concentrated. At such a point, we may decrease the size of h (and hence bias) while still retaining a relatively stable estimator. Since bias tends to be largest at points with a lot of curvature, this suggests that a good design will have the highest concentration of points at x's where r(x) is sharply peaked. This is borne out by the optimal design theory of Muller (1984). The variance, 0' 2 , of each error term affects a good choice for h in a fairly obvious way. All other things being equal, an increase in 0' 2 calls for an increase in h. If 0' 2 increases, the tendency is for estimator variance to increase, and the only way to counteract this is to average more points, i.e., take h larger. The trade-off between stability of an estimator and how well the estimator tracks the data is a basic principle of smoothing. In choosing a smoothing parameter that provides a good compromise between these two
14
2. Some Basic Ideas of Smoothing
properties, it is helpful to have an objective criterion by which to judge an estimator. One such criterion is integrated squared error, or ISE. For any two functions f and g on [0, 1], define the ISE by
1 1
I(!, g) =
2
(f(x)- g(x)) dx.
For a given set of data, it seems sensible to consider as optimum a value of h that minimizes I(fh, r). For the set of data in Figure 2.3, Figure 2.6 shows anISE plot for a Nadaraya-Watson estimate with a Gaussian kernel. The middle estimate in Figure 2.3 uses the bandwidth of .051 that minimizes the ISE. For the two data sets thus far considered, the function r was known, which allows one to compute the ISE curve. Of course, the whole point of using kernel smoothers is to have a means of estimating r when it is unknown. In practice then, we will be unable to compute ISE, or any other functional of r. We shall see that one of smoothing's greatest challenges is choosing a smoothing parameter that affords a good trade-off between variance and bias when the only knowledge about the unknown function comes from the observed data. This challenge is a major theme of this book and will be considered in detail for the first time in Chapter 4.
2.3.2 Variable Bandwidths It was tacitly assumed in Section 2.3.1 that our kernel smoother used the same bandwidth at every value of x. For the data in Figures 2.3 and 2.4, doing so produces reasonable estimates. In Figure 2. 7 we see the function r(x) = 1 + (24x) 3 exp( -24x), 0 :::; x :::; 1, which has a sharp peak at x = .125, but is nearly fiat for .5 :::; x :::; 1. The result of smoothing noisy data from this curve using a constant bandwidth smoother is also
'?
[jJ
'1
(/)
'§; .Q
'? 't' !";-
0.0
0.05
0.10
0.15
0.20
0.25
0.30
h
FIGURE 2.6. Plot of log(ISE) for the Data in Figure 2.3.
2.3. Kernel Smoothing
.. 0.0
. . . ... 0.4
0.2
15
0.6
0.8
1.0
0.6
0.8
1.0
X
0.0
0.4
0.2
X
0
N
.. 0.0
0.2
.· 0.4
.. 0.6
0.8
1.0
FIGURE 2.7. The Effect of Regression Curvature on a Constant Bandwidth Estimator. The top graph is the true curve along with noisy data generated from that curve. The middle and bottom graphs show Gasser-Muller estimates with Epanechnikov kernel and respective bandwidths of h = .05 and h = .20. The lesson here is that the same bandwidth is not always adequate over the whole range of x's.
16
2. Some Basic Ideas of Smoothing
shown in Figure 2. 7. The estimates in the middle and lower graphs are of Gasser-Muller type with K(u) = .75(1 - u 2 )I(-l,l)(u), which is the socalled Epanechnikov kernel. The bandwidth used in the middle graph is appropriate for estimating the function at its peak, whereas the bandwidth in the lower graph is more appropriate for estimating the curve where it is flat. Neither estimate is satisfactory, with the former being too wiggly for x > .3 and the latter having a large bias at the peak. This example illustrates that a constant bandwidth estimate is not always desirable. Values of x where the function has a lot of curvature call for relatively small bandwidths, whereas x's in nearly flat regions require larger bandwidths. The latter point is best illustrated by imagining the case where all the data have a common mean. Here, it is best to estimate the underlying "curve" at all points by Y, the sample mean. It is not difficult to argue that a Nadaraya-Watson estimate tends toY as h tends to infinity. An obvious way of dealing with the problem exhibited in Figure 2. 7 is to use an estimator whose bandwidth varies with x. We have done so in Figure 2.8 by using h(x) of the form shown in the top graph of that figure. The smoothness of the estimate is preserved by defining h(x) so that it changes smoothly from h = .05 up to h = .5.
2.3.3 Transformations of x Constant bandwidth estimators are appealing because of their simplicity, which makes for computational convenience. In some cases where it seems that a variable bandwidth estimate is called for, a transformation of the x-variable can make a constant bandwidth estimate suitable. Let t be a strictly monotone transformation, and define si = (t(xi) + t(xH 1 ))/2, i = 1, ... ,n -1, so= t(O), Sn = t(1) and
/th(z) =
1
h
8Yi 18; (z _u) n
s;-1
K
-h-
du,
t(O) ::::; z ::::; t(1).
Inasmuch as rfM (x) estimates r(x), ith(z) estimates r(r 1(z)). Therefore, an alternative estimator of r(x) is ith(t(x)). The key idea behind this approach is that the function r(r 1 ( · )) may be more amenable to estimation with a constant bandwidth estimate than is r itself. This idea is illustrated by reconsidering the data in Figure 2.7. The top graph in Figure 2.9 shows a scatter plot of (xi/\ Yi), i = 1, ... , n, and also a plot of r(x) versus x 114. Considered on this scale, a constant bandwidth estimate does not seem too unreasonable since the peak is now relatively wide in comparison to the flat spot in the right-hand tail. The
/
2.3. Kernel Smoothing
17
l!)
c:i '
c:i
g
(')
c:i
..c
C\l
c:i
,.... c:i
0.0
0.2
0.4
0.6
0.8
1.0
X
.. 0
C\i >.
• ~
.... .... .....
• C!
0.0
0.2
0.4
0.6
0.8
1.0
X
FIGURE 2.8. A Variable Bandwidth Kernel Estimate and Bandwidth Function. The bottom graph is a variable bandwidth Gasser-Muller estimate computed using the same data as in Figure 2.7. The top graph is h(x), the bandwidth function used to compute the estimate.
estimate fth(t(x)) (t(x) = x 114), shown in the bottom graph of Figure 2.9, is very similar to the variable bandwidth estimate of Figure 2.8. The use of an x-transformation is further illustrated in Figures 2.10 and 2.11 (pp. 19-20). The data in the top graph of Figure 2.10 are jawbone lengths for thirty rabbits of varying ages. Note that jawbone length increases rapidly up to a certain age, and then asymptotes when the rabbits reach maturity. Accordingly, the experimenter has used a judicious design in that more young than old rabbits have been measured.
18
2. Some Basic Ideas of Smoothing
0.6
0.4
1.0
0.8 t(x)
0
C\i
\
0.0
0.2
.•
... ... .... ... ···•---•---~-~----~
0.4
0.6
0.8
~
1.0
X
FIGURE 2.9. Gasser-Muller Estimate Based on a Power Transformation of the Independent Variable. The top graph shows the same data and function as in Figure 2.8, but plotted against t(x) = x 114 . The dashed line in the bottom graph is the variable bandwidth estimate from Figure 2.8, while the solid line is the estimate /lh(x 114) with h = .07.
A Gasser-Muller estimate with Epanechnikov kernel and a bandwidth of .19 is shown in Figure 2.10 along with a residual plot. This estimate does not fit the data well around days 20 to 40. Now suppose that we apply a square root transformation to the xi's. The resulting estimate flh( y'X) and its residual plot are shown in Figure 2.11. Transforming the x's has obviously led to a better fitting estimate in this example.
2.4. Fourier Series Estimators
0
100
200
300
400
500
19
600
day
FIGURE 2.10. Ordinary Gasser-Miiller Kernel Estimate Applied to Rabbit Jawbone Data. The top graph depicts the jawbone lengths of thirty rabbits of varying ages along with a Gasser-Miiller estimate with bandwidth equal to 114. The bottom graph shows the corresponding residual plot.
2.4 Fourier Series Estimators Another class of nonparametric regression estimators makes use of ideas from orthogonal series. Many different sets of orthogonal basis functions could and have been used in the estimation of functions. These include orthogonal polynomials and wavelets. Here we shall introduce the notion of a series estimator by focusing on trigonometric, or Fourier, series. We reiterate that much of the testing methodology in Chapters 7-10 makes use
20
2. Some Basic Ideas of Smoothing
. . ..
0 LO
0
'
.s:::
'5l
c ..Q1
0
(1)
0
0.1
100
0
200
300
400
500
600
400
500
600
day
. •. . ...
0.1 (ij ~
I• •
"0
'(i.i
~
0
•
...
C}l
0
100
200
300 day
FIGURE 2.11. Gasser-Muller Estimate Computed After Using a Square Root Transformation of the Independent Variable. The top graph shows the rabbit jawbone data with a Gasser-Muller estimate based on a square root transformation of the x's. The bandwidth on the transformed scale is 5.4. The bottom graph is the corresponding residual plot.
of Fourier series; hence, the present section is perhaps the most important one in this chapter. Consider the system C = {1, cos(nx), cos(2nx), ... } of cosine functions. The elements of C are orthogonal to each other over the interval (0, 1) in the sense that
1 1
(2.5)
cos(njx) cos(nkx) dx
= 0,
j =/= k,
j, k
=
0, 1, ....
2.4. Fourier Series Estimators
21
For any function r that is absolutely integrable on (0,1), define its Fourier coefficients by
1 1
¢j =
r(x) cos(njx)dx,
j
= 0, 1, ....
The system C is complete for the class C[O, 1] of functions that are continuous on [0, 1]. In other words, for any function r in C[O, 1], the series m
(2.6)
r(x; m) = ¢o
+2L
0 ::::; x ::::; 1,
¢j cos(njx),
j=l
converges tor in mean square as m ---+ oo (see, e.g., Tolstov, 1962). Convergence in mean square means that the integrated squared error I(r(·; m), r) converges to 0 as m ---+ oo. The system C is said to form an orthogonal basis for C[O, 1] since it satisfies the orthogonality properties (2.5) and is complete. The practical significance of C being an orthogonal basis is its implication that any continuous function may be well approximated on [0, 1] by a finite linear combination of elements of C. In addition to being continuous, suppose that r has a continuous derivative on [0, 1]. Then the series r(·; m) converges uniformly on [0, 1] to r as m ---+ oo (Tolstov 1962, p. 81). Often the ¢j's converge quickly to 0, implying that there is a small value of m such that
r(x; m) :::::; r(x),
\;/ x
[0, 1].
E
This is especially relevant in statistical applications where it is important that the number of estimated parameters be relatively small compared to the number of observations. Let us now return to our statistical problem in which we have data from model (2.1) and wish to estimate r. Considering series (2.6) and the discussion above, we could use Y1, ... , Yn to estimate ¢ 0, ... , ¢m and thereby obtain an estimate of r. Since the series r(·; m) is linear in cosine functions, one obvious way of estimating the ¢j 's is to use least squares. Another possibility is to use a "quadrature" type estimate that parallels the definition of ¢j as an integral. Define ¢j by
¢j
=
t 1s' Yi
i=l
cos(nju)du,
j
= 0, 1, ... ,
Bi-1
where so = 0, Si = (xi + XiH)/2, i define an estimator r(x; m) of r(x) by
= 1, ... , n- 1,
and
Sn
1. Now
m
r(x; m) =:Po+ 2
L ¢j cos(njx),
0 ::::;
X ::::;
1.
j=l
To those used to parametric statistical methods the estimator r(x; m) probably seems more familiar than those in Sections 2.2 and 2.3. In one
22
2. Some Basic Ideas of Smoothing
sense the Fourier series approach to estimating r is simply an application of linear models. We could just as well have used orthogonal polynomials as our basis functions (rather than C) and then the series estimation scheme would simply be polynomial regression. What sets a nonparametric approach apart from traditional linear models is an emphasis on the idea that the m in r(x; m) is a crucial parameter that must be inferred from the data. The quantity m plays the role of smoothing parameter in the smoother f(·; m). We shall sometimes refer to f(·; m) as a truncated series estimator and tom as a truncation point. In Figure 2.12 we see three Fourier series estimates computed from the data previously encountered in Figures 2.2 and 2.3. Note how the smoothness of these estimates decreases as the truncation point increases. The middle estimate (with m = 2) minimizes I(r(·; m), r) with respect tom. At first glance it seems that the estimator f(·; m) is completely different from the kernel smoothers of Section 2.3. On closer inspection, though, the two types of estimators are not so dissimilar. It is easily verified that
r(x; m) =
t
Yi
1Si
i=l
Km(x, u)du,
0 ::;
X ::;
1,
Si-1
where m
Km(u, v) = 1 + 2
'2: cos(1fju) cos(1fjv) j=l
=
Dm(u- v)
+ Dm(u + v)
and Dm is the Dirichlet kernel, i.e.,
Dm(t)
=
sin [(2m+ 1)1ft/2] · 2 sin(1ft/2)
We see, then, that r(·; m) has the same form as a Gasser-Muller type estimator with kernel Km (u, v). The main difference between f( ·; m) and the estimators in Section 2.3 is with respect to the type of kernel used. Note that Km(x, u) depends on both x and u, rather than only x - u. The truncation point m, roughly speaking, is inversely proportional to the bandwidth of a kernel smoother. Figure 2.13 gives us an idea about the nature of Km and the effect of increasing m. Each of the three graphs in a given panel provides the kernel weights used at a point, x, where the curve is to be estimated. Notice that the weights are largest in absolute value near x and that the kernel becomes more concentrated at x as m increases. A key difference between the kernel Km and, for example, the Gaussian kernel is the presence of high frequency wiggles, or side lobes, in the tails of Km, with the number of wiggles being an increasing function of m. To many practitioners the oscillatory nature of the kernel Km is an unsavory aspect of the series estimator r(·; m). One way of alleviating this
2.4. Fourier Series Estimators
23
q
"'
0
.:.
(0
>-
0
..·. .. ...
..,. 0 "! 0
.·'
0
0
•
0.0
0.2
0.4
0.6
' ·.•
-~
0.8
1.0
X
q
..
;
"'
0
.:.
(0
>-
0
"' 0
"'
..
0 0
0
..
.~
...
.
. '•
~~~·
0.0
0.2
0.4
0.6
0.8
1.0
0.6
0.8
1.0
X
q
"'0 (0
>-
0
"' 0
"'0 0
0
0.0
0.2
0.4 X
FIGURE 2.12. Fourier Series Estimates. The dashed line is the true curve. The truncation points of the estimates are, from top to bottom, 27, 2 and 1.
2. Some Basic Ideas of Smoothing
24
m=3
"' :g,
'
·~ Q;
E
"'
i
Q;
E
E ~
~
~
"'
3
Q;
"'
';" ';"
0.0
0.4
0.8
0.0
0.4
0.8
0.0
0.4
0.8
0.0
0.4
0.8
0.0
0.4
0.8
m=B
~
:E
·~ 3
Q;
E ~
:g,
:g,
3
"'
·~
"'
~
·;;;
"'
Q;
E ~
"'
~
Q;
E
'l'
'l' 0.0
0.4
0.8
0.0
0.4
0.8
m=16
~
:g,
0
~
~
·~
i
Q;
Q;
E ~
E ~
"'
~
:E
3
0.0
0.4
0.8
1<1
·~ 3
~
~
0.0
0.4
0.8
~
FIGURE 2.13. Kernels Corresponding to the Series Estimate r(x; m). For a given m, the graphs show, from left to right, the data weights Km(x, v) used in estimating the curve at x = .5, x = .75 and x = 1.
2.4. Fourier Series Estimators
25
problem is to apply a taper to the Fourier coefficients. Define an estimator f(x; W>.) of r(x) by n-1
(2.7)
¢o + 2 L
r(x; W>.) =
W>.(j)¢j cos(njx),
0 :::; X :::; 1,
j=1
where the taper {w:>-.(1), W:>-.(2), ... } is a sequence of constants depending on a smoothing parameter .A. Usually the taper is such that 1 2 W:>-.(1) 2 W:>-.(2) 2 · · · 2 W>.(n- 1) 2 0. The estimator (2.7) may also be written as the kernel estimator
with kernel n-1
Kn(u, v; W>.) = 1 + 2
L
w>.(j) cos(njv) cos(nju).
j=1
By appropriate choice of W>., one can obtain a kernel K(x, ·; W>.) that looks much like a Gaussian curve for x not too close to 0 or 1. We shall discuss how to do so later in this section. Tapering has its roots in the mathematical theory of Fourier series, where it has been used to induce convergence of an otherwise divergent series. For example, define, for each postive integer m, wm(j) = 1 - (j - 1)/m, j = 1, ... , m. These are the so-called Fejf:r weights and correspond to forming the sequence of arithmetic means of a series; i.e.,
(2.8)
¢0
m + 2L j=1
(
'-1) ¢ cos(njx) = -1L r(x; j).
1- _J-
m
1
m
m
j=1
The series (2.8) has the following remarkable properties (see, e.g., Tolstov, 1962): (i) It converges to the same limit as r(x; m) whenever the latter converges, and (ii) it converges uniformly to r(x) on [0, 1] for any function r in C[O, 1]. Property (ii) is all the more remarkable considering that there exist continuous functions r such that r(x; m) actually diverges at certain points x. (The last statement may seem to contradict the fact that the Fourier series of any r E C[O, 1] converges in mean square to r; but in fact it does not, since mean square convergence is weaker than pointwise convergence.) The kernel corresponding to the Fejer weights has the desirable property of being positive, but is still quite oscillatory. Another possible set of weights is
26
2. Some Basic Ideas of Smoothing
K:;,
The kernel corresponding to these weights may be written in terms of the Rogosinski kernel Rm (Butzer and Nessel, 1971):
K:;,,(u, v)
=
Rm(u- v)
+ Rm(u + v),
where
The trigonometric series
rR(x; m)
=
1+2
f
j=l
cos(
nj
2m+ 1
)
0:::; x:::; 1,
has the same nice property as the Fejer series, i.e., rR(x; m) converges uniformly to r(x) on [0, 1] for any function r E C[O, 1]. In Figure 2.14 we have plotted the kernel at values of m that make K:;,(x,x) R:! Km'(x,x), where m' is the truncation point used in the corresponding panel of Figure 2.13. We see that the size of the side lobes for K:;,(x, ·)are much smaller than those in Km' (x, ·).This difference between the two kernels translates into an aesthetic advantage for the Rogosinski series, which is illustrated in Figure 2.15. Here we see the series approximations r(x; 8) and rR(x; 14) to r(x) = (2x) 2 exp( -20x). These two approximators use the kernels depicted in the middle panels of Figures 2.13 and 2.14, respectively. Notice that r(·; 8) contains spurious waves in its right tail and also does a poor job of locating r's peak. These facts are a direct result of the high frequency oscillations in the kernel Km. The Rogosinski series does a much better job of fitting both the tail and the peak. So far the series estimators we have considered share the feature that their smoothing parameter m is also the series truncation point. A large class of estimators with continuous smoothing parameters may be constructed as follows. Let K be any continuous probability density function that is symmetric about 0. Define ¢K to be the characteristic function of K, and assume that ¢K(aj) is absolutely summable in j for each a > 0. Now take the taper WA. in the series estimator (2.7) to be
K:;,
W,\(j) =
A> 0,
j = 1, 2, ...
For large n, the kernel of this estimator is well approximated by 00
K(u, v; ..\)
=
1+
L
¢K(.Anj) cos(nju) cos(njv)
j=l
=
Kw(u + v; ..\) + Kw(u- v; ..\),
0:::; u, v:::; 1,
2.4. Fourier Series Estimators
27
m=4 q
"'
"'
"'
'"
'" '"
;;j
0
t
·;;;
:E ·~
"l
w
-;;;
j
"l
~
~
E ~
q
0
"'0
"l
0
q
0
0
0.0
0.4
0.0
0.8
0.4
0.8
0.0
0.4
0.8
m=14
;'? :E 0>
:E
·~
·;;;
~
w
~
:§, ·~
~
~
"
E
~
"
;?
"~E
"' 0
I~ ~ 0.0
0.4
-'\)
0.8
0.0
0.4
-'\)
\!' 0.8
0.0
0.4
0.8
,'I 1:1
m=26
,,'i
Iii
II, If:
;'?
:E ·~
;'?
:E ·~
;?
~
"E ~
II
0
"' :§,
;?
·;;;
~
~
"~E
"~E
i
!
0
"'
'II
;?
III!
~ ~ 0.0
0.4
0.8
0.0
0.4
0.8
'\ 0.0
0.4
0.8
FIGURE 2.14. Kernels Corresponding to the Rogosinski Series Estimate fR(x; m). For a given m, the graphs show, from left to right; the data weights K!:;(x, v) used in estimating the curve at x = .5, x = .75 and x = 1.
r i !
28
2. Some Basic Ideas of Smoothing
0
c)
0.0
0.2
0.6
0.4
0.8
1.0
X FIGURE 2.15. Gamma Type Function and Series Approximators. The solid line is r(x) = (20x) 2 exp( -20x), the dashed line is r(x; 8) and the dots are rR(x; 14), a Rogosinski type series.
where Kw(y; A) is a "wrapped" version of K(yjA)jA, i.e.,
1 ~ (y2j) -A- ,
Kw(y;A)=): .L..J K
Vy.
J=-oo
Note that K w (· ; A) is periodic with period 2 and integrates to 1 over any interval of length 2. Now, let u and v be any numbers in (0, 1); then if K is sufficiently light-tailed,
Kw(u+v;A)+Kw(u-v;A)"-'
~K(u~v)
as A -+ 0. (Here and subsequently in the book, the notation a>-. "' b>-. means that a>-./b>-. tends to 1 as A approaches a limit.) This is true, for example, when K is the standard normal density and whenever K has finite support. So, except near 0 and 1, the Gasser-Muller type kernel smoother (with appropriate kernel K) is essentially the same as the series estimator (2.7) with taper ¢K(A1fj). We can now see that kernel smoothers and series estimators are just two sides of the same coin. Having two representations for the same estimator provides some theoretical insight and can also be useful from a computational standpoint.
2.5 Dealing with Edge Effects So-called edge or boundary effects are a fundamental difficulty in smoothing. The basic problem is that the efficiency of an estimator of r(x) tends to decrease as x nears either edge, i.e., boundary, of an interval containing all the design points. Edge effects are due mostly to the fact that fewer data
2.5. Dealing with Edge Effects
29
are available near the boundaries and in part to properties of the particular smoother being used. In this section we discuss how various smoothers deal with edge effects and how the performance of smoothers can be enhanced by appropriate modification within the boundary region.
2. 5.1 Kernel Smoothers
>lid line
:(x; 14),
v-er any ~n if K
' "' b;.. :ue, for LS finite 10other Gimator d series Ltations also be
10thing. ends to ning all ·er data
Priestley-Chao and Gasser-Muller type kernel estimators, as defined in Section 2.3, can be very adversely affected when x is near an edge of the estimation interval. Consider a Gasser-Muller estimator whose kernel is a pdf with support ( -1, 1). If the estimation interval is [0, 1], then for x E [h, 1- h] the kernel weights of fj?M(x) add to 1, and so the estimate is a weighted average of Y;'s. However, if x is in [0, h) or (1 - h, 1], the weights add to less than 1 and the bias of fj?M (x; h) tends to be larger than it is when x is in [h, 1 h]. In fact, when x is 0 or 1, we have E (rj?M(x)) ~ r(x)/2, implying that the Gasser-Muller estimator is not even consistent for r(O) and r(1) (unless r(O) = 0 = r(1)). A very simple way of alleviating the majority of the boundary problems experienced by r-rc and fj?M is to use the "cut-and-normalize" method. We define a normalized Gasser-Muller estimator by (2.9) By dividing by the sum of the kernel weights we guarantee that the estimate is a weighted average of Y;'s at each x. Of course, for x E [h, 1 - h] the normalized estimate is identical to fcM(x; h). It is worth noting that the N adaraya-Watson kernel estimator is normalized by definition. It turns out that even after normalizing the Priestley-Chao and GasserMuller estimators, their bias still tends to be largest when x is in the boundary region. As we will see in Chapter 3, for the types of kernels most often used in practice, this phenomenon occurs when r has two continuous derivatives throughout [0, 1] andr'(O+) -:f. 0 (orr'(1-) -:f. 0). Intuitively we may explain the boundary problems of the normalized Gasser-Muller estimator in the following way. Suppose we have data Y1 , ... , Yn from model (2.1) in which r has two continuous derivatives with r'(O+) < 0. The function r is to be estimated by (2.9), where K is symmetric about 0. Now consider data Z_n, ... , Z-1, Z1, ... , Zn from the model
zi = m(ui)
+ r]i,
Iii
= 1, ... , n,
where E(TJi) = 0, -U-i = ui = Xi, i = 1, ... , n, and m( -x) = m(x) = r(x) for x 2: 0. Using the Zi's we may estimate m(x) by an ordinary GasserMuller estimator, call it mz(x), using the same kernel and bandwidth (h) as in the normalized estimator rjfM (x) of r(x). Since r'(O+) < 0, the function m has a cusp at x = 0, implying that the bias IEmz(x) - m(x)l will be
30
2. Some Basic Ideas of Smoothing
larger at x = 0 than at x E [h, 1 - h]. A bit of thought reveals that
Emz(x)- m(x) = ErfM (x)- r(x),
for x
= 0, x
E
[h, 1- h].
So, ;;;G.l\1/(x) has a large boundary bias owing to the fact that mz(x) has a large bias at the cusp x = 0. Figure 2.16 may help to clarify this explanation. In this example we have r(x) exp(-x), 0 :::; x :::; 1, and m(x) = exp(-lxl), -1 :::; x:::; 1. Suppose in our discussion above that r' (0+) had been 0, implying that m' (0) = 0. In this case, m would be just as smooth at 0 as at other points; hence, the bias of mz (0) would not be especially large. This, in turn, implies that the normalized estimator (2.9) would not exhibit the edge effects that occur when r' (0+) f. 0. This point is well illustrated by the window and kernel estimates in the middle panels of Figures 2.2 and 2.3. Notice that these (normalized) estimates show no tendency to deteriorate near 0 or 1. This is because the regression function r(x) = [1 - (2x - 1) 2 ]2 I(o,l)(x) generating the data is smooth near the boundary in the sense that r' (0) = 0 r'(1). Consider again the Gasser-Muller estimator rfM having kernel K with support ( -1, 1). If K is chosen to be symmetric, then ~ 1 uK (u) du = 0, which, as we will see in Chapter 3, is beneficial in terms of estimator bias. This benefit only takes place when r(x) is estimated at x E [h, 1 - h]. To appreciate why, suppose we estimate r( x) for x in the boundary region, say at x = qh, 0 :::; q < 1. In this case the kernel used by rfM (x) is
J
0 ~
0~,
rP"'
0 /
Q(lll)
00
0
ci
rfJ,
/6
...._..
~ 0 0 O.c?, oo ..·txJ
><0
ci
.. .. .. •v
OCQ./Q
'
ci
...·0 <{)0
tflilo
-1.0
-0.5
0.0
0.5
1.0
X
2.16. Why a Normalized Kernel Estimate Can Still Experience Boundary Problems. The solid line is the regression function r to be estimated using only data at the positive design points. We can imagine having noisy data from a symmetrized version of r defined on (-1, 1). The cusp at 0 in the symmetrized function illustrates why edge effects plague the normalized estimate. FIGURE
2.5. Dealing with Edge Effects
31
effectively K(u)I(-l,q)(u). The fact that this kernel does not integrate to 1 can be corrected simply by dividing it by J~ 1 K (u). In general, though, this normalized kernel does not have the desirable bias reduction property of a zero first moment. Gasser and Muller (1979) proposed the use of boundary kernels to ensure that the kernel integrates to 1 and has first moment 0 for each x E [0, 1]. One way of constructing such kernels is to define Kq for each q E [0, 1) by
Kq(u) = tq(u)K(u)I(-l,q)(u), where tq is a function with the properties (2.10) At x
=
I:
tq(u)K(u)du
I:
and
= 1
utq(u)K(u)du = 0.
qh the function value r(x) is estimated by
~ li ~ Ls~l Kq (X ~ u) du. A similar boundary adjustment can be done for x near 1. A simple choice for tq is
tq(u)
=
aq
+ bqu.
To ensure that (2.10) is satisfied, define Iq,j and take (2.11)
and
bq =
=
J~ 1 u1 K( u)du, j
-Iq,l 2 Iq,zlq,o - Iq,l
=
0, 1, 2,
•
Other choices of boundary kernels will be discussed in Chapter 3. Figure 2.17 shows an example of the improvement that can be obtained by using boundary kernels. The function from which the data were generated is r(x) = 15 + 50x 3 e-x, which is such that r'(O+) = 0 and r 1 (1-) > 0. Each of the two estimates shown uses the Epanechnikov kernel and bandwidth .2. The lower of the two estimates at x 1 is a normalized Gasser-Muller estimate, and the other is a Gasser-Muller estimate using boundary kernels
Kq(u) = .75(aq- bqu)(1- u 2 )I(-q,l)(u) with aq and bq defined as in (2.11). The behavior of the normalized estimate at 1 is typical. Normalized kernel estimates tend to underestimate r(1) whenever r 1 (1-) > 0 and to overestimate r(1) when r 1 (1-) < 0. At x = 0, though, where the function is nearly flat, the normalized and boundary kernel estimators behave much the same. Rice (1984a) has proposed boundary modifications for Nadaraya-Watson type kernel estimators. Although he motivates his method in terms of a numerical analysis technique, the method turns out to be similar to that
32
2. Some Basic Ideas of Smoothing
.
,!':
•
......~/
0
C')
/~ /
eve
•• •
/
lO
C\1
•· •••
0 C\1
• lO .....
•• 0.0
••
•
•
•
• •• •• •• •
• 0.2
0.6
0.4
0.8
1.0
X FIGURE 2.17. Boundary Kernels vs. Normalizing. The solid line is the true curve from which the data were generated. The dotted line is a Gasser-Muller estimate that uses boundary kernels, and the dashed line is a normalized Gasser-Muller estimate.
of Gasser and Muller in that it produces boundary kernels that integrate to 1 and have first moments equal to 0.
2.5.2 Fourier Series Estimators In this section we examine how Fourier series estimators of the form n-1
r(x; W)..) =
¢o + 2 L
W>.(J)¢j cos(njx)
j=l
are affected when x is near the boundary. Recall that such estimators include the simple truncated series r(x; m) and the Rogosinski estimator rR(x; m) as special cases. We noted in Section 2.4 that any series estimator f(x; W>.) is also a kernel estimator of the form
r(x; W>.) =
t Yi 1Si i=l
Si-1
Kn(x, u; W>.)du,
2.5. Dealing with Edge Effects
33
where n
Kn(x, u; W>-.)
=
1+2
L W>-.(j) cos(nju) cos(njx). j=1
For each x
E
[0, 1] we have
1
~ 1:~ 1 Kn(x, u; W>-.)du
1
Kn(x, u; w>-.)du
=
1.
Since the sum of the kernel weights is always 1, we can expect the boundary performance of the series estimator f(·; W>-.) to be at least as good as that of Nadaraya-Watson or normalized Gasser-Muller estimators. Figures 2.13 and 2.14 show that the boundary adjustments implicit in our series estimators are not simply normalizations. Especially in the top panels of those two figures, we see that the kernel changes shape as the point of estimation moves from x = 1/2 to 1. Another example of this behavior is seen in Figure 2.18, where kernels for the estimate r(x; W>-.) with W>-.(j) = exp( -.5(.17rj) 2 ) are shown at x = .5, .7, .9 and 1. At x = .5 and .7 the kernel is essentially a Gaussian density with standard deviation .1, but at x = .9 the kernel has a shoulder near 0. Right at x = 1 the kernel is essentially a half normal density. To further investigate boundary effects, it is convenient to express the series estimate in yet another way. Define an extended data set as follows: y_i+ 1 = Y;,, i = 1, ... , n, and B-i = -si for i = 0, ... , n. In other words, we create a new data set of size 2n by simply reflecting the data Y1, ... , Yn about the y-axis. It is now easy to verify that the series estimate r(x; W>-.) is identical to (2.12)
t ls; Yi
i=-n+1
Kn(x
u; W>-.)du,
Si-1
for each x E [0, 1], where n-1
Kn(v; W>-.)
=
~ + ~ W>-.(J) cos(njv),
\I v.
In particular, the simple series estimator f(·; m) and the Rogosinski series fR(·; m) are equivalent to kernel estimator (2.12) with Kn(·; W>-.) identical to, respectively, the Dirichlet kernel Dm and the Rogosinski kernel Rm for all n. The representation (2.12) suggests that tapered series estimators will be subject to the same type of edge effects that bother normalized kernel estimators. Note that reflecting the data about 0 will produce a cusp as in Figure 2.16, at least when r'(O+) =f 0. This lack of smoothness will tend to make the bias of estimator (2.12) relatively large at x = 0. The same type of edge effects will occur at x = 1. Since the kernel Kn(·; W>-.) is periodic
2. Some Basic Ideas of Smoothing
34
X=
s
SZ'
.5
X=
'
'
C')
C')
C\1
C\1
0
0
0.0
0.0
0.8
0.4 X=
0.8
X=1 CXJ
'
SZ'
0.4
.9
l!)
s
.7
C')
'
0
0
0.0
0.8
0.4
0.0
0.4
0.8
u
u
FIGURE 2.18. Kernels Corresponding to the Tapered Series Estimate with Taper Equal to a Gaussian Characteristic Function. Each graph is a kernel at a different point of estimation x. The bandwidth A in each case is .1.
with period 2, the estimate (2.12) at x = 1 is equal to the same type of estimate applied to data that are reflected about x = 1, rather than 0. A simple and effective way of correcting the type of edge effects just discussed was proposed by Eubank and Speckman (1990). Their proposal, called polynomial-trigonometric regression, is to estimate r (x) by m
(2.13)
rm(x)
=
alx + iizx 2 +
L
0 ~ X ~ 1,
bj cos(njx),
j=O
where, for a given m, the estimates al, iiz, bj, j ordinary least squares estimates, i.e., they minimize
with respect to a1, az, bo, ... , bm, respectively.
=
0, ... 'm, are the
2.6. Other Smoothing Methods
35
The fact that this method tends to alleviate edge effects may be explained heuristically as follows. Define r'(O+) = d1 and r 1 (1-) = d 2 and note that r(x) may be written as
r(x)
=
d1x + (1/2)(d2
-
di)x 2 + g(x),
where g is a function satisfying g'(O+) = 0 g'(1-). In essence, the quadratic terms in (2.13) model d1 x + (1/2)(d2 - d1 )x 2 and the cosine terms model g. But, since g is smooth near 0 and 1, it may be estimated by a cosine series without boundary effects. Figure 2.19 illustrates the kind of improvement that can be obtained by including quadratic terms with cosines. The data in this example are fifty observations from a model of the form (2.1) with
r(x)
=
(1 + .Olx 2 )(9.9 + .3sin(27rx)).
Edge effects are clear in the top graph where both a simple truncated series and a Rogosinski estimate have been utilized. These two estimates use five and eight cosine terms, respectively, and the estimate in the bottom graph uses quadratic terms and the cosine terms cos(1rx), cos(21rx) and cos(37rx). So, the quadratic-trigonometric estimate provides an overall better fit than the other two estimates, and is at least as parsimonious.
2.6 Other Smoothing Methods
2. 6.1 The Duality of Approximation and Estimation We have discussed two basic approaches to nonparametric function estimation: kernel methods and Fourier series. Both of these methods have counterparts in the field of approximation theory, where the goal is to approximate a function r over an interval [a, b] given values r(xi), . .. , r(xn) at the points a :S x 1 < · · · < Xn :S b. Conversely, any approximation method can be applied in an analogous statistical estimation problem where one observes not the actual function values, but "function values +noise." The isomorphism of function approximation and estimation has been noted and used to advantage by Gray (1988). 1, ... 'n, and, for each X E [a, b], let r(x; TI, ... ' Define Ti = r(xi), i rn) be an approximation to r(x) depending on r 1, ... , Tn. In a statistical estimation problem where Yi = r(xi) + Ei, i = 1, ... , n, we may estimate r(x) by
fn(x)
=
r(x; Y1, ... , Yn).
Many approximation schemes are linear in the sense that n
(2.14)
r(x; rl, ... 'Tn)
=
L Wi(x; n)ri, i=l
36
2. Some Basic Ideas of Smoothing
"i
.••iiee-
ci
-------~ ..
/
,'/
• '/J'
·~
•;,l' • 0
ci
>-
-·
.. -··
·.··!
. /," ,,
00
,•/
o)
.:-.•·
\
<0
..
•
/,'
//
~--~-~--~:
'
o)
0.2
0.0
1.0
0.8
0.6
0.4 X
"i
...
ci
'
",........ •.
'1!.
..
0
ci
...
00
o)
.. •.:
..
>-
..
<0
o)
0.0
0.2
0.6
0.4
0.8
1.0
X
FIGURE 2.19. Illustration of Boundary Effects in Fourier Series Estimates. In each graph the solid line is the true curve. In the top graph the dotted line is the simple series estimate f'(- ; 5) and the dashed line is the Rogosinski estimate fR(·; 8). In the bottom graph the dotted line is a quadratic-trigonometric estimate with the cosine terms cos(1rx), cos(27rx) and cos(37rx).
where the weights wi(x; n) do not depend upon r1, ... , rn· This implies that
and hence that
E (fn(x))- r(x)
=
r(x; r1, ... , rn)- r(x).
The quantity f(x; r 1, ... , rn) -r(x) is what the numerical analyst calls "error" and the statistician calls "bias." So, when the approximation is linear,
2.6. Other Smoothing Methods
37
improvements in "error" obtained with a new numerical method immediately translate into bias reduction in an analogous statistical problem. Of course, in the statistical problem we must beware that bias reduction is not offset by too large an increase in the variance of r(x; E1, ... , En)· A number of different approximation methods have been or could be used in nonparamteric function estimation. In addition to kernel methods and Fourier series, these include local polynomials, splines, ratios of trigonometric polynomials, and wavelets. Each of these methods will be discussed briefly in this section. The primary purpose of this book is to explore the use of smoothing in testing for lack of fit of parametric regression models. Since this topic is relatively new, it seems sensible that one explore as far as possible the use of relatively simple smoothers. This is the approach we take by focusing on trigonometric series estimators and, to a lesser extent, kernel estimators. However, our focus is not meant to imply a preference for these methods over the ones to be described in this section. In fact, one can make a strong argument for the superiority of local polynomial smoothers and/ or wavelets over kernel and trig series methods of estimation. Likewise, it may well be true that local polynomials and/or wavelets will ultimately prove to be better tools for checking the fit of models.
2.6.2 Local Polynomials If the function r has two continuous derivatives, then for each x E [0, 1] we have
r(u) : : : : r(x)
+ (u-
x)r'(x)
for all u in a small neighborhood of x. In other words, r is approximately linear in a neighborhood of x. This suggests that we estimate r by fitting straight lines locally to the data. Suppose that we have data (x1, Y1), ... , (xn, Yn) from the model (2.1). Let K be a probability density function that is unimodal, symmetric about 0 and supported on ( -1, 1), and define
D(bo, b1; x) =
~ f;;t (Yi- bo- b1(xi- x)) 2 K (X- hXi) -
,
where h > 0. Now choose the values of b0 and b1 that minimize D(b 0 , b1 ; x). Calling these values b0 ( x) and h1 ( x), respectively, the local linear estimator of r(x) is h0 (x). The slope, h1 (x), may be used to estimate the derivative
r'(x). The local linear estimate of r(x) solves a weighted least squares problem. The only data used in this problem are those for which xi is within a bandwidth h of x. Since K is unimodal and symmetric about 0, the squared 2 differences (Yi - b0 - h (Xi - x)) receiving the largest weight are those for which Xi is closest to x.
38
2. Some Basic Ideas of Smoothing
It is easy to verify that the estimate
b0 ( x)
has the form
L~=l wi(x)Yi L~=l wi(x) '
(2.15) where
and
k
=
1,2.
The form of estimate (2.15) is reminiscent of the Nadaraya-Watson kernel estimate discussed in Section 2.3. In fact, there is a connection between the two, as the Nadaraya-Watson estimate also solves a local least squares problem. Let a(x) be the value of a that minimizes 2
n t;(Yi-a) K
(
x T
x·
)
.
It follows that a(x) is precisely the Nadaraya-Watson estimate based on kernel K. So, whereas the local linear estimate corresponds to fitting lines locally, the N adaraya-Watson estimate corresponds to fitting constants locally. The concept of local constant and local linear estimators generalizes in an obvious way to local polynomial estimators. For example, the lo~al quadratic estimator is Q(x), wh~re Q('!f) = b0 (x) + b1 (x)(u x) + b2 (x)(u- x) 2 for each u, and (b 0 (x), b1 (x), b2 (x)) minimizes
t
(Yi- bo- bt(Xi- x)- bz(Xi- x) 2
r
K
(X~ Xi).
The advantage of using a local polynomial of degree more than 0 (N adarayaWatson estimate ) or 1 (local linear estimate) is that estimator bias can be reduced when the underlying curve r is sufficiently smooth. For example, the local quadratic estimator will have a smaller bias (at least for large n) than the local linear estimator when r has three continuous derivatives. For a complete discussion of mean squared error properties of local polynomials, see Fan (1992). The local linear estimator has a noteworthy property not possessed by kernel estimators. The estimator (2.15) automatically accounts for boundary effects ofthe type discussed in Section 2.5.1. This property is explained by noting that (2.15) may be written as L~=l wi(x)Yi and that the weights
2.6. Other Smoothing Methods
39
Wi (x) satisfy n
n
i=l
i=l
(2.16) for each x in [0, 1]. Recall from Section 2.5.1 that the bias reduction property of boundary kernels derives from the fact that these kernels integrate to 1 and have a zero first moment. Effectively, the weights used in the local linear estimate satisfy these same conditions, inasmuch as (2.16) is satisfied for all x E [0, 1]. To get an idea of how well the local linear estimate corrects boundary problems, consider Figure 2.20. Here we compare a Nadaraya-Watson estimate with a local linear one in a case where r(x) = 9.9 + .3 sin(2nx). Both estimates use the Epanechnikov kernel and a bandwidth of .14. The estimates are identical except in the boundary region where the local linear estimate is superior to the Nadaraya-Watson. The fact that they are the same for x E [h, 1 - h] is no accident. Whenever the design points are evenly spaced, the Nadaraya-Watson and local linear estimates having the same symmetric kernel and the same bandwidth will be identical at each design point between h and 1 - h. This follows from the fact that
•
C\1
c:i ,...
..
0 0
• •• • • •~--.-..... ~. • ...••-;·" •
.
,•' I I
,... I
~'~. •
•
7 I
>.
~
e
•· I
•
I
<X)
o)
•
c.o Q)
•
• •• 0.0
0.2
0.6
0.4
0.8
1.0
X FIGURE 2.20. Boundary Improvement Obtained with a Local Linear Estimate. The solid line is the true curve from which the data were generated. The dotted line is a Nadaraya-Watson estimate with Epanechnikov kernel and bandwidth, .14. The dashed line is a local linear estimate based on the same kernel and bandwidth.
40
2. Some Basic Ideas of Smoothing
mn, 1 (xi) = 0 for such design points whenever the xj's are evenly spaced. Many of our examples throughout the book will use evenly spaced design points. In these examples we will often use a local linear estimate, since doing so is perhaps the simplest and most convenient way of producing a boundary-adjusted kernel estimate. It is natural to inquire about the relative merits of Gasser-Muller, N adaraya-Watson and local linear estimators that utilize the same kernel. Important papers addressing this question are those of Fan (1992) and Jones Davies and Park (1994). Fan (1992) shows that when the design points are fixed (the case so far considered in this chapter), the large sample mean squared error of the Gasser-Muller and local linear estimators are identical. On the other hand, when the design points are independent and identically distributed random variables, Fan (1992) proves that the large sample mean squared error of the local linear estimator is strictly less than that of the Gasser-Muller estimator, implying that the latter estimator is inadmissible (at least asymptotically). The Nadaraya-Watson estimator is sometimes better and sometimes worse (in terms of large sample mean squared error) than either the Gasser-Muller or local linear estimator. We shall have more to say about mean squared error properties of smoothers in Chapter 3.
2. 6. 3 Smoothing Splines A spline is a piecewise polynomial constructed in such a way that it is smooth at the points, called knots, at which two polynomials are pieced together. Splines can be used to approximate virtually any smooth function, at least if a sufficiently large number of knots is used. This property makes splines well suited for the nonparametric regression problem. Here we shall only discuss the so-called smoothing splines, which arise as the solution to a certain optimization problem. For a comprehensive treatment of splines in nonparametric regression, see Eubank (1988). Let W2[a, b] be the class of all functions that are continuously differentiable on the interval [a, b] and have a second derivative that is square integrable on [a, b]. Suppose that we have data (x 1 , Yi), ... , (xn, Yn) from model (2.1) and that we are willing to assume that r is in W2 [0, 1]. For any function gin W2 [0, 1], define the criterion E>.(g) by 1
E>.(g)
=
2:: (Yi- g(xi)) n n
-
i=l
2
t
+A Jn [g''(x)] o
2
dx,
where A is a positive constant. For a given A > 0 it seems reasonable to estimate r by the function in W 2 [0, 1] that minimizes E>.(g). The term 2 n- 1 I:~=l (Yi - g(xi)) provides a measure of how well g fits the data, 1 2 whereas A J0 [g" (x) ] dx measures the smoothness of g. The constant A reflects the relative importance of fit and smoothness of g; a small A means
2.6. Other Smoothing Methods
41
that fit is more important, whereas a large A means that smoothness of function is more important than fit. The minimizer, call it ?J>.., of E;...(g) turns out to be a spline. In particular, g;... is a spline with the following properties: 1. It has knots at x1, ... , Xn· 2. It is a cubic polynomial on each interval [xi-l, xi], i 3. It has two continuous derivatives.
=
2, ... , n.
The function g;... is called the cubic smoothing spline estimator of r. It is worthwhile to consider the extreme cases A = 0 and A = oo. It turns out that if one evaluates g;... at A = 0, the result is the (unique) minimizer of 2 J~ [g"(x)] dx subject to the constraint that g(xi) = Yi, i = 1, ... , n. This spline is a so-called natural spline interpolant of Y1 , ... , Yn. At the other extreme, lim;..._, 00 g;... is simply the least squares straight line fit to the data (x1, Y1), ... , (xn, Yn)· The cases A = 0 and A = oo help to illustrate that A plays the role of smoothing parameter in the smoothing spline estimator. Varying A between 0 and oo yields estimates of r with varying degrees of smoothness and fidelity to the data. An advantage that the smoothing spline estimator has over local linear and some kernel estimators is its interpretability in the extreme cases of A = 0 and oo. When based on a finite support kernel K, Nadaraya-Watson kernel and local linear estimators are not even well defined as h ----t 0. Even if one restricts h to be at least as large as the smallest value of xi - Xi-1> these estimates still have extremely erratic behavior for small h. They do, however, approach meaningful functions as h becomes large. The Nadaraya-Watson and local linear estimates approach the constant function Y and the least squares straight line, respectively, as h ____, oo. The Gasser-Muller estimator has a nice interpretation in both extreme cases. The case h ----t 0 was discussed in Section 2.3.1, and Hart and Wehrly (1992) show that upon appropriately defining boundary kernels the Gasser-Muller estimate tends to a straight line as h ----t oo.
2. 6.4 Rational Functions A well-known technique of approximation theory is that based on ratios of functions. In particular, ratios of polynomials and ratios of trigonometric polynomials are often used to represent an unknown function. One advantage of this method is that it sometimes leads to an approximation that is more parsimonious, i.e., uses less parameters, than other approximation methods. Here we consider the regression analog of a method introduced in probability density estimation by Hart and Gray (1985) and Hart (1988). Consider approximating r by a function of the form (2.17) rp q(x)
'
=
+ 2 'L::j= 1 {3j cos( njx) , j1 + a 1 exp(nix) + · · · + ap exp(nipx) j2 f3o
0:::; x:::; 1,
42
2. Some Basic Ideas of Smoothing
where the aj's and fJk 's are all real constants. If p = 0, rp,q is simply a truncated cosine series approximation as discussed in Section 2.4. Those familiar with time series analysis will recognize r p,q as having the same form as the spectrum of an autoregressive, moving average process of order (p, q). It is often assumed that the spectrum of an observed time series has exactly the form (2.17). Here, though, we impose no such structure on the regression function r, but instead consider functions rp,q as approximations to r. In the same sense a function need not have a finite Fourier series in order for the function to be well approximated by a truncated cosine series. The representation (2.17) is especially useful in cases where the regression function has sharp peaks. Consider a function of the form (2.18)
g(x; p, e) = 11
+ 2p cos(O) exp(1rix) + p2 exp(27rix)l- 2 ,
where 0 < p < 1 and 0 .::; interval [0, 1]. When arccos( 1
0.::; x .::; 1,
e .::; 1r. This function has a single peak in the
~~2 )
.::;
e.::; arccos( 1 ~p 2
),
the peak occurs in (0, 1) at x = 1r- 1 arccos{- cos(0)(1 + p 2 )/(2p) }; otherwise it occurs at 0 or 1. The sharpness of the peak is controlled by p; the closer p is to 1, the sharper the peak. Based on these observations, a rough rule of thumb is that one should use an approximator of the form r 2 k,q when approximating a function that has k sharp peaks in (0, 1). One may ask what advantage there is to using a rational function approximator when sharp peaks can be modeled by a truncated Fourier series. After all, so long as the function is continuous, it may be approximated arbitrarily well by such a series. In cases where r has sharp peaks, the advantage of (2.17) is one of parsimony. The Fourier series of functions with sharp peaks tend to converge relatively slowly. This means that an adequate Fourier series approximation to r may require a large number of Fourier coefficients. This can be problematic in statistical applications where it is always desirable to estimate as few parameters as possible. By using an approximator of the form rp,q, one can often obtain a good approximation to r by using far fewer parameters than are required by a truncated cosine series. The notion of a "better approximation" can be made precise by comparing the integrated squared error of a truncated cosine series with that of rp,q· Consider, for example, approximating r by the function, call it which has the form r 2 ,m_ 2 and minimizes the integrated squared error I(r2 ,m-2, r) with respect to (30 , ... , !3m- 2 , a 1 and a 2 . Then under a variety of conditions one may show that
r;,,
I
lim
m-+oo
I(r;,,r) I(r(·; m), r)
=
c
< 1,
2.6. Other Smoothing Methods
43
where r( · ; m) is the truncated cosine series from Section 2.4. Results of this type are proven rigorously in Hart (1988) for the case where p is 1 rather than 2. An example of the improvement that is possible with the approximator r 2 ,q is shown in Figure 2.21. The function being approximated is
1155 - - x 5(1- x) 50 1050 '
r(x)
(2.19)
0 :s; x :S 1,
which was constructed so as to have a maximum of 1. The approximations in the left and right graphs of the bottom panel are truncated cosine series based ~n truncation points of 7 and 30, respectively. The approximator in the top left graph is one of Lhe form r 2 ,5 , and that in the top right is rz,s. The two left-hand graphs are based on the same number of parameters, but obviously r 2 ,5 yields a far better approximation than does r(·; 7). The significance of the truncation point m = 30 is that this is the smallest value
co 0
co 0
8.... 0
0
0
0
0.0
0.8
0.4
0.0
0.4
0.8
0.0
0.4
0.8
co 0
0
0
0
0
0.0
0.8
0.4 X
X
FIGURE 2.21. Rational and TI:uncated Cosine Series Approximators. In each graph the solid line is the true curve and the dashed one the approximator. The top graphs depict rational function approximators of the form r2,5 and r 2 ,8 on the left and right, respectively. The bottom graphs show truncated cosine series with truncation points of 7 and 30 on the left and right, respectively.
44
2. Some Basic Ideas of Smoothing
of m for which I(r(·; m), r) < J(r 2,5 , r). We also have J(r2,s, r) ~ .5 I(r(·; 30), r), which is quite impressive considering that r 2,8 uses only one-third the number of parameters that r(·; 30) does. In practice one needs a means of fitting a function of the form r p,q to the observed data. An obvious method is to use least squares. To illustrate a least squares algorithm, consider the case p = 2, and define g(x; p, B) as in (2.18). For given values of p and B, the model
r(x)
~ g(x; p, 8) {fio + 2 ~ fi; co"(~jx)}
is linear in (3 0 , . .• , (3q, and hence one may use a linear routine to find the least squares estimates ~0 (p, B), ... , ~q(p, B) that are conditional on p and B. A Gauss-Newton algorithm can then be used to approximate the values of p and B that minimize
t [Y; -
g(x;; p, 8)
{ilo(p, 0) + 2 ~ !J;(p, 8) coe(~jx;)}]'
This algorithm generalizes in an obvious way to cases where p > 2. Usually it is sufficient to take p to be fairly small, since, as we noted earlier, p/2 corresponds roughly to the number of peaks that r has. Even when r has several peaks, an estimator of the form r 2 ,q will often be more efficient than a truncated cosine series. In particular, this is true when the rate of decay of r's Fourier coefficients is dictated by one peak that is sharper than the others.
2. 6. 5 Wavelets Relative newcomers to the nonparametric function estimation scene are smoothers based on wavelet approximations. Wavelets have received a tremendous amount of attention in recent years from mathematicians, engineers and statisticians (see, e.g., Chui, 1992). A wavelet approximation to a function defined on the real line makes use of an orthogonal series representation for members of L 2(W), the collection of functions that are square integrable over the real line. (Throughout the book W and Wk denote the real number line and k dimensional Euclidean space, respectively.) What makes wavelets so attractive is their tremendous ability to adapt to local features of curves. One situation of particular interest is when the underlying function has jump discontinuities. Without special modification, kernel, cosine series and local polynomial estimators behave quite poorly when jumps are present. By contrast, wavelets have no problem adapting
2.6. Other Smoothing Methods
45
to jump discontinuities. In addition, wavelets are good at data compression, in that they can often approximate nonsmooth functions using far fewer parameters than would be required by comparable Fourier series approximators. In this way wavelets have a similar motivation to the rational functions discussed in Section 2.6.4. Wavelet approximators of functions are orthogonal series expansions based on dilation and translation of a wavelet function'¢. Given a function r that is square integrable over the real line, this expansion takes the form ()()
(2.20)
L
r(x) =
Cj,k 2j/ 2 '¢(2jx- k),
j,k=-00
where cj,k = 2j/ 2
/_:
r(x)'¢(2jx- k) dx.
The function '¢ is called an orthogonal wavelet if the collection of functions {2j/ 2 7j;(2jx- k)} is an orthonormal basis for £ 2 (~). In a statistical context, sample wavelet coefficients Cj,k are computed from noisy data and used in place of Cj,k in a truncated version of the infinite series (2.20). The adaptability of wavelets comes from the fact that the orthogonal functions in the expansion (2.20) make use of dilation and translation of a single function. By contrast, the trigonometric series discussed in Section 2.4 use only dilation, i.e., scaling, of the function cos(nx). To see more clearly the effect of translation, consider using the simple Haar wavelet to represent a square integrable function r on the interval [0, 1]. The Haar wavelet is '¢H(x) =
1, -1, { 0,
ifO:::;x<1/2 if 1/2::::; X< 1 otherwise.
Using the fact that '¢H has compact support [0, 1], the representation (2.20) is, for each x E [0, 1], ()() 2j -1
r(x) =co+
L L
Cj,k 2jf 2 '¢H(2jx- k).
j=O k=O
That wavelets adapt to local features so readily derives from the local nature of '¢(2jx - k). If'¢ has support (0, 1), '¢(2jx - k) has support (2-jk, 2-j(k + 1)). When approximating r(x) over an interval (a, b), the wavelet is a linear combination of only those functions '¢ ( 2j x - k) whose supports intersect (a, b). Contrast this with a cosine series, in which the approximation is a linear combination of cosines, each of which has infinite support. For some functions the global support of the cosines means that a lot of cancellation has to take place in order for a sum of cosines to yield
46
2. Some Basic Ideas of Smoothing
a good approximation. This in turn means that a large number of terms may be needed in the cosine series. By returning to the example in Section 2.6.4, we can illustrate the phenomenon just discussed. Since the function (2.19) has a lot of curvature over the interval (0, .25), but is relatively flat elsewhere, we expect lcj,kl to be largest when [2-jk, 2-j(k+ 1)) intersects (0, .25). Haar wavelet coefficients for (2.19) are plotted in Figure 2.22. For a given j, or resolution level, lcj,kl is plotted versus the midpoint of the interval [2-j k, 2-j (k + 1)], k = 0, ... , 2j - 1. One can see that the largest coefficients are indeed those of the orthogonal functions whose supports intersect (0, .25). A wavelet approximation of (2.19) of the form
rH(x)
L
=
Cj,k2j/ 2 '1/JH(2jx- k)
(j,k)ES is plotted in Figure 2.23. The indices included in the set S correspond to the twelve largest coefficients in Figure 2.22. The bottom graph in Figure 2.23 is a cosine series of the form
rp(x)
=
¢0
+2L
¢j cos(1rjx),
jES'
where the set {0} U S' contains the indices of the twelve largest-inmagnitude Fourier coefficients. Even though the approximators rH and rp use the same number of parameters, the integrated squared error of the wavelet is .000671, whereas that. of the cosine series is .001929. Notice that the wavelet approximation in Figure 2.23 has a stairstep quality. This is due to the discontinuity of the Haar wavelet. The recent advances in wavelet theory have come after the discovery of smooth orthogonal wavelets, an example of which is the Daubechies (1988) wavelet. (See also Mallat, 1989 for more on construction of wavelet orthogonal bases. ) Smooth wavelets lead to approximations that are more aesthetically appealing than the one in Figure 2.23. Suppose we have data from model (2.1) with the Xi's evenly spaced. Then a consistent estimator of the wavelet coefficient Cj,k is
A wavelet estimator of the regression function r takes the form
L
Cj,k
2j/ 2 '1jJ(2jx- k),
(j,k)ES where Sis a set of ordered pairs of integers. Note that the value of j should not exceed log n / log 2, since otherwise the support of 'ljJ (2j x - k) has finer resolution than 1/n. One means of choosing S is to use a thresholding
2.6. Other Smoothing Methods j=0
j=1
~
~ 0
0
"
·~
!;: Q)
0
u
"i5 ~ ·~
47
" Q)
'13 !;:
"'0 0
Q)
8
..
"i5 Q)
·~
0
0
0
"'0 0
.. 0
0
0
0
0
0.0
0.2
0.4
0.6
0.8
0.0
1.0
0.2
0.4
0.6
0.8
1.0
0.8
1.0
j=3
j =2
"'0
~ 0
"'0
"'0
.
"0
0
0
~
0
0
0
0
0
0
0.0
0.2
0.4
0.6
0.8
0.0
1.0
0.2
0.6
0.4
X
j=4
j =5
"'0
"'0
" Q)
'13
"'0 0
:m0
"
"i5 Q)
"0
·~
0
0
"'0 0
. ~
.........................
0
0
0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
FIGURE 2.22. Absolute Value of Haar Wavelet Coefficients of Function (2.19) at First Six Levels of Resolution.
48
2. Some Basic Ideas of Smoothing ~ <Xl
ci <0
ci
-2
'
ci C\1
ci 0
ci
0.0
0.2
0.4
0.6
0.8
1.0
0.6
0.8
1.0
X
<Xl
ci
~
'
ci
0
ci
···-···
0.0
0.2
0.4 X
2.23. Haar Wavelet (Top) and Cosine Series (Bottom) Approximations to the Function (2.19).
FIGURE
scheme in which (j, k) is included inS if and only if icJ,kl exceeds some threshold A (Donoho and Johnstone, 1994). The quantity A is the smoothing parameter for such a wavelet estimator. Small values of A allow inclusion of more coefficients and lead to rougher estimates, whereas large A's lead to smoother estimates. Thresholding is a fundamentally different way of selecting a series estimator than is truncation. In a statistical context, an advantage of truncation is that it provides more protection against inclusion of spurious high frequency terms in the model. An advantage of a thresholding estimator is that it leads to smaller bias than a truncated series based on the same number of parameters. For example, the Fourier series approximator rp
2.6. Other Smoothing Methods
49
in Figure 2.23 has a smaller integrated squared error than the truncated Fourier series r(· ; 11), whose ISE is .003783. We shall have more to say about the relative merits of thresholding and truncation in the testing context of Chapter 7.
3 Statistical Properties of Smoothers
3.1 Introduction The smoothers discussed in Chapter 2 provide very useful descriptions of regression data. However, when we use smoothers to formally estimate a regression function, it becomes important to understand their statistical properties. In this chapter we discuss issues such as mean squared error and sampling distribution of an estimator, and using smoothers to obtain confidence intervals for values of the regression function. We will consider two types of smoothers: Gasser-Muller type kernel estimators and tapered Fourier series. We choose to focus on Gasser-Muller rather than NadarayaWatson type kernel smoothers since the latter have a more complicated bias representation. Among Fourier series estimators the emphasis will be on truncated series estimators, since there are certain practical and theoretical advantages to using an estimator with a discrete smoothing parameter. Throughout most of this chapter we will assume that the data Y1, ... , Yn are generated from a model of the form
Yi
=
r(xi)
+ Ei,
i = 1, ... , n,
in which the error terms are uncorrelated and have mean 0 and common variance 0" 2 . In addition we will assume that the design points are generated by a positive, Lipschitz continuous density function f in the sense that, for all n, "= Q( i
X"
-n1/2) '
i = 1, ...
,n,
where
Q(u)
= F- 1 (u)
and
F(x) =
lax f(u)du,
0 SuS 1.
(A function g is said to be Lipschitz continuous over the interval [a, b] if there exists a constant C such that
jg(u)- g(v)l S Clu- vj, 50
V u, v
E
[a, b].)
3.2. Mean Squared Error of Gasser-Muller Estimators
51
Additional conditions on the regression function r and the error terms will be imposed as we proceed throughout the chapter.
3.2 Mean Squared Error of Gasser-Muller Estimators To simplify notation we shall denote the Gasser-Muller estimator rfM (x) by rh(x ). Until further notice we assume the following conditions on the kernel K: 1. K has support ( -1, 1). 2. K is Lipschitz continuous. 3. 1. 1 K(u) du 4.
J2 J2 1 uK(u) du =
0.
For future reference we also define the constants J K and
1 1
JK
=
1
CTJc by
1
K 2 (u) du
and
CTJc
=
-1
u 2 K(u) du.
-1
The second moment of K is denoted CTJc to conform with usual notation, but note that CTJc could be negative since conditions 1-4 allow K to take on negative values. The results to be presented in this section were first developed by Gasser and Muller (1979).
3.2.1 Mean Squared Error at an Interior Point As a means of measuring how close the estimator rh (x) is, on average, to
r(x), we consider the mean squared error, i.e.,
M(x; h)
=
E (rh(x)- r(x)) 2
=
Var(fh(x))
+ (E(rh(x))- r(x)) 2 .
There are two categories of factors affecting size of the mean squared error: those associated with the probability model from which the data arise, and those associated with the estimator. Relevant model factors are the smoothness of the regression function r near x, the error variance CT 2 and the size of the design density f near x. When using the Gasser-Muller smoother, the bandwidth h and kernel K are the estimator factors affecting the size of mean squared error. To gain insight about the effect of each of these factors, we will investigate the behavior of M(x; h) as the sample size n tends to oo. It turns out that in order for rh(x) to be mean squared error consistent as n __, oo, it is necessary, in general, for the bandwidth h to tend to 0, but slowly enough that nh __, oo. In this section we will avoid boundary issues by assuming
52
3. Statistical Properties of Smoothers
that x is an "interior point," i.e., a point that is in the open interval (0, 1) and such that his smaller than min(x, 1 - x). We first consider the variance of rh(x), which is
V(x; h) = Var(rh(x))
~a'~ (~ L K (x~u)du) =
• I
!72 t(si - Si-1)2
~2 K2
( x
2
~xi) '
where xi E [si_ 1, si], i = 1, ... , n. This expression for the variance provides a starting point, but does not make clear the effect of the design, the bandwidth and the kernel. The following theorem provides more insight by giving a simple form for the limiting variance as n --+ oo.
Theorem 3.1. Assume that the regression model and design conditions of Section 3.1 hold, and let x be in the open interval (0, 1). Then the variance V(x; h) of the Gasser-Muller estimator rh(x) has the following form as n --+ oo, h --+ 0 and nh --+ oo: !72 1 V(x; h)= nh f(x) JK PROOF.
Defining Q[(n + 1/2)/n]
+ O(n
= 1,
-1
)
+ O(nh)
-2
.
we have
V(x; h)= !72h-2 f)si- Si-1)2K2 (x
~xi)
•=1
I
= Vn,h + En,h, where
and
By the mean value theorem
Q ( i +n1/2) _ Q ( i -n1/2) = Q'(ui)
n-1,
0o~o
lVlean ;:;quared JC;rror ot Gasser-Miiller l:<Jstimators
where (i- 1/2)/n ::; ui ::; (i + 1/2)/n, i = 1, Q'(u) = 1/f[Q(u)] for all u, we then have
0
0
0,
53
no Using the identity 1
1
f[Q(ui)] n 1 1 J(xi) -;:;,' where
It follows that
Xi ::; Xi ::; Xi+lo
where
Now consider
where
si-1 ::; x~
::; si, i = 1,
0
0
0,
no Combining the steps above,
where
Making the change of variable z = ( x - t) j h in the above integral, we have 0'2
Vn,h
= -h n =
0'2
nh
lx/h
f(
(x-1)/h
11
1 X -
1
h ) K (z) dz + Rn,h + R~,h z
_ f(x _ hz) K 1
11
0'2 1 = nh f(x) _
1
K
2( )
2
*
2( )
z dz
z dz + Rn,h + Rn,h
+ 0 (n
-1) +
Rn,h
* + Rn,ho
54
3. Statistical Properties of Smoothers
Combining all the previous work yields
V(x; h) =
:~
ftx) JK
+ O(n- 1 ) + Rn,h + R~,h + En,h·
We still need to show that the remainder terms Rn,h, R~,h and En,h are negligible relative to (nh)- 1 . Doing so involves similar techniques for each term, so we look only at Rn,h, for which
a- 2 n 2
(
-
•=1 2 C 1 a- 2 n -< - -h-2 "(s·L...- ' s·,_ 1)K n n i=1
*) lx·- x-1* - -*)
x- xi h
-(
-
'
'
x- xi h
Similarly, one can show that
and so
0
Notice that in order for the variance to tend to 0, it is necessary for n to tend to oo. Also, for a given n, the variance tends to be larger the smaller h is. This is intuitively plausible since a smaller h means that fewer points are being averaged. We will see shortly that in order for bias to tend to 0, it is necessary that h--+ 0. Theorem 3.1 implies that even when h becomes small, V(x; h) will tend to 0 as long as nh --+ oo. This begins to quantify the trade-off between variance and bias that we encountered in Chapter 2. Theorem 3.1 also makes the role of the design density and the kernel K clearer. The kernel affects V (x; h) solely through K 2 , at least asymptotically. We also see that the variance of h (x) will be smallest at points x where the design density is largest. This, of course, is intuitive since
J
3.2. Mean Squared Error of Gasser-Muller Estimators
55
more Yi 's will be averaged at a value of x near which there is a larger concentration of design points. We now turn our attention to the bias E (fh(x)) - r(x), which shall be denoted B(x; h). We have
B(x; h) =
=
18·8 K (x ~ u ) du - r(x) 1 ~ [r(xi)- r(x)] h1 18'i~ K (x ~ u ) du, 8 1 n 1 i~ ~ r(xi) h
n
where the last step follows from the fact that h < min(x, 1- x). It follows that
IB(x; h)l :::; C
max
lx-xil
lr(x) - r(xi)l,
and hence the bias will tend to 0 whenever r is continuous at x and h ---+ 0. In order to further quantify the effect of h on the bias, we need to impose more smoothness conditions on r near the point x. This will allow us to use a Taylor series expansion for r(xi) at design points Xi that are withi;n a bandwidth of x. An approximation for the bias is given in the next theorem. Theorem 3.2. Assume that the regression model and design conditions of Section 3.1 hold, and let x be in the open interval (0, 1). Suppose in addition that r has two continuous derivatives throughout an open interval containing x. Then, as h ---+ 0 and n ---+ oo, the bias of the Gasser-Muller estimator has the representation
E (fh(x))- r(x) = PROOF.
h2
2
r"(x)O'k
+ o(h 2 ) + O(n- 1 ).
We have
E(fh(x))
1
=
h-
=
h- 1
~ r(xi) 1:~
1
K ( x
~ u)
r(u)K ( x
~ u)
du
1 1
+ h- 1 ~ 1:~
1
1
K ( x
1
=
h- 1
r(u)K ( x
du
~ u) [r(xi)- r(u)]
~ u)
du
du
+ O(n- 1 ),
with the last step following from the fact that r has a continuous derivative. Now, 1
[1
h- Jo r(u)K
(X_ U) -h-
du =
11
_ r(x- hv)K(v) dv. 1
56
3. Statistical Properties of Smoothers
Expanding r(x- hv) in a Taylor series about x, we have h2v2
r(x- hv) = r(x)- hvr'(x)
+ - 2 -r"(x(vh)),
where x(vh) is between x and x- hv. Therefore
j l r(x- hv)K(v) dv _
=
r(x)
1
jl
+ 2~ _ v 2 r"(x(vh))K(v) dv, 1
using conditions 3 and 4 on K. Since r 11 and K are continuous, dominated convergence implies that
j_
1
v 2 r"(x(vh))K(v) dv
=
r"(x)
1:
2
v K(v) dv
+ o(1).
1
1
Combining the results above yields
D
Theorem 3.2 shows that, when r ·has two continuous derivatives near x, the bias at x decreases in hat the rate h 2 . Theorem 3.2 also verifies what we pointed out in Chapter 2: If O'k > 0, then at peaks (r"(x) < 0) the kernel estimator will tend to underestimate r(x), whereas at valleys (r"(x) > 0), overestimation occurs. :Furthermore, the bias will be largest in magnitude where lr" (x) I is largest; in other words, the sharper the peak or valley, the larger the bias. A final, very noteworthy, aspect of Theorem 3.2 is that the dominant term in the bias expansion does not depend on the design density. By contrast, the bias of a Nadaraya-Watson estimator does depend on the design (see, e.g., Fan, 1992). The asymptotic design independence of the Gasser-Muller bias is an attractive feature, owing mainly to its simplicity. It is worth noting that the local linear estimator (Section 2.6.2) also has bias that is design-independent to first order. Combining Theorems 3.1 and 3.2 leads to the following corollary x). concerning the mean squared error of
rh (
Corollary 3.1. Assume that the regression model and design conditions of Section 3.1 hold, and let x be in the open interval (0, 1). Suppose in addition that r has two continuous derivatives. Then, as n ----t oo, h ----t 0 and nh ----t oo, the mean squared error of the Gasser-Muller estimator is M(x; h)= M(x; h)+ R(n, h), where
(3.1)
3.2. Mean Squared Error of Gasser-Muller Estimators
57
and
The competing criteria of stability and fidelity are illustrated succinctly in the asymptotic form of the mean squared error. When we make the bandwidth smaller, the squared bias decreases, but the variance increases. To balance the two criteria we could choose h to minimize the mean squared error M(x; h). By considering only the dominant term M(x; h) we can find a sequence {hn} of bandwidths such that
. M(x; hn) hm supn-+oo M(x; h~) ~ 1, for any other sequence
{h~}.
We find hn by solving the equation d
-
dh M(x; h)= 0 for h. This leads to (3.2)
(
0" 2 JK f(x)[r"(x)j20"k
) 1/5
-1/5
n
and
(3.3)
M(x; hn)
~ 1.25J1f'aU 1r"(x)l'i 5
5
415 (
(:)) 1
,-'/
5
The form of hn makes more precise some notions discussed in Chapter 2. Since hn is proportional to 0" 2 / 5 , the more noisy the data, the larger the optimal bandwidth. We also see that the optimal bandwidth is smaller at points x where r has a lot of curvature, i.e., where lr"(x)l is large. These considerations point out that variance and curvature are competing factors. Our ability to detect micro features of r, such as small bumps, is lessened the more noisy the data are. Such features require a small bandwidth, but a lot of noise in the data can make the optimal bandwidth so large that a micro feature will be completely smoothed away. Of course, sample size is also an important factor in our ability to detect fine curve features. In addition to the obvious fact that curve features occurring between adjacent design points will be lost, expression (3.2) shows that the optimal bandwidth is decreasing in n. Hence, given that the bandwidth must be sufficiently small to detect certain curve features, there exists a sample size n below which detection of those features will be impossible when using an "optimal" bandwidth. The design affects hn in the way one would expect; where the design is sparse the optimal bandwidth is relatively large and where dense the bandwidth is relatively small. For cases where the experimenter actually
58
3. Statistical Properties of Smoothers
has control over the choice of design points, expression (3.3) provides insight on how to distribute them. Clearly it is advisable to have the highest concentration of design points near the x at which lr"(x)l is largest. Of course, r" will generally be unknown, but presumably a vague knowledge of where r" is large will lead to better placement of design points than no knowledge whatsoever. An important facet of the asymptotic mean squared error in (3.3) is the rate at which it tends to 0 as n __. oo. In parametric problems the rate of decrease of mean squared error is typically n-1, but in our nonparametric problem the rate is n - 4 15 . It is not surprising that one must pay a price in efficiency for allowing an extremely general form for r. Under the conditions we have placed on our problem, Corollary 3.1 quantifies this price. Having chosen the bandwidth optimally for a given kernel, it makes sense to try to find a best kernel. Expression (3.3) shows that, in terms of asymptotic mean squared error, the optimal kernel problem can be solved independently of the model factors r, f and a-. One problem of interest is to find a kernel K that minimizes subject to the constraint that K be a density with finite support and zero first moment. This calculus of variations problem was solved by Epanechnikov (1969), who showed that the solution is
Jio-'i
The function KE is often referred to as the Epanechnikov kernel. In spite of the optimality of KE there is a large number of kernels that are nearly optimal. The efficiencies of several kernels relative to K E are given in Table 3.1. The quartic and triangle kernels are, respectively,
KQ(u ) = 15 (1- u 2)2 J(-l,l)(u ) and Kr(u) = (1- lui)J(-l,l)(u). 16 One can show that expression (3.3) is valid for the Gaussian kernel even though it does not have finite support. The fact that the relative efficiencies for the quartic, triangle and Gaussian kernels are so close to 1 explains
TABLE 3.1. Asymptotically Optimal Mean Squared Error of Various Kernels Relative to the Epanechnikov
a-2K
Kernel
Epanechnikov Quartic Triangle Gaussian Rogosinski
3/5 .7142857 2/3 .2820948 5/4
1/5 .1428571 1/6 1 .0367611
Relative efficiency
1.0000 1.0049 1.0114 1.0408 .9136
3.2. Mean Squared Error of Gasser-Muller Estimators
59
the oft-quoted maxim that "kernel choice is not nearly as important as bandwidth choice." The kernel termed "Rogosinski" in Table 3.1 is defined by KR(u)
=
(.5
+ cos(?T/5) cos(?Tu) + cos(2?T/5) cos(2?Tu))I(-l,t)(u).
This kernel satisfies conditions 1-4 at the beginning of Section 3.2 but differs from the other kernels in Table 3.1 in that it takes on negative values. On [-1, 1] KR is proportional to the kernel Kf(O, ·)defined in Section 2.4. It is interesting that KR has smaller asymptotic mean squared error than does KE, which is not a contradiction since KE is optimal among nonnegative kernels. We shall have occasion to recall this property of KR in Section 3.3 when we discuss properties of the Rogosinski series estimator.
3.2.2 Mean Squared Error in the Boundary Region Boundary effects can be quantified by considering the mean squared error of an estimator at a point qh, where q E [0, 1) is fixed. Notice that the point of estimation changes as h -----> 0, but maintains the same relative position within the boundary region [0, h). Consider first the mean squared error of a normalized Gasser-Muller estimator rf: (qh) with kernel KN (u ),q
-
K(u)I(-l,q)(u) J~ K(v) dv 1
Using the same method of proof as in Theorems 3.1 and 3.2, it is straightforward to show that (3.4) E
(
rf: (qh)- r(qh) )
2
cr2 1 "' nh f(O)
lq
_ K'fv,q(u) du 1
+
h' [''(O+)]' (I: uKN,,(u) du)
2
Expression (3.4) confirms theoretically what had already been pointed out in Section 2.5.1. The main difference between (3.4) and the corresponding mean squared error at interior points is in the squared bias. The squared bias of the normalized estimator within the boundary region is of order h 2 , rather than h 4 . Also, the effect of r on the bias is felt through the first rather than second derivative. Minimizing (3.4) with respect to h shows that, in the boundary region, the optimal rate of convergence for the mean squared error of a normalized estimator is n- 213 , at least when r' (0+) -=/=- 0. If r'(O+) = 0 and r has two continuous derivatives on [0, 1], then one can show that the squared bias of rf: (qh) is of order h 4 . Suppose now that we employ boundary kernels as described in Section 2.5.1. The main difference between this approach and the normalized estimator is that the boundary kernel Kq satisfies the same moment conditions
60
3. Statistical Properties of Smoothers
as does K; i.e.,
(3.5)
l
and
q uKq(u) du = 0.
-1
Under the same conditions as in Corollary 3.1, the following expansion holds for the mean squared error of a boundary kernel estimator rq,h(qh): (3.6) E (rq,h(qh)- r(qh))
2
1 0
0'2
=
f( )
-
nh
14
~
+
lq
_1
K;(u) du
[r"(O+)]'
[I:
u'K,,(u)du]'
+ o(h4 ) + O(n- 1 ) + O(nh)- 2 . In spite of the similarity of expressions (3.1) and (3.6), there is still a price to be paid in the boundary region. Typically the integral J~ 1 K~ (u) du will be larger than J~ 1 K 2 (u). This implies that the asymptotic variance of rq,h(qh) will be larger than the variance of rh(x) at an interior point x for which f(x) f(O). It is not surprising that the variance of one's estimator tends to be larger in the boundary, since the number of data in (x- h, x +h) (h ::;; x ::;; 1 -h) is asymptotically larger than the number in (0, qh +h) when f(x) f(O). Of course, one remedy for larger variance in the boundary is to put extra design points near 0 and 1. One means of constructing boundary kernels was described in Section 2.5.1. Muller (1991) pursues the idea of finding optimal boundary kernels. Among a certain smoothness class of boundary kernels with support ( -1, q), Muller defines as optimum the kernel which minimizes the asymptotic variance of the mth derivative of the estimator (m 2: 0). For example, if m = 1 the optimum kernel turns out to be (3.7)
Kq(u)
=
6(1
+ u)(q- u)
x
1+5
{
(
1 ) 1+q 3
(
1-q 1 +q )
2
1-q } +10( 1 +q) 2 u
I(- 1 ,q)(u).
At q = 1, Kq(u) becomes simply the Epanechnikov kernel (3/4)(1 u 2 )I( _ 1 ,1 ) ( u). To ensure a smooth estimate, it would thus be sensible to use the Epanechnikov kernel at interior points x E [h, 1 h] and the kernel (3.7) at boundary points. Boundary problems near x = 1 are handled in an analogous way. For x E (1 - h, 1] one uses the estimate
8 Yi h 1 n
1
8
i s;-1
Kq
(
U- X) du,
-h-
3.2. Mean Squared Error of Gasser-Muller Estimators
61
where q = (1 - x)jh and Kq is the same kernel used at the left-hand boundary.
3.2.3 Mean Integrated Squared Error To this point we have talked only about local properties of kernel estimators. A means of judging the overall error of an estimator is to compute the global criterion of integrated mean squared error, which is
1 1
J(rh, r)
=
2
E (rh(x)- r(x)) dx.
The quantity J(rh, r) may also be thought of as mean integrated squared error (MISE) since
J(rh, r)
=
E I(rh, r)
=
E
1 1
2
(rh(x)- r(x)) .dx.
Boundary effects assert themselves dramatically in global criteria such as MISE. Suppose that we use a normalized Gasser-Muller estimator in the boundary region. Then, under the conditions of Corollary 3.1, if either r' (0+) or r' (1-) is nonzero, the integrated squared bias of the GasserMuller estimator is dominated by boundary bias. Let rh denote a GasserMuller estimator that uses kernel K at interior points and the normalized version of K, KN,q, in the boundary. It can be shown that
1
Var (rh(x)) dx
1 ~~~) j_ 1
~~
1
rv
1 2
1
K (u) du,
· which means that the boundary has no asymptotic effect on integrated variance. Consider, though, the integrated squared bias, which may be written as B1 + B2, where
B1 =
{h B 2 (x; h) dx
lo
+ /,
1
B 2 (x; h) dx
1-h
and
1
1-h
B2 =
.
B 2(x; h) dx.
Since the squared bias of rh is of order h 2 (as h ---* 0) in the boundary, the integral B 1 is of order h 3 (unless r 1(0+) = 0 = r'(O- )). Now, B 2 (x; h) is of order h 4 for x E (h, 1- h), and so B 2 is negligible relative to B 1 . It follows that the integrated squared bias over (0, 1) is of order h 3 , and the resulting MISE has asymptotic form 01 nh
3
+ C2h,
62
3. Statistical Properties of Smoothers
which will not converge to zero faster than n- 3 / 4 . In this sense edge effects dominate the MISE of a normalized Gasser-Muller estimator. The boundary has no asymptotic effect on MISE if one uses boundary kernels. Under the conditions of Corollary 3.1, a Gasser-Muller estimator using boundary kernels has MISE that is asymptotic to the integral of M(x; h) over the interval (0, 1). This implies that the MISE converges to 0 at the rate n- 4 / 5 and that the asymptotic minimizer of MISE is - ( <J2Jx fol [f(x)]-1 dx) 1/5 -1/5 hnn . 1 <Jk fo (r" (x) )2 dx
(3.8)
3.2.4 Higher Order Kernels If one assu:rpes that the regression function has more than just two continuous derivatives, then it is possible to construct kernels for which the bias converges to 0 at a faster rate than h 2 . To show how this is done, we first define a kth order kernel K to be one that satisfies
1 1
1
1
K(u) du = 1,
-1
u1 K(u) du = 0,
j
= 1, ... , k- 1,
-1
and 1
uk K(u) du -1- 0.
[ 1
The kernels so far considered have been second order kernels. Notice that kernels of order 3 or more must take on negative values. Suppose that r has k continuous derivatives on [0, 1] and that we estimate r(x) at an interior point x using a Gasser-Muller estimator rh(x) with kth order kernel K. Using a Taylor series expansion exactly as in Theorem 3.2, it follows that (3.9)
E (rh(x))- r(x) =
Assuming that K is square integrable, the asymptotic variance of a kth order kernel estimator has the same form as in Theorm 3.1. Combining this fact with (3.9), it follows that when r has k derivatives, the mean squared error of a kth order kernel estimator satisfies
Choosing h to minimize the last expression shows that the optimal bandwidth hn has the form hn
rv
c n - 1/(Zk+ 1)
0
The corresponding minimum mean squared error of the kth order kernel estimator converges to 0 at the rate n-Zk/(Zk+l). Theoretically, the bias reduction that can be achieved by using higher order kernels seems quite attractive. However, some practitioners are reluctant to use a kernel that takes on negative values, since the associated estimate no longer has the intuitivelr appealing property of being a weighted average. Also, the integral J_ 1 K 2 (u) du tends to be larger for higher order kernels than for second order ones. In small samples where asymptotics have not yet "kicked in," this can make a higher order kernel estimator have mean squared error comparable to that of a second order one (Marron and Wand, 1992).
3.2.5 Variable Bandwidth Estimators The estimators considered in Section 3.2.3 were constant bandwidth estimators, i.e., they employed the same value of hat each x. The form of the optimal bandwidth for estimating r(x) suggests that it would be better to let the bandwidth vary with x. To minimize the pointwise mean squared error, we should let h (as a function of x) be inversely proportional to
{f(x) [r"(x)J
115
2 }
Use of the pointwise optimal bandwidth leads to MISE that is asymptotically smaller than that of the constant bandwidth estimator of Section 3.2.3. Let 1n be the variable bandwidth kernel estimator that uses bandwidth (3.2) at each x, and let 2 n be the constant bandwidth estimator with h equal to (3.8). Then it is easily verified that
r
J(T1n, ~ . r) l 1m n-->oo J(rzn, r)
r
=
11
lr"(x)l2/5 -'-----'----''--'-:-:-::- dx o [f(x)]4/5
X[l ~~:r/5 {l [r"(x)]' r5 dx
Examples in Muller and Stadtmuller (1987) show that this limiting ratio can be as small as 1/2. As a practical matter, knowing that the optimal h has the form (3.2) is not of immediate value since it depends on the unknown function r. Whether one uses a constant or variable bandwidth estimator, it will be necessary to estimate r" in order to infer a mean squared error optimal bandwidth. We will discuss a method of estimating derivatives of regression functions in Section 3.2.6. Muller and Stadtmuller (1987) proposed a
3.2. Mean Squared Error of Gasser-Muller Estimators
63
Choosing h to minimize the last expression shows that the optimal bandwidth hn has the form hn
rv
c n-1/(2k+l).
The corresponding minimum mean squared error of the kth order kernel estimator converges to 0 at the rate n- 2 k/( 2k+l). Theoretically, the bias reduction that can be achieved by using higher order kernels seems quite attractive. However, some practitioners are reluctant to use a kernel that takes on negative values, since the associated estimate no longer has the intuitivelt appealing property of being a weighted average. Also, the integral J_ 1 K 2 (u) du tends to be larger for higher order kernels than for second order ones. In small samples where asymptotics have not yet "kicked in," this can make a higher order kernel estimator have mean squared error comparable to that of a second order one (Marron and Wand, 1992).
3.2.5 Variable Bandwidth Estimators The estimators considered in Section 3.2.3 were constant bandwidth estimators, i.e., they employed the same value of hat each x. The form of the optimal bandwidth for estimating r(x) suggests that it would be better to let the bandwidth vary with x. To minimize the pointwise mean squared error, we should let h (as a function of x) be inversely proportional to {
f(x) [r"(x)]
2}1/5
Use of the pointwise optimal bandwidth leads to MISE that is asymptotically smaller than that of the constant bandwidth estimator of Section 3.2.3. Let hn be the variable bandwidth kernel estimator that uses bandwidth (3.2) at each x, and let 2 n be the constant bandwidth estimator with h equal to (3.8). Then it is easily verified that
r
J(r1n, r) 11 lr"(x)l2/5 . = -'-------'-----'--'-;--;-::-- dx 11m n-t= J(r2n, r) o [f(x)]4/5
x
[1' /~) ]-'/' {[
[r"(x)]'
dx}
-l/S
Examples in Muller and Stadtmuller (1987) show that this limiting ratio can be as small as 1/2. As a practical matter, knowing that the optimal h has the form (3.2) is not of immediate value since it depends on the unknown function r. Whether one uses a constant or variable bandwidth estimator, it will be necessary to estimate r" in order to infer a mean squared error optimal bandwidth. We will discuss a method of estimating derivatives of regression functions in Section 3.2.6. Muller and Stadtmuller (1987) proposed a
64
3. Statistical Properties of Smoothers
method for estimating (3.8) and showed by means of simulation that their data-based variable bandwidth estimator can yield smaller MISE than a data-based constant bandwidth estimator. The dependence of MISE on the design density f raises the question of optimal design. Muller (1984) addresses this issue and derives design densities that asymptotically optimize MISE for constant and variable bandwidth Gasser-Muller estimators. Interestingly, the optimal design density of a constant bandwidth estimator does not depend on local features of r, whereas that of a variable bandwidth estimator does.
3. 2. 6 Estimating Derivatives We have seen that estimation of r" is necessary if one wishes to infer a mean squared error optimal bandwidth for estimating r. Also, in some applications derivatives of the regression function are of at least as much interest as the function itself. For example, in growth studies the derivative of height or weight is important in determining growth spurts and times at which height or weight are changing rapidly. An interesting example of the use of kernel methods in growth studies may be found in Gasser et al. (1984). A Gasser-Muller type kernel estimator of rU<) is (3.10)
rk,h(x)
=
h-(k+l)
~ li 1:~
1
M ( x
~ u)
du,
where M is a kernel with support ( -1, 1) and the si's are as defined before. The kernel M is fundamentally different than ones used in estimating r(x). In order for rk,h (x) to be asymptotically unbiased for r(k) (x), it is necessary that M satisfy 1
(3.11)
1 -1
u
jM( ) d _ { 0, u u- (-1)kk!,
j = 0, 1, ... , k - 1 j = k.
If one is willing to assume the existence of m continuous derivatives (m 2: k + 2), then parallels of (3.1) for rk,h(x) may be obtained by using an M that satisfies (3.11) and also (3.12)
J:
uJM(u)du=O,
j=k+1, ... ,m-1
1
=J
0,
j
=
m.
Of course, as when estimating r itself, it will be necessary for h to tend to 0 in order for the bias of the derivative estimator to tend to 0. It turns out that the variance of rk,h(x) is of order 1/(nh 2k+l). This means that the variance of a derivative estimator will tend to be larger than that of r(x). Furthermore, nh 2k+l must tend to infinity in order for the variance to tend to 0, implying that derivative estimation requires larger
3.3. MISE of Trigonometric Series Estimators
65
bandwidths than does estimation of r. It may be shown that when M satisfies (3.11) and (3.12) and r has m continuous derivatives (m ;:=: k + 2), the mean squared error optimal bandwidth is asymptotic to Cn-l/(Zm+l) and the optimal mean squared error tends to 0 at the rate n-Z(m-k)/(Zm+l). Consider, for example, estimation of the first derivative, and assume that r has three continuous derivatives. If we take m to be 3, then the optimal bandwidth and mean squared error of i\,h (x) are asymptotic to Cn -l/ 7 and C1n- 4 17 , respectively. This makes it clear that nonparametric estimation of derivatives is a much harder problem than estimation of r. For a comprehensive discussion of mean squared error properties and optimal kernels for estimators of the form (3.10), see Gasser, Muller and Mammitzsch (1985). It is natural to ask whether or not the kth derivative of a Gasser-Muller estimator of r is consistent for r(k). The answer is "yes, assuming that the kernel K used in estimatin~ r is sufficiently smooth." Note that when K has k continuous derivatives, r:l has the same form as (3.10) with M = K(k). For the kernels most often used in practice, K(k) will satisfy (3.11) and (3.12) when K has a sufficient number of derivatives that vanish at -1 and 1. Take, for example, the case k = 2. Using integration by parts it is easy to check that K" satisfies (3.11) and (3.12) for m = 4 so long as K is a second order kernel that satisfies K(-1) K(1) = K'(-1) = K'(l) = 0.
3.3 MISE of Trigonometric Series Estimators We showed in Chapter 2 that trigonometric series estimators, as we defined them, are actually kernel estimators of Gasser-Muller form. In fact, there is very little difference between the estimators considered in Section 3.2 and series estimators having a taper of the form W>, (j) = ¢ K ( 7r >.j), j = 1, 2, ... , where ¢K is the characteristic function of kernel K. The only real difference between the two estimators occurs in the boundary region. A series estimator whose smoothing parameter is the series truncation point is another matter. Even though these estimates may be written as kernel estimates, the kernels they employ are often much different than the convolution type kernels considered in Section 3.2. Recall, for example, the Dirichlet and Rogosinski kernels discussed in Chapter 2. Another example of a kernel whose Fourier series is truncated is the Fejer-Korovkin kernel (Butzer and Nessel, 1971, p. 79). Truncated series estimators are interesting from a practical point of view since the set of smoothing parameters that need be considered is well defined and finite. By contrast, when the smoothing parameter is continuous, the data analyst cannot always be sure of the relevant range of smoothing parameters to be considered. This is not a huge difficulty, but it is a nuisance. Another relevant point is that certain theoretical arguments in the lack-of-fit problem are simpler when based on an estimator with a discrete, rather than a continuous, smoothing parameter.
66
3. Statistical Properties of Smoothers
Because of the advantages of truncated series estimators, and so as not to be redundant, we consider only truncated series estimates in this chapter, and for that matter, in the remainder of this book. The estimators considered have the general form m
(3.13)
fm(x) = ¢o + 2 '2: Wm(J)¢j cos('njx), j=l
0:::; x:::; 1.
We also confine our attention to the global error criterion of MISE. This is done for two reasons. First of all, because of Parseval's formula, the integrated squared error of a series estimate has an elegant and compact representation in terms of Fourier coefficients. Pointwise properties, on the other hand, tend to be more awkward to obtain for series estimators than for convolution type kernel estimators. A second reason for focusing on MISE is that the lack-of-fit statistics to be studied in later chapters are motivated by MISE considerations.
3.3.1 The Simple Truncated Series Estimator We first consider perhaps the simplest series estimate m
r(x; m) = ¢o + 2
2: ¢j cos(7rjx). j=l
As we did in Section 2.4, let ¢ 0 , ¢ 1 , ... denote the Fourier coefficients of the function r in the model of Section 3.1. The integrated squared error of r(·; m) may be expressed as 2
m
2
oo
J(f(·;m),r)=(¢o-¢o) +2'2:(¢j-¢j) +2 '2: ¢], j=l j=m+l which follows from Parseval's formula. The MISE of r(·; m) is thus
J(r(· ;m),r) =
E(¢o-
m
2
+ 2
l:E (¢i-
2
j=l oo
m
(3.14)
= Var(¢o) + 2 '2: Var(¢j) + 2 '2: j=l j=m+l m 2
+(¢no- ¢o) + 2 '2: (rPnj- tPi)
oo
+ 2 '2: ¢] j=m+l
¢J
2 ,
j=l where
rPnj =
t
i=l
r(xi) 1s; cos(1rju) du, Bi-1
j = 0, 1, ... , n- 1.
3.3. MISE of Trigonometric Series Estimators
67
Notice that the bias portion of J(f{; m), r) is made up of two parts: the quantity 2 l::~m+l ¢J, which we shall call truncation bias, and a term, (¢no- ¢ ) 2 + 2 (¢nj- ¢j) 2, that we call quadrature bias due to the 0
2::7'= 1
fact that ¢nj is a quadrature type approximation to the integral ¢j. We shall see that for reasonably smooth functions r, the truncation bias is usually much larger than the quadrature bias. The variance of a sample Fourier coefficient J;j is
n
=
0' 2 L(si-
Si-1)
2 cos 2(njx';),
i=1
where xi E [si_ 1 , si], i = 1, ... , n. If we further assume that the design density is strictly positive and Lipschitz continuous on [0, 1], then
(3.15)
l.) Var ('1-'J
0'2 {1 cos2(njx) d
=
n Jo
f(x)
x
·o( -2) n ,
+J
where the term O(n- 2) is bounded uniformly in j. Before investigating the bias of J;j, we define what we mean by a piecewise smooth function. Definition 3.1. The function f is said to be piecewise smooth on [a, b] if f and its derivative are continuous on [a, b] except possibly at a finite number of points where f and/or f' have jump discontinuities. We now have the following theorem on the bias of J;j when r is continuous and piecewise smooth. Theorem 3.3. Suppose r is continuous and piecewise smooth. Then
where the 0 (n - 1 ) term is bounded uniformly in j. PROOF.
We have
E(¢;j)
=
t
r(xi) lsi cos(njx) dx
i=l
=
¢j
+
t
i=1
Si-1
lsi cos(njx)(r(xi)- r(x)) dx. Si-1
68
3. Statistical Properties of Smoothers
Continuity and piecewise smoothness of r imply that
t lsi i=l
cos(njx)(r(xi) - r(x)) dx :S C
t lsi I i=l
Si-1
cos(njx)llxi - xl dx
Si-1
for some constant C. The quantity on the right-hand side of the last statement is bounded by 1
C m?-X lsi- Bi-ll { lcos(njx)l dx l~·~n }0
=a(~), n
with the last step following from the design assumptions of Section 3.1.
D
We now state a theorem on the behavior of the MISE off{; m).
Theorem 3.4. Suppose that r is continuous and piecewise smooth and that the design density f is strictly positive and Lipschitz continuous on [0, 1]. Then if m and n tend to oo with m = o(n), J(f'(-; m), r) = m0'
2
n
PROOF.
1 1
o
00
1 f( x) dx
+ 2 . I::
2 ( 1) ¢j + 0 ~
+0
( m) 2
--;;:
J=m+l
The integrated variance is
0'2 {l 1 n f(x) dx
Jo
20'2 {l
+ ---:;;:
Jo
1 f(x)
~ cos2(njx) dx + 0 m
(
)
:
2
'
by using (3.14) and (3.15). We have m
I::
cos 2 (njx)
=
j=l
m
[1 + cos(2njx)] 2 I:: ]=l
1
1
m
; + 2 I:: cos(2njx), J=l and hence 1
1 0
1 f(x)
m LJ.=l cos (njx) dx = 2 m
2
1 1
0
1 d -- x f(x)
1 ~1 cos(2njx) d -2 L.,.. x· j=l 0 f(x) 1
+
Now,
f j=l
t
Jo
cos(2njx) dx :::; f(x)
f j=l
I{
1
Jo
cos(2njx) dxl f(x) cos( njx) f(x)
3.3. MISE of Trigonometric Series Estimators
~I :::; j=
1
cos(~jx)
{1 Jo
69
dxl
f(x)
< oo, with the very last step following from the absolute convergence of the Fourier series of a continuous, piecewise smooth function (Tolstov, 1962, p. 81). It follows immediately that the integrated variance is
mo-2 t n Jo
_1 dx
f(x)
o(_!_) + o(m)2 n n
+
The integrated squared bias is
which follows immediately from (3.14) and Theorem 3.3. The proof is completed by simply adding the variance and squared bias terms. D Theorem 3.4 shows us that in order for the MISE to tend to 0, we should let m ____, oo, but slowly enough that mjn ____, 0 as n ____, oo. The former and latter conditions ensure that squared bias and variance tend to 0, respectively. The fact that the integrated variance is asymptotically proportional to m/n is our first explicit indication that m is inversely proportional to the bandwidth of a kernel smoother (see Theorem 3.1). To further investigate the MISE of r(·; m) we need to study the behavior This is a well-chronicled problem in of the truncation bias 2 "L:j:m+l the theory of trigonometric series. Let us consider the Fourier coefficient r/Yj, which may be expressed, using integration by parts, as
¢J.
¢j = -
~
1
{
~J Jo
sin(~jx)r'(x) dx,
j = 1, 2, ...
It follows under the conditions of Theorem 3.4 that 00
L ¢J =
o(m-2).
j=m+1
If we assume in addition that r' is piecewise smooth, then
¢j =
(-1)Jr'(1-)-r'(O+) ~2j2
-
1
~ ~ J
1 1
0
cos(~jx)r"(x) dx
and
1cos(~jx)r"(x) 1
dx ____, 0 as j ____, oo.
70
3. Statistical Properties of Smoothers
Therefore CXJ
L
¢]
=
j=m+1
CXJ
L
1f- 4
T 4 [(-1)1r'(1-)- r'(O+)r + o(m- 3 )
j=m+1 CXJ
=
2::
1f-4 {[r'(o+)l2 + [r'(1-)l2}
r4 + a(m-3)
j=m+1
=
[r' (0+ )]2 + [r' (1- )]2 4
3Jr m
3
+ o(m
-3
).
Finally, then, under the conditions of Theorem 3.4 and the further condition that r' is piecewise smooth, we have ffi(J2
(3.16)
J(f{; m), r)
rv-----:;;+
{1
1
Jo f(x) dx
1r;m
3
3
2 {[r'(0+)] + [r'(1-W}
as m, n-+ oo and m/n-+ 0. It follows that the optimal choice mn of m is such that mn
rv nl/4
(-2-
[r'(0+)]2 + [r'(1-)j2 )1/4 1f4CJ2 fo1[1/ f(x)]dx
and the corresponding minimum value of the MISE is
One thing made very clear by (3.16) is the fact that boundary effects dominate the integrated squared bias of the series estimator. The leading term in the integrated squared bias is determined completely by the behavior of r at 0 and 1. Note also that the rate at which the minimum MISE tends to 0 is n- 3 / 4 , the same as that of a normalized kernel estimator. This confirms what we claimed in Chapter 2 about the similarity of boundary effects on cosine series and normalized kernel estimators.
3.3.2 Smoothness Adaptability of Simple Series Estimators The simple truncated series r(·; m) has an ability to adapt to the smoothness of the underlying regression function. This ability is due to the 0-1 taper employed by r(·; m) and is not shared by many other common series estimators, such as the Rogosinski estimator. Note that expression (3.16) does not give the form of the asymptotic MISE when r' (0+) = r' (1-) = 0. To get an idea of what happens in this case, suppose, in addition to
3.3. MISE of Trigonometric Series Estimators
r' (0+) ¢·
J
=
71
= r' (1-) = 0, that rC 3l is piecewise smooth. Then we have [(-1)1+ 1 rC 3l(1-) + rC 3l(O+)]
.
( 1fJ
)4
1
+ (nj)- 4
1 0
cos(7rjx)rC 4 l(x) dx.
If we now proceed exactly as in Section 3.3.1, it is straightforward to show that
J(f{ ;m),r)"' m0'2 {1 ~ + _2_ {[rC3l(o+W + [rC3l(1-W}, n } 0 f(x) 77r8 m 7 in which case the optimal MISE converges to 0 at the rate n- 7 18 . More generally, if a sufficient number of derivatives exist with rC 2k- 1) (0+) = rC 2k- 1l(1-) = 0, k = 1, ... ,e- 1 (e 2: 2), and one of rC 2£- 1l(O+) and rC 2£- 1 l(1-) is nonzero, then the optimal rate for MISE is n-(4 £- 1 )/ 4£. So, as r becomes smoother, the optimal MISE of the simple truncated series converges at a rate that is ever closer to the parametric rate of n - 1 . We will see in the next section that the Rogosinski estimator is not able to exploit smoothness beyond the existence of a second derivative. The ability of the simple series estimator to do so derives from its use of a 0-1 taper. Since the integrated squared bias of r(·; m) is always dominated by truncation bias, the simple series estimator always benefits from an increase in the rate at which Fourier coefficients rPi tend to 0. Use of the 0-1 taper also has its price though, as evidenced by the large side lobes in the associated Dirichlet kernel. It is easy to construct tapers that have both smoothness adaptability and a more aesthetically pleasing kernel function than the Dirichlet. Smoothness adaptability will result whenever wm(j) satisfies wm(j) = 1 for 1 _:::; j :::; em, where c is a constant in (0, 1]. A kernel with smaller side lobes than the Dirichlet is obtained by using a taper that converges to 0 gradually. An example of a taper that yields both smoothness adaptability and a more pleasing kernel than the Dirichlet is
Wm(j)
=
1, 2(1- jfm), { 0,
1 :::; j _:::; m/2 m/2 .S j .S m j > m.
3.3.3 The Rogosinski Series Estimator We now turn our attention to the truncated series estimator whose weights have the form
Wm(j) =
COS (
2:~ 1 )
,
j
=
1, ... ,m.
These weights correspond to the Rogosinski kernel, as defined in Section 2.4. We shall denote the corresponding series estimator of the form (3.13) by TR(·; m).
72
3. Statistical Properties of Smoothers
The integrated squared error of any estimator as in (3.13) is m
2
oo
L
J(fR(·;m),r)=(¢o-¢o) 2 +2'L(wm(J)¢j-¢j) +2 ¢J, j=1 j=m+1 implying that the MISE is
f
2
J(fR(·;m),r)=E(¢o-¢o) 2 +2f:E(wm(J)¢j-¢j) +2 ¢J. j=1 j=m+1 Now, (3.17)
E ( Wm(J)¢j- c/Jj
f
= w;,(j)Var(¢j) + ¢J(1- Wm(J))
2
+ 2¢j(wm(J)- 1)wm(J)(¢nj- ¢j)
+ w;,(j)(¢nj- ¢j) 2 . For the Rogosinski weights and most others of practical interest, the last two terms on the right-hand side of (3.17) are negligible relative to the other two, at least when r is continuous and piecewise smooth. The following theorem gives an approximation to the MISE of the Rogosinski estimator. Theorem 3.5. Suppose that r is continuous and piecewise smooth and that the design density f is strictly positive and Lipschitz continuous on [0, 1]. Then as m and n tend to oo,
+
0(;:: ~
+ O(n- 1 ) +
fj(l- Wm(j))'
f'
o( ::).
PROOF. Using (3.17) and Theorem 3.3, the integrated squared bias portion of the approximation is immediate. The integrated variance is m
2
L
w;,(j)Var(¢j) + O(n- 1),
j=1 and arguing as in the proof of Theorem 3.4,
3o3o MISE of Trigonometric Series Estimators
73
The first term on the right-hand side of the previous expression is
~ n
1 _1_ ~ 1
2 (0) x) ~ wm J o f( = 1 1
=
(J2
n
~
2 (
~ wm
+cos(21rjx)) d 2 x
0) {1 dx Jn f(x)
J
]=1
(1 0
(J2
+
n
~
2 (
~wm
]=1
0) {1 cos(2Jrjx) Jn f(x) dxo
J
0
Just as in the proof of Theorem 3.4 the last quantity is (J
2
--;-
m
2
~ wm(J)
Now we need only investigate follows:
t, w~(J')
o
1
1 0
dX f(x)
+ O(n
-1
)o
I:,j= 1 w~(j), which can be approximated as
= (2m+ 1)
=(2m+ 1)
t, (
1 2 m 2 + 1 ) cos (
[1
2 ;:~ 1 )
112
cos 2 (1ru) du + O(m-
1
)]
m
= 2 + 0(1)0 Combining previous steps yields
from which the result follows upon adding integrated variance and squared hl~o
D
It is of interest to compare the MISE of the estimators f(o; m) and
rR(o; m)o Note that, for a given m, the variance portion of the MISE approximation for rR ( o; m) is only half of the corresponding term for T(o; m) o On the other hand, in addition to truncation bi~, the approximation to J (rR(o; m), r) contains a bias component due to the discrepancy between 1 and wm(J)o To get any further in comparing the two MISEs, we will have to look more closely at the term 2 I:,j= 1 ¢J(l- wm(j)) 2 o We will now ~sume that r' is piecewise smooth, as we did in approximating the truncation bias in Section 30301. Letting {am} be an unbounded sequence of integers such that am = o(m), we have m
2
L j=1
am-1 2
¢](1- Wm(j)) = 2
L
j=1
m
2
¢](1- Wm(j)) + 2
L
¢](1- Wm(j)?o
74
30 Statistical Properties of Smoothers
The weight function may be expressed as 0
Wm(J) = 1for some number am-1
L
bm,j
(
2::~ 1
)2 21 cos(bm,j)
between 0 and 7rj/(2m + 1)0 It follows that
¢J(1- Wm(J))2 ~ (7r4/4)(2m + 1)-4
am-1
L
(i4>j)2
j=1
j=1
As in Section 30301, we may write (-1)jr'(1-)- r'(O+)
4>j = where rJj
--+
0 as j
--+
1f
2 °2
J
+
'T]j
~' J
oo, and hence
m
L
+ o(1)
F
4(1- Wm(J)) 2o
j=arn
2 One may also argue that the term 'E}=am (-1)j F 4(1-wm(J)) is of smaller 2 order than 'E'l=am F 4(1- Wm(J)) , and SO 2
~
~
= 2 [r'(1-)12 + [r'(0+)12 ~ r4(1- Wm(j)?
4>2(1- Wm(J))2
~
1f4
J
j=arn
j=arn
m
+ o(1)
L
F 4(1- Wm(J)) 2o
j=am
Finally, it is easy to verify that, as m m
L
F 4(1- Wm(J)) 2 rv
--+
oo, {1/2
(2m+ 1)- 3
j=am
ln
2 4 u- (1- cos(7ru)) duo
0
Under the same conditions as those needed for (3016), we thus have 1
2
J(fR(o;m),r)rv
~~ 1 f~:) x
+
1r;m3 {lr'(O+)F+[r'(1-)1 3
2 } ·
[1 + ~ li' u-' (1- cos(~u))' du]
~ ~~2
11 !~:)
+
2~!~0~!)
{lr'(0+)12 + [r'(1-)12} 0
3.3. MISE of Trigonometric Series Estimators
75
The corresponding optimal choice m;; of m is such that R
mn
rv
n
(4(5.004) [r'(O+)J2 + [r'(1-)] 4 2 7r cr J01 [1/ f(x)] dx
1; 4
2
114 )
It follows that
m;;
1/4
-+ [2(5.004)] ;::::; 1.78, mn i.e., the optimal truncation point of the Rogosinski estimator is about 1. 78 times that of the simple series estimator. The limiting ratio of optimal MISEs is -
J(r R (··mR) r) ' n ' J (r(-; mn), r)
-+
(1) 3/ 4 (5.004) 1 -
2
1 4 ;::::;
.889,
and so when one of r'(O+) and r'(1-) is nonzero, the Rogosinski is more efficient than the simple series estimator. When r" is piecewise smooth with r' (0+) = r' (1-) = 0, straightforward analysis shows that
(3.18)
J (rR(·; m), r) ""
~~
2
1 ~~:) + : 1 1
1
4 m-
4
[r"(x)]
2
dx,
which is reminiscent of the MISE result in Section 3.2.3. The minimum of (3.18) converges to 0 at the rate n- 4 / 5 , which is the the same as the rate for a boundary adjusted, second order kernel estimator. Of course the MISE result for kernel estimators is obtained without the assumption r' (0+) = r' (1-) = 0. It is worth noting, though, that for any 0 < f < 1/2, mJ.n E
[1
1 _,
2
(rR(x; m) - r(x)) dx]
converges to 0 at the rate n- 4 / 5 so long as r" is continuous and piecewise smooth, even when one or both of r' (0+) and r' (1-) fails to vanish. (This can be shown using analysis as in Hall, 1983 and Eubank, Hart and Speckman, 1990.) It follows that if the Rogosinski estimator is appropriately modified for boundary effects, its MISE will converge at the same rate as that of a second order kernel estimator. One possible boundary modification is to use the data-reflection technique of Hall and Wehrly (1991). It can be shown that, in contrast to what happens with r(·; m), the optimal MISE of the Rogosinski estimator becomes "stuck" at the rate n- 4 15 even though r has a first and possibly higher order derivatives that vanish at 0. The problem is that the integrated squared bias of rR(·; m) is dominated by the discrepancy between 1 and the taper Wm (j) as soon as r" becomes piecewise smooth with r'(O+) = r'(1-) = 0. The truncation bias of the Rogosinski series does become smaller and smaller as the function becomes smoother and smoother, but is negligible in comparison to what
76
3. Statistical Properties of Smoothers
we might call "taper" bias. Another way of comparing the two cases is to recall (from Section 2.5.2) that f{; m) and rR(·; m) may be written as kernel estimators with Dirichlet and Rogosinski kernels, respectively. Now, the Rogosinski kernel is a second order kernel in the sense that it has a nonzero second moment, thus explaining why it has bias of the same order as a second order kernel estimator. On the other hand, the Dirichlet kernel is effectively an infinite order kernel and can therefore always take advantage of extra smoothness in r. Since the minimum MISE of a boundary-modified Rogosinski estimator converges at the same rate as that of a second order kernel estim!ftor, it is of interest to compare the limiting ratio of MISEs. The most effic}ent kernel estimator based on a positive kernel is the Epanechnikov, as <;liscussed in Section 3.2.1. Let MISER and MISEE denote respectively ~{e minimum values of MISE for the Rogosinski estimator and Gasser-M' ller constant bandwidth estimator with K = Epanechnikov, where each stimator has been modified to correct for boundary problems. Then using ~o~ and result (3.18), one can check that
(3.19)
lim MISER MISEE
= ( _65) 4/5
(
2165) 1/5
R;:;
.9450.
n-+oo
So, the Rogosinski estimator is more efficient than the most efficient GasserMuller estimator with positive kernel! This fact was first shown by Hall (1983) in the context of density estimation. Of course, the Rogosinski kernel is not everywhere positive, but the result (3.19) is nonetheless noteworthy. Result (3.19) calls to mind Table 3.1 and the relative efficiency of GasserMuller estimators based on Rogosinski and Epanechnikov kernels. It is remarkable that fixing the truncation point of the Rogosinski kernel at 2, and then using rescaling as a smoothing parameter leads to an even smaller asymptotic relative efficiency than in (3.19).
3.4 Asymptotic Distribution Theory To this point we have only addressed point estimation properties of smoothers. If we are to use smoothers to test hypotheses or construct confidence intervals, it is necessary to know their sampling distributions as well. In this section we shall study the large sample distributions of Gasser-Muller and simple truncated series estimators. Since these estimators are weighted sums, we anticipate that under various sets of conditions on the errors the central limit theorem will apply and the estimators will be asymptotically normally distributed.
3.4. Asymptotic Distribution Theory
77
Let f(x) denote a generic nonparametric smoother. For inference purposes it is of interest to consider the quantity
f(x)- r(x) JVar [f(x)]
= Zn +Bn,
where (3.20)
Zn
=
f(x)- E [f(x)]
and
Bn
=
JVar [f(x)]
E [f(x)] - r(x) . JVar [f(x)]
For linear smoothers f, the random variable Zn typically converges in distribution to N(O, 1). However, the deterministic quantity Bn may not be negligible in comparison to Zn, depending on how the smoothing parameter is chosen. This observation has implications for construction of confidence intervals for r(x) and will be discussed more in Section 3.5. Define Zn,h and Z~,m as follows: Z
_ fh(x)- E [fh(x)] n,h JVar [fh(x)] '
Z* n,m
= f(x; m)- E [f(x; m)] JVar [f(x; m)]
.
We now state and prove the following theorem concerning the limiting distributions of Gasser-Muller and truncated series estimators. Theorem 3.6. Suppose data are generated from the model
Yi = r(xi)
+ Ein,
i
= 1, ... , n,
where, for each n, Eln, . .. , Enn are independent random variables with mean 0, common variance CT 2 and third moments satisfying, for some constant B < oo,
EIEinl 3 <
B
for all i, n.
Assume also that the kernel K used in fh satisfies conditions 1-3 of Section 3.2 and that the design conditions of Section 3.1 hold. Let { hn} and {mn} be sequences of smoothing parameters satisfying hn -+ 0, nhn -+ oo, mn -+ oo and mn/n -+ 0 as n -+ oo. It then follows that, as n tends to infinity, Zn,hn ~ N(O, 1)
and
Z~,mn ~ N(O, 1).
PROOF. Both fhn(x)- E[fhn(x)] and f(x;mn)- E[f(x;mn)] may be written I:~=l WinEin, where
lSi (
lSi
1 X - U ) K h - du and Win = KmJx, u) du hn Bi-1 n Bi-1 in the former and latter cases, respectively. To prove the result it is sufficient to check the Liapounov condition (Chung, 1974, p. 200)
Win
(3.21)
=
-
78
3. Statistical Properties of Smoothers the uastlt)r-,Ivlu.uer estimator, Theorem 3.1 implies that
le~f nhnVar (t winEin)
> 0,
t=l
and so to verify (3.21) it is enough to show that n
3 2
(nhn) 1
2.: lwinl EIEinl 3
3
--+
0
i=l
as n
--+
oo. Now, for some constant C', B
n
2.: lwinl EIEinl 3
3
:::::
n
h
max (si- Bi-d
n l:Si:Sn
i=l
C'
sup
IK(u)i
-l
2.: wrn i=l
n
< nh l:wrn· n i=l
The last expression is O(nh)- 2 by Theorem 3.1, and (3.21) follows immediately. Essentially the same proof may be done to verify the Liapounov condition for the simple series estimator, which proves the result. D Note that Theorem 3.6 applies to the centering r(x) - E [f(x)] and not to r(x) - r(x). To understand the distribution of the latter quantity we must take into account the behavior of Bn, as defined in (3.20). We will do so in the next section in the context of confidence interval construction.
3.5 Large-Sample Confidence Intervals We consider inference only with Gasser-Muller smoothers, although the basic principles discussed have their analogs with series estimators and other linear smoothers. Theorem 3.6 says that when nh is sufficiently large, the Gasser-Muller estimator is approximately normally distributed with mean E [rh(x)]. Inasmuch as E [rh(x)] is approximately r(x), one would expect that a large sample confidence interval for r(x) would be an almost immediate consequence of Theorem 3.6. Unfortunately, this is not the case. Let Zp be the ( 1 - p) 1OOth percentile of the standard normal distribution (0 < p < 1), and consider the naive confidence interval for r(x) with nominal coverage probability 1 - a: (3.22)
fh(X)
± Zajz& [
t rz , wi(z; h)
where w 1 (x; h), ... , wn(x; h) are the Gasser-Muller kernel weights. Even when the E/s have a common variance tJ 2 and & is consistent for tJ, the
3.5. Large-Sample Confidence Intervals
79
actual level of this interval need not converge to (1 - o:). The reason why is clear upon considering the effect of
Bnh = E [rh(x)] - r(x). JVar [fh(x)] Theorem 3.6 implies that for large n the coverage probability of interval (3.22) is if>(Zo)2 - Bnh)
+ if>(za/2 + Bnh)
- 1,
where if> is the cumulative distribution function (cdf) of the standard normal distribution. A natural choice for h would be the mean squared error optimal bandwidth. However, for this sequence of h 's, Bnh converges to a nonzero constant b whenever r"(x) =1- 0. In this event the limiting coverage probability is if>(za; 2 - b)+ if>(za; 2 +b) -1, which is less than the nominal 1- 0:. A number of different approaches have been proposed to deal with the problems inherent in the naive interval (3.22). The most obvious fix is to select h in such a way that Bnh ----r 0. Doing so leads to an interval with asymptotic coverage probability equal to the nominall - o:. The only problem with selecting h in this way is that it will undersmooth relative to a bandwidth that minimizes mean, or mean integrated, squared error. This will lead to a confidence interval whose length is greater than that of interval (3.22) for all n sufficiently large. Furthermore, we are forced into the awkward position of centering our confidence interval at a different and less efficient estimator than the one we would ideally use as our point estimator of r(x). Another possibility is to estimate the quantity Bnh and account for it explicitly in constructing the confidence interval. Suppose we are willing to assume the conditions of Corollary 3.1. Then by taking h to be of the form Cn- 115 , Bnh is asymptotic to
Bc,n =
and
rh(x)- r(x)
y'Var [f(x; m)]
- Bc,n
J)
-------*
N(O, 1).
Estimation of Ben requires estimation of r" (x), which may be done using a kernel estimate' as in Section 3.2.6. If Bc,n has the same form as Bc,n but with r 11 ( x) and CJ replaced by consistent estimators, then
80
3. Statistical Properties of Smoothers
is an asymptotically valid (1 - a)100% confidence interval. Hiirdle and Bowman (1988) provide details of a bootstrap approach to obtaining a bias-adjusted interval as in (3.23). Such methods have the practical difficulty of requiring another choice of smoothing parameter, i.e., for r"(x). More fundamentally, the interval (3.23) has been criticized via the following question: "If one can really estimate r"(x), then why not adjust rh(x) for bias and thereby obtain a better estimate of r(x)?" A less common, but nonetheless sensible, approach is to simply admit that E [rh(x)], call it Tnh(x), is the estimable part of r(x), and to treat Tnh(x) as the parameter of interest. Theorem 3.6 may then be used directly to obtain an asymptotically valid confidence interval for Tnh(x). Typically, Tnh is a "shrunken" version of r, i.e., a version of r that has more rounded peaks and less dramatic valleys than r itself. It follows that if one is more interested in the shape of r, as opposed to the actual size of the function values r (x), then treating r nh as the function of interest is not an unreasonable thing to do. Using ideas of total positivity one can make more precise the correspondence between rand Tnh· When r is piecewise smooth, the expected value of the Gasser-Muller kernel estimator is (3.24) where
rh(x)
=
Jo(1y;,K (x-u) -h-
r(u) du.
Whenever nh -+ oo and h -+ 0, expression (3.24) implies that the naive interval (3.22) is asymptotically valid for rh(x). Now, the function rh is the convolution of r with h- 1 K (-/h). Karlin (1968, p. 326) shows that if K is strictly totally positive, then rh has no more modes than r. This in turn means that, for all h sufficiently small, rh and r have the same number of modes. These considerations provide some assurance, at least for large n, that interval (3.22) is valid for a function having features similar to those of r. The Gaussian density is an example of a totally positive kernel. Silverman (1981) exploited this property of the Gaussian kernel in proposing a test for the number of modes of a probability density function. Other ideas related to inferring the number of peaks of a function are considered in Hart (1984), Donoho (1988) and Terrell and Scott (1985). Bayesian motivated confidence intervals with desirable frequentist properties have been proposed by Wahba (1983). Wahba's intervals have the form r(x) ± ZajzW(x), where r(x) is a Smoothing spline and w2 (x) is a statistic that tends to be closer to the mean squared error of r(x) than to
3.5. Large-Sample Confidence Intervals
81
Var(f(x)). The latter property implies that Wahba's interval tends to have higher coverage probability than does a naive interval as in (3.22). Nychka (1988) established another interesting frequentist property of Wahba's intervals. Suppose one computes these intervals at each of then design points, using the same nominal error probability of a at each Xi· Nychka shows that the average coverage probability of these n intervals tends to be quite close to 1 - a. Perhaps of more interest than confidence intervals for selected values r( x) are simultaneous confidence bands for the entire function r. A number of methods have been proposed for constructing such bands. These include the proposals of Knafl, Sacks and Ylvisaker (1985), Hall and Titterington (1988), Hardle and Bowman (1988), Li (1989), Hardle and Marron (1991) and Eubank and Speckman (1993). The same issue arises in construction of confidence bands as was encountered in pointwise confidence intervals; namely, one must take into account the bias of a nonparametric smoother in order to guarantee validity of the corresponding interval(s). Each of the previous references provides a means of dealing with this issue. A situation in which the bias problem can be fairly easily dealt with is when one wishes to use probability bands to test the adequacy of a parametric model. When a parametric model holds, the bias of a nonparametric smoother depends at most upon the parameters of the model, and hence can be readily estimated. We now illustrate how probability bands can be constructed under a parametric model by means of an example. Suppose that we wish to test the adequacy of a straight line model for the regression function r. In other words, the null hypothesis of interest is
Ho : r(x) = Bo
+ elx,
0 :::;
X :::;
1,
for unknown parameters 80 and 81 . Define
8(x)
=
f(x)-
Oo- elx,
0:::;
X:::;
1,
where f is, say, a boundary-corrected Gasser-Muller smooth and Bo and el are the least squares estimates of 80 and 81 , respectively. The variance of 8(x) has the form a- 2 s 2 (x) for a known function s(x). We may determine a constant Ca such that
where 8- 2 is an estimator of o- 2 = E(ET) and PHa denotes that the probability is computed under H 0 . The constants -ca and Ca form simultaneous 1- a probability bounds for the statistics 8(xi)/(8-s(xi)), i = 1, ... , n. We may thus reject H 0 at level of significance a if any observed value strays outside these bounds. Let us suppose that the error terms Ei are i.i.d. N(O, a- 2 ). We may then approximate a P-value for the test described above by using simulation. When H 0 is true and f is a Gasser-Muller smooth, it turns out that, to
82
3. Statistical Properties of Smoothers
first order, r(x) -eo- elx depends only upon theE/Sand not upon Bo or Bt (Hart and Wehrly, 1992). In conducting a simulation one may thus take Bo = 81 = 0 without loss of generality. Letting mn denote the observed value of maXt
.... ....... ···.~.,{~/ ...... . .. .
,•
.··
.
:. . : ... . ·... ~-~ . >.-~-·.:~~~-~-~~~·~···
0
. ........ ....
............---~--~----.
, ... .
0.0
0.2
0.4
0.6
0.8
1.0
0.6
0.8
1.0
X
lC)
c) 0
c)
0.0
0.2
0.4 X
FIGURE 3.1. Testing the Fit of a Straight Line Model by Simulating Probability Bands.
3.5. Large-Sample Confidence Intervals
83
on an Epanechnikov kernel and a bandwidth of .2. In the lower panel of Figure 3.1 are empirical 95% probability bands for 8(x)jfJ that were obtained by simulating data from the null model with Bo = 81 = 0. The variance estimator used was & 2 = "E~=l Yi- 8 - 8 xi /(n- 2). The A
(
A
0
1
)2
bands are such that only fifty of one thousand simulated curves strayed outside the shaded region in the graph. The line in the bottom panel is 8(x) / fj for the data in the top panel. The value of max1
r
-eo -e
4 Data-Driven Choice of Smoothing Parameters
4.1 Introduction This chapter is devoted to the problem of choosing the smoothing parameter of a nonparametric regression estimator, a problem that plays a major role in the remainder of this book. We will use S to denote a generic smoothing parameter when we are not referring to a particular type of smoother. The sophistication of the technique used to choose S will depend on the data analyst's reasons for fitting a nonparametric smooth to the data. If one wishes a smooth to be merely a descriptive device, then the "by eye" technique may be satisfactory. Here, one looks at several smooths corresponding to different values of S and chooses one (or more) which display interesting features of the data. In doing so, the data analyst is not necessarily saying that these features are verification of similar ones in a population curve; he only wishes to describe aspects of the data that may warrant further investigation. In some applications the data analyst will desire an objective method of choosing S, as opposed to the subjective "by eye" method. For example, if graphical interaction with the data is impossible, it may be necessary to have a data-driven rule for choosing the smoothing parameter. Real-time applications are a good example of situations where graphical analysis is often not feasible. Another important reason for desiring a data-driven rule is that one may wish to base inferences about the underlying curve on a smooth. To honestly assess the sampling distribution of some function of a smooth, one should take into account the method of choosing S. Unless the choice of S is data based, the notion of sampling distribution may not even be well defined. For our purposes, the most important reason for considering data-driven methods of choosing S is their usefulness in testing for lack of fit of parametric regression models. We consider five methods of selecting S in this chapter, each of which, in principle, could be used in testing the fit of a model. These methods are cross-validation, risk estimation, plug-in, a method due to Hall and Johnstone (1992) and one-sided cross-validation. In 84
4.2. Description of Methods
85
this chapter we shall be content to describe the methods without reference to testing problems.
4.2 Description of Methods In this section we describe various methods of smoothing parameter selection, and in Section 4.2.6 we provide an illustration of each method on a set of data. A more rigorous, theoretical look at the methods is the subject of Section 4.3.
4.2.1 Cross- Validation The basic idea of cross-validation (see Stone, 1974) is that one builds a model from one part of the data and then uses that model to predict the rest of the data. For a given model one may compute an average prediction error over all the data points predicted. Among a collection of models, the model that minimizes this average prediction error is deemed best in that collection. For data arising from model (2.1), consider a generic estimator of r denoted by f-( · ; S), where S is the smoothing parameter. The most often used form of cross-validation in our regression context is the "leaveone-out" version. Let f- i ( · ; S) be a regression estimator of the same type as f-(·; S) except that it is computed without data value (xi, Yi). Define the cross-validation criterion CV(S) as follows: CV(S) =
~ n
t
(Yi - f-i(xi; S)) 2 .
i=l
The cross-validation smoothing parameter is that value of S that minimizes CV(S). It may seem that leaving out a single data value in computing the predictor f-i(·; S) would make little difference. However, if the observations are independent, then f-i(Xii S) is independent of }i. This suggests that CV(S) will give an accurate assessment of how well the smoother f-(- ; S) will predict future observations. By contrast, consider what happens if one simply computes the usual sum of squared residuals n
RSS(S)
=
L
(Yi- r(xi; S))
2
.
i=l
Most smoothers have a limiting version that is very rough and interpolates the data, implying that RSS(S) is smallest at this rough estimate. For example, the Gasser-Muller estimate satisfies f-h(xi) --+ Yi as h --+ 0, i = 1, ... , n. Obviously, the fact that RSS(S) is 0 at the no-smoothing point is not an indication that the interpolating estimate is a near-optimal predictor of future data.
86
4. Data-Driven Choice of Smoothing Parameters
A more compelling motivation for cross-validation comes from considering the expected value of CV(S). Assuming only that the errors in model (2.1) are uncorrelated,
E( GV(S))
~ e' + E [ ~
t
l~
(f; (x;; S) - r(x;) )'
e' + M ABE' (S).
It should be clear that unless n is very small, MASE*(S) will be approximately
MASE(S)
~ E [~
t
(r(x.; S) - r(x;))']·
It follows that CV (S) is essentially an unbiased estimator of cr 2 + MASE(S), which is minimized at the same value of S as MASE(S). This result tells us little about how well CV(S) estimates cr 2 + MASE(S), but at least it suggests that cross-validation has some promise as a method of selecting smoothing parameters. In Sections 4.3-4.4 we will take a much closer look at the behavior of bandwidths chosen by cross-validation.
4.2.2 Risk Estimation We noted in the previous section that the cross-validation curve is an approximately unbiased estimator of the MASE risk function. An alternative means of smoothing parameter selection is to estimate a risk function directly. We shall illustrate the method using MASE as the risk. Suppose that our generic estimator has the linear form n
i=l
where the weights wi(x; S) are constants for a fixed value of S. The MASE of this estimator can be written
1 ) MASE(S) = E ( -;;, RSS(S)
2
+ ~
2
t; n
wi(xi; S)- cr
2
.
The quantity M(S) = MASE(S) + cr 2 obviously has the same minimizer (wrt S) as MASE(S), and one can estimate M(S) by ~
1
M(S)
=
-
2&2
RSS(S)
n
+-
n
Ln
Wi(Xii
S),
i=l
where 8'2 is a model-free estimator of variance. For example, one could use a difference-based estimator such as
&2
=
2(n
~ 1) ~(Yi- Yi-1)2,
4.2. Description of Methods
87
or one of those discussed in Hall, Kay and Titterington (1990). The datadriven risk estimation smoothing parameter is the minimizer of M(S). Note that M(S) is approximately unbiased for MASE(0._+0' 2 to the extent that 2 fj2 is approximately unbiased for 0' . In essence M(S) is the well-known Mallows (1973) criterion applied to our nonparametric regression setting. In subsequent chapters the estimated MASE of truncated series estimators will be of particular interest to us. Let the design points be Xi = (i - 1/2)/n, i = 1, ... , n, and the smoother be a series estimator of the form r(·; m) but with ¢j replaced by
' 1 rPj,LS = n
L Yi cos(njxi), n
j = 0, 1, ... , n- 1,
i=l
which are the least squares estimators for the assumed design. Then M(S) becomes
~
28" 2 (m 2 ~ '+'j,LS _+ n ;;. 2
+ 1)
m
'
= 0, 1, ... ,n- 1.
j=m+l
For future reference we note that the minimizer of the last quantity is equal to the maximizer of Jm, where
, Jo=O,
,
'2
- ~ 2nr/Jj,LS ,2 2m,
Jm-~
j=l
m=1, ... ,n-1.
(J
A method closely related to MASE estimation is generalized crossvalidation, or GCV. The vector of predicted values (r(x 1 ; S), . .. , r(xn; S) )' may be written as Ws Y, where Y = (Y1, ... , Yn)' and Ws is ann x n matrix with ijth element wj(Xii S). The GCV criterion is defined by
n- 1 RSS(S) GCV(S) = {1- tr(Ws)/n}2' where tr(A) denotes the trace of matrix A. The value of S that minimizes GCV (S) provides a data-driven choice for the smoothing parameter. For many smoothers the relevant values of the smoothing parameter S are such that tr(Ws)/n-+ 0 as n -+ oo. In this event we have
GCV(S)
=
~RSS(S)
=
2:_ RSS(S) +
n
[1+
2
2
8"~ n
tr~Ws)
t
i=l
+O(tr(:s)r] 2
wi(xi;
S) + Op ( tr(Ws)) n
,
where 8"~ = RSS(S)/n. By comparing the very last expression with M(S), one can see the close relationship between GCV and the MASE estimate. The most important difference between the two criteria is that GCV employs an estimate of 0' 2 that depends on S. However, when n is large one
88
4. Data-Driven Choice of Smoothing Parameters
can expect the two criteria to produce similar choices for the smoothing parameter.
4.2.3 Plug-in Rules Plug-in rules exploit an asymptotic approximation to mean average (or mean integrated) squared error. The most common plug-in rules are those based on the assumption that r has two continuous derivatives, wherein, according to expression (3.8), the MISE-optimal bandwidth of a boundaryadjusted Gasser-Muller estimator satisfies hn
rv Cmodel
CK n- 115 '
with
0"2 fo1 [f(x)]-1 dx) 1/5 Cmodel =
(
fo1 [r"(x)]2 dx
and
A plug-in rule is a data-driven bandwidth of the form CmodeiCKn- 1 15 , where estimates of unknown parameters are "plugged into" Cmodel to obtain Cmodel· Note that, for a given K, the constant CK is known or can be approximated arbitrarily well by numerical means. The variance 0" 2 may be estimated by 8' 2 (as defined in Section 4.2.2), 1 1 and 0 [f(x)]- dx may be approximated by
J
n
2
n 'L)si- Si-1) ,
i=1 which is motivated by the fact that {so, s1, ... , sn} is a partition of [0, 1] and that n(si - Si-1) = 1/ f(xi) for Xi-1 :::; Xi :::; xi+1, i = 1, ... , n. The 1 2 most difficult quantity to estimate in Cmodel is Ir = J0 [r"(x)] dx. Note that Ir depends on the unknown function r, which we do not know, or else we would not need to estimate an optimal bandwidth ! A possible estimator for Ir is
1 [r~(x)] 1
Ir(b)
=
2
dx,
where r~ is a kernel type estimator as introduced in Section 3.2.6. The biggest problem with this proposal is the circularity that arises from having to choose the bandwidth b. Two main approaches have been proposed to deal with this problem. One of these is described by Park and Marron (1990) and Ruppert, Sheather and Wand (1995) and is based on determining the value of b, say bn, that minimizes the mean squared error of Ir(b).
4.2. Description of Methods
89
Of course, bn will itself depend upon the unknown r, through derivatives of order higher than two. Suppose we assume momentarily a simple parametric form for r, such as r is a quartic polynomial. We do so solely for the purpose of estimating bn. The parameters of this model may be estimated by least squares and the resulting estimate of r plugged into an asymptotic approximation for bn. This estimate of bn, call it bn, leads to the estimate Ir(bn), which in turn yields the plug-in estimate of hn· Ruppert, Sheather and Wand (1995) report that fitting a quartic model for the initial estimate of r is inadequate when r has many oscillations. As an alternative they propose dividing the x-range up into blocks and fitting a quartic within each block. A second plug-in approach is due to Gasser, Kneip and Kohler (GKK) (1991) and is iterative in nature. To avoid boundary problems in estimating r", GKK's target bandwidth is one minimizing MISE over an interval of the form (8, 1- 8). (They recommend using 8 = .10.) The asymptotic form of the optimal bandwidth is 1/5 cr2 J81-8[f(x)]-l dx -1/5 hA =OK n . 1 8 2 ( · - [r"(x)] dx )
J8
Now definer"(·; b) to be a kernel estimate of r" based on kernel W that satisfies the moment conditions 1 1 u 2 W(u) du = 2. ukW(u) du 0, k = 0, 1, and -1 1 Assuming evenly spaced design points, GKK's proposed iterative algorithm is as follows:
1
1
• Take h0 = 1/n. • Define hi, i 2: 1, iteratively by
-. h, -
cK (
&2(1- 28)
f
1 8 8-
2
[r"(x; hi-ln1110 )] dx
) 1/5 -1/5 n .
• Stop after eleven iterations and use h 11 as the plug-in bandwidth. The motivation for the factor n 1110 and the use of eleven iterations is based on asymptotic considerations (GKK, 1991). Stopping after eleven iterations provides a more automated approach, but still one wonders whether the GKK algorithm is guaranteed to converge for a given data set. The author has encountered data for which the algorithm eventually oscillates between two bandwidths. In such a case, using the answer obtained after eleven iterations seems artificial and perhaps misleading. Nonetheless, the algorithm often does converge and, in the author's opinion, provides a very useful method.
90
4. Data-Driven Choice of Smoothing Parameters
4.2.4 The Hall-Johnstone Efficient Method Hall and Johnstone (1992) propose a method for selecting the bandwidths of kernel estimators. In a sense the method is much like plug-in, and yet is more efficient in a well-defined way than are the plug-in bandwidths of the previous section. To simplify our explanation of their method, we will assume that r is smooth enough at 0 and 1 so that boundary effects can be ignored, thus allowing integrals to be taken over all of (0, 1). Letting h0 denote the minimizer of
ASE(h)
=
~ n
t
(rh(x) - r(x))
2
,
i=1
Hall and Johnstone (1992) show that ~
~
ho = A1
+ n 1/5 Az J + Rn, ~
1
where A1 and Az are somewhat complicated statistics, J = f 0 [r'(x)] dx and Rn is negligible relative to the other two terms. (The reader is referred to Hall and Johnstone, 1992 for definitions of A1 and A2 .) Inasmuch as fio is taken as our target bandwidth, this representation motivates the following definition of a data-driven bandwidth:
(4.1)
~
h
~
=
A1
2
+ n 1/5 Az J, ~
~
where J is an estimator of J. If the asymptotic mean squared error of J achieves its lower bound, then the data-driven bandwidth (4.1) is asymptotically optimal in a sense to be discussed in Section 4.3.1. It is worth remarking that the quantities A1 and A2 require an estimate of In as does the plug-in method. Hence, the Hall-Johnstone method requires choices of bandwidth for both lr and J. This problem can be dealt with as it was in the first approach discussed in Section 4.2.3. A parametric pilot function may be posited and estimated for purposes of fixing the bandwidths in lr and J
4.2.5 One-Sided Cross- Validation One-sided cross-validation is a method proposed by Hart and Yi (1996) and is a special case of a more general method proposed by the same authors. The idea underlying their general method is that one use different types of estimators at the cross-validation and estimation stages of the analysis. Suppose that rh is the estimator with which we intend to estimate the regression curve. Consider a second estimator, say rb, with smoothing parameter b for which we can define a transformation h = h(b) such that rh(b) and rb produce, in some sense, equivalent amounts of smoothing. We then use leave-one-out cross-validation to choose the smoothing parameter b of rb and take as our estimate of r(x),
rh(b) (x),
4.2. Description of Methods
91
where b is the cross-validation smoothing parameter for the estimate fb. In order for this method to be worthwhile, we desire h(b) to satisfy a property such as 2
(4.2)
E [h(b) -
ho J <
E [hcv
2
- ho J
,
where hcv and ho are respectively the usual cross-validation smoothing parameter for fh and the minimizer of the ASE for fh. The two most important aspects of the method just described are the transformation h(b) and finding an estimator rb for which cross-validation is a reasonably good method of smoothing parameter selection. Concerning the latter, pessimistic results abound in the smoothing literature to the effect that cross-validation bandwidths are highly variable (Chiu and Marron, 1990, Park and Marron, 1990 and Hall and Johnstone, 1992). One might then wonder whether there exists a "good" choice for rb· Ironically it turns out that some rather inefficient estimators rb of r satisfy (4.2). This fact does not contradict existing results since the high variability of cross-validation bandwidths has only been confirmed for efficient kernel estimators of r. We now describe one-sided cross-validation, a version of the above method that in many settings turns out to yield a more efficient data-driven smoothing parameter than ordinary cross-validation. We will describe the method for local linear smoothers, but it can just as well be applied with other estimators, such as kernel smoothers. Let fh be a local linear estimator based on kernel K and all the observations (x1, Y1), ... , (xn, Yn)· We wish to estimate r(x) by fh(x) and need to choose the bandwidth h. For rb(Xi) we propose using a local linear estimator based on the data (x1, Y1), ... , (xi, Yi); i.e., we use data on only one side ofthe point at which the estimate is to be calculated. For now, we suppose that Tb uses the same kernel as does h. Define the cross-validation curve for h by
where m is some small integer that is at least 2 and r~(xi) is a local linear estimate computed from the data (xj, Yj) for which Xj is strictly less than Xi· We must now address the problem of finding a transformation that takes b, the minimizer of CV(b), into a bandwidth that is appropriate for h. Now, CV(b) is an unbiased estimator of
whose minimizer is approximately the same as that of the MASE of rb. Because of the automatic boundary correction afforded by the local linear
92
4. Data-Driven Choice of Smoothing Parameters
estimator, the asymptotic minimizer of MASE is bon=
1 5 elKemodeln- / ,
where elK is a constant that depends only on the kernel K used in the local linear estimator (Fan, 1992). The asymptotic minimizer, han, of the MASE for rh is identical to bon except that it has a different constant e2K in place of elK This means that 0
han=
e2K -e bon, lK
which motivates the following definition of a data-driven bandwidth for use in rh: ,
hoscv
=
e2K,
-e b. lK
It will be argued in Section 4.3.2 that use of one-sided cross-validation (or OSCV) can lead to a dramatic reduction in bandwidth variance m comparison to ordinary cross-validation.
4.2.6 A Data Analysis In this section we shall apply methods discussed in Section 4.2 to a set of medical data. The data were collected from n = 842 pregnant women, and the x and Y variables are respectively number of days into pregnancy and logarithm of serum alphafetoprotein. A scatter plot of the data is provided in Figure 4.1. With the exception of the Hall-Johnstone method, each method in Section 4.2 was carried out for a local linear smoother with Epanechnikov kernel. Plots of various criterion curves are shown in Figure 4.2. The ordinary cross-validation, estimated MASE and generalized cross-validation curves are all minimized at the same value of h = 10.1, whereas the OSCV curve has its minimum at about h = 22. The GKK plug-in method was also used to select h for the local linear smooth. The method described in Section 4.2.3 is valid for the local linear estimator since, conditional on the xi's, the Gasser-Muller and local linear estimators (using the same K) have the same asymptotically optimal bandwidth. The algorithm described in Section 4.2.3 was used with an initial bandwidth of 1 and 8 = .10(range of x). After eleven iterations the plug-in bandwidth was 17.37. It is worth noting that in this example the GKK method is quite sensitive to the choice of 8. For example, with h0 = 1 and 8 = .05, the GKK bandwidth after eleven iterations is 8. 76. The results for these data are typical of those seen by the author in a number of examples. CV, GCV and estimated MASE have all led to what seems to be an undersmoothed estimate of the average log(serum) curve, whereas the OSCV estimate seems more reasonable. The plug-in and OSCV
4.3. Theoretical Properties of Data-Driven Smoothers
E
.
1.{)
2Q)
!!!..
Ol
..Q
1--
. /~ ~-;.-~-~/ ..
93
:··
~>- ~·:··;, .
!
I
150
200
250
days FIGURE 4.1. Maternal Serum Alphafetoprotein Data. Each curve is a local linear smooth. The smoothing parameter of the solid curve was chosen by OSCV, and that of the dotted curve by the plug-in method. The dashed curve's smoothing parameter was chosen by each of CV, GCV and estimated MASE.
estimates are similar in the middle of the data, but the plug-in estimate is somewhat undersmoothed near the edges of the x-interval. Furthermore, the plug-in method requires the choice of h0 and (j and appears to be quite sensitive to changes in b. By contrast, OSCV requires only a choice for m, and the author has found the method to be insensitive to this choice. Of course, the CV, GCV and estimated MASE bandwidths are also more objective than the plug-in method, but, as we will see in the next section, they are more variable than the OSCV bandwidth.
4.3 Theoretical Properties of Data-Driven Smoothers Our most detailed treatment of the theory of data-driven smoothing parameters is in the setting of kernel smoothing (Sections 4.3.1 and 4.3.2).
94
4. Data-Driven Choice of Smoothing Parameters
C\1
0
C\1
0
co
c) UJ
>
c)
en <(
0
(!)
:::2: (0
'
f'-. C\1
0 0
c)
c)
20
40
60
80
~
1.{)
c)
c)
20
40
60
80
20
40
60
80
0
C')
C\1
> 0 en
> 0
0
co
f'-. C\1
0
Ol
c)
C\1
c)
20
40 h
FIGURE
60
80
h
4.2. Cross-Validation Curves for Maternal Serum Alphafetoprotein
Data.
Data-driven choice of the truncation point for Fourier series estimators will be discussed in Section 4.3.3.
4.3.1 Asymptotics for Cross- Validation, Plug-In and Hall-Johnstone Methods Seminal work in the kernel smoothing case has been done by Rice (1984b) and Hiirdle, Hall and Marron (HHM) (1988). In this section we shall describe the results of HHM, since doing so will facilitate our theoretical discussion of the methods encountered in Section 4.2. Let rh be a kernel estimator of Priestley-Chao type. Some important insight is gained by investigating how data-driven bandwidths behave relative to h0 , the minimizer of
In the parlance of decision theory, ASE is a loss function and MASE the corresponding risk function. For a specific set of data it seems more desirable to use the bandwidth that actually minimizes ASE, rather than ASE on the average. This point of view is tantamount to the Bayesian principle that says it is more sensible to minimize posterior risk than frequentist risk.
4.3. Theoretical Properties of Data-Driven Smoothers
95
See Jones (1991) for a more comprehensive discussion of the MASE versus ASE controversy. HHM provide results on the asymptotic distribution of fi - ho, where h is a data-driven choice for the bandwidth of rh. The assumptions made by HHM are summarized as follows: 1. The design points in model (2.1) are Xi = iln, i = 1, ... , n. 2. The regression function r has a uniformly continuous, integrable second derivative. 3. The error terms Ei are i.i.d. with mean 0 and all moments finite. 4. The kernel K of fh is a compactly supported probability density that is symmetric about 0 and has a Holder continuous second derivative.
In addition we tacitly assume that boundary kernels are used to correct edge effects (Hall and Wehrly, 1991). Otherwise we would have to incorporate a taper function into our definition of the cross-validation and ASE curves to downweight the edge effects. Let hcv be the minimizer of the cross-validation curve over an interval of bandwidths of the form Hn = [n- 1+8, n], o > 0. Also, denote by h 0 the minimizer of MASE(h) for h E Hn. Under conditions 1-4 HHM prove the following results: (4.3) and (4.4)
ur
and ()~ are positive. as n ---t oo, where Results (4.3) and (4.4) have a number of interesting consequences. First, recall from Chapter 3 that h 0 rv C0 n- 115 . This fact and results (4.3) and (4.4) imply that
(4.5)
fi;:v - 1
ho
=
Op(n-1/10)
and
ho ho
1
=
Op(n-1/10).
A remarkable aspect of (4.5) is the extremely slow rate, n- 1110 , at which hcv I ho and h0 I ho tend to 1. In parametric problems we are used to the much faster rate of n- 1 / 2 . As discussed above, it is arguable that the distance Ihcv - hoI is more relevant than Ihcv - hoI· With this in mind, an interesting aspect of (4. 5) is that the cross-validation bandwidth and the MASE optimal bandwidth differ from ho by the same order in n. Hence, perfect knowledge of the MASE optimal bandwidth gets one no closer to h0 (in rate of convergence terms) than does the cross-validation bandwidth, which is data driven! If one adopts ASE rather than MASE as an optimality criterion, this makes one wonder if the extremely slow rate of n- 1110 is an inherent part of the bandwidth selection problem. In fact, Hall and Johnstone (1992) show that,
96
4. Data-Driven Choice of Smoothing Parameters
in a minimax sense, the quantity
h- ho ho never converges to 0 faster than n- 1110 , where h is any statistic. Knowing that (hcv- h0 )/h0 converges to 0 at the optimal rate, it is natural to consider how E(hcv- h 0 ) 2 compares with the analogous quantity for other data-driven bandwidths that also converge at the best rate. For commonly used kernels HHM point out that 0' 1 ~ 20' 2 , implying that ho tends to be closer to h0 in absolute terms than does hcv. This suggests the intriguing possibility that a sufficiently good estimator of h 0 will usually be closer to ho than is hcv. Let us now consider the GKK (1991) plug-in bandwidth hPJ, which is founded on estimation of h 0 . We have
hPJ - ho
=
hPJ- ho
+ (ho
- ho),
implying that h pI- h 0 will have the same asymptotic distribution as ho- h 0 as long as hPJ- h 0 is op(n- 3 110 ). GKK show that
hPJ- ho
= Op(n- 215 ) = op(n- 3110 ),
and hence
n
3/10
A
A
(hPJ- ho)
D
---+
2
N(O, 0'2 ).
Asymptotically, then, the plug-in bandwidth of GKK performs better than the cross-validated one in the sense that E(hPJ- h 0 ) 2 ~ .25E(hcv- h 0 ) 2 for commonly used kernels and all n sufficiently large. One way of explaining the behavior of hcv - h0 is to consider the representation hcv - ho
=
hcv - ho - ( ho - ho).
Rice (1984b) was the first to show that n 3 / 10 (hcv - ho) _.!!.__, N(O, 0'6v) for 0'6v > 0. It follows that, asymptotically, hcv has infinitely larger mean squared error in estimating h 0 than does hPJ. Furthermore, (4.4) and (4.5) imply that A
(4.6)
A2
A
E(hcv- ho) ~ Var(hcv)
+ Var(ho)- 2 Cov(hcv, ho). A
A
A
Expression (4.6) entails that a major factor in the large variability of hcv - ho is the fact that hcv and h 0 are negatively correlated (Hall and Johnstone, 1992). In other words, hcv has the following diabolical property: For data sets that require more (respectively, less) smoothing than average, cross-validation tends to indicate that less (respectively, more) smoothing is required.
4.3. Theoretical Properties of Data-Driven Smoothers
97
An obvious question at this point is, "Can we find a data-driven bandwidth, say h, for which E(h - h 0 )2 < E(hPI - h 0 ) 2 ?" The answer is yes, at least under sufficient regularity conditions. Hall and Johnstone (1992) find a lower bound on the limit of
where his any statistic. Let hE be the bandwidth (4.1) with an efficient estimator J of J; Hall and Johnstone (1992) show that limn->oo n 6 110 E(hEho) 2 equals the lower bound. Purely from the standpoint of asymptotic mean squared error theory, this ends the search for the ideal bandwidth selector; however, we shall have more to say on the notion of "ideal" in Section 4.5. To this point we have not discussed any theoretical properties of bandwidths, hR, selected by the risk estimation method of Section 4.2.2. HHM show that the asymptotic distribution of hR - h0 is the same as that of hcv - ho; hence, all the conclusions we have drawn about large sample behavior of cross-validation are also valid for risk estimation. Of course, asymptotics are not always an accurate indicator of what happens in finite-sized samples. Rice (1984b) shows by simulation that various asymptotically equivalent bandwidth selectors behave quite differently in small samples. It is important to point out that to first order the asymptotic ASEs of all the methods discussed in this section are the same. In other words, if h is any of the bandwidth selectors discussed, we have
ASE0) ~ 1 ASE(ho) as n ---t ,oo. The results discussed in this section nonetheless have relevance for second order terms in the ASE. Note that
ASE(h)
~
ASE(ho)
+ ~ (h- h0 ) 2 ASE"(h0 ),
where we have used the fact that ABE' (ho) = 0. Hall and Johnstone (1992) define the risk regret by E[ASE(h)] - E[ASE(ho)] and show that
E[ASE(h)] - E[ASE(ho)]
=
~ MASE"(h 0 )E(h- h0 ) 2 + rn,
where rn is negligible relative to MASE"(h 0 )E(h- h 0 ) 2 . The ratio ofrisk regrets, or relative risk regret, for two bandwidth selectors h1 and h2 is thus asymptotic to
98
4. Data-Driven Choice of Smoothing Parameters
In this way we see that results on E(h- h0 ) 2 relate directly to the question of how well the corresponding data-driven smoother estimates the underlying regression function. Hall and Johnstone (1992) provide some numerical results on risk regret for cross-validation, plug-in and their efficient method.
4.3.2 One-Sided Cross- Validation A detailed theoretical analysis of OSCV has been carried out by Yi (1996). Here we shall only summarize some salient aspects of the theory. Our main purpose in this section is to show that dramatic reductions in bandwidth variance are attainable with one-sided cross-validation. Following Chiu (1990), we assume that Xi = (i -1)/n, i = 1, ... , n, and use a "circular" design in which the data are extended periodically, i.e., for i = 1, ... , n, x-(i-1) = -i/n, Xn+i = 1 + (i - 1)/n, Y-(i-1) = Yn-i+l and Yn+i = Yi. The results in this section pertain to kernel estimators that are applied to the extended data set of size 3n. In the notation of Section 4.2.5, the estimator rh is 2
rh(x)
=
L
1 n nh . K t=-n+1
(XT x·) Yi,
0
~X~
1,
where 0 < h ~ 1 and K is a second order kernel with support (-1, 1). For the estimator rb we use
rb(x)
=
L
T
1 n (X-X·) nb . L Yi, •=-n+1
0
~X~
1,
where 0 < b ~ 1 and Lis a second order kernel with support (0, 1). Note that the estimator rb(x) uses only data for which Xi ~ x. Use of the circular design, along with the assumption that r(O) = r(1) and r' (0+) = r' (1-), eliminates boundary effects. Near the end of this section we will indicate why the forthcoming theoretical results appear to be relevant for certain local linear estimators as well. We begin by defining some notation. Let
define ho to be the minimizer of MAS E (h), and let b0 denote the minimizer of
4.3. Theoretical Properties of Data-Driven Smoothers
99
The bandwidths hcv and bcv minimize the cross-validation curves for the estimators rh and rb, respectively, and A
where, for a given function
CK CL bcv, A
hoscv
=
f, 1/5
[
J2
1
P(u) du
1
Define also the functionals Jf and BJ (when they exist) by
Finally, define Ufn (b) and UJ;, (h) by L
UJn(b)
=
1 ~ ( r ) ( -2nijr) nb ~L nb exp n '
j
= 1, ... , [n/2),
and K 1 ~ ( r ) ( 2njr) UJn (h) = nh rf::n K nh cos -----:;;:-
j
= 1, ... , [n/2].
Throughout this section we assume that the following conditions hold. (These are the same assumptions as in Chiu, 1990 plus conditions on L.) 1. The errors
2. 3. 4.
5.
E1 , E2 , ... are independent random variables with mean 0, variance 0' 2 and finite cumulants of all orders. The function r is such that r(O) = r(1), r'(O+) = r 1 (1-) and r" satisfies a Lipschitz condition of order greater than 1/2. The kernel K is a symmetric probability density function with support (-1, 1) and K" is of bounded variation. The kernel L is a second order kernel with support (0, 1). In addition, L satisfies the following: • Land L' are continuous on (0, 1), • L(O) and L' (0+) are finite, and • L" is of bounded variation on [0, 1), where L"(O) is defined to be L"(O+). The ordinary and one-sided cross-validation curves are minimized over an interval of bandwidths of the form [C- 1 n- 115 , cn- 115 ], where cis arbitrarily large but fixed.
100
4. Data-Driven Choice of Smoothing Parameters
Chiu (1990) obtains the following representation for hcv: [n/2]
(4.7)
L
n 3110 (hcv- ho) = -n 3 110 BKC~,IJ
(Vj- 2)W}~(ho)
+ op(1),
j=1
where V1 , V2, ... are i.i.d. X~ random variables,
Cr IJ
'
= (
-1,---CY_2_ _ ) 1/5
J0
r"(x) 2 dx
and
wf~,(h) = :h [1- uf~,(h)] 2 ,
j
= 1, ... , [n/2].
Similarly, Yi (1996) has shown that
(4.8)
n 3/10(hA oscv - h o)
B 03 = -n3/10 -CK CL L r,~Y [n/2]
x
L (Vj- 2)Wj~(b0 )
+
op(1),
j=1
where j = 1, ... , [n/2].
Hence, both hcv and hoscv are approximately linear combinations of independent x~ random variables. It is worth pointing out that the only reason (4.8) is not an immediate consequence of Chiu's work is that the kernel L does not satisfy Chiu's conditions of being continuous and symmetric about 0. We wish L to have support (0, 1) and to be discontinuous at 0, since such kernels are ones we have found to work well in practice. The theoretical development of Chiu (1990) relies upon the cross-validation curve being differentiable. Fortunately, differentiability of the OSCV curve is guaranteed when L is differentiable on (0, 1]; the fact that L is discontinuous at 0 does not affect the smoothness of the OSCV curve. It turns out, then, that Chiu's approach may be applied to n 3 110 (hoscv - ho) without too many modifications. The main difference in analyzing the cross-validation and OSCV bandwidths lies in the fact that, unlike U{n(h), the Fourier transform Ufn (b) is complex-valued. Representations (4. 7) and (4.8) allow one to compare the asymptotic variances of hcv and hoscv. Define the following asymptotic relative efficiency: A
E hoscv- ho
)2
(
ARE(K, L) = lim n--+oo
E (Ahcv- ho )
2
4.3. Theoretical Properties of Data-Driven Smoothers
Expressions (4. 7) and (4.8) imply that ARE(K, L) where
101
= limn-->oo AREn (K, L),
AREn(K, L) = The ratio AREn(K, L) has been computed for several values of n using the quartic kernel forK and the following choices for L, each of which has support (0, 1):
h(u)
= 140u3(1- u) 3(10- 18u),
L 3(u) = 6u(l- u)(6- lOu),
L 4 (u)
L 2 (u) = 30u2 (1- u) 2 (8- 14u), =
(5.925926- 12.96296u)(l- u 2 ) 2
L 5 (u) = (1- u 2 )(6.923077- 23.076923u
+ 16.153846u2 ).
It turns out that the limit of AREn is independent of the regression function r, and so the values of h 0 and bo were taken to be n- 1 / 5 and (CL/CK )n- 115 , respectively. The results are given in Table 4.1. The most interesting aspect of Table 4.1 is the dramatic reduction in bandwidth variation that results from using kernels L 4 and L5. Use of L 5 leads to an almost twenty-fold reduction in asymptotic variance as compared to ordinary cross-validation. Another interesting result is that the relative efficiencies decrease as the kernel L becomes less smooth at 0. Better efficiency is obtained from using the two kernels that have L(O) > 0. The relative efficiencies are smallest for L5, which is such that L~(O+) = -23.08 < -12.96 = L~(O+). The other three choices for L are shown by Miiller (1991) to be smooth, "optimum" boundary kernels. Each of these three is continuous at 0 (i.e., L(O) = 0). The kernel L 2 is smoother than L 3 in the sense that L~(O+) #- 0 while L~(O) = 0. Kernel L1 is smoother still since it has L~(O) = L~(O) = 0.
Relative Efficiencies of OneTABLE 4.1. Sided to Ordinary Cross-Validation. Each number in the body of the table is a value of AREn (K, L) for K equal to the quartic kernel.
n
L1
Lz
L3
L4
L5
50 150 300 600 1200 2400
1.732 2.165 2.197 2.202 2.202 2.202
1.296 1.899 1.936 1.939 1.940 1.939
1.303 1.811 1.667 1.811 1.755 1.768
.1719 .0469 .1089 .1001 .1004 .1006
.1039 .0389 .0456 .0627 .0561 .0558
102
T
4. Data-Driven Choice of Smoothing Parameters
The relative efficiencies in Table 4.1 suggest the possibility of further improvements in efficiency. For a given K and under general conditions on L, Yi (1996) has shown that lim n 315 Var(hoscv)
n--too
=
c; o-CkFL, '
where
and
1
=
1 1
1
AL(u)
L(x) cos(21rux) dx,
BL(u)
=
L(x) sin(27rux) dx.
Subject to the constraint that L is a second order kernel, one could use calculus of variations to determine an L that minimizes FL. Note that the asymptotically optimal L does not depend on K. Another, perhaps more relevant, optimality problem would be to find the L that minimizes 2
V(K, L)
=
' ' ) lim n 3 I 5 E (hoscvhoK
n--+oo
,
in which hoK is the minimizer of .t!SE(ht = n- 1 2:::~= 1 (fh(xi)- r(xi)) 2 . Let V(K, K) denote limn--+oo n 3 15 E(hcv- hoK ) 2 , where hcv is the ordinary cross-validation bandwidth for rh. Yi (1996) has shown that V(K, L) < V(K, K) for various choices of K and L. As before, one could employ calculus of variations to try to determine an L that minimizes V(K, L). It turns out in this case that the optimal choice for L depends on K. It seems clear that representations paralleling (4. 7) and (4.8) can be established for local linear smoothers. Suppose that rh is a local linear smoother that uses the quartic kernel. Apart from boundary effects and assuming that the Xi's are fixed and evenly spaced, this estimator is essentially the same as a Priestley-Chao type quartic-kernel estimator. Likewise, the one-sided local linear estimator using a quartic kernel is essentially the same as the kernel estimator with kernel L(u) = (5.926 - 12.963u)(1 u 2 ) 2 I(o, 1 )(u) (Fan, 1992). It is thus anticipated that the relative efficiencies in the "£4 " column of Table 4.1 will closely match those for quartic-kernel local linear estimators. Some insight as to why OSCV works better than ordinary cross-validation is gained by considering MASE curves. In a number of cases the author has noticed that the MASE curve for a one-sided estimator tends to have a more well-defined minimum than the MASE of an ordinary, or two-sided, estimator. This is illustrated in Figure 4.3, where we have plotted MASE curves of ordinary and one-sided local linear estimators that use an Epanechnikov kernel. Letting b denote the bandwidth of the one-sided estimator, that estimator's MASE is plotted against h = Cb, where Cis such
4.3. Theoretical Properties of Data-Driven Smoothers
103
1.!)
C\i 0
C\i I
UJ
(/)
"!
<(
::2:
q 1.!)
ci
0.05
0.10
0.15
0.20
h
0.02
0.10
0.06
0.14
h
FIGURE 4.3. MASE Curves for One- and Two-Sided Local Linear Estimators. (MASE- denotes MASEx 10 6 .) In each graph the solid and dotted lines correspond to the one- and two-sided estimators, respectively. The top graph and bottom graphs correspond to r(x) = x 3 (1- x) 3 and r(x) = 1.74[2x10 (1- x) 2 + x 2 (1- x) 10 ), respectively. For both graphs O" = 1/512 and n = 150.
that hoscv Chcv. For this reason the minima of the one- and two-sided MASE curves are approximately the same. Notice in each case that the range of the one-sided MASE curve is larger than that of its two-sided counterpart. Also, the two-sided curves are flatter near their minima than are the one-sided curves. It is not surprising that one can more accurately estimate the minimizer of the curve that is more curved at its minimum. A number of researchers have reported the disturbing tendency of crossvalidation to undersmooth, i.e., to choose too small a bandwidth. This
104
4. Data-Driven Choice of Smoothing Parameters
shows up as left-skewness in the sampling distribution of hcv (Chiu, 1990). Simulation studies have shown that the distribution of hoscv tends to be more nearly symmetric than that of hcv. Expressions (4.7) and (4.8) help to explain this phenomenon. We may write (n/2]
n 3/10 (hoscv- ho) ~ ~ (Vj - 2)wL(j) A
'"'
j=1
(4.9) [n/2]
n 3/10 (hcv- ho) ~ ~ (Vj- 2)wx(J), A
'"'
j=1
and then compare the weights W£ and wx. Figure 4.4 is a plot of WL(J) and wx(J), j = 1, ... , 20, for the case n = 100, K equal to the quartic kernel and L(u) = (5.926- 12.963u)(1- u 2 ) 2 I(o, 1 )(u). Since the ratio of the asymptotic variances of hoscv and hcv is independent of 0' and the underlying function r, the weights were computed with h 0 = n- 1 / 5 and
bo = (CL/Cx)n- 1 15 . Two things are remarkable about Figure 4.4. First, the weights wL(j) are quite a bit smaller in magnitude than wx(J), which is an indication that hoscv indeed has smaller variance than hcv. Also, note that wx(1) and wx(2) are large and negative, which is consistent with the left-skewness in the distribution of hcv. By contrast, the weights WL(J) are much closer to being symmetric about 0, which explains the near symmetry of the distribution of hoscv.
C\J
EOl
'(ii
0
:;;: ~
"'1 5
10
15
20
index FIGURE 4.4. Weights for CV and OSCV Bandwidths. The dotted and solid lines are respectively the weights WK(j) and wL(j), j = 1, ... , 20, as defined in (4.9) with n = 100.
4.3. Theoretical Properties of Data-Driven Smoothers
105
The kernel estimator rb (x) uses data at or to the left of x. Suppose we define another one-sided estimator fR,b(x) by
rR,b(x)
=
nb t;L 1
2
n
(
x·- X) T Yi,
0:::;
X
:S: 1,
where L is the same kernel used by rb. This estimator only uses data at or to the right of x. One might surmise that cross-validating r R,b (x) would provide extra information about a good choice of bandwidth. Under the circular design, define
where f~(xi) and rk,b(xi) use data to the left and right of xi, respectively. It is not difficult to show that
So, CVR(b) is completely redundant once CVL(b) has been computed. The extent to which this result generalizes when the design is not circular and evenly spaced is unclear. However, it is true that for evenly spaced (but not circular) designs the difference between CVL(b) and CVR(b) depends only on data within the boundary region (0, b) and (1 - b, 1). Inasmuch as boundary effects are negligible, we thus anticipate that the asymptotic correlation between the two OSCV bandwidths is 1.
4.3.3 Fourier Series Estimators The theory of data-driven Fourier series estimators is not as well developed as it is for kernel smoothers. Here we will limit our discussion to properties of the simple truncated series estimator. Suppose the design points are Xi = (i -1/2)/n, i = 1, ... , n, and let m be the maximizer of Jm (defined in Section 4.2.2) over m = 0, 1, 2, ... , mn, where mn ----t oo as n ----t oo. Define the loss function Ln by
Ln(m) =
~
t
(f(xi; m)- r(xi))
2
,
m = 0, 1, ... , n- 1,
i=l
and suppose that our model satisfies the following two conditions: 1. The error terms E1 , ... , En are independent and identically distributed with EE~ < oo. 2. The function r is not of the form "':,";=o aj cos(njx) for any finite integer m.
106
4. Data-Driven Choice of Smoothing Parameters
Under these conditions Li (1987) shows that
Ln(m) ___!__, 1 mino::;m::;mn Ln(m) as n ---+ oo. Now define J(m) = E [Ln(m)] for each m. Hurvich and Tsai (1995) obtain an interesting result on the relative rate of convergence of
J(m)
(4.10) Let
~
- 1
mino::;m::;mn J(m)
.
be an arbitrarily small positive number, and suppose that mn
o(n 2 ~). Then, under appropriate regularity conditions, Hurvich and Tsai
show that (4.10) converges to 0 in probability at the rate o(n- 1 12 H). Of course, we want 2~ to be large enough so that mn will be larger than the overall minimizer of J(m). Under the conditions leading to expression (3.16), the optimal value of m is asymptotic to Cn 114 , and so in this case it suffices to take ~ = 1/8 + 8, where 8 is an arbitrarily small positive number. This means that (4.10) will converge to 0 at a faster rate than n- 3 / 8 +6. It is of interest to know how this result compares with the analogous result when one uses risk estimation to choose the bandwidth of a kernel estimator. Under the conditions leading to the MISE expansion in Section 3.3.1, the optimal MISE of a Nadaraya-Watson estimator will be of order n- 314 , the same as for the truncated series estimator. Work of vanEs (1992) suggests that the rate of convergence of the kernel estimator analog of (4.10) is n- 1 / 4 , which is slower than the corresponding rate of n- 3 / 8 +6 for the series estimator (at least when 8 < 1/8). The significance of the Hurvich-Tsai result is that it provides evidence that data-driven methods as described in Section 4.2 are more efficient when used to choose discrete, rather than continuous, smoothing parameters. This is one of the reasons that truncated series estimators are used almost exclusively in our treatment of hypothesis testing via data-driven smoothers. We now study the behavior of lm in the special case where r has a finite Fourier series, i.e., mo
(4.11)
r(x) = ¢o
+ 2L
¢j cos(njx)
j=l
for some finite, non-negative integer m 0 . This situation will have special import to us in Chapter 7. In addition to lm we consider another order selection method that is analogous to the Bayes information criterion (BIC) of Schwarz (1978). Define
Bo = 0,
_ ~ 2n¢]
Bm- L.i j=l
~ I]'
-log(n)m,
m
=
1, 2, ... , n - 1.
4.4. A Simulation Study
107
The maximizer of Bm, call it ms, provides another data-driven truncation point for the simple series estimator. Because of the log(n) term, ms is guaranteed to be no larger than m for all n :2': 8. When r has the form (4.11), Theorem 4.1 below states that ms is a consistent estimator of m 0 , whereas m is inconsistent, tending to overestimate m 0 . The reader may be familiar with results parallel to these in the context of selecting the order of an autoregressive process (Shibata, 1976 and Hannan and Quinn, 1979). The proof of Theorem 4.1 depends on tools that will be developed in Chapter 7 and is thus deferred until that point. Theorem 4.1. Suppose that model (1.1) holds with r having the form (4.11), Xi = (i- 1/2)/n, i = 1, ... , n, and E1, ... , En i.i.d. witly finite fourth moments. Let m and ms be the respective maximizers of Jm and Bm over m = 0, 1, ... , n - 1. Then ms is a consistent estimator of m 0 , whereas m converges in probability to a random variable M which has support {mo, mo + 1, ... } and satisfies P(M = m 0 ) ~ .71. A salient aspect of Theorem 4.1 is that the criteria Jm and Bm are maximized over the set {0, 1, ... , n - 1}, which is the largest set one need consider. Parallels of Theorem 4.1 for density estimation (Ledwina, 1994) and the selection of autoregression order (Hannan and Quinn, 1979) have been proven under the more stringent condition that the number of candidate estimators is o(n). It is important to note that although m is not consistent for mo' the estimator r(x; m) is consistent for r(x; m 0 ) = ¢o + 2 L,"J'~ 1 ¢1 cos(njx). In fact, f(x;
m)- r(x; mo)
= Ov (
)n).
In deciding whether or not to use Bm one should weigh the consistency property of ms against how realistic it is to assume that r has the form (4.11). When r has an infinite Fourier series, ms will not be asymptotic to the minimizer, mn, of the MISE of f(·; m). In contrast, m/mn will generally converge in probability to 1 as n -+ oo, as argued by Hurvich and Tsai (1995).
4.4 A Simulation Study Asymptotic arguments certainly do not tell the whole story about the relative performance of various methods. Time and space do not permit a comprehensive finite-sample comparison of all the methods discussed in this chapter. In this section we present simulation results that compare ordinary cross-validation, plug-in and OSCV when they are used to choose the smoothing parameter of a local linear estimator.
108
4. Data-Driven Choice of Smoothing Parameters
The following four functions were used in the study, where in each case 0 S X S 1: r1(x)
=
x 3 (1- x) 3 ,
r2(x)
=
(x/2) 3 (1- x/2) 3 ,
r3(x) = 1.741 [2x 10 (1- x) 2 + x 2(1- x) 10 ], _ { .0212 exp(x - 1/3), .0212 exp( -2(x- 1/3)),
r 4 (x) -
x < 1/3 x ~ 1/3.
The function r 1 has a single peak, r 2 is monotone increasing, r 3 has two peaks and r 4 has a discontinuous derivative. The range of each function is the same; hence, the same values of CJ, 1/128, 1/512 and 1/2048, were used with each function. These values of CJ were chosen to represent high, moderate and low levels of noise, respectively. Two sample sizes, n = 50, 150, were used, the error terms were taken to be N(O, CJ 2) and the design points were Xi = (i- .5)/n, i = 1, ... , n. The estimator fh whose smoothing parameter is to be chosen is a local linear estimator with Epanechnikov kernel. The reader is reminded that when the design points are evenly spaced there is no essential difference between local linear and boundary-adjusted kernel estimators. One-sided cross-validation was carried out exactly as described in Section 4.2.2. We also approximated ordinary cross-validation and plug-in bandwidths for each set of data generated. The Gasser, Kneip and Kohler (1991) implementation of plug-in was used with 15 = .10, an initial bandwidth of 1/n and eleven iterations. (This is the number of iterations suggested by the authors.) Define the average squared error (ASE) for a local linear estimate fh by
The bandwidth minimizing ASE(h) was approximated for each data set, and ASE(h) was computed for each of the three data-driven bandwidths. Five hundred replications at each combination of function, CJ and n were conducted. The results are summarized in Tables 4.2-4.5. The fact that OSCV is much more stable than ordinary cross-validation is quite evident in these tables. In all but two of the twenty-four (independent) settings considered, the sample variance of the OSCV bandwidth was less than that of the ordinary cross-validation bandwidth. In eighteen of the cases the ratio of sample variances was smaller than .342. In Section 4.3.2 we argued that the OSCV bandwidth has a more nearly symmetric sampling distribution than does the ordinary cross-validation bandwidth. This is illustrated in Figure 4.5 (p. 113), where kernel density estimates for the two data-driven bandwidths are plotted from the simulation with r r1, CJ = 1/512 and n = 150. Also plotted is a density estimate for the ASE-optimal bandwidths.
=
4.4. A Simulation Study
109
Simulation Results for the Function r 1 (x) = x 3 (1- x) 3 . Results are based on 500 replications of each combination of n and 0'. The cross-validation, one-sided cross-validation, plug-in and ASE optimal bandwidths are denoted hcv, hoscv, hPJ and ho, respectively. TABLE
4.2.
(J
hcv
hoscv
hPJ
ho
50
1/128 1/512 1/2048
.2235 .1219 .0708
.2236 .1196 .0675
.1837 .1246 .0915
.2323 .1198 .0672
150
1/128 1/512 1/2048
.1683 .0915 .0519
.1746 .0917 .0530
.1654 .0979 .0579
.1751 .0944 .0539
50
1/128 1/512 1/2048
.8142 .0985 .0171
.2630 .0255 .0047
.1052 .0072 .0002
.4199 .0557 .0105
150
1/128 1/512 1/2048
.3151 .0480 .0101
.0720 .0089 .0022
.0466 .0020 .0002
.1708 .0279 .0059
50
1/128 1/512 1/2048
1.530 .1980 .0326
.8775 .1007 .0172
.8963 .0803 .0704
150
1/128 1/512 1/2048
.5956 .0912 .0181
.3118 .0504 .0095
.2839 .0378 .0081
50
1/128 1/512 1/2048
8.192 .6901 .0639
6.903 .6214 .0608
7.022 .6043 .0740
5.683 .5586 .0567
150
1/128 1/512 1/2048
3.024 .2782 .0268
2.445 .2516 .0251
2.408 .2443 .0248
2.120 .2284 .0236
n
Mean(h)
Var(h) x 10 2
Mean(h- h 0 ) 2 x 10 2
Mean(ASE(h)) x 106
110
4. Data-Driven Choice of Smoothing Parameters
TABLE x/2) 3 .
Simulation Results for the Function r 2 (x) = (x/2) 3 (1 See Table 4.2 for further notes. 4.3.
(J
hcv
hoscv
hPJ
ho
50
1/128 1/512 1/2048
1.152 .2607 .1190
1.262 .2746 .1181
.2058 .1831 .1350
.9351 .2421 .1229
150
1/128 1/512 1/2048
.6005 .1704 .0917
.6551 .1721 .0934
.2127 .1617 .0991
.5181 .1758 .0968
50
1/128 1/512 1/2048
64.44 2.877 .1150
62.35 2.926 .0243
.2796 .1155 .0066
39.46 .8669 .0696
150
1/128 1/512 1/2048
12.49 .3367 .0487
10.76 .1033 .0107
.2696 .0434 .0018
7.046 .1803 .0323
50
1/128 1/512 1/2048
127.5 4.004 .2248
140.9 4.226 .1176
93.08 1.414 .1045
150
1/128 1/512 1/2048
23.96 .7027 .1046
24.67 .3963 .0558
16.96 .3067 .0411
50
1/128 1/512 1/2048
5.166 .5049 .0438
4.231 .4263 .0388
5.951 .4062 .0380
2.899 .3251 .0342
150
1/128 1/512 1/2048
2.096 .1984 .0176
1.645 .1658 .0159
1.926 .1601 .0154
1.151 .1403 .0143
n
Mean(h)
Var(h) x 10 2
Mean( h - ho )2 x 102
Mean(ASE(h)) x 106
An arguably more relevant performance measure than bandwidth variability is ASE, since ASE actually measures how well a regression estimate fares on a particular data set. The scatter plots in Figure 4.6 (p. 114) show how the various methods performed relative to the ASE optimal estimator for the case r r 2 , u = 1/512 and n = 150. Note that the performances of OSCV and plug-in are similar and much better than for ordinary crossvalidation. Tables 4.2-4.5 show that the variability of the plug-in bandwidth is usually quite a bit smaller than that of OSCV, and yet these two methods
=
111
4.4. A Simulation Study
Simulation Results for the Function r 3 (x) = 1.7414[2x 10 x(1- x) + x 2 (1- x) 10 ]. See Table 1 for further notes.
TABLE
4.4. 2
(]'
hcv
hoscv
hPJ
ho
50
1/128 1/512 1/2048
.1846 .0853 .0539
.3349 .0818 .0506
.1524 .0938 .0753
.1612 .0812 .0505
150
1/128 1/512 1/2048
.1195 .0615 .0340
.1222 .0642 .0359
.1256 .0692 .0426
.1183 .0644 .0347
50
1/128 1/512 1/2048
1.121 .0383 .0035
6.276 .0089 .0003
.0580 .0020 .00004
.2039 .0167 .0003
150
1/128 1/512 1/2048
.1183 .0185 .0037
.0404 .0033 .0007
.0129 .0009 .00007
.0474 .0080 .0017
50
1/128 1/512 1/2048
1.507 .0636 .0052
10.02 .0302 .0007
.3008 .0373 .0620
150
1/128 1/512 1/2048
.2036 .0314 .0057
.1093 .0151 .0031
.0774 .0129 .0081
50
1/128 1/512 1/2048
9.742 .9083 .0974
11.65 .8458 .0915
8.033 .8564 .1605
7.278 .7857 .0907
150
1/128 1/512 1/2048
3.819 .3881 .0403
3.360 .3582 .0386
3.247 .3546 .0414
3.027 .3371 .0372
n
Mean(h)
Var(h) x 10 2
Mean(h - h 0 ) 2 x 10 2
Mean(ASE(h)) x 106
produce comparable ASEs. This may seem like a contradiction but is consistent with the theory discussed in Section 4.3.1. Reducing the variability of a data-driven bandwidth has diminishing returns in the sense that even the MASE minimizer (a fixed quantity) differs from the ASE optimal bandwidth by an amount whose order (inn) is just as large as it is for ordinary cross-validation. A proxie for how close the ASE of an estimate is to the optimal ASE is the quantity (h- h 0 ) 2 , where hand ho denote data-driven and ASE-optimal bandwidths, respectively. Note in Tables 4.2-4.5 that the
112
4. Data-Driven Choice of Smoothing Parameters
4.5 . Simulation Results for the Function r4(x) .0212{exp(x- 1/3)I[o,lj3)(x)+ exp[-2(x- 1/3)]I[l/3,1J(x)}. See Table 1 for further notes. TABLE
n
(J
hcv
hoscv
hPJ
ho
50
1/128 1/512 1/2048
.4623 .1483 .0769
.4411 .1495 .0766
.2024 .1544 .0976
.3918 .1425 .0696
150
1/128 1/512 1/2048
.2440 .1048 .0528
.2630 .1118 .0567
.1952 .1255 .0641
.2316 .1095 .0527
50
1/128 1/512 1/2048
9.037 .2414 .0317
6.874 .0755 .0100
.1808 .0362 .0019
3.976 .1043 .0114
150
1/128 1/512 1/2048
1.570 .0885 .0152
1.116 .0280 .0038
.1203 .0092 .0009
.6244 .0473 .0049
50
1/128 1/512 1/2048
14.83 .4354 .0580
11.74 .2442 .0310
7.908 .2137 .0945
150
1/128 1/512 1/2048
2.512 .1716 .0247
1.855 .0968 .0130
1.056 .1051 .0205
50
1/128 1/512 1/2048
6.531 .6617 .0702
5.508 .5969 .0667
6.015 .5834 .0756
4.150 .5189 .0619
150
1/128 1/512 1/2048
2.741 .2738 .0296
2.295 .2477 .0276
2.219 .2475 .0284
1.868 .2240 .0260
Mean(h)
Var(h) x 10 2
Mean(h- h0 ) 2 x 10 2
Mean(ASE(h)) x 106
average (hoscv- ho)2 is usually fairly close to average (hPJ- h0 ) 2 in spite of the larger discrepancy in the variances of the two methods. In only one of the twenty-four cases, r = r 3, u = 1/128, n = 50, was the average ASE for the OSCV bandwidth larger than that of ordinary cross-validation. This was also the only case where OSCV's mean ASE was more than 4. 7% larger than that of plug-in. This case points out a small shortcoming of OSCV. When the underlying function has multiple peaks
0 9 9
8
O
o
pite
was ary was all aks
4.5. Discussion
g
c '1ii
113
/\\ :
0
:
:I
C\1
I
t'
/
' '\
~
\
\ \
,:
J
cQ) -o
1: I( 1: I
i
I: I
0
..-
I I
•
I
/
I
/
I
/
I
!
I I
0
i i
/ /
______________ ,.,_I... .·'
0.0
0.05
0.10
0."15
h FIGURE 4.5. Kernel Density Estimates for Random Bandwidths. The solid line is an estimate constructed for ordinary cross-validation bandwidths from the simulation with r r 1 , rJ = 1/512 and n = 150. The dotted line is an estimate using OSCV bandwidths for the same 500 data sets. The dashed estimate is for ASE-optimal bandwidths.
=
of roughly comparable size and (} is sufficiently large, a one-sided predictor with large bandwidth can perform relatively well since data values near one peak are roughly the same size as those at a neighboring peak. Note that the problem experienced by OSCV for r = r 3 vanishes with an increase in either nor 1/(}. For n =50 the contrast between the cases(}= 1/128 and (} = 1/512 is clear in the two MASE curves in Figure 4.7 (p. 115).
4.5 Discussion In this chapter we have discussed a number of different data-based methods of choosing smoothing parameters. In the setting of kernel smoothing, the plug-in and Hall-Johnstone methods produce much more stable bandwidths than does the leave-one-out version of cross-validation. In fact, the Hall-Johnstone method yields a bandwidth that cannot be improved
114
4. Data-Driven Choice of Smoothing Parameters
"' 0 0
0
0 0
%
0
w
"0
5
(/)
<(
.. .· . . . :. ... :.: ..
0 0
•• •
"'
• • I
0 0
• •
·' ·.....
• ....., ··: • ••-.,:''. t~\ ' ·~.:·,·~~~' '1.#1'
.. ,.: :.
·.p~-
0
·l;···
·'
0
0
0.0
0.001
0.002
0.003
0.004
ASE(hopt)
>
" <1> 0
0 0
0
.<::
: "
iif
~
..
"'0 0 0
0
0 0.0
0.001
0.002
0.003
0.004
ASE(hopt)
a
0 0
0
.. . : .. . .. . ,.,. ...... " ..: ""• ... ·~·~.........
.<::
iif (/) <(
"' 0 0
0 0
0
.; :;.:IJA 0.0
0.001
0.002
0.003
0.004
ASE(hopt)
FIGURE 4.6. Scatter Plots of ASE From Simulation Study. Each plot is of ASEs for data-driven bandwidths vs. ASEs of optimal bandwidths in the case r r2, !J = 1/512 and n = 150. The top, middle and bottom graphs correspond to ordinary cross-validation, OSCV and plug-in, respectively.
=
4.5. Discussion
115
""' 0 0 0 0
c)
C\1 0 0 0 0
c)
0
c)
0.2
0.0
0.4
0.6
0.8
h FIGURE 4.7. MASE Curves for One-Sided Local Linear Estimators. The solid line is the MASE curve for a one-sided local linear estimator (with Epanechnikov kernel) in the case r rs, (]' = 1/128 and n = 50. The dotted line is the MASE curve in the same case except (]' = 1/512. When (]' is larger it is clear that it is harder to distinguish between optimal and oversmoothed estimates.
=
upon in terms of asymptotic efficiency. The version of one-sided crossvalidation used in the simulation of Section 4.4 lies in between ordinary cross-validation and the plug-in method in terms of efficiency. Although OSCV is somewhat less efficient than plug-in and Hall-Johnstone, it has the advantage of being more rough and ready and more objective than those two methods. OSCV does not require estimation of any derivatives, nor does it depend on parameters that must be fixed in an arbitrary way. Furthermore, as noted in Section 4.3.2, there exists the possibility that some version of OSCV will be both objective and fully efficient. We have discussed the application of data-driven smoothing methods only in the case of regression with but a single independent variable. Both cross-validation and plug-in methods can be applied in more complicated settings. For example, Vieu (1991) has applied a local version of crossvalidation to choose the bandwidth function of a variable bandwidth kernel estimator, and Muller and Stadtmiiller (1987) have proposed a plug-in rule for the same problem. Fan and Gijbels (1995) propose a data-driven method for choosing a variable bandwidth local polynomial estimator. Ruppert,
116
4. Data-Driven Choice of Smoothing Parameters
Sheather and Wand (1995) consider plug-in bandwidth rules that may be used for multiple regression as well as the univariate-x case. Use of plug-in rules for estimating additive regression models has been investigated by Opsomer and Ruppert (1996).
5
I
,, '
Classical Lack-of-Fit Tests
I
I ;
I
II
I
I
I
,
'I,I 'I
5.1 Introduction We now turn our attention to the problem of testing the fit of a parametric regression model. Ultimately, our purpose is to show how the nonparametric smoothing methods encountered in the previous three chapters can be useful in this regard. We begin, however, by considering some classical methods for checking model fit. This is done to provide some historical perspective and also to facilitate comparisons between smoothing-based and classical methods. We shall continue to use the following scenario as our canonical regression model:
(5.1)
Yi
=
r(xi)
+ Ei,
,:, [.ii ]I
I I, I
l
i = 1, ... , n,
where the xi's are fixed design points with 0 < x 1 < · · · < Xn < 1. We assume that E1 , ... , En are independent random variables with E(Ei) = 0 and Var(Ei) = o- 2 < oo, i = 1, ... , n. Sometimes we will add the assumption that each Ei has a Gaussian distribution. For our purposes the principal aim in analyzing the data (x 1, Y1), ... , (xn, Yn) is to learn about the relationship between x and Y as it is expressed through the regression function r. In a parametric approach to inferring r, one assumes that (5.2)
r E Se =: {r(· ; 0) : 0 E 8},
where 8 is some subset of p-dimensional Euclidean space with p finite, and, for each e E 8, r(·; 0) is a function with domain [0, 1]. A nonparametric model is of the same form as (5.2) with the exception that the set 8 is infinite dimensional and has an infinite number of elements. The term nonparametric is something of a misnomer since nonparametric models actually have infinitely more parameters than parametric ones. Nontheless, the term fits in the sense that a nonparametric analysis places less importance on inferring individual parameters, since no finite number of 117
! I
118
5. Classical Lack-of-Fit Tests
parameters characterizes the model. Of course, the real distinguishing feature of a nonparametric model is that it allows for a much wider range of possibilities than does a parametric one. A couple of examples will serve to contrast the two types of function classes. Consider quadratic functions defined on the interval [0, 1]:
The class
is an example of a parametric class of functions. If we assume that r E Q, then we know everything there is to know about r as soon as we know the values of the three parameters Bo, 81 and 82. An example of a nonparametric class of functions is L2(0, 1), the class of square integrable functions on (0, 1). A given function r is in £ 2 if and only if 00
2:: q;;u) < oo, j=O 1
where ¢r(J) = J0 cos(njx)r(x) dx. We may thus define the parameter space 8 for £ 2 (0, 1) to be the set of all sequences (8 0 , 81 , ... ) such that 00
:z=e; < oo. j=O
In some cases the data analyst may be virtually certain that r is in a parametric class Be. For example, model (5.2) might be a consequence of the physical system from which the data are generated. The statement "r is in Be" is thus correct to the extent that the data analyst is correct in his assessment of the physical system. Whenever a parametric model is "known" to hold, the statistical problem boils down to inferring the parameter e' which is the only aspect of r that is unknown. Perhaps a more typical scenario is one in which the data analyst merely entertains a model ofform (5.2). He has doubts about whether a particular model Be is correct and would like to have a statistical test of the null hypothesis that r is in Be. In the context of regression, such a test is usually referred to as a lack-of-fit test. An analogous procedure for testing whether a set of i.i.d. observations arises from a given class of probability distributions is more often called a goodness-of-fit test. In the remainder of this chapter we consider some classical lack-of-fit tests.
5.2. Likelihood Ratio Tests
119
5.2 Likelihood Ratio Tests A fundamental approach to testing statistical hypotheses stems from the seminal work of Neyman and Pearson (1933) on testing a simple null hypothesis versus a simple alternative. Neyman and Pearson showed that in the simple vs. simple case the most powerful test of a given size rejects the null hypothesis for small values of a likelihood ratio. This result has led to the use of likelihood ratio tests in more general settings where the model is parametric and one or both of the hypotheses are composite.
5.2.1 The General Case In order to apply a likelihood ratio test it is necessary that we know the joint distribution of the observations up to a finite number of unknown parameters. In our setting this entails that we have a model not only for the mean function r but also for the distribution of the errors E1 , ... , En· We might, for example, suppose that E1 , ... , En are independent and identically distributed from a density g( · ; ¢), where g is known up to the vector of parameters ¢. A familiar special case of this scenario is when the Ei 's are i.i.d. N(O, o- 2 ), in which case ¢ consists of the single unknown parameter a-. More generally, one could simply assume that E1 , ... , En have some joint density g(u 1 , ... , un; ¢) depending on the unknown vector of parameters ¢.If we also assume that r is in a parametric family Se (as in (5.2)), then the likelihood function of the data Y = (Y1 , ... , Yn) is
Our interest is in testing the hypotheses (5.3)
Ho : BE 8o
vs.
Ha : B E 8- 8o,
where 8o is some subset of e. Suppose that the parameter ¢ is assumed to lie in a set 1>. Then the likelihood ratio test of hypotheses (5.3) rejects H 0 for small values of the test statistic
(5.4)
An=
SUP{OE8o,>E1>} SUP{OE8,¢E1>}
L(B, ¢JY) . L(B, ¢JY)
In order to carry out a test that has a specified level of significance, one must know the probability distribution of An when the null hypothesis is true. It is well known that when H 0 is true and the probability model satisfies certain regularity conditions, the statistic -2log(An) converges in distribution to a random variable having the x2 distribution with p - p 0 degrees of freedom, where p 0 and p are the numbers of free parameters associated with 8 0 and 8, respectively. This suggests that one use as a
120
5. Classical Lack-of-Fit Tests
large sample test with nominal size a the test that rejects Ho when (5.5)
-2log(An) ;::o: x;_Po,o:'
where X~-po,o: is the (1 a)100th percentile of the X~-Po distribution. Kendall and Stuart (1979) and Chernoff (1954) provide sufficient conditions such that the test with rejection region (5.5) is asymptotically valid. In principle, the likelihood ratio test is applicable in a wide variety of cases. An elementary case is one in which the null hypothesis is simple. For example, one might assume that r has the form
r(x; B) = Bo withe = {e interest is
+ elx,
0 ::;
X ::;
1,
(Bo, 81) : -oo < 80 ,8 1 < oo}. If the null hypothesis of H0
:
r(· ; B) is identical to 0,
then 8 0 is the singleton set {(0, 0)}. Nested regression models can also be dealt with via likelihood ratio tests. Suppose, for example, that our most general model for r is (5.6) r(x; e) = exp(Bo + elx + e2x 2), 0::; X ::; 1, with 8 = ~ 3 . The hypotheses of interest might be H0
:
log (r(·; B)) is a straight line
versus Ha : log (r(·; B)) is a quadratic.
Here the straight line model is nested within the quadratic, and 8 0 is the set of all 3-tuples of the form (80 , 81, 0). A third setting to which the likelihood ratio test is applicable is when we wish to test one parametric family of functions against another, and neither family is nested within the other. One may be interested in, say, testing the null hypothesis that r is some quadratic against the alternative that r has the form (5.6). In this case we may write the regression function as r(x; B)= (eo+ B1x + B2x 2) I{o}(B3) + exp(Bo + B1x + B2x 2)J{l}(B3), where the parameter space is
and eo
=
{(Bo, e1, e2, 83) : e3
=
o}.
The non-nested case is an example of where the X~-Po approximation to the distribution of -2log(An) is typically not valid. Cox (1962) proposed a slightly modified version of the likelihood ratio test for the non-nested case,
I.
s
f
r
5.2. Likelihood Ratio Tests
121
and White (1982) provided conditions under which the modified statistic is asymptotically normal. A treatment of subsequent developments in testing non-nested models may be found in Pace and Salvan (1990). The likelihood ratio test is obviously designed for situations where we wish to compare how well two specific, parametric models fit the data. By contrast, we might wish to test the fit of a given model without having in mind any particular alternative model. For example, we might simply like to know if the data provide any evidence that r deviates from a straight line, whatever kind of deviation that might be. We shall see later that nonparametric tests are often better suited for such cases than are likelihood ratio tests or other types of parametric tests.
f
5.2.2 Gaussian Errors It is to our benefit to consider in some detail the likelihood ratio test in the case where the error terms E1 , ... , En are independent and identically distributed N(O, o- 2 ) random variables. This situation is important not only because of the central role it has played in classical statistics but also because it leads to lack-of-fit tests that are very useful for non-normal data. When the E/s are i.i.d. N(O, o- 2 ) (where a- is an unknown parameter), the likelihood ratio (5.4) is A _ n -
2
SUP{8E8a,a>D} 0"-n SUP{8E8,a>D}
exp{- 2:.:~=1 (Yi- r(xi; 0)) /(2o-
2
2
e
Let Bo and be respectively the restricted and unrestricted least squares estimators of 0; in other words, B0 minimizes
n
n
L
(Yi- r(xi; 0))
2
i=1
for 0 E 8o while B minimizes o- 2 (0) over 0 E 8. The likelihood ratio may now be expressed as
and, hence the likelihod ratio test rejects H 0 for large values of the variance ratio
(5.7)
,,
I'
I
)}
u-n exp{- 2:.:~= 1 (Yi - r(xi; 0)) /(2o- 2 )}
1 o- 2 (0) = -
!I
o- 2 ( Bo)
a-2(B) .
The quantity (5. 7) is exemplary of a wide class of variance-ratio statistics that are useful in testing lack of fit, regardless of whether the data are Gaussian or not. Many of the statistics to be encountered in this and later
I
122
5. Classical Lack-of-Fit Tests
chapters are special cases of the following general approach. Suppose that two estimators of variance are constructed, and call them &1r and &2 . The estimator &1r is derived on the assumption that the null model is correct. It is an unbiased estimator of (} 2 under H 0 and tends to overestimate (} 2 under the alternative hypothesis. The estimator &2 is constructed so as to be less model dependent than &1r, in the sense that &2 is at least approximately unbiased for (} 2 under both null and alternative hypotheses. It follows that the ratio &1r j &2 contains information about model fit. Only when the ratio is significantly larger than 1 is there compelling evidence that the data are inconsistent with the null hypothesis.
5.3 Pure Experimental Error and Lack of Fit An ideal scenario for detecting lack of fit of a regression model is when more than one replication is available at each of several design points. In this case the data may be written 1, ... , ni, i = 1, ... , n, where we assume that the Eij 's have a common variance for all i and j. For such data we may assess the pure experimental error by computing the statistic n
SSEp =
ni
L L(1'ij- Yi)
2
,
i=l j=l
where Yi is the sample mean of the data at design point Xi· Defining N = I:~=l ni, if at least one ni is more than 1, then &'J, = SSEpj(N- n) is an unbiased estimator of the error variance (} 2 . This is an example of a model-free variance estimator, in that its construction does not require a model for r. From a model-checking point of view, there is obviously a great advantage to having replicates at at least some of the design points. Suppose that r(.) is an estimate of the regression function r and that we wish to assess the fit of r(} Define f'i = r(xi), i = 1, ... , n, and consider the residuals eij
= Yij- f'i = (Yij - Yi) + (Yi
J?i),
-
j
=
1, ... , ni, i
Defining the model sum of squares SSEM by n
SSEM
=
L ni(Yi- fi) i=l
2
,
=
1, ... , n.
5.3. Pure Experimental Error and Lack of Fit
123
the sum of squared residuals SSE is n
SSE
=
ni
LL
e7j
= SSEp + SSEM.
i=1 j=1 A model-based estimator of variance is 8-Xt = SSEMin. Generally speaking, this estimator will be a "good" estimator so long as the fitted regression model is adequate. However, if the regression function r differs substantially from the fitted model, then 8-Xt will tend to be larger than u 2 , since Yi - "fi will contain a systematic component due to the discrepancy between r and the fitted model. A formal test of model fit could be based on the statistic
(5.8) which is an example of the variance-ratio discussed in Section 5.2.2. The distributional properties of 8-Xt I cr'j, depend upon several factors, including the distribution of the errors and the type of regression model fitted to the data. A special case of interest is when the null model is linear in p unknown parameters and the parameters are estimated by least squares. Here, niTXt I (n - p) is an unbiased estimator of u 2 under the null hypothesis that the linear model is correct. If in addition the errors are Gaussian, the statistic
I I ! '
F = SSEMj (n- p) 2
O"p
has, under the null model, the F distribution with degrees of freedom n- p and N - n. When H 0 is false, F will tend to be larger than it is under H 0 ; hence the appropriate size a test is to reject H 0 when F exceeds the (1 - a)100th percentile of the F(n-p),(N -n) distribution. When there are no replicates (i.e., ni = 1 for each i), it is still possible to obtain an estimator that approximates the notion of pure experimental error. The idea is to treat the observations at neighboring design points as "near" replicates. If the regression function is sufficiently smooth, then the difference Yi- Yi-1 will be approximately Ei- Ei-1i hence differences of Y's can be used to estimate the variance of the E/s. Gasser, Sroka and JennenSteinmetz (1986) refer to Yi- Yi- 1, i = 2, ... , n, as pseudo-residuals. Other candidates for pseudo-residuals are
',,, I i 1!
,,
I
'
r
which are the result of joining YiH and 1i-1 by a straight line and taking the difference between this line and 1i. Variance estimators based on these
'i
124
5. Classical Lack-of-Fit Tests
two types of pseudo-residuals are
&~
1
=
2(n- 1)
~
~(li- li-d
2
and
Either of the estimators &~ or &~ could be used in place of&~ in (5.8) to obtain a lack-of-fit statistic that approximates the notion of comparing a model's residual error with a measure of pure experimental error. Of course, in order to conduct a formal test it is necessary to know the probability distribution of the variance ratio under the null hypothesis. We defer discussion of this issue until the next section.
5.4 Testing the Fit of Linear Models Suppose that the model under consideration has the linear form p
r(x) =
L
ejrj(x),
0:::;
X:::;
1,
j=l
where r1, ... , Tp are known functions and 81, ... , Bp are unknown parameters. We shall refer to such models as linear models, which of course have played a prominent role in the theory and practice of regression analysis. In this section we consider methods that have been used to test how well such models fit the observed data. In addition to their historical significance, linear models are of interest to us because most smoothing methods are linear in the data and hence have close ties to methods used in the analysis of linear models. The link between linear models and smoothing ideas is explored in Eubank (1988). Initially we will assume that the error terms in model (5.1) are independent and identically distributed as N(O, o- 2 ). This assumption is in keeping with the classical treatment of linear models, as in Rao (1973). However, in Section 5.4.3 we will discuss ways of approximating the distribution of test statistics when the Gaussian assumption is untenable.
5.4.1 The Reduction Method The reduction method is an elegant model-checking technique for the case where one has a particular, linear alternative hypothesis in mind and the
5.4. Testing the Fit of Linear Models
125
null hypothesis is nested within the alternative. Suppose the hypotheses of interest have the form p
(5.9)
Ho : r(x)
=
L Ojo rj(x),
0::::; X::::; 1,
j=1
and p+k
Ha: r(x)
(5.10)
=
L Oja Tj(x),
0::::; X::::; 1,
j=1
where k 2: 1. In the reduction method one determines how much the error sum of squares is reduced by fitting the alternative model having p + k terms. Let SSE0 and SSEa be the sums of squared residuals obtained by fitting the null and alternative models, respectively, by least squares, and define the test statistic
I'
FR = (SSEo- SSEa)/k. SSEa/(n- p- k) Under H FR has an F distribution with degrees offreedom k and n- p- k. 0 2 The denominator SSEa/(n- p- k) is an unbiased estimator of CJ under both Ho and Ha, whereas the numerator (SSEo - SSEa)/k is unbiased for CJ 2 only under H 0 . The expected value of the numerator is larger than CJ 2 when Ha is true, and so Ho is rejected for large values of FR. Obviously FR is another lack-of-fit statistic that is a ratio of variance estimates. In fact, the test based on FR is equivalent to the Gaussian-errors likelihood ratio test. A situation where it is natural to use the reduction method is in polynomial regression. To decide if a polynomial of degree higher than p is required one may apply the reduction method with the null model corresponding to a pth degree polynomial, and the alternative model to a p + k degree polynomial, k 2: 1. Indeed, one means of choosing an appropriate degree for a polynomial is to apply a series of such reduction tests. One tests hypotheses of the form
H[; : r(x)
=
L 010 xij=1
1
vs.
Hg : r(x)
=
L
I
I I
ill
p+k-1
p-1
I,
Oja xi-
1
j=1
for p = 2, 3, ... and takes the polynomial to be of order p, where pis the smallest p for which H{; is not rejected. The reduction method can also be used in the same way to test the fit of a trigonometric series model for r. Lehmann (1959) shows that, among a class of invariant tests, the reduction test is uniformly most powerful for testing (5.9) vs. (5.10). Hence, for alternatives of the form (5.10), one cannot realistically expect to improve upon the reduction test in terms of power. Considering the problem from a larger perspective though, it is of interest to ask how well the reduction
I I
I
126
5. Classical Lack-of-Fit Tests
test performs when H 0 fails to hold, but the regression function r is not of the form (5.10). In such cases the reduction method sometimes has very poor power. As an example, suppose the data are Yi = r(i/n) + Ei, i = 1, ... , n, and the reduction method is used to test
Ho : r(x) = elO
+ e2ox,
0:::;
X :::;
1
versus
In many cases where r is neither a line nor a quadratic, the reduction method based on the quadratic alternative will still have good power. Suppose, however, that r is a cubic polynomial ra(x) = 2:;=0 "fiXi with the properties that "(3 =/= 0 and (5.11) Obviously, r a is not a straight line, and yet when a quadratic is fitted. to the data using least squares, the estimated coefficients will each be close to 0, owing to condition (5.11). The result will be a reduction test with essentially no power. Figure 5.1 shows an example of ra. The previous example points out a fundamental property of parametric tests. Although such tests will generally be powerful against the parametric alternative for which they were designed, they can have very poor power for other types of alternatives. In our example, the problem is that r a is orthogonal to quadratic functions, and consequently the model fitted under Ha ends up looking just like a function included in the null hypothesis. Our example is extreme in that a competent data analyst could obviously tell from a plot of the data that the regression function is neither linear nor quadratic. Nonetheless, the example hints that parametric tests will not always be a satisfactory means of detecting departures from a hypothesized model.
5.4.2 Unspecified Alternatives The example in the last section suggests that it is desirable to have a method for testing lack of fit that is free of any specific alternative model. We consider such a method in this section, and in the process we introduce a technique for obtaining the probability distribution of the ratio of two quadratic forms. This technique will come in handy in our subsequent study of smoothing-based lack-of-fit tests.
5.4. Testing the Fit of Linear Models
127
•,·
(\J
0 '•/
·'
..
-
...
···~,:::':K;';U' "·'
(\J
9
;·
0.0
0.2
0.6
0.4
0.8
1.0
X
FIGURE 5.1. Cubic Polynomial That Foils a Reduction Method Lack-of-Fit Test. The 1000 data values were generated from the cubic (solid line). The dotted line is the least squares quadratic fit.
We wish to test whether the data are consistent with the linear model in (5.9). To this end, define then x p design matrix R by
R=
rl(xl) r1 (x2)
rp(x1) ) rp(x2)
rl(xn)
rp(xn)
.
( We assume throughout the rest of Section 5.4.2 that R has full column rank. This condition ensures unique least squares estimates of the coefficients el, ... 'eP' The least squares estimates will be denoted Fh, ... 'Bp· The test we shall consider is a generalization ofthe von Neumann (1941) test to be described in Section 5.5.1 and is also closely related to a test proposed by Munson and Jernigan (1989). Define the ith component of the vector e of residuals by ei = Yi 1 Bjrj(xi), i = 1, ... , n. It is well known that
I.::f=
e = [In
R(R' R)- 1 R']
Y,
128
5. Classical Lack-of-Fit Tests
where Y = (Y1 , ... , Yn)' and In is then X n identity matrix. A model-based estimator of variance is
&~![ =
-
1
-
n- P
t e7 = -
1
-Y' [In- R(R'R)- 1 R'] Y. n- P
i=1
We now desire an estimator of variance that will be reasonable whether or not the linear model (5.9) holds. Consider
u-2 = -1 6~( ei an i=2 where an matrix
H=
=
-
ei-1
)2 = -1
e
I
He,
an
2( n- 1) - trace(H R( R' R) - 1 R') and H is the n x n tridiagonal
1 -1
-1
0
2
-1
0 0
0
-1
2
0 0
0 0
0 0
-1
0 0 0
0 0 0
0 0 0
0 0 0
0 0
0 0
-1
2
0
-1
-1 1
This estimator of variance is unbiased for u 2 when the linear model holds and is consistent for u 2 as long as the linear function in (5.9) and the underlying regression function r are both piecewise smooth. We now take as our statistic the variance ratio
Vn
&~
=
(j-2 .
Other possible denominators for the test statistic are the estimators &~ and &~, as defined in Section 5.3. An argument for using & 2 is that it is completely free of the underlying regression function under H 0 . Furthermore, it will typically have smaller bias than &~ when the linear model is reasonably close to the true function. Of course, one could also form an analog of &~ based on the residuals from the linear model. Let us now consider the probability distribution of the statistic Vn when the linear model holds. First observe that Vn is the following ratio of quadratic forms:
Vn
=
Y'AY Y'BY'
where
A=
1
n-p
and 2
B = (n- p) AHA. an
5.4. Testing the Fit of Linear Models
When the linear model holds, AY fi 's, and hence
= Ac:,
where
129
c: is the column vector of
c:'Ac: Vn = c:'Bc:. Note that the distribution of c:'Ac:lc:'Bc: is invariant toO", and so at this point we assume without loss of generality that O" = 1. We have
P (Vn 2 u) = P
[c:' (A-
uB)
c:
2 o).
Theorem 2.1 of Box (1954) implies that the last probability is equal to
(5.12) where r = rank(A- uB), Aln(u), ... , Arn(u) are the real nonzero eigenvalues of A - uB and xr, ... , are i.i.d. single-degree-of-freedom x2 random variables. Given an observed value, u, of the statistic Vn, one may numerically determine the eigenvalues of A- uB and thereby obtain an approximation to the P-value of the test. Simulation or a numerical method as in Davies (1980), Buckley and Eagleson (1988), Wood (1989) or Farebrother (1990) can be used to approximate P(~;=l Ajn(u)xJ > 0). This same technique can be applied to any random variable that is a ratio of quadratic forms in a Gaussian random vector. For example, the null distributions of the statistics 8-~ I 8-J and 8-~ I 8-~ defined in Section 5.3 can be obtained in this way, assuming of course that fl, ... , En are i.i.d. Gaussian random variables. Several of the nonparametric test statistics to be discussed in this and subsequent chapters are ratios of quadratic forms and hence amenable to this technique.
x;
5.4.3 Non-Gaussian Errors To this point in Section 5.4 we have assumed that the errors have a Gaussian distribution, which has allowed us to derive the null distribution of each of the test statistics considered. Of course, in practice one will often not know whether the errors are normally distributed; hence it behooves us to consider the effect of non-Gaussian errors. An important initial observation is that, whether or not the errors are Gaussian, the null distribution of each test statistic we have considered is completely free of the unknown regression coefficients 81 , ... , eP. This is a consequence of the linearity of the null model; typically the distribution of lack-of-fit statistics will depend upon unknown parameters when the null model is nonlinear. Furthermore, if we assume that c1 , ... , En are i.i.d.
I .1
130
5. Classical Lack-of-Fit Tests
with cumulative distribution function G 0 ( x / (J), then the null distribution of each statistic is invariant to the values of both e and (J. We will discuss two methods of dealing with non-Gaussian data: large sample tests and the bootstrap. As a representative of the tests so far discussed, we shall consider the test of fit based on the statistic Vn of Section 5.4.2. The following theorem provides conditions under which Vn has an asymptotically normal distribution. Theorem 5.1. Suppose model (5.1) holds with r having the linear form of (5.9) and E1 , ... , En being independent random variables with common variance (J 2 and EIEii 2 H < lvi for all i and positive constants 8 and M. If fh, ... , satisfy
ev
2
E(Bj- ej) =
o( ~),
j=1, ... ,p,
and irJ(x)l is bounded by a constant for each j and all x E [0, 1], then the statistic Vn of Section 5.4.2 is such that
Vn(Vn- 1)
_E____.
N(O, 1)
as n-+ oo.
The numerator in the first term on the right-hand side of this expression is n
2
L
n
eiei-1
+ ei + e; = 2 L
i=2
eiei-1
+ Ov(1).
i=2
Now, n
L
n
eiei-1
i=2
=
L
EiEi-1
+ Rn,
i=2
where Rn is the sum of three terms, one of which is
p
L(ej- ej)Pj, j=1
with Pj = '2.:::~= 2 ~i-1rj(xi), j = 1, ... ,p. It follows that Rn1 = Ov(1) since Epy = O(n), E(eJ- ej) 2 = O(n- 1 ) and pis finite. The other two terms in Rn can be handled in the same way, and we thus have Rn = Ov(1).
I
5.5. Nonparametric Lack-of-Fit Tests
131
Combining previous results and the fact that 1- 2(n- p)fan = O(n- 1 ) yields
2(n- p) Vn(Vn _ l) = _1_ 2::~=2_ EiEi-1 an Vn 0" 2
+Q
( P
__!___).
Vn
The result now follows immediately upon using the central limit theorem form-dependent random variables of Hoeffding and Robbins (1948). D Under the conditions of Theorem 5.1, an asymptotically valid level-a test of (5.9) rejects the null hypothesis when Za
Vn21+ ..jn' IfVn is observed to be u, then an approximate P-value is l-ei? [vn(u- 1)]. An alternative approximation to the P-value is the probability (5.12). Theorem 5.1 implies that these two approximations agree asymptotically. However, in the author's experience the approximation (5.12) is usually superior to the normal approximation since it allows for skewness in the sampling distribution of the statistic. Another means of dealing with non-Gaussian errors is to use the bootstrap. Let F(·; G) denote the cumulative distribution of ..jn(Vn- 1) when the regression function has the linear form in (5.9) and the errors E1 , ... , En are i.i.d. from distribution G. Given data (x1, Y1), ... , (xn, Yn), we can approximate F(·; G) by F(·; G), where G is the empirical distribution of the residuals ei = Yi- 2::~= 1 Ojrj(xi), i = 1, ... , n. The distribution F(·; G) can be approximated arbitrarily well by using simulation and drawing a sufficient number of bootstrap samples of size n, with replacement, from e1 , ... , en. Using arguments as in Hall and Hart (1990), it is possible to derive conditions under which the bootstrap approach yields a better approximation to the sampling distribution of ..jn(Vn - 1) than does <1?. Although we do not pursue this point here, the technology required for such a derivation may be found in Hall (1992).
5.5 Nonparametric Lack-of-Fit Tests By a nonparametric test of a parametric hypothesis we mean a test that is constructed without reference to a particular class of parametric alternatives. The test based on Vn in Section 5.4.2 is an example of such a test. Typically, nonparametric tests are designed to be omnibus, in the sense that they are consistent against a very wide class of alternatives. (A test is said to be consistent against a given alternative if the power of the test under that alternative tends to 1 as sample size tends to oo.) Chapters 6 through 10 are devoted to a study of nonparametric lack-of-fit tests that
I,'
]i
132
5. Classical Lack-of-Fit Tests
are motivated by ideas from smoothing. In this section we consider some nonparametric tests that predate the smoothing-based tests.
5. 5.1 The von Neumann Test Perhaps the simplest parametric hypothesis concerning r is the "constant regression" or "no-effect" hypothesis, i.e., H0
(5.13)
r(x)
:
=
C,
0 :S x :S 1,
where C is an unknown constant. Along with our other model assumptions, hypothesis (5.13) implies that x has no effect on Y. Von Neumann (1941) proposed a simple means of testing (5.13) in the event that the independent variable x is time. In fact, his test is useful whenever the anticipated departure of r from constancy is smooth in x. Define the statistic 8-~ as in Section 5.3, and 8 2 by 8
2
=
1 n-1
--
Ln (Yi -
-2
Y) .
i=1
Von Neumann (1941) proposed the statistic FN = 8 2 /8-~ for testing the null hypothesis (5.13). If (5.13) is true, then both 8 2 and 8-~ are unbiased estimators of f7 2 • If H 0 is false and r has a continuous first derivative, then
E(G-~) = =
f7
2
(J2
+
2
(n
~ 1) ~[(xi- Xi-1) r'(xt)] 2
+ O(n-2),
where x; E [xi_ 1, xi], i = 2, ... , n. To a good approximation, then, we may regard 8-~ as a model-free variance estimator. The estimator 8 2 is founded on the simple model r = C and, in general, has expectation
where 'Fn = n- 1 2:~= 1 r(xi)· It is thus reasonable to reject H 0 for large values of the statistic FN, which is another example of the testing approach described in Section 5.2.2. Von Neumann (1941) obtained approximations to the null distribution of FN on the assumption that Y1 , ... , Yn are i.i.d. Gaussian random variables. The null distribution in this case may also be obtained by the technique discussed in Section 5.4.2, as FN may be expressed by
Y' (In- n- 1 Jn) Y
(5.14) where Jn is then
FN=2 X
Y'HY
'
n matrix of all1's and His defined as in Section 5.4.2.
5.5. Nonparametric Lack-of-Fit Tests
133
Analogs of the von Neumann test may be constructed by using different variance estimators in the denominator of the test statistic. For example, Munson and Jernigan (1989) propose a version of FN in which&~ is replaced 2 by J~ [f"(t)] d~, where f is the natural cubic spline interpolant to the residuals Yi - Y, i = 1, ... , n. Eubank and Hart (1993) show that, for a 1 2 known constant bn, bn 0 [f"(t)] dt is indeed a yin-consistent estimator of 2 CJ whenever r" is integrable. The von Neumann test is equivalent to the Durbin-Watson (1950) test when the latter is used to check for positive serial correlation among a sequence of constant-mean random variables. Qualitatively, positive serial correlation has a similar effect on F N as that produced by a smooth trend. This is in keeping with the maxim "one person's correlation is another's smooth curve." The lack-of-fit test discussed in Section 5.4.2 is simply the von Neumann test applied to the residuals e 1 , ... , en from the linear model. Hence, the idea underlying the von Neumann test has much more general application than simply testing the no-effect hypothesis (5.13). In fact, one could use the same idea in testing the fit of a nonlinear parametric model. Suppose that Be = {r(-; e) : E 8} is any parametric family of functions and we wish to test
J
e
Ho: r
E
Be
against a general alternative. Let fJ E 8 be our "best" estimate of the true parameter value assuming that H 0 is true. Defining ei = Y,; - r(xi; B), i = 1, ... , n, a reasonable test statistic for H 0 would be (5.15)
~~= 2 (ei- ei-1) 2 ·
The only added difficulty in using (5.15) in the nonlinear case is that the null distribution of the test statistic will, in general, depend upon the true value of the parameter e. By contrast, the null distribution of (5.15) is completely free of regression coefficients in the linear case, a statement that is true whether or not the Ei 's are normally distributed. The dependence of (5.15) on unknown parameters complicates approximation of the null distribution of the test statistic. One means of dealing with the problem is to use the bootstrap. Let F(·; G) denote the cumulative distribution of (5.15) when the regression function is r(·; e) and the errors El, ... 'En are i.i.d. from distribution G. Given data (x 1 , Y1 ), ... , (xn, Yn), we can then approximate F(·; G) by F(·; G), where is our estimate of e and G is the empirical distribution of the residuals ei = Y,; - r(xi; i = 1, ... 'n. The distribution F(·; G) can be approximated by simulating bootstrap data Yt, ... , Y; from the model
e,
e
e,
e,
e),
li* = r(xi; e) + ei'
e,
i
=
1, ... 'n,
134
5. Classical Lack-of-Fit Tests
where e]', ... , e~ are i.i.d. as G. The double bootstrap is particularly helpful in this setting where the distribution of the test statistic depends on e. We will elaborate on the double bootstrap in Section 8.3.
5. 5. 2 A Cusum Test Cusum-based procedures have long been used in quality control to detect a change in the mean of a sequence of random variables (Page, 1954). Recently, Buckley (1991) has shown that a cusum-based test of the noeffect hypothesis is locally most powerful in a certain Bayesian model. For data Y1 , ... , Yn from our model (5.1), the cumulative sum, or cusum, Sj, j = 1, ... , n, is defined as follows: j
Sj
L)Yi- Y),
=
j
= 1, ... , n.
i=l
Buckley's test is based on the statistic
T
n-2 "'":
-
UJ= A2
B-
(Jd
1
82 J
'
where &~ is as defined in Section 5.3. The hypothesis that r is constant is rejected for large values of TB. Buckley's test statistic is another example of a variance ratio. Under model (5.1) it is easy to check that
E
{n- ts;} 2
J=l
~
2
=
(n-1~~n+1) +n_ 2 t~J, j=l
where j
~j
=
L)r(xi) -
'Fn),
j = 1, ... , n,
i=l
and 'Fn = I:~=l r(xi)/n. When r is a constant function, the numerator of TB has expectation proportional to u 2 . Under smooth alternatives to the no-effect hypothesis, the estimator&~ continues to be a consistent estimator of u 2 , whereas the numerator tends to be larger than it is in the null case. The main difference between TB and the von Neumann statistic, FN, is in the way they measure departures from the null hypothesis. All of the previous statistics in this chapter besides TB measure departures from H 0 in a 'rather nonsmooth way, in the sense that individual residuals are squared and then summed. The Buckley statistic, on the other hand, is based on cumulative sums of the residuals (Yi Y), i 1, ... , n. Rather than "feeling" departures from H 0 through the function r(x)- 'Fn, TB feels
5.5. Nonparametric Lack-of-Fit Tests
135
them through
R(x)
=lax (r(u)- 'Fn)dF(u),
0::;
X::;
1,
where F is the cumulative distribution of the design points. In a sense, the Buckley test is our first example of a smoothing-based test, since the residuals are averaged, i.e. smoothed, before being squared. A key distinction between TB and the smoothing-based statistics of the next chapter is that the latter are based on local averages. A statistic analogous to TB but utilizing local averages is
n -1
(5.16)
"n
uj=l
r'2( x 1)
'2 O"d
where f(x) is, say, a kernel estimate of r(x). So, at a given Xj, TB sums all the data to the left of x 1 , whereas the statistic (5.16) computes a weighted average of data in a neighborhood to the left and right of x 1 . We will see in Chapter 7 that there can be a real payoff in terms of power from using a smoothing-based statistic rather than TB. The numerator of TB can be expressed as the quadratic form
n- 1 Jn)Cn(In- n- 1 Jn)Y,
n- 2 Y'(In where Cn is the n x n matrix
n-1 n-1 n-2
n-2 n-2 n-2
2 1
2 1
2 1
n
Cn =
2 1 2 1 2 1 2 1
n-3 n-3 n-3 n-3 n-3 n-3 n-3 n-1 n-2
2 1
2 1
1 1
It follows that TB is expressible as a ratio of quadratic forms, and so we
may approximate its null distribution using the method of Section 5.4.2. This approximation is exact when the errors are normally distributed, and valid asymptotically when the errors are merely independent with common variance. Buckley (1991) utilizes a Bayesian model of the form
Y
=
Cl
+ br + c:,
where 1 is a column vector of 1's, C and b are unknown constants and r and c: are independent and normally distributed random n-vectors having common mean 0 and covariance matrices Var(r)
=
V
and
Var(c:)
= 0"
2
In.
In this model r is interpreted as a vector of unknown parameters for which our prior distribution is N(O, V), and the no-effect hypothesis is equivalent
.I
136
5. Classical Lack-of-Fit Tests
to
Ho: b = 0. The work of Cox et al. (1988) shows that a locally most powerful test of H 0 : b = 0 rejects H 0 for large values ofthe quadratic form Y'VY. Buckley proposes a particular choice for V that effectively assigns low likelihood to vectors r that are "rough." He then shows that, for evenly spaced xi's, this choice for Vis proportional to the matrix (In- n- 1 Jn)Cn(In- n- 1 Jn), implying that a test based on the numerator quadratic form of TB is locally most powerful among a certain invariance class of tests.
5. 5. 3 Von Neumann and Cusum Tests as Weighted Sums of Squared Fourier Coefficients If the design points are Xi = (i- 1/2)/n, i = 1, ... , n, then both the von Neumann and Buckley test statistics may be expressed as n-l
2 :Z:::j=l
(5.17)
where the Wj,n's are constants and coefficients ,
¢i
1 =
-
n
'2
Wj,n¢j
;/JI, ... , :Pn-l
are the sample Fourier
n
L Yi cos(njxi),
j
=
1, ... , n- 1.
i=l
The weights for von Neumann's test are Wj,n
=
1,
j
=
1, ... , n- 1,
which follows from Parseval's identity, i.e., n
- 2 -n1 '""" L.)Yi - Y)
n-l = 2 '"""'2 ~ ¢i.
i=l
j=l
Nair (1986) shows that for Buckley's test the weights are
n Wj,n =
[2nsin (jn/(2n))] 2
'
j =
1
' · · · 'n- 1.
The von Neumann statistic weights all n- 1 sample Fourier coefficients equally, whereas Buckley's test places much more weight on the low frequency coefficients. If one expects most of the "action" in the first couple of Fourier coefficients, then the Buckley test is usually preferable to the von Neumann test. For higher frequency functions the von Neumann test will often be the more powerful of the two. These notions will be made more precise in the next section. The form (5.17) suggests that we could construct any number of test statistics by using different weighting schemes {wj,n : j = 1, ... , n- 1}.
5.5. Nonparametric Lack-of-Fit Tests
137
Indeed, many of the smoothing-based statistics to be encountered in Chapter 6 are of this form. It would be desirable to choose a set of weights that maximize the power of our test. Unfortunately the optimal weights will depend upon the true regression function. Nonetheless, one could presumably use even a vague knowledge of r as an aid in determining "powerful" weights.
5.5.4 Large Sample Power When the alternative is fixed, a "reasonable" test will have power that tends to 1 as the sample size n tends to oo. This property tells us next to nothing, though, about power in small or moderate sized samples. Various methods have been proposed that admit a realistic comparison of the limiting power of tests. One such criterion is Pitman relative efficiency (Noether, 1955), in which one computes the smallest sample size required in order for a test to have a given power. This sample size may be compared with the corresponding sample size of another test. The limiting ratio of these two sample sizes is called the asymptotic relative efficiency of the tests. Another way of comparing two sequences of tests is to use local alternatives. Such alternatives tend to the null hypothesis as n tends to oo. One may compute the ratio of the powers of two tests at the same sample size, and then consider the limit of this ratio. When the local alternatives tend to the null at an appropriate rate, the limiting ratio of powers will be between 0 and 1, thus yielding a meaningful comparison of the two tests. Consider testing the null hypothesis that r is identical to a constant. We introduce the following local alternatives model: (5.18)
Yin
= f-l
+n
_, g (i -1/2) n + Ein,
i = 1, ...
,n,
where f-l is a constant, E1 n, ... , Enn are i.i.d. N(O, cr 2 ) random variables, g 1 is a function satisfying 0 g(x)dx = 0 and"( > 0. Assuming that g is not identical to 0, model (5.18) defines a sequence of alternative hypotheses that converge to the null hypothesis as n ----t oo. Typically, when "( is sufficiently large, and hence model (5.18) converges rapidly to the null hypothesis, the limiting power of a test of H 0 : r = C equals the size of the test. Conversely, if "( is sufficiently small, the limiting power will be 1. This leads to the following definition.
J
Definition 5.1. Suppose that model (5.18) holds for a given function g, and consider a sequence { ¢n : n = 1, 2, ... } of size a tests of the null hypothesis that E(Yin) = f-l for all i and n. Let {Pn(g) : n = 1, 2, ... } be the powers corresponding to the sequence of tests { ¢n}· When it exists, the maximal rate, denoted "f(g), of {¢n} is defined to be the largest value
138
5. Classical Lack-of-Fit Tests
of ""( such that lim inf Pn(g) n---+oo
> a.
The notion of maximal rate provides us with one way of comparing tests. If, for a given g, one test has a larger maximal rate than another, then the former test will have higher power than the latter for all n sufficiently large and all multiples of g that are sufficiently hard to detect (i.e., close to the null case). It is worth noting that in parametric problems maximal rates are usually 1/2, which is a consequence of the fo convergence rate of most parametric estimators. At this point we investigate the limiting distribution of the von Neumann and Buckley statistics under model (5.18). Doing so will allow us to determine the maximal rate of each test and also establish that each test is consistent against a general class of fixed alternatives. To simplify presentation of certain results, we assume throughout the remainder of Section 5.5.3 that Xi = (i -1/2)/n, i = 1, ... , n. We first state a theorem concerning the limit distribution of FN. When g = 0, this theorem is a special case of Theorem 5.1. A proof for the case where J g 2 > 0 is given in Eubank and Hart (1993). Theorem 5.2. The maximal rate of the von Neumann test under model 1 (5.18) is 1/4 for any g such that 0 < 11911 < oo, where 11911 2 = f 0 g 2 (x) dx. Furthermore, suppose that model (5.18) holds with""( = 1/4, and let g be square integrable on (0, 1). Then
An almost immediate consequence of Theorem 5.2 is that the von Neumann test is consistent against any fixed alternative r 0 for which lira - ro II > 0, where ro = f01 ro (x) dx. Another interesting result is that FN has a maximal rate of 1/4, which is less than the maximal rate of 1/2 usually associated with parametric tests. For example, suppose that H0 : r Cis tested by means of the reduction method with a pth order polynomial as alternative model (p ;:::: 1). For many functions g, even ones that are not polynomials, this reduction test has a maximal rate of 1/2. On the other hand, there exist functions g with 0 < llg II < oo such that the limiting power of a size-a reduction test is no more than a. (Such functions can be constructed as in the example depicted in Figure 5.1.) The difference between the von Neumann and reduction tests is characteristic of the general difference between parametric and omnibus nonparametric tests. A parametric test is very good in certain cases but
=
I!'
5.5. Nonparametric Lack-of-Fit Tests
139
very poor in others, whereas the nonparametric test could be described as jack-of-all trades but master of none. Turning to Buckley's cusum-based test, we have the following theorem, whose proof is omitted. Theorem 5.3. The maximal rate of the no-effect test based on TB is 112 under model (5.18) for any g such that 0 < 11911 < oo. Furthermore, if model (5.18) holds with 1 = 112 and g square integrable on (0, 1), then 1 v TB--+ 2 7r
= ( zj
L
j=l
+ -/2aj I "2
J
where Z1, Z2 , ... are i.i.d. as N(O, 1) and aj
=
(J
r
'
J01 cos(1rjx)g(x) dx.
Theorem 5.3 entails that Buckley's test is consistent against any fixed alternative r 0 for which lira- roll > 0. So, the von Neumann and Buckley tests are both consistent against any nonconstant, piecewise smooth member of L 2 [0, 1]. Theorem 5.3 also tells us that the cusum test has maximal rate equal to that commonly associated with parametric tests. In a certain sense, then, the cusum test is superior to the von Neumann test, since the latter has a smaller maximal rate of 1I 4. This means that for any given square integrable g, if we take 1 = 112 in (5.18), then there exists an n 9 such that the power of the Buckley test is strictly larger than that of the same-size von Neumann test for all n > n 9 . As impressive as it may sound this result certainly has its limitations. To appreciate why, we now compare the powers of the two tests in a maximin fashion. Let model (5.18) hold with 1 = 114. If we compare the power ofthe two tests for any specific g, then Buckley's test is asymptotically preferable to the von Neumann test since lim P (TB 2: Tn(a))
n-->CXJ
1
where Tn(a) and Vn(a) are the (1- a)100th percentiles of the null distributions of TB and FN, respectively. Alternatively, suppose that for 1 = 114 in model (5.18) we compute the smallest power of each test over a given class of functions. Consider, for example, the sequence of classes 9n:
where f3n --+ (3 > 0 as n --+ oo. It is straightforward to see by examining the proof of Theorem 5.2 that the von Neumann test satisfies (5.19)
I, ,I
i;
-, ;J
I.
140
5. Classical Lack-of-Fit Tests
Yn is the function gn(x; k) = hf3~1 2 cos(1Tkx),
Now, one element of
0:::; x:::; 1.
Obviously (5.20)
I
As indicated in Section 5.5.3, 1
TB = ~ a-
(5.21)
I
2n¢J 2: --------~---= 2 [2n sin (j7T/(2n))) '
n-1 j=
I
1
from which one can establish that if kn > n 114 log n, then
I
(5.22) as n----> oo. Along with (5.19) and (5.20), (5.22) implies that lim inf P (TB ?_ Tn(a)) :::; a
n->oo gE9n
< lim inf P (FN n->oo gEYn
?_ vn(a)).
In words, the very last expression says that certain high frequency alternatives that are easily detected by the von Neumann test will be undetectable by Buckley's test. The previous calculations show in a precise way the fundamental difference between the von Neumann and Buckley tests. Expression (5.21) implies that, for large n, 1
TB ~ 7T2
1
L j-2 ( 2n¢2) Q-2
n-
J
'
J=1
which shows that TB will have difficulty in detecting anything but very low frequency type functions, i.e., functions r for which llr - rll 2 is nearly the sum of the first one or two Fourier coefficients. By contrast, the .von Neumann statistic weights all the squared Fourier coefficients equally, and so the power of the von Neumann test is just as good for high frequency functions as for low frequency ones.
5.6 Neyman Smooth Tests Neyman smooth tests are a good point of departure as we near our treatment of smoothing-based lack-of-fit tests. Indeed, they are a special case of certain statistics that will be discussed in Chapter 6. Like the von Neumann and Buckley statistics, Neyman smooth statistics are weighted sums of squared Fourier coefficients. The only way in which they differ substantially from the two former tests is through the particular weighting scheme they employ. We shall see that Neyman smooth tests are a sort of compromise between the von Neumann and Buckley tests.
I '
5.6. Neyman Smooth Tests
141
Neyman (1937) proposed his smooth tests in the goodness-of-fit context. Suppose X1, ... , Xn are independent and identically distributed observations having common, absolutely continuous distribution function F. For a completely specified distribution F 0 , it is desired to test
Ho : F(x)
=
Fo(x) V x,
which is equivalent to hypothesizing that F 0 (Xl) has a uniform distribution on the interval (0, 1). Neyman suggested the following smooth alternative of order k to H 0 :
g(u)
= exp(eo
+
t,
0 < u < 1,
ei¢i(u)),
where g is the density of F 0 (X1), ¢ 1, ... , ¢k are Legendre polynomials transformed linearly so as to be orthonormal on (0, 1) and eo is a normalizing constant. In this formulation the null hypothesis of interest is
The test statistic proposed by Neyman is k
Ill~ = ~ Vi
2
1
with
n
Vi= fo ~ ¢i (Fo(Xj)).
Under H 0 , w~ is asymptotically distributed as x2 with k degrees offreedom. The null hypothesis is thus rejected at level a when Ill~ exceeds the (1 a) lOOth percentile of the x~ distribution. Neyman referred to his test as a smooth test since the order k alternatives differ smoothly from the flat density on (0, 1). His test was constructed in such a way that, to a first order approximation, its power function deAmong all tests with this pends on e1, ... , only through ..\2 = 1 property, Neyman (1937) argued that his smooth test is asymptotically uniformly most powerful against order k alternatives for which .A is small. We now consider Neyman's idea in the context of regression. Suppose that in model (5.1) we wish to test the no-effect hypothesis that r is identical to a constant. In analogy to Neyman's smooth order k alternatives, we could consider alternatives of the form
I:7= et.
ek
k
(5.23)
r(x) = eo+
I: ei¢i,n(x), i=l
where ¢1,n, ... , ¢k,n are functions that are orthonormal over the design points in the sense that, for any 0 ~ i, j ~ k,
(5.24)
142
5. Classical Lack-of-Fit Tests
and ¢o,n =:= 1. The least squares estimators of fh, ,
ej
1 =
-;;:
n
2::: ~¢j,n(xi),
j =
... , Bk are simply
o, ... , k.
i=1
If the errors are Gaussian, then under Ho, y'n(fh, ... , fh)/rY has a kvariate normal distribution with mean 0 and identity covariance matrix. This statement remains true in an asymptotic sense if the errors are merely independent with common variance rY 2. Define TN,k by TN,k = &2
n
k '2 l:j=1 ej '2 ' (J
CY 2 •
where is some estimator of An apparently reasonable test of the no-effect hypothesis is to conclude that there is an effect if and only if TN,k 2: c. We shall take the liberty of calling this test a "Neyman smooth test," due to its obvious similarity to Neyman's smooth goodness-of-fit test. Not surprisingly, the reduction method of testing the no-effect hypothesis against (5.23) is equivalent to a Neyman smooth test. Due to the orthogonality conditions (5.24), the statistic FR from Section 5.4.1 is
and so
where T/J,k is the version of TN,k with & 2 = n- 1 2.:~= 1 (~- Y)2. Since FR is monotone increasing in TfJ k> it follows that the reduction test is equivalent to a Neyman smooth test. ' Let us now suppose that the errors are i.i.d. as N(O, rY 2 ). Then FR has the F distribution with degrees of freedom k and n - k 1. Hence, an exact size-a: Neyman smooth test has rejection region of the form (5.25)
R nFk,n-k-1,a T Nk> · ' - Fk,n-k-1,a + (n k- 1)/k
Results in Lehmann (1959, Chapter 7) imply that when the errors are Gaussian, the test (5.25) has a power function that depends only on 7jJ 2 = 2.:7=1 e[ I rY 2 . More importantly, test (5.25) is uniformly most powerful for alternatives (5.23) among all tests whose power functions depend only on 'ljJ 2 . In light of our discussion in Section 5.5 .4, it is significant that the power of the smooth test (5.25) depends on the ei 's only through 7/J 2 . This implies, for example, that fork = 4, the power of (5.25) would be the same for the two cases e = (1, 0, 0, 0) and e = (0, 0, 0, 1). By contrast, Buckley's test
5.6. Neyman Smooth Tests
143
tends to have good power only for low frequency alternatives, as discussed in Section 5.5.4. 2 When the errors are merely independent with common variance tT , k
L
2 n(B·-8·) J J
D
-----+
(}2
x2k
II
j=l
as n
oo. It follows that if 8" 2 in
TN,k
1
rn(x) =eo+
i
2
is any consistent estimator of tT , then TN,k is asymptotically distributed x% under the no-effect hypothesis. This fact allows one to construct a valid large sample Neyman smooth test. It is easy to verify that for any order k alternative the order k Neyman smooth test has maximal rate 1/2. Furthermore, an asymptotic version of the uniformly most powerful property holds for the Neyman smooth test. Consider local alternatives of order k having the form -+
Vn
8 k
8iC/Ji,n(x).
I
I
i[: 1·,
iI
i
I:
!
•
'I
Ii
i•
i" I
I: I
Under these alternatives the order k Neyman smooth test has a limiting power function that is uniformly higher than that of any test whose limiting 2 tT . power depends only on I:7=1 Suppose the design points are Xi = (i- 1/2)/n, i = 1, ... , n, and that the Neyman smooth test uses the cosine functions
er;
c/Jj,n(x) =
..J2 cos(1rjx),
j = 0, ... , k.
Then the Neyman statistic is a weighted sum of squared Fourier coefficients as in (5.17) with 1, Wj,n = { 0,
I I I:
1~j~k
< j < n. A Neyman smooth test with 2 < k < < n may thus be viewed as a comk
promise between the von Neumann and Buckley test statistics. If one is uncertain about the way in which r deviates from constancy but still expects the deviation to be relatively low frequency, then a Neyman smooth test of fairly small order, say 5, would be a good test. An order 5 Neyman test will usually be better than either the von Neumann or Buckley test when most of the "energy" in r is concentrated in the third, fourth and fifth Fourier coefficients. In Chapter 7 we will introduce tests that may be regarded as adaptive versions of the Neyman smooth test. These tests tend to be more powerful than either the von Neumann or Buckley test but do not require specification of an alternative as do Neyman smooth tests.
I I
I,
l 6 Lack-of-Fit Tests Based on Linear Smoot hers
6.1 Introduction We are now in a position to begin our study of lack-of fit tests based on smoothing methodology. We continue to assume that observations Y1, ... , Yn are generated from the model
(6.1)
Yi = r(xi) + Ei,
i
=
1, ... , n,
in which E1 , ... , En are mean 0, independent random variables with common variance a 2 < oo. We also assume that the design points satisfy 1 Xi = F- [(i- 1/2)/n], i = 1, ... , n, where F is a cumulative distribution function with continuous derivative f that is bounded away from 0 on [0, 1]. This assumption on the design is made to allow a concise description of certain theoretical properties of tests, but it is not necessary in order for those tests to be either valid or powerful. In this chapter we focus attention on the use of linear smoothers based on fixed smoothing parameters. By a linear smoother, we mean one that is linear in either Y1 , ... , Yn or a set of residuals e1, ... , en. If applied to residuals, a linear smoother has the form n
(6.2)
g(x; S) =
L wi(x; S)ei, i=l
where the weights wi(x; S), i = 1, ... , n, are constants that do not depend on the data Y1, ... , Yn or any unknown parameters, and S denotes the value of a smoothing parameter. A smoother that we do not consider linear is g(x; S), where S is a nonconstant statistic. Kernel estimators, Fourier series, local polynomials, smoothing splines and wavelets are all linear in the Yi 's as long as their smoothing parameters are fixed rather than data driven. Our interest is in testing the null hypothesis that r is in some parametric class of functions Se against the general alternative that r is not in S 8 . The basic idea behind all the methods in this chapter is that one computes 144
6.2. Two Basic Approaches
145
a smooth and compares it with a curve that is "expected" under the null hypothesis. If the smooth differs sufficiently from the expected curve, then there is evidence that the null hypothesis is false. Smoothing-based tests turn out to be advantageous in a number of ways: • They are omnibus in the sense of being consistent against each member of a very large class of alternative hypotheses. • They tend be more powerful than some of the well-known omnibus tests discussed in Chapter 5. • They come complete with a smoother. The last advantage is perhaps the most attractive feature of smoothingbased tests. The omnibus tests of Chapter 5 do not provide any insight about the underlying regression function in the event that the null hypothesis is rejected. In contrast, by plotting the smoother associated with a lack-of-fit test, one obtains much more information about the model than is contained in a simple "accept-reject" decision. Our study begins in the next section with a look at two fundamental smoothing-based approaches to testing lack of fit.
6.2 Two Basic Approaches Two fundamental testing approaches are introduced in this section. These will be referred to as (i) smoothing residuals and (ii) comparing parametric and nonparametric models. Sometimes these two methods are equivalent, but when they are not the former method is arguably preferable. The two approaches are described in Sections 6.2.1 and 6.2.2, and a case for smoothing residuals is made in Section 6.2.3.
6.2.1 Smoothing Residuals For a parametric model Se, which could be either linear or nonlinear in the unknown parameters, we wish to test the null hypothesis Ho : r E Se = {r(·; B) : B E 8}.
Let Bbe a consistent estimator of B assuming that the null hypothesis is true. Ideally Bwill also be efficient, although numerical considerations and the nature of the parametric family Se may preclude this. Define residuals e1, ... ,en by
If the null hypothesis is true, these residuals should behave more or less like a batch of zero mean, uncorrelated random variables. Hence, when H 0 is true, a linear smooth gas in (6.2) will tend to be relatively flat and centered about 0. A useful subjective diagnostic is to plot the estimate g(·; S) and
I' i
146
6. Lack-of-Fit Tests Based on Linear Smoothers
see how much it differs from the zero function. Often a pattern will emerge in the smooth that was not evident in a plot of residuals. Of course, looks can be deceiving, and so it is also important to have a statistic that more objectively measures the discrepancy of?;(- ; S) from 0. An obvious way of testing H 0 is to use a test statistic of the form
T
JJg(·; S)J\ 2
=
B-2
,
where JJgJJ is a quantity that measures the "size" of the function g and B- 2 is a model-free estimator of the error variance o- 2 . Examples of Jjg II are
{
{
fa
1
fa
1
g2 (x)f(x) dx
fa
} 1/2
g2 (x) dx
,
1
jg(x)j dx,
} 1/2
and
sup Jg(x)J, O:Sx:s;I
where f is the design density. The measure involving f puts more weight on Jg(x) I at points x where there is a higher concentration of design points. 1 A convenient approximation to 0 g2 (x; S)f(x) dx is
J
which leads to the lack-of-fit statistic
(6.3)
A2( S) R _ n -1 '\'n Lli=1 g Xi; n -
G-2
.
We now argue heuristically that Rn is essentially a variance ratio, as discussed in Section 5.2.2. The residuals are
ei = r(xi) - r(xi; B)
+ Ei,
i = 1, ... , n.
Typically, whether or not H 0 is true, the statistic Bwill converge in probability to some quantity, call it 80 , as n __., oo. When H 0 is true, 80 is the true parameter value, whereas under the alternative, r(-; 80 ) is some member of Be that differs from the true regression function r. It follows that for large n
(6.4) where g(x) = r(x)- r(x; 80 ), and g is identically 0 if and only if H 0 is true. We have
6.2. Two Basic Approaches
147
and so, in essence, Rn has the variance-ratio form of Section 5.2.2. A sensible test would reject H 0 for large values of Rn· It is enlightening to compare Rn with the statistic Vn of Section 5.4.2. Typically a limiting version of a linear smoother interpolates the data. When smoothing residuals this means that (6.5) asS corresponds to less and less smoothing. For smoothers satisfying (6.5) it follows that Vn is a limiting version of Rn· We may thus think of Vn as a "nonsmooth" special case of the smooth statistic Rn. Later in this chapter we provide evidence that the smooth statistic usually has higher power than its nonsmooth counterpart.
6.2.2 Comparing Parametric and Nonparametric Models Suppose that f(·; S) is a nonparametric estimate of r based on a linear smooth of Y1 , ... , Yn. As in the previous section, denotes our estimate of B on the assumption that the null model is true. As our lack-of-fit statistic consider
e
0n
_ llr(·;S)-r(-;B)W Q-2
-
'
where llhll is some measure of the size of h, as in the previous section. We will refer to the statistic Cn as a comparison of parametric and nonparametric models. In general, the statistic Cn will be the same as Rn only when f(·;S)-r(-;B)
g(·;S).
We have n
f(x; S) - r(x; e) =
L {Yi -
r(xi; e)}wi(x; S)
i=l
n
i=l
(6.6)
=
g(x; S)
+ Bias{f(x; S), B},
where Bias{f(x; S), B} denotes the bias of f(x; S) when Ho holds and B is the true parameter value. It follows that smoothing residuals and comparing parametric and nonparametric fits will be equivalent only when the smoother f(x; S) is, for each x E [0, 1], an unbiased estimator of r(x) under· the null hypothesis . .From Chapter 3 we know that smoothers are generally biased estimators of the underlying regression function, and so for the most part Rn and Cn will be the same only in special circumstances.
148
6. Lack-of-Fit Tests Based on Linear Smoothers
Below are a couple of examples where the two methods are equivalent. EXAMPLE 6.1: TESTING FOR NO-EFFECT. Consider testing the no-effect hypothesis Ho : r = constant, wherein the estimate of the null model is simply Y and the residuals are ei = Yi - Y, i = 1, ... , n. The two methods are equivalent in this case whenever the smoother 2::=7= 1 Yiwi (· ; S) is unbiased for constant functions. This is true so long as n
L wi(x; S) =
1 for each x E [0, 1].
i=l
We saw in Chapter 2 that many smoothers have weights that sum to 1, including trigonometric series and local linear estimates, as well as Nadaraya-Watson and boundary modified Gasser-Muller kernel estimates.
EXAMPLE 6.2: TESTING THE STRAIGHT LINE HYPOTHESIS. Consider testing the null hypothesis
Ho : r(x) =eo+ elx,
0:::;: X:::;: 1.
It is easily checked that, independent of the choice of smoothing parameter, local linear estimators and cubic smoothing splines are unbiased for straight lines. It follows that comparing a fitted line with either a local linear estimate or a cubic smoothing spline is equivalent to smoothing residuals. For a given smoother the two basic methods will generally be equivalent for just one specific type of null hypothesis. For second order kernel smoothers without any boundary correction, the no-effect hypothesis is the only case where the equivalence occurs. This follows from the bias expansion that we derived in Section 3.2.2. If one uses a kth order kernel and an appropriate boundary correction, then smoothing residuals will be equivalent to the comparison method when testing the null hypothesis that r is a (k- 1)st degree polynomial. We will explore this situation further in Section 8.2.3.
6. 2. 3 A Case for Smoothing Residuals When constructing a statistical hypothesis test, the first thing one must do is ensure that the test is valid in the sense that its type I error probability is no more than its nominal size. The validity of a test based on Cn can be difficult to guarantee since, by (6.6), Cn's distribution will usually depend upon the unknown parameter e through the quantity Bias{r(x; S), B}. By contrast, bias is usually not a problem when smoothing residuals. Suppose the null model is linear and that we use least squares to estimate the regression coefficients 81 , ... , eP. Then the distribution of the residuals e1, ... , en
6.3. Testing the Fit of a Linear Model
is completely free of el' Furthermore,
149
... ' ep and hence so is the distribution of 11.9(. ; S) II· E(ei) = 0,
i
= 1, ... , n,
and so any linear smooth.§(·; S) of the residuals has null expectation 0 for each x and for every S. The bias-free nature of smoothed residuals is our main reason for preferring the statistic Rn to Cn. Intuitively, we may argue in the following way. Imagine two graphs: one of g(x; S) versus x, and the other of r(x; S)- r(x; B) versus x. A systematic pattern in the second graph would not be unusual even if H 0 were true, due to the bias in the smoother r(·; S). On the other hand, a pattern in the graph of smoothed residuals is not expected unless the regression function actually differs from the null model. Of course, S could be chosen so that Bias{r(x; S), B} is negligible. However, in this case (6.6) implies that Rn and Cn are essentially equivalent. When Bias{f(x; S, B} is not negligible, an obvious remedy is to center r(x; S) - r(x; B) by subtracting from it the statistic n
L r(xi; B)wi(x; S)- r(x; iJ). i=l
In doing so we are left with just the smooth g(x; S), and all distinction between the two methods vanishes. Probably the only reason for adjusting Cn differently is that doing so might lead to a more powerful test. Rather than pursuing this possibility, we will use the more straightforward approach of smoothing residuals in the remainder of this chapter.
6.3 Testing the Fit of a Linear Model We now consider using linear smoothers to test the fit of a linear model, in which case the null hypothesis is p
(6.7)
Ho : r(x) =
L
ejrj(x).
j=l
We assume that the design matrix R defined in Section 5.4.2 is of full column rank and that the parameters el' ... ' ep are estimated by the method of least squares.
6. 3.1 Ratios of Quadratic Forms We begin with a treatment of statistics that are ratios of quadratic forms in a vector of residuals. Define e to be the column vector of residuals, and
I: I
150
6. Lack-of-Fit Tests Based on Linear Smoothers
suppose that our test statistic has the form R
'2( S) _ n -1'\"n Di=l g xi;
s -
fj-2
,
where &2 = e' Ce for some matrix C not depending on the data, and g(xi; S) is of the form (6.2). The vector of smoothed residuals is denoted g and is expressible as
g =We= W(In- R(R'R)- 1 R')Y, where W is the n x n smoother matrix with ijth element wj(xi)· The statistic Rs has the form
Y'AY Y'BY' where
and
When H 0 is true
Rs
c1 Ac
= ---,-B c c;
hence for i.i.d. Gaussian errors the null distribution of Rs can be approximated using the technique introduced in Section 5.4.2. The bootstrap is often an effective means of dealing with the problem of non-normal errors. To the extent that an empirical distribution better approximates the underlying error distribution than does a Gaussian, one has confidence that the bootstrap will yield the better approximation to Rs 's sampling distribution. Even when n is very large there is a compelling reason to use the bootstrap. As will be shown in Section 6.3.3, the asymptotic null distribution of Rs is fairly sensitive to the choice of smoothing parameter. For most smoothers there are three distinct approximations to the large sample distribution of Rs. These approximations correspond to small, intermediate and large values of the smoothing parameter. In practice it will seldom be clear which of the three large sample tests is the most appropriate. The bootstrap is thus attractive even for large n since it automatically accounts for the effect of S on the distribution of Rs. To use the bootstrap to carry out a smoother-based test of (6.7), one may employ exactly the same bootstrap algorithm described in Section 5.4.3. An example in Section 6.4.2 illustrates this technique. Another means of dealing with non-normality and/or small n is to use a permutation test. Raz (1990) uses this approach to obtain P-values for a test of no-effect based on nonparametric smoothers. The idea is to obtain a distribution by computing the statistic for all n! possible ways in which
i i
6.3. Testing the Fit of a Linear Model
151
the li's may be assigned to the xi's. In a simulation Raz (1990) shows that this approach does a good job of maintaining the test's nominal level for non-normal errors and n as small as 10.
6.3.2 Orthogonal Series Suppose that each oft he functions r 1 , ... , r P in the null model is in L2 (0, 1), and let v)', v2, ... be an orthogonal basis for L 2(0, 1). Define {v1 , v2, ... } to be the collection of all vj's that are not linear combinations of TI, . .. , rp, and consider series estimators fm(x) of r(x) having the form m=O
m
=
1,2, ... ,
where, for each m, elm, ... , epm, b1m, ... , bmm are the least squares estimators of fh, ... , Bp, b1, ... , bm in the linear model m
P
Yi
(6.8)
=
L
+L
Bjrj(xi)
j=l
bjvj(xi)
+ Ei.
j=l
We may regard fm(x) as a nonparametric estimator of r(x) whose smoothing parameter is m. For a given m ~ 1, we could apply the reduction method (Section 5.4.1) to test the hypothesis (6.7) against the alternative model (6.8). The estimators f 0 and fm would correspond to the reduced and full models, respectively. We will now argue that the reduction method is equivalent to a test based on a statistic of the form (6.3) with fJ(- ; S) an orthogonal series smooth of the residuals e 1 , ... , en from the null model. Using Gram-Schmidt orthogonalization (Rao, 1973, p. 10), we may construct linear combinations u1, ... , Un-p of r1, ... , rp, v1, ... , Vn-p that satisfy the following orthogonality conditions: n
L
rj(xi)uk(xi) = 0,
1 :::; j :::; p, 1 :::; k:::; n- p,
i=l
and
These conditions imply that the least squares estimators of a 1 , ... , am in the model m
p
}i = L{)jrj(Xi) j=l
+
L:ajUj(Xi) j=l
+ Ei
152
6. Lack-of-Fit Tests Based on Linear Smoothers
are
1
iiJ·
n
= -n~ ""Yiuj(xi),
j
= 1, ... , n- p.
•=1
Let SSE0 and SSEa be as defined in Section 5.4.1 when (6.7) is tested by applying the reduction method to r1, ... , rp, u1, ... , Um· It follows that m
SSE0
-
SSEa = n
LiiJ j=1
and that
FR =
n- p- m Rmn ----~~-n- p- Rmn' m
where
Rmn =
n
~m
L..Jj= 1
A2
aj
A
(J 2
and 8- 2 = SSE0 j(n- p). Again using the orthogonality conditions, one can verify that '
where m
[;(xi; m) =
L ajuj(Xi)· j=1
Since aj has the form
(ij =
n-
1
~(ei + ~ekrk(Xi))uj(Xi) = n- ~eiUj(Xi), 1
we see that g(xi; m) is just a smooth of the residuals. So, the reduction test is a smoothing type test with the same general form as the statistic Rn of Section 6.2.1. Furthermore, recalling Section 5.6, we see that Rmn has the form of a Neyman smooth statistic in which the orthogonal functions are constructed to be orthogonal to the functions comprising the null model. Note that the reduction test uses a model-dependent estimator of variance, namely 8- 2 SSE0 j(n- p). By using different variance estimators , in the denominator of Rmn, one can generate other tests.
6. 3. 3 Asymptotic Distribution Theory In this section we will study the limiting distribution of statistics of the form Rn (Section 6.2.1). The behavior of Rn under both the null hypothesis and
6.3. Testing the Fit of a Linear Model
153
local alternatives will be studied. We consider a particular linear model and a particular smoother in our asymptotic analysis. Since our results generalize in a straightforward way to more general models and smoothers, the results of this section provide quite a bit of insight. Let r 1 be a known, twice continuously differentiable function on [0, 1] such that J~ r 1 (x) = 0, and consider testing the null hypothesis
dx
(6.9) We will test H 0 using statistics based on the Priestley-Chao kernel smoother. For convenience we assume that = (i- 1/2)/n, i = 1, ... , n. Let ell ... ' en be the residuals from the least squares fit Bo + e1 r1 (X)' and define the smoother
Xi
1
flh(x) = nh
~ 6_ eiK
(x-x·) T ,
where the kernel K has support ( -1, 1). We will study a test statistic of the following form:
Rn,h
n-[nh]
1
=
n8-2
g~(xi)·
"'
L.J
i=[nh]+1
2 The variance estimator 8- 2 is any estimator that is consistent for () under H0 . The sum in Rn,h is restricted to avoid the complication of boundary effects. We first investigate behavior of the estimator g when the null hypothesis (6.9) is true. We have
flh(x)
(6.10)
1 n LEiK nh i= 1
=-
(x- -x·)' + (Ooh
1 n (x-x·) Oo)- LK - - ' nh i= 1 h A
1 ~ + (01- 01) nh 6_ r1(xi)K (x-x·) T A
It follows that when H 0 is true, E{gh(x)} = 0 for each x. The quantities Oo- Bo and 01- e1 are each Op(n- 112 ), and so when h----+ 0 and nh----+ oo, the dominant term in flh(x) is
-1 Ln E·K nh .
•=1
'
(XXi) -h
·
It follows immediately from results in Chapter 3 that, under H 0 and for each x E (0, 1), flh(x) is a consistent estimator ofO ash----+ 0 and nh----+ oo. Consistency for 0 is also true when h is fixed as n ----+ oo, although in that 1 2 case each of the three terms on the right-hand side of (6.10) is Op(n- 1 ) and must be accounted for in describing the asymptotic behavior of gh (x).
154
6. Lack-of-Fit Tests Based on Linear Smoothers
Clearly, gh(x) estimates 0 more efficiently when his fixed, a fact which is pertinent when considering local alternatives to H 0 . The limiting distribution of Rn,h will be derived under the following local alternatives model: (6.11) where 0 < "' ~ 1/2 and J0 g(x) dx = 0. We first consider the limiting distribution of Rn,h under model (6.11) when h -+ 0 and nh -+ oo. A proof of the following theorem may be found in King (1988). 1
Theorem 6.1. Suppose that in model (6.11)
E1, ••• , En are i.i.d. random variables having finite fourth moments. Assume that K is continuous everywhere and Lipschitz continuous on [-1, 1]. If h rv Cn-a for some a E (0, 1), then for"'> {1/2)(1- a/2) in (6.11)
nhRn,h - B1 ~ hBz
IJ ------7
N(O, 1),
where
B,
~ [ , K'(u) du
and
B,
~ z[, (L K(u)K(u + z) du)' dz.
Furthermore, when"' = (1/2)(1- a/2), a nominal size a test of (6.9) based on Rn,h has limiting power
Theorem 6.1 shows that there is a direct link between the rate at which h tends to 0 and the maximal rate associated with Rn,h· When the alternative converges to the null relatively quickly ("! > (1/2)(1- a/2)), the limiting power of the test based on Rn,h is nil, inasmuch as it equals the limiting level. Theorem 6.1 shows that the maximal rate of Rn,h is (1/2)(1- a/2), in which case the limiting power is larger than the nominal level. By letting h tend to 0 arbitrarily slowly (i.e., a arbitrarily close to 0) the maximal rate can be made arbitrarily close to the parametric rate of 1/2. Of particular interest is the maximal rate in the case h rv cn- 115 ' which is the form of a mean squared error optimal bandwidth for twice differentiable functions. Here, the maximal rate is (1/2)(1-1/10) = 9/20. Theorem 6.1 also implies that the maximal rate for Rn,h is always at least 1/4. This is to be expected since Rn,h tends to the von Neumann type statistic Vn of Section 5.4.2 when h -+ 0 with n fixed. A slight extension of Theorem 5.2 shows that the maximal rate of Vn in the setting of the current section is 1/4.
6.3. Testing the Fit of a Linear Model
155
The following theorem shows that alternatives converging to H 0 at the parametric rate of n 112 can be detected by a test based on Rn,h with a fixed value of h. Theorem 6.2. Suppose that model (6.11) holds in which 1 = 1/2, g is piecewise smooth, r 1 ( x) = x for all x, and the Ei 's are independent random variables satisfying
and for some constant M and some v > 2
Let K satisfy conditions 1-4 of Section 3.2, and define the constant 19 1 12 f0 g(u)(u- 1/2) du. Define also the function ~h by
~h(s) = ~
1:
1
K(u)K(u-
If Rn,h corresponds to a test of Ho : r(x) each h E (0, 1/2)
1
nRn,h
rl-h
~ cr 2 jh
*) du, =
Bo
V s.
+ B1x,
it follows that for
w;(t) dt,
where {W9 (t) : 0::; t::; 1} is a Gaussian process with mean function
JJ(t)
=
h1
rl
Jo [g(u)- (u- 1/2)I9 ]K
(-ht u) du,
0 ::; t ::; 1,
and covariance kernel
L(s, t)
= cr 2
(~h(s-
t)- 12(s- 1/2)(t- 1/2)- I),
0::; s, t::; 1.
Using Theorems 8.1 and 12.3 of Billingsley (1968) it is enough to (i) show that (Vnfih(t 1 ), ... , ynfjh(tm)) converges in distribution to the appropriate multivariate normal distribution for any finite collection of t 1 , ... , tm in (0, 1) and (ii) verify the tightness criteria on p. 95 of Billingsley (1968). Defining fln = L~=l g(xi)/n, PROOF.
I
156
6. Lack-of-Fit Tests Based on Linear Smoothers
and "E
= 2:~= 1
Ei/n, it is easy to show that
n Vnfih(t) = 1- LEiK ynh i= 1
,
1
(t-x·) h --'
1 6 ~K - yn"Enh i= 1
~
- vn eE nh {;;{(xi - 1/2)K
(t-x·) h --'
(t-x·) T
The deterministic piece of the last expression is clearly E[yngh(t)]. By the piecewise smoothness of g and the Lipschitz continuity of K,
E[Vnfih(t)]
=
JL(t)
+ Rn(t),
where IRn(t)l is bounded by (a constant) · n- 1 for all t E (0, 1). Straightforward calculations show that lim Cov( Vnfih(s), Vnfih(t)) = L(s, t)
n--+co
V s, t E [h, 1- h].
The asymptotic normality of ( yngh (tl), ... , yngh (tm)) may be demonstrated by using the Cramer-Wold device (Serfling, 1980, p. 18) in conjunction with a triangular array analog of the Lindeberg-Feller theorem and the moment conditions imposed on the Ei 's. Having established the asymptotic normality of finite dimensional distributions, Theorems 8.1 and 12.3 of Billingsley (1968) imply that the process { Vnfih(t) : h s:; t s:; 1- h} converges in distribution to {W9 (t) : h s:; t s:; 1 - h} so long as the tightness criteria on p. 95 of Billingsley (1968) hold. These criteria are satisfied in our case if the sequence { Vnfih(h) : n = 1, 2, ... } is tight and if, for all n,
(6.12)
Elvn9h(s)- vn9h(t)l 2 s:; B(s- t) 2
v s, t
E
[h, 1- h],
where B is some positive constant. The tightness of { Vnfih(h) : n 1, 2, ... } can easily be proven using the fact that the mean and variance of Vnfih(h) converge to finite constants as n---+ oo. The bound in (6.12) is also easily established by using the boundedness of the function g and the Lipschitz continuity of K. D
I
Theorem 6.2 implies that whenever JL is not identically 0 on (h, 1 - h), the power of a size-a test based on Rn,h converges to a number larger than a. The mean function JL is a convolution of the kernel with the difference between g and its best linear approximation. Whenever g is not identical to a line, there exists an h such that JL is not identically 0. Hence, there exists an h such that the Rn,h-based test has a maximal rate of 1/2, meaning that Rn,h can detect alternatives that converge to H 0 at the parametric rate of n-1/2.
6.4. The Effect of Smoothing Parameter
157
It is sometimes difficult to know what, if anything, asymptotic results tell us about the practical setting in which we have a single set of data. It is tempting to draw conclusions from Theorems 6.1 and 6.2 about the size of bandwidth that maximizes power. Faced with a given set of data, though, it is probably best to keep an open mind about the value of h that is "best." The optimal bandwidth question will be considered more closely in Section 6.4. We can be somewhat more definitive about the practical implications of Theorems 6.1 and 6.2 concerning test validity. These theorems imply that the limiting distribution of statistics of the type Rn can be very sensitive to the choice of smoothing parameter. For example, the asymptotic distribution of Rn,h has three distinct forms depending on the size of h. The three forms correspond to very large, very small and intermediate-sized bandwidths. When h is fixed as n ___, oo, Rn,h converges in distribution to a functional of a continuous time Gaussian process, as described in Theorem 6.2. When h ___, 0 and nh ___, oo, Rn,h is asymptotically normal (Theorem 6.1), whereas if h = 0, Rn,h is asymptotically normal but with norming constants of a different form than in the case nh ___, oo (Theorem 5.1). Practically speaking these three distinct limit distributions suggest that we should use a method of approximating the null distribution that "works" regardless of the size of h. This was our motivation for advocating the bootstrap to approximate critical values of Rs in Section 6.3.1. It is worthwhile to point out that the conclusions reached in this section extend in a fairly obvious way to more general linear hypotheses and more general linear smoothers. Broadly speaking, the only way a smoother can attain a maximal rate of 1/2 is by fixing its smoothing parameter as n ___, oo. In other words, when an estimator's smoothing parameterS is chosen to be mean squared error optimal, the maximal rate of the corresponding test based on Rs will generally be less than 1/2.
6.4 The Effect of Smoothing Parameter The tests discussed in this chapter depend upon a smoothing parameter. To obtain a test with a prescribed level of significance, the smoothing parameter should be fixed before the data are examined. If several tests corresponding to different smoothing parameters are conducted, one runs into the same sort of problem encountered in multiple comparisons. If the null hypothesis is to be rejected when at least one of the test statistics is "significant," then the significance levels of the individual tests will have to be adjusted so that the overall probability of a type I error is equal to the prescribed value. By using the bootstrap one can ensure approximate validity of any test based on a single smoothing parameter value. The key issue then is the effect that choice of smoothing parameter has on power. In Section 6.4.1
l
158
6. Lack-of-Fit Tests Based on Linear Smoothers
we compute power as a function of bandwidth in some special cases to provide insight on how the type of regression curve affects the bandwidth maximizing power. In practice the insight of Section 6.4.1 will not be useful unless one has some knowledge about the true regression function. For cases where such knowledge is unavailable, it is important to have a data-based method of choosing the smoothing parameter. Section 6.4.2 introduces a device known as the significance trace that provides at least a partial solution to this problem.
6.4.1 Power Here we shall get an idea of how curve shape affects the bandwidth that maximizes power. Consider testing the null hypothesis that r is a constant function, in which case the residuals are simply ei = Yi- Y, i = 1, ... , n. We assume that model (6.1) holds with Xi = (i - 1/2)/n, i = 1, ... , n, and investigate power of the test based on the statistic
Rn(h)
~
=
t
flh(xi),
i=l
where gh is a local linear smooth (of residuals) that uses an Epanechnikov kernel and bandwidth h. Simulation was used to approximate the power of this test against the alternative functions
r 1 (x)
=
2
20 [(x/2) (1- x/2)
2
-
1/30] ,
0:::; x :::; 1,
and
r3(x)
+ (x- 1/2)] , 1) (2- 2x) 10 + (x- 1/2)],
.557 [5o (2x- 1) (2x) =
{ .557 [5o (2x
1
10
0 :S:
X
1/2 :S: 1
< 1/2 X
:S: 1.
These functions are such that fo ri(x) dx = 0 and fo rr(x) dx R:i .19, i = 1, 2, 3. The sample size n was taken to be fifty and E1 , ... , En were i.i.d. standard normal random variables. Ten thousand replications were used to approximate the .05 level critical value for Rn(h) at each of fifty evenly spaced values of h between .04 and 2. One thousand replications were then used to approximate power as a function of h for each of r 1 , r 2 and r3. The three regression functions and the corresponding empirical power curves are shown in Figure 6.1. In each case the solid and dashed vertical lines indicate, respectively, the maximizer of power and the minimizer of
-, 159
6.4. The Effect of Smoothing Parameter
"'0
"'0 "'0 a;
g "C
;< 0 c.
9"'
"l 0
"'9
": 0
0.0
0.2
0.4
0.6
0.8
1.0
"'0
0.0
0.5
1.0
1.5
2.0
0.0
0.5
1.0
1.5
2.0
0.0
0.5
1.0
1.5
2.0
"'0 "'0
"'0 a;
g 'i'!
"'0
;< 0 c.
"'9
"0 "'0
"'0 "'9
0 0.0
0.2
0.4
0.6
0.8
1.0
"' 0
"'0 g
"'
"'0 a;
0
;< 0 c.
0
0
"'0
"'9 "'
0
0 0.0
0.2
0.4
0.6
0.8
1.0
X
FIGURE 6.1. Functions and Corresponding Power Curves as a Function of Bandwidth.
mean average squared error for the local linear estimator. The agreement between the two optimal bandwidths is remarkable. In the case of r1, power is maximized at the largest bandwidth since the regression function is almost linear and the local linear smoother is a straight line for large h. For r 2 , power is maximized at h = .32 and then decreases monotonically for larger values of h. Since r 2 is symmetric about .5 the local linear smoother is an unbiased estimator of a flat line for large h, implying that the power of the test based on Rn(h) will be close to its level when h is large. By contrast, r 3 contains an overall upward trend, and so here the power at
160
6. Lack-of-Fit Tests Based on Linear Smoothers
large bandwidths is much larger than .05. In fact, the power at h = 2 is larger than it is for a range of intermediate bandwidths. The two peaks of r3 induce maximum power at a smaller bandwidth of about .13. The previous study is consistent with what intuition would suggest about the bandwidth that maximizes power. Roughly speaking, one would expect the size of the optimal bandwidth to be proportional to the smoothness of the underlying function. In other words, very smooth functions require larger bandwidths than do less smooth functions, all other factors being equal. The examples are also consistent with our claim in Section 6.2.1 that "smooth" statistics usually have more power than ones based on no smoothing. It is unclear whether or not the bandwidths maximizing power and minimizing MASE tend to agree closely in general, as they did in the above study. Agreement of the two bandwidths would suggest that one estimate an optimal testing bandwidth by using one of the methods discussed in Chapter 4, each of which provides an estimator of the MASE minimizer. The resulting test statistic would have the form Rn(h), where his a statistic. Tests of this general flavor will be the topic of Chapters 7 and 8. It is important to note at this point that the randomness of h can have a profound influence on the sampling distribution of Rn(h). It is therefore not advisable to approximate critical values for Rn (h) by pretending that his fixed. An alternative way of avoiding an arbitrary choice of bandwidth is the subject of the next section.
6.4.2 The Significance Trace King, Hart and Wehrly (1991) proposed a partial means of circumventing the bandwidth selection dilemma in testing problems. They proposed that one compute P-values corresponding to several different choices of the smoothing parameter. The question of bandwidth selection becomes moot if all P-values are less than or all greater than the prescribed level of significance. This idea was proposed independently by Young and Bowman (1995), who termed a plot of P-values versus bandwidth a significance trace. We illustrate the use of a significance trace using data borrowed from Cleveland (1993). The data consist of 355 observations and come from an experiment at the University of Seville (Bellver, 1987) on the scattering of sunlight in the atmosphere. The ¥-variable is Babinet point, the scattering angle at which the polarization of sunlight vanishes, and the x-variable is the cube root of particulate concentration in the air. The local linear smooth in Figure 6.2 seems to indicate some curvature in the relationship between average Babinet point and cube root of particulate concentration. Suppose that we test the null hypothesis that the regression function is a straight line by using a statistic Rn based on a local linear smoother.
6.5. Historical and Bibliographical Notes
161
... . . ...... .... :: . .. ... • ·!· . ::1!. . . .. . ... .....:·: ·: . . . I .. I ... • . . :. ...: !.,.::·:....• ....! ....a•:: ,: ·.... . !• • . .... ... ... . .
(!)
C\J
...
~·.
C\J
~~
C\J C\J
.
'E a.
'a Q)
c
:0
0 C\J
I a• • • •
co"'
• I • -·
.. v.;:a •.::
co
.
. 2.0
2.5
3.0
3.5
.
.. t .. : •
4.0
·= ...
4.5
5.0
Cube Root Concentration
FIGURE
6.2. The Babinet Data and Local Linear Smooth.
Figure 6.3 shows significance traces computed from three sets of data. From top to bottom, the graphs correspond respectively to 75, 100 and 200 observations randomly selected from the full set of 355. In each case the bootstrap was used to approximate P-values. Five hundred bootstrap samples were generated from each of the three data sets, and Rn was computed at twenty different values of the smoothing parameter h for each bootstrap sample. For a significance level of .05, the graphs illustrate the three cases that arise in using the significance trace. The top and bottom cases are definitive since regardless of the choice of h, the statistic Rn would lead to nonrejection of H 0 in the former case and rejection in the latter. The middle graph is ambiguous in that H 0 would be rejected for large values of the smoothing parameter but not for smaller ones. Interestingly, though, each of the graphs is consistent with the insight obtained in Section 6.4.1. Figure 6.2 suggests that the ostensible departure from linearity is low frequency; hence the tests based on less smoothing should be less powerful than those based on more smoothing.
162
6. Lack-of-Fit Tests Based on Linear Smoothers
...0 "'0 a.
"'0 0 0
0 0.5
1.0
2.0
1.5
2.5
3.0
0
"'0 0
"'0 a. 0
0
0
0 0.5
1.0
0.5
1.0
1.5
2.0
2.5
3.0
(!)
0
0
... 0
0
a.
"'0 0
0
0
2.0
1.5
2.5
3.0
h
FIGURE 6.3. Significance Traces for Babinet Data. From top to bottom the graphs correspond to sample sizes of 75, 100 and 200.
6.5. Historical and Bibliographical Notes
163
6.5 Historical and Bibliographical Notes The roots of lack-of-fit tests based on nonparametric smoothers exist in the parallel goodness-of-fit problem. As discussed in Section 5.6, smoothingbased goodness-of-fit tests can be traced at least as far back as Neyman (1937). The explicit connection, though, between Neyman smooth tests and tests based on nonparametric function estimation ideas seems not to have been made until quite recently. The use of components of omnibus goodness-of-fit tests (Durbin and Knott, 1972) is closely related to Neyman's idea of smooth tests. Eubank, LaRiccia and Rosenstein (1987) studied the components-based approach and refer to the "intimate relationship between (Fourier series) density estimation and the problem of goodness of fit." A comprehensive treatment of smooth goodness-of-fit tests may be found in Rayner and Best (1989) and a review of work on the subject in Rayner and Best (1990). Two early references on the use of kernel smoothers in testing goodness of fit are Bickel and Rosenblatt (1973) and Rosenblatt (1975). A more recent article on the same subject is that of Ghosh and Huang (1991). In the regression setting the first published paper on testing model fit via nonparametric smoothers appears to be Yanagimoto and Yanagimoto (1987), who test the fit of a straight line model by using cubic spline smoothers. Ironically, this first paper makes use of data-driven smoothing parameters, whereas most of the papers that followed dealt with the conceptually simpler case of linear smoothers, as discussed in this chapter. Tests utilizing splines with fixed smoothing parameters have been proposed by Cox, Koh, Wahba and Yandell (1988), Cox and Koh (1989), Eubank and Spiegelman (1990) and Chen (1994a, 1994b). Early work on the use of kernel smoothers in testing for lack-of-fit includes that of Azzalini, Bowman and Hardie (1989), Hall and Hart (1990), Raz (1990), King, Hart and Wehrly (1991), Muller (1992) and Hardie and Mammen (1993). Cleveland and Devlin (1988) proposed diagnostics and tests of model fit in the context of local linear estimation. Smoothingbased tests that use local likelihood ideas have been investigated by Firth, Glosup and Hinkley (1991) and Staniswalis and Severini (1991). A survey of smoothing-based tests is provided in Eubank, Hart and LaRiccia (1993), and Eubank and Hart (1993) demonstrate the commonality of some classical and smooth tests.
I ! '
'I I
+
7 Testing for Association via Automated Order Selection
7.1 Introduction The tests in Chapter 6 assumed a fixed smoothing parameter. In Chapters 7 and 8 we will discuss tests based on data-driven smoothing parameters. The current chapter deals with testing the "no-effect" hypothesis, and Chapter 8 treats more general parametric hypotheses. The methodology proposed in Chapter 7 makes use of an orthogonal series representation for r. In principle any series representation could be used, but for now we consider only trigonometric series. This is done for the sake of clarity and to make the ideas less abstract. Section 7.8 discusses the use of other types of orthogonal series. Our interest is in testing the null hypothesis (7.1)
Ho : r(x)
=
C for each x
E [0, 1],
where C is an unknown constant. This is the most basic example of the lack-of-fit scenario, wherein the model whose fit is to be tested is simply "r = C." Hypothesis (7.1) will be referred to as "the no-effect hypothesis," since under our canonical regression model it entails that x has no effect on Y. The simplicity of (7.1) will yield a good deal of insight that would be harder to attain were we to begin with a more general case. We note in passing that the methodology in this chapter can be used to test any hypothesis of the form H 0 : r(x) = C + r 0 (x), where r0 is a completely specified function. This is done by applying any of the tests in this chapter to the data Zi = Yi- ro(xi), i = 1, ... , n, rather than to Y1, ... , Yn. We assume a model of the form
(7.2)
Yj=r(xj)+Ej,
j=1, ... ,n,
where Xj = (j- 1/2)/n, j = 1, ... , n, and E1 , ... , En are independent and identically distributed random variables with E(E 1 ) = 0 and Var(El) = a 2 • Assuming the design points to be evenly spaced is often reasonable for purposes of testing (7.1), as we now argue. Consider unevenly spaced design points xi, ... , x~ that nonetheless satisfy xj = Q[(j - 1/2)/n],
164
'~ !
7.1. Introduction
165
= 1, ... , n, for some monotone increasing quantile function Q that maps [0, 1] onto [0, 1]. Then
j
Yj=r[Q(j~1 / 2 )]+Ej,
j=1, ...
,n,
and r(x) = C for all x if and only if r[Q(u)] C for all u. Therefore, we can test r for constancy by testing r[Q(·)] for constancy; but r[Q(·)] can be estimated by regressing Y1 , ... , Yn on the evenly spaced design points (j- 1/2)/n, j = 1, ... , n. Parzen (1981) refers to r[Q(·)] as the regression quantile function. If we assume that r is piecewise smooth on [0, 1], then at all points of continuity x, it can be represented as the Fourier series CXJ
r(x)
=
C
+2L
¢j cos(njx),
j=1
with Fourier coefficients
1 1
rPj =
(7.3)
r(x) cos(njx) dx,
j = 1, 2, ....
For piecewise smooth functions, then, hypothesis (7.1) is equivalent to ¢1 = ¢2 = · · · = 0; therefore, it is reasonable to consider tests of (7.1) that are sensitive to nonzero Fourier coefficients. The test statistics to be considered are functions of sample Fourier coefficients. We shall take as our estimator of rPj (7.4)
'
1
rPj = -
n
L Yi cos(njxi), n
j
= 1, ... , n-
1.
i=1
This definition of ¢j is different from that in Section 2.4; however, for evenly spaced designs the two estimators are practically identical. For our design xi = (i- 1/2)/n, i = 1, ... , n, definition (7.4) is the least squares estimator of rPj· We may estimate r(x) by the simple truncated series m
(7.5)
f(x; m) =
C + 2 L ¢j cos(njx),
x E [0, 1],
j=1
where C l:i Yi/n and the truncation point m is some non-negative integer less than n. Clearly, if H 0 is true, the best choice for m is 0, whereas under the alternative that l¢j I > 0 for some j, the best choice (for all n sufficiently large) is at least 1. In Chapter 4 we discussed a data-driven truncation point m that estimates a "best" choice for m. It makes sense that if m is 0, then there is little evidence in support of the alternative hypothesis, whereas if m 2:: 1 the data tend to favor the alternative. This simple observation motivates all the tests to be defined in this chapter.
166
7. Testing for Association via Automated Order Selection
From one perspective, the series f( · ; m) is simply a nonparametric estimator of the regression function r. However, we may also think of functions of the form m
C
+2L
aj cos(1rjx),
0 :S x :::; 1,
J'=l
as a model for r, wherein the quantity m represents model dimension. This is an important observation since the discussion in the previous paragraph suggests a very general way of testing the fit of a model. If model dimensions 0 and d > 0 correspond respectively to null and alternative hypotheses, and if a statistic d is available for estimating model dimension, then it seems reasonable to base a test of the null hypothesis on d. Many modeling problems fall into the general framework of testing d 0 versus d > 0; examples include problems for which the reduction method is appropriate, testing whether a time series is white noise against the alternative that it is autoregressive of order d > 0, or any other setting where one considers a collection of nested models. Recall the MISE-based criterion Jm introduced in Section 4.2.2: A
Jo = 0,
A
_
Jm -
Lm
2n¢J A2
2m,
m
=
1, ... , n - 1.
(]"
j=l
The statistic m is the maximizer of Jm over m = 0, 1, ... , n - 1. A number of different tests have been inspired by the criterion Jm. These will be discussed in Sections 7.3 and 7.6. For now we mention just two. One possible test rejects H 0 for large values of m. It turns out that the limiting null distribution (as n --4 oo) of m has support {0, 1, 2, ... }, with limn_,= P(m = 0) ~ .712, limn_,cxo P(O :S m :S 4) ~ .938 and limn--->cxo P(O :S m :S 5) ~ .954. This knowledge allows one to construct an asymptotically valid test of H 0 of any desired size. In particular, a test of size .05 would reject H 0 if and only if m ~ 6. A second possible test rejects H 0 for large values of Jm, a statistic that will be discussed in Section 7.6.3.
7.2 Distributional Properties of Sample Fourier Coefficients In order to derive the distribution of subsequent test statistics, it is necessary to understand distributional properties of the sample Fourier coefficients ¢ 1 , ... , ¢n-l· Our main concern is with the null distribution, and so in this section we assume that the null hypothesis (7.1) is true. More general properties of sample Fourier coefficients were discussed in Section 3.3.1.
7.2. Distributional Properties of Sample Fourier Coefficients
167
First of all,
and, for i, j ::;:: 1, Cov(¢i,¢j)={CY2/(2n), 0,
i=j i ::/= j.
These facts derive from the orthogonality properties n
L
cos(njxi) cos(nkxi)
=
j, k
0,
=
0, 1, ... , n- 1, j ::/= k,
i=1 and from 1
n
- L:cos 2 (njxi)= n i=1
1
2,
j=1, ... ,n-l.
When the Ei's are i.i.d. Gaussian, it follows that ¢ 1, ... , :Pn-1 are i.i.d. Gaussian with mean 0 and variance CY 2 / (2n). More generally, we may use the Lindeberg-Feller theorem and the Cramer-Wold device to establish that, for fixed m, vn(¢1, '¢m) converges in distribution to an m-variate normal distribution with mean vector 0 and variance matrix (CY 2/2)Im, where Im is the m X m identity. Define the normalized sample Fourier coefficients ¢N,1, ... , J;N,n-1 by 0
;;,
0 -
'1-'N,• -
0
0
v'2n¢i ff ,
i = 1, ... , n- 1,
where ff is any weakly consistent estimator of CY. Consider a test statistic S that is a function of ¢N,1, ... , J;N,m, i.e., S = S(J;N,1, ... , ¢N,m)· Then, if S is a continuous function, m is fixed and the null hypothesis (7.1) holds, S converges in distribution to S(Z1, ... , Zm), where Z1, ... , Zm are i.i.d. N(O, 1) random variables. An important example of the last statement is the Neyman smooth statistic m
S = ""A2 ~c/JN,j> j=1
x;;,.
whose limiting distribution under (7.1) is To obtain the limiting distributions of some other statistics, such as m and Jm,, it is not enough to know the limiting distribution of ¢ 1, ... , ¢m for a fixed m. The following theorem is an important tool for the case where fixing m does not suffice.
Theorem 7.1. Suppose that in model (7.2) r is constant and the Ei 's are independent and identically distributed with finite fourth moments. For each m ::;:: 1, let Bm denote the collection of all Borel subsets of lRm, and for any
i i,
168
A
E
7. Testing for Association via Automated Order Selection
l
Bm define Pmn(A) and Pm(A) by Pmn(A)
=
P [ ( V'in~/JI/!7,
... , V'in¢m/!7)
E
I
A]
i
and
I
where Z 1 , ... , Zm are i. i. d. standard normal random variables. Then for all m and n
where a(m) is a constant that depends only on m. Theorem 7.1 is essentially an application of a multivariate Berry-Esseen theorem of Bhattacharya and Ranga Rao (1976) (Theorem 13.3, p. 118). To approximate the distribution of the sample Fourier coefficients by that of i.i.d. normal random variables, we wish for the bound in Theorem 7.1 to tend to zero as nand m tend to oo. Since a(m) tends to increase with m, it is clear that m will have to increase more slowly than n 114 . Fortunately, in order to establish the limiting distribution of the statistics of interest, it suffices to allow m to increase at an arbitrarily slow rate with n. Clearly, there exists an increasing, unbounded sequence of integers {mn : n = 1, 2, ... } such that a(mn)m~/fo---+ 0 as n---+ oo.
7.3 The Order Selection Test The no-effect hypothesis says that r is identical to a constant. The nonparametric regression estimate f{; m) is nonconstant if and only if m > 0. These facts lead us to investigate tests of no-effect that are based on the statistic m. The form of lm along with Theorem 7.1 suggest that, as n ---+ oo, m converges in distribution to m, the maximizer of the random walk {S(m) : m = 0, 1, ... }, where m
S(O) = 0,
S(m) =
L
z]- 2m,
m = 1, 2, ... '
j=l
and Z 1 , Z 2 , ... are i.i.d. standard normal random variables. Theorem 7.2 below provides conditions under which this result holds. The ensuing proof is somewhat more concise than the proof of a more general result in Eubank and Hart (1992).
Theorem 7.2. Suppose that in model (7.2) r is constant and the Ei 's are independent and identically distributed with finite fourth moments. Let
7.3. The Order Selection Test
169
zl, z2, ...
be a sequence of independent and identically distributed standard normal random variables, and define m to be the maximizer of S(m) with respect tom, where S(O) = 0 and S(m) = ~';= 1 (ZJ - 2), m :2:: 1. It follows that the statistic m converges in distribution tom as n --+ 00. PROOF. For any non-negative integer m we must show that P(m = m) --+ P(m = m) as n --+ oo. Define, for any positive random variable a, the event Em(a) by
Em(a) =
m 2n¢;12 1 min - - k "" - - :2::2 { O
n~~~n
1
2n¢]
k
-k---m-
I:
-a-
j=m+l
~
}
2 ·
Note that the event {m = m} is equivalent to Em (&- 2 ) . Since &- 2 is consistent for 0' 2 , it is enough to show that P [Em(0' 2 )] --+ P(m = m). For a given m define
Un,k =
Vn,k =
1 m
m
2n¢]
k I: j=k+l k
1
k-m
k
'
0'2
2n¢]
I:
min Un,k :2:: 2 o:<:;k<m
An
= {
Bn
= { mn
n
0, ... ,m -1,
k = m
'
0'2
j=m+l
=
+ 1, ... , n- 1,
max Vn,k m
~ 2} ,
~ 2} ,
where {mn} is an unbounded, nondecreasing sequence of integers to be specified subsequently. Since P[Em(0' 2 )] = P(An n Bn), the proof will be completed if we can show that
(i) lim P(Bn)
=
n-+oo
1,
(ii) lim [P(An) - P(An)] n-+oo
= 0
and
(iii) lim P(An) = P(m = m), n-+oo
where An is defined exactly as is An with Un,k and Vn,k replaced by
uk
=
1 m _ k
m
I: j=k+l
zJ,
k =
o, ... , m -
1,
170
7. Testing for Association via Automated Order Selection
and
vk =
1
k
2.::: z;,
k- m
k
= m + 1, ... '
j=m+l
respectively. Statement (iii) is true by definition since {An} is a decreasing sequence of events. We now turn our attention to proving (i). Let Zjn = V2ii¢jfa, j = 1, ... , n- 1, and note that
(7.6)
En ::J
nn-l
{
~~~=m+lk (z;n- 1)
k=mn+l
-m
1
:S 1
}
.
Define nj = p, j = 1, 2, ... , and let j(1) be the largest integer J. such that P :S mn and j(2) be the largest j such that P < n- 1. For each n set
For any integer k such that mn + 1 :S k :S n - 1, either nj < k :S nj+l for some j or nj( 2 ) < k :S n- 1. It follows that for mn + 1 :S k :S n- 1
(7.7)
I ~7=m+l (ztn- 1) I < I ~~~m+l (Ztn- 1) I + k-m ~-m
~jn ~-m
Combining (7.6) and (7.7), we then have 2
En ::J
Jn·cJ j=j(l)
[{
~~~~m+l(Zfn- 1)1 n·-m
<
J
By Markov's inequality j(2)
(7.8)
"'
~
j=j(l)
v(nj, n) (P- m)2
7.3. The Order Selection Test
171
for
!
\. !i
The last inequality implies that the right-hand side of (7.8) is of the order I:;~](l) j- 2, which tends to 0 as mn, and hence j(1), tends to oo. To deal with the ~jn we use a result of Serfiing (1970). For any set of jointly distributed random variables Y1 , ... , Y,.. with joint distribution function F, let L be the functional L(F) = I::=l E(Y; + D) with D = 2(1 + 2\Esi/ 0' 4- 3\). Defining Fr,s to be the joint distribution of (z;+l,n1), ... , (z;+r,n -1) for all rands, it is clear that L(F,..,s) =Dr, L(F,..,s) + 2 L(Fk,r+s) = L(Fk+r,s), and E('Ej!:+l (Z}n - 1)) ::;; Dr = L(F,.., 8 ). Now an application of Theorem A of Serfiing (1970) gives
E~Jn::;; D (log(4j + 2)] 2 (2j + 1)/(log2)
2 .
Consequently,
n
j(2)
p (
j=j(l)
c.
{
',Jn
n· -m J
j(2)
2 1-
2:
4D [log(4j
+ 2)] 2 (2j + 1)(j 2
2
-
m)- (log 2)-
2 ,
j=j(l)
which tends to 1 as n
-+
oo. Combining the preceding results yields
P(Bn)
-+ 1. To prove (ii) we apply Theorem 7 .1. Since the subset of ~mn -m described by the events An and An is a Borel set, Theorem 7.1 implies that
where a(k) is a constant that depends only on k. Since one can always 2 choose mn to grow sufficiently slowly that (mn- m) a(mn- m)/Vn-+ 0, the proof of (ii) is complete. 0 The classic paper of Spitzer (1956) allows one to concisely describe the distribution of m. For now it suffices to point out that P(m = m) > 0 for each non-negative integer m, with P(m = 0) ~ .71. Most of the
'!I
172
7. Testing for Association via Automated Order Selection
"'c::i
-
-
I
0
c::i
2
0
3
4
5
m FIGURE 7.1. Large Sample Distribution ofthe Data-Driven Truncation Point When the Regression Function Is Constant.
m
distribution of m is displayed in Figure 7.1. When the no-effect hypothesis is true, then the order selection criterion Jm has a probability of about .71 of choosing the correct model. This result parallels that of Shibata (1976) in the context of selecting the order of an autoregressive process. One is tempted to use a size-a test of the form reject H 0 if and only if m ~ me"'
(7.9)
where ma is the smallest m such that P(m ~ m) ::=; a. For example, if one desires a test of size .05, the asymptotic value of ma is 6. Such a test is certainly valid, but turns out to have poor power against some nonpathological alternatives. The following result will help to explain why.
Corollary 7.1. Suppose that all the conditions of Theorem 7.2 are satisfied, and let the regression function r in model (7.2) be of the form k
r(x)
=
¢o
+ 2 L rPj cos(njx), j=l
0 ::;; x ::;; 1,
7.3. The Order Selection Test
173
for some non-negative integer k with ¢k f- 0. Then the statistic m converges in distribution to the random variable m+k, where mis defined in Theorem 7.2. Notice that Corollary 7.1 is just a restatement of the part of Theorem 4.1 concerning m. A means for proving both Corollary 7.1 and ms's consistency (the remainder of Theorem 4.1) become obvious once one understands the proof of Theorem 7.2. Corollary 7.1 implies that test (7.9) is inconsistent whenever the regression function has a Fourier series that is truncated at k with k < ma. Furthermore, whenever k < ma, the limiting power of the test is no more than .29. As an example, consider functions of the form
r(x) = C
+ 2 (¢1 cos(nx) + ¢2 cos(2nx)),
0 ~ x ~ 1.
For any such alternative the limiting power of the size .05 test of form (7.9) is equal to P(m + 2 2 6) ~ .085. In practice it is doubtful that the regression function will be exactly of truncated form. However, it is often the case that the Fourier coefficients ¢j are relatively large for small j and decay rapidly to zero as j increases. Intuition in conjunction with Corollary 7.1 suggest that the power of test (7.9) can be poor for such alternatives, even if they are not of truncated form. For this reason we shall now consider other tests that are motivated by the risk criterion Jm. The inconsistency of test (7.9) can be avoided by simply taking ma = 1. Of course, the resulting test will have a limiting size of .29, which will be an unacceptably large type I error probability in many applications. It is desirable to have a test that is generally consistent, but which is also flexible enough to attain any desired size. Consider the criterion Jm. The term -2m acts as a penalty and makes it unlikely that a high order model will be chosen. Suppose we use instead a criterion with penalty -rym, where ry is any constant larger than 1. By taking ry > 2, models with m 2 1 are penalized more than they are by Jm. The maximizer of a criterion with ry > 2 will necessarily take on zero with higher probability than will m; hence, by appropriate choice of --y, one can obtain a consistent test that attains any desired level of significance. For example, it turns out that ry = 4.18 yields a test whose size (in large samples) is .05. Define J(· ; --y) by ,
J(O; ry) = 0,
m
2n¢;~
J(m;--y) =I:~ -rym, j=l
m= 1,2, ... ,
(j
where "Y > 1. Let m'Y be the maximizer of J(m; ry) over m = 0, 1, ... , n -1. Our focus now centers on the test (7.10)
reject Ho if and only if m"~ 2 1,
174
7. Testing for Association via Automated Order Selection
where ry is fixed and determines type I error probability. Test (7.10) will be referred to as the order selection test. The limiting distribution of m"~ is obtained as a corollary to Theorem 7.2. The process {J(m; ry) : m = 0, 1, ... , n- 1} converges in distribution to the random walk {S(m; ry) : m = 0, 1, ... }, where m
S(O; ry) = 0,
S(m; ry) = ~)zJ
-
ry),
m = 1, 2, ... ,
j=l
and Z 1 , Z 2 , ... are i.i.d. standard normal random variables. The structure of S(·; ry) makes clear why the constant ry is chosen larger than 1. If ry > 1, E(ZJ) ry < 0 and so S(m; ry) __, -oo with probability 1 as m __, oo, guaranteeing that the process has a maximum. The level of test (7.10) is simply 1- P(m"~ = 0). If m"~ is the maximizer of S(m; ry) over 0, 1, ... , then lim P(m"~ = o) = P(m"~ = o).
n--+oo
A remarkable formula due to Spitzer (1956, p. 335) allows us to obtain an arbitrarily good approximation to P(m'Y = 0). His formula implies that
P(m'Y = 0) = exp
(7.11)
Xj > J'Y ·)} p(2 L . { J oo
-
d f
~ Fosb),
j=l
x;
where is a random variable having the x2 distribution with j degrees of freedom, and the subscript OS stands for "order selection." If one desires a test having asymptotic level o:, one simply sets 1 - a equal to Fosb) and solves numerically for ry. It is not difficult to see that Fos is increasing in ry > 1, and hence the solution to the equation 1 - a = Fosb) is unique. In fact, Fos is the cumulative distribution function of an absolutely continuous random variable having support (1, oo), a fact to be exploited in the next section.
7.4 Equivalent Forms of the Order Selection Test
1.4.1 A Continuous- Valued Test Statistic Data analysts often desire a P-value to accompany a test of hypothesis, the P-value being the smallest level of significance for which H 0 would be rejected given the observed data. Owing to the discrete nature of the statistic m"~, finding a P-value for the order selection test by using its definition in Section 7.3 is awkward. However, an equivalent form of the test makes computation of the P-value relatively straightforward. This alternative form is
7.4. Equivalent Forms of the Order Selection Test
175
also helpful for understanding other aspects of the test, such as power, as we will see in Section 7.7. Note that m"~ equals 0 if and only if
!__
Lm 2n¢] &2
m j=1
m
::::: "(,
=
1, ... ,n -1,
which is equivalent to
T nd~f -
max
1:S:m:S:n-1
Therefore the order selection test is equivalent to the test that rejects the no-effect hypothesis for large values of the statistic Tn· If the observed value of Tn is t, then the P-value is 1- Fn(t), where Fn is the cdf of Tn under the null hypothesis. A large sample approximation to the P-value is 1- Fos (t). Note that Fos is the cdf of the random variable T, where 1
m
T =sup- 2:ZJ m2:1
m j= 1
and Z 1 , Z 2 , ••. are i.i.d. standard normal random variables. The support ofT is (1, oo), since the strong law of large numbers entails that m- 1 I:;;'=1 ----+ 1, almost surely, as m ----+ 00. It is shown in the Appendix that
z;
(M IFos(t)
F(t;M)I:::;
+ 1)-1efl+1 1- Bt
where
F(t;M)
~ exp {- ~ P (xJ/ ;t)}
and Bt = exp (- [(t- 1) logt] /2). This allows one to determine Fos(t) to any desired accuracy. For example, if the statistic Tn was observed to be 3, we can approximate the P-value by 1 - F(3; 15) = .119, which agrees with 1 - Fos(3) to the number of decimal places shown. A graph of Fos is shown in Figure 7.2.
1.4.2 A Graphical Test The outcome of the order selection test can be related graphically in a rather appealing way. Consider the Fourier series estimator f( x; m'Y), where 'Y is chosen so that the order selection test has a desired size. The hypothesis of a constant regression function is rejected if and only if the smoother r(x; m"~) is nonconstant. This follows from two facts: (i) when m"~ = 0,
176
7. Testing for Association via Automated Order Selection ~ T""
co 0
,--...
><
LL.
~ 0
~
0
C\1
0 0 0
2
4
6
8
10
X FIGURE
7.2. The Cumulative Distribution Function Fas.
r(x; m,) equals Y, the sample mean of the Y;,'s, for each x, and (ii) Ho is rejected only when m1 > 0, which entails that l¢j I > 0 for some j 2: 1 and hence that r(x; m,) is nonconstant. The result of the order selection test can be reported by providing a graph of r(x; m,) for 0 ::; X ::; 1. The graph serves both an inferential and a descriptive purpose. If the curve is nonconstant, then there is evidence of a relationship between x and Y; at the same time one gets an impression as to the nature of that relationship. This means of testing the no-effect hypothesis seems particularly appealing when one is exploring associations among a large number of variables. For example, a common practice is to look at all possible two-dimensional scatter plots of the variables. By adding the estimate r(-; m,) to each plot, one can determine significant associations at a glance. This is an effective way of combining the exploratory aspect of smoothing with a more confirmatory aspect. I
.I '·
7.5 Small-Sample Null Distribution of Tn So far in this chapter we have considered only a large sample version of the order selection test. Not surprisingly, the distribution of Tn in small samples can be considerably different from the large sample distribution, Fos. This difference is attributable to one or a combination of the following three factors: (a) the sample size n, (b) the way in which the variance CJ 2
--~------------------------------------------
7.5. Small-Sample Null Distribution of Tn
177
is estimated, and (c) the probability distribution of an error term Ei. These factors are the same ones at play in many classical testing problems. In this section we shall be concerned with the small-sample distribution of the test statistic Tn under H 0 . As in any testing problem we wish to ensure that the actual level of the test is as close as possible to the nominal level. In doing so, however, we should keep in mind that some measures which ensure validity can jeopardize the power of a test. Hence, power will be a consideration even as we study the null distribution of Tn.
7. 5.1 Gaussian Errors with Known Variance Consider the case in which E1, ... , En are i.i.d. N(O, CJ 2 ) with CJ 2 known. Assuming the variance to be known is admittedly unrealistic, but nonetheless useful for understanding how Fos works as a small-sample approximation. If H 0 is true, then with Gaussian errors it follows that 2n¢U CJ 2 , ... , 2n¢~_IfCJ 2 are i.i.d. random variables, each having a chi-squared distribution with one degree of freedom. Now define T~ to be the statistic T* = n
1
max 1<m
-
m
~
m j=1 L....t
2n¢ 2 _ _J 0" 2
•
Using combinatorial results of Spitzer (1956), it may be shown that the probability distribution of T~ is
() -
Fos,n t -
~ L....t (Olo···•On-l)ECn-1
{nrr-1 ~1 [P(x;:::; tr) lor} , f)
r=1
r·
r
x;
where denotes a chi-squared random variable with r degrees of freedom, and Cs is the set of all s-tuples (fJ 1 , ... , fJs) of non-negative integers such that () 1 + 2() 2 + · · · + sfJ 8 = s. One may obtain Cs recursively from C1, ... , Cs-1· Percentiles of Fos,n for n = 2, ... , 26 and n = oo are shown in Table 7.1. This table shows how much of the discrepancy between the small- and large-sample percentiles is due merely to n, as opposed to non-normality or the necessity of estimating CJ 2 • For the levels of significance most often used in practice (a :::; .10), agreement between the large- and small-sample percentiles begins when n is only 14. So, the large-sample test will usually be adequate for small n under Gaussian errors with known variance.
7.5.2 Gaussian Errors with Unknown Variance We consider now the more realistic case where the E/s are i.i.d. N(O, CJ 2 ), but CJ 2 is unknown. We are then faced with the problem of choosing an appropriate variance estimator. One possibility is to use the estimator that is efficient when the null hypothesis is true, i.e., we could use 8- 2 = s 2 =
178
7. Testing for Association via Automated Order Selection TABLE 7.1. 100(1 - a)th Percentiles of the Statistic T~ for Gaussian Data. This corresponds to the case where rJ 2 is known.
tJ
n
t[
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 ()()
a= .20 1.6424 2.0839 2.2338 2.3003 2.3343 2.3533 2.3645 2.3714 2.3758 2.3787 2.3806 2.3819 2.3828 2.3834 2.3838 2.3841 2.3843 2.3844 2.3845 2.3846 2.3847 2.3847 2.3847 2.3847 2.3848 2.3848
.10
.05
.01
.001
2.7056 3.0668 3.1621 3.1957 3.2093 3.2153 3.2181 3.2195 3.2201 3.2205 3.2206 3.2207 3.2208 3.2208 3.2208 3.2208 3.2208 3.2208 3.2208 3.2208 3.2208 3.2208 3.2208 3.2208 3.2208 3.2208
3.8415 4.1077 4.1599 4.1734 4.1774 4.1787 4.1791 4.1793 4.1793 4.1793 4.1793 4.1793 4.1793 4.1793 4.1793 4.1793 4.1793 4.1793 4.1793 4.1793 4.1793 4.1793 4.1793 4.1793 4.1793 4.1793
6.635 6.736 6.744 6.744 6.745 6.745 6.745 6.745 6.745 6.745 6.745 6.745 6.745 6.745 6.745 6.745 6.745 6.745 6.745 6.745 6.745 6.745 6.745 6.745 6.745 6.745
10.83 10.85 10.85 10.85 10.85 10.85 10.85 10.85 10.85 10.85 10.85 10.85 10.85 10.85 10.85 10.85 10.85 10.85 10.85 10.85 10.85 10.85 10.85 10.85 10.85 10.85
Each percentile is correct to the number of decimal places shown. At n = oo each table value tis such that 11 - Fos,n(t) - al ::; .00001.
:2:~= 1 (Yi- Y) 2 /(n- 1). The only problem with this proposal is that s 2 is not a consistent estimator of rJ 2 when the no-effect hypothesis is false. For regression functions r that are continuous on [0, 1], s 2 is a consistent estimator of
w
l1
p1 w
3J
gJ 5
a
1 t
a g e c c s v s
t
I
7.5. Small-Sample Null Distribution of Tn
179
where r = f 0 r(x) dx. It follows that s 2 is biased upward under the alternative hypothesis; hence taking &2 = s 2 will tend to deflate Tn and lower the power of the test. From the standpoint of power it is desirable to have a variance estimator that is consistent under both the null and alternative hypotheses. In nonparametric regression there are two principal methods of estimating error variance. One is to base the variance estimate on differences of the Yi's, and the other is to use residuals from a nonparametric estimate of the regression function. The former type of estimator was motivated in Section 5.5.1. Two candidates are 1
and
&~2
=
1
~
2Yi 6 (n _ 2) L_.,(YiHi=2
+ Yi-1) 2 .
The estimator &~ 2 is more variable than &~ 1 , but E(&~2 ) tends to be smaller than E(&~ 1 ) for sufficiently smooth functions. These two factors (variance and bias) must be weighed in choosing between &~ 1 and &~ 2 and, more generally, when choosing from any collection of difference-based variance estimators. Table 7.2 provides 95% confidence intervals for the null hypothesis percentiles of Tn for Gaussian errors and each of the variance estimators &~ 1 , &~ 2 and s 2 . The intervals were constructed from simulated data that consisted of 10,000 repetitions at each value of n. Binomial distribution theory was used to obtain the intervals (Conover, 1980, pp. 112-116). The test statistic using s 2 has smaller percentiles than does the large-sample distribution, whereas statistics based on fj~i, i = 1, 2, have larger percentiles. The results indicate that if a difference-based variance estimator is used, it is important to take into account sample size in approximating percentiles. For example, at n = 50, if one uses variance estimator &~ 2 , the critical value for a .05 level test is about 6, whereas the large-sample critical value is 4.18. If one used the large-sample critical value in this case, the actual size of the nominal .05 level test would be about .107. By contrast, if one used large-sample critical values in conjunction with s 2 , the result would be a conservative test. It is worth pointing out that a simulation was also done at n = 100, and the percentiles for &~ 1 and &~ 2 still differed significantly from the large sample ones at that sample size. Under the assumption of Gaussian errors, no variance estimator is more = .2:7= 1 Erfn. Of course, the E/s are unobservable; hence efficient than is not even an estimator. Consider instead the residuals
a;
&;
T/i =
Yi - f(xi),
i = 1, ... , n,
i,1 i I
'I
180
7. Testing for Association via Automated Order Selection TABLE 7.2. Estimated 100(1 - a)th Percentiles of the Statistic Tn for Gaussian Data. This is the case where cr 2 is unknown.
a=
.10
.05
.01
A2 CT dl
(3.82, 4.01) .1504 (4.20, 4.44) . 1774 (3.01, 3.15) .0882 (4.53, 4.81) .1990 (3.03, 3.21) .0936
(5.17, 5.57) .0866 (5. 75, 6.33) .1074 (3.85, 4.11) .0437 (6.21, 6.68) .1221 (4.56, 5.34) .0601
(9.34, 10.44) .0288 (10.57, 12.07) .0390 (5.93, 6.47) .0064 (11.37, 12.88) .0449 (9.03, 10. 77) .0282
(3.42, 3.60) .1219 ( 3.56, 3.70) .1327 (3.09, 3.21) .0935 (3.55, 3.70) .1359 (3.16, 3.32) .1022
(4.40, 4.67) .0640 (4.53, 4.84) .0683 (3.93, 4.15) .0444 (4.54, 4.89) .0698 (4.08, 4.37) .0508
(7.23, 8.20) .0151 ( 7.51, 8.52) .0170 (6.22, 6.74) .0080 (7.45, 8.13) .0171 (7.06, 7.86) .0144
(3.20, 3.33) . 1034 (3.21, 3.36) .1053 (3.11, 3.22) .0943 (3.18, 3.31) .1028 (3.18, 3.33) .1024
(4.03, 4.30) .0493 (4.09, 4.32) .0506 (3.93, 4.19) . 0461 (4.05, 4.33) .0494 (4.18, 4.48) .0543
(6.70, 7.24) .0115 (6.66, 7.32) .0115 (6.44, 7.05) .0096 (6.66, 7.39) .0116 (6.87, 7.67) .0131
A2 CT d2
n = 20
82 A2
crm A2 CT· m5 A2 CT dl A2 CTd2
n
50
82 A2
crm A2 CT· m5 A2 CT dl A2 CTa2
n = 200
82 A2
crm A2 CT· m5
The numbers in parentheses are a 95% confidence interval for the corresponding percentile. The number just below the confidence interval is the proportion of simulated statistics that exceeded the 100(1 - a)th percentile of Tn's limit distribution (last line of Table 7.1). The largest estimated standard error of any proportion is .004.
7.5. Small-Sample Null Distribution of Tn
where
r is a nonparametric estimator of r. We have 7]i = Ei
and if
181
r is
+ (r(xi)
r(xi)),
i
=
1, ... , n,
a sufficiently good estimator of r, then a variance estimator
Ei ry'f /n will have the same asymptotic mean squared error as a; does; see Hall and Marron (1990). Suppose we estimate r(x) by the Fourier series r(x; m). Then, using Parseval's formula, n-1
n i=1
j=m+1
and it is reasonable to estimate a- by (j~ = 2n 2:7:~+ 1 ¢]/(n- m- 1). This estimator of a- 2 depends upon the smoothing parameter m. To avoid an arbitrary choice of m we could use a variance estimator&~, where m is a data-driven choice of m. For example, m could be m"~, the maximizer of the criterion J(m; ry). In the Gaussian errors case confidence intervals for percentiles of the statistic 2
max
1
-
-
~k
k
"" 2n¢ 2 / &~
L...t
j=1
J
m-y
are given in Table 7.2 for two choices of ry, 2 and 5. The variance estimator used in defining each risk criterion was &J 1 . The percentiles corresponding to and are comparable to those for 8 2 and &J 2 , respectively. Use 5 of the variance estimator in the test statistic Tn thus appears to be a 5 good practice. Doing so yields critical values close to the large-sample ones for n as small as 20, and to a test that will tend to be more powerful than one using 8 2 •
ai
ai
ai
7. 5. 3 Non- Gaussian Errors and the Bootstrap When the errors are non-Gaussian, approximating the distribution of Tn becomes more challenging. So long as the error distribution has four moments finite, the limiting null distribution of Tn (using any weakly consistent estimator &2 ) is still Fos (expression (7.11)). However, the results in Tables 7.1 and 7.2 for the Gaussian case already give us reason to exercise caution in using the large-sample percentiles. Our proposal for handling the possibility of non-Gaussian data is to use a bootstrap procedure to approximate the distribution of Tn. Our bootstrap procedure is essentially the same as that proposed by Hall and Hart (1990) in the context of comparing two regression curves. Given observations Yi, ... , Yn, define the residuals ei = Yi - Y, i = 1, ... , n. A bootstrap sample is e]', ... , e~, where each ej is drawn randomly and with replacement from {e 1 , ... , en}· Define T~ exactly as we did Tn except
182
7. Testing for Association via Automated Order Selection
replace Yj 's by ej 's. Draw N independent bootstrap samples and compute r:; for each sample. Let an be a number such that the proportion of the N T,;;'s greater than or equal to an is a. The bootstrap test then rejects the null hypothesis at level a if and only if Tn 2: an. Ideally we would like to resample from the empirical distribution Fn of the error terms E1, ... , En, since the null distribution of Tn is a function only ofF, the distribution of E1 . However, the proposed bootstrap succeeds in resampling from Fn only when the null hypothesis is true. To see why, note that the residuals are (Ei £) + (r(xi)- 'Fn), i = 1, ... , n, where£= ~~= 1 Ei/n and 'Fn = ~~= 1 r(xi)/n. When any of the variance estimators in Section 7.5.2 are used in the denominator of Tn, the bootstrap distribution of r:; is determined completely by the empirical distribution Gn of Ei + (r(xi) - 'Fn), i = 1, ... , n. If Ho is true, Gn is identical to Fn; otherwise Gn is contaminated by the regression function r. Suppose F is continuous and r is continuous on [0, 1]; then the strong law of large numbers implies that Gn converges uniformly toG, almost surely, where
1 1
G(x)
=
F(x- r(u)
+ r)du,
-oo
<X<
00.
Inasmuch as the bootstrap is used to ensure validity of one's test, the proposed resampling scheme works precisely as the bootstrap was intended. The only potential problem with the procedure is that it could possibly have an adverse effect on the power of the test, at least in small samples. Define Pn(H) to be the probability distribution of Tn when the errors have arbitrary distribution H. The bootstrap procedure approximates the perFn, and we centiles of Pn(F) by those of Pn(Gn)· When Ho is true, Gn obtain the "correct" bootstrap approximation to Pn(F). When the alternative hypothesis is true, the power of the bootstrap test is lowered if the percentiles of Pn(Gn) tend to be larger than those of Pn(F). Note, however, that the percentiles of Pn(Gn) converge almost surely to those of Fos, so long as r is continuous on [0, 1]. (This is true since the distribution G satisfies the conditions of Theorem 7.2.) Hence, the power of the bootstrap test will not be affected in large samples by the particular resampling scheme proposed herein. Another way of bootstrapping would be to estimate the regression function by a nonparametric smoother and then res ample from the residuals (Y1 - r(xi)), ... , (Yn- r(xn)). If r is a consistent estimator of r, then the empirical cdf of the residuals is consistent for the error distribution under both the null and alternative hypotheses. There are at least two reasons, though, why one might prefer to resample from the ei 's. First, the procedure using the ei 's is simpler; the other one requires a nonparametric function estimate with the attendant problem of choosing a smoothing parameter. A second reason is that if one bootstraps from the e/s, then one is at least resampling from the empirical distribution, Fn, of the Ei 's in the case where
=
r,
t
t t
7.5. Small-Sample Null Distribution of Tn
183
Ho happens to be true. With the other procedure, one resamples from a somewhat corrupted version of Fn regardless of which hypothesis is true.
7.5.4 A Distribution-Free Test A delightfully simple way of dealing with uncertainty about the null sampling distribution is to apply the order selection test to ranks. Let rank(Yi) denote the rank of Yi among Y 1 , .•. , Yn, and define rank(Yi) - .5 n
i
=
1, ...
-
Lui cos(njxi),
j
=
0, 1, ... 'n- 1.
n
i=l
Ui=
,n,
and 1
¢R,j
=
n
A statistic for testing H 0
:
r
= constant is k
R
Tn = llf{;~n
A2
1 ""' 2ncpR j k ~ 1/12 '
The rationale behind using 1/12 as a variance estimate is that, under H 0 , rank(Yi) has a discrete uniform distribution over 1, ... , n, implying that
E(Ui)
=
1
2
,
Var(Ui)
=
1 n2 - 1 ~, 12
i = 1, ... , n.
The advantage of basing a test on T,f! is that its null distribution is completely independent of the distribution of Ei. One need not assume that Ei has any moments finite in order to conduct a valid test. For a given n the exact distribution of T[! may be determined using the fact that rank(Y1 ), ... , rank(Yn) are uniformly distributed over all permutations of 1, ... , n. The asymptotic marginal distribution of Ui is uniform on (0, 1); hence one anticipates that the asymptotic null distribution ofT[! is Fos, as given in expression (7.11). The only reason this is not an immediate consequence ofresults in Section 7.3 is that, although U1 , ... , Un are identically distributed, they are not independent. However, the dependence is very weak for large n, and so T,f!'s null distribution converges to Fos. It is of interest to know what function is estimated by the smoother A
!R(x; m)
=
1
~A
~ ¢R,j cos(njx), 2 + 2 j=l
0
~X~
1,
as this will provide insight concerning power of the rank-based order selection test. The behavior of fR(-; m) becomes clear upon considering what the Fourier coefficients ¢R,j estimate. Since rank(Yi) may be written as
184
7. Testing for Association via Automated Order Selection
E(¢R,j)
= n-
2
n
n
i=l
k=l
L L P(Yi :::0: Yk) cos(njxi) n
= n- 2 L
(7.12)
i=l
n
L
G [r(xi)
r(xk)] cos(njxi),
k=l
where G is the cdf of E1 - E2 , which is symmetric about 0. (We implicitly assume henceforth that G is absolutely continuous.) The quantity (7.12) may be written
where
1
hn(x)
=
n
L n
-
G [r(x) - r(xk)],
0 :=:; x :=:; 1.
k=l
Under the usual conditions on m and n, it is now clear that the smoother fR(x; m) consistently estimates the function
1 1
h(x)
=
G [r(x)- r(v)] dv,
0:::; x :::; 1.
This fact forms a basis for comparing the power of an order selection test based on the original data with one based on the ranked data. A fundamentally important point is whether constancy of h is equivalent to constancy of r. It can be verified that if r is piecewise smooth, then this equivalence indeed holds. Another key concern in regard to rank tests is the question of how much efficiency is lost by using a test based on ranks. Rank tests often turn out to be surprisingly efficient (see, e.g., Gibbons, 1971). In fact, they can be super efficient in comparison to certain tests when the raw data have a sufficiently long-tailed distribution. Even when the data are Gaussian there is often only a small loss in efficiency from basing a test on ranks. A good example is that of testing whether X and Y are uncorrelated on the basis of a random sample from the bivariate normal distribution. A test based on Spearman's rank-correlation coefficient has asymptotic relative efficiency of 3/n relative to the t-test based on Pearson's product moment correlation (Gibbons, 1971, p. 293). It would be interesting to compute asymptotic efficiencies of rank-based order selection tests relative to ones based on the raw data.
7.6. Variations on the Order Selection Theme
185
7.6 Variations on the Order Selection Theme In this section we introduce some alternative tests that make use of data-driven selection of a Fourier series truncation point. These include data-driven versions of Neyman's smooth test and tests formulated from a Bayesian perspective.
7. 6.1 Data-Driven Neyman Smooth Tests In Section 5.6 we discussed Neyman's classic smooth test, which in the present context would reject the no-effect hypothesis for large values of
_ ~ 2n¢7 Sm - ~ A2
(7.13)
j=l
•
(]"
Neyman's proposal was to fix m a priori, in which case Sm has an approximate distribution. Neyman (1937) pointed out that the power of the smooth test can be severely diminished by a poor choice of m. Ideally, then, m should be tailored to the underlying function r. One way of making Neyman's smooth test data adaptive is to select m by means of an order selection criterion. A test of no-effect could then be based on Sm,, where is the data-driven choice of m. Here we shall consider two order selection criteria: the Mallows-like criterion lm and the Schwarz type criterion Bm:
x;,
m
2n¢2
m
Bo
=
0,
Bm =
L
(JZ J -
m log n,
m
=
1, ... , n- 1.
j=l
Theorem 4.1 implies that under H 0 , Bm and lm yield respectively consistent and inconsistent estimators of 0. This fact has a crucial bearing on the corresponding data-driven smooth tests. denote the maximizer We shall first consider a test based on lm. Let of Jm over m = 0, 1, ... 'n- 1, and define TM to be 0 when = 0 and
m
m 2n¢z
TM
= I: ~, j=l
m 2:
m
1.
(]"
The limiting distribution of TM is given in the following theorem, whose proof is similar to that of Theorem 7.2, and hence omitted. Theorem 7.3. Assume that the Ej 's in model (7.2) are i.i.d. with mean zero and finite fourth moments, let Z1, Z 2 , ... be i.i.d. standard normal random variables, and define k
S(k)
=
L(ZJ- 2), j=l
k
= 1, 2, ....
186
7. Testing for Association via Automated Order Selection
In addition, define Ej(x) to be the event {0
< S(j):::;
X-
2j; S(k) :::; S(j), k = 1, ... , j - 1},
and let nx be the largest integer less than
j
1, 2, ... ,
x/2. Then, if r is constant on
[0, 1], the statistic TM converges in distribution to T as n
---t
oo, where T
has the following distribution function:
x
0, (7.14)
P(T:::; x)
=
.71, { .71 { 1 + 'L?:'o P [Ej(x)]}, 1
o:s;x:s;2 X> 2.
The nature of TM's distribution is such that the limiting size of any nonrandomized test based on TM can be no more than .29. This is a consequence of the inconsistency of m for 0 but does not seem to be much of a drawback since nominal test size is rarely taken to be as large as .29. Percentiles of the distribution in (7.14) can be approximated by using either numerical integration or simulation. It is interesting that, at any given x, P(T :::; x) depends on the infinite sequence of Zj's only through P(Vj < 0, j 2 1) = .71. A data-driven Neyman smooth statistic based on the Schwarz criterion Bm has a fundamentally different distribution than TM. Let Ts be the statistic of exactly the same form as TM except that in place of m we use ms, the maximizer of Bm over m = 0, 1, ... , n- 1. The fact that ms is a consistent estimator of 0 implies that Ts is exactly 0 with probability tending to 1 as n ---t oo. This in turn means that the asymptotic size of any test of the form "reject H 0 forTs 2 en" (en > 0) must tend to 0. If we wish to specify the size of our test, it is thus necessary to use a statistic other than Ts. A simple way out of the problem just described was proposed by Ledwina (1994) in the goodness-of-fit setting. Let ms denote the maximizer of Bm form in the set {1, ... , n- 1}. Ledwina's test rejects H 0 for large values of S,n, 8 , where Bm is as defined in (7.13). When the null hypothesis is true, ms converges in probability to 1 and the large sample distribution of S,n 8 has a simple form, as given in the following theorem. Theorem 7.4. Under the conditions of Theorem 7.3 and assuming that r is constant, the statistic S,n 8 converges in distribution to a XI random variable as n ---t oo.
PROOF. We have P(Bms :::; x) (7.15)
=
P(Sms :::;
= P ( 2n¢ o-z
2 1
X
:::;
n ms = 1) + P(Bms :::; x n ms
X
n ms > 1)
= 1) + P(Sms :::; x n ms > 1). \
I
7.6. Variations on the Order Selection Theme
187
In what amounts to a corollary of Theorem 4.1, we have P(ms = 1) ----+ 1 as n ----+ oo, implying with (7.15) that
P(Sms :S x)
=
;tr
2
P(
:S x)
+ o(1)
as n ----+ oo. The result is proven upon applying the Liapounov central limit theorem to ¢1· D Theorem 7.4 says that under the no-effect hypothesis, Sms R:J 2n¢i/ &2 , which converges to a random variable. This makes for a particularly familiar and simple large sample distribution. However, in the goodnessof-fit setting Kallenberg and Ledwina (1995) report that critical values approximaof an analog of Sms are substantially larger than their tions. In practice, then, it seems wise to use simulation to approximate the distribution of Sms.
xr
xr
1.6.2 F-Ratio with Random Degrees of Freedom Ideas from linear models suggest a test that is closely related to the datadriven Neyman smooth test of the previous section. Suppose that, for a specified m, we entertain the model m
(7.16)
Yi
¢o
+ 2L
¢) cos(njxi)
+ Ei,
i = 1, ... , n.
j=1 The least squares estimates of ¢ 1, ... , ¢m are simply our estimates ¢ 1, ... , ¢m· On the assumption that model (7.16) holds, the classical means of testing for no-effect is to use the F-ratio m
(7.17)
F =
~2
I.::j=1 ¢j/m n-1
~2
I.:;j=m+l ¢j/(n- m- 1)
.
When H 0 holds and the errors are Gaussian, F has the F distribution with m and n- m- 1 degrees of freedom, and the no-effect hypothesis is rejected for large values of F. Using (7.17) to test for no effect depends upon a choice form. Of course, most of this chapter has been based upon an objective method for choosing m from the data. With this in mind, we propose the test statistic (7.18)
:F n -
m
~2
~
I.:;j=1 ¢j/m
I.::j~~+l
~ I
¢1/(n- m- 1)'
where m maximizes Jm, and Fn is defined to be 0 when m = 0. One may regard Fn as an F-ratio with random degrees of freedom. The advantage of Fn over F is that Fn provides a rough and ready means of testing for
ii
188
7. Testing for Association via Automated Order Selection
no effect that requires no a priori knowledge of model dimension from the data analyst. The null hypothesis of no effect is rejected at level a if :Fn exceeds its 100(1 - a)th percentile. The distribution of :Fn may be approximated by simulation if the error distribution is known, or by using the bootstrap otherwise. The large-sample distribution of :Fn is given by the following theorem.
:6
Theorem 7.5. Using exactly the same notation as in Theorem 7.2, let :F
c s
be a random variable defined as follows:
0, :F = { "\"-m L...- =1 1
0
s s
v
ifm = o
z2;if m;::: J m,
1.
Then, under the conditions of Theorem 7.2, the statistic :Fn (see (7.18)) converges in distribution to :F as n ---> oo.
1:
r 1~
a a
7.6.3 Maximum Value of Estimated Risk Our study of order selection tests began by considering the statistic m which minimizes an estimate of the MISE off'(-; m). We pointed out that the test (7.9) is inconsistent against many truncated series alternatives and quickly moved on to another version of the order selection test. Rather than basing a test on m, suppose we use some other functional of Jm. One possibility is to reject the no-effect hypothesis if and only if J.,n is sufficiently large. Under H 0 , J.,n converges in distribution (as n---> oo) to a non-negative random variable that is finite with probability 1, as implied by the following theorem.
Theorem 7.6. Define S(m) and m as in Theorem 7.2. Then, under the conditions of Theorem 7.2, the statistic J.,n converges in distribution to S(m) as n---> oo. Obviously, J.,n tends to be larger under the alternative hypothesis than under the null, hence motivating the test
(7.19)
reject H 0 when
J.,n ;::: a
(a
> 0).
In Section 7.7.1 we will establish that test (7.19) is consistent for any alternative such that rPi =f. 0 for some j, thus showing that this test circumvents the problem inherent in test (7.9). In Section 9.8 it will be shown that test (7.19) is isomorphic to a test for white noise proposed by Parzen (1977).
7.6.4 Test Based on Rogosinski Series Estimate The unifying theme of the tests presented in this chapter is that each one involves selecting the smoothing parameter of a nonparametric estimate
v
s t
s
I~ I
7.6. Variations on the Order Selection Theme
189
of the regression function. Thus far we have dealt exclusively with the smoother f(·; m), which is the simplest type of truncated trigonometric series. Other types of smoothers can and have been used for testing the fit of models (Chapter 6). The Rogosinski series estimate (Section 2.4) has been investigated in the testing context by Ramachandran (1992) and Kuchibhatla and Hart (1996). In Section 7.4.2 we pointed out that the order selection test may be carried out graphically. This procedure is appealing from a descriptive standpoint in that a graph of a nonparametric curve estimate is provided whenever the no-effect hypothesis is rejected. This same test procedure can be carried out with any truncated series estimate. The Rogosinski estimate, fR(·; m), seems a good candidate for such a test, inasmuch as it tends to have fewer spurious features than the simple series estimate. A parallel of Theorem 7.2 for fR(·; m) has been proven by Kuchibhatla and Hart (1996). Defining wm(J) = cos [nj/(2m + 1)], the appropriate analog of J(m; "Y) is L(m; "'f), defined by L(O; "Y) = 0 and
(7.20) L(m;
~) ~ ~ v,.(j) (";~j) -~ ~ Wm(j),
m
~ 1, ... , n-l,
where vm(J) = 2wm(J) - w~(j), j = 1, ... , m. Ramachandran (1992) shows that taking "Y = 4.22 yields a test whose asymptotic size is approximately .05. Letting mR be the maximizer of (7.20) for "'( = 4.22, we may thus graph fR(· ; mR) and reject the no-effect hypothesis at (nominal) level .05 if and only if this estimate is nonconstant. Values of"'( needed for other significance levels can be determined by simulation or use of the bootstrap.
1.6.5 A Bayes Test Upon acknowledging that the Fourier series truncation point is an unknown parameter, a Bayesian approach to our testing problem would include specifying a prior distribution for this parameter. We shall formulate a Bayesian model in the frequency domain, i.e., we consider a prior for the Fourier coefficients of r, and write the likelihood function in terms of sample Fourier coefficients. Suppose we entertain a model for r of the form m
rm(x) = ¢o
+ 2 L rPj cos(njx),
0 ::::; x ::::; 1,
j=l
for some non-negative integer m. The quantity m is considered an unknown parameter, along with the Fourier coefficients rPi> j = 0, 1, 2, ... , and the error variance 0' 2 . It is assumed that the parameter m is positive if and only if IrPj I > 0 for some j ;::: 1. In this way the no-effect hypothesis is true if and only if m = 0.
190
7. Testing for Association via Automated Order Selection
The sample Fourier coefficients ¢ 0 , ¢1, ... , :Pn-1 are sufficient statistics ' and for i.i.d. Gaussian errors the likelihood function is
f(¢o, · · · ,¢n-1l¢o, · · ·,
v'2 +
(
Vn
foeJ
t,(/>; -
)n exp{-!3:__ [~(¢o -·¢o)2 (J2
¢;)'
+
2
,,%'J,j] },
m
~ 0, 1, ... , n
- 1
and
f(¢o, · · ·, :Pn-1l¢o, · · ·,
f(¢o, · · ·, :Pn-1l¢o, ... ,
m;:: n.
When the errors are merely i.i.d. with finite variance CJ 2 , this likelihood holds as a large-sample approximation. We now turn to the question of specifying a prior distribution for the model parameters. Letting ¢ denote the vector of Fourier coefficients of the regression function, our aim is to specify the prior distribution n( ¢, CJ, m), which is
n(¢, CJ, m)
=
n(¢o, ¢1, ... , ¢m, CJim)n(m).
We assume that the conditional distribution n(¢ 0 , ¢1, ... , ¢m, CJim) has the form
We desire the prior distribution to be as noninformative as possible while still allowing us to obtain a tractable probability model. Since ¢ 0 is essentially a location parameter, a reasonable noninformative prior for ¢ 0 is uniform over the real line. For the scale parameter CJ, we shall use the noninformative prior that is proportional to 1/CJ. The fact that both these priors are improper has little material effect on the ensuing Bayesian analysis. Our prior distribution is now such that
1
n(¢, CJ, m) oc -n(¢1, ... , ¢mlm)n(m). (J
For positive constants ami, i that
(7.21)
= 1, ... , m, m = 1, ... , n - 1, we will assume
7.6. Variations on the Order Selection Theme
191
where
g(s)
=
2 2 2 r(1/2) s- exp(-s- ),
s > 0.
The distribution (7.21) simplifies to A. A. 1f ( '1-'1> ••• ) 'I-'m
I ) - r((m+1)/2) m r(l/2)
(},.) m
x (1 +
(IT ami)
-J
~~
c/Jr ) -Cm+l)/2 2 L.J a 2 . i=1
mt
which is an m-variate t distribution. In the form (7.21) we see that this prior amounts to assuming that, conditional on s and m, ¢1, ... , c/Jm are independent with c/Ji distributed N(O, s 2 a~i), i = 1, ... , m, and taking s to have prior g. Note that g is proper but has infinite mean. At this point we do not specify a prior for m, since doing so is not necessary for deriving the form of the posterior distribution. One possibility would be to use Rissanen's (1983) noninformative prior for the number of parameters in a statistical model. A simple convenience prior is the geometric distribution (7.22)
1r(m)
= pm(1
- p),
m
=
0, 1, ... ,
for some pin (0, 1). An advantage of (7.22) is that it allows one complete freedom in specifying the prior probability of the no-effect hypothesis. That hypothesis is true if and only if m = 0; hence we should take p = 1 - Jro if our prior probability for the null hypothesis is 1r0 • A "proper" Bayesian test of the null hypothesis would be to compute the posterior probability that m = 0 and to reject the no-effect hypothesis if this posterior probability is sufficiently low. The posterior distribution of m is found by integrating the posterior probability function with respect to ¢ 0 , .•. , c/Jm and u. The posterior distribution 1r(cj;0 , .•• , c/Jm, u, m!data) is proportional to likelihood function
1
-1r(cf;1, ... , c/Jm!m)1r(m).
X
(}"
Integrating out ¢ 1, ... , c/Jm and u, and ignoring a term that is negligible for large n, we have (7.23)
1r(m!data)
= 'L-:~2 b. j=O
where
J
,
m
=
0, 1, ... , n- 2,
192
7. Testing for Association via Automated Order Selection
and n-1
Sl
2::
&;, = 2
·rr,
j=m+1 m
A2
2
·
(For m = 0, I:i= 1
l tl
n(Oidata) I (1 - n(Oidata)) nol(1- no) Considering the problem from a frequentist point of view, the posterior probability function n(mldata) has an interesting consistency property, as stated in the following theorem.
Theorem 7. 7. Assume that model (7.2) holds with the true regression function having the form
mo
r(x)
=
0 :::; x :::; 1,
j=1
for some integer m 0 2: 0 with
0 for j = 0, 1, ... , and let the constants ami in the prior n( ¢1, ... ,
for each non-negative integer m PROOF.
=/=-
p
-----+
0
as n
-+ oo
mo.
From (7.23) we have
n(midata) n(moldata)
n(m) ( 1 )m-mo r( ~) n(mo)
X
A
2 (Jmo
) (n-m-1)/2
A2 r( mot1) ( rim
2
1+ Am-mo (
(Jmo
(1
+
A ) (mo+l)/2
~ I:';~1
r ( ~)
A2)(m+1)/2 r(n-mo-1) 2 I:j=1
0
(
r
7.6. Variations on the Order Selection Theme
193
Clearly it is enough to consider
(
o-~0) (n-m-1)/2 r ((n- m- 1)/2) 3'~ r ((n- m 0 - 1)/2)'
2 since &~ 0 0" and 1 + (1/2) 2:::~= 1 m < m 0 and consider
¢]
2:::~= 1 ¢].Now, let
__!!___, 1 + (1/2)
2 ) = -(n-m-1)log ( 1 + ~~o
2 I:':'O 1
A
(7.24) (n-m-1)log
::=;'+
(Jm
(
1 ¢21
)
•
(Jmo
Using the well-known recursion formula for the gamma function, it is easy to verify that r((n m- 1)/2) J log [ r((n- mo- 1)/2) = O(logn). Combining this with (7.24) and the fact that I:j:m+ 1 ¢] / &~ 0 converges in probability to a positive constant proves the theorem for the case m < mo. Now let m > m 0 , and consider
a!
(J 2 A
(n-m-1)log (
)
(
=(n-m-1)log 1+
=
2 &2
m
(
~ A2)
"'mJ-&i+ 1 '+'. 12 1 )
2 LJ·-m
1 1 + Zn '
where Zn is between 0 and 2&:;;,2 I:j=mo+ 1 ¢]. It is now clear that
(n - m - 1) log ( :io ) converges in distribution to a random variable having the X~-mo distribution. So, (&~ 0 /&~)(n-m- 1 )/ 2 is bounded in probability, and since r((n- m- 1)/2) jr((n- m 0 - 1)/2) -+ 0 as n -+ oo, the proof is complete. D The posterior n(mldata) has intriguing possibilities as an order selection criterion. Theorem 7. 7 suggests that one choose the value of m that maximizes n(mldata). Plots of n(mldata) tepd to be much more definitive than those of the MISE-based criterion Jm· This is illustrated in Figure 7.3, which corresponds to four independent sets of data generated from the model
lj = r((j 1/2)/50) + Ej, j = 1, ... , 50, where r(x) = 210 [.65x 8 (1- x) 2 + .35x 2(1- x) 8 ] and the Ej's are i.i.d. (7.25)
194
7. Testing for Association via Automated Order Selection
.·· .... ··· ........................... ····...
co 0
·c
c 0 ·c
'5
'§
c 0
2
2
•
0
0
-·
0
0
0
10 20
.~··· ..................................... ..
co 0
_... 0
30 40
10 20
m
c
m
·································
co 0
.... ..... ·········· c 0 ·c
•
0
·c
..
2
'§ 0
0
co 0
·····........
·· .............···.... . Fram estirru functi
-~
(.)
...... • 0
30 40
0
0
10 20
30
40
...,•"0
10 20
30 40
Uc>(
testir m
m
7.3. Plots of Risk Criteria and Posterior Probabilities. Each plot corresponds to a set of data generated from the model (7.25). The smaller points are values of (!m -mink Jk)/(maxk Jk -mink Jk), whereas the larger ones are posterior probabilities corresponding to a prior probability of .5 for m = 0.
COnSlt
FIGURE
N(O, .72 ). The posterior probabilities (7.23) were computed for each data set with all ami= 1 and n(m) = .sm+l, m = 0, 1, .... The MISE-based quantity Jm was also computed for each data set. The posterior probability function was maximized at m = 6 in each of the four cases, whereas Jm was maximized at 6 in two cases and at 7 and 9 in the other two. Notice that the posterior probability function leaves little doubt as to which value of m is the most likely a posteriori, whereas the risk criterion tends to be much flatter near its maximum. For the data set where Jm was maximized at 9, the series estimates with truncation points 6 and 9 are shown in Figure 7.4. The Bayes criterion has chosen a better estimate in this case in the sense that it has the same features as r.
7.7 Our( relat! Such OncE how· selec· orde1 smal CUSUJ
agair othe1 from lack-
7.7. Power Properties
195
• L{)
• • • ..0
•
I
0.0
'•'•
'\,
•
• 0.2
t
•
•
\ I. • \e \',
• 0.6
0.4
0.8
1.0
X FIGURE 7.4. Data-Driven Series Estimates. The solid and dotted curves are series estimates with m = 6 and m = 9, respectively. The dashed curve is the true function.
Use of 1r(mldata) as an order selection tool and/or as a means of testing 'the no-effect hypothesis is a topic that appears to merit serious consideration, although we do not pursue it further here.
7.7 Power Properties Our discussion to this point has focused on properties of order selection and related tests under the null hypothesis of a constant regression function. Such properties are important for purposes of verifying validity of the test. Once we are confident that a test is valid, our interest naturally turns to how powerful it is. In this section we consider power properties of the order selection and other tests. We first establish the general consistency of the order selection test and the test based on J,n,. Next we present some exact small-sample power results for the order selection, Neyman smooth and cusum tests. We then study asymptotic power of the order selection test against a sequence of local alternatives, which facilitates comparisons with other tests, both omnibus and parametric. Finally, we discuss some results from the literature that compare the power of various smoothing-based lack-of-fit tests.
196
7. Testing for Association via Automated Order Selection
7. 7.1 Consistency In Sections 7.3 and 7.4 we encountered three equivalent versions of the order selection test: one based on the data-driven truncation point m.Y> one a graphical test and the other based on the statistic Tn. In studying power of the order selection test, it will be more convenient to consider the version based on Tn. For a given level of significance a, we will show consistency of the large-sample test that rejects H 0 when ~
Tn
ta,
where ta is such that Fos(ta) = 1 -a. Consistency of a small-sample test with rejection region of the form
Tn
~
tn,a
follows immediately whenever tn,a -+ ta as n -+ oo, which occurs under the conditions of Theorem 7.2. Concerning the variance estimator G- 2 used in Tn, we assume only that it converges in probability to a constant. The simplest and most important situation in which the order selection test is consistent is where the regression function r has at least one nonzero Fourier coefficient r/Yj and the sample coefficient ¢j converges in probability to ¢j. The following theorem establishes consistency under somewhat weaker conditions. Theorem 7.8. Suppose the regression function r is such that, for some j,
the following condition holds:
(7.26)
lim P(l¢j I ~ 8)
n-+oo
=
1
for some 8 > 0.
Then the order selection test is consistent in that lim P(Tn ~ ta)
n-+oo
PROOF.
We have, for all n
P(Tn
~
j
=
1.
+ 1,
~ ta) ~ P (~ t 2;fr ~ ta) J •=1
The result now follows immediately upon using condition (7.26) and the fact that G- 2 converges in probability to a positive constant. 0 Note that condition (7.26) does not even require existence of Fourier coefficients of r, nor does it assume consistency of ¢j in the event that
7.7. Power Properties
197
¢;1 exists. From a practical standpoint, though, it is probably sufficient to envision the case where Fourier coefficients exist and the sample Fourier coefficients are consistent estimators of them. For example, suppose that r is piecewise smooth. Then ¢;j exists for all j, and ¢j is a consistent estimator of ¢i so long as (} 2 < oo. The class of all piecewise smooth functions is sufficient in most applications, since it does not place restrictions on the shape of r and allows for discontinuities in both r and r'. Each of the tests discussed in Section 7.6 is consistent under very general conditions. The proof of consistency is similar in each case, and so here we consider only the test from Section 7.6.3 based on Jm,. Theorem 7.9. Let model (7.2) and the conditions of Theorem 7.2 hold, 1 and suppose that r is absolutely integrable and such that 0 r(x) cos(njx)dx is nonzero for some j. Then the power of test (7.19) tends to 1 as n ---t oo.
J
Let k ~ 1 be the smallest integer such that ¢k -/=- 0. The event ~ a} is implied by {Jk ~ a}, which occurs whenever {2n¢V 8- 2 ~ 2k+ a}. The law of large numbers implies that ¢~ /8- 2 converges in probability to ¢V (} 2 , which is positive. It is now immediate that P(2n¢V 8- 2 ~ 2k +a) ---t 1 as n ---t oo, thus proving the result. D PROOF.
{Jm,
7. 7.2 Power of Order Selection, Neyman Smooth and Cusum Tests Like the property of admissibility in decision theory, consistency is only a minimal sort of optimality property. Consistency provides a meaningful comparison of two tests only when one of the tests is consistent and the other is not. In this section we compare the finite-sample power of the order selection test with that of Neyman smooth tests and the cusum-based test of Section 5.5.2. To simplify comparisons we shall assume that (} 2 is known and that the errors are normally distributed. A slightly modified version of the cusum test rejects H 0 for large values of Tcusum =
n-l
n¢2
j=l
(}
2 L ----{-•
We have pointed out previously how this test downweights the influence of higher degree sample Fourier coefficients. As a result the cusum test will have relatively poor power unless the first couple of Fourier coefficients are "large."
198
7. Testing for Association via Automated Order Selection
An mth order Neyman smooth test of H 0 large values of
r =constant rejects Ho for
0
2nJ;2
m
S(m) =
:
L -----1, j=1
(]'
F
x;,
which has a distribution when H 0 is true. The power of a smooth test depends fairly crucially on making a good choice of m. Indeed, the test based on S(m) will have power equal to its level in cases where ¢1 = · · · = ¢m 0 and ¢j =/= 0 for some j > m. By contrast, the order selection test is consistent whenever at least one Fourier coefficient is nonzero (Section 7.7.1). Whereas the order selection test does not require knowledge of r in order to be consistent, it is clear that some advantage will accrue from partial knowledge of r. To quantify this advantage, consider functions of the form
v
p
('
mo
(7.27)
r(x)
¢o
+2L
¢j cos(njx),
0 ~ x ~ 1.
j=1
If m 0 is known, then an apparently reasonable smooth test is the one based on S(m 0 ). The power of this test is simply
P(S >
(7.28)
S<
a
('
x;,a,aJ,
where x;,a,a is the (1 o:)100th percentile of the x;,o distribution and S is distributed as a noncentral chis quare with m 0 degrees of freedom and noncentrality parameter mo
A= ;
f;¢J.
Note that the power of the smooth test depends on the Fourier coefficients only through L;'j::1 ¢]. This implies, for example, that the power is invariant to a change in ¢1/¢2 so long as ¢i + ¢§ remains the same. Defining 1
Akn
= k
k
L
2n¢]
~,
k
= 1, ... , n - 1,
j=1
and
the order selection test rejects Ho when Tn exceeds Ca, where Ca is the (1- o:)100th percentile of Tn's large-sample distribution. Letting k and ko be the respective maximizers of Akn and k- 1 2:;~= 1 ¢], the power of the
r
7.7. Power Properties
199
order selection test may be expressed as
For functions of the form (7.27), it is straightforward to show that k converges in probability to ka. (This may be done using the same method of proof as in Theorem 7.2.) It follows that (7.29)
P(Tn 2: Cc,) = P
?; ko
(
2n(jJ21
2: kaCa
0' 2
nk
= ka
)
+ o(1),
which makes it clear that the power of the order selection test is sensitive to which Fourier coefficients are large. From (7.29) we would expect the power to be largest for predominantly low frequency functions. Expression (7.29) gives, at best, a crude impression of power for the order selection test. The following bounds will be useful for learning more. For any ma 2: 1 (7.30) P ( max Akn 2: Ca) l:S;k:S;m 0
-
max Akn > Ca ( l:S;k:S;mo -
~ P(Tn
U max
2: Ca)
1
mo
k -
2n¢J 2.: ma j=mo+l k
0'
2
When (7.27) holds the inequalities in (7.30) lead to (7.31)
For specified r and 0' j yn, one may approximate (numerically or by simulation) P(maxl:S;k:S;mo Akn 2: Ca). The power of the Neyman smooth test (as given in (7.28)) may then be compared with the bounds in (7.31). A more explicit bound can be obtained in the event that r has the form (7.27) with (PI = · · · = ¢mo-l = 0 and ¢mo =J 0. Using (7.31) we have P(Tn 2: Ca) 2: max [a, P (S(ma) 2: maCa)]
(7.32) P(Tn 2: Ca) ~ 2a
+ (1 -
a)P(S(ma) 2: maCa)·
For the case ma = 1 the term 2a in the second inequality of (7.32) may be replaced by a. For the same type of function r we may also obtain bounds (see the Appendix) for the power of the test with rejection region
I,:
~# !
I
l
200
I
7. Testing for Association via Automated Order Selection
TABLE 7.3. Power of Size .01 Neyman Smooth, Order Selection and Cusum Tests. mo
A
1
2
3
4
7
.5
.058 ( .055, .065) (.044, .110)
.040 (.01, .028) (.01, .022)
.033 (.01, .021) (.01, .020)
.029 (.01, .020) (.01, .020)
.020 (.01, .020) (.01, .020)
1
.123 (.118, .127) ( .098, .190)
.084 (.021, .040) ( .010, .025)
.067 (.01, .023) (.01, .020)
.057 (.01, .021) (.01, .020)
.041 (.01, .020) (.01, .020)
2
.282 ( .275, .282) (.239, .364)
.204 (.068, .088) (.010, .033)
.164 ( .015, .035) ( .010, .021)
.139 (.01, .023) (.01, .020)
.097 (.01, .020) (.01, .020)
4
.600 ( .591, .596) ( .548, .669)
.487 (.246, .264) (.010, .070)
.417 (.082, .102) ( .010, .025)
.366 (.024, .044) (.01, .021)
.271 (.01, .020) (.01, .020)
8
.923 (.920, .920) ( .902, .942)
.867 ( .677, .690) ( .078, .252)
.818 (.397, .413) (.010, .040)
.774 (.194, .212) (.010, .024)
.667 (.011, .030) ( .010, .020)
16
.999 (.999, .999) (.998, .999)
.997 (.982, .992) (.595, .782)
.994 (.913, .924) (.010, .135)
.990 (.771, .783) (.010, .041)
.976 (.232, .250) ( .010, .020)
The underlying function is r(x) = 2¢ cos(7rmox) and A = n¢ 2 /cr 2 • For a given A and mo, the upper row contains the power of the test based on S(mo), and the second and third rows contain bounds on the power of the order selection and cusum tests, respectively.
li'l where ta is the asymptotic level a critical value. Table 7.3 displays the power of the S(m0 )-based test, the bounds (7.32) and bounds for the power of the cusum test for the case where a= .01 and r(x) = 2¢cos(1rm0 x). As long as n > 7, both sets of bounds and the power of the smooth test depend on n, CT and ¢ only through A = n¢ 2 / CT 2 • For m 0 = 1 all three tests have comparable power. For a given A, the power of each test decreases as the function becomes higher frequency. The decrease is the most and least rapid for the cusum and Neyman smooth tests, respectively. The power of the cusum test is poor for m 0 > 2, regardless of the value of A. Although
ower Properties the order selection test fares considerably better, it is clear that for >. 2: 4 a large price is paid for not taking into account information about m 0 . The setting just considered is extreme in that each test garners all its power from ¢mo. Table 7.3 is not necessarily representative of the relative power of the three tests in other cases, even if the series is truncated. If the first few Fourier coefficients are relatively large, the power of the order selection test will tend to be competitive regardless of the value of other Fourier coefficients.
7. 7. 3 Local Alternatives In this section we will establish that the order selection test has a maximal rate of 1/2. At the same time we will obtain an explicit expression for the power of the order selection test against local alternatives that converge to the null hypothesis at rate n- 1 / 2 . We shall use a model that was introduced in Chapter 5.5.4, i.e., (7.33)
=
lin
fl
+ n- 112 g (
1 2
i -n /
)
+ Ein>
i
=
1, ... , n,
where fl is a constant and g is a piecewise smooth function satisfying J~ g(x)dx = 0. The following theorem provides the limiting distribution of Tn under model (7.33). Theorem 7.10. Let the data be generated according to model (7.33), in 1 which g is a piecewise smooth function. Define aj = 0 cos(njx)g(x) dx, j = 1, 2, ... , and suppose that Eln, ... , Enn are independent and identically distributed for each i and n with E( E[n) :::; C < oo. Then
J
lim P(Tn 2: ta)
=
n->oo
where zl, variables. PROOF.
z2, ...
P
[max ~ (zj + V'J,aj ) L...t k>l -
1 -k
0"
j=l
2
2:
tal ,
is a sequence of independent standard normal random
The sample Fourier coefficients may be expressed as A
-
¢j = ¢j
1 + Vn ajn,
j
=
1, ... , n- 1,
where J;j
1
L n
= -
n
Ekn cos(njxk),
j
= 1, ... , n- 1,
k=l
and ajn
1
L g(xk) cos(njxk),
n
k=l
= -
n
j
= 1, ... , n-
1.
202
7. Testing for Association via Automated Order Selection
It follows that
2n¢;
=
[V271¢j + haj + V'i(ajn
=
(V2rJ:¢j + V'iaj) 2 + 4(Vn¢j + aj)(ajn- aj) + 2(ajn
aj)r aj) 2 .
To prove the result, then, it is sufficient to show that 2
-1
a- 2
k ( V271¢ +V'ia ·) max -1 ~ 1 1
1
-
k ~
j=1
1 k -+sup-~ k~1 k j=1
(7.34) in distribution as n
--+
(7.35)
lim
(
Zj
vf:2aj
+ --
)2
0"
oo, and that
n-->cx:>
1 max -k
1
-
k
~(ajnaj) 2 ~
=
0.
j=1
The proof of (7.34) is virtually the same as the proof of Theorem 7.2 and hence omitted. To prove (7.35), first notice that 1
k
2 max -~(a· - a·) < 1
1
-
k ~
j=1
Jn
-
We have n-1
(7.36)
2 ~(a· ~ Jn -a·) J -<
j=1
[(~ain )
1/2 ( +
~a;
) 1/2]2
in which
and !ln = 2.:~= 1 g(xi)/n. Now, since g is bounded and square integrable (owing to its piecewise smoothness), it follows that (7.36) is bounded uniformly in n, and hence
7.7. Power Properties
203
To finish the proof we will assume that g is continuous. (If g is not continuous, the proof is somewhat more complicated, but still straightforward.) We have 1 n
ajn- aj
= -
n
L cos(1rjxk) [g(xk) - g(xk)] k=l
+
~n
~
t
t
g(xk) [cos(1rjxk)- cos(1rjxk)], k=l where xk E [(k- 1)/n, kjn], k = 1, ... , n, and hence iajn- ajl :::;
lg(xk)- g(xk)i
+ 1rj
n k=l
~
t
lg(xk)i.
n n k=l
Since g is continuous and piecewise smooth, it is also Lipschitz continuous, and so C C. Bj Ia·Jn - a·i1
max l:Sk
k
k L(ajn j=l
which tends to 0, thus finishing the proof.
D
Theorem 7.10 implies that the maximal rate of the order selection test is 1/2, which of course is the maximal rate encountered in most parametric problems. Recall from Chapter 6 that the maximal rate of smoothing-based tests using mean squared error optimal smoothing parameters is smaller than 1/2. Hence, there exist sequences of alternatives for which those tests have limiting power equal to a while the order selection test has limiting power equal to 1. Also recall that certain classical omnibus tests have maximal rate less than 1/2 (Theorem 5.2). Probably the most important aspect of Theorem 7.10 is the expression for the limiting power of the order selection test. Notice that the only difference between the null distribution of Tn and its limiting power is that the central xi random variables in the null case become noncentral xi's under local alternatives. The noncentrality parameter at index j is a]/ u 2 . The power expression in Theorem 7.10 allows one to compare the power of the order selection test with that of other tests having maximal rate of 1/2.
7. 1.4 A Best Test? A virtual plethora of tests were proposed in Section 7.6. A natural question to ask is, "Which one is best?" On the assumption that one has taken pains
204
7. Testing for Association via Automated Order Selection
to ensure the validity of each test, the question of best boils down to power. The simplest answer is that it seems likely that no one test is uniformly more powerful than all others. Furthermore, it seems doubtful that any test in Section 7.6 is inadmissible. In other words, it is likely that no test will be dominated (in terms of power) by some other test for every function r. It would still be helpful to know circumstances under which a given test tends to be powerful (or not). To expedite our discussion, we remind the reader of the relevant tests and provide each with an abbreviation: order selection (OS), Neyman smooth based on Mallow's criterion (NSM), Neyman smooth based on Schwarz' criterion (NSS), F-ratio with random degrees offreedom (RANF), maximum of risk criterion (MAXR), and order selection using a Rogosinski kernel (ROGO). Most of our comparisons of tests will be based on simulation studies, although we briefly discuss some theoretical work too. Kuchibhatla and Hart (1996) performed simulations comparing OS, NSM and ROGO. For the functions they considered there was very little difference in the power of the OS and ROGO tests. For the same functions NSM tended to have slightly lower power than OS and ROGO for low frequency alternatives, and considerably higher power than OS and ROGO for high frequency alternatives. It is also worth noting that in simulations of Kuchibhatla and Hart (1996) for three different functions at three different sample sizes, NSM had either comparable or substantially better power than the von Neumann test (Section 5. 5 .1) . Eubank (1995) conducted a simulation comparing OS, NSS and Buckley's test (Section 5.5.2). The alternative regression functions were a line andAcos('rrjx),j = 1,2,3,x E [0,1]. WhenrwasalineorAcos(nx),Eubank reports that OS and Buckley's test had similar power, with both being slightly better than NSS. For the higher frequency cases A cos(2nx) and A cos(3nx), the power ranking (from best to worst) was NSS, OS and Buckley. For these last two functions the discrepancy between OS and Buckley was greater than that between NSS and OS. Simulations of Lee (1996) indicate that NSS tends to be more powerful than NSM for lower frequency functions, whereas for very high frequency functions, NSM is much more powerful than NSS. In the context of testing for additivity of a regression function, the simulation of Eubank, Hart, Simpson and Stefanski (1996) showed that OS and RANF tests had very similar power. Whether or not similar results hold in testing for no effect remains to be seen. That brings us to the MAXR test, for which the author is unaware of any power studies. A reasonable conjecture would be that MAXR and OS have similar power characteristics, although again this remains to be seen. In Section 7.7.3 OS was shown to have maximal rate of 1/2 under very general conditions. Under the same conditions it can be shown that NSM also has a maximal rate of 1/2. NSS, though, has a maximal rate of 1/2 1 under model (7.33) if and only if a1 =f. 0, where a 1 = 0 cos(nx)g(x) dx.
J
7.8. Choosing an Orthogonal Basis
205
Ironically, this curious result is a consequence of the consistency of the Schwarz criterion. For alternatives that are very close to the null (i.e. of the form in (7.33)), the Schwarz criterion chooses model order 1 with probability tending to 1. Consequently, if a1 happens to be 0, then the limiting power of NSS is equal to its level. Using a different form of asymptotic analysis yields quite different results. Eubank (1995) uses a so-called intermediate asymptotic efficiency to compare the power of OS, NSS and Buckley's test. This form of analysis considers local alternatives of the form n-'YI 2 g(x), "! E (0, 1), but insists that the significance level, an, of a test tend to 0 in such a way that -log(an)n-£ --+ 0 for some I! E (0, 1). With this form of analysis, Eubank (1995) obtains nontrivial asymptotic relative efficiencies. His results show that NSS is always at least as efficient as either of OS or Buckley's test. The OS test can be either more or less efficient than Buckley's test depending on characteristics of the alternative function. If the first Fourier coefficient of g is sufficiently small (large) relative to the others, then OS is more (less) efficient than Buckley. Suppose one wishes to test for no effect using one of OS, NSS, NSM, or Buckley's test. A careful examination of the results cited in this section suggest the following summary. If one suspects the underlying regression function to be predominantly high frequency, then NSM is the preferred test. If the regression function is such that
r(x)
~ ¢0
+ 2¢1 cos(7rx),
x E [0, 1],
then either Buckley's test or NSS is preferred. If one is unsure of the underlying nature of r, then either OS or NSS are good compromises, with perhaps a slight nod going to NSS based on the asymptotics and simulations of Eubank (1995).
7.8 Choosing an Orthogonal Basis In the author's experience the set of basis functions used will have more impact on power than will the type of statistic. (An example illustrating the effect of basis will be given in Section 10.2.) The discussion in Section 7.7.4 was predicated on using the cosine basis {cos(7rjx) : j = 1, 2, ... }. However, the order selection test and all the tests in Section 7.6 have analogs for any given orthonormal basis {u1 , u 2 , .. •} for L 2 (0, 1). Very generally speaking the asymptotic distribution of the test statistics will be invariant to the orthonormal basis used (Section 8.2.1). In Section 7. 7.4 we pointed out that certain tests have good power against high frequency alternatives. "High frequency" in reference to the cosine basis literally means that the function oscillates rapidly. For some other basis {u 1 , u 2 , ••• } , functions Uj with large indices j will not necessarily be high frequency. This has the following implication. Suppose that cosines are
206
7. Testing for Association via Automated Order Selection
replaced by uj's in the tests of Section 7.7.4. Then a test that was powerful against high frequency functions will now be powerful against functions well approximated by linear combinations of Uj 's for large j. Likewise, a test that was good against low frequency functions will now be powerful against functions of the form c0 + c1u1(x) + c2 u 2 (x), which may or may not be "low frequency." A simple way of appreciating the preceding paragraph is to realize that the ordering of the functions within a basis will have a big impact on order selection type tests. Suppose we take as our basis
u·(x) 1
= {
cos(n\11- j)x), cos(1fJX),
j = 1, ... ,10 j 2: 11
and then apply an order selection test to u 1 , u 2 , ...• This basis is mathematically the same as our usual cosine basis, but now an order selection test will have its best power against functions well-approximated by c 0 + c1 cos(lOnx). Another way to look at this phenomenon is to recognize that any test based on an order selection criterion of the form m
naz
L a-i. - Cnm,
m = 1, 2, ... , (Cn
> 1),
j=l
implicitly uses a prior for model order (m) that is, roughly speaking, inversely proportional to m. Of course, ordering the functions in a basis from lowest to highest frequency is not altogether arbitrary, since experience indicates that low frequency functions tend to occur more frequently in practice than do high frequency ones. An alternative to the truncation inherent in order selection schemes is to use thresholding, which has recently become popular with the advent of wavelets (Donoho and Johnstone, 1994). In thresholding, each basis function is given equal weight, in the sense that the function with the largest (in absolute value) sample Fourier coefficient enters the model first, the function with the next largest Fourier coefficient enters next, and so forth. In the testing context, intriguing possibilities exist for thresholding. Some work in this vein has been done by Fan (1996) and Lee (1996), both of whom propose tests that combine truncation and thresholding. Simulation studies of Lee (1996) indicate that, overall, such tests tend to be better than tests based on truncation or threshholding alone.
7.9 Historical and Bibliographical Notes One of the first authors to propose a test of model fit based on a data-driven model selector was Parzen (1977). His proposal will be discussed in Section 9.8. We noted in Section 6.5 that Yanagimoto and Yanagimoto (1987) appear to be the first authors to propose, in the regression context, a test
I
7.9. Historical and Bibliographical Notes
207
of model fit that uses a completely data-driven nonparametric smoother. Theirs is a test of the straight line hypothesis and uses smoothing splines. Barry and Hartigan (1990) worked out asymptotic distribution theory for a test that is the "no-effect" analog of the linearity test of Yanagimoto and Yanagimoto (1987). The order selection test of Eubank and Hart (1992) has already been discussed at length in this chapter. Hart and Wehrly (1992) proposed tests of polynomial fit based on the data-driven smoothing parameter of a kernel estimator. Barry (1993) proposed a test of additivity that is analogous to the Barry and Hartigan (1990) test. In the same context Eubank, Hart, Simpson and Stefanski (1995) studied order selection-type tests. For the goodness-of-fit setting, Bickel and Ritov (1992) and Kim (1992) propose (different) tests based on data-driven cosine series, while Ledwina (1994) and Kallenberg and Ledwina (1995) investigate data-driven Neyman smooth tests. Fundamental work on the distributional properties of data-driven model selectors is that of Woodroofe (1982). Recently Zhang (1992) made use of such distributional properties in a study on choosing an appropriate penalty constant for a model selection criterion.
'I
8 Data-Driven Lack-of-Fit Tests for General Parametric Models
8.1 Introduction In this chapter we consider testing the fit of parametric models of a more general nature than the constant mean model of Chapter 7. We begin with the case of a linear model, i.e., the case where r is hypothesized to be a linear combination of known functions. The fit of such models can be tested by applying the methods of Chapter 7 to residuals. It will be argued that test statistics generally have the same distributions they had in Chapter 7 if least squares is used to estimate model parameters. We also consider testing the fit of a linear model by using methodology based on trigonometric series smoothers. In this event, the null probability distribution of the test statistic depends upon the functions comprising the linear model, but not upon any unknown parameters. Finally in this chapter, we investigate the use of tests as in Chapter 7 for testing the fit of a general nonlinear model. The probability distribution of the test statistic under H 0 depends upon the nonlinear model and, in general, unknown model parameters. This requires that one estimate the parameters in order to approximate the null distribution of the test statistic. This phenomenon is also encountered when applying KolmogorovSmirnov or Cramer-von Mises tests to check the fit of incompletely specified probability distributions. The bootstrap is an especially valuable tool for approximating the null distribution of test statistics when the model is nonlinear.
8.2 Testing the Fit of Linear Models Consider our usual model
Yi = r(xi) + Ei, i = 1, ... , n, where 0 :=:; x 1 < · · · < Xn :=:; 1 are arbitrary (but fixed) design points. By (8.1)
far the most common type of model used for r is the linear model. By a 208
',~
8.2. Testing the Fit of Linear Models
209
model, we mean that r is assumed to have the form p
r(x)
=I: (}iri(x),
0::::; X ::::; 1,
i=l
r1, ... , rp are known functions and 81 , ... , Bp are unknown parameGiven data from (8.1), suppose ~hat we ~stimate 81 , ... , Bp using least and denote these estimates el, 'ep. Define the residuals e 1 , ... , en by 0
0
0
p
ei =
Yi- l:Ojrj(Xi),
i = 1, ... ,n.
j=l
If the null hypothesis is true, the residuals are p
ei
=
Ei
+ L(Bj- Oj)rj(xi),
i
=
1, ... , n,
j=l
and hence will tend to behave like the error terms E1 , •.. , En, especially when n is large. On the other hand, if r is not a linear combination of r 1 , ... , rp, the residuals will contain a systematic component due to the discrepancy between r and the projection of r onto the space of functions spanned by r 1 , ... , rp. The behavior of the residuals suggests that we check the fit of a linear model by applying a test as in Chapter 7 to e 1 , ... , en. The only problem with this proposal is that the functions cos(1rx), cos(21rx), ... are not all orthogonal to the null functions r 1 , ... , r P except in the special case p = 1 and r 1 = constant, i.e., the case treated in Chapter 7. This lack of orthogonality entails that the probability distributions of the test statistics from Chapter 7 will depend on r1o ... , rp. While the dependence on the null functions can be dealt with, as we shall see in Section 8.2.2, it is also of interest to derive test statistics whose null distributions are the same as in Chapter 7. The latter problem is the subject of Section 8.2.1.
8.2.1 Basis Functions Orthogonal to Linear Model Suppose we can define functions Uj, j orthogonality properties: (8.2) where Djj
1
n
n
i=l
- L Uj(xi)uk(xi) = Djk,
= 1 and Djk = 0 for n
(8.3)
=
L uj(xi)rk(xi) i=l
j
=0,
-=/=-
1, ... , n- p, with the following
j, k
=
1, ... , n- p,
k, and j
=1, ... , n- p, k =1, ... ,p.
210
8. Data-Driven Lack-of-Fit Tests for General Parametric Models
The functions u 1 , ... , Un-p could depend on n, but for simplicity's sake we have suppressed this in the notation. Suppose also that the n X p matrix R = {rj(xi)} has full column rank. Now define the sample Fourier coefficients
1 n aj=-L:uj(xi)li, n i=l
j=1, ... ,n-p.
These coefficients are equal to the analogous ones computed from residuals, since
= aj,
with the last step following from property (8.3). Also, the orthogonality properties (8.2) and (8.3) entail that ih, ... 'am are the least squares estimates of a 1 , ... , am in the linear model
li
=
P
m
j=l
j=l
L ejrj(xi) + 2: ajUj(xi) +
i
Ei,
= 1, ... 'n.
Consider testing p
(8.4)
Ho : r(x)
=
2: eiri(x),
0:::; X:::; 1,
i=l
by using one of the tests defined in Chapter 7 with Y1 , ... , Yn replaced by e1, ... , en. For example, consider the test statistic m
S n_-
~2
1 '"""" naj
max-~-~-,
l.Smsn-p
m j=l
0'
2
where 8- 2 is an estimator of 0' 2 based on the residuals e 1 , ... , en. When H 0 is true,
(8.5) again by (8.3). The only difference, then, between null distribution theory for Sn and Tn is that Sn is based on the collection of basis functions {u1, . .. , Un-p} rather than { cos(1rx), ... , cos (1r(n - 1)x)}. We first consider the case where the E/s are i.i.d. N(O, 0' 2 ). Obtaining the distribution of Sn in this case is fairly simple since the Fourier coefficients al, ... 'an-p are themselves i.i.d. normal random variables under the null hypothesis.
8.2. Testing the Fit of Linear Models
211
Theorem 8.1. Suppose that E1 , ... , En in model (8.1) are i.i.d. as N(O, CJ 2 ), and let Z1 , Z2 , ... be i.i.d. standard normal random variables. If fj ~ CJ, then under the null hypothesis (8.4) 'D ----*
Sn as n
--+
1 ~ 2 sup - L.i Zi m:C:l m j=l
oo.
PROOF. Due to (8.5) and the conditions on the uj's, we have are i.i.d. as N(O, CJ 2 /n). This implies that
al, ... 'an-p
1 m nii12 - ~ -m j=l L.i (J2
max
l<m
is equal in distribution to 1
max
l<m
m
- ~ Z2 m j=l L.i J
and hence that '2 (J
2 Sn
'D ----*
CJ
1
m
~
2
sup - L.i Zi . m:C:l m j=l
The result is thus proven by using Slutsky's theorem and the consistency of fJ 2 . D Using essentially the same argument as in Theorem 8.1, one may show that test statistics analogous to the others in Chapter 7 have the same limiting distributions under (8.4) as their Chapter 7 counterparts. Let us now consider the case of non-Gaussian errors. First observe that if H 0 is true and the Uj 's are such that the appropriate Lindeberg-Feller condition holds, then for each fixed k (8.6)
Vn (a1, ... , ak)
_!?__, (Z1, ... , zk),
(J
where Z 1 , ... , Zk are i.i.d. standard normal. The fact that the limiting distribution has a diagonal covariance matrix is a consequence of the orthogonality conditions on the uj's. Whenever (8.6) holds and fj is consistent for CJ, it follows immediately that (8.7)
max
l<m
1
m
L m
j=l
'2
naj fJ 2
v
----*
max -1 l<m
L Z· m
2
j=l
1 ·
The only difference between Sn and the statistic in (8.7) is that the latter takes the maximum over the fixed set {1, ... , k} rather than a set depending on n. For practical purposes result (8.7) is probably adequate since k
212
8. Data-Driven Lack-of-Fit Tests for General Parametric Models
may be arbitrarily large. Using an a priori bound of, say, 200 for the number of Fourier coefficents will be adequate in most problems a data analyst encounters. By placing further restrictions on u 1 , ... , Un-p and the Ei's we may remove the upper bound on the number of Fourier coefficients. Theorem 8.2. Suppose that E1 , ... , En in model (8.1) are i.i.d. with mean 0 and finite fourth moment and that
1 n
~
2
2
- ~uj(xi)uk(xi
(8.8)
)
::::;
C
i=l
for some constant C, all1 ::::; j, k ::::; n- p and all n. If in addition 8- ~ cr, then under the null hypothesis (8.4) 1 ~
IJ
Sn--+ supm2:1 m as n
2
~Zj j=l
---+ oo.
PROOF.
In analogy to the proof of Theorem 7.2, define
z. Jn-
y'riaj cr '
j
=
1, ... , n- p.
Using (8.5) the proof now proceeds exactly as did that of Theorem 7.2 except that Var[
t -1)] (Zfn
=
i=m+l
2(r- m)
+ ( E~:i) -
3) ~2 ~
which by (8.8) is bounded by
2(r- m)
+ ( E~:£) -
3) C(r: m)2
The remainder of the proof is identical to that of Theorem 7.2.
D
A sufficient, but not necessary, condition for (8.8) is that the uj's be uniformly bounded for all j and n. This is true, for example, when testing r for constancy and u 1 , ... , Un-p are taken to be cosines, as in Chapter 7. The uniformly bounded condition is less than ideal, however, since it excludes some interesting bases, such as orthogonal polynomials. It is thus desirable to have a condition such as (8.8) to check, although here we will not pursue the question of how generally (8.8) holds. We note, however,
8.2. Testing the Fit of Linear Models
213
that even (8.8) is not always necessary, as implied by Theorem 8.1 where the Ei 's are assumed to be Gaussian. That result asked only that the Uj 's satisfy (8.2) and (8.3). (Here we have sidestepped the question of whether or not (8.8) is a consequence of (8.2) and (8.3).) We may construct ui, ... , Un-p per the method discussed in Section 6.3.2. As a starting point one may choose any collection of functions that form an orthogonal basis for £ 2 (0, 1). If one suspects a particular type of departure from the fitted linear model, then that knowledge could be used in choosing a "powerful" basis. Consider n - p of the functions in the chosen basis, and call them VI, ... , Vn-p· In general, these functions will not be orthogonal to the rj's in the sense of (8.3). Furthermore, in spite of their orthogonality, VI, ... , Vn-p will not satisfy (8.2) for an arbitrary design XI, ... , Xn· However, we may use a Gram-Schmidt procedure to construct linear combinations ui, ... , Un-p of ri, ... , rp, VI, ... , Vn-p that satisfy both (8.2) and (8.3). It is important to know that one need not construct ui, ... , Un-p in order to carry out the lack-of-fit test. Suppose that ordinary least squares is used to fit two linear models, one based on ri, ... , rp and the other on ri, ... , rp, VI, ... , Vm· Let SSEp and SSEp+m be the respective residual sums of squares for these two models. It follows that m
n
La; = SSEp -
SSEp+m>
m
= 1,
0
0
0
)
n- p,
j=I
where aj, j = 1, ... , m, are the Fourier coefficients that would result from carrying out the Gram-Schmidt procedure. The statistic Sn may thus be expressed as
So, the simplest ordinary least squares software suffices for carrying out an order selection test. An advantage of constructing ui, ... , Un-p is that one need not fit a large number of linear models to obtain the sums ~';=I Furthermore, a number of the vj's may be highly correlated with ri, ... , rp, a condition known as collinearity. Use of Gram-Schmidt can avoid numerical problems caused by collinearity.
a;.
8.2.2 Basis Functions Not Orthogonal to Linear Model Given residuals ei, ... , en from a fitted model and basis functions VI, v 2 , . . . , consider Fourier coefficients
t 214
8. Data-Driven Lack-of-Fit Tests for General Parametric Models
Even though v1 , v2 , ... may not be orthogonal to the functions comprising the fitted model, the bj 's nonetheless contain information about model fit. Suppose we use a test statistic of the form
Sn
(8.9)
=
max
l<m
1 m
Lm --:::--2nb] · j=l
(J"
2
One advantage of this test statistic is its computational simplicity. The Fourier coefficients bj, j = 1, 2, ... , are easy to compute, requiring no orthogonalization algorithm. Two questions are pertinent at this point: (i) what is the distribution of Bn under the null hypothesis and (ii) how does the power of a test based on Sn compare with that of Sn? A discussion of power will be deferred until Section 8.4. To provide some insight into the first question, we consider a special case in which the null hypothesis is (8.10) Theorem 8.3 provides the limiting distribution of Bn under hypothesis (8.10) when v1 , v2 , ... are a cosine basis. A careful examination of the proof of this theorem will show how to extend the result to more general linear models and other bases. Theorem 8.3. Suppose that model (8.1) holds with r(x) = 80 + B1 r 1 (x), where r1 is bounded, integrable on [0, 1] and not of the following form for any finite k 2: 0: k
L
bj cos(njx),
0 ~ x ~ 1.
j=O
Assume also that the Ej 's are i.i.d. with finite fourth moments. Let ei be the ordinary least squares estimator of ei, i = o, 1, and take to be as in (8.9) with
sn
n
bj
1
= n- Lei cos(njxi),
1 ~ j < n,
i=1
and G- 2 any weakly consistent estimator of 0" 2 . Defining 'i'! and
/2 J; r1(x) cos(njx) dx (Io (rl(x)- r1) 2 dx) 1
1/2 '
j = 1, 2, ... '
I
l
8.2. Testing the Fit of Linear Models
let {Zj : j = 1, 2, ... } be a Gaussian process with E(Zj) covariances
=
215
0, j ;:::: 1, and
i=j i=J=j.
It then follows that, as n ----> oo, variable
Bn
converges in distribution to the random
1
S =sup k~1
k
k
L zJ. j=1
By Slutsky's theorem, we may assume that u is known. The support of the random variable S is (1, oo), since 2:~= 1 ZJ / k converges almost surely to 1 ask----> oo (Serfiing, 1980, p. 27). In considering P(Sn :::; x), we thus take x > 1. (It is easy to establish that P(Sn :::; x) ----> 0 for x < 1.) Notice that Zjn = ffnbj/u may be written PROOF.
1
(8.11)
Zjn
=
. ,r;;;
vn
n
L
j = 1, ... 'n,
WijnEi,
i=1
where i = 1, ...
n
n
i=1
i=1
,n,
We have
P(Sn:::;
x)
p(
1 max -k tzJn:::; 1:C::k:C::kn j=1
X
n
1 max -k tzJn:::; kn
x)
where kn is an unbounded sequence of integers. We first show that 1 P ( max -k t zJn :::; kn
(8.12)
x)
----> 1
as n----> oo. Let Vjn denote E(Z]n); then the probability in (8.12) is at least
P
1 k max -k L(Z}n- Vjn) ( kn
+ knmax
1 k ) -k L Vjn:::; x . j=1
Using (8.11) and the boundedness and integrability of r, one can establish that 2:~= 1 Vjn/k ----> 1 ask ----> oo; hence, for all n sufficiently large, the last
216
8. Data-Driven Lack-of-Fit Tests for General Parametric Models
probability is at least
P ( max
kn
1 -k
t(zJn - Vjn) :::; "() j=1
where 'Y = (x 1)/2. We now use essentially the same argument as on p. 1422 of Eubank and Hart (1992) to show that the last probability tends to 1. The only difference in the present proof is establishing that Var(I:~= 1 z]n) :::; Ak for some constant A and all k, but this is easily done by using (8.11) and the conditions on r. For notational convenience, let K = kn. Our next step is to argue that (8.13)
1
p ( max -k 1
~ ZJ x) - p ( 1
6j=1
2 n :::;
2 n :::;
---7
0,
where Z 1n, ... , ZKn have a multivariate normal distribution with mean a vector of O's and covariance matrix equal to Cov(Z1n, ... , ZKn) = I;Kn· From (8.11), I;Kn
=
IK - BKnBkn/Vn,
where IK is the K X K identity matrix and BKn is a K-dimensional column vector with jth element bjn· We can write the difference between probabilities in (8.13) as (8.14)
P(I;:K!,/
2
(Z1n, ... ,ZKn)' E AKn(x))
-P(I;:K!,/ (Z1n, ... , 2
ZKn)' E AKn(x))
for some Borel set AKn(x) inK-dimensional Euclidean space. Now apply Theorem 13.3 of Bhattacharya and Ranga Rao (1976) to the independent random vectors (in = Eii;Kn WiKn, i = 1, 'n, where wiKn (wiln, ... , WiKn)'. In doing so we note that n- 1 2:::~= 1 Cov((in) is the K -dimensional identity and 0
0
0
(8.15) Using Rao's (1973, p. 33) formula for I;J(~ and our conditions on r, one may
(L:?=K+lbJn
r
verify that expression (8.15) is bounded by CK 4 I for some constant C and all n sufficiently large. By Theorem 13.3 of Bhattacharya and Ranga Rao (1976), it now follows that the absolute value of (8.14) is bounded by (8.16)
a(K)K
4
/
(I:?=K+ 1
vn
bJnr
8.2. Testing the Fit of Linear Models
217
where a(K) is a constant depending only on K. Since an infinite number 1 of the j3/ s are nonzero and, for each j, bJ n ---+ f3] 0 ( t (x) - t) 2 dx, there exists an unbounded sequence of integers K such that (8.16) tends to 0 as n ---+ oo. The proof is completed by showing that there exists a sequence of integers K such that
J
(8.17)
P ( max
l~k~K
~k
t .
J=l
z]n ::;
x) - P (sup ~ t z] ::; x) k>l -
k .
---+
0,
J=l
where { Zj} is the Gaussian process of Theorem 3.3. It can be verified that there exists a sequence achieving both (8.17) and the convergence of (8.16) to 0. 0 The limiting distributions of the analogs of the other statistics in Chapter 7 becoine obvious once one examines the proof of Theorem 8.3. The only essential difference between the theory for this section and that in Section 7.3 is with respect to the limiting distribution of the sample Fourier coefficients. In Section 7.3 ¢1 , ... , ¢k are asymptotically multivariate normal with covariance matrix proportional to an identity, whereas in this section b1 , ... , bk are asymptotically normal with the covariance structure defined in Theorem 8.3. It is important to note that whenever the regression function is linear in the ei 's and least squares is used to estimate the model, the residuals e 1 , ... , en are completely free of the unknown ei 's, regardless of the size of n. This implies that any test statistic based only on the residuals will have a null distribution that does not depend on unknown parameters. Of course, Theorem 8.3 shows that the distribution of such a statistic will depend upon the functions r 1 , ... , rp. The interested reader may wish to check his or her understanding of Theorem 8.3 and its proof by formulating and proving a multiparameter version of the theorem.
8.2.3 Special Tests for Checking the Fit of a Polynomial Fitting polynomials is a popular way of dealing with curvilinearity in regression. Consequently, testing the fit of a polynomial is a problem that arises frequently in practice. At least three popular nonparametric smoothers have the following property: For some k, the smoother tends to a least squares polynomial of degree k as the smoothing parameter becomes large (which corresponds to more and more smoothing). This is an attractive property since the fit of a polynomial may be checked by simply comparing smooths with different smoothing parameters. Smoothing splines, boundary modified Gasser-Muller estimators and local polynomials share
218
8. Data-Driven Lack-of-Fit Tests for General Parametric Models
the property of being polynomials for large smoothing parameters. Cox and Koh (1989), Chen (1994a, 1994b) and Jayasuriya (1996) have used smoothing splines to test the fit of polynomials, whereas Hart and Wehrly (1992) use Gasser-Muller type smoothers for the same purpose. Here we shall describe the idea of Hart and Wehrly (1992) as it applies to local polynomials. Their method is very similar to the spline smoothing methods of Yanagimoto and Yanagimoto (1987) and Barry and Hartigan (1990). Suppose we wish to test the null hypothesis
Ho : r(x) = Bo
+ elx +
0
0
0
+ ekxk-1, v X E [0, 1],
where the B/s are unknown, against the alternative
Ha:
fo
1
(r(k)(x)r dx > o.
The proposed test makes use of a (k -1 )st degree local polynomial smoother rh with bandwidth h. Let h be a data-driven bandwidth for rh chosen by either cross-validation or one-sided cross-validation. When H 0 is true, the optimal value of h will be very large, owing to the fact that a local polynomial estimate with "small" h will be less efficient than the least squares polynomial of degree k- 1. If, on the other hand, Ha is true, then the optimal h will be relatively small since consistency results only if h ---+ 0 as n ---+ oo. It thus seems sensible to use a test of the form (8.18)
reject H 0 if
h < c,,
where c, is chosen to make the level of the test a. One needs to investigate the behavior ofthe cross-validation curve under the null hypothesis in order to determine critical values for the test (8.18). When the null hypothesis is true, the estimator rh (x) is unbiased at each x E [0, 1] and for all h. Specifically,
rh(x) = r(x)
+ rh(x),
where rh (x) is just the (k -1 )st degree local polynomial applied to the noise E1 , ... , En· It follows that under H 0 a cross-validation curve depends only on the noise, and not in any way on the unknown polynomial. Furthermore, the null distribution of his invariant to the value of cr whenever El/cr, ... , En/cr are independent and have a common distribution that is free of cr. To see why, note that, under Ho,
and so CV(h)jcr 2 is scale-free under the stated condition on the Ei's. But, CV(h) and CV(h)jcr 2 have the same minimizer, and so h has the same distribution regardless of the value of cr. Using the preceding facts it is straightforward to approximate the null distribution of h using simulation. If one assumes the errors to be i.i.d.
8.3. Testing the Fit of a Nonlinear Model
219
normal, then one may generate many independent random samples of size n from the standard normal distribution and compute h for each sample. Otherwise, one could resample from the residuals obtained upon fitting a (k- 1)st degree polynomial by least squares. Hart and Wehrly (1992) used simulation to approximate the distribution of h for a boundary-modified Gasser-Muller smoother. For the case k = 1, they assumed Gaussian errors and used ordinary cross-validation to obtain h. The resulting sampling distribution of 1/ h resembles a continuous analog of the distribution of min the Fourier series setting (Figure 7.1). Specifically, it was found that a value of h larger than 1 occurs with probability of about .65. The significance of h greater than 1 is that the corresponding smooth is essentially a straight line. So, ordinary cross-validation chooses the "correct" model with about the same probability as does the risk criterion Jm. Simulations performed by the author have shown that lack-of-fit tests based on OSCV bandwidths tend to be more powerful than ones based on the ordinary cross-validation bandwidth. This apparently results from the smaller variability associated with OSCV, as discussed in Chapter 4.
8.3 Testing the Fit of a Nonlinear Model Suppose we wish to test Ho : r E {re : 8 E 8}, where re(x) is not linear in the components of e. One can construct test statistics for H 0 exactly as described in the previous section. Given an estimator of e, we can compute residuals ei = Yi- r 0(xi), i = 1, ... , n, and then a statistic based on those residuals. In general, however, the null distribution of this statistic will depend upon the unknown value of the parameter e. This follows from the well-known fact (see, e.g., Carroll and Ruppert, 1988) that the distribution of e);(}, where is, say, a least squares estimator, often depends upon e for nonlinear models. In Section 8.3.1 we will consider how the unknown parameter enters into the null distribution of a test statistic. This will show us how to conduct a large-sample lack-of-fit test for a nonlinear model. In Section 8.3.2 a bootstrap algorithm is described which will often provide better approximations to significance levels in small samples.
e
(e-
e
8. 3.1 Large-Sample Distribution of Test Statistics We now provide a sketch of some theory that shows how to obtain the limiting distribution of analogs of the statistics in Chapter 7 when the null model is nonlinear. The distribution of any such statistic depends only upon the joint distribution of the sample Fourier coefficients. As in Section
--220
8. Data-Driven Lack-of-Fit Tests for General Parametric Models
8.2.2, define
1
n
bj = - Lei cos(njxi), n
j
= 1, ... , n- p,
i=l
where p is the number of components of e. The following decomposition is useful:
b·-b·+~· J J J> where
and
Let us consider the case where p = 1, and assume that, for each x, ro(x) is twice differentiable with respect to e and that
8 2 ro(x) /JB2 is bounded uniformly in
X
and
e. Then
~i = (B- BhJ,n(B) + Op(B- 8) 2 , where
1 ~ /Jro(xi)
'f'J,n(B) = -;;;,
-8
/JB
. cos(nJxi)·
In most cases of interest v'2ii(B- B)/u will converge in distribution to a random variable having a normal distribution with mean 0 and variance Vo (]"• Furthermore, it will generally be true that the limiting distribution B) I(}' is multivariate normal with mean vector 0 and of 'V2ii(b1, ... , bk, covariance matrix of the form
e-
h 2Co,O") ( 2q,O" Vo,O" , where h is the k X k identity matrix and Co (]" is a k X 1 vector whose ith element is the limit of n Cov(bi, B)/u 2 . This' in turn implies that, when H 0 is true, the limiting distribution of v'2ii(b 1 , ... , bk)/u is multivariate normal with mean vector 0 and covariance matrix
Ijk(e, u) = Ik where 'I'( B) is a k
X
+ Vo,O"'/'(B)"!'(B) + 4'/'(B)C~,O",
1 vector with ith element limn->oo '/'i,n(B).
8.3. Testing the Fit of a Nonlinear Model
221
To illustrate how to carry out a large sample test, consider the case where the fit of a nonlinear model (with p = 1) is to be checked by using the statistic -
1
m
L
Sn(k) = max l<m
-
m
j=l
2nb 2
~· 0'
When a is consistent for 0', and (b 1 , ... , bk) has the limit distribution described in the previous paragraph, Bn (k) converges in distribution to
S(k)
=
1 max -
l<m
-
m
"'zJ, m L.......t j=l
where Z 1 , ... , Zk have a multivariate normal distribution with mean 0 and covariance matrix Ijk(B, 0'). Let Fk(·; B, 0') be the distribution function of S(k) when the true parameter values are (B, 0'). Given a set of data with corresponding estimates and&, we reject the null hypothesis at (nominal) level a if and only if Bn(k) exceeds the (1- a) quantile of the distribution
e
Fk(·;
e, &).
8.3.2 A Bootstrap Test The validity of the large-sample test in the previous section is obviously reliant on the assumption that the sample Fourier coefficients have a multivariate normal distribution. When n is not particularly large this assumption may be suspect. An alternative test could be carried out by using the bootstrap. One approach would be to use essentially the same algorithm as discussed in Section 8.2.3. The only real difference between the linear and nonlinear cases is that, in the latter, the distribution of the lack-of-fit statistic usually depends on B. In other words, in the nonlinear case the test statistic is not a pivotal quantity. As Hall and Wilson (1991) point out, using the simplest form of bootstrap may not be markedly better than a large sample test (as in Section 8.3.1) when the test statistic is not a pivotal quantity. An iterated, or double, bootstrap procedure will often lead to a test whose actual level is closer to the nominal level than are those of either the large-sample or single bootstrap tests. Our description of the iterated bootstrap closely parallels that of Hall (1992, pp. 20-22). Let A be some event determined by data from the regression model (8.19)
Yi
=
re(xi)
+ O'Ei,
i = 1, ... ,n,
where El, ... , En are i.i.d. with mean 0, variance 1 and common distribution function F. The notation P(AI(B, 0', F)) denotes the probability of A under model (8.19). The true, underlying values for B and (]' will be denoted 80 and 0'0 , respectively, and the distribution of Ei by F 0 . If the test statistic is
~~' 222
Sn,
8. Data-Driven Lack-of-Fit Tests for General Parametric Models
1
our goal is to approximate c, where \
P(sn ~ ci(Bo,O"o,Fo)) =a. Note that in general c = c(B0 , O"o, Fa), i.e., c depends on Bo, O"o and F0 . Given data Y 1 , ... , Yn one may compute and {j and the standardized residuals
e
'
Ei
=
Yi- re(xi) , , (}
i
= 1, ... , n.
Letting F denote the empirical distribution of i\, ... , En, the first order bootstrap estimate of c(B0 , O"o, F 0 ) is c(B, {j, F), which may be approximated arbitrarily well by simulation. The double bootstrap involves estimating the value of t, call it t 0 , that solves the equation (8.20) The quantity t 0 may be estimated by the solution to the equation
P(s~ ~ tc(B*,(j*,F*)I(B,{j,F)) =a,
(8.21)
where s~, e*, {j* and F* are functions of data randomly generated from model (8.19) with e = = {j and F = F. Typically, neither c nor t have closed-formed expressions in terms of e, O" and F, and so simulation must be used to approximate them. To approximate
e, (}
P( s~ ~ t c(B*' {j*' F*) I(B, {j, F)) for a given t, one may generate B1 boot-
strap samples from the model (B, {j, F). In the ith bootstrap sample, i = 1' B1' one must approximate c( e;' {ji' Ft)' which may be done by generating B2 bootstrap samples from the model (e;' {ji' Ft) The name double bootstrap obviously derives from the fact that multiple samples must be drawn from each of the original bootstrap samples, leading to a total of B 1 B 2 samples of size n. Having obtained c(B, {j, F) and t0 , the solution of (8.21), the null hypothesis is rejected at level a if Sn exceeds t0 c(B, {j, F). In what sense can one expect this double bootstrap test to be an improvement over a large sample or single bootstrap test? Let aL, a 1 , and a 2 denote the respective levels of large sample, single bootstrap and double bootstrap tests, each having nominal size a. Results of Hall (1992) in a setting similar to model (8.19) suggest that, under appropriate regularity conditions, 0
0
0
'
0
aL
=
a+ O(n- 1 12 ),
a 1 = a+ O(n- 112 )
while a2
=a+ O(n- 1 ).
So, for sufficiently large n the level of the double bootstrap test will be closer to a than that of either a large-sample or single bootstrap test.
1
8.4. Power Properties
223
The double bootstrap can also be used in testing the fit of a linear model. However, doing so is not as important as it is in the nonlinear case. This is because in the linear case one may construct test statistics that are pivotal quantities in the sense that their distributions do not depend on the true regression coefficients or on cr. As we noted before, the single bootstrap tends to better approximate the distribution of pivotal quantities than of nonpivotal ones.
8.4 Power Properties
8. 4.1 Consistency A test done by applying any of the statistics from Chapter 7 to residuals will be consistent against a very general class of alternatives to the fitted model. To make this claim rigorous in at least one setting, the following theorem concerning Sn is provided. Recall that the large sample null distribution of Sn was obtained in Theorem 8.3 for a particular linear model, and the distribution of a truncated version of Sn was discussed in Section 8.3.1 for the nonlinear case. Theorem 8.4. Assume that the Ei 's in model (8.1) are independent with zero means and finite variance cr 2 , and let Xi = (i- 1/2)/n, i = 1, ... , n. Let {re : () E 8} be a parametric family that is smooth in the sense that lru(x) - rv(x)l ~ Cllu- vii for all X and all u, v E e. Suppose that we test Ho : r E {re : () E 8} by rejecting Ho whenever
(8.22)
where {en} is a bounded sequence of constants and v1 (x) = cos(1rjx), j = ' p 1, 2, .... Assume that () ______, fJo for some fJo E 8 as n -+ oo, that Ho is false with 1
in£ { (r(x)- re(x)) 2 dx > 0
(8.23)
BE8 } 0
and that r(·) and re 0 are continuous on [0, 1]. Then the power of the test with rejection region (8.22) tends to 1 as n -+ oo.
~ROOF. Define ~k =
J0\r(x)
- re 0 (x)) cos('lfkx) dx, k
= 1, 2, ... , and_let
k be the smallest integer k such that ~k =/=- 0. (The existence of such a k is
ensured by (8.23).) For n > (8.22) is at least as big as (8.24)
P(v'2nbk ~
k,
the power of the test with rejection region
8'(cnk) 112 )
+P(v'2nbk ~
1 2
-8'(cnk) 1
).
224
8. Data-Driven Lack-of-Fit Tests for General Parametric Models
It is easy to establish under the stated conditions that bk ~ ~k as n ----t oo. Since IJ is consistent for a and Cn is bounded, it follows that (8.24) tends to 1 as n ----t oo, thus completing the proof of Theorem 8.4. 0
8.4.2 Comparison of Power for Two Types of Tests In testing the fit of a linear model, two types of tests based on Fourier series were discussed in Section 8.2. One type uses basis functions that are orthogonal to the functions comprising the linear model, whereas the other uses a given set of basis functions, whether or not they are orthogonal to the linear model. The test statistic Bn corresponds to the latter type of test and Sn to the former. In Section 8.2.2 we raised the question of which of these two tests tends to be more powerful. It is straightforward to establish that a test based on Sn is consistent under very general conditions, as is Bn- Hence, a more detailed analysis is required in order to compare the power of the two tests. To get some insight into power, we will study the behavior of sample Fourier coefficients under the alternative hypothesis. Suppose we are testing the null hypothesis (8.4) and let u 1 , ... , Un-p satisfy the orthogonality conditions (8.2) and (8.3). Defining aj as in Section 8.2.1, we have (8.25)
aj
= aj + aj,n,
j
= 1, ... 'n- p,
where aj = I:~=l EiUj(xi)/n, j = 1, ... , n- p, and
1
aj,n
= ~
n
L
r(xi)uj(xi),
i=l
j
=
1, ... , n- p.
The size of the Fourier coefficients aj,n for "small" j will be the determining factor in the power of an order selection test based on the aj 's. Now (8.2) implies that aj,n is
where p
8n(x)
=
r(x)-
L
eknrk(x),
X E [0, 1],
k=l
and 81n, ... , Opn are the least squares estimates of 81 , ... , OP when the "data" are r(x1), ... , r(xn)· (One may think of I:L 1 eknrk(-) as a least squares approximation of r.) Power of the order selection test will be maximized by using a basis {Uj} for which 8n is well approximated by a linear combination of u 1 , ... , uk, where k is as small as possible. Now consider the case where {b1, b2 , ••• } is an arbitrary orthonormal basis for L 2 (0, 1), and let e 1 , ... , en be the residuals from a least squares
8.4. Power Properties
225
fit of the null model. Each residual has the form
where 7\ is the least squares predicted value at Xi when the data are £1, . . . , En. Now consider the Fourier coefficients 1
A
bj
=-
n
n
Leivj(xi),
j = 1,2, ... ,
i=1
which are appropriate when the design points are evenly spaced. We have (8.26) where bj
j = 1, 2, ... '
= 1:7=1 EiVj(xi)/n, 1
bj,n
j
= 1, 2, ... , and
n
- L 8n(xi)vj(xi),
j = 1, 2, ...
n i=1
The main differences between (8.25) and (8.26) are twofold. The two procedures use different bases, and the latter has the extra stochastic component 1:7=1 r\vj(Xi)/n. The comments made above about a good choice of basis still seem applicable if one uses an order selection test based on the bj 's. That leaves us with the question of what effect the extra stochastic component will have on power. One is tempted to say that the extra noise induced by this term will be detrimental. Is this the case though? In the goodness-of-fit problem Wells (1992) shows that Cramer-von Mises tests are actually more powerful for composite null hypotheses than for simple ones. This seems paradoxical in that parameters must be estimated in the composite case. Although we do not claim that a similar paradox occurs when comparing the tests of Sections 8.2.1 and 8.2.2, the possibility is nonetheless intriguing.
9 Extending the Scope of Application
9.1 Introduction To this point we have dealt with a somewhat limited setting, namely fixeddesign regression with but a single x-variable. In Chapter 9 it will be shown that the tests discussed in Chapters 7 and 8 have a much wider range of application. We shall consider some of the settings in which order selection tests are potentially useful. Our treatment of each setting will be more speculative and not nearly so comprehensive as in Chapters 7 and 8. The goal is not to state and prove an array of theorems on asymptotic distribution theory, but rather to provide a sense of the scope of application of order selection tests.
9.2 Random x's Previously we have assumed that the observed x's were fixed design points. Another modelfor regression data is one wherein (X1, Y1), ... , (Xn, Yn) are a random sample from a bivariate distribution having a regression function
r(x) = E(Y1IX1 = x)
V x.
Suppose we wish to test whether or not r has a particular parametric form. The intuitive motivation for the tests in Chapters 7 and 8 is not dependent on whether or not the design is random. Hence, the only real concern is the effect that randomness of design will have on the null distribution of a statistic. One point of view is that inferences should be carried out conditional on X1, ... , Xn- If one adopts this point of view and if the conditional distribution of Y- r(X) is independent of X, then the distribution theory of Chapters 7 and 8 is still relevant for the random x case. If one wishes, though, to take into account variability in the x's, then further investigation is required to see what effect this has on the null distribution of a test statistic. Let Xc 1J < · · · < X(n) denote the ordered Xsample and Y(l), ... , Y(n) the concomitant Yi 's. In other words, Y(i) is theY 226
9.2. Random x's
227
value that was paired with X(il in the original sample. Bhattacharya (1974) showed that, conditional on X 1 , ... , Xn, Y( 1 ), ... , Y(n) are independent random variables. Suppose we assume further that, conditional on the Xi's, Ei = Y(i)- r(X(i)), i = 1, ... , n, are identically distributed normal random variables with variance 0' 2 . If we now test the fit of a linear model as in Section 8.2.1, then, under the null hypothesis and conditional on the Xi's, m
(9.1)
max 1 <m
-
A2
-1 "\' naj
m j=1 L....t
0'2
is equal in distribution to 1
(9.2)
max
1 <m
-
-
m
"\' Z 2 '
m j=1 L....t
J
where the Zj's are i.i.d. N(O, 1). Since the distribution of (9.2) is free of the Xi's, it follows immediately that, unconditionally, (9.1) is equal in distribution to (9.2). So, if &2 is any consistent estimator of 0' 2 , then m
(9.3)
max 1<m
-
A2
-1 "\' naj m j=1 L....t &2
has the same limiting distribution as in the fixed-design case. It should be clear that if we merely assume E1 , ... , En to be identically distributed conditional on the Xi's, then under appropriate regularity conditions the unconditional limiting distribution of (9.3) will be the same as in the fixeddesign case. A key aspect of the preceding argument is that parameters were estimated by least squares (as in Section 8.2.1). The existence of the Gram-Schmidt procedure ensures that, conditional on the Xi's, the least squares estimates ii1 , ... , iin-p have mean 0 and common variance under the null so long as the conditional variance of each Ei is the same. In contrast, if the estimation scheme of Section 8.2.2 is used, then the conditional second moments of sample Fourier coefficients depend on the X/s. In general this will mean that the probability distribution of x1 will have a first order effect on the distribution of the order selection statistic. If the homoscedasticity assumption on Ei is relaxed, the sampling distribution of a statistic may be approximated using the bootstrap. In the general random design case, the most sensible bootstrap seems to be one which resamples from an empirical distribution composed of (X, Y) pairs. As Hall and Wilson (1991) have pointed out, the following basic principle should be adhered to in applying the bootstrap: Regardless of whether or not H 0 is true, the empirical distribution from which we res ample should be consistent with the null hypothesis. In testing the no-effect hypothesis, for example, this means that our empirical distribution should have a constant regression function. Suppose that data (Xi, Y,;), i = 1, ... , n, are a random
'
I
I
228
9. Extending the Scope of Application
sample from a distribution F, and let Fn be the ordinary empirical distribution function of the data. Even if F has a constant regression function, Fn does not. Indeed, the conditional distributions Y/X of Fn are degenerate at different values for y. Worse yet, ifF has a nonconstant regression function, then resampling from Fn would often be wildly inconsistent with sampling from a distribution for which H 0 is true. These considerations imply that resampling from Fn is not a good idea. How can we construct an empirical distribution for which Ho is true? One possibility is as follows. Define Y = n- 1 I:~=l Yi, and consider the distribution Gn that assigns equal probability to each of the 3n points
(Xi, Yij),
=
j
1, 2, 3, i
=
1, ... , n,
where Yi1, Yiz, Yi3 are chosen to satisfy (for each i)
1
1
3
-3L_., "' Yij
=
Y
and
3
-
-
3 L(Yij- Y)m = (Yi- Y)m,
m
=
2, 3.
j=l
J=l
Clearly Gn has a constant regression function, and furthermore, when H 0 is true, Gn will have conditional variance and skewness that crudely approximate those of F. The null distribution of a statistic may be approximated by repeatedly drawing random samples of size n from Gn. This particular version of the bootstrap is essentially an application of the wild bootstrap of Hardle and Marron (1991). It turns out that for purposes of approximating the sampling distribution of a statistic, a crude approximation to conditional variance and skewness suffice (Mammen, 1993).
9.3 Multiple Regression Suppose that the independent variable xis a k-dimensional vector (k 2: 2) and that one wishes to test the fit of a parametric model r(x; 8). We shall consider two basic approaches. One approach is to use the tests proposed in Chapters 7 and 8 by regressing residuals on a scalar function of x. Alternatively one may devise smoothing-based tests that explicitly take into account the multivariate nature of x. The advantage of the former approach is its simplicity, whereas the latter is consistent against a larger class of departures from the hypothesized model. Suppose that our model is
Yi
=
r(xi)
+ Ei,
i = 1, ... , n,
where x1, ... , Xn are fixed k-vectors and the error terms E1, ... , En are i.i.d. with mean 0 and variance u 2 • We will illustrate the first approach for diagnosing lack of fit by considering the case where the null model is linear.
r 9.3. Multiple Regression
229
The null hypothesis is p
Ho : r(x)
=
L
vX
ejrj(x),
E
Ak
j=l
where Ak is a compact subset of Rk and r 1 , ... , rp are known functions. We may use least squares to estimate ell ... 'ep, and then compute residuals p
ei
= Yi- LOjrj(xi),
i = 1, ... ,n.
j=l
Now let t(x) be a known, real-valued function of x. If the null hypothesis is true, the residuals should not be correlated with any function of x. We thus apply a test as in Chapter 7 to check for association between residuals and t(x). Given orthogonal functions u 1 , ... , Un-p of a single variable, we may use a Gram-Schmidt procedure to construct functions V1, ... , Vn-p that are linear combinations of r 1 , ... , rp, u 1 , ... , Un-p and which satisfy the orthogonality relations n
L
rj(xi)vk(t(xi))
=
0,
1 ::;: j ::;: p, 1 ::;: k::;: n- p,
i=l
and
The hypothesis that the underlying regression function has the specified linear form may be tested by use of the statistic
(9.4)
Tn
=
1 max -k l
k
L j=l
n¢2 0'~;
,
where ~
¢j
1
n
= - LYivj(t(xi)),
n
j = 1, ... ,n-p.
i=l
If E1 , .•. , En are i.i.d. N(O, 0' 2 ), then (8- 2 j0' 2 )Tn has the distribution function Fos,n-p defined in Section 7.5.1. A valid large-sample test could then be based on Tn so long as 8- 2 is consistent for 0' 2 . Clearly, asymptotic validity could be maintained even if the condition of Gaussian errors was relaxed to some extent. Statistics from Chapter 7 other than Tn could also be used in carrying out such a lack-of-fit test. Clearly the power of the above procedure will depend on the function t. If the experimenter suspects that the response Y depends on a particular scalar function of x, then this function would be an obvious choice for t.
230
9. Extending the Scope of Application
On the other hand, a poor choice for t could lead to a test with very poor power. A simple example illustrates the problem. Let x = (x1, ... , xk), and suppose we wish to test
Ho : r(x) = Bx1,
V x.
Let us assume that, in fact, r has the form
r(x) = ex1
+ g(x),
where
g(x)
=
1, 0, { -1,
and A1, A 2 and As partition up the design space. For designs xi = (x 1i, ... , Xki), i = 1, ... , n, such that ~~= 1 g(xi)xli = 0, the least squares estimator
is consistent for e. Hence, to a good approximation the residuals = 1, ... , n, will be
ei
=
Yi - Bxli, i
Now suppose that t(·) was taken to be, say, t(x) = x 2 for all x. A NadarayaWatson smooth of the data (t(xi), ei), i = 1, ... , n, will have the form
g(u)
=
gn(u)
+ g(u),
where ~~= 1 g(xi)K((u- X2i)/h)
gn (U ) =
n
~i=1 K
((
u- X2i)/h
)
and g(u) is just the smooth applied to the noise E1 , .•• , En· Now, if the design is such that {(xli, X2i) : Xi E A1} = {(x1i, X2i) :Xi E As} (as, for example, in some lattice designs), then ~~= 1 g(xi)X1i = 0 and gn(u) = 0 for all u. A lack-of-fit test based on the smooth g(u) will thus have power no bigger than the test's size. Although the previous example may seem naive, one could obviously construct many more examples in which essentially the same problem arises. The point is that choice of t can be important when one tries to diagnose lack of fit by regressing residuals on a scalar function t(x). A common practice in diagnosing lack of fit is to regress residuals on the predicted values p
fi
=
L j=1
Bjrj(xi),
i
= 1, ... , n.
9.3. Multiple Regression
231
One may construct a statistic Tn in the manner described above except that t(x1), ... , t(xn) are replaced by Y1, ... , Yn· Since the "independent variable" is now random, more care will have to be taken to determine the sampling distribution of the statistic Tn. However, an obvious conjecture is that the properties of such a test would be similar to those obtained by using the previous procedure in this section with t(x) = ~}=1 ejrj(x), where ~}=1 ejrj(x) is a best (but unknown) approximation of r in the space spanned by r 1, ... , rp. A more comprehensive way of diagnosing lack of fit that still uses techniques from Chapters 7 and 8 is to carry out several tests as described above, using a different transformation of x for each test. One possibility would be to determine a few principle components, say t 1(x), ... , tm(x) (m ~ k), that account for most of the variation in x, and to then apply one of the tests in Chapter 7 k times, using the values of ij(x 1), ... , ij(Xn) as the design points in the jth test. In doing so, some adjustment of the significance levels for individual tests is advisable to maintain overall control of the type I error probability. Of course, there is no guarantee that lack of fit will manifest itself along principal components. Nonetheless, it seems natural to look for dependence between residuals and x's in those directions where the x's display the most variation. Another approach in multiple regression is to devise generalizations of tests from Chapter 7 that allow the independent variable to be multivariate. To illustrate one approach, suppose that we observe data (x1i, Xzi, Y,;), i = 1, ... , n, from the model
(9.5)
Y,;
=
r(x1i, Xzi)
+ Ei,
i
= 1, ... , n,
where E1, ... , En are i.i.d. N(O, 0" 2 ) random variables and (x1i, Xzi), i = 1, ... , n, are fixed points lying in the unit square. We shall describe a test of the null hypothesis that r is identical to a constant, although the methodology can easily be extended to test the fit of a linear model for r. Consider a linear model for r of the form m1-lmz-1 (9.6)
Tm 1,m2 (u,v) =
L L j=O
ajkcos(1fju)cos(Jrkv),
0 ~ u,v ~ 1.
k=O
If r is sufficiently smooth and the ajk's are Fourier coefficients, then Tm1,m 2 will converge pointwise to r as m 1 and m 2 tend to oo (Tolstov, 1962, p. 178). Let M 1 and M 2 be two positive integers such that M1M2 ~ n. We may then fit M 1 Mz models of the form (9.6) where 1 ~ mi ~ Mi, i = 1, 2, using least squares to estimate the ajk 's in each model. We denote the sum of squared errors corresponding to model (9.6) by SSEm 1 ,m2 • As in Section 8.2.1 we may use a Gram-Schmidt process to decompose SSE1 ,1 ( = ~7= 1 (Y,; - Y)Z) into a sum of n - 1 independent random variables. When the null hypothesis is true, each of these random variables is equal in distribution to 0" 2 xi. Likewise we may construct statistics analogous to
I
I
232
9. Extending the Scope of Application
those in Chapter 7. Define, e.g.,
(9.7)
Bn
=
max (m1,m2)EAn
eff
(SSE1,1- SSEm 1 ,m 2 ) G- 2 (m1m2 - 1)
(9
where G- 2 is a consistent estimator of o- 2 and An = {(j, k) : 1 ::=; j ::=; M 1 , 1 ::=; k ::=; M 2 } - {(1, 1)}. Under our normality assumption the random variable (G- 2 / o- 2 ) Bn is equal in distribution to
(9.8)
max
ml-1 m2-l
1
1)
(m1,m2)EAn (m1m2-
(
~
t;
2
ZJk -
2 Z 00
Ur tir od
) ,
where the Zjk's are i.i.d. standard normal random variables. Since G- 2 is consistent, Bn's distribution will be well approximated for large n by that of (9.8), which in turn may be approximated using simulation. The approach just described can obviously be extended to cases with more than two independent variables. A problem encountered as the number of independent variables grows is the so-called curse of dimensionality. One result of this curse is that nonparametric curve estimators decrease in efficiency very quickly as the dimension of the design space grows (Stone, 1982). One would anticipate a similar decrease in the power of a completely nonparametric test as design dimension increases. This can be seen in the statistic Sn by noting that if J\!h = M 2 , then the number of Fourier coefficients corresponding to a particular x-direction is of order yn. If the x-dimension is d and we use an analogous test procedure, then we have only n 1 /d Fourier coefficients for each independent variable. Some of the loss in power due to large x-dimension can undoubtedly be recouped if the regression function is not completely arbitrary. This intuition is based on the fact that estimation efficiency increases when r has, for example, an additive (yet nonparametric) structure (Stone, 1985).
(0. ho en be il1l
dii (1!
fm co: b.v
th< At ¢J
n wt
9.4 Testing for Additivity At the end of Section 9.3 we pointed out that when the predictor variable is high dimensional, smoothing methods that impose no structure on the regression function become less attractive. In recent years a number of methods have been proposed that seek to circumvent the curse of dimensionality while retaining a nonparametric flavor. These methods include those based on additive models of the form
au
k
(9.9)
Y=Lri(xi)+E, i=l
where Y is the response variable, x 1 , ... , Xk are the predictor variables and E is an unobserved error term. The functions r1, ... , rk are unknown and assumed merely to be "smooth."
Us ac:
th: idE
9.4. Testing for Additivity
233
When model (9.9) holds, much can be gained in terms of estimation efficiency over the structureless model (9.10)
Under appropriate conditions, the regression function in (9.9) can be estimated with an error that tends to 0 at the same rate as in the case of a single predictor, i.e., k = 1 (Stone, 1985). Hence, nonparametric methods can effectively defeat the curse of dimensionality when the structure in (9.9) is justified. If nonparametric additive models are to be used, one should consider how well such a model fits the data. The usefulness of formal tests apparently increases with higher dimensions because departures from additivity become more difficult to assess by graphical means when the number of independent variables is large. The problem of testing the fit of an additive model has been addressed by Hastie and Tibshirani (1990), Barry (1993), Spiegelman and Wang (1994) and Eubank, H!frt, Simpson and Stefanski (1995). In addition, Gu (1992) has proposed diagnostics for assessing concurvity of nonparametric additive models. In this section we consider order selection tests of additivity as proposed by Eubank, Hart, Simpson and Stefanski (1995). For simplicity we consider the case of two independent variables and assume that model (9.10) holds. Assume that r is absolutely integrable, and define the Fourier coefficients rPjk by
11 1
rPjk =
1
cos(1rju)cos(1rkv)r(u,v)dudv,
j,k = 0,1, ....
Then under minimal smoothness conditions on r,
+ r1(u) + rz(v) + r1z(u, v),
r(u, v) = ¢oo
0 ~ u, v ~ 1,
where 00
r1(u)
=
2
L rPjo cos(1rju),
0 ~ u ~ 1,
j=1 00
rz(v) = 2
L ¢ok cos(1rkv),
0 ~ v ~ 1,
k=1 and 00
r1z(u, v) = 4
00
LL
rPjk cos(1rju) cos('lfku),
0 ~ u, v ~ 1.
j=1 k=1 Using terminology as in analysis of variance, r 1 and r 2 may be referred to as main effect functions and r 12 as the interaction function. The hypothesis that r has an additive structure is equivalent to the hypothesis that r 12 is identical to 0.
234
9. Extending the Scope of Application
A test for additivity may be done by a simple extension of the test at the end of Section 9.3. There we tested the hypothesis of no-effect, and the test was based on sums of squares of the form SSEo,o- SSEm 1,m 2,
where SSEm 1,m 2 is based on a model of the form (9.6). To test for additivity we will fit a "full" additive model, and then consider how much extra variation in the data is explained by adding interaction terms to this model. Define M1
rA(u, v)
=
¢oo
M2
+2L
cPjo cos(nju)
+2L
j=1
cPok cos(1rkv),
k=1
where Mi is an upper bound for the number of cosine terms corresponding to independent variable xi, i = 1, 2, and let SSEA be the error sum of squares obtained from fitting rA (by least squares) to the data. Now consider a model of the form A1
rA(u, v)
+4L
.A2
L
cPik cos(1rju) cos(7rkv),
j=1 k=1
and let SSE{Al,A2 , be the error sum of squares for the corresponding fitted model. Then we may base a test of additivity on
snA=
1
1(SSE A-
-;;z max \ \ U
A1,A2 /\1/\2
I) )
SSEA1,A2
where the max is taken over the set {(>. 1 , >. 2 ) : 1 :::; Ai :::; Li, i = 1, 2} and L1 and L2 are such that L1L 2 :::; n- (M1 + M 2 + 1). When Ho is true, &2 j u 2 has the same distribution as
S/:
1
max A1,A2
D1
A1
A2
LLZJk,
2 j= 1 k= 1
where the Zjk 's are i.i.d. standard normal random variables. This approach for testing additivity will be illustrated by example in Chapter 10.
9.5 Testing Homoscedasticity A common assumption in regression analysis is that of homoscedasticity, i.e., the variance of the response is constant for all values of the predictor variable. This assumption, in fact, has been made throughout most of this book. Part of a standard residual analysis is to plot residuals against x (when x is univariate) and/or against predicted values from the fitted regression model. Such a plot can sometimes reveal patterns in the residuals, a common one being that residual scatter increases with an increase in the
I 9.5. Testing Homoscedasticity
235
magnitude of predicted values. Subtle patterns that do not readily reveal themselves in a plot might be detectable with a formal test of homoscedasticity. Tests that have been proposed include a parametric test of Cook and Weisberg (1983) and the rank-correlation test (Yin and Carroll, 1990). The order selection test and the other tests of Chapter 7 can be used to test the hypothesis of homoscedasticity. Consider a generalization of our previous linear regression model having the form p
Yi =
L
ejrj(xi)
+ g(xi)Zi,
i
=
1, ... 'n,
j=1
where the xi's are fixed values in [0, 1], g is an unknown function and the Z/s are i.i.d. random variables with mean 0 and variance 1. The null hypothesis of interest is
Ho : g(x)
=
Vx
C
E
[0, 1].
Let ei = Yi - 2::;= 1 Bjrj(xi), i = 1, ... , n, be residuals from the fitted linear model. Under appropriate conditions on the functions Tj, it follows that
e;
=
2
g (xi)Z[ + ziop (
)n) + Op ( ~) .
To test for constancy of g (and hence for homoscedasticity), we may apply one of the tests from Chapter 7 to the squared residuals ei, ... , or to transformed residuals, such as
e;,,
7]i
=
leil~',
0
< "( :S
2,
or
~i
=
logleil,
i
=
1, ... ,n.
When the null hypothesis is true, these transformed residuals are approximately i.i.d. random variables and hence one anticipates that the asymptotic distribution theory of Chapter 7 will hold for statistics based on the 7]/s or Us. Liaw (1997) proves this for the case where an order selection test is applied to squared residuals. Define sample Fourier coefficients by l.-
'I'J-
~ ~ 2 (7rj(i~ ei cos n n
.5)) ,
j = 1, ... , n- 1,
i=1
and an estimator of
Ve,
the null variance of e;, by
Ve
=
~ I)e;- e2 ) 2 , n
i=1
where e2 = n- 1 2::~= 1 e[. Assuming that Zi has eight moments finite, Liaw (1997) shows that the statistic
Sn
=
1
k
2n¢]
max - "'"" - , 1:":k
r
r
'
236
9. Extending the Scope of Application
converges in distribution to a random variable having the distribution function F 0 s defined in ( 7.11) . Liaw (1997) investigates power properties of the test based on Sn. It is shown that the order selection test can detect local alternatives of the following form:
gn(xi)
=
C
+
1 Vngo(xi),
n
1, 2, ... '
where g0 is a fixed function. Liaw also performs simulations showing that the power of the order selection test is comparable to that of the rankcorrelation and Cook and Weisberg tests under conditions well suited to the latter two tests. Under other conditions the power of the order selection test can be much higher than that of the other two. One can also test for homoscedasticity of multiple regression models using order selection ideas. To do so one may apply the techniques discussed in Section 9.3 to transformed residuals. Details of the asymptotic distribution theory are still to be worked out, but even were this knowledge available one would probably still prefer using some form of bootstrap approximation.
9.6 Comparing Curves A problem that arises frequently in practice is that of comparing two or more curves. The curves may correspond to different treatments or different experimental conditions. A particularly prevalent scenario is one in which responses are measured over time under different sets of conditions. For example, the acid content in rain samples might be measured daily at two different locations, and it may be of interest to test whether or not a discrepancy between measurements at the two sites is consistent with a purely chance model. In at least one situation the order selection tests of Chapter 7 can be used without modification in testing the equality of two regression curves. Suppose that data (Y1 , ZI), ... , (Yn, Zn) are obtained from the model (9.11) where it is assumed that x 1 < · · · < Xn are fixed design points and ((1, TJl), ... , ((n, TJn) are i.i.d. pairs. Of interest is testing the null hypothesis
Ho : r1(x) Define the observations
= r2(x) V x in some interval 8i, i = 1, ... , n, by
These observations follow the familiar model f,
''
[a, b].
9.6. Comparing Curves
=
237
where r r1 - rz and Ei = (i - 'T)i, i = 1, ... , n. The null hypothesis of interest is equivalent to H 0 : r 0. We may now apply any of the no-effect tests from Chapter 7 to the data 81 , ... , 8n. If such a test is significant, then H 0 : r = 0 is rejected. In fact, one would conclude not just that r1 and rz are different, but that r 1 - r 2 is nonconstant. Note, however, that a no-effect test has no power against simple shift alternatives, i.e., ones in which r1 rz + C for some nonzero constant C. To detect a shift it would be reasonable to use a t-statistic of the form
=
=
5
t=--
So/v'n'
where 5 and s 0 are the sample mean and standard deviation of the 8i 's. This t-statistic will also have good power against many nonconstant alternatives to Ho : r 1 - r 2 0. However, it is still desirable to apply the no-effect test since it may confirm that the difference between the two curves is not just a simple shift. Hall and Hart (1990) proposed a bootstrap test of curve equality for the model (9.11). Their test statistic is based on a smooth of the 8/s and is analogous to the statistic Rn,h discussed in Section 6.3.3. The distribution theory of Hall and Hart (1990) assumes that the bandwidth of their smoother remains fixed as n tends to oo. In a simulation study, however, they use cross-validation to choose the bandwidth and find that a bootstrap test that employs cross-validation does a good job of ensuring test validity. Hall and Hart (1990) also propose a method for testing the equality of several curves simultaneously. Model (9.11) is not completely general in the sense that it assumes that responses from both curves are available at every design point. In many situations some or all of the design points corresponding to one curve are different from those of the other curve. Young and Bowman (1995) propose a method for testing curve equality in such a situation. Suppose that the data have the structure
=
and
= rz(Xzj) + 'T}j,
Zj
j
= 1, ... , nz,
where (I, ... , (n 1 , 7)1, ... , 'T)n 2 are mutually independent error terms. Young and Bowman (1995) propose that the hypothesis H 0 : r 1 r 2 be tested using a statistic of the form
=
2
ni
LL
(h(xiji h)- fz(Xiji h))
2 ,
i=l j=l
where f 1 (·; h) and f 2 (·; h) are Gasser-Muller smooths based on the Yj's and Zj 's, respectively. It is important that the same bandwidth be used
238
9. Extending the Scope of Application
for each smoother. Doing so ensures that, under the null hypothesis r 1 ~ r 2 , the first order bias of f\(x; h) is the same as that of rz(x; h). This in turn implies that the asymptotic distribution of the test statistic is free of the unknown functions r 1 and r 2 even when the bandwidth h is chosen optimally. This same desirable property holds if local linear smooths with a common bandwidth are used instead of Gasser-Muller smooths, but it does not hold in general if Nadaraya-Watson type kernel smoothers are used. Unlike Gasser-Muller and local linear estimators, the asymptotic bias of a Nadaraya-Watson estimator depends on the density generating the design points. Hence, only if the design densities for the two regressions are the same will the bias of the two Nadaraya-Watson smoothers be the same. The method of Young and Bowman requires a choice of the smoothing parameter h. They propose using the significance trace method discussed in Section 6.4.2. An alternative approach would be to use a data-driven method as in Chapter 4 to choose the common bandwidth. Another possibility is to apply one of the tests from Chapter 7 to the "pseudo-data" 8ij, j = 1, ... , ni, i = 1, 2, defined by 8ij = h(Xij;
9. 7 Goodness of Fit
tot
s
anc
(9.:
uni Foi
h:n
ser
Tl
tb tb h:
\\
f<
Suppose X 1 , ... , Xn are independent and identically distributed observations from an unknown distribution F. The goodness-of-fit problem consists of using the data X 1 , ... , Xn to test the null hypothesis that F is a member of some specified class of distributions. A particularly simple case is where the null hypothesis has the form
Ho : F(x)
US8
h)- fz(Xij; h),
where h is a very small bandwidth that essentially amounts to interpolation of the data. Distribution theory for the last two approaches is yet to be worked out. It would be worthwhile to compare these two approaches and the Young and Bowman test in terms of validity and power. In addition to the aforementioned papers, the literature on comparison of curves using smoothing ideas includes the following articles: Hardle and Marron (1990), King, Hart and Wehrly (1991) and Delgado (1993).
(9.12)
The null ThE
=
Fa(x),
V x,
in which F 0 is some known function. This is usually referrE;Jd to as the completely specified null case. A case occurring more commonly in practice is where one hypothesizes that F is in a specified parametric family. For example, one may wish to test whether Xi has a Gaussian distribution with unspecified mean and variance. A huge literature exists on the goodnessof-fit problem; see D'Agostino and Stephens (1986) for a review. Kim (1992), Ledwina (1994) and Kallenberg and Ledwina (1995) have studied the use of order selection type tests in the goodness-of-fit problem.
/ 9. 7. Goodness of Fit
239
The first two of these references applies to the case of a completely specified null hypothesis, whereas the third considers the incompletely specified case. The articles of Ledwina (1994) and Kallenberg and Ledwina (1995) make use of a BIC-type selection criterion, while Kim (1992) uses a test analogous to the order selection test of Section 7.3. Suppose that X 1 , ... , Xn are absolutely continuous random variables and that we wish to test a completely specified null hypothesis of the form (9.12). This hypothesis is true if and only if the distribution of Fo(Xl) is uniform on the interval (0, 1). Define the transformed observations Ui = Fo(Xi), i = 1, ... , n. Let g denote the density of U1 , whether or not the null hypothesis is true, and assume that g may be represented by the Fourier series 00
g(x) = 1 + 2
I:: cfj cos(njx),
0 < x < 1,
j=l
where
1 1
=
cfj
j = 1, 2, ....
g(x) cos(njx) dx,
The null hypothesis is equivalent to H 0 : cfj = 0, j = 1, 2, ... , which has the same form as the no-effect hypothesis of Chapter 7. Given estimates of the Fourier coefficients, we may thus use an order selection test to test the hypothesis (9.12). The most obvious estimators of the cfj 's are A
1
cfj = -
n
n
L
cos(njUi),
j
1, 2, ... ,
=
i=l
which are unbiased. Under the null hypothesis these estimators have the following properties:
E(J;i)=O, and A
A
Cov(¢j, cfk)
j=1,2, ... ,
{0
=
1f(2n),
j/=k
j
=
k.
This suggests that the test statistic Sn
=
1 m max - '"""' 2n¢J
l<m
-
m ~
j=l
has (under the null) the limit distribution Fos of Chapter 7. Kim (1992), in fact, has established this result rigorously. In analogy to our discussion in Section 7.7.4, the order selection tests of Kim (1992) and Ledwina (1994) may be regarded as data-driven smooth tests. Rayner and Best (1989) tout smooth goodness-of-fit tests as being
240
9. Extending the Scope of Application
generally more powerful than the more traditional Kolmogorov-Smirnov and Cramer-von Mises tests. The reason for this phenomenon is the same as we have mentioned several times before: The traditional tests downweight higher order Fourier coefficients so severely that they have relatively good power only against very low frequency alternatives.
9.8 Tests for White Noise Let X 1, ... , Xn be data values sampled at the (evenly spaced) time points 1, ... , n, respectively. A model for such data that has proven to be extremely useful in practice (see, e.g., Newton, 1988) is that of a covariance stationary time series. By a covariance stationary series we mean a stochastic process {Xt : t = 1, 2, ... } that satisfies
E(Xt)
= J.L,
t
=
1, 2, ... ,
and (9.13)
Cov(Xs, Xt)
= '"Y([s- t[),
s, t
=
1, 2, ... ,
for some function 'I· Condition (9.13) implies that the covariance between two X's depends only on how far apart in time they are. Let {Xt} be a covariance stationary time series with autocovariance function 'I· The autocorrelation function of {Xt} is
p(j)
'"Y(j) '1(0) '
j
=
0, 1, ... '
and the spectrumS of {Xt} is defined by 00
+ 2 L '"Y(j) cos(2njw),
S(w) = '1(0)
0 :::; w :::; .5,
j=l
where it is assumed that 2:}: 1 1 2(j) < oo. The covariance function'/ is by necessity positive definite, and so S is a non-negative function. Note also that
~o·
5
S(w) dw = '1(0) = Var(Xt)·
The spectrum indicates how variance among the data is distributed across frequencies. If 8 has a peak at frequency W = Wo, then X1, X2, ... will tend to exhibit an irregular cyclical behavior with a period of approximately 1/wo.
9.8. Tests for White Noise
241
The classic estimator of the covariance function "'( is n-j
:Y(J)
=
~ L)xi- X)(Xi+j- X), n
=
j
=
o, 1, ... , n- 1
i=1
j:::: n,
0,
which is a positive definite function (Newton, 1988, p. 165). The corresponding estimator of the autocorrelation function is
'(')
p J =
:Y(J) :Y(O) '
.
J
01
' ' ... 'n -
1
.
In analogy to our definition of the spectrum, the sample spectrum is defined to be n-1
S(w) = :Y(O)
+2L
"'y(j) cos(2njw),
0 :::; w :::; .5.
j=1
It turns out that at the so-called natural frequencies, i.e., at Wk = k/n, k = 0, 1, ... , [n/2], the sample spectrum is equal to the periodogram h,
where 2
k
=
0, 1, ... , [n/2].
Unfortunately, the sample spectrum is not a consistent estimator of the spectrum. Estimating S by S is akin to estimating a regression function by a series estimate with as many terms as there are data values. The result, of course, is an extremely rough estimate. Consistent estimators of S may be constructed by using series estimators of the form n-1
(9.14)
S(w; W>-.)
=
:Y(O)
+2L
W>-.(J):Y(J) cos(2njw),
0:::; w :::; .5,
j=1
where {W>-. (j)} is a collection of weights that play exactly the same role as in the regression estimators of Chapter 2. The quantity A is a smoothing parameter that controls how quickly W>-.(J) tends to 0 as j increases. The estimator (9.14) may also be expressed as a kernel smooth of the sample spectrum, i.e., '
1
112
S(w; W>-.)
=
S(w)Kn(w, u; W>-.) du,
where Kn(w, u; W>-.) is defined as in Section 2.4. A huge literature exists on spectral estimators of the form (9.14); see Priestley (1981) and Newton (1988) for more discussion and references.
242
9. Extending the Scope of Application
A fundamental problem in time series analysis is establishing that the observed data are indeed correlated across time. In the parlance of signal processing, uncorrelated data are referred to as "white noise." The hypothesis of white noise is equivalent to
l'(j) = 0,
j
= 1, 2, ... '
which in turn is equivalent to the spectrum S being constant on [0, .5]. Existing omnibus tests for white noise include Bartlett's test (Bartlett, 1955) and the portmanteau test of Box and Pierce (1970). Alternative tests of white noise may be constructed after noting an isomorphism between the regression problem of Chapter 7 and the spectral analysis problem. The sample spectrum and S(w; w;..) are analogous to regression data Y1 , ... , Yn and a Fourier series smooth, respectively. Furthermore, the white noise hypothesis is analogous to the no-effect hypothesis of Chapter 7. The former hypothesis may be tested using statistics as in Chapter 7 with 2¢]/&2 replaced by p2 (j), j = 1, ... , n- 1. The isomorphism of the two problems is even more compelling upon realizing that, under the white noise hypothesis, y'np(1), ... , y'np(m) (m fixed) are approximately independent and identically distributed N(O, 1) random variables (Priestley, 1981, p. 333). The portmanteau test of white noise (also called the Q test) is based on the statistic
p2(j)
m
Q(m) = (n
+ 2) L
J=l
(1 _ J'/n ) .
Q(m) is analogous to the Neyman smooth statistic discussed in Chapter 5. Indeed, the limiting distribution of Q(m) is x~ when the data are white noise. Newton (1988) notes that a difficulty in using the Q test is the need to choose m. To circumvent this problem, one may use a data-driven version of Q(m) analogous to the statistic TM in Section 7.6.1. Define ~2(
m
Q(m)
=
(n
J n
J=l
where
')
+ 2) L (/-~I ) ,
m is the maximizer of ~
R(m)
{ 0, =
"Lr;
1
np2 (j)- 2m,
m m
= =
0
1, ... , n- 1.
Under appropriate regularity conditions, Q(m) will converge in distribution to the random variable T defined in Theorem 7.3. The order selection criterion R( m) is one means of choosing the order m of a spectrum estimate Sm, where m
Sm(w)
=
')t(O)
+2L j=l
')t(j) cos(2njw).
9.8. Tests for White Noise
243
Such a spectrum estimate corresponds to approximating the observed time series by a moving average process. The series X 1 , X 2 , ... is said to be a moving average process of order q if q
Xt
=
I: ejZt-j,
t = 1, 2, ... ,
j=O
where e 0 = 1, eq f. 0 and {Zt : t = 0, ±1, ±2, ... } is a white noise sequence. Such a process satisfies p(j) = 0 for j > q, and hence has a spectrum of the form q
S(w) = 1(0)
+ 2L
1(j) cos(2njw).
j=l
A reasonable spectrum estimate for a qth order moving average process would be Sq. Estimation of S by Sm raises the question of whether R is the most appropriate order selection criterion for approximating a covariance stationary process by a moving average. Suppose we define an optimal m to be that which minimizes
(12
(9.15)
E
lo
2
(sm(w)- S(w))
dw.
The minimizer of this mean integrated squared error is the same as the maximizer of
Priestley (1981) shows under general conditions that
and 00
1 (1+2 Var [p(j)] ~ :;;:
L p (j) 2
)
j=l
=
-1 Cp.
n
Using these approximations one may construct an approximately unbiased estimator of the risk criterion C(m). We have
E
[f j=l
(n
+ j)p2(j.)/Cp(1- J /n)
2]
~
_1_C(m) Cpn
=
D(m).
244
9. Extending the Scope of Application
A possible estimator for Cp is
Now define
zJn = (n + j)p
2
(j)/Cp, j = 1, ... 'n- 1, and
D(m) = 0, =
~ (Zfn - 2) L.....- (1 - jjn) '
m = 0
m - 1 2 -
' ' .. · '
n
-
1
·
j=l
Letting m be the maximizer of D(m), the null hypothesis of white noise may be tested using the data-driven portmanteau statistic Q (m). When the data really are white noise, R and D will be approximately the same, since Cp estimates 1 in that case. Under appropriate regularity conditions the null distribution of Q(m) will be asymptotically the same as that of Q(m). Any advantage of Q(m) will most likely appear under the alternative hypothesis, since then m will tend to better estimate the minimizer of (9.15) than will m. Another well-known test of white noise is Bartlett's test, which may be considered as a time series analog of the Kolmogorov-Smirnov test. Bartlett's test rejects the white noise hypothesis for large values of the statistic
where N
=
[n/2]
+ 1 and
The idea behind the test is that the integrated spectrum of a covariance stationary time series is identical to w if and only if the series is white noise. Bartlett's test is analogous to a test of curve equality proposed by Delgado (1993). Because of the isomorphism of the no-effect and white noise testing problems, Bartlett's test has the same power limitations as Delgado's test. Specifically, when the observed time series is such that p(1) and p(2) are small relative to autocorrelations at higher lags, then the power of Bartlett's test will be considerably lower than that of tests based on Q(m) or Q(m). Order selection criteria of a different sort than R and D may also be used to test the white noise hypothesis. The Q test implicitly approximates the observed series by a moving average process. Suppose instead that the series is approximated by a stationary autoregressive process of order p, which
I 9.8. Tests for White Noise
245
has the form p
Xt =
L
c/JiXt-i
+ Zt, t =
0, ±1, ±2, ... '
i=l
where ¢1, ... , c/Jp are constants such that the zeros of 1 - ¢1z - ¢zz 2 · · · - c/JpzP are outside the unit circle in the complex plane, and { Zt} is a white noise process. (The Zt 's are sometimes referred to as innovations.) An autoregressive process is white noise if and only if its order is 0. Hence, given a data-driven method for selecting the order of the process, it seems sensible to reject the white noise hypothesis if and only if the selected order is greater than 0. This is precisely what Parzen (1977) proposed in conjunction with his criterion autoregressive transfer (CAT) function that selects autoregressive order. In fact, it appears that Parzen was the first person in any area of statistics to propose an order selection criterion as a test of model fit. Before discussing CAT we introduce another popular criterion for selecting autoregressive order, namely Akaike's Information Criterion (AIC) (1974), defined by AIC(k) =log & 2 (k)
+
2 k, n
k = 0, 1, ... ,
where 8' 2 (k) is the Yule-Walker estimate (Newton, 1988) of the innovations variance for a kth order autoregressive (AR) model. One may perform a test of white noise with AIC just as with the CAT function. The pioneering paper of Shibata (1976) on AIC implies that when the process is actually white noise, the minimizer of AIC(k) occurs at 0 about 71% of the time. Hence, a white noise test based on AIC has type I error probability of about .29 for large n. It is no accident that this probability matches that discussed in Section 7.3. Examining the work of Shibata (1976) reveals that AIC and the MISE criterion Jm are probabilistically isomorphic, at least in an asymptotic sense. Parzen's original version of CAT has the form CAT(k) = 1-
(n- k) n
8'~
a-z(k)
k
+ ;;: ,
k = 0, 1, ... ,
where 8'~ is a model-free estimate of innovations variance. The estimated AR order is taken to be the value of k that minimizes CAT(k). A test of white noise based on this version of CAT turns out also to have an asymptotic level of .29. Again this is no mere coincidence, since AIC and CAT turn out to be asymptotically equivalent (Bhansali, 1986a)o Bhansali (1986b) proposed the CAT a criterion, defined by 8' 2 CATa(k) = 1- a-z(k)
ak
+ --:;;
k
=
0, 1,
0
0
0'
246
9. Extending the Scope of Application
for some constant a > 1. Bhansali (1986a) has shown that his CAT 2 criterion is asymptotically equivalent to CAT. Choices of a other than 2 allow one to place more or less importance on overfitting than does CAT. There is an obvious resemblance of CATa to the regression criterion J(m; !') of Section 7.3. In analogy to the regression setting, one may use CAT a to perform a test of white noise. In fact, the asymptotic percentiles from Table 7.1 can be used in CAT a exactly as they are in J(m; 1') to produce a valid large sample test of white noise. For example, for a test with type I error probability equal to .05, one should use CAT4.1 8 . Parzen proposed the following modified CAT function: CAT*(k)
=
-(1
=
;
1
+ ~) 0' 2~0)
f;
,
1
k
k = 0
1
0' 2 (j) - 0' 2 (k),
k
=
1, 2, ... , Kn,
where 0' 2(k) = n(n- k)- 1 8- 2(k), k = 0, 1, ... , n- 1, and Kn is such that K;,/n --+ 0 as n --+ oo. Parzen proposes a test that rejects the white noise hypothesis if and only if the minimizer of CAT* is greater than 0. We shall call this white noise test Parzen's test. The "natural" definition of CAT*(O) is -1 I 0' 2(0); Par zen proposed the modification - ( 1 + n -l) I 0' 2(0) in order to decrease the significance level of the white noise test from .29. A simulation study in Newton (1988, p. 277) suggests that the level of Parzen's test is about .17. We shall provide a theoretical justification for this probability and also show that Parzen's test is closely related to the regression test proposed in Section 7.6.3. We may write 1 + 8-~CAT*(k) = CAT2(k)
+ Rn,k,
k 2: 1,
where k (
Rn,k = n
8-
2
(j-2~)
)
-
1
-
+
k(k
+ 1)
n2
~n ~ (1 - j__n ) ~
(
J=l
A
a-~' - 1)
1J'2(J)
0
If the white noise hypothesis is true, 8-~ and 8- 2(j), j = 0, ... , Kn, all estimate Var(Xt), and it follows that the terms Rn,k are negligible in comparison to CAT2(k). (This can be shown rigorously using arguments as in Bhansali, 1986a.) So, when the data are white noise, the properties of Parzen's test are asymptotically the same as for the analogous test based on a CAT2 function, where CAT2(k) = CAT2(k), k 2: 1, and
CAT;(o) = 1- (1
+
~) a-~fu)..
I 9.8. Tests for White Noise
247
The significance level of the CAT2 white noise test is
1- P
([i, {
CAT;(o)
~ CAT,(k)}) ~
1-P(Knn{&2(k)>1-(~) 8'2(0) +R
n,k
k=l
})
'
n
where
In the white noise case the terms the limiting level of the test is (9.16)
1-
nl~ P
Dl
Kn {
(
Rn,k
are asymptotically negligible, and
&2(k)
( 2k+1)}) .
log 8'2 (0) 2 log 1- - n -
The last expression is quite illuminating in that it shows Parzen's test to be asymptotically equivalent to a white noise test that rejects H 0 when the value of the AIC criterion at its minimum is less than log 8' 2 (0) -1/n. This suggests that Parzen's test is analogous to the test in Section 7.6.3 based on the maximum of an estimated risk criterion. Arguing as in Shibata (1976),
&2 (k)
-nlog --;:-z-() (J 0
L k
= n
j=l
A
2' + op(1),
¢1 (J)
k 2 1,
where, for any fixed K, fo¢1(1), ... , fo¢K(K) have a limiting multivariate normal distribution with mean vector 0 and identity covariance matrix. This fact and (9.16) imply that the limiting level of Parzen's test is
in which Z 1, Z 2 , .•• are i.i.d. standard normal random variables. The last expression may be written
1- (r~ trzJ - ~ 1) . P
2)
I:J=
Note that the random variable maxk~l 1(ZJ- 2) is precisely the same as the one appearing in Theorem 7.6, which confirms that Parzen's test is
248
9. Extending the Scope of Application TABLE 9.1. Approximate Values of qa for Parzen's Test
a qa
.29 0
.18 1
.10 2.50
.05 4.23
.01 7.87
The estimated values of qa were obtained from 10,000 replications of the process 1 (ZJ - 2), k = 1, ... , 50, where Z1, ... , Z5o are i.i.d. N(O, 1).
2:,;=
analogous to the regression test based on the maximum of an estimated risk. By means of simulation it has been confirmed that (9.17)
1- P (
'Pff
t,(z]- 1) "'.1s. 2),;
The argument leading to (9 .17) also shows that if CAT* (0) is defined to be -(1 + q/n)/& 2 (0) for a constant q, then the limiting level of the corresponding white noise test is
1 - P ( T;i'
t,(z] -
2) ,;
q) .
One can thus obtain any desired level of significance a by using a version of Parzen's test in which CAT*(O) is -(1 + qa/n)/& 2 (0) for an appropriate qa. Simulation was used to obtain approximate values of qa for large-sample tests of various sizes; see Table 9.1. It is worth noting that the values of qa in Table 9.1 are also valid large-sample percentiles for the regression test of Section 7.6.3.
9. 9 Time Series Trend Detection In time series analysis it is often of interest to test whether or not a series of observations have a common mean. One setting where such a test is desirable is in quality control applications. Consider a series of observations X1, ... , Xn made at evenly spaced time points and let /Li = E(Xi), i = 1, ... 'n. Furthermore, let us assume that the process Yi = xi - f.Li, i = 1, 2, ... , is covariance stationary, as defined in the previous section. The hypothesis Ho : f.Li = p,, i = 1, 2, ... , is simply a no-effect hypothesis as in Chapter 7. What makes the time series setting different is that the
F
sc
9.9. Time Series Trend Detection
249 '
covariance structure of the data must be taken into account in order to properly test the hypothesis of constant means. Accounting for the covariance is by no means a minor problem. Indeed, the possibility of covariance fundamentally changes the problem of detecting nonconstancy of the mean. To see part of the difficulty, consider the two data sets in Figure 9.1, which seem to display similar characteristics. The data in the top panel were generated from the autoregressive model
xi where E(Xo)
= .95Xi-1
+ Ei,
i = 1, ... , 5o,
0 and E(Ei) = 0, i
.. .. 0
.... . . . .. ..
1, ... , 50, implying that each
.... .. .. ..
T'"
'
0.0
0.2
0.4
0.6
0.8
1.0
X
1.0
...... . ..
.. .. .. ..
(f)
>-
C\1
0 T'"
'
0.0
0.2
0.4
0.6
0.8
1.0
X
FIGURE 9.1. Simulated Data from Different Models. In the top graph are data generated from an autoregressive process, while the other data were generated from a regression model with i.i.d. errors. In each graph the line is the least squares line for the corresponding data.
I
I
'i' 250
9. Extending the Scope of Application
observation has mean 0. The second data set was generated from
xi = 3.5- 2.6(i- 1/2)/50 + Ei,
i
= 1, ... , 50,
where the E/s are i.i.d. as N(O, 1). The apparent downward trend in each data set is due to different causes. The "trend" in the first data set is anomalous in the sense that it is induced by correlation; if the data set were larger the observations would eventually begin to drift upward. The downward drift in the second data set is "real" in that it is due to the deterministic factor E(Xi) = 3.5- 2.6(i- 1/2)/50. At the very least this example shows that it is important to recognize the possibility of correlation, or else one could erroneously ascribe structure in the data to a nonconstant mean. More fundamentally, the example suggests the possibility of a formal identifiability problem in which it would be impossible to differentiate between two disparate models on the basis of a data analysis alone. The more a priori knowledge one has of the underlying covariance structure, the more feasible it will be to devise a valid and consistent test of the constant mean hypothesis. Let us consider the case where {Xi - f.Li} follows a Gaussian first order autoregressive process. What would happen if we applied, say, a test as in Section 7.3 to test for a constant mean? For simplicity suppose we use a statistic of the form
=
Sn
1 m 2n¢] - "\""' - , - , 1<m<M m j=1 L....t o-2 max
where M is a constant and & 2 = I:~= 2 (Xi- Xi-1) 2/(2n). Let p be the first lag autocorrelation of the process {Xi - p,i}. It is not difficult to argue that when the P,i 's are all the same, the random vector
2n
'2
, 2
--;::-z (¢1, · · ·' ¢M) l7
converges in distribution to
1+p(2 2) (1-p)2 z1, ... ' z M , where Z 1 , ... , ZM are i.i.d. standard normal random variables. Therefore, Sn converges in distribution to the random variable
1+p (1- p)2 SM, where SM has the distribution function Fos,M defined in Section 7.5.1. Now, suppose we conduct a level-a test of constant mean as we would assuming the data were independent. The asymptotic level of this test under the first order autoregressive model is
1
-
F:
OS,M
((1-p)Zt(Y_) 1+p
9.9. Time Series Trend Detection
251
where 1- Fos,M(ta) = a. Obviously, then, this order selection test will be invalid when the data are positively correlated in that the level of the test will be larger than a when p > 0. In fact, the level can be made arbitrarily close to 1 by taking p sufficiently close to 1. This justifies our earlier comment to the effect that one may erroneously conclude that the means are nonconstant if correlation is ignored. When the first lag autocorrelation is negative, the test will be valid but less powerful than it is when the data are independent. The problem just outlined has an apparently easy fix. Suppose that pis a consistent estimator of p. The statistic
Tn =
(1 - p)2 1 + p Sn
has the same limit distribution as S M, implying that we may compare Tn with the independent-data critical values and still have an asymptotically valid test. The only problem with this proposal is obtaining an estimator of p that yields a powerful test. Probably the first estimator that comes to mind is
which is the estimator used when the process mean is assumed to be constant. This estimator is fine for ensuring asymptotic validity of the test but can be very detrimental from a power standpoint. The problem is that if the t-ti's vary slowly over time, the estimator p can be quite close to 1, causing Tn to be relatively small. An alternative estimator of p that addresses the problem inherent in Pl is
where Mi is a nonparametric estimator of f.-ti· Kim (1994) has studied the order selection test in the context of dependent data and in doing so considered various candidates for the Mi in p. A test utilizing p is problematic in that it requires the choice of a smoothing parameter for Mi· A main motivation for using an order selection test is that it circumvents any arbitrary choice of smoothing parameter. It would thus be desirable to have a test that avoided explicit estimation of the f.-ti's. This would be possible were a method available that simultaneously selects a smoothing parameter and estimates p. The time series crossvalidation (TSCV) criterion of Hart (1994) is one such method. TSCV was proposed as a means of selecting the bandwidth of a kernel smoother when the observed data are autocorrelated. It can be viewed as a generalization of the one-sided cross-validation method discussed in Section 4.2.5. To test for constancy of the means, we may proceed as in Section 8.2.3.
252
9. Extending the Scope of Application
The appropriate estimator of means is a Nadaraya-Watson smoother, since that smoother is a fiat line for large bandwidths. The bandwidth of the Nadaraya-Watson estimator is chosen by TSCV assuming that the process {Xi - J-Li} is first order autoregressive. The hypothesis of constant means is rejected if the data-driven bandwidth is sufficiently small. To obtain a valid test based on a data-driven bandwidth h, it is necessary to know the probability distribution of h when the means are constant and the data follow the prescribed process. One method of approximating the distribution of h is to use a bootstrap procedure. TSCV yields both a bandwidth h and an estimate p of p. One can show that under the null hypothesis of constant means the TSCV criterion is invariant to the scale of the xi's. It is thus sensible to generate bootstrap data as follows:
Xt = pXt_ 1
+ E7,
i
= 1, ... , n,
where X 0 = 0 and E]', ... , E~ is a random sample (with replacement) from the residuals
When H 0 is true, e 2 , ... , en will be the "correct" residuals. If the means are nonconstant, e 2 , ... , en will tend to have larger variance than will the error terms in the underlying model, but this is irrelevant since the TSCV criterion is invariant to scale when applied to the bootstrap data. So long as p is close to p we can expect this bootstrap scheme to work reasonably well. Having obtained a bootstrap sample one may compute h* from this sample in the same way h was computed from the original data. The sampling distribution of h can then be approximated by generating a large number of bootstrap samples and computing h* on each one. The assumption that {Xi- J-Li} follows a first order autoregressive process plays no important role in the above development. The test based on TSCV may be applied whenever {Xi - J-Li} is covariance stationary and has a prescribed parametric structure. Presumably one may show that this test is asymptotically valid and consistent under general conditions on the means J-Ll> . •. , f-Ln· A fascinating question is the following: What are the weakest conditions on the process {Xi - J-Li} and the means J-Li under which a valid and consistent test of equal means may be constructed? Suppose, for example, that we assume {Xi - J-Li} is an autoregressive process of unknown order p. Hart (1996) has proposed a generalization of TSCV that allows simultaneous estimation of bandwidth, p and autoregressive parameters. Is it possible to construct valid and consistent tests of constant means based on this generalization?
lC
So
10.: In tl actn Bah the l hom sis o can Fine mul1
10. 10. InS
can< of a date (0, l moe test usiu van the in 1 in S 1 thir
P-v
10 Some Examples
10.1 Introduction In this final chapter we make use of order selection tests in analyzing some actual sets of data. In Section 10.2 tests of linearity are performed on the Babinet data, which were encountered in Section 6.4.2. We also consider the problem of selecting a good model for these data, and perform a test of homoscedasticity. In Section 10.3 order selection tests are used in an analysis of hormone level spectra. Section 10.4 shows how the order selection test can enhance the scatter plots corresponding to a set of multivariate data. Finally, in Section 10.5, the order selection test is used to test whether a multiple regression model has an additive structure.
10.2 Babinet Data
10. 2.1 Testing for Linearity In Section 6.4.2 we used the Babinet data to illustrate the notion of significance trace. Here we use the same data as an example of checking linearity of a regression function via an order selection test. A scatter plot of the data is shown in Figure 10.1. (The x-variable was rescaled to the interval (0, 1).) There is some visual evidence that a straight line is not an adequate model. Does an order selection test agree with the visual impression? Two test statistics were computed using the methodology of Section 8.2.1: one using a cosine basis and the other a polynomial basis. The difference-based variance estimate 8-J 1 (Section 7.5.2) was used in each case. The values of the two test statistics and their associated large-sample P-values are given in Table 10.1. The P-values are simply 1- Fos(Sn), where Fos is defined in Section 7.3. Table 10.1 displays strong evidence that the regression function is something other than a straight line. It is interesting that, although both P-values are quite small, the one corresponding to the polynomial basis 253
254
10. Some Examples (0
C\.1
V
•
c i
•
"¢
C\.1 C\.1 C\.1
>-
j
•• • •
0 C\.1
•
CX)
••
.....
• .. • .. ..... ' .
(0
• 0.2
1
•
.....
0.0
r
0.6
0.4
'
0.8
•
.-
1.0
X FIGURE 10.1. Smooths ofBabinet Data. The solid and dotted lines are quadratic and second order cosine models, respectively. The dashed line is a local linear smooth chosen by OSCV.
is extremely small, 1.1 x 10- 6 . This is a hint that some bases will be more powerful than others in detecting departures from a particular parametric model. This point will be made even more dramatically in the next section. The P-values in Table 10.1 are based on a large-sample approximation. The sample size in this case is reasonably large, n = 355. Nonetheless, it is interesting to see what happens when the bootstrap is used to approximate a P-value. After fitting the least squares line, residuals e 1 , ... , e355 were obtained. A random sample e)', ... , e3 55 is drawn (with replacement) from these residuals and bootstrap data obtained as follows:
Yi* = ~o +~!Xi+ ei,
i
=
1, ... , 355,
TABLE 10.1. Values of Statistic Sn (Section 8.2.1) and Large-Sample P-Values
Basis Cosine Polynomial
P-value 10.06 23.68
.0015 .0000011
10.2. Babinet Data
255
!J1
where !Jo and are the least squares estimates from the original data. A cosine-based statistic S~ was then computed from (x1, Yt), ... , (xn, Y;) in exactly the same way Sn was computed from the original data. This process was repeated independently 10,000 times. A comparison of the resulting empirical distribution of s~ with the large-sample distribution Fos is shown in Figure 10.2. The two cdfs are only plotted for probabilities of at least .80, since the tail regions are of the most interest. The agreement between the two distributions is remarkable. The conclusion that the true regression curve is not simply a line appears to be well founded.
10.2.2 Model Selection Having rejected the hypothesis of linearity we turn our attention to obtaining a good estimate of the underlying regression function. One method of doing so is to use a kernel or local linear estimate. The dashed line in Figure 10.1 is a local linear smooth (with Epanechnikov kernel) whose smoothing parameter was chosen by the one-sided cross-validation method of Chapter 4. In agreement with the analysis in Section 10.2.1 the smooth shows evidence of nonlinearity. The order selection test is significant at level of significance a if and only if a particular risk criterion chooses a model order greater than 0. Which model(s) are preferred by such criteria for the Babinet data? Figure 10.3 provides plots of risk criteria of the form J(m; 1') for two values of')', 2 and 4.18. The criterion using')' = 2 corresponds to unbiased estimation of MASE and chooses a cosine series with over twenty terms. This exemplifies the undersmoothing that often occurs with MASE-based risk criteria 0 0
,....
......
i'····
g :.0
0
.0 0 ......
0
ro
CJ')
0.
0
co 0
2
4
8
6
10
12
X FIGURE
10.2. Fos and Bootstrap Distribution. The solid line is Fos.
256
10. Some Examples
........·.··....... .. . 0
l{)
0
... ... . ·...... '
"
"&i
:t::
(3
l{) (J)
..
0
C')
... ....
...
l{)
co
..
.·. ... ...... ... ........ "' .. ...
C')
0
20
40
60
80
100
0
20
40
60
80
100
c 0
·c
2
"5
0
C')
l{)
C\i
m FIGURE 10.3. Risk Criteria for the Babinet Data. The top and bottom graphs correspond to risk constants of 2 and 4.18, respectively.
and ordinary cross-validation (Chapter 4). The criterion that uses the constant 4.18 places a more severe penalty on overfitting and is consequently maximized at a much smaller m of 4. Another method of selecting a model is to use a Bayesian method as described in Section 7.6.5. In that section it was assumed that the design points were evenly spaced. For more general designs, suppose we start out with any orthogonal basis and use a Gram-Schmidt process to construct functions u 1 , ... , Un that are orthonormal in the sense of (8.2) and (8.3). The corresponding Fourier coefficients ih, ... , an are independently distributed as N(aj, (7 2 /n), j = 1, ... , n, when the errors in our model are i.i.d. N(O, (7 2 ). It follows that the likelihood function for a1 , ... , an will
10.2. Babinet Data
257
have the same structure as in Section 7.6.5, and we may construct posterior probabilities for model orders exactly as we did there. For each basis (cosines and polynomials), probabilities of the form (7.23) were computed, where ami was taken to be 1 for each i and m, and the quantity 'EZ:, 1 ¢7 was replaced by (2n)- 1 x (the sum of squares of regression for the order m model). The prior form was taken to be Rissanen's (1983) noninformative prior for integers. Define K(m) by K(m)
=
log;(m)
+ logd2(2.865064)],
m 2: 1,
where log;(m)
=
log 2 (m)
+ log2 log 2 (m) + ... ,
and the last sum includes only positive numbers and hence is finite. Our prior for m is then defined by n(O) = 1/2 and n(m)
=
TK(m),
m = 1, 2, ....
The Bayesian approach may be taken a step further by assigning a prior probability to each of the two bases. In this way we may compute posterior probabilities for model order and basis. Ascribing prior probability of 1/2 to each basis led to the posterior probabilities in Table 10.2. This analysis provides convincing evidence that the polynomial model is preferable to cosines. The posterior probability of the polynomial basis is .967, with the quadratic model alone having posterior probability of .938. The quadratic fit is shown in Figure 10.1 as the solid line, and the cosine series that maximized posterior probability is the dotted line. Interestingly, the local linear smooth chosen by OSCV is closer overall to the quadratic than it is to the cosine series. This example shows the importance of choosing an appropriate set of basis functions. It is also reassuring in that it shows how a Bayesian approach can be a very useful means of choosing between bases.
10. 2. 3 Residual Analysis A residual analysis is a standard part of any regression analysis. Patterns in the residuals may indicate that some of the model assumptions are not justified. An assumption underlying the analysis in the previous two sections
TABLE 10.2. Posterior Probabilities of Models Fitted to Babinet Data
m
0
1
2
3
4
Cosine basis Polynomial basis
.000 .000
.000 .003
.025 .938
.005 .025
.003 .001
258
10. Some Examples
•
•
I
•
0
•
• •
•
•
me
•
:'
•
•••
. .. ". •
of lat
• •
•
pb: ph:
• • I
thE
•.
A~:
e.g
hol
•
• 0.0
(1C 0.2
0.6
0.4
0.8
1.0
X
wh dE·t
FIGURE 10.4. Absolute Value of Residuals from Quadratic Fitted to Babinet Data. The curve is a first order cosine series, which was chosen by the risk criterion with penalty constant 4.18.
is that of homoscedasticity. To test for unequal variances in the Babinet data, the order selection test using a cosine basis was applied to the absolute value of residuals. The model fitted for the mean was a quadratic (see Section 10.2.2). A test as in Section 9.5 was conducted, with the estimate V;, in the denominator of Sn being the sample variance of le1l, ... , lenl· The value of Sn was 9.34, for which a large sample P-value is .0023. So, there is evidence that the variability of the errors is not constant over the design space. A plot of the absolute residuals is shown in Figure 10.4. This plot suggests that if in fact the data are heteroscedastic, the change in variance is not too dramatic. One would hope that the analyses in Sections 10.2.1 and 10.2.2 are not compromised by this level of heteroscedasticity.
10.3 Comparing Spectra Our second illustration is one that was first reported in Kuchibhatla and Hart (1996). The data used in the analysis come from Diggle (1990) and are leutenizing hormone (LH) levels from a woman's blood samples. Three groups of blood samples were used: one group from the early follicular phase of a woman's menstrual cycle and two from the late follicular phase of two successive cycles. During each of the three time periods the LH level was
te:J by dai l ear thE in 1 to qw
th2 plo Sill'
lev< cor Fig !
apl m per san boc
TM re~J
alg< Re<
of
J
10.3. Comparing Spectra
259
measured every ten minutes for eight hours, resulting in three time series of length 48 each. Here we wish to compare the spectra of the early and late follicular phases. The analysis consists of comparing the periodogram of the early follicular phase with the average of the two periodograms from the late follicular phase data. Let IE(wj), j = 1, ... , 24, and h(wj), j = 1, ... , 24, denote the early and late periodograms, respectively, where Wj denotes frequency. Assuming each observed time series to be stationary, it is well-known (see, e.g., Priestley, 1981) that to a good approximation the following model holds:
(10.1)
where S E and S L are the spectra of the two series and the ryj's are independent and identically distributed random variables with finite variance. To test the hypothesis that the log-spectra for the early and late series differ by more than a constant, we may thus apply an order selection test to the data Yj in (10.1). In the top two plots of Figure 10.5 we see the log-periodograms for the early and late follicular phases along with Rogosinski series estimates for the two spectra. The estimates were chosen using the risk criterion L(m; 'Y) in (7.20). The penalty constant 'Y was chosen to be 4.22, which corresponds to a .05 level test of the hypothesis that a spectrum is constant over frequency. The fact that the top two estimates are nonconstant is evidence that neither the late nor early follicular series are white noise. The bottom plot in Figure 10.5 shows the Yj data plotted with a .05 level Rogosinski smooth. Nonconstancy of the bottom estimate indicates that, at the .05 level of significance, we may reject the hypothesis that the two spectra are constant multiples of each other. An appealing aspect of the smooths in Figure 10.5 is that they serve both an inferential and a descriptive purpose. A data-driven Neyman smooth test based on TM (Section 7.6.1) was also applied to the data Yj in (10.1). The risk criterion Jm was maximized at m = 2, and the value of the test statistic TM was 16.13. To approximate percentiles of TM, we used a bootstrap algorithm in which 10,000 bootstrap samples were drawn from the residuals (Y1 - Y), ... , (Y23 - Y). For each bootstrap sample ei, ... , e2 3 , T"M was computed in exactly the same way TM was computed from Y1 , ... , Y23 . The resulting estimated P-value corresponding to TM = 16.13 was .035. A reassuring check on our bootstrap algorithm is that m* was 0 in 70.27% of the 10,000 bootstrap samples. Recall the theory of Section 7.3, which implies that the large sample value of P(m = 0) under H 0 is about .71.
..........------------260
10. Some Examples
.05 Level Rogosinski Estimate for Late Phase
0
E
'7
2 0Ql
c. (/) &,
"'
.Q
'? 1.0
0.5 frequency
.05 Level Rogosinski Estimate for Early Phase
0
'7
E
~
Ql
"'
c.
'f C)
.Q
'?
'1 1.0
0.5 frequency
.05 Level Rogosinski Estimate: Late vs. Early
0
.Q
~
C)
.Q
..
'7
"' '? 1.0
0.5
1.5
frequency
FIGURE
10.5. Analysis of Luteinizing Hormone in Blood Samples.
10.4. Testing for Association Among Several Pairs of Variables
261
Another, more classical, method of detecting a departure from equal spectra would be to use the statistic Un =
1 k max - "'(Yj - Y) t:s;k:s;n n L.... j=l
which is analogous to the Grenander and Rosenblatt (1957) statistic for testing the fit of a spectral model. This statistic has recently been analyzed in the context of comparing curves by Delgado (1993). Under the null hypothesis that log(SE) = log(SL), Yt, ... , Y23 are approximately i.i.d. log(F2 ,4 ) random variables (Diggle, 1990). The value of Un for our data is .300, leading to a P-value of approximately .205. The test proposed by Diggle (1990, p. 120) based on the maximium and minimum values of the ratio of periodogram ordinates is even more insensitive. Being liberal and using the smaller of two one-sided P-values gives P = .897! This is a good example of how the tests in Chapter 7 are often more sensitive to lack of fit than are many popular classical methods.
10.4 Testing for Association Among Several Pairs of Variables A common practice in the early stages of a multivariate data analysis is to consider scatter plots of all pairs of variables. Data-driven smoothers with an inferential interpretation are a nice way of supplementing such plots. This idea will be illustrated with a subset of the data used by Chernoff (1973) in his study of representing multivariate data by faces. The data were collected from 53 equally-spaced specimens taken from a 4500-foot core drilled from a Colorado mountainside. The data considered here are assays of four different minerals taken from each of the 53 specimens. Figure 10.6 shows scatter plots for each pair of minerals. A Rogosinski series estimate accompanies each plot. The horizontal scale in each case is in quantile units, so that the function being estimated is actually a regression quantile function r (Q(·)), as discussed in Section 7.1. The truncation point of each series estimate was chosen by a risk criterion with penalty constant 4.22. It follows that we may conclude at the .05 level of significance that two variables are associated if they have a nonconstant regression estimate. Note that there are two tests corresponding to any pair of variables, since the test of association is not invariant to which variable is taken to be the dependent one. A couple of caveats are in order concerning the preceding analysis. First of all, anytime several tests are performed simultaneously, one should be aware that the probability of making at least one type I error is greater than the type I error probability of any one test. One may guarantee the
262
10. Some Examples
~.. .. ...
0.4
0.0
0.0
0.8
0.8
0.4
0.0
0.8
0.4
mineral1
mineral1
mineral1
0
:g
~
~
"'.. ·:·
~·:•. · ..__.... ·.~ .. .-·.. .... ... 0.0
0.4
0 <0
..
"•
0
0.0
0.8
mineral2
0.8
0.4
·-.
0.0
•
0.8
0.4
mineral2
mineral2
...
.
~ ... 0
.- .. .,..
"' 0.4
0.0
0.8
~
·.
~.·.::· ·.
.. \
~ ;
0.0
0.4
minera14
0.8
0.0
mineral3
mineral3
..
0.4
0.0
.. ..
.....
·. ·:·
0.8
0
"'
..... 0.4
0.8
mineral3
... ...... ~ ..
-- : 0.0
0.4
mineral4
0.8
0.0
0.4
0.8
mineral4
10.6 .. 05 Level Tests of Association for Chernoff Data. The hypothesis of no association between two variables is rejected at the .05 level for any pair of variables with a nonconstant regression estimate. FIGURE
r
*( 10.5. Testing for Additivity
263
overall probability of a type I error to be no more than a by taking the significance level of each test to be a/M, where M is the number of tests done. A second caveat concerns the homoscedasticity assumption, a violation of which may adversely affect the level of an order selection test. Here we have used Chernoff's data for the sake of illustration without carefully checking for heteroscedasticity.
10.5 Testing for Additivity In Section 9.4 we described how additivity of a regression model could be checked by using an order selection test. Here that methodology is illustrated using a data set from Hastie and Tibshirani (1990). These data were obtained in a study of factors that affect diabetes in children (Sockett et al., 1987). The response variable Y is the logarithm of C-peptide concentration at diagnosis, and the two predictors are age and base deficit, a measure of acidity. In our analysis the two predictors were rescaled to be on the interval (0,1), with x 1 and x 2 denoting rescaled age and base deficit, respectively. The number of points in the data set is n = 43. For descriptive purposes, two-dimensional scatter plots along with local linear smooths are shown in Figure 10.7. We shall take as our full additive model one of the form 5
(10.2)
Y
=
5
+L
f3o
fJ1j cos(1rjx1)
+L
j=l
'0
. ..
~
a Ql
B
Cl
cos(1rkx2)
+ E,
~
~
(i)
fJ2k
k=l
'
(i)
..
'0
...·
~
a Ql
c.
o;
'
..Q
..Q
"!
"!
0.0
0.4 x1
FIGURE
0.8
0.0
0.4
0.8 x2
10.7. Diabetes Data and Local Linear Smooths.
I
···"'f 264
10. Some Examples
which uses up only 11 degrees of freedom, leaving us 32 degrees of freedom with which to assess possible interactions of x 1 and x 2. Let S S Enull denote the error sum of squares corresponding to a least squares estimate of model (10.2), and let SSEr,s be the error sum of squares for the model 5
Y = f3o
5
+L
f3tj cos(njx1)
+L
j=l T
+L
f32k cos(nkx2)
k=l
S
L "fjk cos(njx1) cos(nkx2) +
E,
j=l k=l
where r, s ;:::: 1. Our test statistic for the null hypothesis that the regression function has an additive form is (10.3)
max
l~r,s~5
(SSEnull- SSEr,s) rsu 2
The variance estimate u2 was taken to be SSEnuu/(43 - 11). If anything, this choice for u2 will tend to make the test less powerful, since an interaction between x 1 and x 2 will inflate the null variance estimate. The value of statistic (10.3) for the diabetes data is 5.594. Is this significant evidence of an interaction between x 1 and x 2? Under the null hypothesis and assuming that the errors in our model are i.i.d. Gaussian, the distribution of (10.3) is well approximated by that of T
S
1 2 max - " ' " ' Z k, l
(10.4)
where the Zjk's are i.i.d. N(O, 1). The distribution of (10.4) was approximated by simulation. In 1000 independent replications of (10.4), only 22 statistics exceeded 5.594. In other words, our estimated P-value is .022, and so there does appear to be evidence of an interaction. Two estimated surfaces are shown in Figure 10.8, one of which has interaction terms and the other an additive structure. The estimated models were as follows:
Y = 1.521- .099 cos(nx 1) - .090 cos(2nxt) - 0.049 cos(3nx 1)
·+ .076 cos(2nx 2) + E and
Y
=
1.521- .022 cos(nx1) - .173 cos(2nxt) - 0.051 cos(3nxt)
+
.065 cos(2nx 2)
+
.119 cos(nxt) cos(nx 2) - .115 cos(2nx 1) cos(nx 2) +E.
Note that our analysis implicitly assumes that the effect of terms involving cos(njx1) and cos(njx 2), j ;:::: 6, is negligible. Otherwise there remains the possibility that the model is additive and our interaction model is trying
10.5. Testing for Additivity
Additive Model log(peptide)
1.7 1.65 1.6 1.55 1.5 1.45 1.4 1.35 1.3 1.25 1.2
0
Model with Interaction log(peptide)
1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2
0
FIGURE 10.8. Models Fitted to Diabetes Data.
265
266
10. Some Examples
to approximate the higher order additive terms. One could still check for this possibility by fitting a higher order additive model to obtain SSEnull· Fitting as many as 20 additive terms would still leave 22 degrees of freedom for assessing possible interaction.
Appendix
Here we shall derive an error bound and a probability bound that were stated in Sections 7.4.1 and Section 7.7.2, respectively.
A.l Error in Approximation of Fos(t) In Section 7.4.1 we claimed that, for any t > 1,
.
(M
IFos(t)- F(t, M)l :::;
+ 1)-lBfHl (1 _ Bt)
,
where Bt = exp(-[(t- 1) -log t]/2). To prove this result, recall that F(t; M) = exp
{
and Fos(t)
=
L M
P(
>
2
"t)}
Xj . J
M
,
=
1,2, ... ,
J
j=l
F(t; oo). Defining SM(t) =
t
P(XJ .> jt) ' J
j=l
we have 00
(A.1)
IFos(t)- F(t; M)l = exp(-aM)
P(xJ > jt)
L
j
j=M+l
where aM is between SM(t) and S00 (t). Obviously then,
L oo
IFos(t)- F(t; M)l :::;
j=M+l
P(
2
> "t)
Xj. J
J
.
267
268
Appendix
The next step is to obtain an exponential bound for P(xJ > jt) using Markov's inequality. For any number a such that 0 < a < 1/2, we have
P(xJ > jt)
= P(exp(ax]) > exp(ajt))
::::; (1- 2a)-j/ 2 exp( -ajt). Using this inequality and (A.1), it follows that 00
L r
IFos(t) - F(t; M)l ::::;
(A.2)
1(1- 2a)-j/ 2 exp( -ajt)
j=M+1 00
I:
=
r
1
exp { -Jtt(a)},
j=M+l
where ft(a) = at+ (1/2) log(1- 2a). Obviously we want to choose a so that ft(a) > 0. Simple analysis shows that ft[(1- r 1 )/2] > 0 for any t > 1. Since we also have 0 < (1 r 1 )/2 < 1/2 fort> 1, (A.2) implies that 00
IFos(t) -F(t;M)I::::;
L r
1
exp{-(j/2)[(t-1) -logt]}
j=M+l 00
::::; (M
L
+ 1)- 1
e{
j=J\!!+1
(M
+ 1)-1efH1 (1 - Bt)
thus proving the result.
A.2 Bounds for the Distribution of Tcusum Here we derive bounds for P(Tcusum :2:: t), where Tcusum is defined in Section 7.7.2. We assume that model (7.1) holds with the errors i.i.d. N(O, CJ 2 ) and r(x) = 2¢cos(nm 0 x). Define ?. 2 = n¢ 2 /CJ 2 , and let Z 1 ,Z2 , .•• denote a sequence of i.i.d. standard normal random variables. For any n > m 0 , a lower bound is
where n-1
P1 = p
(
z2
L -f J j=1
·~·.·
A.2. Bounds for the Distribution of Tcusum
269
To obtain an upper bound on P(Tcusum 2: t), note that
P(Tcusum 2:
t) : :; P ( j#mo L ~l + (Zmo + ,j'jj? 2: t) , J mo 2
where the sum extends from 1 to oo, excluding m 0 . The very last probability may be written
where¢(·) denotes the standard normal density. It follows that
P(Tcusum 2: t) :::; { cp(u) du J(u+v'2>.)2>m6t
+ {
Ht(u)cp(u) du,
J(u+v'2>.)2<m6t
where
2
Ht(u)
=P(f ~l2:t- (u+~) ) j=l
J
·
mo
The integral involving Ht (u) may be written as a sum of integrals, each of which may be bounded by using the monotonicity of Ht and values for the cdf of jj 2 (Anderson and Darling, 1952). 1
2:::;: ZJ
References
Akaike, H. (1974). A new look at statistical model identification. I. E. E. E. Trans. Auto. Control19, 716-723. Anderson, T. W. and Darling, D. A. (1952). Asymptotic theory of certain "goodness of fit" criteria based on stochastic processes. Ann. Math. Statist. 23, 193-212. Azzalini, A., Bowman, A. W. and Hardie, W. (1989). On the use ofnonparametric regression for model checking. Biometrika 76, 1-11. Barry, D. (1993). Testing for additivity of a regression function. Ann. Statist. 21, 235-254. Barry, D. and Hartigan, J. A. (1990). An omnibus test for departures from constant mean. Ann. Statist. 18, 1340-1357. Bartlett, M. S. (1955). An Introduction to Stochastic Processes with Special Reference to Methods and Applications. Cambridge University Press, London. Bellver, C. (1987). Influence of particulate pollution on the positions of neutral points in the sky in Seville (Spain). Atmos. Environ. 21, 699-702. Bhansali, R. J. (1986a). Asymptotically efficient selection of the order by the criterion autoregressive transfer function. Ann. Statist. 14, 315-325. Bhansali, R. J. (1986b). The criterion autoregressive transfer function of Parzen. J. Time Series Anal. 7, 315-325. Bhattacharya, P. K. (1974). Convergence of sample paths of normalized sums of induced order statistics. Ann. Statist. 2, 1034-1039. Bhattacharya, R. N. and Ranga Rao, R. (1976). Normal Approximation and Asymptotic Expansions. John Wiley & Sons, New York. Bickel, P. J. and Ritov, Y. (1992). Testing for goodness of fit: a new approach. Nonparametric Statistics and Related Topics (A. K. Md. E. Saleh, ed.), NorthHolland, Amsterdam, pp. 51-57. Bickel, P. J. and Rosenblatt, M. (1973). On some global measures of the deviation of density function estimates. Ann. Statist. 1, 1071-1095. Billingsley, P. (1968). Convergence of Probability Measures. John Wiley & Sons, New York. Box, G. E. P. (1954). Some theorems on quadratic forms applied in the study of analysis of variance problems, I. Effect of inequality of variance in the one-way classification. Ann. Math. Statist. 25, 290-302.
271
272
References
Box, G. E. P. and Pierce, D. A. (1970). Distribution of residual autocorrelations in autoregressive-integrated moving average time series models. J. Amer. Statist. Assoc. 65, 1509~1526. Buckley, M. J. (1991). Detecting a smooth signal: optimality of cusum based procedures. Biometrika 78, 253~262. Buckley, M. J. and Eagleson, G. K. (1988). An approximation to the distribution of quadratic forms in normal random variables. Austral. J. Statist. 30A, 150~ 159. Butzer, P. L. and Nessel, R. J. (1971). Fourier Analysis and Approximation. Academic Press, New York. Carroll, R. J. and Ruppert, D. (1988). Transformation and Weighting in Regression. Chapman & Hall, New York. Chen, J.-C. (1994a). Testing for no effect in nonparametric regression via spline smoothing techniques. Ann. Inst. Statist. Math. 46, 251~265. Chen, J.-C. (1994b). Testing goodness of fit of polynomial models via spline smoothing techniques. Statist. Probab. Lett. 19, 65~76. Chernoff, H. (1954). On the distribution of the likelihood ratio. Ann. Math. Statist. 25, 573~578. Chernoff, H. (1973). The use of faces to represent points ink-dimensional space graphically. J. Amer. Statist. Assoc. 68, 361~368. Chiu, S.-T. (1990). On the asymptotic distributions of bandwidth estimates. Ann. Statist. 18, 1696~1711. Chiu, S.-T. and Marron, J. S. (1990). The negative correlations between datadetermined bandwidths and the optimal bandwidth. Statist. Probab. Lett. 10, 173~180.
Chu, C. K. and Marron, J. S. (1991). Choosing a kernel regression estimator. Statist. Sci. 6, 425~427. Chui, C. K. (1992). An Introduction to Wavelets. Academic Press, San Diego, CA. Chung, K. L. (1974). A Course in Probability Theory. Academic Press, New York. Clark, R. M. (1977). Nonparametric estimation of a smooth regression function. J. Roy. Statist. Soc. Ser. B 39, 107~113. Cleveland, W. S. (1993). Visualizing Data. Hobart Press, Summit, NJ. Cleveland, W. S. and Devlin, S. J. (1988). Locally weighted regression: an approach to regression analysis by local fitting. J. Amer. Statist. Assoc. 83, 596~610.
Conover, W. J. (1980). Practical Nonparametric Statistics. John Wiley & Sons, New York. Cook, R. D. and Weisberg, S. (1983). Diagnostic for heteroscedasticity in regression. Biometrika 70, 1~10. Cox, D. and Koh, E. (1989). A smoothing spline based test of model adequacy in polynomial regression. Ann. Inst. Statist. Math. 41, 383~400. Cox, D., Koh, E., Wahba, G. and Yandell, B. (1988). Testing the (parametric) null model hypothesis in (semiparametric) partial and generalized spline models. Ann. Statist. 16, 113~119. Cox, D. R. (1962). Further results on tests of separate families of hypotheses. J. Roy. Statist. Soc. Ser. B 24, 406~424.
References
273
D'Agostino, R. B. and Stephens, M. A. (1986). Goodness-of-Fit Techniques. Marcel Dekker, New York. Daubechies, I. (1988). Orthonormal bases of compactly supported wavelets. Comm. Pure Appl. Math. 41, 909-996. 2 Davies, R. B. (1980). The distribution of a linear combination of X random variables. Appl. Statist. 29, 323-333. Delgado, M. A. (1993). Testing the equality of nonparametric regression curves. Statist. Probab. Lett. 17, 199-204. Devroye, L. and Gyorfi, L. (1985). Nonparametric Density Estimation: The L1 View. John Wiley & Sons, New York. Diggle, P. (1990). Time Series: A Biostatistical Introduction. Oxford University Press, Oxford. Donoho, D. L. (1988). One-sided inference about functionals of a density. Ann. Statist. 16, 1390-1420. Donoho, D. L. and Johnstone, I. M. (1994). Ideal spatial adaptation via wavelet shrinkage. Biometrika 81, 425-455. Durbin, J. and Knott, M. (1972). Components of Cramer-von Mises statistics, I. J. Roy. Statist. Soc. Ser. B 34, 290-307. Durbin, J. and Watson, G. S. (1950). Testing for serial correlation in least squares regression I. Biometrika 37, 409-428. Epanechnikov, V. A. (1969). Nonparametric estimates of a multivariate probability density. Theory Probab. Appl. 14, 153-158. Eubank, R. L. (1988). Spline Smoothing and Nonparametric Regression. Marcel Dekker, New York. Eubank, R. L. (1995). On testing for no effect in nonparametric regression. Unpublished manuscript. Eubank, R. L. and Hart, J. D. (1992). Testing goodness-of-fit in regression via order selection criteria. Ann. Statist. 20, 1412-1425. Eubank, R. L. and Hart, J.D. (1993). Commonality of cusum, von Neumann and smoothing-based goodness-of-fit tests. Biometrika 80, 89-98. Eubank, R. L., Hart, J. D. and LaRiccia, V. N. (1993). Testing goodness of fit via nonparametric function estimation techniques. Comm. Statist. - Theory Methods 22, 3327-3354. Eubank, R. L., Hart, J. D., Simpson, D. G. and Stefanski, L. A. (1995). Testing for additivity in nonparametric regression. Ann. Statist. 23, 1896-1920. Eubank, R. L., Hart, J.D. and Speckman, P. (1990). Trigonometric series regression estimators with an application to partly linear models. J. Multivar. Anal. 32, 70-83. Eubank, R. L., LaRiccia, V. N. and Rosenstein, R. (1987). Test statistics derived as components of Pearson's phi-squared distance measure. J. Amer. Statist. Assoc. 82, 816-825. Eubank, R. L. and Speckman, P. (1990). Curve fitting by polynomial-trigonometric regression. Biometrika 77, 1-9. Eubank, R. L. and Speckman, P. (1993). Confidence bands in nonparametric regression. J. Amer. Statist. Assoc. 88, 1287-1301. Eubank, R. L. and Spiegelman, C. (1990). Testing the goodness-of-fit of a linear model via nonparametric regression techniques. J. Amer. Statist. Assoc. 85, 387-392.
274
References
Fan, J. (1992). Design-adaptive nonparametric regression. J. Amer. Statist. Assoc. 87, 998-1004. Fan, J. (1996). Test of significance based on wavelet threshholding and Neyman's truncation. J. Amer. Statist. Assoc. 91, 674-688. Fan, J. and Gijbels, I. (1995). Data-driven bandwidth selection in local polynomial fitting: variable bandwidth and spatial adaptation. J. Roy. Statist. Soc. Ser. B 57, 371-394. Fan, J. and Gijbels, I. (1996). Local Polynomial Modeling and its Applications. Chapman & Hall, London. Farebrother, R. W. (1990). The distribution of a quadratic form in normal variables. Appl. Statist. 39, 294-309. Firth, D., Glosup, J. and Hinkley, D. V. (1991). Model checking with nonparametric curves. Biometrika 78, 245-252. Gasser, Th., Kneip, A. and Kohler, W. (1991). A flexible and fast method for automatic smoothing. J. Amer. Statist. Assoc. 86, 643-652. Gasser, Th. and Muller, H.-G. (1979). Kernel estimation of regression functions. Smoothing Techniques for Curve Estimation (Th. Gasser and M. Rosenblatt, eds.), Springer Lecture Notes in Mathematics No. 757, Springer-Verlag, Berlin, pp. 23-68. Gasser, Th., Muller, H.-G., Kohler, W., Molinari, L. and Prader, A. (1984). Nonparametric regression analysis of growth curves. Ann. Statist. 12, 210-229. Gasser, Th., Muller, H.-G. and Mammitzsch, V. (1985). Kernels for nonparametric curve estimation. J. Roy. Statist. Soc. Ser. B 47, 238-252. Gasser, Th., Sroka, L. and Jennen-Steinmetz, C. (1986). Residual variance and residual pattern in nonlinear regression. Biometrika 73, 625-633. Ghosh, B. K. and Huang, W. (1991). The power and optimal kernel of the BickelRosenblatt test for goodness of fit. Ann. Statist. 19, 999-1009. Gibbons, J. D. (1971). Nonparametric Statistical Inference. McGraw-Hill, New York. Gray, H. L. (1988). On a unification of bias reduction and numerical approximation. Essays in Honor of Franklin A. Graybill (J. N. Srivastava, ed.), Elsevier Science Publishers B.V. (North-Holland), Amsterdam, pp. 105-116. Green, P. J. and Silverman, B. W. (1994). Nonparametric Regression and Generalized Linear Models. Chapman & Hall, London. Grenander, U. and Rosenblatt, M. (1957). Statistical Analysis of Stationary Time Series. John Wiley & Sons, New York. Gu, C. (1992). Diagnostics for nonparametric regression models with additive terms. J. Amer. Statist. Assoc. 87, 1051-1058. Hall, P. (1983). Measuring the efficiency of trigonometric series estimates of a density. J. Multivar. Anal. 13, 234-256. Hall, P. (1992). The Bootstrap and Edgeworth Expansion. Springer-Verlag, New York. Hall, P. and Hart, J. D. (1990). Bootstrap test for difference between means in nonparametric regression. J. Amer. Statist. Assoc. 85, 1039-1049. Hall, P. and Johnstone, I. (1992). Empirical functionals and efficient smoothing parameter selection (with discussion). J. Roy. Statist. Soc. Ser. B 54, 475-530.
References
275
Hall, P., Kay, J. W. and Titterington, D. M. (1990). Asymptotically optimal difference based estimation of variance in nonparametric regression. Biometrika 77, 521-528. Hall, P. and Marron, J. S. (1990). On variance estimation in nonparametric regression. Biometrika 77, 415-419. Hall, P. and Titterington, D. M. (1988). On confidence bands in nonparametric density estimation and regression. J. Multivar. Anal. 27, 228-254. Hall, P. and Wehrly, T. E. (1991). A geometrical method for removing edge effects from kernel-type nonparametric regression estimators. J. Amer. Statist. Assoc. 86, 665-672. Hall, P. and Wilson, S. R. (1991). Two guidelines for bootstrap hypothesis testing. Biometrics 47, 757-762. Hannan. E. J. and Quinn, B. G. (1979). The determination of the order of an autoregression. J. Roy. Statist. Soc. Ser. B 41, 190-195. Hiirdle, W. (1990). Applied Nonparametric Regression. Cambridge University Press, Cambridge. Hiirdle, W. and Bowman, A. W. (1988). Bootstrapping in nonparametric regression: local adaptive smoothing and confidence bands. J. Amer. Statist. Assoc. 83, 102-110. Hiirdle, W., Hall, P. and Marron, J. S. (1988). How far are automatically chosen regression smoothing parameters from their optimum? (with discussion). J. Amer. Statist. Assoc. 83, 86-99. Hiirdle, W. and Mammen, E. (1993). Comparing nonparametric versus parametric regression fits. Ann. Statist. 21, 1926-1947. Hiirdle, W. and Marron, J. S. (1990). Semiparametric comparison of regression curves. Ann. Statist. 18, 63-89. Hiirdle, W. and Marron, J. S. (1991). Bootstrap simultaneous error bars for nonparametric regression. Ann. Statist. 19, 778-796. Hart, J. D. (1984). On the modal resolution of kernel density estimators. Statist. Probab. Lett. 2, 363-369. Hart, J.D. (1988). An ARMA type probability density estimator. Ann. Statist. 16, 842-855. Hart, J.D. (1994). Automated kernel smoothing of dependent data by using time series cross-validation. J. Roy. Statist. Soc. Ser. B 56, 529-542. Hart, J. D. (1996). Some automated methods of smoothing time-dependent data. J. Nonparam. Statist. 6, 115-142. Hart, J. D. and Gray, H. L. (1985). The ARMA method of approximating probability density functions. J. Statist. Plan. Inference 12, 137-152. Hart, J.D. and Wehrly, T. E. (1992). Kernel regression when the boundary region is large, with an application to testing the adequacy of polynomial models. J. Amer. Statist. Assoc. 87, 1018-1024. Hart, J. D. and Yi, S. (1996). One-sided cross-validation. Technical Report No. 249, Department of Statistics, Texas A&M University. Hastie, T. J. and Tibshirani, R. J. (1990). Generalized Additive Models. Chapman & Hall, London. Hoeffding, W. and Robbins, H. (1948). The central limit theorem for dependent random variables. Duke Math. J. 15, 773-780.
, 276
References
Hurvich, C. M. and Tsai, C.-L. (1995). Relative rates of convergence for efficient model selection criteria in linear regression. Biometrika 82, 418-425. Jayasuriya, B. R. (1996). Testing for polynomial regression using nonparametric regression techniques. J. Amer. Statist. Assoc. 91, 1626-1631. Jones, M. C. (1991). The roles of ISE and MISE in density estimation. Statist. Probab. Lett. 12, 51-56. Jones, M. C., Davies, S. J. and Park, B. U. (1994). Versions of kernel-type regression estimators. J. Amer. Statist. Assoc. 89, 825-832. Kallenberg, W. C. M. and Ledwina, T. (1995). Consistency and Monte Carlo simulation of a data driven version of smooth goodness-of-fit tests. Ann. Statist. 23, 1594-1608. Karlin, S. (1968). Total Positivity. Stanford University Press, Stanford, CA. Kendall, M. and Stuart, A. (1979). The Advanced Theory of Statistics. Charles Griffin & Company Ltd, New York Kim, J. (1994). Test for change in a mean function when data are dependent. Ph.D. dissertation, Department of Statistics, Texas A&M University. Kim, J.-T. (1992). Testing goodness-of-fit via order selection criteria. Ph.D. dissertation, Department of Statistics, Texas A&M University. King, E. C. (1988). A test for the equality of two regression curves based on kernel smoothers. Ph.D. dissertation, Department of Statistics, Texas A&M University. King, E., Hart, J. D. and Wehrly, T. E. (1991). Testing the equality of two regression curves using linear smoothers. Statist. Probab. Lett. 12, 239-247. Knafl, G., Sacks, J. and Ylvisaker, D. (1985). Confidence bands for regression functions. J. Amer. Statist. Assoc. 80, 683-691. Kuchibhatla, M. and Hart, J. D. (1996). Smoothing-based lack-of-fit tests: variations on a theme. J. Nonparam. Statist. 7, 1-22. Ledwina, T. (1994). Data-driven version of Neyman's smooth test of fit. J. Amer. Statist. Assoc. 89, 1000-1005. Lee, G.-H. (1996). A statistical wavelet approach to model selection and datadriven Neyman smooth tests. Ph.D. dissertation, Department of Statistics, Texas A&M University. Lehmann, E. (1959). Testing Statistical Hypotheses. John Wiley & Sons, New York · Li, K.-C. (1987). Asymptotic optimality for Cp, CL, cross-validation and generalized cross-validation: Discrete index set. Ann. Statist. 15, 958-975. Li, K.-C. (1989). Honest confidence regions for nonparametric regression. Ann. Statist. 17, 1001-1008. Liaw, A. (1997). An application of Fourier series smoothing to a diagnostic test of heteroscedasticity. Ph.D. dissertation, Department of Statistics, Texas A&M University. Mallat, S. (1989). Multiresolution approximations and wavelet orthonormal bases of £ 2 (~). Trans. Amer. Math. Soc. 315, 69-87. Mallows, C. L. (1973). Some comments on Cp. Technometrics 15, 661-675. Mammen, E. (1993). Bootstrap and wild bootstrap for high dimensional linear models. Ann. Statist. 21, 255-285. Marron, J. S. and Wand, M. P. (1992). Exact mean integrated squared error. Ann. Statist. 20, 712-736.
-
References
277
Muller, H.-G. (1984). Optimal designs for nonparametric kernel regression. Statist. Probab. Lett. 2, 285-290. Muller, H.-G. (1991). Smooth optimum kernel estimators near endpoints. Biometrika 78, 521-530. Muller, H.-G. (1992). Goodness-of-fit diagnostics for regression models. Scand. J. Statist. 19, 157-172. Muller, H.-G. and Stadtmuller, U. (1987). Variable bandwidth estimators of regression curves. Ann. Statist. 15, 182-201. Munson, P. J. and Jernigan, R. W. (1989). A cubic spline extension of the DurbinWatson test. Biometrika 76, 39-47. Nadaraya, E. A. (1964). On estimating regression. Theory Probab. Appl. 9, 141142. Nair, V. N. (1986). On testing against ordered alternatives in analysis of variance models. Biometrika 73, 493-499. Newton, H. J. (1988). Timeslab: A Time Series Analysis Laboratory. Wadsworth & Brooks/Cole, Belmont, CA. Neyman, J. (1937). 'Smooth' test for goodness of fit. Skandinavisk Aktuarietidskrift 20,149-199. Neyman, J. and Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Phil. Trans. Roy. Soc. Ser. A 231, 289-337. Noether, G. E. (1955). On a theorem of Pitman. Ann. Math. Statist. 26, 64-68. Nychka, D. (1988). Bayesian confidence intervals for smoothing splines. J. Amer. Statist. Assoc. 83, 1134-1143. Opsomer, J. and Ruppert, D. (1996). A fully automated bandwidth selection method for fitting additive models. Unpublished manuscript. Pace, L. and Salvan, A. (1990). Best conditional tests of separate families of hypotheses. J. Roy. Statist. Soc. Ser. B 52, 125-134. Page, E. J. (1954). Continuous inspection schemes. Biometrika 41, 100-115. Park, B. U. and Marron, J. S. (1990). Comparison of data-driven bandwidth selectors. J. Amer. Statist. Assoc. 85, 66-72. Parzen, E. (1977). Multiple time series: determining the order of approximating autoregressive schemes. Multivariate Analysis - IV (P. Khrishnaiah, ed.), North-Holland, Amsterdam, pp. 283-295. Parzen, E. (1981). Nonparametric statistical data science: a unified approach based on density estimation and testing for "white noise." Technical Report, Department of Statistics, Texas A&M University. Priestley, M. B. (1981). Spectral Analysis and Time Series. Academic Press, London. Priestley, M. B. and Chao, M. T. (1972). Non-parametric function fitting. J. Roy. Statist. Soc. Ser. B 34, 385-392. Ramachandran, M. (1992). Testing for goodness of fit using nonparametric techniques. Ph.D. dissertation, Department of Statistics, Texas A&M University. Rao, C. R. (1973). Linear Statistical Inference and its Applications. John Wiley & Sons, New York. Rayner, J. C. W. and Best, D. J. (1989). Smooth Tests of Goodness of Fit. Oxford University Press, New York.
278
References
Rayner, J. C. W. and Best, D. J. (1990). Smooth tests of goodness of fit: an overview. Int. Statist. Rev. 58, 9-17. Raz, J. (1990). Testing for no effect when estimating a smooth function by nonparametric regression: a randomization approach. J. Amer. Statist. Assoc. 85, 132-138. Rice, J. (1984a). Boundary modification for kernel regression. Comm. Statist. Theory Methods 13, 893-900. Rice, J. (1984b). Bandwidth choice for nonparametric regression. Ann. Statist. 12, 1215-1230. Rissanen, J. (1983). A universal prior for integers and estimation by minimum description length. Ann. Statist. 11, 416-431. Rosenblatt, M. (1975). A quadratic measure of deviation of two-dimensional density estimates and a test of independence. Ann. Statist. 3, 1-14. Ruppert, D., Sheather, S. J. and Wand, M. P. (1995). An effective bandwidth selector for local least squares regression. J. Amer. Statist. Assoc. 90, 12571270. Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6, 461464. Scott, D. W. (1992). Multivariate Density Estimation: Theory, Practice, and Visualization. John Wiley & Sons, New York. Serfiing, R. J. (1970). Moment inequalities for the maximum cumulative sum. Ann. Math. Statist. 41, 1227-1234. Serfiing, R. J. (1980). Approximation Theorems of Mathematical Statistics. John Wiley & Sons, New York. Shibata, R. (1976). Selection of the order of an autoregressive model by Akaike's information criterion. Biometrika 63, 117-126. Silverman, B. W. (1981). Using kernel density estimates to investigate multimodality. J. Roy. Statist. Soc. Ser. B 43, 97-99. Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman & Hall, London. Sockett, E. B., Daneman, D., Clarson, C. and Ehrich, R. M. (1987). Factors affecting and patterns of residual insulin secretion during the first year of type I (insulin dependent) diabetes mellitus in children. Diabetes 30, 453-459. Spiegelman, C. and Wang, C. Y. (1994). Detecting interactions using low dimensional searches in high dimensional data. Chemometr. Intell. Lab. Syst. 23, 293-299. Spitzer, F. (1956). A combinatorial lemma and its applications to probability theory. Trans. Amer. Math. Soc. 82, 323-339. Staniswalis, J. and Severini, T. A. (1991). Diagnostics for assessing regression models. J. Amer. Statist. Assoc. 86, 684-692. Stone, C. J. (1982). Optimal global rates of convergence for nonparametric regression. Ann. Statist. 10, 1040-1053. Stone, C. J. (1985). Additive regression and other nonparametric models. Ann. Statist. 13, 689-705. Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions (with discussion). J. Roy. Statist. Soc. Ser. B 36, 111-147. Tarter, M. E. and Lock, M. D. (1993). Model-Free Curve Estimation. Chapman & Hall, New York.
References
279
Terrell, G. R. and Scott, D. W. (1985). Oversmoothed nonparametric density estimates. J. Amer. Statist. Assoc. 80, 209-214. Tolstov, G. P. (1962). Fourier Series. Dover, New York. vanEs, B. (1992). Asymptotics for least squares cross-validation bandwidths in nonsmooth cases. Ann. Statist. 20, 1647-1657. Vieu, P. (1991). Nonparametric regression: Optimal local bandwidth choice. J. Roy. Statist. Soc. Ser. B 53, 453- 464. von Neumann, J. (1941). Distribution of the ratio of the mean squared successive difference to the variance. Ann. Math. Statist. 12, 367-395. Wahba, G. (1983). Bayesian 'confidence intervals' for the cross-validated smoothing spline. J. Roy. Statist. Soc. Ser. B 45, 133-150. Wahba, G. (1990). Spline Models for Observational Data. SIAM, Philadelphia. Wand, M.P. and Jones, M. C. (1995). Kernel Smoothing. Chapman & Hall, New York. Watson, G. S. (1964). Smooth regression analysis. Sankhya Ser. A 26, 359-372. Wells, M. (1990). The relative efficiency of goodness-of-fit statistics in the simple and composite hypothesis-testing problem. J. Amer. Statist. Assoc. 85, 459463. White, H. (1982). Regularity conditions for Cox's test of non-nested hypotheses. J. Econometr. 19, 301-318. Wood, A. T. A. (1989). An F approximation to the distribution of a linear combination of chi-squared variables. Comm. Statist. - Simul. Comput. 18, 1439-1456. Woodroofe, M. (1982). On model selection and the arc sine laws. Ann. Statist. 10, 1182-1194. Yanagimoto, T. and Yanagimoto, M. (1987). The use of marginal likelihood for a diagnostic test for the goodness of fit of the simple linear regression model. Technometrics 29, 95-101. Yi, S. (1996). On one-sided cross-validation in nonparametric regression. Ph.D. dissertation, Department of Statistics, Texas A&M University. Yin, Y. and Carroll, R. J. (1990). A diagnostic for heteroscedasticity based on the Spearman rank correlation. Statist. Probab. Lett. 10, 69-76. Young, S. G. and Bowman, A. W. (1995). Non-parametric analysis of covariance. Biometrics 51, 920-931. Zhang, P. (1994). On the distributional properties of model selection criteria. J. Amer. Statist. Assoc. 87, 732-737.
I
I
, !!?
Index
additive model, 116, 232-234, 253, 263-266 Akaike, H., 245, 271 Akaike's Information Criterion (AIC), 245 Anderson, T. W., 269, 271 approximation of functions, 21, 25-26, 35-49 approximator, 28, 42, 43, 45, 46, 48 arithmetic means, 25 asymptotic distribution of cusum statistic, 139 data-driven bandwidths, 94-105 data-driven Neyman smooth statistic, 185-187 Gasser-Muller estimator, 76-78 kernel-smoother-based test statistic, 154-156 Neyman smooth statistic, 141 order selection statistic, 168-175, 210-217 truncated series estimator, 76-78 von Neumann statistic, 138 von Neumann statistic applied to residuals, 130-131 asymptotic normality, 76, 95-96, 121, 130, 156, 157, 217 asymptotic relative efficiency, 76, 100, 137, 184 autocorrelation, 240, 241, 244, 250, 251 autoregressive process, 107, 166, 172, 244, 245, 249, 250, 252 autoregressive, moving average process, 42
average squared error, 108 Azzalini, A., 163, 271 Babinet data, 160-161, 253-258 Babinet point, 160 bandwidth, basic issues in choosing, 11-14 definition of, 6 illustrated effect of, 9 optimal, 11, 14, 57-58, 63-65, 79, 88, 89, 92, 95, 108, 109, 111, 113, 114, 154, 157, 159, 160, 218 variable, 14-18, 63-64, 115 Barry, D., 207, 218, 233, 271 Bartlett's test, 242, 244 Bartlett, M. S., 242, 271 Bayes factor, 192 Bayes information criterion (BIC), 106 Bayesian methods, 80-81, 134-136, 189-195, 256-257 Bellver, C., 160, 271 Berry, Scott, viii Berry-Esseen theorem, 168 Best, D. J., 163, 239, 277, 278 Bhansali, R. J., 245, 246, 271 Bhattacharya, P. K., 227, 271 Bhattacharya, R. N., 168, 216, 271 bias of Gasser-Muller estimator, 55-56, 62-63 of sample Fourier coefficient, 67-68 bias reduction, 31, 37, 39, 63 Bickel, P. J., 163, 207, 271
281
282
Index
Billingsley, P., 155, 156, 271 binomial distribution, 179 bootstrap, 80, 83, 130, 131, 133, 150, 157, 161, 181, 182, 188, 189, 208, 219, 221-223, 227, 228, 236, 237, 252, 254, 255, 259 double, 134, 221-223 wild, 228 Bowman, A. W., 80, 81, 160, 163, 237, 271, 275, 279 Box, G. E. P., 129, 242, 271, 272 Buckley, M. J., 129, 134, 135, 272 Butzer, P. L., 26, 65, 272 C-peptide concentration, 263 calculus of variations, 58, 102 Calvin, Jim, viii Carroll, R. J., 219, 235, 272, 279 central limit theorem, 76, 131 Chao, M. T., 8, 277 Chen, Chien-Feng, viii Chen, J.-C., 163, 218, 272 Chen, Ray, viii Chernoff, H., 120, 261, 272 Chiu, 8.-T., 91, 98-100, 104, 272 Chu, C. K., 10, 272 Chui, C. K., 44, 272 Chung, K. L., 77, 272 circular design, 98, 105 Clark, R. M., 8, 272 Clarson, C., 263, 278 Cleveland, W. 8., 160, 163, 272 collinearity, 213 comparing parametric and nonparametric models, 145, 147-148 complete class, 21 components-based test, 163 concurvity, 233 confidence bands, 81 confidence intervals, 50, 76-83, 179-181 Conover, W. J., 179, 272 consistency of cusum test, 139 of maximum risk test, 197 of order selection test, 195-197, 223-224 of von Neumann test, 138
consistent test, definition of, 131 continuous smoothing parameter, 26, 106 convergence in mean square, 21 convolution, 8, 10, 65, 66, 80, 156 Cook, R. D., 235, 272 cosine series, 35, 42-46, 48, 207, 255, 257, 258 covariance stationary, 240, 243, 244, 248, 252 Cox, D., 136, 163, 218, 272 Cox, D. R., 120, 272 Cramer-von Mises test, 208, 225, 240 Cramer-Wold device, 156, 167 criterion autoregressive transfer (CAT), 245 cross-validation, 84-86, 90-92, 94103, 107-115, 218, 219, 237, 256 one-sided, 84, 90-92, 98-105, 108, 115, 218, 219, 251, 255 curse of dimensionality, 232, 233 cusp, 29, 30, 33 cusum, 134-137, 139, 195, 197-201, 268-269 cut-and-normalize method, 29 D'Agostino, R. B., 238, 273 Daneman, D., 263, 278 Darling, D. A., 269, 271 data-reflection, 75 Daubechies, I., 46, 273 Daubechies wavelet, 46 Davies, R. B., 129, 273 Davies, 8. J., 10, 40, 276 Delgado, M.A., 238, 244, 261, 273 derivatives, estimation of, 63-65, 115 design and kernel estimators, 10, 13, 39-40, 54-58, 60, 64, 238 and local linear estimators, 39-40, 56, 238 and Rogosinski estimators, 72-75 and truncated series estimators, 67-69 design density, 50, 51, 54, 56, 64, 67, 68, 72, 146, 238 design, optimal, 13, 64 design, random, 40, 226-228
. /l
w~'j Index design-independent bias, 56, 238 Devlin, S. J., 163, 272 Devroye, L., 4, 273 diabetes data, 263-266 difference-based variance estimator, 86, 179 Diggle, P., 258, 261, 273 discrete smoothing parameter, 50, 65, 106 distribution-free test, 183-184 divergent series, 25 Donoho, D. L., 48, 80, 206, 273 Durbin, J., 133, 163, 273 Eagleson, G. K., 129, 272 Ehrich, R. M., 263, 278 eigenvalues, 129 empirical distribution, 83, 131, 133, 150, 182, 222, 227, 228, 255 Epanechnikov, V. A., 58, 273 Eubank, R., 273 Eubank, R. L., viii, 4, 34, 40, 75, 81, 124, 133, 138, 163, 168, 204, 205, 207, 216, 233, 273 Fan, J., 4, 38, 40, 56, 92, 102, 115, 206, 274 Farebrother, R. W., 129, 274 Fejer series, 26 Fejer weights, 25 Firth, D., 163, 274 Fourier coefficients, definition of, 21 Fourier coefficients, sample, 67-68, 136, 165-168, 189, 190, 196, 197, 201, 206, 210, 213, 214, 217, 219, 221, 224, 225, 227, 235, 256 frequentist, 80, 81, 94, 192 full model, 151, 234, 263 Gasser, Th., 8, 31, 51, 64, 65, 89, 96, 108, 123, 274 Gauss-Newton algorithm, 44 Gaussian process, 155, 157, 215, 217, 250 generalized cross-validation, 87, 92 Ghosh, B. K., 163, 274 Gibbons, J.D., 184, 274 Gijbels, I., 4, 115, 274
283
Glosup, J., 163, 274 goodness-of-fit test, defined, 118 Gram-Schmidt procedure, 151, 213, 227, 229, 231, 256 graphical test, 175-176, 189, 196 Gray, Buddy, viii Gray, H. L., 35, 41, 274, 275 Green, P. J., 4, 274 Grenander, U., 261, 274 Gu, C., 233, 274 Gyorfi, L., 4, 273 Haar wavelet, 45-48 half normal density, 33 Hall, P., 75, 76, 81, 84, 87, 90, 91, 94-98, 131, 163, 181, 221, 222, 227, 237, 274, 275 Hannan, E. J., 107, 275 Hardie, W., 4, 80, 81, 94, 163, 228, 238, 271, 275 Hart, J. D., 41, 43, 75, 80, 82, 90, 131, 133, 138, 160, 163, 168, 181, 189, 204, 207, 216, 218, 219, 233, 237, 238, 251, 252, 258, 273-276 Hartigan, J. A., 207, 218, 271 Hastie, T. J., 233, 263, 275 heteroscedasticity, 258 Hinkley, D. V., 163, 274 Hoeffding, W., 131, 275 homoscedasticity, 227, 234-236, 253, 258, 263 Huang, W., 163, 274 Hurvich, C. M., 106, 107, 276 Integrated squared error, 14, 21, 42, 46, 49, 66, 72 invariant test, 125, 136 Jayasuriya, B. R., 218, 276 Jennen-Steinmetz, C., 123, 274 Jernigan, R. W., 127, 133, 277 Johnstone, I. M., 48, 84, 90, 91, 95-98, 206, 273, 274 Jones, M. C., 4, 10, 40, 95, 276, 279 Kallenberg, W. C. M., 187, 207, 238, 239, 276 Karlin, 8., 80, 276
I
I
284
Index
Kay, J. W., 87, 275 Kayley, v, viii Kendall, M., 120, 276 kernel estimator, convolution type, 10, 66 evaluation type, 10 Gasser-Muller, 8, 10-12, 15, 17-20, 22, 28-33, 40, 41, 50-52, 55, 56, 59, 61, 62, 64, 65, 76-78, 80-82, 85, 88, 92, 148, 217-219, 237, 238 Nadaraya-Watson, 6, 8-10, 14, 16, 29, 31, 33, 38-41, 50, 56, 106, 148, 230, 238, 252 Priestley-Chao, 8, 10, 11, 29, 94, 102, 153 variable bandwidth, 14-18, 63-64, 115 kernel, boundary, 31, 32, 39, 41, 59, 60, 62, 95, 101 Dirichlet, 22, 33, 65, 71, 76 Epanechnikov, 15, 16, 18, 31, 39, 58, 60, 76, 83, 92, 102, 108, 115, 158, 255 Fejer-Korovkin, 65 finite support, 10, 11, 28, 41, 51, 58, 95 Gaussian, 8, 9, 11, 14, 22, 58, 80 higher order, 62-63 quartic, 58, 101, 102, 104 rectangular, 8 Rogosinski, 26, 27, 33, 58, 59, 65, 71, 76, 204 seco;nd order, 62, 63, 65, 75, 76, 98, 99, 102, 148 triangle, 58 Kim, J., 251, 276 Kim, J.-T., 207, 238, 239, 276 King, E. C., 154, 160, 163, 238, 276 Knafl, G., 81, 276 Kneip, A., 89, 96, 108, 274 knots, 40, 41 Knott, M., 163, 273 Koh, E., 136, 163, 218, 272 Kohler, W., 64, 89, 96, 108, 274 Kolmogorov-Smirnov test, 208, 240, 244 Kuchibhatla, M., 189, 204, 258, 276
Lack-of-fit test, defined, 118 LaRiccia, V. N., 163, 273 least squares, 21, 34, 37, 38, 41, 44, 81-83, 87, 89, 121, 123, 125-127, 142, 148, 149, 151, 153, 165, 187, 208-210, 213, 214, 217-219, 224, 225, 227, 229-231, 234, 249, 254, 255, 264 Ledwina, T., 107, 186, 187, 207, 238, 239, 276 Lee, Cherng-Luen, viii Lee, G.-H., viii, 204, 206, 276 Legendre polynomials, 141 Lehmann, E., 125, 142, 276 leutenizing hormone level, 258 Li, K.-C., 81, 106, 276 Liapounov central limit theorem, 187 Liapounov condition, 77, 78 Liaw, A., viii, 235, 236, 276 likelihood ratio test, 3, 118-122, 125 Lindeberg-Feller theorem, 156, 167, 211 Lipschitz continuous, defined, 50 local alternatives, 137, 143, 153-157, 195, 201-203, 205, 236 local likelihood, 163 local linear estimator, 37-41, 56, 91-93, 98, 102, 103, 107, 108, 115, 148, 158-161, 163, 238, 254, 255, 257, 263 local polynomial estimator, 2, 37-40, 44, 115, 144, 217, 218 local quadratic estimator, 38 locally most powerful, 134, 136 Lock, M.D., 4, 278 Lombard, Fred, viii loss function, 94, 105 Mallat, S., 46, 276 Mallows, C. L., 87, 276 Mammen, E., 163, 228, 275, 276 Mammitzsch, V., 65, 274 Markov's inequality, 268 Marron, J. S., 10, 63, 81, 88, 91, 94, 181, 228, 238, 272, 275-277 maximal rate, 137-139, 143, 154, 156, 157, 201, 203, 204 mean average squared error, 88, 159
Index mean integrated squared error of Gasser-Muller estimator, 61-62 of Rogosinski estimator, 71-76 of truncated series estimator, 68-71 mean square convergence, 25 mean squared error of Gasser-Muller estimator, 40, 51-61 of local linear estimator, 40 of local polynomial estimator, 38 mean value theorem, 52 Michelle, v, viii mineral assay data, 261-263 model selection, 255 Molinari, L., 64, 274 moving average, 6 moving average process, 243, 244 Muller, H.-G., 8, 13, 31, 51, 60, 63-65, 101, 115, 163, 274, 277 multivariate normal distribution, 155, 216, 217, 220, 221, 247 Munson, P. J., 127, 133, 277 Nadaraya, E. A., 6, 277 Nair, V. N., 136, 277 natural spline interpolant, 41, 133 Nessel, R. J., 26, 65, 272 nested models, 120, 125 Newton, H. J., 240-242, 245, 246, 277 Neyman smooth test, 140-143, 152, 163, 167, 185-187, 195, 197-201, 203-205, 207, 242, 259 Neyman, J., 119, 141, 163, 185, 277 no-effect hypothesis, 3, 132-135, 141-143, 148, 164, 168, 172, 175, 176, 178, 185, 187-189, 191, 195, 227, 23~ 239, 242, 248 Noether, G. E., 137, 277 non-Gaussian (non-normal), 121, 129-131, 150, 151, 177, 181-183, 211 nonlinear model, 129, 133, 145, 208, 219-223 non-nested models, 120, 121
285
normalized estimator, 29-33, 59, 61, 62, 70 Nychka, D., 81, 277 Omnibus test, 131, 138, 145, 163, 195, 203, 242 Opsomer, J., 116, 277 orthogonal basis, 19, 21, 46, 151, 205, 213, 256 orthogonal polynomials, 19, 22, 212 orthogonal wavelet, 45, 46 orthonormal, 141, 205, 224, 256 orthonormal basis, 45 Pace, L., 121, 277 Page, E. J., 134, 277 Park, B. U., 10, 40, 88, 91, 276, 277 Parseval's formula, 66, 136, 181 parsimony, 35, 41, 42 Parzen, E., 165, 188, 206, 245, 277 Parzen, Manny, viii Pearson, E. S., 119, 277 permutation test, 150 piecewise constant function, 8 piecewise linear function, 10 piecewise smooth function, definition of, 67 Pierce, D. A., 242, 272 Pitman relative efficiency, 137 pivotal quantity, 221, 223 plug-in, 84, 88-90, 92-98, 107-110, 112-116 pointwise convergence, 25, 231 polynomial regression, 22, 125 polynomial-trigonometric regression, 34 polynomials, testing the fit of, 217-219 portmanteau test, 242, 244 posterior distribution, 191, 193 posterior probability, 191, 192, 194, 257 posterior risk, 94 power and smoothing parameters, 158-161 of cusum test, 197-201 of Neyman smooth test, 197-201 of order selection test, 197-203
I
I
I
286
Index
power transformation, 18, 20 Prader, A., 64, 274 Priestley, M. B., 4, 8, 241-243, 259, 277 prior distribution, 135, 189-192, 206, 257 prior probability, 191, 194, 257 prior, convenience, 191 prior, noninformative, 190, 191, 257 probability bands, 81-83 probability density estimation, 1, 41, 76, 107, 163 pseudo-residuals, 123, 124 pure experimental error, 122-124 P-value, 81-83, 174, 175, 253, 254, 258, 259, 261, 264 Quadratic form, 126, 128, 129, 135, 136, 149-151 quadrature, 21, 67 quadrature bias, 67 Quinn, B. G., 107, 275 Rabbit jawbone data, 17-18 Ramachandran, M., 189, 277 random walk, 168, 174 Ranga Rao, R., 168, 216, 271 rank test, 183-184, 235, 236 Rao, C. R., 124, 151, 216, 277 rational functions, 41-45 Rayner, J. C. W., 163, 239, 277, 278 Raz, J., 150, 151, 163, 278 reduced model, 151 reduction method, 3, 124-126, 138, 142, 151, 152, 166 regression quantile function, 165, 261 relative efficiency, 58, 76, 137 residual analysis, 17-18, 234, 257 residuals, 83, 85, 122, 123, 125, 127, 128, 131, 133-135, 144-153, 158, 179, 181, 182, 208210, 213, 217, 219, 222-224, 228-231, 234-236, 252, 254, 257-259 resolution level, 46, 47 Rice, J., 31, 94, 96, 97, 278 risk estimation, 84, 86-88, 97, 106, 243, 247, 255, 256, 258 risk function, 86, 94
risk regret, 97, 98 Rissanen, J., 191, 257, 278 Ritov, Y., 207, 271 Robbins, H., 131, 275 Rogosinski series estimator, 27, 32, 33, 35, 36, 59, 70-76, 188-189, 259, 261 Rosenblatt, M., 163, 261, 271, 274, 278 Rosenstein, R., 163, 273 Ruppert, D., 88, 89, 115, 116, 219, 272, 277, 278 Sacks, J., 81, 276 Salvan, A., 121, 277 sampling distribution, 50, 76, 84, 104, 108, 131, 150, 160, 183, 219, 227, 228, 231, 252 Schucany, Bill, viii Schwarz, G., 106, 278 Scott, D. W., 4, 80, 278, 279 Serfling, R. J., 156, 171, 215, 278 serum alphafetoprotein data, 92 Severini, T. A., 163, 278 Sheather, S. J., 88, 89, 116, 278 Shibata, R., 107, 172, 245, 247, 278 side lobes, 22, 26, 71 significance trace, 158, 160-161, 238, 253 Silverman, B. W., 4, 80, 274, 278 Simpson, D. G., 204, 207, 233, 273 simulation, 64, 81-83, 97, 104, 107113, 115, 129, 131, 151, 158, 179, 186-189, 199, 204-206, 218, 219, 222, 232, 236, 237, 246, 248, 264 Slutsky's theorem, 211, 215 smoother matrix, 150 smoothing, definition of, 4 smoothing parameter, definition of, 6 smoothing residuals, 145-149 smoothing splines, 40-41, 80, 144, 148, 163, 207, 217, 218 Sockett, E. B., 263, 278 Speckman, P., 34, 75, 81, 273 spectrum (spectra), 1, 42, 240-244, 253, 258, 259, 261 Spiegelman, C., 163, 233, 273, 278 Spitzer, F., 171, 174, 177, 278
Index Sroka, L., 123, 274 Stadtmiiller, U., 63, 115, 277 Staniswalis, J., 163, 278 Stefanski, L. A., 204, 207, 233, 273 Stephens, M.A., 238, 273 Stone, C. J., 232, 233, 278 Stone, M., 85, 278 straight line, 2, 37, 41, 82, 83, 120, 121, 123, 126, 159, 219 straight lines, testing the fit of, 81-83, 148, 160-161, 163, 207, 253-255 strong law of large numbers, 175, 182 Stuart, A., 120, 276 Taper, 25, 26, 28, 33, 34, 50, 65, 70, 71, 75, 76, 95 Tarter, M. E., 4, 278 Taylor series, 55, 56, 62 Terrell, G. R., 80, 279 thresholding, 46, 48, 49, 206 Tibshirani, R. J., 233, 263, 275 tightness, 155, 156 time series, 1, 42, 166, 240-252, 259 Titterington, D. M., 81, 87, 275 Tolstov, G. P., 21, 25, 231, 279 total positivity, 80 transformation, 16-18, 20, 90, 91, 231 trigonometric series, 3, 19, 26, 37, 45, 65, 69, 125, 148, 164, 189, 208 truncated series estimator, 22, 32, 35, 50, 65-71, 76, 77, 87, 105, 106, 165 truncation bias, 67, 69, 71, 73, 75 truncation point, 22, 23, 26, 43, 65, 75, 76, 94, 107, 165, 172, 185, 189, 194, 196 Tsai, C.-L., 106, 107, 276 type I error probability, 148, 157, 173, 174, 231, 245, 246, 261, 263
287
Undersmooth, 79, 92, 93, 103, 255 uniform convergence, 21, 25, 26 uniformly most powerful, 125, 141-143 University of Seville, 160 VanEs, B., 106, 279 variance-ratio, 121, 123, 147 Vieu, P., 115, 279 von Neumann, J., 127, 132, 279 Wahba, G., 4, 80, 136, 163, 272, 279 Wand, M. P., 4, 63, 88, 89, 116, 276, 278, 279 Wang, C. Y., 233, 278 Watson, G. S., 6, 133, 273, 279 wavelets, 19, 37, 44-49, 144, 206 Wehrly, T. E., 41, 75, 82, 95, 160, 163, 207, 218, 219, 238, 275, 276 Weisberg, S., 235, 272 Wells, M., 225, 279 White, H., 121, 279 Wilson, S. R., 221, 227, 275 window estimate, 6-8, 30 Wood, A. T. A., 129, 279 Woodroofe, M., 207, 279 wrapped distribution, 28 Yanagimoto, M., 163, 206, 207, 218, 279 Yanagimoto, T., 163, 206, 207, 218, 279 Yandell, B., 136, 163, 272 Yi, S., viii, 90, 98, 100, 102, 275, 279 Yin, Y., 235, 279 Ylvisaker, D., 81, 276 Young, S. G., 160, 237, 279 Zhang, P., 207, 279
Springer Series in Statistics (continued from p. ii)
Mosteller!Wallace: Applied Bayesian and Classical Inference: The Case of The Federalist Papers. Pollard: Convergence of Stochastic Processes. Pratt/Gibbons: Concepts of Nonparametric Theory. Ramsay/Silverman: Functional Data Analysis. Read/Cressie: Goodness-of-Fit Statistics for Discrete Multivariate Data. Reinsel: Elements of Multivariate Time Series Analysis, 2nd edition. Reiss: A Course on Point Processes. Reiss: Approximate Distributions of Order Statistics: With Applications to Non-parametric Statistics. Rieder: Robust Asymptotic Statistics. Rosenbaum: Observational Studies. Ross: Nonlinear Estimation. Sachs: Applied Statistics: A Handbook of Techniques, 2nd edition. Si:irndal!Swensson!Wretman: Model Assisted Survey Sampling. Schervish: Theory of Statistics. Seneta: Non-Negative Matrices and Markov Chains, 2nd edition. Shao/Tu: The Jackknife and Bootstrap. Siegmund: Sequential Analysis: Tests and Confidence Intervals. Simonoff: Smoothing Methods in Statistics. Small: The Statistical Theory of Shape. Tanner: Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions, 3rd edition. Tong: The Multivariate Normal Distribution. van der Vaart!Wellner: Weak Convergence and Empirical Processes: With Applications to Statistics. Vapnik: Estimation of Dependences Based on Empirical Data. Weerahandi: Exact Statistical Methods for Data Analysis. West/Harrison: Bayesian Forecasting and Dynamic Models, 2nd edition. Walter: Introduction to Variance Estimation. Yaglom: Correlation Theory of Stationary and Related Random Functions I: Basic Results. Yaglom: Correlation Theory of Stationary and Related Random Functions II: Supplementary Notes and References.