ROBUST STATISTICAL METHODS with R
ROBUST STATISTICAL METHODS with R
Jana Jurecková ˇ Jan Picek
Published in 2006 b...
444 downloads
4155 Views
2MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
ROBUST STATISTICAL METHODS with R
ROBUST STATISTICAL METHODS with R
Jana Jurecková ˇ Jan Picek
Published in 2006 by Chapman & Hall/CRC Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2006 by Taylor & Francis Group, LLC Chapman & Hall/CRC is an imprint of Taylor & Francis Group No claim to original U.S. Government works Printed in the United States of America on acid-free paper 10 9 8 7 6 5 4 3 2 1 International Standard Book Number-10: 1-58488-454-1 (Hardcover) International Standard Book Number-13: 978-1-58488-454-5 (Hardcover) Library of Congress Card Number 2005053192 This book contains information obtained from authentic and highly regarded sources. Reprinted material is quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonable efforts have been made to publish reliable data and information, but the author and the publisher cannot assume responsibility for the validity of all materials or for the consequences of their use. No part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC) 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe.
Library of Congress Cataloging-in-Publication Data Jureckova, Jana, 1940Robust statistical methods with R / Jana Jureckova, Jan Picek. p. cm. Includes bibliographical references and indexes. ISBN-13: 978-1-58488-454-5 (acid-free paper) ISBN-10: 1-58488-454-1 (acid-free paper) Robust statistics. 2. R (Computer program language)--Statistical methods. I. Picek, Jan, 1965- II. Title. QA276.J868 2006 519.5--dc22
2005053192
Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com Taylor & Francis Group is the Academic Division of Informa plc.
and the CRC Press Web site at http://www.crcpress.com
Contents
Preface
ix
Authors
xi
Introduction
1
1 Mathematical tools of robustness
5
1.1 Statistical model
5
1.2 Illustration on statistical estimation
8
1.3 Statistical functional
9
1.4 Fisher consistency
11
1.5 Some distances of probability measures
12
1.6 Relations between distances
13
1.7 Differentiable statistical functionals
14
1.8 Gˆ ateau derivative
15
1.9 Fr´echet derivative
17
1.10 Hadamard (compact) derivative
18
1.11 Large sample distribution of empirical functional
18
1.12 Computation and software notes
19
1.13 Problems and complements
23
2 Basic characteristics of robustness
27
2.1 Influence function
27
2.2 Discretized form of influence function
28
2.3 Qualitative robustness
30 v
vi
CONTENTS 2.4 Quantitative characteristics of robustness based on influence function
32
2.5 Maximum bias
33
2.6 Breakdown point
35
2.7 Tail–behavior measure of a statistical estimator
36
2.8 Variance of asymptotic normal distribution
41
2.9 Problems and complements
41
3 Robust estimators of real parameter
43
3.1 Introduction
43
3.2 M -estimators
43
3.3 M -estimator of location parameter
45
3.4 Finite sample minimax property of M -estimator
54
3.5 Moment convergence of M -estimators
58
3.6 Studentized M -estimators
61
3.7 L-estimators
63
3.8 Moment convergence of L-estimators
70
3.9 Sequential M - and L-estimators
72
3.10 R-estimators
74
3.11 Numerical illustration
77
3.12 Computation and software notes
80
3.13 Problems and complements
83
4 Robust estimators in linear model
85
4.1 Introduction
85
4.2 Least squares method
87
4.3 M -estimators
94
4.4 GM -estimators
98
4.5 S-estimators and M M -estimators
100
4.6 L-estimators, regression quantiles
101
4.7 Regression rank scores
104
4.8 Robust scale statistics
106
CONTENTS
vii
4.9 Estimators with high breakdown points
109
4.10 One-step versions of estimators
110
4.11 Numerical illustrations
112
4.12 Computation and software notes
115
4.13 Problems and complements
126
5 Multivariate location model
129
5.1 Introduction
129
5.2 Multivariate M -estimators of location and scatter
129
5.3 High breakdown estimators of multivariate location and scatter 132 5.4 Admissibility and shrinkage
133
5.5 Numerical illustrations and software notes
134
5.6 Problems and complements
139
6 Some large sample properties of robust procedures
141
6.1 Introduction
141
6.2 M -estimators
142
6.3 L-estimators
144
6.4 R-estimators
146
6.5 Interrelationships of M -, L- and R-estimators
146
6.6 Minimaximally robust estimators
150
6.7 Problems and complements
153
7 Some goodness-of-fit tests
155
7.1 Introduction
155
7.2 Tests of normality of the Shapiro-Wilk type with nuisance regression and scale parameters
155
7.3 Goodness-of-fit tests for general distribution with nuisance regression and scale
158
7.4 Numerical illustration
160
7.5 Computation and software notes
166
viii
CONTENTS
Appendix A: R system A.1 Brief R overview
173 174
References
181
Subject index
191
Author index
195
Preface
Robust statistical procedures became a part of the general statistical consciousness. Yet, students first learn descriptive statistics and the classical statistical procedures. Only later students and practical statisticians hear that the classical procedures should be used with great caution and that their favorite and simple least squares estimator and other procedures should be replaced with the robust or nonparametric alternatives. To be convinced to use the robust methods, one needs a reasonable motivation; but everybody needs motivation of their own: a mathematically–oriented person demands a theoretical proof, while a practitioner prefers to see the numerical results. Both aspects are important. The robust statistical procedures became known to the Prague statistical society by the end of the 1960s, thanks to Jaroslav H´ajek and his contacts with Peter J. Huber, Peter J. Bickel and other outstanding statisticians. Frank Hampel presented his “Some small sample asymptotics,” also touching Mestimates, at H´ajek’s conference in Prague in 1973; and published his paper in the Proceedings. Thus, we had our own experience with the first skepticism toward the robust methods, but by 1980 we started organizing regular workshops for applied statisticians called ROBUST. The 14th ROBUST will be in January 2006. On this occasion, we express our appreciation to Jarom´ır Antoch, the main organizer. The course “Robust Statistical Methods” is now a part of the master study of statistics at Charles University in Prague and is followed by all statistical students. The present book draws on experience obtained during these courses. We supplement the theory with examples and computational procedures in the system R. We chose R as a suitable tool because R seems to be one of the best statistical environments available. It is also free and the R project is an open source project. The code you are using is always available for you to see. Detailed information about the system R and the R project is available from http://www.r-project.org/. The prepared procedures and dataset, not available from the public resource, can be found on website: http://www.fp.vslib.cz/kap/picek/robust/.
ix
x
PREFACE
We acknowledge the support of the Czech Republic Grant 201/05/2340, the Research Projects MSM 0021620839 of Charles University in Prague, and MSM 467488501 of Technical University in Liberec. The authors would also like to thank the editors and anonymous referees who contributed considerably to the readability of the text. Our gratitude also belongs to our families for their support and patience.
Jana Jureˇckov´ a Jan Picek Prague and Liberec
Authors
Jana Jureˇ ckov´ a is a professor of statistics and probability at Charles University in Prague, Czech Republic. She is a coauthor of Robust Statistical Inference: Asymptotics and Inter-Relations (with P.K. Sen, John Wiley & Sons, 1996) and of Adaptive Regression (with Y. Dodge, Springer, 2000). She is a fellow of the Institute of Mathematical Statistics and an elected member of the International Statistical Institute. She was an associate editor of the Annals of Statistics for six years, and previously for two years in the 1980s; and now is an associate editor of Journal of the American Statistical Association. Jureˇckov´ a participates in extensive international cooperation, mainly with statisticians of Belgium, Canada, France, Switzerland and United States. Jan Picek is an associate professor of applied mathematics at Technical University in Liberec, Czech Republic.
xi
Introduction
If we analyze the data with the aid of classical statistical procedures, based on parametric models, we usually tacitly assume that the regression is linear, the observations are independent and homoscedastic, and assume the normal distribution of errors. However, when today we can simulate data from any probability distribution and from various models with our high–speed computers and follow the graphics, which was not possible before, we observe that these assumptions are often violated. Then we are mainly interested in the two following questions: a) When should we still use the classical statistical procedures, and when are they still optimal? b) Are there other statistical procedures that are not so closely connected with special models and conditions? The classical procedures are typically parametric: the model is fully specified up to the values of several scalar or vector parameters. These parameters typically correspond to the probability distributions of the random errors of the model. If we succeed in estimating these parameters or in testing a hypothesis on their domain, we can use our data and reach a definite conclusion. However, this conclusion is correct only under the validity of our model. An opposite approach is using the nonparametric procedures. These procedures are independent of or only weakly dependent on the special shape of the basic probability distribution, and they behave reasonably well (though not just optimally) for a broad class of distribution functions, e.g., that of distribution functions with densities, eventually symmetric. The discrete probability distributions do not create a serious problem, because their forms usually follow the character of the experiment. A typical representative of nonparametric statistical procedures is the class of the rank tests of statistical hypotheses: the “null distributions” of the test criterion (i.e., the distributions under the hypothesis H0 of interest) coincide under all continuous probability distribution functions of the observations. Unlike the parametric models with scalar or vector parameters, the nonparametric models consider the whole density function or the regression function as an unknown parameter of infinite dimension. If this functional parameter is only a nuisance, i.e., if our conclusions concern other entities of the model, 1
2
INTRODUCTION
then we try to avoid its estimation, if possible. The statistical procedures — considering the unknown density, regression function or the influence function as a nuisance parameter, while the inference concerns something else — are known as semiparametric procedures; they were developed mainly during the past 30 years. On the other hand, if just the unknown density, regression functions, etc. are our main interests, then we try to find their best possible estimates or the tests of hypotheses on their shapes (goodness-of-fit tests). Unlike the nonparametric procedures, the robust statistical procedures do not try to behave necessarily well for a broad class of models, but they are optimal in some way in a neighborhood of some probability distribution, e.g., normal. Starting with the 1940s, the statisticians successively observed that even small deviations from the normal distribution could be harmful, and can strongly impact the quality of the popular least squares estimator, the classical F test and of other classical methods. Hence, the robust statistical procedures were developed as modifications of the classical procedures, which do not fail under small deviations from the assumed conditions. They are optimal, in a specified sense, in a special neighborhood of a fixed probability distribution, defined with respect to a specified distance of probability measures. As such, the robust procedures are more efficient than the nonparametric ones that pay for their universality by some loss of their efficiency. When we speak of robust statistical procedures, we usually have the estimation procedures in mind. There are also robust tests of statistical hypotheses, namely tests of the Wald type, based on the robust estimators; but whenever possible we recommend using the rank tests instead, mainly for their simplicity and high efficiency. In the list of various statistical procedures, we cannot omit the adaptive procedures that tend to the optimal parametric estimator or test with an increasing number of observations, either in probability or almost surely. Hence, these procedures adapt themselves to the pertaining parametric model with an increasing number of observations, which would be highly desirable. However, this convergence is typically very slow and the optimality is attained under a nonrealistic number of observations. Partially adaptive procedures also exist that successively tend to be best from a prescribed finite set of possible decisions; the choice of prescribed finite set naturally determines the success of our inference. The adaptive, nonparametric, robust and semiparametric methods developed successively, mainly since the 1940s, and they continue to develop as the robust procedures in multivariate statistical models. As such, they are not disjoint from each other, there are no sharp boundaries between these classes, and some concepts and aspects appear in all of them. This book, too, though oriented mainly to robust statistical procedures, often touches other statistical methods. Our ultimate goal is to show and demonstrate which alternative procedures to apply when we are not sure of our model.
INTRODUCTION
3
Mathematically, we consider the robust procedures as the statistical functionals, defined on the space of distribution functions, and we are interested in their behavior in a neighborhood of a specific distribution or a model. This neighborhood is defined with respect to a specified distance; hence, we should first consider possible distances on the space of distribution functions and pertaining to basic characteristics of statistical functionals, such as their continuity and derivatives. This is the theoretic background of the robust statistical procedures. However, robust procedures were developed as an alternative to the practical statistical procedures, and they should be applied to practical problems. Keeping this in mind, we also must pay great attention to the computational aspects, and refer to the computational programs that are available or provide our own. As such, we hope that the readers will use, with understanding, robust procedures to solve their problems.
CHAPTER 1
Mathematical tools of robustness
1.1 Statistical model A random experiment leads to some observed values; denote them X1 , . . . , Xn . To make a formal analysis of the experiment and its results, we include everything in the frame of a statistical model. The classical statistical model assumes that the vector X = (X1 , . . . , Xn ) can attain the values in a sample space X (or Xn ) and the subsets of X are random events of our interest. If X is finite, then there is no problem working with the family of all its subsets. However, some space X can be too rich, as, e.g., the n-dimensional Euclidean space; then we do not consider all its subsets, but restrict our considerations only to some properly selected subsets/events. In order to describe the experiments and the events mathematically, we consider the family of events that creates a σ-field, i.e., that is closed with respect to the countable unions and the complements of its elements. Let us denote it as B (or Bn ). The probabilistic behavior of random vector X is described by the probability distribution P, which is a set function defined on B. The classical statistical model is a family P = {Pθ , θ ∈ Θ} of probability distributions, to which our specific distribution P also belongs. While we can observe X1 , . . . , Xn , the parameter θ is unobservable. It is a real number or a vector and can take on any value in the parametric space Θ ⊆ Rp , where p is a positive integer. The triple {X , B, Pθ : θ ∈ Θ} is the parametric statistical model. The components X1 , . . . , Xn are often independent copies of a random variable X, but they can also form a segment of a time series. The model is multivariate, when the components Xi , i = 1, . . . , n are themselves vectors in Rk with some positive integer k. The type of experiment often fully determines the character of the parametric model. We easily recognize a special discrete probability distribution, as Bernoulli (alternative distribution), binomial, multinomial, Poisson and hypergeometric distributions. Example 1.1 The binomial random variable is a number of successful trials among n independent Bernoulli trials: the i-th Bernoulli trial can result either in success with probability p: then we put Xi = 1 - or in a failure with probability 1 − p: then we put Xi = 0. In the case of n trials, the binomial 5
6
MATHEMATICAL TOOLS OF ROBUSTNESS random variable X is equal to X = ni=1 Xi and can take all integer values 0, 1, . . . , n; specifically, n P (X = k) = pk (1 − p)n−k , k = 0, . . . , n, 0 ≤ p ≤ 1 k Example 1.2 Let us have n independent trials, each of them leading exactly to one of k different outcomes, to the i-th one with probability pi , i = 1, . . . , k, k i=1 pi = 1. Then the i-th component Xi of the multinomial random vector X is the number of trials, leading to the outcome i, and P (X1 = x1 , . . . , Xn = xn ) =
n! px1 . . . pxk k x1 ! . . . xk ! 1
for any vector (x1 , . . . , xk ) of integers, 0 ≤ xi ≤ n, i = 1, . . . , k, satisfying k i=1 xi = n. Example 1.3 The Poisson random variable X is, e.g., the number of clients arriving in the system in a unit interval, the number of electrons emitted from the cathode in a unit interval, etc. Then X can take on all nonnegative integers, and if λ is the intensity of arrivals, emission, etc., then P (X = k) = e−λ
λk , k!
k = 0, 1, . . .
Example 1.4 The hypergeometric random variable X is, e.g., a number of defective items in a sample of size n taken from a finite set of products. If there are M defectives in the set of N products, then M N −M k n−k P (X = k) = N n for all integers k satisfying 0 ≤ k ≤ M and 0 ≤ n−k ≤ N −M ; this probability is equal to 0 otherwise. Among the continuous probability distributions, characterized by the densities, we most easily identify the asymmetric distributions concentrated on a halfline. For instance, the waiting time or the duration of a service can be characterized by the gamma distribution. Example 1.5 The gamma random variable X has density function (see Figure 1.1) b b−1 −ax a x e if x ≥ 0 Γ(b) f (x) = 0 if x < 0 where a and b are positive constants. The special case (b=1) is the exponential distribution.
7
0.00
0.05
0.10
0.15
0.20
0.25
STATISTICAL MODEL
0
2
4
6
8
Figure 1.1 The density function of gamma distribution with b = 3 and a = 1.
In technical practice we can find many other similar examples. However, the symmetric distributions with continuous probability densities are hardly distinguished from each other by simply looking at the data. The problem is also with asymmetric distributions extended on the whole real line. We should either test a hypothesis on their shape or, lacking knowledge of the distribution shape, use a robust or nonparametric method of inference. Most of the statistical procedures elaborated in the past were derived under the normality assumption, i.e., under the condition that the observed data come from a population with the Gaussian/normal distribution. People believed that every symmetric probability distribution described by a density is approximately normal. The procedures based on the normality assumption usually have a simple algebraic structure, thus one is tempted to use them in all situations in which the data can take on symmetrically all real values, forgetting the original normality assumption. For instance, the most popular least squares estimator (LSE) of regression or other parameters, though
8
MATHEMATICAL TOOLS OF ROBUSTNESS
seemingly universal, is closely connected with the normal distribution of the measurement errors. That itself would not matter, but the LSE fails when even a small fraction of data comes from another population whose distribution has heavier tails than the normal one, or when the dataset is contaminated by some outliers. At present, these facts can be easily demonstrated numerically with simulated data, while this was not possible before the era of high-speed computers. But these facts are not only verified with computers; the close connection of the least squares and the normal distribution is also supported by strong theoretical arguments, based on the characterizations of the normal distribution by means of properties of estimators and other procedures. For instance, Kagan, Linnik, and Rao (1973) proved that the least squares estimator (LSE) of the regression parameter in the linear regression model with a continuous distribution of the errors is admissible with respect to the quadratic risk (i.e., there is no other estimator with uniformly better quadratic risk), if and only if the distribution of the measurement errors is normal. The Student t-test, the Snedecor F -test and the test of the linear hypothesis were derived under the normality assumption. While the t-test is relatively robust to deviations from the normality, the F -test is very sensitive in this sense and should be replaced with a rank test, unless the normal distribution is taken for granted. If we are not sure by the parametric form of the model, we can use either of the following possible alternative procedures: a) Nonparametric approach: We give up a parametrization of Pθ by a real or vector parameter θ, and replace the family {Pθ : θ ∈ Θ} with a broader family of probability distributions. b) Robust approach: We introduce an appropriate measure of distance of statistical procedures made on the sample space X , and study the stability of the classical procedures, optimal for the model Pθ , under small deviations from this model. At the same time, we try to modify slightly the classical procedures (i.e., to find robust procedures) to reduce their sensitivity.
1.2 Illustration on statistical estimation Let X1 , . . . , Xn be independent observations, identically distributed with some probability distribution Pθ , where θ is unobservable parameter, θ ∈ Θ ⊆ Rp . Let F (x, θ) be the distribution function of Pθ . Our problem is to estimate the unobservable parameter θ. We have several possibilities, for instance (1) maximal likelihood method (2) moment method
STATISTICAL FUNCTIONAL
9
2
(3) method of χ -minimum, or another method minimizing another distance of the empirical and the true distributions (4) method based on the sufficient statistics (Rao-Blackwell Theorem) and on the complete sufficient statistics (Lehmann-Scheff´e Theorem) In the context of sufficient statistics, remember the very useful fact in nonparametric models that the ordered sample (the vector of order statistics) sufficient statistic for the family of Xn:1 ≤ Xn:2 ≤ . . . ≤ Xn:n is a complete n probability distributions with densities i=1 f (xi ), where f is an arbitrary continuous density. This corresponds to the model in which the observations create an independent random sample from an arbitrary continuous distribution. If θ is a one-dimensional parameter, thus a real number, we are intuitively led to the class of L-estimators of the type Tn =
n
cni h(Xn:i )
i=1
based on order statistics, with suitable coefficients cni , i = 1, . . . , n, and a suitable function h(·). (5) Minimization of some (criterion) function of observations and of θ : e.g., the minimization n ρ(Xi , θ) := min, θ ∈ Θ i=1
with a suitable non-constant function ρ(·, ·). As an example we can consider ρ(x, θ) = − log f (x, θ) leading to the maximal likelihood estimator θˆn . The estimators of this type are called M -estimators, or estimators of the maximum likelihood type. (6) An inversion of the rank tests of the shift in location, of the significance of regression, etc. leads to the class of R-estimators, based on the ranks of the observations or of their residuals. These are the M -, L- and R-estimators, and some other robust methods that create the main subject of this book.
1.3 Statistical functional Consider a random variable X with probability distribution Pθ with distribution function F, where Pθ ∈ P = {Pθ : θ ∈ Θ ⊆ Rp }. Then in many cases θ can be looked at as a functional θ = T (P ) defined on P; we can also write θ = T (F ). Intuitively, a natural estimator of θ, based on observations X1 , . . . , Xn , is T (Pn ), where Pn is the empirical probability distribution of vector (X1 , . . . , Xn ), i.e., 1 I[Xi ∈ A], n i=1 n
Pn (A) =
A∈B
(1.1)
10
MATHEMATICAL TOOLS OF ROBUSTNESS
Otherwise, Pn is the uniform distribution on the set {X1 , . . . , Xn }, because Pn ({Xi }) = n1 , i = 1, . . . , n. Distribution function, pertaining to Pn , is the empirical distribution function 1 I[Xi ≤ x], x ∈ R n i=1 n
Fn (x) = Pn ((−∞, x]) =
(1.2)
Example 1.6 (1) Expected value: T (P )
=
T (Pn ) =
R
= EX
xdP
¯n = 1 =X Xi n i=1 n
R xdPn
(2) Variance: T (P ) =
var X = R
x2 dP − (EX)2
1 2 ¯ n2 X −X n i=1 i n
T (Pn ) =
(3) If T (P ) = R h(x)dP, where h is an arbitrary P -integrable function, then an empirical counterpart of T (P ) is 1 h(Xi ) n i=1 n
T (Pn ) =
(4) Conversely, we can find a statistical functional corresponding to a given statistical estimator: for instance, the geometric mean of observations X1 , . . . , Xn is defined as T (Pn ) = Gn =
n
1/n Xi
i=1
1 log Gn = log Xi = n i=1 n
R
log xdPn
hence the corresponding statistical functional has the form
T (P ) = exp log xdP R
Similarly, the harmonic mean T (Pn ) = Hn of observations X1 , . . . , Xn is
FISHER CONSISTENCY
11
defined as 1 1 1 = Hn n i=1 Xi n
and the corresponding statistical functional has the form −1 1 dP T (P ) = H = R x Statistical functionals were first considered by von Mises (1947). The estimator T (Pn ) should tend to T (P ), as n → ∞, with respect to some type of convergence defined on the space of probability measures. Mostly, it is a convergence in probability and in distribution, or almost sure convergence; but an important characteristic also is the large sample bias of estimator T (Pn ), i.e., limn→∞ |E[T (Pn ) − T (P )]|, which corresponds to the convergence in the mean. Because we need to study the behavior of T (Pn ) also in a neighborhood of P, we consider an expansion of the functional (T (Pn ) − T (P )) of the Taylor type. To do it, we need some concepts of the functional analysis, as various distances Pn and P, and their relations, and the continuity and differentiability of functional T with respect to the considered distance.
1.4 Fisher consistency A reasonable statistical estimator should have the natural property of Fisher consistency, introduced by R. A. Fisher (1922). We say that estimator θˆn , based on observations X1 . . . , Xn with probability distribution P , is a Fisher consistent estimator of parameter θ, if, written as a functional θˆn = T (Pn ) of the empirical probability distribution of vector (X1 , . . . , Xn ), n = 1, . . . , it satisfies T (P ) = θ. The following example shows that this condition is not always automatically satisfied. 2 Example 1.7 Let θ = var X = T (P ) = R x2 dP − R xdP be the variance ¯ n )2 is Fisher of P. Then the sample variance θˆn = T (Pn ) = n1 ni=1 (Xi − X − n1 θ. On the other hand, the consistent; but it is biased, because Eθˆn = 1 1 2 ¯ n )2 is not a Fisher unbiased estimator of the variance Sn = n−1 ni=1 (Xi − X consistent estimator of θ, because n n T (Pn ) and T (P ) = T (P ) Sn2 = n−1 n−1 From the robustness point of view, the natural property of Fisher consistency of an estimator is more important than its unbiasedness; hence it should be first checked on a statistical functional.
12
MATHEMATICAL TOOLS OF ROBUSTNESS
1.5 Some distances of probability measures Let X be a metric space with metric d, separable and complete, and denote B the σ-field of its Borel subsets. Furthermore, let P be the system of all probability measures on the space (X , B). Then P is a convex set, and on P we can introduce various distances of its two elements P, Q ∈ P. Let us briefly describe some such distances, mostly used in mathematical statistics. For those who want to learn more about such and other distances and other related topics, we refer to the literature of the functional analysis and the probability theory, e.g., Billingsley (1998) or Fabian et al. (2001). (1) The Prochorov distance: inf{ε > 0 : P (A) ≤ Q(Aε ) + ε ∀A ∈ B, A = ∅}
dP (P, Q) =
where Aε = {x ∈ X : inf y∈A d(x, y) ≤ ε} is a closed ε-neighborhood of a non-empty set A. (2) The L´evy distance: X = R is the real line; let F, G be the distribution functions of probability measures P, Q, then dL (F, G)
=
inf{ε > 0 : F (x − ε) − ε ≤ G(x) ≤ F (x + ε) + ε∀x ∈ R}
(3) The total variation: dV (P, Q) = sup |P (A) − Q(A)| A∈B
We easily verify that dV (P, Q) =
X
|dP − dQ|
(4) The Kolmogorov distance: X = R is the real line and F, G are the distribution functions of probability measures P, Q, then dK (F, G) = sup |F (x) − G(x)| x∈R
(5) The Hellinger distance: dH (P, Q) = dP dµ
√ X
2 dP − dQ
1/2
dQ dµ
If f = and g = are densities of P, Q with respect to some measure µ, then the Hellinger distance can be rewritten in the form √ 2 (dH (P, Q))2 = f − g dµ = 2 1 − f gdµ X
X
RELATIONS BETWEEN DISTANCES
13
(6) The Lipschitz distance: Assume that d(x, y) ≤ 1 ∀x, y ∈ X (we take the d metric d = 1+d otherwise), then dLi (P, Q) = sup ψdP − ψdQ ψ∈L
X
X
where L = {Ψ : X → R : |ψ(x)− ψ(y)| ≤ d(x, y)} is the set of the Lipschitz functions. (7) Kullback-Leibler divergence: Let p, q be the densities of probability distributions P, Q with respect to measure µ (Lebesgue measure on the real line or the counting measure), then q(x) dKL (Q, P ) = q(x)ln dµ(x) p(x) The Kullback-Leibler divergence is not a metric, because it is not symmetric in P, Q and does not satisfy the triangle inequality. More on distances of probability measures can be found in Gibbs and Su (2002), Liese and Vajda (1987), Rachev (1991), Reiss (1989) and Zolotarev (1983), among others.
1.6 Relations between distances The family P of all probability measures on (X , B) is a metric space with respect to each of the distances described above. On this metric space we can study the continuity and other properties of the statistical functional T (P ). Because we are interested in the behavior of the functional, not only at distribution P , but also in its neighborhood; we come to the question, which distance is more sensitive to small deviations of P. The following inequalities between the distances show not only which distance eventually dominates above others, but also illustrate their relations. Their verification we leave as an exercise: d2H (P, Q)
≤ 2dV (P, Q)
≤ 2dH (P, Q)
d2P (P, Q)
≤ dLi (P, Q)
≤ 2dP (P, Q) ∀ P, Q ∈ P
(1.3)
1 2 d (P, Q) ≤ dKL (P, Q) 2 V if X = R, then it further holds: dL (P, Q)
≤ dP (P, Q)
≤ dV (P, Q)
dL (P, Q)
≤ dK (P, Q)
≤ dV (P, Q) ∀ P, Q ∈ P
(1.4)
14
MATHEMATICAL TOOLS OF ROBUSTNESS
Example 1.8 Let P be the exponential distribution with density −x e ... x ≥ 0 f (x) = 0 ... x < 0 and let Q be the uniform distribution R(0, 1) with density 1 ... 0 ≤ x ≤ 1 g(x) = 0 . . . otherwise Then
2dV (P, Q) =
1
1 − e−x dx +
0
∞
e−x dx = 1 +
1
1 2 1 −1+ = e e e
hence dV (exp, R(0, 1) ≈ 0.3679. Furthermore, dK (P, Q) = sup 1 − e−x − xI[0 ≤ x ≤ 1] − I[x > 1] x≥0
= e−1 ≈ 0.1839 and
d2H (exp, R(0, 1)) = 2 1 −
1
0
√ 2 e−x dx = 2 √ − 1 e
thus dH (exp, R(0, 1)) ≈ 0.6528. Finally
dK L(R(0, 1), exp) =
1
ln 0
1 1 dx = e−x 2
1.7 Differentiable statistical functionals Let again P be the family of all probability measures on the space (X , B, µ), and assume that X is a complete separable metric space with metric d and that B is the system of the Borel subsets of X . Choose some distance δ on P and consider the statistical functional T (·) defined on P. If we want to analyze an expansion of T (·) around P, analogous to the Taylor expansion, we must introduce the concept of a derivative of statistical functional. There are more possible definitions of the derivative, and we shall consider three of them: the Gˆ ateau derivative, the Fr´echet and the Hadamard derivative, and compare their properties from the statistical point of view. Definition 1.1 Let P, Q ∈ P and let t ∈ [0, 1]. Then the probability distribution Pt (Q) = (1 − t)P + tQ
(1.5)
is called the contamination of P by Q in ratio t. Remark 1.1 Pt (Q) is a probability distribution, because P is convex. P0 (Q) = P means an absence of the contamination, while P1 (Q) = Q means the full contamination.
ˆ GATEAU DERIVATIVE
15
1.8 Gˆ ateau derivative Fix two distributions P, Q ∈ P and denote ϕ(t) = T ((1−t)P +tQ), 0 ≤ t ≤ 1. Suppose that the function ϕ(t) has the final n-th derivative ϕ(n) , and that the derivatives ϕ(k) are continuous in interval (0, 1) and that the right-hand (k) derivatives ϕ+ are right-continuous at t = 0, k = 1, . . . , n − 1. Then we can consider the Taylor expansion around u ∈ (0, 1) ϕ(t) = ϕ(u) +
n−1 k=1
ϕ(k) (u) ϕ(n) (v) (t − u)k + (t − u)n , v ∈ [u, t] k! n!
(1.6)
We are mostly interested in the expansion on the right of u = 0, that corresponds to a small contamination of P. For that we replace derivatives ϕ(k) (0) (k) with the right-hand derivatives ϕ+ (0). The derivative ϕ+ (0) is called the Gˆ ateau derivative of functional T in P in direction Q. Definition 1.2 We say that functional T is differentiable in the Gˆ ateau sense in P in direction Q, if there exists the limit TQ (P ) = lim
t→0+
T (P + t(Q − P )) − T (P ) t
(1.7)
ateau derivative T in P in direction Q. TQ (P ) is called the Gˆ Remark 1.2 a) The Gˆ ateau derivative TQ (P ) of functional T is equal to the ordinary right derivative of function ϕ at the point 0, i.e., TQ (P ) = ϕ (0+ ) b) Similarly defined is the Gˆ ateau derivative of order k: k d (k) TQ (P ) = T (P + t(Q − p)) = ϕ(k) (0+ ) dtk t=0+ c) In the special case when Q is the Dirac probability measure Q = δx assigning probability 1 to the one-point set {x} x, we shall use a simpler notation Tδx (P ) = Tx (P ) In the special case t = 1, u = 0 the Taylor expansion (1.6) reduces to the form n−1 TQ(k) (P ) 1 dn + T (P + t(Q − p)) (1.8) T (Q) − T (P ) = k! n! dtn t=t∗ k=1
∗
where 0 ≤ t ≤ 1.
16
MATHEMATICAL TOOLS OF ROBUSTNESS
Example 1.9 (a) Expected value: T (P ) =
X
ϕ(t) = X
xdP = EP X
xd((1 − t)P + tQ) = (1 − t)EP X + tEQ X
=⇒ ϕ (t) = EQ X − EP X TQ (P ) = ϕ (0+ ) = EQ X − EP X Finally we obtain for Q = δx Tx = x − EP X (b) Variance: T (P ) = varP X = EP (X 2 ) − (EP X)2 T ((1 − t)P + tQ) = x2 d((1 − t)P + tQ) X
−
X
2 xd((1 − t)P + tQ)
=⇒ ϕ(t) = (1 − t)EP X 2 + tEQ X 2 − (1 − t)2 (EP X)2 2
−t2 (EQ X) − 2t(1 − t)EP X · EQ X ϕ (t) = −EP X 2 + EQ X 2 2
2
+2(1 − t) (EP X) − 2t (EQ X) −2(1 − 2t)EP X · EQ X This further implies lim ϕ (t) = TQ (P )
t→0+
2
= EQ X 2 − EP X 2 − 2EP X · EQ X + 2 (EP X) and finally we obtain for Q = δx Tx (P ) = x2 − EP X 2 − 2xEP X + 2 (EP X)
2
= (x − EP X)2 − varP X
´ FRECHET DERIVATIVE
17
1.9 Fr´ echet derivative Definition 1.3 We say that functional T is differentiable in P in the Fr´echet sense, if there exists a linear functional LP (Q − P ) such that T (P + t(Q − P )) − T (P ) = LP (Q − P ) t uniformly in Q ∈ P, δ(P, Q) ≤ C for any fixed C ∈ (0, ∞). lim
t→0
(1.9)
The linear functional LP (Q − P ) is called the Fr´echet derivative of functional T in P in direction Q. Remark 1.3 a) Because LP is a linear functional, there exists a function g : X → R such that LP (Q − P ) = gd(Q − P ) (1.10) X
b) If T is differentiable in the Fr´echet sense, then it is differentiable in the Gˆ ateau sense, too, i.e., there exists TQ (P ) ∀Q ∈ P, and it holds TQ (P ) = LP (Q − P )
∀Q ∈ P
Especially, Tx (P ) = LP (δx − P ) = g(x) − and this further implies EP (Tx (P ))
= X
(1.11)
gdP
(1.12)
X
Tx (P )dP = 0.
(1.13)
c) Let Pn be the empirical probability distribution of vector (X1 . . . , Xn ). n Then Pn − P = n1 i=1 (δXi − P ) . Hence, because LP is a linear functional, 1 1 LP (δXi − P ) = T (P ) = TP n (P ) n i=1 n i=1 Xi n
LP (Pn − P ) =
n
(1.14)
Proof of (1.11): Actually, because LP (·) is a linear functional, we get by (1.9) TQ (P ) = =
lim
T (P + t(Q − P )) − T (P ) t
lim
T (P + t(Q − P )) − T (P ) − LP (Q − P ) t
t→0+
t→0+
+ LP (Q − P ) = 0 + LP (Q − P ) = LP (Q − P )
2
18
MATHEMATICAL TOOLS OF ROBUSTNESS
1.10 Hadamard (compact) derivative If there exists a linear functional L(Q − P ) such that the convergence (1.9) is uniform not necessarily for bounded subsets of the metric space (P, δ) containing P, i.e., for all Q satisfying δ(P, Q) ≤ C, 0 < C < ∞, but only for Q from any fixed compact set K ⊂ P containing P ; then we say that functional T is differentiable in the Hadamard sense, and we call the functional L(Q − P ) the Hadamard (compact) derivative of T. The Fr´echet differentiable functional is obviously also Hadamard differentiable, and it is, in turn, also Gˆ ateau differentiable, similarly as in Remark 1.3. We refer to Fernholz (1983) and to Fabian et al. (2001) for more properties of differentiable functionals. The Fr´echet differentiability imposes rather restrictive conditions on the functional that are not satisfied namely by the robust functionals. On the other hand, when we have a Fr´echet differentiable functional, we can easily derive the large sample (normal) distribution of its empirical counterpart, when the number n of observations infinitely increases. If the functional is not sufficiently smooth, we can sometimes derive the large sample normal distribution of its empirical counterpart with the aid of the Hadamard derivative. If we only want to prove that T (Pn ) is a consistent estimator of T (P ), then it suffices to consider the continuity of T (P ). The Gˆ ateau derivative of Tx (P ), called the influence function of functional T, is one of the most important characteristics of its robustness and will be studied in Chapter 2 in detail.
1.11 Large sample distribution of empirical functional Consider again the metric space (P, δ) of all probability distributions on (X , B), with metric δ satisfying √ nδ(Pn , P ) = Op (1) as n → ∞, (1.15) where Pn is the empirical probability distribution of the random sample (X1 , . . . , Xn ), n = 1, 2, . . . . The convergence (1.15) holds, e.g., for the Kolmogorov distance of the empirical distribution function from the true one, which is the most important for statistical applications; but it holds also for other distances. As an illustration of the use of the functional derivatives, let us show that the Fr´echet differentiability, together with the classical central limit theorem, always give the large sample (asymptotic) distribution of the empirical functional T (Pn ). Theorem 1.1 Let T be a statistical functional, Fr´echet differentiable in P,
COMPUTATION AND SOFTWARE NOTES
19
and assume that the empirical probability distribution Pn of the random sample (X1 , . . . , Xn ) satisfies the condition (1.15) as n → ∞. If the variance of the Gˆ √ateau derivative TX1 (P ) is positive, varP TX1 (P ) > 0, then the sequence n(T (Pn ) − T (P )) is asymptotically normally distributed as n → ∞, namely L T (Pn ) − T (P ) −→ N 0, varP TX (P ) (1.16) 1 Proof. By (1.14), TP n (P ) = tion (1.15) we obtain
1 n
n i=1
TX (P ) and further by (1.8) and condii
n √ 1 n(T (Pn ) − T (P )) = √ T (P ) + Rn n i=1 Xi
√ 1 = √ LP (Pn − P ) + n o(δ(Pn , P )) n i=1 n
(1.17)
1 T (P ) + op (1) = √ n i=1 Xi n
(P ) = varP TX (P ), i = 1, . . . , n, is finite, then If the joint variance varP TX i 1 (1.16) follows from (1.17) and from the classical central limit theorem. 2
Example 1.10 Let T (P ) = varP X = σ 2 , then 1 ¯ n )2 (Xi − X n i=1 n
T (Pn ) = Sn2 = and, by Example 1.9b),
Tx (P ) = (x − EP X)2 − varP X hence (P ) = EP (X − EP X)4 − E2P (X − EP X)2 = µ4 − µ22 varP TX
and by Theorem 1.1 we get the large sample distribution of the sample variance √ L n(Sn2 − σ 2 ) −→ N 0, µ4 − µ22
1.12 Computation and software notes We chose R (a language and environment for statistical computing and graphics) as a suitable tool for numerical illustration and the computation. R seems to us to be one of the best statistical environments available. It is also free and the R project is an open source project. The code you are using is always available for you to view. Detailed information about R and the R project is available from http://www.r-project.org/.
20
MATHEMATICAL TOOLS OF ROBUSTNESS
Chapter 1 focuses mainly on the theoretical background. However, Section 1.1 mentioned some specific distribution shapes. The R system has built-in functions to compute the density, distribution function and quantile function for many standard distributions, including ones mentioned in Section 1.1 (see Table 1.1). Table 1.1 R function names and parameters for selected probability distributions.
Distribution
R name
Parameters
binomial exponential gamma hypergeometric normal Poisson uniform
binom exp gamma hyper norm pois unif
size, prob rate shape, scale m, n, k mean, sd lambda min, max
The first letter of the name of the R function indicates the function: dXXXX, pXXXX, qXXXX are, respectively, the density, distribution and quantile functions. The first argument of the function is the quantile q for the densities and distribution functions, and the probability p for quantile functions. Additional arguments specify the parameters. These functions can be used as the statistical tables. Here are some examples: • P (X = 3) - binomial distribution with n=20, p = 0.1 > dbinom(3,20,0.1) [1] 0.1901199 • P (X ≤ 5) - Poisson distribution with λ=2 > ppois(5,2) [1] 0.9834364 • 95% – quantile of standard normal distribution > qnorm(0.95) [1] 1.644854 • The density function of gamma distribution with b = 3 and a = 1 ( see Figure 1.1). > plot(seq(0,8,by=0.01),dgamma(seq(0,8,by=0.01),3,1), + type="l", ylab="",xlab="")
COMPUTATION AND SOFTWARE NOTES
21
System R enables the generation of random data. The corresponding functions have prefix r and first argument n, the size of the sample required. For example, we can generate a sample of size 1000 from gamma distribution (b = 3; a = 1) by > rgamma(1000,3,1) R has a function hist to plot histograms. > hist(rgamma(1000,3,1), prob=TRUE,nclass=16) We obtain Figure 1.2, compare with Figure 1.1.
0.15 0.00
0.05
0.10
Density
0.20
0.25
Histogram of rgamma(1000, 3, 1)
0
2
4
6
8
10
rgamma(1000, 3, 1)
Figure 1.2 The histogram of the simulated sample from gamma distribution with b = 3 and a = 1.
In Section 1.5, some distances of probability measures were introduced and illustrated in Example 1.8. There we need to compute an integral. We can also use R function integrate to solve that problem. Compare the following example with the calculation in Example 1.8.
22
MATHEMATICAL TOOLS OF ROBUSTNESS
> integrate(function(x) 0.3678794 with absolute > integrate(function(x) 0.3678794 with absolute
{1-exp(-x)},0,1) error < 4.1e-15 {exp(-x)},1, Inf) error < 2.1e-05
R can also help us in the case of discrete distribution. Let P be the binomial distribution with parameters n = 100, p = 0.001. Let Q be Poisson distribution with parameterλ = np = 1. Then 100 ∞ n e−1 e−1 k 100−k + 0.01 2dV (P, Q) = 0.99 − k k! k! k=0
k=101
> sum(abs(choose(100,0:100)*0.01^(0:100)*(0.99)^(100:0) + -exp(-1)/factorial(0:100)))+1-sum(exp(-1)/factorial(0:100)) [1] 0.005550589 > ## or also > sum(abs(dbinom(0:100,100,0.01)-dpois(0:100,1)))+1-ppois(100,1) [1] 0.005550589 Thus dV (Bi(100, 0.01), P o(1)) ≈ 0.0028. Similarly, dK (Bi(100, 0.01), P o(1)) ≈ 0.0018, dH (Bi(100, 0.01), P o(1)) ≈ 0.0036, dKL (Bi(100, 0.01), P o(1)) ≈ 0.000025 because > ### Kolmogorov distance > max(abs(pbinom(0:100,100,0.01)-ppois(0:100,1))) [1] 0.0018471 > ### Hellinger distance > sqrt(sum((sqrt(dbinom(0:100,100,0.01)) + -sqrt(dpois(0:100,1)))^2)) [1] 0.003562329 >### Kullback-Leibler divergence (Q,P) > sum(dpois(0:100,1)*log(dpois(0:100,1)/ + dbinom(0:100,100,0.01))) [1] 2.551112e-05 >### Kullback-Leibler divergence (P,Q) > sum(dbinom(0:100,100,0.01)*log(dbinom(0:100,100,0.01)/ + dpois(0:100,1))) [1] 2.525253e-05
PROBLEMS AND COMPLEMENTS
23
1.13 Problems and complements 1.1 Let Q be the binomial distribution with parameters n, p and let P be the Poisson distribution with parameter λ = np, then 1 (dV (Q, P ))2 ≤ dKL (Q, P ) 2 1 np3 p 1 p2 ≤ dKL (Q, P ) ≤ + + + p2 4 4 3 2 4n 1 min(p, np2 ) ≤ dV (Q, P ) ≤ 2p 1 − e−np 16 dKL (Q, P ) ≤
p2 2(1 − p)
λ2 n→∞ 4 See Barbour and Hall (1984), Csisz´ar (1967), Harremo¨es and Ruzankin lim n2 dKL (Q, P ) =
(2004), Kontoyannis et al. (2005), Pinsker (1960) for demonstrations. 1.2 Wasserstein-Kantorovich distance of distribution functions F, G of random variables X, Y : ∞ • L1 -distance on F1 = {F : −∞ |x|dF (x) < ∞} : 1 (1) dW (F, G) = |F −1 (t) − G−1 (t)|dt 0
Show that
∞ (1) dW (F, G) = −∞ |F (x) − G(x)|dx (Dobrushin (1970)). (1) dW (F, G) = inf{E|X − Y |} where the infimum is over
Show that jointly distributed X and Y with respective marginals F and G. ∞ • L2 -distance on F2 = {F : −∞ x2 dF (x) < ∞} : 1 (2) [F −1 (t) − G−1 (t)]2 dt dW (F, G) =
all
0
(2) dW (F, G)
Show that = inf{E(X − Y )2 } where the infimum is over all jointly distributed X and Y with respective marginals F and G (Mallows (1972)). • Weighted L1 -distance: 1 (1) dW (F, G) = |F −1 (t) − G−1 (t)|w(t)dt, 0
1
w(t)dt = 1 0
24
MATHEMATICAL TOOLS OF ROBUSTNESS (1)
1.3 Show that dP (F, G) ≤ dW (F, G) (Dobrushin (1970)). 1.4 Let (X1 , . . . , Xn ) and (Y1 , . . . , Yn ) be two samples ∞independent random ∞ from distribution functions F, G such that −∞ xdF (x) = −∞ ydG(y) = 0. n , Gn be the distribution functions of n−1/2 i=1 Xi and Let Fn n n−1/2 i=1 Yi , respectively, then (2)
(2)
dW (Fn , Gn ) ≤ dW (F, G) (Mallows (1972)). 1.5 χ2 -distance: Let p, q be the densities of probability distributions P, Q with respect to measure µ (µ can be a countable measure). Then dχ2 (P, Q) is defined as (p(x) − q(x))2 dχ2 (P, Q) = dµ(x) q(x) x∈X :p(x),q(x)>0 Then 0 ≤ dχ2 (P, Q) ≤ ∞ and dχ2 is independent of the choice of the dominating measure. It is not a metric, because it is not symmetric in P, Q. Distance dχ2 is dating back to Pearson in the 1930s and has many applications in the statistical inference. The following relations hold between dχ2 and other distances: √ (i) dH (P, Q) ≤ 2(dχ2 (P, Q))1/4 (ii) If the sample space X is countable, then dV (P, Q) ≤ 12 dχ2 (P, Q) (iii) dKL (P, Q) ≤ dχ2 (P, Q) 1.6 Let P be the exponential distribution and let Q be the uniform distribution (see Example 1.8) Then 1 2 (1 − e−x ) dχ2 (Q, P ) = dx = e + e−1 − 2 e−x 0 hence dχ2 (R(0, 1), exp) ≈ 0.350402. Furthermore, 1 −x 2 1 1 dχ2 (P, Q) = e − 1 dx = − e−2 + 2 e−1 − 2 2 0 hence dχ2 (exp, R(0, 1)) ≈ 0.168091 1.7 Bhattacharyya distance: Let p, q be the densities of probability distributions P, Q with respect to measure. Then dB (P, Q) is defined as −1 dB (P, Q) = log p(x) q(x) dµ(x) x∈X :p(x),q(x)>0
(Bhattacharyya (1943)). Furthermore, for a comparison 1 √ −1 2 −x dB (exp, R(0, 1)) = log e dx = − log(2 − √ ) ≈ 0.239605 e 0
PROBLEMS AND COMPLEMENTS 1.8 Verify 2dV (P, Q) =
X
25
|dP − dQ|.
1.9 Check the inequalities 1.3. 1.10 Check the inequalities 1.4. (1)
1.11 Compute the Wasserstein-Kantorovich distances dW (F, G) and (2) dW (F, G) for the exponential distribution and the uniform distribution (as in Example 1.8).
CHAPTER 2
Basic characteristics of robustness
2.1 Influence function Expansion (1.17) of difference T (Pn ) − T (P ) says that 1 T (P ) + n−1/2 Rn n i=1 Xi n
T (Pn ) − T (P ) =
(2.1)
where the reminder term is asymptotically n−1/2 Rn = op (n−1/2 ) n negligible, 1 as n → ∞. Then we can consider n i=1 TXi (P ) as an error of estimating (P ) as a contribution of Xi to this error, or as an T (P ) by T (Pn ), and TX i influence of Xi on this error. From this point of view, a natural interpretation of the Gˆ ateau derivative Tx (P ), x ∈ X is to call it an influence function of functional T (P ). Definition 2.1 The Gˆ ateau derivative of functional T in distribution P in the direction of Dirac distribution δx , x ∈ X is called the influence function of T in P ; thus IF (x; T, P ) = Tx (P ) = limt→0+
T (Pt (δx )) − T (P ) t
(2.2)
where Pt (δx ) = (1 − t)P + tδx . As the first main properties of IF, let us mention: a) EP (IF (x; T, P )) = X Tx (P )dP = 0, hence the average influence of all points x on the estimation error is zero. b) If T is a Fr´echet differentiable functional satisfying condition (1.15), and varP (IF (x; T, P )) = EP (IF (x; T, P ))2 > 0 √ then n(T (Pn ) − T (P ) −→ N 0, varP (IF (x; T, P )) Example 2.1 (a) Expected value:
T (P ) = EP (X) = mP , then
¯n T (Pn ) = X 27
28
BASIC CHARACTERISTICS OF ROBUSTNESS IF (x; T, P ) = Tx (P ) = x − mp EP (IF (x; T, P )) = 0 varP (IF (x; T, P )) = varP X = σP2 EQ (IF (x; T, P )) = mQ − mP for Q = P √ ¯ n − mp ) −→ N (0, σP2 ) L n(X
provided P is the true probability distribution of random sample (X1 , . . . , Xn ). (b) Variance:
T (P ) = varP X = σP2 , then IF (x; T, P ) = (x − mP )2 − σP2 EP (IF (x; T, P )) = 0 varP (IF (x; T, P )) = µ4 − µ22 = µ4 − σP4 EQ (IF (x; T, P )) = EQ (X − mp )2 − σP2 2 = σQ + (mQ − mP )2 + 2EQ (X − mQ )(mQ − mP ) 2 − σP2 + (mQ − mP )2 −σP2 = σQ
2.2 Discretized form of influence function
Let (X1 , . . . , Xn ) be the vector of observations and denote Tn = T (Pn ) = Tn (X1 , . . . , Xn ) as its empirical functional. Consider what happens if we add another observation Y to X1 , . . . , Xn . The influence of Y on Tn is characterized by the difference Tn+1 (X1 , . . . , Xn , Y ) − Tn (X1 , . . . , Xn ) := I(Tn , Y ) Because =
1 δX n i=1 i
=
1 n+1
n
Pn
Pn+1
= 1−
n
δXi + δY
i=1
1 n+1
Pn +
=
1 n Pn + δY n+1 n+1
1 δY n+1
(2.3)
DISCRETIZED FORM OF INFLUENCE FUNCTION
29
we can say that Pn+1 arose from Pn by its contamination by the one-point 1 distribution δY in ratio n+1 , hence
1 1 δY − T (Pn ) I(Tn , Y ) = T 1− Pn + n+1 n+1 Because lim (n + 1)I(Tn , Y )
(2.4)
n→∞
T = lim
1−
1 n+1
Pn +
1 n+1 δY
− T (Pn )
1 n+1
n→∞
= IF (Y ; T, P ) (n + 1)I(Tn , Y ) can be considered as a discretized form of the influence function. The supremum of |I(Tn , Y )| over Y then represents a measure of sensitivity of the empirical functional Tn with respect to an additional observation, under fixed X1 , . . . , Xn . Definition 2.2 The number S(Tn ) = sup |I(Tn (X1 , . . . , Xn ), Y )|
(2.5)
Y
is called a sensitivity of functional Tn (X1 , . . . , Xn ) to an additional observation. Example 2.2 (a) Expected value: T (P ) = EP X,
¯ n , Tn+1 = X ¯ n+1 Tn = X
1 ¯n + Y ) (nX n+1 n ¯ n + 1 Y = 1 (Y − X ¯n) −1 X I(Tn , Y ) = n+1 n+1 n+1
=⇒ Tn+1 =
P
¯ n −→ Y − EP X as n → ∞ =⇒ (n + 1)I(Tn , Y ) = Y − X ¯n) = =⇒ S(X
1 ¯n| = ∞ sup |Y − X n+1 Y
Thus, the sample mean has an infinite sensitivity to an additional observation. (b) Median: Let n = 2m + 1 and let X(1) ≤ . . . ≤ X(n) be the observations ordered in increasing magnitude. Then Tn = Tn (X1 , . . . , Xn ) = X(m+1) and Tn+1 =
30
BASIC CHARACTERISTICS OF ROBUSTNESS
Tn+1 (X1 . . . , Xn , Y ) takes on the following of Y among the other observations: ⎧ X +X (m) (m+1) ⎪ ... ⎪ 2 ⎪ ⎨ X(m+1) +X(m+2) Tn+1 = ... 2 ⎪ ⎪ ⎪ Y +X ⎩ (m+1) ... 2
values, depending on the position Y ≤ X(m) Y ≥ X(m+2) X(m) ≤ Y ≤ X(m+2)
Hence, the influence of adding Y to X1 , . . . , Xn on the median is measured by ⎧ X −X (m) (m+1) ⎪ Y ≤ X(m) ⎪ 2 ⎪ ⎨ X(m+2) −X(m+1) I(Tn , Y ) = Y ≥ X(m+2) 2 ⎪ ⎪ ⎪ Y −X ⎩ (m+1) X(m) ≤ Y ≤ X(m+2) 2 Among three possible values of |I(Tn , Y )| is | 12 (Y − X(m+1) )| the smallest; thus the sensitivity of the median to an additional observation is equal to 1 1 (X(m+1) − X(m) ), (X(m+2) − X(m+1) ) S(Tn ) = max 2 2 and it is finite under any fixed X1 , . . . , Xn . 2.3 Qualitative robustness As we have seen in Example 2.1, the influence functions of the expectation and variance are unbounded and can assume arbitrarily large values. Moreover, Example 2.2 shows that adding one more observation can cause a breakdown of the sample mean. The least squares estimator (LSE) behaves analogously (in fact, the mean is a special form of the least squares estimator). Remember the Kagan, Linnik and Rao theorem, mentioned in Section 1.1, that illustrates a large sensitivity of the LSE to deviations from the normal distribution of errors. Intuitively it means that the least squares estimator (and the mean) are very non-robust. How can we mathematically express this intuitive non-robustness property, and how shall we define the concept of robustness? Historically, this concept has been developing over a rather long period, since many statisticians observed a sensitivity of statistical procedures to deviations from assumed models, and analyzed it from various points of view. It is interesting that the physicists and astronomers, who tried to determine values of various physical, geophysical and astronomic parameters by means of an average of several measurements, were the first to notice the sensitivity of the mean and the variance to outlying observations. This interesting part of the statistical history is nicely described in the book by Stigler (1986). The history goes up to 1757, when R. J. Boskovich, analyzing his experiments aiming at a characterization of the shape of the globe, proposed an estimation
QUALITATIVE ROBUSTNESS
31
method alternative to the least squares. E. S. Pearson noticed the sensitivity of the classical analysis of variance procedures to deviations from the normality in 1931. J. W. Tukey and his Princeton group have started a systematic study of possible alternatives to the least squares since the 1940s. The name “robust” was first used by Box in 1953. Box and Anderson (1955) characterized as robust such a statistical procedure that is little sensitive to changes of the nuisance or unimportant parameters, while it is sensitive (efficient) to its parameter of interest. When we speak about robustness of a statistical procedure, we usually mean its robustness with respect to deviations from the assumed distribution of errors. However, other types of robustness are also important, such as the assumed independence of observations, the assumption that is often violated in practice. The first mathematical definition of robustness was formulated by Hampel (1968, 1971), who based the concept of robustness of a statistical functional on its continuity in a neighborhood of the considered probability distribution. The continuity and neighborhood were considered with respect to the Prohorov metric on the space P. Let a random variable (or random vector) X take on values in the sample space (X , B); denote P as its probability distribution. We shall try to characterize mathematically the robustness of the functional T (P ) = T (X). This functional is estimated with the aid of observations X1 , . . . , Xn , that are independent copies of X. More precisely, we estimate T by the empirical functional Tn (Pn ) = Tn (X1 , . . . , Xn ), based on the empirical distribution Pn of X1 , . . . , Xn . Instead of the empirical functional, Tn is often called the (sample) statistic. Hampel’s definition of the (qualitative) robustness is based on the Prohorov metric dP on the system P of probability measures on the sample space. Definition 2.3 We say that the sequence of statistics (empirical functionals) {Tn } is qualitatively robust for probability distribution P, if to any ε > 0 there exists a δ > 0 and a positive integer n0 such that, for all Q ∈ P and n ≥ n0 , dP (P, Q) < δ =⇒ dP (LP (Tn ), LQ (Tn )) < ε
(2.6)
where LP (Tn ) and LQ (Tn ) denote the probability distributions of Tn under P and Q, respectively. This robustness is only qualitative: it only says whether it is or is not the functional robust, but it does not numerically measure a level of this characteristic. Because such robustness concerns only the behavior of the functional in a small neighborhood of P0 , it is in fact infinitesimal. We can obviously replace the Prohorov metric with another suitable metric on space P, e.g., the L´evy metric. However, we do not only want to see whether T is or is not robust. We want to compare the various functionals with each other and see which is more robust
32
BASIC CHARACTERISTICS OF ROBUSTNESS
than the other. To do this, we must characterize the robustness with some quantitative measure. There are many possible quantifications of the robustness. However, using such quantitative measures, be aware that a replacement of a complicated concept with just one number can cause a bias and suppress important information.
2.4 Quantitative characteristics of robustness based on influence function Influence function is one of the most important characteristics of the statistical functional/estimator. The value IF (x; T, P ) measures the effect of a contamination of functional T by a single value x. Hence, a robust functional T should have a bounded influence function. However, even the fact that T is a qualitatively robust functional does not automatically mean that its influence function IF (x; T, P ) is bounded. As we see later, an example of such a functional is the R-estimator of the shift parameter, which is an inversion of the van der Waerden rank test; while it is qualitatively robust, its influence function is unbounded. The most popular quantitative characteristics of robustness of functional T, based on the influence function, are its global and local sensitivities: a) The global sensitivity of the functional T under distribution P is the maximum absolute value of the influence function in x under P, i.e., γ ∗ = sup |IF (x; T, P )|
(2.7)
x∈X
b) The local sensitivity of the functional T under distribution P is the value IF (y; T, P ) − IF (x; T, P ) (2.8) λ∗ = sup y−x x,y; x=y that indicates the effect of the replacement of value x by value y on the functional T. The following example illustrates the difference between the global and local sensitivities. Example 2.3 (a) Mean T (P ) = EP (X), IF (x; T, P ) = x − EP X =⇒ γ ∗ = ∞, λ∗ = 1; the mean is not robust, but it is not sensitive to the local changes. (b) Variance T (P ) = varP X = σP2 IF (x; T, P ) = (x − EP (X))2 − σP2 ,
γ∗ = ∞
MAXIMUM BIAS
(x − EP (X))2 − (y − EP (X))2 λ∗ = sup x−y y=x 2 x − y 2 − 2(x − y)EP X = sup |x + y − 2EP X| = ∞ = sup x−y y=x y=x
33
hence the variance is non-robust both to large as well as to small (local) changes.
2.5 Maximum bias Assume that the true distribution function F0 lies in some family F . Another natural measure of robustness of the functional T is its maximal bias (maxbias) over F , (2.9) b(F ) = sup {|T (F ) − T (F0 )|} F ∈F
The family F can have various forms; for example, it can be a neighborhood of a fixed distribution F0 with respect to some distance described in Section 1.5. In the robustness analysis, F is often the ε-contaminated neighborhood of a fixed distribution function F0 , that has the form FF0 ,ε = {F : F = (1 − ε)F0 + εG, G unknown distribution function} (2.10) The value ε of the contamination ratio is considered as known, known is also the central distribution function F0 . When estimating the location parameter θ of F (x − θ), where F is an unknown member of F , then the central distribution F0 is usually taken as symmetric around zero and unimodal, while the contaminating distribution G can run either over symmetric or asymmetric distribution functions. We then speak about symmetric or asymmetric contaminations. Many statistical functionals are monotone with respect to the stochastic ordering of distribution functions (or random variables), defined in the following way: Random variable X with distribution function F is stochastically smaller than random variable Y with distribution function G, if F (x) ≥ G(x)
∀x ∈ R
The monotone statistical functional thus attains its maxbias either at the stochastically largest member F∞ or at the stochastically smallest member F−∞ of FF0 ,ε , hence b(FF0 ,ε ) = max{|T (F∞ ) − T (F0 )|, |T (F−∞ ) − T (F0 )|}
(2.11)
The following example well illustrates the role of the maximal bias; it shows that while the mean is non-robust, the median is universally robust with respect to the maxbias criterion.
34
BASIC CHARACTERISTICS OF ROBUSTNESS
Example 2.4 (i) Mean T (F ) = EF (X); if F0 is symmetric around zero and so are all contaminating distributions G, all having finite first moments, then T (F ) is unbiased for all F ∈ FF0 ,ε , hence b(FF0 ,ε ) = 0. However, under an asymmetric contamination, b(FF0 ,ε ) = |E(F∞ ) − E(F0 )| = ∞, where F∞ = (1 − ε)F0 + εδ∞ , the stochastically largest member of FF0 ,ε . (ii) Median. Because the median is nondecreasing with respect to the stochastic ordering of distributions, its maximum absolute bias over an asymmetric ε-contaminated neighborhood of a symmetric distribution function F0 is attained either at F∞ = (1 − ε)F0 + εδ∞ (the stochastically largest distribution of FF0 ,ε ), or at F−∞ = (1 − ε)F0 + εδ−∞ (the stochastically smallest distribution of FF0 ,ε ). The median of F∞ is attained at x0 satisfying 1 1 −1 (1 − ε)F0 (x0 ) = =⇒ x0 = F0 2 2(1 − ε) while the median of F−∞ is x− 0 such that (1 −
ε)F0 (x− 0 )
1 1 − −1 =⇒ x0 = F0 +ε= 1− = −x0 2 2(1 − ε)
hence the maxbias of the median is equal to x0 . Let T (F ) be any other functional such that its estimate T (Fn ) = T (X1 , . . . , Xn ), based on the empirical distribution function Fn , is translation equivariant, i.e., T (X1 + c, . . . , Xn + c) = T (X1 , . . . , Xn ) + c for any c ∈ R. Then obviously T (F (· − c)) = T (F (·)) + c. We shall show that the maxbias of T cannot be smaller than x0 . Consider two contaminations of F0 , F+ = (1 − ε)F0 + εG+ , where
G+ (x) =
and G− (x) =
. . . x ≤ x0
0 1 ε {1
F− = (1 − ε)F0 + εG−
− (1 − ε)[F0 (x) + F0 (2x0 − x)]}
1 ε {(1
1
− ε)[F0 (x + 2x0 ) − F0 (x)]}
. . . x ≥ x0
. . . x < −x0 . . . x ≥ −x0
Notice that F− (x − x0 )) = F+ (x + x0 ), hence T (F− ) + x0 = T (F+ ) − x0 and T (F+ ) − T (F− ) = 2x0 ; thus the maxbias of T at F0 cannot be smaller than x0 . It shows that the median has the smallest maxbias among all translation equivariant functionals.
BREAKDOWN POINT
35
If T (F ) is a nonlinear functional, or if it is defined implicitly as a solution of a minimization or of a system of equations; then it is difficult to calculate (2.11) precisely. Then we consider the maximum asymptotic bias of T (F ) over a neighborhood F of F0 . More precisely, let X1 , . . . , Xn be independent identically distributed observations, distributed according to distribution function F ∈ F and Fn be the empirical distribution function. Assume that under an infinitely increasing number of observations, T (Fn ) has an asymptotical normal distribution for every F ∈ F in the sense that √ x P n(T (Fn ) − T (F )) ≤ x → Φ as n → ∞ σ(T, F ) with variance σ 2 (T, F ) dependent of T and F. Then the maximal asymptotic bias (asymptotic maxbias) of T over F is defined as sup {|T (F ) − T (F0 )| : F ∈ F}
(2.12)
We shall return to the asymptotic maxbias later in the context of some robust estimators that are either nonlinear or defined implicitly as a solution of a minimization or a system of equations.
2.6 Breakdown point The breakdown point, introduced by Donoho and Huber in 1983, is a very popular quantitative characteristic of robustness. To describe this characteristic, start from a random sample x0 = (x1 , . . . , xn ) and consider the corresponding value Tn (x0 ) of an estimator of functional T. Imagine that in this “initial” sample we can replace any m components by arbitrary values, possibly very unfavorable, even infinite. The new sample after the replacement denotes x(m) , and let Tn (xm ) be the pertaining value of the estimator. The breakdown point of estimator Tn for sample x(0) is the number ε∗n (Tn , x(0) ) =
m∗ (x(0) ) n
where m∗ (x(0) ) is the smallest integer m, for which sup Tn (x(m) ) − Tn (x(0) ) = ∞ x(m) i.e., the smallest part of the observations that, being replaced with arbitrary values, can lead Tn up to infinity. Some estimators have a universal breakdown point, when m∗ is independent of the initial sample x(0) . Then we can also calculate the limit ε∗ = limn→∞ ε∗n , which often also is called the breakdown point. We can modify the breakdown point in such a way that, instead of replacing m components, we extend the sample by some m (unfavorable) values.
36
BASIC CHARACTERISTICS OF ROBUSTNESS
Example 2.5 ¯n = (a) The average X ¯ n , x(0) ) ε∗n (X
=
1 n,
1 n
n i=1
Xi :
¯ n , x(0) ) = 0 for any initial sample x(0) hence limn→ ε∗n (X
˜ n = X n+1 (consider n odd, for simplicity): (b) Median X ( ) 2
˜ n , x(0) ) ε∗n (X
=
n+1 2n ,
˜ n , x(0) ) = thus limn→ ε∗n (X
1 2
for any initial sample x(0)
2.7 Tail–behavior measure of a statistical estimator The tail–behavior measure is surprisingly intuitive mainly in estimating the shift and regression parameters. We will first illustrate this measure on the shift parameter, and then return to regression at a suitable place. Let (X1 , . . . , Xn ) be a random sample from a population with continuous distribution function F (x − θ), θ ∈ R. The problem of interest is that of estimating parameter θ. A reasonable estimator of the shift parameter should be translation equivariant: Tn is translation equivariant, if Tn (X1 + c, . . . , Xn + c) = Tn (X1 , . . . , Xn ) + c ∀ c ∈ R and ∀X1 . . . , Xn The performance of such an estimator can be characterized by probabilities Pθ (|Tn − θ| > a) analyzed either under fixed a > 0 and n → ∞, or under fixed n and a → ∞. Indeed, if {Tn } is a consistent estimator of θ, then limn→0 Pθ (|Tn −θ| > a) = 0 under any fixed a > 0. Such a characteristic was studied, e.g., by Bahadur (1967), Fu (1975, 1980) and Sievers (1978), who suggested the limit 1 lim − ln Pθ (|Tn − θ| > a) under fixed a > 0 n→∞ n (provided it exists) as a measure of efficiency of estimator Tn , and compared estimators from this point of view. On the other hand, a good estimator Tn also verifies the convergence lim Pθ (|Tn − θ| > a) = lim P0 (|Tn | > a) = 0
a→∞
a→∞
(2.13)
while this convergence is as fast as possible. The probabilities Pθ (Tn − θ > a) and Pθ (Tn − θ < −a), for a sufficiently large, are called the right and the left tails, respectively, of the probability distribution of Tn . If Tn is symmetrically distributed around θ, then both its tails are characterized by probability (2.13). This probability should rapidly tend to zero. However, the speed of this convergence cannot be arbitrarily high. We shall show that the rate of
TAIL–BEHAVIOR MEASURE OF A STATISTICAL ESTIMATOR
37
convergence of tails of a translation equivariant estimator is bounded, and that its upper bound depends on the behavior of 1 − F (a) and F (−a) for large a > 0. Let us illustrate this upper bound on a model with symmetric distribution function satisfying F (−x) = 1 − F (x) ∀x ∈ R. Jureˇckov´ a (1981) introduced the following tail-behavior measure of an equivariant estimator Tn : B(Tn ; a) =
−ln P0 (|Tn | > a) −ln Pθ (|Tn − θ| > a) = , a>0 −ln(1 − F (a)) −ln(1 − F (a))
(2.14)
The values B(Tn ; a) for a 0 show how many times faster the probability P0 (|Tn | > a) tends to 0 than 1−F (a), as a → ∞. The best is estimator Tn with the largest possible values B(Tn ; a) for a 0. The lower and upper bounds for B(Tn ; a), thus for the rate of convergence of its tails, are formulated in the following lemma: Lemma 2.1 Let X1 , . . . , Xn be a random sample from a population with distribution function F (x − θ), 0 < F (x) < 1, F (−x) = 1 − F (x), x, θ ∈ R. Let Tn is an equivariant estimator of θ such that, for any fixed n min Xi > 0 =⇒ Tn (X1 , . . . , Xn ) > 0
1≤i≤n
(2.15) max Xi < 0 =⇒ Tn (X1 , . . . , Xn ) < 0
1≤i≤n
Then, under any fixed n 1 ≤ lima→∞ B(Tn ; a) ≤ lima→∞ B(Tn ; a) ≤ n
(2.16)
Proof. Indeed, if Tn is equivariant, then P0 (|Tn (X1 , . . . , Xn )| > a) = P0 (Tn (X1 , . . . , Xn ) > a) + P0 (Tn (X1 , . . . , Xn ) < −a) = P0 (Tn (X1 − a, . . . , Xn − a) > 0) + P0 (Tn (X1 + a, . . . , Xn + a) < 0) ≥ P0 min Xi > a + P0 max Xi < −a 1≤i≤n
1≤i≤n
= (1 − F (a))n + (F (−a))n hence −ln P0 (|Tn (X1 , . . . , Xn )| > a) ≤ −ln 2 − n ln(1 − F (a)) =⇒ lima→∞
−ln P0 (|Tn | > a) ≤n −ln(1 − F (a))
Similarly, P0 (|Tn (X1 , . . . , Xn )| > a)
38
≤ P0
BASIC CHARACTERISTICS OF ROBUSTNESS min Xi ≤ −a + P0 max Xi ≥ a
1≤i≤n
1≤i≤n
= 1 − (1 − F (−a))n + 1 − (F (a))n = 2 {1 − (F (a))n } = 2(1 − F (a)) 1 + F (a) + . . . + (F (a))n−1 ≤ 2n(1 − F (a)) hence −ln P0 (|Tn (X1 , . . . , Xn )| > a) ≥ −ln 2 − ln n − ln(1 − F (a)) =⇒ lima→∞
−ln P0 (|Tn | > a) ≥1 −ln(1 − F (a)) 2
If Tn attains the upper bound in (2.16), then it is obviously optimal for distribution function F , because its tails tend to zero n-times faster than 1 − F (a), which is the upper bound. However, we still have the following questions: • Is the upper bound attainable, and for which Tn and F ? • Is there any estimator Tn attaining high values of B(Tn ; a) robustly for a broad class of distribution functions? ¯ n can attain both lower and upper bounds It turns out that the sample mean X in (2.16); namely, it attains the upper bound under the normal distribution and under an exponentially tailed distribution, while it attains the lower bound only for the Cauchy distribution and for the heavy-tailed distributions. ¯ n even from the tail behavior This demonstrates a high non-robustness of X ˜ n is robust even with respect aspect. On the other hand, the sample median X ˜ to tails: Xn does not attain the upper bound in (2.16), on the contrary, the ˜ n ; a) is always in the middle of the scope between 1 and n limit lima→∞ B(X for a broad class of distribution functions. These conclusions are in good concordance with the robustness concepts. The following theorem gives them a mathematical form. Theorem 2.1 Let X1 , . . . , Xn be a random sample from a population with distribution function F (x − θ), 0 < F (x) < 1, F (−x) = 1 − F (x), x, θ ∈ R. ¯ n = 1 n Xi be the sample mean. If F has exponential tails, (i) Let X i=1 n i.e., −ln(1 − F (a) lim = 1 for some b > 0, r ≥ 1 (2.17) a→∞ bar then ¯ n ; a) = n (2.18) lim B(X a→∞
TAIL–BEHAVIOR MEASURE OF A STATISTICAL ESTIMATOR
39
(ii) If F has heavy tails in the sense that −ln(1 − F (a) =1 a→∞ m ln a
for some m > 0
lim
(2.19)
then ¯ n ; a) = 1 lim B(X
(2.20)
a→∞
˜ n be the sample median. Then for F satisfying either (2.17) or (iii) Let X (2.19), n ˜n ; a) ≤ n + 1 for n even, and ≤ lima→∞ B(X (2.21) 2 2 ˜ n , a) = n + 1 for n odd (2.22) lim B(X a→∞ 2 Remark 2.1 The distribution functions with exponential tails, satisfying (2.17), will be briefly called the type I. This class includes the normal distribution (r = 2), logistic and the Laplace distributions (r = 1). The distribution functions with heavy tails, satisfying (2.19), will be called the type II. The Cauchy distribution (m = 1) or the t-distribution with m > 1 degrees of freedom belongs here. Proof of Theorem 2.1. (i) It is sufficient to prove that the exponentially tailed F has a finite expected value ¯ n |r < ∞ (2.23) Eε = E0 exp n(1 − ε)b|X for arbitrary ε ∈ (0, 1). Indeed, then we conclude from the Markov inequality that ¯ n | > a) ≤ Eε · exp{−n(1 − ε)bar } P0 (|X ¯ n | > a) −ln P0 (|X n(1 − ε)bar − ln Eε ≥ lim = n(1 − ε) a→∞ bar bar and we arrive at proposition (2.18). =⇒ lima→∞
The finite expectation (2.23) we get from the H¨older inequality: n ¯ n |r ≤ E0 exp (1 − ε)b E0 exp n(1 − ε)b|X |Xi |r
(2.24)
i=1
n
≤ (E0 [exp {(1 − ε)b|X1 | }]) = 2 r
n
∞
n [exp {(1 − ε)bx }] dF (x) r
0
It follows from (2.17) that, given ε > 0, there exists an Aε > 0 such that ε 1 − F (a) < exp −(1 − )bar 2 holds for any a ≥ Aε .
40
BASIC CHARACTERISTICS OF ROBUSTNESS
The last integral in (2.24) can be successively rewritten in the following way:
∞
exp {(1 − ε)bxr } dF (x) =
0
−
Aε
exp {(1 − ε)bxr } dF (x)
0 ∞
exp {(1 − ε)bxr } d(1 − F (x))
Aε
Aε
=
exp {(1 − ε)bxr } dF (x) + (1 − F (Aε )) · exp {(1 − ε)bArε }
0
∞
+
(1 − F (x))(1 − ε)brxr−1 · exp {(1 − ε)bxr } dx
Aε
≤ 0
Aε
ε exp {(1 − ε)bxr } dF (x) + exp − bArε 2 ∞ ε + (1 − ε)brxr−1 · exp − bxr dx < ∞ 2 Aε
and that leads to proposition (i). (ii) If F has heavy tails, then ¯ n | > a) = P0 (X ¯ n > a) + P0 (X ¯ n < −a) P0 (|X ≥ P0 X1 > −a, . . . , Xn−1 > −a, Xn > (2n − 1)a +P0 X1 < a, . . . , Xn−1 < a, Xn < −(2n − 1)a = 2(F (a))n−1 [1 − F ((2n − 1)a)] hence ¯ n , a) ≤ lima→∞ lima→∞ B(X = lim
a→∞
−ln [1 − F (2n − 1)a] m ln a
−ln [1 − F (2n − 1)a] =1 m ln((2n − 1)a)
˜ n be the sample median and n be odd. Then X ˜ n is the middle (iii) Let X ˜ order statistic of the sample X1 , . . . , Xn , i.e., Xn = X(m) , m = n+1 2 , and ˜ F (Xn ) = U(m) has the beta-distribution. Then ˜ n | > a) = P0 (X ˜ n > a) + P0 (X ˜ n < −a) P0 (|X 1 n−1 = 2n um−1 (1 − u)m−1 du m−1 F (a)
VARIANCE OF ASYMPTOTIC NORMAL DISTRIBUTION n−1 ≤ 2n (1 − F (a))m m−1 and similarly
˜ n | > a) ≥ 2n P0 (|X
n−1 m−1
41
(F (a))m−1 (1 − F (a))m
that leads to (2.22) after taking logarithms. Analogously we proceed for n even. 2
2.8 Variance of asymptotic normal distribution If estimator Tn of functional T (·) is asymptotically normally distributed as n → ∞, √ n(Tn − T (P )) → N (0, V 2 (P, T )) LP then another possible robustness measure of T is the supremum of the variance V 2 (P, T ) σ 2 (T ) = sup V 2 (P, T ) P ∈P0
over a neighborhood P0 ⊂ P of the assumed model. The estimator minimizing supP ∈P0 V 2 (P, T ) over a specified class T of estimators of parameter θ, is called minimaximally robust in the class T . We shall show in the sequel that the classes of M-estimators, L-estimators and R-estimators all contain a minimaximally robust estimator of the shift and regression parameters in a class of contaminated normal distributions.
2.9 Problems and complements 2.1 Show that both the sample mean and the sample median of the random sample X1 , . . . , Xn are nondecreasing in each argument Xi , i = 1, . . . , n. 2.2 Characterize distributions satisfying lim
a→∞
−ln(1 − F (a + c)) =1 −ln(1 − F (a))
(2.25)
for any fixed c ∈ R. Show that this class contains distributions of Type 1 and Type 2. 2.3 Let X1 , . . . , Xn be a random sample from a population with distribution function F (x − θ), where F is symmetric, absolutely continuous, 0 < F (x) < 1 for x ∈ R, and satisfying (2.25). Let Tn (X1 , . . . , Xn ) be a translation equivariant estimator of θ, nondecreasing in each argument
42
BASIC CHARACTERISTICS OF ROBUSTNESS Xi , i = 1, . . . , n. Then Tn has a universal breakdown point m∗n = m∗n (Tn ) and there exists a constant A such that Xn:m∗n − A ≤ Tn (X1 , . . . , Xn ) ≤ Xn:n−m∗n +1 + A where Xn:1 ≤ Xn:2 ≤ . . . ≤ Xn:n are the order statistics of the sample X1 , . . . , Xn . (Hint: see He at al. (1990)).
2.4 Let Tn (X1 , . . . , Xn ) be a translation equivariant estimator of θ, nondecreasing in each argument Xi , i = 1, . . . , n. Then, under the conditions of Problem 2.2, m∗n ≤ lima→∞ B(a, Tn ) ≤ lima→∞ B(a, Tn ) ≤ n − m∗n + 1
(2.26)
Illustrate it on the sample median. (Hint: see He et al. (1990)). 2.5 Let Tn (X1 , . . . , Xn ) be a random sample from a population with distribution function F (x − θ). Compute the breakdown point of 1 (Xn:1 + Xn:n ) 2 This estimator is called the midrange (see the next chapter). Tn =
2.6 Show that the midrange (Problem 2.5) of the random sample X1 , . . . , Xn is nondecreasing in each argument Xi , i = 1, . . . , n. Illustrate (2.26) for this estimator. 2.7 Determine whether the gamma distribution (Example 1.5) has exponential or heavy tails.
CHAPTER 3
Robust estimators of real parameter
3.1 Introduction Let X1 , . . . , Xn be a random sample from a population with probability distribution P ; the distribution is generally unknown, we only assume that its distribution function F belongs to some class F of distribution functions. We look for an appropriate estimator of parameter θ, that can be expressed as a functional T (P ) of P. The same parameter θ can be characterized by means of more functionals, e.g., the center of symmetry is simultaneously the expected value, the median, the modus of the distribution, and other possible characterizations. Some functionals T (P ) are characterized implicitly as a root of an equation (or of a system of equations) or as a solution of a minimization (maximization) problem: such are the maximal likelihood estimator, moment estimator, etc. An estimator of parameter θ is obtained as an empirical functional, i.e., when one replaces P in the functional T (·) with the empirical distribution corresponding to the vector of observations X1 , . . . , Xn . We shall mainly deal with three broad classes of robust estimators of the real parameter: M -estimators, L-estimators, and R-estimators. We shall later extend these classes to other models, mainly to the linear regression model.
3.2 M -estimators The class of M -estimators was introduced by P. J. Huber in (1964); the properties of M -estimators are studied in his book (Huber (1981)), and also in the books by Andrews et al. (1972), Antoch et al. (1998), Bunke and Bunke (1986), Dodge and Jureˇckov´ a (2000), Hampel et al. (1986), Jureˇckov´ a and Sen (1996), Lecoutre and Tassi (1987), Rieder (1994), Rousseeuw and Leroy (1987), Staudte and Sheather (1990), and others. M -estimator Tn is defined as a solution of the minimization problem n
ρ(Xi , θ) := min
with respect to θ ∈ Θ
i=1
or
(3.1) EPn [ρ(X, θ)] = min, 43
θ∈Θ
44
ROBUST ESTIMATORS OF REAL PARAMETER
where ρ(·, ·) is a properly chosen function. The class of M -estimators covers also the maximal likelihood estimator (MLE) of parameter θ in the parametric model P = {Pθ , θ ∈ Θ}; if f (x, θ), is the density function of Pθ , then the MLE is a solution of the minimization n (− log f (Xi , θ)) = min, θ ∈ Θ i=1
If ρ in (3.1) is differentiable in θ with a continuous derivative ψ(·, θ) = ∂ ∂θ ρ(·, θ), then Tn is a root (or one of the roots) of the equation n
ψ(Xi , θ) = 0,
θ∈Θ
(3.2)
i=1
hence
1 ψ(Xi , Tn ) = EPn [ψ(X, Tn )] = 0 Tn ∈ Θ. n i=1 n
(3.3)
We see from (3.1) and (3.3) that the M -functional, the statistical functional corresponding to Tn , is defined as a solution of the minimization ρ(x, T (P )) dP (x) = EP [ρ(X, T (P ))] := min, T (P ) ∈ Θ (3.4) X
or as a solution of the equation ψ(x, T (P )) dP (x) = EP [ψ(X, T (P ))] = 0, X
T (P ) ∈ Θ
(3.5)
The functional T (P ) is Fisher consistent, if the solutions of (3.4) and (3.5) are uniquely determined.
3.2.1 Influence function of M -estimator Assume that ρ(·, θ) is differentiable, that its derivative ψ(·, θ) is absolutely continuous with respect to θ, and that the equation (3.5) has a unique solution T (P ). Let Pt = (1 − t)P + tδx ; then T (Pt ) solves the equation ψ(y, T (Pt ))d((1 − t)P + tδx ) = 0 X
hence (1 − t)
X
ψ(y, T (Pt )) dP (y) + tψ(x, T (Pt )) = 0
Differentiating (3.6) in t, we obtain − ψ(y, T (Pt ))dP (y) + ψ(x, T (Pt )) X
+(1 − t)
dT (Pt ) dt
X
∂ ψ(y, θ) dP (y) ∂θ θ=T (Pt )
(3.6)
M -ESTIMATOR OF LOCATION PARAMETER dT (Pt ) ∂ ψ(x, θ) +t =0 dt ∂θ θ=T (Pt )
45
We obtain the influence function of T (P ) if t ↓ 0 : IF (x; T, P ) = ˙ T (P ) = where ψ(y,
ψ(x, T (P )) ˙ T (P )dP (y) − X ψ(y,
(3.7)
∂ ∂θ ψ(y, θ) θ=T (P )
3.3 M -estimator of location parameter An important special case is the model with the shift parameter θ, where X1 , . . . , Xn are independent observations with the same distribution function F (x − θ), θ ∈ R; the distribution function F is generally unknown. M -estimator Tn is defined as a solution of the minimization n
ρ(Xi − θ) := min
(3.8)
i=1
and if ρ(·) is differentiable with absolutely continuous derivative ψ(·), then Tn solves the equation n ψ(Xi − θ) = 0 (3.9) i=1
The corresponding M -functional T (F ) is Fisher consistent, provided the mini mization X ρ(x − θ)dP (x) := min has a unique solution θ = 0. The influence function of T (F ) is then ψ(x − T (P )) IF (x; T, P ) = ψ (y)dP (y) X
(3.10)
We see from the minimization (3.8) and from the equation (3.9) that Tn is translation equivariant, i.e., that it satisfies Tn (X1 + c, . . . , Xn + c) = Tn (X1 , . . . , Xn ) + c
∀c ∈ R
(3.11)
However, Tn generally is not scale equivariant: the scale equivariance of Tn means that Tn (cX1 , . . . , cXn ) = cTn (X1 . . . , Xn )
for c > 0
If the model is symmetric, i.e., we have a reason to assume the symmetry of F around 0, we should choose ρ symmetric around 0 (ψ would be then an odd function). If ρ(x) is strictly convex (and thus ψ(x) strictly increasing), then ni=1 ρ(Xi − θ) is strictly convex in θ, and the M -estimator is uniquely determined. If ρ(x) is linear n in some interval [a, b], then ψ(·) is constant in [a, b], and the equation i=1 ψ(Xi − θ) = 0 can have more roots. There are
46
ROBUST ESTIMATORS OF REAL PARAMETER
many possible rules choosing one among these roots; one possibility to obtain a unique solution is to define Tn in the following way: 1 + (T + Tn− ) 2 n n = sup{t : ψ(Xi − t) > 0}
Tn = Tn−
(3.12)
i=1
Tn+ = inf{t :
n
ψ(Xi − t) < 0}
i=1
Similarly, we determine the M -estimator in the case of nondecreasing ψ with jump discontinuities. If ψ(·) is nondecreasing, continuous or having jump discontinuities; then the M -estimator Tn obviously satisfies for any a ∈ R n
Pθ ψ(Xi − a) > 0 ≤ Pθ (Tn > a) ≤ Pθ (Tn ≥ a) i=1
≤ Pθ
n
ψ(Xi − a) ≥ 0
(3.13)
i=1
= Pθ
n
ψ(Xi − a) > 0 + Pθ
i=1
n
ψ(Xi − a) = 0
i=1
n The inequalities in (3.13) convert in equalities if Pθ { i=1 ψ(Xi − a) = 0} = 0. This further implies that, for any y ∈ R, n 1 √ y
< 0 ≤ Pθ ( n(Tn − θ) < y) ψ Xi − √ P0 n− 2 n i=1 (3.14)
√ 1 ≤ Pθ ( n(Tn − θ) ≤ y) ≤ P0 n− 2
n y
≤0 ψ Xi − √ n i=1
n Because √1n i=1 ψ Xi − √yn is a standardized sum of independent identically distributed random variables, the asymptotic probability distribution of √ n(Tn − θ) can be derived from the central limit theorem by means of (3.14).
3.3.1 Breakdown point of M -estimator of location parameter If Mn estimates the center of symmetry θ of F (x − θ), then its breakdown point follows from Section 2.6: ε∗ = limn→∞ ε∗n = 0 ∗
ε =
limn→∞ ε∗n
=
1 2
if ψ(·) is an unbounded function if ψ is odd and bounded
M -ESTIMATOR OF LOCATION PARAMETER
47
Hence, the class of M -estimators contains robust as well as non-robust elements. Example 3.1 (a) Expected value: Expected value θ = EP X is an M -functional with the criterion function ρ(x) = x2 , ψ(x) = 2x and ψ (x) = 2. Its influence function follows from (3.10): IF (x; T, P ) =
2(x − EP (X)) = x − EP (X) 2dP R
¯ n ; its breakThe corresponding M -estimator is the sample (arithmetic) mean X down point ε∗ = limn→∞ ε∗n is equal to 0, and its global sensitivity γ ∗ = +∞. (b) Median: ˜ = F −1 ( 1 ) can be considered as an M -functional with the criterion Median X 2 ˜ n is a solution of the function ρ(x) = |x|, and the sample median Tn = X minimization n |Xi − θ| := min, θ ∈ R i=1
To derive the influence function of the median, assume that the probability distribution P has a continuous distribution function F, strictly increasing in interval (a, b), −∞ ≤ a < b ≤ ∞, and differentiable in a neighborhood of ˜ Let Ft be a distribution function of the contaminated distribution Pt = X. (1 − t)P + tδx . Median T (Pt ) is a solution of the equation Ft (u) = 12 , i.e., (1 − t)F (T (Pt )) + tI[x < T (Pt )] = that leads to
⎧ 1 −1 ⎪ ⎨ F 2(1−t) T (Pt ) =
⎪ ⎩ F −1 1−2t 2(1−t)
1 2
. . . x > T (Pt ) . . . x ≤ T (Pt )
˜ = T (P ) as The function T (Pt ) is continuous at t = 0, because T (Pt ) → X t → 0; using the following expansions around t = 0 1 1 t 1 t 1 − 2t = + + O(t2 ) and = − + O(t2 ) 2(1 − t) 2 2 2(1 − t) 2 2 we obtain
−1 dF (u) 1 1 −1 1 −1 1 lim [T (Pt ) − F ( )] = sign (x − F ( )) t→0 t 2 2 2 du u= 1 2
and this, in turn, leads to the influence function of the median ˜ ˜ F ) = sign (x − X) IF (x; X, ˜ 2f (X)
(3.15)
48
ROBUST ESTIMATORS OF REAL PARAMETER
The influence function of the median is bounded, hence the median is robust, while the expected value is non-robust. Breakdown point of the median is ε∗ = 1 1 ∗ ∗ ˜ , (γ = 1.253 for the standard normal 2 , and its global sensitivity γ = 2f (X) distribution N (0, 1)). 1 ˜ P ))2 = By (3.15), E(IF (x; X, ˜ = const and we can show that the se4f 2 (X) √ ˜ ˜ is asymptotically normally distributed, quence n(Xn − X) √ 1 ˜ ˜ L{ n(Xn − X)} → N 0, ˜ 4f 2 (X)
as n → ∞. Especially, if F is the distribution function of the normal distri˜ = f 2 (µ) = 1 2 and bution N (µ, σ 2 ), then f 2 (X) 2πσ 2 √ ˜ n − X)} ˜ → N 0, πσ L{ n(X 2 (c) Maximal likelihood estimator of parameter θ of the probability distribution with density f (x, θ) : ρ(x, T (P )) = − log f (x, T (P )) ∂ ψ(x, T (P )) = − log f (x, θ) ∂θ θ=T (P ) IF (x; T, P ) = where
f˙(x, T (P )) 1 · If (T (P )) f (x, T (P ))
∂ f (x, θ) f˙(x, T (P )) = ∂θ θ=T (P )
and
2 ∂ log f (x, θ) f (x, T (P ))dx θ=T (P ) X ∂θ is the Fisher information of distribution f at the point θ = T (P ). If (T (P )) =
3.3.2 Choice of function ψ The M -estimator is determined by the choice of the criterion function ρ or of its derivative ψ. If the location parameter coincides with the center of symmetry of the distribution, we choose ρ symmetric around zero and hence ψ odd. The influence function of an M -estimator is proportional to ψ(x − T (P )) (see (3.10)). Hence, a robust M -estimator should be generated by a bounded ψ. Let us describe the most well-known and the most popular types of functions ψ (and ρ), that we can find in the literature.
M -ESTIMATOR OF LOCATION PARAMETER
49
The expected value is an M -functional generated by a linear, and hence un¯n, bounded function ψ. The corresponding M -estimator is the sample mean X which is the maximal likelihood estimator of the location parameter of the normal distribution. However, this functional is closely connected with the normal distribution and is highly non-robust. If we look for an M -estimator of the location parameter of a distribution not very far from the normal distribution, but possibly containing an ε ratio of nonnormal data, more precisely, that belongs to the family F = {F : F = (1 − ε)Φ + εH} where H runs over symmetric distribution functions, we should use the function ψ, proposed and motivated by P. J. Huber (1964). This function is linear in a bounded segment [−k, k], and constant outside this segment, see Figure 3.1. x . . . |x| ≤ k (3.16) ψH (x) = k sign x . . . |x| > k where k > 0 is a fixed constant, connected with ε through the following identity: 2Φ (k) 1 2Φ(k) − 1 + = (3.17) k 1−ε The corresponding M -estimator is very popular and is often called Huber estimator in the literature. It has a bounded influence function proportional to ψH (following from (3.10)), the breakdown point ε∗ = 12 , the global sensitivity k , and the tail-behavior measure lima→∞ B(a, Tn , F ) = 12 both for γ ∗ = 2F (k)−1 distributions with exponential and heavy tails. Thus, it is a robust estimator of the center of symmetry, insensitive to the extreme and outlying observations. As Huber proved in 1964, an estimator, generated by the function (3.16), is minimaximally robust for a contaminated normal distribution, while the value k depends on the contamination ratio. An interesting and natural question is whether there exists a distribution F such that the Huber M -estimator is the maximal likelihood estimator of θ for F (x − θ), i.e., such that ψ is the likelihood function for F. Such a distribution really exists, and its density is normal in interval [−k, k], and exponential outside. 3.3.3 Other choices of ψ Some authors recommend reducing the effect of outliers even more and choosing a redescending function ψ(x), tending to 0 as x → ±∞, eventually vanishing outside a bounded interval containing 0. Such is the likelihood function of the Cauchy distribution, see Figure 3.2. ψC (x) = − where f (x) =
1 π(1+x2 )
2x f (x) = f (x) 1 + x2
is the density of the Cauchy distribution.
(3.18)
50
ROBUST ESTIMATORS OF REAL PARAMETER
Another example is the Tukey biweight function (Figure 3.3), ⎧ 2 ⎪ . . . |x| ≤ k ⎨ x 1 − xk ψT (x) = ⎪ ⎩ 0 . . . |x| > k and the Andrews sinus function (Figure 3.4), ⎧ x . . . |x| ≤ kπ ⎨ sin k ψA (x) = ⎩ 0 . . . |x| > kπ
(3.19)
(3.20)
Hampel (1974) proposed a continuous, piecewise linear function ψ (see Figure 3.5), vanishing outside a bounded interval: ⎧ |x| sign x . . . |x| < a ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ a sign x . . . a ≤ |x| < b (3.21) ψHA (x) = c−|x| ⎪ ⎪ a sign x . . . b ≤ |x| < c ⎪ c−b ⎪ ⎪ ⎪ ⎩ 0 . . . |x| > c In the robustness literature we can also find the skipped mean, generated by the function (Figure 3.6) x . . . |x| ≤ k ∗ ψ (x) = (3.22) 0 . . . |x| > k or the skipped median, generated by the function (Figure 3.7) ⎧ −1 . . . −k ≤ x < 0 ⎪ ⎪ ⎪ ⎨ ˜ 0 ... |x| > k ψ(x) = ⎪ ⎪ ⎪ ⎩ 1 ... 0≤x≤k
(3.23)
The redescending functions are not monotone, and their corresponding primitive functions ρ are not convex. Besides the global minimum, the function n ρ(Xi −θ) can have local extremes, inducing further roots of the equation i=1 n i=1 ψ(Xi − θ) = 0. Moreover, the functions ψ generating the skipped mean and n the skipped median have jump discontinuities, and hence the equation M -estimator i=1 ψ(Xi − θ) = 0 generally has no solution; the corresponding n must be calculated as a global minimum of the function i=1 ρ(Xi − θ).
M -ESTIMATOR OF LOCATION PARAMETER
51
2
1
-2
-1
1
2
-1
-2
Figure 3.1 Huber function ψH with k = 1.345.
2
1
-5
-4
-3
-2
-1
1
-1
-2
Figure 3.2 Cauchy function ψC .
2
3
4
5
52
ROBUST ESTIMATORS OF REAL PARAMETER
2
1
-2
-1
1
2
-1
-2
Figure 3.3 Tukey biweight function ψT with k = 1.345.
2
1
-6
-5
-4
-3
-2
-1
1
2
3
4
-1
-2
Figure 3.4 Andrews sinus function ψA with k = 1.339.
5
6
M -ESTIMATOR OF LOCATION PARAMETER
53
2
1
-10
-8
-6
-4
-2
2
4
6
-1
-2
Figure 3.5 Hampel function ψHA with a = 2, b = 4, c = 8.
2
1
-2
-1
1
2
-1
-2
Figure 3.6 Skipped means function ψ ∗ with k = 1.345.
8
10
54
ROBUST ESTIMATORS OF REAL PARAMETER
2
1
-2
-1
1
2
-1
-2
Figure 3.7 Skipped medians function ψ˜∗ with k = 1.345.
3.4 Finite sample minimax property of M -estimator Huber estimator is asymptotically minimax over the family of contaminated normal distributions. We shall now illustrate another finite sample minimax property of the Huber M -estimator proved by Huber in 1968. Consider a random sample from a population with distribution function F (x− θ) where both F and θ are unknown, and assume that F belongs to the Kolmogorov ε-neighborhood of the standard normal distribution, i.e., F ∈ F = {F : sup |F (x) − Φ(x)| ≤ ε}
(3.24)
x∈R
where Φ is the standard normal distribution function. Fix an a > 0 and consider the inaccuracy measure of an estimator T of θ : sup
F ∈F ,θ∈R
{Pθ |T − θ)| > a}
Let TH be a slightly modified, randomized Huber estimator: ∗ with probability 12 T TH = T ∗∗ with probability 12
(3.25)
(3.26)
FINITE SAMPLE MINIMAX PROPERTY
55
where T ∗ = sup{t :
n
ψH (Xi − t) ≥ 0}
i=1
(3.27) T ∗∗ = inf{t :
n
ψH (Xi − t) ≤ 0}
i=1
ψH is the Huber function (3.16), and the randomization does not depend on X 1 , . . . , Xn . Then TH is translation equivariant, i.e., satisfies (3.11). We shall show that TH minimizes the inaccuracy (3.25) among all translation equivariant estimators of θ. To be more precise, let us formulate it as a theorem. The sketch of the proof of Theorem 3.1 can be omitted on the first reading. Theorem 3.1 Assume that the bound k in (3.12) is connected with ε and with a > 0 in (3.25) through the following identity: e−2ak [Φ(a − k) − ε] + Φ(a + k) − ε = 1
(3.28)
where Φ is the standard normal distribution function. Then the estimator TH defined in (3.26) and (3.27) minimizes the inaccuracy (3.25) in the family of translation equivariant estimators of θ. Sketch of the proof: The main idea is to construct a minimax test of the hypothesis that the parameter equals −a, against the alternative that it equals +a. The estimator TH will be an inversion of this minimax test. Let Φ be the standard normal distribution function and φ(x), x ∈ R be its density. Moreover, denote p− (x) = φ(x − a),
p+ (x) = φ(x + a),
x∈R
the shifted normal densities, Φ− (x) = Φ(x − a) and Φ+ (x) = Φ(x + a) their distribution functions, and P− and P+ the corresponding probability distributions. Then Φ− (x) < Φ+ (x) ∀x, and the likelihood ratio p− (x) = e2ax p+ (x)
(3.29)
is strictly increasing in x. Introduce two families of distribution functions: F−
= {G ∈ F : G(x) ≤ Φ(x − a) + ε ∀x ∈ R} (3.30)
F+
= {G ∈ F : G(x) ≥ Φ(x + a) − ε ∀x ∈ R}
We can assume that F− ∩ F+ = ∅, which is true for sufficiently small ε. We
56
ROBUST ESTIMATORS OF REAL PARAMETER
shall look for the minimax test of the hypothesis H : F ∈ F− against the alternative K : F ∈ F+ . This test will be the likelihood ratio test of two least favorable distributions of families F− and F+ , respectively. We shall show that the least favorable distributions have the densities: ⎧ [p (x) + p− (x)](1 + e2ak )−1 . . . x < −k ⎪ ⎪ ⎨ + p− (x) . . . |x| ≤ k g− (x) = (3.31) ⎪ ⎪ ⎩ −2ak −1 [p+ (x) + p− (x)](1 + e ) ... x > k and
⎧ [p (x) + p− (x)](1 + e−2ak )−1 ⎪ ⎪ ⎨ + p+ (x) g+ (x) = ⎪ ⎪ ⎩ [p+ (x) + p− (x)](1 + e2ak )−1
. . . x < −k . . . |x| ≤ k
(3.32)
... x > k
Denote G− , G+ the distribution functions and Q− , Q+ the probability distributions corresponding to densities g− , g+ , respectively. We can easily verify that the log-likelihood ratio of g− and g+ is connected with ψH in the following way: n n g− (Xi ) = 2a ln ψH (Xi ) g (Xi ) i=1 + i=1 and the likelihood ratio test of hypothesis that the true distribution is g+ against g− with minimax risk α rejects the hypothesis for large values of the n likelihood ratio, i.e., when i=1 ψH (Xi ) > K for a suitable K. Mathematically such a test is characterized by a test function ζ(x) that is the probability that the test rejects the hypothesis under observation x : ⎧ n 1 ... ⎪ i=1 ψH (Xi ) > K ⎪ ⎨ n γ ... (3.33) ζ(x) = i=1 ψH (Xi ) = K ⎪ ⎪ n ⎩ 0 ... i=1 ψH (Xi ) < K where K is determined so that n Q+ ( ψH (Xi ) > K) = α, i=1
n Q− ( ψH (Xi ) > K) = 1 − α
(3.34)
i=1
and α ∈ (0, 12 ) is the minimax risk; from the symmetry we conclude that K = 0 and γ = 12 . It remains to show that G− , G+ is really the least favorable pair of distribution functions for families (3.30), and that the test (3.33) is minimax.
FINITE SAMPLE MINIMAX PROPERTY
57
But it follows from the inequalities
Q−
Q+
g− (X) >t g+ (X) g− (X) >t g+ (X)
≥
Q−
≤
Q+
g− (X) >t g+ (X) g− (X) >t g+ (X)
(3.35)
that hold for all Q− ∈ F− and Q+ ∈ F+ and t > 0. Indeed, it is trivially 1 1 1 true for 2a ln t < −k and 2a ln t > k, and for −k ≤ 2a ln t ≤ k it follows from (3.30), (3.31) and (3.32). If the distribution P of Xi belongs to F− , i = 1, . . . , n then it follows from (Xi ) (3.35) that the likelihood ratio ni=1 gg− is the stochastically smallest pro+ (Xi ) vided the Xi are identically distributed with density g− . Analogously, if the distribution P of Xi belongs to F+ , i = 1, . . . , n then the likelihood ratio n g− (Xi ) i=1 g+ (Xi ) is the stochastically largest provided the Xi are identically distributed with density g+ . Thus the test (3.33) minimizes max
sup EG+ (ζ),
G ∈F+
sup EG− (1 − ζ)
G ∈F−
(3.36)
hence it is really minimax. If the distribution of X − θ belongs to F , then that of X − θ − a and that of X − θ + a belongs to F+ and F− , respectively, and Pθ (TH (X) > θ + a) = P0 (TH (X) > a) = 12 P0 (T ∗ (X) > a) + 12 P0 (T ∗∗ (X) > a) = 12 P0 (T ∗ (X1 − a, . . . , Xn − a) > 0) + 12 P0 (T ∗∗ (X1 − a, . . . , Xn − a) > 0) n n 1 1 ≤ 2 P0 ψH (Xi − a) > 0 + 2 P0 ψH (Xi − a) ≥ 0 i=1
i=1
= EP0 (ζ(X1 − a, . . . , Xn − a)) ≤ EQ+ (ζ(X1 − a, . . . , Xn − a)) = α as it follows from (3.34) and (3.36). Similarly we verify that Pθ (TH (X) < θ − a) ≤ α Let T now be a translation equivariant estimator. Because the distributions Q+ and Q− are absolutely continuous, T has a continuous distribution function both under Q+ and Q− (see Problem 3.4), hence Q+ (T (X) = 0) = Q− (T (X) = 0) = 0.
58
ROBUST ESTIMATORS OF REAL PARAMETER
Then T induces a test of F+ against F− rejecting when T (X) > 0, and because the test based on Huber estimator is minimax with the minimax risk α, we conclude that max Q+ (T > 0), Q− (T < 0)) ≥ α sup Q+ ∈F+ ,Q− ∈F−
2
hence no equivariant estimator can be better than TH . 3.5 Moment convergence of M -estimators
Summarizing the conditions imposed on a good estimator, it is desirable to have an M -estimator Tn with a bounded influence function and with a break√ down√point 1/2, which estimates θ consistently with the rate of consistency n and n(Tn − θ) that has an asymptotic normal distribution. The asymptotic distribution naturally has finite moments; however, we also wish Tn to have finite moments tending to the moments of the asymptotic distribution. √ Otherwise, we would welcome the uniform integrability of the sequence n(Tn − θ) and its powers. Indeed, we can prove the moment convergence of M -estimators for a broad class of bounded ψ-functions and under some conditions on density f. For an illustration, we shall prove that under the following conditions (A.1) and (A.2). The conditions can be still weakened, but (A.1) and (A.2) already cover a broad class of M -estimators with a bounded influence. This was first proved by Jureˇckov´ a and Sen (1982). (A.1) X1 , . . . , Xn is a random sample from a distribution with density f (x − θ), where f is positive, symmetric, absolutely continuous and nonincreasing for x ≥ 0; we assume that f has positive and finite Fisher information,
2 f (x) dF (x) < ∞ f (x) R and that there exists a positive number (not necessarily an integer or ≥ 1) such that E|X1 | = |x| dF (x) < ∞ R (A.2) ψ is nondecreasing and skew-symmetric, ψ(x) = −ψ(−x), x ∈ R, and 0 < I(f ) =
ψ(x) = ψ(c) · sign x for |x| > c,
c>0
Moreover, ψ can be decomposed into absolutely continuous and step components, i.e., ψ(x) = ψ1 (x) + ψ2 (x), x ∈ R, where ψ1 is absolutely continuous inside (−c, c) and ψ2 is a step function that has a finite number of jumps inside (−c, c), i.e., ψ2 (x) = bj . . . dj−1 < x < dj , j = 1, . . . , m + 1 d0 = −c, dm+1 = c
MOMENT CONVERGENCE OF M -ESTIMATORS
59
Theorem 3.2 For every r > 0, there exists nr < ∞ such that, under conditions (A.1) and (A.2), r (3.37) Eθ n 2 |Tn − θ|r < ∞, uniformly in n ≥ nr Moreover, lim Eθ
√
n→∞
and, especially, lim Eθ
n→∞
r n|Tn − θ| = ν r
σ2 , γ2
R
|x|r dΦ(x)
2r √ (2r)! n|Tn − θ| = ν 2r r 2 r!
for r = 1, 2, . . . , where ν2 =
(3.38)
(3.39)
σ2 =
ψ 2 (x)dF (x) R (3.40)
γ= R
ψ1 (x)dF (x) +
m
(bj − bj−1 )f (dj ) (> 0)
j=1
and Φ is the standard normal distribution function. Furthermore, n1/2 (Tn − θ) is asymptotically normally distributed L{n1/2 (Tn − θ)} → N (0, ν 2 )
(3.41)
Sketch of the proof. We can put θ = 0, without loss of generality. First, because F has the finite th absolute moment, then max {|x|F (x)(1 − F (x))} = C < ∞ x∈R
and
R
[F (x)(1 − F (x))]λ dx < ∞
∀λ >
1 >0
Let a1 > c > 0, where c comes from condition (A.2). Then ∞ r √ r 2 rtr−1 P ( n|Tn | > t)dt E n |Tn | = 0 √ ∞ a1 n √ = + √ rtr−1 P ( n|Tn | > t)dt = In1 + In2 0
n
We shall first estimate the probability √ √ P ( n|Tn | > t) = 2P ( nTn > t) n n 1 1 1 1 ≤ 2P ψ(Xi − tn− 2 ) − E ψ(Xi − tn− 2 ) n i=1 n i=1 1 ≥ −E[ψ(X1 − tn− 2 )]
(3.42)
(3.43)
(3.44)
60
ROBUST ESTIMATORS OF REAL PARAMETER
where we use the inequalities −E[ψ(X1 − tn− 2 ) = −E[ψ(X1 − tn− 2 − ψ(X1 )] c 1 = [F (x + tn− 2 ) − F (x)]dψ(x) −c c 1 1 [F (x + tn− 2 ) − F (x − tn− 2 )]dψ(x) = 1
1
(3.45)
0
≥ 2tn− 2 f (c + tn− 2 )[ψ(c) − ψ(0)] 1
1
= 2tn− 2 f (c + a1 )ψ(c), 1
√ ∀t ∈ (0, a1 n)
based on the facts that f (x) 0 as x → ∞ and ψ(x) + ψ(−x)√= 0, F (x) + F (−x) = 1 ∀x ∈ R. Hence, by (3.44) and (3.45), for 0 < t < a1 n, n √ 1 − 12 Zni ≥ 2tn f (c + a1 )ψ(c) P ( n|Tn | > t) ≤ 2P n i=1 where
1 1 Zni = ψ Xi − tn− 2 − Eψ Xi − tn− 2 , i = 1, . . . , n
are independent random variables with means 0, bounded by 2ψ(c). Thus we can use the Hoeffding inequality (Theorem 2 in Hoeffding (1963)) and obtain √ for 0 < t < a1 n √ P ( n|Tn | > t) ≤ 2exp{−a2 t2 } (3.46) 2 2 where a2 ≥ 2f (c + a1 )ψ (c). Hence, a1 √n ∞ In1 ≤ 2r exp{−a2 t2 }tr−1 dt ≤ 2r exp{−a2 t2 }tr−1 dt < ∞ (3.47) 0
0
√ On the other hand, if t ≥ a1 n, then for n = 2m ≥ 2, √ √ 1 P ( n|Tn | > t) = 2P ( nTn > t) ≤ 2P (Xn:m+1 ≥ −c + tn− 2 ) (3.48) 1 1 n−1 ≤ 2n um (1 − u)n−m−1 du ≤ 2[q(F (−c + tn− 2 ))]n −1 m−1 F (−c+tn 2 ) where Xn:1 ≤ . . . ≤ Xn:n are the order statistics and q(u) = 4u(1 − u) ≤ 1, 0 ≤ u ≤ 1 √ 1 Actually, F (−c + tn− 2 ) > 12 for t ≥ a1 n, and 1 1 n−1 2n um (1 − u)n−m−1 du ≤ 2[q(A)]n for A > m−1 2 A that can be proved again with the aid of the Hoeffding inequality. If n = 2m+1, we similarly get 1 √ n−1 P ( n|Tn | > t) ≤ 2n um (1 − u)n−m du 1 m F (−c+tn− 2 ) ≤ 2[q(F (−c + tn− 2 ))]n 1
(3.49)
STUDENTIZED M -ESTIMATORS
61
Finally, using (3.42)–(3.44), (3.46), (3.48)–(3.49), we obtain r r−1 − 12 n 2 In2 ≤ 2r t [q(F (−c + tn ))] dt = 2rn xr−1 [q(F (−c + x))]n dx √ a1 n a1 r r−1 r [4F (y)(1 − F (y))]λ dy < ∞ ≤ 2r(C ) n 2 [q(F (−c + a1 ))]n−[ ]−1−λ a1 −c
for n ≥ (3.37).
[ r ]
+ 1, and limn→∞ In2 = 0. This, combined with (3.47), proves
It remains to prove the moment convergence (3.39). Under the conditions (A.1)–(A.2), the M -estimator admits the asymptotic representation n √ 1 1 ψ(Xi − θ) + Op (n− 4 ) γ n(Tn − θ) = n− 2 i=1
proved by Jureˇckov´ a (1980). Since Eθ ψ(X1 − θ) = 0 and ψ is bounded, all moments of ψ(X1 − θ) exist. Hence, the von Bahr (1965) theorem on moment convergence of sums of independent random variables applies to 1 n n− 2 i=1 ψ(Xi − θ), and this further implies (3.40) for any positive integer r, in view of the uniform integrability (3.37). It further extends to any s − 12 n positive real r, because n i=1 ψ(Xi − θ) is uniformly integrable for any s ∈ [2r − 2, 2r]. The asymptotic normality then follows from the central limit theorem. 2
3.6 Studentized M -estimators The M -estimator of the shift parameter is translation equivariant but generally it is not scale equivariant (see (3.11)). This shortage can be overcome by using either of the following two methods: • We estimate the scale simultaneously with the location parameter: e.g., Huber (1981) proposed estimating the scale parameter σ simultaneously with the location parameter θ as a solution of the following system of equations: n Xi − θ ψH =0 (3.50) σ i=1 n Xi − θ χ =0 σ i=1
(3.51)
2 2 (x) − R ψH (y)dΦ(y), ψH is the Huber function (3.16), where χ(x) = ψH and Φ is the distribution function of the standard normal distribution. • We can obtain a translation and scale equivariant estimator of θ, if we
62
ROBUST ESTIMATORS OF REAL PARAMETER studentize the M -estimator by a convenient scale statistic Sn (X1 , . . . , Xn ) and solve the following minimization: n Xi − θ ρ := min, θ ∈ R (3.52) Sn i=1 However, to guarantee the translation and scale equivariance of the solution of (3.52), our scale statistic should satisfy the following conditions: (a) Sn (x) > 0 a.e. for x ∈ R (b) Sn (x1 + c, . . . , xn + c) = Sn (x1 , . . . , xn ), c ∈ R, x ∈ Rn (translation invariance) (c) Sn (cx1 , . . . , cxn ) = cSn (x1 , . . . , xn ), c > 0, x ∈ Rn (scale equivariance)
Moreover, it is convenient if Sn consistently estimates a statistical functional S(F ), so that √ n(Sn − S(F )) = Op (1) as n → ∞ (3.53) Indeed, the estimator defined as in (3.52) is translation and scale equivariant, and the pertaining statistical functional T (F ) is defined implicitly as a solution of the minimization x−t ρ dF (x) := min, t ∈ R (3.54) S(F ) X The functional is Fisher consistent, provided the solution of the minimization (3.54) is unique. If ρ has a continuous derivative ψ, then the estimator equals a root of the equation n Xi − θ ψ =0 (3.55) Sn i=1 If ρ is convex and hence ψ is nondecreasing, but discontinuous at some points or constant on some intervals, we obtain a unique studentized estimator analogously as in (3.12), namely Tn = Tn−
1 + (T + Tn− ) 2 n
= sup{t :
n
ψ
i=1
Tn+ = inf{t :
n i=1
ψ
Xi − t Sn
Xi − t Sn
> 0}
(3.56)
< 0}
There is a variety of possible choices of Sn , because there is no universal scale functional. Let us mention some of the most popular choices of the scale statistic Sn :
L-ESTIMATORS
63
• Sample standard deviation: Sn =
n 1
n
¯ n )2 (Xi − X
12
i=1 1
S(F ) = (varF (X)) 2 This functional, being highly non-robust, is used for studentization only in special cases, as in the Student t-test under normality. • Inter-quartile range: Sn = Xn:[ 34 n] − Xn:[ 14 n] where Xn:[np] , 0 < p < 1 is the empirical p-quantile of the ordered sample Xn:1 ≤ . . . ≤ Xn:n . The corresponding functional has the form S(F ) = F −1 ( 34 ) − F −1 ( 14 ) • Median absolute deviation (MAD): ˜n| Sn = med1≤i≤n |Xi − X The corresponding statistical functional S(F ) is a solution of the equation F S(F ) + F −1 ( 12 ) − F −S(F ) + F −1 ( 12 ) = 12 and S(F ) = F −1 ( 34 ) provided the distribution function F is symmetric around 0 and F −1 ( 12 ) = 0. The influence function of the studentized M -functional in the symmetric model satisfying F (−x) = 1 − F (x), ρ(−x) = ρ(x), with absolutely continuous ψ satisfying ψ(−x) = −ψ(x), x ∈ R, has the form x − T (F ) S(F ) ψ IF (x, T, F ) = γ(F ) S(F )
y where γ(F ) = R ψ S(F ) dF (y). Hence, the influence function of T (F ) in the symmetric model depends on the value of S(F ), but not on the influence function of the functional S(F ).
3.7 L-estimators L-estimators are based on the ordered observations (order statistics) Xn:1 ≤ . . . ≤ Xn:n of the random sample X1 , . . . , Xn . The general L-estimator can be written in the form Tn =
n i=1
cni h(Xn:i ) +
k
aj h∗ (Xn:[npj ]+1 )
(3.57)
j=1
where cn1 , . . . , cnn and a1 , . . . , ak are given coefficients, 0 < p1 < . . . < pk < 1 and h(·) and h∗ (·) are given functions. The coefficients cni , 1 ≤ i ≤ n are
64
ROBUST ESTIMATORS OF REAL PARAMETER
generated by a bounded weight function J : [0, 1] → R in the following way: either ni cni = J(s)ds, i = 1, . . . , n (3.58) i−1 n
or approximately cni =
1 J n
i n+1
,
i = 1, . . . , n
(3.59)
The first component of the L-estimator (3.57) generally involves all order statistics, while the second component is a linear combination of several (finitely many) sample quantiles. Many L-estimators have just the form either of the first or of the second component in (3.57); we speak about L-estimators of type I or II, respectively. ˜n The simplest examples of L-estimators of location are the sample median X and the midrange 1 Tn = (Xn:1 + Xn:n ) 2 The popular L-estimators of scale are the sample range Rn = Xn:n − Xn:1 and the Gini mean difference Gn =
n n 1 2 |Xi − Xj | = (2i − n − 1)Xn:i n(n − 1) i,j=1 n(n − 1) i=1
The L-estimators of type I are more important for applications. Let us consider some of their main characteristics. Let L-estimator Tn have an integrable 1 weight function J such that 0 J(u)du = 1. Its corresponding statistical functional is based on the empirical quantile function Qn (t) = Fn−1 (t) = inf{x : Fn (x) ≥ t}, 0 < t < 1 that is the empirical counterpart of the quantile function Q(t) = F −1 (t) = inf{x : F (x) ≥ t}, 0 < t < 1 and is equal to i i = 1, . . . , n − 1 Xn:i . . . i−1 n < t ≤ n, Qn (t) = (3.60) n−1 Xn:n . . . n
The corresponding functional has the form 1 T (F ) = J(s)h (Q(s)) ds 0
(3.62)
L-ESTIMATORS
65
3.7.1 Influence function of L-estimator Assume that F is increasing and absolutely continuous with derivative f, and that the function h in (3.57) is absolutely continuous. Denote (1 − t)F (y) y<x Ft (y) = (1 − t)F (y) + tδx = (1 − t)F (y) + t y≥x the contamination of F by the distribution function of the constant x, i.e., by 0 if y < x δx (y) = 1 if y ≥ x Then
⎧ u −1 ⎪ F ⎪ 1−t ⎪ ⎨ −1 x Ft (u) = ⎪ ⎪ ⎪ ⎩ F −1 u−t 1−t
and hence dFt−1 (u) dt
=
u ≤ (1 − t)F (x) (1 − t)F (x) < u ≤ (1 − t)F (x) + t u > (1 − t)F (x) + t
⎧ ⎪ ⎨
u (1−t)2
·
1 u f (F −1 ( 1−t ))
u < (1 − t)F (x)
⎪ ⎩
u−1 (1−t)2
·
1 f (F −1 ( u−t 1−t ))
u > (1 − t)F (x) + t
This implies that dT (Ft ) = dt
Ft (x)
= 0
1
+ Ft (x)
1 0
dFt−1 (u) du J(u)h Ft−1 (u) · dt
u −1 F h 1−t u J(u)du · (1 − t)2 f F −1 u 1−t
−1 u−t 1−t u−1 h F J(u)du · (1 − t)2 f F −1 u−t 1−t
and t → 0+ leads to the influence function of the functional (3.62): dT (Ft ) = dt t=0
F (x)
s·
= 0
h (F −1 (u)) J(u)du + f (F −1 (u))
1
(u − 1) · F (x)
h (F −1 (u)) J(u)du f (F −1 (u))
66
= 0
1
ROBUST ESTIMATORS OF REAL PARAMETER 1 h (F −1 (u)) h (F (u)) J(u)du − J(u)du u· −1 (u)) f (F −1 (u)) F (x) f (F
−1
Hence, the influence function of T (F ) is equal to ∞ ∞ IF (x, T, F ) = F (y)h (y)J(F (y))dy − h (y)J(F (y))dy −∞
(3.63)
x
Notice that it satisfies the identity d IF (x, T, F ) = h (x)J(F (x)) dx If h(x) ≡ x, x ∈ R, J(u) = J(1 − u), 0 < u < 1 and F is symmetric around 0, the influence function simplifies: ∞ ∞ IF (x, T, F ) = F (y)J(F (y))dy − J(F (y))dy −∞
x
∞
0
F (y)J(F (y))dy +
=
−∞
0
−
(1 − F (−y))J(1 − F (−y))dy
∞
J(F (y))dy x
∞
=
∞
F (y)J(F (y))dy + 0
− =
(1 − F (y))J(F (y))dy
0 ∞
J(F (y))dy x ∞
J(F (y))dy −
0
and hence IF (x, T, F ) =
∞
J(F (y))dy x
x 0
J(F (y))dF (y)
... x ≥ 0 (3.64)
IF (−x, T, F ) = −IF (x, T, F )
... x ∈ R
Remark 3.1 If M -estimator Mn of the center of symmetry is generated by absolutely continuous function ψ, and Ln is the L-estimator with the weight function J(u) = c ψ (F −1 (u)), then the influence functions of Mn and Ln coincide. 3.7.2 Breakdown point of L-estimator If the L-estimator Tn is trimmed in the sense that its weight function satisfies J(u) = 0 for 0 < u ≤ α and 1 − α ≤ u < 1, and ε∗n = mnn is its breakdown point, then limn→∞ ε∗n = α.
L-ESTIMATORS
67 1 2)
Example 3.2 α-trimmed mean ( 0 < α < quantiles: ¯ nα = X hence
cni =
J(u) =
is the average of the central
n−[nα]
1 n − 2[nα]
Xn:i
i=[nα]+1
1 n−[nα]
. . . [nα] + 1 ≤ i ≤ n − [nα]
0
. . . otherwise
1 I[α ≤ u ≤ 1 − α] 1 − 2α
Tn = T (Fn ) =
1 1 − 2α
1 1 − 2α
T (F ) =
1−α
1−α
Fn−1 (u)du
α
F −1 (u)du
α
¯ nα follows from (3.63): The influence function of X ∞ F (y)J(F (y))dy − J(F (y))dy IF (x, T, F ) = x
R
1 = 1 − 2α
F −1 (1−α) F −1 (α)
F (y)dy −
∞
I[α < F (y) < 1 − α]dy
x
hence IF (x, T, F ) + µα = =−
1 −1 αF (1 − α) − (1 − α)F −1 (α) I x < F −1 (α) 1 − 2α
+
1 x − αF −1 (α) − αF −1 (1 − α) I F −1 (α) ≤ x ≤ F −1 (1 − α) 1 − 2α
+
1 −αF −1 (α) + (1 − α)F −1 (1 − α) I x > F −1 (1 − α) 1 − 2α
where µα =
1 1 − 2α
1−α α
F −1 (u)du =
1 1 − 2α
F −1 (1−α)
ydF (y) F −1 (α)
68
ROBUST ESTIMATORS OF REAL PARAMETER
If F is symmetric, then F −1 (u) = −F −1 (1 − u), 0 < u < 1 and µα = 0; then ⎧ F −1 (1−α) − 1−2α . . . x < −F −1 (1 − α) ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ x . . . −F −1 (1 − α) ≤ x ≤ F −1 (1 − α) IF (x, T, F ) = 1−2α ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ F −1 (1−α) . . . x > F −1 (1 − α) 1−2α The global sensitivity of the trimmed mean is γ∗ =
F −1 (1 − α) 1 − 2α
Remark 3.2 If Mn is the Huber estimator of the center of symmetry θ of F (x−θ), generated by the Huber function ψH with k = F −1 (1 − α) (see (3.16)), then the influence ¯ nα coincide. functions of Mn and X Remark 3.3 ¯ nα . Then (i) Let ε∗n = mnn be the breakdown point of the α-trimmed mean X ∗ limn→∞ εn = α. ¯ n,αn ; a) be the tail-behavior measure of (ii) Let αn = [k/n], n ≥ 3 and let B(X ¯ n,αn , defined in (2.14). Then, if F has exponential tails (2.17), X ¯ n,αn ; a) ≤ lima→∞ B(X ¯n,αn ; a) ≤ n − k n − 2k ≤ limand→∞ B(X and if F has heavy tails (2.19) and k <
n−1 2 ,
(3.65)
then
¯ nα ; a) = k + 1 lim B(X
(3.66)
a→∞
Example 3.3 The α-Winsorized mean is an example of an L-estimator of the general form (3.57), that has two components: W nα = T (Fn )
(3.67)
1 [nα]Xn:[nα]+1 + = n = αFn−1 (α) +
1−α
n−[nα]
Xn:i + [nα]Xn:n−[nα]
i=[nα]+1
Fn−1 (u)du + αFn−1 (1 − α)
α
=
n
cni Xn:i +
i=1
where cni =
[nα] + 1 [nα] + 1 Xn:[nα]+1 + Xn:n−[nα] n n
1 n
. . . 1 + [nα] < i < n − [nα]
0
. . . otherwise
L-ESTIMATORS
69
The extreme quantiles are not trimmed but replaced with quantiles Xn:[nα]+1 and Xn:n−[nα] , respectively. For the sake of simplicity, let us consider the model with symmetric distribution function F. The statistical functional T (F ) corresponding to W nα is T (F ) = T1 (F ) + T2 (F ) 1−α = F −1 (u)du + αF −1 (α) + αF −1 (1 − α) α
The influence function of T1 (F ) follows from (3.63), while the influence function of T2 (F ) is a modification of the influence function of the median (3.15) that is the α-quantile with α = 12 ; thus IF (x, W nα , F ) = = F −1 (α) −
α I[x < F −1 (α)] f (F −1 (α))
+x I[F −1 (α) ≤ x ≤ F −1 (1 − α)] +F −1 (1 − α) +
α f (F −1 (1
− α))
I[x > F −1 (1 − α)]
The global sensitivity of the Winsorized mean is α γ ∗ = F −1 (α) + f (F −1 (1 − α)) and the limiting breakdown point of W nα is ε∗ = α. The influence function of the Winsorized mean has jump points at F −1 (α) and F −1 (1 − α), while the influence function of the α-trimmed mean is continuous. Example 3.4 (i) Sen’s weighted mean (Sen (1964)): −1 n n i−1 n−i Tn,k = Xn:i 2k + 1 k k i=1
¯ n and Tn,k is the sample median if where 0 < k < Notice that Tn,0 = X either n is even and k = n2 − 1 or n is odd and k = n−1 2 . n−1 2 .
(ii) The Harrell-Davis estimator of p-quantile (Harrell and Davis (1982)): Tn =
n
cni Xn:i
i=1
cni
Γ(n + 1) = Γ(k)Γ(n − k + 1)
i = 1, . . . , n, where k = [np], 0 < p < 1.
i/n
uk−1 (1 − u)n−k du (i−1)/n
70
ROBUST ESTIMATORS OF REAL PARAMETER
(iii) BLUE (asymptotically best linear unbiased estimator) of the location parameter (more properties described by Blom (1956) and Jung (1955, 1962)). Let X1 , X2 , . . . be independent observations with the distribution function F (x − θ), where F has an absolutely continuous density f with derivative f . Then the BLUE is the L-estimator with the weight function n i 1 Tn = cni Xn:i , cni = J , i = 1, . . . , n n n+1 i=1 J(F (x)) = ψf (x),
ψf (x) = −
f (x) , f (x)
x∈R
3.8 Moment convergence of L-estimators Similarly as in the case of M -estimators, we can prove the moment convergence of L-estimators for a broad class of bounded J-functions and under some conditions on density f. Consider the L-estimator Tn =
n
cni Xn:i
(3.68)
i=1
where Xn:1 ≤ Xn:2 ≤ . . . ≤ Xn:n are the order statistics corresponding to observations X1 , . . . , Xn and cni = cn,n−i+1 ≥ 0,
i = 1, . . . , n, and
n
cni = 1
i=1
and
(3.69) cni = cn,n−i+1
kn = α0 = 0 for i ≤ kn , where lim n→∞ n
for some α0 , 0 < α0 < 12 . Assume that the independent observations X1 , . . . , Xn are identically distributed with a distribution function F (x − θ) such that F has a symmetric density f (x), f (−x) = f (x), x ∈ R, and f is monotonically decreasing in x for x ≥ 0, and |f (x)| < ∞
(3.70)
i i−1 < t ≤ , i = 1, . . . , n n n
(3.71)
sup F −1 (α0 )≤x≤F −1 (1−α0 )
Denote Jn (t) = ncni
for
and assume that Jn (t) → J(t) a.s. ∀t ∈ (0, 1)
(3.72)
where J : [0, 1] → [0, ∞) is a symmetric and integrable function, J(t) = 1 J(1 − t) ≥ 0, 0 ≤ t ≤ 1 and 0 J(t)dt = 1. Then (see Huber (1981)), the
MOMENT CONVERGENCE OF L-ESTIMATORS 71 √ 2 sequence n(Tn − θ) has an asymptotic normal distribution N (0, σL ) where 2 = [F (x ∧ y) − F (x)F (y)]J(F (x))J(F (y))dxdy (3.73) σL R
R
Theorem 3.3 Under the conditions (3.69)–(3.73), for any positive integer r, √ 2r (2r)! lim Eθ [ n(Tn − θ)]2r = σL 2r r!
n→∞
(3.74)
We shall only sketch the basic steps of the proof; the detailed proof can be found in Jureˇckov´ a and Sen (1982). We shall use the following lemma, that follows from the results of Cs¨ org˝ o and R´ev´esz (1978): Lemma 3.1 Under the conditions of Theorem 3.3, for any n ≥ n0 , there exists a sequence of random variables {Yni }n+1 i=1 , independent and normally distributed N (0, 1), such that n+1 √ 1 1 n(Tn − θ) − √ anj Ynj = O(n− 2 log n) a.s. (3.75) n + 1 j=1 as n → ∞, where anj
n
n+1 1 = bni − bni , n + 1 i=1 i=j
bni =
cni , i = 1, . . . , n i −1 f F n+1
Proof of Lemma 3.1: Put θ = 0 without a loss of generality. Using Theorem 6 of Cs¨org˝ o and R´ev´esz (1978), we conclude that there exists a sequence of Brownian bridges {Bn (t) : 0 ≤ t ≤ 1} such that i i √ i −1 n Xn:i − F −1 max − B f F n kn +1≤i≤n−kn n+1 n+1 n+1 1 = O n− 2 log n a.s. as n → ∞ n √ i 1 bni Bni = O n− 2 log n nTn − n+1 i=1 t : t ≥ 0 is a standard Wiener process Wn on The process (t + 1)Bn t+1 [0, ∞), thus Wn (k) = ki=1 Yni , k = 1, 2, . . . , where the Yni are independent random variables with N (0, 1) distributions. 2 n i = 0, Sketch of the proof of Theorem 3.3. Because i=1 cni F −1 n+1 we get by the Jensen inequality (put θ = 0) n i 2r √ √ 2r −1 ( n|Tn |) ≤ n Xni − F n+1
hence
i=1
72 ≤
ROBUST ESTIMATORS OF REAL PARAMETER i 2r √ cni n Xni − F −1 n+1
n−k n
i=kn +1
hence
n−k i 2r n √ √ 2r −1 E0 ( n|Tn |) ≤ cni E n Xni − F <∞ n+1 i=kn +1
This together with Lemma 3.1 implies Theorem 3.3.
2
3.9 Sequential M - and L-estimators Let Tn be a fixed estimator (e.g., M - or L-estimator) of θ based on n independent observations X1 , . . . , Xn , and assume that θ is the center of symmetry of distribution function F (x − θ). Assume that the loss incured when estimating θ by Tn includes also the expenses; more precisely, let c > 0 be the price of one observation and let the global loss be L(Tn , θ, c) = a(Tn − θ)2 + cn
(3.76)
where a > 0 is a constant. The corresponding risk is Rn (Tn , θ, c) = Eθ (Tn − θ)2 + cn
(3.77)
Our goal is to find the sample size n minimizing the risk (3.77). Let us first consider the situation that F is known and that σn2 = nEθ (Tn −θ)2 exists for n ≥ n0 , and that lim σn2 = σ 2 (F ),
n→∞
0 < σ 2 (F ) < ∞
(3.78)
Hence, we want to minimize n1 aσn + cn with respect to n, and if we use the approximation (3.78), the approximate solution n0 (c) has the form a (3.79) n0 (c) ≈ σ(F ) c and for the minimum risk we obtain
√ Rn (Tn0 (c) , θ, c) ≈ 2σ(F ) ac
where p(c) ≈ q(c) means that limc↓0 c ↓ 0.
q(c) p(c)
(3.80)
= 1. Then obviously n0 (c) ↑ ∞ as
If distribution function F is unknown, we cannot know σ 2 (F ) either. But we can still solve the problem sequentially, if there is a sequence σ ˆn of estimators of σ(F ). We set the random sample size (stopping rule) Nc , defined as a −ν Nc = min n ≥ n : n ≥ σ ˆn + n , c>0 (3.81) c
SEQUENTIAL M - AND L-ESTIMATORS
73
where n is an initial sample size and ν > 0 is a chosen number. Then Nc ↑ ∞ with probability 1 as c ↓ 0. The resulting estimator of θ is TNc , based on Nc observations X1 , . . . , XNc . The corresponding risk is R∗ (Tn , θ, c) = aE (TNc − θ)2 + cENc We shall show that, if Tn is either a suitable M -estimator or an L-estimator, then R∗ (Tn , θ, c) → 1 as c ↓ 0 (3.82) Rn (Tn(c) , θ, c) An interpretation of the convergence (3.82) is that the sequential estimator TNc is asymptotically (in the sense that c ↓ 0) equally risk-efficient as the optimal non-sequential estimator Tn(c) corresponding to the case that σ 2 (F ) is known. Such a problem was first considered by Ghosh and Mukhopadhyay (1979) and later by Chow and Yu (1981) under weaker conditions; they proved ˆn2 as the sample variance. Sen (1980) (3.82) for Tn as the sample mean and σ solved the problem for a class of R- (rank-based) estimators of θ. Let Tn be the M -estimator of θ generated by a nondecreasing and skewsymmetric function ψ, and assume that ψ and F satisfy condition (A.1) and (A.2) of Section 3.5, put Sn (t) =
n
ψ(Xi − t)
i=1
Then Tn is defined by (3.12) and tributed N (0, σ 2 (ψ, F )), where
√ n(Tn − θ) is asymptotically normally dis
ψ 2 (x)dF (x) 2 f (x)dψ(x) R
σ (ψ, F ) = R 2
(see Huber (1981)), put 1 2 ψ (Xi − Tn ) n i=1 n
s2n = Choose α ∈ (0, 1) and put
α 1 Mn− = sup t : n− 2 Sn (t) > s2n Φ−1 1 − α 2 1 Mn+ = sup t : n− 2 Sn (t) < s2n Φ−1 2 dn = Mn+ − Mn−
√ nd p
n −→ σ 2 (ψ, F ) as n → ∞ 2Φ−1 1 − α2 is proved, e.g., in Jureˇckov´ a (1977).
Then
σ ˆn2 =
If Nc is the stopping rule defined in (3.81) with σ ˆn2 , then TNc is a risk-efficient M -estimator of θ.
74
ROBUST ESTIMATORS OF REAL PARAMETER
Let now Tn be an L-estimator of √ θ, defined in (3.68), trimmed at α and 1−α, satisfying (3.69)–(3.72). Then n(Tn −θ) is asymptotically normal with variance σ 2 (J, F ) given in (3.73). Sen (1978) proposed the following estimator of σ 2 (J, F ) : σ ˆn2 =
n−1 n−1
cni cnj [F (x ∧ y) − F (x)F (y)]J(F (x))J(F (y))dxdy
i=1 j=1
and proved that σ ˆn2 → σ 2 (J, F ) a.s. as n → ∞. ˆn2 , then TNc is a Again, if Nc is the stopping rule defined in (3.81) with σ risk-efficient L-estimator of θ.
3.10 R-estimators Let Ri be the rank of Xi among X1 , . . . , Xn , i = 1, . . . , n, where X1 , . . . , Xn is a random sample from a population with continuous distribution function. The rank Ri can be formally expressed as Ri =
n
I[Xj ≤ Xi ],
i = 1, . . . , n
(3.83)
j=1
and thus Ri = nFn (Xi ), i = 1, . . . , n, where Fn is the empirical distribution function of X1 , . . . , Xn . The ranks are invariant with respect to the class of monotone transformations of observations, and the tests based on ranks have many advantages: the most important among them is that the distribution of the test criterion under the hypothesis of randomness (i.e., if X1 , . . . , Xn are independent and identically distributed with a continuous distribution function) is independent of the distribution of observations. Hodges and Lehmann (1963) proposed a class of estimators, called R-estimators, that are obtained by an inversion of the rank tests. The R-estimators can be defined for many models, practically for all where the rank tests have a sense, and the test criterion is symmetric about a known center or has other suitable property under the null hypothesis. We shall describe the Restimators of the center of symmetry of an (unknown) continuous distribution function, and the R-estimators in the linear regression model in the sequel. Let X1 , . . . , Xn be independent random observations with continuous distribution function F (x − θ), symmetric about θ. The hypothesis H 0 : θ = θ0 on the value of the center of symmetry is tested with the aid of the signed rank test (or one-sample rank test), based on the statistic + (θ0 )) Sn (θ0 ) = sign(Xi − θ0 )an (Rni
where
+ (θ0 ) Rni
(3.84)
is the rank of |Xi −θ0 | among |X1 −θ0 |, . . . , |Xn −θ0 | and an (1) ≤
R-ESTIMATORS
75
. . . ≤ an (n) are given scores, generated by a nondecreasing score function i + + + + ϕ : [0, 1) → R , ϕ (0) = 0, in the following way: an (i) = ϕ n+1 , i = 1, . . . , n. For example, the linear score function ϕ+ (u) = u, 0 ≤ u ≤ 1 generates the Wilcoxon one-sample test. If θ0 is the right center of symmetry, + then are sign(Xi − θ0 ) and Rni (θ0 ) stochastically independent, i = 1, . . . , n, and Sn (t) is a nonincreasing step function of t with probability 1 (Problem 3.10). This implies that Eθ0 Sn (θ0 ) = 0 and the distribution of Sn (θ0 ) under H 0 is symmetric around 0. As an estimator of θ0 we propose the value of t which solves the equation Sn (t) = 0. Because Sn (t) is discontinuous, such an equation may have no solution; then we define the R-estimator similarly as the M -estimator and put 1 (3.85) Tn = (Tn− + Tn+ ) 2 Tn− = sup{t : Sn (t) > 0},
Tn+ = inf{t : Sn (t) < 0}
Tn coincides with the sample median if an (i) = 1, i = 1, . . . , n. The estimator, corresponding to the one-sample Wilcoxon test with the scores an (i) = i n+1 , i = 1, . . . , n, is known as the Hodges-Lehmann estimator : Xi + Xj : 1≤i≤j≤n (3.86) TnH = med 2 Other R-estimators should be computed by an iterative procedure. Unlike the M -estimators, the R-estimators are not only translation, but also scale equivariant, i.e., Tn (X1 + c, . . . , Xn + c) = Tn (X1 , . . . , Xn ) + c, c ∈ R (3.87) Tn (cX1 , . . . , cXn ) = cTn (X1 , . . . , Xn ), c > 0 The distribution function of statistic Sn (θ) is discontinuous, even if X1 , . . . , Xn have a continuous distribution function F (x − θ). On the other hand, if θ is the actual center of symmetry, then the distribution function of statistic Sn (θ) does depend on F. If we denote 0 ≤ pn = Pθ (Sn (θ) = 0) = P0 (Sn (0) = 0) < 1 then limn→∞ pn = 0 and 1 1 (1 − pn ) ≤ Pθ (Tn < θ) ≤ Pθ (Tn ≤ θ) ≤ (1 + pn ) (3.88) 2 2 This means that if F is symmetric around zero, Tn is an asymptotically median unbiased estimator of θ. Using (3.83) in statistic (3.84) with linear scores, we arrive at an alternative form of the Hodges-Lehmann estimator Tn as a solution of the equation ∞ [Fn (y) − Fn (2Tn − y)]dFn (y) = 0 (3.89) −∞
76
ROBUST ESTIMATORS OF REAL PARAMETER
Similarly, the R-estimator generated by the score function ϕ+ can be expressed as a solution of the equation ∞ ϕ (Fn (y) − Fn (2Tn − y)) dFn (y) = 0 (3.90) −∞
where ϕ(u) = sign(u − 12 )ϕ+ (2u − 1), 0 < u < 1. Hence, the corresponding statistical functional is a solution of the equation ∞ ϕ [F (y) − F (2T (F ) − y)] dF (y) (3.91) −∞
=
1
ϕ u − F (2T (F ) − F −1 (u)) du = 0
0
The influence function of the R-estimator can be derived similarly as that of the L-estimator, and in case of symmetric F with an absolutely continuous density f it equals ϕ(F (x)) ϕ(F (y))(−f (y))dy R
IF (x, T, F ) =
(3.92)
Example 3.5 The breakdown point of Hodges-Lehmann estimator TnH : the estimator can break down if at least half of sum 12 (Xi + Xj ) for 1 ≤ i ≤ j ≤ n is replaced. Assume a sample is corrupted by replacement of m outliers and
n + n2 is even, then the estimator TnH breaks down for m satisfying n−m 1 n 1 n−m+ > n+ 2 2 2 2 Therefore 2m2 − m(4n + 2) + n + n2 > 0, 1 ≤ m ≤ n
(3.93)
We look for the smallest integer m satisfying (3.93). Thus the breakdown point m∗ ε∗n = nn , where √ 2n + 1 − 2n2 + 2n + 1 ∗ +1 m = 2 where · is the ceiling function,
that is, x is the smallest integer no smaller than x. Analogously, for n + n2 odd √ 2 + 2n + 5 2n 2n + 1 − +1 m∗ = 2 . Finally, limn→∞ ε∗n (TnH ) = 0.293. Remark 3.4 If ψ(x) = cϕ(F (x)), x ∈ R, then the influence function of the M -estimator generated by ψ coincides with the influence function of the Restimator generated by ϕ.
NUMERICAL ILLUSTRATION
77
Jaeckel (1969) proposed an equivalent definition of the R-estimator, more convenient for the calculation. Consider the n(n + 1) n n+ = 2 2 averages 12 (Xn:j + Xn:k ) , 1 ≤ j ≤ k ≤ n, including the cases j = k. Let ϕ : (0, 1) → R be a nondecreasing score function, skew-symmetric on (0, 1) in the sense that ϕ(1 − u) = −ϕ(u), 0 < u < 1 and put
i+1 i −ϕ i = 1, . . . , n 2n + 1 2n + 1 Then define the weights cjk , 1 ≤ j ≤ k ≤ n in the following way: dn−k+j ; cjk = 1 cjk = n i=1 idi din = ϕ
1≤j≤k≤n
The R-estimator is defined as the median of the discrete distribution that assigns the probability cjk to each average 12 (Xn:j + Xn:k ) , 1 ≤ j ≤ k ≤ n. If the score function ϕ is linear and hence dn1 = . . . = dnn = n1 , then the weights cjk are all equal and the estimator is just the median of all averages, thus the Hodges-Lehmann estimator. However, Jaeckel’s definition of the Restimator is applicable to more general signed-rank tests, as the one sample van der Waerden test and others.
3.11 Numerical illustration Assume that the following data are independent measurements of a physical entity θ: 46.34, 50.34, 48.35, 53.74, 52.06, 49.45, 49.90, 51.25, 49.38, 49.31, 50.62, 48.82, 46.90, 49.46, 51.17, 50.36, 52.18, 50.11, 52.49, 48.67. Otherwise, we have the measurements Xi = θ + ei ,
i = 1, . . . , 20
and we want to determine the unknown value θ, assuming that the errors e1 , . . . , e20 are independent and identically distributed, symmetrically around zero. The first column of Table 3.2 provides the values of the estimates of θ and the scale characteristics from Table 3.1, based on X1 , . . . , X20 . We see that the values in column I are rather close to each other, and that the data seem to be roughly symmetric around 50. Let us now consider what happens if some observations are slightly changed. The effects of some changes we can see in columns II-V of Table 3.2. Columns II and III illustrate the effects of a change of solely one observation, caused by a mistake in the decimal point: column II corresponds to the fact that the last
78
ROBUST ESTIMATORS OF REAL PARAMETER Table 3.1 Estimates of the location and the scale characteristics.
Location mean median 5%-trimmed mean 10%-trimmed mean 5%-Winsorized mean 10%-Winsorized mean Huber M -estimator Hodges-Lehmann estimator Sen’s weighted mean, k1 = [0.05n] Sen’s weighted mean, k2 = [0.1n] midrange
¯n X ˜n X ¯ .05 X ¯ .10 X ¯ .05 W ¯ .05 W MH HL Sk1 Sk2 Rm
Scale standard deviation inter-quartile range median absolute deviation Gini mean difference
Sn RI M AD Gn
value in the dataset, 48.67, was replaced by 486.7, while column III gives the result of a replacement of 48.67 by 4.867. These changes considerably effected ¯ n , the standard deviation Sn , and the midrange Rm , which is in the mean X ¯ n , Sn and Rm are a correspondence with the theoretical conclusions that X highly non-robust. Columns III and IV show the changes in the estimates when the last five observations in the dataset were replaced by the values 79.45, 76.80, 80.73, 76.10, 87.01, or by the values 1.92, 0.71, 1.26, 0.32, -1.71, respectively. When we wish to obtain a picture of the behavior of an estimator under various models, we usually simulate the model and look at the resulting values of the estimator of interest. For example, 200 observations were simulated from the following probability distributions: • Normal distribution N (0, 1) and N (10, 2) with the density (x−µ)2 1 f (x) = √ e− 2σ2 , µ = 0, 10, σ 2 = 1, 2, x ∈ R σ 2π
NUMERICAL ILLUSTRATION
79
Table 3.2 Effects of changes in the dataset on the estimates.
estimator
I
II
III
IV
V
¯n mean X ˜n median X ¯ .05 5%-trimmed mean X ¯ .10 10%-trimmed mean X ¯ .05 5%-Winsorized mean W ¯ .05 10%-Winsorized mean W Huber M -estimator MH Hodges-Lehmann estimator HL Sen’s weighted mean Sk1 Sen’s weighted mean Sk2 midrange Rm
50.04 50.00 50.05 50.09 50.01 50.12 50.07 50.02 50.04 50.02 50.04
71.95 50.22 50.33 50.33 50.33 50.35 50.33 50.31 50.29 50.25 266.52
47.85 50.00 49.92 49.98 49.87 49.89 49.94 49.94 49.98 50.00 29.30
57.36 50.48 56.32 55.39 57.07 57.09 56.59 51.31 54.19 52.44 66.68
37.48 49.34 38.75 40.32 37.50 37.46 37.62 48.18 42.49 45.60 26.02
sample standard deviation Sn inter-quartile range RI median absolute deviation M AD Gini mean difference Gn
1.82 2.00 1.18 2.09
97.64 2.09 0.98 45.55
10.28 2.00 1.18 6.42
13.67 9.97 1.62 13.39
21.97 15.18 1.86 20.74
(symmetric and exponentially-tailed distribution). • Exponential Exp(5) distribution with the density f (x) = 5e−5x , x ≥ 0, f (x) = 0, x < 0 (skewed and exponentially-tailed distribution). • Cauchy with the density f (x) =
1 , x∈R π(1 + x2 )
(symmetric and heavy-tailed distribution) • Pareto with the density f (x) =
1 , x ≥ 1, f (x) = 0, x < 1 (1 + x)2
(skewed and heavy-tailed distribution). The values of various estimates under the above distributions are given in Table 3.3.
80
ROBUST ESTIMATORS OF REAL PARAMETER Table 3.3 Values of estimates under various models.
estimator ¯n mean X ˜n median X ¯ .05 5%-trimmed mean X ¯ .10 10%-trimmed mean X ¯ .05 5%-Winsorized mean W ¯ .05 10%-Winsorized mean W Huber M -estimator MH Hodges-Lehmann est. HL Sen’s weighted mean Sk1 Sen’s weighted mean Sk2 midrange Rm sample standard deviation Sn inter-quartile range RI median abs. deviation M AD Gini mean difference Gn
N (0, 1)
N (10, 2)
Exp(5)
Cauchy
Pareto
0.06 -0.01 0.05 0.04 0.07 0.05 0.05 0.04 0.02 0.02 0.05
9.92 9.73 9.89 9.88 9.92 9.91 9.89 9.87 9.75 9.73 10.37
4.49 2.92 4.01 3.75 4.30 4.05 3.89 3.73 3.16 3.10 11.85
1.77 -0.25 -0.23 -0.29 -0.10 -0.30 -0.29 -0.24 -0.22 -0.21 146.94
12.19 2.10 3.42 2.84 4.17 3.32 2.87 2.73 2.24 2.17 525.34
1.07 1.26 0.63 1.18
2.03 3.05 1.58 2.30
4.63 5.53 2.02 4.78
13.06 2.45 1.24 7.18
78.48 2.83 0.93 20.23
3.12 Computation and software notes The system R includes some function for computation of mentioned location and scale characteristics. The standard ones are mean, median, var, further the function sd computes the sample standard deviation, IQR the inter-quartile range and mad the median absolute deviation. The summary function returns the median, quartiles, minimum and maximum. The slightly different function fivenum returns Tukey’s five number summary (minimum, lower-hinge, median, upper-hinge, maximum). The hinges are versions of the first and third quartiles. They are equal the quartiles for odd n and differ for even n, details see R: reference manual. There are also standard functions quantiles, max, min and range. Function mean has the argument trim, the fraction of observations to be trimmed from each end before the computation of mean. It means that we can mean(x, trim=alpha) as the α%-trimmed mean. Function mad will compute the median of the absolute deviations from the center – defaults to the median, and multiplied by a constant. The default value = 1.4826 ensures the asymptotically normal consistency. The procedures for location Huber M -estimation are also incorporated into the system R. The function huber finds the Huber M -estimator of location with MAD scale and the function hubers finds the Huber M -estimator for location with scale specified, scale with location specified, or both if neither is specified. Both functions are stored in package MASS.
COMPUTATION AND SOFTWARE NOTES
81
We do not find the function for the Hodges-Lehmann estimator, Sen’s weighted mean, Winsorized mean, midrange and Gini mean difference in R. It is easy to prepare the function for them but it is possible to find them on the website, http://www.fp.vslib.cz/kap/picek/robust/. For example, we can create the function for Gini mean difference and Sen’s weighted mean as follows: gini.mean.difference<-function(x) { x <-sort(x[!is.na(x)]) n<-length(x) na<-seq((1-n),(n-1),by=2) 2*sum(na*x)/n/(n-1) } sen.weight.mean<-function(x,k=0) { x <- x[!is.na(x)] n<-length(x) if ((k < 0) | (k>=(n-1)/2) ) stop("cannot estimate: k<0 or k>=(n-1)/2") if (trunc(k)!=k) stop("cannot estimate: k is not integer") sum(choose(0:(n-1),k)*choose((n-1):0,k)*sort(x))/ choose(n,2*k+1) } The package MASS also contains many datasets. We chose the dataset chem as an example of illustration of R function. The dataset contains 24 determinations of copper in wholemeal flour, in parts per million (Analytical Methods Committee (1989), Venables and Ripley (2002)). > sort(chem) [1] 2.20 2.20 2.40 2.40 2.50 2.70 2.80 2.90 [11] 3.10 3.37 3.40 3.40 3.40 3.50 3.60 3.70 [21] 3.70 3.77 5.28 28.95 > > summary(chem) Min. 1st Qu. Median Mean 3rd Qu. Max. 2.200 2.775 3.385 4.280 3.700 28.950 > > fivenum(chem) [1] 2.200 2.750 3.385 3.700 28.950 > > mean(chem) [1] 4.280417
3.03 3.70
3.03 3.70
82
ROBUST ESTIMATORS OF REAL PARAMETER
> median(chem) [1] 3.385 > > mean(chem, trim=0.05) [1] 3.253636 > > mean(chem, trim=0.10) [1] 3.205 > > winsorized.mean(chem, [1] 3.294167 > > winsorized.mean(chem, [1] 3.185 > > huber(chem) $mu [1] 3.206724 $s [1] 0.526323 > > hubers(chem) $mu [1] 3.205498 $s [1] 0.673652 > > hodges.lehmann(chem) [1] 3.225 > > sen.weight.mean(chem, [1] 3.241265 > > sen.weight.mean(chem, [1] 3.251903 > > midrange(chem) [1] 15.575 > > sd(chem) [1] 5.297396 > > IQR(chem) [1] 0.925 >
trim=0.05)
trim=0.10)
trunc(length(chem)*0.05))
trunc(length(chem)*0.10))
PROBLEMS AND COMPLEMENTS
83
> mad(chem) [1] 0.526323 > > mad(chem,co=1) [1] 0.355 > > gini.mean.difference(chem) [1] 2.830906
3.13 Problems and complements 3.1 Let X1 , . . . , Xn be a sample from the distribution with the density ⎧ ⎨ 1 if |x| ≤ 14 f (x) = ⎩ 1 3 if |x| > 1 ¯= and X
1 n
32|x|
n i=1
4
¯ = ∞. Xi be the sample mean. Then var X
3.2 The α-interquantile range (0 < α < 1) : Sα = F −1 (1 − α) − F −1 (α). The influence function of Sα equals ⎧ 1−α α ⎪ f (a2 ) − a1 ⎪ ⎪ ⎨ IF (x; F, Sα ) = −α f (a1 1 ) + f (a1 2 ) ⎪ ⎪ ⎪ ⎩ 1−α α f (a2 ) − f (a1 )
. . . x < a1 . . . a1 < x < a2 . . . x > a2
where a1 = F −1 (α) and a2 = F −1 (1 − α) and f (x) = should exist in neighborhoods of a1 and a2 .
dF (x) dx ;
the derivative
3.3 The symmetrized α-interquantile range (0 < α < 1) (Collins (2000)): S˜α (F ) = Sα (F˜ ) = F˜ −1 (1 − α) − F˜ −1 (α) where F˜ (x) =
1 2
! F (x) + 1 − F 2F −1 ( 12 ) − x for F continuous
S˜ 14 coincides with MAD. Calculate the influence function of S˜α . 3.4 Let X1 , . . . , Xn be an independent sample from a population with density f (x − θ) and let T (X1 . . . . , Xn ) be a translation equivariant estimator of θ, then Tn has a continuous distribution function. Hint: T (x1 , . . . , xn ) = t if and only if x1 = t − T (0, x2 − x1 , . . . , xn − x1 ). Hence, given X2 − X1 , . . . , Xn − X1 = (y2 , . . . , yn ) and t ∈ R, there is exactly one point x for which T (x) = t. Hence, P {T (X) = t|X2 − X1 =
84
ROBUST ESTIMATORS OF REAL PARAMETER y2 , . . . , Xn − X1 = yn } = 0 for every (y2 , . . . , yn ) and t, thus P {T (X) = t} = 0 ∀t.
3.5 Let X1 , . . . , Xn be independent observations with distribution function n F (x − θ), and let Tn = i=1 cni Xn:i be an L-estimator of θ, then n (a) If i=1 cni = 1, then T is translation equivariant. n (b) If F is symmetric about zero, i=1 cni = 1 and cni = cn,n−i+1 , i = 1, . . . , n, then the distribution of Tn is symmetric about θ. 3.6 Tukey (1960) proposed the model of the normal distribution with variance 1 contaminated by the normal distribution with variance τ 2 > 1, i.e., that of the distribution function F of the form x (3.94) F (x) = (1 − ε)Φ(x) + εΦ τ where Φ is the standard normal distribution function. Compare the asymptotic variances of the sample mean and the sample variance under (3.94). 3.7 Let X1 , . . . , Xn be a sample from the Cauchy distribution C(ξ, σ) with the density σ 1 f (x) = 2 π σ + (x − ξ)2 ¯ n is again C(ξ, σ) Then the distribution of X 3.8 Let X, − π2 ≤ X ≤ π2 , be a random angle with the uniform distribution on the unit circle. Then tg X has the Cauchy distribution C(0, 1). 3.9 Consider the equation n
ψC (Xi − θ) = 0
i=1
where ψC is the Cauchy likelihood function (3.18). Denote Kn the number of its roots. If X1 , . . . , Xn are independent, identically distributed with the Cauchy C(0, 1) distribution, then Kn − 1 has asymptotically Poisson distribution with parameter π1 , as n → ∞. (See Barnett (1966) or Reeds (1985)). 3.10 (a) If X is a random variable with a continuous distribution function, symmetric about zero, then sign X and |X| are stochastically independent. (b) Prove that the linear signed rank statistic (3.84) is a nondecreasing step function of θ with probability 1, provided the score function ϕ+ is nondea and creasing on (0, 1) and ϕ+ (0) = 0. (See van Eeden (1972) or Jureˇckov´ Sen (1996), Section 6.4.) 3.11 Generate samples of different distribution and apply the described methods to these data. 3.12 Write a R procedure that computes the Hodges-Lehmann estimator.
CHAPTER 4
Robust estimators in linear model
4.1 Introduction Consider the linear regression model Yi = xi β + Ui , i = 1, . . . , n
(4.1)
with observations Y1 , . . . , Yn , unknown and unobservable parameter β ∈ Rp , where xi ∈ Rp , i = 1, . . . , n are either given deterministic vectors or observable random vectors (regressors) and U1 , . . . , Un are independent errors with a joint distribution function F. Often we consider the model in which the first component β1 of β is an intercept: it means that xi1 = 1, i = 1, . . . , n. Distribution function F is generally unknown; we only assume that it belongs to some family F of distribution functions. Denoting Y = (Y1 , . . . , Yn ) ⎛ ⎞ x1 ⎜ .. ⎟ X = Xn = ⎝ . ⎠ xn U = (U1 , . . . , Un ) we can rewrite (4.1) in the matrix form Y = Xβ + U
(4.2)
The most popular estimator of β is the classical least squares estimator (LSE) If X is of rank p, then β is equal to β. = (X X)−1 X Y β
(4.3)
is the best linear unbiased As it follows from the Gauss-Markov theorem, β estimator of β, provided the errors U1 , . . . , Un have a finite second moment. is the maximum likelihood estimator of β if U1 , . . . , Un are norMoreover, β mally distributed. is an extension of the sample mean to the linear The least squares estimator β 85
86
ROBUST ESTIMATORS IN LINEAR MODEL
regression model. Then, naturally, it has similar properties: it is highly nonrobust and sensitive to the outliers and to the gross errors in Yi , and to the deviations from the normal distribution of errors. It fails if the distribution of is heavily affected by the regression the Ui is heavy-tailed. But above this all, β matrix X, namely it is sensitive to the outliers among its elements. Violating some conditions in the linear models can have more serious consequences than in the location model. This can have a serious impact in econometric, but also in many other applications. Hence, we must look for robust alternatives to the classical procedures in linear models.
-5
0
y
5
10
Example 4.1 Figure 4.1 illustrates an effect of an outlier in x-direction (leverage point) on the least square estimator.
-5
-4
-3
-2
-1
0
1
-1
0
1
-5
0
y
5
10
x
-5
-4
-3
-2 x2
Figure 4.1 Data with 27 points and the corresponding least squares regression line (top) and the sensitivity of least squares regression to an outlier in the x-direction (bottom).
LEAST SQUARES METHOD
87
Before we start describing the robust statistical procedures, we shall try to illustrate how seriously the outliers in X can affect the performance of the estimator β.
4.2 Least squares method is a If we estimate β by the least squares method, then the set Y = X β hyperplane passing through the points (xi , Yˆi ), i = 1, . . . , n, where = h Y , i = 1, . . . , n Yˆi = xi β i
= X X X −1 X . Hence, and hi is the ith row of the project (hat ) matrix H is the projection of vector Y in the space spanned by the columns Y = HY of matrix X. is the project matrix, then h hj = hij , i, j = 1, . . . , n, and thus Because H i 2 0≤ hik = hii (1 − hii ) =⇒ 0 ≤ hii ≤ 1, i = 1, . . . , n k=i
(4.4) =⇒ |hij | ≤ hi · hj =
1 (hii hjj ) 2
≤ 1, i, j = 1, . . . , n
is of order n × n and of rank p; its diagonal elements lie in the The matrix H = n hii = p. interval 0 ≤ hii ≤ 1, i = 1, . . . , n and trace H i=1 In the extreme situation we can imagine that hii = 1 for some i; then 1 = hii = hi 2 =
n
h2ik = 1 +
k=1
h2ij
k=i
=⇒ hij = 0 for j = i which means that
= h Y = hii Yi = Yi Yˆi = xi β i
and the regression hyperplane passes through (xi , Yi ), regardless of the values of other observations. The value hii = 1 is an extreme case, but it illustrates that a high value of the diagonal element hii causes the regression hyperplane to pass near to the point (xi , Yi ). This point is called a leverage point of the dataset. There are different opinions about which value hii can be considered as high. Huber (1981, p.162) considers xi as a leverage point if hii > .5. It is well known that if EUi = 0 and 0 < σ 2 = EUi2 < ∞, i = 1, . . . , n, then lim max hii = 0
n→∞ 1≤i≤n
is a necessary and sufficient condition for the convergence − β2 → 0 Eβ β n
88
ROBUST ESTIMATORS IN LINEAR MODEL
1 − β) → N 0, σ 2 I p L (X X)− 2 (β n
as n → ∞, where I p is the identity matrix of order p (see, e.g., Huber (1981)). It is intuitively clear that large values of the residuals |Yˆi − Yi | are caused Consider this relation in more by a large maximal diagonal element of H. detail. Assume that the distribution function F has nondegenerate tails, i.e., 0 < F (x) < 1, x ∈ R; moreover, assume that it is symmetric around zero, i.e., F (x)+F (−x) = 1, x ∈ R, for the sake of simplicity. One possible characteristic is the following measure: of the tail behavior of estimator β − β)| > a − log Pβ maxi |xi (β = (4.5) B(a, β) − log(1 − F (a)) We naturally expect that
− β)| > a = 0 lim Pβ max |xi (β a→∞ i
(4.6)
and we are interested in how fast this convergence can be, and under which conditions it is faster. The faster convergence leads to larger values of (4.5) under large a, denote ˜ = max hii , hii = x (X X)−1 xi , i = 1, . . . , n h i 1≤i≤n
(4.7)
˜ on the limit behavior of B(a, β) is described in the following The influence of h theorem: be the least squares estimator of β in model (4.1) with Theorem 4.1 Let β a nonrandom matrix X. (i) If F has exponential tails, i.e., lim
a→∞
− log(1 − F (a)) = 1, ba
b > 0,
then
1 ≤ lima→∞ B(a, β) ≤ 1 ≤ lima→∞ B(a, β) ˜ ˜ h h
(4.8)
(ii) If F has exponential tails with exponent r, i.e., − log(1 − F (a)) = 1, a→∞ bar lim
then
b > 0 and r ∈ (1, 2]
˜ −r+1 ≤ lim ˜ −r h a→∞ B(a, β) ≤ lima→∞ B(a, β) ≤ h
(4.9)
(iii) If F is a normal distribution function, then = lim B(a, β)
a→∞
1 ˜ h
(4.10)
LEAST SQUARES METHOD (iv)
89
If F is heavy-tailed, i.e., lim
a→∞
then
− log(1 − F (a)) = 1, m log a
m>0
=1 lim B(a, β)
a→∞
(4.11)
˜ of matrix H is Theorem 4.1 shows that if the maximal diagonal element h large, then the probability Pβ (maxi |xi (β − β| > a) decreases slowly to 0 with increasing a; this is the case even when F is normal and the number n under normal F is of observations is large. The upper bound of B(a, β) ≤ lima→∞ B(a, β)
n p
with the equality under a balanced design with the diagonal hii = 1, . . . , n.
(4.12) p n,
i =
Proof of Theorem 4.1. Let us assume, without loss of generality, that ˜ = h11 . Because 0 < h ˜ ≤ 1 and Yˆi = x β h i = hi Y , we can write Pβ max xi (β − β > a i = P0 (max |hi Y | > a) ≥ P0 (h1 Y > a) i
˜ 1 > a, h12 Y2 ≥ 0, . . . , h1n Yn ≥ 0) ≥ P0 (hY
1 n−1 ˜ 1 n−1 = (1 − F (a/h)) ˜ ≥ P0 (Y1 > a/h) 2 2 This implies that ˜ ≤ lima→∞ − log(1 − F (a/h)) lima→∞ B(a, β) − log(1 − F (a))
(4.13)
If F has exponential tails with index r, then it further follows from (4.13) that ≤ lima→∞ lima→∞ B(a, β)
˜ r b(a/h) ˜ −r =h bar
which gives the upper bounds in (i) and (ii). For a heavy-tailed F, it follows from (4.13) that ≤ lima→∞ lima→∞ B(a, β)
˜ m log(a/h) =1 m log a
has both positive and negative residuals and and it gives (iv) because β lima→∞ B(a, β) ≥ 1. It remains to verify the lower bounds in (ii) and (iii). If F has exponential tails with exponent r, 1 < r ≤ 2, then, using the Markov inequality, we can
90
ROBUST ESTIMATORS IN LINEAR MODEL
write for any ε ∈ (0, 1) that ˜ 1−r ˆ r − β| > a) ≤ E0 [exp{(1 − ε)bh (maxi |Yi | )}] Pβ (max |xi (β ˜ 1−r ar } i exp{(1 − ε)bh Hence, if we can verify that ˜ 1−r (max |Yˆi |)r }] ≤ Cr < ∞ E0 [exp{(1 − ε)bh
(4.14)
i
we can claim that ˜ 1−r ar − log P0 (max |Yˆi | > a) ≥ − log Cr + (1 − ε)bh i
and this would give the lower bound in (ii), and in fact also the lower bound for the normal distribution in (iii). Thus, it remains to prove that the expected n 1/s value in (4.14) is finite. Denote xs = ( i=1 |xi |s ) , s > 0 and put s = n r 2 k=1 hik = hii , we conclude r−1 (> 2). Then, regarding the relation (max |Yˆi |)r = max |hi Y |r ≤ max(hi s Y r )r i
i
i
n n n ˜ r−1 ≤ max( h2ik )r/s |Yk |r ≤ h |Yk |r i
k=1
k=1
k=1
and hence ˜ 1−r (max |Yˆi |r } ≤ E0 exp{(1 − ε)b E0 exp{(1 − ε)bh i
n
|Yk |r }
k=1
≤ (E0 exp{(1 − ε)b|Y1 |r })n For exponentially-tailed F with exponent r, there exists K > 0 such that 1 − F (x) ≤ exp{−(1 − 2ε bxr } = CK for x > K. Integrating by parts, we obtain ∞ 0 < E0 [exp{(1 − ε)b|Y1 |r }] = −2 exp{(1 − ε)by r }d(1 − F (y)) 0
K
exp{(1 − ε)by r }dF (y) + 2 exp{(1 − ε)bK r }(1 − F (K))
≤2
0 ∞
+2
r(1 − ε)by r−1 (1 − F (y))exp{(1 − ε)by r }dy
(4.15)
K
K
exp{(1 − ε)by r }dF (y) + 2(1 − F (K))exp{(1 − ε)bK 2 }
≤2
0 ∞
+2 K
ε r(1 − ε)by r−1 exp{− by r }dy ≤ Cε < ∞ 2
So we have proved (4.14) for 1 < r ≤ 2. If r = 1, we proceed as follows: (4.4)
LEAST SQUARES METHOD √ implies that |hij | ≤ hii , i, j = 1, . . . , n; thus max |Yˆi | = max |hi Y | = max | i
i
≤ max |hij | ij
n
i
|Yj | ≤ max |
j=1
i
91 n
hij Yj |
j=1 n
˜ 12 hij Yj | ≤ h
j=1
n
|Yj |
j=1
Using the Markov inequality, we obtain ˜ − 2 maxi |Yˆi |} E0 exp{(1 − ε)bh ˜ − 12 a} exp{(1 − ε)bh 1
P0 (max |Yˆi | > a) ≤ i
(E0 exp{(1 − ε)b|Y1 |})n ˜ − 12 a} exp{(1 − ε)bh
≤
and it further follows from (4.15) that E0 exp{(1 − ε)b|Y1 |} < ∞; this gives the lower bound in (i). 2 If F is the normal distribution function of N (0, σ ), then Y − Xβ has n , hence dimensional normal distribution Nn 0, σ 2 H
˜− 2 ) P0 (max |Yˆi | > a) ≥ P0 (h1 Y > a) = 1 − Φ(aσ −1 h 1
i
≤h ˜ −1 . and lima→∞ B(a, β)
2
The proposition (iii) of Theorem 4.1 shows that the performance of the LSE can be poor even under the normal distribution of errors, provided the design is fixed and contains leverage points leading to large ˜h. The rate of convergence in (4.6) does not improve even if the number of observations increases. In ˜ = 1, the convergence (4.6) to zero is equally slow for the extreme case of h arbitrarily large number n of observations as if there is only one observation (just the leverage one). On the other hand, if the design is balanced with diagonal elements hii = p ) = n , hence the rate of convergence i = 1, . . . , n, then lima→∞ B(a, β n n, p in (4.6) improves with n. Up to now, we have considered a fixed number n of observations. The situation can change if n → ∞ and a = an → ∞ at an appropriate rate. An interesting choice of the sequence {an } is the population
analogue of the extreme error among U1 , . . . , Un , i.e., a = an = F −1 1 − n1 . For the normal distribution N (0, σ 2 ), this population extreme is approximately 1 (4.16) an = σΦ−1 1 − ≈ σ 2 log n n Namely for the normal distribution of errors we shall derive the lower and ) under this choice of {an }. These bounds are more upper bounds of B(an , β n
92
ROBUST ESTIMATORS IN LINEAR MODEL
optimistic for the least square estimator, because they both improve with increasing n. This is true for both fixed and random designs, and in the latter case even when the xi are random vectors with a heavy-tailed distribution G, i = 1, . . . , n. Theorem 4.2 Consider the linear regression model Yi = xi β + ei ,
i = 1, . . . , n
(4.17)
with the observations Yi , i = 1, . . . , n and with independent errors e1 , . . . , en , normally distributed as N (0, σ 2 ). Assume that the matrix X n = [x1 , . . . , xn ] is either fixed and of rank p for n ≥ n0 or that x1 , . . . , xn are independent pdimensional random vectors, independent of the errors, identically distributed with distribution function G, then √ ) B(σ 2 log n, β n limn→∞ 2 ≤1 (4.18) (n−1) log n + log n 2 p √ B σ 2 log n, β n ≥1 limn→∞ 1+log 2 log log n n 1 − − 2 log n log n
and
(4.19)
Remark 4.1 The bounds (4.18) and (4.19) can be rewritten in the form of the following asymptotic inequalities that are true for the LSE under normal F, as n → ∞ and for any ε > 0 : √ 2 max x ≥ (σ β 2 log n − log P 0 i n i n (4.20) ≥ p log n n 1 + log 2 log log n n(1 − ε) ≥ − 1− ≥ 2 log n log n 2 Proof of Theorem 4.2. Let us first consider the upper bound. Note that ˜ n ≥ p because trace H n = p. For each j such that hjj > 0 we can write h n P0 max (xi βn ) ≥ a(x1 , . . . , xn ) (4.21) 1≤i≤n = P0 max(hi Y ) ≥ a(x1 , . . . , xn ) ≥ P0 hj Y ≥ a(x1 , . . . , xn ) i
≥ P0 (hjj Yj ≥ a, h1k Yk ≥ 0, k = j |(x1 , . . . , xn )) n−1 n−1 1 a 1 a ≥ P0 Yj ≥ |(x1 , . . . , xn )) = 1−F hjj 2 hjj 2 This holds for each j such that hjj > 0; hence also n−1 a 1 P0 max (xi β n ) ≥ a(x1 , . . . , xn ) ≥ 1 − F ˜ 1≤i≤n 2 h
LEAST SQUARES METHOD
93
Because (− log) is a convex function, we may apply the Jenssen inequality and (4.21) then implies ) ≥ a(x1 , . . . , xn ) − log EG P0 max(xi β n i a ≤ (n − 1) log 2 − log EG 1 − F (4.22) ˜ h a ≤ (n − 1) log 2 + EG − log 1 − F ˜ h If F is of exponential type (4.9) with 1 ≤ r ≤ 2, then
1 a = an = b−1 log n r → ∞ as n → ∞ and (4.22) gives the following asymptotic inequality, as n → ∞, for any G: r r n 1 (n − 1) log 2 (n − 1) log 2 ≤ + + B(an , β n ) ≤ EG ˜ log n p log n hn A more precise form of the above inequality is ) B(an , β n limn→∞ r ≤1 (n−1) log 2 n + p log n and this gives (4.18). The lower bound: Since N (0, σ 2 ) has exponential tails, − log (1 − Φ(a/σ)) =1 a→∞ a2 /2σ 2 lim
and we can write | ≥ a(x1 , . . . , xn ) P0 max |xi β n 1≤i≤n
1 ˜ 2 Y ≥ a(x1 , . . . , xn ) ≤ P0 h
(4.23)
max |hi Y | ≥ a(x1 , . . . , xn ) 1≤i≤n Y 2 a2 = P0 ≥ , . . . , x ) (x 1 n ˜ 2 σ2 hσ 2 a = 1 − Fn ˜ 2 hσ
= P0
where Fn is the χ2 distribution function with n degrees of freedom. Because of (4.23), it holds (see Cs¨org˝ o and R´ev´esz (1981) or Parzen (1975)) fn (x) Fn (x)(1 − Fn (x)) ≤ 1 sup 2 x∈R fn (x) hence, because ˜ hn ≤ 1,
94
ROBUST ESTIMATORS IN LINEAR MODEL 1 − Fn
a2 ˜ 2 hσ
≤ 1 − Fn
a2 σ2
(4.24)
2 2 n −1 a /σ 2
2 ≤ 2 2 ea /2σ 2n/2 Γ(n/2) 12 − n2 − 1 σa2
√ Inserting an = σ 2 log n in (4.24), we obtain n − log(1 − Fn (2 log n)) 1 + log 2 log log n ≥ − 1− log n 2 log n log n 2 4.3 M -estimators The M -estimator of parameter β in model (4.1) is defined as solutions Mn of the minimization n ρ(Yi − xi t) := min (4.25) i=1
with respect to t ∈ Rp , where ρ : R1 → R1 is absolutely continuous, usually convex function with derivative ψ. Such Mn is obviously regression equivariant, i.e., (4.26) Mn (Y + Xb) = Mn (Y ) + b ∀b ∈ Rp but Mn is generally not scale equivariant: generally, it does not hold Mn (cY ) = cMn (Y ) for c > 0
(4.27)
A scale equivariant M -estimator we obtain either by a studentization or if we estimate the scale simultaneously with the regression parameter. The studentized M -estimator is a solution of the minimization n Yi − xi t ρ := min (4.28) Sn i=1 where Sn = Sn (Y ) ≥ 0 is an appropriate scale statistic. To obtain Mn both regression and scale equivariant, our scale statistic Sn must be scale equivariant and invariant to the regression, i.e., Sn (c(Y + Xb)) = cSn (Y ) ∀b ∈ Rp and c > 0
(4.29)
Such is, e.g., the root of the residual sum of squares, ]2 Sn (Y ) = [(Y − Y ) (Y − Y )] 2 = [Y (I n − H)Y 1
1
but this is closely connected with the least squares estimator, and thus highly non-robust. Robust scale statistics can be based on the regression quantiles or on the regression rank scores, which will be considered later. The minimization (4.28) should be supplemented with a rule on how to define
M -ESTIMATORS
95
Mn in case Sn (Y ) = 0; but this mostly happens with probability 0. Moreover, a specific form of the rule does not affect the asymptotic behavior of Mn . If ψ(x) =
dρ(x) dx
is continuous, then Mn is a root of the system of equations n Yi − xi t =0 (4.30) xi ψ Sn i=1
This system can have more roots, while only one leads to the global minimum of (4.28). Under general conditions, there always exists at least one root of the √ system (4.30) which is an n-consistent estimator of β (see Jureˇckov´ a and Sen (1996)). Another important case is that ψ is a nondecreasing step function, hence ρ is a convex, piecewise n linear function. Then Mn is a point of minima of the convex function i=1 ρ ((Yi − xi t)/Sn ) over t ∈ Rp . In this case, too, we can prove its consistency and asymptotic normality. If we want to estimate the scale simultaneously with the regression parameter, we can proceed in various ways. One possibility is to consider (Mn , σ ˆ ) as a solution of the minimization n
σρ σ −1 (Yi − xi t + aσ := min, t ∈ Rp , σ > 0
(4.31)
i=1
with an appropriate constant a > 0. We arrive at the system of p+1 equations n Yi − xi t xi ψ =0 σ i=1 n Yi − xi t χ =a σ i=1 where
χ(x) = xψ(x) − ρ(x) and a =
(4.32) χ(x)dΦ(x) R
and Φ is the standard normal distribution function. The usual choice of ψ is the Huber function (3.16). The matrix X can be random, fixed or a mixture of random and fixed elements. If the matrix X is random, then its rows are usually independent random vectors, identically distributed, hence they are an independent sample from some multivariate distribution. The influence function of Mn in this situation depends on two arguments, on x and y. Similarly, the possible breakdown and the value of the breakdown point of Mn should be considered not only with respect to changes in observations y, but also with respect to those of x.
96
ROBUST ESTIMATORS IN LINEAR MODEL
4.3.1 Influence function of M -estimator with random matrix Consider the model (4.1) with random matrix X, where (xi , Yi ) , i = 1, . . . , n are independent random vectors with values in Rp ×R1 , identically distributed with distribution function P (x, y). If ρ has an absolutely continuous derivative ψ, then the statistical functional T(P ), corresponding to the estimator (4.25), is a solution of the system of p equations xψ(y − x T(P )dP (x, y) = 0 (4.33) Rp+1
Let Pt denote the contaminated distribution Pt = (1 − t)P + tδ(x0 , y0 ), 0 ≤ t ≤ 1, (x0 , y0 ) ∈ Rp × R where δ(x0 , y0 ) is a degenerated distribution with the probability mass concentrated in the point (x0 , y0 ). Then the functional T(Pt ) solves the system of equations (1 − t) xψ(y − x T(Pt ))dP (x, y) Rp+1
+tx0 ψ(y0 − x0 T(Pt )) = 0. Differentiating by t we obtain − xψ(y − x T(Pt ))dP (x, y) + x0 ψ(y0 − x0 T(Pt )) Rp+1
−(1 − t)
Rp+1
x x
dT(Pt ) ψ (y − x T(Pt ))dP (x, y) dt
dT(Pt ) ψ (y0 − x0 T(Pt )) = 0 dt and we get the influence function dT(Pt ) IF(x0 , y0 ; T, P ) = dt t=0 if we put t = 0 and notice that, on account of (4.33), xψ(y − x T(Pt ))dP (x, y) = 0 −tx0 x0
Rp+1
then
IF(x0 , y0 ; T, P )
Rp+1
x xψ (y − x T(P ))dP (x, y)
= x0 ψ(y0 − x0 T(P )) Hence, the influence function of the M -estimator is of the form IF(x0 , y0 ; T, P ) = B−1 x0 ψ(y0 − x0 T(P ))
(4.34)
M -ESTIMATORS
97
where
B= Rp+1
x xψ (y − x T(P ))dP (x, y)
(4.35)
Observe that the influence function (4.35) is bounded in y0 , provided ψ is bounded; however, it is unbounded in x0 , and thus the M -estimator is nonrobust with respect to X. Many authors tried to overcome this shortage and introduced various generalized M -estimators, called GM -estimators, that outperform the effect of outliers in x with the aid of properly chosen weights.
4.3.2 Large sample distribution of the M -estimator with nonrandom matrix Because the M -estimator is nonlinear and defined implicitly as a solution of a minimization, it would be very difficult to derive its distribution function under a finite number of observations. Moreover, even if we were able to derive this distribution function, its complicated form would not give the right picture of the estimator. This applies also to other robust estimators. In this situation, we take recourse to the limiting (asymptotic) distributions of estimators, which are typically normal, and their covariance matrices fortunately often have a compact form. Deriving the asymptotic distribution is not easy and we should use various, sometimes nontraditional methods that have an interest of their own, but their details go beyond this text. In this context, we refer to the monographs cited in the literature, such as Huber (1981), Hampel et al. (1986), Rieder (1994), Jureˇckov´ a and Sen (1996), among others. Let us start with the asymptotic properties of the non-studentized estimator with a nonrandom matrix X. Assume that the distribution function F of errors Ui in model (4.1) is symmetric around zero. Consider the M -estimator Mn as a solution of the minimization (4.25) with an odd, absolutely continuous function ψ = ρ such that EF ψ 2 (U1 ) < ∞. The matrix X = X n is supposed (n) (n) to be of rank p and max1≤i≤n hii → 0 as n → ∞, where hii is the maximal n = X n (X X n )−1 X , diagonal element of the projection (hat) matrix H n n then p
Mn −→ β
1 L (X n X n ) 2 (Mn − β) → Np 0, σ 2 (ψ, F )Ip as n → ∞, where σ 2 (ψ, F ) = If, moreover, p × p, then
1 n X nX n
L
(4.36)
EF ψ 2 (U1 ) (EF ψ (U1 ))2
→ Q, where Q is a positively definite matrix of order
√
n(Mn − β) → Np 0, σ 2 (ψ, F )Q−1
If ψ has jump points, but is nondecreasing, and F is absolutely continuous
98
ROBUST ESTIMATORS IN LINEAR MODEL
with density f, then (4.36) is still true with the only difference being that EF ψ 2 (U1 ) σ 2 (ψ, F ) = ( R f (x)dψ(x))2 Notice that σ 2 (ψ, F ) is the same as in the asymptotic distribution of the M -estimator of location. p
If M -estimator Mn is studentized by the scale statistic Sn such that Sn −→ S(F ) as n → ∞, then the asymptotic covariance matrix of Mn depends on S(F ). In the models with intercepts, only the first component of the estimator is asymptotically affected by S(F ).
4.3.3 Large sample properties of the M -estimator with random matrix If the system of equations EP [xψ(y − x t) = 0 has a unique root T(P ) = β, then T(Pn ) → T(P ) as n → ∞, where Pn is the empirical distribution pertaining to observations ((x1 , y1 ), . . . , (xn , yn )) . The functional T(Pn ) admits, under some conditions on probability distribution P, the following asymptotic representation T(Pn ) = T(P ) +
1 1 IF(x, y; T, P ) + op (n− 2 ) as n → ∞ n
If EP IF(x, y; T, P )2 < ∞, the above representation further leads to the asymptotic distribution of T(Pn ) : √ L n(T(Pn ) − T(P )) → Np (0, Σ) (4.37) where
Σ = EP [IF(x, y; T, P )] [IF(x, y; T, P )] = B−1 AB−1
B is the matrix defined in (4.35) and x xψ 2 (y − x T(P ))dP (x, y) A= Rp+1
4.4 GM -estimators The influence function (4.34) of the M -estimator is unbounded in x, thus the M -estimator is sensitive to eventual leverage points in matrix X. The choice of function ψ has no effect on this phenomenon. Some authors proposed to supplement the definition of the M -estimator by suitable weights w that reduce the influence of the gross values of xij .
GM -ESTIMATORS
99
Mallows (1973, 1975) proposed the generalized M -estimator as a solution of the minimization n Yi − xi t σw(xi )ρ (4.38) := min, t ∈ Rp , σ > 0 σ i=1 If ψ = ρ is continuous, then the generalized M -estimator solves the equation n Yi − xi t xi w(xi )ψ =0 (4.39) σ i=1 The influence function of the pertaining functional T(P ) then equals y − x T(P ) −1 (4.40) IF(x, y; T, P ) = B xw(x)ψ S(P ) where S(P ) is the functional corresponding to the solution σ in the minimization (4.38). We obtain a bounded influence function if we take w leading to bounded xw(x). Such an estimator is a special case of the following GM -estimator, which solves the equation n Yi − xi t η xi , =0 σ i=1 n Yi − xi t χ =0 σ i=1
(4.41)
with functions η, χ, where η : Rp × R → R and χ : R → R. If we take η(x, u) = u and χ(u) = u2 − 1 we obtain the least squares estimator: the choice η(x, u) = ψ(u) leads to the M -estimator, and η(x, u) = w(x)ψ(u) leads to the Mallows GM -estimator. The usual choice of the function η is η(x, u) = ψ1x(x) ψ(u), where ψ is, e.g., the Huber function. The choice of function χ usually coincides with (4.32). The statistical functionals T(P ) and S(P ) corresponding to Mn and σn are defined implicitly as a solution of the system of equation: y − x T(P ) x η x, dP (x, y) = 0 S(P ) Rp+1 y − x T(P ) χ x, dP (x, y) = 0 S(P ) Rp+1
(4.42)
100
ROBUST ESTIMATORS IN LINEAR MODEL
The influence function of functional T(P ) in the special case σ = 1 has the form IF(x, y; T, P ) = B−1 xη(x, y − x T(P )), where
x x
B=
Rp+1
∂ η(x, u) dP (x, y) ∂u u=y−x T(P )
The asymptotic properties of GM -estimators were studied by Maronna and Yohai (1981), among others. Under some √ regularity conditions, the GM estimators are strongly consistent and n(T(Pn ) − T(P )) has asymptotic p-dimensional normal distribution Np (0, Σ) with covariance matrix Σ = B−1 AB−1 , where A= x xη 2 (x, y − x T(P ))dP (x, y) Rp+1
Krasker and Welsch (1982) proposed a GM -estimator as a solution of the system of equations n Yi − xi t =0 xi wi σ i=1 with weights wi = w(xi , Yi , t) > 0 determined so that they maximize the asymptotic efficiency of the estimator (with respect to the asymptotic covariance matrix Σ) under the constraint γ ∗ ≤ a < ∞, where γ ∗ is the global sensitivity of the functional T under distribution P, i.e., 1 γ ∗ = sup (IF(x, y; T, P )) Σ−1 (IF(x, y; T, P )) 2 x,y As a solution, we obtain the weights of the form ⎧ ⎫ ⎨ ⎬ a w(x, y, t) = min 1, ⎩ y−x t (x Ax) 12 ⎭ σ where
x x
A= Rp+1
y − x t σ
w2 (x, y, t)dP (x, y)
The Krasker-Welsch estimator has a bounded influence function, but it should be computed iteratively, because matrix A depends on w.
4.5 S-estimators and M M -estimators The S-estimator, proposed by Rousseeuw and Yohai (1984) minimizes an estimator of scale, ˜ = arg min σ β ˜n (β) n
L-ESTIMATORS, REGRESSION QUANTILES
101
and the estimator of scale σ ˜n (β) solves the equation n Yi − xi β 1 ρ = b for each fixed β n i=1 σ ˜n (β) where ρ(x) is a symmetric, continuous function, nondecreasing in |x|, and b = ρ(x)dΦ(x) with Φ being the standard normal distribution function. ˜ and the scale σ After calculating the S-estimator β ˜n , we calculate the M M proposed by Yohai (1987), as a solution of the minimization estimator β, n Yi − xi β ρ = min, β ∈ Rp ˜ ) σ ˜ ( β n n i=1 4.6 L-estimators, regression quantiles L-estimators of location parameter as linear combinations of order statistics or linear combinations of functions of order statistics are highly appealing and intuitive, because they are formulated explicitly, not as solutions of minimization problems or of systems of equations. The calculation of L-estimators is much easier. Naturally, many statisticians tried to extend the L-estimators to the linear regression model. Surprisingly, this extension is not easy, because it was difficult to find a natural and intuitive extension of the empirical (sample) quantile to the regression model. A successful extension of the sample quantile appeared only in 1978, when Koenker and Bassett introduced the regression α-quantile β(α) for model (4.1). It is more illustrative in a model with an intercept: hence let us assume that β1 is an intercept and that matrix X satisfies the condition xi1 = 1, i = 1, . . . , n
(4.43)
The regression α-quantile β(α), 0 < α < 1 is defined as a solution of the minimization n ρα (Yi − xi t) := min, t ∈ Rp (4.44) i=1
with the criterion function ρα (x) = |x|{αI[x > 0] + (1 − α)I[x < 0]}, x ∈ R
(4.45)
The function ρα is piecewise linear and convex; hence we can intuitively expect that the minimization (4.44) can be solved by some modification of the simplex algorithm of the linear programming. Indeed, Koenker and Bassett (1978) proposed to calculate β(α) as the component β of the optimal solution (β, r+ , r− ) of the parametric linear programming problem α
n i=1
ri+ + (1 − α)
n i=1
ri− : min
(4.46)
102
ROBUST ESTIMATORS IN LINEAR MODEL under constraint
p
xij βj + ri+ − ri− = Yi , i = 1, . . . , n
j=1
βj ∈ R1 , j = 1, . . . , p, ri+ , ri− ≥ 0, i = 1, . . . , n, 0<α<1 The variables ri+ and ri− in (4.46) are equal to the positive and negative parts of residuals Yi − xi β, i = 1, . . . , n. The linear programming problem (4.46) not only enables calculating the regression quantiles with the aid of the simplex algorithm, but it also illustrates the structure of the regression quantiles. It is known from the linear programming theory that the set B(α) of solutions of (4.46) (and thus also of (4.44)) is nonempty, compact and polyhedral. We can choose β(α) as a lexicographically maximal element B(α), unless we have some other rule prescribed. Being considered as a function of α, the regression quantile β(α) is a step function of α ∈ (0, 1). The population counterpart (i.e., the corresponding statistical functional) of β(α) is the population regression quantile β(α) = (β1 + F −1 (α), β2 , . . . , βp )
(4.47)
The asymptotic properties of β(α) are analogous to those of sample quan√ tiles in the location model. Indeed, n(β n (α) − β(α)) has a p-dimensional asymptotic normal distribution α(1 − α) −1 Q Np 0, (4.48) (f (F −1 (α))2 under some conditions on matrix X n and on distribution function F of errors in model (4.1). For instance, (4.48) holds if X n is either fixed (nonrandom) with limn→∞ n1 X n X n = Q, or random (up to the first column that corresponds to the intercept) and limn→∞ Ex1 x1 = Q, where Q is a positively definite matrix of order p × p, and if F is symmetric around 0, strictly increasing and has a positive derivative f in a neighborhood of F −1 (α). This is in accord with the asymptotic distribution of the sample α-quantile (whose matrix is X = 1n = (1, . . . , 1) ∈ Rn ). The regression quantiles provide a basis for various L-estimators of parameter β in a linear regression model. The most popular is the L1 -estimator, or regression median, which is the regression α-quantile with α = 12 . A broad class of L-estimators is linear combinations of a finite number of regression quantiles. From a practical point of view, the trimmed least-squares estimator proposed by Koenker and Bassett (1978) is very appealing. This is a straightforward extension of the trimmed mean to the linear regression model and is defined as follows: fix α1 , α2 , 0 < α1 < α2 < 1, i = 1, . . . , n, put % & (α1 ) < Yi < x β ai = I xi β (4.49) n i n (α2 ) , i = 1, . . . , n
L-ESTIMATORS, REGRESSION QUANTILES
103
and calculate the weighted least squares estimator with the weights ai . This estimator Tn (α1 , α2 ), called the (α1 , α2 )-trimmed least squares estimator, can be written in an explicit form Tn (α1 , α2 ) = (X n An Xn )−1 X n An Y n
(4.50)
where An = diag(ai ) is a diagonal matrix with diagonal (a1 , . . . , an ). We can show that Tn (α1 , α2 ) is asymptotically normally distributed, provided F is increasing and differentiable in interval (F −1 (α1 ) − ε, F −1 (α2 ) + ε), and under some regularity conditions imposed on the matrix X n . More precisely, √
L (4.51) n(Tn − β − δe1 ) → Np 0, σ 2 Q−1 where e1 = (1, 0, . . . , 0) ∈ Rp and δ = (α2 − α1 )−1
α2
F −1 (u)du
α1
σ 2 = σ 2 (α1 , α2 , F )
α2 −1 α2 (F −1 (u) − δ)2 du = (α2 − α1 )
(4.52)
α1
+α1 (F −1 (α1 ) − δ)2 + (1 − α2 )(F −1 (α2 ) − δ)2
2 − α1 (F −1 (α1 ) − δ) + (1 − α2 )(F −1 (α2 ) − δ) In the symmetric situation, when F (x)+F (−x) = 1, x ∈ R and α1 = α, α2 = √ 1 − α, 0 < α < 12 , δ vanishes and n(Tn (α) − β) has asymptotic normal distribution Np (0, σ 2 (α, F )Q−1 ) with 1−α −1 (F (u))2 du + 2α(F −1 (α))2 2 (4.53) σ (α, F ) = α 1 − 2α Notice that σ 2 (α, F ) coincides with the asymptotic variance of the α-trimmed mean in the location model. Besides the trimmed least squares estimator we can consider the broad class of L-estimators of the form 1 (α)dν(α) (4.54) β Tνn = n 0
where ν is a suitable signed measure (0, 1) (finite and with a compact support that is a subset of (0, 1)). The special case are linear combinations of a finite number of regression quantiles, that are generated by an atomic measure ν. The trimmed L-estimators, extending the (α1 , α2 )-trimmed mean, are generated by ν that has a Lebesgue measure density J(u) =
I[α1 ≤ u ≤ α2 ] , α2 − α1
0 < α1 < α2 < 1
104
ROBUST ESTIMATORS IN LINEAR MODEL
Unlike the M -estimators, the L-estimators of regression parameter are both regression and scale equivariant. Regression quantiles and L-estimators of various types are studied, e.g., in Gutenbrunner (1986), Gutenbrunner and Jureˇckov´ a (1992), Jureˇckov´ a and Sen (1996), and recently in Koenker (2005).
4.7 Regression rank scores The dual program to (4.46) has a very interesting interpretation: while the solutions of (4.46) are the regression quantiles, the solutions of the dual problem are called regression rank scores because they remind the ranks of observations and have many properties similar to those of ranks. The dual program to (4.46) can be written in the form n
under constraint
i=1 n i=1 n
Yi a ˆi := max a ˆi = n(1 − α) xij a ˆi = (1 − α)
i=1
(4.55) n
xij , j = 2, . . . , p
i=1
0≤a ˆi ≤ 1,
i = 1, . . . , n,
0<α<1
The components of the optimal solution of (4.55), n (α) = (ˆ a an1 (α), . . . , a ˆnn (α)) ,
0≤α≤1
are called the regression rank scores. The matrix form of program (4.55) is more compact: under constraint
a := max Y n a = (1 − α)X n 1n X n ∈ [0, 1]n , 0 ≤ α ≤ 1 a
(4.56)
Recall that xi1 = 1, i = 1, . . . , n by assumption (4.43). From (4.56) we can easily verify that the regression rank scores are invariant with respect to changes of β, i.e., an (α, Y ) ∀b ∈ Rp (4.57) an (α, Y + Xb) = Because β(α) and an (α) are dual to each other, we get from the linear programming theory ' (α) 1 . . . Yi > xi β n (4.58) a ˆni (α) = 0 . . . Yi < xi β n (α), i = 1, . . . , n (α) for some i (the exact fit), then 0 < a and if Yi = xi β ˆni (α) < 1; there n are exactly p such components for each α, and the corresponding values of
REGRESSION RANK SCORES
105
a ˆni (α) are determined by the restriction conditions in (4.56). The regression rank scores a ˆni (α), i = 1, . . . , n are continuous, piecewise linear functions of α ∈ [0, 1] satisfying a ˆni (0) = 1, a ˆni (1) = 0 (see Figure 4.2). 1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0 0
0.2
0.4
0.6
0.8
1
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0 0
0.2
0.4
0.6
0.8
1
Figure 4.2 Example of regression rank scores for the simulated data.
Due to these nice properties, the regression rank scores have many applications. The invariance property (4.57) guarantees that the tests based on regression rank scores are invariant to β in the situations when β is a nuisance parameter, while another parameter is of our interest for testing a hypothesis or estimation, or when we want to test a hypothesis on the shape of the distribution. The fact that in this way we avoid an estimation of β not only facilitates the computation, but we also prevent a risk of a wrong choice of an estimator of β. The tests of linear hypotheses with nuisance parameter β, based on regression rank scores, were constructed by Gutenbrunner et al. (1993). The tests based on regression rank scores enable verification of hypotheses of the type H : β(2) = 0 in the model Y = X(1) β (1) + X2 β(2) + e
(4.59)
with the nuisance parameter β (1) , that contains an intercept. Under H, the model (4.59) reduces to the form Y = X(1) β (1) + e
(4.60)
and the test is based on the regression rank scores corresponding to the hypothetical model (4.60). The tests have the same asymptotic powers as the
106
ROBUST ESTIMATORS IN LINEAR MODEL
rank tests of H in the situation that β (1) is fully known. The exact forms of the tests and more details can be found in Gutenbrunner et al. (1993). As we have already mentioned in Section 4.3, a scale equivariant M -estimator of regression parameter can be obtained using the studentization by a suitable scale statistic. However, such statistics should be scale equivariant and invariant to the change of regression parameter (see 4.29), and such scale statistics are based on regression rank scores and on regression quantiles. We shall describe some of these scale statistics in the next section.
4.8 Robust scale statistics The M -estimators of regression parameters are regression equivariant but not scale-equivariant; hence either they should be studentized, or the scale should be estimated simultaneously with the regression parameters. The studentizing scale statistic Sn (Y ) should be fully determined by the observations Yi and by matrix Xn and it should satisfy Sn (c(Y + Xb)) = cSn (Y ) ∀b ∈ Rp , c > 0, Y ∈ Rn
(4.61)
regression invariance and scale equivariance, cf. (4.29). There are not many statistics of this type in the literature; when we studentize an M -estimator by Sn that is only invariant to the shift, but not to the regression; then the M -estimator loses its regression equivariance. We shall describe some statistics satisfying (4.61), based on regression quantiles or on the regression rank scores. (i) Median absolute deviation from the regression median (MAD). Statistic MAD is frequently used in the location model. Welsh (1986) extended MAD to the the linear regression model in the following way: start √ with an initial estimator β 0 of β, n-consistent, regression and scale equivariant (so we shall not start with the ordinary M -estimator). The Welsh scale statistic is then defined as (4.62) Sn = med1≤i≤n Yi (β 0 ) − ξ 1 (β 0 ) 2 where Yi (β 0 ) = Yi − xi β 0 , i = 1, . . . , n ξ 1 (β 0 ) = med1≤i≤n Yi (β 0 ) 2
Sn is apparently invariant/equivariant in the sense of (4.61). Its asymptotic properties are studied in the Welsh (1986) paper.
ROBUST SCALE STATISTICS
107
(ii) L-statistics based on regression quantiles. The Euclidean distance of two regression quantiles Sn = β n (α2 ) − β n (α1 )
(4.63)
0 < α1 < α2 < 1, is invariant/equivariant in the sense of (4.61) and p Sn −→ S(F ) = F −1 (α2 ) − F −1 (α1 ) as n → ∞. The Euclidean norm can be replaced by Lp -norm or another suitable norm. It is also possible to use only the absolute difference of the first components of the regression quantiles, i.e., Sn = βn1 (α2 ) − βn1 (α1 ) (iii) Bickel and Lehmann (1979) proposed several measures of spread of distribution F, e.g., 12 1 2 F −1 (u) − F −1 (1 − u) dΛ(u) S(F ) = 1 2
where Λ is the uniform probability distribution on interval ( 12 , 1 − δ), 0 < δ < 12 . Using their idea, we obtain a broad class of scale statistics based on regression quantiles of the following type 12 2 1 (1 − u) dΛ(u) Sn = β (u) − β 1 2
n
n
2 (1 − u) Again, the squared norm β (u) − β in the integral can be ren n (u) and placed by the squared difference of the first components of β n β (1 − u). n
(iv) Estimators of 1/f (F −1 (α)) based on regression quantiles. The density quantile function f (F −1 (α)), 0 < α < 1 is an important characteristic of the probability distribution, necessary in the statistical inference based on quantiles. Unfortunately, it is not easy to estimate f (F −1 (α)), even if we only need its value at a single α. Similarly as when we estimate the density, we are able√to estimate f (F −1 (α) only with a lower rate of consistency than the usual n. Falk (1986) constructed estimates of f (F −1 (α) for the location model, based on the sample quantiles. These estimates are either of the histogram or the kernel type. Benefiting from the convenient properties of regression quantiles, we can extend Falk’s estimates to the linear regression model, replacing the sample quantiles with the intercept components of the regression quantiles. The histogram type estimator has the form
1 βn1 (α + νn ) − βn1 (α − νn ) Hn(α) = (4.64) 2νn
108
ROBUST ESTIMATORS IN LINEAR MODEL
where the sequence {νn } is our choice and it should satisfy 1 νn = o n− 2 and lim nνn = ∞ n→∞
The kernel type estimator of 1/f (F −1 (α)) is based on the kernel function k : R1 → R1 that has a compact support and satisfies the relations k(x)dx = 0 and xk(x)dx = −1 The kernel estimator then has the form
1 1 n1 (u)k α − u du = β χ(α) n νn2 0 νn
(4.65)
where this time the sequence {νn } should satisfy νn → 0,
nνn2 → ∞,
nνn3 → 0, as n → ∞
√ Both (4.64) and (4.65) are nνn -consistent estimators of 1/f (F −1 (α)) and are invariant/equivariant in the sense of (4.61). Their lower rate of consistency is analogous to that of the density estimator and cannot be considerably improved. As such, they are not usually used for a studentization, but they are very useful in various contexts, e.g., in the inference on the quantiles of F. Their asymptotic properties, applications and numerical illustrations are studied in detail in the book of Dodge and Jureˇckov´ a (2000). (v) Scale statistics based on the regression rank scores. ˆnn (α)), 0 < α < 1 be the regression rank scores for model Let (ˆ an1 (α), . . . , a (4.1). Choose a nondecreasing score function ϕ : (0, 1) → R1 standardized so that 1−α0 ϕ2 (α)dα = 1 α0
for a fixed α0 , 0 < α0 <
1 2
ˆbni = −
and calculate the scores 1−α0
ϕ(α)dˆ ani (α), i = 1, . . . , n α0
The scale statistic
1 ˆ Yi bni n i=1 n
Sn =
is invariant/equivariant in the sense of (4.61) and it is an estimator of the functional 1−α0 ϕ(α)F −1 (α)dα S(F ) = α0
(4.66) √
n-consistent
ESTIMATORS WITH HIGH BREAKDOWN POINTS
109
4.9 Estimators with high breakdown points The breakdown point of an estimator in the linear model takes into account not only possible replacements of observations Y1 , . . . , Yn by arbitrary values, but also possible replacements of vectors (x1 , Y1 ) , . . . , (xn , Yn ) . More precisely, our observations create a matrix ⎡ ⎤ ⎡ ⎤ x1 , y1 z1 ⎢ z ⎥ ⎢ x , y ⎥ ⎢ 2 ⎥ ⎢ 2 2 ⎥ ⎥ ⎢ ⎥ Z=⎢ ⎢ ... ⎥ = ⎢ ... ⎥ ⎣ ⎦ ⎣ ⎦ z n
xn , yn
and the breakdown point of estimator T parameter β is the smallest integer mn (Z) such that, replacing arbitrary m rows in matrix Z by arbitrary rows and denoting the resulting estimator T∗m , then sup T − T∗m = ∞, where the supremum is taken over all possible replacements of m rows. We also often measure the breakdown point by means of a limit ε∗ = limn→∞ mnn , if the limit exists. We immediately see that even the estimators that have the breakdown point 1/2 in the location model can hardly attain 1/2 in the regression model, because the matrix X plays a substantial role. Then one naturally poses questions, e.g., whether there are any estimators with maximal possible breakdown points in the regression model; and if so, what do they look like, is it easy to compute them, and in which context are they useful and desirable? Siegel replied affirmatively to the first question in 1982, when he constructed a so-called repeated median with the 50% breakdown point. In the simple regression model Yi = α + xi β + ei , i = 1, . . . , n, the repeated median of the slope parameter has a simple form Yi − Yj β = med1≤i≤n medj:j=i xi − xj However, in the p-dimensional regression its computation needs O (np ) operations, hence it is not convenient for practical applications. Hampel (1975) expressed the idea of the least median of squares (LMS) that minimizes med1≤i≤n {[Yi − xi t]2 }, t ∈ Rp
(4.67)
Rousseeuw (1984) demonstrated that this estimator had the 50% breakdown 1 point. It estimates β consistently, but with the rate of consistency n 3 only (Kim and Pollard (1990), Davies (1990)), hence it is highly inefficient.
110
ROBUST ESTIMATORS IN LINEAR MODEL
Rousseeuw also suggested the least trimmed squares estimator (LTS) as a solution of the minimization hn
(Yi − xi t) := min, t ∈ Rp
i=1
where hn = [n/2] + [(p√+ 1)/2] and [a] denotes the integer part of a. This estimator is already an n-consistent estimator of β and has the breakdown point 1/2. Its weighted version was studied by V´ıˇsek (2002a, b). Another high breakdown point estimator is the S-estimator, constructed by Rousseeuw and Yohai (1984) and described in Section 4.5. These estimators are studied in detail in the book by Rousseeuw and Leroy (1987); great attention is also devoted to their computational aspects. These estimators combine the high 1 breakdown point with the proper rate of consistency n 2 , and the S-estimator also with a fairly good efficiency at the normal model; yet, these estimators are inordinately difficult to compute. The least trimmed sum of absolute deviations (LTA) is another estimator, originally proposed by Bassett (1991), and further studied by Tableman (1994 a, b), H¨ ossjer (1994) and Hawkins and Olive (1999). To have an estimator with a high breakdown point is desirable in regression and other models, but the high breakdown point is only one advantage that cannot be emphasized above others. Using these estimators in practice, we should be aware of the possibility that, being resistant to outliers and gross errors, they can be sensitive to small perturbations in the central part of the data. We can refer to to Hettmansperger and Sheather (1992) and to Ellis (1998) for numerical evidence of this problem, while it still needs a thorough analytical treatment.
4.10 One-step versions of estimators Many estimators, such as the M -estimators, regression quantiles and maximal likelihood estimators, are defined implicitly as solutions of a minimization or of a system of equations. It can be difficult to solve such problems algebraically, but there can be still other difficulties. The most serious difficulty happens when the system of equations has more solutions, but only one is efficient, and it is hard to distinguish the efficient solution from the others, if it is possible at all. We have already met this situation in the context of M -estimators: the system of equations (4.30) can have more roots, among them at least one is √ n-consistent, but it is difficult to distinguish it from the others. √ The n-consistent, or even the efficient root of the system of equations can be often approximated by its one-step version. This consists in the first step of the Newton-Raphson iteration algorithm of solving the algebraic equation. Let us illustrate this approach on the one-step version of the M -estimator in
ONE-STEP VERSIONS OF ESTIMATORS
111
the linear regression model (4.1), generated by a function ρ with derivative ψ, and studentized by scale statistics Sn = Sn (Y ). The procedure was first used by Bickel (1975) in the context of Huber’s M (0) estimator. Generally, √ it starts with an initial consistent estimator Mn of parameter β; the n-consistent initial estimator gives the best results. The one-step M -estimator is defined through the relation ⎧ 1 ⎨ M(0) ˆn = 0 n + γ ˆn W n . . . γ (4.68) = M(1) n ⎩ M(0) . . . γ ˆ = 0 n n where Wn =
Q−1 n
n i=1
xi ψ
Yi − xi M(0) n Sn
1 X Xn n n and γˆn is an estimator, either of the functional
x 1 γ= ψ dF (x) S(F ) R S(F ) Qn =
or of the functional
f (xS(F ))dψ(x)
γ= R
depending on whether the selected score function ψ is continuous or not. For instance, in the case of absolutely continuous ψ, we can use the estimator n 1 Yi − xi M(0) n ψ γˆn = nSn i=1 Sn (1) (0) M √ n is a good approximation of the consistent M -estimator Mn : if Mn is n-consistent and ψ is sufficiently smooth, then −1 Mn − M(1) (4.69) n = Op n
however,
−3/4 n = O Mn − M(1) p n
only, if ψ has jump discontinuities. More on the one-step versions of M - and L-estimators can be found in Janssen et al. (1985), Jureˇckov´ a and Welsh (1990), among others. The k-step versions in the location model were studied by Jureˇckov´ a and Mal´ y (1995), among others. Generally we can say that the one-step versions give good approximations of an M -estimator with smooth function ψ, while the approximation is rather poor for the discontinuous ψ, and then it is not much improved even when we use k steps instead of one.
112
ROBUST ESTIMATORS IN LINEAR MODEL
Naturally, an ideal estimator would have both a high efficiency and a high breakdown point. This can be partially achieved when we combine a high breakdown point initial estimator with an iteration based on a highly efficient M -estimator. This was first done by Jureˇckov´ a and Portnoy (1987), who τ started with an initial estimator M(0) such that n M(0) n n − β = Op (1) for 1 1 some τ ∈ ( 4 , 2 ], and defined the one-step version as ⎧ 1 ⎨ M(0) ˆn = 0 n + γ ˆn W n . . . γ = M(1) n ⎩ M(0) . . . γˆn = 0 n where W n is defined in (4.68) and a > 0 is a given constant. Then M(1) n has (1) (1) the same breakdown point as Mn , while Mn is asymptotically equivalent to the non-iterative M -estimator Mn ., i.e., −2τ M(1) n − Mn = Op n This was further extended by Simpson et al. (1992) to the GM -estimators; Welsh and Ronchetti (2002) compared various one-step versions of M -estimators, √ and effects of various initial estimators. The difference between various n-consistent initial estimators appears only in the reminder terms of the approximations, or in the so-called second order asymptotics. Surprisingly, though theoretically we always get an approximation (4.69), the numerical evidence shows that the initial estimator plays a substantial role. The best approximation gives such an initial M(0) n that already has the same influence a and Sen (1990)). This is always true for M(1) function as Mn (Jureˇckov´ n , as we see from (4.69), hence we should follow the Welsh and Ronchetti proposal and use two Newton-Raphson iterations, instead of one.
4.11 Numerical illustrations Consider the dataset used by Galton in 1885 to study the relationship between a parent’s and his or her child’s height. The dataset (see Figure 4.3) contains 1078 measurements of a father’s height and his son’s height in inches, and it was used by Galton to investigate the regression. Data are available from R package, UsingR. We have the model Yi = β0 + β1 X1i + ei ,
i = 1, . . . , 1078
where Y is the son’s height, X1 is the father’s height. We want to determine the unknown values β0 and β1 , assuming that the errors e1 , . . . , e1078 are independent and identically distributed, symmetrically around zero. In Table 4.1 we will now give the results obtained by the different methods of regression used in the preceding sections.
NUMERICAL ILLUSTRATIONS
113
O
O O
70
O O O
O O O O O O O O O O O OO O O O O OO O O O O O OO O O O OOOOO O O O O O O O O O O O OOO O O OOO O O OO O OO O OO O O O OOO OO O O O O O O O O O O O O OO O O O O O O OO OO O O O O O O O O OO O OO O O O O OOO O OOO O O O O O OOO OOO O O O O O OOO OO O O O O OO O O O OOOO O O O O O O OO O O O O OO OOOOO OOO O O OO OO O OOO O O O O OO O O O O OO O OOOOO O O OOO O OOO O OO O OOO O O O O OO O O O OOOOO O O OO O O OO OO OO O O OOOO O OO O O O O O O O O O O O OOO O O O O O O O O O O O O O O O OO O OO O OO O O O OO O OO OOO O O O O O O O O O O O O O O O OO O O O O O O O O O O O OO O O O OOO OOO OOOO O O OOOOOO O O O OO OO OO O O OOO OOO O O O OO O O O OOO O OO OO O O O O O O O O O OO O O OOOO OOO O OOO OO OOO OOO O O O OOO O OO O O OO O O OOOOO OO O OOO OO O O O OO OOOO OO OOOOOOO O OO OO O O O OO O O OO O O O OOO O OOOOO O OO OO OO O O O O O O O O O O O O O O O O O O O O OOO O O O O O OOOOO OOOOOO O O OO O O OOO OO O OO OO O OOOO O OOO OO OOO OOO O O O O O OOO O OO O OO O OOO O O OO O O O O OOO OO O OOO OO O OO OO O O O O O O OO O OO O OOO O OO OO O OOO O OO O OO O O O OO O OOOOO OO OO O O OO O OOOOO O OOO OO OO OO O O O OO O O O O OO OOO O O OOOOO O O O O OOOOO OOOO OO O OO O OO OOO OOOO OO O O OOO O O O O O O O OO O O O OO OOOOOO O O OO O OO O O O O O O O O O O O O O O O O O O O O O O O O OOO O OO O O OO O O O O O O O OO OO OO O O O O O OO O OO O O O O O O O O O O O O O O O O O O O OO O O O O OOO O O OO O O O O O OO O OO O O OO O OO OO OO OO O O O OO O OO OO O O O OO O O O O O O O OO OO O O OO O O OO O O O O O OO O O O O O OO O O O O O O OO O O O OO O O O O O O OO O O O O OO OO O O O O O O O OOO O O O O O O O O O O O O O O O O O O O O OO O O OO O O O O O O O O O O O O OO O O O O
O
60
65
Son’s height
75
O
O O
60
O
O
O
65
70
75
Father’s height
Figure 4.3 The Galton dataset contains 1078 measurements of a father’s and his son’s height in inches. Table 4.1 Results for Galton data.
Methods used for estimation Least square M -estimation M M -estimation Least absolute deviation Trimmed least squares (α = 0.05) Trimmed least squares (α = 0.20) Least median of squares Least trimmed squares
βˆ0
βˆ1
33.8866 34.5813 34.7037 35.5710 34.3876 35.0088 45.9288 38.7060
0.5141 0.5039 0.5021 0.4885 0.5068 0.4974 0.3281 0.4391
We see that the values of estimated coefficients are rather close to each other with the exception of the least median of squares estimation. Figure 4.4 illustrates this difference, i.e., least squares regression and least median of squares lines.
114
ROBUST ESTIMATORS IN LINEAR MODEL
O
O O
70
O O O
O O O O O O O O O O O OO O O O O OO O O O O O OO O O O OOOOO O O O O O O O O O O O OOO O O OOO O O OO O OO O OO O O O OOO OO O O O O O O O O O O O O OO O O O O O O OO OO O O O O O O O O OO O OO O O O O OOO O OOO O O O O O OOO OOO O O O O O OOO OO O O O O OO O O O OOOO O O O O O O OO O O O O OO OOOOO OOO O O OO OO O OOO O O O O OO O O O O OO O OOOOO O O OOO O OOO O OO O OOO O O O O OO O O O OOOOO O O OO O O OO OO OO O O OOOO O OO O O O O O O O O O O O OOO O O O O O O O O O O O O O O O OO O OO O OO O O O OO O OO OOO O O O O O O O O O O O O O O O OO O O O O O O O O O O O OO O O O OOO OOO OOOO O O OOOOOO O O O OO OO OO O O OOO OOO O O O OO O O O OOO O OO OO O O O O O O O O O OO O O OOOO OOO O OOO OO OOO OOO O O O OOO O OO O O OO O O OOOOO OO O OOO OO O O O OO OOOO OO OOOOOOO O OO OO O O O OO O O OO O O O OOO O OOOOO O OO OO OO O O O O O O O O O O O O O O O O O O O O OOO O O O O O OOOOO OOOOOO O O OO O O OOO OO O OO OO O OOOO O OOO OO OOO OOO O O O O O OOO O OO O OO O OOO O O OO O O O O OOO OO O OOO OO O OO OO O O O O O O OO O OO O OOO O OO OO O OOO O OO O OO O O O OO O OOOOO OO OO O O OO O OOOOO O OOO OO OO OO O O O OO O O O O OO OOO O O OOOOO O O O O OOOOO OOOO OO O OO O OO OOO OOOO OO O O OOO O O O O O O O OO O O O OO OOOOOO O O OO O OO O O O O O O O O O O O O O O O O O O O O O O O O OOO O OO O O OO O O O O O O O OO OO OO O O O O O OO O OO O O O O O O O O O O O O O O O O O O O OO O O O O OOO O O OO O O O O O OO O OO O O OO O OO OO OO OO O O O OO O OO OO O O O OO O O O O O O O OO OO O O OO O O OO O O O O O OO O O O O O OO O O O O O O OO O O O OO O O O O O O OO O O O O OO OO O O O O O O O OOO O O O O O O O O O O O O O O O O O O O O OO O O OO O O O O O O O O O O O O OO O O O O
O
60
65
Son’s height
75
O
O O
60
O
O
O
65
70
75
Father’s height
Figure 4.4 Least squares (solid line) and least median of squares (dashed line) regression for the Galton dataset contains 1078 measurements of a father’s and his son’s height in inches.
Another example of simple regression is from astronomy: the HertzsprungRussell diagram of the Star Cluster CYG OB1, which contains 47 stars in the direction of Cygnus. The first variable is the logarithm of the effective temperature at the surface of the star (X) and the second one is the logarithm of its light intensity (Y ). This dataset was analyzed by Rousseeuw and Leroy (1987), who compared the least squares method with the least median of squares estimator. The data contains the four giant stars — outliers, which are leverage points (but they are not errors). We again have the model Yi = β0 + β1 X1i + ei ,
i = 1, . . . , 47
where Y is the log light intesity and X1 is the log of the temperature. The results obtained by different methods are summarized in Table 4.2. We see that not only the least square regression but also M , trimmed least squares and least absolute deviation regression, are sensitive to a leverage point. Figure 4.5 displays the data and the least squares, least absolute deviation and least trimmed squares lines.
COMPUTATION AND SOFTWARE NOTES
115
Table 4.2 Results for data of the Hertzsprung-Russell diagram of the Star Cluster CYG OB1.
Methods used for estimation
βˆ1
6.7935 6.8658 -4.9702 8.1492 6.7935 6.7935 -12.76 -14.05
-0.4133 -0.4285 2.2533 -0.6931 -0.4133 -0.4133 4.00 4.32
5.5 5.0 4.0
4.5
Log light intensity
6.0
Least square M -estimation M M -estimation Least absolute deviation Trimmed least squares (α = 0.05) Trimmed least squares (α = 0.20) Least median of squares Least trimmed squares
βˆ0
3.6
3.8
4.0
4.2
4.4
4.6
Log temperature
Figure 4.5 Least squares (solid line), least absolute deviation (dashed line) and least trimmed squares (dotted line) regression for the dataset of the Hertzsprung-Russell diagram of the Star Cluster CYG OB1.
4.12 Computation and software notes The basic function of R for the linear regression (corresponding to least square method) is lm. It is used to fit linear models but it can be used to carry out regression (for example the analysis of covariance). The call is as follows:
116
ROBUST ESTIMATORS IN LINEAR MODEL > lm(formula, data)
where formula is a symbolic description of the model (the only required argument) and data is an optimal data frame. For example, if we consider the model Yi = β0 + β1 X1i + β2 X2i + ei , i = 1, . . . , n then > lm(y~x1+x2, data) The intercept term is implicitly present; its presence may be confirmed by giving a formula such as y ∼ 1+x1+x2. It may be omitted by giving a -1 term in formula, as in y∼x1+x2-1 or also y∼0+x1+x2. lm returns an object of class “lm.” Generic functions to perform further operations on this object include, among others: print for a simply display, summary for a conventional regression analysis output, coefficients for extracting the vector of regression coefficients, residuals for the residuals (response minus fitted values), fitted.values for fitted mean values, deviance for the residual sum of squares, plot for diagnostic plots, predict for prediction, including confidence and prediction intervals. Consider this example. Koenker and Bassett (1982) give the Engel food expenditure data. This is a regression dataset consisting of 235 observations on income (x) and expenditure (y) on food for Belgian working class households.
> data(engel) > lm(y~x,data=engel) Call: lm(formula = y ~ x, data = engel) Coefficients: (Intercept) 147.4754
x 0.4852
> engel.fit<-lm(y~x,data=engel) > # object of class lm > ###### print ############ > print(engel.fit) Call: lm(formula = y ~ x, data = engel)
COMPUTATION AND SOFTWARE NOTES Coefficients: (Intercept) 147.4754
117
x 0.4852
> ###### summary ######### > summary(engel.fit) Call: lm(formula = y ~ x, data = engel) Residuals: Min 1Q -725.699 -60.239
Median -4.317
3Q 53.411
Max 515.772
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 147.47539 15.95708 9.242 <2e-16 *** x 0.48518 0.01437 33.772 <2e-16 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 114.1 on 233 degrees of freedom Multiple R-Squared: 0.8304, Adjusted R-squared: 0.8296 F-statistic: 1141 on 1 and 233 DF, p-value: < 2.2e-16
> ###### coefficients ############ > coefficients(engel.fit) (Intercept) x 147.4753885 0.4851784 > ###### residuals ########### > residuals(engel.fit) 1 2 3 4 -95.4873907 -99.1979999 -99.0175287 -54.5459709 5 -16.2232564
6 27.4411921
7 80.8752198
8 77.8958642
....... 229 -30.6916798
230 -43.5993020
233
234
231 232 -54.6858594 -110.8549133 235
118
ROBUST ESTIMATORS IN LINEAR MODEL
38.4621331
14.6014720
89.6828553
> ######### residual sum of squares ############## > deviance(engel.fit) [1] 3033805 > ######## fitted values ############## > fitted(engel.fit) 1 351.3268
2 410.1567
3 584.6975
4 457.5433
5 511.7840
6 606.3566
7 549.8813
8 622.5450
9 783.0004
10 871.5551
......... 226 921.4131
227 524.2629
228 229 744.6929 1024.6547
230 349.0383
231 361.2049
232 410.0542
233 429.5387
235 660.6373
234 508.0004
> ####### diagnostic plots ###### > plot(engel.fit) Hit Hit Hit Hit
to to to to
see see see see
next next next next
plot: plot: plot: plot:
We obtain Figures 4.6 – 4.9. We see that the dataset contains outliers. From that reason we could look for an alternative more robust method. The package MASS implements the procedure rlm(), which correponds to either M -estimator or M M -estimator. The package MASS is recommended; it will be distributed with every binary distribution of R.
COMPUTATION AND SOFTWARE NOTES
119
600
Residuals vs Fitted
0
O O O O O O OO OO OO O O O OO OO O O OOO O O O O O O O O O O O OO OO O OO O O OO O O O OO O O O O O OOO O O O OOO O O OO O O OO OOO OO O O O O O O O OO O O O OO O O O OO OO O O O O O OOOO OO O O O O O OO O O O O O O O O O O O O O O O O OO OO O O O O O O OO O OOOO O O O O OO O O OO O O O OO O O O O OO O O O OO OO OO OOOOOO O OO OO O O O O OO O O O
−400
Residuals
200
400
59O O
105O
−800
138O
500
1000
1500
2000
2500
Fitted values lm(formula = y ~ x, data = engel)
Figure 4.6 Residuals versus the fitted values for Engel food expenditure data.
6
Normal Q−Q plot
4
59O
2 0 −2
O
−4
Standardized residuals
O O OOOOO OOO O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O OOOOOO O O OO
−6
105O
138O
−3
−2
−1
0
1
2
3
Theoretical Quantiles lm(formula = y ~ x, data = engel)
Figure 4.7 Normal Q-Q plot for Engel food expenditure data.
120
ROBUST ESTIMATORS IN LINEAR MODEL
Scale−Location plot 2.5
138O
2.0
O
1.0
1.5
O
0.0
0.5
Standardized residuals
105O 59O
O O
O O O O O O OO O O OOO O O OO O O OO O OO OO O O O O OOOO O O O O O O O O OO O O O OOOOO O O O O OO OO OO OO OO O O O O O OO O O O OO O OOOO O O OO O O O O O O O OO OO O O O O OO O O OO O OO OOO OO O O O O O OO OOO O OO OO O O O O OO O OOO O O OO O OO O O OO O O OO OO O O O OO O O OOO O O O O O O O O OOO O O OOO O OO OO O O O O O O O O O OO O OO O O OO O O O O OO
500
1000
1500
2000
2500
Fitted values lm(formula = y ~ x, data = engel)
Figure 4.8 Scale-location plot for Engel food expenditure data.
10
Cook’s distance plot
6 4 2
Cook’s distance
8
138
105
0
59
0
50
100
150
200
Obs. number lm(formula = y ~ x, data = engel)
Figure 4.9 Cook’s distance plot for Engel food expenditure data.
COMPUTATION AND SOFTWARE NOTES > ########## load package ####### > library(MASS) > ######### use engel data ##### > data(engel) > ######### M-estimator ####### > rlm(y~x,data=engel) Call: rlm(formula = y ~ x, data = engel) Converged in 10 iterations Coefficients: (Intercept) x 99.4319013 0.5368326 Degrees of freedom: 235 total; 233 residual Scale estimate: 81.4 > ### > summary(rlm(y~x,data=engel)) Call: rlm(formula = y ~ x, data = engel) Residuals: Min 1Q Median 3Q Max -933.748 -54.995 4.768 53.714 418.020 Coefficients: Value Std. Error t value (Intercept) 99.4319 12.1244 8.2010 x 0.5368 0.0109 49.1797 Residual standard error: 81.45 on 233 degrees of freedom Correlation of Coefficients: (Intercept) x -0.8845 > ### > deviance(rlm(y~x,data=engel)) [1] 3203849
121
122
ROBUST ESTIMATORS IN LINEAR MODEL
> ######## MM-estimator ###### > rlm(y~x,data=engel,method="MM") Call: rlm(formula = y ~ x, data = engel, method = "MM") Converged in 15 iterations Coefficients: (Intercept) x 85.5072163 0.5539063 Degrees of freedom: 235 total; 233 residual Scale estimate: 76.3 > ### > summary(rlm(y~x,data=engel,method="MM")) Call: rlm(formula = y ~ x, data = engel, method = "MM") Residuals: Min 1Q Median 3Q Max -1004.471 -55.202 1.173 51.141 383.753 Coefficients: Value Std. Error t value (Intercept) 85.5073 12.0796 7.0787 x 0.5539 0.0109 50.9320 Residual standard error: 76.31 on 233 degrees of freedom Correlation of Coefficients: (Intercept) x -0.8845 > ### > deviance(rlm(y~x,data=engel,method="MM")) [1] 3339047 > ######## Choice of psi function ###### > > > >
#### #### #### ####
Psi functions are supplied for the Huber, Hampel and Tukey bisquare proposals as psi.huber, psi.hampel and psi.bisquare. psi.huber is default
COMPUTATION AND SOFTWARE NOTES
123
> rlm(y~x,data=engel,psi=psi.huber) Call: rlm(formula = y ~ x, data = engel, psi = psi.huber) Converged in 10 iterations Coefficients: (Intercept) x 99.4319013 0.5368326 Degrees of freedom: 235 total; 233 residual Scale estimate: 81.4 > ### > rlm(y~x,data=engel,psi=psi.hampel) Call: rlm(formula = y ~ x, data = engel, psi = psi.hampel) Converged in 8 iterations Coefficients: (Intercept) x 80.2517611 0.5601229 Degrees of freedom: 235 total; 233 residual Scale estimate: 80.4 > ### > rlm(y~x,data=engel,psi=psi.bisquare) Call: rlm(formula = y ~ x, data = engel, psi = psi.bisquare) Converged in 11 iterations Coefficients: (Intercept) x 86.245026 0.552921 Degrees of freedom: 235 total; 233 residual Scale estimate: 81.3 > Package MASS also contains the procedure lqs for a regression estimator with a high breakdown point. We can select either the LMS estimator or the LTS estimator. > ################ LMS ################## > lqs(y~x,data=engel,method="lms")
124
ROBUST ESTIMATORS IN LINEAR MODEL
Call: lqs.formula(formula = y ~ x, data = engel, method = "lms") Coefficients: (Intercept) -3.1939
x 0.6903
Scale estimates 63.57 63.82 > ############### LTS ################# > lqs(y~x,data=engel,method="lts") Call: lqs.formula(formula = y ~ x, data = engel, method = "lts") Coefficients: (Intercept) -23.6590
x 0.7284
Scale estimates 70.89 69.86 We could find a function for LTS regression also in the package rrcov. This contributed package is available for download from CRAN (http://cran.r-project.org/) and its mirrors. > library(rrcov) Scalable Robust Estimators with High Breakdown Point > ltsReg(engel$x,engel$y) Coefficients: Intercept engel$x -5.2541 0.6824 Scale estimate 85.04 The least absolute deviation estimator is possible to find in the package quantreg. Function rq() enables computation of the regression quantiles and the function ranks() computes the regression rank scores. > ##### LAD estimator = regression median - tau=0.5 ###### > rq(y~x,data=engel,tau=0.5) Call: rq(formula = y ~ x, tau = 0.5, data = engel)
COMPUTATION AND SOFTWARE NOTES
125
Coefficients: (Intercept) x 81.4822474 0.5601806 Degrees of freedom: 235 total; 233 residual > ############ regression quantiles ################# > quant<-c(.05,.25,.75,.95) > > coefficients(rq(y~x,data=engel,tau=quant)) tau= 0.05 tau= 0.25 tau= 0.75 tau= 0.95 (Intercept) 124.8800408 95.4835396 62.3965855 64.1039632 x 0.3433611 0.4741032 0.6440141 0.7090685 > The trimmed least squares estimator we could implement with help rq() as follows: > "tls.KB"<-function(formula, data, alpha=0.05) + { + resid1 <-residuals(rq(formula,data,tau=alpha)) + resid2 <-residuals(rq(formula,data,tau=1-alpha)) + c1 <- c(resid1 >= 0) + c2 <- c(resid2 <= 0) + coefficients(lm(formula,data[c(c1 & c2),])) + } > ##### Using ###### > tls.KB(y~x,data=engel,alpha=0.2) (Intercept) x 95.5841657 0.5421883 > We plot the engel data and some fit lines this way: > + > > > > > > + >
plot(engel$x,engel$y,xlab="Household income", ylab="Food expenditure",type = "n", cex=.5) points(engel$x,engel$y,cex=.5,pch=20) abline(lm(y~x,data=engel),lty = 1) abline(rlm(y~x,data=engel),lty = 2) abline(rq(y~x,data=engel,tau=0.5),lty = 3) abline(lqs(y~x,data=engel),lty = 4) legend(3500,700,c("LS line", "M line","LAD line","LTS line" ), lty = c(1,2,3,4))
Then we obtain Figure 4.10.
ROBUST ESTIMATORS IN LINEAR MODEL
1500 1000
LS line M line LAD line LTS line
500
Food expenditure
2000
126
1000
2000
3000
4000
5000
Household income Figure 4.10 Fit lines for Engel food expenditure data.
4.13 Problems and complements 4.1 The number of distinct solutions of the linear program (4.46), as α runs from 0 to 1, is of order Op (n · ln n). In the location model, when X = 1n , the number of distinct values is exactly 1 (Portnoy (1991)). 4.2 The L1 -estimator and any other M -estimator of β in model (4.1) has the breakdown point n1 . 4.3 Another possible characteristic of performance of L-estimators in the linear model is the largest integer m such that, for any set M ⊂ N = {1, . . . , n} of size m, 1 i∈N −M |xi b| > inf b=1 2 i∈N |xi b|
PROBLEMS AND COMPLEMENTS
127
4.4 The restricted M -estimator is a solution either of minimization (4.25) or of minimization (4.28) under the linear constraint Aβ = c
(4.70)
where A is a q × p matrix of full rank and c ∈ R . The relation (4.70) can be our hypothesis and we are interested in the behavior of the estimator in the hypothetical situation. p
, i = 1, . . . , n be the residuals with respect to estimator 4.5 Let ri = Yi −xi β 0 β 0 of β, and let θn be the solution of the minimization n
ρα (ri − θn ) := min
i=1
where ρα is the criterion function of an α-regression quantile defined in (4.45), 0 < α < 0, then n− 2 1
n
ψα (ri − θn ) → 0 a.s. as n → ∞
i=1
where ψα (x) = α − I[x < 0], x ∈ R (Ruppert and Carroll (1980)). (α) be a solution of the minimization (4.44) (the regression α4.6 Let β n quantile, then n 1 (α) → 0 a.s. as n → ∞ xi ψα Yi − xi β n− 2 n i=1
where ψα is the function from problem 4.4 (Ruppert and Carroll (1980)). 4.7 The Wilcoxon-type test of hypothesis H : β (2) = 0 in the model Y = X(1) β (1) + X2 β(2) + e (see (4.59), X(1) is n × p and X(2) is n × q), based on the regression rank scores, is based on the following criterion. Assume that β (1) contains an intercept, i.e. the first column of X(1) is 1n . Calculate the regression rank scores (ˆ an1 (α), . . . , a ˆnn (α)), 0 ≤ α ≤ 1, for the hypothetical model Y = X(1) β(1) + e, the Wilcoxon scores 1 ˆbn = udˆ ani , i = 1, . . . , n 0
n (2) and linear rank statistics vector Sn = i=1 xi ˆbi . Then we reject H provided 1 −1 S Q Sn ≥ χ2q (.95) Tn = 12 n n where χ2q (.95) is the 95% quantile of the χ2q distribution (Gutenbrunner et al. (1993)).
128
ROBUST ESTIMATORS IN LINEAR MODEL
4.8 Apply the described methods on the “salinity” dataset consisting of measurements of water salinity (salt concentration, lagged salinity, trend) and river discharge taken in North Carolina. This dataset was listed by Ruppert and Carroll (1980) and the package rrcov contains it. 4.9 Software for robust statistics was developed by A. Marazzi (1992). On his web page http://www.iumsp.ch/Unites/us/Alfio/msp programmes.htm find the program library ROBETH, which has been interfaced to the statistical environments S-Plus and R.
CHAPTER 5
Multivariate location model
5.1 Introduction Estimating a location parameter of multivariate observations is a very common and important task in the practice. If we want to estimate the location robustly, we try to apply similar robustness criteria and measures of performance of estimators, such as in the univariate model. We characterize the estimators by the influence function, global and local sensitivities, breakdown point, the maxbias and a high asymptotic efficiency under the normal distribution. We also desire to have an estimator equivariant in a suitable sense. The main classes of robust estimators, M -, S,- L,- and R-estimators, were extended to the multivariate models. Among them, the latter two classes depend on the ordered data. Because the multivariate data are not naturally ordered, we should first define some ordering, and there are many possible ways to do it. Serfling (2004) gives an excellent review of multivariate descriptive measures based on sample quantiles, and collects citations to the main papers in the area. The multivariate sign and rank methods were systematically developed by Hettmansperger, Oja, Randles and their coworkers (see M¨ ott¨ onen and Oja (1995), Marden (1998), Oja and Randles (2004), among others. Similarly as in the univariate case, the generalized signs and ranks are mostly used for testing. The multivariate median is a special case, because it represents a center and is a special M -estimator in the univariate case. Here we can refer to medians of Brown (1983), Oja (1983), Small (1990), Chakraborty et al. (1988), Vardi and Zhang (2000), Hettmansperger and Randles (2002), Zuo (2003), Zuo (2004), among others. The literature on multivariate estimation is very rich, due to many various aspects and many possible extensions of univariate concepts. Thus we can touch on only the main ideas and main results of multivariate robustness. We shall illustrate these ideas mainly using M -estimators and S-estimators, with a brief description of various other constructions.
5.2 Multivariate M -estimators of location and scatter The location parameter of a multivariate distribution should be indentifiable and Fisher consistent. Whenever it is possible, we try to characterize the location parameter as a center of symmetry. However, while the symmetry and 129
130
MULTIVARIATE LOCATION MODEL
the center of symmetry are uniquely determined in the univariate model, their extension to the multivariate model is not straightforward and can be made in several possible ways. Then we will speak on the spherical, elliptical, central and angular symmetries of a multivariate distribution. We shall mainly concentrate on spherically and elliptically symmetric distributions, briefly describe some problems we encounter in the estimation of their parameters, and show some robust estimators of the location parameters. The distribution of random vector X is spherically symmetric about θ, if it is invariant with respect to a rotation, i.e., if X − θ has the same distribution as U(X − θ) for any orthogonal p × p matrix U. The distribution of X − θ then depends only on X − θ, the maximal invariant with respect to the rotations, and density of X, if it exists, has the form g(x − θ), x ∈ Rp with a nonnegative function g. The spherically symmetric distributions are used in physical models for a description of the move of fluids, in the models of gravitation and thermodynamic fields, among other applications. The class of spherically symmetric distributions covers the multivariate normal distribution with covariance matrix σ 2 Ip , but also the standard p-variate tdistribution with k degrees of freedom, what is the distribution of the random √ k vector S X where X is the standard p-variate normal, S is independent of X and has χ2 distribution with k degrees of freedom. A random vector X has a nonsingular elliptically symmetric distribution with parameters θ and V if X has the same distribution as the random vector A Y + θ where Y has a spherically symmetric distribution and A is a nonsingular p × p matrix such that A A = V. The density of X, if it exists, has the form 1 (5.1) |V|− 2 g (x − θ) V−1 (x − θ) = |A|−1 g0 A−1 (x − θ) with nonnegative functions g and g0 . If Y is p-variate normal Np (0, Ip ), then X is p-variate normal Np (θ, V) with V = A A. Elliptically symmetric distributions have various applications, that are described, e.g., in Devlin et al. (1976). Consider the problem of estimating parameters θ and V based on n independent observations, all distributed according to density (5.1), by estimators Tn and Vn , affine equivariant in the following sense Tn (AX1 + b1 . . . . , AXn + b) = ATn (X1 , . . . , Xn ) + b (5.2)
Vn (AX1 + b1 . . . . , AXn + b) = AVn (X1 , . . . , Xn )A
for all nonsingular p × p matrices A and b ∈ Rp . Even if we want to estimate only θ by an equivariant estimate and V is unknown, we cannot avoid a simultaneous estimation of V: anything like a studentization does not apply in the multivariate model. Maronna (1976) was the first to extend the class of M -estimators to the mul-
MULTIVARIATE M -ESTIMATORS OF LOCATION AND SCATTER 131 tivariate model and defined the M -estimators of (θ, V) in model (5.1), based on n independent observations X1 , . . . , Xn . Huber (1981) then studied these estimators in his book. For the sake of brevity, denote Zi = V− 2 (Xi − θ), i = 1, . . . , n; then the M -estimator of (θ, V) is a solution (Tn , Vn ) of the system of equations 1
1 w1 (Zi )Zi = 0 n i=1 n
(5.3)
1 w2 (Zi )Zi Zi = Ip n i=1 n
with suitable functions w1 , w2 . These estimators are affine equivariant ; the statistical functional, corresponding to (5.3), is defined implicitly as a solution of the system of equations EF {w1 (Z)Z} = 0
(5.4)
EF {w2 (Z)ZZ } = Ip Generally, the functions w1 (x), w2 (x) are assumed to be nonnegative, nonincreasing and continuous for x ≥ 0. Maronna proved the existence, uniqueness and consistency of the estimator (5.3), derived its influence function and its asymptotic normality, under some conditions on w1 , w2 and on the function g. Maronna and Huber showed that its breakdown point could not exceed the 1 . upper bound p+1 If the distribution of observations is spherically symmetric (i.e., V = Ip ) and θ = 0, the influence function of Tn is equal to (see Huber (1981), Section 8.7) IF (x; F, T) =
xw1 (x) EF w1 (Y) + p1 Yw1 (Y)
Notice that it is bounded if w1 (x) = ψHx(x) , where ψH (x) is the Huber function (3.16). Moreover, the components of Tn are asymptotically independent, and √ n(Tn − θ) has asymptotically the p-dimensional normal distribution with expectation 0 and with the covariance matrix σ 2 Ip , where −2 1 1 2 σ = EF (Xw1 (X) EF w1 (X) + Xw1 (X) p p 2
(5.5)
Example 5.1 (Huber Proposal 2.) The special model, considered by Huber (1964), corresponds to the choice wi (x) = ψix(x) , i = 1, 2, where ψ1 (x) = ψH (x) = ψH (x, k) is the Huber function (3.16) and ψ2 (x) = ψH (x, k 2 ).
132
MULTIVARIATE LOCATION MODEL
5.3 High breakdown estimators of multivariate location and scatter The M -estimator is computationally simple but its breakdown point cannot 1 exceed p+1 . Among estimators with a high breakdown point we can mention the multivariate S-estimators proposed by Davies (1987) and Lopuha¨ a (1989), the minimum volume ellipsoid estimator and minimum covariance determinant estimator proposed by Rousseeuw (1985) (see also Davies (1992)), and Stahel-Donoho estimator (Stahel (1981), Donoho (1982), Maronna et al. (1992), Maronna and Yohai (1995)). Another class of estimators is based on projections of multivariate estimators into a suitable univariate problem (the P -estimators, the projection estimators). Here belong the Stahel-Donoho estimator, further estimates based on Tukey’s concept of depth (Donoho and Gasko (1992)), the P -estimator of covariance matrices (Maronna, Stahel and Yohai (1992)), and its extension to multivariate location model (Tyler (1994), Adrover and Yohai (2002)). These estimators have good maxbias and breakdown properties, but are computationally intensive. The one-step versions of these estimators, similarly as in the linear regression model, can be computed more easily, but they can lose some other advantages. Let us return to the S-estimators, which were introduced by Rousseeuw and Yohai (1984) in the linear regression model. Being adjusted to the multivariate location model, the S-estimator of (θ, V) in model (5.1) is defined as (Tn , Vn ) minimizing the determinant det(V) with respect to (t, V) subject to the constraint
n 12 1 ρ (Xi − t) V−1 (Xi − t) (5.6) = b0 n i=1 where ρ is a symmetric function with a continuous derivative ψ and ρ(0) = 0. There exists a constant k > 0 such that the function ρ is strictly increasing on [0, k] and constant on [k, ∞]. The constant 0 < b0 < sup ρ(x) is chosen to achieve a high efficiency under some specific distribution, usually normal. A slightly modified S-estimator was proposed by Davies (1987). Lopuha¨ a and Rousseeuw (1991) showed that under a suitable choice of b0 , the S-estimator can attain the breakdown point n1 [n−p+2] → 12 ; but because the high break2 down point and high efficiency are in some contradiction, the value b0 leading to a high breakdown point differs from that leading to a high efficiency. In the standard case that θ = 0 and V = Ip in (5.1), the influence function of the component Tn of the S-estimator is IF (x; T, F ) = where β = EF
x 1 ψ(x) β x
1 1 1− ψ(X) + ψ (X) p p
(5.7)
ADMISSIBILITY AND SHRINKAGE The influence function of the component Vn satisfies 1 1 IF (x; V, F ) − trace[IF (x; V, F )]Ip = pψ(x)x p γ1 and
133
xx 1 − Ip 2 x p
1 2 trace[IF (x; V, F )] = (ρ(x) − b0 p γ2
where
1 EF ψ (X)X2 + (p + 1)ψ(X)X p+2 γ2 = EF (ψ(X)X)
γ1 =
5.4 Admissibility and shrinkage When we classify estimators from the decision theoretical viewpoint, then we see that many robust estimators are inadmissible with respect to standard loss functions already in the univariate models. For example, Jureˇckov´ a and Klebanov (1997, 1998) showed that the M -estimators of univariate location parameter, whose influence functions are constant on both tails, are inadmissible for any distribution extended on the whole line R. Such estimators cannot be even Bayesian in the finite sample case. The inadmissibility pertains also to multivariate models, where not only the robust estimators but even the maximum likelihood estimators (MLE) are generally inadmissible. Stein (1956) showed that if X has the normal Np (θ, Ip ) distribution and p > 2, then
E ((1 − ∆)X − θ) ((1 − ∆)X − θ) < E {(X − θ) (X − θ)} p−2 with ∆ = X X , hence X (1 − ∆) dominates X. Even under a nonnormal distribution both MLE and robust estimators often √ admit shrinkage versions, dominating them asymptotically at least in a n-neighborhood of the true parameter value. It indicates that it is doubtful to emphasize the asymptotic efficiency of an estimator under the normality, because the concept of the asymptotic relative (or risk) efficiency is imprecise in multivariate cases. We guess that this phenomenon deserves some illustration.
Let X = (X1 , . . . , Xp ) be a random vector with distribution function F (x; θ) = F (x1 − θ1 , . . . , xp − θp ), absolutely continuous. Using n independent observations X1 , . . . , Xn , we want to estimate the unknown location parameter θ = (θ1 , . . . , θp ) . The loss incurred in using estimator Tn can have various forms. First, the general quadratic loss Tn − θ2Q = (Tn − θ) Q−1 (Tn − θ)}
(5.8)
with a positively definite matrix Q; the corresponding risk is defined as RQ (Tn , θ) = Trace(Q−1 VTn )
(5.9)
134
MULTIVARIATE LOCATION MODEL
where
VTn = E(Tn − θ)(Tn − θ) (5.10) is the dispersion matrix of Tn . An important special case of the quadratic loss is the squared error loss corresponding to Q ≡ Ip , Tn − θ2 =
n
(Tni − θi )2
(5.11)
i=1
If Q = diag(q11 , . . . , qpp ) is a diagonal matrix, then (5.8) reduces to p (Tni − θi )2 i=1
qii
In practice, the matrix Q is given, while VTn depends on n and on the un√ known F. For asymptotically normal Tn , such that n(Tn −θ) has asymptotic distribution Np (0, V), we can construct its shrinkage version, asymptotically dominating Tn with respect to the quadratic risk (5.8) and (5.9) with positively definite matrix Q. Assume that the matrix V can be estimated from the data, i.e., that there p n } such that V n −→ exists a sequence {V V as n → ∞. Choose a pivot value θ0 of θ, e.g., our hypothetical value of θ. Then the observed distance of Tn from θ0 can be characterized in the following way: −1 (Tn − θ0 ) Ln = n(Tn − θ 0 ) V n
(5.12)
A possible shrinkage version of Tn we consider in the form −1 −1 Vn )(Tn − θ0 ) TSn = θ0 + (Ip − kdn L−1 n Q
(5.13)
where k is a fixed positive constant, 0 ≤ k ≤ 2(p − 2), and dn is the smallest n . If the pivot θ0 = θ, and θ is sufficiently far from characteristic root of QV θ0 so that θ − θ0 > ε > 0, then the shrinkage version TSn has the same asymptotic risk as Tn . However, the risk of TSn is smaller than that of Tn provided θ lies in a local neighborhood of the pivot, i.e., θ = θ(n) = θ 0 + n−1/2 ξ,
ξ ∈ Rp
(5.14)
For such local neighborhood of the chosen pivot, TSn asymptotically dominates −1 −1 Tn . If V = Q , hence trace Q V−1 = p and δ = 1, the reduction is maximal for the choice k = p − 2; otherwise it depends on the factors of the model. More details can be found in the book by Jureˇckov´ a and Sen (1996).
5.5 Numerical illustrations and software notes Consider the Stack Loss Plant Data (stackloss) used by Brownlee (1960) and studied by Dodge (1996). Data were obtained from 21 days of operation of a plant for the oxidation of ammonia (NH3) to nitric acid (HNO3). The nitric oxides produced are absorbed in a countercurrent absorption tower.
NUMERICAL ILLUSTRATIONS AND SOFTWARE NOTES
135
stackloss is a data frame with 21 observations on 4 variables: (i) Air. Flow — the flow of cooling air, it represents the rate of operation of the plant. (ii) Water. Temp — the temperature of cooling water circulated through coils in the absorption tower. (iii) Acid. Conc. — the concentration of the acid circulating, minus 50, times 10: that is, 89 corresponds to 58.9 percent acid. (iv) stack.loss — (the dependent variable) is 10 times the percentage of the ingoing ammonia to the plant that escapes from the absorption column unabsorbed; that is, an (inverse) measure of the overall efficiency of the plant. Two methods for robust covariance estimation are offered via the function cov.rob in packages MASS (Rousseeuw and Leroy (1987), Rousseeuw and van Zomeren (1990), Rousseeuw and van Driessen (1999) and Venables and Ripley (2002)). The usage of function is as follows: cov.rob(data, quantile.used, method) where data is a matrix or data frame, quantile.used is the minimum number of the data points regarded as good points, method — minimum volume ellipsoid, minimum covariance determinant or classical. First method mve — the minimum volume ellipsoid seeks an ellipsoid containing k=quantile.used points which is of minimum volume. Method mcd — minimum covariance determinant seeks k points whose volume of the Gaussian confidence ellipsoid is minimal. > ###### MVE ##### > cov.rob(stackloss, method="mve") $center Air.Flow Water.Temp Acid.Conc. stack.loss 56.3750 20.0000 85.4375 13.0625 $cov Air.Flow Water.Temp Acid.Conc. stack.loss Air.Flow 23.050000 6.666667 16.625000 19.308333 Water.Temp 6.666667 5.733333 5.333333 7.733333 Acid.Conc. 16.625000 5.333333 34.395833 13.837500 stack.loss 19.308333 7.733333 13.837500 18.462500 $msg [1] "32 singular samples of size 5 out of 2500" $crit [1] 20.03295
136
$best [1] 5
MULTIVARIATE LOCATION MODEL
6
7
8
9 11 12 13 14 16 17 18 19
$n.obs [1] 21 > ######## MCD ########### > cov.rob(stackloss, method="mcd") $center Air.Flow Water.Temp Acid.Conc. stack.loss 56.26667 20.13333 85.66667 13.20000 $cov Air.Flow Water.Temp Acid.Conc. stack.loss Air.Flow 24.495238 7.390476 18.238095 20.942857 Water.Temp 7.390476 5.838095 5.190476 7.971429 Acid.Conc. 18.238095 5.190476 35.952381 14.285714 stack.loss 20.942857 7.971429 14.285714 19.457143 $msg [1] "20 singular samples of size 5 out of 2500" $crit [1] 6.397633 $best [1] 5
6
7
8
9 10 11 12 15 16 17 18 19
$n.obs [1] 21
########### classical ############# > cov.rob(stackloss, method="classical") $center Air.Flow Water.Temp Acid.Conc. stack.loss 60.42857 21.09524 86.28571 17.52381 $cov Air.Flow Water.Temp Acid.Conc. stack.loss Air.Flow 84.05714 22.657143 24.571429 85.76429 Water.Temp 22.65714 9.990476 6.621429 28.14762
NUMERICAL ILLUSTRATIONS AND SOFTWARE NOTES Acid.Conc. 24.57143 stack.loss 85.76429
6.621429 28.147619
28.714286 21.792857
137
21.79286 103.46190
$n.obs [1] 21 An alternative approach assumes that the data came from a multivariate t distribution: this provides some degree of robustness to an outlier without giving a high breakdown point (Kent, Tyler and Vardi (1994), Venables and Ripley (2002)). The corresponding function is cov.trob in package MASS. > cov.trob(stackloss) $cov Air.Flow Water.Temp Acid.Conc. stack.loss Air.Flow 60.47035 17.027203 18.554452 62.28032 Water.Temp 17.02720 8.085857 5.604132 20.50469 Acid.Conc. 18.55445 5.604132 24.404633 16.91085 stack.loss 62.28032 20.504687 16.910855 72.80743 $center Air.Flow Water.Temp Acid.Conc. stack.loss 58.96905 20.79263 86.05588 16.09028 $n.obs [1] 21 $call cov.trob(x = stackloss) $iter [1] 5 Function covMcd in the package rrcov computes a multivariate location and scale estimate with a high breakdown point using the Fast MCD (Minimum Covariance Determinant) estimator. The estimator implemented in covMcd() is similar to the function cov.rob(data, method="mcd"). The implementation in rrcov uses the Fast MCD algorithm of Rousseeuw and van Driessen (1999) to approximate the minimum covariance determinant estimator.
> covMcd(stackloss) Call: covMcd(x = stackloss) Log(det):
6.398
138
MULTIVARIATE LOCATION MODEL
Center: Air.Flow 56.27
Water.Temp 20.13
Acid.Conc. 85.67
stack.loss 13.20
Covariance Matrix:
Air.Flow Water.Temp
Air.Flow 55.13 16.63
Water.Temp 16.63 13.14
Acid.Conc. 41.05 11.68
stack.loss 47.13 17.94
Acid.Conc. stack.loss
41.05 47.13
11.68 17.94
80.91 32.15
32.15 43.79
> The function tolellipse plots the 0.975 tolerance ellipse of the bivariate dataset. The ellipse is defined by those data points whose distance is equal to the square root of the 0.975 χ2 quantile with 2 degrees of freedom. We obtain Figure 5.1 if we apply this function on the stackloss dataset. > tolellipse(stackloss[,1:2],classic=TRUE)
35
CLASSICAL TOLERANCE ELLIPSE (97.5%)
30
30
35
ROBUST TOLERANCE ELLIPSE (97.5%)
2
2
O O OO O
20
20
O OO O
O O O O O O O
O
10
10
15
O
15
O O O O O O O
O
25
25
O O
40
60
80
40
60
80
Figure 5.1 The 0.975 tolerance ellipse of the stackloss dataset.
PROBLEMS AND COMPLEMENTS
139
5.6 Problems and complements 5.1 X is spherically symmetric about θ if and only if X − θ and
X−θ X−θ
are
independent (Dempster (1969)). 5.2 A possible measure of performance of estimators is the Pitman closeness risk. For a suitable loss function L(·, ·) and for two competing estimators T1 and T2 of θ, denote P(T1 , T2 ; θ) = Pθ {L(T1 , θ) < L(T2 , θ)} + 12 Pθ {L(T1 , θ) = L(T2 , θ)} Then T1 is said to be Pitman-closer to θ than T2 if P(T1 , T2 ; θ) ≥ 12 , with a strict inequality for at least some θ, and T1 , called the Pitman-closest estimator of θ in the family C, if P(T1 , T2 ; θ) ≥ 12 ∀θ and all T ∈ C. Sen (1994) showed that the Pitman-closeness risk is asymptotically isomorphic to the quadratic risk. 5.3 Stein (1981) found a whole class of estimates dominating X with respect to the quadratic risk. A particularly important class of such estimators is James-Stein estimators a X 1− XX U In case of unknown scale, they can be modified to 1 − a U X X X, if there is an appropriate residual vector U available for estimating the scale. As was showed by Cellier, Fourdrinier and Robert (1989) and by Cellier and Fourdrinier (1995), the latter estimates dominate the naive estimate X not only for the normal distribution, but with a proper a even simultaneously for all spherically symmetric distributions with the density exp{−g(x−θ)} with a suitable function g. Because of this property, such estimates got the name “robust James-Stein estimators.” They were later studied in more detail by Cohen, Brandwein and Strawderman (1991), by Fourdrinier and Strawderman (1996), and by Fourdrinier, Marchand and Strawderman (2004), among others. 5.4 Let x1 , . . . , xn be distinct points in Rp and consider the minimization of C(y) =
n
ηi y − xi
(5.15)
i=1
with respect to y ∈ Rp , where η1 , . . . , ηn are chosen positive weights. The solution of (5.15) is called the L1 -multivariate median. If C(y) is strictly convex in Rp and x1 , . . . , xn are not collinear, then the minimum is uniquely determined. If x1 , . . . , xn lie in a straight line, the minimum is achieved at any one-dimensional median and may not be unique (Vardi and Zhang (2000)).
140
MULTIVARIATE LOCATION MODEL
5.5 Let x1 , . . . , xn and η1 , . . . , ηn be the same as in Problem 5.4 and ξi = nηi , i = 1, . . . , n. Denote j=1 ηj 1 − ¯ e(y) if y differs from {x1 , . . . , xn } D(y) = 1 − (¯ e(y) − ξk ) if y = xk , for some k y−xi ¯(y) = where ei (y) = y−x , i = 1, . . . , n and e {i:xi =y} ξi ei (y). The i ¯(y) is the spatial rank function considered by M¨ott¨ function e onen and Oja (1995) and Marden (1998). Chaudhuri (1996) considered the corresponding multivariate quantiles. Verify that ¯ e(y) ≤ {k:xk =y ξk ≤ 1, hence 0 ≤ D(y) ≤ 1 and limy→∞ D(y) = 0. Moreover, D(xk ) = 1 if ξk ≥ 12 , i.e., the point xk is the L1 -multivariate median if it possesses half of the total weight (Vardi and Zhang (2000)).
CHAPTER 6
Some large sample properties of robust procedures
6.1 Introduction Robust estimators are non-linear functions of observations, often implicitly defined. It is very difficult to derive their distribution functions under a finite number of observations. If we are not able to derive the finite-sample distribution in a compact form, we should take recourse to limit forms of distribution functions of estimators under an infinitely increasing number of observations n. The limit distribution is often normal, and the variance of the asymptotically normal distribution is an important characteristic of the estimator. The asymptotic distribution of robust estimators cannot be derived by a straightforward application of the central limit theorem, because they are not linear combinations of independent random variables. Thus, our first step is √ to approximate the sequence n(Tn − T (F )) by a linear combination of independent summands. In the literature, these approximations are usually called asymptotic representations of Tn . If the functional T (P ) is Fr´echet differentiable, then the asymptotic representation follows from the expansion (1.17) and can be written in the form n √ 1 n(Tn − T (F )) = IF (Xi , T, F ) + Rn (6.1) n i=1 where IF (x, T, P ) is the influence function of T (F ) and the remainder term is asymptotically negligible as n → ∞, i.e., Rn = op (1). A similar asymptotic representation of estimator Tn can be derived even if Tn is not Fr´echet differentiable, using other methods and under various conditions on the smoothness of the distribution function F and of the score functions of the estimators (ψ, J, ϕ), respectively. The asymptotic theory of robust and nonparametric statistical procedures provides many interesting and challenging mathematical problems, and a host of excellent papers and books were devoted to their solutions. However, we should keep in mind that the asymptotic solution in fact primarily helps in situations when we do not have an exact solution under a finite number of observations. 141
142
SOME LARGE SAMPLE PROPERTIES
The asymptotic properties of robust estimators and their asymptotic relations were studied in detail in several excellent monographs; we refer to Bickel et al. (1993), Dodge and Jureˇckov´ a (2000), Field and Ronchetti (1990), Hampel et al. (1986), Huber (1981), Jureˇckov´ a and Sen (1996), Koul (2002), Lehmann (1999), Rieder (1994), Serfling (1980), Shorack and Wellner (1986), Shorack (2000), Sen (1981), van der Vaart and Wellner (1996), Witting and M¨ ullerFunk (1995), among others. In this chapter we shall give only a brief outline of basic asymptotic results for M -, L- and R-estimators of the location parameter. The asymptotic relations and equivalence of various estimators can be of independent interest. These results can be extended in a more or less straightforward way to the regression model; we refer to the literature cited above. If an estimator Tn admits a representation (6.1), then it is asymptotically normally distributed as n → ∞ in the sense that √ n(Tn − T (F )) → N (0, σF2 ) (6.2) L where σF2 = EF (IF (X, T, F ))2 . Let us consider the asymptotic representations and distributions of M -, L- and R-estimators, whose influence functions we have earlier derived. The results are stated without proofs; we refer to Jureˇckov´ a and Sen (1996) for details.
6.2 M -estimators 6.2.1 M -estimator of general scalar parameter Let {Xi , i = 1, 2, . . . , } be a sequence of independent observations with a joint distribution function F (x, θ), θ ∈ Θ, where Θ is an open interval of R1 . The M -estimator of θ is a solution of the minimization n ρ(Xi , θ) = min, θ ∈ Θ i=1
Assume that ρ(x, θ) is absolutely continuous in θ with derivative ψ(x, θ) = ∂ ∂θ ρ(x, θ). If ψ(x, θ) is continuous in θ, then we look for the M -estimator Tn among the roots of the equation n
ψ(Xi , θ) = 0
(6.3)
i=1
If the function Eθ ρ(X, t) has a unique minimum at t = θ (hence, the functional is Fisher consistent) and some other conditions on either the smoothness of ψ(x, θ) or the smoothness of F (x, θ) are satisfied, then there exists a sequence {Tn } of roots of the equation (6.3) such that, as n → ∞, √ n(Tn − θ) = Op (1)
M -ESTIMATORS
143
√ 1 n(Tn − θ) = √ nγ(θ)
n
ψ(Xi , θ) + Op (n−1/2 )
(6.4)
i=1
˙ ˙ θ) = ∂ ψ(x, θ) where γ(θ) = Eθ ψ(X, θ), ψ(x, ∂θ √ This further implies that n(Tn − θ) has asymptotically normal distribution N (0, σ 2 (ψ, F )) where Eθ (ψ 2 (X, θ) σ 2 (ψ, F )) = (6.5) γ 2 (θ)
6.2.2 M -estimators of location parameter Let X1 , X2 , . . . be independent observations with joint distribution function F (x − θ). The M -estimator of θ is a solution of the minimization n
ρ(Xi − θ) = min, θ ∈ R1
i=1
Assume that ρ(x) is absolutely continuous with derivative ψ(x) and that the function h(t) = R ρ(x − t)dF (x) has a unique minimum at t = 0. If ψ is absolutely continuous with derivative ψ and γ = ψ (x)dF (x) > 0, then n there exists a sequence {Tn } of roots of the equation i=1 ψ(Xi − t) = 0 such that √ n(Tn − θ) = Op (1) n √ 1 n(Tn − θ) = √ ψ(Xi − θ) + Op (n−1/2 ) (6.6) nγ i=1 √ x Pθ n(Tn − θ) ≤ x → Φ as n → ∞ σ(ψ, F ) where σ 2 (ψ, F ) = γ −2 R ψ 2 (x)dF (x) and Φ is the standard normal distribution function N (0, 1). If F has an absolutely continuous density f with derivative f and finite Fisher information I(F ) = [f (x)/f (x)]2 dF (x) > 0, then, under a special choice ρ(x) = − log f (x), the M -estimator coincides with the maximal likelihood estimator of θ, and its asymptotic variance attains the Rao-Cram´er lower bound 1/I(F ). If ψ(x) has points of discontinuity, then we should assume that the distribution function F has two derivatives f, f in their neighborhood. The M -estimator is uniquely determined by relations n(3.12), provided ψ is nondecreasing. The solution Tn of the minimization i=1 ρ(Xi − θ) := min is not necessarily a n root of the equation i=1 ψ(Xi − θ) = 0, but it always satisfies n−1/2
n i=1
ψ(Xi − Tn ) = Op (n−1/2 ) as n → ∞
(6.7)
144
SOME LARGE SAMPLE PROPERTIES
and n √ 1 n(Tn − θ) = √ ψ(Xi − θ) + Op (n−1/4 ) nγ1 i=1
γ1 = Pθ
f (x)dψ(x), R
√ n(Tn − θ) ≤ x → Φ
(6.8)
x σ(ψ, F )
where σ 2 (ψ, F ) = (γ ∗ )−2 R ψ 2 (x)dF (x) and Φ is a distribution function of the normal distribution N (0, 1). Let now Tn∗ be an M -estimator of θ, studentized by the scale statistic Sn such that Sn → S(F ) = S as n → ∞. Then, if both ψ and F are absolutely continuous and smooth, Tn∗ admits the asymptotic representation n Xi − θ 1 γ2 Sn − 1 + Op (n−1 ) Tn∗ − θ = ψ − nγ1 i=1 S γ1 S where γ1 = S1 ψ Sx dF (x) and γ2 = S1 xψ Sx dF (x). Unless we impose additional conditions on ψ, F and on Sn , we cannot separate Tn∗ from Sn , so √ ∗ n(Tn − θ) is not asymptotically normally distributed, but
√ Sn ∗ L n γ1 (Tn − θ) + γ2 → N (0, σ 2 ) S where σ 2 = ψ 2 Sx dF (x). These results simplify in the symmetric case, when ψ(−x) = ψ(x) and F (x) + F (−x) = 1, x ∈ R, because then γ2 = 0. 6.3 L-estimators Let X1 , X2 , . . . , be independent observations with a joint distribution function F.The L-estimator of type I is a linear combination of order statistics n Tn = i=1 cni Xn:i with the coefficients generated by the weight function J according to (3.58) or (3.59). Among them, we shall concentrate on robust members, i.e., on the trimmed L-estimators with J(u) = 0 for 0 ≤ u < α and 1 − α < u ≤ 1, 0 < α < 12 . To derive the asymptotic representation of such L-estimators, we should assume that the distribution function F is continuous almost everywhere and that F and J have no joint point of discontinuity. More precisely, if J is discontinuous at u, then F −1 (u) should be Lipschitz in a neighborhood of u. Under these conditions, as n → ∞, n √ n(Tn − T (F )) = n−1/2 ψ1 (Xi ) + Op (n−1/2 ) i=1
where
1
T (F ) = 0
J(u)F −1 (u)du
(6.9)
L-ESTIMATORS and
145
ψ1 (x) = −
R
{I[y ≥ x] − F (y)}J(F (y))dy, x ∈ R
√ n(T Moreover, n − T (F )) has asymptotically normal distribution N 0, σ 2 (J, F ) , where
σ 2 (J, F ) = ψ12 (x)dF (x)
∞ ∞ R = J(F (x))J(F (y))[F (x ∧ y) − F (x)F (y)]dxdy −∞
−∞
With an appropriate choice of the weight function we obtain even an asymptotically efficient L-estimator; actually, the efficient weight function has the form J(u) = JF (u) = where
ψ(x) = −
ψ (F −1 (u)) , 0
f (x) , x∈R f (x)
(6.10)
and I(F ) = [f (x)/f (x)]2 dF (x) is the Fisher information of F ; this is of course possible only if 0 < I(F ) < ∞. The asymptotic variance of the resulting efficient L-estimator is 1 σ 2 (J, F ) = I(F ) that is the Rao-Cram´er lower bound. Notice that, in the case of JF (u) = 0 (x) = d logdxf (x) = const for for 0 < u < α and for 1 − α < u < 1, then ff (x) x < F −1 (α) and x > F −1 (1 − α), and hence the tails of the density f (x) decrease to 0 as fast as e−x or ex as x → ±∞. Let us now turn to the L-estimators of type II, that are linear combinations k of a finite number of quantiles, Tn = i=1 aj Xn:[npj ]+1 , 0 < p1 < . . . < pk < 1. For the asymptotics of such estimators, it is sufficient that the quantile it is function F −1 is smooth in neighborhoods of p1 , . . . , pk. For instance, sufficient if F is twice differentiable at F −1 (pj ) and F F −1 (pj ) > 0, j = 1, . . . , k. Then, as n → ∞, k n √ n Tn − aj F −1 (pj ) = n−1/2 ψ2 (Xi ) + Rn
where and
j=1
Rn = O n ψ2 (x) =
−1/4
k j=1
i=1 1/2
(log n)
(log log n)1/4
a.s.
(6.11)
aj pj − I x ≤ F −1 (pj ) , x ∈ R F (F −1 (pj ))
k √ Then n Tn − j=1 aj F −1 (pj ) is asymptotically normally distributed as N 0, R ψ22 (x)dF (x) .
146
SOME LARGE SAMPLE PROPERTIES
6.4 R-estimators We want to estimate the center of symmetry θ of the distribution function F (x−θ) with the aid of an R-estimator Tn . It is generated by the rank statistic Sn (t) (3.84) by means of relations (3.85). Assume that the score function ϕ(u) in (3.84) is nondecreasing and square-integrable, 0 < u < 1 and that F has positive and finite Fisher information I(F ). Then we obtain the asymptotic representation of Tn in the form n √ 1 n(Tn − θ) = √ ϕ(F (Xi − θ)) + op (1) as n → ∞ (6.12) nγ i=1 √ where γ = R ϕ(F (x))(−f (x))dx. Hence, n(Tn − θ) has an asymptoti 1 cally normal distribution N 0, γ −2 0 ϕ2 (u)du . Especially, we we obtain an asymptotically efficient R-estimator with the asymptotic variance 1/I(F ) if we take f (F −1 (u)) , 0
6.5 Interrelationships of M -, L- and R-estimators Let {T1n } and {T2n } be two sequences of estimators of θ, the center of symmetry of distribution F (x−θ). Assume that both √ {T1n } and {T2n } are asymptotically normally distributed, namely that n(Tjn − θ) has asymptotic normal distribution N (0, σj2 ), j = 1, 2, as n → ∞. Then the ratio of variances e1,2 = σ12 /σ22 is called the asymptotic relative efficiency of {T2n } with reif {T2n } is based on n observations, then spect √ to {T1n }. Alternatively, √ both n(T2n − θ) and n(T1n − θ) have asymptotically normal distribution N (0, σ12 ), provided the sequence n = n (n) is such that there exists the limit σ2 n = 12 = e1,2 lim n→∞ n (n) σ2 The fact that e1,2 = 1 means that {T1n } and {T2n } are equally asymptotically efficient. If this is the case, we can consider a finer comparison of {T1n } and {T2n } by means of the so-called deficiency of {T2n } with respect to {T1n }. If aj + o(n−1 ), j = 1, 2 Eθ n(Tnj − θ)2 = τ 2 + n then the ratio a2 − a1 d1,2 = τ2 is called the deficiency of {T2n } with respect to {T1n }. If n (n) is chosen so that Eθ [n(T2n − θ)2 ] = Eθ [n(T1n − θ)2 ] + O(n−1 )
INTERRELATIONSHIPS OF M -, L- AND R-ESTIMATORS then
147
d1,2 = lim [n (n) − n] n→∞
We have seen in Chapter 3 that the influence functions of M -estimators and Lestimators, based on observations with distribution function F (x−θ) coincide, i.e., IF (x, T1 , F ) ≡ IF (x, T2 , F ), provided J(u) = ψ (F −1 (u)), 0 < u < 1. Similar relations hold also between M -estimators and R-estimators, and between L-estimators and R-estimators. Now, using the asymptotic representations (6.4), (6.9), (6.11), (6.12), we can claim much stronger statements. Comparing these representations, we conclude that these estimators not only have the same influence functions, but that the sequences {Tn1 } and {Tn2 } are asymptotically close to each other in the sense that √ n(T2n − T1n ) = Rn = op (1) as n → ∞ (6.13) if their asymptotic representations coincide (up to the remainder terms). Such sequences {Tn1 } and {Tn2 } of estimators are called asymptotically equivalent. If we are able to derive the exact order of the remainder term Rn in (6.13), we can get more information on the relation of {Tn1 } and {Tn2 }. In some cases we are able to find the asymptotic distribution of Rn , standardized by 1 1 an appropriate power of n (most often n 2 or n 4 ). This is an asymptotic distribution of the second order and it is not normal. Unfortunately, as we shall see in Sections 6.2–6.4, the conditions for (6.13) depend on the unknown distribution function F, thus we cannot calculate the value of the R-estimator, once we have calculated the M -estimator, etc. Rather these relationships have a theoretical character and enable discovery of some property of the estimator of one class, once we have proved it for another class. Let us summarize the most interesting asymptotic relations among estimators.
6.5.1 M - and L-estimators Let X1 , X2 , . . . be independent random variables with a joint distribution function F (x − θ) such that F (x) + F (−x) = 1, x ∈ R; let Xn:1 ≤ Xn:2 ≤ . . . ≤ Xn:n be the order statistics corresponding to X1 , . . . , Xn . I. Let Mn be the M -estimator of θ generated by a nondecreasing step function ψ, (6.14) ψ(x) = αj . . . sj < x < sj+1 , j = 1, . . . , k where −∞ = s0 < s1 < . . . < sk < sk+1 = ∞ −∞ < α0 ≤ α1 ≤ . . . ≤ αk < ∞ αj = −αk−j+1 , sj = −sk−j+1 , j = 1, . . . , k
148
SOME LARGE SAMPLE PROPERTIES
and at least two among α1 , . . . , αk are different. It means that Mn is a son lution of the minimization i=1 ρ(Xi − t) = min, where ρ is a continuous, convex, symmetric and piecewise linear function with derivative ρ = ψ a.e. Assume that F has two bounded derivatives f, f and that f is positive in neighborhoods of s1 , . . . , sk . Then the L-estimator Ln , asymptotically equivalent to Mn , is the linear combination of a finite number of quantiles, Ln = kj=1 aj Xn:[npj ] , where pj = F (sj ),
γ=
aj =
1 (αj − αj−1 )f (sj ) γ
k (αj − αj−1 )f (sj ) (> 0)
(6.15)
j=1
3 and Mn − Ln = Op n− 4 as n → ∞ II. Assume that F has an absolutely continuous symmetric density f and finite Fisher information I(F ) > 0. Let Mn be the Huber M -estimator of θ, generated by the function ψ x . . . |x| ≤ c ψ(x) = c · sign x . . . |x| > c where c > 0. Let Ln be the α-trimmed mean, Ln =
1 n − 2[nα]
n−[nα]
Xn:i
i=[nα]+1
where α = 1 − F (c). If F further satisfies f (x) > a > 0 and f (x) exists for x ∈ F −1 (α − ε), F −1 (1 − α + ε) ε > 0, then
Mn − Ln = Op n−1
(6.16)
as n → ∞. III. Let Ln be the α-Winsorized mean ⎫ ⎧ n−[nα] ⎬ 1⎨ Ln = Xn:i + [nα]Xn:n−[nα] [nα]Xn:[nα]+1 + ⎭ n⎩ i=[nα]+1
Then, under the same conditions as in II, 3 Mn − Ln = Op n− 4 , n → ∞
(6.17)
INTERRELATIONSHIPS OF M -, L- AND R-ESTIMATORS where Mn is the ⎧ ⎪ ⎪ ⎨ ψ(x) = ⎪ ⎪ ⎩
149
M -estimator generated by the function F −1 (α) −
α f (F −1 (α))
x < F −1 ((α) F −1 (α) ≤ x ≤ F −1 (1 − α)
x
α F −1 (1 − α) + f (F −1 x > F −1 (1 − α) (α)) n IV. Let Ln = i=1 cni Xn:i , where the coefficients cni are generated by the function J : (0, 1) → R such that
1 J(u)du = 1 J(1 − u) = J(u), 0 < u < 1, 0
1 2 and such that J is continuous in (0, 1) up to a finite number of points s1 , . . . , sm , where α < s1 . . . < sm < 1 − α, and J is Lipschitz continuous in intervals (α, s1 ), (s1 , s2 ), . . . , (sm , 1 − α). The distribution function F is assumed to have a symmetric density, and F −1 (u) = inf{x : F (x) ≥ u} is supposed to be Lipschitz continuous in the neighborhood of s1 , . . . , sm , and
A f 2 (x)dx < ∞, where A = F −1 (1 − α + ε), ε > 0 J(u) = 0 for u ∈ (0, α) ∪ (1 − α, 1), 0 < α <
−A
Then an asymptotically equivalent M -estimator Mn is generated by the function
ψ(x) = − (I[y ≥ x] − F (y)) J(F (y))dy, x ∈ R R
and
Mn − Ln = Op n−1 , n → ∞
(6.18)
6.5.2 M - and R-estimators Let X1 , X2 , . . . be independent random variables with the same distribution function F (x − θ) such that F (x) + F (−x) = 1, x ∈ R. We assume that F has an absolutely continuous density f and finite Fisher information I(F ) > 0. Let ϕ : (0, 1) → R be a nondecreasing score function, ϕ(1−u) = −ϕ(u), 0 < u < 1 1 and 0 ϕ2 (u)du < ∞. Assume that
1 γ=− ϕ(F (x))f (x)dx > 0 0
LetRn be the R-estimator, defined in (3.84) and (3.85) with the scores an (i) = i , i = 1, . . . , n, where ϕ+ (u) = ϕ u+1 , 0 ≤ u < 1. Moreover, let ϕ+ n+1 2 Mn be the M -estimator generated by the function ψ(x) = cϕ(F (x)), x ∈ R, c > 0, then √ n(Mn − Rn ) = op (1) as n → ∞ (6.19)
150
SOME LARGE SAMPLE PROPERTIES
Especially, the Hodges-Lehmann R-estimator is generated by the score function ϕ(u) = u − 12 , 0 < u < 1, and hence an asymptotically equivalent M -estimator is generated by the ψ-function ψ(x) = F (x) − 12 , x ∈ R.
6.5.3 R- and L-estimators Combining the previous results, we obtain the asymptotic relations of R- and L-estimators. We shall omit the detailed description, but we should mention an interesting case of an R-estimator, asymptotically equivalent to the αtrimmed mean, generated by the score function ⎧ F −1 (α) ... 0 < u < α ⎪ ⎪ ⎨ −1 F (u) ... α ≤ u ≤ 1 − α ϕ(u) = ⎪ ⎪ ⎩ −1 F (1 − α) . . . 1 − α < u < 1
6.6 Minimaximally robust estimators Most of the estimators T √n = T (Fn ) are asymptotically normally distributed, i.e., the distribution of n(Tn − T (F )) converges, as n → ∞, to the normal distribution N (0, Vas (F, T )) , where Vas (F, T ) = R IF 2 (x, T, F )dF (x). The maximum asymptotic variance σ 2 (T ) = sup Vas (F, T ) F ∈F
over a specified class F of distribution functions can be considered as a measure of robustness of the functional T (and of the estimator Tn ). On the other hand, consider a class of functionals T , e.g., of M -functionals, and look for T0 ∈ T that σ 2 (T0 ) ≤ σ 2 (T ), ∀ T ∈ T . Such a functional, if it exists, is called a minimaximally robust estimator of T because it satisfies σ 2 (T0 ) = inf sup Vas (F, T ) T ∈T P ∈P
(6.20)
Let us illustrate this situation on the special case of location parameter, when X1 , . . . , Xn is a random sample from a population with distribution function F (x − θ), θ is an unknown parameter, and F is an unknown element of a family F of distribution functions. The most usual classes F are as follows: (i) Contamination model: FG = {F : F = (1 − ε)G + εH, H ∈ P}
(6.21)
where G is a fixed distribution function, ε ∈ [0, 1) is a fixed number, and H runs over a fixed class P of distribution functions.
MINIMAXIMALLY ROBUST ESTIMATORS (ii) Kolmogorov model: FG = F : sup |F (x) − G(x)| ≤ ε , ε ∈ [0, 1) fixed
151
(6.22)
x∈R
Let F0 ∈ F be the distribution function with the smallest Fisher information in F (the least favorable distribution of the family F ), i.e., 2
f0 (x) I(F0 ) = dF0 = min I(F ) F ∈F f0 (x) R Let T0 be the element of the class T of estimators, that is asymptotically 1 efficient estimator of θ for distribution function F0 , i.e., Vas (F0 , T0 ) = I(F 0) (if it exists). If 1 = Vas (F0 , T0 ) ≥ sup Vas (F, T0 ), I(F0 ) F ∈F then 1 inf sup Vas (F, T ) = T ∈T F ∈F I(F0 ) i.e., ∀T ∈ T and ∀F ∈ F
Vas (F0 , T ) ≥ Vas (F0 , T0 ) ≥ Vas (F, T0 )
(6.23)
In the symmetric contamination model, the minimaximally robust estimator exists in the class of M -, L- and R-estimators, respectively (see Huber (1964), Jaeckel (1971) for more detail).
6.6.1 Minimaximally robust M -, L- and R-estimators Consider the contamination model (6.21), where G is a symmetric unimodal distribution function with twice differentiable density g such that (− log g(x)) is convex in x; let H run over the symmetric distribution functions. Denote F1 the family of such distribution functions. Let T (F ) be the M -functional, defined as a root of the equation R ψ(x − T (F )) = 0, then 2 ψ (x − T (F ))dF (x) 1 Vas (F, T ) = R 2 ≥ I(F ) ψ (x − T (F ))dF (x) R
The least favorable distribution of the family F1 has the density (see Huber (1964)) ⎧ (1 − ε)g(x0 )ek(x−x0 ) ... x ≤ x0 ⎪ ⎪ ⎨ f0 (x) = (1 − ε)g(x) . . . x0 ≤ x ≤ x1 (6.24) ⎪ ⎪ ⎩ (1 − ε)g(x1 )e−k(x−x1 ) . . . x ≥ x1 where
g (x) x0 = −x1 = inf x : − ≥ −k g(x)
152
SOME LARGE SAMPLE PROPERTIES
and k > 0 is determined by the relation
x1 2 1 g(x1 ) + g(x)dx = k 1 − ε x0 Tn is the maximal likelihood estimator for distribution f0 , hence the M estimator generated by the function ⎧ −k . . . x ≤ x0 ⎪ ⎪ f0 (x) ⎨ g (x) = ψ0 (x) = − − g(x) . . . x0 < x < x1 f0 (x) ⎪ ⎪ ⎩ k ... x ≥ x1 We see from the asymptotic relations of Section 6.5 that minimaximally robust L- and R- estimators also exist there; especially, the minimaximally robust L-estimator is generated by the weight function J0 (u) =
1 ψ (F −1 (u)), 0 < u < 1 I(F0 ) 0
and the minimaximally robust R-estimator is generated by the score function ϕ0 (u) = ψ0 (F0−1 (u)), 0 < u < 1 An important special case is the minimaximally robust estimator in the model of the contaminated normal distributions: Put G ≡ Φ, the N (0, 1) distribution function, into the model (6.21). Then the least favorable distribution has the density ⎧ 1−ε −x2 /2 ⎨ √ e . . . |x| ≤ k 2π f0 (x) = (6.25) ⎩ √ 1−ε −k2 /2−k|x| e . . . |x| > k 2π and hence it is normal in the central part [−k, k] and exponential outside. The likelihood function of f0 , is x . . . |x| ≤ k f0 (x) = ψ0 (x) = − f0 (x) k sign x . . . |x| > k that is the well-known Huber function. The constant k > 0 is determined by the relation 1 Φ (k) = 2Φ(k) − 1 + 2 k 1−ε The minimaximally robust M -estimator for the contaminated normal distribution is generated by the function ψ0 , and it coincides with the maximal likelihood estimator corresponding to the density f0 . The minimaximally robust L-estimator is generated by the weight function J0 that must satisfy J0 (F0 (x)) =
1 I[−k ≤ x ≤ k], x ∈ R I(F0 )
PROBLEMS AND COMPLEMENTS
153
and hence equals J0 (u) =
1 I F0−1 (−k) ≤ u ≤ F0−1 (k) , 0 < u < 1 I(F0 )
The corresponding L-estimator is the α-trimmed mean, where α = F0−1 (−k). Similarly, the minimaximally robust R-estimator for the contaminated normal distribution is generated by the score function ϕ0 (u) = ψ0 F0−1 (u) , 0 < u < 1.
6.7 Problems and complements 6.1 Let X1 , . . . , Xn have distribution function F (x − θ) where F is symmetric generated by the ψ-function around 0. Let Tn be Huber’s M -estimator √ (3.16) with the boundary k. Then n(Tn − θ) has asymptotically normal distribution N (0, σ 2 (k, F ) where k 2 x dF (x) + 2k 2 (1 − F (k)) 2 σ (k, F ) = −k (F (k) − F (−k))2 6.2 Let F be the family of densities 1 x ,s > 0 F = f : f (x) = f0 s s where f0 is a fixed density (but generally unknown) such that
f0 (x) = f0 (−x), x ∈ R, f0 (0) = 1, x2 f0 (x)dx < ∞ and f0 has a bounded derivative in a neighborhhod of 0. Consider the linear regression model Yi = xi β + ei , i = 1, . . . , n with β ∈ Rp , where e1 , . . . , en are independent errors with joint density f ∈ F, but unknown. The solution Tn (δ) of the minimization n Yi − xi b ρ = min, b ∈ Rp s ˆ n i=1 where ρ(x) = (1 − δ)x2 + δ|x|, x ∈ R and sˆn is an estimator of s, is a M estimator that can be considered as a mixture of the LSE and L1 estimators of β. For each f ∈ F, there exists an optimal value δf ∈ [0, 1], leading to the minimal asymptotic variance of Tn (δ), 0 ≤ δ ≤ 1. The optimal δf can be estimated consistently by an estimator that depends only on sˆn and on two first sample moments (Dodge and Jureˇckov´ a (2000)). As estimators of s =
1 f (0) ,
we can use either (4.64) or (4.65).
6.3 Let X1 , . . . , Xn be a sample from a population with distribution function
154
SOME LARGE SAMPLE PROPERTIES
F (x − θ), F symmetric around 0. The α-trimmed mean Tn (α), defined in Example 3.2 is asymptotically normally distributed, √ L n(Tn (α) − θ) → N (0, σ 2 (α, F )) where 1 σ (α, F ) = (1 − 2α)2
F −1 (1−α)
2
2
x dF (x) + 2α(F F −1 (α)
−1
(1 − α))
2
Let αF be a value minimizing σ 2 (α, F ). It is unknown, provided that F is unknown. The estimation of αF (adaptive choice of the trimmed mean) was considered by Tukey and McLaughlin (1963), Jaeckel (1971), and Hall (1981). 6.4 The α-trimmed mean is asymptotically efficient for f in the Bahadur sense if and only if the central part of distribution F, namely 100(1 − 2α)%, is normal, while the distribution has 100α% of exponential tails on both sides (Fu (1980)). This can be explained so that the problem of estimating α is equivalent to estimating the proportion of nonnormality from the observations.
CHAPTER 7
Some goodness-of-fit tests
7.1 Introduction Robust estimators are constructed in such a way that they are insensitive to small deviations from the assumed distribution of the model errors; for instance, the Huber estimator of the location or regression parameters is minimaximally robust over a family of contaminated normal distributions. Before using a robust estimator, likely operating well in a neighborhood of some distribution, we can try to verify a hypothesis on the shape of this distribution; otherwise, we can start with a suitable goodness-of-fit test. The χ2 and the Kolmogorov-Smirnov tests are probably most well-known. Many other various tests can be found in the literature (e.g., Huber-Carol et al. (2002)). These tests work well in the simplest situation, when our observations Y1 , . . . , Yn are independent and identically distributed with a distribution function F, and we want to verify the hypothesis H0 : F ≡ F0 , where F0 is a fully specified distribution function. However, the hypothetical distribution function F0 is often specified only up to several unknown parameters, e.g., up to the location, scale or regression parameters. This is a typical situation: our observations can follow a linear regression model, whose parameters we want to estimate by a suitable robust estimator, and an approximate knowledge of the shape of the distribution of errors would lead to a good choice of the score function. This situation is more realistic, but the standard goodness-of-fit tests then lose their simplicity. Taking these facts into account, we want to offer some goodness-of-fit tests on the shape of the distribution in the presence of nuisance regression and scale parameters.
7.2 Tests of normality of the Shapiro-Wilk type with nuisance regression and scale parameters If the distribution seems to have a symmetric unimodal density, then the first natural idea is to test for its normality. A highly intuitive goodness-of-fit test 155
156
SOME GOODNESS-OF-FIT TESTS
of normality with nuisance location and scale parameters was proposed by Shapiro and Wilk (1965). Their test has received considerable attention in the literature; its asymptotic null distribution was later studied by de Wet and Venter (1973), and recently by Sen (2002). Because we often also have a nuisance regression, we shall describe an extension of the Shapiro-Wilk test of normality to the situation with nuisance regression and scale parameters, constructed by Sen, Jureˇckov´ a and Picek (2003). Their test is based on the pair of two estimators of the standard deviation of errors in the linear regression model, namely on the maximum likelihood estimator and on an L-estimator. Similar to the Shapiro-Wilk test, the asymptotic equivalence of these estimators is a characteristic property of the normal distribution of the errors, i.e., it is true only under the normality, and thus provides a test. Let Y1 , . . . , Yn be independent observations following the linear model Yi = θ + xi β + σei , i = 1, . . . , n
(7.1)
where xi ∈ Rp , i = 1, . . . , n are given regressors, not all equal, θ ∈ R1 , β ∈ Rp and σ > 0 are unknown intercept, regression and scale parameters, and the errors ei are independent and identically distributed according to a continuous distribution function F with location 0 and scale parameter 1. We want to test the hypothesis H0 : F ≡ Φ,
H1 : F ≡ F1 = Φ
against
(7.2)
where Φ is the standard normal distribution function, F1 is a general nonnormal distribution function, and θ, β, and σ are treated as nuisance parameters. For the special location-scale model (i.e., when β = 0), Shapiro and Wilk (1965) proposed a goodness-of-fit test based on two estimators of σ: Ln , the BLUE (best linear estimator) under H0 , and σ ˆn , the maximum likelihood estimator (MLE) under H0 . Suppose that Y1 , . . . , Yn are i.i.d. observations with the distribution N (µ, σ 2 ). Then the MLE of σ is σ ˆn , where σ ˆn2
=n
−1
n
(Yi − Y¯i )2
(7.3)
i=1
The best linear unbiased estimate (BLUE) Ln of σ has the form Ln =
n
ani Yni
(7.4)
i=1
where a = (a1 , . . . , an ) = (Mn Vn−1 Mn )−1 (Mn Vn−1 ),
an 1n = 0
(7.5)
and where Mn = M denotes the vector of expected values of order statistics and Vn = V is their variance matrix. Shapiro and Wilk (1965) modified the
TEST OF NORMALITY OF THE SHAPIRO-WILK TYPE 157 n BLUE of σ to Ln0 = i=1 ani,0 Yni where (an1,0 , . . . , ann,0 ) = an0 is such that M V−1 an0 = (7.6) 1/2 (M V−1 V−1 M) then an 1n = 0 and an0 an0 = 1. Ln0 is asymptotically equivalent to n 1 −1 i Φ (7.7) Tn = Yni n i=1 n+1 (see, e.g., Serfling (1980)). Let us write the Shapiro-Wilk criterion in the form L2n0 Wn = n 1 − 2 (7.8) σ ˆn ˆn are asymptotically equivalent if and only Two scale estimators Ln0 and σ if F ≡ Φ, i.e., if the hypothesis of normality is true, while under nonnormal √ L2 alternative F1 with the finite second moment, the sequence n 1 − σˆn0 2 n has a nondegenerate asymptotic (normal) distribution. It means that the test criterion is consistent with respect to the nonnormal alternatives. We propose the goodness-of-fit test of the hypothesis (7.2) of the normality, based on the observations Y1 , . . . , Yn , following the linear regression model (7.1) with unknown θ, β and σ. The test criterion is ˆ 2n L n = n 1 − W (7.9) sˆ2n where 1
xi )2 (Yi − Y¯n − β n i=1 n
sˆ2n = ˆn = L
n
a0ni rn:i
(7.10)
i=1
are the residual variance and the linear estimator of σ with ani,0 , i = 1, . . . , n defined in (7.6) and the rn:i are the order statistics corresponding to the residuals
xi , i = 1, . . . , n rni = Yi − Y¯n − β (7.11) We assume that the n × p matrix Xn = [x1 , . . . , xn ] satisfies X 1n = 0,
Rank(Xn ) = p < n − 1
and
(7.12) max hn,ii = O(n
1≤i≤n
−1
) as n → ∞ (the balanced design)
where hn,ij = xi (X X)−1 xj , i, j = 1, . . . , n. The MLE of parameters θ, β, σ
158
SOME GOODNESS-OF-FIT TESTS
under normal Φ have the form ¯n , e ¯n = n−1 1n en θˆn = Y¯n = n−1 1n Yn = θ + e
= (X Xn )−1 X Yn = β + σ(X Xn )−1 en β n n n n σ ˆn2 = n−1
n
(7.13)
)2 (Yi − θˆn − xi β n
i=1
Sen, Jureˇckov´ a and Picek (2003) proved that the asymptotic null distribution of Wn coincides with that of Wn ; hence, the test rejects the hypothesis of the normality on the asymptotic significance level α provided n ≥ τα W
(7.14)
where τα is the asymptotic critical value of the Shapiro-Wilk test of normality with nuisance location and scale. The coefficients ani,0 , i = 1, . . . , n and the critical values of the original Shapiro-Wilk test for n ≤ 50 are tabulated in Shapiro and Wilk (1965). The critical values for n > 50 we have calculated by a Monte Carlo procedure.
7.3 Goodness-of-fit tests for general distribution with nuisance regression and scale Consider again the linear model (7.1), but this time we have another distribution of errors e1 , . . . , en , in mind, and we want to test the hypothesis H0 : F (e) ≡ F0 (e/σ), for a specified distribution function F0 . This is not necessarily normal, but for simplicity we assume that F0 ∈ F, the class of distribution functions is symmetric around 0 and possessing a positive density, finite variance and finite Fisher’s information. Similarly as in the test of normality, our proposed test of H0 is based on the ratio of two scale statistics: the first one is based on regression rank scores a ˆni (α), i = 1, . . . , n, 0 ≤ α ≤ 1, introduced in Section 4.7, and the second one is an extension of the interquartile range to the regression quantiles. If F0 ∈ F, we may choose the score generating function ϕ0 (u) = F0−1 (u) or −f0 (F0−1 (u))/f0 (F0−1 (u)), 0 < u < 1 (if F0 is strongly unimodal with finite Fisher information), because F (e) = F0 (e/σ), F −1 (u) = σF0−1 (u) under H0 . Our proposed test is based on the statistic 1 Sn0 ∗ Tn = n 2 log − log ξ(F0 ) (7.15) Sn1 where Sn0 = Sn0 (Y) = n−1
n
ˆn Yiˆbni = n−1 Y b
(7.16)
i=1
ˆ n = (ˆbn1 , . . . , ˆbnn ) are the regression scores generated by ϕ0 in the and b
NUISANCE REGRESSION AND SCALE following way:
159
1
ˆbni = −
ϕ0 (u)dˆ ani (u),
i = 1, . . . , n
(7.17)
0
The second scale statistic Sn1 will be based on the regression quantiles; note that the regression quantiles and regression rank scores are asymptotically independent. For simplicity, we recommend the regression interquartile range, Sn1 = βˆ1 ( 34 ) − βˆ1 ( 14 )
(7.18)
where βˆ1 (α) is the first component of the α-regression quantile. Moreover, denote
1 S0 (F ) = ϕ0 (u)F −1 (u)du 0
S1 (F ) = F −1 ( 34 ) − F −1 ( 14 ) = 2F −1 ( 34 ) ξ(F ) =
S0 (F ) , S1 (F )
(7.19)
F ∈F
0) Note that ξ(F0 ) = SS01 (F (F0 ) is a completely known function, because it depends on the chosen ϕ0 and on the hypothetical F0 , and does not depend on σ.
Jureˇckov´ a, Picek and Sen (2003) proved that the asymptotic (null) distribution of the criterion Tn∗ under the hypothesis H0 is normal ∗2 D (7.20) Tn∗ −→ N 0, τ01 where ∗2 τ01 =
1 0 0 2 0 γ − 2ξ(F )γ + ξ (F )γ 0 0 00 01 11 S02 (F0 )
(7.21)
where 0 γ00 = 14 (µ4 − µ22 )
0 0 = γ10 γ01
0 γ11 =
⎞ ⎛
F0−1 ( 3 ) 4 −1 ⎝ = e2 dF0 (e) − 12 µ2 ⎠ 1 2f0 (F0−1 ( 34 )) F0−1 ( 4 )
q11 2 4f0 (F0−1 ( 34 ))
(7.22)
µ2 =
R
e2 dF0 (e),
µ4 =
R
e4 dF0 (e)
q11 is the first diagonal element of the matrix D−1 where D = limn→∞ Dn = limn→∞ n−1 X X. ∗ Then τ01 does not depend on σ and is positive unless happens with probability 0).
Sn0 Sn1
≡ ξ(F0 ) (what
160
SOME GOODNESS-OF-FIT TESTS
We are almost ready to formulate the critical region of the test; however, we should think over alternative distributions F against which we wish to have the test consistent. We shall introduce the following one- and two-sided alternatives of H0 . For a pair (F0 , F ) of distributions, let A(F0 , F ) = S0 (F )
S1 (F0 ) − S0 (F0 ) S1 (F )
(7.23)
and set the partial ordering F F0 or F ≺ F0 accordingly A(F0 , F ) is > or < 0 This partial ordering is linked to H´ ajek’s (1969) interpretation of F having heavier or lighter tails than F0 . Consider the following alternatives to H0 : H 1 : F F0 ,
H≺ 1 : F ≺ F0 ,
≺ H= 1 : H1 ∪ H1
Then • we reject H0 in favor of H 1 on the asymptotic significance level α if Tn∗ ∗ ≥ uα τ01 • we reject H0 in favor of H≺ 1 on the asymptotic significance level α if Tn∗ ∗ ≤ −uα τ01 • we reject H0 in favor of H= 1 on the asymptotic significance level α if ∗ Tn τ ∗ ≥ u α2 01 where uα = Φ−1 (1 − α) and Φ is the standard normal distribution function.
7.4 Numerical illustration 7.4.1 Comparison of tests for testing normality Let us illustrate the performance of the proposed test on the simulated regression model. Concerning the design matrix, we generate three columns as independent, identically-distributed random variables with uniform distribution on (−10, 10) with the first column 1n added; β = (2, −2, 1, −1) and consider 25 rows for it. The errors were generated from the following densities:
NUMERICAL ILLUSTRATION
161 2 − x2
normal N (0, 1) :
f (x) =
√1 e 2π
normal N (0, 4) :
f (x) =
x √1 e− 32 4 2π
logistic (0, 1) :
f (x) =
e−x (1+e−x )2
logistic (0, 4) :
f (x) =
e−x/4 (1+e−x/4 )2
Laplace (0, 1) :
f (x) = 12 e−|x|
Laplace (0, 4) :
f (x) = 42 e−4|x|
Cauchy:
f (x) =
2
1 π(1+x2 ) .
In order to gain insight into larger sample size behavior for our proposed tests, we also generate the design matrix of 100, 250 and 500 rows, respectively; in each case errors ei are generated to insure independence. 1000 replications were simulated for each case. Based on these data, we calculated the test statistics n 2 ˆ 2n ( i=1 ani,0 rn:i ) L n 2 Wn = n 1 − 2 = n 1 − (7.24) sˆn i=1 rni where
rni = Dn−1/2 Yi − θˆn − xi β n
and an0 = (an1,0 , . . . , ann,0 ) =
M V−1 (M V−1 V−1 M)1/2
(7.25)
Because the asymptotic null distributions of the test statistics of the Shapiron are not known for n > 50, they were approximated by the Wilk type test W following simple Monte Carlo procedure: For a fixed n, a random sample of size n from the normal distribution was n was computed, and this random experiment was repeated generated and W 100, 000 times. For the sake of comparison, the nonparametric test of Section 7.3 was performed on the same data for testing the normality. Tables 7.1–7.4 give the numbers of rejections of H0 (among 1000 tests) for both statistics described above.
162
SOME GOODNESS-OF-FIT TESTS
Table 7.1 Numbers of rejections of H0 among 1000 cases on level α for matrix (25×4).
α=0.01
α=0.05
α=0.1
Distribution of errors
n W
Tn∗
n W
Tn∗
n W
Tn∗
Normal N (0, 1) Normal N (0, 4) Logistic (0, 1) Logistic (0, 4) Laplace (0, 1) Laplace (0, 4) Cauchy (0, 1)
21 9 42 42 141 148 838
13 12 22 19 56 54 624
42 54 136 131 275 255 900
61 57 76 76 127 120 720
105 90 175 182 365 320 920
142 118 139 130 212 216 765
Table 7.2 Numbers of rejections of H0 among 1000 cases on level α for matrix (100×4).
α=0.01 Distribution of errors Normal N (0, 1) Normal N (0, 4) Logistic (0, 1) Logistic (0, 4) Laplace (0, 1) Laplace (0, 4) Cauchy (0, 1)
α=0.05
α=0.1
n W
Tn∗
n W
Tn∗
n W
Tn∗
8 8 51 44 333 354 1000
12 12 22 18 63 60 539 620
46 47 119 112 499 125 1000
60 57 76 76 127 616 730
104 96 174 158 581 224 1000
143 119 137 131 210 773
NUMERICAL ILLUSTRATION
163
Table 7.3 Numbers of rejections of H0 among 1000 cases on level α for matrix (250×4).
α=0.01 Distribution of errors Normal N (0, 1) Normal N (0, 4) Logistic (0, 1) Logistic (0, 4) Laplace (0, 1) Laplace (0, 4) Cauchy (0, 1)
α=0.05
α=0.1
n W
Tn∗
n W
Tn∗
n W
Tn∗
10 10 52 44 546 563 1000
9 13 88 86 669 687 1000
49 50 96 98 721 719 1000
56 57 206 197 836 832 1000
96 100 147 146 782 796 1000
107 113 302 289 890 891 1000
Table 7.4 Numbers of rejections of H0 among 1000 cases on level α for matrix (500×4).
α=0.01 Distribution of errors Normal N (0, 1) Normal N (0, 4) Logistic (0, 1) Logistic (0, 4) Laplace (0, 1) Laplace (0, 4) Cauchy (0, 1)
α=0.05
α=0.1
n W
Tn∗
n W
Tn∗
n W
Tn∗
10 9 56 49 869 856 1000
15 10 199 185 967 963 1000
56 56 96 95 937 926 1000
53 51 389 369 991 990 1000
102 100 142 136 960 939 1000
107 101 495 489 995 995 1000
7.4.2 Testing for nonnormal distributions
We used the test in Section 7.3 for verifying the following three null hypotheses:
164
SOME GOODNESS-OF-FIT TESTS
(i) H0 : F ≡ logistic Logistic scores for Sn0 : ϕ0 (u) = log u − log(1 − u), 0 < u < 1 (ii) H0 : F ≡ normal Normal scores for Sn0 : ϕ0 (u) = Φ−1 (u), 0 < u < 1 where Φ is the standard normal distribution function. (iii) H0 : F ≡ Laplace Laplace scores for Sn0 :)
ϕ0 (u) =
log 2u − log(2(1 − u))
0 < u < 0.5 u ≥ 0.5
The errors were generated by sampling from the hypothetical F0 (normal, logistic, Laplace), and from the following alternative densities: 2
normal N (0, 1) :
f (x) =
x √1 e− 2 2π
normal N (0, 4) :
f (x) =
x √1 e− 32 4 2π
logistic (0, 1) :
f (x) =
e−x (1+e−x )2
logistic (0, 4) :
f (x) =
e−x/4 (1+e−x/4 )2
Laplace (0, 1) :
f (x) = 12 e−|x|
Laplace (0, 4) :
f (x) = 42 e−4|x|
Cauchy:
f (x) =
2
1 π(1+x2 )
1000 replications were simulated for each case. Based on these data, we calculated the test statistics 1 Sn0 ∗ 2 Tn = n − log ξ(F0 ) log Sn1 for the hypothesis H0 : F ≡ F0 , (β, σ unspecified). We took the regression interquartile range βˆ1 ( 34 ) − βˆ1 ( 14 ) in the role of Sn1 . Tables 7.5–7.7 give the numbers of rejections of H0 (among 1000 tests) for all above cases.
NUMERICAL ILLUSTRATION
165
(i) H0 : F ≡ logistic (0,1) (i.e., we used logistic scores) Table 7.5 Numbers of rejections of H0 among 1000 cases on level α for matrix (25×4) and for matrix (100 × 3).
n=27 Distribution Normal N (0, 1) Normal N (0, 4) Logistic (0, 1) Logistic (0, 4) Laplace (0, 1) Laplace (0, 4) Cauchy (0, 1)
n=108
α=0.01
α=0.05
α=0.1
α=0.01
α=0.05
α=0.1
19 18 24 18 21 22 513
124 92 93 86 87 100 632
213 197 169 181 149 169 687
45 45 12 10 84 88 999
163 187 61 62 199 210 1000
279 286 127 114 286 298 999
(ii) H0 : F ≡ normal (0,1) (i.e., we used van der Waerden scores) Table 7.6 Numbers of rejections of H0 among 1000 cases on level α for matrix (25×4) and for matrix (100 × 3).
n=27 Distribution Normal N (0, 1) Normal N (0, 4) Logistic (0, 1) Logistic (0, 4) Laplace (0, 1) Laplace (0, 4) Cauchy (0, 1)
n=108
α=0.01
α=0.05
α=0.1
α=0.01
α=0.05
α=0.1
11 11 22 18 55 53 626
60 59 77 77 129 124 722
142 129 136 132 212 216 765
12 11 40 38 337 331 1000
52 53 116 122 516 509 1000
116 112 183 192 611 607 1000
166
SOME GOODNESS-OF-FIT TESTS
(iii) H0 : F ≡ Laplace (0,1) (i.e., we used Laplace scores) Table 7.7 Numbers of rejections of H0 among 1000 cases on level α for for matrix (25×4) and for matrix (100 × 3).
n=27 Distribution Normal N (0, 1) Normal N (0, 4) Logistic (0, 1) Logistic (0, 4) Laplace (0, 1) Laplace (0, 4) Cauchy (0, 1)
n=108
α=0.01
α=0.05
α=0.1
α=0.01
α=0.05
α=0.1
43 42 37 44 19 24 342
259 251 189 203 121 131 452
424 378 319 337 189 231 515
366 365 153 144 11 9 971
689 687 378 375 69 73 991
806 802 516 511 135 145 1000
7.5 Computation and software notes R provides classical tests of goodness-of-fit. Functions shapiro.test and ks.test compute the Shapiro-Wilk and Kolmogorov-Smirnov tests. > n.dist<-rnorm(100, mean=-5, sd=3) > shapiro.test(n.dist) Shapiro-Wilk normality test data: n.dist W = 0.9866, p-value = 0.4103 > ks.test(n.dist,"pnorm") One-sample Kolmogorov-Smirnov test data: n.dist D = 0.8436, p-value < 2.2e-16 alternative hypothesis: two.sided > ks.test(n.dist,"pnorm",mean(n.dist),sd(n.dist)) One-sample Kolmogorov-Smirnov test
COMPUTATION AND SOFTWARE NOTES
167
data: n.dist D = 0.0406, p-value = 0.9965 alternative hypothesis: two.sided > > ### Exponential distribution ### > n.exp<-rexp(100) > shapiro.test(n.exp) Shapiro-Wilk normality test data: n.exp W = 0.8352, p-value = 3.489e-09 > ks.test(n.exp,"pnorm",mean(n.exp),sd(n.exp)) One-sample Kolmogorov-Smirnov test data: n.exp D = 0.1868, p-value = 0.001859 alternative hypothesis: two.sided > ks.test(n.exp,"pexp") One-sample Kolmogorov-Smirnov test data: n.exp D = 0.0807, p-value = 0.5333 alternative hypothesis: two.sided The goodness-of-fit tests for general distribution with nuisance regression and scale described in the previous sections can be implemented in R with the help of the package quantreg. "jps.goodness.test"<- function(y, x, score ="logis") { aux <- match.call(expand.dots = FALSE) am <- match(c("formula", "data", "weights", "na.action"), names(aux), 0) aux <- aux[c(1, am)] aux$drop.unused.levels <- TRUE aux[[1]] <- as.name("model.frame") aux <- eval.parent(aux) at <- attr(aux, "terms") y <- model.response(aux) x <- model.matrix(at, aux, contrasts)
168
SOME GOODNESS-OF-FIT TESTS
n<-length(y) sn0 <-sum(rranks(y~x, score=score)$ranks*y)/n sn1 <- rq(formula, tau=0.75)$coef[1]rq(formula, tau=0.25)$coef[1] Qinv<-solve(1/n*t(x)%*%x) if(score == "logis") { S0 <- pi^2/3 qv <- qlogis(0.75) S1 <- 2*qv g00 <- 4/45*pi^4 g11 <- Qinv[1,1]/(2*dlogis(qv))^2 g01 <- 3.89034 } else if(score == "normal") { S0 <- 1 qv <- qnorm(0.75) S1 <- 2*qv g00 <- 0.5 g11 <- Qinv[1,1]/(2*dnorm(qv))^2 g01 <- 0.67449 } else if(score == "laplace") { S0 <- 2 qv <- -log(0.5) S1 <- 2*qv g00 <- 5 g11 <- Qinv[1,1]/(exp(-abs(qv)))^2 g01 <- 1.86675 } else stop("invalid score function") ksi<-S0/S1 tau<-sqrt((g00+g11*ksi^2-2*ksi*g01))/S0 Tn<-(log(sn0/sn1)-log(ksi))/tau*sqrt(n) names(Tn) <- NULL RVAL <- list(statistic = Tn, p.value = 2*(1-pnorm(abs(Tn)))) return(RVAL) } The function rq() enables the computation of the regression quantiles. The function rranks() is an extension of the function ranks(), which computes the ranks from the regression rankscore process. rranks<- function(formula, score = "wilcoxon") { v <- rq(formula, tau=-1) if(score == "wilcoxon") {
COMPUTATION AND SOFTWARE NOTES J <- ncol(v$sol) dt <- v$sol[1, 2:J] - v$sol[1, 1:(J - 1)] ranks <- as.vector((0.5 * (v$dsol[, 2:J] + v$dsol[, 1:(J - 1)]) %*% + dt) - 0.5) A2 <- 1/12 return(list(ranks = ranks, A2 = A2)) } else if(score == "normal") { J <- ncol(v$sol) dt <- v$sol[1, 2:J] - v$sol[1, 1:(J - 1)] dphi <- c(0, phi(qnorm(v$sol[1, 2:(J - 1)])), 0) dphi <- diff(dphi) ranks <- as.vector((((v$dsol[, 2:J] v$dsol[, 1:(J - 1)]))) %*% (dphi/dt)) A2 <- 1 return(list(ranks = ranks, A2 = A2)) } else if(score == "logis") { J <- ncol(v$sol) dt <- v$sol[1, 2:J] - v$sol[1, 1:(J - 1)] dphi <- c(0,-log(1-v$sol[1, 2:(J-1)])v$sol[1, 2:(J-1)]*(log(v$sol[1, 2:(J-1)]) -log(1-v$sol[1, 2:(J-1)])),0) dphi <- diff(dphi) ranks <- as.vector((((v$dsol[, 2:J] v$dsol[, 1:(J - 1)]))) %*% (dphi/dt)) A2 <- 3.28987 return(list(ranks = ranks, A2 = A2)) } else if(score == "laplace") { J <- ncol(v$sol) v2 <- rq(x, y, dual = T, ci = F, int = F,tau=0.5) aft<-length(v$sol[1,v$sol[1,]<=0.5]) v$sol<-cbind(v$sol[,1:aft],c(0.5,0,v2$coef), v$sol[,(aft+1):J]) v$dsol<-cbind(v$dsol[,1:aft],v2$dual,v$dsol[,(aft+1):J]) J <- J+1 dt <- v$sol[1, 2:J] - v$sol[1, 1:(J - 1)] dphi <- c(0,-inv.lap(v$sol[1, 2:(J-1)]),-1) dphi <- diff(dphi) dphi[(aft+1)]<- -inv.lap(v$sol[1,(aft+2)])+0.5 ranks <- as.vector((((v$dsol[, 2:J] v$dsol[, 1:(J - 1)]))) %*% (dphi/dt)) A2 <- 2 return(list(ranks = ranks, A2 = A2))
169
170
SOME GOODNESS-OF-FIT TESTS
} else if (score == "sign") { j.5 <- sum(v$sol[1, ] < 0.5) w <- (0.5 - v$sol[1, j.5])/(v$sol[1, j.5 + 1] - v$sol[1, j.5]) r <- w * v$dsol[, j.5 + 1] + (1 - w) * v$dsol[, j.5] ranks <- 2 * r - 1 A2<-1 return(list(ranks = ranks, A2 = A2)) } else stop("invalid score function") }
Let us illustrate the proposed test on the real regression model with the Mayer matrix (real physical measurements, see Mayer (1750)) > mayer.x [,1] [1,] -0.8836
[,2] 0.4682
... ... [27,] -0.9284 0.3716 > > mayer.y [1] 13.17 13.13 13.20 ... ... [25] 15.63 14.90 13.12 > > jps.goodness.test(mayer.y~1+mayer.x, score ="normal") $statistic [1] -1.814286 $p.value [1] 0.06963365 > Test of normality of the Shapiro-Wilk type with nuisance regression and scale parameters is also possible to implement and compute in R. Let us illustrate it again on the real regression model with the Mayer matrix:
COMPUTATION AND SOFTWARE NOTES > > > > >
171
ls.out<-summary(lm(mayer.y~1+mayer.x)) hn<-mayer.x%*%solve(t(mayer.x)%*%mayer.x)%*%t(mayer.x) n<-length(mayer.y) dn<-diag((1-1/n-diag(hn))^(-0.5)) res<-dn%*%ls.out$res
Because the coefficients (a1 , . . . , an ) are not known, they were approximated in the following way: the coefficients ai , i = 1, . . . , n, were approximated as suggested in Shapiro and Wilk (1965); namely, we calculated a ˆ∗i = 2Mi and
(i = 2, 3, . . . , n − 1)
Γ 1 (n + 1) a ˆ21 = a ˆ2n = √ 2 1 2Γ 2 n + 1
ˆn are directly usable but a ˆ∗i i = 2, . . . , n − 1, must be The coefficients a ˆ1 , a normalized in the following way: a ˆ∗i a ˆi = √ −2.722 + 4.083 ∗ n
(i = 2, . . . , n − 1)
The expected values Mi of order statistics, i = 2, . . . , n − 1, were computed by the numerical integration. A similar approach was used by Royston (1982a, 1982b, 1995) and his algorithms are used in R. > a<-rep(0,n) > for (j in 2:(n-1)) { + integrand <- function(x) {x*exp(lgamma(n+1)+ lgamma(n-j+1)-lgamma(j))* + (pnorm(x))^(j-1)*(1-pnorm(x))^(n-j)*dnorm(x)} + a[j]<-integrate(integrand, lower = -Inf, upper = Inf)$value + } > a<-2*a > a[n]<-sqrt(exp(lgamma((n+1)/2)-lgamma(n/2+1))/sqrt(2)) > a[1]<--a[n] > a[2:(n-1)]<-a[2:(n-1)]/sqrt(-2.722+4.083*n) > ln<-(sum(sort(res)*a))^2 > sn<-sum(res^2) > wn<-ln/sn > wn [1] 0.9753435 The distribution of the test statistic also is not known. Consequently it was then approximated by the following Monte Carlo procedure: for a fixed n, a n is random sample of size n from the normal distribution is generated, and W computed. Because Wn has the same distribution as the usual Shapiro-Wilk test we can use function shapiro.test:
172
SOME GOODNESS-OF-FIT TESTS
> aux<-rep(0,10000) > for(i in 1:10000)aux[i]<-shapiro.test(rnorm(27))[[1]] > ### the critical values #### > quantile(aux,c(0.01,0.05,0.1)) 1% 5% 10% 0.8963324 0.9251900 0.9368363 > > #### "p-value"### > length(aux[aux<wn])/10000 [1] 0.7411 >
APPENDIX A
R system
R is a system for statistical analyses and graphics created by Ross Ihaka and Robert Gentleman (1996). It is freely distributed under the terms of the GNU, General Public Licence; since mid-1997 its development and distribution have been carried out by several statisticians known as the “R Core Team.” The system consists of a language, a run-time environment with graphics, a debugger, access to certain system functions, and the ability to run programs stored in script files. The core of R is an interpreted computer language which allows branching and looping as well as modular programming using functions. R is available in several forms: the sources written mainly in C (and some routines in Fortran), essentially for Unix and Linux machines, or some precompiled binaries for Windows, Linux (Debian, Mandrake, RedHat, SuSe), Macintosh and Alpha Unix. The files needed to install R, either from the sources or from the pre-compiled binaries, are distributed from the Internet site of the Comprehensive R Archive Network (CRAN), see the web page http://www.r-project.org/, where the instructions for the installation (“R Installation and Administration” manual) are also available. The resulting language is very similar to S. Most of the user-visible functions in R are written in R. It is possible for the user to interface to procedures written in the C, C++, or Fortran languages for efficiency. The R distribution contains a large number of statistical procedures and R could seem too complex for a non specialist, but its main feature is flexibility. The classical software usually directly displays the results of an analysis. R saves them in objects, so that an analysis can be done with no result displayed. The such feature is very useful because the user can extract only the part of the results that is of his or her interest. There is also a large set of functions for graphics that are visualized immediately in their own windows and can be saved in various formats (jpg, png, bmp, ps, pdf, emf, pictex, xfig, eps). R comes complete with its base libraries and some “recommended” packages. The user can extend his or her version of R by installing additional packages. A package is a collection of functions and datasets that are “packaged” up for easy installation. There are several (over 150) packages available, again from http://www.r-project.org/. The installation of a package may require 173
174
R SYSTEM
compilation of some C or Fortran code. Usually, a Windows machine does not have such a compiler, so authors typically provide a pre-compiled package in zip format. The R project website also contains a number of tutorials, documents and books that can help one learn R. Many questions about R are asked and answered on the R mailing list. Online help, and several manuals (in pdf format) are available with the R software. The manual “An Introduction to R” by the R core development team is an excellent introduction to R for people familiar with statistics. It has many interesting examples of R and a comprehensive treatment of the features of R.
A.1 Brief R overview When you use the R program, it issues a prompt when it expects input commands. The default prompt is >. Start the R program under Unix with the command $ R. To use R under Windows the procedure is basically the same. Usually R has a shortcut icon on the desktop screen or you can find it under Start–Programs–R menu. If not, search and run the executable file rgui.exe by double clicking from the search result window. To quit R, type q() at the R prompt (>) and press the Enter key. A dialog box will ask whether to save the objects you have created during the session so that they will become available next time you invoke R. Commands you entered can be easily recalled and modified. Just by hitting the arrow keys on the keyboard, you can navigate through the recently entered commands. Most R functions have online documentation: help(name) or ?name documentation on a topic with name “name” help.search(topic) search the help system apropos(what) the names of all objects in the search list matching the regular expression “what” help.start() start the HTML version of help browser interface to help
A.1.1 Data creation, selection, and manipulation c(...) generic function to combine arguments into a vector or list seq(from,to) or from:to generate sequences rep(x,times) replicate x times matrix(data,nrow,ncol) creates a matrix from the given set of values array(data,dim) creates a multi-way array with data x; dim is a vector giving the maximal indices in each dimension list(...) creates a list – an object consisting of an ordered collection of object data.frame(...) creates data frames, tightly coupled collections of variables
BRIEF R OVERVIEW
175
which share many of the properties of matrices and of lists, used as the fundamental data structure by most of R’s modeling software rbind(...), cbind combine arguments by rows (columns) for matrices, data frames, and others which.max(x) returns the index of the greatest element of x which.min(x) returns the index of the smallest element of x rev(x) provides a reversed version of its argument sort(x) sorts the numeric or complex vector into ascending (or descending) order x %in% y returns a logical vector indicating if there is a match or not for its left operand match(x, y) returns a vector of the same length as x with the elements of x which are in y (NA otherwise) which(x == a) returns a vector of the indices of x if the comparison operation is true unique(x) if x is a vector, array or a data frame, returns a similar object but with duplicate elements removed subset(x, ...) returns a selection of x with respect to condition Matrices: t(x) matrix transpose diag(x) diagonal matrix %*% matrix multiplication solve(a,b) solves a% ∗ %x = b for x solve(x) matrix inverse of x rowsum(x), colsum(x) sum of rows (columns) for a matrix-like object rowSum(x), colSum(x), rowMeans(x), colMeans(x) row and column sums and means for numeric arrays Slicing and extracting data - indexing vectors: x[n] nth element x[-n] all elements without the nth x[1:n] first n elements x[-(1:n)] elements from n + 1 to the end x[c(1,5,...)] specific elements x[x > what] all elements greater than “what” x[x > from & x < to] all elements between “from” and “to” Indexing lists: x[n] list with elements n x[[n]] nth element of the list x[[name]] or x$name element of the list named “name” Indexing matrices: x[i,j] element at row i, column j x[i,] row i
176
R SYSTEM
x[,j] column j x[,c(j1,j2)] columns j1 and j2 x[name,] row named “name” Indexing data frames: x[[name]] or x$name column named “name”
A.1.2 Variable conversion and information as.array(x), as.data.frame(x), as.numeric(x), as.logical(x), as.complex(x), as.character(x), . . . convert type is.na(x), is.null(x), is.array(x), is.data.frame(x), is.numeric(x), is.complex(x), is.character(x) . . . test for type length(x) number of elements in x dim(x) retrieve or set the dimension of an object nrow(x), ncol(x) number of rows, columns attr(x,...) get or set the attribute of x
A.1.3 Input and output save(...,file) writes an external representation of R objects to the specified file load(file) loads the datasets written with save data(dataset) loads specified datasets library(packages) or require(package) loads add-on packages write.table(x,file) prints its required argument x to “file”(after converting it to a data frame) read.table(file) reads a file in table format and creates a data frame from it print(x, ...) prints its arguments; generic, meaning it can have different methods for different objects format(x,...) format an R object for pretty printing cat(..., file) prints the arguments after coercing to character if necessary scan(file) read data into a vector or list from the console or file sink(file) send R output to a “file,” until sink()
A.1.4 Math x + y, x - y, x * y, x / y, x ^ y, x %% y, x %/% y binary operators perform arithmetic on numeric or complex vectors abs(x), sqrt(x), log(x), log10(x), exp(x) compute mathematical functions
BRIEF R OVERVIEW
177
cos(x), sin(x), tan(x), acos(x), asin(x), atan(x) trigonometric functions beta(a, b), lbeta(a, b), gamma(x), lgamma(x), psigamma(x, deriv), digamma(x), trigamma(x), choose(n, k), lchoose(n, k), factorial(x), lfactorial(x) special mathematical functions related to the beta and gamma functions max(x), min(x) returns the maxima and minima of the input values pmin(x,y,...), pmax(x,y,...) a vector of which it h element is the minimum (maximum) of x[i], y[i], ... sum(x) sum of the elements of x diff(x) lagged and iterated differences of vector x prod(x) product of the elements of x cumsum(x), cumprod(x), cummin(x), cummax(x) return a vector whose elements are the cumulative sums, products, minima or maxima of the elements of the argument ceiling(x), floor(x), round(x, digits), signif(x, digits), trunc(x) rounding of numbers scale(x) scaling and centering of matrix union(x,y), intersect(x,y), setdiff(x,y), setequal(x,y), is.element(el,set) set functions Re(x), Im(x), Mod(x), Arg(x), Conj(x) basic functions which support complex arithmetic fft(x) fast Fourier transform Many math functions have a logical parameter na.rm=FALSE to specify missing data (NA) removal.
A.1.5 Distributions rnorm(n, mean, sd) Gaussian (normal) distribution rexp(n, rate) exponential distribution rgamma(n, shape, scale) gamma distribution rpois(n, lambda) Poisson distribution rweibull(n, shape, scale) Weibull distribution rcauchy(n, location, scale) Cauchy distribution rbeta(n, shape1, shape2) beta distribution rt(n, df) Student (t) distribution rf(n, df1, df2) FisherSnedecor (F) distribution rchisq(n, df) χ2 distribution rbinom(n, size, prob) binomial distribution rgeom(n, prob) geometric distribution rhyper(nn, m, n, k) hypergeometric distribution rlogis(n, location, scale) logistic distribution rlnorm(n, meanlog, sdlog) lognormal distribution rnbinom(n, size, prob) negative binomial distribution
178
R SYSTEM
runif(n, min, max) uniform distribution rwilcox(nn, m, n), rsignrank(nn, n) Wilcoxon statistics. All these functions can be used by replacing the letter r with d, p or q to get, respectively, the probability density (dfunc(x, ...)), the cumulative probability density (pfunc(x, ...)), and the value of quantile (qfunc(p, ...)), with 0 < p < 1. sample(x, size) takes randomly a sample of the specified size from the elements of the vector x using either without or with replacement
A.1.6 Statistics, optimization, and model fitting mean(x) mean of the elements of x weighted.mean(x, w) mean of x with weights w median(x) median of the elements of x quantile(x,probs) sample quantiles rank(x) ranks of the elements of x var(x) or cov(x) variance of the elements of x sd(x) standard deviation of x cor(x) correlation matrix of x var(x, y) or cov(x, y) covariance between x and y cor(x, y) linear correlation between x and y binom.test(), pairwise.t.test(), power.t.test(), prop.test(), t.test() basic tests anova(fit,...) analysis of variance (or deviance) tables for one or more fitted model objects density(x) kernel density estimates of x lm(formula) fit linear models; formula is typically of the form response term A + term B + ... optim(par, fn, method) general-purpose optimization nlm(f,p) minimize function f using a Newton-type algorithm with starting values p glm(formula,family) fit generalized linear models nls(formula) nonlinear least-squares estimates of the nonlinear model parameters approx(x,y) linearly interpolate given data points spline(x,y) cubic spline interpolation loess(formula) fit a polynomial surface using local fitting. The following generics often apply to model fitting functions: predict(fit,...) predictions from fit based on input data df.residual(fit) returns the number of residual degrees of freedom coef(fit) returns the estimated coefficients (sometimes with their standard errors)
BRIEF R OVERVIEW
179
residuals(fit) returns the residuals deviance(fit) returns the deviance fitted(fit) returns the fitted values logLik(fit) computes the logarithm of the likelihood and the number of parameters AIC(fit) computes the Akaike information criterion
A.1.7 Plotting plot(x) plot of the values of x (on the y-axis) ordered on the x-axis plot(x, y) bivariate plot of x (on the x-axis) and y (on the y-axis) hist(x) histogram of the frequencies of x barplot(x) creates a bar plot with vertical or horizontal bars dotchart(x) plots a Cleveland dot plot (stacked plots line-by-line and column-by-column) pie(x) circular pie chart boxplot(x) box-and-whiskers plot coplot(x y | z), pairs(x) function for representing multivariate data qqnorm(x) quantiles of x with respect to the values expected under a normal distribution qqplot(x, y) quantiles of y with respect to the quantiles of x contour(x, y, z), image(x, y, z), persp(x, y, z) plots of three variables The following parameters are common to many plotting functions: add if TRUE superposes the plot on the previous one (if it exists) axes if FALSE does not draw the axes and the box type specifies the type of plot xlim, ylim specifies the lower and upper limits of the axes xlab, ylab annotates the axes main, sub main title, subtitle Low-level plotting commands: points(x, y) adds points lines(x, y) adds lines text(x, y, labels, ...) adds text given by labels at coordinates (x, y) abline(a,b) draws a line of slope b and intercept a abline(h=y) draws a horizontal line at ordinate y abline(v=x) draws a vertical line at abcissa x abline(lm.obj) draws the regression line given by “lm.obj” legend(x, y, legend) adds the legend at the point (x,y) with the symbols given by legend title() adds a title and optionally a subtitle axis(side, vect) adds an axis
180
R SYSTEM
A.1.8 Programming function(arglist) expr, return(value) these functions provide the base mechanisms for defining new functions The R language has the following basic control-flow constructs: if(cond) expr if(cond) cons.expr else alt.expr for(var in seq) expr while(cond) expr repeat expr break next
References
J.G. Adrover and V. J. Yohai (2002). Projection estimates of multivariate location. Ann. Statist. 20, 1760–1781. Analytical Methods Committee (1989). Robust statistics — how not to reject outliers. The Analyst 114, 1693–1702. D.F. Andrews, P.J. Bickel, F.R. Hampel, P.J. Huber, W.H. Rogers, and J.W. Tukey (1972). Robust Estimates of Location. Survey and Advances. Princeton University Press, Princeton, NJ. ´ V´ıˇsek (eds.) (1992). Computational Aspects of Model Choice. J. Antoch and J.A. Physica-Verlag, Heidelberg. ´ V´ıˇsek (1998). Robust Estimation in Linear Model. J. Antoch, H. Ekblom and J.A. XploRe Macros: http://www.quantlet.de/codes/rob/ROB.html. R.R. Bahadur (1967). Rates of convergence of estimators and test statistics. Ann. Math. Statist. 38, 303–324. A.D. Barbour and P. Hall (1984). On the rate of Poisson convergence. Math. Proc. Cambridge Philos. Soc. 95, 473–480. V.D. Barnett (1966). Evaluation of the maximum likelihood estimator where the likelihood equation has multiple roots. Biometrika 53, 152–166. V.D. Barnett and T. Lewis (1994). Outliers in Statistical Data, 3rd ed. John Wiley & Sons, Chichester, U.K. G.W. Bassett (1991). Equivariant, monotonic, 50% breakdown estimators. Amer. Statist. 45, 135–137. D.A. Belsley, E. Kuh and R.E. Welsch (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. John Wiley & Sons, New York. A. Bhattacharyya, (1943). On a measure of divergence between two statistical populations defined by their probability distribution. Bull. Calcutta Math. Soc. 35, 99–109. P.J. Bickel (1975). One-step Huber estimates in the linear model. Ann. Statist. 1, 597–616. P.J. Bickel and E.L. Lehmann (1979). Descriptive statistics for nonparametric model. IV. Spread. Contributions to Statistics: Jaroslav H´ ajek Memorial Volume (J. Jureˇckov´ a, ed.), pp. 33–40. Academia, Prague and Reidel, Dordrecht. P.J. Bickel, C.A.J. Klaassen, Y. Ritov and J.A. Wellner (1993). Efficient and Adaptive Estimation for Semiparametric Models. John Hopkins University Press, Baltimore. P. Billingsley (1998). Convergence of Probability Measures, 2nd ed. John Wiley & Sons, New York. G. Blom (1956). On linear estimates with nearly minimum variance. Arkiv f¨ ur Mathematik 3, 365–369. 181
182
REFERENCES
P. Bloomfield and W.L. Steiger (1983). Least Absolute Deviations. Theory, Applications and Algorithms. Birkh¨ auser, Boston. R.J. Boskovich (1757). De literaria expeditione per pontificiam ditionem et synopsis amplioris operis... Bononiensi Scientiarum et Artum Instituto atque Academia Commentarii 4, 353–396. G.E.P. Box (1953). Non-normality and tests of variance. Biometrika 40, 318–335. G.E.P. Box and S.L. Anderson (1955). Permutation theory in the derivation of robust criteria and the study of departures from assumption. J. Royal Statist. Soc. B 17, 1–34. B.M. Brown (1983). Statistical uses of the spatial median J. Royal Statist. Soc. B 45, 25–30. B.M. Brown (1988). Spatial median. Encyclopedia of Statistical Sciences, Vol. 8 (S. Kotz, N.L. Johnson and C.B. Read, eds.), pp. 574–575. John Wiley & Sons, New York. K.A. Brownlee (1960). Statistical Theory and Methodology in Science and Engineering. John Wiley & Sons, New York, 491–500. H. Bunke and O. Bunke (eds.) (1986). Statistical Inference in Linear Models. John Wiley & Sons, Chichester, U.K. R.J. Carroll and D. Ruppert (1988). Transformations and Weighting in Regression. Chapman & Hall, London. D. Cellier and D. Fourdrinier (1995). Shrinkage estimators under spherical symmetry for the general linear model. J. Multiv. Anal. 52, 338–351. D. Cellier, D. Fourdrinier and C. Robert (1989). Robust shrinkage estimators of the location parameter for elliptically symmetric distributions. J. Multiv. Anal. 29, 39–52. B. Chakraborty, P. Chaudhuri and H. Oja (1998). Operating transformation retransformation on spatial median and angle test. Statistica Sinica 8, 767–784. S. Chaterjee and A.S. Hadi (1988). Sensitivity Analysis in Linear Regression. John Wiley & Sons, New York. P. Chaudhuri (1996). On a geometric notion of quantiles for multivariate data. J. Amer. Statist. Assoc. 91, 862–872. Y.S. Chow and K.F. Yu (1981). The performance of a sequential procedure for the estimation of the mean. Ann. Statist. 9, 184–188. A. Cohen Brandwein and E. Strawderman (1991). Generalizations of James-Stein estimators under spherical symmetry. Ann. Statist. 19, 1639–1650. J.R. Collins (1982). Robust M -estimators of location vectors. J. Multivar. Analysis 12, 480–492. J.R. Collins (1986). Maximum asymptotic variances of trimmed means under asymmetric contaminations. Ann. Statist. 14, 348–354. J.R. Collins (1991). Maximum asymptotic biases and variances of symmetrized interguantile ranges under asymmetric contamination. Statistics 22, 379–402. J.R. Collins (1999). Robust M -estimators of scale: Minimax bias versus maximal variance. Canad. J. Statist. 27, 81–96. J.R. Collins (2000). Robustness comparisons of some classes of location parameter estimators. Ann. Inst. Statist. Math. 52, 351–366. J.R. Collins (2003). Bias-robust L-estimators of a scale parameter. Statistics 37, 287–303.
REFERENCES
183
J.R. Collins and B.Wu (1998). Comparisons of asymptotic biases and variances of M -estimators of scale under asymmetric contamination. Commun. Statist. Theory Meth. 27, 1791–1810. R.D. Cook and S. Weisberg (1982). Residuals and Influence in Regression. Chapman & Hall, London. I. Csisz´ ar (1967). Information-type measures of difference of probability distributions and indirect observations. Studia Sci. Math. Hungar. 2, 299–318. M. Cs¨ org˝ o, and P. R´ev´esz (1978). Strong approximation of the quantile process. Ann. Statist. 6, 882–894. M. Cs¨ org˝ o, and P. R´ev´esz (1981). Strong Approximations in Probability and Statistics. Akad´emiai Kiad´ o, Budapest. P.L. Davies (1987). Asymptotic behavior of S-estimates of multivariate location parameters and dispersion matrices. Ann. Statist. 15, 1269–1292. P.L. Davies (1990). The asymptotics of S-estimators in the linear regression model. Ann. Statist. 18, 1651–1675. P.L. Davies (1992). The asymptotics of Rousseeuw’s minimum volume ellipsoid estimator. Ann. Statist. 20, 1828–1843. P.L. Davies and U. Gather (2004). Robust Statistics. Handbook of Computational Statistics (J.E. Gentle, W. H¨ ardle, Y. Mori, eds.), pp. 655–695. Springer-Verlag, Berlin. A.P. Dempster (1969). Elements of Continuous Multivariate Analysis. AddisonWesley, New York. S.J. Devlin, R. Gnanadesikan and J.R. Kettenring (1976). Some multivariate applications of elliptical distributions. Essays in Probability and Statistics (S. Ikeda, ed.), pp. 365–395. Shinko Tsusho, Tokyo. T. de Wet and J.H. Venter (1973). Asymptotic distributions of quadratic forms with application to test of fit. Ann. Statist. 31, 276–295. R.L. Dobrushin (1970). Describing a system of random variables by conditional distributions. Theor. Prob. Appl. 15, 458–486. Y. Dodge (1996). The guinea pig of multiple regression. Robust Statistics, Data Analysis, and Computer Intensive Methods, (H. Rieder, ed.), Lecture Notes in Statistics 109, pp. 91–118. Springer-Verlag, New York. Y. Dodge and J. Jureˇckov´ a (2000). Adaptive Regression. Springer-Verlag, New York. D.L. Donoho (1982). Breakdown properties of multivariate location estimators. Ph.D. Thesis, Department of Statistics, Harvard University, Harvard, MA. D.L. Donoho and M. Gasko (1992). Breakdown properties of location estimates based on half space depth and projection outlyingness. Ann. Statist. 20, 1803–1827. D.L. Donoho and P.J. Huber (1983). The notion of breakdown point. A Festschrift for Erich Lehmann (P.J. Bickel, K.A. Doksum and J.L. Hodges, eds.), pp. 157– 184. Wadsworth, CA. N.R. Draper and H. Smith (1988). Applied Regression Analysis, 3rd ed. John Wiley & Sons, New York. S.P. Ellis (1998). Instability of least squares, least absolute deviation and least median of squares linear regression. Statist. Science 13, 337–350. M. Fabian, P. Habala, P. H´ ajek, V.M. Santaluc´ıa, J. Pelant and V. Z´ızler (2001). Functional Analysis and Infinite-Dimensional Geometry. Springer, New York. M. Falk (1986). On the estimation of the quantile density function. Statist. Probab. Lett. 4, 69–73.
184
REFERENCES
L.T. Fernholz (1983). von Mises Calculus for Statistical Functionals. Lecture Notes in Statistics 19, Springer-Verlag, New York. C.A. Field and E.M. Ronchetti (1990). Small Sample Asymptotics. IMS Lecture Notes 13, IMS, Hayward, CA. R.A.Fisher (1922). On the mathematical foundations of theoretical statistics. Phil. Trans. Roy. Soc. London A 222, 309–368. D. Fourdrinier, E. Marchand and E. Strawderman (2004). On the inevitability of a paradox in shrinkage estimation for scale mixture of normals. J. Statist. Plann. Infer. 121, 37–51. D. Fourdrinier and E. Strawderman (1996). A paradox concerning shrinkage estimators: Should a known scale parameter be replaced by an estimated value in the shrinkage factor? J. Multiv. Anal. 59, 109–140. J.C. Fu (1975). The rate of convergence of consistent point estimators. Ann. Statist. 3, 234–240. J.C. Fu (1980). Large deviations, Bahadur efficiency, and the rate of convergence of linear functions of order statistics. Bull. Inst. Math. Acad. Sinica 8, 15–37. M. Ghosh and N. Mukhopadhyay (1979). Sequential point estimation of the mean when the distribution is unspecified. Comm. Statist. A 8, 637–652. A.L. Gibbs and F.E. Su (2002). On choosing and bounding probability metrics. Intern. Statist. Rev. 70, 419–435. C. Gutenbrunner (1986). Zur Asymptotik von Regression Quantil Prozessen und daraus abgeleiten Statistiken. Ph.D. Thesis, Universit¨ at Freiburg, Germany. C. Gutenbrunner and J. Jureˇckov´ a (1992). Regression rank scores and regression quantiles. Ann. Statist. 20, 305–330. C. Gutenbrunner, J. Jureˇckov´ a, R. Koenker and S. Portnoy (1993). Tests of linear hypotheses based on regression rank scores. J. Nonpar. Statist. 2, 307–331. J. H´ ajek (1969). A Course in Nonparametric Statistics. Holden-Day, San Francisco. P.J. Hall (1981). Large sample properties of Jaeckel’s adaptive trimmed mean. Ann. Inst. Statist. Math. 33A, 449–462. F.R. Hampel (1968). Contribution to the Theory of Robust Estimators. Ph.D. Thesis. University of California, Berkeley, CA. F.R. Hampel (1971). A general qualitative definition of robustness. Ann. Math. Statist. 42, 1887–1896. F.R. Hampel (1973). Some small sample asymptotics. Proceedings of the Prague Symposium on Asymptotic Statistics, Vol. 2 (J. H´ ajek, ed.), pp. 109–126. Charles University, Prague. F.R. Hampel (1974). The influence curve and its role in robust estimation. J. Amer. Statist. Assoc. 69, 383–393. F.R. Hampel (1975). Beyond location parameters: Robust concepts and methods (with discussion). Bull. Internat. Statist. Inst. 46, 375–391. F.R. Hampel, P.J. Rousseeuw, E.M. Ronchetti, and W. Stahel (1986). Robust Statistics – The Approach Based on Influence Functions. John Wiley & Sons, New York. F. Harrell and C. Davis (1982). A new distribution-free quantile estimator. Biometrika 69, 636–640. T.P. Hettmansperger (1985). Statistical Inference Based on Ranks. John Wiley & Sons, New York. P. Harremo¨es and P.S. Ruzankin (2004). Rate of convergence to Poisson law in terms of information divergence. Trans. IEEE Inform. Theory 50, 2145–2149.
REFERENCES
185
D.M. Hawkins and D. Olive (1999). Applications and algorithms for least trimmed sum of absolute deviations, regression. Comp. Statist. Data Anal. 32, 119–134. X. He, J. Jureˇckov´ a, R. Koenker and S. Portnoy (1990). Tail behavior of regression estimators and their breakdown point. Econometrica 58, 1195-1214. X. He and D.G. Simpson (1993). Lower bounds for contamination bias: Globally minimax versus locally linear estimation. Ann. Statist. 21, 314–337. T.P. Hettmansperger and R.H. Randles (2002). A practical affine equivariant multivariate median. Biometrika 89, 851–860. T.P. Hettmansperger and S. Sheather (1992). A cautionary note on the method of least median squares. Amer. Statist. 46, 79–83. J.L. Hodges and E.L. Lehmann (1963). Estimation of location based on rank tests. Ann. Math. Statist. 34, 598–611. W. Hoeffding (1963). Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58, 13–29. O. H¨ ossjer (1994). Rank-based estimates in the linear model with high breakdown point. J. Amer. Statist. Assoc. 89, 149–158. C. Huber-Carol, N. Balakrishnan, M.S. Nikulin, M. Mesbah (eds.) (2002). Goodnessof-Fit Tests and Model Validity. Birkh¨ auser, Boston. P.J. Huber (1964). Robust estimation of a location parameter. Ann. Math. Statist. 36, 73–101. P.J. Huber (1968). Robust confidence limits. Z. Wahrsch. Verw. Geb. 10, 269–278. P.J. Huber (1969). Th´eorie de l’inf´erence de statistique robuste. Presses de l’Universit´e de Montr´eal, Montr´eal. P.J. Huber (1981). Robust Statistics. John Wiley & Sons, New York. R. Ihaka, R. Gentleman (1996). R: A language for data analysis and graphics. J. Computational Graphical Statistics 5, 299–314. L.A. Jaeckel (1969). Robust Estimates of Location. Ph.D. Thesis, University of California, Berkeley, CA. L.A. Jaeckel (1971). Robust estimation of location: Symmetry and asymmetric contamination. Ann. Math. Statist. 42, 1020–1034. L.A. Jaeckel (1972). Estimating regression coefficients by minimizing the dispersion of residuals. Ann. Statist. 43, 1449–1458. W. James and C. Stein (1961). Estimation with quadratic loss. Proceedings of the Fourth Berkeley Symposium on Mathematics Statistics and Probability, Vol. 1, pp. 361–380. University of California Press, Berkeley, CA. P. Janssen, J. Jureˇckov´ a and N. Veraverbeke (1985). Rate of convergence of oneand two-step M -estimators with applications to maximum likelihood and Pitman estimators. Ann. Statist. 13, 1222–1229. J. Jung (1955). On linear estimates defined by a continuous weight function. Arkiv f¨ ur Mathematik 3, 199–209. J. Jung (1962). Approximation to the best linear estimates. Contribution to Order Statistics (A.E. Sarhan and B.G. Greenberg, eds.), pp. 28–33. John Wiley & Sons, New York. J. Jureˇckov´ a (1977). Asymptotic relations of M -estimates and R-estimates in linear regression model. Ann. Statist. 5, 464–472. J. Jureˇckov´ a (1980). Asymptotic representation of M -estimators of location. Math. Operationsforsch. und Statistik, Ser. Statistics 11, 61–73.
186
REFERENCES
J. Jureˇckov´ a (1981). Tail-behavior of location estimators. Ann. Statist. 9, 578–585. J. Jureˇckov´ a and A.H. Welsh (1990). Asymptotic relations between L- and M estimators in the linear model. Ann. Inst. Statist. Math. 42, 671–698. J. Jureˇckov´ a and L.B. Klebanov (1997). Inadmissibility of robust estimators with respect to L1 -norm. IMS Lecture Notes - Monograph Series, Vol. 31, (Y. Dodge, ed.), pp. 71–78. Institute of Mathematical Statistics, Hayward, CA. J. Jureˇckov´ a and L.B. Klebanov (1998). Trimmed, Bayesian and admissible estimators. Statist. Probab. Lett. 42, 47–51. J. Jureˇckov´ a and M. Mal´ y (1995). The asymptotics for studentized k-step M estimators of location. Sequen. Anal. 14, 229–245. J. Jureˇckov´ a and P.K. Sen (1982). M-estimators and L-estimators of location: Uniform integrability and asymptotic risk-efficient sequential versions. Sequential Anal. 1, 27–56. J. Jureˇckov´ a and P.K. Sen (1990). Effect of the initial estimator on the asymptotic behavior of one-step M -estimator. Ann. Inst. Statist. Math. 42, 345–357. J. Jureˇckov´ a and P.K. Sen (1994). Regression rank scores scale statistics and studentization in the linear model. Proceedings of 5th Prague Conference on Asymptotic Statistics (M. Huˇskov´ a and P. Mandl, eds.), pp. 111–121. Physica-Verlag, Vienna. J. Jureˇckov´ a and P.K. Sen (1996). Robust Statistical Procedures: Asymptotics and Interrelations. John Wiley & Sons, New York. J. Jureˇckov´ a and S. Portnoy (1987). Asymptotics for one-step M -estimators in regression with application to combining efficiency and high breakdown point. Commun. Statist. Theory Methods A 16, 2187–2199. J. Jureˇckov´ a and X. Milhaud (1993). Shrinkage of maximum likelihood estimators of multivariate location. Proceedings of 5th Prague Symposium on Asymptotic Statistics (P. Mandl and M. Huˇskov´ a , eds.), pp. 303–318. Physica-Verlag, Vienna. J. Jureˇckov´ a, J. Picek and P.K. Sen (2003). Goodness-of-fit test with nuisance regression and scale. Metrika 58, 235–258. J. Jureˇckov´ a, R. Koenker and S. Portnoy (2001). Tail behavior of the least squares estimator. Statist. Probab. Lett. 55, 377–384. A.M. Kagan, J.V. Linnik and C.R. Rao (1973). Characterization Problems in Mathematical Statistics. John Wiley & Sons, New York. J.T. Kent and D.E. Tyler (1991). Redescending M -estimates of multivariate location and scatter. Ann. Statist. 19, 2102–2119. J.T. Kent, D.E. Tyler and Y. Vardi (1994). A curious likelihood identity for the multivariate t-distribution. Comm. Statistics – Simulation Computation 23, 441– 453. J. Kim and D. Pollard (1990). Cube-root asymptotics. Ann. Statist. 18, 191–219. R. Koenker (2005). Quantile Regression. Cambridge University Press, Cambridge, U.K. R. Koenker and G. Bassett (1978). Regression quantiles. Econometrica 46, 33–50. R. Koenker and G. Bassett (1982). Robust tests of heteroscedasticity based on regression quantiles. Econometrica 50, 43–61. I. Kontoyannis, P. Harremo¨es and O. Johnson (2005). Entropy and the law of small numbers. IEEE Trans. Inform. Theory 51, 466–472. H.L. Koul (2002). Weighted Empirical Processes in Dynamic Nonlinear Models, 2nd ed., Lecture Notes in Statistics 166, Springer, New York. W. Krasker and R. Welsch (1982). Efficient bounded-influence regression estimation. J. Amer. Statist. Assoc. 77, 595–604.
REFERENCES
187
J.P. Lecoutre and P. Tassi (1987). Statistique non parametrique et robustesse. Economica, Paris. E.L. Lehmann (1983). Theory of Point Estimators. John Wiley & Sons, New York. E.L. Lehmann (1999). Elements of Large Sample Theory. Springer, New York. F. Liese and I. Vajda (1987). Convex Statistical Distances. Teubner, Leipzig, Germany. R.Y. Liu, J.M. Parelius amd K. Singh (1999). Multivariate analysis by data depth: Descriptive statistics, graphics and inference (with discussion). Ann. Statist. 27, 783–858. H.P. Lopuha¨ a (1989). On the relation between S-estimators and M -estimators of multivariate location and covariance. Ann. Statist. 17, 1662–1683. H.P. Lopuha¨ a and P.J. Rousseeuw (1991). Breakdown properties of affine equivariant estimators of multivariate location and covariance matrices. Ann. Statist. 19, 229– 248. C.L. Mallows (1972). A note on asymptotic joint normality. Ann. Math. Statist. 43, 508–515. C. Mallows (1973). Influence functions. National Bureau of Economic Research, Conference on Robust Regression, Cambridge, MA. C. Mallows (1975). On some topics in robustness. Technical memorandum, Bell Telephone Laboratories, Murray Hill, NJ. A. Marazzi (1992). Algorithms, Routines and S-Functions for Robust Statistics. Chapman and Hall, New York. J.I. Marden (1998). Bivariate QQ plots and spider web plots. Statistica Sinica 8, 813–826. R.A. Maronna (1976). Robust M -estimates of multivariate location and scatter. Ann. Statist. 4, 51–67. R.A. Maronna and V.J. Yohai (1981). Asymptotic behavior of general M -estimates for regression and scale with random carriers. Z. Wahrscheinlichkeitstheorie und verw. Gebiete 58, 7–20. R.A. Maronna and V.J. Yohai (1995). The behavior of the Stahel-Donoho robust multivariate estimator. J. Amer. Statistc. Assoc. 90, 330–341. R.A. Maronna, W.A. Stahel and V.J. Yohai (1992). Bias-robust estimates of multivariate scatter based on projections. J. Multiv. Anal. 21, 965–990. R.D. Martin and R.H. Zamar (1993). Efficiency-constrained bias-robust estimation of location. Ann. Statist. 21, 991–1017. J.T. Mayer (1750). Abhandlung u ¨ber die Umw¨ alzung des Monds um seine Axe. Kosmographische Nachrichten und Sammlungen 1, 52–183. J. M¨ ott¨ onen and H. Oja (1995). Multivariate spatial sign and rank methods. J. Nonpar. Statist. 5, 201–213. H. Oja (1983). Descriptive statistics for multivariate distributions. Statist. Probab. Lett. 1, 327–332. H. Oja and R.H. Randles (2004). Multivariate nonparametric tests. Statist. Sci 4, 598–605. E. Parzen (1975). Nonparametric statistical data modelling. J. Amer. Statistical Assoc. 74, 105–131. E.S. Pearson (1931). The analysis of variance in cases of nonnormal variation. Biometrika 23, 114–133. M.S. Pinsker (1960). Information and Information Stability of Random Variables and Processes (in Russian). Izv. Akad. Nauk, Moscow.
188
REFERENCES
S. Portnoy (1991). Asymptotic behavior of the number of regression quantile breakpoints. J. Sci. Statist. Comput. 12, 867–883. S.T. Rachev (1991). Probability Metrics and the Stability of Stochastic Models. John Wiley & Sons, Chichester, U.K. R Development Core Team (2004). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, http://www.Rproject.org. J.A. Reeds (1985). Asymptotic number of roots of Cauchy location likelihood equations. Ann. Statist. 13, 775–784. R.D. Reiss (1989). Approximate Distributions of Order Statistics. Springer, New York. H. Rieder (1994). Robust Asymptotic Statistics. Springer, New York. P.J. Rousseeuw (1984). Least median of squares regression. J. Amer. Statist. Assoc. 79, 871–880. P.J. Rousseeuw (1985). Multivariate estimation with high breakdown point. Mathematical Statistics and Applications, Vol. B (W. Grossmann et al., eds.), pp. 283–297. Reidel, Dordrecht. P.J. Rousseeuw and A.M. Leroy (1987). Robust Regression and Outlier Detection. John Wiley & Sons, New York. P.J. Rousseeuw and B.C. van Zomeren (1990). Unmasking multivariate outliers and leverage points. J. Amer. Statist. Assoc. 85, 633–639. P.J. Rousseeuw and K. van Driessen (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics 41, 212–223. P.J. Rousseeuw and V.J. Yohai (1984). Robust regression by means of S-estimators. Robust and Nonlinear Time Series Analysis (J. Franke, W. H¨ ardle and R.D. Martin, eds.), pp. 256–272. Springer, New York. P. Royston (1982a). An extension of Shapiro and Wilk W test for normality to large samples. Appl. Statistics 31, 115–124. P. Royston (1982b). Algorithm AS 181: The W test for normality. Appl. Statistics 31, 176–180. P. Royston (1995). A remark on algorithm AS 181: The W test for normality. Appl. Statistics 44, 547–551. D. Ruppert and R. J. Carroll (1980). Trimmed least squares estimation in the linear model. J. Amer. Statist. Assoc. 75, 828–838. P.K. Sen (1964). On some properties of the rank-weighted means. J. Indian Soc. Agricul. Statist. 16, 51–61. P.K. Sen (1978). An invariance principle for linear combinations of order statistics. Zeit. Wahrscheinlich. Verw. Geb. 42, 327–340. P.K. Sen (1980). On nonparametric sequential point estimation of location based on general rank order statistics. Sankhy¯ a A 42, 201–218. P.K. Sen (1981). Sequential Nonparametrics: Invariance Principles and Statistical Inference. John Wiley & Sons, New York. P.K. Sen (1986). On the asymptotic distributional risk of shrinkage and preliminary test versions of maximum likelihood estimators. Sankhy¯ a A 48, 354–371. P.K. Sen (1994). Isomorphism of quadratic norm and OC ordering of estimators admitting first-order representation. Sankhy¯ a A 56, 465–475. P.K. Sen (2002). Shapiro-Wilk type goodness-of-fit tests for normality: Asymptotics revisited. Goodness-of-Fit Tests and Model Validity (C. Huber-Carol et al., eds.), pp. 73–88. Birkh¨ auser, Boston.
REFERENCES
189
P.K. Sen, J. Jureˇckov´ a and J. Picek (2003). Goodness-of-fit test of Shapiro-Wilk type with nuisance regression and scale. Austrian J. of Statist. 32, No. 1 & 2, 163–177. R.J. Serfling (1980). Approximation Theorems of Mathematical Statistics. John Wiley & Sons, New York. R.J. Serfling (2004). Nonparametric multivariate descriptive measures based on spatial quantiles. J. Statist. Planning Infer. 123, 259–278. S.S. Shapiro and M.B. Wilk (1965). An analysis of variance for normality (complete samples). Biometrika 52, 591–611. G.R. Shorack (2000). Probability for Statisticians. Springer, New York. G.R. Shorack and J.A. Wellner (1986). Empirical Processes with Applications to Statistics. John Wiley & Sons, New York. A.F. Siegel (1982). Robust regression using repeated medians. Biometrika 69, 242– 244. G.L. Sievers (1978). Estimation of location: A large deviation comparison. Ann. Statist. 6, 610–618. D.G. Simpson, D. Ruppert and R.J. Carroll (1992). On one-step GM -estimates and stability of inference in linear regression. J. Amer. Statist. Assoc. 87, 439–450. C.G. Small (1990). A survey of multidimensional medians. Intern. Statist. Rev. 58, 263–277. W.A. Stahel (1981). Breakdown of covariance estimators. Research Report 31, Fachgruppe f¨ ur Statistik, ETH Z¨ urich. R.J. Staudte and S.J. Sheather (1990). Robust Estimation and Testing. John Wiley & Sons, New York. C. Stein (1956). Inadmissibility of the usual estimator for the mean of multivariate distribution. Proceedings of Third Berkeley Symposium on Mathematics Statistics and Probability, Vol, 1, pp. 197–206. University of California Press, Berkeley, CA. C. Stein (1981). Estimation of the mean of a multivariate normal distribution. Ann. Statist. 9, 1135–1151. S.M. Stigler (1986). The History of Statistics. The Measurement of Uncertainty before 1900. Belknap Press of Harvard University Press, London. M. Tableman (1994a). The influence functions for the least trimmed squares and the least trimmed absolute deviations estimators. Statist. Probab. Lett. 19, 329–337. M. Tableman (1994b). The asymptotics for the least trimmed absolute deviations (LTAD) estimator. Statist. Probab. Lett. 19, 387–398. J.W. Tukey (1960). A survey of sampling from contaminated distribution. Contributions to Probability and Statistics (I. Olkin, ed.), Stanford University Press, Stanford, CA. J.W. Tukey (1977). Exploratory Data Analysis. Addison-Wesley, Reading, MA. J.W. Tukey and D.H. McLaughlin (1963). Less vulnerable confidence and significance procedures for location based on a single sample: trimming/winsorization I. Sankhy¯ a A 25, 331–352. D.E. Tyler (1994). Finite sample breakdown points of projection based on multivariate location and scatter statistics. Ann. Statist. 22, 1024–1044. I. Vajda (1988). Theory of Statistical Inference and Information. Reidel, Dordrecht. A.W. van der Vaart and J.A. Wellner (1996). Weak Convergence and Empirical Processes (With Applications to Statistics.) Springer, New York. C. van Eeden (1972). Analogue, for signed rank statistics, of Jureˇckov´ a’s asymptotic linearity theorem for signed rank statistics. Ann. Math. Statist. 43, 791–802.
190
REFERENCES
Y. Vardi and C.H. Zhang (2000). The multivariate L1 -median and associated data depth. Proceedings of National Academy of Sciences USA 97, 1423–1426. W.N. Venables and B.D. Ripley (2002). Modern Applied Statistics with S. 4 th ed., Springer, New York. ´ V´ıˇsek (1995). On high breakdown point estimation. Comp. Statist. 11, 137–146. J.A. ´ V´ıˇsek (2000). On the diversity of estimates. Comp. Statist. Data Anal. 34, J.A. 67–89. ´ V´ıˇsek (2002a). The least weighted squares I. The asymptotic linearity of normal J.A. equations. Bull. Czech Econometric Society 9/15, 31–58. ´ V´ıˇsek (2002b). The least weighted squares II. Consistency and asymptotic norJ.A. mality. Bull. Czech Econometric Society 9/16, 1–28. B. von Bahr (1965). On the convergence of moments in the central limit theorem. Ann. Math. Statist. 36, 808–818. R. von Mises (1947). On the asymptotic distribution of differentiable statistical functions. Ann. Math. Statist. 35, 73–101. A.H. Welsh (1986). Bahadur representation for robust scale estimators based on regression residuals. Ann. Statist. 14, 1246–1251. A.H. Welsh and E. Ronchetti (2002). A journey in single steps: robust one-step M-estimation in linear regression. J. Statist. Planning Infer. 103, 287–310. D.P. Wiens and Z. Zheng (1986). Robust M -estimators of multivariate location and scatter in the presence of asymmetry. Canad. J. Statist. 14, 161–176. H. Witting and U. M¨ uller-Funk (1995). Mathematische Statistik II. Asymptotische Statistik: Parametrische Modelle und nichtparametrische Funktionale. Teubner, Stuttgart, Germany. V.J. Yohai (1987). High breakdown point and high efficiency robust estimates for regression. Ann. Statist. 15, 642–656. V.M. Zolotarev (1983). Probability metrics. Theor. Probab. Appl. 28, 278–302. Y. Zuo (2003). Projection depth functions and associated medians. Ann. Statist. 31, 1460–1490. Y. Zuo (2004). Robust and efficient multivariate medians. Preprint, Michigan State University, East Lansing, MI.
Subject index admissibility, 133 Andrews sinus function, 50 asymptotic distribution, 46, 48, 141–143, 145–147, 150 asymptotic relations, 147, 150, 152 asymptotic relative efficiency, 146 asymptotic representation, 141, 147 asymptotic variance, 143, 145, 146, 150 asymptotically efficient estimator, 145 asymptotically equivalent estimator, 147–150 balanced design, 89 BLUE, 70, 156, 157 breakdown point, 35, 47–49, 66, 68, 69, 76 in multivariate model, 131, 132 in regression, 95, 109, 110, 112, 123, 126 of M -estimator of location, 46
empirical distribution function, 10, 18 empirical probability distribution, 9, 11, 17–19 empirical quantile function, 64 equivariance affine, 130 estimators with high breakdown points, 109 expected value, 47 exponential tails, 68, 88, 89 finite sample minimax, 54 Fisher consistency, 11, 44, 45, 62, 142 Fisher information, 48, 143, 145, 146, 148, 149, 151
Cauchy likelihood, 49, 84 characteristics of robustness, 27 quantitative, 32, 35 contaminated distribution, 47, 49, 152, 153 contamination model, 150, 151
geometric mean, 10 Gini mean difference, 64, 78 global sensitivity, 32, 47–49, 68, 69 GM -estimator, 97–100, 112 goodness-of-fit test, 155, 166 χ2 test, 155 Kolmogorov-Smirnov, 155, 166 Shapiro-Wilk test of normality, 155, 157, 166, 171 with nuisance regression and scale, 155, 158, 167, 170
Dirac probability measure, 15 distance of measures, 11, 12, 14, 18 Kullback-Leibler divergence, 13 L´evy, 12 Prochorov, 12 Hellinger, 12 Kolmogorov, 12, 18, 54 Lipschitz, 13 relations, 13 total variation, 12 dual program, 104
Hampel ψ function, 50 harmonic mean, 10 Harrell and Davis estimator of p-quantile, 69 heavy tailed distribution, 89 heavy tails, 49, 68 Hodges-Lehmann estimator, 75–78, 81, 84, 150 Hoeffding inequality, 60 Huber estimator randomized, 54
192 Huber function, 61, 68, 152 influence function, 27, 30, 32, 44, 45, 47–49, 63, 65–69, 76, 95–100, 112, 131–133, 141, 142, 147 discretized form, 28, 29 inter-quartile range, 63, 78, 80 James-Stein estimator, 139 Kolmogorov model, 151 L1 -multivariate median, 139, 140 large sample distribution, 18, 19 least favorable distribution, 56, 151, 152 least median of squares, 109, 113–115 least squares estimator, 8, 85, 87, 88, 94, 99, 113, 114 trimmed, 102, 103, 113, 114, 125 least squares method, 87, 113, 114 least trimmed squares estimator, 110, 113 least trimmed sum of absolute deviations, 110 L-estimator, 43, 63–66, 70, 76, 142, 144, 145, 147, 148, 150, 151 asymptotically risk-efficient, 74 in linear model, 101–104 large sample properties, 141 minimaximally robust, 151–153 moment convergence, 70 relations to other estimators, 146 leverage point, 87, 91 likelihood ratio, 57 likelihood ratio test, 56 local sensitivity, 32 maximal likelihood estimator, 44 maximum bias, 33 Mayer matrix, 170 mean, 47, 49, 73, 78, 80, 83, 84 median, 43, 47, 48, 64, 69, 75, 78, 80 median absolute deviation, 63, 78, 80 median unbiased estimator, 75 M -estimator, 9, 43–50, 66, 75, 76, 142, 143, 147, 149–152 asymptotically risk-efficient, 73
SUBJECT INDEX Huber, 49, 54, 58, 68, 78, 80, 148 in linear model, 94–98, 104, 106, 110, 118, 126 one-step version, 110–112 in multivariate model, 129, 130, 132 in regression model Huber, 111 large sample properties, 141 minimaximally robust, 151, 152 moment convergence, 58 of location, 45 of multivariate location influence function, 131 relations to other estimators, 146 studentized, 61 M -functional, 44, 45, 47, 49, 150, 151 midrange, 42, 64, 78, 81 minimax robustness, 41, 49, 150–153 minimum covariance determinant estimator, 132, 135, 137 minimum volume ellipsoid estimator, 132, 135 M M -estimators, 100, 101, 118 modus, 43 moment estimator, 43 multivariate t-distribution, 130 multivariate location, 129 multivariate normal distribution, 130 one-step version, 110 Pitman closeness, 139 pivot, 134 population extreme, 91 population regression quantile, 102 projection estimator, 132 quadratic loss, 133 qualitative robustness, 30–32 R system, 173 Rao-Cram´er lower bound, 143, 145 redescending function, 49 regression and scale equivariance, 104, 106 regression equivariance, 94, 106 regression invariance, 94, 106
SUBJECT INDEX regression quantile, 94, 101–104, 106, 107, 110, 124, 168 regression rank scores, 94, 104–106, 108, 124, 168 R-estimator, 9, 43, 74–76, 142, 146, 147, 149–151 asymptotically efficient, 146 large sample properties, 141 minimaximally robust, 151–153 relations to other estimators, 146 robust estimators in multivariate location model, 129 of real parameter, 43 sample range, 64, 80 scale equivariance, 45, 61, 62, 94, 106 scale statistic, 62, 144 in linear model, 94, 106, 107, 111 scatter matrix, 129 Sen’s weighted mean, 69, 78, 81 sequential estimator, 73 asymptotically risk efficient, 73 sequential L-estimators, 72 sequential M -estimators, 72 S-estimator, 100 in multivariate model, 132 shrinkage phenomenon, 133 skipped mean, 50 skipped median, 50 Snedecor F -test, 8 spatial rank function, 140 spherically symmetric distribution, 139 squared error loss, 134 Stahel-Donoho estimator, 132 standard deviation, 63, 78, 80 statistical functional, 9–11, 13, 14, 131 derivative, 14 Fr´echet, 14, 17, 18, 27, 141 Gˆ ateau, 14, 15, 18, 19, 27 Hadamard, 14, 18 differentiability, 11, 14 empirical, 18 tail-behavior measure, 36–40, 49, 68 statistical model, 5 stopping rule, 72 Student t-test, 8 studentized M -estimator, 61, 62, 144 in linear model, 94, 98, 106 studentized M -functional, 63
193 symmetry angular, 130 central, 130 elliptical, 130 spherical, 130 translation and scale equivariance, 61, 75 translation equivariance, 36, 37, 45, 55, 57, 61 trimmed mean, 67–69, 78, 80, 148, 150, 153, 154 Tukey biweight, 50 Winsorized mean, 68, 69, 78, 81, 148
Author index Adrover, 132, 181 Anderson, 31, 182 Andrews, 43, 181 Antoch, 43, 181 Bahadur, 36, 154, 181 Balakrishnan, 185 Barbour, 23, 181 Barnett, 84, 181 Bassett, 101, 102, 110, 116, 181, 186 Belsley, 181 Bhattacharyya, 24 Bickel, 107, 111, 142, 181, 183 Billingsley, 12, 181 Blom, 70, 181 Bloomfield, 182 Boskovich, 30, 182 Box, 31, 182 Brandwein, 139 Brown, 129, 182 Brownlee, 134, 182 Bunke H., 43, 182 Bunke O., 43, 182 Carroll, 127, 182, 188, 189 Cellier, 139, 182 Chakraborty, 129, 182 Chaterjee, 182 Chaudhuri, 140, 182 Chow, 73, 182 Cohen, 139 Cohen Brandwein, 182 Collins, 83, 182, 183 Cook, 183 Csisz´ ar, 23, 183 Cs¨ org˝ o, 71, 93, 183 Davies, 109, 132, 183 Davis, 69, 184
de Wet, 156, 183 Dempster, 139, 183 Devlin, 130, 183 Dobrushin, 23, 24, 183 Dodge, 43, 108, 134, 142, 153, 183, 186 Doksum, 183 Donoho, 35, 132, 183 Draper, 183 Ekblom, 181 Ellis, 110, 183 Fabian, 12, 183 Falk, 107, 183 Fernholz, 18, 184 Field, 142, 184 Fisher, 11, 184 Fourdrinier, 139, 182, 184 Franke, 188 Fu, 36, 154, 184 Galton, 112 Gasko, 132, 183 Gather, 183 Gentleman, 173, 185 Ghosh, 73, 184 Gibbs, 13, 184 Gnanadesikan, 183 Greenberg, 185 Gutenbrunner, 104–106, 127, 184 Habala, 183 Hadi, 182 H´ ajek, 160, 183 Hall, 23, 154, 181, 184 Hampel, 31, 43, 50, 97, 109, 142, 181, 184 H¨ ardle, 188 Harrell, 69, 184
196 Harremo¨es, 23, 184, 186 Hawkins, 110, 185 He, 42, 185 Hettmansperger, 110, 129, 184, 185 Hodges, 74, 183, 185 Hoeffding, 60, 185 H¨ ossjer, 110, 185 Huber, 35, 43, 49, 54, 61, 70, 73, 87, 88, 97, 131, 142, 151, 181, 183, 185 Huber-Carol, 155, 185 Huˇskov´ a, 186 Ihaka, 173, 185 Ikeda, 183 Jaeckel, 77, 151, 154, 185 James, 185 Janssen, 111, 185 Johnson, 182, 186 Jung, 70, 185 Jureˇckov´ a, 37, 43, 58, 61, 71, 73, 84, 95, 97, 104, 108, 111, 112, 133, 134, 142, 153, 156, 158, 181, 183–186 Kagan, 8, 186 Kent, 137, 186 Kettenring, 183 Kim, 109, 186 Klaassen, 181 Klebanov, 133, 186 Koenker, 101, 102, 104, 116, 184–186 Kontoyannis, 23, 186 Kotz, 182 Koul, 142, 186 Krasker, 100, 186 Kuh, 181 Lecoutre, 43, 187 Lehmann, 74, 107, 142, 181, 183, 185, 187 Leroy, 43, 110, 114, 135, 188 Lewis, 181 Liese, 13, 187 Linnik, 8, 186 Liu, 187 Lopuha¨ a, 132, 187 Mallows, 23, 24, 99, 187
AUTHOR INDEX Mal´ y, 111, 186 Mandl, 186 Marazzi, 128, 187 Marchand, 139, 184 Marden, 129, 140, 187 Maronna, 100, 130, 132, 187 Martin, 187, 188 Mayer, 170, 187 McLaughlin, 154, 189 Mesbah, 185 Milhaud, 186 M¨ ott¨ onen, 129, 140, 187 Mukhopadhyay, 73, 184 M¨ uller-Funk, 142, 190 Nikulin, 185 Oja, 129, 140, 182, 187 Olive, 110, 185 Olkin, 189 Parelius, 187 Parzen, 93, 187 Pearson, 31, 187 Pelant, 183 Picek, 156, 158, 186 Pinsker, 23, 187 Pollard, 109, 186 Portnoy, 112, 126, 184–186, 188 Rachev, 13, 188 Randles, 129, 185, 187 Rao, 8, 186 Read, 182 Reeds, 84, 188 Reiss, 13, 188 R´ev´esz, 71, 93, 183 Rieder, 43, 97, 142, 188 Ripley, 81, 135, 137, 190 Ritov, 181 Robert, 139, 182 Rogers, 181 Ronchetti, 112, 142, 184, 190 Rousseeuw, 43, 100, 109, 110, 114, 132, 135, 137, 184, 187, 188 Royston, 171, 188 Ruppert, 127, 182, 188, 189
AUTHOR INDEX Ruzankin, 23, 184 Santaluc´ıa, 183 Sarhan, 185 Sen, 43, 58, 69, 71, 73, 74, 84, 95, 97, 104, 112, 134, 139, 142, 156, 158, 186, 188 Serfling, 129, 142, 157, 189 Shapiro, 156, 171, 189 Sheather, 43, 110, 185, 189 Shorack, 142, 189 Siegel, 109, 189 Sievers, 36, 189 Simpson, 112, 185, 189 Singh, 187 Small, 129, 189 Smith, 183 Stahel, 132, 184, 187, 189 Staudte, 43, 189 Steiger, 182 Stein, 133, 139, 185, 189 Stigler, 30, 189 Strawderman, 139, 182, 184 Su, 184 Tableman, 110, 189 Tassi, 43, 187 Tukey, 31, 84, 154, 181, 189 Tyler, 132, 137, 186, 189 Vajda, 13, 187, 189 van der Vaart, 142, 189 van Driessen, 135, 137, 188 van Eeden, 84, 189 van Zomeren, 135, 188 Vardi, 129, 137, 139, 140, 186, 190 Venables, 81, 135, 137, 190 Venter, 156, 183 Veraverbeke, 185 V´ıˇsek, 110, 181, 190 von Bahr, 61, 190 von Mises, 11, 190 Weisberg, 183 Wellner, 142, 181, 189 Welsch, 100, 181, 186 Welsh, 106, 111, 112, 186, 190
197 Wiens, 190 Wilk, 156, 171, 189 Witting, 142, 190 Wu, 183 Yohai, 100, 101, 110, 132, 181, 187, 188, 190 Yu, 73, 182 Zamar, 187 Zhang, 129, 139, 140, 190 Zheng, 190 Z´ızler, 183 Zolotarev, 13, 190 Zuo, 129, 190