Mathematical Statistics: Basic Ideas and Selected Topics, Vol I (2nd Edition)

Second Edition Mathematical Statistics Basic Ideas and Selected Topics Volume I Peter J. Bickel University of Califo...

Author: Peter J. Bickel | Kjell A. Doksum

88 downloads 576 Views 25MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Second Edition

Mathematical Statistics

Basic Ideas and Selected Topics Volume I

Peter J. Bickel

University of California

Kjell A. Doksum

University of California

Pn_,nt icc Hall PRENTICE HALL Upper Saddle River, New Jersey 07458

Library of Congress Cataloging-in-Publication Data

Bickel. Peter J.

Mathematical statistics: basic ideas and selected topics/ Peter J. Bickel, Kjeli A.

Doksum-2nd ed. p. em .

•

Includes bibliographical references and index.

ISBN 0-13-850363-X(v.

1)

L Ma thematical statistics. L Doksum, Kjell A. II. Title.

QA276.B47 2001

00-031377

519.5-dc21

Acquisition Editor: Kathleen Boothby Sestak Editor in Chief: Sally Yagan Assistant Vice President of Production and Manufacturing: David W. Riccardi Executive Managing Editor: Kathleen Schiaparelli Senior Managing Editor: Linda Mihatov Behrens Production Editor: Bob Walters Manufacturing Buyer: Alan Fischer Manufacturing Manager: Trudy Pisciotti Marketing Manager: Angela Battle Marketing Assistant: Vince Jansen Director of Marketing: John Tweeddale !

i

Editorial Assistant: Joanne Wendelken Art Director: Jayne Conte Cover Design: Jayne Conte

I'n'nlict·

llall

i

@2001,

1977 by

Prentice-Hall, Inc.

Upper Saddle River, New Jersey

07458

All rights reserved. No part of this book may be reproduced, in any form or by any means, without permission in writing from the publisher.

''

:

i

i

• •

Printed in the United States of America 10

9 8 7

ISBN:

6

5 4

3

2

I

1

D-13-850363-X

'

Prentice-Hall International (UK) Limited, Prentice-Hall of Australia Pty. Limited, Prentice-Hall of Canada Inc.,

London Sydney

Toronto

Prentice-Hall Hispanoamericana, S.A.,

Mexico

Prentice-Hall of India Private Limited,

New Delhi

Prentice-Hall of Japan, Inc.,

Tokyo

Pearson Education Asia Pte. Ltd.

Editora Prentice-Hall do Brasil, Ltda.,

Rio de Janeiro

!

To Erich L Lehmann

,I

'

'

'

'

--

-

-- --------- ------- "-

CONTENTS

PREFACE TO THE SECOND EDITION: VOLUME I

xiii

PREFACE TO THE FIRST EDITION

xvii

l

STATISTICAL MODELS, GOALS, AND PERFORMANCE CRITERIA

l

1.1

1.1.1

Data and Models

I I

1.1.2

Pararnetrizations and Parameters

6

1.1.3

Statistics

1.1.4

Examples, Regression Models

Data, Models, Parameters, and Statistics

as

Functions on the Sample Space

8 9

1.2

Bayesian Models

12

1.3

The Decision Theoretic Framework

16

1.3.1

Components of the Decision Theory Framework

17

1.3.2

Comparison of Decision Procedures

24

1.3.3

Bayes and Minimax Criteria

26

1.4

Prediction

32

1.5

Sufficiency

41

1.6

Exponential Families

49

1.6.1

The One-Parameter Case

49

1.6.2

The Multip arameter Case

53

Building Exponential Families

56

1.6.4

Properties of Exponential Families

58

1.6.5

Conjugate Families of Prior Distributions

62

1.6.3

1.7

Problems and Complements

66

1.8

Notes

95

1.9

References

96 • •

VII

•• •

CONTENTS

VIII

2

METHODS OF

2.1

2.2

2.3 *2.4

3

Basic Heuristics of Estimation

2.1.1

Minimum Contrast Estimates; Estimating Equations

2.1.2

The Plug-In and Extension Principles

Minimum Contrast Estimates and Estimating Equations

99 99 99 102 107

2.2.1

Least Squares and Weighted Least Squares

107

2.2.2

Maximum Likelihood

114

Maximum Likelihood in Multi parameter Exponential Families

Algorithmic Issues

121 127

2.4.1

The Method of Bisection

127

2.4.2

Coordinate Ascent

129

2.4.3

The Newton-Raphson Algorithm

132

2.4.4

The EM (Expectation/Maximization) Algorithm

133

2.5

Problems and Complements

138

2.6

Notes

!58

2.7

References

!59

MEASURES OF PERFORMANCE

161

3.1

Introduction

161

3.2

Bayes Procedures

161

3.3

Minimax Procedures

170

Unbiased Estimation and Risk Inequalities

176

*3.4

*3.5

4

ESTIMATION

3.4.1

Unbiased Estimation, Survey Sampling

176

3.4.2

The Information Inequality

179

Nondecision Theoretic Criteria

188

3.5.1

Computation

188

3.5.2

Interpretability

189

3.5.3

Robustness

190

3.6

Problems and Complements

197

3.7

Notes

210

3.8

References

211

TESTING AND CONFIDENCE REGIONS

213

4.1

Introduction

213

4.2

Choosing a Test Statistic: The Neyman-Pearson Lemma

223

4.3

Uniformly Most Powerful Tests and Monotone Likelihood Ratio

4.4

Models

227

Confidence Bounds. Intervals, and Regions

233

ix

CONTENTS The Duality Between Confidence Regions and Tests

241

*4.6

Uniformly Most Accurate Confidence Bounds

248

*4.7

Frequentist and Bayesian Formulations

251

4.8

Prediction Intervals

252

4.9

Likelihood Ratio Procedures

255

4.9.1

lnttoduction

255

4.9.2

Tests for the Mean of a Normal Distribution-Matched Pair Experiments

257

Tests and Confidence Intervals for the Difference in Means of Two Normal Populations

261

4.9.4

The Two-Sample Problem with Unequal Variances

264

4.9.5

Likelihood Ratio Procedures for Bivariate Normal Distributions

266

4.5

4.9.3

4.10 Problems and Complements

269

4.11 Notes

295

4.12 References

295

5 ASYMPTOTIC APPROXIMATIONS

297

5.1

Introduction: The Meaning and Uses of Asymptotics

297

5.2

Consistency

301

5.3

5.2.1

Plug-In Estimates and MLEs in Exponential Family Models

301

5.2.2

Consistency of Minimum Contrast Estimates

304

First- and Higher-Order Asymptotics: The Delta Method with Applications

306

5.3.1

The Delta Method for Moments

306

5.3.2

The Delta Method for In Law Approximations

311

5.3.3

Asymptotic Normality of the Maximum Likelihood Estimate in Exponential Families

322

Asymptotic Theory in One Dimension

5.4

324

5.4.1

Estimation: The Multinomial Case

324

5.4.2

Asymptotic Normality of Minimum Contrast and M-Estimates

327

*5.4.3

Asymptotic Normality and Efficiency of the MLE

331

* 5.4.4

Testing

332

*5.4.5

Confidence Bounds

336

•

5.5

Asymptotic Behavior and Optimality of the Posterior Distribution

337

5.6

Problems and Complements

345

5.7

Notes

362

5.8

References

363

!

'

X

6

CONTENTS

INFERENCE IN THE MULTIPARAMETER CASE 6.1

*6.2

*6.3

Inference for Gaussian Linear Models

365

6.1.1

The Classical Gaussian Linear Model

366

6.1.2

Estimation

369

6.1.3

Tests and Confidence Intervals

374

Asymptotic Estimation Theory in p Dimensions

383

6.2.1

Estimating Equations

384

6.2.2

Asymptotic Normality and Efficiency of the MLE

386

6.2.3

The Posterior Distribution in the Multiparameter Case

3 9!

Large Sample Tests and Confidence Regions 6.3.1 6.3.2

* 6 .4

365

392

Asymptotic Approximation to the Distribution of the Likelihood Ratio Statistic

3 92

Wald's and Rao's Large Sample Tests

3 98

Large Sample Methods for Discrete Data

400

6.4.1

Goodness-of-Fit in a Multinomial Model. Pearson's x2 Test

401

6.4.2

Goodness-of-Fit to Composite Multinomial Models. Contingency Thbles

403

Logistic Regression for Binary Responses

408

6.4.3 *6.5

Generalized Linear Models

4!1

*6.6

Robustness Properties and Semiparametric Models

417

6.7

Problems and Complements

422

6.8

Notes

438

6.9

References

438

A A REVIEW OF BASIC PROBABILITY THEORY

441

A.1 The Basic Model

441

A.2 Elementary Properties of Probability Models

443

A.3

443

Discrete Probability Models

A.4 Conditional Probability and Independence

444

A.5 Compound Experiments

446

A.6 Bernoulli and Multinomial Trials, Sampling With and Without Replacement

447

A.7 Probabilities on Euclidean Space

448

A.S

Random Variables and Vectors: Transformations

451

A.9 Independence of Random Variables and Vectors

453

A.!0 The Expectation of a Random Variable

454

A. l l Moments

456

A.12 Moment and Cumulant Generating Functions

459

XI •

CONTENTS

B

A.I 3 Some Classical Discrete and Continuous Distributions

460

A.14 Modes of Convergence of Random Variables and Limit Theorems

466

A.I 5 Further Limit Theorems and Inequalities

468

A.t6 Poisson Process

47 2

A.l 7 Notes

474

A.l 8 References

475

ADDITIONAL TOPICS IN PROBABILITY AND ANALYSIS

477

B.l

Conditioning by a Random Variable or Vector

B.l.l

The Discrete Case

477

8.1.2

Conditional Expectation for Discrete Variables

479

B.l.3

Properties of Conditional Expected Values

480

8.1.4

Continuous Variables

482

B.l.S

Comments on the General Case

484

B.2 Distribution Theory for Transformations of Random Vectors

B.3

485

8.2.1

The Basic Framework

485

8.2.2

The Gamma and Beta Distributions

488

Distribution Theory for Samples from a Normal Population

49 1

8.3.1

The x2, F, and t Distributions

49 1

8.3.2

Orthogonal Transformations

494

B.4 The Bivariate Normal Distribution B.S

477

497

Moments of Random Vectors and Matrices

502

B.S.!

Basic Properties of Expectations

502

8.5.2

Properties of Variance

503

B.6 The Multivariate Normal Distribution

506

8.6.1 Definition and Density

506

8.6.2

508

Basic Properties. Conditional Distributions

B.7 Convergence for Random Vectors: 0p and ap Notation

511

B.8 Multivariate Calculus

516

B.9

518

Convexity and Inequalities

B.IO Topics in Matrix Theory and Elementary Hilbert Space Theory

519

B.IO.I Symmetric Matrices

519

B.l0. 2 Order on Symmetric Matrices

520

B.! 0.3 Elementary Hilbert Space Theory

521

B.l l Problems and Complements

524

8.12 Notes

53 8

B.13 References

539

••

CONTENTS

XII

C TABLFS Table I The Standard Normal Distribution

541 542

Table 11 Auxilliary Table of the Standard Normal Distribution

543

Table II t Distribution Critical Values

544

Table Ill

x2 Distribution Critical Values

Table IV F Distribution Critical Values

545 546

547

INDEX

I

l

'

'

'

PREFACE TO THE SECOND EDITION: VOLUME I

In the twenty-three years that have passed since the first edition of our book appeared statistics has changed enonnously under the impact of several forces:

(1) The generation of what were once unusual types of data such as images, trees (phy logenetic and other), and other types of combinatorial objects.

(2) The generation of enormous amounts of data-terrabytes (the equivalent of

10 1 2

characters) for an astronomical survey over three years.

(3) The possibility of implementing computations of a magnitude that would have once been unthinkable. The underlying sources of these changes have been the exponential change in com puting speed (Moore's "law'') and the development of devices (computer controlled) using novel instruments and scientific techniques (e.g., NMR tomography, gene sequencing). These techniques often have a strong intrinsic computational component. Tomographic data are the result of mathematically based processing. Sequencing is done by applying computational algorithms

to raw gel electrophoresis data.

As a consequence the emphasis of statistical theory has shifted away from the small sample optimality results that were a major theme of our book in a number

of directions:

( 1) Methods for inference based on larger numbers of observations and minimal assumptions-asymptotic methods in non- and semiparametric models, models with ''infinite" number of parameters.

(2) The construction of models for time series, temporal spatial series, and other com plex data structures using sophisticated probability modeling but again relying for analytical resuJts on asymptotic approximation. Multiparameter models are the rule.

(3) The use of methods of inference involving simulation as a key element such as the bootstrap and Markov Chain Monte Carlo. .

Xlll ..

xiv

Preface to the Second Edition: Volume I

(4) The development of techniques not describable in "closed mathematical form" but rather through elaborate algorithms for which problems of existence of solutions are important and far from obvious.

(5) The study of the interplay between numerical and statistical considerations. Despite advances in computing speed, some methods run quickly in real time. Others do not and some though theoretically attractive cannot be implemented in a human lifetime.

(6) The study of the interplay between the number of observations and the number of parameters of a model and the beginnings of appropriate asymptotic theories. There have, of course, been other important consequences such as the extensive devel opment of graphical and other exploratory methods for which theoretical development and connection with mathematics have been minimal. These will not be dealt with in our work. As a consequence our second edition, reflecting what we now teach our graduate stu dents, is much changed from the first. Our one long book has grown to two volumes, each

I I

i •

to be only a little shorter than the first edition. Volume

I,

which we present in 2 000, covers material we now view as important for

all beginning graduate students in statistics and science and engin eering graduate students whose research will involve statistics intrinsically rather than as an aid in drawing conclu•

SIOnS. In this edition we pursue our philosophy of describing the basic concepts of mathemat ical statistics relating theory to practice. However, our focus and order of presentation have changed.

l covers the material of Chapters 1-6 and Chapter 10 of the first edition with pieces of Chapters 7-10 and includes Appendix A on basic probability theory. However, Chapter 1 now has become part of a larger Appendix B, which includes more advanced Volume

topics from probability theory such as the multivariate Gaussian distribution, weak con vergence in Euclidean spaces, and probability inequalities as well as more advanced topics in matrix theory and analysis. The latter include the principal axis and spectral theorems for Euclidean space and the elementary theory of convex functions on elementary introduction to Hilbert space theory. As

Rd

as well as an

in the first edition, we do not require

measure theory but assume from the start that our models

are

what we call "regular." That

is, we assume either a discrete probability whose support does not depend on the parameter set, or the absolutely continuous case with a density. Hilbert space theory is not needed, but for those who know this topic Appendix B points out interesting connections to prediction and linear regression analysis. Appendix B is as self-contained as possible with proofs of most statements, problems,

and references to the literature for proofs of the deepest results such as the spectral theorem.

The reason for these additions are the changes in subject matter necessitated by the current areas of importance in the field. Specifically, instead of beginning with parametrized models we include from the start

non- and semiparametric models, then go to parameters and parametric models stressing

the role of identifiability. From the beginning we stress function-valued parameters, such as

the density, and function-valued statistics, such as the empirical distribution function. We

XV

Preface to the Second Edition: Volume I also. from the start, include examples that are important in applications, such

as

regression

experiments. There is more material on Bayesian models and analysis. Save for these changes of emphasis the other major new elements of Chapter 1, which parallels Chapter 2 of the first edition, are an extended discussion of prediction and an expanded introduction to k-parameter exponential families. These objects that are the building blocks of most modem models require concepts involving moments of random vectors and convexity that

are

given in Appendix B. Chapter 2 of this edition parallels Chapter 3 of the first artd deals with estimation. Ma

jor differences here are a greatly expanded treatment of maximum likelihood estimates (MLEs), including a complete study of MLEs in canonical k-parameter exponential fam ilies. Other novel features of this chapter include a detailed analysis including proofs of convergence of a standard but slow algorithm for computing MLEs in multiparameter ex

ponential families and ail introduction to the EM algorithm, one of the main ingredients of most modem algorithms for inference. Chapters

3

and 4 parallel the treatment of Chap

ters 4 and 5 of the first edition on the theory of testing and confidence regions, including some optimality theory for estimation as well and elementary robustness considerations. The main difference in our new treatment is the downplaying of unbiasedness both in es timation and testing and the presentation of the decision theory of Chapter 10 of the first edition at this stage. Chapter 5 of the new edition is devoted to asymptotic approximations.

It includes

the initial theory presented in the first edition but goes much further with proofs of consis tency and asymptotic normality and optimality of maximum likelihood procedures in infer ence. Also new is a section relating Bayesian and frequentist inference via the Bernstein von Mises theorem. Finaliy, Chapter 6 is devoted to inference in multivariate (multi parameter) models. In

cluded are asymptotic normality of maximum likelihood estimates, inference in the general

linear model, Wilks theorem on the asymptotic distribution of the likelihood ratio test. the Wald and Rao statistics and associated confidence regions, and some parailels to the opti mality theory and comparisons of Bayes and frequentist procedures given in the univariate

case in Chapter 5. Generalized linear models are introduced as examples. Robustness from

an asymptotic theory point of view appears also. This chapter uses multivariate calculus

in an intrinsic way and can be viewed as an essential prerequisite for the more advanced topics of Volume

II.

As in the first edition problems play a critical role by elucidating and often substantially expanding the text. Almost all the previous ones have been kept with an approximately equal number of new ones added-to correspond to our new topics and point of view. The conventions established on footnotes and notation in the first edition remain, if somewhat augmented. Chapters 1-4 develop the basic principles and examples of statistics. Nevertheless. we star sections that could be omitted by instructors with a classical bent and others that could be omitted by instructors with more computational emphasis. Although we believe the

material of Chapters 5 and 6 has now become fundamental, there is clearly much that could

be omitted at a first reading that we also star. There are clear dependencies between starred

Pref3ce to the Second Edition: Volume I

•

XVI

sections that follow. 5.4.2

5.4.3

6.2

1-.

�

6.6

6.3

�

6.4

�

6.5

Volume II is expected to be forthcoming in 2003. Topics to be covered include per mutation and rank tests and their basis in completeness and equivariance. Examples of application such as the Cox model in survival analysis, other transformation models, and the classical nonparametric k sample and independence problems will be included. Semi parametric estimation and testing will be considered more generally, greatly extending the material in Chapter 8 of the first edition. The topic presently in Chapter 8, density estima tion, will be studied in the context of nonparametric function estimation. We also expect to discuss classification and model selection using the elementary theory of empirical pro cesses. The basic asymptotic tools that will be developed or presented, in part in the text and, in part in appendices, are weak convergence for random processes, elementary empir ical process theory, and the functional delta method. A final major topic in Volume II will be Monte Carlo methods such as the bootstrap and Markov Chain Monte Carlo. With the tools and concepts developed in this second volume students will be ready for advanced research in modem statistics. For the first volume of the second edition we would like to add thanks to new col leagues, particularly Jianging Fan, Michael Jordan, Jianhua Huang, Ying Qing Chen, and Carl Spruill and the many students who were guinea pigs in the basic theory course at Berkeley. We also thank Faye Yeager for typing, Michael Ostland and Simon Cawley for producing the graphs, Yoram Gat for proofreading that found not only typos but serious errors, and Prentice Hall for generous production support. Last and most important we would like to thank our wives, Nancy Kramer Bickel and Joan H. Fujimura, and our families for support, encouragement, and active participation in an enterprise that at times seemed endless, appeared gratifyingly ended in 1976 but has, with the field, taken on a new life. '

Peter J. Bickel bickel@ stat.berkeley.edu Kjell Doksum [email protected]

I

• •

I

l I

i

I

I

PREFACE TO THE FIRST EDITION

This book presents our view of what an introduction to mathematical statistics for students with a good mathematics background should be. By a good mathematics background we mean linear algebra and matrix theory and advanced calculus (but no measure theory). Be cause the book is an introduction to statistics, we need probability theory and expect readers to have had a course at the level of, for instance, Hoel, Port, and Stone's Introduction to Probability Theory. Our appendix does give all the probability that is needed. However, the treatment is abridged with few proofs and no examples or problems. We feel such an introduction should at least do the following: (1) Describe the basic concepts of mathematical statistics indicating the relation of theory to practice. (2) Give careful proofs of the major "elementary" results such as the Neyman-Pearson lemma, the Lehmann--Scheff6 theorem, the information inequality, and the Gauss-Markoff theorem. (3) Give heuristic discussions of more advanced results such as the large sample theory of maximum likelihood estimates, and the structure of both Bayes and admissible solutions in decision theory. The extent to which holes in the discussion can be patched and where patches can be found should be clearly indicated. (4) Show how the ideas and results apply in a variety of important subfields such as Gaussian linear models, multinomial models, and nonparametric models. Although there are several good books available for this purpose, we feel that none has quite the mix of coverage and depth desirable at this level. The work of Rao, Linear Statistical Inference and Its Applications, 2nd ed., covers most of the material we do and much more but at a more abstract level employing measure theory. At the other end of the scale of difficulty for books at this level is the work of Hogg and Craig, Introduction to Mathematical Statistics, 3rd ed. These authors also discuss most of the topics we deal with but in many instances do not include detailed discussion of topics we consider essential such as existence and computation of procedures and large sample behavior. Our book contains more material than can be covered in tw� qp.arters. In the two quarter courses for graduate students in mathematics, statistics, the physical sciences, and engineering that we have taught we cover the core Chapters 2 to 7, which go from modeling through estimation and testing to linear models. In addition we feel Chapter 10 on decision theory is essential and cover at least the first two sections. Finally, we select topics from xvii

xviii

Preface to the First Edition

Chapter 8 on discrete data and Chapter 9 on nonpararnetric models. Chapter 1 covers probability theory rather than statistics. Much of this material unfor tunately does not appear in basic probability texts but we need to draw on it for the rest of the book. It may be integrated with the material of Chapters

2-7 as the course proceeds

rather than being given at the start; or it may be included at the end of an introductory probability course that precedes the statistics course. A special feature of the book is its many problems. They range from trivial numerical exercises and elementary problems intended to familiarize the students with the concepts to material more difficult than that worked out in the text. They are included both as a check on the student's mastery of the material and as pointers to the wealth of ideas and results that for obvious reasons of space could not be put into the body of the text.

Conventions:

(i) In order to minimize the number of footnotes we have added a section

of comments at the end of each chapter preceding the problem section. These comments are ordered by the section to which they pertain. Within each section of the text the presence of comments at the end of the chapter is signaled by one or more numbers,

1

for the first, 2

for the second, and so on. The comments contain digressions, reservations, and additional references. They need to be read only as the reader's curiosity is piqued. (i) Various notational conventions and abbreviations are used in the text. A list of the most frequently occurring ones indicating where they are introduced is given at the end of the text. (iii) Basic notation for probabilistic objects such as random variables and vectors, den

: '

sities, distribution functions, and moments is established in the appendix.

I

We would like to acknowledge our indebtedness to colleagues, students, and friends

i ,

who helped us during the various stages (notes, preliminary edition, final draft) through

'

which this book passed. E. L. Lehmann's wise advice has played a decisive role at many points. R. Pyke's careful reading of a next-to-final version caught a number of infelicities of style and content Many careless mistakes and typographical errors in an earlier version were caught by D. Minassian who sent us an exhaustive and helpful listing. W. Cannichael, in proofreading the final version, caught more mistakes than both authors together. serious error in Problem 2.2.5 was discovered by

A

F. Scholz. Among many others who

helped in the same way we would like to mention C . Chen, S. J. Chou, G. Drew, C. Gray,

U. Gupta, P. X. Quang, and A Samulon. Without Winston Chow's lovely plots Section

9.6

would probably not have been written and without Julia Rubalcava's impeccable typing and tolerance this text would never have seen the light of day. We would also like to thank tlte colleagues and fiiends who Inspired and helped us to

enter the field of statistics.

The foundation of oUr statistical knowledge was obtained in the

lucid, enthusiastic, and stimulating lectures of Joe Hodges and Chuck Bell, respectively. Later we were both

very much influenced by Erich Lehmann whose ideas are strongly

rellected in this hook.

Kjell Doksum

1976

' !

•

Peter J. Bickel

Berkeley

!

Mathematical Statistics

Basic Ideas and Selected Topics Volume I Second Edition

; '

j ' 1 j 1

' '

I ' '

i I

i

\

Chapter 1

STATISTICAL MODELS , GOALS , AND PERFORMANCE CRITERIA

1.1

DATA, MODELS, PARA METERS AND STATISTICS

1. 1.1

Data and Models

Most studies and experiments, scientific or industrial, large scale or small, produce data whose analysis is the ultimate object of the endeavor. Data can consist of: (1) Vectors of scalars, measurements, and/or characters, for example, a single time series of measurements.

(2) Matrices of scalars and/or characters, for example, digitized pictures or more rou tinely measurements of covariates and response on a set of 1.1.4 and Sections 2.2.1 and 6 .1. (3) Arrays of scalars and/or characters

n

individuals-see Example

as in contingency tables-see Chapter 6---or

more generally multifactor multiresponse data on a number of individuals.

(4) All of the above and more, in particular, functions as in signal processing, trees as in evolutionary phylogenies, and so on. The goals of science and society, which statisticians share,

are

to draw useful infor

mation from data using everything that we know. The particular angle of mathematical

statistics is to view data as the outcome of a random experiment that we model mathemati cally.

A detailed discussion of the appropriateness of the models we shall discuss in particular situations is beyond the scope of this book, but we will introduce general model diagnostic tools in Volume 2, Chapter 1. Moreover, we shall parenthetically discuss features of the sources of data that can make apparently suitable models grossly misleading. A generic source of trouble often called

grf!SS errors

is discussed in greater detail in the section on

robustness (Section 3.5.3). In any case all our models is in

are

generic and, as usual, ''The Devil

the details!" All the principles we discuss and calculations we perform should only

be suggestive guides in successful applications of statistical analysis in science and policy. Subject matter specialists usually have to be principal guides in model formulation.

A

I

!

2

I

Statistical Models, Goals, and Performance Criteria

Chapter 1

priori, in the words of George Box ( 1979), "Models of course, are never true but fortunately it is only necessary that they be useful." In this book we will study how, starting with tentative models:

I

(I) We can conceptualize the data structure and our goals more precisely. We begin this in the simple examples that follow and continue in Sections 1.2-1.5 and throughout the book. (2) We can derive methods of extracting useful information from data and, in particular, give methods that assess the generalizability of experimental results. For instance, if we observe an effect in our data, to what extent can we expect the same effect more generally? Estimation, testing, confidence regions, and more general procedures will be discussed in Chapters 2-4. (3) We can assess the effectiveness of the methods we propose. We begin this discussion with decision theory in Section 1.3 and continue with optimality principles in Chapters 3 and 4. (4) We can decide if the models we propose are approximations to the mechanism generating the data adequate for our purposes. Goodness of tit tests, robustness, and diag nostics are discussed in Volume 2, Chapter I. (5) We can be guided to alternative or more general descriptions that might tit better. Hierarchies of models are discussed throughout. Here are some examples: "

:

'. '.

i:

'.

I

(a) We are faced with a population of N elements, for instance, a shipment of manufac tured items. An unknown number N8 of these elements are defective. It is too expensive to examine all of the items. So to get information about 8, a sample of n is drawn without replacement and inspected. The data gathered are the number of defectives found in the sample. (b) We want to study how a physical or economic feature, for example, height or in come, is distributed in a large population. An exhaustive census is impossible so the study is based on measurements and a sample of n individuals drawn at random from the popu lation. The population is so large that, for modeling purposes, we approximate the actual process of sampling without replacement by sampling with replacement. (c) An experimenter makes n independent detenninations of the value of a physical constant p,. His or her measurements are subject to random fluctuations (error) and the data can be thought of as p, plus some random errors. (d) We want to compare the efficacy of two ways of doing something under similar conditions such as brewing coffee, reducing pollution, treating a disease, producing energy, learning a maze, and so on. This can be thought of as a problem of comparing the efficacy of two methods applied to the members of a certain population. We run m + n independent experiments as follows: m + n members of the population are picked at random and m of these are assigned to the first method and the remaining n are assigned to the second method. In this manner, we obtain one or more quantitative or qualitative measures of efficacy from each experiment. For instance, we can assign two drugs, A to m, and B to n, randomly selected patients and then measure temperature and blood pressure, have the patients rated qualitatively for improvement by physicians, and so on. Random variability

I '

i

Section Ll

Data,

Models,

3

Parameters, and Statistics

here would come primarily from differing responses among patients to the same drug but also from error in the measurements and variation in the purity of the drugs. We shall use these examples to arrive at out formulation of statistical models and to indicate some of the difficulties of constructing such models. First consider situation (a), which we refer to as: Example 1.1.1. Samp ling Insp ection. The mathematical model suggested by the descrip tion is well defined. A random experiment has been perfonned. The sample space consists of the numbers 0, 1, . . , n corresponding to the number of defective items found. On this space we can define a random variable X given by X(k) k, k 0, 1, . , n. If N8 is the number of defective items in the population sampled, then by (A.I3.6) .

�

�

. .

(J.J.l) if max(n- N(l- 8), 0) < k < min(N8, n) .

Thus, X has an hypergeometric, 1t(N8, N, n ) distribution. The main difference that our model exhibits from the usual probability model is that NO is unknown and, in principle, can take on any value between 0 and N. So, although the sample space is well defined, we cannot specify the probability structure completely but rather only give a family {1t(N8, N, n) } of probability distributions for X, any one of D which could have generated the data actually observed.

Example 1.1.2. Sample from a Population. One-Samp le Models. Situation (b) can be thought of as a generalization of (a) in that a quantitative measure is taken rather than simply recording"defective" or not. It can also be thought of as a limiting case in which N = oo, so that sampling with replacement replaces sampling without. Fonnally, if the measurements are scalar, we observe x 1 , , Xn, which are modeled as realizations of X1 , .. ., Xn independent, identically distributed (i.i.d.) random variables with common unknown distribution function F. We often refer to such X1 , .. , Xn as a random samp le from F, and also write that Xb ... , Xn are i.i.d. as X with X F, where",.._," stands for"is distributed as." The model is fully described by the set F of distributions that we specify. The same model also arises naturally in situation (c). Here we can write the n determinations of p, as •

•

•

.

,.....,

xi = 1-t + l':i,

1
(1.1.2)

, En) T is the vector of random errors. What should we assume about where € = (€ 1 , the distribution of €, which together with p, completely specifies the joint distribution of X1, . . , Xn? Of course, that depends on how the experiment is carried out. Given the description in (c), we postulate .

•

•

.

(1) The value of the error committed on one determination does not affect the value of

the error at other times. That is, €1,

•

•

•

, t::n are independent.

4

Statistical Models, Goals, and Performance Criteria

Chapter 1

(2) The distribution of the error at one determination is the same as that at another.

Thus, Et, . .. , £n are identically distributed. (3) The distribution oft: is independent of J.L. Equivalently X 1, ... , Xn are a random sample and, if we let G be the distribution function of £1 and F that of X 1, then F(x) = G(x- JJ>)

(1.1.3)

and the model is alternatively specified by F, tbe set ofF's we postulate, or by { (JJ>, G) : J1 E R, G E 9} where 9 is the set of all allowable error distributions that we postulate. Commonly considered Q's are all distributions with center of symmetry 0, or alternatively all distributions with expectation 0. The classical def�ult model is: (4) The common distribution of the errors is N(O, a2 ) , where a2 is unknown. That is, the Xi are a sample from a N(J.L, a2 ) population or equivalently F = {tP ( ·:J-1:) : J1 E 0 R, cr > 0} where tP is the standard normal distribution.

i

I

I

I

This default model is also frequently postulated for measurements taken on units ob tained by random sampling from populations, for instance, heights of individuals or log incomes. It is important to remember that these are assumptions at best only approximately valid. All actual measurements are discrete rather than continuous. There are absolute bounds on most quantities-100 ft high men are impossible. Heights are always nonnega tive. The Gaussian distribution, whatever be J1 and a, will. have none of this. Now consider situation (d). ·

Example

·

1.1.3. Two-Sample Models. Let Xt, . . , xTn; Yt, ... , Yn, respectively,

the responses of m subjects having a given disease given drug A and n other similarly diseased subjects given drug B. By convention, if drug A is a standard or placebo, we refer to the x's as control observations. A placebo is a substance such as water tJlat is expected to have no effect on the disease and is used to correct for the well-documented placebo effect, that is, patients improve even if they only think they are being treated.We let they's denote the responses of subjects given a new drug or treatment that is being evaluated by comparing its effect with that of the placebo. We call they's treatment observations. Natural initial assumptions here are: .

be

(l) The x's andy's are realizations of X1, , Xm a sample fromF, and Yt, ... , Yn a sample from G, so that the model is specified by the set of possible (F, G) pairs. To specify this set more closely the critical constant treatment effect assumption is often made. (2) Suppose that if treatment A had been administered to a subject response x would have been obtained. Then if treatment B had been administered to the same subject instead of treatment A, responsey x + 6. would be obtained where 6. does not depend on x. This implies that ifF is the distribution of a control, then G(·) F(·- �).We call this the shift model with parameter 6.. Often the final simplification is made. (3) The control responses are normally distributed, Then ifF is the N(JJ>, 112 ) distribu tion and G is the N(J1 + 6., a-2) distribution, we have specified the Gaussian two sample 0 model with equal variances. •

•

•

=

=

!

I I ;

•

;

Section

1.1

Data, Models, Parameters, and Statistics

5

How do we settle on a set of assumptions? Evidently by a mixture of experience and physical considerations. The advantage of piling on assumptions such as ( I)-(4) of Exam ple 1.1.2 is that, if they are true, we know how to combine our measurements to estimate 1-L in a highly efficient way and also assess the accuracy of our estimation procedure (Exam ple 4.4.1). The danger is that, if they are false, our analyses, though correct for the model written down. may be quite irrelevant to the experiment that was actually performed. As our examples suggest, there is tremendous variation in the degree of knowledge and control we have concerning experiments. In some applications we often have a tested theoretical model and the danger is small. The number of defectives in the first example clearly has a hypergeometric distribution; the number of a particles emitted by a radioactive substance in a small length of time is well known to be approximately Poisson distributed. In others, we can be reasonably secure about some aspects, but not others. For instance, in Example 1.1.2, we can ensure independence and identical distribution of the observa tions by using different, equally trained observers with no knowledge of each other's find ings. However, we have little control over what kind of distribution of errors we get and will need to investigate the properties of methods derived from specific error distribution assumptions when these assumptions are violated. This will be done in Sections 3.5.3 and 6.6.

Experiments in medicine and the sociaJ sciences often pose particular difficulties. For instance, in comparative experiments such as those of Example 1.1.3 the group of patients to whom drugs A and B are to be administered may be haphazard rather than a random sample from the population of sufferers from a disease. In this situation (and generally) it is important to randomize. That is, we use a random number table or other random mechanism so that the m patients administered drug A are a sample without replacement from the set of m + navailable patients. Without this device we could not know whether observed differences in drug performance might not (possibly) be due to unconscious bias on the part of the experimenter. All the severely ill patients might, for instance, have been assigned to B. The study of the model based on the minimal assumption of randomization is complicated and further conceptual issues arise. Fortunately, the methods needed for its analysis are much the same as those appropriate for the situation of Example 1.1.3 when F, G are assumed arbitrary. Statistical methods for models of this kind are given in Volume 2. Using our first three examples for illustrative purposes. we now define the elements of a statistical model. A review of necessary concepts and notation from probability theory are given in the appendices. We are given a random experiment with sample space f!. On this sample space we have defined a random vector X = (X1, ... , Xn). When w is the outcome of the experiment, �(w) is referred to as the observations or data. It is often convenient to identify the random vector X with its realization, the data X(w). Since it is only X that we observe, we need only consider its probability distribution. This distribution is assumed to be a member of a family Pof probability distributions on R n. Pis referred to as the model. For instance, in Example 1.1.1, we observe X and the family Pis that of all hypergeometric distributions with sample size nand population size N. In Example 1.1.2, if (1)-(4) hold, Pis the

1

!

6

Statistical Models, Goals, and Performance Criteria

'

Chapter 1

family of all distributions according to which X1,..., Xn are independent and identically distributed with a common N(p,, a-2) distribution.

1.1.2

I

i

Parametrizations and Parameters ----t

To describe Pwe use a parametrization, that is, a map, (} Po from a space of labels, the parameter space 8,toP; or equivalently write P = {Po :BE 8}. Thus, in Example 1} and 1.1.1 we take (} to be the fraction ofdefectives in the shipment, e = { 0, k Po the 'H.(NB, N, n) distribution. In Example 1.1.2 with assumptions (l)-(4) we have implicitly taken e = R X R+ and, if(} = (p,, a2), Pe the distribution on R" with density x ,;JL) where cpis the standard normal density. If,still in this example, we know ( 1/) 1 ! n� we are measuring a positive quantity in this model, we have 8 = R+ x R+. If, on the other hand, we only wish to make assumptions (l}-(3) with t:.having expectation 0, we can take e = {(!',G) : I' E R, Gwith density gsuch that I xg(x )dx = 0} and p(",G) hasdensity n�l g(x; -I')· When we can take e to be a nice subset of Euclidean space and the maps () -----+ Po are smooth,in senses to be made precise later, models Pare called parametric. Models such as that of Example 1.1.2 with assumptions (1) -(3) are called semiparametric. Fi nally,models such as that of Example 1.1.3 with only ( I ) holding and F, G taken to be arbitrary arecalled nonparametric. It's important to note that even nonparametric models make substantial assumptions-in Example 1.1.3that X1, ... , Xrnare independent of each other and Y1, , Yn;moreover,X1, ... ,Xrn areidentically distributed as are Y1, ... , Yn· The only truly nonparametric but useless model for X E R n is to assume that its (joint) distribution can be anything. Note that there are many ways ofchoosing a parametrization in these and all other problems. We may take any one-to-one function of()as a new parameter. For instance, in Example 1.1.1 we can use the number of defectives in the population, NO, asa parameter and in Example 1.1.2, under assumptions (l)-(4), we may parametrize the model by the first and second moments of the normal distribution of the observations (i.e., by (tt, tt2 + a')). What parametrization we choose is usually suggested by the phenomenon we are mod eling; (}is the fraction of defectives, 11-is theunknown constant being measured . However, as we shall see later,the first parametrization we arrive at is not necessarily the one leading to the simplest analysis. Of even greater concern is the possibility that the parametriza tion is not one-to-one,that is, such that we can have 01 f. 02 and yet Pe1 = Pe2• Such parametrizations are called unidentifiable. For instance, in (l.l.2) suppose that we permit Gto bearbitrary. Thenthe map sending B = (!',G) into the distribution of (X1, , Xn) remains the same but 8 = { (!' G) : I' E R, Ghas(arbitrary)densityg}. Now the parametrization is unidentifiable because, for example, 11- = 0 and N(O, 1) errors lead to the same distribution ofthe observations a� 11- = 1 and N( 1, 1) errors. The critical problem with such parametrizations is that ev�n with "infinite amounts of data," that is, knowledge of the true Pe,parts of fJremain unknowable. Thus, we will need to ensure that our parametrizations are identifiable,that is, lh i 02 ==> Po1 i= Pe2•

I 1

'

'

1 • • • '

•

•

•

•

,

�

•

•

I

'' '' ,

'

j' '

j

i

1

Section 1 . 1

7

Data, Models, Parameters, and Statistics

Dual to the notion of a parametrization, a map from some e to P. is that of a parameter, formally a map, v, from P to another space N. A parameter is a feature v(P) of the dis tribution of X. For instance, in Example 1.1.1, the fraction of defectives () can be thought of as the mean of Xjn. In Example 1.1.3 with assumptions (1H2) we are interested in �. which can be thought of as the difference in the means of the two populations of responses. In addition to the parameters of interest, there are also usually nuisance parameters, which correspond to other unknown features of the distribution of X. For instance, in Example 1.1.2, if the errors are normally distributed with unknown variance a2 , then a2 is a nuisance parameter. We usually try to combine parameters of interest and nuisance parameters into a single grand parameter (), which indexes the family P, that is, make B -----+ Po into a parametrization of P. Implicit in this description is the assumption that () is a parameter in the sense we have just defined. But given a parametrization (} -----+ Po, (} is a parameter if and only if the parametrization is identifiable. Formally, we can define (} : P -----+ 8 as the inverse of the map 8 -----+ Po, from 8 to its range P iff the latter map is 1-l, that is, if Po1 = Pe2 implies 81 = 82. More generally, a function q : 8 -----+ N can be identified with a parameter v( P) iff Po, � Po, implies q(Bl) � q(82) and then v(Po ) q(B). Here are two points to note: (1) A parameter can have many representations. For instance, in Example 1.1.2 with assumptions (1)-(4) the parameter of interest fl - J.L(P) can be characterized as the mean of P, or the median of P, or the midpoint of the interquantile range of P, or more generally as the center of symmetry of P, as long as P is the set of all Gaussian distributions. (2) A vector parametrization that is unidentifiable may still have components that are parameters (identifiable). For instance, consider Example 1. 1.2 again in which we as sume the error f to be Gaussian but with arbitrary mean �. Then P is parametrized by B = (f11 �. a2 ) , where a2 is the variance of t:. As we have seen this parametriza tion is unidentifiable and neither f1 nor � arc parameters in the sense we've defined. But cr 2 � Var(X1 ) evidently is and so is I' + t.. Sometimes the choice of P starts by the consideration of a particular parameter. For instance, our interest in studying a population of incomes may precisely be in the mean income. When we sample, say with replacement, and observe X1 , , X independent with common distribution, it is natural to write •

.

•

n

where f1 denotes the mean income and, thus, E{t:i ) = 0. The (f11 G) parametrization of Example 1.1.2 is now well defined and identifiable by ( 1 . 1.3) and g � {G : J xdG(x) � 0}. Similarly, in Example 1.1.3, instead of postulating a constant treatment effect �. we can start by making the difference of the means, 6 = f1Y - flX , the focus of the study. Then 5 is identifiable whenever flx and flY exist.

Statistical Models, Goals, and Performance Criteria

8

1 . 1.3

Chapter 1

Statistics as Functions on the Sample Space

•

Models and parametrizations are creations of the statistician, but the true values of param eters are secrets of nature. Our aim is to use the data inductively, to narrow down in useful ways our ideas of what the "true'' P is. The link for us are things we can compute, statistics.

X to some space of values T, usually a Euclidean space. Informally, T( x) is what we can compute if we observe X = x. Thus, xfn . In Example 1.1.2 in Example 1.1.1, the fraction defective in the sample, T(x) a common estimate of J1. is the statistic T(X1 ) . . • Xn ) = X = ! L� 1 Xi, a common Formally, a statistic T is a map from the sample space

estimate of

X and s2

a-2 is the statistic

are called the

=

,

s2 =

n 1 "(X; - X)2 L n - i i=l

sample mean and sample variance.

How we use statistics in esti

mation and other decision procedures is the subject of the next section. For future reference we note that a statistic just as a parameter need not be real or Euclidean valued. For instance, a statistic we shall study extensively in Chapter �

function valued statistic

F.

xE

n � 1 F(X1, . . , Xn)(x) = 1(X; n .

R is

called the

empirical distribution function,

-

where

I j

(X1,

•

.

.

,

Xn)

L

t=l

are a sample from a probability

P

2 is the

which evaluated at

S x)

on R and

1(A)

is the indicator

of the event A. This statistic takes values in the set of all distribution functions on R. It estimates

the function valued parameter F defined by its evaluation at x E R,

F(P)(x) = P[X1 5 x].

Deciding which statistics are important is closely connected

to deciding which param

eters are important and, hence, can be related to model formulation as we saw earlier. For instance, consider situation (d) listed at the beginning of this section. If we suppose there is a single numerical measure of performance of the drugs and the difference in performance

of the drugs for any given patient is a constant irrespective of the patien4 then our attention naturally focuses on estimating this constant If, however, this difference depends on the

patient in a complex manner

(the effect of each drug is complex), we have to formulate a

relevant measure of the difference in performance of the drugs and decide how to estimate this measure. Often the outcome of the experiment is used to decide on the model and the appropri

ate measure of difference. Next this model, which now depends on the data, is used to decide what estimate of the measure of difference should be employed (cf., for example, Mandel,

1964).

a meaning to

Data-based model selection can make it difficult to ascenain or even assign

the accuracy of estimates or the probability of reaching correct conclusions.

Nevertheless, we can draw guidelines from our numbers and cautiously proceed. These

issues will be discussed further in Volume 2. In this volume we assume that the model has

i I

i

i I

Section 1.1

Data,

Models, Parameters,

and

9

Statistics

been selected prior to the current experiment. This selection is based on experience with previous similar experiments (cf. Lehmann, 1990). There are also situations in which selection of what data will be observed depends on the experimenter and on his or her methods of reaching a conclusion. For instance, in situation (d) again, patients may be considered one at a time, sequentially, and the decision of which drug to administer for a given patient may be made using the knowledge of what happened to the previous patients. The experimenter may, for example, assign the drugs a1tematively to every other patient in the beginning and then, after a while, assign the drug that seems to be working better to a higher proportion of patients. Moreover, the statistical procedure can be designed so that the experimenter stops experimenting as soon as he or she has significant evidence to the effect that one drug is better than the other. Thus, the number of patients in the study (the sample size) is random. Problems such as these lie in the fields of sequential analysis and experimental design. They are not covered under our general model and will not be treated in this book. We refer the reader to Wetherill and Glazebrook (1986) and Kendall and Stuart (1966) for more information. Notation. Regular models. When dependence on 8 has to be observed, we shall denote the distribution corresponding to any particular parameter value () by Po . Expectations Po will be written Eo. Distribution functions ca1culated under the assumption that X will be denoted by F(·, 0), density and frequency functions by p(·, 0). However, these and other subscripts and arguments will be omitted where no confusion can arise. It will be convenient to assume(l) from now on that in any parametric model we con sider either: ,..._,

( l ) All of the P, are continuous with densities p(x, 0); (2) All of the Po are discrete with frequency functions p(x, 8), and there exists a set {x1 , x2 , . . . ) that is independent ofO such that "'£';' 1 p(x; O) = I for all 0. Such models will be called regular parametric models. In the discrete case we will use both the terms frequencyfunction and density for p(x, 0). See A.lO. ,

1.1.4

Examples, Regression Models

We end this section with two further important examples indicating the wide scope of the notions we have introduced. In most studies we are interested in studying relations between responses and several other variables not just treatment or control as in Example 1.1.3. This is the stage for the following. We observe (z1, Y1),. . . , (zn, Yn) where Example 1.1.4. Regression Models. Y1, ... , Yn are independent. The distribution of the response Yi for the ith subject or case in the study is postulated to depend on certain characteristics zi of the ith subject. Thu s, Zi is a d dimensional vector that gives characteristics such as sex, age, height, weight, and so on of the ith subject in a study. For instance, in Example 1.1.3 we could take z to be the treatment label and write our observations as (A, X1). (A, Xm). ( B, Yl ), . . . , (B, Yn). This is obviously overkill but suppose that, in the study, drugs A and B are given at several

10

Statistical

Models, Goals, and

Performance

Criteria

Chapter 1

dose levels. Then, d = 2 and zf can denote the pair (Treatment Label, Treatment Dose Level) for patient i. ln general, Zi is a nonrandom vector of values called a covariate vector or a vector of explanatory variables whereas Yi is random and referred to as the response variable or dependent variable in the sense that its distribution depends on zi. If we let f(Yi I zi) denote the density of Yi for a subject with covariate vector zi, then the model is n

(a)

P(Yt, · · · , yn) � IJ f(Yi I Zi ) · i=l

If we let J.L(z) denote the expected value of a response with given covariate vector z, then we can write, (b) where Ei = Yi - E ( Yi). i = 1, . . . , n. Here J.L(z) is an unknown function from R d to R that we are interested in. For instance, in Example 1.1.3 with the Gaussian two-sample model I'(A) � J.L, J.L(B) � I' + fl. We usually need to postulate more. A common (but often violated assumption) is (1) The ti are identically distributed with distribution F. That is, the effect of z on Y is through J.1,(z) only. In the two sample models this is implied by the constant treatment effect assumption. See Problem 1.1.8. On the basis of subject matter knowledge and/or convenience it is usually postulated that (2) I' (z) � g((3, z) where g is known except for a vector (3 � ((31 , . • • , /3d )T of un knowns. The most common choice of g is the linear f011t1, (3) g((3 , z) � L.;�, f31 zi � zT (3 so that (b) becomes (b') This is the linear model. Often the following final assumption is made: (4) The distribution F of (l) is N(O, cr2 ) with cr2 unknown. Then we have the classical Gaussian linear model, which we can write in vector matrix form, (c) where Z n x d = ( zf , . . . ,z�)T and J i s the n x n identity. Clearly, Example 1.1.3(3) is a special case of this model. So is Example 1.1.2 with assumptions (1)-{4). In fact by varying our assumptions this class of models includes any situation in which we have independent but not necessarily identically distributed obser vations. By varying the assumptions we obtain parametric models as with (l), (3) and (4) above, semiparametric as with (l) and (2) with F arbitrary, and nonparametric if we drop (I) and simply .treat the Zi as a label of the completely unknown distributions of Yf. Iden tifiability of these parametrizations and the status of their components as parameters are D discussed in the problems.

i

Section Ll

Models, Parameters, and

Data,

1l

Statistics

Finally, we give an example in which the responses are dependent.

Measurement Model with Autoregressive Errors. Let Example 1.1.5. X 1 , . . . , Xn be the n determinations of a physical constant J.t. Consider the model where xi = J.t + ei, i and assume

ei

=

l, . . . , n

=

/3ei-t + f.i, i = 1, . . . , n, eo

=

0

where €i are independent identically distributed with density f. Here the errors e1 , . . . , en are dependent as are the X's. In fact we can write X,

� Jt ( l

- /3) + {3X,_1 + ,, i � 2, , , , , n, X1 � I' + '', ,

An example would be, say, the elapsed times X 1 , , Xn spent above a fixed high level for a series of n consecutive wave records at a point on the seashore. Let 11 = E(Xi) be the average time for an infinite series of records. It is plausible that ei depends on ei-1 because long waves tend to be followed by long waves. A second example is consecutive measurements Xi of a constant 11- made by the same observer who seeks to compensate for apparent errors. Of course, model (a) assumes much more but it may be a reasonable first approximation in these situations. To find the density p(x 1 , , xn). we start by finding the density of c 1 , . . . , en , Using conditional probability theory and ei = /3ei-I + f.i, we have .

•

•

.

•

•

p(e l )p(c, l e, )p(e3 1 e�,e,) , , p(e,. I e,, . . . , cn-d p(e l )p(e, I el )p(e3 I e, ) . . . p(en I 'n-d f(e l )f(c, - f3c l ) . . . f(en - f3en-1)Because ei

=

Xi

- Jl,

the model for X 1 , . . . , X is

p(x,, . . . , xn)

n

n

�

f(x, - JL) IJ f( x; - /3x; - 1 - ( I - f3)JL). j=2

The default assumption, at best an approximation for the wave example, is that f is the N(O, 0'2 ) density. Then we have what is called the AR(l) Gaussian model p(x,, . . . , Xn ) =

I 2a2

(x, - JL)2 +

n

(x, - f3x _ , - (I - /3)1' )2 L i= ,

•

2

We include this example to illustrate that we need not be limited by independence. However, save for a brief discussion in Volume 2 , the conceptual issues of stationarity, ergodicity, and the associated probability theory models and inference for dependent data are beyond the scope of this book. 0

I 12

Statistical Models, Goals, and Performance Criteria

Summary.

Chapter 1

In this section we introduced the first basic notions and formalism of mathe

matical statistics, vector observations

X with unknown probability distributions P ranging over models P. The notions of parametrization and identifiability are introduced. The gen eral definition ofparameters and statistics is given and the connection between parameters and pararnetrizations elucidated. This is done in the context of a number of classical exam ples, the most important of which is the workhorse of statistics, the

regression model.

We

view statistical models as useful tools for learning from the outcomes of experiments and studies. They are useful in understanding how the outcomes can be used to draw inferences

that go beyond the particular experiment. Models are approximations to the mechanisms

generating the observations. How useful a particular model is is a complex mix of how good the approximation is and how much insight it gives into drawing inferences.

BAYESIAN MODELS

1.2

Throughout our discussion so far we have assumed that there is no information available about the true value

of the parameter beyond that provided by the data. There are situa

tions in which most statisticians would agree that more can be said For instance, in the inspection Example size

N

1.1.1,

it is possible that, in the past, we have had many shipments of

that have subsequently been distributed. If the customers have provided accurate

records of the number of defective items that they have found, we can construct a frequency distribution

{1ro, . . . , 'll"N }

for the proportion

(J of defectives in past shipments. That is, 1ri

is the frequency of shipments with i defective items,

i = 0, .

, N.

Now it is reasonable to

• •

suppose that the value of (J in the present shipment is the realization of a random variable .

.

• •

'

(} with distribution given by

l

.

P[li =

N

]

=

rr,,

i

= 0, . .

.

(1.2.1)

, N.

Our model is then specified by the joint distribution of the observed number X of defectives

in the sample and the random variable

9.

We know that, given (} =

hypergeometric distribution 'H(i, N, n) . Thus,

i/N, X

has the

'

' '

'

'

•

P[X

= k,

II = N l

-

( 1 .2.2)

This is an example of a Bayesian model.

There is a substantial number of statisticians who feel that it is always reasonable, and

indeed necessary, to think of the true value of the parameter (J as being the realization of a

random variable 8 with a known distribution. This distribution does not always corresp:md to an experiment that is physically realizable but rather is thought of as

a measure of the

beliefs of the experimenter concerning the true value of (J before he or she takes any data.

I

Section 1.2

13

Bayesian Models

Thus, the resulting statistical inference becomes subjective. The theory of this school is expounded by L. J. Savage ( 1954), Raiffa and Schlaiffer ( 1961 ), Lindley ( 1965), De Groot (1969), and Berger (1985). An interesting discussion of a variety of points of view on these questions may be found in Savage et a!. (1962). There is an even greater range of viewpoints in the statistical community from people who consider all statistical statements as purely subjective to ones who restrict the use of such models to situations such as that of the inspection example in which the distribution of (J has an objective interpretation in terms of frequencies. ( l ) Our own point of view is that subjective elements including the views of subject matter experts arc an essential element in all model building. However, insofar as possible we prefer to take the frequentist point of view in validating statistical statements and avoid making final claims in terms of subjective posterior probabilities (see later). However, by giving () a distribution purely as a theoretical tool to which no subjective significance is attached, we can obtain important and useful results and insights. We shall return to the Bayesian framework repeatedly in our discussion. In this section we shall define and discuss the basic clements of Bayesian models. Sup- pose that we have a regular parametric model {Pe : () E 8}. To get a Bayesian model we introduce a random vector 9, whose range is contained in 8, with density or frequency function 1r. The function 1r represents our belief or information about the parameter () be fore the experiment and is called the prior density or frequencyfunction. We now think of Pe as the conditional distribution of X given (J 8. The joint distribution of (8, X) is that of the outcome of a random experiment in which we first select f) = () according to 7r and then, given (J = (), select X according to Pe. If both X and (J are continuous or both are discrete, then by (B.l .3), (0, X) is appropriately continuous or discrete with density or frequency function, =

f(O, x) = ?T(O)p(x, O) .

(1.2.3)

p(x, p(x I

Because we now think of B) as a conditional density or frequency function given 8 = we will denote it by 0) for the remainder of this section. Equation {1.2.2) is an example of {1.2.3). In the "mixed" cases such as (} continuous X discrete, the joint distribution is neither continuous nor discrete. The most important feature of a Bayesian model is the conditional distribution of f) given X = which is called the posterior distribution of 8. Before the experiment is performed, the information or belief about the true value of the parameter is described by the prior distribution. After the va1ue x has been obtained for X, the information about () is described by the posterior distribution. For a concrete illustration, let us turn again to Example 1.1.1. For instance. suppose that N = 100 and that from past experience we believe that each item has probability .1 of being defective independently of the other members of the shipment. This would lead to the prior distribution

B,

x,

"'

=

( 1�0 ) (0.1)'{0.9)100-i,

(1 .2.4)

for i = 0, 1, . . . , 100. Before sampling any items the chance that a given shipment contains

Statistkal Models, Goals, and Performance Criteria

14

20 or more bad items is by the normal approximation with continuity correction, �.

.

I

P[!OOII > 20]

!

=

"'

p

!0011 - 1 0 JIOO(O.I) (o.9)

I -

C35)

=

Cha pter 1

(A 1 5 . 1 0)

,

!0

>

.I"" ) ("' o."" � J"" w""o� (o"'' 9)

( 1 .2.5)

0.001.

Now suppose that a sample of 19 has been drawn in which 10 defective items are found. This leads to

P[IOOII 2 20 I X

=

10] "' 0.30.

( 1 .2.6)

To calculate the posterior probability given in ( 1.2.6) we argue loosely as follows:

If be

fore the drawing each item was defective with probability . 1 and good with probability

.9

independently of the other items, this will continue to be the case for the items left in the lot

1008 - X, the number of defectives left after the drawing, is independent of X and has a 8(81, 0.1) distribution. Thus, P[!OOII > 20 I X = 10] P[lOOll - X > 10 I X !OJ (10011 X) - 8 1 1.9 > P ( 1 .2.7) J81(0.9)(0.!) J81(0.9)(0.1) "' I - <1>(0.52) 0.30. after the 19 sample items have been drawn. Therefore,

=

.

In general, to calculate the posterior, some variant of Bayes' rule (B.1.4) can be used. Specifically, (i) The posterior distribution is discrete or continuous according as the prior distri I

!

bution is discrete or continuous. (ii)

If

we denote the corresponding (posterior) frequency function or density by

rr(9 I x), then

rr(8 1 x)

rr(O)p(x I 8) L.;, rr(t)p(x I t) rr(8)p(x 1 8) roo rr(t)p(x 1 t)dt

if

if

8 is discrete, 8 is continuous.

( 1 .2.8)

8 and X are both continuous or both discrete this is precisely Bayes' rule applied to the joint distribution of (II, X) given by (1.2.3). Here is an example. Example 1.2.1. Bernoulli Trials. Suppose that X1 , . . , Xn are indicators of n Bernoulli trials with probability of success () where 0 < 8 < 1 . If we assume that 8 has a priori distribution with density "· we obtain by (1.2.8) as posterior density of ll, rr(9)8.(1 o)n-k rr(8 l x, , . . . , xn) = ' ( 1 .2.9) fo rr(t)tk(1 - t)n -kdt In the cases where

.

!

'

' '

I '

i

i

•

Section

L2

15

Bayesian Models

tOr 0 < (} < 1, Xi

=

0 or 1, i

=

1, .

. .

, n, k

=

2..: :1 1 xi.

Note that the posterior density depends on the data only through the total number of successes, L� 1 Xi· We also obtain the same posterior density if B has prior density 1r and 1 Xi, which has a B(n, 8) distribution given B (} (Problem 1.2.9). we only observe We can thus write 1r(B I k) for 1r(B I x,, . . . , Xn ) , where k 2::� 1 x,. To choose a prior 1T, we need a class of distributions that concentrate on the interval (0, 1). One such class is the two-parameter beta family. This class of distributions has the remarkable property that the resulting posterior distributions arc again beta distributions. Specifically, upon substituting the f3(r, s) density (B.2. l l ) in (1.2.9) we obtain

L7

=

�

(Jk+r-I ( 1

_

c

8) n-k+s-I

(1.2.10)

The proportionality constant c, which depends on k, r, and s only, must (see (B.2.11)) be B(k + r, n - k + s) where B(·, ·) is the beta function, and the posterior distribution of B k is f3 (k + r, n - k + s ). given l:: X, As Figure B.2.2 indicates, the beta family provides a wide variety of shapes that can approximate many reasonable prior distributions though by no means all. For instance, non-U-shaped bimodal distributions are not permitted. Suppose, for instance, we are interested in the proportion (} of "geniuses" (IQ 2: 160) in a particular city. To get infonnation we take a sample of n individuals from the city. If n is small compared to the size of the city, (A. l5.l3) leads us to assume that the number X of geniuses observed has approximately a B(n, 8) distribution. Now we may either have some information about the proportion of geniuses in similar cities of the country or we may merely have prejudices that we are willing to express in the fonn of a prior distribution on B. We may want to assume that B has a density with maximum value at 0 such as that drawn with a dotted line in Figure B.2.2. Or else we may think that 1r( B) concentrates its mass near a small number, say 0.05. Then we can choose r and s in the fj(r, s) distribution, so that the mean is r/(r + s) = 0.05 and its variance is very small. The result might be a density such as the one marked with a solid line in Figure B.2.2. If we were interested in some proportion about which we have no information or belief, we might take B to be uniformly distributed on (0, 1), which corresponds to using the beta D distribution with r = s = 1. �

A feature of Bayesian models exhibited by this example is that there are natural para metric families of priors such that the posterior distributions also belong to this family. Such families are called conjugal.'?. Evidently the beta family is conjugate to the bino mial. Another bigger conjugate family is that of finite mixtures of beta distributions see Problem 1.2.16. We return to conjugate families in Section 1.6.

Summary. We present an elementary discussion of Bayesian models, introduce the notions ofprior and posterior distributions and give Bayes rule. We also by example introduce the notion of a conjugate family of distributions.

16

Statistical Models, Goals, and Performance Criteria

Chapter 1

I

THE DECISION THEORETIC FRAMEWORK

1.3

' '

Given a statistical model, the information we want to draw from data can be put in various forms depending on the purposes of our analysis. We may wish to produce "best guesses"

i

'

of the values of important parameters, for instance, the fraction defective B in Example

1 . 1 . 1 or the physical constant J-L in Example 1.1.2. These are estimation problems. In other

'

situations certain P are "special" and we may primarily wish to know whether the data

,

support ..specialness" or not. For instance, in Example 1.1.3, P's that correspond to no treatment effect (i.e

.•

placebo and treatment are equally effective) are special because the

FDA (Food and Drug Administration) does not wish to permit the marketing of drugs that do no good. If J.to is the critical matter density in the universe so that J.l

< J.lo means the

universe is expanding forever and J.l > J.kO correspond to an eternal alternation of Big Bangs

and expansions, then depending on one's philosophy one could take either P's correspond

< Jlo or those corresponding to J.l > J.lo as special. Making detenninations of

ing to J.l

"specialness" corresponds to

testing significance.

As the second example suggests, there

are many problems of this type in which it's unclear which of two disjoint sets of P's;

Pg

Po

testing problem is really one of discriminating between Po and PO. For instance, in Example 1.1.1 contractual agreement between shipper and re ceiver may penalize the return of "good" shipments, say, with (J < 8o. whereas the receiver or

is special and the general

does not wish to keep "bad,"

(} 2 Bo, shipments. Thus, the receiver wants to discriminate

and may be able to attach monetary costs to making a mistake of either type: "keeping the bad shipment" or "returning a good shipment." In testing problems we, at a first cut, state which is supported by the data: "specialness" or, as it's usually called, "hypothesis" or "nonspecialness" (or alternative). We may have other goals as illustrated by the next two examples. Example

1.3.1. Ranking.

A consumer organization preparing (say) a report on air condi

tioners tests samples of several brands.

On the basis of the sample outcomes the organiza

tion wants to give a ranking from best to worst of the brands (ties not pennitted). Thus, if there are

k different brands,

there are

k! possible rankings or actions, one of which will be 0

announced as more consistent with the data than others. Example

1.3.2. Prediction. A very important class of situations arises when, as in Example

1.1.4, we have a vector z, such as, say, (age, sex, drug dose) T that can be used for prediction

of a variable of interest

Y,

say a 50-year-old male patient's response to the level of a

drug. Intuitively, and as we shall see fonnally later, a reasonable prediction rule for an unseen

Y

(response of a new patient) is the function �-t(z), the expected value of Y given

z. Unfortunately

�-t(z) is unknown.

(zt, Yi). 1 < i :S: n, if we believe l'(z) = g((3, z) we

However, if we have observations

we can try to estimate the function 1'0· For instance,

and then plug our estimate of (3 into g. Note that we really want to estimate the function Jl( • )� our results will guide the selection

can estimate (3 from our observations Y; of g((j, z;) of doses of drug for future patients.

D

In all of the situations we have discussed it is clear that the analysis does not stop by specifying an estimate or a test or a ranking or a prediction function. There are many pos sible choices of estimates. In Example 1.1.1 do we use the observed fraction of defectives

' '

•

I ' ''

i' i .

I'

' i

!

Section 1.3

17

The Decision Theoretic Framework

Xjn as our estimate or ignore the data and usc hiswrical infonnation on past shipments, or combine them in some way? In Example 1.1.2 lO estimate J1 do we use the mean of � L:�- 1 Xi, or the median, defined as any value such that half the measurements, X the Xi are at least as large and half no bigger? The same type of question arises in all examples. The answer will depend on the model and, most significantly, on what criteria of performance we use. Intuitively, in estimation we care how far off we are, in testing whether we are right or wrong, in ranking what mistakes we've made, and so on. In any case, whatever our choice of procedure we need either a priori (before we have looked at the data) and/or a posteriori estimates of how well we're doing. In designing a study to compare treatments A and B we need to determine sample sizes that will be large enough to enable us to detect differences that matter. That is, we need a priori estimates of how well even the best procedure can do. For instance, in Example 1.1.3 even with the simplest Gaussian model it is intuitively clear and will be made precise later that, even if .6.. is large, a large a2 will force a large m) n to give us a good chance of correctly deciding that the treatment effect is there. On the other hand, once a study is carried out we would probably want not only to estimate � but also know how reliable our estimate is. Thus, we would want a posteriori estimates of performance. These examples motivate the decision theoretic framework: We need to (I) clarify the objectives of a study, (2) point to what the different possible actions are, (3) provide assessments of risk, accuracy, and reliability of statistical procedures, (4) provide guidance in the choice of procedures for analyzing outcomes of experi ments. =

1.3.1

Components of the Decision Theory Framework

As in Section 1.1, we begin with a statistical model with an observation vector X whose distribution P ranges over a set P. We usually take P to be parametrized, P = {Pe : 0 E 8} . Action space. A new component is an action space A of actions or decisions or claims that we can contemplate making. Here are action spaces for our examples. Estimation. If we are estimating a real parameter such as the fraction () of defectives, in Example 1.1.1, or p, in Example 1.1.2, it is natural to take A = R though smaller spaces may serve equally well, for instance, A = { 0, �, . . ) 1} in Example 1.1.1. .

Testing. Here only two actions are contemplated: accepting or rejecting the "specialness" of P (or in more usual language the hypothesis H : P E Po in which we identify P0 with the set of "special'" P's). By convention. A = {0, 1 } with 1 corresponding to rejection of H. Thus, in Example 1.1.3, taking action 1 would mean deciding that D. # 0.

Ranking. Here quite naturally A = {Permutations {i 1 , , ik) of { 1 , . . , k}}. Thus, if we have three air conditioners, there are 3! = 6 possible rankings, •

.

.

.

A = {{1, 2,3), (1,3, 2), {2, 1, 3), (2, 3, 1 ), {3, 1 ,2), (3, 2 , !) } .

Statistical Models, Goals, and

18

Performance Criteria

Chapter 1

Prediction. Here A is much larger. [f Y is real, and z E Z, A {a : a is a function from Z to R} with a(z) representing the prediction we would make if the new unobserved Y had covariate value z. Evidently Y could itself range over an arbitrary space Y and then R would be replaced by Y in the definition of a(·). For instance, if Y 0 or 1 corresponds to, say, "does not respond" and "responds," respectively, and z = (Treatment, Sex)T, then a(B, M) would be our prediction of response or no response for a male given treatment B. =

=

Loss function. Far more important than the choice of action space is the choice of loss function defined as a function l : P x A R+. The interpretation of l( P, a), or I (0, a) if P is parametrized, is the nonnegative loss incurred by the statistician if he or she takes action a and the true ..state of Nature," that is, the probability distribution producing the data. is P. As we shall see, although loss functions, as the name suggests, sometimes can genuinely be quantified in economic terms, they usually are chosen to qualitatively reflect what we are trying to do and to be mathematically convenient. �

Estimation. In estimating a real valued parameter v(P) or q(6') if P is parametrized the most commonly used loss function is, Quadratic Loss: l(P, a) = (v(P) - a)2 (or l(O,a)

; ;

i' r

" •

=

(q(O) - a)2).

Other choices that are, as we shall see (Section 5.1), less computationally convenient but perhaps more realistically penalize large errors less are Absolute Value Loss: l(P; a) = f v ( P) - a[ , and truncated quadratic loss: l(P, a) = min {(v( P)-a)2, d'}. Closely related to the latter is what we shall call confidence interval loss, l(P, a) = 0, fv(P) - of < d, l( P, a) = 1 otherwise. This loss expresses the notion that all errors within the limits ±d are tolerable and outside these limits equally intolerable. Although estimation loss functions are typically symmetric in v and a, asymmetric loss functions can also be of importance. For instance, l(P, a) = l(v < a), which penalizes only overestimation and by the same amount arises naturally with lower confidence bounds as discussed in Example 1.3.3. If v = (v,, . . . , vd) = (qt(ll), . . . ,qd(ll)) and a = (a,, . . . , ad) are vectors, examples of loss functions are l(O, a) l(O, a)

1(0, a)

1 )a; - v; )2 = squared Euclidean distance(d 2 d

� 2.: [a; - v; f = absolute distance/d

max{lai - vi l,j

=

1, . . . , d} = supremum distance.

We can also consider function valued parameters. For instance, in the prediction exam ple 1.3.2, 11(·) is the parameter of interest. If we use a(·) as a predictor and the new z has marginal distribution Q then it is natural to consider, l(P, a) =

J(l'(z) - a(z))2 dQ(z),

the expected squared error if a is used.

If,

say, Q is the empirical distribution of the Zj in

Section 1.3

19

The Decision Theoretic Framework

the training set

( z 1 , Y), . . . , ( Zn, Yn), this leads to the commonly considered l(P, a)

]

= -

n

L(�t(z;) - a (z,))2,

n .=l J

n-1 times the squared Euclidean distance between (a(zl ), . . . , a(zn) jT and the vector parameter (I'( zl), . . . , �t(zn) )T

which is just

the prediction vector

Testing. We ask whether the parameter B is in the subset 60 or subset 81 of e, where {So, 81}, is a partition of e (or equivalently if p E Po or p E PI)· If we take action a when the parameter is in ea. we have made the correct decision and the loss is zero. Otherwise, the decision is wrong and the loss is taken to equal one . This 0- l loss function

can be written as

0 - ! loss: l(8, a) = 0 if 8 E e. (The decision is correct) l(0, a)

=

1 otherwise (The decision is wrong).

Of course, other economic loss functions may be appropriate. For instance, in Example 1 . 1 . 1 suppose returning a shipment with () < 00 defectives results in a penalty of s dol

lars whereas every defective item sold results in an appropriate loss function is

l(8, 1) l (8 , 1 ) l(8 , 0)

r

dollar replacement cost. Then the

s if 8 < 8o O if8 > 8o

( 1 .3.1)

rN8.

Decision procedures. We next give a representation of the process whereby the statistician uses the data to arrive at a decision. The data is a point X x in the outcome or sample space X. We define a decision rule or procedure 0 to be any function from the sample space taking its values in A. Using 0 means that if X = x is observed, the statistician takes action o(x) . Estimation. For the problem of estimating the constant IJ.. in the measurement model, we implicitly discussed two estimates or decision rules: 61 (x) = sample mean X and 0 (x) = 2 =

X = sample median. Testing. In

Example

with

X

and Y distributed as

N(l' + �, <12 ) and N(�t,"2 ) , respectively, if we are asking whether the treatment effect p'!-fameter 6. is 0 or not, then a reasonable rule is to decide Ll = 0 if our estimate x - y is close to zero, and to decide Ll "I 0 if our estimate is not close to zero. Here we mean close to zero relative to the variability in the experiment, that is, relative to the standard deviation a. In Section 4.9.3 we will show how to obtain an estimate (j of a from the data. The decision rule can now

1.1.3

be written

o(x, y)

=

O if l x ::: Ji l "

1 if lx ::: iii > c "

(1 .3.2)

20

Statistical Models, Goals, and Performance Criteria

Chapter 1

where c is a positive constant called the critical value. How do we choose c? We need the next concept of the decision theoretic framework, the risk or riskfunction: The risk function. If d is the procedure used, 1 is the loss function, () is the true value of the parameter, and X = x is the outcome of the experiment, then the loss is l(P, O(x)). We do not know the value of the loss because P is unknown. Moreover, we typically want procedures to have good properties not at just one particular x, but for a range of plausible x's. Thus, we turn to the average or mean loss over the sample space. That is, we regard l( P, 6 (x)) as a random variable and introduce the riskfunction R(P, 6) = Ep[l(P, 6(X)]

as the measure of the perfonnance of the decision rule O(x). Thus. for each 8. R maps P or e to R+. R(·,d) is our a priori measure of the performance of d. We illustrate computation of R and its a priori use in some examples. Estimation. Suppose v v(P) is the real parameter we wish to estimate and V = fJ(X) is our estimator (our decision rule). If we use quadratic loss, our risk function is called the mean squared error (MSE) of V and is given by _

MSE(fl) = R(P, v) = Ep(v(X) - v(P))2

(! .3.3)

where for simplicity dependence on P is suppressed in MSE. The MSE depends on the variance of;; and on what is called the bias ofV where ,.

Bias(fi) = E(fl) - v

'

" •

'

,,

can be thought of as the "long-run average error" of V. A useful result is Proposition 1.3.1.

•

I' '

I

MSE(fi) = (Bias fi)2 + Var(v) . Proof. Write the error as

(fi - v)

=

[v - E(fi)] + [E(v) - I'] ·

If we expand the square of the right-hand side keeping the brackets intact and take the expected value, the cross term will he zero because E[fi - E(fi)] = 0. The other two terms are (Bias fi)2 and Var(v) (If one side is infinite, so is the other and the result is trivially o true.) We next illustrate the computation and the a priori and a posteriori use of the risk function. .

Example 1.3.3. Estimation of1J.. (Continued). Suppose X1 , . . . , Xn are i.i.d. measurements of p, with N(O, a2 ) errors. If we use the mean X as our estimate of IJ.. and assume quadratic loss, then Bias(X) Yar(X)

1

-

n

" � Var(X; )

n2 i=l

=

2 � n

Section 1.3

21

The Decision Theoretic Fra mework

and, by Proposition 1.3.1

MSE(X)

=

R(p,, a2,X)

=

2

"-- , n

( ! .3.4)

which doesn't depend on Jl· Suppose that the precision of the measuring instrument cr2 is known and equal to crJ or where realistically it is known to be ::; crJ . Then (1 .3.4) can be used for an a priori estimate of the risk of X. If we want to be guaranteed MSE(X) < £2 we can do it by taking at least no = a0/t: measurements. If we have no idea of the value of cr2, planning is not possible but having taken n measurements we can then estimate cr2 , for instance by 82 = � L� 1 (Xi - X)2, or n&2 / (n - 1 ) , an estimate we can justify later. The a posteriori estimate of risk (i2 jn is, of course, itself subject to random error. Suppose that instead of quadratic loss we used the more naturalC 1 ) absolute value loss. Then

R(p,, a' , X )

=

EIX - I" I = E i£1

where E; = X; - p,. If, as we assumed, theE; are N(O, a2) then by (A.1 3.23), ( .,fii/a )E N(O, 1) and

�

(1.3.5) This harder calculation already suggests why quadratic loss is really favored. If we only assume, as we discussed in Example 1 . 1.2, that the E:i are i.i.d. with mean 0 and variance a2 (P ) , then for quadratic loss, R(P, X) = a2 (P) /n still, but for absolute value loss only approximate, analytic, or numerical and/or Monte Carlo computation, is possible. In fact, computational difficulties arise �even with quadratic loss as soon as we think of estimates other than X. For instance, if X = median(Xt, . . . , Xn) (and we, in general, write Ci for a median of {a1, . . . , an )), E(x - p, ) 2 = E(�) can only be evaluated numerically (see D Problem 1.3.6), or approximated asymptotically. -

We next give an example in which quadratic loss and the breakup of MSE given in Proposition 1.3.1 is useful for evaluating the performance of competing estimators. Example 1.3.4. Let f.J.o denote the mean of a certain measurement included in the U.S. census, say, age or income. Next suppose we are interested in the mean f-1. of the same measurement for a certain area of the United States. If we have no data for area A, a natural guess for f-1. would be f.J.o, whereas if we have a random sample of measurements X" X2 , . . . , Xn from area A, we may want to combine tto and X = n- 1 L� 1 Xi into an estimator, for instance,

/i = (0.2)p,o + (O.S)X. The choice of the weights 0.2 and 0.8 can only be made on the basis of additional knowl edge about demography or the economy. We shall derive them in Section 1.6 through a

22

I

Statistical Models, Goals, and Performance Criteria

Chapter 1

formal Bayesian analysis using a normal prior to illustrate a way of bringing in additional knowledge. Here we compare the performances of Ji and X as estimators of J-l using MSE. We easily find

l

I'

Bias(,ii)

' I

Var(,U)

I

'

R(Jl, iiJ

0.2Jlo + 0.8Jl - Jl = 0.2(/lo - Jl) (0.8)2Var(X) MSE (,ii)

=

=

( 64) 0'2 jn .

2

.04(1'o - 1')2 + ( 64 )0' jn. .

If ll is close to I'O· the risk R(!l, ,U) of ,U is smaller than the risk R(Jl, X) = 0'2 jn of X with the minimum relative risk inf{MSE(,U)jMSE(X); Jl E R} being 0.64 when Jl = I'D· Figure 1.3.1 gives the graphs of MSE(,U) and MSE(X) as functions of I'· Because we do not know the value of fL, using MSE, neither estimator can be proclaimed as being better than the other. However, if we use as our criteria the maximum (over p,) ofthe MSE (called the minimax criteria), then X is optimal (Example 3.3.4). MSE(,U)

MSE(X)

\

' I,

Figure 1.3.1. The mean squared errdrs of X and ji. The two MSE curves cross at Jl = I'D ± 3r>j,jii.. D

Testing. The test rule (1.3.2) for deciding between L'>

two values 0 and 1; thus, the risk is R(L'>, 6)

=

I(L'>, O)P[6(X, Y)

=

0]

=

0 and L'> I 0 can only take on the

+ 1(1'>, 1)P[6(X, Y)

=

1],

which in the case of 0 - l loss is R(L'>,6) - P[6(X, Y) = 1] if L'> = 0 - P[6(X, Y) = 0] if L'> I 0. In the general case X and 8 denote the outcome and parameter space, respectively, and we are to decide whether 8 E 8D or 8 E 81, where 8 = 8o U 81. 8o n 81 = 0. A test

23

Section 1.3 The Decision Theoretic Framework

function is a decision rule 6(X) that equals 1 on a set C c X called the critical region 1 (X E C], where 1 denotes the and equals 0 on the complement of C; that is, J(X) indicator function. If 6(X) = 1 and we decide 8 E 81 when in fact 8 E 8o, we call the error committed a Type I error, whereas if 6(X) = 0 and we decide (} E 8o when in fact (} E 8 1 > we call the error a Type II error. Thus, the risk of 6(X) is =

R(8, 6)

E (J(X))

P (J(X)

=

=

1) if 8 E 80

Probability of Type l error

R(8,6)

P(J(X)

=

0) if 8 E 81

(1.3.6)

Probability of Type U error. Finding good test functions corresponds to finding critical regions with small probabilities of error. ln the Neyman-Pearson framework of statistical hypothesis testing, the focus is on first providing a small bound, say .05, on the probability of Type I error, and then trying to minimize the probability of a Type II error. For instance, in the treatments A and B example, we want to start by limiting the probability of falsely proclaiming one treatment superior to the other (deciding � =!= 0 when � = 0), and then next look for a procedure with low probability of proclaiming no difference if in fact one treatment is superior to the other (deciding D. = 0 when D. ¥ 0). This is not the only approach to testing. For instance, the loss function ( 1 .3.1) and tests 6k of the form, "Reject the shipment if and only if X > k," in Example 1.1.1 lead to (Problem 1.3.18).

R(8,6)

sPo[X > k] + rN8Po [X < kJ, 8 < 8o

rN8Po[X < k], 8 > 8o.

(1.3.7)

Confidence Bounds and Intervals Decision theory enables us to think clearly about an important hybrid of testing and estimation, confidence bounds and intervals (and more generally regions). Suppose our primary interest in an estimation type of problem is to give an upper bound for the param eter v. For instance, an accounting finn examining accounts receivable for a finn on the basis of a random sample of accounts would be primarily interested in an upper bound on the total amount owed. If (say) X represents the amount owed in the sample and v is the unknown total amount owed, it is natural to seek v(X) such that

P[v(X )

>

] > 1-a

v

(1.3.8)

for all possible distributions P of X. Such a v is called a (1 - a) upper confidence bound on v. Here a is small, usually .05 or .01 or less. This corresponds to an a priori bound on the risk of a on v(X) viewed as a decision procedure with action space R and loss function,

l(P,a)

0, a > v(P) 1, a < v(P)

24

Chapter 1

Statistical Models, Goals, and Perform ance Criteria

an asymmetric estimation type loss function. The 0

-1

nature makes it resemble a testing

loss function and, as we shall see in Chapter 4, the connection is close. lt is clear, though,

1

that this fonnulation is inadequate because by taking

'

oo

v

we can achieve risk

=

0.

What is missing is the fact that, though upper bounding is the primary goal, in fact it is important to get close to the truth-knowing that at most oo dollars are owed is of no use.

The decision theoretic framework accommodates by adding a component reflecting this. '

'

,

For instance

l(P,a)

'

-

'

'

�� "

'

-

a - v(P)

, a > v(P)

-

c

, a < v(P),

for some constant c

> 0. Typically, rather than this Lagrangian form, it is customary to first fix a in (1.3.8) and then see what one can do to control (say) R(P, v) E(v(X) v(P)) + , where x+ xl(x > 0). The same issue arises when we are interested in a confidence interval �(X)� v(X)] for v defined by the requirement that =

=

-

P[v(X) < v(P) < v(X)] > 1 - a

•

·I

for all

P E P.

We shall go into this further in Chapter4.

We next tum to the final topic of this section, general criteria for selecting "optimal" procedures.

1.3.2 • •

Comparison of Decision Procedures

In this section we introduce a variety of concepts used in the comparison of decision proce dures. We shall illustrate some of the relationships between these ideas using the following

8 has two members, A has three points. and the risk of all possi

simple example in which

ble decision procedures can be computed and plotted. We conclude by indicating to what extent the relationships suggested by this picture carry over to the general decision theoretic model.

Example 1.3.5. Suppose we haVe two possible states of nature, which we represent by 81 and 82 . For instance, a component in a piece of equipment either works or does not work; a

certain location either contains oil or does not; a patient either has a certain disease or does not, and so on. Suppose that three possible actions,

ah a2 ,

and a3, are available. In the

context of the foregoing examples, we could leave the component in, replace it. or repair it;

we could drill for oil, sell the location, or sell partial rights� we could operate, administer drugs, or wait and see. Suppose the following loss function is decided on

TABLE 1.3.1.

The loss function

(Drill)

a, (Oil) (No oil)

01 82

1(0, a)

(Sell)

(Partial rights)

a2

a3

0

10

12

I

5 6

'

'

i

Section 1.3

25

The Decision Theoretic Framework

Thus, if there is oil and we drill, the loss is zero, whereas if there is no oil and we drill, the loss is 12, and so on. Next, an experiment is conducted to obtain information about B resulting in the random variable X with possible values coded as 0, 1, and frequency function p(x, B) given by the following table

TABLE 1.3.2. The frequency function p(x, 8,); i = 1, 2 Rock formation X

e, 02

(Oil) (No oil)

0

I

o.3

o.7 0.4

0.6

Thus, X may represent a certain geological formation, and when there is oil, it is known that formation 0 occurs with frequency 0.3 and formation 1 with frequency 0. 7, whereas if there is no oil, formations 0 and 1 occur with frequencies 0.6 and 0.4. We list all possible decision rules in the following table.

TABLE 1.3.3. Possible decision rules oi(x ) •

'

I

x=O

x=1

a, a,

2 a,

a,

3 a, a,

4 a, a,

5 a, a,

8 a, a,

7 a, a,

6 a, a,

9 a, a,

Here, 01 represents "Take action a1 regardless of the value of X," 02 corresponds to ''Take action a1. if X 0; take action a2, if X = 1," and so on. The risk of 8 at B is =

E[l(B,o(X))] = l (B, at )P[o(X ) = a t]

R(B, 5)

+l (B, a2 )P [o(X)

=

a2] + l(B,a3 )P[o(X) = a3].

For instance,

R(B�o o,)

-

0(0.3) + 10(0.7) = 7

-

12(0.6) + 1 (0.4)

=

7.6.

If 8 is finite and has k members, we can represent the whole risk function of a procedure 0 by a point in k-dimensional Euclidean space, (R(01, 8), . . . , R(Bk, 8)) and if k = 2 we can plot the set of all such points obtained by varying 8. The risk points ( R(Bt, oi ), R(O,o;)) are given in Table 1.3.4 and graphed in Figure 1.3.2 for i = 1, . . . , 9.

TABLE 1.3.4. Risk points (R( 01, ot), R(B,, 51)) 1

•

R(01, 8;) R(O,, 5;)

1 0 12

2 7 7.6

3 3.5 9.6

4 3 5.4

5 10 1

6 6.5 3

7 1 .5 8.4

8 8.5 4.0

9 5 6

It remains to pick out the rules that are "good" or "best." Criteria for doing this will be D introduced in the next subsection.

26

Statistical Models. Goals, and Performance Criteria

R(B2,6,)

I

10 .7

•

3

.4

5

.

9

•

. 6

2

•

8 •

0

5

0

10

5

R(B1 , 6,)

Figure 1,3,2. The risk points (R(B1 , 6,), R(B,, 6,) ) i ,

1.3.3

Chapter 1

=

1 , . . . , 9.

Bayes and Minimax Criteria

The difficulties of comparing decision procedures have already been discussed in the spe cial contexts of estimation and testing. We say that a procedure 6 improves a procedure J' if, and only if,

R(B,6) < R(B,6')

for all (} with strict inequality for some (}. It is easy to see that there is typically no rule c5 that improves all others. For instance, in estimating B E R when X N(B, a5). if we -

,...,_,

ignore the data and use the estimate () = 0, we obtain M SE(O) = 82. The absurd rule "6•(X) = 0" cannot be improved on at the value B = 0 because Eo(62 (X)) = 0 if and only if O(X) = 0. Usually, if c5 and are two rules, neither improves the other. Consider, for instance, 64 and 66 in our example. Here R(B1, 64) < R(81, 65) but R(B,, 64) > R( 8,, 66). The problem of selecting good decision procedures has been attacked in a variety of ways. -

J'

'

!'

(1) Narrow classes of procedures have been proposed using criteria such as con siderations of symmetry, unbiasedness (for estimates and tests), or level of sig nificance (for tests). Researchers have then sought procedures that improve all others within the class. We shall pursue this approach further in Chapter 3. Ex tensions of unbiasedness ideas may be found in Lehmann (1997, Section 1.5). Symmetry (or invariance) restrictions are discussed in Ferguson (1967).

(2) A second major approach has been to compare risk functions by global crite-

Section 1.3

27

The Decision Theoretic Framework

ria rather than on a pointwise basis. We shall discuss the Bayes and minimax criteria. Bayes: The Bayesian point of view leads to a natural global criterion. Recall that in the Bayesian model () is the realization of a random variable or vector (} and that Po is the conditional distribution of X given (} � e. In this framework R(B, J) is just E[I(O, J(X)) I (} 8], the expected loss, if we use <5 and (} () . If we adopt the Bayesian point of view, we need not stop at this point, but can proceed to calculate what we expect to lose on the average as (} varies. This quantity which we shall call the Bayes risk of <5 and denote r(<5) is then, given by =

=

r(J)

�

E[R(O,J)]

�

E[i (O,J(X))].

(1 .3 .9)

The second preceding identity is a consequence of the double expectation theorem (B.l.20) in Appendix B. To illustrate, suppose that in the oil drilling example an expert thinks the chance of finding oil is .2. Then we treat the parameter as a random variable (} with possible values 81 , 82 and frequency function "(BI)

�

0.2, 11"(02) � 0.8.

The Bayes risk of <5 is, therefore, r(J)

Table 1.3.5 gives r(51),

•

.

.

,

�

o.2R(e,, o) + o.BR( e,, J).

(1.3.10)

r(o9) specified by (1.3.9).

TABLE 1.3.5. Bayes and maximum risks of the procedures of Table 1.3.3. ' r(J;) max( R(01, o,), R(e,, J,))

•

I

2

4

3

9.6

7.48

8.38

12

7.6

9.6

4.92 5.4

5 2.8

10

6

7

8

9

3.7

7.02

4.9

5.8

6.5

8.4

8.5

6

In the Bayesian framework 0 is preferable to <5' if, and only if, it has smaller Bayes risk. If there is a rule <5*, which attains the minimum Bayes risk� that is, such that r{J")

�

minr(J) •

then it is called a Bayes rule. From Table 1.3.5 we see that rule Os is the unique Bayes rule for our prior. The method of computing Bayes procedures by listing all available 0 and their Bayes risk is impracticable in general. We postpone the consideration of posterior analysis, the only reasonable computational method, to Section 3.2. Note that the Bayes approach leads us to compare procedures on the basis of, r(J)

�

EoR(B, 0)11"(0),

if(} is discrete with frequency function 1r(O), and r(J)

�

J R(O,o)1C(O)d0,

28

Statistical Models, Goals, and Performance Criteria

Chapter

1

if (} is continuous with density rr(B). Such comparisons make sense even if we do not

interpret 1f as a prior density or frequency, but only ac; a weight function for averaging the values of the function R( B, 6). For instance, in Example 1.3.5 we might feel that both values ofthe risk were equally important. It is then natural to compare procedures using the simple average � [R(fh , 0) + R( fh , 8) ]. But this is just Bayes comparison where n places equal probability on 81 and 02.

Minimax: Instead of averaging the risk as the Bayesian does we can look at the worst possible risk. This is, we prefer 0 to 0'. if and only if, sup R(B.J) < sup R(B,J'). '

'

A procedure 0*, which has supR(B, J•)

' ' '

'

•

=

infsupR(B,J), ;

'

is called minimax (minimizes the maximum risk). The criterion comes from the general theory of two-person zero sum games of von Neumann.<2) We briefly indicate ..the game of decision theory." Nature (Player I) picks a point (} E 8 independently of the statistician (Player II), who picks a decision procedure J from V, the set of all decision procedures. Player II then pays Player I, R(B, J). The maximum risk of 0* is the upper pure value of the game. This criterion of optimality is very conservative. It aims to give maximum protection against the worst that can happen, Nature's choosing a 8, which makes the risk as large as possible. The principle would be compelling, if the statistician believed that the parameter value is being chosen by a malevolent opponent who knows what decision procedure will be used. Of course, Nature's intentions and degree of foreknowledge are not that clear and most statisticiaqs find the minimax principle too conservative to employ as a general rule. Nevertheless1 in many cases the principle can lead to very reasonable procedures. To illustrate computation of the minimax rule we tum to Table 1.3.4. From the listing of max(R(B1, J), R(B 2 , J)) we see that J4 is minimax with a maximum risk of 5.4. Students of game theory will realize at this point that the statistician may be able to lower the maximum risk without requiring any further information by using a random mechanism to determine which rule to employ. For instance, suppose that, in Example 1.3.5, we toss a fair coin and use 04 if the coin lands heads and 06 otherwise. Our expected risk would be,

'

I

I '

4.75 if B = e, 4.20 if e = e,.

I

The maximum risk 4. 75 is strictly less than that of 04. Randomized decision rules: In general, if V is the class of all decision procedures (nonrandomized), a randomized decision procedure can be thought of as a random experi ment whose outcomes are members of V. For simplicity we shall discuss only randomized

,

' '

! i

I

Section 1.3

The Decision Theoretic

29

Framework

procedures that select among a finite set (h, . , lSq of non randomized procedures. If the randomized procedure l5 selects lSi with probability Ai, i = 1 , , q, LJ==1 A1 = 1, we then define q (1.3.11) R(8,6) � L -\;R(8, 6;). . .

. . .

i=l

Similarly we can define, given a prior 1r on 8, the Bayes risk of l5 q

(1 .3.12) r(J) L -';E[R(IJ,J;)] i=l A randomized Bayes procedure <5"' minimizes r(d) anwng all randomized procedures. A randomized minimax procedure minimizes maxo R(8,<5) anwng all randomized proce �

dures. We now want to study the relations between randomized and nonrandomized Bayes and minimax procedures in the context of Example 1.3.5. We will then indicate how much of what we learn carries over to the general case. As in Example 1 .3.5, we represent the risk of any procedure J by the vector ( R( 8,, 6), R(82, 6)) and consider the risk set

S�

{(R(8I ,J), R(82, J)) : J E V'} •

where V* is the set of all procedures, including randomized ones. By (1.3.10), 9

9 9 L -';R(8,,6,), r2 L -';R(82,6;), A; > 0, L -' 1 s � (r� ,r2) : rl i=l i=l That is, S is the convex hull of the risk points ( R(81> 6;), R( 82, 6;) ) i 1 , . . , 9 (Figure �

; �

�

,

1.3.3). If rr(8 I ) � 1 � 1 1r (82), 0 < 1 points in S that lie on the line -

�

.

S 1, then all rules having Bayes risk c correspond to

1r1 + (1 - 1lr2 � c.

( 1.3.13)

As c varies, (1 .3.13) defines a family of parallel lines with slope -'Y/(1 - r). Finding the Bayes rule corresponds to finding the smallest c for which the line (1.3.13) intersects S. This is that line with slope �/ ( 1 1) that is tangent to S at the lower boundary of S. All points of S that are on the tangent are Bayes. 1\vo cases arise: -

. ,../i

.

. .

-

( 1) The tangent has a unique point of contact with a risk point corresponding to a nonrandomized rule. For instance, when 'Y = 0.2, this point is (10, 1), which is the risk point of the Bayes rule Js (see Figure 1 .3.3).

(2) The tangent is the line connecting two ..nonrandomized" risk points point (r1, r2) on this line can be written

-\R(81 , 6;) + ( 1 - >.)R(O�> J; ), -\R(82, 6;) + (1 >. )R(82, 61), -

Ji,J;.

A

(1.3.14)

30

Statistical Models, Goals, and Performance Criteria

Chapter 1

1

rz 10

7 s

2

•

5

Q(c' )

5

0 0

10

5

Figure 1.3.3. The convex hull S of the risk points ( R(B� , 8,), R(82, '\) ) , i I , . . . , 9. The point where the square Q( c') defined by ( 1.3. 1 6 ) touches S is the risk point of the =

minimax rule.

where 0

< .), < I, and, thus, by ( 1 . 3. 1 1 ) corresponds to the values o, with probability .), o; with probability ( 1 - >.), 0 < .), < I.

(1.3. 15)

Each one of these rules, as >. ranges from 0 to 1, is Bayes against 1r. We can choose two nonrandomized Bayes rules from this class, namely Oi (take ,\ = 1) and o; (take .), = 0). Because changing the prior 1r corresponds to changing the slope -J/(1 - -y) of the line given by ( 1.3. 13), the set B of all risk points corresponding to procedures Bayes with respect to some prior is just the lower left boundary of S (i.e all points on the lowet boundary of S that have as tangents the y axis or lines with nonpositive slopes). To locate the risk point of the nilhimax rule consider the family of squares, .•

( 1 .3.16) whose diagonal is the line r1 = r2• Let c• be the srrt�llest c for which Q(c) n S i 0 (i.e., the first square that touches S). Then Q(c*) n S is either a point or a horizontal or vertical line segment. See Figure 1.3.3. It is the set of risk points of minimax rules because any point with smaller maximum risk would belong to Q(c) n S with c < c* contradicting the choice of c*. In our example, the first point of contact between the squares and S is the

i

Section

1.3

31

The Decision Theoretic Framework

intersection between r1 = r2 and the line connecting the two points corresponding to <54 and <56. Thus, the minimax rule is given by ( 1.3.14) with i = 4, j = 6 and .\ the solution of

From Table 1.3.4, this equation becomes 3,\ + 6.5(1 - >.)

=

5.4,\ + 3(1 - >.),

which yields >. � 0.59. There is another important concept that we want to discuss in the context of the risk set. A decision rule 6 is said to be inadmissible if there exists another rule <5' such that 6' improves <5. Naturally, all rules that are not inadmissible are called admissible. Using Table 1.3.4 we can see, for instance, that 62 is inadmissible because 64 improves it (i.e., R( 81 , J, ) = 3 < 7 = R( 81, J2) and R(8,, J,) = 5. 4 < 7.6 = R(8,, J, ) ) To gain some insight into the class of all admissible procedures (randomized and non randomized) we again use the risk set. A rule 6 with risk point (r1, r2) is admissible, if and only if, there is no (x, y) in S such that x < r1 and y < r2, or equivalently, if and only if, {(x,y) : x < r1, y < r2 } has only (r1 , r2 ) in common with S. From the figure it is clear that such points must be on the lower left boundary. In fact, the set of all lower left boundary points of S corresponds to the class of admissible rules and, thus, agrees with the set of risk points of Bayes procedures. lf 8 is finite, e = {fh, . . . , Bk}, we can define therisk set in general as .

S = { ( R(81, J), . . . , R(8k, J)) : J E v•} where V* is the set of all randomized decision procedures. The following features exhibited by the risk set by Example 1.3.5 can be shown to hold generally (see Ferguson, 1967, for instance). (a) For any prior there is always a nonrandomized Bayes procedure, if there is a ran domized one. (3) Randomized Bayes procedures are mixtures of nonrandomized ones in the sense of (1.3.14). (b) The set B of risk points of Bayes procedures consists of risk points on the lower boundary of S whose tangent hyperplanes have normals pointing into the posi tive quadrant. 4 If e is finite( ) and minimax procedures exist, they are Bayes procedures. (c) (d) All admissible procedures are Bayes procedures. (e) If a Bayes prior has n(Bi) > 0 for all i, then any Bayes procedure corresponding to n is admissible. If 9 is not finite there are typically admissible procedures that are not Bayes. However, under some conditions, all admissible procedures are either Bayes procedures or limits of

32 'I

Statistical Models, Goals, and Performance Criteria

Chapter 1

Bayes procedures (in various senses). These remarkable results. at least in their original

'

'j l

fonn. are due essentially to Wald. They are useful because the property of being Bayes is

>

ea-;ier to analyze than admissibility.

' I

I I.

Other theorems are available characterizing larger but more manageable classes of pro

'

cedures, which include the admissible rules, at least when procedures with the same risk function are identified. An important example is the class of procedures that depend only on knowledge of a sufficient statistic (see Ferguson, 1 967; Section

3.4).

We stress that looking

at randomized procedures is essential for these conclusions, although it usually turns out that all admissible procedures of interest are indeed nonrandomized. For more information on these topics, we refer to Blackwell and Girshick ( 1954) and Ferguson ( 1967).

!

'

'

Summary. We introduce the decision theoretic foundation of statistics including the no

'

tions of action

space, decision rule, loss function, and risk through various examples in cluding estimation, testing, confidence bounds, ranking, and prediction. The basic bias variance decomposition of mean square error is presented. The basic global comparison criteria Bayes and minimax are presented as well as a discussion of optimality by restriction and notions of admissibility.

'

1.4

•

PREDICTION

The prediction Example

1.3.2 presented

important situations in which a vector

z

of co

variates can be used to predict an unseen response Y. Here are some further examples of the kind of situation that prompts our study in this section. A college admissions officer has available the College Board scores at entrance and first-year grade point averages of freshman classes for a period of several years. Using this information, he wants to predict the first-year grade point averages of entering freshmen on the basis of their College Board scores. A stockholder wants to predict the value of his holdings at some time in the fu ture on the basis of his past experience with the market and his portfolio. A meteorologist wants to estimate the amount of rainfall in the coming spring. A government expert wants to predict the amount of heating oil needed next winter. Similar problems abound in every field. The frame we shall fit them into is the following. We assume that we know the joint probability distribution of a random vector (or vari

Z

Y. We want to find a function g defined on the range of Z such that g(Z) (the predictor) is "close" to Y. In terms of our preceding discussion. Z able)

and a random variable

is the information that we have and Y the quantity to be predicted. For example, in the

Z would be the College Board score of an entering freshman or her first-year grade point average. The joint distribution of Z and Y can be

college admissions situation, and Y his

calculated (or rather well estimated) from the records of previous years that the admissions

I

'

officer has at his disposal. Next we must specify what close means. One reasonable mea

(g(Z) - Yf, which is the squared prediction error when g(Z) is used to predict Y. Since Y is not known, we tum to the mean squared prediction error (MSPE) sure of "distance" is

1'> 2 (Y, g(Z)) or its square root

yE(g(Z) - Y)2.

=

E [g(Z) - Y] 2

The MSPE is the measure traditionally used in the

i

Section

1.4

33

Prediction

mathematical theory of prediction whose deeper results (see, for example, Grenander and Rosenblatt, presuppose it. The method that we employ to prove our elementary theorems does generalize to other measures of distance than 6.(Y, g(Z)) such as the mean absolute error E(lg( Z) - Yl) (Problems 1.4. 7- 1 ). Just how widely applicable the notions of this section are will become apparent in Remark 1.4.5 and Section 3.2 where the problem of MSPE prediction is identified with the optimal decision problem of Bayesian statistics with squared error loss. The class Q of possible predictors g may be the nonparametric class QN p of all g : Rd -----Jo or it may be to some subset of this class. See Remark 1.4.6. In this section we

1957)

1

R

LJ=l

consider YNP and the class QL of linear predictors of the form a + bj Zj · We begin the search for the best predictor in the sense of minimizing MSPE by consid ering the case in which there is no covariate information, or equivalently, in which Z is a constant; see Example 1.3.4. In this situation all predictors are constant and the best one is that number Co that minimizes E(Y - c)2 as a function of c.

Lemma 1.4.1. E(Y - c) 2 is either oofor all c or is minimized uniquely by c = p, = E(Y). In fact, when EY2 < oo, E(Y - c)2 � Var Y + (c - p) 2 .

Proof. EY2 < oo if and only if E(Y - c)2 < implies that p. exists, and by expanding

Y-

c

�

oo

(1.4.1)

for all c; see Problem

1.4.25. EY2 < oo

(Y - p) + (I' - c)

0 makes the cross product term vanish. We see that D E(Y - c)2 has a unique minimum at c = p, and the lemma follows.

(1.4.1) follows because E(Y - p.)

=

Now we can solve the problem of finding the best MSPE predictor of Y, given a vector Z; that is, we can find the g that minimizes E(Y - g(Z))2• By the substitution theorem for conditional expectations (B.l . l6). we have

E [ (Y - g(Z)) 2 I Z � z]

�

E[(Y - g(z )) 2 I Z

�

z].

(1.4.2)

Let

p(z) Because g(z) is a constant, Lemma

E[(Y - g(z) ) 2 I Z = z]

=

=

E(Y I Z = z).

1.4.1 assures us that

E[(Y - p(z)) 2 I Z = z] + [g(z) - p( z) ] 2 •

(1.4.3)

If we now take expectations of both sides and employ the double expectation theorem (B.l .20), we can conclude that

Theorem 1.4.1. /f Z is any random vector and Y any random variable, then either E(Y g(Z))2 = oofor every function g or E(Y - p(Z) )2 < E(Y - g(Z) ) 2

(1.4.4)

34

Statistical Models, Goals, and Performance Criteria

for every g with strict inequality holding unless g(Z) best MSPE predictor. In fact. when E(Y2) < oo. E(Y - g(Z))2

=

=

Chapter 1

I<(Z). That is. !'(Z) is the unique

E(Y - I'(Z) )2 + E(g(Z) - I'(Z))2

( 1 .4.5)

An important special case of ( ! .4.5) is obtained by taking g(z) = E(Y) for all z. Write Var(Y I z) for the variance of the condition distribution of Y given Z = z, that is, Var(Y I z) = E([Y - E(Y I z)]' I z), and recall (B.l.20), then ( 1 .4.5) becomes, Vat Y = E(Var(Y I Z)) + Var(E(Y I Z)),

(1.4.6)

which is generally valid because if one side is infinite, so is the other. Property (1 .4.6) is linked to a notion that we now define: Two random variables U and V with EIUVI < oo are said to be uncorrelated if

E[V - E(V)][U - E(U)]

=

0.

Equivalently U and V are uncorrelated if either EV[U-E(U)] = 0 or EU[V-E(V)] Let E = Y - {-L(Z) denote the random prediction error, then we can write

=

0.

Proposition 1.4.1. Suppose that Var Y < oo, then (a) f is uncorrelated with everyfunction ofZ (b) I'(Z) and < are uncorrelated (c) Var(Y) = Var I'(Z) + Var <. Proof, To show (a), let h(Z) be any function of Z, then by the iterated expectation theorem,

E{h(Z)c}

E{ E[h(Z)E I Z]) E{h(Z)E[Y - I'(Z) I Z])

because E[Y - I'(Z) I Z]

=

I'(Z) - I'(Z) =

=

0

0. Properties (b) and (c) follow from (a).

D

Note that Proposition 1.4. 1(c) is equivalent to (1 .4.6) and that (1.4.5) follows from (a) because (a) implies that the cross product term in the expansionof E {[Y - l'(z )] + [l'(z) g(z)]JZ vanishes. As a consequence of (1.4.6), we can derive the following theorem, which will prove of importance in estimation theory.

Theorem 1,4,2. If E(IYI) < oo but Z and Y are otherwise arbitrary, then

Var(E(Y I Z)) < Var Y.

If Var Y

< oo

strict inequality lwlds unless

•

I

(1.4.7)

y

=

E(Y I Z)

or equivalently unless Y is a junction ofZ.

( 1 .4.8)

I

i

' '

Section

1.4

35

Prediction

Proof. The assertion (1.4.7) follows immediately from (1.4.6). Equality in (1.4.7) can hold if, and only if,

E(Var(Y I Z))

�

By (A. l l .9) this can hold if, and only if,

E(Y - E(Y I Z))2

�

0. D

( 1.4.8) is true.

Example 1.4.1. An assembly line operates either at full, half, or quarter capacity. Within any given month the capacity status does not change. Each day there can be 0, 1, 2, or 3 shutdowns due to mechanical failure. The following table gives the frequency function p( z, y) = P(Z = z, Y = y) of the number of shutdowns Y and the capacity state of the line for a randomly chosen day. The row sums of the entries pz (z) (given at the end of each row) represent the frequency with which the assembly line is in the appropriate capacity state, whereas the column sums py (y) yield the frequency of 0, 1, 2, or 3 failures among all days. We want to predict the number of failures for a given day knowing the state of the assembly line for the month. We find

Z

•

E(Y I z � 1) E (Y I z �

3

I: iP[Y � i I z � 1] � 2.45,

�

i=l

i) � 2.10, E (Y I z � !) � 1.20.

These fractional figures are not too meaningful as predictors of the natural number values of Y. But this predictor is also the right one, if we are trying to guess, as we reasonably might, the average number of failures per day in a given month. In this case if Yi represents the number of failures on day i and Z the state of the assembly line, the best predictor is E 30- 1 2.:;"' 1 Y; I also. � E(Y I

(

Z)

Z),

0 I 0 10 I 0.025 1 1 0.25 py(yJ 1 o.1 5 z\y

.

I I 1 1

p(z,y) 3 I 2 o.o5 I o.o5 0.05 0.025 I 0.10 0.10 0.25 1 0.15 0.30 o.1 0 1 o.3o 1 0.45

11

pz(z) 0.25 0.25 0.50

1

The MSPE of the best predictor can be calculated in two ways. The first is direct. E(Y - E(Y I Z)) 2

�

3

L L(Y - E(Y I Z � z))2p(z, y) � 0.885. x

y=O

36

Statistical Models, Goals, and Performance Criteria

Chapter 1

The second way is to use (1.4.6) writing, E(Y

�

E(Y I Z))2

Var Y - Var(E(Y I Z)) E(Y2)

�

E[(E(Y I Z)) 2 1

I >'py (y) - L [E(Y I Z = z) ] 2pz (z)

'

l

y

I

0.885 D

as before.

Example 1.4.2. The Bivariate Normal Distribution. Regression toward the mean. If (Z, Y) has a N(f..l z, f-.LY , a'k, o-�, p) distribution, Theorem B.4.2 tells us that the conditional dis tribution of Y given Z = z is N(!'Y + p(uy /uz ) ( z flZ), u�- ( 1 p2 )). Therefore, the best predictor of Y using Z is the linear function -

-

flo( Z) = flY + p(uyjuz) ( Z

Because

�

JlZ).

2 E((Y - E(Y I Z = z)) I Z = z) = u� ( 1

-

p2)

(1 .4.9)

is independent of z, the MSPE of our predictor is given by,

' ,,

I

' '

E(Y - E(Y I Z)) 2 = u�(1 - p2).

I'I

(1.4.10)

•

The qualitative behavior of this predictor and of its MSPE gives some insight into the structure of the bivariate normal distribution. If p > 0, the predictor is a monotone in creasing function of Z indicating that large (small) values of Y tend to be associated with large (small) values of Z. Similarly, p < 0 indicates that large values of Z tend to go with small values of Y and we have negative dependence. If p = 0, the best predictor is just the constant f.J.Y as we would expect in the case of independence. One minus the ratio of the MSPE of the best predictor of Y given Z to Var Y, which is the MSPE of the best constant predictor, can reasonably be thought of as a measure of dependence. The larger this quantity the more dependent Z and Y are. In the bivariate normal case, this quantity is just r}. Thus, for this family of distributions the sign of the correlation coefficient gives the type of dependence between Z and Y, whereas its magnitude measures the degree of such dependence. Because of (1.4.6) we can also write I '

I

•

Var Jlo(Z) 2 . p = Var Y

(1.4.1 1 )

The line y = flY + p(uy fuz)(z flz), which corresponds to the best predictor of Y given Z in the bivariate normal model, is usually called the regression (line) of Y on Z. The term regression was coined by Francis Galton and is based on the following ob servation. Suppose Y and Z are bivariate normal random variables with the same mean �

Section 1.4

37

Prediction

It•

variance cr2, and positive correlation p. In Galton's case, these were the heights of a randomly selected father (Z) and his son (Y) from a large human population. Then the predicted height of the son, or the average height of sbns whose fathers are the height Z, ( 1 p)tt + pZ, is closer to the population mean of heights tt than is the height of the father. Thus, tall fathers tend to have shorter sons; there is "regression toward the mean." This is compensated for by "progression" toward the mean among the sons of shorter fathers and there is no paradox. The variability of the predicted value about ft should, consequently, be less than that of the actual heights and indeed Var((l - P)l" + pZ) p2cr2• Note that in practice, in particular in Galton's studies, the distribution of (Z, Y) is unavailable and the regression line is estimated on the basis of a sample (Z1, Y1 ), . . . , (Zn, Yn) from the D population. We shall see how to do this in Chapter 2. -

=

Example 1.4.3. The Multivariate Normal Distribution. Let Z (Z1, Zd)T be a (tt1, , ttd)T and suppose that (ZT, Y)T has d x 1 covariate vector with mean JLz a ( d + 1) multivariate normal, Nd+l (J.<, E), distribution (Section B.6) in which J.< =

=

(I"�, 11-Y) T, 11-Y

=

•

.

. . •

,

.

E(Y),

Ezz Ezy 'Ev z ayy Ezz is the d x d variance-covariance matrix Var(Z) Ezy

=

,

,

( Cov(Z1 , Y), . . . , Cov(Zd, Y)) T

=

T Eyz

Var(Y) . Theorem B.6.5 states that the conditional distribution of Y given Z z isN(I"v+(z -J.<.f,B, uyylz) where ,B EziEzv and uyy l uyy-EyzEziEzy. Thus, the best predictor E(Y I Z) of Y is the linear function and ayy

=

=

z

=

Jto (Z)

=

=

I'Y + (Z - J.
(1.4.12)

with MSPE

E[Y - l"o(Z)]2

=

E{E[Y - l"o(Z)]2 I Z}

=

E(uvv1zl

=

uyy - EvzEziEzy.

The quadratic fonn EvzEzi'Ezy is positive except when the joint normal distribution is degenerate, so the MSPE of Jlo(Z) is smaller than the MSPE of the constant predictor JlY. One minus the ratio of these MSPEs is a measure of how strongly the covariates are associated with Y. This quantity is called the multiple correlation coefficient (MCC), coefficient ofdetennination or population R-squared. We write

MCC

=

2 Pzv

=

1

_

E[Y - 11-o (Z)]' Var Y

=

Var 11-o (Z) Var Y

where the last identity follows from ( 1 .4.6). By ( l .4. l l ), the MCC equals the square of the usual correlation coefficient p az y /crf.- y (J"lz when d 1. =

'

'

=

38

Statistical Models, Goals, and Performance Criteria

Chapter 1

For example, let Y and Z (Z1, Z2)T be the heights in inches of a 10-year-old girl and her parents ( ZI = mother's height, z2 = father's height). Suppose(!) that ( ZT Y) T is trivariate nonnal with Var(Y) = 6.39 =

j

l:zz = '

i

( ;:;�

�:�� ) , l:zy

=

(4.07, 2 98) T

Then the strength of association between a girl's height and those of her mother and father, respectively; and parents, are

P�, ,Y

=

.335, p�,,Y = .209, p�y

=

.393.

In words, knowing the mother's height reduces the mean squared prediction error over the constant predictor by 33.5%. The percentage reductions knowing the father's and both parent's heights are 20.9% and 39.3%, respectively. In practice, when the distribution of (ZT, Y)T is unknown, the linear predictor l'o(Z) and its MSPE will be estimated using a D sample (Zf, Y1 )r, . . . , (Z�, Yn )T See Sections 2.1 and 2.2.

'

" '

The best linear predictnr. The problem of finding the best MSPE predictor is solved by Theorem 1.4.1. Two difficulties of the solution are that we need fairly precise knowledge of the joint distribution of Z and Y in order to calcujate E(Y I Z) and that the best predictor may be a complicated function of Z . If we are willing to sacrifice absolute excellence, we can avoid both objections by looking for a predictor that is best within a class of simple predictors. The natural class to begin with is that of linear combinations of components of z. We first do the one-dimensional case. Let us call any random variable of the form a + bZ a linear predictor and any such variable with a = 0 a zero intercept linear predictor. What is the best (zero intercept) linear predictor of Y in the sense of minimizing MSPE? The answer is given by: Theorem 1.4.3. Suppose that E(Z2) and E(Y2) are finite and Z and Y are not constant. Then the unique best zero intercept linear predictor i� obtained by taking

E(ZY) b = bo = E(Z' ) ' whereas the unique best linear predictor is ILL (Z) = a1 + b1 Z where

Cov(Z, Y) bI = a = E(Y) - b E(Z) . I Var z ' I Proof. We expand (Y - bZ}' = {Y - ]Z(b - bo) + Zb0]}2 to get

E(Y - bZ)2 = E(Y2 ) + E(Z2 ) (b - b0)2 - E(Z2 )b�. Therefore, E(Y - bZ)2 is uniquely minintized by b

=

b0, and

Y )] ' (Z E ( E(Y - boZ)2 = E(Y2 ) . E(Z2)

(1.4.!3)

I

Section 1.4

39

Prediction

To prove the second assertion of the theorem note that by (1.4.1 ). E(Y - a - bZ)2

�

Var(Y - bZ) + (E(Y) - bE(Z) - a)2

Therefore, whatever be b, E(Y - a - bZ)2 is uniquely minimized by taking a � E(Y) - bE(Z). Substituting this value of a in E(Y a bZ)2 we see that the b we seek minimizes E[(Y - E(Y)) - b(Z - E(Z))]2 We can now apply the result on zero intercept linear predictors to the variables Z - E(Z) and Y - E(Y) to conclude that b1 is the unique D minimizing value. -

-

Remark 1.4.1. From (1.4.13) we obtain the proof of the Cauchy--Schwarz inequality (A. l l .l7) in the appendix. This is because E(Y - b0Z)2 > 0 is equivalent to the Cauchy Schwarz inequality with equality holding if, and only if, E(Y - boZ)2 � 0, which

corresponds to Y � b0Z. We could similarly obtain (A. l l . l6) directly by calculating D E(Y - a1 - b1 Z)2•

Note that if E(Y I Z) is of the form a + bZ, then a � a1 and b � b1 , because, by (1.4.5), if the best predictor is linear, it must coincide with the best linear predictor. This is in accordance with our evaluation of E(Y I Z) in Example 1.4.2. In that example nothing is lost by using linear prediction. On the other hand, in Example 1.4.1 the best linear predictor and best predictor differ (see Figure 1.4.1). A loss of about 5% is incurred by using the best linear predictor. That is, E[Y - I'L {Z)]' � 1 . 05 E(Y - Jt(Z)]'

.

Best Multivariate Linear Predictor. Our linear predictor is of the fonn d

l't(z) � a + L b;Z; � a + zT b j=l

[3

�

(E( [Z - E( Z)J [Z - E( ZW W ' E( [Z - E(Z)J[Y - E(Y)])

�

EziEzy.

Theorem 1.4.4. If EY2 and (E([Z - E(Z)f[z - E( Z)J))-1 exist, then the unique best MSPE predictor is

jtL{Z) Proof. Note that R(a, b)

X

�

=

�

I'Y + ( Z - J.tzf [3.

( 1 .4.14)

Ep[Y - !'t ( Z)]' depends on the joint distribution P of (zr, Y)T only through the expectation J.t and covariance E of X. Let P0 denote

40 '

.

Chapter 1

Statistical Models, Goals, and Performance Criteria

y

. 'i

'

'

' '

•

2

'

•

1

0 0

0.25

0.50

0.75

z

1.00

Figure 1.4.1. The three dots give the best predictor. The line represents the best linear predictor y = 1.05 + 1.45z. '

,.

' i''

the multivariate normal, N(p., E), distribution and let llo (a, b) = Ep, [Y - I'I(Z)]2 By Example 1.4.3, Ro(a, b) is minimized by (1.4.14). Because P and P0 have the same p. and D E, R(a, b) = Ro(a, b), and R(a, b) is minimized by (1.4.14). Remark 1.4.2. We could also have established Theorem 1.4.4 by extending the proof of Theorem 1.4.3 to d > 1. However, our new proof shows how second-moment results sometimes can be established by "connecting" them to the normal distribution. A third D approach using calculus is given in Problem 1.4.19. Remark 1.4.3. In the general, not necessarily normal, case the multiple correlation coeffi cient (MCC) or coefficient of determination is defined as the correlation between Y and the best linear predictor of Y; that is, p� = Corr2 (Y,JtL(Z)).

y

Thus, the MCC gives the strength of the linear relationship between Z and Y. See Problem 0 1.4.17 for an overall measure of the strength of this relationship. Remark 1.4.4. Suppose the model for �t(Z) is linear; that is, 11-(Z) = E(Y 1 Z) = a + zT,a for unknown a E R and f3 E Rd . We want to express a and f3 in terms of moments of Zd are (Z, Y). Set Z0 = 1. By Proposition 1.4.1(a), ' = Y - JL(Z) and each of Z0 , uncorrelated; thus, E(Z;[Y - (a + zT,B)]) = 0, j = 0, . . , d. (1 .4.15) •

.

•

•

,

Section 1.5 Solving

41

Sufficiency

normal model

for a and f3 gives

1.4.23). Because the is a linear model, this gives a new derivation of (1.4.12).

(1.4.15)

(1.4.14)

(Problem

multivariate

Remark 1.4.5. Consider the Bayesian model of Section 1.2 and the Bayes risk (1.3.8) defined by r(5) = E[l(II,5(X))]. If we identify (I with Y and X with Z, we see that r(5) = MSPE for squared error loss 1(9, 5) = (9 5)2 Thus. the optimal MSPE predictor E(O I X) is the Bayes procedure for squared error loss. We return to this in Section 0 3.2. -

Remark 1.4.6. When the class 9 of possible predictors 9 with EI9(Z)I < oc form a Hilbert space as defined in Section B.l 0 and there is a go E Q such that 9o = arg inf{Ll(Y, 9(Z)) : 9 then go(Z) is called the

9o(Z) = 1f(Y

I 9).

expected value zero

E 9},

projection of Y on the space Q of functions of Z and we write

Moreover, 9(Z) and h(Z) are said to be orthogonal if at least one has

and E[9(Z)h(Z)] = 0. With these concepts the results of this section

are linked to the general Hilbert space results of Section B . l O. Using the distance

projection

1r

notation, we can conclude that

1f(Y 1 9L) = 1f(I'(Z) I 9L) Ll2(Y, I'L(Z)) = Ll2 (I' L {Z) , I'(Z)) + Ll2 (Y, I' (Z ) ) Y - ,(Z) is orthogonal to I' (Z) and to ,L(Z).

I'(Z) = 1f(Y

'"'' -�' ' - "-;' ,,_

.

Note that

1 9NP),

l'd Z) =

(1.4.16) (1 .4.17) 0

(1.4. 16) is the Pythagorean identity.

Summary.

D. and

We consider situations in which the goal is to predict the (perhaps in the future)

value of a random variable Y. The notion of mean squared prediction

error (MSPE) is in

troduced, and it is shown that if we want to predict Y on the basis of information contained in a random vector Z, the optimal MSPE predictor is the conditional expected value of Y

given

z.

The optimal MSPE predictor in the multivariate normal distribution is presented.

It is shown to coincide with the optimal MSPE predictor when the model is left general but the class of possible predictors is restricted to be linear.

1.5

S UFFICIENCY

Once we have postulated a statistical model, we would clearly like to separate out any

aspects of the data that are irrelevant in the context of the model and that may obscure our

understanding of the situation.

X E X. Re call that a statistic is any function of the observations generically denoted by T(X) or T. The range of T is any space of objects 7, usually R or Rk, but as we have seen in Sec tion 1 .1.3, can also be a set of functions. If T assigns the same value to different sample points, then by recording or taking into account only the value of T(X) we have a reduc tion of the data. Thus, T(X) = X loses information about the Xi as soon as n > 1. We begin by fonnalizing what we mean by "a reduction of the data"

Chapter 1

Statistical Models, Goals, and Performance Criteria

42

loses information about the labels of the Even T � The idea of sufficiency is to reduce the data with statistics whose use involves no loss of : (} E 8 }. information, in the context of a model P = For instance, suppose that in Example 1.1.1 we had sampled the manufactured items in order, recording at each stage whether the examined item was defective or not. We could then represent the data by a vector X = where = 1 if the ith item sampled is defective and = 0 otherwise. The total number of defective items observed, T = 2::�_.1 into the is a statistic that maps many different values of same number. However, it is intuitively clear that if we are interested in the proportion 0 of defective items nothing is lost in this situation by recording and using only T. One way of making the notion "a statistic whose use involves no loss of infonnation" precise is the following. A statistic T(X) is called sufficient for P E P or the parameter () if the conditional distribution of X given T(X) = t does not involve fJ. Thus, once does not the value of a sufficient statistic T is known, the sample X = given that P is valid. We give a contain any further information about () or equivalently decision theory interpretation that follows. The most trivial example of a sufficient statistic is T(X) = X because by any interpretation the conditional distribution of X given T(X) = X is point mass at X.

(X . . . . X11) ,

Xi.

(X(I)·· . . . X(n)),

{Pe

Xi,

(X1, . . . ,Xn)

Xi

Xi

(XI , . . . , Xn)

P,

(X1, . . . , Xn)

Example 1.5.1. A machine produces n items in succession. Each item produced is good

with probability () and defective with probability fJ, where fJ is unknown. Suppose there is no dependence between the quality of the items produced and let = if the ith item is the record of n Bernoulli trials with is good and 0 otherwise. Then X = probability By (A.9.5),

1-

xi

(X1, . . . , Xn)

8.

1

P[Xt = Xt, . . , Xn = Xn] = 8'(1 - 8)n-t (1.5.1) where Xi is 0 or 1 and t = I:� 1 Xi. By Example B . l . l , the conditional distribution of X o given T = I:� 1 Xi = t does not involve 0. Thus, T is a sufficient statistic for 0. ·

Example 1.5.2. Suppose that arrival of customers at a service counter follows a Poisson

xl

be the time of arrival of the first customer, process with arrival rate (parameter) e. Let the time between the arrival of the first and second customers. By (A.l6.4), and are independent and identically distributed exponential random variables with parameter is sufficient for 0. Begin by noting that according to e. We prove that T = 1 + and are independent and the first of Theorem B.2.3, whatever be these statistics has a uniform distribution on (0, Therefore, the conditional distribution of = t is U(O, 1) whatever be t. Using our discussion given + = in Section B. I. I we see that given = t, the conditional distribution of + are the same and we can conclude and that of + = t, has a U(O, t) distribution. It follows that, when = that given is conditionally distributed as Y) where is uniform on whatever be D Thus, is sufficient. (0, and Y = t In both of the foregoing examples considerable reduction has been achieved. Instead of keeping track of several numbers, we need only record one. Although the sufficient statistics we have obtained are "natural," it is important to notice that there are many others

x2

X

XJ/(X1 + X,)

''

'

'' ' '

X2 8, XJ/(X1 +X,)

Xt

X2

XI

[XJ/(Xt + X,)](Xt + X,) Xt X1 + X, 8, (Xt,X2) X. Xt + X, t)

x2

XI

Xt +X, 1).

x2 Xtt/(X1 X2)

XI

(X,

X

Xt +X, t,

Section 1.5

43

Sufficiency

that will do the same job. Being told that the numbers of successes in five trials is three is the same as knowing that the difference between the numbers of successes and the number of failures is one. More generally, if T1 and T2 are any two statistics such that 71 (x) = T1 (y) if and only if T2 (x) = T2(y), then Tt and T2 provide the same information and achieve the same reduction of the data. Such statistics are called equivalent. In general, checking sufficiency directly is difficult because we need to compute the conditional distribution. Fortunately, a simple necessary and sufficient criterion for a statis tic to be sufficient is available. This result was proved in various forms by Fisher, Neyman, and Halmos and Savage. It is often referred to as the factorization theorem for sufficient statistics. •

•

Theorem 1.S.l. Jn a regular model, a statistic T(X) with range T is sufficient/ore if, and only if, there exists afunction g(t, B) defined/or t in T and e in 8 and a function h defined on X such that for al/ x

p(x, 0) = g(T(x), O)h(x)

E X, 0 E 8.

(L52)

We shall give the proof in the discrete case. The complete result is established for instance by Lehmann (1997, Section 2.6). Proof. Let (x1, x2 , . . . ) be the set of possible realizations of X and let ti = T(xi)· Then T is discrete and 2:::"' 1 Po[T = I;] = 1 for every 8. To prove the sufficiency of (1 .52), we need only show that Po [X = XjiT = ti] is independent of e for every i and j. B y our definition of conditional probability in the discrete case, it is enough to show that Po [X = x; IT = I;) is independent of 8 on each of the sets S, = {0 : Po[T = t,] > 0}, i = 1 , 2 , . . . . Now, if (L52) holds,

Po [T = t;] =

{x:T(x)=t,}

p(x , O ) = g(t; , O )

{x:T(x)=t;}

h(x).

(1 .53)

By (B.Ll) and (152), for O E S;,

Po[X = x;IT = I;]

P8[X

=

x; , T = t;] /P8(T = 1;)

p(x; , B) Po[T = I;] g(t;, O)h(x;) if T(x;) = t; Po [T = t;J

(L5.4)

0 if T(x;) -F t,. Applying (1 .53) we arrive at,

Po [X = x; IT = t;]

0 if T (x ;) oF I;

h( x; )

if T(x;)

=

1;.

(L5.5)

Chapter 1

Statistical Models, Goals, and Performance Criteria

44

Therefore, T is sufficient. Conversely, if T is sufficient, let g(t; , B)

= P8[T = t ], h(x) = P[X x ] T( X) = t;]

(!.5.6)

p(x, B)

= Po[X = x, T = T(x) ] = g (T(x), B)h(x)

( 1.5. 7)

,

=

Then 0

by (B. l .3).

Example 1.5.2 (continued). If X11 • . . , Xn are the interarrival times for n customers, then the joint density of (X 1 , . . , Xn) is given by (see (A.l6.4)), .

n (1 .5.8) p{x1, . . . , xn,B) = B" exp[-B I; x;] i=l if all the xi are > 0, and p(x1, , Xn J l) = 0 otherwise. We may apply Theorem 1.5.1 to conclude that T(XI, · · · , Xn) L� I Xi is sufficient. Take g{t,8) ene t if t > 0, () > 0, and h(x1, , Xn) = 1 if all the xi are > 0, and both functions = 0 otherwise. .

•

•

=

.

•

=

-

B

•

whole class of distributions, which admits simple sufficient statistics and to which this 0 example belongs, are introduced in the next section. A

Example 1.5.3. Estimating the Size of a Population. Consider a population with () mem bers labeled consecutively from I to B. The population is sampled with replacement and n members of the population are observed and their labels X1, . . . , Xn are recorded. Common sense indicates that to get information about B. we need only keeep track of X(n) = max(X1, . . Xn)· In fact, we can show that X(n) is sufficient. The probability distribution of X is given by .

(!.5.9)

if every Xi is an integer between 1 and B and p(x 11 (1 .5.9) can be rewritten as

• • •

,

Xn, 0)

= 0 otherwise. Expression (! .5.10)

where X(n)

= max{x1 ,

. . .

I

Xn)· By Theorem 1.5.1. X(n) is a sufficient statistic for 0.

D

Example 1.5.4. Let X1, 1 Xn be independent and identically distributed random vari ables each having a normal distribution with mean fL and variance u2, both of which are unknown. Let B = (JL, 0'2 ). Then the density of (X,, . . . , Xn) is given by n 1 n .

I.

1

•

.

[2,m2 ] - f2 exp{- 2.,.2 L (x; - JL)2} �= I n n n{l2 1 2 - [ ?r0'2t"i2 [exp{- 2.,.2 }][exp{- 2.,.2 (L x� - 2JL L x;) }]. i=l i=l

(1.5. 1 1)

'

I

l

'

'

Section 1.5

45

Sufficiency

Evidently p{x1 , . . . , Xn 1 fJ) is itself a function of upon applying Theorem 1.5.1 we can conclude that

(2.:�

1

xi, L� 1 x�) and fJ only and

n

n

T( X1 , . . , Xn ) = ( L X, LX,") i= 1 i=l .

,

is sufficient for B. An equivalent sufficient statistic in this situation that is frequently used

IS

n

•

n

S(X1, . . . , Xn) = [(1 /n) L X;, [1/(n - 1)] L(X, i=l

i= 1

-

X)2 ] ,

where X = {1 /n) L� 1 Xi . The first and second components of this vector are called the D sample mean and the sample variance, respectively.

Example 1.5.5. Suppose, a<; in Example 1.1.4 with d 2, that Y1 , Yi rv N(J-Li, a2), with 1-Li following the linear regresssion model =

. . .

, Yn are independent,

where we assume ttiat the given constants { zi} are not all identical. Then 6 = ({31, {32 , a2 ) T is identifiable (Problem 1. 1.9) and p(y, 9) =

(27rcr

2)

' n

-

{

-E(fh + (J,z;)2 exp 2cr2

}

exp

2

-E¥;2 + 2/)1EY; + 2/)2 Ez,Y; a2

Thus, T = (EYi, EYi2 , EziYi) is sufficient for 6.

}

·

D

Sufficiency and decision theory

Sufficiency can be given a clear operational interpretation in the decision theoretic set ting. Specifically, if T(X) is sufficient, we can, for any decision procedure o(x), find a randomized decision rule o•(T(X)) depending only on T(X) that does as well as o(X) in the sense of having the same risk function; that is, R(O, 6) = R(O, o•) for all 0.

(1.5.12)

By randomized we mean that o•(T(X)) can be generated from the value t ofT(X) and a random mechanism not depending on B. Here is an example.

Example 1.5.6. Suppose X�o . , Xn are independent identically N(9 1) distributed. Then exp { - L x;) p(x, 9) exp{n9(x .

=

.

�0)}(27r)-�"

,

�

By the factorization theorem, X is sufficient. Let o(X) = X1 • Using only X, we construct a rule o•(X) with the same risk = mean squared error as o(X) as follows: Conditionally,

46

Statistical Models, Goals, and Performance Criteria

given X t, choose T* B.l and ( 1 .4.6). we find =

=

E(T')

J""' (X) from the normal N(t, �

E[E(T' [X)]

E[Var(T' [ X)]

�

E(X)

Chapter 1

n�l ) distribution. Using Section

� I" �

E(X1)

+ Var[E(T'[X)]

n-1 1 + - � 1 � Var(T') Var( X,). n n 0 Thus, §'(X) and 6{X) have the same mean squared error. The proof of (1.5.12) follows along the lines of the preceding example: Given T(X) t, the distribution of 6(X) does not depend on e. Now draw §• randomly from this condi tional distribution. This §' (T{X)) will have the same risk as §• (X) because, by the double expectation theorem, �

�

�

R(e, (>')

�

E{E[R(e,6'(T))[ T] )

�

E{E[R(e,6(X))[T])

�

R(e,6). 0

Sufficiency and Bayes models There is a natural notion of sufficiency of a statistic T in the Bayesian context where in addition to the model P {Po : e E 8} we postulate a prior distribution I1 for e. In Example 1.2.1 (Bernoulli trials) we saw that the posterior distribution given X x is the same as the posterior distribution given T(X) L� 1 Xi k, where k 2:7 1 Xi. In this situation we call T Bayes sufficient. �

=

=

=

=

Definition. T(X) is Bayes sufficient for IT if the posterior distribution of 8 given X = x is the same as the posterior (conditional) distribution of(} given T(X) T(x) for all x. Equivalently, e and X are independent given T(X). =

I

Theorem 1.5.2. (Kolmogorov). lfT(X) is sufficient for e. it is Bayes sufficient for every

rr.

This result and a partial converse is the subject of Problem 1.5.14. Minimal sufficiency For any model there are many sufficient statistics: Thus, if X1, Xn is a N(J-L, a2) CL:� 1 x,, L:� I xn and S(X) (X,, . . , Xn) are both sample n > 2, then T(X) sufficient. But T(X) provides a greater reduction of the data. We define the statistic T(X) to be minimally sufficient if it is sufficient and provides a greater reduction of the data than any other sufficient statistic S(X), in that, we can find a transformation r such that T(X) r(S(X)). .

�

�

•

.

,

.

�

Example 1.5.1 (continued). In this Bernoulli trials case, T L: 1 xi was shown to be sufficient. Let S(X) be any other sufficient statistic. Then by the factorization theorem we can write p(x, e) as p(x, e) g (S(x), e)h(x) =

iI !

�

-

Section 1.5

47

Sufficiency

Combining this with ( 1.5.1), we find 8T(1 - O) n -T � g(S(x), O)h(x) for all 8.

For any two fixed fh and fh, the ratio of both sides of the foregoing gives In particular. if we set 01 and solve for T, we find

�

T = r(S(x))

2/3 and 8,

=

�

1/3, take the log of both sides of this equation

{log[2"g(S(x), 2/3)/g(S(x), 1/3)]}/2 log 2. 0

Thus, T is minimally sufficient. The likelihood function

The preceding example shows how we can use p{x, 8) for different values of () and the factorization theorem to establish that a sufficient statistic is minimally sufficient. We define the likelihood function L for a given observed data vector x as Lx(B) = p(x, 8), 8 E e. Thus, Lx is a map from the sample space X to the class T of functions { () ---t p(x, ()) : x E X}. It is a statistic whose values are functions; if X = x, the statistic L takes on the value Lx. In the discrete case, for a given B, Lx(()) gives the probability of observing the point x. In the continuous case it is approximately proportional to the probability of observing a point in a small rectangle around x. However, when we think of Lx(B) as a function of(), it gives, for a given observed x, the "likelihood" or "plausibility" of various B. The formula ( 1 .5.8) for the posterior distribution can then be remembered as Posterior ex: (Prior) x (Likelihood) where the sign ex denotes proportionality as functions of B. Example 1.5.4 (continued). In this N(JL, "' ) example, the likelihood function ( 1 .5 . 1 1 ) is determined by the two-dimensional sufficient statistic T � (Tt , T,) � Set 8

�

(Bto 8,)

=

n

n

i=l

i=l

(L x,, L xf)

(JL, "' ) then ,

Now, as a function of(), Lx( · ) determines (h. t2 ) because, for example,

t2

�

-2 logLx(O, l) - n log2?T

48

Statistical Models, Goals, and Performance Criteria

Chapter 1

a similar expression for t in terms of Lx {0. 1) and Lx ( 1, 1) (Problem 1 .5.17). Thus, L is a statistic that is equivalent to (t1• i 2 ) and. hence, itself sufficient. By arguing as in 1

with

Example

1.5.1

(continued) we can show that T and, hence, L is minimal sufficient.

D

In fact, a statistic closely related to L solves the minimal sufficiency problem in general.

Suppose there exists

Bo such that

(x : p(x , e) > 0} C {x : p(x, eo) > 0} for all value

B. Let Ax = Lx�O o) . Thus, Ax is the function valued statistic that at fJ takes on the x ({ :) , the likelihood ratio of () to 00. Then Ax is minimal sufficient. See Problem p X, 0 )

1.5.12 for a proof of this theorem of Dynk.in, Lehmann, and

Scheff6.

The "irrelevant" part of the data We can always rewrite the original X as

S(X) is a statistic needed to uniquely detennine x once we know the sufficient statistic T( x). For instance, ifT(X) = X we can take S(X) = (X1 - X , . . . , Xn - X), the residuals; or if T(X) = (X 1), . . . , X(n)), the order statistics, S(X) = (R" . . . , Rn), the ranks, where R; = � L;�1 1(X; < Xi). S(X) becomes irrelevant (ancillary) for inference if T(X) is known (T(X), S(X))

where

but only if P is valid. Thus, in Example 1.5.5, if a2 = 1 is postulated, X is sufficient, but if in fact a2 f 1 all information about a2 is contained in the residuals. If, as in the Example 1 .5.4, a2 is assumed unknown, {X, 2:� 1 (Xi - X)2) is sufficient, but if in fact the common distribution of the observations is not Gaussian all the information needed to

estimate this distribution is contained in the corresponding S(X)-see Problem

P specifies that X1

• . . .

, Xn are a random sample, ( X( 1 ) , . . .

, X(

1.5.13.

If

n)) is sufficient. But the

ranks are needed if we want to look for possible dependencies in the observations as in

Example

1.1.5.

Summary.

that

Consider an experiment with observation vector X = (X1, . . . , Xn)· Suppose

X has distribution in the class P = {Po

:

e

E 8}.

We say that a statistic T(X)

sufficient for P E P, or for the parameter B. if the conditional distribution of X given T(X) = t does not involve e. Let p(X, e) denote the frequency function or density of X. The factorization theorem states that T(X) is sufficient for () if and only if there exist functions g( t, 8) and h(X) such that is

p(X, e) = g(T(X), e)h(X). We show the following result: If T(X) is sufficient

for e, then for any decision procedure

J(X), we can find a randomized decision rule J'(T(X)) depending only on the value of t = T(X) and not on e such that J and J' have identical risk functions. We define a statistic T(X) to be Bayes sufficient for a prior 1r if the posterior distribution of () given X = x is the same as the posterior distribution of 8 given T(X) = T(x) for all X. If T(X) is sufficient for 8, it is Bayes sufficient for e. A sufficient statistic T(X) is minimally sufficient for () if for any other sufficient statistic S{X) we can find a transformation r such that T ( X) = r (S(X)) The likelihood function is defined for a given data vector of .

- - ·--

- - -

I

Section 1.6

Exponential

49

Families

observations X to be the function of 0 defined by Lx(B) sufficient for B, and if there is a value ()0 E 8 such that

=

p(X. B). 0 E' 0. If T(X) is

{x : p(x,O) > 0} c {x : p(x, Bo) > 0}, 0 E 0, then, by the factorization theorem, the likelihood ratio

Lx(B) Ax(B) = x( L Oo) depends on X through T(X) only. Ax(B) is a minimally sufficient statistic. 1.6

EXPONENTIAL FAMILIES

The binomial and normal models considered in the last section exhibit the interesting fea ture that there is a natural sufficient statistic whose dimension as a random vector is inde pendent of the sample size. The class of families of distributions that we introduce in this section was first discovered in statistics independently by Koopman, Pitman, and Dannois through investigations of this property C1) . Subsequently, many other common features of these families were discovered and they have become important in much of the modern theory of statistics. Probability models with these common features include normal, binomial, Poisson, gamma, beta, and multinomial regression models used to relate a response variable Y to a set of predictor variables. More generally, these families form the basis for an important class of models called generalized linear models. We return to these models in Chapter 2. They will reappear in several connections in this book.

1.6.1

The One-Parameter Case

The family of distributions of a model {Pe : () E 8}, is said to be a one-parameter expo nentialfamily, if there exist real-valued functions ry(B), B((} ) on 8, real-valued functions T and h on Rq, such that the density (frequency) functions p(x, B) of the P6 may be written

p(x,8) = h(x) exp{q(O)T(x) - B(8)} where x E X C Rq . Note that the functions fl, B, and T are not unique. In a one-parameter exponential family the random variable

(!.6.1)

T(X) is sufficient for B.

This is clear because we need only identify exp{q(O)T(x) - B(8)} with g(T(x),8) and h(x) with itself in the factorization theorem. We shall refer to T as a natural sufficient statistic of the family. Here are some examples. Example 1.6.1. The Poisson Distribution. Let P8 be the Poisson distribution with unknown mean IJ. Then, for x E {0, I, 2 , }, · · ·

p(x, 8) =

exe-8

1

= 1. exp{xlog 8 - 8}, 8 > 0. x. 1

x

(1 .6.2)

50

Chapter 1

Statistical Models, Goals, and Performance Criteria

•

•••

'

•

•

.

. I '

.

I• ' .

Therefore, the

Pe

form a one-parameter exponential family with

1 q = 1, '7{0) = logO, B(O) = 0, T(x) x,h(x) = 1. X. B(n, 0) Example 1.6.2. The Binomial Fomily. E {0, 1 , . . . ,n) p(x,O) = ( : ) ox (1 - orx ( : ) exp[xlog( 1 � e l + nlog(1 - 0)].

(1 .6.3)

=

Suppose X has a

Then, for x

distribution,

0<

0 < 1. ( 1.6.4)

Therefore, the family of distributions of X is a one-parameter exponential family with

q = 1,1J{O) = log( 1 � 0 ),B{O) = -nlog(1 - 0),T(x) = x,h(x) = ( : ) · Here is an example where

(Z, Y)T

Z

0

+ OW, 0 > 0, Z W f(z,y,O) f(z)fe(y I z) =
Example 1.6.3. Suppose X independent N(O, 1). Then

f(x,O)

q = 2.

( 1 .6.5)

=

where

Y

=

and

are

=

This is a one-parameter exponential family distribution with q=

2,1)(0) = -�o-2, B(O) logO,T(x) (y - z)2, h(x) {2rr)-1 =

=

=

exp

-�z2} .

0

The families of distributions obtained by sampling from one-parameter exponential families are themselves one-parameter exponential families.

Specifically. suppose

Xt, . . . , Xm are independent and identically distributed with common distribution

Pe p(x, B)

where the

form a one-parameter exponential family as in

the family of distributions of X and

=(

X1,

. . .

( 1 6. ) If { pJ=)},

1

Pe,

0 E 8,

is m 1 Xm ) considered as a random vector in R q .

.

are the corresponding density (frequency) functions, we have m

IIh(x;) exp[�(B)T(x;) - B(B)]

p(x,B)

i= l

-

( 1 .6.6)

m

Ilh(x;) i=l

exp

�(0) LT(x;) - mB(B) i=l

--- -----

Section 1.6

51

Exponential Families

where x = ( x 1 . . . , xm ). Therefore, the PJ m) form a one-parameter exponential family. If we use the superscript m to denote the corresponding T, 1], B, and h, then q C m) = mq and ,

,

I: T(x;), s<m)(O) � mB(O), h(m)(x) � II h(x;) .

i=1

(1 .6.7)

i=1

Note that the natural sufficient statistic T(m) is one-dimensional whatever be m. For example, if X = (X 1 , . . , X ) is a vector of independent and identically distributed m P(O) random variables and PJm ) is the family of distributions of x, then the Pim) form a one-parameter exponential family with natural sufficient statistic T(m)(X) = L:n 1 Xi. Some other important examples are summarized in the following table. We leave the proof of these assertions to the reader. .

TABLE 1.6.1 Family of distributions N(!', .,.2) r(p, .\) (3(r,s)

..

" fixed I' fixed p fixed .\ fixed r fixed s fixed

1)(0) pf
{r

1)

.

T(x) X

(x - I') ·-

X

logx log{1 - x) log x

The statistic T(:m) (X 1 , , Xm) corresponding to the one-parameter exponential fam ily of distributions of a sample from any of the foregoin� is just I::;n 1 T(Xi). In our first Example 1.6.1 the sufficient statistic T :m)(X1 , Xm) = L :n 1 Xi is distributed as P(mfJ). This family of Poisson dist:Jibutions is one-parameter exponential whatever be m. In the discrete case we can establish the following general result. •

.

•

•

.

.

,

Theorem 1.6.1. Let {Po} be a one-parameter exponential family of discrete distributions with corresponding functions T, fJ, B, and h, then the family of distributions of the statis tic T(X) is a one-parameter exponential family of discrete distributions whose frequency functions may be written h'(l) exp{1J(O)t - B(B)} for suitable h* .

52

Statistical Models, Goals, and Performance Criteria

Chapter 1

Proof. By definition,

L

P9[T(x) = t]

{x:T(x)=t}

L

{x:T(x)=t}

p(x,B) h(x) exp[�(B)T(x) - B(B)]

exp[�(B)t - B(O)] { h'(t)

L

{x: T{x)=t}

( 1.6.8)

h(x)}.

h(x),

0 If we let the result follows. = L: {"T(x) �t} A similar theorem holds in the continuous case if the distributions ofT( X) are them selves continuous.

Canonical exponential families. We obtain an important and useful reparametrization of the exponential fantily (1.6.1) by letting the model be indexed by 11 rather than 0. The exponential fantily then has the form ( 1 .6.9) q(x,TJ) = h(x)exp(TJT(x) - A(�)], x E X C R• where A('I) = log f Jh(x)exp[TJT(x)]dx in the continuous case and the integral is replaced by a sum in the discrete case. If 8 E e, then A(TJ) must be finite, if q is definable. Let E be the collection of all 11 such that A(�) is finite. Then as we show in Section 1.6.2, · · ·

E is either an interval or all of R and the class of models (1.6.9) with 11 E E contains the class of models with 0 E 8. The model given by ( 1 .6.9) with 11 ranging over [ is called the canonical one-parameter exponentialfamily generated by T and £ is called the natural parameter space and T is called the natural sufficient statistic.

h.

q(x,1}) = (1 /x!)exp{TJx -exp[TJ]}, x E {0, 1, 2, . . . }, 00

00

exp{A(TJ)} = L (e"' /x!) = L(e")x/x! = exp(e"), 0

and [ = R. Here is a useful result.

Theorem 1.6.2. If X is distributed according to (1.6.9) and is an lhterior point of [, the moment-generating/unction ofT(X) exists and is given by

�

M(s) = exp[A(s for s in some neighborhood ofO.

+ 1/)

-

A(TJ)]

I

!

Example 1.6.1. (continued). The Poisson family in canonical form is

where 11 = log 0,

'

Section 1.6

53

Exponential Families

Moreover,

E(T(X)) A'('l), Var(T( X )) A"('l), �

�

We give the proof in the continuous case. We compute

Proof.

M(s) E(exp(sT(X))) J , , , J h(x)exp[(s + 'l)T(x) - A('l)]dx {exp[A (s + '7) - A(7])]} J, J h(x)exp[(s + 'l)T(x) - A(s + 'l)]dx exp[A(s + 1)) - A('7)] M(s) X , Xn Example 1.6.4 p(x,O) (xf82)exp(-x2f202 ), x > 0,0 > 0. �

�

· ·

�

�

because the last factor, being the integral of a density, is one. The rest of the theorem (see Section A.l2).

follows from the moment-enerating property of

0

Here is a typical application of this result. Suppose

t.

.

.

is a sample from a population with density

.

�

This is known as the

Rayleigh

distribution. It is used to model the density of "time until

failure" for certain types of equipment. Now

n n /82))exp(- I;xi /282) p(x,8) - (TI(x; i=l �= 1 n n 1 (TI x, )exp[;2 I;x ; -nlog82]. -27]). '7 -1/282, 82 -1/27], B(8) n 02 A(7]) -n/7} 2n(J2 n/7}2 E� Xl 4n04. i=l

i=l

Here

�

=

the natural sufficient statistic

=

1

log

� -n log(

and

has mean

=

and variance

Direct computation of these moments is more complicated.

1.6.2

Therefore, 0

The Multiparameter Case

Our discussion of the ..natural form" suggests that one-parameter exponential families are

naturally indexed by a one-dimensional real parameter 7J and admit a one-dimensional suf ficient statistic

T(x).

More generally, Koopman, Pitman, and Dannois were led in their

investigations to the following family of distributions, which is naturally indexed by a k

dimensional parameter and admit a k-dimensional sufficient statistic.

8

A family of distrib�tions {Pe : E 8}, 8 C Rk , is said to be a k-parameter expo and B of 6, and real-valued nentialfamily, if there exist real-valued functions 111 . . . functions T1 , . . on Rq such that the density (frequency) functions of the Po may ' be written as,

. Tk. h

.

, 1Jk

1

k p(x,8) h(x)exp[j=l L '7; (9)T;(x) - 8(9)], x E X C R•. =

(1 .6.10)

I,

54

Statistical Models, Goals. and Performance Criteria

Chapter 1

By Theorem 1 . 3 . 1 , the vector T(X) = (Tt (X), . . . , Tk(X))T is sufficient. It will be referred to as a natural sufficient statistic of the family. Xm ) where the Xi are independent and identically Again, suppose X = (X1, distributed and their common distribution ranges over a k�parameter exponential family given by (1.6.1 0). Then the distributions of X form a k-parameter exponential family with natural sufficient statistic m m .

.

.

•

t=l

Example 1.6.5. The Normal Family. Suppose that Po = N(/l, .:r2 ), e 2 J1 < oo, a > 0}. The density of Pe may be written as 11

p(x,O) = exp ["2 x -

x2

2"2

=

2 {(!", .:r )

2 1 J12 - 2 ( " 2 + log(2rr.:r ))],

which corresponds to a twO-parameter exponential family with and '" T (x) = x, '12(9) = 2, 1 "

1

-

q =

:

-

oo <

(1.6. 1 1 )

1, ()1 = J1, 82 = a2,

2 , T, (x) = x2,

2"

1 '"' 2 ("2 + log(2rr.:r2)), h(x) = 1.

B(O)

If we observe a sample X = (X1, . . , Xm) from a N(l", .:r2) population, then the preceding discussion leads us to the natural sufficient statistic .

m m (L x,,

:L x,') ,

i=l

i=l

which we obtained in the previous section (Example 1.5.4).

0

Again it will be convenient to consider the "biggest" families, letting the model be TJk)T rather than 6. Thus, the canonical k-parameter exponential indexed by 71 = (ry1, family generated by T and h is •

.

.

,

q (x,'l) = h(x) exp{ TT(x)'l - A('l)}, x E X c Rq

where T(x) = (T1(x), . . . , Tk(x)f and, in the continuous case,

" '

I I

In the discrete case, A(71) is defined in the same way except integrals over Rq are replaced by sums. In either case, we define the natural parameter space as k t: = {'I E R : -oo < A('l) < oo}.

lI

Section 1.6

55

Exponential Families

Example 1.6.5. (N(Ji, a2) continued). In this example, k � 2, TT(x) � (x,x2 ) � (T1(x) , T2 (x )), �1 � Ji/a2 , �2 � - 1/2a2, A(71) � � [(-�l/2�,) + log(rr/ - �2 )], h(x) � 1 and £ � R x R- � { (�1 . �2) : �1 E R, �, < 0} .

Example 1.6.6. Linear Regression. Suppose a.;; in Examples 1 . 1.4 and 1.5.5 that Y1 , , Yn are independent, Yi rv N(J-ti , a2), with J.ti = f3t + f32 zi, i = 1, . . . ' n. From Exam� pie 1.5.5, the density of Y = (Yt . . . . , Yn)T can be put in canonical form with k = 3, T(Y) � (EY,,EY,2 , I:z,Y,)r, �1 � (31ja2, t}2 � (3,ja 2 , �3 � -1 /2CJ2 , •

•

•

-n A('l) = 4 [�: + m,�5 + Z�1�2 + 2 log(rr/ - �3)], �3 and £ � {(�1 , '72 , �3) : �1 E R,�, E R,�, < 0}, where iii, � n- 1 Ezf. Example 1.6.7. Multinomial Trials. We observe the outcomes of n independent trials where each trial can end up in one of k possible categories. We write the outcome vector as X = (X1 , . . . , Xn)T where the Xi are i.id. as X and the sample space of each Xi is the k categories { 1, 2, . . . , k). Let T; (x) � 2::� 1 1 [X, � j], and A1 � P(X, � j). Then p(x, >.) � IJ:� 1 AJ' (x) , >. E A, where A is the simplex {>. E Rk : 0 < A; < 1,j = 1, . . . , k, L7-I ,\j = 1 } . It will often be more convenient to work with unrestricted parameters. In this example, we can achieve this by the reparametrization k .\j = e0' / L e0' ,j j=l

=

1, . . . , k, a E Rk .

Now we can write the likelihood as

k k qo(x, a ) � exp{L <>;�·(x) - n log L exp(ct;)}. j=l j= l This is a k-parameter canonical exponential family generated by T1, . . . , Tk and h(x) = II� 1 1 [xi E { 1, . . . , k }] with canonical parameter a and t: = Rk . However a is not identifiable because qo(x, a + cl) � q0(x , a) for 1 = ( 1, . . . , 1jT and all c. This can be remedied by considering

�j = log(A;/Ak) = <>;

- "'•·

1

s j s k - 1, and rewriting

k- 1 q(x,71) = exp{T'[._ 1) (x)'l - n log(1 + L e"')} j=l

where

56

Statistical Models, Goals, and Performance Criteria

Chapter 1

Note that q(x, 17) is a k - 1 parameter canonical exponential family generated by T( - l) k and h{x) = fT� 1 l [xi E { 1, . . . , k}] with canonical parameter 1J and £ = Rk-l_ More over, the parameters 'IJ = log(Pry [X = j]/Pry[X = k]), 1 < j < k - 1, are identifiable. 0 Note that the model for X is unchanged.

1.6.3

Building Exponential Families

Submodels A submodel of a k-parameter canonical exponential family { q(x, 17); 11 E £ an exponential family defined by

' I

1

p(x, 8) = q(x, ry(8))

I I

c

Rk} is

(1 .6.12)

where 6 E 8 C R1, l < k, and 17 is a map from 8 to a subset of Rk. Thus, if X is discrete taking on k values as in Example 1 .6.7 and X = (X1, . . . , Xn)T where the X; are i.i.d. as X, then all models for X are exponential families because they are submodels of the multinomial trials model. Affine transformations

IfP is the canonical family generated by Tk x 1 and h and M is the affine transformation from Rk to R1 defined by M(T) = Mex kT + hex "

I" '

it is easy to see that the family generated by M(T(X)) and h is the subfamily of P corre sponding to and 17(8) = MT8. Similarly, if 8 c R' and 71( 8) = Bk xe8 c Rk, then the resulting submodel of P above is a submodel of the exponential family generated by BTT(X) and h. See Problem 1.6.17 for details. Here is an example of affine transformations of 6 and T. '

.. •

•

Example 1,6,8. Logistic Regression. Let Yi be independent binomial, B(n;, .\;), 1 < i < n. If the ,\i are unrestricted, 0 < ,\i < 1, 1 < i < n, this, from Example 1 .6.2, is an n-parameter canonical exponential family with Yi = integers from 0 to ni generated

by T (Y, , . . . , Yn ) � Y, h(y) = A(17) =

I;�

1 n;

07 1

( �; )

1 (0

<

y; < n; ) . Here 'li = log 1�),, ,

log(! + e"' ) . However, let x1 < . . . < Xn be specified levels and

(1.6.!3)

----- ---

Section 1.6

57

Exponential Families

This is a linear transformation 71(8) = Bnxz8 corresponding to Bnx2 = (1, x), where 1 is (1, . . . , 1)r, x = (x1 , . . . , xn)T. Set M = BT , then this is the two-parametercanonical exponential family generated by NIY = (L�" 1 �'I:� 1 X1 Yi)T and h with

A(B1 ,82) � L n; log(1 + exp(B1 i= 1

+ B,x,)).

This model is sometimes applied in experiments to determine the toxicity of a sub stance. The Yi represent the number of animals dying out of ni when exposed to level Xi of the substance.lt is assumed that each animal has a random toxicity threshold X such that death results if and only if a substance level on or above X is applied. Assume also: (a) No interaction between animals (independence) in relation to drug effects (b) The distribution of X in the animal population is logistic; that is,

P[X < x]

�

[1 + exp { - (8, + O,x) W ' ,

( ! .6.14)

81 E R, B, > 0. Then (and only then), log(P[X

<

x]/(1 - P[X

<

x])) � 81

+ O,x 0

and (1.6.!3) holds. Curved exponential families

Exponential families (1.6.12) with the range of 17( 8) restricted to a subset of dimension l with l < k 1, are called curved exponential families provided they do not form a canonical exponential family in the 8 parametrization. -

Example 1.6.9. Gaussian with Fixed Signal-to-Noise Ratio. In the normal case with X1, . . . , Xn i.i.d. N(Jl-,cr2), suppose the ratio IJ-1-I/cr, which is called the coefficient of variation or signal-to-noise ratio, is a known constant ,\0 > 0. Then, with B = p., we can wnte •

where T1 = �� 1 Xi, T2 = I::�= I x;, 'fJJ(B) curved exponential family with l = 1.

=

>.5e-l and 'fJ2(0)

=

-�>.58- 2 •

This is a D

In Example 1.6. 8, the 8 parametrization has dimension 2, which is less than k = n when n > 3. However, p(x, 6) in the 8 parametrization is a canonical exponential family, so it is not a curved family. Example 1.6.10. Location-Scale Regression. Suppose that Y1 ) . . . , Yn are independent, N(Jl.i 1 crf). If each f.li ranges over R and each crf ranges over {01 oo ) , this is by Yi Example 1.6.5 a 2n-paramcter canonical exponential family model with 'f)i = J-Li/crf, and 'fJn+i = - 1/2crf, i = 1 , . . . , n, generated by ;'V

T(Y) = (Y, . . . , Yn, Y,2, . . . Y;)r ,

58

Statistical Models, Goals, and Performance Criteria

Chapter 1

and h(Y) = 1. Next suppose that (J-Li, a?) depend on the value zi of some covariate, say, for unknown parameters 81 E R, 82 E R, 83 > 0 (e.g., Bickel, 1978; Carroll and Ruppert, 1988, Sections 2.1-2.5; and Snedecor and Cochran, 1989, Section 15. 10). For 0 = (01, 82, 02), the map ry(O) is

Because L:� 1 ryi (6) Yi + L:� 1 17n+t (6 ) Y? cannot be written in the form �J 1 1Jj (9)T1•(Y) for some 1Jj(9), T1•(Y), then p(y, 9) = q(y, ry(O)) as defined in (6.1.12) is not an exponential family model, but a curved exponential family model with 0

l = 3.

Models in which the variance Var(Yi) depends on i are called heteroscedastic whereas models in which Var{Yi) does not depend on i are called homoscedastic. Thus, Examples 1.6.10 and 1.6.6 are heteroscedastic and homoscedastic models, respectively. We return to curved exponential family models in Section 2.3. Supermodels We have already noted that the exponential family structure is preserved under i.i.d. sampling. Even more is true. Let Yi , 1 :S j < n, be independent, ¥:,- E YJ c Rq, with an exponential family density

( Y1 , . . . , Yn)T is modeled by the exponential family generated by T (Y) �;_1 Tj(Y,) and IJJ 1 h1(y1), with parameter ry(O), and B(O) = �; 1 Bj (O).

Then Y

=

In Example 1.6.8 note that ( 1 .6.13) exhibits Y, as being distributed according to a two parameter family generated by T1(Y,) = (Y1 , x1Y1) and we can apply the superrnodel

approach to reach the same conclusion as before. 1.6.4

I II.i' '

Properties of Exponential Families

Theorem 1.6.1 generalizes directly to k-parameter families as does its continuous analogue. We extend the statement of Theorem 1 .6.2. Recall from Section B.5 that for any random vector Tk xl. we define M(s) = Ee•TT

as the moment-generating function, and E(T) =

(E(T1), .

.

. , E(T• W

!'

I

Section 1.6

59

ExpOnential Families

Var(T) = IICov(7�, 7b)llk x k ·

Theorem 1.6.3. Let P be a canonical k-parameter exponentialfamily generated by (T, h) with corresponding natural parameter space £ andfunction A( 11). Then (a) £ is convex

(b) A : £

-----)

R is convex

(c) If E has nonempty interior in Rk and 'TJo E £, then T(X) has under 11o a moment generatingfunction M given by M(s) = exp{A(7Jo + s) - A('l)0)) validfor all s such that 11o + s E £. Since 11o is an interior point this set ofs includes a ball about 0. Corollary 1.6.1. Under the conditions of Theorem 1.6.3 E'IJ, T(X) = A('10) Var'IJ, T(X) = A('IJo)

'

f/J. A OA OA ('IJo) J T, A('1o ) = II """""' ('IJo ) li ('IJo) , . . . , ""' where A('IJo) = ( ""' "

The corollary follows immediately from Theorem B.5.1 and Theorem 1 .6.3(c).

Proof of Theorem 1.6.3. We prove (b) first. Suppose '1)0

'1)1 E t: and 0 < a < 1. By the Holder inequality (B.9.4), for any u(x), v(x), h(x) > 0, r, s > 0 with � + ! = 1, ,

j u(x)v(x)h(x)dx < (! u•(x)h(x)dx) ; (J v' (x)h(x)dx) ! .

Substitute ; = a , ! = 1 - a, u(x) = exp(a7JiT(x)), v(x) = exp ( ( 1 - aJ7JrT(x)) and take logs of both sides to obtain, (with oo permitted on either side), A(a'l)1

+

(1 - a)7J 2 ) < aA(7J1) + (1 - a)A(7J2 )

(1.6.15)

Which is (b). If '1)1, '1)2 E t: the right-hand side of (1.6.15) is finite. Because

j exp('I)TT(x))h(x)dx > 0

for all '1 we conclude from ( 1 .6.15) that ct'l)1 + (1 - a)'l)2 E t: and (a) follows. Finally (c) 0 is proved in exactly the same way as Theorem 1.6.2. The formulae of Corollary 1 .6.1 give a classical result in Example 1 .6.6.

60

Statistical Models, Goads, and Performance Criteria

Chapter 1

Example 1.6.7. (continued). Here, using the a: parametrization, k

A(a) � n log(l::; e"') j=l

and

E>.(T;(X)) � P>.[X � j]

=

k >-, � e"' / L e"' f=l

!

I

!

'

'

;

'

0

'

'

'

The rank of an exponential family I

i

' i

l' '

I'

;I

•'

J

' ,,

ii ' '

Evidently every k-parameter exponential family is also k'-dimensional with k' > k. However, there is a minimal dimension. An exponential family is of rank k iff the generating statistic T is k-dimensional and I , T1 ( X) . . . , Tk(X) are linearly independent with positive probability. Formally, P7J [L;�t a;T;(X) � ak+ d < 1 unless all a; are 0. Note that Po(A) � 0 or Po(A) < 1 for some 0 iff the corresponding statement holds ,

�

p(x, ,)) 0. Going back to Example 1 .6.7 we can see that the multinomial family is of rank at most k 1. It is intuitively clear that k 1 is in fact its rank and this is seen in Theorem 1 .6.4 that follows. Sintilarly, in Example 1.6.8, if n � 1, and 7J1(9) � 81 + 82x1 we are writing the one-parameter binomial family corresponding to Y1 as a two-parameter family with generating statistic (Y1, XI Y1 ) . But the rank of the family is 1 and 81 and 82 are not identifiable. However, if we consider Y with n'> 2 and x1 < Xn the family as we have seen remains of rank < 2 and is in fact of rank 2. Our discussion suggests a link between rank and identifiability of the 'TJ parameterization. We establish the connection and other fundamental relationships in Theorem 1 .6.4, for all 9 because 0

<

-

-

Theorem 1.6.4. Suppose P � {q(x, 7J); 7J E f} is a canonical exponential family gener ated by (Tkxi, h) with natural parameter space t: such that E is open. Then thefollowing are equivalent. (i) P is of rank k. (ii) 7J is a parameter (identifiable). (iii)

Var7J(T) is positive definite.

--·

-

------

I

I

Section 1.6

(iv) (v)

61

Exponential Families

'1 � A('1) is 1-1 on [. A is strictly convex on E.

Note that, by Theorem 1.6.3, because E is open,

A is defined on all ofE.

Proof. We give a detailed proof for k = 1. The proof for k > 1 is then sketched with details left to a problem. Let � (-) denote "( ·) is false." Then '*

P"[a1T � a2] � I for a1 f 0. This is equivalent to Var"(T) � 0 ?� (iii) "'(ii) {::::} There exist rJI -=/=- rn such that PTJ1 = Pm . �(i)

Equivalently

exp{rJ1T(x) - A(ry1)}h(x) � exp{ry,T(x) - A(ry2))h(x). Taking logs we obtain (ry1 - ry2)T(X) � A(ry,) - A(ry1) with probability l =�(i). We, thus, have (i) - (ii) = (iii). Now (iii) =} A"(ry) > 0 by Theorem 1.6.2 and, hence, A'(ry) is strictly monotone increasing and 1 - 1 . Conversely, A"(1JO) = 0 for some TJo implies that T c, with probability l , for all ry, by our remarks in the discussion of rank, which implies that A" (T)) � 0 for all T) and, hence, A' is constant. Thus, (iii) (iv) and the same =

=

discussion shows that (iii) - (v).

Proof of the general case sketched I. � (i) =� (iii)

� (i) = P1)[aTT � c] � I for some a f 0, all 1) � (iii) = aTVar'l (T)a � Var1)(aTT) � 0 for some a f 0, all1) = ( � i)

II. � (ii) -� (i)

� (ii) = P1),

�

P1), some 1) 1 f '1o· Let Q � { P1),+o(1),-1),) : 'lo + c(1J 1 - 1Jo ) E [}.

Q is the exponential family (one-parameter) generated by ('71 k � 1 to Q to get � (ii) =- (i).

'1o)TT.

Apply the case

III. (iv) "' (v) � (lii) Properties (iv) and (v) are equivalent to the statements holding for every Q defined previously for arbitrary 1)0, 1) 1 .

as 0

Corollary 1.6.2. Suppose tlwt the conditions of Theorem 1.6.4 hold and P is of rank k. Then (a) P may be uniquely parametrized by p.(1J) = E1) T(X) where p. ranges over A( l'), (b) logq (x, 1J) is a strictly concavefunction of1J on 1'.

Proof. This is just a restatement of (iv) and (v) of the theorem.

0

62

Statistical Models, Goals, and Performance Criteria

Chapter 1

The relation in (a) is sometimes evident and the 1-L parametrization is close to the initial parametrization of classical P. Thus, the B(n, 0) family is parametrized by E(X), where X is the Bernoulli trial, the N(l<, cr5) family by E(X). For {N(I<,cr2 ) ) , E(X,X2) = (J-L , a 2 + J12 ), which is obviously a 1-1 function of (J1, a2). However, the relation in {a) may be far from obvious (see Problem 1.6.21). The corollary will prove very important in estimation theory. See Section 2.3. We close the present discussion of exponential families with the following example. Example 1.6.11. The p Van·are Gaussian Family. An important exponential family is based on the multivariate Gaussian distributions of Section 8.6. Recall that Ypx 1 has a p variate Gaussian distribution, Np (I-J., E), with mean l-1-px 1 and positive definite variance covariance matrix L:p xp' iff its density is

I 1 2 f(Y, !', E) = jdet(E)j- 1 "-P/2 exp { - 2 (Y - !')TE- 1 (Y - Jt)) .

(1.6.16)

Rewriting the exponent we obtain

logj(Y, !', E)

- �YTE-IY + (E- t lllY 2

I (log jdet(E) I + I'TE- 1 !') 2 The first two terms on the right in ( 1.6.17) can be rewritten

p

P - log 1r. 2

(1.6.17)

p

-( 2: aijYiYj + � 2: aii r:2 ) + L(2: aiJfli)Yi 1 < <j

where E- 1 = family with statistics (Yj, . . . , Yp, {l'ilJh«<j
1.6.5

•

.

Conjugate Families of Prior Distributions

, • ,

In Section 1 .2 we considered beta prior distributions for the probability of success in n Bernoulli trials. This is a special case of conjugate families of priors, families to which the posterior after sampling also belongs. Suppose X1, . . . , Xn is a sample from the k-parameter exponential family (1.6.10), and, as we always do in the Bayesian context, write p(x I 8) for p(x, 8). Then

n

k

,

•

n

p( x j 9) = liT h(x,)] exp { L 1/; (9) L T;(x,) - nB(9)). i=l i=l j=l

- - --

(1 .6. 1 8)

.....

_ _ _ _ _ _

Section 1.6

63

Exponential Families

() E 8, which is k�dimensional. A conjugate exponential family is obtained from ( 1 .6.18) by letting n and tJ = L� 1 Tj (xi), j = 1, . . . , k, be "parameters" and treating () as the variable of interest. That is, let t = ( t1, . . . , tk+ I ) T and where

k

tk+1B(II))d01 - - - de, exp{ h t,�1(0) 1= 1= j { ( , . . . , tk=+J } 0 w( t1 , . . . , tk )

w(t)

-

=

· ··

f1

'

<

I I

< 00

with integrals replaced by sums in the discrete case. We assume that Problem

( 1 .6.19)

1.6.36), then

rl is nonempty (see

Proposition 1.6.1. The (k + !)�parameter exponential family given by

k (1.6.20) rr,(O) � exp{L �;(O)t; - tk+1B(O) - log w(t)} j=l where t (1 1 , . . . , tk+l) E !1, is a conjugate prior to p(xiO) given by ( 1 .6.18). Proof. If p(xi O) is given by ( 1.6.18) and " by (1 .6.20), then n k rr(Oix) p(xiO)rr,(O) exp{L 1)1 (0) (LT;(x;) + t;) - (tk+l + n)B(ll)} j=l i=l rr,(O), =

rx

rx

rx

( 1 .6.21)

where n

n

t1 + L TI (x;), . . . ,t, + L T•(x;), tk+1 + n i=l i=l and

ex

indicates that the two sides are proportional functions of(). Because two probability

densities that are proportional must family

T

be

equal, 1r(6jx) is the member of the exponential

( 1.6.20) given by the last expression in (I .6.2 1) and our assertion follows.

0

Remark 1.6.1. Note that (1.6.21) is an updating formula in the sense that as data x1, . . . , Xn become available, the parameter t of the prior distribution is updated to s = (t + a), where D a = (I:� 1 T1 (x;), . . . , I: � 1 T,(x;), n)T

It is easy to check that the beta distributions are obtained as conjugate to the binomial in this way.

Example 1.6.12.

Suppose

X1, . . . , Xn

is a

N(B,u5)

is unknown. To choose a prior distribution for model defined by ( 1.6.20).

Por

n

=

8,

sample, where

u6

is known and

8

we consider the conjugate family of the

I

02 Ox p(x l8) exp{ 2 u0 - 2u02 } . oc

( 1 .6.22)

64

Statistical Models, Goals, and Performance Criteria

Chapter 1

This is a one-parameter exponential family with

The conjugate two-parameter exponential family given by ( 1.6.20) has density

( 1 .6.23) Upon completing the square, we obtain

1rt(O)

oc

exp{-

t2

tl 2 2 (0 - - ) ) .

2a0

(1 .6.24)

h

Thus, 1rt (OJ is defined only for t2 > 0 and all t 1 and is the N(tJ/t2, <1� jt2) density. Our conjugate family, therefore, consists of all N(TJo, rJ) distributions where 7Jo varies freely and Tg is positive. If we start with aN(ry0, rJ) prior density, we must have in the (t 1 , t2) parametrization

( 1.6.25) By ( 1.6.21 ), if we observe EX, = s, the posterior has a density ( 1 .6.23) with

ao + n, 2

t2(n) =

2

To

t1(s) =

1Jo(1o2 + s . 2

To

Using (1 .6.24), we find that tr(Bix) is a normal density with mean

o 'l �]

(g

t (s) 1 = Jl(s,n) = , + n)- [s + � t2(n) To To

(1 .6.26)

and variance

1 u5 2 T0 (n) = ( = ( --, + r0 t2 n)

n _1 ) . --, a0

( 1.6.27)

Note that we can rewrite ( 1.6.26) intuitively as

p.(s, n) = W1X + W21]0

( 1.6.28)

where WJ = nT6(n)/u�, w2 = ri!(n)/ril so that w2 = 1 - w 1 .

D

Eo

These formulae can be generalized to the case X; i.i.d. N.(O, E0 ), 1 0 and I is the p x p identity matrix (Problem 1.6.37). Moreover, it can be shown (Problem 1.6.30) that the Np()o,, r) family with )o, E RP and r symmetric positive definite is a conjugate family f'V

7Jo

To

i

Section 1.6

65

Exponential Families

p

p(p

'

.. . ..

'

.

Discussion . .

•

p

to Np(81 Eo), but a richer one than we've defined in (1 .6.20) except for = 1 because + 3)/2 rather than a + I parameter family. In fact, the conditions NP(.\, r) is a of Proposition 1.6.1 are often too restrictive. In the one-dimensional Gaussian case the members of the Gaussian conjugate family are unimodal and symmetric and have the same shape. It is easy to see that one can consuuct conjugate priors for which one gets reasonable formulae for the parameters indexing the model and yet have as great a richness of the shape variable as one wishes by considering finite mixtures of members of the family defined in (1 .6.20). See Problems 1 .6.31 and 1.6.32.

, B}) X , Xn) , T(Xi)·

model of Example 1 .5.3 is not covered by this Note that the uniform U( { I , 2, . theory. The natural sufficient statistic max( 1 ) . which is one-dimensional what In fact, the family of distributions ever be the sample size, is not of the form L� 1 in this example and the family U(O, B) are not exponential. Despite the existence of classes of examples such as these, starting with Koopman, Pitman, and Darmois, a theory has been built up that indicates that under suitable regularity conditions families of distributions, which admit k-dimensional sufficient statistics for all sample sizes, must be k-parameter exponential families. Some interesting results and a survey of the literature may be found in Brown ( 1986). Problem 1.6.1 0 is a special result of this type. . .

{Po : 9 E 8}, 8 Rk, 8, 'T } k T1, ... , Tk, h Rq Po k p(x, 9) = h(x) exp[L ry,(11)T;(x) -B(9)] , x E X C Rq. 1.6 .29) j=l = (T1 (X), . . . , Tk (X)) is called the natural sufficient statistic of the family. The T(X) canonical k-parameter exponentialfamily generated by T and h is q(x, 71) = h(x) exp{TT (x)71- A(71)} where A('I) � log J:··· J: h(x)exp{TT(x)71}dx

c Summary. is a k-parameter exponential family of distribu and real-valued functions tions if there are real-valued functions ry1, , and B on on such that the density (frequency) function of can be written as •

•

•

(

in the continuous case, with integrals replaced by sums in the discrete case. The set

71{ ERk : -oo < A('I) < oo} is called the natural parameter space. The set E is convex, the E R is convex. A If [ has nonempty interior in R' and 'lo E [, then T(X) has for X � P71, moment generating function ,P (s) = exp{A('Io + s) - A('lo)} for all s sucb that 'lo + s is in [. Moreover E71,[T(X)] � A('lo ) and Var71,[T(X)] A( 'lo) where A and A denote the gradient and Hessian of A. [=

map

a

;

---.

the

-

66

Statistical Models, Goals, and Performance Criteria

Chapter 1

Tk An exponential family is said to be of rank k if T is k-dimensional and 1 , T1 , are linearly independent with positive Po probability for some 8 E e. If P is a canonical exponential family with E open, then the following are equivalent: •

•

•

,

(i) P is of rank k,

(ii)

'7

is identifiable,

(iii) Var17(T) is positive definite, (iv) the map '7

�

A(17) is I - I on 1', '

(v) A is strictly convex on f.

A family F of prior distributions for a parameter vector () is called a conjugate family of priors to p(x j 8) if the posterior distribution of 8 given x is a member of F. The ( k + 1 )-parameter exponential family k

?Tt(O) = exp( L 1JJ (O)t, - B(O)tk+1 - logw} j= l where w

and

=

1: 1: ·

exp{l::1J, (O)t; - B(O)}dO,

k t = (t,, . . . , tk+l) E O = {(t, . . . , tk+l ) E R + l : O < w < oo } ,

is conjugate to the exponential family p(x i 9) defined in ( 1.6.29).

1.7

PROBLEMS AND COMPLEMENTS

Problems for Section 1.1

1. Give a formal statement of the following models identifying the probability laws of the data and the parameter space. State whether the model in question is parametric or nonparametric.

,,

(a) A geologist measures the diameters of a large number n of pebbles in an old stream bed. Theoretical considerations lead him to believe that the logarithm of pebble diameter is normally distributed with mean p, and variance a2. He wishes to use his observations to ob tain some information about p, and a2 but has in advance no knowledge of the magnitudes of the two parameters.

(b) A measuring instrument is being used to obtain n independent determinations of a physical constant p,. Suppose that the measuring instrument is known to be biased to the positive side by 0.1 units. Assume that the errors are otherwise identically distributed normal random variables with known variance.

Section 1.7

67

Problems and Complements

(c) In part (b) suppose that the amount of bias

is pOsitive but unknown. Can you per

ceive any difficulties in making statements about f.t for this model ?

(d) The number of eggs laid by an insect follows a Poisson distribution with unknown

mean

A.

Once laid, each egg has an unknown chance p of hatching and the hatching of one

egg is independent of the hatching of the others. An entomologist studies a set of n such

insects observing both the number of eggs laid and the number of eggs hatching for each

nest.

2. Are the following parametrizations identifiable? (Prove or disprove.)

(a) The parametrization of Problem l . l . l(c).

(b) The parametrization of Problem l . l . l(d).

(c) The parametrization of Problem 1 . 1 . l (d) if the entomologist observes only the num

ber of eggs hatching but not the number of eggs laid in each case.

3. Which of the following parametrizations are identifiable?

(Prove or disprove.)

(a) Xt , · . . , XP are independent with X� ""' N(ai + v, o-2 ) .

and

Po

is the distribution of X

(b) Same as (a) with

= (X1,

•

•

•

, Xp)·

a = (a1, . . . , ap) restricted to

{(a1 , . . . , a. ) : (c) X and Y are independent N( I'1 , a2)

Y - X.

p

I: a, i= l

= 0}.

and N(1'2, a2),

B = (I' 1 , 1'2 ) and we observe

i = 1 , . . . ,p; j = I , . . . , b are independent with X;; � N(!';; , a2) where /Lii = v + ai + Ai• 8 = (at, . . . , ap, At. . . . , Ab, v, o-2 ) and Po is the distribution of Xu , . . . , Xpb· (d) X;; ,

(e)

Same as (d) with

Z:f 1 a; = 0 and L�� >.;

(a1, . . . , ap) =

and

restricted to the sets where

0.

4. (a) Let U be any random variable and Show that

(At , . . . , Ah)

V

be any other nonnegative random variable.

Fu+v(t) < Fu (t) for every t.

(If Fx and Fy are distribution functions such that said to be stochastically larger than Y.)

n2

(b) As in Problem

Fx (t) < Fv (t) for every t, then X is

1.1.1 describe formally the following model.

Two groups of n1 and

individuals, respectively, are sampled at random from a very large population. Each

68

Statistical Models, Goals, and Performane
Chapter 1

member of the second (treatment) group is administered the same dose of a certain drug believed to lower blood pressure and the blood pressure is measured after 1 hour. Each member of the first (control) group is administered an equal dose of a placebo and then has the blood pressure measured after 1 hour. It is known that the drug either has no effect or lowers blood pressure, but the distribution of blood pressure in the population sampled before and after administration of the drug is quite unknown.

' '

5. The number

n of graduate students entering a certain department is recorded. In each of k subsequent years the number of students graduating and of students dropping out is recorded. Let Ni be the number dropping out and Mi the number graduating during year i,

i = 1, . . . , k. The following model is proposed.

Po[Nt nt,Mt = m1, . . . , Nk = nk, Mk mk] n. nt·· . . nk.mt· · · .mk.r. f-lt · ' · f-lk VI · · · Vk p =

=

I

I

I

! !

!

n1

n,.

m1

m,.

r

where ··

f.lt + ' · · + f.lk + Vt + · + Vk + p 1, 0 < /-li < l, 0 < Vi < 1, + · · · + mk + r n n1 + · · · + nk + and () = (f.lt, · . . , f.lk , vl, . . . , vk) is unknown. =

=

mt

I

'

I

i

1
(a) What are the assumptions underlying this model'?

Jl.i

(b) () is very difficult to estimate here if k is large. The simplification = 1r(1 (1 - 7r)(1 v) i - I v for i = 1, . . . , k is proposed where 0 < 1r < 1, 0 < J1. < 1, 0 < v < 1 are unknown. What assumptions underlie the simplification?

Ji.)i-tJl., vi =

�

6. Which of the following models are regular? (Prove or disprove.) (a) Pe is the distribution of X when X is uniform on (0, 0), e

.

=

(0, oo) .

.

(b) Pe is the distribution of X when X is uniform on {0, 1, 2, . . , 0}, e = { 1 , 2, . . }.

(c) Suppose X � N(Ji., 0'2) . Let Y and P8 is the distribution of Y.

=

1 if X < 1 and Y

=

X if X > 1. 0

=

(!', 0'2 )

(d) Suppose the possible control responses in an experiment are 0.1, 0.2, . . . , 0.9 and they occur with frequencies p(0.1), p(0.2), . . . ,p(0.9). Suppose the effect of a treatment be the distribution of a is to increase the control response by a fixed amount 0. Let treatment response.

Po

7. Show that Y - c has the same distribution as -Y + c. if and only if, the density or frequency function p of Y satisfies p(c + t) = p(c - t) fj>r all t. Both Y and p are said to be symmetric about c. Hint: If Y - c has the same distribution as -Y c, then P(Y < t + c) = P( -Y < t - c) = P(Y > c - t) = 1 - P(Y < c - t).

+

8. Consider the two sample models of Examples 1.1.3(2) and 1.1.4(1) .

.

Section 1.7

Problems and Complements

69

(a) Show that if Y � X + o(X). o(x) � 21' + C. - 2x and X � N(J1, "2 ), then G( · ) � F(- - C.). That is, the two cases J(x) - C. and J(x) � 21' + C. - 2x yield the same distribution for the data (X,, . . . , Xn). (Yt . . . . , Yn)· Therefore, G(-) � F(· - C.) does not imply the constant treatment effect assumption. (b) In part (a), suppose X has a distribution F that is not necessarily normal. For what type of F is itpossible to have G(·) � F(·-tl) for both J(x) tl andJ(x) � 2J1+tl-2x? . . .·

(c) Suppose that Y � X + J(X) where X � N(Jl•"') and o(x) is continuous. Show that if we assume that J(x) + x is strictly increasing, then G(-) � F(- - C.) implies that J(x) = C..

.- -

" '"

9. Collinearity: Suppose Yi Let zJ· = (zli· · - - , Znjf-

(a) Show that independent).

(/31 ,

. . . ,

=

L:j== 1 4i/3i + ci, £i

"' N(O, o-2 ) independent, 1 < i < n.

/3p ) are identifiable iff z 1 , . . . , zp, are not collinear (linearly

(b) Deduce that (fit, . . . , {Jp ) are not identifiable if n < p, that is, if the number of pardllleters is larger than the number of observations. 10.

Let X = (min(T, C), l(T < C)) where T, C are independent, P[T = j]

p(j), j = 0, . . . , N,

P[C � j]

r(j), i

�

0, . . , N .

and (p, r) vary freely over F � { (p, r) : p(j) > 0, r(j) > 0, 0 < j < N, L,f 0p(j) � I , L,f 0 r(j) � I } and N is known. Suppose X,, . . . , Xn are observed i.i.d. according to the distribution of X. Show that {p(j) : j � 0, . . , N), {r(j) : j � 0, . . , N) are identifiable. Hint: Consider "hazard rates" for Y = min(T, C), .

.

P[Y

�

j, y � T I y > j].

The Scale Model. Positive random variables X and Y satisfy a scale model with parameter J > 0 if P(Y < t) � P(JX < t) for all t > 0, or equivalently, G(t) F(t/8), J > 0, t > 0. 11.

=

(a) Show that in this case, log X and log Y satisfy a shift model with parameter log J.

(b) Show that if X and Y satisfy a shift model With parameter C., then ex and eY satisfy a scale model with parameter e.o.. (c) Suppose a scale model holds for X, Y. Let c > 0 be a constant. Does X' = Y' yc satisfy a scale model? Does log X', logY' satisfy a shift model? =

12. The Lelunann 1Wo-Sample Model. In Example 1.1.3 let Xt. . . . , Xm and Yi , . .

.

xc, , Yn

denote the survival times of two groups of patients receiving treatments A and B. Sx (t) =

'

' f '

70

Stdtistical Models, Goals, and Performance Criteria

Chapter 1

P(X > t) = 1 - F(t) and Sv(t) = P(Y > t) = 1 - G(t), t > 0, are called the survival functions. Survival beyond time t is modeled to occur if the events T1 > t, . . . , Tk > t all occur, where T1, . , Tk are unobservable and i.i.d. as T with survival function S0. For treatments A and B, k = a and b, respectively.

'

t Ii

.

.

(a) Show that Sv(t)

= s';{"(t). (b) By extending (bja) from the rationals to 5 E (0, oo) , we have the Lehmann model

'

' •

Sv(t) = Si,(t), t > 0.

(1 .7.1)

Equivalently, Sv(t) = sg.(t) with C. = a5, t > 0. Show that if S0 is continuous, then X' = - log S0 (X) and Y' = - log S0(Y) follow an exponential scale model (see Problem 1 . 1 . 11) with scale parameter 5. Hint: By Problem B.2.12, So (T) has a U(O, I) distribution; thus, - log S0(T) has an exponential distribution. Also note that P(X > t) = S0 (t).

•

' I

(c) Suppose that T and Y have densities fo(t) and g(t). Then h,(t) = f0(t)/S0(t) and hy(t) = g(t)jSy(t) are called the hazard rates of To and Y. Moreover, hv(t) = C.ho(t) is called the Cox proportional hazard model. Show that hv(t) = C.h,(t) if and only if

Sv(t) = sg.(t) .

13. A proportional hazard model. Let f(t I z;) denote the density of the survival time Y; of a patient with covariate vector Zi apd define the regression survival and hazard functions of Yi as

Sv(t I z; ) =

!,� f(y I z;)dy, h(t I z; ) = f(t I z; )/Sv (t I z; ).

Let T denote a survival time with density fo(t) and hazard rate ho(t) = The Cox proportional hazanl model is defined as

h(t I z) = h0(t) exp {g(j3, z) )

fo(t)/ P(T > t). (1 .7.2)

where ho(t) is called the baseline hazard function and g is known except for a vector {3 = ({31 , . . . , {3p)T of unknowns. The most common choice of g is the linear form g(/31 z) = zT/3. Set C. = exp{g(j3, z)}. (a) Show that (1. 7.2) is equivalent to Sy (t I

z) = Sf': (t). (b) Assume (1 .7.2) and that Fo(t) = P(T :S t) is known and strictly increasing. Find an increasing function Q(t) such that the regression survival function of Y' = Q(Y) does not depend on ho(t). ' !I ''

"

•

Hint: See Problem 1.1.12.

(c) Under the assumptions of(b) above, show that there is an increasing function Q•(t) such that ifY;' = Q•(Y;), then f'i* = 9 ({31 Zi) + £i for some appropriate €i. Specify the distribution of fi.

-

--

-----

----- -

-

Section 1.7

Problems and

71

Complements

Hint: See Problems 1.1.11 and 1 . 1 . 12.

In Example 1 . 1 .2 with assumptions ( 1)--{4), the parameter of interest can be character 1 1 ized as the median v = F- (0.5) or mean I' = f=oo xdF(x) = J0 F-1 (u)du. Generally, J1 and v are regarded as centers of the distribution F. When F is not symmetric, /1 may be very much pulled in the direction of the longer tail of the density, and for this reason, the median is preferred in this case. Examples are the distribution of income and the distri bution of wealth. Here is an example in which the mean is extreme and the median is not. Suppose the monthly salaries of state workers in a certain state are modeled by the Pareto distribution with distribution function 14.

I - (xfc)-8, x >

F (x O) ,

c

x
0,

where () > 0 and c = 2, 000 is the minimum monthly salary for state workers. Find the median v and the mean J1 for the values of() where the mean exists. Show how to choose () to make J.L - v arbitrarily large.

Let X1 , . . . , Xm be i.i.d. F, Y1 , . . . , Yn be i.i.d. described by 15.

G,

where the model

{(F, G)}

is

where 1/J is an unknown strictly increasing differentiable map from R to R, '1/J' > 0, '1/J(±oo) = ±oo, and Z1 and Z{ are independent random variables . (a) Suppose Z1 , z; have a N(O, 1) distribution. Show that both ,P and � are identifi able. (b) Suppose Z1 and z; have a N(O, (,P (t)) .

Problems for Section 1.2 1.

Merging Opinions. Consider a parameter space consisting of two points fh and Bz, and

suppose that for given B, an experiment leads to a random variable function p(x I 0) is given by 0\x 01 Oz

0 0.8 0.4

l

0.2 0.6

Let 1r be the prior frequency function of 0 defined by rr(111)

(a) Find the posterior frequency function rr(O

X whose frequency

I x).

=

�,

rr

(02 )

=

�.

(b) Suppose X,, . . . , Xn are independent with frequency function p(x posterior 1r(B I x1, , xn ) · Observe that it depends only on 2:� 1 Xi. .

•

.

I

0). Find the

I

72

'

Statistical Models. Goals, and Performance Criterid

(c) Same as (b) except use the prior 1r1 (81)

Chapter 1

.25, 1r1 (82 ) � . 75. = .5n) for the two priors and 1ft when

�

(d) Give the values of P(B = fh I z:::; 1 xi n � 2 and 100.

7r

�

(e)Give the most probable values 8 = arg ma.xo 1r (B I E7 1 Xt = k) for the two priors � 1r and 1r1. Compare these B's for n = 2 and 100. (f) Give the set on which the two B's disagree. Show that the probability of this set tends

to zero as n ---t oo. Assume X rv p(x) = L;=l 1r(8i)p(x I Bi)· For this convergence, does it matter which prior, 7r or 11"1, is used in the formula for p(x )? 2. Consider an experiment in which, for given () = B, the outcome X has density p(x 8) � (2xj82), 0 < X < e. Let 7r denote a prior density for 8.

(a) Find the posterior density of 8 when 1r(8)

(b) Find the posterior density of 8 when 1r (B)

� �

1, 0 < 0 < 1.

302, 0 < 0 < 1.

(c) Find E(8 1 x) for the two priors in (a) and (b).

(d) Suppose X1 , . . . , Xn are independent with the same distribution as X. Find the posterior density of8 given X1 = x1 , . . . , Xn = Xn when 1r(B) = 1, 0 < 8 < 1.

I

I

3. Let X be the number of failures before the first success in a sequence of Bernoulli trials with probability of success 0. Then Po[X � k] � ( 1 - e) • e, k � 0, 1, 2, . . . . This is called the geometric distribution (9(0)). Suppose that for given 8 � 0, X has the geometric distribution

'

(a) Find the posterior distribution of 8 given X un1.,1orm on { 41 , 21 , 43 } .

=

2 when the prior distribution of 8 is

(b) Relative to (a), what is the most probable value of 8 given X �

(c) Find the posterior distribution of 8 given X f3(r, s). 4.

=

2? Given X

•

•

.

, Xn are natural numbers between 1 and 8 and 8 = {1, 2, 3,

(a) Suppose 8 has prior frequency,

' ' "

: '

c(a) . . 7r(J) = ·a , ] = 1 , 2, . . . , J

1 where a > 1 and c(a) � [2:;" r"J- Show that 1

1T (). I XJ, . . . , Xn)

=

c(n + a, m) . ·J n+a , J = m, m + l, . . . ,

--�--

---

k?

k when the prior distribution is beta, i

Let X1, . . . , Xn be distributed as

where x1,

�

.

.

. }.

Section 1.7

73

Problems and Complements

where m = max(x,, . . . , J:n), c(b, t) =

[�;"' , F"J-', b > 1.

(b) Suppose that max (x1 , . . . , Xn) x1 = m for all n. Show that ?T(m 1 as n --+ oo whatever be a. Interpret this result.

=

I

XI,

. . . , Xn) --+

x is not close to 0 or 1 and 5. In Example 1 .2.1 suppose n is large and (1 jn) L� 1 Xi the prior distribution is beta, f3(r, s) . Justify the following approximation to the posterior distribution

=

where q. is the standard normal distribution function and

_

n r J.l = n + r + s x + n + r + s ' _

_2

li(l - li)

= a n +r+s'

Hint: Let /3(a, b) denote the posterior distribution. If a and b are integers, then /3 ( a, b) , W, are in is the distribution of (aV fbW)[l + (aVfbW)]- 1 , where V1 , . . . , v., W1 , dependent standard exponential. Next use the central limit theorem and Slutsky's theorem. .

.

•

6. Show that a conjugate family of distributions for the Poisson family is the gamma family.

7, Show rigorously using (1.2.8) that if in Example 1.1.1, D = NO has a B(N, ?To ) distri bution, then the posterior distribution of D given X = k is that of k + Z where Z has a

B(N - n, ?To ) distribution.

.

8. Let (X,, . . . , Xn+k) be a sample from a population with density f(x I 8), e E e. Let 9 have prior density 1r. Show that the conditional distribution of (6, Xn + 1 , . . . , Xn+k) given xl = X J , . . . ' Xn = Xn is that of (Y, Zt, . . . Zk) where the marginal distribution of y equals the posterior distribution of 6 given X1 = X t , . . . , Xn = Xn, and the conditional distribution of the zi 's given y t is that of sample from the population with density l

=

f(x I t).

9. Show in Example 1.2.1 that the conditional distribution of 6 given I:; 1 Xi = k agrees with the posterior distribution of 6 given X1 = X t , . . . , Xn = Xn, where I:� 1 Xi = k.

Suppose xl, · · . , Xn is a sample with xi ,....., p(x I 0), a regular model and integrable as a function of e. Assume that A = {X : p(X I 8) > 0} does not involve e. 10.

(a) Show that the family of priors N

11' (8) =

IT p(�, I 8)

i=l

where �i E A and N E {1,2, . . . } is a conjugate family of prior distributions for p(x I 8) and that the posterior distribution of(} given X = x is 1r (8 I x) =

N'

IT vW I 8)

i=l

I

!

74

Statistical Models, Goals, and Performance Criteria

where

N' = N + n and ({I, . . , {hr,) = ( {1 >

• . •

.

(b) Use the result (a) to give .,-(8)

and

12.

Suppose

p(x

.

.

.,-(8 I x) when

U. Let p(x I B) = exp{-(x - 8)}, 0 0, B > 0 0 otherwise.

p(x I B)

the posterior density

, {N, X1,

Chapter 1

.,-(B I x).

I 0) is the density of i.i.d.

x and let .,-(8)

= 2 exp{-2B}, B > 0.

Xn. where xi X1 1 of the distribution of Xi. • • • 1

.-v

Find

N (�to , �)' j.lo is

a- 2 is (called) the precision (a) Show that p(x I B) ()( o: n exp (- itB) where t = �� l (X, - l"o) 2 and ()( denotes "proportional to" as a function of B. (b) Let .,-(B) o< ol (>-') exp { -� vB}, v > 0, .\ > 0; B > 0. Find the posterior distri bution 1r(O I x) and show that if >. is an integer, given x, O(t + v) has a x�+n distribution. known, and () =

Note that, unconditionally,

v(} has a xX distribution.

(c) Find the posterior distribution of a.

N(iJ., a2) and we formally put 7C(iJ., a) = ,; , then the posterior density !f"(J-t I x, s2) of J1. given (x, s2 ) is such that y'n(p.�.X) "' tn-I· Here s2 = n-1 1 " I...J ( Xt - X )2 • Hinl: Given iJ. and a, X and s2 are independent with X � N(iJ.,a 2 jn) and (n l)s2 ja2 "' X�-l· This leads to p(X, s2 I J.L, a2). Next use Bayes rule. , Xn,Xn+l are i.i.d. f(x I 8), 6 "" 1r, the predictive 14. Ina Bayesian model where X 1 , 13.

Show that if X1 , . . , Xn are i.i.d. .

.

.

•

distribution is the marginal distribution of Xn+I· The posterior predictive distribution is the conditional distribution of Xn+l given X1 , . . . , Xn. (a) If f

and .,- are the

N(B, aJ)

and

posterior predictive distribution.

N(Bo, r.f) densities, compute the predictive and

(b) Discuss the behavior of the two predictive distributions as

n -+ oo.

15. The Dirichlet distribution is a conjugate prior for the multinomial. distribution, D(a:), a = (o:1, , ar)r, O'j > 0, 1 ::; j < r, has density •

Let N =

(N1,

•

•

•

.

.

, Nr ) be multinomial

M(n, 9),

0=

r

(B h . . . , Br)T, 0 < 8; < 1, L B; = 1. j= l

The Dirichlet

Section 1.7

75

Problems and Complements

Show that if the prion r( 0) for 0 is V( a), then the posterion r( 0 f N nr ) where n = ( n t

� n

) is V(a + n),

1 • • • ,

Problems for Section 1.3 1. Suppose the possible states of nature are (}1, (}2, the possible actions are a1, a2, a3, and the loss function l((}, a) is given by a, a,

o 2

I

0

2 I

Let X be a random variable with frequency function p(x, (}) given by

I (1 - p) (I - q) and let d1 , when

.

•

.

,

dg be the decision rules of Table 1.3.3. Compute and plot the risk points

(a) p � q = .1, (b) p � 1 - q � . l. (c) Find the minimax rule among J1 , . . . , b"9 for the preceding case (a). (d) Suppose that (} has prior 1r(BI)

0.5, 1r(B2) � 0.5. Find the Bayes rule for case

�

(a).

2. Suppose that in Example 1.3.5, a new buyer makes a bid and the loss function is changed

to

8\a a,

a,

a,

a3

0 12

7 I

4 6

o,

(a) Compute and plot the risk points in this case for each rule (h, . , dg of Table 1.3.3. .

.

(b) Find the minimax rule among {J 1 . . . , b"g}. ,

(c) Find the minimax rule among the randomized rules. 1'

{d) Suppose 0 has prior 1r(a1) 0.5 and (ii) 1' 0.1.

=

=

� 1'·

1r (a2 )

=

1-

1'·

Find the Bayes rule when (i)

3. The problem of selecting the better of two treatments or of deciding whether the effect of one treatment is beneficial or not often reduces to the pr9blem of deciding whether B < 0. B 0 or 8 > 0 for some parameter 8. See Example 1.1.3. Let the actions corresponding to deciding whether a < 0, (} 0 or a > 0 be penoted by -1, 0, 1, respectively and suppose the loss function is given by (from Lehmann, 1957) =

=

Statistic<�! Models, Goals, and Performance Criteria

76

Chdpter 1

1 -1 0 c b+c 0 b =0 0 b > 0 b+c c 0 where b and c are positive. Suppose X is aN(B, 1) sample and consider the decision rule J,,,(X) = -1 if X < r 0 if r < X < s 1 if X > s. 0\a <0

(a) Show that the risk function is given by

R(O, J,,,)

c( y'n(s - 0)), 0 < 0 b(y'nr), 0=0 c

(y'n(r - 0)), 0 > 0

where =

1 - , and is the N(0, 1) distribution function. (b) Plot the risk function when b = c = 1, n = I and .

(t) r = -s = -1, (n) r = - s = -1. 2 For what values of 8 does the procedure with r = -s = -1 have smaller risk than the procedure with r = -�s = -1? 4. Stratified sampling. We want to estimate the mean J1. = E(X) of a population that has been divided (stratified) into s mutually exclusive parts (strata) (e.g., geographic loca

'

I

' i

.

i I

'

I

I

"

tions or age groups). Within the jth stratum we have a sample of i.i.d. random variables X1j, · · · , Xnii; j = l, . . . , s, and a stratum sample mean Xi ; j = l, . . . , s. We assume that the s samples from different strata are independent. Suppose that the jth stratum has 100pj% of the population and that the jth stratum population mean and variances are f-£J and o"J. Let N = E;= l nJ and consider the two estimators s

n3

1 /1 1 = N - L LxiJ, where we assume that PJ•

j=l i=l

1 < j < s, are known.

iiz

s

= L PJ XJ

j=l

(a) Compute the biases, variances. and MSEs of ji1 and Jiz. How should nj, 0 be chosen to make /11 unbiased?

I

{b) Neyman allocation. Assume that 0

< J. < s,

< aJ < oo, 1 < j < s, are known (estimates

will be used in a later chapter). Show that the strata sample sizes that minimize MSE(/i2) are given by

( 1 .7.3)

Hint:

77

Problems and Complements

Section 1.7

You may use a Lagrange multiplier.

(c) Show that

MSE(Ji!) with nk � Pk N minus MSE(Ji,) with nk given by ( 1 . 7.3) is

N- 1 2:;= 1 Pj (t7J - a-)2, where a- = 2:;=1 PP7r �

-

Xb and Xb denote the sample mean and the sample median of the sample X1 b . . , Xn - b. lf the parameters of interest are the population mean and median of Xi - b, respectively, show that MSE(X,) and MSE(X,) are the same for all values of b (the

5. Let '

.

�

MSEs of the sample mean and sample median are

6.

Suppose that

that

n is odd.

satisfying

X1 , . . . , Xn are i.i.d.

ac;

X

invariant with respect to shift).

F, that X is the median of the sample, and �

""

We want to estimate "the" median

P(X < v) ?: � ll!ld P(X > ) > �v

�

v ofF,

where

v is defined as a value

(a) Find the MSE of X when (i) F is discrete with

a < b < c. Hint:

�

P(X � c) � p, P(X

Use Problem 1.3.5. The answer is M

where (ii)

P(X � a)

k � .5 (n + ! ) and S � B(n,p).

� b)

�

1 -- 2p, 0 < p

<

1,

SE(X) � [(a-b)2 + (c- b)2]P(S ?: k)

F is uniform, U(O, 1). Hint:

See Problem B.2.9.

(iii) F is normal,

N(O, 1) , n � 1, 5, 25, 75.

Hint: See Problem B.2.13.

Use a numerical integration package.

� MSE(X)/M SE(X) in question (i) when b � 0, a � -,;, b � ,;, p � 20 40 and n � 1, 5, 15. (c) Same as (b) except when n � 15, plot RR for p � .1, .2, .3, 4 .45. (d) Find E[X - b[ for the situation in (i). Also find E[X - bl when n � 1, and 2 and � compare it to E[X - b[. (e) Compute the relative risks MSE(X)fMSE(X) in questions (ii) and (iii). 7. Let X1 , . . . , Xn be a sample from a population with values (b) Compute

the relative risk RR .

,

.

,

.

�

,

-

e - 2,; , e - ,; , e , e + ,; , e + 2,;; ,; > o. Each value has probability

�

-

.2. Let X and X denote the sample mean and median.

Suppose

that n is odd. �

(a) Find MSE(X)

andthe relative risk RR

(b) Evaluate RR when n � 1, 3, 5. Hint: By Problem 1.3.5, set () 0

distribution of

�

=

�

-

� MSE(X)fMSE(X).

without loss of generality. Next note that the

X involves Bernoulli and multinomial trials.

I

I 78

:!

j

'

8. Let X 1 ,

Chapter 1

, Xn be a sample from a population with variance a2, 0 < a2 < oo. (a) Show that s:? = (n - 1) - l E� 1 (Xi - X)2 is an unbiased estimator of u2. Hint: Write (X, - X)2 = ([X, - !'] - [X - l'lJ2, then expand (X, - X)2 keeping the

' ' '

'

Statistical Models, Goals, and Performance Criteria

.

•

.

square brackets intact.

(b) Suppose X,

i :

-

N(!',
(i) Show that MSE(s2 ) =

' '

I' :

(ii) Let &6 = c c=

L�

(n + 1)- 1

l

2(n - 1)-1
(xi - X)2•

Show that the value of c that minimizes

MsE(O'; ) is

Hintfor question (b): Recall (Theorem 8.3.3) that
' ' ' ' '

distribution. You may use the fact that E(Xi - J.L)4

'

I

.I

= 30'2•

9. Let () denote the proportion of people working in a company who have a certain charac teristic (e.g., being left-handed). It is known that in the state where the company is located, 10% have the characteristic. A person in charge of ordering equipment needs to estimate 8 and uses where jJ

=

e = (.2J(.1oJ + (.sw

X/n is the proportion with the characteristic in a sample of size n

company. Find

�

from the

MSE(e) and MSE(P). If the true e is eo, for what e0 is MSE(B)jMSE (P) < 1?

Give the answer for n

= 25 and n = 100. 10. In Problem 1.3.3(a) with b = c = 1 and n = 1. suppose 6 is discrete with frequency function 1r(O) = 1r ( -�) = 1r U) = '1· Compute the Bayes risk of dr,s when (a) r = -s -1 (b) r = -�s = -1.

•

=

Which one of the rules is the better one from the Bayes point of view?

11. A decision rule d is said to be unbiased if

E9 (Z(e,o( X) ) ) < E9(Z(e' , o(X))) "

'

II '

•

I

for all e, e'

E 8.

(a) Show that if e is real and l (e, a) definition of an unbiased estimate of 8.

= (e - a ) 2, then this definition coincides with the

(b) Show that if we use the 0 - 1 loss function in testing. then a test function is unbiased in this sense if, and only if, the powerfunction, defined by {J(e, o) = E9(o(X )), satisfies

for all

e' E 81•

, •

1

{J(e',o) > sup({J(e, o ) : e E 8o} , i

-

Section 1.7 Problems and Complements

79

12. In Problem 1.3.3, show that if c ::; b,

z

>

0, and

then Jr,.� is unbiased

13. A (behavioral) randomized test of a hypothesis H is defined as any statistic rp(X) such that 0 <
otherwise. Define the nonrandomized test Ju,

0 < u < 1, by u 0 if
Ou(X)

1

if

Suppose that U - U(O, 1) and is independent of X. Consider the following randomized test J: Observe U. If = u, use the test Ju. Show that J agrees with rp in the sense that,

U

Pe[o(X)

�

1]

�

1 - Pe[o(X)

� OJ = E9 (
14. Convexity ofthe risk set. Suppose that the set of decision procedures is finite. Show that if J1 and J2 are two randomized procedures, then, given 0 < a < 1, there is a randomized procedure 83 such that R(e, J,) = aR(e, o.) + (1 - a)R( e, 62) for all e.

15. Suppose that Pe, (B) = 0 for some event B implies that P9(B) = 0 for all e E 8. Further suppose that l(e0 , ao) � 0. Show that the procedure J(X) au is admissible. 16. In Example 1.3.4, find the set of J1 where MSE(ji) < MSE(X). Your answer should depend on n, 0"2 and J � II' - Jto 1 17. In Example 1.3.4, consider the estimator Mw If n, 0"2 and o �

� WJto +

(1 - w)X.

II' - Jto I are known,

(a) find the value of wo that minimizes MSE(J'w), (b) find the minimum relative risk of P,w0 to X. 18. For Example 1.1.1, consider the loss function ( 1.3.1) and let Jk be the decision rule "reject the shipment iff X 2 k."

(a) Show that the risk is given by (1.3.7). (b) If N � 10, s = r � 1, e0 � .I, and k = 3, plot R(B, Jk) as a function of e.

(c) Same as (b) except k = 2. Compare J2 and 63. 19. Consider a decision problem with the possible states of nature 81 and 82 , and possible actions a1 and a • Suppose the loss function f(B, a ) is 2

80

Statistical Models, Goals, and Performance Criteria

Chapter 1

Let X be a random variable with probability function p(x ! 8) B\x

e, e,

0 o.2 o.4

I o.s o.6

(a) Compute and plot the risk points of the nonrandomized decision rules. Give the minimax rule among the nonrandomized decision rules. (b) Give and plot the risk set S. Give the minimax rule among the randomized decision rules. (c) Suppose B has the prior distribution defined by rr(B,) the Bayes decision rule?

=

0.1, rr(B,)

=

0.9. What is

Problems for Section 1.4

1. An urn contains four red and four black balls. Four balls are drawn at random without replacement. Let Z be the number of red balls obtained in the first two draws and Y the total number of red balls drawn. (a) Find the best predictor of Y given intercept linear predictor.

Z,

the best linear predictor, and the best zero

.

'

(b) CompU!e the MSPEs of the predictors in (a). 2. In Example 1.4.1 calculate explicitly the best zero intercept linear predictor, its MSPE, and the ratio of its MSPE to that of the best and best linear predictors.

3. In Problem B.l.7 find the best predictors ofY given X and of X given Y and calculate their MSPEs. 4. Let U1 , U2 be independent standard normal random variables and set Y = U1 • Is Z of any value in predicting Y?

Z

=

U'f + U?,

5. Give an example in which the best linear predictor of Y given Z is a constant (has no

predictive value) whereas the best predictor Y given Z predicts Y perfectly. 6. Give an example in which

Z can be used to predict Y perfectly, but Y is of no value in predicting Z in the sense that Var(Z I Y) = Var(Z), 7, Let Y be any random variable and let R(c) = E(IY-cl) be the mean absolute prediction error. Show that either R(c) = oo for all c or R(c) is minimized by taking c to be any number such that P[Y 2: c] > !, P[Y < c] > ; . A number satisfying these restrictions is called a median of (the distribution of) Y. The midpoint of the interval of such c is called the conventionally defined median or simply just the median.

I

I

I

Section 1.7

Problems and Complements

81

Hint: If c < eo . .• '·

El Y - col = ElY - cl + (c - co){P[Y > co] - P[Y 8.

< eo]} + 2E[(c - Y)l[c < Y <

c0]].

Let Y have a N(p, <72) distribution.

(a) Show that E(IY - cl) = O"Q[I c - 1"1/<7] where Q(t) = 2[(t)] - t. (b) Show directly that p minimizes E(IY - cl) as a function of c. 9. If Y and Z are any two random variables, exhibit a best predictor of Y given Z for mean absolute prediction error. Suppose that Z has a density p, which is symmetric about all z. Show that c is a median of Z. 10.

c,

p( c + z) = p(c - z) for

11. Show that if ( Z, Y) has a bivariate nonnal distribution the best predictor of Y given z in the sense of MSPE coincides with the best predictor for mean absolute error.

Many observed biological variables such as height and weight can be thought of as the sum of unobservable genetic and environmental variables. Suppose that Z, Y are measure ments on such a variable for a randomly selected father and son. Let Z', Z11, Y', Y" be the corresponding genetic and environmental components Z = Z' + Z", Y = Y' + Y", where (Z', Y') have a N(p,, J.L, a2, a2 , p) distribution and Z", Y" are N(v, T2) variables independent of each other and of ( Z', Y'). 12.

(a) Show that the relation between Z and Y is weaker than that between Z' and Y'; that is, [Cor(Z, Y )l < I PI · (b) Show that the error of prediction (for the best predictor) incurred in using Z to predict Y is greater than that incurred in using Z' to predict Y'. 13. Suppose that Z has a density p, which is symmetric about c and which is unimodal; that is, p( z) is nonincreasing for z > c. (a) Show that P[I Z - tl :; s] is maximized as a function of t for each s > 0 by t = c. (b) Suppose (Z, Y) has a bivariate normal distribution. Suppose that if we observe Z = z and predict p ( z ) for Y our loss is 1 unit if II"( z) - Yl > s, and 0 otherwise. Show that the predictor that minimizes our expected loss is again the best MSPE predictor.

14. Let zl and Zz be independent and have exponential distributions with density >..e -Az , z > 0. Define Z = Z2 and Y = Z1 + Z1Z2 • Find

(a) The best MSPE predictor E(Y (b) E(E(Y

I Z)) (c) Var(E(Y I Z)) (d) Var(Y I Z = z) (e) E(Var(Y I Z))

I Z = z) ofY given Z = z

'

'

I

'

'

'

82

Statisf1cal Models, Goals, and Performance Criteria

Chapter 1

(0 The best linear MSPE predictor of Y based on

'

Z = z. Hint: Recall that E( Zt) = E( Z2) = 1/A and Var ( Zt) = Var ( Z2) = 1 /A 2 15. Let J-L(z)

=

E(Y I Z = z). Show that

Var(J-L(Z))/Var( Y)

=

Corr2 (Y, J-L(Z))

= max g

Corr' ( Y, g(Z))

where g(Z) stands for any predictor.

' 1,

'

16. Show that p�y linear predictors.

=

Corr' (Y, J-L L(Z)) =

maxgEC

Corr2 (Y, g(Z)) where £ is the set of

17. One minus the ratio of the smallest possible MSPE to the MSPE of the constant pre dictor is called Pearson's correlation ratio 1J� y; that is,

�iY

=

1 - E [Y - J-L(Z)] 2 /Var(Y)

=

Var(J-L(Z) )/Var(Y) .

(See Pearson, 1905, and Doksurn and Samarov, 1995, on estimation of 1]�y .) (a) Show that 1]�y > PiY • where PiY is the population multiple correlation coefficient of Remark 1.4.3. Hint: See Problem 1.4.15. (b) Show that if Z is one-dimensional and h is a 1-1 increasing transformation of 2 ts mvanant un er suc . 2 2 y then ryh(Z)Y atiS, Th d hh = 1Jz . 1J "

Z,

.

.

(c) Let 'L = Y - J-LL(Z) be the linear prediction error. Show that, in the linear model of Remark 1.4.4, €£ is uncorrelated with PL(Z) and 1J� y = p�y· 18. Predicting the past from the present. Consider a subject who walks into a clinic today, at time t, and is diagnosed with a certain disease. At the same time t a diagnostic indicator Zo of the severity of the disease (e.g., a blood cell or viral load measurement) is obtained. Let S be the unknown date in the past when the subject was infected. We are interested in the time Yo = t - S from infection until detection. Assume that the conditional density of Zo (the present) given Y0 = Yo (the past) is

where !L and a2 are the mean and variance of the severity indicator Z0 in the population of people without the disease. Here j3y0 gives the mean increase of Z0 for infected subjects over the time period y0; j3 > 0, y0 > 0. It will be convenient to rescale the problem by introducing Z = (Z0 - J-L)/u and Y = /3Yo/u. (a) Show that the conditional density f(z I y) of Z given Y = y is N(y, !). (b) Suppose that Y has the exponential density

1r(y) = A exp{-.\y}, A > 0, y > 0.

Section 1.7

Problems and Complements

83

Show that the conditional distribution of Y (the past) given Z =

7r(y I z ) � (27r) - l c-1 exp where c

=

{- � [y - (

z

z

}

(the present) has density

- >.) ]2 , y > 0

4>(z - >..) . This density is called the truncated (at zero) normal, N(z - >.. , 1 ) ,

density. Hint: Use Bayes rule.

(c) Find the conditional density rro (Yo I zo) of Yo given Zo (d) Find the best predictor of Yo given Zo E IYo - g( Zo)l. Hint: See Problems 1.4. 7 and 1.4.9.

=

= z0.

zo using mean absolute prediction error

(e) Show that the best MSPE predictor ofY given Z = z is E(Y I Z � z)

�

c-1. - z )

-

(>. - z) .

(In practice, all the unknowns, including the "prior" rr, need to be estimated from cohort studies; see Berman, 1990, and Normand and Doksum, 2000) . 19. Establish 1.4.14 by setting the derivatives of R(a, b) equal to zero, solving for (a, b), and checking convexity.

20. Let Y be the number of heads showing when X fair coins are tossed, where X is the number of spots showing when a fair die is rolled. Find (a) The mean and variance of Y.

(b) The MSPE of the optimal predictor of Y based on X. (c) The optimal predictor of Y given X

= x, x =

1, . . . , 6.

21. Let Y be a vector and let r(Y) and s(Y) be real valued. Write Cov[r(Y), s(Y) I zl for tbe covariance between r(Y) and s(Y) in the conditional distribution of (r(Y), s(Y)) given Z = z. (a) Show that ifCov[r(Y), s(Y)] < oo, then

Cov[r(Y), s(Y)]

�

E{Cov[r(Y), s(Y) I Z]} + Cov{ E[r(Y) I Z], E[s(Y) I Z]}.

(b) Show tbat (a) is equivalent to (1.4.6) when r (c) Show that if Z is real, Cov[r(Y), Z]

�

�

s.

Cov{E[r(Y) I Z], Z}.

(d) Suppose Y1 � a1 + b1Z1 + W and Yz � a2 + b,Z2 + W, where Y and Y2 are 1 responses of subjects 1 and 2 with common influence W and separate influences Zt and Zz, where Z1. Z2 and W are independent with finite variances. Find Corr(Y1 , Yz) using (a).

84

Statistical Models, Goals, and Performance Criteria

Chapter 1

(e) In the preceding model (d), if b1 = b2 and Z1 , Z2 and �V have the same variance cr2• we say that there is a 50% oVerlap between Y1 and Yz. In this case what is Corr(Y1 , Y )? 2 (f) In model (d), suppose that Z1 and z, optimal predictor of Yz given (Yi, Z1, Zz).

are

N(p, u2 ) and

W � N(p0 , u5).

Find the

22. In Example 1.4.3, show that the MSPE of the optimal predictor is u�(l - p�y ). 23. Verify that solving (1.4.15) yields (1.4.14). 24. (a) Let w(y,z) be a positive real-valued function. Then [y - g(z)]2/w(y,z) = 6w(y,g(z)) is called weighted squared prediction error. Show that the mean weighted squared prediction error is minimized by po ( Z ) = EO(Y I Z), where Po (y , z) =

cp(y, z)/w(y, z)

and c is the constant that makes p0 a density. Assume that

Eow (Y, g( Z)) < co for some g and that Po is a density. B(n,z), n > 2 , and suppose that Z has the beta, (b) Suppose that given Z = z, Y ,B(r, s ) density. Find p0(Z) when (i) w(y, z) = 1, and (ii) w(y, z) = z(l - z), 0 < z < 1. �

,

25. Show that EY2 < co if and only if E(Y - c)2 < co for all c.

Hint: Whatever be Y and c, 1

2 - c2 < (Y - c)2 Y 2

=

Y2 - 2cY + c2 < 2 (Y2 + c2). '

i

Problems for Section 1.5 1. Let XI, . . . ' Xn be a sample from a Poisson, P(e), population where e > 0. (a) Show directly that z=:=t Xi is sufficient for 8.

(b) Establish the same result using the factorization theorem. 2. Let n items be drawn in order without replacement from a shipment of N items of which N8 are bad. Let Xi = 1 if the ith item drawn is bad, and = 0 otherwise. Show that L� 1 xi is sufficient for 8 directly and by the factorization theorem. 3. Suppose X1 ,

•

.

•

,

I

'

Xn is a sample from a population with one of the following densities.

8) = 8x9- 1, 0 < X < 1, 8 > 0. This is the beta, ,B(O , !), density. (b) p(x, 8) = 8ax•-1 exp( -8x ) x > 0, 8 > 0, a > 0. (a) p(x

,

"

,

This is known as the Weibull density. (c) p(x, 8)

=

0a9fx(9+1 ), X > a, 8 > 0, a > 0.

1

'

85

Section 1. 7 Problems and Complements

This is known as the Pareto density. In each case, find a real-valued sufficient statistic for (),

a fixed.

4. (a) Show that T1 and T2 are equivalent statistics if, and only if, we can write T2 = H (T1) for some 1-1 transformation H of the range of T1 into the range of T2. Which of the following statistics are equivalent? (Prove or disprove. )

n� 1 Xt and I:� 1 log Xi , Xi > 0 (c) I:� 1 xi and I:� 1 log Xi, Xi > 0 (d) (l:� 1 xi , l:� 1 x;) and (l:� 1 xi, I:� 1 (xi - x?) (e) (I:� 1 Xi, 2:� 1 xl) and (2:� 1 Xi , 2:� 1 (xi X ) 3). 5. Let B = (B" 82) be a bivariate parameter. Suppose that T1 (X) is sufficient for 81 whenever 82 is fixed and known, whereas T2(X) is sufficient for fh whenever 81 is fixed (b)

-

and known. Assume that (}h B2 vary independently, lh E 81, 82 E 82 and that the set S {x : p(x, B) > 0} does not depend on B. =

(a) Show that if T1 and T2 do not depend one, and 81 respectively, then (T1 (X), T2(X))

is sufficient for e.

(T1 (X) , T,(X)) is sufficient for B, T1 (X) is sufficient for 81 whenever 8z is fixed and known, but Tz(X) is not sufficient for 8z, when (}1 is fixed (b) Exhibit an example in which

and known. 6. Let X take on the specified values v1 , vk with probabilities 81, . . . , 8k, respectively. Suppose that X1 , . . . , Xn are independently and identically distributed as X. Suppose that IJ = (81, , Bk) is unknown and may range over the set 8 = { (B, . . . , Bk) : e, > 0, 1 < i < k, L� 1 8i = 1 }. Let Nj be the number of xi which equal Vj· .

.

.

.

.

1

.

(a) What is the distribution of (Nt, . . . , Nk)? (b) Show that N = (N1, . . . , Nk-t) is sufficient for B.

7. Let X1,

. • . 1

Xn be a sample from a population with density p( x , 8) given by -

p(x, B)

0 otherwise. Here e

=

(!",a) with -oo <

11-

< oo, a > 0.

(a) Show that min (X1 , , Xn) is sufficient for fl when a is fixed. (b) Find a one-dimensional sufficient statistic for a when Jl. is fixed. . • .

(c) Exhibit a two-dimensional sufficient statistic for 8.

8. Let X1, . . , Xn be a sample from some continuous distribution Fwith density J, which is unknown. Treating f as a parameter, show that the order statistics X(t ) • . . . , X(n) (cf. Problem B.2.8) are sufficient for f. .

'

''

I !

86

9.

Statistical Models, Goals, and Performance Criteria Let

X1 ,

.

.

•

,

Chapter 1

Xn be a sample from a population with density

fo (x)

a(O) h (x)

if O I

< x < o,

0 othetwise where h(x)

>

0, 0

= (OI , O, ) with -oo

<

OI

< 02 <

oo, and a(O) =

I [J:,' h(x)dx] �

is assumed to exist. Find a two-dimensional sufficient statistic for this problem and apply

02] family of distribution s. 10. Suppose X1, , . . , Xn are U.d. with density f(x, 8) = �e-l.x-S]_ Show that (X{l)• . . . , X(n)). the order statistics, are minimal sufficient. Hint: t, Lx (O) = - L� I sgn(X, - 0), 0 rt {XI, . . . , Xn }, which determines X( I ) , · · · , X(n)• 11. Let X1 , X2 , . . . , Xn be a sample from the unifonn, U(O,B), distribution. Show that X(n) = max{ Xii 1 0 on X. Show your result to the U[81,

that

fxx(��) is minimial sufficient.

Hint: 13.

Apply the factorization theorem.

Suppose that X =

bution function

(X1,

• . .

F(x). If F(x)

, Xn) is a sample from a population with continuous distri is N(p, u2 ) , T(X) = (X, u2 ), where u2 = n� I l.:(X,

X)2, is sufficient, and S(X) = (X(1 1 , . . . , X(n1 ), where X(, 1 = (X(i) - X)ju, is "irrel evant" (ancillary) for (J.L, a2 ). However, S(X) is exactly what is needed to estimate the

.

!

'

"shape" of

'

class F = iff

F(x) when F(x) is unknown . The shape ofF is represented by the equivalence {F((· - a)fb) b > 0, a E R). Thus a distribution G has the same shape as F

G E F.

function

:

For instance, one "estimator" of this shape is the scaled empirical distribution

�

F,(x)

jfn,

x(n :S x < x(i+l)• j

=

1, .

..,n-1

x < x(1 ) 1, X > x(n)· � Show that for fixed x, F,((x - x )fu) converges in probability to F(x). Here we are using F to represent F because every member ofF can be obtained from F. 0,

:I

'I

14. Kolmogorov's Theorem.

"

'. •

9,

We are given a regular model with 9 finite.

(a) Suppose that a statistic T(X) has the property that for any prior distribution on the posterior distribution of

9 depends on x only through T(x).

Show that T(X) is

sufficient. (b) Conversely show that if T(X) is sufficient, then, for any prior distribution, the

posterior distribution depends on x only through T

(x) .

i

j l

'

Section 1.7

Problems

87

and Complements

Hint: Apply the factorization theorem. 15. Let Xh . , Xn be a sample from f(x 0), () E R. Show that the order statistics arc minimal sufficient when f is the density Cauchy f(t) � 1 /Jr( l + t2). 16. Let X Xrn; Y1 , . . , Y,l be independently distributed according to N(p,, 0"2) and . .

1,

.

· -

.

•

,

.

N(TJ, r2), respectively. Find minimal sufficient statistics for the following three cases: (i) p,, TJ, (ii)

p,,

17 < oo,

0 < a, T.

a = T and p,, 17, O" arc arbitrary.

(iii) p, =

17.

a, T are arbitrary: -oo < 1] and p,, a, T are arbitrary.

In Example 1.5.4, express t1 as a function of Lx(O,

I)

and

Lx(l, 1).

Problems to Section 1.6 1. Prove the assertions of Table 1 . 6. 1. 2.

Suppose

X1 ,

• • •

, Xn is as in Problem 1.5.3. In each of the cases (a), (b) and (c), show

that the distribution of X forms a one-parameter exponential family. Identify ,.,, B, T, and

h.

3.

Let X be the number of failures before the first success in a sequence of Bernoulli trials

with probability of success 9. Then Pe [X � k] �

the

geometric distribution (9 (B)).

(I - 9)'9, k � 0, I, 2, .

.

. This is called

(a) Show that the family of geometric distributions is a one-parameter exponential fam

ily with

(b)

T(x)

�

x.

Deduce from Theorem 1.6.1 that if X�o

. . . , Xn is a sample from 9(9), then the

2.:� 1 Xi form a one-parameter exponential family. (c) Show that E� xi in part (b) has a negative binomial distribution with parameters

distributions of

l

(n,9) defined by Pe iL:7 1 X, = k] �

( � ) n

+

- l

( 1 - 9)'9", k = 0 , 1 , 2 , . . . (The

negative binomial distribution is that of the number of failures before the nth success in a sequence of Bernoulli trials with probability of success

0.) Hint: By Theorem 1.6.1, Pe [L� 1 X, = k] = c,(l - 9)'0", 0 < 9 < I. If =

"\" � c,w• =

k=O

I

(1

-w

)n

, 0 < w < I,

then

Ck �

I d' n w ) (I k! fiw k w=O

Which of the following families of distributions are exponential families? (Prove or disprove.)

4.

(a) The U(O, 9) fumily

88

Statistical Models, Goals, and Performance Criteria

Chapter 1

(b)p(." , O) = {exp[- 2 log 0 + log(2x)]}l[x E (0,0)] (c) p(x, O) = �. x E { 0. 1 +0, . . . , 0.9 + 0) (d) The N(O, 02) family, 0 > 0 (e) p( x, O ) = 2(x + 0)/(1 + 20), 0 < x < I, 0 > 0

(f) p(x, 9) is the conditional frequency function of a binomial, B(n, 0), variable X,

given that X > 0.

5. Show that the following families of distributions are two-parameter exponential families and identify the functions 1], B, T, and h.

(a) The beta family.

(b) The gamma family.

6. Let X have the Dirichlet distribution, D( a) , of Problem 1.2.15. Show the distribution of X form an r-parameter exponential family and identify fJ, B, T, and h. 7. Let X = ( (X 1 , Y1 ), . . . , (X., Yn)) be a sample from a bivariate normal population. Show that the distributions of X form a five-parameter exponential family and identify 'TJ, B, T, and h. 8. Show that the family of distributions of Example 1.5.3 is not a one parameter CX(Xlnential family. Hint: If it were, there would be a set A such that p(x, 0) > 0 on A for all 0. 9. Prove the analogue of Theorem 1.6.1 for discrete k-parameter exponential families. 10. Suppose that f(x, B) is a positive density on the real line, which is continuous in x for each 0 and such that if (X1, X2 ) is a sample of size 2 from f (·, 0), then X1 + X2 is sufficient for B. Show that /(·, B) corresponds to a one-arameter exponential family of distributions with T(x) = x. Hint: There exist functions g(t, 0), h(x�, x2) such that log f(x�, 0) + log j(x2, 0) = g(x1 + x2, 0) + h(x1, x2). Fix Oo and let r(x, 0) = log f(x, 0) - log f(x, Oo), q(x, 0) = g(x,O) - g(x,Oo). Then, q(x 1 + x2,0) = r(x, , O) + r(x,, O), and hence, [r(x� , O) r(O, 0)] + [r(x2, 0) - r(O, 0)] = r(x1 + x2, 0) - r(O, 0).

' '

I

11. Use Theorems 1.6.2 and 1.6.3 to obtain moment-generating functions for the sufficient statistics when sampling from the following distributions.

(a) normal, () = (p, a2) (b) gamma, r(p, e = >., p fixed

>.),

(c) binomial (d) Poisson (e) negative binomial (see Problem 1.6.3) () = (p, (0 gamma, r(p,

>.),

.

-

>.) .

�---

------

II

'

Section 1. 7

89

Problems and Complements

12. Show directly using the definition of the rank of an ex}X)nential family that the multi nomial distribution, M(n; B,, . . . , B.) , O < B; < 1, 1 < j < k. L:�" 1 B; = 1, is of rank

k - 1.

13. Show that in Theorem 1.6.3, the condition that E has nonempty interior is equivalent to the condition that £ is not contained in any ( k I)-dimensional hyperplane . �

•

14. Construct an exponential family of rank k for which £ is not open and A is not defined on all of t:. Show that if k = 1 and &0 # 0 and A, A are defined on all of t:, then Theorem 1.6.3 continues to hold. 15. Let P = {Po : B E 8} where Po is discrete and concentrated on X = {x 1 , x2, . . . ) , and let p(x, B) = Po [X = x]. Show that if P is a (discrete) canonical exponential family generated b(, (T, h) and &0 # 0, then T is minimal sufficient. Hint: �;,• Lx(TJ) = T; (X) - E'IT;(X). Use Problem 1.5.12. 16. Life testing. Let X1, . . . , Xn be independently distributed with exponential density < < < 2 1 the > ordered X's be denoted by Y1 Y 2 0 for 0, and let · · · e-xl (28)x Y It is assumed that Y1 becomes available first, then Yz, and so on, and that observation is continued until Yr has been observed. This might arise, for example, in life testing where each X measures the length of life of, say, an electron tube, and n tubes are being tested simultaneously. Another application is to the disintegration of radioactive material, where n is the number of atoms, and observation is continued until r a-particles have been emitted. Show that ..

(i) The joint distribution of Y1 ,

1

•

[ p

n!

(28)' (n - r)! ex (ii) The distribution of II: :

l

.

.

, Yr is an exponential family with density

L::� r Yi + (n - r)yr 28

-

]

< < < , 0Yl - . . . - Yr ·

2 + (n r)Yrl/B is x with 2r degrees of freedom. Y;

(iii) Let Yi , Y2 , denote the time required until the first, second, . . . event occurs in a Poisson process with parameter 1/28' (see A.16). Then Zr = Y1/8', Z2 = (Y2 Y1 )/8', z, = (Y3 - Y2)/8', . . . are independently distributed as x2 with 2 degrees , Yr is an exponential family with density of freedom, and the joint density of Y1 , •

.

.

•

1

(28')

r) ( Y ' exp 28'

•

.

'

0<

Yl

< ... <

Yr·

The distribution of Yr/B' is again x2 with 2r degrees of freedom. (iv) The same model arises in the application to life testing if the number n of tubes is held constant by replacing each burned-out tube with a new one, and if Y1 denotes the time at which the first tube bums out, Y2 the time at which the second tube burns out, and so on, measured from some fixed time.

'

. 90

Statistical Models, Goals, and Performance Criteria

Chapter 1

[(ii): The random variables z, � (n - i + l)(Y; - l� � t )/9 (i � 1, . . . . 1· ) are inde pendently distributed as x2 with 2 degrees of freedom, and [L� 1 Yi + (n - 1')Yr]/B =

I: :� t Z,.]

17. Suppose that (Tk X l , h) generate a canonical exponential family P with parameter 1J kxl and E = Rk. Let (a) Show that Q is the exponential family generated by IIL T and h exp{ cTT}, where lh is the projection matrix of T onto L � { '1 : '1 � BIJ + c } .

(b) Show that ifP has full rank k and B is of rank l, then Hint: If B is of rank l, you may assume

Q has full rank I.

18. Suppose Yt . . . , Yn are independent with Yi "' N(/31 + /32 Zi, a 2 ), where z1 , . . . , Zn are covariate values not all equal. (See Example 1.6.6.) Show that the family has rank 3. Give the mean vector and the variance matrix of T. ,

19. Logistic Regression. We observe

(z1, Y1 ) , . . . , (zn, Yn) where the Y1, , Yn are inde pendent, Yi "-' B(�, .Xi). The success probability .Xi depends on the characteristics zi of .

.

•

the ith subject, for example, on the covariate vector zi = (age, height, blood pressure)T. The function l(u) � log[u/(1 - u)[ is called the logit function. In the logistic linear re gression model it is assumed that l ( .\,) � zf(3 where (3 ((31 , . . . , f3d )T and z, is d x 1. Show that Y (Y1, Yn) T follow an exponential model with rank d iff z1, . . . , zd are not collinear (linearly independent) (cf. Examples 1.1.4, 1.6.8 and Problem 1.1.9). =

• . •

=

,

20. (a) In part II of the proof of Theorem 1.6.4, fill in the details of the arguments that Q is generated by (171 - 'lo)TT and that �(ii) =�(i). (b) Fill in the details of part Ill of the proof of Theorem 1.6.4.

21. Find JJ.('I)

I

EryT(X) for the gamma, r(a, .\), distribution, where 9 � (a, ,\) .

�

Xn be a sample from the k-parameter exponential family distribution 22. Let X 1 , (1.6.10). Let T � (I;� 1 T1(X,), . . . , I;� 1 Tk(Xi)) and let •

I

•

•

,

s

�

{(ry1(1J), . . . , ryk(IJ)) , e E 8).

Show that if S contains a subset of k + 1 vectors v0, . . . , Vk+l so that vi - v0, 1 < i < k, are not collinear (linearly independent), then T is minimally sufficient for 8.

I

. i I, '

,

:

23. Using (1.6.20), find a conjugate family of distributions for the gamma and beta fami lies. (a) With one parameter fixed.

(b) With both parameters free.

Section 1.7

91

Problems and Complements

24. Using (1 .6.20), find a conjugate family of distributions for the normal family using as parameter 0 (01 , 02) where 01 = Eo (X), Oz = 1/(VaroX) (cf. Problem !.2.12). =

25. Consider the linear Gaussian regression model of Examples 1.5.5 and 1.6.6 except with cr z known. Find a conjugate family of prior distributions for (/31 , /32) T.

26. Using (1 .6.20), find a conjugate family of distributions for the multinomial distribution. See Problem !.2.15.

27. Let 'P denote the canonical exponential family genrated by T and h. For any TJo E £, set h0(x) = q(x, 'lo ) where q is given by (!.6.9). Show that P is also the canonical exponential family generated by T and h0.

28. Exponentialfamilies are maximum entropy distributions. The entropy h(f) of a random variable X with density f is defined by

h(f) = E(- log f(X))

=

- J: [logf(x)Jf(x)dx.

This quantity arises naturally in information in theory; see Section 2.2.2 and Cover and Thomas (1991). Let S = {x : f(x) > 0}. {a) Show that the canonical k-parameter exponential family density

f(x, 'I) = exp

•

1/0 + L 'l;r; (x) - A( 'I) j:=l

, xES

maximizes h(f) subject to the constraints

f(x)

> 0,

Is f(x)dx = 1, Is f(x)r;(x) = a; , 1 < j < k,

where r-,o, . . , TJk are chosen so that f satisfies the constraints. Hint: You may usc Lagrange multipliers. Maximize the integrand. .

(b) Find the maximum entropy densities when r;(x) = xi and (i) S = (0, oo), k I, <>t > 0; (ii) S R, k = 2 , at E R, az > 0; (iii) S R, k 3, a1 E R, o:z > 0, a, E R. =

=

=

=

29. As in Example 1.6.11, suppose that Y 1 , . , Y are i.i.d. Nv(f.L, E) where f.L varies freely in RP and E ranges freely over the class of all p x p symmetric positive definite matrices. Show that the distribution of Y (Yt, . . . Yn ) is the p(p + 3) /2 canonical exponential family generated by h = l and the p(p + 3)/2 statistics .

n

.

=

,

n

n

i=l

i=I

. . . , Y;p). Show that [ is open and that this family is of rank p(p + 3) /2. Hint: Without loss of generality, take n = 1. We want to show that h = 1 and the m = p(p + 3) /2 statistics T; (Y) = Yj, 1 < j < p, and T;1(Y) Y;Yi, 1 < j < l < p,

where Y;

=

(Yil

,

=

92

Statistical Models, Goals, and Performance Criteria

Chapter 1

generate Nv(J.l, E). As E ranges over all p x p symmetric positive definite matrices. so does E-1• Next establish that for symmetric matrices M,

J exp{-uTMu}du < oo iff M is positive definite

by using the spectral decomposition (see B.10. 1 .2)

p

M = L AjejeJ for e 1 , . . . , ep orthogonal, )..J E R.

j=l

To show that the family has full rank m, use induction on p to show that if Z1, . . . , Zv are i.i.d. N(O, and if Bpxp = (b; is symmetric, then

l)

1)

p P I: a;Z; + L b;1 Z;Z1 =

j,l

j., l

c

= P(aTZ + ZTBz = c) = o

a = 0, B = 0, = 0. Next recall (Appendix B.6) that since Y � Np(/', E), then Y = SZ for some nonsingular p x p matrix S. 30. Show that if X 1 , . ,Xn are i.i.d. Nv(8,E0) given 6 where �0 is known, then the Np(A, f) family is conjugate to N (B Eo), where A varies freely in RP and r ranges over c

unless

.

.

.

,

all p x p symmetric positive definite matrices.

31. Conjugate Normal Mixture Distributions. A Hierarchical Bayesian Normal Model. Let {(I';, r; ) : j k} be a given collection of pairs with l'i E R, > 0. Let (J.I., be a random pair with >.; = P((!', u = (!';, r; ) , 0 2::��1 >.; = Let 8 be a random variable whose conditional distribution given (JL, u = (P,J, Tj) is normal, N(p,j, rJ). Consider the model X = 8 + t:, where 8 and e are independent and € N(O, a3) , a� known. Note that 8 has the prior density

1< <

.

•

'

•

' •

I

)

)

< >.; < 1,)

Tj

u)

1.

"'

•

k

1r(B) = L A;'Pr, (B - !'; ) j=l where 'l'r denotes the N(O, uon. •

i� ''' ' ' ': ' •

(1 .7.4)

T2) density. Also note that (X I B) has the N(B, u5) distribu-

(a) Find the posterior k

"

I

•

1r(B I x) = LP((tt,u) (!'j, Tj ) I x)1r(B I (!';,r;),x) j=l =

and write it in the fonn k

L >.;(x)'Pr,(x)(B - !'j(x)) j=l

Section

1.7

93

Problems ;3nd Complements

for appropriate .\J (x ) , TJ (x) and fLJ (x ). This shows that ( 1. 7 .4) defines a conjugate prior for the N(B, 176), distribution.

(b) Let Xi = 8 + Ei, l < i <

where 8 is as previously and E t , . . . , En are i.i.d. N(O, 17�). Find the posterior 7r (B I x , , . . . , Xn ) , and show that it belongs to class (1.7 .4). Hint: Consider the sufficient statistic for p(x I B). n,

32. A Hierarchical Binomial-Beta Model. Let { (r1 , sJ) : 1 < j < k} be a given collection of pairs with r; > 0, s1 > 0, let (R, S) be a random pair with P(R = r1 , S = s;) = >.;,

0 < .\1 < 1, E7= l .\1 = 1, and let 8 be a random variable whose conditional density given R = r, S = s is beta, (J(r, s ) . Consider the model in which (X the binomial, B(n1 e), distribution. Note that e has the prior density

1r(B, r, s)

1r(B) =

I B) has

k

L -\;1r(B, r1 , s1 ) .

( I .7.5)

j=l

Find the posterior

k

1r(B I x) = L P(R = r; , S = s; I x)1r(B I (r; , s;),x) j=l

and show that it can be written in the form L: >-1 (x)1r(B,r;(x),s;(x)) for appropriate >-;(x), r;(x) and s;(x). This shows that (1.7.5) defines a class of conjugate priors for the B( n, B) distribution.

33. Let p(x, ry) be a one parameter canonical exponential family generated by T(x) = x and h(x), x E X C R, and let ,P(x) be a nonconstant, nondecreasing function. Show that E,,P(X) is strictly increasing in

Hint:

ry.

Cov,(,P(X), X)

� E{(X - X')i,P(X) - ,P(X')]) where X and X' are independent identically distributed as 34. Let (Xt 1

•

•

•

, Xn) be a stationary Markov chain with two states 0 and

P[Xi = Ei 1 x1 = c1, . . . , Xi - 1 where (i) (ii)

( P10 Poo

Pol Pn

)

=

1.

That is,

Ei-d = P[Xi = ci 1 xi-1 = Ei-d = p.,._l.,.

is the matrix of transition probabilities. Suppose further that ·

Poo = PII = p, so that, Pw = Pol

P[X1

X (see A.l l.12).

= OJ = P(X1

=

1] = �·

= 1 - p.

94

Statistical

Models,

Goals, and

Performance C riteria

Chapter 1

(a) Show that if 0 < p < 1 is unknown this is a full rank, one-parameter exponential family with T = N00 + N11 where N1j the number of transitions from i to j. For example, 01011 has No1 = 2, N11 = 1, Noo = 0, N10 = I . (b) Show that E(T) = (n - 1)p (by the method of indicators or otherwise). 35. A Conjugate Priorfor the Two-Sample Problem. Suppose that X1 , . . . , Xn and Y1 , . . . , Yn are independent N'(fl1, a2) and N(�J 2 , a2) samples, respectively. Consider the prior 7r for which for some r > 0, k > 0, ro-2 has a x� distribution and given a2, p,1 and fL2 are independent with N(�1 , a2 Ikt) and N(6, a2 I k,) distributions, respectively, where �j E R, kj > 0, j = 1, 2. Show that 1r is a conjugate prior. 36. The inverse Gaussian density, IG(J..t, A), is j(X,Jl., .\)

=

1 [.\I27T] i2x-3i 2 exp{ -.\(x - J1.)2/2J1.2x}, x > 0,

J1.

> 0, .\ > 0.

Show that this is an exponential family generated by T(X) = - ; (X, x- 1 )T and h(x) = (27r) -lf2x- 3/2 (b) Show that the canonical parameters TJt , ry are given by ry1 = fL-2 A, 1J2 = .\, and 2 that A( q1 , 172 ) = (� log(q2 ) + JihijZ) , £ = [O, oo) x (O,oo). (c) Fwd the moment-generating function of T and show that E(X) = Jl., Var(X) Jl.-3 .\, E(x-') = Jl._, + .\_ ,, Var(x-') = (.\Jl.)-' + 2.\ -2. (d) Suppose J1. = Jl.o is known. Show that the gamma family, f(u,,B), is a conjugate pnor. (e) Suppose that .\ = .\0 is known. Show that the conjugate prior formula (1.6.20) produces a function that is not integrable with respect to fL. That is, n defined in (1.6.19) is empty. (f) Suppose that J1. and .\ are both unknown. Show that (1.6.20) produces a function that is not integrable; that is, f! defined in (1.6.19) is empty. 37. Let X1, . . . , Xn be i.i.d. as X � Np(O, �0) where �0 is known. Show that the conjugate prior generated by (1.6.20) is the Afv (q0 , 761) family, where 7Jo varies freely in RP, T6 > 0 and I is the p x p identity matrix. 38, Let X; = (Z;, Y;)T be i.i.d. as X = (Z, Y)T, 1 < i < n, where X has the density of Example 1.6.3. Write the density of X1 , . , Xn as a canonical exponential family and identify T, h, A, and £. Find the expected value and variance of the sufficient statistic. 39. Suppose that Yt , . . . , Yn are independent, Yi N(fLi, a2 ), n > 4. (a) Write the distribution of Y1 , , Yn in canonical exponential family form. Identify T, h, 1), A, and E. (b) Next suppose that fLi depends on the value Zi of some covariate and consider the submodel defined by the map 11 : (11 1 , 112 , II3)T � (p7, a2JT where 11 is determined by /Li = cxp{Bt + Bzzi}, Zt < z2 < · · · < Zn; a 2 = 83 (a)

-

•

'

'P'· ',,' '

.. . ' ..

.

'

H '

......,

, ,, ,.

' ' ,

"

I I' I'

.

.

:

.

•

,I '' '. •

i

I'

··---

------

'

: : •

Section 1.8

95

Notes

where 91 E R, 02 E R, 03 > 0. This model is sometimes used when Jli is restricted to be positive. Show that p(y, 0) as given by ( is a curved exponential family model with l 3.

1. 6.12)

=

40. Suppose Y1 , . . . , Y;.l are independent exponentially, E' ( ,\i), distributed survival times, n > 3. (a) Write the distribution of Y1 , . . . , Yn in canonical exponential family form. Identify T, h, '1. A, and E.

(b) Recall that J..li = E(Yi) = Ai 1 . Suppose /Ji depends on the value Zi of a covariate. Because fti > 0, J.Li is sometimes modeled as

J.Li = cxp{Or + fhzi}, i =

1, . . .

, n

where not all the z's are equal. Show that p(y, fi) as given by nential family model with l =

2.

1.8

(1.6.12) is a curved expo

NOTES

Note for Section 1.1

(1)all dominated For the measure theoreticaHy minded we can assume more generally that the Po are p(x, 9) denotes df;11 , the Radon Nikodym by a a finite measure and derivative.

ft

that

Notes for Section 1,3

� (I) More natural in the sense of measuring the Euclidean distance�between the estimate (} and the "truth" 0. Squared error gives much more weight to those (} that are far away from (} than those close to (}. (2) We define the lower boundary of a convex set simply to be the set of all boundary points r such that the set lies completely on or above any tangent to the set at r. Note for Section 1.4 (I) Source: Hodges, Jr., J. L., D. Kretch, and R. S. Crutchfield. Statlab: An Empirical Introduction to Statistics. New York: McGraw-Hill, 1975. Notes for Section 1.6 (I) Exponential families arose much earlier in the work of Boltzmann in statistical mechan ics as laws for the distribution of the states of systems of particles-see Feynman ( 1 963), for instance. The connection is through the concept of entropy, which also plays a key role in information theory-see Cover and Thomas ( 1991 ).

(2) The restriction that's x E Rq and that these families be discrete or continuous is artifi cial. In general if J.L is a a finite measure on the sample space X, p(x, (}) as given by

(1.6.1)

'

'

96

Statistical Models, Goals, and Performance Criteria

can be taken to be the density of

X with respect to J-L-see Lehmann (1997), for instance.

Chapter 1

This permits consideration of data such as images, positions, and spheres (e.g., the Earth), and so on.

Note for Section 1.7 ( I ) uT Mu > 0 for all p x

1.9

1 vectors u # 0.

REFERENCES

BERGER, J. 0., Statistical Decision Theory and Bayesian Analysis New York:

Springer, 1985.

BERMAN, S. M., ''A Stochastic Model for the Distribution ofHIV Latency Time Based on T4 Counts," Biometika. 77, 733-74 1 (1990).

BICKEL, P. J., "Using Residuals Robustly 6, 266--291 (1978).

1: Tests for Heteroscedasticity, Nonlinearity," Ann. Statist.

BLACKWELL, D. AND M. A. GIRSHICK, Theory of Games and Statistical Decisions New York:

Wiley,

1954.

Box, G. E. P., ''Sampling and Bayes Inference in Scientific Modelling and Robustness (with Discus sion)," J. Royal Statist. Soc. A 143, 383-430 (1979).

BRowN, L., Fundamentals of Statistical Exponential Families with Applications

in Statistical Deci

sion Theory, IMS Lecture Notes-Monograph Series, Hayward, 1986.

CARROLL, R. J. AND D. RuPPERT, Transformation and Weighting in Regression New York:

Chapman

and Hall, 1988.

CoVER, T.

M. AND J. A. THOMAS, Elements of Information Theory New York: Wiley,

1991.

DE GROOT, M. H., Optimal Statistical Decisions New York: McGraw-Hill, 1969.

DoKSuM,

K A

AND A. SAMAROV, "Nonparametric Estimation

of Global Functionals and a Measure

of the Explanatory Power of Covariates in Regression,'' Ann. Statist. 23, 1443-1473 ( 1995).

FERGUSON, T. FEYNMAN,

S., Mathematical Statistics New York: Academic Press, 1967.

R. P., The Feynmo.n Lectures on Physics, v. I , R. P. Feynman, R. B. Leighton, and M.

Sands, Eels., Ch. 40 Statistical Mechanics of Physics Reading, MA: Addison-Wesley, 1963.

GRENANDER, U. AND M. ROSENBLATT, Statistical Analysis of Stationary Time Series New York:

Wi

ley, 1957.

HorXJES, JR., 1. L., D. KRETcH AND R. S. CRUTCHFIELD, Statlab: An Empirical introduction to Statis tics New York: McGraw-Hill, 1975.

KENDALL, M. G. AND A. STUART, The Advanced Theory of Statistics, Vols. II, ffi New York:

Hafner

Publishing Co., 1961, 1966.

L6HMANN, E. L., "A Theory of Some Multiple Decision Problems, 1-25, 547-572 (1957).

L6HMANN, E. L., "Model Specification: Statist. Science 5, 160-168 (1990).

LatMANN, E. L., .. . . .' •

'

I

I and II," Ann. Math Statist. 22,

The Views of Fisher and Neyman, and Later Developments,"

Testing Statistical Hypotheses, 2nd ed. New York: Springer, 1997.

Section 1. 9

97

References

D. V., Introduction to Probability and Statistics from a Bayesian Point of View, Part I: Probability; Part II: Inference London: Cambridge University Press, 1965.

LINDLEY,

MANDEL, J., The Statistical Analysis of Experimental Data

New York: J. Wiley & Sons, 1964.

A. DOKSUM, "Empirical Bayes Procedures for a Change Point Problem with Application to HIV/AIDS Data," Empirical Bayes and Likelihood Inference, 67-79, Editors: S. E. Ahmed and N. Reid. New York: Springer, Lecture Notes in Statistics, 2000.

NORMAND, S-L. AND K.

K., "On the General Theory of Skew Correlation and Nonlinear Regression," Proc. Roy. Soc. London 71, 303 (1 905). (Draper's Research Memoirs, Dulan & Co, Biometrics Series II.)

PEARSON,

RAIFFA,

H. AND R. SCHLAIFFER, Applied Statistical Decision Theory, Division of Research, Graduate

School of Business Administration, Harvard University, Boston, 1961. SAVAGE., L. J., The Foundations of Statistics, J. Wiley & Sons,

New York, 1954.

SAVAGE., L. J. ET AL., The Foundation of Statistical Inference London: SNEDECOR, G. W. AND W. G.

Press, 1989. B. and Hall, 1986.

WErHERILL, G.

AND K.

Methuen & Co., 1962.

COCHRAN, Statistical Methods, 8th Ed. Ames, lA: Iowa State University

D. GLAZEBROOK, Sequential Methods in Statistics New York: Chapman

I I

'

'

Chapter 2

METHODS OF ESTIMATION

2.1 2.1.1

BASIC HEURISTICS O F ESTIMATION Minimum Contrast Estimates; Estimating Equations

P E P, usually parametrized as P = Our basic framework is as before, X E X, X {Po : 8 E 8}. In this parametric case, how do we select reasonable estimates for 8 itself? """

-

That is, how do we find a function B(X) of the vector observation X that in some sense "is close" to the unknown 8? The fundamental heuristic is typically the following. We consider a function that we shall call a contrastfunction

p : X x 8 ---> R and define

E8,P(X, 8). As a function of 8, D(80) 9) measures the (population) discrepancy between 8 and the true value 80 of the parameter. In order for p to be a contrast function we require that D(80 1 9) is uniquely minimized for 8 = 60• That is, if P90 were true and we knew D(Bo, 8) as a function of 8, we could obtain 80 as the minimizer. Of course, we don't know the truth so this is inoperable, but in a very weak sense (unbiasedncss), p(X, 8) is an estimate of D(80, 8). So it is natural to consider 8(X) minimizing p(X, 8). This is the most general form of the minimum contrast estimate we shall consider in the next section. Now suppose 8 is Euclidean c Rd, the true 80 is an interior point of 8, and 8 D(Bo, 8) is smooth. Then we expect D(8o, 8)

-----Jo

v8 D(80, 8)

=

o

(2. 1 . 1 )

where V denotes the gradient,

Arguing heuristically again we are led to estimates B that solve

v8p(X, e)

=

o.

(2.1.2)

The equations (2.1.2) define a special form of estimating equations.

99

100

Chapter 2

Methods of Estimation

More generally, suppose we are given a function W and define V(l1o, 11) � Suppose V (80, 8) =

:

XxR d - Rd, W - ('ljl1,

E11, w(X, 11).

.

.

•

, '1/Jd)T

(2. 1 .3)

�

0 has 80 as its unique solution for all 80 E 8. Then we say 8 solving �

w(X,11)

�o

(2.1 .4)

is an estimating equation estimate. Evidently, there is a substantial overlap between the two classes of estimates. Here is an example to be pursued later.

Example 2.1.1. Least Squares. Consider the parametric version of the regression model of Example 1.1.4 with l'(z) = g({3, z), {3 E Rd, where the function g js known. Here the data function n} where Yi , . . . , Yn are independent. A natural are X = {(zi, Yi) : 1 ,$

(I)

i<

p(X, /3) to consider is the squared Euclidean distance between the vector Y of observed , g({3, Zn))T That is, we take Yi and the vector expectation of Y, JL(z) = (g({3, z 1), •

.

.

n

p(X, {3) = IY - �tl2 =

�)Yi - g({3, z;))'.

i=l

(2. 1 .5)

Strictly speaking P is not fully defined here and this is a point we shall explore later. But, for convenience, suppose we postulate that the Ei of Example 1.1.4 are i.i.d. N(O, Then {3 parametrizes the model and we can compute (see Problem 2.1.16),

a5).

E(3 ,P(X, {3)

D (f3o. {3)

n

no-;l + 2)g({30, z,) - g({3 , z, )]2 ,

(2.1.6)

i=l

J

!

which is indeed minimized at {3 = {30 and uniquely so if and only if the parametrization is � identifiable. An estimate {3 that minimizes p(X, {3) exists if g({3, z) is continuous and

' •

lim{)g(fl, z) l : lf31 �

(Problem 2.1.10). The estimate {3 is called the least squares estimate. � If, further, g({3, z) is differentiable in {3, then {3 satisfies the equation alently the system of estimating eq!.lations,

� � 8g � � 8g � ({3, z,)g(/3, z,), ({3, z;)Y; = LL813 i=l 813 i=l J

I'

oo} = oo

3

In the important linear case,

(2.1.2) or equiv(2.1 .7)

I

I

I

d

g({3, zi) = 'L, ziJ,BJ and zi = (Zit, . . . , Zid)T j=l '

• •

.i

•

•

I

•

Section 2.1

Basic Heuristics o f Estimation

101

the system becomes n

d

i=l

=I

I.>'iY; = L k

n

� L z-t;-z-k t

(2.1.8)

i=l

the normal equations. These equations are commonly written in matrix form (2.1 .9)

where Zv = ll ziJIInx d is the design matrix. Least squares, thus, provides a firs t example of both minimum contrast and estimating equation methods. We return to the remark that this estimating method is well defined even if the ci are not i.i.d. a5). In fact, once defined we have a method of computing a statistic {!J from the n}, which can be judged on its merits whatever the true P data X = {(zi 1 Yi)1 governing X is. This very important example is pursued further in Section 2.2 and Chapter 0 6.

N(O,

1
Here is another basic estimating equation example. Example 2.1.2. Method of Moments (MOM). Suppose X1, ,Xn are i.i.d. as X � P8 , 8 E Rd and 8 is identifiable. Suppose that l't ( 8), . . . , I'd (8) are the first moments of the population we are sampling from. Thus, we assume the existence of . • .

d

Define the jth sample moment /11 by,

1 <j

< d.

To apply the method of moments to the problem of estimating 9, we need to be able to express 9 as a continuous function g of the first moments. Thus, suppose

d

8 � (1'1(8), . . , !'d (O)) .

is from Rd to Rd . The method of moments prescribes that we estimate 9 by the solution of

1-1

�

Pi = l'i (O) , 1 :O j

if it exists. The motivation of this simplest estimating equation example is the law of large numbers: For X ,..._, Po, fiJ converges in probability to flJ(fJ). More generally, if we want to estimate a Rk-valued function q(B) of 9, we obtain a MOM estimate of q( 8) by expressing q(8) as a function of any of the first moments 1'1, . . . , I'd of X, say q(O) = h(l'l, . . . , I'd), > k, and then using h(j.t�, . , /id) as the estimate of q(8).

d

d

.

.

Methods of Estimation

102

Chapter 2

For instance, consider a study in which the survival time X is modeled to have a gamma distribution, f(u, A), with density [.\"/r (a)]x"- 1 exp{ -.\x}, x > O; cx > O, .\ > 0. In this case f) � (u, .\), l'l � E(X) � n /.\, and 1'2 � E(X2) � a(l + u) /.\2 Solving

for (} gives

"�

(M•/a)2 ,

.\ � i'•/a2,

=

iii �

(X/<1 )2 ; ), � X !a'

where o-2 p,2 -p,i and &2 = n -lEX[ - X2• In this example, the method of moment es timator is not unique. We can, for instance, express (} as a function of J..t t and /-L3 = E(X3 ) D and obtain a method of moment estimator based on /11 and ji3 (Problem

2.1.11).

Algorithmic issues

We note that, in general, neither minimum contrast estimates nor estimating equation solutions can be obtained in closed fonn. There are many algorithms for optimization and root finding that can be employed. An algorithm for estimating equations frequently is quick and M used when computation of M(X, ·) = D'I'(X, ·) - J�$' (X, ·JJ dxd J is nonsingular with high probability is the Newton-Raphson algorithm. It is defined by initializing with 90, then setting (2. 1 . 1 0)

.!

• • •

This algorithm and others will be discussed more extensively in Section 2.4 and in Chap ter 6, in particular Problem

6.6.10.

2.1.2

The Plug-In and Extension Principles

We can view the method of moments as an example of what we call the plug-in (or substitu tion) and extension principles, two other basic heuristics particularly applicable in the i.i.d. case. We introduce these principles in the context of multinomial trials and then abstract them and relate them to the method of moments. Example 2.1.3. Frequency Plug-in(2) and Extension. Suppose we observe multinomial trials in which the values v1, Vk of the population being sampled are known, but their respective probabilities p1, . . , Pk are completely unknown. If we let X1, , Xn be i.i.d. as X and Ni = number of indices j such that Xi = Vi, then the natural estimate of Pi = P[X = vi] suggested by the law of large numbers is Njn. the proportion of sample values equal to vi. As an illustration consider a population of men whose occupations fall in one of five different job categories, 3, 4, or 5. Here k 5, vi = i, i = . , 5, Pi is the proportion of men in the population in the ith job category and Njn is the sample proportion in this category. Here is some job category data (Mosteller, I 968). .

.

=

1, ..

•

.

,

...

1, 2,

'

I

I

•

'

Section 2.1

103

Basic Heuristics of Estimation

Job Category l

23

2 84

3 289

4 217

5 95

0.03

0.12

0.41

0.31

0.13

I

•

Ni �

Pi

-

n = 2::,�1 Ni

=

Lio)Pi = j

708

for Danish men whose fathers were in category 3, together with the estimates Pi = Next consider the more general problem of estimating a continuous function

Ni fn.

q(p1,

.

•

•

,

Pk) of the population proportions. The frequency plug-in principle simply proposes to re place the unknown population frequencies p1, . . . , Pk by the observable sample frequencies

Ntfn, . . . , Nkjn.

That is, use

(2. 1 . 1 1 ) to estimate q(p 1 ,

. . .

,Pk) ·

For instance, suppose that in the previous job category table,

categories 4 and 5 correspond to blue-collar jobs, whereas categories white-collar jobs. We would be interested

q(p,, · · · ,Ps)

=

2 and 3 correspond to

in estimating

(p4 + Ps) - (P2

+ P3) ,

the difference in the proportions of blue-collar and white-collar workers. If we use the frequency substitution principle, the estimate is

0.44 - 0.53

=

-0.09. Equivalently, let P denote p = (p1, . .

which in our case is

,pk) with Pi

=

=

·v,], I < i < k, and think of this model as P = {all probability distributions P on {v1 , . . . , vk} }. Then q ( p ) can be identified with a parameter v : P � R, that is, v(P) = (p4 + p5) - (p, + p,), and the frequency plug-in principle simply says to replace P = (p,, . . . , Pk) in v(P) by D P = ( �� , . . . , �,. ) the multinomial empirical distribution of X1, , Xn. .

P[X

,

.

Now suppose that the proportions

p1, . . . , Pk

functions of some d-dimensional parameter 8 =

a component of 8 or more generally a function

.

•

do not vary freely but are continuous

(fh, .

q(8) .

. , (}d) and that we want to estimate

.

Many of the models arising in the

analysis of discrete data discussed in Chapter 6 are of this type. Example

2.1.4.

Hardy-Weinberg Equilibrium.

Consider a sample from a population in

genetic equilibrium with respect to a single gene with two alleles. If we assume the three different genotypes are identifiable, we are led to suppose that there are three types of individuals whose frequencies

P1 If Ni

=

82, P2

are

=

given by the so-called

28(1 - 8),

p, = ( I

8) 2 , 0 < 8 < L

(2.LI2)

i in the sample of size n, then (Nt , N2 , N3) parameters (n,p1 ,P2, p3) given by (2.1.12). Suppose

is the number of individuals of type

has a multinomial distribution with

-

Hardy-Weinberg proportions

104

Methods of Estimation

Chapter 2

$1. we can use we want to estimate fJ, the frequency of one of the alleles. Because fJ the principle we have introduced and estimate by JN1 jn. Note, however, that we can also 0 write 0 = I ..fii3 and, thus, 1 - .jNJTi is also a plausible estimate of 0. =

-

In general, suppose that we want to estimate a continuous Rl-valued function q of 8. If p1, . ,Pk are continuous functions of 8, we can usually express q(8) as a continuous function of PI , . . . , Pk, that is, .

.

q(IJ) = h(p,(IJ), . . . ,p.(IJ)), with h defined and continuous on P=

.

(p,, . . ,p.) : p; > O,

(2. 1.13)

k

L p; = 1

j

.

'

i= 1

Given h we can apply the extension principle to estimate q( 8) as,

T(X1, . . . , Xn)

=

h ( Nn, , . . , Nn• ) .

.

.

(2 1 . 14)

As we saw in the Hardy-Weinberg case, the representation (2.1.13) and estimate (2.1.14) are not unique. We shall consider in Chapters 3 (Example 3.4.4) and 5 how to choose among such estimates. We can think of the extension principle alternatively as follows. Let

Po = {Po = (p, (IJ), . . . ,p. (IJ)) : e E e} be a submodel of P. Now q(IJ) can be identified if IJ is identifiable by a parameter v : Po � R given by v(Pe ) = q(O). Then (2.1.13) defines an extension of v from Po to P via v : P --> R where v(P) - h(p) and v(P) = v(P) for P E Po . The plug-in and extension principles can be abstractly stated as follows:

T Plug-in principle. If we have an estimate P of P E P such that P E P and v : P is a parameter. then v(P) is the plug-in estimate of v. In particular, in the i.i.d. case if P is the space of all distributions of X and X�, . . . , Xn are i.i.d. as X f'V P. the empirical distribution P of X given by -

-

-

---+

-

1

n

P[X E A] = n L I (X; E A) -

(2.1 . 15)

. t=l

�

is, by the law of large numbers, a natural estimate of P and v(P) is a plug-in estimate of v(P) in this nonparametric context For instance, if X is real and F(x) = P(X :S x) is the distribution function (d.f.), consider va (P) = i [F- 1 (a) + F.J 1 (a)], where a E (0, I) and

F-1 (a) = inf{x : F(x) > a}, Fu 1 (a) = sup{x : F(x) < a},

-

-

�

-

(2. 1.16)

Section 2.1

105

Basic Heuristics of Estimation

v0 ( P) is the ath population quantile X Here x � = v.!. (P) is called the population median. A natural estimate is the ath sample quantile then

'

a.

(2.1.17) where

F is the empirical d.f. Here X!. �

is called the sample median.

'

For a second example, if X is real and P is the class of distributions with EjXj1 < oo, then the plug-in estimate of the jth moment v(P) = f..t1 = in this nonparametric �

context is the jth sample moment v(P) =

E(XJ )

�

J x;dF(x)

=

.

n- 1 I:� 1 Xf. �

Extension principle. Suppose Po is a submod.el of P and P is an element of P but not necessarily Po and suppose v : Po � T is a parameter. If v : P � T is an extension of v � in the sense that v(P) = v(P) on Po. then v(P) is an extension (and plug-in) estimate of v(P). With this general statement we can see precisely how method of moment estimates can be obtained as extension and frequency plug-in estimates for multinomial trials because

1';(8)

k

=

L vfp,(8) = h(p(ll)) i=l

where

h(p)

=

=

v(Po)

k

L v!P< = v(P) ,

i =l

�

and P is the empirical distribution. This reasoning extends to the general i.i.d. case (Prob lem 2.1.12) and to more general method of moment estimates (Problem 2.1.13). As stated, these principles are general. However, they are mainly applied in the i.i.d. case--but see Problem 2.1.14.

Remark 2.1.1. The plug-in and extension principles are used when Pe, v, and v are contin uous. For instance, in the multinomial examples 2.1.3 and 2.1.4, Pe as given by the Hardy Weinberg p(O), is a continuous map from e = [0, 1] to P, v(Pe ) = q(8) = h(p(8)) is a continuous map from 8 to R and v( P) = h(p) is a continuous map from P to R. Remark 2.1.2. The plug-in and extension principles must be calibrated with the target parameter. For instance, let Po be the class of distributions of X = B + �: where B E R

and the distribution of t: ranges over the class of symmetric distributions with mean zero. Let v(P) be the mean of let P be the class of distributions of = (I + < where B E R and the distribution off ranges over the class of distributions with mean zero. In this case both v1(P) = Ep(X) and v2(P) = "median of P" satisfy v(P) = v(P), P E Po,

X and

but only

�

v1(P)

-

=

X is a sensible� estimate of v(P), P

symmetric, the sample median

X

¢ P0, because when P is not

v2(P) does not converge in probability to Ep(X).

106

Methods of Estimation

Chapter 2

Here are three further simple examples illustrating reasonable and unreasonable MOM estimates.

Example 2.1.5. Suppose that X 1 , . . . , Xn is a N(f-1, a2 ) sample as in Example 1 . 1.2 with assumptions ( 1 )--(4) holding. The method of moments estimates of f1 and a2 are X and

-· a .

0

Example 2.1.6.

Suppose

X1 ,

.

•

•

, Xn

are the indicators of a set of Bernoulli trials with

probability of success 8. Because p,1 (8) = () the method of moments leads to the natural

estimate of 8,

X, the frequency of successes. To estimate the population variance 8( 1 - 8)

we are led by the first moment to the estimate,

X(l - X).

Because we are dealing with

(unrestricted) Bernoulli trials, these are the frequency plug-in (substitution) estimates (see Problem

o

2.Ll).

1.5.3 1

Example 2.1.7. Estimating the Size of a Population (continued). In Example where ) . Thus, 0 = 21' X1 , . . c, Xn are i.i.d. U {1, 2, . . . , 0}, we find I' = Eo (X,) = J (0 and 2X - 1 is a method of moments estimate of B. This is clearly a foolish estimate if X(n ) = max Xi > 2X - 1 because in this model B is always at least as large as X(n) . D

+1

As we have seen, there are often several method of moments estimates for the same

q(9).

For example, if we are sampling from a Poisson population with parameter B, then (}

is both the population mean and the population variance. The method of moments can lead to either the sample mean or the sample variance. Moreover, because Po =

exp( -0}, a frequency plug-in estimate of 0 is

- logj)0, where

will make a selection among such procedures in Chapter

Remark 2.1.3.

3.

P(X

Po is n- 1 [#X;

=

0) = 0]. We

=

What are the good points of the method of moments and frequency plug-in?

(a) They generally lead to procedures that are easy to compute and are, therefore, valu

'

able as preliminary estimates in algorithms that search for more efficient estimates. See

'

Section

2.4.

(b) If the sample size is large, these estimates are likely to be close to the value estimated

(consistency). This minimal property is discussed in Section It does turn out that there are

5.2.

"best'' frequency plug-in estimates, those obtained by

the method of maximum likelihood, a special type of minimum contrast and estimating

2.4, they are often difficult to compute. Algorithms for their computation will be introduced in Section 2.4.

equation method. Unfortunately, as we shall see in Section

Discussion.

When we consider optimality principles, we may arrive at different types of

estimates than those discussed in this section. For instance, as we shall see in Chapter

3,

estimation of (} real with quadratic loss and Bayes priors lead to procedures that are data weighted averages of(} values rather

than minimizers of functions p( 6, X). Plug-in is not

the optimal way to go for the Bayes, minimax, or uniformly minimum variance unbiased (UMVU) principles we discuss briefly in Chapter apparent in Chapters

5 and 6.

3.

However, a saving grace becomes

If the model fits, for large amounts of data, optimality prin

ciple solutions agree to first order with the best minimum contrast and estimating equation solutions, the plug-in principle is justified, and there are best extensions.

' '

'

'' ' '

Section 2.2

107

Minimum Contrast Estimates and Estimating Equations

Summary. We consider principles that suggest how we can use the outcome X of an experiment to estimate unknown parameters. For the model { Pe : f) E 8} a contrast p is a function from X x H to R such that the discrepancy D (!Jo , IJ)

�

E(;"p(X, IJ), B E 6 C Rd

is uniquely minimized at the true value 8 = 80 of the parameter. A minimum contrast esti mator is a minimizer of p(X, 6). and the contrast estimating equations are VeP(X, 6 ) =

0,

For data ( (z;, li) : I < i < n ) with li independent and E(Y;) = g((3, z;), 1 < i < n , where g is a known function and /3 E Rd is a vector of unknown regression coefficients. a least squares estimate of /3 is a minimizer of p(X, (3) = Lfli

-

g((3 , z, )f .

For this contrast. when g (/3 z) = zT{3. the associated estimating equations are called the normal equations and are given by ZbY = ZbZn/3. where Zn = llziJ·IInxd is called the design matrix. Suppose X P. The plug-in estimate (PIE) for a vector parameter v = v(P) is obtained by setting fJ = v( P) where P is an estimate of P. When P is the empirical probability distribution Pe defined by Pe(A) = n- 1 I:� 1 l [X; E Aj, then v is called the empirical PIE. If P = Po. 6 E e. is parametric and a vector q(8) is to be estimated. we find a parameter v such that v(P&) = q(8) and call v(P) a plug-in estimator of q(IJ). Method of moment estimates are empirical PIEs based on v(P) = (f.li. . . . , f.ld )T where In the multinomial case the frequency plug-in estimators f.lJ = E( XJ\ I < j < are empirical PIEs based on v(P) � (PI : . . ,pk ) , where PJ is the probability of the jth category, 1 < j :::; k. Let Po and P be two statistical models for X with Po c P. An extension v of v from Po to P is a parameter satisfying v(P) = v(P), P E P0. If P is an estimate of P with P E P, v(P) is called the extension plug·in estimate of v(P). The general principles are shown to be related to each other. ,

,.....,

�

�

�

�

�

�

d.

.

�

�

2.2 2.2.1

�

M I N I M UM CONTRAST ESTIMATES AND ESTIMATING EQUATIONS Least Squares and Weighted Least Squares

Least squares< 1 ) was advanced early in the nineteenth century by Gauss and Legendre for estimation in problems of astronomical measurement. It is of great importance in many areas of statistics such as the analysis of variance and regression theory. In this section we shall introduce the approach and give a few examples leaving detailed development to Chapter 6.

108

by

Methods of Estimation

In Example 2. 1.1

Chapter 2

we considered the nonlinear (and linear) Gaussian model P0 given

li = g(/3, z;) + <, 1 < i < n where ci are i.i.d.

(2.2.1)

N(O, aJ) and {3 ranges over Rd or an open subset. The contrast n

p(X, i3) = Dli - g(/3, z,Jf i=l -

led to the least squares estimates (LSEs) {3 of {3. Suppose that we enlarge Po to P where we retain the independence of the Yi but only require

IJ.(z; ) = E(Y; ) = g((l,z;),

E(<;) = 0.

(2.2.2)

Are least squares estimates still reasonable? For P in the semiparametric model P, which satisfies (but is not fully specified by) (2.2.2), we can compute still

Dp (/30 , /3)

=

Epp(X, /3)

n

- L Var i=l

p

n

(<; ) + L [g(/30 , z;) i=l

-

(2.2.3)

g(/3, z;)] 2 ,

which is again ntinimized as a function of i3 by i3 = {30 and uniquely so if the map , g(f3,znlf is 1 - 1. f3 � The estimates continue to be reasonable under the Gauss-Markov assumptions,

(g(i3,zt),

.

.

.

E( <;) = O, l 0, 1 :::; i :::; n,

(2.2.4)

Cov(�:i, Ej) = 0, 1 < i < j ::.; n

(2.2.6)

(2.2.5)

because (2.2.3) continues to be valid. Note that the joint distribution H of (C t , . . , t:n) is any distribution satisfying the Gauss-Markov assumptions. That is, the model is semiparametric with {3, a2 and H un known. The least squares method of estimation applies only to the parameters and is often applied in situations in which specification of the model beyond (2.2.4}-{2.2.6) is difficult Sometimes z can be viewed as the realization of a population variable Z, that is, (Zi, Yi). 1 $ i < n, is modeled as a sample from a joint distribution. This is frequently the case for studies in the social and biological sciences. For instance, Z could be edu cational level and Y income, or Z could be height and Y log weight Then we can write the conditional model of Y given Zi = Zj , I < j < n, as in (a) of Example 1.1.4 with If we consider this model, (Z,, li) i.i.d. as <j simply defined as Yj E(Yj I Zi = (Z, Y) � P E P = {All joint distributions of(Z,Y) such that E(Y I Z = = g(i3,z), and i3 � (g(/3, z,), . . . ,g(/3, zn)JT is 1 - 1, then f3 has an interpretation as a i3 E parameter on P, that is, f3 = (l(P) is the ntinim12er of E (Y - g(/3, Z))'. This follows .

{31,

-

Rd},

"

I

Zj)·

z)

•

.

•

, !3d

109

Minimum Contrast Estimates and Estimating Equations

Section 2.2

-

from Theorem 1.4.1. In-this case we recognize the LSE {3 as simply being the usual plug-in estimate f3(P), where P is the empirical distribution assigning mass n - 1 to each of the n pairs (Zi, Yi). the most commonly used g in these models is g({3, z) = As we noted in Example is called the linear and zT(3, which, in conjunction with (multiple) regression modeL For the data { (zi, Yi); i = 1, , n } we write this model in matrix form as Y = Zv(3 + <

2.1.1(2.2.1), (2.2.4), (2.2.5) (2.2.6), ...

where Zv = llzij II is the design matrix. We continue our discussion for this important special case for which explicit formulae and theory have been derived. For nonlinear cases we can use numerical methods to solve the estimating equations (2.1.7). See also Problem Sections and and Seber and Wild (1989). The linear model is often the default model for a number of reasons: (I) If the range of the z's is relatively small and p(z) is smooth, we can approximate p(z) by

2.2.41,

6.4.3 6.5,

p(z) = -:::·

d

a p(zol + :L a: (zo)(z - zo), j=l J

for zo an interior point of the domain. We can then treat p(zo) g� (zo)zo as an unknown to give an approximate ( d + I) -dimensional and identify %!; (zo) with linear model with z0 = 1 and as before and

f3o

- E1= l

f3;

Zj

d

p(z) = L f3; z; .

j=O

This type of approximation is the basis for nonlinear regression analysis based on local polynomials, see Ruppert and Wand (1994), Fan and Gijbels (1996), and Volume ll. (2) If as we discussed earlier, we are in a situation in which it is plausible to assume that (Zi, Yi) are a sample from a (d+ 1 )-dimensional distribution and the covariates that are the coordinates of Z are continuous, a further modeling step is often taken and it is assumed . . ) zd , Y)T bas a nondegenerate multivariate Gaussian distribution Nd+I (11, I: that ). In that case, as we have seen in Section 1. , E(Y I Z = z) can be written as •

(Zl )

.

4

d

p(z) = f3o + L f3; z;

j=l

where d

f3o

=

(-((3,, . , {3d), l)�t = i'Y - L f3i !'j ·

..

j= l

(2.2.9)

no

Methods of Estimation

Chapter 2

Furthennore,

< = y - I'(Z) is independent of Z and has a N(O, a 2 ) distribution where

Therefore, given Zi = 1 < i < n.

Zi,

a2 = ayy - Eyz:E z� Ezy.

1
n, we have a Gaussian linear regression model for }i,

Estimation of {3 in the linear regression model. We have already argued in Example � 2. 1. 1 Zn/3 is identifiable, the least squares estimate, {3, exists that, if the parametrization {3 and is unique and satisfies the normal equations (2.1.8). The parametrization is identifiable if and only ifZv is of rank d or equivalently if ZbZv is of full rank d; see Problem 2.2.25. In that case, necessarily, the solution of the normal equations can be given "explicitly" by ------)

(2.2.10)

'

Here are some examples.

Example 2.2.1. In the measurement model in which Yi is the detennination of a constant g(z, il1) = 1 and the normal equation is L:� 1 (y, - il1) = 0, /h, d = 1, g(z, il1J = il" � D whose solution is il1 = ( 1/n) L:� 1 Yi = y.

8�,

Example 2.2.2. We want to find out how increasing the amount z of a certain chemical or fertilizer in the soil increases the amount y of that chemical in the plants grown in that soil. For certain chemicals and plants. the relationship between z and y can be approximated well by a linear equation y = /31 + (32 z provided z is restricted to a reasonably small interval. If we run several experiments with the same z using plants and soils that are as nearly identical as possible, we will find that the values of y will not be the same. For this reason, we assume that for a given z, Y is random with a distribution P(y j z ). Following are the results of an experiment to which a regression model can be applied (Sned.ecor and Cochran, 1967, p. 139). Nine samples of soil were treated with different amounts z of phosphorus. Y is the amount of phosphorus found in com plants grown for 38 days in the different samples of soil. Z;

r;

1

64

4 71

5 54

9 81

11 76

13 93

23 77

23 95

�

28 109

,

' '

The points (zi, Yi) and an estimate of the line !31 + f32z are plotted in Figure 2.2.1. We want to estimate f3I and /32 . The normal equations are n

n

� (y, - il1 - il2 z;) = 0, � z,(y; - il1 - f3zzi) = 0. i=l

(2.2. 1 1 )

When the Zi 's are not all equal, we get the solutions (2.2.12)

'

I

Section 2.2

Minimum Contrast Estimates and Estimating Equations

111

y •

110 90

•

•

.

E6:

•

•

70 50

�

•

•

•

10

0

20

X

30

Figure 22.1. &atter plot { (Zi, y1); i = 1, . . . , n} and sample regression line for the phosphorus data. fs is the residual for (zs, Ys) ·

and fJ1

=

Y - /3zz

(2.2.13)

where z = ( 1 /n) I:� 1 z;, and ii = (1/n) 2::7 1 y;. The line y = {31 + f32z is known as the sample regression line or line of best fit of YI, . . . , Yn on z 1 , . . . , Zn- Geometrically, if we measure the distance between a point (z; , Y;) and a line y = a + bz vertically by d; = IY; - (a + bz;)l. then the regression line minimizes the sum of the squared distances to the n points ( z1, Y1 ), . . . , (zn, Yn). The vertical distances €i = fy,: - (!31 + f32 zi)] are called the residuals of the fit, = 1, . . . , n. The � line y = {31 + (32 z is an estimate of the best linear MSPE predictor a1 + b1 z of Theorem 1 .4.3. This connection to prediction explains the use of vertical distances� in regression. The regression line for the phosphorus data is given in Figure 2.2.1. Here fJ1 = 61.58 and � 0 /3z = 1.42. �

�

�

�

i

�

Remark 2.2.1. The linear regression model is considerably more general than appears at first sight. For instance, suppose we select p real-valued functions of z, 91, . . . , gp. p > d

and postulate that ,u ( z) is a linear combination of 91 (z), . . . , Yv(z) ; that is p

,u(z) = 2:);9; (z). j=l

Then we are still dealing with a linear regression model because we can define Wpxi

112

I

Methods of Estimation

(g1 (z ) ,

Chapter 2

. . . . gp( z)) T as our covariate and consider the linear model Yi

p

L OJWij + ci, 1 < i <

=

j=l

n

where

Wij For instance, if

Yi

=

= 9j( Zi) ·

d = 1 and we take 9i (z) = zi,

0

< j < 2, we arrive at quadratic regression,

Bo + B 1 Zi + (hz'f + £i-see Problem 2.2.24 for more on polynomial regression.

Whether any linear model is appropriate in particular situations is a delicate matter par

tially explorable through further analysis of the data and knowledge of the subject matter. We return to this in Volume

II.

Weighted least squares. In

Example 2.2.2 and many similar situations it may not be

zi of the However, we may be able to characterize the dependence of Var( fi ) on

reasonable to assume that the variances of the errors covariate variable.

Zi at least up to a multiplicative constant.

fi

are the same for all levels

That is, we can write

(2.2.14) where

are

a2 is unknown as before, but the wi

heteroscedastic

known weights. Such models are called

(as opposed to the equal variance models that are

homoscedastic).

The

method of least squares may not be appropriate because (2.2.5) fails. Note that the variables

y, ,

= _

_

g ((3 , Z; )

Y; - ..;w; ..;w;

+ <; ' ..;w;

are sufficient for the Y; and that Var(<;/ #;) =

g((3, z;) j ..;w;. €, =
and the

=

I < z. < n, - -

ww2 jw;

9(,8, zi ) + ti , 1

=

u2 . Thus, if we setg((3, z;) = (2.2.15)

- satisfy the assumption (2.2.5). The weighted least squares estimate of /3 is now Yi

the value

�

{3. which for given iii

=

yi f ,fWi minimizes

(2.2.16) as

a function of ,8.

Example 2.2.3. Weighted Linear Regression. zi2 = Zi, and g(/3, Zi) = {31 + fhzi. i = 1, . .

{31 and f3z that minimize

I '

n

L v, [y, i=I

-

Consider the case in which .

d

= 2, �

Zii = 1 , �

, n. We need to find the values {31 and /32 of

(/3, + f3zz,W

(2.2.17)

Section 2.2

113

Minimum Contrast Estimates and Estimating Equations

where vi = 1/wi- This problem may be solved by setting up analogues to the normal equations (2.1.8). We can also use the results on prediction in Section 1.4 as follows. Let (Z*, Y*) denote a pair of discrete random variables with possible values (z1, y1) , . . . , (zn , Yn) and probability distribution given by

P[ (Z' , Y') � (z; , y;)J � u;, where

i

�

1, . . , n .

n Ui = Vi/ L vi , i = 1 , . . ,n. i=l + {32Z* denotes a linear predictor of Y* based on Z*, then its MSPE is .

If /1-t (Z*)

given by

=

{31

n

E[Y ' � !lt (Z')f � L u;[y, � ((3, + (32z;)]2 i=l It follows that the problem of minimizing (2.2.17) is equivalent to finding the best linear MSPE predictor of Y"'. Thus, using Theorem 1.4.3, a.

V.l -

and

Cov(Z"', Y"') - L:t UiZi Yi - (L�� t UzYi)(L��J uizi) n ; 2 ( n u;z;)2 Var(Z•) L;�J u z; L;�1

(2.2.18)

�

n � 1" � u;y; - (h - � u;z;. (31 � E(Y') - f32E(Z') � -n1 " n 1 iz.-o-: I �

�

n -"

This computation suggests, as we make precise in Problem 2.2.26, that weighted least D squares estimates are also plug-in estimates.

�

Next consider finding the ,B that minimizes (2.2.16) for g(,B, z; ) = zf,B and for general d. By following the steps of Example 2.2.1 leading to (2.1.7), we find (Problem 2.2.27) � that f3 satisfy the weighted least squares normal equations where W = diag(wt, . . . , Wn) and Zn = llziillnxd is the design matrix. When Zn has n, we can write rank d and Wi > 0, 1

Remark 2.2.2. More generally, we may allow for correlation between the errors

{t:i}· That is, suppose Var(€) = a2W for some invertible matrix Wn x n · Then it can be shown (Problem 2.2.28) that the model Y � Zn,B+• can be transformed to one satisfying (2.2.1) � and (2.2.4H2.2.6). Moreover, when g(,B, z) = zT,B, the ,B minimizing the least squares contrast in this transformed model is given by (2.2.19) and (2.2.20).

Methods of Estimation

1 14

2

Remark 2.2.3. Here are some applications of weighted least squares: When the ith re sponse Yi is an average of ni equally variable observations, then Var(Yi) = a2 /ni, and 'Wi = ni 1 . If Yi is the sum of ni equally variable observations, then 'Wi ni. If the vari ance of Yi is proportional to some covariate, say z1 . then Var(}i) = Zi 1 a2 and 'Wi Zi l · =

=

I n time series and repeated measures, a covariance structure i s often specified for Problems 2.2.29 and 2.2.42). 2.2.2

e

(see

Maximum Likelihood

The method of maximum likelihood was first proposed by the German mathematician C. F. Gauss in 1 82 1 . However, the approach is usually credited to the English statistician R. A. Fisher ( 1 922) who rediscovered the idea and first investigated the properties of the method. In the form we shall give, this approach makes sense only in regular parametric models. Suppose that p(x, 0) is the frequency or density function of X if 0 is true and that 8 is a subset of d-dimensional space. Recall Lx (0) , the likelihood function of 0, defined in Section 1 .5, which is just p(x, 0) considered as a function of 0 for fixed x. Thus, if X is discrete, then for each 0, Lx ( 0) gives the probability of observing x. If 8 is finite and 1r is the uniform prior distribution on 8, then the posterior probability that 0 0 given X = x satisfies 1r(O I x) ex Lx (O), where the proportionality is up to a function of x. Thus, we can think of Lx ( 0) as a measure of how "likely" 0 is to have produced the observed x. A similar interpretation applies to the continuous case (see A.7. 1 0) . The method of maximum likelihood consists of finding that value B(x) of the parameter that is "most likely" to have produced the data. That is, if X = x, we seek 0(x) that satisfies

Lx( B( x) )

=

p (x, B(x) ) = max{p(x, 0) : 0

E

8}

max{ Lx (O)

:

0 E 8}.

By our previous remarks, if 8 is finite and 1r is uniform, or, more generally, the prior density 1r on 8 is constant, such a 8(x) is a mode of the posterior distribution. If such a 8 exists, we estimate any function q( 0) by q(O(x) ). The estimate q(O(x) ) is called the maximum likelihood estimate (MLE) of q (O) . This definition of q(O) is consistent. That is, suppose q is 1 - 1 from 8 to 0; set w = q( 0) and write the density of X as p0 (x, w) = p(x, q-1 (w)). If w maximizes p0 (x, w) then w = q(O) (Problem 2.2.16(a)). If q is not 1 - 1 , the MLE of w q(O) i s still q({i) (Problem 2.2.16(b)). Here is a simple numerical example. Suppose 0 0 or � and p (x , 0) is given by the following table. 1 x\0 0 2

0

0. 1 0 0.90

2

Then 8( 1 ) = �' 8( 2) = 0. If X 1 the only reasonable estimate of 0 is value 1 could not have been produced when 0 0. =

=

� because the

1 15

Maximum likelihood estimates need neither exist nor be unique (Problems 2.2. 1 4 and 2 .2. 13). In the rest of this section we identify them as of minimum contrast and estimating equation type, relate them to the plug-in and extension principles and some notions in information theory, and compute them in some important special cases in which they exist, are unique, and expressible in closed form. In the rest of this chapter we study more detailed conditions for existence and uniqueness and algorithms for calculation of MLEs when closed forms are not available. When fJ is real, NILEs can often be obtained by inspection as we see in a pair of impor tant examples.

Example 2.2.4. The Normal Distribution with Known Variance. Suppose X "" N(fJ, a2 ) , where a2 is known, and let

( )

Lx(O) = �

is a normal density with mean x and variance a2 . The maximum is, therefore, achieved uniquely for B(x) x.

=

Suppose more generally that X1 , . . . , Xn is a sample from a N(fJ, a2) population. It is a consequence of Problem 2.2. 1 5 that the MLE of fJ based on X1 , . . . , Xn is the same as that based on the sufficient statistic X, which has a N(fJ, a2 jn) distribution. In view of our result for n 1 we can conclude that B(X1 , . . . , Xn ) = X 0

is the MLE of fJ.

.

Example 2.2.5. Estimating the Size of a Population (continued). Suppose X1 , . . . , Xn are i.i.d. U { 1 , 2, . . , fJ} with fJ an integer ;::::: 1 . We have seen in Example 2 . 1 . 7 that the method of moments leads to the unreasonable estimate 2X - 1 for the size fJ of the population. What is the maximum likelihood estimate of fJ? From ( 1 .5. 10) we see n that is 0 for fJ 1 , . . , rnax(x1 , . . . , xn ) - 1 , then jumps to [max(x1 , . . . , xn )] and equals the monotone decreasing function from then on. Figure 2.2.2 illustrates the situation. Clearly, rnax(X1 , . . . , Xn ) is the MLE of fJ. This estimate is also somewhat D unreasonable because we know that fJ > X ( n) · �1aximum likelihood as a minimum contrast and estimating equation method. Define

Lx(fJ)

.

e -n

lx (fJ) = log Lx (fJ) = log p(x , fJ) .

By definition the MLE B(X) if it exists minimizes - log p because - lx (fJ) is a strictly decreasing function of p. Log p turns out to be the best monotone function of p to consider for many reasons. A prototypical one is that if the Xi are independent with densities or frequency function f (x, fJ) for i = 1 , . . . , n, then, with X = (X1 , . . . , Xn ) n n (2.2.2 1 ) fJ) (X lx (fJ) = log p(X , fJ) = log iJ J i , = L log f (Xi , fJ) i=l

116

Methods of Est i m ation

Chapter 2

0.15 •

0 . 10 0.05

•

0 1

0

4

3

2

e

5

Figure 2.2.2. The likelihood function for Example 2 .2.5. Here n (x1 , X2 , X3 ) ( 1 , 2 , 1 ) . X

3 and

=

which when considered a s a random quantity i s a sum of independent variables, the most studied object of probability theory. Another reason for turning to log p is that we can, by making a connection to information theory, identify p(x, 0) -lx (O) as a contrast and 8 as a minimum contrast estimate. In this case, (2.2.22) and to say that D (Oo , 0) is minimized uniquely when Po = Po0 is equivalent to

D (Oo , 0)

D (Oo , Oo)

=

- (Eo0 log p(X, 0) - Eo0 log p(X, Oo ) ) �

e

Eo0 log

p(X, O) e p (X ' 0 )

>

(2.2.23)

0

unless 00 . Here D (00, 00) is called the entropy of X. See Problem 1.6.28. Let X have frequency or density function Po or p1 and let Eo denote expectation with respect to PO· Define the mutual entropy or Kullback-Leibler information divergence K (Po , PI ) between Po and PI by =

K (po , p l ) = -Eo log

Po

(X)

P - L Po (x) log l (x) Po x

(2.2.24)

for X discrete and replacing sums by integrals for X continuous. By convention § = 0 and O x oo O so that if p0 (x) O, p0 (x) log(p1 (x) /p0 (x)) O and if p1 (x) O also, then p1 (x)jp0 (x) 0. Then (2.2.23) is equivalent to =

=

=

I l

Lemma 2.2.1 (Shannon, 1948) ( 2 ) The mutual entropy K (p0 , p1 ) is always well defined and K (Po, PI ) 2:: 0 with equality if and only if {x : Po ( x) P1 ( x)} has probability 1 under =

both Po and P1 .

1 17

Proof. We give the proof for the discrete case. Kolmogorov ( 1 956) gave the proof in the general case. Recall Jensen's inequality (B .9.3): If g : R -t R is strictly convex and Z is any random variable such that E(Z) is well defined, then Eg(Z) is well defined and Eg(Z) � g(E(Z) ) with iff P[Z c0] 1 where c0 is a constant. Given X distributed according to P0, let Z p1 (X)jp0(X) and g(z) log z. Because g" (z) z-2 > 0, g is strictly convex. Thus, with our convention about 0 and oo , =

=

Eg(Z)

=

(

)

L Po (x) - log (x) � g(E(Z)) Po x P - log I (x)po(x) . x Po

(2:

=

)

(2.2.25)

Because � (x)po(x) ::; p1 (x) we must have g(E(Z)) � - log p1 (x) 0. Equality holds iff Pt (x) > 0 implies Po(x) > 0 and P l (x)/Po (x) c if p0 (x) > 0. Then 1 D 'L: x P l (x) c 'L: x Po (x) c, and we conclude that po ( x) P l (x) for all x. =

=

=

Lemma 2.2.1 shows that in the case of Xt , . . . , Xn i.i.d. p(

,

X O)

satisfies the condition of being a contrast function, and we have shown that the MLE is a minimum contrast estimate. Next let Po = { Po : 0 E 8} and define v : Po -t 8 by v(Po0 )

The extension li of v to P

=

v(P)

=

,

arg min{ -Eo 0 log p(X 0) : 0

all probabilities on X is =

,

arg min{ - Ep log p(X 0) : 0

E

E

8} .

8} .

Now the MLE is v(P). That is, the MLE is the value of 0 that minimizes the Kullback Leibler divergence between the empirical probability P and Po . Likelihood equations

If 8 is open, l x ( 0) is differentiable in 0 and 0 exists then B must satisfy the estimating equation \7 o lx (O)

=

0.

(2.2.26)

This is known as the likelihood equation. If the xi are independent with densities fi ( X ' 0) the likelihood equation simplifies to

n L \7 o log fi ( Xi , 0) i= l

=

0,

(2.2.27)

. If

•

118

Methods of Estimation

Chapter 2

which again enables us to analyze the behavior of B using known properties of sums of independent random variables. Evidently, there may be solutions of (2.2.27) that are not max:ima or only local maxima, and as we have seen in Example 2.2.5, situations with f) well defined but (2.2.27) doesn't make sense. Nevertheless, the dual point of view of (2.2.22) and (2.2.27) is very important and we shall explore it extensively in the natural and favorable setting of multiparameter exponential families in the next section. Here are two simple examples with (} real. Example 2.2.6. Consider a population with three kinds of individuals labeled 1, 2, and 3 and occurring in the Hardy-Weinberg proportions p{l, B) = B2, p(2, B) = 2B{l - B), p(3, B) = ( 1 - B)2

where 0 < {) < 1 (see Example 2.1 .4). If we observe a sample of three individuals and obtain x 1 = 1, x2 = 2, x3 = 1, then Lx(B) = p{l, B)p(2, B)p(l, B) = 2B5(1 - B).

The likelihood equation is i)

1 = 0, l (B) = iJB x B - 1-B � which has the unique solution 8 = �. Because '

5

82

5 iJ ' lx(B) = - ' B

B

i

I

1 <0 ' {1 B)

1 i

for all B E (0 , 1), � maximizes Lx(B) . In general, let nt. n2, and n3 denote the number of { x 1 , . . . , xn } equal to 1, 2 and 3, respectively. Then the same calculation shows that if 2n1 + nz and n2 + 2n3 are both positive, the maximum likelihood estimate exists and is given by

+ n, 2n1 O(x) = . 2n

(2.2.28)

If 2n1 + n2 is zero, the likelihood is {1 - B)'", which is maximized by B = 0, so the MLE does not exist because 6 = (0, 1 ) . Similarly, the MLE does not exist if 112 + 2n3 = 0. D Example 2.2.7. Let X denote the number of customers arriving at a service counter during n hours. If we make the usual simplifying assumption that the arrivals form a Poisson process, then X has a Poisson distribution with parameter nA, where A represents the expected number of anivals in an hour or, equivalently, the rate of arrival. In practice, A is an unknown positive constant and we wish to estimate A using X. Here X takes on values {0, 1, 2, . . . } with probabilities,

n (>,n)' e-> , x p(x, A) = 1

X.

= O, l,

.

.

.

(2.2.29)

Section 2.2

119

Minimum Contrast Estimates and Estimating Equations

The likelihood equation is

� which has the unique solution ). = xjn. If x is positive, this estimate is the MLE of ). (see (2.2.30)). If x = 0 the MLE does not exist. However, the maximum is approached as 0

,\ l 0.

To apply the likelihood equation successfully we need to know when a solution is an MLE. A sufficient condition, familiar from calculus, is that l be concave in B. If l is twice differentiable, this is well known to be equivalent to

8' l 8 < 0 , x( 2 ) 88

(2.2.30)

for all B. This is the condition we applied in Example 2.2.6. A similar condition applies for vector parameters.

Example 2.2.8. Multinomial Trials. As in Example 1 . 6 . 7, consider an experiment with n i.i.d. trials in which each trial can produce a result in one of k categories. Let Xi j if the ith trial produces a result in the jth category, let 81 = P(X, = j) be the probability of the jth category, and let Nj = L:� 1 l(Xi = j] be the number of observations in the jth category. We assume that n > k � 1. Then, for an experiment in which we observe n; = 2::7 1 ! [X, = j], p(x, fJ) = TI7� 1 87' , and =

k lx (fJ) = L n; 1og8;, j= 1

k fJ E 6 = {fJ : O; > o, L ei j=l

=

(2.2.3 1 )

!}.

To obtain the MLE (J we consider l as a function of 81, . . . , Bk -1 with

e,

=

k-1 1 - L 8,. j=1

(2.2.32)

We first consider the case with all the nj positive. Then p(x, 8) = 0 if any of the 8j � zero; thus, the MLE must have all (Jj > 0, and must satisfy the likelihood equa�ons

8 lx(fJ) = 88J 88,/88;

By (2.2.32), (2.2.32) to find

k k 8 "\"' n, log 8 = "\"' n, 80, , L.,1 0. 88J 88J L., l=l 1=

= 0,

-I, and the equation becomes

.

J

= I, .

� �

(8,/8;)

.

.

are

,k- L

= (n,;n,). Next use

120

Methods of Estimation

Chapter 2

�

To show thai this (} maximizes lx(8), we check the concavity of lx(O): let 1 < r < k - 1, 1 < j < k - 1, then 8 8 l,(O) 88, 881

- (� �i) +

< 0,

r=j

(2.2.33)

nk - IF < 0, r 'I' J.. k

� It follows that in this nJ > 0, Bj > 0 case, lx(O) is strictly concave and 8 is the unique maximizer of lx(O). Next suppose that nj = 0 for some j. Then 9 with Bj = nJfn, j = 1, . . . , k, is still the unique MLE of 0. See Problem 2.2.30. The 0 < 8; < 1, 1 < j < k, version of this example will be considered in the exponential family case in Section 2.3. �

•

'

•

�

Example 2.2.9. Suppose that X1 , . . . , Xn are i.i.d. N(t.L, o-2 ) with J1. and o-2 both unknown. Using the concavity argument, we find that for n > 2 the unique MLEs of 11- and o-2 are Ji = X and 0'2 = n-1 L:� 1 (X, - X) 2 (Problem 2.2.11(a)). Maximum likelihood and least squares

We conc1ude with the link between least squares and maximum likelihood. Suppose the model Po of Example 2.1.1 holds and X = (Y1 , , YnJT. Then •

lx ((3)

log

IT _l_'P ( ao

i=l

g((3, z ) y; ;

a-o

2

n

•

- 2 log(21T
•

)

n 1 '"' 2 , z;) g((3 ] ' L )Y; . 20 02 1= 1

(2.2.34)

Evidently maximizing lx(f3) is equivalent to minimizing I:� 1 [Y; - g((3,z,)]2. Thus, least squares estimates are maximum like1ihood for the particular model Po. As we have seen and shall see more in Section 6.6, these estimates viewed as an algorithm applied to the set of data X make sense much more generally. It is easy to see that weighted least squares estimates are themselves maximum likelihood estimates of f3 for the model :S n. More generally, we can consider 'ffi min Yi independent N(g(/3, Zi), wi a5 ) 1 imizing L:;,,; [Y; - g((3, z,)][l:j - g((3, z;)]wi; where W = ]]w;; ]lnxn is a symmetric positive definite matrix, as maximum likelihood estimates for {3 when Y is distributed as Nn ( (g((3 , z1 ) , . . . , g((3, zn) jT, u3W- 1 ) see Problem 2.2.28. .

::; i ,

Summary. In Section 2.2.1 we consider least squares estimators (LSEs) obtained by min imizing a contrast of the form I;� 1 [Y; - g; ((3) ]2, where E(Y;) = g; ((3), g;, i = 1 , . , n,

are

.

.

known functions and {3 is a parameter to be estimated from the independent observa tions Y1 , , Yn, where Var(Yi) does not depend on This approach is applied to ex periments in which for the ith case in a study the mean of the response Yi depends on .

.

•

i.

I

Section 2.3

.

Zil , . . , zid · In particular we consider the case with the LSE of {3 in the case in which l! zij llnxd is of rank

a set of available covariate values

9i({3 ) =

d.i

121

Maximum Likelihood in Multiparameter Exponential Families

E1= l Zij/3j and give

Extensions to weighted least squares, which are appropriate when Var(Yi) depends on

or the Y's are correlated, are given. In Section

estimators

(MLEs)

�

0

2.2.2

maximum likelihood likelihood Lx(B) = p( x, 8).

we consider

that are defined as maximizers of the

These estimates are shown to be equivalent to minimum contrast estimates based on a con trast function related to Shannon entropy and Kullback-Leibler information divergence. In the case of independent response variables Yi that are modeled to have a N(gi(f3), distribution, it is shown that the MLEs coincide with the LSEs.

cr2 )

MAXIMUM LIKELIHOOD IN M U LTI PARAMETER EXPONENTIAL FAMILIES

2.3

Questions of existence and uniqueness of maximum likelihood estimates in canonical ex ponential families can be answered completely and elegantly. This is largely a consequence

of the strict concavity of the log likelihood in the natural parameter Tf, though the results

of Theorems

1.6.3 and 1.6.4 and Corollaries 1.6.1

and

1.6.2 and other exponential family

properties also play a role. Concavity also plays a crucial role in the analysis of algorithms

in the next section. Properties that derive solely from concavity are given in Propositon

2.3.1.

We start with a useful general framework and lemma. Suppose 8 c

RP is an open set. Let iJB = 8 - 8 be the boundary of 8, where 8 denotes the cJOSUTe of 8 in [-00, oojP. That is, ae is the set of points outside of e that can be obtained as limits of points i n e. including all points with ±oo as a coordinate. For instance, if X N(B1 , (h), e = RxR+ rv

and

88 = {(a,b) o a = ±oo,O < b S oo} U {(a,b) : a E R, b E {O, oo}}. In general, for a sequence

{Bm}

to mean that for any subsequence with

18=.�: I

oo,

as

k

--t

8 open, we define 8= - 88 as m --t oo } either 8="' t with t ¢ 8, or 8=.�: diverges

of points from = ,.

{B

--t

oo, where I I denotes the Euclidean norm.

For instance, in the

N(B, 02) case, (a, m- 1 ) , (m, b), ( -m, b), (a, m), (m, m-1) all tend to 88 as m � oo. -

Lemma 2.3.1. Suppose we are given a function l

continuous. Suppose also that

lim{l(ll)

:

e

--t

R where e c RP is open and l is

: II � 86} = -oo.

(2.3 . 1 )

�

Then there exists 8 E 8 such that l(O) = max{l(ll) : II E 8}. Proof.

See Problem 2.3.5.

Existence and unicity of the MLE in exponential families depend on the strict concavity of the log likelihood and the condition of Lemma 2.3.1 only. Formally,

122

Methods of Estimation

Chapter 2

Proposition 2.3.1. Suppose X � { P11 : II E 8 }. e open c RP. with corresponding lx(ll) = logp(x, II) is strictly concave and lx(ll) � -oo as densities p(x, II). !ffunher � 8 88, then the MLE 8(x) exists and is unique. � Proof. From (8.9) we know that II � lx(ll) is continuous on e. By Lemma 2.3.1, ll(x) and are distinct maximizers, then lx (0 , + exists. If > i (lx(Ot ) +lx (0, )) = ---t

lx(81

o,

o,

0

) a contradiction. ,

o,) )

Applications of this theorem are given in Problems

2.3.8

0

and 2.3.12.

We can now prove the following.

Them-em 2.3.1. Suppose P is the canonical exponential family generated by

that

The natural parameter space, £. is open. (ii) The family is of rank k. Let x be the observed data vector and set to (a) If to E Rk satisjies 11)

(T1 h)

and

(i)

=

T(x) . \fc 1"

!

(2.3.2)

0

then the MLE Tj exists, is unique, and is a solution to the equation

•

(2.3.3)

I •

Conversely, ift0 doesn 't satisfy (2.3.2), then the MLE doesn't exist and (2.3.3) has no solution. (b)

We, thus, have a necessary and sufficient condition for existence and uniqueness of the MLE given the data. '

Define the

I'

I ' '

'

P(C)

=

convex suppon of a probability P to be the smallest convex set C such that

1.

Corollary 2.3.1. Suppose the cmulitions ofTheorem 2.3.1 hold. IJCT is the convex suppon ofthe distribution ofT(X), then 1] exists and is unique if! to E � where C!J: is the interior

i

•

of CT.

Proofof Theorem 2.3.1. We give the proof for the continuous case.

Existence and Uniqueness ofthe MLE 1}. Without loss of generality we can suppose h(x) = p(x, 'lo ) for some reference 'lo E [ (see Problem 1.6.27). Furthermore, we may also assume that to = T(x) = 0 because P is the same as the exponential family generated by T(x) - to. Then, if lx('l) = logp(x,71) with T(x) = 0, lx('l) = -A(71) + log h(x). We show that if

{7Jm}

which implies existence of 1j by Lemma ••

' '

.I . '

2.3. L

Write

11m =

-oo,

lx(1Jm) AmUm, Um u�::ll ' Am =

has no subsequence converging to a point in E, then

-

=

Section 2.3

123

Maximum Likelihood in M ultiparameter Exponential Families

1 71= 11· So, llu=ll

l. Then, if { 1J m} has no subsequence converging in £ it must have a subsequence { 1J mk } that obeys either case 1 or 2 as follows. Case 1: Am�<

�

--> oo,

Umk

-->

u. Write Eo for £170

and Po for P1Jo· Then , T(X)

. u,, k

limkEoe>..,..,., k > limc'm' ' P0[u );;, T(X) > J] > lime'm'' Po[uTT(X) > J] � 00 because for some o

Case 2: Am k

---t

> 0, Po[uTT(X) > J] > 0.

)., Um�.

-t

So we have

u. Then Au ¢. E by assumption. So

-oo. Because any subsequence of { 1Jm } has no subse In either case limm..., lx(1Jm.,J quence converging in E we conclude lx(1Jm) - oo and Tj exists. It is unique and satisfies (2.3.3) by Theorem 1.6.4. =

-t

I '* Nonexistence: If (2.3.2) fails, there exists c # 0 such that P0 [cTT < OJ E11(cTT(X)) < 0, for all 1J. If ij exists then E'l T 0 '* E17 (cTT) 0 =*' P17 [cTT D 0] 1, contradicting the assumption that the family is of rank k. =

�

�

=

=

Proof of Corollary 2.3.1. By (B.9. l) a point to belongs to the interior C of a convex set C iff there exist points in C0 on either side of it, that is, iff, for every d =/= 0, both {t : dTt > dTto} n C0 and {t : dTt < dTto} n C0 are nonempty open sets. The D equivalence of (2.3.2) and Corollary 2.3.1 follow. Example 2.3.1. The Gaussian Model. Suppose X�, . . . , Xn are i.i.d. N(p,, rJ2), I' E R, a2 > 0. As we observed in Example 1.6.5, this is the exponential family generated by T(X) (L;7 1 X,, L:7 1 Xf) and l. Evidently, Cr R "' R+ For n > 2, T(X) has a density and, thus, C T C� and the MLE always exists. Por n 1, Cj. 0 because T(XI) is always a point on the parabola T2 T'.f and the MLE does not exist. This is equivalent to the fact that if n 1 the formal solution to the likelihood equations gives D 0'2 0, which is impossible. =

�

=

=

=

=

=

=

In fact, existence of MLEs when nomenon.

T has

a continuous ca.se density is a general phe

Theorem 2.3.2. Suppose the conditions ofTheorem 2.3.1 hold and Tk x 1 has a continuous case density on Rk. Then the MLE Tj exists with probability 1 and necessarily satisfies (2.3.3).

Methods of Estimation

124

Chapter 2

Proof. The boundary of a convex set necessarily has volume 0 (Problem 2.3.9), thus, if T has a continuous case density PT (t), then

D

and the result follows from Corollary 2.3. 1.

Remark 2.3.1. From Theorem 1.6.3 we know that E,.,T(X) = A(1J). Thus, using (2.3.3), the MLE 7j in exponential families has an interpretation as a generalized method of mo ments estimate (see Problem 2.1.13 and the next example). When method of moments and frequency substitution estimates are not unique, the maximum likelihood principle in many cases selects the «best" estimate among them. For instance, in the Hardy-Weinberg examples 2.1.4 and 2.2.6, li1 = .,(i'iifn, liz = 1 - ,;n:;rn and lis = (2nt + nz ) /2n are frequency substitution estimates (Problem 2.1.1), but only B3 is a MLE. In Example 3.4.4 we will see that 83 is, in a certain sense, the best estimate of 8. •

A nontrivial application of Theorem 2.3.2 folJows. Example 2.3.2. The Two-Parameter Gamma Family. Suppose X1 , . . . , Xn are i.i.d. with density gp,A(x) = r�;) e-AxxP- 1, x > 0, p > 0, >.. > 0. This is a rank 2 canonical exponential family generated by T = (I; log X,, I: X,), h(x) = x- 1 , with '

I,

'

I

'

by Problem 2.3.2(a). The likelihood equations are equivalent to (Problem 2.3.2(b)) r'

-

r(j)) - log A = log X -

p

� =X A

(2.3.4) (2.3.5)

where log X = � L:� 1 logXi. Itiseasy to seethat ifn 2 2, T has a density. We conclude from Theorem 2.3.2 that (2.3.4) and (2.3.5) have a unique solution with probability 1. How D to find such nonexplicit solutions is discussed in Section 2.4. If T is discrete MLEs need not exist. Here is an example.

Example 2.3.3. Multinomial Trials. We follow the notation of Example 1 . 6 . 7. The statistic of rank k - 1 which generates the family is T(k- 1) = (T1 , . . . , T.-t) T, where T; (X) = I:� 1 l(X, = j), I :S j :S k. We assume n 2 k - 1 and verify using Theorem 2.3.1 that in this caseMLEs of 'I; = log(>.;/>.•), 1 < j < k - 1, where O < >.; = P(X = j] < 1, exist iff all T; > 0. They are determined by >.; = T; fn, 1 < j :S k. To see this note that T; > 0, 1 ::; j < k iffO < T; < n, 1 < j < k. Thus, if we write cTto = I;{c;t;o : c; > 0} + I: {c;t;o : c; < 0} we can increase cTt0 by replacing a t;o by t;o + 1 in the first sum or a tjo by t;o - 1 in the second. Because the resulting value oft is possible if 0 < tjo < n, 1 < j < k, and one of the two sums is nonempty because c i' 0, we see that (2.3.2) holds. -

i

i

I

Section 2.3

Maximum likelihood in Multiparameter Exponential Families

125

On the other hand, if any TJ = 0 or n, 0 < j < k - 1 we can obtain a contradiction to (2.3.2) by taking c; = -1 (i = j ) , 1 < i < k - 1. The remaining case T, = 0 gives a contradiction if c = (1, 1, . . . , l)r. Alternatively we can appeal to Corollary 2.3.1 directly D (Problem 2.3.10). Remark 2.3.1. In Example 2.2.8 we saw that in the multinomial case with the clQsed parameter set (.>.; : -'; > 0, 2:7�, -'; = 1}, n > k - 1, the MLEs ofA3 , j = 1 , . . . , k, exist and are unique. However, when we put the multinomial in canonical exponential family form, our parameter set is open. Similarly, note that in the Hardy-Weinberg Example 2.2.6, if 2n1 + n, = 0, the MLE does not exist if 8 = (0, 1 ) , whereas if B = [0, 1] it does exist D �nd is unique.

The argument of Example 2.3.3 can be applied to determine existence in cases for which (2.3.3) does not have a closed-form solution as in Example 1.6.8-see Problem 2.3.1 and Haberman (1974). In some applications, for example, the bivariate normal case (Problem 2.3. 13), the following corollary to Theorem 2.3.1 is useful. Corollary 2.3.2.

Consider the exponentialfamily k

p(x, B) = h(x) exp I; c; (B)T;(x) - B(B) , x E X, B E e. j=l

Let C" denote the interior of the range of (c, (B), . . . , c, (B))T and let x be the observed data. Ifthe equations -

have a solution B (x) E C0, then it is the unique MLE of B. When P is not an exponential family both existence and unicity of MLEs become more problematic. The following result can be useful. Let Q : II E e), e open c R=,

m

< k - 1, be a cUIVed exponential family

=

{Po

p(x,ll) = exp{cT(II)T(x) - A(c(ll)))h(x). Suppose c : 8

�

E

c

R' has a differential c( II) =

(2.3.6)

��'' (II) mxk on e. Here E is the

natural pararnerer space of the exponential fautily P generated by (T , h). Tben

Theorem 2.3.3. lfP above satisfies the condition of Theorem 2.3.1. c(8) is closed in [ and T(x) = to satisfies (2.3.2) so that the MLE ij in P exists, then so does the MLE II in

Q and it satisfies the likelihood equation • •

-

cT(ii)(t0 - A(c(ii)) = 0.

(2.3.7)

Note that c(ll) E c(8) and is in general not ij. Unfortunately strict concavity of lx is not inherited by curved exponential families, and unicity can be lost-take c not one-to-one for instance.

126

Methods of Estimation

Chapter 2

The proof is sketched in Problem 2.3. 11. Example 2.3.4. Gaussian with Fixed Signal to Noise. As in Example 1.6. 9, suppose X1, . . , Xn are i.i.d. N(p, a2 ) with J-l/a Ao > 0 known. This is a curved exponential

2��2 ,

.

=

Jl > 0, corresponding to 771 = --jfo , ry2 = - 2!2 family with c1 (Jt) = ¥, c2 (Jt) = Evidently c(8) = {(1)I , ry2) : ry2 = - i '1f \)2 , ry, > O , ry2 < 0}, which is closed in E = ( (ryl> ry2) : ry1 E R, ry2 < 0}. As a consequence of Theorems 2.3.2 and 2.3.3, we can conclude that an MLE /.i always exists and satisfies (2.3. 7) if n 2 2. We find •

c(B) = � (-Jt-2 , Jt - 3 ) T, and from Example 1.6.5 I T 2 2 · A (1/ ) = ;zn ( -ryl/'12 , '1, /2 '12 - 1/'12) . Thus, with t1 = 2:: x, and t2 = 2:: xf, Equation (2. 3. 7) becomes 2 2 2 J , t nJt, + t3)(t, A5Jt .),6( Jt , - n(Jt ))T = 0,

which with /i2 = n- 1 Lxl simplifies to

2 J-t + A6x11 - A6it2 = 0

-

1 ' 11-± = 2 [ >..o x ± .),o

Note that 11+11 - = -A6Ji:2 < 0, which implies i"i+ > 0, 11- < 0. Because J-l > 0, the D solution we seek is il+. Example 2.3.5. Location-Scale Regression. Suppose that lj1,

'

= 1 , . . . , n, are n independent random samples, where N(tLi , aJ). Using Examples 1.6.5 and 1.6.10, we see that the distribution of {lj1 : j = 1, . . . , n, l = 1, . . . , m} is a 2n-parameter canonical exponential family with 'lJi = Jti/al, 1Jn+i = -1 /2af, i = 1, . . , n, generated by h(Y) = I and

Yjl

•

.

,

, Y;m. j

"'->

.

m

T(Y) =

L Yll, l=l

m

·

·

·

,

m

m

L Ynl, L YI1 • · · · L Y;l l=l l= I l=l

T

,

Next suppose, as in Example 1.6.10, that

tLi = (}I + 82Zi7 af = 83(81 + 82zi ) 2

1

Zt < · ' · < Zn

where z1 , Zn are given constants. Now p(y, 9) is a curved exponential family of the form (2.3.6) with •

.

.

,

Section 2.4

Algorithmic Issues

127

If m > 2, then the full 2n-parameter model satisfies the conditions of Theorem 2.3.1. Let £ be the canonical parameter set for this full model and let

8 � {II : II ,

E R, 82 E R,83 > 0 } . �

�

Then c(8) is closed in £ and we can conclude that for m > 2, an MLE (J of 8 exists and (} D satisfies (2.3.7).

Summary. In this section we derive necessary and sufficient conditions for existence of

MLEs in canonical exponential families of full rank with £ open (Theorem 2.3.1 and Corol lary 2.3.1). These results lead to a necessary condition for existence of the MLE in curved exponential families but without a guarantee of unicity or sufficiency. Finally, the basic property making Theorem 2.3.1 work, strict concavity, is isolated and shown to apply to a broader class of models.

2.4

ALGORITH MIC ISSUES

As we have seen, even in the context of canonical multiparameter exponential families, such as the two-parameter gamma, MLEs may not be given exp1icitly by formulae but only implicitly as the solutions of systems of nonlinear equations. In fact, even in the classical � regression model with design matrix ZD of full rank the formula (2. 1. 10) for {3 is easy to write down symbolically but not easy to evaluate if d is at all large because inversion of + 1)/2 terms with Z'£ZD requires on the order of nd2 operations to evaluate each of n operations to get Z'J;ZD and then, if implemented as usual, order d3 operations to invert. The packages that produce least squares estimates do not in fact use formula (2.1. 10). It is not our goal in this book to enter seriously into questions that are the subject of textbooks in numerical analysis. However, in this section, we will discuss three algorithms of a type used in different statistical contexts both for their own sakes and to illustrate what kinds of things can be established about the black boxes to which we all, at various times, entrust ourselves. We begin with the bisection and coordinate ascent methods, which give a complete though slow solution to finding MLEs in the canonical exponential families covered by Theorem 2.3.1.

d,

d(d

2.4.1

The Method of Bisection

The bisection method is the essential ingredient in the coordinate ascent algorithm that yields MLEs in k-parameter exponential families. Given f continuous on (a, b), f i strictly, f(a+) < 0 < f(b-), then, by the intermediate value theorem, there exists unique x*c:: (a, b) such that f(x*) = 0. Here, in pseudocode, is the bisection algorithm to find x•. Given tolerance c:: > 0 for lx ti nal - x*l: Find xo < x , , f(xo) < 0 < f(x ! ) by taking lxol, lxd large enough. Initialize x ;;ld � x , , xold = xo .

128

Methods of Estimation

( l ) If l x�ld - x 1 < 2<, Xfinai 01d

(2) Else, xnew

=

Chapter 2

= J (x�ld + x01d ) and return Xfinal·

j (x�ld + x01d ) .

(3) If f(xnew) = 0, X final = xnew· (4) If f(xnew) < 0, x�ld

=

xnew.

(5) If f(xnew ) > 0, x�ld = xnew.

Go to (1).

End

Lemma 2.4.1. The bisection algorithm stops at a solution Xfinal such that

l xfinal - x' l

:S <.

'

'

i i

Proof. If Xm is the mth iterate of xnew

l

(1 ) Moreover, by the intermediate value theorem, Xm :S x* < Xm+ 1 for all m.

(2) Therefore,

(3) and Xm --+ x* as m -+

oo.

Moreover, for m = log2(lxt - xo\ft), l xm+l - x* l

$ €.

If desired one could evidently also arrange it so that, in addition, 1 /(Xfinal l l < <. From this lemma we can deduce the following.

0

Theorem 2.4.1. Let p(x, TJ) be a one-parameter canonical exponentialfamily generated by (T, h), satisfying the conditions of Theorem 2.3.1 and T = t0 E C�, the interior (a, b) of the convex support ofPT· Then, the MLE Tj, which exists and is unique by Theorem 2.3.1, may befound (to tolerance €) by the method ofbisection applied to

f('1 ) = E, T(X) - to. Proof. By Theorem 1 .6.4, f' ('1) = Var,T(X) > 0 for all 1) so that f is strictly increasing and continuous and necessarily because i; exists, f(a+) < 0 < f(b- ) . D

Example 2,4,1, The Shope Parameter Gamma Family. Let X,, . . . , Xn be i.i.d. f(O, 1),

(2.4. 1)

i

!

Algorithmic Issues

Section 2.4

Because T(X) = 2:�· equation

1

129

log Xi has a density for all n the MLE always exists. It solves the

T(X)

r'(B) f{B)

,

n

which by Theorem 2.4.1 can be evaluated by bisection. This example points to another hidden difficulty. The function f{B) = J;:" x0 -1e-"dx needed for the bisection method can itself only be evaluated by numerical integration or some other numerical method. However, it is in fact available to high precision in standard packages such as NAG or D MATLAB. In fact, bisection itself is a defined function in some packages.

2.4.2

Coordinate Ascent

The problem we consider is to solve numerically, for a canonical k-parameter exponential family, •

E11 (T(X)) = A(17) = to

when the MLE Tj = Tj(to) exists. Here is the algorithm, which is slow, but as we shall see, always converges to if. The case

k = 1: See Theorem 2.4.1.

The general case: Initialize

o

""

�

1] = ('fJt ' . . . ' -o) 'fJk .

Solve

• •

Set and finally

.-{) J

TJ

�1 ,0 ( , = _

-{) )

1JI 1J2• · · · • 'fJk ,

02

�

1]

-{) . . "" ) , and soon, , = 1JI , 1J2 'fJ3 , ,'fJk _

(

1

�

1

�

.

i/' = 171') = (ii: , . . . , ijt).

Repeat, getting ij(r), r > 1, eventually.

Notes: (1) ln practice, we would again set a tolerence to be, say c, for each of the if'. k, in cycle j and stop possibly in midcycle as soon as

1
130

Methods of Estimation

Chapter 2

(2) Notice that g:, (iT,., , i//_ , T/1, ii/+ 1 , ) is the expectation of T,(X) in the one 2 1 exponential family model with all parameter parameters save ru assumed known. Thus, the algorithm may be viewed as successive fitting of one-parameter families. We pursue this discussion next. .

Theorem 2.4.2.

.

.

•

•

•

lfii(c) are as above, (i), (ii) of Theorem 2.3.1 hold and to E C!J., ij (r)

---+

ij as r ---t oo .

Proof. We give a series of steps. Let /(71) = tifry - A(71) + log h(x), the log likelihood. (I) (ii'' J T in j for i fixed and in i. If 1 < j < k, ii'' and ii'(j+ IJ differ in only one coordinate for which iji(j+l ) maximizes Therefore, limi ,j l(i/i) = A (say) exists and is > -00. (2) The sequence (iji 1 , . . . , Tjik ) has a convergent subsequence in E x · · · x [ �in ! � I k 1} , 1] . . . ). , ('11 , . . . , TJi n k ) (

l

l.

---t

i i < < [, But 71 E 1 j k. Else lim; l(ij' ) = -oo for some j. (3) l (71i ) = >. for all j because the sequence of likelihoods is monotone. (4)

/�, (71') = 0 because ::, (ij'•') = 0, ltn.

(5) Because 17 1 , rP differ only in the second coordinate, (3) and (4) Continuing, r,1 = · · · = 7Jk . Here we use the strict concavity of l. (6) By (4) and (5), A(711) = t0. Hence, 711 is the unique MLE.

=>

17 1

r?.

To complete the proof notice that if 1j(rJc) is any subsequence of 1j(r) that converges to ry• (say) then, by (I), l(ry•) = A. Because l(ij 1 ) = >. and the MLE is unique, ij' = ij 1 = ij. 0 By a standard argument it follows that, 1j(r) � 1].

Example 2.4.2. The Two-Parameter Gamma Family (continued). We use the notation of We can initialize with the method Example 2.3.2. For n > 2 we know the MLE exists. of moments estimate from Example 2.1 .2, .A(O) = �. jj<0l = '!2 • We now use bisection --- 1 --' r' --' 1 J to get p''l solving riP''l) = log X + log>.( O and then ). (1 ) = •x , ij = (jj\1l, - >.< l) Continuing in this way we can get arbitrarily close to 1]. This two-dimensional problem is than one-dimensional problem Example 2.4.1 because the equaessentially no harder the of � � tion leading to Anew given bold • (2.3.5), is computationally explicit and simple. Whenever D we can obtain such steps in algorithms, they result in substantial savings of time. -2

�

�

---

( l)

.

It is natural to ask what happens if, in fact, the MLE 1j doesn't exist; that is, t0 ¢_ C¥. Fortunately in these cases the algorithm, as it should, refuses to converge (in 11 space!)-see Problem 2.4.2. We note some important generalizations. Consider a point we noted in Example 2.4.2: For some coordinates l, 1i{ can be explicit. Suppose that this is true for each l. Then each step of the iteration both within cycles and from cycle to cycle is quick. Suppose that we can write fiT = (nf, . . . , 17'[:) where flj has dimension dJ and L;=I dJ = k and the problem of obtaining ij1(to, 71,; j ¥ l) can be solved in closed form. The case we have

I

Section 2.4

131

Algorithmic Issues

just discussed has d1 = · · · = dr = 1, r = k. Then it is easy to see that Theorem 2.4.2 has a generalization with cycles of length r, each of whose members can be evaluated easily. A special case of this is the famous Deming-Stephan proportional fitting of contingency tables algorithm-see Bishop, Feinberg, and Holland ( 1 975), for instance, and Problems

2.4.9-2.4.10.

Next consider the setting of Proposition

8 E e open

c

�

RP,

2.3.1 in which lx(O), the log likelihood for

is strictly concave. If B(x) exists and lx is differentiable, the method extends straightforwardly. Solve ( _1, = 0 by the method of

g�: Bt, . . . , BJ

Bi, BJ+ 1 , . . . , B�)

bisection in to get BJ for j = 1, . . ,p, iterate and proceed. Figure 2.4.1 illustrates the process. See also Problem 2.4.7. The coordinate ascent algorithm can be slow if the contours in Figure 2.4.1 are not close to sphericaL It can be speeded up at the cost of further computation by Newton's method, which we now sketch.

.

Bj

3

2

I

0

0

1

2

Figure 2.4.1. The coordinate ascent algorithm.

(B1,82)r

3

The graph shows log likelihood contours,

where the log likelihood is constant. At each stage with one that is, values of coordinate fixed, find that member of the family of contours to which the vertical (or hori zontal) line is tangent. Change other coordinates accordingly.

132

2.4.3

Methods of Estimation

Chapter 2

The Newton-Raphson Algorithm

An algorithm that, in general, can be shown to be faster than coordinate ascent, when it converges, is the Newton-Raphson method. This method requires computation of the inverse of the Hessian, which may counterbalance its advantage in speed of convergence when it does converge. Here is the method: If 1j0ld is the cunent value of the algorithm, then 1 Tinew = fiold - k Ciiold )(A(iiold) - to).

(2.4.2)

The rationale here is simple. If 7Jold is close to the root 'ij of A(ij) = t0• then by expanding A(ij) around 17oid• we obtain f1new is the solution for ij to the approximation equation given by the right- and left-hand sides. If f1old is close enough to fj, this method is known to converge to Tj at a faster rate than coordinate ascent see Dahlquist, BjOrk, and Anderson (1974). A hybrid of the two methods that always converges and shares the increased speed of the Newton-Raphson method is given in Problem 2.4. 7. Newton's method also extends to the framework of Proposition 2.3.1. In this case, if l(B) denotes the log likelihood, the argument that led to (2.4.2) gives

(2.4.3) Example 2.4.3. Let X1 , . . . , Xn be a sample from the logistic distribution with d.f.

F(x,B) = [I + exp{-(x - 8)}] -1 • The density is

We find

I I

exp{-(x 8)} f(x, B) = I + exp{-(x - 8)}]2 · [ n • l(8) n - 2 L: exp{-(X; - B)} F(X,,B) i=l n •• -2 L f(X;, 8) < 0. l(B)

'

I I

' '

II I

i=l

The Newton-Raphson method can be implemented by taking

X.

-

-

8old - BMOM

D

The Newton-Raphson algorithm has the property that for large n, Tinew after only one step behaves approximately like the MLE. We return to this property in Problem 6.6.10. When likelihoods are noncave, methods such as bisection, coordinate ascent, and Newton-Raphson's are still employed, though there is a distinct possibility of nonconver gence or convergence to a local rather than global maximum. A one-dimensional problem

I

Section 2.4 Algorithmic Issues

133

in which such difficuJties arise is given in Problem 2.4.13. Many examples and impor tant issues and methods are discussed, for instance, in Chapter 6 of Dahlquist, BjOrk, and Anderson (1974).

2.4.4

The EM ( Expectation/Maximization) Algorithm

There are many models that have the following structure. There are ideal observations, X � Po with density p(x, 8), 8 E 6 c Rd Their log likelihood lp,x(8) is "easy" to max imize. Say there is a closed-form MLE or at least lp,x(B) is concave in B. Unfortunately, we observe S = S(X) � Q0 with density q(s, 8) where l, ,, (8) = log q(s, B) is difficult to maximize; the function is not concave, difficult to compute, and so on. A fruitful way of thinking of such problems is in terms of S as representing part of X, the rest of X is "miss ing" and its "reconstruction" is part of the process of estimating (} by maximum likelihood. The algorithm was fonnalized with many examples in Dempster, Laird, and Rubin (1977), though an earlier general form goes back to Baum, Petrie, Soules, and Weiss (1970). We give a few examples of situations of the foregoing type in which it is used, and its main properties. For detailed discussion we refer to Little and Rubin (1987) and MacLachlan and Krishnan (1997). A prototypical example follows. Example 2.4.4. Lumped Hardy-Weinberg Data. As in Example 2.2.6, letXi, i = 1, . . . , n, be a sample from a population in Hardy-Weinberg equilibrium for a two-allele locus, Xi = (E;Jo €;2, €;3), where P9[X = ( 1 , 0, 0)] = B2 , Po [X = (0, 1, 0)] = 28(1 - B), Po[X (0, 0, 1)] = (I - B)2, 0 < B < 1. What is observed, however, is not X but S where S;

s,

xi, 1 < i < m (€i1 + €iz , €i3 ), m + 1 :S i <

n.

(2.4.4)

Evidently, S = S(X) where S(X) is given by (2.4.4). This could happen if, for some individuals, the homozygotes of one type (€il = 1) could not be distinguished from the heterozygotes (€iz = 1). The log likelihood of S now is l,,,(B)

} )2Eit log0 + Ei2 log2B(l - B) + 2E;a log(1 - B)]

i=l +

n

[(Ei! + E;2 ) log (l - (1 - 8)2) + 2E;3 log(1 - B)] I; i=rn+l

(2.4.5)

a function that is of curved exponential family form. It does tum out that in this simplest case an explicit maximum likelihood solution is still possible, but the computation is clearly not as simple as in the original Hardy-Weinberg canonical exponential family example. If we suppose (say) that observations S1, . . . , Srn are not Xi but (€il, €iz + €ia). then explicit solution is in general not possible. Yet the EM algorithm, with an appropriate starting point, D leads us to an MLE if it exists in both cases. Here is another important example.

134

I I'

Methods of Estimation

Chapter 2

Example 2.4.5. Mixture of Gaussians. Suppose St , . . . , Sn is a sample from a popu lation P whose density is modeled as a mixture of two Gaussian densities, p(s, B) = (1 - .\) 0, J.l t , J.lz E R and 'Pu (s) = �'P ( �). It is not obvious that this falls under our scheme but let (2.4.6)

where Ll., are independent identically distributed with Po [LI., = 1] = .\ = 1 - Po [Ll., Suppose that given � = (� 1 , , �n). the Si are independent with .

.

=

0].

•

L.o(S, I A ) = L.o(S, I Ll., ) = N(LI.,I', + {1 - Ll.,)l'z, Ll.,"; + (1 - Ll.,)"�).

I, '

:!

'

'

i. '

That is, �i tells us whether to sample from N(J11 ,a?) or N(J12 ,a�). It is easy to see (Problem 2.4.1 1), that under 9, S has the marginal distribution given previously. Thus, we can think of S as S(X) where X is given by (2.4.6). This five-parameter model is very rich permitting up to two modes and scales. The log likelihood similarly can have a number of local maxima and can tend to oo as e tends to the boundary of the parameter space (Problem 2.4.12). Although MLEs do not exist in these models, a local maximum close to the true eo turns out to be a good "proxy" for the D nonexistent MLE. The EM algorithm can lead to such a local maximum. The EM Algorithm. Here is the algorithm. Let

' '

( g p(X,B)Bo) I S(X)

J(B I Bo) � Eo, lo _

p(X,

=

)

s

(2.4.7)

where we suppress dependence on s. Initialize with Bold = Bo . The first (E) step of the algorithm is to compute J (B I B0ld) for as many values of B as needed. If this is difficul� the EM algorithm is probably not suitable. The second (M) step is to maximize J(B I Bold ) as a function of B. Again, if this step is difficult, EM is not particularly appropriate. Then we set Bnew � arg max J (B I B0ld ) , reset Bold = Bnew and repeat the process. As we shall see in important situations, including the examples, we have given, the M step is easy and the E step doable. The rationale behind the algorithm lies in the following formulas, which we give for 8 real and which can be justified easily in the case that X is finite (Problem 2.4.12)

(

)

q (s ,B ) = o p(X, B) I S(X) = s E0 q(s,Bo) p(X,Bo) and

:B lo q(s, B) 9=0o g

�

E00

(:B logp(X,B) I S(X) = s) 9=90

(2.4.8)

'

(2.4.9)

for all Bo (under suitable regularity conditions). Note that (2.4.9) follows from (2.4.8) by taking logs in (2.4.8), differentiating and exchanging Eo0 and differentiation with respect

I

i i

'

'

'

!

i

Section 2.4 Algorithmic Issues

135

to () at 00. Because, formally, DJ(B I Bo ) � [)B

E

,,

(�I DB

og p(X ' B) I S (X)

=

and, hence, 8J(B I Bo) DO

)

s

(2.4.10)

a = DB log q (s , B0 ) Oo

(2.4. 1 1 )

it follows that a fixed point () of the algorithm satisfies the likelihood equation, a log q(s,B) = 0. DB

(2.4.12)

The main reason the algorithm behaves well follows.

Lemma 2.4.1. /fBnew, Bold are as defined earlier and S(X) = s, q(s, Bnew ) > q(s, Bold )-

(2.4.13)

Equality holds in (2.4. 13) iff the conditional distribution of X given S(X) = s is the same for Bnew as for Bold and Bnew maximizes J(B I Bold l· Proof. We give the proof in the discrete case. However, the result holds whenever the quantities in J(O I Oo) can be defined in a reasonable fashion. In the discrete case we appeal to the product rule. For x E X, S(x) = s p(x,B) = q (s, B)r(x I s, B)

(2.4.14)

where r(· I · , B) is the conditional frequency function of X given S(X) q (s , B) J(B I Bo) = log + Eo, q(s, Bo) If Bo = Bold• 8 = Bnew.

{

,·(X I s, B) log I S(X) = s ,·(X I s , Bo)

{

= s.

}

Then

(2.4. 15)

.

}

q(s, Bnew ) r(X I s, Bnew ) log log I S(X) = s . , " .J(Bnew I Bold ) - Eo " = ld o r (X I s, uold ) q (s, uold )

(2.4. 16)

Now, J(Bnew I Bold ) > J(Bold I Bold ) = 0 by definition of Bnew . On the other hand,

{

r(X I s, Bnew ) - Eoold log r(X I s , B ) I S(X) = s old by Shannon's ineqnality, Lemma 2.2.1.

}

>0

(2.4.17) D

The most important and revea1ing special case of this lemma follows. Theorem 2.4.3. Suppose {Po

:

() E 9} is a canonical exponential family generated by

(T,h) satisfying the conditions of Theorem 2.3.1. Let S(X) be any statistic, then

136

Methods of Estimation

Chapter 2

(a) The EM algorithm consists of the alternation •

A(Bnew) = Eo01d (T(X) I S(X) = s)

(2.4. 18)

Bold = Bnew ·

(2.4.19)

unique. Ifa solution of(2.4.18) exists it is necessarily � (b) If the sequence of iterates {B-m} so obtained is bounded and the equation

A( B) = Ee(T(X) I S(X) = s) •

(2.4.20)

�

has a unique solution, then it converges to a limit B", which is necessarily a local maximum of q(s B). ,

Proof. In this case,

J(B I Bo) = Eo ( (B - Bo)TT(X) - (A( B) - A(Bo)) I S(X) = s} - (B - Bof Ee, (T(X) I S(X) = y) - (A(B) - A(Bo)) ,

Part (a) follows. Part (b) is more difficult. A proof due to Wu (1983) is

(2.4.21)

Example 2.4.4 (continued). X is distributed according to the exponential family

p(x, B) = exp{1)(2Ntn(x) + N2n(x)) - A{7J)}h(x) where

1)

= log

( � B) , h(x) 1

= 2N,,(x) ,

(2.4.22)

A(1J) = 2nlog(1 + e")

and N;n = L:� 1 <;;(x;), 1 < j < 3. Now, (2.4.23)

2Nlrn + N2m +Eo

e

L (2 il + fi2) I €it + €i2 , m + 1 < i <

i==+I

n

.

(2.4.24)

Under the assumption that the process that causes lumping is independent of the values of the €ij . 0, 1 < j S 2 Po [<;; = 1 [
Po [
92 B2 B2 + 29 (1 - B) 1 - {1 - 9) 2 1 - Po[
Eo(2Ntn + N2n I S) = 2Nlm + N2m +

I I

•

A'(1J) = 2nB n

' ;

D

sketched in Problem 2.4.16.

-

2

�

2 - Bold

Mn

(2.4.25)

I ;

I

Section 2.4 Algorithmic Issues

137

where Mn

n

L

�

(
i=-m+l

Thus, the EM iteration is

_ 2N, + N,m _...:2;_...-_ Mn ;; + .,- � (2.4.26) unew n 2 - Bold n It may be shown directly (Problem 2.4.12) that if 2 Nl m + N2m > 0 and Mn > 0, then � B-m converges to the unique root of ,.

' (2Nsm + N2m ) B 2 lm + (N + (1 - Nsm JJ � 0 B n n o

in (0, 1), which is indeed the MLE when S is observed.

Example 2.4.6. Let ( z, , Y1 ), . . . , ( Zn . Yn) be i.i.d. as ( Z, Y), where (Z, Y) � N(lh, '"' af , a�, p). Suppose that some of the Zi and some of the Yi are missing as follows: For n2 , we oberve onJy Zi, and for 1 � i � n 1 we observe both Zi and }i, for n 1 + 1 n2 + 1 ::.; n, we observe only Yt. In this case a set of sufficient statistics is n n n 1 1 1 = Z, 2 = T = n= nL z;, Y, = n 4 Tl Y /, L ZiYi· T L 5 T3 T '

< i<

i<

-

i=l

i=l

i=l

The observed data are S � { (Z,, Y; ) : 1

< i ::; nt} U {Z; : n1 + 1 < i ::; n,} U {Y; :

n,

+ 1 ::; i ::; n}.

To compute Eo (T I S = s ) . where B = (J11, /-t2 , af , a� , B) , we note that for the cases with Zi and/or Yi observed, the conditional expected values equal their observed values. For other cases we use the properties of the bivariate normal distribution (Appendix B.4 and Section 1.4), to conclude

Ee(Y, I Z,) /1-2 + P
A( B)

�

EeT � (/" 1 , Ji-2, af + l"i, ai + �.a, a,p + /"!Jl-2).

= We take 80 BMOM • where BMOM is the method of moment estimates � (/1�, Ji,, ilf, i12 , r) (Problem 2.1.8) of 8 based on the observed data, and find (Problem 2 .4.1) that the M-step produces ....... ....... ....... ....... 2 i11,new - T, (8old), /12,new � T, (Bold ), i11,new � Ts( Bold) - 'I? � ,, (2.4.27) T.(eold) - T, i1�,new ....... ....... ....... ....... ....... ....... ....... ....... Pnew [Ts(Bold ) - T, T,]/{[Ts(Bold ) - T,) [T.(8old ) - T,] } , �

-

�

1

138

Methods of Estimation

Chapter 2

where Tj (B) denotes Ti with missing values replaced by the values computed in the Estep and Ti = Tj (B0rct ) , j = 1, 2. Now the process is repeated with BMOM replaced by Bnew0 �

�

�

�

Because the E-step, in the context of Example 2.4.6, involves imputing missing values, the EM algorithm is often called multiple imputation.

X, then J(B i 80) is log[p(X, B)/p(X,B0)], which Remark 2.4.1. Note that if S(X) as a function of() is maximized where the contrast - log p(X , 0) is minimized. Also note that, in general, -Eo, [J(O I Oo )] is the Kullback-Leibler divergence (2.2.23). =

Summary. The basic bisection algorithm for finding roots of monotone functions is devel oped and shown to yield a rapid way of computing the MLE in all one-parameter canonical exponential families with E open (when it exists). We then, in Section 2.4.2, use this algorithm as a building block for the general coordinate ascent algorithm, which yields with certainty the MLEs in k-parameter canonical exponential families with E open when it exists. Important variants of and alternatives to this algorithm, including the Newton Raphson method, are discussed and introduced in Section 2.4.3 and the problems. Finally in Section 2.4.4 we derive and discuss the important EM algorithm and its basic properties.

2.5

PROBLEMS AND COMPLEMENTS

Problems for Section 2.1 1. Consider a population made up of three different types of individuals occurring in the 2 Hardy-Weinberg proportions 02, 28(1 - 0) and ( l - 8) , respectively, where 0 < B < I. (a) Show that T3

' "

I'

=

NI/n + N2/2n is a frequency substitution estimate of e.

(b) Using the estimate of (a), what is a frequency substitution estimate of the odds ratio B/(1 - B)?

(c) Suppose X takes the values -1, 0, 1 with respective probabilities PI, p2, P3 given by the Hardy-Weinberg proportions. By considering the first moment of X, show that T3 is a method of moment estimate of (). 2. Consider n systems with failure times X1 , , Xn assumed to be independent and identically distributed with exponential, &( >.) , distributions. .

.

•

(a) Find the method of moments estimate of >.. based on the first moment. (b) Find the method of moments estimate of >.. based on the second moment. (c) Combine your answers to (a) and (b) to get a method of moment estimate of >.. based on the first two moments. (d) Find the method of moments estimate of the probability P(X1 2: 1) that one system will last at least a month.

3. Suppose that i.i.d. X1 , . , Xn have a beta, {3(a1, a2 ) distribution. Find the method of moments estimates of a = (a 1 , a2 ) based on the first two moments. .

.

; '

''

!

139

Problems and Complements

Section 2.5

Hint: See Problem 8.2.5. 4. Let X1 ,

. . .

, X11 be the indicators of n Bernoulli trials with probability of success 8.

(a) Show that X is a method of moments estimate of 8. -

(b) Exhibit method of moments estimates for VaroX 8(1 8)/n first using only the first moment and then using only the second moment of the population. Show that these estimates coincide. =

(c) Argue that in this case all frequency substitution estimates of q(8) must agree with

q(X).

5. Let Xt, . . . , Xn be a sample from a population with distribution � function F and � frequency function or density p. The empirical distribution function F is defined by F(x) = [No. of X; < x]/n. If q(8) can be written in the form q(8) � s(F) for some � function s of F we define the empirical substitution principle estimate of q(B) to be s ( F). (a) Show that in the finite discrete case, empirical substitution estimates coincides with

frequency substitution estimates. � Hint: Express F in terms of p and F in tenns of

(

p X) �

�

No. of X; � x n

.

�

(b) Show that in the continuous case X "' F means that X = Xi with probability 1/n.

(c) Show that the empirical substitution estimate of the jth moment /1j is the jth sample �

moment J..LJ . Hint: Write m;

�

J== xi dF(x) or m1

�

Ep(Xi) where X - F.

� � (d) For t1 < · · · < t., find the joint frequency function ofF(t,), . . . , F(tk). 1 Hint: Consider nF(t1), N2 n(F(t2 ) - F(t,)), . . . , � (N1, . . . , Nk+ I ) where N Nk+I = n(l - F(tk)). �

�

6. Let X( l ) < < X(n) be the order statistics of a sample X1, . . . , Xn. (See Problem 8.2.8.) There is a one-to-one correspondence between the empirical distribution function F and the order statistics in the sense that, given the order statistics we may construct F and given F, we know the order statistics. Give the details of this correspondence. ·

·

·

�

�

�

The jth cumulant 0· of the empirical distribution function is called the jth sample cumulanr and is a method of moments estimate of the cumulant cj. Give the first three sample cumulants. See A.l2.

7.

8. Let ( Z1 , Y1 ) , ( Z2 , Y2 ), . . . , ( Zn, Yn) be a set of independent and identically distributed F. The natural estimate of F(s t) is random vectors with common distribution function � the bivariate empirical distribution function F(s, t), which we define by ,

� F( s,t )

�

Number of vectors (Z;, Y;) such that Z; < s and Y; < t

n

.

I ;

140

Methods of Estimation �

Chapter

2

�

(a) Show that F(·, · ) is the distribution function of a probability P on R2 assigning mass 1/n to each point (Zi, Yi ) . (b) Define the sample product moment of order (i, j), the sample covariance, the sample correlation, and so on, as the corresponding characteristics of the distribution F. Show that the sample product moment of order (i,j) is given by �

The sample covariance is given by

IL n

n k=l '

' '

(Zk - Z)(Yk - Y) = -

n -I ...- L., zk yk - ZY, n k=l

where Z, Y are the sample means of the Z1, . . . , Zn and Y1 , sample correlation coefficient is given by

.

.

•

, Yn, respectively. The

'

All of these quantities are natural estimates of the corresponding population characteristics and are also called method of moments estimates. (See Problem 2.1.17.) Note that it follows from (A.11.19) that - I < r < 1.

,I

'

9. Suppose X = (X,, , . . , Xn) where the X, are independent N(O, a2).

'

(a) Find an estimate of cr2 based on the second moment. (b) Construct an estimate of a using the estimate of part (a) and the equation a

=

!

#.

i •

i'

'

(c) Use the empirical substitution principle to construct an estimate of a using the relation E(IX, I) = a.,J21r.

10. In Example 2. 1.1, suppose that g(/3, z) is continuous in {3 and that lg(/3, z) I tends to oo as \.81 tends to oo. Show that the least squares estimate exists. Hint: Set c = p(X, 0). There exists a compact set K such that for f3 in the complement of K, p(X, {3) > c. Since p(X, {3) is continuous on K, the result follows. 11. In Example 2.1.2 with X ji, and M3 · Hint: See Problem B.2.4.

�

f(a, >.), find the method of moments estimate based on

12. Let X1, , Xn be i.i.d. as X � Po, (} E 8 C Rd, with (} identifiable. Suppose X has possible values v1 , . . . , vk and that q(9) can be written as

'

' '

i ' :

J l

• . .

q(O) = h(!'1(6), . . . ,!'r(O))

I ' '

'

'

I '

Section 2.5

141

":_ ' "::: P '' Problems and Com:c::: :m :: :: ::

_ _ _ _ _

for some Rk -valued function h. Show that the method of moments estimate (j = h(/11, , fir) can be written as a frequency plug-in estimate. 1 l . Suppose X estimates( ) Xn are i.i.d. as X P0 , General method of moment 1, 13. with (J E 8 C Rd and (J identifiable. Let 91 , . . . , 9r be given linearly independent functions and write • • •

.

p;(O) = Eo(g; ( X )), /i;

=

.

rv

.

n

n - r L9;( X;), j = 1,

.

.

. , r.

i=l

Suppose that X has possible values v1 , . . . , vk and that q(O) = h(pr(O), . . . , Pr(O)) for some Rk-valued function h.

(a) Show that the method of moments estimate (j = h(/11, . . . , ilr) is a frequency plug

in estimate. (b) Suppose {Po : () E 8} is the k-parameter exponential family given by (1.6.10). Let g,(X) = T;(X ) , 1 < j < k. In the following cases, find the method of moments esumates •

(i) Beta, 1)(1, 8) (ii) Beta, /)(8, 1) (iii) Raleigh, p(x,B) = (xj82) exp(-x2/282) , x > 0. 8 > 0 (iv) Gamma, f(p, 8), p fixed (v) Inverse Gaussian, IG(p, -") , () = (p, A). See Problem 1.6.36. 14.

Hint: Use Corollary 1.6.1.

When the data are not i.i.d., it may stiJl be possible to express parameters as functions of moments and then use estimates based on replacing population moments with ..sample" moments. Consider the Gaussian AR( I) model of Example 1.1.5. (a) Use E(X;) to give a method of moments estimate of p.

(b) Suppose p = po and /) = b are fixed. Use E(U{), where

U, = (X; - pa) to

1/2

,

give a method of moments estimate of u2• (c) If J..l and u2

are fixed, can you give a method of moments estimate of {3?

142

Methods of Estimation

Chapter 2

15. Hardy-Weinberg with six genotypes. In a large natural population of plants (Mimulus guttatus) there are three possible alleles S, I, and F at one locus resulting in six genotypes labeled SS, II, FF, Sf, SF, and IF. Let 01, 02 , and 83 denote the probabilities of S, I, and F, respectively, where ()i = 1. The Hardy-Weinberg model specifies that the six genotypes have probabilities

L7=I I

Genotype Genotype

ss e,

Probability

2

3

4

5

6

II

FF

Sf

SF

IF

e,

e3

ze,e,

ze,e3

ze,e3

Let Nj be the number of plants of genotype j in a sample of n independent plants, 1 < j < 6 and let p1 � N; /n. Show that

'

I'

are frequency ' '

plug-in estimates of 81,

16. Establish (2.1.6).

Hint: [Y, - g(/3,z;)] =

02 , and (}3·

[Y; - g(/30, z;)] + [g(/3o , z;) - g(/3,zi )] .

17. Multivariate method of moments. For a vector X =

(X1, . . . , Xq). of observations, let

the moments be

mikrs

=

E(XtX�),

For independent identically distributed empirical or sample moment to be �

ffijkrs

n

j

Xi = (Xt1, . . . , Xiq ) , i

1 "' x; x• - - L....J i is• n . t=l

r

> 0, k > 0; r, s = 1, . . . , q.

J > o' k > - o·' r, s ·

-

=

1, . . . , n, we define the

1, . . . ' q .

IfB = {01�, , Om) can be expressed as a function of the moments, the method of moments estimate 8 of 8 is obtained by replacing mJkrs by fiiikrs· Let X = ( Z, Y) and 8 = (a1, b1), where (Z, Y) and (a,, bt) are as in Theorem 1.4.3. Show that method of moments estimators of the parameters b1 and a1 in the best linear predictor are .

•

•

Problems for Section 2.2

1. An object of unit mass is placed in a force field of unknown constant intensity 8. Read ings Y1 , , Yn are taken at times t1 , . . . , tn on the position of the object. The reading Yi .

.

•

I

Section 2.5

143

Problems and Complements

differs from the true position ( 8/2)tf by a mndom error f1• We suppose the t:i to have mean 0 and be uncorrelated with constant variance. Find the LSE of 8. 2. Show that the fonnulae of Example 2.2.2 may be derived from Theorem 1.4.3, if we con sider the distribution assigning mass 1/n to each of the points

(zr , YI ) , . . . , (zn , Yn) ·

3. Suppose that observations Y1 , . . . , Yn have been taken at times z1 , . . , Zn and that the linear regression model holds. A new observation Yn-1-1 is to be taken at time Zn+I · What is the least squares estimate based on Y1 , . . . , Y;t of the best (MSPE) predictor of Yn+ I ? .

4. Show that the two sample regression lines coincide (when the axes are interchanged) if and only if the points (zi,Yi). i = 1, . . , n, in fact, all lie on a line. Hint: Write the lines in the fonn .

�(y - y) � � ·· p

(z - z) "

7

�

.

5. The regression line minimizes the sum of the squared vertical distances from the points (z1 , y1 ), , (zn, Yn)- Find the line that minimizes the sum of the squared perpendicular distance to the same points. Hint: The quantity to be minimized is . • .

I:?-r (y, - 8r - 8,z,)2 1 + 8,i 6. (a) Let Y1 , . . . , Yn be independent random variables with equal variances such that E(Yi) = o:z3 where the z3 are known constants. Find the least squares estimate of o:.

(b) Relate your answer to the fonnula for the best zero intercept linear predictor of Section 1.4. Show that the least squares estimate is always defined and satisfies the equations (2.1.5) provided that g is differentiable with respect to (3,, 1 < i < d, the range {g(zr, /3), . . . , g (zn , /3), i3 E Rd } is closed, and /3 ranges over R'7.

8. Find the least squares estimates for the model Yi = 81 + 82zi + €i with t:i {2.2.4)-(2 . 2.6) under the restrictions 81 > 0, 8, :;; 0.

a�

given by

9. Suppose Yi = 81 + t:i, i = 1, . . . , n1 and Yi = 82 + l':i, i = n1 + 1, . . . , n1 + n2 , where t: 1 , . . . , €n1+n2 are independent N(O, a2) variables. Find the least squares estimates of ()1 and 8,. 10. Let X1, , Xn denote a sample from a population with one of the following densities or frequency functions. Find the MLE of 8. .

•

•

(a) f(x, 8) � 8e-Ox, X > 0; 8 > 0. (exponential density)

(b) f(x, 8) = 8c9x-(O+ I ), X > c; c constant > 0; 8 > 0. (Pareto density)

144

Methods of Estimation

(c) f(3c,8)

Chapter 2

c8".r-lc+l), :r > 8; c constant > 0: 8 > 0. (Pareto density) (d) f(x, 8) = v'ex v'0- 1 , 0 < x < 1, 8 > 0. (beta, il( VB, 1), density) (e) f(x, 8) = (x/82 ) exp{ -x2/282 }, x > 0; 8 > 0. (Rayleigh density) =

(f) f(x, 8) = 8cx'--1 exp{ -8x'}, x

> 0; c constant > 0; 8 > 0. (Wei bull density)

, Xn. n > 2, is a sample from a N(Jl, a2) distribution. (a) Show that if J1 and a2 are unknown, J1 E R, a2 > 0, then the unique MLEs are /i = X and a2 = n-1 2.:� 1 (X, - X ) 2 (b) Suppose p and a 2 are both known to be nonnegative but otherwise unspecified. Find maximum likelihood estimates of J1 and a2 . 11. Suppose that X1 ,

•

.

.

12. Let X1 , . . . , Xn, n > 2, be independently and identically distributed with density f(x, 8)

=

I

-

0'

exp { -(x - Jl) /0'}, x 2 Jl ,

where 8 = (Jl, 0'2 ), -oo < J1 < oo, 0' 2 > 0.

(a) Find maximum likelihood estimates of J.t and a2 .

(b) Find the maximum likelihood estimate of Pe [X1 > t] for t > 11· Hint: You may use Problem 2.2. 16(b ).

13. Let X1 ' . . . ' Xn be a sample from a u [8 - i 8 + i I distribution. Show that any T such that X(n) i < T < X(l) + i is a maximum likelihood estimate of 8. (We write U[a, b] to make p(a) = p(b) = (b - a) 1 rather than 0.) '

-

14, If n extsts. •

=

-

I in Example 2.1.5 show that no maximum likelihood estimate of 8 = (Jl, 0'2 ) �

�

15, Suppose that T(X) is sufficient for 8 and that 8(X) is an MLE of 8. Show that 8 � depends on X through T(X) only provided that 8 is unique. Hint: Use the factorization theorem (Theorem 1.5.1). 16. (a) Let X

Pe, 8

�

E

8 and let 8 denote the MLE of 8. Suppose that h is a one-toone function from 8 onto h(8). Define ry = h(8) and let f(x, ry) denote the density or frequency function�of X in terms of T} (i.e., reparametrize the model using ry). Show that the MLE of ry is h (8) (i.e., MLEs are unaffected by reparametrization, they are equivariant under one-to-one transformations). -

9 E 8}, 8 c RP, p > I , be a family of models for X E X c R"� k Let q be a map from 8 onto !1, !1 c R , I � k < p. Show that if 9 is a MLE of 9, then q(O) is an MLE of w = q(O). = {9 E 8 : q(O) = w }, then {8(w) : w E !1} is a partition of 8, and 8(w) Let Hint: � 9 belongs to only one member of this partition, say 8(w). Because q is onto !1, for each w E !1 there is 9 E 8 such that w = q( 9). Thus, the MLE of w is by definition (b) Let 1' = {Po :

WMLE

=

arg sup sup{Lx(O) : 9 E 8(w)}. WEO

'

I

I

I

i

Section 2.5

145

Problems and Complements

Now show that WMLE � w � q(O). 17. Censored Geometric Waiting Times. If time is measured in discrete periods, a model that is often used for the time X to failure of an item is

P, [x

�

k] � Bk -1(1 - B), k � 1, 2, . . .

where 0 < 8 < 1. Suppose that we only record the time of failure, if failure occurs on or before time r and otherwise just note that the item has lived at least (r + 1) periods. Thus, we observe Y1 , . . , Yn which are independent, identically distributed, and have common frequency function, f( k,B) � Bk- 1 ( 1 - B), k � 1 , . . . , r

.

f(r + 1 , B) � 1 - Po [X < r] � 1 - L Bk-1 (1 - B) � B'. k=I (We denote by "r + 1" survival for at least (r + 1) periods.) Let M = number of indices i such that Yi = r + 1. Show that the maximum Jikelihood estimate of 8 based on Y1 , , Yn

.lS

•

-

B(Y) �

""

•

.

y: - n

L...�� . ' M. Li�l Y, -

18. Derive maximum likelihood estimates in the following models. (a) The observations are indicators of Bernoulli trials with probability of success 8. We want to estimate B and VaroX1 = B(1 - B). (b) The observations are X1 = the number of failures before the first success, X2

=

the number of failures between the first and second successes, and so on, in a sequence of binomial trials with probability of success fJ. We want to estimate 8.

.

19. Let X1 , . . , Xn be independently distributed with Xi having a N( ei, 1) distribution, 1 < i < n. (a) Find maximum likelihood estimates of the fJi under the assumption that these quan tities vary freely. (b) Solve the problem of part (a) for n = 2 when it is known that B1 < B,. A general solution of this and related problems may be found in the book by Barlow, Bartholomew, Bremner, and Brunk (1972).

20. In the "life testing" problem 1.6. 16(i), find the MLE of B. 21. (Kiefer-Wolfowitz) Suppose (X,, . . . , Xn) is a sample from a population with density

f(x, B) �

9

lOa

(X - p)

+

1 cp (x - p) 10

where

146

Methods of Estimation

Chapter 2

=

that snp.,. p(x, j?,a2) sup � ,.,. p(X,J.t,a2) if, and only if, ji equals one of the numbers x1, . . . , Xn . Assume that Xi =/=- Xj for i -=I j and that n > 2.

22. Suppose X has a hypergeometric, 1i(b, N, n) , distribution. Show that the maximum likelihood estimate of b for N and n fixed is given by

if � ( N + 1 ) is not an integer, and

X X b(X ) = -(N + 1 ) or -(N + 1 ) - 1 n n

�

othetwise, where [t] is the largest integer that is < t. Hint: Consider the mtio L(b + 1, x)/L(b, x) as a function of b.

X1 ,

,Xm and Y1 ,

, Yn be two independent samples from N(J.LI,a 2 ) and N(/.'2 , <1 2 ) populations, respectively. Show that the MLE of 0 = (1'1 , ,.., <12 ) is If = (X, Y, ii2 ) where 23. Let

•

.

.

•

•

•

,

i

-(12

n

=

L(X, - X)' + L(lJ - Y)2 /(m + n). j= l i=l

24. Polynomial Regression. Suppose Y; = l'(z;) + <;, where <; satisfy (2.2.4)-(2.2.6). Set ,jp) : 0 < i• < J, 1 < k < p}, zi. wherej E J and J is a subset of { (j 1 zj = z{' and assume that ·

·

·

'

,

•

•

•

I'(Z) = I;{ajzl : j E J}. In an experiment to study tool life (in minutes) of steel-cutting tools as a function of cut ting speed (in feet per minute) and feed rate (in thousands of an inch per revolution), the following data were obtained (from S. Weisberg, 1985). TABLE 2.6.1. Tool life data

Feed -1 -1 1 1 -1 -1 1 1 0 0

Speed -1 -1 -1 -1 1 1 1 1

- J2 J2

Life 54.5 66.0 11.8 14.0 5.2 3.0 0.8 0.5 86.5 0.4

Feed - 2

J2 0 0 0 0 0 0 0 0

Speed 0 0 0 0 0 0 0 0 0 0

Life 20.1 2.9 3.8 2.2 3.2 4.0 2.8 3.2 4.0 3.5

I

I •

Section 2.5

147

Problems and Complements

The researchers analyzed these data using Y � log tool life, z1 � (feed rate - 13)/6, z2 � (cutting speed - 900)/300. Two models are contemplated

i3o + i3tzt

(a) Y �

+ i3,z,

+E

(b) Y = eta + 0:1Z1 + 0:2Z2 + n3zi

+

0:4z? + O:sZ1Z2 + f.

Use a least squares computer package to compute estimates of the coefficients ({J's and n's) in the two models. Use these estimated coefficients to compute the values of the contrast function (2.1.5) for (a) and (b). Both of these models are approximations to the true mechanism generating the data. Being larger, the second model provides a better approximation. However, this has to be balanced against greater variability in the estimated coefficients. This will be discussed in Volume II.

25. Consider the model (2.2.1), (2.2.4)-(2.2.6) with g({3, z) = zT {3.

Show that the follow

ing are equivalent.

(a) The parameterization

f3 --t ZD/3 is identifiable.

(b) ZD is of rank d. (c) Z');ZD is of rank d.

26. Let ( Z, Y) have joint probability P with joint density f ( z, y), let v (z, y) 2 0 be a

2 weight funciton such that E(v(Z, Y)Z ) and E(v(Z, Y)Y2) are finite. The best linear weighted mean squared prediction error predictor i31 (P) + i32 (P)Z of Y is defined as the minimizer of 2 ( E{v(Z, Y) (Y - b1 + b2Z) ] ) .

(a) Let (Z' , Y') have density v(z,y)f(z,y)fc where c = f J v(z, y)f(z, y)dzdy. Show that i32(P) � Cov(Z', Y') /Var Z' and i31(P) � E(Y*) - 32 (P)E(Z'). -

(b) Let P be the empirical probability defined in Problem 2.1.8 and let v(z,y) � -.. .-... -..

Z � z). Show that f3t (P) and f3,(P) coincide with i31 and ,32 of Example That is, weighted least squares estimates are plug-in estimates.

1/Var(Y

2.2.3.

I

27. Derive the weighted least squares normal equations (2.2.19). 28. Let ZD = IJziJIInxd be a design matrix and let Wnxn be a known symmetric invertible 2 2 matrix. Consider the model Y = ZDf3 + € where € has covariance matrix cr W, cr unknown. Let w-l be a square root matrix of w- t (see (B.6.6)). Set Y � w-ly, ZD = w - z ZD and € = w-, •. -

'

'

(a) Show that Y � ZDf3+£ satisfy the linear regression model (2.2.1 ), (2.2.4)-(2.2.6)

with g({3, z ) �

-

ZDf3-

148

Chapter 2

Methods of Estimation �

(b) Show that if Z D has rank d, then the (3 that minimizes

- - (Y - ZD,6)T (Y - ZD,B)

=

1 � T (Y - ZD,B) W (Y - ZD,B)

is given by (2.2.20). 29. Let ei = (€i + €i+I )/2. i = 1, . . . , n, where E1, . . . , €n+I are i.i.d. with mean zero and variance a 2 • The ei are called moving average errors. Consider the model Yi = 11 + ei, -i = 1, . . . , n.

(a) Show that E( li+ I I Y1 , li) = � (!' + Y;). That is, in this model the optimal MSPE predictor of the future Yi+ 1 given the past Y1 , . . . , Yi is � (J.t + Yi). .

.

•

,

(b) Show that Y is a multivariate method of moments estimate of p. (See Problem 2.1.17.) (c) Find a matrix A such that en x i

.

\I'

=

Anx(n+t)E(n+ I)xl·

(d) Find the covatiance matrix W of e .

(e) Find the weighted least squares estimate of p.

(f) The following data give the elapsed times Y1 , . . . , Yn spent above a fixed high level for a series of n = 66 consecutive wave records at a point on the seashore. Use a weighted least squares c�mputer routine to compute the weighted least squares estimate ji of 11· Is ji, different from Y? TABLE 2.5.1. Elapsed times spent above a certain high level for a series of 66 wave records taken at San Francisco Bay. The data (courtesy S. J. Chou) should be read row by row.

2.968 9.054 1.971 0.860 5.249 1.908 1.582

2.097 1.958 10.379 2.093 5.100 3.064 5.453

1.61 1 4.058 3.391 0.703 4.131 5.392 4.091

3.038 3.918 2.093 1.182 0.020 8.393 3.716

7.921 2.019 6.053 4.114 1.071 0.916 6.156

5.476 9.858 3.689 3.081 4.196 2.788 2.075 2.834 4.455 3.676 9.665 5.564 2.039

1.397 4.229 4.5 1 1 3.968 2.666 3.599

0.155 4.669 7.300 6.480 5.457 2.723

1 .301 2.274 5.856 2.360 1.046 2.870

some of the n; are zero. �Show that the 30. In the multinomial Example 2.2.8, suppose � � MLE of B; is (J with B; = n; fn, j = 1 , . . . , k. Hint: Suppose without loss of generality that n1 = n2 = · · · = nq = 0, nq+l > 0, . . . , nk > 0. Then k p(x, 0) = II eJn' j=q+l

which vanishes if Bj

=

0 for any j = q + 1, . . . , k.

'

31. Suppose Y1 , . . . , Yn are independent with Yi uniformly distributed on [.Ui - a, .Ui + uJ, <J > 0, where !'; = L;�= Z;;/3; for given covatiate values {z;; }. Show that the MLE of l

I

I

Section 2.5

149

Problems and Complements �

�

, {3p, a ) T is obtained by finding /31, . , {3p that minimizes the maximum absolute ((31 , value conrrast function maxi IYi - t.Lil and then setting a = max1 IY� - P-d, where Iii = •

•

.

.

.

�

L:j::1 Ztjf3j ·

32. Suppose Y1 , . . . , Yn are independent with Yi having the Laplace density

1 � exp{- [y; - p; [ /a), a > 0 where t.li

=

l:::j=1 ZzJfJJ for given covariate values { ZiJ } . �

�

(a) Show that the MLE of ((3�, . . . , (Jp, a) is obtained by finding (31, . . . , (Jp that minimizes the least absolute deviation contrast function 2:::� 1 IYi - /1i I and then setting a = 1 n� 2:::� 1 IYi - tit [, where tit = Lj=1 ZiJ fJJ. These f3r, . . . , /Jr and /iJ, . . . ,/in are called least absolute deviation estimates (lADEs). �

�

�

(b) If n is odd, the sample median fj is defined as Y(k) where k � i(n + 1) and Y( l ) , . . . , Y(n) denotes Y1 , . . . , Yn ordered from smallest to largest. If n is even, the sample median j/is defined as i [Y( c) + Y(r+l ) ] where r � in. (See (2.1. 17).) Suppose p; � I' for each i. Show that the sample median fj is the minimizer of 2:::� 1 [ Yi - JJ[. � Hint: Use Problem 1.4. 7 with Y having the empirical distribution F.

33. The Hodges-Lehmann (location) estimate XHL is defined to be the median of the �n(n + 1) pairwise averages � (xi + xJ). i < j. An asymptotically equivalent procedure XHL is to take the median of the distribution placing mass .;& at each point x•;:cj. i < j and mass ,& at each Xi. (a) Show that the Hodges-Lehmann estimate is the minimizer of the contrast function

i
(b) Define BH L to be the minimizer of

J [x - 28[d(F

•

F) (x )

where F * F denotes convolution. Show that XHL is a plug-in estimate of BHL· 34. Let X; be i.i.d. as (Z, Y)T where Y

�

Z + JXw, A > 0, Z and W are independent

N(O, 1). Find the MLE of A and give its mean and variance. Hint: See Example 1.6.3.

35. Let g(x) � 1/1T(l + x'), x E R, be the Cauchy density, let X1 and X2 be i.i.d. with density g(x - 8), 8 E R. Let x1 and x, be the observations and set ll = � (x1 - x2). Let 8 � arg max Lx ( 8) be "the'" MLE. -

(a) Show that if [ll[ < 1, then the MLE exists and is unique. Give the MLE when ill [ < l.

...

·-

150

Methods of Estimation

Chapter 2

Show that if It, I > 1, then the MLE is not unique. Find the values of B that maximize the likelihood Lx(B) when lt,l > 1. Hint: Factor out (x - () ) in the likelihood equation. (b)

36. Problem 35 can be generalized as follows (Dharmadhikari and Joag-Dev, !985). Let g be a probability density on R satisfying the following three conditions: I. g is continuous, symmetric about 0, and positive everywhere. 2. g is twice continuously differentiable everywhere except perhaps at 0. 3. If we write h = logg, then h"(y) > 0 for some nonzero y. Let (X1 , X2 ) be a mndom sample from the distribution with density f(x, B) = g(x-B). where x E R and (} E R. Let x1 and x2 be the observed values of X1 and X2 and write x = (x1 + xz)/2 and t, = (x1 - x2)/2. The likelihood function is given by

Lx (B) -

Let B = arg max Lx (B) be "the" MLE. Show that

'

•

• •

.

'

•

!' '

'

I I '

i'

g(x1 - B)g(xz - B) g(x + t, - B)g(x - t, - B).

(a) The likelihood is symmetric about x. -

-

(b) Either () = x or() is not unique. (c) There is an interval

such that h(y + J)

(a, b), a < b, such that for every y E (a, b) there exists a J > 0

- h(y) > h(y) - h(y - J). (d) Use (c) to show that if t, E (a, b), then B is not unique. 37. Suppose XI ' . . . ' Xn are i.i.d. N(e ,
38. Let X � P, , B E 8. Suppose h is a I - I function from 8 onto !1 = h(8). Define ry = h(B) and let p' (x, ry) = p(x, h-I (ry)) denote the density or frequency function of X for the ry parametrization. Let K(Bo , B1) (K'(ry0,ryl)) denote the Kullback-Leibler divergence between p(x, Bo ) and p( x, B1) (p' (x, no) and p' (x, ry1 ) ) Show that •

39. Let Xi denote the number of hits at a certain Web site on day i, i = 1, . . . , n. Assume that S = L:� 1 X; has a Poisson, P(n-'). distribution. On day n + 1 the Web Master decides to keep track of two types of hits (money making and not money making). Let V1· and Wi denote the number of hits of type 1 and 2 on day j, j = n + 1 , . . . , n + m. Assume that sl = E;+�l Vj and s2 = z:::;+�l Wj have 'P(mAl) and 'P(m.\2) distributions, where A 1 + ..\2 = A Also assume that S, 81, and S2 are independent. Find the MLEs of A! and Az hased on S, 81, and Sz. .

i

1,'

l,

j

I

Section 2.5

151

Problems and Complements

40. Let X 1 , . . , Xn be a sample from the generalized Laplace distribution with density .

I

cxp{ -xj81 } , x > 0,

e1 + e2 I exp{x/82 }, eI + e2

X<0

where 8; > O, j � 1 , 2. (a) Show that T1 � L; X, I[X; > OJ and T2 = L; -X,I[X, statistics.

<

OJ are sufficient

(b) Find the maximum likelihood estimates of 81 and 82 in terms of T1 and T2 . Care fully check the "T1 = 0 or T2 = 0" case. 41. The mean relative growth of an organism of size the equation (Richards, 1959; Seber and Wild, 1989)

y

[ (y)l] y

l dy f3 I � dt

a

,

> 0;

a

y

at time t is sometimes modeled by

> 0, f3 > 0, 15 > 0.

(a) Show that a solution to this equation is of the form (a,f3,J1.,15), J1. E R, and

g(t,li) �

y

= g(t; 8), where 8

a

{l + exp[-fJ(t - p)/15]}' '

yi),

(b) Suppose we have observations ( t 1 . . . , ( tn, Yn), n > 4, on a population of a large number of organisms. Variation in the population is modeled on the log scale by using the model ,

log Y; � log a - 15 !og{ l + exp[-fJ(t ; - p)/15]} + r;

where €1, . , En are uncorrelated with mean 0 and variance a2 • Give the least squares estimating equations (2.1.7) for estimating a, {3, 5, and J.t. .

.

(c) Let Yi denote the response of the ith organism in a sample and let ZiJ denote the level of the jth covariate (stimulus) for the ith organism, i = 11 • • • , n; j = 1, . . , p. An .

example of a neural net model is Yi =

p

L h(zii; ..\1) + Ei 1

j =I

i=

11 • • • 1

n

where A = (a, (3, p), h(z; >.) = g(z; a, [3, Jl., I); and E l , . ,
.

-

...

42. Suppose X1, . . . , Xn satisfy the autoregressive model of Example 1.1.5.

152

Chapter 2

Methods of Estimation

(a) If I' is known, show that the MLE of fJ is

jj

=

-

2.:;'_2 (x,_, - fJ,)(xi - I') • L:l • (Z-i - 11-) 2

(b) If j3 is known, find the covariance matrix W of the vector € = ( €1, . . . , En )T of autoregression errors. (One way to do this is to find a matrix A such that en x i = Anxnfnx J .) Then find the weighted least square estimate of f.l.. Is this also the MLE of J.l� Problems for Section 2.3

1. Suppose Y1 ,

•

•

•

,

Yn are independent

P[Y; = 1]

= p(x, , a, {J)

= 1 - P[Y;

=

o], 1 :S i < n,

p log (x, a, fi) = a + fix, x, 1 -p

<

·

·

·

n

> 2,

< Xn .

Show that the MLE of a, fJ exists iff (Y1 , . . . , Yn ) is not a sequence of 1's followed by all O's or the reverse. Hint:

'

f '

n

n

n

n

i=l

i=l

i=l

i=l

c, L Y• + c, L x;y; = I;(c, + c,x,)y, < I;(c, + c,x,)1(c,x, + c1 > 0).

If c2 > 0, the bound is sharp and is attained only if Yi

X·1 > _ £l.. C2 ' -

I

' ' ' '

!

''

=

0 for xi < _ u y· C2'

1

-

1 for

2. Let X,, . . . , Xn be i.i.d. gamma, r(.A,p).

, Xn)T can be written as the rank 2 canonical (a) Show that the density of X = (X1 , exponential family generated by T = (E log X,, EX,) and h(x) = x - 1 with ry1 = p, 'll = - >. and

·'

•

•

•

where r denotes the gamma function. (b) Show that the likelihood equations are equivalent to (2.3.4) and (2.3.5). 3. Consider the Hardy-Weinberg model with the six genotypes given in Problem 2.1.15. Let e = {(81 , 82) : 91 > 0,92 > 0,9, + 92 < 1} and let 93 = 1 - (81 + 82). In a sample of n independent plants, write x; = j if the ith plant has genotype j, 1 < j 5 6. Under what conditions on (x11 , Xn) does the MlE exist? What is the MLE? Is it unique? •

•

•

4. Give details of the proof or Corollary 2.3.1. 5. Prove Lenuna 2.3.1. Hint: Let c = l(O). There exists a compact set K

c

e such that l(ll) < c for all ll not in K. This set K will have a point where the max is attained. , '

I

'

'

'

'

I'I '

'

Section 2.5

Problems and Complements

6. In the heterogenous regression Example 1.6.10 with n > 3, 0 < z1 < that the MLE exists and is unique.

, Yn

· ·

·<

Zn,

153

show

denote the duration times of n independent visits to a Web site. Suppose 7. Let Y1, . . y has an exponential, £(.\i), distribution where .

Jti = E(Y;) = .\;1 = exp{a + iJz;},

Z1

< ··· <

Zn

and Zi is the income of the person whose duration time is Yi, 0 < z1 < · · · < Show that the MLE of (a, il) T exists and is unique. See also Problem 1.6.40.

8. Let X1 ,

•

.

.

,

Xn

Zn,

n 2: 2.

E RP be i.i.d. with density,

/9(x) = c(a) exp{ -lx - 81a}, 8 E RP,

u

>1

wherec- 1 (a) = JR exp{- l x l a) dx and 1 · 1 is the Euclidean nonn. �

"

(a) Show that if a

> 1, the MLE 8 exists and is unique.

(b) Show that if a = 1, the MLE 8 exists but is not unique if n is even. 9. Show that the boundary 8C of a convex C set in Rk has volume 0. Hint: If 8C has positive volume, then it must contain a sphere and the center of the sphere is an interior point by (B.9.1 ). 10. Use Corollary 2.3.1 to show that in the multinomial Example 2.3.3, MLEs of ry; exist

iff ali T; > 0, 1 < j < Hint: The k points

k - 1. (0, . . . ,0), (O,n,O, . . , 0), . . . , (0,0, . . . ,n) are the vertices of the convex set {(t" . . . , t._t) : t; 2 0, 1 < j < k - t, Z::7 ; tj < n }. 11. Prove Theorem 2.3.3. Hint: If it didn't there would exist ry; = c(9;) such that ryJ to- A(ryj ) � max{ 'ITto A(71) : 71 E c(8)} > -oo. Then { 'lj ) has a subsequence that converges to a point 71° E [. But c(9) is closed so that 71° = c(8°) and 8° must satisfy the likelihood equations.

.

.

, Xn

12. Let X1 , . . . be i.i.d. ! fo ( x�r ) , a > 0, J.l E R, and assume for w W11 > 0 so that w is strictly convex, w (±oo ) = oo. (a) Show that, if n

> 2, the likelihood equations

t

' w

� { (X;: Jt) t=l

a unique solution

- log fo that

Cfl, O').

(X,OJ- �') =

' w

0

( X, I' ) - 1 } = 0 OJ -

(b) Give an algorithm such that starting at jP

= 0, 0:0 = 1, ji(i )

---+

ji, (T(i ) __. 0'.

154

Methods of Estimation

[1 + exp{ -x)J- 1 ,

(c) Show that for the logistic distribution F0(x)

w

Chapter

2

is strictly

convex and give the likelihood equations for f.l and cr. (See Example 2.4.3.) Hint: (a) Thefunction D( a, b) � L:; 1 w (aX; - b) - n log a is strictly convex in (a, b) and lim(a,b)-.(ao,bo ) D (a , b) = x if either ao = 0 or or bo = (b) Reparametrize by a = ; , b = : and consider varying a, b successively. Note: You may use without proof (see Appendix B.9).

oo

±oo.

(i) If a strictly convex function has a minimum, it is unique. 2 . . D 82D a2o a2o ( "'D ) EP (1· 1·) If aa2 > 0, ab'l > 0 and 802 ab2 > BaOb , then D IS stnctly convex. "

:i

13. Let (X1 . YI ), . . . , (Xn, Yn ) be a sample from a N(111, .u2, a!, a�, p) population. (a) Show that the MLEs of a'f, ag, and p when f1, 1 and J.t2 are assumed to be known are 'ifi = (1/n) L:� 1 (X; - P.tf. 'if:j = (1/n) L:� 1 (Yi - J1.2 ) 2 , and n

p = I)x, - J1.1)( Yi - J1.2 )/n'if1'if2 i=l

respectively, provided that n > 3.

(b) If n > 5 and /1-I and /1-2 are unknown, show that the estimates of fl-1, /1-2, a?, a�, p coincide with the method of moments estimates of Problem 2.1 .8. Hint: (b) Because (X1, Y1 ) has a density you may assume that 'if/ > 0, (f� > 0, IPl < 1. Apply Corollary 2.3.2. Problems for Section 2.4 1 . EM for bivariate data.

(a) In the bivariate nonnal Example 2.4.6 , complete the E-step by finding E(Z; I Y;),

E(Zl l Y;) and E(Z;Y; I Y;).

(b) In Example 2.4.6, verify the M-step by showing that

BeT = (Jl- 1 , /1-2 , a� + Jl-i, a� + /1-�, fXTl 0"2 + /1-1Jl2 ) 2. Show that if T is minimal and & is open and the MLE doesn't exist, then the coordinate ascent algorithm doesn't converge to a member of E.

3. Describe in detail what the coordinate ascent algorithm does in estimation of the regres sion coefficients in the Gaussian linear model Y � ZDJ3 + < , rank(ZD) = k, <1, . . . ,
(Check that you are describing the Gauss-Seidel iterative method for solving a system of linear equations. See, for example, Golub and Van Loan, 1985, Chapter 10.)

'

'

Section 2.5

155

Problems and Complements

(It, Yi), 1 < i < n, be independent 0 = (A,!') E (0, 1) x R where 4.

Let

and identically distributed according to

Po [ft = 1] = A = 1 - Po[ft and given

/1

j. Y1 "' N(/L, a} ) . j = 0,

=

(a) Show that family with T

=

�2 (;� - ;?) ·

X - {(Ii, Yi) : 1

1 and

=

OJ

Ps

,

,

fi f- af known.

a

< i < n} is distributed according to an exponential

(u7 L, Y,I, -'- ;}g 2:;, Y; (1 - I, ), L;J,), ry1 = I'· �' = C\) + log

(b) Deduce that T i s minimal sufficient.

(c) Give explicitly the maximum likelihood estimates of J-t and >.., when they exist. 5.

Suppose the

It in Problem 4 are not observed.

(a) Justify the fol lowing crude estimates of J-t and

A,

y (� 2::7 1 (Y, - Y)2 - a�)/ (al - a�). Do you see any problems with Give as explicitly

(b)

problem.

Hint: 6.

as

)..? -

passible the

E- and

M-steps of the EM algorithm for this

Use Bayes rule.

Consider a genetic trait that is directly unobservable but will cause a disease among a

certain proportion of the individuals that have it. For families in which one member has the disease, it is desired to estimate the proportion

8 that has the genetic trait. Suppose that

in a family of n members in which one has the disease (and, thus, also the trait), X is the

1, the model often used for X is that it has the conditional distribution of a B( n, 9) variable, B E [0, 1], given number of members who have the trait. Because it is known that X >

X > l.

(a) Show that P(X = x MLE exists and is unique. (b) Use �

_

e1

�

-

where e

=

n

X

) )

9"( 1 -0J"'"

1_(1-G)"

,x =

1, . . . , n, and that the

(2.4.3) to show that the Newton-Raphson algorithm gives

-

e

I X > 1) =

I \

B(1 0) [1 - (1 - B)"]{ x - nB- x(1 - B)"} , nB2(1 - B)"[n - 1 + (1 - B)"] - [1 - (1 - 8)"]'[(1 - 2B)x + nB2 ] -

-

_,....

_

eold and estimate of e.

_

_

-.

�

el

-

_

8new. as the first approximation

_

to the

_

maximum likelihood

156

Methods of Estimation

(c) lf n = 5, x

-

=

Chapter 2

-

2, find (}1 of (b) above using (} = xjn as a preliminary estimate.

7. Consider the following algorithm under the conditions of Theorem 2.4.2. Define Tj0 as before. Let

and where ).* maximizes

iJnew = iJ(>-• ) t:fij ( >-) - A (i)(>.)) .

Show that the sequence defined by this algorithm converges to the MLE if it exists. Hint: Apply the argument of the proof of Theorem 2.4.2 noting that the sequence of iterates {fjmJ is bounded and, hence, the sequence (11m , ijm+ t) has a convergent subse quence. 8. Let X1 , X2,X3 be independent observations from the Cauchy distribution about (}, f(x, 9) = ,.-• (1 + (x - 9)2) - 1 • Suppose X, = 0, X, = 1, X3 = a. Show that for a sufficiently large the likelihood function has local maxima between 0 and 1 and between p and a. (a) Deduce that depending on where bisection is started the sequence of iterates may converge to one or the Other of the local maxima (b) Make a similar study of the Newton-Raphson method in this case.

, Xn be i.i.d. where X = ( U, V, W), P[U = a, V = b, W = c] 9. Let X1 , I < a S A, I < b < B, 1 < c < C and La ,b,,Pabo = 1. •

•

•

=

Pabc.

(a) Suppose for all a, b, c,

(1) Iog pabc = /-Lac + Vbc where -oo

< f.L, v < oo.

Show that this holds iff

P[U � a , V = b I W = c] = PIU = a I W = c]P[V = b I W = c]. i.e. iff U and V are independent given W.

(b) Show that the family of distributions obtained by letting Jl., v vary freely is an ex ponential family of rank {C - 1) + C(A + B - 2) = C(A + B - 1) - 1 generated by N++c, Na+c , N+bc where Nabc = #{i : Xi = (a,b,c)} and "+" indicates summation over the ind�x. (c) Show that the MLEs exist iff 0 < Na+c• N+be < N++c for all a, b, c and then are given by

i

•

Section

2.5 Problems and Complements

!57

Hint:

(b) Consider Na+c - N+ t c/A, N+bc - N++c/ B, Nt- t <:: · (c) The model implies fiabc = fi+ bcfia+cffi++c and use the likelihood equations.

10. Suppose X is as in Problem 9, but now

(2) logpabc = /lac + llbc + 'Yab where Jl, v, f vary freely.

(a) Show that this is an exponential family of rank A + B + C - 3 + (A - !)(C - 1) + (B - l)(C - 1) + (A - l)(B - I) � AB + AC + BC - (A + B + C).

(b) Consider the following ''proportional fitting" algorithm for finding the maximum likelihood estimate in this model. Nat+ N.p± N+±c O) d: = Initialize: pabc n

n

n

dl ) Pabc

-

dO) Nab+ Pabc dO) n Pab+

d2) Pabc d3) Pabc

Reinitialize with �t�. Show that the algorithm converges to the MLE if it exists and di· verges othetwise. Hint: Note that because {p�� } belongs to the model so do all subsequent iterates and that �� is the MLE for the exponential family Pabc

=

ell-abp(O)

L eli-<>'b'p��'c'

' - ;:-__cc,a> !b!£ = =a',b' ,c'

obtained by fixing the "b, c" and "a, c" parameters. 11. (a) Show that S in Example 2.4.5 has the specified mixture of Gaussian distribution. (b) Give explicitly the E- and M-steps of the EM algorithm in this case.

12. Justify formula (2.4.8). Hint: Po,[X � x I S(X) � s] � 13. Let fo (x)

�

fo (x - 9) where

:1:::: l(S(x) � s).

2 fo (x ) � 3
158

Methods of Estimation

Chapter 2

and r..p is the N(O, 1) density. Show for 11 = 1 that bisection may lead to a local maximum of the likelihood, if a is sufficiently large. 14. Establish the last claim in part (2) of the proof of Theorem 2.4.2. Hint: Use the canonical nature of the family and openness of E. 15. Verify the formula given in Example 2.4.3 for the actual MLE in that example. Hint: Show that { ( Bm, Bm+ 1 ) } has a subsequence converging to (B*, (}*) and necessarily g• � go. �

16. Establish part (b) of Theorem 2.4.3.

Hint: Show that {(Bm, Bm + t ) } has a subsequence converging to (8* J !") and, thus,

necessarily{)* is the global maximizer.

17. Limitations ofthe EM Algorithm. The assumption underlying the computations in the EM algorithm is that the conditional probability that a component Xi of the data vector X is missing given the rest of the data vector is not a function of x1 . That is, given X - {Xi } , the process determining whether Xj is missing is independent of XJ. This condition is called missing at random. For example, in Example 2.4.6, the probability that Yi is missing may depend on Zi, but not on }i. That is, given Zi, the "missingness" of Yi is independent of }'i. If Yi represents the seriousness of a disease, this assumption may not be satisfied. For instance, suppose all subjects with Yi > 2 drop out of the study. Then using the E-step to impute values for the missing Y's would greatly underpredict the actual Y's because all the Y's in the imputation would have Y < 2. In Example 2.4.6, suppose Yi is missing iff Yi < 2. If /12 = 1.5, a1 = a2 = 1 and p = 0.5, find the probability that E(li I Z; ) underpredicts Y;.

I

18. EM and Regression. For X = {(Zi, Yi ) : i = y1

1, . . . , n } , consider the model

= ;31 + J32Zi + Ei

where EJ , . . . ' En are i.i.d. N(O, a2 ), zl , . . ' Zn are ii.d. N(J1I' an and independent of E1, En· Suppose that for 1 < i < m we observe both Zi and Yi and for m + 1 < i < n, we observe only }'i. Complete the E- and M-steps of the EM algorithm for estimating .

•

•

.

,

(Jlt, f3t, a�, a2 , !32)-

2.6

NOTES

Notes for Section 2.1

( 1 ) "Natural" now was not so natural in the eighteenth century when the least squares prin ciple was introduced by Legendre and Gauss. For a fascinating account of the beginnings of estimation in the context of astronomy see Stigler (1 986).

• '

(2) The frequency plug-in estimates are sometimes called Fisher consistent. R. A. Fisher (1 922) argued that only estimates possessing the substitution property should be considered and the best of these selected. These considerations lead essentially to maximum likelihood estimates.

Section 2.7

159

References

Notes for Section 2.2 ( 1) An excellent historical account of the development of )east squares methods may be found in Eisenhart ( 1 964). (2) For further properties of Kullback-Leibler divergence, see Cover and Thomas ( 1991 ).

Note for Section 2.3 ( I ) Recall that in an exponential family, for any

A, P[T(X) E A[

0 for all or for no

P E P. Note for Section 2.5 (I)

In the econometrics literature (e.g. Appendix A.2; Campbell, Lo, and MacK.inlay,

1997), a multivariate version of minimum contrasts estimates are often called generalized method of moment estimates.

2.7

REFERENCES

BARLOW, R. E., D. J. BARTHOLOMEW, 1. M. BREMNER, AND H. D. BRUNK, Statistical Inference Under Order Restrictions New York: Wiley, 1972.

BAUM, L. E., T. PETRIE, G. SouLES, AND N. WEISS, "A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains:' Aml. Math. Statist., 41, 1641 7 1 (1 970).

BISHOP, Y. M. M., S. E. FEINBERG, AND P. W. HoLLAND, Discrete Multivariate Analysis: Theory and Practice Cambridge, MA: MIT Press, 1975.

CAMPBELL, J. Y., A. W. Lo, AND A. C. MACK.INLAY, The Econometrics of Financial Markets Princeton, N1: Princeton University Press, 1997. COVER, T. M., AND 1. A. THOMAS, Elements oflnfonnation Theory New York: Wiley, 1 99 1 .

DAHLQUIST, G., A. BJORK, AND N. ANDERSON, Numerical Analysis New York: Prentice Hall, 1974. DEMPSTER, A., M. M. LAIRD, AND D. B. RUBIN, "Maximum Likelihood Estimation from Incomplete Data via the EM Algorithm," J. Roy. Statist. Soc. B, 1-38 (1977). DHARMADHIKARI, S., AND K. JOAG-DEV, "Examples of Nonunique Maximum Likelihood Estima tors," The American Statistician, 39, 199-200 (1985). EisENHART, C., "The Meaning of Least in Least Squares," Journal Wash. Acad. Sciences, 54, 24-33 ( 1964).

FAN, J., AND l GIJBELS, Local Polynomial Modelling and Its Applications London: Chapman and

Hall, 1996.

FisHER, R. A., "On the Mathematical Foundations of Theoretical Statistics," reprinted in Contribu tions to Mathenuztical Statistics (by R. A. Fisher 1950) New York: 1. Wiley and Sons, 1 922.

GoLUB, G. H., AND C. F. VAN LoAN, Matrix Computations Baltimore: John Hopkins University Press, 1985. HABERMAN, S. 1., The Analysis of Frequency Data Chicago: University of Chicago Press,1974.

Methods of Estimation

160

Chapter

2

KOLMOGOROV, A. N.. "On the Shannon Theory of Information Transmission in the Case of Contin-

uous Signals," IRE Transf Inform. Theory. IT2, 102-108 ( 1 956). LllTLE, R. J. A., AND D. B . RUBIN. Statistical Anaf:His with Missing Data New York: J. Wiley, 1987. MACLACHLAN, G. J., AND T. KRISHNAN, The EM Algorithm and Extensions New York: Wiley, 1997.

MOSTELLER, F., "Association and Estimation in Contingency Tables," J. Amer. Statist. Assoc., 63, 1 -28 ( 1 968).

RICHARDS, F. J., "A Flexible Growth Function for Empirical Use." J. Exp.

( 1959).

Botany, 10, 290-300

RUPPERT, D., AND M. P. WAND, "Multivariate Locally Weighted Least Squares Regression," Ann.

Statisr. 22, 1346-1370 (1994). .

SEBER, G. A. F., A ND C.J. WILD, Nonlinear Regression New York: Wiley, 1989.

SHANNON, C. E., "A Mathematical Theory of Communication," Bell System Tech. Journal, 27, 379-

i

243, 623-656 (1 948).

SNEDECOR, G. W., AND W. COCHRAN, Statistical Methods , 6th ed. Ames, lA: Iowa State University

Press, 1%7.

I I

STIGLER, S., The History of Statistics Cambridge, MA: Harvard University Press, 1986. WEIS BERG, S., Applied linear Regression, 2nd ed. New York: Wiley, 1985.

WU, C. F. J., "On the Convergence Properties of the EM Algorithm," Ann. Statist. , II, 95-103 (1983).

' •

Ij

.

!

\ !

Chapter 3

MEAS U RES OF PERFORMANCE, NOTIONS OF OPT IMALITY , AND OPTIMAL PROCEDU RES

3.1

INTRODUCTION

Here we develop the theme of Section 1.3, which is how to appraise and select among de cision procedures. In Sections 3.2 and 3.3 we show how the important Bayes and minimax criteria can in principle be implemented. However, actual implementation is limited. Our examples are primarily estimation of a real parameter. In Section 3.4, we study, in the context of estimation, the relation of the two major decision theoretic principles to the non decision theoretic principle of maximum likelihood and the somewhat out of favor principle of unbiasedness. We also discuss other desiderata that strongly compete with decision the oretic optimality, in particular computational simplicity and robustness. We return to these themes in Chapter 6, after similarly discussing testing and confidence bounds, in Chapter4 and developing in Chapters 5 and 6 the asymptotic tools needed to say something about the multiparameter case.

3.2

BAYES PROCEDURES

Recall from Section 1 .3 that if we specify a parametric model P = {P8 : 6 E 8}, ac tion space A, loss function l(O,a), then for data X Po and any decision procedure J randomized or not we can define its risk function, R( ·, J) : 8 -----+ R+ by .......,

R(6,6) = E6 {(6,6(X)). We think of R(·, 6) as measuring a priori the performance of 6 for this model. Strict com parison of 61 and 62 on the basis of the risks alone is not well defined unless R(O, 6,) :0: R(B, 62) for all (} or vice versa. However, by introducing a Bayes prior density (say) 1r for (} comparison becomes unambiguous by considering the scalar Bayes risk,

r(1r.6) = ER(6, 6) = El(6,6(X)),

(3.2. 1) 161

162

Measures of Performance

Chapter 3

where ( 0 , X) is given the joint distribution specified by (1 .2.3). Recall also that we can define

R(1r)

�

inf{r(1r,6) : 6 E V}

(3.2.2)

Bayes risk of the problem, and that in Section 1.3 we showed how could identify the Bayes rules 0; such that the

in an example, we

(3.2.3) In this section we shall show systematically how to construct Bayes rules. This exercise is interesting and important even if we do not view prior distribution on

1r

as reflecting an implicitly believed in

fJ. After all, if n is a density and e r(1r,6) �

c

R

j R(0, 6)1r(O)dO

(3 . 2.4)

and 1r may express that we care more about the values of the risk in some rather than other regions of

e.

For testing problems the hypothesis is often treated as more important than

the alternative. We may have vague prior notions such as "IB/

> 5 is physically implausi

ble" if, for instance, () denotes mean height of people in meters. If 1r is then thought of as a weight function roughly reflecting our knowledge, it is plausible that 6; if computable will

behave reasonably even if our knowledge is only roughly right. Clearly,

1r(O) = c plays

a special role ("equal weight") though (Problem 3.2.4) the parametrization plays a crucial role here. It is in fact clear that prior and loss function cannot be separated out clearly either. Thus, considering • •

•

1r2 (8)

and

=

1.

l1 (0, a)

and 7rt (0) is equivalent to considering

l2(B, a) � ,,(0)1 1 (B, a )

Issues such as these and many others are taken up in the fundamental

treatises on Bayesian statistics such as Jeffreys (1948) and Savage (1954) and are reviewed in the modern works of Berger ( 1 985) and Bernardo and Smith (1994). We don' t pursue them further except in Problem 3.2.5, and instead tum to construction of Bayes procedure. We first consider the problem of estimating

a?,

using a nonrandomized decision rule

6.

q(B) with quadratic loss, l(O, a) � (q(O)

-

Suppose () is a random variable (or vector)

1r(O). Our problem is to find the function 6 of X that minimizes r(", 6) � E( q( IJ) - 6(X) j2. This is just the problem of finding the best mean squared prediction error (MSPE) predictor of q(B) given X (see Remark 1.4.5). Using our results on MSPE prediction, we find that either r(1r, 6) oo for all 6 or the with (prior) frequency function or density

Bayes rule 6* is given by

6·(x) � E[q(IJ ) I X]. This procedure is called the Bayes

(3 .2 5) .

estimate for squared error loss.

'

In view of formulae (1.2.8) for the posterior density and frequency functions, we can give the Bayes estimate a more explicit form. In the continuous case with 9 real valued and prior density 1r,

= q(O)p(x 1 B)1r(B)dO foo ( ) " r= p(x I O),(B)dO

,. x

_

I

•

(3.2.6)

In the discrete case, as usual, we just need to replace the integrals by sums. Here is an example.

! •

-

Section

3.2 Bayes Procedures

163

Example 3.2.1. Bayes Estimates for the Mean of a Normal Distribution with a Nor mal Prior. Suppose that we want to estimate the mean B of a nonnal distribution with known variance a2 on the basis of a sample X1 , . . . , Xn. If we choose the conjugate prior N(r}o, 12) as in Example 1.6. 12, we obtain the posterior distribution

The Bayes estimate is just the mean of the posterior distribution

J' (X ) � 1)o

[njCJ21/T21/ l - [n/ njCJ2 l T2 CJ2 + 1/ T2 +

+

X

(3.2. 7)

Its Bayes risk (the MSPE of the predictor) is just

(

r 7r, o')

E(B - E(() I X))2 � E[E((O - E(O I X))2 I X)] CJ2 1 (J' E 1+ = n n12 nja2 + ljT2 ·

[ /(

)]

No finite choice of 'IJo and 12 will lead to X as a Bayes estimate. But X is the limit of such estimates as prior knowledge becomes "vague" (T oo with. 1Jo fixed). In fact, X is the estimate that (3.2.6) yields, if we substitute the prior "density" 1r( 0) = 1 (Problem 3.2.1). Such priors with J 1r(O) = oo or 2: 1r(O) = oo are called impropa The resulting Bayes procedures are also called improper. Fonnula (3.2. 7) reveals the Bayes estimate in the proper case to be a weighted average -

W1]o + (1 - w)X of the estimate to be used when there are no observations, that is, 'IJo, and X with weights inversely proportional to the Bayes risks of these two estimates. Because the Bayes risk of X, a2 / n, tends to 0 as n oo, the Bayes estimate corresponding to the prior density N(1Jo, 12) differs little from X for n large. In fact, X is approximately a Bayes estimate for any one of these prior distributions in the sense that [r( rr, r) - r( rr, J* )J/r(rr, 0"') - 0 D as n - oo. For more on this, see Section 5.5. -

We now tum to the problem of finding Bayes rules for geueral action spaces A and loss functions l. To begin with we consider only nonrandomized rules. If we look at the proof of Theorem 1.4.1, we see that the key idea is to consider what we should do given X = x. Thus, E(Y I X) is the best predictor because E(Y I X � >) minimizes the conditional MSPE E((Y - a)2 I X = x) as a function of the action a. Applying the same idea in the general Bayes decision problem, we fonn the posterior risk

( I x)

r a

�

E(l(O, a) I X = x).

This quantity r (a I x) is what we expect to lose, if X = :x and we use action a. Intuitively, we should, for each x, take that action a = J*(x) that tnakes r(a I x) as small as possible. This action need not exist nor be unique if it does exist. Howev-er,

164

Measures of Performance

Chapter

3

Proposition 3.2.1. Suppose that there exists a function O*(x) such that

r(o'(x) I x) = inf{r(a I x) : a E A}.

(3.2.8)

Then 0* is a Bayes rule. Proof. As in the proof of Theorem 1.4.1, we obtain for any 0

r(1r,o) = E[l(ll, o(X))] = E[E(l(ll,o(X)) I X)] .

(3.2.9)

But. by (3.2.8),

E(l(O,o(X)) I X = x] = r(o(x) I x) > r(o'(x) I x) = E(l(O,o'(X)) 1 X = x]. Therefore,

E(l(O,o(X)) 1 x] > E(l(O, o'(X)) 1 x],

0

and the result follows from (3.2.9).

As a first illustration, consider the oil-drilling example (Example 1.3.5) with prior 1r(ll1 ) = 0.2, 1r(ll,) = 0.8. Suppose we observe x = 0. Then the posterior distribution of ll is by ( 1 .2.8) 8 I ..(oi 1 x = o) = , ..(e, 1 x = o) = .

9

Thus, the posterior risks of the actions a1, a2, and aa are

r(a, I 0) r(a, I 0)

+

-

-

2,

8

i(02, at) g r(a, I 0)

9

10.67 5.89.

Therefore, a2 has the smallest posterior risk and, if 0" is the Bayes rule,

o'(O) = a,. Similarly,

r(a1 I I) = 8.35, r(a, 1 1 ) 3.74, r(a, I I) = 5.70 =

and we conclude that

6 (1 ) = a,. '

Therefore, 6* = 85 as we found previously. The great advantage of our new approach is that it enables us to compute the Bayes procedure without undertaking the usually impossible calculation of the Bayes risks of all corppeting procedures. More generally consider the following class of situations.

{flo, . . ,8v}, Example 3.2.2. Bayes Procedures Whfn 8 and A Are Finite. Let 8 A = {a0, . . . , aq } let w13 > 0 be given constants, and let the loss incurred when 0; is true and action a3 is taken be given by =

,

.

!

Section 3.2

165

Bayes Procedures

Let 1r (O) be a prior distribution assigning mass 11'i to Oi, so that 1ft > 0, i = 0, . . . ,p, and Lf 0 1r i = 1. Suppose, moreover, that X has density or frequency function p(x I 8) for each 0. Then, by (1.2.8), the posterior probabilities are

r;p(x I O;) 1 P[9 � 0; I X � x] � Ej'lrj p(x I e,) and, thus,

T (aj I X) =

E;w;j7r;p(x I 0;) . E;1r;p(x I O;)

(3.2.10)

The optimal action 6* (x) has

r(J'(x) I x) � min r(aj I x). O<J < q Here are two interesting specializations. (a) Classification: Suppose that p = q, we identify aJ with 03, j = 0, . . . ,p, and let

Wii

-

1, i f j 0.

This can be thought of as the classification problem in which we have p + 1 known disjoint populations and a new individual X comes along who is to be classified in one of these categories. In this case,

r (O; I x) � P[9 f 9; I X � x] and minimizing r(()i I x) is equivalent to the reasonable procedure of maximizing the posterior probability,

r;p(x I 9' ) . 7 P[9 � 9; I X � x] � Ejrrjp(x I Oj) (b) Testing: Suppose p = q = 1, 1ro = 1r, 1r1 = 1 - 1r, 0 < 1r < 1, a0 corresponds to deciding (} = eo and a1 to deciding (} = 01. This is a special case of the testing formulation of Section 1.3 with 8o � { Oo} and 81 � {9I ) . The Bayes rule is then to decide 9 � Ot if (1 - rr)p(x I Ot) > 1rp(x I 9o) decide 0 � Oo if (1 - rr)p(x I Ot) < rrp( x I 9o ) and decide either ao or a1 if equality occurs. See Sections 1 .3 and 4.2 on the option of randomizing between ao and a1 if equality occurs. As we let 1r vary between zero and one, we obtain what is called the class of Neyman-Pearson tests, which provides the solution to the problem of minimizing P (type II error) given P (type I error) D < a. This is treated further in Chapter 4. •

166

Measures of Performance

Chapter 3

To complete our illustration of the utility of Proposition 3.2. L we exhibit in "closed form.. the Bayes procedure for an estimation problem when the loss is not quadratic. Example 3.2.3. Bayes Estimation of the Probability ofSuccess in n Bernoulli Trials. Sup pose that we wish to estimate (} using X1 , , Xn, the indicators of n Bernoulli trials with probability of success 8. We shall consider the loss function l given by 2 (0 a) , 0 < 0 < I, a real. (3.2. 1 1 ) l(O, a) = 0( 1 O) .

' ' " •

"

'

" '

d

' •

•

•

_

This close relative of quadratic loss gives more weight to parameter values close to zero and one. Thus, for (} close to zero, this l((), a) is close to the relative squared error (fJ - a) 2 / 0. lt makes X have constant risk, a property we shall find important in the next section. The analysis can also be applied to other loss functions. See Problem 3.2.5. By sufficiency we need only consider the number of successes, S. Suppose now that we have a prior distribution. Then, if all terms on the right-hand side are finite, r

(a I k)

(3.2.12)

Minimizing this parabola in a, we find our Bayes procedure is given by

(I/( - £1) I s = k) 1 E ' = J (k) E( l/ll( l - £1) I s = k)

(3.2.13)

provided the denominator is not zero. For convenience let us now take as prior density the density br,, (0) of the bela distribution !3(r, s) . In Example I . 2.1 we showed that this leads to a !)(k + r, n + s - k) posterior distribution for ll if S = k. If ! < k < n - l and n > 2, then all quantities in (3.2.12) and (3.2.13) are finite, and

J' ( k )

1 J0 ( 1 / (1 - O))bk+r,n-k+, (9)d9 f� (I/0 (1 - O))bk+r,n-k+,(O)dO

(3.2.14)

B(k + r, n - k + s - 1)

k+r-1

B(k + r - I , n - k + s - 1)

n + s + r - 2'

where we are using the notation 8.2. 1 1 of Appendix B. If k = 0, it is easy to see that a = 0 is the only a that makes r (a I k) < oo. Thus, J'(O) = 0. Similarly, we get J'(n) = I. If we assume a uniform prior density, (r = s = 1 ) we see that the Bayes procedure is the D usual estimate, X. This is not the case for quadratic loss (see Problem 3.2.2). .

"Real" computation of Bayes procedures The closed fonns of (3.2.6) and (3.2. 10) make the compulation of (3.2.8) appear straight forward. Unfortunately, this is far from true in general. Suppose, as is typically the case, that 0 = (B, . . . , lip) has a hierarchically defined prior density,

1r(£1,, 02 , . . , Ov) .

=

,., (01),.2 (02 I B,) . "v(Bv I Bv- I ) . .

.

(3.2.15)

I

Section 3.2

167

Bayes Procedures

Here is an example.

Example 3.2.4. The random effects model we shall study in Volume II has

(3.2.16)

where the €ij are i.i.d. N(O, cr:) and J.L and the vector � = (6.1, . . . , 6.1) is independent of {<;; : I < i < I, I < j < J} with Ll.�, . . . , L\.1 i.i.d. N(O,cri), I < j < J, N(J.Lo, cr�). Here the XiJ" can be thought of as measurements on individual i and 6.i J.L is an "individual" effect. If we now put a prior distribution on (p.,cr�,cr�) making them independent, we have a Bayesian model in the usual form. But it is more fruitful to think of this model as parametrized by () = (p., cr�, cr�, 6. 1 , . . . , 6.1) with the Xij I () independent N(p + Ll.,, cr�). Then p(x I 0) = n. ,j 'Pu. CXij - f.' - Ll.;) and '""'"'

I

1r(O) = "I (f.')7r, (cr;)7r3(cri) II 'Puo (L\.i) i=l

(3.2.17)

where IPu denotes the N(O, cr2) density. ln such a context a loss function frequently will single out some single coordinate 08 (e.g., L\.1 in 3.2.17) and to compute r(a I x) we will need the posterior distribution of Llt I x. But this is obtainable from the posterior distribution of fJ given X = x only by integrating out ()i, j =f. s, and if p is large this is intractable. ln recent years so-called Markov Chain Monte Carlo (MCMC) techniques have made this problem more tractable D and the use of Bayesian methods has spread. We return to the topic in Volume II. Linear Bayes estimates

When - the problem of computing r (1r, 6) and 611" is daunting, an alternative is to consider a class 'D of procedures for which r( 1r, 6) is easy to compute and then to look for 011" E 'D that minimizes r( 1r, 6) for 6 E 'D. An example is linear Bayes estimates where, in the case of squared error loss [q(8) - a]2, the problem is equivalent to minimizing the mean squared prediction error among functions of the form a + L;= I bJXi - lf in (1.4.14) we identify q(8) with Y and X with Z, the solution is J(X) = Eq(8) + (X - E(X)JT{3 where f3 is as defined in Section 1.4. For example, if in the 1110del (3.2.16), (3.2.17) we set q(fJ) = ..1. 1 , we can find the linear Bayes estimate of .6.1 by using 1.4.6 and Problem 1.4.21. We find from ( 1.4.14) that the best linear Bayes estimator of L\.1 is 6L(X) = E(L\.I) + (X - Jl.ff3 where E(LI.I ) given model

=

(3.2.18)

0, X = (Xu, . . . , XJJ)T, Jl. = E(X) and /1 = Ex�ExA,. For the E(XI,) = EE(X,; I B) = E(p + L\. 1)

�

E (p)

Measures of Performance

168

Var(X1;)

=

E Var(X1,

I 0) + Var E(X!j I 0) = E(<>�) + ! + � <>

<>

Chapter 3 ,

E Cov(X1;, X1k I 0) + Cov(E(X1; I 0), E(X1k I 0)) 0 + Cov(p + Llq , I' + tl.J ) = <>! + <>�, Cov(tl.l, Xlj )

=

E Cov(Xlj, tl., I e) + Cov(E(X!j

I 0), E(tl.l I B)) = 0 +
=

<>t .

From these calculations we find f3 and JL(X). We leave the details to Problem 3.2.10. Linear Bayes procedures are useful in actuarial science, for example, Btihlmann (1970) and Norberg (1 986). Bayes estimation, maximum likelihood, and equivariance

As we have noted earlier, the maximum likelihood estimate can be thought of as the mode of the Bayes posterior density when the prior density is (the usually improper) prior n(O) ;,::; c. When modes and means coincide for the improper prior (as in the Gaussian case), the MLE is an improper Bayes estimate. In general, computing means is harder than modes and that again accounts in part for the popularity of maximum likelihood. An important property of the MLE is equivariance: An estimating method M producing � the estimate (}M is said to be equivariant with respect to reparametrization if for� every oneto-one function h from e to fl = h (e), the estimate of w "" h(B) is wM = h(BM); that is, � (h(B)) M = h(OM ). In Problem 2.2.16 we show that the �MLE procedure is equivariant. If we consider squared error loss, then the Bayes procedure (}B = E(B I X) is not equivariant for nonlinear transformations because E(h(B) I X) oF h(E(B I X))

, ,

•

for nonlinear h (e.g., Problem 3.2.3). The source of the lack of equivariance of the Bayes risk and procedure for squared error loss is evident from (3.2.9): In the discrete case the conditional Bayes risk is

re(a I x) = I:!B - aj',.(B I x). OE6

(3.2.19)

If we set w = h(B) for h one-to-one onto fl = h(8), then w has prior >.(w) = 7r(h- 1 (w)) and in the w parametrization, the posterior Bayes risk is

rn(a l x)

=

I: !w - a]2>.(w l x)

wErl

I: !h(O) - a]2,.(0 OE6

1

I x).

(3.2.20)

Thus, the Bayes procedure for squared enor loss is not equivariant because squared error loss is not equivariant and, thus, rn(a I x) op re(h-1 (a) I x ) .

i

' •

•

'

Section 3.2

169

Bayes Procedures

Loss functions of the form /((} , a) = Q(Po, Pa) are necessarily equivariant. The Kullback-Leibler divergence K(O,a), O, a E 8, is an example of such a loss function. It satisfies Ko(w, a) = Ke(O, h-1 (a)), thus, with this loss function,

ro(a I x) = re(h-1(a) I x). the discrete case using K means that the

See Problem 2.2.38. In importance of a loss is measured in probability units, with a similar interpretation in the continuous case (see (A.7.10)). In the N(O, <:rJ) case the KL (Kullback-Leibler) loss K(e, a) is �n(a - e) ' (Problem 2.2.37), that is, equivalent to squared error loss. In canonical exponential families •

K(1J, a) = L[ry; - a;]E1)T; + A(1J) - A(a). j=l

Moreover, if we can find the KL loss Bayes estimate fi BKL of the canonical parameter 71 and if 1J = c( 9) : 8 ___,. E is one-to-one, then the KL loss Bayes estimate of 9 in the

�

general exponential family is 9 B KL = c- ' ( fiaKd· For instance, in Example 3.2.1 where 11 is the mean of a normal distribution and the prior is normal, we found the squared error Bayes estimate iiB = + (1 -w)X, where 1Jo is the prior mean and w is a weight. Because the KL loss is equivalent to squared error for the canonical parameter f.', then if w = h(p), WB KL = h(/iuK£), where !iB KL = WTJo + (1 - w)X. Bayes procedures based on the Kullback-Leibler divergence loss function are important for their applications to model selection and their connection to "minimum description (message) length" procedures. See Rissanen (1987) and Wallace and Freeman (1987). More recent reviews are Shibata (1997), Dowe, Baxter, Oliver, and Wallace (1998), and Hansen and Yu (2000). We will return to this in Volume II.

wry0

Bayes methods and doing reasonable things There is a school of Bayesian statisticians (Berger, 1985; DeGroot, 1 %9; Lindley, 1965; Savage, 1 954) who argue on nonnative grounds that a decision theoretic framework and rational behavior force individuals to use only Bayes procedures appropriate to their personal prior n. This is not a view we espouse because we view a model as an imperfect approximation to imperfect knowledge. However, given that we view a model and loss structure as an adequate approximation, it is good to know that generating procedures on the basis of Bayes priors viewed as weighting functions is a reasonable thing to do. This is the conclusion of the discussion at the end of Section 1.3. It may be shown quite generally as we consider all possible priors that the class 'Do of Bayes procedures and their limits is complete in the sense that for any 6 E 1J there is a 6o E 1Jo such that R(6, 6o) < R(e, 6) for all e.

Summary. We show how Bayes procedures can be obtained for certain problems by com

puting posterior risk. In particular, we present Bayes procedures for the important cases of classification and testing statistical hypotheses. We also show that for more complex problems, the computation of Bayes procedures require sophisticated statistical numerical techniques or approximations obtained by restricting the class of procedures.

170

3.3

I :.

'

Measures of Performance

MINIMAX PROCEDURES

In Section 1 . 3 on the decision theoretic framework we introduced minimax procedures as ones corresponding to a worst-case analysis; the true (} is one that is as "hard" as possible. That is, 61 is better than 82 from a minimax point of view if sup8 R((), J1 ) < sup8 R(B, 82 ) and o· is said to be minimax if supR(e, o•) 8

! ., iI •

;

I·

..

Chapter 3

=

inf supR(O, o). ' 8

Here (} and 8 are taken to range over 8 and V = {all possible decision procedures (possibly randomized)} while P = {Po : 0 E 6 }. It is fruitful to consider proper subclasses of V and subsets of P, but we postpone this discussion. The nature of this criterion and its relation to Bayesian optimality is clarified by consid ering a so-called zero sum game played by two players N (Nature) and S (the statistician). The statistician has at his or her disposal the set V of all randomized decision procedures whereas Nature has at her disposal all prior distributions 1r on 9. For the basic game, S picks 0 without N's knowledge, N picks 1r without S's knowledge and then all is revealed and S pays N r(1r ,J) =

J R(O, o)d1r(O)

where the notation f R(O, J)d1r(0) stands for f R(0, o)1r(O)dll in the continuous case and L, R(Oi , o)tr(Oi) in the discrete case. S tries to minimize his or her loss, N to maximize her gain. For simplicity, we assume in the general discussion that follows that all sup's and inf's are assumed. There are two related partial information games that are important. 1: N is told the choice 0 of S before picking 1r and S knows the rules of the game. Then N naturally picks n.; such that r(tr6 , o) = supr(1r,o), •

(3.3.1)

that is, 1r.; is leastfavorable against 0. Knowing the rules of the game S naturally picks 0*

such that

(3.3.2) We claim that 0* is minimax. To see this we note first that,

' •

for al1 1r, J. On the other hand, if R(0;, J) = sup8 R( 0, o), then if tr, is point mass at 0;, r(1r;, o) = R(O,, o) and we conclude that sup r(tr,J) = sup R(O , o) 8 •

(3.3.3)

•

'

i

' • •

Section 3.3

171

Minimax Procedures

and our claim follows.

1r

II: S is told the choke of N before picking 6 and N knows the rules of the game. Then S naturally picks 6rr such that

That is, r5rr is a Bayes procedure for 11". Then N should pick 7r� such that (3.3.4) For obvious reasons, 1r* is called a leastfavorable (to S) prior distribution. As we shall see by example, although the right-hand sides of (3.3.2) and (3.3.4) are always defined, least favorable priors and/or minimax procedures may not exist and, if they exist, may not be umque. The key link between the search for minimax procedures in the basic game and games I and II is the von Neumann minimax theorem of game theory, which we state in our language. •

Theorem 3.3.1. (von Neumann). If both 8 and V are finite, then:

(a)

v = supinf r(11" , 6 ) , v = infsup r(11",6) ?5

rr

?5

1t

are both assumed by (say) 1r* (least favorable), 6* minimax, respectively. Further,

(

v = r n. , u "

)

-

=v

v and V are called the lower and upper values of the basic game. When v (say), v is called the value of the game.

(3.3.5)

= v =

v

Remark 3.3.1. Note (Problem 3.3.3) that von Neumann's theorem applies to classification and testing when 80 � { 90 } and 81 � { 9,} (Example 3.2.2) but is too restrictive in its assumption for the great majority of inference problems. A generalization due to Wald and Karlin-see Karlin (1959)-states that the conclusions of the theorem remain valid if 8 and D are compact subsets of Euclidean spaces. There are more far-reaching generalizations but, as we shall see later, without some form of compactness of 8 and/or D, although equality of v and V holds quite generally, existence of least favorable priors and/or minimax procedures may fail.

The main practical import of minimax theorems is, in fact, contained in a converse and its extension that we now give. Remarkably these hold without essentially any restrictions on 8 and D and are easy to prove. Proposition 3.3.1. Suppose 6**.

7r** £

can be found such that

r •' u ** = 01!"•

7r

*

* = 1fJ·•

(3.3.6)

Measures of Performance

172

that is, 0** is Bayes against 1r** and 1r** is least favorable against 0**. Then R(1r**, 0** ) . That is, rr** is leastfavorable and 0** is minimax.

Chapter 3 v

-

v =

To utilize this result we need a characterization of 1ro. This is given by

Proposition 3.3.2. 1r0 is leastfavorable against 6 iff

,,{e : R(0,6) = supR(0',6)} = I .

(3,3.7)

8'

That is, 1r5 assigns probability only to points () at which the function R( 8) is maximal. ·,

Thus, combining Propositions 3.3.1 and 3.3.2 we have a simple criterion, "A Bayes rule with constant risk is minimax." Note that 1Q; may not be unique. In particular, if R(O, 0) = constant, the rule has constant risk, then all 1r are least favorable. We now prove Propositions 3.3.1 and 3.3.2.

Proof of Proposition 3.3.1. Note first that we always have

v
(3,3.8)

i�f r( rr, 6) < r(rr,6')

(3.3,9)

v = supin,fr(1r,6) < sup r(rr,6')

(3 . 3 . 1 0)

because, trivially,

for all n, J'. Hence,

•

•

for all J' and v ::; in£6' sup11' r(1r, {/) = v. On the other hand, by hypothesis, y

'''

'

(3.3. 1 1 )

v = i�f r( n**, 6) = r(1r**, 6**)

= s�p r(1r,6*"') = v

' ' '

'

(3,3.12) 0

as advertised.

Proof of Proposition 3.3.2.

"

is least favorable for 6 iff

E,R(8,6) =

J r(0, 6)drr(O) = s�p r(rr,&).

(3.3.13)

But by (3,3.3),

supr(rr,6) = supR(O, 6), 8 •

Because E,R(8, 6) = sup8 R(O, 6), (3.3,13) is possible iff (3.3.7) holds. Putting the two propositions together we have the following.

'

'

.

Combining (3.3.8) and (3.3. 1 1) we conclude that

' ' '

> infr(n**,6) = r(1r* * , 6... ) = supr(1r,6**) > v.

(3.3,14) 0

Section

3.3 Minimax Procedures

173

Theorem 3.3.2. Suppose 0* has sup0 R((}, 0*) = r- < oo. If there exists a prior 1r* such that 0* is Bayes for 1r* and 7r* { (} : R( (}, 0*) r} 1. then 0* is minimax. =

=

Example 3.3.1. Minimax /::.Stimation in the Binomial Case. Suppose S has a B(n,O) distribution and X = Sjn, asin Examp!e 3.2.3. Let i (B, a) (B-a)2/B(I -B), O < B < I . For this loss function, =

(X )2 0 E R(B X ) = '

0 ( 1 - B)

-

0( 1 - B) = nO ( I - 0)

=

�

n'

-

and X does have constant risk. Moreover, we have seen in Example 3.2.3 that X is Bayes, when 8 is U(O, 1). By Theorem 3.3.2 we conclude that X is minimax and, by Proposition 3.3.2, the uniform distribution least favorable. For the usual quadratic Joss neither of these assertions holds. The minimax estimate is S+ � Vn 0• ( S) = n + vn

=

Vn X +

vn + 1

I

. �.

vn + l 2

This estimate does have constant risk and is Bayes against a (J( fo/2, Vfi/2) prior (Prob lem 3.3.4). This is an example of a situation in which the minimax principle leads us to an unsatisfactory estimate. For quadratic loss, the limit as n --t oo of the ratio of the risks of 0* and X is > 1 for every () -=f. �. At 0 � the ratio tends to 1. Details are left to Problem 0 3.3.4. =

Example 3.3.2. Minimax Testing. Satellite Communications. A test to see whether a communications satellite is in working order is run as follows. A very strong signal is beamed from Earth. The satellite responds by sending a signal of intensity v > 0 for n seconds or, if it is not working, does not answer. Because of the general "noise" level in space the signals received on Earth vary randomly whether the satellite is sending �r not. The mean voltage per second of the signal for each of the n seconds is recorded. Denote the mean voltage of the signal received through the ith second less expected mean voltage due to noise by Xi. We assume that the Xi are independently and identically distributed as N(p, a2 ) where p v, if the satellite functions, and 0 otherwise. The variance u2 of the "noise" is assumed known. Our problem is to decide whether "J.l = 0" or "p v." We view this as a decision problem with 0 - l loss. lf the number of transmissions is fixed, the minimax rule minimizes the maximum probability of error (see (1.3.6)). What is this risk? A natural first step is to use the characterization of Bayes tests given in the preceding section. If we assign probability 1r to 0 and 1 - 1r to v, use 0 - 1 1oss, and set L(x, 0, v) p(x I v) /p(x I 0), then the Bayes test decides I' = v if =

=

=

L(x, 0, v )

and decides p

=

=

nv' } { > exp Exi 1 - 71' a 2a v

2

2

0 if L(x,O, v) <

7r . 1 - 71'

174

Measures of Performance

This test is equivalent to deciding Jt =

Chapter 3

v (Problem 3.3.1) if. and only if,

1 ;;;: Ex, > t,

T=

avn

'

''

where,

t=

" [

;;;; vyn

1r

log

1 - 1r

+

] 2 2 . nv

'

a

If we ca11 this test J-,r,

"

' d' ' I' !

R(O,J.) R(v , o )

I � (t)

( vf)

t

,

To get a minimax test we must have R(O, t51r) =

�t = t

' '' or

= ( �t)

�

R(v , 61r), which is equivalent to

v..fii �

v

"

-"-'

.,fii t= 2a

.

Because this value oft corresponds to n = �, the intuitive test, which decides JJ = v if and 0 only if T > � [Eo (T) + Ev (T)J, is indeed minimax. If 8 is not bounded, minimax rules are often not Bayes rules but instead can be obtained as limits of Bayes rules. To deal with such situations we need an extension of Theorem

3.3.2.

Theorem 3.3.3. Let o' be a rule such that sup8 R(O,o') = r < oo, let {tr.} denote a sequence of pn'or distributions such that 7rk{8 : R(B, 0*) = r} = I. and let rk = infJ r(1rk, 8), where r(nk, 0) denotes the Bayes risk wrt 7rk· If

Tk

-

r as k

---+

oo,

(3.3. 15)

then J* is minimax.

Proof Because r( "•, &') = r

supR(B, &') = rk + o(l) 8 where o(l) �

0 as k � oo. But by (3.3.13) for any competitor o

sup R(O , o) > E,, (R(O,&)) > rk � supR(O,&') � o( l). 8 8 If we let k

sup8 R(B, &•).

---+

oo

(3.3.16)

I '

the left-hand side of (3.3.16) is unchanged, whereas the right tends to

0

'·

I

Section 3.3

M'mimax Procedures

175

Example 3.3.3. Normal Mean.

We now show that X is minimax in Example 3.2.1. Identify

1fk with theN('IJo, 72) prior where k = 72.

Then

whereas the Bayes risk of the Bayes rule of Example 3.2.1 is

--

n

Because ( o-2 /n ) / (( (/2 /n) + 72)

------>

0 as 72

------>

a2

I

(cr2/n) + r2 n ·

oo, we can conclude that X is minimax.

Example 3.3.4. Minimax Estimation in a Nonparametric Setting (after Lehmann).

D

Suppose

X1, . . . , Xn are i.i.d. F E :F

Then X is minimax for estimating B(F)

=

Ep(XI ) with quadratic loss. This can be

viewed as an extension of Example 3.3.3. Let 1fk be a prior distribution on :F constructed as follows:(l)

(i) (ii)

1rk {F :

VarF (X!) fo M} =

0.

1rk { F : F fo N(l', M) for some I'}

=

0.

(iii) F is chosen by first choosing I' = 6(F) from a N(O, k) distribution and then taking F = N(6(F) , M ). Evidently, the Bayes risk is now the same as in Example 3.3.3 with a2 = M. Because, evidently,

max R(F, X ) = max

:F

:F

VarF (X,)

n

M n

, 0

Theorem 3.3.3 applies and the result follows.

Minimax procedures and symmetry As we have seen, minimax procedures have constant risk or at least constant risk on the "most difficult" B. There is a deep connection between symmetries of the model and the structure of such procedures developed by Hunt and Stein, Lehmann, and others, which is discussed in detail in Chapter 9 of Lehmann ( 1986) and Chapter of Lehmann and Casella

5

(1998), for instance. We shall discuss this approach somewhat, by example, in Chapters 4 and Volume II but refer to Lehmann ( 1986) and Lehmann and Casella (1998) for further reading.

Summary.

We introduce the minimax principle in the contex. t of the theory of games. Using this framework we connect minimaxity and Bayes metbods and develop sufficient conditions for a procedure to be minimax and apply them in several important examples.

'I

!I , ' '

i

'

I

'

176

Measures of Performance

Chapter 3

More specifically, we show how finding minimax procedures can be viewed as solving a game between a statistician S and nature N in which S selects a decision rule 8 and N selects a prior 1r. The lower (upper) value v(v) of the game is the supremum (infimum) over priors (decision rules) of the infimum (supremum) over decision rules (priors) of the Bayes risk. A prior for which the Bayes risk of the Bayes procedure equals the lower value of the game is called least favorable. When v = v, the game is said to have a value v . Von Neumann's Theorem states that if e and D are both finite, then the game of S versus N has a value v, there is a least favorable prior n" and a minimax rule §* such that J* is the Bayes rule for n* and rr* maximizes the Bayes risk of J* over all priors. Moreover, v equals the Bayes risk of the Bayes rule J* for the prior 1r*. We show that Bayes rules with constant risk, or more generally with constant risk over the support of some prior, are minimax. This result is extended to rules that are limits of Bayes rules with constant risk and we use it to show that x is a minimax rule for squared error loss in theN( 0, a5) model.

3.4 3.4.1

UNBIASED ESTIMATION AND RISK I NEQUALITIES Unbiased Estimation, Survey Sampling

In the previous two sections we have considered two decision theoretic optimality princi ples, Bayes and minimaxity, for which it is possible to characterize and, in many cases, compute procedures (in particular estimates) that are best in the class of all procedures, D, according to these criteria. An alternative approach is to specify a proper subclass of procedures, Do c D, on other grounds, computational ease, symmetry, and so on, and then see if within the D0 we can find 0* E Do that is best according to the ..gold standard," R( 0, 5) > R(0, 5') for all 0, all 5 E Do. Obviously, we can also take this point of view with humbler aims, for example, looking for the procedure 0; E Do that minimizes the Bayes risk with respect to a prior 1r among all J E D0• This approach has early on been applied to parametric families V0. When D0 is the class of linear procedures and l is quadratic Joss, the solution is given in Section 3.2. In the non-Bayesian framework, if Y is postulated as following a linear regression model with E(Y) = zT{3 as in Section 2.2.1, then in estimating a linear function of the j3J· it is natural to consider the computationally simple class of linear estimates, S(Y ) = L:� 1 diYi. This approach coupled with the principle of unbiasedness we now introduce leads to the famous Gauss-Markov theorem proved in Section 6.6. We introduced, in Section 1.3, the notion of bias of an estimate O(X) of a parameter q(B) in a model P = {Po : 0 E 8} as Bias9(5)

=

E05(X)

-

' '

•

i

'

q(B).

An estimate such that Biase (0) = 0 is called unbiased. This notion has intuitive appeal, ruling out, for instance, estimates that ignore the data, such as 5(X) = q(B0), which can't be beat for 8 = 80 but can obviously be arbitrarily terrible. The most famous unbiased estimates are the familiar estimates of f.L and a-2 when X1 , Xn are i.i.d. N(Jl, a-2) . . •

,

Section 3.4

177

Unbiased Estimation and Risk Inequalities

given by (see Example

1.3.3 and Problem 1 .3.8) Jt

�

2�

s

�

=X n

(3.4. 1 )

� 2 [ I.: (X, - X) .

n-l

� ooc

(3.4.2)

1

Because for unbiased estimates mean square error and variance coincide we call an unbi ased estimate

O*(X)

of

q(O) that has minimum MSE among all unbiased estimates for all

0, UMVU (uniformly minimum variance unbiased). Volume

2 for B2 , these are both UMVU.

As we shall see shortly for

X and in

Unbiased estimates play a particularly important role in survey sampling.

Example 3.4.1. Unbiased Estimates in Survey Sampling.

Suppose we wish to sample

from a finite population, for instance, a census unit, to determine the average value of a variable (say) monthly family income during a time between two censuses and suppose that we have available a list of families in the unit with family incomes at the last census . Write

x1 , . . . , xN for the unknown current family incomes and correspondingly u1, . . . , UN

for the known last census incomes. We ignore difficulties such as families moving. We let

X1, . . . , Xn

denote the incomes of a sample of

replacement. This leads to the model with

x

( ) N n

.

r f{aJ , .

.

. , an} C { x , , . . . , XN }

(3.4.3)

0 otherwise.

We want to estimate the parameter X = estimate

families drawn at random without

= (x1, . . . , xN ) as parameter

1

=

n

-k L:f_ 1 Xj.

It is easy to see that the natural

X = � I:� 1 xi is unbiased (Problem 3.4.14) and has

MSE(X} � Varx (X) � where

� (1 - ��D

N

2 I " � 2. L,.(x, - x)
i= 1

(3.4.4)

(3.4.5)

u1, . . . , UN. One way to do this, reflecting the probable correlation between (Ut. . . . , uN ) and (x1, . . . , xN ) , is This method of sampling does not use the information contained in

to estimate by a regression estimate

Xn = X - b(U - u)

(3.4.6)

b is a prespecified positive constant, Ui is the last census income corresponding to Xi, and u = Jt L� 1 ui, U = � I:� 1 Ui. Clearly for each b, XR is also unbiased. If where

'I i

178

Measures of Performance

Chapter

3

'

the correlation of U, and X, is positive and b < 2Cov(U, X)/Var(U), this will be a better estimate than X and the best choice of b is bopt = cov(U, X)/Var(U) (Problem 3.4.19), The value of bopt is unknown but can be estimated by

The resulting estimate is no longer unbiased but behaves well for large samples-see Prob lem 5.3. 11. An alternative approach to using the u1 is to not sample all units with the same prob ability. Specifically let 0 < 1r1, . . . , ?TN < 1 with For each unit 1, . . . , N 1 1fj = toss a coin with probability 1fJ of landing heads and select x1 if the coin lands heads. The result is a sample S = {X1 , . , . , XM } of random size M such that E(M) = (Problem 3.4.15). If the 1fJ are not all equal, X is not unbiased but the following estimate known as the Horvitz-Thompson estimate is:

Ef

n.

n

M ! "' X' � XHT = - � N i=l 1fJ, where Ji is defined by xi

=

(3.4.7)

XJ; . To see this write

Because 1fj = P[xj E S] by construction unbiasedness follows. A natural choice of 1rj is This makes it more likely for big incomes to be included and is intuitively desirable. It is possible to avoid the undesirable random sample size of these schemes and yet have specified 'lfj · The Horvitz-Thompson estimate then stays unbiased. Further discussion of this and other sampling schemes and comparisons of estimates are left to the problems. D

�n.

Discussion. Unbiasedness is also used in stratified sampling theory (see Problem 1.3.4). However, outside of sampling, the unbiasedness principle has largely fallen out of favor for a number of reasons. (i) Typically unbiased estimates do not exist-see Bickel and Lehmann (1969) and Problem 3.4.18, for instance. (ii) Bayes estimates are necessarily biased see Problem 3.4.20-and minimax esti mates often are. the attractive equivariance property. If B is unbiased (iii) Unbiased-estimates do not obey for 0, q(O) is biased for q(O) unless q is linear. They necessarily in general differ from maximum likelihood estimates except in an important special case we develop later.

'

I

l

Section

3.4 Unbiased Estimation and Risk Inequalities

179

Nevertheless, as we shall see in Chapters 5 and 6, good estimates in large samples � l � unbiased. We expect that jBiase(Bn)I /Vart (Bn) ----.. 0 or equivalently are approximately 1 as n ---t oo. In particular we shall show that maximum likeVar8(Bn)/MSEe (Bn) lihood estimates are approximately unbiased and approximately best among all estimates. The arguments will be based on asymptotic versions of the important inequalities in the next subsection. Finally. unbiased estimates are still in favor when it comes to estimating residual vari ances. For instance, in the linear regression model Y = ZDf3 + e of Section 2.2, the S2 variance a2 =-Var(ei) is estimated by the unbiased estimate = ?T"ij(n - p) where e = (Y - ZD/3), f3 is the least squares estimate, and p is the number of coefficients in {3. This preference of S2 over the MLE 0:2 = iTejn is in accord with optimal behavior when both the number of observations and number of parameters are large. See Problem 3.4.9. ---t

3.4.2

The Information Inequality

The one-parameter case We will develop a lower bound for the variance of a statistic, which can be used to show that an estimate is UMVU. The lower bound is interesting in its own right, has some decision theoretic applications, and appears in the asymptotic optimality theory of Section 5.4. We suppose throughout that we have a regular parametric model and further that e is an open subset of the line. From this point on we will suppose p(x, B) is a density. The discussion and results for the discrete case are essentially identical and will be referred to in the future by the same numbers as the ones associated with the continuous-case theorems given later. We make two regularity assumptions on the family {Pe : (} E 8}. (I) The set A � {x : p(x, B) > 0} docs not depend on B. For all x E A. B E 8, 8/iJB log p( x, B) exists and is finite. (II) lf T is any statistic such that Ee (ITf) < oo for all B E 8, then the operations of integration and differentiation by (} can be interchanged in J T( x )p(x, B)dx. That is, for integration over Rq,

:0 j T(x)p(x, B) dx] j T(x) :Op(x, B)dx �

(3.4.8)

whenever the right-hand side of (3.4.8) is finite. Note that in particular (3.4.8) is assumed to hold if T(x) � I for all x, and we can interchange differentiation and integration in J p(x, B)dx. Assumption II is practically useless as written. What is needed are simple sufficient conditions on p(x, B) for II to hold. Some classical conditions may be found in Apostol (1 974), p. 167. Simpler assumptions can be formulated using Lebesgue integration theory. For instance, suppose I holds. Then II holds provided that for all T such that Eo( ITI) < oo

180

Measures of Performance

e

for all . the integrals

j T(x)

tep(x, e)] dx and j T(x) [:ep(x,B)

Chapter

3

dx

are continuous functions(3) of(). It is not hard to check (using Laplace transform theory) that a one-parameter exponential family quite generally satisfies Assumptions I and II. Proposition 3.4.1. ljp(x, B) � h(x) exp{ �(B)T(x) - B(B)} is an exponentialfamily and TJ(B) has a nonvanishing continuous derivative on e, then I and II hold.

.

For instance, suppose X1 , . . , Xn is a sample from a N(B, a2) population, where a2 is known. Then (see Table 1 .6.1) �(B) � Bja2 and I and II are satisfied. Similarly, I and II are satisfied for samples from gamma and beta distributions with one parameter fixed. If I holds it is possible to define an important characteristic of the family {Po}, the Fisher information number, which is denoted by I(B) and given by

I(B) � E9

(! logp(X, B))

2

Note that 0 < I(B) < oo.

2 � j ( :e logp(x, B) ) p(x, B)dx.

(3.4.9)

Lemma 3.4.1. Suppose that I and II hold and that

&

E & logp(X, B) < oo.

B

Then (3.4.10) •

and, thus,

I( B) � Var Proof.

I

(! logp(X,B)) .

(3.4. 1 1 )

j { [ :8p(x, B) /p(x,B) p(x, B)dx j :ep(x, O)dx :e jp(x, O)dx � o. �

Example 3.4.2. Suppose X1,

-

& log p (x' 0) � &o

.

.

•

,

Xn is a sample from a Poisson P(O) population. Then

(

�· 1 X; - n and I(B) � Var �"-1 X; •�

o

0

•-

o

)

n = -nO � B' o· I

0

Section 3.4

181

Unbiased Estimation and Risk Inequalities

Here is the main result of this section.

Let

any T(X) be Var0(T(X)) < oo for all B.allDenote E0(T(X)) by 1/;(0). Suppose and 0 < I(B) < oo. Then for B, .,P(B) is differentiable and B ( )] ' > [1/J' (T(X)) ' - I(B)

Theorem 3.4.1.

(Information Inequality).

statistic that I

such that and hold

Var

II

(3.4. 1 2)

Proof. Using I and II we obtain,

.,P' (B) j T(x):Bp(x,B)dx = j T(x) (:B logp(x,B))p(x,B)dx. =

(3.4. 13)

By (A. l 1 . 1 4) and Lemma 3.4.1, .,P'(B)

= (! logp(X,B),T(X)) .

(3.4. 14)

Cov

Now let us apply the correlation (Cauchy-Schwarz) inequality (A. l l . l 6) to the random variables We get and

8f8Blogp(X, B) T(X). 11/J'(B) [ < Var(T(X)) ar (! logp(X,B) ). D 3.4.1, Var (f, p(X, B)) I(B). T(X) .,P(B). q(B) the conditions of Theorem 3.4.1 hold and T an unbiased CoroUary Suppose estimate of B. Then (3.4.15)

V

log

The theorem follows because, by Lemma

=

The lower bound given in the information inequality depends on through = (), we obtain a universal lower If we consider the class of unbiased estimates of bound given by the following.

is

3.4.1.

(T X)) > information Cramir-Rao lower bound

Var9 (

I(B)

1

(3.4.16)

I(B)

The number 1/ is often referred to as the for the variance of an unbiased estimate of 1/J(B). Here's another important special case.

or

Suppose that X (X1, ,Xn) is a sample from a population with density x,B) and that the conditions of Theorem 3.4.1 hold. Let J,(B) ) then [ ) ( J' , � .,P' )) (X > n T ) e( (B Var I = l1 nl, e) Proposition 3.4.2. f( , B E 8,

E (f. log f(X1 , B)

2

=

•

•

,

=

,

(B) and

(3.4.17)

182

Measures of Performance

Proof. This is a consequence of Lemma 3.4.1

I(B) = Var

[:o logp(X, BJ]

and

Chapter 3

!)

L ao logf(X; , B) � [! logf(X; , B)l = nh(B). 11

Var

i= I

Var

0

!1 (B) is often referred to as the information contained in one observation. We have just shown that the information I(B) in a sample of size n is nh (8). Next we note how we can apply the information inequality to the problem of unbiased estimation. If the family {PfJ } satisfies I and II and if there exists an unbiased estimate T* of 1/;(B) such that Varo[T'(X)] = [¢'(8)] 2 /I(B) for all B E 8, then T' is UMVU as an estimate of 1};.

�

Example 3.4.2. (Continued). For a sample from -a P(B) distribution, the MLE is B Because X is unbiased and Var( X) = Bfn, then X is UMVU. -

-

= X.

Example 3.4.3. Suppose X1, Xn is a sample from a normal distribution with unknown mean B and known variance a2. As we previously remarked, the conditions of the infor mation inequality are satisfied. By Coronary 3.4.1 we see that the conclusion that X is UMVU follows if •

•

•

,

- � Var(X)

I

'

I

I ,,

Now Var(X)

]!

l nl, (B) '

(3.4.18)

� a2/n, whereas if

ll .: i

'

'' '

and (3.4.18) follows. Note that because X is UMVU whatever may be a2, we have in fact proved that X is UMVU even if a2 is unknown. D We can similarly show (Problem 3.4.1) that if X1 , . . . , Xn are the indicators of n Bernoulli trials with probability of success B, then X is a UMVU estimate of B. These are situations in which X follows a one-parameter exponential family. This is no accident. Theorem 3.4.2. Suppose that the family { Pe : B E 8} satisfies assumptions I and II and there exists an unbiased estimate T"' of 'I/J (B ), which achieves the lower bound of Theo rem 3.4.1 for every B. Then { Po } is a one-parameter exponential family with density or frequency function of the fonn

p(x,B) = h (x) exp[ry(B) T' (x) - B (B)].

(3.4.19)

Conversely, if {Po} is a one-parameter exponentialfamily of theform (1.6.1) with natural sufficient statistic T( X) and fl(8) has a continuous nonvanishing derivative on 8, then T(X) achieves the information inequality bound and is a UMVU estimate of Eo (T (X )) .

I

'

i

I

Section 3.4

Unbiased Estimation

and Risk

183

Inequalities

Proof We start with the first assertion. Our argument is essentially that of Wijsman (1973). By (3.4.14) and the conditions for equality in the correlation inequality (A. I l . l6) we know that T* achieves the lower bound for all (} if, and only if, there exist functions a1 ( (}) and a2 (B) such that

:B logp(X, B) = a 1 (B)T' (X) + a2(B)

(3.4.20)

with Pg probability 1 for each (}. From this equality of random variables we shall show that Pe[X E A'] = I for all B where

{ :B logp(x, B)

A' = x :

=

a1 (B)T'(x) + a2(B) for all B E

8} .

(3.4.2 1 )

Upon integrating both sides of (3 .4. 20) with respect to B we get (3.4.19). The passage from (3.4.20) to (3.4.19) is highly technical. However, it is necessary. Here is the argument. If A, denotes the set of x for which (3.4.20) hold, then (3.4.20) guarantees Pe(Ae) = 1 and assumption I guarantees Pw (Ae) = 1 for all B' (Problem 3.4.6). Let (}1, (}2, . . . be a denumerable dense subset of 8. Note that if A** = nm Aem , Pe· (A") = 1 for all B'. Suppose without loss of generality that T(x!) of T(x2) for x1, x2 E A**. By solving for a 1, a2 in (3.4.22) for j = 1 , 2, we see that a11 a2 are linear combinations of 8 log p( xJ, (}) jd(}, j hence, continuous in e. But now if X is such that

=

! logp(x, B) = a1 (B)T'(x) + a2(B)

1, 2 and,

(3.4.23)

for all (}1 , (}2 , and both sides are continuous in(}, then (3.4.23) must hold for all e. Thus, A** = A* and the result follows. Conversely in the exponential family case (1.6.1) we assume without loss of generality (Problem 3.4.3) that we have the canonical case with �(B) = B and B( B) = A(B) = log J h(x) exp{BT(x))dx. Then .

.

•

�

BO

logp(X, B) = T(X) - A' (B)

(3.4.24)

so that (3.4.25) l(B) = Vare(T(X) - A ' (B)) = VareT(X) = A" (B). But ,P(B) = A'(B) and, thus, the information bound is [A"(B)] 2/A"(B) A"(B) Vare(T(X)) so that T(X) achieves the information bound as an estimate of EeT(X). D Example 3.4.4. In the Hardy-Weinberg model of Examples 2.1.4 and 2.2.6,

p(x, B) - 2"' exp{(2n1 + n2) IogB + (2n 3 + n3) log(! - B)} = 2"' exp{(2n 1 + nz)[log B - log(! - B) ] + 2nlog(l - B)}

184

Measures of Performance

Chapter 3

where we have used the identity (2n1 + n2) + (2n3 + n2 ) = 2n. Because this is an expo nential family, Theorem 3.4.2 implies that T = (2JV1 + N2 )j2n is UMVU for estimating E(T) � (2n)-1 [2nB2 + 2nB(1 - B)] � B. �

�

This T coincides with the MLE (} of Example 2.2.6. The variance of B can be computed directly using the moments of the multinomial distribution of (N1, N2 , N3), or by trans forming p(x, B) to canonical form by setting t) � log[B/(1 - B)] and then using Theorem 1.6.2. A third method � would be to use Var(B) � 1/I(B) and formula (3.4.25). We find 0 (Problem 3.4.7) Var (B) � B(! - B)/2n. �

Note that by differentiating (3.4.24), we have

iJ' logp (X, B) � -A (B). oB' "

By (3.4.25) we obtain

' 8 I(B) � -Ee ' logp(X,B). oB

(3.4.26)

It turns out that this identity also holds outside exponential families:

Suppose p( · , B) satisfies in addition to I and II: p(·,B) is twice differen tiable and interchange between integration and differentiation is pennitted. Then (3.4.26) holds. Proposition 3.4.3.

I '

Proof We need only check that

a'' logp(x, B) � oB

1

p(x , B)

a'' p(x, B) - ( a oB

oB

and integrate both sides with respect to p(x , B). Example 3.4.2.

)

log p(x, B)

'

0

(Continued). For a sample from a P(B) distribution n -2 -e , E8 ( - ::, logp(X, B) ) � e E t X; i=l

which equals I(B).

(3.4.27)

0

Discussion. It often happens, for instance, in the U(O, B) example, that I and II fail to hold, although UMVU estimates exist. See Volume II. Even worse, as Theorem 3.4.2 suggests, in many situations, assumptions I and II are satisfied and UMVU estimates of ¢(8) exist, but the variance of the best estimate is not equal to the bound [¢' (B) ]2 /I(B). Sharpenings of the information inequality are available but don't help in general. Extensions to models in which B is multidimensional are considered next. The multiparameter case

We will extend the information lower bound to the case of several parameters, () (Bh . . . , Bd)· In particular, we will find a lower bound on the variance of an estimator

Section 3.4

185

Unbiased Estimation and Risk Inequalities

81 = T of fh when the parameters 82 , . . . , Bd are unknown. We assume that 8 is an open

subset of Rd and that {p(x, B) 0 E 8} is a regular parametric model with conditions I and II satisfied when differentiation is with respect Bj, j = 1, . . . , d. Let p( x, 6) denote the density or frequency function of X where X E X c Rq. The (Fisher) information matrix is defined as :

(3.4.28) where

(3.4.29) Proposition 3.4.4. Under the conditions in the opening paragraph, (a)

(3.4.30)

I;k(ll) = Cov0 That is,

(

() iJ logp(X, 11), logp(X, 11) iJOk fJO;

)

.

(3.4.31)

and !(9) = Var(Vo logp(X, 11)).

(b) If X�, . . . , Xn are i.i.d. as X, then X = (X1, . . . , Xnf' has information matrix nh {0) where h is the information matrix ofX. (c) If, in addition, p( ·,B) is twice differentiable and double integration and differentia tion under the integral sign can be interchanged, I(ll) =

-

E9

(ao:;o.

logp(X, o)

)

, 1 :S J < a, 1 < k < d.

Proof. The arguments follow the d = 1 case and are left to the problems. Example 3.4.5. Suppose X � N(IJ., cr2 ) , 9 = (IJ., a2 ) . Then 1 I I 2 2 logp(x ' 11) = - - log(2rr) - - logcr 2a2 (x - IJ.) 2 2

!11 (11) = -E

[ a,.

!1 2 (6) = -E Da'

iJ iJ

[ IJ'

2 1ogp(x, 11) = E[cr- 2 ] iJ1,

l

= cr-2

logp(x , 11) = -cr- 4 E(x - IJ.) = 0 = !,. (11)

l

iJ' 4 !22(9) = -E r' 2 1ogp(x,11) = cr- /2. (iJc )

(3.4.32)

186

Measures of Performance

Thus, in this case

I(IJ) = Example 3.4.6.

(

0 "-4/2

)

.

Chapter

(3.4.33) 0

Canonical k-Parameter Exponential Family. Suppose k

p(x, IJ) = exp{LT;(x)9; - A(IJ)}h(x) j= l

3

(3.4.34)

() E 8 open. The conditions I, II are easily checked and because V9 logp(x , IJ) = T(X)

- A(O),

then

I(IJ) = VaroT(X). By (3.4.30) and Corollary 1.6.1,

I(IJ) = var9T(X) = A(IJ ).

(3.4.35)

...

0

� , Bd assumed unknown. Let Next suppose 81 = T is an estimate of 81 with 82, 1/;(IJ) = EoT(X) and let ,i(IJ) = V'1/;( IJ) be the d x I vector of partial derivatives. Then

Assume the conditions of the1/;(qpening paragraph hold and suppose that the matrix is nonsingula" Then/or all IJ, IJ) exists and Theorem 3.4.3. I(9)

(3.4.36) Proof. We will use the prediction inequality Var(Y) � Var(I'L(Z)), where I'L(Z) denotes the optimal MSPE linear predictor of Y; that is,

I'L (Z) = I'Y + (Z - l'z )T Lz� Lzy ·

(3.4.37)

Now set Y = T(X), Z = V'(} logp(x , IJ). Then

Varo(T(X)) � Lz� J-

I (IJ)

Lzy

(3.4.38)

where Lzy = Eo(TV'o logp(X, IJ)) = V'oEo(T(X)) and the last equality follows 0 from the argument in (3.4.13). Here are some consequences of this result.

(continued).

Estimates in Canonical Exponential Families.

Example 3.4.6. UMVU Sup pose the conditions of Example 3.4 . 6 hold. We claim that each of T;(X) is a UMVU

• • •

Section

187

3.4 Unbiased Estimation and Risk Inequalities

estimate of E9T;(X). This is a different claim than T1 (X) is UMVU for E9T;(X) if 8;, i =j:. j, are known. To see our claim note that in our case

(3.4.39) where, without loss of generality, we let j 3.4.4 r ' (8)

=

= 1. We have already computed in Proposition

(

We claim that in this case

8'A 88 88J t

, )_

kx k

(3.4.40)

·

(3.4.41) because J,(8) is the first row of !(8) and, hence, J,(8)J- 1 (8) ;;, A(8) is just Var9T1 (X). '

( 1 0, . . . , 0) . But ,

3.4.7. Multinomial Trials.

In the multinomial Example 1.6.6 with X1, , Xn Example i.i.d. a'i X and >..1 = P(X = j), j = 1, . . . , k, we transfOrmed the multinomial model M(n, .\�, . . . , >..k ) to the canonical form .

•

•

p(x,8) = exp{TT(x)8 - A(8)} where TT(x) = (T1 (x) , . . . , T._1 (x) ) , n

T;(X) = I ) IX; = j], X = (X,, . . . , Xn) T , 8 j=l

A(8) = Note that

8 A(8) 88J

n log

=

(8,, . . . , 8._,)T,

k- 1

I >•' I + j=l

8 ne l

= I "k I 8 = n>.; = nE(T;(X)) + &l=l e (:)2 ne el ( 1 + E7 11 ee1 - ee3 ) = n>.; (l - -1;) = Var(T; (X)). A(8) = 2 2 88J e) I

(1 + E7=/ e l

Thus, by Theorem 3.4.3, the lower bound on the variance of an unbiased estimator of l/1; (8) = E(n-1 T;(X)) = >.; is >.;(1 - A; )fn. But because N;/n is unbiased and has 0 variance >.;(1 - >.;)/n, then N;/n is UMVU for >.; .

188

Measures of Performance

3o4o8o

The Normal Case.

Chapter 3

l I j

X1, . . . , Xn are i.i.d. N(i'· <72 ) then X is UMVU for J.l and � L, X[ is UMVU for p,2 + u2. But it does not follow that l 1 :l:: (Xi - X)2 is UMVU for a2 . These and other examples and the implications of Theorem 3.4.3 are

Example

If

•

I !

n

D

explored in the problems. Here is an important exten sion of Theorem

Theorem 3.4.4.

3.4.3 whose proof is left to Problem 3.4.21.

,

Suppose that the conditions of Theorem 3.4.3 hold and

•

0

is a d-dimensiona/ statistic. Let

1/J(O) � Eo(T(X ))dx! � (1/J,(O), . ,,pd (OW and 1/J(O) � ('Jif-(OJ) dxd Then .

J

.

.

(3.4.42)

where A > B means aT (A - B)a > Ofor all "dxl· (3.4.42) dxd Note that both sides of

are

matrices. Also note that

() unbiased '* Varo �

In Chapters

O > r 1 (0).

�

i

5 and 6 we show that in smoothly parametrized models, reasonable estimates

are asymptotically unbiased. We establish analogues of the information inequality and use them to show that under suitable conditions the MLE is asymptotically optimal. Summary. We study the important application of the unbiasedness principle in survey sampling. We derive the information inequa!ity in one-parameter models and show how

it can be used to establish that i n a canonical exponential family,

T(X)

is the UMVU

estimate of its expectation. Using inequalities from prediction theory, we show how the infom1ation inequality can be extended to the multiparameter case. Asymptotic analogues

0

�

• •

I i

• 0

of these inequalities are sharp and lead to the notion and construction of efficient estimates.

3o5

NONDECISION THEORETIC CRITERIA

•

In practice, even if the loss function and model are well specified, features other than the risk function are also of importance in selection of a procedure. The three principal issues we discuss are the speed and numerical stability of the method of computation used to obtain the procedure, interpretability of the procedure, and robustness to model departures.

3o5ol

I

•

Computation

Speed of computation and numerical stability issues have been discussed briefly in

Sec

tion 2.4. They are dealt with extensively ·in books on numerical analysis such as Dahlquist,

Section 3.5

189

Nondecision Theoretic Criteria

BjOrk, and Anderson ( 1974). We discuss some of the issues and the subtleties that arise in the context of some of our examples in estimation theory. Closed form versus iteratively computed estimates

At one level closed form is clearly preferable. For instance, a method of moments estimate of (.\,p) in Example 2.3.2 is given by

where Q-2 is the empirical variance (Problem 2.2.11). It is clearly easier to compute than the MLE. Of course, with ever faster computers a difference at this level is irrelevant. But it reappears when the data sets are big and the number of parameters large. On the other hand, consider the Gaussian linear model of Example 2.1.1. Then least squares estimates are given in closed fonn by equ(\tion (2.2.10). The closed fonn here is deceptive because inversion of a d x d matrix takes on the order of d3 operations when done in the usual way and can be numerically unstable. It is in fact faster and better to solve equation (2.1.9) by, say, Gaussian elimination for the particular z}; Y. Faster versus slower algorithms

Consider estimation of the MLE 8 in a general canonical exponential family as in Sec tion 2.3. It may be shown that, in the algorithm we discuss in Section 2.4, if we seek to take �(J) 01 < < < I then J is of the order of log � (Problem 3.5.1). enough steps J so that Ill On the other hand, at least if started close enough to 8, the Newton-Raphson method in 1 (j - 1) j fJ (T(X) - A(BU �')) ) takes on the order which the jth iterate, fl( ) = iP- ) A-1 ( of log log ! steps (Problem 3.5.2). The improvement in speed may however be spurious since A -l is costly to compute if d is large-though the same trick as in computing least squares estimates can be used. -

�

-

,

The interplay between estimated variance and computation

As we have seen in special cases in Examples 3.4.3 and 3.4.4, estimates of parameters based on samples of size n have standard deviations of order n - 1 12 . It follows that striving for numerical accuracy of ord�r smaller than n - 112 is wastefuL Unfortunately it is hard to translate statements about orders into specific prescriptions without assuming at leaSt bounds on the constants involved.

3.5.2

Interpretability

Suppose that in the normal N(p,
•

190

Measures of Performance

Chapter 3

mean Jl and variance a 2 other than the normal. On the other hand, suppose we initially postulate a model in which the data are a sample from a gamma, 9(pj .\), distribution. Then E(X)/ JVar(X) � (pf>.)(pf>.' )-11 2 = z,I/2 We can now use the MLE f;l/2, which as we shall see later (Section 5.4) is for n large a more precise estimate than Xf& if this model is correct. However, the form of this estimate is complex and if the model is incorrect it no longer is an appropriate estimate of E( X)/ [Var( X) 1 2 . We return to this in Section 5.5.

]1

3.5.3

Robustness

Finally, we turn to robustness. This is an issue easy to point to in practice but remarkably difficult to formalize ap propriately. The idea of robustness is that we want estimation (or testing) procedures to perform reasonably even when the model assumptions under which they were designed to perform excellently are not exactly satisfied. However, what reasonable means is connected to the choice of the parameter we are estimating (or testing hypotheses about). We consider three situations (a) The problem dictates the parameter. For instance, the Hardy-Weinberg parameter (} has a clear biological interpretation and is the parameter for the experiment described in Example 2.1.4. Similarly, economists often work with median housing prices, that is, the parameter v that has half of the population prices on either side (formally, v is any value such that P(X < v) > �. P(X 2 v) 2 iJ. Alternatively, they may be interested in total consumption of a commodity such as coffee, say (} = N p,, where N is the population size and Jl is the expected consumption of a randomly drawn individual. (b) We imagine that the random variable X* produced by the random experiment we are interested in has a distribution that follows a ''true" parametric model with an inter pretable parameter (}, but we do not necessarily observe X*. The actual observation X is X* contaminated with "gross errors"-see the following discussion. But (} is still the target in which we are interested. (c) We have a qualitative idea of what the parameter is, but there are several parameters that satisfy this qualitative notion. This idea has been developed by Bickel and Lehmann (1975a, 1975b, 1976) and Doksum (1975), among others. For instance, we may be inter ested in the center of a population, and both the mean Jl and median v qualify. See Problem

3.5.13.

,

We will consider situations (b) and (c). '

'

. · ·

•.

�'

Gross error models

.

I ! '

I

-

Most measurement and recording processes are subject to gross errors, anomalous val ues that arise because of human error (often in recording) or instrument malfunction. To be a bit formal, suppose that if n measurements X* = (Xi, . . . , X�) could be taken with out gross errors then P* E P* would be an adequate approximation to the distribution of X* (i.e., we could suppose X* ......, P* E P"'). However, if gross errors occur, we ob serve not X* but X = (X1, Xn) where most of the Xi = x:, but there are a few •

•

•

•

'

' '

!

,

! '

"

' •

Section

3.5

191

Nondecision Theoretic Criteria

�

and use B(X1 , . . . , wild �values. Now suppose we want to estimate B(P*) Xn) knowing � Xn ) will continue to be a that B(Xi , . . . , X�) is a good estimate. Informally B(X1, good or at least reasonable estimate if its value is not greatly affected by the Xi -1 Xt, the gross errors. Again informally we shall call such procedures robust. Formal definitions require model specification, specification of the gross error mechanism, and definitions of insensitivity to gross errors. Most analyses require asymptotic theory and will have to be postponed to Chapters 5 and 6. However, two notions, the sensitivity curve and the break down point, make sense for fixed n. The breakdown point will be discussed in Volume II. We next define and examine the sensitivity curve in the context of the Gaussian location model, Example 1.1.2, and then more generally. Consider the one-sample symmetric location model P defined by •

.

.

i = 1, . . , n,

,

(3.5.1)

.

where the errors are independent, identically distributed, and symmetric about 0 with com mon density f and d.f. F. If the error distribution is normal, X is the best estimate in a variety of senses. In our new formulation it is the Xt that obey (3.5.1). A reasonable formulation of a model in which the possibility of gross errors is acknowledged is to make the ci still i.i.d. but with common distribution function F and density f of the form f(x) � (1 - >.)

� (;) 'P

+ >.h(x).

(3.5.2)

Here h is the density of the gross errors and .\ is the probability of making a gross error. This corresponds to,

Xi

Xt with probability 1 Y; with probability .).

-

>.

where Y; has density h(y I') and (X,', Y;) are i.i.d. Note that this implies the possibly unreasonable assumption that committing a gross error is independent of the value of x•. Further assumptions that are commonly made are that h has a particular form, for example, h = �a

" '

'' '' ' '

' '

I

192

I I

The sensitivity curve

'

I

I

Measures of Performance

n

�

3

�

At this point we ask: Suppose that an estimate T(X1, . . . , X ) = B(F), where F is the empirical d.f.. is appropriate for the symmetric location model, 'P, in particular, has the plug-in property, B(Pu.�.,J )) = J.1 for all P(Jl., f) E P. How sensitive is it to the presence of gross errors among X1 , , Xn? An interesting way of studying this due to Tukcy (1972) and Hampel ( 1974) is the sensitivity curve defined as follows for plug-in estimates (which are well defined for all sample sizes n). We start by defining the sensitivity curve for general plug-in estimates. Suppose that X � P and that 0 � O(P) is a parameter. The empirical plug-in estimate of 0 is 0 � O(P) where P is the empirical probability distribution. See Section 2.1.2. The sensitivity cun:e of (} is defined as •

'

Chapter

•

•

�

�

�

�

SC(x; 0 ) = n[O(x i , . . . �

, Xn-l

�

, Xn-I ,

�

x) - O(xi,

... , Xn- I )],

where x i , . . . represents an observed sample of size n - 1 from P and x represents an observation that (potentially) comes from a distribution different from P. We are interested in the shape of the sensitivity curve, not its location. In our examples we shall, therefore, shift the sensitivity curve in the horizontal or vertical direction whenever this produces done by fixing x 1 , . . . more transparent formulas. Often this is as an "ideal" sample � of size n - 1 for which the estimator (} gives us the right value of the parameter and then we see what the introduction of a potentially deviant nth observation x does to the value of 0. We return to the location problem with 8 equal to the mean p, = E(X). Because the estimators we consider are location invariant, that is, (}(X1, . . . , Xn) - J.L = (}(X1 p,, . . . , X, - p,), aod because E(X, - p,) � 0, we take I' � 0 without loss of generality. Now fix x i l . . . , Xn- l so that their mean has the ideal value zero. This is equivalent to shifting the SC vertically to make its value at x = 0 equal to zero. See Problem 3.5.14. Then

, Xn-l

�

�

Sc(x; x_) = n

(XJ +"·+Xn- I + X) n

�

= x.

Thus, the sample mean is arbitrarily sensitive to gross error-a large gross error can throw the mean off entirely. Are there estimates that are less sensitive? � A classical estimate of location based on the order statistics is the sample median X defined by �

X

n

if n � 2k + 1 x(k+I) � (X(k) + X(k+!J ) if n � 2k

where X( I ), . . . , X( ) are the order statistics, that is, X1 , . . . , Xn ordered from smallest to largest. See (2.1.16 ), (2.1.17), aod Problem 2.2.32, The sample median can be motivated as an estimate of location on various grounds.

(i) It is the empirical plug-in estimate of the population median v (Problem 3.5.4), and it splits the sample into two equal halves.

Section

3.5

193

Nondecision Theoretic Criteria

(ii) In the symmetric location model (3.5.1), plug-in estimate of f-L.

v coincides with

1-L

and X is

an

(iii) The sample median is the MLE when we assume the common density errors {£i} in (3.5.1) is the Laplace (double exponential) density

empirical

f(x) of the

1

f(x) = - exp{-l x l/r} , 27 a density having substantially heavier tails than the normal.

See Problems 2.2.32 and

3.5.9. The sensitivity curve of the median is as follows: If, say, n = 2k + 1 is odd and the median of x1 we obtain j

SC(x; x)

where x ( l)

<

nx<•J = - nx l k+l ) - nx - nx(k + l)

-

· · · :S x(n- l)

are the

ordered

SC(x)

•

•

•

, Xn- I

=

(x(k) + x( k+ I l)/2

=

0,

< x lk) for x(k ) < x < x(k+I )

for X for x

> x (k+I )

x1, . . . , Xn - I · SC(x)

X

X

Figure 3.5.1. The sensitivity curves of the mean and median. Although the median behaves well when gross errors are expected, its perfonnance at the normal model is unsatisfactory in the sense that its variance is about 57% larger than the variance of X. The sensitivity curve in Figure 3.5.1 suggests that we may improve matters by constructing estimates whose behavior is more like that of the mean when x is near Jl· A class of estimates providing such intermediate behavior and including both the mean and

194

Measures of Performance

�en known since the eighteenth century. trimmed mean, X0, by the median has

Let

0 < a: < �.

+ . . . + X( n-[nu} ) [no:] x ) +I ( Xa = n - 2[nn]

Chapter 3

We define the a

(3.5.3)

···

[no:] is the largest integer < na and X(I) < < X(n) are the ordered observations. That is, we throw out the "outer" [naJ observations on either side and take the average of the rest. The estimates can be justified on plug-in grounds (see Problem 3.5.5). For more sophisticated arguments see Huber (1981). Note that if a: = 0, Xo: = X, whereas as a j �. Xa X. For instance, suppose we take as our data the differences in Table 3.5.1. If [na] = [(n - l)a:] and the trimmed mean of x1 , . . . ) Xn - I is zero, the sensitivity cmve of an a: trimmed mean is sketched in Figure 3.5.2. (The middle portion is the line y = x(1 - 2[nn]/n) - 1 .) where

-

�

-----Jo

SC(x)

x
X

Figure 3.5.2. The sensitivity curve of the trimmed mean. " '

Intuitively we expect that if there are no gross errors, that is, f than any trimmed mean with a mately to Problem

a

=

5.4.1.

�·

> 0

= r.p. the mean is better

including the median, which corresponds approxi

This can be verified in tenus of asymptotic variances (MSEs}-see

However, the sensitivity curve calculation points to an equally intuitive

conclusion. Iff is symmetric about 0 but has "heavier tails" (see Problem Gaussian density. for example, the Laplace density, f (x) =

the Cauchy,

3.5.8) than the

�e-lxl, or even more strikingly

f( x) = 1 /rr( 1 + x2 ), then the trimmed means for a > 0 and even the median

can be much better than the mean, infinitely better in the case of the Cauchy-see Problem

5.4.1 again.

Which a should we choose in the trimmed mean? There seems to be no simple answer.

The range

0.10 <

a

< 0.20

seems to yield estimates that provide adequate protection

against the proportions of gross errors expected and yet perform reasonably well when

sampling is from the normal distribution. See Andrews, Bickel, Hampel. Haber, Rogers, and Tukey (1972). There has also been some research into procedures for which ,. • •

•

'

i

'

a is

chosen using the observations. For a discussion of these and other forms of "adaptation," see Jaeckel (1971 ), Huber (1972), and Hogg (1974).

• •

Section

3.5

195

Nondecision Theoretic Criteria

Gross errors or outlying data points affect estimates in a variety of situations. We next consider two estimates of the spread in the population as well as estimates of quantiles ; other examples will be given in the problems. If we are interested in the spread of the values in a population, then the variance a 2 or standard deviation a is typically used. A fairly common quick and simple alternative is the IQR (interquartile range) defined as T = x.7s - x. 5, where Xa has lOOet percent of the values in the population on its left 2 (fonnally, X0 is any value such that P( X < Xa) > et, P( X > X0) > 1 - a). x0 is called and x.75 and x . 2s are called the and The IQR is often a ath calibrated so that it equals a in the N(J-l, a2) model. Because T = 2 x (.674)a, the scale measure used is 0.742(x.75 - x . 2s) . Example 3.5.1. Let B(P) = Var(X) = a2 denote the variance in a population and let XI , . , Xn denote a sample from that population. Then a� = n - I L:� I ( xi - X) 2 is the empirical plug-in estimate of a2 . To simplify our expression we shift the horizontal axis so that 2:� / Xi = 0. Write Xn = n 1 2:� 1 Xi = n - 1x, then

quantile

upper lower quartiles.

Spread.

.

.

-

�2 -1 ) n (�an2 - an

SC(x; &2)

n-1

L (xi - n-1 x)2 + (x - n- 1 x)2 - nG-�_ 1 i= l

n-1

+ (n - )2 + �

- L x� + (n - 1x) 2 [(n - 1 )/n[2x2 - n&;_, i=- 1

-

(n - 1)&� _ 1

+

1

n

n2

x2 - nG2n- 1

It is clear that 0:� is very sensitive to large outlying lxl values. Similarly,

SC(x; &) nCTn-1

� ) an2 ( n - l - ..-.2 - 1 n

.-... a -

(3.5.4)

2 an-l SC(x; &2)/2&n -1

where the approximation is valid for x fixed, n --+ oo (Problem 3.5.10).

Quantiles andandthe IQR. Let B(P)

=

D

x0 deaote a ath quan tile of the let X denote the o:th sample quantile (see 2.1.16). distribution of X, 0 < a < 1, If no: is an integer, say k, the o:th sample quantile is X = � [x(k) x(k+l)]. and at sample size n - l, Xo: = x
a

a

+

196

Measures of Performance

Chapter 3

thus, for 2 < k < n - 2, SC(x ; io)

�[xlk-1)

_

2 1

-

<

[x - xi•I J , x lk -1 )

1 [xlk+l)

2

x lkl] , x < xl k- 1)

2

_

x lkl ] , x

-

x < xlk + l) -

(3.5.5)

> xlk+l) -

Clearly, X is not sensitive to outlying x's. Next consider the a

sample !QR

�

�

�

T = X.75 - X.25·

Then we can write SC(x ; i) = SC( x; i 1s) - SC(x ; i.2s) and the sample IQR is robust with respect to outlying gross errors x . Remark 3.5.1. The sensitivity of the parameter B(F) to x can be measured by the which is defined by

function,

IF(x; O, F)

=

0

influence

lim JF (x ; O , F) dO ,

where I,(x ; 0, F) = <- 1 [0((1 - <)F +
,

=

l[t S x]). It is easy to see

'

I

•

'

�·

�

.

i

•

,'

I

••

lr

.

, Xn- 1 · We will return to where Fn-I denotes the empirical distribution based on x 1 , the influence function in Volume IL It plays an important role in functional expansions of estimates.

�i '

� •

li

•

•

Discussion. Other aspects of robustness, in particular the breakdown point, have been studied extensively and a number of procedures proposed and implemented. Unfortunately these procedures tend to be extremely demanding computationally, although this difficulty appears to be being overcome lately. An exposition of this point of view and some of the earlier procedures proposed is in Hampel, Ronchetti, Rousseuw, and Stahel (1983). Summary. We discuss briefly nondecision theoretic considerations for selecting proce dures including interpretability, and computability. Most of the section focuses on robust ness, discussing the difficult issues of identifiability. The rest of our very limited treatment focuses on the sensitivity curve as illustrated in the mean, trimmed mean, median, and other procedures .

I·--� ---

•

! I

1

!

--

Section

3.6

3.6

197

Problems and Complements

PROBLEMS AND COMPLEMENTS

Problems for Section 3.2

1. Show that if X1, . . . , Xn is a N(B, a2) sample and 1r is the improper prior 1r(B) = 1, 8 E e = R . then the improper Bayes rule for squared error loss is 6* (x) = X.

X1 , . . . , Xn be the indicators of n Bernoulli trials with success probability B. Sup pose I ( 0, a) is the quadratic loss (0 - a) 2 and that the prion r( 0) is the beta, {3(r, s), density. w)X of the Find the Bayes estimate 88 of B and write it as a weighted average w80 + (1 mean Bo of the prior and the sample mean X = Sfn. Show that Ba = (S + l)/(n +2) for 2.

Let

�

�

-

the uniform prior.

q( B) = 0(1 - B) and give the Bayes estimate of q(B). Check whether q(Oa) = E(q(O) I x), where Ba is the Bayes estimate of B. 3,

In Problem 3.2.2 proceeding, give the MLE of the Bernoulli variance �

�

In the Bernoulli Problem 3.2.2 with uniform prior on the probabilility of success B, we found that (S + 1)/(n + 2) is the Bayes rule. In some studies (see Section 6.4.3), the

4.

parameter

A = B/(1 - 0),

which is called the

odds ratio

(for success), is preferred to

B.

If we put a (improper) uniform prior on .\, under what condition on S does the Bayes rule exist and what is the Bayes rule?

5, Suppose (}

(a)

�

1r(O), (X I B = B) � p(x I 0) .

Show that the joint density of X and

0 is

f(x, B) = p(x I 0)1r(B) = c(x)1r(O I x) where

c(x) = .f 1r(O) p (x I O)dO.

(h) Let 1(0, a) = (0 - a)2/w(O) for some weight function w(B) > 0, 0 E e. Show that

the Bayes rule is where

fo(x,B) = p(x I B) [7r (O)fw(B)Jic and

c= is assumed to be finite.

j j p(x I 0)[7r (O)/w(B)]d0dx

That is, if

7f

and

I are

changed to

a(B) > 0, respectively, the Bayes rule does not change. Hint: See Problem 1.4.24.

a(B)1r(O)

and

l(B,a)/a(B),

(c) In Example 3.2.3, change the loss function to l(B, a) = (B - a)2 /Ba(l - B)P. Give the conditions needed for the posterior Bayes risk to be finite and find the Bayes rule. 6,

Find the Bayes risk r(,., J) of

J(x) = X

in Example 3.2.1. Consider the relative risk

e(J, 1r) = R(1r) /r( 7f , J), where R(7f) is the Bayes risk . Compute the limit of e(J, 1r) as

198

(a) T --+

Measures of Performance

oo,

-----+

(b) n -+ oo, (c) a 2

Chapter

3

oo.

7. For the following problems, compute the posterior risks of the possible actions and give

the optimal Bayes decisions when x = 0. (a) Problem 1.3.\(d);

(b) Problem 1.3.2(d)(i) and (ii);

(c) Problem

1.3. 19(c). 8. Suppose that N" . . . , N, given 9 = B are multinomial M(n, B), B = (B1, . . . , B,)r, and that 9 has the Dirichlet distribution D(a), a = (a1, . . . , a, )r, defined in Problem 1.2.15. Let q( 0) = L;=l CjOj, where c 1 , . . . , Cr are given constants.

(a) If l(B, a) = [q(B)-a]2, find the Bayes decision rule o' and the minimum conditional Bayes risk r(o'(x) [ x). Hint: lf 9 � D(a), then E(91) = a,ja0, Var(9,) = a,(ao - a, )/a5(a0 + !), and Cov(8J , 9J) = -aiajfa5(ao + 1), where ao = Lj= 1 Ctj . (Use these results, do not derive them.) (b) When the loss function is

l(B, a)

=

(q(B) - a) ' I n;�, Bj, find necessary and

sufficient conditions under which the Bayes risk is finite and under these conditions find the Bayes rule. (c) We want to estimate the vector (B,, . . . , B,) with loss function l(B, a) = aj )2. Find the Bayes decision rule. 9.

•

I

,,

"

!

'

•

LJ� I (Bj -

Bioequivalence trials are used to test whether a generic drug is, to a close approximation,

equivalent to a name-brand drug. Let () = fLG p,a be the difference in mean effect of the generic and name-brand drugs. Suppose we have a sample X1, . . . , Xn of differences in the effect of generic and name-brand effects for a certain drug, where E(X) = 0. A regulatory agency specifies a number t: > 0 such that if () E { -E, E), then the generic and brand-name drugs are, by definition, bioequivalent. On the basis of X = (XI , . . . , Xn) we want to decide whether or not () E ( -e, E ) . Assume that given B, xi . · . , Xn are i.i.d. N(O,a5). where afi is known, and that 9 is random with a N(r10 r6) distribution. There are two possible actions: -

.

,

•' '

i. '

0 {:::} Bioequivalent 1 {:::} Not Bioequivalent

a a with losses l(B, 0) and l( B, I). Set

>.(B) = l(B,O) - l(B, !)

' • •

= difference in loss of acceptance and rejection of bioequivalence. Note that >.(0) should be negative when () E (-E, t:) and positive when () ¢ ( -E, t:) . One such function (Lindley, 1998) is •

�·

i.

L I •

' ·------------------------

I

j

Section 3.6

Problems and Complements

199

>.(±E-) = 0 implies that r satisfies

where 0 < r < 1 . Note that

1

logr = - 2c2 c

2

0 and 1 where l(B , O) and l (B , l ) are not Any two functions with difference >.(8) are possible loss functions at a = 0 and

This is an example with two possible actions

constant.

I.

(a) Show that the Bayes rule is equivalent to

"Accept biocquivalence if E(>. (

and show that (3.6.1) is equivalent to "Accept bioequivalence if [E(O I

B) I X = x) < 0"

(3.6. 1)

)]2 < (Tif (n) + c'){log(rg(�;+,' ) +

x

� }"

where

Hint: (b) large

See Example 3.2.1.

It is proposed that the preceding prior is "uninformative" if it has

("76

---+

TJo

oo"). Discuss the preceding decision rule for this "prior."

(c) Discuss

the behavior of the preceding decision rule for large

sider the general case (a) and the specific case (b).

n

("n

---..

=

0 and rt

oo"). Con

10. For the model defined by (3.2.16) and (3.2.17), find (a) the linear Bayes estimate of �1(b) the linear Bayes estimate of p..

(c) Is the assumption that the � 's are normal

needed in (a) and (b)?

Problems for Section 3.3 l . In Example 3.3.2 show that

2. Suppose g

'

S

x

T�

R.

L(x, O,v)

A point

> rr/ ( 1

- rr

) is equivalent to T > t.

(x0, yo ) is a saddle point of g if

g(xo , Yo) = sup g(x,yo) = infg(xo , y). S

Suppose S and T are subsets of

and g is twice differentiable.

T

Rm . RP. respectively, (Xo, Yo)

is in the interior of S

x

T,

(a) Show that a necessary condition for (xo, y0) to be a saddle point is that, representing X = (x1, . . . ,x m ), y = (Yl l · · . , yp), a9

a9 (xo,Yo) = 0, {) (xo , Yo) = aYj Xi

200

Measures of Performance

Chapter 3

and

for all l < i, a , b <

m,

a'g x a' g (xo , Yo ) >0 0, ( o, Yo ) < aXaaXb a YcaYd 1 < j,c, d < p.

(b) Suppose Sm = { x : X; > 0, I < i < m , I:;" 1 X; = l}, the simplex, and g(x,y) = E�l 1 E;=l CijXiYJ with x E Sm. y E Sp. Show that the von Neumann minimax theorem

is equivalent to the existence of a saddle point for any twice differentiable g. 3.

Suppose e

= { 80, Bl), A = {0, I}, and that the model is regular. Suppose l(B; , i) = O, l(B; , j) = w;; > O, i,j = O, I , i f j.

Let Lx (Bo, 81 ) = p (X, 81) jp (X, Bo) and suppose that Lx ( Bo , BI) has a continuous distri bution under both Po0 and Pe1 • Show that

(a) For every 0 <

rr

< 1, the test rule J1T given by 8,(X)

(1 <)w, > I if Lx (80, 81) = =

-

0 otherwise

is Bayes against a prior such that P[B = 81] = " =

1TWlQ

I - P[B = Bo], and

(b) There exists 0 < 1r* < 1 such that the prior 1r* is least favorable against 61f� , that

is, the conclusion of von Neumann 's theorem holds. Hint: Show that there exists (a unique) 1r* so that

••

R(B0 , 8,.) = R(B1 , 8,.).

'

4. Let S � B(n,B), 1(8, a) = (B - a)2 , 8(S) = X = Sjn, and I 8'(S) = (S + vn)/( n + vn) . 2

•

•.. •

f, ' ;

i

!. I

•

'•

' I'

( 1'. '

(a) Show that 8' has constant risk and is Bayes for the beta, {J( vn/2, vn/2), prior. Thus, &* is minimax. Hint: See Problem 3. 2.2 . (b) Show that limn-oo [R(B, 8')/R(B,8)] equals 1 when B = �.

> I

5. Let X" . . . , Xn be i.i.d. N(p,a2) and l(a 2 , d) (a) Show that if I' is koown to be 0 8'(X1, . . . , Xn ) = n

•

!!, ' ·'

Js rmrumax. •

•

•

=

for B f

(,t.

i; and show that this limit

- 1) 2 •

: 2 :L xJ

.

I i j

Section 3.6

201

Problems and Complements

(b) If I' � 0, show that 6' is unifonnly best among all rules of the torm oc(X) = c L Xf. Conclude that the MLE is inadmissible. I'

is unknown, 6(X) = ,.�1 I;(X, - X)2 is best among all rules of the form 6c(X) = c I;(X, - X)2 and, hence, that both the MLE and the estimate s' = (n - 1)-1 I;(X, - X)' are inadmissible. Hint: (a) Consider a gamma prior on 0 = 1/<72 See Problem 1.2.12. (c) Use (B.3.29). (c) Show that if

6. Let X1, . . . , Xk be independent with means f.t � > . . . , Jl k , respectively, where (/11 , · · · , f-tk) = (J1?1 , . . · , M?,J, Jl� <

· · · < 112

is a known set of values, and i 1 . . . , ik is an arbitrary unknown permutation of 1 , . . . , k. Let A = {(h , . . . , jk) : Permutations of 1, . . . , k} ,

/((il> · . . , ik), (j1 ,

·

•

· ,jk)) = I: 1 (it < im,jl l,tn

Show that the minimax rule is to take

J(X I , . . . , Xk)

=

>

Jm)·

(R1, . . . , Rk)

where R, is the rank of X;, that is, R; = I:� 1 1(Xt < X; ) . Hint: Consider the uniform prior on permutations and compute the Bayes rule by show ing that the posterior risk of a permutation (i1, . . . , ik) is smaller than that of (i'1 , . . . , i�). . . h . and R < Rb· . t.,b = "-a.. ., = 'tb, w ere "-j, = "-.j • J -" 1 a, b , a < b, "-a. a

7. Show that X has a Poisson (.\) distribution and 1(.\, a) = (.\ - a)2 j.\. Then X is rrurumax. •

•

Hint: Consider the gamma, f(k- 1 , 1), prior. Let k _,

oo.

8. Let X, be independent N(p,,, 1), 1 . . . , dk)r. Show that if k t(p,,d) = I:
then 6(X) = X is minimax.

Remark: Stein (1956) has shown that if k > 3, X is no longer unique minimax. For instance,

o'(X) =

(1 - ��n

X

is also minimax and R(p,, o') < R(p,, 6) for all p,. See Volume II. , Pk) , distribution, 0 < P; < 9. Show that if ( Nl > . . . , Nk) has a multinomial, M (n, p1, 1, 1 < j < k, then � is minimax for the loss function •

l(p,d) where qj

=

1 - P;. 1 < j < k.

=

t j= 1

(d; - P;)'

Pjqj

•

.

202

Measures of Performance

Hint: Consider Dirichlet priors on

1.2.15. See also Problem 3.2.8.

(p 1 ,

•

•

•

, Pk- d

Chapter 3

with density defined in Problem

Let Xi(i = 1 , . . . , n) be i.i.d. with unknown distribution F. For a given x we want to estimate the proportion F(x) of the population to the left of x. Show that 10.

0

No. of X; < x . 1 1 � + 1 + Vn 2(1 + y'n) Vn

is minimax for estimating F(x) = P(Xi < x) with squared error loss. Hint: Consider the risk function of o. See Problem 3.3.4.

11. Let X1, . . . , Xn be independent N(l', 1 ) . Define

d d X+ ifX < - Vn Vn d O if iXI < Vn -

d if X XVn

>

d

.;n·

(a) Show that the risk (for squared error loss) E( y'n(o (X )

bounded for all n and I'·

12. Suppose that given 6 � B, X has a binomial, B(n, B), distribution. Show that the Bayes estimate of 0 for the Kullback-Leibler loss function lp(B, a) is the posterior mean E(6 I X) .

• '

.

• •

13. Suppose that given 9 = 8 = (8,, . . . , Bk)T, X = (X1, . . . , Xk)T has a multinomial, M(n, B), distribution. Let the loss function be the Kullback-Leibler divergence lp(B, a)

•

•

' .

•

'

and let the prior be the uniform prior

'

'

I

:

'

'

!

j,

! ' '

,(e,, . . , e._,) = (k - 1)!, e, > o, .

;'

'

1'))2 of these estimates is

(b) How does the risk of these estimates compare to that of X?

1:

I I

-

k- 1

I.: e1 = 1.

j= l

Show that the Bayes estimate is (X; + 1)/(n + k).

14. Let K(pe, q) denote the KLD (Kullback-Leibler divergence) between the densities Pe and q and define the Bayes Kl.D between P = {Pe : 8 E 8} and q as

k (q, 1r) =

J K(pe, q)1r(B)dB.

I

Show that the marginal density of X,

p(x) =

JPe(x)1r(B)d8,

I

Section 3.6

minimizes

203

Problems and Complements

k( q. 1r ) and that the minimum is Io x ,

[ j { = Eo

_

.

pe(X) log p(X )

}]

K(O)dO.

Io, x is called the mutual information between B and X. Hint: k(q. K ) - k(p, K ) = J [Eo { log ��i } K(O)dO

]

15. Jeffrey 's "Prior:"

A density proportional to

>

0 by Jensen's inequality.

.fil.O) is called Jeffrey's prior.

It is

theN(O, cr�), N(Mo, 0) and B(n, 0) cases, Jeffrey's priors are proportional to 1, o-1, and o- � (1 - o)- ! , respectively. Give the Bayes rules for squared often improper. Show that in

error in these three cases.

Problems for Section 3.4 n

1, . . . , X

1. Let X

be the indicators of n Bernoulli trials with success probability B. Show

that X is an UMVU estimate of (}.

A = R. We shall say a loss function is convex, if I(O , aa0 + (1 - a)a1) < ai(O, ao ) + (1 - a )I(0, a,), for any ao, a 1 , 0, 0 < a < 1. Suppose that there is an unbiased estimate J of q(O) and that T(X) is sufficient. Show that if 1(0, a) is convex and o•(X) = E(o(X) 1 t(X) ). then R(O, a•) < R( o, o), Hint: Use Jensen's inequality: If g is a convex function and X is a random variable, 2. Let

g(E(X)).

then E(g(X)) > 3.

Equivariance.

II hold and that

h

Let

X ,......, p( x, 0)

with

(} E E> c R,

suppose that assumptions I and

is a monotone increasing differentiable function from

Reparametrize the model by setting

model in the new parametrization.

8

onto

h(8).

ry = h(O) and let q(x, ry) = p(x, h- 1 (ry)) denote the

(a) Show that if Ip (O) and Iq(fJ) denote the Fisher information

tions, then

in the two parametriza

That is, Fisher information is not equivariant under increasing transformations of the pa

rameter.

(b) Equivariance of the Fisher Information Bound.

-I (

Let

Bp(O)

and

B,(ry) denote

the

information inequality lower bound ('1/J' ? jI as in (3.4.12) for the two parametrizations

p(x, 0) and q(x, ry).

Show that

bound is equivariant.

B, (ry) = BP (h

ry) ) ; that is, the Fisher information lower

4. Prove Proposition 3.4.4.

S. Suppose X 1 , . . . , X are i.i.d. N(M, cr2) with I' - Mo known. n

(a) CT5 = n- 1 L:_1 (Xi - J..t o)2 is a UMVU estimate of a2. (b) ag is inadmissible.

Show that

204

Measures of Performance

Chapter

I

3

Hint: See Problem 3.3.5(b). (c) if 110 is not known and the true distribution of xt is N(11, a 2), p. =f- J-to, find the bias �2 [ o a0.

6. Show that assumption I implies that if A {x : p{x, B) > 0} doesn't depend on B, then for any set B, Pe (B) 1 for some B if and only if Pe (B) 1 for all B. -

=

7.

=

�

In Example 3.4.4, compute Var(O) using each ofthe three methods indicated.

8. Establish the claims of Example 3.4.8. � � 9. Show that S2 = (Y - Zvf3)T(Y - Zvf3)/(n - p) is an unbiased estimate of u2 in the linear regression model of Section 2.2. � � � 10. Suppose 0 is UMVU for estimating 0. Let a and b be constants. Show that ,\ = a + bB is UMVU for estimating A = a + bO. =

11. Suppose Y1 . . . , Yn are independent Poisson random variables with E(Yi) /Ji where J.li = exp{ a + (3zi } depends on the levels zi of a covariate; a, (3 E R. For instance, Zi could be the level of a drug given to the ith patient with an infectious disease and Yi could denote the number of infectious agents in a given unit of blood from the ith patient 24 hours after the drug was administered. ,

'

(a) Write the model for Y1 , . . . , Yn in two-parameter canonical exponential form and give the sufficient statistic.

'

'

'

=

(b) Let 9 (a, {3) T Compute I( 9) for the�model in (a) and then find the lower bound on the variances of unbiased estimators a and {3 of a and {3. (c) Suppose that z, = log[i/{n + 1)], i 1, . . , n. Find lim n-1 I(9) as n oo, and =

.

�

give the limit of n times the lower bound on the variances of a and {3. Hint: Use the integral approximation to sums.

'

'

�

' •

' l

12. Let X1 . . . , Xn be a sample from the beta, B(B, 1), distribution.

•

'

,

(a) Find the MLE of 1/B. is it unbiased? Does it achieve the information inequality lower bound?

• ,

' •

(b) Show that X is an unbiased estimate ofB /(B + 1). Does X achieve the information inequality lower bound? 1 13. Let F denote the class of densities with mean B- and variance B-2(B > OJ that satisfy

the conditions of the information inequality. Show that a density that minimizes the Fisher information over F is f(x, B) ee-0"1(x > 0). Hint: Consider T(X) = X in Theorem 3.4.1. =

14. Show that if (X1, . . . , Xn) is a sample drawn without replacement from an unknown finite population { x1, . , x N}, then .

.

(a) X is an unbiased estimate of X =

� L� 1 Xi-

' ,

,

•

,

' '

I

I '

I

Problems

Section 3.6

and

205

Complements

(b) The variance of X is given by

15.

Suppose

u1, . . , UN .

(3.4.4).

are as in Example

other Uj with probability 1r1 where size, then

and Uj is retained independently of all

3.4.1

Lf 1 1rj = n. Show that if M is the expected sample N

E( M) � L 1r,

� n.

j= l

16.

15 is employed with 'Trj

Suppose the sampling scheme given in Problem

_

N·

Show

that the resulting unbiased Horvitz-Thompson estimate for the population mean has vari ance strictly larger than the estimate obtained by taking the mean of a sample of size

n

taken without replacement from the population.

17.

Stratified Sampling.

strata 7rk =

{xki}, 1 < i < :" , 1 < k < K.

(See also Problem

h.

k

=

1, .

.

.

, K,

1.3.4.)

"'£�

1

Suppose the u; can be relabeled into

h = N. Let

mk from stratum k form the corresponding sample averages X1, XK. Define (a) Take

samples with replacement of size -

• . .

Show that

{ Xk t ,

.

.

•

, Xkr�; J and

1 "' L., 7rkXk -

K k=l

X is unbiased and if X is the mean of a simple random sample without replace-

ment from the population then

with equality iff Xk- =

-

_

-

VarX < Var X

IJ; 1 2..:1" 1 Xki doesn't depend on k for all k such that 1rk

(b) Show that the inequality between

�::._�

=

� and suppose

,

K

�

X=

-

7rk =

> 0.

Var X and Var X continues to hold if '";.,5�11 >

k, even for sampling without replacement in each stratum. 18. Let X have a binomial, B(n,p), distribution. Show that l�p is not unbiasedly es for all

timable. More generally only polynomials of degree n in p are unbiasedly estimable. -

-

19. Show that Xk given by (3.4.6) is (a) unbiased and (b) has smaller variance than X if b < 2 Cov(U,X)/Var(U).

X is distributed accordihg to {Po : 0 E 8 c R} and ,- is a prior distribution 2 for (J such that E(9 ) < oo. 20.

Suppose

(a)

Show that

o(X)

is both an unbiased estimate of

respect to quadratic loss, if and only if,

(b) Deduce that if Pe 0.

= N(O, a�).

(c) Explain how it is possible if Po

P[o(X) = 9] = L

9 and

the Bayes estimate with

X is not a Bayes estimate for any prior 1r. •

is binomial,

B(n, 8), that "! is a Bayes estimate for

206 Hint: Given

21.

Chapter 3

Measures of Performance

E(ii(X) ] 9) � 9, E(9 ] X) � J(X) compute E(J(X) - 9)2

Prove Theorem 3.4.4.

Hint: It is equivalent to show that, for all

Var(aTO) · Note that 1jJT

>

adx 1,

aT (1;;( 9)I -1 (9)'1/ (9))a ¢T (9)aj T J- 1 (9)[1;; T (9)aj . J

' "" (9)a � 'i7Ee(ar9) and apply Theorem 3.4.3.

22. Regularity Conditions are Needed for the Information Inequality. Let X � U(O, 8) be the uniform distribution on (0, 8). Note that logp(x, 8) is differentiable for all 8 > x, that is, with probability 1 for each 8, and we can thus define moments of{)/88 log p(x, 8).

Show that, however,

(ii)

Var

(! logp(X, ) 8)

(iii) 2X is unbiased for " '

2.

If a

IQR.

'

l'

Yet show

e and has finite variance.

Problems for Section 35 1. If n =

i

� 0 and the information bound is infinite.

2k is even, give and plot the sensitivity curve of the median.

=

3. If a = quartile

4.

0.25 and

net =

0.25 and (n

-

k is an integer, use (3.5.5) to plot the sensitivity curve of the

l)a is an integer. give and plot the sensitivity curves of the lower

X. 2s. the upper quartile X.1s. and the IQR. �

Show that the sample median X is an empirical plug-in estimate of the population

median v.

5.

Show that the a trimmed mean

Xo. is an empirical plug-in estimate of

!'a � (1 - 2a) -1

}"['"

Xl-<>

xdF(x).

Here J xdF(x) denotes J xp(x)dx in the continuous case and L: xp(x) in the discrete

case. 'i' � ,,,. "

'

,. •

6.

An estimate

J(X) is said to be shift or translation equivariant if, for all X1, .

.

. , Xn, c,

Section

3.6

207

Problems and Complements

It is antisymmetric if for all x1, . . . , Xn

-..

-

· -

(a) Show that X, X, X arc translation equivariant and antisymmetric. a

(b) Suppose X1 , . . . , Xn is a sample from a population with d.f. F(x - J1.) where J1.

is unknown and Xi - f.J, is symmetrically distributed about 0. Show that if 0 is translation equivariant and antisymmetric and Eo(O(X)) exists and is finite, then

-

-..

(i.e., J is an unbiased estimate of f.J,). Deduce that X, Xu , X are unbiased estimates of the center of symmetry of a symmetric distribution. ·-

7. The Hodges-Lehmann (location) estimate XHL is defined to be the median of the 1n(n + 1) pairwise averages �(Xi + xj). i < j. Its properties are similar to those of the trimmed mean. It has the advantage that there is no trimming proportion a that needs to be subjectively specified. (a) Suppose n = 5 and the "ideal" ordered sample of size n - 1 = 4 is - 1 .03, -.30, .30, 1.03 (these are expected values of four N(O, !)-order statistics). For x > .3, plot the sensitivity curves of the mean, median, trimmed mean with a = Ij4, and the Hodges Lehmann estimate. (b) Show that XHL is translation equivariant and antisymmetric. (See Problem 3.5.6.) �

8. The Huber estimate Xk is defined implicitly as the solution of the equation n

where 0 < k <

oo,

=0

& is an estimate of scale, and

x if lx l < k k ifx > k -k ifx < -k. One reasonable choice for k is k

=

1.5 and for (j is,

med IX, 1
-

Xl/0.67.

Show that (a) k

= oo

corresponds to X, k

-t

0 to the median.

208

Measures of Performance

Chapter 3

�

(b) lf i7 is replaced by a known a0, then Xk is the MLE ofB when X1, . . . , Xn are i.i.d. with density fo((.r - O)/a0) where

2;2 x J2; e '

1 �c

fo(x )

for jx j

<

k

I - c k2/2- kl xl Vf; e , for j x j > k, '

with k and c connected through

'

2'f'(k) k

_

Z.P(-k)

�

e

I-£

(c) Xk exists and is unique when k > 0. Use a fixed known ao in place of&. (d) Xk is translation equivariant and antisymmetric (see Problem 3.5.6). (e) If k < oo, then limlx]----> oo SC(x; Xk) is a finite constant.

9. Iff(·) and g( · ) are two densities with medians v zero and identical scale parameters r, we say that g(·) has heavier tails than J(-) if g(x) is above f(x) for jxj large. In the case of the Cauchy density, the standard deviation does not exist; thus, we will use the IQR scale parameter r = x.r5 - x.25. In what follows adjust f and g to have v = 0 and T = 1 . •

i'

.

j, : .

'

(a) Find the set of j x j where g(jxl) > 'P( Ix l ) for g equal to the Laplace and Cauchy densities gL(x) � (2ry)-1 exp { -jxj/ry} and gc(x) = b[b2 + x2] -1 /rr.

(b) Find the tail probabilities P(IXI > 2), P(IXI > 3) and P(IXI � 4) for the normal, Laplace. and Cauchy distributions. (c) Show that gc(x)/'P(x) is of order exp{x2} as j x j

� oo.

lO. Suppose l":� / x; = 0. Show that SC(x iTn ) � (2a)- 1 (x2 -
• •

11. Let JJo be a hypothesized mean for a certain population. The (student) t-ratio is defined as t = y'ii(x - !'o) fs, where s2 = (n - 1)-1 2":� (x; - x)2. Let l'o = 0 and choose 1 mean zero. Find the limit of the sensitivity the ideal sample x 1 , , Xn-1 to have sample curve oft as .

.

•

(a) lxl � oo, n is fixed, and (b) n � oo, x is fixed.

12. For the ideal sample of Problem 3.5. 7(a), plot the sensitivity curve of

(a) iTn. and

(b) the !-ratio of Problem 3.5.11.

This problem may be done on the computer. \

13. Location Parameters. Let X be a random variable with continuous distribution function F. The functional 0 = Ox = O(F) is said to be scale and shift (translation) equivariant

Section 3.6

209

Problems and Complements

if Ba + bX = a + b8x. It is antisymmetric if (Jx ()_x. Let Y denote a random variable with continuous distribution function G. X is said to be stochastically smaller than Y if :;:_

F(t) � P(X < i) > P(Y < t)

=

G(t) for all t E

R. In this case we write X

,,

"

< Y. 8

is said to be order preserving if X < Y * Ox < ()y. If() is scale and shift equivariant, antisymmctric, and order preserving, it is called a location parameter.

(a) Show that if F is symmetric about c and () is a location parameter, then 8(F)

= c.

(b) Show that the mean Jl, median v, and trimmed population mean Jl(){ (see Problem

3.5.5) are location parameters.

( ( t.t))

x k (c) Let Jl( ) be the solution to the equation E t/Jk

=

r

0, where T is the median

k of the distribution of [X - v[/0.67 and 1/Jk is defined in Problem 3.5.8. Show that Jl.( ) is a location parameter. < 1 , let v. � v. (F ) = ; ex. + x1_0), v(F) � inf{v.(F) : 0 < et < 1/2} and v(F) � sup{v.(F) : 0 < " < 1/2}. Show that v. is a location parameter and show that any location parameter 8(F) satisfies v(F) < 8(F) < v(F). Hint: For the second part, let H(x) be the distribution function whose inverse is H - 1 (a) � ;rx.-x, -.], 0 < a < 1, and note that H(x-v(F)) < F(x) < H (x - v (F)) Also note that 1/(x) is symmetric about zero. (d) For 0 <

et

.

(e) Show that if the support S(F) = { x : 0 < F(x) < 1} of F is a finite interval, then v(F) and D( F) are location parameters. ([v(F), v(F)] is the location parameter set in the sense that for any continuous F the value 8( F) of any location parameter must be in [v(F), v(F)] and, if F is also strictly increasing, any point in [v(F) v(F)] is the value of ,

some location parameter.)

14. An estimate ()n is said to be shift and scale equivariant if for all Xr, . . . , Xn, a, b > 0, �

�

�

8n(a + bx 1 , . . . , a + bxn) � a + b8n(Xt, . . . , xn)· (a) Show that the sample mean, sample median, and sample trimmed mean

and scale equivariant.

are

shift

(b) Write the SC as SC(x,O, xn- 1 ) to show its dependence on Xn-l = (x�, . . . Xn-d· Show that if () is shift and location equivariant, then for a E R, b > 0, c E R, d > 0, �

,

�

�

SC(a + bx, c + dO, a + bx,.-1)

=

�

bdSC(x, 8, Xn_1) .

That is, the SC is shift invariant and scale equivariant.

15. In Remark 3.5.1: �

�

(a) Show that SC(x, 8) � IF� (x; 8, Fn-J). n

(b) In the following cases, compare SC(x; 8, F) IF(x;8, F).

-

�

limn-oo SC(x, 8) and

210

Measures of Performance

(i) B(F) � (ii) O(F) =

Jtp

Chapter 3

� J .rdF(1:).

"J.· = f(x - JlFfdF(x).

(iii) e(F) = Xn. Assume that F is strictly increasing.

(c) Does n - l [SC(x, iJ)

-

IF(x, 8, F)J � 0 in the cases (i), (ii), and (iii) preceding? �

16. Show that in the bisection method, in order to be certain that the Jth iterate BJ is within e of the desired () such that '1/J( 0) = 0, we in general must take on the order of log ! steps. �

�

This is, consequently, also true of the method of coordinate ascent.

17. Let d = 1 and suppose that '1/J is twice continuously differentiable, 1/J' > 0, and we seek � the unique solution () of '1/J({)) = 0. The Newton-Raphson method in this case is

(a) Show by example that for suitable tjJ and j il\0) - Bj large enough, { oU> ) do not

converge.

(b) Show that there exists, C < oo, o > 0 (depending on tj;) such that if jB 1°) - Bj < o, then j il\i) - Bj < Cj il\f - l ) 8j2 Hint: (a) Try tj;(x) = A log x with A > l. �

'

•

I

�

-

(b)

. •

''

.

'' ' ' '

18, In the gross error model (3.5.2), show that

(a) If h is a density that is symmetric about zero, then J-.t is identifiable.

• '

(b) If no assumptions are made about h, then p, is not identifiable. 3.7

'

•

NOTES

Note for Section 3.3 (1) A technical problem is to give the class S of subsets of :F for which we can assign probability (the measurable sets). We define S as the "-field generated by SA , B � { F E :F : Pp(A) E B), A, B E B, where B is the class of Borel sets.

I

j

Notes for Section 3,4 I

(I) The result of Theorem 3.4.1 is commonly known as the CramCr�Rao inequality. Be cause priority of discovery is now given to the French mathematician M. Frechet, we shall

' ' '

'

Section

211

3.8 References

follow the lead of Lehmann and call the inequality after the Fisher information number that appears in the statement.

(2) Note that this inequality is true but uninteresting if 1(8) = oo (and ,P'(8) is finite) or if Vare(T(X)) = oo.

oo oo oo 1. j_ j_oo T(x) {) p(x, >.)dxd>. = j_ · · · j_ T(x) [-p(x, {) 8)

(3) The continuity of the first integral ensures that {)

8(J

80

-=

·

·

0>.

·

-<X>

-=

-=

8(}

l

dx

for all (J whereas the continuity (or even boundedness on compact sets) of the second inte gral guarantees that we can interchange the order of integration in

(4) The finiteness of Var8 (T(X)) and 1(8) imply that 1/J'(8 ) is finite by the covariance

interpretation given in (3.4.8).

3.8

REFERENCES

ANDREWS, D. F., P. J. BICKEL, F. R. HAMPEL, P. J. HUBER, W. H, ROGERS, AND J. W. TUKEY, Robust Estimates of Location: Sun>ey and Advances Prince ton, NJ: Princeton University

Press, 1972. APoSTOL, T. M., Mathematical Analysis, 2nd ed. Reading, MA: Addison-Wesley, 1974. BERGER, J. 0., Statistical Decision Theory and Bayesian Analysis New York: Springer, 1985. BERNARDO, J. M., AND A. F. M. SMITH, Bayesian Theory New York: Wiley, 1994. BICKEL, P., AND E. LEHMANN, "Unbiased Estimation in Convex. Families," Ann. Math. Statist., 40, 1523-1535 (1969). BICKEL, P., AND E. LEHMANN, Descriptive Statistics for Nonparametric Models. I. Introduction," Ann. Statist., 3, 1038-1044 (1975a). BICKEL, P., AND E. LEHMANN, "Descriptive Statistics for Nonparametric Models. II. Location," Ann. Statist., 3, 1045-1069 (1975b). "

BICKEL, P., AND E. LEHMANN, "Descriptive Statistics for Nonparametric Models. III. Dispersion, Ann. Statist. 4, .

"

1 139-1158 (1976).

BOHLMANN, H., Mathematical Methods in Risk Theory Heidelberg: Springer Verlag,

1970.

DAHLQUIST, G., A. BJORK, AND N. ANDERSON, Numen"cal Analysis New York: Prentice Hall,

DE GROOT, M. H., Optimal Statistical Decisions New York: McGraw-Hill,

1974.

1969.

DoKsuM, K. A., "Measures of Location and Asymmetry," Scand. J. ofStatist., 2,

1 1-22 (1975).

DOWE, D. L., R. A. BAXTER, J. J. OLIVER, AND C. S. WALLACE, Point Estimation Using the KullbackLeibler Loss Function and MML, in Proceedings of the Second Pacific Asian Conference on Knowledge Discovery and Data Mintng Melbourne: Springer-Verlag,

1998.

Measures

212

of Performance

Chapter

3

"The Influence Curve and Its Role in Robust Estimation," J. Amer. Statist. Assoc., 69, 383-393 ( 1 974).

HAMPEL, F., HAMPEL, F.,

E. RoNCHEm, P. RoussEUW, AND W. STAHEL, Robust Statistics: The Approach Based on Influence Functions New York: J. Wiley & Sons, 1986.

HANSEN,

M. H., AND B. Yu, "Model Selection and the Principle of Mimimum Description Length,"

J. Amer. Statist. Assoc., (2000).

Hooo, R.,

"Adaptive Robust Procedures," J. Amer. Statist. Assoc., 69, 909-927 (1 974).

HUBER, P., Robust Statistics

New York: Wiley, l981.

HUBER, P., "Robust Statistics:

JAECKEL, L.

A Review," Ann. Math. Stal!·sr., 43, 1041-1067 (1972).

A., "Robust Estimates of Location," Ann. Math. Statist., 42, I 020-- 1034 ( 1971).

JEFFREYS, H., Theory ofProbability,

2nd ed. London: Oxford University Press, 1948.

KARLIN,

S., Mathematical Methods and Theory in Games, Programming, and Economics Reading, MA: Addison-Wesley, 1959.

LEHMANN,

E. L., Testing Statistical Hypotheses New York: Springer, 1986.

LEHMANN, E. L., AND G. CASELLA, Theory of Point Estimation, 2nd ed.

New York: Springer, 1998.

D. V., Introduction to Probability and Statistics from a Bayesian Point of View, Part I: Probability; Part II: Inference, Cambridge University Press, London, 1965.

LINDLEY,

LINDLEY,

D.V., "Decision Analysis and Bioequivalence Trials," Statistical Science, I3, 1 36-- 1 41

(1998). "Hierarchical Credibility: Analysis of a Random Effect Linear Model with Nested Classification," Scand. Actuarial J., 204-222 (1 986).

NoRBERG, R.. '

RrsSANEN, I.,

(1 987).

"Stochastic Complexity (With Discussions)," J. Royal Statist. Soc. B, 49, 223-239

SAVAGE, L. I., The Foundations ofStatistics

,. '

New York: J. Wiley & Sons, 1954.

SHIBATA, R., "Boostrap Estimate of Kullback-Leibler Information for Model Selection," Statistica Sinica, 7, 375-394 (1997).

'

STEIN, C., "Inadmissibility of the Usual Estimator for the Mean of a Multivariate Distribution," Proc. Third Berkeley Symposium on Math. StrJ.tist. and Probability, 1, University of California Press, 197-206 (1 956). TUKEY, J. W., Exploratory Data Analysis Reading, MA:

Addison-Wesley, 1972 .

•

WALLACE, C. S., AND P. R. FREEMAN, "Estimation and Inference by Compact Coding (With Discussions)," J. Royal Statist. Soc. B, 49, 240-251 (1987). A., "On the Attainment of the Cramer-Rao Lower Bound," Ann. Math. Statist., I, 538-542 (1973).

WIJSMAN, R.

' •

•

1

I

Chapter 4

TES TING AND CON FIDENCE REGIONS : B AS IC TH EORY

4.1

INTRODUCTION

In Sections 1.3, 3.2, and 3 .3 we defined the testing problem abstractly, treating it as a de cision theory problem in which we are to decide whether P E Po or P1 or, parametrically, whether (} E 8o or 81 if pj = {Pe : (} E ej }. where Po, PI or 8o, el are a partition of the model p or, respectively, the parameter space e. This framework is natural if, as is often the case, we are trying to get a yes or no answer to important questions in science, medicine, public policy, and indeed most human activities, and we have data providing some evidence one way or the other. As we have seen, in examples such as 1 . 1 .3 the questions are sometimes simple and the type of data to be gathered under our control. Does a new drug improve recovery rates? Does a new car seat design improve safety? Does a new marketing policy increase market share? We can design a clinical trial, petfonn a survey, or more generally construct an experiment that yields data X in X C Rq, modeled by us as having distribution P9, 0 E e, where 8 is partitioned into {eo, e!} with eo and e, conesponding, respectively, to answering "no" or '"yes" to the preceding questions. Usually, the situation is less simple. The design of the experiment may not be under our control, what is an appropriate stochastic model for the data may be questionable, and what 80 and 81 correspond to in terms of the stochastic model may be unclear. Here are two examples that illustrate these issues. Example 4.1.1. Sex Bias in Graduate Admissions at Berkeley. The Graduate Division of the

University of California at Berkeley attempted to study the possibility that sex bias operated in graduate admissions in 1973 by examining admissions data. They initially tabulated Nm1 , Nft. the numbers of admitted male and female applicants, and the corresponding numbers Nmo, NJo of denied applicants. If n is the total number of applicants, it might be tempting to model (Nmt, Nmo , Nft , Njo) by a multinomial, M(n,Pmt,Pmo,Pjt ,PJO), distribution. But this model is suspect because in fact we are looking at the population of all applicants here, not a sample. Accepting this model provisionally, what does the

213

214

Testing and Confidence Regions

Chapter 4

hypothesis of no sex bias correspond to? Again it is natural to translate this into P[Admit I Male]

=

Pml Pml + Pmo

= P[Admit I Female]

=

p!I

PJ1 + PfO

But is this a correct translation of what absence of bias means? Only if admission is deter mined centrally by the toss of a coin with probability Pml Pml + PmO

PJl P!I + PJo

[n fact, as is discussed in a paper by Bickel, Hammel, and O'Connell ( 1975), admissions are petfonned at the departmental level and rates of admission differ significantly from department to department. If departments ..use different coins," then the data are naturally decomposed into = (Nm 1d, Nmod, Nfld , Njod , d = 1, , D), where Nmld is the number of male admits to department d, and so on. Our multinomial assumption now becomes N M(pmid. Pmod, pfld. pJOd, d 1, . , D). In these terms the hypothesis of "no bias" can now be translated into: Pml P!ld H: Pmid + PmOd PJld + PJOd

N

. . .

,...__

=

. .

_

for d = 1 , , D. This is not the same as our previous hypothesis unless all departments have the same number of applicants or all have the same admission rate,

I'

. . .

i

i.

Pml + PJt Pml + P/1 + Pmo + PJO In fact, the same data can lead to opposite conclusions regarding these hypotheses-a phe nomenon called Simpson's paradox. The example illustrates both the difficulty of speci fying a stochastic model and translating the question one wants to answer into a statistical 0 hypothesis.

• '

'

' '

•

I ' •

'

' '

�'

' '

' '

.

!

i·

'

.

Example 4.1.2. Mendel's Peas. In one of his famous experiments laying the foundation of the quantitative theory of genetics, Mendel crossed peas heterozygous for a trait with two alleles, one of which was dominant. The progeny exhibited approximately the expected ratio of one homozygous dominant to two heterozygous dominants (to one recessive). In a modem formulation, if there were n dominant offspring (seeds), the natural model is to assume, if the inheritance ratio can be arbitrary, that NAA. the number of homozygous dominants, has a binomial (n,p) distribution. The hypothesis of dominant inheritance corresponds to H : p = � with the alternative K : p f. j. It was noted by Fisher as reported in Jeffreys (1961) that in this experiment the observed fraction ';: was much closer to j than might be expected under the hypothesis that NAA has a binomial, B (n, �), distribution, NAA m 1 I < = 7 x 10 _5 . p

[

-

-

-

-

-

l

I

•

•

'

I ' '

I

n 3 3 Fisher conjectured that rather than believing that such a very extraordinary event oc curred it is more likely that the numbers were made tO "agree with theory" by an overzeal ous assistant. That is, either NAA cannot really be thought of as stochastic or any stochastic

I.

:.

:·

'

n

-

•

! I

Section

4.1

215

Introduction

model needs to permit distributions other than B( n, p), for instance, ( 1 - E)c5 � + cB( n) p), where 1 E is the probability that the assistant fudged the data and 5!.!. is point mass at '

�

3"

n

0

What the second of these examples suggests is often the case. The set of distributions corresponding to one answer, say 8o, is better defined than the alternative answer 8 1 . That a treatment has no effect is easier to specify than what its effect is; see, for instance, our discussion of constant treatment effect in Example 1.1.3. In science generally a theory typically closely specifies the type of distribution P of the data X as, say, P = Po, B E 8o. If the theory is false, it's not clear what P should be as in the preceding Mendel example. These considerations lead to the asymmetric formulation that saying P E Po ( (} E 8o) corresponds to acceptance of the hypothesis H : P E Po and P E P1 corresponds to rejection sometimes written as K : P E pl_(l) As we have stated earlier, acceptance and rejection can be thought of as actions a = 0 or 1, and we arc then led to the natural 0 1 loss l(O, a) = 0 if () E Sa and 1 otherwise. Moreover, recall that a decision procedure in the case of a test is described by a test function /i : x � {0, 1} or critical region C = {x : li(x) = 1}, the set ofpoints for which we reject. It is convenient to distinguish between two structural possibilities for S0 and S1 : If 80 consists of only one point, we call S0 and H simple. When So contains more than one point, 80 and H are called composite. The same conventions apply to S1 and K. We illustrate these ideas in the following example. �

Example 4.1.3. Suppose we have discovered a new drug that we believe will increase the rate of recovery from some disease over the recovery rate when an old established drug is applied. Our hypothesis is then the null hypothesis that the new drug does not improve on the old drug. Suppose that we know from past experience that a fixed proportion Bo = 0.3 recover from the disease with the old drug. What our hypothesis means is that the chance that an individual randomly selected from the ill population will recover is the same with the new and old drug. To investigate this question we would have to perform a random experiment. Most simply we would sample n patients, administer the new drug, and then base our decision on the observed sample X = (X1 , . . . , Xn), where Xi is 1 if the ith patient recovers and 0 otherwise. Thus, suppose we observe S = EXi, the number of recoveries among the n randomly selected patients who have been administered the new drug. (2) If we let (} be the probability that a patient to whom the new drug is administered recovers and the population of (present and future) patients is thought of as infinite, then S has a B(n, 0) distribution. If we suppose the new drug is at least as effective as the old, then 8 = [Bo, 1], where Bo is the probability of recovery using the old drug. Now So = {Bo} and H is simple; 8 1 is the interval (80, 1] and K is composite. In situations such as this one we shall simplify notation and write H : (} = 00, K : B > Bo. If we allow for the possibility that the new drug is less effective than the old, then eo = [0, Bo] and 80 is composite. It will turn out that in most cases the solution to testing problem� with eo simple also solves the composite 80 problem. See Remark 4.1. In this example with 8o = { (}0} it is reasonable to reject I-J if S is "much" larger than what would be expected by chance if H is true and the value of () is 00. Thus, we reject H if S exceeds or equals some integer, say k, and accept H otherwise. That is, in the

216

Testing and Confidence Regions

tenninology of Section 1.3, our critical region C is rule is

ok(X)

=

!{S > k) with

P1

Pu

{X : S > k}

= probability of type I error =

= probability of type II error =

Chapter

4

and the test function or

Pe, ( S > k)

Po ( S < k), B > Bo.

The constant k that determines the critical region is called the

0

critical value.

In most problems it turns out that the tests that arise naturally have the kind of structure we have just described. There is a statistic T that "tends" to be small, if H is true, and large, if H is false. We call T a

test statistic.

(Other authors consider test statistics T that

tend to be small, when H is false. -T would then be a test statistic in our sense.) We select a number

> c and accept H The value c that completes our specification is referred to as the critical value

c and our test

otherwise.

is to calculate T(x) and then reject H if T(x)

of the test. Note that a test statistic generates a family of possible tests as c varies. We will discuss the fundamental issue of how to choose T in Sections 4.2, 4.3, and later chapters. We now tum to the prevalent point of view on how to choose

c.

The Neyman Pearson Framework The Neyman Pearson approach rests on the idea that, of the two errors, one can be

thought of as more important. By convention this is chosen to be the type I error and that in ;;

'

�, •

,

:' '.

:�·

('

I

'

tum determines what we call H and what we call K. Given this position, how reasonable

is this point of view?

In the medical setting of Example 4.1.3 this asymmetry appears reasonable.

'

� :

also been argued that, generally in science, announcing that a new phenomenon has been

I

observed when in fact nothing has happened (the so-called null hypothesis) is more serious

•

!

•

than missing something new- that has in fact occurred. We do not find this persuasive, but

j

if this view is accepted, it again reason�bly leads to a Neyman Pearson formulation.

' '

!

As we noted in Examples 4.1.1 and 4.1.2, asymmetry is often also imposed because one

"

. ' '

It has

of 90, 91, is much better defined than its complement and/or the distribution of statistics T under 80 is easy to compute. In that case rejecting the hypothesis at level as a measure of the weight of evidence we attach to the falsity of H.

a

is intetpreted

For instance, testing

techniques are used in searching for regions of the genome that resemble other regions that are

known to have significant biological activity. One way of doing this is to align the

known and unknown regions and compute statistics based on the number of matches. To determine significant values of these statistics a (more complicated) version of the follow

'

I

'

•

I ' !

l

ing is done. Thresholds (critical values) are set so that if the matches occur at random (i.e.,

matches at one position are independent of matches at other positions) and the probability of a match is than

a.

�,

then the probability of exceeding the threshold (type I) error is smaller

No one really believes that H is true and possible types of alternatives are vaguely

known at best, but computation under H is easy. The Neyman Pearson framework is still valuable in these situations by at least making

us think of possible alternatives and then, as we shall see in Sections 4.2 and 4.3, suggesting

what test statistics it is best to use.

l 1 I

i

'

Section 4.1

217

Introduction

There is an important class of situations in which the Neyman Pearson framework is inappropriate, such as the quality control Example 1.1.1. Indeed, it is too limited in any situation in which, even though there are just two actions, we can attach, even nominally, numbers to the two losses that are not equal and/or depend on 0. See Problem 3.2.9. Finally, in the Bayesian framework with a prior distribution on the parameter, the approach of Example 3.2.2(b) is the one to take in all cases with 80 and 81 simple. Here are the elements of the Neyman Pearson story. Begin by specifying a small num ber a > 0 such that probabilities of type I error greater than a arc undesirable. Then restrict attention to tests that in fact have the probability of rejection less than or equal to a for all () E 8o. Such tests are said to have level (of significance) a, and we speak of rejecting H at level o:. The values a = 0.01 and 0.05 are commonly used in practice. Because a test of level a is also of level a' > a, it is convenient to give a name to the smallest level of significance of a test. This quantity is called the size of the test and is the maximum probability of type I error. That is, if we have a test statistic T and use critical value c, our test has size a(c) given by

a(c) � sup(Pe[T(X) > c] : 8 E 8o} ·

(4. Ll )

Now a(c) is nonincreasing in c and typically <> (c) T 1 as c l -oo and a (c) l 0 as c T oo. ln that case, if 0 < a < 1, there exists a unique smallest c for which a: (c) < a. This is the critical value we shall use, if our test statistic is T and we want level a. It is referred to as the level " critical value. In Example 4.1.3 with o(X) � 1{S > k), Bo � 0.3 and n = 10, we find from binomial tables the level 0.05 critical value 6 and the test has size a(6) � Pe, (S > 6) � 0.0473. Once the level or critical value is fixed, the probabilities of type II error as 8 ranges over 81 are determined. By convention 1 - P [type II error] is usually considered. Specifically,

Definition 4.1.1. The power of a test against the alternative () is the probability of rejecting H when (} is true. Thus, the power is 1 minus the probability of type II error. It can be thought of as the probability that the test will "detect" that the alternative 8 holds. The power is a function of 8 on e 1 If Go is composite as well, then the probability of type I error is also a function of B. Both the power and the probability of type I error are contained in the power function, which is defined/or all B E 8 by •

{3(0) � f3(B, o) � Pe[Rejection] � Pe[o(X) � 1] � Pe [T(X) > c]. If B E 80, /3(B, o) is just the probability of type I error, whereas if B E 81, /3( B, o) is the power against (}. Example 4.1.3 (continued). Here

{3(B, o•) � P(S > k) �

tk ( Jn ) 0;(1 - B) n-j

j=

A plot of this function for n = 1 0, 80 = 0.3, k = 6 is given in Figure 4. 1.1.

218

Testing and

Confidence

Regions

Chapter 4

1.0 0.8 0.6 0.4 0.2

0

�3 :.-:.:.: -+=-�-=t-::- � -:.; ---+- --+---+0

0.1 0.2 0.3 0.4

0.5

-

0.6 0.7 0.8 0.9

1.0

Figure 4.1.1. Power function of the level 0.05 one-sided test c5k of H : () = 0.3 versus K : B > 0.3 for the B(lO, 0) family of distributions. The power is plotted as a function of 8, k � 6 and the size is 0.0473. I

Note that in this example the power at () = 81 > 0.3 is the probability that the level 0.05 test will detect an improvement of the recovery rate from 0.3 to fh > 0.3. When 81 is 0.5, a 67% improvement, this probability is only .3770. What is needed to improve on this situation is a larger sample size n. One of the most important uses of power is in the selection of sample sizes to achieve reasonable chances of detecting interesting alternatives. D We return to this question in Section 4.3. Remark 4.1. From Figure 4.1.1 it appears that the power function is increasing (a proof will be given in Section 4.3). It fo11ows that the level and size of the test are unchanged if instead of 8o � {8o ) we used 6o � [0 8o] . That is, ,

a(k) � sup{Pe[T(X) > k] : 8 E eo} � Pe, [T(X) > k]. Example 4.1.4. One-Sided Tests for the Mean ofa Normal Distribution with Known Vari ance. Suppose that X � (X1 , . . . , Xn) is a sample from N(J<, u2 ) population with u2 is known. (The CT2 unknown case is treated in Section 4.5.) We want to test H : p, < 0 versus K : p, > 0. This problem arises when we want to compare two treatments or a treatment and control (nothing) and both treatments are administered to the same subject. For in stance, suppose we want to see if a drug induces sleep. We might, for each of a group of n randomly selected patients, record sleeping time without the drug (or after the adminis tration of a placebo) and then after some time administer the drug and record sleeping time again. Let xi be the difference between the time slept after administration of the drug and time slept without administration of the drug by the ith patient. If we assume X1 , . . . , Xn are normally distributed with mean p. and variance CT2, then the drug effect is measured by p. and H is the hypothesis that the drug has no effect or is detrimental, whereas K is the alternative that it has some positive effect.

'

• •

I l I

'

! i

•

l i I ! •

'

'

' '

I

I

I I 1 I

Section 4.1

X

Because values of

219

Introduction

X.

K than under H, it is natural to reject H for large is convenient to replace X by the test statistic T(X) = foX/ a, which

tends to be larger under

It

generates the same family of critical regions. The power function of the test with critical value c i s

(4.1.2)

because

(z) =

1

- <1>(-z ). Because !l(!L) is increasing,

a ( c) = sup{fl(IL) : fL < 0} The smallest c for which �(-c)

= fl(O)

= (-c).

:-=; a is obtained by setting q:.( -c) = a or

c = -z(a) where

-z (a) =

z(1 - a ) is the (1 -

D

a) quantile of the N(O, 1) distribution.

The Heuristics of Test Construction When hypotheses are expressed in terms of an estimable parameter H

-

: () E eo c RP,

and we have available a good estimate () of (), it is clear that a reasonable test statis-

-

tic is d(O,

eo),

inf{d(x,y)

where d is the Euclidean (or some equivalent) distance and d(x,

: y E

8}.

S)

=

This minimum distance principle is essentially what underlies

p = P[AA], N�, is the MLE of p and d ( N�A , 60) = I N�A - ! I- In Example 4.1.3, � estimates O andd ( � , 00) = ( � - Oo) + where Y+ = y I (y > 0). Rejecting for large values of this statistic is equivalent to rejecting for large values of X. Given a test statistic T(X) we need to determine critical values and eventually the Examples 4 . 1 .2 and 4.1.3. In Example 4.1.2,

power of the resulting tests. The task of finding a critical value is greatly simplified if

Ce(T(X)) doesn't depend on 0 for 0 E eo. 4 . 1 .3.

e0 is simple as in Example such as testing J.L = J.Lo versus

This occurs if

But it occurs also in more interesting situations

J.L =j:. J.Lo if we have N(p., 0"2 ) observations with both parameters unknown (the t tests of Example 4.5.1 and Example 4 . 1 .5). In all of these cases, Co, the common distribution of T(X) under () E 8o, has a closed form and is tabled. However, in any case, critical values yielding correct type I probabilities are easily obtained by Monte Carlo methods. That is, if we generate i.i.d. from then the test that rejects iff >

T(X( 1 )), . . . , T(X(B) ) Co, T(X) T((B+l) (t-a))• where T(t) < · · · < T( B+l) are the ordered T(X), T(X( 1 )), . . . , T(X( 8 )), has level a if Co is continuous and (B + 1 ) ( 1 - a) is an integer (Problem 4.1.9). The key feature of situations in which .Co(Tn) = .C0 for () E 80 is usually invariance

under the action of a group of transformations. See Lehmann ( 1997) and Volume II for discussions of this property. Here are two examples of testing hypotheses in a nonparametric context in which the minimum distance principle is applied and calculation of a critical value is straightforward.

220

Testing and Confidence Regions

Chapter 4

Example 4.1.5. Goodness of Fit Tests. Let X1 , , Xn be i.i.d. as X ......., F, where F is continuous. Consider the problem of testing H : F = F0 versus K : F i- Fa. Let F denote the empirical distribution and consider the sup distance between the hypothesis F0 and the plug-in estimate of F, the empirical distribution function F. as a test statistic .

.

•

Dn = sup IF(x) - Fo(x)l-

x

It can be shown (Problem 4.1.7) that Dn. which is called the Kolmogorov statistic, can be writren as Dn

=

··

_max max

>-l,

. . . ,n

{ !_ - Fo(x(i)) , F0(x(i)) - _,_(i_1_,_) } n

n

(4.1.3)

where x( l) < · < X(n) is the ordered observed sample, that is, the order statistics. This statistic has the following distribution free property: Proposition 4.1.1. The distribution of Dn under H is the same for all continuous Fo. In particular, PF, (Dn < d) = Pu (Dn < d), where U denotes the U(O, 1) distribution. Proof. Set U;

�

Fo (X;). then by Problem B.3.4, U; � U(O, 1). Also

F(x)

n -1 E1{X; < x} = n-1E1{Fo(X,) < Fo(x)} n-1 E1{U; < Fo(x)} = U (Fo(x))

where U denotes the empirical distribution function of U1 , . . . , Un· As x ranges over R, u = Fo(x) ranges over (0, 1), thus,

Dn = sup I U(u) - ul O
'

.;.' '

0

and the result follows.

'

�; '

•

'

l '

Note that the hypothesis here is simple so that for any one of these hypotheses F = Fo. the distribution can be simulated {or exhibited in closed fonn). What is remarkable is that it is independ�nt of which Fo we consider. This is again a consequence of invariance properties (Lehmann, 1997). The distribution of Dn has been thoroughly smdied for finite and large n. In particular. for n > 80, and

hn(t) = tj( Vn + 0. 1 2 + 0.11/Vn) close approximations to the size a critical values ka hn(L224) for a = .01, .05, and .10 respectively.

are

hn(L628), hn(l.358), and 0

Example 4.1.6. Goodness of Fit to the Gaussian Family. Suppose Xt, . . . , Xn are i.i.d. F and the hypothesis is H : F = � ('��-') for some p., u, which is evidently composite. We can proceed as in Example 4.1.5 rewriting H : F(!J. + o-x) =
• ',

'

] '

Section 4.1

221

Introduction

F(X + &x) where X and &2 are the MLEs of J-1 and a2. Applying the sup distance again, we obtain the statistic X

sup

IG(x) - .P(x)l

X

�

�n) with �i where G is the empirical distribution of (�1, under H, the joint distribution of (.Dq , . . . , �n) doesn't depend on •

( z, - Z)

•

•

,

I ( � L-7 1 (Z, - z)') : ' 1 < i < n, where z

I,

.

-

(Xi - X)/ii. But, p., a2 and is that of . ' Zn arc i.i.d. N(O, 1 ) . (See .

Section 8.3.2.) Thus, Tn has the same distribution £0 under H, whatever be 11 and a2, and the critical value may be obtained by simulating i.i.d. observations Zi, 1 < i < n, from N(O, 1), then computing the Tn corresponding to those Zi . We do this B times independently, thereby obtaining Tn1 , . . . , TnB· Now the Monte Carlo critical value is the D [ (B + 1 )( 1 a) + 1]th order statistic among Tn, Tnt , . . . , TnB· (31 -

The p-Value: The Test Statistic as Evidence '

Different individuals faced with the same testing problem may have different criteria of size. Experimenter I may be satisfied to reject the hypothesis H using a test with size a = 0.05, whereas experimenter II insists on using a = 0.01. It is then possible that experimenter I rejects the hypothesis H, whereas experimenter II accepts H on the basis of the same outcome x of an experiment. If the two experimenters can agree on a common test statistic T, this difficulty may be overcome by reporting the outcome of the experiment in tenns of the observed size or p-value or significance probability of the test. This quantity is a statistic that is defined as the smallest level of significance a at which an experimenter using T would reject on the basis of the observed outcome x. That is, if the experimenter's critical value corresponds to a test of size less than the p�value, H is not rejected; otherwise, H is rejected. Consider, for instance, Example 4. 1.4. If we observe X = x = (x 1 , . . . , xn). we would reject H if, and only if, a satisfies

.,fiix T(x) = > -z(<>) u or upon applying .P to both sides if, and only if, <>

2: .P ( -T(x)).

Therefore, if X = x, the p-vaJue is .P( -T(x))

=

.P

( --;;x) .

Considered as a statistic the p-value is .P( -v nX fu).

(4.1.4)

222

Testing and Confidence Regions In general, let

X be a q dimensional

random vector. We will show that we can express

(4. 1 . 1 ) . Suppose that we observe

the p-value simply in tenns of the function a(·) defined in

X

=

Chapter 4

x. Then if we use critical value c, we would reject H if, and only if, T(x) > c.

Thus, the largest critical value test with critical value

c

for which we would reject is c =

c is just a ( c)

T(x).

But the size of a

and a( c) is decreasing i n c. Thus, the smallest a for

which we would reject corresponds to the largest

c.(T(x)). We have proved the following.

c for which we would reject and is just

Proposition 4.1.2. The p-va/ue is c.(T(X)). This is in agreement with

(4.1 .4).

c.(k) �

Similarly in Example

4.1.3,

t ( J ) 06(! - 00)"-J n

j=k and the p-value is a (

s) where s is the observed value of X. The normal approximation is used for the p-value also. Thus, for min{ nOo, n(! - 00)} > 5, <>(s) � Pe,(S 2 s) oe 1 -

( s - l - nOo ) . 2

,

[nOo(! - Oo)]•

(4.1.5)

The p-value is used extensively in situations of the type we described earlier, when H

i

is well defined, but to quote Fisher

K is not, so that type II error considerations are unclear. In this context,

( 1958),

''The actual value of p obtainable from the table by interpolation

indicates the strength of the evidence against the null hypothesis" (p.

'

80).

The p-value can be thought of as a standardized version of our original statistic; that

;

a(T) is on the unit interval and when H is simple and T has a continuous distribution, c.(T) has a uniform, U(O, 1 ) distribution (Problem 4.1.5). is,

,

It is possible to use p-values to combine the evidence relating to a given hypothesis H !: '

1.

.

L

Ii l' I

provided by several different independent experiments producing different kinds of data. For example, if values

r experimenters use continuous test statistics T1, . . . , Tr to produce p

a(T1 ), . . . , a(Tr). then if H is simple Fisher (1958) proposed using

' •

r

T � -2 L log c.(T1)

;

.. ' ..

j= I

'

j

(4.1.6)

distribution with to test H. The statistic T has a chi-square � �

4.1.6).

Thus, H is rejected if

T 2: x 1 -a

2n degrees of freedom (Problem

where Xt-a is the

1

-

o:th quantile of the

X�n

distribution. Various methods of combining the data from different experiments in this way are discussed by van Zwet and Osterhoff

•

f

( 1 967). More generally, these kinds of issues are currently being discussed under the rubric of data-fusion and meta-analysis (e.g., see Hedges and Olkin, 1985).

Section

4.2

223

Choosing a Test Statistic The Neyman-Pearson lemma

The preceding paragraph gives an example in which the hypothesis specifies a distri bution completely; that is, under H, o:(Ti) has a U(O, 1) distribution. This is an instance of testing that is, we test whether the distribution of X is different from a specified Fo .

goodness of.fit;

Summary. We introduce the basic concepts and terminology of testing statistical hypothe ses and give the Neyman---Pearson framework. In particular, we consider experiments in which important questions about phenomena can be turned into questions about whether a parameter () belongs to 8 o or 81, where 8 o and 81 are disjoint subsets of the parame ter space e. We introduce the basic concepts of simple and composite hypotheses, (null) hypothesis H and alternative (hypothesis) K, test functions, critical regions, test statistics, type I error, type II error, significance level, size, power, power function, and p-value. In the Neyman-Pearson framework, we specify a small number o: and construct tests that have at most probability (significance level) o: of rejecting H (deciding K) when H is true; then, subject to this restriction, we try to maximize the probability (power) of rejecting H when K is true.

4 .2

CHOOSING A TEST STATISTIC: THE NEYMAN-PEARSON LEMMA

We have seen how a hypothesis-testing problem is defined and how performance of a given test b, or equivalently, a given test statistic T, is measured in the Neyman-Pearson theory. Typically a test statistic is not given but must be chosen on the basis of its performance. In Sections 3.2 and 3.3 we derived test statistics that are best in terms of minimizing Bayes risk and maximum risk. In this section we will consider the problem of finding the level a test that ha<> the highest possible power. Such a test and the corresponding test statistic are called (MP). We start with the problem of testing a simple hypothesis H : () = 00 versus a simple alternative K : 0 = 01• In this case the Bayes principle led to procedures based on the defined by

mostpoweiful

simple likelihood ratio statistic

p(x, Bt) L (x, Bo, B! ) = p(x, B0 ) where p(x, 0) is the density or frequency function of the random vector X. The statistic L takes on the value oo when p(x, B I) > 0, p(x, Bo) = 0; and, by convention, equals 0 when both numerator and denominator vanish. The statistic L is reasonable for testing H versus K with large values of L favoring K over H. For instance, in the binomial example (4.1.3), (B,fBo)5[(1 - B,)/(1 - Bo)]n-S [B, (1 - B0)/Bo (1 - 0,)]5[(1 - 8 1)/(1

- Oo )] " ,

which is large when S = EXi is large, and S tends to be large when K : () = 01 true.

(4.2. I) >

Oo is

224

Testing and Confidence Regions

Chapter 4

We call i.f!k a likelihood ratio or Neyman-Pearson (NP) test ifunction) if for some 0 < k < oo we can write the test function '-Pk as

X= .

'Pk ( )

if L( x, 80, 81) > k 0 1f L( x, Bo,BI) < k 1

with 5, and

5)]/P( S � 5)

�

.0262

if S = 5. Such randomized tests are not used in practice. They are only used to show that with randomization, likelihood ratio tests are unbeatable no matter what the size a is. Theorem 4.2.1. (Neyman-Pearson Lemma). (a) If a > 0 and i{)k is a size a likelihood ratio test, then <{Jk is MP in the class of level a tests. (b) For each 0 < a :5: 1 there exists an MP size a likelihood ratio test provided that randomization is permitted, 0 < r.p(x) < l,for some x. (c) Ifr.p is an MP level o: test, then it must be a level a likelihood ratio test; that is, there exists k such that

P, [
�

(4.2.2)

Bt.

Proof. (a) Let Ei denote Ea,, i = 0, 1, and suppose r.p is a level a test, then

Eo
'

(4.2.3)

We want to show Eti'Pk(X) - 0. To this end consider

Et[cpk(X) -
�

[;��:::l - ]

I + II (say), where I � Eo{'Pk(X)(L(X, 80, 81) - k] -
l ' '

•

•

'

Because L(x,Bo,Bt) - k is < 0 or 2 0 according as 'Pk(x) is 0 or 1, and because 0 < 0. Note that a > 0 implies k < oo and, thus, 'Pk(x) � 1 if p(x, 80) � 0. It follows that II > 0. Finally, using (4.2.3), we have shown that

Et[
2

kEo ( 0.

(4.2.4)

'

'

I

'

i

I

1 '

'

Section 4.2

Choosing a

Test Statistic:

225 -------= =cc

"--

The Neyman-Pearson lemma

(b) If a � 0, k = oo makes 'Pk MP size a. If a = 1, k = 0 makes E1 'Pk(X) � 1 and I.Pk is MP size a. Next consider 0 < a < 1. Let P1 denote Po, , i = 0, 1. Because Po [L(X, Oo, OJ) = oo] = 0, then there exists k < oo such that

Po[L(X, 00,01) > k] < a and Po[T,(X, 00, OJ) > k] > a. If P0[L(X, Oo, 01) � k] � 0, then 'Pk is MP size a. If not, define

a - Po[L(X, o,,O,) > k] 'Pk (x) � P0[L(X, Oo, OJ) = k]

on the set {x : L(x, Oo, O,) = k ) . Now 'Pk is MP size a. (c) Let x E {x : p(x, OJ) > 0}, then to have equality in (4.2.4) we need to have k and have 0} D and 0 � Oo. It follows from the Neyman-Pearson lemma that an MP test has power at least as large as its level; that is,

Corollary 4.2.1. /f

a with equality iffp(·, Oo) �

p(-,o,).

Proof. See Problem 4.2. 7.

Remark 4.2.1. Let Jr denote the prior probability of 00 so that (1-Jr) is the prior probability of fh. Then the posterior probability of (}1 is

(1 - JT)p(x, O,) 7r(0 1 I X) (1 - JT)p(x, O,) + JTp(x, Oo)

-

_

(1 - JT)L(x,00,01) (1 - JT)L(x, Oo, 0, ) + Jr .

(4.2.5)

If o, denotes the Bayes procedure of Example 3.2.2(b), then, when Jr � k/(k + 1), o, � 'Pk· Moreover, we conclude from (4.2.5) thatthis o, decides 01 or 00 according as Jr(01 I x) is larger than or smaller than 1/2 . Part (a) of the lemma can, for 0 < a < 1, also be easily argued from this Bayes property of 'Pk (Problem 4.2.10). Here is an example illustrating calculation of the most powerful level a test I.Pk·

Example 4.2.1. Consider Example 3.3.2 where X = (X1, , Xn ) is a sample of n N(J.L,u2) random variables with u2 known and we test H : J.L = 0 versus K : I" = v, where v is a known signal. We found •

•

•

nv2 v n L(X, O,v) = exp 2 L X; u i= l 2u2 Note that any strictly increasing function of an optimal statistic is optimal because the two statistics generate the same family of critical regions. Therefore,

X T(X) � Vn-;; =

a

v.,fii

[log L(X,O,v) + nv2 ] 2"2

226

Testing and Confidence Regions

Chapter 4

is also optimal for this problem. But T is the test statistic we proposed in Example 4.1.4. From our discussion there we know that for any specified a, the test that rejects if, and only if. T > z(l - a)

(4.2.6)

has probability of type I error o:. The power of this test is, by (4.1 .2), (z(a) + (v../ii/ (J)). By the Neyman-Pearson lemma this is the largest power available with a level et test. Thus, if we want the probability of detecting a signal v to be at least a preassigned value j3 (say, .90 or .95), then we solve ( z (a)+ ( v../ii/(J) ) = !3 for n and find that we need to take n = ( (J /v )2[z(l -a) + z(/3)f D This is the smallest possible n for any size a test. An interesting feature of the preceding example is that the test defined by (4.2.6) that is MP for a specified signal v does not depend on v: The same test maximizes the power for all possible signals v > 0. Such a test is called uniformly most powerful (UMP). We will discuss the phenomenon further in the next section. The following important example illustrates, among other things, that the UMP test phenomenon is largely a feature of one-dimensional parameter problems. Example 4.2.2. Simple Hypothesis Against Simple Alternative for the Multivariate Nor mal: Fisher's Discriminant Flmction. Suppose X � N(p1, E1), 81 = (p1, E1), j = 0, 1. The likelihood ratio test for H : 9 = Bo versus K : (J = fh is based on '

.

•

'

'

'' •

t '

Rejecting H for L large is equivalent to rejecting for

I

1

•

.. • •

•

•

large. Particularly important is the case Eo = Et when "Q large" is equivalent to "F = (tt 1 - p.0)EQ1X large." The function F is known as the Fisher discriminant function. It is used in a classification context in which 9o, 91 correspond to two known populations and we desire to classify a new observation X as belonging to one or the other. We return to this in Volume II. Note that in general the test statistic L depends intrinsically on ito, JJ 1. However if, say, J.£1 = p,0 + Ado, ).. > 0 and Et = Eo, then, if ito• Llo, Eo are known, a UMP (for all ).. ) test exists and is given by: Reject if

'

i

•

•

I'

,

'

(4.2.7)

where c = z(l - a)[�6E0 1 �0]! (Problem 4.2.8). If �0 = (1, 0, . . . , O)T and Eo = I, then this test rule is to reject H if X1 is large; however, if E0 =j:. I, this is no longer the case (Problem 4.2.9). In this example we bave assumed that IJo and IJ, for the two populations are known. If this is not the case, they are estimated with their empirical versions with sample means estimating population means and sample covariances estimating population covanances. •

I

'

Section 4.3

UnifOfmly Most Powerful Tests and Monotone Likelihood Ratio Models

227

Summary. We introduce the simple likelihood ratio statistic and simple likelihood ratio (SLR) test for testing the simple hypothesis H : () = Bo versus the simple alternative K : 0 01• The Neyman-Pearson lemma, which states that the size a SLR test is uniquely most powerful (MP) in the class of level a tests, is established. We note the connection of the MP test to the Bayes procedure of Section 3.2 for de ciding between Oo and 81. Two examples in which the MP test does not depend on ()1 are given. Such tests are said to be UMP (unifonnly most powerful). =

4.3

U N I FORMLY MOST POWERFUL TESTS ANO MONOTONE L I K E LIHOOD RATIO MODELS

We saw in the two Gaussian examples of Section 4.2 that UMP tests for one-dimensional parameter problems exist. This phenomenon is not restricted to the Gaussian case as the next example illustrates. Before we give the example, here is the general definition of UMP: Definition 4.3.1. A level a test
(3 (0, f3(0,
:

0 E 60 (4.3.1)

for any other level a test 1.p.

Example 4.3.1. Testing for a Multinomial Vector. Suppose that ( N1, . . . , Nk) has a multi nomial M(n , Bt, . . . , Bk) distribution with frequency function,

""' " n! n1 1. . . . nk.1 171 • · · 17k

P(nt, · · · , nk , B) -

"

'

where n1, . . . , nk are integers summing ton. With such data we often want to test a simple hypothesis H : B1 = Ow, . . . , Ok = ()kO· For instance, if a match in a genetic breeding ex periment can result in k types, n offspring are observed, and Ni is the number of offspring of type i, then (N1 , . . . , Nk) - M(n, e,, . . . , Ok)· The simple hypothesis would corre spond to the theory that the expected proportion of offspring of types 1 , . . . , k are given by Ow, Oko · Usually the alternative to H is composite. However, there is sometimes a simple alternative theory K : B1 = B11, . . . , (}k = Ok1· In this case, the likelihood ratio L • . .

,

.

IS

rr (:'')N' tO .

L�

i=l

Here is an interesting special case: Suppose Ojo > 0 for ally", 0 < integer l with 1 :0 l < k

€

< 1 and for some fixed (4.3.2)

where

228

Testmg and Confidence Regions

Chapter 4

That is, under the alternative, type l is less frequent than under H and the conditional probabilities of the other types given that type l has not occurred are the same under K as they are under H. Then =

pn-N,f_N/

pn ( Ejp)N' .

L Because t:. < 1 implies that p > E, we conclude that the MP test rejects H, if and only if, =

Critical values for level a are easily determined because Nl B( n, 810) under H. Moreover, for a = P( N, < c) , this test is UMP for testing H versus K : 0 E 8 1 = { 1:1 : 1:1 is of the form (4.3.2) with 0 < E < 1 }. Note that because l can be any of the integers 1 , . . . , k, we get radically different best tests depending on which (}i we assume to be Ow 0 under H. N1 <

c.

.......,

Typically the MP test of H : 0 = Oo versus K : (} = 01 depends on (h and the test is not UMP. However, we have seen three models where, in the case of a real parameter, there is a statistic T such that the test with critical region { x : T(x) > c} is UMP. This is part of a general phenomena we now describe.

Definition 4.3.2. The family of models { P, : 0 E 8} with 8 c R is said to be a monotone likelihood ratio (MLR) family if for fit < 02 the distributions Po1 and Po2 are distinct and D the ratio p(:r:, 02 )jp(x, 0 1 ) is an increasing function ofT(x).

Example 4.3.2 (Example 4.1.3 continued). In this i.i.d. Bernoulli case, set s then p(x, 0) = 0 ( 1 - 0)"-' = (I - 0)"[0/(1 - 0)]"

'

' '

'

and the model is by (4.2.1) MLR in s.

=

2:� 1 x;,

D

Example 4.3.3. Consider the one-parameter exponential family mode!

p(x,O) = h(x) exp{ry (O)T(x) - B (O)} .

'

•

"

•

.

If TJ(O) is strictly increasing in () E 6, then this family is MLR. Example 4.2.1 is of this D form with T(x) = ..fiixfu and ry(p.) = (..fiiu)p., where u is known. Define the Neyman-Pearson (NP) test function

. • •

I ifT(x) > t 6t (x ) 0 if T(x) < t

' • '

'

_

(4.3.3)

with 61(x) any value in (0, 1) if T(x) = t. Consider the problem of testing H : 0 = 00 versus K : 0 = O, with Oo < 8,. If {Po : 0 E 8}, 8 C R, is an MLR family in T(x), then L(x, Oo, 01) = h(T(x)) for some increasing function h. Thus, 6, equals the likelihood ratio test Bo, in fact.

"

:

'

' :·.

'

Theorem 4.3.1. Suppose {Po : 8 E 8 }, 8 c R, is an MLRfamily in T(x). (I) For each t E (0, oo), the powerfunction (3(1:1) = Eo 61 (X ) is increasing in 0. (2) If Eo061(X) = "' > 0, then 6, is UMP level "' for testing H : 8 < 80 versus

K : O > O,.

•

'

'

I

Section 4.3

Uniformly Most Po�rful Tests and Monotone likelihood Ratio Models

229

Proof (I) follows from <51 = I.Ph(t) and Corollary 4.2.1 by noting that for any fh < ()2 , c5t is MP at level Eo, o1(X) for testing H : 0 = 01 versus [( : 0 = o,. To show (2), recall that we have seen that c5t maximizes the power for testing II : () = Oo versus K : () > Oo among the class of tests with level a = Eo,61 (X). If 0 < Oo, then by (1), Eoo1 (X) < a and br is of level o: for H : () :< 00. Because the class of tests with level a for H : () < ()0 is contained in the class of tests with level a for H : () = 00, and because dt maximizes the 0 power over this larger class, bt is UMP for H : () < Oo versus K : () > Oa. The following useful result follows immediately.

Corollary 4.3.1. Suppose {Po : 0 E 6}, e c R, is an MLRfamily in T(r). If the distributionfunction Fa ofT( X) under X ,....., Pe0 is continuous and ift(l -a) is a solution of Fo (t) = 1 - a, then the test that rejects H if and only ifT(r) > t(I ex) is /JMP level a for testing H : () < Oo versus K : () > Oo. -

Example 4.3.4. Testing Precision. Suppose X1, . . . , Xn is a sample from a N(p,,u2) population, where 11 is a known standard, and we are interested in the precision <J- 1 of the measurements X1 , Xn. For instance, we could be interested in the precision of .

.

•

,

a new measuring instrument and test it by applying it to a known standard. Because the most serious error is to judge the precision adequate when it is not, we test H : u 2: uo versus K : u < u0, where uQ 1 represents the minimum tolerable precision. Let S = 2:� 1 (X, - 1')2, then

p(x, 0) = exp

1

{ - 20'2 S - �2

log(27r0'2 )

}.

This is a one-parameter exponential family and is MLR in T = -8. The UMP level a test rejects H if and only if S < s(cx) where s(cx) is such that Pa, (S :S s(cx)) = a. If we write

we see that Sfu5 has a X� distribution. Thus, the critical constant s ( a) is <J5xn(o.), where D Xn(a) is the ath quantile of the ;G distribution.

Example 4.3.5. Quality Control. Suppose tha� as in Example 1 . 1 1, X is the observed number of defectives in a sample of n chosen at random without replacement from a lot of N items containing b defectives, where b = N8. If the inspector making the test considers lots with bo = NOo defectives or more unsatisfactory, she formulates the hypothesis H as 0 > 00, the alternative K as () < 80, and specifies an a such that the probability of rejecting H (keeping a bad lot) is at most a. If o. is a value takeh on by the distribution of X, we now show that the test a• with reject H if, and only if, X < h(cx), where h(cx) is the ath quantile of the hypergeometric, H.( NOo, N, n), distribution, is UMP level ex. For simplicity suppose that bo > n, N - bo > n. Then, if N01 = b, < bo and 0 < x :S b,, (1.1.1) yields .

b,(b, - l) . . . (b, - x + 1)(N - b1 ) . . . (N - b1 - n + x + l) ( L x, 00' 0 1 ) =

b0(b0 - l) . . . (bo - x + I)(N - bo) . . . (N - bo - n + x + l ) '

230

Testing and Confidence Regions

Note that L( x, Ba B 1 ) = 0 for b1 < x < ,

'CI ) "' L (x + i 'c:Oo' O'-' -';'-;-'---';; :;c- = L(x, Oo, OI) -

n. Thus, for O <

:1·

Chapter 4

< b1 - 1,

( bbo1 -- xx ) (N(N -- nn + i ) - (b1 - x)x) + 1) - (b0 -

< I

·

Therefore, L is decreasing in x and the hypergeometric model is an MLR family in T(x) = -x. It follows that 8* is UMP level a:. The critical values for the hypergeometric distribu D tion are available on statistical calculators and software.

Power and Sample Size In the Neyman-Pearson framework we choose the test whose size is small. That is, we choose the critical constant so that the maximum probability of falsely rejecting the null hypothesis H is small. On the other hand, we would also like large power {3(0) when () E 81; that is, we want the probability of correctly detecting an alternative K to be large. However, as seen in Figure 4.1.1 and formula ( 4 .1.2), this is, in general. not possible for all parameters in the alternative 9 1 . In both these cases, H and K are of the fonn H : () < Oo and K : () > Oo. and the powers are continuous increasing functions with limo!Oo {3(0) = a . By Corollary 4.3.1, this is a general phenomenon in MLR family models with p(x, 0) continuous in 0. This continuity of the power shows that not too much significance can be attached to acceptance of H, if all points in the alternative are of equal significance: We can find 0 > Oo sufficiently close to Oo so that /3( 0) is arbitrarily close to f3(0o) = a. For such 0 the probability of falsely accepting H is almost 1 - a. This is not serious in practice if we have an indifference region. This is a subset of the alternative on which we are willing to tolerate low power. In our nonnal example 4.1.4 we might be uninterested in values of J.L in (0, �) for some small � > 0 because such improvements are negligible. Thus. (0, �) would be our indifference region. Off the indifference region, we want guaranteed power as well as an upper bound on the probability of type I error. In our example this means that in addition to the indifference region and level a, we specify {3 close to 1 and would like to have /3(!') > /3 for all i' > t.. This is possible for arbitrary /3 < 1 only by making the sample size n large enough. In Example 4.1.4 because fJ(J.L) is increasing, the appropriate n is obtained by solving

' ' '

'

i,

f> ''

I

' • '

I '

!·!

{J(t.) = if>(z(a) + .,fiit.ju) ' ' '

'

,

I

I

= f3

for sample size n. This equation is equivalent to

z(a) + .,fiit.ju = z(/3) whose solution is

Note that a small signal-to-noise ratio �/a will require a large sample size n.

!

I

••

' • •

Section 4.3

231

Uniformly Most Powerful Tests and Monotone Likelihood Ratio Models

Dual to the problem of not having enough power is that of having too much. It is natural to associate statistical significance with practical significance so that a very low p-value is interpreted as evidence that the alternative that holds is physically significant, that is, far from the hypothesis. Formula shows that, if n is very large and/or a is small, we can have very great power for alternatives very close to 0. This problem arises particularly in goodness-of-fit tests (see Example 4. 1.5), when we test the hypothesis that a very large sample comes from a particular distribution. Such hypotheses are often rejected even though for practical purposes "the fit is good enough." The reason is that is so large that unimportant small discrepancies are picked up. There are various ways of dealing with this problem. They often reduce to adjusting the critical value so that the probability of rejection for parameter value at the boundary of some indifference region is o:. In Example this would mean rejecting H if, and only if.

(4.1.2)

n

4.1.4

X> ,fii - z ( 1 u

-a) +

t. vn-·. u

As a further example and precursor to Section 5.4.4, we next show how to find the sample size that will "approximately" achieve desired power {3 for the size o: test in the binomial example.

Example 4.3.6 (Example 4.1.3 continued). Our discussion uses the classical normal ap proximation to the binomial distribution. First, to achieve approximate size we solve and find the approximate critical value = Po" S > s) for s using 4.

f3(0o)

(

a,

( L4) 1 so = nOo + 2 + z(1 -a) InOo(J -Oo)] 1;2 .

Again using the normal approximation , we find

( nO + l so (3(0) = Po (S 2: so ) = [n0(1 � O)]'f2 ) . Now consider the indifference region (Oo, Bt), where = + Dt., .0t. > 0. We solve (3(0,) = f3 for n and find the approximate solution

fh

Oo

a = .05, f3 = .90, Oo = 0.3, and 01 = 0.35, we 11eed n = (0.05)-2 {1.645 0.3(0.7) + 1.282 0.35(0.55)} 2 = 162.4.

For instance, if

X

X

Thus, the size .05 binomial test of H : 8 = 0.3 requires approximately 163 observations to have probability .90 of detecting the 17% increase in e from 0.3 to 0.35. The power .35 and = achievable (exactly, using the SPLUS package) for the level .0:> test for 0 163 is 0.86.

0=

n

Our discussion can be generalized. Suppose 8 is a vector. Often there is a function q(B) < q0 and K : such that H and K can be formulated as H : > q0• Now let

q(O)

q(O)

232

Testing and Confidence Regions

Chapter 4

q1 > qo be a value such that we want to have power fJ(O) at least fJ when q(O) > q1 . The set {0 : qo < q(O) < q1} is our indifference region. For each n suppose we have a level a test for H versus I< based on a suitable test statistic T. Suppose that {3(0) depends on () only through q( 0) and is a continuous increasing function of q( 0). and also increases to 1 for fixed () E 61 as n oo. To achieve level a and power at least {3, first let Co be the smallest number c such that ...-t

Then let n be the smallest integer such that

P,, [T 2: co] > fJ

where Oo is such that q( Oo) = q0 and 01 is such that q (0 1 ) = q1 . This procedure can be applied, for instance, to the F test of the linear model in Section 6 . 1 by taking q(0) equal to the noncentrality parameter governing the distribution of the statistic under the alternative. Implicit in this calculation is the assumption that Po1 [T > c0] is an increasing function ofn. We have seen in Example 4.1.5 that a particular test statistic can have a fixed distribu tion £o under the hypothesis. It may also happen that the distribution of Tn as (} ranges over 81 is detennined by a one-dimensional parameter >.(0) so that 90 = {0 : >.(0) = 0} and 9 1 = {0 : >.(0) > 0} and Co(Tn) = c,(6)(Tn) for all 0. The theory we have devel oped demonstrates that if C,(Tn) is an MLR family, then rejecting for large values ofTn is UMP among all tests based on Tn. Reducing the problem to choosing among such tests comes from invariance consideration that we do not enter into until Volume II. However, we illustrate what can happen with a simple example. .

'

Example 4.3.7. Testing Precision Continued. Suppose that in the Gaussian model of Ex ample 4.3.4. p, is unknown. Then the MLEof u2 is (]-2 = � E�-1 (Xi -Xf as in Example 2.2.9. Although H : u = uo is now composite, the distribution of Tn = ni:T2/u5 is x;_1, independent of JJ.. Thus, the critical value for testing H : u = uo versus K : u < uo and rejecting H if Tn is small, is the a percentile of X�-t· It is evident from the argument of Example 4.3.3 that this test is UMP for H : u > uo versus K : u < uo among all tests 0 depending on U2 only. Complete Families of Tests The Neyman-Pearson framework is based on using the 0-1 loss function. We may

ask whether decision procedures other than likelihood ratio tests arise if we consider loss functions l(O, a) , a E A = { 0, I}, B E 9, that are not 0-1. For instance, for e, = (tm, oo ), we may consider l(O, 0) = (B - Bo), B E 91 • In general, when testing H : B < Bo versus K : (} > Bo, a reasonable class of loss functions are those that satisfy l(O, 1) - l(O,O) > 0 for B < 00 l(0, 1) - l(0,0) < 0 for B > Bo.

(4.3.4)

i '

1

i

1

•

1 ' '

Section 4.4

233

Confidence Bounds, Intervals, and Regions

The class D of decision procedures is said to be complete< 1 ),(2) if for any decision rule 'P there exists 15 E V such that R(O, J)

<

R(O , I") for all 0 E E>.

(4.3.5)

That is, if the model is correct and loss function is appropriate, then any procedure not in the complete class can be matched or improved () by one in the complete class. Thus, it isn't worthwhile to look outside of complete classes. In the following the decision procedures arc test functions. {Po : 0 E e }, e c R, MLR Theorem 4.3.2. T( X) and I(0, a) (4.3.3) (4.3.4), EJ, (X) = <>, 0 S " < I, The risk function of any test rule 'P is

at all

i s Suppose an family i n suppose the loss function satisfies then the class of tests of the form with is complete. Proof. R(O,\")

Eo{I"(X)l(O, 1) + [ I - I"(X)]l(O,O)} = E, {l(8,0) + [i(O, 1 ) - i (O, O)]I"(X) }

=

Let J,(X) be such that, for some 00, E,,J, (X) = E,,I"(X) > 0. If E,I"(X) = 0 for all 0 then J=(X) clearly satisfies (4.3.5). Now J, is UMP for H : 0 S Oo versus K : 8 > 8o by Theorem 4.3.1 and, hence, R(O,J,) - R(O, I") = (1(0 , I) - /(0, O))(E,(J,(x)) - E,(I"(X))) < 0 for 8 > 8o. (4.3.6) But 1 - J, is similarly UMP for H : 0 2: 8o versus K : 0 < Oo (Problem 4.3.12) and, hence, E,(1 -J,(X)) = 1 - E,J,(X) 2: 1 - E,I"(X) for 0 < 00. Thus, (4.3.5) holds for D all 8 Summary. We consider models {Po : e E 8} for which there exist tests that are most powerful for every e in a composite alternative 61 (UMP tests). For () real, a model is said to be monotone likelihood ratio (MLR) if the simple likelihood ratio statistic for testing eo versus 81 is an increasing function of a statistic T( x) for every eo < e1 . For MLR models, the test that rejects H : 8 < 8o for large values of T(x) is UMP for K : 8 > 8o. In such situations we show how sample size can be chosen to guarantee minimum power for alternatives a given distance from H. We also show how, when UMP tests do not exist, locally most powerful (LMP) tests in some cases can be found. Finally, we show that for MLR models, the class of NP tests is complete in the sense that for loss functions other than the 0-l loss function, the risk of any procedure can be matched or improved by an NP test. .

4.4

CONFIDENCE BOUNDS, INTERVALS, AND REGIONS

We have in Chapter 2 considered the problem of obtaining precise estimates of parameters and we have in this chapter treated the problem of deciding whether the parameter fJ i" a

234

Testing and Confidence Regions

Chapter 4

member of a specified set 80. Now we consider the problem of giving confidence bounds, intervals, or sets that constrain the parameter with prescribed probability illustration consider Example

1

- a.

As an

4.1.4 where X1 , . . . , Xn are i.i.d. N (J1, u2 ) with a2 known.

Suppose that J-L represents the mean increase in sleep among patients administered a drug.

X = (X1, . . . , Xn ) to establish a lower bound !'(X) for I' with a prescribed probability (1 - a) of being correct. In the non-Bayesian framework, I' is a constant, and we look for a statistic !'(X) that satisfies P(I'(X) < !') = 1 - a with 1 - a equal to .95 or some other desired level of confidence. In our example Then we can use the experimental outcome

this is achieved by writing

By solving the inequality inside the probability for I-t• we find

P(X - o-z(l - a)/Vn < I') = 1 - a and

!'(X) = X - o-z(l - a)/ Vri

'

P(I'(X) < !') = 1 - a. We say that !'(X) is a lower confidence bound with confidence level 1 - a. Similarly, as in ( 1.3.8), we may be interested in an upper bound on a parameter. In the N(l', o-2) example this means finding a statistic !'(X) such that P(I'(X) > !') = 1 - a; is a lower bound with

'

and a solution is

I'(X) = X + o-z(1 - a)j.,fn.

1

'

Here

!'(X) is called an upper level (l - a) confidence bound for I'·

Finally, in many situations where we want an indication of the accuracy of an estimator,

we want both lower and upper bounds. That is, we want to find

that the interval

'

a

such that the probability

[X - a, X + a] contains p. is 1 - a. We find such an interval by noting

,!

and solving the inequality inside the probability for p.. This gives ,I

where

I

I'±(X) = X ± o-z (1 - �a)j .,fn. [!,- (X), !'+(X)] is a level ( 1 - a) confidence interval for I' · general, if v = v(P), P E P, is a parameter, and X � P, X E Rq,

We say that In

'

It.

it

be possible for a bound or interval to achieve exactly probability

it may not

(1 - a) for a prescribed ( 1 - a) such as .95. In this case, we settle for a probability at least ( 1 - a). That is,

.

! 1:

'

'

I

'

I

•

Section 4.4 Confidence Bounds, Intervals, and Regions

Definition 4.4.1. A statistic v(X) is called a level ( 1 for every P E P,

235

- o) lower confidence bound for v if

P[v(X) < v] > I - o. Similarly, v(X) is called a level (l

- a ) upper confidence bound for v if for every P E P,

P[ v( X )

=

v] > I - o.

Moreover, the random interval [v( X), v( X)] formed by a pair of statistics v(X), v(X) is a level ( I - a ) or a 100(1 - a) % confidence interval for v if, for all P E P,

P[v(X) < v < v(X)] > I - a. The quantities on the left are ca11ed the probabilities of coverage and (1 - a) is called a confidence level. For a given bound or interval, the confidence level is clearly not unique because any number (I - a' ) < (l - a) will be a confidence level if (I - a) is. In order to avoid this ambiguity it is convenient to define the confidence coefficient to be the largest possible confidence level. Note that in the case of intervals this is just inf{ P(!c(X) < v < v(X) , P E P]} (i.e., the minimum probability of coverage). For the normal measurement problem we have just discussed the probability of coverage is independent of P and equals the confidence coefficient. Example 4.4.1, The (Student) t Interval and Bounds. Let X1, . . . , Xn be a sample from a N(p,, u2 ) population, and assume initially that u2 is known. In the preceding discussion we used the fact that Z (!') = .,fii( X - I')/
2 s =

n

I - 2. (X; - X) L n - 1 i=l

lhat is, we wil1 need the distribution of

Now Z(l') = .,jii (X - 1')/
236

Testing and Confidence Regions

Chapter 4

has the t distribution Tn-t whatever be J-l and tJ2 • Let tk (P) denote the pth quantile of the Tj, distribution. Then P ( -tn - 1 (I -

i a) < T(J.t) < tn-i (1 - �a)) = 1 -

Solving the inequality inside the probability for J-t, we find P [X - stn-1

(I - �a)/ ..fo. < J1 < X + stn-i

a.

(1 - ; a)/ ..fo.]

=

1 - a.

The shortest level (1 - a) confidence interval of the type X ± sc/..fo. is, thus, [X - stn-1 ( 1 - ; a)/ .Jii, X + stn- i ( 1 - ; a) / v'n]

.

(4.4.1)

Similarly, X - stn- 1 ( 1 - a) /.fii and X + stn-1 ( 1 - a) /.fii are natural lower and upper confidence bounds with confidence coefficients (1 - a). To calculate the coefficients tn-l ( 1 - �a) and tn-dl - a), we use a calculator, computer software, or Tables I and II. For instance, if n = 9 and a = 0.01, we enter Table II to find that the probability that a 7;,_ 1 variable exceeds 3.355 is .005. Hence, tn-1 (1 - !a) = 3.355 and

[X - 3.355s/3, X + 3.355s/3] is the desired level 0.99 confidence interval. From the results of Section B.? (see Problem 8.7.12), we see that as n ---+ oo the Tn-1 distribution converges in law to the standard normal distribution. For the usual values of a, we can reasonably replace tn-1(P) by the standard normal quantile z(p) for n > 120. Up to this point, we have assumed that X1 , . . . , Xn are i.i.d. N (p,, a2). It turns out that the distribution of the pivot T(J.t) is fairly close to the Tn-1 distribution if the X's have a distribution that is nearly symmetric and whose tails are not much heavier than the normal. In this case the interval (4.1.1) has confidence coefficient close to 1 - a. On the other hand, for very skew distributions such as the x2 with few degrees of freedom, or very heavy-tailed distributions such as the Cauchy, the confidence coefficient of (4. I. 1) can be much larger than 1 - a. The properties of confidence intervals such as (4.4.1) in non-Gaussian situations can be investigated using the asymptotic and Monte Carlo methods introduced in Chapter 5. See Figure 5.3.1. If we assume a2 < oo, the interval will have D probability (1 - a) in the limit as n ---+ oo.

Example 4.4.2. Confidence Intervals and Bounds for the Vartance of a Normal Distri bution. Suppose that X1 , . . . , Xn is a sample from a N(p,, a2) population. By Theorem B .3.1, V(a2) = (n - 1 )s2/a2 has a x;_1 distribution and can be used as a pivot. Thus, if we let Xn-l(P) denote the pth quantile of the X�- 1 distribution, and if a1 + a2 = a, then

P(x(a1) < V(a2 ) < x(1 - a2 )) = 1 - <>. By solving the inequality inside the probability for a2 we find that

[(n - 1)s2 /x(1 - a2 ), (n - 1)s2/x{a1)] is a confidence interval with confidence coefficient (1 - a) . .

,

'

.·

(4.4.2)

1 ' •

237

Section 4_4 Confidence Bounds, Intervals, and Regions

a2, 1959).

The length of this interval is random. There is a unique choice of cq and which uniformly minimizes expected length among all intervals of this type. It may be shown that is not far from optimal (Tate and Klett, for n large, taking = = The pivot V(u2 ) similarly yields the respective lower and upper confidence bounds ex) and if we drop the normality assumption, the confidence inter In contrast to Example even in the limit as -----+ oo. val and bounds for u2 do not have confidence coefficient Asymptotic methods and Monte Carlo experiments as described in Chapter have shown that the confidence coefficient may be arbitrarily small depending on the underlying true we give an interval with cor distribution, which typically is unknown. In Problem D rect limiting coverage probability. The method of pivots works primarily in problems related to sampling from normal populations. If we consider "approximate" pivots, the scope of the method becomes much broader. We illustrate by an example. Example 4.4.3. Approximate Confidence Bounds and Intervals for the Probability of Suc , are the indicators of Bernoulli trials with cess in Bernoulli Trials. If 1 , probability of success 8, then is the MLE of 8. There is no natural "exact" pivot based has on and However, by the De Moivre-Laplace theorem, yln(X approximately a N(O, distribution. If we use thls function as an "approximate" pivot and let :=:::::: denote "approximate equality," we can write

a1 a2 �a (n - 1)sjx(1 - (n-4.4.1)sjx(a). 1,

n

X 0.

1)

p

Let

X X

•

.

•

1-a 1.1.16

Xn

n 5

n

- 0)/ )0(1 - 0)

O) <- z (1 - la) = 1 - a. 2 )0(1 - 0)

yin(X -

ka = z ( 1 - �a) and observe that this is equivalent to [ k2 2 P (X - 0) < ;:0(1 - 0)l = P(g(O,X)
ex

where

g(O,X) = (I + � }2 - (zx + �) O+X2 For fixed 0 < X < 1, g( 0, X) is a quadratic polynoutial with two real roots. In terms of S = nX, they are(! ) k B(X) { S + z� - ka)(S(n -S)/nJ + k�/4} I (n + k�) (4.4.3) O(X) { S + kz� + ka)(S(n - S)/nJ + k�/4} I (n k�) . +

Because the coefficient of ()2 in g((}, X) is greater than zero, IB :

g(O, X) < OJ = �(X) 0 < B(X)], �

(4.4.4)

. •

•

238

Testing and Confidence Regions

Chapter 4

so that [O(X), ii(X)[ is an approximate level (1 - a) confidence interval for 0. We can similarly show that the endpoints of the level (1 - 2a) interval are approximate upper and lower level (1 - a) confidence bounds. These intervals and bounds are satisfactory in practice for the usual levels, if the smaller of nO, n(l - 8) is at least 6. For small n, it is better to use the exact level (1 - a) procedure developed in Section 4.5. A discussion is given in Brown, Cai, and Das Gupta (2000). Note that in this example we can determine the sample size needed for desired accu racy. For instance, consider the market researcher whose interest is the proportion B of a population that will buy a product. He draws a sample of n potential customers, calls will ingness to buy success, and uses the preceding model. He can then determine how many customers should be sampled so that (4.4.4) has length 0.02 and is a confidence interval with confidence coefficient approximately 0.95. To see this, note that the length, say l, of the interval is

l � 2k0{ y'[S(n - S) j n] + ki,j4} (n + k� ) -l Now use the fact that

S(n - S)/n � in - n-1 (S - �n)2 :> in

(4.4.5)

to conclude that

(4.4.6) Thus, to bound l above by lo choose

=

.

0.02, we choose n so that ka.(n + k�)-� n

''

•

=

'

ln this case, 1 - �a = 0.975, ka. = z (1 length 0.02 by choosing n so that

n • • '

;; •• '

i

ko 2 T;; - k0• - ta)

2 2 ) (1.96) ( 1.96 0.02

lo. That is, we

=

=

!

'

1.96, and we can achieve the desired

9, 600.16, or n

=

I

9, 601.

This formula for the sample size is very crude because (4.4.5) is used and it is only good when 8 is near 1/2. Better results can be obtained if one has upper or lower bounds on 8 such as 8 < 8o < �, 8 > 81 > �. See Problem 4.4.4. Another approximate pivot for this example is .,fii(X - 8)/JX(l - X). This leads to the simple interval (4.4.7)

't •

'.

'

=

=

i" I

,,

See Brown, Cai, and Das Gupta (2000) for a discussion.

D

1

' '

' '

j

Section 4-4

239

Confidence Bounds, Intervals, and Regions

Confidence Regions for Functions of Parameters

We can define confidence regions for a function q(O) as random subsets of the range of q that cover the true value of q(O) with probability at least (1 - a). Note that ifC(X) is a level (1 � a ) confidence region for 0, then

q (C(X)) � {q(O) 0 E C(X) }

is a level (1 - a) confidence region for q(O). Example 4.4.4. Let X 1 ,

•

•

•

•

, Xn denote the number of hours a sample of internet sub

scribers spend per week on the Internet. Suppose X 1 , Xn is modeled as a sample from an exponential, £(8-1 ) , distribution, and suppose we want a confidence interval for the population proportion P(X ? x) of subscribers that spend at least x hours per week on the Internet. Here q(O) � I - F(x) � exp{ -x/0). By Problem B.3.4, 2nX/O has a chi-square, X�n• distribution. By using 2nX /0 as a pivot we find the (1 - a ) confidence interval 2nX/x (I - ia) < e < 2nX/x (ia) •

.

•

,

where x(/3) denotes the {3th quantile of the X�n distribution. Let 0 and 8 denote the lower and upper boundaries of this interval, then

exp{-x/0} < q(O) < exp{ -xjiJ)

is a confidence interval for q(0) with confidence coefficient (1 - a) . If q is not 1 1, this technique is typically wasteful. That is, we can find confidence regions for q(0) entirely contained in q(C(X)) with confidence level ( 1 - a) . For instance, we will later give confidence regions C(X) for pairs B ( 01, 02) T. In this case, if q(B) = 01, q(C(X)) is larger than the confidence set obtained by focusing on B1 alone. -

-=

Confidence Regions of Higher Dimension

We can extend the notion of a confidence interval for one-dimensional functions q(B) to r-dimensional vectors q( 0) � (q1 ( 0), . . . , q" ( 0) ) . Suppose q (X) and q1 (X) are reai J valued. Then the r-dimensional random rectangle -

! (X) � { q(O) • q; (X) < q; (O) < q; (X) , j � l, . . . , r ) is said to be a level (I - a) confidence region, if the probability that it covers the unknown but fixed true ( q1 ( 0), . . . , qr( 0)) is at least (! - a). We write this as

P[q(O) E /(X)] 2 I

-

o.

Note that if I; (X) � [q . , q; J is a level (1 - a; ) confidence interval for q; ( 0) and if the -} I. (X) X . . · x pairs (T 1 T,), . . , (Ir, Tr) are independent, then the rectangle I (X) Ir(X) has level ,

.

=

(4.4.8)

Testing and Confidence Regions

240

Chapter 4

Thus, an r-dimensional confidence rectangle is in this case automaticalll obtained from the one-dimensional intervals. Moreover, if we choose o = 1 - ( 1 r , then I(X) has

j

1 - a:.

confidence level

- a)

IJ are not independent is to use Bonferroni's

An approach that works even if the

in

equality (A.2.7). According to this inequality,

P[q(O ) E I(X)] > Thus, if we choose

a,

=

ajr, j

1-

=

1, .

c

L P[qj(O ) "' Ij(X)] > j=l .

. , r, then

1-

c

:Laj· j= l

I(X) has confidence level (1 - a).

Example 4.4.5. Confidence Rectangle for the Parameters ofa Normal Distribution. Sup pose X1 , Xn is a N(p, u2 ) sample and we want a confidence rectangle for (t.t, a2 ). .

.

•

,

From Example 4.4.1

h (X )

=

X ± stn-1 (1 - !a )/ Vn

( - !a). From Example 4.4.2,

is a confidence interval for J.L with confidence coefficient 1

I2 (X) _ -

(n - l ) s2

Xn-1 ( 1 -

is a reasonable confidence interval for

I, (X)

is a level ( 1

1 ( Xn-1 4 a)

with confidence coefficient

'' " '

:

'

ij · I ,

,

,

!2 (X)

-

Thus,

.

The method of pivots can also be applied to oo-dimensional parameters such as

F.

X1. . . , Xn are i.i.d. as X P, and we are interested in the function F(t) F(·). We assume that F is P(X < t); that is, v(P)

Example 4.4.6.

Suppose

rv

.

' '

distribution

' ''

continuous, in which case (Proposition 4.1.1) the distribution of

"

(1 - �a).

a) confidence rectangle for (p, u2 ) It is possible to show D that the exact confidence coefficient is ( 1 - �a) 2 • See Problem 4.4.15.

'

, 'I

x

a2

14a , )

(n - l)s2

=

=

Dn(F)

=

su

p I F(t) - F(t) i

tER

Dn (F) is a pivot. Let da be - a. Then by solving Dn(F) < d. for F, we

does not depend on F and is known (Example 4.1 .5). That is,

PF(Dn(F) < dn) 1 find that a simultaneous in t size 1 - a confidence region C(x)(·) is the confidence band which, for each t E R, consists of the interval chosen such that

=

C(x)(t)

(max{ O, F(t) - d0}, �

=

min { !,

�

F(t) + d.}).

We have shown

P(C(X)(t) for all

::J

F(t) for all t E R)

P E P = set of P with P( -oo, t]

continuous in

t.

=

1 -

a 0

, ,

i ' J •

l

Section 4.5

241

The Duality Between Conf1dence Regions and Tests

We can apply the notions studied in Examples 4.4.4 and 4.4.5 to give confidence regions for scalar or vector parameters in nonparametric models. Example 4.4.7. A Lower Confidence Bound for the Mean of a Nonnegative Random Vari able. Suppose X, , . . . , Xn are i.i.d. as X and that X has a density f( t) = F' (t), which is zero for t < 0 and nonzero for t > 0. By integration by parts. if J.t = IL(F) = f0= tj(t)dt exists, then Let F-(t) and F+ (t) be the lower and upper simultaneous confidence boundaries of Example 4.4.6. Then a ( 1 a) lower confidence bound for p. is p. given by -

(4.4.9)

-

because for C(X) as in Example 4.4.6, J.t = inf { J.t( F) : F E C(X)} = J.t(F+ ) and sup{J.t(F) : F E C(X)} = J.t(F-) = oo-see Problem 4.4.19. Intervals for the case F supported on an interval (see Problem 4.4.18) arise in ac counting practice (see Bickel, 1992) where such bounds are discussed and shown to be D asymptotically strictly conservative. -

Summary. We define lower and upper confidence bounds (LCBs and UCBs), confidence

intervals, and more generally confidence regions. In a parametric model {Pe : B E e}, a Ievel l - a: confidence region for a parameter q(B) is a set C(x) depending only on the data x such that the probability under Po that C(X) covers q(8) is at least 1 - a: for all 0 E 8. For a nonpararnetric class P = {P) and parameter v = v(P), we similarly require P(C(X) ::> v) > 1 - a for all P E P. We derive the (Student) t interval for I' in the N(/1, a2 ) model with a2 unknown, and we derive an exact confidence interval for the binomial parameter. In a nonparametric setting we derive a simultaneous confidence interval for the distribution function F( t) and the mean of a positive variable X.

4.5

THE DUALITY BETWEEN CONFIDENCE REGIONS AND TESTS

Confidence regions are random subsets of the parameter space that contain the true param eter with probability at least 1 a. Acceptance regions of statistical tests are. for a given hypothesis H, subsets of the sample space with probability of accepting H at least 1 - a: when H is true. We shall establish a duality between confidence regions and acceptance regions for families of hypotheses. We begin by illustrating the duality in the following example. -

Example 4.5.1. 1\vo-Sided Tests for the Mean of a Normal Distribution. Suppose that an established theory postulates the value f-LO for a certain physical constant. A scientist has reasons to believe that the theory is incorrect and measures the constant n times obtaining

242

Testing and Confidence Regions

measurements

Xi

X1 .

.

.

..

Xn.

Chapter 4

Knowledge of his instruments leads him to assume that the

are independent and identically distributed normal random variables with mean {L and

a2. If any value of Jl other than /LO is a possible alternative, then it is reasonable to formulate the problem as that of testing H : f-l = J-lo versus K : f-l I- f-lO· We can base a size o test on the level (1 - a ) confidence interval ( 4.4.1) we constructed for Jl as follows. We accept H. if and only if, the postulated value Jlo is a member of the level ( 1 - o ) confidence interval variance

[X - stn-1 (1 - �a) /vn. X + stn-1 (1 - la) / vn].

(4.5, [ )

= y'n(X - /1-o) / s, then our test accepts H, if and only if, -tn -l ( 1 - �a) < T < tn-1 (t - la) . Because P" [IT I = tn-1 (1 - )a)] = 0 the test is equivalently characterized by rejecting H when I TI > tn-1 (1 - �a). This test is called two-sided If we let T

because it rejects for both large and small values of the statistic T. In contrast to the tests of Example 4. I .4, it has power against parameter values on either side of /1-0·

Because the same interval (4.5 . 1 ) is used for every J.Lo we see that we have, in fact, generated a family of level

a tests {J(X, 11Jl where

1 if vnY�;� 1 > tn -1 (1 - �a) 0 otherwise.

(4.5.2)

These tests correspond to different hypotheses, O{X, J.lo ) being of size pothesis

H : Jl = flo-

a only for the hy

Conversely, by starting with the test (4.5.2) we obtain the confidence interval (4.5.1)

by finding the set of J1 where J(X, Jl)

� 0.

We achieve a similar effect, generating a family of level

a tests, if we start out with

(1 - a) LCB X - tn_ 1 (1 - o.)sj y1i and define J•(x, Jl) to equal ! if, and only if, X - tn_1(1 - a)s/vn 2 Jl· Evidently, (say) the level

=

1 - a. D

These are examples of a general phenomenon. Consider the general framework where the random vector X takes values in the sample space

I I

I

I I

r

E P.

Let

I

X has

distribution

,

(1 - a), that is

J

!

,

P[v E S( X ) ] > 1 - a,

''

and

v = v(P) be a parameter that takes values in the set N. For instance, in Example 4.4. 1 , J1 = Jl(P) takes values in N = (-oo, oo) in Example 4.4.2, "2 = "2 ( P) takes values in N = (0, oo ), and in Example 4.4.5, (f', "2) takes values in N = ( -oo, oo) x (0, oo). For a function space example, consider v(P) = F, as in Example 4.4.6, where F is the distribution function of Xi. Here an example of N is the class of all continuous distribution functions. Let S = S( X ) be a map from X to subsets of N, then S is a (1 - a) confidence region for v if the probability that S(X) contains v is at least P

'

X C Rq

all P

E P. . •

,

• ,

Section 4.5

243

The Duality Between Conf1dence Regions and Tests

H = Hv0 : v = v0 test J(X, v0) with level a. Then the

Next consider the testing framework where we test the hypothesis for some specified value

vo.

Suppose we have a

acceptance regiOn

A (vo) � {x : J(x, vo) = 0 }

1 - a. For some specified vo, H may be accepted, for other specified vo, H may be rejected. Consider the set of v0 for which H.,.0 is accepted; this is a random set contained in N with probaPihty at least 1 - a of containing the true value of v(P) whatever be P. Conversely, if S(X) is a Ievel l - a confidence region for v, then the test that accepts H.,.0 if and only if v0 is in S(X), is a level o test for H.,.0 • Formally, let Pv0 = {P : v(P) � v0 : vo E V}. We have the following. Duality Theorem. Let S(X) = {v0 E N : X E A(vo)). then is a subset of X with probability at least

P[X E A(v0)) > 1 - a for all P E Pv, ifand only if S(X) is a 1 - a confidence region for v. We next apply the duality theorem to MLR families:

Theorem 4.5.1. Suppose X � Po where {Po : 8 E 6} is MLR in T = T(X) and suppose that the distribution/unction Fo(t) ofT under Po is continuous in each of the variables t and B when the other is fixed. If the equation Fo(t) � 1 - a has a solution li.(t) in e. then Ba(T) is a lower confidence boundfor () with confidence coefficient 1 - a. Similarly, any solution B.(T) of Fo(T) = a with Ba E 6 is an upper confidence bound for 8 with coefficient ( 1 - a). Moreover, if a1 + a2 < 1, then [.Qa1 , Ba2] is confidence intervalfor () with confidence coefficient 1 - (o t + a2 ).

Proof. By Corollary 4.3 . 1 , the acceptance versus K : B > Bo can be written

region of the

UMP size a test of H : B

=

Bo

A(Bo) = {x : T(x) < to0(1 - a) } where

to0 (1 - a ) is the 1 - a quantile of Fo0• By the duality theorem, if s(t) = {8 E 6 : t < to(l - a ) } ,

then S(T) is a we find

1 - a confidence region for B. By applying Fo to hnth sides of t 5 to(1-a), S(t) � {B E 6 : Fo(t) < 1 - a}.

By

Theorem

4.3.1,

the power function

Po(T > t) = 1 - Fo(t)

for

a

test with critical

t is increasing in (). That i�, Fe (t) is decreasing in (). It follows that Fo (t) < 1 - a iff B > lia(t) and S(t) = [lia, oo). The proofs for the upper confid�nce hnund and interval

constant

fotlow by the same type of argument,

0

We next give connections between confidence bounds, acceptance regions, and p-values for MLR families: Let t denote the observed value

t = T(x) ofT(X) for the datum x, let

244

Testing and Confidence Regions

Chapter 4

o:(t, Oo ) denote the p-value for the UMP size a test of H : (} = Oo versus K : () > Oo, and

let

A• (B) = T(A(B)) = {T(x) : x E A(B)).

Corollary 4.5.1. Under the conditions of Theorem 4.4.1,

{t : a(t,B) > a) = (-oo, to(1 - a)] {B : a(t,B) > a) = [�(t), oo).

A•(B) S (t) Proof. The p-value is

a(t, B) = Po (T > t) = 1 - Fo(t). We have seen in the proof of Theorem 4.3.1 that 1 - Fo(t) is increasing in 0. Because D Fe(t) is a distribution function, 1 - Fo(t) is decreasing in t. The result follows. In general, let a( t, v0) denote the p-value of a test b (T, vo ) = 1 [T > c] of H : v = vo based on a statistic T = T( X ) with observed value t = T(x). Then the set C = {(t,v) : a(t,v) < a} = {(t,v) : J(t,v) = 0} (t, v)

t, v

t

where, for the given gives the pairs will be accepted; and for the given v, is plane, in the acceptance region. We call C the set of compatible points. In the vertical sections of C are the confidence regions S( whereas horizontal sections are the acoeptance regions • = 0). We illustrate these ideas using the example , Xn are i.i.d. N(f-L, cr2 ) with a2 known. Let T = X, of testing H : fL = fLO when X1 , then C = { ( , p : [t - p[ S <7z

t)

A (v) = {t : J(t, v) .

.

(t, B)

(t, B)

•

t )

(1 - ia)/ vn} . Figure 4.5.1 shows the set C, a confidence region S(t0), and an acceptance set A• (po ) for this example.

Example 4.5.2. Exact Confidence Bounds and Intervals for the Probability of Success in n Binomial Trials. Let X1, . . . , Xn be the indicators of n binomial trials with probability E (0, 1), we seek reasonable exact level ( 1 - �) upper and lower of success 6. For To find a lower confidence bound for (J confidence bounds and confidence intervals for o E (0, 1). our preceding discussion leads us to consider level tests for H : (} < denote the critical Let We shall use some of the results derived in Example constant of a level (1 confidence region is test of H. The correspouding level given by C(Xt , . . , Xn) = : S < - 1 },

a

. .

'

, " '1 .

•

•

.

,

,

8.

a)

.

xi.

{B

a 4.1.3.

k(B, a)

where s = Ef I To analyze the structure of the region we need to examine

,,

(i)

'.,, '

(ii)

,, •

>I ,

�,

''i

•

·!

k(B,a) is nondecreasing in B. k(B,a) � k(Bo, a) ifB i Bo.

k(Oo, a) (1 - a)

Oo. B

k(8, a). We claim that

l 1

Section 4.5

245

The Duality Between Confidence Regions and Tests

Figure 4.5.1. The shaded region is the compatibility set C for the two-sided test of Hp.0 : J.L = J.to in the normal model. S(to) is a confidence interval for f.l for a given value to ofT, whereas A* (f.lo) is the acceptance region for Hp.o.

(iii)

k(fJ, a) increases by exactly

(iv)

k(O,a) � I

and k( l

,

a)

�

1

at its points of discontinuity.

n + 1.

To prove (i) note that it was shown in Theorem 4.3.1 (i) that Pe [S > j] is nondecreasing 02 and in 6 for fixed j. Clearly, it is also nonincreasing in j for fixed e. Therefore, 01 k(B1, a) > k(B2 , a) would imply that

<

a > Po, [S 2: k(B2 ,a) ] > Po, [S > k(B2 , a) - 1]

2:

Po, [S > k(B1 ,a) - 1] > a,

a contradiction. The assertion (ii) is a consequence of the following remarks. If Bo is a discontinuity point o k(B, <>), let j be the limit of k(B, <>) as 0 t Bo. Then Po [S > j] a for all 0 < Bo and, hence. Po, [S > j] a. On the other hand, if 0 > 00, Po [S 2: j] > a. Therefore, Po, [S > j] � a and j = k(B0, <>). The claims (iii) and (iv) are left as exercises. From (i), (ii), (iii), and (iv) we see that, if we define

<

<

B(S) = inf{B : k(B,a) � S

+ 1},

then

C(X) �

l] { (B(S), 0, 1 [

]

if s > 0 ifS = 0

246

Testing and Confidence Regions

Chapter 4

I

and �(S) is the desired level (I - a) LCB for 0 (2l Figure 4.5.2 portrays the situation. From our discussion, when S > 0, then k(O(S) , a) = S and, therefore, we find 8(S) as the unique solution of the equation,

When S = 0, 8(S) = 0. Similarly, we define

O(S) = sup(O : j(O,a) = S - 1} where j ( 8, a) is given by, •

Then 0 (S) is a level

(I - a) UCB for 0 and when S < n, 0(S) is the unique solution of

�( � ) s

or(! - o)n -r = a.

When S = n, O(S) = I. Putting the bounds f!(S), O(S) together we get the confidence interval (O(S), O(S)] oflevel (1- 2a). These intervals can be obtained from computer pack ages that use algorithms based on the preceding considerations. As might be expected, if n is large, these bounds and intervals differ little from those obtained by the first approximate method in Example 4.4.3.

, •

1 •

•

I

'

' •

I

l I ' i • •

•

l

I ' •

4

k(O,O.i6)

3

'�

I

I i

l l

I i ,

Figure 4.5.2. Plot of k(O, 0.16) for n = 2.

,I

'I

,. . .

• •

j

Section 4.5

247

The Duality Between Confidence Regions and Tests

Applications of Confidence Intervals to Comparisons and Selections

We have seen that confidence intervals lead naturally to two-sided tests. However, two-sided tests seem incomplete in the sense that if H B = 80 is rejected in favor of H : () -1- Bo, we usually want to know whether H : () > Bo or H : B < Bo. For instance, suppose B is the expected difference in blood pressure when two treat ments, A and B, are given to high blood pressure patients. Because we do not know whether A or B is to be preferred, we test H : B = 0 versus K : 8 -1- 0. If H is rejected, it is natural to carry the comparison of A and B further by asking whether 8 < 0 orB > 0. If we decide () < 0, then we select A as the better treatment, and vice versa. The problem of deciding whether B = 80, 8 < 80, or B > Bo is an example of a three decision problem and is a special case of the decision problems in Section 1.4, and 3.1-3.3. Here we consider the simple solution suggested by the level ( 1 - a) confidence interval !: :

1.

2.

3.

Make no judgment as w whether 0 < 80 or 8 > 80 if I contains Bo; Decide B < Bo if I i s entirely to the left of Bo; and Decide 8 > 80 if I is entirely to the right of 80.

(4.5.3)

, Xn are i.i.d. N(J-l, a2 ) with u2 known. In Section 4.4 we considered the level { 1 - a) confidence interval X ± uz{l - � a)/ fo for f-t. Using this interval and ( 4.5.3) we obtain the following three decision rule based on T = .JTi(X a: o)/ ' l Do not reject H : I' � l'c if ITI < z(l - !a). Decide I' < l'o ifT < -z(t - ia). Decide I' > l'c ifT > z(l - ia). Example 4.5.2. Suppose X1 , . . .

Thus, the two-sided test can be regarded as the first step in the decision procedure where if H is not rejected, we make no claims of significance, but if H is rejected, we decide whether this is because () is smaller or larger than Bo. For this three-decision rule, the probability of falsely claiming significance of either 8 < Bo or () > Bo is bounded above by �a. To see this consider first the case 8 > 00. Then the wrong decision "11 < �-to" is made when T < -z(l - � a). This event has probability

(

)

( l' - l'o) < P[T < -z(l - la)J � -z(l - la) - .Jii a ( -z(l - la)) � la. Similarly, when f-t < �-to. the probability of the wrong decision is at most � a. Therefore, by using this kind of procedure in a comparison or selection problem, we can control the probabilities of a wrong selection by setting the a of the parent test or confidence intervaL We can use the two-sided tests and confidence intervals introduced in later chapters in similar fashions. Summary. We explore the connection between tests of statistical hypotheses and confi

dence regions. If J{x, vo} is a level a test of H

:

v = v0 , then the set S(x) of v0 where

248

Testing and Confidence Regions

o ("'' v0)

�

0 is a level (1

- o ) confidence region for

fidence region for v. then the test that accepts

H

;

v

va.

If S(x) is a level

( 1 - a) con·

E S(:r)

= v0 when vo

Chapter 4

is a level

a:

test. We give explicitly the construction of exact upper and lower confidence bounds and intervals for the parameter in the binomial distribution. We also give a connection between confidence intervals, two-sided tests, and the three-decision problem of deciding whether a parameter 8 is (}0, less than

4.6

80, or larger than 90, where 80

is a specified value.

'

'

' '

'

UNIFORMLY MOST ACCURATE CONFIDENCE BOUNDS

In our discussion of confidence bounds and intervals so far we have not taken their accuracy into account. We next show that for a certain notion of accuracy of confidence bounds, which is connected to the power of the associated one-sided tests, optimality of the tests translates into accuracy of the bounds. If (} and (}* are two competing level both very likely to fall below the true

B.

(1 - a ) lower confidence bounds for

But we also want the bounds to be close to

we say that the bound with the smaller probability of being far below Formally, for

(}, they are

X E X c Rq, the following is true.

B.

Thus,

(} is more accurate.

Definition 4.6.1. A level ( 1 - a) LCB 8* of(} is said to be more accurate than a competing level ( 1 - a) LCB 8 if, and only if, for any fixed 8 and all 8' < B. P, �· (X) < 8'] < Po ]B(X) :S 8']. - a ) UCB 8 8' > e.

Similarly, a level ( 1 any fixed e and all

-·

(4.6. 1)

is more accurate than a competitor 8 if, and only if, for

(4.6.2)

Lower confidence bounds B* satisfying (4.6.1) for all competitors are called uniformly

most accurate as are upper confidence bounds satisfying (4.6.2) for all competitors. Note

(1 - a) uniformly most accurate level ( 1 - a) UCB for -8. that

'

8*

is a uniformly most accurate level

LCB for B, if and only if

-8*

is a

•

.

'

I

.

i

1 ' '

' ,

.

.

:1;1

.

Example 4.6.1 (Examples 3.3.2 and 4.2.1 continued). Suppose X = (X,, . . . , Xn) is a sample of a N(J.L, a-2 ) random variables with a2 known. A level a test of H : J.L = J.Lo vs K ' J.1. > J.l.o rejects H when ..fii( X - J.l.o )/u > z(1 - a) . The dual lower confidence bound is J.l., (X) = X - z(1 - a)u/ ..fii. Using Problem 4.5.6, we find that a competing lower confidence bound is J.1. (X) = X(k)• where X(l) :S X(2) :S · · · :S X(n denotes ) 2 the ordered X,, . . . , Xn and k is defined by P(S > k) � 1 - a for a binomial, B(n, � ),

J.L (X) is random variable S. Which lower bound is more accurate? It does tum out that _,

more accurate than e2(X) and is, in fact, uniformly most accurate in the N(J.L,

.

'

L

I!

I

a-2 ) model.

This is a consequence of the following theorem, which reveals that (4.6.1) is nothing more than a comparison of (X)Wer functions.

�------ --

I ' I

•

I '

'

'

I

'

-

P, [e• (X) < B'] < Po [B(X) < 8']. '

'

0

!

I

'

'

i !

I

l

'

I

'

Section 4. 6

249

Uniformly Most Accurate Confidence Bounds

Theorem 4.6.1. Let fJ* be a level {1 - n) LCB for B, a real parameter, such that for each (}0 the associated test whose critical function O* (x. Bo) is given by

I ifO"(x) > Oo

o"(x, Oo)

0 otherwise is UMP level a for H : (} level ( I - a).

=

Bo versus K (} > Bo. Then 0_* is uniformly most accurate at :

Proof Let 0 be a competing level ( I - a ) LCB

Oo. Defined o (x, 00) by

o(x, 00) � 0 if, and only if, O(x) < 00.

Then o(X,Oo) is a level a test for H : 0 � Oo versus K : 0 > Oo. Because J"(X, 00) is UMP level a for H : (} = Bo versus K : B > Bo, for fJ 1 > (}0 we must have

Ee. (o(X, Oo)) < Ee, (o"(X, Oo)) or

Pe, [O(X) > Oo] < Pe, [O" (X) > Oo]. Identify follows.

Bo

with (}' and 81 with (} in the statement of Definition 4.4.2 and the result o

If we apply the result and Example 4.2.1 to Example 4.6. 1 , we find that x - z(l o:)a / JTi is uniformly most accurate. However, X(k) does have the advantage that we don't have to know a or even the shape of the density f of Xi to apply it. Also, the robustness considerations of Section 3.5 favor X{k) (see Example 3.5.2). Uniformly most accurate (UMA) bounds turn out to have related nice properties. For instance (see Problem 4.6. 7 for the proof), they have the smallest expected "distance" to 0:

Corollary 4.6.1. Suppose �·(X) is UMA level (I - a ) lower confidence boundfor 0. Let O(X) be any other (I - a ) lower confidence bound, then for all 0 where a+

�

a,

if a > 0, and 0 otherwise.

We can extend the notion of accuracy to confidence bounds for real-valued functions of an arbitrary parameter. We define q* to be a uniformly most accurate level (1 - a ) LCB for q(0) if, and only if, for any other level ( I - a) LCB q,

Pe[_f < q(O')] < Pe [q < q(O') ] whenever q((}') < q( 8). Most accurate upper confidence bounds are defined similarly.

Example 4.6.2. Boundsfor the Probability ofEarly Failure ofEquipment. Let X 1 , , Xn be the times to failure of n pieces of equipment where we assume that the Xi are indepen dent £(A) variables. We want a uniformly most accurate level {1 - a) upper confidence bound q* for q( A) = 1 - e->.to, the probability of early failure of a piece of equipment. .

•

.

Testing and Confidence Regi ons

250

C h a pter 4

We begin by finding a uniformly most accurate level ( 1 - a) UCB ). for A. To find ). we invert the family of UMP level a tests of H : A � Ao versus K : >. < A 0 . By Problem 4.6.8, the UMP test accepts H if •

•

n

L Xi < X2n ( 1 - a) /2>-o

(4.6.3)

i=l

or equivalently if

A0

X2 n ( 1 - a ) n

< 2 "l<' L.. t = l X'

where X2 n ( 1 - a) is the ( 1 - a) quantile of the X � n distribution. Therefore, the confidence region corresponding to this test is ( 0, 5. ) where 5. is by Theorem 4.6. 1 , a uniformly most accurate level (1 - a ) UCB for A and, because q is strictly increasing in A, it follows that q(,\ ) is a uniformly most accurate level ( 1 - a ) UCB for the probability of early D failure. *

*

*

Discussion We have only considered confidence bounds. The situation with confidence intervals i s more complicated. Considerations of accuracy lead u s to ask that, subject to the require ment that the confidence level is ( 1 a) , the confidence interval be as short as possible. Of course, the length T 'I. is random and it can be shown that in most situations there is no confidence interval of level (1 - a) that has uniformly minimum length among all such intervals. There are, however, some large sample results in this direction (see Wilks, 1 962, pp. 374-376). If we tum to the expected length Ee (T - 'I.) as a measure of precision, the situation is still unsatisfactory because, in general, there does not exist a member of the class of level ( 1 a) intervals that has minimum expected length for all B. However, as in the estimation problem, we can restrict attention to certain reasonable subclasses of level ( 1 - a) intervals for which members with uniformly smallest expected length exist. Thus, Neyman defines unbiased confidence intervals of level ( 1 a) by the property that -

-

-

-

Pe ['I. � q(B) � f] � Pe ['I. � q(B' ) � T] for every B, B' . That is, the interval must be at least as likely to cover the true value of q( B) as any other value. Pratt ( 1961) showed that in many of the classical problems of estimation . there exist level ( 1 - a) confidence intervals that have uniformly minimum expected length among all level ( 1 - a) unbiased confidence intervals . In particular, the intervals developed in Example 4 . 5 . 1 have this property. Confidence intervals obtained from two-sided tests that are uniformly most powerful within a restricted class of procedures can be shown to have optimality properties within restricted classes. These topics are discussed in Lehmann ( 1 997).

Summary. By using the duality between one-sided tests and confidence bounds we show that confidence bounds based on UMP level a tests are uniformly most accurate (UMA) level ( 1 a) in the sense that, in the case of lower bounds, they are less likely than other level (1 - a) lower confidence bounds to fall below any value B' below the true B . -

._

r t ·

i

't

;

Section 4 . 7

4.7

251

Freq uen tist a n d Bayesian Form u lations

F R E Q U E N T I S T AN D BAY E S I A N FO R M U lAT I O N S

We have so far focused on the frequentist formulation of confidence bounds and intervals where the data X E

X

c

Rq

are random while the parameters are fixed but unknown.

A consequence of this approach is that once a numerical interval has been computed from

experimental data, no probability statement can be attached to this interval. Instead, the interpretation of a

100( 1· - a)%

confidence interval is that if we repeated an experiment

indefinitely each time computing a

100 ( 1 - a)%

confidence interval, then

100 ( 1 - a)%

of

the intervals would contain the true unknown parameter value. In the B ayesian formulation of Sections 1 . 2 and 1 . 6 . 3 , what are called level

(1

...:...

a)

credible bounds and intervals are subsets of the parameter space which are given probability at least

( 1 - a)

by the posterior distribution of the parameter given the data. Suppose that,

given e, X has distribution Pe , e E

8

c

R,

and that e has the prior probability distribution

II. Definition 4.7.1.

Let

II( · f x ) denote the posterior probability distribution of e given X x , (1 - a) lower and upper credible bounds for e if they respectively =

then ft. and B are level sati sfy

II( fi. :S: e j x ) ;:::: 1 - a , II ( e :S: B !x ) � 1 - a.

Turning to Bayesian credible intervals and regions, it is natural to consider the collec tion of e that is "most likely" under the distribution

Definition 4.7.2.

Let

n ( - f x)

denote the density of e given

ck is called a If n

II ( e j x ) .

=

{e : n( - l x)

Thus ,

X = x,

then

� k}

level ( 1 - a) credible region for e i f II( Ck f x ) :2: 1 - a .

(e f x)

is unimodal, then

Ck

will be an interval of the form

[fi., B] .

We next give such

an example .

Suppose that given f..l , X1 , . . . , Xn are i.i.d. N(f..l , 0"6) with 0"6 known, N(f..lo , TJ ) , with f..l o and T;s known. Then, from Example 1 . 1 . 1 2, the posterior distribution of f..l given xl , . . . ' Xn is N(/I B , a� ) , with

Example 4.7. 1. and that f..l

rv

n _x + i f..lo 2 1 -;;:g f..l B = n + l , O"B = n + l U5 -;;:g U5 -;;:g �

It follows that the level

1

-

U5

�

a lower and upper credible bounds f..l f..l - = B - Zl -o

O"o Vn 1 + nr0

(

::"A)

1

2

for

f..l are

Testing a n d Confidence Reg ions

252

(1 -

while the level

a:) credible interval is [p,- , p, + ] with /1

±

2

Compared to the frequentis t interval

1 .3.4

a-o

�

/111 ± Z l - -"'

=

interval is pulled in the direction of Example

Cha pter 4

1

( �) nT0 2

vn 1 +

•

z1 _ � a-olfo, the center fiB of the Bayesian p,0 is a prior guess of the value of p,. See guesses. Note that as To ----> oo, the B ayesian

X ±

p,0,

where

for sources of such prior

interval tends to the frequentist interval; however, the interpretations of the intervals are D

different.

Example 4.7.2.

known. Let >-

where

a

> 0,

=

b

Suppose that given a-2, xl , · · · , Xn are i.i .d. N(p,o, a - 2 and suppose >- has the gamma f( density

�a, � b)

X� +n

credible bound for

A.

distribution, where

t

distribution, then .6_

= =

and

ij�

=

where /1 0 is

1 . 2. 1 2, given x1 , , Xn, l: (xi - p,0 ) 2 . Let X a + n ( a: ) denote the a:th X a + n ( a: ) I ( t + b) is a level ( 1 - a: ) lower

> 0 are known parameters . Then, by Problem

( t + b ) >- has a X�+n

quantile of the

a-2 )

.

.

•

( ( n - 1)s 2 + b)lxa+n (a:)

( 1 - a: ) upper credible bound for a-2 . Compared to the frequentist bound ( n 1) s 2 I Xn- l ( a: ) of Example 4.4.2, a� is shifted in the direction of the reciprocal bI a of the

i s a level

mean of 1r(A).

We shall analyze B ayesian credible regions further i n Chapter

Summary.

5.

(1

I n the Bayesian framework w e define bounds and intervals, called level

-

a: ) credible bounds and intervals, that determine subsets of the parameter space that are assigned probability at least

the data

:c.

(1

- a: ) by the posterior distribution of the parameter

In the case of a normal prior

1r ( B)

and normal model

(}

p ( x I B), the level (1

given - a: )

credible interval is similar to the frequentist interval except it is pulled in the direction

p,0

of the prior mean and it i s a little narrower. However, the interpretations are different: In

the frequentist confidence interval, the probability of coverage is computed with the data X random and

B fi xed, X

is computed with

4.8

whereas in the Bayesian credible interval, the probability of coverage

=

x fixed and (} random with probability distribution II ( B I X

=

x) .

P R E D I CT I O N I NT E RVA LS

I n Section variable

Y.

1.4

we discussed situations in which we want t o predict the value o f a random

In addition to point prediction of

that contains the unknown value

Y

Y,

it is desirable to give an interval

with prescribed probability

(1

- a .

)

[Y, Y]

For instance, a

doctor administering a treatment with delayed effect will give patients a time interval

[1:: , f']

in which the treatment is likely to take effect. S imilarly, we may want an interval for the

Section 4 . 8

P rediction Interva ls

253

future GPA of a student or a future value of a portfolio. We define a level ( 1 - n ) prediction interval as an interval [Y, Y] based on data X such that P ( Y ::; Y ::; Y) ?: 1 n . The problem of finding prediction intervals is similar to finding confidence intervals using a pivot: Example 4.8.1. The (Student) t Prediction Interval. As in Example 4.4. 1 , let X1 , . . . , Xn be i.i.d. as X "' N(p,, cr 2 ) . We want a prediction interval for Y Xn+l • which is assumed to be also N(p,, cr2 ) and independent of X1 , . . . , X11 • Let Y Y (X) denote a predictor based on X (X1 , . . . , "� ) . Then Y and Y are independent and the mean squared prediction error (MSPE) of Y is =

=

=

Note that Y can be regarded as both a predictor of Y and as an estimate of p,, and when we do so, AISP E(Y) MSE(Y) + cr 2 , where MSE denotes the estimation theory mean squared error. It follows that, in this case, the optimal estimator when it exists is also the optimal predictor. In Example 3.4.8, we found that in the class of unbiased estimators, X is the optimal estimator. We define a predictor Y* to be prediction unbiased for Y if E(Y* - Y) 0 , and can c�nclu� e that in the class of prediction unbiased predictors, the optimal MSPE predictor is Y X . � We next use the prediction error Y - Y to construct a pivot that can be used to give a prediction interval. Note that =

=

=

Y-Y

=

X

-

1 Xn+l "' N(O, [n - + 1]cr 2 ) .

-1

Moreover, s2 (n - 1 ) l:: � (Xi - X) is independent of X by Theorem independent of Xn+1 by assumption. It follows that =

B .3 . 3

and

Z

1 +y1cr p (Y) - vny -

has a N(O, 1) distribution and is independent of V (n - 1)s2 /cr2 , which has a x;,_ 1 distribution. Thus, by the definition of the (Student) t distribution in Section B . 3 , =

�

=

p (y ) -

Z(Y)

IV

v�

Y-Y

=

vn- 1 + 1s

has the t distribution, Yn-1 · By solving -tn- 1 (1 Y, we find the ( 1 - a) prediction interval

�a) ::; Tp (Y) ::; tn - 1 (1 - � a) for

Y = X ± Jn - 1 + 1stn - 1 (1 - ! a ) .

(4. 8 . 1 )

Note that Tp (Y) acts as a prediction interval pivot in the same way that T ( p, ) acts as a confidence interval pivot in Example 4.4. 1 . Also note that the prediction interval is much wider than the confidence interval ( 4.4. 1 ) . In fact, it can be shown using the methods of

254

Testi n g a n d Confidence Re g ions

Cha pter 4

Chapter 5 that the width of the confidence interval ( 4.4. 1 ) tends to zero in probability at the

�a) .

rate n - ! , whereas the width of the prediction interval tends to 2az ( 1 Moreover, the confidence level of ( 4.4. 1 ) is approximately correct for large n even if the sample comes from a nonnormal distribution, whereas the level of the prediction interval ( 4.8.1) D i s not ( 1 - a ) i n the limit a s n --+ oo for samples from non-Gaussian distributions.

We next give a prediction interval that is valid from samples from any population with a continuous distribution.

Example 4.8.2. Suppose X1 , . . . , Xn are i.i.d. as X "' F, where F is a continuous dis tribution function with positive density f on ( a , b), - oo :S: a < b :S: oo . Let X(l) < · · · < X(n) denote the order statistics of X1 , . . . , Xn . We want a prediction interval for Y = Xn+ l "' F, where Xn+ l is independent of the data X1 , , Xn . Set U; = F(Xi ) , i = 1 , . . . , n + 1 , then, b y Problem B .2 . 1 2, U1 , , Un+ l are i.i.d. uniform, U(O, 1 ) . Let U(l l < · · · < U(n) be U1 , . . . , Un ordered, then .

•

.

. . •

P(U(j) :S: Un+ l :S: U(k) )

J P(u jc

:s: Un+ l :s: v I u(j )

=

u, u(k)

=

v ) dH ( u , v )

v - u ) dH ( u, v ) = E (U(k J ) - E ( UcJ l )

where H is the joint distribution of U(j) and Uc k) · By Problem B .2.9, E (U( i ) ) = thus,

P (X(j) :S: Xn+l :S: X(k) )

=

k-j

ij (n + 1 ) ; (4. 8 .2)

n+ 1.

It follows that [X(]l • X(kJ] with k = n + 1 - j is a level a = (n + 1 - 2j) / (n + 1) prediction interval for Xn+ l · This interval is a distribution-free prediction interval. See Problem 4.8.5 for a simpler proof of ( 4.8.2).

Bayesian Predictive Distributions Suppose that () is random with () "' 1r and that given () = e, X1 , , Xn + l are i.i.d. p(X I e). Here xl , . . . ' Xn are observable and Xn+ l is to be predicted. The poste rior predictive distribution Q( · I x) of Xn+ l is defined as the conditional distribution of Xn+l given x = ( x i . . . . , xn ) ; that is, Q (· I x) has in the continuous case density . . •

q(xn+l I

x

n+l

)=

n

1 iIJl p(x; I B)1r(B)d8j 1 iIJl p(xi I B)1r(B)d8 e

=

e

=

with a sum replacing the integral in the discrete case. Now [Y B • YB] is said to be a level (1 - a ) Bayesian prediction interval for Y Xn+ l if =

Section 4 . 9

L i kelihood Ratio P roced u res

255

5 5

Example 4.8.3. Consider Example 3 . 2 . 1 where (Xi I B) N( e , 0" ) , 0" known, and 1r(B) is N ( TJo , 7 2 ) , 7 2 known . A sufficient statistic based on the observables X I , . . . , Xn is rv

= X = n - I :L�= I Xi , and it is enough to derive the marginal distribution of Y Xn+l from the joint distribution of X, Xn +I and 9, where X and Xn+ I are independent. Note T

=

that

E[ (Xn+ I - 9 ) 9]

=

E{E (Xn + l - e)e I 9 = B} =

0.

Thus, Xn+l - 9 and 9 are uncorrelated and, by Theorem B .4. 1 , independent. To obtain the predictive distribution, note that given X = t, Xn+ l - 9 and 9 are still uncorrelated and independent. Thus,

.C{ Xn+ l I X = t} = .c{ (Xn+ I - 9 )

+

9 I X = t}

=

N( /iB , 0"5

+

a1 )

where, from Example 4.7.1, 0"

�2

B

=

n

a5

1

I

+ 72

'

J1 B �

= ( 0" B2 I 7 2 ) TJO + ( n O" B2 I 0"02 ) X- ·

a) Bayesian prediction interval for Y is [Y3 , YtJ with

It follows that a level (1 -

Yf = liB ± z ( 1 -

� a)

V(J5

+

a1 .

(4 .8.3)

To consider the frequentist properties of the Bayesian prediction interval ( 4.8.3) we com pute its probability limit under the assumption that XI , . . . , Xn are i.i.d. N ( B , 0" ) . Because 0"1 ----) 0, ( n 0" 1 I 0" ) ----) 1, and X .!:.., e as n ----) oo , we find that the interval ( 4.8.3) converges in probability to e ± z ( 1 - � a) O"o as n ----) oo . This is the same as the probability 0 limit of the frequentist interval ( 4.8. 1 ) .

5

5

The posterior predictive distribution is also used to check whether the model and the prior give a reasonable description of the uncertainty in a study (see Box, 1 9 83). Summary. We consider intervals based on observable random variables that contain an un observable random variable with probability at least ( 1 - a). In the case of a normal sample of size n + 1 with only n variables observable, we construct the Student t prediction interval for the unobservable variable. For a sample of size n + 1 from a continuous distribution we show how the order statistics can be used to give a distribution-free prediction inter val. The Bayesian formulation is based on the posterior predictive distribution which is the conditional distribution of the unobservable variable given the observable variables. The Bayesian prediction interval is derived for the normal model with a normal prior. 4.9 4.9.1

L I K E L I H O O D RAT I O P R O C E D U RES I nt rod uction

Up to this point, the results and examples in this chapter deal mostly with one-parameter problems in which it sometimes is possible to find optimal procedures. However, even in

:·.:. :·. . :.

Test ing a n d Confidence Regions

256

C h a pter 4

the case in which B is one-dimen sional, optimal procedures may not exist. For instance, if a2 ) population with a2 known, there is no UMP test = f-lo vs K : f-1 =f. f-lo · To see this, note that it follows from Example 4.2. 1 that if f-1 1 > f-lo, the MP level a test <5a (X) rej ects H for T > z ( 1 - � o: ) , where T yn(X - f.Lo ) /a. On the other hand, if f-1 1 < f-Lo , the MP level a test <pa (X) rej ects H if T ::::; z(o:). Because <5a (x) =f. <pa (x) , by the uniqueness of the NP test (Theorem 4.2. l (c)), there can be no UMP test of H : f-1 = f-Lo vs H : f-1 =f. f-Lo.

X1 ,

.

•

.

, Xn

is a sample from a N(f.L,

for testing H

:

f-1

=

I n this section w e introduce intuitive and efficient procedures that can b e used when no optimal methods are available and that

are

natural for multidimensional parameters.

The efficiency is in an approximate sense that will be made clear in Chapters 5 and 6. We start with a generalization of the Neyman-Pearson statistic that H

:

X =

(X1 , . . . , Xn ) has density or frequency function

B E 80 vs K

: B

p(x, B1 ) jp(x, B0 ) . Suppose p(x, B) and we wish to test

E 8 1 . The test statistic we want to consider is the likelihood ratio

given by

L (x)

=

sup {p(x, B) : B sup {p(x, B) : B

E 81 }

.

E 8o}

Tests that reject H for large values o f L (x) are called likelihood ratio tests. To see that this is a plausible statistic, recall from Section 2 . 2 . 2 that we think of the likelihood function sample

L(B, x) = p(x, B) as a measure of how well B "explains" the given x = (x1 , . . . , xn ) · So, if sup {p(x, B) : B E 8 1 } is large compared to sup{p(x, B) :

B E 80 } , then the observed sample is best explained by some B E 8 1 , and conversely. Also note that { B0 } , 81

=

coincides with the optimal test statistic

L (x)

p(x, B1 ) jp(x, B0 )

when 8 o

=

{ B 1 } . In particular cases, and for large samples, likelihood ratio tests have

weak optimality properties to be discussed in Chapters 5 and 6. In the cases we shall consider, dimension than 8

=

p(x, B)

is a continuous function of B and B0 is of smaller

8 o u 81 so that the likelihood ratio equals the test statistic

>. (x) =

sup{p(x, B) : B E 8 } sup{p(x, B) : B E 8o }

(4.9 . 1 )

whose computation is often simple. Note that in general

>.(x)

=

max(L (x) , 1) .

We are going t o derive likelihood ratio tests i n several important testing problems . Al though the calculations differ from case to case, the basic steps are alway s the same.

1 . Calculate the MLE B of B. 2. Calculate the MLE B0 of B where B may vary only over 8 0 .

3 . Form >.(x)

=

p(x, � /p(x, 00 ) .

h that i s strictly increasing on the range of ,\ such that h(>.(X) ) has h(>.(X)) is equivalent to >.(X), we specify the size a likelihood ratio test through the test statistic h(>. (X)) and ,its ( 1 - a ) th quantile obtained from the table.

4 . Find a function

a simple form and a tabled distribution under H. Because

l 1

Section 4 . 9

L i keli hood Ratio P rocedu res

257

likelihood confidence regions, bounds, and so on. For instance, we can invert the family of size o:

We can also invert families of likelihood ratio tests to obtain what we shall call likelihood ratio tests of the point hypothesis H

:

e

=

e0

and obtain the level

(1 -

o:

)

confidence region

C (x ) where

sup0

=

denotes sup over

e

ro0 n

{ e : p ( x , e) � [c ( e) t E

1

sup0 p (x, e) }

(4.9.2)

8 and the critical constant c ( e) satisfies

[sup0 p (X, e) p (X ' e0 )

� c (eo

)]

=

o: .

6) that c ( e) is independent of e. In that case, e who se likelihood is on or above some fixed value dependent on

It is often approximately true (see Chapter

C(x)

is just the set of all

the data. An example is discussed in Section

4.9.2.

e ( el > e2 ) where e1 is the parameter of e 2 is a nuisance parameter. We shall obtain likelihood ratio tests for hypotheses of the form H : e1 ew, which are composite because e 2 can vary freely. The family of such level o: likelihood ratio tests obtained by varying e10 can also be inverted and yield confidence regions for e1 . To see how the process works we refer to the specific examples Thi s section includes situations i n which

=

interest and

=

in Sections

4.9.2

4.9.2--4.9.5.

Tests for t h e M e a n o f a N ormal Distri bution- M atched Pair Experiments

Suppose X1 ,

•

•

,

, X n form a sample from a

N(J.L, 0" 2 )

population i n which both J..l and

0" 2

are unknown. An important class of situations for which this model may be appropriate occurs in

matched pair experiments. Here are some examples. Suppose we want to study

the effect of a treatment on a population of patients whose responses are quite variable because the patients differ with respect to age, diet, and other factors . We are interested in expected differences in responses due to the treatment effect. In order to reduce differences due to the extraneous factors , we consider pairs of patients matched so that within each pair the patients are as alike as pos sible with respect to the extraneous factors. We can regard twins as being matched pairs . After the matching, the experiment proceeds as follow s . In the ith pair one patient is picked at random (i . e . , with probability

� ) and given the treatment,

while the second patient serves as control and receives a placebo . Response measurements are taken on the treated and control members of each pair. Studies in which subjects serve as their own control can also be thought of as matched pair experiments . That is, we measure the response of a subj ect when under treatment and when not under treatment. Examples of such measurements are hours of sleep when receiving a drug and when receiving a placebo, sales performance before and after a course in salesmanship, mileage of cars with and without a certain ingredient or adjustment, and so on.

Let Xi denote the difference between the treated and control responses for the ith pair.

If the treatment and placebo have the same effect, the difference Xi has a distribution that is

258

Testi n g a n d Confidence Regions

C h a pter 4

symmetric about zero. To this we have added the normality assumption. The test we derive will still have desirable properties in an approximate sense to be discussed in Chapter if the normality as sumption is not satisfied. Let J1-

5

E (X1 ) denote the mean difference

=

:

between the response of the treated and control subjects . We think of J1- as representing the treatment effect. Our null hypothesis of no treatment effect is then

H

J1-

=

0. However,

' for the purpose of referring to the duality between testing and confidence procedures, we

H

test

: J1-

=

fl-o ,

where we think

of

as an established standard for an old treatment.

Two-Sided Tests

:

We begin by considering K

fl-o

J1-

i-

fl-O·

This corresponds to the alternative "The treat

ment has some effect, good or bad." However, as discussed in Section

4.5,

the test can be

modified into a three-decision rule that decides whether there is a significant positive or negative effect.

Form of the Two-Sided Tests Let

B

=

(Jl-, a2 ), 8o { (Jl-, a2 ) : J1- = fl-o}. =

Under our assumptions,

The problem of finding the supremum of that

p ( x , B ) was solved in Example 3 . 3 .6.

sup{p (x , B ) : e where

�

()

=

�2

(x, a )

=

E

8}

=

p ( x , B) ,

(1-n � n1 � - 2 ) LJ xi ,

i =l

- LJ (xi i=l

x) ,

is the maximum likelihood estimate o f e.

&5

Finding

sup {p( x , () )

of a2 when J1-

equation is

=

fl-o

: () E

We found

8 0} boils down t o finding the maximum likelihood estimate &5 ). The likelihood p ( x () ) at

is known and then evaluating

0�2 log p (x ,

() )

=

,

(Jl-o,

� [ :4 t(xi - Jl-o) 2 - :2 ]

which has the immediate solution

�2a0 = -1 � LJ ( Xi - fl- o ) 2 . n

i=l

=

0,

Section 4 . 9

Likelihood Ratio Proced u res

259

a5 gives the maximum of p(x, B) log .\(x) , which thus equals

By Theorem 2.3. 1 , equivalent to

�

log .\(x)

l og p(x, B)

-

for B E

80 .

The test statistic

.\ (x)

is

log p(x, (fJo , 0'02 ) )

{ - � [ (log 2 7T) � log(a5 /a2 ) .

+

( lo g a2 ) ]

- �} - { - � [ (log 2 7T) + (log a5 ) J � } -

Our test ruie, therefore, rejects H for large values of

(a5 fa 2 ) .

To simplify the rule further

we use the following equation, which can be established by expanding both sides.

Therefore,

Because

s2

function of

I Tn I

(n - 1 ) where

1 -

(a5 / a2 ) = 1 + ( x - Mo ) 2 / 0' 2 . I: ( x i - x ) 2 na 2 j (n - 1 ) , a5 fa 2 =

Tn

=

y'n(x - fJ o) . s

Therefore, the likelihood ratio tests rej ect for large values of distribution under H (see Example 4.4. 1 ), the size

rej ect H if, and only if,

I Tn l 2': 2 . 064.

n = 25

I Tn 1 -

Because

Tn

has a T

a critical value is tn- 1 (1 - � a) and we

can use calculators or software that gives quantiles of the the critical value. For instance, suppose

i s monotone increasing

t distribution, or Table III, to find a = 0.05. Then we would

and we want

One-Sided Tests The two-sided formulation is natural if two treatments, A and B, ate considered to be equal before the experiment is performed. However, if we are comparing a treatment and control, the relevant question is whether the treatment creates an improvement. Thus, the : f1 S f,Lo versus K : f1 > fio (with fio = 0) is suggested. The statistic is equivalent to the likelihood ratio statistic ,\ for this problem. A proof is sketched

testing problem H

Tn

in Problem 4.9.2. In Problem 4.9. 1 1 we argue that

6

=

( !1 - f,Lo ) / a.

P0 [Tn 2': t ]

is increasing in <5, where

Therefore, the test that rej ects H for

Tn 2': tn- 1 ( 1 - a) , is of size

a for H

versus K : 11 <

fio

:

11

S fio · Similarly, the size

rejects H if, and only if,

a

likelihood ratio test for H

:

11

2': fio

a nd Confidence

260

4

Power Functions To discuss the power of these tests, we need to introduce the noncentral t distribution with k degrees of freedom and noncentrality parameter 8. This distribution, denoted by Tk , 6, is by definition the distribution of ZI jVfk where Z and V are independent and have

N(8, 1) and x� distributions, respectively. The density of Zl jVfk is given in Problem 4.9. 12. To derive the distribution of Tn , note that from Section B.3, we know that yn(X �-L ) Ia and ( n - 1 ) s2 la2 are independent and that ( n 1 ) s2 la2 has a x;_ 1 distribution. Because E [yn(X f-to)la] = fo ( f-t - f-Lo) Ia and Var ( y'n(X

yn (X

-

- f-Lo) Ia)

=

1,

f-Lo) I a has N( 8, 1 ) distribution, with 8 = fo(f-t - f-Lo) Ia . Thus, the ratio fo ( X - f-to) la . / (n- 1 )s2 /u2 V n- 1

has a Yn - 1 , 6 distribution, and the power can be obtained from computer software or tables ( f-t , a2 ) of the noncentral t distribution. Note that the distribution of Tn depends on (} only through 8. The power functions of the one-sided tests are monotone in 8 (Problem 4.9. 1 1) just as the power functions of the corresponding tests of Example 4.2. 1 are monotone in Vnf-LI a. We can control both probabilities of error by selecting the sample size n large provided we consider alternatives of the form 181 � 81 > 0 in the two-sided case and 8 � 81 or 8 ::::; 81 in the one-sided cases. Computer software will compute n. If we consider alternatives of the form (f-t f-Lo) � � . say, we can no longer control both probabilities of error by choosing the sample size. The reason is that, whatever be n, by making a sufficiently large we can force the noncentrality parameter 8 = fo (f-L - f-Lo ) I a as close to 0 as we please and, thus, bring the power arbitrarily close to a:. We have met similar difficulties (Problem 4.4.7) when discussing confidence intervals. A Stein solution, in which we estimate a for a first sample and use this estimate to decide how many more observations we need to obtain guaranteed power against all alternatives with 1 �-L - 1-Lo l � � . is possible (Lehmann, 1 997, p. 260, Problem 1 7). With this solution, however, we may be required to take more observations than we can afford on the second stage.

Likelihood Confidence Regions If we invert the two-sided tests, we obtain the confidence region

C(X)

{f-t : l vn (X - �-L)Is l s tn-1 ( 1 - �a: ) } .

We recognize C(X) a s the confidence interval of Example 4.4. 1 . Similarly the one-sided tests lead to the lower and upper confidence bounds of Example 4.4. 1 .

Section 4.9

261

Likelihood Ratio Proced ures

Data Example

As an illustration of these procedures, consider the following data due to Cushny and Peebles (see Fisher, 1 958, p. 1 2 1 ) giving the difference B - A in sleep gained using drugs A and B on 1 0 patients. This is a matched pair experiment with each subject serving as its own control. Patient i A B B-A

1 0.7 1 .9 1 .2

2 - 1 .6 0.8 2.4

3 -0.2 1.1 1 .3

-4 - 1 .2 0. 1 1 .3

5 -0. 1 -0. 1 0.0

6 3 .4 4.4 1 .0

7 3. 7 5.5 1 .8

8 0.8 1 .6 0.8

9 0.0 4.6 4.6

10 2.0 3.4 1 .4

2 If we denote the difference as x's, then x 1 . 58, s 1 . 513, and ITn l 4.06. Because t9 (0.995) 3.25, we conclude at the 1% level of significance that the two drugs are significantly different. The 0.99 confidence interval for the mean difference J.-L between treatments is [0.32,2.84] . It suggests that not only are the drugs different but in fact B is better than A because no hypothesis J.-L 11-' < 0 is accepted at this level. (See also ( 4.5 .3) .) =

=

=

=

=

4.9.3

Tests and Confidence I ntervals for the Difference in Means of Two Normal Populations

We often want to compare two populations with distribution F and G on the basis of two independent samples X 1 , . . , Xn1 and Y1 , . . . , Yn2 , one from each population. For in stance, suppose we wanted to test the effect of a certain drug on some biological variable (e.g., blood pressure). Then X1 , . . . , Xn1 could be blood pressure measurements on a sam ple of patients given a placebo, while Y1 , . . , Yn2 are the measurements on a sample given the drug. For quantitative measurements such as blood pressure, height, weight, length, vol ume, temperature, and so forth, it is usually assumed that X 1 , , Xn1 and Y1 , . . . , Yn2 are independent samples from N (J.-L1 , a 2 ) and N (J.-L2 , a2 ) populations, respectively. The preceding assumptions were discussed in Example 1 . 1 .3. A discussion of the con sequences of the violation of these assumptions will be postponed to Chapters 5 and 6. .

.

. . •

Tests

We first consider the problem of testing H : f.-Ll J:L2 versus H : f.-Ll =F J.-L2. In the control versus treatment example, this is the problem of determining whether the treatment has any effect. Let e (J.-Lt , J.-l2 , a 2 ) . Then 8o { e : J.-L 1 J.-L2 } and 81 {e : J.-L1 # J.-L2 }· The log of the likelihood of (X, Y) (X1 , , Xnp Y1 , . . . , Yn2 ) is =

=

=

=

=

=

. . •

where n n1 + n2 . As in Section 4.9.2 the likelihood function and its log are maximized over 8 by the maximum likelihood estimate e. In Problem 4.9.6, it is shown that (j =

=

262

Testi n g a nd Confidence Regions

Cha pter 4

- - 2) (X, Y, a , where

When tt1 tt2 tt. our model reduces to the one-sample model of Section the maximum of p over 80 is obtained for f) (Jl, jl, a5), where =

4.9.2. Thus,

and

If we use the identities

_!

fcvi - J1)2

?l i=l

obtained by writing [Xi - J1] 2 log likelihood ratio statistic

is equivalent to the test statistic

1 n

f(l'i i=l

Y ) 2 + n2 ( Y - J1) 2 n

[(Xi - X) + (X - J1)] 2 and expanding, we find that the

I TI where

and

To complete specification of the size a likelihood ratio test we show that distribution when tt1 tt2 . By Theorem B.3 . 3 =

1 - 1 - 1 nl 1 n2 -X, - Y, -2 L (Xi - X- ) 2 and -2 L (Y· - Y ) 2 a j=l a a a i=l -

1

• i -: "

I

T has Yn-2

Section 4.9

263

likelihood Ratio Proced ures

are independent and distributed as N(�Ld a, 1/ni ) , N(1L2 / a, 1 /n2 ) , X�1 _1 , X�2 _:1, respec tively. We conclude from this remark and the additive property of the x 2 distribution that (n - 2)s 2 ja2 has a X�-2 distribution, and that jn1 n2jn(Y - X)ja has a N(O, 1 ) distri 2 bution and is independent of (n - 2)s ja2 • Therefore, by definition, T Tn _2 under H and the resulting two-sided, two-sample t test rejects if, and only if, r-v

As usual, corresponding to the two-sided test, there are two one-sided tests with critical regions, for H : IL2 :::; ILl and for H : ILl :::; IL2· We can show that these tests are likelihood ratio tests for these hypotheses. It is also true that these procedures are of size a for their respective hypotheses. As in the one sample case, this follows from the fact that, if ILl -=f. �L2 , T has a noncentral t distribution with noncentrality parameter,

Confidence Intervals

To obtain confidence intervals for IL2 - ILl we naturally look at likelihood ratio tests for the family of testing problems H : 1L2 - ILl = � versus K : IL2 - ILl -=f. �- As for the special case � = 0, we find a simple equivalent statistic I T (�) ! where

If IL2 - �L 1

=

� . T (�) has a Tn-2 distribution and inversion of the tests leads to the interval (4 .9. 3 )

for IL2 - IL l · Similarly, one-sided tests lead to the upper and lower endpoints of the interval as 1 �a likelihood confidence bounds. -

.....

·--

•.

Testi n g and Confidence Regions

264

Cha pter 4

Data Example

As an illustration, consider the following experiment designed to study the permeability (tendency to leak water) of sheets of building material produced by two different machines. From past experience, it is known that the log of permeability is approximately normally distributed and that the variability from machine to machine is the same. The results in terms of logarithms were (from Hald, 1 952, p. 472) x

y

(machine 1 ) (machine 2)

1 .845 1 .5 83

1 .790 1 .627

2.042 1 .282

We test the hypothesis H of no difference in expected log permeability. H is rej ected

ITI 2 tn-2 ( 1 - �a) . Here y - x -0.395, s 2 = 0.0264, and T -2.977. Because t4 (0.975) = 2. 776, we conclude at the 5% level of significance that there is a significant

if

=

=

difference between the expected log permeability for the two machines. The level 0.95 confidence interval for the difference in mean log permeability is

-0.395 ± 0.368.

..

: i

'

On the basis of the results of this experiment, we would select machine 2 as producing the smaller permeability and, thus, the more waterproof material . Again we can show that the selection procedure based on the level ( 1 - a) confidence interval has probability at most �a of making the wrong selection.

4.9.4

T h e Two- Sample P roblem with U nequal Variances

In two-sample problems of the kind mentioned in the introduction to Section 4.9.3, it may happen that the X's and Y's have different variances. For instance, a treatment that in creases mean response may increase the variance of the responses. If normality holds, we are lead to a model where X1 , . . . , Xn1 ; Y1 , . . . , Yn 2 are two independent N ( llb ai ) and N(112 , a� ) samples, respectively. As a first step we may still want to compare mean responses for the X and Y populations. This is the Behrens- Fisher problem. Suppose first that ar and a� are known. The log likelihood, except for an additive constant, is

The MLEs of 111 and 112 are, thus, /11 x and /12 y for (111 , 112 ) E R x R. When 111 112 = 11. setting the derivative of the log likelihood equal to zero yields the MLE =

=

=

Section 4.9

Likelihood Ratio Procedu res J

265

where I = arIa�. It follows that .-\ (x, y )

By writing

2 L (xi - x) + n 1 (j1 - x) 2 i=l n1

i=l n2

n2

L ( Yi - j1) 2 = L(Yi - fi? + n2 ( j1 - fi) 2 j=l j =l we obtain

\ (x, y ) = exp

{ 2�� CM - X) 2 + 2�5 (/i - Y) 2 } .

Next we compute

j1 - x

= =

n1 !n2 (fi - x) l(nl + ln2) .

Similarly, j1 - fi n2 (x - fJ) I (n 1 + 1n2 ) . It follows that the likelihood ratio test is equivalent to the statistic lD I IaD , where D = Y - X and aJy is the variance of D, that is =

Var (X ) + Var ( Y ) = Thus, DIaD has a N( !::::.. , 1) distribution, where !::::.. must be estimated. An unbiased estimate is

=

n1

+

n2

.

J.t2 - f.tl · Because a1 is unknown, it

sr s� 2 S D = -1 + - . n n2 It is natural to try and use DIs D as a test statistic for the one-sided hypothesis H : J.t2 :::; J.t l , I D I I S D as a test statistic for H : J.t l = J.t2, and more generally (D - !::::.. ) 1 sD to generate confidence procedures. Unfortunately the distribution of ( D !::::.. ) Is D depends on arIa� for fixed n 1 , n2 . For large n 1 , n2, by Slutsky's theorem and the central limit theorem, (D !::::.. ) 1 S D has approximately a standard normal distribution (Problem 5.3.28). For small and moderate n 1 , n2 an approximation to the distribution of ( D - !::::.. ) Is D due to Welch ( 1 949) works well.

266

Testing a n d Confidence Regions

Let c

= sfjn 1s'b .

C h a pter 4

Then Welch' s approximation is Tk where

k

[

c2 n1 - 1

+

]

1 ( 1 c) 2 n2 - 1

When k is not an integer, the critical value is obtained by linear interpolation in the t tables or using computer software. The tests and confidence intervals resulting from this approximation are called Welch's solutions to the Behrens-Fisher problem. Wang ( 1 97 1 ) has shown the approximation t o b e very good for a = 0.05 and a 0.0 1 , the maximum error in size being bounded by 0.003. Note that Welch 's solution works whether the variances are equal or not. The LR procedure derived in Section 4.9.3, which works well if the variances are equal or n1 n2 , can unfortunately be very misleading if ui # u� and n1 # n2. See Figure 5.3.3 and Problem 5.3.28. =

4.9.5

l i kelihood Ratio Proced u res for Bivariate N ormal Distri butions

If n subjects (persons, machines, fields, mice, etc.) are sampled from a population and two numerical characteristics are measured on each case, then we end up with a bivariate random sample (Xb Yl ) , . . . , (Xn , Yn ) · Empirical data sometimes suggest that a reason able model is one in which the two characteristics (X, Y) have a joint bivariate normal distribution, N(J.LI, /12, ' u� , p), with ur > 0, (j� > 0.

Testing Independence, Confidence Intervals for p The question "Are two random variables X and Y independent?" arises in many sta tistical studies. Some familiar examples are: X weight, Y blood pressure; X test score on mathematics exam, Y test score on English exam; X percentage of fat in diet, Y cholesterol level in blood; X average cigarette consumption per day in grams, Y = age at death. If we have a sample as before and assume the bivariate normal model for (X, Y), our problem becomes that of testing H : p 0. =

=

Two-Sided Tests If we are interested in all departures from H equally, it is natural to consider the twosided alternative K : p # 0. We derive the likelihood ratio test for this problem. Let B = ( J.Lb f/,2 , u f , , p) . Then 8o { B : p = 0} and 81 { B : p # 0 } . From (B .4.9), the log of the likelihood function of X ( (X 1 , Y1 ) , . . . , ( Xn , Yn) ) is =

[

logp(x, B ) -n[log(2r.u1u 2 1 2:: ��1 ( X ; /l 1 ) 2 2( 1 - p2 ) uf

_

2p 2:: �=1 ( x ; /l 1 ) ( Yi - /l2) u 1 u2

+

]

2::��1 (y; - 1"2) 2 . u�

I

Section 4 . 9

267

L i ke l ihood Ratio Proced u res

The unrestricted maximum likelihood estimate (i, y, &i, &� , p) , where

-----2 0" 1

e

was given in Problem 2.3. 1 3 as

i i i = [t (xi - ( Y) ]

= -� � ( Xi - X- ) 2 n 1

=1

�=1

----- 2 ' 0" 2

X)

yi

= -� � ( Yi - y-) 2 ' n 1

-

=1

/n&1u2 .

p is called the sample correlation coefficient and satisfies -1 ::::; p ::::; 1 (Problem 2. 1 . 8). When p 0 we have two independent samples, and eo can be obtained by separately maxi =

mizing the likelihood of X 1 , . . . , Xn and that of Y1 , and the log of the likelihood ratio statistic becomes

. . . , Yn . We have eo = (x, y, &i , &� , 0)

log A(x) (4.9.4)

= Thus, log A (x) is an increasing function of p2 , and the likelihood ratio tests reject H for large values of [ fi1 . To obtain critical values we need the distribution o f p or an equivalent statistic under H. Now,

i-

� (U n

p=

U) (Vi - V) /

[� n

(Ui - U ) 2

] [� 1 /2

n

(Vi - V) 2

]

1 /2

where Ui = ( Xi J.-L1 ) /0"1 , Vi = (Yi J.-L2 ) /0"2 . Because ( U1 , V1 ) , . . . , (Un , Vn) is a sample from the N(O , 0, 1 , 1 , p) distribution, the distribution of p depends on p only. If p = 0, then by Problem B .4.9, -

(4.9.5) has a Tn- 2 distribution. Because ITn l is an increasing function of [ fi1 , the two-sided likeli hood ratio tests can be based on ITn I and the critical values obtained from Table II. When p 0, the distribution of p is available on computer packages. There is no simple form for the distribution of p (or Tn) when p =I 0. A normal approximation is available. See Example 5.3.5. Qualitatively, for any a, the power function of the LR test is symmetric about p = 0 and increases continuously from a to 1 as p goes from 0 to 1 . Therefore, if we specify indifference regions, we can control probabilities of type II error by increasing the sample size.

=

268

Testi n g a nd Confidence Regions

C h a pter 4

One-Sided Tests

In many cases, only one-sided alternatives are of interest. For instance, if we want to decide whether increasing fat in a diet significantly increases cholesterol level in the blood, we would test H : p = 0 versus K : p > 0 or H : p :::; 0 versus K : p > 0. It can be shown that p is equivalent to the likelihood ratio statistic for testing H : p :::; 0 versus K : p > 0 and similarly that corresponds to the likelihood ratio statistic for H : p 0 versus K : p < 0. We can show that Pe [p ;:::: c] is an increasing function of p for fixed c (Problem 4.9 . 1 5). Therefore, we obtain size a tests for each of these hypotheses by setting the critical value so that the probability of type I error is a when p = 0. The power functions of these tests are monotone. Confidence Bounds and Intervals

Usually testing independence is not enough and we want bounds and intervals for p giving us an indication of what departure from independence is present. To obtain lower confidence bounds, we can start by constructing size a likelihood ratio tests of H : p p0 versus K : p > p0. These tests can be shown to be of the form "Accept if, and only if, p :::; c(p0 )" where PPo [p :::; c(p0 ) ] 1 - a. We obtain c(p) either from computer software or by the approximation of Chapter 5. Because c can be shown to be monotone increasing in p, inversion of this family of tests leads to Ievel l a lower confidence bounds. We can similarly obtain 1 a upper confidence bounds and, by putting two Ievel l - �a bounds together, we obtain a commonly used confidence interval for p. These intervals do not correspond to the inversion of the size a LR tests of H : p Po versus K : p -::f. p0 but rather of the "equal-tailed" test that rejects if, and only if p ;:::: d (p0 ) or p :::; c(p0 ) where PPo [I) d (p0 )] PPo [IJ' :::; c(p0 )] 1 �a. However, for large n the equal tails and LR confidence intervals approximately coincide with each other. =

Data Example

A� an illustration, consider the following bivariate sample of weights Xi of young rats at a certain age and the weight increase Yi during the following week. We want to know whether there is a correlation between the initial weights and the weight increase and for mulate the hypothesis H : p 0.

370.7 1 8. 1

Here p 0.18 and Tn 0.48. Thus, by using the two-sided test and referring to the 'I7 tables, we find that there is no evidence of correlation: the p-value is bigger than 0.75. Summary. The likelihood ratio test statistic >. is the ratio of the maximum value of the likelihood under the general model to the maximum value of the likelihood under the model specified by the hypothesis. We find the likelihood ratio tests and associated confidence procedures for four classical normal models:

!

Section 4 . 10

269

Problems a nd Complements

( 1 ) Matched pair experiments in which differences are modeled as N(p,, a2) and we test the hypothesis that the mean difference p, is zero. The likelihood ratio test is equivalent to the one-sample (Student) t test. (2) Two-sample experiments in which two independent samples are modeled as coming from N(p,1 , a2) and N(p, , a2 ) populations, respectively. We test the hypothesis 2 that the means are equal and find that the likelihood ratio test is equivalent to the two-sample (Student) t test. (3) Two-sample experiments in which two independent samples are modeled as com ing from N(p,1 , ai ) and N(p, , a� ) populations, respectively. When a? and a� are 2 known, the likelihood ratio test is equivalent to the test based on I Y - X I . When a? and a� are unknown, we use (Y - X ) / s n , where sn is an estimate of the standard deviation of D Y - X. Approximate critical values are obtained using Welch's t distribution approximation. =

(4) Bivariate sampling experiments in which we have two measurements X and Y on each case in a sample of n cases. We test the hypothesis that X and Y are indepen dent and find that the likelihood ratio test is equivalent to the test based on IP1. where p is the sample correlation coefficient. We also find that the likelihood ratio statistic is equivalent to a t statistic with n - 2 degrees of freedom. 4.10

P R O B L E M S A N D CO M P L E M ENTS

Problems for Section 4.1 1. Suppose that X 1 , uniform distribution

, Xn are independently and identically distributed according to the U(O, fJ). Let Mn max (X1 , . . . , Xn) and let

• . .

=

1 if Mn � C 0 otherwise.

=

(a) Compute the power function of be and show that it is a monotone increasing function of e.

(b) In testing H : f) ::; � 0.05 ?

versus K

:

f)

>

! . what choice of

c

would make be have size

exactly

(c) Draw a rough graph of the power function of be (d) How large should n be so that the be (e) If in a sample of size n

=

20, Mn

=

specified in (b) when

specified in (b) has power

n

=

0. 98 for f)

20. =

�?

0.48, what is the p-value?

2. Let X1 , , Xn denote the times in days to failure of n similar pieces of equipment. Assume the model where X = (X1 , . . . , Xn ) is an £ ( >. ) sample. Consider the hypothesis H that the mean life 1/ >. p, ::; f..L o . . • •

=

. i

270

Testing and Confidence Regions

Chapter 4

(a) Use the result of Problem B.3.4 to show that the test with critical region

[X > !"ox( 1

-

<>) /2n] ,

where x(l - a) is the (1 - o:)th quantile of the X�n distribution, is a size a test.

(b) Give an expression of the power in terms of the X� n distribution.

(c) Use the central limit theorem to show that ) /!") + vn(l" - l"o) j !"] is an approximation to the power of the test in part (a). Draw a graph of the approximate power function. Hint: Approximate the critical region by [X > l"o(l + z(1 - a) / vn)]

;

i •

;

.

(d) The following are days until failure of air monitors at a nuclear plant. If f.Lo = 25, give a normal approximation to the significance probability. Days until failure: 3 150 4034 3237 34 2 3 1 6 5 14 150 27 4 6 27 10 3037 Is H rejected at level a � 0.05? 3. Let X1, . . .

, Xn be a P(O) sample.

(a) Use the MLE X of 0 to construct a level a test for H : 0 � Oo versus K : 0 > 00.

(b) Show that the power function of your test is increasing in 8. (c) Give an approximate expression for the critical value if n is large and 8 not too close to 0 or oo. (Use the central limit theorem.)

4. Let X1 ,

• • •

,

c • •

Xn be a sample from a population with the Rayleigh density •

f(x,O)

=

(xj0 2 ) exp{-x2j202 }, x > 0 0 > 0. ,

(a) Construct a test of H : B = 1 versus K : B > with approximate size o: using a complete sufficient statistic for this model. Hint: Use the central limit theorem for the critical value.

1

(b) Check that your test statistic has greater expected value under K than under H. 5. Show that if H is simple and the test statistic T has a <;ontinuous distribution, then the p-value a(T) has a unifonn, U(O, 1 ) , distribution.

• • •

1

I

i I

I

Hint: See Problem B.2.12.

6. Suppose that T1 , . . . , Tr are independent test statistics for the same simple H and that each Ti has a continuous distribution, j = 1, . . , r. Let o:(TJ) denote the p-value for 1j. j = 1 1 . . , r. .

.

(a) Show that, under H, T

Hint: See Problem B.3.4. 7. Establish

=

-2 L.:;�1 log <>(T;) has a X�r distribution.

(4.1.3). Assume that Fo and F are continuous.

I , ' •

i

Section 4.10

271

Problems and Complements

8. (a) Show that the power Pp[Dn > ka) of the Kolmogorov test is bounded below by

�

sup Fp[[F(x) X

�

·

Fo(x)l > ko[

Dn > [ F(x) - Fo(x)[ for each .T. (b) Suppose Fo isN(O, I) and F(x) � (l+exp(->•/r)) - 1 where r � J3/rr is chosen so that J.0000 x2dF(x) = 1. (This is the logistic distribution with mean zero and variance 1.) Evaluate the bound Pp(IF (x) - Fo(x)l > ka ) for a � 0.10, n � 80 and x � 0.5, � 1, and 1.5 using the normal approximation to the binomial distribution of nF(x) and the Hint:

approximate critical value in Example 4.1.5.

(c) Show that if F and Fo are continuous and mogorov test tends to 1 as n ---+ oo.

F

"I

Fo,

then the power of the Kol

9. Let X1 , . . . , Xn be i.i.d. with distribution function F and consider H : F = F0. Suppose that the distribution £o of the statistic T = T(X) is continuous under H and that H is rejected for large values of T. Let T(l), . . . , y( B ) be B independent Monte Carlo simulated values of T. (In practice these can be obtained by drawing B indepen dent samples X(ll, . . . , X( B) from Fo on the computer and computing T(j) = T(X Ul ) , j = 1 , . . . , B. Here, to get X with distribution .Fo, generate a U(O, 1) variable on the 1 computer and set X � F0- (U) as in Problem B.2.12(b).) Next let T(l), . . . , r<8 + 1) de note T, y{t), . . . , y(B) ordered. Show that the test rejects H iff T > T( B+ I-m) has level a � m/(B + !). Hint: If H is true T(X), T(X< 11), . . . , T(X<81) is a sample of size B + I from Co. Use the fact that T(X) is equally likely to be any particular order statistic. 10. (a) Show that the statistic Tn of Example 4.1.6 is invariant under location and scale. That is, if x; � (X, - a)jb, b > 0, then Tn(X') � Tn(X).

(b) Use part (a) to conclude that CN( ",u')(Tn) = CN(o,l) (Tn).

ll. In Example 4.1.5, let ¢( u) be a function from (0, I) to (0, oo ) , and let a the statistics � sup ,P(Fo(x))[F(x) - Fo(xW S¢,a

�

�

T¢,a

sup¢(F(x))[F(x) - Fo (x) l"

U.p,a -

j ,P(Fo(x))IF(x) - Fo (x) ["dFo(x) j ,P(F(x))[F(x) - Fo(x) l "dF(x).

V¢,a

Fo .

X

X

> 0. Define

(a) For each of these statistics show that the distribution under H does not depend on

(b) When '1/J(u) = 1 and a: = 2. Vw,o: is called the Cramer-von Mises statistic. Express

the Cramer-von Mises statistic as a sum.

272

Testing and Confidence Regions

Chapter 4

(c) Are any of the four statistics in (a) invariant under location and scale. (See Problem 4.1.10.)

!, , •

I

•

12. Expected p-values. Consider a test with critical region of the form {T > c} for testing H : 8 = (}0 versus I< : (} > (}0• Without loss of generality, take 80 = 0. Suppose that T has a continuous distribution Fe. then the p-value is

U = 1 - F0 (T).

• '

,

!

;i

(a) Show that if the test has level a, the power is

•

• •

(3(8) = P(U < a) = 1 - Fe(F0 1 (1 - a)) 1

1 where F0- (u) = inf{t : Fo (t) 2: u). (b) Define the expected p-value as EPV(8) = EeU. Let T0 denote a random variable with distribution F0, which is independent ofT. Show that EPV(8) = P(To > T). Hint: P(To > T) = J P(To 2: t I T = t)J.(t)dt where fe(t) is the density of Fe(t).

(c) Suppose that for each a E (0, 1), the UMP test is of the form 1 {T > c}. Show that the EPV(8) for 1{T > c) is uniformly minimal in 8 > 0 when compared to the EPV(8) for any other test. Hint: P(T < to I To = to) is 1 minus the power of a test with critical value to . (d) Consider the problem of testing H : Ji, = Ji,o versus K : Ji, > J.to on the basis of the N(p., "2 ) sample X1, . . . , Xn, where " is known. Let T = X p.0 and 8 p. - p.0. Show that EPV(8) = if!( -.Jii8/ ..;2" ) , where if! denotes the standard normal distribution. (For a recent review of expected p values see Sackrowitz and Samuel-Cabo, 1999.) -

' :l

=

Problems for Section 4.2

1. Consider Examples 3.3.2 and 4.2.1. You want to buy one of two systems. One has The first system costs $106, the signal-to-noise ratio vfao = 2, the other has vfao = other $105• One second of transmission on either system costs $103 each. Whichever system you buy during the year, you intend to test the satellite 100 times. If each time you test, you want the number of seconds of response sufficient to ensure that both probabilities of error are < 0.05, which system is cheaper on the basis of a year's operation?

1.

2. Consider a population with three kinds of individuals labeled 1, 2, and 3 occuring in the Hardy-Weinberg proportions /(1,8) = 82 , !(2,8) = 28(1 - 8), f(3,8) = (1 - 8) 2 • For a sample X1 , . . . , Xn from this population, let N1 N2, and N3 denote the number of Xj equal to I, 2, and 3, respectively. Let 0 < 8o < 8, < 1 . .

(a) Show that L(x, 8o, 8,) is an increasing function of 2N,

N2 . (b) Show that if c > 0 and a E (0, 1) satisfy Pe, [2N1 + N2 > c] = a, then the test that rejects H if, and only if, 2N, + N2 > c is MP for testing H : 8 = 80 versus K : 8 = 81. +

3. A gambler observing a game in which a single die is tossed repeatedly gets the impres

sion that 6 comes up about 18% of the time, 5 about 14% of the time, whereas the other ,

I

i !

, '

j

•

I

l j

'

1

i

I

1 i •

I !

I

1

�:-- :_:::_:��-=c==-==-==='---Section 4.10

273

Problems and Complements

· -

-

four numbers are equally likely to occur (i.e., with probability .17). Upon being asked to play, the gambler asks that he first be allowed to test his hypothesis by tossing the die n times. (a) What test statistic should he use if the only alternative he considers is that the die is fair? (b) Show that if n = 2 the most powerful level .0196 test rejects if, and only if, two 5's are obtained. (c) Using the fact that if(N" . . . , Nk) � M(n, 8,, . . . , Bk), then a,N, + +akNk has approximately a N(np,, na2 ) distribution, where 11 = L7 1 aifJi and a2 = I: � 1 fJi(ai 11)2, find an approximation to the critical value of the MP level a: test for this problem. ·

4. A formulation of goodness of tests specifies that a test is best if the maximum probability

of error (of either type) is as small as possible. (a) Show that if in testing H : {} = {}0 versus K : fJ = fJ1 there exists a critical value c such that P0, [L(X, 80, 81) > c] = I - Po, [L(X, Bo, 81) > c] then the likelihood ratio test with critical value c is best in this sense. (b)

Find the test that is best in this sense for Example 4.2.1.

5. A newly discovered skull has cranial measurements (X, Y) known

to

be distributed either (as in population 0) according to N(O, 0, 1, 1 , 0.6) or (as in population 1) according to N(l, I, I, I, 0.6) where all parameters are known. Find a statistic T(X, Y) and a critical value c such that if we use the classification rule, (X, Y) belongs to population 1 if T > c, and to population 0 ifT < c, then the maximum of the two probabilities ofmisclassification Po [T > c], P!(T < c] is as small as possible. Hint: Use Problem 4.2.4 and recall (Proposition B.4.2) that linear combinations of bivariate normal random variables are normally distributed. 6. Show that if randomization is permitted, MP-sized a: likelihood ratio tests with 0

1 have power nondecreasing in the sample size.

< a: <

7. Prove Corollary 4.2.1. Hint: The MP test has power at least that of the test with test function d(x) = a:. 8. In Exarnle 4.2.2, derive the UMP test defined by (4.2. 7). 9. In Example 4.2.2, if .
a

(1, 0, . . . , of and �0 # I, find the MP test for testing

< 1 , prove Theorem 4.2. 1(a) using the connection between likelihood ratio

tests and Bayes tests given in Remark 4.2. 1.

i

274

Testing and Confidence Regions

Chapter 4

'

' '

'

Problems for Section 4.3

1. Let X1 be the number of arrivals at a service counter on the ith of a sequence of n days. A possible model for these data is to assume that customers arrive according to a homo geneous Poisson process and, hence, that the Xi are a sample from a Poisson distribution with parameter 8, the expected number of arrivals per day. Suppose that if 8 < 80 it is not worth keeping the counter open.

(a) Exhibit the optimal (UMP) test statistic for H : 8 < 00 versus K

:

0 > 00 .

(b) For what levels can you exhibit a UMP test? (c) What distribution tables would you need to calculate the power function of the UMP

test? 2. Consider the foregoing situation of Problem 4.3.1. You want to ensure that if the arrival rate is < 10, the probability of your deciding to stay open is < 0.01, but if the arrival rate is > 15, the probability of your deciding to close is also < 0.01. How many days must you observe to ensure that the UMP test of Problem 4.3.1 achieves this? (Use the normal approximation.) 3. In Example 4.3.4, show that the power of the UMP test can be written as

2 ( Gn( n(<>)/ <J) <J�X (J = <J ) where Gn denotes the X�n distribution function. ' .

•

'

. •

4. Let X1, . . . , Xn be the times in months until failure of n similar pieces of equipment. lf the equipment is subject to wear. a model often used (see Barlow and Proschan, 1965) is the one where X1, . . . , Xn is a sample from a Weibull distribution with density f(x, A) = .\cxc -te->.xc. x > 0. Here c is a known positive constant and .\ > 0 is the parameter of mterest. •

(a) Show that K:

1/>. > 1/>.o.

L� 1 Xf is an optimal test statistic for testing H : 1/.\ < 1/ .\o versus

(b) Show that the critical value for the size a test with critical region [L�-1 Xf > k] is k = X2 n ( 1 - <>) /2>.o where X2n ( 1 - <>) is the ( 1 <>)th quantile of the X�n distribution and that the power function of the UMP level a test is given by -

1 - G,n(>.x,n(1 - <>)/ >.o) where G2 n denotes the X�n distribution function. Hint: Show that X[ &(>.) . -

1/ >.o = 12. Find the sample size needed for a level 0.01 test to have power at least 0.95 at the alternative value 1/>.1 = 1 5 Use the normal approximation to the (c) Suppose

" •

! j

J

!

l

'

i

l i

.

critical value and the probability of rejection.

5. Show that if X1, . . . , Xn is a sample from a truncated binomial distribution with I

p(x, 0) =

(:)

'

' '

I .

n x /[1 Ox(l - 0)" (1 - O) ), X = 1, . . . , n,

,

i

Section 4.10

275

Problems and Complements

then E7 1 Xi is an optimal test statistic for testing H : 8

=

8o versus I\ ; 8 > 80.

6. Let X1, . . . , Xn denote the incomes of n persons chosen at random from a certain population. Suppose that each Xi has the Pareto density 1 +BI -( f(x,O) = c8Bx , x>c where 8 > 1 and c > 0.

(a) Express mean income J.L in terms of 8.

:

(b) Find the optimal test statistic for testing H

J.L

=

J.Lo versus

K : J.L > po.

(c) Use the central limit theorem to find a normal approximation to the critical value of test in part (b). Hint: Use the results of Theorem 1.6.2 to find the mean and variance of the optimal test statistic. 7. In the goodness-of-fit Example 4.1.5, suppose that F0 (x) has a nonzero density on some interval (a, b), -oo < a 1 versus K : B < 1 rejects H if -2E log F0(X,) 2: x1_a, where Xt-a is the (1 - u)th quantile of the X�n distribution. (See Problem 4. 1 .6.) It follows that Fisher's method for cgmbining p-values (see 4.1.6) is UMP for testing that the p-values are uniformly distributed against F(u) = u8, 0 < B < l . 8. Let the distribution of smvival times of patients receiving a standard treatment be the known distribution Fo, and let Y1, . . . , Yn be the i.i.d. survival times of a sample of patients receiving an experimental treatment. (a) Lehmann Alte�arive. In Problem 1.1. 1 2, we derived the model G(y, Ll.) = 1 - [1 - F0(y)J", y > 0, Ll. > 0. To test whether the new treatment is beneficial we test H : � < 1 versus K : .6. > 1. Assume that Fo has � density fo. Find the UMP test. Show how to find critical values. (b) Nabeya-Miura Alternative. For the purpose of modeling, imagine a sequence X1 , X2, of i.i.d. survival tjmes with distribution F0 • Let N be a zero-truncated Poisson, P()..), random variable, which is independent of X1 J X2, .

•

•

.

P(Y < y)

=

e

,

-

-

1

.

•

C(max{X1 ,

(i) Show that if we model the distribution of Y as .>..F e o (y)

•

1

•

•

.

, XN ) ) then ,

, y > 0, A > 0.

(ii) Show that if we model the distribotion of Y as C(min{Xt, . . . , XN )) , then P(Y < y)

=

e-.XFo(Y) e

,

-

_

1

1 ,

y > 0, A > 0.

276

Testing and Confidence Regions

Chapter 4

(iii) Consider the model G(y,

efiFo (Y) _ 1

0)

e• - 1

j

, O f' O

Fo(Y), o � o. To see whether the new treatment is beneficial, we test H

Fo has a density fo(y). I:� 1 Fo( l:i).

Assume that

:

{} < 0 versus K : 8 > 0.

Show that the UMP test is based on the statistic

Xn be i.i.d. with distribution function F(x). We want to test whether F is exponential, F(x) � 1 - exp( -x), x > 0, or Weibull, F( x ) � 1 - exp( -x 9), x > 0, B > 0. Find the MP test for testing H : {} = 1 versus K : B = 81 > 1. Show that the test is 9.

Let

X1,

•

.

.

,

not UMP.

10. Show that under the assumptions of Theorem

complete.

Hint: �

Consider the class of all

4.3.2 the class

Bayes tests of H : ()

1r{Oo} 1 - 1r{O,} varies between 0 and l.

=

/.00 e,

p(x, O)d1r(O)j

The left-hand side equals

j"'

-oo

of all Bayes tests is =

lh

where

loss, every Bayes test for

>

.

f6";' L(x, 0, 00)d1r(O) f_':x, L(x, 0, Oo)d1r(O)

.

Section 4-4 •

, Xn

be a sample from a normal population with unknown mean J.L and

cr2. Using a pivot based on Er 1 (Xi - X)2• (a) Show how to construct level {1 - a) confidence intervals of fixed finite length for

unknown variance log a2•

1 (X; - X) 2 � 16.52, n � a) UCB for u2? announce as your level {1

(b)

Suppose that Ef

2,

a

�

0.01.

Wbat would you

-

(B/2)t� + €i, i = 1, . . , n, where the €i are independent normal random variables with mean 0 and known variance u2 (cf. Problem 2.2.1).

2. I '

'

•

Let

Xi

=

.

'

;

ll '

l

' I '

I

I

1 l

J '

l I i

T(x), the denominator decreasing. 12. Show that under the assumptions of Theorem 4.3.2, 1 - 6t is UMP for testing H : 8 > Bo versus K : 8 < Bo.

.

' •

•

The numerator is an increasing function of

1. Let X1 ,

i

p (x, O)d1r(O) ( <) L

6

Problems for

I

•

80 versus K : B

11. Show that under the assumptions of Theorem 4.3.1 and 0-1 H : B < 80 versus K : B > 81 is of the form Ot for some t. Hint: A Bayes test rejects (accepts) H if

ll

oS�ec�t:�io�o�4.,.l=O�P�m�b�le�m�n�d�C�o �,�· �m�p21e=m2e=''='�---(a) Using a pivot based on the MLE

------

----7 --�27

(2L:r<-- l t;X!)/�i 1 ti ofB, find a fixed length level

(1 - a) confidence interval for B. (b) ItO < ti < 1, i = 1, . . . , n, but we may otherwise choose the t! freely, what values should we use for the tt so as to make our interval as short as possible for given a? 3. Let X1, . . . , Xn be as in Problem 4.4.1. Suppose that an experimenter thinking he knows the value of a2 uses a lower confidence bound for 1-l of the form p;(X) = X - c, where c is chosen so that the confidence level under the assumed value of a-2 is 1 - a. What is the actual confidence coefficient of J-t, if a2 can take on all positive values? 4. Suppose that in Example 4.4.3 we know that 8 < 0.1. (a) Justify the interval [il, min( 8, 0.1 )] if 8 < 0.1, [0. 1, 0.1] if 8 > 0.1, where 8, 8 are given by (4.4.3). (b) Calculate the smallest

n needed to bound the length of the 95% interval of part (a)

Compare your result to the n needed for

by

0.02.

5.

Show that if q(X) is a level

then

[q(X), q(X)]

is a level

interval arbitrarily if q

Hint:

6.

Use (A.2.7) .

Show that if

(4.4.3).

(1 - a,) LCB and q(X) is a level (1 - a,) UCB for q(8), (1 - (a, + a,) ) confidence interval for q(8). (Define the

> q.)

X1, . • • , Xn are i.i.d. N(J-t, u2 ) and O:t + 0:2 < a:, then the shortest level

( 1 - a) interval of the form

is obtained by taking

[x- z(1 - a,) fo' X + z(1 - a,) ;,;]

a:1 = a:2 = aj2 (assume a-2 known). Hint: Reduce to a:1 +a:2 = a: by showing that if 0:1 + 0:2 < a:, there is a shorter interval with a:1 + a:2 = a:. Use calculus. 7. Suppose we want to select a sample size N such that the interval (4.4.1) based on n = N observations has length at most l for some preassigned length l = 2d Stein's (1945) two stage procedure is the following. Begin by taking a fixed number no and calculate

X0 = ( 1/n0) I:�" 1 X, and

.

> 2 of observations

s5 = (no - 1)-ti:�" 1 (X, - Xo)2•

Then take

N - n0 further observations, with N being the smallest integer greater than no

and greater than or equal to

2

[sot,._, ( 1 - i a) /d] . Show that, although

N is random, ../N(X - p)/so, with X = I:f 1 X
distribution. It follows that

[x - sotn,- 1 (1 - �a)/ .JN, X + Sotno- 1 (1 - )a)j .JN]

'

i' I

278

Testing and Confidence Regions

(1 - a

is a confidence interval with confidence coefficient

) N,

Chapter 4

for Jt of length at most

(The sticky point of this approach is that we have no control over

'

2d.

and, if (}' is large, we

may very likely be forced to take a prohibitively large number of observations. The reader interested in pursuing the study of sequential procedures such as this one is referred to the

1986, and the fundamental monograph of Wald, 1947.) + 1 x,. By Theorem 8.3.3, Sn , is + = k, X has a N(Jl, (}'2 jk) depends only on Sno • given

book of Wetherill and Glazebrook,

Hint:

Note that -

X (no/N)Xn, (1/NJL.':' N N(O, <72) ../N(X =

independent of Xno · Because distribution. Hence,

- !') has a

(a) Show that in Problem

8.

of length at most

2d wheri (}'2

4.4.6, in

n

n

N (1

-

distribution and is independent of sn ,·

order to have a level

�

a)

is known, it is necessary to take at least

observations.

Hint: (b)

confidence interval

z2 (

(c)

'

'

'

.

•

-

�a) (}'2 jd2

Set up an inequality for the length and solve for n.

What would

be the minimum sample

size in part (a) if

a

=

0.001, (}'2

0.05? '

1

Suppose that (}'2 is not known exactly, but we are sure that (}'2

z2 {I - �a) (J'r/d2 B(n, 0) X = Sjn.

< (J'f.

=

5, d

=

Show that

observations are necessary to achieve the aim of part (a).

n

>

9.

Let S �

(1

(a) Use (A. l4.18) to show that sin - 1 ( - a) confidence interval for sin- 1 ( VO).

and

v'X)±z (1 - 1a) / 2fo

= X = 0.1, X�, ... , Xn1 Yl l . . . , Yn2 be

(b) If n

is an approximate level

use the result in part (a) to compute an approximate level

100 and

0.95 confidence interval for ().

10. Let and N(TJ, .,-2) populations, respectively.

(a) If all parameters are unknown, find two quadruples are each sufficient.

(b) Exhibit

a level (1 -

a) confidence

2)

two independent samples from N(Jl, (}'

ML estimates of /-L. v, a2, r2 • interval for 12

/2 (}'

If

( TJ - f.') .

cr 2 , r2

are known, exhibit a fixed length level (1

�

and

Show that these

using a pivot based on the

a:)

confidence interval for

Such two sample problems arise In comparing the precision of two instruments and in

determining the effect of a treatment.

11. Show that the endpoints of the approximate level ( 1 - a) interval defined by

are indeed approximate level

Hint: 12.

,

., '

'

'

[O(X) < OJ = [fo(X- 0)/[0(1 - O)Jl < B(n, 0). 0 X± J3z ( 1 - l a) / 8.

Let S �

(a)

{1 - �a:) upper and lower bounds.

Show that

interval for

Suppose that it is known that

z

:S

(4.4.3)

(1 - 1a) ].

!.

4fo is an approximate level ( 1 -

i i

i

'

I

'

statistics of part (a). Indicate what tables you wdllld need to calculate the interval.

(c)

I

a) confidence

Section 4.10

279

Problems and Complements

(b) What sample size is needed to guarantee that this interval has length at most 0.02? 13. Suppose that a new drug is tried out on a sample of 64 patients and that S = 25 cures 8(64, 8), give a 95% confidence interval for the true proportion of are observed. If S cures 8 using (a) (4.4.3), and (b) (4.4.7). "'

14. Suppose that 25 measurements on the breaking strength of a certain alloy yield X = 11.1 and ,<; = 3.4. Assuming that the sample is from a /V(p,, cr2) population, find (a) A level 0.9 confidence interval for Jl. (b) A level 0.9 confidence interval for cr.

(c) A level 0.9 confidence region for (Jl, o-). (d) A level 0.9 confidence interval for p, + cr. 15. Show that the confidence coefficient of the rectangle of Example 4.4.5 is ( 1 - � o:) 2 . Hint: Let t = tn�l (1 - !n), x 1 = Xn�I ( !a), and x 2 = Xn- 1 (1 - !o:), then

P[(JL, u2 ) E J.{X) x !,{X)] By the proof of Theorem B.3.3, (n - 1)s2/u2 X�- 1 = r (� (n - 1J, D and n(X p,)2 Icr2 xt = r ( � ' �) are independent. Now the result follows from Theorem 8.2.3. �

"'

16. In Example 4.4.2,

(a) Show that x( ul ) and x(1 -u,) can be approximated by x( a 1) (n- 1) + v'2( n 1) l z{ u1 ) and x(1 - u2 ) (n - 1) + J2(n - 1) l z(1 - a2 ). Hint: By B .3 .1, V (u2 ) can be written as a sum of squares of n 1 independent N(0, 1) random variables, :L7 / Z'f. Now use the central limit theorem. ""

�

-

(b) Suppose that Xi does not necessarily have a nonnal distribution, but assume that l-'4 = E(X; - 1-')4 < oo and that "' = Var[(X; - JL)/u]2 = (JL4/u4 ) - 1 is known. Find the limit of the distribution of n-l {[(n - 1)s2 /u2) - n } and use this distribution to find an approximate 1 - a confidence interval for cr 2 . (In practice, K is replaced by its MOM estimate. See Problem 5.3.30.) Hint: ( n - 1)s2 2:;' 1 (X; - JL)2 - n(X - JL) 2 • Now use the law of large numbers, Slutsky's theorem, and the central limit theorem as given in Appendix A. =

(c) Suppose Xi has a X� distribution. Compute the (K. known) confidence intervals of part (b) when k 1, 10, 100, and 10 000. Compare them to the approximate interval given in part (a). K 2 is known as the kurtosis coefficient. In the case where Xi is normal, it equals 0. See A.l l . l l . Hint: Use Problem 8.2.4 and the fact that X� = f (k, �). =

-

I I 280 17.

Consider Example 4.4.6 with

interval for

F(x).

In this case

�

.r

Testing and Confidence Regions

Chapter 4

- a)

confidence

fixed. That is, we want a level {1

nF(1:) = # [X; < xj y'ii[F(x) - F(x)[ JF(x)[l - F(x)[

has a binomial distribotion and

is the approximate pivot given in Example 4.4.3 for deriving a confidence interval for 8 =

F(x).

(a) For 0 < a < b < I, define �

An(F)

y'ii[F(x) -F(x)[ , r' (a) -< x -< p- ' (b) JF(x)[l - F(x)j .05 .95. F Pp(An(F) S t) = Pu(An(U) <

= sup

Typical choices of a and b are

and

Show that for

•

continuous t)

where

U denotes

the uniform,

confidence intervals for

F(x) u.

intervals for

)=I-

u

8

U(O, 1), distribution

function. It follows that the binomial

in Example 4.4.3 can be turned into simultaneous confidence

by replacing

z (I - �u)

by the value

Ua

determined by

Pu(An(U) <

(b) For O < a < b < ! , define

Bn(F) F

" . '

. [I

•

.

,

.'

..

Show that for

= sup

i •

;

1

•

continuous,

Pp(Bn(F) < t) Pu(Bn(U) < Fo Ho : F Fo An(Fo) Bn(Fo) be X X,,(a,... , Xn X E 0 I" -f. F(x)dx + J;[l -F(x)jdx.

(c) For testing ta for and

=

can

ifft

(a) Show that

with

t).

continuous, indicate how critical values

Ua

i

and

obtained using the Monte Carlo method of Section 4.1.

are i.i.d. as

Suppose

f(t)

•

y'ii[F(x) -F(x)[ , p-'(a) < x < p-'(b) VF(x)[l -F(x)j =

18.

'

b) for some -oo

and that

has density f(t) = oo.

F'

()

t . Assume that

=

(b) Using Example 4.1.6, find a level ( 1 - a) confidence interval for Jl.

'

• • •

j

l j ! l

19. In Example 4.4.7, verify the lower bounary Jl given by (4.4.9) and the upper boundary -

I" = 00.

'

'

•

---

Section 4.10 Problems and Complements

Problems for Section 4.5 1. Let X1, . . . , Xn1 and Y1 , respectively, and let � � 0I>.. .

•

•

,

281

Yn2 be independent exponential E( 8) and £( ...\. ) samples,

f(o.)

denotes the o.th quantile of the F2n 1 ,2n 2 distribution, show that [Y J{ i a) IX, Y f ( 1 - �a) IX[ is a confidence interval for � with confidence coefficient 1 "· Hint: Use the results of Problems B.3.4 and 8.3.5. (a) If

-

(b) Show that the test with acceptance region [!(�<>) < XIY S <> for testing H : � = 1 versus K : � of 1.

f(1 - ia)) has size

(c) The following are times until breakdown in days of air monitors operated under two different maintenance policies at a nuclear power plant. Experience has shown that the exponential assumption is warranted. Give a 90% confidence interval for the ratio � of mean life times. X

y

3 1 5040 34 32 37 34 2 31 6 5 14 150 27 4 6 27 10 30 37 8 26 10 8 29 20 10

Is H : � = 1 rejected at level a � 0. 10?

2. Show that if 8(X) is a level ( 1 - a) UCB for 8, then the test that accepts, if and only if 8(X) 2. 8o, is of level a for testing H : 0 > Oo. Hint: If 8 > 8o, [B(X) < 8) :::> [8(X) < 8o). 3. (a) Deduce from Problem 4.5.2 that the tests of H

UCBs of Example 4.4.2 are level o. for H : o-2 >

o- .

5

:

cr2 = cr6 based on the level (1

- <>)

(b) Give explicitly the power function of the test of part (a) in tenns of the X�- l distri bution function.

(c) Suppose that n = 16, o. = 0.05, o-fi = 1 . How small must an alternative before the size o. test given in part (a) has power 0.90?

0"2 be

4. (a) Find c such that 00 of Problem 4.1.1 has size <> for H : 8 < 80. (b) Derive the level ( 1 - a) LCB corresponding to 00 of part (a).

(c) Similarly derive the level (1 - a) UCB for this problem and exhibit the confidence intervals obtained by putting two such bounds of level (1 <>1) and (1 - <>2 ) together. -

1 n is the shortest such confidence interval. (d) Show that [Mn, Mn/o. f J 5. Let X 1, X2 be independent N(8, , cr2), N(8,, cr2 ), respectively, and consider the prob lem of testing H : 81 = 82 = 0 versus K : Bf + 8� > 0 when 0"2 is known. (a) Let 00(X1, X2) = 1

if and only if Xf + Xi > c. What value of c gives sizec>?

(b) Using Problems B.3.12 and 8.3.13 show that the power (3(01, O,) is an increasing function of 8f + B�.

�'

''

282

Testing and Confidence Regions

Chapter 4

(c) Modify the test of part (a) to obtain a procedure that is level a for H : 81 = B�, 02 = eg and exhibit the corresponding family of confidence circles for ({)1, 82 ). Hint: (c) x, -e?, x, -eg are independentN(01-B?, a-2 ), N( 82 -eg, a-2 ), respectively. 6. Let X1 , . . . , Xn be a sample from a population with density f (t 8) where () and f are unknown, but f (t) = f( -t) for all t, and f is continuous and positive. Thus, we have a location parameter family. -

(a) Show that testing H : 8 < 0 versus K : () > 0 is equivalent to testing H' : P[X1 > OJ <

l !

i:

� versus K' : P[X1 > OJ > � -

(b) The sign test of H versus K is given by, 1 if

n

L 1 [X, > OJ i=l

>k

0 otherwise.

Determine the smallest value k = k(a) such that ok(o) is level a for H and show that for n large, k - �n + �z(1 - a)y'n. (c) Show that ok(oJ {X1

. .

'

K : 0 > Bo

- Oo, .

. .

, Xn - Oo) is a level a test of H

:

0 < Oo versus

.

(d) Deduce that X(n - k( o:) +l) (where X(j) is the jth order statistic of the sample) is a level { 1 - a) LCB for {) whatever be f satisfying our conditions. (e) Show directly that Pe[XuJ

e.

< OJ and Pe [X(i) < 0 < X<• JJ do not depend on J or

(f) Suppose that a = 2 -(n - l ) L:J 1

-

"'·

�

(;)

. Show that P[X(k) < 0 :c:; X(n - k+l)J

=

(g) Suppose that we drop the assumption that f(t) = f( -t) for all t and replace 8 by the v = median of F. Show that the conclusions of (a}-(f) still hold.

7. Suppose () = ( 'T}, r) where 'T} is a parameter of interest and r is a nuisance parameter. We are given for each possible value 'T}o of TJ a level a test O(X, 'TJo ) of the composite hypothesis H : ry = ry0 • Let C{X) = { ry : o{X, ry) 0}.

.

'

l •

I

=

(a) Show that C(X) is a level {1 - a) confidence region for the parameter ry and con versely that any level (1 - a) confidence region for TJ is equivalent to a family of level a tests of these composite hypotheses.

,

(b) Find the family of tests corresponding to the level ( 1 - a) confidence interval for ll of Example 4.4.1 when a2 is unknown.

'

•

'

Section 4.10

Problems and Complements

283

8. Suppose X, Y are independent and X � N(v, 1), Y � N(r1, 1). Let p (p, r1 ) . Define J(X, Y, p)

2

,

= o if ]X - pYI < (1 + p ) 'z(l = 1 otherwise.

=

v/1J,8

=

1

2 a)

(a) Show that J(X, Y, po) is a size <> test of H : p = Po· (b) Describe the confidence region obtained by inverting the family {J(X, Y, p)} as in Problem 4.5.7. Note that the region is not necessarily an interval or ray. This problem is a simplified version of that encountered in putting a confidence interval on the zero of a regression line.

9. Let x � N(8, 1) and q(8)

= 82

(a) Show that the lower confidence bound for q( B) obtained from the image under q of the ray (X - z(1 - <>) , oo) is q(X)

=

(X - z(1 - <>))2 if X > z(1 - <>) 0 if X < z(l - <>).

(b) Show that Po(�(X) < 82] and, hence. that sup8 P6 [q(X) < 82]

-

1 - a if 8 > 0 >l>(z(1 - a) - 28) if 8 < 0

= 1.

10. Let a (S, 8o) denote the p-value of the test of H : 8 = 8o versus K : 8 > 8o in Example 4.1.3 and let (8(S), 8( S)] be the exact level (1 - 2u) confidence interval for 8 of Example 4.5.2. Show that as 8 ranges from 8(S) to 8(S), n(S, 8) ranges from " to a value no smaller than 1 - u. Thus, if 8o < O(S) (S is inconsistent with H : 8 = 8o), the quantity � = B(S) - 00 indicates how far we have to go from 80 before the value S is not at all surprising under H . 11. Establish (iii) and (iv) of Example 4.5.2.

12. Let 1J denote a parameter of interest, let T denote a nuisance parameter, and let (} = (ry, -r ) . Then the level (1 - <>) confidence interval [ry(x), ij(x)] for ry is said to be unbiased confidence interval if

Pe[ry( X) < ry' < ry(X)] < 1 - " for all ry' ,P ry, all II.

That is, the interval is unbiased if it has larger probability of covering the true value 1J than the wrong value 17'. Show that the Student t interval (4.5.1) is unbiased. Hint: You may use the result of Problem 4.5.7.

284

Testing

and Confidence Regions

Chapter 4

Confidence Regionsfor Qua11tiles. Let X 1, . . . . Xn be a sample from a population with 1 continuous distribution F. Let Xp = 1 [F-1 (p) + Fij (P)], 0 < p < I, be the pth quantile of F. (See Section 3.5.) Suppose that p is specified. Thus, 100x.95 could be the 95th percentile of the salaries in a certain profession1 or lOOx o5 could be the fifth percentile of 13.

.

the duration time for a certain disease.

''

: Xp < 0 versus I< : Xp > 0 is equivalent to testing H' : P( X > 0) < (1 - p) versus K' : P( X 2: 0) > (I - p). (b) The quantile sign test Ok of H versus K has critical region {x : L.:� 1 [Xi > OJ > k ) . Determine the smallest value k = k(a) such that Jk(o) has level a for H 1and show that for n large, k(a) - h(a), where (a) Show that testing H

h (a) C': n(l - p) + Z1 -u )np(! - p).

'

I

'

(c) Let x• be a specified number with 0 < F(x') < I. Show that J•(X1 -x•, . . . , Xn x*) is a level a test for testing H : Xp < x* versus K : xP > x*.

.. . .

(d) Deduce that X(n-k(u)+1 ) (Xc;) is the jth order statistic (1 - a) LCB for Xp whatever be f satisfying our conditions.

of the sample) is a level

(e) Let S denote a B(n,p) variable and choose k and l such that I - a = P(k < S < n - l + I) = 2:7 k+ 1 pi(! -p)"-i. Show that P(X(k) < Xp < X(n-1)) = 1 - <>. That is, (X(k)' X(n-1)) is a level (1 - a) confidence interval for Xp whatever be F satisfying our

conditions. That is, it is distribution free. (f) Show that

'

(g) Let F(x)

•

l .

h (�a) and h (I - �a) where

denote the empirical distribution. Show that the interval in parts (e) and

!

k

and l in part (e) can be approximated by

(0 can be derived from the pivot

T(xp) Hint:

.

!

h(a) is given in part (b). -

l

'

Note that F(xp)

•

' '

' ' '

•

=

vn[F(xp) - F(xp)J . jF(xp) [l - F(xp)]

= p. Construct the interval using F-1 and Fu 1•

14. Simultaneous Confidence Regions for Quantiles.

In Problem

13

preceding we gave a

Xp for p fixed. Suppose we want 0 < p < 1. We can proceed as follows. Let F, F-(x), and F+ (x) be as in Examples 4.4.6 and 4.4.7. Then

disstribution-free confidence interval for the pth quantile a distribution-free confidence region for Xp valid for all -

P(F -(x) < F(x) < F'+(x)) for all x E (a, b) = 1 - a. (a) Show that this statement is equivalent to

P(xp < Xp < Xp for all p E (0, 1))

Section

4.10

Problems and Complements

285

where x, � sup{.r : a < .r < b, F f (>·) < p} and .r" � inf{.r : a < x < b, fr - (x ) > p}. That is, the desired confidence region is the band consisting of the collection of intervals {[;z;,, x,] : o < p < 1 } .

(b) Express x11 and :rp in terms of the critical value of the Kolmogorov statistic and the order statistics. (c) Show how the statistic An(F) of Problem 4.1.17(a) and (c) can be used to give

another distribution-free simultaneous confidence band for xp. Express the band in terms of critical values for An(F) and the order statistics. Note the similarity to the interval in Problem 4.4. 13(g) preceding. 15.

Suppose X denotes the difference between responses after a subject has been given treatments A and B, where A is a placebo. Suppose that X has the continuous distribution F. We will write Fx for F when we need to distinguish it from the distribution F-x of -X. The hypothesis that A and B are equally effective can be expressed as H : F�x (t) = Fx (t) for all t E R. Tbe altemative is that F.x(t) ! F(t) for some t E R. Let Fx and F-X be the empirical distributions based on the i.i.d. X1 , . . . , Xn and -X1 , . . . , -Xn. �

�

(a) Consider the test statistic D(Fx , F x)

�

max{ [Fx(t) - F_ x(t)[ : t E R}. �

Show that if Fx is continuous and H holds, then D(Fx, F_x) has the same distribution as D(Fu, F1-u ) where Fu and F1-u are the empirical distributions of U and 1 - U with �

�

�

�

......

,

U

F(X) �� U(O, 1). n Hint: nFx (x) = 2:,� 1 l[Fx (X,) < Fx (x)]

�

n F_x(x)

=

n

L 1 [-X, < x] =

i==l nF1 - u{F-x (x))

�

�

nFu(F(x)) and

n L 1 [F_x( -Xi) < F_ x(x) ] i=l

�

nFr - u(F(x)) under H.

See also Example 4. 1.5. (b) Suppose we measure the difference between the effects of A and B by � the dif ference between the quantiles of X and -X, that is, vp(p) = � [xP + x1_ PJ, where p = F(x). Give a distribntion-free level (1 - <>) simultaneous confidence band for the curve {vF{p) : 0 < p < 1} . Hint: Let t.(x) = F � (Fx(x)) - x, then nF-x (x + l:. (x))

=

n L 1 [-X, < x + l:.(x)] i=l n L l [F_x ( -X,) < Fx (x)] i=l

=

nFr-u (Fx(x)) .

'

I'

''

'I

'

'•

286

Testing and Confidence Regions

· -

' '

'

'

Chapter 4

�

Moreover,

nFx (x)

2..:� 1 1 [Fx (Xi)

�

Fx(,·)] = nFu(Fx(x)). It follows that if --C, then D(Fx . F�x.") = D(Fu,F1_u), and by <

�

�

F- x(': + t.(x)), solving D( Fx, F- x,t::. ) < do. for �. where da. is the nth quantile of the distribution of D(Fu) F1 _u ) we get a distribution-free level (1 - a) simultaneous confidence band for t.(x) = F_J:(Fx(x)) - x = -2vF (F(x)). Properties of this and other bands are given we set �

F�x."(x) = �

�

�

,

by Doksum, Fenstad and Aaberge ( 1 977).

(c) A Distdbution and Parameter-Free Confidence Interval. Let 8(.) : :F

--t

R, where

:F is the class of distribution functions with finite support, be a location parameter as defined

in Problem

3.5.17. Let

VF =

vt = sup vt (P) inf vF(p), O
[vF (p), lit (p)] is the band in part (b). Show that for given F E F, the probability is (1 - a) that the interval [vF , DtJ contains the location set Lp = {B(F) : B(·) is a location parameter} of all location parameter values at F. Hint: Define H by H-1(p) = �[Fx 1 (p) - Fx 1 (1 - p) J = � [Fx 1 (p) + F_J: (v)]. Then H is symmetric about zero. Also note that 1 x = H-1 (F(x)) - 2 t.(x) = H-1 (F(x)) + vp(F(x)). It follows that X is stochastically between Xs-v F and Xs +vp where Xs = H- 1 (F(X)) has the symmetric distribution H. The result now follows from the properties of B(· ). 16. As in Example 1.1.3, let X1 , . Xn be i.i.d. treatment A (placebo) responses and let Y1 , . . . , Yn be i.i.d. treatment B responses. We assume that the X's and Y's are in� dependent and that they have respective continuous distributions Fx and Fy. To test the hypothesis H that the two treatments are equally effective, we test H : Fx(t) = Fy (t) for all t versus K : Fx (t) of Fy (t) for some t E R. Let Fx and Fy denote the X and Y where

. .

1 '

'

'

1

�

�

empirical distributions and consider the test statistic

'

II

I I

D(Fx , Fy ) = max [Fy (t) - Fx(t) [ . tER

(a) Show that if H holds, -

then

-..

""

-..

""

D(Fx , Fy) has the same distribution as D(Fu , Fv ),

Fu and Fv are independent U(O,l) empirical distributions. Hint: nFx(t) = 2..:� 1[Fx{X;) < Fx (t)] = nFu(Fx(t)); nFy(t) = 1 ) 1 < F v (xp)] = nFv (Fx (t)) under H . 2..:� 1 [Fy {Y; (b) Consider the parameter bp( Fx Fy) = Yp - xp. where Xp and Yp are the pth quan� tiles of Fx and Fy. Give a distribution-free level (1 - a) simultaneous confidence band - [&; ,&;i : 0 < p < !] for the curve {&p(Fx, Fv) : 0 < p < 1 } . Hint: Let t.(x) = Fy 1 (Fx (x)) - x, then

where

1

n n nFy(x + t.(x)) = I; I [Y; < Fy 1 (Fx(x))] = I; 1[Fy {Yi) < Fx(x)] = nFv(Fx(x)). i=l

i=l

l i t

'

I

'

Section 4.10

287

Problems and Complements

Moreover. nFx (":)

=

Fu (Fx (x)) . It follows that if we set F1� ,(x)

=

Fv (x + t.(x)), " " -� � £ then D (Fx , F;,, ) = D(Fu, Fv ) . Let d0 denote a size a; critical value for D (Fu , Fv ), � then by solving D(Fx , FY, .6.) < do: for Ll, we find a distribution-free level (1 a: ) simul1 confidence band for t.(. T ,) taneous = Fx 1 (P) - Fy (p) = a,(Fx, Fy ) . Properties of n

•

-

this and other bands are given by Doksum and Sievers ( 1 976).

(c) A parameter 0 = 0 ( ) ; :F x :F ---7 R, where :F is the class of distributions with finite support, is called a shift parameter if O(Fx, Fx +a) = 8(Fx Fx ) = a and · ,

8t

·

-a,

8t

Y1 > Y, X1 > X

o>

O(Fx , Fy) < 8(Fx, Fv, ) , O (Fx , Fy ) > O(Fx, , Fy ) .

Let 6 = mino
(d) Show that E(Y) - E(X), a,(·, ·), 0 < p < 1, 6, and 8 are shift parameters.

(e) A Distribution and Parameter-Free Confidence Interval. Letb" - = min o
-

-

-

�

�

Problems for Section 4.6

1. Suppose X1 , . . . , Xn is a sample from a r (p, ! ) distribution, where p is known and 0 is unknown. Exhibit the UMA level (1 - a;) UCB for 0.

2. (a) Consider the model of Problem 4.4.2. Show that n

n

i=l

i=l

e· = (2 2 �A X; )/ L tt

-

2z(1

n

) [L tt l - �

- a; .,-

i=l

is a uniformly most accurate lower confidence bound for 8. (b) Consider the unbiased estimate of e, T = (2 2:7 1 X;) I 2:7 1 if. Show that n

n

e = (2 z= x,J; z= ti - 2uvnz(1 i=l i=l

-

n

n)/ z= ti i=l

is also a level (1 - a:) confidence bound for O. (c) Show that the statement that 0* is more accurate than 0 is equivalent to the assertion that S = (2 E� 1 t; Xi)/ E� 1 t[ has uniformly smaller variance than T. Hint: Both 0 and 0* are normally distributed. 3. Show that for the model of Problem 4.3.4, if I' = 1 />., then

a uniformly most accurate level 1

-

a:

UCB for J.L.

I' = 2 L� 1 Xf!x,n(<>) is

288

Testing and Confidence Regions

Chapter 4

- o UpPer and lower confidence bounds J1 in the model of Problem 4.3.6 for c fixed, n = 1. 4.

Construct unifom1ly most accurate level 1

for

5. (1

Establish the following result due to Pratt

( 1961 ) . Suppose [8', 8'], [8, 8] are two level

- a) confidence intervals such that

Po[B' < 8' < 8'] < Pe[B < 8' < OJ

for al!

B oJ 8. '

(8, 8), (8', 8') have joint densities, then Eo(8' - 8') :::; Eo (8 - 8). = = Hint: Eo(B - 8) 1 = ]00= du p(s, t)dsdt = J = Po [B < u < B]du, where

Show that if

(1: )

=

p(s , t) is the joint density of ( 8, 8).

F, G corresponding to densities /, g, respec 1 I so are well defined and that p- ,G tively, satisfying the conditions of Problem 8.2.12 strictly increasing. Show that if F(x) < G(x) for all x and E(U),E(V) are finite, then E(U ) > E(V). Hint: By Problem B.2.12(b), E(U) = f; p-1 (t)dt. 6.

' ' '

7.

1

Let

U, V

be random variables with d.f.'s

Suppose that ll_' is a uniformly most accurate level ( 1 -

- a. Prove Corollary 4.6.1.

Hint:

Apply Problem 4.6.6 to

a) LCB such that ?6[8' < 8]

=

V = (8 - 8')+, U = (8 - B)+.

j,

i

J '

' •

• • •

l '

' •

8. In Example 4.6.2, establish that the UMP test has acceptance region (4.6.3). Hint: Use Examples 4.3.3 and 4.4.4.

j '

l •

'

�· ;

•

Problems for Section 4.7

•

1. (a) Show that if 8 has a beta, /3( r, s). distribution with .>. = s6/r(l - 6) has the F distribution F2r,2, .

Hint:

r

and

s positive integers, then

8,

r, s) distribution with r and s integers. Show how the quantiles of the F distribution 0.

2. Suppose that given ..\ = ...\., X1, , Xn are i.i.d. Poisson, P(>.) and that ..\ is distributed as V/s0, where so is some constant and V Let T = L:� 1 Xi· .

•

.

"'

(a) Show that (.X J T

m = k + 2t.

x2 distribution

upper and lower credible bounds for ...\..

3.

Suppose that given () =

Pareto,

Pa(c, s) , density

x�-

= t) is distributed as W/ s, where s = so+ 2n and W � x;;, with

(b) Show how quantiles of the

8,

can be used to determine level

(1 - a)

X1, . . . , Xn are i.i.d. uniform, U(O,B), and that () has the

1r(t) = sc' ft'-1, t

> c, s > 0, c > 0.

'

' •

•

X has a binomial, 13(n,8), distribution and that (} has

can be used to find upper and lower credible bounds for ...\. and for

•

'

See Sections 8.2 and 8.3.

(b) Suppose that given () = beta, /3(

: J

•

Section 4.10

Problems

289

and Complements

ma.x{X1, . . . , Xn)· Show that {8 1 M = m) � Pa(c' , s') with c' = (a) Let M max{c, m} and s' = s + n. =

(b) Find level {1 - u ) upper and lower credible bounds for B.

(c) Give a level ( 1 - a: ) confidence interval for B.

(d) Compare the level { 1 - u) upper and lower credible bounds for B to the level { 1 - a) upper and lower confidence bounds for B. In particular consider the credible bounds as n --+ oo.

4. Suppose that given 8 = (tt1 , J-1.2 , r ) = (111, /.l2, r), Xr , . . . , Xm. and Yr , . . . , Yn are two independent N(f.li , r) and N(f.l2, r) samples, respectively. Suppose 8 has the improper prior 1r{B) = ljr, T > 0.

(a) Let so = E(x; proportional to

- x)2 + E(yj

-

y )2 Show formally thatthe posterior ?r(B •

I x,y) is

1r(r I so)?r(/11 I r,x)?r(/12 1 r,y) where 1r{r I so) is the density of so/V with V � Xm+n -2 , ?r(/11 I r,x) is a N(x, rjm) density and 1r{112 I r, y) is a N(y, r/n) density. Hint: p(B I x, y) is proportional to p( 8)p(x I /11 , r)p (y I /12 , r). (b) Show that given T, 1'1 and /12 are independent in the posterior distribution p(B x, y) and that the joint density of .1. = J.ll /12 and r is �

1r {Ll.,r I x,y) = 1r(r I so)?r(Ll. l x - y , r) where 1r{ Ll. I x - y,
is (Student) t with m + n - 2 degrees of freedom. Hint: 1r{Ll. I x, y) is obtained by integrating out r in 1r{Ll., r I x, y). (d) Use part (c) to give level {1-a) credible bounds and a level (1-a) credible interval

for Ll..

Problems for Section 4.8 1. Let Xr, . . . , Xn+I be i.i.d. as X N(J-.l, a5), where u5 is known. Here X1 , observable and Xn+I is to be predicted. ,..,_,

(a) Give a level (1

� a:

) prediction interval for Xn-t l ·

• . •

, Xn is

290

Testing and Confidence Regions

Chapter 4

•

•

(

(b) Compare the interval in part (a) to the Bayesian prediction interval 4.8.3) by doing a frequentist computation of the probability of coverage. That is, suppose X1, . . . , Xn are i.i.d. N(Jl, 0'5). Take "5 � r2 I, .05. Then the level of the lOll, �o 10, and frequentist interval is 95%. Find the probability that the Bayesian interval covers the true mean J.t for M = 5, 8, 9, 9.5, 10, 10.5, 1 1 , 12, 15. Present the results in a table and a graph.

� n�

a�

�

l

' ,

F,

where X1, . . . , Xn are observable and Xn+l is to , Xn+ l be i.i.d. as X be predicted. A level (1 - a: ) lower (upper) prediction bound on Y = Xn+l is defined to be a function Y(Y) of X1, . . . , Xn such that P(Y < Y) > 1 - a (P(Y < Y) > I - a). 2. Let X1,

•

.

,.....,

•

(a) If F is N(Jl, a5) with "5 known, give level (I - a) lower and upper prediction bounds for Xn+l· (b) If F is N(p,, a2) with a2 unknown, give level (1 bounds for Xn+l ·

-

a:) lower and upper prediction

(c) IfF is continuous with a positive density f on (a, b), -oo < a < b < oo, give level (I - a) distribution free lower and upper prediction bounds for Xn+ I ·

3. Suppose X1 , . . . , Xn+l are i.i.d. as X where X has the exponential distribution

F(x I 0) � I - e-x/O,

.. •

x

> 0, 0 > 0.

B(n, 0),

X is a binomial, random variable, and that (} 4. Suppose that given (} � has a beta, {J(r, s), distribution. Suppose that Y, which is not observable, has a distribution given (J = 8. Show that the conditional (predictive) distribution of Y given

B(m, 0)

X = x is

i

'

'

'

i '

Suppose X1 , . . . , Xn are observable and we want to predict Xn+t· Give a level (1 - a) prediction interval for Xn+l · Hint: XdB has a � distribution and nXn+t! E� 1 xi has an F2,2n distribution.

0,

i'

! '

'

1 l

i

•

q(y l x) � ( ; ) B(r +x+y, s + n - x + m - y)/B(r +x, s + n - x) where B( ·, ·) denotes the beta function. (This q(y I x) is sometimes called the P61ya distribution.) Hint: First show that

J '

i

'

•

q(y I x) � Jp(y I 0)1r(O I x)dO. 5. In Example 4.8.2, let U( I) < · · - < u
.

•

,

u C! ), . . . , U(n+l) .

Problems for Section 4.9 I

'.

B(n, 8), distribution. Show that the likelihood ratio statistic for 0 i � is equivalent I2X - nl.

1. Let X have a binomial, testing H : � versus K :

0�

to

,

Section 4.10

291

Problems and Complements

Hint: Show -'(n - x).

that for

x < �n, A(x)

is an increasing function of

-(2x - n) and A(x)

=

Jn Problems 2-4, let X1 ) . . . , Xn be a N(f-L, a2) sample with both J-L and a2 unknown. 2.

In testing

H : JL < Jlo

versus K

:

J-L

> Jlo

show that the one-sided, one-sample t test is

the likelihood ratio test (for o: < �). Hint: Note that /io � X if X < Jl.o and � Jl.o otherwise. Thus, log A(x) and � (nl2) log(! + r;l(n - !)) for Tn > 0, where Tn is the statistic.

t

We want to test

3. One-Sided Testsfor Scale.

�

0, ifTn < 0

H : a2 < a� versus K : a2 > a5. Show that

(a) Likelihood ratio tests are of the form: Reject if, and only if,

Hint: log -'( x) � 0,

if ii2 lv-5

<

'

1 and � (nl2)[ii2 Io-5

(b) To obtain size o: for H we should take c = Xn-I ( 1 Hint: Recall Theorem B.3.3.

-

-

I

-

log(ii2Io-5)1 otherwise.

o: .

)

(c) These tests coincide with the testS obtained by inverting the family of level

lower confidence bounds for

q2 .

4. Two-Sided Testsfor Scale.

We want to test

.

H:a

=

{1

-

o:

}

ao versus K : a i= a0.

(a) Show that the size o: likelihood ratio test accepts if, and only if,

n

CJ < (i) (ii)

F(cz) - F(c1)

I 2 '"' (Xi - X)

L..,. o CJ i=I

�

-

,

< Cz where c1

and

c2

satisfy,

1 - a, where F is the d.f. of the X�-1 distribution.

Ct - cz = n logct/cz.

(b) Use the

normal approximatioh to check that

C1n

Czn

V2nz(1 - �a) n + V2nz(1 - �a) n-

approximately satisfy (i) and also (ii) in the sense that the ratio

C1n - CZn --,=--=.:;-"--n log cl n / C z n

---�'

1 as n

---�'

oo .

(c) Deduce that the critical values of the commonly used equal-tailed test, Xn-l ( � o:)

Xn-1 (1 -

J a) also approximately satisfy (i) and (ii) of part (a).

,

i

292

Testing and Confidence Regions

Chapter 4

The following blood pressures were obtained in a sample of size n = 5 from a certain population: 1 24, 1 10, 1 14, 100, 190. Assume the one-sample normal model. 5.

(a) Using the size a: = 0.05 one-sample t test, can we conclude that the mean blood pressure in the population is significantly larger than l 00? (b) Compute a level 0.95 confidence interval for a2 corresponding to inversion of the equal-tailed tests of Problem 4.9.4.

! l

l

,

I

•

(c) Compute a level 0.90 confidence interval for the mean bloC
6. Let X1, . . . , Xn1 and Y1 , . . . , Yn2 be two independent N(J.L t ,
�

(I'!, 1'2 , a2 ) is (X, y, 0'2 ), where a' is as defined in

(b) Consider the problem of testing H : J..L I < J-L2 versus K : ;..t 1 > J1. 2 · Assume a < � Show that the likelihood ratio statistic is equivalent to the two-sample t statistic T.

(c) Using the normal approximation (z(a)+ Jn, nz/n(I'I -!'z) ja) to the power, find the sample size n needed for the level 0.01 test to have power 0.95 when n1 = nz = � n

and (1' 1 - 1'2 )/a � l7. The following data are from an experiment to study the relationship between forage pro duction in the spring and mulch left on the ground the previous fall. The control measure ments (x's) correspond to 0 pounds of mulch per acre, whereas the treatment measurements (y's) correspond to 500 pounds of mulch per acre. Forage production is also measured in pounds per acre. X

y

794 2012

1800 2477

576 3498

41 1 2092

897 1808

.

•

'

I I

•

I .I

l

I

I

'

,

l

•

I

'

!

I

'

I

'

Assume the two-sample normal model with equal variances. (a) Find a level 0.95 confidence interval for p.z - I'!·

•

•

•

(b) Can we conclude that leaving the indicated amount of mulch on the ground signifi cantly improves forage production? Use ct = 0.05.

'

•

(c) Find a level 0.90 confidence interval for a by using the pivot s2 ja 2•

8. Suppose X has density p(x, 8), 8 E 6, and that A(X, 6o, e,) depends on X only through T.

T is sufficient for 8. Show that

9. The nonnally distributed random variables X1, . . . , Xn are said to be serially correlated or to follow an autoregressive model if we can write

Xi = (}Xi-I + €i, i = 1 , . . , n, .

where X0

=

0 and £1, . . . , €n are independent

•

N(O, a2) random variables.

, I

'

•

293

Problems and Complements

Section 4.10

(a) Show that the density of X = (X1, . . . , Xn) is p(x, 8)

=

2 (21r
for -oo < Xi < oo, i = 1, . . . , n,

i= 1

xo = 0.

- 8x,_1)2}

(b) Show that the likelihood ratio statistic of = 0 (independence) versus K 0 (serial correlation) is equivalent to -(L� 2 xixi_ 1 )2 1 L:� 1 x;.

H:0

:0

i:-

a �

10. (An example due to C. Stein). Consider the following model. Fix 0 < < and a/[2(1 - ")] < c < Let 8 consist of the point -I and the interval [0, 1[ . Define the

a.

frequency functions p(x , 8) by the following table. X

8

-2

-I

0

I

2"

- - Q

Q

l2 - a

la 2

Be

(;::�) (� - a)

(;::�) ( i - a)

( 1 - B)c

'

-1 i -I

1 2

) ( 1

'

'

a

Q

(a) What is the size a likelihood ratio test for testing

H

:

2

B = - 1 versus K : B 7'= -1?

a

(b) Show that the test that rejects if, and only if, X = 0, has level and is strictly more powerful whatever be B. 11. The power functions of one- and two-sided t tests. Suppose that T has a noncentral t, 7k,6• distribution. Show that, (a) P, [T >

t] is an increasing function of J. (b) P, [IT[ > t] is an increasing function of ]J].

Hint: Let Z and V be independent and have N(O, 1), xZ distributions respectively.

Then, for each v > 0, P0 [Z > t yV/k] is increasing in J, P6 []ZJ > in [J[. Condition on V and apply the double expectation theorem.

t yV/k] is increasing

12. Show that the noncentral t distribution, 7k,6• has density

1 {= x U•-'I e-H x+C tVxfk -ol'ldx. J. ' ,(t) = h'k(�k)zlCk+l) lo Hint: Let Z and V be as in the preceding hint. From the joint distribution of Z and V, get the joint distribution of Y1 = Z/ ,jV7k and Y2 = V. Then use py, (y1) = f py,y, (y,, Y2 ) dy,. 13. The F Test for Equality of Scale. Let X1, , Xn1 , Y1 , - . . , Yn2 be two independent samples from N(Jli, £Ti), N(p.2 , 0'�). respectively, with all parameters assumed unknown.

...

294

Testing and Confidence Regions

Chapter 4

(a) Show that the LR test of H : af = a� versus K : ai > af is of the form: Reject if, and only if, F � [(n1 - 1)/(n2 - l)]E(}j - Y) 2/E(X; - X)2 > C.

(b) Show that (afja�)F has an Fnz-I,n1 -I distribution and that critical values can be obtained from the F table. -

(c) Justify the two-sided F test: Reject H if, and only if, F > /(1 a/2) or F < f(n/2 ) , where f( t) is the tth quantile of the Fnz-I,nt -I distribution. as an approximation to the LR test of H : a1 = (7z versus K : a1 -=1 a2. Argue as in Prol;>lem 4.9.4.

''' '

' '

(d) Relate the two-sided test of part (c) to the confidence intervals for a5faf obtained

in Problem 4.4.10.

14. The following data are the blood cholesterol levels (x's) and weight/height ratios (y's) of I 0 men involved in a heart study.

I'

X

y

254 2,71

240 2,96

279 2,62

284 2. 1 9

315 2.68

250 2.64

298 2.37

384 2.61

310 2.12

337 1.94

Using the likelihood ratio test for the bivariate normal model, can you conclude at the I 0% level of significance that blood cholest�rol level is correlated with weight/height ratio?

15, Let ( X1 , Y1 ), . . . , (Xn, Yn) be a sampj� from a bivariate.N"(O, 0, <1?, <1�, p) distribution. Consider the problem of testing H : p � 0 versus K : p -1 0. (a) Show that the likelihood ratio �tatistic is equivalent to lrl where

i=l

j=l

(b) Show that if we have a sample from a bivariate N(P.I , p.z, ai, a?, p) distribution, then P[P > c] is an increasing function of p for fixed c. Hint: Use the transformations and Problem B.4.7 to conclude that ,O has the same dis tribution as 812/8182, where

'''

=

n

n

n

2 = L l � L s; u,8 Vi , 812 � L U;V, i=2 i=2 i=2 and (U2, V,), . . . , (Un, Vn ) is a sample from a .N"(O, 0, 1 , 1, p) distribution. Let R � 812 /81 82, T � vn - 2R(V1 R2, and using the arguments of Problems B.4.7 and B.4.8, show that given Uz = u2, . . . , Un = Un, T has a noncentral 'Fn -2 distribution

with noncentrality parameter p. Because this conditional distribution does not depend on (u2, • . . , Un ), the continuous versiop of (B.1.24) implies that this is also the unconditional distribution. Finally, note that ,0 has the same distribution as R, that T is an increasing function of R, and use Probl�m 4.8.ll(a).

,

,

..

·;'

16, Let A(X) denote the likelihood ratio statistic for testing H : p = 0 versus K : p -1 0 in the bivariate normal model. Show, using (4.9.4) and (4.9.5) that 2 log >.(X) !:. V, where

•

.

' '

' '

1 '

l i

'

v has a xi distribution.

'

!

section 4. 11

295

Notes

17, Consider the bioequivalence example in Problem 3.2.9.

(a) Find the level " LR test for testing H :

8 E [-<, <] versus K : 8 i/c [-<, <].

(b) Compare your solution to the Bayesian solution based on a continuous loss function oo, and r;o = 0, n oo. given in Problem 3.2.9. Consider the cases rro = 0, TJ .......

4.11

.......

NOTES

Notes for Section 4.1 (1) The point of view usually taken in science is that of Karl Popper [1968]. Acceptance of a hypothesis is only provisional as an adequate current approximation to what we are interested in understanding. Rejection is more definitive.

(2) We ignore at this time some real-life inadequacies of this experiment such as the placebo effect (see Example 1 . 1 .3). (3) A good approximation (Durbin, !973; Stephens, 1974) to the critical value is c,.(t) = tj( ..fii - 0.01 + 0.85/ ..fii) where t = 1 .035, 0.895 and t = 0.819 for " = 0.01, .05 and 0.10, respectively. Notes for Section 4.3 (1) Such a class is sometimes called essentially complete. The term complete is then re served for the class where strict inequality in ( 4.3.3) holds for some 8 if

(2) In using 8(5) as a confidence hound we are using the region [8(5), 1]. Because the region contains C(X), it also has confidence level (I <>) . -

4.12

REFERENCES

BARLOW, R. AND F. PROSCHAN, Mathematical Theory of Reliability New York: J. Wiley & Sons

,

1965.

BICKEL,!\ E. HAMMEL, AND J. W. O'CONNELL, "Is there a sex bias in graduate admissions?'' Science, 187, 398-404 ( 1975). Box, G. E. P., Apology for Ecumenism in Statistics and Scientific lnference, Data Analysis and Ro bustness, G. E. P. Box, T. Leonard, and C. F. Wu, Editors New York: Academic Press, 1983.

296

Testing and Confidence Regions

Chapter 4

BROWN, L. D., T. CAt, AND A. DAS GuPTA. "Interval estimation for a binomial proportion," The American Statistician, 54 (2000).

DoKsUM, K. A. AND G. SIEVERS, "Plotting tions," Biometrika,

63, 421-434 ( l 976).

with confidence:

Graphical comparisons of two popula

DOKSUM, K. A., G. FENSTAD, AND R. AARBERGE, "Plots and tests for symmetry," Biometrika, 473-487 (1977).

64,

'

! I

DURBIN, J., "Distribution theory for tests based on the sample distribution function," Regional Con ference Series in Applied Math., 9, SIAM, Philadelphia, Pennsylvania (1973).

l I I

FERGUSON, T., Mathematical Statistics. A Decision Theoretic Approach New York: Academic Press, 1967. FISHER, R. A., Statistical Methods for Research Workers, Company, 1958.

13th ed.

l '

New York: Hafner Publishing

HALO, A., Statistical Theory with Engineering Applications New York:

J. Wiley & Sons,

i '

• . '

1952.

.� !. '

HEDGES, L. V. AND I. OLION, Statistical Methods for Meta-Analysis Orlando, FL: Academic Press,

I

1985.

i •

JEFFREYS, H., The Theory of Probability Oxford: Oxford University Press, 1961.

LEHMANN, E. L., Testing Statistical Hypotheses, 2nd ed. New York: Springer, 1997. .. '

'; •

••

!.

'

PoPPER, K. R., Conjectures and Refutations; the Growth ofScientific Knowledge, 3rd ed. New York: Harper and Row, 1968.

PRATI, J., "Length of confidence intervals," 1. Amer. Statist. Assoc., 56, 549-567 (1961). SACKRoWITZ, H. AND E. SAMUEL--CAHN,

"P values as random variables-Expected P values," The

American Statistician, 53, 326-331 (1999).

STEPHENs, M., "EDF statistics for goodness of fit," J. Amer. Statist, 69, 730-737 ( 1974).

STEIN, C., "A two-sample test for a linear hypothesis whose pOWer is independent of the variance," Ann Math. Statist., 16, 243-258 (1945).

TATE, R. F. AND G. W. KLETI, "Optimal confidence intervals for the variance of a normal distribu tion, 1. Amer. Statist. Assoc., 54, 674--682 (1959). "

VAN ZWET, W. R. AND J. OSTERHOFF, "On the combination of independent test statistics," Ann. Math. Statist., 38,659-680 (1967).

WALD, A., Sequential Analysis New York: Wiley, 1947. WALD, A., Statistical Decision Functions New York: Wiley, 1950.

WANG, Y., "Probabilities of the type I errors of the Welch tests," 1. Amer. Statist. Assoc., 66, 605-608 (1971).

WELCH, B., "Further notes on Mrs. Aspin's tables," Biometrika, 36, 243-246 (1 949).

WETHERDJ.., G. B. AND K. D. GLAZEBROOK. Sequential Methods in Statistics New York: Chapman and Hall, 1986.

WILKS, S. S., Mathematical Statistics New York: J. Wiley & Sons, 1%2.

1 • •

j 1

l '

I '

, , •

.

' •

!

' : '

Chapter 5

ASYMPTOTIC APPROXIMATIONS

5.1

INTRODUCTION: THE MEANING AND USES OF ASYMPTOTICS

Despite the many simple examples we have dealt with, closed fonn computation of risks in tenns of known functions or simple integrals is the exception rather than the rule. Even if the risk is computable for a specific P by numerical integration in one dimension, the qualitative behavior of the risk as a function of parameter and sample size is hard to ascer tain. Worse, computation even at a single point may involve high-dimensional integrals. In Xn from a distribution F, our setting for this section particular, consider a sample Xt 1 and most of this chapter. If we want to estimate J.L(F) = EFX1 and use X we can write, •

•

•

,

(5.1.1) This is a highly informative formula, telling us exactly how the MSE behaves as a function of n, and calculable for any F and all n by a single one-dimensional integration. However, consider med(X1, . . . ,Xn) as an estimate of the population median v(F). If n is odd, v (F) � p- l ( l), and F has density f we can write MSEF(med(X,, . . . , Xn)) =

J: p-l (x -

2

(l)) 9n(x)dx

(5.1.2)

where, from Problem (B.2.13), if n � 2k + 1,

9n(x) =

n

( ) 2k k

Fk (x)(1 - F(x))k f(x).

(5.1.3)

Evaluation here requires only evaluation of F and a one-dimensional integration, but a different one for each n (Problem 5.1.1). Worse, the qualitative behavior of the risk as a function of n and simple parameters of F is not discernible easily from (5.1.2) and (5.1.3). To go one step further, consider evaluation of the power function of the one-sided t test of Chapter 4. If X1 , . . . , Xn are i.i.d. N(Ji-, a2 ) we have seen in Section 4.9.2 that ,fii.X IS has a noncentral t distribution with parameter !"/rJ and n - 1 degrees of freedom. This distribution may be evaluated by a two-dimensional integral using classical functions 297

i

I

298

I

!' '

Asymptotic Approximations

Chapter 5

'' '

(Problem 5.1.2) and its qualitative properties are reasonably transparent. But suppose F is not Gaussian. It seems impossible to determine explicitly what happens to the power function because the distribution of foX/ S requires the joint distribution of (X, S) and in general this is only representable as an n-dirnensional integral;

'

'' '

where

A� There are two complementary approaches to these difficulties. The first, which occupies us for most of this chapter, is to approximate the risk function under study

i'

'

'

;

'

'

by a qualitatively simpler to understand and easier to compute function, Rn (F). The other, which we explore further in later chapters, is to use the Monte Carlo method. In its simplest form, Monte Carlo is described as follows. Draw B independent "samples" of size n, {XIj, . . . , Xn1 }, 1 < j < B from F using a random number generator and an explicit form for F. Approximately evaluate Rn(F) by

1 B

-

(5. 1.4) Rs = B Ll(F, o(X11, . . . , XnJ)). j=l By the law of large numbers B ---+ oo, fis � Rn (F ). Thus, save for the possibility of a very unlikely event, just as in numerical integration, we can approximate Rn(F) arbitrarily as

closely. We now turn to a detailed discussion of asymptotic approximations but will return to describe Monte Carlo and show how it complements asyrnptotics briefly in Example 5.3.3. Asymptotics in statistics is usually thought of as the study of the limiting behavior of statistics or, more specifically, of distributions of statistics, based on observing n i.i.d. observations X1, . . . , Xn as n --+ oo. We shall see later that the scope of asymptotics is much greater, but for the time being let's stick to this case as we have until now. Asymptotics, in this context, always refers to a sequence of statistics

{Tn(XJ, . . . , Xn)}n>l , for instance the sequence of means {Xn}n>t. where Xn = of medians, or it refers to the sequence of their distributions

! :E� 1 Xi, or the sequence

Asymptotic statements are always statements about the sequence. The classical examples p are, Xn -+ EF(XI) or -

LF( ;fn(Xn - EF (XI))) -+ N(O, VarF(XI)).

i i

'

I I

' ' ! ,

j

Section 5 .1

Introduction: The Meaning and Uses of Asymptotics

Tn(X . , Xn)

Tn (X . , Xn) Tn(X1, ... ,Xn)

299

In theory these limits say nothing about any particular but in practice we 1, . . act as if they do because the we consider are closely related as functions of 1, . . so that we expect the limit to approximate or (in an appropriate sense). For instance, the weak law of large num bers tells us that, if oo, then

nCp(Tn (X1, . . . , Xn)) Ep]Xd <

(5.1 .5) That is, (see A.l4.1 )

h[] Xn -JL ] > <[

�

o

(5.1.6)

for all £ > 0. We interpret this as saying that, for n sufficiently large, X is approximately equal to its expectation. The trouble is that for any specified degree of approximation, say, £ = .01, (5.1.6) does not tell us how large n has to be for the chance of the approximation not holding to this degree (the !eft-hand side of (5.1.6)) to fall, say, below .01. Is n > 100 enough or does it have to be n > 100, 000? Similarly, the central limit theorem tells us that oo, ,, is as above and then if n

•

Ep]Xf] <

a2 - Varp (XI), n ( Pp [vn X a- JL) < z] il>(z )

(5.1 .7)

�

where 4:1 is the standard normal d. f. As an approximation, this reads

( (x I') ) . PF[Xn < x] "' il> vn a

(5.1.8)

n,

Again we are faced with the questions of how good the approximation is for given x, and What we in principle prefer are bounds, which are available in the classical situations of (5.1.6) and (5. 1 . 7). Thus, by Chebychev's inequality, if oo,

Pp.

a' Pp[]Xn - I'] > <] < nE2 .

EpXf <

As a bound this is typically far too conservative. For instance, if more delicate Hoeffding bound (B.9.6) gives

(5 .1 .9 )

jX 11 < 1, the much

Pp [IXn -JL] > e] ::; 2exp { - �ne2} . (5.1 .10) 1 implies that a2 < 1 with a2 = 1 possible (Problem 5.1.3), the right Because ] X !] hand side of (5.1.9) when a2 is unknown be>omes 1jn e2 For ' = .1, 400, (5.1.9) is .25 whereas (5.1. 10) is .14. ::;

n =

Further qualitative features of these bounds and relations to approximation (5.1.8) are given in Problem 5.1.4. Similarly, the celebrated Berry-Esseen bound (A. 1 5 . 1 1 ) states that if oo,

EF]Xd' (x) -< 0Eap]3nX1/2d3

r,; (Xn

V"

(5. ! . I I)

300

Asymptotic Approximations

Chapter 5

where C is a universal constant known to be < 33/4. Although giving us some idea of how much (5.1.8) differs from the truth, (5. 1.11) is again much too consetvative generally. (1) The approximation (5.1.8) is typically much better than (5.1.11) suggests. Bounds for the goodness of approximations have been available for Xn and its distri bution to a much greater extent than for nonlinear statistics such as the median. Yet, as we have seen, even here they are not a very reliable guide. Practically one proceeds as follows: (a) Asymptotic approximations are derived. (b) Their validity for the given n and T11 for some plausible values of F is tested by numerical integration if possible or Monte Carlo computation. If the agreement is satisfactory we use the approximation even though the agreement for the true but unknown F generating the data may not be as good. Asymptotics has another important function beyond suggesting numerical approxima tions for specific n and F. If they are simple, asymptotic formulae suggest qualitative properties that may hold even if the approximation itself is not adequate. For instance, (5.1.7) says that the behavior of the distribution of Xn is for large n governed (approxi mately) only by f-L and cr 2 in a precise way, although the actual distribution depends on Pp in a complicated way. It suggests that qualitatively the risk of Xn as an estimate of Jl, for any loss function of the form l(F,d) = -'(]!' - dl) where -'(0) = 0, A'(O) > 0, behaves like -' (0)( <1/ y'n)( v'27i') (Problem 5.1.5) and quite generally that risk increases with <1 and decreases with n, which is reasonable. As we shall see, quite generally, good estimates Bn of parameters B(F) will behave like Xn does in relation to Jl · The estimates On will be consistent, Bn � O (F) for all F in the model, and asymptotically normal, '

,

fo[Bn - B(F)] <1(B,F)

-->

(5.1.12)

N(O, 1)

'

'

•

•

l '

•

-

where CT(B, F) typically is the standard deviation (SD) of foOn or an approximation to this SD. Consistency will be pursued in Section 5.2 and asymptotic normality via the delta method in Section 5.3. The qualitative implications of results such as are very impor tant when we consider comparisons between competing procedures. Note that this feature of simple asymptotic approximations using the normal distribution is not replaceable by Monte Carlo. We now turn to specifics. As we mentioned, Section 5.2 deals with consistency of various estimates including maximum likelihood. The arguments apply to vector-valued estimates of Euclidean parameters. In particular, consistency is proved for the estimates of canonical parameters in exponential families. Section 5.3 begins with asymptotic computa tion of moments and asymptotic normality of functions of a scalar mean and include as an application asymptotic normality of the maximum likelihood estimate for one-parameter exponential families. The methods are then extended to vector functions of vector means and applied to establish asymptotic normality of the MLE 7j of the canonical parameter 17

j

i

•

•

'

j

l :

Section 5.2

30 1

Consistency

in exponential families among other results. Section 5.4 deals with optimality results for likelihood-based procedures in one-dimensional parameter models. Finally in Section 5.5 we examine the asymptotic behavior of Bayes procedures. The notation we shall use in the rest of this chapter conforms closely to that introduced in Sections A l 4, A. IS, and B.7. We will recall relevant definitions from that appendix as we need them, but we shall use results we need from A.14, A. IS, and 8.7 without further discussion. .

Summary. Asymptotic statements refer to the behavior of sequences of procedures as the sequence index tends to oo. In practice, asymptotics are methods of approximating risks, distributions, and other statistical quantities that are not realistically computable in closed form, by quantities that can be so computed. Most aSymptotic theory we consider leads to approximations that in the i.i.d. case become increasingly valid as the sample size increases. We also introduce Monte Carlo methods and discuss the interaction of asymptotics, Monte Carlo, and probability bounds.

5.2

CONSISTENCY Plug-In Estimates and MlEs in Exponential Family Models

5.2.1

Suppose that we have a sample X1, . ,Xn from Po where 6 E 6 and want to estimate a real or vector q(B). The least we can ask of our estimate iln (X1, , Xn) is that as .

.

•

.

.

n � oo, fin I] q(8) for all 8. That is, in accordance with (A.l4.1) and (B.7. 1), for all 8 E e. , > o. (5.2. 1 )

·

where I j denotes Euclidean distance. A stronger requirement is (5.2.2)

Bounds b(n, <) for sup9 P9 1/ <) that yield (5.2.2) are preferable and we shall indicate some of qualitative interest when we can. But, with all the caveats of Section 5.1, (5.2. 1), which is called consistency of fin and can be thought of as O'th order asymptotics. remains central to all asymptotic theory. The stronger statement (5.2.2) is called unifonn consistency. If e is replaced by a smaller set K, we talk of uniform coru;istency over K. Example 5.2.1. Means. The simplest example of consistency is that of the mean. If X1, . . . , Xn are i.i.d. P where P is unknown but EpjX11 < oo then, by the WLLN, -

p

X � p(P) = E(X!) �

-

�

and p(P) = X, where P is the empirical distribution, is a consistent estimate of p(P). For P this large it is not uniformly consistent. (See Problem 5.2.2.) However, if, for

'

'

302 instance,

=

P

{P : EpX'f < M <

by Chebyshev's inequality, for all

X

oo }, then

Asymptotic Approximations

Chapter 5

is uniformly consistent over

P because

P E P.

0

, Xn be the indicators of binomial trials Example 5.2.2. Binomial Variance. Let X1 , with P[X1 = 1] = p. Then N = L; X; has a B(n,p) distribution, 0 < p < 1, and P = X = N/n is a uniformly consistent estimate of p. But further, consider the plug-in estimate p(1 -p)jn of the variance of p, which is � q(ji), where q(p) = p(1-p). Evidently, by A.I4.6, q(jj) is consistent. Other moments of X1 can be consistently estimated in the .

�

•

.

-

0

same way.

To some extent the plug-in method was justified by consistency considerations and it is not suprising that consistency holds quite generally for frequency plug-in estimates.

Theorem 5.2.1. Suppose thnt P = S = { (Ph . . . , Pk) : 0 < Pi < 1, 1 < j < k, L;J�1 p, = 1}, the k-dimensional simplex, where Pi = P[X1 = xi], 1 < j < k, and { x1, , xk} is the range of X,. Let Ni = L;� 1 1(X; = x1) and PJ - NJfn, Pn = (ih, . . . , fA) E S be the empirical distribution. Suppose that q : S RP is continuous. Then (jn = q(P:n ) is a unifonnly consistent estimate of q(p ). .

•

.

---+

Proof. By the weak law of large numbers for all p,

6>0

Pp [IPn - PI > 6] � 0. Because q is continuous and S is compact, it is uniformly continuous on S. Thus, for every <

> 0, there exists 6 (<) > 0 such that p, p' E S, I P' -pi < 6(<), implies l q(p')-q(p) I <

Then

Pp [liln - q l > <] :S Pp[[Pn - PI > 6(<)] But, sup{ Pp[li\n -pi > 6] : p E S} < kj4n62 (Problem 5.2.1) and the result follows. In fact, in this case, we can go further. Suppose the is defined by

0

modulus of continuity of q, w( q, 0)

·

w(q, 8)

= sup{ l q(p) - q(p') I : IP - P'l :S 8},

Evidently, w(q, ·) is increasing in

w(q,6) !

<.

O as 6 !

0. Let w-1 : [a, b] < 1 w- (<)

It easily follows (Problem

0

and has the range (a, b) say. If

(5.2.3) q is

continuous

R+ be defined as the inver�e ofw,

= inf{6

: w (q , 6)

2: <}

(5.2.4)

i

5.2.3) that (5.2.5)

A simple and important result for the case in which X�, . . . , Xn are i.i.d. is the following:

I

with Xi

EX

I

l

••

'

'

•

Section 5.2

303

Consistency

(g,, . . . , 9d) map X onto Y C Rd Suppose Proposition 5.2.1. Let g Eoi9J(Xl ) l < oo, I < j < d, for all 8; let mj(B) = EogJ (Xl), I < j < d, and let RP. Then, if h is continuous, q(B) = h(m(B)), where h : Y �

ij

h(g)

=

I n "' g(X, ) h , L... i=l

is a consistent estimate of q(8). More generally if v(P) = h{Epg(X1 ) ) and P = {P : Ep lg(X1) 1 < oo }, then v( P) = h(g) , where P is the empirical distribution, is consistent for v{P). Proof. We need only apply the general weak law of large numbers (for vectors) to conclude that (5.2.6) if Eplg(Xl)l < oo. For consistency of h(g) apply Proposition B . 7. 1: Un .!:.. U implies 0 that h(U n) !'. h (U) for all continuous h.

Example 5.2.3. Variances and Correlations. Let Xi (Ui, Vi), 1 0, IPI < 1. Let g(u , v ) = (u, v, u2 , v2 , uv) so that E� 1 g(Ui, Vi) is the statistic generating this 5-parameter exponential family. If we let 8 = (J-Lt, J-l2, a'f, ag,p), then =

If h = m- 1 , then

which is well defined and continuous at aJl points of the range of m. We may, thus, con clude by Proposition 5.2. 1 that the empirical means, variances, and correlation coefficient are all consistent. Questions of uniform consistency and consistency when P { Distribu tions such that EU[ < oo, EV12 < oo, Var(UJ) > 0, Var(VJ ) > 0, I Corr (Ur, Vr)l < I } 0 are discussed in Problem 5.2.4. =

Here is a general consequence of Proposition 5.2.1 and Theorem 2.3.1. Theorem 5.2.2. Suppose P is a canonical exponentialfamily of rank d generated by T. Let q, [ and A(·) correspond to P as in Section 1.6. Suppose [ is open. Then, zf X1, . . . , Xn are a sample from Pr, E P.

(i) P'1 [The MLE ij exists] � I.

(ii) ij is consistent.

'

�I '

' '

'

304

Asymptotic Approximations

Chapter 5

'

' '

-

Proof Recall from Corollary 2.3.1 to Theorem 2.3.1 that i)(X�> - - - , Xn) exists iff � L� 1 T(Xi) = Tn belongs to the interior C.J. of the convex support of the distribution of Tn - Note that, if no is true, En, ( T(X!)) must by Theorem 2.3.1 belong to the interior

A(n) = to,

t

A(n0) = En, T{X1),

where o = of the convex support because the equation is solved by '7o· By definition of the interior of the convex support there exists a ball < 6} c CT- By the law of large numbers, S, =

{t : [t - En, T(Xt)l

I

n

n

L T(X,) i=l

Hence,

1

P,

jo E'r,, T(X!).

n

Pn, [ n L T(X, ) E CT) � 1. i=l But Tj, which solves

(5.2.7)

n 1 A(n) = n- L T(X,), 1

•= .

exists iff the event -in (5.2.7) occurs and (i) follows. We showed in Theorem 2.3.1 that on C.J. the map 17 ..--j. A(11) is 1-1 and continuous on£. By a classical result, see, for example, : A(t:) � t: is continuous on S, and the result follows from Rudin (1987), the inverse D Proposition 5.2.1.

A- 1

i i

\ '

-

'

5.2.2

! ! '

Consistency of Minimum Contrast Estimates

'

'

The argument of the the previous subsection in which a minimum contrast estimate, the MLE, is a continuous function of a mean of i.i.d. vectors evidently used exponential family properties. A more general argument is given in the following simple theorem whose con be i.i.d. Po, B E e c Rd . Let 0 be a minimum ditions are hard to check. Let contrast estimate that minimizes

XI' . . . , Xn

n 1 X , II) Pn(X, II) = n- L p( , .

where, as usual, D(

z=l

Theorem 5.2.3. Suppose

n P�' o 1 0 sup{ [ - L[p(X, , Ii) - D{lio , li)JI : II E e} n i=l

(5.2.8)

and -

Then 8 is consistent.

-

lio

l > <} > D(lio,lio)

'

1

I '

'

I

i

' '

i

110 , II) = Eo,P(Xt. II) is uniquely minimized at lio for all lio E e.

inf{D{Ii, lio) : Ill

I

1

for l!llery e

> 0.

(5.2.9)

j

I I

I

I I

I

I

Section

305

5.2 Consistency

Proof Note that,

n 1 ] < Pe,[inf{-n L[p(X, , II) - - p(X., IIo)] : ]11 - 1101 > <} < O]

-

<

P11, [19 - llol >

By hypothesis, for all

i= l

(5.2.10)

o > 0,

1

Pll, [inf{- L(p(X, , II) - p(X,,IIo)) : [11 - llol > <} n i= l - inf{D(IIo, II) - D(ll0, llo)) : 111 -- llol > <} < -o] � 0 because the event in

n

(5.2. 1 1 ) implies that

1 n

sup{[- L[p(X,, II) - D(llo, 11)]1

n i=l

which has probability tending to

o� Then

(5.2.11)

0 : II E e) > -,

(5.2.12)

2

0 by (5.2.8). But for c > 0 let

� inf{D(II,IIo) - D(llo,llo) : [11 - llol > o).

(5.2.11) implies that the right-hand side of (5.2.10) tends to 0.

0

A simple and important special case is given by the following.

{ll1 , . . . , lid). E11, ] logp(X" II) [ < oo and the parameterization is identifiable, then, if II is the MLE, P11 [8 ¥ 0; ] � Ofor all j. CoroUary 5.2.1. lf 6 is finite, e

�

-

-

'

Proof Note that for some < > 0,

(5.2.13) By Shannon's Lemma 2.2.1 we need only check that (5.2.8) and (5.2.9) hold for

logp(x, II). But because 6 is finite, (5.2.8) follows from the WLLN and P11, [max{] � l:� 1 (p(X, , II;) - D(llo, II;)) : 1 < j < d) > <] < d max{ P11, [I� l:� 1 (p(X,, II; ) - D(llo, II;) )I > <] : 1 < j S d) and

(5.2.9) follows from Shannon's lemma. can

often fail

see

�

Problem

p(x, II) � 0. 0

5.2.5. An alternative condition that is readily seen to work more widely is the replacement of (5.2.8) by Condition

(5.2.8)

(i) For all compact K

1 n

c

6,

Pll, 0. sup - L ]p(X , 0) - D(llo, II) I : II E K n i=l ,

(ii) For some compact

1 n

K c 8,

II) - p(X,, 110)) : II E K' P11, inf - L(p(X,, _ n �=1

(5.2.14) >0

�

1.

!

•

'

306

Asymptotic Approximations

Chapter 5

•

We shall see examples in which this modification works in the problems. Unfortunately checking conditions such as (5.2.8) and (5.2.14) is in general difficult. A general approach due to Wald and a similar approach for consistency of generalized estimating equation so lutions are left to the problems. When the observations are independent but not identically distributed, consistency of the MLE may fail if the number of parameters tends to infinity, see Problem 5.3.33.

Summary. We introduce the minimal property we require of any estimate (strictly speak ing, sequence of estimates) consistency. If Bn is an estimate of B(P), we require that Bn � � B(P) as n � oo. Unifonn consistency for 'P requires more, that sup{P[IBn - B(P)] > Ej : 0 for all € > 0. We show how consistency holds for continuous functions of P E P} ---t

vector means as a consequence of the law of large numbers and derives consistency of the MLE in canonical multiparameter exponential families. We conclude by studying consis tency of the MLE and more generally MC estimates in the case 8 finite and 8 Euclidean. Sufficient conditions are explored in the problems. 5,3

FIRST- AND H IGHER-ORDER ASYMPTOTICS: THE DELTA M ETHOD WITH APPLICATIONS

We have argued in Section 5.1 that the principal use of asymptotics is to provide quantita tively or qualitatively useful approximations to risk. 5.3.1

The Delta Method for Moments

•

• •

We begin this section by deriving approximations to moments of smooth functions of scalar means and even provide crude bounds on the remainders. We then sketch the extension to functions of vector means. be i.i.d. X valued and for the moment take X = R. Let As usual let Xt, . . . h : R � R, let ]]g]]oo = sup{ ]g(t) I : t E R} denote the sup norm, and assume

, Xn

(i)

!

(b)

'

!

(a) h is

m times differentiable on R, m > 2. We denote the jth derivative of h by

h(i) and assume ]jh(ml]]oo = SUPx ]M ml ( x)J < M < oo

l

l •

'

.

.

'

(ii) EJX, jm < oo Let E(X1)

=

�'• Var(X1)

=

a2 We have the following.

Theorem 5.3.1. lj(i) and (ii) hold, then

m- 1 h(j) ( ) Eh(X) = h(!l) + L 'I !l E(X J· j= l

where • •

'

I

�-

-

!l)j + Rm

(5.3. I )

-

Section 5.3

First-

and Higher-Order Asymptotics: The Delta Method with Applications

307

The proof is an immediate consequence of Taylor's expansion. (5.3.2)

where [X'

-

l'f < [X - J'f. and the following lemma.

Lemma 5.3.1. If EjX 1 lj < that

oo,

j > 2, then there are constants Cj > 0 and Dj > 0 such (5.3.3) (5.3.4)

Note that for j even. £[X - lt[i = E(X - I')J

Proof. We give the proof of (5.3.4) for all j and (5.3.3) for j even. The more difficult argument needed for (5.3.3) and j odd is given in Problem 5.3.2. Let I' = E(XI) = 0, then

i X) E(Xi) - n -iE("" � I �= .. L.

(a)

=

0 unless each integer that appears among { i1, But E(X,, . . . X,j ) least twice. Moreover, •

sup

(b)

•

.

.

,

ii } appears at

[E(X,, . . . X,, ) [ = E[Xl [i

!t, · ·· ,ij

by Problem 5.3.5, so the number d of nonzero terms in (a) is

U/ 21 n

2:

(c)

r_- 1

r

.

2::

.

J -

1 , . . . , tr t . +· ··+•r=J i�o>2 all k

•t

where . , n . = tt. .1n. ..!.!... and (t] denotes the greatest integer < t. The expression in (c) is, for j < n/2, bounded by tl

... ,z..

1

[�J! c

(d)

n n -

(

!) . . (n .

[j/2] + I)

where

Cj =

max 1
{ { .

L 'l. l , . 1. tr. .

1

: it

+

. . + ir = j, ik > 2, 1 < k :::; r .

}}

.

308

Asymptotic APproximations

Chapter 5

!

l i

But

n-1n(n - 1) . . . (n - [i/2] + 1)

(e)

< niJ/ZJ -J

and (c), (d), and (e) applied to (a) imply (5.3.4) for j odd and (5.3.3) for j even, if J.t = 0. In general by considering Xi - J-L as our basic variables we obtain the lemma but with EIXdi replaced by EIX1 - J.tiJ. By Problem 5.3.6, EIX1 - J.tl' S 2'EIXdi and the D lemma follows. The two most important corollaries of Theorem 5.3. 1, respectively, give approximations to the bias of h(X) as an estimate of h(J.t) and its variance and MSE. Corollary 5.3.1. (a}

(b)

ifEIXJ !3 < oo and l l h(3l lloo < oo, then ('l (1')a2 + n-31 . h O( 2) Eh(X) = h(J.t) + 2n if E(Xt) O(n-2).

< oo and

llh<4llloo

< oo then

(5.3.5)

O(n-31') in (5.3.5) can be replaced by

Proof For (5.3.5) apply Theorem 5.3.1 with m = 3. Because E(X - J.t)2 = a2 /n, (5.3.5) follows. If the conditions of (b) hold, apply Theorem 5.3.1 with m = 4. Then D Rm = O(n -2) and also E(X - J.t)' = O(n -2) by (5.3.4).

I

I i

•

I I• 1'

I•

Corollary 5.3.2. if (a) ll h(j ) ll oo < oo,

•

i

•

•

. •

1 S j < 3 and EIXd3 < oo, then

.

ll ' •

'

1 i

l •

Var h(X) = a2 [h(I) (J.t)J' + O(n-312) n

(b) 1/ llh(J) IIoo < 00, 1 < j < 3, and EXf < oo, then replaced by O(n-2).

j

•

(5.3.6)

.,

I

'

O(n-312) in (5.3.6) can be

l •

Proof. (a) Write

Eh2(X)

=

=

h2(J.t) + 2h(J.t)h(1 l(J.t)E(X - J.t) + {h(2l(J.t)h(J.t) + [h< 1l ]2 (J.t) } E(X - J.t)2 1 + E[h2]<3 l (X.)(X - J.t)3 6 2 O(n-312). h2(J.t) + {h('l (J.t)h(J.t) + [h<1l]'(J.t)}':.. + n

(b) Next, using Corollary 5.3.1,

• • •

I

I

1 •

':

.,

·-----,

----

Section 5.3

First- and Higher-Order Asymptotics: The Delta Method with Applications

309

Subtracting (a) from (b) we get (5.3.6). To get part (b) we need to expand Eh2(X) to four D terms and similarly apply the appropriate form of (5.3.5).

Clearly the statements of the corollaries as well can be turned to expansions as in The orem 5.3.1 with bounds on the remainders. Note an important qualitative feature revealed by these approximations. If h(X) is viewed, as we normally would, as the plug-in estimate of the parameter h(J.t) then, for large n, the bias of h(X) defined by Eh(X) - h(p.) is O(n-1 ) which is neglible compared to the standard deviation of h(X), which is O(n-112) unless hi1) (p.) = 0. A qualitatively simple explanation of this important phenonemon will be given in Theorem 5.3.3. ,

Example 5.3.1. If XL . . . , Xn are i.i.d. c(A) the MLE of A is X - I If the X, represent the lifetimes of independent pieces of equipment in hundreds of hours and the warranty replacement period is (say) 200 hours, then we may be interested in the warranty failure probability (5.3.7) If h(t)

1 - exp(-2/t), then h(X) is the MLE of 1 E,Xt = 1/),, =

- exp (- 2A)

=

h(p.), where p.

=

We can use the two corollaries to compute asymptotic approximations to the means and variance of h(X). Thus, by Corollary 5.3.1,

Bias,(h(X))

because hi2) ( t) 5.3.1)

=

E,(h(X) - h(p.)) h<2�(/-l) :: + O(n-2) 2e -2" A3 (1 - A)/n + O(n -2 )

(5.3.8)

4(r·3 - t-•) exp( -2/t), 0'2 = 1/A2, and, by Corollary 5.3.2 (Problem (5.3.9) D

Further expansion can be done to increase precision of the approximation to Var h(X) for large n. 'f!!us, by expanding Eh2(X) and Eh(X) to six terms we obtain the approximation Var(h(X)) = �[hi1 ) (1, )]'0'2 + ;/. {h ll )(p.)hi2) (i•)p.3 (5.3.10) + J [hi'J (p.)]' ()'•} + R� with R� tending to zero at the rate 1/n3. Here Jlk denotes the kth central moment of Xi and we have used the facts that (see Problem 5.3.4)

E(X - p.)3 =

��,

(5.3.1 1 )

Example 5.3.2. Bias and Variance of the MLE of the Binomial Van·ance. We will com pare E(h(X)) and Var h(X) with their approximations, when h(t) = t(1 - t) and X; -

I

310

Asymptotic Approximations

Chapter 5

B(l,p), and will illustrate how accurate (5.3. 10) is in a situation in which the approxima tion can be checked. First calculate

Eh(X) = E(X) - E(X2) = p - [Var(X ) + (E(X))2] n-1 1 = p(1 - p) - -p(1 - p) = p(1 p). n n Because hl'>(t) =

1 - 2t, hl2> = -2, (5.3.5) yields . 1 p) E(h(X)) = p(1 - p) - -p(1 n

and in this case (5.3.5) is exact as it should be. Next compute

) p ) p (1 p (1 2p ' Var h(X ) = n (1 - 2p) + n - 1

{

} (n - 1 ) 'n

Because 1'3 = p(1

.

- p)(1 - 2p), (5.3.10) yields 1 1 2p)p(1 Va:t h(X) = -n (1 - 2p)2p(1 - p) + --,{ p)(1 -2(1 2p) n +2p'(1 - p)2} + R� p( l -p) { (1 - 2p .!:.[2p(1 - p) - 2(1 - 2p)2]} + R�. )2 + n n

.. , '

'

'

! ' '

" '·I :

).

R'n '

' •

•

I

I

t •

'

•• •

.. .

j

' ' '

'

I

' I

I

'

p(1 � p) [(1 - p)] 2p)2 2p(1 n p(1 � p) = O(n-3) . [1 6p(1 p)] n

•

'

!

Thus, the error of approximation is

�

�·

'

!. I

.I

1

D

The generalization of this approach to approximation of moments for functions of vec tor means is fonnally the same but computationally not much used for d larger than 2.

d and let Y, = g(X,) = (g, (X;), . . . , gd (X, ) f . g X � R 5.3.2. Suppose : Theorem Let h : Rd R. assume that h has continuous partial derivatives of order up to m, and that

l

i

'

l

--t

(i) [[Dm(h)ll= < co where Dmh(x) is the array (tensor)

'

'

' '

xi

amh . . , , a . . axd .

,d

X

( )

:

. Z1

+

. . . . + Zd

. < < < 1 t · m,
.

•

'

I

'

-------

l

Section

5.3 First- and Higher-Order Asymptotics: The Delta Method with Applications

(ii) EIY,, /m < oo, 1 < j < d where Y;1

=

311

9j(X;).

Then, ifYk = � 2:: � 1 Yik> Y = � 2::� 1 Yi, and J.1 = E¥1, then

Eh(Y)

This is a consequence of Taylor's expansion in d variables, 8.8.1 1 , and the appropriate generalization of Lemma 5.3.1. The proof is outlined in Problem 5.3.4. The most interest ing application, as for the case d = 1, is to m = 3. We get, for d = 2, E I 1 j 3 < oo

Y

{ �:� (J.L) Var(Yu ) + 8,",',j'x, (J.L) Cov(Yl l , Y12)

h(J.L ) + � �

Eh(Y)

i z:l

(5.3.13)

(J.L) Var(Y., ) } + O(n 312 ) Moreover, by (5.3.3), if E/Y1 /4 < oo, then O(n- 31 2 ) in (5.3.12) can be replaced by O(n-2 ). Similarly, under appropriate conditions (Problem 5.3.12) Var h(Y) � [ (:,;� ( J.L) ) Var ( Yu ) + 2 g,;•, (J.L) ffx� (J.L) Cov(Yu . Ytz) (5.3.14) + (:,;',(J.L) ) 'Var(Yt2 )] + O(n- 2 ) +

'

Approximations (5.3.5), (5.3.6), (5.3.13), and (5.3.14) do not help us to approximate risks for loss functions other than quadratic (or some power of (d - h(J.L))). The results in the next subsection go much further and "explain" the form of the approximations we already have. 5.3.2

The Delta Method for In law Approximations

As usual we begin with d = 1.

Theorem 5.3.3. Suppose that X = R. h : R � R, EXt < oo and h is differentiable at J.L = E(X 1 ). Then

.C( Vn(h(X) - h(J.L)))

�

N(O,

<72 (h))

(5.3,15)

where and

<72 = Var(X,).

The result follows from the more generaJly useful lemma.

Lemma 5.3.2. Suppose {Un} are real random variables and tlultfor a sequence {an } of constants with an --+ oo as oo,

n

--+

312

Chapter 5

Asymptotic Approximations

(i) an(Un - u ) .!:.. V for some constant u.

• '

' '

(ii) g : R ---> R is d(fferentiable at u with derivat;ve g< 1 l ( u) . Then (5.3. 16) '

'

Proof.

By definition of the derivative, for every £

> 0 there exists a 8 > 0 such that

lv - ul < li '* lg(v) - g(u) - g<'>(u)(v - u)l <
(a) Note that (i) '*

an(Un - u)

(b)

=

Op(l)

'

'

• •

'

' '

Un - U = Op(a;;: 1 ) = Op (1).

(c) '

..

Using

'

'

' ' •

l

(c), for every li > 0

'

'

P[IUn - ul < li] � l

(d) and, hence, from (a}, for every

£

>

0,

'

i

P ]lg(Un ) - g(u) - g < 'l(u)( Un - u)l :0:
(e) But (e) implies

(f)

'J

from (b). Therefore, I

(g)

i'

But, by hypothesis,

I I

''

\ '

' '

I

'

an(Un - u) !:. V and the result follows.

The theorem follows from the central limit theorem letting

v-

N(o, a2 ).

'

Un = X, an = n112, u

Note that (5.3.15) "explains" Lemma 5.3.1. Formally we expeet that if Vn

!:.

EV,{ - EVi (although this need not be true, see Problems 5.3.32 and B.7.8). Vn = y'n(X - !") !:. V - N(O, a2). Thus, we expeet

0

=

1-4 o

V, then

Consider

'

'

'

•

' •

(5.3. 17)

! 1'

j

1

'

'

Section 5.3 where

313

First- and Higher-Order Asymptotics: The Delta Method with Applications

Z � N(O, 1 )

.

But if j is even.

EZ' > 0, else E Z' = 0. Then (5.3. 1 7) yields 2 = O(n-'·1 ) , j even = o(n-' 1 2),j odd.

Example 5.3.3. "t" Statistics. , X,. be i.i.d. F E :F where Ep(X1) = p,, (a) The One-Sample Case. Let X1 , 2 Varp(XI) = a < oo. A statistic for testing the hypothesis H : p, = 0 versus K : p, > 0 •

lS

•

.

•

=

T,.

where

s2 =

-

X ..;Ti s n

1

n-1

2. (X1 X) L -

i=l

= { Gaussian distributions}, we can obtain the critical value tn-1 (1 - n) for Tn from the Tn-l distribution. In general we claim that if F E F and H is true, then

If :F

£

Tn - N(O, 1).

(5.3.18)

Zt-a but that the tn - t (1 - a) critical value (or z1_a) is approximately correct if H is true and F is not Gaussian. For the proof In particular this implies not only that tn - 1 (

note that

1 -n)

-+

(X - p,) £ - N(0, 1) U,. = ..jTi u

by the central limit theorem, and

n 1"X - 2 - L ) 1 - X)

n t=l . by Theorem

5.2.2 and Slutsky's

ples with l"t

= E(X1), u1 = Var(X,), 1"2 = E(Yt) and u� = Var(Y,).

theorem. Now Slutsky's theorem yields (5.3.18) because

T,. = U,.f(s,./u) = g(U,. , s,.fu), where g(u, v) = ufv. (b) The Two-Sample Case. Let X1, Xn1 and Yt , . . . , Yn2 •

H :

/11

= J1.2 versus

•

•

,

be two independent sam Consider testing

K : 11-2 > 11-I · In Example 4.9.3 we saw that the two sample t statistic Sn =

�

n n2

cY � X ) , n = nt + n2

has a 'Tn-2 distribution under H when the X's and Y's are normal with af = a� . Using the cen tral limit theorem, Slutsky's theorem, and the foregoing arguments, we find (Problem 5.3.28) that if n t fn A, 0 < A < 1, then -

8" ", N

(o

(1 - A)u1 + >.u� ' Au1 + (1 - A)u�

)·

314

Asymptotic Approximations

It follows that if 111 = n2 or af a�, then the critical value t,1-2(1 approximately correct if H is true and the X's and Y's are not normal.

-

Chapter 5

et) for Sn is

Monte Carlo Simulation As mentioned in Section 5.1, approximations based on asymptotic results should be checked by Monte Carlo simulations. We illustrate such simulations for the preceding t tests by generating data from the X� distribution M times independently, each time com puting the value of the t statistics and then giving the proportion of times out of M that the t statistics exceed the critical values from the t table. Here we use the xJ distribution because for small to moderate d it is quite different from the normal distribution. Other distributions should also be tried. Figure 5.3.1 shows that for the one-sample t test, when o: = 0.05, the asymptotic result gives a good approximation when n > 10 1 ·5 32, and the true distribution F is x� with d 2:: 10. The x� distribution is extremely skew, and in this case the tn-l (0.95) approximation is only good for n > 102·5 316. ,.....

"""'

One sample: 10000 Simulations; Chi-square '- data

�---�---�----�- -�----Tl 1 0:.

0.12,-

..

0.'

0.08

'..

0.0<

•

0.02

•

o�------�----�------�-----7�----� 1.5 2 2.5 3 0.5 1

j i

Log1 0 sample size

i

'

'

•

'

Figure 5.3.1. Each plotted point represents the results of 10,000 one-sample t tests using x� data, where d is either 2, 10, 20, or 50, as indicated in the plot. The simulations are repeated for different sample sizes and the observed significance levels are plotted.

I

''

I '

'.

'

.

l

''

'

II

For the two-sample t tests, Figure 5.3.2 shows that when uf = u� and n1 n2, the tn- 2 (1-o:) critical value is a very good approximation even for small n and for X, Y ,..... X� · =

! i

Section 5.3

First- and Higher-Order Asymptotics: The Delta Method with Applications

315

This is because, in this case, Y - X = �; 2.:71 1 (Yi - Xi). and Yi - Xi have a symmetric distribution. Other Monte Carlo runs (not shown) with af 1=- a� show that as long as n1 = n2 , the tn - 2 (0.95) approximation is good for n1 > 100, even when the X's and Y's have different x� distributions, scaled to have the same means, and a� = 12af. Moreover, the tn-2(1 - o:) approximation is good when n 1 -=/=- n2 and af = a2. However, as we see from the limiting law of Sn and Figure 5.3.3, when both n 1 -=/=- n2 and af 1=- ai, then the two-sample t tests with critical region 1 { S'n > tn_ 2 (1 - cr)} do not have approximate level o:. In this case Monte Carlo studies have shown that the test in Section 4.9.4 based on Welch's approximation works well. Two

0.12

sample; 10000 Simulations, Chi-Square DaJa; Equal Variances

0.'

'·"'

,df..so

--

-

� �-. - - - --o-- - - --7� dHO

$

0.02

"� ------� �----�3� 2.5 0.5------� 1.5 ------� 2 ----� 1 log1 0 sample size Figure 5.3.2. Each plotted point represents the results of 10,000 two-sample t tests. For each simulation the two samples are the same size (the size indicated on the x-axis), af = a�. and the data are x� where d is one of 2, 10, or 50. D

Next, in the one-sample situation, let h(X) be an estimate of h(J.L) where h is con tinuously differentiable at I'· hl 1 >(!') i 0. By Theoretn 5.3.3, y'n[h(X) - h(JL)] ". N(O, (JLlJ2). To test the hypothesis H : h(JL) = ho versus K : h(JL) > ho the natural test statistic is T.

_

n-

vn[h(X ) - ho]

s[h( l )(X)[

'

!

' '

·

.

316

Asymptotic Approximations

'

' '

Chapter 5

Two Sample;10000 Simulations: Gaussian Data: Unequal Variances; 2nd sample 2x bigger -'--o.tzr-��-�--��

---ci

i-

"

•

' •

-,

i

'

,_,

•

'

' '

'

'

• >

�

• '

'

'

0 {)8

ai 0.00

1

' '

·'

• u

u �

¥: 0

"

'-"' - - - -f) - - - - - - -£>- -

0.02 ""---...._, ...._,__

-

-

6K

,,

-

-

- -

-o- - - - - - -(J- -

-

-

+

-

-

-

-

-

-

-

0

- - +

0�-----��----� 0.5 --�1�----� 1.5 25 ----� 2

'

Log10 (smaller sample size)

Figure 5.3.3. Each plotted point represents the results of IO,{X)() two-sample t tests.

For

each simulation the two samples differ in size: The second sample is two times the size of the first. The x-axis denotes the size of the smaller of the two samples. The data in the first sample are N(O, 1) and in the second they are N(O)a2) where a2 takes on the values I, 3, 6, and 9, as indicated in the plot.

' ' '

' '

Combining Theorem 5.3.3 and Slutsky's theorem, we see that here, too, if H is true Tn £ N(O, 1)

so that z1_0 is the asymptotic critical value. Variance Stabilizing Transfonnations Example 5.3.4. In Appendices A and B we encounter several important families of dis

tributions, such as the binomial, Poisson, gamma, and beta, which are indexed by one or more parameters. If we take a sample from a member of one of these families, then the sample mean X will be approximately normally distributed with variance cr2 /n depending on the parameters indexing the family considered. We have seen that smooth transfor mations h(X) are also approximately normally distributed. It turns out to be useful to know transformations h, called variance stabilizing, such that Var h(X) is approximately independent of the parameters indexing the family we are considering_ From (5_3_6) and

'

Section 5.3

First- and Higher-Order Asymptotics: The Delta Method with Applications

317

2 2 l (1')] ) [li a fn. Thus, (5.3. 13) we see that a first approximation to the variance of h( X) is finding a variance stabilizing transformation is equivalent to finding a function h such that for all J-l and a appropriate to our family. Such a function can usually be found if a depends only on J-l, which varies freely. In this case (5.3.19) is an ordinary differential equation. As an example, suppose that X1 , . . . , Xn is a sample from a P(A) family. In this case a2 = A and Var(X) = Ajn. To have Var h(X) approximately constant in A, h must satisfy the differential equation [h< 1 l (A)j 2 A = c > 0 for some arbitrary c > 0. If we require that h is increasing, this leads to h(1l (A) = Vc/J>.., A > 0, which has as its solution h(>.) = 2,JC,\ + d, where d is arbitrary. Thus, h(t) = /i is a variance stabilizing transformation of X for the Poisson family of distributions. Substituting in (5.3.6) we find Var( X) : � 1/4n and vn( (X) l - ( >.) l ) has approximately a N(o, 1/ 4) distribution. o One application of variance stabilizing transformations, by their definition, is to exhibit monotone functions of parameters of interest for which we can give fixed length (indepen dent of the data) confidence intervals. Thus, in the preceding P( >.) case, r5

VA ±

2z(1 - la) 2 vn:

is an approximate 1 - a confidence interval for J>... A second application occurs for models where the families of distribution for which variance stabilizing transformations exist are used as building blocks of larger models. Major examples are the generalized linear models of Section 6.5. The comparative roles of variance stabilizing and canonical transformations as link functions are discussed in Volume II. Some further examples of variance stabilizing transformations are given in the problems. The notion of such transformations can be extended to the following situation. Suppose, 9n (X1 , . . . , Xn) is an estimate of a real parameter f indexing a family of distributions from which X 1 , , Xn are an i.i.d. sample. Suppose further that .

.

•

Then again, a variance stabilizing transformation h is such that

vn(h('/) - h('y)) -+ N(o, c)

(5.3.19)

for all f. See Example 5.3.6. Also closely related but different are so-called normalizing transformations. See Problems 5.3.15 and 5.3.16. Edgeworth Approximations

The normal approximation to the distribution of X utilizes only the first two moments of X. Under general conditions (Bhattacharya and Rao, 1 976, p. 538) one can improve on

318

Asymptotic Approximations

Chapter 5

the normal approximation by utilizing the third and fourth moments. Let F11 denote the distribution of Tn = Jn( X - J-L) fa and let fin and )'211 denote the coefficient of skewness and kurtosis of Tn . Then under some conditions,O l

'

I

where r11 tends to zero at a rate faster than 1/n and H2. H , and Hs are Hermite polyno 3 mials defined by H2 (x)

�

x2 - 1, H (x) 3

�

x3 - 3x, Hs(x)

�

x5 - 10x3 + 15x.

(5.3.21)

' '

The expansion (5.3.20) is ca11ed the Edgeworth expansion for Fn .

Example 5.3.5. Edgeworth Approximations to the x2 Distribution. Suppose V rv X� According to Theorem B.3. 1, V has the same distribution as E� 1 X[, where the Xi are independent and Xi ""' N(O, I), i = I, . . . , n. It follows from the central limit theorem that Tn � (2::7 1 Xf - n ) /ffn � ( V - n)j ffn has approximately a N(O 1) distribution. To improve on this approximation, we need only compute !In and 12n. We can use Problem B .2.4 to compute ,

I '

: •

'Yi n �

E(Vn)' - 2V2

(2n)l

Vn , 'Y2n

1

•

�

E(V - n)4

(2n)2

_

3�

12 n·

1

I '

Therefore, Fn (x)

•

!, •

' •' ' '.

�

VZ

I I 2 3 (x) -
Table 5.3.1 gives this approximation together with the exact distribution and the normal D approximation when n = 10.

•

;

!

I

I

'

l I

:

. .,

�:

I

'

•

•

'

:

'

'

•• •

'

I.

I

'

',

I

I

TABLE 5.3.1. Edgeworth(') and nonnal approximations EA and NA to the X�o distribu tion, P(Tn < x), where Tn is a standardized x!o random variable.

Section 5.3

First- and

Higher-Order Asymptotics: The Delta Method with Applications

3 19

The Multivariate Case

Lemma 5.3.2 extends to the d-variate case.

Lemma 5.3.3. Suppose {Un} are d-dimensional random vectors and that for some se

quence of constants {an} with an (i) an(Vn -

(ii)

u

)

_r.�

----'�" cx::>

as n ---Jo

cx::>,

Vdx l for some d X 1 vector of constants u.

g : Rd ----'�" RP has a differential g��d (u) at

u.

Then

Proof. The proof follows from the arguments of the proof of Lemma 5.3.2.

(X 1 , Y1 ), , (Xn, Yn ) be i.i.d. as (X, Y) where 0 < EX4 < co , Let p2 = Cov 2 (X, Y)/a�ai where af = Var X, a� = Var Y; and let

Example 5.3.6. Let

•

< oo. EY4 � r2 = C2 j&�(j� where 0<

•

•

Recall from Section 4.9.5 that in the bivariate normal case the sample correlation coefficient r is the MLE of the population correlation coefficient p and that the likelihood ratio test � of H : p = 0 is based on lrl. We can write r2 = g(C, &f , U�) : R3 ----'�" R, where and scale invariance of p and r, we can Because of the location g(u 1 , u2 , u3 ) = ui/u2u3. use the transformations xi = (xi - J.ti )/a' and }j = (Yj -J.t2) I0"2 to conclude that without loss of generality we may assume J.ti = 11-2 = 0, ai = ai = 1, p = E(XY).� Using the central limit and Slutsky's theorems, we can show (Problem 5.3.9) that vn(C - p), .Jri(i7f - 1) and vn(i7� - 1) jointly have the same asymptotic distribution as vn(U n - u ) where Un = (n- 1 EXiYi, n 1 EXl , n -1 E�2 )

u

and = (p, 1, 1). Let Tf.; central limit theorem

vn(U - u ) Next we compute

�

=

Var(XkYi ) and Ak,j,m,l = Cov( XkYi, .fm Y'), then by the

N(O, E), E =

2 >.1,1,2,0 >.l,l ,0,2 71 , 1 >.1 , 1 2 ,0 722,0 >.2,0,2,0 >. 1 , 1,0,2 ),2 , 0,2 ,0 7o2,2 ,

(5.3.22)

g(l) (u) = (2u1 (u,u3, -uifu�u,, -uifu,u� ) = (2p, - p2 , -p2 ). It follows from Lemma 5.3.3 and (8.5.6) that .Jri(r2 - r? ) is asymptotically nonnal, N(o, aiD. with 4 2 T 4·' ' + ' ' + P 2o P 102 f.' 1n +2{ -2p3 >-.,,, ,2,o - 2P' >'1,1,0,2 + p4 >-.,.o,2,o}.

320

Asymptotic Approximations

When (X, Y) � N(Jt1, !'2,ui, u2,p), then ,fii(r - p) !:. N(o, (1 - p2 )2 ).

ufi

=

4p2(1 - p2 )',

Chapter 5

and (Problem

5.3.9)

Refening to (5.3.19), we see (Problem 5.3.10) that in the bivariate normal case a vari ance stabilizing transformation h(r) with ,fii[h(r) - h(p)] !:. N(O, 1) is achieved by choosing

h(p) = !2 log

( ) 1+p 1 -p

·

The approximation based on this transformation, which is called Fisher's z, has been stud ied extensively and it has been shown (e.g., David, 1938) that

C(vn - 3(h(r) - h(p)) is closely approximated by the N(O, 1) distribution, that is, P(r < c) "' (vn - 3[h(c) - h(p)]), c E (-1, 1). This expression provides approximations to the critical value of tests of H : p = 0, it gives approximations to the power of these tests, and it provides the approximate 100(1 - a)% confidence interval of fixed length,

.

f ! '

p = tanh {h(r) ± z (1 - �a)/ vn - 3 } 0

where tanh is the hyperbolic tangent.

.

Here is an extension of Theorem 5.3.3 .

' •• ·'

Y

.

' '

Y are independent identically distributed d vectors VarY 1 = E and h 0 � RP where 0 is an open subset

Theorem 5.3.4. Suppose 1, . • • with ElYJ / 2 < oo, EY1 = m,

•

ofRd, h =

n

1

:

(h1 , . . . , h,) and h has a total differential h(1) (m) = ll g'• (m) ll �,

h(Y) = h(m) + h( l l (m){Y - m) + o, (n - 1 12 )

' ' ,

.I

f. ,,

pxd

. Then

(5 .3 .23)

'

'

(5.3.24)

'

' '

j

•

•

. •

. L •

:i '

!

Proof. Argue as before using B .8.5 (a)

' .' '.

I

and (b) so that

h{y)

('l ( (m)(y - m) m) h h = + -

,fii(Y - m)

£

�

+ o(l y - ml)

N(o, E)

Section 5.3

First- and Higher-Order Asymptotics: The Delta Method with Applications

vn(h(Y) - h(m)) = jnh(ll(m)(Y - m) + Op(l).

(c)

321 D

Example 5.3.7. xi and Normal Approximation to the Distribution ofF Statistics. Sup- pose that X1, , Xn is a sample from a N(O, 1) distribution. Then according to Corol lary 8.3.1, the F statistic . . •

Tk,m

_

-

( 1/k) L:=t Xf '"' k+m

( 1/m) L....i=k+t X,2

(5.3.25)

has an :Fk,m distribution, where k + m = n. Suppose that n > 60 so that Table IV cannot be used for the distribution of Tk,m· When k is fixed and m (or equivalently n = k + m) is large, we can use Slutsky's theorem (A. l 4.9) to find an approximation to the distribution of Tk,m· To show this, we first note that (1/m) L:+;:1 Xl is the average ofm independent xi random variables. By Theorem B.3.l, the mean of a xi variable is E(Z 2 ), where Z � N(O, 1). But E(Z2) = Var(Z) = I. Now the weak law of large numbers (A.15.7) implies that as m ----t oo,

Using the (b) part of Slutsky's theorem, we conclude that for fixed k,

as m ----t oo. By Theorem B .3.1, 2.:: 1 Xl has a x% distribution. Thus, when the number of degrees of freedom in the denominator is large, the :Fk,m distribution can be approximated by the distribution of V/k, where V � x%. To get an idea of the accuracy of this approximation, check the entries of Table IV against the last row. This row, which is labeled m = oo, gives the quantiles of the distri bution of V/k. For instance, if k = 5 and m = 60, then P[Ts,6o < 2.37) = P[(V/k) :S 2.21) = 0.05 and the respective 0.05 quantiles are 2.37 for the F5,60 distribution and 2.21 for the distribution of Vjk. See also Figure B.3.1 in which the density of V/k, when k = 10, is given as the :F10,= density. Next we turn to the normal approximation to the distribution of Tk,m· Suppose for simplicity that k = m and k --+ oo. We write Tk for Tk,k · The case m = ).k for some ). > 0 is left to the problems. We do not require the Xi to be normal, only that they be i.i.d. with EX1 = 0, EXf > 0 and EXt < oo. Then, if <72 = Var(XI), we can write,

(5.3.26)

I'

322

Asymptotic Approximations

x; Icr2 and Yi2

= Xf+i/a2, i = 1 , . . . ' k. Equivalently Tk Y; � (Y; 1 , Y,,)r, E(Y;) � ( 1 , 1)T and h(u, v) � �. By Theorem 5.3.4, where yil

=

=

Chapter 5

h(Y) where

•

(5.3.27) where 1 (1, l)T, h( 1 1(u, v) identity. We conclude that �

In particular if X1

(�, - ;)> )T and

�

:E

�

Var(Yu)J, where J is the 2

x

2

vn(Tk - 1) £ N(0, 2Var(Yu )) .

�

N(0, a2 ) , as k

In general, when min{k, m}

�

oo,

.,fii(Tk - 1) £ N(O, 4). --t oo,

the distribution of �(Tk,m -

1) can be ap proximated by a N(O, 2) distribution. Thus (Problem 5.3. 7), when X; � N(O, <12 ) , P[Tk,m

< t]

P[�(Y,,m - 1)

<

�(t - 1)]

• ·.

(5.3.28)

"' if!( �(t - 1)/-/2).

!

•

An interesting and important point (noted by Box, 1953) is that unlike the t test, the F test for equality of variances (Problem 5.3.8(a)) does not have robustness of level. Specifically, ifVar(Xf) of 2<74, the upper h,m critical value !k,m ( 1 - o) , which by (5.3.28) satisfies

l (: ,

!

Zt-a ,

' ' ,

or

/k ,m(1 - o) "' 1 +

,

.

•

!1

! i�

Ck m =

�· •

'

'

.

'

2 (m + k)

mk

1+

Z J -a

K(m + k) Zt-a mk

where K � Var[(X1 - 1'•)/<7.)2, l't � E(X1), and <1� � Var (X1 ) . When it can be estimated by the method of moments (Problem 5.3.8(d)).

5.3.3

•

1

l l

l

is asymptotically incorrect. In general (Problem 5.3.8(c)) one has to use the critical value

'

:r.

�

mk ;o 1)/v2 o) ---c� (/km(l m+ k ,

• ,

1 I

.

(5.3.29) K

is unknown, D

Asymptotic Normality of the Maximum likelihood Estimate in Exponential Families

Our final application of the 8-method follows.

Theorem 5.3.5. Suppose P is a canonical exponential family of rank d generated by T Xn are a sample from P17 E P and ij is defined as the MLE with [ open. Then if X1 , .

•

.

,

if it exists and equal to c (some fixed value) otherwise,

l. '

'

! I' ;Ii

•

,

,

Section 5.3

First- and Higher-Order Asymptotics: The Delta Method with Applications

(i) ii = 71

1

� I:7- 1 A 1 (fi)( T(Xi) ••

1

- A(71)) + op71 (n- ) '

•

1 (71)).

(ii! £71(fo(ii - 71 )) • Nd(o , A

323

Proof. The result is a consequence. of Theorems 5.2.2 and 5.3.4. We showed in the proof of Theorem 5.2.2 that, if T = � I:� 1 T(X;), P71 [T E A(E)] � 1 and, hence, P71 [ij = A.- 1 (T)] � L Identify h in Theorem 5.3.4 with A.- 1 and m with A(71). Note that by B.8.14, if t = A(71), (5.3.30)

.. . But DA = A by definition and, thus, ih our case, (5.3.31) Thus, (i) follows from (5.3.23). For (ii) simply note that, in our case, by Corollary 1.6.1, .

E = Var(T(Xl ) )

=

A(71)

and, therefore, (5.3.32) 0

Hence, (ii) follows from (5.3.24).

Remark 5.3.1. Recall that

A(71)

=

Var71 (T)

=

!(71)

is the Fisher information. Thus, the asymptotic variance matrix I - 1 (17) of .Jri(ij - 17) equals the lower bound (3.4.38) on the variance matrix of y'n(ij - 71) for any unbiased estimator ij. This is an "asymptotic efficiency" property of the MLE we return to in Section

6.2.1.

Example 5.3.8. Let X1 , Xn be i.i.d. as X with X N(p,, a2). Then T1 T2 = n- 1 EX[ are sufficient statistics in the canonical model. Now •

.

.

'"'""'

,

.fii [1i - 11,T2 - (112 + a2)] -".N(O, O,l (fl))

-

=

X and

(5.3.33)

where, by Example 2.3.4,

Here 1)1 = JL/a2, '72 By Theorem 5.3.5,

=

-l/2a2, i}I

=

X/0'2, and 1)2

=

-l/2il2 where 0'2

=

T2 - (T1)2

324

Asymptotic Approximations =

Because X T, and il2 (Problem 5.3.26)

=

T - (T1 )2• we can use (5.3.33) and Theorem 5.3.4 to find 2

.,;n(X where Eo

Chapter 5

= diag(a2 2a4).

-

J" , il2

- <72) .S N(O, 0, Eo)

1

Summary. Consistency is Oth-order asymptotics. First-order asymptotics provides approx imations to the difference between a quantity tending to a limit and the limit, for instance, the difference between a consistent estimate and the parameter it estimates. Second-order asyrnptotics provides approximations to the difference between the error and its first-order approximation, and so on. We begin in Section 5.3.1 by studying approximations to mo ments and central moments of estimates. Fundamental asymptotic formulae are derived for the bias and variance of an estimate first for smooth function of a scalar mean and then a vector mean. These "8 method" approximations based on Taylor's fonnula and elemen tary results about moments of means of i.i.d. variables are explained in terms of similar stochastic approximations to h(Y) - h(JL) where Y1 , ,Y are i.i.d. as Y, EY 1-L· and h is smooth. These stochastic approximations lead to Gaussian approximations to the laws of important statistics. The moment and in law approximations lead to the definition of variance stabilizing transformations for classical one-dimensional exponential families. Higher-order approximations to distributions (Edgeworth series) are discussed briefly. Fi nally, stochastic approximations in the case of vector statistics and parameters are devel oped, which lead to a result on the asymptotic normality of the MLE in multiparameter exponential families. .

.

•

n

=

•

'

• •

5.4

•

•

1:,

ASYMPTOTIC THEORY I N O N E DI MENSION

' l

.

'

'

l

i•

! ' •

•

I

.

,,

I

i

'

In this section we define and study asymptotic optimality for estimation, testing, and con fidence bounds, under i.i.d. sampling, when we are dealing with one-dimensional smooth parametric models. Specifically we shall show that important likelihood based procedures such as MLE's are asymptotically optimal. In Chapter 6 we sketch how these ideas can be extended to multi-dimensional parametric families.

j !

'

'

'

; '

'

' ..

;• '

I·

5.4.1

Estimation: The Multinomial Case

Following Fisher (1958), <' l we develop the theory first for the case that x., . . . , Xn are i.i.d. taking values {Xo, . . , xk } only so that P is defined by p = (A,, . . , Pk ) where .

.

P; - P(X,

=

x;], 0 < j < k

(5.4.1 )

and p E S, the (k+ I)-dimensional simplex (see Example 1.6.7). Thus, N (No, . . . , Nk ) where Ni I:� 1 l(Xi Xj) is sufficient. We consider one-dimensional parametric submodels of S defined by P {(p(x0, 0), . ,p(xk , 0)) : 0 E 8}, e open c R (e.g., see Example 2.1.4 and Problem 2.1.15). We focus first on estimation of 0. Assume =

-

A:0

�

;

'

=

=

.

.

p(x;, 0), 0 < P; < I, is twice differentiable for 0 < j < k.

i I

�

J

-------

,

II , ,

•

Section 5.4

325

Asymptotic Theory in One Dimension

Note that A implies that k

Iogp (X1, 8) � 2.:)ogp (x;, 8)1 (X1 � x;)

l(X1 ,8) is twice differentiable and

J=oO

(5.4.2)

g� (X1 , 8) is a well�defined, bounded random variable (5.4.3)

Furthermore (Section 3.4.2), (5.4.4)

and

g;� (X

1,

8) is similarly bounded and well defined with (5.4.5)

As usual we call 1(8) the Fisher infonnation. Next suppose we are given a plug-in estimator h ( r;:) (see (2.1.11 ) ) of (J where h: S�

R

satisfies

h(p(8)) � 8 for all 8 E e

(5.4.6)

where p(8) � (p(x0, 8), . . . , p (x., B))r. Many such h exist if k 2.1.4, for instance. Assume

> 1. Consider Example

H : h is differentiable. Then we have the following theorem. Theorem 5.4.1. Under H,Jor all 8. (5.4.7) where 172 (8, h ) is given by (5.4.11). Moreover, if A also holds,

"2 (8, h) > r' (B)

(5.4.8)

with eqUality if and only if, M & (p(8))

P,

_,

p(9)

�I

m

.

(B) &B (x;,B), 0 < J < k.

(5.4.9)

Asymptotic Approximations

326 Proof.

Apply Theorem 5.3.2 noting that

( (N) ) vn h -;; - h(p (B))

Note that, using the definition of

( N ) 8 p( h + op( l) . x;,8) vn &p; (p( )) : k &h

�

Nj ,

&h ( N ) h &p; (p (8)) -:;/: - p(x; , 8) n-1 kh &p; (p(B))( l (X, k &h

n

but also its asymptotic variance is

"'(8, h )

t

( ;;') - h(p (8))} asymptotically normal with mean 0,

'

Note that by differentiating (5.4.6), we obtain

j=O

k &h &p 2: 8 (p(8)) 88 (x;,8) � 1 j=D PJ

Cove

• . •

•

'

- p(x, , 8)). (5.4.10)

&h k &h ) ( h &p; (p(8)) p(x; , 8) - L a'PJ_ (p (8))p(x; , 8)

or equivalently, by noting

I t.

� x;)

�

k

'

k

�

Thus, by (5.4.10), not only is vn { h

I

Chapter 5

(5.4. 1 1 )

2 •

(5.4.12)

� (x;, 8) � [f, l (x; , B) ] p( x;, B), k &h &l & (p(B))l(X, � x;), 8ii (X� o B) j =O PJ

L

�

1.

(5.4.13)

By (5.4. 13), using the correlation inequality (A. l 1 . 16) as in the proof of the information inequality

(3.4.12), we obtain

with equality iff,

& 2 (B,h)Var9 l < " 1 &B (X1 ,8) � "' (B,h)I (B)

&h &l L & . (p(B))(l(Xt = x; ) - p(x;,B)) � a(B) &B (XI> 8) + b(B) k

j=O

for some

PJ

a(B) f' 0 and some b(B) with prob�bility 1.

which implies (5.4.9).

(5.4.14)

, '

, 1 , '

(5.4.15)

a (B), whil• �ir common

a2 (B)I(B) � "' ( B, h), we s�� tliat equality in (5.4.8) gives a2 (B)I2 (B) = 1, •

•

Taking expectations we get b(B) = 0.

Noting that the covariance of the right-· anp lett-hand sides i s variance is

•

(5.4.16) 0

, •

I

Section 5.4

_.3::2:..:_7

n Asymptotic Theory in One Dlm io:;_ e "::':: :.c::

_ _ _ _ _ _ _ _ _ _ _ _ _ _

We shall see in Section 5.4.3 that the information bound (5.4.8) is, if it exists and under regularity conditions, achieved by if = h ( r;; ) , the MLE of B where h is defined implicitly by: h(p) is the value of 0, which �

(i) maximizes L::7�o Nj logp(xj, B) and

(ii) solves L:7�0 N, gi (xj , B) = O.

Example 5.4.1. One-Parameter Discrete Exponential Families. Suppose p (x, 8) = exp{BT(x) - A(B)}h(x) where h(x) = l(x E {xo , . , xk }). B E e. is a canonical one-parameter exponential family (supported on {xo, . . . , Xk }) and e is open. Then Theorem 5.3.5 applies to the MLE B and

..

�

C8( /ii ( B- B)) � N

(o, I(�))

(5.4.17)

with the asymptotic variance achieving the information bound J-1(B). Note that because 'C'k T(x N then, by ( -1 n (X ) 2.3.3) T = n L;� 1 T ; = L..,j�o 1 ) ";- . (5.4.18) and h(p)

=

k

[A_]- 1 L T(:Ej )Pj y';, Q

(5.4.19)

The binomial (n, p) and Hardy-Weinberg models can both be put into this framework with D canonical parameters such as B = log 1 P in the first case.

( P)

Both the asymptotic variance bound and its achievement by the MLE are much more general phenomena. In the next two subsections we consider some more general situations.

5.4.2

Asymptotic Normality of Minimum Contrast and M-Estimates

We begin with an asymptotic normality theorem for minimum contrast estimates. As in Theorem 5.2.3 we give this result under conditions that are themselves implied by more technical sufficient conditions that are easier to check. Suppose i.i.d. X1 , , Xn are tentatively modeled to be distributed according to Po. B E 8 open c R and corresponding density/frequency functions p (· , B). Write P = { P, : B E E> ) . Let p : X X e � R where . • .

D(B,B o ) = Es, ( p(X , , O ) - p(X, ,Ilo ))

328

Asymptotic Approximations

Chapter 5

•

; •

•

is uniquely minimized at Bo. Let On be the minimum contrast estimate

-

0. � Suppose

AO: .p � Then

•

1 n argrnin - _L p(X,, O). n . t=l

U is well defined. 1 n ,P(X;, On) -n L . t=l

�

0.

(5.4.20)

In what follows we let P, rather than Pe, denote the distribution of Xi. This is because, as pointed out later in Remark 5.4.3, under regularity conditions the properties developed in this section are valid for P � { Pe : 0 E 8}. We need only that 0(P) is a parameter as defined in Section 1.1. As we saw in Section 2.1, parameters and their estimates can often be extended to larger classes of distributions than they originally were defined for. Suppose AI: The parameter O(P) given by the solution of

j ,P(x, O)dP(x)

'·

0

(5.4.21)

is well defined on P. That is,

J I.P (x, O) jdP(x) < co, 0 E e, p E p

and O(P) is the unique solution of(5.4.21) and, hence, O(Pe)

=

i •

• • •

'

'

0.

Ep.j;2(X1,0(P)) < oo for all P E P. A3: ,P( · 9) is differentiable, � (X1, 9) has a finite expectation and

i !

A2:

'

,

a.p

'!

I·

�

Ep 89 (X, 9(P)) f 0. A4: sup1 AS:

'

•

{ � I:;� 1 (� (X; , t) - � (X; , O(P))) I : jt - 9(P)I <
9. £. 9(P). That is, 9. is consistent on P = {Pe

0.

: 9 E 8}.

.

' '

i

Theorem 5.4.2. Under AO-A5,

'

I '

•

(5.4.22) where

-¢(x,P)

=

/( .j;(x,9(P)) -Ep 09 (Xt,9(P)) a.p

J • •

1

'

•

•

• •

•

(5.4.23)

Section 5.4

329

Asymptotic Theory in One Dimension

Hence, where

"2 (1/J, P)

=

Ep
(5.4.24)

Proof Claim (5.4.24) follows from the central limit theorem and Slutsky's theorem, ap plied to (5.4.22) because

n ../Ti(Bn - B(P)) = n-112 L ;ft(X;, P) + op(l) i=l and

Ep;f(x,, P) while

Ep
0

Ep;f2 (X1,P) =
-

by A l , A2, and A3. Next we show that (5.4.22) follows by a Taylor expansion of the equations (5.4.20) and (5.4.21). Let Bn = B (P) where P denotes the empirical probability. By expanding n-• I:7 1 <J;(X;, Bn) around B(P) , we obtain, using (5.4.20), -

-

l n &.p l n - n L <J;(X;, B(P)) = n L 88 (X;,O�)(Bn - B (P)) i=l t=l

where

IB� - B(P) I < IBn - B(P)I. Apply A5 and A4 to cooclude that n n I l thj; * a , = )) ;, ) + op ( l) (P (X ; B (X L L O <J; 8 8 8 8 n n n i=I t=l

(5.4.25)

(5.4.26)

and A3 and the WLLN to conclude that (5.4.27) Combining (5.425)-(5.4.27) we get,

(On -

8¢ ( B(P)) -Ep 88 (X1 , B (P))

But by the central limit theorem and A l ,

+

n ) l op(!) n t; <J;(X;, B (P)). =

n 2 1 n -n L = Op( - 1 ). B(P)) <J;(X;, . t=l 1

(5.4.28)

(5.4.29)

330

Asymptotic Approximations

Chapter 5

Dividing by the second factor in (5.4.28) we finally obtain

I n

On - O(P) = n L ;f(X;, O(P)) + Op

i= l

I n

!

- '<' 1/J(X;, O(P)) n�

l=l

and (5.4.22) follows from the foregoing and (5.4.29).

0

Remark 5.4.1. An additional assumption A6 gives a slightly different formula for Ep 'iilf (X" O(P)) if P = Po.

A6: Suppose P = Po so that O(P) = 0, and that the model P is regular and let l(x, 0) = logp{x, B) where p(·, (}) is as usual a density or frequency function. Suppose l is differen· tiable and assume that

&l -Eo (X, , O),P(X1 , 0) 80

-Covo (:�(Xt , O) , ,P (X1, 0) ) .

i

J 1/J(x, O)p(x, O)dJL(x)

=

0

(5.4.30)

' '

· ,

'

l i

(5.4.31)

=

{Po :

I

l l

•

I

I

l '

l •

'

; ,

• • '

•

or more generally, for M-estimates, =

I •

• •

argmin Epp(X1, B)

(2) B(P) solves Ep1/J(X1 , B)

!

•

Remark 5.4.2. Solutions to (5.4.20) are called M-estimates. Our arguments apply to M estimates. Nothing in the arguments require that On be a minimum contrast as well as an M-estimate (i.e., that 1/J = � for some p).

=

•

•

i

-

(I) O(P)

. ,

'

for all 0. If an unbiased estimate S (X!) of 0 exists and we let 1/J (x, 0) = 6 ( x) -0, it is easy to see that A6 is the same as ( 3.4.8). If further X1 takes on a finite set of values, { xo, . . . , Xk }, and we define h(p) = I:7=o 6(x1 ) Pi• we see that A6 corresponds to (5.4.12). Identity (5.4.30) suggests that ifP is regular the conclusion of Theorem 5.4.2 may hold even if 1/J is not differentiable provided that Eo 'iilf ( X1 , 0) is replaced by Covo(1/J(X" 0). Z� (X1 , B)) and a suitable replacement for A3, A4 is found. This is in fact true-see Prob lem 5.4.1.

Remark 5.4.3. Our arguments apply even if X,, . . . , Xn are i.i.d. P but P
• '

•

Note that (5.4.30) is formally obtained by differentiating the equation (5.4.2 1), written as

!

0.

Theorem 5.4.2 is valid with B(P) as in (I) or (2). This extension will he pursued in Vol ume 2. We conclude by stating some sufficient conditions, essentially due to Cramer (1946), for A4 and A6. Conditions AO, Al, A2, and A3 are readily checkable whereas we have given conditions for AS in Section 5.2.

; ;

Section 5.4

A4': (a) 8 --+

33 1

Asymptotic Theory in One Dimension

�� (Xt , O ) is a continuous function of 8 for all x.

{

(b) There exists J(O) > 0 such that sup

EhJ; iJ

0') , (X t O

where Eo M(X1 ,

Dv')

iJ (Xt, O

0)

0) < oo.

A6': �t (x, 0') is defined for all x, /0'- 0/

< J(

0) and J:+: J I !Ji:' (x, s) dJL(x)ds < oo for

some J � J(O) > 0, where JL(x) is the dominating measure for P(x) defined in (A. l0.13).

That is. "dP(x)

�

p(x) dJL(x) ."

Details of how A4' (with A0-A3) iiT!plies A4 and A6' implies A6 are given in the problems. We also indicate by example in the problems that some conditions are needed (Problem 5.4.4) but A4' and A6' are not necessary (Problem 5.4.1).

5.4.3

Asymptotic Normality and Efficiency of the MLE

The most important special case of (5.4.20) occurs when and 1/>(x,

0)

- Dl

�

ijO(x, 0)

-

p(x, 0)

�

l (x, 0) -.

=

logp(x,

0)

obeys A0-A6. In this case On is the MLE On and we obtain an

identity of Fishef's,

21 & - Eo 002 (Xt , 0) where

(5.4.32)

/(8) is t�e Fisher information introduq;:d in Section 3.4.

We can now state the basic

result on asymptotic normality and e�ciency of the MLE. Theorem

5.4.3.

If AO-A6

apply to p(x, 0)

satisfies

�

l(x, 0) and P

�

Pe,

-

then the MLE On (5.4.33)

so that (5.4.34)

Furthermore, A0-A6, then

if Bn

is a minimum contrast estimate whose corresponding p and '1jJ satisfy

2 (1/>, Po) >

<7

with equality iff 1/> � a ( 0) � for some a o1 0.

1 I(O)

(5.4.35)

r •

• '

332

Asymptotic Approximations

Chapter 5

Proof Claims (5.4.33) and (5.4.34) follow directly by Theorem 5.4.2. By (5.4.30) and

(5.4.35). claim (5.4.35) is equivalent to

(5.4.36) =

0, cross multiplication shows that (5.4.36) is just the correlation Because Eo'!j;(X1 , B) inequality and the theorem follows because equality holds iff 'ljJ is a nonzero multiple a ( B) o of 3i; (X1 , B).

I

Note that Theorem 5.4.3 generalizes Example 5.4.1 once we identify '1/J(x, B) with T(x) - A'(B). The optimality part of Theorem 5.4.3 is not valid without some conditions on the esti· mates being considered. l'l Example 5.4.2. Hodges's Example. Let X1 , . . . , Xn be i.i.d. .N(B, 1). Then X is the MLE of B and it is trivial to calculate I(B)_ 1. Consider the following competitor to X: -

Bn

I

-

0 if lXI < n-1;4 X if l X I > n-1/4

(5.4.37)

We can interpret this estimate as first testing H : () 0 using the test "Reject iff JXI > n-1/4" and using X as our estimate if the test rejects and 0 as our estimate otherwise. We next compute the limiting distribution of .,fii (Bn - B). Let Z .N(O, 1). Then =

�

P[IZ + ..fiiBI < n1i4] (n1/4 - .,fiiB) - ( -n1 i4 - .,fiiB ). =

(5.4.38)

___,

where
-

5.4.4

1

1

' •

•

' ' • •

1

l

�

• •

:

.

' • •

;

'

; ' '

l

•

•

(5.4.39) =

•

•

0 because n114 - ,ftiB ___, -oo, and, thus, Therefore, if B # 0, P•[IXI < n- 1 14[ PB [Bn = X[ ___, 1. If B 0, Po [IXI < n1i4] ___, 1, and PB[iin = OJ I. Therefore, ___,

•

Testing

The major testing problem if (} is one-dimensional is H : (} < Bo versus K : (} > 9o. If p(·, B) is an MLR family in T(X), we know that all likelihood ratio tests for simple B1

Section 5.4

�3:::3:..::3 _;cc"_:O._o:c'-=O-=;m_:'cc":c';cco"'----��... Asymptotic Th�o:_c"f.-�

versus simple ()2 , ()1 < ()2 , as well as the likelihood ratio test for H versus K, are of the form "Reject H for T( X) large" with the critical value specified by making the probability of type I error a at ()0. If p(-, ()) is a one-parameter exponential family in B generated by T(X) , this test can also be interpreted as a test of H : >.. < >..o versus K : >.. > .\0, where ).. = A(()) because A is strictly increasing. The test is then precisely, "Reject H for large values of the MLE T(X) of >...'1 It seems natural in general to study the behavior of the test, "Reject H if Bn > c(a, Bo) " where P&, [Bn > c(a, Bo )] = a and Bn is the MLE of e. We will use asymptotic theory to study the behavior of this test when we observe i.i.d. X1 , . . . , Xn distributed according to Po. B E (a, b), a < eo < b, derive an optimality property, and then directly and through problems exhibit other tests with the same behavior. � Let en (a, ()0) denote the critical value of the test using the MLE On based on n observations. •

•

Theorem 5.4.4. Suppose the model P = {Po : 8 E 8} is such thot the conditions of Theorem 5.4.2 apply to '0 = g� and Bn. the MLE. That is,

£,(-fii(Bn - 8)) � N(o, r 1 (8))

(5.4.40)

where 1(8) > Ofor all 8. Then

c,(a,Bo) = Bo + Zt-a/Vn1(8o) + o(n- 112) where Zt- a is the 1 - a quantile of the N(O, 1) distribution. Suppose (A4') holds as well as (A6) and 1(8) < oofor all 8. Then lf8 > Bo. Po[Bn > c,(a,Oo)] � 1. lf8 < Bo, Po[Bn > en(<>, Oo)] � 0. �

�

(5.4.41)

(5.4.42) (5.4.43)

Property (5.4.42) is sometimes called consistency of the test against a fixed alternative. Proof. The proof is straightforward:

Po, [Vn1(8o)(Bn - Bo) > z] � l - 1>(z) by (5.4.40). Thus,

Po,[Bn > Bo + Zt-a/Vnf(Oo)] = Po,[Vn1(8o)(Bn - Bo) > Zt-a] � a.

(5.4.44)

But Polya's theorem (A.l4.22) guarantees that su p

which implies that other hand,

IPo,[.fii(Bn - Bo) > z] - (1 - 1>(z))l � 0,

Jn1(80)(c,(a,Oo) - Bo) - z1_0

(5.4.45)

� 0, and (5.4.41) follows. On the

Po[Bn > Cn(<>, Bo)] = Poh/;;/(B)(Bn - 8) > Vn/(ij(cn (<>, Bo) - 8)].

(5.4.46)

i

•

334

Asymptotic Approximations

Chapter 5

l

By (5.4.41),

.jni(8)(c.,(a, 80 ) - 8) and ___, oo if 8 <

foi(B)(8o - 8 + Z1-a / /ni(80) + o(n -112 )) .jnl(8)(8o - 8) + 0(1) -oo if 8 > 80

80. Claims (5.4.42) and (5.4.43) follow.

___,

0

Theorem 5.4.4 tells us that the test under discussion is consistent and that for n large the power function of the test rises steeply to a from the left at 00 and continues rising steeply to 1 to the right of 80. Optimality claims rest on a more refined analysis involving a reparametrization from 8 to ')' = vn( 8 - 8o). (3)

Theorem 5.4.5. Suppose the conditions of Theorem 5.4.2 and (5.4.40) hold uniformly for fJ in a neighborhood of Oo. That is, assume

sup{IP,[fol(Bj(Bn - 8) S z] - (1 - of?(z))[ : [8 8o [ < <(8o) } forsome <(8o) > 0. LetQ1 = Po, 'l' Jri(8 - 8o), then -

___,

0,

(5.4.47)

�

I '

(5.4.48)

j

,

.

'

•

uniformly in "f· Furthennore, if
•

(5.4.49)

•

'

'

j i !

then

'

1 - 1?(z1-a - 'l'y'I{8o)) if'Y > 0 > 1 - 1?(z1-a - 'l'y'I{8o)) if'Y < 0. <

'

(5.4.50)

Note that (5.4.48) and (5.4.50) can be interpreted as saying that among all tests that are asymptotically level a (obey (5.4.49)) the test based on rejecting for large values of Bn is asymptotically uniformly most powerful (obey (5.4.50)) and has asymptotically smallest probability of type I error for B < Bo. In fact, these statements can only be interpreted as valid in a small neighborhood of 80 because "' fixed means () 80. On the other hand, if Jri(B - B0) tends to zero, then by (5.4.50), the power of tests with asymptotic level a tend to a. If Jri(B - Bo) tends to infinity, the power of the test based on Bn tends to I by (5.4.48). In either case, the test based on Bn is still asymptotically MP. -

-----+

�

�

Proof. Write

Po[Bn > cn(a, Bo)] - P,[fol(Bj(Bn - B) > fol(Bj(c., (a, Bo) - B)] Po[/nl(B)(Bn - B) > .jni(B)(Bo - B + Z1-o/v'nl(Bo) +o(n- 1/2))]. (5.4.5 1)

!

.i

'

i '

Section 5.4

335

Asymptotic Theory in One Dimension

..fii (e - eo) is fixed, I(e) � I (eo + },; ) � !(eo) because our uniformity assumption implies that () ---+ I(O) is continuous (Problem 5.4.7). Thus, If 7

�

1 - .P(Z!-a(l + o(l)) + .jn(I(e0) + o( l))(eo - e ) + o( l ) ) I - .P ( zl a - -y�) + o( l ) )

-

(5.4.52)

and (5.4.48) follows. To prove (5.4.50) note that by the Neyman-Pearson lemma, if 1 > 0,

n

p (xi , eo + Jn) > dn (a,eo) < Pe, + )n I; log p(X e0 ) i=l 1,

-l-
n

v (xi , eo + ;?;; ) I; log -·· p(X e ) � dn(o, eo) , 1l 0 i=l (5.4.53)

where p(x, 0) denotes the density of Xi and dn, tn are uniquely chosen so that the right hand side of (5.4.53) is a if 7 is 0. Further Taylor expansion and probabilistic arguments of the type we have used show that the right-hand side of (5.4.53) tends to the right-hand side of (5.4.50) for all -y. The D details are in Problem 5.4.5.

The asymptotic results we have just established do not establish that the test that rejects for large values of On is necessarily good for all alternatives for any n. The test l [en > en (a, eo)] of Theorems 5.4.4 and 5.4.5 in the future will be referred to as a Wald test. There are two other types of test that have the same asymptotic behavior. These are the likelihood ratio test and the score or Rao test. It is easy to see that the likelihood ratio test for testing H : g < 80 versus K : 8 > 00 is of the form �

n

"Reject if

L log[p(X,}n)/p(Xi, eo)]l (en > eo) > kn(eo, a)." i=l

It may be shown (Problem 5.4.8) that, for a < �. kn(e0, a) � zf_. + o(l) and that if own (X1 , . . ) Xn) is the critical function of the Wald test and OL n (X1 ' . . . ) Xn) is the critical function of the LR test then, for all "(,

.

(5.4.54)

Assertion (5.4.54) establishes that the test O.in yields equality in (5.4.50) and, hence, is asymptotically most powerful as well. Finally, note that the Neyman Pearson LR test for H : () = ()0 versus K : Oo + t, € > 0 rejects for large values of 1 - [log pn( X1 , . . , Xn, eo E

.

+ <) - logpn (Xl , . . . , Xn, eo)]

•

•

'

•• .

.

•

336

Asymptotic Approximations

Chapter 5

where Pn(Xt, . . . , Xn, B) is the joint density of Xt , . . . , Xn . For e small, n fixed, this is approximately the same as rejecting for large values of 8�0 log pn(X 1 1 , Xn, Bo). , Xn are i.i.d. with The preceding argument doesn't depend on the fact that X1, common density or frequency function p(x, 8) and the test that rejects H for large values of a�o log Pn (X 1 , , Xn, 90) is, in general, called the score or Rao test. For the case we are considering it simplifies, becoming •

.

•

•

.

•

.

I

•

.

'

'

•

•

"Reject H iff

t 0�0 log (X , 9 ) p

·�

. 1

;

•

I

'

o

>

!I

rn(a,9o)."

'

It is easy to see (Problem 5.4.15) that and that again if JR.n (X1, .

rn(a, 9o) = Z t - a /nl(90) + o (n1 12 ) .

•

.

(5.4.55)

, Xn) is the critical function of the Rao test then

(X1 ' . . . ' Xn) = Ow n (X , ' . . . ' Xn)] � 1 ' Po,+-'Wf.n ,;n

'

• • ..

-

'

(5.4.56)

(Problem 5.4.8) and the Rao test is asymptotically optimal. Note that for all these tests and the confidence bounds of Section 5.4.5, I(9o), which d' may require numerical integration, can be replaced by n 1 do2 ln (Bn) (Problem 5.4. 10).

•

• ' • • .

-

' • •

•

• •

5.4.5

'

'

•

•

Confidence Bounds

We define an asymptotic Level l - a lower confidence bound (LCB) ()n by the requirement that

'

•

(5.4.57) for all () and similarly define asymptotic Ievel l - a VCBs and confidence intervals. We can approach obtaining asymptotically optimal confidence bounds in two ways:

I' I.

I II '

(i) By using a natural pivot.

(

'

'

'1

(ii) By inverting the testing regions derived in Section 5.4.4.

"'

Method (i) is easier: If the assumptions of Theorem 5.4.4 hold, that is, (AO)-(A6), (A4'), and !(9) finite for all 8, it follows (Problem 5.4.9) that

.r

Co( ni (6 ) ( Bn - B)) � N(o, 1)

(5.4.58)

9� = (in - Zt-a / Vnl(Bn ) .

(5.4.59)

•

'. •

'

I

I I

I! ••

'

V

n

for all () and, hence, an asymptotic Ievel l - a lower confidence bound is given by Turning tto method (ii), inversion of 8Wn gives formally

e�, = inf{ 9 : c,(a, 9)

>

-

9n}

(5.4.60)

'

Section 5.5

337

Asymptotic Behavior and Optim ality of the Posterior Distribution

or if we use the approximation

Cn ( Cl', 9) � e + Zt-o/Vni(iJ), (5.4.41 ), (5,4,61)

In fact neither 8�1, or 8�2 properly inverts the tests unless cn (et, 8) and C,. (a, 8) are increasing in 8. The three bounds are different as illustrated by Examples 4.4.3 and 4.5.2. If it applies and can be computed, 8�1 is preferable because this bound is not only

approximately but genuinely Ievel l - a. But computationally it is often hard to implement

because cn (et, 0) needs, in general, to be computed by simulation for a grid of B values. Typically, (5.4.59) or some equivalent alternatives (Problem 5,4, 10) are preferred but can

be quite inadequate (Problem 5.4.1 1), These bounds 8�, (}�1, 8� 2• are in fact asymptotically equivalent and optimal in a suit

able sense (Problems

Summary.

5.4,12 and 5.4.13).

We have defined asymptotic optimality for estimates

in one-parameter models.

In particular, we developed an asymptotic analogue of the information inequalily of Chap ter

3

for estimates of () in a one-dimensional subfamily of the multinomial distributions,

showed that the MLE fonnally achieves this bound, and made the latter result sharp in the context of one-parameter discrete exponential families. In Section theory of minimum contrast and

5.4.2 we developed the

M-estimates, generalizations of the MLE, along the lines

of Huber (1967), The asymptotic formulae we derived are applied to the MLE both under the mOOel that led to it and tmder an arbitrary P. We also delineated the limitations of the optimality theory for estimation through Hodges's example. We studied the optimal ity results parallel to estimation in testing and confidence bounds. Results on asymptotic properties of statistical procedures can also be found in Ferguson ( 1996), Le Cam and Yang

(1990), Lehmann (1999), Rao (1973), and Serfling (1980), 5.5

ASYMPTOTIC BEHAVIOR AND OPTIMALITY OF THE POSTERIOR DISTRIBUTION

Bayesian and frequentist inferences merge as

n --+ oo

in a sense we now describe. The

framework we consider is the one considered in Sections from a regular model in which e is open c

able.

5.2

and

5.4,

i.i.d. observations

R or e � {91, , , , , e,} finite, and e is identifi

Most of the questions we address and answer are under the assumption that arbitrary specified value, or in frequentist tenns, that

e is true.

fJ = (}, an

Consistency The first natural question is whether the Bayes posterior distribution as

n ------+ oo con

centrates all mass more and more tightly around B. Intuitively this means that the data that

are coming from Po eventually wipe out any prior belief that parameter values not close to

e are likely,

Formalizing this statement about the posterior distribution, II(· I X1 , is a function-valued statistic, is somewhat subtle in general. But for e

•

.

•

, Xn). which

= {(}I, . . . ) (}k}

it is

1. i

338

Asymptotic Approximations

Chapter 5

straightforward. Let

tr(e i X, , . . . , Xn) = P[9 = e I X,, . . . , Xn]· Then we say that II( . I x,, . , Xn) is consistent iff for all e E e, Po [ltr(e I Xt, . . . , Xn) - 1 1 > <] � 0 .

(5.5. 1 )

.

for all E > 0. There is a slightly stronger definition: II(- j X1 , iff for all e E 9,

•

•

•

(5.5.2)

, Xn) is a.s. consistent

tr(e I X,, . . . , Xn) � 1 a.s. Po .

(5.5.3)

General a.s. consistency is not hard to formulate: tr ( · I Xt , . . . , Xn)

=>

b{o) a.s.

Po

(5.5. 4)

where ::::} denotes convergence in law and <5{o} is point mass at B. There is a completely satisfactory result for e finite.

Theorem 5.5.1. Let 1r1 P(8 = Bi], j = 1, . . , k denote the prior distribution of8. Then II(· I X1, , Xn) is consistent (a.s. consistent) iff 'lfj > 0for j = 1, . . . , k.

.

. '

.

-

' ,

. • .

Proof. Let p( ·,B) denote the frequency or limit j function of X. The necessity of the condition is immediate because 1fj = 0 for some j implies that 1r( Bi I X1, . . . , Xn) = 0 for all X, , . . . , Xn because, by (1.2.8),

P[IJ = e; I Xt, . . . , Xnl "; rr�, p(x,, e; )

.

:i j'

k

'

'

., ' '

•

,

I

1 1 .[f-. 1 p(X;, e.) "• - og + - L.., og '--;-�-,::.;, n i=I n p(Xi, Bj) 11"j

•

•

·

Intuitively, no amount of data can convince a Bayesian who has decided a priori that (JJ is impossible. On the other hand, suppose all 1ri are positive. If the true (J is (Ji or equivalently 8 = (Ji , then

.

.

n

L::.�t "• 0,�, p(X;, e.)

•

(5.5.5)

,

'

-

By the weak (respectively strong) LLN, under Po;• 1 .[f-. 1 p(X; , e. ) og L p(X; , e;) n ,�

1

----+

I ( o;

E

(

p(X, , e.) og p(X1 , e; )

•

'

i

,

)

I

.

::; ) < 0, by Shannon's inequality, if

in probability (respectively a.s.). But Eo; tog :�: ) B =I Bi. Therefore, a

I

I X,, . . . , Xn) tr(e. og tr(e; I Xt, " . , Xn)

in the appropriate sense, and the theorem follows. 'I

� -00 ,

0

i i

,

i

'

•

I

'

' ,' •

•

Section 5.5

Asymptotic Behavior and Optimality of the Posterior Distribution

Remark 5.5.1. We have proved more than is stated. (} I X1 , . , Xn] 0 exponentially. .

.

339

Namely. that for each

(J E 8, Po[O =/:

-

D

As this proof suggests, consistency of the posterior distribution is very much akin to

consistency of the

MLE. The

appropriate analogues of Theorem 5.2.3 are valid. Next we

give a much stronger connection that has inferential implications:

Asymptotic normality of the posterior distribution Under conditions AO-A6 for p(x, 8)

-

= l(x, 8)

that if 8 is the MLE,

Consider

-

C( y'ri(ll - B) I

=

logp(x, 8), we showed in Section 5.4

Ca(vn( if - B)) � N(o,r'(B)).

(5.5.6)

Xt, . . . , Xn), the -posterior probability distribution of y'n(ll -

B( X1 , . , Xn) ) , where we emphasize that (] depends only on the data and is a constant given Xt, . . . , Xn. For conceptual ease we consider A4(a.s.) and A5(a.s.), assumptions -

.

.

that strengthen A4 and AS by replacing convergence in

Po. We also add,

A7: For all

Po

(1, and all /j > O there exists <(li, B) > 0 such that sup

n

� I)z (X;, 8' ) - l(X;, B)] : IB' - Bl > /j n . t=l

AS: The prior distribution has a density 1r( · ) on

at all

Po probability by convergence a.s.

e.

<

8 such that 1r( · )

-

€

(0 8) � J. ,

is continuous and positive

Remarkably,

Theorem 55,2 ("Bernstein/von Mises''), If conditions A0-A3, A7, and AS hold, then

A4(a.s.), A5(a.s.), A6,

-

C(y'n(9 - 8) I x,, . . , Xn) � N(O, l 1 (B )) .

(5.5.7)

a.s. under Pofor all B. We can rewrite sup for all

8 a.s. Po

X

(5.5.7) more usefully as IP[y'n(9 - if) < x I

Xt, . . . , Xn] - of> (x JI(O)) I � 0

(5.5.8)

and, of course, the statement holds for our usual and weaker convergence

in Po probability also. From this restatement we obtain the important corollary.

Corollary 5.5.1. Under the conditions of Theorem 5.5.2, sup X

a.s.

Pofor all 8.

IP[y'n(9 - if) :; x I

Xt , . . . , Xn] - of>(x �)l � 0

(5.5.9)

i

' '

340

Asymptotic Approximations

Chapt er 5

Remarks ( I ) Statements (5.5.4) and (5.5.7)-(5.5.9) are, in fact, frequentist statements about the asymptotic behavior of certain function-valued statistics. (2) Claims (5.5.8) and (5.5.9) hold with a.s. replaced by in Po probability if A4 and A5 are used rather than their strong forms-see Problem 5.5. 7. (3) Condition A7 is essentially equivalent to (5.2.8), which coupled with (5.2.9) and

-

identifiability guarantees consistency of B in a regular model.

Proof We compute the posterior density of .,fii(O - 8) as (5.5.10) where Cn = Cn(Xl , . . . , Xn) is given by

'

-

Divide top and bottom of (5 .5.10) by II7 1 p(X;, B) to obtain

(5.5.1 1 ) where l(x,B ) = logp(x,B ) and

.

, '

We claim that

(5.5. 12) for all 8. To establish this note that (a) sup

{ (0 + fo) - 11"(8) 11"

tent and 1r is continuous.

:

ItI

:S M

} � 0 a.s. for all M because 0 is a.s. consis

I !

: ,

' I

(b) Expanding,

(5.5.13)

i

1 •

!

' I

,

Section 5.5

Asymptotic Behavior and Optimality of the Posterior Distribution

where

Bi,lf < )n. L� g� (X,, e) 0 n r:Pz 1 n 82 l 1 L &B2(X;, B' (t)) - n L &B2 (X;, B) : [ t [ < M � 0, i=l i==l M, P0• 5.5.3),(5.5.13), (SLLN)

[e-

1

We use

sup

for all

341

here. By A4(a.s.), A5(a.s.),

�

n

a.s.

Using

the strong law of large numbers

we obtain (Problem

Po [dnqn(t) �1f(B)exp {Eo :;:(Xr,B)�} (5.5. 2) .

Using A6 we obtain

]

for all t

1

and AS,

(5.5.14)

� 1.

Now consider

dn

=

J:" (e + Jn) exp �� (x,J+ Jn ) - l(X,, e) ds r dnqn(s)ds Jlsi 8)dt i=l

(5.5.15)

-

By AS and A7,

n

o) e-m(,, < exp �(I(X , t) l(X B-)) ·. ft - 8 f > ; i=l (5.5.14) y'ne-"'<'· 0 ) Po 0 5.5.4) > 0. (5.5. 14) (B) > 0 &' { 1 t' l Po [dnqn(t) < 21r(8) exp Eo ( &B' (Xr, 8)) 2 } < 8(8)vn � L p,

o

L-

suP

-

;,

for all 8 so that the second tenn in 8 Finally note that (Problem

such that

4

-

' u

1

(5.5.16)

a.s.

for all

� �

is bounded by by arguing as for

�

, there exists 8

for all [t[

(5.5.17)

By

(5.5.15) and 5.5 . 1 6) , for a11 8 > 0,

(

-1

.

Finally, apply the dominated convergence theorem, Theorem B.7.5, to 8(8)y'n)), using (5.5.14) and (5.5.17) to conclude that, a.s.

Po,

s (B) { 1r(8) v'2,i 2I = exp dn --+ 7r(B) 1d s } 2 vi (B) = _

ITTii\ .

(5.5.18)

dnqn(sl(fsf < (5.5.19)

I

' . .

342

Asymptot'1C Approximations

Chapter 5

Hence, a.s. Po, where r.p is the standard Gaussian density and the theorem follows from Scheffe's Theorem B.7.6 and Proposition B.7.2. o

•

•

•

•• •

'·

Example 5.5.1. Posterior Behavior in the Normal Translalion Model with Normal Prior. (Example 3.2.1 continued). Suppose as in Example 3.2.1 we have observations from a N( (), a2) distribution with a2 known and we put a N(rJ, T2) prior on 8. Then the posterior 1 1 distribution of 8 is N wtn7J -I- W2nX, (� + r 2 ) - where Wtn =

• •

' ' '

•

)

(

"2

, W2n 2 2 nr -1- a

=

1 - wl n ·

(5.5.20)

0, X - (), a.s., if () = e, and (� + T\) - l Evidently, as n ---+ 00, Wtn 0. That is, the posterior distribution has mean approximately () and variance approxi mately 0, for n large, or equivalently the posterior is close to point mass at () as we expect from Theorem 5.5.1. Because e = X, fo(ll - e) has posterior distribution Now, Jnw1 n = O(n- 1 12) � o( l ) and N .,lnwln(ry - X), n ( � + -----+

(

n ( ;i + �) rem 5.5.2.

-l

----->

;'>) - l ).

----+

a2 = J- 1 (8) and we have directly established the conclusion of Theo D

Example 5.5.2. Posterior Behavior in the Binomial-Beta Model. (Example 3.2.3 contin ued). If we observe Sn with a binomial, B(n, 8), distribution, or equivalently we observe X,, . . . , Xn i.i.d. Bernoulli (!, B) and put a beta, (3(r, s) prior on B, then, as in Example 3.2.3, 9 has posterior (3(Sn +r, n + s - Sn) . We have shown in Problem 5.3.20 that if Ua,b oo, b ----+ oo, has a {J(a, b) distribution, then as a

[

-+

(a + b) 3 ab

] (Ua,b l

a a+b

)

£ --> N(O, l).

(5.5.21)

-

If 0 < B < 1 is true, Sn/n �- () so that Sn + r ----+ oo, n + s Sn ----+ oo a.s. Po. By identifying a with Sn + r and b with n + s Sn we conclude after some algebra that because 9 = X, fo(9 - X) !:, N(O, B(I - B) )

-

a.s. Po, as claimed by Theorem 5.5.2.

jI i j '

•

'

•

0

Bayesian optimality of optimal frequentist procedures and frequentist optimality of Bayesian procedures Theorem 5.5.2 has two surprising consequences. (a) Bayes estimates for a wide variety of loss functions and priors efficient in the sense of the previous section.

are

asymptotically

1

'

.

' '

' •

Section 5.5

343

Asymptotic Behavior and Optimality of the Posterior Distribution

(b) The maximum likelihood estimate is asymptotically equivalent in a Bayesian sense to the Bayes estimate for a variety of priors and loss functions. As an example of this phenomenon consider the following. �

Theorem 5.5.3. Suppose the conditions of Theorem 5.5.2 are satisfied. Let B be the MLE ofB and let B* be the median ofthe posterior distribution oj(). Then (i) �

(5.5.22) a.s. Po for all (). Consequently, 8...._* - 8 +

1 L. �I

n l=I

-l

(8) !!!:_ (X,. B) + op,(n - l / ) 88 2

(5.5.23)

and £o ( .,fii(ir - 8)) � N (O, r1( 8)). (ii) (5.5.24)

E( fo( 111 - e1 - 1 111 I x, , . . . , Xn ) = mjn E( Vn( l 11 - dl - 1111) I X, , . . . , Xn) + op(l). (5 . 5.25)

Thus, (i) corresponds to claim (a) whereas (ii) corresponds to claim (b) for the loss functions ln (0, d) = fo( ! B-dl - 181). But the Bayes estimates for ln and for l(O, d) = 18dl must agree whenever £( 1111 1 X,, . . . , Xn) < oc. (Note that if E( I111 1 Xt, . . . , Xn) = oo, then the posterior Bayes risk under l is infinite and all estimates are equally poor.) Hence, (5.5.25) follows. The proof of a corresponding claim for quadratic loss is sketched in Problem 5.5.5.

Proof. By Theorem 5.5.2 and Polya's theorem (A.14.22) sup

IP[.,fii (/1 - e)

< X I x,, . . . , Xn) - 1>(x vT(8))[

�

0 a.s. Po.

(5.5.26)

But uniform convergence of distribution functions implies convergence of quantiles that are unique for the limit distribution (Problem B.7 .1 1). Thus, any median of the posterior distribution of fo(11 - e) tends to 0, the median of N(O, I- 1 (�)), a.s. Po. But the medi an of the posterior of fo(/1 - 8 ) is fo(B• - 8) , and (5.5.22} follows. To prove (5.5 .24) note that �

�

�

and, hence, that (5.5.27) E(fo(l11 - el - l 11 - e•l) I X,, . . . , Xn) < fole - e· l � o a.s. Po, for all B. Because a.s. convergence Po for all B implies. a.s. convergence P (B.?), claim (5.5.24) follows and, hence,

E( Vn(l11 - Bl - 1111 ) I Xt , . . Xn) = E( Vn(l/1 - B' l - 1111) I X, · · · , Xn) + op ( l ) . ·

,

(5.5.28)

344

Asymptotic Approximations

Chapter 5

Because by Problem 1.4.7 and Proposition 3.2.1, B* is the Bayes estimate for ln(B,d), D (5.5.25) and the theorem follows. �

Remark. In fact, Bayes procedures can be efficient in the sense of Sections 5.4.3 and 6.2.3 even if MLEs do not exist. See Le Cam and Yang ( 1990). Bayes credible regions

There is another result illustrating that the frequentist inferential procedures based on f) agree with Bayesian procedures to first order.

�

Theorem 5.5.4. Suppose the conditions of Theorem 5.5.2 are satisfied. Let where Cn is chosen so that 1r(Cn I xl, . . . , Xn) = 1 - 0', be the Bayes credible region defined in Section 4.7. Let In('Y) be the asymptotically Ievel l - 'Y optimal interval based on 8, given by -

where dn('Y) = z ( 1

' •

- :})

..jlcfj. ThenJor every < > 0, 0,

P• [In (<> + <)

i• •

I

l

c

Cn (X,, . . . , Xn)

c

In (<> - <)] � I.

(5.5.29)

The proof, which uses a strengthened version of Theorem 5.5.2 by which the poste rior density of fo(IJ - 0) converges to the N(o I- 1 (B)) density unifonnly over compact neighborhoods of B for each fixed 0, is sketched in Problem 5.5.6. The message of the the orem should be clear. Bayesian and frequentist coverage statements are equivalent to first order. A finer analysis both in this case and in estimation reveals that any approximations to Bayes procedures on a scale finer than n- 1 12 do involve the prior. A particular choice, the Jeffrey's prior, makes agreement between frequentist and Bayesian confidence procedures valid even to the higher n- 1 order (see Schervisch, 1 995).

l

,

:

!

Thsting

Bayes and frequentist inferences diverge when we consider testing a point hypothesis. For instance, in Problem 5.5.1, the posterior probability of Bo given X 1 , Xn if H is false is of a different magnitude than the p-value for the same data. For more on this so called Lindley paradox see Berger (1985) and Schervisch (1995). However, if instead of considering hypothesis specifying one points 00 we consider indifference regions where H specifies [00 + t..) or ( 00 - t., 00 + t..) , then Bayes and frequentist testing procedures agree in the limit. See Problem 5.5.2. •

!

'

•

i '

•

•

,

Summary. Here we established the frequentist consistency of Bayes estimates in the fi nite parameter case, if all parameter values are a prior possible. Second. we established

'

Section 5.6

345

Problems and Complements

the so-called Bernstein-von Mises theorem actually dating back to Laplace (see Le Cam and Yang, 1990), which establishes frequentist optimality of Bayes estimates and Bayes optimality of the MLE for large samples and priors that do not rule out any region of the parameter space. Finally, the connection between the behavior of the posterior given by the so-called Bernstein-von Mises theorem and frequentist contldence regions is developed.

5.6

PROBLEMS AND COMPLEMENTS

Problems for Section 5.1 1. Suppose X1 , . . . , Xn are i.i.d. as X rv F, where F has median p- t ous case density. (a) Show that, if n = 2k + 1,

n

EFmed(Xh . . . , Xn ) EFmed2 (X1, . . . ,X ,. )

n

-

( �) and a continu

) dt • t (t tk(1 ) 1' ) (� 2 ( : ) [[r'(t)ft•(1 - t)•dt p- l

(b) Suppose F is uniform, U(O, 1). Find the MSE of the sample median for n and 5.

Z

�

1, 3,

N(p, 1) and V is independent of Z with distribution x;.. Then T =

/ ( �) � is said to have a noncentral t distribution with noncentrality J1 and

2. Suppose Z

=

of freedom. See Section 4.9.2.

m

degrees

(a) Show that

where fm(w) is the x� density, and

)

nX (b) If X1, . . . , X,. are i.i d N(p, u2 show that y' .

.

/

( ,.�1 l:: (X, - XJ2 ) i has a

noncentral t distribution with noncentrality parameter .fiiJL/IT and n dom.

-

1 degrees of free

(c) Show that T2 in (a) has a noncentral :Ft, m distribution with noncentrality parameter J12. Deduce that the density ofT is p(t)

=

00

2 L P[R = i] . !2i+l(f)[ 0) +
where R is given in Problem B.3.12.

' "

I

1.

\

346

Asymptotic Approximations

Chapter 5

Hint: Condition on IT I.

1,

3. Show that if PIIXI < 11 probability Hint: Var(X) < EX2

�.

< 1

then Var(X)

with equality iff

X

±1 with

4. Comparison of Bounds: Both the Hoeffding and Chebychev bounds are functions of n and f. through foe.

(a) Show that the ratio of the Hoeffding function h( foE) to the Chebychev function c( Jii<) tends to 0 as Jii< � oo so that h(-) is arbitrarily better than c( - ) in the tails.

'I·.. 'i

( V:() - 1 gives lower results than h in

(b) Show that the normal approximation 24>

the tails if PIIXI < 1] = 1 because, if a2 < 1 , 1 - .P(t) -
•

5. Suppose .\ : R � R has .\(0)

� oo.

0, is bounded, and has a hounded second derivative .\". Show that if X1, . . . , Xn are i.i.d., EX1 = f-1. and Var X1 = a2 < oo, then

! •

=

E.\( X - p.) = >.'(0)

;,/!; (!) +

0

as

n � oo.

Hint: JiiE(.\ ( I X - l•ll - .\( 0)) = E>.'(O)Jii i X - P.l + E

IX - 1'1 < IX - 1'1· The last term is .\'(O)a J== lzl
('·;· (X - p.)(X - t•?)

:S s upx l.\''(x) la2 jn and the first tends to

Problems for Section 5.2 1. Using the notation of Theorem 5.2.1, show that

2. Let X1, . . . , Xn be i.i.d. N(p.,a2). Show that for all n > 1, all < > 0 - I'

sup P(•,•l I IX "

Hint: Let a

------)

I

2:

<] =

1.

oo.

3. Establish (5.2.5).

1

Hint: liJn - q(p)l > < => IPn - P I > W- ( <).

4. Let (U;, •

V;), 1 0, Vr > 0]. Show that if P = N(O, 0, 1 , 1, p), then

(

p = sin 21l' 'Y(P) -

!) .

Section 5.6

347

Problems and Complements

(b) Deduce that if P is the bivariate normal distribution, then

-p .

= Slll

n

211" 2. L 1 (X; > X)1(Y; > Y) n . 1=1

is a consistent estimate of p. (c) Suppose p(P) is defined generally as Covp (U, V)j .jVarpU Varp V for P E P � { P : EpU2 + EpV2 < oo, VarpU VarpV > 0}. Show that the sample correlation coefficient continues to be a consistent estimate of p(P) but ji is no longer consistent.

.

5. Suppose X1 , . . , X

n

are

i.i.d.

N(IJ, o-5) where ao is known.

(a) Show that condition ( 5.2.8) fails even in this simplest case in which X

Hint: sup

u. r

n "' 1.. n Lt= 1

((Xi-J.l)2 - (1 (J.t-J.to)2) ) +

,.2 0

a-1 0

=

o:J.

.& IJ is clear.

(b) Show that condition (5.2. 14)(i),(ii) holds. Hint: K can be taken as [-A, A], where A is an arbitrruy positive and finite constant.

6. Prove that (5.2.14)(i) artd (ii) suffice for consistency. 7. (Wald) Suppose B

�

p( X, B) is continuous, B E R and

(i) For some <(Bo) > 0

Eo, sup {ip(X,B') - p(X,B)i : IB - B' I < <(Bo)} < oo. (ii) Eo, inf{p(X, B) - p(X, Bo) : I B - Bol > A } > 0 for some A

< oo.

�

Show that the maximum contrast estimate B is consistent. Hint: From continuity of p, (i) and the dominated convergence theorem, .

i m Eo, sup{ ip(X,B') - p(X,B) i : B' E S'(B, o)} � o l5-o where S( B, o) is the o ball about B. Therefore, by the basic property of maximum contrast estimates, for each B f> B0, and < > 0 there is o(B) > 0 such that

Eo, inf{p(X, B') - p(X, B0) : B' E S(B, i (B))) > <.

By compactness there is a finite number 81 , . . , Or of sphere centers such that .

K n {B : IB - Bol > .\}

c

T

U S(B, J (B, )) j= 1

Now

inf

1 n

-n L {p(X. , B) - p(X;, Bo)} : B E K n {I : 18 - Bol > .\ . ·�

1

'' •

'

'

' '

348

Asymptotic Approximations

2: min

l<j
�n .

n

I; inf{p(X, , O' ) - p(X, Oo)} :

t= l

Chapter 5

B' E S(81, o(8,))

For r fixed apply the law of large numbers.

8. The condition of Problem 7(ii) can also fail. Let X� be i.i.d. N(J.", a-2 ). Compact sets [( can be taken of the form {11'1 < A, ' :;; a :;; 1/<, ' > 0}. Show that the log likelihood tends to oo as a ---+ 0 and the condition fails.

'

9. Indicate how the conditions of Problem 7 have to be changed to ensure uniform consis tency on K. •

l

10. Extend the result of Problem 7 to the case 8 E RP, p >

1.

"

Problems for Section 5.3

I. Establish (5.3.9 ) in the exponential model of Example 5.3.1. 2. Establish (5.3.3) for j odd as follows: (i) Suppose Xf, . . . , X� are i.i.d. with the same distribution as X1, 1 pendent of them, and let X' = n- EX;. Then EIX - 1'11 < EI X .

(ii)

,

•

Xn but inde

- X' 11, If are i.i.d. and take the values ±1 with probability � . and if c1, , Cn are con stants, then by Jensen's inequality, for some constants Mi, ti

•

n

E L ci€i i=l (iii) Condition on IX; (iv)

.

'

<

n

EJ�I L "''' i=l

j+l < -

M· 3

•

•

t

n

Lc� i=l

•

- X; i , i = 1, . . . , n, in (i) and apply (ii) to get .

t

E II:� 1 (X; - X:JI' < M; E [2::� 1 (X; - X:J2] ' < M1n� E [,; 2::�-1 (X; - X;) 2] > < M;n1 E (,; L:iX; - X;!l) l'lj.

<

M;n > EIX1 -

3. Establish (5.3.11). Hint: See part (a) of the proof of Lemma 5.3.1. 4. Establish Theorem 5.3.2. Hint: Taylor expand and note that if i1 + · · · + id = m

d E II (Yk - JJ.d' k=l

d

<

<

1 m m + L ElY• - l'•lm k=l Cmn-k/2_

'

'

Section 5.6

349

Problems and Complements

Suppose ad >

0, 1 < j < m, "L,�= I ii = m then a�1 ,

•

•

•

m

m

, a�d < [max(a1, . . . , ad)] m <

L aj j =l

m

< mm - I L aJ. J=l 5. Let X1 ,

•

•

•

, Xn be i.i.d. R valued with sup { IE( X;, . . .

6. Show that if E IXd i < oo, j

EX1 = 0. Show that

, X;, ) I : i,, . . >

.

, i1; j

2, then E I X1

�

Hint: By the iterated expectation theorem

E I Xt

�

= 1, . . . , n} = E I Xt l ·

!Lii < 2i E I X1IJ.

P·lj = E{I X, � ILl' I I Xd > IILI}P( IXtl > ll•ll +E{I X, l"lj I I Xt l < IILI}P(I Xt l < ll"l l · �

7. Establish

5.3.28.

8. Let XJ , . . . , Xn1 be i.i.d. F and Y]. , . . . , Yn2 be i.i.d. C, and suppose the X's and Y's are independent.

(a) Show that if F and G are N(/Lt afl and N(!L2 , aD, respectively, then the LR test of H : uf = a� versus K : af # a� is based on the statistic sT /s�, where sT = (nt 1)- t 2::7' 1 (X; X) " s� = (n2 1)- 1 2::7' 1 (1-j Y)" . ,

�

�

,

�

(b) Show that when F and .Fk,m distribution with k = n1

G are nonnal as in part (a), then (si/af)f(s�fa�) has an

�

(c) Now suppose that F and

and that 0

Ck,m = 1

�

1 and rn =

n2 - 1.

G are not necessarily nonnal but that

< Var( Xf) < oo. Show that if m = .\k for some .\ > 0 and +

�<(k + m) ZI -a , K, = Var [( X1 km

-

/Jl ) /0"1] 2 ,

Then, under H : Var(Xt) = Var(Y1), P (s[! s� :;

(d) Let Ck ,m be ck,m with

K.,

/JI

ck,m) � 1

= E(XI ), a21 = Var (Xt ) �

a

as k �

.

oo.

replaced by its method of moments estimate. Show that under the assumptions of part (c), if 0 < Exr < oo, PH (st/s� < Ck ,=) --+ 1 - a as k ----t 00.

i

:;

350

Asymptotic Approximations

Chapter 5

(e) Next drop the assumption that G E 9. Instead assume that 0 < Var(Y?) < :x: . Under the assumptions of part (c), use a normal approximation to find an approximate critical value qk,m (depending on �· , � Var[ (X 1 - JL 1 )/ CJr]2 and ><2 = Var[( X2 - JL2)/CJ2)2 such that PH(si/s� < Qk,m ) ----+ 1 - a as k _____. oo .

•

(0 Let iik,m

K:t

and K2 replaced by their method of moment estimates. Show that under the assumptions of part (e), if 0 < EX� < oo and 0 < EY18 < oo, then P(si/s� :::; fik,m) ----+ 1 - a as k -----> oo. '' '

•

' �•

' •

' •

' '

be Qk,m with

9. In Example 5.3.6, show that (a) If l't = 1'2 � 0, CJ1 = o-2 = I, vn((C - p), (iJ/ - 1), (ui - l))T has the same asymptotic distribution as n1 [n- 1 EX1Y;: - p, n- 1 EX[ - 1 , n-I EY? - lJT . (b) If (X, Y) � N(Jtl, /l2 , "1, ":f, p), then jil(r 2 - p2) _s N(O, 4p2 (I - p2 ) 2 ) and, if p ol 0, then jn(r - p) � N(O, (1 - p' J2). (c) Show that if p = 0, jil(r - p) � N(O, 1). Hint: Use the central limit theorem and Slutsky's theorem. Without loss of generality, JL 1 = JL2 = 0, cri = cri = 1.

(��)

10. Show that � log is the variance stabilizing transformation for the correlation coefficient in Example 5.3.6. I + I · l - l H'mt.· Wnte (1 p)2 2 1-p l+P . _

(

)

11. In survey sampling the model-based approach postulates that the population {xi , · . . , xN} or {(ui , xi), . . . , (uN, XN) } we are interested in is itself a sample from a superpopulation that is known up to parameters; that is, there exists TI , . , TN i.i.d. Po, () E 8 such that Ti = ti where ti = (ui,xi), i = 1, N In particular, suppose in the context of Example 3.4.1 that we use Ti 1 Tin, which we have sampled at random from {t 1 , tN}, to estimate x = � E� 1 Xi. Without loss of generality, suppose ij = j, 1 < j < n. Consider as estimates .

. . . 1

1

1

•

•

(1.) X = � -

•

•

•

.

.

1

•

x,+ n+X. when 7: -= ( ..

�

t

" Uz,

t

X )·

-

(ii) X R = bopt (U - u) as in Example 3.4.1, Show that. if � then

----+

,\ as N

----+

oo, 0 < ...\ < 1, and if EX'f < oo (in the supermodel),

(a) jii(X - x) ..S N(o, 72 (1 - .>,)) where 7 2 = Var(XI ). (b) Suppose Po is such that X I = bUi+t::i , i = 1, . . . , N where the Ei are i.i.d., Et::i = 0, Var(<;) = o-2 < oo and Var(U) > 0. Show that jii(XR -x) ..S N(O, (l-A)o-2 ) , "2 < Hint: (a) X - X = (1 - � ) (X - Xc) wliere xc = N�n Et n+I xi . � (b) Use the delta method for the multivariate case and note (bopt - b)(U - u) -

72

-

(

op n

- >'

).

' •

�

]

'

l '

Section 5.6

351

Problems and Complements

12. (a) Suppose that

EJYd3 < oo. Show that JE(Y. - l'a)(Y, - l'b)(Y, - /'c ) J < Mn-2

(b) Deduce formula (5. 3 .1 4 ) Hint: If U is independent of (V, W), EU .

13. Let Sn have a

X� distribution.

(a) Show that if n is large,

..;s;:

-

=

0, E(WV) <

oo,

then E(UVW )

.Jii has approximately a N(0,

=

0.

�) distribution. This

is known as Fisher's approximation. (b) From (a) deduce the approximation P[Sn < x] "' 1>( J2X y'2ri} (c) Compare the approximation of (b) with the central limit approximation P[Sn < x] = .P((x - n)/ffn) and the exact values of P[Sn < x] from the x' table for x = xo.go, X = XQ.99, n = 5, 10, 25. Here Xq denotes the qth quantile of the X; distribution.

X1 , . . . , Xn is a sample from a population with mean Jl-, variance a2 , and third central moment Jl-3 . Justify fonnally 3 E [h(X ) - E(h(X ))J' = __!, [h'(l') ] 311a + h"(I'Jih' (l' ) ] 2u4 + O(n-3 ). n2 n 14. Suppose

Hint: Use (5.3.12). 15. It can be shown (under suitable conditions) that the nonnal approximation to the distri bution of h( X) improves as the coefficient of skewness 'YI n of h(X) diminishes.

(a) Use this fact and Problem 5.3.14 to explain the numerical results of Problem 5.3. 13(c). (b) Let Sn X� · The following approximation to the distribution of Sn (due to Wilson and Hilferty, 1931) is found to be excellent ,......

Use (5.3 .6 ) to explain why. 16. Normalizing Transformation for the Poisson Distribution. Suppose X1 , . . , Xn is a sample from a P(.\) distribution. .

(a) Show that the only transformations h that make E[h(X) - E(h(X))J' up to order 1/n2 for all ,\ > 0 are of the form h(t) = ct2i3 + d.

=

0 to terms

(b) Use (a) to justify the approximation

•

17. Suppose X�, . . . , Xn are independent, each with Hardy-Weinberg frequency function f given by

352

Asymptotic Approximations

X

J(x)

0 B

I 2B(I - B)

Chapter 5

2 (I - B)2

where 0 < e < I. (a) Find an approximation to P[X < tJ in terms of fJ and t.

(b) Find an approximation to P[VX < t) in terms of B andt. (c) What is the approximate distribution of y'ii( X -- Jl.) + X2 , where J1. = E(X1 )?

18. Variance Stabilizing Transfo111U1tion for the Binomial Distribution. Let X1 , Xn be the indicators of n binomial trials with probability of success B. Show that the only variance stabilizing transformation h such that h(O) = 0, h( l ) = 1, and h'(t) > 0 for all t, is given by h(t) = ( 2/7r)sin- 1 (vt). .

.

•

1

19. Justify formally the following expressions for the moments of h(X, Y) where (X, , Yi ), . . . , (Xn, Yn) is a sample from a bivariate population with E(X) = I' I · E(Y) = J1.2, Var(X) = uf, Var(Y) = u�, Cov (X, Y) = pu1u2 . (a)

1 ) ( Y)) O(n = E(h(X , h J1.1, Jt)2 + - ).

(b) Var(h(X, Y)) "' �{[h, (Jl.t, 1'2)) 20'7 n 2 2 (n + +2h, (Jt" 1'2 lh2 (J1.,, J1.2)pu, "'2 + [h2 (J1." 1'2ll ,.n o - ) where

a a h1(x, y) = &,h(x,y), h2(x,y) = ayh(x,y) .

Hint: h(X, Y) - h(Jl.t, !'2 ) = h, (Jl.t, J1.2)(X - Jl.t ) + h2 (J1.1, J1.2)(Y - J1.2 ) + 0(n - t ) . 20. Let Bm,n have a beta distribution with parameters m and n, which are integers. Show that if m and n are both tending to oo in such a way that m/ (m + n) ---> a, 0 < a < I,

then

(Bm,n - mj (m + n)) x v' < m+n P y'a(l - a)

___,

'li (x) .

. - - -1 where X,, . . . , Xm, Y1 , . . . , Yn - Hmt: Use Bm,n = (mX/nYJII + (mXjnY)) independent standard exJX>nentials.

are

21. Show directly using Problem B.2.5 that under the conditions of the previous problem, if m/(m + n) - a tends to zero at the rate 1/(m + n) 2 , then <>(1 - a) m E(Bm 'n) = , Var Bm 'n = + Rm 'n n + m m+n

where Rm,n tends to zero at the rate 1/(m + n) 2 .

< <

i

Section 5.6 Problems and Complements

353

22. Let Sn ,..,_, X�· Usc Stirling's approximation and Problem 8.2.4 to give a direct justifi cation of where Rn / fo -+ 0 as in n -+

oo.

Recall Stirling's approximation:

(It may be shown but is not required that I foR, I is bounded.) n

23. Suppose that X1 , . . . , X is a sample from a population and that h is a real-valued func tion of X whose derivatives of order k are denoted by h(k), k > I. Suppose IM41(x)l < M for all x and some constant M and suppose that 114 is finite. Show that Eh(X) h(Jl.) + ih(21 (Jl.) + : + Rn where I Rn l < M31 (Jl.)IJ1.3I/6n2 + M(J1.4 + 30'2) /24n2 Hint: "

Therefore,

24. Let

X 1 , • . • , Xn be a sample from a population with mean J1 and variance a2 Suppose h has a second derivative h<2) continuous at fL and that h(l)(Jl) = 0.

< oo.

(a) Show that fo[h(X)-h(Jl.)] � 0 while n[h(X -h(Jl.)] is asymptotically distributed as iM'I(J1.)0'2V where V � XI·

(b) Use part (a) to show that when J1. = �. n[X(l - X) - Jl.(l - Jl.)] !:. -0"2V with V xi . Give an approximation to the distribution of X ( 1 - X) in tenns of the xi distribution function when J1 = �. "'

25. Let XI> . . . , Xn be a sample from a population with 0'2 = Var(X) < oo, J1. = E(X) and let T = X2 be an estimate of p,2• (a) When J1. of 0, find the asymptotic distribution of fo(T-Jl.') using the delta method.

(b) When J1. = 0, find the asymptotic distriution of nT using P(nT < t) = P(-Vt < foX S Vt). Compare your answer to the answer in part (a). (c) Find the limiting laws of fo(X - Jl.)' and n(X - J1.)2.

354 26.

Asymptotic Approximations Show that if

xl ' . .

Chapter 5

, XII are i.i.d. N(p, cr2), then

.

£ N(O, O, Eo) vc:n(X - p,(f2 - a2 ) � where E0 = diag (a2 , 2a4 ) . -

Hint:

27.

Use

(5.3.33) and Theorem 5.3.4.

(X1 , Yt ), . . . , (Xn, Yn ) are n sets of control and treatment responses in a

Suppose

matched pair experiment. Assume that the observations have a common N(JJ.1 , J.t2, CYi, a�, p) distribution. We want to obtain confidence intervals on

t

IL2 - J-LI = �-

Yi - Xi we treat X1 ,

of using the one-sample intervals based on the differences . . . , Yn as separate samples and use the two-sample t intervals

Y1 ,

n

Analysis for fixed

is difficult because

n - oo and (a) Show that

no longer has a

is given by

(4.9.3), then limn P[t. E In] > I -

<>

where the right-hand side is the limit of .JTi times the length of the one-sample based on the differences.

Hint:

i t

28.

Suppose

Xt, . . . , Xn1

E(XI ), a/

=

and

Yt,

.

.

ni (n � >., 0 <

' '

(b)

, Yn2

Deduce that if ,\

<

=

4.9.3

are as in Section

E(YI ), and u� =

T(Ll.)

,\ .

)

t interval

(a), (c) Apply Slutsky's theorem.

with I'I =

I

•

is the length of the interval In,

-

i

.

(4.9.3). What happen s? '12n-2 distribution. Let

Jlnl vnllnl � 2vu1 + a�z(1 �") > 2( Jar + a� - 2pawz)z(1 �

(c) Show that if

I

.

(t [1 - 1�',"�fJ l).

P[T(t.) < t] �
(b) Deduce that if p > 0 and In

T(�)

Suppose that instead

Var(YJ ). We want to study

of Example 4.9.3. if n� , n2 ---+ oo. so that

t) � .u/ + (1 - >.JaN((!

= �

or

a1

=

independent samples

- >.)a/ + .\a�)]:).

az, the intervals (4.9.3) have correct asymptotic

probability of coverage.

(c) Show that if a� >

of coverage < •

'

I

'I !

'

'

'I ' '

1

-

a,

a/ and >. > 1 - ,\, the interval (4 9 3) has asymptotic probability ,

.

whereas the situation is reversed if the sample size inequalities and

variance inequalities agree.

(d) Make a comparison of the asymptotic length of (4.9.3) the pivot jD - t.j(so where D and so are as in Section 4.9.4.

29. Let T = (D - t.) /so

that

E(Xf) <

(a) Show n2 ---+ oo.

where

D, t. and so

and the intervals based on

are as defined in Section

4.9.4.

Suppose

oo and E(Y/) < oo.

that

T has

asymptotically a standard normal distribution as

n1 ---+

oo and

'

,

l .

I

I I

'

l

'

,

!

'

l

Section 5.6

355

Problems and Complements

(b) Let k be the Welch degrees of freedom defined in Section 4.9.4. Show that k � ex: as n 1 ----.) oo and n2 oo. ---+

P,2 in favor of (c) Show using parts (a) and (b) that the tests that reject H : Ji.I K : f..L2 > p,1 when T > tk( l - a) , where tk (l - a) is the critical value using the Welch =

approximation, has asymptotic level a.

(d) Find or write a computer program that carries out the Welch test. Carry out a Monte Carlo study such as the one that led to Figure 5.3.3 using the Welch test based on T rather than the two-sample t test based on Sn. Plot your results. 30. Generalize Lemma 5.3.3 by showing thai if Y 1 , . . . , Y E Rd are i.i.d. vectors and EIYt l k < oo, where I · I is the Euclidean norm, then for all integers k: n

where C depends on d, ElY 1 [ k and k only. Hint: If [x[1 � L;:� l [xj [, x � (x 1 , . . . , xdf and [xl is Euclidean distance, then there exist universal constants 0 < Cd < cd < 00 Such that cdlx l l < l xl < Cd l x! J .

31. Let X1 , . . . , Xn be i.i.d. as X � F and let I' = E(X),
has a X� - I distribution when F is theN(JJ, (j2) distribution. (a) Suppose E(X4) <

oo.

(b) Let Xn-1 (a) be the ath quantile of x;_J• Firld approximations to P(Vn < Xn-1 (a)) and P(Vn :'> Xn-l (I - a)) and evaluate the approximations when F is Ts . Hint: See Problems B.3.9 and 4.4.16. . (c) Let R be the method of moment estimaie of"" and let

Va � (n - 1) + y'i<(n - l)z (a).

Show that ifO

<

EX8 < oo, then P(Vn < Va,) ----.) a as n ---+ oo.

32. It may be shown that if Tn is any sequence of random variables such that Tn £ T and

if the variances ofT and Tn exist, then lim infn Var(Tn)

>

Var(T). Let

where X is uniform, U( - 1 , 1). Show that as n ---+ oo, Tn £ X, but Var(Tn) --+ 00. 33. Let X;j (i

�

I, . . . ,p; j

�

I,..

.

, k) be independent with X,,

�

N(J';, u2).

(a) Show that the MLEs of l'i and <12 are

k

Iii

=

p

k

k- 1 L X;j and iT2 = (kp)- 1 L L (X,j - /1; )2

j= l

i=l j=l

356

Asymptotic Approximations

Chapter 5

(b) Show that if k is fixed and p � oo, then <72 !. (k - 1)cr2(k. That is the MLE jl2 is not consistent (Neyman and Scott, 1948).

'

! •

(c) Give a consistent estimate of a2 . Problems for Section 5.4

1. Let X1, !/! : R� R

.

. , Xn

.

be

i.i.d. random variables distributed according to P E P. Suppose

(i) is monotone nondecreasing (ii) ¢(-oo)

<

0 < .P(oo)

'

(iii) I.PJ(x) < M < oo for all x. (a)

Show that (i), (ii), and (iii) imply that O(P) defined (not uniquely) by

i

Ep,P(X1 - O(P)) 2 0 > Ep!/J(Xt - II ) , ali i/ > O(P)

is finite. (b) Suppose that for all P E P, O(P) is the unique solution of Ep,P(X1 - 0) = 0. Let On = O(P), where P is the empirical distribution of X1, . . . , Xn. Show that Bn is consistent for O(P) over 'P. Deduce that the sample median is a consistent estimate of the population median if the latter is unique. (Use ,P(x) = sgn(x).) Hint: Show that Ep1jJ(X1 - B) is nonincreasing in B. Use the bounded convergence .......

.....

-..

-..

theorem applied to ,P (Xt - 0) !, !/!( -oo) as 0 � oo.

(c) Assume the conditions in (a) and (b). Set (.(0) - Ep>jJ(Xt - 0) and T2(0) Varp¢(X 1 - 0). Assmne that ,\'(0) < 0 exists and that

I'

1 VnT

I ! ' •

i

-

I

'

.

•

i

• •

!•

I

[!/! (X; - On) - A(On)] !:, N(O, 1) � (O)

c

..fti(On - 0) � N Hint: P( ..jii(On - 0)) < t) = P(O < On)

(d) Assume part (c) and A6. A'(O) = Cov(,P(X, - 0), f(X,)).

(0,

T2(0) [A'(O)J'

)

I

1

i

•

•

•

'

i'

'

I I

, I

for every sequence {On} with On = 0 + t(..jii for t E R. Show that

'

I

n

!

'

I

'

'

.

'

n = P ( - 2::. �1 ,P(X; - On) < 0).

Show that if f(x)

(e) Suppose that the d.f. F(x) of X, is continuous and that f (O)

l

F' (x)

exists, then

•

\

' •

=

F'(O) exists. Let

X denote the sample median. Show that, under the conditions of (c), ..jii(X - 0) !:.

N(O, 1/4/'(0)).

� -----------------------------------------------------.... •

Problems and Complements

Section 5.6

�

.

(0 For two estimates 01 and

..-.

357 ..-.

.

c

.

J

B2 With .,jii(B1 - B) � N(O, cr]). = I, 2, the ..-.

..-.

of 81 with respect to 82 is defined as ep(81, 82 ) = then = 1r/2. ..-.

relative efficiency

..-.

-

a?fo.Y.

asymptotic

Show that if

N(J", cr2) , ep(X, X ) (g) Suppose X1 has the gross error density j,(x - B) (see Section 3.5) where f,(x) = ( I c)) density. Find the efficiency ep( X, X) as defined in (f).- If and a = 1, = 4, evaluate the efficiency for £ = .05, 0.10, 0.15 and 0.20 and note that X is more efficient than X for these gross error cases. (h) X Suppose that X1 has the Cauchy density f(x) = 1/:>r(l + x 2), x E R. Show that ep(X , ) = O. P is

-

r

2. Show that assumption A4' of this section coupled with A0-A3 implies assumption A4.

Hint:

( f!,P f!, P X X Es B(p) ( : [ t -B(p)[ < en ,, ( ,,t) L ;:; f!B f!.p ) f! f! { , P < Esup f!B (X�, t) - f!B,P (X,, B(p)) : [ t - B(p)[ < En } . up

I

"

t= l

Apply A4 and the dominated convergence theorem B.7.4. 3, Show that A6' implies A6. Hint:

b<

oo,

ff0 f.p(x,B)p(x,O)dl"(x) = J g,(.P(x, B)p(x,B))dJ"(x) if for all -oo < a <

[ J -fo (,P(x, B)p(x,B)d!"(x)) J ,P(x,b)p(x,b)dJ"(x) - J .p(x,a)p(x,a)dJ"(x). =

Condition A6' permits interchange of the order of integration by Fubini's theorem (Billings ley, 1979) which you may assume.

X, , . . . , Xn be i.i.d. U(O , B), B > 0. (a) Show that g� (x, B) = i for 0 > x and is undefined for B < x. Conclude that g�(X,B) is defined with Pe probability I but f!l I f!l Ee De(X, B) = - 7" 0, Yare (X , B) = 0. B f!B (b) Show that if B = max(X�o ... ,Xn) is the MLE, then .Ce(n(B - B)) � [ (!/B). Thus, not only does asymptotic nonnality not hold but 8 converges B faster than at rate with I(B) = oo, not 0! n-112• This is compatible ( P [n(B B) < x[ = I - ! - ,:'e ) � I - exp(-x/B ) . 4. Let

-

Hint:

e

-

"

358

Asymptotic Approximations

Chapter 5

5. (a) Show that in Theorem 5.4.5

p (x., eo + -;in) log � p(X;, Oo ) n

i.

�

'Y

n

& fo � &e log p(X , B ) ,

•

0

•

' •

,

'

' '

'

'

under Pe0 , and conclude that

'

' '

"

(b) Show that

i '

I

p (X;, Oo + -;)n) log � p(X;,Oo)

(

2 p (X;, Oo + Jn ) 2l(O) . log tl(00),"f � N � p(X;,Oo)

n

£,,+7.<

)

'

!

' ;

1i

(c) Prove (5.4.50). Hint: (b) Expand as in (a) but around 00 (d) Show that P,

o

[ + Vn -"-

n L:,·-t log -

+ -;)n. p( X.,Oo + ;l;;n )

using (b) and Polya's theorem (A.l4.22).

p(X.,'o )

= Cn

]

___,

0 for any sequence { Cn } by

6. Show that the conclusions of Prqf:>le� 5.4.5 continue to hold if ' '

' ' •

'

I j

'

;

is replaced by the likelihood ratio statistic

'

I

1

p(X;, Oo + -;)n) � L., log ---,-,-;-;,-i'-''p(X;, Bo) i�l

I

I

'

.

I

I

'

I '

'

7. Suppose A4', A2, and A6 hold for .p � &l(&B so that I(B) < oo. Show that B 1(0) is continuous. Hint: () ---+ g;� (X, 8) is continuous and �

sup

E,g;!(X,O)

-!(B)

and

{ :;; (X,B') • JO - B'J <
if
I

�

------

Section 5.6 Problems and Complements

359

8. Establish (5.4.54) and (5.4.56). Hint: Use Problems 5.4.5 and 5.4.6.

9. (a) Establish (5.4.58). Hint: Use Problem 5.4.7, A.l4.6 and Slutsky's theorem. (b) Suppose the conditions of Theorem 5.4.5 hold. Let B� be as in (5.4.59). Then (5.4.57) can be strengthened to: For each Bo E 8, there is a neighborhood V(Bo) of 80 1 - a. such that limn sup{Po[B� < BJ : B E V(Bo)} �

10. Let

�

n

&2 1 � 1 I � - - L ' (X;, B). &B n i=l

(a) Show that under assumptions (AO)-(A6) for 1j! consistent estimate of I ( fJ).

�

at all B and (A4 ), f is a '

(b) Deduce that

is an asymptotic lower confidence bound for fJ. (c) Show that if Pe is a one parameter exponential family the bound of (b) and (5.4.59) coincide. 11. Consider Example 4.4.3, setting a lower confidence bound for binomial (}. -

�·

(a) Show that, if X = 0 or 1 , the bound (5.4.59), which is just (4.4.7), gives B � X. Compare with the exact bound of Example 4.5.2. (b) Compare the bounds in (a) with the bound (5.4.61), which agrees with (4.4.3), and give the behavior of (5.4.61) for X � 0 and 1 . 12. (a) Show that under assumptions (AO)-(A6) for all B and (A4 ) '

.

_ 8• + Op n- II ' B• ( ) -nj - -n

for j

�

1 , 2.

-

13. Let Bn 1 , Bn2 be two asymptotic Ievel l - a lower confidence bounds. We say that Bn 1 is asymptotically at least as good as Bn2 if, for all f > 0 -

-

-

Show that 8�1 and, hence, all the B�i are at least as good as any competitors. Hint: Compare Theorem 4.4.2.

360

Asymptotic Approximations

Xn are i.i.d. inverse Gaussian with parameters J.L and A, where Jl is known. That is, each Xi has density 1/ ,\x 2 ,\ x exp - 21-1 + Jl. ; > 0; J-l > 0; ,\ > 0. 2rrx3 2x 2 14. Suppose that

X1,

Chapter 5

. • .

,

( )

.\ .\ }

{

(a) Find the Neyman-Pearson (NP) test for testing H : ,\ (b) Show that the NP test is UMP for testing H

:

=

Ao versus K : A

<

A0 .

.\ > >.o versus K >. < >.o. :

(c) Find the approximate critical value of the Neyman-Pearson test using a normal approximation. (d) Find the Wald test for testing H : ,\ = .\o versus K

:

.\

.\o.

<

(e) Find the Rao score test for testing H : ,\ = ..\o versus K

:

..\ <

1

Ao.

15. Establish (5.4.55). Hint: By (3.4.10) and (3.4.11 ) , the test statistic is a sum of i.i.d. variables with mean zero and variance 1(0). Now use the central limit theorem. Problems for Section 5.5 1. Consider testing H : J.t = 0 versus K : 11- =j:. 0 given X1, . . . , Xn i.i.d. Consider the Bayes test when p, is distributed according to 1r such that

N(Jt, 1).

1 > rr({O}) = .\ > 0, rr(p 7" 0) = 1 - .\ and given I' 7" 0, I' has aN(O, T2) distribution.

!

(a) Show that the posterior probability of {0} is

1 n where mn(.fii,X) = r(l + r2)- 12rp

( (I+��;)l/2).

Hint: Use Examples 3.2.1 and 3.2.2.

i I

-

(b) Suppose that I' = 0. By Problem 4.1.5, the p-value fJ = U(O, 1) distribution. Show that (J I'. 1.

'

I

2[1 -
(c) Suppose that Jl = 8 > 0. Show that jj;(J I'. oo. That is, if H is false, the evidence against H as measured by the smallness of the p-value is much greater than the evidence measured by the smallness of the posterior probability of the hypothesis (Lindley's "para dox"). 2. Let X" . . . , Xn be i.i.d. N(Jl, 1) . Consider the problem of testing H K ; J1. > �. where .6. is a given number.

:

Jl

I

•

E

[0, LJ.] versus

(a) Show that the test that rejects H for large values of y'n(X - LJ.) has p-value p =
'

'

'

Section 5.6

361

Problems and Complements

(b) Suppose that 1-L has a N(O, 1 ) prior. Show that the posterior probability of H is where an = nf (n + 1).

-

�. -yn(anX - �)/ Ja,; (Lindley's "paradox" of Problem 5.1.1 is not in effect.) (c) Show that when

Jl =

c

�

N(O, 1 ) and p

c

�

U(O, 1).

(d) Compute plimn�oo p/pfor p. fo �. (e) Verify the following table giving posterior probabilities of [0, �] when ynX 1 .645 and p = 0.05. 10 .029 .058

n 1'> = 0. 1 "' = 1.0

20 .034 .054

50 .042 .052

=

100

.046 .050

3. Establish (5.5.14). Hint: By (5.5.13) and the SLLN,

logdnqn(t) t2 -2

=

(

(�

n &2/ &2/ ) t ) I e I(O) - f1 88, (x,, • (t)) - 8 , (X,,B0) + logn 0 + Vn n 0

•

Apply the argument used for Theorem 5.4.2 and the continuity of n (B). 4. Extablish (5.5.17). Hint:

<

n l

n

&' { f1 sup &B2 l(X,, B') : iB' - B[ < o}

�

if [t[ < oyn. Apply the SLLN and o ous at 0 = 0.

E9sup

{ ;;,z(X,,B')

: [B - B'l < o} continu

5, Suppose that in addition to the conditions of Theorem 5.5.2, J B'n(B)dB < yn(E(9 I X) - B) � 0 a.s. Pe. Hint: In view of Theorem 5.5.2 it is equivalent to show that dn

J tqn(t)dt

�

oo.

Then

0

M a.s. P,. By Theorem 5.5.2, J_ M tqn(t)dt � 0 a.s. for all M < oo. By (5.5.17), :J'lfo J:J'lfo [t [ exp -iJ(B)'; dt < ' for M(<) sufficiently large, all [ < [t qn(t)dt J

{

}

l

Asymptotic Approximations

362

Chapter 5

'

<>

0. Finally, n

l)l(X,, t) - l(X,})) 7r(t)dt. i=l

,,

'

'

Apply (5.5.16) noting that

I

foc"'''·0 1

6. (a) Show that sup([qn (t)

-

� 0 and

I• (O)
'

:

J [t[1r(t)dt < co. [t[ <

M} � 0 a.s. for all e.

(b) Deduce (5.5.29). Hint: {t : ,jl(Bj c} are monotone increasing in c. Finally, to obtain I

we must have Cn =

I

',!.

c(z1_ � [l(B)n)-112)(! + op(!)) by Theorem 5.5.1.

and A5. Show that (5.5.8) and in

d.

�

7. Suppose that in Theorem 5.5.2 we replace the assumptions A4(a.s.)

•

/ in

and A5(a.s.) by A4

(5.5.9) hold with a.s. convergence replaced by convergence

Po probability.

5.7

NOTES

Notes for Section 5.1 ( I ) The bound is actually known to

and 1 with probability 1 - Pn

be essentially attained for Xi = 0 with probability Pn where Pn -----jo 0 or 1. For n large these do not correspond

to distributions one typically faces. See Bhattacharya and Ranga Rao (1976) for further

'

I

I I • •

discussion .

•

Notes for Section 5.3

,

(I) If the right-hand side is negative for some x, Fn (x) is taken to he 0.

'

!' '

I' :

,I ' '

.

l

' '

•

'

' ' '

•

Notes for Section 5.4

I '' '

•

(2) Computed by Winston Chow.

' '

J

I i• '

(I) This result was first stated by R. A. Fisher ( 1925). A proof was given by Cramer ( 1946). Notes for Section 5.5 ( 1 ) This famous result appears in Laplace's work and was rediscovered by S. Bernstein and R. von Mises-see Stigler ( 1 986) and Le Cam and Yang (1990).

-------

'

I

i

i

' •

Section 5.8

5.8

363

References

REFERENCES

BERGER, J., Statistical Decision Theory and Bayesian Analysis New York: Springer-Verlag, 1985. BHATTACHARYA, R. H. AND R. RANGA RAO, Normal Approximation and Asymptotic Expansions New

York: Wiley, 1976. BILLINGSLEY, P., Probability and Measure New York: Wiley, 1979. Box, G. E. P., "Non-normality and tests on variances," Biometrika, 40, 3 1 8-324 (1953).

CRAMER, H., Mathematical Methods of Statistics Princeton, NJ: Princeton University Press, 1946.

DAVID, F. N., Tables of the Correlation Coefficient, Cambridge University Press, reprinted in Biometrika Tablesfor Statisticians ( 1 966), Vol. I, 3rd ed., H. 0. Hartley and E. S. Pearson, Editors Cambridge: Cambridge University Press, 1938.

FERGUSON, T. S., A Course in Large Sample Theory New York: Chapman and Hall, 1996. FisHER, R. A., "Theory of statistical estimation," Proc. Camb. Phil. Soc., 22, 700-725 (1925).

FISHER, R.

A., Statistical Inference and Scientific Method, Vth Berkeley Symposium, 1958.

HAMMERSLEY, J. M. AND D. C. HANSCOMB, Monte Carlo Methods London: Methuen & Co., 1964.

HoEFFDING, W., ..Probability inequalities for sums of bounded random variables," J. Amer. Statist. Assoc., 58, 13-80 (1963).

HUBER,

Proc.

P. 1., The Behavior of the Maximum Likelihood Estimator Under Non-Standard Conditions, Vth Berk. Syrup. Math. Statist. Prob., VoL I Berkeley, CA: University of California Press,

1967. LE CAM,

1990.

L. AND G. L. YANG, Asymptotics in Statistics, Some Basic Concepts New York:

Springer,

LEHMANN, E. L., Elements of Large-Sample Theory New York: Springer-Verlag, 1999.

LEHMANN, E. L. AND G. CASELLA, Theory of Point Estimation New York Springer-Verlag, 1998.

NEYMAN, J. AND E. L. Scorr, "Consistent estimates based on partially consistent observations," Econometrica, !6, 1-32 (1948).

RA.o, C. R., Linear Statistical Inference and Its Applications, 2nd ed. New York: J. Wiley & Sons, 1973. RUDIN, W., Mathematical Analysis, 3rd ed. New York: McGraw Hill, 1987.

SCHERVISCH, M., Theory ofStatistics New York: Springer, 1995.

SERFLrNG, R. J., Approximation Theorems of Mathematical Statistics New York: J. Wiley & Sons,

1980.

STIGLER, S., The History of Statistics: The Measurement of Uncertainty BefOre 1900 Cambridge, MA: Harvard University Press, 1986. WILSON, E. B. AND M. M. HiLPERTY, 'The distribution of chi square," Proc. Nat. Acad. Sci., U.S.A., /7, p. 684 (1931).

I

I '

'

'

'

'

!

!"I

I

i

'

'

'

I

'

'

' '

'

'

1 '

Chapter 6

=======

INFERENCE IN THE MULTIPAR AMETER CASE

6.1

I N F ERENCE FOR GAUSSIAN LINEAR MODELS •

Most modern statistical questions involve large data sets, the modeling of whose stochastic structure involves complex models governed by several, often many, real parameters and frequently even more semi- or nonparametric models. In this final chapter of Volume I we develop the analogues of the asymptotic analyses of the behaviors of estimates, tests, and confidence regions in regular one-dimensional parametric models for d-dimensional models {Po : II E 8}. 8 c R"- We have presented several such models already, for instance, the multinomial (Examples 1.6. 7, 2.3.3), multiple regression models (Examples 1.1 .4, 1.4.3, 2.1.1) and more generally have studied the theory of multiparameter exponen tial families (Sections 1.6.2, 2.2, 2.3). However, with the exception of Theorems 5.2.2 and 5.3.5, in which we looked at asymptotic theory for the MLE in multiparameter exponen tial families, we have not considered asymptotic inference, testing, confidence regions, and prediction in such situations. We begin our study with a thorough analysis of the Gaus sian linear model with known variance in which exact calculations are possible. We shall show how the exact behavior of likelihood procedures in this model correspond to limit ing behavior of such procedures in the unknown variance case and more generally in large samples from regular d-dimensional parametric models and shall illustrate our results with a number of important examples. This chapter is a lead-in to the more advanced topics of Volume II in which we consider the construction and properties of procedures in non- and semiparametric models. The approaches and techniques developed here will be successfully extended in our discussions of the delta method for function-valued statistics, the properties of nonparametric MLEs, curve estimates, the bootstrap, and efficiency in semiparametric models. There is, however, an important aspect of practical situations that is not touched by the approximation, the fact that d, the number of parameters, and n, the number of observations, are often both large and commensurate or nearly so. The inequalities ofVapnik-Chervonenkis, Talagrand type and the modern empirical process theory needed to deal with such questions will also appear in the later chapters of Volume II.

365

366

'j '

!

Inference in the Multiparameter Case

Chapter 6

Notational Convention: In this chapter we will, when there is no ambiguity, let expres

sions such as (J refer to both column and row vectors.

' ' ' '

6.1.1

'

I

'

The Classical Gaussian linear Model

Many of the examples considered in the earlier chapters fit the framework in which the ith measurement Yi among n independent observations has a distribution that depends on known constants zil, . . , Zip· In the classical Gaussian (nonnal) linear model this depen dence takes the fonn

f

" ' '

1

•

.

Yi

=

p

L ziJf3i + Ei,

j=l

i

= l, . . . , n

(6. U)

where c 1 , . . , En are i.i.d. N(O, a2). In vector and matrix notation, we write .

'i '"

'

(6.1 .2)

!' ' '·

I'.l

and

'

"

Y = Z(3 + •,

•

� N(0,<72J)

(6.1.3)

where zi = (Zii, . . . , Zip f. Z = (zij)nxp. and J is the n x n identity matrix. Here Yi is called the response variable, the Zij are called the design values, and Z is called the design matrix. In this section we will derive exact statistical procedures under the assumptions of the model (6.1.3). These are among the most commonly used statistical techniques. In Sec tion 6.6 we will investigate the sensitivity of these procedures to the assumptions of the model. It turn� out that these techniques are sensible and useful outside the narrow frame work of model (6.1.3). Here is Example 1 . 1.2(4) in this framework.

. •

Example 6.1.1. The One-Sample Location Problem. We have n independent measure ments Yt , . . . Yn from a population with mean f3t = E(Y). The model is 1

Yi j i

' '

=

f3t + €i ,

i = 1, . . . 1 n

(6 1 .4) .

D

where €1 , . . . 1 €n are i.i.d. N(O, a2). Here p = 1 and Znxl = (1, . . , 1)T. .

The regression framewor� of Examples 1. 1.4 and 2. 1 . 1 is also of the form (6.1.3): Example 6.1.2. Regression. We consider experiments in which n cases are sampled from a population, and for each case, say the ith case, we have a response Yi and a set of p - 1

, Zip · We are interested in relating the mean of covariate measurements denoted by Zi2 the response to the covariate values. The normal linear regression model is • . . .

'

I

p }i = f3t + L Zij/3j + €i1 j=2

i = 1, . . , n .

(6. 1 .5)

. l

Section 6.1

367

Inference for Gaussian Linear Models

where (31 is called the regression intercept and /3z, . . . /]p are called the regression coeffi cients. If we set zil 1, -i 1, . , n, then the notation (6.1.2) and (6.1.3) applies. We treat the covariate values Zij as fixed (nonrandom). In this case, (6.1.5) is called the fixed design normal linear regression model. The random design Gaussian linear regression model is given in Example 1.4.3. We can think of the fixed design model as a conditional version of the random design model with the inference developed for the conditional dis D tribution of Y given a set of observed covariate values. .

=

=

.

.

Example 6.1.3. The p-Sample Problem or One- Way Layout. In Example 1.1.3 and Sec tion 4.9.3 we considered experiments involving the comparisons of two population means when we had available two independent samples, one from each population. 1\vo-sample models apply when the design values represent a qualitative factor taking on only two val ues. Frequently, we are interested in qualitative factors taking on several values. If we are comparing pollution levels, we want to do so for a variety of locations; we often have more than two competing drugs to compare, and so on. To fix ideas suppose we are interested in comparing the performance of p > 2 treat ments on a population and that we administer only one treatment to each subject and a sample of nk subjects get treatment k, 1 < k < p, n1 + · · · + np = n. If the control and treatment responses are independent and nonnally distributed with the same variance a2, we arrive at the one-way layout or p-sample model, (6.1.6)

where Ykl is the response of the lth subject in the group obtaining the kth treatment, /3k is the mean response to the kth treatment, and the €kl are independent N(O, � ) random variables. To see that this is a linear model we relabel the observations as Y1, Yn. where Yn1+n2 to Y1 , . . . , Yn1 correspond to the group receiving the first treatment, Yn1 + 1 , that getting the second, and so on. Then for 1 $" j < p, if no = 0, the design matrix has elements: .

•

j- 1

1 if L nk + 1 < i <

k= I

•

.

.

•

,

,

j

L nk

k=l

0 otherwise and

II 0 Z=

0

,,

.

•

.. •

•

0 0

• • •

•

0

0

•

•

•

•

where Ij is a column vector of nj ones and the 0 in the "row" whose jth member is Ij is a column vector of ni zeros. The model (6. 1 . 6) is an example of what is often called analysis of variance models. Generally, this terminology is commonly used when the design values are qualitative. 1 The model (6.1.6) is often reparametrized by introducing a = p- 2::� �1 f3k and Ok /3k - 0: because then ok represents the difference between the kth and average treatment =

368

Inference in the Multiparameter Case

effects, k model is

1 , . . . ,p. In terms of the new parameter {3"' =

Y = Z'(3' + <,

<

Chapter 6

(a, ISh . . , 15v)T, the linear .

� N(O, a2J)

where z;1 (p+ I ) = (1, Z) and l n x 1 is the vector with n ones. Note that Z* is of rank p and that {3* is not identifiable for {3* E RP+ 1 . However, {3* is identifiable in the p-dimensional linear subspace {(3' E RP+ I : L:��I {,k = 0} of w+' obtained by adding the linear restriction E�=I J"k = 0 forced by the definition of the Jk 's. This type oflinear model with the number of columns d of the design matrix larger than its rank r, and with the parameters identifiable only once d r additional linear restrictions have been specified, is common in analysis of variance models. D x

-

!

Even if {3 is not a parameter (is unidentifiable), the vector of means p = (.UI, . . . , JLn ) T of Y always is. It is given by

•

p

'

I

I' =

Z(3 =

L /3jCj

; •

j=l where the Cj are the columns of the design matrix,

!

(Zlj, • • · , Znj)T, j = 1, . . . ,p.

Cj =

The parameter set for f3 is RP and the parameter set for It is w = {P- = Z(3; (3

E RP} .

Note that w is the linear space spanned by the columns Cj, j = 1, . . . , n, of the design matrix. Let r denote the number of linearly independent Cj, j = 1, . . . ,p, then r is the rank of Z and w has dimension r. It follows that the parametrization ({3,a2) is identifiable if and only ifr = p (Problem 6.1.17). We assume that n > r .

' •

The Canonical Fonn of the Gaussian Linear Model

• • '.

I

,.

i

'

r ''

The linear model can be analyzed easily using some geometry. Because dimw = r, there exists (e.g., by the Gram-Schmidt process) (see Section B.3.2), an orthonormal basis VI ' . . . ' Vn for Rn such that VI' . . ' v span w. Recall that orthonormal means vr vj = 0 for i f. j and v[vi = 1. When vrvj = o. we call Vi and Vj orthogonal. Note that any t E Jl:l can be written .

'

'

t = I;(vit)v;

I

I

r

n

'

(6.1.7)

i=l

and that

t E w ¢:> t =

T

L:(vft)vi ¢:> v'[t = 0, i = i=l

r

+ 1, . . . , n.

We now introduce the canonical variables and means '

' • •

'

I

;

Ui = vfY, T/i = E(Ui) = v'[Jl., i = 1, . . .

, n.

Inference for

Section 6.1

G:::'c:"cc"c:''_c"--L:::i--ne:::'::_'_::M.:.:oc:d:::ei'-'-----'3:C6:::C9

Theorem 6.1.1. The U1 are independent and Ui ,......, N(rJi, a2 ), i = 1, . . . 11i

while

(TJt , .

.

, n,

where

r + 1 , . . . , n,

= 0, i =

. , 1Jr )T varies freely over Rr.

Proof. Let Anxn be the orthogonal matrix with rows v'[ , . . . , v�. Then we can write U = AY, 7J = AJ.t, and by Theorem B.3.2, U1, . . , Un are independent nonnal with D variance and E(Ui) = v[J.L = 0 for i = r + 1, . . . , n because p, E w.

a2

.

Note that

1

Y � A- U. So, observing U and Y is the same thing. 1-'

J.t and 71

(6. 1.8)

are equivalently related,

� A - I 'I·

(6.1.9)

whereas Var(Y) � Var(U)

�

cr2Jnx n·

(6.1.10)

It will be convenient to obtain our statistical procedures for the canonical variables U, and then translate them which are sufficient for (J.J., using the parametrization (17, to procedures for �-'· (3, and based on Y using (6.1 .8)-(6. 1.10). We start by considering the log likelihood C(TJ, u) based on U

u2)r,

a2) cr2

n 1 � - 2 L.,(u ; - r]i) 2" t=l

C(TJ,U)

2

-

� I � 1 2 - 2 L ui + � � 2a i=l

i=l

6. 1.2

2

n 2 tog(27rcr ) 1JiUi

� r(f -

� i=l

2u2

-

n log(27r0'" 2) .

(6. 1 . 1 1 )

2

Estimation

a2 known case, which is the guide to asymptotic inference in general. Theorem 6.1.2. In the canonical form of the Gaussian linear model with a2 known We first consider the

(i) T � ( U1 , . . . , U,)T is sufficientfor rJ.

(ii) UJ, . . . , Ur is the MLE of1JI , · · · · 1lr· (iii)

Ui is the UMVU estimate ofryi, i = 1, . .

.

, r.

' c,. are constants, then the MLE of0: = (iv) If Ct ' also UMVUfor a. •

.

.

(v) The MLE of p. is il =

ui =

E;

z:; 1 Ci1Ji

is a = E� 1

and j),i is UMVUfor J.ti, i = vTil. making il and u equivalent. 1

vi Ui

1,

. . . , n.

ciui. a is Moreover,

I' '

370

Inference in the Multiparameter Case

Chapter 6

Proof. (i) By observation, (6.1.11) is an exponential family with sufficient statistic T. (ii) U1 , . , U1. are the MLEs of 111, . . , Tfr because, by observation, (6. 1.11) is a function of 1}1 , . . . , 1lr only through L�- 1 (ui - "7i )2 and is minimized by setting 'tli = ui. (We could also apply Theorem 2.3.1.) 1 , . . . , r. (iii) By Theorem 3.4.4 and Example 3.4 . 6, U; is UMVU for E(U;) = 7);, (iv) By the invariance of the MLE (Section 2.2.2), the MLE of q(9) = 2:: � 1 c;ry; is q(8) = L� 1 <;Ui· If all thee's are zero, 5 is UMVU. Assume that at least one c is different from zero. By Problem 3.4.10, we can assume without loss of generality that L:·= 1 c'f = 1. By Gram-Schmidt orthogonalization, there exists an orthonormal basis V I , . . . , Vn of Rn with VI = c = ( c , ' . ' Cr, 0, . . . ' O)T E Rn . Let wi = vru. �i = Vi'fJ, i = 1 ' . . . ' n, then W N(�,a2J) by Theorem B.3.2, where J is the n x n identity matrix. The distribution of W is an exponential family, W1 = Q is sufficient for 6 = a, and is UMVU for its expectation E(W,) = a. 0 (v) Follows from (iv). .

.. •

i

.

.

i=

-

:! I ' I' I

•

'

.

' ' '

I i

.

-

I I

Next we consider the case in which a2 is unknown and assume n

> r + 1.

•

I

!

'

i

•

i

Theorem 6.1.3. In the canonical Gaussian linear model with a2 unknown,

(i) T = (ii) (iii)

Ur, L7 r+l un T is sufficientfor ('TJ1, . . . , 1]r, a2)T. The MLE ofa 2 is n 1 L� r+1 u; . s2 = ( n r) - 1 L� r+l u? is an unbiased estimator of a2• (U1,

. . . 1

-

-

(iv) The conclusions of Theorem 6 . 1 2 (ii), . . . ,(v) are still valid. .

'

'

Proof. By {6.1.11), (UI , · · · , Ur , L� I is sufficient. But because L� 1 Ui2 L� I Ui2 + L7 r+l Ul, this statistic is equivalent to T and (i) follows. To show (ii), recall that the maximum of ( 6. 1.11) has 'Tfi = Ui . That is, we need to maximize

' '

II i

,,

I

'

I

•

i J •

. '

'

: .

"

as a function of a2. The maximizer is easily seen to be n - I L� r+ l U? (Problem 6.1.1 ). (iii) is clear because EUl = a2, i > r+ 1. To show (iv), apply Theorem 3.4.3 and Example 3.4.6 to the canonical exponential family obtained from (6.1.11) by setting T; = U;, Bj 'TJi /a2, j = 1, . . , r, Tr+1 = L� 1 U? and Br+ I = 1/2u2

=

.

-

I

•

I

.

Projections -

We next express j1 in terms of Y, obtain the MLE (3 of {3, and give a geometric interpretation of ji, {3, and s2• To this end, define the norm It! of a vector t E Rn by

l tl' = I:� 1 fl.

'

I

•

I

'

'

'

unT

•

i

'

'i

•

I

I '

Section 6.1

371

Inference for Gaussian linear Models

Definition 6.1.1. The projection y0 = 1f(Y I £...').. of a point y E Rn on w is the point

Yo = arg min{IY - W : t E w}. -

The maximum likelihood estimate f3 of f3 maximizes

or, equivalently,

13 = arg min{ IY - Z/312 : !3 E W}.

That is, the MLE of {3 equal� the least squares estimate (LSE) of {3 defined in Example 2.1.1 and Section 2.2.1. We have Theorem 6.1.4. In the Gaussian linear model (i) jl is the unique projection o.fY on

u..J

(ii) jl is orthogonal to Y - ji.. (iii)

s2

=

and is given by

ii = zl3

IY - /il 2 /(n - r)

(6.L12)

(6.Ll3)

(iv) lfp = r, then !3 is identifmble, !3 = (zrz )-1 zr !'. the MLE = LSE of!3 is unique and given by (6.1.14) -

(v) f3; is the UMVU estimate of /3;. j i = I , . . . , n.

Proof.

=

1, . . . , p, and Jii is the UMVU estimate of J.ti,

(i) is clear because Z/3, f3 E RP, spans w. (ii) and (iii) are also clear from Theorem

2:� ViUi and Y - /i = L7 r+l v;U;. To show (iv), note that 1 ,.. = Z/3 and ( 6. 1 . 12) implies zr,.. = zTZ!3 and zTji = zrzl3 and, because z has full rank, zrz is nonsingular, and !3 = (Zrz)-1 zrp, 13 = (zrz) -1Zrji. To show f3 = (ZTz)-lzTy, note that the space w.l. of vectors s orthogonal to w can be written as w.L = {s E R" : sr(Z/3) = O for all !3 E W}. 6.1 .3 because /i

=

It follows that !3T (ZT s) = 0 for all !3 E RP, which implies zr s = 0 for all s E wj_. Thus, zr (Y - ji) = 0 and the second equality in (6.1.14) follows. f3; and Jii are UMVU because, by (6.1.9), any linear combination ofY's is also a linear combination of U's, · and by Theorems 6.1.2(iv) and 6 . 1.3(iv), any linear combination of D U's is a UMVU estimate of its expectation .

372

Inference in the Multiparameter Case

Chapter 6

Note that in Example 2.1.1 we give an alternative derivation of (6.1.14) and the normal equations (ZTZ)f3 = zry. The estimate fi = Z{3 of tt is called thefitted value and € Y - ji is called the residual from this fit. The goodness of the fit is measured by the residual sum of squares (RSS) IY - fll2 = L7 1 Ef. Example 2.2.2 illustrates this terminology in the context of Example 6.1.2 with p = 2. There the points jli = /31 + fhzi. i = 1, . . , n lie on the regression line � fitted to the data { ( z;, y;), i = 1 , . . , n}; moreover, the residuals €; = (y, (f3t + /3z z,)] are the vertical distances from the points to the fitted line. Suppose we are given a value of the covariate z at which a value Y following the linear model (6.1.3) is to be taken. By Theorem 1.4.1, the best MSPE predictor of Y if f3 is known well as z is E(Y) = zT{3 and its best (UMVU) estimate not knowing f3 is as � � Y = zT{3. Taking z = zi. 1 < i < n, we obtain P,i = zi/3. 1 < i < n. In this method of "prediction" � of Yi, it is common to write Yi for j]i, the ith component of the fitted value j],. That is, Y = j.i. Note that by (6.1.12) and (6.1.14), when p = r, �

=

�

�

.

.

�

It follows from this and (B.5.3) that if J = Jnxn is the identity matrix, then (6.1.15)

'

'

Next note that the residuals can be written as €=

I

•

I

'

.

.i '

J

I,

'

I

Y - Y = (J - H)Y.

The residuals are the projection of Y on the orthocomplement of w and

Var(€) = a2 (J - H). Corollary 6.1.1. In the Gaussian linear model �

(i) the fitted values Y = jl and the residual € are independent.

I '

�

We can now conclude the following.

I

'

•

•

'

•

i

'

'

I, I

•

I

HT = H H2 = H.

'

I

'

'

The matrix H is the projection matrix mapping Rn into w, see also Section B.lO. In statistics it is also called the hat matrix because it "puts the hat on Y." As a projection matrix H is necessarily symmetric and idempotent,

I

i

1

H = Z(ZTz)-tzT

•

'

•

where

,

·J

•

Y = HY

I

'

-

�

'

1

�

�

'

, •

(ii) Y � N(J.t, a2H),

(iii) € � N(O, a2 (J - H)), and

(iv) ifp = r, (:i � N({3, a2(zrz)- 1 ).

(6.1.16)

1 .

•

Section

6. 1 Inference for Gaussian linear Models

373

Proof (Y, E) is a linear transformation of U and, hence. joint Gaussian. The independence �

follows from the identification of j1 and € in tenns of the a2(zrz)- 1 follows from (B.5.3).

�

Ui in the theorem. Var(/3)

=

o

We now return to our examples.

Example 6.1.1.

One Sample (continued). Here J-i = _81 and j1 = {3 1 Y. Moreover, the unbiased estimator s2 of o-2 is L:� 1 (Yi - Yf j(n - 1), which we have seen before in 0 Problem 1.3.8 and (3.4.2). �

-

Example 6.1.2. Regression (continued). If the design matrix Z has rank p, then the MLE LSE estimate is j3 = (zrzr- 1 zry as seen before in Example 2.1.1 and Section 2.2.1. We now see that the MLE of Jt is /l = Z/3 and that {3j and Iii are UMVU for !3i and /-Li respectively, j = 1 , . . . ,p, i = 1, . . . , n. The variances of ,8,ji, = Y, and € = Y - Y are given in Corollary 6.1.1. In the Gaussian case ,8, Y, and € are normally distributed with Y and € independent. The error variance o-2 = Var(El) can be unbiasedly estimated by s2 = (n - p)- 1 IY Il l '0

=

�

�

�

�

�

�

�

�

-

Example 6.1.3.

The One-Way Layout (continued).

(ZTZ)(3 = ZY become

In this example the normal equations

n,

nk(Jk = L ykl , k = 1, . . . ,p. 1=1 At this point we introduce an important notational convention in statistics. If {Cijk . . . } is a multiple-indexed sequence of numbers or variables, then replacement of a subscript by a dot indicates that we are considering the average over that subscript. Thus,

where n = n 1 + · · · + np and we can write the least squares estimates as

f3k = Yk. , k = 1 , . . . ,p. By Theorem 6.1 .3, in the Gaussian model, th� UMVU estimate of the average effect of all the treatments, a = {3., is

1 p a = - L Y•. (not Y. in general) p k=l and the UMVU estimate of the incremental effect Ok = f3k - a of the kth treatment is

P• = Y•. - [i, k = 1, . . . ,p. �

0

j: '

' !;

'.

''

374

Inference in the Multiparameter Case

Chapter 6

Remark 6.1.1. An alternative approach to the MLEs for the normal model and the associ ated LSEs of this section is an approach based on MLEs for the model in which the errors , En in ( 6.1.1) have the Laplace distribution with density c1, . . .

_l_exp { -� ltl } 2,;

<7

and the estimates of f3 and J.L are least absolute deviation estimates (LADEs) obtained by minimizing the absolute deviation distance L� 1 IYi - zT.BI. The LADEs were introduced by Laplace before Gauss and Legendre introduced the LSEs-see Stigler (! 986). The LSEs are preferred because of ease of computation and their geometric properties. However, the LADEs are obtained fairly quickly by modem computing methods; see Koenker and D'Orey ( !987) and Portnoy and Koenker ( !997). For more on LADEs, see Problems 1.4. 7 and 2.2.31.

6.1.3 '

• •

'

j

'

j

1

1 '

' 1

,j

'

'

'

'

The most important hypothesis-testing questions in the context of a linear model corre spond to restriction of the vector of means J-L to a linear subspace of the space w, which together with a2 specifies the model. For instance, in a study to investigate whether a drug affects the mean of a response such as blood pressure we may consider, in the context of Example 6.1.2, a regression equation of the form

lzijllnx3

.\ (y)

=

sup{p(y, JL) : JL E w } sup{p(y, JL) : JL E wo }

'

'

' '

(6. l.l7)

where zi2 is the dose level of the drug given the ith patient, ZiJ is the age of the ith patient, and the matrix with Zil = 1 has rank 3. Now we would test H : f3z = 0 versus K : {32 =I 0. Thus, under H, {JL : J-ti = (31 + f33zi3, i = 1 , . . . , n} is a two dimensional linear subspace of the full model's three-dimensional linear subspace of Rn given by (6.1.17). Next consider the p-sample model of Example 1.6.3 with f3k representing the mean reSJXlnse for the kth population. The first inferential question is typically "Are the means equal or not?" Thus we test H : (31 = · · · = (3p = (3 for some (3 E R versus K: "the {3's are not all equal." Now. under H, the mean vector is an element of the space {JL : J.Li = (3 E R, i = 1, . . . , p}, which is a one-dimensional subspace of Rn. whereas for the full model JL is in a p-dimensional subspace of Rn . In general, we let w correspond to the full model with dimension r and let Wo be a q-dimensional linear subspace over which JL can range under the null hypothesis H; 1 < q < r. We first consider the a2 known case and consider the likelihood ratio statistic

'

I

l

Tests and Confidence Intervals

mean response = /31 + f32 zi2 + fJJZiJ,

!,

'

: '

'

'

•

' '

'

Section

6_1

375

I nference for Gaussian Linear Models

for testing H : fL E w0 versus f{

:

J-L

E w - w0. Because (6.1. 18)

then, by Theorem 6. 1.1, .\(Y)

�

exp

{- 2t2IY -

)}I' - IY - Po I'

}

where j1 and flo arc the projections of Y on w and w0, respectively. But if we let An x n be an orthogonal matrix with rows vf , . . . , v� such that V J , . . . , vq span wo and v 1 , . . . , Vr span w and set (6. 1 . 19) then, by Theorem 6.1.2(v), .\(Y)

�

exp

1 2<7'

i=q+ l

u'

(6. 1 .20)

•

It follows that r

2 1og.\(Y) �

L (U./<7)2 i=q+I

Note that (U;/
distribution with

e ' � <7- 2

a2 known, 2 log .\(Y) has a x;_q(02)

r

" L 11i' = a-'1 {L - Ito I'

(6.1.2 1)

where fLo is the projection of fL on w0. In particular, when H holds, 2 log .\(Y) ""' x; -q.

Proof We only need to establish the second equality in ( 6.1.21). Write 'li is as defined in (6.1.19), then

�

AJL where A

r

L '1? � IJL - JLol 2 i=q+ I 0

376

Inference in the Multiparameter Case Next consider the case in which

MLEs of a2 for Jt

a2

E w and tt E wo are

a -2

Chapter 6

is unknown. We know from Problem 6 . 1 . 1 that the

a0 = -I IY - t-to I IY - J.L -12 and _, - I' , n n respectively. Substituting jJ.., ji0, 82, and 86 into the likelihood ratio stati stic, we obtain =

-

{ �(Y· E · a:.;) }" � r �� ; ' a Y - J.L P Y, JLo, o

n

.A ( y) =

=

where p(y, /L,
The resulting test is intuitive.

It consists of rejecting H when the fit, as measured by the

res idual sum of squares under the model specified by H, is poor compared to the fit under the general model. For the purpose of finding critical values it is more convenient to work with a statistic equivalent to >.. ( Y),

n-_:=_r IY - /iol2 - IY - /il2 IY - Iii' r-q

T Because T

=

(r - q) -'1/i - /io l2 (n - r)- 'IY - /il2 •

(n - r)(r - q)-1{[.A(Y)]2fn - !}, T is an increasing function of .A(Y)

and the two test statistics are equivalent. T is called

'

i

the

hypothesis.

>

(6.1 .22)

F statistic for the general linear

We have seen in PmJX1Sition 6.1.1 that a-211}- itol2 have a x;_ q (82) distribution with

'

:

02 =
'

T

•

=

(noncentral (central

= =

x:-• variable)!df

X�-r variable)!dj

with the numerator and denominator independent. The distribution of such a variable is

noncentral :F distribution with noncentrality parameter 82 and r - q and n - r degrees offreedom (see Problem B.3.14). We write :F.,m (B2) for this distribution where k = r - q and m = n - r. We have shown the following. called the

Proposition 6.1.2. In the Gaussian linear model the F statistic defined by ( 6.1.22), which is equivalent to the likelihood ratio statistic for H : tt E wo for K : J.t E w - wo. has the noncentral F distribution Fr-q,n -r(fP) where 02 = a-2IJL - J.tol2. In particular, when H Jwlds, T has the (central) Fr-q,n-r distribution. Remark 6.1.2. In Proposition 6 . 1 . 1 suppose the assumption ..a2 is known" is replaced by ••a2 is the same under H and K and estimated by the MLE 0'2 for /.I. E w." In this case, it can be shown (Problem 6.1.5) that if we introduce the

variance equal likelihood ratio

statistic IL E w} max{p(y, /L,iT2 : ) :\(y ) max{p (y,/L, iT2) : IL E wo} _

(6.1.23)

l '

•

Section 6.1

Inference for Gaussian ---Linear Models

377 ------ -

then A(Y) equals the likelihood ratio statistic for the o-2 known case with 0'2• It follows that 2 log .\(Y) =

r

�

-- q -T = noncenlral x?.

-

rr2

q

(6.1 .24)

central X�-c/n

(n - r)/n

replaced by

where T is the F statistic (6.1.22). Remark 6.1.3. The canonical representation (6.1.19) made it possible identity

IY - i'ol 2 = IY - iL l' + II' - i'o l2 ,

to

recognize the (6.1.25)

which we exploited in the preceding derivations. This is the Pythagorean identity. See Figure 6.1.1 and Section B.lO.

Y3 y

IY - i'o l2

y,

I Y - iL l'

Y1 = Y2

Figure 6.1.1. The projections jl and jl0 of Y on w and wo; and the Pythagorean identity.

We next return to our examples. Example 6.1.1. One Sample (continued). We test H : (31 = i'o versus K : (3 'f I'D· In this case Wo = {!'o}, q = 0, r = 1 and

T= which we recognize as 4.9.2.

(n

-

(Y - 1'0)' 1 ) 1 I: (Y, Y)2 ' -

-

t2 /n, where t is the one-sample Student t statistic of Section 0

378

Inference in the Multiparameter Case

Chapter 6

Example 6.1.2. Regression (continued). We consider the possibility that a subset of p q covariates does not affect the mean response. Without loss of generality we ask whether the last p - q covariates in multiple regression have an effect after fitting the first q. To formulate this question, we partition the design matrix Z by writing it as Z = (Z1, Z2) where Z1 is n x q and Z2 is n x (p - q ) , and we partition {3 as {3T = ({3f, !3J) where {32 is a (P - q) x 1 vector of main (e.g., treatment) effect coefficients and {31 is a q x 1 vector of"nuisance" (e.g., age, economic status) coefficients. Now the linear model can be written as --

i

'

,,

i

i

I ' '

' '

'

t zry and {30 0 versus K {32 i' Q_ ln this case {3 (zrz){3 2 (Zfz t )- 1 ZjY are the ML& under the full model (6,L26) and H, respectively, Using (6.1.22) we can write the F statistic version of the likelihood ratio test in the intuitive form We test H :

=

-

:

F=

-

=

•

I

6.2.1.

•

•

.

.

'

1 '

Example 6.1.3. The One-Way Layout (continued). Recall that the least squares estimates of {31, , (Jp are Y1., Yp.. As we indicated earlier, we want to test H : (31 = · · · = (Jp. Under H all the observations have the same mean so that, •

'

' '

0

•

'

'

'

(RSSH - RSSF)/(dfH - dfp) RSSF/dfF

1 2 2 (6, 1 27) a0 = (p - q)- {3r {ZfZ2 - ZfZt (Zfz,)-1 ZfZ,}/32 , In the special case that Z'fZ2 = 0 so the variables in Z1 are orthogonal to the variables in Z2 , 02 simplifies to a-2 (p - q)f3f(ZfZ2 )f32 , which only depends on the second set of variables and coefficients. However, in general 82 depends on the sample correlations between the variables in Z1 and those in Z2. This issue is discussed further in Example

'

•

'

'

where RSSF = IY - iW and RSSH = IY- i}0 1 2 are the residual sums of squares under the full model and H, respectively; and dfF = n-p and dfH = n-q are the corresponding degrees of freedom. The F test rejects H ifF is large when compared to the ath quantile of the Fp-q,n-p distribution. Under the alternative F has a noncentral Fp-q ,n-p(fP) distribution with noncentrality parameter (Problem 6, L 7)

'

i

,

i

.

'

l '

l ;

'

i

'

! '

' '

' I '

'

Thus,

p nk

p

lc=l l=I

k=l

2 l lit - tto = L L(Y• - Y. )2 = L nk (Yk- - YY •

Substituting in (6.1.22) we obtain the F statistic for the hypothesis H in the one-way layout

L��t nk (Yk, - Y .)2 Tp - I L��· L:�',(Ykt - Y.-) 2 ' _

n-P

. I

•

• '

I

'

------

Section 6.1

Inference for Gaussian Linear Models

379

When H holds, '1' has a Fp- l,n - p distribution. If the (J, are not all equal, T has a noncentral Fp-l,n-p distribution with noncentrality parameter (6.! .28)

where {3 = n- I :2::=�"" 1 ni/3i. To derive 1F, compute a - 2 1 M - J.Lo 12 for the vector Jt = (fh, . . . , (31,(3, . , (3,, . . . , (3p, . . . , {3p)T and its projection 1-'o = ((3, . . . , (J)T There is an interesting way of looking at the pieces of information summarized by the F statistic. The sum of squares in the numerator, p SSe = L nk(Yk. - Y.)2 .

.

k=l

is a measure of variation between the p samples Y11 , . . . , Yi sum of squares in the denominator,

SSw

=

p

n1 ,

.

• •

, Yy1 ,

• • .

, Ypnp· The

"'

L L(Yk, - Y•. ) ' , k=1 l=1

measures variation within the samples. [f we define the total sum of squares as

SSr

1'

=

nk

L L(Yk, - Y.)2 , k=1 l= l

which measures the variability of the pooled samples, then by the Pythagorean identity (6.!.25) (6.! .29) SSr = SSe + SSw. Thus, we have a decomposition of the variability of the whole set of data, SST , the total sum of squares, into two constituent components, SSs, the between groups (or treatment) sum of squares and SSw. the within groups (or residual) sum of squares. SSTfa2 is a (noncentral) x2 variable with (n - 1) degrees of freedom and noncentrality parameter f?. Because SSB /a2 and SSw/a2 are independent x' variables with (p - 1) and (n - p) degrees of freedom, respectively, we see that the decomposition (6.1.30) can also be viewed stochastically, identifying IP and (p - 1) degrees of freedom as "coming" from SSafa 2 and the remaining (n - p) of the (n - 1} degrees of freedom of SSr ja2 as "conting" from SSwfa 2 • This information as well as SSa/(P - 1) and SSw j(n - p), the unbiased estimates of 02 and a2, and the F statistic, which is their ratio, are often summarized in what is known as an analysis ofvariance (ANOVA) table. See Tables 6.1.1 and 6.1.3. As an illustration, consider the following data( l ) giving blood cholesterol levels of men in three different socioeconomic groups labeled I, H, and III with I being the "high" end. We assume the one-way layout is valid. Note that this implies the possibly unrealistic

'I ,

'

'

i'I

'

'� '

380

Inference in the Multiparameter Case

Chapter 6

'

TABLE 6.1.1. ANOVA table for the one-way layout

Between samples

Within sample�

ss B

SSw

-

Sum of squares

- I:

[ ,, '

1 SST = �

Total

k- 1 I' '

P 1< - 1

�

c

111< '

"

( \ "k- - ) 1"2 k 0? - \" · '

1

L " �<

1-l

ki

(}'kl

-

d.f.

·

k

I'

1;- I'

p-

Mean squares

I

n - p Tl -

1

-

SS8

A/Sa

AISw

=

s� Hi

F-value i\/Sa 1\IS

•

.,_

TABLE 6.1.2. Blood cholesterol levels

403 312 403

I II

Ill

311 222 244

336 420 235

269 302 353

259 420 319

386 260

353

210

286

290

I assumption that the variance of the measurement is the same in the three groups (not to speak of normality ). But see Section

6.6 for ''robustness.. to these assumptions.

p = 3, n1 = 5, n2

compute

=

1 0, n3 = 6, n = 21,

' ' • '

and we ;

' ' •

•

TABLE 6.1.3. ANOVA table for the cholesterol data

�

" • •

ss Between groups Within groups Total

1202.5 85, 750.5 86,953.0

d.f.

2 18 20

MS

F�value

601.2 1 0.126 4763.9

From :F tables, we find that the p-value corresponding to the F-value

0.126

is

0.88.

Thus, there is no evidence to indicate that mean blood cholesterol is different for the three SOCIOCCOnormc groups. •

. •

We want to test whether there is a significant difference among the mean blood choles terol of the three groups. Here

' '

0

•

Remark 6.1.4. Decompositions such as {6.1.29) of the response total sum of squares SST into a variety of sums of squares measuring variability in the observations corresponding

analysis of variance. They can be formulated in any linear model including regression models. See Scheffe (1959, pp. 42-45) and Weis berg (1985, p. 48). Originally such decompositions were used to motivate F statistics and to variation of covariates are referred to as

to establish the distribution theory of the components via a device known as Cochran's the orem (Graybill,

1961, p. 86).

Their principal use now is in the motivation of the convenient

summaries of information we call ANOVA tables.

Section 6.1

381

Inference for Gaussian linear Models

Confidence Intervals and Regions

We next use our distributional results and the method of pivots to find confidence inter vals for J.li, 1 < i < n, /3j. 1 < j < p, and in general, any linear combination

n

of the J.t'S. If we set 'lj; = 2::7 1 aiJii = aT 11 and �

�

�

where H is the hat matrix, then ( - ,P)/a{,P) has a N(O, I ) distribution. Moreover,

(n - r)s2/a2

=

[Y

··

2

iil fa2

n

=

L (Uja2)2

i=r+I

�

has a X�- r distribution and is independent of '¢. Let

be an estimate of the standard deviation a ('¢ ) of 1};, This estimated standard deviation is called the standard error of'¢. By referring to the definition of the t distribution, we find that the pivot �

�

�

has a Tn_,. distribution. Let tn-r ( 1 - �a) denote the 1 - �a quantile of the Tn-r distri bution, then by solving [T( ,P )[ < tn-• (1 - �a) for ,P, we find that

is, in the Gaussian linear model, a 100(1 - a)% confidence interval for 1/J. Example 6.1.1. One Sample (continued). Consider 1/J = Jl. We obtain the interval

which is the same as the interval of Example 4.4.1 and Section 4.9.2. Example 6.1.2. Regression (continued). Assume that p = r. First consider 1jJ = f3J for some specified �gression coefficient /3j . The 100(1 - a)% confidence interval for f3J is

(Jj = {Jj ± tn-p ( I - 1 a) s{ I {ZTZ) -1 1jj }l' �

2

382

Chapter

6

[(ZTz)�1]h is the jth diagonal element of (ZTZ)� 1 Computer software computes (zr z)�' and labels s{[(ZTz)�1 ]11) l as the standard error of the (estimated) jth regres

where

sion coefficient. Next consider ljJ level o:) confidence interval is

=

(1 -

J..Li

=

J1.i

=

mean response for the ith case,

Jii ± tn -p

1 < i < n. The

(1 - �a) sA,

hii

where is the ith diagonal element of the hat matrix H. Here s� is called the standard error of the (estimated) mean of the ith case. 2 and Next consider the special case in which

p

=

Yi = !3r + /32Zi2 + Ei, i

=

11

.

. . , n.

If we use the identity n

�)z;z - z.z)(li - Y) - �)z,2 - z.2)li - YL(zi2 - z.z) i=l - �)z" - z.z)li, we obtain from Example 2.2.2 that

fh

Because •

�

Var(Yi) a2 , we obtain

L�� ( zi2 - z.z)li

(6.1.30)

· 2 L� 1 (zi2 - z.2)

=

n

2 "' � / Var(!lz) (J L)zi2 - z.z) i=l and the 100( I - a)% confidence interval for .62 has the form !lz = iJz ± tn-p ( 1 - la) s/ V))zi2 - z z)2 . �

I

i.

I

'

Inference in the Multiparameter Case

• •

;

'

,

The confidence interval for /31 is given in Problem 6.1.10. 2 case, it is straightforward (Problem 6.1. 10) to compute Similarly, in the

p�

h

ii -

_

2 I (zi2 z.z) - + "'" n

L...i. =l (Zi2 - Z.2 ) and the confidence interval for th� mean response Jl.i of the ith case has a simple explicit

'

f3k. 1 < k < p. Because

I

'

o

fonn.

Example 6.1.3. One-Way Layout (continued). We consider 'ljJ

ilk � Yk. N(!lk, (J2/nk). we find the 100(1 - a)% confidence interval ilk � i3. ± tn-p (1 - �a) s/Jiik where s2 � SSw/(n-p). The intervals for p � f). and the incremental effect ok � ilk - p 0 given in Problem 6.1.11. �

�

are

i ; '

I

• •

=

I

•

1 ' •

J

·] •

I

i ' •

l

Section 6.2

383

Asymptotic Estimation Theory in p Dimensions

Joint Confidence Regions

We have seen how to find confidence intervals for each individual (3j. 1 < j < p. We next consider the problem of finding a confidence region C in RP that covers the vector f3 with prescribed probability (1 - a ) . This can be done by inverting the likelihood ratio test or equivalently the F test That is, we let C be the collection of {30 that is accepted when the level (I - a) F test is used to test H j3 � j30. Under H, I' � l'o � Zj30; and the numerator of the F statistic (6.1.22) is based on 2 Iii M l � IZ/3 - Zf3ol � (/3 - f3o)T (ZTZ)(/3 - /30). Thus, using (6.1.22), the simultaneous confidence region for {3 is the ellipse � � T T o (Z Z) ( f3o ) ) < fc,n-c (1 - �a ) f3 : (3 2 ((3 o C � f3 (6. 1.31) rs where Jr,n-r (1 -- �a) is the 1 - �a quantile of the :Fr,n-r distribution. Example 6.1.2. Regression (continued). We consider the case p = r and as in (6.1.26) T = (/3[, f3f) , where {32 is a vector of main effect coefficients and ZI /3 write Z = ( , Z2 ) -. -.T .-...T "'T and {31-is a vector of ..nuisance" coefficients. Similarly, we partition {3 as {3 = {/31 , {32 ) where /3 1 is q x 1 and- j32 is (p - q) x 1. By Corollary 6.1.1, 0"2 (zrz) is the variancecovariance matrix of {3. It follows that if we let S denote the lower right (P - q) x (P q) comer of ( zrz) -1, then a2S is the variance-covariance matrix of j3 2 . Thus, a joint 100(1 - a}% confidence region for {32 is the p - q dimensional ellipse :

-

o '

< fp -q, n-p (1 - �a)

f3o, :

0

We consider the classical Gaussian linear model in which the resonse Yi for the ith case in an experiment is expressed as a linear combination Jlt = Lj=1 (3i Zij of covariates plus an error fi, where ci, . . . are i.i.d. N(0, a2 ) . By introducing a suitable orthogonal transformation, we obtain a canonical model in which likelihood analysis is straightforward. The inverse of the orthogonal transformation gives procedures and results in terms of the original variables. In particular we obtain maximum likelihood estimates, likelihood ratio tests, and confidence procedures for the regression coefficients {f3j}. the resJX)nse means {Jli }, and linear combinations of these. Summary.

, tn

6.2

ASYM PTOTIC ESTIMATION TH EORY IN p DIM ENSIONS

In this section we largely parallel Section 5.4 in which we developed the asymptotic prop erties of the MLE and related tests and confidence bounds for one-dimensional parameters. We leave the analogue of Theorem 5.4.1 to the problems and begin immediately generaliz ing Section 5.4.2.

6.2.1

Estimating Equations

Our assumptions are as before save that everything is made a vector: i.i.d. P where P E Q, a model containing P = {Po : 6 E 6} such that (i)

,

e open c RP.

I

(ii) Densities of Po are p(-, 6), 9 E 6.

•

J

The following result gives the general asymptotic behavior of the solution of estimating equations.

AO. 'I'

=

(,P1 ,

•

.

.

,

1/!p)T where ,P1 � 1

gf. is well defined and

n

'

-L 'I'(X,, On ) � 0. .

n t= l

(6.2.1)

•

A solution to (6.2.1) is called an estimating equation estimate or an M-estimate. AI. The parameter 8( P) given by the solution of (the nonlinear system of p equations in p unknowns):

J 'l'(x, 6)dP(x) � 0

1

(6.2.2)

is well defined on Q so that 6(P) is the unique solution of (6.2.2). Necessarily 6(Po) � 6 because Q :::> P. A2. Epi'I'(X�o 6(P))I 2

'

< oo where I · I is the Euclidean norm.

A3. 1/Ji{·, 8), 1 < i < p, have first-order partials with respect to all coordinates and using the notation of Section B.S,

I! :

where

is nonsingular. A4. sup .

{I� 2:� 1 (Dw(X;, t) - Dw(X;, 6(P)))i

p

:

It - O(P)I <
AS. On � O(P) for all P E Q.

Theorem 6.2.1. Under AO-AS of this section n 1 On � O(P) + - L ii'!(X;, O(P)) + t=l where

n.

' •

.

'

.

op(n· 112)

iii (x,O(P)) � -[EpDw(X� o O(P))]" 1 w(x, O(P)).

(6 2.3 ) .

(6.2.4)

Section 6.2

Asymptotic Estimation The ry in

o

p

Dimensions

385

Hence, (6.2.5)

where E(>T>, P)

�

J(6. P)E>T>>T>r(X1, 6(P))F (0. P)

(6.2.6)

and F 1 (6, P) = - EpD>l>(X, , 6(P))

=

The proof of this result follows precisely that of Theorem 5.4.2 save that we need multivariate calculus as in Section B.8. Thus,

1

-n

n

1 n

I: >f>(X,, 6(P)) n I: D>f>(X,, 6�)(6n - 6(P)). i=l i=l =

(6.2.7)

Note that the left-hand side of (6.2. 7) is a p x 1 vector, the right is the product of a p x p matrix and a p x 1 vector. The rest of the proof follows essentially exactly as in Section 5.4.2 save that we need the observation that the set of nonsingular p x p matrices, when viewed as vectors, is an open subset of RP , representable, for instance, as the set of vectors for which the determinant, a continuous function of the entries, is different from zero. We use this remark to con clude that A3 and A4 guarantee that with probability tending to 1, � 2::� 1 DW(Xi, 6�) is nonsingular. ,

Note. This result goes beyond Theorem 5.4.2 in making it clear that although the definition of On is motivated by P, the behavior in (6.2.3) is guaranteed for P E Q, which can include P � P. In fact, typically Q is essentially the set of P's for which 6(P) can be defined uniquely by (6.2.2). We can again extend the assumptions of Section 5.4.2 to:

A6. If l(-, 6) is differentiable -E6 w(X,, 6)Dl(Xt. 6) -Cov6 (w(X1, 6), Dl(X1, 6))

(6.2.8)

defined as in B.5.2. The heuristics and conditions behind this identity are the same as in the one-dimensional case. Remarks 5.4.2, 5.4.3, and Assumptions A4' and A61 extend to the multivariate case readily. Note that consistency of On is assumed. Proving consistency usually requires different arguments such as those of Section 5.2. It may, however, be shown that with probability tending to 1, a root-finding algorithm starting at a consistent estimate 0� will find a solution On of (6.2.1) that satisfies (6.2.3) (Problem 6.2.10).

" i

i!

Inference in the Multiparameter Case

Chapter 6

'

'

,I

''

386

'

'

'

6.2.2

Asymptotic Normality and Efficiency of the MLE

[fwe take p(x O ) comes ,

=

l(x,O)

=

log p(x,O), and >I> (x , O) obeys A0-A6, then (6.2.8) heEoDl (X 1 , O)DT l( X1 , 0)) VaroDl(X1, 0)

(6.2.9)

where

is the Fisher information matrix 1(8) introduced in Section 3.4. If p : 8 ---+ R, 8 C Rd, is a scalar function, the matrix 8��eoJ {8) is known as the Hessian or curvature matrix of the surface p. Thus, (6.2.9) states that the expected value of the Hessian of l is the negative of the Fisher information. We also can immediately state the generalization of Theorem 5.4.3.

))

Theorem 6.2.2. If AO-A6 holdfor p(x, 0)

=

. •

'

I' '

'

'

-

log p(x, 0), then the MLE On satisfies

I n 11 op(n + l:; r' (O)Dl(X; , O) · ') On = 0 + n i =l

(6.2. 10)

i '

'

,

so that (6.2 . 1 1 )

If 8n is a minimum contrast estimate with p and '¢ satisfying AO--A6 and corresponding asymptotic variance matrix E(W, P9). then E(w,P0 ) > r ' (OJ in the sense of Theorem 3.4.4 with equality in (6.2.12) for 0 -On = On + Op(n - 1/2 ).

(6.2.12) =

Oo iff, unckr 00, (6.2. !3)

The proofs of (6.2.10) and (6.2.11) parallel those of {5.4.33) and (5.4.34) exactly. The proof of (6.2. 12) parallels that of Theorem 3.4.4. For completeness we give it. Note that by {6.2.6) and {6.2.8) E(>I> , Po) = Cov0 1 {U, V)Varo (U)Cov0 1 (V, U) (6.2.14) Proof.

where U = >I>(X1 , 0), V = Dl(X 1 , 0). But by (B.l0.8), for any U, V with Var(Ur, VT)T nonsingular (6.2.15) Var(V) > Cov(U, V)Var- 1 {U)Cov{V, U). Taking inverses of both sides yields I- 1 (0) = Var0 1 (V) < E(w, O). (6.2.16)

'

•

.

'

Section 6.2

Asymptotic Estimation Theory in

p

387

Dimensions

Equality holds in (6.2.15) by (B . I 0.2.3) iff for some b = b(O) U = b + Cov ( U , V ) Var- 1 ( V ) V

(6.2. 1 7)

with probability 1. This means in view of Eo 'I' = EoDl = 0 that w(X1 , 0)

=

b(O)Dl(X, , O).

In the case of identity in (6.2.16) we must have -[EoD>�'(X,, OW1>li(X, , 0) = r 1 (0)Dl(X1, 0).

(6.2.18)

Hence, from (6.2.3) and (6.2. 10) we conclude that (6.2.13) holds.

0 �

We see that, by the theorem, the MLE is efficient in the sense that for any ap l aT(J n has asymptotic bias o(n-112) and asymptotic variance n-1aT J-1(8)a, which is no larger than that of any competing minimum contrast estimate. Further any competitor Bn�such that aT(Jn has the same asymptotic behavior as aTBn for all a in fact agrees with On to order n-112 A special case of Theorem 6.2.2 that we have already established is Theorem 5.3.6 on the asymptotic normality of the MLE in canonical exponential families. A number of important new statistical issues arise in the multiparameter case. We illustrate with an example. x

.

-

Example 6.2.1. The Linear Model with Stochastic Covariates. Let Xi = (Zf, Yi)T, 1 < i ::S n, be i.i.d. as X = (ZT, Y) T where Z is a p x 1 vector of explanatory variables and Y is the response of interest. This model is discussed in Section 2.2.1 and Example 1.4.3. We specialize in two ways: (i) (6.2.19) where ' is distributed as N(O, a2) independent of Z and E(Z) = 0. That is, given Z, Y has a N(a + zr[3, a2) distribution. (ii) The distribution Ho of Z is known with density h0 and E(ZZT) is nonsingular. The second assumption is unreasonable but easily dispensed with. It readily follows (Problem 6.2.6) that the MLE of [3 is given by (with probability 1)

� - Z �T(nJ Y . f3-.. = [Z-T(nJ Z(nJI I

(6.2.20)

Here z(n is the n X p matrix IIZij Z.j II where z.j = � LZ 1 Zij· We used subscripts ) (n) to distinguish the use of Z as a vector in this section and as a matrix in Section 6.1. In the present context, Z(n) = (Z1, . Zn)T is referred lo as the random design matrix. This example is called the random design case as opposed to the fixed design case of Section 6.1. Also the MLEs of a and a2 are �

.

.

1

p --" a=Y � Z ; f3; ,

.J =l

I

� 2 a 2 = -[Y - (Ci + Z(nJ/3) [ . n

(6.2.21)

I

388

Inference in the Multiparameter Case

Chapter 6

�

Note that although given Z 1 , . . . , Zn, (3 is Gaussian, this is not true of the marginal distribution of {3. It is not hard to show that AO-A6 hold in this case because if Ho has density ho and if 8 denotes then �

(a,f3r,a2)T,

I 2 2 + log h0(z) rr) log + 2 ] (1oga zr,:3) + (a [Y 2 2 2oI

l(X,IJ)

Z; ; ( ,

Dl(X, IJ)

and

,

,

1(1:1)

1

,

2"

4 (<2 - I) a- 2

=

0 0

)

I

i

'

I

••

'

0

0 0

o--2 E(ZZT)

'

(6.2.23)

1

2u4

0

• .

l

<

,

•

N(O, diag (o-2 ,o-2[E ( ZZrW 1 , 2o-4 )). (6.2.24) This can be argued directly as well (Problem 6.2.8). It is clear that the restriction of Ho known plays no role in the limiting result for a,/3,U2• Of course, these will only be the MLEs if H0 depends only on parameters other than (a, {3,u2). In this case we can estimate and give approximate confidence intervals for {3j. j = 1, . ,p. E( ZZT) by � L� 1 An interesting feature of (6.2.23) is that because 1(1:1) is a block diagonal matrix so is 1 -1 (6) and, consequently, {3 and 0'2 are asymptotically independent. In the classical linear model of Section 6.1 where we perform inference conditionally given Zi = Zi, 1 < i < n, we have noted this is exactly true. This is an example of the phenomenon of adaptation. If we knew a-2, the MLE would still be {3 and its asymptotic varianc� optimal for this model. If we knew a and /3, U2 would no longer be the MLE. But its asymptotic variance would be the same as that of the MLE and, by Theorem 6.2.2, &2 would be asymptotically equivalent to the MLE. To summarize, estimating either parameter with the other being a nuisance parameter is no harder than when the nuisance parameter is known. Formally, in a model P = { P(9,") : 8 E 8, '7 E £} we say we can estimate B adaptively at 'TJO if the asymptotic variance of the MLE B (or more generally, an efficient estimate of 8) in the pair (0, iiJ is the same as that of O(r;o), the efficient estimate for 'P'I1a = {P(o,f1a) : 8 E 8}. The possibility of adaptation is in fact rare, though it appears prominently in this way in the Gaussian linear model. In particular consider estimating {31 in the presence of a, (!32 . . . , /3p) with (i) (3,, . . , f3v known. (ii) ,:3 arbitrary. {3p 0. Let Z, In case (i), we take, without loss of generality, a = fJ2 (Zi l , . . . , Zip) T, then the efficient estimate in case (i) is �

zizr

.

.

�

�

�

a,

j '

•

•

so that by Theorem 6.2.2

L(y'n(ii - a, {j - ,:3,82 - o-2))

(6.2.22)

j j

�

.

.

.

.

=

=

(6.2.25)

1 •

j

'

'

l

l

l

,

I

•

1

• •

.

:

'

'

Section 6.2

Asymptotic Estimation Theory in p Dimensions

389

with asymptotic variance a2[Ezfj-- 1 . On the other hand, (31 is the first coordinate of {3 given by {6.2.20). Its asymptotic variance is the {1, 1) element of 0"2 [EzzT]- 1 , which is strictly bigger than , [Ezf] -• unless [EZZTJ- 1 is a diagonal matrix (Problem 6.2.3). So in general we cannot estimate f3t adaptively if f3z, . . , {3P are regarded as nuisance -

,.

-

.

parameters. What is happening can be seen by a representation of [Zfn) Z( n)]-1 Z(n ) Y and ! 1 1 (11) where J-1(11) = llf''{ll) lf. We claim that

(3

2:�-JZ.t z,C I J )Y; (6.2.26) I= 1 zC 1 2 (Z ) tl L.... t=l t where Z( l ) is the regression of (Zu, . . . , Z1n)T on the linear space spanned by (Zj 1 , . . . , Zjn)T, 2 < j < Similarly, 2 " (6.2.27) Z . I (II) = "'I E(Z" !1(Zn 1 Z21 , . . , p,) ) where II(Ztl I Zz l ' . . . ) Zpl) is the projection of zll on the linear span of z21 ) . . Zp l (Problem 6.2.1 1). Thus, Il{Zu I z, " . . . 'Zpl) = l:j=, a; z,, where (a;, . . . ' a;) min imizes E(Zu - l:P_2 aj Zj 1) 2 over ( . . , ap) E RP-l (see Sections 1.4 and B.10). What (6.2.26) and (6.2.27) reveal is that there is a price paid for not knowing [3,, . . , {3p when the variables Z2, . . . , Zp are in any way correlated with Z1 and the price is measured by [E(Zil - Il(Zu I Z21, . . . 'Zp1 )2]- 1 = (1 - E(Il(Zu I z21,2 . . . , Zp,))2 ) I . '-'--:..:..._ -'-""""f;c;. ,.(7 '----"---"'"-''Z E(Z11 ) ( ) E 11 -

"'n

-

p.

-

'

a,,

'

.

.

-

(6.2.28)

In the extreme case of perfect collinearity the price is oo as it should be because (31 then becomes unidentifiable. Thus, adaptation corresponds to the case where have linearly (see Section 1.4). Correspondingly no value in predicting in the Gaussian linear model (6.1.3) conditional on the Zi, i = 1, . . , n, is undefined if the denominator in (6.2.26) is 0, which corresponds to the case of collinearity and occurs with probability 1 if

(Z2, . . , , Zp)

Z1

fJ1

.

E(Zu - Il(Zu I z21, . . . , Zp,))2 = 0.

0

Example 6.2.2. M Estimates Generated by Linear Models with General Error Structure. Suppose that the €i in (6.2.19) are i.i.d. but not necessarily Gaussian with density !Jo (� ) for instance,

,

fo (x) = ( 1

e-x

)

+ e-x ' '

the logistic density. Such error densities have the often more realistic, heavier tails(1) than the Gaussian density. The estimates {30 1 O'o now solve -

p 1 z Y; L f3.o z,. L ,,.,p k=l i=l p n 1 L a x a I Yi - L/3koZik k= l i= l n

and

0"

-

-

-

=0

=0

, ,

390

Inference in the Multiparameter Case

=

f'

(yj_Q10 (y) + l ) , (30 = (fito, . . . ,f]po) T �

�

where 'lj;

=

Theorem

6.2.2 may be shown to hold (Problem 6.2.9) if

- J;; . x(y)

-

�

Chapter 6

The assumptions of

fo is strictly concave, i.e., fo is strictly decreasing. (ii) (log fo)" exists and is bounded. Then, if further fo is symmetric about 0,

'

(i) log

•

r:r- 2 I((3r, I )

I(IJ) =

2

,

r:r- 2

( c1E(tZ) � )

•

(6.2.29) •

, ,

fo(x)dx, c, = J (xfo (x) + 1 ) 2 fo(x)dx. Thus, ,60, iTo are opti mal estimates of {3 and a in the sense of Theorem � 6.2.2 if fo is true. Now suppose /o generating the estimates f3o and (T� is symmetric and satisfies (i) and (ii) but the true error distribution has density f possibly different from fo. Under suitable

l

conditions we can apply Theorem

i

where c1

=

J (fo(x))

6.2.1 with

' •

'

' •

i

•

•

� 'lj; ( y - L:l ZkiJk ) , I

where

'

i

;

1f;, (z, y, (3, r:r)

> ( y L�=l Zk,6k )

'

<

j

l>, P) is as in (6.2.6). What is the relation between (30, and (3, <J given in the Gaussian model (6.2.19)? If is symmetric about 0 and the solution of (6.2.31) is unique,

fo � then (30 = (3. But <Jo = c(fo)o- for some, c(fo) typically different from one. Thus, (30 can .

be used for estimating (3 althougl) if tiJI1 true distribution of the •• is N(O, r:r2 ) it should

perform less well than (3. On the qt)ler hand, iT0 is an estimate of r:r only if normalized by a constant depending on (See Problem 6.2.5.) These are issues of robustness, that is, to have a bounded sensitivity curve (Section 3�5, Problem 3.5.8), we may well wish to use a nonlinear bounded '.II = ('f/;1 , . . . , ,Pp )T to estimate {3 even though it is suboptimal when € N(O, u2), and to use a suita�ly normalized version of 0:0 for the same purpose. One effective choice of '¢1 is the Hu�er function defined in Problem 3.5.8. We will discuss D these issues further in Section 6.6 and Volume II. .

�

fo.

,....,

1

l

Co( vn(i3o f3o)) � N(O, :!:: ( >1>, P)) C( y'n(iT - r:ro) � N(O, r:r2(P))

I

j

i

to conclude that

•

ll

Section 6.2

391

Asymptotic Estimation Theory in p Dimensions

Testing and Confidence Bounds

There are three principal approaches to testing hypotheses in multiparameter models, the likelihood ratio principle, Wald tests (a generalization of pivots), and Rao's tests. All of these will be developed in Section 6.3. The three approaches coincide asymptotically but differ substantially, in perfonnance and computationally, for fixed n. Confidence regions that parallel the tests will also be developed in Section 6.3. Optimality criteria are not easily stated even in the fixed sample case and not very persuasive except perhaps in the case of testing hypotheses about a real parameter in the presence of other nuisance parameters such as H : 81 < 0 versus K : fh > 0 where fJz, . . . , Bp vary freely.

The Posterior Distribution in the Multiparameter Case

6.2.3

The asymptotic theory of the posterior distribution parallels that in the one dimensional case exactly. We simply make 8 a vector, and interpret I I as the Euclidean norm in conditions A7 and A8. Using multivariate expansions as in B.8 we obtain ·

Theorem 6.2.3. Ifthe multivariate versions of A0-A3, A4(a.s.), A5(a.s.) and A6-A8 hold �

then. if8 denotes the MLE,

(6.2.32) a.s. under Pe for al/ 8. The consequences of Theorem 6.2.3 are the same as those of Theorem 5.5.2, the equiv alence of Bayesian and frequentist optimality asymptotically. Again the two approaches differ at the second order when the prior begins to make a difference. See Schervish (1995) for some of the relevant calculations. A new major issue that arises is computation. Although it is easy to write down the pos terior density of 8, 1r(8) IT� 1 p(Xi, 8), up to the proportionality constant fe rr(t) ll� 1 p(X,, t)dt, the latter can pose a formidable problem if p > 2, say. The prob lem arises also when, as is usually the case, we are interested in the posterior distribution of some of the parameters, say ( 01, 82), because we then need to integrate out (83 , . . . , Bp). The asymptotic theory we have developed pennits approximation to these constants by the procedure used in deriving (5.5.19) (Laplace's method). We have implicitly done this in the calculations leading up to (5.5.19). This approach is refined in Kass, K.adane, and Tier ney ( 1989). However, typically there is an attempt at "exact" calculation. A class of Monte Carlo based methods derived from statistical physics loosely called Markov chain Monte Carlo has been developed in recent years to help with these problems. These methods are beyond the scope of this volume but will be discussed briefly in Volume II. We defined minimum contrast (MC) and M-estimates in the case of p dimensional parameters and established their convergence in law to a normal distribu tion. When the estimating equations defining the M-estimates coincide with the likelihood

Summary.

392

Inference in the Multiparameter Case

Chapter 6

equations, this result gives the asymptotic distribution of the MLE. We find that the MLE is asymptotically efficient in the sense that it has "smaller" asymptotic covariance matrix than that of any MD or AJ-estimate if we know the correct model P = {Po : B E e} and use the MLE for this model. We use an example to introduce the concept of adaptation in which an estimate f) is called adaptive for a model {Po,11 : () E 8, ry E £} if the asymptotic distribution of fo( () - B) has mean zero and variance matrix equal to the smallest possible for a general class of regular estimates of () in the family of models {Po,110 : () E 8}. ryo specified. In linear regression, adaptive estimation of is possible iff Z1 is uncorrelated with every linear function of Z2) . . . ) Zp. Another example deals with M-estimates based on estimating equations generated by linear models with non-Gaussian error distribution. Finally we show that in the Bayesian framework where given 0, X1 ) Xn-are i.i.d. Po, if 0 denotes the MLE for Po, then the posterior distribution of yfii(O - 0) converges a.s. under Po to the N(O, r 1 (9)) distribution.

/31

•

•

•

• •

)

. '

6 .3

LARGE SAMPLE TESTS AND CONFIDENCE REGIONS

In Section 6.1 we developed exact tests and confidence regions that are appropriate in re gression and anaysis of variance (ANOVA) situations when the responses are normally distributed. We shall show (see Section 6.6) that these methods in many respects are also approximately correct when the distribution of the error in the model fitted is not assumed to be normal. However, we need methods for situations in which, as in the linear model, co variates can be arbitrary but responses are necessarily discrete (qualitative) or nonnegative and Gaussian models do not seem to he appropriate approximations. In these cases exact methods are typically not available, and we tum to asymptotic approximations to construct tests, confidence regions, and other methods of inference. We present three procedures that are used frequently: likelihood ratio, Wald and Rao large sample tests, and confidence procedures. These were treated for f) real in Section 5.4.4. In this section we will use the results of Section 6.2 to extend some of the results of Section 5.4.4 to vector-valued parameters.

,

'

i

• '

•

1 l •

·�

'

I '

j

. •

6.3.1

Asymptotic Approximation to the Distribution of the Likelihood Ratio Statistic

In Sections 4.9 and 6.1 we considered the likelihood ratio test statistic,

sup{p(x,9) : 9 E 8} = A(x) sup{p(x, 9) : IJ E 8o} for testing H : IJ E 80 versus K : IJ E 8,, 8, = 8 - 8o, and showed that in several statistical models involving normal distributions, .\(x) simplified and produced intuitive tests whose critical values can be obtained from the Student t and :F distributions. However, in many experimental situations in which the likelihood ratio test can be used to address important questions, the exact critical value is not available analytically. In such

'

'

I '

I •

I

'

'

Section

6.3

393

Large Sample Tests and Confidence Regions

cases we can turn to an approximation to the distribution of >.(X) based on asymptotic theory, which is usually referred to as Wilk.s '.s theorem or approximation. Other approxi mations that will be explored in Volume II are based on Monte Carlo and bootstrap simu lations. Here is an example in which Wilks's approximation to C(>.(X)) is useful:

Example 6.3.1. Suppose X�, X2 , . . . , Xn are i.i.d. as X where X has the gamma. r(a, /)), distribution with density

p(x; B)

�

1 iJ"x"- exp{-!lx}/f(a); x > 0; a > 0, il > 0.

(Ei, il), exists and in Example 2.4.2 we In Example 2.3.2 we showed that the MLE, B showed how to find B as a nonexplicit solution of likelihood equations. Thus, the numerator of .\(x) is available as p(x, B) � IT� 1 p(x;, B). Suppose we want to test H : a � 1 (exponential distribution) versus K : a i- 1. The MLE of f3 under H is readily seen from (2.3.5) to be ilo � 1/x and p(x; 1, ilo) is the denominator of the likelihood ratio statistic. It remains to find the critical value. This is not available analytically. D �

�

�

�

�

�

�

�

The approximation we shall give is based on the result "2log >.(X) � x�" for degrees of freedom d to be specified later. We next give an example that can be viewed as the limiting situation for which the approximation is exact:

Example 6.3.2. The Gaussian Linear Model with Known Variance. Let Y1 , . . . , Yn be independent with Yi rv N(Pi1 <1�) where Uo is known. As in Section 6.1.3 we test whether = ( 11 . . J.l. p, . , J.tn)T is a member of a q-dimensional linear subspace of Ir, w0, versus the alternative that J.l. E w - w0 where w is an r-dimensional linear subspace of Rn and w ::> w0; and we transform to canonical form by setting

vq span w0 where An x n is an orthogonal matrix with rows vf . . . v'[; such that V J , and v 1 , . . . , Vr span w. Set Bi T]i/Uo, i l, . . . , r and Xi Uduo, i l, . . . , n. Then Xi rv N(Bi, l), i 1 , . . . , r and Xi N(O, l), i r+l, . . . , n. Moreover. the hypothesis H is equivalent to H : Oq+ l = · · · = Or = 0. UsingSection 6.1.3, we conclude that under H. =

=

=

rv

• . . 1

1

1

=

=

=

2 Iog .\(Y)

r

�

L x; x;i=q+l �

•.

Wilks's theorem states that, under regularity conditions, when testing whether a parameter vector is restricted to an open subset of Rq or Rr, q < r, the X�-q distribution is an approximation to £(2 log .\(Y)). In this u2 known example, Wilks's approximation is D exact. We illustrate the remarkable fact that X�-q holds as an approximation to the null distri bution of 2 log A quite generally when the hypothesis is a nice q-dimensional submanifoJd of an r-dimensional parameter space with the following.

I [ j

i

'

394

Inference in the Multiparameter Case

Chapter 6

Example 6.3.3. The Gaussian Linear Model with Unknown Variance. If Yi are as in

Example 6.3.2 but CT2 is unknown then 8 = (Jl., u2 ) ranges over an r + 1-dimensional manifold whereas under H, 8 ranges over a q + l�dimensional manifold. In Section 6.1.3, we derived 2 " L... i =q+l xi ) n 2 log>.(Y � log I +

' '

''

2:: :• • +I X'f

•

•

Apply Example 5.3.7 to Vn = l:�-q+ 1 X[jn-1 _L:� r+ I X'f and conclude that Vn _£, x;_ , . Finally apply Lemma 5.3. 2 with g(t) � log(! + t), an � n, c � 0 and conclude that 2log ..\(Y) £ x;-q also in the a-2 unknown case. Note that for A (Y) defined in Remark 6.1.2, 2 log A (Y) �

Vn ". x;_,

as

o

welL

Consider the general i.i.d. case with X1 , . . . , Xn a sample from X c R', and () E 8 c W. Write the log likelihood as

p(x, B),

where x E

n ln(8) � I; log p(X B). ,,

i=l

We first consider the simple hypothesis H

:

'

1 •

!

6 = Bo.

'

Theorem 6.3.1. Suppose the assumptions of Theorem 6.2.2 are satisfied. Then, under H : () � 8o, � c 2 log>.(X ) � 2[ln(8n) - ln(8o)] � x;. �

Proof Because On solves the likelihood equation Doln(O) = 0, where Do is the derivative

�

with respect to 8, an expansion of ln(O) about On evaluated at 0 = 00 gives ,....

,....

,....

2[ln(8n) - ln(8o)] � n(8n - 8o) In(8n)(8n - 8o) � � for some ()� with [8� - 8n[ < [8n - 8o[. Here I

"

8

T

8

•

In(()) � - n L 88 88 log p(X , 8) k J i=l -

By Theorem 6.2.2, mation matrix. Because

,

(6.3.1)

! j '

•

• •

..

'"

fo(On - 80) ", N(O, J-1 (8o)), where I.xr(8) is the Fisher infor

l

'

�-

'

'

[8� - 8o[ < [8� - On [ + [On - 8o[ :0 2[0n - 8o[, we can conclude arguing from A.3 and A.4 that that In(8�) !. Eln(80) � 1(80). Hence, """

C

T

1 J(8o)V, V � N(O, I (8o)).

2[ln(8n) - ln(8o)] � V The result follows because, by Corollary B.6.2, yrJ(8o)V � x�.

(6.3.2) 0

•

•

•

•

' •

As a consequence of the theorem, the test that rejects 2 log>.(X) where Xr(l and

395

Large Sample Tests and Confidence Regions

Section 6.3

)

- o

is the

H : ()

= 80 when

> x, (1 - a),

1 - o quantile of the X; distribution, has approximately Ievel l - a, � {eo : 2[ln(en) - ln(eo)] < x,. ( l

-

(6.3.3)

oe) }

is a confidence region for 9 with approximat� coverage probability 1

-

a

.

H : 8 E 8o. where 8 is open and 8o is the Set Of e E 8 with 8j = 8o,j, j = q + 1 , . . . , T , and {8o ,j} are specified values. 2 Examples 6.3.1 and 6.3.2 illustrate such 8o. We set d = r - q, e T = (e (ll, e < l ) , e <1l = (81, . . . , 8,)r, e< 'l = (Bo+h - . . , 8, ) r, e6' l = (8o,,+1· · . . Bo , r )T Next we tum to the more general hypothesis

,

Theorem 6.3.2. Suppose that the assumptions of Theorem 6.2.2 hold for p(x, e), e E 8. Let Po be the model {Po : 8 E 8o} with corresponding parametn'zation 9( 1 ) = �( 1 ) �(1) (1) ( 81 , . , , 8,). Suppose that e0 is the MLE of e under H and that e0 satisfies A6for �( � T Po. Let eo ,n = (1101) , e0(2) ) . Then under H : e E 8o,

Proof.

Let

e0 E 8o and write 2 log >.(X)

= 2 [/n(iin) - In( eo) ] - 2 [1,. (iio,n) - ln (eo)].

It is easy to see that A0-A6 for P imply A0-A5 for Po. By

(6.3.4)

(6.2.10) and (6.3.1) applied to

� � (! ) � Bn and the corresponding argument applied to 00 , 9o,n and (6.3.4), 2 log >.(X)

= sT (eo)r 1 (eo)S(eo) - s; (eo)IQ' 1 (eo)S1 (eo) + Op(1)

where

(6.3.5)

n

n - 1 /2 = L Dl(X, , e) S(eo)

i=l

and S

= (S�, S2)T where 8 1 is the first q coordinates ofS.

Furthermore,

lo(eo) = Vare,S1 (eo). Make a change of parameter, for given true 80 in

8o,

TJ = M (e - eo) where, dropping the dependence on

Do. M = p J 1 /2

(6.3.6)

396

Inference in the M u!tiparameter Case

and P is an orthogonal matrix such that, if A0 MA0

=

{1J :

_

{8

-

Chapter 6

:

Oo 8 E 80} I

= · · · = 11, = 0, 1J E M8).

1/q+l

•

• •

•

Such P exists by the argument given in Example 6.2.1 because 111 2 Ao is the intersection of a q dimensional linear subspace of Rr with J1 12 {8 - 80 : () E 8}. Now write De for differentiation with respect to (} and D11 for differentiation with respect to TJ. Note that, by definition, >. is invariant under reparametrization =

.l.(X)

(6. 3. 7)

-y(X)

. ,: •

!

1 1

.

•

1 • •

where

-y(X) = sup{p(x, 110 1J

+ M-11))} / sup{p(x, llo + M-1'1) : llo + M- 11) E Bo}

and from (8.8.13) D1Ji(x, 110

+ M-11J) = [M-1JTDIII(x, ll).

(6.3.8)

We deduce from (6.3.6) and (6.3.8) that if T(1J)

=

.

'

n

n - 1 /2 L D1Jl (X ,, llo + M- 11)),

•

I

,

i=l

• •

then Moreover, because in terms of 1), H is {1J E M8 applying (6.3.5) to -r(X) we obtain, 2 log-r(X)

:

1Jq+ 1

= ..

·

=

(6.3.9)

1Jr

= 0}, then by

TT(O)T(O) - Tf{O)T1(0) + op(l)

q - L T,'(o) - L T,'(o) + ov(l) r

i=l r

L

i=q+ l

'

T,2 (0) + Op(l),

which has a limiting X�-q distribution by Slutsky's theorem because T(O) has a limiting Nr (O, J) distribution by (6.3.9). The result follows from (6.3. 7). D

I• ' •

(6.3.10)

i=l

•

'

Note that this argument is simply an asymptotic version of the one given in Example 6.3.2. Thus, under the conditions of Theorem 6.3.2, rejecting if .\(X) > X q ( l - a) is an asymptotically level a test of H : 8 E 9o. Of equal importance is that we obtain an asymptotic confidence region for (8q+ 1 , , Br ) a piece of 8. with 8 , . . . , Bq acting as 1 nuisance parameters. This asymptotic level 1 o: confidence region is r

• • .

�

�

�

.

-

{(Oq+l> . . . , llr) : 2[1n(11n) - ln(Oo,!, . . . , Oo,,, oq+ l , . . . , Or)] '

.

,. ' •

I

<

Xr-q(l - a)) (6.3.1 1 )

'

R:.= l •:"'g ; o:_ f;:_":.= 6:: i e ..C d_C de Te ":_:.: e'C e:: Se . 3___c::: ':: ' ;.:o o" o:e:cS::: c_:: ':c m C'p:C :.:o ' ":.= :_ o::.c "':__ '.c g.c :. =':: :::: -

cc 397

_ _ _ _ _ _ _ _ _ _ _ _

-

where Bo. 1 , . . . , Bo.q are the MLEs. themselves depending on 8q+l, . . . . 8,., of B 1 , , 8q assuming that Bq+ I , . . . , Br are known. More complicated linear hypotheses such as H : 6- Bo E w0 where w0 is a linear space of dimension q are also covered. We only need note that if wo is a linear space spanned by an orthogonal basis v 1 , . . . , Vq and Vq+ t , . . . , Vr are orthogonal tow0 and v 1 , . . . , Vr span Rr then, .

WO �

T {8 : 9 Vj � 0,

q

+ 1 < j < r}.

.

•

(6.3.12)

The extension of Theorem 6.3.2 to this situation is easy and given in Problem 6.3.2. The formulation of Theorem 6.3.2 is still inadequate for most applications. It can be extended as follows. Suppose H is specified by: There exist d functions, 9i ; e - R, q + 1 < j < r written as a vector g, such that Dg(8) exists and is of rank r - q at all 8 E e. Define H : 8 E 80 with

eo � {8 E e : g(8)

=

o}.

(6.3.13)

Evidently, Theorem 6.3.2 falls under this schema with 9;(8) � 8j - 8o,j. q + 1 < j S r. Examples such as testing for independence in contingency tables, which require the following general theorem, will appear in the next section. Theorem 6.3.3. Suppose the assumptions of Theorem 6.3.2 and the previously conditions on g. Suppose the MLE hold 9o,n under H is consistent for all (J E 80. Then, if A(X) is the likelihood ratio statistic for H : 8 E 8o given in (6.3.13), 2 log .X(X) ". under H.

X�-•

The proof is sketched in Problems (6.3.2)-(6.3.3). The essential idea is that, if 8o is true, .X(X) behaves asymptotically like a test for H : 8 E 800 where

eoo

�

{8 E e : Dg(8o)(8 - 8o )

�

0}

(6.3.14)

a hypothesis

of the form (6.3.13). Wilks's theorem depends critically on the fact that not only is 8 open but that if 60 given in (6.3.13) then the set { (8 1 . . . , 8q)T : 8 E e } is open in R•. We need both properties because we need to analyze both the numerator and denominator of A(X). As an example ofwhatcan go wrong, let (X; I , Xi2) be i.i.d. N(BI, 82, J), where J is the 2 x 2 identity matrix and 80 = {8 : 81 + 82 < 1 }. If 8I + 82 � 1, >

(j

0

�

( (XI + X2 )

2

+

� 1 _ (X I + X2) ) 2' 2

2

and 2 log .X(X) � xt but if 8 I + 82 < 1 clearly 2 log .X(X) � Op(1). Here the dimension of 80 and 8 is the same but the boundary of 8o has lower dimension. More sophisticated examples are given in Problems 6.3.5 and 6.3.6.

398

Inference in the Multi parameter Case

Chapter 6

' '

' '

6.3.2

Wald's and Rao's Large Sample Tests The Wald Test

Suppose that the assumptions of Theorem 6.2.2 hold. Then

� L ,fii(IJ - IJ) � N(o, r 1 (1J)) as n

� oo.

(6.3.15)

Because I( IJ) is continuous in IJ (Problem 6.3. 10), it follows from Proposition B.7.1(a) that

� p I(IJn) � I(IJ) as n � oo.

(6.3.16)

By Slutsky's theorem B.7.2, (6.3.15) and (6.3.16),

n(iin - IJ)T I(On)(iin - IJ) !:., yrI(IJ)V, V - Afr(O, r1 (1J)) where, according to Corollary B.6.2, yrI(IJ)V - x;. It follows that the Wald test that rejects H : 6 = Oo in favor of K : (} i= Oo when

� � T Wn (IJo) � n(IJn - IJo) I(IJo)(IJn - IJo)

>

'

Xr (1 - o)

has asymptotic level a. More generally I(Bo) can be replaced by any consistent estimate

� � of I( IJo). in particular - � D2ln (IJo) or I ( IJn) or -! D2ln (IJn). The last Hessian choice is � favored because it is usually computed automatically with the MLE. It and I(Bn) also have the advantage that the confidence region one generates {6 : Wn(O) < xp(l - a)} is an ellipsoid in W easily interpretable and computable see (6.1.31). For the more general hypothesis H : (} E 8o we write the MLE for 8 E 8 as 6n = � (1) �(2) � � � � �(2) ( 1) (IJn , IJn ) where IJn � (8" . . , 8, ) and IJn � (B,+, . . . , Br ) and define the Wald statistic as ) 2) (6.3.17) Wn (IJ� ) � n(ii�' - IJ�'ll[J"(iin Jr1 (ii�) - IJ�2))

'

r'.

"

" .

'

i'

�

i

�

.

I

where I22(IJ) is the lower diagonal block of I-1 ( IJ) written as

I -1 (IJ)

_

-

(

I " (IJ) I12(1J) I21(1J) J22(1J)

•

)

. •

i

''

.

! I

i

I

with diagonal blocks of dimension q X q and d X d, respectively. More generally, 122(iin) is replaceable by any consistent estimate of 122(8), for instance, the lower diagonal block

� of the inverse of - � D2ln( IJn). the Hessian (Problem 6.3.9).

j

Theorem 6.3.4. Under the conditions of Theorem 6.2.2, zf H is true,

Wn(IJ�2) ) !:., x;_,.

(6.3.18)

Proof. 1(8) continuous implies that I-1(8) is continuous and, hence, !22 is continuous. �(2 c (2) ) But by Theorem 6.2.2, ,fii(!Jn - IJ0 ) � Afd(O, J22( 1Jo )) if IJo E 60 holds. Slutsky's 0 theorem completes the proof. '

f I

�

'

'

,:

I

I :

.'

I !

�----------------------------------------

i l

'

•

'

Section 6.3

Large Sample Tests and Confidence Regions

399

2)

The Wald test, which rejects iff Hin ( Bb ) > X,-- q (1 level a . What is not as evident is that, under H,

) is, therefore. asymptotically

- a ,

�( 2 (6.3.19) Wn (90 ) ) = 2 log ,\(X) + ap(l) where A(X) is the LR statistic for H : (} C: 80. The argument is sketched in Problem 6.3.9.

Thus, the two tests are equivalent asymptotically. The Wald test leads to the Wald confidence regions for ( Bq + 1 , . Br) T given by { 8(2 ) 2 Wn (9 ( 1 ) < Xr-q(l - u)}. These regions are ellipsoids in R"- Although, as (6.3.19) indicates, the Wald and likelihood ratio tests and confidence regions are asymptotically equivalent in the sense that the same conclusions are reached for large n, in practice they can be very different. .

.

�

:

The Rao Score Test For the simple hypothesis H · () = Bo, Rao's score test is based on the observation that, by the central limit theorem,

vnt/Jn(9o) !; N(O, I (9o))

where 1/J n = n- I Dln (eo) is the likelihood score vector. It follows from this and Corollary B.6.2 that under H, as n -

1

(6.3.20) CXJ,

Rn(9o) = ntJ;?:,(9o)r (9o)..Pn(9o) !:. x;.

The test that rejects H when Rn(Bo) > Xr ( 1 - a:) is called the Rao score test. This test has the advantage that it can be carried out without computing the MLE, and the convergence Rn(Bo) � x; requires much weaker regularity conditions than does the corresponding convergence for the likelihood ratio and Wald tests. The extension of the Rao test to H : () E 8o runs as follows. Let

>�'n(9) = n - 1 / 2 D2 1n(9)

where Dtln represents the q x 1 gradient with respect to the first q coordinates and D2ln the d x 1 gradient with respect to the last d. The Rao test is based on the statistic � (2) ..-.. -1 T

Rn(9o ) = n>�'n (9o,n)E --...

_

>�' n(9o,n)

� � where :E is a consistent estimate of :E (Bo), the asymptotic variance of Vn Wn ( Bo,n) under H. It can be shown that (Problem 6.3.8) E(Oo) = I,,(Oo) - !2 1 ( 9o)I!i1 (9o)Il2(9o)

(6.3.21)

where I11 is the upper left q x q block of the r x r infonnation matrix I( 80), I12 is the upper right 6.3.9) under A0--A6 and consistency � q x d block, and so on. Furthermore, (Problem of eo,n under H, a consistent estimate of .:E- 1 ( 80) is � � 2 2 1 (6.3.22) n- 1 [-D,ln(9o,n) + D21ln(9o,n)[D1 ln(Bo,n] Ddn(9o,n)] ......_

--...

400

Inference in the Multiparameter Case

Chapter 6

where D� is the d x d matrix of second partials of l11 with respect to e
Rn(9b2)) !:, X�·

•

The Rao large sample critical and confidence regions are { Rn (9�2) ) > Xd (l - a)} and {9(2) : Rn (9(2)) < Xd(l - o)}. The advantage of the Rao test over those of Wald and Wilks is that MLEs need to be computed only under H. On the other hand, it shares the disadvantage of the Wald test that matrices need to be computed and inverted. Power Behavior of the LR, Rao, and Wald Tests It is possible as in the one-dimensional case to derive the asymptotic power for these where Bo E 8o. The analysis for 8o = tests for alternatives of the form On = Bo + { Oo} is relative!y easy. For instance, for the Wald test

�

T

.--..

""'

1

••

'

•

where x� (-y2) is the noncentral chi square distribution with m degrees of freedom and noncentrality parameter "'?. It may he shown that the equivalence (6.3.19) holds under lin and that the power be havior is unaffected and applies to all three tests. Consistency for fixed alternatives is clear for the Wald test but requires conditions for the likelihood ratio and score tests-see Rao (1973) for more on this. Summary. We considered the problem of testing H : 8 E eo versus K : 8 E e 8o where 8 is an open subset of R! and eo is the collection of 8 E e with the last r q coordinates 9(2) specified. We established Wilks's theorem. which stales that if >.(X) is the LR statistic, then, under regularity conditions, 2log >.(X) has an asymptotic distribution under H. We also considered a quadratic form, called the Wald statistic, which measures the distance between the hypothesized value of 8(2) and its MLE, and showed that this quadratic form has limiting distribution x:-q. Finally, we introduced the Rao score test, which is based on a quadratic form in the gradient of the log likelihood. The asymptotic distribution of this quadratic form is also x:-q. -

�

x;-•

6.4

l

.,

-

• •

�

n(9n - Bo) I(9n)(9n - 9o) = ( fo(On - 9n) + t.)I(On)( fo(On 9n) + A) !:. x;_.(ArI(li0)A)

. • • . .

1

LARGE SAMPLE METHODS FOR DISCRETE DATA

In this section we give a number of important applications of the general methods we have developed to inference for discrete data. In particular we shall discuss problems of

1

1

'

'

l

Section 6.4

_ ______________ 4 01

Large Sample Methods for Disccete Data __ � __

goodness-of-fit and special cases oflog linear and generalized linear models (GLM), treated in more detail in Section 6.5.

6.4.1

Goodness-of-Fit in a Multinomial Model. Pearson's x2 Test

As in Examples 1.6.7, 2.2.8, and 2.3.3, consider i.i.d. trials in which Xt = j if the ith trial pnxluces a result in the jth category, j = 1 , . . . , k. Let ()i = P(Xi = j) be the probability of the jth category. Because ()k = 1 - E�-; ()i , we consider the pa rameter 9 = (81, . . . , 8k- I)T and test the hypothesis H : ()J = ()0i for specified (J0j. j = 1, . . . , k - 1. Thus, we may be testing whether a random number generator used in simulation experiments is producing values according to a given distribution, or we may be testing whether the phenotypes in a genetic experiment follow the frequencies predicted by � theory. In Example 2.2.8 we found the MLE 0; � N; jn, where N; � I:� 1 1 {X; = j }. It follows that the large sample LR rejection region is k

2 log ,\ (X) � 2 LN; Iog(N;/nBo;) > Xk-l (l

j=l

- a) .

To find the Wald test, we need the information matrix I = IIIij j j . For i, j = 1, . . . , k-1, we find using (2.2.33) and (3.4.32) that if i = j, if i ,., j. Thus, with 8ok

=

Wn(Bo)

1

- E7 i ()Oi• the Wald statistic is k-1 k-l k-1 n L[B; - Bo;]2/Oo; + n L L(B; - Bo;)(O; - Bo;)/Ookj=l i=l j=l

�

The second term on the right is

k-l n L(O; - Do;) j=l Thus,

k W (Bo) L(N; - nBo;)2 /nOo;i=l n

�

402

Inference in the M u ltipara m eter Case

Cha pter 6

The term on the right is called Pearson 's chi-square ( x2 ) statistic and is the statistic that is typically used for this multinomial testing problem. It is easily remembered as X2

=

S

UM ( Observed - Expected) 2

(6.4.1)

Expected

where the sum is over categories and "expected" refers to the expected frequency EH ( Nj) . The general form (6.4. 1) of Pearson's x2 will reappear in other multinomial applications in this section. To derive the Rao test, note that from Example 2.2 .8,

with

To find J- 1 , we could invert I or note that by (6.2. 1 1) , J-1 (9) � with and, by � A. N (N1 , . . . , Nk _I ) T l3 . 1 5, llaij ll( k-1) x (k-l) =

=

=

Var ( N ) , where

=

Thus, the Rao statistic is

-

(

n

' .

(

k-1 k-1 . () L L . =1 z=1 . ()Oi J

.

. .

-

:

.�

_..._

_..._

_ z

.. "'

()k 8ok

·";;

)(

)

_..._

_..._

[ (!!J_ �) ]

The second term on the right is

k-1 L 8oj j=1 8oj 8ok _..._

-n

_..._

2

[� ] _..._

= -n

8ok

To simplify the first term on the right of (6.4.2), we write _..._

()j 8oj

-

_..._

-

()k 8ok

-

-

2

1

-1

{[8ok (()j - 8oj) ] - [8oj (()k - 8ok )] } 8oj 8ok , _..._

=

)

()k 8oi8oj . 8J· - _ 8o1· 8ok .

,.._

and expand the square keeping the square brackets intact. Then, because

k-1 L (Oj - 8oj ) j=1

=

- (Ok - 8ok),

(6.4.2)

403

Large Sam ple M ethods for Discrete Data

Section 6 . 4

the first term on the right of ( 6 . 4 . 2 ) becomes

It follows that the Rao statistic equals Pearson 's x2 .

Example 6.4. 1. Testing

a

Genetic Theory. In experiments on pea breeding, Mendel ob

served the different kinds of seeds obtained by crosses from peas with round yellow seeds and peas with wrinkled green seeds. Possible types of progeny were:

(2)

wrinkled yellow;

(3)

round green; and

(1)

round yellow;

wrinkled green. If we assume the seeds are

(4)

produced independently, we can think of each seed as being the outcome of a multinomial trial with possible outcomes numbered occurrence

e4

B1 , (h, 83, 84.

1 , 2, 3, 4

as above and associated probabilities of

Mendel's theory predicted that

81

=

9/16,

82

1 / 1 6, and we want to test whether the di stribution of types in the n

=

= =

83

performed (seeds he observed) is consistent with his theory. Mendel observed

n2 nB4o

101,

= =

n3

34 . 75,

k

= =

X2

108, 4 =

n4

=

(2.25) 2 3 1 2 . 75

32.

+

Then,

( 3 . 2 5) 2 104 . 2 5

+

nB10

=

( 3 . 75 ) 2 104 . 2 5

3 1 2 . 75,

+

nB2o

( 2 . 75 ) 2 34 . 75

=

=

=

3 / 1 6,

556 trials he

n830

n1

=

=

3 1 5,

104.25,

0 . 47 '

which has a p-value of 0 . 9 when referred to a x� table. There is insufficient evidence to rej ect Mendel 's hypothesis . For comparison 2 log value may be too small ! See Note

6.4.2 Suppose

=

0 .48 in this case. However, this

j.:

Goodness-of- Fit t o Composite M ultinomial Models. Contingency Tables N

=

(N1 , . . , Nk ) T has a multinomial, M (n, 8) , distribution. We will investi H : B E 8o versus K B rf:_ 8o , where 8o is a composite "smooth" subset .

gate how to test of the

,\

1.

(k -

:

I ) -dimensional parameter space

I I

k

e

=

{8 : ei � o,

1 :::;

i :::; k ,

I: ei

=

i=l

1}.

For example, i n the Hardy-Weinberg model (Example 2 . 1 . 4),

which is a one-dimensional curve in the two-dimensional parameter space the adequacy of the Hardy-Weinberg model means testing

H

:

8

E

80

8.

Here testing

versus K

:

8

E

"!�I

ii

.

,

!

t

I

Chapter 6

81 where 81 = 8 - 8o . Other examples, which will be pursued further later in this section, involve restrictions on the ei obtained by specifying independence assumptions on classifications of cases into different categories. We suppose that we can describe 80 parametrically as

where 17 ( ry1 , . . . , TJq) T , is a subset of q-dimensional space, and the map 17 -.. ( B1 (17 ) , . . . ' ek ( 1J) ) T takes £ into 8o. To avoid trivialities we assume q < k - 1. Consider the likelihood ratio test for H : e E 8o versus K : e rj:. 8o . Let p( n1 , . . . , nk , 8 ) denote the frequency function of N . Maximizing p(n1 , . . . , nk , 8 ) for 8 E 80 is the same as maximizing p( n1 , . . . , nk , 8( 1J)) for 1J E £. If a maximizing value, ij = ( rh , . . . , i]q) exists, the log likelihood ratio is given by =

log .x (n 1 , . . . , nk ) =

k

2: ni [log (nd i=l

n)

- log ei ( i7 ) J .

If 17 -.. e( 1J) is differentiable in each coordinate, £ is open, and ij exists, then it must solve the likelihood equation for the model, {p( , 8( 17 ) ) : 17 E £}. That is, ij satisfies ·

a

l

t·

I nference in the M u lti para meter Case

404

a

'rjj

log p(n l , • • • , nk , 8 (1J)) = 0 ,

1

�j �

q

(6.4.3)

or

k a n· L . t � ei (1J) = O, l � j � q. i =1 et ( 1J ) 'rJJ If £ is not open sometimes the closure of £ will contain a solution of (6.4.3) . To apply the results of Section 6.3 and conclude that, under H, 2 log >. approximately has a x;- q distribution for large n , we define Bj = 9j ( 8), j = 1 , . . . , r, where 9i is chosen so that H becomes equivalent to "(B� , . . . , e� ) T ranges over an open subset of Rq and Bj Boj , j q + 1 , . . . , r for specified Boj ·" For instance, to test the Hardy-Weinberg model we set e� = B1 , e� = B2 - 2VBt ( 1 - VBt) and test H : e� = 0. Then we can conclude from Theorem 6.3.3 that 2 log >. approximately has a x r distribution under H. The Rao statistic is also invariant under reparametrization and, thus, approximately x;_ q · Moreover, we obtain the Rao statistic for the composite multinomial hypothesis by replacing Boj in (6.4.2) by ei (ij) . The algebra showing Rn (80) = x 2 in Section 6.4. l now leads to the Rao statistic =

.,

=

1J = Rn (8("')

� [Ni - nej (ij)]2 =

j� =l

1J nBJ· ("')

X

2

where the right-hand side is Pearson's x 2 as defined in general by (6.4. 1 ) . The Wald statistic i s only asymptotically invariant under reparametrization. However. the Wald statistic based on the parametrization 8 ( 17) obtained by replacing Boj by ej (ij), is, by the algebra of Section 6.4. 1 , also equal to Pearson's x2 •

,

Methods for Discrete Data

Section 6.4

Example 6.4.4.

Hardy-Weinberg.

Thus, H is rejected if x 2

ij) O(

((

2:: x1 ( 1

2n 1 + n2 2n

405

We found in Example 2.2.6 that if ) with

(2n 1 + n2)/2n.

- a

2 + n2 ) ( 2n3 + n2 ) ) ) 2 , (2nl + n2) (2n3 2 , 2n

T

2n

0

Example 6.4.5. The Fisher Linkage Model. A self-crossing of maize heterozygous on two characteristics (starchy versus sugary; green base leaf versus white base leaf) leads to four possible offspring types: ( 1 ) sugary-white; (2) sugary-green; (3) starchy-white; (4) starchy-green. If Ni is the number of offspring of type i among a total of n offspring, then ( Nb . . . , N4) has a M (n, fh , . . . , B4) distribution. A linkage model (Fisher, 1 958, p. 301), specifies that

where TJ is an unknown number between 0 and 1. Tt> test the validity of the linkage model we would take 8o { G (2 + ry) , i ( l ry) , i (1 - TJ) , iTJ) : 0 :; TJ S 1 } a "one dimensional curve" of the three-dimensional parameter space e. The likelihood equation (6.4.3) becomes

n1 (2 + ry)

(n2 + n3) n4 + (1 - ry) -:ry

0

,

(6.4.4)

which reduces to a quadratic equation in if. The only root of this equation in [0, 1] is the desired estimate (see Problem 6.4. 1). Because q = 1, k 4, we obtain critical values from 0 the x� tables.

Testing Independence of Classifications in Contingency Tables Many important characteristics have only two categories. An individual either is or is not inoculated against a disease; is or is not a smoker; is male or female; and so on. We of ten want to know whether such characteristics are linked or are independent. For instance, do smoking and lung cancer have any relation to each other? Are sex and admission to a university department independent classifications? Let us call the possible categories or states of the first characteristic A and A and of the second B and B. Then a randomly se lected individual from the population can be one of four types AB, AB, AB, AB. Denote the probabilities of these types by B11 , B12 , B21 , B22 , respectively. Independent classifica tion then means that the events [being an A] and [being a B] are independent or in terms of the eij ·

eij

=

(Bil + ei2 ) (B11 + B21 ).

To study the relation between the two characteristics w e take a random sample o f size n from the population. The results are assembled in what is called a 2 x 2 contingency table such as the one shown.

406

I nference in t h e M u ltipa ra meter Case

C h a pter 6

A A

The entries in the boxes of the table indicate the number of individuals in the sample who belong to the categories of the appropriate row and column. Thus, for example N12 is the number of sampled individuals who fall in category A of the first characteristic and category B of the second characteristic. Then, if N = (N1 1 , N12 , N21 , N22) r , we have N M (n, Ou , 8 12 , 82 1 , 022) . We test the hypothesis H : 8 E 80 versus K : 0 � 8o, where 80 is a two-dimensional subset of 8 given by "'

8o

=

{ ( "7 1 'T/2 , "1 1 ( 1 - "72 ) , "12 ( 1 - 'T/1 ) , ( 1 - 'T/1 ) ( 1 - "72 ) ) : 0 ::; "7 1 ::; 1, 0 ::; 'T/2 ::; 1 } .

Here we have relabeled 0 11 + 0 12 , 0 1 1 + 021 as ry 1 , ry2 to indicate that these are parameters, which vary freely. For 8 E 80 , the likelihood equations (6.4.3) become + n12 ) rh (nu + n2! ) Tf2

(nu

whose solutions are Tf1 'T/2

=

(n21

+

n22)

( 1 - if! )

(n12

+

n22)

(6.4.5)

( 1 - fh)

( nn (nu

+

n 12 ) /n + n21 ) jn,

(6.4.6)

the proportions of individuals of type A and type B, respectively. These solutions are the maximum likelihood estimates. Pearson's statistic is then easily seen to be (6.4.7)

where Ri = Ni l + Ni 2 is the ith row sum, Cj = N1j + N2J is the jth column sum. By our theory if H is true, because k = 4, q = 2, x2 has approximately a xi dis tribution. This suggests that x2 may be written as the square of a single (approximately) standard normal variable. In fact (Problem 6.4.2), the (NiJ - RiCJ /n ) are all the same in absolute value and,

where z

�l tt [

1 R J -

�=1 J= l

l

•

Section 6 . 4

407

Large Sam ple Methods for Discrete Data

An important alternative form for Z is given by (6.4 .8)

Thus, z = y'n[P (A I

B)-

P(A I B )]

[�(B) �(�) ] B, B, P(A) P(A)

1

12

where P is the empirical distribution and where we use A, A, B to denote the event that a randomly selected individual has characteristic A, A, B. Thus, if x2 measures devia tions from independence, Z indicates what directions these deviations take. Positive values of Z indicate that A and B are positively associated (i.e., that A is more likely to occur in the presence of than it would in the presence of B). It may be shown (Problem 6.4. 3) that if A and B are independent, that is, P(A I = P(A I B), then Z is approximately distributed as N(O, 1 ) . Therefore, it is reasonable to use the test that rejects, if and only if,

B

B)

Z � z(1 - a )

B)

as a level a one-sided test of H : P(A I B) = P(A I B) (or P(A I B) :::; P(A I B)) 2 > P(A I B) . The x test is equivalent to rejecting (two-sidedly) if, versus K : P(A I and only if,

Next we consider contingency tables for two nonnumerical characteristics having a and b states, respectively, a , b � 2 (e.g., eye color, hair color). If we take a sample of size n from a population and classify them according to each characteristic we obtain a vector Nii • i = 1, . . . , a , j 1, . . . , b where Nij is the number of individuals of type i for characteristic 1 and j for characteristic 2. If (Jij = P [A randomly selected individual is of type i for 1 and j for 2] , then =

{Nij : 1 :::; i :::;

a,

1 :::; j :::; b}

I".J

M (n , (Jij : 1 :::; i :::;

a,

1 :::; j :::; b) .

The hypothesis that the characteristics are assigned independently becomes H : (Jij 'TJi 1 T}j 2 for 1 :::; i :::; a , 1 :::; j :::; b where the 'TJi 1 , T}j 2 are nonnegative and 2: �= 1 'TJi 1 2:� = 1 T}j 2 = 1. The Nij can be arranged in a a x b contingency table,

Nu a

Na1 c1

1

2

N12

c2

... ... . . . ... . . .

b

N1 b

R1

Na b cb

Ra n

=

-

�- .-__,.,.,.-.,----�-,_...---....---_.,.--.--

---��

408

I nference in the M u lti para meter Case

Chapter 6

with row and column sums as indicated. Maximum likelihood and dimensionality calcu lations similar to those for the 2 X 2 table show that Pearson's x2 for the hypothesis of independence is given by

(6.4.9) which has approximately a x(a- I ) (b-l) distribution under H. The argument is left to the problems as are some numerical applications.

6.4.3

Logistic Regression for B i nary Responses

In Section 6 . 1 we considered linear models that are appropriate for analyzing continuous responses {Yi} that are, perhaps after a transformation, approximately normally distributed and whose means are modeled as J-li = E;= I Zij {3j = z f {3 for known constants { Zij } and unknown parameters f3I , . . . , {3p . In this section we will consider Bernoulli responses Y that can only take on the values 0 and 1. Examples are ( 1 ) medical trials where at the end of the trial the patient has either recovered (Y = 1 ) or has not recovered (Y 0), (2) election polls where a voter either supports a proposition (Y = 1) or does not (Y 0). or (3) market research where a potential customer either desires a new product (Y = 1) or does not (Y 0) . As is typical, we call Y 1 a "success" and Y = 0 a "failure." We assume that the distribution of the response Y depends on the known covariate vector z T . In this section we assume that the data are grouped or replicated so that for each fixed i, we observe the number of successes Xi = E";;;__ I Yij where Yij is the response on the jth of the ffii trials in block i, 1 :::; i :::; k. Thus, we observe independent XI , . . . with Xi binomial, B( mi , 1ri ). where 1ri = 1r ( zi ) is the probability of success for a case with covariate vector zi . Next we choose a parametric model for 1r (z ) that will generate useful procedures for analyzing experiments with binary responses. Because 1r (z) varies between 0 and 1, a simple linear representation zT {3 for 1r ( ) over the whole range of z is impossible. Instead we turn to the logistic transform g( 1r) , usually called the logit, whicll we introduced in Example 1.6.8 as the canonical parameter =

=

=

=

' xk

.t:" ·

·

(6.4. 10)

TJ = g ( 1r ) = log [1rj ( l - 1r)] .

Other transforms, such as the p robit 9 I ( 1r ) = - 1 (1r) where is the N(O, 1 ) d.f. and the log-log transform 92 (1r) = log [- log ( 1 - 1r)] are also used in practice. The log likelihood of 1r ( 1r1 , , 7rk ) T based on X (XI , . , Xk ) T is

t, [x, C :',J log

=

• . .

=

+

m; log ( l - 1r; )

.

.

] t, ( ;'; ) +

log

When we use the logit transform g ( 1r ) , we obtain what is called the where

.

(6.4. 1 1)

logistic linear regres

sion model

'"

409

Large Sam ple Methods for Discrete Data

Section 6.4

2, (1, zi)T is the logistic regression model of Problem 2 . 3 . 1 . ZN (f3) of {3 (/31 , . . . , /3p)T is, if N 2:::: 7=1 m i, The log likelihood l ( 1r({3)) p k k (6.4. 1 2) g( ( f3 ) {3 ) lN m; lo l + exp{z; } + log ';; fJ; T; The special case p

Zi

=

=

=

where

Tj

=

) ( � =

�

�

=

=

2::::7=1 ZijXi

and we make the dependence on N explicit. Note that

lN (f3)

the log likelihood of a p-parameter canonical exponential model with parameter vector

( T1 , . . . , Tp) T. It follows that the NILE of {3 solves E{3 (Tj) Ef3 (z r x) = z r x, where Z = llzij llrnxp is the design matrix.

and sufficient statistic T

Tj , j

=

1, . . . , p,

or

=

=

Thu s, Theorem 2 . 3 . 1 applies and we can conclude that if 0 <

Xi

p, the solution to this equation exists and gives the unique MLE

E(Tj)

=

2::::7=1 Zij J.Li, the likelihood equations are just z r (x - J.L )

=

<

mi and Z has rank

{3 of {3.

sufficient but not necessary for existence-see Problem 2 . 3 . 1 . We let Then

is

{3

/-Li

=

The condition is

E(Xi)

0.

=

mi1ri.

(6.4 . 1 3)

By Theorem 2 .3 . 1 and by Proposition 3 .4.4 the Fisher information matrix is

I(f3) where

W

=

diag {

=

z r wz

(6.4. 1 4)

mi7ri(1 - 7ri) }k x k · The coordinate ascent iterative procedure of Section {3.

2 . 4 . 2 can be used to compute the MLE of

Alternatively with a good initial value the Newton-Raphson algorithm can be em

ployed. Although unlike coordinate ascent Newton-Raphson need not converge, we can guarantee convergence with probability tending to estimate use

1 as N

---+ oo

as follows. As the initial

(6.4. 1 5)

Vi

=

lo

g

( mi X·-� x+i .!2+ )

the empirical logistic transform. Because

1

2

(6.4. 1 6)

,

{3 (zrz) -1ZT'T] and TJi log[7ri (1 - 7ri)-1], =

=

{30 i s a plug-in estimate of f3 where 1ri and (1 - 1ri) i n TJi has been replaced by

xi --. 1 xi -1 , (1 )* 1 - mi + 2mi mi + 2mi - 1ri Similarly, in (6.4. 14) , W is Here the adjustment 1/2mi is used to avoid log 0 and l og 1f:

=

=

oo.

estimated using

{mi1r; ( l - 7ri) * } . Using the 8-method, it follows by Theorem 5 . 3 . 3, that if m Wo

(

=

(

diag

m} V; - log 1 :',

---+ oo,

1ri > 0 for 1 ::; i ::; k

J) -S N(O, [rr; ( l - rr; Jr 1 ) .

410

6

Because

Z has rank p, it fo1lows

(Problem

6.4. 14) that {30

is consistent.

To get expressions for the MLEs of 7r and JL, recall from Example

1 .6.8 that the inverse

of the logit transform g is the logistic distribution function

Thus, the MLE of 1ri is

Jfi

=

g

-1

( L:�=l Xij(jj ) . Testing

In analogy with Section

6.1,

we let w

=

{ 11

:

rJi

=

zT {3, {3

E

RP} and let

r

be the

dimension of w. We want to contrast w to the case where there are no restrictions on 11; that is, we set n

=

R k and consider TJ E fl. In this case the likelihood is a product of

independent binomial densities, and the MLEs of 1ri and J.Li are statistic

2 log .A for testing H :

ji is the MLE of JL for 11

11 E w versus

E w. Thus, from

D (X, ji)

K:

(6.4. 1 1)

Xdmi

and

Xi .

The LR

n - w is denoted by D(y, ji), where

TJ

and

(6.4. 12)

k

2 I )xi log(Xdfli ) + XI log( XI/ flDJ i =l

(6.4. 1 7)

x: mi Xi and fl� mi Mi · D (X, ji) measures the distance between the fit ji based on the model w and the data X. By the multivariate delta method, Theorem 5. 3.4, D(X, fl) has asymptotically a x%-r· distribution for 11 E w as mi --+ oo, i 1, . . . , k < oo-see Problem 6.4. 1 3. As in Section 6. 1 linear subhypotheses are important. If w0 is a q-dimensional linear

where

=

=

subspace of w with q <

K:

11 E w -

r,

then we can form the LR statistic for H : 11 E

w0

versus

wo

2 log .A

= 2 t [xi log ( Ei . ) + XI i =I

/-LOt

log

�

( E, . )]

(6.4. 1 8)

fLo�

ji0 is the MLE of JL under H and M�i mi MOi . In the present case, by Problem 6.4. 1 3, 2 log .,\ has an asymptotic x;-q distribution as mi --+ oo, i 1, . . . , k. Here is a

where

=

-

=

special case.

Example 6.4.1. The Binomial One- Way Layout. Suppose that k treatments are to be tested for their effectiveness by assigning the ith treatment to a sample of mi patients and record ing the number

Xi

of patients that recover, i

pendently and we observe

X1 , . . . , Xk

.

1 , . . , k.

independent with

The samples are collected inde

Xi

rv

B ( 1ri , mi ) ·

For a second

example, suppose we want to compare k different locations with respect to the percentage that have a certain attribute such as the intention to vote for or against a certain proposition. We obtain k independent samples, one from each location, and for the ith location count the number Xi among

mi that has the given attribute.

Section 6 . 5

411

Generalized Linea r Models

This model corresponds to the one-way layout of Section 6.1, and as in that section, an important hypothesis is that the populations are homogenous. Thus, we test H : 1r1 1r 1rk = 1r, 1r E (0, 1 ) , versus the alternative that the 1r's are not all equal. Under 2 H the log likelihood in canonical exponential form is =

{3T - N log(l + exp{{:J}) +

t; log ( �: ) k

where T 1::= 1 Xi , N 2: := 1 mi. and (3 = log [1r / ( 1 - 1r)] . It follows from Theorem 2.3.1 that if 0 < T < N the MLE exists and is the solution of (6.4. 13), where J.L (m 1 1r, . . . , mk7r)T. Using Z as given in the one-way layout in Example 6 . 1 .3, we find that the MLE of 1r under H is 7r TjN. The LR statistic is given by (6.4. 18) with Jioi mi1r. The Pearson statistic k- 1 ( Xi mi "' 1r ) 2 2 X m · 7r 7r ) "'(1 = � i l =

=

_ I:

-

,....

is a Wald statistic and the x2 test is equivalent asymptotically to the LR test (Problem 6.4. 15).

Summary. We used the large sample testing results of Section 6.3 to find tests for impor tant statistical problems involving discrete data. We found that for testing the hypothesis that a multinomial parameter equals a specified value, the Wald and Rao statistics take a form called "Pearson's x2 , which equals the sum of standardized squared distances be tween observed frequencies and expected frequencies under H . When the hypothesis is that the multinomial parameter is in a q-dimensional subset of the k !-dimensional pa rameter space 8, the Rao statistic is again of the Pearson x2 form. In the special case of testing independence of multinomial frequencies representing classifications in a two-way contingency table, the Pearson statistic is shown to have a simple intuitive form. Finally, we considered logistic regression for binary responses in which the logit transformation of the probability of success is modeled to be a linear function of covariates. We derive the likelihood equations, discuss algorithms for computing MLEs, and give the LR test. In the special case of testing equality of k binomial parameters, we give explicitly the MLEs and x2 test. "

-

6.5

G E N E RA L I Z E D L I N EA R M O D E LS

In Sections 6.1 and 6.4.3 we considered experiments in which the mean J.Li of a response Yi is expressed as a function of a linear combination

�i

=

zf {3

p

=

L Zij/3j

j= l

of covariate values. In particular, in the case of a Gaussian response, J.Li �i · In Section 1ri g - 1 ( �i) , where g - 1 (y) is the logistic distribution EYiJ , then /.Li 6.4.3, if J.Li =

=

=

6

412

function. More generally, McCullagh and Neider ( 1 983, 1 989) synthesized a number of previous generalizations of the linear model, most importantly the log linear model devel oped by Goodman and Haberman. See Haberman (1974). The generalized linear model with dispersion depending only on the mean

The data consist of an observation ( Z , Y) where Y = (Yb . . . , Yn ) T is n x 1 and ZJx n = (z1 , . . . , Zn ) with Zi = (zib . . . , Zip ) T nonrandom and Y has density p ( y, 17) given by (6.5 . 1 )

where 17 i s not in £, the natural parameter space o f the n-parameter canonical exponential family (6.5. 1 ) , but in a subset of E obtained by restricting TJi to be of the form

where h is a known function. As we know from Corollary 1.6.1, the mean f.t of Y is related to 11 via f.t = A ( 17) . Typically, A ( 17) = A 0 ( T/i ) for some A 0 , in which case J-Li A� (TJi ) . We assume that there is a one-to-one transform g(f.t) of f.t, called the link function, such that p

g (f.t)

= L /1jZ (j ) j=l

Z{3

where z (j ) (z1j , . . . , Znj ) T is the jth column vector of Z. Note that if A is one-one, f.t determines 17 and thereby Var(Y) A( 17) . Typically, g(f.t) is of the form (g(J-L1 ), , g(J-Ln)) T , in which case g is also called the link function. • • •

Canonical links

The most important case corresponds to the link being canonical; that is, g

11

=

or

p

I:l 111 z<j ) = Zf3.

j=

In this case, the GLM is the canonical subfamily of the original exponential family gener ated by ZTY, which is p X 1. Special cases are: (i)

The linear model with known variance.

These Yi are independent Gaussian with known variance 1 and means J-Li The model is GLM with canonical g(f.t) f.t, the identity. (ii)

Log linear models.

= L:�= l Z.;,3 {3;.

Section 6 . 5

413

Genera l ized L i n e a r Models

Suppose (Y1 , . . . , Yp)T is seen in Example 1.6. 7,

M(n, B l , · · · , Bp), Bj T/j

=

0, 1 :::; j :::;

>

log ej ' 1 :::; j :::;

p,

L::�=l B =

1 . Then, as

p

are canonical parameters. If we take g(/1 ) = (log /-ll , . . . , log /-lp) T , the link is canonical. The models we obtain are called log linear-see Haberman (1 974) for an extensive treat ment. Suppose, for example, that Y = IIYiJ I I I:Si:Sa• 1 :::; j :::; b, so that YiJ is the indicator of, say, classification i on characteristic 1 and j on characteristic 2. Then (} =

IIBij ll , 1 :::; i

:::;

1 :::; j :::; b,

a,

and the log linear model corresponding to

l og eij = /3i + /3j ,

where /3i , f3J are free (unidentifiable parameters), is that of independence

eij

=

ei+e+j

where Bi+ = L::� =l Bii• B+J = L:: �=l BiJ· The log linear label is also attached to models obtained by taking the Yi independent Bernoulli ( Bi), 0 < ei < 1 with canonical link g(B) log [B( 1 - B ) ] . This isjust the logistic linear model of Section 6.4.3. See Haberman ( 1 974) for a further discussion. =

Algorithms If the link is canonical, by Theorem 2 . 3 . 1 , if maximum likelihood estimates they necessarily uniquely satisfy the equation z ry

=

{3 exist,

z rEl3 Y = z r A(Z/3)

or (6.5 .2)

It's interesting to note that (6.5 .2) can be interpreted geometrically in somewhat the same way as in the Gausian linear model-the "residual" vector Y - M(/3) is orthogonal to the column space of Z. But, in general, M(/3) is not a member of that space. The coordinate ascent algorithm can be used to solve (6.5.2) (or ascertain that no so lution exists). With a good starting point {30 one can achieve faster convergence with the Newton-Raphson algorithm of Section 2.4. In this case, that procedure is just (6.5 .3)

where In this situation and more generally even for noncanonical links, Newton-Raphson coincides with Fisher's method of scoring described in Problem 6.5.1. If /3 0 � {30, the

414

C h a pter 6

I n ference i n the M u lt i parameter Case

true value of {3, as n � oo, then with probability tending to 1 the algorithm converges to the MLE if it exists. In this context, the algorithm is also called iterated weighted least squares. This name stems from the following interpretation. Let Ll m+l = jjm+l - jjm , which satisfies the equation (6.5.4) That is, the correction A m+l is given by the weighted least squares formula ( 2.2.20) when the data are the residuals from the fit at stage m, the variance covariance matrix is Wm and the regression is on the columns of WmZ-Problem 6.5.2. Testing in GLM

Testing hypotheses in GLM is done via the LR statistic. As in the linear model we can define the biggest possible GLM M of the form (6. 5 . 1 ) for which p n. In that case the MLE of J1 is Ji M (Y1 , , Yn ) T (assume that Y is in the interior of the convex support of {y : p (y , 11) > 0 } ) . Write 17( ) for A - l . We can think of the test statistic =

=

• • •

·

2 log .X. = 2 [Z ( Y , 77 ( Y )) - l (Y , "' ( IL o ) ]

for the hypothesis that IL = /L o within M as a "measure" of (squared) distance between Y and /Lo · This quantity, called the deviance between Y and /Lo , (6.5.5) is always 2:: 0. For the Gaussian linear model with known variance a-5 (Problem 6.5.4). The LR statistic for H D( Y , j1,0)

=:

:

IL E

wo is just

i nf { D( Y , IL )

:

1L E

where j1,0 is the MLE of IL in wo . The LR statistic for H : IL with w 1 :J wo is

wo } E

wo versus K : IL

E

w 1 - wo

D(Y , Ji 0) - D(Y, Ji1 ) where j1,1 is the MLE under w 1 . We can then formally write an analysis of deviance analo gous to the analysis of variance of Section 6.1. If w0 c w1 we can write (6.5.6) a decomposition of the deviance between Y and j1,0 as the sum of two nonnegative com ponents, the deviance of Y to j1, 1 and � (j1,0 , j1, 1 ) = D (Y, j1,0) - D(Y, j1,1 ) , each of which can be thought of as a squared distance between their arguments. Unfortunately � =j=. D generally except in the Gaussian case.

Section 6 . 5

415

General ized Linear Models

Formally if w0 is a GLM of dimension p and w1 of dimension q with canonical links, then �(ji,0 , ji, 1 ) is thought of as being asymptotically x; - q . This can be made precise for stochastic GLMs obtained by conditioning on Z1 , . . . , Zn in the sample (Z1 , Y1 ) , . . . , (Zn , Yn ) from the family with density

p (z, y, {3)

=

h(y) qo (z) exp { (zT{3)y - A0 (zT {3) }.

(6.5.7)

More details are discussed in what follows. Asymptotic theory for estimates and tests

If (Z1 , Y1 ) , . . . , (Zn , Yn ) can be viewed as a sample from a population and the link is canonical, the theory of Sections 6.2 and 6.3 applies straightforwardly in view of the gen eral smoothness properties of canonical exponential families. Thus, if we take Zi as having marginal density qo , which we temporarily assume known, then (Z1 , YI ) , . . . , (Zn , Yn ) has density

p (z, y, {3)

=

TI h(y;)qo (z;) { t, [(zf exp

{3) y; - A0(zf {3) ]

}

.

(6.5.8)

This is not unconditionally an exponential family in view of the A0 ( z f{3) term. However, there are easy conditions under which conditions of Theorems 6.2.2, 6.3.2, and 6.3.3 hold (Problem 6.5.3), so that the MLE {3 is unique, asymptotically exists, is consistent with probability 1, and (6.5 .9)

1 What is J- ({3) ? The efficient score function

a�i logp (z , y , {3) is

(Yi - A o (Zf{3) ) Z f and so,

which, in order to obtain approximate confidence procedures, can be estimated by f :EA.(z{j) where :E is the sample variance matrix of the covariates. For instance, if we assume the covariates in logistic regression with canonical link to be stochastic, we obtain =

If we wish to test hypotheses such as H : /31

=

· ·

·

=

/3d

=

0, d

< p,

we can calculate

i= l

where {3 H is the (p x 1 ) MLE for the GLM with f3�x l ( 0, . . . , 0 , !3d+ I, . . . , /3p) . and can conclude that the statistic of ( 6.5. 10 ) is asymptotically x� under H. Similar conclusions =

___

l_

-�P -

I.I

·' 1.'

416

I nference i n the M u ltipara m eter Case

Chapter 6

follow for the Wald and Rao statistics. Note that these tests can be carried out without knowing the density qo of z . 1 These conclusions remain valid for the usual situation in which the Z i are not random but their proof depends on asymptotic theory for independent nonidentically distributed variables, which we postpone to Volume II. The generalized linear model The GLMs considered so far force the variance of the response to be a function of its mean. An additional "dispersion" parameter can be introduced in some exponential family models by making the function h in (6. 5 . 1 ) depend on an additional scalar parameter 7. It is customary to write the model as, for c( 7) > 0, p ( y, 17 , 7) Because J p ( y , 17 , 7)dy

=

=

exp{ c- 1 ( 7 ) ( 11T y - A ( 17 ) ) }h(y, 7) .

(6.5 . 1 1)

1, then

A(17) jc(7) = log

/ exp{c- 1 (7)17TY} h(y, 7)dy.

(6.5. 1 2)

The left-hand side of (6.5. 10) is of product form A( 17 ) [1 / c ( 7) ] whereas the right-hand side cannot always be put in this form. However, when it can, then it is easy to see that

E(Y)

=

A (17 )

Var( Y ) = c(7) A (17 )

(6.5. 13) (6.5. 1 4)

so that the variance can be written as the product of a function of the mean and a general dispersion parameter. Important special cases are the N ( J-L, 0" 2 ) and gamma (p , .A) families. For further dis cussion of this generalization see McCullagp. and Neider (1983, 1989). General link functions Links other than the canonical one can be of interest. For instance, if in the binary data regression model of Section 6.4.3, we take g ( J-L) cp - 1 (J-L) so that =

'lri =

we obtain the so-called probit model. Cox ( 1 970) considers the variance stabilizing trans formation

which makes asymptotic analysis equivalent to that in the standard Gaussian linear model. As he points out, the results of analyses with these various transformations over the range . 1 :::::; J-L :::::; .9 are rather similar. From the point of analysis for fixed n, noncanonical links can cause numerical problems because the models are now curved rather than canonical ex ponential families. Existence of MLEs and convergence of algorithm questions all become more difficult and so canonical links tend to be preferred.

417

Models

Summary. We considered generalized linear models defined as a canonical exponential model where the mean vector of the vector Y of responses can be written as a function,

E/3j z(j ) , where the z(j ) are f3 is a vector of regression coefficients. We considered the

called the link function, of a linear predictor of the form observable covariate vectors and

canonical link function that corresponds to the model in which the canonical exponential model parameter equals the linear predictor. We discussed algorithms for computing MLEs of

/3.

In the random design case, we use the asymptotic results of the previous sections to

develop large sample estimation results, confidence procedures, and tests.

6.6

RO B U STN ESS P R O P E RT I ES A N D S E M I PARA M ET R I C M O D E LS

As we most recently indicated in Example

6.2.2,

the distributional and implicit structural

assumptions of parametric models are often suspect. In Example

6.2.2,

we studied what

procedures would be appropriate if the linearity of the linear model held but the error dis

j0, which 0, the resulting MLEs for f3 optimal under fo continue to estimate f3 as defined by (6. 2.19) even if the true errors are N(O, a 2 ) or, in fact, had any distribution symmetric around 0 (Problem 6.2.5). That is, roughly speaking, if we consider the semi parametric model, pl {P : (zr, Y) p given by (6.2. 19) with Ei i.i.d. with density f for some f symmetric about 0}, then the LSE /3 of f3 is, under further mild conditions on P, still a consistent asymptotically normal estimate of f3 and, in fact, so is any estimate solving the equations based on (6.2.31) with fo symmetric about 0. There is another semi parametric model p2 {P : (zr, Y)T p that satisfies Ep (Y I Z) zr(3, Epzrz nonsingular, Ep Y2 < oo} . For this model it turns out that the LSE is optimal in a sense to

tribution failed to be Gaussian. We found that if we assume the error distribution is symmetric about

rv

=

rv

=

be discussed in Volume II. Furthermore, if P3 is the nonparametric model where we assume

(Z, Y) has a joint distribution and if we are interested in estimating the best linear Y given Z, the right thing to do if one assumes the (Zi , Yi) are i.i .d., is "act as if the model were the one given in Example 6. 1.2." Of course, for estimating /-1-L(Z) only that

predictor 1-1-L (Z) of

in a submodel o f P3 with

Y where c is independent of is not the best estimate of

=

Z but a0 (Z)

(3.

zr (3 + ao(Z)c is not constant and

(6.6.1)

a0

is assumed known, the LSE

These issues, whose further discussion we postpone to Vol

ume II, have a fixed covariate exact counterpart, the Gauss-Markov theorem, are discussed below. Another even more important set of questions having to do with selection between nested models of different dimension are touched on in Problem

6.6.8 but otherwise

also

postponed to Volume II.

___ _

__j

418

I n ference in the M u ltipara meter Case

Chapter 6

Robustness in Estimation We drop the assumption that the errors

E1 , . . . , En in the linear model (6. 1 .3) are normal.

Instead we assume the Gauss-Markov linear model where

(6.6.2) where Z is an n of the estimates

x

p matrix of constants, {3 is p

x

1, and Y ,

€

are n

x

1. The optimality

Jii, fji of Section 6 . 1 . 1 and the LSE in general still holds when they are

compared to other linear estimates.

Theorem 6.6.1. Suppose the Gauss-Markov linear model (6.6.2) holds. Then, for any parameter of the form

0:

=

2:: � 1 aiJ-Li for some constants a1 ' . . . ' an,

the estimate

a Y1 ,

=

2:: � 1 aiJii . . . , Yn.

has uniformly minimum variance among all unbiased estimates linear in

Proof.

Because

Cov ( Yi ,

Yj )

and a i s unbiased. E( Ei ) 0 , E(a) 2:: � 1 aiJ-Li Ei , Ej ) , and by (B .5 .6), n n VarcM (a) L: a; Var (Ei ) + 2 L aiaj Cov (Ei , EJ ) a2 L: a; i 1 i=1 i <j =

=

=

=

o: ,

Moreover,

Cov (

=

=

=

where VarcM refers to the variance computed under the Gauss-Markov assumptions. Let

a stand for any estimate linear in VarcM (a)

=

Y1 , . . . , Yn .

The preceding computation shows that

Varc (a) , where Varc (a) stands for the variance computed under the Gaus

sian assumption that

E1 , . . . , En are i.i.d.

N(O, a2).

By Theorems 6 . 1 . 2(iv) and 6 . 1 .3(iv),

Varc (a) ::; Varc (a) for all unbiased a. Because Yare

=

VarcM for all linear estimators, 0

the result follows.

Note that the preceding result and proof are similar to Theorem 1 .4.4 where it was shown that the optimal linear predictor in the random design case is the same as the optimal predictor in the multivariate normal case. In fact, in Example 6 . 1 .2, our current Ji coincides with the empirical plug-in estimate of the optimal linear predictor ( 1 .4. 14) . See Problem 6 . 1 .3. Many of the properties stated for the Gaussian case carry over to the Gauss-Markov case:

Proposition 6.6.1. If we replace the Gaussian assumptions on the errors E1 , , En with the Gauss-Markov assumptions, the conclusions ( 1 ) and (2) of Theorem 6 . 1 .4 are still . • .

jj and j], are still unbiased and Var(Ji) Var (jj) a2 (zrz ) - 1 •

valid; moreover, ifp

= r,

=

a2H, Var(€)

=

a2 (I H), and -

=

Example 6.6.1. One Sample (continued). In Example 6. 1 . 1 ,

J-L

=

{31 , and Ji

=

fj1

=

Y.

The Gauss-Markov theorem shows that Y, in addition to being UMVU in the normal case, is UMVU in the class of linear estimates for all models with

EY/

<

oo .

However, for

419 n

large,

Y has a larger variance (Problem 6.6.5) than the nonlinear estimate

median when the density of Y is the Laplace density

1 2A

exp{ -AIY - 11 1 } ,

For this density and a l l symmetric densities,

y E

R, J1 E R,

Y

sample

A > 0.

Y is unbiased (Problem 3.4.12).

Remark 6.6.1. As seen in Example 6 . 6 . 1 , a major weakness of the Gauss-Markov theorem is that it only applies to linear estimates. Another weakness is that it only applies to the homoscedastic case where Var(}i ) is the same for all i. Suppose we have a heteroscedastic version of the linear model where

E( E )

0, but Var ( Ei )

a

} depends on i.

If the

} are

a

known, we can use the Gauss-Markov theorem to conclude that the weighted least squares estimates of Section

2.2 are UMVU. However, when the at are unknown, our estimates are

not optimal even in the class of linear estimates.

Robustness of Tests In Section

5 . 3 we investigated the robustness of the significance levels of t tests for

the one- and two-sample problems using asymptotic and Monte Carlo methods. Now we will use asymptotic methods to investigate the robustness of levels more generally. The 6-method implies that if the observations

Xi

come from a distribution

P,

which does not

belong to the model, the asymptotic behavior of the LR, Wald and Rao tests depends criti cally on the asymptotic behavior of the underlying MLEs From the theory developed in Section true we expect that

iin

8 and 80.

6.2 we know that if '11 ( - ,

8)

=

Dl( - , 8)

and P is

8(P) , which is the unique solution of

I w (x, 8)dP(x)

(6.6.3)

0

and

.;ii (e - 8 ( P) ) � N( O, �( w , P) ) :E given by ( 6.2.6 ) . Thus, for instance, consider H :

(6.6.4)

8 80 and the Wald test statis tics Tw n ( ii 80 )T !(80 ) (9 80) . If 8(P) :::/= 80 evidently we have Tw oo . B ut if 8 (P) 80 , then Tw � vr J(80)V where V "' N(o, � ('11 , P) ) . Because � ('II , P ) :::/= J- 1 ( 8 0 ) i n general, the asymptotic distribution of Tw i s not x;. This

with

=

=

=

observation holds for the LR and Rao tests as well-but see Problem

6.6.4. There is an

important special case where all i s well: the linear model we have discussed in Example

6 . 2 . 2. Example 6.6.2. The Linear Model with Stochastic Covariates. Suppose ( Z 1 , Yl ) , . . . , (Zn , Yn ) are i.i.d. as (Z, Y) where Z is a (p x 1 ) vector of random covariates and we model the relationship between

Z and Y as

(6.6.5)

6

420

with the distribution P of (Z , Y) such that E and Z are independent, EpE 0, EpE 2 < oo , and we consider, say, H : (3 0. Then, when E is N(O, u2 ) , (see Examples 6.2.1 and 6.2.2) the LR, Wald, and Rao tests all are equivalent to the F test: Reject if =

Tn where {3 is the LSE, Z( n) s2

_ =

"" T

2 (3 Z T (n) Z (n) f3/ s 2 Jp , n -p ( l --.

- a)

(6.6.6)

(Zb . . , Z n ) T and .

1 � 2 � ( Yi - Yi ) n - p i =l

1

""'

--

=

p

n

--

2

"' IY - Z (n)f3 i .

Now, even if E is not Gaussian, it is still true that if w is given by (6. 2.30) then (3 (P) specified by (6.6.3) equals (3 in (6.6.5) and u2 ( P ) Varp ( E ) . For instance, =

Thus, it is still true by Theorem 6.2.1 that (6.6.7) and (6.6.8) Moreover, by the law of large numbers, 1 n - zr (n) z ( n)

�� z i z! n i=1

(6.6.9)

1.

so that the confidence procedures based on approximating Var( ,B) by n - 1 z'[n ) Z ( n) s2 / Vii are still asymptotically of correct level. It follows by Slutsky's theo rem that, under H : (3 = 0,

where Wpxp is Gaussian with mean 0 and, hence, the limiting distribution of is x;. Because fp,n-p ( 1 a ) -+ xp ( 1 a ) by Example 5.3.7, the test (6.6.5) has asymptotic level a even if the errors are not Gaussian. This kind of robustness holds for H : (Jq + 1 f3o, q+ 1 , , (Jp f3o,p or more generally (3 E £0 + (30, a q-dimensional affine subspace of RP. It is intimately linked to the fact that even though the parametric model on which the test was based is false, •

. .

(i) the set of P satisfying the hypothesis remains the same and (ii) the (asymptotic) variance of (3 is the same as under the Gaussian model.

Section 6.6

421

Robustness Properties and Semi parametric Models

If the first of these conditions fails, then it's not clear what H : 8 80 means anymore. If it holds but the second fails, the theory goes wrong. We illustrate with two final examples. =

The Two-Sample Scale Problem. Suppose our model is that X 1 2 (X( ) , X( ) ) where X( 1 ) , x < 2 ) are independent E E respectively, the lifetimes of paired pieces of equipment. If we take H : A 1 A2 the standard Wald test (6.3. 17) is 2 ..... (0 1 02 ) > "Reject iff n (6.6. 10) - x 1 ( 1 - a) " a2 where

Example 6.6.3.

(!;), ( j;)

.....

< _!_n � xz j ) L i=1

( 6.6. 1 1 )

and a=

!2n i: < xP ) i=1

+

v

x? ) ) .

(6.6. 1 2)

The conditions of Theorem 6.3.2 clearly hold. However suppose now that X(l) /0 1 and X( 2 ) /02 are identically distributed but not exponential. Then H is still meaningful, X(l) and X( 2 ) are identically distributed. But the test (6.6.10) does not have asymptotic level a in general. To see this, note that, under H, (6.6. 1 3 )

but

J2Ep ( X ( 1 ) ) =f.

a

V2 VarpX{ 1 )

in general. It is possible to construct a test equivalent to the Wald test under the parametric model and valid in general. Simply replace <7 2 by

�2

u

=

_!_ � n� z=1

{(

��) - ( .X < 1 ) Xz

+

2

) (X �2) - (.X (l)

.X< 2 ) ) 2

+

z

+

2

)

.X <2 ) ) 2

}

.

(6.6. 1 4) D

Example 6.6.4. The Linear Model with Stochastic Covariates with E and Z Dependent. Suppose E(E I Z) = 0 so that the parameters {3 of (6.6.5) are still meaningful but E and Z

are dependent. For simplicity, let the distribution of E given Z = z be that of a (z ) E ' where E1 is independent of Z. That is, we assume variances are heteroscedastic. Suppose without loss of generality that Var( E1 ) 1. Then ( 6.6. 7) fails and in fact by Theorem 6.2.1 =

Vii(� where

!3)

-+

N(o, Q)

(6.6. 1 5)

422

I n ference in the M u ltipara meter Case

Chapter 6

(Problem 6.6.6) and, in general, the test (6.6.6) does not have the correct level. Special ize to the case Z (1, h A1 , . . . , Id - Ad )T where (!1 , . . . , Id) has a multinomial ( A 1 . . . , Ad ) distribution, 0 < Aj < 1, 1 ::; j ::; d. This is the stochastic version of the d-sample model of Example 6 . 1 .3. It is easy to see that our tests of H : {32 f3d 0 = fail to have correct asymptotic levels in general unless a2 ( Z) is constant or A 1 1 Ad 1/ d (Problem 6.6.6). A solution is to replace Z(n ) Z(n) / s2 in (6.6.6) by Q- , where =

-

,

=

·

· ·

=

=

=

·

· ·

=

Q is a consistent estimate of Q . For d = 2 above this is just the asymptotic solution of the Behrens-Fisher problem discussed in Section 4.9.4, the two-sample problem with unequal 0 sample sizes and variances.

To summarize: If hypotheses remain meaningful when the model is false then, in gen eral, LR, Wald, and Rao tests need to be modified to continue to be valid asymptotically. The simplest solution at least for Wald tests is to use as an estimate of [Var y'no] - 1 not I(O) or - � D2ln (O) but the so-called "sandwich estimate" (Huber, 1 967), (6.6. 1 6)

Summary. We considered the behavior of estimates and tests when the model that gen erated them does not hold. The Gauss-Markov theorem states that the linear estimates that are optimal in the linear model continue to be so if the i.i.d. N(O, a2 ) assumption on the errors is replaced by the assumption that the errors have mean zero, identical vari ances, and are uncorrelated, provided we restrict the class of estimates to linear functions of Y1 , . . . , Yn . In the linear model with a random design matrix, we showed that the MLEs and tests generated by the model where the errors are i.i.d. N(O, a2 ) are still reasonable when the true error distribution is not Gaussian. In particular, the confidence procedures derived from the normal model of Section 6 . 1 are still approximately valid as are the LR, Wald, and Rao tests. We also demonstrated that when either the hypothesis H or the vari ance of the MLE is not preserved when going to the wider model, then the MLE and LR procedures for a specific model will fail asymptotically in the wider model. In this case, the methods need adjustment, and we gave the sandwich estimate as one possible adjustment to the variance of the MLE for the smaller model.

6.7

PRO B L E M S A N D CO M P L E M E N TS

Problems for Section 6.1 1. Show that in the canonical exponential model ( 6. 1 . 1 1 ) with both (i) the MLE does not exist if n = r , (ii) if n 2: r + 1, then the MLE

( TJ1 , . . . , TJr , a2 ) T I. S

n ( U1 , . . . , Ur , - 1 "' L... i =r+ n

I

U2i ) T

.

11

and a2 unknown,

(7]1 , . . . , 77n a2 )T of . ---2 - - 1 1Y J-L --- � 2 In partiCU 1ar, a n

-

•

2. For the canonical linear Gaussian model with a2 unknown, use Theorem 3.4.3 to com

pute the information lower bound on the variance of an unbiased estimator of a2 • Compare this bound to the variance of s2 •

Section 6.7

Hint:

423

Problems and Complements

By A. 1 3 .22, Var(U[ )

=

2 o- 4 .

3. Show that fi of Example 6. 1 .2 with p = r coincides with the empirical plug-in estimate of J.lL (J-LLb (Zi2 , , Zip ) T, /-LY /31 ; , /-LLn)T, where JtLi = J-LY + (zi -J-Lz)/3, z; =

=

· · ·

· · .

=

and f3 and J-Lz are as defined in ( 1 .4. 14) . Here the empirical plug-in estimate is based on i.i.d . (Zi , Y1 ) , . . . , (Z� , Yn) where Z i = (Zi2 , . . . , Zip)T . 4. Let Yi denote the response of a subject at time i, i

= 1 , . . . , n.

Suppose that Yi satisfies

the following model Yi

=

B + Ei , i

=

1, . . . , n

where Ei can be written as Ei cei- l + ei for given constant c satisfying 0 � c � 1 , and the ei are independent identically distributed with mean zero and variance o-2, i 1 , . . . , n; e0 0 (the Ei are called moving average errors, see Problem 2.2.29). Let =

=

where

a · = � - c)i L.) i =O 1

( 1 - ( - c)i ) 2 ( 1 - 1( -+cc)j+ l )/ � L 1+c i= l

(a) Show that B is the weighted least squares estimate of B. (b) Show that if ei

rv

N(O, o-2 ), then B is the MLE of B.

(c) Show that Y and B are unbiased.

(d) Show that Var( O) � Var ( Y ).

(e) Show that Var (B) < Var ( Y ) unless

c

=

0.

5. Show that >:(Y) defined in Remark 6 . 1 . 2 coincides with the likelihood ratio statistic

A (Y) for the o-2 known case with o-2 replaced by 0:2.

6. Consider the model (see Example 1 . 1 .5) Yi

B + ei, i

where ei = cei- l + Ei, i = 1 , . . . , n, t:o 0, c are i.i.d. N(O, o-2 ) . Find the MLE B of B.

=

E

1, . . . , n (0, 1] is a known constant and t: 1 , . . . , En

7. Derive the formula (6. 1.28) for the noncentrality parameter B2 in the regression example. 8. Derive the formula (6. 1 . 29) for the noncentrality parameter 82 in the one-way layout. 9. Show that in the regression example with interval for /31 in the Gaussian linear model is

p

=

r

=

2, the 100(1

z. •

-

)• ] .

a ) % confidence

424

Inference in the M u lti para meter Case

10. Show that if p

=

r

=

2 in Example 6.1.2, then the hat matrix H

=

C h a pter 6

( hij) is given by

1 + (zi2 Z.2) (zj2 - Z.2) -__:___-"----:--=__:_ __;;_ n -=----= 2.:(zi2 - z.2) 2 1 1. Show that for the estimates

(a) Var ( a)

;� 2.:::� =1 nk

=

a and J'k in the one-way layout Var ( 8k )

=

( (p�:)2 + Lk=;fi ;k ) .

(b) If n is fixed and divisible by p, then Var( a) is minimized by choosing ni

ni

n is nl2, n2

(c) If =

fixed and divisible by np nl2(p -

=

(d) Give the

·

·

·

=

2(p - 1), 1) .

c =

nIp.

then Var(J'k) is minimized by choosing

100(1 - a)% confidence intervals for a and 6k.

, Yn be a sample from a population with mean 11 and variance a 2 , where n is 1 even. Consider the three estimates T1 Y, T2 ( 1 I 2n) 2.::: 1:1 Yi + ( 3 I 2n) 2.::: �= � n+ 1 }i, 12. Let Y1,

• • .

=

1 - - 2). and T3 = 2 (Y

(a) Why can you conclude that T1 has a smaller MSE (mean square error) than T2 ? (b) Which estimate has the smallest MSE for estimating 0

=

� (/1 2) ?

13. In the one-way layout,

(a) Show that level ( 1 - a) confidence intervals for linear functions of the form {3j - {3;. are given by

and that a level ( 1

a)

confidence interval for a2 is given by

(n - p) s2 lxn -p ( 1 'lj;

(b) Find confidence intervals for

� ( � + tJ3 )

� a ) ::; a2 ::; (n - p) s2 lx n-p (� a ) . 'lj;

t31 ·

14. Assume the linear regression model with p future observation Y to be taken at the pont z .

(a) Find a level 1 + fJ2 z . fJ

(1 - a)

r

=

2. We want to predict the value of a

confidence interval for the best MSPE predictor E(Y)

=

(b) Find a level (1 a) prediction interval for Y (i.e., statistics t(Y1 , . . . , Yn) . t(Yh . . . , Yn ) such that P[t_ ::; Y ::; � 1 - a). Note that Y is independent of Y1 , . . . , Yn .

15. Often a treatment that is beneficial in small doses is harmful in large doses. The following model is useful in such situations. Consider a covariate x , which is the amount

Section 6. 7

425

Probtems a nd Lornol;emEmts

· or dose of a treatment, and a response variable Y, which is yield or production. Suppose a good fit is obtained by the equation

where Yi is observed yield for dose xi . Assume the model log }i = !31 + f32 x i + f33 log xi + Ei , i =

1, . . . , n

where E1 , . . . , En are independent N(O, a2). (a) For the following data (from Hald, 1 952, p. 653), compute {i1 , {i , {i3 and level 0.95 2 confidence intervals for (31 , !3 , (33 . 2 (b) Plot (Xi , Yi) and (Xi , fli) where Yi = e131 e132xi x f3 • Find the value of x that maxi mizes the estimated yield fj = e131 e132xx133 .

3 .77

x (nitrogen) y (yield)

1 67.5

Hint: Do the regression for /-Li = (31 + f32zil + f32zil + f33zi2 where zn = Xi - x, Zi = log xi - * I::1 log xi . You may use x = 0.0289. 2 16. Show that if C is an n x r matrix of rank r, r :=:; n, then the r x r matrix C'C is of rank r and, hence, nonsingular. Hint: Because C' is of rank r, xC' = 0 => x = 0 for any r-vector x. But xC'C = o => ll xC' II 2 = xC'Cx' = o => xC' = o .

17. In the Gaussian linear model show that the parametrization ({3, a2 ) T is identifiable if and only if r = p. Problems for Section 6.2

1. Check AO, . . . ,A6 when 8 log p(x, 8) , and Q = P.

=

(M, a2), P

2. Let 8, P, and p be as in Problem

6.2.1

densities of the form

=

{N({L, a2)

:

1-L E

R, a2

>

0}, p ( x, 8)

and let Q be the class of distributions with

( 1 - E )
a2 < 72 , E :=:;

1 2

where
1

1

=

2
>

0.

(b) Show that the MLE of (M, 7 2 ) does not exist so that A6 doesn't hold.

426

Inference in the Multiparameter Case

(c) Construct a method of moment estimate Bu of 8 moments which are fo consistent.

I I '

=

Chapter 6

(Jt. T2) based on the first two

�

(d) Deduce from Problem 6.2.10 that the estimate (} derived as the limit of NewtonRaphson estimates from 8n is efficient. 3. In Example 6.2.1, show that EZtZ; � O, i > !.

([EzzT]- 1 )(1. 1 1 > [Ezfj- t with equality if and only if

4. In Example 6.2.2, show that the assumptions of Theorem 6.2.2 hold if (i) and (ii) hold. 5. In Example 6.2.2, show that

fo is logistic.

c(fo) = uo/a is 1 if fo is normal and is different from 1 if

6. (a) In Example 6.2.1 show thatMLEs of {3, f.l, and
Hint: fx(x)

�

fyiz(y)Jz(z).

(b) Suppose that the distribution of Z is not known so that the model is semi parametric, X Pce,H)• {Pce,H) : 0 E 8, H E H}, 0 Euclidean, 1{ abstract. In some cases it is possible to find T(X) such that the distribution of X given T(X) � t is Q,, which doesn't depend on H E H. The MLE of 0 based on (X, t) is then�called a conditional MLE. Show that if we identify X � (zCnl, Y), T(X) = zCnJ , then ({3, fj, i72) are conditional MLEs. Hint: (a),(b) The MLE minimizes q'.[Y - zCnJ f31 2 -

• ••

i' .

'

7. Fill in the details of the proof of Theorem 6.2.1.

8. Establish (6.2.24) directly as follows:

(a) Show that if Zn � � I:� t Z; then, given Z( n)• .,fii(ji multivariate nonnal distribution with mean 0 and variance,

(; "

�

n[Z (�J ( n) ] -1

)

f.l,

('jj - f3 ) T ) T has a <

•

'

and that 0:2 is independent of the preceding vector with n&2 /a2 having a x;_P distribution. (b) Apply the law of large numbers to conclude that

P -1 T n Zc nJ Z CnJ � E ( ZZT ) .

(c) Apply Slutsky's theorem to conclude that

and, hence, that

(d) (/j - f3)T z[n1 z(n) ('jj - {3)

=

op(n - 112). �

(e) Show that 0:2 is unconditionally independent of ('ji,/3). (f) Combine (a)-(e) to establish (6.2.24).

I

427

Section 6.7

Problems and Complements

9. Let Y1

. , Y;1 real be independent identically distributed

. . .

where p. E R, a > 0 are unknown and r has known density f > 0 such that if p(x) - log f(x) then p" > 0 and, hence, p is strictly convex. Examples are f Gaussian, and f(x) = e-x(l + ex)-', (logistic).

a0 is assumed known a unique MLE for JJ exists and uniquely

(a) Show that if o

solves

� L, P i=t

(b) Write 81

uniquely solves

l

a'

e,

'

( ao-!') Xi

- 0.

; . Show that if 02

=

BB a unique MLE for Bt exists and

10. Suppose A0-A4 hold and 8� is fo consistent; that is, 8�

=

80 + Op(n- 112 ).

(a) Let IJn be the first iterate of the Newton-Raphson algorithm for solving (6.2.1) starting at o;p

Show that

n IJ�) Xi, D,P( IJn IJ� - !:.n L . =

-1

t=1

n !:. L 'l'(X,, 8�). n . t=l

On satisfies (6.2.3).

Hint:

n n n 1 1 -n L 'l'(X,, 8�) n- L 'l'(X,, 80) - n- L D¢(X,,IJ�) + ap(l) (1J� 1J0). . 1

i=l

=

i=-.1

t=1

(b) Show that under AO-A4 there exists c > 0 such that with probability tending to 1, ;. 2:7 1 '!'(X, , 8) has a unique 0 in S(IJ0, <), the < ball about IJo. Rd Hint: You may use a uniform version of the inverse function theorem: If gn : Rd

are

such that:

(i)

sup{JDgn(IJ) - Dg(IJ)j : jiJ - 11oJ < <} � 0, (ii) gn(Bo) � g(IJo), (iii) Dg(IJo) is nonsingular, (iv) Dg(IJ) is continuous at IJo,

----J.

428

Inference in the Multiparameter Case

then, for

n

sufficiently large, there exists a b

and their image contains a ball

S(g(9o).b).

> 0. t > 0 such that gn

are

1-1

Chapter 6 on

8(9o, 6)

rithm starting at e;, converges to the unique root

(6.2.3).

iteration of the Newton-Raphson algo

On described in (b) and that O satisfies n

Hint: You may use the fact that if the initial value of Newton-Raphson is close enough

to a unique solution. then it converges to that solution. Establish

DYi

�

i= l

where

Y; � (Z,, � Z,( ll)llt 2::: (6; + c;lh)Z,(j) �

i=l

�( I ) = I:j

1 ciziU)

zr/3)2 over all {3 is the same as minimizing "'

� Y;

i=l

Differentiate with respect to

{31.

P

� (Zit � Z, )6t � "' � {J, Z,, j=2

�(! )

2

l'

'' •

I:� 1 (Yi � '

2

•

Similarly compute the information matrix when the model

is written as

•

y; = {J, (Z" where {31, 12 ,

I

'

j=l

and the c; do not depend on {3. Thus, minimizing

n

1 '

p

z; /3) 2 = 2:::

'

• • • '

Hint: Write n

1

l

(6.2.26) and (6.2.27).

n

' '

(c) Conclude that with probability tending to 1,

II.

j

. . •

�

rr(z" 1

, lp range freely and

•

p

. '

z,2 , . . . , z,p)) + 2::: ..0z,1 + <,

•

j=2

i

Ei are i.i.d. N(O, a 2 ).

Problems for Section 6.3 1.

Suppose responses Y1 , . . . , Yn are independent Poisson variables with log

.\i

for given covariate values z 1 , tests for testing

=

• . .

fh

, zw

+

fhzi, 0 <

ZJ

<

···

<

Yi ,...._, P(Ai), and

Zn

Find the asymptotic likelihood ratio, Wald, and Rao

H : lh. = 0 versus K : 82

¥ 0.

2. Suppose that wo is given by (6.3.12) and the assumptions of Theorem (6.3. ;! hold for p(x, 9) , 9 E e. Reparametnze 1' by '1(9) = 2::}� 1 '7; (9)v; where '7; ( 9) = 9 v;, {v;} are orthonormal. q( · , 'I ) = p(·, 8 ) for '1 E 3 and 3 = {'I(9) : 9 E e}. Show that if 30 = {'I E 3 : 'I; = 0, q + 1 < j < r} then .\(X) for the original testing problem is given by

.\(X)

= sup { q(X

and, hence, that Theorem

· 'I) : 'I E 3} / sup{q(X, 'I) : 'I E 3o }

6.3.2 holds for eo

as given in

(6.3.12).

� '

l

Section 6 . 7

Problems

and

4c: =-29

C om p I 'm ' " " :CC-''CC::c:cc:::c_

_ _ _ _ _ _ _ __ _ _ _ _ _ _ _

E10 and the conditions of Theorem 6.3.3 hold. There exists an open ball about IJ0, S(IJ0) c 8 and a map 3. Suppose that 80

E

which is continuously differentiable such that (i) �1 ( 1J) = g1 (1J) on S(IJ0), q + 1 < j < r and, hence,

S(IJo) ,'1 8o = {IJ E S(IJo) : �i (IJ) = 0, q + 1 < j < r}. (ii)

'I

is 1-1 on S(IJ0) and D'l ( IJ) is a nonsingular r

x r

matrix for aii iJ

E

S( IJo).

(Adjoin to 9q+ l , . . . , gr. a'fB, . . . , ar o where a1 , . . . , aq are orthogonal to the linear span of

��'' (IJo)

pxd

.)

Show that if we reparametrize {Po : IJ E S(IJ0)} by q(· , 'I (IJ)) p(·,IJ) where q(·, 'I) � is uniquely �defined on 2 = {'I(IJ) : £! E S(£!0)} then, q(·,'l) and Tin = 'I(On) and, Tio,n = q(Bo,n) satisfy the conditions of Problem 6.3.2. Deduce that Theorem 6.3.3 is valid.

4. Testing Simple versus Simple. Let 8 = {80 , OJ}, Xi, . . . , Xn i.i.d. with density p( -,B) .

Consider testing H : (} = 00 versus K : B = 81. Assume that Pe1 of Pe0 , and that for some b > 0,

(ii) Ee, (a) Let >.(X1 , . . . , Xn) be the likelihood ratio statistic. Show that under H, even if p J = 0, 2 log .l.(X1, . . , X,) � 0. .

(b) If h = 2 show that asymptotically the critical value of the most powerful (Neyman Pearson) test with Tn = L� 1 ( l(X,, IJ,) - l(X,, Oo)) is -nK(IJo, 0! ) + Z!-a Jria(Oo, £!,) where k(Bo, 81) is a Kullback-Leibler in formation

(X " IJo) p K(IJo , IJ!) = Ee, log IJ (X p 1' 1) and

be i.i.d. with X, and Y; independent, N(IJ1 , 1), N(IJ2 , 1), respectively. Suppose Bi > 0, j = 1, 2. Consider testing H : 81 = 82 = 0 versus K : IJ, > 0 or IJ2 > 0. 5. Let (X,, Y, ) , 1 .( Xt, Yi

mixture of point mass at 0, xi and x� Hint: By sufficiency reduce to n = 1. Then

:

(b) Suppose Xi, Yi are as above with the same hypothesis but 8 = { ( 81 , fh) : 0 < e, < cl)1,01 > 0}. Show that 2 log.\(X,, li : 1 < i < n) has a null distribution, which

is a mixture of point mass at 0, sin .6. = 0 < .6. < ; .

JI�c2,

xi and x� but with probabilities � - �, �

and

� where

(c) Let (X,' Y, ) have anN, (e,' e,, u1o' "�o. Po ) distribution and ( x,, Y;), 1 0 and H be as above. Exhibit the null distribution of 2 log .\(Xi, Yi : 1 0, e, > 0,

C( y'n(B1 - 01 , ii, - O,)) � .N"(O, 0, 1, 1, 0). (b) lf01 = 02 = 0

where U � N(O, 1) with probability U with the same distribution .

i

•

i and 0 with probability i and V is independent of �

(c) Obtain the limit distribution of y'n(O , - 01, O, - 1:1,) if O, = 0, O, > 0. �

'

(d) Relate the result of (b) to the result of Problem 4(a). Note: The results of Problems 4 and 5 apply generally to models obeying AO-A6 when we restrict the parameter space to a cone (Robertson, Wright, and Dykstra, 1988). Such restrictions are natural if, for instance, we test the efficacy of a treatment on the basis of two correlated responses per individual.

•

I .,

7. Show that (6.3.19) holds.

Hint: �

(i) Show that !(On) cau be replaced by 1(0).

'i

I

' •

' •

(ii) Show that Wn (B�2) ) is invariant under affine reparametrizations q = a + BB where B is nonsingular.

8�2))

showing that its leading (iii) Reparametrize as in Theorem 6.3.2 and compute Wn ( term is the same as that obtained in the proof of Theorem 6.3.2 for 2 log),(X).

l ' •

i

Section

6.7

Problems and Complements

431

�(1) 8. Show that under AO-A5 and A6 for 011

where �(llo) is given by (6.3.21). Hint: Write

�( I I

and apply Theorem 6.2.2 to On . 9. Under conditions AO-A6 for (a) and A0-A6 with A6 for i!�'l for (b) establish that (a)

[-�D2 ln(8n)]-l is a consistent estimate of J- 1 (0o ).

(b) (6.3 . 22) is a consistent estimate of �- 1 (11o)Hint: Argue as in Problem 5.3.10.

10. Show that under A2. A3, A6 11 � !(II) is continuous. Problems for Section 6.4 1. Exhibit the two solutions of (6.4.4) explicitly and find the one that corresponds to the

maximizer of the likelihood. 2. (a) Show that for any 2 x 2 contingency table the table obtained by subtracting (esti mated) expectations from each entry has all rows and columns summing to zero, hence, is of the form

Z2 where Z is given by (6.4.8) (c) Derive the alternative form (6.4.8) for Z. (b) Deduce that x'

�

2 contingency table model let Xi = 1 or 0 according as the ith individual sampled is an A or A and Yi = 1 or 0 according as the ith individual sampled is a B or B. (a) Show that the correlation of X1 and Y1 is

3. In the 2

x

� P

P(A n B) - P(A)P(B) )P( A)(1 - P(A))P(B) (1 - P(B)) .

(b) Show that the sample correlation coefficient r studied in Example 5.3.6 is related to Z of (6.4.8) by Z � .fiir . (c) Conclude that if A and B are independent, 0 < P( A) < 1, 0 < P(B) < 1, then Z

has a limitingN(01 l) distribution.

' '

l ' '

'

432

Inference in the Multipara meter Case

Chapter 6

4. (a) Let ( Nu' Nl2· N21 ' N22 ) rv M ( n, Bu (}12, (}2 1 ' 022 ) as in the contingency table. Let Ri = Nil + Ni2t ci = Nli + N2i· Show that given Rr = TI, R2 = r2 = n - Tr, Nu and N21 are independent B(r 1 , 8u / (8u + 81,) ) , B(r,, 8,! / ( 821 + 8,,) ). I

(b) Show that 812/(811

+ 81 2 )

�

82 1/(821 + B,)

iff R1

and C,

are

independent.

(c) Show that under independence the conditional distribution of Nii given R;_ Ci = Ci, i = 1, 2 is Jt(ci, n, ri) (the hypergeometric distribution).

= r1,

5. Fisher's Exact Test

From the result of Problem 6.2.4 deduce that if j(o:) (depending on r1 , c1, n) can be chosen so that then the test that rejects (conditionally on Rr = Tf, cl = Ct) if Nu > j(a) is exact level a. This is known as Fisher's exact test. It may be shown (see Volume ll) that the (approximate) tests based on Z and Fisher's test are asymptotically equivalent in the sense of (5.4.54). 6. Let Nii be the entries of an a x b contingency table with associated probabilities Bij and let 1Ji 1 = E�= l (}ij. 1Ji2 = z::: 1 Oii. Consider the hypothesis H : Oij = Tfil Tfj2 for all i, j. (a) Show that the maximum likelihood estimates of Tfit, 1Ji 2 are given by �

Tfil where R; � L; N,;, C;

�

Li N ; .

(b) Deduce that Pearson's

,

R;

= n

,

�

C;

Tfj2 = n

x' is given by (6.4.9) and has approximately a XZa-l ) (b-1 )

distribution under H. Hint: (a) Consider the likelihood as a function of Tfit. 1, . . . , b - 1 only.

i =

1, . . . , a

-

1, fJJ2· j

=

7. Suppose in Problem 6.4.6 that H is true. (a) Show that then

. •

:

P[N,; -

where

( B, C, D, . . . ) A

=

are the multinomial coefficients.

(b) How would you, in principle, use this result to construct a test of H similar to the 2 x test with probability of type I error independent of Tfit , 1Ji2 ?

l

Section 6.7

8.

433

Problems and Complements

The following table gives the number of applicants to the graduate program of a small

department of the University of California, classified by sex and admission status. Would you accept or reject the hypothesis of independence at the 0.05 level (a) using the x

2

test with approximate critical value?

(b) using Fisher's exact test of Problem 6.4.5? ny D c.::, '-;-o-, 19 12 0 5

Admit Men Women

Hint:

(b) It is easier to work with N22. Argue that the Fisher test is equivalent to

+ n - (r1 + ci ) or N22 < q1 + n - (r1 + c! ) , and that under H, is conditionally distributed 1t(r2, n, c2).

rejecting H if N22 N22

9.

(a) If A, (i)

>

Q2

B, C are three events, consider the assertions,

P(A n B 1 C) = P(A 1 C)P(B

(ii) P(A n B 1

1 C)

(A, B INDEPENDENT GNEN C)

C) = P(A 1 C)P(B 1 C) {A, B INDEPENDENT GIVEN C)

P(A n B) = P(A)P(B) (A, B INDEPENDENT)

(iii)

(C is the complement of C.) Show that (i) and (ii) imply (iii), if A and C are independent or B and C are independent.

(b) Construct an experiment and three events for which (i) and (ii) hold, but (iii) does not.

(c) The following 2 x 2 tables classify applicants for graduate study in different depart ments of the university according to admission status and sex. Test in both cases whether the events [being a man] and [being admitted] are independent. Then combine the two tables into one, and petfonn the same test on the resulting table. Give p-values for the cases. Admit Men Women

Admit

Deny

235 1 " 3,8 --

�' 35 270 45 7

Men Women

273 42 n 315 =

three

Deny

122 103

225 1 62 n = 387

93 69

215 172

(d) Relate your results to the phenomenon discussed in (a), (b). 10. Establish {6.4.14). =

0 in the logistic model, �; = equal, and that we wish to test H : fh < {3g versus K fh > {3g. 11. Suppose that we know that {31

Show that, for suitable a, there is a UMP level

l::f

1 z;N;

> k, where Pp� [l::f

1

;;;N;

2:

k] = a.

:

a test,

{31

+ {hz;,

z; not all

which rejects, if and only if,

l l'

434

Inference in the Multiparameter Case

Chapter 6

12. Suppose the Zi in Problem 6.4.11 are obtained as realization of i.Ld. Zi and m, so that (Z; , X;) are i.i.d. with (X, I Z;) � B(m, 1r(.Li2Zi)).

'

(a) Compute the Rao test for H Problem 6.4.11.

:

= rn

fJ2 < {3g and show that it agrees with the test of

(b) Suppose that f3t i s unknown. Compute the Rao test statistic for H

case.

:

(32 < !38 in this

(c) By conditioning on L� 1 Xi and using the approach of Problem 6.4.5 construct an exact test (level independent of f3t)·

�' '

.

13. Show that if Wo C Wt are nested logistic regression models of dimension q < r < k and m1, . . . , mk � oo and H : fJ E w0 is true then the law of the statistic of (6.4.18) tends to X2r-q· Hint: (X; - J',)/ /m;K;(l - 1r;), I < i < k are independent, asymptotically N(O, 1 ) . Use this to imitate the argument of Theorem 6.3.3, which is valid for the i.i.d. case. 14. Show that, in the logistic regression model, if the design matrix has rank p, then {30 as defined by (6.4.15) is consistent. -

15. In the binomial one-way layout show that the LR test is asymptotically equivalent to Pearson's x2 test in the sense that 2 log .\ - x2 .£.; 0 under H.

16. Let X1 , . . . , Xk be independent Xi N(Bi, a2) where either a2 = a5 (known) and Bt, . . . , Bk vary freely, or Bi = Bio (known) i = 1, . . . , k and a2 is unknown. Show that the likelihoOO ratio test of H : Bt = 810, . . . , Bk = 8ko, a2 = a5 is of the form: Reject if (1/a;3) 2:::� 1 (Xi - Bio)2 > k2 or < kt. This is an approximation (for large k, n) and simplification of a model under which (N1, . . . , Nk ) � M(n, 010, . . . , Oko) under H, but under K may be either multinomial with 8 # Bo or have Eo(Ni ) nBio, but Varo (N;) < nOiO(l - 0;0)("Cooked data"). "'

:::;;;::

Problems for Section 6.5

1. Fisher's Method ofScoring The following algorithm for solving likelihood equations was proosed by Fisher-see Rao (1973), for example. Given an initial value Oo define iterates -

_...._

_...._

.-.

_...._

llm+l � Om + l- 1 (9m)Dl(9m)·

Show that for GLM this method coincides with the Newton-Raphson method of Section 2.4.

'

i

i

•

•

j

1

l 1 j

I '

1 ' .

•

i

1

i i

I

•

l

' •

2. Verify that (6.5.4) is as claimed formula (2.2.20) for the regression described after (6.5.4). 3. Suppose that (Z1, Yi ) , . . . , (Zn, Yn) have density as in (6.5.8) and, (a) P[Z1 E {zl 1 l, . . . ,z(k)}) � 1

'

I

l •

'

•

'

Section 6.7

Problems and Complements

435

(b) The linear span of {zl 1 1 , . . . , z(k)} is RP (c) P[Z 1 � zUI] > 0 for all j. Show that the conditions A0--A6 hold for P P/3, E P (where qo is assumed known). Hint: Show that if the convex support of the conditional distribution of Y1 given Z1 = zUl contains an open interval about P,j for j = 1, . . . , k, then the convex support of the conditional distribution of z:::;: 1 A1 YJ zU l given Zj = z U), j = 1 , . . . , k, contains an open !. zlil in RP ball about "k L...... J =l )r) �

=

"

·

4. Show that for the Gaussian linear model with known variance D(Y,f.'o) Jy - P.ol2/a,l.

aJ,

the deviance is

=

5. Let Yt , .

. . , Yn be independent responses and suppose the distribution of Yi depends on

a covariate vector zi. Assume that there exist functions that the model for Yi can be written as

h(y, T), b(B), g(p,) and c(T) such

o,y - b(O,) } { p(y, O,) � h(y,r) exp c(T)

g(!"i) Var(Y)/c(r) � b"(B). where T is known,

�

zT{3, and b' and g are monotone.

Set �

�

g(f") and v(!")

�

(a) Show that the likelihood equations are

..;:... df"i ( Yi - f"i )Zij O, J. , . . . , p. [ L., dt. ) ( v <,t JJ� i =l �

·

Hint: By the chain rule

8 l (y O) 8(3; ,

�

�

i3l

dO df" 8( 80 df" d( 8!3; .

(b) Show that the Fisher information is Z);WZv where Zv � jjz,;l/ is the design matrix and W � diag(WJ , . . . , wn ), w, = w(!"i ) � l/v (!"i)( d(i/ df"iF.

(c) Suppose (Zr, Yr ), . . . , (Zn, Yn) are i.i.d. as (Z, Y) and that given Z � z, Y follow the model p(y, O(z)) where O(z) solves II ( 0) � g-1 (zT {3). Show that, under appropriate conditions,

(d) Gaussian GIM. Suppose Y; � N(f"i, ,;,j). Give 0, T, h(y,r), b(ll), c(r), and v(f"). Show that when g is the canonical link, g � (b')- 1 , the result of (c) coincides with (6.5.9).

(e) Suppose that Y; has the Poisson, P(f",), distribution. Give B, T, h(y, r), b(O), c(r), � and v(!")· In the random design case, give the asymptotic distribution of ..fii(/3 - (3). Find the canonical link function and show that when g is the canonical link, your result coincides with (6.5.9).

Chapter

6

i

•

Problems for Section 6.6 1. Consider the linear model of Example 6.6.2 and the hypothesis

(3q+1 = f3o,q+L . . , (Jp = f3o,p ·

under the sole assumption that Ec = 0, 0 < Var E < oo. Show that the LR, Wald, and Rao tests are still asymptotically equivalent in the sense that if 2 log .\n, Wn , and Rn are the corresponding test statistics, then under H,

Wn + op(1) Wn + Op(1).

i :

I

Inference in the Multiparameter Case

436

l

Note: 2log .\n, Wn and Rn are computed under the assumption of the Gaussian linear model with a2 known. Hint: Retrace the arguments given for the asymptotic equivalence of these statistics under parametric model and note that the only essential property used is that the MLEs under the model satisfy an appropriate estimating equation. Apply Theorem 6.2.1.

:

I

•

2. Show that the standard Wald test for the problem of Example 6.6.3 is as given in (6.6.10). 3. Show that 0:2 given in (6.6.14) is a consistent estimate of2 VarpX(l) in Example 6.6.3 and, hence, replacing &2 by 0:2 in (6.6 10) creates a valid level u test. .

4. Consider the Rao test for H : fJ = fJo for the model P = { PfJ : II E 8} and AO A6 hold. Suppose that the ttue P does not belong to P but if fJ(P) is defined by (6.6.3) then fJ(P) = 80. Suppose AO-A6 are valid. Show that, if VarpDl(X, 110) is estimated by

!(80), then the Rao test does not in general have the correct asymptotic level, but that if the estimate � L:� dDl][DljT (X,, fJo) is used, then it is. 5. Suppose X1 , . . . , Xn are i.i.d. P. By Problem 5.3.1,� if P has a positive density J at v(P), the unique median of P. then the sample median X satisfies

vn(X - v(P) )

�

l

i

l

•

l

-!

N(O,a2(P))

,

•

where a2(P) = 1 /4f(v(p)).

,

(a) Show that iff is symmetric about I'• then v(P) = I' · (b) Show that if f is N(J', a 2 ), then a2(P) > a2 = Varp(Xl), the information bound and asymptotic variance of yn(X - J'), but if f�(x) = � exp -lx - J'I, then a2(P) < a2 , in fact, a2(P)ja2 = 2/rr. 6. Establish (6.6.15) by verifying the condition of Theorem 6.2.1 under this model and

verifying the formula given.

7. In the binary data regression model of Section 6.4.3, let 1r = s (zf /3) where s ( t) is the continuous distribution function of a random variable symmetric about 0; that is,

s(t) = 1 - s ( -t), t E R.

l1

,

'

I ! I

I I

, '

,

(6.7.1)

' �------... ,

6.7 Problems and Complements

Section

437

Show that can be written in this form for both the probit and logit models. (b) Suppose that are realizations of U.d. zi. that zl is bounded with probability 1 and let ih(XC"l), where X(n) {(Y;, Z,) : 1 < i < n}, be the MLE for the logit model. Show that if the correct model has Jri given by s as above and {3 {30, then !3L is not a consistent estimate of {30 unless s(t) is the logistic distribution. But if f3L is defined as the solution of EZ ,s(Zf{30) Q(/3) where Q(/3) E(ZfA(Zf/3)) is p x 1, then vn(rJL - f3 L) has a limiting normal distribution with mean 0 and variance (j- 1 (f3)Var(Z 1 (Y1 - A (ZfiJ) )) [Q- 1 (/3)] where Q(/3) E(ZfA(Zf{30) Z, ) is p and necessarily nonsingular. Hint: Apply Theorem 6.2.1. 8. (Model Selection) Consider the classical Gaussian linear model (6.1.1) Yi = Jl-i + Ci, where are i. i.d. Gaussian with mean zero Jl-i = zT{3 and variance a2, zi are 1, . . d-dimensional vectors for covariate (factor) values. Suppose that the covariates are ranked in order of importance and that we entertain the possibility that the last d - p don't matter, . . � !3d 0. /3P+l Let f3(p) be the LSE under this assumption and Y;(p) the corresponding fitted value. A natural goal to entertain is to obtain new values YJ", . . . , Y; at z1 , and eval uate the performance of YjCP), . . . , Y�P) and, hence, the model with f3d+I (3p 0 by the (average) expected prediction error rr

(a)

Zi

�

�

�

=

�

�

p x

�

i =

ci

, n,

.

�

�

�

.

�

•

.

.

,

Zn

=

= · · · =

n

EPE(p) � n EL(Yi' - ¥;<">)2 -1

i=l

Here YJ" , . . , v; are indepe-qdent of }] , . . . , Yn and �* is distributed as }'£, i 1, . . Let RSS(p) I;(Y; Y,CP) )2 be the residual sum of squares. Suppose that "2 is known. (a) Show that EPE(p) "2 (1 + �;) ; I:� 1 (P.i - p.jP) )2 where JL)P) zf iJ(P) and i3(p) (/3t, . , {3p, 0, . . , O) T (b) Show that =

.

, n.

-

�

�

�

.

.

.

+

�

.

and deduce that (c) EPE(p) (d)

�

RSS(p) + �"2 is an unbiased estimate of EPE(p).

Show that (a),(b) continue to hold if we assume the Gauss-Markov model. Model

438

Inference in the Multiparameter Case

Chapter 6

(e) Suppose p 2 and 11(z) � !! 1 o, + i!2z2. Evaluate EPE for (i) �i � a1z; and (ii) 'rJi = /31 Zil + /32zi2· Give values of /31 , t'i2 and { .::1 1 , :;;i z} such that the EPE in case (i) is smaller than in case (ii) and vice versa. Use a2 = 1 and n 10. �

=

Hint: (a) Note that

EPE(p)

n

n

2 1 �
n

n

2

i =l

Derive the result for the canonical model. (b) The result depends only on the mean and covariance structure of the

i = l, . .

6.8

.

}i, Yt , jl�,

, n.

NOTES

Note for Section 6.1 (1) From the L. A. Heart Study after Dixon and Massey (1969). '

Note for Section 6.2 (1) See Problem

• '

3.5.9 for a discussion of densities with heavy tails.

' '

Note for Section 6.4

' •

(l) R. A. Fisher pointed out that the agreement of this and other data of Mendel's with his hypotheses is too good. To guard against such situations he argued that the test should be

'

'

' '

used in a two-tailed fashion and that we should reject H both for large and for small values

2•

of x

reasonable, if we consider alternatives to H, which are not multinomial. For instance, we

!I

might envision the possibility that an overzealous assistant of Mendel "cooked" the data.

' '

. .

l

'

I

'

I

I

Of course, this makes no sense for the model we discussed in this section, but it is

I

;'

LR test statistics for enlarged models of this type do indeed reject H for data corresponding 2 to small values of x as well as large ones (Problem 6.4. 16). The moral of the story is that the practicing statisticians should be on their guard! For more on this theme see Section

6.6.

'

l

'

•

'

'

'

l '

6.9

REFERENCES

Cox, D.

R., The Analysis ofBinary Data London:

DIXON,

'

Methuen, 1 970.

W. AND F. MASSEY, /ntroduction to Statistical Analysis, 3rd ed.

New York: McGraw-Hill.

1969.

'

l

Section 6.9

439

Refere nces

FfSHER, R. A . Stotistical Methods fm· Research lVrn{ers, 13th ed. New York: Hafner, 1958. .

GRAYBILL, F., An lnrmduction to Linear StatiJ£ica{ Models, Vol. I New York: McGraw-Hill. 196 1 . HABERMAN, S., The Analysis of Frequency Data Chicago: University of Chicago Press, 1974.

HALD, A., Statistical Theory with Er1gineering Applications New York: Wiley, 1952.

HUBER, P. J., "The behavior of the maximum likelihood estimator under nonstandard conditions," Proc. Fifth Berkeley Symp. Math. Statist. Prob. I, Univ. of California Press, 22 1-233 ( 1967).

KASS, R., J. KADANE AND L. TiERNEY, "Approximate marginal densities of nonlinear functions," Biometrika, 76, 425-433 ( 1 989).

KoENKER, R. AND Y. D'oREY, "Computing regression qunntiles," J. Roy. Statist. Soc. Ser. C. 36, 383-393 ( 1 987).

LAPLACE,

P.-S., "Sur quelques points du systeme du monde," Memoires de l'Acadf!mie des Sciences

de Paris (Reprinted in Oevres CompUtes, II, 475-558. Gauthier-Villars, Paris) ( 1 789).

MALLOWS, C., "Some comments on Cp,'' Technometrics, 15, 661--675 ( 1 973). McCULLAGH, P. AND J. A. NELDER, Genemlized Linear Models London: Chapman and Hall, New York, t 983; second edition, 1 989.

PORTNOY, S. AND R. KoENKER, "The Gaussian Hare and the Laplacian Tortoise: Computability of squared-error versus absolute-error estimators," Statistical Science, 12, 279-300 ( 1 997). RAO, C. R., Linear Statistical inference and Its Applications, 2nd ed. New York: J. Wiley & Sons, 1 973.

ROBERTSON, T., f. T. WRIGHT, AND R. L. DYKSTRA, Order Restricted Statistical inference New York: Wiley, 1988.

SCHEFFE,

II., The Analysis of Variance New York:

Wiley, 1 959.

SCHERVlSCH, M., Theory of Statistics New York: Springer, 1 995. STiGLER, S., The 1/istory of Statistics: The Measurement of Unce11ainty Before !900 Cambridge, MA: Harvard University Press, 1 986.

WEISBERG, S., Applied linear Regression, 2nd ed. New York: Wiley, 1 985.

' ' •

'

"

'''

' '

'

'

'

l

l '

'

j

i

'

i

1

I

'

Appendix A

A REVIEW OF B ASIC PROB AB ILITY THEORY

In statistics we study techniques for obtaining and using information in the presence of uncertainty. A prerequisite for such a study is a mathematical model for randomness and some knowledge of its properties. The Kolmogorov model and the modem theory of prob ability based on it are what we need. The reader is expected to have had a basic course in probability theory. The purpose of this appendix is to indicate what results we consider ba sic and to introduce some of the notation that will be used in the rest of the book. Because the notation and the level of generality differ somewhat from that found in the standard textbooks in probability at this level, we include some commentary. Sections A.14 and A.15 contain some results that the student may not know, which are relevant to our study of statistics. Therefore, we include some proofs as well in these sections. In Appendix B we will give additional probability theory results that are of special interest in statistics and may not be treated in enough detail in some probability texts.

A.l

THE BASIC MODEL

Classical mechanics is built around the principle that like causes produce like effects. Prob ability theory provides a model for situations in which like or similar causes can produce one of a number of unlike effects. A coin that_ is tossed can land heads or tails. A group of ten individuals selected from the population of the United States can have a majority for or against legalized abortion. The intensity of solar flares in the same month of two different years can vary sharply. The situations we are going to model can all be thought of as random experiments. Viewed naively, an experiment is an action that consists of observing or preparing a set of circumstances and then observing the outcome of this situation. We add to this notion the requirement that to be called an experiment such an action must be repeatable, at least conceptually. The adjective random is used only to indicate that we do not, in addition, require that every repetition yield the same outcome, although we do not exclude this case. What we expect and observe in practice when we repeat a random experiment many times is that the relative frequency of each of the possible outcomes will tend to stabilize. This 441

1

I

l

A Review of Basic Probability Theory

442

Appendix A

long-term relative frequency 11 ..1/ '' where 11,\ is the number of times the possible outcome A occurs in n repetitions, is to many statistician::;. including the authors, the operational interpretation of the mathematical concept of probability. ln this sense, almost any kind of activity involving uncertainty, from horse races to genetic experiments, falls under the vague heading of "random experiment." Another school of statisticians finds this formulation too restrictive. By interpreting probability as a subjective measure, they are willing to assign probabilities in any situation involving uncertainty, whether it is conceptually repeatable or not. For a discussion of this approach and further references the reader may wish to consult Savage ( 1954), Raiffa and Schlaiffer ( I 961 ), Savage (1962), Lindley ( 1 965), de Groot ( I 970), and Berger ( 1 985). We now turn to the mathematical abstraction of a random experiment, the probability model. In this section and throughout the book, we presunle the reader to be familiar with ele mentary set theory and its notation at the level of Chapter I of Feller ( 1968) or Chapter l of Parzen ( 1960). We shall use the symbols U, n, c, , c for union, intersection, comple mentation, set theoretic difference, and inclusion as is usual in elementary set theory. A random experiment is described mathematically in terms of the following quantities. .

-

A.l.l The sample space is the set of all possible outcomes of a random experiment. We denote it by n. Its complement, the null set or impossible event, is denoted by 0. A.1.2 A sample point is any member of fl and is typically denoted by w.

A.1.3 Subsets of n are called events. We denote events by A, B, and so on or by a descrip tion of their members, as we shall see subsequently. The relation between the experiment and the model is given by the correspondence "A occurs if and only if the actual outcome of the experiment is a member of A." The set operations we have mentioned have interpre tations also. For example, the relation A c B between sets considered as events means that the occurrence of A implies the occurrence of B. If w E f.!, {w} is called an elementary event. If A contains more than one point, it is called a composite event.

i •

I •

I \ '

" '

A.1.4 We will let A denote a class of subsets of n to which we an assign probabilities. For technical mathematical reasons it may not be possible to assign a probability P to ev ery subset of n. However, A is always taken to be a sigma field, which by definition is a nonempty class of events closed under countable unions, intersections, and complemen tation (cf. Chung, 1974; Grimmett and Stirzaker, 1992; and Loeve, 1977). A probabiliry distribution or measure is a nonnegative function P on A having the following properties:

•

' ' '

I

(i) (ii)

P(Q) = l. If A1, A2, . . . are pairwise disjoint sets in A, then p

I

•

'

•

•

i, •

•

u A; t=1

=

L P(A;). i= 1

Recall that Ui 1 Ai is just the collection of points that are in any one of the sets Ai and that two sets are disjoint if they have no points in common .

.�

1

l I

l

l j

' ••

j

l

•

l

\

i

l

1

Section A.2

Elementary Properties of Probability Models

443

A.l.S The three objects !1, A, and P together describe a random experiment mathemati cally. We shall refer to the triple (0, A, P) either as a probability model or identify the model with what it represents as a (random) experiment. For convenience, when we refer to events we shall automatically exclude those that are not members of A. References Gnedenko (1967) Chapter I , Sections 1-3, 6-8 Grimmett and Stirzaker (1992) Sections 1. 1-1.3 Hoe!, Port, and Stone ( 1 97 1 ) Sections Ll , 1 .2 Parzen (1 960) Chapter I, Sections 1-5 Pitman ( 1 993) Sections 1.2 and 1.3 A.2

ELEMENTARY PROPERTIES O F PROBABILITY MO DELS

The following are consequences of the definition of P. A,2,1 lf A c B, then P(B - A) = P(B) - P(A). A.2.2 P(A")

=

1 -

P(A), P(0)

A.2.3 If A c B, P(B) A.2.4 0 < P(A) < I. A.2.5 P (U;;' 1 An) < A.2.6 If A, c A, A.2.7 P

c

(n��r A,)

>

=

0.

P(A).

I;;;'

,

P(A n ) ·

· · · c An . . . , then P (U;:' 1 An) = lim,�= P(An)·

> 1 -

I:��r P(A;) (Bonferroni's inequality).

References Gnedenko, (1967) Chapter I, Section 8 Grimmett and Stirzaker ( 1992) Section 1.3 Hoe!, Port, and Stone (1992) Section 1.3 Par2en (1 960) Chapter I , Sections 4-5 Pitman (1993) Section 1.3 A.J

DISCRETE PROBABILITY MODELS

A.3.1 A probability model is called discrete if !1 is finite or countably infinite and every subset off! is assigned a probability. That is, we can write f! = {w 1 , w2, . . . } and A is the collection of subsets of n. In this case, by axiom (ii) of (A.l .4)., we have for any event A, (A.3.2)

' '

;

444

A Review of Basic Probability Theory

Appendix A

An important special case arises when n has a finite number of elements, say N, all of which are equally likely. Then P( {w )) = 1/N for every w E n. and Number of elements in A P(A ) = :_::_c:c:.:,:_�Ci"'-'-=--"'-'-' . (A.3 .3) N

A.3.4 Suppose that w 1 , .

, WN

are the members of some population (humans, guinea pigs, flowers, machines, etc.). Then selecting an individual from this population in such a way that no one member is more likely to be drawn than another, selecting at random, is an experiment leading to the model of (A.3.3). Such selection can be carried out if N is small by putting the "names" of the wl in a hopper, shaking well, and drawing. For large N, a random number table or computer can be used. . .

-� l 1

'

1

-

'• •

• '

I

'

l ,

•

References

Gnedenko (1967) Chapter I , Sections 4--5 Parzen (1960) Chapter I , Sections 6-7 Pitman (1993) Section 1.1

A.4

.

l

1

I

CONDITIONAL PROBABI LITY AND INDEPENDENCE

! 1 '

Given an event B such that P( B) > 0 and any other event A, we define the conditional probability of A given B, which we write P(A I B), by P(A n B) P(A I B) (AA. l ) . =

P(B) If P(A) corresponds to the frequency with which A occurs in a large number of repetitions of the experiment, then P( A I B) corresponds to the frequency of occurrence of A relative to the class of trials in which B does occur. From a heuristic point of view P(A I B) is the chance we would assign to the event A if we were told that B has occurred. If A1, A2, . are (pairwise) disjoint events and P(B) > 0, then

p U A I B = :L P(A; I B).

In fact, for fixed B as before, the function P(- I B) is a probability measure on (!:l, A) which is referred to as the conditional probability measure given B. Transposition of the denominator in (A.4.1 ) gives the multiplication rule,

P(A n B) = P(B)P(A 1 B).

(A.4.3)

Bn are (pairwise) disjoint events of positive probability whose union is !:l, the identity A = U7 1 (A n B; ). (A. l 4)(ii) and (A.4.3) yield n (A.4.4) P(A) L P(A I B; )P(B; ). .

-�

�

1 '

'

'

(A.4.2)

i=l

i=l

B1 , B2,

1

.

.

If

l j .

.

•

,

.

=

j=l

j ' '

' •

1 •

'

' '

'

'

Section A.4

445

Conditional Probability and Independence

If 1 1( A ) is positive, we can combine (A.4.1 ). ( A.4.3 ), and ( A.4.4) and obtain Bayes rule P(A I B,)P(B,) , P(B, I A) � "" ) L.. j - 1 P(A I B )P(BJ

The conditional probability of A given B1 , defined by

(A4.5)

.I

.

.

.

•

Bu is written P(A B1

for any events A, B1, . . . , Bn such that P(B1 n · · · n Bn) Simple algebra leads to the multipliCation rule,

>

•

•

.

•

, En ) and

0.

P(B1 n · · · n Bn) � P(B1) P(B, I BJ)P(B, I B1 , B,) . . . P(B, I B1 . . . . , Bn _ !) (A.4.7)

whenever P(B1 n · · · n En - d > 0. Two events A and B arc said to be independent if P(A n B) � P(A)P(B).

(A.4.8)

If P( B) > 0, the relation (A.4.8) may be written P(A I B) � P(A).

(A.4.9)

In other words, A and B are independent if knowledge of B does not affect the probability of A. The events A 1 , . . . , An are said to be independent if P(A;, n . . . n A;,)

k

= IT P(A;,)

(A.4.10)

j=l

for any subset {i1, . . . , ik} of the integers {1, . . . , n}. If all the P(Ai) are positive, relation (A.4. l0) is equivalent to requiring that P(A, I A;, . . . , A;,)

�

P(A1)

for any j and {i1, . . . , ik} such that j '/c {i1, . . . , i k }. References Gnedenko ( 1967) Chapter I , Sections 9 Grimmett and Stirzaker (1992) Section 1.4 Hoe!, Port, and Stone (1971) Sections 1 .4, 1 .5 Parzen (1960) Chapter 2, Section 4; Chapter 3, Sections 1.4 Pitman (1993) Section 1 .4

(A.4. l l )

l '

446

A Review of Basic Probability Theory

Appendix A

COMPOUND EXP ERIMENTS

A.5

There is an intuitive notion of independent experiments. For example, if we toss a coin twice, the outcome of the first experiment (toss) reasonably has nothing to do with the outcome of the second. On the other hand, it is easy to give examples of dependent experi ments: If we draw twice at random from a hat containing two green chips and one red chip, and if we do not replace the first chip drawn before the second draw, then the probability of a given chip in the second draw will depend on the outcome of the first dmw. To be able to talk about independence and dependence of experiments, we introduce the notion of a compound experiment. InformaUy, a compound experiment is one made up of two or more component ex periments. There are certain natural ways of defining sigma fields and probabilities for these experiments. These will be discussed in this section. The reader not interested in the formalities may skip to Section A.6 where examples of compound experiments are given . A.S.l Recall that if A 1 , . . . , An are events, the Cartesian product A1 x · · · x An of At, . . . , An is by definition { (w1, , Wn) : w� E Ai, 1 < i < n}. If we are given n ex periments (probability models) t\, . . . , En with respective sample spaces 0 1 , , .On. then the sample space .O ofthe n stage compound experiment is by definition f21 X · · · X .On . The (n stage) compound experiment consists in performing component experiments £1 , , En and recording all n outcomes. The interpretation of the sample space n is that (w1, . . . , wn) is a sample point in n if and only if w 1 is the outcome of £1, w2 is the outcome of £2 and so on. To say that £t has had outcome w? E Ot corresponds to the occurrence of the com pound event (in .O) given by fl1 X · · · X f2i-I X {w?} X f2t+I X · · · X fln = {(Wt, . . , wn) E n : wi = wf}. More generally, if Ai E At. the sigma field corresponding to £i, then Ai corresponds to nl X . . . X ni-l X At X ni+ I X . . . X nn in the compound experiment. If we want to make the £i independent, then intuitively we should have aU classes of events A1, , An with Ai E Ai, independent. This makes sense in the compound experiment. If P is the probability measure defined on the sigma field A of the compound experiment, that is, the subsets A of !1 to which we can assign probability (I), we should have P([A, X n, X X !1n] n [!1, X A, X . . . X !1n] n . . . ) .

.

•

•

'

• • •

.

I

.

' '

•

'

!

l '

i

.

.

'

•

1 1 •

.

.

•

'

.

= P(A, x · · · x An) = P(A, X n, X . . . X !1n)P(fl, X A, X . . . X nn) . . . P(n, X . . . X nn-1 X An). (A.5.2)

j l

If we are given probabilities ?, on (!1, A1), P2 on (!12 , A2), . . . , Pn on (!1. , An). then (A.5.2) defines P for A1 x · · · X An by

i

l j

· .

x

(A.5.3) · · · x An) = P, (A,) . . . Pn(An) · It may be shown (Billingsley, 1995; Chung, 1974; Loeve, 1977) that if P is defined by (A.5.3) for events A1 x · · x An. it can be uniquely extended to the sigma field A spec ified in note ( 1 ) at the end of this appendix. We shall speak of independent experiments £1, . . . , En if the n stage compound experiment has its probability structure specified by (A.5.3). In the discrete case (A.5.3) holds provided that P( {(w, , . . . , wn)}) = P1 ( { w,}) . . . Pn ( {Wn}) for all wi E !1,, 1 < i < n. (A.5.4) P(A, ·

l

I '

''

I:

:1

i

•

L i

--

---------------

�

Section

A6

Specifying

P 1, .

know

447

Bernoulli and Multinomial Trials, Sampling With and Without Replacement

P when the £,

are dependent is more complicated. In the discrete case we E 01, { (w1 . . . , U.:n ) } ) for each once we have specified c,..:11 ) with

wi P( (w1 i = , H. By the multiplication rule (A.4.7) we have, in the discrete case, the following. A.S.S P({(w1 . . , wn ) } ) P(£1 has outcome w l ) P(£2 has outcome w2 1 £1 has out £11_ 1 has outcome W n _ t ) , come W1 ) . . . P(£T, has outcome Wn I £1 has outcome W J , The probability structure is determined by these conditional prob ab ilitie s and conversely. .

•

.

.

.

•

.

.

=

.

.

•

•

•

References Grimmett and Stirzaker ( 1992) Sections Hoe!, Port, and Stone Parzen

A.6

1 .5, 1 .6

( 1 97 1 ) Section 1 .5

(1960) Chapter 3

BERNOULLI AND MULTINOMIAL TRIALS, SAMPLING WITH AND WITHOUT REPLACEMENT

A.6.1 Suppose that we have an experiment with only two possible outcomes, which we shall denote by S (success) and F (failure). If we assign P( { S) ) p, we shall refer to such =

an experiment as a

Bernoulli trial with probability of success p. The simplest example of such a Bernoulli trial is tossing a coin with probability p of landing heads (success). Other examples will appear naturally in what follows. If we repeat such an experiment n times independently, we say we have performed n Bernoulli

trials with success probability p. If 0 is the sample space of the compound experiment, any point w E n is an n-dimensional vector of S's and F's and,

where then

k(w)

P( { w ) ) = Pk(w)(l - p)"- k(w)

is the number of S's appearing in

w.

(A.6.2)

If Ak is the event [exactly

k S's occur], (A.6.3)

( k )=

where

n

The formula

n!

k! (n - k)! '

(A6.3) is known as the binomial probability.

A.6.4 More generally, if an experiment has q possible outcomes w1,

, wq and P( {wi}) = ,pq. If Pi. we refer to such an experi ment as a multinomial trial with probabilities p1 , .

•

•

•

.

•

the experiment is performed n times independently. the compound experiment is called n

multinomial trials with probabilities p1 , . . . , Pq· If n is the sample space of this experiment and w

E !1, then

(A.6.5)

' i' !'

\'

'

,,

Appendix A

k�(w) = number of times wi appears in the sequence w. If AkJ , ... ,kq is the event (ex kt WI 's are observed, exactly kzwz 's are observed, . . . , exactly kqwq 's are observed),

where actly then

(A.6.6)

.

,.

'

A Review of Basic Probability Theory

448

'

' ' '

where the

ki are natural numbers adding up to n.

'

A.6.7

If we perform an experiment given by (.0, A, P) independently

n times,

we shall

sample of size n from the population given by (O,A, P). When .0 is finite the tenn, with replacement is added to sometimes refer to the outcome of the compound experiment as a

distinguish this situation from that described in (A.6.8) as follows.

A.6.8 If we have a finite population of cases n

{w1 , WN } and we select cases wi successively at random n times without replacement, the component experiments are not independent and, for any outcome a = (wi1 , wi., ) of the compound experiment, •

P({a})

•

•

=

•

•

•

1 (N) n

�

I l

,

'

(A.6.9)

.

' '

where

' •

(N) n

N! (N - n)! .

�

' ' •

If the case drawn is replaced before the next drawing, we are sampling and the component experiments are independent and bers of n have a "special" characteristic F and

Ak

= (exactly

P(A k )

�

for max (O, n -

k

N( 1

P( {a}) �

p)

�

with replacement,

1/N". If Np of the mem

' •

'

have the opposite characteristic

• • '

k "special" individuals are obtained in the sample}, then

) ( n

S

and

�

'

(Np)k(N(I - p)) n-k (N) n

(A.6. l0)

�

N(l - p)) < k < min(n,Np),

•

J

• •

and

(A.6.1 0) is known as the hypergeometric probability.

P(Ak)

�

O otherwise. The formula '

References

J

'

j

2, Section I I Hoe!, Port, and Stone (1971) Section 2.4 Gnedenko (1 967) Chapter

.. •

'

1

Parzen ( 1960) Chapter 3, Sections 1 -4

'

I

Pitman (1993) Section 2.1

A.7

l1

PROBABILITIES ON EUCLI DEAN SPACE

Random experiments whose outcomes are real numbers play a central role in

theory and

.1

'

practice. The probability models corresponding to such experiments can all be thought of

'

i

as having a Euclidean space for sample space.

'

i

..J

________________________________

Section A.?

449

Probabilities on Euclidean Space

We shall use the notation Rk of k�dimensional Euclidean space and denote members , xk) ', where ( )' denotes transpose. of Rk by symbols such as x or (x1 , .

•

.

A.7.1 lf (a1 , bt ) , . . . , (a , bk) are k open intervals, we shall can the set (a1obt) x · · · x k (ab bk) = {(x 1, . . , xk) : ai < xi < bi, 1 < i < k} an open k rectangle. .

A.7.2 The Borel field in Rk, which we denote by !3k, is defined to be the smallest sigma field having all open k rectangles as members. Any subset of Rk we might conceivably be interested in turns out to be a member of f3k. We will write R for R1 and !3 for !3 1 . A.7.3 A discrete (probability) distribution on Rk is a probability measure P such that L:� 1 P( { xi }) = 1 for some sequence of points {xi} in Rk. That is, only an xi can occur as an outcome of the experiment. This definition is consistent with (A.3.1) because the study of this model and that of the model that has n = { x1, , Xn, . . . } are equivalent. The frequencyfunction p of a discrete distribution is defined on Rk by •

•

•

p(x) = P( (x )). (A.7.4) Conversely, any nonnegative function p on Rk vanishing except on a sequence {x1, . , Xn, . . . } of vectors and that satisfies L:� 1 p(xi) = 1 defines a unique discrete probability .

.

distribution by the relation

P(A ) =

L p(x; ).

(A.7.5)

x;EA

A.7.6 A nonnegative function p on Rk , which is integrable and which has

r p(x)dx = 1 , JR'

where dx denotes dx1 . . . dxn, is called a density function. Integrals should be interpreted in the sense of Lebesgue. However, for practical purposes, Riemann integrals are adequate. A.7.7 A continuous probability distn'bution on Rk is a probability P that is defined by the relation

P(A)

=

L p(x)dx

=

1

(A.7.8)

for some density function p and all events A. P defined by A.7.8 are usually called abso lutely continuous. We will only consider continuous probability distributions that are also absolutely continuous and drop the term absolutely. It may be shown that a function P so defined satisfies (A.l.4). Recall that the integral on the right of (A.7.8) is by definition

{ 1A(X)p(x)dx ln•

where 1 A (x ) = 1 if x E A, and 0 otherwise. Geometrically, P(A) is the volume of the "cylinder" with base A and height p(x) at x. An important special case of (A.7.8) is given by (A.7.9)

450

A Review of Basic Probability Theory

Appendix A

Jt turns out that a continuous probability distribution determines the density that generates it "uniquely."( I ) Although in a continuous model P( { x}) = 0 for every x, the density function has an operational interpretation close to that of the frequency function. For instance, if p is a continuous density on R, :r:0 and x1 are in R, and h is close to 0, then by the mean value theorem

P([xo - h, xo + h]) p(xo) "' P([xo - h, xo + h]) "' 2hp(xo) and . (A.7.10) P([Xt - h X I + h]) p(Xt ) The ratio p(xo)/p(xl ) can, thus, be thought of as measuring approximately how much more or less likely we are to obtain an outcome in a neighborhood of xo then one in a neighborhood of x1• 1

A.7.11 The distribution function (d.f.) F is defined by

F(x 1 , . . . ,x•) = P((-oo , x,J

x

. . x (-oo, x.]). ·

(A.7.12)

The d.f. defines P in the sense that if P and Q are two probabilities with the same d.f., then P = Q. When k = 1, F is a function of a real variable characterized by the following properties: '

(A.7. 13)

I I

(A.7.14)

'

• •

•

I' I

;

·.

'

'

"

I

'

x < y =? F(x) < F(y) (Monotone) Xn j x =? F(xn)

•

'

I' '

!

I '

F(x) (Continuous from the right)

limx�oo F(x) = 1 limx�-oo F(x) = 0.

•

'

�

'

!

I '

(A.7.1 5)

i

(A.7. 16)

1

It may be shown that any function F satisfying (A. 7. 13HA.7.16) defines a unique P on the real line. We always have

F(x) - F(x - 0) (2) = P({x)).

(A.7.17)

Thus, F is continuous at x if and only if P( {x}) = 0. References

I

Gnedenko (1967) Chapter 4, Sections 21, 22 Hoe!, Port, and Stone (1971) Sections 3 . 1 , 3.2, 5.1, 5.2 Parzen ( 1960) Chapter 4, Sections 1-4, 7 Pitman (1993) Sections 3.4, 4.1 and 4.5

l !

•

•

i

J

Section A.8

A.B

451

Random Vari<Jbles and Vectors: Transformations

RAN DOM VARIABLES AND VECTORS: TRANSFOR MATIONS

Although sample spaces can be very diverse, the statistician is usually interested primarily in one or more numerical characteristics of the sample point that has occurred. For example, we measure the weight of pigs drawn at random from a population, the time to breakdown and length of repair time for a randomly chosen machine, the yield per acre of a field of wheat in a given year, the concentration of a certain pollutant in the atmosphere, and so on. In the probability model, these quantities will correspond to random variables and vectors.

A.8.1 A random variable X is a function from f!to R such that the set {w : X(w) E B} x-'(B) is in 0 for every B E B.(l l

=

Xk) T is k-tuple of random variables, or equivalently a function from !1 to Rk such that the set {w , X(w) E: B} � x- 1 (B) is in A for every B E Bk_(I) For k = 1 random vectors are just random variables. The event x- 1 ( B) will usually be written [X E B] and P([X E B]) will be written P[X E B]. The probability distribution of a random vector X is, by definition, the probability measure Px in the model ( R k , 13k, Px ) given by

A.8.2 A random vector X = ( X1,

•

•

.

,

Px(B) � P[X E B[.

(A.8.3)

A.8.4 A random vector is said to have a continuous or discrete distribution (or to be contin

uous or discrete) according to whether its probability distribution is continuous or discrete. Similarly, we will refer to the frequencyfunction, density, dj, and so on of a random vector

when we are, in fact, referring to those features of its probability distribution. The subscript X or X will be used for densities, d.f.'s, and so on to indicate which vector or variable they correspond to unless the reference is clear from the context in which case they will be omitted. The probability of any event that is expressible purely in tenns of X can be calculated if we know only the probability distribution of X. In the discrete case this means we need only know the frequency function and in the continuous case the density. Thus, from (A.7.5) and (A.7.8) P[X E A]

L P(x),

xE A

if X is discrete

L p(x)dx, if X is continuous.

(A.8.5)

When we are interested in particular random variables or vectors. we will describe them purely in terms of their probability distributions without any further specification of the underlying sample space on which they are defined. The study of real- or vector-valued functions of a random vector X is central in the theory of probability and of statistics. Here is the formal definition of such transformations. Letg be any function from Rk to Rm, k, m > I, such that<21 g-1(B) {y E Rk : g(y) E �

452

A Review of B<1sic Probdbility Theory

£3} E 8/.: for every B E 8"'.

Then the random tran�(ormatirm

g (X) (w) = g(X(w) ).

g( X)

Appendix A

is defined by (A.8.6)

An example of a transformation often used in statistics is

g

(g 1 , g2 /

with

g1 (X)

= X and 92(X) = k- 1 E�· 1 ( Xi - X)2. Another common example is g(X) = (min{ X,), max( X,))'. The probability distribution of g(X) is completely detennined by that of X through 1 (A.8.7) P[g(X) E B] = P[X E g- ( B)].

k- 1 L; 1 Xi

If X is discrete with frequency function

tion

px , then g(X) is discrete and has frequency func

Pg(XJ(t) = Suppose that

X

=

L Px (x) .

(A.8.8)

{x:g(x)=t}

is continuous with density

Px and g is real-valued and one-to-one C3l

S such that P[X E S] = 1. Furthennore, assume that the derivative l of g exists and does not vanish on S. Then g(X) is continuous with density given by on an open set

'

'

(A.8.9)

!

'

'

g(S), and 0 otherwise. This is called the change of variableformula. If g(X) = aX + !'. a # 0, and X is continuous, then

'

for t E

Pg(X) (t) = From (A.8.8) it follows that if (X1

t - I' ) ( PX j';J 1

a

·

•

'

• '

(A.8.10)

Yf is a discrete random vector with frequency func tion P(X, Y)· then the frequency function of X, known as the marginal frequency function,

i

'

'

'

is given by(4) • '

i (

t '

px(x) = LP(X ,YJ(x,y) . y

(A.8 . 1 1 )

(X, Y)T is continuous with density P( X,Y)• it may be shown (as a consequence of (A.8.7) and (A.7.8)) that X is a marginal density function given by

' •

Similarly, if

px(x)

�

These notions generalize to the case

1: P(X,YJ(x, y)dy.<5l

integrating out over y in

P(X,YJ(x, y) .

! •

1 i '

Z = (X, Y), a random vector obtained by putting two

random vectors together. The (marginal) frequency or density of X is found as and (A.8.12) by summing or

(A.8.12)

. • • •

in (A.8. 1 1 )

Discrete random variables may be used to approximate continuous ones arbitrarily

!

• '

;

!

•

i

•

closely and vice versa.

I I !

•

;

' '

j

-

Section A.9

453

Independence of Random Variables and Vectors

In practice, all random variables are discrete because there is no instrument that can measure with perfect accuracy. Nevertheless, it is common in statistics to work with con tinuous distributions, which may be easier to deal with. The justification for this may be theoretical or pragmatic. One possibility is that the observed random variable or vector is obtained by rounding off to a large number of places the true unobservable continuous random variable specified by some idealized physical modeL Or else, the approximation of a discrete distribution by a continuous one is made reasonable by one of the limit theorems of Sections A. l 5 and B.7.

A.8.13 A convention: We shall write X = Y if the probability of the event [X fc Yj is 0. References Gnedenko (1967) Chapter 4, Sections 2 1-24 Grimmett and Stirzaker ( 1992) Section 4.7 Hoe!, Port, and Stone (197 1 ) Sections 3.3, 5.2, 6. 1, 6.4 Parzen (1960) Chapter 7, Sections 1-5, 8, 9 Pitman (1993) Section 4.4

INDEPENDENCE O F RANDOM VARIABLES AND VECTORS

A.9

A.9.1 Two random variables X1 and X2 are said to be independent if and only if for sets A and B in B, the events [X1 E AJ and [X, E BJ are independent. A.9.2 The random variables X1 , , Xn are said to be (mutually) independent if and only if for any sets A1, , An in B, the events [X 1 E A1], . . [Xn E An] are independent. To generalize these definitions to random vectors X 1 , . . , Xn (not necessarily of the same dimensionality) we need only use the events [Xi E Ai] where Ai is a set in the range of xi. A.9.3 By (A.8.7), if X and Y are independent, so are g(X) and h( Y) , whatever be g and h. For example. if (X1, X2 ) and (Yt . Y2 ) are independent, so are X1 + X2 and Yt Y2, (X1,X1X2) and Y2 , and so on. •

.

•

•

•

.

.

1

.

Theorem A.9.1. Suppose X = (X1, . . . , Xn) is either a discrete or continuous random vector. Then the random variables X 1 , . . . , Xn are independent if, and only if, either ofthe following two conditions hold: (A.9.4) Px(x,, .

.

.

, Xn) = Px, (xi) . .

.

px. (xn)

for all x , , . . , Xn. .

(A.9.5)

A.9.61fthe Xi are all continuous and independent, then X = (Xt, . . . , , Xn) is continuous. A.9.7 The preceding equivalences are valid for random vectors XI J . . (X�o . , Xn). .

.

.

1

Xn with X -

A Re iew of Basic Probability Theory

454

i

i

v

Appendix A

Xn are independent identically distributed k-dimensional random vectors with d.f. Fx or density (frequency function) px, then X 1 X11 is called a random sample (?{ si::.e n from a population with d.f. Fx or density (frequency function) Px- In

A.9.8 If X1

•

.

.

.

,

•

.

.

.

,

statistics, such a random sample is often obtained by selecting n members at random in the sense of (A.3.4) from a population and measuring k characteristics on each member. If A is any event, we define the random variable 1 A, the indicator of the event A, by

1 if w E A

0 otherwise.

(A9.9)

If we perfonn n Bernoulli trials with probability of success p and we let Xi be the indicator of the event (success on the ith trial), then the Xi form a sample from a distribution that assigns probability p to 1 and { 1 - p) to 0. Such samples will be referred to as the indicato rs of n Bernoulli rn·ats with probability ofsuccess p. References

• l-

i '

Gnedenko ( 1967) Chapter 4, Sections 23, 24 Grimmett and Stirzaker (1992) Sections 3.2, 4.2 Hoe!, Port, and Stone (1971) Section 3.4 Parzen (1960) Chapter 7, Sections 6, 7 Pitman (1993) Sections 2.5, 5.3

A.lO

THE EXPECTATION O F A RANDOM VARIABLE

Let X be the height of an individual sampled at random from a finite population. Then a reasonable measure of the center of the distribution of X is the average height of an indi vidual in the given population. If x1, . . . , xq are the only heights present in the population, it follows that this average is given by 'L.L 1 xiP[X = xi] where P[X = Xi] is just the proportion of individuals of height xi in the population. The same quantity arises (approx imately) if we use the long-run frequency interpretation of probability and calculate the average height of the individuals in a large sample from the population in question. In line with these ideas we develop the general concept of expectation as follows. If X is a nonnegative, discrete random variable with possible values { x1, x2 , . . . }, we define the expectation or mean of X, written E( X), by 00

1 i

oj

'

j

l

-� ' ' •

E(X) = l:: x,px(x;). i==l (Infinity is a possible value of E(X). Take

(A.IO.I)

·'

1

x, = i, Px (i) = + 1) , i = 1 , 2, . . . .) i(i A.l0.2 More generally, if X

is discrete, decompose { x1, x2, . . . } into two sets A and B where A consists of all nonnegative Xi and B of all negative Xi- If either 'L.x,EA XiPX (xi) <

�

.

Section A.lO

455

The Expectation of a Random Variable

oo or L:,,E8( -x.Jpx (J:i) < oo, we define E(X) unambiguously by (A. IO.I ). Otherwise, we leave E(X) undefined. Here are some properties of the expectation that hold when X is discrete. If X is a constant, X(c..v·) = c for all w, then (A. I 0.3)

E(X) = c. If X

=

lA (cf. (A.9.9)), then

E(X) = P( A). (A.l0.4) If X is an n-dimensional random vector, if g is a real-valued function on Rn , and if E(lg(X) I) < oo, then it may be shown that 00

E(g(X)) = L g(x,)px (x, ) i= l

(A. 10.5)

As a consequence of this result, we have 00

E( IX I) = L lx,lpx (x,) . i=l Taking g(x) = L:�

1

(A.l0.6)

O:'iXi we obtain the fundamental relationship n

n

E :L a,X, = :L a,E(X,) (A.l0.7) i= l i=l if O:' t , . . . , O:'n are constants and E(IXil) < oo, i = 1, . , n. A.IO.S From (A.l0.7) it follows that if X < Y and E(X), E(Y) are defined, then E(X) < E(Y). If X is a continuous random variable, it is natural to attempt definition of the expec .

.

a

tation via approximation from the discrete case. Those familiar with Lebesgue integration will realize that this leads to

E(X) =

1: xpx(x)dx

(A.l 0.9)

as the definition of the expectation or mean of X whenever J0= xpx(x)dx or J' xpx(x)dx is finite. Otherwise, E(X) is left undefined. 00

A.IO.lO A random variable X is said to be integrable if E(IXIJ < oo. It may be shown that if X is a continuous k-dimensional random vector and any random variable such that

{ lg(x)iPx(x)dx < 00, Jn,

g(X) is

456

A Review of Basic Probability Theory

then E(g(X)) exists and

E(g(X)) = l, g(x)px(x)dx.

Appendix A

(A.IO. l l )

In the continuous case expectation properties (A. I 0.3), (A. l0.4), (A. I 0.7), and (A.IO.S) as well as continuous analogues of (A.l0.5) and (A.l0.6) hold. It is possible to define the expectation of a random variable in general using discrete approximations. The interested reader may consult an advanced text such as Chung (1974), Chapter 3. The fonnulae (A.l0.5) and (A.IO. l l) are both sometimes written as (A.IO.l2) E(g(X)) = f g(x)dF(x) or f g(x)dP(x) Jnk .IRk where F denotes the distribution function of X and P is the probability function of X

defined by (A.8.5). A convenient notation is dP(x)

J g(x)dP(x)

• 1:

.

•

=

p(x)dl'(x), which means

=

=

L g(xi)p(xi)1 discrete case i=l

J: g(x)p(x)dx,

(A.l0.13) continuous case.

We refer to J1 = llP as the dominating measure for P. In the discrete case J1 assigns weight one to each of the points in {x : p(x) > 0} and it is called counting measure. In the continuous case di-L(x) = dx and J.l(x) is called Lebesgue measure. We will often refer to p(x) as the density of X in the discrete case as well as the continuous case. References Chung (1 974) Chapter 3 Gnedenko (1967) Chapter 5, Section 26 Grimmett and Stirzaker (1992) Sections 3.3, 4.3 Hoe!, Port, and Stone ( 1971) Sections 4. 1 , 7.1 Parzen (1960) Chapter 5; Chapter 8, Sections 1-4 Pitman (1993) Sections 3.3, 3.4, 4.1

'

•

'

'

•

l

l

•

A.ll

1

MOMENTS

• '

A.U.l If k is any narural number and X is a random variable, the kth moment of X is defined to be the expectation of Xk . We assume that all moments written here exist. By (A.I0.5) and (A. 10.11),

E(X k )

=

2:xkpx(x) if X is discrete X

1: xkpx(x)dx

(A. I l.2) if X is continuous.

In general, the moments depend on the distribution of X only. • •

l

j�

Section A.ll

A.ll.3

Moments

457

The distribution of a random variable is typically uniquely specified by its mo

ments. This is the case, for example, if the random variable possesses a moment generating function (cf. (A.l2.l)).

A.l1.4 The kth central moment of X is by definition E[(X - E(X))k]. the kth moment of (X - E(X)), and is denoted by l'k·

variance of X and will be written Var X. The nonnegative square root of Var X is called the standard deviation of X. The standard deviation measures the spread of the distribution of X about its expectation. It is also called a measure of scale. Another measure of the same type is E(]X - E(X)J), which is often referred to as the mean deviation. The variance of X is finite if and only if the second moment of X is finite (cf. (A.1 1 . 15)). If a and b are constants, then by (A.l0.7) A.ll.S The second central

moment is called the

2 Var(aX + b) = a

Var X.

(A. l l .6)

(One side of the equation exists if and only i f the other does.)

A.11.7 If X is any random variable with well-defined (finite) mean and variance, the stan dardized version or Z-score of X is the random variable Z = (X - E(X))jvVar X. By (A.l 0.7) and (A. l l .6) it follows then that

E(Z) 2 A.l1.9 If E(X )

=

0, then X

=

= 0 and Var Z =

0. If Var X

follow, for instance, from (A.l5.2).

A.U.lO The

=

0, X

=

1.

(A. l l .8)

E(X) (a constant). These results

third and fourth central moment'i are used in the

and the kunosis ')'2, which are defined by

coefficient of skewness ;1

where o-2 = Var X. See also Section A.l2 where ;1 and ;2 are expressed in tenns of cumu lants. These descriptive measures are useful in comparing the shapes of various frequently used densities.

A.ll.ll If Y

= a + bX with b > 0, then the coefficient of skewness and the kurtosis of Y

are the same as those of X. If X

A.ll.12

� N(tl,a2), then 11

=

1z

=

0.

It is possible to generalize the notion of moments to random vectors. For sim

k

2. If X1 and X2 are random variables and i, j are nat ural numbers, then the product moment of order (i,j) of X1 and X2 is, by definition, E(XjX�). The central product moment of order (i,j) of X, and X2 is again by defi nition E[(X 1 - E(X1))' (X2 E(Xz))i]. The central product moment of order (1, I) is plicity we consider the case

=

-

I

'

'

'

'

458

A Review of Basic Prob.ability Theory

Appendix A

called the comriance of X1 and .\:2 and is written Cov( .\ 1- X2). By expanding the product (X 1 - E(X J ) ) ( X2 - E( X2 )) and using (A. I 0.3) and (A.IO. 7). we obtain the relations, Cov(aX1 + bX2, cX3 + dX,) = ac Cov(X1 . X3) + be Cov(X2, X,) + ad Cov(XJ , X,) + bd Cov(X,, X4)

(A. I I l 3)

I '

and

'

(A. l l . l4)

I

I

If Xi and Xf are distributed as X1 and X2 and are independent of X1 and X2, then

�

Cov(X1 , X2) = E(X1 - x; )(X, - x; ). lfwe put X1 = X, = X in (A. I l . l4), we get the formula 2 2 Var X = E(X ) - [E(X)j

The covariance is defined whenever X1 and X2 have finite variances and in that case

''

I'

(A. l l . l 6)

'

!

•

'

'

i '

with equality holding if and only if (1) X1 or X2 is a constant or Co = (2) (X1 - E(X1)) �ar (X�/') (X2 - E(X2 )). This is the correlation inequality. It may be obtained from the Cauchy-Schwartz inequality, 2

r

(A.l 1 . 17)

'

�

X,) Cov(X1, _ . Corr(X 1, X2 ) J(Var X1)(Var X2 )

'I I

'

'

'

' •

' ' '

'

-i

•

1

•

for any two random variables Z" z, such that E(Zf) < oo, E(Zi) < oo. Equality holds if and only if one of Z11 Z2 equals 0 or Z1 = a Z2 for some constant a. The correlation inequality corresponds to the special case Z1 = X1 - E(X1), Z2 = X2 - E(X2 ). A proof of the Cauchy-Schwartz inequality is given in Remark 1.4. 1. The correlation of XI and X2. denoted by Corr(X1, X2). is defined whenever XI and X2 are not constant and the variances of X1 and X2 are finite by

'

I

(A . I l . l 5)

(A. l l . l8)

I

' '

•

•

'

'

'

l � . . ·

I

'

The correlation of X1 and X2 is the covariance of the standardized versions of X1 and X2. The correlation inequality is equivalent to the statement (A.l l.l9) Equality holds if and only if X2 is linear function (X2 = a + bX1 , b # 0) of X1 ·

l

j

Section A.12

tion

459

Moment and Cumulant Generating Functions

If X1 , . . . . X11 have finite variances, we obtain as a consequence of (A 1 1 . 13) the rela-

+ · · · + Xn) = L Var X, + 2 L Cov(X, . X1 ). n

Var(X1

(A. l l .20)

·t=l

If X1 and X2 are independent and X1 and X2 are integrable, then or in view of(A.l l . J 4), Cov(X1 , X2)

=

Corr(X1 , X2)

(A.l l .2 l )

0 when Var(X;) > 0, i = 1 , 2.

=

(A. l l .22)

This may be checked directly. It is not true in general that X1 and X2 that satisfy (A. l l .22) (i.e., are uncorrelated) need be independent. The correlation coefficient roughly measures the amount and sign of linear relationship between X1 and X2. It is -1 or 1 in the case of perfect relationship (Xz = a l- bX1 , b < 0 orb > 0, respectively). See also Section 1.4. , Xn are indepen As a consequence of (A. I 1 .22) and (A.l l.20), we see that if X1 , dent with finite variances, then •

Var(X1 + · · ·

+

•

.

n

Xn)

=

L Var X;.

(A. l l .23)

i=l

References Gnedenko (1967) Chapter 5, Sections 27, 28, 30 Hoe!, Port, and Stone (1971) Sections 4.2-4.5, 7.3 Parzen ( I 960) Chapter 5; Chapter 8, Sections l-4 Pitman (1993) Section 6.4

A.l2

MOMENT AND CUMULANT GEN ERATING FUNCTIONS

A.l2.1 If E( e"' I X I ) < oo for some s0 > 0, Mx (s) = E( e'X ) is well defined for lsi S so and is called the moment generating function of X. By (A.l 0.5) and (A.IO.l l ), Mx(s)

1:

if X is discrete

i=l

(A. l 2.2) esxpx (x)dx if X is continuous.

If Mx is well defined in a neighborhood { s : lsi < so} of zero, all moments of X finite and

= E(Xk) k s , lsi < so . Mx(s) = L k! k=O

are

(A.l2.3)

460

I

i I

I

A Review of Basic: Probability Theory

A.l2.4 The moment generuting function .Hx has derivatives of all orders at :; dk

:\J,. ( s ) �Is�- -

---cc

=

Appendix A

=

0 and

E(X ' ).

A.12.5 If defined, Alx determines the distribution of X uniquely and is itself uniquely determined by the distribution of X. If X1 . . . . X11 are independent random variables with moment generating functions + X11 has moment generating function given by llf x !ll x, , then X1 + .

1 ,

.

•

.

•

· · ·

•

M( X , +

n

·+X, J (s) = IT Mx, (s ) . •

i=l

(A.l2.6)

This follows by induction from the definition and (A. l 1 .21). For a generalization of the notion of moment generating function to random vectors, see Section B.S. The function Kx(s)

=

log Mx (s)

(A. l 2.7)

is called the cumulant generating function of X. If !v!x is well defined in some neighbor hood of zero, Kx can be represented by the convergent Taylor expansion 00

c Kx(s) = L ;sj . J 0 F .

where

!

(A.l2.8)

di (A. l2.9) c 7 = c;(X ) = . Kx (s)l,� o dsJ is called the jth cumulant of X, j > 1. For j > 2 and any constant a, cj(X + a) = Cj (X). If X and Y are independent, then c;(X + Y) = c; (X) + c1 (Y). The first cumulant c1 is the mean J-.t of X, c2 and c3 equal the second and third central moments f-£2 and f-£3 of X, and c4 = J.L4 - 3.u�. The coefficients of skewness and kurtosis (see (A. l l .IO)) can be written as ')'1 = c3Jci and /'2 = c4jc�. If X is normally distributed, Cj = 0 for j > 3. See Problem 8.3.8. ,

I

I •

References Hoe!, Port, and Stone (1971) Chapter 8, Section 8.1 Parzen (1960) Chapter 5, Section 3; Chapter 8, Sections 2-3 Rao (1973) Section 2b.4

A.13

SOME CLASSICAL DISCRETE AND CONTINUOUS DISTRIBUTIONS

By definition, the probability distribution of a random variable or vector is just a probability measure on a suitable Euclidean space. In this section we introduce certain families of '

••

I I

Section A . 13

461

Some Classical Discrete and Continuous Distributions

distributions, which arise frequently in probability and statistics, and list some of their Following the name of each distribution we give a shorthand notation that

properties.

will sometimes be used as will obvious abbreviations such as "binomial binomial distribution with pmameter

(n, B)".

( n, B)"

for "the

The symbol p as usual stands for a frequency

If anywhere below p is not specified explicitly for some value of x it shall be assumed that p vanishes at that point. Similarly, if the value of the distribution function F is not specified outside some set, it is assumed to be zero to the "left" ofthe set and one to the "right" of the set. or density function.

I. Discrete Distributions The binomial distribution with parameters n and B : B( p(k) � The parameter

( � ) e'(J -e)"-•,

n, B).

k � O, J , . . .

, n.

(A. I 3 . 1 )

n can be any integer > 0 whereas 8 may be any number in [0, l ] .

A.13.2 If X is the total number of successes obtained in n Bernoulli trials with probability of success If

e. then X has a !3(n, B) distribution (see (A.6.3)).

X has a !3( 0) distribution, then

n,

E(X) � nO Var X � nO(! - 0).

(A.l3.3)

,

Higher-order moments may be computed from the moment generating function

Mx (t) � [Be' + ( 1 - 0)]". A.13.5 If X, , X2 ,

B(n2, 0), . . .

(A.I3.4)

, Xk are independent random variables distributed as B(n1, 8), 0), respectively, then X, + X2 + · · · + X, has a l3(n1 + - · · +

. . .

, l3(nk,

nk,e)

distribution. This result may be derived by using (A.l2.5) and (A.I 2.6) in conjunction with

(A.l3.4).

The hypergeometric distribution with parameters D, N, and n :

p(k)

�

(A.I3.6)

for k a natural number with max(O, n - (N - D)) and n may be any natural nwnbers that

1t(D, N,n).

are

:S

k < min(n, D). The parameters D

less than or equal to the natural number N.

A.13.7 If X is the number of defectives (special objects) in a sample of size n taken without

X has If the sample is taken with replacement, X has

replacement from a population with D defectives and N - D nondefectives, then

H( D, N, n) distribution (see (A.6. I 0)). a B(n DfN) distribution. an

,

' •

462

A Review of Basic Probability Theory

If X has an H(D, N, n) distribution, then

D E(X) = n ' Var X = N

D nN

(

D !N

)

N-n N_ 1.

Appendix A

• •

7

The Poisson distribution with parameter .\ : P(.\).

e-'>.k p(k) = k!

(A.I3.9)

for k = 0, 1, 2, . . . The parameter >.. can be any positive number. If X has a P(.\) distribution, then .

=

Var X

= .\.

(A. 13.10)

The moment generating function of X is given by

1

Mx(t) = e'<,'- ) .

(A. 1 3 . 1 1 )

If X1, X2, . . . , Xn are independent random variables with P(.\1), P( .\2 ) , . . . , P .\n) distributions, respectively, then X1 + X2 + -i-Xn has the P(.\1 + .\2 + · · · + .\n) distribution. This result may be derived in th same manner as the corresponding fact for the binomial distribution.

A.l3.12

(

The multinomial distribution with parameters n, B1,

.

•

•

,

Bq : M(n, 01,

•

•

.

, Oq)•

n! k (A. l 3 . 1 3) p(k l , · · · , kq ) = k ! . . . kq !(}kl l ' ' • (jqq l whenever ki are nonnegative integers such that 'L.? 1 ki = n. The par�meter n is any natural number while (01, , Oq) is any vector in q q, 2:)• = 1 e = (&, . . e,) : e, > o, i=l •

.

.

.

1
.

A.13.14 If X = (Xt, . . . , Xq ) , where Xi is the number of times outcome wi occurs in n '

, Oq). then X has a M(n, ()11

multinomial trials with probabilities (01, . . . tion (see (A.6.6)). If X has a M (n, 01, . . , B,) distribution,

.

.

•

, Bq) distribu

.

E(X;) Cov(Xi , Xi)

-

nO;, Var X; = nl!;(l - O,) - neiej , i of:- j, i,j = l, . . , q. .

•

(A. I3.8)

Formulae (Al3.8) may be obtained directly from the definition (A 13.6). An easier way is to use the interpretation (A.l3.7) by writing X = 'L - 1 Ii where Ij = 1 if the jth object sampled is defective and 0 otherwise, and then applying formulae (A.I0.4), (A.I0.7), and (A. I 1 .20).

E(X )

i

(A. l3. 15)

•

' •

•

Section A 13

463

Some Classical Discrete and Continuous Distributions

These results may either be derived directly or by a representation such as that discussed in (A . I 3.8) and an application of formulas (A. I0.4). (A.I 0.7), (A. I 3. 13). and ( A . l 1 .20).

A.13.16 If X has a M(n, f:h , . . . , Bq)

n -

2::; ""1 X iJ' , f}i s , 1 - Lj= 1 8i1 ) distribution for any set {i 1 , . . . , ·is } C { 1 , . . . , q }. has a M ( n , Bi 1 , Therefore, Xj has B (n, e) ) distributions for each j and more generally L;== I xij has a B(n, L;=l 81 ) distribution if s < q. These remarks follow from the interpretation .

.

distribution, then

(Xi1 ,

•

•

.

,

X

•.•

,

•

.1

(A. I 3 . 14).

II. Continuous Distributions

F. and X

mean that X is random variable with d.f. density or frequency function p.

FL

F.

Let Y be a random variable with d.f. =

{ Fp,

: ·

-oo < J.t < oo} is called a

Let

"" p will similarly

Fp,

F. Therefore, for any fl, "(,

family.

FP- can be referred back to F or any FL

Similarly, if Y generates

so does Y

+

The family

J.t is called a

parameter, and we say that Y generates FL. By definition, for any J.t, X

and all calculations involving

+ J.t.

be the d.f. of Y

location parameterfamily,

X

location

"" Fp, {:::} X- J.t

""

other member of the

'Y for any fixed 'Y·

If Y has a first

moment, it follows that we may without Joss of generality (as far as generating assume that E(Y) = 0. Then if X

.......

F will mean that X has

Before beginning our listing we intrOOuce some convenient notations:

� F", E(X) = 11..

FL

goes)

> 0. The family Fs = ( F;; : rr > 0} is called a scale parameterfamily, a is a scale parameter, and Y is said to generate Fs. By definition, for any a > 0, X F; {:::} Xja F. Again all calculations involving one member of the family can be referred back to any other because for any a, T > 0, Similarly let

F;

be the d. f. of oY,

""

rr

""

F;(x) � F; If Y generates

Fs

C:) .

and Y has a firSt moment different from 0, we may without Joss of

=

1 and, hence, if X � F;;, then E(X) = rr. Alternatively, i f Y has a second moment, we may select F as being the unique member of the family Fs having 2 Var Y = 1 and then X "' F; ==? Var X = a . Finaiiy, define Fp,,a as the d.f. of aY + p. The family FL ,S = {Fp,,a : -oo < J.t < oo , a > 0} is called a location-scale parameter generality take E(Y)

family, J1 is called a location parameter, and a a scale parameter, and Y is said to generate

FL ,S· From

Fp,,a(x) =

( x - fl.) F a

=

) ( r(x - !") F..., ,r + 'Y , a

we see as before how to refer calculations involving one member of the family back to any other. Without loss of generality, if Y has a second moment, we may take

E(Y)

=

0, Var Y =

1.

'

464

Then if

A Review of Basic Probability Theory

Appendix A

X "' F1, ,c-. we obtain

E(X) = JJ., Var X = <72 Clearly FP- Fp.,1 , F; Fo,cr· The relation between the density of F,...,u and that of F is given by (A.8.10). All the families of densities we now define are location-scale families. =

=

or scale

The nonnal distribution with parameters fL and a2 : N(J-L, cr2) . p (x) =

{ �exp - � (x - JJ.)2} . 22

(AI3.17)

The parameter J.l can be any real number while a is positive. The normal distribution with 1 is known as the standard normal distribution. Its density will be denoted 11 = 0 and by \P(z) and its d.f. by (z).

a=

A.13.18 The family of N(JJ., <72) distributions is a location-scale family. If Z has a N(O, 1) distribution, then Z + Jl. has a N(JJ., <72 ) distribution, and conversely if has a N(JJ., <72 )

<7 distribution, then (X - JJ.)/<7 has a standard normal distribution. If X has a N(JJ., <72) distribution, then

X

(A.I3.19)

{ <7 t 2 2 Mx(t) = exp pt 2 }

More generally, all moments may be obtained from

+

for -oo <

t

<

oo.

a2 = 1, then k t 2 [ (2 )! l k Mx(t) = L=O z•k! (2k)' k

'

(AI3.20)

In particular if {L = 0,

=

.

,

,

(A I 3 .2 1 )

A.13.23 If X., . . , Xn

0

z•f2(k/2)!

if k > 0 is even.

1

, '

are

(AI3.22)

. • .

j

•

'

= JJ . ; , E(X;) CiXi

independent normal random variables such that has a Var Xi = a�, and c1, , Cn are any constants that are not all 0. then E� 1 N(ctJJ.t + · · · + CnJl.n o ci
i

I

ifk > O is odd k!

'

.

•

and, hence, in this case we can conclude from (A.l2.4) that

E(Xk) E(Xk)

'

I

1

'

'

!

'

,

•

'

The exponential distribution with parameter >. : t'(>.).

I

,

Section A.l3

465

Some Classical Discrete and Continuous Distributions

(A. 13.24)

p(x) = >.e ', x > O.

The range of >. is (0, oo ). The distribution function corresponding to this p is given by F(x) = 1 - e->x for x > 0.

(A.13.25)

A.13.26 If rr = 1/A, then rr is a scale parameter. £(1) is called the standard exponential

distribution.

If X has an£(>.) distribution, 1

1 E(X) = >. , Var X = 2 . >.

(A.13.27)

More generally, all moments may be obtained from 1 (t ) = Mx 1 - (t/>.) =

[ l � k! =

k! tk ).k

(A.13.28)

which is well defined for t < >.. Further information about the exponential distribution may be found in Appendix B.

The uniform distribution on (a, b) : U(a, b). p(x) =

1

(b

-

a)

' a<x
(A. 13.29)

where (a, b) is any pair of real numbers such that a < b. The corresponding distribution function is given by (x - a) F(x) = for a < x < b. ( b - a)

(A.1 3.30)

If X has a U(a, b) distribution, then (b - a)2 a+b Var X = E(X) = 2 ' 12

A.13.32 If we set I' = a,

(A.13.31)

= (b - a), then we can check that the U(a, b) family is location-scale family generated by Y, where Y - U(O, l). a

References Gnedenko (1967) Chapter 4, Sections 21-24; Chapter 5, Sections 26--28, 30 Hoe!, Port, and Stone (1971) Sections 3.4. 1, 5.3.1, 5.3.2 Parzen ( 1962) Chapter 4, Sections 4--6; Chapter 5; Chapter 6 Pitman (1993) pages 475-487

a

j I ,

466

A Review of Basic Probability Theory

A.14

Appendix A

,

MODES OF CONVERGENCE OF RANDOM VARIABLES AND LIMIT TH EOREMS

'

I

'

Much of probability theory can be viewed as variations, extensions, and generalizations of two basic results, the central limit theorem and the law of large numbers. Both of these theorems deal with the limiting behavior of sequences of random variables. The notions of limit that are involved are the subject of this section. All limits in the section are as

! ' '

I

n - oo.

A.l4.1 We say that the sequence of random variables {Zn} converges to the random vari able Z in probability and write Zn � Z if P[IZn - Zl > t] 0 as for every > 0. That is, Zn � Z if the chance that Zn and Z differ by any given amount is negligible for large enough. A.l4.2 We say that the sequence {Zn} converges in law (in distribution) to Z and write Zn !:..,. Z if Fz.. ( t) Fz (t) for every point t such that Fz is continuous at t. (Recall that Fz is continuous at t if and only if P[Z t] 0 (A.7.17).) This is the mode of convergence needed for approximation of one distribution by another. (A.l4.3) If Zn Z, then Zn Z. Because convergence in law requires nothing of the joint distribution of the Zn and Z whereas convergence in probability does, it is not surprising and easy to show that, in general, convergence in law does not imply convergence in probability (e.g., Chung, 1974), but consider the following. A.l4.4 If Z z0 (a constant), convergence in law of { Zn} to Z implies convergence in probability. Proof. Note that z0 ± are points of continuity of Fz for every c > 0. Then --->

n

- oo

E

n

-->

�

p

-->

j I'

I

I

�

c

-->

=

c

P[[Zn - zo[

> <]

P(Zn < zo + <) + P(Zn � zo - <) < 1 - Fz. (zo + ;) + Fz. (zo - <). 1-

(A.l4.5)

-

By assumption the right-hand side of (A.I4.5) converges to ( 1 - Fz(zo +
<)

=

0.

0

A.14.6 If Zn I', z0 (a constant) and g is continuous at z0• then g(Zn) I', g(zo). Proof. If ' is positive. there exists a J such that [z - Zo[ < J implies (g(z) - g(zo)[ < <. Therefore, (A.l4.7) P((g(Zn) - g(zo)[ < <] > P[[Zn - zo[ < J] � I - P[[Zn - zo[ > J]. Because the right-hand side of (A.l4.7) converges to 1, by the definition (A.l4. I) the result follows. 0

i

' ' ' '

•

j

j

1

I

I i j

'

1 •

I

j

1

i

------

Section A.l4

467

Modes of Convergence of Random Variables and Limit Theorems

A more general result is given by the following. A.14.8 If Zn !:, Z and g is continuous, then g(Zn) !:, g(Z). The following theorem due to Slutsky will be used repeatedly. L

Theorem A.l4.9 If Zn Z and Un (a) Zn + Un � Z + Uo, (b) UnZn .!:.... uaZ. -4

p

-

u0 (a constant), then

Proof. We prove (a). The other claim follows similarly. Begin by writing

Fcz.w.)(t )

-

P[Zn + Un < t, Un > uo - E] +P[Zn + Un < t, Un < uo - E].

(A. I 4. 10)

Let t be a point of continuity of F(Z+uo)· Because a distribution function has at most countably many points of discontinuity, we may for any t choose E positive and arbitrarily small such that t ± t are both points of continuity of F(Z+uo)· Now by (A.l4.10), Fcz. + u,) (t) :S P[Zn < t - uo + E) + P[IUn - uo[ > Ej.

(A.l4. J l)

Moreover,

(A.14.12) Because Fz,+u0(t) Thus.

=

lim sup Fcz.w.J)(t) n

P[Zn < t - ua]

=

Fz.(t - u0 ) , we must have Zn + uo !:, Z + u0.

- j > E] < lim n F(Zn+uo) (t + E) + lim n P[[Un u0

=

F(Z+uo)(t + E).

(A.l4. 13) Similarly, I-

F(Z, +Un)(t)

=

P[Zn + Un > tJ :S P[Zn > t - uo

-

Ej + PfiUn - Uo[ > Ej. (A.14.14)

and, hence, liminf F(Z,+U,)(t) > limF(Z,+uo)(t -

n

n

E) = F(Z+uo)(t - E).

(A.l4.15)

Therefore,

F(Z+uo) (t - E) < liminf n F(Z,+U,)(t)

< lim supF(Z,+U,)(t) < F(Z+•o ) (t + E) . n

(A. 14.16) Because E may be permitted to tend to 0 and F(Z+uo) is continuous at t, the result fol o lows.

A.14.17 CorolJary. Suppose that an is a sequence of constants tending to oo, b is a fixed number, and an ( Zn - b) � X. Let g be a function of a real variable that is differentiable and whose derivative g' is continuous at b. ( l l Then On[g(Zn) - g(b)J !:, g'(b)X.

(A. J4. 1 8)

1

468

Appendix A

A Review of Basic Probability Theory

1

Proof. By Slutsky's theorem

Z,. - b = .2_ [a,.(Z,. - b)] _c, 0 · X = 0. a,.

By (A. l4.4), ] Z,. - b] !':. 0. Now apply the mean value theorem to g(Z,.)

a,.[g(Z,. ) - g(b) ]

(A. l4.19)

- g(b) getting

a,.[g' (Z�) (Z,. - b)]

=

]Z,. - b]. Because ]Z,. - bl !':. 0, so does IZ; - bl and, hence, Z� !':. b. By the continuity of g' and (A.l4.6), g' (Z; ) !':. g' (b). Therefore, applying (A.l4.9) again, o g'(Z�)[a,.(Z,. - b)] _c, g'(b)X. A.14.20 Suppose that { Zn} takes on only natural number values and pz,.. (z) pz (z) for t: all z. Then Zn Z. This is immediate because whatever be z, Fz. (z ) = Lt1 0 Pz. (k) � Lt1 0pz(k) = Fz(z ) where [z] = greatest integer < z. The converse is also true and easy to establish. where

IZ� - bl

<

----+

----+

A.l4.21 (Scheffe) Suppose that { Z,.}, Z are continuous and Pz. (z) � pz(z) for (almost) all z. Then Z,. -". Z (Hajek and Sidak, 1967, p. 64).

::

•

. •

,

l

j

'

References

];

•

Grimmett and Stirzaker ( 1992) Sections 7.1-7.4

A.15

jj

FURTHER LIMIT THEOREMS AND I N EQUALITIES

<

The Bernoulli law of large numbers, which we give next, brings us back to the motivation of our definition of probability. There we noted that in practice the relative frequency of occurrence of an event A in many repetitions of an experiment tends to stabilize and that this ..limit" corresponds to what we mean by the probability of A. Now having defined probability and independence of experiments abstractly we can prove that. in fact, the rel ative frequency of occurrence of an event A approaches its probability as the number of repetitions of the experiment increases. Such results and their generalizations are known as Jaws of large numbers. The first was discovered by Bernoulli in 1713. Its statement is as follows.

•

•

•

If { Sn} i s a sequence of random variables such that S,. has a B(n, p) distribution for n > 1, then -

P

� p.

•

,

l

'

1

•

'

'

(A.l5.1)

n As in (A.13.2), think of Sn as the number of successes in n binomial trials in which we identify success with occurrence of A and failure with occurrence of A c. Then Snfn can . . .

,

!

Bernoulli's (Weak) Law of Large Numbers

Sn

i

'

'

Section A.lS

Further Limit Theorems and Inequalities

469

be interpreted as the relative frequency of occurrence of A in n independent repetitions of the experiment in which A is an event and the Bernoulli law is now evidently a statement of the type we wanted. Bernoulli's proof of this result was rather complicated and it remained for the Russian mathematician Chebychev to give a two-line argument. His generalization of Bernoulli's result is based on an inequality that has proved to be of the greatest importance in proba bility and statistics.

Chebychev's Inequality If X is any random variable, then

E(X2 )

P[]XJ > a] < >- <] - (

-. 2 a

(A.l 5.2)

The Bernoulli law follows readily from (A. l 5 .2) and (A.13.3) via the calculation

p

Sn _ P n

<

)

E Sn f n - p 2

c:2

�

Var Sn

n2c:2

�

p( l - P) nc:2

�

O as n

�

OO.

(A.l5.3)

A generalization of (A.15.2), which contains various important and useful inequalities, is the following. Let

g Z.

be a nonnegative function on

range of a random variable

R such that

g

is nondecreasing on the

Then

E(g(Z)) < P[Z >- aI - g(a) . Z ] X ] , g( t2Z > (Markov' s > g(t ] X ] ) inequality), Z = X g(t) est s > (Bernstein's inequality, Proof of (A.15.4). g, (A.l5.4)

If we put

�

t) �

if t

�

cases are obtained by taking and

=

and

0 and 0 otherwise, we get (A.l5.2). Other important and for

�

t if t

0 and 0 otherwise

0 and all real t

see

B.8.1 for the binomial case Bernstein's inequality). Note that by the properties of

(A.l5.5) Therefore, by (A.l0.8)

g(a)P[Z > a] � E(g(n)IrpaJ) < E(g(Z)),

(A.l5.6)

0

which is equivalent to (A.l5.4).

The following result, which follows from Chebychev's inequality, is a useful general ization of Bernoulli's law.

Khintchin's (Weak) Law of Large Numbers Let

{Xi}, i >

1 , be a sequence of independent identically distributed random variables

with finite mean J-L and define

Sn

=

L�

l

xi.

Then

Sn P � -+ J-L. n

(A.l5.7)

470

A Review of Basic Probability Theory

Upon taking the

Xi

to

Appendix A

be indicators of binomial trials, we obtain (A. l 5 . I).

De Moivre-Laplace Theorem

{Sn}

Suppose that

n, p)

B(

is a sequence of random variables such that for each n,

distribution where 0 < p < 1 . Then

Snj - np � Z, .np( ! - p)

Z has

has a

(A. 15.8)

_

where

Sn

a standard normal distribution. That is, the standardized versions of

verge in law to a standard normal random variable. If we write

Sn

con

Sn -np = .,fii (Sn P) .jnp(! - p) .jp(! - p) -;;-

and use (A.14.9), it is easy to see that (A. l5.8) implies (A. l5.1).

The De Moivre-Laplace theorem is generalized by the following. , '

Central Limit Theorem Let

{Xi}

be a sequence of independent identically distributed random variables with

(common) expectation J.t and variance a2 such that 0 <

a2

<

00 .

Then, if

Sn - n!' �c. z.

(A.15.9)

a.,fii

where

1

Sn = L� xi

i

• •

'

Z has the standard nonnal distribution.

• •

The last two results are most commonly used in statistics as approximation theorems. Let

'

k l be P[k Sn lj and

� [ P k - 2� - Sn - l + 2 ] np l+ !] np! Sn np k [ P .;npq .;npq .;npq (l - np+ ) (k -np- !2 ) .;npq .;npq � l+ � continuity correction.

nonnegative integers . The De Moivre-Laplace theorem is used as <

<

<

<

where q

= 1-

p.

The

� appearing in k -

and

•

•

-

-

�

i ' 'I I '

<

<

2'

•

-

is called the

j

� We

have an excellent idea of how good this approximation is. An illustrative discussion is given

in Feller (1968, pp. 1 87-188). A rule of thumb is that for most purposes the approximation

can

be used when

npX, n(l - p)

Only when the

and

are both larger than

5.

are integer-valued is the first step of (A.J5. 10) followed. Otherwise

(A.15.9) is applied iri the form

(bnp.) P[a Sn bj .,fiia <

<

""

-

(a.fiia - np.)

.

(A.15.10)

l

I

i ,

i '

Section A.!5

471

Further Limit Theorems and Inequalities

The central limit theorem (and some of its generalizations) are also used to justify the assumption that "most" random variables that are measures of numerical characteristics of real popu lations, such as intelligence, height, weight, and blood pressure, are approx imately normally distributed. The argument is that the observed numbers are sums of a large number of small (unobserved) independent factors. That is, each of the characteristic variables is expressible as a sum of a large number of small variab les such as influences of particular genes, elements in the diet, and so on. For example, height is a sum of factors corresponding to heredity and environment. If a bound for EIXi

-

tt!3 is known,

it is possible to give a theoretical estimate of the

error involved in replacing P( Sn < b) by its normal approximation: Berry-F..sseen Theorem Suppose that X1 , . . , Xn are i.i.d. with mean Jt and variance .

cr2 > 0.

Then, for ail n, (A. l 5. 1 1 )

For a proof. see Chung ( 1 974, p. 224). In practice, if we need the distribution of Sn we try to calculate it exactly for small values of n and then observe empirically when the approx imation can be used with safety. This process of combining a limit theorem with empirical investigations is applicable in many statistical situations where the distributions of transformations

g(x)

(see A.8.6) of

interest become progressively more difficult to compute as the sample size increases and yet tend to stabi lize. Examples of this process may be found in Chapter 5 . We conclude this section with two simple limit theorems that lead to approximations of one classical distribution by another. The very simple proofs of these results may, for instance,

be found in Gnedenko ( 1 967, p. 53 and p.

1 05).

A.15.12 The first of these results reflects the intuitively obvious fact that if the populations sampled are large and the samples are comparatively small, sampling with and without replacement leads to approximately the same probability distribution. Specifically, sup pose that

{XN}

is a sequence of random variables such that

'H. (DN, N, n), distribution where DN fN ---> p as N

---> oo and

XN has a hypergeometric

11,

is fixed. Then

(A. l 5 . 1 3 ) as N - oo for k = 0,

1,

. . , n. B y (A.l 4.20) we conclude that .

(A. l5. 14) where X has a B(n,p) distribution.

The approximation of the hypergeometric distri

bution by the binomial distribution indicated by this theorem is rather good. stance, if N = 50, n =

5,

and D

=

For in

20, the approximating binomial distribution to

H(D,N,n) is 8(5, 0.4). If H holds, P{X

<

2]

= 0.690

while under the approximation,

A Review of Basic Probability Theory

472

P[X < 2] 0.683. (n/N) :S O.L =

Appendix A

As indicated in this example, the approximation is reasonable when

i �

•

•

The next elementary result, due to Poisson, plays an important role in advanced proba bility theory. Poisson's Theorem

{Xn} is a sequence of random variables such distribution and npn -+ A as n -+ oo, where 0 < A < oo. Then Suppose that

that

Xn

has a

B(n,pn)

; '

•

; •

(A. l 5 . 1 5 )

for

k

=

0, 1, 2,

. .

P(:l) distribution. the

Xn � X where X has a This theorem suggests that we approximate the B(n,p) distribution by

. as

n

P(np) distribution.

-+

oo.

By (A.1 4.20) it follows that

:

Tables 3 on p. 108 and 2 on p. 154 of Feller ( 1 968) indicate the

excellence of the approximation when p is small and

np is moderate.

It may

be shown that

the error committed is always bounded by np2.

.

' •

'

:

References Gnedenko ( 1 967) Chapter 2, Section 1 3 ; Chapter 6, Section 32; Chapter 8, Section 42

Hoe!,

Port, and Stone (1971) Chapter 3, Section 3.4.2

Parzen ( 1960) Chapter 5, Sections 4, 5; Chapter 6, Section 2; Chapter 10, Section 2

A.16

POISSON PROCESS

A.16.1 A

Poisson process with parameter A

t

is a collection of random variables

{ N(t)},

> 0, such that (i) (ii)

1

! j

1 •

• '

N(t) has a P(:lt) distribution for each t. N(t + h)

l

-

N(t)

is independent of

N(s)

for all

s < t, h >

0, and has a

P(:lh)

distribution.

i

'

i

j '

l •

Poisson processes are frequently applicable when we study phenomena involving events

that occur "rarely" in small time intervals. For example, if N (t)is the number of disinte grations of a fixed amount of some radioactive substance in

the period from time 0 to time

t, then {N(t)} is a Poisson process. The numbers N(t) of "customers" (people, machines, etc.) arriving at a service counter from time 0 to ti me t are sometimes well approximated by a Poisson process as is the number of people who visit a WEB site from time 0 to t. Many interesting examples are discussed in the books of Feller ( 1968), Parzen (1 962), Kar lin (1 969). In each of the preceding examples of a Poisson process

N(t)

represents the

number of times an ..event" (radioactive disintegration, arrival of a customer) has occurred in the time from 0 to

t.

We use the word

event here for lack of a better one because these

, ' '

l

j '

�

• '

'

Section A.l6

473

Poisson Process

are not events in tenns of the probability model on which the N ( t) are defined. If we keep temporarily to this notion of event as a recurrent phenomenon that is randomly detennined in some fashion and define N(t) as the number of events occurring between time 0 and time t, we can ask under what circumstances { N (t)} will form a Poisson process.

A.16.2 Formally, let (N(t)}. t > 0

be a collection of natural number valued random variables. It turns out that, { N(t)} is a Poisson process with parameter A if and only if the

following conditions hold: (a) N(t + h) - N(t) is independent of N(s), s < t, for h > 0, (b) N(t + h) - N(t) has the same distribution as N(h) for h > 0, (c) P[N(h)

=

1]

(d) P[N ( h) > 1 ]

=

=

>.h + o(h), and o(h).

(The quantity o(h) is such that o(h)/h be interpreted as follows.

�

0 as h

�

0.) Physically, these a"umptions may

(i) The time of recurrence of the "event" is unaffected by past occurrences. (ii) The distribution of the number of occurrences of the "event" depends only on the length of the time for which we observe the process. (iii) and (iv) The chance of any occurrence in a given time period goes to 0 as the pe riod shrinks and having only one occurrence becomes far more likely than multiple occurrences. This assertion may be proved as follows. Fix t and divide [0, t] into n intervals [0, tfn], (t/n, 2tfn], . . . , ( (n - 1 )t/n, t]. Let I;n be the indicator of the event [N(jtfn) - N( (j l)t/n) > lj and definer Nn(t) = 'L7 1 I;n· Then Nn(t) differs from N(t) only insofar as multiple occurrences in one of the small subintervals are only counted as one occurrence. By (a) and (b), Nn (t) has a B(n, P [N(tfn) > 1]) distribution. From (c) and (d) and Theorem (A. l5.15) we see that Nn (t) !:. Z, where Z has a P(>.t) distribution. On the other hand,

P[ [ Nn (t) - N(t)] > <] S P[Nn(t) � N(t)] S P :;;

�

;Q

)t (j (N (�) - N ( �1 ) ) > I]

l)t j p [ (N (� - N (< � ) ) > l

= nP [N (�) > 1] no (

;)

---t

0 as n ---t oo. (A.l6.3)

·· -------

··---

474

A Review of Basic Probability Theory

The first of the inequalities in ( A . I 6 . 3 ) is obvious. the second says that if

Appendix A

Nn ( t) f N ( t )

1

there must have been a multiple occurrence in a small subinterval, the third is just (A.2.5}, and the remaining identities follow from (b) and (d). The claim (A.16.3) now follows from Slutsky's theorem (A. I 4.9) upon writing

A.16.4 Let T1

N(t) � N, (t)

+ (N(t) - N, (t)).

be the time at which the "event" first occurs in a Poisson process (the first

t

N(t) = 1), T2 be the time at which the "event" occurs for the second time, and Then T1 , T2 - T1 , T11 - Tn - 1 , . . are independent, identically distributed £(>.)

such that so on.

•

•

.

,

.

random variables.

References Gnedenko ( 1 967) Chapter 10, Section '

' '

'

'

•

'

51

'

Grimmett and Stirzaker ( 1992) Section 6.8 Hoe!, Port, and Stone ( 197 1 ) Section 9.3 Parzen (1962) Chapter 6, Section 5 Pitman (1993) Sections 3.5, 4.2

A.17

j

; '

NOTES

Notes for Section A.S

A to be the smallest sigma field that has every An with Ai E Ai, 1 < i < n, as a member. ( I ) We define

set of the form A1

x

· · ·

x

Notes for Section A.7

' f

'

1

( I ) Strictly speaking, the density is only defined up to a set of Lebesgue measure 0. (2) We shall use the notation g(x+O) for limx. tx for a function

1

g of a real variable that possesses

g(xn) and g(x -0) for limx. !x g(xn)

such limits.

1 1

:

•

l '

l

Notes for Section A.S ( I ) The requirement on the sets

x - 1 (B) is purely technical.

discrete case and is satisfied by any function of interest when Sets

B that are members of Bk

are called

measurable.

It is no restriction in the

0 is Rk or a subset of Rk .

k,

When considering subsets of R

we will assume automatically that they are measurable. (2) Such functions g are called definitions (A.8 . 1 ) and (A8.2).

measurable.

This condition ensures that g(X) satisfies

For convenience, when we refer to functions we shall

assume automatically that this condition is satisfied. (3) A function g is said to be one to one ifg(x)

(4) Strictly speaking,

(X, Y) and

�

g(y) implies x y. �

(x, y) in (A.8. l l ) and (A.8. 12) should be transposed.

However, we avoid this awkward notation when the meaning is clear.

' '

Section A.18

475

References

(5) The integral in (A.8.12) may only be finite for "almost all" x. In the regular cases we study this will not be a problem. Notes for Section A.l4 ( I ) It may be shown that one only needs the existence of the derivative g' at b for (A.l4.17) to hold. See Theorem 5.3.3.

A.18

REFERENCES

BERGER, J. 0., Statistical Decision Theory and Bayesian Analysis

New York: Springer,

1985.

BILLINGSLEY, P., Probabiliry and Measure, 3rd ed. New York: J. Wiley & Sons, 1995.

CHUNG, K. L., A Course in Probability Theory New York: Academic Press, 1974.

DEGRoOT, M. H., Optimnl Statistical Decisions New York: McGraw Hill, 1970.

FELLER, W., An Introduction to Probability Theory and Its Applications, Vol. I, 3rd ed.

J.

Wiley & Sons,

1968.

GNEDEN KO, B . V., The Theory of Probability, GRIMMEI"I, G.

R.,

Press, 1992.

New York:

AND D.

R.

4th ed. New York:

Chelsea, 1967.

STIRZAKER, Probability and Random Processes Oxford: Clarendon

HAJEK, J. AND Z. SIDAK, Theory of Rank Tests New York: Academic Press, 1 967.

HOEL, P. G., S. C. PORT, AND C. J. STONE, Introduction to Probability Theory Boston: Houghton Mifflin, 1 9 7 1 .

KARLIN, S., A First Course in Stochastic Processes

New York: Academic Press,

1 969.

LINDLEY, D. V., Introduction to Probability and Statistics from a Bayesian Point of View, Part I:

Probability; Part II: Inference London: Cambridge University Press, 1965. LoEvE, M., Probability Theory, Vol. I, 4th ed. Berlin: Springer, 1 977. PARZEN, PARZEN,

E., Modern Probability Theory and Its Application New York : J. Wiley & Sons, 1960.

E., Stochastic Processes San Francisco: Holden Day, 1962. -

PITMAN, J., Probability New York: Springer, 1 993. RA.IFFA, H., AND

R. SCHLAIFFER,Applied Statistical Decision Theory, Division of Research, Graduate

School of Business Administration, Boston: Harvard University, 1 9 6 1 . RAO, C. R., linear Statistical Inference and Its Applications, 2nd ed. New York: 1973. SAVAGE, L. J., The Foundations ofStatistics

New York: J. Wiley & Sons, 1 954.

J. Wiley & Sons,

SAVAGE, L. J., The Foundation ofStatistical Inference London: Methuen & Co., 1962.

'

i

.

' .,

�

'

Appendix B

ADDITIONAL TOPICS IN PROB ABILITY AND ANALYSIS

In this appendix we give some results in probability theory, matrix algebra, and analysis that are essential in our treatment of statistics and that may not be treated in enough detail in more specialized texts. Some of the material in this appendix, as well as extensions, can be found in Anderson ( 1 958), Billingsley ( 1995), Brei man ( I %8), Chung ( 1 978), Dempster ( 1 969), Feller (1971), Loeve (1977), and Rao ( 1 973). Measure theory will not be used. We make the blanket assumption that all sets and functions considered are measurable.

B.l

CONDITIONING BY A RANDOM VARIABLE OR VECTOR

The concept of conditioning is important in studying associations between random vari ables or vectors. In this section we present some results useful for prediction theory, esti mation theory, and regression.

B.l.l

The Discrete Case

The reader is already familiar with the notion of the conditional probability of an event A given that another event B has occurred. If Y and Z are discrete random vectors possibly of different dimensions, we want to study the conditional probability structure of Y given that Z has taken on a particular value z. Define the conditional frequency function p( · I z) ofY given Z = z by

p( ) y ,z p(y I z) = P[Y = y I Z = z] = p (z) z

(B. l . l )

where p and pz are the frequency functions of (Y, Z) and Z. The conditional frequency function p is defined only for values of z such that pz(z) > 0 . With this definition it is

477

478

Additional Topics in Probability and Analysis

y

TABL E R !

z 0

I

2 pz(z)

Appendix B

I

0 10 0.25 0.05 0.05 0. 15 0.05 1 0.1 o 035 I 030

I

20 0.05 0.05 0.25 0 .35

i

-

JIF (Y) II

0.35 0.25 0.40

' '

I

'

;

clear that p( · I z) is the frequency of a probability distribution because

by {A.8.11). that z � z.

pz (z) p(y z) � EYp(y I z) � Ey , �I pz (z) pz ( z) This probability distribution is called the conditional distribution ofY given

Example 8.1.1 Let Y = (Y1 , . . . , Yn). where the Yi are the indicators of a set of n Bernoulli trials with success probability p. Let Z = Ef 1 }i, the total number of successes. Then Z has a binomial, B(n,p), distribution and p(y I z) � " •

!I •

.!

• •

.. .' .i

'

1

'

i •

PjY = y, Z = z]

(�)

p'(l - p)"-'

�

pY(I - p)"-'

(�)

P' ( l - p)"-'

I

(�)

- -,-c�

; .

l

;

j

•

'

(B.I.2)

if the Yi are all 0 or 1 and Eyi = z. Thus, if we are told we obtained k successes in n binomial trials, then these successes D are as likely to occur on one set of trials as on any other.

Example 8.1.2 Let Y and Z have the joint frequency function given by the table For instance, suppose Z is the number of cigarettes that a person picked at random from a certain population smokes per day (to the nearest 10), and Y is a general health rating for the same person with 0 corresponding to good, 2 to poor, and 1 to neither. We find for z � 20 0 I 2 y P(Y I 20) 7 7 7

•

1 ' •

l l

l

•

-

These figures would indicate an association between heavy smoking and poor health be 0 cause p(2 l 20) is almost twice as large as py(2). The conditional distribution of Y given Z = z is easy to calculate in two special cases.

(i) lfY and Z are independent, then p(y I z) = py(y) and the conditional distribution coincides with the marginal distribution. (ii) If Y is a function of Z, h(Z), then the conditional distribution of Y is degenerate, Y � h(Z) with probability I. ' •

• • '

I

Both of these assertions follow immediately from Definition(B. l . l ).

l •

Section B.l

479

Conditioning by a Random Variable or Vector

Two important formulae follow from (B. I . I ) and (A.4.5). Let q(z j y) denote the conditional frequency function of Z given Y y . Then =

p(y,z) p( y I z) =

=

(B. I . 3)

p(y I z)pz(z)

q(z I y)py(y ) Eyq (z I y)p y(y )

(B . I .4)

Bayes' Rule

whenever the denominator of the right-hand side is positive. Equation (B. t 3) can be used for model construction. For instance, suppose that the number Z of defectives in a lot of N produced by a manufacturing process has a B(N, 0) distribution. Suppose the lot is sampled n times without replacement and let Y be the num ber of defectives found in the sample. We know that given Z = z, Y has a hypergeometric, 'H( z, N, n ) , distribution. We can now use (8.1.3) to write down the joint distribution of Y and Z .

y

where the combinatorial coefficients

(�)

(B.I.5) n

vanish unless a , b are integers with b <

a.

We can also use this model to illustrate (8 . 1 .4). Because we would usually only observe Y, we may want to know what the conditional distribution of Z given Y y is. By (8.1 .4) this is ..,...,

P[Z

IY

=

y] =

c(y)

=

I:

= z

where ,

( � ) B"(l - 0)"-' ( � ) ( � � � ) / c(y)

(�)

W ( l - B)"

'

( � ) ( � �: )

.

This formula simplifies to (see Problem B . l . l l ) the binomial probability, (B.l .6)

8.1.2

Conditional Expectation for Discrete Variables

Suppose that Y is a random variable with E(IYI) < oo. Define the conditional expectation of Y given Z = z, written E(Y I Z = z), by

E(Y I Z

=

z) = Eyyp (y I z) .

(B. I . 7)

480

Additional Topics in Probability and Analysis

Note that by (B . I . l ), if pz(z)

>

0,

Appendix B '

�i n

•

E( ) p,· y E( I Y I I Z = z) = Ey ]y]p(y I z) < Ey ]Yi . (B.I.8) = pz z pz z Thus, when Pz ( z) > 0, the conditional expected value of Y is finite whenever the expected value is finite.

Example B.l.3 Suppose Y and Z have the joint frequency function of Table B.l. We find 5 11 1 . E(Y I z = 20) = 0 . 7 + I + 2 . 7 = 7 = 1.57. 7 I

Similarly, E(Y I Z = 10) = ; = L17 and E(Y I Z = 0) = � = 0.43. Note that in the health versus smoking context, we can think of E(Y I Z = z) as the mean health rating 0 for people who smoke z cigarettes a day. Let g(z) = E(Y I Z = z). The random variable g(Z) is written E(Y I Z) and is called the conditional expectation ofY given Z. ( l ) As an example we calculate E(Y1 I Z) where Y1 and Z are given in Example B. l . l . We have E(Y, I z = i) = P lY, = I I z = i] =

� { 7 ( )

( ��n (7)

_,__,_--"-

1 •

n

•

(B.!.9)

is just the number of ways i successes can occur in n Bernoulli

'

l

;

�

;

1 '

;

•

' •

.

'

..

trials with the first trial being a success. Therefore, z E(Y, I Z) = -. n

B, 1.3

'

!

The first of these equalities holds because Y1 is an indicator. The second follows from (B. l.2) because

<

(B.LJO)

'

l

i

j

'

Properties of Conditional Expected Values

j <

In the context of Section A.4, the conditional distribution of a random vector Y given Z = z corresponds to a single probability measure P� on (0, A). Specifically, define for

A E A,

•

'

J i '

i , •

P. (A) = P(A l iZ = z]) if pz ( z)

>

0.

(B.LJ 1)

This P� is just the conditional probability measure on (0, A) mentioned in (A.4.2). Now the conditional distribution of Y given Z = z is the same as the distribution of Y if P� is the probability measure on (0, A). Therefore, the conditional expectation is an ordinary expectation with respect to the probability measure P�. It follows that all the properties of the expectation given in (A. J0.3HA. 10.8) hold for the conditional expectation given Z = z. Thus, for any real-valued function r(Y) with Elr(Y) I < oo, E(r(Y) I Z = z) = Eyr(y )p(y I z)

' ' •

<

! '

• •

•

j

Section B.l

481

Conditioning by a Random Variable or Vector

and (B.l.12) aE(Y1 I Z = z) + {IE(Y, I Z z) identically in z for any Y" Y, such that E(IYJ ! ), E(IY2 1 ) are finite. Because the identity

E(aY1 + f)Y, I Z

holds for all

z,

=

z)

=

=

we have (B.I.13)

This process can be repeated for each of (A. I 0.3)-(A.l 0.8) to obtain analogous properties of the conditional expectations. In two special cases we can calculate conditional expectations immediately. If Y and Z are independent and E(IYI) < oo, then

E(Y I Z) = E(Y).

(B. l . l 4)

E(h(Z) I Z) = h(Z).

(B. l .15)

This is clear by (i). On the other hand, by (ii) The notion implicit in (B. l . l5) is that given Z = z, Z acts as a constant. If we carry this further, we have a relation that we shall call the substitution theorem for conditional

expectations:

E(q(Y, Z) I Z

=

z) = E(q(Y, z) I Z = z) .

This is valid for all z such that pz ( z) > 0 if E lq (Y, Z) I < tions (B. I . l l ) and (B.l.7) because

oo.

(B. l . 16)

This follows from defini

P[q(Y, Z) = a I Z = z] = P[q(Y,Z) = a, Z = z l Z = z] = P[q(Y,z) = a I Z = z] (B.l.l7) for any a. If we put q(Y , Z)

=

r(Y)h(Z), where Elr(Y)h(Z)I <

oo,

we obtain by (B. I . l 6),

E(r(Y)h(Z) I Z = z) = E(r(Y)h(z) I Z = z) = h(z)E(r(Y) I Z = z).

(B. l . l 8)

Therefore,

E(r(Y)h(Z) I Z)

=

h(Z)E(r(Y) I Z).

(B.l . 19)

Another intuitively reasonable result is that the mean of the conditional means is the mean:

E( E(Y I Z)) = E(Y),

(B.l .20)

whenever Y has a finite expectation. We refer to this as the double or iterated expectation

theorem. To prove (B.l.20) we write, in view of (B.l .7) and (A.I0.5),

E(E(Y I Z)) = Ezpz(z)[Eyyp(y I z)] = Ey,zYP ( Y I z )pz(z) = Ey,zYP(y, z) = E(Y ).

(B.I.21)

-----

'

"

'

482

Additional Topics in Probability and Analysis

The interchange of summation used i.s valid because the finiteness of all sums converge absolutely. As an illustration, we check (B.l .20) for

E(E(Yt I Z)) � E If we apply (B. 1 .20) to Y

E(Y1 I Z)

(n) Z

=

np -;;:

E( [Y[)

Appendix B implies that

given by (B . 1 . 1 0). In this case,

=

p = E(Yt ) .

(B. l .22)

= r(Y)h(Z) and use (B . l . 19), we obtain the product expec

tation fonnula: Theorem B.l.l lf Elr(Y)h(Z)I < oo, then

E( r(Y)h(Z) ) = E(h(Z)E(r(Y) I Z)). Note that we can express the conditional probability that

P[Y E A I Z = zj = E(l[Y E AJ I Z Then by taking r(Y)

I '

'

= z) =

Y E A given Z = z as

EyEAP(Y I z) .

= l[Y E AJ, h = I in Theorem B . 1 . 1 we can express the (uncondi

tional) probability that Y

E A as

P[Y E AJ = E(E(r(Y) I Z)) = E.P[Y E A I Z = zjpz (z)

.' Ii

(B.l .23)

=

E[P(Y E A I Z) J.

••

For example, if Y and

:II

Z are as in (B.l .5), P[Y < yj

!'

where

'

!

Hz

(B . l 24)

=

E.

(�)

8'(1 - 0)"-• H.(y)

is the distribution function of a hypergeometric distribution with parameters

(z, N,n).

�l �;'·; � i

'' 'I ''' i' '

.

Continuous Variables

B. l.4

•

,

,

'

•

Suppose now that

u

(Y, Z)

is a continuous random vector having coordinates that are them

selves vectors and having density function

!i

p(y, z).

between frequency and density functions, the

z = z by ifpz(z) '

i

> 0.

fu nction of Y given

(B . l .25)

,

•

1

1

'

'

'

j •

pz (z),

is given by (A.8.12), it is clear that p(·

I z)

is a density. Because (B. 1 .25) does not differ fonnally from (B. 1 . 1), equations (B.l.3) and (B.I .6) go over verbatim. Expression (B . l .4) becomes

py (y )q(z I y) where q is the conditional density of Z given

'

conditional density( I)

p(y z) P(Y I z ) , ( pz z )

Because the marginal density of Z,

•

We define, following the analogy

,

Y

=

(B.l .26)

y. This is also called Bayes' Rule.

j

J '

'J

'

, • •

'

''

" '

'

l

Section B.l

483

Conditioning by a Random Variable or Vector

If Y and Z are independent, the conditional distributions equal the marginals as in the discrete case. Example B.1.4 Let Y1 and Y2 be independent and uniformly, U(O, 1), distributed. Let Z min(Y1 , Y2 ) Y = max(Y1 , Y2). The joint distribution of Z and Y is given by ,

=

F (z, y)

2P[Y1 < Y2 , Y1 < z, Y2

-

{' rmio(y,,,) 2l dy dy o lo

<

,

,

=

yj

2 lI' o min(y2, z)dy,

(8.1 .27)

if 0 S Z, 'lj < 1. The joint density is, therefore,

2 ifO < z < y < 1 0 otherwise.

p (z, y)

(8.1 .28)

The marginal density of Z is given by

!,' 2dy

pz(z)

=

2(1 - z ) , 0

<

z

<

1

(8.1.29)

0 otherwise. We conclude that the conditional density of Y given Z (z, 1 ) .

z is uniform on the interval 0

If E ( IYI) < oo, we denote the conditional expectation ofY given Z = z in analogy to the discrete case as the expected value of a random variable with density p(y I z). More generally, if E( lr(Y ) I ) < oo, (A.IO. l l) shows that the conditional expectation of r(Y) given Z = z can be obtained from E(r(Y) I Z

=

z)

=

1:

r(y)p(y I z)dy.

(8.1.30)

As before, if g(z) = E(r(Y) I Z = z) , we write g(Z) as E(r(Y) I Z) , the conditional expectation of r(Y) given Z. With this definition we can show that formulas 12, 13, 14, 19, 20, 23, and 24 of this section hold in the continuous case also. As an illustration, we next derive B.l.23: Letg(z) = E[r(Y) I ZJ, then, by (A.IO. l l), E(h(Z)E( r(Y) I Z )) =

1:

=

E(h(Z)g( Z ))

h(z)pz(z)

[C

=

1:

h(z)g(z)pz(z)dz

]

r(y)p(y I z)dy dz.

(8.1.31)

484

Additional Topics in Probability and Analysis

Appendix B

By a standard theorem on double integrals, we conclude that the right-hand side of (8 . 1 . 3 1 ) equals

1: 1: 1:1:

r(y) h( z)pz ( z}p(y I z}dydz

�

(B. l .32)

r(y}h (z)p(y, z } dyd z

�

E(r(Y)h( Z }}

j l

l ' I

'

by (A.IO. I l ), and we have established B . l .23. To illustrate these formulae, we calculate E(Y I Z) in Example 8 . 1 .4. Here, E(Y I Z

=

z } � ro ' yp (y I z}dy J

1 �

(1

_

' ydy r }, z}

1+z �

2

, 0 < z < 1,

and, hence,

.

E(Y I Z)

=

1+z 2

'

;

•

'

l

.

•

.

8 . 1 .5

1

'

Comments on the General Case

'

•

•

Clearly the cases (Y, Z) discrete and (Y, Z} contiquous do not cover the field. For ex ample, if Y is uniform on (0, 1) and Z Y2, then (Y, Z) neither has a joint frequency function nor a joint density. (The density would have to concentrate on z y2, but then it cannot satisfy /01 J; f(y, z)dydz = 1.) Thus, (Y, Z) is neither discrete nor continuous in our sense. On the other hand, we should have a concept of conditional probability for which P[Y = u l Z VuJ 1 . To cover the general theory of conditioning is beyond the scope of this book. The interested student should refer to the books by Breiman (1968), Loeve (1977), Chung (1974), or Billingsley (1995). We merely note that it is possible to define E(Y I Z z) and E(Y I Z) in such a way that they coincide with (B. I.?) and (B.I .30) in the discrete and continuous cases and moreover so that equations 15, 16, 20, and 23 of this section hold. As an illustration, suppose that in Example B. l .4 we want to find the conditional expec tation of sin( ZY) given Z z. By our discussion we can calculate E(sin( ZY) I Z z) as follows: First, apply (B.1. 16) to get =

=

=

=

=

�

=

E(sin(ZY) I Z

�

z)

=

E (sin(zY } I Z

�

z).

Because, given Z z, Y has a U(z, 1) distribution, we can complete the computation by applying (A.IO. l l) to get =

E(sin(zY) I Z

=

z)

=

1

(1 - z)

1' •

sin(zy)dy

�

1 ( [cos z2 - cos z]. z 1 - z)

I

'

'

: .

• •

•

•

.

'

-

'

1

'

'

Section

B.2 B.2.1

8.2

485

Distribution Theory for Transformations of Random Vectors

DISTRI BUTION THEORY FOR TRANSFORMATIONS O F RANDOM VECTORS The Basic Framework

In statistics we will need the distributions of functions of the random variables appearing in an experiment. Examples of such functions are sums, averages, differences, sums of squares, and so on. In this section we will develop a result that often is useful in finding the joint distribution of several functions of a continuous random vector. The result will gen eralize (A.8.9), which gives the density of a real-valued function of a continuous random variable. Let h (h1, , hk)T, where each hi is a real-valued function on Rk . Thus, h is a transformation from R k to Rk . Recall that the Jacobian Jh (t) of h evaluated at t ( t1, . . . , tk )T is by definition the determinant =

.

•

.

a [)t, Jh (t) �

h, (t)

•

• •

•

•

•

•

•

[)

•

•

[)tk

h, (t)

•

•

•

The principal result of this section, Theorem B.2.2, rests on the change of variable theorem for multiple integrals from calculus. We now state this theorem without proof (see Apostol. 1974, p. 421). Theorem B.2.1 Let h = (h1, . . . , hk)T be a transformation defined on an open subset B of Rk. Suppose that; ( ! ) (i) h has continuousfirst partial derivatives in B. (ii) h is one-to-one on B. (iii) The Jacobian of h does not vanish on B. Let f be a real-valued function (defined and measurable) on the range h(B) { (h1(t), . . . ,hk(t)) : t E B} ofh and suppose f satisfies

-

r lf(x)ldx < 00. jh(B) Then for every (measurable) suhset K of h(B) we have

f J(x)dx jK

�

f

Jh-l(K)

!(h(t)) IJh(t) ldt.

(B.2.1)

- - -·· . -- -

-- -------

486

Additional Topics in Probability and Analysis

Appendix B

In these expressions we write dx for dx1 . . dxk. Moreover, h-I denotes the inverse of the transformation h; that is, h-I (x) = t if, and only if, x = h(t). We also need another result from the calculus (see Apostol, 1974, p. 417),

' '

.

1 Jh • (t) � Jh (h-1(t)) -

(B.2.2) .

'

It follows that a transformation h satisfies the conditions of Theorem B.2.l if, and only if, h- 1 does. We can now derive the density ofY � g(X) � (91 (X), . . . , 9k(X))T when g satisfies the conditions of Theorem B.2.1 and X = (X1, Xk)T is a continuous random vector. •

•

.

I

,

Theorem B.2.2 Let X be continuous and let S be an open subset of Rk such that P(X E S) � 1. /fg � (91> . . . , 9k)T is a transformation from S to R k such that g and S satisfy the conditions of Theorem B.2.1, then the density ofY = g(Y) is given by

' • •

l 1 •

(B.2.3) •

l

for y E g(S).

1 1 j "

Proof. The distribution function ofY is (see (A.7.8))

Fy (y ) �

i

j

J

• •

where Ak � {x E Rk : 9;(x) < y;, i � 1, . . . k } Next we apply 'fheorem B.2.1 1 with h � g- 1 and f � p,. Because h- (Ak) � g(Ak) � {g(x) : 9;(x) < y, , i � 1, . . . , k} � {t : t < y;, i � 1 , . . , k }, we ll�tain ,

.

•

.

i

. '

' '

j

l l '

The result now follows if we recall from Section A 7 that whenever Fy (y) = fy'::o . . . Jy� q(tl! . . . tk)dtl . . . dtk for some nonnegative function q, then q must be the D density of Y, !

Example B.2.1 Suppose X � (X1 , Xz)T where X1 and Xz are independent with N(O, 1) and N(O, 4) distributions, res�tively. What is the joint distribution of Y1 = X1 + X2 and Yz � X1 - Xz? Here (see (A.13.17)), Px (xi>xz ) �

'

4� exp - � [xi + !x�J .

' ' '

•

!

I

x1

In this case, S � R2 Also note that 91 (x) � + Xz, 9z (x) � X1 � (Y1 + Yz) , 9:21 (Y) � 5 (y1 - 1J2 ), that the range g(S) is R2 and that

J·-· (y) =

1 z 1 2

1 2 1 -2

1 2

- -

-

xz,

g!1 (y)

' � ; '

Section 8.2

487

Distribution Theory for Transformations of Random Vectors

Upon substituting these quantities in

(B.2.3), we obtain

�Px G

Pv(YI, Y2)

(Yl + y, ),

r

� (Yl - Y2))

_I_ exp - � � (YI + Y2)2 + _l_(YI - y,)2 8?r 2 4 16 I I " 2 exp - 32 I::>y i + 5y2 + 6YIY2 I . 2 Sn '

]

This is an example of bivariate normal density. Such densities will be considered further in Section

Upon combining

(B,22) and (B,23) we see that for y E g(S) ,

py (y ) If

o

B.4.

X

is a random variable

(k

=

=

1),

Px( g - 1 (Y ) ) ' JJ. (g l (y ))l

(B2A)

the Jacobian of

g

is just its derivative and the

requirements (i) and (iii) that g' be continuous and nonvanishing imply that monotone and, hence, satisfies (ii). In this case

(B.2.4)

g is

strictly

reduces to the familiar formula

(A.8.9). It is possible to give useful generalizations of Theorem not one-to-one (Problem

B.2.2 to

situations where

g is

B.2.7).

Theorem B.2.2 provides one of the instances in which frequency and density functions

g is one-to-one, and Y

g(X) , then py(y ) = p,(g-1 (y) ) . The extra factor in the continuous case appears roughly as follows. If A(y) is a ..small" cube surrounding y and we let V(B) denote the volume of a set B, then V (g 1 (A (Y)"-')-) P[g(XJ E A(y)J P[X E g- 1 (A(y)_ ) ] . :_ py(y) "' ";'��'i': 1 V(g (A(y))) V(A(y)) V(A(y)) V(g-1 (A(y ) ) ) l (g- (y)) . V(A(y )) differ. If X is discrete,

"' P

=

X

Using the fact that

g-1

is approximately linear on

A(y) , it is not hard to show that

V(g-l(A(y))) ! J•-' (y) l . "' V(A(y )) The justification of these approximations is the content of Theorem

B.2.2.

The following generalization of (A.8.10) is very important. For a review of the ele mentary properties of matrices needed in its formulation, we refer the reader to Section

BJO.

g is called an affine transformation of Rk if there exists a k x k matrix A and a k x 1 vector c such that g(x) = Ax+c. If c = 0, g is called a linear transformation. The function g is one-to-one if, and only if, A is nonsingular and then Recall that

g-I(y ) = A - l (y - c), y E Rk, where A -I

is the inverse of A .

(B.2.5)

l

488

Additional Topics In Probability and Analysis

Appendix 8

j

'

' '

B.2.1 Suppose X is continuous and S is such that P(X E S) = 1. If g is a one-to-one affine transformatioll as defined earlier, then Y = g(X) has density Corollary

(8.2.6)

'

for y E g( S), where det A is the determinant of A. The corollary follows from (8.2.4), (8.2.5), and the relation,

' '

•

'

.. '

'

Jg (g-1 (y)) - det A.

'

"

i

(8.2.7)

Example B.2.1 is a special case of the corollary. Further applications appear in the next D section. B.2.2

The Gamma and Beta Distributions

As a consequence of the transformation theorem we obtain basic properties of two impor tant families of distributions, which will also figure in the next section. The first family has densities given by •

(8.2.8) for x > 0, where the parameters p and A are taken to be positive and f(p) denotes the Euler gamma function defined by

r(p) =

l ' = edt. tp1

'

(8.2.9)

b, ,,(x) =

B(r, s)

'

' 1 '

•

(8.2. 10)

The family of distributions with densities given by (B.2.8) is referred to as the gamma family of distributions and we shall write f(p, ..\) for the distribution corresponding to 9p,>.. · The special case p = 1 corresponds to the familiar exponential distribution £(..\) of (A.l3.24). By (A.S.IO), X is distribnted r(p, .>.) if, and only if, >.X is distribnted r(p, 1). Thus, 1 /.>. is a scale parameter for the r(p, .\) family. Let k be a positive integer. In statistics, the gamma density 9p, >.. with p = �k and .\ = � is referred to as the chi squared density with k degrees offreedom and is denoted by x%. The other family of distributions we wish to consider is the beta family, which is in dexed by the positive parameters r and s. Its densities are given by x"-1(1 - x)'-1

•

•

It follows by integration by parts that, for all p > 0,

r(p + 1) = pr(p) and that r(k) = (k - 1)! for positive integers k.

•

(8.2. 1 1)

forO < x < 1, where B(r, s) = [r(r)r(s)]/[r(r+s)] is the betafunction. The distribntion corresponding to br,• will be written (J(r, s) . Figures 8.2.1 and 8.2.2 show some typical members of the two families.

�1 •

'

'

.

i' '

j i

J

, '

j' .

'

' '' '

'

l i

Section 8.2

Distribution Theory for Transformations of Random Vectors

489

Theorem B.2.3 If X1 and X2 are independent random variables with f(p, A) and f'(q, A)

distributions, respectively, then Y1 = X1 + X2 and Y2 = XI/( XI + X2 ) are independent and have, respectively, f(p + q, A) and fJ(p, q) distributions. Proof. If A

=

1, the joint density of X1 and X2 is (B.2.I2)

for x1 > 0, X2 > 0. Let

1.4

p

1.2 ....

1.0

0.8

-<

•

=

.\ = IO

0.6 0.4 0.2 0

I I

0

I

·� - - -

/

.

p=

',

'

'--- ---' .,_ + -

-

1

2

2, .\ ,

=

'

1

..... ..... . . ............. . .. ,

,

. _ _ _

-

-

.

-

- - - -

3

X

Figure B.2.1 The gamma density, gp,>. ( x) ,

for selected p, .\.

Then g is one-to-one on 8 = {(XI. X2)T : X1 > 0, X2 > 0} and its range is 81 { (y, , Y2 ) T : Yt > 0, 0 < y2 < I}. We note that on S, g-1 ( YI , Y2 ) = (YIY2 , YI - YIY2 f·

(B.2. I3)

Therefore, Y2 YI

I - Y2 -y,

=

-y,.

-------·�

(B.2.14)

-

..

-

. .

__

490

Additional Topics in Probability and Analysis

Appendix B

If we now substitute (B 2. 1 3) and (B.2.14) in (B.2A) we get for the density of (Y1 , Y2)T .

g(S 1 . X,).

I

>

�

I e-Y1 ( Y tY2 )11- (Yt - .tJ I Y2) q - YI f (p)f( q)

(B.2. 15)

0. 0 < y2 < L Simplifying (B.2.15) leads to JIY (Yr, y,)

.

'

=

1

py(y,.y,)

for y1

•

� YP+
(y, ) bp. q (Y2).

(B.2.16)

The result is proved for ,\ = 1. If ). i- 1 define x; = /\X1 and Xf = >.Xz. Now xr and X� are independent f(p, 1). r(q, 1) variables respectively. Because X� + Xf D .$X1 + X2) and x; (X; + x;) - 1 � X1 (X 1 + X2) _ , the theorem follows.

4 •

II II 3 "I I I I r � 5 s = 14 I I I I I 2 I I I I I I I I ·. I I 1 I I I I I

'

1

'

l

'

0

0

I I

•

r =-s = 3 -

-

I I'

'

-

'

'

•

j

•

•

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

' '

X

•

'

Figure B.2.2 The beta density, br,s(x), for selected

r, s.

By iterating the argument of Theorem B.2.3, we obtain the following general result. Corollary B.2.2 If X 1 , , Xn are independent random variables such that Xi has a f(pi, A) distribution, i = 1 , . . ' n, then E? 1 xi has a f(Ef I Pi, ,$ distribution. • • •

.

Some other properties of the gamma and beta families are given in the problems and in the next section.

! '

j ' '

Section 8.3

B.J

Distribution Theory for Samples from a Normal Population

491

DISTRIBUTION THEORY FOR SAMPLES FROM A NORMAL POPULATION

In this section we introduce some distributions that appear throughout modem statistics. We derive their densities as an illustration of the theory of Section B.2. However, these distrib�tions should be remembered in terms of their definitions and qualitative properties rather than density formulas.

B.J.l

The

x2, F,

and

t

Distributions

Throughout this section we shall suppose that X = (X1, . . . , Xn)T where the Xi form a sample from a N(O, a2) population. Some results for normal populations, whose mean differs from 0, are given in the problems. We begin by investigating the distribution of the Ei 1 X[, the squared distance of X from the origin.

Theorem 8.3.1 The random variable V = Ei 1X[ ja2 has a x� distribution. That is, V has density

for v > O. Proof. Let Zi = Xi/a, i = 1 , . . , n. Then Zi N(O, 1 ) . Because the Z[ are inde pendent, it is enough to prove the theorem for n = 1 and then apply Corollary B.2.2. If T = Zf, then the distribution function ofT is .

"'

(8.3.2) and, thus,

FT (t) = ( Vt) - <1> ( - Vt).

(8.3.3)

Differentiating both sides we get the density ofT

PT (t) = C i cp(Vt) =

l

y'2i;

t- ! e-t/2

(8.3.4)

for t > 0, which agrees with 91 1 up to a multiplicative constant. Because the constant is determined by the requirement21�at PT and g 1 1 are densities, we must have PT = g �·� 1 1 2•2 D and the result follows.

Let V and W be independent and have x� and x?n distributions, respectively, and let S = (Vjk ) (Wjm). The distribution of S is called the F distribution with k and m degrees offreedom. We shall denote it by Fk ,m· Next, we introduce the t distribution with k degrees offreedom, which we shall denote by 7,. By definition T. is the distribution of Q = Zj.jV/k, where Z and V are inde pendent with N(O, 1) and x� distributions, respectively. We can now state the following elementary consequence of Theorem B.3.1.

'

'

I

492

Additional Topics in Probability and Analysis

Appendix B

<

'

' •

Corollary B.3.l The random mriable

tion. The rmzdom mriab/e X

(

111

I k )E7" 1 x; j"i:,�·+;.� 1 x;

has Oil

1 / ( 1/k):E� +} Xt2 has a T�..: disrribwion.

Fk,m distrilm

Proof. For the first assertion we need only note that

'

•

'

(B.3,5) and apply the theorem and the definition of Fk,m· The second assertion follows in the same D way.

I

•

i

.

I

'

I

To make the definitions of the :Fk.m and 7k distributions useful for computation, we need their densities. We assume the S, Q, V, W are as in the definitions of these distribu· tions. To derive the density of S note that, if U � Vj (V + W), then

Vjk � m U (B.3.6) S Wjm k 1 - U · r (�m, �)( l l and V and W are independent, then by Because V r (�k, �). W Theorem B.2.3, U has a beta distribution with parameters �k and �m. To obtain the density of S we need only apply the change of variable formula (A.8.9) to U with g( u) = (mjk)uj(1 - u) . After some calculation we arrive at the Fk,m density (see Figure B.3. 1 ) (k/m)�•sl<•-2l( I + (kjm)s) -l 0. �

'""

,....,

'

! •

j

'

li , •

'

'

' ' ,

'

'

l 1 •

•

• • •

'

•

•

•

To get the density of Q we argue as follows. Because -Z has the same distribution as Z, we may conclude that Q and -Q are identically distributed. It follows that

P[O < -Q < q) 1 P[-q < Q < OJ � 2 P[O < Q2 < q2). Differentiating P[O < Q < q), P[-q < Q < OJ and JP[O < Q2 < q2] we get PQ (q) PQ ( -q) qpQ' (q2) if q > 0. P[O < Q < q)

�

�

(B.3.8)

l

(B.3.9)

Now Q2 has by Corollary B.3.1 an Fu distribution. We can, therefore, use (B.3.7) and (B.3.9) to conclnde

PQ (q) _

for -oo

< q < oo.

r (�(k + 1 )) (1 + (q' /k))-l<•+ll

v'1i'kr Gk)

l

I l

j

•

i ' ,

(B.3.10)

.. •

• •

Section 8 3

493

Distribution Theory for Samples from a Normal Population

0.2

. •

-- .. ... . . .. . . . .

0

0

I

2

3

Figure B.3.1 The :Fk,m density for selected (k, m) .

0.45

'

-

-4

-2

-3

The N(O,l) density The T4 density

-I

0

I

'

2

'

-

3

Figure B.3.2 The Student t and standard normal densities. The x2, T, and :F cumulative distribution functions are given in Tables II, III, and IV, respectively. More precisely, these tables give the inverses or quantiles of these distribu tions. For a E (0, 1 ) an ath quantile or 100 ath percentile of the continuous distribution F is by definition any number x(a) such that F(x(a)) = a. Continuity of F guarantees the existence of x(a) for all a. If F is strictly increasing, x(a) is unique for each a. As an illustration, we read from Table III that the (0.95)th quantile or 95th percentile of the '12o distribution is t(0.95) = I. 725. ,

------ ----------- --·-- -···· · ·

...... ......

.

-·--

.

-·-

494

Additional Topics in Probability and Analysis

8.3.2

Appendix B

Orthogonal Transformations

We tum now to orthogonal transformations of normal samples. Let us begin by recalling some classical facts and definitions involving matrices, which may be found in standard texts, for example, Birkhoff and MacLane (1965). Suppose that A is a k x m matrix with entry aij in the ith row and jth column, i = 1, . , k; j = 1, . . . , m. Then the transpose of A, written AT, is the m x k matrix, whose entry in the ith row and jth column is aji · Thus, the transpose of a row vector is a column vector and the transpose of a square matrix is a square matrix. An n x n matrix A is said to be orthogonal if, and only if, .

.

' ' '

J'

•

'

(8.3 . 1 1 ) or equivalently if, and only if, either one of the following two matrix equations is satisfied, (8.3.12)

;

'

(8.3.13) where I is the n xn identity matrix. Equation (B.3.12) requires that the column vectors of A be of length 1 and mutually perpendicular, whereas (B.3.13) imposes the same requirement on the rows. Clearly, (8.3.12) and (8.3.13) are equivalent. Considered as transformations on Jr1' orthogonal matrices are rigid motions, which preserve the distance between points. That is, if a = ( a1 , . . . , an ) T, b = (b, , . , bn)T, Ia - bl = JE� 1 (a, - b;) 2 is the Euclidean distance between a and b, and A is orthogonal, then . .

'

'j

•

'

'

I

I

'

'

I a - bl

=

(8.3. 14)

IAa - Abl.

'

To see this, note that

(a - b)T(a - b) = (a - b)TATA(a - b) [A(a - bW[A(a - b)] = I Aa - Abl2

I a - bl'

(8.3.15)

Finally, we shall use the fact that if A is orthogonal,

'

,

I

I

(8.3.16)

ldet AI = L

'

'

•

'

This follows from

[det AJ2

=

[det A] [det AT] = det[AAT]

=

det l =

L

(8.3.17)

Because X = (X1, . . , Xn)T is a vector of independent identically distributedN(O, a2 ) random variables we can write the density of X as .

-

[v'27i'<7]n exp [d,r<7]n exp [- 2!,1x1'] · 1

1

-

I

I

I'

" ,I

n � xr

20''

(8.3.18)

l

Section B 3

495

Distribution Theory for Sam ples from a Normal Population

We seek the density of Y = g(X) = c � (c 1 , . . , cn) · By Corollary B.2. 1, Y .

py(y)

AX + c, where A is orthogonal, � (Y1, . . . Y,, )T has density

n

x n,

and

.

-

(A (Y - c)) ' ldet AIPx Px (AT(y - c)) 1

(B.3.19)

by (B.3 . 1 1 ) and (B.3.16). If we substitute AT (Y - c) for x in (B.3.18) and apply (B.3.15), we get (B.3.20) Because

} � ( Yi : c;) Px(Y - c) = n { yJ

(B.3.21)

we see that the Yi are independent normal random variables with E(Yi) = Ci and common variance (J2 . If. in particular, c = 0, then Y1 , . . . , Yn are again a sample from a N(O, (J2) population. More generally it follows that if Z � X + d, then Y � g(Z) � A(X + d) + c � AX + (Ad + c) has density

py(y) � Px(y - (Ad -t- c)).

(B.3.22)

Because d � (d1 , . . . ,dn)T is arbitrary and, by definition, E(Z;) i = 1, . . . , n, we see that we have proved the following theorem.

Theorem B.3.2 !fZ

= E(X; + d; )

=

d;,

= (Z1, , Zn)T has independent normally distributed components with the same variance fJ2 and g is an affine transformation defined by the orthogonal matrix A andvector c = (c1, Yn)T has independent , cn)T, then Y = g(Z) = (Y1 , nonnally distributed components with variance fJ2. Furthermore, ifA = ( aij) .

•

.

.

•

•

n E(Y;) = c; + L a;;E(Z;) j =l

. . . 1

(B.3.23)

for i = l , . . . ,n. This fundamental result will be used repeatedly in the sequel. As an application, we shall derive another classical property of samples from a normal population. Theorem B.3.3 Let (Z1,

.

.

•

, Zn)T be a sample from a N(f.L, a2) population. Define n 1 Z = N L Z; . (B.3.24) i"-= 1

Then Z and Ek= t (Zi - Z)2 are independent. Furthermore, Z has a N(f.l, (J2 /n) distribu tion while ( 1 Ia2 ) :Ef I ( zi - Z? is distributed as X� - 1 .

496

Additional Topics in Probability and Analysis

Proof Construct an orthogonal matrix A a, =

=

Appendix 8

( aij) whose first row is

(Jn ' Jn) . . .

This is equivalent to finding one of the many orthogonal bases in Rn whose first mem ber is a1 and may, for instance, be done by the Gram-Schmidt process (Birkhoff and MacLane, 1965, p. 180). An example of such an A is given in Problem B.3.15. Let AZ = (Y,, . . . , Yn JT· By Theorem B.3.2, the V. are independent and normally distributed with variance 0"2 and means

n n E(Y;) = I >ijE( Zj) = I' :�::>ij· j=l j=l Because a1 j = 1/ .Jii, 1 < j < n, and A is orthogonal we see that n n L aij = Vn Laljakj = 0, k = 2, . . . , n. j=l j=l

(B.3.25)

(B.3.26) •

)

•

Therefore,

E(Yt)

= w/n, E(Yk)

•

= 0,

k = 2, . . . , n.

•

(B.3.27)

By Theorem B.3.1, (Ijcr 2 )EL2Y.,' has a X�- I distribution. Because by the definition of A,

- Yt z = ..;n·

(B.3.28)

Now

I

n n n n L(Z• - Z)2 = I:zf - 2Z l: Z. + nZ2 = I: zf - nZ2• k=l k=l k=I k=I

Therefore, by (B.3.28),

•

Finally,

n n '<;""' L.(Zk - Z- )2 = '<;"' L. z.2 - Y,2 . k=l k=l

n n l: Y.' = IYI 2 = IAZ - AOI 2 = IZI 2 = 2:: z� k=l k=l

Assertion (B.3.29) follows.

•

�

'

1

'

'

•

•

1 ., '

1

i

1'.

the theorem will be proved once we establish the identity

n n '<;"' 2 '<;"' Z = L.(Z k - )2 . L_. yk k=l k=2

'

i

.

.

.

(B.3.29)

., '

•

•

, '

.

i

l

.

(B.3.30)

., •

'

,,

!

I

•

' •

(B.3.31 )

(B.3.32) 0

. j

l

1 I

Section 8.4

B.4

497

The Bivariate Normal Distribution

THE BIVAR IATE NORMAL DISTRIBUTION

The normal distribution is the most ubiquitous object in statistics.

It appears in theory

as an approximation to the distribution of sums of independent random variables, of order

statistics, of maximum likelihood estimates, and so on. In practice it turns out that variables

arising in all sorts of situations, such as errors of measurement, height, weight, yields of

chemical and biological processes, and so on,

are

approximately normally distributed.

In the same way, the family of k-variate nonnal distributions arises on theoretical

grounds when we consider the limiting behavior of sums of independent k-vectors of ran dom variables and in practice as an approximation to the joint distribution of k-variables. Examples are given in Section

In this section we focus on the important case k = 2 where all properties can be derived relatively easily without matrix calculus and we can

6.2.

draw pictures. The general k-variate distribution is presented in Section

more thorough introduction to moments of random vectors.

B.6 following

a

Recall that if Z has a standard normal distribution, we obtain the N(Jl, cr2) distribution

as the distribution of g(Z)

= crZ +

f-1· Thus,

Z

generates the location-scale family of

N(J.i,, cr2 ) distributions. The analogue of the standard normal distribution in two dimensions

is the distribution of a random pair with two independent standard normal components,

whereas the generalization of the family of maps g( z)

= cr z + J.i, is the group

of affine

transformations. This suggests the following definition, in which we let the independent

N(O, 1) random variables Z1 and Z2 generate our family of bivariate distributions. A planar vector (X, Y) has a bivariate nonnal distribution if, and only if, there exist constants aiJ • 1 � i,j < 2, J.l,1, J.1,2, and independent standard normal random variables z1 ' z2 such that X M1 + anZ1 + a12Z2 (B.4. 1) y /-L2 + a21 zl + a22z2. In matrix notation, if A � (a,,), J1

�

definition is equivalent to

( !' 1 , /'2)T, X � (X,Y)T, Z

X � AZ + J1·

(B.4.2)

Two important properties follow from the Definition

(B.4.1).

Proposition B.4.1 The marginal distn'butions of the components of a bivariate normal random vector are (univariate) nonnal or degenerate (concentrate on one point). This is a consequence of

Note that

E(X )

�

(A.1 3.23).

The converse is not true. See Problem

1'1 + anE(Z1) + a12E(Z2)

�

1'1 , E(Y)

�

1'2

B.4.10. (B.4.3)

and define

u1 Then

X has a N(/'1, uf )

�

v'var X,

u2

�

vvar Y.

(B.4.4)

and Y a N(/' 2 , u?;) distribution.

Proposition 8.4.2 If we apply an affine transformation g(x) Cx + d to a vector X, which has a bivariate nonnal distribution, then g(X) also has such a distribution. =

-------�-- �---

. --

.

--

.

....,. _.

498

Additional Topics in Probability and Analysis

Appendix B

This is clear because

(B.4.5) We now show that the bivariate normal distribution can be characterized in terms of first- and second-order moments and derive its density. As in Section A.l l, let (X Co::.:v-" :) cc-: • Y� ( .::. ) = Cor X , Y = 0'10"2 if a1 a2 f- If a1 a2 = 0, )it will be convenient to let = We define the variance covariance matrix of (X, Y (or of the distribution of (X, Y)) as the matrix of central second moments CX + d = C(AZ + J") + d = (CA)Z + (CJL + d).

p

p

0.

0.

This symmetric matrix is in many ways the right generalization of the variance to two dimensions. We see this in Theorem B.4.1 and (B.4.21). A general definition and its properties (for k dimensions) are given in Section B.S. Theorem B.4.1 Suppose that ""'' i' 0 and I PI < I. Then

1 Px(x) = exp [-�({x - JLf�- 1 (X - JL))l

(B.4.8)

' ' '

I

I '

l

l '

(B.4.9) i

•

Because (X, Y) is an affine transformation of (Z1 , Z2), we can use Corollary 8.2.1 to obtain the joint density of (X, Y) provided A is nonsingular. We start by showing that AAT = �. Note that Proof.

i.

'i i<

liI

I I

while and

"'1

i l

t l

l

1

'�

(B.4.10) pcr1a2

Cov(auZt + a12Z2, a21Z1 + a22Z2 ) - an":ll Cov(Z1 ,Z, ) +(a,,a21 + a11a,2) Cov(Z1, Z2) +a12a22 Cov(Z2, Z2) aua21 + a12a22·

(B.4.11)

i �-----

Section 8.4

499

The Bivariate Normal Distribution

Therefore, AA T

=

ldct AI

:E and by using elementary properties of determinants we obtain

Vdet Adet AT v'det 1: � crw2 J1 - pz.

J[det A]'

�

�

Vdet AA T

(8.4.12)

Because IPI < 1 and a1a2 f. 0, we see that A is nonsingular and can apply Corollary B.2.1 to obtain the density of X. The density of Z can be written

pz(z) As in (8 .3.19),

Px (x)

�

)

1 T - z z 2

.

- � [A 1( x - J.<Jf [A - 1 (x - J.<)J } { ex p � (x - J.<)T[A -'f A -• (x - J.<) } . { � � AI

21rld t A I 2rrld t

Because

( p ;r,;: ex 1

exp -

(8.4. 13)

(8.4.14)

(8.4.15) we arrive at (B.4.8). Finally (B.4.9) follows because by the formulas for an inverse (8.4.16) 0

From (B.4.7) it is clear that :E is nonsingular if, and only if, a1a2 f. 0 and IPI < 1. Bivariate nonnal distributions with a1 a2 i- 0 and IPI < 1 are refened to as norulegenerate, whereas others are degenerate. If a'f = ai1 + ai2 = 0, then X = J.ti, Y is necessarily distributed as N(J.
(Y - 11-2) a2

�p

(X - J.
.

(8.4. l 7)

Because the marginal distributions of X <¥Id Y are, as we have already noted, N(J.t1, oD and N(J.<2, o-:j) respectively, relation (8.4.1 7) specifies the joint distribution of (X, Y) com pletely. Degenerate distributions do not have densities but correspond to random vectors whose marginal distributions are nonnal or degenerate and are such that (X, Y) falls on a fixed line or a point with probability 1. Note that when p = 0, Px (x) becomes the joint density of two independent normal variables. Thus, in the bivariate normal case, independence is equivalent to correlation zero. This is not true in general. An example is given in Problem B.4. 1 1 . Now, suppose that we are given nonnegative constants a1, a2. a number p such that IPI < 1 and numbers J.t t , J.t2 · Then we can construct a random vector (X, Y) having a

'

II

500

Additional Topics in Probability and Analysis

Appendix 8

••

p(x,y)

\

.

•

'

'

I !

'

y

'

' •

' ' • '

• • •

p(x,y)

•

J

y

Figure 8.4.1 A plot of the bivariate normal density p(x, y) for l'l o-1 = o-2 =

1, p = 0 (top) and p = 0.5 (bottom).

=

1'2

=

2,

bivariate normal distribution with vector of means (Jtl, /.L2 ) and variance-covariance matrix E given by (B.4.7). For example, take (8.4.18) and apply (B.4.10) and (B.4. I I). A bivariate normal distribution with this moment structure will be referred to as N(J<,, 1'2 . at ,a�, p) or N(J<, E). ' '

1 '

'

Section 8.4

501

The Bivariate Normal Distribution

Now suppose that U

= (U1, U2 )T is obtained by an affine trnnsformation, u,

u,

c11X + c12 Y + vi c21X c22 Y v2

(B.4.19)

+

+

from a vector (X, Y) T having a N(f-li, f-£2, uf , a� , p) distribution. By Proposition 8.4.2, U has a bivariate normal distribution. In view of our discussion, this distribution is completely determined by the means, variances, and covariances of U1 and U2, which in tum may be expressed in terms of the f-li, af , p, cij , and Vi. Explicitly, E(UI) Var U1 Var U, Cov(U1, U,)

VI + Cuf-ll C12j.L2, E(U2) V2 + C21J.L1 + C22/l2 2 2 2 c11 a1 + c12a2 cr2C11para2 2 2 2 2 C21 u1 + c22 a2 + c21C22pa1a2 cuc21a i + C12C22a� + (cuc22 + c12C2t ) Pu1a2.

=

+ 2 2+ 2

(B.4.20)

In matrix notation we can write compactly (E(U, ) , E (U,)) E(U)

( v, , v, l

T = v + C!L, + C(tL1,!'2) CECT

(B.4.21)

where E(U) denotes the covariance matrix of U. If the distribution of (X, Y) is nondegenerate and we take l

(Tl

0

( vl , v, ) ' =

e

../l - p2 l 2 .,.2Jl-p

(Tt

(B .4.22)

-C(I'bi'2JY

then ul and u2 are independent and identically distributed standard normal random vari ables. Therefore, starting with any nondegenerate bivariate normal distribution we may by an affine transformation of the vector obtain any other bivariate normal distribution. Another very important property of bivariate normal distributions is that normality is preserved under conditioning. That is, Theorem B.4.2 /f (X, Y) has a nondegenerate N(l'l , 1'2, cr?, cr�,p) distribution, then the conditional distribution ofY given X = x is

Proof. Because X has aN(tLI , a¥ ) distribution and (Y, X ) is nondegenerate, we need only

------

-

- ----

502

Additional Tepics in Probability and Analysis

Appendix 8

calculate

p( y I x)

P(Y,X) ( y, x)

px(x)

{ (x -:.t)2]} p')J

1 ---;;c=7c===cce exp o-2 }21r(1 - p2 )

+[1 - (1 -

1 exp "2 V21r(1 - p')

-

1

2(1 -

(Y - I"z ) 2 [ p2 ) ,-� -

]

2 1 (x 1"1) (Y 1" 2) _ [ p 2(1 - p2) 0"2 "'

. (8.4.23)

This is the density we want.

i

0

Theorem B.4.2 shows that the conditional mean of Y given X = x falls on the line y � 1"2 + p(o-2 /o-!)(x - 1"1). This line is called the regression line. See Figure 8.4.2, which also gives the contour plot Sc = { (x, y) : f(x, y) = c} where c is selected so that P((X, Y) E Sc) � 'Y· 'Y � 0.25, 0.50, and 0.75. Such a contour plot is also called a 1001% probability level curve. See also Problem 8.4.6. B y interchanging the roles of Y and X, we see that the conditional distribution of X 2 p + 2 given Y � y is N(p1 (udo-2)p(y 1" ) , a-i (l - )). With the convention that 0/0 = 0 the theorem holds in the degenerate case as well. More generally, the conditional distribution of any linear combination of X and Y given any other linear combination of X and Y is normal (Problem B .4.10). As we indicated at the beginning of this section the bivariate normal family of distri butions arises naturally in limit theorems for swns of independent random vectors. The main result of this type is the bivariate central limit theorem. We postpone its statement and proof to Section B.6 where we can state it for the k-variate normal.

,

l

,

'

B.S

MOMENTS OF RAN DOM VECTORS AND MATRICES

• •

We generalize univariate notions from Sections A.lO to A.l2 in this section. Let U, re spectively V, denote a random k, respectively l, vector or more generally U = IIUij llkxl , a matrix of random variables. Suppose EIUiJ I < oo for all i, j. Define the expectation of U by

E(U)

B.S. I

�

IIE(U;; ) IIk xl· ,

Basic Properties of Expectations

If Am xk. Bmxl are nonrandom and EU, EV are defined, then

E(AU + BV) � AE(U) + BE(V).

(B.S. I)

Section 8.5

503

Moments of Random Vectors and Matrices

y 3

•

(2,2) . .

2

• • • • • • •

•

1

•

•

0

0

2

1

3

X

Figure B.4.2 25%, 50%, and 75% probability level-curves, the regression line (solid line), and major axis (dotted line) for the N(2, 2, 1, I , 0.5) density.

This is an immediate consequence of the linearity of expectation for random variables and the definitions of matrix multiplication. If U � c with probability 1,

E(V) = c . For a random vector U, suppose EU,' < oo for i = 1, . . . , k or equivalently E(fUf2) < co, where 1 · I denotes Euclidean distance. Define the variance of U, often called the variance-covariance matrix, by

Var(U)

E(V - E(U))(U - E(U)f ff Cov(U, U; )[fkxk,

(B.5.2)

,

a symmetric matrix.

B.5.2 If A is m

Properties of Variance x

k as before, Var(AU)

=

AVar(U)A T

Note that Var(U) is k x k, Var(AU) is m x m.

(B.5.3)

Additional Topics in Probability and An a lysis

504

Appendix B

Let ckx 1 denote a constant vector. Then

Var(U + c) = Var(U ) .

(B.5.4)

Var( c ) = IIO]]kxk ·

(8.5.5)

If ak x 1 is constant we can apply (8.5.3) to obtain

Var( Ej= 1 ap, ) = arvar(U)a = Li,j a, a; Cov(U,, U; ).

Var(aTU)

=

(B.5.6)

Because the variance of any random variable is nonnegative and a is arbitrary, we con clude from (B.5.6) that Var(U) is a nonnegative definite symmetric matrix. The following proposition is important.

Proposition B.S.l /f E]Uj2 < oo, then Var(U ) is positive definite if and only if, for every a t 0, b,

P[aru + b = OJ < !.

(B.5.7)

Proof. By the definition of positive definite, (B.lO.l), Var(U) is not positive definite iff arvar(U)a = 0 for some a f 0. By (B.5.6) that is equivalent to Var( aTU) = 0, which is D equivalent to (B.5.7) by (A.l l .9). If Ukx 1 and Wkxi then

independent random vectors with E]U]2 <

oo,

Var(U + W) = Var(U) + Var(W).

.

E]W]2 < oo, (B.5.8)

This follows by checking the identity element by element. More generally, if E]U]2 < oo, E]V]2 < oo, define the covariance of Uk xr, Vr x 1 by

'

'.

•

•

•

are

Cov(U, V)

•

=

•

E(U - E(U))(V - E(VJJY IICov(U;, Vj )]]kx l·

Then, if U, V are independent

'

Cov(U, V) = o.

(B.5.9)

.

' •

'

•

' ' '

In general Cov(AU + a, BV + b) = ACov(U, V)Br

(B.5.10)

j I

for nonrandom A, a, B, b, and Var(U + V) = Var(U) + 2 Cov(U, V) + Var(V). We leave (B.5.10) and (B.S. l l ) to the problems. Define the moment generatingfunction (rn.g.f.) of Ukxi for t E IF by

M (t)

I

=

Mu (t)

=

E (e' u ) = E( e"�� ,•, u, ) . T

(8.5.1 1) . •

' '

> • >

•

i

•

I

Section 8.5

505

Moments of Random Vectors and Matrices

Note that Af(t) can be (c. f.) of U by,

co

for all

t =!=- 0. In parallet< 1 ) define the characteristic function

T

'l'(t) = E(e" u) = E(cos(tTU)) + iE(sin(tTU)) where i = A. Note that 'I' is defined for all theorems are beyond the scope of this book. Theorem B.5.1 Let S = (a)

t E n• , all U. The proofs of the following

{t : M(t) < oo). Then,

S is convex. (See B.9).

(b) If S

has a nonempty interior S0, (contains a sphere S(O, €), € > 0), then M is analytic on 8° In that case EIUIP < oo for all p. Then, if i1 + · · + ik = p, ·

&PM(O) &t;, 1 .

In particular,

.

•

=

&t;, k

&M

(0) &tJ

E(Ui' . . . U�' ).

(B.5.12)

= E(U)

(B.5.13)

kxl

and (B.5.14)

kxk (c)

If 8° is nonempty, M detennines the distribution ofV uniquely.

Expressions (B.5.12HB.5.14) are valid with

cumulant generating function of U is defined by K(t) = Ku(t) = log M(t). If S ( t) = { t : M (t) < oo } has a nonempty interior, then we define the cumulants as The

&P

'-"ll···tl; = e; l · · ·tl; (U ) = &t;' . & " K(t) . . tk 1 r.

·

.

, i1 + · · · + ik = P·

.

t-O

An important consequence of the definitions and (A.9.3) is that if Ukxl• Vkxl are independent then

Mu+V(t) = Mu (t)Mv(t), Ku+V(t) = Ku(t) + Kv (t)

(B.5.15)

where we use subscripts to indicate the vector to which the m.g.f. belongs. The same type of identity holds for c.f.'s. Other properties of cumulants are explored in the problems. See also Bamdorff-Nielsen and Cox (1989).

..

..

506

Additi:mal Topics in Probability and Analysis

Appendix 8

Example 8.5.1 Tire Bimriare Normal Distribution. N(JtJ , Jl2, af, ai , p) distribution then, it is easy to check that (B.S.

E(U) = I' Var(U) = �

Mu(t) where

t = (t1, t2) T , I'· and

=

exp

{

e·l'

+

16)

(B.S.l7)

� tT �t }

(B.S.l8)

� are defined as in (6 .4.7), (6.4.8). Similarly

'Pu(t) obtained by substituting iti for ti,

=

{

�

exp it TI' - tT I: t

1 < j < k,

in

(8.5.18).

}

(B.S. 19)

The result follows directly from

(A. l3.20) because

(6.5 .20) and by (6 .4.20),

U1 + t2U, has a N(tT/L, tT �t) distribution.

By taking the log in (8. 5 . 1 8) and differentiating we find the first five cumulants

[: '

t1

'

,

'

'

', ·. i'

I

I

0

All other cumulants are zero.

'

• '

I

THE MULTIVARIATE NORMAL DISTRIBUTION

8.6

l ! .

' •

;! '

''

'

8.6.1

Definition and Density

•

'

We define the multivariate normal distribution in two ways and show they are equivalent. From the equivalence we are able to derive the basic properties of this family of distribu� tions rapidly.

Definition 8.6.1 Ukxl

has a

multivariate (k-van"ate) normal distribution

iff

U

can be

i

J l

written as

U = I' + AZ when

Jlk; x l • Akxk

are constant and

Z

=

( Z1 , . . . , Zk ) T

where the Zj are independent

standard normal variables. This is the immediate generalization of our definition of the bivariate normal. We shall show that as in the bivariate case the distribution of U depends on

I'

=

E(U)

and

� - Var(U), only.

Definition B.6.2 Ukx 1 has a multivariate normal distribution dom, aTU = E1=IaiUi has a univariate normal distribution.

iff for every

akx 1

nonran

Theorem B.6.1 Definitions B.6.J and B.6.2 define the same family of distributions.

' •

'

l

Section 8.6

507

The Multivariate Normal Distribution �

[ATa]Tz + aTp, a linear combination of independent normal variables and, hence, normaL Conversely, if X = aTU has a univariate normal distribution, necessarily this is N(E(X)1 Var(X)). But

Proof If U is given by Definition B.6. I, then aTU � aT (AZ + p)

E(X) � aT E(U)

(B.6. 1)

Var(X) = aT Var(U)a

(B.6.2)

from (B.S. I ) and.(B.5.4). Note that the finiteness of E(IUI) and Var(U) is guaranteed by applying Definition B .6.2 to eJU, where ej denotes the k x 1 coordinate vector with 1 in the jth coordinate and 0 elsewhere. Now, by definition, Mu (a)

E(exp(aTU)) = E(ex) exp {aT E(U) + �aT Var(U)a}

(B.6.3)

from (A.1 3.20), for all a. Thus, by Theorem B.5 . 1 the distribution of U under Definition B.6.2 is completely determined by E(U), Var(U). We now appeal to the principal axis theorem (B. I 0.1.1 ) . If :E is nonnegative definite symmetric, there exists Akxk such that where A is nonsingular iff :E is positive definite. Now, given U defined by B.6.2 with E(U) = p, Var(U) = �- consider where A and Z are as in Definition B.6.1. Then E(V) = p, Var(V)

=

AVar(Z)AT

=

AAT

because Var(Z) = Jkxk • the identity matrix. Then, by definition, V satisfies Definition 8.6.1 and, hence, B .6.2 and has the same first and second moments as U. Since first and second moments determine the k-variate normal D distribution uniquely, U and V have the same distribution and the theorem follows. Notice that we have also proved: Corollary 8.6.1. Given arbitrary fLkxt and :E nonnegative definite symmetric, there is a unique k-variate normal distribution with mean vector J..L and variance-covariance matrix �.

We use Nk (P, �) to denote the k-variate normal distribution of Coroiary B.6.1. Argu ing from Corollary B.2.1 we see the following. Theorem 8.6.2 /j:E ispositive definite or equivalently nonsingular, then ifV U has a density given by 1 pu ( x) = (27r) k/2 [det(�)] k/2

I x { x exp - 2 ( - p) � T

-I

}

(x - p) .

""'

Nk(J..L, :E),

(B.6.5)

Additional Topics in Probability and An alysis

508

Appendix B

Proof. Apply Definition 8.6. 1 with A such that E = AAT The converse that U has a density only if E is positive definite and similar more refined results are left to Problem 8.6.2. There is another important result that follows from the spectral decomposition theorem

(8.10. 1 .2). Theorem 8.6.3 If U kx 1 has Nk(J.L, E) distribution, there exists an orthogonal matrix Pkxk such that pTU has an Nk (v, Dkxk) distribution where v = pT/1- and Dk x k is

'

j •

' •

• •

l j

the diagonal matrix whose diagonal entries are the necessarily nonnegative eigenvalues of E. /f:E is of rank l < k, necessarily only l eigenvalues are positive and cowersely.

•

Proof. By the spectral decomposition theorem there exists P orthogonal such that

E = PDPT Then pru has a./1/k(v, D) distribution since Var(PTU) ofP.

. •

=

pTEP = D by orthogonality 0

This result shows that an arbitrary normal random vector can be linearly transformed to a normal random vector with independent coordinates. some possibly degenerate. In the bivariate normal case, (B.4.19) and (B.4.22) transformed an arbitrary nondegenerate bivariate normal pair to an i.i.d. N(O, 1) pair. Note that if rank � = k and we set

( r ' = P (o l r ' p r

E i = PD i PT, E-i = E i .

•

' ,,

. .

I;

;

· .

'

·J

•

'

.1 •

has a N(O, J) distribution, where J is the k x k identity matrix.

'

•

'

' • '

Corollary 8.6.2 /fU has an./1/k (O, E) distribution andE is of rank k, then ur E - 1 U has

8.6.2

;

•

(8 .6.6)

Proof. By (8.6.6). UTE-1U = ZTZ, where z is given by (8 .6.6). But zrz = where Z; are i.i.d. ./1/(o, 1). The result follows from (8.3.1).

l

•

where D! is the diagonal matrix with diagonal entries equal to the square root of the eigenvalues of :E, then

a x� distribution.

• •

L� 1 Zl 0

Basic Properties. Conditional Distributions

If U is./1/k(Jt, E) and Atxk• br x i are nonrandom, then AU + b is .Ni(A�t + b, AEA'). This follows immediately from Definition 8.6.2. In particular, marginal distributions of blocks of coordinates of D are normal. For the next statement we need the following block matrix notation. Given :Ekx k positive definite, write

(8.6.7)

Section 8.6

509

-------------- --��

The Multivariate Normal Distribution

where :E l l is the l X l variance of ( u1 ' . ' ul ) T. which we denote by U(l)' :Ezz the ( k-·l) X (k - I) variance of (U1+ 1 , Uk) 7 denoted by Ul'l, and �" = Cov (U I ' i Ui 21)1xk-t. .

.

:E :.n

=

:Ef2 .

•

Similarly write

.

•

f.J =

ul 'I and Ui2)

(

.

p

.l ' i p Pl

)

,

2 where JL( I l and J.L( ) are the mean vectors of

,

We next show that independence and uncorrelatedness are the same for the k variate normal. Specifically

( ��; )

where U(I) and U(Z) are k and l vectors, Theorem 8.6.4 /f U k +l x = ( ) l respectively, has a k + l variate normal distribution, then UP) and U( 2) are independent iff

.

(8.6.8) Proof. If Ui 1 1 and u1 21 are independent, (8.6.8) follows from (A. 1 1 .22). Let U i 1 1• and 2 Ui 21• be independent Nk(E(Ui 1 1), Var(U ( l l ) ) , Ni (E(UI'I ) , Var(Ui 1 )) . Then we show U lll• 2 l has the same distribution as U and, hence, U( ) , u< > are below that U* U(Z)• independent. To see that U* has the same distribution as U note that _

(

)

EU•

=

2

EU

by definition, and because U(l) and U( ) have E12 Var(U(j)•) = E;;, j = 1, 2, then

Var (U)

=

(8.6.9)

0

and by construction

Var(U')

(8.6.10)

by (8.6.7). Therefore, U and U* must have the same distribution by the determination of the k-variate normal by first and second moments. D

Theorem 8.6.5 If U is distributed as Nk (p,, :E), with :E positive definite as previously, then the conditional distribution of U( I ) given U(Z) = ul2l is N1 (p,(l) + :E 2:E221 ( u<2l 1 2 p,< l), :Eu - :E 2 :E221 :E1 2). Moreover "En - :E12:E 221 :E2 1 is positive definite so there is 1 a conditional density given by (B.6.5) with (:Ell � "E 1 z:E221 :EzJ) substituted for :E and 2 2 1 1 p< 1 ) for 1-'· 1-'( ) - ��2�22' (ui Proof. Identification of the conditional density as normal can be done by direct computa tion from the formula

(8.6.1 I ) after noting that :En is positive definite because the marginal density must exist. To derive this and also obtain the required formula for conditional expectation and variance we proceed as follows. That "Eu, :En are positive definite follows by using aT �a > 0 with a whose last k - l or first l coordinates are 0. Next note that

(8.6.12)

-------------- - - ---··-

510

Additional Topics in Probability and Analysis

because

1 1 Var(U 2 I l)E 22 E21 E12E22 :E12 'Ezz1 'Ezz 'Ezz1 :E 21

Appendix B

(8.6.13)

by (B.5.3). Furthermore, we claim 1 2 Cov(E12E22'Ui'l, ul ' l - E12E22 U( l )

�

0

(8.6.14)

(Problem B.6.4) and, hence, by Theorem B.6.5, U(l) - E12E221Ui2) and Ui 2) are inde pendent Thus, the conditional distribution of U(l) - E 1 2E 221U( 2l given U( 2l = u< 2l is the same as its marginal distribution. By the substitution property of the conditional distribution this is the same as the conditional distribution of uU l - ::E 1 2 ::E221 u< 2l given u<2l = u< 2l. The result now follows by adding E 12 E 2i U(2l and noting that (8.6.15) and

E(U I u, � u,) 1

1 u2 u + E(U1 - E 12E221U2 I , � u,) E 12E22 - E(U1 - E12E221 U2 ) + E12E221 u2 1 1 JL1 - 'Et 2 'E22 1Jz + ::E 12:Ezz U2 1 p.1 + E12E22 (u2 - p.2) .

.

•

0

• ,

Theorem 8.6.6 The Multivariate Central Limit Theorem. Let Xt, Xz, . . . , Xn be independent and identically distributed random k vectors with EIX 1 I2 < oo. Let E(X1 ) = g R, JL, Var{XI) = ::E, and let Sn = Ei 1 Xi. Then, for every continuous function : Rk

(

Jnnp. ) £. g(Z)

g Sn

----->

(8.6.16)

1 ,

"

, •

� ,

•

., ,

where Z � N. (o, E).

As a consequence. if I; is positive definite, we can use Theorem 8.7.1 to conclude that

(B.6. l7) {x : xi < Zi, i = 1, . . . , k} where as usual subscripts for all z E Rk . Here {x : x < z} indicate coordinate labels. A proof of this result may be found in more advanced texts in probability, for instance, Billingsley (1995) and in Chung (1974). An important corollary follows. =

Corollary 8.6.1 1/the X, are as in the statement of Theorem B.6.6, if� is positive definite and if X = � E�=t Xi, then (B.6.l8)

• '

•

l

•

l

; '

'

j i

1 l

·�

' '

j

-1

,

'

l

Section 8.7

Convergence for Random Vectors:

Op

and OJ> Notat ion

Sll

Proof fo(X - p) (Sn - ni')/Jn. Thus, we need only note that the function g(x) = l xT:E � 1 x from Rl.: to R is continuous and that if z xl: N(O, � ) . then zT�- z �

"'

"'

(Corollary B.6.2).

8.7

c

Op

CONVERGENCE FOR RANDOM VECTORS: AND op NOTATION

The notion of convergence in probability and convergence in law for random variables dis cussed in section A . l . 5 generalizes to random vectors and even abstract valued random elements taking their values in metric spaces. We give the required generalizations for ran dom vectors and, hence, random matrices here. We shall also introduce a unique notation that makes many computations easier. In the following, B.7.1 A sequence of random vectors

Z - (Z1, . . . , Zd )T

Zn

iff

=

( Zn1 , . . . Znd )T ,

p

IZ n - Z l � or equivalently

Znj � Zj

for

that depend on n. Thus,

1 < j < d.

Zn � Z iff

converges

Zn

n - 1 L:� 1 Z;. If EIZI <

oo,

then Zn !.:. p

in probability to

are considered under probabilities

for every € > WLLN (the weak law of large numbers).

Euclidean distance.

0

Note that this definition also makes sense if the

Pn

I · I denotes

0. -

Let Z1, . . . , Zn be i.i.d. as Z and let Zn � EZ.

EjZI 2 < oo, the result follows from Chebychev's For a proof in the EIZI < oo case, see Billingsley ( 1995). When

inequality

as

in Appendix A.

The following definition is subtler.

8.7.2 A sequence

L(Zn) � L(Z), iff for all functions

{Zn}

of random vectors

converges in law

to

Z,

written

Zn

c

--+

Z

or

h(Zn) !:, h(Z)

h : Rd--+R, h continuous.

We saw this type of convergence in the central limit theorem (8 .6.6). Note that in the definition of convergence in law, the random vectors

Zn, Z only play

the role of defining marginal distributions. No requirement is put on joint distributions of

{ Zn} , Z.

Thus, if Z1 ,

. . . , Zn

are i.i.d., Zn

An equivalent statement to (B .7.2) is

.£... Z1, but Zn + Z , .

Eg(Z ,.)� Eg(Z) for all

g

:

Rd �R

p

(B.7.3)

continuous and bounded. Note that (B.7.3) implies (A.l4.6).

following stronger statement can be established .

The

512

Additional Topics in Probability and Analysis

Appendix B

' '

'

Theorem B.7.1 Zn _£. Z ijf (B.7.3) holds for every g : Rd_.,.flP such that g is bounded and if Ag = { z : g is continuous at z } then P[Z E Ag] = 1. Here are some further properties.

'

'

'

; '

' ;

' '

Proposition B.7.1 (a) IfZn £. Z and g is continuous from R" to RP, then g(Zn) £. g(Z). (b) The implication in (a) continues

conclusion above.

to

' '

.

'

,

hold if "P" is replaced by "{." in premise and

(c) The conclusion of (a) and (b) continues to hold if continuity of g is replaced by P[Z E Ag] = I where Ag - ( z : g is continuous at z } . 8.7.4

If Zn

p

�

Z then Zn

c

�

Z.

A partial converse follows.

j

L p 8.7.5 IfZn -----+ zo (a constant), then Zn __,. zo. ' '

I; '

•

•

I'

,,

II

i I.

'

'

•

Note that (B.7.4) and (8.7.5) generalize (AJ4.3), (Al4.4).

Theorem 8.7.2 Slutsky's Theorem. Suppose Z?; = (u;:, V!) where Zn is a d vector, Un is b-dimensional, Vn is c = d - b-dimensional and (a) Un (b)

" �

U

Vn � v where v is a constant vector.

(c) g is a continuous function from Jr1

to

Rb.

•

I

'

Then Again continuity of g can be weakened to P[(UT, vT )T E Ag] We next give important special cases of Slutsky's theorem:

=

L

�

Example 8.7.1 (a) d = 2, b = c = !, g(u,v) = au + {Jv. g(u,v) = uv or g(u, v) = ;; and v # 0. This covers (A.I4.9) (b)

'i '

Vn = IIVni;]]bxb,c = b2, g(uT,vT) = vu where v is a b x b matrix. To apply Theorem B.? .2. rearrange V and v as c x 1 vectors with c = b2• n

i

' ,

Section

8.7

Convergence for Random Vectors: O p and Op Notation

'---� -��--_-::_-:::

513

Combining this with b = c = dj2,g(uT,vT) = u + v, we obtain, that if the b x b matrix IIVn l l � llvll and Wn. b x 1, tends in probability to w, a constant vector, and c Un --+ U, then (8.7.6) The proof of Theorem B.7.l and other preceding results comes from the following the orem due to Hammersley (I 952), which relates the two modes of convergence. Skorokhod ( 1956) extended the result to function spaces . Theorem 8.7.3 Hammersley. Suppose vectors Zn � Z in the sense ofDefinition 8.7.2. There exists (on a suitable probability space) a sequence of random vectors { z.;:} and a vector Z"' such that

(i) .C(Z;;) = .C(Zn) for all n, .C ( Z* )

=

.C(Z)

(ii) z;; � Z"'. A

somewhat stronger statement can also be made, namely, that zn"' '::.!.

z"'

where �· refers to almost sure convergence defined by Zn �- Z if P

(n�oo lim Zn = z) = I.

This type of convergence also appears in the following famous law. SLLN (the strong law of large numbers). Let Zh . . . , Zn be i.i.d. as Z and let Zn n-l L:� 1 zi. then Zn �- J.L = EZ iff E I Z I < 00.

-

For a proof, see Billingsley ( 1 995). The proof of Theorem B.7.3 is easy for d = I. (Problem B.7.1) For the general case refer to Skorokhod ( 1 956). Here are the proofs of some of the preceding assertions using Hammersley's theorem and the following. Theorem 8.7.4 If the vector Un converges in probability to U and g P[U E A9] = I, then

is

bounded and

For a proof see Billingsley (1995, p. 209). Evidently, Theorem B.7.4 gives the equiva lence between (8.7.2) and (B.7.3) and establishes Theorem B. 7. I . The proof of Proposition B.7 .I (a) is easy if g is unifonnly continuous; that is. for every £ > 0 there exists 6 > 0 such that

{(z, , z,) : lg(zi) - g (zz ) l > E) C {(z,, zz) : l zl - Zzl > 0). A

stronger result (in view of Theorem B.7.4) is as follows.

. .

5 14

Additional Topics in Probability and Analysis

Appendix B

7.5 Dominated Conl'ergence Theorem. If { H111 } , lV are random l'ariab/es, W" -': W, P[IWn J < jliFJ] I and E]WJ < oo, then EWn EW. Proposition B.7.1 (b) and (c) follow from the (a) part and Hammersley's theorem. Then (8.7 .3) follows from the dominated convergence because if g is bounded by M and uni formly continuous, then for c5 > 0 (B.7.7) ]Epg (Zn) - Epg ( Z ) J < Theorem B.

=

�

sup{]g(z) - g(z')J : ]z - z'J < J} + MP[]Zn - ZJ > J]

Let n-+oo to obtain that lim supn ]Epg(Zn ) Epg(Z)J < sup{]g(z) - g(z')J : Jz - z'J < J} (8.7.8) and let 0-----+0. The general argument is sketched in Problem 8.7.3. For 8.7.5 let h,(z) l(]z - zol > ) Note that Ah, {z : ]z - zo] " c). Evidently if P[Z zo] 1, P[Z E Ah,] I for all > 0. Therefore, by Problem 8.7.4, P [IZn - zo] > •J�P[]Z - zo] > <] 0 because P[Z zo] 1 and the result follows. Finally Slutsky's theorem is easy because by Hammersley's theorem there exist V,:", u;: with the same marginal distributions as Vn, Un and u:;_ � U* , v,: � Then (U:;_, V,;" ) !:..,. (U*,v), whichbyProposition8.7.1 implies that (Un, Vn) .£. (U,v), which by Theorem B.7.1 implies Slutsky's theorem. In deriving asymptotic properties of some statistical methods, it will be convenient to use convergence of densities. We will use the following. -

= =

=

< .

f

=

=

=

=

=

v.

Theorem 8.7.6 Scheffe's Theorem. Suppose Pn(z) and p(z) are densities or frequency functions on Rd such that Pn (z) p( z) as n -----+ oo for all z E Rd. Then ---t

j IPn(z) - p(z)]dz

�

0 as n � oo

in the continuous case with a sum replacing the integral in the discrete case. Proof.

We give the proof in the continuous case. Note that ]Pn(z) - p(z)J

where x+ m =

ax

{ O x}. ,

= Pn(z)

- p(z) + 2 [p(z) - Pn(z)J+

Thus,

'

j IPn(z) - p(z)]dx jIPn (z) - p(z)]dz + 2 j[p(z) - Pn(z)j+dz. =

The first tenn on the right is zero. The second tenn tends to zero by applying the dominated convergence theorem to Un [p(Z) - Pn(Z)]+fp(Z) and g(u) u, u E [0, 1], because =

=

0 [p(z ) - Pn(z)]+ < p(z). Proposition 8.7.2 /fZn and Z have densities orfrequency functions Pn(z) and p(z) with Pn(z) � p(z) as n � oofor all z E Rd, then Zn .£. Z.

' '

'

'

'

1

'

'

'

·:, •

l

Section 8.7

SIS

Convergence for Random Vectors: Op and Op Notation

Proof. We give the proof in the continuous case. Let g bounded, say 191 < M < co . Then

IEg (Zn ) - Eg(Z ) I

=

:

Rd

--+

R be continuous and

j g(z)[pn(z) - p(z)]dz < M j IPn(z) - p(z)ldz D

and the result follows from (B.7.3) and Theorem 8.7.5.

Remark 8.7.1 Theorem 8.7.5 can be strengthened considerably with a suitable back ground in measure theory. Specifically, suppose J-t is a sigma finite measure on X. If 9n and g are measurable functions from X to R such that ( I)

9n

�

g in measure, i.e., I'{ x : l9n (x) - g(x) I > c)

�

0 as n

· ;

oo for all < > 0

and (2) J lun lrdl' � J lg ! dl' as n � oo for some r > l , then J l9n -g ldl' � O as n � oo. D A proof of this result can be found in Billingsley (1979, p. 184). '

L

Theorem 8.7.7 Polya's Theorem. Suppose real-valued Xn --+ X. Let Fn,F be the distribution Junctions of Xn, X, respectively. Suppose F is continuous. Then sup IFn(x ) - F(x)l X

�

0.

�

F(x) and Fn(x - 0) � F(x) for all x . Given �: > 0, choose x, x such that F(x) :S £, 1 - F(x) < £. Because F is uniformly continuous on [x, x], there exists c5 (£ ) > 0 such that for all Jl. < x1, x2 < x, < XK = x be such that lx , - Xzl < J(c) '* IF(x,) - F(x2)1 < f. Let x = xo < x1 ]x; - x; - , J :S J(<) for all j. Then Outline of Proof. By Proposition 8.7.1, Fn (x)

•

•

•

JFn(x) - F(x)l < max{IFn(x; ) - F(x; ) ] , 1Fn(X;+1) - F(x;+I ) ]} sup x3<x <xj+l + sup {max{(Fn(x) - Fn (x;)),Fn (X;+!) - Fn (x) }} x,.<x<x3+l +max{(F(x) - F(x;)), F(x;+I) - F(x) } .

sup IFn(x) - F (x) ] x<x sup IFn (x ) - F(x)l x>X

< Fn (X) <

+ F(x)

( 1 - Fn (x)) + (1 - F(x)).

Conclude that, limn supx IFn(x) - F(x ) l < 3< and the theorem follows. We end this section with some useful notatiOn.

D

Additional Topics in Probability and Analysis

516

t

' '

t

The Op , ;::. p, and

'

Appendix B

op Notation

The following asymptotic order in probability notation is useful.

I '

'

Un � o p(l ) Un Op( l )

' '

iff iff

�

iff iff

Note that

p

Un -------+ 0 \It > 0, 3M < oo such that \In P[ I Un l > M] S t I Un l � {1 ) Op /Vn l IUn l 1 � Op { ) /Vnl

1 O � O (1 O p(1)op(1) � op ), p(1) + op( l) p( ) ,

c

(8.7.9)

and Un � U =? Un � Op(1). Suppose Z1, . . . z are i.i.d. as Z with EIZI < oo Set 11- � E(Z ) , then Z" 1-' + op(1) by the WLLN. If E IZI 2 < oo, then Zn � J.L + Op(n - l ) by the central limit theorem. ,

8.8

"

.

MULTIVARIATE CALCULUS

' •

8.8.1 A function T : Rd _____, R is linear iff

T{ax1

+ f3x2) � aT(xi) + {3T(x2 )

E R, x1 , x2 E Rd. More generally, T : Rd x · · · x Rd _____, R is k linear iff k T(x 1 , x2, . . . , xk ) is linear in each coordinate separately when the others are held fixed.

for all

a, f3

8.8.2 T

-

(T1 ,

•

•

•

, TP ) mapping Rd x · · · x Rd ---+ RP is said to be k linear iff T1 ,

• • •

,

Tp

k

are k linear as in B.8.1.

8.8.3 T is k linear asin B.8.1 iff there exists an array {ai1, ,i�c : 1 < ij such that if x, = (xti , . . , Xtd ) , 1 < t < k, then ..

.

1 < d, < j < k} _

' '

.

T(xt, . - . , xk )

=

d

d

i�c=l

il=l

_L · · · _L aiJ,

.

.

. ,i"

k

IJ XjiJ

j=l

•

(8.8.4)

B.8.5 1f h : 0 � RP, 0 open c Rd, h = (h,, . , hp), then h is Frt!chet differentiable at x E 0 iff there exists a (necessarily unique) linear map Dh(x) : Rd _____, RP such that .

•

.

lh(y) - h(x) - Dh{x)(y - x)l � o(IY - xi)

where I I is the Euclidean norm. If p = 1, Db is the total differentia{. ·

,

(B.8.6)

j •

i

J •

, •

·'

'

.

'

l

-

Section 8.8

517

M ultivariate Calculus

More generally, h is m times Frichet differentiable iff there exist l linear operators D 1 h (x) : Rd x . . . x Rd R", 1 < l < m such that �

I

m Dl h

h(y) - h(x) - L l! (x)(y - x, . . . , y - x) 1=1

=

(8.8.7)

o([y - x[m).

8.8.8 If h is m times Frechet differentiable, then for 1 < j < p, hj has partial derivatives of order < m at x and the jth component of D1h(x) is defined by the array o1hj(x) l . : < 1.j < d, + + f = 1, o < fi :::; , 1
{

8.8.9 h is

"d

m

. }

.

times Fr6chet differentiable at x if hj has partial derivatives of order up to m on 0 that are continuous at x. B.8.10 Taylor's Formula

If hj, 1 < j < p has continuous partial derivatives of order up to m + 1 on 0, then, for all x , y E 0, m D 1 h(x) nm+ 1 h(x') h(y) = h(x) + L 'l. (y - x, . . . , y - x) + m + ) .' (y - x, . . . , y - x) ( ! =1 1 (8.8. 1 1 ) for some x* = x + a*(y - x), 0 < a* < 1. These classical results may be found, for instance, in Dieudonne (1960) and Rudin (1991 ). As a consequence. we obtain the following. B.8.12 Under the conditions of 8.8.1 0,

� D1h(x) (y - x, . . . , y - x) h(y) - h(x) l! � l=l

< ( (m + 1 )!)- 1 sup{ [Dm+'h(x')[ : [x' - x[ < [y - x[)[y - x[ m+I

for all x,y E 0.

B.8.13 Chain Rule. Suppose h : 0 � RP with derivative Dh and g : h(0) derivative Dg. Then the composition g o h : 0 __,. Rq is differentiable and

�

R• with

D(g o h)(x) = Dg(Dh(x) ).

As a consequence. we obtain the following.

p, h be I - l and continuously Frechet differentiable on a neighborhood of x E O, and Dh(x) = zh· (x) be nonsingular. Then h- 1 : h(O) � o is Frechet B.8.14 Let d

=

l

:x:;

differentiable at y = h(x) and

ll

px p

Dh-' (h(x))

=

[Dh(xJr'-

518

Appendix B

Additional Topics in Probability and Analysis

6.9

CONVEXITY AND I N EQUALITIES

Convexity A subset S of

Rk is said to be

ax+ (1- a)y E S.

When

k

convex

if for every

x, y

a

E [0, lj,

1, c(!nvex sets are finite and infinite intervals. When

=

spheres, rectangles, and hyperplanes are convex. The point the convex set S iff for every d

E S, and every

i- 0�

k > 1,

x0 belongs to the interior S0 of (8.9.1)

where 0 denotes the empty set. A function

g from a convex set S to R is said to be convex if

g(ax + (I - a)y)

< ag(x)

+ (I - a)g(y),

all

x, y E S,

a E [0, 1].

(B.9.2)

g is said to strictly convex if(B.9.2) holds with < replaced by < for all x � y, a '/c {0, 1}. Convex functions are continuous on 8°. When k = 1, if g" exists, convexity is equivalent to g"(x) > 0, x E S; strict convexity holds if g"(x) > 0, x E S. For g convex and fixed x, y E S, h(a) = g(ax + (I - f')Y)) is convex in a, a E [0, 1]. When k > I, if 8g2(x)j8xi8xi exists, convexity is equivalent to

L u,u;82g (x)f8x,8x; > 0,

all

u

E

Rk and x E S.

• '

i,j A function convex.

h from a convex set S to R is said to be (strictly) concave if g =

Jensen's Inequa�ty. and

If S c

EU is finite, then EU

Rk is convex and closed, g ij convex on S,

E S,

-h is (strictly)

P[U E S]

Eg(U) exists and

Eg(U) > g(EU) with equality if and only if there are a and

=

I,

(B.9.3)

bkx 1 such that

In particular, if g is strictly convex, equality holds in (B.9.3) if and only if P[U

for some Ckxl·

'

"

=

c]

=

' '

l

1

For a proof see Rockafellar (1970). We next give a useful ineqpality relating product

moments to marginal moments:

HOlder's Inequality.

Let r and s be numbers with r, s

EI XYI When

r =

<

> 1, r- 1 + s-1

{EIXIr} ; {EIYI'J l .

=

1. Then (B .9 .4 )

s = 2, HOlder's inequality becomes the Cauchy-Schwartz inequality (A.l ] . ] 7).

For a proof of (B.9.4), see Billingsley (1995, p. 80) or Problem B.9.3.

'

Section 8.10

Topics in Matrix Theory and Elementary Hilbert Space Theory

We conclude with bounds for tails of distributions. Bernstein Inequality for the Binomial Case. Let Sn

....__,

519

B( n, p) , then

P([Sn - np[ > nc) < 2 exp{-n<2 /2} for f > 0.

(8.9.5)

That is, the probability that Sn exceeds its expected value np by more than a multiple nc of n tends to zero exponentially fast as n OCI. For a proof, see Problem 8.9.1. Hoeffding's Inequality. The exponential convergence rate (8.9.5) for the sum of indepen dent Bernoulli variables extends to the sum Sn = 2:� 1 Xi of i.i.d. bounded variables Xi, [X; - I' I < c;, where I' = E(Xt) -

n

P[ISn - nJl[ > x) < 2exp

- i x'/ l:::Cl

(8.9.6)

i= I

For a proof, see Grimmett and Stirzaker (1 992, p. 449) or Hoeffding (1963). TOPICS IN MATRIX THEORY AND ELEM ENTARY H I LBERT SPACE THEORY

8.10 8. 10.1

Symmetric Matrices

We establish some of the results on symmetric nonnegative definite matrices used in the text and 8.6. Recall Apxp is symmetric iff A = AT. A is nonnegative definite (nd) iff xT Ax > 0 for all x, positive definite (pd) if the inequality is strict unless x = 0. 8.10.1.1. The Principal Axis Theorem (a) A is symmetric nonnegative definite (snd) iff there exist Cpxp such that A = CCT

(8.10.1)

(b) A is symmetric positive definite (spd) iff C above is nonsingular. The if' part in (a) is trivial because then xT Ax = xrccT X = 1Cxl2. The "only if' part in (b) follows because jCxl2 > 0 unless x = 0 is equivalent to Cx =/:- 0 unless x = 0, which is nonsingularity. The "if' part in (b) follows by noting that C nonsingular iff det(C) ol 0 and det(CCT) = det2(C). Parenthetically we note that if A is positive definite, A is nonsingular (Problem 8.10.1). The "if' part of (a) is deeper and follows from the spectral theorem. .•

8.10.1.2 Spectral Theorem (a) Avxp is symmetric iff there exists P orthogonal and D = diag( At, . . . , .Ap) such that (8.10.2)

520

Additional Topics in Probability and Analysis

Aj are real, unique up to labeling, and are the eigenvalues of A. exist vectors ej, Jeil 1 such that

(b) The

=

Appendix B

That is, there

(B.l0.3)

A is also snd, all the >...i are nonnegative. The rank of A is the number of nonzero eigenvalues. Thus, A is positive definite iff all its eigenvalues are positive.

(c) If

(d) In any case the vectors

ei can be chosen orthonormal and are then unique up to label.

Thus, Theorem 8.10. 1 .2 may equivalently be written p

A = L eie[Ai

(B. l0.4)

i=l

where

eieT can be interpreted as projection on the one-dimensional space spanned by ei

(Problem B . l 0.2). (B.J 0.1) follows easily from B . 1 0.3 by taking

.!

C � P diag(>-1 , . . . , >-$ ) in (B.10.1).

The proof of the spectral theorem is somewhat beyond our scope MacLane ( 1 953,

1

pp. 275-277, 3 1 4), for instance.

see Birkhoff and

B.l0.1.3 If A is spd, so is A - I .

Proof. A � P diag(>- 1 , . . . , Ap)PT => A-1 � P diag(>-� 1 , . . . , >-; ' )PT • '

I

B.l0.1.4 If A is spd,

then max{ xT Ax

: xTx -; .

'

I •

8.10.2

Order on Symmetric Matrices

As we defined in the text for

A, B symmetric A < B iff B - A is nonnegative definite.

j •

This is easily seen to be an ordering.

B.l0.2.1 If A and B are symmetric and A :S

B, then for any C (B.10.5)

B - A sod means B-A � EET and thenCBcT -CAcT � C(B-A)cT � CEETcT (CE)(CEjT. Furthermore, if A and B are spd and A :S B, then

This follows from definition of snd or the principal axis theorem because

�

(B. l0.6)

Proof.

After Bellman ( 1 %0, p. 92, Problems 13, 14). Note first that, if A is symmetric, (B.! 0. 7)

j '

.,

I •

' '

'

Section 8.10

521

Topics in Matrix Theory and Elementary Hilbert Space Theory

because y = A -•x maximizes the quadratic form. Then, if A < B. 2xTy - yT Ay > 2xTy - yT By

all x.. y. By (8.10.7) we obtain xT A-1x > xT B- 1 x for all x and the result fol D lows.

for

B.10.2.2 The Generalized Cauchy-Schwarz Inequality E n E12 be spd, (p + q) x (p+ q), with En.P x p, ,, q x q. Then Let E = E2 E21 En E1 1 , E22 are spd. Furthennore,

(

)

·

(B.! 0.8) Proof From Section B.6 we have noted that there exist (Gaussian) random vectors Up xi, Vqxi such that E = Var(Ur, vr)r, En = Var(U), E22 = Var(V), E12 = cov(U, V). The argument given in B.6 establishes that (B.l0.9) 0 and the result follows. B.l0.2.3 We note also, although this is not strictly part of this section, that if U, V are random vectors as previously (not necessarily Gaussian), then equality holds in (B.l0.8) iff for some b

(B.JO.IO) with probability I. This follows from (B.! 0.9) since aT Var(U - E12E221 V)a = 0 for all a iff (B.JO. l l ) for all a where b is E(U -E 12E22 V). But (B.! 0.1 1) for ali a is equivalentto(B.!O.IO).

1

8.10.3

Elementary Hilbert Space Theory

A linear space 1i over the reals

is a Hilbert space iff

(i) It is endowed with au inner product ( ) : 1l x 1l � R such that ( ) is bilinear, ·, ·

·

,

·

(ah1 + bh2, ch3 + dh4) = ab(h1, h2) + ac(h1 , hg) + be( h2, hg) + bd(h2, h4),

symmetric, (h1, h2)

=

(h2, hi), and (h, h) > 0

with equality iff h = 0.

0

I

,

'

522

Additional Topics in Probability and Analysis

Appendix B

It follows that if l ilt II' = (/L It). then II , II is a norm, That is. (a) lilt II = 0 iii It = 0 ((cb) llah1hll = lalll I is such that !Jhm - hn II ---> 0 as there exists h E 1i such that ll hn - h ll � 0. The prototypical example of a Hilbert space is Euclidean space RP from which the ln thiscase ifx = (xi , · · - , xp)T, y = (Yb · · · dlvf E RP, (x,y) = abstraction is drawn. xTy = L�=I XjYi· ll xll 2 = :L�=l x] is the squared length, and so on. a

nL, n

8.10.3.1 Orthogonality and Pythagoras's Theorem

is orthogonal to h2 iff (h1 , h2 ) = 0. This is written h 1 notion of orthogonality in Euclidean space. We then have h1

_1_

hz.

--

This is the usual

Pythagoras's Theorem. If h1 j_ hz, then

(8.10.12)

I

An interesting consequence is the inequality valid for all h1, hz,

i

(8.10.13)

In R2 (8. !0.12) is the familiar "square on the hypotenuse" theorem whereas (8.!0.13) says that the cosine between x1 and x2 is < 1 in absolute value. ,

.

I

, ,

8.10.3.2 Projections on Linear Spaces

We naturally define that a sequence hn E H converges to h iff Jl hn - hi I ---t 0. A linear subspace £ of H is closed iff hn E l for all n, hn --+ h h E C. Given a closed linear subspace C of 1i we define the projection operator 11(· I C) : 1i � C by: 11(h I C) is that h' E C that achieves mi ( l l h h'll : h' E C). lt may be shown that 11 is characterized by the property h - 11(h I C) h' for all h' E C. (8.10.14) Furthermore, (i) 11(h I C) exists and is uniquely defined. (ii) 11(· I C) is a linear operator 11(ah, + f3h, I C) = a11(h , I C) + (311(h, I C). (iii) ll is idempotent, D2 = IT. :::::>

n

-

l_

, ,

,

I • •

'

l , j

' ' '

.

l

Section 8.10 Topics in

Matrix Theory and Elementary Hilbert Space

(iv) II is norm reducing

523

Theory

llll(h I L) ll < ll h ll·

(8.10.15)

ln fact, and this follows from (8.10. 12),

ll h ll' = llll(h I L)f + llh - Il(h I L i ll'

(8.10.16)

Here h - Il(h I L) may be interpreted as a projection on L� = {h : (h, h') = 0 for all h' E [}. Properties (i)-(iii), of II above are immediate. All of these correspond to geometric results in Euclidean space. If x is a vector in RP, Il(x I L) is the point of L at which the perpendicular to L from x meets L. (8.10.16) is Pythagoras's theorem again. If [ is the column space of a matrix Anxp ofrankp < n, then

(8. 1 0.17) �

�

�

This is the fonnula for obtaining the fitted value vector Y = (Y1 , . . , Yn) T by least squares in a linear regression Y = A[3 + < and (8.10.16) is the ANOVA identity. The most important H1lbert space other than RP is L2(P) = {All random variables X on a (separable) probability space such that EX2 < oo}. In this case we define the inner product by .

(X, Y) - E( XY)

(8.10.18)

II XII = E l (X' ) .

(8.10.19)

so that

All properties needed for this to be a Hilbert �pace are immediate save for complete ness, which is a theorem of F. Riesz. Maintaining our geometric intuition we see that, if E(X) = E(Y) = 0, orthogonality simply corresponds to nncorrelatedness and Pythagoras's theorem is just the familiar •

Var(X + Y)

=

Var(X) + Var(r )

if X and Y are uncorrelated. The projection formulation now reveals that what we obtained in Section mulas for projection operators in two situations,

1.4 are for

(a) [ is the linear span of 1, Zt, . . . , Zd. Here ll(Y I L) = E(Y) This is just (1.4.14). (b)

L is the space of all X

+ ( Ez:l, Ezy )T( Z

-

E..'( Z))

.

(8 10 20) .

.

g(Z) for some g (measuralle). This is evidently a linear space that can be shown to be closed. Here, =

II(Y I L) That is what

(1.4.4) tells us.

=

E(Y I Z)

.

(8 10.2 1 ) .

524

Additional Topics in Probability and Analysis

Appendix 8

The identities and inequalities of Section 1 ..-1 can readily be seen to be special cases of

(B. 10.16) and (8.10. 1 5).

For a fuller treatment of these introductory aspects of Hilbert space theory, see Halmos

( 1951 ), Royden (1968), Rudin (1991), or more extensive works on functional analysis such as Dunford and Schwartz (1964). B.ll

,

•

•

'

P ROBLEMS AND COMPLEMENTS

Problems for Section B. I 1. An urn contains four red and four black balls. Four balls are drawn at random without

replacement. Let Z be the number of red balls obtained in the first two draws and Y the total number of red balls drawn. (a) Find the joint distribution of Z and Y and the conditional distribution of Y given Z and Z given Y.

(b) Find E(Y 2.

i

I

=

1 Y)

+ Z2.

=

Y.

Suppose Y and Z have joint density function p( z, y) = z + y for 0 < z < 1, 0 < y < 1 .

(a) Find E(Y I

2.

E(E(Y I Z)) using (a).

Z1 and Y = Z1

Hint: E(Z1 + z2

4.

0, 1 ,

Suppose Z1 and Z2 are independent with exponential £(.A) distributions. Find E(X I Y)

when X

•

=

I Z).

(b) Compute EY =

3.

z) for z

> 2 is an integer.

(a) Find E(Y '

=

Suppose Y and Z have thejoint density p(z, y) = k(k-1 )(z-y)k-2 for0 < y < z < I ,

where k

!

I Z

I Z = z).

(b) Find E(Ye1Z+(1/Z)I

1 Z = z) .

5. Let ( X1 , . . . , Xn) be a sample from a Poisson P (.A) distribution and let Sm =

m < n.

(a)

Show that the conditional distribution of

X

M(k, 1/n, . . . , 1/n).

(b) Show that E(Sm

I Sn)

=

given Sn

I:;" 1 X; ,

k is multinomial

(mfn)Sn·

6. A random variable X has a P(.A) distribution. Given X = k, Y has a binomial B(k,p) distribution.

(a) Using

the relation

E(e'Y)

=

E(E(e'Y

I X)) and the uniqueness of moment gen

erating functions show that Y has a P(.Ap) distribution.

(b) Show that Y and X - Y are independent and find the conditional distribution of X given Y

=

y.

•

Section B.ll

525

Problems and Complements

7. Suppose that X has a normal N(ft, cr2 ) distribution and that Y independent of X and has a N( -y, r2) distribution.

=

X + Z, where Z is

(a) What is the conditional distribution of Y given X = x?

(b) Using Bayes rule find the conditional distribution of X given Y

=

y.

8. In each of the following examples: (a) State whether the conditional distribution of Y given Z or of neither type.

=z

is discrete, continuous,

(b) Give the conditional frequency, density, or distribution function in each case. (c) Check the identity E[E(Y I Z)] = E(Y) (i) 1 -, z2 + Y2 < 1

P(Z, Y )(z,y)

1r

0 otherwise. (ii)

4zy, 0 < z < 1, 0 < y < 1

P(z,Y) (z , y)

0 otherwise.

(iii) Z has a uniform U(O, 1) distribution, Y (iv) Z has a U ( -1, 1) distribution, Y (v) Z has a U(-1, 1) distribution, Y

=

Z2

=

Z2.

=

Z2 if Z2 <

! and Y = ! if Z2 > ;.

9. (a) Show that if E(X2) and E(Y2) are finite then Cov(X, Y) = Cov(X, E(Y I X)). (b) Deduce that the random variables X and Y in Problem B.l.8(c) (i) have correlation 0 although they are not independent.

. .

10. (a) If X 1 , . , Xn is a sample from any population and Sm

=

E� 1 Xi, m :S n, show

that the joint distribution of (Xi, Sm) does not depend on i, i < m . Hint: Show that the joint distribution of (X., . . . , Xn) is the same as that of (Xip . . . , xi,. ) where (il, . . . ' in ) is any permutation of (1, . . . ' n) .

(b) Assume that if X and Y are any two random variables, then the family of condi tional distributions of X given Y depends only on the joint distribution of (X, Y). Deduce from (a) that E(X, I Sn) = · · · = E(Xn I Sn) and, hence, that E(Sm I Sn) = (m/n)Sn.

526

Additional Topics in Probability and Analysis

N.

Appendix B

11. Suppose that Z has a binomial. B( B), distribution and that given Z = z, hypergeometric, 'H(z, n ) , distribution. Show that

N,

P[Z �

z

Iy

�

y[

( N - n ) w-Y(J - B)N-n-(,-y)

�

z-y

(i.e., the binomial probability of successes in Hint: P[Z

�

z I Y � y)

where b(y)

�

L '

Y has a

�

N - n trials).

n N ( - ) B'(l - B)N-'/b(y)

( ��; )

z-y

8' (1 - B) N- ,

�

BY (l - B) N-y

Problems for Section B.2 l. If B is uniformly distributed on (- "/2, 1r/2) show that Y � tan B has a Cauchy di stri bution whose density is given by p(y) = l/[1r(l + y2 )). -oo < y < oo. Note that this density coincides with the Student t density with one degree of freedom obtainable from

(B.3.10).

Suppose X1 and X2 are independent exponential £(A) random variables. Let Yi XI - x2 and y2 = x2. 2.

=

(a) Find the joint density of Y1 and Y2 .

(b) Show that Yi has density p(Y) = �Ac"lvl, double exponen!ial or Laplace density.

-oo

oo.

This is known as the

3. Let X1

and X2 be independent with {3(r1, si) and {3(r2, s2) distributions, respectively. Find the joint density of Y1 = X1 and Y, = X2(l - XI)·

4. Show that if X has a gamma r(p, A) distribution, then (a) Mx (t) = E(e'x ) = (b) E(Xr ) =

�[�(;/,

r

(/ ) t

>

".

t < A.

-p.

(c) E(X) = pfA. Vat\X) = pfA2

5. Show that if X has a beta (J(r, s) distribution, then

(b) Var X = (r+sP(�+s+I) . 6. Let Vi, . . , Vn+I be a sample from a population with an exponential £(1) distribution (see (A.l3.24)) and let Sm � L;;" 1 V,, m < n + 1. .

'

•

.

> • '

' •

•

l1

I j' '

l

'

--

Section B.ll

527

Problems and Complements

(a) Show that T =

(sv

n+l

, . . . , sv.

n+l

)

T

has a density given by

n!, ti > 0, 1 < i <

PT(tlJ . . . , tn)

0 otherwise.

-

. first the JOmt . . bution · of . . dIStn H.mt: Denve (b) Show that U

=

(

Sn 81 . . 8n+l , . , 8n + l

Pu(u1 , · · · , un )

)

(s

v

n+l ,

n,

E;l 1 ti < 1,

. . . , $v. n+l n

,

T has a density given by

nl, 0

< Ut < U2 < · · · < Un < 1,

0 otherwise.

S1 , . . . , Sr be r disjoint open subsets of Rn such that P[X E Ui 1Si] that g is a transformation from u;- I si to Rn such that

7. Let

=

1. Suppose

si for each i.

(i) g has continuous first partial derivatives in (ii) g is one to one on each Si. (iii) The Jacobian of g does not vanish on each

Si.

Show that if X has density px , Y = g(X) has density given by r

1

1

py (y) = l:Px (g;-1 (y))IJ•• (gi (Y))I- I,(y) for y E g(ur 1 S,)

i=l

where g; is the restriction of g to S; and l;( y) is I ify E g(S; ) and 0 otherwise. (If 1 li(Y) = 0, the whole summand is taken to be O even though g; is in fact undefined.)

Hint: P[g(X) E B] = L� 1 P[g(X ) E B, X E S;) 8. Suppose that X1 , . . . , Xn is a sample from a populati(Jn with density f. The Xi ar ranged in order from smallest to largest are called the order statistics and are denoted by Xcl), . . . , X(n). Show that Y g(X ) = ( X(l), . . . , X(n) J" has density =

n

py (y) = n! IT f(y;) for y1

i=l

< y, < · - · < Yn

Hint: Let

- {(x, . . . , xn) : x1 < · · · < z.,}, {(x1, . , xn) : x, < x, < ·· · < xn}

-

and so on up to

.

.

Snt · Apply the previous problem.

528

Additional Topics in Probability and Analysis

Appendix B

9. Let X1 , . . , X, be a sample from a unifonn U(O, I ) distribution (cf. (A. 13.29)). .

(a) Show that the order statistics of X

density is given in Problem B.2.6(b).

(X 1,

=

. . .

, Xn) have the distribution whose

(b) Deduce that X(k) has a 11(k, n - k + I ) distribution.

(c) Show that EX(k) = k/(n + 1 ) and Var X(k ) = k(n - k + 1)/(n + 1)2(n + 2). Hint: Use Problem B.2.5.

10. Let X1 , . . . , Xn be a sample from a population with density f and d.f. F. (a)

Show that the (X(r+I)• - . . , X(n) )T is

if X(l) <

···

conditional

density

of

(X(l), . . . , X(n))T

gtven •

< X(r) < X(r+l)·

(b) Interpret this result.

11. (a) Show that if the population in Problem B.2. 10 is U(O, 1 ) , then

(

x( l l- , . . . , x<,J == X(r+I ) X(r+l)

I

(b) Deduce th at X(n) • . . ,

'

,

.

)

r

x,.,

and (X(,+l)• . . . , X(n))r are independent. x,. ,,

x(n-1) , X(n-2) , .

. . ,

x,,, . ' are m . dependent m thIS case. x (1)

12. Let the d.f. F have a density f that is continuous and positive on an interval (a, b) such that F(b) - F(a) = 1 , -oo S a < b < oo. (The results are in fact valid ifwe only suppose

•

I

that F is continuous.)

•

. •

(a) Show that if X has density J, then Y = F(X) is uniformly distributed on (0, 1).

I I

(b) Show that if U � U(O, 1 ), then F- 1 (U) has density

f.

(c) Let U(l) < < U(n) be the order statistics of a sample of size n from a U(O, 1) population. Show that then F-1 (U( 1J) < < F- 1 (U(n)) are distributed as the order statistics of a sample of size n from a population with density f. · · ·

· · ·

13. Using Problems B.2.9(b) and B.2.12 show that if X(k) is the kth order statistic of a

sample of size n from a population with density f, then

• ' . .

,

•

1

14. Let X(l)> . . . , X(n) be the order statistics of a sample of size n from an £(1) population.

II '

Show that nX(l)• (n - l)(X(2) - X(l)), (n - 2) (X(3) - X(2J) , . . . , (X(n) - X(n - 1)) are independent and identically distributed according to £(1 ). Hint: Apply Theorem B.2.2 directly to the density given by Problem B.2.8.

l !

Section 8.11

529

Problems and Complements

15. Let Tk be the time of the kth occurrence of an event in a Poisson process as in (A.I6.4).

(a) Show that Tk has a r(k, >,) distribution. (b) From the identity of the events, [N ( l) < k - lI

100 "

9k , l(s)ds =

k-2:1 ..,e--" ).)

,�

. 0

J.

=

[Tk

>

1 I, deduce the identity

.

Problems for Section B.3

1. Let X and Y be independent and identically distributed N(O,
..;X?J+P. Show that 8 is unifonnly distributed on ( - � , }).

(b) Let 8 = sin - 1

(c) Show that X/Y has a Cauchy distrihotion. Hint: Use Problem B.2.1.

2. Suppose that Z r ( � k, � k), k > 0, and that given Z = z, the conditional distribution of Y is N(O, z- I ) Show that Y has a T. distribution. When k = 1, this is an example where E(E(Y I Z)) = 0, while E(Y) does not exist. 3. Show that if Z1 , , Zn are as in the statement of Theorem B.3.3, then ""' .

•

.

.

n

y'ri(Z - !')/ l:(Z; - ZJ 2 /(n - 1) i= 1

has a Tn- 1 distribution.

4. Show that if XI, . . . ' Xn are independent £(>.) random variables, then T = has a X�n distribution. Hint: First show that 2.-\Xi has a r ( 1, �) = X� distribution.

2). E� 1 xi

X1, . . . , Xm; Y1 , , Yn are independent £ ().) random variables, then S = (n/m) (2:;" I X; ) / (2:; I lj) has a F2m,2n distribution.

5. Show that if

. • .

6. Suppose that X1 and X2 are independent with r(p, 1) and r (p+ �. 1) distributions. Show that Y = 2.,jXIX2 has a r(2p, 1) distribution. Suppose X has density p that is symmetric about 0; that is, Show that E(Xk) = 0 if k is odd and the kth moment is finite. 8. Let X � N(I', u2 ). 7.

(a) Show that the rth central moment of X is E (X - I')'

-

r even

- 0,

T odd.

p(x)

=

p( -x) for all x.

Appendix B

'

'

530

I'

Additional Topics in Prob.ability and Analysis

(b) Show the rth cumulant C is xero for r > 3. r Hint: Use Problem B.3.7 for r odd. For r even set m = r/2 and note that because Y = ](X - J,t)/o-]2 has a xi distribution, we can find E(Y'") from Problem 8.2.4. Now use E(X - J,t)' = o-r E(Ym ),

9. Show that if X � 7,., then

for r even and r < k. The moments do not exist for r 2: k, the odd moments are zero when r < k. The mean of X is 0, for k > 1, and Var X = kf(k - 2) for k > 2. Hint: Using the notation of Section 8.3, for r even E(Xr) = E(Qr) = k�rE(Zr)E(V-!r), where 8.3.7.

I '

10. Let X

! '

�

Z

�

N(0, 1) and V

� X� ·

Now use Problems

8.2.4

and

Fk,m • then

provided - � k < r < � m. For other r, E(Xr) does not exist. When m m/(m - 2), and when m > 4, Var X =

2m2 (k + m - 2) k(m - 2)2(m - 4)

>

2, E(X)

=

.

Hint: Using the notation of Section B.3, E(Xr) = E(Q") = (m/kY E(Vr)E(W-r), where V rv x� and W t">.J x!. Now use Problem B.2.4. 11. Let X have a N( 0, 1) distribution.

(a) Show that Y

=

X2 has density

py(y) =

1 2.fii'Y

' e-l(>+9 )(e0v'Y + e-0v'Y),

y > 0.

This density corresponds to the distribution known as noncentral x2

dom and noncentrality parameter fP .

with 1 degree offree

(h) Show that we can write

py (y)

R

where formula.

,......

P ( � 82 )

and fm is the

=

••

;

,

=

� P (R = i)j,i+I(Y) 't=l

x� density.

Give a probabilistic interpretation of this

Hint: Use the Taylor expansions for e0../fi and e-o.,;y in powers of yfij.

!

I ·1''

1

'

i

Section B.ll

531

Problems and Complements

12. Let X1 , . . . , Xn be independent normal random variables each having variance 1 and E(Xi) = Bi , i 1, . . . , n, and let 02 = L:� 1 B[. Show that the density of V = L:� 1 Xl is given by =

Pv (v) = L P(R = i)f2i+n(v), v > 0 i=O where R P (�fP) and frn is the x:n density. The distribution of V is known as the noncentral x2 with n degrees offreedom and (noncentrality) parameter 82. Hint: Use an orthogonal transformation Y = AX such that Y1 = L� 1 (BiXi/8). Now V has the same distribution as L� 1 J!i2 where Y1 , . . . , Yn are independent with vari ances I and E(Y1 ) = 0, E(Y;) = 0, i = 2 , . . . , n. Next use Problem B.3. 1 1 and ""'

roo Pv (v) = n J0

00

L P(R = i)j,i+l (v - s) fn-l (s)ds. i=O

13. Let X1, • . . ,Xn be independentN(O, 1) random variables and let V = (X1 + ej2 + L� 2 Xi2 • Show that for fixed v and n, P(V > v) is a strictly increasing function of 82. Note that V has a noncentral x� distribution with parameter 82. 14. Let V and W be independent with W x_2mand V having a noncentral xi distribution .......,

with noncentrality parameter 02• Show that S = (V/k)/(W/m) has density

Ps (s)

=

00

L P(R = i)fk+2i, m(s) i=O

where R "' P (�02) and /j, m is the density of Fi,m· The distribution of S is known as the noncentral Fk,m distribution with (noncentrality) parameter 02. 15. Let X1, • . . , Xn be independent normal random variables with - common mean and 2 = I:,�/ X; - X(m))2 . vanance. Define X(m) = (1/m) L::,� 1 X;, and Sm -

.

rn

·

rn

(a) Show that

(b ) Let

n-1 n Show that the matrix A defined by Y = AX is orthogonal and, thus, satisfies the require ments of Theorem B.3.2.

(c) Give the joint density of (X(n), �, . . . , S�)T .

'

' ' ' '

532

Additional Topics in Probability and Analysis

Appendix B

16. Show that under the assumptions of Theorem B.3.3, Z and (Z1 - Z, . . . , Zn - Z) are independent. Hint: It suffices to show that Z is independent of (Z2 - Z� . . . , Zn - Z). This provides another proof that Z and L� 1 ( Zi - Z) 2 are independent.

Problems for Section B.4 I. Let (X, Y)

�

N ( 1, 1 , 4, 1 , � ). Find

(a) P(X + 2Y < 4). (b) P(X < 2 I y = 1). (c) The joint distribution of X +

2Y and 3Y - 2X. Let (X, Y) have a N(!'l> p,2, u�, u�, p) distribution in the problems 2--{;, 9 that follow. 2. Let Fh ·, J..l. t , J..L2, cr�, cr�, p) denote the d.f. of (X, Y). Show that

I

•

( X - p,1 Y - p,2 ) , (Tl 0"2

has a N(O, 0, 1, 1, p) distribution and, hence, express F( ·, · , ttl , J.l2, af, cr�, p) in terms of F(·, · , 0, 0, 1 , 1,p) .

3. Show that X + Y and X - Y are independent, if and only if, crf

4. Show that if cr1cr2

=

u�.

> 0, IPI < 1, then

•

has a � distribution. Hint: Consider ( U1 , U2 ) defined by (B.4.19) and (B.4.22).

5. Establish the following relation due to Sheppard.

F(O, 0, 0 , 0, 1, 1, p) =

! + (1/21T) sin -l

p.

Hint: Let U1 and U2 be as defined by (B.4.19) and B.4.22, then

P[X < 0, Y < OJ

P[U1 < O, pU1 + y'1 - p2U2 < OJ p - p u, < 0, u2, > r.�=;; u y'1 - p2 =

6. The geometry of the bivariate normal surface.

c}. Suppose that cri = u� . Show that {S,; c > 0} is a family of ellipses centered at (l't. p,2) with common major axis given by (Y-1'2 ) = (a) Let S,

=

{(x, y) : P( X,Yj(X, y)

=

;

'

i

Section

B.ll

Problems

and

533

Complements

(x -111) if p > 0, (y -1'2) � - (x - 1'1 ) if p < 0. If p � 0, {So} is a family of concentric

circles.

(b) If x

=

(

c,

y) is proportional to a normal density as a function of y. That is, sections of the surface z = px_ (x, y) by planes parallel to the (y, z ) plane are proportional c, Px

to Gaussian (normal) densities. This is in fact true for sections by any plane perpendicular to the y) plane.

(x,

(c) Show that the tangents to So at the two points where the line y � 1'2+p(a2/ cr,)(x J..t d intersects Sc are vertical. See Figure B.4.2.

(X1, YI), . . . , (X., Yn) be a sample from a N(I'I . l-'2 • cr� , aJ, p) � N(l-', !:.) dis2 tribution. Let X � (1/n) 2::� 1 X, , Y � (1/n) 2::� 1 Y; , Sf � 2::� 1 (X, - X) , 2 2 " " "" S2 � L..i�1 (Y; - Y) • S12 � L..i�1 (X, - X)(Y; - Y). (a) Show that n(X - 1'1 . Y - I'2)T !:.-1 (X - I' I. Y - ll-2) has a x� distribution. (b) Show that (X, Y) and (8�, SJ, 812) are independent. 7. Let

Hint: (a): See Problem B.4.4.

(b): Let A be an orthogonal matrix whose first row is

(n- i , . . .

-

,n

�).

Let

U

(X1, . . . , Xn)T and Y � (Y1 , . . . , Ynf· Show that ) form a sample from aN(O, 0, o-i, a-�, p) population. Note that Sf V (U2, V2), . . . , (Un, n n '"' 822 � n '"' n · ;;;; while � 812 X � y n. UI /v'n. u,v;, � Vt/v v,-. �2 Li�2 Li �2 Li u,-, 8. In the model of Proplern B.4.7 let R � 812/8182 and

AX and V

� AY, where X �

=

=

y'(n - 2)R

T � �c==� J1 - R2 . (a) Show that wpen p = 0, T has a Tn_2 distribution. (b) Find the density of R if p � 0. Hint: Without loss of generality, take

cr� � cr� � 1. Let C be an (n - 1) x (n (U2, . . . , Un)/S1· Define (W2, . . . , Wnf �

1) orthogonal matrix whose first row is C(V,, . . and show that T can be written in the form T � L/M where L � 2 f� � and M2 � 2). Argue that given are, T has a Yn 2 distribution. no matter what = = Now use the continuous version of (B.l .24).

.

, Vn)T 812/S1 W2 u2 U2, . . . 'Un

(S�SJ - 8f2)/(n - )S L��3 Wt/(n u2, . . . ) Un Un,

9. Show that the conditional distribution of aX + bY given eX + dY = t is normal. Hint: Without loss of generality take a � d � 1, b � c � 0 because (aX + bY, eX +

dY) also has a bivariate normal distribution. IPI �

Deal directly with the cases

1.

Ut0"2

=

0 and

10. Let p1 denote the N(O, 0, 1 , 1, 0) density and let P2 be the N(O, 0, 1, 1, p) density.

Suppose that

(X, Y) have the joint density 1

1

p(x, y) � 2 pi (x, y) + 2 pz(x, y).

I:I

534

Additional Topics in Probability and Analysis

Show that

X

and only if, p

Appendix B

and Y have normal marginal densities, but that the joint density is normal, if

= 0.

11. Use a construction similar to that of Problem B.4.1 0 to obtain a pair of random variables (X, Y) that (i) have marginal normal distributions. (ii)

are

uncorrelated.

(iii)

are

not independent.

Do these variables have a bivariate normal distribution?

'

,

I

Problems for Section B.5

'

I. Establish (B.S.IO) and (B.S. II). 2. Let akxi and

,

Bkxk be nonrandom . Show that

I and • •

3. Show that if Mu (t) is well defined in a neighborhood of zero then =

I '

Mu (t) = 1 + L

/

II

p= l

where J.Li1 ···i,. =

p, p =

E(U;1

p.

.••

i,. t! 1

· · · U�,. ) and the sum is over all

1, 2, . . . . Moreover,

That is, the Taylor series for

4.

�J.ti1

•

•

•

t�,.

(i 1 , . . , ik) with ij > 0. E;=I ii .

Ku converges in a neighborhood of zero.

Show that the second- and higher-degree cumulants (where p =

�=1 ii

invariant under shift; thus. they depend only on the nioments about the mean.

=

2: 2) are

l I

'

5. Establish (B.5.!6}-{B.5.19). 6. a�

In the bivariate case write =

aoz·

Show that

1J.

•

=

E(U),

u;;

-

E(U, - M,)'(U2 - !-'2)i , ul

•

-

l

I

'

• •

' , '

•

-

i

' •

Section B.ll

-------�-------------�535 �

��--

Problems and Complements

and

(C40, CQ4, C221 C31, C1 3 ) (a40 - 3af, O"o4 - 3a?, a22 - a-fa� - 20'u , a:n - 3trf0'1 1 , 0"13

=

V. W. and Z are independent and that U1

7. Suppose

= Z+

-

V and U2 =

3a�au ). Z+

W.

Show

that

Mu (t) = Mv(t,)Mw(t,)Mz(tr + t,) Ku(t) = Kv(tr) + Kw(t,) + Kz(tr + t,) and show that c;1 (U)

8.

=

c,+1(Z) for i ,<

j; i,j > 0.

Suppose U

(The bivariate log normal distribution) .

N(Jl.I ,Ji-2, af, a� , p)

distribution. Then Y

bivariate log nonnal distribution.

E(Y,' Yi) where

=

= (Yi , f2)T = (eu\eu2)T

9. (a) Suppose Z is N(p., E). P = L�=I ii > 2) are zero.

has a bivariate

is said to have a

Show that

{

�

exp if.l. r + i1J.2 + i2
= 0"1 a2p.

an

= (U1, U2)T

+

ij
�j'O"�}

Show that all cumulants of degree higher than

2

(where

(b) Suppose U 1 , . . . , Un are i.i.d. as U. Let Zn = n- l I;� 1 (U; p.). Show that Kz. (t) = nKu(n-it) - ni p.T t and that all cumulants of degree higher than 2 tend to zero as n --+ oo. -

10.

Suppose

Ukxl

and

vmxl

are independent and

I = {h, . . . , i k } and J = {ik+IJ . . . , ik.,_} unless either I = {0, . . . , 0} or J = {0, . . . , 0}. where

z(k+m) x l

=

(UT, VT)T. Let CI ,J

bea cumulantofZ. Show that CI,J

Problems for Section B.6 I. (a)

Suppose

pendent

U

=

U,

N(O, 0'2)

(U1,

•

.

.

,

= fJ. + aZ; + (JZ,_,

i

=

!, . .

.

, k, where

Z0, . . . , z.

¥0

are inde

random variables. Compute the expectation and covariance matrix of

u. ). Is U k·variate normal?

(b) Perform the same operation and answer the same question for Ui defined as follows: U1 = Z1 , U2 = Z2 + aUl ! U3

=

Z3 + aU2 , - . . , Uk = Zk + o:Uk-1·

2. Let U be as in Definition B.6.1. Show that if :E is not positive definite, then U does not have a density. has positive definite variance :E. Let ugll and u[�� X l be a partition I) 2 of U with variances �11. �22 and covariance :E12 = Cov(U( I ) , u< l)lx(k -l ) · Show that

3.

Suppose

Ukxl

1 2 2 Cov(E12E 22 UC l, uCil - E12E22UC l) = 0.

536

' I'

Add"1tional Topics in Probability and Analysis

•

Problems for Section B.7

ll •

Appendix B

1. Prove Theorem B.7.3 for d = 1 when Z an Zn have continuous distribution functions F

' '

and Fn . Hint: Let U denote a unifonn, U(O, 1 ), random variable. For any d.f. G define the left inverse by Now define Z� = F;;1 (U) and z• = p-1(U ) . = inf{t : G(t) >

••

' •

u}.

a-1(u)

2. Prove Proposition B.?. l (a). 3. Establish (B.7.8).

I

4. Show that if Zn £ zo, then P(IZn - zo l > <)

'

Hint: Extend (A.14.5).

�

P(IZ - zol > <). '

5. The Lp nonn of a random vector X is defined by IXI P = { EIXIP} , , p > I. The sequence of random variables {Zn} is said to converge to Z in L norm if IZn - ZI P � 0 P as

n

-+ oo.

We write Zn

� z. Show that

Lq

Lp

(a) if p < q, then Zn � Z => Zn � Z.

'

I

Hint: Use Jensen's inequality B.9.3.

I

(b) 1f Zn Lp � Z, then Zn � Z. Hint:

il '

p

.

'

E!Zn - ZIP 2: EfiZn - ZIPI{IZn - Zl 2: <}] > EP P(!Zn - Zl 2: E).

!

i i

6. Show that IZn - Zl � 0 is equivalent to Zn; � Z; for I :S j < d. Hint: Use (B.7.3) and note that !Zn; - Z;l2 < IZn - Zl2 U(O, 1) and let U1 = I, U2 = I{U E (0, �)}, Ua = I{U E (�, I)}, U4 = I{U E [o, m. Us = I{U E (hm, . , Un = I{U E (m2 -•, (m + l)T•)},

7.

Let U

�

. .

where n =

8. Let U L,

a.s.

p > m + 2•, 0 < m < 2• and k 0. Show that Un � 0 but Un + 0.

�

U(O, I) and set Un = 2n 1{U E [0, �)}. Show that Un �· 0, Un � 0, but

Un + 0, p > I, where L is defined in Problem B. 7.5. P 9. Establish B.7.9. 10. Show that Theorem B.7.5 implies Theorem B.7.4.

. ••

.'j

.j '

1

.I' ,

'

' ' ·

I ,

•

�

11. Suppose that as in Theorem B.7.6, Fn(x) F(x) for all x, F is continuous, and strictly increasing so that p-l (n) is unique for all 0 < n < 1. Show that

sup{/ F;; 1 (a) - F- 1(a)/ : < < a 0. Here F;; 1 (a) = inf{x : Fn(x) > a } . Hint: Argue by contradiction.

I

'

l

Section

B.ll

537

Problems and Complements

Problems for Section B.S 1. If h

: Rd � for izl <
RP and

h(x)

=

h(xo + z) Here the integral is a p

Hint: Let g(u)

11

h : Rd for i z l < <1,

2. If

=

x

Dh(x)

=

is continuous in a sphere

h(xo) +

(1 1

)

h( x0 + uz)zdu zT

d matrix of integrals.

h(x0 + uz). Then by the chain rule, g(u) = h(Xo + uz)z and

h(xo + uz)zdu =

[

•

g(u)du = g(l) - g(O) = h(xo + z) - h(xo).

ii(x) = D2 h(x) is continuous in a sphere { x : l x - xo l < <1}, then

� R and

h(xo + z) = h(xo) + h(xo)z + zT Hint:

{x : lx - Xoi < <1}. then

Apply Problem B.S. I to

[11 11

]

ii(xo + uvz)vdudv z .

h(xo + z) - h(xo) - h(Xo)z . •

3. Apply Problems B.S.! and B.S.2 to obtain special cases of Taylor's Theorem B.8.12.

Problems for Section B.9 1.

State and prove Jensen's inequality for conditional expectations.

2. Use Hoeffding's inequality (B.9.6) to establish Bernstein's inequality (B.9.5). if p = � . the hound can be improved to 2 exp{ -2n/,2 }.

Show that

2. IYr, ; + ! = 1.

3. Derive HOlder's inequality from Jensen's inequality with k = Hint: For (x,y)

4.

Show that if k

R2, consider g(x,y)

E

= 1

=

and g"(x) exists, then

equivalent.

lxr +

g" (x) > 0,

5. Show that convexity is equivalent to the convexity of

a E [0, 1] for all x and y in S. 6.

Use Problem

7.

Show that if

all

x E

S, and convexity are

g (ax + (1 - a)y)) as function of

5 above to generalize Problem 4 above to the case k > 1.

.

/. / g2 (x) exists and the matrix I /.x, a"x,. g2 (x) I is positive definite, then x,

g is strictly convex.

x.,

8. Show that

P(X > a)

<

inf{ e-ta Ee'x

Hint: Use inequality (A. l5.4).

9. Use Problem 8 above to prove Bernstein's inequality.

:

t � 0}.

538

Appendix B

Additional Topics in Probability and Analysis

10. Show that the sum of (strictly) convex functions is (strictly) convex. Problems for Section B.lO

1. Verify that if A is snd, then

A is ppd iff A is nonsingular.

2. Show that if S is the one-dimensional space S = the projection matrix onto

S (B.l 0.17) is just eeT.

{ae : a E R}

for e orthononnal, then

3. Establish (1!.10.15) and (B.IO. l 6). 4. Show that h - II ( h I C.) = II(h I C.�) using (B.l0 .14). 5. Establish (B.IO.l7). B.12

NOTES

Notes for Section B.J.2 ( ! ) We shall follow the convention of also calling

•

I

E(Y I Z) any variable that is equal

to g(Z) with probability ! .

Notes for Section B.1.3

( I ) The

definition of the conditional density ( B . l .25) can be motivated

A(x), A(y) are small "cubes" with centers x and y and volumes dx, dy andp(x,y) is continuous. Then P[X E A(x) I Y E A(y)] = P[X E A(x) , Y E A(y)]/P[Y E A(y)]. But P[X E A(x), Y E A(y)J "" p(x, y)dx dy, P[Y E A(y)J "' py(y)dy, and it is reasonable that we should have p(x I y) "" P[X E A(x) I Y E A(y)]/dx "' p(x,y)/py (y). Suppose that

:

as follows:

,

Notes for Section B.2

(I) We do not dwell on the stated conditions of the transfonnation Theorem B.2.1 be

cause the conditions are too restrictive. It may, however. be shown that (B.2. 1) continues to hold even iff is assumed only to be absolutely integrable in the sense of Lebesgue

K is any member of Bk , the Borel a-field on Rk . Thus, K any set in Rk that one commonly encounters.

and

f can be any density function and

Notes for Section B.3.2 (I) In deriving (B.3.15) and (B.3.17) we are using the standard relations,

BTAT, det[AB)

=

[AB] T

det Adet B, and det A = det AT.

Notes ror Section B.S (I) Both m g f. s .

.

'

tion of U defined by

and c.f.'s are special cases of the Laplace transform 1/J of the distribu·

where z is in the set of k tuples of complex numbers.

l

• '

'

•

Section 8.13

B.13

References

539

REFERENCES

ANDERSON, T. W., An Introduction to Multivariate Statistical Analysis New York: J. Wiley & Sons , 1958.

APOSTOL, T., Mathematical Analysis, 2nd ed. Reading, MA: Addison-Weslcy,l974. BARNDORFF-NlELSEN, 0. E., AND D. P. Cox, Asymptotic Techniques for Use in Statistics New York: Chapman and Hall, 1989.

BILLINGSLEY, P., Probability and Measure, 3rd ed. New York: J. Wiley & Sons, 1979, 1995.

BIRKHOFF, G., AND S. MacLANE, A Sun;ey of Modern Algebra, rev. ed. New York: Macmillan, 1953. BIRKHOFF, G., AND S. MacLANE, A Sun;ey ofModern Algebra, 3rd ed. New York MacMillan, 1965. BRElMAN, L., Probability Reading, MA: Addison-Wesley, 1968. CHUNG, K. L., A Course in Probability Theory New York: Academic Press, 1974. DEMPSTER, A. P., Elements of Continuous Multivariate Analysis

Reading, MA:

Addison-Wesley,

1969.

DIEUDONNE, J., Foundation of Modern Analysis, v. 1 , Pure and Applied Math

.

Series, Volume 1 0

New York: Academic Press, 1960.

DUNFORD, N., AND J. T. SCHWARTZ, Linear Operators, Volume 1 , lnterscience New York: J. Wiley & Sons, 1964.

FELLER, W., An Introduction to Probability Theory and Its Applications, Vol. II, 2nd ed. New York:

J. Wiley & Sons,

1971.

GRIMMEIT, G. R., AND D. R. STIRSAKER, Probability and Random Processes Oxford: Claren don Press, 1992.

HALMOS, P. R., An Introduction to Hilben Space and the Theory of Spectral Multiplicity, 2nd

ed.

New York: Chelsea, 1951. HAMMERSLEY, J., ''An extension of the Slutsky-Fr6chet theorem," Acta Mathematica, 87, 243--247 ( 1 952). HoEFFDJNG, W., "Probability inequalities for sums of bounded random variables " J. Amer. Statist. ,

A"oc., 58, 13-30 ( 1 963).

LoEvE, M., Probability Theory, Vol. I, 4th ed. Berlin: Springer, 1977. RAo, C. 1973.

R., Linear Statistical Inference and Its Applications, 2nd ed.

New York:

J. Wiley & Sons,

ROC KAFELLAR, R. T., Convex Analysis Princeton, NJ: Princeton UniverSity Press, 1970.

ROYDEN, H. L., Real Analysis, 2nd ed. New York: MacMillan,

1968.

RUDIN, W., Functional Analysis, 2nd ed. New York: McGraw-Hill, 1991.

SKOROKHOD, A. V., "Limit theorems for stochastic processes, Th. Prob. Applic., "

1, 261-290 (1956).

I '

'

'

'

,I

'

' '

I I '

'

1

1

'

' '

' '

l

Appendix C TA B L ES

541

542

Tables

Appendix C

Pr(Z < z)

Table I The standard normal distribution z

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.0

0.5000

0.5040

0.5080

0.5120

0.5160

0.5199

0.5239

0.5279

0.53!9

0.5359

0. 1

0.5398

0.5438

0.5478

0.5517

0.5557

0.5596

0.5636

0.5675

0.5714

0.5753

0.2

0.5793

0.5832

0.5871

0.5910

0.5948

0.5987

0.6026

0.6064

0.6103

0.6141

0.3

0.6179

0.62 1 7

0.6255

0.6293

0.6331

0.6368

0.6406

0.6443

0.6480

0.6517

0.4

0.6554

0.6591

0.6628

0.6664

0.6700

0.6736

0.6772

0.6808

0.6844

0.6879

0.5

0.6915

0.6950

0.6985

0.7019

0.7054

0.7088

0.7123

0.7157

0.7190

0.7224

0.6

0.7257

0.7291

0.7324

0.7357

0.7389

0.7422

0.7454

0.7486

0.75 1 7

0.7549

0.7

0.7580

0.76 1 1

0.7642

0.7673

0.7704

0.7734

0.7764

0.7794

0. 7823

0.7852

0.8

0.7881

0.7910

0.7939

0. 7967

0.7995

0.8023

0.805 1

0.8078

0.8106

0.8133

0.9

LO

0.8159

0.8186

0.8212

0.8238

0.8264

0.8289

0.8315

0.8340

0.8365

0.8389

0.8413

0.8438

0.8461

0.8485

0.8508

0.8531

0.8554

0.8577

0.8599

0.8621

l.l

0.8643

0.8665

0.8686

0.8708

0.8729

0.8749

0.8770

0.8790

0.8810

0.8830

L2

0.8849

0.8869

0.8888

0.8907

0.892j

0.8944

0.8962

0.8980

0.8997

0.9015

!.3

0.9032

0.9049

0.9066

0.9082

0.9099

0.9 1 1 5

0.9 131

0.9147

0.9162

0.9177

1.4

0.9192

0.9207

0.9222

0.9236

0.9251

0.9265

0.9279

0.9292

0.9306

0.9319

1.5

0.9332

0.9345

0.9357

0.9370

0.9382

0.9394

0.9406

0.9418

0.9429

0.9441

1.6

0.9452

0.9463

0.9474

0.9484

0.9495

0.9505

0.9515

0.9525

0.9535

0.9545

1.7

0.9554

0.9564

0.9573

0.9582

0.9591

0.9599

0.9608

0.9616

0.9625

0.%33

1.8

0.9641

0.9649

0.9656

0.9664

0.9671

0.9678

0.9686

0.9693

0.9699

0.9706

1.9

0.97 !3

0.9719

0.9726

0.9732

0.9738

0.9744

0.9750

0.9756

0.9761

0.9767

2.0

0.9772

0.9778

0.9783

0.9788

0.9793

0.9798

0.9803

0.9808

0.9812

0.9817

2.1

0.9821

0.9826

0.9830

0.9834

0.9838

0.9842

0.9846

0.9850

0.9854

0.9857

2.2

0.986 1

0.9864

0.9868

0.9871

0.9875

0.9878

0.988 1

0.9884

0.9887

0.9890

2.3

0.9893

0.9896

0.9898

0.9901

0.9904

0.9906

0.9909

0.99 1 1

0.99l3

0.9916

2.4

0.9918

0.9920

0.9922

0.9925

0.9927

0.9929

0.9931

0.9932

0.9934

0.9936

2.5

0.9938

0.9940

0.994 1

0.9943

0.9945

0.9946

0.9948

0.9949

0.9951

0.9952

2.6

0.9953

0.9955

0.9956

0.9957

0.9959

0.9960

0.996 1

0.9962

0.9%3

0.9%4

2.7

0.9965

0.9966

0.9967

0.9968

0.9969

0.9970

0.9971

0.9972

0.9973

0.9974

2.8

0.9974

0.9975

0.9976

0.9977

0.9977

0.9978

0.9979

0.9979

0.9980

0.9981

2.9

0.998 1

0.9982

0.9982

0.9983

0.9984

0.9984

0.9985

0.9985

0.9986

0.9986

3.0

0.9987

0.9987

0.9987

0.9988

0.9988

0.9989

0.9989

0.9989

0.9990

0.9990

3.1

0.9990

0.9991

0.9991

0.�99 1

0.9992

0.9992

0.9992

0.9992

0.9993

0.9993

3.2

0.9993

0.9993

0.9994

0.9994

0.9994

0.9994

0.9994

0.9995

0.9995

0.9995

3.3

0.9995

0.9995

0.9995

0.9996

0.99%

0.9996

0.9996

0.9996

0.99%

0.9997

3.4

0.9997

0.9997

0.9997

0.9997

0.9997

0.9997

0.9997

0.9997

0.9997

0.9998

Table entry is probability at or below z.

.

'

'

i I '

1

'

' '

'

Appendix C

543

Tables

Pr(Z < z)

Table I' Auxilliary table of the standard normal distribution Pr(Z > z) z Pr(Z > z ) z

Pr(Z ::0: z) z

.50

.45

.40

.35

.30

.25

.20

.1 5

.10

0

. 1 26

.253

.385

.524

.674

.842

1 .036

1.282

.09

.08

.07

.06

.05

.04

.03

.025

1.341

1.405

1.476

1 .555

1 .645

1.751

1.881

1.960

.02

.01

.005

.001

.0005

.0001

.00005

.0000 1

2.054

2.326

2.576

3.090

3.291

3.719

3.891

4.265

Entries in the top row are areas to the right of values in the second row.

Area = .1

•• •

Area= .8

Area = .1

I

544

Tables

Appendix C

' I

•

Pr(T 2 t)

Table II t distribution critical values df

.25

.10

.05

.025

Right tail probability p .02

.0 I

.005

.0025

.001

.0005

l

3.078

6.314

1 2.71

15.89

3 1 .82

63.66

127.3

318.3

636.6

2

LOOO

0.8 1 6

!.886

2.920

4.303

4.849

6.965

9.925

14.09

22.33

3 1 .60

3

0.765

1 .638

2.353

3.182

3.482

4.541

5.841

7.453

10.21

12.92

1.533

2.!32

2.776

2.999

3.747

4.604

5.598

7.173

8.610

1.476

2.015

2.571

2.757

3.365

4.032

4.773

5.893

6.869

1.943

2.447

2.612

3.143

3.707

4.317

5.208

5.959

5

0.74 I

6

0.718

7

0.7 1 1

L440

1.415

1.895

2.365

2.517

2.998

3.499

4.029

4.785

5.408

8

0.706

!.397

1.860

2.306

2.449

2.896

3.355

3.833

4.501

5.041

9

0.703

1.383

1.833

2.262

2.398

2.821

3.250

3.690

4.297

4.781

10

0.700

1.372

1.812

2.228

2.359

2.764

3.!69

3.581

4.144

4.587

11

0.697

1.363

1 .796

2.201

2.328

2.718

3. !06

3.497

4.025

4.437

12

0.695

1.356

1.782

2.179

2.303

2.681

3.055

3.428

3.930

4.318

13

0.694

1.350

1.771

2.160

2.282

2.650

3.012

3.372

3.852

4.221

14

0.692

1.345

1.761

2.145

2.264

2.624

2.977

3.326

3.787

4.140

15

0.691

1.341

1.753

2.131

2.249

2.602

2.947

3.286

3.733

4.073

16

0.690

1 .337

1 . 746

2.120

2.235

2.583

2.921

3.252

3.686

4.015

17

0.689

1.333

1.740

2.110

2.224

2.567

2.898

3.222

3.646

3.965

18

0.688

1 .330

1.734

2.101

2.214

2.552

2.878

3.197

3.610

3.922

19

0.688

1 .328

1.729

2.093

2.205

2.539

2.861

3.174

3.579

3.883

20

0.687

1.325

1.725

2.086

2.197

2.528

2.845

3.153

3.552

3.850

21

0.686

1.323

1.721

2.080

2.189

2.518

2.831

3.135

3.527

3.819

22

0.686

1.321

1.717

2.074

2.183

2.508

2.819

3.1 19

3.505

3.792

4

0.727

23

0.685

1.319

1.714

2.069

2.177

2.500

2.807

3.104

3.485

3.768

24

0.685

1 .3 1 8

1 .7 1 1

2.064

2.172

2.492

2.797

3.091

3.467

3.745

25

0.684

1.316

1.708

2.060

2.167

2.485

2.787

3.078

3.450

3.725

30

0.683

1.310

1.697

2.042

2.147

2.457

2.750

3.030

3.385

3.646

40

0.681

1.303

1.684

2.02 1

2.123

2.423

2.704

2.971

3.307

3.551

50

0.679

1.299

1 .676

2.009

2.109

2.403

2.678

2.937

3.261

3.496

60

0.679

1 296

1.671

2.000

2.099

2.390

2.660

2.915

3.232

3.460

100

0.677

1 .290

1.660

1.984

2.08 1

2.364

2.626

2.871

3.174

3390

1 000

0.675

1.282

1.646

1.%2

2.056

2.330

2.581

2.813

3.098

3.300

0.674

1.282

1 .645

1 .960

2.054

2326

2.576

2.807

3.090

3.291

50%

80%

90%

95%

96%

98%

99%

99.5%

99.8%

99.9%

00

Confidence level C

The entries in the top row are the probabilities of exceeding the tabled values. The left column gives the degrees of freedom.

•

�-"

•

545

Tables

Appendix C

(x' > x)

Table III dJ

X

x2 distribution critical values

.25

.10

.05

.025

Right tail probability p .02

.0 I

.005

.0025

.001

.0005

I

1.32

2.71

3.84

5.02

5.41

6.63

7.88

9.14

!0.83

12.12

2

2. 77

4.61

5.99

7.38

7.82

9.21

10.60

1 1 .98

13.82

15.20

3

4. 1 1

6.25

7.81

9.35

9.84

1 1 .34

12.84

14.32

16.27

17.73

4

5.39

7.78

9.49

1 1 . 14

1 1 .67

13.28

14.86

16.42

18.47

20.00

5

6.63

9.24

1 1 .07

12.83

1 3.39

15.09

16.75

18.39

20.52

22. 1 1

6

7.84

10.64

12.59

14.45

15.03

16.81

18.55

20.25

22.46

24.!0

7

9.04

12.02

14.07

1 6.01

16.62

18.48

20.28

22.04

24.32

26.02

8

10.22

13.36

15.51

17.53

18.17

20.09

21 .95

23.77

26.12

27.87

9

1 1 .39

14.68

16.92

19.02

19.68

2 1 .67

23.59

25.46

27.88

29.67

10

12.55

15.99

18.31

20.48

21.16

II

23.21

25.19

27.1 1

29.59

31 .42

13.70

17.28

19.68

21.92

22.62

24.72

26.76

28.73

31 .26

33.14

12

14.85

18.55

21 .03

23.34

24.05

26.22

28.30

30.32

32.91

34.82

13

15.98

19.81

22.36

24.74

25.47

27.69

29.82

31 .88

34.53

36.48

14

17.12

21.06

23.68

26.12

26.87

29.14

31.32

33.43

36.12

38. 1 1

15

18.25

22.31

25.00

27.49

28.26

30.58

32.80

34.95

37.70

39.72

16

19.37

23.54

26.30

28.85

29.63

32.00

34.27

36.46

39.25

41.31

17

20.49

24.77

27.59

30.19

31.00

33.41

35.72

37.95

40.79

42.88

18

21.60

25.99

28.87

3 1 .53

32.35

34.8 1

37.16

39.42

42.31

44.43

19

22.72

27.20

30.14

32.85

33.69

36.19

38.58

40.88

43.82

45.97

20

23.83

28.4!

31.41

34.17

35.02

37.57

40.00

42.34

45.3 1

47.50

21

24.93

29.62

32.67

35.48

36.34

38.93

41.40

43.78

46.80

49.0 1

22

26.04

30.81

33.92

36.78

37.66

40.29

42.80

45.20

48.27

50.51

23

27.14

32.01

35.17

38.08

38.97

41 .64

44.18

46.62

49.73

52.00

24

28.24

33.20

36.42

39.36

40.27

42.98

45.56

48.03

51.18

53.48

25

29.34

34.38

37.65

40.65

41.57

44.31

46.93

49.44

52.62

54.95

26

30.43

35.56

38.89

41 .92

42.86

45.64

48.29

50.83

56.41

27

3 1 .53

36.74

40. I I

54.05

43.19

44.14

46.96

49.64

52.22

55.48

57.86

28

32.62

37.92

41.34

44.46

45.42

48.28

50.99

53.59

56.89

59.30

29

33.71

39.09

42.56

45.72

46.69

49.59

52.34

54.97

58.30

60.73

30

34.80

40.26

43.77

46.98

47.96

50.89

53.67

56.33

59.70

62.16

40

45.62

51.81

55.76

59.34

60.44

63.69

66.77

69.70

73.40

76.09

50

56.33

63.17

67.50

7 1 .42

72.61

76.15

79.49

82.66

86.66

89.56

60

66.98

74.40

79.08

83.30

84.58

88.38

91 .95

95.34

99.61

102.69

80

88.13

96.58

101.88

106.63

108.07

1 1 2.33

1 1 6.32

120.10

1 24.84

128.26

100

109.14

1 1 8.50

124.34

129.56

l3l.l4

135.81

140.17

144.29

149.45

153.17

The entries in the top row are the probabilities of exceeding the tabled values. p Pr(� > x ) where x is in the body of the table and p is in the top row (margin). dj denotes degrees of freedom and is given in the left column (margin). =

546

Tables

Appendix C

Pr(F > f) f Table IV F distribution critical values

Pr(F > f)

0.05

,,

I

I

2

3

4

5

161

199

216

225

648

799

864

4052

4999

18.51

r,

6

7

8

10

15

230

234

237

239

242

246

922

937

948

957

969

985

5403

900

5625

5764

5859

5928

5981

6056

6157

19.00

19.16

19.25

19.30

19.33

19.35

19.37

19.40

19.43

38.51

39.00

39.17

39.25

39.30

39.33

39.36

39.37

39.40

39.43

98.50

99.00

99. 1 7

99.25

99.30

99.33

99.36

99.37

99.40

99.43

10.13

9.55

9.28

9.12

9.01

8.94

8.89

8.85

8.79

8.70

17.44

16.04

15.44

34. 12

30.82

7.7 1

0.025 0.01

0.025

O.o!

0.05

2

0.025

O.oJ

0.05

3

14.88

14.73

14.62

14.54

14.42

14.25

29.46

ISJO

28.71

28.24

27.91

27.67

27.49

27.23

26.87

6.94

6.59

6.39

6.26

6.16

6.09

6.04

5.96

5.86

12.22

10.65

9.98

9.60

9.36

9.20

9.07

8.98

8.84

8.66

21 .20

18.00

16.69

15.98

15.52

15.21

14.98

14.80

14.55

14.20

6.61

5.79

5.41

5.1 9

5.05

4.95

4.88

4.82

4.74

4.62

0.025

10.01

8.43

7.76

7.39

7.15

6.98

6.85

6.76

6.62

6.43

0.01

16.26

13.27

12.06

1 1 .39

10.97

10.67

10.46

10.29

10.05

9.72

5.99

5.14

4.76

4.53

4.39

4.28

4.21

4.15

4.06

3.94

8.81

7.26

6.60

6.23

5.99

5.82

5.70

5.60

5.46

527

13.75

10.92

9.78

9. 15

8.75

8.47

8.26

8.10

7.87

7.56

5.59

4.74

4.35

4.12

3.97

3.87

3.79

3.7 3

3.64

3.51

8,07

6.54

5.89

5.52

5.29

5.12

4.99

4.90

4.76

4.57

12.25

9.55

8.45

7.85

7.46

7.19

6.99

6.84

6.62

6.31

5.32

4.46

4.07

3$4

3.69

358

3.50

3.44

3.35

3.22

7.57

6.06

5.42

5.05

4.82

4.65

4.53

4.43

4.30

4.10

1 1 .26

8.65

7.59

7.01

6.63

6.37

6.18

6.03

5.81

5.52

5.12

4.26

3.86

3.63

3.48

3.37

3.29

3.23

3.14

3.01

7.21

5.71

5.08

4.72

4.48

4.32

4.20

4.10

3.96

3.77

10.56

8.02

6.99

6.42

6.06

5.80

5.61

5.47

5.26

4.96

4.96

4.10

3.7 1

3.48

3.33

3.22

3.14

3.07

2.98

2.85

6.94

5.46

4.83

4.47

4.24

4fl7

3.95

3.85

3.72

3.52

10.04

7.56

6.55

5.99

5.64

5.39

5.20

5.06

4.85

4.56

4.75

3.89

3.49

3.26

3. l l

3.00

2.91

2.85

2.75

2.62

0.025

6.55

5.10

4.47

4.12

3.89

3.73

3.61

3.51

3.37

3.18

0.01

9.33

6.93

5.95

5.41

5.06

4.82

4.64

4.50

4.30

4.01

4.54

3.68

3.29

3.06

2.90

2.79

2.71

2.64

2.54

2.40

0.025

6.20

4.77

4.15

3.80

3.58

3.41

3.29

3.20

3.06

2.86

0.01

8.68

6.36

5.42

4.89

4.56

432

4.14

4.00

3.80

3.52

0.025

O.oJ

0.05

Ofl5

0.05

4

5

6

0.025

O.Q I

0.05

7

0.025 0.01 0.05

8

0.025 0.01 0.05

9

0.025 0.01 0.05

10

0.025

O.Di

0.05

0.05

12

15

r1 = numerator degrees of freedom. r2

=

denominator degrees of freedom.

i

I\

I NDEX

X

'"'-J

F, X is distributed according to F,

table, 379

463

B(n, 8), binomial distribution with param eters n and 8, 461 £(,\), exponeiltial distribution with pa rameter A, 464 1{ (D, N, n), hypergeometric distribution with parameters D, N, n, 461 M(n, 81, . . . , Bq ) , multinomial distribu tion with parameters n, fh, .

e,, 462

N(p, E),

analysis of variance (ANOVA), 367

..,

multivariate normal distribu tion, 507

antisymmetric, 207, 209 asymptotic distribution

I

of quadratic forms, 5 0 asymptotic efficiency, 331 of Bayes estimate, 342 ofMLE, 3 3 1 , 386 asymptotic equivalence of MLE and Bayes estimate, 342 asymptotic normality, 3 1 1 of

M-estimate, estimating equation estimate, 330

N(JJ, u2 ), normal distribution With mean JJ and variance u2 , 464 N(JJt, j.t2, uf, u� , p), bivariate normal dis

of MLE. 33 I , 386

P(.\), Poisson distribution with parame

of sample correlation, 3 1 9

tribution, 492 ter

,\, 462

U (a, b), uniform distribution on the inter val (a, b) , 465

of estimate, 300 of minimum contrast estimate, 327

of posterior, 339, 391 asymptotic order in probability notation, 516 asymptotic relative efficiency, 357 autoregressive model, l l , 292

acceptance, 215 action space, 1 7

Bayes credible bound, 25 I

adaptation, 388

Bayes credible interval, 252

algorithm, l 02, I 27

Bayes credible region, 251

bisection, 127, 2 1 0 coordinate ascent, 129

asympbtic, 344 Bayes estirn•te, 162

EM, l33

Bernoulli trials, 166

Newton-Raphson, 102, 132, 189,210

equivan.ance, 168

for GLM, 4 1 3 proportional fitting, 157 alternative, 215, 2 1 7

Gaussim model, 163

linear, I
Bayes risk, I (i2 547

548

Index

Bayes rule, 27, 162

Bayes rule, 165

Bayes' rule, 445, 479, 482

coefficient of determination, 37

Bayes' theorem, 14

coefficient of skewness, 457

Bayesian models, 12

collinearity, 69, 90

Bayesian prediction interval, 254

comparison, 247

Behrens-Fisher problem, 264

complete families of tests, 232

Bernoulli trials, 447

compound experiment, 446

Bernstein's inequality, 469

concave function, 5 1 8

binomial case, 519 Bernstein-von Mises theorem, 339

Berry-Esseen bound, 299

conditional distribution, 478 for bivariate nonnal case, 501 for multivariate normal case, 509

Berry-Esseen theorem, 471

conditional expectation, 483

beta distribution, 488

confidence band

as prior for Bernoulli trials, 15 moments, 526 beta function, 488 bias, 20, 176 sample variance, 78 binomial distribution, 447, 461 bioequivalence trials, 198 bivariate log normal distribution, 535 bivariate normal distribution, 497 cumulants, 506 geometry, 532 nondegenerate , 499 bivariate normal model, 266

quantiles simultaneous, 284 confidence bound, 23, 234, 235 mean nonparametric, 241

unifonnly most accurate, 248

confidence interval, 24, 234, 235 Bernoulli trials approximate, 23 7 exact,

244

location parameter

nonparametric, 286

median nonpararnebic, 282

Cauchy distribution, 526

one-sample Student t, 235

Cauchy-schwartz inequality, 458

quantile

Cauchy-S chwarz inequality, 39 generalized, 521 center of a population distribution, 7 1 central limit theorem, 470 multivariate, 5 1 0

nonpanuneUic, 284

shift parameter

nonpanuneUic, 287

two-sample Student t, 263 unbiased, 283

chain rule, 517

confidence level, 235

change of variable formula, 452

confidence rectangle

characteristic function, 505 Chebychev bound, 346 Chebychev's inequality, 299, 469 chi-square disUibution, 491 noncentral, 530 chi-square test, 402 chi-squared disUibution, 488 classification

Gaussian model, 240 confidenoe region, 233, 239 distribution function, 240 confidence regions Gaussian linear model, 383 conjugate nonnal mixture distributions, 92

consistency, 30 l

Index

of estimate, 300, 301 of minimum contrast estimates, 304 ofMLE, 305, 347 of posterior, 338 oftest, 333 uniform, 30 I contingency tables, 403 contrast function, 99 control observation, 4 convergence in LP norm, 536 in law, distribution, 466 in law, in distribution for vectors, 5 1 1 in probability, 466 for vectors, 5 1 1 of random variables, 466 convergence of sample quantile, 536 convex function, 518 convex support, 122 convexity, 518 correlation, 267, 458 inequality, 458 multiple, 40 ratio, 82 covariance, 458 of random vectors, 504 covariate, 10 stochastic, 387, 419 Cramer-Rao lower bound, 1 8 1 Cramer-von Mises statistic, 271 critical region, 23, 215 critical value, 216, 217 cumulant, 460 generating function, 460 in normal distribution, 460 cumulant generating function for random vector, 505 curved exponential family, 125 existence of MLE, 125

De Moivre-Laplace theorem, 470 decision rule, 19 admissible, 3 1

549 Bayes, 27, 161, 162 inadmissible, 31 minimax, 28, !70, 171 randomized, 28 unbiased, 78 decision theory, 16 delta method, 306 for distributions, 3 1 1 for moments, 306 density, 456 conditional, 482 density function, 449 design, 366 matrix, 366 random, 387 values, 366 deviance, 414 decomposition, 4 1 4 Dirichlet distribution, 74, !98, 202 distribution function (d.f.), 450 distribution of quadratic form, 533 dominated convergence theorem, 514 double exponential distribution, 526 duality between confidence regions and tests, 241 duality theorem, 243 Dynkin, Lehmann, Scheffe's theorem, 86 Edgeworth approximations, 317 eigenvalues, 520 empirical distribution, I04 empirical distribution function, 8, 139 bivariate, !39 entropy maximum, 91 error, 3 autoregressive, 1 1 estimate, 99 consistent, 301 empirical substitution, 139 estimating equation, 100 frequency plug-in, 103 Hodges-Lehmann, 149 least squares, I 00

550

: '

Index maximum likelihood, 1 1 4

fitted value, 372

method of moments, 10 I

fixed design, 387

minimum contrast, 99

Fr6chet differentiable, 516

plug-in, 104

frequency function, 449

squared error Bayes, 162

conditional, 477 frequency plug-in principle, 103

unbiased, 176 estimating equation estimate asymptotic nonnality, 384

:

'[I '

moments, 526

estimation, 16

gamma function, 488

events, 442

gamma model

independent, 445 '

gamma distribution, 488

expectation, 454, 455 conditional, 479

MLE, 124, 129, 130 Gauss-Marakov linear model, 418 Gauss-Markov assumptions, 108

exponential distribution, 464

Gauss-Markov theorem, 418

exponential family, 49

Gaussian linear model, 366

conjugate prior, 62 convexity, 61

canonical fonn, 368

confidence intervals, 381

curved, 57

confidence regions, 383

identifiability, 60

estimation in, 369

log concavity, 61

identifiability, 3 7 1

MLE, 121

moment generating function, 59 multiparameter, 53 canonical, 54

one-parameter, 49 canonical, 52

likelihood ratio statistic, 374 MLE, 371 testing, 378

UMVU estimate, 3 7 1 Gaussian model Bayes estimate, 163

rank of, 60

existence of MLE, 123

submodel, 56

mixture, 134

supermodel, 58

Gaussian two-sample model, 261

UMVU estimate, 186

generali�ed linear models (GLM), 4 1 1

extension principle, 102, l 04 F distribution, 491 moments, 530 noncentral, 531

F statistic, 376 factorization theorem, 43

geometric �stribution, 72, 87 GLM, 412 . estimate •

asymptotic distributions, 415 Gaussi1111, 435 likelihood ratio asymptotic distribution, 415

Fisher consistent, 158

likelihood ratio test, 414

Fisher information, 1 80

Poisson, 435

matrix, 185 Fisher's discriminant function, 226

• •

I

goodness-of-fit test, 220, 223 gross error models, 190

Fisher's genetic linkage model, 405 Fisher's method of scoring, 434

Holder's inequality, 5 1 8

I I

'

'

_I

Index

551

Hammersley's theorem,

513

Hardy-Weinberg proportions,

103, 403

chi-square test, 405 MLE, 1 1 8, 124 UMVU estimate, 183 hat matrix, 372 hazard rates, 69, 70 heavy tails, 208 Hessian, 386

Kolmogorov statistic, 220 Kolmogorov's theorem, 86 Kullback-Leibler divergence, 1 16 and MLE, 1 1 6 Kullback-Leibler loss function, 169 kurtosis, 279,457

L, norm, 536

Laplace

distribution, 526

hierarchical Bayesian normal model, 92 hierarchical binomial-beta model, 93 Hodges 's example, 332

Laplace distribution, 374 law of large numbers weak

Hod.ges-Lehmann estimate, 207 Hoeffding bound, 299, 346

Bernoulli's, 468

Hoeffding's inequality, 519

Khintchin's, 469

Horvitz-Thompson estimate, 178

least absolute deviation estimates, 149, 374

Huber estimate, 207, 390 hypergeometric distribution, 3, 461

least favorable prior distribution, 170

hypergeometric probability, 448

least squares, 107, 120

hypothesis, 215 composite, 215 null, 215 simple, 215

weighted, 107, 1 12 Lehmann alternative, 275 level (of significance). 217 life testing, 8 9 likelihood equations, 1 1 7

identifiable, 6

likelihood function, 47

independence, 445 , 453

likelihood ratio, 48, 256

independent experiments, 446

asymptotic chi-square distribution,

indifference region, 230

394, 395

influence function, 196

confidence region, 257 asymptotic, 395

information bound, 181

logistic regression, 410

asymptotic variance, 327

test, 256

information inequality, 179, 1 8 1 , 186, 188, 206

bivariate normal, 266

integrable, 455

Gaussian one-sample model, 257

interquartile range (IQR), 196

Gaussian two-sample model, 261

invariance

one-sample scale, 291 two-sample scale, 293

shift, 77 inverse Gaussian distribution, 94

likelihood ratio statistic

IQR, 196

in Gaussian linear model, 376

iterated expectation theorem, 481

simple, 223

Jacobian, 485 theorem, 486

likelihood ratio test, 335 linear model Gaussian, 366

Jensen's inequality, 5 1 8

non-Gaussian, 389

-------

552 I

I

'

'

i

stochastic covariates, 419 Gaussian, 387 heteroscedastic, 421 linear regression model, 1 09 link function, 412 canonical, 4 1 2 location parameter, 209, 463 location-scale parameter family, 463 location-scale regression existence of MLE, 1 27 log linear model, 412 logistic distribution, 57, 132 likelihood equations, 154 neural nets, 1 5 1 Newton-Raphson, 132 logistic linear regression model, 408 logistic regression, 56, 90, 408 logistic transform, 408 empirical, 409 logit, 90, 408 loss function, 18 0 - 1, 19 absolute, 1 8 Euclidean, 1 8 Kullback-Leibler, 169, 202 quadratic, 1 8 M-estimate, 330 asymptotic normality, 384 marginal density, 452 Markov's inequality, 469 matched pair experiment, 257 maximum likelihood, 1 14 maximum likelihood estimate, 1 14. see MLE maximum likelihood estimate (MLE), 1 14 mean, 7 1 , 454, 455 sensitivity curve, 192 mean absolute prediction error, 80, 83 mean squared error (MSE), 20 mean squared prediction error (MSPE), 32 median, 7 1 MSE, 297

Index

population, 77, 80, 105 sample, 192 sensitivity curve, 193 Mendel's genetic model, 2 1 4 chi-square test, 403 meta-analysis, 222 rnethod ofmoments, 101 minimax estimate Bernoulli trials, 173 distribution function, 202 minimax rule, 28, 170 minimax test, 173 MLE, 1 14, see maximum likelihood esti mate as projection, 371 asymptotic normality exponential family, 322 Cauchy model, 149 equivariance, 1 1 4, 144 existence, I 21 uniqueness, 121 MLR, 228, see monotone likelihood ratio model, I, 5 AR(l), I I Cox, 70 Gaussian linear regression. 366 gross error, 190, 210 Lehmann, 69 linear, 1 0 Gaussian, 10 location symmetric, 191 logistic linear, 408 nonparametric, 6 one-sample, 3, 366 parametric, 6 proportional hazard, 70 regression, 9 regular, 9 scale, 69 semiparametric, 6 shift, 4 symmetric, 68 two-sample, 4

' I '

Index

moment, 456 central, 457 of random matrix, 502 moment generating function, 459 for random vector, 504 monotone likelihood ratio (MLR), 228 Monte Carlo method, 219, 221, 298, 3 14 MSE, 20, see mean squared error sample mean, 21 sample variance, 78 MSPE, 32, see mean squared prediction error bivariate normal, 36 multivariate normal, 37 MSPE predictor, 83, 372 multinomial distribution, 462 multinomial trials, 55, 447, 462 consistent estimates, 302 Dirichlet prior, 198 estimation asymptotic normality, 324 in contingency tables, 403 Kullback-Leibler loss Bayes estimate, 202 minimax estimate, 201 MLE, 1 19, 124 Pearson's chi-square test, 401 UMVU estimate, 187 multiple correlation coefficient, 37 multivariate normal distribution, 506 natural parameter space, 52, 54 natural sufficient statistic, 54 negative binomial distribution, 87 neural net model, 1 5 1 Neyman allocation, 76 Neyman-Pearson framework, 23 Neyman-Pearson kmma, 224 Neyman-Pearson test, 165 noncentral t distribution, 260 noncentral :F distribution, 376 noncentral chi-square distribution, 375 nonnegative definite, 519 normal distribution, 464, see Gaussian

553

central moments. 529 normal equations, lO I weighted least squares, 1 1 3 normalizing transformation, zero skew ness, 351 observations, 5 one-way layout, 367 binomial testing, 410 confidence intervals, 382 testing, 378 order statistics, 527 orthogonal, 41 orthogonal matrix, 494 orthogonality, 522 p-sample problem, 367 p-value, 221 parameter, 6 nuisance, 7 parametrization, 6 Pareto density, 85 Pearson's chi-square, 402 placebo, 4 plug-in principle, 102 Poisson distribution, 462 Poisson process, 472 Poisson's theorem, 472 Polya's theorem, 515 pnpulation, 448 pnpulation R-squared, 37 pnpulation quantile, I 05 pnsitive definite, 519 pnsterior, 13 power function, 78, 217 asymptotic, 334 sample size, 230 prediction, 16, 32 training set, 19 prediction error, 32 absolute, 33 squared, 32 weighted, 84

I

'

554

Index

prediction interval, 252

random variable, 451

Bayesian, 254

random vector, 451

distribution-free, 254

randomization, 5

Student t, 253

randomized test, 79, 224

predictor, 32 linear, 38, 39 principal axis theorem, 519 prior, 13 conjugate, 15 binomial, 15 exponential family, 62 general model, 73 multinomial, 74

rank, 48 ranlcing, 16 Rao score, 399 confidence region, 400 statistic, 399 asymptotic chi-square distribution, 400 test, 399

multinontial goodness-of-fit, 402

normal case, 63

Rao test, 335, 336

Poisson, 73

Rayleigh distribution, 53

improper, 163

regression, 9, 366

Jeffrey's, 203

confidence intervals, 381

least favorable, 170

confidence regions, 383

probability

heteroscedastic, 58, 153

conditional, 444

homoscedastic, 58

continuous, 449

Laplace model, 149

discrete, 443, 449

linear, 109

distribution, 442

location-scale, 57

subjective, 442

logistic, 56

probability distribution, 451

Poisson, 204

probit model, 416

polynomial, 146

product moment, 457

testing, 378

central, 457 projection, 41, 371

weighted least squares, 147 weighted, linear, 1 1 2

projection matrix, 372

regression line. 502

projections on linear spaces, 522

regression toward the mean, 36

Pythagorean identity, 4 1 , 377

rejection, 215

in Hilbert space, 522

relative frequency, 441 residual, 48,

quality control, 229 quantile population, I 04 sensitivity curve, 195

I l l, 372

sum of squares, 379 response, 10, 366 risk function, 20

maximum, 28 testing, 22

risk set, 29 random,44l random design, 387 random effects model, 167 random experiments, 441

convexity, 79 robustness, 190, 418 of level t-statistics, 314

555

Index asymptotic, 419 of tests , 419

stochastic ordering, 67, 209 stratified sampling, 76, 205 substitution theorem for conditional ex

saddle point, 199 sample, 3 correlation, 140, 267 covariance, 140

pectations, 481 superefficiency, 332 survey sampling, 177 model based approach, 350

cumulant, 139

survival functions, 70

mean, 8, 45

symmetric distribution, 68

median, 105, 149, 192

symmetric matrices, 5 1 9

quantile, 105

symmetric variable, 68

,

random, 3 regression line, 1 1 1 variance, 8, 45

t distribution, 491 moments, 530

sample of size n, 448

Taylor expansion, 517

sample space, 5, 442

test function, 23

sampling inspection, 3

test size, 217

scale, 457

test statistic, 216

scale parameter, 463

testing, 16, 213

Schefte's theorem, 468 multivariate case, 5 14 score test, 335, 336, 399

Bayes, 165, 225 testing independence in contingency ta bles, 405

selectiQ.g at random, 444

total differential, 517

selection, 75, 247

transfonnation

sensitivity curve, 192

k linear, 5 16

Shannon's lemma, 1 16

affine, 487

shift and scale equivariant, 209

linear, 487, 5 1 6

shift equivariant, 206, 208

orthogonal, 494

signal to noise fixed, 126 Slutsky's theorem, 467 multivariate, 5 1 2 spectral theorem, 519 square root matrix, 507 standard deviation, 457 standard error, 3 8 1 standard normal distribution, 464 statistic, 8

trimmed mean, 194, 206 sensitivity curve, 194 type I error, 23, 216

type n error, 23, 216 UMP, 226, 227, erful

see uniformly most pow

UMVU, 177, see uniformly minimum variance unbiased

uncorrelated, 459

ancillary, 48

unidentifiable, 6

equivalent, 43

uniform distribution, 465

sufficient, 42 Bayes, 46 minimal, 46 natural, 52

discrete

MLE, 1 1 5 uniformly minimum variance unbiased

(UMVU), 177

' '

I I

'

'

556

uniformly most powerful asymptotically, 334 uniformly most powerful (UMP), 226, 227 variance, 457 of random matrix, 503 sensitivity curve, 195 variance stabilizing transformation, 3 16, 317 for binomial, 352 for correlation coefficient, 320, 350 for Poisson, 3 1 7 in GLM, 4 16 variance-covariance matrix, 498

Index

von Neumann's theorem, 1 7 1 Wald confidence regions, 399 Wald statistic, 398 asymptotic chi-square distribution, 398 Wald test, 335, 399 multinomial goodness-of-fit, 401 weak law of large numbers for vectors, 5 1 1 Weibull density, 84 weighted least squares, 1 1 3, 147 Wilks's theorem, 393-395, 397 Z-score, 457

Mathematical Statistics: Basic Ideas and Selected Topics, Vol I (2nd Edition)

Read more

Basic Algebra I, 2nd Edition

Read more

Modern Mathematical Statistics with Applications, 2nd Edition

Read more

Modern Mathematical Statistics with Applications, 2nd Edition

Read more

Selected Works, Vol. I

Read more

Selected works. - Probability theory and mathematical statistics

Read more

Probability and Statistics (2nd Edition)

Read more

Probability and Statistics (2nd Edition)

Read more

Probability and Statistics (2nd Edition)

Read more

Probability and Statistics (2nd Edition)

Read more

Probability and Statistics (2nd Edition)

Read more

Business Statistics, 2nd Edition

Read more

Geometry I. Basic ideas and concepts

Read more

Basic Radiology, 2nd Edition

Read more

Mathematical Excursions, 2nd Edition

Read more

Mathematical Excursions, 2nd Edition

Read more

Mathematical Excursions, 2nd Edition

Read more

Community, 2nd edition (Key Ideas)

Read more

Community: 2nd edition (Key Ideas)

Read more

Quantum Mechanics: Selected Topics (Selected Topics Series)

Read more

Topics in Algebra 2nd Edition

Read more

Nanomedicine, Vol. I and II: Basic Capabilities

Read more

Applied Basic Mathematics, 2nd Edition

Read more

Applied Basic Mathematics (2nd Edition)

Read more

Basic Real Analysis, 2nd edition

Read more

Differential and Integral Calculus, Vol. I, 2nd Edition

Read more

Probability Theory and Mathematical Statistics, Third Edition

Read more

Braids and Coverings: Selected Topics (London Mathematical Society Student Texts)

Read more

Mathematical Statistics

Read more

Mathematical statistics

Read more

Recommend Documents

Mathematical Statistics: Basic Ideas and Selected Topics, Vol I (2nd Edition)

PDF compression, OCR, web optimization using a watermarked evaluation copy of CVISION PDFCompressor PDF compression, O...

Basic Algebra I, 2nd Edition

...

Modern Mathematical Statistics with Applications, 2nd Edition

Springer Texts in Statistics Series Editors: G. Casella S. Fienberg I. Olkin For further volumes: http://www.springer....

Modern Mathematical Statistics with Applications, 2nd Edition

Springer Texts in Statistics Series Editors: G. Casella S. Fienberg I. Olkin For further volumes: http://www.springer....

Selected Works, Vol. I

...

Selected works. - Probability theory and mathematical statistics

Probability and Statistics (2nd Edition)

Probabi I ity and Statistics Second Edition Morris H. DeGroot Carnegie-Mellon University • .. � ADDISON·WESLEY P...

Probability and Statistics (2nd Edition)

Probability and Statistics (2nd Edition)

Probability and Statistics (2nd Edition)

Probabi I ity and Statistics Second Edition Morris H. DeGroot Carnegie-Mellon University • .. � ADDISON·WESLEY ...